MPA-ID 2019 Math Camp: R Session Slides

# MPA-ID 2019 Math Camp: R Session Slides
## Shiro Kuriwaki

---

# Table of Contents

* [Day 1: tidyverse](#day1)
 * [dplyr](#day01-dplyr)
 * [ggplot](#day01-ggplot)
* [Day 2: Beyond the tidyverse (base-R)](#day2)
 * [functions](#day02-functions)
 * [base-R syntax](#day02-base-R)
 * [function syntax](#day02-formula)
* [Day 3: Visualization](#day3)
 * [principles](#day03-principles)
 * [technicalities](#day04-technicalities)
* [Day 4: Workflow](#day4)
 * [workflow](#day04-workflow)
 * [downloading R Desktop](#day04-download-rstudio)
* [Day 5: Setup and RStudio Desktop](#day5)
 * [desktop](#day05-desktop)
 * [asking for help](#day05-getting-help)

]

# About these slides

These slides were prepared for the MPA-ID cohort's 2019 Math Camp. See it in conjunction with the <a href="https://www.shirokuriwaki.com/programming/2019%20Math%20Camp%20R%20Syllabus.pdf">syllabus</a> for more resources.

Slides are originally compiled in HTML, so for best results see the <a href="https://www.shirokuriwaki.com/programming/2019-math-camp-R_all-days.html">web version</a> on a browser. The PDF version is a print-out of the version, so may crops out code windows that could be scrolled across horizontally in the web version.

Contact Shiro Kuriwaki (kuriwaki@g.harvard.edu) for more information.

]

---

# 5-day Schedule

---

# 5-day Schedule

1. Mastering R basics in tidyverse 
2. Mastering R basics beyond tidyverse
3. Perfecting Graphs
4. Exporting and Presenting
5. Presentations, preparing for classes

.small[\* See <a href="https://www.shirokuriwaki.com/programming/2019%20Math%20Camp%20R%20Syllabus.pdf">syllabus</a> for details]
]

## Logistics

1. Login to <https://rstudio.cloud> and go to the Math Camp class space
2. "Copy" the Day 1 project `02_Day-01`, which includes
 - datasets
 - exercise PDF
 - relevant packages
3. During slides, try the code yourself + consult the cheatsheet: <http://bit.ly/HKS-R>
4. Work in groups for exercises (in your own R scripts)
5. Interrupt with questions and give me feedback on your index cards.

]

---
layout: true
<div class="my-footer">Day 1</div>

---
class: middle
name: day1

# Day 1: Basics of tidyverse

## dplyr verbs
## ggplot aesthetics
## functions

\* References for today's material: <a href = "https://r4ds.had.co.nz">R for Data Science</a> chs. <a href = "https://r4ds.had.co.nz/data-visualisation.html">3</a>, <a href ="https://r4ds.had.co.nz/workflow-basics.html">4</a>, and <a href ="https://r4ds.had.co.nz/transform.html">5</a>; as well as the <a href = "https://rstudio.cloud/learn/primers">Primers</a> and <a href = "https://style.tidyverse.org">style guide</a> from the summer.
]

---

# What is tidyverse?

* A popular suite of R packages
* Has its own syntax, designed to be more intuitive

For someone who does not know ANY programming, which of these two is more intuitive?

## (a) tidyverse syntax

```r
weo_dataset %>% 
  group_by(continent) %>% 
	summarize(gdp = median(rgdp2017, na.rm = TRUE))
```

]

## (b) base-R syntax

```r
aggregate(weo_dataset[, "rgdp2017"], 
          list(continent = weo_dataset[, "continent"]), 
          median, na.rm = TRUE)
```

]

.small[.right[Source: Adapted from Roger Peng, <a href= "https://simplystatistics.org/2018/07/12/use-r-keynote-2018/">"Teaching R to New Users - From tapply to the Tidyverse"</a>]]

---

# If learning R is like "learning a language", R is a very _fast-evolving_ language

---
class: left

# tidyverse simplifies data analysis by focusing on "tidy" rectangular data.

```r
weo <- read_excel("data/input/WEO-2018.xlsx")
```

.small[
<table>
 <thead>
 <tr>
 <th style="text-align:left;"> country </th>
 <th style="text-align:left;"> continent </th>
 <th style="text-align:right;"> pop1992 </th>
 <th style="text-align:right;"> rgdp1992 </th>
 <th style="text-align:right;"> pop1994 </th>
 <th style="text-align:right;"> pop1995 </th>
 <th style="text-align:right;"> pop1996 </th>
 <th style="text-align:right;"> pop1997 </th>
 <th style="text-align:right;"> pop1998 </th>
 <th style="text-align:right;"> pop1999 </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> Afghanistan </td>
 <td style="text-align:left;"> Asia </td>
 <td style="text-align:right;"> NA </td>
 <td style="text-align:right;"> NA </td>
 <td style="text-align:right;"> NA </td>
 <td style="text-align:right;"> NA </td>
 <td style="text-align:right;"> NA </td>
 <td style="text-align:right;"> NA </td>
 <td style="text-align:right;"> NA </td>
 <td style="text-align:right;"> NA </td>
 </tr>
 <tr>
 <td style="text-align:left;"> Albania </td>
 <td style="text-align:left;"> Europe </td>
 <td style="text-align:right;"> 3.217 </td>
 <td style="text-align:right;"> 9728.893 </td>
 <td style="text-align:right;"> 3.137 </td>
 <td style="text-align:right;"> 3.141 </td>
 <td style="text-align:right;"> 3.168 </td>
 <td style="text-align:right;"> 3.148 </td>
 <td style="text-align:right;"> 3.129 </td>
 <td style="text-align:right;"> 3.109 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> Algeria </td>
 <td style="text-align:left;"> Africa </td>
 <td style="text-align:right;"> 26.271 </td>
 <td style="text-align:right;"> 267807.178 </td>
 <td style="text-align:right;"> 27.496 </td>
 <td style="text-align:right;"> 28.060 </td>
 <td style="text-align:right;"> 28.566 </td>
 <td style="text-align:right;"> 29.045 </td>
 <td style="text-align:right;"> 29.507 </td>
 <td style="text-align:right;"> 29.965 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> Angola </td>
 <td style="text-align:left;"> Africa </td>
 <td style="text-align:right;"> 13.459 </td>
 <td style="text-align:right;"> 39709.339 </td>
 <td style="text-align:right;"> 14.279 </td>
 <td style="text-align:right;"> 14.707 </td>
 <td style="text-align:right;"> 15.148 </td>
 <td style="text-align:right;"> 15.603 </td>
 <td style="text-align:right;"> 16.071 </td>
 <td style="text-align:right;"> 16.553 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> Antigua and Barbuda </td>
 <td style="text-align:left;"> North America </td>
 <td style="text-align:right;"> 0.062 </td>
 <td style="text-align:right;"> 1133.675 </td>
 <td style="text-align:right;"> 0.065 </td>
 <td style="text-align:right;"> 0.067 </td>
 <td style="text-align:right;"> 0.068 </td>
 <td style="text-align:right;"> 0.070 </td>
 <td style="text-align:right;"> 0.072 </td>
 <td style="text-align:right;"> 0.074 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> Argentina </td>
 <td style="text-align:left;"> South America </td>
 <td style="text-align:right;"> 33.420 </td>
 <td style="text-align:right;"> 445025.532 </td>
 <td style="text-align:right;"> 34.353 </td>
 <td style="text-align:right;"> 34.779 </td>
 <td style="text-align:right;"> 35.196 </td>
 <td style="text-align:right;"> 35.604 </td>
 <td style="text-align:right;"> 36.005 </td>
 <td style="text-align:right;"> 36.399 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> Armenia </td>
 <td style="text-align:left;"> Asia </td>
 <td style="text-align:right;"> 3.450 </td>
 <td style="text-align:right;"> 7153.161 </td>
 <td style="text-align:right;"> 3.290 </td>
 <td style="text-align:right;"> 3.220 </td>
 <td style="text-align:right;"> 3.170 </td>
 <td style="text-align:right;"> 3.140 </td>
 <td style="text-align:right;"> 3.110 </td>
 <td style="text-align:right;"> 3.090 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> Australia </td>
 <td style="text-align:left;"> Oceania </td>
 <td style="text-align:right;"> 17.557 </td>
 <td style="text-align:right;"> 507554.453 </td>
 <td style="text-align:right;"> 17.893 </td>
 <td style="text-align:right;"> 18.120 </td>
 <td style="text-align:right;"> 18.330 </td>
 <td style="text-align:right;"> 18.510 </td>
 <td style="text-align:right;"> 18.706 </td>
 <td style="text-align:right;"> 18.919 </td>
 </tr>
</tbody>
</table>
]

---
name: day01-dplyr

# The `dplyr` package

* __94 percent__ of you responded being "Very Comfortable" or "Comfortable" using Excel

* `dplyr` ["dee" - plier] is a package that provides common spreadsheet manipulations

## single verbs

* `filter()`: Return __rows__ with matching conditions
* `select()`: Select/rename __variables__ by name
* `arrange()`: Arrange rows by variables
* `mutate()`: Create or transform variables
* `summarize()`: Reduce multiple values down to a single value

* `%>%` ["and then"]: Pipe left hand-side into right hand side

]

```r
weo %>% filter(continent == "Europe")
```

```r
weo %>% arrange(rgdp2010)
```

```r
weo %>% select(rgdp2010)
```

```r
weo %>% mutate(gdp_pc_2010 = rgdp2010 / pop2010)
```

```r
weo %>% summarize(median_pop_2017 = median(pop2017))
```

]

---

# A good package offers tools for common use cases

## selecting the first 10 columns

```r
weo %>% select(1:10)
```

## selecting values from 1992 (i.e., variables that include the number 1992.)

```r
weo %>% select(matches("1992"))
```

## select and rename

```r
weo %>% select(country, continent, `real_gdp_2017 = `rgdp2017)
```

## move 2017 to the beginning

```r
weo %>% select(country, continent, matches("2017"), `everything()`)
```

See the help page of `?select` for more.

---

# A common summary -- counting

```r
count(weo, continent)
```

```
## # A tibble: 6 x 2
## continent n
## <chr> <int>
## 1 Africa 55
## 2 Asia 43
## 3 Europe 45
## 4 North America 22
## 5 Oceania 13
## 6 South America 13
```

---

# Cross-tabulation via counting

## A new dataset with categorical variables

```r
data(gss_cat)
count(gss_cat, race)
```

```
## # A tibble: 3 x 2
## race n
## <fct> <int>
## 1 Other 1959
## 2 Black 3129
## 3 White 16395
```

```r
count(gss_cat, race, `sort = TRUE`)
```

```
## # A tibble: 3 x 2
## race n
## <fct> <int>
## 1 White 16395
## 2 Black 3129
## 3 Other 1959
```

]

```r
count(gss_cat, race, relig)
```

```
## # A tibble: 44 x 3
## race relig n
## <fct> <fct> <int>
## 1 Other No answer 14
## 2 Other Don't know 3
## 3 Other Inter-nondenominational 2
## 4 Other Native american 16
## 5 Other Christian 74
## 6 Other Orthodox-christian 1
## 7 Other Moslem/islam 42
## 8 Other Other eastern 10
## 9 Other Hinduism 62
## 10 Other Buddhism 72
## # … with 34 more rows
```
]

---

# Doing more with dplyr: `group_by`

```r
weo %>% 
* group_by(continent) %>%
  summarize(median_gdp_pc = median(rgdp2017 / pop2017))
```

```
## # A tibble: 6 x 2
## continent median_gdp_pc
## <chr> <dbl>
## 1 Africa 3180.
## 2 Asia 15443.
## 3 Europe 30082.
## 4 North America 14484.
## 5 Oceania 5109.
## 6 South America 13194.
```

How would you write the equivalent of `count` with `group_by`?

```r
weo %>% 
  group_by(continent) %>% 
  summarize(n = n())
```

---

# Grouping + dplyr verbs is a large part of data analysis

```r
drop_inc <- c("No answer", "Not applicable", "Don't know", "Refused")
gop_ind <- c("Not str republican", "Strong republican")

gss_cat %>% 
  filter(race != "Other", 
         !rincome %in% drop_inc) %>% 
  mutate(race = fct_infreq(race)) %>% 
* group_by(race, rincome) %>%
  summarize(is_R = mean(partyid %in% gop_ind)) %>% 
  pivot_wider(id_cols = rincome, 
              names_from = race,
              values_from = is_R) %>% 
  kable(format = "html", 
        digits = 2,
        col.names = c("Income", "Whites", "Blacks")) %>% 
  add_header_above(c(" " = 1, "% Strong Republican among" = 2))
```

.right-plot-narrow[
.small[
<table>
 <thead>
<tr>
<th style="border-bottom:hidden" colspan="1"></th>
<th style="border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">% Strong Republican among</div></th>
</tr>
 <tr>
 <th style="text-align:left;"> Income </th>
 <th style="text-align:right;"> Whites </th>
 <th style="text-align:right;"> Blacks </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> $25000 or more </td>
 <td style="text-align:right;"> 0.34 </td>
 <td style="text-align:right;"> 0.05 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> $20000 - 24999 </td>
 <td style="text-align:right;"> 0.28 </td>
 <td style="text-align:right;"> 0.05 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> $15000 - 19999 </td>
 <td style="text-align:right;"> 0.27 </td>
 <td style="text-align:right;"> 0.03 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> $10000 - 14999 </td>
 <td style="text-align:right;"> 0.25 </td>
 <td style="text-align:right;"> 0.06 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> $8000 to 9999 </td>
 <td style="text-align:right;"> 0.23 </td>
 <td style="text-align:right;"> 0.11 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> $7000 to 7999 </td>
 <td style="text-align:right;"> 0.24 </td>
 <td style="text-align:right;"> 0.00 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> $6000 to 6999 </td>
 <td style="text-align:right;"> 0.26 </td>
 <td style="text-align:right;"> 0.03 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> $5000 to 5999 </td>
 <td style="text-align:right;"> 0.28 </td>
 <td style="text-align:right;"> 0.08 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> $4000 to 4999 </td>
 <td style="text-align:right;"> 0.29 </td>
 <td style="text-align:right;"> 0.08 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> $3000 to 3999 </td>
 <td style="text-align:right;"> 0.26 </td>
 <td style="text-align:right;"> 0.10 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> $1000 to 2999 </td>
 <td style="text-align:right;"> 0.27 </td>
 <td style="text-align:right;"> 0.00 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> Lt $1000 </td>
 <td style="text-align:right;"> 0.24 </td>
 <td style="text-align:right;"> 0.04 </td>
 </tr>
</tbody>
</table>
]
]

---

# Introducing a new dataset

The WEO dataset in "long" form

```r
weo_long
```

```
## # A tibble: 6,112 x 5
## country continent year pop rgdp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan Asia 1992 NA NA
## 2 Afghanistan Asia 1993 NA NA
## 3 Afghanistan Asia 1994 NA NA
## 4 Afghanistan Asia 1995 NA NA
## 5 Afghanistan Asia 1996 NA NA
## 6 Afghanistan Asia 1997 NA NA
## 7 Afghanistan Asia 1998 NA NA
## 8 Afghanistan Asia 1999 NA NA
## 9 Afghanistan Asia 2000 NA NA
## 10 Afghanistan Asia 2001 NA NA
## # … with 6,102 more rows
```

]

compare this to the original "wide" form....

```r
weo
```

```
## # A tibble: 191 x 66
## country continent pop1992 pop1993 pop1994 pop1995 pop1996 pop1997
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghan… Asia NA NA NA NA NA NA 
## 2 Albania Europe 3.22 3.20 3.14 3.14 3.17 3.15
## 3 Algeria Africa 26.3 26.9 27.5 28.1 28.6 29.0 
## 4 Angola Africa 13.5 13.9 14.3 14.7 15.1 15.6 
## 5 Antigu… North Am… 0.062 0.063 0.065 0.067 0.068 0.07
## 6 Argent… South Am… 33.4 33.9 34.4 34.8 35.2 35.6 
## 7 Armenia Asia 3.45 3.37 3.29 3.22 3.17 3.14
## 8 Austra… Oceania 17.6 17.7 17.9 18.1 18.3 18.5 
## 9 Austria Europe 7.80 7.88 7.93 7.95 7.96 7.97
## 10 Azerba… Asia 7.46 7.49 7.60 7.64 7.73 7.8 
## # … with 181 more rows, and 58 more variables: pop1998 <dbl>,
## # pop1999 <dbl>, pop2000 <dbl>, pop2001 <dbl>, pop2002 <dbl>,
## # pop2003 <dbl>, pop2004 <dbl>, pop2005 <dbl>, pop2006 <dbl>,
## # pop2007 <dbl>, pop2008 <dbl>, pop2009 <dbl>, pop2010 <dbl>,
## # pop2011 <dbl>, pop2012 <dbl>, pop2013 <dbl>, pop2014 <dbl>,
## # pop2015 <dbl>, pop2016 <dbl>, pop2017 <dbl>, pop2018 <dbl>,
## # pop2019 <dbl>, pop2020 <dbl>, pop2021 <dbl>, pop2022 <dbl>,
## # pop2023 <dbl>, rgdp1992 <dbl>, rgdp1993 <dbl>, rgdp1994 <dbl>,
## # rgdp1995 <dbl>, rgdp1996 <dbl>, rgdp1997 <dbl>, rgdp1998 <dbl>,
## # rgdp1999 <dbl>, rgdp2000 <dbl>, rgdp2001 <dbl>, rgdp2002 <dbl>,
## # rgdp2003 <dbl>, rgdp2004 <dbl>, rgdp2005 <dbl>, rgdp2006 <dbl>,
## # rgdp2007 <dbl>, rgdp2008 <dbl>, rgdp2009 <dbl>, rgdp2010 <dbl>,
## # rgdp2011 <dbl>, rgdp2012 <dbl>, rgdp2013 <dbl>, rgdp2014 <dbl>,
## # rgdp2015 <dbl>, rgdp2016 <dbl>, rgdp2017 <dbl>, rgdp2018 <dbl>,
## # rgdp2019 <dbl>, rgdp2020 <dbl>, rgdp2021 <dbl>, rgdp2022 <dbl>,
## # rgdp2023 <dbl>
```

]

---

# groups in ggplot

## simple case: single time trend
.left-code[

```r
gdp_by_year <- weo_long %>% 
 group_by(year) %>% 
 summarize(med_gdp = median(rgdp, na.rm = TRUE),
 gdp_pc = sum(rgdp, na.rm = TRUE) / 
 sum(pop, na.rm = TRUE))

ggplot(gdp_by_year, aes(x = year, y = med_gdp)) +
  geom_line() +
  labs(y = "Median GDP (in millions 2017 USD)")
```
]

---

# groups in ggplot

## what's wrong here?

```r
ggplot(weo_long, aes(x = year,  y = rgdp)) +
  geom_line() +
  labs(y = "GDP (in millions 2017 USD)")
```
]

---

# groups in ggplot

## the group aesthetic treats a group of observations, defined by another variable, separately

```r
ggplot(weo_long, 
*      aes(x = year, y = rgdp, group = country)) +
  geom_line() +
  labs(y = "GDP (in millions 2017 USD)")
```
]

---

# groups in ggplot

```r
weo_large <- weo_long %>% 
 filter(country %in% c(
 "India",
 "China",
 "Japan",
 "South Korea",
 "United States"
 ))

ggplot(weo_large, 
*      aes(x = year, y = rgdp, group = country)) +
  geom_line() +
  labs(y = "GDP (in millions 2017 USD)")
```
]

---
class: middle, center
name: day01-ggplot

# The grammar of graphics

---
# 3 required components of ggplot grammar: data, aesthetic mapping, and geoms

---

# mapping vs. layers and the global application of geoms

```r
#
ggplot(data = gdp_by_year, 
       aes(x = year, y = gdp_pc)) +
* geom_point() +
  labs(x = "Year",
       y = "GDP per capita")
```

]

```r
#
ggplot(data = gdp_by_year, 
       aes(x = year, y = gdp_pc)) +
* geom_line() +
  labs(x = "Year",
       y = "GDP per capita")
```

]

```r
ggplot(data = gdp_by_year, 
       aes(x = year, y = gdp_pc)) +
* geom_point() +
* geom_line() +
  labs(x = "Year",
       y = "GDP per capita")
```

]

---

# mappings vs. hard-coded values

```r
ggplot(data = gdp_by_year, 
       aes(x = year, y = gdp_pc)) +
* geom_point(aes(color = gdp_pc)) +
  labs(x = "Year",
       y = "GDP per capita")
```

]

```r
ggplot(data = gdp_by_year, 
       aes(x = year, y = gdp_pc)) +
* geom_line(color = "navy") +
  labs(x = "Year",
       y = "GDP per capita")
```

]

```r
ggplot(data = gdp_by_year, 
       aes(x = year, y = gdp_pc)) +
* geom_point(aes(color = "navy")) + # wrong!
  labs(x = "Year",
       y = "GDP per capita")
```

]

---

# Try the exercises provided in your RStudio day 1 space.

---
layout: true
<div class="my-footer">Day 2</div>

---
class: middle
name: day2

# Day 2 - R beyond the tidyverse

## Group project preview
## function anatomy
## base-R syntax: vectors and lists
## strings, factors, numerics, and coercion
## formula syntax cross-tabs

\* References for today's material: <a href = "https://r4ds.had.co.nz">R for Data Science</a> chs. <a href = "https://r4ds.had.co.nz/factors.html">15</a>, <a href = "https://r4ds.had.co.nz/pipes.html">18</a>, and <a href = "https://r4ds.had.co.nz/vectors.html">20</a>; and <a href = "https://rstudio-education.github.io/hopr">Hands-on Programming with R</a> chs. <a href = "https://rstudio-education.github.io/hopr/basics.html">2</a> through <a href = "https://rstudio-education.github.io/hopr/r-notation">6</a>.
]

---

# Housekeeping

* Announcements by Dan and Jake
* How to get files and submit assignments on Canvas
* Open the rstudio cloud session for today (`03_Day-02`)
* Read in the datasets we'll use

```r
library(tidyverse)
library(readxl)

weo <- read_excel("data/input/WEO-2018.xlsx")
weo_long <- read_csv("data/input/WEO-2018_long.csv")
data(gss_cat)
```

* How to ask questions

---

# Group Project (8/27): How does climate change affect economic growth / politics?

]

We'll use data from:

same format as `WEO-2018_long.csv`, contains temperature, precipitation, and growth data.

]

---

# Group Project (8/27): How does climate change affect economic growth / politics?

## for next Tuesday:

Short Submission: A standalone graph / table testing whether hot countries tend to be poor.

Group Presentation: A data-driven report showing if / how climate change affects growth.
 
 - Focus on the tools for data exploration and visualization we learn in math camp
 - That means: don't worry about the econometrics

]

We'll use data from:

same format as `WEO-2018_long.csv`, contains temperature, precipitation, and growth data.

]

---

# Ready with your datasets?

```r
library(tidyverse)
library(readxl)

weo <- read_excel("data/input/WEO-2018.xlsx")
weo_long <- read_csv("data/input/WEO-2018_long.csv")
data(gss_cat)

gdp_by_year <- weo_long %>% 
 group_by(year) %>% 
 summarize(med_gdp = median(rgdp, na.rm = TRUE))
```

---
class: middle, center
name: day02-functions

# Function Anatomy

---

* _function_ (in mathematics): a mapping from each element of a set `$X$` a _single_ element of a set `$Y$`"

In R: a process that takes one or more inputs (in parentheses) and returns exactly one output
    
    ```r
    log(x = 100, base = 10)
    ```

* Call a function by the function name, with arguments and values in parentheses, separated by commas
    
    ```r
    log(x = 100, base = 10)
    log(base = 10, x = 100)
    ```

* If the user does not name the arguments, they are assumed to come in the order they are defined.
    
    ```r
    log(100, 10)
    ```

* How do you know that the first argument of log() is x, and not base?

---

# Function anatomy: important odds and ends

## 1. functions with no output

These functions still do something, but they are side-effects (no object output)

```r
library(tidyverse)
```
.small[.right[(loads the package)]]

```r
data(gss_cat)
```
.small[.right[(loads a built-in dataset which will be called `gss_cat`)]]

]

## 2. functions with no arguments

```r
Sys.Date()
```

```
## [1] "2019-09-05"
```

```r
n()
```
.small[.right[(`dplyr`s "the numbers of rows in")]]

]

---

## 3. functions with default arguments

e.g. `log()` in R defaults to `$\log_{e}()$`:

```r
args(log)
```

```
## function (x, base = exp(1)) 
## NULL
```

therefore, an argument with a default can be left un-specified:

```r
log(exp(5))
```

_Note_ this is different from not declaring the _name_ for a no-default argument

```r
args(aes)
```

```
## function (x, y, ...) 
## NULL
```

```r
ggplot(gdp_by_year, aes(year, med_gdp))
```
]

## 4. functions where defaults are not documented

```r
args(geom_point)
```

```
## function (mapping = NULL, data = NULL, stat = "identity", position = "identity", 
##     ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) 
## NULL
```

```r
ggplot(gdp_by_year, aes(year, med_gdp)) +
* geom_point() +
  labs(x = "Year", y = "GDP per capita")
```

]

---

## 5. functions with countably infinite arguments

```r
count(x = gss_cat, race)
```
.right[.small[(one-way tabulation)]]

```r
count(x = gss_cat, race, relig)
```
.right[.small[(two-way tabulation)]]

```r
count(x = gss_cat, race, relig, denom)
```
.right[.small[(three-way tabulation)]]

Notice `args(count)` returns:

```r
function (x, `...`, wt = NULL, sort = FALSE
```

R represents such potentially infinite arguments with `...` (an ellipsis)

]

other arguments could be _named_

```r
count(gss_cat, race, relig, sort = TRUE)
```

```
## # A tibble: 44 x 3
## race relig n
## <fct> <fct> <int>
## 1 White Protestant 8188
## 2 White Catholic 4001
## 3 White None 2816
## 4 Black Protestant 2271
## 5 Other Catholic 916
## 6 White Christian 474
## 7 Other Protestant 387
## 8 Black None 384
## 9 White Jewish 370
## 10 Other None 323
## # … with 34 more rows
```

so cannot be implicitly called. Consider:

```r
count(gss_cat, race, relig, TRUE) # wrong
```

]

---

## 6. dplyr's pipe

* tidyverse invented the `%>%` for a convenient "and then"
  
  ```r
  weo %>% 
      filter(continent == "Africa") %>% 
      select(country, rgdp1992, pop1992) %>% 
      mutate(gdp_pc = rgdp1992 / pop1992)
  ```

* The object to `%>%`'s LEFT becomes the _value of the first argument_ of the function to `%>%`'s RIGHT
  
  ```r
  weo %>% 
  * filter(continent == "Africa")
  ```

* Therefore the first chunk is equivalent to:
  
  ```r
  mutate(select(filter(weo, continent == "Africa"), rgdp1992, pop1992), gdp_pc = rgdp1992 / pop1992)
  ```

* [Advanced]: A dataframe in a pipeline is implicitly kept "in scope", such that a variable name that is called is presumed to be from the dataframe.

---

## 7. ggplot's plus

* Appending layers by `+` (instead of `%>%`) is a historical legacy

]

```r
# format dataset
weo_percap <- weo %>% 
 mutate(pc_1992 = rgdp1992/pop1992,
 pc_2007 = rgdp2017/pop2017)

# ggplot
ggplot(weo_percap, aes(pc_1992, pc_2007)) +
  ## layers
  geom_abline(slope = 1, 
              intercept = 0, 
              linetype = "dashed") +
  geom_point(size = 0.5, color = "indianred") + 
  ## coordinate
  coord_equal() +
  ## facet 
  facet_wrap(~continent) +
  ## scale
  scale_x_continuous(labels = NULL) +
  scale_y_continuous(labels = dollar) +
  labs(x = "1992 GDP per capita",
       y = "2007 GDP per capita",
       title = "Growth between 1992 and 2017") +
  ## theme
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))
ggsave("images/day1gg-show.png", w  = 4.8, height = 3.2)
```
]
]

---

## 7. ggplot's plus

```r
# format dataset
weo_percap <- weo %>% 
 mutate(pc_1992 = rgdp1992/pop1992,
 pc_2007 = rgdp2017/pop2017)

]

---
class: center, middle

# Questions?

---
name: day02-base-R

# Three broad flavors of syntax

## 1. tidyverse syntax

```r
weo `%>%`
  `mutate(gdp_pc_2017 = rgdp2017 / pop2017)`
```

## 2. base-R (dollar sign and square bracket) syntax

```r
weo`$`gdp_pc_2017 <- weo`$`rgdp2017 / weo`$`pop2017
weo`[["`gdp_pc_2017`"]]` <- weo`[["`rgdp2017`"]]` / weo`[["`pop2017`"]]`
```

## 3. formula (tilde) syntax

```r
lm(tvhours `~` race `+` relig, data = gss_cat)
```

```r
xtabs(`~` race `+` relig, data = gss_cat)
```

]

---
class: middle, center

# A Whirlwind Tour of Base-R

## Data structures
## Data types
## Summary functions

---

# Data Structures: what's the container?

## data frames is the most complex data structure

| | Homogeneous | Heterogeneous |
| -----:| :-------------: |:-------------:|
| 1-dimensional| Atomic Vector | List |
| 2-dimensional | Matrix | Data frame |

## "Heterogeneous" means it can take different types

```r
x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x)
```

```
## List of 4
##  $ : int [1:3] 1 2 3
##  $ : chr "a"
##  $ : logi [1:3] TRUE FALSE TRUE
##  $ : num [1:2] 2.3 5.9
```

---

# tidyverse vs. syntax have different ways of subsetting

## tidyverse (dataframes only)

```r
weo %>% 
  select(country, rgdp2017) %>% 
  slice(1:5)
```

```
## # A tibble: 5 x 2
## country rgdp2017
## <chr> <dbl>
## 1 Afghanistan 63353.
## 2 Albania 32763.
## 3 Algeria 576494.
## 4 Angola 173327.
## 5 Antigua and Barbuda 2174.
```

]

## base-R

Atomic vectors with square brackets

```r
   letters[c(1, 2, 3)]
   ```

Matrices also with square brackets -- with two dimensions:

```r
   weo[1:5, 1]
   ```

Lists with double square brackets or `$` (S for Slot)

```r
   weo[[1]]
   weo[["country"]]
   weo$country
   ```

]

---

# Data types: what's in the container?

(a) Logical

```r
c(TRUE, FALSE, FALSE)
```

```r
is.na(c(NA, 2, 3))
```

```
## [1]  TRUE FALSE FALSE
```

]

(b) Numeric - Integers

```r
c(1L, 0L, 0L)
```

```r
c(1, 0, 0)
```

]

(d) Character

```r
c("A", "B", "B")
```

]

---

# Vectors are flexible:

## coercion

`TRUE` gets coerced to 1 and `FALSE` gets coerced to 0:

```r
sum(c(TRUE, FALSE, TRUE))
```

```
## [1] 2
```

numbers can be coerced to characters:

```r
as.character(c(1, 2, 3))
```

automatic coercion makes vectors stay homogeneous:

```r
c(1, "a")
```

]

## recycling

A scalar assigned to become a vector will be transformed into a _vector_ with its values repeated.

```r
weo$ones <- 1
```

This created a column called `ones` where every row is the number ones.

```r
count(weo, ones)
```

```
## # A tibble: 1 x 2
## ones n
## <dbl> <int>
## 1 1 191
```

]

## vector algebra

Numeric vectors and matrices follow the rules of algebra

```r
c(1, 2, 3) + c(100, 200, 300)
```

```
## [1] 101 202 303
```

```r
c(1, 2, 3) * c(100, 200, 300)
```

```
## [1] 100 400 900
```

recycling works here too:

```r
(c(1, 2, 3) * c(100, 200, 300)) + 1
```

```
## [1] 101 401 901
```

]

---

# factors: a hybrid type that encodes the order of values

* In practice, factors are like _ordered_ characters

* Technically,  factors integers (which dictates the ordering, called __levels__) where each value has a __label__

## built-in GSS saves variables as factors

```r
data(gss_cat)
count(gss_cat, marital)
```

```
## # A tibble: 6 x 2
## marital n
## <fct> <int>
## 1 No answer 17
## 2 Never married 5416
## 3 Separated 743
## 4 Divorced 3383
## 5 Widowed 1807
## 6 Married 10117
```
]

## compare this to strings

```r
gss_str <- read_csv("data/input/gss_str.csv")
count(gss_str, marital)
```

```
## # A tibble: 6 x 2
## marital n
## <chr> <int>
## 1 Divorced 3383
## 2 Married 10117
## 3 Never married 5416
## 4 No answer 17
## 5 Separated 743
## 6 Widowed 1807
```
]

---

# Recoding, Releveling, and Replacing

## 1. dplyr::recode edits the values

```r
gss_cat %>% 
  mutate(relig = 
*          recode(relig,
*                 Protestant = "Christian",
*                 Catholic = "Christian",
*                 .default = "Non-Christian")) %>%
  count(relig)
```

```
## # A tibble: 2 x 2
## relig n
## <fct> <int>
## 1 Non-Christian 5513
## 2 Christian 15970
```

---
# Recoding, Releveling, and Replacing

## 2. forcats::fct\_relevel edits the _order_ of the levels

```r
count(gss_cat, race)
```

```
## # A tibble: 3 x 2
## race n
## <fct> <int>
## 1 Other 1959
## 2 Black 3129
## 3 White 16395
```

```r
gss_cat %>% 
* mutate(race = fct_relevel(race, "White", "Black", "Other")) %>%
  count(race)
```

```
## # A tibble: 3 x 2
## race n
## <fct> <int>
## 1 White 16395
## 2 Black 3129
## 3 Other 1959
```

---
# Recoding, Releveling, and Replacing

## 3. base::replace edits the value of variables based on the value of potentially other variables

```r
gss_cat %>% 
 mutate(is_senior = NA, # this makes a columns of NAs
 is_senior = replace(is_senior, `age < 65`, "not senior"),
 is_senior = replace(is_senior, `age >= 65`, "senior")) %>% 
 select(age, is_senior)
```

```
## # A tibble: 21,483 x 2
## age is_senior 
## <int> <chr> 
## 1 26 not senior
## 2 48 not senior
## 3 67 senior 
## 4 39 not senior
## 5 25 not senior
## 6 25 not senior
## 7 36 not senior
## 8 44 not senior
## 9 44 not senior
## 10 47 not senior
## # … with 21,473 more rows
```

---

# The things you can get out of data

## 1. dimensions

Each vector has a "length", each 2d array has a "dimension"

```r
x <- c(1:10)
length(x)
```

```
## [1] 10
```

```r
dim(weo)
```

```
## [1] 191  67
```

]

## 2. summary stats

```r
stdnormals <- rnorm(n = 1000, mean = 0, sd = 1)
```

```r
mean(stdnormals)
sd(stdnormals)
```

```
## [1] -0.007799939
```

```
## [1] 1.032668
```

```r
round(mean(stdnormals), digits = 2)
```

```
## [1] -0.01
```

]

---

# The things you can get out of data

## 3. string concatenation

```r
   c("a", "b", "c", "d")
   ```
   
   ```
   ## [1] "a" "b" "c" "d"
   ```
concatenate ("paste") into a single item

```r
   str_c("a", "b", "c", "d")
   ```
   
   ```
   ## [1] "abcd"
   ```
   
]

## 4. pulling out vectors from dataframes

```r
mean(weo$rgdp2019)
```

```
## [1] 653848.6
```

```r
mean(select(weo, rgdp2019))
```

```
## Warning in mean.default(select(weo, rgdp2019)): argument is not numeric or
## logical: returning NA
```

```
## [1] NA
```

"pull" takes one column and coerces to vector:

```r
mean(`pull`(weo, rgdp2019))
```

```
## [1] 653848.6
```

]

---

What would the following code do?

```r
   weo_long %>% 
     mutate(rgdp2017 = round(rgdp2017, digits = 0),
            country = str_c(country, " (", continent, ")")) %>% 
     slice(1:3)
   ```

<table>
 <thead>
 <tr>
 <th style="text-align:left;"> country </th>
 <th style="text-align:right;"> rgdp2017 </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> Afghanistan (Asia) </td>
 <td style="text-align:right;"> 63353 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> Albania (Europe) </td>
 <td style="text-align:right;"> 32763 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> Algeria (Africa) </td>
 <td style="text-align:right;"> 576494 </td>
 </tr>
</tbody>
</table>

---
class: center, middle
name: day02-formula

# Formula syntax

---

# Formula (tilde) syntax expresses the relationship between variables

```r
fit <- lm(tvhours ~ race + relig, data = gss_cat)
summary(fit)
```

* __One__ variable (unquoted) to the left hand side
* __One or more__ variables to the right hand side

]

```r
xtabs(~ relig + race, gss_cat, drop = TRUE)
```

* Typically no left hand side
* __One or two__ variables to the right hand side (row variable, then column variable)

]

---

# Summarizing cross-tabs

Cell / row / column proportions

```r
prop.table(xtabs(~ relig + race, gss_cat))
prop.table(xtabs(~ relig + race, gss_cat), `margin = 1`)
prop.table(xtabs(~ relig + race, gss_cat), `margin = 2`)
```

Using the GSS dataset, how would you get a cross-tab that shows (a) the proportion in each income bin, (b) for each of the two most common values of race, and (c) _rounded_ to the second decimal point?

```r
gss_cat_wb <- gss_cat %>% 
 filter(race %in% c("White", "Black")) %>% 
 mutate(race = as.character(race))

xtabs(~ rincome + race, gss_cat_wb) %>% 
  prop.table(margin = 2) %>% 
  round(2)
```

---
layout: true
<div class="my-footer">Day 3</div>

---
class: middle
name: day3

# Day 3 Agenda

## how our eyes see data
## technicalities: axes, legends, annotation, exporting

\* References for today's material: <a href = "https://r4ds.had.co.nz">R for Data Science</a> ch. <a href = "https://r4ds.had.co.nz/data-visualisation.html">2</a>; <a href = "https://serialmentor.com/dataviz/">Fundamentals of Data Visualization</a> by Claus Wilke; chs. <a href = "https://serialmentor.com/dataviz/proportional-ink.html">17</a> through <a href = "https://serialmentor.com/dataviz/small-axis-labels.html">24</a>; and <a href = "https://www.youtube.com/watch?v=fSgEeI2Xpdc">How Humans See Data</a> by John Rauser.

\* Most of the example graphs from this day are Claus Wilke's excellent book, and the presentation of the first section follows John Rauser's eye-opening talk. 
]

---
class: center, middle

# Visualization

---
name: day03-principles

# 1. Show the key message as a comparison of positions on a common scale

Perception tasks, ranked

1. Position along a common scale &#x2b05; the human eye can discern this most accurately 
2. Position along identical, nonaligned scales
3. Length
4. Angle or Slope
5. Area
6. Volume, Density, Saturation
7. Color Hue &#x2b05; and this the least

.right[.small[Source: <a href = "https://pdfs.semanticscholar.org/565d/843c2c0e60915709268ac4224894469d82d5.pdf">Cleveland and McGill (1985)</a>, paper on Canvas]]

---

# Example: Alesina et al., "Intergenerational mobility in Africa" (<a href="https://voxdev.org/topic/health-education/intergenerational-mobility-africa">link</a>)

```r
alesina <- read_excel("data/input/intergenerational-mobility.xlsx")
```

]

]

---

1. Position along a common scale &#x2b05; the human eye can discern this most accurately 
2. Position along identical, nonaligned scales
3. Length
4. Angle or Slope
5. Area
6. Volume, Density, Saturation
7. Color Hue &#x2b05; and this the least
]

![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-119-1.svg)
]

---

1. Position along a common scale &#x2b05; the human eye can discern this most accurately 
2. Position along identical, nonaligned scales
3. Length
4. Angle or Slope
5. Area
6. Volume, Density, Saturation
7. Color Hue &#x2b05; and this the least

Using: `geom_col()` with the `fill` aesthetic, reordering countries by `fct_reorder()`.
]

![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-120-1.svg)
]

---

1. Position along a common scale &#x2b05; the human eye can discern this most accurately 
2. Position along identical, nonaligned scales
3. Length
4. Angle or Slope
5. Area
6. Volume, Density, Saturation
7. Color Hue &#x2b05; and this the least

Using: defining a color scale with `scale_fill_distiller(palette = "Reds")`.
]

![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-121-1.svg)

]

---

1. Position along a common scale &#x2b05; the human eye can discern this most accurately 
2. Position along identical, nonaligned scales
3. Length
4. Angle or Slope
5. Area
6. Volume, Density, Saturation
7. Color Hue &#x2b05; and this the least

Using: `geom_segment()`
]

![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-122-1.svg)
]

---

1. Position along a common scale &#x2b05; the human eye can discern this most accurately 
2. Position along identical, nonaligned scales
3. Length
4. Angle or Slope
5. Area
6. Volume, Density, Saturation
7. Color Hue &#x2b05; and this the least

Using: `geom_col()`
]

![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-123-1.svg)
]

---

Using: `geom_point()`
]

![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-124-1.svg)
]

---

1. Position along a common scale &#x2b05; the human eye can discern this most accurately 
2. Position along identical, nonaligned scales
3. Length
4. Angle or Slope
5. Area
6. Volume, Density, Saturation
7. Color Hue &#x2b05; and this the least

Using: `geom_point()`
]

![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-125-1.svg)
]

---

1. Position along a common scale &#x2b05; the human eye can discern this most accurately 
2. Position along identical, nonaligned scales
3. Length
4. Angle or Slope
5. Area
6. Volume, Density, Saturation
7. Color Hue &#x2b05; and this the least

Using: `geom_point()` and `geom_errorbar()`
]

![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-126-1.svg)
]

---

Using: `geom_point()` and `geom_errorbar()`
]

![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-127-1.svg)
]

---

# 1. Show the key message as a comparison of positions on a common scale

---

# 2. Maximize the Data-to-Ink ratio (i.e. proactively remove superficial ink)

---

# 3. Avoid excessive overlap (instead, use small multiples)

## bad

]

## bad
<img src="images/papaioannou21junefig4.png" width="789" />

]

---

**Solution**:  In ggplot, "facet" your graphs by `facet_grid()` (a rigid table) or `facet_wrap()` (wrap long lines).

```r
ggplot(weo_long, aes(x = year, y = (rgdp/pop), group = country)) +
* facet_wrap(~ continent) + # formula syntax
  geom_line() +
  labs(x = "Year", y = "GDP per capita")
```

---
class: center, middle

Try Problem 7 part (1)

---
class: center, middle
name: day03-technicalities

# Important Technicalities

---

# 1. Clear and concise axis titles

## Always label your axes, but don't make it verbose either

]

]

---

# 1. Clear and concise axis titles

pseudo-code:

```r
ggplot(stocks, 
       aes(x = year, 
           y = price, 
           group = company,
           color = company)) + 
* labs(x = "",
*      y = "stock price, indexed",
*      color = "")
```
]

---

# 2. Reasonable font and image sizes

## The default image size is almost always too large, making the labels illegible

]

]

**Solution**: save the images in _smaller size_ (I recommend 3 by 5 inches; some recommend smaller).

```r
ggsave("figures/figure_rightsize.pdf", `height = 3`, `width = 5`)
```

---

# Also: Use vector graphics where possible

Note: panel (b) zooms into the figure in panel (a) when the graph is a raster graphic (like `jpeg` or `png`). panel (c) zooms into the same area when the graph is a vector graphic (like `pdf` or `svg`).

---

# 3. Reader-friendly legends or annotation

## Having to go back and forth between data and legend is not reader-friendly.

]

]

**Solution**: relevel your factors (`fct_relevel`) so the levels align with the data, or ...

---

# Whenever possible, design your figures so they don’t need a legend.

]

pseudo-code:

```r
ggplot(stocks, 
       aes(x = year, 
           y = price, 
           group = company,
           color = company)) + 
  annotate("text", x = 2018, y = 550, label = "Facebook") +
  annotate("text", x = 2018, y = 340, label = "Alphabet") +
  annotate("text", x = 2018, y = 250, label = "Microsoft") +
  annotate("text", x = 2018, y = 195, label = "Apple") +
  labs(x = "", 
       y = "stock price, indexed", 
       color = "color")
```

]

---

# 4. Default colors can be indistinguishable in grayscale

or for colorblind readers

\vspace{-2em}

]

]

Solution:

* The viridis color scale (`scale_color_viridis_d` or `scale_color_viridis_c`) is a good default.
* Also manually assign values for discrete variables ( `scale_color_manual`)

---
layout: true
<div class="my-footer">Day 4</div>

---
class: middle
name: day4

## Workflow best practices
## RStudio and R Desktop
## Review of key functions via cheatsheet
]

\* Instead of the original plan, we will _not_ cover Rmarkdown in math camp and instead consolidate what we've learned so far. This prepares us well for Rmarkdown down the road. For those feeling ambitious, <a href = "https://r4ds.had.co.nz/r-markdown.html">R for Data Science Chapter 27</a> is a good introduction to Rmarkdown.

\* References for today's material: <a href = "https://r4ds.had.co.nz">R for Data Science</a> chs. <a href = "https://r4ds.had.co.nz/workflow-scripts.html">6</a> and <a href = "https://r4ds.had.co.nz/workflow-projects.html">8</a>; <a href = "https://whattheyforgot.org/">What they forgot to teach you about R</a> ch. <a href = "https://whattheyforgot.org/save-source.html">1</a>; and <a href = "https://web.stanford.edu/~gentzkow/research/CodeAndData.pdf">Code and Data for the Social Sciences</a> chs. 1 and 7.
]

---
class: middle
name: day04-workflow

# Workflow

.small-note[
Prepare by opening today's rstudio.cloud project and opening `sample_cleaned_script.R` in the `example_scripts` folder
]

---

# What does good R code look like?

Good recipes ...

]

---

Good recipes ...

&#x2611; Readable to your future self

&#x2611; Clear (minimizes ambiguity)

&#x2611; Ordered chronologically

&#x2611; Follows convention

]

]

<img src="images/Rworkflow_recipe.png" width="100%" />
]

---

# First, load packages (but don't keep installation code)

---

# Read in data at the top, save output at the end

---

# Reorder and re-write frequently into a concise, top-to-bottom ordering

* Remove unnecessary "checks" (like `View(dataset)`, `?ggplot`)
--

* Section your code with comment lines, but be concise

```r
  # Load data ---------------------------
  
  # Plot data ---------------------------
  ```
--

* Better to have multiple short scripts than one huge script

00_download.R
        01_explore.R
        ...
        09_model.R
        10_visualize.R
        
--

**Why?** Because you want to be re-running your entire code a lot ...

---

# Restart and Re-run (everything) frequently, as you write code

## 1. Restart from a clean slate

* Go to top toolbar, `Session` > `Restart R`, or
*  &#x21e7; (shift) +  &#x2318; (command/control)  + `0` (zero)

This will remove

* All objects in the Environment
* All loaded packages
* All user-created functions

Try this in sample_cleaned_script.R

## \* Warning for old R users

Do **not** put `rm(list = ls())` at the beginning of your code. This only removes objects.

]

## 2. Run all ("source") - and break on first error

* Toolbar `Code` > `Source`, or
* &#x21e7; (shift) + &#x2318; (command/control) + `s` (the letter S, for source)

## \* or run all and show output

* Toolbar `Code` > `Run Region` > `Run All`, or
* &#x21e7; (shift) + &#x2318; (command/control) + &#x23ce; (enter)

## \* or re-run only up to this line

* Toolbar `Code` > `Run Region` > `Run All Chunks Above`, or
* &#x2325; (option/alt) +   &#x2318; (command/control) + `b` (the letter B, for begin)

---
class: center, middle

Exercise: fix this example bad code so it runs on a blank slate.

---

# File paths  are a way to define the location of a file

* **"Directory"** &#x2248; "Folder".    "Working Directory" = The current folder
* Paths are hierarchical: parent &#x2192; child , separated by a **forward slash** `/` (`project/data/input`)
* **File extensions** are separated by a period `.` (`kuriwaki_shiro.R`, `scatterplot.pdf`)

There are important special characters:

* **Root**: Starting with `/` (or `C:\` in Windows) means the top of computer. `/cloud/project`
* **Home**: Starting with a tilde `~/` means the default home directory -- varies by project. (`~/data/input`)
* Relative paths: Starting with no symbol means "wherever we are now". (`data/input`)
* A period `./` also means, wherever we are now (`./data/input`)
* **Parent folder**: two periods move to parent folder (go back up one): `../projects/data/input`

.small[\* You almost never call the root directory, and rarely call the home directory. Instead, do things in relative paths. Fast forward to <a href="#rstudio-project">RStudio Projects</a>]

So where is the working directory?
* It depends, but in RStudio, the working directory defaults to the project directory, which is wherever the `.Rproj` file is.
* This is not permanent; you can modify the working directory any time.

---
# Navigating paths - the basics

Shell (or bash, command-line) is a language to navigate folders outside of R

* `pwd ` is -- show me the present (current) working directory (`getwd()` in R)
* `cd ` is -- change the directory to the following path (`setwd()` in R)

---

# Exercise your path reading skills

1. How would you read the following paths?

```r
weo <- read_excel("data/input/WEO-2018.xlsx")
```

```r
weo <- read_csv("~/Dropbox/API-209/Problem Sets/ps01/data/input/WEO-2018.xlsx")
```

```r
ggsave("figures/percapita_growth_figure.pdf")
```

2. Why would defining everything in absolute paths be a bad idea?
--

1. Too long
2. Not sharable: `setwd("path/that/only/works/on/my/machine")`

---
class: center, middle
name: day04-download-desktop

# Download R Desktop (in preparation for tomorrow)

---

# Download for Tomorrow: Newest Versions of R Desktop and RStudio Desktop

## R is the software

1. Google "download R" (or .small[<http://www.r-project.org/>])
2. Go to "download" R, and pick the first mirror

3. Download the installer for your Operating System (Mac or Windows)

]

1. Google "download RStudio" (or .small[<http://www.rstudio.com/products/rstudio/download>])
2. Download the Installer for Mac/Windows

GUI: Graphical User Interface

]]

]

---
class: middle, center
name: day04-cheatsheet

# Wrap-up and review of key commands

---

# Are you familiar with the functions on the cheatsheet?

Follow along on the cheat-sheet: <http://bit.ly/HKS-R>

.left-code[
1. Object assignment
2. dplyr verbs
3. ggplot layers
4. Base-R vectors and data-frames
5. Summary functions
6. Input / Output
7. Combining datasets (new topic)
]

We didn't get to cover these cheatsheet functions which come up fairly often --

* `::` - call function directly from package
* `unique`, `n_distinct`: unique values
* `summary`: generic function, output varies by object 
* `str`: general structure
* `geom_histogram()`, `geom_bar()`: geoms that involve aggregation
* `fill` vs. `color` aesthetics: fill is the inside of a polygon; color is the border line
* `write_csv`: saving to a spreadsheet
* `bind_rows`: stack rows
* `left_join`, `inner_join`: merge multiple datasets

]

---
layout: true
<div class="my-footer">Day 5</div>

---
class: middle
name: day5

# Day 5 Agenda

## Discuss presentations
## R Desktop
## Getting help effectively

\* References for today's material: <a href = "https://whattheyforgot.org/">What they forgot to teach you about R</a> ch. <a href = "https://whattheyforgot.org/save-source.html">1</a> and <a href = "https://whattheyforgot.org/project-oriented-workflow.html">2</a>; <a href = "https://rstudio-education.github.io/hopr">Hands-on Programming with R</a> ch. <a href = "https://rstudio-education.github.io/hopr/starting.html">A</a>.
]

---
class: center, middle

# Presentations

---
class: center, middle
name: name: day04-desktop

# Getting to work with R Desktop

---

# Step 1: Create an API-209 folder

Choose a directory to put all your work in (e.g. `~/Dropbox`/ `C:\Dropbox`, or `~/Documents`)

&#x2B91; under it, create a `API-209` folder (avoid spaces in your folder names)

&#x2B91; under that, create a `Problem-Sets` folder

&#x2B91; under that, create folders `PS-01`, `PS-02`, .small[(adding a 0 makes it easy to sort when you get to `PS-10`)]

&#x2B91; for each problem set folder, create a `data` folder with subfolders `input` and `output`

\* You can deviate this from later, but these are well-established practices worth trying first. It's better to try out some established convention rather than re-inventing the wheel. See e.g. <a href="http://web.stanford.edu/~gentzkow/research/CodeAndData.pdf">"Code and Data for the Social Sciences: A Practitioner’s Guide"</a> and <a href="https://doi.org/10.1371/journal.pcbi.1005510">"Good enough practices in scientific computing"</a> 
]

---

# Step 2: Open RStudio from Applications

---

# Step 2(a): Change global options so restarting entails clean slate

.pull-left[
* Change your Global settings to not save data
* To do this, Go to Toolbar `Tools` > `Global Options` and change the sections highlighted in red arrows as shown here:
]

]

---
# Step 2(b): Install key packages

You only need to do this once, so do this in the Console without any script

```r
install.packages(c("tidyverse", "readxl", "haven"))
```

You can install and update packages later.

---

# Step 3: Create a New Project

* Toolbar `File` > `New Project` or the icon: <img / src = "images/project-add-icon.png"/>

]

]

---
name:rstudio-project

# Orient yourself - where is the working directory?

1. The project directory is wherever the `.Rproj` file is
2. Rstudio Projects are created so that the working directory defaults to the project directory

---

# Practice Project-Oriented Workflow

* .slarge[Create an RStudio project for each data analysis project.]

* .slarge[Keep data files there; we’ll talk about loading them into R in data import.]

* .slarge[Keep scripts there; edit them, run them in bits or as a whole.]

* .slarge[Save your outputs (plots and cleaned data) there.]

* .slarge[Only ever use relative paths, not absolute paths]

.small-note[
\* Taken from <a href="https://r4ds.had.co.nz/workflow-projects.html">R for Data Science Chapter 8</a> 
]

---

# Going beyond defaults

## setwd()

* You can change working directories by the R command `setwd()`
* You can change it back by Toolbar: `Session` > `Set Working Directory` > `To Project Directory`

]

## Your script location is independent from the project

.small-note[\* If you open a script for problem set 1 in the problem set 2 RStudio Project, the working directory still defaults to `Problem-Sets/PS-02`]

]

---

# Leverage the features of the RStudio IDE

## Technically, R is only the Console

]

## RStudio is a GUI - access R _via_ RStudio

## As an IDE, Rstudio includes more tools

* <a href = "https://r4ds.had.co.nz/r-markdown.html">Rmarkdown</a>
* Shiny apps (interactive web apps)
* Terminal (command-line tools)
* Git (version control)

]

---

# Step 4: Test your Assignment

Assignment for end of today (mandatory)

> Take your summer assignment R script you created, and adapt so that is both (a) correct and (b) replicable on a project. Create a project specific to the summer assignment.

Materials on Canvas, <a href= "https://canvas.harvard.edu/courses/62068/assignments/303684"> R day 5 exercise</a>

---

# Running R for Problem Sets

## Preparation

1. Create a designated folder (e.g. `PS-04`)
2. Create subfolders for data, figures, etc..
3. Download problem set and data files
4. Create a Rstudio Project for that project folder

]

## Analysis

API-209 problem sets will generally be in a fillable word doc.

]

---
class: center, middle
name: day05-getting-help

# Final Tips for your own R experience

---

# Take advantage of the R community

## Resources

* **Class instructors and resources**

Also helpful:

* Stackoverflow (.small[<https://stackoverflow.com/>])
* Community boards (e.g. .small[<http://community.rstudio.com>])
* Twitter?

<blockquote class="twitter-tweet" data-lang="en">What <a href="https://twitter.com/hashtag/rstats?src=hash&amp;ref_src=twsrc%5Etfw">#rstats</a> tricks did it take you way too long to learn? One of mine is using readRDS and saveRDS instead of repeatedly loading from CSV&mdash; Emily Riederer (@EmilyRiederer) <a href="https://twitter.com/EmilyRiederer/status/898735640031920129?ref_src=twsrc%5Etfw">August 19, 2017</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

]

## Stuck?

Help us help you by making your errors reproducible

* Code and/or Screenshots
* Or even better: `reprex`

```r
install.packages("reprex")
library(reprex)
```

Copy the problematic code to your clipboard,

```r
library(forcats)
data(gss_cat)

xtabs(rincome + race, gss_cat)
```

and separately run `reprex()`

]

---
class: center, middle

# Thanks!

Please be sure to send us any feedback in the end of math camp survey.