class: center, middle, inverse, title-slide # MPA-ID 2019 Math Camp: R Session Slides ## Shiro Kuriwaki --- class: middle .pull-left[ # Table of Contents * [<b>Day 1: tidyverse</b>](#day1) * [dplyr](#day01-dplyr) * [ggplot](#day01-ggplot) * [<b>Day 2: Beyond the tidyverse (base-R)</b>](#day2) * [functions](#day02-functions) * [base-R syntax](#day02-base-R) * [function syntax](#day02-formula) * [<b>Day 3: Visualization</b>](#day3) * [principles](#day03-principles) * [technicalities](#day04-technicalities) * [<b>Day 4: Workflow</b>](#day4) * [workflow](#day04-workflow) * [downloading R Desktop](#day04-download-rstudio) * [Day 5: Setup and RStudio Desktop](#day5) * [desktop](#day05-desktop) * [asking for help](#day05-getting-help) ] .pull-right[ # About these slides These slides were prepared for the MPA-ID cohort's 2019 Math Camp. See it in conjunction with the <a href="https://www.shirokuriwaki.com/programming/2019%20Math%20Camp%20R%20Syllabus.pdf">syllabus</a> for more resources. Slides are originally compiled in HTML, so for best results see the <a href="https://www.shirokuriwaki.com/programming/2019-math-camp-R_all-days.html">web version</a> on a browser. The PDF version is a print-out of the version, so may crops out code windows that could be scrolled across horizontally in the web version. Contact Shiro Kuriwaki (kuriwaki@g.harvard.edu) for more information. ] --- # 5-day Schedule <img src="images/data-science.png" width="80%" style="display: block; margin: auto;" /> .small[Source: Grolemund and Wickham, R for Data Science] --- # 5-day Schedule .pull-left[ ## Topics 1. Mastering R basics in tidyverse 2. Mastering R basics beyond tidyverse 3. Perfecting Graphs 4. Exporting and Presenting 5. Presentations, preparing for classes .small[\* See <a href="https://www.shirokuriwaki.com/programming/2019%20Math%20Camp%20R%20Syllabus.pdf">syllabus</a> for details] ] -- .pull-right[ ## Logistics 1. Login to <https://rstudio.cloud> and go to the Math Camp class space 2. "Copy" the Day 1 project `02_Day-01`, which includes - datasets - exercise PDF - relevant packages 3. During slides, try the code yourself + consult the cheatsheet: <http://bit.ly/HKS-R> 4. Work in groups for exercises (in your own R scripts) 5. Interrupt with questions and give me feedback on your index cards. ] --- layout: true <div class="my-footer"><span>Day 1</span></div> --- class: middle name: day1 # Day 1: Basics of tidyverse ## dplyr verbs ## ggplot aesthetics ## functions .small-note[ **Notes:** \* References for today's material: <a href = "https://r4ds.had.co.nz"><i>R for Data Science</i></a> chs. <a href = "https://r4ds.had.co.nz/data-visualisation.html">3</a>, <a href ="https://r4ds.had.co.nz/workflow-basics.html">4</a>, and <a href ="https://r4ds.had.co.nz/transform.html">5</a>; as well as the <a href = "https://rstudio.cloud/learn/primers">Primers</a> and <a href = "https://style.tidyverse.org">style guide</a> from the summer. ] --- # What is tidyverse? * A popular suite of R packages * Has its own syntax, designed to be more intuitive -- <br> <mark>For someone who does not know ANY programming, which of these two is more intuitive?</mark> -- .pull-left[ ## (a) tidyverse syntax ```r weo_dataset %>% group_by(continent) %>% summarize(gdp = median(rgdp2017, na.rm = TRUE)) ``` ] .pull-right[ ## (b) base-R syntax ```r aggregate(weo_dataset[, "rgdp2017"], list(continent = weo_dataset[, "continent"]), median, na.rm = TRUE) ``` ] -- .small[.right[Source: Adapted from Roger Peng, <a href= "https://simplystatistics.org/2018/07/12/use-r-keynote-2018/">"Teaching R to New Users - From tapply to the Tidyverse"</a>]] --- # If learning R is like "learning a language", R is a very _fast-evolving_ language <br> <br> <br> <img src="images/tikz_timeline.png" width="120%" style="display: block; margin: auto;" /> --- class: left # tidyverse simplifies data analysis by focusing on "tidy" rectangular data. ```r weo <- read_excel("data/input/WEO-2018.xlsx") ``` .small[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> pop1992 </th> <th style="text-align:right;"> rgdp1992 </th> <th style="text-align:right;"> pop1994 </th> <th style="text-align:right;"> pop1995 </th> <th style="text-align:right;"> pop1996 </th> <th style="text-align:right;"> pop1997 </th> <th style="text-align:right;"> pop1998 </th> <th style="text-align:right;"> pop1999 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> Albania </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 3.217 </td> <td style="text-align:right;"> 9728.893 </td> <td style="text-align:right;"> 3.137 </td> <td style="text-align:right;"> 3.141 </td> <td style="text-align:right;"> 3.168 </td> <td style="text-align:right;"> 3.148 </td> <td style="text-align:right;"> 3.129 </td> <td style="text-align:right;"> 3.109 </td> </tr> <tr> <td style="text-align:left;"> Algeria </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 26.271 </td> <td style="text-align:right;"> 267807.178 </td> <td style="text-align:right;"> 27.496 </td> <td style="text-align:right;"> 28.060 </td> <td style="text-align:right;"> 28.566 </td> <td style="text-align:right;"> 29.045 </td> <td style="text-align:right;"> 29.507 </td> <td style="text-align:right;"> 29.965 </td> </tr> <tr> <td style="text-align:left;"> Angola </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 13.459 </td> <td style="text-align:right;"> 39709.339 </td> <td style="text-align:right;"> 14.279 </td> <td style="text-align:right;"> 14.707 </td> <td style="text-align:right;"> 15.148 </td> <td style="text-align:right;"> 15.603 </td> <td style="text-align:right;"> 16.071 </td> <td style="text-align:right;"> 16.553 </td> </tr> <tr> <td style="text-align:left;"> Antigua and Barbuda </td> <td style="text-align:left;"> North America </td> <td style="text-align:right;"> 0.062 </td> <td style="text-align:right;"> 1133.675 </td> <td style="text-align:right;"> 0.065 </td> <td style="text-align:right;"> 0.067 </td> <td style="text-align:right;"> 0.068 </td> <td style="text-align:right;"> 0.070 </td> <td style="text-align:right;"> 0.072 </td> <td style="text-align:right;"> 0.074 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> South America </td> <td style="text-align:right;"> 33.420 </td> <td style="text-align:right;"> 445025.532 </td> <td style="text-align:right;"> 34.353 </td> <td style="text-align:right;"> 34.779 </td> <td style="text-align:right;"> 35.196 </td> <td style="text-align:right;"> 35.604 </td> <td style="text-align:right;"> 36.005 </td> <td style="text-align:right;"> 36.399 </td> </tr> <tr> <td style="text-align:left;"> Armenia </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 3.450 </td> <td style="text-align:right;"> 7153.161 </td> <td style="text-align:right;"> 3.290 </td> <td style="text-align:right;"> 3.220 </td> <td style="text-align:right;"> 3.170 </td> <td style="text-align:right;"> 3.140 </td> <td style="text-align:right;"> 3.110 </td> <td style="text-align:right;"> 3.090 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> Oceania </td> <td style="text-align:right;"> 17.557 </td> <td style="text-align:right;"> 507554.453 </td> <td style="text-align:right;"> 17.893 </td> <td style="text-align:right;"> 18.120 </td> <td style="text-align:right;"> 18.330 </td> <td style="text-align:right;"> 18.510 </td> <td style="text-align:right;"> 18.706 </td> <td style="text-align:right;"> 18.919 </td> </tr> </tbody> </table> ] --- name: day01-dplyr # The `dplyr` package * __94 percent__ of you responded being "Very Comfortable" or "Comfortable" using Excel -- * `dplyr` ["dee" - plier] is a package that provides common spreadsheet manipulations -- ## single verbs .pull-left[ * `filter()`: Return __rows__ with matching conditions * `select()`: Select/rename __variables__ by name * `arrange()`: Arrange rows by variables * `mutate()`: Create or transform variables * `summarize()`: Reduce multiple values down to a single value * `%>%` ["and then"]: Pipe left hand-side into right hand side ] .pull-right[ ```r weo %>% filter(continent == "Europe") ``` ```r weo %>% arrange(rgdp2010) ``` ```r weo %>% select(rgdp2010) ``` ```r weo %>% mutate(gdp_pc_2010 = rgdp2010 / pop2010) ``` ```r weo %>% summarize(median_pop_2017 = median(pop2017)) ``` ] --- # A good package offers tools for common use cases -- ## selecting the first 10 columns ```r weo %>% select(1:10) ``` ## selecting values from 1992 (i.e., variables that include the number 1992.) ```r weo %>% select(matches("1992")) ``` ## select and rename ```r weo %>% select(country, continent, `real_gdp_2017 = `rgdp2017) ``` ## move 2017 to the beginning ```r weo %>% select(country, continent, matches("2017"), `everything()`) ``` See the help page of `?select` for more. --- # A common summary -- counting ```r count(weo, continent) ``` ``` ## # A tibble: 6 x 2 ## continent n ## <chr> <int> ## 1 Africa 55 ## 2 Asia 43 ## 3 Europe 45 ## 4 North America 22 ## 5 Oceania 13 ## 6 South America 13 ``` --- # Cross-tabulation via counting ## A new dataset with categorical variables .pull-left[ ```r data(gss_cat) count(gss_cat, race) ``` ``` ## # A tibble: 3 x 2 ## race n ## <fct> <int> ## 1 Other 1959 ## 2 Black 3129 ## 3 White 16395 ``` ```r count(gss_cat, race, `sort = TRUE`) ``` ``` ## # A tibble: 3 x 2 ## race n ## <fct> <int> ## 1 White 16395 ## 2 Black 3129 ## 3 Other 1959 ``` ] -- .pull-right[ ```r count(gss_cat, race, relig) ``` ``` ## # A tibble: 44 x 3 ## race relig n ## <fct> <fct> <int> ## 1 Other No answer 14 ## 2 Other Don't know 3 ## 3 Other Inter-nondenominational 2 ## 4 Other Native american 16 ## 5 Other Christian 74 ## 6 Other Orthodox-christian 1 ## 7 Other Moslem/islam 42 ## 8 Other Other eastern 10 ## 9 Other Hinduism 62 ## 10 Other Buddhism 72 ## # … with 34 more rows ``` ] --- # Doing more with dplyr: `group_by` ```r weo %>% * group_by(continent) %>% summarize(median_gdp_pc = median(rgdp2017 / pop2017)) ``` ``` ## # A tibble: 6 x 2 ## continent median_gdp_pc ## <chr> <dbl> ## 1 Africa 3180. ## 2 Asia 15443. ## 3 Europe 30082. ## 4 North America 14484. ## 5 Oceania 5109. ## 6 South America 13194. ``` -- <mark>How would you write the equivalent of `count` with `group_by`?</mark> -- ```r weo %>% group_by(continent) %>% summarize(n = n()) ``` --- # Grouping + dplyr verbs is a large part of data analysis .left-code-wide[ ```r drop_inc <- c("No answer", "Not applicable", "Don't know", "Refused") gop_ind <- c("Not str republican", "Strong republican") gss_cat %>% filter(race != "Other", !rincome %in% drop_inc) %>% mutate(race = fct_infreq(race)) %>% * group_by(race, rincome) %>% summarize(is_R = mean(partyid %in% gop_ind)) %>% pivot_wider(id_cols = rincome, names_from = race, values_from = is_R) %>% kable(format = "html", digits = 2, col.names = c("Income", "Whites", "Blacks")) %>% add_header_above(c(" " = 1, "% Strong Republican among" = 2)) ``` .small[\* don't worry about the code for now. The only point here is that `group_by` is crucial.] ] .right-plot-narrow[ .small[ <table> <thead> <tr> <th style="border-bottom:hidden" colspan="1"></th> <th style="border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">% Strong Republican among</div></th> </tr> <tr> <th style="text-align:left;"> Income </th> <th style="text-align:right;"> Whites </th> <th style="text-align:right;"> Blacks </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> $25000 or more </td> <td style="text-align:right;"> 0.34 </td> <td style="text-align:right;"> 0.05 </td> </tr> <tr> <td style="text-align:left;"> $20000 - 24999 </td> <td style="text-align:right;"> 0.28 </td> <td style="text-align:right;"> 0.05 </td> </tr> <tr> <td style="text-align:left;"> $15000 - 19999 </td> <td style="text-align:right;"> 0.27 </td> <td style="text-align:right;"> 0.03 </td> </tr> <tr> <td style="text-align:left;"> $10000 - 14999 </td> <td style="text-align:right;"> 0.25 </td> <td style="text-align:right;"> 0.06 </td> </tr> <tr> <td style="text-align:left;"> $8000 to 9999 </td> <td style="text-align:right;"> 0.23 </td> <td style="text-align:right;"> 0.11 </td> </tr> <tr> <td style="text-align:left;"> $7000 to 7999 </td> <td style="text-align:right;"> 0.24 </td> <td style="text-align:right;"> 0.00 </td> </tr> <tr> <td style="text-align:left;"> $6000 to 6999 </td> <td style="text-align:right;"> 0.26 </td> <td style="text-align:right;"> 0.03 </td> </tr> <tr> <td style="text-align:left;"> $5000 to 5999 </td> <td style="text-align:right;"> 0.28 </td> <td style="text-align:right;"> 0.08 </td> </tr> <tr> <td style="text-align:left;"> $4000 to 4999 </td> <td style="text-align:right;"> 0.29 </td> <td style="text-align:right;"> 0.08 </td> </tr> <tr> <td style="text-align:left;"> $3000 to 3999 </td> <td style="text-align:right;"> 0.26 </td> <td style="text-align:right;"> 0.10 </td> </tr> <tr> <td style="text-align:left;"> $1000 to 2999 </td> <td style="text-align:right;"> 0.27 </td> <td style="text-align:right;"> 0.00 </td> </tr> <tr> <td style="text-align:left;"> Lt $1000 </td> <td style="text-align:right;"> 0.24 </td> <td style="text-align:right;"> 0.04 </td> </tr> </tbody> </table> ] ] --- # Introducing a new dataset .pull-left[ The WEO dataset in "long" form ```r weo_long ``` ``` ## # A tibble: 6,112 x 5 ## country continent year pop rgdp ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 Afghanistan Asia 1992 NA NA ## 2 Afghanistan Asia 1993 NA NA ## 3 Afghanistan Asia 1994 NA NA ## 4 Afghanistan Asia 1995 NA NA ## 5 Afghanistan Asia 1996 NA NA ## 6 Afghanistan Asia 1997 NA NA ## 7 Afghanistan Asia 1998 NA NA ## 8 Afghanistan Asia 1999 NA NA ## 9 Afghanistan Asia 2000 NA NA ## 10 Afghanistan Asia 2001 NA NA ## # … with 6,102 more rows ``` ] -- .pull-right[ compare this to the original "wide" form.... ```r weo ``` ``` ## # A tibble: 191 x 66 ## country continent pop1992 pop1993 pop1994 pop1995 pop1996 pop1997 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Afghan… Asia NA NA NA NA NA NA ## 2 Albania Europe 3.22 3.20 3.14 3.14 3.17 3.15 ## 3 Algeria Africa 26.3 26.9 27.5 28.1 28.6 29.0 ## 4 Angola Africa 13.5 13.9 14.3 14.7 15.1 15.6 ## 5 Antigu… North Am… 0.062 0.063 0.065 0.067 0.068 0.07 ## 6 Argent… South Am… 33.4 33.9 34.4 34.8 35.2 35.6 ## 7 Armenia Asia 3.45 3.37 3.29 3.22 3.17 3.14 ## 8 Austra… Oceania 17.6 17.7 17.9 18.1 18.3 18.5 ## 9 Austria Europe 7.80 7.88 7.93 7.95 7.96 7.97 ## 10 Azerba… Asia 7.46 7.49 7.60 7.64 7.73 7.8 ## # … with 181 more rows, and 58 more variables: pop1998 <dbl>, ## # pop1999 <dbl>, pop2000 <dbl>, pop2001 <dbl>, pop2002 <dbl>, ## # pop2003 <dbl>, pop2004 <dbl>, pop2005 <dbl>, pop2006 <dbl>, ## # pop2007 <dbl>, pop2008 <dbl>, pop2009 <dbl>, pop2010 <dbl>, ## # pop2011 <dbl>, pop2012 <dbl>, pop2013 <dbl>, pop2014 <dbl>, ## # pop2015 <dbl>, pop2016 <dbl>, pop2017 <dbl>, pop2018 <dbl>, ## # pop2019 <dbl>, pop2020 <dbl>, pop2021 <dbl>, pop2022 <dbl>, ## # pop2023 <dbl>, rgdp1992 <dbl>, rgdp1993 <dbl>, rgdp1994 <dbl>, ## # rgdp1995 <dbl>, rgdp1996 <dbl>, rgdp1997 <dbl>, rgdp1998 <dbl>, ## # rgdp1999 <dbl>, rgdp2000 <dbl>, rgdp2001 <dbl>, rgdp2002 <dbl>, ## # rgdp2003 <dbl>, rgdp2004 <dbl>, rgdp2005 <dbl>, rgdp2006 <dbl>, ## # rgdp2007 <dbl>, rgdp2008 <dbl>, rgdp2009 <dbl>, rgdp2010 <dbl>, ## # rgdp2011 <dbl>, rgdp2012 <dbl>, rgdp2013 <dbl>, rgdp2014 <dbl>, ## # rgdp2015 <dbl>, rgdp2016 <dbl>, rgdp2017 <dbl>, rgdp2018 <dbl>, ## # rgdp2019 <dbl>, rgdp2020 <dbl>, rgdp2021 <dbl>, rgdp2022 <dbl>, ## # rgdp2023 <dbl> ``` ] --- # groups in ggplot ## simple case: single time trend .left-code[ ```r gdp_by_year <- weo_long %>% group_by(year) %>% summarize(med_gdp = median(rgdp, na.rm = TRUE), gdp_pc = sum(rgdp, na.rm = TRUE) / sum(pop, na.rm = TRUE)) ggplot(gdp_by_year, aes(x = year, y = med_gdp)) + geom_line() + labs(y = "Median GDP (in millions 2017 USD)") ``` ] -- .right-plot[ ![](2019-math-camp-R_all-days_files/figure-html/lineall-out-1.svg)<!-- --> ] --- # groups in ggplot ## what's wrong here? .left-code[ ```r ggplot(weo_long, aes(x = year, y = rgdp)) + geom_line() + labs(y = "GDP (in millions 2017 USD)") ``` ] .right-plot[ ![](2019-math-camp-R_all-days_files/figure-html/linectybad-out-1.svg)<!-- --> ] --- # groups in ggplot ## the group aesthetic treats a group of observations, defined by another variable, separately .left-code[ ```r ggplot(weo_long, * aes(x = year, y = rgdp, group = country)) + geom_line() + labs(y = "GDP (in millions 2017 USD)") ``` ] .right-plot[ ![](2019-math-camp-R_all-days_files/figure-html/linecty-out-1.svg)<!-- --> ] --- # groups in ggplot .left-code[ ```r weo_large <- weo_long %>% filter(country %in% c( "India", "China", "Japan", "South Korea", "United States" )) ggplot(weo_large, * aes(x = year, y = rgdp, group = country)) + geom_line() + labs(y = "GDP (in millions 2017 USD)") ``` ] .right-plot[ ![](2019-math-camp-R_all-days_files/figure-html/linectysub-out-1.svg)<!-- --> ] --- class: middle, center name: day01-ggplot # The grammar of graphics --- # 3 required components of ggplot grammar: data, aesthetic mapping, and geoms <img src="images/ggplot-grammar.png" width="60%" style="display: block; margin: auto;" /> --- # mapping vs. layers and the global application of geoms .left-threeway[ ```r # ggplot(data = gdp_by_year, aes(x = year, y = gdp_pc)) + * geom_point() + labs(x = "Year", y = "GDP per capita") ``` <img src="2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-28-1.svg" width="100%" /> ] -- .center-threeway[ ```r # ggplot(data = gdp_by_year, aes(x = year, y = gdp_pc)) + * geom_line() + labs(x = "Year", y = "GDP per capita") ``` <img src="2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-29-1.svg" width="100%" /> ] -- .right-threeway[ ```r ggplot(data = gdp_by_year, aes(x = year, y = gdp_pc)) + * geom_point() + * geom_line() + labs(x = "Year", y = "GDP per capita") ``` <img src="2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-30-1.svg" width="100%" /> ] --- # mappings vs. hard-coded values .left-threeway[ ```r ggplot(data = gdp_by_year, aes(x = year, y = gdp_pc)) + * geom_point(aes(color = gdp_pc)) + labs(x = "Year", y = "GDP per capita") ``` <img src="2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-31-1.svg" width="100%" /> ] -- .center-threeway[ ```r ggplot(data = gdp_by_year, aes(x = year, y = gdp_pc)) + * geom_line(color = "navy") + labs(x = "Year", y = "GDP per capita") ``` <img src="2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-32-1.svg" width="100%" /> ] -- .right-threeway[ ```r ggplot(data = gdp_by_year, aes(x = year, y = gdp_pc)) + * geom_point(aes(color = "navy")) + # wrong! labs(x = "Year", y = "GDP per capita") ``` <img src="2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-33-1.svg" width="100%" /> ] --- class: middle center # Try the exercises provided in your RStudio day 1 space. --- layout: true <div class="my-footer"><span>Day 2</span></div> --- class: middle name: day2 # Day 2 - R beyond the tidyverse ## Group project preview ## function anatomy ## base-R syntax: vectors and lists ## strings, factors, numerics, and coercion ## formula syntax cross-tabs .small-note[ **Notes:** \* References for today's material: <a href = "https://r4ds.had.co.nz"><i>R for Data Science</i></a> chs. <a href = "https://r4ds.had.co.nz/factors.html">15</a>, <a href = "https://r4ds.had.co.nz/pipes.html">18</a>, and <a href = "https://r4ds.had.co.nz/vectors.html">20</a>; and <a href = "https://rstudio-education.github.io/hopr"><i>Hands-on Programming with R</i></a> chs. <a href = "https://rstudio-education.github.io/hopr/basics.html">2</a> through <a href = "https://rstudio-education.github.io/hopr/r-notation">6</a>. ] --- class: middle # Housekeeping * Announcements by Dan and Jake * How to get files and submit assignments on Canvas * Open the rstudio cloud session for today (`03_Day-02`) * Read in the datasets we'll use ```r library(tidyverse) library(readxl) weo <- read_excel("data/input/WEO-2018.xlsx") weo_long <- read_csv("data/input/WEO-2018_long.csv") data(gss_cat) ``` * How to ask questions --- # Group Project (8/27): How does climate change affect economic growth / politics? .pull-left[ <img src="images/nytimes-climate.png" width="100%" /> ] .pull-right[ We'll use data from: <img src="images/dell_et_al_abstract.png" width="100%" /> same format as `WEO-2018_long.csv`, contains temperature, precipitation, and growth data. ] --- # Group Project (8/27): How does climate change affect economic growth / politics? .pull-left[ ## for next Tuesday: Short Submission: A standalone <mark>graph / table</mark> testing whether hot countries tend to be poor. Group Presentation: A data-driven <mark>report</mark> showing if / how climate change affects growth. - Focus on the tools for data exploration and visualization we learn in math camp - That means: don't worry about the econometrics ] .pull-right[ We'll use data from: <img src="images/dell_et_al_abstract.png" width="100%" /> same format as `WEO-2018_long.csv`, contains temperature, precipitation, and growth data. ] --- class: middle # Ready with your datasets? ```r library(tidyverse) library(readxl) weo <- read_excel("data/input/WEO-2018.xlsx") weo_long <- read_csv("data/input/WEO-2018_long.csv") data(gss_cat) gdp_by_year <- weo_long %>% group_by(year) %>% summarize(med_gdp = median(rgdp, na.rm = TRUE)) ``` --- class: middle, center name: day02-functions # Function Anatomy --- * _function_ (in mathematics): a mapping from each element of a set `\(X\)` a _single_ element of a set `\(Y\)`" In R: a process that takes one or more inputs (in parentheses) and returns exactly one output ```r log(x = 100, base = 10) ``` -- * Call a function by the function name, with arguments and values in parentheses, separated by commas ```r log(x = 100, base = 10) log(base = 10, x = 100) ``` -- * If the user does not name the arguments, they are assumed to come in the order they are defined. ```r log(100, 10) ``` -- * <mark>How do you know that the <b>first</b> argument of <font face = "Ubuntu Mono">log()</font> is <font face = "Ubuntu Mono">x</font>, and not <font face = "Ubuntu Mono">base</font>?</mark> --- # Function anatomy: important odds and ends -- .pull-left[ ## 1. functions with no output These functions still do something, but they are side-effects (no object output) ```r library(tidyverse) ``` .small[.right[(loads the package)]] <br> ```r data(gss_cat) ``` .small[.right[(loads a built-in dataset which will be called `gss_cat`)]] ] -- .pull-right[ ## 2. functions with no arguments ```r Sys.Date() ``` ``` ## [1] "2019-09-05" ``` .small[.right[(prints the date regardless of input)]] ```r n() ``` .small[.right[(`dplyr`s "the numbers of rows in")]] ] --- .pull-left[ ## 3. functions with default arguments e.g. `log()` in R defaults to `\(\log_{e}()\)`: ```r args(log) ``` ``` ## function (x, base = exp(1)) ## NULL ``` therefore, an argument with a default can be left un-specified: ```r log(exp(5)) ``` _Note_ this is different from not declaring the _name_ for a no-default argument ```r args(aes) ``` ``` ## function (x, y, ...) ## NULL ``` ```r ggplot(gdp_by_year, aes(year, med_gdp)) ``` ] -- .pull-right[ ## 4. functions where defaults are not documented ```r args(geom_point) ``` ``` ## function (mapping = NULL, data = NULL, stat = "identity", position = "identity", ## ..., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) ## NULL ``` ```r ggplot(gdp_by_year, aes(year, med_gdp)) + * geom_point() + labs(x = "Year", y = "GDP per capita") ``` <img src="2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-52-1.svg" width="100%" /> ] --- ## 5. functions with countably infinite arguments .pull-left[ ```r count(x = gss_cat, race) ``` .right[.small[(one-way tabulation)]] ```r count(x = gss_cat, race, relig) ``` .right[.small[(two-way tabulation)]] ```r count(x = gss_cat, race, relig, denom) ``` .right[.small[(three-way tabulation)]] Notice `args(count)` returns: ```r function (x, `...`, wt = NULL, sort = FALSE ``` R represents such potentially infinite arguments with `...` (an ellipsis) ] -- .pull-right[ other arguments could be _named_ ```r count(gss_cat, race, relig, sort = TRUE) ``` ``` ## # A tibble: 44 x 3 ## race relig n ## <fct> <fct> <int> ## 1 White Protestant 8188 ## 2 White Catholic 4001 ## 3 White None 2816 ## 4 Black Protestant 2271 ## 5 Other Catholic 916 ## 6 White Christian 474 ## 7 Other Protestant 387 ## 8 Black None 384 ## 9 White Jewish 370 ## 10 Other None 323 ## # … with 34 more rows ``` so cannot be implicitly called. Consider: ```r count(gss_cat, race, relig, TRUE) # wrong ``` ] -- --- ## 6. dplyr's pipe * tidyverse invented the `%>%` for a convenient "and then" ```r weo %>% filter(continent == "Africa") %>% select(country, rgdp1992, pop1992) %>% mutate(gdp_pc = rgdp1992 / pop1992) ``` -- * The object to `%>%`'s LEFT becomes the _value of the first argument_ of the function to `%>%`'s RIGHT ```r weo %>% * filter(continent == "Africa") ``` -- * Therefore the first chunk is equivalent to: ```r mutate(select(filter(weo, continent == "Africa"), rgdp1992, pop1992), gdp_pc = rgdp1992 / pop1992) ``` .right[.small[Remark: this follows the formal definition of a function, but is much less readable]] -- * <font color = "gray">[Advanced]: A dataframe in a pipeline is implicitly kept "in scope", such that a variable name that is called is presumed to be from the dataframe. </font> --- ## 7. ggplot's plus .left-threeway[ * Historically, `ggplot` came before `tidyverse` and `dplyr` * Appending layers by `+` (instead of `%>%`) is a historical legacy -- <img src="images/ggplot-grammar.png" width="100%" style="display: block; margin: auto;" /> ] -- .right-plot[ .small[ ```r # format dataset weo_percap <- weo %>% mutate(pc_1992 = rgdp1992/pop1992, pc_2007 = rgdp2017/pop2017) # ggplot ggplot(weo_percap, aes(pc_1992, pc_2007)) + ## layers geom_abline(slope = 1, intercept = 0, linetype = "dashed") + geom_point(size = 0.5, color = "indianred") + ## coordinate coord_equal() + ## facet facet_wrap(~continent) + ## scale scale_x_continuous(labels = NULL) + scale_y_continuous(labels = dollar) + labs(x = "1992 GDP per capita", y = "2007 GDP per capita", title = "Growth between 1992 and 2017") + ## theme theme_classic() + theme(plot.title = element_text(hjust = 0.5, face = "bold")) ggsave("images/day1gg-show.png", w = 4.8, height = 3.2) ``` ] ] --- ## 7. ggplot's plus .left-code[ .small[ ```r # format dataset weo_percap <- weo %>% mutate(pc_1992 = rgdp1992/pop1992, pc_2007 = rgdp2017/pop2017) # ggplot ggplot(weo_percap, aes(pc_1992, pc_2007)) + ## layers geom_abline(slope = 1, intercept = 0, linetype = "dashed") + geom_point(size = 0.5, color = "indianred") + ## coordinate coord_equal() + ## facet facet_wrap(~continent) + ## scale scale_x_continuous(labels = NULL) + scale_y_continuous(labels = dollar) + labs(x = "1992 GDP per capita", y = "2007 GDP per capita", title = "Growth between 1992 and 2017") + ## theme theme_classic() + theme(plot.title = element_text(hjust = 0.5, face = "bold")) ggsave("images/day1gg-show.png", w = 4.8, height = 3.2) ``` ] ] .right-plot[ <img src="images/day1gg-show.png" width="100%" style="display: block; margin: auto;" /> ] --- class: center, middle # Questions? --- name: day02-base-R # Three broad flavors of syntax ## 1. tidyverse syntax ```r weo `%>%` `mutate(gdp_pc_2017 = rgdp2017 / pop2017)` ``` .right[.small[Create a new variable]] ## 2. base-R (dollar sign and square bracket) syntax ```r weo`$`gdp_pc_2017 <- weo`$`rgdp2017 / weo`$`pop2017 weo`[["`gdp_pc_2017`"]]` <- weo`[["`rgdp2017`"]]` / weo`[["`pop2017`"]]` ``` .right[.small[Create a new variable]] ## 3. formula (tilde) syntax .pull-left[ ```r lm(tvhours `~` race `+` relig, data = gss_cat) ``` .right[.small[Run regressions]] ] .pull-right[ ```r xtabs(`~` race `+` relig, data = gss_cat) ``` .right[.small[Cross-tab]] ] --- class: middle, center # A Whirlwind Tour of Base-R ## Data structures ## Data types ## Summary functions --- # Data Structures: what's the container? ## data frames is the most complex data structure | | Homogeneous | Heterogeneous | | -----:| :-------------: |:-------------:| | 1-dimensional| Atomic Vector | List | | 2-dimensional | Matrix | <mark>Data frame</mark> | <br> ## "Heterogeneous" means it can take different types ```r x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9)) str(x) ``` ``` ## List of 4 ## $ : int [1:3] 1 2 3 ## $ : chr "a" ## $ : logi [1:3] TRUE FALSE TRUE ## $ : num [1:2] 2.3 5.9 ``` --- # tidyverse vs. syntax have different ways of subsetting .pull-left[ ## tidyverse (dataframes only) ```r weo %>% select(country, rgdp2017) %>% slice(1:5) ``` ``` ## # A tibble: 5 x 2 ## country rgdp2017 ## <chr> <dbl> ## 1 Afghanistan 63353. ## 2 Albania 32763. ## 3 Algeria 576494. ## 4 Angola 173327. ## 5 Antigua and Barbuda 2174. ``` ] ## base-R .pull-right[ Atomic vectors with square brackets ```r letters[c(1, 2, 3)] ``` Matrices also with square brackets -- with two dimensions: ```r weo[1:5, 1] ``` Lists with double square brackets or `$` (S for <mark>S</mark>lot) ```r weo[[1]] weo[["country"]] weo$country ``` ] --- # Data types: what's in the container? -- .left-threeway[ (a) Logical ```r c(TRUE, FALSE, FALSE) ``` <br> ```r is.na(c(NA, 2, 3)) ``` ``` ## [1] TRUE FALSE FALSE ``` ] .center-threeway[ (b) Numeric - Integers ```r c(1L, 0L, 0L) ``` <br> (c) Numeric - Doubles ```r c(1, 0, 0) ``` ] .right-threeway[ (d) Character ```r c("A", "B", "B") ``` ] --- # Vectors are flexible: .left-threeway[ ## coercion `TRUE` gets coerced to 1 and `FALSE` gets coerced to 0: ```r sum(c(TRUE, FALSE, TRUE)) ``` ``` ## [1] 2 ``` numbers can be coerced to characters: ```r as.character(c(1, 2, 3)) ``` automatic coercion makes vectors stay homogeneous: ```r c(1, "a") ``` ] -- .center-threeway[ ## recycling A scalar assigned to become a vector will be transformed into a _vector_ with its values repeated. ```r weo$ones <- 1 ``` This created a column called `ones` where every row is the number ones. ```r count(weo, ones) ``` ``` ## # A tibble: 1 x 2 ## ones n ## <dbl> <int> ## 1 1 191 ``` ] -- .right-threeway[ ## vector algebra Numeric vectors and matrices follow the rules of algebra ```r c(1, 2, 3) + c(100, 200, 300) ``` ``` ## [1] 101 202 303 ``` ```r c(1, 2, 3) * c(100, 200, 300) ``` ``` ## [1] 100 400 900 ``` recycling works here too: ```r (c(1, 2, 3) * c(100, 200, 300)) + 1 ``` ``` ## [1] 101 401 901 ``` ] --- # factors: a hybrid type that encodes the order of values -- * In practice, factors are like _ordered_ characters -- * Technically, factors integers (which dictates the ordering, called __levels__) where each value has a __label__ -- .pull-left[ ## built-in GSS saves variables as factors ```r data(gss_cat) count(gss_cat, marital) ``` ``` ## # A tibble: 6 x 2 ## marital n ## <fct> <int> ## 1 No answer 17 ## 2 Never married 5416 ## 3 Separated 743 ## 4 Divorced 3383 ## 5 Widowed 1807 ## 6 Married 10117 ``` ] .pull-right[ ## compare this to strings ```r gss_str <- read_csv("data/input/gss_str.csv") count(gss_str, marital) ``` ``` ## # A tibble: 6 x 2 ## marital n ## <chr> <int> ## 1 Divorced 3383 ## 2 Married 10117 ## 3 Never married 5416 ## 4 No answer 17 ## 5 Separated 743 ## 6 Widowed 1807 ``` ] --- # Recoding, Releveling, and Replacing ## 1. dplyr::recode edits the values ```r gss_cat %>% mutate(relig = * recode(relig, * Protestant = "Christian", * Catholic = "Christian", * .default = "Non-Christian")) %>% count(relig) ``` ``` ## # A tibble: 2 x 2 ## relig n ## <fct> <int> ## 1 Non-Christian 5513 ## 2 Christian 15970 ``` --- # Recoding, Releveling, and Replacing ## 2. forcats::fct\_relevel edits the _order_ of the levels ```r count(gss_cat, race) ``` ``` ## # A tibble: 3 x 2 ## race n ## <fct> <int> ## 1 Other 1959 ## 2 Black 3129 ## 3 White 16395 ``` ```r gss_cat %>% * mutate(race = fct_relevel(race, "White", "Black", "Other")) %>% count(race) ``` ``` ## # A tibble: 3 x 2 ## race n ## <fct> <int> ## 1 White 16395 ## 2 Black 3129 ## 3 Other 1959 ``` --- # Recoding, Releveling, and Replacing ## 3. base::replace edits the value of variables based on the value of potentially other variables ```r gss_cat %>% mutate(is_senior = NA, # this makes a columns of NAs is_senior = replace(is_senior, `age < 65`, "not senior"), is_senior = replace(is_senior, `age >= 65`, "senior")) %>% select(age, is_senior) ``` ``` ## # A tibble: 21,483 x 2 ## age is_senior ## <int> <chr> ## 1 26 not senior ## 2 48 not senior ## 3 67 senior ## 4 39 not senior ## 5 25 not senior ## 6 25 not senior ## 7 36 not senior ## 8 44 not senior ## 9 44 not senior ## 10 47 not senior ## # … with 21,473 more rows ``` --- # The things you can get out of data .pull-left[ ## 1. dimensions Each vector has a "length", each 2d array has a "dimension" ```r x <- c(1:10) length(x) ``` ``` ## [1] 10 ``` ```r dim(weo) ``` ``` ## [1] 191 67 ``` .right[.small[Also use `nrow` (number of rows) and `ncol` (number of columns)]] ] .pull-right[ ## 2. summary stats ```r stdnormals <- rnorm(n = 1000, mean = 0, sd = 1) ``` ```r mean(stdnormals) sd(stdnormals) ``` ``` ## [1] -0.007799939 ``` ``` ## [1] 1.032668 ``` ```r round(mean(stdnormals), digits = 2) ``` ``` ## [1] -0.01 ``` ] --- # The things you can get out of data .pull-left[ ## 3. string concatenation ```r c("a", "b", "c", "d") ``` ``` ## [1] "a" "b" "c" "d" ``` concatenate ("paste") into a single item ```r str_c("a", "b", "c", "d") ``` ``` ## [1] "abcd" ``` ] -- .pull-right[ ## 4. pulling out vectors from dataframes ```r mean(weo$rgdp2019) ``` ``` ## [1] 653848.6 ``` ```r mean(select(weo, rgdp2019)) ``` ``` ## Warning in mean.default(select(weo, rgdp2019)): argument is not numeric or ## logical: returning NA ``` ``` ## [1] NA ``` "pull" takes one column and coerces to vector: ```r mean(`pull`(weo, rgdp2019)) ``` ``` ## [1] 653848.6 ``` ] --- <mark> What would the following code do? </mark> ```r weo_long %>% mutate(rgdp2017 = round(rgdp2017, digits = 0), country = str_c(country, " (", continent, ")")) %>% slice(1:3) ``` -- <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> rgdp2017 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan (Asia) </td> <td style="text-align:right;"> 63353 </td> </tr> <tr> <td style="text-align:left;"> Albania (Europe) </td> <td style="text-align:right;"> 32763 </td> </tr> <tr> <td style="text-align:left;"> Algeria (Africa) </td> <td style="text-align:right;"> 576494 </td> </tr> </tbody> </table> --- class: center, middle name: day02-formula # Formula syntax --- # Formula (tilde) syntax expresses the relationship between variables .pull-left[ OLS has a `formula` and a `data` argument ```r fit <- lm(tvhours ~ race + relig, data = gss_cat) summary(fit) ``` * __One__ variable (unquoted) to the left hand side * __One or more__ variables to the right hand side ] -- .pull-right[ Formulas are not unique to regressions. See for example cross-tabs: ```r xtabs(~ relig + race, gss_cat, drop = TRUE) ``` * Typically no left hand side * __One or two__ variables to the right hand side (row variable, then column variable) ] --- # Summarizing cross-tabs Cell / row / column proportions ```r prop.table(xtabs(~ relig + race, gss_cat)) prop.table(xtabs(~ relig + race, gss_cat), `margin = 1`) prop.table(xtabs(~ relig + race, gss_cat), `margin = 2`) ``` -- <br> Using the GSS dataset, how would you get <mark>a cross-tab that shows (a) the proportion in each income bin, (b) for each of the two most common values of race, and (c) _rounded_ to the second decimal point? </mark> -- ```r gss_cat_wb <- gss_cat %>% filter(race %in% c("White", "Black")) %>% mutate(race = as.character(race)) xtabs(~ rincome + race, gss_cat_wb) %>% prop.table(margin = 2) %>% round(2) ``` --- layout: true <div class="my-footer"><span>Day 3</span></div> --- class: middle name: day3 # Day 3 Agenda ## how our eyes see data ## technicalities: axes, legends, annotation, exporting .small-note[ **Notes:** \* References for today's material: <a href = "https://r4ds.had.co.nz"><i>R for Data Science</i></a> ch. <a href = "https://r4ds.had.co.nz/data-visualisation.html">2</a>; <a href = "https://serialmentor.com/dataviz/"><i>Fundamentals of Data Visualization</i></a> by Claus Wilke; chs. <a href = "https://serialmentor.com/dataviz/proportional-ink.html">17</a> through <a href = "https://serialmentor.com/dataviz/small-axis-labels.html">24</a>; and <a href = "https://www.youtube.com/watch?v=fSgEeI2Xpdc"><i>How Humans See Data</i></a> by John Rauser. \* Most of the example graphs from this day are Claus Wilke's excellent book, and the presentation of the first section follows John Rauser's eye-opening talk. ] --- class: center, middle # Visualization <br> --- name: day03-principles # 1. Show the key message as a comparison of positions on a common scale <br> Perception tasks, ranked 1. <u>Position along a common scale</u> <font color = "red"> ⬅ the human eye can discern this most accurately </font> 2. Position along identical, nonaligned scales 3. Length 4. Angle or Slope 5. Area 6. Volume, Density, Saturation 7. Color Hue <font color = "red"> ⬅ and this the least</font> .right[.small[Source: <a href = "https://pdfs.semanticscholar.org/565d/843c2c0e60915709268ac4224894469d82d5.pdf">Cleveland and McGill (1985)</a>, paper on Canvas]] --- # Example: Alesina et al., "Intergenerational mobility in Africa" (<a href="https://voxdev.org/topic/health-education/intergenerational-mobility-africa">link</a>) ```r alesina <- read_excel("data/input/intergenerational-mobility.xlsx") ``` .pull-left[ <img src="images/papaioannou21junefig1a.png" width="85%" /> ] -- .pull-right[ <img src="images/papaioannou21junetable1.png" width="72%" /> ] --- .left-code[ <br> <br> <br> 1. Position along a common scale <font color = "red"> ⬅ the human eye can discern this most accurately </font> 2. Position along identical, nonaligned scales 3. Length 4. Angle or Slope 5. Area 6. Volume, Density, Saturation 7. <mark>Color Hue</mark><font color = "red"> ⬅ and this the least</font> ] -- .right-plot[ ![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-119-1.svg)<!-- --> ] --- .left-code[ <br> <br> <br> 1. Position along a common scale <font color = "red"> ⬅ the human eye can discern this most accurately </font> 2. Position along identical, nonaligned scales 3. Length 4. Angle or Slope 5. Area 6. Volume, Density, Saturation 7. <mark>Color Hue</mark> <font color = "red"> ⬅ and this the least</font> <br> Using: `geom_col()` with the `fill` aesthetic, reordering countries by `fct_reorder()`. ] .right-plot[ ![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-120-1.svg)<!-- --> ] --- .left-code[ <br> <br> <br> 1. Position along a common scale <font color = "red"> ⬅ the human eye can discern this most accurately </font> 2. Position along identical, nonaligned scales 3. Length 4. Angle or Slope 5. Area 6. Volume, Density, <mark>Saturation</mark> 7. Color Hue <font color = "red"> ⬅ and this the least</font> <br> Using: defining a color scale with `scale_fill_distiller(palette = "Reds")`. ] .right-plot[ ![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-121-1.svg)<!-- --> ] --- .left-code[ <br> <br> <br> 1. Position along a common scale <font color = "red"> ⬅ the human eye can discern this most accurately </font> 2. Position along identical, nonaligned scales 3. <mark>Length</mark> 4. Angle or Slope 5. Area 6. Volume, Density, Saturation 7. Color Hue <font color = "red"> ⬅ and this the least</font> <br> Using: `geom_segment()` ] .right-plot[ ![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-122-1.svg)<!-- --> ] --- .left-code[ <br> <br> <br> 1. <mark>Position along a common scale</mark> <font color = "red"> ⬅ the human eye can discern this most accurately </font> 2. Position along identical, nonaligned scales 3. Length 4. Angle or Slope 5. Area 6. Volume, Density, Saturation 7. Color Hue <font color = "red"> ⬅ and this the least</font> <br> Using: `geom_col()` ] .right-plot[ ![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-123-1.svg)<!-- --> ] --- .left-code[ <br> <br> <br> 1. <mark>Position along a common scale</mark> <font color = "red"> ⬅ the human eye can discern this most accurately </font> 2. Position along identical, nonaligned scales 3. Length 4. Angle or Slope 5. Area 6. Volume, Density, Saturation 7. Color Hue <font color = "red"> ⬅ and this the least</font> <br> Using: `geom_point()` ] .right-plot[ ![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-124-1.svg)<!-- --> ] --- .left-code[ <br> <br> <br> 1. <mark>Position along a common scale</mark> <font color = "red"> ⬅ the human eye can discern this most accurately </font> 2. Position along identical, nonaligned scales 3. Length 4. Angle or Slope 5. Area 6. Volume, Density, Saturation 7. Color Hue <font color = "red"> ⬅ and this the least</font> <br> Using: `geom_point()` ] .right-plot[ ![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-125-1.svg)<!-- --> ] --- .left-code[ <br> <br> <br> 1. <mark>Position along a common scale</mark> <font color = "red"> ⬅ the human eye can discern this most accurately </font> 2. Position along identical, nonaligned scales 3. Length 4. Angle or Slope 5. Area 6. Volume, Density, Saturation 7. Color Hue<font color = "red"> ⬅ and this the least</font> <br> Using: `geom_point()` and `geom_errorbar()` ] .right-plot[ ![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-126-1.svg)<!-- --> ] --- .left-code[ <br> <br> <br> 1. <mark>Position along a common scale</mark> <font color = "red"> ⬅ the human eye can discern this most accurately </font> 2. Position along identical, nonaligned scales 3. Length 4. Angle or Slope 5. Area 6. Volume, Density, Saturation 7. Color Hue <font color = "red"> ⬅ and this the least</font> <br> Using: `geom_point()` and `geom_errorbar()` ] .right-plot[ ![](2019-math-camp-R_all-days_files/figure-html/unnamed-chunk-127-1.svg)<!-- --> ] --- # 1. Show the key message as a comparison of positions on a common scale <img src="images/how-humans-see.png" width="100%" style="display: block; margin: auto;" /> --- # 2. Maximize the Data-to-Ink ratio (i.e. proactively remove superficial ink) <img src="images/cwilke_table-examples-1.png" width="75%" style="display: block; margin: auto;" /> --- # 3. Avoid excessive overlap (instead, use small multiples) .pull-left[ ## bad <img src="images/cwilke_popgrowth-vs-popsize-colored-1.png" width="80%" /> ] .pull-right[ ## bad <img src="images/papaioannou21junefig4.png" width="789" /> ] --- **Solution**: In ggplot, "facet" your graphs by `facet_grid()` (a rigid table) or `facet_wrap()` (wrap long lines). ```r ggplot(weo_long, aes(x = year, y = (rgdp/pop), group = country)) + * facet_wrap(~ continent) + # formula syntax geom_line() + labs(x = "Year", y = "GDP per capita") ``` <img src="images/country_facet.png" width="75%" /> --- class: center, middle <mark>Try Problem 7 part (1) </mark> --- class: center, middle name: day03-technicalities # Important Technicalities --- # 1. Clear and concise axis titles ## Always label your axes, but don't make it verbose either .pull-left[ <img src="images/cwilke_tech-stocks-minimal-labeling-bad-1.png" width="1536" /> ] .pull-right[ <img src="images/cwilke_tech-stocks-labeling-ugly-1.png" width="1536" /> ] --- # 1. Clear and concise axis titles .pull-left[ <img src="images/cwilke_tech-stocks-minimal-labeling-1.png" width="1536" /> ] .pull-right[ pseudo-code: ```r ggplot(stocks, aes(x = year, y = price, group = company, color = company)) + * labs(x = "", * y = "stock price, indexed", * color = "") ``` ] --- # 2. Reasonable font and image sizes ## The default image size is almost always too large, making the labels illegible .pull-left[ <img src="images/cwilke_Aus-athletes-small-1.png" width="1536" /> ] .pull-right[ <img src="images/cwilke_Aus-athletes-big-good-1.png" width="1536" /> ] -- **Solution**: save the images in _smaller size_ (I recommend 3 by 5 inches; some recommend smaller). ```r ggsave("figures/figure_rightsize.pdf", `height = 3`, `width = 5`) ``` --- # Also: Use vector graphics where possible <img src="images/cwilke_bitmap-zoom-1.png" width="100%" /> Note: panel (b) zooms into the figure in panel (a) when the graph is a raster graphic (like `jpeg` or `png`). panel (c) zooms into the same area when the graph is a vector graphic (like `pdf` or `svg`). --- # 3. Reader-friendly legends or annotation ## Having to go back and forth between data and legend is not reader-friendly. .pull-left[ <img src="images/cwilke_tech-stocks-bad-legend-1.png" width="1536" /> ] -- .pull-right[ <img src="images/cwilke_tech-stocks-good-legend-1.png" width="1536" /> ] **Solution**: relevel your factors (`fct_relevel`) so the levels align with the data, or ... --- # Whenever possible, design your figures so they don’t need a legend. .pull-left[ <img src="images/cwilke_tech-stocks-good-no-legend-1.png" width="1536" /> ] .pull-right[ pseudo-code: ```r ggplot(stocks, aes(x = year, y = price, group = company, color = company)) + annotate("text", x = 2018, y = 550, label = "Facebook") + annotate("text", x = 2018, y = 340, label = "Alphabet") + annotate("text", x = 2018, y = 250, label = "Microsoft") + annotate("text", x = 2018, y = 195, label = "Apple") + labs(x = "", y = "stock price, indexed", color = "color") ``` ] --- # 4. Default colors can be indistinguishable in grayscale or for colorblind readers \vspace{-2em} .pull-left[ <img src="images/cwilke_iris-scatter-one-shape-1.png" width="1536" /> ] -- .pull-right[ <img src="images/cwilke_iris-scatter-one-shape-cvd-1.png" width="2011" /> ] -- <b>Solution</b>: * The viridis color scale (`scale_color_viridis_d` or `scale_color_viridis_c`) is a good default. * Also manually assign values for discrete variables ( `scale_color_manual`) --- layout: true <div class="my-footer"><span>Day 4</span></div> --- class: middle name: day4 .pull-left[ # Day 4 Agenda ## Workflow best practices ## RStudio and R Desktop ## Review of key functions via cheatsheet ] .small-note[ **Notes:** \* Instead of the original plan, we will _not_ cover Rmarkdown in math camp and instead consolidate what we've learned so far. This prepares us well for Rmarkdown down the road. For those feeling ambitious, <a href = "https://r4ds.had.co.nz/r-markdown.html"><i>R for Data Science Chapter 27</i></a> is a good introduction to Rmarkdown. \* References for today's material: <a href = "https://r4ds.had.co.nz"><i>R for Data Science</i></a> chs. <a href = "https://r4ds.had.co.nz/workflow-scripts.html">6</a> and <a href = "https://r4ds.had.co.nz/workflow-projects.html">8</a>; <a href = "https://whattheyforgot.org/"><i>What they forgot to teach you about R</i></a> ch. <a href = "https://whattheyforgot.org/save-source.html">1</a>; and <a href = "https://web.stanford.edu/~gentzkow/research/CodeAndData.pdf"><i>Code and Data for the Social Sciences</i></a> chs. 1 and 7. ] --- class: middle name: day04-workflow # Workflow .small-note[ Prepare by opening today's rstudio.cloud project and opening `sample_cleaned_script.R` in the `example_scripts` folder ] --- .pull-left[ # What does good R code look like? .large[Think of your R script like a <mark>recipe</mark>] Good recipes ... ] --- .pull-left[ # What does good R code look like? .large[Think of your R script like a <mark>recipe</mark>] Good recipes ... .narrow-space[ ☑ Readable to your collaborators ☑ Readable to your future self ☑ Clear (minimizes ambiguity) ☑ Ordered chronologically ☑ Follows convention ] <img src="images/jg-sheet-pan-pancakes-articleLarge.jpg" width="80%" /> ] .pull-right[ <img src="images/Rworkflow_recipe.png" width="100%" /> ] --- # First, load packages (but don't keep installation code) .pull-left[ <img src="images/Rworkflow_library.png" width="1045" /> ] --- # Read in data at the top, save output at the end .pull-left[ <img src="images/Rworkflow_load.png" width="1045" /> ] .pull-right[ <img src="images/Rworkflow_save.png" width="1043" /> ] --- # Reorder and re-write frequently into a concise, top-to-bottom ordering * Remove unnecessary "checks" (like `View(dataset)`, `?ggplot`) -- * Section your code with comment lines, but be concise ```r # Load data --------------------------- # Plot data --------------------------- ``` -- * Better to have multiple short scripts than one huge script 00_download.R 01_explore.R ... 09_model.R 10_visualize.R -- **Why?** Because you want to be re-running your entire code a lot ... --- # Restart and Re-run (everything) frequently, as you write code .pull-left[ ## 1. Restart from a clean slate * Go to top toolbar, `Session` > `Restart R`, or * ⇧ (shift) + ⌘ (command/control) + `0` (zero) This will remove * All objects in the Environment * All loaded packages * All user-created functions <mark> Try this in sample_cleaned_script.R</mark> <br> ## \* Warning for old R users Do **not** put `rm(list = ls())` at the beginning of your code. This only removes objects. ] -- .pull-right[ ## 2. Run all ("source") - and break on first error * Toolbar `Code` > `Source`, or * ⇧ (shift) + ⌘ (command/control) + `s` (the letter S, for source) ## \* or run all and show output * Toolbar `Code` > `Run Region` > `Run All`, or * ⇧ (shift) + ⌘ (command/control) + ⏎ (enter) ## \* or re-run only up to this line * Toolbar `Code` > `Run Region` > `Run All Chunks Above`, or * ⌥ (option/alt) + ⌘ (command/control) + `b` (the letter B, for begin) .small[.right[\* For more shortcuts, see `Tools` > `Keyboard Shortcut Help`]] ] --- class: center, middle <mark>Exercise: fix this example bad code so it runs on a blank slate.</mark> <img src="images/Rworkflow_exercise-badscript.png" width="40%" /> --- # File paths are a way to define the location of a file * **"Directory"** ≈ "Folder". "Working Directory" = The current folder * Paths are hierarchical: parent → child , separated by a **forward slash** `/` (`project/data/input`) * **File extensions** are separated by a period `.` (`kuriwaki_shiro.R`, `scatterplot.pdf`) There are important special characters: * **Root**: Starting with `/` (or `C:\` in Windows) means the top of computer. `/cloud/project` * **Home**: Starting with a tilde `~/` means the default home directory -- varies by project. (`~/data/input`) * <mark>Relative paths</mark>: Starting with no symbol means "wherever we are now". (`data/input`) * A period `./` also means, wherever we are now (`./data/input`) * **Parent folder**: <u>two</u> periods move to parent folder (go back up one): `../projects/data/input` .small[\* You almost never call the root directory, and rarely call the home directory. Instead, do things in relative paths. Fast forward to <a href="#rstudio-project">RStudio Projects</a>] So where is the working directory? * It depends, but in RStudio, the working directory defaults to the project directory, which is wherever the `.Rproj` file is. * This is not permanent; you can modify the working directory any time. --- # Navigating paths - the basics Shell (or ba<u>sh</u>, command-line) is a language to navigate folders outside of R * `pwd ` is -- show me the <u>p</u>resent (current) working directory (`getwd()` in R) * `cd ` is -- <u>c</u>hange the directory to the following path (`setwd()` in R) --- # Exercise your path reading skills <mark>1. How would you read the following paths?</mark> <br> ```r weo <- read_excel("data/input/WEO-2018.xlsx") ``` ```r weo <- read_csv("~/Dropbox/API-209/Problem Sets/ps01/data/input/WEO-2018.xlsx") ``` ```r ggsave("figures/percapita_growth_figure.pdf") ``` -- <br> <mark> 2. Why would defining everything in absolute paths be a bad idea?</mark> -- 1. Too long 2. Not sharable: `setwd("path/that/only/works/on/my/machine")` --- class: center, middle name: day04-download-desktop # Download R Desktop (in preparation for tomorrow) --- # Download for Tomorrow: <u>Newest</u> Versions of R Desktop and RStudio Desktop .pull-left[ ## R is the software 1. Google "download R" (or .small[<http://www.r-project.org/>]) 2. Go to "download" R, and pick the first mirror <img src="images/Rdownload_homepage.png" width="80%" /> 3. Download the installer for your Operating System (Mac or Windows) <img src="images/Rdownload_mirror.png" width="80%" /> ] -- .pull-right[ ## RStudio is the IDE and GUI 1. Google "download RStudio" (or .small[<http://www.rstudio.com/products/rstudio/download>]) 2. Download the Installer for Mac/Windows <img src="images/Rdownload_rstudio.png" width="80%" /> .narrow-space[.small[ IDE: Integrated Development Environment. GUI: Graphical User Interface ]] ] .small[Note: We will put all of this to action on Tuesday.] --- class: middle, center name: day04-cheatsheet # Wrap-up and review of key commands --- # Are you familiar with the functions on the cheatsheet? Follow along on the cheat-sheet: <http://bit.ly/HKS-R> .left-code[ 1. Object assignment 2. dplyr verbs 3. ggplot layers 4. Base-R vectors and data-frames 5. Summary functions 6. Input / Output 7. Combining datasets (new topic) ] .right-wide[ We didn't get to cover these cheatsheet functions which come up fairly often -- * `::` - call function directly from package * `unique`, `n_distinct`: unique values * `summary`: generic function, output varies by object * `str`: general structure * `geom_histogram()`, `geom_bar()`: geoms that involve aggregation * `fill` vs. `color` aesthetics: fill is the inside of a polygon; color is the border line * `write_csv`: saving to a spreadsheet * `bind_rows`: stack rows * `left_join`, `inner_join`: merge multiple datasets ] --- layout: true <div class="my-footer"><span>Day 5</span></div> --- class: middle name: day5 # Day 5 Agenda ## Discuss presentations ## R Desktop ## Getting help effectively <br> <br> .small-note[ **Notes:** \* References for today's material: <a href = "https://whattheyforgot.org/"><i>What they forgot to teach you about R</i></a> ch. <a href = "https://whattheyforgot.org/save-source.html">1</a> and <a href = "https://whattheyforgot.org/project-oriented-workflow.html">2</a>; <a href = "https://rstudio-education.github.io/hopr"><i>Hands-on Programming with R</i></a> ch. <a href = "https://rstudio-education.github.io/hopr/starting.html">A</a>. ] --- class: center, middle # Presentations --- class: center, middle name: name: day04-desktop # Getting to work with R Desktop --- # Step 1: Create an API-209 folder Choose a directory to put all your work in (e.g. `~/Dropbox`/ `C:\Dropbox`, or `~/Documents`) -- ⮑ under it, create a `API-209` folder (avoid spaces in your folder names) ⮑ under that, create a `Problem-Sets` folder ⮑ under that, create folders `PS-01`, `PS-02`, .small[(adding a 0 makes it easy to sort when you get to `PS-10`)] ⮑ for each problem set folder, create a `data` folder with subfolders `input` and `output` .small-note[ \* You can deviate this from later, but these are well-established practices worth trying first. It's better to try out some established convention rather than re-inventing the wheel. See e.g. <a href="http://web.stanford.edu/~gentzkow/research/CodeAndData.pdf">"Code and Data for the Social Sciences: A Practitioner’s Guide"</a> and <a href="https://doi.org/10.1371/journal.pcbi.1005510">"Good enough practices in scientific computing"</a> ] --- # Step 2: Open RStudio from Applications <img src="images/Rstudio-emptyproject.png" width="75%" style="display: block; margin: auto;" /> --- # Step 2(a): Change global options so restarting entails clean slate .pull-left[ * Change your Global settings to <u>not</u> save data * To do this, Go to Toolbar `Tools` > `Global Options` and change the sections highlighted in red arrows as shown here: ] .pull-right[ <img src="images/rstudio-workspace.png" width="80%" /> ] --- # Step 2(b): Install key packages <br> <br> You only need to do this once, so do this in the Console without any script ```r install.packages(c("tidyverse", "readxl", "haven")) ``` You can install and update packages later. --- # Step 3: Create a New Project * Toolbar `File` > `New Project` or the icon: <img / src = "images/project-add-icon.png"/> .pull-left[ <img src="images/rstudio-project-existing.png" width="80%" /> ] .pull-right[ <img src="images/rstudio-project-path.png" width="80%" /> ] --- name:rstudio-project # Orient yourself - where is the working directory? 1. The <u>project</u> directory is wherever the `.Rproj` file is 2. Rstudio Projects are created so that the <mark>working directory</mark> defaults to the <u>project directory</u> <img src="images/rstudio-ps-01.png" width="70%" style="display: block; margin: auto;" /> --- # Practice Project-Oriented Workflow <br> * .slarge[Create an RStudio project <u>for each data analysis project</u>.] -- * .slarge[Keep <u>data files</u> there; we’ll talk about loading them into R in data import.] * .slarge[Keep <u>scripts</u> there; edit them, run them in bits or as a whole.] * .slarge[Save your <u>outputs</u> (plots and cleaned data) there.] * .slarge[Only ever use <u>relative paths</u>, not absolute paths] .small-note[ \* Taken from <a href="https://r4ds.had.co.nz/workflow-projects.html">R for Data Science Chapter 8</a> ] --- # Going beyond defaults .pull-left[ ## setwd() * You can change working directories by the R command `setwd()` * You can change it back by Toolbar: `Session` > `Set Working Directory` > `To Project Directory` ] -- .pull-right[ ## Your script location is independent from the project <img src="images/project-mixup.png" width="75%" style="display: block; margin: auto;" /> .small-note[\* If you open a script for problem set 1 in the problem set 2 RStudio Project, the working directory still defaults to `Problem-Sets/PS-02`] ] --- # Leverage the features of the RStudio IDE .pull-left[ ## Technically, R is only the Console <img src="images/R-console.png" width="75%" style="display: block; margin: auto;" /> ] .pull-right[ ## RStudio is a GUI - access R _via_ RStudio <img src="images/Rstudio-emptyproject.png" width="80%" /> ## As an IDE, Rstudio includes more tools * <a href = "https://r4ds.had.co.nz/r-markdown.html">Rmarkdown</a> * Shiny apps (interactive web apps) * Terminal (command-line tools) * Git (version control) ] --- # Step 4: Test your Assignment <br> <br> Assignment for end of today (mandatory) > Take your summer assignment R script you created, and adapt so that is both (a) <u>correct</u> and (b) <u> replicable</u> on a project. Create a project specific to the summer assignment. Materials on Canvas, <a href= "https://canvas.harvard.edu/courses/62068/assignments/303684"> R day 5 exercise</a> --- # Running R for Problem Sets .pull-left[ ## Preparation 1. Create a designated folder (e.g. `PS-04`) 2. Create subfolders for data, figures, etc.. 3. Download problem set and data files 4. Create a <u>Rstudio Project</u> for that project folder ] -- .pull-right[ ## Analysis API-209 problem sets will generally be in a fillable word doc. <img src="images/problemset-look.png" width="80%" /> ] --- class: center, middle name: day05-getting-help # Final Tips for your own R experience --- # Take advantage of the R community .pull-left[ ## Resources * **Class instructors and resources** Also helpful: * Stackoverflow (.small[<https://stackoverflow.com/>]) * Community boards (e.g. .small[<http://community.rstudio.com>]) * Twitter? <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">What <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a> tricks did it take you way too long to learn? One of mine is using readRDS and saveRDS instead of repeatedly loading from CSV</p>— Emily Riederer (@EmilyRiederer) <a href="https://twitter.com/EmilyRiederer/status/898735640031920129?ref_src=twsrc%5Etfw">August 19, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ] -- .pull-right[ ## Stuck? Help us help you by making your errors reproducible * Code and/or Screenshots * Or even better: `reprex` ```r install.packages("reprex") library(reprex) ``` Copy the problematic code to your clipboard, ```r library(forcats) data(gss_cat) xtabs(rincome + race, gss_cat) ``` and separately run `reprex()` ] --- class: center, middle # Thanks! Please be sure to send us any feedback in the end of math camp survey.