“Saying [social science researchers] should spend more time thinking about the way they write code would be like telling a novelist that she should spend more time thinking about how best to use Microsoft Word. …. This manual began with a growing sense that our own version of this self-taught seat-of-the-pants approach to computing was hitting its limits.” — Gentzkow and Shapiro (2014)
This memo outlines one way to organize a data analysis-oriented project in policy research or social sciences – an area that is increasingly data-driven, code-driven, and collaborative. Many of the principles I describe are taken verbatim from the write-up by economists Matthew Gentzkow and Jesse Shapiro, “Code and Data for the Social Sciences: A Practitioner’s Guide”, which I recommend.
Learning how to write on new apps may sound like an issue of taste that just makes things look nice (as the opening quote indicates). But good organization and the use of appropriate tools improve the quality of your work, and more importantly, reduce the number of errors you make.
All the more so given recent trends:
With these challenges come demands for new tools. Learning these tricks is an investment, but a little training on well-tested best practices goes a long way.
The recommendations by Gentzkow and Shapiro are central worth re-posting:
- Automate everything that can be automated.
- Write a single script that executes all code from beginning to end.
- Store code and data under version control.
- Run the whole directory before checking it back in.
- Separate directories by function.
- Separate files into inputs and outputs.
- Make directories portable.
- Store cleaned data in tables with unique, non-missing keys.
- Keep data normalized as far into your code pipeline as you can.
- Abstract to eliminate redundancy.
- Abstract to improve clarity.
- Otherwise, don’t abstract.
- Don’t write documentation you will not maintain.
- Code should be self-documenting.
- Manage tasks with a task management system.
- E-mail is not a task management system.
- Keep it short and purposeful
- Make your functions shy
- Order your functions for linear reading
- Use descriptive names
- Pay special attention to coding algebra
- Make logical switches intuitive
- Be consistent
- Check for errors
- Write tests
- Profile slow code relentlessly.
- Store “too much” output from slow code
- Separate slow code from fast code
A version of the above tailored to those whose main job is not research, but for a more light-weight data analysis project. .
Priniciples
1. Automation
- Automate everything that can be automated.
- Each script should executes all code from beginning to end.
- Embed figures and tables to auto-update in your document.
2. Abstraction
- Abstract to eliminate redundancy.
- Abstract to improve clarity.
- Otherwise, don’t abstract.
3. Self-Documentation
- Don’t write documentation you will not maintain.
- Code should be self-documenting.
Organization
4. Directory Structure
- Make a shared project directory
- Separate directories by function
- Separate files into inputs and outputs
- Separate and number scripts within directories by function
- Make directories portable
5. Datasets
- Store cleaned data in tables with unique, non-missing keys
6. Task and Version Management
- Manage tasks with a task management system.
- E-mail is not “the best”" task management system.
- Consider using a version control system
- Consider going all open-source.
Production
7. Software for Writing
- Use an editor you have the most control over (consider non-WYSIWIG)
8. Figures and Tables
- One message per figure/table
- Maximize the data-ink ratio
- Drop unnecessary digits
- Modify aspect-ratios of figures
9. Code style
- Use descriptive names
- Don’t repeat code more than twice
- Keep it short and purposeful
- Make your functions shy
- Be consistent
- Write tests
10. Typesetting
- Use generous margins
- Use a consistent font
Below, I go through each of the main topics.
- Automate everything that can be automated.
- Each script should executes all code from beginning to end.
- Embed figures and tables to auto-update in your document.
You will be revising your analysis and write-up many more times than you anticipate. That’s why short and automated code is essential.
For example, these lines of code reads in data from Google Sheets, formats it, and loads the necessary packages, and saves a figure of a printable size.
library(tidyverse)
library(googlesheets)
# read data (here from google sheets)
gapminder_sheet <- gs_gap()
gap_africa <- gs_read(gapminder_sheet, ws = "Africa")
gap_africa <- mutate(gap_africa, log_gdppc = log(gdpPercap))
ggplot(gap_africa, aes(x = log_gdppc, y = lifeExp, size = pop)) +
facet_wrap(~year, nrow = 2) +
geom_point(alpha = 0.7) +
geom_label(data = filter(gap_africa, country %in% "Rwanda"), aes(label = country)) +
guides(size = FALSE) +
labs(x = "log GDP per Capita", y = "Average Life Expectancy")
ggsave("figures/gapminder_africa.pdf", width = 8, height = 4)
Then, an text editor like Rmarkdown or LaTeX compiles a plain text file that embeds this Figure.
---
title: "Project Report"
author: "Shiro Kuriwaki"
date: "`r Sys.Date()`"
output: pdf_document
---
## Introduction
The relationship between a country's economy and health outcomes is policy-relavent.
## Data
This report looks at 50 years worth of GDP and health-data in a panel of African countries.
## Results
![Rwanda's Civil War led to a substnantial Drop in its Life Expectancy](figures/gapminder_africa.pdf)
See more on the Writing Software section below to see how this strip of text and symbols can produce a nice document.
- Abstract to eliminate redundancy.
- Abstract to improve clarity.
- Otherwise, don’t abstract.
It is also normal for you to try something out and then do another version of that. Suppose that, in analyzing Brazilian municipality data, we want to generate a new variable that computes the difference in a municipality’s literacy level with its state-average. In Stata you would try
egen mean_literate91 = mean(literate91), by(state)
generate relative_lit_state = mean_literate91 - literate
and then look at a histogram,
Intrigued, suppose you want to do the same type of calculation but with a different variable, the poverty rate. A redundant but common next step is to copy-paste the two lines of code and make sure to replace the relevant parts:
egen mean_poverty91 = mean(poverty91), by(state)
generate relative_pov_state = mean_poverty91 - poverty91
histogram relative_pov_state
What about doing the same thing not by state, but by region? And so on and so on.. you end up with lines of code that is redundant in terms of the key operation (compute means, subtract the mean from the value); only the variables have changed. After a while, you get tired of copy-pasting and tediously editing so you stop analyzing the data and move out. Even worse, you make a copy-paste error and forget to swap a variable.
Here’s where the abstraction via defining your own function becomes important:
program hist_relative_rate
syntax, invar(varname) outvar(name) byvar(varname)
tempvar mean_invar
egen `mean_invar'= mean(`invar'), by(`byvar')
gen `outvar' = `mean_invar' - `invar'
histogram `outvar'
end
Now we can do the same function with only one line of code!
Moreover, repeating the same procedure with different variables requires you only write as many lines as there are different specifications, rather than having the extra egen
.
hist_relative_rate, invar(poverty80) outvar(relative_pov80_state) byvar(state)
hist_relative_rate, invar(poverty91) outvar(relative_pov91_state) byvar(state)
hist_relative_rate, invar(poverty80) outvar(relative_pov80_region) byvar(region)
Not only is this fewer keystrokes, it is much more readable. It is clear what the function hist_relative_rate
is essentially doing, whereas understanding three lines takes more time than people have time for.
Abstraction and automation also are the most effective ways to avoid bugs and errors that plague even high profile studies (for example, errors surrounding work by Piketty and Reinhart and Rogoff). “Hard-coding” refers to the coding practice of entering numbers directly in code instead of letting them vary, automatically and abstractly, by input. Usually, hard-coding important numbers leaves the risk of typos and systematic errors introduced when something upstream in the analysis pipeline changes.
We’ll see later that adding comments to this code is also not a great idea, because that documentation can quickly become deprecated.
- Don’t write documentation you will not maintain.
- Code should be self-documenting.
A common misconception is that one needs to annotate (a.k.a. “comment”) code extensively to be helpful. While this is good in theory, it almost always fails to come through in practice. Analysts and programmers are humans, after all – we fix typos in place without updating the documentation, we make changes on a whim and forget to get back to them, we copy and paste code from elsewhere along with the irrelevant comments, etc.. The lesson is to embrace and adjust your code to prepare for those mistakes, rather than setting yourself up for unrealistic fastidiousness.
The Gentzkow and Shapiro example is excellent. What’s wrong with this Stata code?
* Elasticity = Percent Change in Quantity / Percent Change in Price
* Elasticity = 0.4 / 0.2 = 2
* See Shapiro (2005), The Economics of Potato Chips,
* Harvard University Mimeo, Table 2A.
compute_welfare_loss, elasticity(2)
These four lines of notes are informative, but three days later after multiple stages of editing, you’re bound to see something like this:
* Elasticity = Percent Change in Quantity / Percent Change in Price
* Elasticity = 0.4 / 0.2 = 2
* See Shapiro (2005), The Economics of Potato Chips,
* Harvard University Mimeo, Table 2A.
compute_welfare_loss, elasticity(3)
Notice the elasticity in the notes are inconsistent with the ones in the values. Changing values like this happens out, but after a couple of days you won’t remember whether your comment or your code is the newer version, and you are stuck.
* See Shapiro (2005), The Economics of Potato Chips,
* Harvard University Mimeo, Table 2A.
local percent_change_in_quantity = -0.4
local percent_change_in_price = 0.2
local elasticity = `percent_change_in_quantity'/`percent_change_in_price'
compute_welfare_loss, elasticity(`elasticity')
This is much better, because ``it has far less scope for internal inconsistency. You can’t change the percent change in quantity without also changing the elasticity, and you can’t get a different elasticity number with these percent changes.’’
- Make a project directory
- Separate directories by function
- Separate files into inputs and outputs
- Separate and number scripts within directories by function
- Make directories portable
Directories are just a computer engineering name for folders. Being told how to organize folders may seem pedantic at first, but this habit-forming practice is pretty consequential for all the work that you do later. And especially if you are collaborating with others, using well-established practices save you a lot of confusion and errors.
Copying directly from Gentzkow and Shapiro,
notice that
build
and analyze
folders, which are verbs, not nouns,input
should not be overwrittenoutput
should be reproducible by the codeWhat about within each directory? Long code (about ~300 lines or more) can get hard to scroll through. It probably makes your life easier to split your code up into separate parts. (read
, clean
, deduplicate
, merge
, for example.)
Whether or not to maintain a strict difference between build
and analyze
, even to the point of having a separate data
sub-folder for each, can be debatable. For example my directory for one project looks like:
Breaking out of old habits are hard, but give this a try. The hope is that this will become natural as you use it.
- Store cleaned data in tables with unique, non-missing keys
Building your own dataset is an intermediate/advanced-level skill. But it is the bread and butter of data analysis in the type of work Gentzkow and Shapiro do. A lot of the power in data analysis comes from merging, and setting “keys” – unique identifiers is a principle that makes this possible.
The concept of keys is also critical when discussing data because it is the database equivalent of asking what your unit of analysis is.
For example, in the gapminder dataset, what is the key?
## # A tibble: 624 x 7
## country continent year lifeExp pop gdpPercap log_gdppercap
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Algeria Africa 1952 43.1 9279525 2449. 7.80
## 2 Algeria Africa 1957 45.7 10270856 3014. 8.01
## 3 Algeria Africa 1962 48.3 11000948 2551. 7.84
## 4 Algeria Africa 1967 51.4 12760499 3247. 8.09
## 5 Algeria Africa 1972 54.5 14760787 4183. 8.34
## 6 Algeria Africa 1977 58.0 17152804 4910. 8.50
## 7 Algeria Africa 1982 61.4 20033753 5745. 8.66
## 8 Algeria Africa 1987 65.8 23254956 5681. 8.64
## 9 Algeria Africa 1992 67.7 26298373 5023. 8.52
## 10 Algeria Africa 1997 69.2 29072015 4797. 8.48
## # … with 614 more rows
Note that in this case you need a key of two variables to uniquely identify a row – country
and year
Large corporate and government databases are organized in this way too – each entry in a table (which can be tens of millions of rows) is identifiable by a unique identifier. Databases like credit card transactions and voter files are stored on servers and queried by languages like SQL. The critical link that makes all this data sensible is the unique identifier.
- Manage tasks with a task management system.
- E-mail is not the best task management system.
- Consider using a version control system.
- Consider going all open-source.
``E-mail is not a task management system’’. There are plenty of Apps now that allow you to tag individuals with tasks, making individual responsibilities clear. e.g., Dropbox, Google Docs, Google Tasks, Slack, Asana/Basecamp …
A version control system, typically on Github, is a great way to keep track of your files without being overloaded with versions and versions of confusingly named files. This may be an advanced topic for an introduction to data analysis, but tutorials like https://try.github.io/ is a good start, and learning as you go is a valuable skill-set.
Data analytics – both the datasets and the programs and functions that analyze them – is increasingly open source. Posting your own code and findings on the public web is daunting at first, but it has a couple of benefits. Primarily, you learn how to code and analyze data through the best possible ways – seeing others more advanced than you actually write code for the same type of problems.
For example, see a version of the Directory Screenshot I posted at the Github page: https://github.com/kuriwaki/poll_error
- Use an editor you have the most control over (consider non-WYSIWIG)
WYSIWIG stands for “What you see is what you get”, and software like Microsoft Office and Google Docs fall in this category. A non-WYSIWIG editor requires you to explicitly control your formatting (such as boldface, font size, margin size, Figure placement, captioning, etc..), often in plain-text. Plain text is just what it sounds like – text files that have no formatting in them. Although they have different file extensions, .Rmd
, .do
, .R
, is plain text.
A good, “lightweight” non-WYSIWIG language is Markdown. The best interface for using Markdown with data I know is RMarkdown. That’s how I made this document. To make my slides and notes, I also use a version of Rmarkdown with a more involved typesetting engine called TeX
underneath it.
- One message per figure/table
- Maximize the data-ink ratio
- Drop unnecessary digits
- Modify aspect-ratios of figures
As you saw in the Anscombe dataset problem, making visualizations is key to data analysis. In some disciplines it is not at all controversial to skim a paper or report by just reading the figures and tables, skipping the text.
Edward Tufte’s books and workshops is a classic in this area. A talk by Jean-Luc Demont on making figures for slides to get the message across (and presentations more generally) is worth a watch too (https://www.youtube.com/watch?v=meBXuTIPJQk).
John Rauser’s talk on the general patterns of how we see – and don’t see – visualizations is also clarifying (https://www.youtube.com/watch?v=fSgEeI2Xpdc).
The data-ink ratio principle is a good rule of thumb. Don’t waste ink on things that is not actual information: