- Congressional Representation: Accountability from the Constituent’s Perspective. (with Stephen Ansolabehere). American Journal of Political Science. [Summarized in the AJPS Blog, Replication Data]
The premise that constituents hold representatives accountable for their legislative decisions undergirds political theories of democracy and legal theories of statutory interpretation. But studies of this at the individual level are rare, examine only a handful of issues, and arrive at mixed results. We provide an extensive assessment of issue accountability at the individual level. We trace the congressional rollcall votes on 44 bills across seven Congresses (2006-2018), and link them to constituent's perceptions of their representative's votes and their evaluation of their representative. Correlational, instrumental variables, and experimental approaches all show that constituents hold representatives accountable. A one-standard deviation increase in a constituent's perceived issue agreement with their representative can improve net approval by 35 percentage points. Congressional districts, however, are heterogeneous. Consequently, the effect of issue agreement on vote is much smaller at the district-level, resolving an apparent discrepancy between micro and macro studies.
- Wealth, Slave Ownership, and Fighting for the Confederacy: An Empirical Study of the American Civil War. 2019. (with Andrew B. Hall and Connor Huff). American Political Science Review, 113 (3): 658-673. [Replication Data] [Covered by The Weeds podcast].
How did personal wealth and slaveownership affect the likelihood Southerners fought for the Confederate Army in the American Civil War? On the one hand, wealthy Southerners had incentives to free-ride on poorer Southerners and avoid fighting; on the other hand, wealthy Southerners were disproportionately slaveowners, and thus had more at stake in the outcome of the war. We assemble a dataset on roughly 3.9 million free citizens in the Confederacy and show that slaveowners were more likely to fight than non-slaveowners. We then exploit a randomized land lottery held in 1832 in Georgia. Households of lottery winners owned more slaves in 1850 and were more likely to have sons who fought in the Confederate Army. We conclude that slaveownership, in contrast to some other kinds of wealth, compelled Southerners to fight despite free-rider incentives because it raised their stakes in the war’s outcome.
Survey Statistics and Demography
- Unrepresentative Big Surveys Significantly Overestimate US Vaccine Uptake. (with Valerie C. Bradley, Michael Isakov, Dino Sejdinovic, Xiao-Li Meng, and Seth Flaxman). In Press. (previous arXiv version)
Surveys are a crucial tool for understanding public opinion and behavior, and their accuracy depends on maintaining statistical representativeness of their target populations by minimizing biases from all sources. Increasing data size shrinks confidence intervals but magnifies the impact of survey bias -- an instance of the Big Data Paradox. Here we demonstrate this paradox in estimates of first-dose COVID-19 vaccine uptake in US adults: Delphi-Facebook(about 250,000 responses per week) and Census Household Pulse (about 75,000 per week). By May 2021, Delphi-Facebook overestimated uptake by 17 percentage points and Census Household Pulse by 14, compared to a benchmark from the Centers for Disease Control and Prevention (CDC). Moreover, their large data sizes led to minuscule margins of error on the incorrect estimates. In contrast, an Axios-Ipsos online panel with about 1,000 responses following survey research best practices provided reliable estimates and uncertainty. We decompose observed error using a recent analytic framework to explain the inaccuracy in the three surveys. We then analyze the implications for vaccine hesitancy and willingness. We show how a survey of 250,000 respondents can produce an estimate of the population mean that is no more accurate than an estimate from a simple random sample of size 10. Our central message is that data quality matters far more than data quantity, and compensating the former with the latter is a mathematically provable losing proposition.
- The Use of Differential Privacy for Census Data and its Impact on Redistricting: The Case of the 2020 U.S. Census. (with Chris Kenny, Cory McCartan, Evan Rosenman, Tyler Simko, and Kosuke Imai). Science Advances. Originally a Public Comment to the Census Bureau (May 28, 2021). [FAQ, Reaction to the Bureau's Response (June 9, 2021).]
Census statistics play a key role in public policy decisions and social science research. Yet given the risk of revealing individual information, many statistical agencies are considering disclosure control methods based on differential privacy, which add noise to tabulated data. Unlike other applications of differential privacy, however, census statistics must be post-processed after noise injection to be usable. We study the impact of the US Census Bureau's new Disclosure Avoidance System (DAS) on a major application of census statistics: the redrawing of electoral districts. We find that the DAS systematically undercounts the population in mixed-race and mixed-partisan precincts, yielding unpredictable racial and partisan biases. The DAS also leads to a likely violation of "One Person, One Vote" standard as currently interpreted, but does not prevent accurate predictions of an individual's race and ethnicity. Our findings underscore the difficulty of balancing accuracy and respondent privacy in the Census.
Selected Press Coverage
Covered by AP News, Washington Post, The Harvard Crimson, San Francisco Chronicle, Matthew Yglesias blog, Statistical Modeling (Andrew Gelman's blog) by Jessica Hullman (Part 1, Part 2)
- Towards Principled Unskewing: Viewing 2020 Election Polls Through a Corrective Lens from 2016. (with Michael Isakov). Harvard Data Science Review, 2.4 (pre - 2020 election special issue). [PDF version with post-election review] [Covered by The Harvard Crimson]
We apply the concept of the data defect index (Meng, 2018) to study the potential impact of systematic errors on the 2020 pre-election polls in twelve Presidential battleground states. We investigate the impact under the hypothetical scenarios that (1) the magnitude of the underlying non-responses bias correlated with supporting Donald Trump is similar to that of the 2016 polls, (2) the pollsters' ability to correct systematic errors via weighting has not improved significantly, and (3) turnout levels remain similar as 2016. Because survey weights are crucial for our investigations but are often not released, we adopt two approximate methods under different modeling assumptions. Under these scenarios, which may be far from reality, our models shift Trump's estimated two-party voteshare by a percentage point in his favor in the median battleground state, and increases twofold the uncertainty around the voteshare estimate.
Education in Political Science
- The "Math Prefresher" and The Collective Future of Political Science Graduate Training. (with Gary King and Yon Soo Park). PS: Political Science and Politics, 54(3): 537-541.
The political science math prefresher arose a quarter-century ago and has now spread to many of our discipline’s PhD programs. Incoming students arrive for graduate school a few weeks early for ungraded instruction in math, statistics, and computer science as they relate to political science. The prefresher’s benefits, however, go beyond its technical content: it opens pathways to mastering methods necessary for political science research, facilitates connections among peers, and — perhaps most important — eases the transition to the increasingly collaborative nature of graduate work. The prefresher also shows how faculty across a highly diverse discipline have worked together to train the next generation. We review this program and advance its collaborative aspects by building infrastructure to share teaching content across universities so that separate programs can build on one another’s work and improve all of our programs.
Selected Working Papers
- Ticket Splitting in a Nationalized Era.
[Previously titled "Party Loyalty on the Long Ballot: Is Ticket Splitting More Prevalent in State and Local Elections?]. [Covered by The Post and Courier, Governing, Colorado Public Radio News].
Many believe that party loyalty in U.S. elections has reached heights unprecedented in the post-war era, although this finding relies on evidence from presidential, congressional, and gubernatorial elections. If party labels are a heuristic, we would expect party-line voting to be even more dominant in lower-information elections. Yet, here I show that the prevalence of ticket splitting in state and local offices is often similar to or higher than in national offices because of larger incumbency advantages and starker candidate valence differentials. Because neither surveys nor election returns have been able to reliably measure individual vote choice in downballot races, I introduce an underused source of voter data: cast vote records. I create a database from voting machines that reveals the vote choices of 6.6 million voters for all offices on the long ballot, and I design a clustering algorithm tailored to such ballot data. In contrast to ticket splitting rates of 5 to 7 percent in U.S. House races, about 15 to 20 percent of voters split their ticket in a modal Sheriff race. Even in a nationalized politics, a fraction of voters still cross party lines to vote for the more experienced candidate in state and local elections.
- Synthetic Area Weighting for Measuring Public Opinion in Small Areas (with Soichiro Yamauchi). [Slides from Polmeth]
The comparison of subnational areas is ubiquitous but survey samples of these areas are often biased or prohibitively small. Researchers turn to methods such as multilevel regression and poststratiﬁcation (MRP) to improve the efficiency of estimates by partially pooling data across areas via random effects. However, the random effect approach can pool observations only through area-level aggregates.
We instead propose a weighting estimator, the synthetic area estimator, which weights on variables measured only in the survey to partially pool observations individually. The proposed method consists of two-step weighting: first to adjust differences across areas and then to adjust for differences between the sample and population. Unlike MRP, our estimator can directly use the national weights that are often estimated from pollsters using proprietary information. Our approach also clarifies the assumptions needed for valid partial pooling, without imposing an outcome model. We apply the proposed method to estimate the support for immigration policies at the congressional district level in Florida. Our empirical results show that small area estimation models with insufficient covariates can mask opinion heterogeneities across districts.
- A Clustering Approach for Characterizing Voter Types: An Application to High-Dimensional Ballot and Survey Data
Large-scale ballot and survey data hold the potential to uncover the prevalence of swing voters and strong partisans in the electorate. However, existing approaches either employ exploratory analyses that fail to fully leverage the information available in high-dimensional data, or impose a one-dimensional spatial voting model. I derive a clustering algorithm which better captures the probabilistic way in which theories of political behavior conceptualize the swing voter. Building from the canonical finite mixture model, I tailor the model to vote data, for example by allowing uncontested races. I apply this algorithm to actual ballots in the Florida 2000 election and a multi-state survey in 2018. In Palm Beach County, I find that up to 60 percent of voters were straight ticket voters; in the 2018 survey, even higher. The remaining groups of the electorate were likely to cross the party line and split their ticket, but not monolithically: swing voters were more likely to swing for state and local candidates and popular incumbents.
- Sparse Multilevel Regression (and Poststratification). (with Max Goplerud, Marc Ratkovic, and Dustin Tingley).
Multilevel models have long played an important role in a variety of social sciences. We extend this framework by bringing to bear recent developments in the machine learning literature to allow for considerable flexibility. We introduce a sparse regression framework that covers both the linear case as well as a logit model for binary outcome data. We leverage recent computational tricks based on data-augmentation to dramatically speed up estimation times with equal or better performance compared to existing approaches. We apply our model in the context of multilevel modelling with post-stratification which has become a common tool for survey researchers.
- The Seats Votes Curve for Issues: On the Aggregation of Issue Preferences. (with Stephen Ansolabehere)
The seats-votes curve is widely used to evaluate the fairness of electoral systems. Existing examples almost exclusively use election results to estimate the seats-votes curve. Instead of evaluating how votes for political parties translate into seats, we measure how the support for a particular policy translates into the proportion of districts whose majorities support that same policy, providing a similarly powerful measure of how an electoral system represents voter’s issue preferences. Using the twelve years of the Cooperative Congressional Election Study, we show the seats-votes curve equivalent for 75 key roll-call votes. We find that the seats-votes curve for issue preferences is remarkably majoritarian in both U.S. states and congressional districts, following the cubic polynomial. These findings suggest the blame for representational failures of Congress, in which issues with majority support do not become law, do not solely lie in the districting system.
Book Project: Congressional Representation
(with Stephen Ansolabehere)
This book, tentatively titled Congressional Representation, argues that through all of the gridlock and the polarization that has plagued the government over the past three decades, the U.S. Congress remains a largely majoritarian institution. Congress acts in line with the majority of people more often than not. Building on 15 years of data on public preferences of more than 500,000 Americans, this study examines what voters know, what they care about when they vote, and how well their legislators and their Congress reflect their preferences. Representation is not a seamless or mechanical process, but it aggregates peoples' beliefs and preferences well on the important issues that face the country. Individual voters do not follow the details of congressional legislation but most know enough to hold correct beliefs about legislation and to hold their representatives accountable. For their part, legislators are highly responsive to the aggregate opinion of their districts. And, on important bills, Congress makes decisions in line with the majority of the nation. When representation fails, it is often the obstruction of one branch of government or one party.
I have taught classes on American Politics, Japanese Politics, statistics, and programming, at the undergraduate, Masters, and PhD level. I received the 2020 Dean's Excellence in Teaching Award at the Harvard Kennedy School of Public Policy for my teaching in econometrics and shepherding the use of the R statistical language in its core statistics sequence. This work included creating portable screencasts of R workflows, covering common topics in econometrics, causal inference, data science, quantitative social science.
I am a RStudio certified trainer, and have created several resources for statistics and data science for the social sciences that I hope are useful for other students and instructors. These include a workshop I co-designed on training teachers in the social sciences for teaching statistics and programming my presentations on project-oriented workflow (invited presentation, Toronto Data Workshop), introduction to version control with GitHub (source), introduction to Stata (source), and statistics notes
covering Probability, Inference, and Regression written for a Masters-level statistics course (source).
- The Cast Vote Records project is collecting and organizing public ballot image logs to advance the understanding of voting patterns in federal, state and local elections. One of the project's goals is to establish a relational database for such data. This project is one of the 2018 New Initiative Grants from the MIT Election Data and Science Lab. Please feel free to contact me for any questions about this project. Fore more information, see the associated empirical paper or the memo on the election administration of cast vote records.
- The Cumulative CCES Common Content (2006-2021) (downloaded on Dataverse over 13,000 times) is a part of the Cooperative Congressional Election Survey Dataverse. It combines all common content respondents of the CCES and harmonizes key variables, so that researchers can analyze all years of the CCES or merge in standardized variables with their own CCES datasets.
- The Candidates in American General Elections dataset (with Jeremiah Cha and James M. Snyder, Jr.) is a comprehensive list of winning and losing candidates in U.S. Congressional, Presidential, and Gubernatorial elections. Unlike official records or other datasets, we standardize candidate names acorss time and office, and record the incumbency of the candidates.
- Portable Routines for Preparing CCES and ACS data for MRP (ccesMRPprep) is a set of datasets and API interfaces to facilitate Multilevel Regression Poststratification (MRP), a survey weighting method for small area estimation. Other articles already provide helpful tutorials and code for MRP. But implementing a MRP entails considerable upfront costs related to data collection, cleaning, and standardization. This package provides these routines: not only modeling software, but code to import and standardize publicly available data sources, combined with detailed documentation about these data sources.
About the banner image: Survey data from the Cumulative CCES, limited to validated voters in contested districts who voted for a major party in the Presidency and House. Estimates are
made at the congressional district level and use Multilevel Regression Poststratification (MRP) stratifying on age, gender, education from the ACS and using House candidate incumbency
status and presidential voteshare as district-level predictors. In presidential years the values represent ticket splitting (e.g. Trump voters who voted for a 2016 Democratic House candidate); in midterm years they represent party switch from
the previous presidential election (e.g. Trump voters who voted for a 2018 Democratic House candidate). Districts where a Democrat and Republican candidate did not contest the general election are left blank. Figure created by Shiro
About this website: This website uses code from Minimal Mistakes, Github Pages, uses some CSS from Matt Blackwell's website at the time, and is inspired by Sarah Bouchat's website and Andrew Hall's website.