Selected Working Papers
- The Still Secret Ballot: The Limited Privacy Cost of Transparent Election Results. (with Jeffrey B. Lewis and Michael Morse). [Featured in Bipartisan Policy Center Memo, Votebeat Texas]
Abstract
After an election, should officials release an electronic record of each ballot? The release of ballots could bolster the legitimacy of the result. But it may also facilitate vote revelation, where an analyst unravels the secret ballot by uniquely linking votes on an anonymous ballot to the voter's name and address in the public voter file. We first provide a theoretical model of how vote revelation could occur under various election-reporting regimes. Perhaps counterintuitively, releasing ballot records is no more revelatory than the typical practice of releasing aggregate vote tallies by precinct and method. We then present the first empirical evaluation of vote revelation, using the 2020 election in Maricopa County, Arizona, as a case study. For 99.8% of voters, the release of ballot records led to no revelation of any vote choice. We conclude the ballot can be both public and still as secret as it is under typical reporting practices.
- Ticket Splitting in a Nationalized Era. 2024 Version, conditionally accepted, The Journal of Politics. [Covered by The Post and Courier, Governing, Colorado Public Radio News].
Abstract
Party loyalty in U.S. Congressional elections has reached heights unprecedented in the post-war era. Theories of partisanship as informational cues would predict that ticket splitting from national partisanship should be even more rare in low-information elections. Yet, here I show that ticket splitting in state and local offices is often higher than in Congress. I use cast vote records from voting machines that overcome ecological inference challenges, and develop a clustering algorithm to summarize such ballot data. For example, about a third of South Carolina Trump voters form a bloc whose probability of ticket splitting is 5 percent for Congress, but 32 percent for county council and 50 percent for sheriff. I show that a model with candidate quality differentials can explain these patterns: Even in a nationalized era, some voters cross party lines to vote for the more experienced and higher quality candidate in state and local elections.
- Winning Elections with Unpopular Policies: Valence Advantage and Single-Party Dominance in Japan. (with Shusei Eshima, Yusaku Horiuchi, and Daniel M. Smith)
Abstract
The existence of dominant parties in democracies is an enduring puzzle, in part because spatial models of voting suggest that an opposition party should be able to challenge the incumbent by proposing more popular policies. We consider the preeminent case of Japan's Liberal Democratic Party (LDP) and investigate whether its continued success can be explained by voters' support for its policies. We measure voters' policy-based utility for parties through a novel application of conjoint experiments featuring policy profiles based on parties' real-world manifestos and find that these utilities only partially explain vote choice. Most voters prefer the policies of the main opposition parties, but many nevertheless support the LDP. We interpret this discrepancy as arising from the LDP's advantage in terms of valence (non-policy considerations). An examination of which voters change preferences when policy profiles include party labels suggests that trust is an important component of the LDP's valence advantage.
- Preference Aggregation: A Seats Votes Curve for Issues. (with Stephen Ansolabehere). Full paper available upon request
Abstract
How well does the U.S. electoral system of states and congressional districts aggregate voters' policy preferences? A century of research has studied how districts aggregate votes for parties into legislative seats, using the well-known seats-votes curve. This paper develops the analog for issues, called the Issue Aggregation Curve, from the spatial model. The one-dimensional model characterizes choice over specific bills; the multidimensional model characterizes choice between two alternatives that consist of bundles of issues, such as party platforms or ideologies. We estimate the issue aggregation curves for 104 issues using data from the CCES from 2006 to 2022. The issue aggregation curves for the U.S. Senate and House exhibit little or no bias and have steep slopes, even steeper than the seats-votes curves. These patterns reveal that the Senate and House districts magnify, rather than interfere with, the expression of the policy preferences of the majority.
- Cast Vote Records: A Database of Ballots from the 2020 U.S. Election (with Mason Reece, 12 others)
Abstract
Ballots are the basis of the electoral process. A growing group of political scientists, election administrators, and computer scientists have requested electronic records of actual ballots cast (cast vote records) from election officials, with the hope of affirming the legitimacy of elections and countering misinformation about ballot fraud. However, the administration of election data in the U.S. is scattered across local jurisdictions. Here we introduce a database of cast vote records from the 2020 U.S. general election. We downloaded, standardized, and extensively checked the accuracy of a set of cast vote records collected from the 2020 election. Our initial release includes six offices – President, Governor, U.S. Senate and House, and state upper and lower chambers – covering 40.9 million voters in 20 states who voted for a total of thousands of candidates, including 2,121 Democratic and Republican candidates. This database serves as an unparalleled source of data for studying voting behavior and election administration.
- Synthetic Area Weighting for Measuring Public Opinion in Small Areas (with Soichiro Yamauchi). [Slides from Polmeth]
Abstract
The comparison of subnational areas is ubiquitous, but survey samples of these areas are often biased or prohibitively small. Researchers turn to methods such as multilevel regression and poststratification (MRP) to improve the efficiency of estimates by partially pooling data across areas via random effects. However, the random effect approach can pool observations only through area-level aggregates. We instead propose a weighting estimator, the synthetic area estimator, which weights on variables measured only in the survey to partially pool observations individually. The proposed method consists of two-step weighting: first to adjust differences across areas and then to adjust for differences between the sample and population. Unlike MRP, our estimator can directly use the national weights that are often estimated from pollsters using proprietary information. Our approach also clarifies the assumptions needed for valid partial pooling, without imposing an outcome model. We apply the proposed method to estimate the support for immigration policies at the congressional district level in Florida. Our empirical results show that small area estimation models with insufficient covariates can mask opinion heterogeneities across districts.
Peer-Reviewed Publications
American Politics
- The Geography of Racially Polarized Voting: Calibrating Surveys at the District Level. (with Stephen Ansolabehere, Angelo Dagonel, and Soichiro Yamauchi). American Political Science Review, vol. 118, p. 922-939. 2024. [Open Access at APSR, preprint with appendix and all estimates] [ISPS blog, WSHU radio] [Replication, Data Release] [Software: ccesMRPprep, ccesMRPrun, synthjoint] [WISM slides]
Abstract
Debates over racial voting, and over policies to combat vote dilution, turn on the extent to which groups' voting preferences differ and vary across geography. We present the first study of racial voting patterns in every congressional district in the US. Using large-sample surveys combined with aggregate demographic and election data, we find that national-level differences across racial groups explain 60 percent of the variation in district-level voting patterns, while geography explains 30 percent. Black voters consistently choose Democratic candidates across districts, while Hispanic and White voters’ preferences vary considerably across geography. Districts with the highest racial polarization are concentrated in the parts of the South and Midwest. Importantly, multi-racial coalitions have become the norm: in most congressional districts, the winning majority requires support from minority voters. In arriving at these conclusions, we make methodological innovations that improve the precision and accuracy when modeling sparse survey data.
- Widespread Partisan Gerrymandering Mostly Cancels Nationally, but Reduces Electoral Competition [ISPS blog] (with Christopher Kenny, Cory McCartan, Tyler Simko, and Kosuke Imai). Proceedings of the National Academy of Sciences (PNAS), vol. 120 (25)
Abstract
Redistricting plans in legislatures determine how voters' preferences are translated into representative's seats. Political parties may manipulate the redistricting process to gain additional seats and insulate incumbents from electoral competition, a process known as gerrymandering. But detecting gerrymandering is difficult without a representative set of alternative plans that comply with the same geographic and legal constraints. Harnessing recent algorithmic advances in sampling, we study such a collection of alternative redistricting plans that can serve as a non-partisan baseline. This methodological approach can distinguish electoral bias due to partisan effects from electoral bias due to other factors. We find that Democrats are structurally and geographically disadvantaged in House elections by 8 seats, while partisan gerrymandering disadvantages them by 2 seats.
- Congressional Representation: Accountability from the Constituent’s Perspective. (with Stephen Ansolabehere). American Journal of Political Science. 2022. [Summarized in the AJPS Blog] [Data]
Abstract
The premise that constituents hold representatives accountable for their legislative decisions undergirds political theories of democracy and legal theories of statutory interpretation. But studies of this at the individual level are rare, examine only a handful of issues, and arrive at mixed results. We provide an extensive assessment of issue accountability at the individual level. We trace the congressional rollcall votes on 44 bills across seven Congresses (2006-2018), and link them to constituent's perceptions of their representative's votes and their evaluation of their representative. Correlational, instrumental variables, and experimental approaches all show that constituents hold representatives accountable. A one-standard deviation increase in a constituent's perceived issue agreement with their representative can improve net approval by 35 percentage points. Congressional districts, however, are heterogeneous. Consequently, the effect of issue agreement on vote is much smaller at the district-level, resolving an apparent discrepancy between micro and macro studies.
- Wealth, Slave Ownership, and Fighting for the Confederacy: An Empirical Study of the American Civil War. (with Andrew B. Hall and Connor Huff). American Political Science Review, vol. 113, p. 658-673. 2019. [Covered by The Weeds podcast] [Data]
Abstract
How did personal wealth and slaveownership affect the likelihood Southerners fought for the Confederate Army in the American Civil War? On the one hand, wealthy Southerners had incentives to free-ride on poorer Southerners and avoid fighting; on the other hand, wealthy Southerners were disproportionately slaveowners, and thus had more at stake in the outcome of the war. We assemble a dataset on roughly 3.9 million free citizens in the Confederacy and show that slaveowners were more likely to fight than non-slaveowners. We then exploit a randomized land lottery held in 1832 in Georgia. Households of lottery winners owned more slaves in 1850 and were more likely to have sons who fought in the Confederate Army. We conclude that slaveownership, in contrast to some other kinds of wealth, compelled Southerners to fight despite free-rider incentives because it raised their stakes in the war’s outcome.
Survey Statistics and Demography
- Evaluating Bias and Noise Induced by the U.S. Census Bureau's Privacy Protection Methods (with Christopher T. Kenny, Cory McCartan, Tyler Simko, Kosuke Imai)). Science Advances, forthcoming. [Data]
Abstract
The United States Census Bureau faces a difficult trade-off between the accuracy of Census statistics and the protection of individual information. We conduct the first independent evaluation of bias and noise induced by the Bureau's two main disclosure avoidance systems: the TopDown algorithm employed for the 2020 Census and the swapping algorithm implemented for the three previous Censuses. Our evaluation leverages the Noisy Measure File (NMF) as well as two independent runs of the TopDown algorithm applied to the 2010 decennial Census. We find that the NMF contains too much noise to be directly useful, especially for Hispanic and multiracial populations. TopDown's post-processing dramatically reduces the NMF noise and produces data whose accuracy is similar to that of swapping. While the estimated errors for both TopDown and swapping algorithms are generally no greater than other sources of Census error, they can be relatively substantial for geographies with small total populations.
- Comment: The Essential Role of Policy Evaluation for the 2020 Census Disclosure Avoidance System (with Christopher T. Kenny, Cory McCartan, Evan T. R. Rosenman, Tyler Simko, Kosuke Imai). Harvard Data Science Review, Jan 2023. [HDSR DOI]
Abstract
In "Differential Perspectives: Epistemic Disconnects Surrounding the US Census Bureau's Use of Differential Privacy," boyd and Sarathy argue that empirical evaluations of the Census Disclosure Avoidance System (DAS), including our published analysis, failed to recognize how the benchmark data against which the 2020 DAS was evaluated is never a ground truth of population counts. In this commentary, we explain why policy evaluation, which was the main goal of our analysis, is still meaningful without access to a perfect ground truth. We also point out that our evaluation leveraged features specific to the decennial Census and redistricting data, such as block-level population invariance under swapping and voter file racial identification, better approximating a comparison with the ground truth. Lastly, we show that accurate statistical predictions of individual race based on the Bayesian Improved Surname Geocoding, while not a violation of differential privacy, substantially increases the disclosure risk of private information the Census Bureau sought to protect. We conclude by arguing that policy makers must confront a key trade-off between data utility and privacy protection, and an epistemic disconnect alone is insufficient to explain disagreements between policy choices.
- Unrepresentative Big Surveys Significantly Overestimated US Vaccine Uptake. (with Valerie C. Bradley, Michael Isakov, Dino Sejdinovic, Xiao-Li Meng, and Seth Flaxman; co-first author with Bradley). Nature, vol. 600, p. 695-700. 2021. [Covered by Harvard Gazette] [Data]
Abstract
Surveys are a crucial tool for understanding public opinion and behaviour, and their accuracy depends on maintaining statistical representativeness of their target populations by minimizing biases from all sources. Increasing data size shrinks confidence intervals but magnifies the effect of survey bias: an instance of the Big Data Paradox. Here we demonstrate this paradox in estimates of first-dose COVID-19 vaccine uptake in US adults from 9 January to 19 May 2021 from two large surveys: Delphi–Facebook (about 250,000 responses per week) and Census Household Pulse (about 75,000 every two weeks). In May 2021, Delphi–Facebook overestimated uptake by 17 percentage points (14–20 percentage points with 5% benchmark imprecision) and Census Household Pulse by 14 (11–17 percentage points with 5% benchmark imprecision), compared to a retroactively updated benchmark the Centers for Disease Control and Prevention published on 26 May 2021. Moreover, their large sample sizes led to miniscule margins of error on the incorrect estimates. By contrast, an Axios–Ipsos online panel with about 1,000 responses per week following survey research best practices provided reliable estimates and uncertainty quantification. We decompose observed error using a recent analytic framework to explain the inaccuracy in the three surveys. We then analyse the implications for vaccine hesitancy and willingness. We show how a survey of 250,000 respondents can produce an estimate of the population mean that is no more accurate than an estimate from a simple random sample of size 10. Our central message is that data quality matters more than data quantity, and that compensating the former with the latter is a mathematically provable losing proposition.
- The Use of Differential Privacy for Census Data and its Impact on Redistricting: The Case of the 2020 U.S. Census. (with Chris Kenny, Cory McCartan, Evan Rosenman, Tyler Simko, and Kosuke Imai). Science Advances, vol. 7, eabk3283. 2021. Originally a Public Comment to the Census Bureau (May 28, 2021). [FAQ, Reaction to the Bureau's Response (June 9, 2021).] [Data]
Abstract
Census statistics play a key role in public policy decisions and social science research. Yet given the risk of revealing individual information, many statistical agencies are considering disclosure control methods based on differential privacy, which add noise to tabulated data. Unlike other applications of differential privacy, however, census statistics must be post-processed after noise injection to be usable. We study the impact of the US Census Bureau's new Disclosure Avoidance System (DAS) on a major application of census statistics: the redrawing of electoral districts. We find that the DAS systematically undercounts the population in mixed-race and mixed-partisan precincts, yielding unpredictable racial and partisan biases. The DAS also leads to a likely violation of "One Person, One Vote" standard as currently interpreted, but does not prevent accurate predictions of an individual's race and ethnicity. Our findings underscore the difficulty of balancing accuracy and respondent privacy in the Census.
Selected Press Coverage
Covered by AP News, Washington Post, The Harvard Crimson, San Francisco Chronicle, Matthew Yglesias blog, Statistical Modeling (Andrew Gelman's blog) by Jessica Hullman (Part 1, Part 2)
- Towards Principled Unskewing: Viewing 2020 Election Polls Through a Corrective Lens from 2016. (with Michael Isakov). Harvard Data Science Review, vol. 2.4 (pre - 2020 election issue). 2020. [Covered by The Harvard Crimson] [PDF version with post-election review] [Data]
Abstract
We apply the concept of the data defect index (Meng, 2018) to study the potential impact of systematic errors on the 2020 pre-election polls in twelve Presidential battleground states. We investigate the impact under the hypothetical scenarios that (1) the magnitude of the underlying non-responses bias correlated with supporting Donald Trump is similar to that of the 2016 polls, (2) the pollsters' ability to correct systematic errors via weighting has not improved significantly, and (3) turnout levels remain similar as 2016. Because survey weights are crucial for our investigations but are often not released, we adopt two approximate methods under different modeling assumptions. Under these scenarios, which may be far from reality, our models shift Trump's estimated two-party voteshare by a percentage point in his favor in the median battleground state, and increases twofold the uncertainty around the voteshare estimate.
Education in Political Science
- Simulated redistricting plans for the analysis and evaluation of redistricting in the United States (with Cory McCartan, Christopher Kenny, Tyler Simko, George Garcia III, Kevin Wang, Melissa Wu, and Kosuke Imai). Scientific Data, 9, 689. 2022. [Website] [Dataverse]
Abstract
This article introduces the 50stateSimulations, a collection of simulated congressional districting plans and underlying code developed by the Algorithm-Assisted Redistricting Methodology (ALARM) Project. The 50stateSimulations allow for the evaluation of enacted and other congressional redistricting plans in the United States. While the use of redistricting simulation algorithms has become standard in academic research and court cases, any simulation analysis requires non-trivial eforts to combine multiple data sets, identify state-specifc redistricting criteria, implement complex simulation algorithms, and summarize and visualize simulation outputs. We have developed a complete workfow that facilitates this entire process of simulation-based redistricting analysis for the congressional districts of all 50 states. The resulting 50stateSimulations include ensembles of simulated 2020 congressional redistricting plans and necessary replication data. We also provide the underlying code, which serves as a template for customized analyses. All data and code are free and publicly available. This article details the design, creation, and validation of the data.
- The "Math Prefresher" and The Collective Future of Political Science Graduate Training. (with Gary King and Yon Soo Park). PS: Political Science and Politics, vol. 54, p. 537-541. 2020.
Abstract
The political science math prefresher arose a quarter-century ago and has now spread to many of our discipline’s PhD programs. Incoming students arrive for graduate school a few weeks early for ungraded instruction in math, statistics, and computer science as they relate to political science. The prefresher’s benefits, however, go beyond its technical content: it opens pathways to mastering methods necessary for political science research, facilitates connections among peers, and — perhaps most important — eases the transition to the increasingly collaborative nature of graduate work. The prefresher also shows how faculty across a highly diverse discipline have worked together to train the next generation. We review this program and advance its collaborative aspects by building infrastructure to share teaching content across universities so that separate programs can build on one another’s work and improve all of our programs.
Teaching
- PLSC 277: The U.S. Congress (Undergraduate. Fall 2022, Spring 2024) [Public Syllabus]
Course Description
The United States Congress is arguably the most powerful legislature in the world. Its actions—and inaction—affect taxes, healthcare, business, the environment, and international politics. To understand the nature of legislative power in Congress and in democracies more broadly, we ask: How do successful politicians become powerful? How do they navigate rules and institutions to their advantage? What is the proper role of the lawmaking in regulating private business? Should we limit legislative lobbying and put a cap on campaign contributions? Class discussions use case studies including the Civil Rights movement in the 1960s, the Tax Reform Act under Reagan, and the Affordable Care Act under Obama. Exercises include coding and data analysis. The goal is to equip students with a broad understanding of the principles of politics, economics, public policy, and data science.
- PLSC 438/536: Applied Quantitative Research Design (MA, Undergraduate, and Ph.D. Fall 2022, Fall 2023) [Public Syllabus]
Course Description
Research designs are strategies to obtain empirical answers to theoretical questions. Research designs using quantitative data for social science questions are more important than ever. This class, intended for advanced students interested in social science research, trains students with best practices for designing and implementing rigorous quantitative research. We cover designs in causal inference, prediction, and missing data at a high level. This is a hands-on, application-oriented class. Exercises involve programming and statistics in addition to the social sciences (politics, economics, and policy). The final project advances a research question chosen in consultation with the instructor.
Prerequisite: Any statistics or data science course that teaches ordinary least squares regression. Past or concurrent experience with a programming language such as R is strongly recommended.
- PLSC 862: American Elections with Comparative Perspective (Spring 2023) [Syllabus]
Course Description
This graduate-level seminar covers foundational work on electoral politics in the United States, with some comparisons with other countries' systems and domestic proposals for reform. Readings examine work on elite position-taking, re-election, federalism, representation, and electoral systems. Accompanying readings include similar and more recent articles in comparative politics, political economy, or election law. This course has two intended audiences: students in American Politics, and students outside American Politics interested in theories of electoral democracy developed in the American Politics subfield that have then been exported to other subfields. Class emphasizes empirical research designs and analysis of available datasets in addition to reading.
I received the 2020 Dean's Excellence in Teaching Award at the Harvard Kennedy School of Public Policy for my teaching in econometrics and shepherding the use of the R statistical language in its core statistics sequence. This work included creating portable screencasts of R workflows, covering common topics in econometrics, causal inference, data science, quantitative social science.
I am a RStudio certified trainer, and have created several resources for statistics and data science for the social sciences that I hope are useful for other students and instructors. These include a workshop I co-designed on training teachers in the social sciences for teaching statistics and programming, my presentations on project-oriented workflow, introduction to version control with GitHub, introduction to Stata, and statistics notes
covering Probability, Inference, and Regression written for a Masters-level statistics course (links).
Any use of my teaching material available online is welcome with attribution.
Dissertation
Book Project: Congressional Representation
(with Stephen Ansolabehere)
This book, tentatively titled Congressional Representation, argues that through all of the gridlock and the polarization that has plagued the government over the past three decades, the U.S. Congress remains a largely majoritarian institution. Congress acts in line with the majority of people more often than not. Building on 15 years of data on public preferences of more than 500,000 Americans, this study examines what voters know, what they care about when they vote, and how well their legislators and their Congress reflect their preferences. Representation is not a seamless or mechanical process, but it aggregates peoples' beliefs and preferences well on the important issues that face the country. Individual voters do not follow the details of congressional legislation but most know enough to hold correct beliefs about legislation and to hold their representatives accountable. For their part, legislators are highly responsive to the aggregate opinion of their districts. And, on important bills, Congress makes decisions in line with the majority of the nation. When representation fails, it is often the obstruction of one branch of government or one party.
(Slides)
Datasets
- The Cast Vote Records project is collecting and organizing public ballot image logs to advance the understanding of voting patterns in federal, state and local elections. One of the project's goals is to establish a relational database for such data. This project is one of the 2018 New Initiative Grants from the MIT Election Data and Science Lab. Please feel free to contact me for any questions about this project. Fore more information, see the associated empirical paper or the memo on the election administration of cast vote records.
- The Cumulative CCES Common Content (2006-2022) (downloaded on Dataverse over 20,000 times) is a part of the Cooperative Congressional Election Survey Dataverse. It combines all common content respondents of the CCES and harmonizes key variables, so that researchers can analyze all years of the CCES or merge in standardized variables with their own CCES datasets.
- The Candidates in American General Elections dataset (with Jeremiah Cha and James M. Snyder, Jr.) is a comprehensive list of winning and losing candidates in U.S. Congressional, Presidential, and Gubernatorial elections. Unlike official records or other datasets, we standardize candidate names acorss time and office, and record the incumbency of the candidates.
- The Fifty States Redistricting Simulations project provides an ensemble of alternative maps for congressional districts that are simulated from state of the art redistricting software. These maps can be used to evaluate whether a proposed map is an outlier on any dimension. (as part of the ALARM team, with Cory McCartan, Christopher Kenny, Tyler Simko, George Garcia III, Kevin Wang, Melissa Wu, and Kosuke Imai)
- Portable Routines for Preparing CCES and ACS data for MRP (ccesMRPprep) is a set of datasets and API interfaces to facilitate Multilevel Regression Poststratification (MRP), a survey weighting method for small area estimation. Other articles already provide helpful tutorials and code for MRP. But implementing a MRP entails considerable upfront costs related to data collection, cleaning, and standardization. This package provides these routines: not only modeling software, but code to import and standardize publicly available data sources, combined with detailed documentation about these data sources.
About the banner image: Survey data from the Cumulative CCES, limited to validated voters in contested districts who voted for a major party in the Presidency and House. Estimates are
made at the congressional district level and use Multilevel Regression Poststratification (MRP) stratifying on age, gender, education from the ACS and using House candidate incumbency
status and presidential voteshare as district-level predictors. In presidential years the values represent ticket splitting (e.g. Trump voters who voted for a 2016 Democratic House candidate); in midterm years they represent party switch from
the previous presidential election (e.g. Trump voters who voted for a 2018 Democratic House candidate). Districts where a Democrat and Republican candidate did not contest the general election are left blank. Figure created by Shiro
Kuriwaki.
About this website: This website uses code from Minimal Mistakes, Github Pages, uses some CSS from Matt Blackwell's website at the time, and is inspired by Sirus Bouchat's website and Andrew Hall's website.