Common Misperceptions: Limitations of the ACS for MRP
The ACS is the main dataset used for post-stratification in MRP
because of its wide coverage of different geographies. However, there
are a couple of data limitations to the ACS that make subsequent MRP
diverge from the classic MRP as it is usually taught:
- The ACS is conducted by the Census Bureau but it is a
survey, not a population census. It is an extremely large
survey, with over 3 million respondents in any given year. But it is
still a sample, and therefore its counts are estimates. The ACS even
provides margin of error estimates around its counts.
-
Microdata is not available for some geographies
like congressional districts. By microdata, we mean data where each row
is an individual respondent and can be aggregated up. The Census page,
tidycensus, IPUMS (https://usa.ipums.org/) has an amazing interface to
download microdata, but that data does not include congressional
district.
- The Census provides some crosswalk files but they are mainly ZCTA to
CD crosswalks.
- PUMAs are the main unit of area for the ACS but these do not link
well to CD (see here
for discussion and potential fractional matching). For example Cambridge
Mass. is a single PUMA, but Cambridge has about 8 zipcodes and is split
into parts of MA-05 (Rep. Katherine Clark) and MA-07 (Rep. Ayanna
Pressley)
- The population Census is ideal, but the Census only occurs every 10
years, and only a 5 percent sample of it (with its restrictions of its
own) is given to the public with some lag time. Therefore the population
Census is often not a good alternative to the ACS / CPS.
- The ACS does not ask some key variables, such as
voter registration (which is asked in the CPS) or party identification
(which is only asked in political surveys like the CCES or Pew). That
limits the variables one could include in the MRP model.
- Of the variables that are available, only certain variable
combinations of them are available at the district level:
- There are three are three-way interactions of some demographic
variables but not more than that.
- The ACS provides thousands of cell counts, but only a subset of them
are useful for MRP because cells for MRP must form a partition of the
population (i.e., exclusive and exhaustive categories). I describe
several important combinations below.
- See the synthetic imputation page
(
vignette("synth", package = "synthjoint")
) in this package
to see how this restriction could be relaxed.
- Some of the partition counts are not perfect
partitions. For example, some of the variable codes I define
below exclude people aged 18-24, include non-citizens, or
double-count African Americans who also identify as Hispanic/Latino.
These introduce some error into the post-stratification.
This package and vignette goes into more detail about these
limitations and the nature of the data that is available.
Three types of ACS Tables
If you are in a hurry or want something to start with, here are three
partitions this package provides:
-
Age x Sex x Education. These variable codes are
encoded in
?acscodes_age_sex_educ
.
-
Age x Sex x Race. These variable codes are encoded
in
?acscodes_age_sex_race
.
-
Sex x Education x Race . These variable codes are
encoded in
?acscodes_sex_educ_race
See their help pages for more information and their limitations.
Collapsing variable codes
The survey and the post-stratification must share discrete variables
the level of which match 1:1. Therefore, if race is binned 9 ways in the
ACS but there are only 5 response options in the CCES, then the finer
group must be collapsed to coarser groupings of the 5 levels in the
CCES. That also means, for example, that the recoding must be nested –
i.e., it is hard to salvage two variable codings that are not
many-to-one nested mappings to each other. These judgments must be made
with careful attention to the question wording and are documented to
some extent in ?namevalue
.
Thanks
The statements in this vignette benefited from many online resources
and correspondence with experts including Yair Ghitza and Matto
Mildenberger. I also relies on findings from several published
papers:
- Howe, P. D., Mildenberger, M., Marlon, J. R., & Leiserowitz, A.
(2015). Geographic variation in opinions on climate change at state and
local scales in the USA. Nature Climate Change, 5(6), 596–603.
https://doi.org/10.1038/nclimate2583
- Kastellec, J. P., Lax, J. R., & Phillips, J. H. (2010). Public
Opinion and Senate Confirmation of Supreme Court Nominees. Journal
of Politics, 72(3), 767–784. https://doi.org/10.1017/S0022381610000150
- Walker, Kyle. (2020). tidycensus: Load US Census Boundary and
Attribute Data as ‘tidyverse’ and ‘sf’-Ready Data Frames. R package
version 0.9.9.2. https://CRAN.R-project.org/package=tidycensus
- Warshaw, C., & Rodden, J. (2012). How should We measure
district-level public opinion on individual issues? Journal of
Politics, 74(1), 203–219. https://doi.org/10.1017/S0022381611001204