All CCES data can be downloaded directly from dataverse, using the dataverse
R package. The function get_cces_dataverse()
will make this
quick and simple - you only need to specify the name of the dataset,
given in ?cces_dv_ids
.
# may take about 30 seconds to download
ccc <- get_cces_dataverse("cumulative")
Here we will use a built-in sample of 1,000 observations for
illustration (see ccc_samp()
). In production, you will want
to download all 14 datasets to your local directory via a script like initialize-cces-downloads.R
(still private).
ccc_samp
#> # A tibble: 1,000 × 18
#> year case_id state st cd zipcode county_fips gender age race
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl+l> <dbl> <dbl+l>
#> 1 2006 1005058 Michigan MI MI-04 48603 26145 2 [Fem… 36 1 [Whi…
#> 2 2006 1006614 Texas TX TX-18 77040 48201 1 [Mal… 40 3 [His…
#> 3 2006 1009338 Californ… CA CA-48 92656 06059 1 [Mal… 32 1 [Whi…
#> 4 2006 1088898 Florida FL FL-13 34224 12015 2 [Fem… 52 1 [Whi…
#> 5 2006 1090564 Pennsylv… PA PA-19 17011 42041 2 [Fem… 25 1 [Whi…
#> 6 2006 1093132 South Ca… SC SC-02 29073 45063 2 [Fem… 48 1 [Whi…
#> 7 2006 1093573 Utah UT UT-03 84118 49035 2 [Fem… 74 1 [Whi…
#> 8 2006 1105620 Hawaii HI HI-01 96701 15003 1 [Mal… 37 4 [Asi…
#> 9 2006 1116569 Texas TX TX-21 78610 48209 2 [Fem… 20 3 [His…
#> 10 2006 1117377 Ohio OH OH-07 45502 39023 2 [Fem… 49 1 [Whi…
#> # ℹ 990 more rows
#> # ℹ 8 more variables: hispanic <dbl+lbl>, educ <dbl+lbl>, faminc <dbl+lbl>,
#> # marstat <dbl+lbl>, newsint <dbl+lbl>, vv_turnout_gvm <dbl+lbl>,
#> # voted_pres_16 <dbl+lbl>, economy_retro <dbl+lbl>
The CCES cumulative dataset is already harmonized and cleaned, but values must be recoded so that they later match with the values of the ACS. I have created key-value pairings for the main demographic variables. The wrapper function expects a CCES cumulative dataset and recodes.
See get_cces_question()
for how to get the
outcome data, which is a single column from another CCES
dataset. This still relies on a flat file being pre-downloaded (via
get_cces_dataverse()
).
ccc_std <- ccc_std_demographics(ccc_samp)
#> age variable modified to bins. Original age variable is now in age_orig.
count(ccc_std, age)
#> # A tibble: 5 × 2
#> age n
#> <int+lbl> <int>
#> 1 1 [18 to 24 years] 75
#> 2 2 [25 to 34 years] 160
#> 3 3 [35 to 44 years] 145
#> 4 4 [45 to 64 years] 426
#> 5 5 [65 years and over] 194
A formula represents the variables that are in the model, how they are interacted, and which variables are random effects. We presume the formula will be used in brms. Currently we support binary outcomes. This can be set in the regression form
fm_brm <- response ~ age + gender + educ + pct_trump + (1|cd)
where response
is a binary variable or a
factor/character variable that can be coerced to one using
yesno_to_binary()
.
A prior could be specified here as well, but this is not strictly necessary and can be defined when fitting the model.
We provide wrappers around the great tidycensus package that produces ACS data and post-stratification tables from the ACS. Tailored lookup tables internally will pull out the appropriate CD-level counts and label them so that they match up with CCES keys. District-level information, which is not necessary ACS data (e.g. election outcomes) must be supplied here as well.
# only do this once, replacing the input with your census key
tidycensus::census_api_key(Sys.getenv("CENSUS_API_KEY"))
#> To install your API key for use in future sessions, run this function with `install = TRUE`.
acs_tab <- get_acs_cces(
varlist = acscodes_age_sex_educ,
varlab_df = acscodes_df,
year = 2018)
#> Getting data from the 2018 1-year ACS
#> The 1-year ACS provides data for geographies with populations of 65,000 and greater.
#> Warning: • You have not set a Census API key. Users without a key are limited to 500
#> queries per day and may experience performance limitations.
#> ℹ For best results, get a Census API key at
#> http://api.census.gov/data/key_signup.html and then supply the key to the
#> `census_api_key()` function to use it throughout your tidycensus session.
#> This warning is displayed once per session.
poststrat <- get_poststrat(
cleaned_acs = acs_tab,
dist_data = cd_info_2018,
formula = fm_brm)
#> Joining with `by = join_by(year, cd)`
#> Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
#> ℹ Please use `if_any()` or `if_all()` instead.
#> ℹ The deprecated feature was likely used in the ccesMRPprep package.
#> Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
poststrat
#> # A tibble: 17,318 × 6
#> age gender educ pct_trump cd count
#> <fct> <fct> <fct> <dbl> <chr> <dbl>
#> 1 18 to 24 years Male HS or Less 0.0496 NY-15 21859
#> 2 18 to 24 years Male HS or Less 0.0553 NY-13 15491
#> 3 18 to 24 years Male HS or Less 0.0715 PA-03 14171
#> 4 18 to 24 years Male HS or Less 0.0722 CA-13 13738
#> 5 18 to 24 years Male HS or Less 0.0917 CA-12 4893
#> 6 18 to 24 years Male HS or Less 0.0952 IL-07 15820
#> 7 18 to 24 years Male HS or Less 0.101 CA-37 13440
#> 8 18 to 24 years Male HS or Less 0.107 NY-07 12768
#> 9 18 to 24 years Male HS or Less 0.113 CA-34 19439
#> 10 18 to 24 years Male HS or Less 0.123 GA-05 16741
#> # ℹ 17,308 more rows
You need to get a API key from the census website to run this
command. See vignette("acs")
.
Often, the census alone does not provide nearly all the variables one would want to post-stratify (i.e. adjust for non-representativeness). Party affiliation in Census variables is a prominent examples. The package provides open-source software to implement this expansion. See the README in https://github.com/kuriwaki/synthjoint.
Under this workflow, only three objects are needed to conduct MRP with a brms model:
get_poststrat()
and
possibly expanded via synthetic methods (in Step 5).Together, these uniquely define one MRP model for one operationalization of one question with a particular set of covariates.