library(ccesMRPprep)
library(tidyverse)
library(haven)

All CCES data can be downloaded directly from dataverse, using the dataverse R package. The function get_cces_dataverse() will make this quick and simple - you only need to specify the name of the dataset, given in ?cces_dv_ids.

# may take about 30 seconds to download
ccc <- get_cces_dataverse("cumulative")

Here we will use a built-in sample of 1,000 observations for illustration (see ccc_samp()). In production, you will want to download all 14 datasets to your local directory via a script like initialize-cces-downloads.R (still private).

ccc_samp
#> # A tibble: 1,000 × 18
#>     year case_id state    st    cd    zipcode county_fips  gender   age     race
#>    <dbl> <chr>   <chr>    <chr> <chr> <chr>   <chr>       <dbl+l> <dbl> <dbl+lb>
#>  1  2006 1005058 Michigan MI    MI-04 48603   26145       2 [Fem…    36 1 [Whit…
#>  2  2006 1006614 Texas    TX    TX-18 77040   48201       1 [Mal…    40 3 [Hisp…
#>  3  2006 1009338 Califor… CA    CA-48 92656   06059       1 [Mal…    32 1 [Whit…
#>  4  2006 1088898 Florida  FL    FL-13 34224   12015       2 [Fem…    52 1 [Whit…
#>  5  2006 1090564 Pennsyl… PA    PA-19 17011   42041       2 [Fem…    25 1 [Whit…
#>  6  2006 1093132 South C… SC    SC-02 29073   45063       2 [Fem…    48 1 [Whit…
#>  7  2006 1093573 Utah     UT    UT-03 84118   49035       2 [Fem…    74 1 [Whit…
#>  8  2006 1105620 Hawaii   HI    HI-01 96701   15003       1 [Mal…    37 4 [Asia…
#>  9  2006 1116569 Texas    TX    TX-21 78610   48209       2 [Fem…    20 3 [Hisp…
#> 10  2006 1117377 Ohio     OH    OH-07 45502   39023       2 [Fem…    49 1 [Whit…
#> # … with 990 more rows, and 8 more variables: hispanic <dbl+lbl>,
#> #   educ <dbl+lbl>, faminc <dbl+lbl>, marstat <dbl+lbl>, newsint <dbl+lbl>,
#> #   vv_turnout_gvm <dbl+lbl>, voted_pres_16 <dbl+lbl>, economy_retro <dbl+lbl>

## Step 2. Cleaning CCES data

The CCES cumulative dataset is already harmonized and cleaned, but values must be recoded so that they later match with the values of the ACS. I have created key-value pairings for the main demographic variables. The wrapper function expects a CCES cumulative dataset and recodes.

See get_cces_question() for how to get the outcome data, which is a single column from another CCES dataset. This still relies on a flat file being pre-downloaded (via get_cces_dataverse()).

ccc_std <- ccc_std_demographics(ccc_samp)
#> age variable modified to bins. Original age variable is now in age_orig.
count(ccc_std, age)
#> # A tibble: 5 × 2
#>                     age     n
#>               <int+lbl> <int>
#> 1 1 [18 to 24 years]       75
#> 2 2 [25 to 34 years]      160
#> 3 3 [35 to 44 years]      145
#> 4 4 [45 to 64 years]      426
#> 5 5 [65 years and over]   194

## Step 3. Define a Model Specification

A formula represents the variables that are in the model, how they are interacted, and which variables are random effects. We presume the formula will be used in brms. Currently we support binary outcomes. This can be set in the regression form

fm_brm <- response ~  age + gender + educ + pct_trump + (1|cd)

where response is a binary variable or a factor/character variable that can be coerced to one using yesno_to_binary().

A prior could be specified here as well, but this is not strictly necessary and can be defined when fitting the model.

We provide wrappers around the great tidycensus package that produces ACS data and post-stratification tables from the ACS. Tailored lookup tables internally will pull out the appropriate CD-level counts and label them so that they match up with CCES keys. District-level information, which is not necessary ACS data (e.g. election outcomes) must be supplied here as well.

# only do this once, replacing the input with your census key
tidycensus::census_api_key(Sys.getenv("CENSUS_API_KEY"))
#> To install your API key for use in future sessions, run this function with install = TRUE.

acs_tab <- get_acs_cces(
varlist = acscodes_age_sex_educ,
varlab_df = acscodes_df,
year = 2018)
#> The 1-year ACS provides data for geographies with populations of 65,000 and greater.
#> Getting data from the 2018 1-year ACS

poststrat <-  get_poststrat(
cleaned_acs = acs_tab,
dist_data = cd_info_2018,
formula = fm_brm)
#> Joining, by = c("year", "cd")

poststrat
#> # A tibble: 17,318 × 6
#>    age            gender educ       pct_trump cd    count
#>    <fct>          <fct>  <fct>          <dbl> <chr> <dbl>
#>  1 18 to 24 years Male   HS or Less     0.049 NY-15 21859
#>  2 18 to 24 years Male   HS or Less     0.054 NY-13 15491
#>  3 18 to 24 years Male   HS or Less     0.068 CA-13 13738
#>  4 18 to 24 years Male   HS or Less     0.07  PA-03 14171
#>  5 18 to 24 years Male   HS or Less     0.087 CA-12  4893
#>  6 18 to 24 years Male   HS or Less     0.092 IL-07 15820
#>  7 18 to 24 years Male   HS or Less     0.096 CA-37 13440
#>  8 18 to 24 years Male   HS or Less     0.104 NY-07 12768
#>  9 18 to 24 years Male   HS or Less     0.107 CA-34 19439
#> 10 18 to 24 years Male   HS or Less     0.119 GA-05 16741
#> # … with 17,308 more rows

You need to get a API key from the census website to run this command. See vignette("acs").

## Step 5. (Optional) Expand Population Tables

Often, the census alone does not provide nearly all the variables one would want to post-stratify (i.e. adjust for non-representativeness). Party affiliation in Census variables is a prominent examples. The package provides open-source software to implement this expansion. See vignette("synth") .

## Final Output

Under this workflow, only three objects are needed to conduct MRP with a brms model:

1. The model specification which can be a plain text line of a brms formula (in Step 3). All variables specified in the RHS should be in both the survey data and poststratification.
2. The survey data
3. The poststratification table via get_poststrat() and possibly expanded via synthetic methods (in Step 5).

Together, these uniquely define one MRP model for one operationalization of one question with a particular set of covariates.