Model synthetic joint distribution with multinomial logit

Imputes the counts of a joint distribution of count variables for small areas based on microdata. synth_smoothfix() is basically synth_mlogit() which is then corrected by if one knows the marginals.See Details.

synth_mlogit(formula, microdata, poptable, area_var, count_var = "count")

synth_smoothfix(
  formula,
  microdata,
  poptable,
  fix_to,
  area_var,
  count_var = "count"
)

Source

Jonathan P. Kastellec, Jeffrey R. Lax, Michael Malecki, and Justin H. Phillips (2015). Polarizing the electoral connection: Partisan representation in Supreme Court confirmation politics. The Journal of Politics 77:3 http://dx.doi.org/10.1086/681261

Soichiro Yamauchi (2021). emlogit: Implementing the ECM algorithm for multinomial logit model. R package version 0.1.1. https://github.com/soichiroy/emlogit

Yair Ghitza and Mark Steitz (2020). DEEP-MAPS Model of the Labor Force. Working Paper. https://github.com/Catalist-LLC/unemployment

Arguments

formula: A representation of the aggregate imputation or "outcome" model, of the form X_{K} ~ X_1 + ... X_{K - 1}
microdata: The survey table that the multinomial model will be built off. Must contain all variables in the LHS and RHS of formula.
poptable: The population table, collapsed in terms of counts. Must contain all variables in the RHS of formula, as well as the variables specified in area_var and count_var below.
area_var: A character vector of the area of interest.
count_var: A character variable that specifies which variable in poptable indicates the count
fix_to: A dataset with only marginal counts or proportions of the outcome in question, by each area. Proportions will be corrected so that the margins of the synthetic joint will match these, with a simple ratio.

Value

A dataframe with a similar format as the poptarget table but with rows expanded to serve as a joint distribution. In general, if the variable of interest has L values, the final dataset will have L times more rows than poptarget. The data will have additional variables:

The outcome variable of interest (Z). For example if the LHS of the formula was party_id, then there would be a column called party_id containing the values of that variable in long form.
prX: The known distribution of the covriates (RHS) within the area, i.e. Pr(X | A).
prZ_givenX: The main estimate from the multinomial logit model. Formally, Pr(Z | X, A), although this is usually the same value for every A and thus equal to Pr(Z | X) unless A is on the RHS as well.
prXZ: A new estimate for the joint distribution within the area, i.e. Pr(Z, X | A). Computed by prX * prZ_givenX.
count: A new count variable. Simply the product of n_aggregate and pr_pred.

Details

In this setup, the population distribution table (poptable) has the joint distribution of (A, X_{1}, ..., X_{K - 1}) categorical variables where A denotes a categorical small area, Xs denote categorical covariates, and the missing covariate is X_{K}.

Now, the survey data (microdata) has a sample joint distribution of (X_{1}, .., X_{K - 1}, X_{K}) categorical variables but the sample size is too small for small areas. Therefore, the function models a multinomial outcome model roughly of the form X_{K} ~ X_{1} + ... X_{K} and predicts onto poptable to estimate the joint distribution of (A, X_{1}, .., X_{K})

Currently, this function does not support post-stratifiation based on a known aggregate distribution -- that is, further adjusting the probabilities based on a known population distribution (see e.g. Leeman and Wasserfallen AJPS https://doi.org/10.1111/ajps.12319)

Examples

 library(dplyr)
 library(ccesMRPprep)

 # Impute the joint distribution of party ID with race, sex, and age, using
 # survey data in NY.

 synth_acs <- synth_mlogit(pid3 ~ race + age + female,
                           microdata = cc18_NY,
                           poptable = acs_race_NY,
                           area_var = "cd")
#> Warning: NAs in the microdata -- dropping data

 # original (27 districts x 2 sex x 5 age x 6 race categories)
 count(acs_race_NY, cd, female, age, race, wt = count)
#> # A tibble: 1,620 × 5
#>    cd    female age            race                n
#>    <chr>  <int> <fct>          <fct>           <dbl>
#>  1 NY-01      0 18 to 24 years White           20604
#>  2 NY-01      0 18 to 24 years Black            2654
#>  3 NY-01      0 18 to 24 years Hispanic         6264
#>  4 NY-01      0 18 to 24 years Asian            3121
#>  5 NY-01      0 18 to 24 years Native American     0
#>  6 NY-01      0 18 to 24 years All Other        1480
#>  7 NY-01      0 25 to 34 years White           27437
#>  8 NY-01      0 25 to 34 years Black            2425
#>  9 NY-01      0 25 to 34 years Hispanic        10730
#> 10 NY-01      0 25 to 34 years Asian            3168
#> # … with 1,610 more rows

 # new, modeled (original x 5 party categories)
 synth_acs
#> # A tibble: 8,100 × 9
#>    cd    race  age            female    prX pid3        prZ_givenX    prXZ count
#>    <chr> <fct> <fct>           <int>  <dbl> <fct>            <dbl>   <dbl> <dbl>
#>  1 NY-01 White 18 to 24 years      0 0.0358 Democrat        0.365  0.0130  7516.
#>  2 NY-01 White 18 to 24 years      0 0.0358 Republican      0.253  0.00903 5206.
#>  3 NY-01 White 18 to 24 years      0 0.0358 Independent     0.251  0.00899 5180.
#>  4 NY-01 White 18 to 24 years      0 0.0358 Other           0.0612 0.00219 1261.
#>  5 NY-01 White 18 to 24 years      0 0.0358 Not Sure        0.0700 0.00250 1441.
#>  6 NY-01 White 18 to 24 years      1 0.0355 Democrat        0.391  0.0139  8013.
#>  7 NY-01 White 18 to 24 years      1 0.0355 Republican      0.220  0.00782 4504.
#>  8 NY-01 White 18 to 24 years      1 0.0355 Independent     0.186  0.00659 3800.
#>  9 NY-01 White 18 to 24 years      1 0.0355 Other           0.0384 0.00136  786.
#> 10 NY-01 White 18 to 24 years      1 0.0355 Not Sure        0.165  0.00586 3378.
#> # … with 8,090 more rows

 # See the data elec_NY to see if these numbers look reasonable.


if (FALSE) {
  # another example -- imputing education
  library(ccesMRPrun)
  synth_mlogit(educ ~ age + female,
              microdata = cces_GA,
              poptable = acs_GA,
              area_var = "cd")
}


# synth_mlogit WITH MARGINS CORRECTION -----

library(dplyr)
# suppose we want know the distribution of (age x female) and we know the
# distribution of (race), by CD, but we don't know the joint of the two.

educ_target <- count(acs_educ_NY, cd, educ, wt = count, name = "count")
pop_syn <- synth_smoothfix(educ ~ race + age + female,
                         microdata = cc18_NY,
                         fix_to = educ_target,
                         poptable = acs_race_NY,
                         area_var = "cd")