Imputes the counts of a joint distribution of count variables for small areas
based on microdata. `synth_smoothfix()`

is basically `synth_mlogit()`

which is then corrected by
if one knows the marginals.See Details.

synth_mlogit(formula, microdata, poptable, area_var, count_var = "count") synth_smoothfix( formula, microdata, poptable, fix_to, area_var, count_var = "count" )

formula | A representation of the aggregate imputation or "outcome" model,
of the form |
---|---|

microdata | The survey table that the multinomial model will be built off.
Must contain all variables in the LHS and RHS of |

poptable | The population table, collapsed in terms of counts. Must contain
all variables in the RHS of |

area_var | A character vector of the area of interest. |

count_var | A character variable that specifies which variable in |

fix_to | A dataset with only marginal counts or proportions of the outcome in question, by each area. Proportions will be corrected so that the margins of the synthetic joint will match these, with a simple ratio. |

Jonathan P. Kastellec, Jeffrey R. Lax, Michael Malecki, and Justin H.
Phillips (2015). Polarizing the electoral connection: Partisan representation in
Supreme Court confirmation politics. *The Journal of Politics* 77:3 http://dx.doi.org/10.1086/681261

Soichiro Yamauchi (2021). emlogit: Implementing the ECM algorithm for multinomial logit model. R package version 0.1.1. https://github.com/soichiroy/emlogit

Yair Ghitza and Mark Steitz (2020). DEEP-MAPS Model of the Labor Force. Working Paper. https://github.com/Catalist-LLC/unemployment

A dataframe with a similar format as the `poptarget`

table
but with rows expanded to serve as a joint distribution. In general,
if the variable of interest has `L`

values, the final dataset will have
`L`

times more rows than `poptarget`

. The data will have additional variables:

The outcome variable of interest (

`Z`

). For example if the LHS of the formula was`party_id`

, then there would be a column called`party_id`

containing the values of that variable in long form.`prX`

: The known distribution of the covriates (RHS) within the area, i.e.`Pr(X | A)`

.`prZ_givenX`

: The main estimate from the multinomial logit model. Formally,`Pr(Z | X, A)`

, although this is usually the same value for every`A`

and thus equal to`Pr(Z | X)`

unless`A`

is on the RHS as well.`prXZ`

: A new estimate for the joint distribution within the area, i.e.`Pr(Z, X | A)`

. Computed by`prX * prZ_givenX`

.`count`

: A new count variable. Simply the product of`n_aggregate`

and`pr_pred`

.

In this setup, the population distribution table (`poptable`

)
has the joint distribution of `(A, X_{1}, ..., X_{K - 1})`

categorical variables where
`A`

denotes a categorical small area, `X`

s denote categorical covariates, and
the missing covariate is `X_{K}`

.

Now, the survey data (`microdata`

) has a sample joint distribution of
`(X_{1}, .., X_{K - 1}, X_{K})`

categorical variables but the sample size is too small
for small areas. Therefore, the function models a multinomial outcome model roughly of the
form `X_{K} ~ X_{1} + ... X_{K}`

and predicts onto `poptable`

to estimate the
joint distribution of `(A, X_{1}, .., X_{K})`

Currently, this function does not support post-stratifiation based on a known aggregate distribution -- that is, further adjusting the probabilities based on a known population distribution (see e.g. Leeman and Wasserfallen AJPS https://doi.org/10.1111/ajps.12319)

library(dplyr) # Impute the joint distribution of party ID with race, sex, and age, using # survey data in NY. synth_acs <- synth_mlogit(pid3 ~ race + age + female, microdata = cc18_NY, poptable = acs_race_NY, area_var = "cd")#> Warning: NAs in the microdata -- dropping data# original (27 districts x 2 sex x 5 age x 6 race categories) count(acs_race_NY, cd, female, age, race, wt = count)#> # A tibble: 1,620 x 5 #> cd female age race n #> <chr> <int> <fct> <fct> <dbl> #> 1 NY-01 0 18 to 24 years White 20604 #> 2 NY-01 0 18 to 24 years Black 2654 #> 3 NY-01 0 18 to 24 years Hispanic 6264 #> 4 NY-01 0 18 to 24 years Asian 3121 #> 5 NY-01 0 18 to 24 years Native American 0 #> 6 NY-01 0 18 to 24 years All Other 1480 #> 7 NY-01 0 25 to 34 years White 27437 #> 8 NY-01 0 25 to 34 years Black 2425 #> 9 NY-01 0 25 to 34 years Hispanic 10730 #> 10 NY-01 0 25 to 34 years Asian 3168 #> # … with 1,610 more rows# new, modeled (original x 5 party categories) synth_acs#> # A tibble: 8,100 x 9 #> cd race age female prX pid3 prZ_givenX prXZ count #> <chr> <fct> <fct> <int> <dbl> <fct> <dbl> <dbl> <dbl> #> 1 NY-01 White 18 to 24 years 0 0.0358 Democrat 0.365 0.0130 7516. #> 2 NY-01 White 18 to 24 years 0 0.0358 Republican 0.253 0.00903 5206. #> 3 NY-01 White 18 to 24 years 0 0.0358 Independent 0.251 0.00899 5180. #> 4 NY-01 White 18 to 24 years 0 0.0358 Other 0.0612 0.00219 1261. #> 5 NY-01 White 18 to 24 years 0 0.0358 Not Sure 0.0700 0.00250 1441. #> 6 NY-01 White 18 to 24 years 1 0.0355 Democrat 0.391 0.0139 8013. #> 7 NY-01 White 18 to 24 years 1 0.0355 Republican 0.220 0.00782 4504. #> 8 NY-01 White 18 to 24 years 1 0.0355 Independent 0.186 0.00659 3800. #> 9 NY-01 White 18 to 24 years 1 0.0355 Other 0.0384 0.00136 786. #> 10 NY-01 White 18 to 24 years 1 0.0355 Not Sure 0.165 0.00586 3378. #> # … with 8,090 more rows# See the data elec_NY to see if these numbers look reasonable. if (FALSE) { # another example -- imputing education library(ccesMRPrun) synth_mlogit(educ ~ age + female, microdata = cces_GA, poptable = acs_GA, area_var = "cd") } # synth_mlogit WITH MARGINS CORRECTION ----- library(dplyr) # suppose we want know the distribution of (age x female) and we know the # distribution of (race), by CD, but we don't know the joint of the two. educ_target <- count(acs_educ_NY, cd, educ, wt = count, name = "count") pop_syn <- synth_smoothfix(educ ~ race + age + female, microdata = cc18_NY, fix_to = educ_target, poptable = acs_race_NY, area_var = "cd")