Imputes the counts of a joint distribution of count variables for small areas
based on microdata. synth_smoothfix()
is basically synth_mlogit()
which is then corrected by
if one knows the marginals.See Details.
synth_mlogit(formula, microdata, poptable, area_var, count_var = "count")
synth_smoothfix(
formula,
microdata,
poptable,
fix_to,
area_var,
count_var = "count"
)
Jonathan P. Kastellec, Jeffrey R. Lax, Michael Malecki, and Justin H. Phillips (2015). Polarizing the electoral connection: Partisan representation in Supreme Court confirmation politics. The Journal of Politics 77:3 http://dx.doi.org/10.1086/681261
Soichiro Yamauchi (2021). emlogit: Implementing the ECM algorithm for multinomial logit model. R package version 0.1.1. https://github.com/soichiroy/emlogit
Yair Ghitza and Mark Steitz (2020). DEEP-MAPS Model of the Labor Force. Working Paper. https://github.com/Catalist-LLC/unemployment
A representation of the aggregate imputation or "outcome" model,
of the form X_{K} ~ X_1 + ... X_{K - 1}
The survey table that the multinomial model will be built off.
Must contain all variables in the LHS and RHS of formula
.
The population table, collapsed in terms of counts. Must contain
all variables in the RHS of formula
, as well as the variables specified in
area_var
and count_var
below.
A character vector of the area of interest.
A character variable that specifies which variable in poptable
indicates the count
A dataset with only marginal counts or proportions of the outcome in question, by each area. Proportions will be corrected so that the margins of the synthetic joint will match these, with a simple ratio.
A dataframe with a similar format as the poptarget
table
but with rows expanded to serve as a joint distribution. In general,
if the variable of interest has L
values, the final dataset will have
L
times more rows than poptarget
. The data will have additional variables:
The outcome variable of interest (Z
). For example if the LHS of the formula
was party_id
, then there would be a column called party_id
containing the
values of that variable in long form.
prX
: The known distribution of the covriates (RHS) within the area, i.e.
Pr(X | A)
.
prZ_givenX
: The main estimate from the multinomial logit model. Formally,
Pr(Z | X, A)
, although this is usually the same value for every A
and thus equal to Pr(Z | X)
unless A
is on the RHS as well.
prXZ
: A new estimate for the joint distribution within the area, i.e.
Pr(Z, X | A)
. Computed by prX * prZ_givenX
.
count
: A new count variable. Simply the product of n_aggregate
and pr_pred
.
In this setup, the population distribution table (poptable
)
has the joint distribution of (A, X_{1}, ..., X_{K - 1})
categorical variables where
A
denotes a categorical small area, X
s denote categorical covariates, and
the missing covariate is X_{K}
.
Now, the survey data (microdata
) has a sample joint distribution of
(X_{1}, .., X_{K - 1}, X_{K})
categorical variables but the sample size is too small
for small areas. Therefore, the function models a multinomial outcome model roughly of the
form X_{K} ~ X_{1} + ... X_{K}
and predicts onto poptable
to estimate the
joint distribution of (A, X_{1}, .., X_{K})
Currently, this function does not support post-stratifiation based on a known aggregate distribution -- that is, further adjusting the probabilities based on a known population distribution (see e.g. Leeman and Wasserfallen AJPS https://doi.org/10.1111/ajps.12319)
library(dplyr)
library(ccesMRPprep)
# Impute the joint distribution of party ID with race, sex, and age, using
# survey data in NY.
synth_acs <- synth_mlogit(pid3 ~ race + age + female,
microdata = cc18_NY,
poptable = acs_race_NY,
area_var = "cd")
#> Warning: NAs in the microdata -- dropping data
# original (27 districts x 2 sex x 5 age x 6 race categories)
count(acs_race_NY, cd, female, age, race, wt = count)
#> # A tibble: 1,620 × 5
#> cd female age race n
#> <chr> <int> <fct> <fct> <dbl>
#> 1 NY-01 0 18 to 24 years White 20604
#> 2 NY-01 0 18 to 24 years Black 2654
#> 3 NY-01 0 18 to 24 years Hispanic 6264
#> 4 NY-01 0 18 to 24 years Asian 3121
#> 5 NY-01 0 18 to 24 years Native American 0
#> 6 NY-01 0 18 to 24 years All Other 1480
#> 7 NY-01 0 25 to 34 years White 27437
#> 8 NY-01 0 25 to 34 years Black 2425
#> 9 NY-01 0 25 to 34 years Hispanic 10730
#> 10 NY-01 0 25 to 34 years Asian 3168
#> # … with 1,610 more rows
# new, modeled (original x 5 party categories)
synth_acs
#> # A tibble: 8,100 × 9
#> cd race age female prX pid3 prZ_givenX prXZ count
#> <chr> <fct> <fct> <int> <dbl> <fct> <dbl> <dbl> <dbl>
#> 1 NY-01 White 18 to 24 years 0 0.0358 Democrat 0.365 0.0130 7516.
#> 2 NY-01 White 18 to 24 years 0 0.0358 Republican 0.253 0.00903 5206.
#> 3 NY-01 White 18 to 24 years 0 0.0358 Independent 0.251 0.00899 5180.
#> 4 NY-01 White 18 to 24 years 0 0.0358 Other 0.0612 0.00219 1261.
#> 5 NY-01 White 18 to 24 years 0 0.0358 Not Sure 0.0700 0.00250 1441.
#> 6 NY-01 White 18 to 24 years 1 0.0355 Democrat 0.391 0.0139 8013.
#> 7 NY-01 White 18 to 24 years 1 0.0355 Republican 0.220 0.00782 4504.
#> 8 NY-01 White 18 to 24 years 1 0.0355 Independent 0.186 0.00659 3800.
#> 9 NY-01 White 18 to 24 years 1 0.0355 Other 0.0384 0.00136 786.
#> 10 NY-01 White 18 to 24 years 1 0.0355 Not Sure 0.165 0.00586 3378.
#> # … with 8,090 more rows
# See the data elec_NY to see if these numbers look reasonable.
if (FALSE) {
# another example -- imputing education
library(ccesMRPrun)
synth_mlogit(educ ~ age + female,
microdata = cces_GA,
poptable = acs_GA,
area_var = "cd")
}
# synth_mlogit WITH MARGINS CORRECTION -----
library(dplyr)
# suppose we want know the distribution of (age x female) and we know the
# distribution of (race), by CD, but we don't know the joint of the two.
educ_target <- count(acs_educ_NY, cd, educ, wt = count, name = "count")
pop_syn <- synth_smoothfix(educ ~ race + age + female,
microdata = cc18_NY,
fix_to = educ_target,
poptable = acs_race_NY,
area_var = "cd")