vignettes/derived.Rmd
derived.Rmd
All survey questions are inherently discrete so the analyst must decide how to derive (or operationalize) that into a numeric indicator. This decision choice that rarely gets much spotlight is a difficult operation to represent in a general function.
factor
is a base R variable class in which values are
defined by exclusive and exhaustive list of a level and a
label (See Advanced
R for a brief technical overview.). Both levels and labels are
necessary to track because the order in which the values were presented
and what the respondent saw (the labels) matters. The levels can be
thought of as fundamentally as integers, to represent the order. That
said, the transformation from R factors to integers is cumbersome.
?base::factor
recommends
as.numeric(levels(f))[f]
where f
is a factor
vector.
haven_labelled
is a separate R class that was defined in
the haven
package to represent labelled variables in
Stata/SPSS without loss of information. haven
is a
part of tidyverse and is designed to read Stata and SPSS files. Stata
and SPSS have their own analogs to factors, but in Stata, these
variables are literally integers/doubles with a “labels”
attribute. This is a named vector where the values are the numbers and
the labels are the label equivalent of factors (corresponding
to Stata’s label list [lblname]
).
haven_labelled
is a class that preserves this information
(See ?haven::labelled
help page or the vignette
from the labelled
package for more detail). For example,
look at any of the variables in the CCES 2018 sample which contains a
lbl
tag:
gov_approval_lbl <- select(cc18_samp, case_id, CC18_308d)
gov_approval_lbl
#> # A tibble: 1,000 × 2
#> case_id CC18_308d
#> <dbl> <dbl+lbl>
#> 1 415395741 4 [Strongly disapprove]
#> 2 414164923 4 [Strongly disapprove]
#> 3 412379892 2 [Somewhat approve]
#> 4 414203529 3 [Somewhat disapprove]
#> 5 412148048 1 [Strongly approve]
#> 6 412329835 2 [Somewhat approve]
#> 7 417352072 4 [Strongly disapprove]
#> 8 414614677 1 [Strongly approve]
#> 9 416797006 1 [Strongly approve]
#> 10 412962561 1 [Strongly approve]
#> # ℹ 990 more rows
Recent versions of haven
(>= 2.1.0) will display the
labels in square brackets if the dataset is a tibble, but we can see
that the values are basically doubles. If we had used factors, these
numerical values would be obscured, though the levels and labels will be
kept.
gov_approval_fct <- transmute(cc18_samp, case_id, CC18_308d = as_factor(CC18_308d))
gov_approval_fct
#> # A tibble: 1,000 × 2
#> case_id CC18_308d
#> <dbl> <fct>
#> 1 415395741 Strongly disapprove
#> 2 414164923 Strongly disapprove
#> 3 412379892 Somewhat approve
#> 4 414203529 Somewhat disapprove
#> 5 412148048 Strongly approve
#> 6 412329835 Somewhat approve
#> 7 417352072 Strongly disapprove
#> 8 414614677 Strongly approve
#> 9 416797006 Strongly approve
#> 10 412962561 Strongly approve
#> # ℹ 990 more rows
For using CCES outcomes, the haven_labelled class is more concise then factors, for the following reasons:
gov_approval <= 2
instead of
gov_approval %in% c("Strongly approve", "approve")
).The second point is consequential. For example suppose we want to model a derived variable, Governor approval, which is 1 if the respondent strongly approves or approves of the governor. With the haven_labelled class, we can do what we would do in Stata:
gov_approval_lbl %>%
mutate(outcome = CC18_308d <= 2) %>%
count(CC18_308d, outcome)
#> # A tibble: 6 × 3
#> CC18_308d outcome n
#> <dbl+lbl> <lgl> <int>
#> 1 1 [Strongly approve] TRUE 144
#> 2 2 [Somewhat approve] TRUE 307
#> 3 3 [Somewhat disapprove] FALSE 170
#> 4 4 [Strongly disapprove] FALSE 265
#> 5 5 [Not sure] FALSE 112
#> 6 NA NA 2
But this will not work with factors,
gov_approval_fct %>%
mutate(outcome = CC18_308d <= "Somewhat approve") %>%
count(outcome)
#> Warning: There was 1 warning in `mutate()`.
#> ℹ In argument: `outcome = CC18_308d <= "Somewhat approve"`.
#> Caused by warning in `Ops.factor()`:
#> ! '<=' not meaningful for factors
#> # A tibble: 1 × 2
#> outcome n
#> <lgl> <int>
#> 1 NA 1000
That means, with factors, we would need to figure out all the labels and type them up exactly as they appear. In surveys, the labels can be long and contain punctuation which is easy to miss. Of course we can extract these from attributes and save the trouble of hand-entering them, but the same information is available in the haven_labelled class too.
On the other hand, haven_labelled
is a bit inconvenient
because the raw number is not immediately informative. A quick way to
transform the numerical data to its labels is
haven::as_factor
(notice this is different from
base::as.factor
). Or, for metadata, we can look at its
attributes:
In the haven_labelled
class, three pieces of information
are stored as attributes:
label
: Description of the entire variable (this
contains a short version of the question or an indication of the
question in CCES data)labels
: name-value pairs.We can use str
and attr
to view and extract
these metadata:
str(gov_approval_lbl$CC18_308d)
#> 'haven_labelled' num [1:1000] 4 4 2 3 1 2 4 1 1 1 ...
#> - attr(*, "label")= chr "Job approval -- The Governor of $inputstate"
#> - attr(*, "labels")= Named num [1:7] 1 2 3 4 5 8 9
#> ..- attr(*, "names")= chr [1:7] "Strongly approve" "Somewhat approve" "Somewhat disapprove" "Strongly disapprove" ...
For example, this indicates that CC18_308d
is about
"Job approval -- The Governor of $inputstate"
. This is
not the question wording verbatim, of course. This is something
that needs to be looked up in the Word Doc YouGov questionnaires each
time, or in the CCES codebooks (Shiro and the CCES team has some of this
in plain-text tabular form for recent years).
And the possible values are
attr(gov_approval_lbl$CC18_308d, "labels")
#> Strongly approve Somewhat approve Somewhat disapprove Strongly disapprove
#> 1 2 3 4
#> Not sure skipped not asked
#> 5 8 9
where again the numbers are the values and the labels attribute of the vector are the value labels.
As long as we retain these attributes (which is possible as long as haven is loaded), we can express derivation in a simple way.
Another thing this metadata does not include is whether the response options are binary, ordinal, or categorical (no inherent ordering). This is something we need to hand-classify, although it is usually obvious once we see the question.
We consider the of the most common types below. These are from the
sample question metadata (see the q_type
variable):
questions_samp %>%
filter(q_ID %in% c("CC18_322C","CC18_308d", "CC18_pid3"))
#> # A tibble: 3 × 5
#> q_ID q_label cces_data q_code response_type
#> <chr> <chr> <chr> <chr> <chr>
#> 1 CC18_322C Withold Sanctuary Funding 2018 CC18_322c yesno
#> 2 CC18_308d Governor Approval 2018 CC18_308d ordinal
#> 3 CC18_pid3 Partisan Identification (3-point) 2018 pid3 categorical
This classification narrows down the type of function we would use for deriving a variable.
All three types can be derived by the basic atomic operator:
outcome = lbl_var %in% c(v1, v2, v3, ...)
where outcome
is the derived variable,
lbl_var
the data vector of class
haven_labelled
, and c(v1, v2, v3, ...)
is a
vector of one or more in the data vector that that we consider a
success.
For this project we only consider binomial models, which is why this is sufficient. Survey data is almost never continuous, and running multinomial models in MRP is still exceedingly uncommon.
yesno_to_binary
This is the simplest case because there are only two options and it
is almost always clear which of the two is naturally corresponds to a
“success” instead of a “failure” in a Bernoulli random variable. That
means we currently _only- use yesno_to_binary()
for the
derivation. For example, CC18_322C
is a question asking for
people’s support of a measure to withdraw federal funding for so-called
“Sanctuary Cities”:
cc18_samp %>%
select(case_id, CC18_322c)
#> # A tibble: 1,000 × 2
#> case_id CC18_322c
#> <dbl> <dbl+lbl>
#> 1 415395741 2 [Oppose]
#> 2 414164923 1 [Support]
#> 3 412379892 2 [Oppose]
#> 4 414203529 1 [Support]
#> 5 412148048 2 [Oppose]
#> 6 412329835 2 [Oppose]
#> 7 417352072 2 [Oppose]
#> 8 414614677 2 [Oppose]
#> 9 416797006 1 [Support]
#> 10 412962561 1 [Support]
#> # ℹ 990 more rows
These have no inherent ordering, i.e. there is no likert scale or obvious scale. These can be vote choice for parties, race, religion, or method of voting. Here we show partisanship (on a 3 point multiple choice question).
str(cc18_samp$pid3)
#> 'haven_labelled' num [1:1000] 3 2 1 2 3 3 1 1 2 3 ...
#> - attr(*, "label")= chr "3 point party ID"
#> - attr(*, "labels")= Named num [1:7] 1 2 3 4 5 8 9
#> ..- attr(*, "names")= chr [1:7] "Democrat" "Republican" "Independent" "Other" ...
To make a derived variable out of this, we need a function that takes an ordered vector of value(s) that counts as a success . For example, let’s say we want to measure the proportion of independents. Then the derivation will be
cc18_samp %>%
select(case_id, pid3) %>%
mutate(outcome = pid3 %in% 3) %>%
count(pid3, outcome)
#> # A tibble: 5 × 3
#> pid3 outcome n
#> <dbl+lbl> <lgl> <int>
#> 1 1 [Democrat] FALSE 353
#> 2 2 [Republican] FALSE 272
#> 3 3 [Independent] TRUE 279
#> 4 4 [Other] FALSE 43
#> 5 5 [Not sure] FALSE 53
because 3
stands for independent in this case, as can be
seen from the metadata.
Clearly, “Yes/No” variable type can be considered a type of
categorical variable type. I still distinguish these between because in
the former, the analyst does not need to make a decision about what
consists as a success, whereas in pid3
for example, three
reasonable operations can exist and the what is a success as opposed to
a failure is ambiguous from the name pid3
.
Ordinal variables have a clear ordering in their labels and their values conform to that order, for example likert scales agree-disagree, approve-disagree, as well as measures like education, income, and news interest.
This means that derivation can use operators <
and
>
instead of simply %in%
– as in the
Governor approval example in the beginning. We may want to lump together
this with a categorical variable, but again the distinction can be
meaningful because in an ordinal variable one would almost never “skip”
a value, as in governor_approval %in% c(1, 3, 4)
, whereas
in a categorical variable, one might do that depending on the order of
the variables.
Note one must be careful that almost all the variables
considered here are not exhaustively “ordinal” because they have “Not
Sure” / “Other” / “Not Asked” values as taking values of 8 or 9. For
example, because pid3 == 4
means “Other”, a derived
variable defined as “pid3 >= 3” is not very meaningful unless you are
interested in the group of people who identify as “Independent” or
“Other”.
cc18_samp %>%
select(case_id, CC18_308d) %>%
mutate(outcome = CC18_308d <= 2) %>%
count(CC18_308d, outcome)
#> # A tibble: 6 × 3
#> CC18_308d outcome n
#> <dbl+lbl> <lgl> <int>
#> 1 1 [Strongly approve] TRUE 144
#> 2 2 [Somewhat approve] TRUE 307
#> 3 3 [Somewhat disapprove] FALSE 170
#> 4 4 [Strongly disapprove] FALSE 265
#> 5 5 [Not sure] FALSE 112
#> 6 NA NA 2