library(tidyverse)
library(scales)
library(janitor)
library(survey)
library(srvyr)
library(gtsummary)
11 Case Study: NCS-R
The sample design report outline is from Valliant (2013). We’ll use the data from the National Comorbidity Survey Replication (NCS-R) featured in Heeringa (2017). This was a 2002 study of mental illness. The sample design was an equal probability, multistage sample.
11.1 Executive summary
- Provide a brief overview of the survey including information related to general study goals and year when annual survey was first implemented.
- Describe the purpose of this document.
- Provide a table of the sample size to be selected per business unit (i.e., respondent sample size inflated for ineligibility and nonresponse).
- Discuss the contents of the remaining section of the report.
11.2 Sample design
Description of the target population.
The target population was adults aged 18 and older residing in the 48 contiguous United States.
Describe the sampling frame including the date and source database.
The sampling frame included households in the 48 contiguous United States. The survey was conducted between February 2001 and April 2003. The source database for the sampling frame was the Inter-university Consortium for Political and Social Research (ICPSR), which provided the necessary geographic and demographic information to ensure a nationally representative sample.
Describe the type of sample and method of sample selection to be used.
The sampling strategy was a multi-stage clustered area probability sample. - Stage 1 was to select primary sampling units. The entire country (48 contiguous states) was divided into primary sampling units (PSUs), composed of counties or groups of contiguous counties. A random sample of PSUs was selected. - Stage 2 was to select segments. The selected PSUs were subdivided into segments, usually census tracts or block groups. A random sample of segments was selected. - Stage 3 was to select households. A random sample of households was selected. - Stage 4 was to select respondents. A person was randomly selected from the household.
11.3 Sample size and allocation
- Optimization requirements – Optimization details including constraints and budget.
– Detail the minimum domain sizes and mechanics used to determine the sizes. - Optimization results
– Results: minimum respondent sample size per stratum
– Marginal sample sizes for key reporting domains
– Estimated precision achieved by optimization results
- Inflation adjustments to allocation solution
– Nonresponse adjustments
– Adjustments for ineligible sample members
- Final sample allocation
– Marginal sample sizes for key reporting domains
- Sensitivity analysis
– Results from comparing deviations to allocation after introducing changes to the optimization system
# Downloaded from book web site
# https://websites.umich.edu/~surveymethod/asda/#Links%20to%20Data%20Sets%20for%20First%20and%20Second%20Editions
# https://www.umich.edu/~surveymethod/asda/Chapter%20Exercises%20Data%20Sets%20Stata%2015SEP2017.zip
<- foreign::read.dta("input/ncsr_sub_13nov2015.dta")
ncsr_raw
<- ncsr_raw |> mutate(ncsrwtsh_pop = ncsrwtsh * (209128094 / 9282))
ncsr
<- as_survey_design(
ncsr_des
ncsr,ids = seclustr,
strata = sestrat,
nest = TRUE,
weights = ncsrwtsh_pop
)
11.4 Descriptive Analysis
11.4.1 Counts
How many U.S. adults have experienced an episode of major depression in their lifetime?
|>
ncsr_des survey_count(.by = mde, vartype = c("se", "ci", "cv")) |>
adorn_totals(, fill = NA,,, n) |>
::gt() |>
gt::fmt_number(n:n_upp, decimals = 0) |>
gt::fmt_number(n_cv, decimals = 2) |>
gt::cols_label(
gtn = "Estimted Total Lifetime MDE",
n_se = "Standard Error",
n_low = "95% CI (low)",
n_upp = "95% CI (upp)",
n_cv = "CV"
)
.by | Estimted Total Lifetime MDE | Standard Error | 95% CI (low) | 95% CI (upp) | CV |
---|---|---|---|---|---|
0 | 169,035,891 | 7,876,170 | 153,141,136 | 184,930,645 | 0.05 |
1 | 40,092,207 | 2,567,488 | 34,910,806 | 45,273,607 | 0.06 |
Total | 209,128,097 | NA | NA | NA | NA |
How many U.S. adults have experienced an episode of major depression in their lifetime by marital status subpopulation?
|>
ncsr_des filter(mde == 1) |>
survey_count(.by = MAR3CAT, vartype = c("se", "ci", "cv")) |>
adorn_totals(, fill = NA,,, n) |>
::gt() |>
gt::fmt_number(n:n_upp, decimals = 0) |>
gt::fmt_number(n_cv, decimals = 2) |>
gt::cols_label(
gtn = "Estimted Total Lifetime MDE",
n_se = "Standard Error",
n_low = "95% CI (low)",
n_upp = "95% CI (upp)",
n_cv = "CV"
)
.by | Estimted Total Lifetime MDE | Standard Error | 95% CI (low) | 95% CI (upp) | CV |
---|---|---|---|---|---|
1 | 20,304,191 | 1,584,109 | 17,107,330 | 23,501,051 | 0.08 |
2 | 10,360,671 | 702,622 | 8,942,723 | 11,778,618 | 0.07 |
3 | 9,427,345 | 773,138 | 7,867,091 | 10,987,600 | 0.08 |
Total | 40,092,207 | NA | NA | NA | NA |
11.4.2 Sums
What is the total number of females by obesity category? Sum sexf
.
|>
ncsr_des summarize(
.by = OBESE6CA,
Tot = survey_total(sexf, na.rm = TRUE, vartype = c("se", "ci", "var", "cv"))
|>
) adorn_totals(, fill = NA,,, Tot, Tot) |>
::gt() |>
gt::fmt_number(Tot:Tot_var, decimals = 0) |>
gt::fmt_number(Tot_cv, decimals = 2) gt
OBESE6CA | Tot | Tot_se | Tot_low | Tot_upp | Tot_var | Tot_cv |
---|---|---|---|---|---|---|
1 | 6,131,462 | 535,122 | 5,051,541 | 7,211,382 | 286,355,856,271 | 0.09 |
2 | 45,404,619 | 2,720,611 | 39,914,204 | 50,895,035 | 7,401,724,264,861 | 0.06 |
3 | 28,849,013 | 1,527,843 | 25,765,701 | 31,932,324 | 2,334,302,987,994 | 0.05 |
4 | 15,796,359 | 857,378 | 14,066,100 | 17,526,617 | 735,096,749,984 | 0.05 |
5 | 6,490,138 | 584,221 | 5,311,132 | 7,669,145 | 341,314,438,641 | 0.09 |
6 | 4,332,572 | 451,013 | 3,422,390 | 5,242,753 | 203,412,969,143 | 0.10 |
NA | 1,982,485 | 222,624 | 1,533,211 | 2,431,759 | 49,561,583,033 | 0.11 |
Total | 108,986,647 | NA | NA | NA | NA | NA |
11.4.3 Means and Proportions
What was the mean age by region? Calculate mean(age
).
|>
ncsr_des summarize(
.by = region,
M = survey_mean(age, na.rm = TRUE, vartype = c("se", "ci"))
|>
) ::gt() |>
gt::fmt_number(decimals = 0) gt
region | M | M_se | M_low | M_upp |
---|---|---|---|---|
1 | 46 | 1 | 44 | 48 |
2 | 45 | 0 | 44 | 46 |
3 | 45 | 0 | 44 | 46 |
4 | 43 | 1 | 41 | 45 |
What is the proportion of respondents from each reason? Calculate the proportion of region
.
|>
ncsr_des summarize(
.by = region,
M = survey_prop()
|>
) adorn_totals(, fill = NA,,, M) |>
::gt() |>
gt::fmt_number(M:M_se, decimals = 3) gt
region | M | M_se |
---|---|---|
1 | 0.193 | 0.033 |
2 | 0.232 | 0.018 |
3 | 0.358 | 0.020 |
4 | 0.217 | 0.022 |
Total | 1.000 | NA |
11.4.4 Quantiles
What is the IQR of the age by region? Calculate the quantiles of age
.
|>
ncsr_des summarize(
.by = region,
Q = survey_quantile(age, quantiles = c(.25, .5, .75))
|>
) ::gt() |>
gt::fmt_number(ends_with("se"), decimals = 2) gt
region | Q_q25 | Q_q50 | Q_q75 | Q_q25_se | Q_q50_se | Q_q75_se |
---|---|---|---|---|---|---|
1 | 33 | 44 | 58 | 1.33 | 1.33 | 1.33 |
2 | 31 | 43 | 57 | 0.68 | 0.68 | 0.68 |
3 | 30 | 43 | 57 | 0.70 | 0.70 | 0.70 |
4 | 28 | 41 | 54 | 2.60 | 1.30 | 1.30 |
11.4.5 Ratios
What is the ratio of age to DSM_SO?
|>
ncsr_des summarize(
.by = region,
R = survey_ratio(age, DSM_SO)
|>
) ::gt() |>
gt::fmt_number(ends_with("se"), decimals = 2) gt
region | R | R_se |
---|---|---|
1 | 10.097784 | 0.21 |
2 | 10.061276 | 0.06 |
3 | 9.826059 | 0.11 |
4 | 9.736837 | 0.28 |