11  Case Study: NCS-R

The sample design report outline is from Valliant (2013). We’ll use the data from the National Comorbidity Survey Replication (NCS-R) featured in Heeringa (2017). This was a 2002 study of mental illness. The sample design was an equal probability, multistage sample.

library(tidyverse)
library(scales)
library(janitor)
library(survey)
library(srvyr)
library(gtsummary)

11.1 Executive summary

  • Provide a brief overview of the survey including information related to general study goals and year when annual survey was first implemented.
  • Describe the purpose of this document.
  • Provide a table of the sample size to be selected per business unit (i.e., respondent sample size inflated for ineligibility and nonresponse).
  • Discuss the contents of the remaining section of the report.

11.2 Sample design

Description of the target population.

The target population was adults aged 18 and older residing in the 48 contiguous United States.

Describe the sampling frame including the date and source database.

The sampling frame included households in the 48 contiguous United States. The survey was conducted between February 2001 and April 2003. The source database for the sampling frame was the Inter-university Consortium for Political and Social Research (ICPSR), which provided the necessary geographic and demographic information to ensure a nationally representative sample.

Describe the type of sample and method of sample selection to be used.

The sampling strategy was a multi-stage clustered area probability sample. - Stage 1 was to select primary sampling units. The entire country (48 contiguous states) was divided into primary sampling units (PSUs), composed of counties or groups of contiguous counties. A random sample of PSUs was selected. - Stage 2 was to select segments. The selected PSUs were subdivided into segments, usually census tracts or block groups. A random sample of segments was selected. - Stage 3 was to select households. A random sample of households was selected. - Stage 4 was to select respondents. A person was randomly selected from the household.

11.3 Sample size and allocation

  • Optimization requirements – Optimization details including constraints and budget.
    – Detail the minimum domain sizes and mechanics used to determine the sizes.
  • Optimization results
    – Results: minimum respondent sample size per stratum
    – Marginal sample sizes for key reporting domains
    – Estimated precision achieved by optimization results
  • Inflation adjustments to allocation solution
    – Nonresponse adjustments
    – Adjustments for ineligible sample members
  • Final sample allocation
    – Marginal sample sizes for key reporting domains
  • Sensitivity analysis
    – Results from comparing deviations to allocation after introducing changes to the optimization system
# Downloaded from book web site
# https://websites.umich.edu/~surveymethod/asda/#Links%20to%20Data%20Sets%20for%20First%20and%20Second%20Editions
# https://www.umich.edu/~surveymethod/asda/Chapter%20Exercises%20Data%20Sets%20Stata%2015SEP2017.zip
ncsr_raw <- foreign::read.dta("input/ncsr_sub_13nov2015.dta")

ncsr <- ncsr_raw |> mutate(ncsrwtsh_pop = ncsrwtsh * (209128094 / 9282))

ncsr_des <- as_survey_design(
  ncsr,
  ids = seclustr,
  strata = sestrat,
  nest = TRUE,
  weights = ncsrwtsh_pop
)

11.4 Descriptive Analysis

11.4.1 Counts

How many U.S. adults have experienced an episode of major depression in their lifetime?

ncsr_des |>
  survey_count(.by = mde, vartype = c("se", "ci", "cv")) |>
  adorn_totals(, fill = NA,,, n) |>
  gt::gt() |>
  gt::fmt_number(n:n_upp, decimals = 0) |>
  gt::fmt_number(n_cv, decimals = 2) |>
  gt::cols_label(
    n = "Estimted Total Lifetime MDE",
    n_se = "Standard Error",
    n_low = "95% CI (low)",
    n_upp = "95% CI (upp)",
    n_cv = "CV"
  )
.by Estimted Total Lifetime MDE Standard Error 95% CI (low) 95% CI (upp) CV
0 169,035,891 7,876,170 153,141,136 184,930,645 0.05
1 40,092,207 2,567,488 34,910,806 45,273,607 0.06
Total 209,128,097 NA NA NA NA

How many U.S. adults have experienced an episode of major depression in their lifetime by marital status subpopulation?

ncsr_des |>
  filter(mde == 1) |>
  survey_count(.by = MAR3CAT, vartype = c("se", "ci", "cv")) |>
  adorn_totals(, fill = NA,,, n) |>
  gt::gt() |>
  gt::fmt_number(n:n_upp, decimals = 0) |>
  gt::fmt_number(n_cv, decimals = 2) |>
  gt::cols_label(
    n = "Estimted Total Lifetime MDE",
    n_se = "Standard Error",
    n_low = "95% CI (low)",
    n_upp = "95% CI (upp)",
    n_cv = "CV"
  )
.by Estimted Total Lifetime MDE Standard Error 95% CI (low) 95% CI (upp) CV
1 20,304,191 1,584,109 17,107,330 23,501,051 0.08
2 10,360,671 702,622 8,942,723 11,778,618 0.07
3 9,427,345 773,138 7,867,091 10,987,600 0.08
Total 40,092,207 NA NA NA NA

11.4.2 Sums

What is the total number of females by obesity category? Sum sexf.

ncsr_des |>
  summarize(
    .by = OBESE6CA, 
    Tot = survey_total(sexf, na.rm = TRUE, vartype = c("se", "ci", "var", "cv"))
  ) |>
  adorn_totals(, fill = NA,,, Tot, Tot) |>
  gt::gt() |>
  gt::fmt_number(Tot:Tot_var, decimals = 0) |>
  gt::fmt_number(Tot_cv, decimals = 2)
OBESE6CA Tot Tot_se Tot_low Tot_upp Tot_var Tot_cv
1 6,131,462 535,122 5,051,541 7,211,382 286,355,856,271 0.09
2 45,404,619 2,720,611 39,914,204 50,895,035 7,401,724,264,861 0.06
3 28,849,013 1,527,843 25,765,701 31,932,324 2,334,302,987,994 0.05
4 15,796,359 857,378 14,066,100 17,526,617 735,096,749,984 0.05
5 6,490,138 584,221 5,311,132 7,669,145 341,314,438,641 0.09
6 4,332,572 451,013 3,422,390 5,242,753 203,412,969,143 0.10
NA 1,982,485 222,624 1,533,211 2,431,759 49,561,583,033 0.11
Total 108,986,647 NA NA NA NA NA

11.4.3 Means and Proportions

What was the mean age by region? Calculate mean(age).

ncsr_des |>
  summarize(
    .by = region,
    M = survey_mean(age, na.rm = TRUE, vartype = c("se", "ci"))
  ) |>
  gt::gt() |>
  gt::fmt_number(decimals = 0)
region M M_se M_low M_upp
1 46 1 44 48
2 45 0 44 46
3 45 0 44 46
4 43 1 41 45

What is the proportion of respondents from each reason? Calculate the proportion of region.

ncsr_des |>
  summarize(
    .by = region,
    M = survey_prop()
  ) |>
  adorn_totals(, fill = NA,,, M) |>
  gt::gt() |>
  gt::fmt_number(M:M_se, decimals = 3)
region M M_se
1 0.193 0.033
2 0.232 0.018
3 0.358 0.020
4 0.217 0.022
Total 1.000 NA

11.4.4 Quantiles

What is the IQR of the age by region? Calculate the quantiles of age.

ncsr_des |>
  summarize(
    .by = region,
    Q = survey_quantile(age, quantiles = c(.25, .5, .75))
  ) |>
  gt::gt() |>
  gt::fmt_number(ends_with("se"), decimals = 2)
region Q_q25 Q_q50 Q_q75 Q_q25_se Q_q50_se Q_q75_se
1 33 44 58 1.33 1.33 1.33
2 31 43 57 0.68 0.68 0.68
3 30 43 57 0.70 0.70 0.70
4 28 41 54 2.60 1.30 1.30

11.4.5 Ratios

What is the ratio of age to DSM_SO?

ncsr_des |>
  summarize(
    .by = region,
    R = survey_ratio(age, DSM_SO)
  ) |>
  gt::gt() |>
  gt::fmt_number(ends_with("se"), decimals = 2)
region R R_se
1 10.097784 0.21
2 10.061276 0.06
3 9.826059 0.11
4 9.736837 0.28