11 Case Study: NCS-R

The sample design report outline is from Valliant (2013). We’ll use the data from the National Comorbidity Survey Replication (NCS-R) featured in Heeringa (2017). This was a 2002 study of mental illness. The sample design was an equal probability, multistage sample.

library(tidyverse)
library(scales)
library(janitor)
library(survey)
library(srvyr)
library(gtsummary)

11.1 Executive summary

Provide a brief overview of the survey including information related to general study goals and year when annual survey was first implemented.
Describe the purpose of this document.
Provide a table of the sample size to be selected per business unit (i.e., respondent sample size inflated for ineligibility and nonresponse).
Discuss the contents of the remaining section of the report.

11.2 Sample design

Description of the target population.

The target population was adults aged 18 and older residing in the 48 contiguous United States.

Describe the sampling frame including the date and source database.

The sampling frame included households in the 48 contiguous United States. The survey was conducted between February 2001 and April 2003. The source database for the sampling frame was the Inter-university Consortium for Political and Social Research (ICPSR), which provided the necessary geographic and demographic information to ensure a nationally representative sample.

Describe the type of sample and method of sample selection to be used.

The sampling strategy was a multi-stage clustered area probability sample. - Stage 1 was to select primary sampling units. The entire country (48 contiguous states) was divided into primary sampling units (PSUs), composed of counties or groups of contiguous counties. A random sample of PSUs was selected. - Stage 2 was to select segments. The selected PSUs were subdivided into segments, usually census tracts or block groups. A random sample of segments was selected. - Stage 3 was to select households. A random sample of households was selected. - Stage 4 was to select respondents. A person was randomly selected from the household.

11.3 Sample size and allocation

Optimization requirements – Optimization details including constraints and budget.
– Detail the minimum domain sizes and mechanics used to determine the sizes.
Optimization results
– Results: minimum respondent sample size per stratum
– Marginal sample sizes for key reporting domains
– Estimated precision achieved by optimization results
Inflation adjustments to allocation solution
– Nonresponse adjustments
– Adjustments for ineligible sample members
Final sample allocation
– Marginal sample sizes for key reporting domains
Sensitivity analysis
– Results from comparing deviations to allocation after introducing changes to the optimization system

# Downloaded from book web site
# https://websites.umich.edu/~surveymethod/asda/#Links%20to%20Data%20Sets%20for%20First%20and%20Second%20Editions
# https://www.umich.edu/~surveymethod/asda/Chapter%20Exercises%20Data%20Sets%20Stata%2015SEP2017.zip
ncsr_raw <- foreign::read.dta("input/ncsr_sub_13nov2015.dta")

ncsr <- ncsr_raw |> mutate(ncsrwtsh_pop = ncsrwtsh * (209128094 / 9282))

ncsr_des <- as_survey_design(
  ncsr,
  ids = seclustr,
  strata = sestrat,
  nest = TRUE,
  weights = ncsrwtsh_pop
)

11.4 Descriptive Analysis

11.4.1 Counts

How many U.S. adults have experienced an episode of major depression in their lifetime?

ncsr_des |>
  survey_count(.by = mde, vartype = c("se", "ci", "cv")) |>
  adorn_totals(, fill = NA,,, n) |>
  gt::gt() |>
  gt::fmt_number(n:n_upp, decimals = 0) |>
  gt::fmt_number(n_cv, decimals = 2) |>
  gt::cols_label(
    n = "Estimted Total Lifetime MDE",
    n_se = "Standard Error",
    n_low = "95% CI (low)",
    n_upp = "95% CI (upp)",
    n_cv = "CV"
  )

.by	Estimted Total Lifetime MDE	Standard Error	95% CI (low)	95% CI (upp)	CV
0	169,035,891	7,876,170	153,141,136	184,930,645	0.05
1	40,092,207	2,567,488	34,910,806	45,273,607	0.06
Total	209,128,097	NA	NA	NA	NA

How many U.S. adults have experienced an episode of major depression in their lifetime by marital status subpopulation?

ncsr_des |>
  filter(mde == 1) |>
  survey_count(.by = MAR3CAT, vartype = c("se", "ci", "cv")) |>
  adorn_totals(, fill = NA,,, n) |>
  gt::gt() |>
  gt::fmt_number(n:n_upp, decimals = 0) |>
  gt::fmt_number(n_cv, decimals = 2) |>
  gt::cols_label(
    n = "Estimted Total Lifetime MDE",
    n_se = "Standard Error",
    n_low = "95% CI (low)",
    n_upp = "95% CI (upp)",
    n_cv = "CV"
  )

.by	Estimted Total Lifetime MDE	Standard Error	95% CI (low)	95% CI (upp)	CV
1	20,304,191	1,584,109	17,107,330	23,501,051	0.08
2	10,360,671	702,622	8,942,723	11,778,618	0.07
3	9,427,345	773,138	7,867,091	10,987,600	0.08
Total	40,092,207	NA	NA	NA	NA

11.4.2 Sums

What is the total number of females by obesity category? Sum sexf.

ncsr_des |>
  summarize(
    .by = OBESE6CA, 
    Tot = survey_total(sexf, na.rm = TRUE, vartype = c("se", "ci", "var", "cv"))
  ) |>
  adorn_totals(, fill = NA,,, Tot, Tot) |>
  gt::gt() |>
  gt::fmt_number(Tot:Tot_var, decimals = 0) |>
  gt::fmt_number(Tot_cv, decimals = 2)

OBESE6CA	Tot	Tot_se	Tot_low	Tot_upp	Tot_var	Tot_cv
1	6,131,462	535,122	5,051,541	7,211,382	286,355,856,271	0.09
2	45,404,619	2,720,611	39,914,204	50,895,035	7,401,724,264,861	0.06
3	28,849,013	1,527,843	25,765,701	31,932,324	2,334,302,987,994	0.05
4	15,796,359	857,378	14,066,100	17,526,617	735,096,749,984	0.05
5	6,490,138	584,221	5,311,132	7,669,145	341,314,438,641	0.09
6	4,332,572	451,013	3,422,390	5,242,753	203,412,969,143	0.10
NA	1,982,485	222,624	1,533,211	2,431,759	49,561,583,033	0.11
Total	108,986,647	NA	NA	NA	NA	NA

11.4.3 Means and Proportions

What was the mean age by region? Calculate mean(age).

ncsr_des |>
  summarize(
    .by = region,
    M = survey_mean(age, na.rm = TRUE, vartype = c("se", "ci"))
  ) |>
  gt::gt() |>
  gt::fmt_number(decimals = 0)

region	M	M_se	M_low	M_upp
1	46	1	44	48
2	45	0	44	46
3	45	0	44	46
4	43	1	41	45

What is the proportion of respondents from each reason? Calculate the proportion of region.

ncsr_des |>
  summarize(
    .by = region,
    M = survey_prop()
  ) |>
  adorn_totals(, fill = NA,,, M) |>
  gt::gt() |>
  gt::fmt_number(M:M_se, decimals = 3)

region	M	M_se
1	0.193	0.033
2	0.232	0.018
3	0.358	0.020
4	0.217	0.022
Total	1.000	NA

11.4.4 Quantiles

What is the IQR of the age by region? Calculate the quantiles of age.

ncsr_des |>
  summarize(
    .by = region,
    Q = survey_quantile(age, quantiles = c(.25, .5, .75))
  ) |>
  gt::gt() |>
  gt::fmt_number(ends_with("se"), decimals = 2)

region	Q_q25	Q_q50	Q_q75	Q_q25_se	Q_q50_se	Q_q75_se
1	33	44	58	1.33	1.33	1.33
2	31	43	57	0.68	0.68	0.68
3	30	43	57	0.70	0.70	0.70
4	28	41	54	2.60	1.30	1.30

11.4.5 Ratios

What is the ratio of age to DSM_SO?

ncsr_des |>
  summarize(
    .by = region,
    R = survey_ratio(age, DSM_SO)
  ) |>
  gt::gt() |>
  gt::fmt_number(ends_with("se"), decimals = 2)

region	R	R_se
1	10.097784	0.21
2	10.061276	0.06
3	9.826059	0.11
4	9.736837	0.28