library(tidyverse)
library(scales)
library(janitor)
library(survey)
library(srvyr)
library(gtsummary)
4 Data Preparation
This book uses the api
datasets from the survey package for examples. The Academic Performance Index (API) was a school rating system used in California for several years. The survey package includes several datasets that mimic survey samples from the overall population of 6,194 schools.
data(api)
returns several data files: a simple random sample (apisrs
), a stratified simple random sample (apistrat
), and a two-stage cluster (apiclus2
).
data(api, package = "survey")
# Add some cols for stat test examples.
<- function(df) {
prep_data |>
df mutate(
stype = factor(stype, levels = c("E", "M", "H"), ordered = TRUE),
meals_cut = cut(meals, c(0, 12, 25, 100), include.lowest = TRUE),
hsg_cut = cut(hsg, c(0, 12, 25, 100), include.lowest = TRUE)
)
}
<- prep_data(apisrs)
apisrs
<- prep_data(apistrat)
apistrat
<- prep_data(apiclus2) apiclus2
Schools are uniquely identified by column snum
. Schools roll up to districts, dnum
. Two other columns contain metadata related to the sampling design.
fpc
: finite population correction (FPC). The FPC adjusts the variance calculation (Section 3.1). The FPC is important when the sample size is >=5% of the population size.fpc
equals the size of the population that the respondent is drawn from. For an SRS, that’s the entire population of 6,194 schools. For a 2 stage cluster design, that’s the second stage population.pw
: sampling weight. The sampling weight scales the sample up to the population. Think of it as saying, “this respondent representspw
respondents from the total population.”
Let’s create the design objects.
apisrs
is a simple random sample of 200 schools from a population of 6,194, so fpc
= 6194 and pw
= 6194 / 200 = 30.97 for all rows.
<- as_survey_design(apisrs, weights = pw, fpc = fpc)
apisrs_des
summary(apisrs_des)
Independent Sampling design
Called via srvyr
Probabilities:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.03229 0.03229 0.03229 0.03229 0.03229 0.03229
Population size (PSUs): 6194
Data variables:
[1] "cds" "stype" "name" "sname" "snum" "dname"
[7] "dnum" "cname" "cnum" "flag" "pcttest" "api00"
[13] "api99" "target" "growth" "sch.wide" "comp.imp" "both"
[19] "awards" "meals" "ell" "yr.rnd" "mobility" "acs.k3"
[25] "acs.46" "acs.core" "pct.resp" "not.hsg" "hsg" "some.col"
[31] "col.grad" "grad.sch" "avg.ed" "full" "emer" "enroll"
[37] "api.stu" "pw" "fpc" "meals_cut" "hsg_cut"
apistrat
is a sample of 200 schools from a population stratified by school type, stype
: E = Elementary (n = 100, fpc
= 4421, pw
= 44.2), M = Middle (n = 50, fpc
= 1018, pw
= 20.4), and H = High School (n = 50, fpc
= 755, pw
= 15.1). pw
equals the fpc
/ n.
<- as_survey_design(apistrat, weights = pw, fpc = fpc, strata = stype)
apistrat_des
summary(apistrat_des)
Stratified Independent Sampling design
Called via srvyr
Probabilities:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.02262 0.02262 0.03587 0.04014 0.05339 0.06623
Stratum Sizes:
E H M
obs 100 50 50
design.PSU 100 50 50
actual.PSU 100 50 50
Population stratum sizes (PSUs):
E H M
4421 755 1018
Data variables:
[1] "cds" "stype" "name" "sname" "snum" "dname"
[7] "dnum" "cname" "cnum" "flag" "pcttest" "api00"
[13] "api99" "target" "growth" "sch.wide" "comp.imp" "both"
[19] "awards" "meals" "ell" "yr.rnd" "mobility" "acs.k3"
[25] "acs.46" "acs.core" "pct.resp" "not.hsg" "hsg" "some.col"
[31] "col.grad" "grad.sch" "avg.ed" "full" "emer" "enroll"
[37] "api.stu" "pw" "fpc" "meals_cut" "hsg_cut"
apiclus2
is a two-stage cluster sample of 126 schools within districts. The first stage is random sample of 40 of the 757 school districts (dnum
). The second stage is a random sample of up to 5 schools (snum
) from each district. For cluster designs, the cluster ids are specified in the design object from largest to smallest level.
<- as_survey_design(
apiclus2_des
apiclus2,id = c(dnum, snum),
weights = pw,
fpc = c(fpc1, fpc2),
)
summary(apiclus2_des)
2 - level Cluster Sampling design
With (40, 126) clusters.
Called via srvyr
Probabilities:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.003669 0.037743 0.052840 0.042390 0.052840 0.052840
Population size (PSUs): 757
Data variables:
[1] "cds" "stype" "name" "sname" "snum" "dname"
[7] "dnum" "cname" "cnum" "flag" "pcttest" "api00"
[13] "api99" "target" "growth" "sch.wide" "comp.imp" "both"
[19] "awards" "meals" "ell" "yr.rnd" "mobility" "acs.k3"
[25] "acs.46" "acs.core" "pct.resp" "not.hsg" "hsg" "some.col"
[31] "col.grad" "grad.sch" "avg.ed" "full" "emer" "enroll"
[37] "api.stu" "pw" "fpc1" "fpc2" "meals_cut" "hsg_cut"