4  Data Preparation

library(tidyverse)
library(scales)
library(janitor)
library(survey)
library(srvyr)
library(gtsummary)

This book uses the api datasets from the survey package for examples. The Academic Performance Index (API) was a school rating system used in California for several years. The survey package includes several datasets that mimic survey samples from the overall population of 6,194 schools.

data(api) returns several data files: a simple random sample (apisrs), a stratified simple random sample (apistrat), and a two-stage cluster (apiclus2).

data(api, package = "survey")

# Add some cols for stat test examples.
prep_data <- function(df) {
  df |>
    mutate(
      stype = factor(stype, levels = c("E", "M", "H"), ordered = TRUE), 
      meals_cut = cut(meals, c(0, 12, 25, 100), include.lowest = TRUE),
      hsg_cut = cut(hsg, c(0, 12, 25, 100), include.lowest = TRUE)
    )
}

apisrs <- prep_data(apisrs)

apistrat <- prep_data(apistrat)

apiclus2 <- prep_data(apiclus2)

Schools are uniquely identified by column snum. Schools roll up to districts, dnum. Two other columns contain metadata related to the sampling design.

Let’s create the design objects.

apisrs is a simple random sample of 200 schools from a population of 6,194, so fpc = 6194 and pw = 6194 / 200 = 30.97 for all rows.

apisrs_des <- as_survey_design(apisrs, weights = pw, fpc = fpc)

summary(apisrs_des)
Independent Sampling design
Called via srvyr
Probabilities:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.03229 0.03229 0.03229 0.03229 0.03229 0.03229 
Population size (PSUs): 6194 
Data variables:
 [1] "cds"       "stype"     "name"      "sname"     "snum"      "dname"    
 [7] "dnum"      "cname"     "cnum"      "flag"      "pcttest"   "api00"    
[13] "api99"     "target"    "growth"    "sch.wide"  "comp.imp"  "both"     
[19] "awards"    "meals"     "ell"       "yr.rnd"    "mobility"  "acs.k3"   
[25] "acs.46"    "acs.core"  "pct.resp"  "not.hsg"   "hsg"       "some.col" 
[31] "col.grad"  "grad.sch"  "avg.ed"    "full"      "emer"      "enroll"   
[37] "api.stu"   "pw"        "fpc"       "meals_cut" "hsg_cut"  

apistrat is a sample of 200 schools from a population stratified by school type, stype: E = Elementary (n = 100, fpc = 4421, pw = 44.2), M = Middle (n = 50, fpc = 1018, pw = 20.4), and H = High School (n = 50, fpc = 755, pw = 15.1). pw equals the fpc / n.

apistrat_des <- as_survey_design(apistrat, weights = pw, fpc = fpc, strata = stype)

summary(apistrat_des)
Stratified Independent Sampling design
Called via srvyr
Probabilities:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.02262 0.02262 0.03587 0.04014 0.05339 0.06623 
Stratum Sizes: 
             E  H  M
obs        100 50 50
design.PSU 100 50 50
actual.PSU 100 50 50
Population stratum sizes (PSUs): 
   E    H    M 
4421  755 1018 
Data variables:
 [1] "cds"       "stype"     "name"      "sname"     "snum"      "dname"    
 [7] "dnum"      "cname"     "cnum"      "flag"      "pcttest"   "api00"    
[13] "api99"     "target"    "growth"    "sch.wide"  "comp.imp"  "both"     
[19] "awards"    "meals"     "ell"       "yr.rnd"    "mobility"  "acs.k3"   
[25] "acs.46"    "acs.core"  "pct.resp"  "not.hsg"   "hsg"       "some.col" 
[31] "col.grad"  "grad.sch"  "avg.ed"    "full"      "emer"      "enroll"   
[37] "api.stu"   "pw"        "fpc"       "meals_cut" "hsg_cut"  

apiclus2 is a two-stage cluster sample of 126 schools within districts. The first stage is random sample of 40 of the 757 school districts (dnum). The second stage is a random sample of up to 5 schools (snum) from each district. For cluster designs, the cluster ids are specified in the design object from largest to smallest level.

apiclus2_des <- as_survey_design(
  apiclus2,
  id = c(dnum, snum),
  weights = pw,
  fpc = c(fpc1, fpc2),
)

summary(apiclus2_des)
2 - level Cluster Sampling design
With (40, 126) clusters.
Called via srvyr
Probabilities:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.003669 0.037743 0.052840 0.042390 0.052840 0.052840 
Population size (PSUs): 757 
Data variables:
 [1] "cds"       "stype"     "name"      "sname"     "snum"      "dname"    
 [7] "dnum"      "cname"     "cnum"      "flag"      "pcttest"   "api00"    
[13] "api99"     "target"    "growth"    "sch.wide"  "comp.imp"  "both"     
[19] "awards"    "meals"     "ell"       "yr.rnd"    "mobility"  "acs.k3"   
[25] "acs.46"    "acs.core"  "pct.resp"  "not.hsg"   "hsg"       "some.col" 
[31] "col.grad"  "grad.sch"  "avg.ed"    "full"      "emer"      "enroll"   
[37] "api.stu"   "pw"        "fpc1"      "fpc2"      "meals_cut" "hsg_cut"