Ipsos Public Affairs (Ipsos) conducted a survey on behalf of Pew Research from Jan 3-13, 2020 related to knowledge and attitudes about the 2020 census. The target population was non-institutionalized adults age 18 and older residing in the United States. The survey was conducted in part to explore the impact of a change in questions related to race and ethnicity.

This project uses other features in the data to further explore the importance of racial/ethnic origins to self-identification. I estimate the relationship between measures of importance and five respondent features: age, gender, education, state of residence, and political ideology.

Data Set

The 2020 Census survey #1 data set is available for download at the PEW Research web site. The link is to the project page. To navigate to this survey from the Pew Research Center home page, click TOOLS & RESOURCES. In the Dataset Downloads section, select Social & Demographic Trends from the pull-down selection box. You must register for a free account to continue. Scroll down or search for 2020 Census survey #1.

I unzipped the file to my project data directory. Jan20 Census_cleaned dataset.sav is an SPSS data file. The Jan20 Census_methodology.pdf file contains the essential data documentation.

pew_dat_0 <- foreign::read.spss(
  "../data/Jan20 Census_cleaned dataset.sav", 
  to.data.frame = TRUE
)

dim(pew_dat_0)
## [1] 3535  156

The data set consists of 156 variables collected from 3,535 participants. Only a fraction of the columns are of interest for this analysis.

Data Engineering

The survey asked all respondents,

The next two questions are the exact wording for how the 2020 census will ask about Hispanic origin and race. We’d like to know what your answer would be.

The first of the two questions, recorded in variables CENHISPAN2020_[1,5] was

Are you of Hispanic, Latino, or Spanish origin?

  1. No, not of Hispanic, Latino, or Spanish origin CENHISPAN2020_1
  2. Yes, Mexican, Mexican American, Chicano CENHISPAN2020_2
  3. Yes, Puerto Rican CENHISPAN2020_3
  4. Yes, Cuban CENHISPAN2020_4
  5. Yes, another Hispanic, Latino, or Spanish origin CENHISPAN2020_5
    Enter, for example, Salvadoran, Dominican, Colombian, Guatemalan, Spaniard, Ecuadorian, etc.

The responses were collected using check boxes, permitting multiple selections. I will simplify the information to just hispanic (Yes | No) using variable CENHISPAN2020_1. If the respondent selected CENHISPAN2020_1, the survey recorded it as “Yes”, meaning “No, not Hispanic”, so I need to reverse the labeling.

pew_dat_1 <- pew_dat_0 %>% 
  mutate(hispanic = factor(case_when(CENHISPAN2020_1 == "No" ~ "Yes",
                                     CENHISPAN2020_1 == "Yes" ~ "No",
                                     TRUE ~ as.character(CENHISPAN2020_1)),
                           levels = c("No", "Yes", "Refused")))

pew_dat_1 %>% tabyl(hispanic)
##  hispanic    n    percent
##        No 2660 0.75247525
##       Yes  814 0.23026874
##   Refused   61 0.01725601

The second of the two questions, recorded in variables CENRACE2020_[1,15] was

What is your race?
[Select one or more boxes AND enter origins. For this survey, Hispanic origins are not races.]
  1. White
    Enter, for example, German, Irish, English, Italian, Lebanese, Egyptian, etc.
  2. Black or African American
    Enter, for example, African American, Jamaican, Haitian, Nigerian, Ethiopian, Somali, etc.
  3. American Indian or Alaska Native
    Enter name of enrolled or principal tribe(s), for example, Navajo Nation, Blackfeet Tribe, Mayan, Aztec, Native Village of Barrow Inupiat Traditional Government, Nome Eskimo Community, etc.
  4. Chinese
  5. Filipino
  6. Asian Indian
  7. Vietnamese
  8. Korean
  9. Japanese
  10. Other Asian
    Enter, for example, Pakistani, Cambodian, Hmong, etc.
  11. Native Hawaiian
  12. Samoan
  13. Chamorro
  14. Other Pacific Islander
    Enter, for example, Tongan, Fijian, Marshallese, etc.
  15. Some other race
    Enter race or origin.

Pew notes in the figure captions to their summary article Black and Hispanic Americans See Their Origins as Central to Who They Are, Less So for White Adults that

White and Black adults include those who report being only one race and are not Hispanic. Hispanics are of any race. Share of respondents who didn’t offer an answer not shown.

The Jan20 Census_readme.txt file included in the data zip file explains that field racnum is the number of races selected in CENRACED2020_[1..15]. I’ll follow their lead.

pew_dat_2 <- pew_dat_1 %>%
  mutate(
    origins = factor(
      case_when(
        hispanic == "Yes" ~ "Hispanic",
        racnum == 1 & CENRACE2020_1 == "Yes" ~ "White",
        racnum == 1 & CENRACE2020_2 == "Yes" ~ "Black",
        TRUE ~ "Other"
      ),
      levels = c("White", "Black", "Hispanic", "Other")
    )
  )
pew_dat_2 %>% tabyl(origins) 
##   origins    n    percent
##     White 2088 0.59066478
##     Black  275 0.07779349
##  Hispanic  814 0.23026874
##     Other  358 0.10127298

The survey asked respondents three questions related to how they thought about their origins. Each had the same lead-in.

Here is a pair of statements about how you think about your origin (for example, German, Mexican, Jamaican, Chinese, etc.) Which statement comes closer to your view – even if neither is exactly right?

CENIDENTITYa
  • My origin is central to my identity
  • My origin is not central to my identity
CENIDENTYb
  • I am very familiar with my origins
  • I am not too familiar with my origins
CENIDENTYc
  • I feel a strong connection with the cultural origin of my family
  • I do not feel a strong connection with the cultural origin of my family
pew_dat_3 <- pew_dat_2 %>%
  mutate(
    central = factor(
      CENIDENTITYa, 
      levels = c("My origin is not central to my identity", 
                 "My origin is central to my identity", 
                 "Refused"),
      labels = c("Not central", "Central", "Refused")
    ),
    familiar = factor(
      CENIDENTITYb, 
      levels = c("I am not too familiar with my origins", 
                 "I am very familiar with my origins", 
                 "Refused"),
      labels = c("Not too familiar", "Very familiar", "Refused")
    ),
    connection = factor(
      CENIDENTITYc, 
      levels = c("I do not feel a strong connection with the cultural origin of my family", 
                 "I feel a strong connection with the cultural origin of my family", 
                 "Refused"),
      labels = c("Not a strong connection", "Strong connection", "Refused")
    )
  )

pew_dat_3 %>% tabyl(central)
##      central    n    percent
##  Not central 2204 0.62347949
##      Central 1199 0.33917963
##      Refused  132 0.03734088
pew_dat_3 %>% tabyl(familiar)
##          familiar    n    percent
##  Not too familiar 1188 0.33606789
##     Very familiar 2246 0.63536068
##           Refused  101 0.02857143
pew_dat_3 %>% tabyl(connection)
##               connection    n    percent
##  Not a strong connection 1726 0.48826025
##        Strong connection 1714 0.48486563
##                  Refused   95 0.02687412

I want to estimate the relationship between importance or origins and other features available in the survey. There are five interesting features: age, gender, education, state of residence, and political ideology.

pew_dat_4 <- pew_dat_3 %>% 
  mutate(
    IDEO = fct_relevel(IDEO, c("Moderate", "Conservative")),
    # There's a level for "under 18", but everyone is >18
    ppagecat = fct_drop(ppagecat),
    CASEID = as.character(CASEID)
  ) %>%
  select(
    CASEID, weight, origins, central, familiar, connection,
    ppage, ppagecat, ppgender, ppeducat, ppstaten, ppreg4, ppreg9, IDEO
  )

pew_dat_4 %>% pull(ppage) %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   37.00   53.00   50.99   64.00   94.00
pew_dat_4 %>% tabyl(ppagecat)
##  ppagecat   n    percent
##     18-24 189 0.05346535
##     25-34 544 0.15388967
##     35-44 605 0.17114569
##     45-54 553 0.15643564
##     55-64 779 0.22036775
##     65-74 596 0.16859972
##       75+ 269 0.07609618
pew_dat_4 %>% tabyl(ppgender)
##  ppgender    n  percent
##      Male 1754 0.496181
##    Female 1781 0.503819
pew_dat_4 %>% tabyl(ppeducat)
##                     ppeducat    n    percent
##        Less than high school  317 0.08967468
##                  High school 1004 0.28401697
##                 Some college  966 0.27326733
##  Bachelor's degree or higher 1248 0.35304102
pew_dat_4 %>% count(ppreg4, ppreg9, ppstaten)
##       ppreg4             ppreg9 ppstaten   n
## 1  Northeast        New England       ME  11
## 2  Northeast        New England       NH  20
## 3  Northeast        New England       VT   6
## 4  Northeast        New England       MA  56
## 5  Northeast        New England       RI  12
## 6  Northeast        New England       CT  48
## 7  Northeast       Mid-Atlantic       NY 190
## 8  Northeast       Mid-Atlantic       NJ 123
## 9  Northeast       Mid-Atlantic       PA 151
## 10   Midwest East-North Central       OH 125
## 11   Midwest East-North Central       IN  56
## 12   Midwest East-North Central       IL 130
## 13   Midwest East-North Central       MI 106
## 14   Midwest East-North Central       WI  86
## 15   Midwest West-North Central       MN  60
## 16   Midwest West-North Central       IA  33
## 17   Midwest West-North Central       MO  48
## 18   Midwest West-North Central       ND   4
## 19   Midwest West-North Central       SD  10
## 20   Midwest West-North Central       NE  13
## 21   Midwest West-North Central       KS  22
## 22     South     South Atlantic       DE  12
## 23     South     South Atlantic       MD  60
## 24     South     South Atlantic       DC  10
## 25     South     South Atlantic       VA 118
## 26     South     South Atlantic       WV  11
## 27     South     South Atlantic       NC 101
## 28     South     South Atlantic       SC  46
## 29     South     South Atlantic       GA  86
## 30     South     South Atlantic       FL 253
## 31     South East-South Central       KY  40
## 32     South East-South Central       TN  64
## 33     South East-South Central       AL  42
## 34     South East-South Central       MS  27
## 35     South West-South Central       AR  27
## 36     South West-South Central       LA  47
## 37     South West-South Central       OK  37
## 38     South West-South Central       TX 320
## 39      West           Mountain       MT  12
## 40      West           Mountain       ID  13
## 41      West           Mountain       WY   3
## 42      West           Mountain       CO  44
## 43      West           Mountain       NM  33
## 44      West           Mountain       AZ  83
## 45      West           Mountain       UT  37
## 46      West           Mountain       NV  28
## 47      West            Pacific       WA  98
## 48      West            Pacific       OR  52
## 49      West            Pacific       CA 503
## 50      West            Pacific       AK   3
## 51      West            Pacific       HI  15
pew_dat_4 %>% tabyl(IDEO)
##               IDEO    n    percent
##           Moderate 1512 0.42772277
##       Conservative  866 0.24497878
##  Very conservative  319 0.09024045
##            Liberal  563 0.15926450
##       Very liberal  189 0.05346535
##            Refused   86 0.02432815

One final look at the refined data set. CASEID is a unique identifier for the weighted group, and weight is the response weight.

pew_dat_4 %>% skimr::skim()
Data summary
Name Piped data
Number of rows 3535
Number of columns 14
_______________________
Column type frequency:
character 1
factor 11
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
CASEID 0 1 1 4 0 3535 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
origins 0 1 FALSE 4 Whi: 2088, His: 814, Oth: 358, Bla: 275
central 0 1 FALSE 3 Not: 2204, Cen: 1199, Ref: 132
familiar 0 1 FALSE 3 Ver: 2246, Not: 1188, Ref: 101
connection 0 1 FALSE 3 Not: 1726, Str: 1714, Ref: 95
ppagecat 0 1 FALSE 7 55-: 779, 35-: 605, 65-: 596, 45-: 553
ppgender 0 1 FALSE 2 Fem: 1781, Mal: 1754
ppeducat 0 1 FALSE 4 Bac: 1248, Hig: 1004, Som: 966, Les: 317
ppstaten 0 1 FALSE 51 CA: 503, TX: 320, FL: 253, NY: 190
ppreg4 0 1 FALSE 4 Sou: 1301, Wes: 924, Mid: 693, Nor: 617
ppreg9 0 1 FALSE 9 Sou: 697, Pac: 671, Eas: 503, Mid: 464
IDEO 0 1 FALSE 6 Mod: 1512, Con: 866, Lib: 563, Ver: 319

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weight 0 1 1.00 0.68 0.24 0.59 0.82 1.15 4.42 ▇▂▁▁▁
ppage 0 1 50.99 16.87 18.00 37.00 53.00 64.00 94.00 ▅▆▇▆▁

Save Data

save the refined data set to a file for subsequent steps.

saveRDS(pew_dat_4, "../data/1_data_mgmt.rds")