Ipsos Public Affairs (Ipsos) conducted a survey on behalf of Pew Research from Jan 3-13, 2020 related to knowledge and attitudes about the 2020 census. The target population was non-institutionalized adults age 18 and older residing in the United States. The survey was conducted in part to explore the impact of a change in questions related to race and ethnicity.
This project uses other features in the data to further explore the importance of racial/ethnic origins to self-identification. I estimate the relationship between measures of importance and five respondent features: age, gender, education, state of residence, and political ideology.
The 2020 Census survey #1 data set is available for download at the PEW Research web site. The link is to the project page. To navigate to this survey from the Pew Research Center home page, click TOOLS & RESOURCES. In the Dataset Downloads section, select Social & Demographic Trends from the pull-down selection box. You must register for a free account to continue. Scroll down or search for 2020 Census survey #1.
I unzipped the file to my project data directory. Jan20 Census_cleaned dataset.sav is an SPSS data file. The Jan20 Census_methodology.pdf file contains the essential data documentation.
<- foreign::read.spss(
pew_dat_0 "../data/Jan20 Census_cleaned dataset.sav",
to.data.frame = TRUE
)
dim(pew_dat_0)
## [1] 3535 156
The data set consists of 156 variables collected from 3,535 participants. Only a fraction of the columns are of interest for this analysis.
The survey asked all respondents,
The next two questions are the exact wording for how the 2020 census will ask about Hispanic origin and race. We’d like to know what your answer would be.
The first of the two questions, recorded in variables CENHISPAN2020_[1,5]
was
Are you of Hispanic, Latino, or Spanish origin?
- No, not of Hispanic, Latino, or Spanish origin
CENHISPAN2020_1
- Yes, Mexican, Mexican American, Chicano
CENHISPAN2020_2
- Yes, Puerto Rican
CENHISPAN2020_3
- Yes, Cuban
CENHISPAN2020_4
- Yes, another Hispanic, Latino, or Spanish origin
CENHISPAN2020_5
Enter, for example, Salvadoran, Dominican, Colombian, Guatemalan, Spaniard, Ecuadorian, etc.
The responses were collected using check boxes, permitting multiple selections. I will simplify the information to just hispanic
(Yes | No) using variable CENHISPAN2020_1
. If the respondent selected CENHISPAN2020_1
, the survey recorded it as “Yes”, meaning “No, not Hispanic”, so I need to reverse the labeling.
<- pew_dat_0 %>%
pew_dat_1 mutate(hispanic = factor(case_when(CENHISPAN2020_1 == "No" ~ "Yes",
== "Yes" ~ "No",
CENHISPAN2020_1 TRUE ~ as.character(CENHISPAN2020_1)),
levels = c("No", "Yes", "Refused")))
%>% tabyl(hispanic) pew_dat_1
## hispanic n percent
## No 2660 0.75247525
## Yes 814 0.23026874
## Refused 61 0.01725601
The second of the two questions, recorded in variables CENRACE2020_[1,15]
was
What is your race?
[Select one or more boxes AND enter origins. For this survey, Hispanic origins are not races.]
- White
Enter, for example, German, Irish, English, Italian, Lebanese, Egyptian, etc.- Black or African American
Enter, for example, African American, Jamaican, Haitian, Nigerian, Ethiopian, Somali, etc.- American Indian or Alaska Native
Enter name of enrolled or principal tribe(s), for example, Navajo Nation, Blackfeet Tribe, Mayan, Aztec, Native Village of Barrow Inupiat Traditional Government, Nome Eskimo Community, etc.- Chinese
- Filipino
- Asian Indian
- Vietnamese
- Korean
- Japanese
- Other Asian
Enter, for example, Pakistani, Cambodian, Hmong, etc.- Native Hawaiian
- Samoan
- Chamorro
- Other Pacific Islander
Enter, for example, Tongan, Fijian, Marshallese, etc.- Some other race
Enter race or origin.
Pew notes in the figure captions to their summary article Black and Hispanic Americans See Their Origins as Central to Who They Are, Less So for White Adults that
White and Black adults include those who report being only one race and are not Hispanic. Hispanics are of any race. Share of respondents who didn’t offer an answer not shown.
The Jan20 Census_readme.txt file included in the data zip file explains that field racnum
is the number of races selected in CENRACED2020_[1..15]
. I’ll follow their lead.
<- pew_dat_1 %>%
pew_dat_2 mutate(
origins = factor(
case_when(
== "Yes" ~ "Hispanic",
hispanic == 1 & CENRACE2020_1 == "Yes" ~ "White",
racnum == 1 & CENRACE2020_2 == "Yes" ~ "Black",
racnum TRUE ~ "Other"
),levels = c("White", "Black", "Hispanic", "Other")
)
)%>% tabyl(origins) pew_dat_2
## origins n percent
## White 2088 0.59066478
## Black 275 0.07779349
## Hispanic 814 0.23026874
## Other 358 0.10127298
The survey asked respondents three questions related to how they thought about their origins. Each had the same lead-in.
Here is a pair of statements about how you think about your origin (for example, German, Mexican, Jamaican, Chinese, etc.) Which statement comes closer to your view – even if neither is exactly right?
CENIDENTITYa
CENIDENTYb
CENIDENTYc
<- pew_dat_2 %>%
pew_dat_3 mutate(
central = factor(
CENIDENTITYa, levels = c("My origin is not central to my identity",
"My origin is central to my identity",
"Refused"),
labels = c("Not central", "Central", "Refused")
),familiar = factor(
CENIDENTITYb, levels = c("I am not too familiar with my origins",
"I am very familiar with my origins",
"Refused"),
labels = c("Not too familiar", "Very familiar", "Refused")
),connection = factor(
CENIDENTITYc, levels = c("I do not feel a strong connection with the cultural origin of my family",
"I feel a strong connection with the cultural origin of my family",
"Refused"),
labels = c("Not a strong connection", "Strong connection", "Refused")
)
)
%>% tabyl(central) pew_dat_3
## central n percent
## Not central 2204 0.62347949
## Central 1199 0.33917963
## Refused 132 0.03734088
%>% tabyl(familiar) pew_dat_3
## familiar n percent
## Not too familiar 1188 0.33606789
## Very familiar 2246 0.63536068
## Refused 101 0.02857143
%>% tabyl(connection) pew_dat_3
## connection n percent
## Not a strong connection 1726 0.48826025
## Strong connection 1714 0.48486563
## Refused 95 0.02687412
I want to estimate the relationship between importance or origins and other features available in the survey. There are five interesting features: age, gender, education, state of residence, and political ideology.
<- pew_dat_3 %>%
pew_dat_4 mutate(
IDEO = fct_relevel(IDEO, c("Moderate", "Conservative")),
# There's a level for "under 18", but everyone is >18
ppagecat = fct_drop(ppagecat),
CASEID = as.character(CASEID)
%>%
) select(
CASEID, weight, origins, central, familiar, connection,
ppage, ppagecat, ppgender, ppeducat, ppstaten, ppreg4, ppreg9, IDEO
)
%>% pull(ppage) %>% summary() pew_dat_4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 37.00 53.00 50.99 64.00 94.00
%>% tabyl(ppagecat) pew_dat_4
## ppagecat n percent
## 18-24 189 0.05346535
## 25-34 544 0.15388967
## 35-44 605 0.17114569
## 45-54 553 0.15643564
## 55-64 779 0.22036775
## 65-74 596 0.16859972
## 75+ 269 0.07609618
%>% tabyl(ppgender) pew_dat_4
## ppgender n percent
## Male 1754 0.496181
## Female 1781 0.503819
%>% tabyl(ppeducat) pew_dat_4
## ppeducat n percent
## Less than high school 317 0.08967468
## High school 1004 0.28401697
## Some college 966 0.27326733
## Bachelor's degree or higher 1248 0.35304102
%>% count(ppreg4, ppreg9, ppstaten) pew_dat_4
## ppreg4 ppreg9 ppstaten n
## 1 Northeast New England ME 11
## 2 Northeast New England NH 20
## 3 Northeast New England VT 6
## 4 Northeast New England MA 56
## 5 Northeast New England RI 12
## 6 Northeast New England CT 48
## 7 Northeast Mid-Atlantic NY 190
## 8 Northeast Mid-Atlantic NJ 123
## 9 Northeast Mid-Atlantic PA 151
## 10 Midwest East-North Central OH 125
## 11 Midwest East-North Central IN 56
## 12 Midwest East-North Central IL 130
## 13 Midwest East-North Central MI 106
## 14 Midwest East-North Central WI 86
## 15 Midwest West-North Central MN 60
## 16 Midwest West-North Central IA 33
## 17 Midwest West-North Central MO 48
## 18 Midwest West-North Central ND 4
## 19 Midwest West-North Central SD 10
## 20 Midwest West-North Central NE 13
## 21 Midwest West-North Central KS 22
## 22 South South Atlantic DE 12
## 23 South South Atlantic MD 60
## 24 South South Atlantic DC 10
## 25 South South Atlantic VA 118
## 26 South South Atlantic WV 11
## 27 South South Atlantic NC 101
## 28 South South Atlantic SC 46
## 29 South South Atlantic GA 86
## 30 South South Atlantic FL 253
## 31 South East-South Central KY 40
## 32 South East-South Central TN 64
## 33 South East-South Central AL 42
## 34 South East-South Central MS 27
## 35 South West-South Central AR 27
## 36 South West-South Central LA 47
## 37 South West-South Central OK 37
## 38 South West-South Central TX 320
## 39 West Mountain MT 12
## 40 West Mountain ID 13
## 41 West Mountain WY 3
## 42 West Mountain CO 44
## 43 West Mountain NM 33
## 44 West Mountain AZ 83
## 45 West Mountain UT 37
## 46 West Mountain NV 28
## 47 West Pacific WA 98
## 48 West Pacific OR 52
## 49 West Pacific CA 503
## 50 West Pacific AK 3
## 51 West Pacific HI 15
%>% tabyl(IDEO) pew_dat_4
## IDEO n percent
## Moderate 1512 0.42772277
## Conservative 866 0.24497878
## Very conservative 319 0.09024045
## Liberal 563 0.15926450
## Very liberal 189 0.05346535
## Refused 86 0.02432815
One final look at the refined data set. CASEID
is a unique identifier for the weighted group, and weight
is the response weight.
%>% skimr::skim() pew_dat_4
Name | Piped data |
Number of rows | 3535 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 1 |
factor | 11 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
CASEID | 0 | 1 | 1 | 4 | 0 | 3535 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
origins | 0 | 1 | FALSE | 4 | Whi: 2088, His: 814, Oth: 358, Bla: 275 |
central | 0 | 1 | FALSE | 3 | Not: 2204, Cen: 1199, Ref: 132 |
familiar | 0 | 1 | FALSE | 3 | Ver: 2246, Not: 1188, Ref: 101 |
connection | 0 | 1 | FALSE | 3 | Not: 1726, Str: 1714, Ref: 95 |
ppagecat | 0 | 1 | FALSE | 7 | 55-: 779, 35-: 605, 65-: 596, 45-: 553 |
ppgender | 0 | 1 | FALSE | 2 | Fem: 1781, Mal: 1754 |
ppeducat | 0 | 1 | FALSE | 4 | Bac: 1248, Hig: 1004, Som: 966, Les: 317 |
ppstaten | 0 | 1 | FALSE | 51 | CA: 503, TX: 320, FL: 253, NY: 190 |
ppreg4 | 0 | 1 | FALSE | 4 | Sou: 1301, Wes: 924, Mid: 693, Nor: 617 |
ppreg9 | 0 | 1 | FALSE | 9 | Sou: 697, Pac: 671, Eas: 503, Mid: 464 |
IDEO | 0 | 1 | FALSE | 6 | Mod: 1512, Con: 866, Lib: 563, Ver: 319 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
weight | 0 | 1 | 1.00 | 0.68 | 0.24 | 0.59 | 0.82 | 1.15 | 4.42 | ▇▂▁▁▁ |
ppage | 0 | 1 | 50.99 | 16.87 | 18.00 | 37.00 | 53.00 | 64.00 | 94.00 | ▅▆▇▆▁ |
save the refined data set to a file for subsequent steps.
saveRDS(pew_dat_4, "../data/1_data_mgmt.rds")