2  Item Selection

The design and analysis of a survey depends on its purpose. Exploratory surveys investigate topics with no particular expectation. They are usually qualitative and often ask open-ended questions which are analyzed for content (word/phrase frequency counts) and themes. Descriptive surveys measure the association between the survey topic(s) and the respondent attributes. They typically ask Likert scale questions. Explanatory surveys explain and quantify hypothesized relationships with inferential statistics.

In a survey project, exploratory factor analysis (EFA), item reduction, and confirmatory factor analysis (CFA) play critical roles in refining and validating the questionnaire. Here’s how they fit into the major phases of a survey project:

2.0.1 1. Survey Design and Item Development

  • Item Generation: During the initial design phase, you generate items (questions) based on the concepts you wish to measure. This involves creating questions that capture each domain or construct of interest.
  • Preliminary Item Pool: You often start with a larger pool of items than you plan to use in the final survey. The aim is to cover all relevant aspects of the constructs comprehensively, knowing that some items will be refined or removed later.

2.0.2 2. Pilot Testing and Exploratory Factor Analysis (EFA)

  • Pilot Survey Administration: After developing the preliminary item pool, a pilot survey is administered to a small group from the target population.
  • Exploratory Factor Analysis (EFA): EFA is conducted to explore the underlying structure of the data without imposing any predefined structure. This helps identify latent factors (or constructs) that the items represent and assess if the items group together as expected.
    • Purpose: EFA provides insights into whether items align with the hypothesized constructs or if adjustments are needed.
    • Outcome: The result of EFA often leads to item reduction, where redundant or low-loading items (items that don’t contribute well to the factors) are removed to create a cleaner, more interpretable structure.

2.0.3 3. Item Reduction and Refinement

  • Item Reduction: Based on the results from EFA, items that don’t load well on any factor or that cause cross-loading issues (loading on multiple factors) are removed. This results in a streamlined set of items that reliably measure the intended constructs.
  • Revised Survey Development: A revised version of the survey is developed with the refined item pool, improving clarity, reducing redundancy, and ensuring that each item is relevant to a single construct.

2.0.4 4. Field Testing and Confirmatory Factor Analysis (CFA)

  • Larger Sample Administration: The refined survey is then administered to a larger, more representative sample to test its reliability and validity.
  • Confirmatory Factor Analysis (CFA): CFA is conducted to validate the factor structure identified in the EFA phase. CFA requires specifying the number of factors and their relationships based on prior knowledge or the EFA results.
    • Purpose: CFA tests how well the items fit the hypothesized structure and whether the constructs align with theoretical expectations. Good model fit in CFA indicates that the items reliably measure each intended factor.

2.0.5 5. Final Survey Administration and Data Collection

  • With a validated survey, you can proceed to full-scale data collection, confident that the items are reliable and valid measures of the constructs.

2.0.6 6. Data Analysis and Reporting

  • Following data collection, analysis is performed, often including descriptive statistics, correlations, and further structural equation modeling if needed.
  • Reporting: The results and insights gained through EFA, item reduction, and CFA help demonstrate the survey’s validity and reliability, which is crucial when presenting findings to stakeholders.

Each of these steps is important to ensure that the survey effectively measures the constructs of interest, minimizes measurement error, and provides meaningful insights for your study or project.

2.1 Reliability and Validity

A high quality survey will be both reliable (consistent) and valid (accurate).1 A reliable survey is reproducible under similar conditions. It produces consistent results across time, samples, and sub-samples within the survey itself. Reliable surveys are not necessarily accurate though. A valid survey accurately measures its latent variable. Its results are compatible with established theory and with other measures of the concept. Valid surveys are usually reliable too.

Reliability can entail one or more of the following:

  • Inter-rater reliability (aka Equivalence). Each survey item is unambiguous, so subject matter experts should respond identically. This applies to opinion surveys of factual concepts, not personal preference surveys. E.g., two psychologists completing a survey assessing a patient’s mental health should respond identically. Use the Cohen’s Kappa test of inter-rater reliability to test equivalence (see Section @ref(itemgeneration) on item generation).
  • Internal consistency. Survey items measuring the same latent variable are highly correlated. Use the Cronbach alpha test and split-half test to assess internal consistency (see Section @ref(itemreduction) on item reduction).
  • Stability (aka Test-retest). Repeated measurements yield the same results. You use the test-retest construct to assess stability in the Replication phase.

There are three assessments of validity:

  • Content validity. The survey items cover all aspects of the latent variable. Use Lawshe’s CVR to assess content validity (see Section @ref(itemgeneration) on item generation).
  • Construct validity. The survey items are properly grounded on a theory of the latent variable. Use convergent analysis and discriminant analysis to assess construct validity in the Convergent/Discriminant Validity phase.
  • Criterion validity. The survey item results correspond to other valid measures of the same latent variable. Use concurrent analysis and predictive analysis to assess criterion validity in the Replication phase.

Survey considerations

  • Question order, selection order
  • respondent burden and fatigue

Quality of Measurement: Reliability and Validity (internal and external)

Pretest and pilot test

Probability Sampling: - simple random, systematic random, stratified random, cluster, and multistate cluster. - power analysis

Reporting - Descriptive statistics - Write-up

Continuous latent variables (e.g., level of satisfaction) can be measured with factor analysis (exploratory and confirmatory) or item response theory (IRT) models. Categorical or discrete variables (e.g., market segment) can be modeled with latent class analysis (LCA) or latent mixture modeling. You can even combine models, e.g., satisfaction within market segment.

In practice, you specify the model, evaluate the fit, then revise the model or add/drop items from the survey.

A full survey project usually consists of six phases.2

  1. Item Generation (Section @ref(itemgeneration)). Start by generating a list of candidate survey items. With help from SMEs, you evaluate the equivalence (interrater reliability) and content validity of the candidate survey items and pare down the list into the final survey.

  2. Survey Administration (Section @ref(surveyadministration)). Administer the survey to respondents and perform an exploratory data analysis. Summarize the Likert items with plots and look for correlations among the variables.

  3. Item Reduction (Section @ref(itemreduction)). Explore the dimensions of the latent variable in the survey data with parallel analysis and exploratory factor analysis. Assess the internal consistency of the items with Cronbach’s alpha and split-half tests, and remove items that do not add value and/or amend your theory of the number of dimensions.

  4. Confirmatory Factor Analysis (Section @ref(confirmatoryfactoranalysis)). Perform a formal hypothesis test of the theory that emerged from the exploratory factor analysis.

  5. Convergent/Discriminant Validity (Section @ref(convergentvalidity)). Test for convergent and discriminant construct validity.

  6. Replication (Section @ref(replication)). Establish test-retest reliability and criterion validity.

2.2 Item Generation

Define your latent variable(s), that is, the unquantifiable variables you intend to infer from variables you can quntify. E.g., “Importance of 401(k) matching”

After you generate a list of candidate survey items, enlist SMEs to assess their inter-rater reliability with Cohen’s Kappa and content validity with Lawshe’s CVR.

2.2.1 Cohen’s Kappa

An item has inter-rater reliability if it produces consistent results across raters. One way to test this is by having SMEs take the survey. Their answers should be close to each other. Conduct an inter-rater reliability test by measuring the statistical significance of SME response agreement using the Kohen’s kappa test statistic.

Suppose your survey measures brand loyalty and two SMEs answer 13 survey items like this. The SMEs agreed on 6 of the 13 items (46%).

sme %>% mutate(agreement = RATER_A == RATER_B)
   RATER_A RATER_B agreement
1        1       1      TRUE
2        2       2      TRUE
3        3       2     FALSE
4        2       3     FALSE
5        1       3     FALSE
6        1       1      TRUE
7        1       1      TRUE
8        2       1     FALSE
9        3       2     FALSE
10       3       3      TRUE
11       2       3     FALSE
12       1       3     FALSE
13       1       1      TRUE

You could measure SME agreement with a simple correlation matrix (cor(sme)) or by measuring the percentage of items they rate identically (irr::agree(sme)), but these measures do not test for statistical validity.

cor(sme)
##           RATER_A   RATER_B
## RATER_A 1.0000000 0.3291403
## RATER_B 0.3291403 1.0000000
irr::agree(sme)
##  Percentage agreement (Tolerance=0)
## 
##  Subjects = 13 
##    Raters = 2 
##   %-agree = 46.2

Instead, calculate the Kohen’s kappa test statistic, \(\kappa\), to assess statistical validity. Cohen’s kappa compares the observed agreement (accuracy) to the probability of chance agreement. \(\kappa\) >= 0.8 is very strong agreement, \(\kappa\) >= 0.6 substantial, \(\kappa\) >= 0.4 moderate, and \(\kappa\) < 0.4 is poor agreement. In this example, \(\kappa\) is only 0.32 (poor agreement).

psych::cohen.kappa(sme)
Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels, 
    w.exp = w.exp)

Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries 
                 lower estimate upper
unweighted kappa -0.18     0.19  0.55
weighted kappa   -0.13     0.32  0.76

 Number of subjects = 13 
psych::cohen.kappa(sme2)
Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels, 
    w.exp = w.exp)

Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries 
                 lower estimate upper
unweighted kappa -0.22     0.17  0.56
weighted kappa    0.30     0.56  0.83

 Number of subjects = 13 

Use the weighted kappa for ordinal measures like Likert items (see Wikipedia).

2.2.2 Lawshe’s CVR

An item has content validity if SMEs agree on its relevance to the latent variable. Test content validity with Lawshe’s content validity ratio (CVR),

\[CVR = \frac{E - N/2}{N/2}\] where \(N\) is the number of SMEs and \(E\) is the number who rate the item as essential. CVR can range from -1 to 1. E.g., suppose three SMEs (A, B, and C) assess the relevance of 5 survey items as “Not Necessary”, “Useful”, or “Essential”:

  item            A            B            C
1    1    Essential       Useful Not necesary
2    2       Useful Not necesary       Useful
3    3 Not necesary Not necesary    Essential
4    4    Essential       Useful    Essential
5    5    Essential    Essential    Essential

Use the psychometric::CVratio() function to calculate CVR. The threshold CVR to keep or drop an item depends on the number of raters. CVR should be >= 0.99 for 5 experts; >= 0.49 for 15, and >= 0.29 for 40.

sme2 %>% 
  pivot_longer(-item, names_to = "expert", values_to = "rating") %>%
  group_by(item) %>% 
  summarize(.groups = "drop",
            n_sme = length(unique(expert)),
            n_ess = sum(rating == "Essential"),
            CVR = psychometric::CVratio(NTOTAL = n_sme, NESSENTIAL = n_ess))
# A tibble: 5 × 4
   item n_sme n_ess    CVR
  <int> <int> <int>  <dbl>
1     1     3     1 -0.333
2     2     3     0 -1    
3     3     3     1 -0.333
4     4     3     2  0.333
5     5     3     3  1    

In this example, items vary widely in content validity from unanimous consensus for to unanimous consensus against.

2.3 Item Reduction

The third phase, explores the mapping of the factors (aka “manifest variables”) to the latent variable’s “dimensions” and refines the survey to exclude factors that do not map to a dimension. A latent variable may have several dimensions. E.g., “brand loyalty” may consist of “brand identification”, “perceived value”, and “brand trust”. Exploratory factor analysis (EFA), identifies the dimensions in the data, and whether any items do not reveal information about the latent variable. EFA establishes the internal reliability, whether similar items produce similar scores.

Start with a parallel analysis and scree plot. This will suggest the number of factors in the data. Use this number as the input to an exploratory factor analysis.

2.3.1 Parallel Analysis

A scree plot is a line plot of the eigenvalues. An eigenvalue is the proportion of variance explained by each factor. Only factors with eigenvalues greater than those from uncorrelated data are useful. You want to find a sharp reduction in the size of the eigenvalues (like a cliff), with the rest of the smaller eigenvalues constituting rubble (scree!). After the eigenvalues drop dramatically in size, additional factors add relatively little to the information already extracted.

Parallel analysis helps to make the interpretation of scree plots more objective. The eigenvalues are plotted along with eigenvalues of simulated variables with population correlations of 0. The number of eigenvalues above the point where the two lines intersect is the suggested number of factors. The rationale for parallel analysis is that useful factors account for more variance than could be expected by chance.

psych::fa.parallel() compares a scree of your data set to a random data set to identify the number of factors. The elbow below here is at 3 factors.

brand_rep <- read_csv(url("https://assets.datacamp.com/production/repositories/4494/datasets/59b5f2d717ddd647415d8c88aa40af6f89ed24df/brandrep-cleansurvey-extraitem.csv"))

psych::fa.parallel(brand_rep)

Parallel analysis suggests that the number of factors =  3  and the number of components =  2 

2.3.2 Exporatory Factor Analysis

Use psych::fa() to perform the factor analysis with your chosen number of factors. The number of factors may be the result of your parallel analysis, or the opinion of the SMEs. In this case, we’ll go with the 3 factors identified by the parallel analysis.

brand_rep_efa <- psych::fa(brand_rep, nfactors = 3)
# psych::scree(brand_rep) # psych makes scree plot's too.
psych::fa.diagram(brand_rep_efa)

Using EFA, you may tweak the number of factors or drop poorly-loading items. Each item should load highly to one and only one dimension. This one dimension is the item’s primary loading. Generally, a primary loading > .7 is excellent, >.6 is very good, >.5 is good, >.4 is fair, and <.4 is poor. Here are the factor loadings from the 3 factor model.

brand_rep_efa$loadings

Loadings:
               MR2    MR1    MR3   
well_made       0.896              
consistent      0.947              
poor_workman_r  0.739              
higher_price    0.127  0.632  0.149
lot_more               0.850       
go_up                  0.896       
stands_out                    1.008
unique                        0.896
one_of_a_kind   0.309  0.295  0.115

                 MR2   MR1   MR3
SS loadings    2.361 2.012 1.858
Proportion Var 0.262 0.224 0.206
Cumulative Var 0.262 0.486 0.692

The brand-rep survey items load to 3 factors well except for the one_of_a_kind item. Its primary factor loading (0.309) is poor. The others are either very good (.6-.7) and excellent (>.7) range.

Look at the model eigenvalues. There should be one eigenvalue per dimension. Eigenvalues a little less than one may be contaminating the model.

brand_rep_efa$e.value
[1] 4.35629549 1.75381015 0.98701607 0.67377072 0.39901205 0.35865598 0.23915591
[8] 0.14238807 0.08989556

Look at the factor score correlations. They should all be around 0.6. Much smaller means they are not describing the same latent variable. Much larger means they are describing the same dimension of the latent variable.

brand_rep_efa$r.scores
          [,1]      [,2]      [,3]
[1,] 1.0000000 0.3147485 0.4778759
[2,] 0.3147485 1.0000000 0.5428218
[3,] 0.4778759 0.5428218 1.0000000

If you have a poorly loaded dimension, drop factors one at a time from the scale. one_of_a_kind loads across all three factors, but does not load strongly onto any one factor. one_of_a_kind is not clearly measuring any dimension of the latent variable. Drop it and try again.

brand_rep_efa <- psych::fa(brand_rep %>% select(-one_of_a_kind), nfactors = 3)
brand_rep_efa$loadings

Loadings:
               MR2    MR3    MR1   
well_made       0.887              
consistent      0.958              
poor_workman_r  0.735              
higher_price    0.120  0.596  0.170
lot_more               0.845       
go_up                  0.918       
stands_out                    0.990
unique                        0.916

                 MR2   MR3   MR1
SS loadings    2.261 1.915 1.850
Proportion Var 0.283 0.239 0.231
Cumulative Var 0.283 0.522 0.753
brand_rep_efa$e.value
[1] 4.00884429 1.75380537 0.97785883 0.42232527 0.36214534 0.24086772 0.14404264
[8] 0.09011054
brand_rep_efa$r.scores
          [,1]      [,2]      [,3]
[1,] 1.0000000 0.2978009 0.4802138
[2,] 0.2978009 1.0000000 0.5351654
[3,] 0.4802138 0.5351654 1.0000000

This is better. We have three dimensions of brand reputation:

  • items well_made, consistent, and poor_workman_r describe Product Quality,
  • items higher_price, lot_more, and go_up describe Willingness to Pay, and
  • items stands_out and unique describe Product Differentiation

Even if the data and your theory suggest otherwise, explore what happens when you include more or fewer factors in your EFA.

psych::fa(brand_rep, nfactors = 2)$loadings

Loadings:
               MR1    MR2   
well_made              0.884
consistent             0.932
poor_workman_r         0.728
higher_price    0.745       
lot_more        0.784 -0.131
go_up           0.855 -0.103
stands_out      0.591  0.286
unique          0.540  0.307
one_of_a_kind   0.378  0.305

                 MR1   MR2
SS loadings    2.688 2.485
Proportion Var 0.299 0.276
Cumulative Var 0.299 0.575
psych::fa(brand_rep, nfactors = 4)$loadings

Loadings:
               MR2    MR1    MR3    MR4   
well_made       0.879                     
consistent      0.964                     
poor_workman_r  0.732                     
higher_price           0.552  0.150  0.168
lot_more               0.835              
go_up                  0.929              
stands_out                    0.976       
unique                        0.932       
one_of_a_kind                        0.999

                 MR2   MR1   MR3   MR4
SS loadings    2.244 1.866 1.846 1.029
Proportion Var 0.249 0.207 0.205 0.114
Cumulative Var 0.249 0.457 0.662 0.776

The two-factor loading worked okay. The 4 factor loading only loaded one variable to the fourth factor. In this example the SME expected a three-factor model and the data did not contradict the theory, so stick with three.

Whereas the item generation phase tested for item equivalence, the EFA phase tests for internal reliability (consistency) of items. Internal reliability means the survey produces consistent results. The more common statistics for assessing internal reliability are Cronbach’s Alpha, and split-half.

2.3.3 Cronbach’s Alpha

In general, an alpha <.6 is unacceptable, <.65 is undesirable, <.7 is minimally acceptable, <.8 is respectable, <.9 is very good, and >=.9 suggests items are too alike. A very low alpha means items may not be measuring the same construct, so you should drop items. A very high alpha means items are multicollinear, and you should drop items. Here is Cronbach’s alpha for the brand reputation survey, after removing the poorly-loading one_of_a_kind variable.

psych::alpha(brand_rep[, 1:8])$total$std.alpha
[1] 0.8557356

This value is in the “very good” range. Cronbach’s alpha is often used to measure the reliability of a single dimension. Here are the values for the 3 dimensions.

psych::alpha(brand_rep[, 1:3])$total$std # Product Quality
## [1] 0.8918025
psych::alpha(brand_rep[, 4:6])$total$std # Willingness to Pay
## [1] 0.8517566
psych::alpha(brand_rep[, 7:8])$total$std # Product Differentiation
## [1] 0.951472

Alpha is >0.7 for each dimension. Sometimes the alpha for our survey as a whole is greater than that of the dimensions. This can happen because Cronbach’s alpha is sensitive to the number of items. Over-inflation of the alpha statistic can be a concern when working with surveys containing a large number of items.

2.3.4 Split-Half

Use psych::splitHalf() to split the survey in half and test whether all parts of the survey contribute equally to measurement. This method is much less popular than Cronbach’s alpha.

psych::splitHalf(brand_rep[, 1:8])
Split half reliabilities  
Call: psych::splitHalf(r = brand_rep[, 1:8])

Maximum split half reliability (lambda 4) =  0.93
Guttman lambda 6                          =  0.92
Average split half reliability            =  0.86
Guttman lambda 3 (alpha)                  =  0.86
Guttman lambda 2                          =  0.87
Minimum split half reliability  (beta)    =  0.66
Average interitem r =  0.43  with median =  0.4

2.4 Confirmatory Factor Analysis

Whereas EFA is used to develop a theory of the number of factors needed to explain the relationships among the survey items, confirmatory factor analysis (CFA) is a formal hypothesis test of the EFA theory. CFA measures construct validity, that is, whether you are really measuring what you claim to measure.

These notes explain how to use CFA, but do not explain the theory. For that you need to learn about dimensionality reduction, and structural equation modeling.

Use the lavaan package (latent variable analysis package), passing in the model definition. Here is the model for the three dimensions in the brand reputation survey. Lavaan’s default estimator is maximum likelihood, which assumes normality. You can change it to MLR which uses robust standard errors to mitigate non-normality. The summary prints a ton of output. Concentrate on the lambda - the factor loadings.

brand_rep_mdl <- paste(
  "PrdQl =~ well_made + consistent + poor_workman_r",
  "WillPay =~ higher_price + lot_more + go_up",
  "PrdDff =~ stands_out + unique", 
  sep = "\n"
)
brand_rep_cfa <- lavaan::cfa(model = brand_rep_mdl, data = brand_rep[, 1:8], estimator = "MLR")
# lavaan::summary(brand_rep_cfa, fit.measures = TRUE, standardized = TRUE)
semPlot::semPaths(brand_rep_cfa, rotation = 4)

lavaan::inspect(brand_rep_cfa, "std")$lambda
               PrdQl WillPy PrdDff
well_made      0.914  0.000  0.000
consistent     0.932  0.000  0.000
poor_workman_r 0.730  0.000  0.000
higher_price   0.000  0.733  0.000
lot_more       0.000  0.812  0.000
go_up          0.000  0.902  0.000
stands_out     0.000  0.000  0.976
unique         0.000  0.000  0.930

The CFA hypothesis test is a chi-square test, so is sensitive to normality assumptions and n-size. Other fit measure are reported too: * Comparative Fit Index (CFI) (look for value >.9) * Tucker-Lewis Index (TLI) (look for value >.9) * Root mean squared Error of Approximation (RMSEA) (look for value <.05)

There are actually 78 fit measures to choose from! Focus on CFI and TLI.

lavaan::fitMeasures(brand_rep_cfa, fit.measures = c("cfi", "tli"))
  cfi   tli 
0.980 0.967 

This output indicates a good model because both measures are >.9. Check the standardized estimates for each item. The standardized factor loadings are the basis of establishing construct validity. While we call these measures ‘loadings,’ they are better described as correlations of each manifest item with the dimensions. As you calculated, the difference between a perfect correlation and the observed is considered ‘error.’ This relationship between the so-called ‘true’ and ‘observed’ scores is the basis of classical test theory.

lavaan::standardizedSolution(brand_rep_cfa) %>%
  filter(op == "=~") %>%
  select(lhs, rhs, est.std, pvalue)
      lhs            rhs est.std pvalue
1   PrdQl      well_made   0.914      0
2   PrdQl     consistent   0.932      0
3   PrdQl poor_workman_r   0.730      0
4 WillPay   higher_price   0.733      0
5 WillPay       lot_more   0.812      0
6 WillPay          go_up   0.902      0
7  PrdDff     stands_out   0.976      0
8  PrdDff         unique   0.930      0

If you have a survey that meets your assumptions, performs well under EFA, but fails under CFA, return to your survey and revisit your scale, examine the CFA modification indices, factor variances, etc.

2.5 Convergent/Discriminant Validity

Construct validity means the survey measures what it intends to measure. It is composed of convergent validity and discriminant validity. Convergent validity means factors address the same concept. Discriminant validity means factors address different aspects of the concept.

Test for construct validity after assessing CFA model strength (with CFI, TFI, and RMSEA) – a poor-fitting model may have greater construct validity than a better-fitting model. Use the semTools::reliability() function. The average variance extracted (AVE) measures convergent validity (avevar) and should be > .5. The composite reliability (CR) measures discriminant validity (omega) and should be > .7.

semTools::reliability(brand_rep_cfa)
           PrdQl   WillPay    PrdDff
alpha  0.8926349 0.8521873 0.9514719
omega  0.9017514 0.8605751 0.9520117
omega2 0.9017514 0.8605751 0.9520117
omega3 0.9019287 0.8646975 0.9520114
avevar 0.7573092 0.6756174 0.9084654

These values look good for all three dimensions. As an aside, alpha is Cronbach’s alpha. Do not be tempted to test reliability and validity in the same step. Start with reliability because it is a necessary but insufficient condition for validity. By checking for internal consistency first, as measured by alpha, then construct validity, as measured by AVE and CR, you establish the necessary reliability of the scale as a whole was met, then took it to the next level by checking for construct validity among the unique dimensions.

At this point you have established that the latent and manifest variables are related as hypothesized, and that the survey measures what you intended to measure, in this case, brand reputation.

2.6 Replication

The replication phase establishes criterion validity and stability (reliability). Criterion validity is a measure of the relationship between the construct and some external measure of interest. Measure criterion validity with concurrent validity, how well items correlate with an external metric measured at the same time, and with predictive validity, how well an item predicts an external metric. Stability means the survey produces similar results over repeated test-retest administrations.

2.6.1 Criterion Validity

2.6.1.1 Concurrent Validity

Concurrent validity is a measure of whether our latent construct is significantly correlated to some outcome measured at the same time.

Suppose you have an additional data set of consumer spending on the brand. The consumer’s perception of the brand should correlate with their spending. Before checking for concurrent validity, standardize the data so that likert and other variable types are on the same scale.

set.seed(20201004)
brand_rep <- brand_rep %>%
  mutate(spend = ((well_made + consistent + poor_workman_r)/3 * 5 +
                  (higher_price + lot_more + go_up)/3 * 3 +
                  (stands_out + unique)/2 * 2) / 10)
brand_rep$spend <- brand_rep$spend + rnorm(559, 5, 4) # add randomness
brand_rep_scaled <- scale(brand_rep)

Do respondents with higher scores on our the brand reputation scale also tend to spend more at the store? Build model, and latentize spend as Spndng and model with the ~~ operator. Fit the model with the semTools::sem() function.

brand_rep_cv_mdl <- paste(
  "PrdQl =~ well_made + consistent + poor_workman_r",
  "WillPay =~ higher_price + lot_more + go_up",
  "PrdDff =~ stands_out + unique",
  "Spndng =~ spend",
  "Spndng ~~ PrdQl + WillPay + PrdDff",
  sep = "\n"
)
brand_rep_cv <- lavaan::sem(data = brand_rep_scaled, model = brand_rep_cv_mdl)

Here are the standardized covariances. Because the data is standardized, interpret these as correlations. The p-vales are not significant because the spending data was random.

lavaan::standardizedSolution(brand_rep_cv) %>% 
  filter(rhs == "Spndng") %>%
  select(-op, -rhs)
      lhs est.std    se     z pvalue ci.lower ci.upper
1   PrdQl   0.174 0.043 4.092      0    0.091    0.258
2 WillPay   0.241 0.043 5.654      0    0.157    0.324
3  PrdDff   0.198 0.041 4.779      0    0.117    0.279
4  Spndng   1.000 0.000    NA     NA    1.000    1.000
semPlot::semPaths(brand_rep_cv, whatLabels = "est.std", edge.label.cex = .8, rotation = 2)

Each dimension of brand reputation is positively correlated to spending history and the relationships are all significant.

2.6.1.2 Predictive Validity

Predictive validity is established by regressing some future outcome on your established construct. Assess predictive validity just as you would with any linear regression – regression estimates and p-values (starndardizedSolution()), and the r-squared coefficient of determination inspect().

Build a regression model with the single ~ operator. Then fit the model to the data as before.

brand_rep_pv_mdl <- paste(
  "PrdQl =~ well_made + consistent + poor_workman_r",
  "WillPay =~ higher_price + lot_more + go_up",
  "PrdDff =~ stands_out + unique",
  "spend ~ PrdQl + WillPay + PrdDff",
  sep = "\n"
)
brand_rep_pv <- lavaan::sem(data = brand_rep_scaled, model = brand_rep_pv_mdl)
#lavaan::summary(brand_rep_pv, standardized = T, fit.measures = T, rsquare = T)
semPlot::semPaths(brand_rep_pv, whatLabels = "est.std", edge.label.cex = .8, rotation = 2)

lavaan::standardizedSolution(brand_rep_pv) %>% 
  filter(op == "~") %>%
  mutate_if(is.numeric, round, digits = 3)
    lhs op     rhs est.std    se     z pvalue ci.lower ci.upper
1 spend  ~   PrdQl   0.094 0.048 1.949  0.051   -0.001    0.189
2 spend  ~ WillPay   0.182 0.052 3.486  0.000    0.080    0.284
3 spend  ~  PrdDff   0.060 0.055 1.092  0.275   -0.047    0.167
lavaan::inspect(brand_rep_pv, "r2")
     well_made     consistent poor_workman_r   higher_price       lot_more 
         0.837          0.867          0.533          0.537          0.658 
         go_up     stands_out         unique          spend 
         0.816          0.951          0.866          0.072 

There is a statistically significant relationship between one dimension of brand quality (Willingness to Pay) and spending. At this point you may want to drop the other two dimensions. However, the R^2 is not good - only 7% of the variability in Spending can be explained by the three dimension of our construct.

Factor scores represent individual respondents’ standing on a latent factor. While not used for scale validation per se, factor scores can be used for customer segmentation via clustering, network analysis and other statistical techniques.

brand_rep_cfa <- lavaan::cfa(brand_rep_pv_mdl, data = brand_rep_scaled)

brand_rep_cfa_scores <- lavaan::predict(brand_rep_cfa) %>% as.data.frame()
psych::describe(brand_rep_cfa_scores)
        vars   n mean   sd median trimmed  mad   min  max range  skew kurtosis
PrdQl      1 559    0 0.88  -0.50   -0.19 0.09 -0.57 4.02  4.59  2.31     6.18
WillPay    2 559    0 0.69   0.00    0.01 0.86 -1.15 1.16  2.31 -0.05    -1.18
PrdDff     3 559    0 0.96  -0.04   -0.15 1.17 -0.88 2.54  3.41  1.09     0.35
          se
PrdQl   0.04
WillPay 0.03
PrdDff  0.04
psych::multi.hist(brand_rep_cfa_scores)

map(brand_rep_cfa_scores, shapiro.test)
$PrdQl

    Shapiro-Wilk normality test

data:  .x[[i]]
W = 0.66986, p-value < 2.2e-16


$WillPay

    Shapiro-Wilk normality test

data:  .x[[i]]
W = 0.94986, p-value = 7.811e-13


$PrdDff

    Shapiro-Wilk normality test

data:  .x[[i]]
W = 0.82818, p-value < 2.2e-16

These scores are not normally distributed, which makes clustering a great choice for modeling factor scores. Clustering does not mean distance-based clustering, such as K-means, in this context. Mixture models consider data as coming from a distribution which itself is a mixture of clusters. To learn more about model-based clustering in the Hierarchical and Mixed Effects Models DataCamp course.

Factor scores can be extracted from a structural equation model and used as inputs in other models. For example, you can use the factor scores from the brand reputation dimensions as regressors for a regrssion on spending.

brand_rep_fs_reg_dat <- bind_cols(brand_rep_cfa_scores, spend = brand_rep$spend)
brand_rep_fs_reg <- lm(spend ~ PrdQl + WillPay + PrdDff, data = brand_rep_fs_reg_dat)
summary(brand_rep_fs_reg)$coef
             Estimate Std. Error    t value      Pr(>|t|)
(Intercept) 7.1555354  0.1591195 44.9695738 3.363705e-187
PrdQl       0.4260875  0.2062002  2.0663774  3.925620e-02
WillPay     1.1365087  0.2805960  4.0503388  5.842799e-05
PrdDff      0.1714031  0.2181813  0.7855993  4.324375e-01

The coefficients and r-squared of the lm() and sem() models closely resemble each other, but keeping the regression inside the lavaan framework provides more information (as witnessed in the higher estimates and r-squared). A construct, once validated, can be combined with a wide range of outcomes and models to produce valuable information about consumer behavior and habits.

2.6.2 Test-Retest Reliability

Test-retest reliability is the ability to achieve the same result from a respondent at two closely-spaced points in time (repeated measures).

Suppose you had two surveys, identified by an id field.

# svy_1 <- brand_rep[sample(1:559, 300),] %>% as.data.frame()
# svy_2 <- brand_rep[sample(1:559, 300),] %>% as.data.frame()
# survey_test_retest <- psych::testRetest(t1 = svy_1, t2 = svy_2, id = "id")
# survey_test_retest$r12

An r^2 <.7 is unacceptable, <.9 good, and >.9 very good. This one is unacceptable.

One way to check for replication is by splitting the data in half.

# svy <- bind_rows(svy_1, svy_2, .id = "time")
# 
# psych::describeBy(svy, "time")
# 
# brand_rep_test_retest <- psych::testRetest(
#   t1 = filter(svy, time == 1),
#   t2 = filter(svy, time == 2),
#   id = "id")
# 
# brand_rep_test_retest$r12

If the correlation of scaled scores across time 1 and time 2 is greater than .9, that indicates very strong test-retest reliability. This measure can be difficult to collect because it requires the same respondents to answer the survey at two points in time. However, it’s a good technique to have in your survey development toolkit.

When validating a scale, it’s a good idea to split the survey results into two samples, using one for EFA and one for CFA. This works as a sort of cross-validation such that the overall fit of the model is less likely due to chance of any one sample’s makeup.

# brand_rep_efa_data <- brand_rep[1:280,]
# brand_rep_cfa_data <- brand_rep[281:559,]
#  
# efa <- psych::fa(brand_rep_efa_data, nfactors = 3)
# efa$loadings
# 
# brand_rep_cfa <- lavaan::cfa(brand_rep_mdl, data = brand_rep_cfa_data)
# lavaan::inspect(brand_rep_cfa, what = "call")
# 
# lavaan::fitmeasures(brand_rep_cfa)[c("cfi","tli","rmsea")]

  1. This section is aided by Fiona Middleton’s “Reliability vs. Validity in Research | Difference, Types and Examples” (Middleton 2022).↩︎

  2. This section is primarily from George Mount’s Data Camp course (Mount, n.d.).↩︎