Machine learning datasets often have an enormous number of columns, many of which are correlated because they essentially measure the same latent construct. You could drop redundant columns, but sometimes the redundancy is not obvious, hidden in combinations of columns. Principle components analysis (PCA) reduces the number of columns to a few, interpretable linear combinations while retaining as much information as possible.1

2.1 Intuition

Shlens (2014) introduces PCA with a toy example where the movement of an oscillating spring is monitored by cameras from three positions in a room. Each camera produces a two-dimensional dataset of x and y coordinates, six vectors of numbers altogether. The physical reality is there is only one dimension of movement. PCA should be able to reduce the six vectors into a single principle component.

I’ll simplify the example somewhat to make the math easier. I’ll locate the three cameras in the same spot, x = 0 and y = 0, and rotate them 15 degrees, 45 degrees, and 150 degrees respectively.

Show the code
# 100 photos of the end of the spring taken over 10s of motion.
t <- seq(0, 10, length.out = 100)

# simple harmonic motion from our perspective: 0-degree rotation.
x_0 <- sin(2 * pi * 0.5 * t)
y_0 <- rnorm(length(t), 0, .03)
coords_0 <- matrix(c(x_0, y_0), ncol = 2)

# 15-degree rotation matrix for camera 1
theta_A = pi * (1 / 12)
M_A <- matrix(c(cos(theta_A), -sin(theta_A), sin(theta_A), cos(theta_A)), nrow = 2)
coords_A <- coords_0 %*% M_A

# 45-degree rotation matrix for camera 2
theta_B = pi * (1 / 4)
M_B <- matrix(c(cos(theta_B), -sin(theta_B), sin(theta_B), cos(theta_B)), nrow = 2)
coords_B <- coords_0 %*% M_B

# 150-degree rotation matrix for camera 3
theta_C = pi * (5 / 6)
M_C <- matrix(c(cos(theta_C), -sin(theta_C), sin(theta_C), cos(theta_C)), nrow = 2)
coords_C <- coords_0 %*% M_C

# Combined dataset
spring <- tibble(
  t = t,
  camera_A_x = coords_A[, 1],
  camera_A_y = coords_A[, 2],
  camera_B_x = coords_B[, 1],
  camera_B_y = coords_B[, 2],
  camera_C_x = coords_C[, 1],
  camera_C_y = coords_C[, 2]
)
glimpse(spring)
Rows: 100
Columns: 7
$ t          <dbl> 0.0000000, 0.1010101, 0.2020202, 0.3030303, 0.4040404, 0.50…
$ camera_A_x <dbl> 0.009075173, 0.306140645, 0.577588310, 0.774408459, 0.92934…
$ camera_A_y <dbl> -0.033869007, 0.063072215, 0.135231406, 0.257148083, 0.2211…
$ camera_B_x <dbl> 0.02479383, 0.23358947, 0.43259045, 0.54208336, 0.69428532,…
$ camera_B_y <dbl> -0.02479383, 0.20769246, 0.40590799, 0.60990100, 0.65615038…
$ camera_C_x <dbl> 0.01753189, -0.26107292, -0.50403966, -0.72942063, -0.81348…
$ camera_C_y <dbl> 0.03036612, 0.17187534, 0.31279357, 0.36575832, 0.50080390,…

Here are the x-y coordinates recorded from the cameras’ perspectives..

We can perform a PCA in a single step. The center and scale. arguments are because PCA should operate on standardized data.

spring_pca <- 
  spring |>
  select(-t) |>
  prcomp(center = TRUE, scale. = TRUE)

summary(spring_pca)
Importance of components:
                          PC1     PC2       PC3       PC4      PC5       PC6
Standard deviation     2.4441 0.16312 5.982e-16 1.152e-16 1.01e-16 8.729e-17
Proportion of Variance 0.9956 0.00443 0.000e+00 0.000e+00 0.00e+00 0.000e+00
Cumulative Proportion  0.9956 1.00000 1.000e+00 1.000e+00 1.00e+00 1.000e+00

PCA creates as many components as there are variables. The summary table shows the proportion of the variance in the data set explained by each component. The standard deviation of the first component is 2.444 compared to nearly zero for the other five components. It explains essentially all of the dataset variance. The upshot is that you can replace the six columns in the dataset with just this one.

2.2 Matrix Algebra

PCA works by identifying the most meaningful basis to re-express the dataset. In our toy example, the unit basis along the x-axis is the most meaningful. It does this with matrix algebra.

Each row of the dataset may be expressed as a column vector, \(\vec{X}\) that is the x and y values recorded by the three cameras.

\[ \vec{X} = \begin{bmatrix} A_x \\ A_y \\ B_x \\ B_y \\ C_x \\ C_y \end{bmatrix} \]

\(\vec{X}\) lies in an m=6-dimensional vector space spanned by an orthonormal basis. The orthonormal bases for each camera are \(\{(1,0), (0,1)\}\) from their own perspective, so a data point from camera \(A\) might equivalently be expressed as

\[ \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} A_x \\ A_y \end{bmatrix} \]

An orthonormal basis is any set of vectors whose pairwise inner products are zero. The orthonormal basis for camera \(A\) might be \(\{(\sqrt{2}/2,\sqrt{2}/2), (-\sqrt{2}/2,\sqrt{2}/2)\}\) from a neutral perspective. From the neutral perspective, you would have to rotate the axes 45 degrees.

# M is a 45-degree line
M <- matrix(c(0:100, 0:100 * tan(45*(pi/180))), ncol = 2)
colnames(M) <- c("X", "Y")

# M is unchanged after multiplication by identity matrix.
I <- matrix(c(1, 0, 0, 1), nrow = 2)
M1 <- M %*% I
colnames(M1) <- colnames(M)

# Rotate M 45-degrees.
B <- matrix(c(sqrt(2)/2, sqrt(2)/2, -sqrt(2)/2, sqrt(2)/2), nrow = 2)
M2 <- M %*% B
colnames(M2) <- colnames(M)

bind_rows(
  `45 degree line` = as_tibble(M1),
  `Rotated 45 degrees` = as_tibble(M2),
  .id = "series"
) |>
  ggplot(aes(x = X, y = Y, color = series)) +
  geom_line() +
  labs(x = NULL, y = NULL, color = NULL)

Extending this to all three cameras, you might start by naively assuming they collect data from the same perspective, taking each measurement at face value. The set of orthonormal basis vectors, \(\textbf{B}\), would look like this identity matrix

\[ \textbf{B} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix} = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix} \]

where \(b_1\) and \(b_2\) are the bases used by camera \(A\), etc. Now the data set \(\textbf{X}\) can be expressed as the matrix multiplication, \(\textbf{BX}\). This added complexity allows you to ask whether another basis, \(\textbf{P}\), that is a linear combination of the original basis, better expresses the data set, \(\textbf{PX} = \textbf{Y}\). The linear restriction is a key simplifying assumption of PCA. \(\textbf{P}\) transforms \(\textbf{X}\) into \(\textbf{Y}\), but you can also think of it as rotating and stretching \(\textbf{X}\) into \(\textbf{Y}.\)

So, what transformation is best? The ideal is to maximize variance (signal-to-noise ratio) and minimize covariance (redundancy) in the data. The signal-to-noise ratio is \(SNR = \sigma^2_{signal} / \sigma^2_{noise}\). In each camera’s 2D perspective, the signal is the amplitude of the movement of the spring along the x-axis and the noise is the movement along the y-axis. SNR is maximized by rotating the camera axes (gold below).

The other criteria is redundancy. The three cameras record the same activity, so they are highly correlated.

spring |> select(ends_with("x")) |> GGally::ggpairs()

If \(\textbf{X}\) is an \(m \times n\) matrix of centered values (subtracting the mean), PCA finds an orthonormal matrix \(\textbf{P}\) in \(\textbf{Y} = \textbf{PX}\) such that the covariance matrix \(\textbf{C}_\textbf{Y} = \frac{1}{n}\textbf{YY}^T\) is diagonal. The rows of \(\textbf{P}\) are the principal components of \(\textbf{X}\). The off-diagonal elements of a covariance matrix are the covariances and they measure redundancy. The diagonal elements are the variances and they measure signal.

With matrix algebra, you can express \(\textbf{C}_\textbf{Y}\) in terms of the covariance matrix of \(\textbf{X}\), \(\textbf{C}_\textbf{X}\).

\[ \begin{align} \textbf{C}_\textbf{Y} &= \frac{1}{n}\textbf{YY}^T \\ &= \frac{1}{n}(\textbf{PX})(\textbf{PX})^T \\ &= \frac{1}{n}\textbf{PXX}^T\textbf{P}^T \\ &= P\left(\frac{1}{n}\textbf{XX}^T\right)\textbf{P}^T \\ &= \textbf{PC}_\textbf{X}\textbf{P}^T \end{align} \]

Any symmetric matrix can be diagonalized by an orthogonal matrix of its eigenvectors, \(\textbf{C}_\textbf{X} = \textbf{E}^T\textbf{DE}\). So select \(\textbf{P}\) to be such a matrix.

\[ \begin{align} \textbf{C}_\textbf{Y} &= \textbf{PC}_\textbf{X}\textbf{P}^T \\ &= \textbf{P}(\textbf{E}^T\textbf{DE})\textbf{P}^T \\ &= \textbf{P}(\textbf{P}^T\textbf{DP})\textbf{P}^T \\ &= (\textbf{PP}^T)\textbf{D}(\textbf{PP}^T) \\ &= (\textbf{PP}^{-1})\textbf{D}(\textbf{PP}^{-1}) \\ &= \textbf{D} \end{align} \]

2.3 Case Study

Let’s work with a case study presented by Laerd. 315 job candidates complete a questionnaire consisting of 25 questions.

pca_dat <- 
  foreign::read.spss("./input/pca.sav", to.data.frame = TRUE) |>
  mutate(
    across(where(is.factor), ~factor(., levels = likert_scale, ordered = TRUE))
  )

glimpse(pca_dat)
Rows: 315
Columns: 30
$ Qu1    <ord> Agree Somewhat, Disagree, Disagree Somewhat, Undecided, Undecid…
$ Qu2    <ord> Undecided, Strongly Agree, Agree, Disagree, Disagree, Disagree,…
$ Qu3    <ord> Undecided, Agree, Agree Somewhat, Agree, Agree, Agree Somewhat,…
$ Qu4    <ord> Undecided, Agree, Agree, Disagree, Disagree, Agree Somewhat, Ag…
$ Qu5    <ord> Undecided, Agree, Agree Somewhat, Disagree Somewhat, Disagree S…
$ Qu6    <ord> Undecided, Agree Somewhat, Agree Somewhat, Disagree, Disagree, …
$ Qu7    <ord> Undecided, Agree Somewhat, Agree, Disagree Somewhat, Disagree S…
$ Qu8    <ord> Undecided, Agree Somewhat, Agree Somewhat, Disagree, Disagree, …
$ Qu9    <ord> Agree Somewhat, Agree Somewhat, Disagree Somewhat, Undecided, U…
$ Qu10   <ord> Undecided, Agree Somewhat, Disagree Somewhat, Disagree, Disagre…
$ Qu11   <ord> Agree Somewhat, Agree Somewhat, Disagree Somewhat, Agree Somewh…
$ Qu12   <ord> Strongly Agree, Agree, Agree, Agree, Agree, Agree, Agree, Agree…
$ Qu13   <ord> Undecided, Agree, Agree, Strongly Agree, Strongly Agree, Agree,…
$ Qu14   <ord> Undecided, Agree, Agree Somewhat, Agree Somewhat, Agree Somewha…
$ Qu15   <ord> Undecided, Agree, Agree, Agree Somewhat, Agree Somewhat, Agree …
$ Qu16   <ord> Undecided, Agree, Agree Somewhat, Undecided, Undecided, Disagre…
$ Qu17   <ord> Undecided, Agree, Agree Somewhat, Undecided, Undecided, Agree S…
$ Qu18   <ord> Undecided, Agree Somewhat, Agree, Agree Somewhat, Agree Somewha…
$ Qu19   <ord> Agree Somewhat, Strongly Agree, Agree, Disagree, Disagree, Disa…
$ Qu20   <ord> Undecided, Agree, Agree Somewhat, Disagree Somewhat, Disagree S…
$ Qu21   <ord> Undecided, Strongly Agree, Agree Somewhat, Agree, Agree, Strong…
$ Qu22   <ord> Undecided, Agree Somewhat, Agree Somewhat, Agree Somewhat, Agre…
$ Qu23   <ord> Undecided, Strongly Agree, Agree Somewhat, Agree Somewhat, Agre…
$ Qu24   <ord> Undecided, Agree, Agree Somewhat, Agree, Agree, Agree, Undecide…
$ Qu25   <ord> Undecided, Strongly Agree, Agree Somewhat, Agree Somewhat, Agre…
$ FAC1_1 <dbl> -0.31985907, -1.85802269, -1.36495416, 0.02235922, 0.02235922, …
$ FAC2_1 <dbl> -0.6366505, -1.5727256, -1.6357254, -0.6846857, -0.6846857, -1.…
$ FAC3_1 <dbl> 0.825818943, -1.576466472, -0.125140607, -0.576649155, -0.57664…
$ FAC4_1 <dbl> -1.4492270, -0.9594769, 0.3063403, -0.3991595, -0.3991595, 1.60…
$ FAC5_1 <dbl> -0.03960671, -0.03584510, -0.80707018, 1.59287496, 1.59287496, …

The questions focused on four attributes: motivation, dependability, enthusiasm, and commitment.

gt_items(q_motivation)
Characteristic Strongly Agree
N = 301
Agree
N = 1361
Agree Somewhat
N = 4061
Undecided
N = 7261
Disagree Somewhat
N = 7691
Disagree
N = 2061
Strongly Disagree
N = 2471







    Qu3 1 (0.3%) 19 (6.0%) 54 (17%) 97 (31%) 94 (30%) 27 (8.6%) 23 (7.3%)
    Qu4 1 (0.3%) 12 (3.8%) 53 (17%) 106 (34%) 91 (29%) 25 (7.9%) 27 (8.6%)
    Qu5 1 (0.3%) 19 (6.0%) 61 (19%) 88 (28%) 102 (32%) 23 (7.3%) 21 (6.7%)
    Qu6 3 (1.0%) 13 (4.1%) 45 (14%) 102 (32%) 104 (33%) 25 (7.9%) 23 (7.3%)
    Qu7 4 (1.3%) 11 (3.5%) 42 (13%) 99 (31%) 90 (29%) 37 (12%) 32 (10%)
    Qu8 7 (2.2%) 7 (2.2%) 42 (13%) 68 (22%) 119 (38%) 42 (13%) 30 (9.5%)
    Qu12 1 (0.3%) 26 (8.3%) 44 (14%) 76 (24%) 83 (26%) 16 (5.1%) 69 (22%)
    Qu13 12 (3.8%) 29 (9.2%) 65 (21%) 90 (29%) 86 (27%) 11 (3.5%) 22 (7.0%)
1 n (%)
gt_items(q_dependability)
Characteristic Strongly Agree
N = 491
Agree
N = 1851
Agree Somewhat
N = 4301
Undecided
N = 6281
Disagree Somewhat
N = 5531
Disagree
N = 2671
Strongly Disagree
N = 931







    Qu2 15 (4.8%) 31 (9.8%) 51 (16%) 79 (25%) 69 (22%) 51 (16%) 19 (6.0%)
    Qu14 2 (0.6%) 31 (9.8%) 53 (17%) 101 (32%) 85 (27%) 30 (9.5%) 13 (4.1%)
    Qu15 4 (1.3%) 26 (8.3%) 67 (21%) 92 (29%) 80 (25%) 33 (10%) 13 (4.1%)
    Qu16 7 (2.2%) 25 (7.9%) 66 (21%) 77 (24%) 90 (29%) 39 (12%) 11 (3.5%)
    Qu17 12 (3.8%) 29 (9.2%) 80 (25%) 94 (30%) 71 (23%) 24 (7.6%) 5 (1.6%)
    Qu18 5 (1.6%) 33 (10%) 69 (22%) 97 (31%) 73 (23%) 25 (7.9%) 13 (4.1%)
    Qu19 4 (1.3%) 10 (3.2%) 44 (14%) 88 (28%) 85 (27%) 65 (21%) 19 (6.0%)
1 n (%)
gt_items(q_enthusiasm)
Characteristic Strongly Agree
N = 1071
Agree
N = 3081
Agree Somewhat
N = 5621
Undecided
N = 4921
Disagree Somewhat
N = 3721
Disagree
N = 281
Strongly Disagree
N = 211







    Qu20 8 (2.5%) 37 (12%) 74 (23%) 83 (26%) 87 (28%) 17 (5.4%) 9 (2.9%)
    Qu21 20 (6.3%) 67 (21%) 103 (33%) 63 (20%) 60 (19%) 0 (0%) 2 (0.6%)
    Qu22 23 (7.3%) 68 (22%) 114 (36%) 59 (19%) 49 (16%) 0 (0%) 2 (0.6%)
    Qu23 16 (5.1%) 46 (15%) 87 (28%) 84 (27%) 74 (23%) 2 (0.6%) 6 (1.9%)
    Qu24 21 (6.7%) 55 (17%) 95 (30%) 96 (30%) 48 (15%) 0 (0%) 0 (0%)
    Qu25 19 (6.0%) 35 (11%) 89 (28%) 107 (34%) 54 (17%) 9 (2.9%) 2 (0.6%)
1 n (%)
gt_items(q_commitment)
Characteristic Strongly Agree
N = 251
Agree
N = 601
Agree Somewhat
N = 1041
Undecided
N = 1491
Disagree Somewhat
N = 4841
Disagree
N = 3221
Strongly Disagree
N = 1161







    Qu1 7 (2.2%) 10 (3.2%) 20 (6.3%) 39 (12%) 114 (36%) 94 (30%) 31 (9.8%)
    Qu9 4 (1.3%) 19 (6.0%) 21 (6.7%) 19 (6.0%) 117 (37%) 94 (30%) 41 (13%)
    Qu10 9 (2.9%) 9 (2.9%) 25 (7.9%) 29 (9.2%) 129 (41%) 82 (26%) 32 (10%)
    Qu11 5 (1.6%) 22 (7.0%) 38 (12%) 62 (20%) 124 (39%) 52 (17%) 12 (3.8%)
1 n (%)

Let’s drop the unrelated FAC* cols. PCA only works with numerical values. Our questionnaire uses a 7-level scale, so we’re probably okay to treat that as numeric.

pca_numeric <-
  pca_dat |>
  select(starts_with("Qu")) |> 
  mutate(across(everything(), as.numeric))

2.4 Assumptions

Missing values can bias PCA results. If you have missing values, you should either impute, remove the observations, or drop the columns. Fortunately there are no missing values in this dataset.

colSums(is.na(pca_numeric))
 Qu1  Qu2  Qu3  Qu4  Qu5  Qu6  Qu7  Qu8  Qu9 Qu10 Qu11 Qu12 Qu13 Qu14 Qu15 Qu16 
   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
Qu17 Qu18 Qu19 Qu20 Qu21 Qu22 Qu23 Qu24 Qu25 
   0    0    0    0    0    0    0    0    0 

There are a few other assumptions to be aware of:

  • PCA is based on correlation coefficients, so all variables should be linearly related (r > .3) to at least one other variable.
  • The next assumption is that the sample sizes should be large. As a rule of thumb, there should be at least 5 cases per variable.
  • There should be no outliers. Component scores greater than 3 standard deviations away from the mean can have a disproportionate influence on the results.

I found two procedures used to test assumptions (though it does not test all of them). They both operate on the correlation matrix.

corr_mtrx <- cor(pca_numeric)

The first assumption could be tested by querying the matrix for the max correlations. The worst is q20.

Show the code
# ggcorrplot::ggcorrplot(corr_mtrx)

max_cor <-
  as_tibble(corr_mtrx) |>
  mutate(var1 = factor(q_colnames, levels = q_colnames)) |>
  pivot_longer(cols = c(Qu1:Qu25), names_to = "var2", values_to = "rho") |>
  filter(var1 != var2) |>
  slice_max(by = var1, order_by = rho, n = 1)

max_cor_v <- round(max_cor$rho, 2)
names(max_cor_v) <- max_cor$var1
max_cor_v
 Qu1  Qu2  Qu3  Qu4  Qu5  Qu6  Qu7  Qu8  Qu9 Qu10 Qu11 Qu12 Qu13 Qu14 Qu15 Qu16 
0.58 0.68 0.69 0.68 0.53 0.56 0.63 0.63 0.55 0.60 0.60 0.62 0.69 0.80 0.80 0.59 
Qu17 Qu18 Qu19 Qu20 Qu21 Qu22 Qu23 Qu24 Qu25 
0.63 0.63 0.68 0.38 0.69 0.69 0.59 0.62 0.62 

The Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy compares the variable correlations to the partial correlations in the data. The test measures sampling adequacy for each variable in the model and for the complete model.2

\[ \text{KMO}_j = \frac{\sum_{i \ne j}r_{ij}^2}{\sum_{i \ne j}r_{ij}^2 + \sum_{i \ne j} u} \]

where \(r_{ij}\) are correlations, and \(u_{ij}\) are partial covariances. Scores range from 0 to 1. Values should be at least .6 to justify a PCA. Values over .8 are preferable.

EFAtools::KMO(corr_mtrx)

── Kaiser-Meyer-Olkin criterion (KMO) ──────────────────────────────────────────

✔ The overall KMO value for your data is meritorious.
  These data are probably suitable for factor analysis.

  Overall: 0.833

  For each variable:
  Qu1   Qu2   Qu3   Qu4   Qu5   Qu6   Qu7   Qu8   Qu9  Qu10  Qu11  Qu12  Qu13 
0.817 0.740 0.854 0.889 0.909 0.906 0.843 0.829 0.849 0.762 0.744 0.899 0.857 
 Qu14  Qu15  Qu16  Qu17  Qu18  Qu19  Qu20  Qu21  Qu22  Qu23  Qu24  Qu25 
0.824 0.877 0.913 0.880 0.794 0.803 0.834 0.725 0.724 0.831 0.741 0.803 

Bartlett’s test of sphericity tests the null hypothesis that the correlation matrix is an identity matrix, i.e., there are no correlations between any variables.

EFAtools::BARTLETT(corr_mtrx, N = nrow(pca_dat))

✔ The Bartlett's test of sphericity was significant at an alpha level of .05.
  These data are probably suitable for factor analysis.

  𝜒²(300) = 4214.06, p < .001

2.5 Perform PCA

There are two stats package functions for PCA, princomp() and prcomp(). The documentation suggests prcomp() is preferable.

pca_result <- prcomp(pca_numeric, center = TRUE, scale. = TRUE)

summary(pca_result)
Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6    PC7
Standard deviation     2.5942 1.8282 1.7022 1.4204 1.02418 0.97525 0.9460
Proportion of Variance 0.2692 0.1337 0.1159 0.0807 0.04196 0.03804 0.0358
Cumulative Proportion  0.2692 0.4029 0.5188 0.5995 0.64145 0.67949 0.7153
                           PC8    PC9    PC10    PC11    PC12    PC13    PC14
Standard deviation     0.87741 0.8306 0.79657 0.74071 0.71685 0.71026 0.67118
Proportion of Variance 0.03079 0.0276 0.02538 0.02195 0.02056 0.02018 0.01802
Cumulative Proportion  0.74608 0.7737 0.79906 0.82101 0.84156 0.86174 0.87976
                          PC15    PC16    PC17    PC18    PC19   PC20    PC21
Standard deviation     0.61776 0.60252 0.59119 0.56626 0.54597 0.5315 0.50522
Proportion of Variance 0.01527 0.01452 0.01398 0.01283 0.01192 0.0113 0.01021
Cumulative Proportion  0.89502 0.90955 0.92353 0.93635 0.94828 0.9596 0.96978
                          PC22    PC23    PC24    PC25
Standard deviation     0.48827 0.45536 0.41360 0.37230
Proportion of Variance 0.00954 0.00829 0.00684 0.00554
Cumulative Proportion  0.97932 0.98761 0.99446 1.00000

The top 4 components account for 60% of the variance. Subsequent components add less than 4% of explanatory power. The component loadings show how they relate to each column.

2.6 Results

Loadings are the eigenvectors of the covariance matrix. They are in the rotatation object returned by prcomp(). You should scale them by standard deviations.

pca_loadings <- pca_result$rotation %*% diag(pca_result$sdev)

Here are the loadings from the top 8 components. PC1 loads strongly to Motivation and Dependability, and moderately to Enthusiasm. PC2 loads strongly to Commitment and strongly negative to enthusiasm suggesting strong commitment accompanying low enthusiasm. PC3 loads strongly negatively to Enthusiasm. PC4 loads strongly to Commitment. The remaining components are less decisive.

Show the code
colnames(pca_loadings) <- paste0("pc", 1:25)

pca_loadings |>
  as_tibble(rownames = "Q") |>
  mutate(Area = case_when(
    Q %in% q_motivation ~ "Motivation",
    Q %in% q_dependability ~ "Dependability",
    Q %in% q_enthusiasm ~ "Enthusiasm",
    Q %in% q_commitment ~ "Commitment"
  )) |>
  select(Area, Q, pc1:pc8) |>
  arrange(Area, Q) |>
  gt::gt() |> 
  gt::fmt_number(pc1:pc8, decimals = 3) |>
  gt::data_color(
    columns = pc1:pc8, 
    method = "numeric",
    palette = c("firebrick", "white", "dodgerblue3"),
    domain = c(-1, 1)
  ) |>
  gt::tab_row_group(label = "Commitment", rows = Area == "Commitment") |>
  gt::tab_row_group(label = "Enthusiasm", rows = Area == "Enthusiasm") |>
  gt::tab_row_group(label = "Dependability", rows = Area == "Dependability") |>
  gt::tab_row_group(label = "Motivation", rows = Area == "Motivation") |>
  gt::cols_hide(columns = Area) |>
  gt::tab_header("Factor loadings by survey area.") |>
  gt::tab_options(heading.align = "left")
Factor loadings by survey area.
Q pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8
Motivation
Qu12 0.656 0.282 −0.046 −0.343 −0.142 −0.172 0.089 −0.026
Qu13 0.642 0.265 −0.175 −0.347 −0.205 −0.052 −0.155 0.101
Qu3 0.671 0.286 −0.217 −0.265 −0.279 0.036 −0.134 0.197
Qu4 0.677 0.203 −0.274 −0.287 −0.124 −0.010 −0.004 −0.090
Qu5 0.586 0.203 −0.100 −0.276 −0.043 −0.123 −0.232 −0.082
Qu6 0.652 0.296 −0.052 −0.290 0.061 0.023 −0.064 −0.169
Qu7 0.559 0.332 −0.001 −0.223 0.421 0.143 0.353 0.096
Qu8 0.527 0.375 −0.181 −0.094 0.473 0.225 0.273 −0.088
Dependability
Qu14 0.649 −0.295 0.444 0.065 −0.250 0.046 0.108 0.132
Qu15 0.668 −0.333 0.405 0.062 −0.123 −0.015 0.061 0.227
Qu16 0.621 −0.242 0.349 0.071 0.204 −0.125 −0.103 0.096
Qu17 0.561 −0.401 0.349 0.085 0.158 0.078 −0.208 0.253
Qu18 0.617 −0.289 0.355 0.110 0.292 −0.082 −0.056 0.186
Qu19 0.579 −0.185 0.473 0.221 0.010 0.100 0.049 −0.371
Qu2 0.541 −0.203 0.414 0.192 −0.328 0.193 0.150 −0.409
Enthusiasm
Qu20 0.357 −0.229 −0.335 0.089 0.165 0.432 −0.556 −0.191
Qu21 0.244 −0.457 −0.583 0.043 −0.107 0.366 0.154 0.148
Qu22 0.269 −0.398 −0.579 0.132 −0.228 0.246 0.318 0.118
Qu23 0.505 −0.306 −0.422 0.275 −0.004 −0.366 0.106 −0.135
Qu24 0.253 −0.498 −0.573 0.139 0.069 −0.309 −0.085 −0.005
Qu25 0.423 −0.428 −0.458 0.224 0.179 −0.270 0.051 −0.128
Commitment
Qu1 0.230 0.540 −0.081 0.545 −0.033 −0.014 −0.048 0.228
Qu10 0.335 0.574 −0.161 0.535 0.034 −0.085 −0.070 −0.089
Qu11 0.153 0.479 −0.147 0.619 −0.064 0.207 −0.114 0.111
Qu9 0.314 0.564 0.123 0.391 −0.127 −0.136 0.125 −0.001

2.7 Interpretation

A principal components analysis will produce as many components as there are variables. However, the purpose of principal components analysis is to explain as much of the variance in your variables as possible using as few components as possible. After you have extracted your components, there are four major criteria that can help you decide on the number of components to retain: (a) the eigenvalue-one criterion, (b) the proportion of total variance accounted for, (c) the scree plot test, and (d) the interpretability criterion. All except for the first criterion will require some degree of subjective analysis.

  1. Eigenvalue-One Criterion (Kaiser Criterion). This is a straightforward rule of thumb that suggests retaining only those components with eigenvalues greater than one. An eigenvalue represents the amount of variance explained by each component. If an eigenvalue is greater than one, it means the component explains more variance than a single original variable. The first five components are greater than one, although PC5 was only 1.02.

  2. Proportion of Total Variance Accounted For: This involves retaining enough components to explain a specified cumulative percentage of the total variance. Common thresholds are 70% to 90%. You keep adding components until the cumulative percentage of explained variance reaches your desired threshold. The first four components reached 60%. The fifth brought it to 64%. 70% required seven components.

  3. Scree Plot Test: A scree plot displays the eigenvalues in descending order. The point at which the plot levels off (known as the “elbow”) indicates the number of components to retain. This method requires some visual inspection and subjective judgment. The idea is to keep the components before the plot starts to flatten out. The elbow appears to be at five components.

factoextra::fviz_eig(pca_result, addlabels = TRUE)

  1. Interpretability Criterion: This criterion involves retaining components that make sense in the context of your data and research questions. Sometimes, even if a component explains a significant portion of the variance, it might not be retained if it does not lead to meaningful or interpretable patterns. Conversely, a component with lower explained variance might be retained if it provides valuable insights. I expected one component for each of the four survey focus areas, but the loadings didn’t really occur that way. Still, four was the number producing the most intelligible results.

Bi-plots can help with interpretability. They compare two components. Greater distances indicate strong loadings. Similar directions indicate strong correlations.

factoextra::fviz_pca_var(
  pca_result, 
  # axes = c(2, 3), 
  repel = TRUE, 
  col.var = "cos2",
  gradient.cols = my_palette$warm[c(1, 4, 7)]
)

2.8 Appendix: Eigenvalues and Eigenvectors

A square matrix, \(\textbf{A}\), can be decomposed into eigenvalues, \(\lambda\), and eigenvectors, \(\textbf{v}\).3

\[ \textbf{A} \textbf{v} = \lambda \textbf{v} \tag{2.1}\]

For example, \(6\) and \(\begin{bmatrix}1 \\ 4 \end{bmatrix}\) are an eigenvalue and eigenvector here:

\[ \begin{bmatrix} -6 & 3 \\ 4 & 5 \end{bmatrix} \begin{bmatrix}1 \\ 4 \end{bmatrix} = 6 \begin{bmatrix}1 \\ 4 \end{bmatrix} \]

Equation 2.1 can be re-expressed as \(\textbf{A} \textbf{v} - \lambda \textbf{I} \textbf{v} = 0.\) For \(\textbf{v}\) to be non-zero, the determinant must be zero,

\[ | \textbf{A} - \lambda \textbf{I}| = 0 \tag{2.2}\]

Back to the example, use Equation 2.2 to find possible eigenvalues.

\[ \left| \begin{bmatrix} -6 & 3 \\ 4 & 5 \end{bmatrix} - \lambda \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \right| = 0 \]

Subtract the matrices, calculate the determinant, and solve for the possible eigenvalues, \(\lambda = -7 \text{ or } 6.\)

\[ \begin{align} \left| \begin{bmatrix} -6 - \lambda & 3 \\ 4 & 5 - \lambda \end{bmatrix} \right| &= 0 \\ (-6 - \lambda)(5 - \lambda) - 3 \times 4 &= 0 \\ -30 + \lambda + \lambda^2 - 12 &= 0 \\ (\lambda + 7)(\lambda - 6) &= 0 \end{align} \]

Plug them into Equation 2.1 and solve the system of equations. For \(\lambda = 6\) you have \(y = 4x\).

\[ \begin{align} \begin{bmatrix} -6 & 3 \\ 4 & 5 \end{bmatrix} \begin{bmatrix}x \\ y \end{bmatrix} &= 6 \begin{bmatrix}x \\ y \end{bmatrix} \\ \begin{bmatrix} -6x + 3y \\ 4x + 5y \end{bmatrix} &= \begin{bmatrix} 6x \\ 6y \end{bmatrix} \\ \begin{bmatrix} -12x + 3y \\ 4x - y \end{bmatrix} &= \begin{bmatrix} 0 \\ 0 \end{bmatrix} \\ -8x + 2y &= 0 \\ y &= 4x \end{align} \]

So \(\begin{bmatrix}1 \\ 4 \end{bmatrix}\) is a solution, or \(\begin{bmatrix}2 \\ 8 \end{bmatrix}\), or \(\begin{bmatrix}3 \\ 12 \end{bmatrix}\), etc. You can do the same exercise for \(\lambda = -7\).

Eigenvectors and eigenvalues are useful because matrices are used to make transformations in space. In transformations, the eigenvector is the axis of rotation, the direction that does not change, and the eigenvalue is the scale of the stretch (1 = no change, 2 = double length, -1 = point backwards, etc.).


  1. Material from PSU, Laerd, and Shlens (2014).↩︎

  2. See Statistics How-to.↩︎

  3. Took these notes from Math is Fun.↩︎