2 Generalized Linear Models (GLM)

The linear regression model, $E(y|X) = X \beta$, structured as $y_i = X_i \beta + \epsilon_i$ where $X_i \beta = \mu_i$, assumes the response is a linear function of the predictors and the residuals are independent random variables normally distributed with zero mean and constant variance, $\epsilon \sim N \left(0, \sigma^2 \right)$. This implies that given a set of predictors, the response is normally distributed about its expected value, $y_i \sim N \left(\mu_i, \sigma^2 \right)$. However, there are many situations where this normality assumption does not hold. Generalized linear models (GLMs) are a generalization of the linear regression model that work with non-normal response distributions.¹

The response will not have a normal distribution if the underlying data-generating process is binomial (Section @ref(binomiallogistic)) or multinomial (Section @ref(multinomiallogistic)), ordinal (Section @ref(ordinallogistic)), Poisson (counts, Section @ref(poissonregression)), or exponential (time-to-event). In these situations the linear regression model can predict probabilities and proportions outside [0, 1], or negative counts or times. GLMs solve this problem by modeling a function of the expected value of $y$, $f(E(y|X)) = X \beta$. There are three components to a GLM: a random component specified by the response variable’s probability distribution (normal, binomial, etc.); a systematic component in the form $X\beta$; and a link function, $\eta$, that specifies the link between the random and systematic components and converts the response to a range of $[-\infty, +\infty]$.

Linear regression is a special case of GLM where the link function is the identity function, $f(E(y|X)) = E(y|X)$. For binary logistic regression, the link function is the logit,

\[f(E(y|X)) = \log \left( \frac{E(y|X)}{1 - E(y|X)} \right) = \log \left( \frac{\pi}{1 - \pi} \right) = \mathrm{logit}(\pi)\]

where $\pi$ is the response probability.² For multinomial logistic regression, the link function is the logit again, but with respect to the baseline level, and there is set of logits (one for each non-baseline level),

\[f(E(y|X)) = \log \left( \frac{E(y_j|X)}{E(y_{j^*}|X)} \right) = \log \left( \frac{\pi_j}{\pi_{j^*}} \right) = \mathrm{logit}(\pi_j)\]

Where $j$ is a level in the dependent variable and $j^*$ is the baseline level. For Poisson regression, the link function is

\[f(E(y|X)) = \ln (E(y|X)) = \ln(\lambda)\]

where $\lambda$ is the expected event rate. For exponential regression, the link function is

\[f(E(y|X) = -E(y|X) = -\lambda\]

where $\lambda$ is the expected time to event.

GLM uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.

Fit a GLM just like an linear model, but with the glm() function, specifying the distribution with the family = c("gaussian", "binomial", "poisson") parameter. Fit a mulinomial logistic regression model with nnet::multinom().

2.1 Binomial Logistic Regression

Logistic regression estimates the probability that a categorical dependent variable is a particular level. The dependent variable levels can be binomial, multinomial, or ordinal. The binary logistic regression model is

\[y_i = \mathrm{logit}(\pi_i) = \log \left( \frac{\pi_i}{1 - \pi_i} \right) = X_i \beta\]

where $\pi_i$ is the response $i$’s binary level probability. The model predicts the log odds of the level. Why do this? The range of outcomes need to be between 0 and 1, and a sigmoid function, $f(y) = 1 / \left(1 + e^y \right)$, does that. If the log odds of the level equals $X_i\beta$, then the odds of the level equals $e^{X\beta}$. You can solve for $\pi_i$ to get $\pi = \mathrm{odds} / (\mathrm{odds} + 1)$. Substituting,

\[\pi_i = \frac{\exp(y_i)}{1 + \exp(y_i)} = \frac{e^{X_i\beta}}{1 + e^{X_i\beta}}\]

which you can simplify to $\pi_i = 1 / (1 + e^{-X_i\beta})$, a sigmoid function. The upshot is $X\beta$ is the functional relationship between the independent variables and a function of the response, not the response itself.

The model parameters are estimated either by iteratively reweighted least squares optimization or by maximum likelihood estimation (MLE). MLE maximizes the probability produced by a set of parameters $\beta$ given a data set and probability distribution.³ In logistic regression the probability distribution is the binomial and each observation is the outcome of a single Bernoulli trial.

\[L(\beta; y, X) = \prod_{i=1}^n \pi_i^{y_i}(1 - \pi_i)^{(1-y_i)} = \prod_{i=1}^n\frac{\exp(y_i X_i \beta)}{1 + \exp(X_i \beta)}.\]

In practice, multiplying many small probabilities can be unstable, so MLE optimizes the log likelihood instead.

\[\begin{align} l(\beta; y, X) &= \sum_{i = 1}^n \left(y_i \log(\pi_i) + (1 - y_i) \log(1 - \pi_i) \right) \\ &= \sum_{i = 1}^n \left(y_i X_i \beta - \log(1 + e^{X_i\beta}) \right) \end{align}\]

Sometimes you will see the cost function optimized. The cost function is the negative of of the log likelihood function.

Assumptions

The binomial logistic regression model requires a dichotomous dependent variable and independent observations. The sample size should be large, at least 10 observations per dependent variable level and independent variable. There are three conditions related to the data distribution: i) the logit transformation must be linearly related to any continuous independent variables, ii) there must be no multicollinearity, and iii) there must be no influential outliers.

Be aware of over-dispersion, a common issue with GLM. For a binomial logistic regression, the response variable should be distributed $y_i \sim \mathrm{Bin}(n_i, \pi_i)$ with $\mu_i = n_i \pi_i$ and $\sigma^2 = \pi (1 - \pi)$. Over-dispersion means the data shows evidence of variance greater than $\sigma^2$.

Case Study

This case study uses the Laerd Statistics article on binomial logistic regression data set. A study investigates the relationship between the incidence of heart disease (Yes|No) and age, weight, gender, and maximal aerobic capacity using data from n = 100 participants.

Show the code

cs1 <- list()

cs1$dat <- foreign::read.spss("./input/logistic-regression.sav", to.data.frame = TRUE)

cs1$dat %>%
  gtsummary::tbl_summary(
    by = heart_disease,
    include = -caseno,
    percent = "row",
    statistic = list(gtsummary::all_continuous() ~ "{mean}, {sd}")
  )

Characteristic	No N = 65¹	Yes N = 35¹
age	39, 8	45, 9
weight	77, 14	85, 15
gender
Female	29 (78%)	8 (22%)
Male	36 (57%)	27 (43%)
VO2max	45, 9	41, 6
¹ Mean, SD; n (%)

Overall, men are twice as likely to have heart disease. Male odds are .43/.57 = 0.8 and female odds are .22/.78 = 0.3, an male-to-female odds ratio of 2.7.

Show the code

cs1$dat %>% gtsummary::tbl_cross(row = heart_disease, col = gender, percent = "column")

	gender		Total
	Female	Male	Total
heart_disease
No	29 (78%)	36 (57%)	65 (65%)
Yes	8 (22%)	27 (43%)	35 (35%)
Total	37 (100%)	63 (100%)	100 (100%)

Age, weight, and poor max aerobic capacity are positively associated with heart disease.

Show the code

cs1$dat %>%
  pivot_longer(cols = c(age, weight, VO2max)) %>%
  ggplot(aes(x = heart_disease, y = value)) + 
  geom_boxplot(outlier.shape = NA) +
  geom_jitter(aes(color = name)) +
  facet_wrap(facets = vars(name), scales = "free_y")

Consider centering the continuous variables around their means to facilitate model interpretation. The intercept term in the fitted model would represent a reasonable condition, not a zero-aged, zero-weighted person with no aerobic capacity. This is the way to go if you want to present your findings in the framework of a baseline probability (or odds) and the incremental effects of the independent variables. You might also standardize the continuous vars to get a more meaningful increment. On the other hand, if you want to use your model for predicting outcomes, you’ll have to back out of the centering when you predict values.

If your model is predictive rather than inferential, split the data into training/testing data sets.

Show the code

# For reproducibility
set.seed(123)

(x <- initial_split(cs1$dat, prop = 0.7, strata = heart_disease))
## <Training/Testing/Total>
## <69/31/100>

cs1$dat_training <- training(x)
dim(cs1$dat_training)
## [1] 69  6

cs1$dat_testing <- testing(x)
dim(cs1$dat_testing)
## [1] 31  6

Fit the Model

Fit the model using the tidymodels framework. If you want to continue using the classic methodology, the glm object is inside the tidymodels fit. The model fit returns a brief summary with the coefficients and model diagnostics.

Show the code

cs1$model <- 
  logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification") 

cs1$fit <-
  cs1$model %>%
  fit(heart_disease ~ age + weight + VO2max + gender, data = cs1$dat)

# The fit object returned by glm(). You'll need this for interpretation and 
# checking assumptions.
cs1$result <-
  cs1$fit %>%
  extract_fit_engine()

# If you are fitting a predictive model, use the training set.
cs1$fit_training <-
  cs1$model %>%
  fit(heart_disease ~ age + weight + VO2max + gender, data = cs1$dat_training)

cs1$result %>% summary()


Call:
stats::glm(formula = heart_disease ~ age + weight + VO2max + 
    gender, family = stats::binomial, data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -1.676469   3.336079  -0.503  0.61530   
age          0.085098   0.028160   3.022  0.00251 **
weight       0.005727   0.022442   0.255  0.79858   
VO2max      -0.099024   0.047944  -2.065  0.03889 * 
genderMale   1.949639   0.842413   2.314  0.02065 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 129.49  on 99  degrees of freedom
Residual deviance: 102.09  on 95  degrees of freedom
AIC: 112.09

Number of Fisher Scoring iterations: 5

The null deviance, G^2, is the likelihood ratio of the intercept-only model with 69 rows - 1 parameter = 99 degrees of freedom. It is the sum of the squared deviance residuals. The residual deviance is the likelihood ratio of the full model with 100 - 5 parameters = 95 degrees of freedom.

The residual deviance is distributed chi-squared and can be used to test whether the model differs from the saturated model (model with as many coefficients as observations, G^2 = 0, df = 0) where $H_0$ = no difference. The deviance test for lack of fit fails to reject the null hypothesis.

Show the code

# G^2 calculations
cs1$result %>% residuals(type = "deviance") %>% .^2 %>% sum()
## [1] 102.0878
cs1$result %>% deviance()
## [1] 102.0878

# df
df.residual(cs1$result)
## [1] 95

# G^2 is distributed chi-squared with df degrees of freedom
pchisq(deviance(cs1$result), df = df.residual(cs1$result), lower.tail = FALSE)
## [1] 0.2911469
vcdExtra::LRstats(cs1$result)
## Likelihood summary table:
##               AIC    BIC LR Chisq Df Pr(>Chisq)
## cs1$result 112.09 125.11   102.09 95     0.2911

These two deviances, the null and residual, are shown in the ANOVA summary. An ANOVA table shows the change in deviance from successively adding each variable to the model.

Show the code

anova(cs1$result)

Analysis of Deviance Table

Model: binomial, link: logit

Response: heart_disease

Terms added sequentially (first to last)

       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                      99     129.49              
age     1  11.9074        98     117.58 0.0005591 ***
weight  1   9.1820        97     108.40 0.0024440 ** 
VO2max  1   0.5045        96     107.89 0.4775433    
gender  1   5.8076        95     102.09 0.0159568 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Deviance residuals are one of four residuals you can calculate from a binary logistic regression.⁴ One is the raw residual, $\epsilon_i = y_i - \hat{p}_i$, where $\hat{p}_i$ is the predicted probability. Another is the Pearson residual, $r_i = \frac{\epsilon_i}{\sqrt{\hat{p}_i(1 - \hat{p}_i)}}$, the raw residual rescaled by dividing by the estimated standard deviation of a binomial distribution with 1 trial⁵. A third is the standardized Pearson residual, $rs_i = r_i / \sqrt{1 - \hat{h}_i}$, the Pearson residual adjusted for the leverage of the predictors using the hat-values. Hat-values measure the predictor distances from the mean. This residual is especially useful to evaluate model fit because if the model fits well, these residuals have a standard normal distribution. Finally, there are the deviance residuals, $d_i = \mathrm{sign}(\epsilon_i) \left[ -2(y_i \log \hat{p}_i + (1 - y_i) \log (1 - \hat{p}_i)) \right]^{.5}$. Deviance Residuals measure how much the estimated probabilities differ from the observed proportions of success. You want deviance residuals to be evenly distributed (in absolute values, 1Q $\approx$ 3Q, min $\approx$ max). You also want the min and max to be <3 because deviance residuals are roughly approximated by a standard normal distribution.

Show the code

bind_rows(
  Raw = cs1$result %>% residuals(type = "response") %>% summary(),
  Pearson = cs1$result %>% residuals(type = "pearson") %>% summary(),
  `Standardized Pearson` = cs1$result %>% rstandard(type = "pearson") %>% summary(),
  Deviance = cs1$result %>% residuals(type = "deviance") %>% summary(),
  .id = "Residual"
)

# A tibble: 4 × 7
  Residual             Min.        `1st Qu.`   Median      Mean  `3rd Qu.` Max. 
  <chr>                <table[1d]> <table[1d]> <table[1d]> <tab> <table[1> <tab>
1 Raw                  -0.7954587  -0.2500077  -0.1061849  -8.0… 0.3532839 0.91…
2 Pearson              -1.9720520  -0.5773622  -0.3446596  -1.8… 0.7391867 3.34…
3 Standardized Pearson -2.1794172  -0.5918874  -0.3521058  -2.1… 0.7592960 3.39…
4 Deviance             -1.7815642  -0.7585405  -0.4738051  -8.0… 0.9336278 2.23…

Interpretation

Before we look at the coefficient estimations, consider what it is they are predicting: the log odds of the binary response. To see what that means, plug in values for the explanatory variables to get predictions. $\hat{y}$ is the log odds of having heart disease.

Show the code

(mean_person <- 
  cs1$dat %>%
  select(-caseno) %>%
  summarize(.by = gender, across(where(is.numeric), mean)))
##   gender      age   weight   VO2max
## 1   Male 40.79365 84.83270 46.40095
## 2 Female 41.62162 70.85324 38.91135

pred_log_odds <- cs1$fit %>% predict(new_data = mean_person, type = "raw")
names(pred_log_odds) <- mean_person$gender
pred_log_odds
##       Male     Female 
## -0.3643411 -1.5819310

Exponentiate to get the odds, $\exp (\hat{y}) = \frac{\pi}{1 - \pi}$.

Show the code

(pred_odds <- exp(pred_log_odds))

     Male    Female 
0.6946542 0.2055777

Solve for $\pi = \frac{\exp (\hat{y})}{1 + \exp (\hat{y})}$ to get the probability. Do the math, or use predict(type = "prob").

Show the code

(pred_prob <- pred_odds / (1 + pred_odds))
##      Male    Female 
## 0.4099091 0.1705222
cs1$fit %>% predict(new_data = mean_person, type = "prob")
## # A tibble: 2 × 2
##   .pred_No .pred_Yes
##      <dbl>     <dbl>
## 1    0.590     0.410
## 2    0.829     0.171

Now let’s interpret the coefficients.

Show the code

cs1$fit %>% tidy()

# A tibble: 5 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept) -1.68       3.34      -0.503 0.615  
2 age          0.0851     0.0282     3.02  0.00251
3 weight       0.00573    0.0224     0.255 0.799  
4 VO2max      -0.0990     0.0479    -2.07  0.0389 
5 genderMale   1.95       0.842      2.31  0.0206

The intercept term is the log-odds of heart disease for the reference case. The reference case in the model is gender = “Female”, age = 0, weight = 0, and VO2max = 0. If the data was centered, the reference case would actually meaningful.

Show the code

cs1$fit %>%
  predict(new_data = list(age = 0, weight = 0, VO2max = 0, gender = "Female"), 
          type = "raw")

        1 
-1.676469

Column “statistic” is the Wald $z$ statistic, $z = \hat{\beta} / SE(\hat{\beta})$. Its square is the Wald chi-squared statistic. The p-value is the area to the right of $z^2$ in the $\chi_1^2$ density curve:

Show the code

cs1$fit %>% 
  tidy() %>%
  mutate(p.chisq = map_dbl(statistic, ~pchisq(.^2, df = 1, lower.tail = FALSE))) %>%
  pull(p.chisq)

[1] 0.615297092 0.002511644 0.798578389 0.038885956 0.020648470

Interpret the coefficient estimates as the change in the log odds of $y$ due to a one unit change in $x$. If $\delta = x_a - x_b$, then a $\delta$ change in $x$ is associated with a $\delta \hat{\beta}$ change in the log odds of $y$. $\beta$ is the log odds ratio of $x_a$ vs $x_b$.

\[\log \left(\pi / (1 - \pi) |_{x = x_a} \right) - \log \left(\pi / (1 - \pi) |_{x = x_b} \right) = \log \left( \frac{\pi / (1 - \pi) |_{x = x_a}}{\pi / (1 - \pi) |_{x = x_b}} \right) = \delta \hat{\beta}\]

The exponential of the coefficient estimates is the change in the odds of $y$ due to a $\delta$ change in $x$. $\exp \beta$ is the odds ratio of $x_a$ vs $x_b$.

\[\mathrm{odds}(y) = e^{\delta \hat{\beta}}\]

Show the code

cs1$fit %>% tidy(exponentiate = TRUE)

# A tibble: 5 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    0.187    3.34      -0.503 0.615  
2 age            1.09     0.0282     3.02  0.00251
3 weight         1.01     0.0224     0.255 0.799  
4 VO2max         0.906    0.0479    -2.07  0.0389 
5 genderMale     7.03     0.842      2.31  0.0206

All covariates held equal, a male’s log odds of heart disease are 1.95 times that of a female’s (log(OR)). A male’s odds are 7.03 times that of a female’s (OR). Of course, all covariate’s are not equal - males are heavier and have higher VO2max. Let’s run the calculations with the mean predictor values for male and female.

Show the code

# Log OR
pred_log_odds["Male"] / pred_log_odds["Female"]
##      Male 
## 0.2303142

# OR
pred_odds["Male"] / pred_odds["Female"]
##     Male 
## 3.379034

A one-unit increase in any of the continuous independent variables is interpreted similarly. The reference level is unimportant since the change is constant across the range of values. A one year increase in age increases the log-odds of heart disease by a factor of 0.09, and the odds by a factor of 1.09. To calculate the effect of a decade increase in age, multiply $\beta$ by 10 before exponentiating, or raise the exponentiated coeficient by 10. The effect of a 10-year increase in age is to increase the odds of heart disease by 2.34. The odds double every ten years.

oddsratio::or_glm() is a handy way to calculate odds ratios from arbitrary increments to the predictors. Here are the ORs of a 10-year age change, 10 kg weight change, and VO2max change of 5.

Show the code

oddsratio::or_glm(
  cs1$dat, 
  cs1$result, 
  incr = list(age = 10, weight = 25, VO2max = -12)
)

   predictor oddsratio ci_low (2.5) ci_high (97.5)          increment
1        age     2.342        1.391          4.270                 10
2     weight     1.154        0.381          3.572                 25
3     VO2max     3.281       11.033          1.124                -12
4 genderMale     7.026        1.428         40.155 Indicator variable

Notice that the predicted probabilities have the sigmoidal shape of the binary relationship.

Show the code

augment(cs1$fit, new_data = cs1$dat, type = "raw") %>%
  ggplot(aes(x = age, color = gender)) +
  geom_point(aes(y = as.numeric(heart_disease == "Yes"))) +
  geom_point(aes(y = .pred_Yes), shape = 4) +
  geom_smooth(aes(y = .pred_Yes), se = FALSE) +
  labs(x = "Age",
       y = NULL,
       title = "Binary Fitted Line Plot") +
  scale_y_continuous(breaks = c(0,1), labels = c("Healthy", "Heart Disease")) +
  theme_light() +
  theme(legend.position = "right")

Assumptions

Four assumptions relate to the study design: (1) the dependent variable is dichotomous; (2) the observations are independent; (3) the categories of all nominal variables are mutually exclusive and exhaustive; and (4) there are at least 10 observations per dependent variable level and independent variable. These assumptions are all valid here. Three more assumptions related to the data distribution:

There is a linear relationship between the logit transformation and the continuous independent variables. Test with a plot and with Box-Tidwell.
There is no independent variable multicollinearity. Test with correlation coefficients and variance inflation factors (VIF).
There are no influential outliers. Test with Cook’s distance.

Test the linearity assumption first. There are two ways to do this (do both). First, fit your model, then plot the fitted values against the continuous predictors. This is the GLM analog to OLS bivariate analysis, except now the dependent variable is the logit transformation. These plotted relationships look pretty linear.

Show the code

cs1$fit %>%
  augment(new_data = cs1$dat) %>%
  pivot_longer(c(age, weight, VO2max)) %>%
  ggplot(aes(x = value, y = .pred_Yes)) +
  geom_point() +
  facet_wrap(facets = vars(name), scales = "free_x") +
  geom_smooth(method = "lm", formula = "y~x") +
  labs(title = "Linearity Test: predicted vs continuous predictors", x = NULL)

Show the code

# Using non-centered vars to avoid log(0) errors.
x <- 
  cs1$dat %>%
  mutate(
    age_tx = log(age) * age,
    weight_tx = log(weight) * weight,
    VO2max_tx = log(VO2max) * VO2max
  )

cs1$boxtidwell_fit <- 
  logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification") %>%
  fit(heart_disease ~ age + weight + VO2max + gender + 
        age_tx + weight_tx + VO2max_tx, 
      data = x)

cs1$boxtidwell_fit %>% tidy()

# A tibble: 8 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept) -40.6       21.7      -1.87   0.0615
2 age           2.73       1.10      2.47   0.0135
3 weight        0.144      0.783     0.184  0.854 
4 VO2max        1.32       1.82      0.724  0.469 
5 genderMale    1.85       0.922     2.01   0.0443
6 age_tx       -0.543      0.227    -2.40   0.0164
7 weight_tx    -0.0266     0.146    -0.182  0.855 
8 VO2max_tx    -0.301      0.382    -0.788  0.431

Focus on the three transformed variables. age_tx is the only one with a p-value nearly <.05, but it is customary to apply a Bonferroni adjustment when testing for linearity. There are eight predictors in the model (including the intercept term), so the Bonferroni adjusted p-value for age_tx is multiplied by 8. Do not reject the null hypothesis of linearity.

If the relationship was nonlinear, you could try transforming the variable by raising it to $\lambda = 1 + b / \gamma$ where $b$ is the estimated coefficient of the model without the interaction terms, and $\gamma$ is the estimated coefficient of the interaction term of the model with interactions. For age, $b$ is 0.085 and $\gamma$ is -0.543, so $\lambda$ = 0.843. This is approximately 1 (no transformation). It appears to be customary to apply general transformations like .5 (square root), 1/3 (cube root), ln, and the reciprocal.

Now check for multicollinearity. Variance inflation factors (VIF) estimate how much the variance of a regression coefficient is inflated due to multicollinearity. When independent variables are correlated, it is difficult to say which variable really influences the dependent variable. The VIF for variable $i$ is

\[ \mathrm{VIF}_i = \frac{1}{1 - R_i^2} \]

Show the code

car::vif(cs1$result)

     age   weight   VO2max   gender 
1.035274 1.900575 2.167067 2.502538

Try calculating the $\mathrm{VIF}$ for age.

I don’t know why this doesn’t work. It would work if the underlying model was OLS instead of GLM. The answer seems to be related to GVIF vs VIF, but I didn’t figure it out.)

Show the code

r2 <- lm(age ~ weight + VO2max + gender, data = cs1$dat_training) %>%
  summary() %>% pluck("r.squared")

(vif <- 1 / (1 - r2))
## [1] 1.049814

Now check for influential outliers. Predict the response probabilities and filter for the predictions more than two standard deviations from the actual value and a Cook’s Distance greater than 4/N = 0.04.⁶ Cook’s distance measures how much predicted values change when observation i is removed from the data. Only two fitted values were both an outlier and influential, row ids 59 and 70. An index plot of Cook’s Distance shows the two points at the far left. You might examine the observations for validity. Otherwise, proceed and explain that there were two standardized residuals with value of 2.01 and 2.27 standard deviations which were kept in the analysis.

Show the code

augment(cs1$result) %>%
  mutate(
    id = row_number(),
    outlier = if_else(abs(.std.resid) >= 2, "Outlier", "Other"),
    influential = if_else(.cooksd > 4 / nrow(cs1$dat), "Influential", "Other"),
    status = case_when(
      outlier == "Outlier" & influential == "Influential" ~ "Influential Outlier",
      outlier == "Outlier" ~ "Outlier",
      influential == "Influential" ~ "Influential",
      TRUE ~ "Other"
    )
  ) %>%
  ggplot(aes(x = .fitted, y = .cooksd)) +
  geom_point(aes(color = status)) +
  geom_text(aes(label = if_else(influential == "Influential", id, NA_integer_)), 
            check_overlap = TRUE, size = 3, nudge_x = .025) +
  geom_hline(yintercept = 4 / nrow(cs1$dat), linetype = 2, color = "goldenrod") + 
  scale_color_manual(values = c("Influential Outlier" = "firebrick", 
                                "Influential" = "goldenrod",
                                "Outlier" = "slategray",
                                "Other" = "black")) +
  theme(legend.position = "right") +
  labs(title = "Index Plot of Cook's Distance.",
       subtitle = "Row id labeled for values > 4 / N.")

Evaluate the Fit

There are several ways to evaluate the model fit.

The likelihood ratio test
Pseudo R-squared⁷.
Accuracy measures
Gain and ROC curves

The likelihood ratio test compares the log likelihood of the fitted model to an intercept-only model.

Show the code

intercept_only <-
  logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification") %>%
  fit(heart_disease ~ 1, data = cs1$dat)

(cs1$lrtest <- lmtest::lrtest(cs1$result, intercept_only$fit))

Likelihood ratio test

Model 1: heart_disease ~ age + weight + VO2max + gender
Model 2: heart_disease ~ 1
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   5 -51.044                         
2   1 -64.745 -4 27.402  1.649e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The fitted model is significant, $\chi^2$(4) = 27.4, p < .001. Calculate the pseuedo-R2 with DescTools::PseudoR2().

I Can’t get this to work in the tidymodels framework. Using glm() for now.

Show the code

x <- glm(heart_disease ~ age + weight + VO2max + gender, 
         data = cs1$dat,
         family = "binomial")
cs1$pseudo_r2 <- DescTools::PseudoR2(x, which = c("CoxSnell", "Nagelkerke", "McFadden"))

cs1$pseudo_r2

  CoxSnell Nagelkerke   McFadden 
 0.2396799  0.3301044  0.2116126

Laerd interprets this as the model explained 33.0% (Nagelkerke R2) of the variance in heart disease. This is how your would interpret R2 in an OLS model. UCLA points out that the various pseudo R-squareds measure other aspects of the model and are unique to the measured quantity. A pseudo R-squared is not very informative on its own; it is useful for comparing models. Accuracy measures formed by the cross-tabulation of observed and predicted classes is the better way to go.

Show the code

cs1$conf_mat <-
  cs1$fit %>% 
  augment(new_data = cs1$dat) %>%
  # conf_mat requires truth to be first level of the factor variable.
  mutate(across(c(heart_disease, .pred_class), ~fct_relevel(., "Yes"))) %>%
  conf_mat(truth = heart_disease, estimate = .pred_class)

cs1$conf_mat
##           Truth
## Prediction Yes No
##        Yes  16 10
##        No   19 55

cs1$conf_mat %>% summary()
## # A tibble: 13 × 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.71 
##  2 kap                  binary         0.322
##  3 sens                 binary         0.457
##  4 spec                 binary         0.846
##  5 ppv                  binary         0.615
##  6 npv                  binary         0.743
##  7 mcc                  binary         0.330
##  8 j_index              binary         0.303
##  9 bal_accuracy         binary         0.652
## 10 detection_prevalence binary         0.26 
## 11 precision            binary         0.615
## 12 recall               binary         0.457
## 13 f_meas               binary         0.525

cs1$conf_mat %>% autoplot()

The model accuracy, 71.0%, is the percent of observations correctly classified. The sensitivity, 45.7%, is the accuracy with regard to predicting positive cases. The specificity, 84.6%, is the accuracy with regard to predicting negative cases. If you are fitting a predictive model, use the testing data set for this.

Show the code

cs1$fit_training %>% 
  augment(new_data = cs1$dat_testing) %>%
  conf_mat(truth = heart_disease, estimate = .pred_class)

          Truth
Prediction No Yes
       No  16   5
       Yes  4   6

Finally, plot the gain curve or ROC curve. In the gain curve, the x-axis is the fraction of items seen when sorted by the predicted value, and the y-axis is the cumulatively summed true outcome. The “wizard” curve is the gain curve when the data is sorted by the true outcome. If the model’s gain curve is close to the wizard curve, then the model predicts the response well. The gray area is the “gain” over a random prediction.

Show the code

cs1$dat_testing %>%
  bind_cols(predict(cs1$fit, new_data = cs1$dat_testing, type = "prob")) %>%
  # event_level = "second" sets the second level as success
  yardstick::gain_curve(.pred_Yes, truth = heart_disease, event_level = "second") %>%
  autoplot() +
  labs(title = "Gain Curve")

11 of the 31 participants had heart disease in the test data set.

The gain curve encountered 6 heart disease cases (50%) within the first 8 observations (55%). It encountered all 11 heart disease cases on the 18th observation.
The bottom of the grey area is the outcome of a random model. Only half the heart disease cases would be observed within 50% of the observations.
The top of the grey area is the outcome of the perfect model, the “wizard curve”. Half the heart disease cases would be observed in 6/30=20% of the observations.

The ROC (Receiver Operating Characteristics) curve plots sensitivity vs specificity at varying cut-off values for the probability ranging from 0 to 1. Ideally, you want very little trade-off between sensitivity and specificity, with a curve hugging the left and top axes.

Show the code

cs1$dat_testing %>%
  bind_cols(predict(cs1$fit, new_data = cs1$dat_testing, type = "prob")) %>%
  # event_level = "second" sets the second level as success
  yardstick::roc_curve(.pred_Yes, truth = heart_disease, event_level = "second") %>%
  autoplot() +
  labs(title = "ROC Curve")

Reporting

A binomial logistic regression was performed to ascertain the effects of age, weight, gender and VO2max on the likelihood that participants have heart disease. Linearity of the continuous variables with respect to the logit of the dependent variable was assessed via the Box-Tidwell (1962) procedure. A Bonferroni correction was applied using all eight terms in the model resulting in statistical significance being accepted when p < 0.00625 (Tabachnick & Fidell, 2014). Based on this assessment, all continuous independent variables were found to be linearly related to the logit of the dependent variable. There were two standardized residuals with value of 2.01 and 2.27 standard deviations, which were kept in the analysis. The logistic regression model was statistically significant, χ2(4) = 27.40, p < .001. The model explained 33.0% (Nagelkerke R2) of the variance in heart disease and correctly classified 71.0% of cases. Sensitivity was 45.7%, specificity was 84.6%, positive predictive value was and negative predictive value was . Of the five predictor variables only three were statistically significant: age, gender and VO2max (as shown in Table 1). Females had 0.14 times lower odds to exhibit heart disease than males. Increasing age was associated with an increased likelihood of exhibiting heart disease, but increasing VO2max was associated with a reduction in the likelihood of exhibiting heart disease.

Show the code

gtsummary::tbl_regression(
  cs1$fit,
  exponentiate = TRUE
)

Characteristic	OR¹	95% CI¹	p-value
age	1.09	1.03, 1.16	0.003
weight	1.01	0.96, 1.05	0.8
VO2max	0.91	0.82, 0.99	0.039
gender
Female	—	—
Male	7.03	1.43, 40.2	0.021
¹ OR = Odds Ratio, CI = Confidence Interval

2.2 Multinomial Logistic Regression

The multinomial logistic regression model is $J - 1$ baseline logits,

\[y_i = \log \left( \frac{\pi_{ij}}{\pi_{ij^*}} \right) = X_i \beta_j, \hspace{5mm} j \ne j^*\]

where $j$ is a level of the multinomial response variable. Whereas binomial logistic regression models the log odds of the response level, multinomial logistic regression models the log relative risk, the probability relative to the baseline $j^*$ level.⁸

Interpret the $k^{th}$ element of $\beta_j$ as the increase in log relative risk of $Y_i = j$ relative to $Y_i = j^*$ given a one-unit increase in the $k^{th}$ element of $X$, holding the other terms constant. The individual probabilities, $\pi_{ij}$, are

\[\pi_{ij} = \frac{\exp(y_{ij})}{1 + \sum_{j \ne j^*} \exp(y_{ij})} = \frac{e^{X_i\beta_j}}{1 + \sum_{j \ne j^*} e^{X_i\beta_j}}\]

and for the baseline category,

\[\pi_{ij^*} = \frac{1}{1 + \sum_{j \ne j^*} \exp(y_{ij})} = \frac{1}{1 + \sum_{j \ne j^*} e^{X_i\beta_j}}\]

Assumptions

Multinomial logistic regression applies when the dependent variable is categorical. It presumes a linear relationship between the log relative risk of the dependent variable and $X$ with residuals $\epsilon$ that are independent. It also assumes there is no severe multicollinearity in the predictors, and there is independence of irrelevant alternatives (IIA). IIA means the relative likelihood of being in one category compared to the base category would not change if you added other categories.

Case Study

This case study uses the data set from this UCLA tutorial. A study measures the association between students’ academic program (academic, general, and vocational) and their socioeconomic status (SES) (low, middle, high) and writing score.

Show the code

download.file(
  "https://stats.idre.ucla.edu/stat/data/hsbdemo.dta",
  "./input/hsbdemo.dta",
  mode = "wb"
)

Show the code

cs2 <- list()

cs2$dat <- foreign::read.dta("./input/hsbdemo.dta") %>%
  # Just keep cols relevant to study
  select(id, prog, ses, write) %>%
  mutate(prog = fct_relevel(prog, "academic", after = 0))

cs2$dat %>%
  gtsummary::tbl_summary(
    by = prog,
    include = c(prog, ses, write),
    statistic = list(gtsummary::all_continuous() ~ "{mean}, {sd}")
  ) %>%
  gtsummary::add_overall()

Characteristic	Overall N = 200¹	academic N = 105¹	general N = 45¹	vocation N = 50¹
ses
low	47 (24%)	19 (18%)	16 (36%)	12 (24%)
middle	95 (48%)	44 (42%)	20 (44%)	31 (62%)
high	58 (29%)	42 (40%)	9 (20%)	7 (14%)
write	53, 9	56, 8	51, 9	47, 9
¹ n (%); Mean, SD

Conduct a brief exploratory analysis to establish your expectations. Academic programs are associated with higher writing scores and SES. General and vocational programs are the opposite, although SES has opposing effects for general (increased probability) and vocational (decreased probability).

Show the code

cs2$dat %>%
  mutate(write_bin = cut(write, breaks = 5, dig.lab = 1, right = FALSE)) %>%
  count(prog, ses, write_bin) %>%
  mutate(.by = c(ses, write_bin), prob = n / sum(n)) %>%
  ggplot(aes(x = write_bin, y = prob, color = ses)) +
  geom_point() +
  geom_line(aes(group = ses)) +
  facet_wrap(facets = vars(prog)) +
  theme(legend.position = "top") +
  labs(title = "Program Proportions by SES")

If your model is predictive rather than inferential, split the data into training/testing data sets.

Show the code

# For reproducibility
set.seed(123)

(x <- initial_split(cs2$dat, prop = 0.7, strata = prog))
## <Training/Testing/Total>
## <139/61/200>

cs2$dat_training <- training(x)
dim(cs2$dat_training)
## [1] 139   4

cs2$dat_testing <- testing(x)
dim(cs2$dat_testing)
## [1] 61  4

Fit the Model

The multinomial logitistic regression model engines all seem to be related to neural networks and advise that predictors be set on a common scale by normalizing. I have only one predictor other than SES in this model, but I’ll normalize it anyway. Normalizing write will also facilitate model interpretation. The intercept will represent a reasonable condition (average writing score) and a one-unit increase in write will represent a 1 SD increase in writing.

This case study will use both the parsnip package to fit a model directly to a data set, and the recipes package to preprocess the data. parsnip is fine for most applications and it seems to be better supported by other functions, like tidy(), so stick with it for inferential applications. recipes has the advantage of being able to process data in a workflow, so you don’t have to transform new data to make predictions. See more uses of recipes on the tidy models documentation.

Let’s first use parsnip to fit the full data set for an inferential application.

Show the code

# Create the model. This much is the same for parsnip or recipes.
cs2$model <-
  multinom_reg() %>%
  set_engine("nnet")

# For parsnip, you need to normalize the data explicitly. nnet requires dummy
# vars for factor predictors, but thankfully parsnip does that implicitly. 
# However, I want the outcome variable base level to be "academic" and while
# recipes can do that in pre-processing, I have to do it manually for parsnip.
cs2$dat_normalized <-
  cs2$dat %>%
  # Cast to numeric, otherwise scale() returns an nmatrix. :(
  mutate(write = as.numeric(scale(write, center = TRUE, scale = TRUE)))

# Fit the whole data set for an explanatory model.
cs2$fit <-
  cs2$model %>%
  fit(prog ~ ses + write, data = cs2$dat_normalized)

# Extract the fit object returned by the engine. Use for interpretation and 
# checking assumptions.
cs2$result <-
  cs2$fit %>%
  extract_fit_engine()

In parallel, let’s use recipes to fit a predictive model to the training data. The model object is already created, so we just need to pair a recipe object with a workflow.

Show the code

cs2$rec <-
  # The `data` argument can be the base data, or training, or even testing. 
  # recipe() only uses it to catalog variable names and data types.
  recipe(prog ~ ses + write, data = cs2$dat) %>%
  # You could have specified the formula as prog ~ ., then assigned roles. E.g.,
  # Keep "id" in data set, but don't use it in the model like this:
  # update_role(id, new_role = "ID") %>%
  # Unlike parsnip, recipe does not automatically create dummy vars.
  step_dummy(all_nominal_predictors()) %>%
  # Not relevant here, but good practice: if a factor level has few values, it may
  # not appear in the training set. If so, its dummy will contain a single value
  # (0). You can prevent that by dropping zero-value cols.
  step_zv(all_predictors()) %>%
  # Set the reference level of the outcome here if you want.
  step_relevel(prog, ref_level = "academic") %>%
  # Normalize write.
  step_normalize(write)

# The workflow pairs the model and recipe.
cs2$wflow <-
  workflow() %>%
  add_model(cs2$model) %>%
  add_recipe(cs2$rec)

# Fit the training data set for a predictive model.
cs2$fit_training <-
  cs2$wflow %>%
  fit(data = cs2$dat_training)

# You can't extract the engine fit and pipe into summary. Seems like a bug
# cs2$result_training <- cs2$fit_training %>% extract_fit_engine()

Let’s look at the explanatory model summary object. The model produces a set of coefficient estimates for each non-reference level of the dependent variable. The nnet engine presents the coefficient estimates, then their standard errors as a second section, but does not present the z-statistic or p-values!

Show the code

cs2$result %>% summary()

Call:
nnet::multinom(formula = prog ~ ses + write, data = data, trace = FALSE)

Coefficients:
         (Intercept)  sesmiddle    seshigh      write
general   -0.2049851 -0.5332857 -1.1628363 -0.5490814
vocation  -0.7771705  0.2913906 -0.9826773 -1.0767939

Std. Errors:
         (Intercept) sesmiddle   seshigh     write
general    0.3491296 0.4437324 0.5142201 0.2029457
vocation   0.4111072 0.4763734 0.5955665 0.2106132

Residual Deviance: 359.9635 
AIC: 375.9635

The residual deviance is $G^2 = 2 \sum_{i,j}y_{ij} \log \frac{y_{ij}}{\hat{\pi}_{ij}}$. Another model diagnostic is the log-likelihood, $-G^2 / 2$ (not shown) and AIC. More on these in the Model Fit section.

Show the code

deviance(cs2$result)
## [1] 359.9635
logLik(cs2$result, sum = TRUE)
## 'log Lik.' -179.9817 (df=8)

The Wald z-statistic is $z = \hat{\beta} / SE(\hat{\beta})$. Its square is the Wald chi-squared statistic. The p-value is the area to the right of $z^2$ in the $\chi_1^2$ density curve. Get these from tidy().

Show the code

cs2$result %>% tidy()

# A tibble: 8 × 6
  y.level  term        estimate std.error statistic     p.value
  <chr>    <chr>          <dbl>     <dbl>     <dbl>       <dbl>
1 general  (Intercept)   -0.205     0.349    -0.587 0.557      
2 general  sesmiddle     -0.533     0.444    -1.20  0.229      
3 general  seshigh       -1.16      0.514    -2.26  0.0237     
4 general  write         -0.549     0.203    -2.71  0.00682    
5 vocation (Intercept)   -0.777     0.411    -1.89  0.0587     
6 vocation sesmiddle      0.291     0.476     0.612 0.541      
7 vocation seshigh       -0.983     0.596    -1.65  0.0989     
8 vocation write         -1.08      0.211    -5.11  0.000000318

Interpretation

Start with interpreting the dependent variable. The model fits the log relative risk of belonging to program $j \in$ [vocation, general] vs. $j^*$ = academic. However, predict() returns either the risk (type = “probs”) or outcome (type = “class), not the log relative risk. Plug in values for the predictor variables to get predictions. The relative risk is $RR = \exp (\hat{y}_j) = \pi_j / \pi_{j^*}$. We see here that a student of low SES and mean writing score is less likely to be in a general or vocation program than an academic program.

Show the code

(risk <- predict(cs2$result, newdata = list(ses = "low", write = 0), type = "probs"))
##  academic   general  vocation 
## 0.4396833 0.3581922 0.2021246

# Relative risk.
(rr <- risk[-1] / risk[1])
##   general  vocation 
## 0.8146595 0.4597049

# Log-relative risk (the modeled outcome)
(log_rr <- log(rr))
##    general   vocation 
## -0.2049851 -0.7771705

Move on to the coefficients. Interpret $\hat{\beta}$ as the change in the log relative risk of $y_j$ relative to $y_{j^*}$ due to a $\delta$ = one unit change in $x$. A $\delta = x_a - x_b$ change in $x$ is associated with a $\delta \hat{\beta}$ change. $\delta\beta$ is the log relative risk ratio.

\[\log \left(\pi_j / \pi_{j^*} |_{x = x_a} \right) - \log \left(\pi_j / \pi_{j^*} |_{x = x_b} \right) = \log \left( \frac{\pi_j / \pi_{j^*} |_{x = x_a}}{\pi_j / \pi_{j^*} |_{x = x_b}} \right) = \delta \hat{\beta}\]

The exponential of $\hat{\beta}$ is the change in the relative risk of $y_j$ relative to $y_{j^*}$ due to a $\delta$ = one unit change in $x$. $\exp \delta \beta$ is the relative risk ratio.

\[\pi_j / \pi_{j^*} |_{x = x_a} = \exp{\delta \hat{\beta}}\]

The intercept term is the log-relative risk of $y_j$ relative to $y_{j^*}$ for the reference case. The reference case in the model is ses = “low” and write centered at 52.8. Notice how the intercept matches the predicted values above.

Show the code

(ref_log_rr <- coef(cs2$result)[,"(Intercept)"])
##    general   vocation 
## -0.2049851 -0.7771705

# From log-relative risk to relative risk.
(ref_rr <- exp(ref_log_rr))
##   general  vocation 
## 0.8146595 0.4597049

The log relative risks of a low SES student with a 52.8 writing score being in program general vs academic is $\hat{y} = \mathrm{Intercept}_1$ = -0.205, and $\hat{y} = \mathrm{Intercept}_2$ = -0.777 for vocation vs academic. The corresponding relative risks are $\exp(\hat{y}_j)$ = 0.815 and 0.460. The expected probabilities are 44.0%, 35.8%, and 20.2% for academic, general, and vocation respectively.

If SES was high instead of low, the expected probabilities of being in program general vs academic would be 70.1%, 17.8%, and 12.1% for academic, general, and vocation respectively.

What if the writing score increases by 1 SD (one unit)? The log RR of being in program general vs academic change by a factor of coef(cs2$result)["general", "write"] = -0.549, RR = 0.577. For program vocation vs academic, it would change by a factor of -1.077, RR = 0.341. To get a 2 SD increase, multiply the coefficient by 2, then exponentiate. The two RRs would then be 0.333 and 0.116.

Visualize the parameter estimates by plotting the predicted values.

Show the code

new_data <- expand.grid(
  ses = levels(cs2$dat$ses),
  write = seq(from = round(min(cs2$dat_normalized$write)), 
              to = round(max(cs2$dat_normalized$write)), 
              by = 1)
)

bind_cols(
  new_data,
  predict(cs2$fit, new_data = new_data, type = "prob")
) %>%
  pivot_longer(cols = -c(ses, write), names_to = "prog", values_to = "probability") %>%
  ggplot(aes(x = write, y = probability, color = ses)) +
  geom_line() + 
  facet_wrap(facets = vars(prog))

Assumptions

Four assumptions relate to the study design: (1) the dependent variable is multinomial; (2) the observations are independent; (3) the categories of all nominal variables are mutually exclusive and exhaustive; and (4) there are at least 10 observations per dependent variable level and independent variable. These assumptions are all valid. Three more assumptions related to the data distribution:

There is a linear relationship between the logit transformation and the continuous independent variables. Test with a plot and with Box-Tidwell.
There is no independent variable multicollinearity. Test with correlation coefficients and variance inflation factors (VIF).
There are no influential outliers. Test with Cook’s distance.

There are two ways to test for linearity (do both). First, plot the fitted values against the continuous predictors. This is the GLM analog to OLS bivariate analysis, except now the dependent variable is the logit transformation. These plotted relationships look good, except that in the prog = general level, writing score appears to interact with SES.

Show the code

cs2$fit %>%
  augment(new_data = cs2$dat_normalized) %>%
  pivot_longer(cols = c(.pred_academic:.pred_vocation), values_to = ".fitted") %>%
  filter(str_detect(name, as.character(prog))) %>%
  ggplot(aes(x = write, y = .fitted, color = ses)) +
  geom_point() +
  facet_wrap(facets = vars(prog), scales = "free_x") +
  geom_smooth(method = "lm", formula = "y~x") +
  labs(title = "Linearity Test: predicted vs continuous predictors", x = NULL)

The second test for linearity is the Box-Tidwell approach. Add transformations of the continuous independent variables to the model, $x_{Tx} = x \log x$, then test their significance level in the fit. Focus on the transformed variable. write_tx has a p-value <.05 for general. It is customary to apply a Bonferroni adjustment when testing for linearity. There are ten predictors in the model (including the intercept terms), so the Bonferroni adjusted p-values for write_tx are multiplied by 10. We should reject the null hypothesis of linearity because the adjusted p.value is still below .05.

Show the code

# Using non-centered vars to avoid log(0) errors.
cs2$boxtidwell <- cs2$dat %>%
  mutate(write_tx = log(write) * write) %>%
  fit(cs2$model, prog ~ ses + write + write_tx, data = .)

tidy(cs2$boxtidwell) %>% filter(term == "write_tx") %>% mutate(adj_p = p.value * 10)

# A tibble: 2 × 7
  y.level  term     estimate std.error statistic p.value  adj_p
  <chr>    <chr>       <dbl>     <dbl>     <dbl>   <dbl>  <dbl>
1 general  write_tx   0.0713    0.0231     3.09  0.00198 0.0198
2 vocation write_tx   0.0125    0.0239     0.522 0.601   6.01

If the relationship is nonlinear, you can try transforming the variable by raising it to $\lambda = 1 + b / \gamma$ where $b$ is the estimated coefficient of the model without the interaction terms, and $\gamma$ is the estimated coefficient of the interaction term of the model with interactions. For write, $b$ is for general and $\gamma$ is , so $\lambda$ = 1 + / = . It seems customary to apply general transformations like .5 (square root), 1/3 (cube root), ln, and the reciprocal, so maybe try raising write_c to . It seems in this case that the better solution is to add an interaction between write_c and ses to the model. I’m not going to pursue this further here.

Check for multicollinearity using variance inflation factors (VIF). VIFs estimate how much the variance of a regression coefficient is inflated due to multicollinearity. When independent variables are correlated, it is difficult to say which variable really influences the dependent variable. The VIF for variable $i$ is

\[ \mathrm{VIF}_i = \frac{1}{1 - R_i^2} \]

where $R_i^2$ is the coefficient of determination (i.e., the proportion of dependent variance explained by the model) of a regression of $X_i$ against all of the other predictors, $X_i = X_{j \ne i} \beta + \epsilon$. If $X_i$ is totally unrelated to its covariates, then $R_i^2$ will be zero and $\mathrm{VIF}_i$ will be 1. If $R_i^2$ is .8, $\mathrm{VIF}_i$ will be 5. The rule of thumb is that $R_i^2 \le 5$ is tolerable, and $R_i^2 > 5$ is “highly correlated” and you have to do something about it. car::vif() doesn’t work for multinomial logistic regression. The model type is not actually important here - we’re concerned about the covariate relationships. Below, I successively collapse the dependent variable into two-levels, then fit a binomial logistic regression and pipe that into car::vif().

Show the code

tmp_fit_general <-
  logistic_reg() %>%
  fit(
    prog ~ ses + write, 
    data = cs2$dat_normalized %>% mutate(prog = fct_collapse(prog, vocation = "academic"))
  ) %>%
  extract_fit_engine() 

tmp_fit_general %>% car::vif()
##           GVIF Df GVIF^(1/(2*Df))
## ses   1.037207  2        1.009175
## write 1.037207  1        1.018433

tmp_fit_vocation <- 
  logistic_reg() %>%
  fit(
    prog ~ ses + write, 
    data = cs2$dat_normalized %>% mutate(prog = fct_collapse(prog, general = "academic"))
  ) %>%
  extract_fit_engine()

tmp_fit_vocation %>% car::vif()
##           GVIF Df GVIF^(1/(2*Df))
## ses   1.020017  2        1.004967
## write 1.020017  1        1.009959

Check for influential outliers. Outliers are predicted values greater than two standard deviations from the actual value. Influential points have a Cook’s Distance greater than 4/N (4 / 200 = 0.02.⁹ Influential outliers are both. There is no simple way to do this for the multinomial regression because neither VGAM nor nnet support the augment() generic. Instead, I will use the two binomial logistic regressions from the VIF diagnostic.

Show the code

outlier_dat <-
  bind_rows(
    general = augment(tmp_fit_general, type.predict = "response"),
    vocation = augment(tmp_fit_vocation, type.predict = "response"),
    .id = "logit"
  ) %>%
  mutate(
    id = row_number(),
    outlier = if_else(abs(.std.resid) >= 2, "Outlier", "Other"),
    influential = if_else(.cooksd > 4 / nrow(cs1$dat), "Influential", "Other"),
    status = case_when(
      outlier == "Outlier" & influential == "Influential" ~ "Influential Outlier",
      outlier == "Outlier" ~ "Outlier",
      influential == "Influential" ~ "Influential",
      TRUE ~ "Other"
    ),
    status = factor(status, levels = c("Influential", "Outlier", "Influential Outlier", "Other"))
  )

outlier_dat %>%
  ggplot(aes(x = .fitted, y = .cooksd)) +
  geom_point(aes(color = status)) +
  geom_text(aes(label = if_else(influential == "Influential", id, NA_integer_)), 
            check_overlap = TRUE, size = 3, nudge_x = .025) +
  geom_hline(yintercept = 4 / nrow(cs1$dat), linetype = 2, color = "goldenrod") + 
  scale_color_manual(values = c("Influential Outlier" = "firebrick", 
                                "Influential" = "goldenrod",
                                "Outlier" = "salmon",
                                "Other" = "black")) +
  theme(legend.position = "right") +
  labs(title = "Index Plot of Cook's Distance.",
       subtitle = "Row id labeled for influential points.") +
  facet_wrap(facets = vars(logit), ncol = 1)

No fitted values were influential outliers in the first fit, and only two were influential outliers in the second fit.

Show the code

(cs2$outliers <-
  outlier_dat %>% filter(status == "Influential Outlier") %>% pull(.std.resid))

[1] 2.202098 2.593458

An index plot of Cook’s Distance shows the two points at the far left. You might examine the observations for validity. Otherwise, proceed and explain that there were two standardized residuals with value of 2.20 and 2.59 standard deviations which were kept in the analysis.

Evaluate the Fit

There are several ways to evaluate the model fit.

Deviance and chi-squared tests for lack of fit
The likelihood ratio test
Pseudo R-squared¹⁰.
Accuracy measures
Gain and ROC curves

The deviance test for lack of fit and the likelihood ratio test are residuals tests. The deviance residual is defined as $d_i = \mathrm{sign}(\epsilon_i) \left[ -2(y_i \log \hat{\pi}_i + (1 - y_i) \log (1 - \hat{\pi}_i)) \right]^{.5}$. The model deviance, $G^2$, is the sum of the squared deviance residuals. It also equals $G^2 = 2 \sum_{i,j}y_{ij} \log \frac{y_{ij}}{\hat{\pi}_{ij}}$. You can calculate them by hand.

Show the code

# Actual values (1s and 0s for three response levels)
y <- cs2$dat %>% 
  mutate(val = 1) %>% 
  pivot_wider(names_from = prog, values_from = val, values_fill = 0) %>%
  select(levels(cs2$dat$prog)) %>%
  as.matrix()

# Predicted values (probabilities for three response levels)
pi <- predict(cs2$result, type = "prob") * 1

# Raw residuals, by hand or by formula
# e <- y - pi
e <- residuals(cs2$result, type = "response")

# Deviance residuals
d <- sign(e) * sqrt(-2 * y * log(pi) + (1 - y) * log(1 - pi))

(g2 <- sum(d^2, na.rm = TRUE))
## [1] 359.9635
(g2 <- 2 * sum(y * log(y / pi), na.rm = TRUE))
## [1] 359.9635
(g2 <- deviance(cs2$result))
## [1] 359.9635

The related Pearson statistic, $X$, is the sum of the squared Pearson residuals, $pr_i = \epsilon_i / \sqrt{\hat{\pi}_i}$, the raw residual scaled by dividing by the estimated standard deviation of a binomial distribution with 1 trial. I don’t see this calculated in the residual() functions. You can do it yourself.

Show the code

(x2 <- sum((e / sqrt(pi))^2))
## [1] 406.0656

The deviance and Pearson statistic are distributed chi-squared with $(N - p)(r - 1)$ degrees of freedom where $p$ = 4 predictor variables (3 SES levels + intercept), and $r$ = 3 levels of the dependent variable for 392 degrees of freedom. The deviance and Pearson tests for lack of fit calculate the probability of the test statistic. The null hypothesis is that the model is correct. Neither test rejects the null hypothesis.

Show the code

# Deviance test for lack of fit
(N <- nrow(cs2$dat))
## [1] 200
(r <- length(levels(cs2$dat$prog)))
## [1] 3
(p <- length(coef(cs2$result)) / (r - 1)) # coefficients for each level, so divide by # levels
## [1] 4
(df <- (N - p) * (r - 1))
## [1] 392

pchisq(g2, df, lower.tail = FALSE)
## [1] 0.8755302

pchisq(x2, df, lower.tail = FALSE)
## [1] 0.3014625

You can do the same calculations for the intercept-only model.

Show the code

io <- multinom_reg() %>% fit(prog ~ 1, data = cs2$dat) %>% extract_fit_engine()

deviance(io)
## [1] 408.1933

# degrees of freedom
((nrow(cs2$dat) - length(coef(io)) / (r - 1)) * (r - 1))
## [1] 398

The log-likelihood measures the unexplained variability in the model. The likelihood ratio test compares the log likelihood of the fitted model to the intercept-only model. You can use lmtest::lrtest() to test. anova() does the same thing using the residual deviance, $G2 = -2 \times \mathrm{log likelihood}$, although it does not seem to work with the nnet engine.

Show the code

(cs2$lrtest <- lmtest::lrtest(io, cs2$result))

Likelihood ratio test

Model 1: prog ~ 1
Model 2: prog ~ ses + write
  #Df  LogLik Df Chisq Pr(>Chisq)    
1   2 -204.10                        
2   8 -179.98  6 48.23  1.063e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Show the code

# (cs2$lrtest <- anova(io, cs2$result, type = "I", test = "LR"))

The difference in deviances is $LR$ = 48.23 with 6 degrees of freedom. This is distributed chi-squared, with p-value = 1.063e-08. The deviance test for lack of fit concludes that the model fits significantly better than an empty (intercept-only) model, $\chi^2$(6) = 48.23, p < .001.

You can use lmtest::lrtest() to perform likelihood ratio tests on the significance of the predictors too. The likelihood ratio test compares the log likelihood with and without the predictor. Unfortunately, this does not seem to work within the parsnip framework.

Show the code

# (cs2$lrtest_ses <- lmtest::lrtest(cs2$result, "ses"))
no_ses <- multinom_reg() %>% fit(prog ~ write, data = cs2$dat) %>% extract_fit_engine()
(cs2$lrtest_ses <- lmtest::lrtest(cs2$result, no_ses))
## Likelihood ratio test
## 
## Model 1: prog ~ ses + write
## Model 2: prog ~ write
##   #Df  LogLik Df  Chisq Pr(>Chisq)  
## 1   8 -179.98                       
## 2   4 -185.51 -4 11.058    0.02592 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# (cs2$lrtest_write <- lmtest::lrtest(cs2$fit_nnet, "write"))
no_write <- multinom_reg() %>% fit(prog ~ ses, data = cs2$dat) %>% extract_fit_engine()
(cs2$lrtest_write <- lmtest::lrtest(cs2$result, no_write))
## Likelihood ratio test
## 
## Model 1: prog ~ ses + write
## Model 2: prog ~ ses
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   8 -179.98                         
## 2   6 -195.71 -2 31.447  1.484e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both SES, $X^2$ = 11.058, p = 0.026, and writing score $X^2$ = 31.447, p = 1.484e-07, had significant effects on the program.

Logistic regression does not have a direct R-squared statistic like OLS does (the proportion of variance explained by the model). However, there are some analogs, called pseudo R-squared. You’ll encounter three pseudo R-squared measures: Cox and Snell, Nagelkerke, and McFadden. This one does not work for the nnet engine.

Show the code

# DescTools::PseudoR2(cs2$result, which = c("CoxSnell", "Nagelkerke", "McFadden"))

Accuracy measures formed by the cross-tabulation of observed and predicted classes is the better model fit diagnostic the the pseudo r-squares.

Show the code

cs2$conf_mat <-
  cs2$fit %>% 
  augment(new_data = cs2$dat_normalized) %>%
  # conf_mat requires truth to be first level of the factor variable.
  # mutate(across(c(prog, .pred_class), ~fct_relevel(., "academic"))) %>%
  conf_mat(truth = prog, estimate = .pred_class)

cs2$conf_mat

          Truth
Prediction academic general vocation
  academic       92      27       23
  general         4       7        4
  vocation        9      11       23

Show the code

cs2$conf_mat %>% summary()

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             multiclass     0.61 
 2 kap                  multiclass     0.299
 3 sens                 macro          0.497
 4 spec                 macro          0.763
 5 ppv                  macro          0.550
 6 npv                  macro          0.799
 7 mcc                  multiclass     0.320
 8 j_index              macro          0.260
 9 bal_accuracy         macro          0.630
10 detection_prevalence macro          0.333
11 precision            macro          0.550
12 recall               macro          0.497
13 f_meas               macro          0.491

Show the code

cs1$conf_mat %>% autoplot()

The model accuracy, 61.0%, is the percent of observations correctly classified. The sensitivities are the accuracy with regard to predicting positive cases in each level of the dependent variable. The specificities are the accuracy with regard to predicting negative cases. The prevalences are the proportion of cases that were positive.

Show the code

cs2$dat_normalized %>%
  bind_cols(predict(cs2$fit, new_data = cs2$dat_normalized, type = "prob")) %>%
  # event_level = "second" sets the second level as success
  yardstick::gain_curve(.pred_academic, .pred_general, .pred_vocation, 
                        truth = prog, event_level = "second") %>%
  autoplot() +
  labs(title = "Gain Curve")

105 of the 200 participants were in the academic program.

The gain curve encountered 52 academic programs (50%) within the first 72 observations (36%). It encountered all 105 cases on the 189th observation.
The bottom of the grey area is the outcome of a random model. Only half the academic program cases would be observed within 50% of the observations.
The top of the grey area is the outcome of the perfect model, the “wizard curve”. Half the academic program cases would be observed in 52.5/200=26.25% of the observations.

Show the code

cs2$dat_normalized %>%
  bind_cols(predict(cs2$fit, new_data = cs2$dat_normalized, type = "prob")) %>%
  # event_level = "second" sets the second level as success
  yardstick::roc_curve(.pred_academic, .pred_general, .pred_vocation, 
                        truth = prog, event_level = "second") %>%
  autoplot() +
  labs(title = "Gain Curve")

Reporting

A multinomial logistic regression was performed to ascertain the effects of socioeconomic status (ses) and writing score on the likelihood that participants are enrolled in an academic, general, or vocation program. Linearity of the continuous variables with respect to the logit of the dependent variable was assessed via the Box-Tidwell (1962) procedure. A Bonferroni correction was applied using all eight terms in the model resulting in statistical significance being accepted when p < 0.00625 (Tabachnick & Fidell, 2014). Based on this assessment, the continuous write independent variable was found to be linearly related to the logit of the dependent variable levels. There were two standardized residuals with value of 2.20 and 2.59 standard deviations, which were kept in the analysis. The multinomial logistic regression model was statistically significant, $\chi^2$(6) = 48.23, p < .001. The model correctly classified 61% of cases. Sensitivity was 50%, specificity was 76%, positive predictive value was 55% and negative predictive value was 80%. The write predictor variable was statistically significant for both outcome levels and high SES was statistically significant for the general program (as shown in Table 1). A 1SD increase in the writing score was associated with a (1 - 0.58) decrease in the odds of choosing a general program instead of academic and a (1 - 0.34) decrease in the odds of choosing a vocation program over academic. A high SEC was associated with a (1 - 0.31) decrease in the odds of chooseing a general program over academic.

Show the code

gtsummary::tbl_regression(
  cs2$result,
  exponentiate = TRUE
) %>%
  gtsummary::as_flex_table() %>%
  flextable::theme_apa()

Outcome	Characteristic	OR1	95% CI1	p-value
general	ses
	low	—	—
	middle	0.59	0.25, 1.40	0.2
	high	0.31	0.11, 0.86	0.024
	write	0.58	0.39, 0.86	0.007
vocation	ses
	low	—	—
	middle	1.34	0.53, 3.40	0.5
	high	0.37	0.12, 1.20	0.10
	write	0.34	0.23, 0.51	<0.001
1OR = Odds Ratio, CI = Confidence Interval

2.3 Ordinal Logistic Regression

Ordinal logistic regression, also call cumulative link model (CLM), is a generalized linear model (GZLM), an extension of the general linear model (GLM) to non-continuous outcome variables. There are many approaches to ordinal logistic regression, including cumulative, adjacent, and continuation categories, but the most popular is the cumulative odds ordinal logistic regression with proportional odds.¹¹. The model for ordinal response random variable $Y_i$ with $J$ levels is

\[\gamma_{ij} = F(\eta_{ij}), \hspace{5 mm} \eta_{ij} = \theta_j - x_i^\mathrm{T}\beta, \hspace{5 mm} i = 1, \ldots, n, \hspace{5 mm} j = 1, \ldots, J-1\]

where $\gamma_{ij} = P(Y_i \le j) = \pi_{i1} + \cdots + \pi_{ij}$. $\eta_{ij}$ is a linear predictor with $J-1$ intercepts. $F$ is the inverse link function. The regression models the logit link function of $\gamma_{ij}$.

\[\mathrm{logit}(\gamma_{ij}) = \log \left[\frac{P(Y_i \le j)}{P(Y_i \gt j)} \right] = \theta_j - x_i^\mathrm{T}\beta\] The cumulative logit is the log-odds of the cumulative probabilities that the response is in category $\le j$ versus $\gt j$. $\theta_j$ is the log-odds when $x_i^\mathrm{T}=0$ and $\beta$ is the increase in the log odds attributed to a one unit increase in $x_i^\mathrm{T}$. Notice $\beta$ is the same for all $j$. The exponential of the predicted value is the odds. Solve this for the probability,

\[P(Y_i \gt j) = \frac{\mathrm{exp}(\hat{y}_i)}{1 + \mathrm{exp}(\hat{y}_i)}.\]

The exponential of $\beta$ is the odds ratio of $x_1^\mathrm{T} - x_0^\mathrm{T}$. Solve this for the odds ratio

\[\mathrm{OR} = \frac{\mathrm{exp}(\theta_j - x_1^\mathrm{T}\beta)}{\mathrm{exp}(\theta_j - x_2^\mathrm{T}\beta)} = \mathrm{exp}(\beta(x_1^\mathrm{T} - x_0^\mathrm{T}))\]

If $x$ is a binary factor factor, then $\exp(\beta)$ is the odds ratio of $x=1$ vs $x=0$. Thus the odds-ratio is proportional to the difference between values of $x$ and $\beta$ is the constant of proportionality.

The model is estimated with a regularized Newton-Raphson algorithm with step-halving (line search) using analytic expressions for the gradient and Hessian of the negative log-likelihood function. This is beyond me, but the upshot is that the estimation is an iterative maximization exercise, not a formulaic matrix algebra process. It is possible for the model estimation to fail to converge on a maximum.

You will sometimes encounter discussion about the latent variable. That is just the underlying quality you are trying to measure. If you rate something a 4 on a 5-level Likert scale, 4 is the expression of your valuation, the latent variable. Your precise valuation is somewhere between 3 and 5 on a continuous scale. The link function defines the distribution of the latent variable.

There are variations on the ordinal model. Structured thresholds impose restrictions on $\theta_j$, for example requiring equal distances between levels. Partial proportional odds allow $\theta_j$ to vary with nominal predictors. You can also use link functions other than logit.

There are two assumptions underling ordinal logistic regression: (a) no multicollinearity, and (b) proportional odds.

Case Study

This case study uses the Laerd Statistics article on ordinal logistic regression data set. A study investigates the relationship between attitude toward tax levels and participant values and background. 192 participants in a study responded to the statement “Taxes are too high” on a 4-level Likert scale (tax_too_high, Strongly Disagree, Disagree, Agree, Strongly Agree). Participant attributes include business owner (Y|N), age, and political affiliation (Liberal, Conservative, Labor).

Show the code

cs3 <- list()

cs3$dat <- foreign::read.spss("./input/ordinal-logistic-regression.sav", to.data.frame = TRUE) %>%
  mutate(tax_too_high = factor(tax_too_high, ordered = TRUE),
         biz_owner = fct_relevel(biz_owner, "No", "Yes"),
         politics = fct_relevel(politics, "Lab")) %>%
  select(-c(biz_friends, uni_educated, income))

cs3$dat %>% 
  gtsummary::tbl_summary(
  by = politics, 
  statistic = list(gtsummary::all_continuous() ~ "{mean} ({sd})")
)

Characteristic	Lab N = 62¹	Lib N = 54¹	Con N = 76¹
biz_owner	36 (58%)	26 (48%)	47 (62%)
age	37.4 (5.6)	35.4 (5.8)	37.2 (4.8)
tax_too_high
Strongly Disagree	10 (16%)	10 (19%)	4 (5.3%)
Disagree	13 (21%)	16 (30%)	9 (12%)
Agree	28 (45%)	21 (39%)	42 (55%)
Strongly Agree	11 (18%)	7 (13%)	21 (28%)
¹ n (%); Mean (SD)

Show the code

cs3$dat %>%
  mutate(age = case_when(age < quantile(age, .33) ~ "young",
                         age < quantile(age, .67) ~ "middle",
                         TRUE ~ "old"),
         age = factor(age, levels = c("young", "middle", "old"))) %>%
  count(tax_too_high, biz_owner, age, politics) %>%
  ggplot(aes(x = tax_too_high, y = n, fill = biz_owner)) +
  geom_col(position = position_dodge2(preserve = "single")) +
  facet_grid(rows = vars(age), cols = vars(politics), space = "free") +
  scale_x_discrete(labels = function (x) str_wrap(x, width = 10)) +
  theme_bw() +
  theme(legend.position = "bottom") +
  labs(title = "Taxes too high?",
       subtitle = "Reponse count by business owner, age, and politics.")

Fit the Model

Fit a cumulative link model for the cumulative probability of the $i$th response falling in $j$th category or below where $i$ indexes the ($n = 192$) responses, $j = 1, \ldots, J$ indexes the ($J = 4$) response categories, and $\theta_j$ is the threshold for the $j$th cumulative logit.

\[\mathrm{logit}(P(Y_i \le j)) = \theta_j - \beta_1(\mathrm{politics}_i) - \beta_2(\mathrm{biz\_owner}_i) - \beta_3(\mathrm{age}_i)\]

Show the code

cs3$fmla <- formula(tax_too_high ~ biz_owner + age + politics)

cs3$result <- ordinal::clm(cs3$fmla, data = cs3$dat)

cs3$result %>% summary()

formula: tax_too_high ~ biz_owner + age + politics
data:    cs3$dat

 link  threshold nobs logLik  AIC    niter max.grad cond.H 
 logit flexible  192  -197.62 409.23 6(0)  3.14e-12 3.2e+05

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
biz_ownerYes  0.66462    0.28894   2.300 0.021435 *  
age           0.24189    0.03260   7.421 1.17e-13 ***
politicsLib   0.03695    0.36366   0.102 0.919072    
politicsCon   1.16142    0.34554   3.361 0.000776 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
                           Estimate Std. Error z value
Strongly Disagree|Disagree    7.026      1.166   6.024
Disagree|Agree                8.766      1.231   7.119
Agree|Strongly Agree         11.653      1.357   8.590

The summary object shows several fit statistics. More about these in the fit evaluation section below. The Coefficients table is the familiar parameter estimates. The coefficient estimate for biz_ownerYes is 0.665 with standard error 0.289, $z = \hat{\beta} / se =$ 2.300, and $p = 2 \cdot P(Z>z) =$ 0.021. Some programs (e.g., SPSS) also show the Wald chi-squared statistic, $z^2 =$ 5.291. The square of a normal variable has a $\chi^2$ distribution, so the p value for the Wald chi-squared statistic is pchisq(z^2, df = 1) = 0.021.

The Threshold coefficients table are the intercepts, or cut-points. The first cut-point is the log-odds of response level Strongly Disagree (or less) vs greater than Strongly Disagree when all factor variables are at their reference level and the continuous vars are at 0.

There may be interaction effects between biz_owner and politics. You can check this by comparing the log likelihood to the saturated model with a likelihood ratio test.

Show the code

saturated <- ordinal::clm(tax_too_high ~ biz_owner*politics + age, data = cs3$dat)

(cs3$sat_anova <- anova(cs3$result, saturated))

Likelihood ratio tests of cumulative link models:
 
           formula:                                  link: threshold:
cs3$result cs3$fmla                                  logit flexible  
saturated  tax_too_high ~ biz_owner * politics + age logit flexible  

           no.par    AIC  logLik LR.stat df Pr(>Chisq)
cs3$result      7 409.23 -197.62                      
saturated       9 411.75 -196.87  1.4805  2      0.477

The likelihood ratio test indicates the main-effects model fits about the same as the saturated model, $\chi^2$(2) = 1.48, p = 0.477)

Assumptions

Cumulative odds ordinal logistic regression with proportional odds models require a) no multicollinearity, and b) proportional odds.

Multicollinearity occurs when two or more independent variables are highly correlated so that they do not provide unique or independent information in the regression model. Multicollinearity inflates the variances of the estimated coefficients, resulting in larger confidence intervals. The usual interpretation of a slope coefficient as the change in the mean response per unit increase in the predictor when all the other predictors are held constant breaks down because changing one predictor necessarily changes other predictors.

Test for multicollinearity with variance inflation factors (VIF). The VIF is the inflation percentage of the parameter variance due to multicollinearity. E.g., a VIF of 1.9 means the parameter variance is 90% larger than what it would be if it was not correlated with other predictors.

Predictor k’s variance, $Var(\hat{\beta_k})$, is inflated by a factor of

\[\mathrm{VIF}_k = \frac{1}{1 - R_k^2}\]

due to collinearity with other predictors, where $R_k^2$ is the $R^2$ of a regression of the $k^{th}$ predictor on the remaining predictors. If predictor $k$ is unrelated to the other variables, $R_k^2 = 0$ and $VIF = 1$ (no variance inflation). If 100% of the variance in predictor $k$ is explained by the other predictors, then $R_k^2 = 1$ and $VIF = \infty$. The rule of thumb is that $VIF \le 5$ is acceptable.

Show the code

# Cannot use CLM model with vif(). Re-express as a linear model.
lm(as.numeric(tax_too_high) ~ politics + biz_owner + age, dat = cs3$dat) %>%
  car::vif()

              GVIF Df GVIF^(1/(2*Df))
politics  1.035831  2        1.008840
biz_owner 1.023642  1        1.011752
age       1.036491  1        1.018082

The VIFs in column GVIF are all below 5, so this model is not compromised by multicollinearity.

The proportional odds assumption means the independent variable effects are constant across each cumulative split of the ordinal dependent variable. Test for proportional odds using a full likelihood ratio test comparing the proportional odds model with a multinomial logit model, also called an unconstrained baseline logit model. This is also called the test of parallel lines. The multinomial logit model fits a slope to each of the $J – 1$ levels. The proportional odds model is nested within the multinomial model, so you can use a likelihood ratio test to see if the models are statistically different. Fit the proportional odds model and a multinomial model using VGAM::vglm() and capture the log likelihoods and degrees of freedom. Perform a likelihood ratio test on the differences in log likelihoods, $D = -2 \mathrm{loglik}(\beta)$.

Show the code

cs3$vglm_ordinal     <- VGAM::vglm(cs3$fmla, VGAM::propodds,   data = cs3$dat)

cs3$vglm_multinomial <- VGAM::vglm(cs3$fmla, VGAM::cumulative, data = cs3$dat)

(cs3$po_lrt <- VGAM::lrtest(cs3$vglm_multinomial, cs3$vglm_ordinal))

Likelihood ratio test

Model 1: tax_too_high ~ biz_owner + age + politics
Model 2: tax_too_high ~ biz_owner + age + politics
  #Df  LogLik Df  Chisq Pr(>Chisq)
1 561 -193.31                     
2 569 -197.62  8 8.6197     0.3754

The assumption of proportional odds was met, as assessed by a full likelihood ratio test comparing the fit of the proportional odds model to a model with varying location parameters, $\chi^2$(8) = 8.620, p = 0.375.

Another option is the partial proportional odds test. This test locates specific variables causing the rejection of proportional odds.

Show the code

(cs3$po_lrt2 <- ordinal::clm(cs3$fmla, data = cs3$dat) %>% ordinal::nominal_test())

Tests of nominal effects

formula: tax_too_high ~ biz_owner + age + politics
          Df  logLik    AIC     LRT Pr(>Chi)
<none>       -197.62 409.23                 
biz_owner  2 -197.34 412.67 0.55974   0.7559
age                                         
politics   4 -196.20 414.40 2.83415   0.5860

The assumption of proportional odds was met, as assessed by a full likelihood ratio test comparing the fit of the proportional odds model to a model with varying location parameters for business owner, $\chi^2$(2) = 0.560, p = 0.756 and politics, $\chi^2$(4) = 2.834, p = 0.586.

Evaluate the Fit

There are three ways to assess overall model fit: The Deviance and Pearson goodness-of-fit tests of the overall model fit; the Cox and Snell, Nagelkerke, and McFadden pseudo R measures of explained variance; and the likelihood ratio test comparing the model fit to the intercept-only model. However, these tests rely on large frequencies in each cell, that is, each possible combination of predictor values. Overall goodness-of-fit statistics should be treated with suspicion when a continuous independent variable is present and/or there are a large number of cells with zero frequency.

The Pearson goodness-of-fit statistic is $X^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$ where $i$ is the observation number and $j$ is the response variable level. It is a summary of the Pearson residuals, the difference between the observed and expected cell counts, $O_{ij} - E_{ij}$. The deviance goodness-of-fit statistic is the difference in fit between the model and a full model; a full model being a model that fits the data perfectly, $G^2 = 2 \sum_{ij} O_{ij} \log \left( \frac{O_{ij}}{E_{ij}} \right)$. Neither of these tests are reliable if there are many cells with zero frequencies and/or small expected frequencies and are generally not recommended. Generally, the chi-squared test requires a frequency count of at least 5 per cell.

Show the code

# Observed combinations of model vars
cs3$cell_patterns <- 
  cs3$dat %>% count(biz_owner, age, politics, tax_too_high) %>% nrow()

# Observed combinations of predictor vars * levels of response var
cs3$covariate_patterns <- 
  cs3$dat %>% count(biz_owner, age, politics) %>% nrow()
cs3$possible_cells <- 
  cs3$covariate_patterns * length(levels(cs3$dat$tax_too_high))

# 1 - ratio of observed to possible
cs3$pct_freq_zero <- 1 - cs3$cell_patterns / cs3$possible_cells

There are 137 observed combinations of model variables (predictors), and 372 possible combinations (predictors * outcome levels), so 63.2% of cells have zero frequencies. Ideally, zero frequencies should be less than 20%, so if you were to use the deviance or Pearson tests, you would need to report this. The results below are contradictory and bogus. I think you’d only use this test if you didn’t have continuous predictor variables.

Show the code

observed <- cs3$dat %>% 
  count(biz_owner, age, politics, tax_too_high) %>%
  pivot_wider(names_from = tax_too_high, values_from = n, values_fill = 0) %>%
  pivot_longer(
    cols = `Strongly Disagree`:`Strongly Agree`, 
    names_to = "outcome", 
    values_to = "observed"
  )

expected <- bind_cols(
  cs3$dat, 
  cs3$result %>% predict(newdata = cs3$dat %>% select(-"tax_too_high")) %>% data.frame()
) %>%
  rename_with(~str_remove(., "fit\\."), starts_with("fit")) %>%
  rename_with(~str_replace(., "\\.", " ")) %>%
  summarize(.by = c(biz_owner, age, politics), 
            across(`Strongly Disagree`:`Strongly Agree`, sum)) %>%
  pivot_longer(cols = `Strongly Disagree`:`Strongly Agree`, names_to = "outcome", values_to = "expected")

obs_exp <- observed %>%
  inner_join(expected, by = c("politics", "biz_owner", "age", "outcome")) %>%
  mutate(epsilon_sq = (observed - expected)^2,
         chi_sq = epsilon_sq / expected,
         g_sq = 2 * observed * log((observed+.0001) / expected)
  )

cs3$chisq <- list()
cs3$chisq$X2 = sum(obs_exp$chi_sq)
cs3$chisq$G2 = sum(obs_exp$g_sq)
cs3$chisq$df = cs3$covariate_patterns * (length(levels(cs3$dat$tax_too_high)) - 1) - 7
cs3$chisq$X2_p.value = pchisq(cs3$chisq$X2, df = cs3$chisq$df, lower.tail = FALSE)
cs3$chisq$G2_p.value = pchisq(cs3$chisq$G2, df = cs3$chisq$df, lower.tail = FALSE)

The Pearson goodness-of-fit test indicated that the model was not a good fit to the observed data, $\chi^2$(272) = 745.4, p < .001$. The deviance goodness-of-fit test indicated that the model was a good fit to the observed data, $G^2$(272) = 232.6, p = 0.960.

There are a number of measures in ordinal regression that attempt to provide a similar “variance explained” measure as that provided in ordinary least-squares linear regression. However, these measures do not have the direct interpretation that they do in ordinary linear regression and are often, therefore, referred to as “pseudo” R2 measures. The three most common measures (Cox and Snell, Nagelkerke, and McFadden) are not particularly good and not universally used. It is presented in the SPSS output, so you might encounter it in published work.

Show the code

cs3$nagelkerke <- rcompanion::nagelkerke(cs3$result)
cs3$nagelkerke$Pseudo.R.squared.for.model.vs.null

                             Pseudo.R.squared
McFadden                             0.181957
Cox and Snell (ML)                   0.367369
Nagelkerke (Cragg and Uhler)         0.399641

The best way to assess model fit is the likelihood ratio test comparing the model to an intercept-only model. The difference in the -2 log likelihood between the models has a $\chi^2$ distribution with degrees of freedom equal to the difference in the number of parameters.

Show the code

intercept_only <- ordinal::clm(tax_too_high ~ 1, data = cs3$dat)
cs3$lrt <- anova(cs3$result, intercept_only)
cs3$lrt

Likelihood ratio tests of cumulative link models:
 
               formula:         link: threshold:
intercept_only tax_too_high ~ 1 logit flexible  
cs3$result     cs3$fmla         logit flexible  

               no.par    AIC  logLik LR.stat df Pr(>Chisq)    
intercept_only      3 489.14 -241.57                          
cs3$result          7 409.23 -197.62  87.911  4  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table shows the log likelihoods of the two models. LR.stat is the difference between 2 * the logLik values.

The final model statistically significantly predicted the dependent variable over and above the intercept-only model, $\chi^2(4)$ = 87.9, p = 0.000.

Interpret Results

Return to the model summary.

Show the code

tidy(cs3$clm)

# A tibble: 0 × 0

The coefficients for biz_owner, age, and politics are positive. Positive parameters increase the likelihood of stronger agreement with the statement. In this case, discontent with taxes are higher for business owners, increase with age, and are higher for Liberal Democrats and Conservatives relative to the Labor Party. The expected cumulative log-odds of declaring $\le j$ level of agreement with the statement for the baseline group (biz_ownerNo, age = 0, politicsLib) is for $j = 1$ (Strongly Disagree), for $j = 2$ (Disagree), and for $j = 3$ (Agree).

You could solve the logit equation for

\[\pi_j = \frac{\mathrm{exp}(Y_i)} {1 + \mathrm{exp}(Y_i)}\] to get the cumulative probabilities for each level. That’s what predict(type = "cum.prob") does. But it might be more intuitive to work with individual probabilities, the lagged differences to get the individual probabilities for each $j$. That’s what predict(type = "prob") does. I like to play with predicted values to get a sense of the outcome distributions. In this case, I’ll take the median age, and each combination of biz_owner and politics.

Show the code

new_data <- cs3$dat %>% 
  mutate(age = median(cs3$dat$age)) %>% 
  expand(age, politics, biz_owner)

preds <- predict(cs3$result, newdata = new_data, type = "prob")[["fit"]] %>% as.data.frame()

bind_cols(new_data, preds) %>%
  pivot_longer(cols = `Strongly Disagree`:`Strongly Agree`) %>% 
  mutate(name = factor(name, levels = levels(cs3$dat$tax_too_high))) %>%
  ggplot(aes(y = politics, x = value, fill = fct_rev(name))) +
  geom_col() +
  geom_text(aes(label = scales::percent(value, accuracy = 1)), 
            size = 3, position = position_stack(vjust=0.5)) +
  facet_grid(~paste("Bus Owner = ", biz_owner)) +
  scale_fill_grey(start = 0.5, end = 0.8) +
  theme_bw() + 
  theme(legend.position = "top",
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  guides(fill = guide_legend(reverse = TRUE)) +
  labs(title = "Taxes too High for Conservative Business Owners?", 
       x = NULL, fill = NULL)

You will want to establish whether politics is statistically significant overall before exploring any specific contrasts. The ANOVA procedure with type I test reports an overall test of significance for each variable entered into the model.

Show the code

(cs3$anovaI <- anova(cs3$result, type = "I"))

Type I Analysis of Deviance Table with Wald chi-square tests

          Df  Chisq Pr(>Chisq)    
biz_owner  1 13.201  0.0002798 ***
age        1 57.413  3.533e-14 ***
politics   2 14.636  0.0006635 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The political party last voted for has a statistically significant effect on the prediction of whether tax is thought to be too high, Wald $\chi^2$(2) = 14.6, p = 0.001.

The best way to work with the data is with the tidy(exponentiate = TRUE) version.

Show the code

cs3$tidy <- cs3$result %>% tidy(conf.int = TRUE, exponentiate = TRUE)
cs3$tidy

# A tibble: 7 × 8
  term        estimate std.error statistic  p.value conf.low conf.high coef.type
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <chr>    
1 Strongly D…   1.13e3    1.17       6.02  1.70e- 9   NA         NA    intercept
2 Disagree|A…   6.41e3    1.23       7.12  1.08e-12   NA         NA    intercept
3 Agree|Stro…   1.15e5    1.36       8.59  8.72e-18   NA         NA    intercept
4 biz_ownerY…   1.94e0    0.289      2.30  2.14e- 2    1.11       3.44 location 
5 age           1.27e0    0.0326     7.42  1.17e-13    1.20       1.36 location 
6 politicsLib   1.04e0    0.364      0.102 9.19e- 1    0.508      2.12 location 
7 politicsCon   3.19e0    0.346      3.36  7.76e- 4    1.63       6.35 location

Then you can summarize the table in words.

The odds of business owners considering tax to be too high was 1.944 (95% CI, 1.107 to 3.443) times that of non-business owners, a statistically significant effect, z = 2.300, p = 0.021.

The odds of Conservative voters considering tax to be too high was 3.194 (95% CI, 1.635 to 6.351) times that of Labour voters, a statistically significant effect, z = 3.361, p = 0.001. The odds of Liberal Democrat voters considering tax to be too high was similar to that of Labour voters (odds ratio of 1.038 (95% CI, 0.508 to 2.121), p = 0.919.

An increase in age (expressed in years) was associated with an increase in the odds of considering tax too high, with an odds ratio of 1.274 (95% CI, 1.197 to 1.360), z = 7.421, p = 0.000.

Reporting

Here is the complete write-up.

A cumulative odds ordinal logistic regression with proportional odds was run to determine the effect of business ownership, political party voted for, and age, on the belief that taxes are too high. There were proportional odds, as assessed by a full likelihood ratio test comparing the fitted model to a model with varying location parameters, $\chi^2$(8) = 8.620, p = 0.375. The final model statistically significantly predicted the dependent variable over and above the intercept-only model, $\chi^2(4)$ = 87.9, p = 0.000. The odds of business owners considering tax to be too high was 1.944 (95% CI, 1.107 to 3.443) times that of non-business owners, a statistically significant effect, z = 2.300, p = 0.021. The political party last voted for has a statistically significant effect on the prediction of whether tax is thought to be too high, Wald $\chi^2$(2) = 14.6, p = 0.001. The odds of Conservative voters considering tax to be too high was 3.194 (95% CI, 1.635 to 6.351) times that of Labour voters, a statistically significant effect, z = 3.361, p = 0.001. The odds of Liberal Democrat voters considering tax to be too high was similar to that of Labour voters (odds ratio of 1.038 (95% CI, 0.508 to 2.121), p = 0.919. An increase in age (expressed in years) was associated with an increase in the odds of considering tax too high, with an odds ratio of 1.274 (95% CI, 1.197 to 1.360), z = 7.421, p = 0.000.

Package gtsummary shows a nice summary table.

Show the code

gtsummary::tbl_regression(cs3$result)

Characteristic	log(OR)¹	95% CI¹	p-value
biz_owner
No	—	—
Yes	0.66	0.10, 1.2	0.021
age	0.24	0.18, 0.31	<0.001
politics
Lab	—	—
Lib	0.04	-0.68, 0.75	>0.9
Con	1.2	0.49, 1.8	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

2.4 Poisson Regression

Poisson models count data, like “traffic tickets per day”, or “website hits per day”. The response is an expected rate or intensity. For count data, specify the generalized model, this time with family = poisson or family = quasipoisson.

Recall that the probability of achieving a count $y$ when the expected rate is $\lambda$ is distributed

\[P(Y = y|\lambda) = \frac{e^{-\lambda} \lambda^y}{y!}.\]

The poisson regression model is

\[\lambda = \exp(X \beta).\]

You can solve this for $y$ to get

\[y = X\beta = \ln(\lambda).\]

That is, the model predicts the log of the response rate. For a sample of size n, the likelihood function is

\[L(\beta; y, X) = \prod_{i=1}^n \frac{e^{-\exp({X_i\beta})}\exp({X_i\beta})^{y_i}}{y_i!}.\]

The log-likelihood is

\[l(\beta) = \sum_{i=1}^n (y_i X_i \beta - \sum_{i=1}^n\exp(X_i\beta) - \sum_{i=1}^n\log(y_i!).\]

Maximizing the log-likelihood has no closed-form solution, so the coefficient estimates are found through interatively reweighted least squares.

Poisson processes assume the variance of the response variable equals its mean. “Equals” means the mean and variance are of a similar order of magnitude. If that assumption does not hold, use the quasi-poisson. Use Poisson regression for large datasets. If the predicted counts are much greater than zero (>30), the linear regression will work fine. Whereas RMSE is not useful for logistic models, it is a good metric in Poisson.

Case Study 4

Dataset fire contains response variable injuries counting the number of injuries during the month and one explanatory variable, the month mo.

Show the code

fire <- read_csv(file = "C:/Users/mpfol/OneDrive/Documents/Data Science/Data/CivilInjury_0.csv")
fire <- fire %>% 
  mutate(mo = as.POSIXlt(`Injury Date`)$mon + 1) %>%
  rename(dt = `Injury Date`,
         injuries = `Total Injuries`)
str(fire)

In a situation like this where there the relationship is bivariate, start with a visualization.

Show the code

ggplot(fire, aes(x = mo, y = injuries)) +
  geom_jitter() +
  geom_smooth(method = "glm", method.args = list(family = "poisson")) +
  labs(title = "Injuries by Month")

Fit a poisson regression in R using glm(formula, data, family = poisson). But first, check whether the mean and variance of injuries are the same magnitude? If not, then use family = quasipoisson.

Show the code

mean(fire$injuries)
var(fire$injuries)

They are of the same magnitude, so fit the regression with family = poisson.

Show the code

m2 <- glm(injuries ~ mo, family = poisson, data = fire)
summary(m2)

The predicted value $\hat{y}$ is the estimated log of the response variable,

\[\hat{y} = X \hat{\beta} = \ln (\lambda).\]

Suppose mo is January (mo = ), then the log ofinjuries` is $\hat{y} = 0.323787$. Or, more intuitively, the expected count of injuries is $\exp(0.323787) = 1.38$

Show the code

predict(m2, newdata = data.frame(mo=1))
predict(m2, newdata = data.frame(mo=1), type = "response")

Here is a plot of the predicted counts in red.

Show the code

augment(m2, type.predict = "response") %>%
  ggplot(aes(x = mo, y = injuries)) +
  geom_point() +
  geom_point(aes(y = .fitted), color = "red") + 
  scale_y_continuous(limits = c(0, NA)) +
  labs(x = "Month",
       y = "Injuries",
       title = "Poisson Fitted Line Plot")

Evaluate a logistic model fit with an analysis of deviance.

Show the code

(perf <- glance(m2))
(pseudoR2 <- 1 - perf$deviance / perf$null.deviance)

The deviance of the null model (no regressors) is 139.9. The deviance of the full model is 132.2. The psuedo-R2 is very low at .05. How about the RMSE?

Show the code

RMSE(pred = predict(m2, type = "response"), obs = fire$injuries)

The average prediction error is about 0.99. That’s almost as much as the variance of injuries - i.e., just predicting the mean of injuries would be almost as good! Use the GainCurvePlot() function to plot the gain curve.

Show the code

augment(m2, type.predict = "response") %>%
  ggplot(aes(x = injuries, y = .fitted)) +
  geom_point() +
  geom_smooth(method ="lm") +
  labs(x = "Actual",
       y = "Predicted",
       title = "Poisson Fitted vs Actual")

Show the code

augment(m2) %>% data.frame() %>% 
  GainCurvePlot(xvar = ".fitted", truthVar = "injuries", title = "Poisson Model")

It seems that mo was a poor predictor of injuries.

These notes are primarily from PSU’s Analysis of Discrete Data which uses Alan Agresti’s Categorical Data Analysis (Agresti 2013). I also reviewed PSU’s Regression Methods, DataCamp’s Generalized Linear Models in R, DataCamp’s Multiple and Logistic Regression, and Interpretable machine learning (Molnar 2020).↩︎
The related probit regression link function is $f(E(Y|X)) = \Phi^{-1}(E(Y|X)) = \Phi^{-1}(\pi)$. The difference between the logistic and probit link function is theoretical, and the practical significance is slight. You can safely ignore probit.↩︎
Notes from Machine Learning Mastery.↩︎
UVA discusses the four types of residuals you can calculate from a binary logistic regression.↩︎
See probability notes on the binomial distribution here ↩︎
This article suggests using three standard deviations for an outliers threshold.↩︎
Nice explanation here. Sounds like you should use pseudo R2 for model comparison, not for reporting goodness of fit↩︎
These notes rely on the course notes from PSU STAT 504, Analysis of Discrete Data and UCLA.↩︎
Cook’s distance measures how much predicted values change when observation i is removed from the data. This article suggests using three standard deviations for an outliers threshold.↩︎
Nice write-up here. Sounds like you should use pseudo R2 for model comparison, not for reporting goodness of fit↩︎
These notes rely on UVA, PSU, Laerd, and the CLM package article vignette ↩︎

# Generalized Linear Models (GLM) ```{r include=FALSE} library(tidyverse) library(tidymodels) tidymodels_prefer() theme_set( theme_light() + theme( plot.caption = element_text(hjust = 0), strip.background = element_rect(fill="gray90", color="gray60"), strip.text = element_text(color = "gray20") ) ) ``` The linear regression model, $E(y|X) = X \beta$, structured as $y_i = X_i \beta + \epsilon_i$ where $X_i \beta = \mu_i$, assumes the response is a linear function of the predictors and the residuals are independent random variables normally distributed with zero mean and constant variance, $\epsilon \sim N \left(0, \sigma^2 \right)$. This implies that given a set of predictors, the response is normally distributed about its expected value, $y_i \sim N \left(\mu_i, \sigma^2 \right)$. However, there are many situations where this normality assumption does not hold. Generalized linear models (GLMs) are a generalization of the linear regression model that work with non-normal response distributions.^[These notes are primarily from PSU's [Analysis of Discrete Data](https://online.stat.psu.edu/stat504) which uses Alan Agresti's **Categorical Data Analysis** [@Agresti2013]. I also reviewed PSU's [Regression Methods](https://newonlinecourses.science.psu.edu/stat501/lesson/15), DataCamp's [Generalized Linear Models in R](https://www.datacamp.com/courses/generalized-linear-models-in-r), DataCamp's [Multiple and Logistic Regression](https://www.datacamp.com/courses/multiple-and-logistic-regression), and **Interpretable machine learning** [@Molner2020].] The response will not have a normal distribution if the underlying data-generating process is binomial (Section \@ref(binomiallogistic)) or multinomial (Section \@ref(multinomiallogistic)), ordinal (Section \@ref(ordinallogistic)), Poisson (counts, Section \@ref(poissonregression)), or exponential (time-to-event). In these situations the linear regression model can predict probabilities and proportions outside [0, 1], or negative counts or times. GLMs solve this problem by modeling a *function* of the expected value of $y$, $f(E(y|X)) = X \beta$. There are three components to a GLM: a *random component* specified by the response variable's probability distribution (normal, binomial, etc.); a *systematic component* in the form $X\beta$; and a *link function*, $\eta$, that specifies the link between the random and systematic components and converts the response to a range of $[-\infty, +\infty]$. Linear regression is a special case of GLM where the link function is the identity function, $f(E(y|X)) = E(y|X)$. For binary logistic regression, the link function is the logit, $$f(E(y|X)) = \log \left( \frac{E(y|X)}{1 - E(y|X)} \right) = \log \left( \frac{\pi}{1 - \pi} \right) = \mathrm{logit}(\pi)$$ where $\pi$ is the response probability.^[The related probit regression link function is $f(E(Y|X)) = \Phi^{-1}(E(Y|X)) = \Phi^{-1}(\pi)$. The difference between the logistic and probit link function is theoretical, and [the practical significance is slight](https://www.theanalysisfactor.com/the-difference-between-logistic-and-probit-regression/). You can safely ignore probit.] For multinomial logistic regression, the link function is the logit again, but with respect to the baseline level, and there is set of logits (one for each non-baseline level), $$f(E(y|X)) = \log \left( \frac{E(y_j|X)}{E(y_{j^*}|X)} \right) = \log \left( \frac{\pi_j}{\pi_{j^*}} \right) = \mathrm{logit}(\pi_j)$$ Where $j$ is a level in the dependent variable and $j^*$ is the baseline level. For Poisson regression, the link function is $$f(E(y|X)) = \ln (E(y|X)) = \ln(\lambda)$$ where $\lambda$ is the expected event rate. For exponential regression, the link function is $$f(E(y|X) = -E(y|X) = -\lambda$$ where $\lambda$ is the expected time to event. GLM uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations. Fit a GLM just like an linear model, but with the `glm()` function, specifying the distribution with the `family = c("gaussian", "binomial", "poisson")` parameter. Fit a mulinomial logistic regression model with `nnet::multinom()`. ## Binomial Logistic Regression {#binomiallogistic} Logistic regression estimates the probability that a categorical dependent variable is a particular level. The dependent variable levels can be binomial, multinomial, or ordinal. The *binary* logistic regression model is $$y_i = \mathrm{logit}(\pi_i) = \log \left( \frac{\pi_i}{1 - \pi_i} \right) = X_i \beta$$ where $\pi_i$ is the response $i$'s binary level probability. The model predicts the *log odds* of the level. Why do this? The range of outcomes need to be between 0 and 1, and a sigmoid function, $f(y) = 1 / \left(1 + e^y \right)$, does that. If the _log odds_ of the level equals $X_i\beta$, then the _odds_ of the level equals $e^{X\beta}$. You can solve for $\pi_i$ to get $\pi = \mathrm{odds} / (\mathrm{odds} + 1)$. Substituting, $$\pi_i = \frac{\exp(y_i)}{1 + \exp(y_i)} = \frac{e^{X_i\beta}}{1 + e^{X_i\beta}}$$ which you can simplify to $\pi_i = 1 / (1 + e^{-X_i\beta})$, a sigmoid function. The upshot is $X\beta$ is the functional relationship between the independent variables and _a function of the response_, not the response itself. The model parameters are estimated either by _iteratively reweighted least squares optimization_ or by _maximum likelihood estimation_ (MLE). MLE maximizes the probability produced by a set of parameters $\beta$ given a data set and probability distribution.^[Notes from [Machine Learning Mastery](https://machinelearningmastery.com/logistic-regression-with-maximum-likelihood-estimation/).] In logistic regression the probability distribution is the binomial and each observation is the outcome of a single Bernoulli trial. $$L(\beta; y, X) = \prod_{i=1}^n \pi_i^{y_i}(1 - \pi_i)^{(1-y_i)} = \prod_{i=1}^n\frac{\exp(y_i X_i \beta)}{1 + \exp(X_i \beta)}.$$ In practice, multiplying many small probabilities can be unstable, so MLE optimizes the log likelihood instead. $$\begin{align} l(\beta; y, X) &= \sum_{i = 1}^n \left(y_i \log(\pi_i) + (1 - y_i) \log(1 - \pi_i) \right) \\ &= \sum_{i = 1}^n \left(y_i X_i \beta - \log(1 + e^{X_i\beta}) \right) \end{align}$$ Sometimes you will see the _cost function_ optimized. The cost function is the negative of of the log likelihood function. **Assumptions** The binomial logistic regression model requires a dichotomous dependent variable and independent observations. The sample size should be large, at least 10 observations per dependent variable level and independent variable. There are three conditions related to the data distribution: i) the logit transformation must be linearly related to any continuous independent variables, ii) there must be no multicollinearity, and iii) there must be no influential outliers. Be aware of over-dispersion, a common issue with GLM. For a binomial logistic regression, the response variable should be distributed $y_i \sim \mathrm{Bin}(n_i, \pi_i)$ with $\mu_i = n_i \pi_i$ and $\sigma^2 = \pi (1 - \pi)$. Over-dispersion means the data shows evidence of variance greater than $\sigma^2$. ### Case Study {.unnumbered #cs1} This case study uses the [Laerd Statistics](https://statistics.laerd.com/) article on binomial logistic regression [data set](https://statistics.laerd.com/premium/spss/spss-files/logistic-regression.sav). A study investigates the relationship between the incidence of heart disease (Yes|No) and age, weight, gender, and maximal aerobic capacity using data from *n* = 100 participants. ```{r} cs1 <- list() cs1$dat <- foreign::read.spss("./input/logistic-regression.sav", to.data.frame = TRUE) cs1$dat %>% gtsummary::tbl_summary( by = heart_disease, include = -caseno, percent = "row", statistic = list(gtsummary::all_continuous() ~ "{mean}, {sd}") ) ``` <br> Overall, men are twice as likely to have heart disease. Male odds are .43/.57 = `r comma(.43/.57, .1)` and female odds are .22/.78 = `r comma(.22/.78, .1)`, an male-to-female odds ratio of `r comma((.43/.57) / (.22/.78), .1)`. ```{r} cs1$dat %>% gtsummary::tbl_cross(row = heart_disease, col = gender, percent = "column") ``` Age, weight, and poor max aerobic capacity are positively associated with heart disease. ```{r} cs1$dat %>% pivot_longer(cols = c(age, weight, VO2max)) %>% ggplot(aes(x = heart_disease, y = value)) + geom_boxplot(outlier.shape = NA) + geom_jitter(aes(color = name)) + facet_wrap(facets = vars(name), scales = "free_y") ``` :::rmdnote Consider centering the continuous variables around their means to facilitate model interpretation. The intercept term in the fitted model would represent a reasonable condition, not a zero-aged, zero-weighted person with no aerobic capacity. This is the way to go if you want to present your findings in the framework of a baseline probability (or odds) and the incremental effects of the independent variables. You might also standardize the continuous vars to get a more meaningful increment. On the other hand, if you want to use your model for predicting outcomes, you'll have to back out of the centering when you predict values. ::: If your model is predictive rather than inferential, split the data into training/testing data sets. ```{r collapse=TRUE} # For reproducibility set.seed(123) (x <- initial_split(cs1$dat, prop = 0.7, strata = heart_disease)) cs1$dat_training <- training(x) dim(cs1$dat_training) cs1$dat_testing <- testing(x) dim(cs1$dat_testing) ``` ### Fit the Model {-} Fit the model using the tidymodels framework. If you want to continue using the classic methodology, the glm object is inside the tidymodels fit. The model fit returns a brief summary with the coefficients and model diagnostics. ```{r} cs1$model <- logistic_reg() %>% set_engine("glm") %>% set_mode("classification") cs1$fit <- cs1$model %>% fit(heart_disease ~ age + weight + VO2max + gender, data = cs1$dat) # The fit object returned by glm(). You'll need this for interpretation and # checking assumptions. cs1$result <- cs1$fit %>% extract_fit_engine() # If you are fitting a predictive model, use the training set. cs1$fit_training <- cs1$model %>% fit(heart_disease ~ age + weight + VO2max + gender, data = cs1$dat_training) cs1$result %>% summary() ``` The null deviance, G^2, is the likelihood ratio of the intercept-only model with `r nrow(cs1$dat_training)` rows - 1 parameter = `r cs1$result %>% summary() %>% .$df.null` degrees of freedom. It is the sum of the squared deviance residuals. The residual deviance is the likelihood ratio of the full model with `r nrow(cs1$dat)` - 5 parameters = `r cs1$result %>% summary() %>% .$df.residual` degrees of freedom. The residual deviance is distributed chi-squared and can be used to test whether the model differs from the saturated model (model with as many coefficients as observations, G^2 = 0, *df* = 0) where $H_0$ = no difference. The deviance test for lack of fit fails to reject the null hypothesis. ```{r collapse=TRUE} # G^2 calculations cs1$result %>% residuals(type = "deviance") %>% .^2 %>% sum() cs1$result %>% deviance() # df df.residual(cs1$result) # G^2 is distributed chi-squared with df degrees of freedom pchisq(deviance(cs1$result), df = df.residual(cs1$result), lower.tail = FALSE) vcdExtra::LRstats(cs1$result) ``` These two deviances, the null and residual, are shown in the ANOVA summary. An ANOVA table shows the change in deviance from successively adding each variable to the model. ```{r} anova(cs1$result) ``` Deviance residuals are one of four residuals you can calculate from a binary logistic regression.^[[UVA](https://data.library.virginia.edu/understanding-deviance-residuals/) discusses the four types of residuals you can calculate from a binary logistic regression.] One is the *raw residual*, $\epsilon_i = y_i - \hat{p}_i$, where $\hat{p}_i$ is the predicted probability. Another is the *Pearson residual*, $r_i = \frac{\epsilon_i}{\sqrt{\hat{p}_i(1 - \hat{p}_i)}}$, the raw residual rescaled by dividing by the estimated standard deviation of a binomial distribution with 1 trial^[See probability notes on the binomial distribution [here](https://bookdown.org/mpfoley1973/probability/binomial.html)]. A third is the *standardized Pearson residual*, $rs_i = r_i / \sqrt{1 - \hat{h}_i}$, the Pearson residual adjusted for the leverage of the predictors using the hat-values. Hat-values measure the predictor distances from the mean. This residual is especially useful to evaluate model fit because if the model fits well, these residuals have a standard normal distribution. Finally, there are the *deviance residuals*, $d_i = \mathrm{sign}(\epsilon_i) \left[ -2(y_i \log \hat{p}_i + (1 - y_i) \log (1 - \hat{p}_i)) \right]^{.5}$. Deviance Residuals measure how much the estimated probabilities differ from the observed proportions of success. You want deviance residuals to be evenly distributed (in absolute values, 1Q $\approx$ 3Q, min $\approx$ max). You also want the min and max to be <3 because deviance residuals are roughly approximated by a standard normal distribution. ```{r} bind_rows( Raw = cs1$result %>% residuals(type = "response") %>% summary(), Pearson = cs1$result %>% residuals(type = "pearson") %>% summary(), `Standardized Pearson` = cs1$result %>% rstandard(type = "pearson") %>% summary(), Deviance = cs1$result %>% residuals(type = "deviance") %>% summary(), .id = "Residual" ) ``` ### Interpretation {-} Before we look at the coefficient estimations, consider what it is they are predicting: the log odds of the binary response. To see what that means, plug in values for the explanatory variables to get predictions. $\hat{y}$ is the log odds of having heart disease. ```{r collapse=TRUE} (mean_person <- cs1$dat %>% select(-caseno) %>% summarize(.by = gender, across(where(is.numeric), mean))) pred_log_odds <- cs1$fit %>% predict(new_data = mean_person, type = "raw") names(pred_log_odds) <- mean_person$gender pred_log_odds ``` Exponentiate to get the **odds**, $\exp (\hat{y}) = \frac{\pi}{1 - \pi}$. ```{r} (pred_odds <- exp(pred_log_odds)) ``` Solve for $\pi = \frac{\exp (\hat{y})}{1 + \exp (\hat{y})}$ to get the **probability**. Do the math, or use `predict(type = "prob")`. ```{r collapse=TRUE} (pred_prob <- pred_odds / (1 + pred_odds)) cs1$fit %>% predict(new_data = mean_person, type = "prob") ``` Now let's interpret the coefficients. ```{r} cs1$fit %>% tidy() ``` The intercept term is the log-odds of heart disease for the reference case. The reference case in the model is `gender` = "Female", `age` = 0, `weight` = 0, and `VO2max` = 0. If the data was centered, the reference case would actually meaningful. ```{r} cs1$fit %>% predict(new_data = list(age = 0, weight = 0, VO2max = 0, gender = "Female"), type = "raw") ``` Column "statistic" is the Wald $z$ statistic, $z = \hat{\beta} / SE(\hat{\beta})$. Its square is the Wald chi-squared statistic. The _p_-value is the area to the right of $z^2$ in the $\chi_1^2$ density curve: ```{r} cs1$fit %>% tidy() %>% mutate(p.chisq = map_dbl(statistic, ~pchisq(.^2, df = 1, lower.tail = FALSE))) %>% pull(p.chisq) ``` Interpret the coefficient estimates as the change in the log odds of $y$ due to a one unit change in $x$. If $\delta = x_a - x_b$, then a $\delta$ change in $x$ is associated with a $\delta \hat{\beta}$ change in the log odds of $y$. $\beta$ is the log odds ratio of $x_a$ vs $x_b$. $$\log \left(\pi / (1 - \pi) |_{x = x_a} \right) - \log \left(\pi / (1 - \pi) |_{x = x_b} \right) = \log \left( \frac{\pi / (1 - \pi) |_{x = x_a}}{\pi / (1 - \pi) |_{x = x_b}} \right) = \delta \hat{\beta}$$ The *exponential* of the coefficient estimates is the change in the _odds_ of $y$ due to a $\delta$ change in $x$. $\exp \beta$ is the odds ratio of $x_a$ vs $x_b$. $$\mathrm{odds}(y) = e^{\delta \hat{\beta}}$$ ```{r} cs1$fit %>% tidy(exponentiate = TRUE) ``` All covariates held equal, a male's log odds of heart disease are `r comma(coefficients(cs1$result)["genderMale"], .01)` times that of a female's (log(OR)). A male's odds are `r comma(exp(coefficients(cs1$result)["genderMale"]), .01)` times that of a female's (OR). Of course, all covariate's are _not_ equal - males are heavier and have higher VO2max. Let's run the calculations with the mean predictor values for male and female. ```{r collapse=TRUE} # Log OR pred_log_odds["Male"] / pred_log_odds["Female"] # OR pred_odds["Male"] / pred_odds["Female"] ``` A one-unit increase in any of the continuous independent variables is interpreted similarly. The reference level is unimportant since the change is constant across the range of values. A one year increase in age increases the log-odds of heart disease by a factor of `r comma(coefficients(cs1$result)["age"], .01)`, and the odds by a factor of `r comma(exp(coefficients(cs1$result)["age"]), .01)`. To calculate the effect of a *decade* increase in age, multiply $\beta$ by 10 before exponentiating, or raise the exponentiated coeficient by 10. The effect of a 10-year increase in age is to increase the odds of heart disease by `r comma(exp(coefficients(cs1$result)["age"] * 10), .01)`. The odds double every ten years. `oddsratio::or_glm()` is a handy way to calculate odds ratios from arbitrary increments to the predictors. Here are the ORs of a 10-year age change, 10 kg weight change, and VO2max change of 5. ```{r} oddsratio::or_glm( cs1$dat, cs1$result, incr = list(age = 10, weight = 25, VO2max = -12) ) ``` Notice that the predicted probabilities have the sigmoidal shape of the binary relationship. ```{r message=FALSE} augment(cs1$fit, new_data = cs1$dat, type = "raw") %>% ggplot(aes(x = age, color = gender)) + geom_point(aes(y = as.numeric(heart_disease == "Yes"))) + geom_point(aes(y = .pred_Yes), shape = 4) + geom_smooth(aes(y = .pred_Yes), se = FALSE) + labs(x = "Age", y = NULL, title = "Binary Fitted Line Plot") + scale_y_continuous(breaks = c(0,1), labels = c("Healthy", "Heart Disease")) + theme_light() + theme(legend.position = "right") ``` ### Assumptions {-} Four assumptions relate to the study design: (1) the dependent variable is dichotomous; (2) the observations are independent; (3) the categories of all nominal variables are mutually exclusive and exhaustive; and (4) there are at least 10 observations per dependent variable level and independent variable. These assumptions are all valid here. Three more assumptions related to the data distribution: - There is a linear relationship between the logit transformation and the continuous independent variables. Test with a plot and with Box-Tidwell. - There is no independent variable multicollinearity. Test with correlation coefficients and variance inflation factors (VIF). - There are no influential outliers. Test with Cook's distance. Test the linearity assumption first. There are two ways to do this (do both). First, fit your model, then plot the _fitted values_ against the continuous predictors. This is the GLM analog to OLS bivariate analysis, except now the dependent variable is the _logit_ transformation. These plotted relationships look pretty linear. ```{r} cs1$fit %>% augment(new_data = cs1$dat) %>% pivot_longer(c(age, weight, VO2max)) %>% ggplot(aes(x = value, y = .pred_Yes)) + geom_point() + facet_wrap(facets = vars(name), scales = "free_x") + geom_smooth(method = "lm", formula = "y~x") + labs(title = "Linearity Test: predicted vs continuous predictors", x = NULL) ``` The second test for linearity is the Box-Tidwell approach. Add transformations of the continuous independent variables to the model, $x_{Tx} = x \log x$, then test their significance level in the fit. ```{r} # Using non-centered vars to avoid log(0) errors. x <- cs1$dat %>% mutate( age_tx = log(age) * age, weight_tx = log(weight) * weight, VO2max_tx = log(VO2max) * VO2max ) cs1$boxtidwell_fit <- logistic_reg() %>% set_engine("glm") %>% set_mode("classification") %>% fit(heart_disease ~ age + weight + VO2max + gender + age_tx + weight_tx + VO2max_tx, data = x) cs1$boxtidwell_fit %>% tidy() ``` Focus on the three transformed variables. `age_tx` is the only one with a *p*-value nearly <.05, but it is customary to apply a Bonferroni adjustment when testing for linearity. There are eight predictors in the model (including the intercept term), so the Bonferroni adjusted *p*-value for `age_tx` is multiplied by 8. Do not reject the null hypothesis of linearity. If the relationship _was_ nonlinear, you could try transforming the variable by raising it to $\lambda = 1 + b / \gamma$ where $b$ is the estimated coefficient of the model without the interaction terms, and $\gamma$ is the estimated coefficient of the interaction term of the model with interactions. For `age`, $b$ is `r comma(coefficients(cs1$result)["age"], .001)` and $\gamma$ is `r comma(coefficients(cs1$boxtidwell_fit$fit)["age_tx"], .001)`, so $\lambda$ = `r comma(1 + coefficients(cs1$fit$fit)["age"] / coefficients(cs1$boxtidwell_fit$fit)["age_tx"], .001)`. This is approximately 1 (no transformation). It appears to be customary to apply general transformations like .5 (square root), 1/3 (cube root), ln, and the reciprocal. Now check for multicollinearity. Variance inflation factors (VIF) estimate how much the variance of a regression coefficient is inflated due to multicollinearity. When independent variables are correlated, it is difficult to say which variable really influences the dependent variable. The VIF for variable $i$ is $$ \mathrm{VIF}_i = \frac{1}{1 - R_i^2} $$ where $R_i^2$ is the coefficient of determination (i.e., the proportion of dependent variance explained by the model) of a regression of $X_i$ against all of the other predictors, $X_i = X_{j \ne i} \beta + \epsilon$. If $X_i$ is totally unrelated to its covariates, then $R_i^2$ will be zero and $\mathrm{VIF}_i$ will be 1. If $R_i^2$ is .8, $\mathrm{VIF}_i$ will be 5. The rule of thumb is that $R_i^2 \le 5$ is tolerable, and $R_i^2 > 5$ is "highly correlated" and you have to do something about it. These are excellent. ```{r} car::vif(cs1$result) ``` Try calculating the $\mathrm{VIF}$ for `age`. :::rmdnote I don't know why this doesn't work. It _would_ work if the underlying model was OLS instead of GLM. The answer seems to be related to GVIF vs VIF, but I didn't figure it out.) ::: ```{r collapse=TRUE} r2 <- lm(age ~ weight + VO2max + gender, data = cs1$dat_training) %>% summary() %>% pluck("r.squared") (vif <- 1 / (1 - r2)) ``` ```{r include=FALSE} cs1$outliers <- cs1$result %>% augment(type.predict = "response") %>% filter(abs(.std.resid) >= 2) %>% pull(.std.resid) ``` Now check for influential outliers. Predict the response probabilities and filter for the predictions more than two standard deviations from the actual value and a Cook's Distance greater than 4/N = `r 4 / nrow(cs1$dat)`.^[[This article](https://towardsdatascience.com/assumptions-of-logistic-regression-clearly-explained-44d85a22b290) suggests using three standard deviations for an outliers threshold.] Cook's distance measures how much predicted values change when observation *i* is removed from the data. Only two fitted values were both an outlier and influential, row ids 59 and 70. An index plot of Cook's Distance shows the two points at the far left. You might examine the observations for validity. Otherwise, proceed and explain that there were two standardized residuals with value of `r comma(cs1$outliers[1], .01)` and `r comma(cs1$outliers[2], .01)` standard deviations which were kept in the analysis. ```{r warning=FALSE} augment(cs1$result) %>% mutate( id = row_number(), outlier = if_else(abs(.std.resid) >= 2, "Outlier", "Other"), influential = if_else(.cooksd > 4 / nrow(cs1$dat), "Influential", "Other"), status = case_when( outlier == "Outlier" & influential == "Influential" ~ "Influential Outlier", outlier == "Outlier" ~ "Outlier", influential == "Influential" ~ "Influential", TRUE ~ "Other" ) ) %>% ggplot(aes(x = .fitted, y = .cooksd)) + geom_point(aes(color = status)) + geom_text(aes(label = if_else(influential == "Influential", id, NA_integer_)), check_overlap = TRUE, size = 3, nudge_x = .025) + geom_hline(yintercept = 4 / nrow(cs1$dat), linetype = 2, color = "goldenrod") + scale_color_manual(values = c("Influential Outlier" = "firebrick", "Influential" = "goldenrod", "Outlier" = "slategray", "Other" = "black")) + theme(legend.position = "right") + labs(title = "Index Plot of Cook's Distance.", subtitle = "Row id labeled for values > 4 / N.") ``` ### Evaluate the Fit {-} There are several ways to evaluate the model fit. - The likelihood ratio test - Pseudo R-squared^[Nice explanation [here](https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/). Sounds like you should use pseudo R2 for model comparison, not for reporting goodness of fit]. - Accuracy measures - Gain and ROC curves The **likelihood ratio test** compares the log likelihood of the fitted model to an intercept-only model. ```{r} intercept_only <- logistic_reg() %>% set_engine("glm") %>% set_mode("classification") %>% fit(heart_disease ~ 1, data = cs1$dat) (cs1$lrtest <- lmtest::lrtest(cs1$result, intercept_only$fit)) ``` The fitted model is significant, $\chi^2$(`r abs(cs1$lrtest$Df[2])`) = `r comma(cs1$lrtest$Chisq[2], .1)`, *p* < .001. Calculate the pseuedo-R2 with `DescTools::PseudoR2()`. :::rmdnote I Can't get this to work in the tidymodels framework. Using `glm()` for now. ::: ```{r} x <- glm(heart_disease ~ age + weight + VO2max + gender, data = cs1$dat, family = "binomial") cs1$pseudo_r2 <- DescTools::PseudoR2(x, which = c("CoxSnell", "Nagelkerke", "McFadden")) cs1$pseudo_r2 ``` Laerd interprets this as the model explained `r percent(cs1$pseudo_r2["Nagelkerke"], .1)` (Nagelkerke R2) of the variance in heart disease. This is how your would interpret R2 in an OLS model. [UCLA](https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/) points out that the various pseudo R-squareds measure other aspects of the model and are unique to the measured quantity. A pseudo R-squared is not very informative on its own; it is useful for comparing models. Accuracy measures formed by the cross-tabulation of observed and predicted classes is the better way to go. ```{r collapse=TRUE} cs1$conf_mat <- cs1$fit %>% augment(new_data = cs1$dat) %>% # conf_mat requires truth to be first level of the factor variable. mutate(across(c(heart_disease, .pred_class), ~fct_relevel(., "Yes"))) %>% conf_mat(truth = heart_disease, estimate = .pred_class) cs1$conf_mat cs1$conf_mat %>% summary() cs1$conf_mat %>% autoplot() ``` The model accuracy, `r cs1$conf_mat %>% summary() %>% filter(.metric == "accuracy") %>% pull(.estimate) %>% percent(.1)`, is the percent of observations correctly classified. The sensitivity, `r cs1$conf_mat %>% summary() %>% filter(.metric == "sens") %>% pull(.estimate) %>% percent(.1)`, is the accuracy with regard to predicting positive cases. The specificity, `r cs1$conf_mat %>% summary() %>% filter(.metric == "spec") %>% pull(.estimate) %>% percent(.1)`, is the accuracy with regard to predicting negative cases. If you are fitting a predictive model, use the testing data set for this. ```{r} cs1$fit_training %>% augment(new_data = cs1$dat_testing) %>% conf_mat(truth = heart_disease, estimate = .pred_class) ``` Finally, plot the [gain curve or ROC curve](https://community.tibco.com/wiki/gains-vs-roc-curves-do-you-understand-difference). In the **gain curve**, the x-axis is the fraction of items seen when sorted by the predicted value, and the y-axis is the cumulatively summed true outcome. The "wizard" curve is the gain curve when the data is sorted by the true outcome. If the model's gain curve is close to the wizard curve, then the model predicts the response well. The gray area is the "gain" over a random prediction. ```{r} cs1$dat_testing %>% bind_cols(predict(cs1$fit, new_data = cs1$dat_testing, type = "prob")) %>% # event_level = "second" sets the second level as success yardstick::gain_curve(.pred_Yes, truth = heart_disease, event_level = "second") %>% autoplot() + labs(title = "Gain Curve") ``` `r sum(cs1$dat_testing$heart_disease == "Yes")` of the `r nrow(cs1$dat_testing)` participants had heart disease in the test data set. - The gain curve encountered 6 heart disease cases (50%) within the first 8 observations (55%). It encountered all 11 heart disease cases on the 18th observation. - The bottom of the grey area is the outcome of a random model. Only half the heart disease cases would be observed within 50% of the observations. - The top of the grey area is the outcome of the perfect model, the "wizard curve". Half the heart disease cases would be observed in 6/30=20% of the observations. The **ROC** (Receiver Operating Characteristics) curve plots sensitivity vs specificity at varying cut-off values for the probability ranging from 0 to 1. Ideally, you want very little trade-off between sensitivity and specificity, with a curve hugging the left and top axes. ```{r} cs1$dat_testing %>% bind_cols(predict(cs1$fit, new_data = cs1$dat_testing, type = "prob")) %>% # event_level = "second" sets the second level as success yardstick::roc_curve(.pred_Yes, truth = heart_disease, event_level = "second") %>% autoplot() + labs(title = "ROC Curve") ``` ### Reporting {-} > A binomial logistic regression was performed to ascertain the effects of age, weight, gender and VO2max on the likelihood that participants have heart disease. Linearity of the continuous variables with respect to the logit of the dependent variable was assessed via the Box-Tidwell (1962) procedure. A Bonferroni correction was applied using all eight terms in the model resulting in statistical significance being accepted when *p* < `r comma(.05/8, .00001)` (Tabachnick & Fidell, 2014). Based on this assessment, all continuous independent variables were found to be linearly related to the logit of the dependent variable. There were two standardized residuals with value of `r comma(cs1$outliers[1], .01)` and `r comma(cs1$outliers[2], .01)` standard deviations, which were kept in the analysis. The logistic regression model was statistically significant, χ2(4) = 27.40, p < .001. The model explained `r percent(cs1$pseudo_r2["Nagelkerke"], .1)` (Nagelkerke R2) of the variance in heart disease and correctly classified `r cs1$conf_mat %>% summary() %>% filter(.metric == "accuracy") %>% pull(.estimate) %>% percent(.1)` of cases. Sensitivity was `r cs1$conf_mat %>% summary() %>% filter(.metric == "sens") %>% pull(.estimate) %>% percent(.1)`, specificity was `r cs1$conf_mat %>% summary() %>% filter(.metric == "spec") %>% pull(.estimate) %>% percent(.1)`, positive predictive value was `r percent(cs1$confusion_matrix$byClass["Pos Pred Value"], .1)` and negative predictive value was `r percent(cs1$confusion_matrix$byClass["Neg Pred Value"], .1)`. Of the five predictor variables only three were statistically significant: age, gender and VO2max (as shown in Table 1). Females had `r exp(-coef(cs1$result)["genderMale"]) %>% comma(.01)` times lower odds to exhibit heart disease than males. Increasing age was associated with an increased likelihood of exhibiting heart disease, but increasing VO2max was associated with a reduction in the likelihood of exhibiting heart disease. ```{r message=FALSE} gtsummary::tbl_regression( cs1$fit, exponentiate = TRUE ) ``` ## Multinomial Logistic Regression {#multinomiallogistic} The _multinomial_ logistic regression model is $J - 1$ baseline logits, $$y_i = \log \left( \frac{\pi_{ij}}{\pi_{ij^*}} \right) = X_i \beta_j, \hspace{5mm} j \ne j^*$$ where $j$ is a level of the multinomial response variable. Whereas binomial logistic regression models the *log odds* of the response level, multinomial logistic regression models the *log relative risk*, the probability relative to the baseline $j^*$ level.^[These notes rely on the course notes from [PSU STAT 504, Analysis of Discrete Data](https://online.stat.psu.edu/stat504/) and [UCLA](https://stats.oarc.ucla.edu/r/dae/multinomial-logistic-regression/).] Interpret the $k^{th}$ element of $\beta_j$ as the increase in log relative risk of $Y_i = j$ relative to $Y_i = j^*$ given a one-unit increase in the $k^{th}$ element of $X$, holding the other terms constant. The individual probabilities, $\pi_{ij}$, are $$\pi_{ij} = \frac{\exp(y_{ij})}{1 + \sum_{j \ne j^*} \exp(y_{ij})} = \frac{e^{X_i\beta_j}}{1 + \sum_{j \ne j^*} e^{X_i\beta_j}}$$ and for the baseline category, $$\pi_{ij^*} = \frac{1}{1 + \sum_{j \ne j^*} \exp(y_{ij})} = \frac{1}{1 + \sum_{j \ne j^*} e^{X_i\beta_j}}$$ **Assumptions** Multinomial logistic regression applies when the dependent variable is categorical. It presumes a linear relationship between the log relative risk of the dependent variable and $X$ with residuals $\epsilon$ that are independent. It also assumes there is no severe multicollinearity in the predictors, and there is independence of irrelevant alternatives (IIA). IIA means the relative likelihood of being in one category compared to the base category would not change if you added other categories. ### Case Study {.unnumbered #cs2} This case study uses the data set from [this UCLA tutorial](https://stats.oarc.ucla.edu/r/dae/multinomial-logistic-regression/). A study measures the association between students' academic program (academic, general, and vocational) and their socioeconomic status (SES) (low, middle, high) and writing score. ```{r eval=FALSE} download.file( "https://stats.idre.ucla.edu/stat/data/hsbdemo.dta", "./input/hsbdemo.dta", mode = "wb" ) ``` ```{r} cs2 <- list() cs2$dat <- foreign::read.dta("./input/hsbdemo.dta") %>% # Just keep cols relevant to study select(id, prog, ses, write) %>% mutate(prog = fct_relevel(prog, "academic", after = 0)) cs2$dat %>% gtsummary::tbl_summary( by = prog, include = c(prog, ses, write), statistic = list(gtsummary::all_continuous() ~ "{mean}, {sd}") ) %>% gtsummary::add_overall() ``` <br> Conduct a brief exploratory analysis to establish your expectations. Academic programs are associated with higher writing scores and SES. General and vocational programs are the opposite, although SES has opposing effects for general (increased probability) and vocational (decreased probability). ```{r} cs2$dat %>% mutate(write_bin = cut(write, breaks = 5, dig.lab = 1, right = FALSE)) %>% count(prog, ses, write_bin) %>% mutate(.by = c(ses, write_bin), prob = n / sum(n)) %>% ggplot(aes(x = write_bin, y = prob, color = ses)) + geom_point() + geom_line(aes(group = ses)) + facet_wrap(facets = vars(prog)) + theme(legend.position = "top") + labs(title = "Program Proportions by SES") ``` If your model is predictive rather than inferential, split the data into training/testing data sets. ```{r collapse=TRUE} # For reproducibility set.seed(123) (x <- initial_split(cs2$dat, prop = 0.7, strata = prog)) cs2$dat_training <- training(x) dim(cs2$dat_training) cs2$dat_testing <- testing(x) dim(cs2$dat_testing) ``` ### Fit the Model {-} The multinomial logitistic regression model engines all seem to be related to neural networks and advise that predictors be set on a common scale by normalizing. I have only one predictor other than SES in this model, but I'll normalize it anyway. Normalizing `write` will also facilitate model interpretation. The intercept will represent a reasonable condition (average writing score) and a one-unit increase in `write` will represent a 1 SD increase in writing. :::rmdnote This case study will use both the `parsnip` package to fit a model directly to a data set, and the `recipes` package to preprocess the data. `parsnip` is fine for most applications and it seems to be better supported by other functions, like `tidy()`, so stick with it for inferential applications. `recipes` has the advantage of being able to process data in a workflow, so you don't have to transform new data to make predictions. See more uses of `recipes` on the [tidy models documentation](https://www.tidymodels.org/start/recipes/). ::: Let's first use `parsnip` to fit the full data set for an inferential application. ```{r} # Create the model. This much is the same for parsnip or recipes. cs2$model <- multinom_reg() %>% set_engine("nnet") # For parsnip, you need to normalize the data explicitly. nnet requires dummy # vars for factor predictors, but thankfully parsnip does that implicitly. # However, I want the outcome variable base level to be "academic" and while # recipes can do that in pre-processing, I have to do it manually for parsnip. cs2$dat_normalized <- cs2$dat %>% # Cast to numeric, otherwise scale() returns an nmatrix. :( mutate(write = as.numeric(scale(write, center = TRUE, scale = TRUE))) # Fit the whole data set for an explanatory model. cs2$fit <- cs2$model %>% fit(prog ~ ses + write, data = cs2$dat_normalized) # Extract the fit object returned by the engine. Use for interpretation and # checking assumptions. cs2$result <- cs2$fit %>% extract_fit_engine() ``` In parallel, let's use `recipes` to fit a predictive model to the training data. The model object is already created, so we just need to pair a recipe object with a workflow. ```{r} cs2$rec <- # The `data` argument can be the base data, or training, or even testing. # recipe() only uses it to catalog variable names and data types. recipe(prog ~ ses + write, data = cs2$dat) %>% # You could have specified the formula as prog ~ ., then assigned roles. E.g., # Keep "id" in data set, but don't use it in the model like this: # update_role(id, new_role = "ID") %>% # Unlike parsnip, recipe does not automatically create dummy vars. step_dummy(all_nominal_predictors()) %>% # Not relevant here, but good practice: if a factor level has few values, it may # not appear in the training set. If so, its dummy will contain a single value # (0). You can prevent that by dropping zero-value cols. step_zv(all_predictors()) %>% # Set the reference level of the outcome here if you want. step_relevel(prog, ref_level = "academic") %>% # Normalize write. step_normalize(write) # The workflow pairs the model and recipe. cs2$wflow <- workflow() %>% add_model(cs2$model) %>% add_recipe(cs2$rec) # Fit the training data set for a predictive model. cs2$fit_training <- cs2$wflow %>% fit(data = cs2$dat_training) # You can't extract the engine fit and pipe into summary. Seems like a bug # cs2$result_training <- cs2$fit_training %>% extract_fit_engine() ``` Let's look at the explanatory model summary object. The model produces a set of coefficient estimates for each non-reference level of the dependent variable. The `nnet` engine presents the coefficient estimates, then their standard errors as a second section, but does not present the *z*-statistic or *p*-values! ```{r} cs2$result %>% summary() ``` The residual deviance is $G^2 = 2 \sum_{i,j}y_{ij} \log \frac{y_{ij}}{\hat{\pi}_{ij}}$. Another model diagnostic is the log-likelihood, $-G^2 / 2$ (not shown) and AIC. More on these in the Model Fit section. ```{r collapse=TRUE} deviance(cs2$result) logLik(cs2$result, sum = TRUE) ``` The Wald *z*-statistic is $z = \hat{\beta} / SE(\hat{\beta})$. Its square is the Wald chi-squared statistic. The _p_-value is the area to the right of $z^2$ in the $\chi_1^2$ density curve. Get these from `tidy()`. ```{r} cs2$result %>% tidy() ``` ### Interpretation {-} Start with interpreting the dependent variable. The model fits the **log relative risk** of belonging to program $j \in$ [vocation, general] vs. $j^*$ = academic. However, `predict()` returns either the risk (`type` = "probs") or outcome (`type` = "class), not the log relative risk. Plug in values for the predictor variables to get predictions. The relative risk is $RR = \exp (\hat{y}_j) = \pi_j / \pi_{j^*}$. We see here that a student of low SES and mean writing score is less likely to be in a general or vocation program than an academic program. ```{r collapse=TRUE} (risk <- predict(cs2$result, newdata = list(ses = "low", write = 0), type = "probs")) # Relative risk. (rr <- risk[-1] / risk[1]) # Log-relative risk (the modeled outcome) (log_rr <- log(rr)) ``` Move on to the coefficients. Interpret $\hat{\beta}$ as the change in the _log relative risk_ of $y_j$ relative to $y_{j^*}$ due to a $\delta$ = one unit change in $x$. A $\delta = x_a - x_b$ change in $x$ is associated with a $\delta \hat{\beta}$ change. $\delta\beta$ is the log relative risk ratio. $$\log \left(\pi_j / \pi_{j^*} |_{x = x_a} \right) - \log \left(\pi_j / \pi_{j^*} |_{x = x_b} \right) = \log \left( \frac{\pi_j / \pi_{j^*} |_{x = x_a}}{\pi_j / \pi_{j^*} |_{x = x_b}} \right) = \delta \hat{\beta}$$ The exponential of $\hat{\beta}$ is the change in the _relative risk_ of $y_j$ relative to $y_{j^*}$ due to a $\delta$ = one unit change in $x$. $\exp \delta \beta$ is the relative risk ratio. $$\pi_j / \pi_{j^*} |_{x = x_a} = \exp{\delta \hat{\beta}}$$ The intercept term is the log-relative risk of $y_j$ relative to $y_{j^*}$ for the reference case. The reference case in the model is `ses` = "low" and `write` centered at `r comma(mean(cs2$dat$write), .1)`. Notice how the intercept matches the predicted values above. ```{r collapse=TRUE} (ref_log_rr <- coef(cs2$result)[,"(Intercept)"]) # From log-relative risk to relative risk. (ref_rr <- exp(ref_log_rr)) ``` The log relative risks of a low SES student with a `r comma(mean(cs2$dat$write), .1)` writing score being in program general vs academic is $\hat{y} = \mathrm{Intercept}_1$ = `r comma(ref_log_rr[1], .001)`, and $\hat{y} = \mathrm{Intercept}_2$ = `r comma(ref_log_rr[2], .001)` for vocation vs academic. The corresponding relative risks are $\exp(\hat{y}_j)$ = `r comma(ref_rr[1], .001)` and `r comma(ref_rr[2], .001)`. The expected probabilities are `r percent(risk[1], .1)`, `r percent(risk[2], .1)`, and `r percent(risk[3], .1)` for academic, general, and vocation respectively. ```{r include=FALSE} ses_high <- predict(cs2$result, newdata = list(ses = "high", write = 0), type = "probs") ``` If SES was high instead of low, the expected probabilities of being in program general vs academic would be `r percent(ses_high[1], .1)`, `r percent(ses_high[2], .1)`, and `r percent(ses_high[3], .1)` for academic, general, and vocation respectively. What if the writing score increases by 1 SD (one unit)? The log RR of being in program general vs academic change by a factor of `coef(cs2$result)["general", "write"]` = `r coef(cs2$result)["general", "write"] %>% comma(.001)`, RR = `r exp(coef(cs2$result)["general", "write"]) %>% comma(.001)`. For program vocation vs academic, it would change by a factor of `r coef(cs2$result)["vocation", "write"] %>% comma(.001)`, RR = `r exp(coef(cs2$result)["vocation", "write"]) %>% comma(.001)`. To get a 2 SD increase, multiply the coefficient by 2, then exponentiate. The two RRs would then be `r exp(coef(cs2$result)["general", "write"]*2) %>% comma(.001)` and `r exp(coef(cs2$result)["vocation", "write"]*2) %>% comma(.001)`. Visualize the parameter estimates by plotting the predicted values. ```{r} new_data <- expand.grid( ses = levels(cs2$dat$ses), write = seq(from = round(min(cs2$dat_normalized$write)), to = round(max(cs2$dat_normalized$write)), by = 1) ) bind_cols( new_data, predict(cs2$fit, new_data = new_data, type = "prob") ) %>% pivot_longer(cols = -c(ses, write), names_to = "prog", values_to = "probability") %>% ggplot(aes(x = write, y = probability, color = ses)) + geom_line() + facet_wrap(facets = vars(prog)) ``` ### Assumptions {-} Four assumptions relate to the study design: (1) the dependent variable is multinomial; (2) the observations are independent; (3) the categories of all nominal variables are mutually exclusive and exhaustive; and (4) there are at least 10 observations per dependent variable level and independent variable. These assumptions are all valid. Three more assumptions related to the data distribution: - There is a linear relationship between the logit transformation and the continuous independent variables. Test with a plot and with Box-Tidwell. - There is no independent variable multicollinearity. Test with correlation coefficients and variance inflation factors (VIF). - There are no influential outliers. Test with Cook's distance. There are two ways to test for linearity (do both). First, plot the _fitted values_ against the continuous predictors. This is the GLM analog to OLS bivariate analysis, except now the dependent variable is the _logit_ transformation. These plotted relationships look good, except that in the `prog` = general level, writing score appears to interact with SES. ```{r} cs2$fit %>% augment(new_data = cs2$dat_normalized) %>% pivot_longer(cols = c(.pred_academic:.pred_vocation), values_to = ".fitted") %>% filter(str_detect(name, as.character(prog))) %>% ggplot(aes(x = write, y = .fitted, color = ses)) + geom_point() + facet_wrap(facets = vars(prog), scales = "free_x") + geom_smooth(method = "lm", formula = "y~x") + labs(title = "Linearity Test: predicted vs continuous predictors", x = NULL) ``` The second test for linearity is the Box-Tidwell approach. Add transformations of the continuous independent variables to the model, $x_{Tx} = x \log x$, then test their significance level in the fit. Focus on the transformed variable. `write_tx` has a *p*-value <.05 for general. It is customary to apply a Bonferroni adjustment when testing for linearity. There are ten predictors in the model (including the intercept terms), so the Bonferroni adjusted *p*-values for `write_tx` are multiplied by 10. We should reject the null hypothesis of linearity because the adjusted p.value is still below .05. ```{r} # Using non-centered vars to avoid log(0) errors. cs2$boxtidwell <- cs2$dat %>% mutate(write_tx = log(write) * write) %>% fit(cs2$model, prog ~ ses + write + write_tx, data = .) tidy(cs2$boxtidwell) %>% filter(term == "write_tx") %>% mutate(adj_p = p.value * 10) ``` If the relationship is nonlinear, you can try transforming the variable by raising it to $\lambda = 1 + b / \gamma$ where $b$ is the estimated coefficient of the model without the interaction terms, and $\gamma$ is the estimated coefficient of the interaction term of the model with interactions. For `write`, $b$ is `r comma(coef(cs2$fit_nnet)[1, "write_c"], .001)` for general and $\gamma$ is `r comma(coef(cs2$boxtidwell)[1, "write_tx"], .001)`, so $\lambda$ = 1 + `r comma(coef(cs2$fit_nnet)[1, "write_c"], .001)` / `r comma(coef(cs2$boxtidwell)[1, "write_tx"], .001)` = `r comma(1 + coef(cs2$fit_nnet)[1, "write_c"] / coef(cs2$boxtidwell)[1, "write_tx"], .001)`. It seems customary to apply general transformations like .5 (square root), 1/3 (cube root), ln, and the reciprocal, so maybe try raising `write_c` to `r comma(1 + coef(cs2$fit_nnet)[1, "write_c"] / coef(cs2$boxtidwell)[1, "write_tx"], 1)`. It seems in this case that the better solution is to add an interaction between `write_c` and `ses` to the model. I'm not going to pursue this further here. Check for multicollinearity using variance inflation factors (VIF). VIFs estimate how much the variance of a regression coefficient is inflated due to multicollinearity. When independent variables are correlated, it is difficult to say which variable really influences the dependent variable. The VIF for variable $i$ is $$ \mathrm{VIF}_i = \frac{1}{1 - R_i^2} $$ where $R_i^2$ is the coefficient of determination (i.e., the proportion of dependent variance explained by the model) of a regression of $X_i$ against all of the other predictors, $X_i = X_{j \ne i} \beta + \epsilon$. If $X_i$ is totally unrelated to its covariates, then $R_i^2$ will be zero and $\mathrm{VIF}_i$ will be 1. If $R_i^2$ is .8, $\mathrm{VIF}_i$ will be 5. The rule of thumb is that $R_i^2 \le 5$ is tolerable, and $R_i^2 > 5$ is "highly correlated" and you have to do something about it. `car::vif()` doesn't work for multinomial logistic regression. The model type is not actually important here - we're concerned about the covariate relationships. Below, I successively collapse the dependent variable into two-levels, then fit a binomial logistic regression and pipe that into `car::vif()`. ```{r collapse=TRUE} tmp_fit_general <- logistic_reg() %>% fit( prog ~ ses + write, data = cs2$dat_normalized %>% mutate(prog = fct_collapse(prog, vocation = "academic")) ) %>% extract_fit_engine() tmp_fit_general %>% car::vif() tmp_fit_vocation <- logistic_reg() %>% fit( prog ~ ses + write, data = cs2$dat_normalized %>% mutate(prog = fct_collapse(prog, general = "academic")) ) %>% extract_fit_engine() tmp_fit_vocation %>% car::vif() ``` Check for influential outliers. Outliers are predicted values greater than two standard deviations from the actual value. Influential points have a Cook's Distance greater than 4/N (4 / `r nrow(cs2$dat)` = `r 4 / nrow(cs2$dat)`.^[Cook's distance measures how much predicted values change when observation *i* is removed from the data. [This article](https://towardsdatascience.com/assumptions-of-logistic-regression-clearly-explained-44d85a22b290) suggests using three standard deviations for an outliers threshold.] Influential outliers are both. There is no simple way to do this for the multinomial regression because neither VGAM nor nnet support the `augment()` generic. Instead, I will use the two *binomial logistic* regressions from the VIF diagnostic. ```{r} outlier_dat <- bind_rows( general = augment(tmp_fit_general, type.predict = "response"), vocation = augment(tmp_fit_vocation, type.predict = "response"), .id = "logit" ) %>% mutate( id = row_number(), outlier = if_else(abs(.std.resid) >= 2, "Outlier", "Other"), influential = if_else(.cooksd > 4 / nrow(cs1$dat), "Influential", "Other"), status = case_when( outlier == "Outlier" & influential == "Influential" ~ "Influential Outlier", outlier == "Outlier" ~ "Outlier", influential == "Influential" ~ "Influential", TRUE ~ "Other" ), status = factor(status, levels = c("Influential", "Outlier", "Influential Outlier", "Other")) ) outlier_dat %>% ggplot(aes(x = .fitted, y = .cooksd)) + geom_point(aes(color = status)) + geom_text(aes(label = if_else(influential == "Influential", id, NA_integer_)), check_overlap = TRUE, size = 3, nudge_x = .025) + geom_hline(yintercept = 4 / nrow(cs1$dat), linetype = 2, color = "goldenrod") + scale_color_manual(values = c("Influential Outlier" = "firebrick", "Influential" = "goldenrod", "Outlier" = "salmon", "Other" = "black")) + theme(legend.position = "right") + labs(title = "Index Plot of Cook's Distance.", subtitle = "Row id labeled for influential points.") + facet_wrap(facets = vars(logit), ncol = 1) ``` No fitted values were influential outliers in the first fit, and only two were influential outliers in the second fit. ```{r} (cs2$outliers <- outlier_dat %>% filter(status == "Influential Outlier") %>% pull(.std.resid)) ``` An index plot of Cook's Distance shows the two points at the far left. You might examine the observations for validity. Otherwise, proceed and explain that there were two standardized residuals with value of `r comma(cs2$outliers[1], .01)` and `r comma(cs2$outliers[2], .01)` standard deviations which were kept in the analysis. ### Evaluate the Fit {-} There are several ways to evaluate the model fit. - Deviance and chi-squared tests for lack of fit - The likelihood ratio test - Pseudo R-squared^[Nice write-up [here](https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/). Sounds like you should use pseudo R2 for model comparison, not for reporting goodness of fit]. - Accuracy measures - Gain and ROC curves The **deviance test** for lack of fit and the **likelihood ratio test** are residuals tests. The deviance residual is defined as $d_i = \mathrm{sign}(\epsilon_i) \left[ -2(y_i \log \hat{\pi}_i + (1 - y_i) \log (1 - \hat{\pi}_i)) \right]^{.5}$. The model deviance, $G^2$, is the sum of the squared deviance residuals. It also equals $G^2 = 2 \sum_{i,j}y_{ij} \log \frac{y_{ij}}{\hat{\pi}_{ij}}$. You can calculate them by hand. ```{r collapse=TRUE, warning=FALSE} # Actual values (1s and 0s for three response levels) y <- cs2$dat %>% mutate(val = 1) %>% pivot_wider(names_from = prog, values_from = val, values_fill = 0) %>% select(levels(cs2$dat$prog)) %>% as.matrix() # Predicted values (probabilities for three response levels) pi <- predict(cs2$result, type = "prob") * 1 # Raw residuals, by hand or by formula # e <- y - pi e <- residuals(cs2$result, type = "response") # Deviance residuals d <- sign(e) * sqrt(-2 * y * log(pi) + (1 - y) * log(1 - pi)) (g2 <- sum(d^2, na.rm = TRUE)) (g2 <- 2 * sum(y * log(y / pi), na.rm = TRUE)) (g2 <- deviance(cs2$result)) ``` The related Pearson statistic, $X$, is the sum of the squared *Pearson residuals*, $pr_i = \epsilon_i / \sqrt{\hat{\pi}_i}$, the raw residual scaled by dividing by the estimated standard deviation of a binomial distribution with 1 trial. I don't see this calculated in the `residual()` functions. You can do it yourself. ```{r collapse=TRUE} (x2 <- sum((e / sqrt(pi))^2)) ``` The deviance and Pearson statistic are distributed chi-squared with $(N - p)(r - 1)$ degrees of freedom where $p$ = 4 predictor variables (3 SES levels + intercept), and $r$ = 3 levels of the dependent variable for `r (nrow(cs2$dat) - 4) * (3 - 1)` degrees of freedom. The deviance and Pearson tests for lack of fit calculate the probability of the test statistic. The null hypothesis is that the model is correct. Neither test rejects the null hypothesis. ```{r collapse=TRUE} # Deviance test for lack of fit (N <- nrow(cs2$dat)) (r <- length(levels(cs2$dat$prog))) (p <- length(coef(cs2$result)) / (r - 1)) # coefficients for each level, so divide by # levels (df <- (N - p) * (r - 1)) pchisq(g2, df, lower.tail = FALSE) pchisq(x2, df, lower.tail = FALSE) ``` You can do the same calculations for the intercept-only model. ```{r collapse=TRUE, warning=FALSE, message=FALSE} io <- multinom_reg() %>% fit(prog ~ 1, data = cs2$dat) %>% extract_fit_engine() deviance(io) # degrees of freedom ((nrow(cs2$dat) - length(coef(io)) / (r - 1)) * (r - 1)) ``` The log-likelihood measures the unexplained variability in the model. The **likelihood ratio test** compares the log likelihood of the fitted model to the intercept-only model. You can use `lmtest::lrtest()` to test. `anova()` does the same thing using the residual deviance, $G2 = -2 \times \mathrm{log likelihood}$, although it does not seem to work with the `nnet` engine. ```{r} (cs2$lrtest <- lmtest::lrtest(io, cs2$result)) # (cs2$lrtest <- anova(io, cs2$result, type = "I", test = "LR")) ``` The difference in deviances is $LR$ = `r comma(cs2$lrtest$Chisq[2], .01)` with `r cs2$lrtest$Df[2]` degrees of freedom. This is distributed chi-squared, with *p*-value = `r scientific(cs2$lrtest$"Pr(>Chisq)"[2], 4)`. The deviance test for lack of fit concludes that the model fits significantly better than an empty (intercept-only) model, $\chi^2$(`r cs2$lrtest$Df[2]`) = `r comma(cs2$lrtest$Chisq[2], .01)`, p < .001. You can use `lmtest::lrtest()` to perform likelihood ratio tests on the significance of the predictors too. The likelihood ratio test compares the log likelihood with and without the predictor. Unfortunately, this does not seem to work within the `parsnip` framework. ```{r collapse=TRUE} # (cs2$lrtest_ses <- lmtest::lrtest(cs2$result, "ses")) no_ses <- multinom_reg() %>% fit(prog ~ write, data = cs2$dat) %>% extract_fit_engine() (cs2$lrtest_ses <- lmtest::lrtest(cs2$result, no_ses)) # (cs2$lrtest_write <- lmtest::lrtest(cs2$fit_nnet, "write")) no_write <- multinom_reg() %>% fit(prog ~ ses, data = cs2$dat) %>% extract_fit_engine() (cs2$lrtest_write <- lmtest::lrtest(cs2$result, no_write)) ``` Both SES, $X^2$ = `r comma(cs2$lrtest_ses$Chisq[2], .001)`, *p* = `r comma(cs2$lrtest_ses$"Pr(>Chisq)"[2], .001)`, and writing score $X^2$ = `r comma(cs2$lrtest_write$Chisq[2], .001)`, *p* = `r scientific(cs2$lrtest_write$"Pr(>Chisq)"[2], 4)`, had significant effects on the program. Logistic regression does not have a direct R-squared statistic like OLS does (the proportion of variance explained by the model). However, there are some analogs, called pseudo R-squared. You'll encounter three pseudo R-squared measures: Cox and Snell, Nagelkerke, and McFadden. This one does not work for the `nnet` engine. ```{r} # DescTools::PseudoR2(cs2$result, which = c("CoxSnell", "Nagelkerke", "McFadden")) ``` Accuracy measures formed by the cross-tabulation of observed and predicted classes is the better model fit diagnostic the the pseudo r-squares. ```{r} cs2$conf_mat <- cs2$fit %>% augment(new_data = cs2$dat_normalized) %>% # conf_mat requires truth to be first level of the factor variable. # mutate(across(c(prog, .pred_class), ~fct_relevel(., "academic"))) %>% conf_mat(truth = prog, estimate = .pred_class) cs2$conf_mat cs2$conf_mat %>% summary() cs1$conf_mat %>% autoplot() ``` The model accuracy, `r cs2$conf_mat %>% summary() %>% filter(.metric == "accuracy") %>% pull(.estimate) %>% percent(.1)`, is the percent of observations correctly classified. The sensitivities are the accuracy with regard to predicting positive cases in each level of the dependent variable. The specificities are the accuracy with regard to predicting negative cases. The prevalences are the proportion of cases that were positive. Finally, plot the [gain curve or ROC curve](https://community.tibco.com/wiki/gains-vs-roc-curves-do-you-understand-difference). In the **gain curve**, the x-axis is the fraction of items seen when sorted by the predicted value, and the y-axis is the cumulatively summed true outcome. The "wizard" curve is the gain curve when the data is sorted by the true outcome. If the model's gain curve is close to the wizard curve, then the model predicts the response well. The gray area is the "gain" over a random prediction. ```{r} cs2$dat_normalized %>% bind_cols(predict(cs2$fit, new_data = cs2$dat_normalized, type = "prob")) %>% # event_level = "second" sets the second level as success yardstick::gain_curve(.pred_academic, .pred_general, .pred_vocation, truth = prog, event_level = "second") %>% autoplot() + labs(title = "Gain Curve") ``` `r sum(cs2$dat$prog == "academic")` of the `r nrow(cs2$dat)` participants were in the academic program. - The gain curve encountered 52 academic programs (50%) within the first 72 observations (36%). It encountered all `r sum(cs2$dat$prog == "academic")` cases on the 189th observation. - The bottom of the grey area is the outcome of a random model. Only half the academic program cases would be observed within 50% of the observations. - The top of the grey area is the outcome of the perfect model, the "wizard curve". Half the academic program cases would be observed in `r sum(cs2$dat$prog == "academic")/2`/`r nrow(cs2$dat)`=`r percent(sum(cs2$dat$prog == "academic")/2 / nrow(cs2$dat), .01)` of the observations. The **ROC** (Receiver Operating Characteristics) curve plots sensitivity vs specificity at varying cut-off values for the probability ranging from 0 to 1. Ideally, you want very little trade-off between sensitivity and specificity, with a curve hugging the left and top axes. ```{r} cs2$dat_normalized %>% bind_cols(predict(cs2$fit, new_data = cs2$dat_normalized, type = "prob")) %>% # event_level = "second" sets the second level as success yardstick::roc_curve(.pred_academic, .pred_general, .pred_vocation, truth = prog, event_level = "second") %>% autoplot() + labs(title = "Gain Curve") ``` ### Reporting {-} > A multinomial logistic regression was performed to ascertain the effects of socioeconomic status (ses) and writing score on the likelihood that participants are enrolled in an academic, general, or vocation program. Linearity of the continuous variables with respect to the logit of the dependent variable was assessed via the Box-Tidwell (1962) procedure. A Bonferroni correction was applied using all eight terms in the model resulting in statistical significance being accepted when *p* < `r comma(.05/8, .00001)` (Tabachnick & Fidell, 2014). Based on this assessment, the continuous `write` independent variable was found to be linearly related to the logit of the dependent variable levels. There were two standardized residuals with value of `r comma(cs2$outliers[1], .01)` and `r comma(cs2$outliers[2], .01)` standard deviations, which were kept in the analysis. The multinomial logistic regression model was statistically significant, $\chi^2$(`r cs2$lrtest$Df[2]`) = `r comma(cs2$lrtest$Chisq[2], .01)`, p < .001. The model correctly classified `r cs2$conf_mat %>% summary() %>% filter(.metric == "accuracy") %>% pull(.estimate) %>% percent(1)` of cases. Sensitivity was `r cs2$conf_mat %>% summary() %>% filter(.metric == "sens") %>% pull(.estimate) %>% percent(1)`, specificity was `r cs2$conf_mat %>% summary() %>% filter(.metric == "spec") %>% pull(.estimate) %>% percent(1)`, positive predictive value was `r cs2$conf_mat %>% summary() %>% filter(.metric == "ppv") %>% pull(.estimate) %>% percent(1)` and negative predictive value was `r cs2$conf_mat %>% summary() %>% filter(.metric == "npv") %>% pull(.estimate) %>% percent(1)`. The `write` predictor variable was statistically significant for both outcome levels and high SES was statistically significant for the general program (as shown in Table 1). A 1SD increase in the writing score was associated with a (1 - `r cs2$result %>% coef() %>% .["general", "write"] %>% exp() %>% comma(.01)`) decrease in the odds of choosing a general program instead of academic and a (1 - `r cs2$result %>% coef() %>% .["vocation", "write"] %>% exp() %>% comma(.01)`) decrease in the odds of choosing a vocation program over academic. A high SEC was associated with a (1 - `r cs2$result %>% coef() %>% .["general", "seshigh"] %>% exp() %>% comma(.01)`) decrease in the odds of chooseing a general program over academic. ```{r message=FALSE, warning=FALSE} gtsummary::tbl_regression( cs2$result, exponentiate = TRUE ) %>% gtsummary::as_flex_table() %>% flextable::theme_apa() ``` ## Ordinal Logistic Regression {#ordinallogistic} Ordinal logistic regression, also call cumulative link model (CLM), is a generalized linear model (GZLM), an extension of the general linear model (GLM) to non-continuous outcome variables. There are many approaches to ordinal logistic regression, including cumulative, adjacent, and continuation categories, but the most popular is the **cumulative odds ordinal logistic regression with proportional odds**.^[These notes rely on [UVA](https://data.library.virginia.edu/fitting-and-interpreting-a-proportional-odds-model/), [PSU](https://online.stat.psu.edu/stat504/), [Laerd](https://statistics.laerd.com/spss-tutorials/ordinal-regression-using-spss-statistics.php), and the CLM package [article vignette](https://cran.r-project.org/web/packages/ordinal/vignettes/clm_article.pdf)]. The model for ordinal response random variable $Y_i$ with $J$ levels is $$\gamma_{ij} = F(\eta_{ij}), \hspace{5 mm} \eta_{ij} = \theta_j - x_i^\mathrm{T}\beta, \hspace{5 mm} i = 1, \ldots, n, \hspace{5 mm} j = 1, \ldots, J-1$$ where $\gamma_{ij} = P(Y_i \le j) = \pi_{i1} + \cdots + \pi_{ij}$. $\eta_{ij}$ is a linear predictor with $J-1$ intercepts. $F$ is the inverse link function. The regression models the logit link function of $\gamma_{ij}$. $$\mathrm{logit}(\gamma_{ij}) = \log \left[\frac{P(Y_i \le j)}{P(Y_i \gt j)} \right] = \theta_j - x_i^\mathrm{T}\beta$$ The cumulative logit is the log-odds of the cumulative probabilities that the response is in category $\le j$ versus $\gt j$. $\theta_j$ is the log-odds when $x_i^\mathrm{T}=0$ and $\beta$ is the increase in the log odds attributed to a one unit increase in $x_i^\mathrm{T}$. Notice $\beta$ is the same for all $j$. The exponential of the predicted value is the odds. Solve this for the probability, $$P(Y_i \gt j) = \frac{\mathrm{exp}(\hat{y}_i)}{1 + \mathrm{exp}(\hat{y}_i)}.$$ The exponential of $\beta$ is the odds ratio of $x_1^\mathrm{T} - x_0^\mathrm{T}$. Solve this for the odds ratio $$\mathrm{OR} = \frac{\mathrm{exp}(\theta_j - x_1^\mathrm{T}\beta)}{\mathrm{exp}(\theta_j - x_2^\mathrm{T}\beta)} = \mathrm{exp}(\beta(x_1^\mathrm{T} - x_0^\mathrm{T}))$$ If $x$ is a binary factor factor, then $\exp(\beta)$ is the odds ratio of $x=1$ vs $x=0$. Thus the odds-ratio is *proportional* to the difference between values of $x$ and $\beta$ is the constant of proportionality. The model is estimated with a regularized Newton-Raphson algorithm with step-halving (line search) using analytic expressions for the gradient and Hessian of the negative log-likelihood function. This is beyond me, but the upshot is that the estimation is an iterative maximization exercise, not a formulaic matrix algebra process. It is possible for the model estimation to fail to converge on a maximum. You will sometimes encounter discussion about the *latent variable*. That is just the underlying quality you are trying to measure. If you rate something a 4 on a 5-level Likert scale, 4 is the expression of your valuation, the latent variable. Your precise valuation is somewhere between 3 and 5 on a continuous scale. The link function defines the distribution of the latent variable. There are variations on the ordinal model. Structured thresholds impose restrictions on $\theta_j$, for example requiring equal distances between levels. Partial proportional odds allow $\theta_j$ to vary with nominal predictors. You can also use link functions other than logit. There are two assumptions underling ordinal logistic regression: (a) no multicollinearity, and (b) proportional odds. ### Case Study {.unnumbered #cs3} This case study uses the [Laerd Statistics](https://statistics.laerd.com/) article on ordinal logistic regression [data set](https://statistics.laerd.com/premium/spss/spss-files/ordinal-logistic-regression.sav). A study investigates the relationship between attitude toward tax levels and participant values and background. 192 participants in a study responded to the statement "Taxes are too high" on a 4-level Likert scale (`tax_too_high`, *Strongly Disagree*, *Disagree*, *Agree*, *Strongly Agree*). Participant attributes include business owner (*Y*|*N*), age, and political affiliation (*Liberal*, *Conservative*, *Labor*). ```{r} cs3 <- list() cs3$dat <- foreign::read.spss("./input/ordinal-logistic-regression.sav", to.data.frame = TRUE) %>% mutate(tax_too_high = factor(tax_too_high, ordered = TRUE), biz_owner = fct_relevel(biz_owner, "No", "Yes"), politics = fct_relevel(politics, "Lab")) %>% select(-c(biz_friends, uni_educated, income)) cs3$dat %>% gtsummary::tbl_summary( by = politics, statistic = list(gtsummary::all_continuous() ~ "{mean} ({sd})") ) ``` <br> ```{r} cs3$dat %>% mutate(age = case_when(age < quantile(age, .33) ~ "young", age < quantile(age, .67) ~ "middle", TRUE ~ "old"), age = factor(age, levels = c("young", "middle", "old"))) %>% count(tax_too_high, biz_owner, age, politics) %>% ggplot(aes(x = tax_too_high, y = n, fill = biz_owner)) + geom_col(position = position_dodge2(preserve = "single")) + facet_grid(rows = vars(age), cols = vars(politics), space = "free") + scale_x_discrete(labels = function (x) str_wrap(x, width = 10)) + theme_bw() + theme(legend.position = "bottom") + labs(title = "Taxes too high?", subtitle = "Reponse count by business owner, age, and politics.") ``` ### Fit the Model {-} Fit a cumulative link model for the cumulative probability of the $i$th response falling in $j$th category or below where $i$ indexes the ($n = 192$) responses, $j = 1, \ldots, J$ indexes the ($J = 4$) response categories, and $\theta_j$ is the threshold for the $j$th cumulative logit. $$\mathrm{logit}(P(Y_i \le j)) = \theta_j - \beta_1(\mathrm{politics}_i) - \beta_2(\mathrm{biz\_owner}_i) - \beta_3(\mathrm{age}_i)$$ ```{r} cs3$fmla <- formula(tax_too_high ~ biz_owner + age + politics) cs3$result <- ordinal::clm(cs3$fmla, data = cs3$dat) cs3$result %>% summary() ``` The summary object shows several fit statistics. More about these in the fit evaluation section below. The Coefficients table is the familiar parameter estimates. The coefficient estimate for `biz_ownerYes` is `r coef(summary(cs3$result))["biz_ownerYes", "Estimate"] %>% comma(.001)` with standard error `r coef(summary(cs3$result))["biz_ownerYes", "Std. Error"] %>% comma(.001)`, $z = \hat{\beta} / se =$ `r coef(summary(cs3$result))["biz_ownerYes", "z value"] %>% comma(.001)`, and $p = 2 \cdot P(Z>z) =$ `r coef(summary(cs3$result))["biz_ownerYes", "Pr(>|z|)"] %>% comma(.001)`. Some programs (e.g., SPSS) also show the Wald chi-squared statistic, $z^2 =$ `r coef(summary(cs3$result))["biz_ownerYes", "z value"]^2 %>% comma(.001)`. The square of a normal variable has a $\chi^2$ distribution, so the *p* value for the Wald chi-squared statistic is `pchisq(z^2, df = 1)` = `r coef(summary(cs3$result))["biz_ownerYes", "z value"]^2 %>% pchisq(df = 1, lower.tail = FALSE) %>% comma(.001)`. The Threshold coefficients table are the intercepts, or cut-points. The first cut-point is the log-odds of response level *Strongly Disagree* (or less) vs greater than *Strongly Disagree* when all factor variables are at their reference level and the continuous vars are at 0. There may be interaction effects between `biz_owner` and `politics`. You can check this by comparing the log likelihood to the saturated model with a likelihood ratio test. ```{r} saturated <- ordinal::clm(tax_too_high ~ biz_owner*politics + age, data = cs3$dat) (cs3$sat_anova <- anova(cs3$result, saturated)) ``` The likelihood ratio test indicates the main-effects model fits about the same as the saturated model, $\chi^2$(`r cs3$sat_anova["saturated", "df"]`) = `r cs3$sat_anova["saturated", "LR.stat"] %>% comma(.01)`, *p* = `r cs3$sat_anova["saturated", "Pr(>Chisq)"] %>% comma(.001)`) ### Assumptions {-} Cumulative odds ordinal logistic regression with proportional odds models require a) no multicollinearity, and b) proportional odds. **Multicollinearity** occurs when two or more independent variables are *highly correlated* so that they do not provide unique or independent information in the regression model. Multicollinearity inflates the variances of the estimated coefficients, resulting in larger confidence intervals. The usual interpretation of a slope coefficient as the change in the mean response per unit increase in the predictor when all the other predictors are held constant breaks down because changing one predictor *necessarily* changes other predictors. Test for multicollinearity with variance inflation factors (VIF). The VIF is the inflation percentage of the parameter variance due to multicollinearity. E.g., a VIF of 1.9 means the parameter variance is 90% larger than what it would be if it was not correlated with other predictors. Predictor k's variance, $Var(\hat{\beta_k})$, is inflated by a factor of $$\mathrm{VIF}_k = \frac{1}{1 - R_k^2}$$ due to collinearity with other predictors, where $R_k^2$ is the $R^2$ of a regression of the $k^{th}$ predictor on the remaining predictors. If predictor $k$ is unrelated to the other variables, $R_k^2 = 0$ and $VIF = 1$ (no variance inflation). If 100% of the variance in predictor $k$ is explained by the other predictors, then $R_k^2 = 1$ and $VIF = \infty$. The rule of thumb is that $VIF \le 5$ is acceptable. ```{r} # Cannot use CLM model with vif(). Re-express as a linear model. lm(as.numeric(tax_too_high) ~ politics + biz_owner + age, dat = cs3$dat) %>% car::vif() ``` The VIFs in column GVIF are all below 5, so this model is not compromised by multicollinearity. The **proportional odds** assumption means the independent variable effects are constant across each cumulative split of the ordinal dependent variable. Test for proportional odds using a full [likelihood ratio test](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqhow-are-the-likelihood-ratio-wald-and-lagrange-multiplier-score-tests-different-andor-similar/) comparing the proportional odds model with a multinomial logit model, also called an unconstrained baseline logit model. This is also called the *test of parallel lines*. The multinomial logit model fits a slope to each of the $J – 1$ levels. The proportional odds model is nested within the multinomial model, so you can use a likelihood ratio test to see if the models are statistically different. Fit the proportional odds model and a multinomial model using `VGAM::vglm()` and capture the log likelihoods and degrees of freedom. Perform a likelihood ratio test on the differences in log likelihoods, $D = -2 \mathrm{loglik}(\beta)$. ```{r} cs3$vglm_ordinal <- VGAM::vglm(cs3$fmla, VGAM::propodds, data = cs3$dat) cs3$vglm_multinomial <- VGAM::vglm(cs3$fmla, VGAM::cumulative, data = cs3$dat) (cs3$po_lrt <- VGAM::lrtest(cs3$vglm_multinomial, cs3$vglm_ordinal)) ``` > The assumption of proportional odds was met, as assessed by a full likelihood ratio test comparing the fit of the proportional odds model to a model with varying location parameters, $\chi^2$(8) = `r cs3$po_lrt %>% slot("Body") %>% tail(1) %>% pull(Chisq) %>% scales::number(accuracy = .001)`, *p* = `r cs3$po_lrt %>% slot("Body") %>% tail(1) %>% pull("Pr(>Chisq)") %>% scales::number(accuracy = .001)`. Another option is the partial proportional odds test. This test locates specific variables causing the rejection of proportional odds. ```{r} (cs3$po_lrt2 <- ordinal::clm(cs3$fmla, data = cs3$dat) %>% ordinal::nominal_test()) ``` > The assumption of proportional odds was met, as assessed by a full likelihood ratio test comparing the fit of the proportional odds model to a model with varying location parameters for business owner, $\chi^2$(`r cs3$po_lrt2["biz_owner", "Df"]`) = `r cs3$po_lrt2["biz_owner", "LRT"] %>% comma(.001)`, *p* = `r cs3$po_lrt2["biz_owner", "Pr(>Chi)"] %>% comma(.001)` and politics, $\chi^2$(`r cs3$po_lrt2["politics", "Df"]`) = `r cs3$po_lrt2["politics", "LRT"] %>% comma(.001)`, *p* = `r cs3$po_lrt2["politics", "Pr(>Chi)"] %>% comma(.001)`. ### Evaluate the Fit {-} There are three ways to assess overall model fit: The Deviance and Pearson goodness-of-fit tests of the overall model fit; the Cox and Snell, Nagelkerke, and McFadden pseudo R measures of explained variance; and the likelihood ratio test comparing the model fit to the intercept-only model. However, these tests rely on large frequencies in each *cell*, that is, each possible combination of predictor values. Overall goodness-of-fit statistics should be treated with suspicion when a continuous independent variable is present and/or there are a large number of cells with zero frequency. The **Pearson goodness-of-fit** statistic is $X^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$ where $i$ is the observation number and $j$ is the response variable level. It is a summary of the Pearson residuals, the difference between the observed and expected cell counts, $O_{ij} - E_{ij}$. The **deviance goodness-of-fit** statistic is the difference in fit between the model and a full model; a full model being a model that fits the data perfectly, $G^2 = 2 \sum_{ij} O_{ij} \log \left( \frac{O_{ij}}{E_{ij}} \right)$. Neither of these tests are reliable if there are many cells with zero frequencies and/or small expected frequencies and are generally not recommended. Generally, the chi-squared test requires a frequency count of at least 5 per cell. ```{r} # Observed combinations of model vars cs3$cell_patterns <- cs3$dat %>% count(biz_owner, age, politics, tax_too_high) %>% nrow() # Observed combinations of predictor vars * levels of response var cs3$covariate_patterns <- cs3$dat %>% count(biz_owner, age, politics) %>% nrow() cs3$possible_cells <- cs3$covariate_patterns * length(levels(cs3$dat$tax_too_high)) # 1 - ratio of observed to possible cs3$pct_freq_zero <- 1 - cs3$cell_patterns / cs3$possible_cells ``` There are `r cs3$cell_patterns` observed combinations of model variables (predictors), and `r cs3$possible_cells` possible combinations (predictors * outcome levels), so `r cs3$pct_freq_zero %>% scales::percent(accuracy = .1)` of cells have zero frequencies. Ideally, zero frequencies should be less than 20%, so if you were to use the deviance or Pearson tests, you would need to report this. The results below are contradictory and bogus. I think you'd only use this test if you didn't have continuous predictor variables. ```{r} observed <- cs3$dat %>% count(biz_owner, age, politics, tax_too_high) %>% pivot_wider(names_from = tax_too_high, values_from = n, values_fill = 0) %>% pivot_longer( cols = `Strongly Disagree`:`Strongly Agree`, names_to = "outcome", values_to = "observed" ) expected <- bind_cols( cs3$dat, cs3$result %>% predict(newdata = cs3$dat %>% select(-"tax_too_high")) %>% data.frame() ) %>% rename_with(~str_remove(., "fit\\."), starts_with("fit")) %>% rename_with(~str_replace(., "\\.", " ")) %>% summarize(.by = c(biz_owner, age, politics), across(`Strongly Disagree`:`Strongly Agree`, sum)) %>% pivot_longer(cols = `Strongly Disagree`:`Strongly Agree`, names_to = "outcome", values_to = "expected") obs_exp <- observed %>% inner_join(expected, by = c("politics", "biz_owner", "age", "outcome")) %>% mutate(epsilon_sq = (observed - expected)^2, chi_sq = epsilon_sq / expected, g_sq = 2 * observed * log((observed+.0001) / expected) ) cs3$chisq <- list() cs3$chisq$X2 = sum(obs_exp$chi_sq) cs3$chisq$G2 = sum(obs_exp$g_sq) cs3$chisq$df = cs3$covariate_patterns * (length(levels(cs3$dat$tax_too_high)) - 1) - 7 cs3$chisq$X2_p.value = pchisq(cs3$chisq$X2, df = cs3$chisq$df, lower.tail = FALSE) cs3$chisq$G2_p.value = pchisq(cs3$chisq$G2, df = cs3$chisq$df, lower.tail = FALSE) ``` > The Pearson goodness-of-fit test indicated that the model was *not* a good fit to the observed data, $\chi^2$(`r cs3$chisq$df`) = `r cs3$chisq$X2 %>% scales::comma(accuracy = 0.1)`, *p* < .001$. The deviance goodness-of-fit test indicated that the model was a good fit to the observed data, $G^2$(`r cs3$chisq$df`) = `r cs3$chisq$G2 %>% scales::comma(accuracy = 0.1)`, *p* = `r cs3$chisq$G2_p.value %>% scales::comma(accuracy = 0.001)`. There are a number of measures in ordinal regression that attempt to provide a similar "variance explained" measure as that provided in ordinary least-squares linear regression. However, these measures do not have the direct interpretation that they do in ordinary linear regression and are often, therefore, referred to as **"pseudo" R2 measures**. The three most common measures (Cox and Snell, Nagelkerke, and McFadden) are not particularly good and not universally used. It is presented in the SPSS output, so you might encounter it in published work. ```{r} cs3$nagelkerke <- rcompanion::nagelkerke(cs3$result) cs3$nagelkerke$Pseudo.R.squared.for.model.vs.null ``` The best way to assess model fit is the **likelihood ratio test** comparing the model to an intercept-only model. The difference in the -2 log likelihood between the models has a $\chi^2$ distribution with degrees of freedom equal to the difference in the number of parameters. ```{r} intercept_only <- ordinal::clm(tax_too_high ~ 1, data = cs3$dat) cs3$lrt <- anova(cs3$result, intercept_only) cs3$lrt ``` The table shows the log likelihoods of the two models. LR.stat is the difference between 2 * the logLik values. > The final model statistically significantly predicted the dependent variable over and above the intercept-only model, $\chi^2(4)$ = `r cs3$lrt$LR.stat[2] %>% comma(.1)`, *p* = `r cs3$lrt[[6]][2] %>% comma(.001)`. ### Interpret Results {-} Return to the model summary. ```{r} tidy(cs3$clm) ``` The coefficients for `biz_owner`, `age`, and politics are positive. Positive parameters *increase* the likelihood of stronger agreement with the statement. In this case, discontent with taxes are higher for business owners, increase with age, and are higher for Liberal Democrats and Conservatives relative to the Labor Party. The expected cumulative log-odds of declaring $\le j$ level of agreement with the statement for the baseline group (`biz_ownerNo`, `age = 0`, `politicsLib`) is `r cs3$clm$coefficients["Strongly Disagree|Disagree"] %>% scales::comma(accuracy = .001)` for $j = 1$ (Strongly Disagree), `r cs3$clm$coefficients["Disagree|Agree"] %>% scales::comma(accuracy = .001)` for $j = 2$ (Disagree), and `r cs3$clm$coefficients["Agree|Strongly Agree"] %>% scales::comma(accuracy = .001)` for $j = 3$ (Agree). You could solve the logit equation for $$\pi_j = \frac{\mathrm{exp}(Y_i)} {1 + \mathrm{exp}(Y_i)}$$ to get the cumulative probabilities for each level. That's what `predict(type = "cum.prob")` does. But it might be more intuitive to work with individual probabilities, the lagged differences to get the individual probabilities for each $j$. That's what `predict(type = "prob")` does. I like to play with predicted values to get a sense of the outcome distributions. In this case, I'll take the median `age`, and each combination of `biz_owner` and `politics`. ```{r fig.height=2.5, fig.width=6.5} new_data <- cs3$dat %>% mutate(age = median(cs3$dat$age)) %>% expand(age, politics, biz_owner) preds <- predict(cs3$result, newdata = new_data, type = "prob")[["fit"]] %>% as.data.frame() bind_cols(new_data, preds) %>% pivot_longer(cols = `Strongly Disagree`:`Strongly Agree`) %>% mutate(name = factor(name, levels = levels(cs3$dat$tax_too_high))) %>% ggplot(aes(y = politics, x = value, fill = fct_rev(name))) + geom_col() + geom_text(aes(label = scales::percent(value, accuracy = 1)), size = 3, position = position_stack(vjust=0.5)) + facet_grid(~paste("Bus Owner = ", biz_owner)) + scale_fill_grey(start = 0.5, end = 0.8) + theme_bw() + theme(legend.position = "top", axis.text.x = element_blank(), axis.ticks.x = element_blank()) + guides(fill = guide_legend(reverse = TRUE)) + labs(title = "Taxes too High for Conservative Business Owners?", x = NULL, fill = NULL) ``` You will want to establish whether `politics` is statistically significant overall before exploring any specific contrasts. The ANOVA procedure with type I test reports an overall test of significance for each variable entered into the model. ```{r} (cs3$anovaI <- anova(cs3$result, type = "I")) ``` > The political party last voted for has a statistically significant effect on the prediction of whether tax is thought to be too high, Wald $\chi^2$(`r cs3$anovaI$Df[3]`) = `r cs3$anovaI$Chisq[3] %>% scales::comma(accuracy = .1)`, *p* = `r cs3$anovaI[[3]][3] %>% scales::comma(accuracy = .001)`. The best way to work with the data is with the `tidy(exponentiate = TRUE)` version. ```{r} cs3$tidy <- cs3$result %>% tidy(conf.int = TRUE, exponentiate = TRUE) cs3$tidy ``` Then you can summarize the table in words. > The odds of business owners considering tax to be too high was `r cs3$tidy %>% filter(term == "biz_ownerYes") %>% pull(estimate) %>% scales::comma(accuracy = .001)` (95% CI, `r cs3$tidy %>% filter(term == "biz_ownerYes") %>% pull(conf.low) %>% scales::comma(accuracy = .001)` to `r cs3$tidy %>% filter(term == "biz_ownerYes") %>% pull(conf.high) %>% scales::comma(accuracy = .001)`) times that of non-business owners, a statistically significant effect, *z* = `r cs3$tidy %>% filter(term == "biz_ownerYes") %>% pull(statistic) %>% scales::comma(accuracy = .001)`, *p* = `r cs3$tidy %>% filter(term == "biz_ownerYes") %>% pull(p.value) %>% scales::comma(accuracy = .001)`. > The odds of Conservative voters considering tax to be too high was `r cs3$tidy %>% filter(term == "politicsCon") %>% pull(estimate) %>% scales::comma(accuracy = .001)` (95% CI, `r cs3$tidy %>% filter(term == "politicsCon") %>% pull(conf.low) %>% scales::comma(accuracy = .001)` to `r cs3$tidy %>% filter(term == "politicsCon") %>% pull(conf.high) %>% scales::comma(accuracy = .001)`) times that of Labour voters, a statistically significant effect, *z* = `r cs3$tidy %>% filter(term == "politicsCon") %>% pull(statistic) %>% scales::comma(accuracy = .001)`, *p* = `r cs3$tidy %>% filter(term == "politicsCon") %>% pull(p.value) %>% scales::comma(accuracy = .001)`. The odds of Liberal Democrat voters considering tax to be too high was similar to that of Labour voters (odds ratio of `r cs3$tidy %>% filter(term == "politicsLib") %>% pull(estimate) %>% scales::comma(accuracy = .001)` (95% CI, `r cs3$tidy %>% filter(term == "politicsLib") %>% pull(conf.low) %>% scales::comma(accuracy = .001)` to `r cs3$tidy %>% filter(term == "politicsLib") %>% pull(conf.high) %>% scales::comma(accuracy = .001)`), *p* = `r cs3$tidy %>% filter(term == "politicsLib") %>% pull(p.value) %>% scales::comma(accuracy = .001)`. > An increase in age (expressed in years) was associated with an increase in the odds of considering tax too high, with an odds ratio of `r cs3$tidy %>% filter(term == "age") %>% pull(estimate) %>% scales::comma(accuracy = .001)` (95% CI, `r cs3$tidy %>% filter(term == "age") %>% pull(conf.low) %>% scales::comma(accuracy = .001)` to `r cs3$tidy %>% filter(term == "age") %>% pull(conf.high) %>% scales::comma(accuracy = .001)`), *z* = `r cs3$tidy %>% filter(term == "age") %>% pull(statistic) %>% scales::comma(accuracy = .001)`, *p* = `r cs3$tidy %>% filter(term == "age") %>% pull(p.value) %>% scales::comma(accuracy = .001)`. ### Reporting {-} Here is the complete write-up. > A cumulative odds ordinal logistic regression with proportional odds was run to determine the effect of business ownership, political party voted for, and age, on the belief that taxes are too high. There were proportional odds, as assessed by a full likelihood ratio test comparing the fitted model to a model with varying location parameters, $\chi^2$(8) = `r cs3$po_lrt %>% slot("Body") %>% tail(1) %>% pull(Chisq) %>% scales::number(accuracy = .001)`, *p* = `r cs3$po_lrt %>% slot("Body") %>% tail(1) %>% pull("Pr(>Chisq)") %>% scales::number(accuracy = .001)`. The final model statistically significantly predicted the dependent variable over and above the intercept-only model, $\chi^2(4)$ = `r cs3$lrt$LR.stat[2] %>% scales::comma(accuracy = .1)`, *p* = `r cs3$lrt[[6]][2] %>% scales::comma(accuracy = .001)`. The odds of business owners considering tax to be too high was `r cs3$tidy %>% filter(term == "biz_ownerYes") %>% pull(estimate) %>% scales::comma(accuracy = .001)` (95% CI, `r cs3$tidy %>% filter(term == "biz_ownerYes") %>% pull(conf.low) %>% scales::comma(accuracy = .001)` to `r cs3$tidy %>% filter(term == "biz_ownerYes") %>% pull(conf.high) %>% scales::comma(accuracy = .001)`) times that of non-business owners, a statistically significant effect, *z* = `r cs3$tidy %>% filter(term == "biz_ownerYes") %>% pull(statistic) %>% scales::comma(accuracy = .001)`, *p* = `r cs3$tidy %>% filter(term == "biz_ownerYes") %>% pull(p.value) %>% scales::comma(accuracy = .001)`. The political party last voted for has a statistically significant effect on the prediction of whether tax is thought to be too high, Wald $\chi^2$(`r cs3$anovaI$Df[3]`) = `r cs3$anovaI$Chisq[3] %>% scales::comma(accuracy = .1)`, *p* = `r cs3$anovaI[[3]][3] %>% scales::comma(accuracy = .001)`. The odds of Conservative voters considering tax to be too high was `r cs3$tidy %>% filter(term == "politicsCon") %>% pull(estimate) %>% scales::comma(accuracy = .001)` (95% CI, `r cs3$tidy %>% filter(term == "politicsCon") %>% pull(conf.low) %>% scales::comma(accuracy = .001)` to `r cs3$tidy %>% filter(term == "politicsCon") %>% pull(conf.high) %>% scales::comma(accuracy = .001)`) times that of Labour voters, a statistically significant effect, *z* = `r cs3$tidy %>% filter(term == "politicsCon") %>% pull(statistic) %>% scales::comma(accuracy = .001)`, *p* = `r cs3$tidy %>% filter(term == "politicsCon") %>% pull(p.value) %>% scales::comma(accuracy = .001)`. The odds of Liberal Democrat voters considering tax to be too high was similar to that of Labour voters (odds ratio of `r cs3$tidy %>% filter(term == "politicsLib") %>% pull(estimate) %>% scales::comma(accuracy = .001)` (95% CI, `r cs3$tidy %>% filter(term == "politicsLib") %>% pull(conf.low) %>% scales::comma(accuracy = .001)` to `r cs3$tidy %>% filter(term == "politicsLib") %>% pull(conf.high) %>% scales::comma(accuracy = .001)`), *p* = `r cs3$tidy %>% filter(term == "politicsLib") %>% pull(p.value) %>% scales::comma(accuracy = .001)`. An increase in age (expressed in years) was associated with an increase in the odds of considering tax too high, with an odds ratio of `r cs3$tidy %>% filter(term == "age") %>% pull(estimate) %>% scales::comma(accuracy = .001)` (95% CI, `r cs3$tidy %>% filter(term == "age") %>% pull(conf.low) %>% scales::comma(accuracy = .001)` to `r cs3$tidy %>% filter(term == "age") %>% pull(conf.high) %>% scales::comma(accuracy = .001)`), *z* = `r cs3$tidy %>% filter(term == "age") %>% pull(statistic) %>% scales::comma(accuracy = .001)`, *p* = `r cs3$tidy %>% filter(term == "age") %>% pull(p.value) %>% scales::comma(accuracy = .001)`. Package **gtsummary** shows a nice summary table. ```{r} gtsummary::tbl_regression(cs3$result) ``` ## Poisson Regression {#poissonregression} Poisson models count data, like "traffic tickets per day", or "website hits per day". The response is an expected *rate* or intensity. For count data, specify the generalized model, this time with `family = poisson` or `family = quasipoisson`. Recall that the probability of achieving a count $y$ when the expected rate is $\lambda$ is distributed $$P(Y = y|\lambda) = \frac{e^{-\lambda} \lambda^y}{y!}.$$ The poisson regression model is $$\lambda = \exp(X \beta).$$ You can solve this for $y$ to get $$y = X\beta = \ln(\lambda).$$ That is, the model predicts the log of the response rate. For a sample of size *n*, the likelihood function is $$L(\beta; y, X) = \prod_{i=1}^n \frac{e^{-\exp({X_i\beta})}\exp({X_i\beta})^{y_i}}{y_i!}.$$ The log-likelihood is $$l(\beta) = \sum_{i=1}^n (y_i X_i \beta - \sum_{i=1}^n\exp(X_i\beta) - \sum_{i=1}^n\log(y_i!).$$ Maximizing the log-likelihood has no closed-form solution, so the coefficient estimates are found through interatively reweighted least squares. Poisson processes assume the variance of the response variable equals its mean. "Equals" means the mean and variance are of a similar order of magnitude. If that assumption does not hold, use the quasi-poisson. Use Poisson regression for large datasets. If the predicted counts are much greater than zero (>30), the linear regression will work fine. Whereas RMSE is not useful for logistic models, it is a good metric in Poisson. ### Case Study 4 {.unnumbered #cs4} Dataset `fire` contains response variable `injuries` counting the number of injuries during the month and one explanatory variable, the month `mo`. ```{r eval=FALSE} fire <- read_csv(file = "C:/Users/mpfol/OneDrive/Documents/Data Science/Data/CivilInjury_0.csv") fire <- fire %>% mutate(mo = as.POSIXlt(`Injury Date`)$mon + 1) %>% rename(dt = `Injury Date`, injuries = `Total Injuries`) str(fire) ``` In a situation like this where there the relationship is bivariate, start with a visualization. ```{r eval=FALSE} ggplot(fire, aes(x = mo, y = injuries)) + geom_jitter() + geom_smooth(method = "glm", method.args = list(family = "poisson")) + labs(title = "Injuries by Month") ``` Fit a poisson regression in R using `glm(formula, data, family = poisson)`. But first, check whether the mean and variance of `injuries` are the same magnitude? If not, then use `family = quasipoisson`. ```{r eval=FALSE} mean(fire$injuries) var(fire$injuries) ``` They are of the same magnitude, so fit the regression with `family = poisson`. ```{r eval=FALSE} m2 <- glm(injuries ~ mo, family = poisson, data = fire) summary(m2) ``` The *predicted* value $\hat{y}$ is the estimated **log** of the response variable, $$\hat{y} = X \hat{\beta} = \ln (\lambda).$$ Suppose `mo` is January (mo = `), then the log of `injuries` is $\hat{y} = 0.323787$. Or, more intuitively, the expected count of injuries is $\exp(0.323787) = 1.38$ ```{r eval=FALSE} predict(m2, newdata = data.frame(mo=1)) predict(m2, newdata = data.frame(mo=1), type = "response") ``` Here is a plot of the predicted counts in red. ```{r eval=FALSE} augment(m2, type.predict = "response") %>% ggplot(aes(x = mo, y = injuries)) + geom_point() + geom_point(aes(y = .fitted), color = "red") + scale_y_continuous(limits = c(0, NA)) + labs(x = "Month", y = "Injuries", title = "Poisson Fitted Line Plot") ``` Evaluate a logistic model fit with an analysis of deviance. ```{r eval=FALSE} (perf <- glance(m2)) (pseudoR2 <- 1 - perf$deviance / perf$null.deviance) ``` The deviance of the null model (no regressors) is 139.9. The deviance of the full model is 132.2. The psuedo-R2 is very low at .05. How about the RMSE? ```{r eval=FALSE} RMSE(pred = predict(m2, type = "response"), obs = fire$injuries) ``` The average prediction error is about 0.99. That's almost as much as the variance of `injuries` - i.e., just predicting the mean of `injuries` would be almost as good! Use the `GainCurvePlot()` function to plot the gain curve. ```{r eval=FALSE} augment(m2, type.predict = "response") %>% ggplot(aes(x = injuries, y = .fitted)) + geom_point() + geom_smooth(method ="lm") + labs(x = "Actual", y = "Predicted", title = "Poisson Fitted vs Actual") ``` ```{r eval=FALSE} augment(m2) %>% data.frame() %>% GainCurvePlot(xvar = ".fitted", truthVar = "injuries", title = "Poisson Model") ``` It seems that `mo` was a poor predictor of `injuries`.