I’m modeling the effect of the number of cylinders on vehicle fuel economy. I don’t think cyl should matter if you control for disp and gear. Take a look at this regression output. What do you think? The coefficient for cyl=6 is -4.8 (p = .009), but cyl=8 is indistinguishable from cyl=4.
The problem with this regression is that it holds cyl=4 as the standard for comparison. That’s more aggressive than what I’m after. I want to know if any of the cyl levels stand out relative to the whole, not whether any two particular levels differ. In this case, at least cyl=4 and cyl=6 differed. If cyl8 was the reference, none of the p-values would be significant.
I think I have a better way: report contrasts from the estimated marginal means (EMMs).
9.1 Calculating EMMs
Estimated marginal means (EMMs) are the mean predicted values over supplies set of predictor values. Whereas ordinary marginal means are conditional means of the raw data, EMMs are conditional means of modeled data.
library(emmeans)
One way to get at the impact of cyl is by calculating the expected mpg by cyl, holding the other predictors fixed at some realistic value. You might set disp to the dataset average. For factors like gear you can make a prediction for each level. That’s nine combinations of predictors.
This predictor dataset is called a reference grid. emmeans() creates a reference grid, makes predictions, then averages them by the specs argument. specs = "1" averages over the whole grid, mean(my_preds$.fitted) = 20.1.
emmeans(my_fit, specs ="1")
1 emmean SE df lower.CL upper.CL
overall 20.1 0.665 26 18.7 21.5
Results are averaged over the levels of: cyl, gear
Confidence level used: 0.95
I’m interested in cyl. specs = "cyl" will average across the three gear levels. Specifying weights = "proportional" will weight EMMs by the number of rows in the original data for each level of cyl.
cyl emmean SE df lower.CL upper.CL
4 23.3 1.69 26 19.8 26.7
6 18.5 1.29 26 15.8 21.1
8 18.4 1.61 26 15.1 21.7
Results are averaged over the levels of: gear
Confidence level used: 0.95
What does this get us? The expected mpg for a car with 4 cylinders is 23.3. That’s 5 mpg greater than the other two. The SEs are pretty large though, so I can’t conclude fuel economy varies by cyl. For that, I need a contrast.
9.2 Level Comparisons
There’s a multitude of ways to analyze EMMs. You can compare them to each other, to the overall mean, to a single level, and so many other ways in the contrast documentation that I don’t understand.
Pairwise
Let’s start with pairwise comparisons. This is definitely not how I want to compare cyl levels, but it’s a useful introduction.
pairs(emm_cyl)
contrast estimate SE df t.ratio p.value
cyl4 - cyl6 4.7998 1.71 26 2.801 0.0248
cyl4 - cyl8 4.8559 3.06 26 1.585 0.2697
cyl6 - cyl8 0.0561 2.41 26 0.023 0.9997
Results are averaged over the levels of: gear
P value adjustment: tukey method for comparing a family of 3 estimates
The EMM of cyl4 is 4.7998 mpg greater than cyl6 and statistically significant (p = .0248). cyl4 - cyl8 is large, but not significant. This is what the regression output was saying, but pairs() gives me cyl6 - cyl8 too.
A more general way to report paired differences is with Cohen’s d effect size. It standardizes the differences by dividing by the model residual standard error. An effect size >.8 is large, >.5 medium, and >.2 small.
contrast effect.size SE df lower.CL upper.CL
cyl4 - cyl6 1.5693 0.601 26 0.334 2.80
cyl4 - cyl8 1.5876 1.030 26 -0.520 3.70
cyl6 - cyl8 0.0183 0.788 26 -1.601 1.64
Results are averaged over the levels of: gear
sigma used for effect sizes: 3.059
Confidence level used: 0.95
But again, I’m interested in identifying cyl levels that stand out from the rest, not ferreting out two levels that differ from each other.
One vs Whole
Instead of comparing cyl levels to each other, I’ll compare them to the whole using a contrast. The default contrast subtracts the grand mean, the mean of the three EMMs, 20.046, from each EMM.
contrast(emm_cyl)
contrast estimate SE df t.ratio p.value
cyl4 effect 3.22 1.450 26 2.225 0.1049
cyl6 effect -1.58 0.949 26 -1.666 0.1615
cyl8 effect -1.64 1.750 26 -0.938 0.3571
Results are averaged over the levels of: gear
P value adjustment: fdr method for 3 tests
If group sizes are imbalanced, as they are here, subtract the weighted grand mean. The population sizes are cyl4 = 11, cyl6 = 7, and cyl8 = 14, so the weighted grand mean is 20.09.
contrast(emm_cyl, wts =NA)
contrast estimate SE df t.ratio p.value
cyl4 effect 3.17 1.60 26 1.984 0.1736
cyl6 effect -1.63 1.17 26 -1.392 0.2634
cyl8 effect -1.68 1.52 26 -1.107 0.2785
Results are averaged over the levels of: gear
P value adjustment: fdr method for 3 tests
Now we’re talking. Instead of subtracting the grand mean of the EMMs, it’s subtracting the simple mean of the underlying data. Controlling for other variables, cars with cyl = 4 are predicted to have an mpg 3.17 higher, cals with cyl = 6 1.63 lower. None of these effects are statistically significant.
9.2.1 One vs Others
The default contrast shows how each level relates to the whole, but it feels like double counting since each level also contributes to the whole. Maybe cyl4 would be statistically different if it wasn’t included in the whole. You can do this with method = "del.eff"!
contrast(emm_cyl, wts =NA, method ="del.eff")
contrast estimate SE df t.ratio p.value
cyl4 effect 4.84 2.44 26 1.984 0.1736
cyl6 effect -2.08 1.49 26 -1.392 0.2634
cyl8 effect -2.99 2.70 26 -1.107 0.2785
Results are averaged over the levels of: gear
P value adjustment: fdr method for 3 tests
Yep, a much larger effect. Note however that the p-values are unchanged. The contrast is just not large enough, given the sample size, to be distinguishable from zero.
9.3 Nuisance Variables
gear has been a bit of a nuisance so far. It is an important part of the model, but after that it was just something to proportionally average over. In fact, controlling variables can become crippling if there are enough of them because the reference grid is composed of all combinations of variable levels.
It would be preferable to predict on the “average” gear, just like we predicted on the average disp. Well, you can! Models one-hot encode factor variables, so you get gear3 = 0|1, gear4 = 0|1, and gear5 = 0|1. What if instead of three combinations you predicted on gear3=1/3, gear4=1/3, and gear5=1/3?
ref_grid(my_fit, nuisance ="gear")
'emmGrid' object with variables:
cyl = 4, 6, 8
disp = 230.72
Nuisance factors that have been collapsed by averaging:
gear(3)
Now we’ll have just three predictions. This is closer to the “typical” car, but we can do even better by weighting the levels proportionally to their frequency (gear3=15/32, gear4=12/32, and gear5=5/32).
Now the one vs others contrast is more like the “average” car, varying only the number of cylinders. Recall that the default is to weight the other two EMMs equally, so wts = NA weights them by cyl count.
The expected value of cyl4 is 4.837 mpg greater than the weighted average of the other cyl levels. Let’s report this!
9.4 Reporting EMM + Contrast
Sometimes reporting model coefficients suffers from lack of context. The model fit coefficient for cyl6 indicates it reduces the expected mpg by 4.800 relative to cyl4. Instead, I will report that a ‘representative’ car, a car with a proportional mix of all modeled car attributes, has an expected mpg of 20.091 (intercept-only model) and the effect of fixing cyl = 4 is to increase this expected value by 3.174 to 23.265, and fixing cyl = 6 decreases it 1.625 to 18.465.
```{r}#| include: falselibrary(tidyverse)library(tidymodels)library(gtsummary)```# Estimated Marginal Means {#EMMs}I'm modeling the effect of the number of cylinders on vehicle fuel economy. I don't think `cyl` should matter if you control for `disp` and `gear`. Take a look at this regression output. What do you think? The coefficient for `cyl=6` is -4.8 (*p* = .009), but `cyl=8` is indistinguishable from `cyl=4`.```{r}#| code-fold: falsemy_cars <- mtcars |>mutate(across(c(cyl, gear), factor))my_fit <-linear_reg() |>fit(mpg ~ cyl + disp + gear, data = my_cars) |>extract_fit_engine()tbl_regression(my_fit, intercept =TRUE)```The problem with this regression is that it holds `cyl=4` as the standard for comparison. That's more aggressive than what I'm after. I want to know if any of the `cyl` levels stand out relative to the whole, not whether any two particular levels differ. In this case, at least `cyl=4` and `cyl=6` differed. If `cyl8` was the reference, none of the *p*-values would be significant.I think I have a better way: report contrasts from the *estimated marginal means* (EMMs).## Calculating EMMsEstimated marginal means (EMMs) are the mean predicted values over supplies set of predictor values. Whereas ordinary marginal means are conditional means of the raw data, EMMs are conditional means of _modeled_ data.```{r}#| code-fold: falselibrary(emmeans)```One way to get at the impact of `cyl` is by calculating the expected `mpg` by `cyl`, holding the other predictors fixed at some realistic value. You might set `disp` to the dataset average. For factors like `gear` you can make a prediction for each level. That's nine combinations of predictors. ```{r}#| code-fold: falsenewdata <-expand.grid(cyl =levels(my_cars$cyl), disp =mean(my_cars$disp),gear =levels(my_cars$gear))(my_preds <-augment(my_fit, newdata = newdata))```This predictor dataset is called a _reference grid_. `emmeans()` creates a reference grid, makes predictions, then averages them by the `specs` argument. `specs = "1"` averages over the whole grid, `mean(my_preds$.fitted)` = `r comma(mean(my_preds$.fitted), .1)`.```{r}#| code-fold: falseemmeans(my_fit, specs ="1")```I'm interested in `cyl`. `specs = "cyl"` will average across the three `gear` levels. Specifying `weights = "proportional"` will weight EMMs by the number of rows in the original data for each level of `cyl`.```{r}#| code-fold: false(emm_cyl <-emmeans(my_fit, specs ="cyl", weights ="proportional"))```What does this get us? The expected `mpg` for a car with 4 cylinders is 23.3. That's 5 mpg greater than the other two. The SEs are pretty large though, so I can't conclude fuel economy varies by `cyl`. For that, I need a contrast.## Level ComparisonsThere's a multitude of ways to analyze EMMs. You can compare them to each other, to the overall mean, to a single level, and so many other ways in the `contrast` documentation that I don't understand.### Pairwise {-}Let's start with pairwise comparisons. This is definitely _not_ how I want to compare `cyl` levels, but it's a useful introduction.```{r}#| code-fold: falsepairs(emm_cyl)```The EMM of `cyl4` is 4.7998 mpg greater than `cyl6` and statistically significant (*p* = .0248). `cyl4 - cyl8` is large, but not significant. This is what the regression output was saying, but `pairs()` gives me `cyl6 - cyl8` too.A more general way to report paired differences is with Cohen's *d* effect size. It standardizes the differences by dividing by the model residual standard error. An effect size >.8 is large, >.5 medium, and >.2 small.```{r}#| code-fold: falseeff_size(emm_cyl, sigma =sigma(my_fit), edf =df.residual(my_fit))```But again, I'm interested in identifying `cyl` levels that stand out from the rest, not ferreting out two levels that differ from each other.### One vs Whole {-}Instead of comparing `cyl` levels to each other, I'll compare them to the *whole* using a contrast. The default contrast subtracts the grand mean, the mean of the three EMMs, `r comma(mean(tidy(emm_cyl)$estimate), .001)`, from each EMM.```{r}#| code-fold: falsecontrast(emm_cyl)```If group sizes are imbalanced, as they are here, subtract the *weighted* grand mean. The population sizes are `cyl4` = 11, `cyl6` = 7, and `cyl8` = 14, so the weighted grand mean is `r emm_cyl |> tidy() |> mutate(estimate = estimate * case_match(cyl, "4"~11/32, "6"~7/32, "8"~14/32)) |> pull(estimate) |> sum() |> comma(.01)`.```{r}#| code-fold: falsecontrast(emm_cyl, wts =NA)```Now we're talking. Instead of subtracting the grand mean of the EMMs, it's subtracting the simple mean of the underlying data. Controlling for other variables, cars with `cyl = 4` are predicted to have an `mpg` 3.17 higher, cals with `cyl = 6` 1.63 lower. *None of these effects are statistically significant.*### One vs OthersThe default contrast shows how each level relates to the whole, but it feels like double counting since each level also contributes to the whole. Maybe `cyl4` would be statistically different if it wasn't included in the whole. You can do this with `method = "del.eff"`!```{r}#| code-fold: falsecontrast(emm_cyl, wts =NA, method ="del.eff")```Yep, a much larger effect. Note however that the *p*-values are unchanged. The contrast is just not large enough, given the sample size, to be distinguishable from zero.## Nuisance Variables`gear` has been a bit of a nuisance so far. It is an important part of the model, but after that it was just something to proportionally average over. In fact, controlling variables can become crippling if there are enough of them because the reference grid is composed of all combinations of variable levels.```{r}#| code-fold: falseref_grid(my_fit)```It would be preferable to predict on the "average" `gear`, just like we predicted on the average `disp`. Well, you can! Models one-hot encode factor variables, so you get `gear3 = 0|1`, `gear4 = 0|1`, and `gear5 = 0|1`. What if instead of three combinations you predicted on `gear3=1/3`, `gear4=1/3`, and `gear5=1/3`?```{r}#| code-fold: falseref_grid(my_fit, nuisance ="gear")```Now we'll have just three predictions. This is closer to the "typical" car, but we can do even better by weighting the levels proportionally to their frequency (`gear3=15/32`, `gear4=12/32`, and `gear5=5/32`).```{r}#| code-fold: falsenuis_grid <-ref_grid(my_fit, nuisance ="gear", wt.nuis ="proportional")tidy(nuis_grid)```Now the one vs others contrast is more like the "average" car, varying only the number of cylinders. Recall that the default is to weight the other two EMMs equally, so `wts = NA` weights them by `cyl` count.```{r}#| code-fold: falseemmeans(nuis_grid, specs ="cyl") |>contrast("del.eff", wts =NA) |>tidy()```The expected value of `cyl4` is 4.837 mpg greater than the weighted average of the other `cyl` levels. Let's report this!## Reporting EMM + ContrastSometimes reporting model coefficients suffers from lack of context. The model fit coefficient for `cyl6` indicates it reduces the expected `mpg` by 4.800 relative to `cyl4`. Instead, I will report that a 'representative' car, a car with a proportional mix of all modeled car attributes, has an expected `mpg` of 20.091 (intercept-only model) and the effect of fixing `cyl = 4` is to increase this expected value by 3.174 to 23.265, and fixing `cyl = 6` decreases it 1.625 to 18.465. ```{r}#| code-fold: falseemmeans(nuis_grid, specs ="cyl") |>contrast(wts =NA) |>tidy()```We can report this in a single row.```{r}final_fit <-linear_reg() |>fit(mpg ~ cyl + disp + gear, data = my_cars) |>extract_fit_engine()nuis_grid <-ref_grid(final_fit, nuisance ="gear", wt.nuis ="proportional")cyl_emm <-emmeans(nuis_grid, specs ="cyl") cyl_contrast <-contrast(cyl_emm, wts =NA)intercept_emm <-emmeans(nuis_grid, specs ="1", weights ="proportional")bind_rows(EMM =tidy(cyl_emm),Effect =tidy(cyl_contrast) |>mutate(cyl =str_sub(contrast, 4, 4)) |>rename(p.value = adj.p.value),.id ="measure") |>select(measure, cyl, estimate, p.value) |>pivot_wider(names_from = measure, values_from =c(estimate, p.value)) |>bind_rows(tidy(intercept_emm) |>mutate(cyl ="Intercept") |>rename(estimate_EMM = estimate)) |>select(cyl, estimate_EMM, p.value_EMM, estimate_Effect, p.value_Effect) |>mutate(cyl =fct_relevel(cyl, "Intercept", after =0)) |>arrange(cyl) |> gt::gt() |> gt::cols_label(estimate_EMM ="Estimate",p.value_EMM ="p-value",estimate_Effect ="Estimate",p.value_Effect ="p-value" ) |> gt::tab_spanner("EMM", c(estimate_EMM, p.value_EMM)) |> gt::tab_spanner("Effect", c(estimate_Effect, p.value_Effect)) |> gt::fmt_number(c(estimate_EMM, estimate_Effect), decimals =2) |> gt::fmt_number(c(p.value_EMM, p.value_Effect), decimals =3)```## Learn MoreThis [Very statisticious](https://aosmith.rbind.io/2019/03/25/getting-started-with-emmeans/) blog post is helpful. I also worked through the **emmeans** vignettes on [CRAN](https://cran.r-project.org/web/packages/emmeans/) and **ggeffects** on [GitHub](https://strengejacke.github.io/ggeffects/index.html).