9  Estimated Marginal Means

I’m modeling the effect of the number of cylinders on vehicle fuel economy. I don’t think cyl should matter if you control for disp and gear. Take a look at this regression output. What do you think? The coefficient for cyl=6 is -4.8 (p = .009), but cyl=8 is indistinguishable from cyl=4.

my_cars <- mtcars |> mutate(across(c(cyl, gear), factor))

my_fit <-
  linear_reg() |>
  fit(mpg ~ cyl + disp + gear, data = my_cars) |>
  extract_fit_engine()

tbl_regression(my_fit, intercept = TRUE)
Characteristic Beta 95% CI p-value
(Intercept) 29 24, 34 <0.001
cyl


    4
    6 -4.8 -8.3, -1.3 0.009
    8 -4.9 -11, 1.4 0.12
disp -0.03 -0.05, 0.00 0.029
gear


    3
    4 0.03 -3.8, 3.9 >0.9
    5 0.32 -3.4, 4.0 0.9
Abbreviation: CI = Confidence Interval

The problem with this regression is that it holds cyl=4 as the standard for comparison. That’s more aggressive than what I’m after. I want to know if any of the cyl levels stand out relative to the whole, not whether any two particular levels differ. In this case, at least cyl=4 and cyl=6 differed. If cyl8 was the reference, none of the p-values would be significant.

I think I have a better way: report contrasts from the estimated marginal means (EMMs).

9.1 Calculating EMMs

Estimated marginal means (EMMs) are the mean predicted values over supplies set of predictor values. Whereas ordinary marginal means are conditional means of the raw data, EMMs are conditional means of modeled data.

library(emmeans)

One way to get at the impact of cyl is by calculating the expected mpg by cyl, holding the other predictors fixed at some realistic value. You might set disp to the dataset average. For factors like gear you can make a prediction for each level. That’s nine combinations of predictors.

newdata <- expand.grid(
  cyl = levels(my_cars$cyl), 
  disp = mean(my_cars$disp),
  gear = levels(my_cars$gear)
)

(my_preds <- augment(my_fit, newdata = newdata))
# A tibble: 9 × 4
  cyl    disp gear  .fitted
  <fct> <dbl> <fct>   <dbl>
1 4      231. 3        23.2
2 6      231. 3        18.4
3 8      231. 3        18.3
4 4      231. 4        23.2
5 6      231. 4        18.4
6 8      231. 4        18.4
7 4      231. 5        23.5
8 6      231. 5        18.7
9 8      231. 5        18.7

This predictor dataset is called a reference grid. emmeans() creates a reference grid, makes predictions, then averages them by the specs argument. specs = "1" averages over the whole grid, mean(my_preds$.fitted) = 20.1.

emmeans(my_fit, specs = "1")
 1       emmean    SE df lower.CL upper.CL
 overall   20.1 0.665 26     18.7     21.5

Results are averaged over the levels of: cyl, gear 
Confidence level used: 0.95 

I’m interested in cyl. specs = "cyl" will average across the three gear levels. Specifying weights = "proportional" will weight EMMs by the number of rows in the original data for each level of cyl.

(emm_cyl <- emmeans(my_fit, specs = "cyl", weights = "proportional"))
 cyl emmean   SE df lower.CL upper.CL
 4     23.3 1.69 26     19.8     26.7
 6     18.5 1.29 26     15.8     21.1
 8     18.4 1.61 26     15.1     21.7

Results are averaged over the levels of: gear 
Confidence level used: 0.95 

What does this get us? The expected mpg for a car with 4 cylinders is 23.3. That’s 5 mpg greater than the other two. The SEs are pretty large though, so I can’t conclude fuel economy varies by cyl. For that, I need a contrast.

9.2 Level Comparisons

There’s a multitude of ways to analyze EMMs. You can compare them to each other, to the overall mean, to a single level, and so many other ways in the contrast documentation that I don’t understand.

Pairwise

Let’s start with pairwise comparisons. This is definitely not how I want to compare cyl levels, but it’s a useful introduction.

pairs(emm_cyl)
 contrast    estimate   SE df t.ratio p.value
 cyl4 - cyl6   4.7998 1.71 26   2.801  0.0248
 cyl4 - cyl8   4.8559 3.06 26   1.585  0.2697
 cyl6 - cyl8   0.0561 2.41 26   0.023  0.9997

Results are averaged over the levels of: gear 
P value adjustment: tukey method for comparing a family of 3 estimates 

The EMM of cyl4 is 4.7998 mpg greater than cyl6 and statistically significant (p = .0248). cyl4 - cyl8 is large, but not significant. This is what the regression output was saying, but pairs() gives me cyl6 - cyl8 too.

A more general way to report paired differences is with Cohen’s d effect size. It standardizes the differences by dividing by the model residual standard error. An effect size >.8 is large, >.5 medium, and >.2 small.

eff_size(emm_cyl, sigma = sigma(my_fit), edf = df.residual(my_fit))
 contrast    effect.size    SE df lower.CL upper.CL
 cyl4 - cyl6      1.5693 0.601 26    0.334     2.80
 cyl4 - cyl8      1.5876 1.030 26   -0.520     3.70
 cyl6 - cyl8      0.0183 0.788 26   -1.601     1.64

Results are averaged over the levels of: gear 
sigma used for effect sizes: 3.059 
Confidence level used: 0.95 

But again, I’m interested in identifying cyl levels that stand out from the rest, not ferreting out two levels that differ from each other.

One vs Whole

Instead of comparing cyl levels to each other, I’ll compare them to the whole using a contrast. The default contrast subtracts the grand mean, the mean of the three EMMs, 20.046, from each EMM.

contrast(emm_cyl)
 contrast    estimate    SE df t.ratio p.value
 cyl4 effect     3.22 1.450 26   2.225  0.1049
 cyl6 effect    -1.58 0.949 26  -1.666  0.1615
 cyl8 effect    -1.64 1.750 26  -0.938  0.3571

Results are averaged over the levels of: gear 
P value adjustment: fdr method for 3 tests 

If group sizes are imbalanced, as they are here, subtract the weighted grand mean. The population sizes are cyl4 = 11, cyl6 = 7, and cyl8 = 14, so the weighted grand mean is 20.09.

contrast(emm_cyl, wts = NA)
 contrast    estimate   SE df t.ratio p.value
 cyl4 effect     3.17 1.60 26   1.984  0.1736
 cyl6 effect    -1.63 1.17 26  -1.392  0.2634
 cyl8 effect    -1.68 1.52 26  -1.107  0.2785

Results are averaged over the levels of: gear 
P value adjustment: fdr method for 3 tests 

Now we’re talking. Instead of subtracting the grand mean of the EMMs, it’s subtracting the simple mean of the underlying data. Controlling for other variables, cars with cyl = 4 are predicted to have an mpg 3.17 higher, cals with cyl = 6 1.63 lower. None of these effects are statistically significant.

9.2.1 One vs Others

The default contrast shows how each level relates to the whole, but it feels like double counting since each level also contributes to the whole. Maybe cyl4 would be statistically different if it wasn’t included in the whole. You can do this with method = "del.eff"!

contrast(emm_cyl, wts = NA, method = "del.eff")
 contrast    estimate   SE df t.ratio p.value
 cyl4 effect     4.84 2.44 26   1.984  0.1736
 cyl6 effect    -2.08 1.49 26  -1.392  0.2634
 cyl8 effect    -2.99 2.70 26  -1.107  0.2785

Results are averaged over the levels of: gear 
P value adjustment: fdr method for 3 tests 

Yep, a much larger effect. Note however that the p-values are unchanged. The contrast is just not large enough, given the sample size, to be distinguishable from zero.

9.3 Nuisance Variables

gear has been a bit of a nuisance so far. It is an important part of the model, but after that it was just something to proportionally average over. In fact, controlling variables can become crippling if there are enough of them because the reference grid is composed of all combinations of variable levels.

ref_grid(my_fit)
'emmGrid' object with variables:
    cyl = 4, 6, 8
    disp = 230.72
    gear = 3, 4, 5

It would be preferable to predict on the “average” gear, just like we predicted on the average disp. Well, you can! Models one-hot encode factor variables, so you get gear3 = 0|1, gear4 = 0|1, and gear5 = 0|1. What if instead of three combinations you predicted on gear3=1/3, gear4=1/3, and gear5=1/3?

ref_grid(my_fit, nuisance = "gear")
'emmGrid' object with variables:
    cyl = 4, 6, 8
    disp = 230.72
Nuisance factors that have been collapsed by averaging:
    gear(3)

Now we’ll have just three predictions. This is closer to the “typical” car, but we can do even better by weighting the levels proportionally to their frequency (gear3=15/32, gear4=12/32, and gear5=5/32).

nuis_grid <- ref_grid(my_fit, nuisance = "gear", wt.nuis = "proportional")

tidy(nuis_grid)
# A tibble: 3 × 7
  cyl    disp estimate std.error    df statistic  p.value
  <chr> <dbl>    <dbl>     <dbl> <dbl>     <dbl>    <dbl>
1 4      231.     23.3      1.69    26      13.8 1.85e-13
2 6      231.     18.5      1.29    26      14.4 7.19e-14
3 8      231.     18.4      1.61    26      11.4 1.26e-11

Now the one vs others contrast is more like the “average” car, varying only the number of cylinders. Recall that the default is to weight the other two EMMs equally, so wts = NA weights them by cyl count.

emmeans(nuis_grid, specs = "cyl") |>
  contrast("del.eff", wts = NA) |> 
  tidy()
# A tibble: 3 × 8
  term  contrast    null.value estimate std.error    df statistic adj.p.value
  <chr> <chr>            <dbl>    <dbl>     <dbl> <dbl>     <dbl>       <dbl>
1 cyl   cyl4 effect          0     4.84      2.44    26      1.98       0.174
2 cyl   cyl6 effect          0    -2.08      1.49    26     -1.39       0.263
3 cyl   cyl8 effect          0    -2.99      2.70    26     -1.11       0.279

The expected value of cyl4 is 4.837 mpg greater than the weighted average of the other cyl levels. Let’s report this!

9.4 Reporting EMM + Contrast

Sometimes reporting model coefficients suffers from lack of context. The model fit coefficient for cyl6 indicates it reduces the expected mpg by 4.800 relative to cyl4. Instead, I will report that a ‘representative’ car, a car with a proportional mix of all modeled car attributes, has an expected mpg of 20.091 (intercept-only model) and the effect of fixing cyl = 4 is to increase this expected value by 3.174 to 23.265, and fixing cyl = 6 decreases it 1.625 to 18.465.

emmeans(nuis_grid, specs = "cyl") |> contrast(wts = NA) |> tidy()
# A tibble: 3 × 8
  term  contrast    null.value estimate std.error    df statistic adj.p.value
  <chr> <chr>            <dbl>    <dbl>     <dbl> <dbl>     <dbl>       <dbl>
1 cyl   cyl4 effect          0     3.17      1.60    26      1.98       0.174
2 cyl   cyl6 effect          0    -1.63      1.17    26     -1.39       0.263
3 cyl   cyl8 effect          0    -1.68      1.52    26     -1.11       0.279

We can report this in a single row.

Show the code
final_fit <- 
  linear_reg() |> 
  fit(mpg ~ cyl + disp + gear, data = my_cars) |>
  extract_fit_engine()

nuis_grid <- ref_grid(final_fit, nuisance = "gear", wt.nuis = "proportional")

cyl_emm <- emmeans(nuis_grid, specs = "cyl") 

cyl_contrast <- contrast(cyl_emm, wts = NA)

intercept_emm <- emmeans(nuis_grid, specs = "1", weights = "proportional")

bind_rows(
  EMM = tidy(cyl_emm),
  Effect = tidy(cyl_contrast) |> 
    mutate(cyl = str_sub(contrast, 4, 4)) |>
    rename(p.value = adj.p.value),
  .id = "measure"
) |>
  select(measure, cyl, estimate, p.value) |>
  pivot_wider(names_from = measure, values_from = c(estimate, p.value)) |>
  bind_rows(tidy(intercept_emm) |>
              mutate(cyl = "Intercept") |> 
              rename(estimate_EMM = estimate)) |>
  select(cyl, estimate_EMM, p.value_EMM, estimate_Effect, p.value_Effect) |>
  mutate(cyl = fct_relevel(cyl, "Intercept", after = 0)) |>
  arrange(cyl) |>
  gt::gt() |>
  gt::cols_label(
    estimate_EMM = "Estimate",
    p.value_EMM = "p-value",
    estimate_Effect = "Estimate",
    p.value_Effect = "p-value"
  ) |>
  gt::tab_spanner("EMM", c(estimate_EMM, p.value_EMM)) |>
  gt::tab_spanner("Effect", c(estimate_Effect, p.value_Effect)) |>
  gt::fmt_number(c(estimate_EMM, estimate_Effect), decimals = 2) |>
  gt::fmt_number(c(p.value_EMM, p.value_Effect), decimals = 3)
cyl
EMM
Effect
Estimate p-value Estimate p-value
Intercept 20.09 NA NA NA
4 23.27 0.000 3.17 0.174
6 18.47 0.000 −1.63 0.263
8 18.41 0.000 −1.68 0.279

9.5 Learn More

This Very statisticious blog post is helpful. I also worked through the emmeans vignettes on CRAN and ggeffects on GitHub.