Regularization is a set of methods in linear regression that manage the “bias-variance” tradeoff in machine learning. Bias and variance are two types of errors that affect model performance.

The tradeoff is that increasing model complexity to decrease predictive bias (expected prediction error) comes at the cost of increasing prediction variance (variance of expected prediction error). The goal is to find a “sweet spot” with an optimal level of model complexity that minimizes overall error on new, unseen data.

5.1 Bias-Variance Trade-off

The linear regression model is \(Y = X \beta + \epsilon\), where \(\epsilon \sim N(0, \sigma^2)\). OLS estimates the coefficients by minimizing the loss function

\[L = \sum_{i = 1}^n \left(y_i - x_i^{'} \hat\beta \right)^2\]

where \(\hat{f} = x_i^{'} \hat\beta\) is the estimator of the true regression function, \(f\).

This results estimate for the coefficients of

\[\hat{\beta} = \left(X'X\right)^{-1}\left(X'Y\right).\]

There are two important characteristics of any estimator: its bias and its variance. For OLS, these are

\[Bias(\hat{\beta}) = E(\hat{\beta}) - \beta = 0\]

and

\[Var(\hat{\beta}) = \sigma^2(X'X)^{-1}\]

where the unknown population variance \(\sigma^2\) is estimated from the residuals

\[\hat\sigma^2 = \frac{\epsilon' \epsilon}{n - k}.\]

The OLS estimator is unbiased, but can have a large variance when the predictor variables are highly correlated with each other, or when there are many predictors (notice how \(\hat{\sigma}^2\) increases as \(k \rightarrow n\)). Stepwise selection balances the trade-off by eliminating variables, but this throws away information. Regularization keeps all the predictors, but reduces coefficient magnitudes to reduce variance at the expense of some bias.

We can decompose the expected prediction error (EPE) into a reducible error related to the type of model, and irreducible error related to the variance \(y\) (noise).

\[ \text{EPE} = \mathbb{E}\left[\left(f(x) - \hat{f}(x)\right)^2\right] + \mathbb{V}[y|x] \]

The reducible error, which is the mean squared error (MSE) of the estimate, can be further decomposed into the estimator bias and variance.

\[ \text{EPE} = \left(f(x) - \mathbb{E}\left[\hat{f}(x)\right]\right)^2 + \mathbb{E}\left[\left(\hat{f}(x) - \mathbb{E}\left[\hat{f}(x)\right]\right)^2\right] + \mathbb{V}[y|x] \]

Illustration. Let’s set up a bias-variance tradeoff by fitting several models of increasing complexity to a random data-generating process, \(f(x) = x^3\). We’ll create 250 random samples by choosing \(x\) between 1 and 5 and \(y = f(x) + e\) where \(e\) is noise.

f <- function(x) {x^3}

make_sample = function() {
  x = runif(n = 100, 1, 5)
  e = rnorm(n = 100, 0, 5)
  y = f(x) + e
  data.frame(x, y)
}

sim_dat <-
  tibble(sim = 1:250) |>
  mutate(dat = map(sim, ~ make_sample()))

We’ll fit five models. The first is a linear model, \(y \sim x\). Then we’ll fit polynomial model of degree 2. These first two models underfit the data. Then one of degree 3 - matching the underlying data-generating process. Then we’ll overfit with one of degree 9. Finally, we’ll use regularization to improve on the degree 9 model.

sim_fit <-
  sim_dat |>
  mutate(
    dat = map(sim, ~ make_sample()),
    fit_1 = map(dat, ~ linear_reg() |> fit(y ~ poly(x, degree = 1), data = .)),
    fit_2 = map(dat, ~ linear_reg() |> fit(y ~ poly(x, degree = 2), data = .)),
    fit_3 = map(dat, ~ linear_reg() |> fit(y ~ poly(x, degree = 3), data = .)),
    fit_9 = map(dat, ~ linear_reg() |> fit(y ~ poly(x, degree = 9), data = .)),
    fit_9a = map(dat, 
                  ~ linear_reg(engine = "glmnet", penalty = 0.2, mixture = 1) |>
                    fit(y ~ poly(x, degree = 9), data = .))
  )

This gives us 5 models fit to 250 random samples, each sample containing 100 points from an \(f(x) = x^3 + \epsilon\) data-generating process. Let’s predict a single value, \(f(x = 4)\) from each model.

test_dat <- data.frame(x = 4)

sim_preds <-
  sim_fit |>
  mutate(
    `1` = map_dbl(fit_1, ~ predict(., new_data = test_dat) |> pull(.pred)),
    `2` = map_dbl(fit_2, ~ predict(., new_data = test_dat) |> pull(.pred)),
    `3` = map_dbl(fit_3, ~ predict(., new_data = test_dat) |> pull(.pred)),
    `9` = map_dbl(fit_9, ~ predict(., new_data = test_dat) |> pull(.pred)),
    `9 (LASSO)` = map_dbl(fit_9a, ~ predict(., new_data = test_dat) |> pull(.pred))
  )

Check out the distribution of fitted values. The degree 3 polynomial is the gold standard, and you see both low bias and variance. The degree-1 polynomial was just bad. The degree 2 had a high bias, but at least the variance was small. The degree 9 model overfit the data. It had low bias overall, but very high variance. The last model used regularization to improve its reliability, and you see the variance has shrunk with only a small reduction in bias.

Show the code
sim_preds |>
  select(`1`:`9 (LASSO)`) |>
  pivot_longer(c(everything())) |>
  ggplot(aes(x = name, y = value)) +
  geom_hline(yintercept = f(4), linewidth = 1, color = "goldenrod") +
  geom_boxplot(fill = NA, outlier.shape = NA) +
  geom_jitter(height = 0, width = .2, alpha = .2) +
  theme(panel.grid = element_blank()) +
  labs(
    x = "Polynomial Degree", y = "Preds",
    title = "Increasing complexity improves bias at expense of higher variance.",
    subtitle = "Simulated Predictions from Four Linear Models"
  )

5.2 Case Study

The sections below use the mtcars dataset to predict mpg from the other variables with regularization function glmnet::glmnet(). Set up a train-test split, training on 5-fold CV.

data("mtcars")

glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
(mt_split <- initial_split(mtcars))
<Training/Testing/Total>
<24/8/32>
(mt_folds <- vfold_cv(training(mt_split), v = 5))
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits         id   
  <list>         <chr>
1 <split [19/5]> Fold1
2 <split [19/5]> Fold2
3 <split [19/5]> Fold3
4 <split [19/5]> Fold4
5 <split [20/4]> Fold5

5.3 Ridge

Ridge regression estimates the linear model coefficients by minimizing an augmented loss function which includes a term, \(\lambda\), that penalizes the magnitude of the coefficient estimates,

\[L = \sum_{i = 1}^n \left(y_i - x_i^{'} \hat\beta \right)^2 + \lambda \sum_{j=1}^k \hat{\beta}_j^2.\]

The resulting estimate for the coefficients is

\[\hat{\beta} = \left(X'X + \lambda I\right)^{-1}\left(X'Y \right).\]

As \(\lambda \rightarrow 0\), ridge regression approaches OLS. The bias and variance for the ridge estimator are

\[\mathrm{Bias}(\hat{\beta}) = -\lambda \left(X'X + \lambda I \right)^{-1} \beta\] \[\mathbb{V}(\hat{\beta}) = \sigma^2 \left(X'X + \lambda I \right)^{-1}X'X \left(X'X + \lambda I \right)^{-1}\]

Notice how the estimator bias increases with \(\lambda\) while the variance decreases with \(\lambda\). The optimal level for \(\lambda\) is the one that minimizes the root mean squared error (RMSE) or the Akaike or Bayesian Information Criterion (AIC or BIC), or R-squared.

To fit a linear model with ridge regression, specify linear_reg(mixture = 0). Standardize the predictors in the recipe so that each contributes to the model on a similar scale. Otherwise, predictors with larger scales would naturally have larger coefficients.

ridge_mdl <- linear_reg(engine = "glmnet", penalty = tune(), mixture = 0)

ridge_rec <-
  recipe(mpg ~ ., data = mtcars) |>
  step_normalize(all_numeric_predictors())

ridge_wflow <-
  workflow() |>
  add_model(ridge_mdl) |>
  add_recipe(ridge_rec)

# Set up the hyperparameter space for penalty. Set range through trial and error.
ridge_grid <-
  grid_latin_hypercube(
    penalty(range = c(0, 10), trans = NULL), 
    size = 30
  )

# tune_grid() returns a resamples object, essentially a tibble with turning
# results for each hyperparameter combination and fold.
ridge_resamples <-
  ridge_wflow |>
  tune_grid(
    resamples = mt_folds,
    grid = ridge_grid
  )

ridge_resamples |> show_best(metric = "rmse") |> knitr::kable()
penalty .metric .estimator mean n std_err .config
4.035107 rmse standard 2.748411 5 0.4425804 Preprocessor1_Model13
3.913497 rmse standard 2.748456 5 0.4431044 Preprocessor1_Model12
4.462717 rmse standard 2.750088 5 0.4406450 Preprocessor1_Model14
3.558491 rmse standard 2.750365 5 0.4443715 Preprocessor1_Model11
4.899727 rmse standard 2.754607 5 0.4383952 Preprocessor1_Model15
Show the code
ridge_resamples |>
  collect_metrics() |>
  ggplot(aes(x = penalty, y = mean)) +
  geom_point() +
  facet_wrap(facets = vars(.metric), scales = "free_y") +
  labs(title = "Ridge Regression Hyperparameter Tuning")

Finalize the workflow with the optimal hyperparameter setting.

Show the code
# Update the workflow with the optimal hyperparameter values.
ridge_final <-
  ridge_wflow |>
  finalize_workflow(parameters = select_best(ridge_resamples, metric = "rmse"))

# Refit the model with the optimal hyperparameters using the _full_ training 
# dataset (not the folds).
ridge_fit <- ridge_final |> last_fit(mt_split)

# Metrics are evaluated on the testing dataset.
ridge_fit |> collect_metrics() |> knitr::kable()
.metric .estimator .estimate .config
rmse standard 2.4889825 Preprocessor1_Model1
rsq standard 0.8844566 Preprocessor1_Model1


The most important variables here were am and wt.

Show the code
ridge_final |>
  fit(training(mt_split)) |>
  pull_workflow_fit() |>
  vi() |>
  ggplot(aes(x = Importance, y = fct_rev(fct_inorder(Variable)), color = Sign)) +
  geom_point(size = 2) +
  geom_segment(aes(x = 0, xend = Importance))  +
  theme(legend.position = "top") +
  labs(
    color = NULL, y = NULL,
    title = "Ridge Variable Importance"
  )

5.4 Lasso

Lasso stands for “least absolute shrinkage and selection operator”. Like ridge, lasso adds a penalty for coefficients, but instead of penalizing the sum of squared coefficients (L2 penalty), lasso penalizes the sum of absolute values (L1 penalty). As a result, coefficients can be zeroed under lasso for high values of \(\lambda\).

The loss function for lasso is

\[L = \sum_{i = 1}^n \left(y_i - x_i^{'} \hat\beta \right)^2 + \lambda \sum_{j=1}^k \left| \hat{\beta}_j \right|.\]

To fit a linear model with lasso regression, specify linear_reg(mixture = 1).

lasso_mdl <- linear_reg(engine = "glmnet", penalty = tune(), mixture = 1)

lasso_rec <-
  recipe(mpg ~ ., data = mtcars) |>
  step_normalize(all_numeric_predictors())

lasso_wflow <-
  workflow() |>
  add_model(lasso_mdl) |>
  add_recipe(lasso_rec)

# Set up the hyperparameter space for penalty. Set range through trial and error.
lasso_grid <-
  grid_latin_hypercube(
    penalty(range = c(0, 4), trans = NULL), 
    size = 30
  )

# tune_grid() returns a resamples object, essentially a tibble with turning
# results for each hyperparameter combination and fold.
lasso_resamples <-
  lasso_wflow |>
  tune_grid(
    resamples = mt_folds,
    grid = lasso_grid
  )

lasso_resamples |> show_best(metric = "rmse") |> knitr::kable()
penalty .metric .estimator mean n std_err .config
1.0515463 rmse standard 3.135101 5 0.5020066 Preprocessor1_Model08
0.9219008 rmse standard 3.141623 5 0.5186401 Preprocessor1_Model07
1.1547527 rmse standard 3.144581 5 0.4892161 Preprocessor1_Model09
1.2441210 rmse standard 3.161189 5 0.4772235 Preprocessor1_Model10
0.7771826 rmse standard 3.173642 5 0.5315058 Preprocessor1_Model06
Show the code
lasso_resamples |>
  collect_metrics() |>
  ggplot(aes(x = penalty, y = mean)) +
  geom_point() +
  facet_wrap(facets = vars(.metric), scales = "free_y") +
  labs(title = "Lasso Regression Hyperparameter Tuning")

Finalize the workflow with the optimal hyperparameter setting.

Show the code
# Update the workflow with the optimal hyperparameter values.
lasso_final <-
  lasso_wflow |>
  finalize_workflow(parameters = select_best(lasso_resamples, metric = "rmse"))

# Refit the model with the optimal hyperparameters using the _full_ training 
# dataset (not the folds).
lasso_fit <- lasso_final |> last_fit(mt_split)

# Metrics are evaluated on the testing dataset.
lasso_fit |> collect_metrics() |> knitr::kable()
.metric .estimator .estimate .config
rmse standard 2.1770670 Preprocessor1_Model1
rsq standard 0.9076329 Preprocessor1_Model1


The most important variables here were am and wt.

Show the code
lasso_final |>
  fit(training(mt_split)) |>
  pull_workflow_fit() |>
  vi() |>
  ggplot(aes(x = Importance, y = fct_rev(fct_inorder(Variable)), color = Sign)) +
  geom_point(size = 2) +
  geom_segment(aes(x = 0, xend = Importance))  +
  theme(legend.position = "top") +
  labs(
    color = NULL, y = NULL,
    title = "Lasso Variable Importance"
  )

5.5 Elastic Net

Elastic Net combines the penalties of ridge and lasso to get the best of both worlds. The loss function for elastic net is

\[L = \frac{\sum_{i = 1}^n \left(y_i - x_i^{'} \hat\beta \right)^2}{2n} + \lambda \frac{1 - \alpha}{2}\sum_{j=1}^k \hat{\beta}_j^2 + \lambda \alpha\left| \hat{\beta}_j \right|.\]

In this loss function, new parameter \(\alpha\) is a “mixing” parameter that balances the two approaches. If \(\alpha\) is zero, you are back to ridge regression, and if \(\alpha\) is one, you are back to lasso. To fit a linear model with elastic net regression, specify linear_reg(mixture = tune()).

elnet_mdl <- linear_reg(engine = "glmnet", penalty = tune(), mixture = tune())

elnet_rec <-
  recipe(mpg ~ ., data = mtcars) |>
  step_normalize(all_numeric_predictors())

elnet_wflow <-
  workflow() |>
  add_model(elnet_mdl) |>
  add_recipe(elnet_rec)

# Set up the hyperparameter space for penalty and mixture. Set penalty range 
# through trial and error.
elnet_grid <-
  grid_latin_hypercube(
    penalty(range = c(2, 7), trans = NULL), 
    mixture(range = c(.1, .8)),
    size = 30
  )

# tune_grid() returns a resamples object, essentially a tibble with turning
# results for each hyperparameter combination and fold.
elnet_resamples <-
  elnet_wflow |>
  tune_grid(
    resamples = mt_folds,
    grid = elnet_grid
  )

elnet_resamples |> show_best(metric = "rmse") |> knitr::kable()
penalty mixture .metric .estimator mean n std_err .config
3.451338 0.1427038 rmse standard 2.935395 5 0.4578454 Preprocessor1_Model02
2.523840 0.2380821 rmse standard 2.993297 5 0.4722182 Preprocessor1_Model06
5.434235 0.1124067 rmse standard 3.041341 5 0.4382533 Preprocessor1_Model01
3.928826 0.2142507 rmse standard 3.107241 5 0.4555658 Preprocessor1_Model05
2.403868 0.3727664 rmse standard 3.119993 5 0.4754769 Preprocessor1_Model12
Show the code
elnet_resamples |>
  collect_metrics() |>
  filter(.metric == "rmse") |>
  ggplot(aes(x = penalty, y = mixture, color = mean)) +
  geom_point(size = 2) +
  scale_color_gradient(low = "red", high = "blue") +
  # facet_wrap(facets = vars(.metric), scales = "free_y") +
  labs(title = "Elastic Net Regression Hyperparameter Tuning", color = "rmse")

Finalize the workflow with the optimal hyperparameter setting.

# Update the workflow with the optimal hyperparameter values.
elnet_final <-
  elnet_wflow |>
  finalize_workflow(parameters = select_best(elnet_resamples, metric = "rmse"))

# Refit the model with the optimal hyperparameters using the _full_ training 
# dataset (not the folds).
elnet_fit <- elnet_final |> last_fit(mt_split)

# Metrics are evaluated on the testing dataset.
elnet_fit |> collect_metrics() |> knitr::kable()
.metric .estimator .estimate .config
rmse standard 2.4026534 Preprocessor1_Model1
rsq standard 0.8865235 Preprocessor1_Model1


The most important variables here were wt, qsec and disp.

Show the code
elnet_final |>
  fit(training(mt_split)) |>
  pull_workflow_fit() |>
  vi() |>
  ggplot(aes(x = Importance, y = fct_rev(fct_inorder(Variable)), color = Sign)) +
  geom_point(size = 2) +
  geom_segment(aes(x = 0, xend = Importance))  +
  theme(legend.position = "top") +
  labs(
    color = NULL, y = NULL,
    title = "Elastic Net Variable Importance"
  )

Model Summary

Lasso performed best on RMSE and R-squared.

Show the code
bind_rows(
  Ridge = collect_metrics(ridge_fit),
  Lasso = collect_metrics(lasso_fit),
  ElNet = collect_metrics(elnet_fit),
  .id = "Model"
) |>
  select(Model, .metric, .estimate) |>
  pivot_wider(names_from = .metric, values_from = .estimate) |>
  knitr::kable()
Model rmse rsq
Ridge 2.488983 0.8844566
Lasso 2.177067 0.9076329
ElNet 2.402653 0.8865235

Final thoughts on the models.

  • Lasso can set some coefficients to zero, thus performing variable selection.
  • Lasso and Ridge address multicollinearity differently: in ridge regression, the coefficients of correlated predictors are similar; In lasso, one of the correlated predictors has a larger coefficient, while the rest are (nearly) zeroed.
  • Lasso tends to do well if there are a small number of significant parameters and the others are close to zero. Ridge tends to work well if there are many large parameters of about the same value.
  • In practice, you don’t know which will be best, so run cross-validation pick the best.