<- function(x) {x^3}
f
= function() {
make_sample = runif(n = 100, 1, 5)
x = rnorm(n = 100, 0, 5)
e = f(x) + e
y data.frame(x, y)
}
<-
sim_dat tibble(sim = 1:250) |>
mutate(dat = map(sim, ~ make_sample()))
5 Regularization
Regularization is a set of methods in linear regression that manage the “bias-variance” tradeoff in machine learning. Bias and variance are two types of errors that affect model performance.
Bias refers to errors introduced by overly simplistic models that make strong assumptions. High-bias models, like linear models for nonlinear data, tend to underfit, meaning they fail to capture important patterns. This leads to systematic errors when predicting on new data.
Variance is the model’s sensitivity to fluctuations in the training data. High-variance models, like linear models with highly correlated features and deep neural networks, tend to overfit, meaning they capture noise in the data as if it were part of the underlying pattern. These models perform well on training data but are less reliable on new data.
The tradeoff is that increasing model complexity to decrease predictive bias (expected prediction error) comes at the cost of increasing prediction variance (variance of expected prediction error). The goal is to find a “sweet spot” with an optimal level of model complexity that minimizes overall error on new, unseen data.
5.1 Bias-Variance Trade-off
The linear regression model is \(Y = X \beta + \epsilon\), where \(\epsilon \sim N(0, \sigma^2)\). OLS estimates the coefficients by minimizing the loss function
\[L = \sum_{i = 1}^n \left(y_i - x_i^{'} \hat\beta \right)^2\]
where \(\hat{f} = x_i^{'} \hat\beta\) is the estimator of the true regression function, \(f\).
This results estimate for the coefficients of
\[\hat{\beta} = \left(X'X\right)^{-1}\left(X'Y\right).\]
There are two important characteristics of any estimator: its bias and its variance. For OLS, these are
\[Bias(\hat{\beta}) = E(\hat{\beta}) - \beta = 0\]
and
\[Var(\hat{\beta}) = \sigma^2(X'X)^{-1}\]
where the unknown population variance \(\sigma^2\) is estimated from the residuals
\[\hat\sigma^2 = \frac{\epsilon' \epsilon}{n - k}.\]
The OLS estimator is unbiased, but can have a large variance when the predictor variables are highly correlated with each other, or when there are many predictors (notice how \(\hat{\sigma}^2\) increases as \(k \rightarrow n\)). Stepwise selection balances the trade-off by eliminating variables, but this throws away information. Regularization keeps all the predictors, but reduces coefficient magnitudes to reduce variance at the expense of some bias.
We can decompose the expected prediction error (EPE) into a reducible error related to the type of model, and irreducible error related to the variance \(y\) (noise).
\[ \text{EPE} = \mathbb{E}\left[\left(f(x) - \hat{f}(x)\right)^2\right] + \mathbb{V}[y|x] \]
The reducible error, which is the mean squared error (MSE) of the estimate, can be further decomposed into the estimator bias and variance.
\[ \text{EPE} = \left(f(x) - \mathbb{E}\left[\hat{f}(x)\right]\right)^2 + \mathbb{E}\left[\left(\hat{f}(x) - \mathbb{E}\left[\hat{f}(x)\right]\right)^2\right] + \mathbb{V}[y|x] \]
Illustration. Let’s set up a bias-variance tradeoff by fitting several models of increasing complexity to a random data-generating process, \(f(x) = x^3\). We’ll create 250 random samples by choosing \(x\) between 1 and 5 and \(y = f(x) + e\) where \(e\) is noise.
We’ll fit five models. The first is a linear model, \(y \sim x\). Then we’ll fit polynomial model of degree 2. These first two models underfit the data. Then one of degree 3 - matching the underlying data-generating process. Then we’ll overfit with one of degree 9. Finally, we’ll use regularization to improve on the degree 9 model.
<-
sim_fit |>
sim_dat mutate(
dat = map(sim, ~ make_sample()),
fit_1 = map(dat, ~ linear_reg() |> fit(y ~ poly(x, degree = 1), data = .)),
fit_2 = map(dat, ~ linear_reg() |> fit(y ~ poly(x, degree = 2), data = .)),
fit_3 = map(dat, ~ linear_reg() |> fit(y ~ poly(x, degree = 3), data = .)),
fit_9 = map(dat, ~ linear_reg() |> fit(y ~ poly(x, degree = 9), data = .)),
fit_9a = map(dat,
~ linear_reg(engine = "glmnet", penalty = 0.2, mixture = 1) |>
fit(y ~ poly(x, degree = 9), data = .))
)
This gives us 5 models fit to 250 random samples, each sample containing 100 points from an \(f(x) = x^3 + \epsilon\) data-generating process. Let’s predict a single value, \(f(x = 4)\) from each model.
<- data.frame(x = 4)
test_dat
<-
sim_preds |>
sim_fit mutate(
`1` = map_dbl(fit_1, ~ predict(., new_data = test_dat) |> pull(.pred)),
`2` = map_dbl(fit_2, ~ predict(., new_data = test_dat) |> pull(.pred)),
`3` = map_dbl(fit_3, ~ predict(., new_data = test_dat) |> pull(.pred)),
`9` = map_dbl(fit_9, ~ predict(., new_data = test_dat) |> pull(.pred)),
`9 (LASSO)` = map_dbl(fit_9a, ~ predict(., new_data = test_dat) |> pull(.pred))
)
Check out the distribution of fitted values. The degree 3 polynomial is the gold standard, and you see both low bias and variance. The degree-1 polynomial was just bad. The degree 2 had a high bias, but at least the variance was small. The degree 9 model overfit the data. It had low bias overall, but very high variance. The last model used regularization to improve its reliability, and you see the variance has shrunk with only a small reduction in bias.
Show the code
|>
sim_preds select(`1`:`9 (LASSO)`) |>
pivot_longer(c(everything())) |>
ggplot(aes(x = name, y = value)) +
geom_hline(yintercept = f(4), linewidth = 1, color = "goldenrod") +
geom_boxplot(fill = NA, outlier.shape = NA) +
geom_jitter(height = 0, width = .2, alpha = .2) +
theme(panel.grid = element_blank()) +
labs(
x = "Polynomial Degree", y = "Preds",
title = "Increasing complexity improves bias at expense of higher variance.",
subtitle = "Simulated Predictions from Four Linear Models"
)
5.2 Case Study
The sections below use the mtcars
dataset to predict mpg
from the other variables with regularization function glmnet::glmnet()
. Set up a train-test split, training on 5-fold CV.
data("mtcars")
glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
<- initial_split(mtcars)) (mt_split
<Training/Testing/Total>
<24/8/32>
<- vfold_cv(training(mt_split), v = 5)) (mt_folds
# 5-fold cross-validation
# A tibble: 5 × 2
splits id
<list> <chr>
1 <split [19/5]> Fold1
2 <split [19/5]> Fold2
3 <split [19/5]> Fold3
4 <split [19/5]> Fold4
5 <split [20/4]> Fold5
5.3 Ridge
Ridge regression estimates the linear model coefficients by minimizing an augmented loss function which includes a term, \(\lambda\), that penalizes the magnitude of the coefficient estimates,
\[L = \sum_{i = 1}^n \left(y_i - x_i^{'} \hat\beta \right)^2 + \lambda \sum_{j=1}^k \hat{\beta}_j^2.\]
The resulting estimate for the coefficients is
\[\hat{\beta} = \left(X'X + \lambda I\right)^{-1}\left(X'Y \right).\]
As \(\lambda \rightarrow 0\), ridge regression approaches OLS. The bias and variance for the ridge estimator are
\[\mathrm{Bias}(\hat{\beta}) = -\lambda \left(X'X + \lambda I \right)^{-1} \beta\] \[\mathbb{V}(\hat{\beta}) = \sigma^2 \left(X'X + \lambda I \right)^{-1}X'X \left(X'X + \lambda I \right)^{-1}\]
Notice how the estimator bias increases with \(\lambda\) while the variance decreases with \(\lambda\). The optimal level for \(\lambda\) is the one that minimizes the root mean squared error (RMSE) or the Akaike or Bayesian Information Criterion (AIC or BIC), or R-squared.
To fit a linear model with ridge regression, specify linear_reg(mixture = 0)
. Standardize the predictors in the recipe so that each contributes to the model on a similar scale. Otherwise, predictors with larger scales would naturally have larger coefficients.
<- linear_reg(engine = "glmnet", penalty = tune(), mixture = 0)
ridge_mdl
<-
ridge_rec recipe(mpg ~ ., data = mtcars) |>
step_normalize(all_numeric_predictors())
<-
ridge_wflow workflow() |>
add_model(ridge_mdl) |>
add_recipe(ridge_rec)
# Set up the hyperparameter space for penalty. Set range through trial and error.
<-
ridge_grid grid_latin_hypercube(
penalty(range = c(0, 10), trans = NULL),
size = 30
)
# tune_grid() returns a resamples object, essentially a tibble with turning
# results for each hyperparameter combination and fold.
<-
ridge_resamples |>
ridge_wflow tune_grid(
resamples = mt_folds,
grid = ridge_grid
)
|> show_best(metric = "rmse") |> knitr::kable() ridge_resamples
penalty | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|
4.035107 | rmse | standard | 2.748411 | 5 | 0.4425804 | Preprocessor1_Model13 |
3.913497 | rmse | standard | 2.748456 | 5 | 0.4431044 | Preprocessor1_Model12 |
4.462717 | rmse | standard | 2.750088 | 5 | 0.4406450 | Preprocessor1_Model14 |
3.558491 | rmse | standard | 2.750365 | 5 | 0.4443715 | Preprocessor1_Model11 |
4.899727 | rmse | standard | 2.754607 | 5 | 0.4383952 | Preprocessor1_Model15 |
Show the code
|>
ridge_resamples collect_metrics() |>
ggplot(aes(x = penalty, y = mean)) +
geom_point() +
facet_wrap(facets = vars(.metric), scales = "free_y") +
labs(title = "Ridge Regression Hyperparameter Tuning")
Finalize the workflow with the optimal hyperparameter setting.
Show the code
# Update the workflow with the optimal hyperparameter values.
<-
ridge_final |>
ridge_wflow finalize_workflow(parameters = select_best(ridge_resamples, metric = "rmse"))
# Refit the model with the optimal hyperparameters using the _full_ training
# dataset (not the folds).
<- ridge_final |> last_fit(mt_split)
ridge_fit
# Metrics are evaluated on the testing dataset.
|> collect_metrics() |> knitr::kable() ridge_fit
.metric | .estimator | .estimate | .config |
---|---|---|---|
rmse | standard | 2.4889825 | Preprocessor1_Model1 |
rsq | standard | 0.8844566 | Preprocessor1_Model1 |
The most important variables here were am
and wt
.
Show the code
|>
ridge_final fit(training(mt_split)) |>
pull_workflow_fit() |>
vi() |>
ggplot(aes(x = Importance, y = fct_rev(fct_inorder(Variable)), color = Sign)) +
geom_point(size = 2) +
geom_segment(aes(x = 0, xend = Importance)) +
theme(legend.position = "top") +
labs(
color = NULL, y = NULL,
title = "Ridge Variable Importance"
)
5.4 Lasso
Lasso stands for “least absolute shrinkage and selection operator”. Like ridge, lasso adds a penalty for coefficients, but instead of penalizing the sum of squared coefficients (L2 penalty), lasso penalizes the sum of absolute values (L1 penalty). As a result, coefficients can be zeroed under lasso for high values of \(\lambda\).
The loss function for lasso is
\[L = \sum_{i = 1}^n \left(y_i - x_i^{'} \hat\beta \right)^2 + \lambda \sum_{j=1}^k \left| \hat{\beta}_j \right|.\]
To fit a linear model with lasso regression, specify linear_reg(mixture = 1)
.
<- linear_reg(engine = "glmnet", penalty = tune(), mixture = 1)
lasso_mdl
<-
lasso_rec recipe(mpg ~ ., data = mtcars) |>
step_normalize(all_numeric_predictors())
<-
lasso_wflow workflow() |>
add_model(lasso_mdl) |>
add_recipe(lasso_rec)
# Set up the hyperparameter space for penalty. Set range through trial and error.
<-
lasso_grid grid_latin_hypercube(
penalty(range = c(0, 4), trans = NULL),
size = 30
)
# tune_grid() returns a resamples object, essentially a tibble with turning
# results for each hyperparameter combination and fold.
<-
lasso_resamples |>
lasso_wflow tune_grid(
resamples = mt_folds,
grid = lasso_grid
)
|> show_best(metric = "rmse") |> knitr::kable() lasso_resamples
penalty | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|
1.0515463 | rmse | standard | 3.135101 | 5 | 0.5020066 | Preprocessor1_Model08 |
0.9219008 | rmse | standard | 3.141623 | 5 | 0.5186401 | Preprocessor1_Model07 |
1.1547527 | rmse | standard | 3.144581 | 5 | 0.4892161 | Preprocessor1_Model09 |
1.2441210 | rmse | standard | 3.161189 | 5 | 0.4772235 | Preprocessor1_Model10 |
0.7771826 | rmse | standard | 3.173642 | 5 | 0.5315058 | Preprocessor1_Model06 |
Show the code
|>
lasso_resamples collect_metrics() |>
ggplot(aes(x = penalty, y = mean)) +
geom_point() +
facet_wrap(facets = vars(.metric), scales = "free_y") +
labs(title = "Lasso Regression Hyperparameter Tuning")
Finalize the workflow with the optimal hyperparameter setting.
Show the code
# Update the workflow with the optimal hyperparameter values.
<-
lasso_final |>
lasso_wflow finalize_workflow(parameters = select_best(lasso_resamples, metric = "rmse"))
# Refit the model with the optimal hyperparameters using the _full_ training
# dataset (not the folds).
<- lasso_final |> last_fit(mt_split)
lasso_fit
# Metrics are evaluated on the testing dataset.
|> collect_metrics() |> knitr::kable() lasso_fit
.metric | .estimator | .estimate | .config |
---|---|---|---|
rmse | standard | 2.1770670 | Preprocessor1_Model1 |
rsq | standard | 0.9076329 | Preprocessor1_Model1 |
The most important variables here were am
and wt
.
Show the code
|>
lasso_final fit(training(mt_split)) |>
pull_workflow_fit() |>
vi() |>
ggplot(aes(x = Importance, y = fct_rev(fct_inorder(Variable)), color = Sign)) +
geom_point(size = 2) +
geom_segment(aes(x = 0, xend = Importance)) +
theme(legend.position = "top") +
labs(
color = NULL, y = NULL,
title = "Lasso Variable Importance"
)
5.5 Elastic Net
Elastic Net combines the penalties of ridge and lasso to get the best of both worlds. The loss function for elastic net is
\[L = \frac{\sum_{i = 1}^n \left(y_i - x_i^{'} \hat\beta \right)^2}{2n} + \lambda \frac{1 - \alpha}{2}\sum_{j=1}^k \hat{\beta}_j^2 + \lambda \alpha\left| \hat{\beta}_j \right|.\]
In this loss function, new parameter \(\alpha\) is a “mixing” parameter that balances the two approaches. If \(\alpha\) is zero, you are back to ridge regression, and if \(\alpha\) is one, you are back to lasso. To fit a linear model with elastic net regression, specify linear_reg(mixture = tune())
.
<- linear_reg(engine = "glmnet", penalty = tune(), mixture = tune())
elnet_mdl
<-
elnet_rec recipe(mpg ~ ., data = mtcars) |>
step_normalize(all_numeric_predictors())
<-
elnet_wflow workflow() |>
add_model(elnet_mdl) |>
add_recipe(elnet_rec)
# Set up the hyperparameter space for penalty and mixture. Set penalty range
# through trial and error.
<-
elnet_grid grid_latin_hypercube(
penalty(range = c(2, 7), trans = NULL),
mixture(range = c(.1, .8)),
size = 30
)
# tune_grid() returns a resamples object, essentially a tibble with turning
# results for each hyperparameter combination and fold.
<-
elnet_resamples |>
elnet_wflow tune_grid(
resamples = mt_folds,
grid = elnet_grid
)
|> show_best(metric = "rmse") |> knitr::kable() elnet_resamples
penalty | mixture | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|
3.451338 | 0.1427038 | rmse | standard | 2.935395 | 5 | 0.4578454 | Preprocessor1_Model02 |
2.523840 | 0.2380821 | rmse | standard | 2.993297 | 5 | 0.4722182 | Preprocessor1_Model06 |
5.434235 | 0.1124067 | rmse | standard | 3.041341 | 5 | 0.4382533 | Preprocessor1_Model01 |
3.928826 | 0.2142507 | rmse | standard | 3.107241 | 5 | 0.4555658 | Preprocessor1_Model05 |
2.403868 | 0.3727664 | rmse | standard | 3.119993 | 5 | 0.4754769 | Preprocessor1_Model12 |
Show the code
|>
elnet_resamples collect_metrics() |>
filter(.metric == "rmse") |>
ggplot(aes(x = penalty, y = mixture, color = mean)) +
geom_point(size = 2) +
scale_color_gradient(low = "red", high = "blue") +
# facet_wrap(facets = vars(.metric), scales = "free_y") +
labs(title = "Elastic Net Regression Hyperparameter Tuning", color = "rmse")
Finalize the workflow with the optimal hyperparameter setting.
# Update the workflow with the optimal hyperparameter values.
<-
elnet_final |>
elnet_wflow finalize_workflow(parameters = select_best(elnet_resamples, metric = "rmse"))
# Refit the model with the optimal hyperparameters using the _full_ training
# dataset (not the folds).
<- elnet_final |> last_fit(mt_split)
elnet_fit
# Metrics are evaluated on the testing dataset.
|> collect_metrics() |> knitr::kable() elnet_fit
.metric | .estimator | .estimate | .config |
---|---|---|---|
rmse | standard | 2.4026534 | Preprocessor1_Model1 |
rsq | standard | 0.8865235 | Preprocessor1_Model1 |
The most important variables here were wt
, qsec
and disp
.
Show the code
|>
elnet_final fit(training(mt_split)) |>
pull_workflow_fit() |>
vi() |>
ggplot(aes(x = Importance, y = fct_rev(fct_inorder(Variable)), color = Sign)) +
geom_point(size = 2) +
geom_segment(aes(x = 0, xend = Importance)) +
theme(legend.position = "top") +
labs(
color = NULL, y = NULL,
title = "Elastic Net Variable Importance"
)
Model Summary
Lasso performed best on RMSE and R-squared.
Show the code
bind_rows(
Ridge = collect_metrics(ridge_fit),
Lasso = collect_metrics(lasso_fit),
ElNet = collect_metrics(elnet_fit),
.id = "Model"
|>
) select(Model, .metric, .estimate) |>
pivot_wider(names_from = .metric, values_from = .estimate) |>
::kable() knitr
Model | rmse | rsq |
---|---|---|
Ridge | 2.488983 | 0.8844566 |
Lasso | 2.177067 | 0.9076329 |
ElNet | 2.402653 | 0.8865235 |
Final thoughts on the models.
- Lasso can set some coefficients to zero, thus performing variable selection.
- Lasso and Ridge address multicollinearity differently: in ridge regression, the coefficients of correlated predictors are similar; In lasso, one of the correlated predictors has a larger coefficient, while the rest are (nearly) zeroed.
- Lasso tends to do well if there are a small number of significant parameters and the others are close to zero. Ridge tends to work well if there are many large parameters of about the same value.
- In practice, you don’t know which will be best, so run cross-validation pick the best.