Regularized Regression

Inference, estimation, and decision-making from data

OLS finds the coefficients that fit the training data best, which is exactly the problem when you have many features or little data: it fits the noise too, and the coefficients swing to wild values. Regularized regression tames this by adding a penalty that punishes large coefficients, trading a little training fit for much better generalization.

Ridge regression adds an L2 penalty, the squared length of the coefficient vector:

The knob λ controls the strength. λ = 0 is plain OLS; as λ grows, every coefficient is shrunk toward zero, smoothing the model. This shrinkage also fixes the ill-conditioned (XᵀX)⁻¹ from the last lesson: ridge adds λI, guaranteeing invertibility.

Where this lives in MLThe ridge penalty is weight decay, the most common regularizer in deep learning, baked into every optimizer. And as you saw in lesson 8, ridge = MAP with a Gaussian prior, lasso = MAP with a Laplace prior. Regularization, weight decay, and Bayesian priors are three names for the same idea: prefer simpler weights unless the data strongly argues otherwise.

▶ Regularized Regression

← Model Diagnostics Bias-Variance Decomposition →