Inference, estimation, and decision-making from data
OLS finds the coefficients that fit the training data best, which is exactly the problem when you have many features or little data: it fits the noise too, and the coefficients swing to wild values. Regularized regression tames this by adding a penalty that punishes large coefficients, trading a little training fit for much better generalization.
Ridge regression adds an L2 penalty, the squared length of the coefficient vector:
The knob λ controls the strength. λ = 0 is plain OLS; as λ grows, every coefficient is shrunk toward zero, smoothing the model. This shrinkage also fixes the ill-conditioned (XᵀX)⁻¹ from the last lesson: ridge adds λI, guaranteeing invertibility.