Inference, estimation, and decision-making from data
Real predictions use many inputs, not one. Multiple linear regression generalizes the line to a flat plane (or hyperplane) in higher dimensions: each feature gets its own coefficient. Stacking all the data into a matrix X, the model is gorgeously compact:
Here X is the n×d design matrix (one row per observation, one column per feature), β is the vector of coefficients, and y the outputs. The OLS solution has a famous closed form:
The geometry is worth picturing. The vector of predictions Xβ̂ must live in the column space of X, the set of all combinations of your feature columns. OLS picks the β̂ whose prediction is the point in that space closest to y. Geometrically, ŷ is the orthogonal projection of y onto the column space, and the residual y − ŷ is perpendicular to it. That perpendicularity is exactly what (XᵀX)⁻¹Xᵀ computes.