Relationships Between Variables

Inference, estimation, and decision-making from data

So far each variable stood alone. The real questions usually involve two variables at once: does study time relate to grades? does model size relate to accuracy? The first tool is a scatter plot (one dot per observation, x against y), which lets your eye spot a trend instantly.

To put a number on a linear trend, use the Pearson correlation coefficient r. It runs from −1 to +1: +1 is a perfect upward line, −1 a perfect downward line, 0 no linear relationship at all.

In the figure, the more tightly the points hug the fitted line, the closer |r| is to 1. Spread them out and r drifts toward 0.

Where this lives in MLCorrelation analysis is a daily ML tool. Highly correlated features are redundant; they inflate variance in linear models (multicollinearity) and waste capacity. And when choosing an evaluation benchmark, you check whether it correlates with the metric you actually care about; a cheap proxy metric is only useful if it tracks the expensive real one.
▶ Relationships Between Variables
← Distributions of DataParameters & Estimators →