Variance

The mathematics of uncertainty

Two bets can share the same average and feel completely different: "+1 or −1" versus "+1000 or −1000" both average to 0, but one is wild. Variance measures that spread, the average squared distance of X from its mean μ = E[X]:

Squaring keeps deviations positive (so they don't cancel) and punishes large excursions harder. To get back to the original units, take the square root: the standard deviation σ = √Var(X).

In practice the shortcut formula is faster, "the mean of the square minus the square of the mean":

Where this lives in MLThe variance of a gradient estimator decides how noisy each training step is. A mini-batch gradient is an average of per-example gradients; by Bienaymé, averaging n independent estimates divides the variance by n, so the noise falls like 1/√n in standard deviation. That's the whole reason bigger batches give smoother, lower-variance steps, and why variance-reduction tricks speed up training.
▶ Variance
← ExpectationKey Discrete Distributions →