p-values

Inference, estimation, and decision-making from data

The p-value turns "how extreme is my test statistic?" into a single number. It's the probability of seeing data at least as extreme as yours, assuming H₀ is true. A tiny p-value means "this data would be very surprising if there were really no effect", which is evidence against H₀.

The decision rule is mechanical: pick a threshold α in advance (commonly 0.05), then reject H₀ if p < α. A small p doesn't prove H₁; it just says the null explains the data poorly.

A p-value is a fluke check: if nothing were really going on, how surprising would a result like yours be? Suppose a friend claims a fair coin yet flips nine heads in a row — a p-value puts a number on just how rare that streak would be under the boring 'it's fair' story H₀. The smaller the number, the harder it is to shrug the result off as luck.

Where this lives in MLIn ML, a p-value tells you whether model A's win over model B on a benchmark is signal or noise. But the trap is real: with a giant test set, a 0.01% accuracy gain can be 'significant' yet utterly meaningless in practice. And p-hacking, trying configurations until one clears p < 0.05, is exactly how leaderboards fill with irreproducible results.
▶ p-values
← Frameworkt-test →