Multiple Testing

Inference, estimation, and decision-making from data

Run one test at α = 0.05 and you have a 5% chance of a false positive. Run twenty independent tests and, even if nothing is real, you'll probably get at least one "significant" result by pure luck. This is the multiple testing problem, and it silently corrupts a huge amount of research and ML experimentation.

The chance of at least one false positive across m tests, the family-wise error rate, balloons: with m independent tests at level α it's 1 − (1 − α)m. For m = 20, α = 0.05, that's about 64%, more likely than not to find a phantom effect.

Buy a single lottery ticket and your odds of winning are tiny. Buy a thousand and one of them might "win" something purely by chance, even though you have no special insight at all. Running many statistical tests is the same gamble: with enough tries, a meaningless fluke will eventually cross the significance line and masquerade as a real discovery.

Where this lives in MLMultiple testing is a quiet killer of ML rigor. A hyperparameter search over 100 configurations, an ablation study with dozens of variants, or a benchmark suite with 50 tasks: each is a barrage of implicit tests. Picking "the config that won on the validation set" without correction is mass multiple testing, and it's why so many reported gains evaporate on a fresh test set.
▶ Multiple Testing
← t-testNon-parametric Tests →