Statistical Testing for ML

Inference, estimation, and decision-making from data

You've built two classifiers and one scores 91.0% accuracy, the other 91.4%. Is the second really better, or did it just get a luckier test set? Answering this rigorously is statistical testing for ML: hypothesis testing adapted to the quirks of model comparison.

The naive move, a plain t-test on per-fold accuracies, is flawed, because cross-validation folds share training data and so violate the independence the t-test assumes. This makes the test overconfident, inflating false positives. Three better tools handle the ML setting honestly.

McNemar's test compares two classifiers on the same test set by looking only at the examples where they disagree, exactly the right question for paired predictions. The bootstrap resamples the test set with replacement many times to build a confidence interval for accuracy directly, no formula needed. The corrected paired t-test adjusts the variance to account for the overlap between CV folds, undoing the overconfidence of the naive version.

Where this lives in MLThis kind of rigor is what separates a real result from leaderboard noise. Before claiming model A beats model B, run McNemar's test (same test set) or a bootstrap CI on the accuracy gap. The entire reason a result is reported as "91.2% ± 0.4%" rather than just "91.2%" is so a reader can apply exactly this kind of test by eye.

▶ Statistical Testing for ML

← Evaluation Metrics Generative vs Discriminative →