Inference, estimation, and decision-making from data
You've built two classifiers and one scores 91.0% accuracy, the other 91.4%. Is the second really better, or did it just get a luckier test set? Answering this rigorously is statistical testing for ML: hypothesis testing adapted to the quirks of model comparison.
The naive move, a plain t-test on per-fold accuracies, is flawed, because cross-validation folds share training data and so violate the independence the t-test assumes. This makes the test overconfident, inflating false positives. Three better tools handle the ML setting honestly.
McNemar's test compares two classifiers on the same test set by looking only at the examples where they disagree, exactly the right question for paired predictions. The bootstrap resamples the test set with replacement many times to build a confidence interval for accuracy directly, no formula needed. The corrected paired t-test adjusts the variance to account for the overlap between CV folds, undoing the overconfidence of the naive version.