Framework

Inference, estimation, and decision-making from data

Hypothesis testing is a disciplined way to answer "is this effect real, or could it just be noise?", which is the exact question "is model A actually better than model B?" You start by assuming there's nothing going on and ask how surprising your data would be if that were true.

Two competing claims. The null hypothesis H₀ is the boring default: no effect, no difference. The alternative H₁ is what you suspect: there is an effect. You compute a test statistic from the data and ask: if H₀ were true, how extreme is this value?

If the statistic is so extreme that it would rarely happen under H₀, you reject H₀. Otherwise you fail to reject it (note: never "accept", since absence of evidence isn't evidence of absence).

Where this lives in MLEvery "+0.5% accuracy" claim is implicitly a hypothesis test. H₀: the two models are equally good; the observed gap is sampling noise. If you skip the test, you'll ship improvements that vanish on the next data split, chasing Type I errors. The whole reason ML benchmarks report variance across seeds is to let you ask honestly whether a difference clears the noise floor.
▶ Framework
← Confidence Intervalsp-values →