Evaluation Metrics

Inference, estimation, and decision-making from data

"Accuracy" sounds like the obvious way to score a classifier, right up until it lies to you. The right evaluation metric depends entirely on the task and the cost of different mistakes. Start with the confusion matrix: counts of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Every metric is built from these four numbers.

Two complementary metrics. Precision = TP/(TP+FP) asks "of the things I flagged positive, how many really were?" Recall = TP/(TP+FN) asks "of the actual positives, how many did I catch?"

They trade off: flag everything and recall hits 1 but precision craters; flag only the surest cases and precision soars while recall drops. The F1 score balances them as their harmonic mean:

Where this lives in MLChoosing the wrong metric quietly wrecks ML projects. Optimizing accuracy on imbalanced data produces a model that ignores the class you actually care about. The metric you optimize is the behavior you get, so define success with precision/recall/F1/AUC before you train, matched to the real-world cost of false positives versus false negatives.
▶ Evaluation Metrics
← Cross-ValidationStatistical Testing for ML →