Inference, estimation, and decision-making from data
"Accuracy" sounds like the obvious way to score a classifier, right up until it lies to you. The right evaluation metric depends entirely on the task and the cost of different mistakes. Start with the confusion matrix: counts of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Every metric is built from these four numbers.
Two complementary metrics. Precision = TP/(TP+FP) asks "of the things I flagged positive, how many really were?" Recall = TP/(TP+FN) asks "of the actual positives, how many did I catch?"
They trade off: flag everything and recall hits 1 but precision craters; flag only the surest cases and precision soars while recall drops. The F1 score balances them as their harmonic mean: