Cross-Entropy

The mathematics of uncertainty

Suppose the truth is distribution p, but you encode outcomes using a different model q. Cross-entropy is the average surprise you actually pay: surprise measured by your model q, but averaged over how often events really occur under p:

It splits into two meaningful pieces: the unavoidable entropy of the truth, plus a penalty for using the wrong model, the KL divergence (next lesson):

Since H(p) is fixed by the data, minimizing cross-entropy over your model is identical to minimizing the KL divergence, driving q toward p. And cross-entropy is always at least H(p), with equality only when q = p.

Where this lives in MLOpen almost any classifier or language model and the final layer is softmax followed by cross-entropy loss. Minimizing it is exactly maximum-likelihood estimation: −log q(true) summed over the data is the negative log-likelihood. Training a network to predict the next token is minimizing cross-entropy between the true next-token distribution and the model's.
▶ Cross-Entropy
← EntropyKL Divergence →