The mathematics of uncertainty
Suppose the truth is distribution p, but you encode outcomes using a different model q. Cross-entropy is the average surprise you actually pay: surprise measured by your model q, but averaged over how often events really occur under p:
It splits into two meaningful pieces: the unavoidable entropy of the truth, plus a penalty for using the wrong model, the KL divergence (next lesson):
Since H(p) is fixed by the data, minimizing cross-entropy over your model is identical to minimizing the KL divergence, driving q toward p. And cross-entropy is always at least H(p), with equality only when q = p.