The mathematics of uncertainty
KL divergence measures how far one distribution q is from another p: the extra surprise you pay for modeling reality p with the wrong distribution q. It's the gap inside cross-entropy:
Two facts make it the workhorse "distance" of ML. By Gibbs' inequality it's always ≥ 0, and it's zero exactly when q = p. So driving KL to 0 means making your model match the truth perfectly.
KL is not symmetric: KL(p‖q) ≠ KL(q‖p) in general, and it violates the triangle inequality. The asymmetry is meaningful, because the two directions reward different failures. KL(p‖q) punishes q heavily for being small where p is large (it's "mode-covering"); KL(q‖p) punishes q for spreading mass where p has none (it's "mode-seeking").