KL Divergence

The mathematics of uncertainty

KL divergence measures how far one distribution q is from another p: the extra surprise you pay for modeling reality p with the wrong distribution q. It's the gap inside cross-entropy:

Two facts make it the workhorse "distance" of ML. By Gibbs' inequality it's always ≥ 0, and it's zero exactly when q = p. So driving KL to 0 means making your model match the truth perfectly.

KL is not symmetric: KL(p‖q) ≠ KL(q‖p) in general, and it violates the triangle inequality. The asymmetry is meaningful, because the two directions reward different failures. KL(p‖q) punishes q heavily for being small where p is large (it's "mode-covering"); KL(q‖p) punishes q for spreading mass where p has none (it's "mode-seeking").

Where this lives in MLA VAE's ELBO has a KL term pulling the encoder's latent distribution toward the prior N(0, I), a regularizer that keeps the latent space well-behaved. RL methods like PPO/TRPO constrain each policy update with a KL "trust region" so the new policy can't lurch too far. Knowledge distillation minimizes KL between a big teacher's and a small student's output distributions.

▶ KL Divergence

← Cross-Entropy Mutual Information →