Mutual Information

The mathematics of uncertainty

Mutual information measures how much knowing one variable tells you about another: the reduction in uncertainty about X once you observe Y. It's the KL divergence between the true joint and the "pretend they're independent" product of marginals:

Because it's a KL, it's always ≥ 0, and it's zero exactly when X and Y are independent, the case where the joint really does factor into the product of marginals. The further the joint is from independence, the more information the variables share.

Equivalently, it's the drop in entropy of X from learning Y:

Where this lives in MLMutual information quantifies how much a representation keeps about its input. The information bottleneck principle frames a good representation Z as one that maximizes I(Z; Y) (keep what predicts the label) while minimizing I(Z; X) (drop irrelevant input detail). InfoNCE, the loss behind contrastive self-supervised learning (SimCLR, CPC), is a tractable lower bound on mutual information between…
▶ Mutual Information
← KL DivergenceLaw of Large Numbers →