Marginal Distributions

The mathematics of uncertainty

Given a joint p(x, y), suppose you only care about X and want to forget Y. You marginalize: sum (or integrate) the joint over all values of the unwanted variable. What's left is the marginal distribution of X alone.

The name comes from old probability tables: you'd add up each row and write the total in the margin. Those row-sums are the marginal of one variable, and the column-sums are the marginal of the other. Marginalizing means "integrate out the variable you don't want."

Take that two-way height–weight table and suppose you only care about height, ignoring weight entirely. You simply add up each row of the joint p(x, y) and jot the total in the margin — that row-total is how often each height occurs no matter the weight. Reading only those margin totals gives the marginal distribution of X, the one variable seen on its own.

Where this lives in MLMarginalizing out latent variables is both the central computation and the central headache of generative modeling. The data likelihood is p(x) = ∫ p(x, z) dz = ∫ p(x | z) p(z) dz, an integral over every possible latent z. That integral is usually intractable, which is exactly why VAEs optimize a tractable lower bound (the ELBO) instead of computing the marginal directly.

▶ Marginal Distributions

← Joint Distributions Conditional Distributions →