Joint Distributions

The mathematics of uncertainty

So far each random variable lived alone. But the interesting questions are about relationships: height and weight, an image and its label. A joint distribution p(x, y) gives the probability of every pair of values at once. It's the complete description of how two (or more) variables behave together.

For discrete variables, picture a grid: rows are values of X, columns values of Y, and each cell holds the probability of that combination. All the cells are non-negative and sum to 1, the axioms again, now in two dimensions. For continuous variables it's a density f(x, y) and probabilities are volumes under a 2-D surface.

Imagine a two-way table of people sorted by height and weight at the same time: short-and-light in one cell, tall-and-heavy in another, and a number in every cell saying how common that pairing is. That whole grid of pairings is the joint distribution p(x, y) — it describes height and weight together, not one at a time. Fill in every cell, make them non-negative and add to 1, and you have captured the complete picture of how the two traits travel together.

Where this lives in MLSupervised learning is modeling a joint p(x, y) of inputs and labels, or a piece of it. Generative models learn the full joint p(x, y) and can synthesize new data; discriminative models learn only the conditional p(y | x) needed to predict. The whole generative-vs-discriminative distinction is about how much of the joint you bother to model.

▶ Joint Distributions

← Multivariate Gaussian Marginal Distributions →