Expectation-Maximization (EM)

Inference, estimation, and decision-making from data

Sometimes the most important variable is one you never observe. Which cluster did this point come from? Which topic generated this document? These hidden latent variables Z make maximum likelihood hard: you can't just maximize the log-likelihood because it now contains a sum inside a log. Expectation–Maximization (EM) is the elegant fix.

EM breaks a hard joint optimization into two easy alternating steps, repeated until convergence:

The quantity EM actually pushes up each round is a lower bound on the log-likelihood called the ELBO (evidence lower bound). The E-step tightens the bound; the M-step raises it.

Where this lives in MLEM is the engine behind Gaussian mixture models and clustering, and its E/M structure is the conceptual ancestor of variational autoencoders. A VAE's encoder plays the role of the E-step (inferring latent z), the decoder and the ELBO objective play the M-step. The "maximize a lower bound by alternating between inferring latents and updating parameters" pattern is everywhere in modern…
▶ Expectation-Maximization (EM)
← Generative vs DiscriminativeConcentration Inequalities (brief) →