Maximum Likelihood Estimation

Inference, estimation, and decision-making from data

If you must pick a single value for the parameter θ, the most natural rule is this: choose the θ that makes the data you actually observed most probable. That's maximum likelihood estimation (MLE), the principle behind training almost every model in ML.

Given data x₁, …, xₙ assumed independent, the probability of the whole sample is the product of the per-point probabilities. As a function of θ, this product is the likelihood:

Multiplying many small probabilities underflows to zero and is awkward to differentiate. The fix is to take the log: the log of a product is a sum, and log is increasing so it doesn't move the maximizer. We maximize the log-likelihood:

Where this lives in MLTraining a model is maximum likelihood. Minimizing cross-entropy loss is exactly maximizing the log-likelihood of the labels; cross-entropy is the negative log-likelihood. Minimizing mean squared error is MLE under a Gaussian noise assumption. When you call .backward() and step the optimizer, you are climbing the log-likelihood surface above, just in millions of dimensions.

▶ Maximum Likelihood Estimation

← Parameters & Estimators MLE for Common Distributions →