Inference, estimation, and decision-making from data
If you must pick a single value for the parameter θ, the most natural rule is this: choose the θ that makes the data you actually observed most probable. That's maximum likelihood estimation (MLE), the principle behind training almost every model in ML.
Given data x₁, …, xₙ assumed independent, the probability of the whole sample is the product of the per-point probabilities. As a function of θ, this product is the likelihood:
Multiplying many small probabilities underflows to zero and is awkward to differentiate. The fix is to take the log: the log of a product is a sum, and log is increasing so it doesn't move the maximizer. We maximize the log-likelihood: