Bayesian Estimation

Inference, estimation, and decision-making from data

MLE asks "which single θ best explains the data?" Bayesian estimation asks a richer question: "given the data, what is my full belief about θ?" Instead of one number, you get a whole distribution, and you can fold in what you knew beforehand.

Three ingredients. The prior p(θ) is your belief before seeing data. The likelihood p(x|θ) is how well each θ explains the data (same object as in MLE). Bayes' rule combines them into the posterior p(θ|x):

Read it as: posterior belief = how well θ explains the data, weighted by how plausible θ was to begin with. More data makes the likelihood dominate and washes out the prior.

Where this lives in MLRegularization is this idea in everyday use. Adding an L2 penalty λ‖β‖² to the loss is exactly MAP estimation with a Gaussian prior on the weights. The prior says "weights near zero are more plausible." Adding an L1 penalty corresponds to a Laplace prior, which prefers sparse weights. Weight decay isn't a hack; it's a Bayesian prior with a different name.

▶ Bayesian Estimation

← MLE for Common Distributions Confidence Intervals →