Constrained Optimization

Multivariate calculus from first principles

Often you don't want the lowest point everywhere; you want the lowest point subject to a constraint. Minimize loss while keeping the weight norm bounded; maximize margin while points stay correctly classified. Lagrange multipliers are the standard tool for optimizing along a constraint curve.

The geometry to hold onto: at the constrained optimum, the level curves of f are tangent to the constraint g(x) = 0. If they crossed instead of touching, you could slide along the constraint to a better value. Tangency means the two gradients point along the same line, so they're parallel:

The scalar λ (the Lagrange multiplier) is the proportionality factor. Packaging both conditions into one object gives the Lagrangian L = f − λg; setting ∇L = 0 recovers exactly the equations above.

Where this lives in MLConstrained optimization is everywhere in ML. Support vector machines maximize a margin subject to classification constraints, and their dual problem is built from Lagrange multipliers (via the KKT conditions, the extension that handles inequalities). Constrained weight norms, trust regions in RL, and projected gradient methods all trace back to '∇f parallel to ∇g'. The multiplier λ is the same…

▶ Constrained Optimization

← Convexity Multivariate Taylor →