The Gradient

Multivariate calculus from first principles

Collect every partial derivative of f into one vector and you get the gradient, written ∇f ("grad f"). Every optimizer in deep learning runs on this one object, so it earns its place at the center of the course.

The gradient isn't just bookkeeping. As a vector in the input space, it has a direction and a length, and both carry meaning. The direction is the one of steepest ascent: point yourself along ∇f and the function climbs as fast as it possibly can. Its length ‖∇f‖ is exactly how steep that climb is.

Picture yourself standing on a grassy hill in fog. The gradient ∇f is the arrow that points straight up the steepest part of the slope, and its length tells you just how punishing that climb is. Set a ball down and let go: it rolls off in exactly the opposite direction, taking the fastest way down.

Where this lives in MLStanding on the loss surface, you want to step downhill as fast as possible. The gradient ∇L points toward steepest increase, so you subtract it: w ← w − η∇L, the update behind SGD, Adam, and every other optimizer. Backpropagation exists for one reason, to compute this vector efficiently.
▶ The Gradient
← Higher-Order PartialsDirectional Derivative →