Multivariate calculus from first principles
Collect every partial derivative of f into one vector and you get the gradient, written ∇f ("grad f"). Every optimizer in deep learning runs on this one object, so it earns its place at the center of the course.
The gradient isn't just bookkeeping. As a vector in the input space, it has a direction and a length, and both carry meaning. The direction is the one of steepest ascent: point yourself along ∇f and the function climbs as fast as it possibly can. Its length ‖∇f‖ is exactly how steep that climb is.
Picture yourself standing on a grassy hill in fog. The gradient ∇f is the arrow that points straight up the steepest part of the slope, and its length tells you just how punishing that climb is. Set a ball down and let go: it rolls off in exactly the opposite direction, taking the fastest way down.