You're standing on this landscape. You can feel the slope under your feet, but you can't see. Which way do you step to climb fastest?

Slopes in Two Directions

Let's start with the simplest possible surface: $f(x,y) = x^2 + y^2$, a bowl opening upward. Standing at any point $(x_0, y_0)$, we can ask: how steep is it heading east? How steep heading north?

Those are exactly our two partial derivatives. If we hold $y$ fixed and walk east, the slope we feel is $\partial f/\partial x = 2x$. Hold $x$ fixed and walk north: the slope is $\partial f/\partial y = 2y$.

The red plane below slices the surface at constant $y = y_0$, revealing the east-west cross-section. The blue plane slices at constant $x = x_0$, revealing the north-south cross-section. Drag the sliders to move the point and watch the slopes change.

Drag to orbit

So at any point we have two slope measurements - one per direction. But what if we want to walk northeast? Or at some arbitrary angle? We need to combine these measurements into a single object.

Assembling the Gradient Vector

The key idea: pack both partials into a single vector. For $f(x,y)$, we define

$$\nabla f = \frac{\partial f}{\partial x}\,\mathbf{i} + \frac{\partial f}{\partial y}\,\mathbf{j}$$

The symbol $\nabla$ (read "del" or "nabla") is a vector differential operator - by itself it means nothing, but applied to $f$ it produces this gradient vector. In 3D, it picks up a $z$-component too: $\nabla f = f_x\,\mathbf{i} + f_y\,\mathbf{j} + f_z\,\mathbf{k}$.

For our bowl $f = x^2 + y^2$, the gradient is $\nabla f = 2x\,\mathbf{i} + 2y\,\mathbf{j}$. At the point $(1, 0.5)$ that gives $\langle 2, 1 \rangle$ - pointing mostly east with some north. Notice something important: the gradient lives in the $xy$-plane, not on the surface itself. It tells you which horizontal direction to face to climb fastest.

Move the point and watch the orange gradient arrow.

Drag to orbit

Perpendicular to Level Curves

Here's something beautiful. The level curves of $f(x,y) = x^2 + y^2$ are circles - the curves where $f = c$ for each constant $c$. Along a level curve, $f$ doesn't change; you're walking at constant altitude.

It turns out the gradient is always perpendicular to the level curves. Think of it this way: if you're standing on a hillside, the steepest path goes straight up the slope - that's perpendicular to the contour lines, which run across the slope. The gradient points in that steepest direction.

Move the point below. The orange arrow (gradient) always meets the contour lines at a right angle.

Magnitude Equals Steepness

The gradient's direction tells us where to climb. Its magnitude tells us how steep:

$$|\nabla f(x,y)| = \sqrt{\left(\frac{\partial f}{\partial x}\right)^2 + \left(\frac{\partial f}{\partial y}\right)^2}$$

Where the contour lines are packed closely together, a small horizontal step causes a large change in $f$ - so the gradient magnitude is large there. Where contours are far apart, the terrain is gentle and $|\nabla f|$ is small.

The arrows below encode both pieces of information: their direction is $\nabla f / |\nabla f|$ (the steepest direction), and their color encodes $|\nabla f|$ (blue = gentle, red = steep). For $f = x^2 + y^2$, the gradient grows as we move away from the origin - the bowl's walls get steeper.

The Directional Derivative

Now we can answer the original question: if we walk in the direction of an arbitrary unit vector $\mathbf{u}$, how fast does $f$ change? The answer is the dot product:

$$D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u} = |\nabla f|\cos\theta$$

where $\theta$ is the angle between $\mathbf{u}$ and $\nabla f$. Let's read off the three key cases:

Common trap: $D_{\mathbf{u}} f \neq |\nabla f|$ unless $\mathbf{u}$ happens to be aligned with $\nabla f$. The gradient magnitude is the maximum possible directional derivative, achieved only in one direction.

Rotate the blue unit vector with the slider and watch the projection (green) change.

Properties of the Gradient

The gradient obeys familiar-looking algebraic rules. For differentiable scalars $f$ and $g$:

$$\nabla(f + g) = \nabla f + \nabla g \qquad \text{(linearity)}$$ $$\nabla(fg) = f\,\nabla g + g\,\nabla f \qquad \text{(product rule)}$$

These follow directly from the corresponding rules for partial derivatives. For the product rule, just look at one component: $$\frac{\partial(fg)}{\partial x} = f \frac{\partial g}{\partial x} + g \frac{\partial f}{\partial x}.$$ Doing the same for $y$ and $z$, then assembling, gives $\nabla(fg) = f\,\nabla g + g\,\nabla f$.

There's also a chain rule. If $f = F(u)$ where $u = g(x,y,z)$, then the chain rule on each component gives: $$\frac{\partial f}{\partial x} = F'(u)\frac{\partial g}{\partial x}, \quad \frac{\partial f}{\partial y} = F'(u)\frac{\partial g}{\partial y}, \quad \frac{\partial f}{\partial z} = F'(u)\frac{\partial g}{\partial z}.$$ Factoring out $F'(u)$:

$$\nabla f = F'(u)\,\nabla g \qquad \text{(chain rule)}$$

As a quick check: $f = e^{x^2+y^2}$ has $u = x^2 + y^2$, $F(u) = e^u$, $F'(u) = e^u$. So $\nabla f = e^{x^2+y^2}\langle 2x, 2y \rangle$ - which you can verify directly.

Playground: Explore Any Gradient Field

Enter a function $f(x,y)$ using JavaScript syntax. Examples: Math.sin(x)*Math.cos(y), x*x - y*y, Math.exp(-(x*x+y*y))

Click on the plot to read gradient values at that point.

Why Machine Learning Cares About Gradients (additional material)

Everything we've built - the gradient pointing uphill, its magnitude telling us how steep - turns out to be the core engine behind how AI learns.

Imagine you're training a neural network to recognize photos of cats. The network has millions of adjustable knobs (called weights). There's a function $L(w_1, w_2, \ldots, w_n)$ that measures how badly the network is doing - it's called the loss function. Big loss = lots of mistakes. Small loss = good predictions.

Training the network means finding the knob settings that make $L$ as small as possible. That's a minimization problem in millions of dimensions - exactly the kind of landscape we've been studying, just with way more axes than two.

The algorithm is beautifully simple: compute $\nabla L$ (the gradient of the loss), then take a small step in the opposite direction, because the gradient points uphill and we want to go downhill:

$$\mathbf{w}_{\text{new}} = \mathbf{w}_{\text{old}} - \eta\,\nabla L(\mathbf{w}_{\text{old}})$$

The parameter $\eta$ (called the learning rate) controls step size. Too big and you overshoot the valley; too small and training takes forever. This update rule is called gradient descent, and it's the same idea as rolling a ball downhill on our surface plots - the ball follows the steepest path down.

Every time ChatGPT writes a sentence, every time your phone recognizes your face, every time a self-driving car spots a stop sign - somewhere underneath, gradients got computed and weights got nudged downhill. The math is exactly what we've been doing here, just scaled up.

Practice Problems - §3.3

From Kaplan, problems after §3.3

3 Determine $\nabla f$ for $f = xy$, and sketch several gradient vectors alongside the level curves.

For the scalar field $f(x,y) = xy$, compute the gradient $\nabla f$. Then sketch the level curves of $f$ and draw gradient vectors at several points, verifying they are perpendicular to the level curves.

Step 1: Compute the partial derivatives.

$\displaystyle\frac{\partial f}{\partial x} = \frac{\partial(xy)}{\partial x} = y$     $\displaystyle\frac{\partial f}{\partial y} = \frac{\partial(xy)}{\partial y} = x$

Step 2: Assemble the gradient.

$$\nabla f = y\,\mathbf{i} + x\,\mathbf{j}$$

At the point $(1, 2)$: $\nabla f = 2\,\mathbf{i} + 1\,\mathbf{j} = \langle 2, 1\rangle$. At $(2, 1)$: $\nabla f = \langle 1, 2\rangle$. Notice that swapping $x$ and $y$ reflects the gradient across $y = x$.

Step 3: Identify the level curves.

The level curves are $xy = c$, that is, $y = c/x$ - rectangular hyperbolas. For $c > 0$: hyperbolas in the first and third quadrants. For $c \lt 0$: hyperbolas in the second and fourth quadrants. For $c = 0$: the two coordinate axes.

The gradient $\langle y, x\rangle$ is perpendicular to these curves, as expected. We can verify at a specific point: on the curve $xy = 2$ near $(1, 2)$, the tangent direction is obtained by differentiating $y = 2/x$: slope $= -2/x^2 = -2$. The gradient at $(1,2)$ is $\langle 2, 1\rangle$. Dot product: $(−2)(2) + (1)(1) \neq 0$... wait - let's check: the tangent vector to $xy = 2$ at $(1,2)$ is proportional to $(1, -2)$ (since $dx \cdot y + x \cdot dy = 0 \Rightarrow dy/dx = -y/x = -2$). Then $\langle 2, 1\rangle \cdot \langle 1, -2\rangle = 2 - 2 = 0$. Perpendicular. $\checkmark$

6 Prove $\nabla(f+g) = \nabla f + \nabla g$ and $\nabla(fg) = f\,\nabla g + g\,\nabla f$.

Let $f$ and $g$ be differentiable scalar fields on a domain $D$ in space. Prove the two gradient identities:

$\nabla(f+g) = \nabla f + \nabla g$   (linearity)

$\nabla(fg) = f\,\nabla g + g\,\nabla f$   (product rule)

Step 1: Prove $\nabla(f+g) = \nabla f + \nabla g$.

We work component by component. The $x$-component of $\nabla(f+g)$ is: $$\frac{\partial(f+g)}{\partial x} = \frac{\partial f}{\partial x} + \frac{\partial g}{\partial x}$$ by the linearity of partial differentiation. The $y$ and $z$ components are identical by the same reasoning. Assembling all three: $\nabla(f+g) = \nabla f + \nabla g$. $\square$

Step 2: Prove $\nabla(fg) = f\,\nabla g + g\,\nabla f$.

The $x$-component of $\nabla(fg)$ is, by the product rule for partial derivatives: $$\frac{\partial(fg)}{\partial x} = f\frac{\partial g}{\partial x} + g\frac{\partial f}{\partial x}.$$ This is exactly the $x$-component of $f\,\nabla g + g\,\nabla f$. The $y$ and $z$ components are identical. Assembling: $$\nabla(fg) = f\,\nabla g + g\,\nabla f. \quad\square$$

Step 3: Quick sanity check.

Take $f = x$, $g = y^2$, so $fg = xy^2$. Direct: $\nabla(xy^2) = \langle y^2, 2xy\rangle$. Via the product rule: $f\,\nabla g + g\,\nabla f = x\langle 0, 2y\rangle + y^2\langle 1, 0\rangle = \langle y^2, 2xy\rangle$. $\checkmark$

8 Prove: $\operatorname{grad}\dfrac{f}{g} = \dfrac{1}{g^2}[g\operatorname{grad} f - f\operatorname{grad} g]$

Prove: $\operatorname{grad}\dfrac{f}{g} = \dfrac{1}{g^2}[g\,\operatorname{grad} f - f\,\operatorname{grad} g]$

Step 1: Set up the quotient rule for partial derivatives.

Let $h = f/g$. We need to find $\nabla h$. By the quotient rule for partial derivatives: $$\frac{\partial h}{\partial x} = \frac{g\,\frac{\partial f}{\partial x} - f\,\frac{\partial g}{\partial x}}{g^2}$$ Similarly for $\partial h/\partial y$ and $\partial h/\partial z$.

Step 2: Combine into the gradient.

Assembling the three partial derivatives into a vector: $$\nabla h = \frac{\partial h}{\partial x}\mathbf{i} + \frac{\partial h}{\partial y}\mathbf{j} + \frac{\partial h}{\partial z}\mathbf{k} = \frac{1}{g^2}\left(g\,\frac{\partial f}{\partial x} - f\,\frac{\partial g}{\partial x}\right)\mathbf{i} + \cdots$$ Factoring $1/g^2$ from each component:

$$\operatorname{grad}\frac{f}{g} = \frac{1}{g^2}[g\,\operatorname{grad} f - f\,\operatorname{grad} g] \quad\square$$