The General Chain Rule
Jacobian matrices multiply - the chain rule as matrix multiplication - Kaplan §2.9
Prereq: §2.8 Chain Rules
Here's a grid of points in the $(x_1, x_2)$-plane. We're going to feed it through two transformations, one after the other, and watch what happens to the grid at each stage.
$(x_1, x_2)$-space
$(u_1, u_2)$-space - after $\mathbf{g}$
$(y_1, y_2)$-space - after $\mathbf{f} \circ \mathbf{g}$
When one transformation feeds into another, the chain rule says their Jacobian matrices multiply. But what does that look like geometrically - and why should matrix multiplication capture it?
From single to multi-variable chain rules
Let's start with something we know cold. One input, one output, one intermediate variable. If $y = f(u)$ and $u = g(x)$, single-variable calculus gives us:
Multiply two derivatives. Done. Now here's the question that matters: what happens when everything becomes vector-valued?
Say $y$ depends on two intermediate variables $(u_1, u_2)$, and each of those depends on two inputs $(x_1, x_2)$. If we nudge $x_1$ by a tiny amount, how does $y$ respond? Well, that nudge ripples through both intermediate variables simultaneously - it changes $u_1$ and $u_2$, and each of those changes affects $y$. Adding up both pathways:
Stare at that for a moment. The right side is a dot product - the row $\bigl[\frac{\partial y}{\partial u_1},\; \frac{\partial y}{\partial u_2}\bigr]$ dotted with the column $\bigl[\frac{\partial u_1}{\partial x_1},\; \frac{\partial u_2}{\partial x_1}\bigr]^T$.
And if we have multiple outputs $y_1, y_2, \ldots, y_m$ each depending on multiple inputs $x_1, x_2, \ldots, x_n$ through intermediates $u_1, u_2, \ldots, u_p$? Then every single entry of the resulting Jacobian is one of these dot products. Row $i$ of $\mathbf{Y}_u$ dotted with column $j$ of $\mathbf{U}_x$ gives us entry $(i,j)$ of the answer. That's the definition of matrix multiplication.
Each entry $(i,j)$ of $\mathbf{Y}_x$ sums up all the indirect pathways from input $x_j$ to output $y_i$ through the intermediate variables. The single-variable chain rule multiplies two numbers; the general chain rule multiplies two matrices. Same idea, bigger playground.
Pick a pair of mappings below and watch how the dot products assemble the product matrix entry by entry.
The Jacobian product in action
Enough abstraction - let's compute. Consider the composition:
We want the Jacobian $\mathbf{Y}_x$ at the point $\mathbf{x} = (1, 0)$. We could substitute everything, expand, and differentiate the mess. But the chain rule says: just compute two simple Jacobians and multiply.
Step 1: The inner Jacobian $\mathbf{U}_x$
The inner mapping $\mathbf{g}$ is linear, so its Jacobian is just the coefficient matrix - the same everywhere:
Step 2: Find the intermediate point
Before we can compute $\mathbf{Y}_u$, we need to know where in $u$-space we are. At $\mathbf{x} = (1,0)$:
Step 3: The outer Jacobian $\mathbf{Y}_u$ at $\mathbf{u} = (2,1)$
Step 4: Multiply
We never had to compose the functions and differentiate the resulting polynomial. We just multiplied two matrices at the right point. For complicated nested mappings with many variables, this modularity is a lifesaver.
Verification - the hard way
Can we trust this? Let's verify by brute force. Substitute $u_1 = 2x_1 + x_2$ and $u_2 = x_1 - x_2$ directly:
Differentiating these and plugging in $(1,0)$:
Determinants multiply - area distortion
Here's where things get geometrically beautiful. When the Jacobian matrices are square - same number of inputs, intermediates, and outputs - we can take determinants of both sides of $\mathbf{Y}_x = \mathbf{Y}_u \cdot \mathbf{U}_x$:
That's just the standard linear algebra fact $\det(AB) = \det(A)\det(B)$. But in the language of Jacobians, it says something vivid:
What does this mean? Each Jacobian determinant measures how much the mapping locally stretches or compresses area (in 2D) or volume (in 3D). So the rule says:
Area distortion factors compose by multiplication.
Think of it like currency exchange rates. If 1 dollar buys 2 euros and 1 euro buys 3 yen, then 1 dollar buys $2 \times 3 = 6$ yen. Same logic: if $\mathbf{g}$ doubles local area and $\mathbf{f}$ triples it, the composite $\mathbf{f} \circ \mathbf{g}$ multiplies area by 6.
This fact becomes the engine behind change of variables in multiple integrals (Chapter 4). When you switch from Cartesian to polar coordinates, the area element picks up that factor of $r$ - that's exactly one of these Jacobian determinants at work.
Try different mapping pairs below. Watch the colored square deform through each stage, and check that the area ratios multiply.
$x$-space
$u$-space
$y$-space
The differential form
There's a lovely way to see the chain rule that makes it feel almost inevitable. The Jacobian matrix $\mathbf{Y}_u$ tells us how small changes in $\mathbf{u}$ produce small changes in $\mathbf{y}$:
And the inner mapping relates $d\mathbf{u}$ to $d\mathbf{x}$:
Now do what any calculus student would do - substitute. Replace $d\mathbf{u}$:
That's it. The chain rule is substitution. In single-variable calculus we write $dy = f'(u)\,du = f'(g(x))\,g'(x)\,dx$, replacing $du$ with $g'(x)\,dx$. The matrix version does exactly the same thing, just with matrices instead of numbers.
Linear approximations compose
This leads to the deepest way to understand what's happening. The matrix $\mathbf{Y}_u$ is the best linear approximation to $\mathbf{f}$ near a point. The matrix $\mathbf{U}_x$ is the best linear approximation to $\mathbf{g}$ near a point. Their product?
When $\mathbf{f}$ and $\mathbf{g}$ are actually linear - say $\mathbf{y} = A\mathbf{u}$ and $\mathbf{u} = B\mathbf{x}$ - the composite is $\mathbf{y} = AB\mathbf{x}$, and the chain rule reduces to ordinary matrix multiplication with no approximation at all. The general chain rule says: even when the maps are nonlinear, the same holds for their linear approximations at each point.
Chains of any length
The same logic extends effortlessly. If $\mathbf{y} = \mathbf{f}(\mathbf{u})$, $\mathbf{u} = \mathbf{g}(\mathbf{v})$, $\mathbf{v} = \mathbf{h}(\mathbf{x})$, then we can apply the chain rule twice:
Just keep multiplying Jacobian matrices, one for each link in the chain, outer to inner. Each factor is the derivative of that stage, evaluated at the point it actually receives as input. We'll prove this cleanly in Problem 3(a) below.
Practice Problems - §2.9
From Kaplan, problems after §2.9
$y_1 = u_1 u_2 - 3u_1$, $y_2 = u_1^2 + 2u_1 u_2 + 2u_1 - u_2$; $u_1 = x_1 \cos 3x_2$, $u_2 = x_1 \sin 3x_2$.
Find the Jacobian matrix in the form of a product of two matrices and evaluate for $x_1 = 0,\; x_2 = 0$.
Differentiate $(y_1, y_2)$ with respect to $(u_1, u_2)$:
$$\mathbf{Y}_u = \begin{pmatrix} u_2 - 3 & u_1 \\ 2u_1 + 2u_2 + 2 & 2u_1 - 1 \end{pmatrix}$$Differentiate $(u_1, u_2)$ with respect to $(x_1, x_2)$:
$$\mathbf{U}_x = \begin{pmatrix} \cos(3x_2) & -3x_1\sin(3x_2) \\ \sin(3x_2) & 3x_1\cos(3x_2) \end{pmatrix}$$First find the intermediate point: $u_1 = 0 \cdot \cos(0) = 0$, $u_2 = 0 \cdot \sin(0) = 0$.
$$\mathbf{Y}_u\big|_{(0,0)} = \begin{pmatrix} -3 & 0 \\ 2 & -1 \end{pmatrix}, \quad \mathbf{U}_x\big|_{(0,0)} = \begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}$$The entire second column is zero. Why? At $x_1 = 0$, changing $x_2$ has no effect on either $u_1$ or $u_2$ - both are $x_1$ times a trig function, so when $x_1 = 0$, the trig part is irrelevant. The $\mathbf{U}_x$ matrix already told us this with its zero column, and that zero propagated through the multiplication. The Jacobian determinant is $0$: the mapping crushes a 2D region into a line at this point.
Given: $f(x_0,y_0) = u_0$, $g(x_0,y_0) = v_0$. At these points:
$f_x = 2,\; f_y = 3,\; g_x = -1,\; g_y = 5$
$p_u = 7,\; p_v = 1,\; q_u = -3,\; q_v = 2$
Let $z = p(f(x,y),\, g(x,y))$ and $w = q(f(x,y),\, g(x,y))$. Find the Jacobian matrix of $(z, w)$ with respect to $(x, y)$ at $(x_0, y_0)$.
This is a pure chain rule problem. We don't need explicit formulas for $f, g, p, q$ - just their derivatives at the right points. The outer mapping $(u,v) \mapsto (z,w)$ composes with the inner mapping $(x,y) \mapsto (u,v)$, and the general chain rule says their Jacobians multiply.
We computed the full $2 \times 2$ Jacobian of the composite without ever knowing an explicit formula for any of the four functions. Just the partial derivatives at one point, plus the chain rule. This is the power of the general chain rule: it reduces calculus to linear algebra.
Let $\mathbf{y} = \mathbf{f}(\mathbf{u})$, $\mathbf{u} = \mathbf{g}(\mathbf{v})$, $\mathbf{v} = \mathbf{h}(\mathbf{x})$. Show that $\mathbf{y}_x = \mathbf{y}_u \, \mathbf{u}_v \, \mathbf{v}_x$.
The composition $\mathbf{u} = \mathbf{g}(\mathbf{h}(\mathbf{x}))$ has Jacobian:
$$\mathbf{u}_x = \mathbf{u}_v \cdot \mathbf{v}_x$$Now $\mathbf{y} = \mathbf{f}(\mathbf{u})$ where $\mathbf{u}$ depends on $\mathbf{x}$ (through $\mathbf{v}$). The chain rule gives:
$$\mathbf{y}_x = \mathbf{y}_u \cdot \mathbf{u}_x = \mathbf{y}_u \cdot (\mathbf{u}_v \cdot \mathbf{v}_x) = \mathbf{y}_u \, \mathbf{u}_v \, \mathbf{v}_x$$The last equality uses associativity of matrix multiplication.
The chain rule extends to any number of stages. With $k$ links in the chain, $\mathbf{y}_x$ is a product of $k$ Jacobian matrices, always written outer to inner, each evaluated at the point its mapping actually receives as input. The same logic (apply the two-stage rule and substitute) works for 4, 5, or 100 stages.