Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

9.2 Gradients

In Secion 9.1 we suggested that organizing all the partial derivatives of ff into a vector might be useful. In this section, we will show that storing all the partial derivatives in a vector is a very good idea. This vector is called a gradient.

Gradients

In other words, the gradient of a surface at a point is the ordered collection of all partial derivatives of the surface at the point.

The animation below illustrates the gradient of the example surface introduced in Sections 8.2 and 9.1. The gradient is built by first finding the slope of the xx and yy cross-sections. These are the partial derivatives. Storing the partial derivatives in a vector produced the gradient. The gradient vector is shown in white, with side-lengths equal to the partial derivatives. The animation finishes by appending the level set passing through the point where the gradient is evaluated.

The gradient of a surface.

This animation reveals a pair of compelling geometric facts about the gradient.

  1. When we view the surface as a heatmap, the gradient points uphill. In fact, it points in the direction of steepest ascent.

  2. The gradient is perpendicular to the level set passing through the point where it is evaluated. The white arrow representing the gradient is perpendicular to the white contour at the end of the animation shown above.

These two facts suggest a new interpretation of the arrows you’ve appended to contour plots to indicate the uphill direction. Those arrows were parallel to the gradient!

We’ll dive into the geometric interpretation in a moment. For now, let’s run some examples.

Examples

Here are the three examples from Section 9.2:

  1. f(x,y)=x+7y3f(x,y) = x + 7 y - 3 find f(x,y)\nabla f(x,y).

  1. f(x,y)=x2×(1+y3)f(x,y) = x^2 \times (1 + y^3) find f(x,y)\nabla f(x,y).

  1. f(x,y)=log(xy)f(x,y) = \log(x - y) find f(x,y)\nabla f(x,y).

Vector-Valued Functions and Vector-Fields

The animation below shows how the gradient changes as the point where it is evaluated changes. Varying the point where the gradient is evaluated changes its length and direction.

The gradient of the surface depends on the point where it is evaluated.

Since the gradient accepts all inputs xx, and, for every input xx, returns a vector with the same number of elements as xx, the gradient of a surface defines a vector field.

Vector fields are easiest to picture in two-dimensions. In two-dimensions, a vector field can be pictured as a collection of arrows, one for each location in the plane. You can imagine these as the arrows pointing uphill on a surface, or as the velocity of some flowing medium. Not all vector fields are the gradient of some surface, but the gradient of any surface is a vector field.

Run the code cell below to visualize a surface, it’s level sets, and the gradient vector field. The blue arrows on the x,yx,y plane are the gradients. The solid red arrow is the gradient at the input location [x0,y0][x_0,y_0] The red curve is the level-set passing through [x0,y0][x_0,y_0]. Try varying the input point where the gradient is evaluated by changing x0x_0 and y0y_0.

from utils_lsg import show_gradient_field

show_gradient_field()

Linearization via Gradients

In Section 9.1 we saw that the tangent plane to a surface ff at some input [x,y][x_*,y_*] is produced by its linear approximation:

f~1(x,y)=f(x,y)+[xf(x,y),yf(x,y)][xx,yy].\tilde{f}_1(x,y) = f(x_*,y_*) + [\partial_x f(x_*,y_*), \partial_y f(x_*,y_*)] \cdot [x - x_*,y - y_*].

Click “Show Tangent Plane” in the demo above to visualize the tangent plane.

We can write the linearization more concisely using the gradient:

f~1(x,y)=f(x,y)+f(x,y)[xx,yy]\tilde{f}_1(x,y) = f(x_*,y_*) + \nabla f(x_*,y_*) \cdot [x - x_*,y - y_*]

Note the analogy to linearization in one-dimension. In one-dimension:

f(x)f~1(x)=f(x)+f(x)(xx).f(x) \simeq \tilde{f}_1(x) = f(x_*) + f'(x_*) (x - x_*).

The dd-dimensional formula replaces f(x)f'(x_*) with the gradient, f(x)\nabla f(x_*), and uses an inner product to collate the linear approximations across all dd dimensions.

The animation below illustrates this process for our example surface. It shows the tangent plane (white) to the surface about an input point. The tangent plane is the plane containing the linear approximation to the xx and yy cross-sections passing through the input point (orange and green).

The tangent plane to a surface is recovered by linearizing the surface.

Directional Derivatives

The partial derivatives of a surface each evaluate the slope of the surface when only one input varies. Varying only one input moves along a line in the input space. For partial derivatives, these lines are parallel to a single coordinate axis.

We can generalize this idea:

We can work out directional derivatives directly, or by using gradients to linearize ff about xx. Let’s try both approaches and see that they give the same answer. We’ll work with the example function:

f(x,y)=x2y+3xf(x,y) = x^2 y + 3 x

the input point [x,y]=[1,3][x,y] = [1,-3] and the direction v=[5,6]v = [-5,6].

Directly

First, set up the derivative.

vf(x,y)=ddtf(x5t,y+6t)t=0=ddt(x5t)2(y+6t)+3(x5t)t=0.\begin{aligned} \partial_{v} f(x,y) & = \frac{d}{dt} f(x - 5 t, y + 6 t) \Big|_{t = 0} \\ & = \frac{d}{dt} (x - 5 t)^2 (y + 6t) + 3 (x - 5t) \Big|_{t = 0}. \end{aligned}

Then, evaluate:

vf(x,y)=2(x5t)(5)(y+6t)+(x5t)26+3(5)t=0=2(x0)(5)(y+0)+(x0)2615=10xy+6x215.\begin{aligned} \partial_{v} f(x,y) & = 2(x - 5t)(-5) (y + 6t) + (x - 5t)^2 6 + 3 (-5) \Big|_{t = 0}\\ & = 2 (x - 0) (-5) (y + 0) + (x - 0)^2 6 - 15 \\ & = -10 x y + 6 x^2 - 15. \end{aligned}

At [x,y]=[1,3][x,y] = [-1,3]:

vf(1,3)=10×1×(3)+6×(1)215=30+615=21.\partial_{v} f(-1,3) = -10 \times 1 \times(-3) + 6 \times (-1)^2 - 15 = 30 + 6 - 15 = 21.

Using Gradients

We can rearrange the same calculation to find a more efficient approach. Consider the first equality after performing the derivative. Each term was either multiplied by -5 or by 6. These numbers didn’t show up by accident. They were the entries of v=[5,6]v = [-5,6]. Tracing their source, they appeared each time we applied the chain rule, and either took a derivative of x5tx - 5 t or y+6ty + 6 t with respect to tt. So, we can rewrite the directional derivative:

vf(x,y)=(2(x5t)(y+6t)+3)t=0×(5)+((x5t)2)t=0×(6)=(2xy+3)×(5)+x2×(6)=(2xy+3)×v1+x2×v2=[2xy+3,x2]v.\begin{aligned} \partial_{v} f(x,y) & = (2(x - 5t)(y + 6t) + 3)\Big|_{t = 0} \times (-5) + ((x - 5t)^2)\Big|_{t = 0} \times (6) \\ & = (2xy + 3) \times (-5) + x^2 \times (6) \\ & = (2xy + 3) \times v_1 + x^2 \times v_2 \\ & = [2xy + 3,x^2] \cdot v.\end{aligned}

So, this directional derivative of ff with respect to vv equals an inner product between some vector-valued function of the input point, and the direction vv.

Inspect the terms in the vector-valued function of ff. They look a lot like partial derivatives. Indeed:

xf(x,y)=xx2y+3x=2xy+3yf(x,y)=yx2y+3x=x2.\begin{aligned} & \partial_{x} f(x,y) = \partial_{x} x^2 y + 3 x = 2xy + 3 \\ & \partial_{y} f(x,y) = \partial_{y} x^2 y + 3 x = x^2. \end{aligned}

So, the vector-valued function is the gradient of ff!

It follows that:

vf(x,y)=[xf(x,y),yf(x,y)]v=f(x,y)v\partial_{v} f(x,y) = [\partial_{x} f(x,y),\partial_{y} f(x,y)] \cdot v = \nabla f(x,y) \cdot v

The equation, vf(x,y)=f(x,y)v\partial_v f(x,y) = \nabla f(x,y) \cdot v is too nice to be a coincidence. This equation holds for all directional derivatives where the gradient exists:

Here’s the calculation of the directional derivative using gradients:

  1. f(x,y)=[2xy+3,x2]\nabla f(x,y) = [2 x y + 3, x^2] so f(1,3)=[6+3,1]=[3,1]\nabla f(1,-3) = [-6 + 3,1] = [-3,1].

  2. v=[5,6]v = [-5,6]

  3. vf(1,3)=[3,1][5,6]=15+6=21\partial_v f(1,-3) = [-3,1] \cdot [-5,6] = 15 + 6 = 21.

This approach is more organized and almost always faster. It will also help us prove the two geometric observations we made at the start of this chapter.

It also provides a general chain rule for surfaces. If x(t)=[x1(t),x2(t),...,xd(t)]x(t) = [x_1(t),x_2(t),...,x_d(t)] is a vector-valued function of some scalar variable tt, then we can think about x(t)x(t) as a point moving along a path, whose position on the path depends on the time tt. Then, at any instant, the velocity of the point is v(t)=ddtx(t)=[ddtx1(t),ddtx2(t),...,ddtxd(t)]v(t) = \frac{d}{dt} x(t) = [\frac{d}{dt} x_1(t), \frac{d}{dt} x_2(t),..., \frac{d}{dt} x_d(t)]. Then, using the directional derivative:

Gradient Geometry

The gradient is a vector, so, like all vectors, it has a length (magnitude) and a direction. We’ve observed two geometric facts about the direction of the gradient vector:

  1. The gradient points in the direction of steepest ascent.

  2. The gradient is perpendicular to the level set of ff about the point where it is evaluated.

Let’s prove these facts using the equation:

vf(x)=f(x)v\partial_v f(x) = \nabla f(x) \cdot v

Along the way, we’ll work out an interpretation for the magnitude of the gradient. In all cases, we’ll work with the geometric interpretation of the inner-product introduced in Section 8.1:

vf(x)=f(x)vcos(θ)\partial_v f(x) = \|\nabla f(x)\| \|v\| \cos(\theta)

where θ\theta is the angle between the gradient, and the direction vv.

Steepest Ascent

First, let’s find the direction of steepest ascent. To find the direction of steepest ascent at xx, we should look for a direction vv, that maximizes vf(x)\partial_v f(x).

Here we have a slight problem. Since

vf(x)=f(x)vcos(θ)v\partial_v f(x) = \|\nabla f(x)\| \|v\| \cos(\theta) \propto \|v\|

we can make the directional derivative along any direction arbitrarily large by picking v\|v\| arbitrarily large. If f(x)f(x) increases along a direction vv, then we can always make the slope along vv larger by increasing vv.

The last sentence may have seemed strange. It seems strange since it is reveals an inconsistency between the definition we used to derive the chain rule for surfaces, and the geometry we expect a directional derivative to encode. The directional derivative really should be independent of length. Here’s an improved definition:

Using this definition, all vectors pointing in the same direction will produce the same directional derivative, no matter their length, since they all share the same unit vector. This definition is consistent with the definition we used before, provided we restricted our input directions to unit vectors.

In either case, we can now write the directional derivative more simply:

This definition is much better. Now the directional derivative depends only on the gradient vector, which controls the linearization of ff, and the direction of vv. It is independent of the magnitude of vv.

Now, to find the direction of steepest ascent at xx, we want to maximize vf(x)=f(x)cos(θ)\partial_v f(x) = \|\nabla f(x) \| \cos(\theta). Since xx and ff are fixed, we cannot change the gradient term. So, the only term to maximize is the cosine term. Remember, the angle θ\theta depends on our choice of vv.

To maximize cos(θ)\cos(\theta) our only option is θ=0\theta = 0. If θ=0\theta = 0, then the direction vv and the gradient f(x)\nabla f(x) are parallel. So, the best choice for vv is vf(x)v \propto \nabla f(x)!

Our analysis above also provides a nice interpretation for the magnitude of the gradient vector. If vf(x)v \propto \nabla f(x), then vv and f(x)\nabla f(x) are parallel, so θ=0\theta = 0. If θ=0\theta = 0 then cos(θ)=1\cos(\theta) = 1, so:

vf(x)=f(x) if v is the direction of steepest ascent.\partial_{v} f(x) = \|\nabla f(x)\| \text{ if } v \text{ is the direction of steepest ascent}.

Therefore:

maxv{vf(x)}=f(x).\max_{v}\{\partial_{v} f(x) \} = \|\nabla f(x)\|.

These two facts make the gradient a wonderfully interpretible object. At every xx, f(x)\nabla f(x) is a vector pointing in the direction of steepest ascent, with magnitude equal to the fastest possible rate of ascent. The longer the gradient vector, the steeper the surface. The shorter the gradient vector, the shallower the surface.

Level Sets and Gradients

To show that the gradient, f(x)\nabla f(x), is perpendicular to the level set of ff at xx, we will again use the equation:

vf(x)=f(x)cos(θ).\partial_v f(x) = \|\nabla f(x)\| \cos(\theta).

Along a level set, the surface stays at a constant height. So, if vv points along a level set, then the directional derivative of ff along vv must be zero. You can think about this statement topographically. If you walk along a path that stays at the same elevation, then your elevation will never change, so the slope of the ground along the path must equal zero.

Therefore, if vv is parallel to a level set, then:

vf(x)=f(x)cos(θ)=0.\partial_v f(x) = \|\nabla f(x)\| \cos(\theta) = 0.

If f(x)\nabla f(x) is a nonzero vector, then f(x)0\|\nabla f(x)\| \neq 0, so cos(θ)=0.\cos(\theta) = 0. As in Section 8.1, the cosine of the angle between two nonzero vectors can only equal zero if they are perpendicular to one another. Therefore:

This rule is incredibly useful when we want to visualize gradients. Given a contour plot, the gradient is easy to draw. Just add a vector field perpendicular to the level sets!

Run the code cell below to check that the gradient vector field is, as shown above, always perpendicular to the contours of each contour plot.

from utils_lsg import show_gradient_field

show_gradient_field()