In Secion 9.1 we suggested that organizing all the partial derivatives of f into a vector might be useful. In this section, we will show that storing all the partial derivatives in a vector is a very good idea. This vector is called a gradient.
In other words, the gradient of a surface at a point is the ordered collection of all partial derivatives of the surface at the point.
The animation below illustrates the gradient of the example surface introduced in Sections 8.2 and 9.1. The gradient is built by first finding the slope of the x and y cross-sections. These are the partial derivatives. Storing the partial derivatives in a vector produced the gradient. The gradient vector is shown in white, with side-lengths equal to the partial derivatives. The animation finishes by appending the level set passing through the point where the gradient is evaluated.
The gradient of a surface.
This animation reveals a pair of compelling geometric facts about the gradient.
When we view the surface as a heatmap, the gradient points uphill. In fact, it points in the direction of steepest ascent.
The gradient is perpendicular to the level set passing through the point where it is evaluated. The white arrow representing the gradient is perpendicular to the white contour at the end of the animation shown above.
These two facts suggest a new interpretation of the arrows you’ve appended to contour plots to indicate the uphill direction. Those arrows were parallel to the gradient!
We’ll dive into the geometric interpretation in a moment. For now, let’s run some examples.
The animation below shows how the gradient changes as the point where it is evaluated changes. Varying the point where the gradient is evaluated changes its length and direction.
The gradient of the surface depends on the point where it is evaluated.
Since the gradient accepts all inputs x, and, for every input x, returns a vector with the same number of elements as x, the gradient of a surface defines a vector field.
Vector fields are easiest to picture in two-dimensions. In two-dimensions, a vector field can be pictured as a collection of arrows, one for each location in the plane. You can imagine these as the arrows pointing uphill on a surface, or as the velocity of some flowing medium. Not all vector fields are the gradient of some surface, but the gradient of any surface is a vector field.
Run the code cell below to visualize a surface, it’s level sets, and the gradient vector field. The blue arrows on the x,y plane are the gradients. The solid red arrow is the gradient at the input location [x0,y0] The red curve is the level-set passing through [x0,y0]. Try varying the input point where the gradient is evaluated by changing x0 and y0.
from utils_lsg import show_gradient_field
show_gradient_field()
The d-dimensional formula replaces f′(x∗) with the gradient, ∇f(x∗), and uses an inner product to collate the linear approximations across all d dimensions.
The animation below illustrates this process for our example surface. It shows the tangent plane (white) to the surface about an input point. The tangent plane is the plane containing the linear approximation to the x and y cross-sections passing through the input point (orange and green).
The tangent plane to a surface is recovered by linearizing the surface.
The partial derivatives of a surface each evaluate the slope of the surface when only one input varies. Varying only one input moves along a line in the input space. For partial derivatives, these lines are parallel to a single coordinate axis.
We can generalize this idea:
We can work out directional derivatives directly, or by using gradients to linearize f about x. Let’s try both approaches and see that they give the same answer. We’ll work with the example function:
We can rearrange the same calculation to find a more efficient approach. Consider the first equality after performing the derivative. Each term was either multiplied by -5 or by 6. These numbers didn’t show up by accident. They were the entries of v=[−5,6]. Tracing their source, they appeared each time we applied the chain rule, and either took a derivative of x−5t or y+6t with respect to t. So, we can rewrite the directional derivative:
So, this directional derivative of f with respect to v equals an inner product between some vector-valued function of the input point, and the direction v.
Inspect the terms in the vector-valued function of f. They look a lot like partial derivatives. Indeed:
The equation, ∂vf(x,y)=∇f(x,y)⋅v is too nice to be a coincidence. This equation holds for all directional derivatives where the gradient exists:
Here’s the calculation of the directional derivative using gradients:
∇f(x,y)=[2xy+3,x2] so ∇f(1,−3)=[−6+3,1]=[−3,1].
v=[−5,6]
∂vf(1,−3)=[−3,1]⋅[−5,6]=15+6=21.
This approach is more organized and almost always faster. It will also help us prove the two geometric observations we made at the start of this chapter.
It also provides a general chain rule for surfaces. If x(t)=[x1(t),x2(t),...,xd(t)] is a vector-valued function of some scalar variable t, then we can think about x(t) as a point moving along a path, whose position on the path depends on the time t. Then, at any instant, the velocity of the point is v(t)=dtdx(t)=[dtdx1(t),dtdx2(t),...,dtdxd(t)]. Then, using the directional derivative:
The gradient is a vector, so, like all vectors, it has a length (magnitude) and a direction. We’ve observed two geometric facts about the direction of the gradient vector:
The gradient points in the direction of steepest ascent.
The gradient is perpendicular to the level set of f about the point where it is evaluated.
Along the way, we’ll work out an interpretation for the magnitude of the gradient. In all cases, we’ll work with the geometric interpretation of the inner-product introduced in Section 8.1:
First, let’s find the direction of steepest ascent. To find the direction of steepest ascent at x, we should look for a direction v, that maximizes ∂vf(x).
we can make the directional derivative along any direction arbitrarily large by picking ∥v∥ arbitrarily large. If f(x) increases along a direction v, then we can always make the slope along v larger by increasing v.
The last sentence may have seemed strange. It seems strange since it is reveals an inconsistency between the definition we used to derive the chain rule for surfaces, and the geometry we expect a directional derivative to encode. The directional derivative really should be independent of length. Here’s an improved definition:
Using this definition, all vectors pointing in the same direction will produce the same directional derivative, no matter their length, since they all share the same unit vector. This definition is consistent with the definition we used before, provided we restricted our input directions to unit vectors.
In either case, we can now write the directional derivative more simply:
This definition is much better. Now the directional derivative depends only on the gradient vector, which controls the linearization of f, and the direction of v. It is independent of the magnitude of v.
Now, to find the direction of steepest ascent at x, we want to maximize ∂vf(x)=∥∇f(x)∥cos(θ). Since x and f are fixed, we cannot change the gradient term. So, the only term to maximize is the cosine term. Remember, the angle θ depends on our choice of v.
To maximize cos(θ) our only option is θ=0. If θ=0, then the direction v and the gradient ∇f(x) are parallel. So, the best choice for v is v∝∇f(x)!
Our analysis above also provides a nice interpretation for the magnitude of the gradient vector. If v∝∇f(x), then v and ∇f(x) are parallel, so θ=0. If θ=0 then cos(θ)=1, so:
∂vf(x)=∥∇f(x)∥ if v is the direction of steepest ascent.
These two facts make the gradient a wonderfully interpretible object. At every x, ∇f(x) is a vector pointing in the direction of steepest ascent, with magnitude equal to the fastest possible rate of ascent. The longer the gradient vector, the steeper the surface. The shorter the gradient vector, the shallower the surface.
Along a level set, the surface stays at a constant height. So, if v points along a level set, then the directional derivative of f along v must be zero. You can think about this statement topographically. If you walk along a path that stays at the same elevation, then your elevation will never change, so the slope of the ground along the path must equal zero.
If ∇f(x) is a nonzero vector, then ∥∇f(x)∥=0, so cos(θ)=0. As in Section 8.1, the cosine of the angle between two nonzero vectors can only equal zero if they are perpendicular to one another. Therefore:
This rule is incredibly useful when we want to visualize gradients. Given a contour plot, the gradient is easy to draw. Just add a vector field perpendicular to the level sets!
Run the code cell below to check that the gradient vector field is, as shown above, always perpendicular to the contours of each contour plot.
from utils_lsg import show_gradient_field
show_gradient_field()