Suppose that f(x) is a scalar-valued function of d inputs, x=[x1,x2,...,xd]. For example, f could be a joint density function for a random vector with d entries.
How can we find inputs x that maximize (or minimize) f(x)? Equivalently, given a surface, how can we identify its highest and lowest points?
Optimizing functions of multiple variables is a core task in probability, statistics, computer science, and machine learning. Optimization is also central in a wide variety of design problems spanning business, operations management, engineering, and economics. In this chapter we will focus on examples where our aim is to maximize a density function (find a mode), or minimize a loss function to find models that best fit observed data.
The basic technique we’ll highlight is the first and simplest optimization procedure. It uses gradients to walk steadily uphill (or downhill) on a surface. We will use it to show that, as for functions of one dimension, if the inputs are unconstrained, and f is smooth, then any extrema (maxima or minima) of f occur where its gradient (think, slope) is zero.
To maximize a smooth function f, we’ll adopt a simple iterative strategy.
Start with some initial guess at the optimizer, x(0). To improve our guess, we’ll try to adjust it slightly. Since we are trying to maximize the function, we should adjust it in a direction which increases the function value.
As an analogy, imagine that you are hiking in the mountains and want to get to a peak. If you can’t see the location of the peak (for instance, you are in the woods), then the best strategy is to simply step in the uphill direction.
To find the uphill direction, use the gradient, ∇f(x(0)). Since the gradient vector always points in the direction of steepest ascent, moving our guess in the direction of the gradient should increase the function value. This idea defines a gradient ascent algorithm:
Gradient ascent (or descent) is a first-order optimization algorithm since it is based on iteratively collecting local information about the surface, out to first derivatives. These first derivatives are used to form a gradient vector, then the gradient vector points the way up to higher surface values.
We won’t get into the real details of choosing a learning rate, stopping condition, or the convergence behavior of this algorithm, but will instead use it as a conceptual tool. To find a maximizer, just pick an initial input, then slowly walk uphill, where, at each step, we choose to step in the direction of steepest ascent. You can imagine the strategy as a sequence of refinements, where at each stage, we use a gradient to refine our guess at an optimizer. Further implementation details, or theory, are really subjects for an optimization class.
Run the code cell below to watch gradient ascent in action. You’ll see that the sequence of iterates form a path that climb the surface, following the gradient vector field, moving perpendicularly to the level sets of the surface.
from utils_ga import show_gradient_ascent
show_gradient_ascent()
Machine Learning and Gradient Ascent
In a prototypical machine learning problem we collect some sample data, pose a model for the data that depends on a long list of parameters (typically, a neural network), then try to find a collection of values for those parameters that minimize some measure of misfit between the observed data and the model conditional on the parameters. The term learning in machine learning usually really means optimization. In practice, the learning process is usually a variant of gradient descent.
We can use the basic idea of gradient ascent to motivate a condition that all maxima and minima of a smooth surface must satisfy. We’ll motivate the condition in two steps:
If x∗ is a maxima (or minima), then any iterative optimizer should stop moving if x(t)=x∗ at some time t. In other words, if we are walking uphill, and reach a summit, then we should stop walking.
If ∇f(x∗)=0, then an iterative optimizer could keep walking, since there is an uphill direction leaving x∗.
It follows that, if x∗ is a local maximizer (or minimizer) of a smooth function f, then ∇f(x∗)=0. If not, then we could find a direction to move away from x∗ that would increase f.
So, the only possible extrema for this function is at [x∗,y∗]=[5,7]. In this case the extrema is a global minimum, since the function is convex (note that the coefficients before x2 and y2 are both positive).
g(x,y)=1−x2+e−21y2.
Example 2
Notice that, in this case, we can break g into the sum of two functions, one for x, and one for y:
where gx(x)=1−x2 and gy(y)=e−21y2. Since the value of gx(x) does not depend on y, and the value of gy(y) does not depend on x, the maximizer of g(x,y) is the pair [x∗,y∗] where x∗ maximizes gx(x) and y∗ maximizes gy(y). We can see this by applying the gradient condition:
The functions cos(πx) and sin(πx) are never simultaneously zero. So, if cos(πx)=0, then sin(πx)=0.
Therefore, to set both entries of the gradient to zero we either need sin(πx)=0 and sin(πy)=0, or cos(πx)=0 and cos(πy)=0. Thus, the gradient is zero at x∗=2k,y∗=2k′, and x∗=2k+1,y∗=2k′+1. These points form a lattice of all even integer pairs [x,y] and all odd integer pairs [x,y].
Pairs of even integers set h(x,y)=0 since h(x,y)=0 if sin(πx)=0 or sin(πy)=0. These cannot be maxima or minima since the function is positive for some [x,y] and negative for others.
It follows that, the extrema of h(x,y) must occur on a square lattice of odd integers in x and y.
Run the code cell below, and select “Sine product”, to visualize h.
You should see that h is a periodic checkerboard of peaks and valleys. If you rotate the surface to view it from above, you’ll see that the peaks and valleys all are centered on a lattice where x and y are half integers.
from utils_lsg import show_gradient_field
show_gradient_field()
First, suppose that [x,y] are independent normal random variables with expectations E[X]=3, E[Y]=5, and standard deviations SD[X]=2 and SD[Y]=4. Then, since the joint density of a pair of independent random variables equals the product of their marginal densities:
We can solve this problem in two ways. We’ll solve it by reasoning directly on density functions, then will confirm our answer by setting the gradient to zero.
By probability logic:
First, notice that the density is a product of two nonnegative functions. Each function is only a function of a single variable. So, the joint density is maximized by maximizing each of the marginal densities separately.
Here’s the matching probability argument: since X and Y are independent the most likely pair[x∗,y∗] consists of the most likely value of X and the most likely value of Y. So, to maximize the joint density, we should separately maximize each of the marginal densities.
Each of the marginal densities is a translation and a dilation of the standard normal density. The standard normal density is bell shaped and even about 0, so is maximized at 0. Therefore, the marginal densities are separately maximized at x∗=3 and y∗=5.
Therefore, the mode of the joint density is equal to the expectation of the vector [X,Y]. This is a special property of normal distributions. They are maximized at their expected value.
Via gradients:
As always, if the location of a maximizer (or minimizer) is unchanged if we pass our function through a monotonic transformation (see Section 3.3). Therefore, the joint density is maximized where exp(−21((2x−3)2+(4y−5)2)) is maximized. Logs are monotonic, so we can replace our objective function with its logarithm without moving the maximizer. The log is the inverse of the exponential. Therefore, the maximizer of the joint density is the maximizer of the quadratic function:
Let’s try an example where we don’t know the answer from the start.
Dirichlet distributions are popular models for user preferences among a set of d+1 items, for unknown chances on d+1 outcomes, or for the gaps between sorted random samples. They naturally generalize the familiar family of densities where X∈[0,1] and PDF(x)=fX(x)∝xα−1(1−x)β−1 to higher dimensions.
Here’s a two dimensional example: X≥0, Y≥0 and X+Y≤1, with:
The function g(x,y)=0 for all x,y such that x>0, y>0 and x+y<1, so is nonzero on the interior of the support. On the boundary of the support the function g(x,y)=0. So, we can restrict our attention to the interior of the support. Inside the triangle where x>0, y>0 and x+y<1 the gradient is only zero if:
These numbers are related simply to the parameters of the original distribution. In general, if X is drawn from a Dirichlet distribution with parameters α=[α1,α2,....,αd+1] then its joint density has mode at x∗j=∑i=1d+1(αi−1)αj−1. In our case, α1=6, α2=4 and α3=9 so the mode is at [(6−1)/(19−3),(4−1)/(19−3)]=[5/16,3/16].
We could have just as well found the mode by maximizing the logarithm of g.
You are provided with a list of data points {[xj,yj]}j=1n relating an independent variable x to a dependent variable y. These points form a scatter cloud in the x,y plane.
You suggest a function that relates x and y, y^(x;θ) where θ is a vector of free parameters. For instance, we could consider a linear model:
In machine learning we usually use a neural network for y^.
You aim to find the best fit function among all y^(x;θ) by minimizing some loss function that measures the discrepancy between your proposed model and the observed data:
The most popular example is linear least squares regression. In linear least squares regression the model y^(xj;θ) is a linear function of the parameters θ. Often it is also a linear function of the independent variable x, though it need not be. Examples include all polynomial models:
is a line with intercept θ0 and slope θ1. Now, our regression problem is to find the line that best fits the data by minimizing the mean square error between the line and the data points:
This is an unconstrained minimization problem in the two free parameters, θ0 and θ1 with objective function equal to the mean squared error in the linear model.
To help visualize this problem, run the code cell below.
The panel on the left will show a series of xj,yj pairs as a scatter plot. You have the freedom to choose a slope and an intercept. For each slope and intercept you pick, you can compute the associated mean square error. Search for a slope and intercept that make the mean square error small. Then click “Reveal RMSE” to show the mean square error as a function of the free parameters. This will reveal a surface in a righthand panel. This surface is a function of the parameters. It depends on the data. Solving the regression problem amounts to minimizing this surface.
from utils_ls import show_least_squares
show_least_squares()
To find the best fit parameters, compute the gradient of the objective, and set it equal to zero. This suffices since the objective is a convex, quadratic function of the parameters, and we’ve placed no constraints on the parameters.
Here we’ve added the subscript θ to the gradient symbol to remind us that we are taking the gradient with respect to the parameters. Always pay careful attention to which variables you are optimizing over, and which are held fixed. In this problem the data is fixed, and we are optimizing with respect to the parameters of the model.
To compute the partial with respect to θ0 and θ1 we’ll apply the chain rule:
This is a nicer form. It ensures that the best fit line passes through the point [xˉ,yˉ] where xˉ and yˉ are the average x and y coordinates in the data-set.
Now, y^(xj;θ∗)−yj can also be expressed more cleanly:
Let x−xˉ=Δx and y−yˉ=Δy. Then Δx and Δy are centered variables that represent a horizontal and vertical distance away from the mean. They correspond to the values of x and y had we started by centering our data (subtracting off the mean x and y coordinate). Many data processing pipelines start off by centering the data. Here we see a good reason to center your data when finding a best fit line. The best fit line automatically picks an intercept that effectively centers the problem. In terms of the centered variables:
We’ll interpret this result in Section 11 in terms of the variance in the sampled x coordinates, variance in the sampled y coordinates, and correlation between the sampled x and y coordinates. Anytime you’ve generated a best fit line in a past class, these are the equations your computer used to find the best fit intercept and slope.
The same essential logic applies for models that are linear in θ, no matter how they depend on x. For instance, if we’d used the quadratic model:
we would have arrived at essentially the same conclusion, just substituting x2 for x as our independent variable. You’ll practice computing the gradient of different loss functions and for different regression models on your homework.