Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

9.3 Unconstrained Optimization

Suppose that f(x)f(x) is a scalar-valued function of dd inputs, x=[x1,x2,...,xd]x = [x_1,x_2,...,x_d]. For example, ff could be a joint density function for a random vector with dd entries.

How can we find inputs xx that maximize (or minimize) f(x)f(x)? Equivalently, given a surface, how can we identify its highest and lowest points?

Optimizing functions of multiple variables is a core task in probability, statistics, computer science, and machine learning. Optimization is also central in a wide variety of design problems spanning business, operations management, engineering, and economics. In this chapter we will focus on examples where our aim is to maximize a density function (find a mode), or minimize a loss function to find models that best fit observed data.

The basic technique we’ll highlight is the first and simplest optimization procedure. It uses gradients to walk steadily uphill (or downhill) on a surface. We will use it to show that, as for functions of one dimension, if the inputs are unconstrained, and ff is smooth, then any extrema (maxima or minima) of ff occur where its gradient (think, slope) is zero.

Gradients and Optimization

Gradient Ascent

To maximize a smooth function ff, we’ll adopt a simple iterative strategy.

Start with some initial guess at the optimizer, x(0)x(0). To improve our guess, we’ll try to adjust it slightly. Since we are trying to maximize the function, we should adjust it in a direction which increases the function value.

As an analogy, imagine that you are hiking in the mountains and want to get to a peak. If you can’t see the location of the peak (for instance, you are in the woods), then the best strategy is to simply step in the uphill direction.

To find the uphill direction, use the gradient, f(x(0))\nabla f(x(0)). Since the gradient vector always points in the direction of steepest ascent, moving our guess in the direction of the gradient should increase the function value. This idea defines a gradient ascent algorithm:

Gradient ascent (or descent) is a first-order optimization algorithm since it is based on iteratively collecting local information about the surface, out to first derivatives. These first derivatives are used to form a gradient vector, then the gradient vector points the way up to higher surface values.

We won’t get into the real details of choosing a learning rate, stopping condition, or the convergence behavior of this algorithm, but will instead use it as a conceptual tool. To find a maximizer, just pick an initial input, then slowly walk uphill, where, at each step, we choose to step in the direction of steepest ascent. You can imagine the strategy as a sequence of refinements, where at each stage, we use a gradient to refine our guess at an optimizer. Further implementation details, or theory, are really subjects for an optimization class.

Run the code cell below to watch gradient ascent in action. You’ll see that the sequence of iterates form a path that climb the surface, following the gradient vector field, moving perpendicularly to the level sets of the surface.

from utils_ga import show_gradient_ascent

show_gradient_ascent()

We can use the basic idea of gradient ascent to motivate a condition that all maxima and minima of a smooth surface must satisfy. We’ll motivate the condition in two steps:

  1. If xx_* is a maxima (or minima), then any iterative optimizer should stop moving if x(t)=xx(t) = x_* at some time tt. In other words, if we are walking uphill, and reach a summit, then we should stop walking.

  2. If f(x)0\nabla f(x_*) \neq 0, then an iterative optimizer could keep walking, since there is an uphill direction leaving xx_*.

It follows that, if xx_* is a local maximizer (or minimizer) of a smooth function ff, then f(x)=0\nabla f(x_*) = 0. If not, then we could find a direction to move away from xx_* that would increase ff.

Finding Maxima by Setting Gradients to Zero

We can use the condition, f(x)=0\nabla f(x_*) = 0, like we did ddxf(x)=0\frac{d}{dx} f(x_*) = 0 to narrow down our search for extrema.

Here are three examples:

  1. f(x,y)=8x212xy+6y2+4x24y+4f(x,y) = 8 x^2 - 12 x y + 6 y^2 + 4 x - 24 y + 4.

  2. g(x,y)=1x2+e12y2.g(x,y) = 1 - x^2 + e^{-\frac{1}{2}y^2}.

  3. h(x,y)=sin(πx)sin(πy)h(x,y) = \sin( \pi x) \sin(\pi y).

Run the code cell below, and select “Sine product”, to visualize hh.

You should see that hh is a periodic checkerboard of peaks and valleys. If you rotate the surface to view it from above, you’ll see that the peaks and valleys all are centered on a lattice where xx and yy are half integers.

from utils_lsg import show_gradient_field

show_gradient_field()

Finding Modes of Joint Densities

Let’s use this rule to find the modes of some joint distributions.

Normal Distributions

First, suppose that [x,y][x,y] are independent normal random variables with expectations E[X]=3\mathbb{E}[X] = 3, E[Y]=5\mathbb{E}[Y] = 5, and standard deviations SD[X]=2\text{SD}[X] = 2 and SD[Y]=4\text{SD}[Y] = 4. Then, since the joint density of a pair of independent random variables equals the product of their marginal densities:

fX,Y(x,y)=(12π12e12(x32)2)×(12π12e12(x32)2)exp(12((x32)2+(y54)2))\begin{aligned} f_{X,Y}(x,y) & = \left( \frac{1}{\sqrt{2 \pi}} \frac{1}{2} e^{-\frac{1}{2} \left(\frac{x - 3}{2} \right)^2} \right) \times \left( \frac{1}{\sqrt{2 \pi}} \frac{1}{2} e^{-\frac{1}{2} \left(\frac{x - 3}{2} \right)^2} \right) \propto \exp\left(-\frac{1}{2}\left(\left(\frac{x - 3}{2} \right)^2 + \left(\frac{y - 5}{4} \right)^2 \right) \right)\end{aligned}

Where is this density maximized.

We can solve this problem in two ways. We’ll solve it by reasoning directly on density functions, then will confirm our answer by setting the gradient to zero.

  1. By probability logic:

    First, notice that the density is a product of two nonnegative functions. Each function is only a function of a single variable. So, the joint density is maximized by maximizing each of the marginal densities separately.

    Here’s the matching probability argument: since XX and YY are independent the most likely pair [x,y][x_*,y_*] consists of the most likely value of XX and the most likely value of YY. So, to maximize the joint density, we should separately maximize each of the marginal densities.

    Each of the marginal densities is a translation and a dilation of the standard normal density. The standard normal density is bell shaped and even about 0, so is maximized at 0. Therefore, the marginal densities are separately maximized at x=3x_* = 3 and y=5y_* = 5.

    [x,y]=[3,5]=[E[X],E[Y]].[x_*,y_*] = [3,5] = [\mathbb{E}[X], \mathbb{E}[Y]].

    Therefore, the mode of the joint density is equal to the expectation of the vector [X,Y][X,Y]. This is a special property of normal distributions. They are maximized at their expected value.

  2. Via gradients:

    As always, if the location of a maximizer (or minimizer) is unchanged if we pass our function through a monotonic transformation (see Section 3.3). Therefore, the joint density is maximized where exp(12((x32)2+(y54)2))\exp\left(-\frac{1}{2}\left(\left(\frac{x - 3}{2} \right)^2 + \left(\frac{y - 5}{4} \right)^2 \right) \right) is maximized. Logs are monotonic, so we can replace our objective function with its logarithm without moving the maximizer. The log is the inverse of the exponential. Therefore, the maximizer of the joint density is the maximizer of the quadratic function:

    12((x32)2+(y54)2)- \frac{1}{2} \left(\left(\frac{x - 3}{2} \right)^2 + \left(\frac{y - 5}{4} \right)^2 \right)

    This quadratic function is a sum of a function of xx and a function of yy, so is maximized by separately maximizing each term.

    Applying the gradient:

    log(fX,Y(x,y))=12((x32)2+(y54)2)=[12x3214y54].\nabla \log(f_{X,Y}(x,y)) = - \nabla \frac{1}{2} \left(\left(\frac{x - 3}{2} \right)^2 + \left(\frac{y - 5}{4} \right)^2 \right) = \left[ \begin{array}{c} - \frac{1}{2}\frac{x - 3}{2} \\ - \frac{1}{4} \frac{y - 5}{4} \end{array} \right].

    Setting both terms to zero gives x=3x_* = 3 and y=5y_* = 5 as expected.

Dirichlet Distributions

Let’s try an example where we don’t know the answer from the start.

Dirichlet distributions are popular models for user preferences among a set of d+1d+1 items, for unknown chances on d+1d+1 outcomes, or for the gaps between sorted random samples. They naturally generalize the familiar family of densities where X[0,1]X \in [0,1] and PDF(x)=fX(x)xα1(1x)β1\text{PDF}(x) = f_X(x) \propto x^{\alpha - 1} (1 - x)^{\beta - 1} to higher dimensions.

Here’s a two dimensional example: X0X \geq 0, Y0Y \geq 0 and X+Y1X + Y \leq 1, with:

fX,Y(x,y)g(x,y)=x5y3(1(x+y))8.f_{X,Y}(x,y) \propto g(x,y) = x^5 y^3 (1 - (x + y))^8.

Let’s find it’s mode.

The density is maximized where gg is maximized, so, take the gradient of gg:

g(x,y)=[5x4y3(1(x+y))88x5(1(x+y))73x5y2(1(x+y))88x5y3(1(x+y))7].\nabla g(x,y) = \left[ \begin{array}{c} 5 x^4 y^3 (1 - (x + y))^8 - 8 x^5 (1 - (x + y))^7 \\ 3 x^5 y^2 (1 - (x + y))^8 - 8 x^5 y^3 (1 - (x + y))^7 \end{array} \right].

We can write the gradient more conveniently:

g(x,y)=[5xg(x,y)81(x+y)g(x,y)3yg(x,y)81(x+y)g(x,y)]=[5x81(x+y)3y81(x+y)]g(x,y).\nabla g(x,y) = \left[ \begin{array}{c} \frac{5}{x} g(x,y)- \frac{8}{1 - (x + y)} g(x,y) \\ \frac{3}{y} g(x,y)- \frac{8}{1 - (x + y)} g(x,y) \end{array} \right] = \left[ \begin{array}{c} \frac{5}{x} - \frac{8}{1 - (x + y)} \\ \frac{3}{y} - \frac{8}{1 - (x + y)} \end{array} \right] g(x,y).

The function g(x,y)0g(x,y) \neq 0 for all x,yx,y such that x>0x > 0, y>0y > 0 and x+y<1x + y < 1, so is nonzero on the interior of the support. On the boundary of the support the function g(x,y)=0g(x,y) = 0. So, we can restrict our attention to the interior of the support. Inside the triangle where x>0x > 0, y>0y > 0 and x+y<1x + y < 1 the gradient is only zero if:

[5x81(x+y)3y81(x+y)]=0\left[ \begin{array}{c} \frac{5}{x_*} - \frac{8}{1 - (x_* + y_*)} \\ \frac{3}{y_*} - \frac{8}{1 - (x_* + y_*)} \end{array} \right] = 0

Rearranging gives the equations:

5x=81(x+y)=3y\frac{5}{x_*} = \frac{8}{1 - (x_* + y_*)} = \frac{3}{y_*}

or:

x5=1(x+y)8=y3\frac{x_*}{5} = \frac{1 - (x_* + y_*)}{8} = \frac{y_*}{3}

It follows that, y=35xy_* = \frac{3}{5} x_*. Then:

1(x+y)8=1x35x8=185x8=1815x.\frac{1 - (x_* + y_*)}{8} = \frac{1 - x_* - \frac{3}{5} x_*}{8} = \frac{1 - \frac{8}{5} x_*}{8} = \frac{1}{8} - \frac{1}{5} x_*.

Then:

1815x=15x\frac{1}{8} - \frac{1}{5} x_* = \frac{1}{5} x_*

So:

25x=18\frac{2}{5} x_* = \frac{1}{8}

Or x=516x_* = \frac{5}{16}. It follows that y=316y_* = \frac{3}{16} so:

[x,y]=116[5,3].[x_*,y_*] = \frac{1}{16}\left[5, 3 \right].

These numbers are related simply to the parameters of the original distribution. In general, if XX is drawn from a Dirichlet distribution with parameters α=[α1,α2,....,αd+1]\alpha = [\alpha_1,\alpha_2,....,\alpha_{d+1}] then its joint density has mode at xj=αj1i=1d+1(αi1){x_*}_j = \frac{\alpha_j - 1}{\sum_{i=1}^{d+1} (\alpha_i - 1)}. In our case, α1=6\alpha_1 = 6, α2=4\alpha_2 = 4 and α3=9\alpha_3 = 9 so the mode is at [(61)/(193),(41)/(193)]=[5/16,3/16][(6 - 1)/(19 - 3), (4 - 1)/(19 - 3)] = [5/16, 3/16].

We could have just as well found the mode by maximizing the logarithm of gg.

Regression

Here’s an example regression problem:

  1. You are provided with a list of data points {[xj,yj]}j=1n\{[x_j,y_j]\}_{j=1}^n relating an independent variable xx to a dependent variable yy. These points form a scatter cloud in the x,yx, y plane.

  2. You suggest a function that relates xx and yy, y^(x;θ)\hat{y}(x;\theta) where θ\theta is a vector of free parameters. For instance, we could consider a linear model:

    y^(x;θ)=θ1+θ2x\hat{y}(x;\theta) = \theta_1 + \theta_2 x

    or a quadratic model:

    y^(x;θ)=θ1+θ2x+θ3x2\hat{y}(x;\theta) = \theta_1 + \theta_2 x + \theta_3 x^2

    or even an exponential model:

    y^(x;θ)=θ1eθ2x.\hat{y}(x;\theta) = \theta_1 e^{\theta_2 x}.

    In machine learning we usually use a neural network for y^\hat{y}.

  3. You aim to find the best fit function among all y^(x;θ)\hat{y}(x;\theta) by minimizing some loss function that measures the discrepancy between your proposed model and the observed data:

θ=argminθ{L(y^(;θ),{[xj,yj]}j=1n)}\theta_* = \text{argmin}_{\theta}\{\mathcal{L}(\hat{y}(\cdot;\theta),\{[x_j,y_j]\}_{j=1}^n)\}

Least Squares Regression

In least squares regression we set the loss function to the average square error in the model over the data:

L(y^,{[xj,yj]}j=1n)=MSE(y^,{[xj,yj]}j=1n)=1nj=1n(y^(xj)yj)2\mathcal{L}(\hat{y}, \{[x_j,y_j]\}_{j=1}^n) = \text{MSE}(\hat{y},\{[x_j,y_j]\}_{j=1}^n) = \frac{1}{n} \sum_{j=1}^n \left(\hat{y}(x_j) - y_j \right)^2

where MSE\text{MSE} stands for mean square error. Then, our problem is to find the parameters θ\theta that minimize the mean square error:

θ=argminθ{1nj=1n(y^(xj;θ)yj)2}\theta_* = \text{argmin}_{\theta}\{\frac{1}{n} \sum_{j=1}^n \left(\hat{y}(x_j;\theta) - y_j \right)^2 \}

The most popular example is linear least squares regression. In linear least squares regression the model y^(xj;θ)\hat{y}(x_j;\theta) is a linear function of the parameters θ\theta. Often it is also a linear function of the independent variable xx, though it need not be. Examples include all polynomial models:

y^(x;θ)=θ0+θ1x+θ2x2+...θnxn.\hat{y}(x;\theta) = \theta_0 + \theta_1 x + \theta_2 x^2 + ... \theta_n x^n.

Let’s start with the simple case where y^\hat{y} is linear in both the parameters and xx. Then:

y^(x;θ)=θ0+θ1x\hat{y}(x;\theta) = \theta_0 + \theta_1 x

is a line with intercept θ0\theta_0 and slope θ1\theta_1. Now, our regression problem is to find the line that best fits the data by minimizing the mean square error between the line and the data points:

θ=argminθ0,θ1{1nj=1n((θ0+θ1xj)yj)2}.\theta_* = \text{argmin}_{\theta_0,\theta_1}\left\{\frac{1}{n} \sum_{j=1}^n \left((\theta_0 + \theta_1 x_j) - y_j \right)^2\right\}.

This is an unconstrained minimization problem in the two free parameters, θ0\theta_0 and θ1\theta_1 with objective function equal to the mean squared error in the linear model.

To help visualize this problem, run the code cell below.

The panel on the left will show a series of xj,yjx_j,y_j pairs as a scatter plot. You have the freedom to choose a slope and an intercept. For each slope and intercept you pick, you can compute the associated mean square error. Search for a slope and intercept that make the mean square error small. Then click “Reveal RMSE” to show the mean square error as a function of the free parameters. This will reveal a surface in a righthand panel. This surface is a function of the parameters. It depends on the data. Solving the regression problem amounts to minimizing this surface.

from utils_ls import show_least_squares

show_least_squares()

To find the best fit parameters, compute the gradient of the objective, and set it equal to zero. This suffices since the objective is a convex, quadratic function of the parameters, and we’ve placed no constraints on the parameters.

θMSE(y^(;θ),{[xj,yj]}j=1n)=θ1nj=1n((θ0+θ1xj)yj)2\nabla_{\theta} \text{MSE}(\hat{y}(\cdot;\theta), \{[x_j,y_j]\}_{j=1}^n) = \nabla_{\theta} \frac{1}{n} \sum_{j=1}^n \left((\theta_0 + \theta_1 x_j) - y_j \right)^2

Here we’ve added the subscript θ\theta to the gradient symbol to remind us that we are taking the gradient with respect to the parameters. Always pay careful attention to which variables you are optimizing over, and which are held fixed. In this problem the data is fixed, and we are optimizing with respect to the parameters of the model.

To compute the partial with respect to θ0\theta_0 and θ1\theta_1 we’ll apply the chain rule:

θi1nj=1n((θ0+θ1xj)yj)2=θi1nj=1n(y^(xj;θ)yj)2=1nj=1nθi(y^(xj;θ)yj)2=2nj=1n(y^(xj;θ)yj)×θiy^(xj;θ).\begin{aligned} \partial_{\theta_i} \frac{1}{n} \sum_{j=1}^n \left((\theta_0 + \theta_1 x_j) - y_j \right)^2 & = \partial_{\theta_i} \frac{1}{n} \sum_{j=1}^n \left(\hat{y}(x_j;\theta) - y_j \right)^2 \\ & = \frac{1}{n} \sum_{j=1}^n \partial_{\theta_i} \left(\hat{y}(x_j;\theta) - y_j \right)^2 \\ & = \frac{2}{n} \sum_{j=1}^n (\hat{y}(x_j;\theta) - y_j) \times \partial_{\theta_i} \hat{y}(x_j;\theta). \end{aligned}

Then, since y^(x;θ)=θ0+θ1x\hat{y}(x;\theta) = \theta_0 + \theta_1 x, θ0y^(x;θ)=1\partial_{\theta_0} \hat{y}(x;\theta) = 1 and θ1y^(x;θ)=x\partial_{\theta_1} \hat{y}(x;\theta) = x. So:

θMSE(y^(;θ),{[xj,yj]}j=1n)=2n[j=1n(y^(xj;θ)yj)j=1n(y^(xj;θ)yj)×xj]\nabla_{\theta} \text{MSE}(\hat{y}(\cdot;\theta), \{[x_j,y_j]\}_{j=1}^n) = \frac{2}{n} \left[ \begin{array}{c} \sum_{j=1}^n (\hat{y}(x_j;\theta) - y_j) \\ \sum_{j=1}^n (\hat{y}(x_j;\theta) - y_j) \times x_j \end{array} \right]

Setting the first entry to zero requires:

1nj=1n(y^(xj;θ)yj)=0.\frac{1}{n}\sum_{j=1}^n (\hat{y}(x_j;\theta_*) - y_j) = 0.

Rearranging, we need:

1nj=1ny^(xj;θ)=1nj=1nyj=yˉ.\frac{1}{n} \sum_{j=1}^n \hat{y}(x_j;\theta_*) = \frac{1}{n} \sum_{j=1}^n y_j = \bar{y}.

That is, the average value of the model must equal the average value of the data, yˉ\bar{y}. Substituting in for y^\hat{y} gives:

1nj=1nθ0+θ1xj=θ0+θ11nj=1nxj=θ0+θ1xˉ=yˉ\begin{aligned} \frac{1}{n} \sum_{j=1}^n {\theta_*}_0 + {\theta_*}_1 x_j & = {\theta_*}_0 + {\theta_*}_1 \frac{1}{n} \sum_{j=1}^n x_j \\ & = {\theta_*}_0 + {\theta_*}_1 \bar{x} = \bar{y} \end{aligned}

So:

θ0=yˉθ1xˉ.{\theta_*}_0 = \bar{y} - {\theta_*}_1 \bar{x}.

Plugging back in, we find:

y^(x;θ)=yˉ+θ1(xxˉ).\hat{y}(x;\theta_*) = \bar{y} + {\theta_*}_1 (x - \bar{x}).

This is a nicer form. It ensures that the best fit line passes through the point [xˉ,yˉ][\bar{x},\bar{y}] where xˉ\bar{x} and yˉ\bar{y} are the average xx and yy coordinates in the data-set.

Now, y^(xj;θ)yj\hat{y}(x_j;\theta_*) - y_j can also be expressed more cleanly:

y^(xj;θ)yj=θ1(xjxˉ)(yjyˉ).\hat{y}(x_j;\theta_*) - y_j = {\theta_*}_1 (x_j - \bar{x}) - (y_j - \bar{y}).

Let xxˉ=Δxx - \bar{x} = \Delta x and yyˉ=Δyy - \bar{y} = \Delta y. Then Δx\Delta x and Δy\Delta y are centered variables that represent a horizontal and vertical distance away from the mean. They correspond to the values of xx and yy had we started by centering our data (subtracting off the mean xx and yy coordinate). Many data processing pipelines start off by centering the data. Here we see a good reason to center your data when finding a best fit line. The best fit line automatically picks an intercept that effectively centers the problem. In terms of the centered variables:

y^(xj;θ)yj=θ1ΔxjΔyj.\hat{y}(x_j;\theta_*) - y_j = {\theta_*}_1 \Delta x_j - \Delta y_j.

So, the second entry of the gradient can be written:

2nj=1n(θ1ΔxjΔyj)×xj=2nj=1n(θ1ΔxjΔyj)×(Δxj+xˉ).\frac{2}{n} \sum_{j=1}^n ({\theta_*}_1 \Delta x_j - \Delta y_j) \times x_j = \frac{2}{n} \sum_{j=1}^n ({\theta_*}_1 \Delta x_j - \Delta y_j) \times (\Delta x_j + \bar{x}).

The second term in the product is:

2n[j=1n(θ1ΔxjΔyj)]xˉ.\frac{2}{n} \left[\sum_{j=1}^n ({\theta_*}_1 \Delta x_j - \Delta y_j) \right] \bar{x}.

This term is zero since we chose θ\theta_* so that the bracketed sum equals zero. To check our work, note that:

j=1n(θ1ΔxjΔyj)=θ1j=1nΔxjj=1nΔyj=θ1×00=0.\sum_{j=1}^n ({\theta_*}_1 \Delta x_j - \Delta y_j) = {\theta_*}_1 \sum_{j=1}^n \Delta x_j - \sum_{j=1}^n \Delta y_j ={\theta_*}_1 \times 0 - 0 = 0.

Both sums return zero since the variables Δx\Delta x and Δy\Delta y are centered.

So, the second entry in the gradient is:

2nj=1n(θ1ΔxjΔyj)×Δxj.\frac{2}{n} \sum_{j=1}^n ({\theta_*}_1 \Delta x_j - \Delta y_j) \times \Delta x_j.

Setting this entry to zero requires:

θ11nj=1nΔxj21nj=1nΔyjΔxj=0{\theta_*}_1 \frac{1}{n} \sum_{j=1}^n \Delta x_j^2 - \frac{1}{n} \sum_{j=1}^n \Delta y_j \Delta x_j = 0

or:

θ1=1nj=1nΔyjΔxj1nj=1nΔxj2.{\theta_*}_1 = \frac{\frac{1}{n} \sum_{j=1}^n \Delta y_j \Delta x_j}{\frac{1}{n} \sum_{j=1}^n \Delta x_j^2}.

We’ll interpret this result in Section 11 in terms of the variance in the sampled xx coordinates, variance in the sampled yy coordinates, and correlation between the sampled xx and yy coordinates. Anytime you’ve generated a best fit line in a past class, these are the equations your computer used to find the best fit intercept and slope.

The same essential logic applies for models that are linear in θ\theta, no matter how they depend on xx. For instance, if we’d used the quadratic model:

y^(x;θ)=θ0+θ2x2\hat{y}(x;\theta) = \theta_0 + \theta_2 x^2

we would have arrived at essentially the same conclusion, just substituting x2x^2 for xx as our independent variable. You’ll practice computing the gradient of different loss functions and for different regression models on your homework.