Unconstrained Optimization - Data 89 Course Notes

Suppose that $f(x)$ is a scalar-valued function of $d$ inputs, $x = [x_1,x_2,...,x_d]$ . For example, $f$ could be a joint density function for a random vector with $d$ entries.

How can we find inputs $x$ that maximize (or minimize) $f(x)$ ? Equivalently, given a surface, how can we identify its highest and lowest points?

Unconstrained Optimization in

d

Input Variables

Suppose that $f$ is a smooth, scalar-valued function $d$ inputs, $x = [x_1,x_2,...,x_d]$ . An optimization problem asks for the input vector $x_*$ that either maximizes (or minimizes) $f$ :

\textbf{Find: } x_* = \argmax_x\{f(x)\}.

(1)

Here the term $\text{argmax}$ stands for argument maximizer. It is the argument which maximizes the function in the brackets, over the input specified in the subscript. The problem is unconstrained since we have placed no restrictions on the range of inputs $x$ .

To minimize, either replace the argument maximizer with an argument minimizer, or maximize $-f$ . Since minimization is the same as maximizing $-f$ , we’ll focus, without loss of generality, on maximization problems.

In either case we call the function we are maximizing an objective function.

Optimizing functions of multiple variables is a core task in probability, statistics, computer science, and machine learning. Optimization is also central in a wide variety of design problems spanning business, operations management, engineering, and economics. In this chapter we will focus on examples where our aim is to maximize a density function (find a mode), or minimize a loss function to find models that best fit observed data.

The basic technique we’ll highlight is the first and simplest optimization procedure. It uses gradients to walk steadily uphill (or downhill) on a surface. We will use it to show that, as for functions of one dimension, if the inputs are unconstrained, and $f$ is smooth, then any extrema (maxima or minima) of $f$ occur where its gradient (think, slope) is zero.

Gradients and Optimization¶

Gradient Ascent¶

To maximize a smooth function $f$ , we’ll adopt a simple iterative strategy.

Start with some initial guess at the optimizer, $x(0)$ . To improve our guess, we’ll try to adjust it slightly. Since we are trying to maximize the function, we should adjust it in a direction which increases the function value.

As an analogy, imagine that you are hiking in the mountains and want to get to a peak. If you can’t see the location of the peak (for instance, you are in the woods), then the best strategy is to simply step in the uphill direction.

To find the uphill direction, use the gradient, $\nabla f(x(0))$ . Since the gradient vector always points in the direction of steepest ascent, moving our guess in the direction of the gradient should increase the function value. This idea defines a gradient ascent algorithm:

Gradient Ascent

To optimize a smooth surface $f$ , propose an initial guess $x(0)$ then iterate:

x(t+1) = x(t) + \gamma(t) \nabla f(x(t))

(2)

where $\gamma(t) > 0$ is the learning rate. To minimize, step in the direction of steepest descent, $x(t+1) = x(t) - \gamma(t) \nabla f(x(t)).$

We call each guess at an optimizer, $x(t)$ , an iterate. The sequence of iterates, $\{x(t)\}_{t = 0}^{...}$ will climb uphill.

Caution

There is no gaurantee that gradient ascent will climb to the global maximizer of a function, only that it will produce a sequence of iterates that continuously climb the surface. In many, but not all, settings sequences of iterates that climb the surface will converge, in the limit of infinite time, to local maximizers of the surface. Convergence for gradient descent is generally linear, meaning that the discrepancy between an iterate and a true local maximizer at time $t$ is proportional to the discrepancy at time $t - 1$ . As a consequence, the discrepancies between the iterate and the true optimizer typically decay exponentially in time, and converge to a decaying geometric sequence.

In principle, iterative procedures require infinitely many steps to recover the exact optimizer. In practice, they produce a series of improving approximations. We stop once the approximations are close enough. The rule used to stop the process is called the stopping condition.

Gradient ascent (or descent) is a first-order optimization algorithm since it is based on iteratively collecting local information about the surface, out to first derivatives. These first derivatives are used to form a gradient vector, then the gradient vector points the way up to higher surface values.

We won’t get into the real details of choosing a learning rate, stopping condition, or the convergence behavior of this algorithm, but will instead use it as a conceptual tool. To find a maximizer, just pick an initial input, then slowly walk uphill, where, at each step, we choose to step in the direction of steepest ascent. You can imagine the strategy as a sequence of refinements, where at each stage, we use a gradient to refine our guess at an optimizer. Further implementation details, or theory, are really subjects for an optimization class.

Run the code cell below to watch gradient ascent in action. You’ll see that the sequence of iterates form a path that climb the surface, following the gradient vector field, moving perpendicularly to the level sets of the surface.

from utils_ga import show_gradient_ascent

show_gradient_ascent()

Machine Learning and Gradient Ascent

In a prototypical machine learning problem we collect some sample data, pose a model for the data that depends on a long list of parameters (typically, a neural network), then try to find a collection of values for those parameters that minimize some measure of misfit between the observed data and the model conditional on the parameters. The term learning in machine learning usually really means optimization. In practice, the learning process is usually a variant of gradient descent.

We can use the basic idea of gradient ascent to motivate a condition that all maxima and minima of a smooth surface must satisfy. We’ll motivate the condition in two steps:

If $x_*$ is a maxima (or minima), then any iterative optimizer should stop moving if $x(t) = x_*$ at some time $t$ . In other words, if we are walking uphill, and reach a summit, then we should stop walking.
If $\nabla f(x_*) \neq 0$ , then an iterative optimizer could keep walking, since there is an uphill direction leaving $x_*$ .

It follows that, if $x_*$ is a local maximizer (or minimizer) of a smooth function $f$ , then $\nabla f(x_*) = 0$ . If not, then we could find a direction to move away from $x_*$ that would increase $f$ .

Finding Maxima by Setting Gradients to Zero¶

Proof

Suppose not. Then, $\partial_v f(x_*)$ along $v = \nabla f(x_*)$ equals $\|\nabla f(x_*)\|$ which is great than 0, so there is a direction leaving $x_*$ along which the function is increasing.

If we set $v = -\nabla f(x_*)$ instead, then $\partial_v f(x_*) = - \|\nabla f(x_*)\| < 0$ , so there is a direction leaving $x_*$ along which the function is decreasing.

So, if $\nabla f(x_*)$ is nonzero, then we can find directions leaving $x_*$ along which the function increases and along which it decreases. It follows that $x_*$ cannot be a local maximizer or minimizer of $f$ .

We can use the condition, $\nabla f(x_*) = 0$ , like we did $\frac{d}{dx} f(x_*) = 0$ to narrow down our search for extrema.

Here are three examples:

$f(x,y) = 8 x^2 - 12 x y + 6 y^2 + 4 x - 24 y + 4$ .
Example 1
To search for extrema, start by computing the gradient:
$\nabla f(x,y) = \left[\begin{array}{c} 16 x - y + 4 \\ -12 x + 12 y -1 \end{array} \right].$
(4)
Then, set both terms in the gradient to zero:
$16 x_* - 12y_* + 4 = 0, \quad -12 x_* + 12 y_* - 24 = 0.$
(5)
The second equation requires:
$y_* = \frac{1}{12}(12 x_* + 24) = x_* + 2.$
(6)
So, plugging into the first equation:
$16 x_* - 12 (x_* + 2) + 4 = 0$
(7)
Simplifing:
$4 x_* - 24 + 4 = 0$
(8)
So $4 x_* = 20$ or $x_* = 20/4 = 5$ . Then $y_* = 7$ .
So, the only possible extrema for this function is at $[x_*,y_*] = [5,7]$ . In this case the extrema is a global minimum, since the function is convex (note that the coefficients before $x^2$ and $y^2$ are both positive).
$g(x,y) = 1 - x^2 + e^{-\frac{1}{2}y^2}.$
Example 2
Notice that, in this case, we can break $g$ into the sum of two functions, one for $x$ , and one for $y$ :
$g(x,y) = g_x(x) + g_y(y)$
(9)
where $g_x(x) = 1 - x^2$ and $g_y(y) = e^{-\frac{1}{2}y^2}.$ Since the value of $g_x(x)$ does not depend on $y$ , and the value of $g_y(y)$ does not depend on $x$ , the maximizer of $g(x,y)$ is the pair $[x_*,y_*]$ where $x_*$ maximizes $g_x(x)$ and $y_*$ maximizes $g_y(y)$ . We can see this by applying the gradient condition:
$\nabla f(x,y) = \left[\begin{array}{c} \partial_{x} (g_x(x) + g_y(y)) \\ \partial_{y} (g_x(x) + g_y(y)) \end{array} \right] = \left[\begin{array}{c} \partial_{x} g_x(x) \\ \partial_{y} g_y(y) \end{array} \right].$
(10)
Then, since $g_x(x)$ and $g_y(y)$ are both functions of a single variable we can write the gradient:
$\nabla f(x,y) = \left[\begin{array}{c} \frac{d}{dx} g_x(x) \\ \frac{d}{dx} g_y(y) \end{array} \right].$
(11)
Then, setting the gradient to zero sets:
$\frac{d}{dx} g_x(x_*) = 0, \quad \frac{d}{dy} g_y(y_*) = 0$
(12)
which are the slope zero conditions for maximizing $g_x$ and $g_y$ separately.
$h(x,y) = \sin( \pi x) \sin(\pi y)$ .
Example 3
In this case $h(x,y)$ factors as a product of a function of $x$ and a function of $y$ . It’s gradient is:
$\nabla h(x,y) = \left[\begin{array}{c} \pi \cos(\pi x) \sin(\pi y) \\ \pi \sin(\pi x) \cos(\pi y) \end{array} \right]$
(13)
The functions $\cos(\pi x)$ and $\sin(\pi x)$ are never simultaneously zero. So, if $\cos(\pi x) = 0$ , then $\sin(\pi x) \neq 0$ .
Therefore, to set both entries of the gradient to zero we either need $\sin(\pi x) = 0$ and $\sin(\pi y) = 0$ , or $\cos(\pi x) = 0$ and $\cos(\pi y) = 0$ . Thus, the gradient is zero at $x_* = 2 k , y_* = 2 k'$ , and $x_* = 2 k + 1, y_* = 2 k' + 1$ . These points form a lattice of all even integer pairs $[x,y]$ and all odd integer pairs $[x,y]$ .
Pairs of even integers set $h(x,y) = 0$ since $h(x,y) = 0$ if $\sin(\pi x) = 0$ or $\sin(\pi y) = 0$ . These cannot be maxima or minima since the function is positive for some $[x,y]$ and negative for others.
It follows that, the extrema of $h(x,y)$ must occur on a square lattice of odd integers in $x$ and $y$ .

Run the code cell below, and select “Sine product”, to visualize $h$ .

You should see that $h$ is a periodic checkerboard of peaks and valleys. If you rotate the surface to view it from above, you’ll see that the peaks and valleys all are centered on a lattice where $x$ and $y$ are half integers.

from utils_lsg import show_gradient_field

show_gradient_field()

Finding Modes of Joint Densities¶

Let’s use this rule to find the modes of some joint distributions.

Normal Distributions¶

First, suppose that $[x,y]$ are independent normal random variables with expectations $\mathbb{E}[X] = 3$ , $\mathbb{E}[Y] = 5$ , and standard deviations $\text{SD}[X] = 2$ and $\text{SD}[Y] = 4$ . Then, since the joint density of a pair of independent random variables equals the product of their marginal densities:

\begin{aligned} f_{X,Y}(x,y) & = \left( \frac{1}{\sqrt{2 \pi}} \frac{1}{2} e^{-\frac{1}{2} \left(\frac{x - 3}{2} \right)^2} \right) \times \left( \frac{1}{\sqrt{2 \pi}} \frac{1}{2} e^{-\frac{1}{2} \left(\frac{x - 3}{2} \right)^2} \right) \propto \exp\left(-\frac{1}{2}\left(\left(\frac{x - 3}{2} \right)^2 + \left(\frac{y - 5}{4} \right)^2 \right) \right)\end{aligned}

(14)

Where is this density maximized.

We can solve this problem in two ways. We’ll solve it by reasoning directly on density functions, then will confirm our answer by setting the gradient to zero.

By probability logic:
First, notice that the density is a product of two nonnegative functions. Each function is only a function of a single variable. So, the joint density is maximized by maximizing each of the marginal densities separately.
Here’s the matching probability argument: since $X$ and $Y$ are independent the most likely pair $[x_*,y_*]$ consists of the most likely value of $X$ and the most likely value of $Y$ . So, to maximize the joint density, we should separately maximize each of the marginal densities.
Each of the marginal densities is a translation and a dilation of the standard normal density. The standard normal density is bell shaped and even about 0, so is maximized at 0. Therefore, the marginal densities are separately maximized at $x_* = 3$ and $y_* = 5$ .
$[x_*,y_*] = [3,5] = [\mathbb{E}[X], \mathbb{E}[Y]].$
(15)
Therefore, the mode of the joint density is equal to the expectation of the vector $[X,Y]$ . This is a special property of normal distributions. They are maximized at their expected value.
Via gradients:
As always, if the location of a maximizer (or minimizer) is unchanged if we pass our function through a monotonic transformation (see Section 3.3). Therefore, the joint density is maximized where $\exp\left(-\frac{1}{2}\left(\left(\frac{x - 3}{2} \right)^2 + \left(\frac{y - 5}{4} \right)^2 \right) \right)$ is maximized. Logs are monotonic, so we can replace our objective function with its logarithm without moving the maximizer. The log is the inverse of the exponential. Therefore, the maximizer of the joint density is the maximizer of the quadratic function:
$- \frac{1}{2} \left(\left(\frac{x - 3}{2} \right)^2 + \left(\frac{y - 5}{4} \right)^2 \right)$
(16)
This quadratic function is a sum of a function of $x$ and a function of $y$ , so is maximized by separately maximizing each term.
Applying the gradient:
$\nabla \log(f_{X,Y}(x,y)) = - \nabla \frac{1}{2} \left(\left(\frac{x - 3}{2} \right)^2 + \left(\frac{y - 5}{4} \right)^2 \right) = \left[ \begin{array}{c} - \frac{1}{2}\frac{x - 3}{2} \\ - \frac{1}{4} \frac{y - 5}{4} \end{array} \right].$
(17)
Setting both terms to zero gives $x_* = 3$ and $y_* = 5$ as expected.

Dirichlet Distributions¶

Let’s try an example where we don’t know the answer from the start.

Dirichlet distributions are popular models for user preferences among a set of $d+1$ items, for unknown chances on $d+1$ outcomes, or for the gaps between sorted random samples. They naturally generalize the familiar family of densities where $X \in [0,1]$ and $\text{PDF}(x) = f_X(x) \propto x^{\alpha - 1} (1 - x)^{\beta - 1}$ to higher dimensions.

Here’s a two dimensional example: $X \geq 0$ , $Y \geq 0$ and $X + Y \leq 1$ , with:

f_{X,Y}(x,y) \propto g(x,y) = x^5 y^3 (1 - (x + y))^8.

(19)

Let’s find it’s mode.

The density is maximized where $g$ is maximized, so, take the gradient of $g$ :

\nabla g(x,y) = \left[ \begin{array}{c} 5 x^4 y^3 (1 - (x + y))^8 - 8 x^5 (1 - (x + y))^7 \\ 3 x^5 y^2 (1 - (x + y))^8 - 8 x^5 y^3 (1 - (x + y))^7 \end{array} \right].

(20)

We can write the gradient more conveniently:

\nabla g(x,y) = \left[ \begin{array}{c} \frac{5}{x} g(x,y)- \frac{8}{1 - (x + y)} g(x,y) \\ \frac{3}{y} g(x,y)- \frac{8}{1 - (x + y)} g(x,y) \end{array} \right] = \left[ \begin{array}{c} \frac{5}{x} - \frac{8}{1 - (x + y)} \\ \frac{3}{y} - \frac{8}{1 - (x + y)} \end{array} \right] g(x,y).

(21)

The function $g(x,y) \neq 0$ for all $x,y$ such that $x > 0$ , $y > 0$ and $x + y < 1$ , so is nonzero on the interior of the support. On the boundary of the support the function $g(x,y) = 0$ . So, we can restrict our attention to the interior of the support. Inside the triangle where $x > 0$ , $y > 0$ and $x + y < 1$ the gradient is only zero if:

\left[ \begin{array}{c} \frac{5}{x_*} - \frac{8}{1 - (x_* + y_*)} \\ \frac{3}{y_*} - \frac{8}{1 - (x_* + y_*)} \end{array} \right] = 0

(22)

Rearranging gives the equations:

\frac{5}{x_*} = \frac{8}{1 - (x_* + y_*)} = \frac{3}{y_*}

(23)

or:

\frac{x_*}{5} = \frac{1 - (x_* + y_*)}{8} = \frac{y_*}{3}

(24)

It follows that, $y_* = \frac{3}{5} x_*$ . Then:

\frac{1 - (x_* + y_*)}{8} = \frac{1 - x_* - \frac{3}{5} x_*}{8} = \frac{1 - \frac{8}{5} x_*}{8} = \frac{1}{8} - \frac{1}{5} x_*.

(25)

Then:

\frac{1}{8} - \frac{1}{5} x_* = \frac{1}{5} x_*

(26)

So:

\frac{2}{5} x_* = \frac{1}{8}

(27)

Or $x_* = \frac{5}{16}$ . It follows that $y_* = \frac{3}{16}$ so:

[x_*,y_*] = \frac{1}{16}\left[5, 3 \right].

(28)

These numbers are related simply to the parameters of the original distribution. In general, if $X$ is drawn from a Dirichlet distribution with parameters $\alpha = [\alpha_1,\alpha_2,....,\alpha_{d+1}]$ then its joint density has mode at ${x_*}_j = \frac{\alpha_j - 1}{\sum_{i=1}^{d+1} (\alpha_i - 1)}$ . In our case, $\alpha_1 = 6$ , $\alpha_2 = 4$ and $\alpha_3 = 9$ so the mode is at $[(6 - 1)/(19 - 3), (4 - 1)/(19 - 3)] = [5/16, 3/16]$ .

We could have just as well found the mode by maximizing the logarithm of $g$ .

Regression¶

Here’s an example regression problem:

You are provided with a list of data points $\{[x_j,y_j]\}_{j=1}^n$ relating an independent variable $x$ to a dependent variable $y$ . These points form a scatter cloud in the $x, y$ plane.
You suggest a function that relates $x$ and $y$ , $\hat{y}(x;\theta)$ where $\theta$ is a vector of free parameters. For instance, we could consider a linear model:
$\hat{y}(x;\theta) = \theta_1 + \theta_2 x$
(29)
or a quadratic model:
$\hat{y}(x;\theta) = \theta_1 + \theta_2 x + \theta_3 x^2$
(30)
or even an exponential model:
$\hat{y}(x;\theta) = \theta_1 e^{\theta_2 x}.$
(31)
In machine learning we usually use a neural network for $\hat{y}$ .
You aim to find the best fit function among all $\hat{y}(x;\theta)$ by minimizing some loss function that measures the discrepancy between your proposed model and the observed data:

\theta_* = \text{argmin}_{\theta}\{\mathcal{L}(\hat{y}(\cdot;\theta),\{[x_j,y_j]\}_{j=1}^n)\}

(32)

Least Squares Regression¶

In least squares regression we set the loss function to the average square error in the model over the data:

\mathcal{L}(\hat{y}, \{[x_j,y_j]\}_{j=1}^n) = \text{MSE}(\hat{y},\{[x_j,y_j]\}_{j=1}^n) = \frac{1}{n} \sum_{j=1}^n \left(\hat{y}(x_j) - y_j \right)^2

(33)

where $\text{MSE}$ stands for mean square error. Then, our problem is to find the parameters $\theta$ that minimize the mean square error:

\theta_* = \text{argmin}_{\theta}\{\frac{1}{n} \sum_{j=1}^n \left(\hat{y}(x_j;\theta) - y_j \right)^2 \}

(34)

The most popular example is linear least squares regression. In linear least squares regression the model $\hat{y}(x_j;\theta)$ is a linear function of the parameters $\theta$ . Often it is also a linear function of the independent variable $x$ , though it need not be. Examples include all polynomial models:

\hat{y}(x;\theta) = \theta_0 + \theta_1 x + \theta_2 x^2 + ... \theta_n x^n.

(35)

Let’s start with the simple case where $\hat{y}$ is linear in both the parameters and $x$ . Then:

\hat{y}(x;\theta) = \theta_0 + \theta_1 x

(36)

is a line with intercept $\theta_0$ and slope $\theta_1$ . Now, our regression problem is to find the line that best fits the data by minimizing the mean square error between the line and the data points:

\theta_* = \text{argmin}_{\theta_0,\theta_1}\left\{\frac{1}{n} \sum_{j=1}^n \left((\theta_0 + \theta_1 x_j) - y_j \right)^2\right\}.

(37)

This is an unconstrained minimization problem in the two free parameters, $\theta_0$ and $\theta_1$ with objective function equal to the mean squared error in the linear model.

To help visualize this problem, run the code cell below.

The panel on the left will show a series of $x_j,y_j$ pairs as a scatter plot. You have the freedom to choose a slope and an intercept. For each slope and intercept you pick, you can compute the associated mean square error. Search for a slope and intercept that make the mean square error small. Then click “Reveal RMSE” to show the mean square error as a function of the free parameters. This will reveal a surface in a righthand panel. This surface is a function of the parameters. It depends on the data. Solving the regression problem amounts to minimizing this surface.

from utils_ls import show_least_squares

show_least_squares()

To find the best fit parameters, compute the gradient of the objective, and set it equal to zero. This suffices since the objective is a convex, quadratic function of the parameters, and we’ve placed no constraints on the parameters.

\nabla_{\theta} \text{MSE}(\hat{y}(\cdot;\theta), \{[x_j,y_j]\}_{j=1}^n) = \nabla_{\theta} \frac{1}{n} \sum_{j=1}^n \left((\theta_0 + \theta_1 x_j) - y_j \right)^2

(38)

Here we’ve added the subscript $\theta$ to the gradient symbol to remind us that we are taking the gradient with respect to the parameters. Always pay careful attention to which variables you are optimizing over, and which are held fixed. In this problem the data is fixed, and we are optimizing with respect to the parameters of the model.

To compute the partial with respect to $\theta_0$ and $\theta_1$ we’ll apply the chain rule:

\begin{aligned} \partial_{\theta_i} \frac{1}{n} \sum_{j=1}^n \left((\theta_0 + \theta_1 x_j) - y_j \right)^2 & = \partial_{\theta_i} \frac{1}{n} \sum_{j=1}^n \left(\hat{y}(x_j;\theta) - y_j \right)^2 \\ & = \frac{1}{n} \sum_{j=1}^n \partial_{\theta_i} \left(\hat{y}(x_j;\theta) - y_j \right)^2 \\ & = \frac{2}{n} \sum_{j=1}^n (\hat{y}(x_j;\theta) - y_j) \times \partial_{\theta_i} \hat{y}(x_j;\theta). \end{aligned}

(39)

Then, since $\hat{y}(x;\theta) = \theta_0 + \theta_1 x$ , $\partial_{\theta_0} \hat{y}(x;\theta) = 1$ and $\partial_{\theta_1} \hat{y}(x;\theta) = x$ . So:

\nabla_{\theta} \text{MSE}(\hat{y}(\cdot;\theta), \{[x_j,y_j]\}_{j=1}^n) = \frac{2}{n} \left[ \begin{array}{c} \sum_{j=1}^n (\hat{y}(x_j;\theta) - y_j) \\ \sum_{j=1}^n (\hat{y}(x_j;\theta) - y_j) \times x_j \end{array} \right]

(40)

Setting the first entry to zero requires:

\frac{1}{n}\sum_{j=1}^n (\hat{y}(x_j;\theta_*) - y_j) = 0.

(41)

Rearranging, we need:

\frac{1}{n} \sum_{j=1}^n \hat{y}(x_j;\theta_*) = \frac{1}{n} \sum_{j=1}^n y_j = \bar{y}.

(42)

That is, the average value of the model must equal the average value of the data, $\bar{y}$ . Substituting in for $\hat{y}$ gives:

\begin{aligned} \frac{1}{n} \sum_{j=1}^n {\theta_*}_0 + {\theta_*}_1 x_j & = {\theta_*}_0 + {\theta_*}_1 \frac{1}{n} \sum_{j=1}^n x_j \\ & = {\theta_*}_0 + {\theta_*}_1 \bar{x} = \bar{y} \end{aligned}

(43)

So:

{\theta_*}_0 = \bar{y} - {\theta_*}_1 \bar{x}.

(44)

Plugging back in, we find:

\hat{y}(x;\theta_*) = \bar{y} + {\theta_*}_1 (x - \bar{x}).

(45)

This is a nicer form. It ensures that the best fit line passes through the point $[\bar{x},\bar{y}]$ where $\bar{x}$ and $\bar{y}$ are the average $x$ and $y$ coordinates in the data-set.

Now, $\hat{y}(x_j;\theta_*) - y_j$ can also be expressed more cleanly:

\hat{y}(x_j;\theta_*) - y_j = {\theta_*}_1 (x_j - \bar{x}) - (y_j - \bar{y}).

(46)

Let $x - \bar{x} = \Delta x$ and $y - \bar{y} = \Delta y$ . Then $\Delta x$ and $\Delta y$ are centered variables that represent a horizontal and vertical distance away from the mean. They correspond to the values of $x$ and $y$ had we started by centering our data (subtracting off the mean $x$ and $y$ coordinate). Many data processing pipelines start off by centering the data. Here we see a good reason to center your data when finding a best fit line. The best fit line automatically picks an intercept that effectively centers the problem. In terms of the centered variables:

\hat{y}(x_j;\theta_*) - y_j = {\theta_*}_1 \Delta x_j - \Delta y_j.

(47)

So, the second entry of the gradient can be written:

\frac{2}{n} \sum_{j=1}^n ({\theta_*}_1 \Delta x_j - \Delta y_j) \times x_j = \frac{2}{n} \sum_{j=1}^n ({\theta_*}_1 \Delta x_j - \Delta y_j) \times (\Delta x_j + \bar{x}).

(48)

The second term in the product is:

\frac{2}{n} \left[\sum_{j=1}^n ({\theta_*}_1 \Delta x_j - \Delta y_j) \right] \bar{x}.

(49)

This term is zero since we chose $\theta_*$ so that the bracketed sum equals zero. To check our work, note that:

\sum_{j=1}^n ({\theta_*}_1 \Delta x_j - \Delta y_j) = {\theta_*}_1 \sum_{j=1}^n \Delta x_j - \sum_{j=1}^n \Delta y_j ={\theta_*}_1 \times 0 - 0 = 0.

(50)

Both sums return zero since the variables $\Delta x$ and $\Delta y$ are centered.

So, the second entry in the gradient is:

\frac{2}{n} \sum_{j=1}^n ({\theta_*}_1 \Delta x_j - \Delta y_j) \times \Delta x_j.

(51)

Setting this entry to zero requires:

{\theta_*}_1 \frac{1}{n} \sum_{j=1}^n \Delta x_j^2 - \frac{1}{n} \sum_{j=1}^n \Delta y_j \Delta x_j = 0

(52)

or:

{\theta_*}_1 = \frac{\frac{1}{n} \sum_{j=1}^n \Delta y_j \Delta x_j}{\frac{1}{n} \sum_{j=1}^n \Delta x_j^2}.

(53)

Best Fit Line

Given a dataset $\{[x_j,y_j]\}_{j=1}^n$ the best fit line to the data in the least squares sense (the line that minimizes the MSE) is:

\hat{y}(x;\theta_*) = \bar{y} + {\theta_*}_1 (x - \bar{x})

(54)

where:

\begin{aligned} & \bar{x} = \frac{1}{n} \sum_{j=1}^n x_j , \quad \bar{y} = \frac{1}{n} \sum_{j=1}^n y_j \\ & {\theta_*}_1 = \frac{\frac{1}{n} \sum_{j=1}^n (y_j - \bar{y}) (x_j - \bar{x})}{\frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})^2} \end{aligned}

(55)

We’ll interpret this result in Section 11 in terms of the variance in the sampled $x$ coordinates, variance in the sampled $y$ coordinates, and correlation between the sampled $x$ and $y$ coordinates. Anytime you’ve generated a best fit line in a past class, these are the equations your computer used to find the best fit intercept and slope.

The same essential logic applies for models that are linear in $\theta$ , no matter how they depend on $x$ . For instance, if we’d used the quadratic model:

\hat{y}(x;\theta) = \theta_0 + \theta_2 x^2

(56)

we would have arrived at essentially the same conclusion, just substituting $x^2$ for $x$ as our independent variable. You’ll practice computing the gradient of different loss functions and for different regression models on your homework.