Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

11.2 Linear Regression

Another way to summarize the relationship between two variables is by fitting one variable to a function of the other. For example, we might look for the best fit line relating YY to XX. Then, the slope and intercept of the best fit will offer a summary of the relationship between XX and YY. Note: we are only summarizing the relationship with the fit line, not trying to use the fit to estimate some underlying relationship.

This makes sense when the two variables satisfy a roughly linear relationship. Here’s the example we introduced in Section 10.2 again.

from utils_cond_exp import show_conditional_expectation

show_conditional_expectation()

In this case yˉ(x)=E[YX=x]\bar{y}(x) = \mathbb{E}[Y|X = x] is a linear function of xx, so it is very natural to summarize the relationship between YY and XX with a line.

Finding a best fit line is an example of linear regression. We first considered linear regression in Section 9.3. We’ll recall the setting and result here, before reinterpretting the solution in terms of covariance and correlation.

Here’s an example linear regression problem using a least squares loss:

  1. You are provided with a list of data points {[xj,yj]}j=1n\{[x_j,y_j]\}_{j=1}^n relating an independent variable xx to a dependent variable yy. These points form a scatter cloud in the x,yx, y plane.

  2. You suggest a linear function that relates xx and yy, y^(x;m,b)=mx+b\hat{y}(x;m,b) = m x + b where mm and bb are free parameters.

  3. You aim to find the best fit function among all y^(x;m,b)\hat{y}(x;m,b) by minimizing some loss function that measures the discrepancy between your proposed model and the observed data:

    m,b=argminm,b{L(y^(;m,b),{[xj,yj]}j=1n)}m*, b_* = \text{argmin}_{m,b}\{\mathcal{L}(\hat{y}(\cdot;m,b),\{[x_j,y_j]\}_{j=1}^n)\}

    It is common practice to minimize the least squares loss:

    L(y^,{[xj,yj]}j=1n)=MSE(y^,{[xj,yj]}j=1n)=1nj=1n(y^(xj)yj)2\mathcal{L}(\hat{y}, \{[x_j,y_j]\}_{j=1}^n) = \text{MSE}(\hat{y},\{[x_j,y_j]\}_{j=1}^n) = \frac{1}{n} \sum_{j=1}^n \left(\hat{y}(x_j) - y_j \right)^2

    where MSE\text{MSE} stands for mean square error.

The figure below shows an example with 1000 sampled [X,Y][X,Y] pairs (shown as blue scatter points). The lines represent 100 possible best fit lines (produced by fitting to 100 different bootstrap samples of 10 data points). They are colored according to the quality of their fit, as measured using mean square error (the least squares loss). The lines colored red are poor fits, and produce a large loss. The lines in green are good fits, and produce a small loss. Our goal is to find a formula for the intercept and slope that minimize the loss, so provide the best summary of the relationship between XX and YY.

Possible best fit lines.

Here’s the resulting fit line (shown in red). The dashed black lines mark the empirical averages xˉ\bar{x} and yˉ\bar{y}.

Best fit line.

Let’s try to make some more sense of this solution. First, since xˉ\bar{x} and yˉ\bar{y} are the empirical averages of the sampled xx’s and yy’s, if we rewtite the best fit line:

y^(x)yˉ=m(xx)\hat{y}_*(x) - \bar{y} = m_* (x - x_*)

then we can express the best fit line in the centered coordinates:

y0=yyˉ,x0=xxˉ,y^0(x)=mx0.y_0 = y - \bar{y}, \quad x_0 = x - \bar{x}, \quad {\hat{y}_0}_*(x) = m_* x_0.

This explains the best fit choice of intercept, bb_*. The best fit intercept is equivalent to centering the scatter cloud.

What about the best fit slope, mm_*?

Consider its numerator and denominator separately:

1nj=1n(xjxˉ)(yjyˉ),1nj=1n(xjxˉ)2.\frac{1}{n} \sum_{j=1}^n (x_j - \bar{x}) (y_j - \bar{y}), \quad \frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})^2.

These are an average product of the centered variables, and an average of the centered variables squared. In other words, the first has the form of a covariance (see Section 11.1) and the second has the form of a variance (see Section 4.3). These are the sample covariance and sample variance in the collection {[xj,yj]}j=1n\{[x_j,y_j]\}_{j=1}^n. They are equivalent to the covariance and variance if we choose XX and YY by picking an index, JJ, uniformly at random from 1 to nn, then setting X=xJX = x_J and Y=yJY = y_J. In other words:

If we use the empirical distribution, [X,Y]=[xJ,YJ][X,Y] = [x_J,Y_J] when JUniform({1,2,3,...,n})J \sim \text{Uniform}(\{1,2,3,...,n\}), then we can write:

xˉ=E[X],yˉ=E[Y]Var[X]=1nj=1n(xjxˉ)2,Var[Y]=1nj=1n(yjyˉ)2Cov[X,Y]=1nj=1n(xjxˉ)2.\begin{aligned} & \bar{x} = \mathbb{E}[X], \quad \bar{y} = \mathbb{E}[Y] \\ & \text{Var}[X] = \frac{1}{n}\sum_{j=1}^n (x_j - \bar{x})^2, \quad \text{Var}[Y]= \frac{1}{n}\sum_{j=1}^n (y_j - \bar{y})^2 \\ & \text{Cov}[X,Y] = \frac{1}{n}\sum_{j=1}^n (x_j - \bar{x})^2. \end{aligned}

These equalities are exact if we use the empirical distribution. If, instead, [X,Y][X,Y] are drawn from the background distribution, then these equations are approximates based on the observed samples. Regardless, if nn is very large, then the averages associated with the background distribution, and their empirical approximations, will be very similar. Taking nn to infinity will drive any approximation error to zero. So, from here on out we’ll think about this result as if XX and YY were sampled from the distribution that produced our data. We’ll study the sense in which empirical distributions and averages approximate the background distribution and averages against the background in Section 13.

We can now rephrase the best fit slope:

m=Cov[X,Y]Var[X]m_* = \frac{\text{Cov}[X,Y]}{\text{Var}[X]}

where the covariance and variance are their empirical approximations (finite nn), or the true covariance and variance in XX and YY (infinite nn).

Let’s rewrite this result in terms of the correlation. Recall that:

Corr[X,Y]=Cov[X,Y]SD[X]SD[Y].\text{Corr}[X,Y] = \frac{\text{Cov}[X,Y]}{\text{SD}[X] \text{SD}[Y]}.

Therefore, the best fit slope is:

m=Corr[X,Y]SD[X]SD[Y]Var[X]=Corr[X,Y]SD[Y]SD[X].m_* = \frac{\text{Corr}[X,Y] \text{SD}[X] \text{SD}[Y]}{\text{Var}[X]} = \text{Corr}[X,Y] \frac{\text{SD}[Y]}{\text{SD}[X]}.

Let’s try moving all the terms involving yy to the left hand side. Then, the equation for the best fit line reads:

y^(x)E[Y]SD[Y]=Corr[X,Y]×xE[X]SD[X]\frac{\hat{y}_*(x) - \mathbb{E}[Y]}{\text{SD}[Y]} = \text{Corr}[X,Y] \times \frac{x - \mathbb{E}[X]}{\text{SD}[X]}

The fraction on the left is the standardization of yy. The fraction on the right is the standardization of xx. So, the best fit line relating two random variables is, after standardizing, the line, passing through the origin, with slope equal to the correlation between the variables. In other words:

So, we can view the slope of a best fit line, in standard units, and correlation, as identical objects. This is useful in two ways. First, it provides an easy procedure for remembering the best fit line formula. Don’t try to memorize the messy form we found originally. Instead, do the following:

  1. Convert to standard variables.

  2. Set the slope of the best fit line equal to the correlation in the original variables.

  3. Convert back as needed.

Second, it provides a clearer interpretation of the correlation and its relation to linear associations. The correlation is a slope. It is a slope in standard units. This explains both why the correlation is not a good measure for nonlinear relations, and why the correlation is a natural measure of association for variables that share a rough linear relationship. It redefines the correlation, not as an abstract measure of association, but instead as the answer to a concrete question: “what is the slope of the best fit line to this dataset/distribution?

So, to find a correlation graphically, translate your axes to center your data. Rescale the axes so that the data cloud has the same standard deviation in each dimension. Then, estimate the slope of the best fit line. The steeper the slope, the stronger the correlation.

This interpretation also recasts the role of a best fit line. The best fit line, to a collection of data points, or a joint distribution, is, like the distribution summaries introduced in Section 4, is just another way of summarizing a distribution. It’s intercept is fixed by, and represents, the empirical averages of the data. It’s slope, after standardizing, reflects the correlation in the sampled data points!