Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

11.1 Covariance and Correlation

Objectives

Suppose that XX and YY are jointly distributed random variables. How can we summarize the relationship between XX and YY using their joint distribution? In particular, how can we summarize the degree of association between the variables?

Here’s an example. Run the code cell below. Suppose that XX and YY were drawn from the distribution displayed. Clearly XX and YY are related. The conditional distribution of YY given X=xX = x varies as xx varies. Increasing xx increases the chance that YY is large. As a result, the conditional expectation, yˉ(x)=E[YX=x]\bar{y}(x) = \mathbb{E}[Y|X = x] is an increasing function of xx. Moreover, the conditional distribution of YX=xY|X = x is considerably narrower, for every xx, than the marginal distribution of YY. That means that, if we observed XX, we could use our observation to make better informed guesses at YY than we could had we not observed XX.

from utils_cond_exp import show_conditional_expectation

show_conditional_expectation()

How could we summarize the strength of this relationship between XX and YY?

When defining a new mathematical object it can help to list out some desired characteristics or “desiderata”. Here are some natural desiderata for a measure of association:

It can also be helpful to identify desired “invariances”. These are ways of modifying a question that leave the answer unchanged. When measuring associations between XX and YY there are two natural invariances:

Covariance

Definition

Let’s try to design a measure that achieves all of our desiderata.

First, let’s work out how to ensure an invariance. We’ll start with invariance to translations.

To ensure that we treat XX and X+sX + s the same, for all ss, we could start by centering our random variables.

If we always start by centering our variable, then it doesn’t matter where it is centered to start. After centering it always has expectation equal to zero. So, centering will move all translations of the same distribution to the same distribution with mean zero. In particular:

(X+s)E[X+s]=X+s(E[X]+s)=XE[X]=X0(X + s) - \mathbb{E}[X + s] = X + s - (\mathbb{E}[X] + s) = X - \mathbb{E}[X] = X_0

for all possible ss.

So, to ensure invariance to translations, all we have to do is start by centering our variables. If we define our measure of association using centered variables, then it will be invariant to translations.

Since we will center every variable from here on out, it will be helpful to have shorthand notation for E[X]\mathbb{E}[X] and E[Y]\mathbb{E}[Y]. We’ll use xˉ=E[X]\bar{x} = \mathbb{E}[X] and yˉ=E[Y]\bar{y} = \mathbb{E}[Y].

Next, let’s work on the second desiderata. Draw an x,yx,y plane. Mark the four quadrants.

Then, imagine sampling the centered variables XcX_c and YcY_c. The sampled pair, [Xc,Yc][X_c,Y_c] could land in any of the four quadrants. If:

  1. X0>0X_0 > 0 and Y0>0Y_0 > 0 then both variables are larger than their expectation. This is evidence of a positive association.

  2. X0>0X_0 > 0 and Y0<0Y_0 < 0 then one variable is larger than its expectation and the other is smaller. This is evidence of a negative association.

  3. X0<0X_0 < 0 and Y0<0Y_0 < 0 then both variables are smaller than their expectation. This is evidence of a positive association.

  4. X0<0X_0 < 0 and Y0>0Y_0 > 0 then one variable is larger than its expectation and the other is smaller. This is evidence of a negative association.

So, imagine coloring your x,yx,y plane, using green for quadrants (I) and (III) and red for quadrants (II) and (IV). Then label the green quadrants (+) for positive association and (-) for negative association.

The quadrants, colored by matching sign.

Let’s look for a scalar valued function of xx and yy that is positive in quadrants (I) and (III), negative in quadrants (II) and (IV), and zero on the boundaries. A natural choice is the product function x×yx \times y. Notice that:

  1. If x>0x > 0 and y>0y > 0 then x×y>0x \times y > 0.

  2. If x>0x > 0 and y<0y < 0 then x×y<0x \times y < 0.

  3. If x<0x < 0 and y<0y < 0 then x×y>0x \times y > 0.

  4. If x<0x < 0 and y>0y > 0 then x×y<0x \times y < 0.

So, x×yx \times y is positive in quadrants (I) and (III), and negative in quadrants (II) and (IV). It is zero if either xx or yy are zero, so is zero along the coordinate axes dividing the quadrants. Here’s an overlaid heatmap + contour plot showing the function x×yx \times y.

The product function.

The function x×yx \times y has another useful property that aligns to our desiderata. It increases in magnitude as either input increases in magnitude. So, if xx is far from zero, or yy is far from zero, then x×y|x \times y| will be large. In contrast, if both are small, or one is much smaller than the other, then the product x×y|x \times y| will be near zero.

Since we centered our variables, the origin represents the center of mass of our distribution. So, the function Xc×Yc|X_c \times Y_c| will be large when both variables vary far from their expectations, and will be small if both stay close to their expectations.

So, the function: X0×Y0=(Xxˉ)×(Yyˉ)X_0 \times Y_0 = (X - \bar{x}) \times (Y - \bar{y}) is a natural measure for the degree to which a pair of sampled values jointly vary away from their expectations. In other words, it measures how much a single pair of samples “co”-vary. It is positive when the sampled vector suggests a positive association, negative when the sampled vector suggests a negative association, and is larger in magnitude the larger the variation.

Now, both samples are random, so their product, X0×Y0X_0 \times Y_0 is also a random quantity. The degree of association between two random variables should be a number determined by their joint distribution, not a number that changes depending on a particular sample. So, to average over the possible values of XcX_c and YcY_c, let’s use the expected value of the product as our measure of association:

EX,Y[Xc×Yc]=EX,Y[(Xxˉ)×(Yyˉ)].\mathbb{E}_{X,Y}[X_c \times Y_c] = \mathbb{E}_{X,Y}[(X - \bar{x}) \times (Y - \bar{y})].

The expected product of the centered variables is the covariance in XX and YY.

The covariance in XX and YY measures how much the two variables vary together, e.g. “co”-vary. It is, by design:

  1. Invariant to translations.

  2. Positive when the variables are positively associated and negative when they are negatively associated.

  3. Like variance, which is an expected square, the covariance, which is an expected product, is defined to give more weight to large joint variations than small variations.

Like expected values, or variances, covariances summarize an aspect of a distribution. In this case, the covariance summarizes how much the joint distribution associates positive deviations in one variable with positive deviations in the other.

Algebraic Properties of Covariance

Like any summary quantity defined as an expectation, the covariance has a variety of useful algebraic properties. Many of the algebraic properties of covariance are generalizations of the algebraic properties of variance. For example:

  1. If either XX or YY are constants, then the covariance in XX and YY is zero since:

    Cov[c,Y]=E[(cc)×(Yyˉ)]=E[0×Y0]=E[0]=0.\text{Cov}[c,Y] = \mathbb{E}[(c - c) \times (Y - \bar{y})] = \mathbb{E}[0 \times Y_0] = \mathbb{E}[0] = 0.

    This is sensible. If one of the variables is constant, then it doesn’t vary. If one variable doesn’t vary, then the other variable can’t “co”-vary with it.

  2. Covariance = expected product - product of expectations:

    Cov[X,Y]=E[(Xxˉ)(YYˉ)]=E[XYxˉYyˉX+xˉyˉ]=E[XY]xˉE[Y]yˉE[X]+xˉyˉ=E[XY]xˉyˉyˉxˉ+xˉyˉ=E[XY]xˉyˉ.\begin{aligned} \text{Cov}[X,Y] & = \mathbb{E}[(X - \bar{x})(Y - \bar{Y})] = \mathbb{E}[X Y - \bar{x} Y - \bar{y} X + \bar{x} \bar{y}] \\ & = \mathbb{E}[X Y] - \bar{x} \mathbb{E}[Y] - \bar{y} \mathbb{E}[X] + \bar{x} \bar{y} \\ & = \mathbb{E}[X Y] - \bar{x} \bar{y} - \bar{y} \bar{x} + \bar{x} \bar{y} \\ & = \mathbb{E}[X Y] - \bar{x} \bar{y}. \end{aligned}

    So:

    Cov[X,Y]=E[X×Y]E[X]×E[Y].\text{Cov}[X,Y] = \mathbb{E}[X \times Y] - \mathbb{E}[X] \times \mathbb{E}[Y].

    This generalizes the formula for variance as an expected square minus a squared expectation. The covariance is the expected product minus the product of the expectations.

The algebraic properties of covariance generalize the algebraic properties of variance since variances are covariances.

Cov[X,X]=E[(Xxˉ)×(Xxˉ)]=E[(Xxˉ)2]=Var[X].\text{Cov}[X,X] = \mathbb{E}[(X - \bar{x}) \times (X - \bar{x})] = \mathbb{E}[(X - \bar{x})^2] = \text{Var}[X].

So, the variance in a random variable is the covariance between the variable and itself.

Let’s check whether covariances are scale invariant:

Cov[aX,bY]=E[(aXE[aX])×(bYE[bY])=E[a(Xxˉ)×b(Yyˉ)]=E[aX0×bY0]=abE[X0×Y0]=abCov[X,Y].\begin{aligned} \text{Cov}[a X , b Y] & = \mathbb{E}[(a X - \mathbb{E}[a X]) \times (b Y - \mathbb{E}[b Y]) \\ & = \mathbb{E}[a (X - \bar{x}) \times b (Y - \bar{y})] = \mathbb{E}[a X_0 \times b Y_0] = a b \mathbb{E}[X_0 \times Y_0] = a b \text{Cov}[X,Y]. \end{aligned}

So, covariances are not scale invariant. Instead, Cov[aX,bY]=abCov[X,Y]\text{Cov}[a X , b Y] = ab \text{Cov}[X,Y]. For instance, Cov[2X,3Y]=6Cov[X,Y]\text{Cov}[2X, -3 Y] = -6 \text{Cov}[X,Y]. To find a scale invariant measure of association we will need to modify the covariance to eliminate its dependence on the units of XX and YY.

Correlation

Definition

What should we do to ensure scale invariance?

Think back to how we ensured translation invariance. We started by modifying our variable. Specifically, we proposed applied a transformation to standardize some aspect of the distribution. We translated the distribution to give it a standard expectation (expectation zero).

We can generalize this standardization procedure by translating, then scaling, our random variable. We will translate to give the random variable a standard expectation (give the distribution a standard center). Then, we will scale the random variable. Scaling a random variable dilates its distribution. We will scale so that the distribution has a standard width.

Recall the following definitions from Section 4.3:

The standardized variable ZZ is scale invariant since standardizing fixes the width of the distribution. To check, let’s standardize the variable aX0a X_0:

aX0SD[aX0]=aX0aSD[X0]=sign(a)Z\frac{a X_0}{\text{SD}[a X_0]} = \frac{a X_0}{|a| \text{SD}[X_0]} = \text{sign}(a) Z

where Z=X0/SD[X0]Z = X_0/\text{SD}[X_0]. So, as long as a>0a > 0, the scaled variable aX0a X_0 produces the same standardized variable as X0X_0.

So, to define a scale invariant measure of association, we will compute the covariance in the standardizations of XX and YY. The covariance in a standardized pair of variables is the correlation between the variables.

To compute the correlation in two random variables, standardize, then find the covariance:

Xs=X0/SD[X],Ys=Y0/SD[Y]X_s = X_0/\text{SD}[X], \quad Y_s = Y_0/\text{SD}[Y]

so:

Corr[X,Y]=Cov[Xs,Ys]=E[X0SD[X]×Y0SD[X]]=E[X0×Y0]SD[X]SD[Y]=Cov[X,Y]SD[X]SD[Y].\begin{aligned} \text{Corr}[X,Y] & = \text{Cov}[X_s,Y_s] = \mathbb{E}\left[ \frac{X_0}{\text{SD}[X]} \times \frac{Y_0}{\text{SD}[X]} \right] \\ & = \frac{\mathbb{E}[X_0 \times Y_0]}{\text{SD}[X] \text{SD}[Y]} = \frac{\text{Cov}[X,Y]}{\text{SD}[X] \text{SD}[Y]}. \end{aligned}

So, we can compute the correlation in two random variables from the covariance in the random variables and their standard deviations.

The correlation achieves all the desiderata and invariances described at the start of the chapter. Like the covariance it is translation invariant, its sign matches the sign of association, and is larger, in magnitude, the stronger the association between XX and YY. In addition, it is unitless, since standard variables are unitless. Therefore, it is scale invariant. It will not change if you change the units used to evaluate either or both random variables.

Interpretation as a Measure of Association

Because the correlation is unitless, and defined using standard variables, the value of the correlation can be interpreted on a standard scale.

The last point is important. It means that we can read correlation values and know immediately whether the variables are strongly or weakly associated. This is why we usually talk about the correlation between two variables, not the covariance, when we want to say they are strongly or weakly associated.

It is important that you develop an intuitive sense for large and small correlations. Navigate to this Correlation Guessing Game to practice guessing correlation values from collections of sampled [X,Y][X,Y] pairs.

Geometric Interpretation

To prove that correlations are bounded between -1 and 1, and only equal -1 or +1 in the case when XX and YY must fall on a line, we’ll adopt a geometric interpretation of the covariance and the correlation.

First, to simplify our approach, imagine independently sampling nn pairs {[Xj,Yj]}j=1n\{[X_j,Y_j]\}_{j=1}^n from the joint distribution for XX and YY. Then, let x=[x1,x2,...,xn]\vec{x} = [x_1,x_2,...,x_n] represent the vector of observed xx samples, and y=[y1,y2,...,yn]\vec{y} = [y_1,y_2,...,y_n] represent the vector of observed yy samples. Then, we could approximate the variances and covariances needed to compute a correlation with sample averages:

Var[X]1nj=1n(xjxˉ)2,Var[Y]1nj=1n(yjyˉ)2,Cov[X,Y]1nj=1n(xjxˉ)(yjyˉ)\text{Var}[X] \approx \frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})^2, \quad \text{Var}[Y] \approx \frac{1}{n} \sum_{j=1}^n (y_j - \bar{y})^2, \quad \text{Cov}[X,Y] \approx \frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})(y_j - \bar{y})

where:

xˉ=1nj=1nxj,yˉ=j=1nyj\bar{x} = \frac{1}{n} \sum_{j=1}^n x_j, \quad \bar{y} = \sum_{j=1}^n y_j

are the empirical averages of the sampled XX and YY values.

These approximations are not exact, but will converge in the limit of infinite nn. We’ll go into more detail on the convergence of sample averages to expectation in the last chapter of the book (see Section 13). For now, as we have in previous chapters, we’ll take the relation on faith.

Now, examine the empirical variances. Each is, up to scaling by 1/n1/n, a sum of squares of centered variables. Let x0=xxˉ\vec{x}_0 = \vec{x} - \bar{x} and y0=yyˉ\vec{y}_0 = \vec{y} - \bar{y} denote the centered vectors. Then, the empirical variance in the sampled XX values is the sum of all the entries of the centered vector, x0\vec{x}_0, squared, divided by nn. The sum of the squares of the entries of a vector is the square of its length (see Section 8.1). So:

Var[X]1nx02,Var[Y]1ny02.\text{Var}[X] \approx \frac{1}{n} \|\vec{x}_0\|^2, \quad \text{Var}[Y] \approx \frac{1}{n} \|\vec{y}_0\|^2.

Therefore, the empirical standard deviations are proportional to the lengths of each vector:

SD[X]1nx0,SD[Y]1ny0.\text{SD}[X] \approx \frac{1}{\sqrt{n}} \|\vec{x}_0\|, \quad \text{SD}[Y] \approx \frac{1}{\sqrt{n}} \|\vec{y}_0\|.

Next, consider the covariance. The empirical covariance is, up to scaling by 1/n1/n, a sum, over every entry of x0\vec{x}_0 and y0\vec{y}_0 of the entrywise products. In other words, the empirical covariance is the inner product between x0\vec{x}_0 and y0\vec{y}_0 all divided by nn.

Cov[X,Y]1nx0y0.\text{Cov}[X,Y] \approx \frac{1}{n} \vec{x}_0 \cdot \vec{y}_0.

So, we can estimate the correlation between XX and YY with:

Corr[X,Y]=Cov[X,Y]SD[X]SD[y]1nx0y01nx0y0.\text{Corr}[X,Y] = \frac{\text{Cov}[X,Y]}{\text{SD}[X] \text{SD}[y]} \approx \frac{\frac{1}{n} \vec{x}_0 \cdot \vec{y}_0}{\frac{1}{n} \|\vec{x}_0\| \|\vec{y}_0\|}.

Cancelling the shared factor of 1/n1/n:

Corr[X,Y]x0y0x0y0.\text{Corr}[X,Y] \approx \frac{ \vec{x}_0 \cdot \vec{y}_0}{\|\vec{x}_0\| \|\vec{y}_0\|}.

Recall that, the inner product between two vectors equals the products of their lengths, times the cosine of the angle between them. Therefore:

Corr[X,Y]cos(θ)\text{Corr}[X,Y] \approx \cos(\theta)

where θ\theta is the angle between the centered vectors of sampled XX and centered vector of sampled YY. This approximation becomes exact in the limit as nn goes to infinity, since, as nn goes to infinity, each sample average converges to an expectation against the background distribution.

We can use this interpretation to show that the correlation is bounded between -1 and 1, and only equals ±1\pm 1 if XX and YY satisfy a linear relationship.

First, assume nn very large, so that the approximation is an equality. Then, Corr[X,Y]=cos(θ)\text{Corr}[X,Y] = \text{cos}(\theta) for some angle θ\theta. The cosine of any angle is bound between -1 and 1, so Corr[X,Y][1,1]\text{Corr}[X,Y] \in [-1,1].

The cosine of θ\theta only equals ±1\pm 1 if θ=0\theta = 0^{\circ} or θ=180\theta = 180^{\circ}. In other words, the cosine of the angle between two vectors is only ±1\pm 1 if the vectors point in the same, or opposite directions. In either case, the vectors must be parrallel. So, Corr[X,Y]=±1\text{Corr}[X,Y] = \pm 1 if and only if we can guarantee that the centered vectors produced by jointly sampling XX and YY, then centering, are parallel.

This requires x0y0\vec{x_0} \propto \vec{y}_0 or y0=λx0\vec{y}_0 = \lambda \vec{x}_0 for some λ\lambda. Then, in terms of the original vectors of samples:

yyˉ=λ(xxˉ)y=λ(xxˉ)+yˉ.\vec{y} - \bar{y} = \lambda (\vec{x} - \bar{x}) \quad \Rightarrow \quad \vec{y} = \lambda(\vec{x} - \bar{x}) + \bar{y}.

So, expanding elementwise, we need:

yj=λ(xjxˉ)+yˉ.y_j = \lambda(x_j - \bar{x}) + \bar{y}.

In other words, we need all but a vanishing fraction of the sample pairs [xj,yj][x_j,y_j] to satisfy some linear relationship. We can only guarantee this if we can guarantee that Y=mX+bY = m X + b for some slope and intercept mm and bb when [X,Y][X,Y] are drawn jointly. In other words, the correlation in [X,Y][X,Y] can only equal ±1\pm 1 if YY is a linear function of XX.

The converse is also true. If Y=mX+bY = m X + b for some mm and bb, then the correlation equals ±1\pm 1. You will check this fact on your homework.

Correlation and Dependence

The correlation in two variables is related to, but distinct from, the dependence or independence of the variables. Correlations look for a particular type of relationship between the variables. We will show in Section 11.2 that the correlation in two variables equals the slope of the best fit line in the standardized variables. As a result, if the variables depend on each other, but don’t share a linear relationship, they may be weakly correlated. So, weak correlations may not imply weak dependence. That said, if two variables are independent, then they cannot be correlated.

So, correlations (and covariances) satisfy the desired characteristic that, when XX and YY are independent, the correlation and covariance in XX and YY are both zero. It follows that, if Corr[X,Y]0\text{Corr}[X,Y] \neq 0, then XX and YY must be dependent variables. So correlated variables are dependent.

Beware. Uncorrelated variables may also be dependent.

Here’s another example where observing XX would provide information about YY, but, the two variables don’t share a positive or negative association so are uncorrelated:

Uncorrelated bu dependent variables 2.

Remember: independent means uncorrelated, and correlated means dependent, but uncorrelated does not mean independent except in special cases (indicator variables or normal random variables).