Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

11.3 Chapter Summary

Covariance and Correlation

All definitions and results are available in Section 11.1.

  1. The covariance between the random variables, XX and YY, is defined:

    Cov[X,Y]=E[(XE[X])(YE[Y])]\text{Cov}[X,Y] = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]

    where xˉ=E[X]\bar{x} = \mathbb{E}[X] and yˉ=E[Y].\bar{y} = \mathbb{E}[Y]. The variables X0=XE[X]X_0 = X - \mathbb{E}[X] and Y0=YE[Y]Y_0 = Y - \mathbb{E}[Y] are centered.

    It may be expanded as the expected product of the variables minus the product of their expectations:

    Cov[X,Y]=E[X×Y]E[X]×E[Y].\text{Cov}[X,Y] = \mathbb{E}[X \times Y] - \mathbb{E}[X] \times \mathbb{E}[Y].
  2. Properties of covariance:

    • The covariance is unchanged by translations (adding constants) to the variables, Cov[X+s,Y+t]=Cov[X,Y]\text{Cov}[X +s,Y+t] = \text{Cov}[X,Y].

    • The correlation does depend on the scale of each variable, and Cov[aX,bY]=abCov[X,Y]\text{Cov}[aX,bY] = ab\text{Cov}[X,Y].

    • The sign of the covariance indicates the sign of the association between two variables.

    • The covariance is zero if XX and YY are independent. However, dependent variables may also share a covariance equal to zero.

    • The covariance between a random variable and itself is the variance, Cov[X,X]=Var[X]\text{Cov}[X,X] = \text{Var}[X].

    • The covariance between any random variable and a constant is zero.

  3. A standard random variable is a random variable with expectation 0 and standard deviation 1. We can standardize any variable by centering it, then scaling by its standard deviation:

    Xs=XE[X]SD[X].X_s = \frac{X - \mathbb{E}[X]}{\text{SD}[X]}.
  4. The correlation between two random variables, XX and YY, is defined as the covariance in the standardized variables. It may be computed:

    Corr[X,Y]=Cov[X,Y]SD[X]SD[Y].\text{Corr}[X,Y] = \frac{\text{Cov}[X,Y]}{\text{SD}[X]\text{SD}[Y]}.
  5. Properties of correlation:

    • The correlation is unchanged by translations (adding constants) to the variables, Corr[X+s,Y+t]=Corr[X,Y]\text{Corr}[X +s,Y+t] = \text{Corr}[X,Y].

    • The correlation does depend on the scale of each variable, and Corr[aX,bY]=Corr[X,Y]\text{Corr}[aX,bY] = \text{Corr}[X,Y] if a>0a > 0 and b>0b > 0.

    • The sign of the correlation indicates the sign of the association between two variables.

    • The correlation is zero if XX and YY are independent. However, dependent variables may also be uncorrelated (have correlation equal to 0).

    • The correlation is between [1,+1][-1,+1] and Corr[X,Y]=1|\text{Corr}[X,Y]| = 1 if and only if YY is a linear function of XX.

  6. Correlation geometry: the empirical estimate to the correlation given nn sample pairs, {xj,yj}j=1n\{x_j,y_j\}_{j=1}^n is:

    Corr[X,Y]=1nj=1n(xjxˉ)(yjyˉ)1nj=1n(xjxˉ)2)1nj=1n(yjyˉ)2)\text{Corr}[X,Y] = \frac{\frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})(y_j - \bar{y})}{\sqrt{\frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})^2)} \sqrt{\frac{1}{n} \sum_{j=1}^n (y_j - \bar{y})^2)} }

    where xˉ=1nj=1nxj\bar{x} = \frac{1}{n} \sum_{j=1}^n x_j and yˉ=1nj=1nyj\bar{y} = \frac{1}{n} \sum_{j=1}^n y_j.

    The empirical estimate equals the cosine of the angle between the centered data vectors, vecx0=xxˉvec{x}_0 = \vec{x} - \bar{x}, y0=yyˉ\vec{y}_0 = \vec{y} - \bar{y}, where x=[x1,x2,...,xn]\vec{x} = [x_1,x_2,...,x_n] and y=[y1,y2,...,yn]\vec{y} = [y_1,y_2,...,y_n].

Best Fit Lines (Linear Regression)

All definitions and results are available in Section 11.2.

  1. The best fit line to a collection of sample data points {xj,yj}j=1n\{x_j,y_j\}_{j=1}^n is a line y^(x)=mx+b\hat{y}(x) = m x + b whose coefficients, [m,b][m,b], are selected to minimize some measure of error in the fit line, commonly called a loss.

    If the user adopts the least squares loss:

    L(m,b;{xj,yj}j=1n)=1nj=1n((mxj+b)yj)2\mathcal{L}(m,b;\{x_j,y_j\}_{j=1}^n) = \frac{1}{n} \sum_{j=1}^n ((mx_j + b) - y_j)^2

    Then the best fit line is:

    y^(x)=m(xxˉ)+yˉ\hat{y}_*(x) = m_* (x - \bar{x}) + \bar{y}

    where xˉ=1nj=1nxj\bar{x} = \frac{1}{n} \sum_{j=1}^n x_j and yˉ=1nj=1nyj\bar{y} = \frac{1}{n} \sum_{j=1}^n y_j are the averages of the sample ddata. The best fit slope, mm_* is:

    m=1nj=1n(xjxˉ)(yjyˉ)1nj=1n(xjxˉ)2.m_* = \frac{\frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})(y_j - \bar{y})}{\frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})^2}.
  2. Interpretation:

    • The best fit intercept is chosen so that the best fit line to the centered variables passes through the origin:

      (y^(x)yˉ)=m(xxˉ)(\hat{y}_*(x) - \bar{y}) = m_* (x - \bar{x})

      or, equivalently, so that the best fit line passes through the point [xˉ,yˉ][\bar{x},\bar{y}].

    • The best fit slope equals the empirical covariance between the samples xx and yy values, divided by the empirical variance in the sampled xx values.

    • If we standardize our variables, then the best fit slope is the empirical correlation between the sampled xx and yy values.

      So, the correlation between XX and YY equals the slope of the best fit line to the distribution (to a collection of infinitely many samples) in standard units.