Chapter Summary - Data 89 Course Notes

Covariance and Correlation¶

All definitions and results are available in Section 11.1.

The covariance between the random variables, $X$ and $Y$ , is defined:
$\text{Cov}[X,Y] = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]$
(1)
where $\bar{x} = \mathbb{E}[X]$ and $\bar{y} = \mathbb{E}[Y].$ The variables $X_0 = X - \mathbb{E}[X]$ and $Y_0 = Y - \mathbb{E}[Y]$ are centered.
It may be expanded as the expected product of the variables minus the product of their expectations:
$\text{Cov}[X,Y] = \mathbb{E}[X \times Y] - \mathbb{E}[X] \times \mathbb{E}[Y].$
(2)
Properties of covariance:
- The covariance is unchanged by translations (adding constants) to the variables, $\text{Cov}[X +s,Y+t] = \text{Cov}[X,Y]$ .
- The correlation does depend on the scale of each variable, and $\text{Cov}[aX,bY] = ab\text{Cov}[X,Y]$ .
- The sign of the covariance indicates the sign of the association between two variables.
- The covariance is zero if $X$ and $Y$ are independent. However, dependent variables may also share a covariance equal to zero.
- The covariance between a random variable and itself is the variance, $\text{Cov}[X,X] = \text{Var}[X]$ .
- The covariance between any random variable and a constant is zero.
A standard random variable is a random variable with expectation 0 and standard deviation 1. We can standardize any variable by centering it, then scaling by its standard deviation:
$X_s = \frac{X - \mathbb{E}[X]}{\text{SD}[X]}.$
(3)
The correlation between two random variables, $X$ and $Y$ , is defined as the covariance in the standardized variables. It may be computed:
$\text{Corr}[X,Y] = \frac{\text{Cov}[X,Y]}{\text{SD}[X]\text{SD}[Y]}.$
(4)
Properties of correlation:
- The correlation is unchanged by translations (adding constants) to the variables, $\text{Corr}[X +s,Y+t] = \text{Corr}[X,Y]$ .
- The correlation does depend on the scale of each variable, and $\text{Corr}[aX,bY] = \text{Corr}[X,Y]$ if $a > 0$ and $b > 0$ .
- The sign of the correlation indicates the sign of the association between two variables.
- The correlation is zero if $X$ and $Y$ are independent. However, dependent variables may also be uncorrelated (have correlation equal to 0).
- The correlation is between $[-1,+1]$ and $|\text{Corr}[X,Y]| = 1$ if and only if $Y$ is a linear function of $X$ .
Correlation geometry: the empirical estimate to the correlation given $n$ sample pairs, $\{x_j,y_j\}_{j=1}^n$ is:
$\text{Corr}[X,Y] = \frac{\frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})(y_j - \bar{y})}{\sqrt{\frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})^2)} \sqrt{\frac{1}{n} \sum_{j=1}^n (y_j - \bar{y})^2)} }$
(5)
where $\bar{x} = \frac{1}{n} \sum_{j=1}^n x_j$ and $\bar{y} = \frac{1}{n} \sum_{j=1}^n y_j$ .
The empirical estimate equals the cosine of the angle between the centered data vectors, $vec{x}_0 = \vec{x} - \bar{x}$ , $\vec{y}_0 = \vec{y} - \bar{y}$ , where $\vec{x} = [x_1,x_2,...,x_n]$ and $\vec{y} = [y_1,y_2,...,y_n]$ .

Best Fit Lines (Linear Regression)¶

All definitions and results are available in Section 11.2.

The best fit line to a collection of sample data points $\{x_j,y_j\}_{j=1}^n$ is a line $\hat{y}(x) = m x + b$ whose coefficients, $[m,b]$ , are selected to minimize some measure of error in the fit line, commonly called a loss.
If the user adopts the least squares loss:
$\mathcal{L}(m,b;\{x_j,y_j\}_{j=1}^n) = \frac{1}{n} \sum_{j=1}^n ((mx_j + b) - y_j)^2$
(6)
Then the best fit line is:
$\hat{y}_*(x) = m_* (x - \bar{x}) + \bar{y}$
(7)
where $\bar{x} = \frac{1}{n} \sum_{j=1}^n x_j$ and $\bar{y} = \frac{1}{n} \sum_{j=1}^n y_j$ are the averages of the sample ddata. The best fit slope, $m_*$ is:
$m_* = \frac{\frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})(y_j - \bar{y})}{\frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})^2}.$
(8)
Interpretation:
- The best fit intercept is chosen so that the best fit line to the centered variables passes through the origin:
  $(\hat{y}_*(x) - \bar{y}) = m_* (x - \bar{x})$
  (9)
  or, equivalently, so that the best fit line passes through the point $[\bar{x},\bar{y}]$ .
- The best fit slope equals the empirical covariance between the samples $x$ and $y$ values, divided by the empirical variance in the sampled $x$ values.
- If we standardize our variables, then the best fit slope is the empirical correlation between the sampled $x$ and $y$ values.
  So, the correlation between $X$ and $Y$ equals the slope of the best fit line to the distribution (to a collection of infinitely many samples) in standard units.

11.3 Chapter Summary

Covariance and Correlation¶

Best Fit Lines (Linear Regression)¶