Covariance and Correlation¶
All definitions and results are available in Section 11.1.
The covariance between the random variables, and , is defined:
where and The variables and are centered.
It may be expanded as the expected product of the variables minus the product of their expectations:
Properties of covariance:
The covariance is unchanged by translations (adding constants) to the variables, .
The correlation does depend on the scale of each variable, and .
The sign of the covariance indicates the sign of the association between two variables.
The covariance is zero if and are independent. However, dependent variables may also share a covariance equal to zero.
The covariance between a random variable and itself is the variance, .
The covariance between any random variable and a constant is zero.
A standard random variable is a random variable with expectation 0 and standard deviation 1. We can standardize any variable by centering it, then scaling by its standard deviation:
The correlation between two random variables, and , is defined as the covariance in the standardized variables. It may be computed:
Properties of correlation:
The correlation is unchanged by translations (adding constants) to the variables, .
The correlation does depend on the scale of each variable, and if and .
The sign of the correlation indicates the sign of the association between two variables.
The correlation is zero if and are independent. However, dependent variables may also be uncorrelated (have correlation equal to 0).
The correlation is between and if and only if is a linear function of .
Correlation geometry: the empirical estimate to the correlation given sample pairs, is:
where and .
The empirical estimate equals the cosine of the angle between the centered data vectors, , , where and .
Best Fit Lines (Linear Regression)¶
All definitions and results are available in Section 11.2.
The best fit line to a collection of sample data points is a line whose coefficients, , are selected to minimize some measure of error in the fit line, commonly called a loss.
If the user adopts the least squares loss:
Then the best fit line is:
where and are the averages of the sample ddata. The best fit slope, is:
Interpretation:
The best fit intercept is chosen so that the best fit line to the centered variables passes through the origin:
or, equivalently, so that the best fit line passes through the point .
The best fit slope equals the empirical covariance between the samples and values, divided by the empirical variance in the sampled values.
If we standardize our variables, then the best fit slope is the empirical correlation between the sampled and values.
So, the correlation between and equals the slope of the best fit line to the distribution (to a collection of infinitely many samples) in standard units.