Chapter Overview¶
Where We Are¶
Chapters 8 - 10 have focused on random vectors and their distribution. We’ve seen how to define a random vector as an ordered collection of random variables, define distribution functions for random vectors as functions of multiple variables, optimize those functions to find modes and to estimate unknown parameters, and integrate/sum those functions to compute chances and expectations.
Where We’re Going¶
In this chapter, we will answer the question, “how related are and ?”, using their joint distribution. A method for measuring the relationship between two variables is a measure of association. The more associated the two variables, the more information we can learn about one by observing the other.
In this chapter we will focus on linear measures of association. These are covariance and correlation. We will conclude by interpreting covariance and correlation in terms of the slope of the best fit line relating and .
Note: there are other ways to measure association that don’t look for a linear relationship between and . Most importantly, mutual information measures how much the conditional distribution of one variable is expected to change if we observe the other. We focus on linear relationships here since they are the most widely used, and are a useful tool in a variety of applied regression and estimation problems. If you’d like to learn more about other measures of association come ask us, or take a class in information theory!
We will:
Define a covariance and correlation (see Section 11.1). We will:
Motivate both measures based on desired characteristics for a measure of association.
Identify invariants of both measures (e.g. covariance does not depend on how we center our data, correlation does not depend on how we scale our data).
Explain correlation as covariance in standard units.
Relate covariance and correlation to independence and dependence.
Provide bounds on correlation and will interpret the situations when correlation equals or 0
Suggest resources to practice “eyeballing” correlation from observed data
Recall the linear regrssion problem first introduced in Section 9.3 then will reinterpret correlation and covariance in terms of the slope of the best fit line relating and (see Section 10.2). In particular, we will show that the correlation between two random variables equals the slope of the best fit line relating those variables, after converting to standard units.