Expected Values - Data 89 Course Notes

This chapter will introduce methods for summarizing distributions. We will start by looking for a summary value that describes the position of the ``center" of the distribution.

Like any summary, choosing a central value, discards informastion about the distribution. Depending on what we want to describe, and what we are willing to discard, we could choose different summaries. We saw one natural summary of center in the last chapter. The mode of a distribution, or location of its peak, is often adopted as a natural descriptor of the distribution. When the distribution has one peak, the location of the peak is a natural reference value. It represents the most likely outcome when sampling from the distribution.

This chapter focuses on expected values as measures of center. Expected values are the most commonly used summary in all of statistics, data science, machine learning, and probability. They are, in essence, averages. We’ll spend this section introducing the expected value. We will define it, compare it to other measures of center, discuss its interpretation, then extend it to functions of random variables.

Definition¶

The expected value of a random variable is the weighted average of possible values of the random variable, weighted by its distribution.

We will often abbreviate the sum over all possible values of $X$ with “all $x$ ”. This avoids heavy subscripts. In general, you should assume a sum over $x$ runs over the full support of $X$ unless told otherwise.

Here’s an example. Suppose that $\mathcal{X} = \{1,2,3,4\}$ and:

$x$	1	2	3	4
$\text{PMF}(x)$	$1/2$	$1/4$	0	$1/4$

Then, to find the expected value of $X$ , multiply the entries in each column, then add the entrywise products:

\begin{aligned} \mathbb{E}[X] & = \sum_{x=1}^4 x \times \text{PMF}(x) \\& = 1 \times \frac{1}{2} + 2 \times \frac{1}{4} + 3 \times 0 + 4 \times \frac{1}{4} = 2. \end{aligned}

(2)

We can also compute expected values of continuous random variables. The formula is entirely analogous. Replace the sum over the support with an integral, and substitute the mass function for a density function.

For example, consider the dartboard demonstration in Section 2.4. The distance from the dart to the center of the dartboard was a random variable $R \in [0,1]$ with PDF $2 r$ . Therefore:

\begin{aligned} \mathbb{E}[R] & = \int_{r = 0}^1 r \times \text{PDF}(r) dr = \int_{r=0}^1 r \times (2 r) dr \\& = \int_{r = 0}^1 2 r^2 dr = \frac{2}{3} r^3 |_{r=0}^1 \\ & = \frac{2}{3}. \end{aligned}

(4)

Notice that the expected value of $R$ is not $1/2$ even though the dart’s positions were drawn uniformly. The expected value is greater than $1/2$ since the dart’s distance to the origin is not uniform. The dart is more likely to land with $R > 1/2$ than $R < 1/2$ , so its expected value is also greater than $1/2$ .

Interpretation¶

Expected values are commonly used to summarize the “center” of a distribution. Here are three interpretations of the expected value.

Center of Mass¶

Suppose that $X$ is a random variable with some PMF. Imagine the PMF as a physical distribution of mass. For instance, if $\text{PMF}(2) = 0.4$ and $\text{PMF}(8) = 0.6$ you could imagine putting a 4 pound weight at $x = 2$ and a 6 pound weight at $x = 8$ .

To find a central value, imagine that your masses are sitting on a long beam. You get to place a fulcrum under the beam. You can adjust the fulcrum, but cannot move the masses. You want to move the fulcrum to a position where the beam will balance without tipping left or right.

The position of the fulcrum that balances the beam is the center of mass of the distribution. The center of mass can be found using the same formula as the expected value:

\text{center of mass} = \sum_{\text{all } x} x \times (\text{Mass at } x) = \sum_{\text{all } x} x \times \text{PMF}(x) = \mathbb{E}[X].

(5)

The same applies for density using units of mass per length. Just replace the sum with an integral.

This physical analogy can be helpful when we want to visualize the expected value of a random variable. For example:

If the distribution is symmetric about some $x_*$ , then it must balance about $x_*$ so $x_*$ is its expected value.
If the distribution is skewed, then the expected value will move away from the mode of the distribution in the direction of the skew.
If the distribution is highly skewed, then the expected value may by far from the peak of the distribution, and may be a highly atypical value.

Expected Values are Not Typical Values

Expected values are weighted averages. So, like any weighted average, they need not correspond to possible, or even typical values. For instance:

In 2024, the average American family had 1.94 children. No family has 1.94 children.
In 2022 the national average family income, with adult earners between the ages of 35 and 65, was 170,000 dollars. That number probably seems high to you. It seems high since the average is skewed by a small portion of very wealthy households. The top 1% of American households control 31.7% of the national wealth. The bottom 50% of American households hold about 2.5% of the national wealth, less than the 3.8% controlled by the 800 billionaires in the US. Since wealth distributions are heavily skewed, most economists and labor statisticians use medians. The median incomde was between 80,000 and 90,000 dollars (see Source.)

Sample Averages¶

We’ve just seen that expectations need not be typical values. They are balance points for the distribution. This begs the question, “When do we expect an expected value?”

In short, we should expect an expected value if we draw a large collection of sample values, and average the samples This is a very common procedure in data analysis. For instance, you might:

Choose a sample of 1,000 individuals from a pool and measure their heights. Then, since your sample size is large, the average height of the 1,000 individuals in your sample will be close to $\mathbb{E}[H]$ where $H$ is the height of a uniformly selected individual. In this case, we would call the average over your sample a sample average, and the background average over the full population a population average. If you draw a large sample, then the sample average will estimate the population average closely so, before we draw the sample, it is reasonable to expect the population average.
Run a random experiment repeatedly. Each new experiment produces a new sample value $X$ . The values are independent and identical. You repeat the process 1,000 times, then average all your sampled values. Since your sample size is large, your sample average is likely close to some background value. In this case there is no population average since we are just running a process, not picking from a fixed pool. However, the likely sample average will still be concentrated around some value. That value is the expected value of $X$ , $\mathbb{E}[X]$ . Formally, the probability that the sample average differs from the expectation by any fixed amount vanishes as the number of samples increases.

These are suprisingly subtle ideas. In both cases the expectation is a background quantity that we could compute, but that we cannot observe experimentally unless we either sample the entire population, or run infinitely many experiments. Instead, we have experimental procedures that produce sample averages. Those sample averages are random, since our data is random. However, if we draw enough samples, the sample averages will closely estimate the background expectation.

So, we expect to see $\mathbb{E}[X]$ in the sense that, it is the value we should anticipate before averaging a long list of samples. The more samples we average, the more prescient our expectation will be.

Open this Law of Large Numbers Interactive to see this idea in practice. You may vary the distribution, and number of samples, then can track the sample average as you draw sampled. You will see that the sample averages approach the expectation.

Let’s see this in a bit more detail. Suppose that $X$ is a discrete random variable with finitely many possible outcomes. Suppose that we draw $n$ samples, where $n$ is a large number. Let $N(x)$ denote the number of times we see the sample value $X = x$ . Then, we can write our sample average:

\bar{X}_n = \frac{1}{n} \sum_{x \in \mathcal{X}} x N(x)

(6)

This is the usual formula for an average, grouped by possible outcomes. Now, to simplify, move the $1/n$ inside the sum:

\bar{X}_n = \sum_{x \in \mathcal{X}} x \frac{N(x)}{n}.

(7)

Notice, $N(x)/n$ is the frequency with which we saw the outcome $X = x$ . So:

\bar{X}_n = \sum_{x \in \mathcal{X}} x \text{Fr}(x)

(8)

where $\text{Fr}(x)$ is the frequency with which we saw outcome $x$ in our $n$ trials.

Recall, probabilities are meant to estimate frequencies. In Section 1.2 we defined chances as long run frequencies. So, assuming $n$ large, $\text{Fr}(x) \approx \text{Pr}(X = x) = \text{PMF}(x)$ . Therefore:

\bar{X}_n \approx \sum_{x \in \mathcal{X}} x \times \text{PMF}(x) = \mathbb{E}[X]

(9)

In other words, the sample average will be close to the expected value, and should converge to the expected value as the number of samples diverges. We haven’t proven this formally since we haven’t established exactly what we mean by converge, or that frequencies do in fact converge in the long run, but this argument captures the essential background logic. Expected values are not reasonable expectations for individual samples. Expected values are reasonable expectations for long run sample averages.

Smallest Square Error¶

Here’s a more algebraic way to define expected value.

Often, we are looking for a single summary value when we want to predict the value of a random variable $X$ . So, consider all possible values $x_*$ that we could propose as a central value. Let’s try to use $x_*$ as a prediction for $X$ .

Compute the error in our prediction between sampled $X$ and the proposed value $x_*$ via $X - x_*$ . Compute its magnitude using $(X - x_*)^2$ so that large underestimates and overestimates are both considered large errors. Pick $x_*$ so that, if we drew a long collection of samples $X$ , and averaged the squared errors, our averaged square error would be as small as possible. That is, choose $x_*$ to minimize the mean square error over a long collection of samples. Equating long run frequency to chance:

\textbf{Minimize: } \text{MSE}(x_*) = \sum_{x \in \mathcal{X}} (x - x_*)^2 \times \text{PMF}(x).

(10)

You can think about the MSE at $x_*$ as a measure of how good $x_*$ is as a summary for the distribution. Anytime $X$ can take on multiple values every choice of summary value will ignore the variation in $X$ . The quantity $(x - x_*)^2$ measures how badly $x_*$ misrepresents any particular $x$ . Minimizing an average against the PMF selects an $x_*$ that tries to that compromises between the different probable values of $x$ .

It turns out that, the best choice of $x_*$ is the expected value $\mathbb{E}[X]$ . In other words, expected values minimize mean square errors.

Notice the role of the square here. Squaring a number greater than one makes it larger. Squaring a number less than one makes it smaller. So, when we minimize a mean square error, we are discounting small errors, and exaggerating large errors. As a consequence, the expected value is the best choice of central value when we aim to avoid very large summarization or prediction errors, but mostly disregard small errors.

This interpretation matches our observation about skewed distributions. Expected values will be sensitive to outliers, since the expected value tries avoid very large errors in prediction.

We’ll make this idea more precise in the future.

Expectations of Functions of Random Variables¶

Often we are interested in functions of random variables. For instance, consider $Y = X^2$ , or $W = e^X$ . Notice that both $Y$ and $W$ are random since $X$ was random. So, each function of a random variable defines a new random variable.

We are often interested in functions of random variables since the functions represent operations we could apply to the random variable. Data analysis is all about collecting samples, then applying functions to those samples in order to answer a question.

So, if $Y = g(X)$ for some function $g$ , what is $\mathbb{E}[Y]$ ?

There are two ways to calculate this expectation:

Both formula give the same answer. The first just treats $Y$ as a new discrete random variable, and applies the formula for generic expectations. To apply it, find the possible values of $Y = g(X)$ , then find the PMF of $Y$ . This is often a lot of work since finding the PMF of $g(X)$ involves finding $\text{Pr}(g(X) = y)$ for each possible $y$ .

The second is often easier. It only requires evaluating a new weighted average against the PMF of $X$ . It generalizes our original expectation formula nicely. Instead of averaging $x$ , average $g(x)$ .

The domain space formula is the form we’ll work with. It’s called the domain space formula since it averages over the inputs to $g$ (the domain), rather than over the outputs of $g$ (its range) If you like exercises, try to show that the two formula are equivalent.

Here’s an example. Suppose that $\mathcal{X} = \{1,2,3,4\}$ and:

$x =$	1	2	3	4
$\text{PMF}(x) =$	$1/2$	$1/4$	0	$1/4$

What is $\mathbb{E}[X^2]$ ?

Add a row to the table for $g(x) = x^2$ :

$x$	1	2	3	4
$x^2 =$	1	4	9	16
$\text{PMF}(x)$	$1/2$	$1/4$	0	$1/4$

Then:

\mathbb{E}[X^2] = 1 \times \frac{1}{2} + 4 \times \frac{1}{4} + 9 \times 0 + 16 \times \frac{1}{4} = \frac{1}{2} + 1 + 4 = 5.5

(12)

Notice that $\mathbb{E}[X^2] = 5.5$ is not $\mathbb{E}[X]^2 = 2^2 = 4$ . Instead, $\mathbb{E}[X^2] = 5.5 > 4 = \mathbb{E}[X^2]$ . There are two related lessons here:

The domain space formula also generalizes nicely to continuous random variables. As usual, integrate where we used a sum, and swap mass for density:

For instance, suppose we had wanted to find the expected square distance from a dart’s position to the center of the dartboard, when the dart’s position is drawn uniformly. Then we would compute:

\mathbb{E}[R^2] = \int_{r = 0}^1 r^2 \times (2 r) dr = \int_{r = 0}^1 2 r^3 dr = \frac{2}{4} r^4|_{0}^1 = \frac{1}{2}.

(16)

Comparison to Mode and Median¶

There are other ways to summarize the “center” of a distribution. The two popular alternatives are:

The mode (or modes) of a distribution is the value (or values) of $x$ that maximize its distribution function (PMF or PDF). These are the most likely possible values of the random variable. They correspond to the locations of the peaks in the distribution plot. A distribution is unimodal if it has a single peak. Otherwise it is multimodal.
The median of a distribution is the value where it’s CDF equals first crosses 1/2 from below. The discrete case is a bit awkward so we’ll focus on the continuous case for interpretation. The continuous case is easier since, when we work with continuous random variables, we don’t need to distinguish between the statements $X \leq x_*$ and $X < x_*$ .
- For a continuous random variable, the median value is the threshold $x$ where $\text{CDF}(x) = 1/2$ .
- The median $x_*$ is a central value in the sense that, if we predict $X$ using $x_*$ then we are equally likely to over or underestimate. Half of the probability mass lies above $x_*$ and half lies below.
- Like the expected value, the median can be defined as the central value that minimizes some average error. The median is the choice of central value that minimizes the mean absolute error, $|X - x_*|$ . This means that the median is concerned by errors in proportion to their size. It does not exaggerate large errors or discount small errors. It is less sensitive to outliers than the expectation.

Importantly, the mean (expected value), median, and mode are all different summaries. Do not conflate them.

We will mostly concern ourselves with expectations and modes in this class. In comparison,medians have received short shrift by statisticians and probabilists. Of the three, expectations are the easiest to compute, are easy to estimate, and are the easiest to analyze. They are not the easiest to estimate (medians are more stable). However, it is easier to build a theory that explains how estimates to expectations behave (sample averages). This theory is deep, and is the crown jewel of probability. Many of the founding theorems in probability regard sample averages. The law of large numbers, central limit theorem, and ergodic theorems all provide guarantees for sample averages that relate sample averages to expectations. The law of large numbers, in particular, is a foundational pillar of probability. It is the key theorem that relates the abstract world of chances, and their rules, to the measurable world of frequencies. Without it, probability would be a purely abstract concept.

As a result, we will also spend a lot of effort on expectations. While we focus on expectations, don’t forget the median. It is an equally, if not more, sensible notion, that minimizes a simpler notion of error, and is more stable to outliers.