Rules of Expectation - Data 89 Course Notes

Section 4.1 introduced the basic definition of an expectation as a weighted average. This definition provides a direct formula for computing an expectation: evaluate a weighted average of the possible values of the random variable, weighted by their likelihood.

Evaluating these averages can be tricky. We’ll spend a couple weeks practicing integrals and sums, but, it will take some work before we can confidently evaluate interesting expectations directly.

Nevertheless, expectations are popular summary values because they obey a variety of useful rules. These rules make it possible to compute many expectations without performing summation acrobatics. Instead of plugging into the definition, we can often find an expectation by applying the rules of expectation.

This section will introduce the most useful rules. We will start with an example for motivation, then will explore three sets of rules. Once we’ve built up a list of rules, we’ll come back and complete our example without working through the sum directly.

Binomial Example¶

Suppose that $X \sim \text{Binom}(n,p)$ . What is $ $\mathbb{E}[X]$ ?

Count variables are discrete, so we could start with a sum:

\mathbb{E}[X] = \sum_{\text{all } x} x \times \text{Pr}(X = x) = \sum_{x = 0}^n x \left( \begin{array}{c} n \\ x \end{array} \right) p^x (1 - p)^{n-x}.

(1)

That’s not an easy sum. We could try simplifying:

\mathbb{E}[X] = \sum_{x = 0}^n \frac{x}{x!} \frac{n!}{(n - x)!} p^x (1 - p)^{n-x} = \sum_{x = 0}^n \frac{n!}{(x- 1)!(n - x)!} p^x (1 - p)^{n-x}

(2)

but, unless you’re quite clever, it’s not clear how to progress. Sums are easy to set up, but often are hard to close.

Closing the Sum

To close the sum, start by focusing on the combinatorial term involving all the factorials. It is:

\frac{n!}{(x- 1)!(n - x)!}

(3)

This is close to, but not quite, $n$ choose $(x - 1)$ . It falls short since the terms in the denominator add to $(x - 1) + (n - x) = n - 1$ not $n$ . That sais, with a little rearranging, we can relate it to $(n-1)$ choose $(x - 1)$ :

\frac{n!}{(x- 1)!(n - x)!} = \frac{n \times (n - 1)!}{(x- 1)!(n - x)!} = n \left( \begin{array}{c} n - 1 \\ x - 1 \end{array} \right)

(4)

So, we can write our sum:

\mathbb{E}[X] = n \sum_{x = 0}^{n-1} \left( \begin{array}{c} n-1 \\ x - 1 \end{array} \right) p^x (1 - p)^{n-x}

(5)

Now the term inside the sum looks almost like the PMF of a binomial, but the multiplicity (the choose factor) uses $n-1$ and $x - 1$ , while the powers are $p^x$ and $(1 - p)^{n - x}$ . The power $n - x$ isn’t an issue since $n - x = n - 1 + 1 - x = (n - 1) - (x - 1)$ . To handle the extra power of $p$ , write:

p^x = p \times p^{x-1}.

(6)

Now our sum is:

\mathbb{E}[X] = n p \sum_{x = 0}^{n-1} \left( \begin{array}{c} n-1 \\ x - 1 \end{array} \right) p^{x - 1} (1 - p)^{(n-1)-(x - 1)}

(7)

Take a look at the bounds of the sum again. We started from $x = 0$ . But $x = 0$ will cause us headaches if we plug $0 - 1 = -1$ into a factorial. This is a clue to back up and look at the original expression. Originally, we had:

\mathbb{E}[X] = \sum_{x = 0}^n x \times \text{PMF}(x).

(8)

The first term in this sum is $0 \times \text{PMF}(0) = 0$ . Adding zero doesn’t change the value of a sum, so we can ignore the $x = 0$ term:

\mathbb{E}[X] = 0 + \sum_{x = 1}^n x \times \text{PMF}(x) = \sum_{x = 1}^n x \times \text{PMF}(x).

(9)

So, dropping the “zero” term from our sum, we now have:

\mathbb{E}[X] = n p \sum_{x = 1}^{n} \left( \begin{array}{c} n-1 \\ x - 1 \end{array} \right) p^{x - 1} (1 - p)^{(n-1)-(x - 1)}

(10)

Now all the terms in the sum involve $x - 1$ , and the bounds of the sum make sense. At this point it may look like we’ve gone to a lot of effort just to arrive at a similar sum. However, we’re actually almost done. Like most math problems, our initial steps are all about expanding terms until we reach an expanded form that is easier to simplify.

To simplify, it helps to come up with a strategy. So far we’ve just simplified what was available, then organized terms. Now, notice that the resulting sum is a sum over $n-1$ different terms, where each term inside the sum looks like the value of a binomial PMF on $n - 1$ trials, for $x - 1$ successes. Since binomial PMFs are a PMF, they are normalized. If we write the sum as a sum over all the values of a binomial PMF, then we will know its value must equal one!

Let $y = x - 1$ . Then, when $x = 1$ , $y = 0$ and when $x = n$ , $y = n-1$ . Therefore,

\mathbb{E}[X] = n p \sum_{y = 0}^{n-1} \left( \begin{array}{c} n-1 \\ y \end{array} \right) p^{y} (1 - p)^{(n-1)-y}

(11)

Let:

\text{PMF}(s;n,p) = \left(\begin{array}{c} n \\ s \end{array} \right) p^s (1 - p)^{n - s}

(12)

denote the binomial PMF for a generic binomial random variable $S \sim \text{Binom}(n,p)$ . Then, we’ve just shown that, when $X \sim \text{Binom}(n,p)$ :

\mathbb{E}[X] = \sum_{x = 0}^n x \times \text{PMF}(x;n,p) = n p \sum_{y = 0}^{n-1} \text{PMF}(y;n-1,p)

(13)

So, by manipulating the sum, we’ve replaced a weighted average of possible $x$ values over a binomial PMF on $n$ trials, with just a sum of binomial PMF values on $n-1$ trials. Since the second sum is the sum of a binomial PMF over all possible inputs:

\sum_{y = 0}^{n-1} \text{PMF}(y;n-1,p) = 1

(14)

We don’t need to use any algebra to close this sum. Instead, we’re just recalling normalization.

So, putting it all together:

\mathbb{E}[X] = n p.

(15)

It turns out that:

\mathbb{E}[X] = n p.

(16)

That’s a wonderfully simple formula. The expected number of successes in a string of $n$ independent, identical, binary trials is the number of trials, $n$ , times the chance each individual trial succeeds, $p$ . If I run 10 experiments, and each succeeds with chance 1/5, then I expect to see 2 successes.

This answer also closely tracks what we learned about the mode of the binomial. The most likely outcome for a binomial random variable is near $n p$ .

Deriving the expectation directly is exhausting. If you haven’t, open the dropdown above to see how much work it took to get here. Whenever we reach an answer that is suspiciously simple, through a process that is evidently opaque, we should ask, “was there a better way to find this answer?” Often, if your answer is intuitive, but your work is ornate, there is a simpler method. The rest of this chapter will develop a series of rules that make this sort of calculation a breeze.

Rules of Expectation:¶

Here’s our basic strategy:

When given a random variable, always try to write out the expectation directly as a sum or an integral. If you can close it, go ahead. There’s no need to try anything sharper.
If the sum/integral is tricky, try to rewrite the random variable as a combination of simpler random variables.
Then, apply rules of expectation to break up the original expectation into a combination of the expectations of its parts. If each part is simple enough, then we can use the expectations of the parts to put together the expectation of the whole.

Expectations of Key Distributions¶

To use this strategy, we will need to know the expectations of some key reference distributions. Here are three you should always be ready to use:

Constants: If $X = c$ for some constant $c$ , then $\mathbb{E}[X] = c$ .
This result follows immediately from either interpretation of the expectation. If all of the mass of the distribution is at one value, then that value must be the center of mass. Alternately, if $X$ is always $c$ , then any sample average of a string of samples, will be a sample average of a string of $c$ ’s, so must equal $c$ .
Expectations of Constants
If $X = c$ , then:
$\mathbb{E}[X] = c.$
(17)
In other words, the expected value of a constant, is the constant itself.
Bernoulli (binary) Random Variables: Suppose that $X$ is an indicator random variable for some event $E$ . Then $X \sim \text{Bern}(p)$ where $p = \text{Pr}(E)$ . What is $\mathbb{E}[X]$ ?
$\mathbb{E}[X] = \sum_{x = 0}^1 x \times \text{PMF}(x) = 0 \times (1 - p) + 1 \times p = p.$
(18)
Expectations of Indicators
If $X \sim \text{Bern}(p)$ , then:
$\mathbb{E}[X] = p.$
(19)
In other words, the expected value of an indicator random variable is the chance that the event it indicates occurs.
Symmetric Distributions: Suppose that $X$ is drawn from a distribution that is symmetric about some value $x_*$ . Then, to balance the distribution, the only possible midpoint is $x_*$ , so the center of mass is $x_*$ . It follows that:
Expectations of Symmetric Distributions
If $X$ is drawn symmetrically about $x_*$ , then:
$\mathbb{E}[X] = x_*.$
(20)

Linearity of Expectation¶

Next, we will need rules that help us compute expectations of transformations of random variables. These are just standard algebra rules for averages.

The simplest transformation is a linear function. We’ll break this rule into three parts. The first two are each special cases of the third.

Translations: If $Y = X + b$ for some $b$ , then $\mathbb{E}[Y] = \mathbb{E}[X + b] = \mathbb{E}[X] + b$ .
As usual, we can either prove this rule using the weighted average formula for expectations, or argue it using the interpretations of expectation. Let’s work by interpretation.
Adding a constant to $X$ just shifts its distribution rightward by the constant since $\text{PMF}(y) = \text{Pr}(Y = y) = \text{Pr}(X + b = y) = \text{Pr}(X = y - b) = \text{PMF}(x - b)$ . If I translate a distribution horizontally by a distance $b$ , then I must also translate its center of mass horizontally by a distance of $b$ . So, the new center of mass is the old center of mass, plus $b$ .
Scaling: If $Y = a X$ for some $a$ , then $\mathbb{E}[Y] = \mathbb{E}[aX] = a \mathbb{E}[X]$ .
Let’s prove this one using the weighted average formula. We’ll do the discrete case. The continuous case works for the same reason.
$\mathbb{E}[Y] = \mathbb{E}[a X] = \sum_{\text{all } x} (a x) \text{PMF}(x) = a \sum_{\text{all } x} x \text{PMF}(x) = a \mathbb{E}[X].$
(21)

Combining these two rules produces the following:

Nonlinear Functions

Be careful with this result. As noted in Section 4.1, the expectation of a function is rarely the function of the expectation. The result shown above only works for linear functions. If $f$ is nonlinear, then it might not hold.

If $f$ is convex or concave, then we can still relate the function of the expectation to the expectation of the function, only we are limited to using bounds. As discussed in Section 4.1, if the function $f$ is convex (curves upwards everywhere), then, for any random variable $X$ that can take on at least two distinct values:

\mathbb{E}[f(x)] > f(\mathbb{E}[X])

(23)

The latter inequality is called Jensen’s inequality. It holds for all convex functions and all distributions. The equality holds in reverse if $g$ is concave (curves down).

Additivity of Expectation¶

The final, and most useful property of expectation is another statement about sums. This time, it regards the expectations of sums of random variables.

To prove this result, we’ll use some of the ideas from Sections 1.3 and 1.5. Let $S = X + Y$ . Then, by the weighted average formula for expectations

\mathbb{E}[S] = \sum_{\text{all } s} s \times \text{Pr}(S = s) = \sum_{\text{all } s} s \times \text{Pr}(X + Y = s).

(25)

We can expand the chance that $X + Y = s$ by summing over all pairs, $x,y$ that add to $s$ . Each distinct pair of $x,y$ that add to $s$ are disjoint, so, by the addition rule (see Section 1.3):

\text{Pr}(X + Y = s) = \sum_{\text{all } x,y \text{ s.t. } x + y = s} \text{Pr}(X = x, Y = y).

(26)

Then, moving all terms inside the sum:

\mathbb{E}[S] = \sum_{\text{all } s} s \times \text{Pr}(S = s) = \sum_{\text{all } s} \sum_{\text{all } x,y \text{ s.t. } x + y = s} (x + y) \times \text{Pr}(X = x, Y = y).

(27)

The sum over all possible $s$ , of each pair $x,y$ that could add to $s$ , is just the sum over all pairs $x$ and $y$ . So, we can write our sum more simply:

\mathbb{E}[S] = \sum_{\text{all } x, y} (x + y) \times \text{Pr}(X = x, Y = y)

(28)

Now, simplifying:

\mathbb{E}[S] = \sum_{\text{all } x,y } x \times \text{Pr}(X = x, Y = y) + \sum_{\text{all } x,y } y \times \text{Pr}(X = x, Y = y).

(29)

The probabilities in each sum are joint probabilities. We can expand each using the multiplication rule from Section 1.5. For example:

\sum_{\text{all } x,y } x \times \text{Pr}(X = x, Y = y) = \sum_{\text{all } x,y } x \times \text{Pr}(X = x) \times \text{Pr}(Y = y|X = x).

(30)

Now, let’s split up the sum. Sum over all $x$ first, then, sum over all $y$ :

\sum_{\text{all } x } \sum_{\text{all } y} x \text{Pr}(X = x) \text{Pr}(Y = y|X = x) = \sum_{\text{all } x } x \text{Pr}(X = x) \left( \sum_{\text{all } y} \text{Pr}(Y = y|X = x) \right).

(31)

Here’s the kicker. The sum inside the parentheses on the right is the sum of a distribution, over all possible values of the associated random variable. Anytime we sum the PMF of a random variable (here, the PMF of $Y$ given $X = x$ ), over all possible values, we must get back 1. All PMF’s are normalized.

So:

\sum_{\text{all } x } \sum_{\text{all } y} x \text{Pr}(X = x) \text{Pr}(Y = y|X = x) = \sum_{\text{all } x } x \text{Pr}(X = x) = \mathbb{E}[X].

(32)

The same argument applies for the second term in our original sum, so:

\mathbb{E}[S] = \mathbb{E}[X] + \mathbb{E}[Y].

(33)

Essentially the same arguments apply in the continuous case.

Expectations of Count Variables via Additivity¶

A count variable is an integer-valued random variable that represents some sort of count. For instance, binomial random variables count successes. Geometric random variables count trials until a success. The rules established above make it easy to find the expectations of count variables, since most count variables can be expanded as a sum. After all, most counting processes occur as sequences where, each time an instance occurs, we add 1 to our running count.

Binomial Random Variables¶

Let’s try to find the expectation of a binomial again. This time, we’ll use rules instead of brute force algebra.

First, suppose that $X \sim \text{Binom}(n,p)$ . Then $X$ is the number of successes in a string of $n$ independent, identical, binary trials. So, if we let $I_j$ be an indicator for the event that the $j^{th}$ trial succeeds, then:

X = I_1 + I_2 + ... I_n = \sum_{j=1}^n I_j.

(34)

Then, using the additivity property:

\mathbb{E}[X] = \mathbb{E}\left[\sum_{j=1}^n I_j \right] = \sum_{j=1}^n \mathbb{E}[I_j].

(35)

Each $I_j$ is an indicator, so is a Bernoulli random variable with success probability $p$ . Since the expectation of any Bernoulli random variable is its success probability:

\mathbb{E}[X] = \sum_{j=1}^n p = n \times p.

(36)

Done! Compare this proof to the dropdown argument provided at the start of the chapter. This one is much better.

It is better in two ways:

It is simpler. It involves fewer steps and is easier to follow/remember.
Each of its steps are meaningful and rely on clearly motivated logical arguments that walk directly towards the desired result. Unlike the algebraic proof, which required a large number of little steps, none of which except the last carried much intrinsic meaning, each step in this proof uses a powerful idea: count variables are sums of indicators, expectations of sums are sums of expectations, and, the expectation of an indicator is its chance of success.

This is why rules are so helpful. They will allow us to find expectations in situations where direct application of the weighted average formula is ungainly.

Hypergeometric Random Variables¶

Suppose you sampled $n$ individuals from a pool of total size $m$ . You sample uniformly, but sample without replacement. You make sure that your sample of $n$ individuals never includes the same individual twice. Of the $m$ individuals $s = p m$ possess a characteristic of interest. For example, perhaps you wanted to know what fraction of Berkeley data science majors are double majors. Then $m$ would be about 2,000, $p$ would be 44%, and $s$ would be the number of double majors, which is about 880 students. The $n$ individuals could be a sample of 100 data science majors selected from Data 89.

Let $X$ denote the number of individuals in your sample of $n$ who possess the characteristic of interest. In our example, $X$ could be the number of students in our sample who are double majors. Abstractly, $X$ is the number of successful draws, in a sequence of $n$ uniform draws, made without replacement, from a fixed pool. Random variables of this kind are called hypergeometric random variables.

What is $\mathbb{E}[X]$ ?

First, try to write the expectation as a weighted average:

\mathbb{E}[X] = \sum_{\text{all } x} x \times \text{PMF}(x)

(37)

To fill in the sum, first we need to work out the support of $x$ . The minimum and maximum values of $X$ depend on $n$ , $m$ , and $s$ . If $n < s$ , then it is possible that every student we sample is a double major, so $X$ could be as large as $n$ when $n < s$ . If $n > s$ , then, at most, we sample every double major in data science, and $X = s$ . So, $X \leq \text{min}(n,s)$ . Similar logic applies to $n - X$ , the number of single majors in our sample. The number of single majors must be less than $m - s$ , and less than $n$ , so $n - X \leq \text{min}(m - s,n)$ which implies $X \geq \text{max}(0,n - (m - s))$ . So:

\mathbb{E}[X] = \sum_{x = \text{max}(0,n + s - m)}^{\text{min}(n,s)} x \times \text{PMF}(x)

(38)

Already this looks tough.

The PMF is even worse. To find the chance that $X = x$ , use probability by proportion. There are $m$ choose $n$ ways to select $n$ individuals without replacement from a pool of $m$ . There are $s$ choose $x$ ways to select $x$ individuals with the characteristic of interest from the $s$ in the pool. There are $m - s$ choose $n - s$ wasy to select the remaining $n - s$ from the $m - s$ individuals in the pool who don’t have the characteristic of interest.

Therefore:

\text{PMF}(x) = \text{Pr}(X = x) = \frac{\left(\begin{array}{c} s \\ x \end{array} \right) \left(\begin{array}{c} m - s \\ n - x \end{array} \right)}{\left(\begin{array}{c} m \\ n \end{array} \right)} = \frac{\left(\begin{array}{c} p m \\ x \end{array} \right) \left(\begin{array}{c} (1 - p) m) \\ n - x \end{array} \right)}{\left(\begin{array}{c} m \\ n \end{array} \right)}

(39)

So, the expectation is:

\mathbb{E}[X] = \sum_{x = \text{max}(0,n - (1 - p) m)}^{\text{min}(n,p m)} x \times \frac{\left(\begin{array}{c} p m \\ x \end{array} \right) \left(\begin{array}{c} (1 - p) m) \\ n - x \end{array} \right)}{\left(\begin{array}{c} m \\ n \end{array} \right)}

(40)

That is a properly difficult sum!

To solve it, we’ll adopt the same approach we used for the expectation of the binomial.

First, notice that $X$ is a count variable. So, let’s expand it as a sum of indicators. Imagine drawing the individuals in sequence, and checking, one at a time, whether they possess the characteristic of interest. Let $I_j$ be an indicator for the event that the $j^{th}$ individual has the characteristic of interest (e.g. is a double major). Then, just like we saw for the binomial:

X = \sum_{j=1}^n I_j.

(41)

So, by additivity:

\mathbb{E}[X] = \sum_{j=1}^n \mathbb{E}[I_j].

(42)

Notice that, unlike in the binomial case, the indicators in this example are dependent. They are dependent since, each time we sample an individual, we remove them from the pool.

However, their dependence doesn’t matter since the additivity rule applies to any pair of random variables!

Now we’re almost done. As before, the expectation of an indicator is the chance that the corresponding event occurs. So, $\mathbb{E}[I_j]$ is the probability that the $j^{th}$ individual we pick has the characteristic of interest. This is a marginal probability. It does not depend on the other individuals sampled. On any particular draw, ignoring the other draws, the chance that the selected individual has the characteristic of interest is $p$ , since $p$ % of all individuals in the pool have the desired characteristic. In other words, the chance the 10th student selected is a double major is 44%, as is the chance that the 40th student selected is a double major.

So:

\mathbb{E}[X] = \sum_{j=1}^n \mathbb{E}[I_j] = \sum_{j=1}^n p = np.

(43)

So, just like binomial radnom variables, the expectation of a hypergeometric random variable is the number of draws, $n$ , times the marginal chance each draw succeeds, $p$ .

Notice the power of working by properties. Even though the hypergeometric PMF is much harder to work with, its expectation is just as easy as the binomials. Both can be broken into a sum of simple expectations using additivity, even though, when sampling without replacement, all draws are all dependent on each other!

4.2 Rules of Expectation

Binomial Example¶

Rules of Expectation:¶

Expectations of Key Distributions¶

Linearity of Expectation¶

Additivity of Expectation¶

Expectations of Count Variables via Additivity¶

Binomial Random Variables¶

Hypergeometric Random Variables¶