Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

4.2 Rules of Expectation

Section 4.1 introduced the basic definition of an expectation as a weighted average. This definition provides a direct formula for computing an expectation: evaluate a weighted average of the possible values of the random variable, weighted by their likelihood.

Evaluating these averages can be tricky. We’ll spend a couple weeks practicing integrals and sums, but, it will take some work before we can confidently evaluate interesting expectations directly.

Nevertheless, expectations are popular summary values because they obey a variety of useful rules. These rules make it possible to compute many expectations without performing summation acrobatics. Instead of plugging into the definition, we can often find an expectation by applying the rules of expectation.

This section will introduce the most useful rules. We will start with an example for motivation, then will explore three sets of rules. Once we’ve built up a list of rules, we’ll come back and complete our example without working through the sum directly.

Binomial Example

Suppose that XBinom(n,p)X \sim \text{Binom}(n,p). What is $E[X]\mathbb{E}[X]?

Count variables are discrete, so we could start with a sum:

E[X]=all xx×Pr(X=x)=x=0nx(nx)px(1p)nx.\mathbb{E}[X] = \sum_{\text{all } x} x \times \text{Pr}(X = x) = \sum_{x = 0}^n x \left( \begin{array}{c} n \\ x \end{array} \right) p^x (1 - p)^{n-x}.

That’s not an easy sum. We could try simplifying:

E[X]=x=0nxx!n!(nx)!px(1p)nx=x=0nn!(x1)!(nx)!px(1p)nx\mathbb{E}[X] = \sum_{x = 0}^n \frac{x}{x!} \frac{n!}{(n - x)!} p^x (1 - p)^{n-x} = \sum_{x = 0}^n \frac{n!}{(x- 1)!(n - x)!} p^x (1 - p)^{n-x}

but, unless you’re quite clever, it’s not clear how to progress. Sums are easy to set up, but often are hard to close.

It turns out that:

E[X]=np.\mathbb{E}[X] = n p.

That’s a wonderfully simple formula. The expected number of successes in a string of nn independent, identical, binary trials is the number of trials, nn, times the chance each individual trial succeeds, pp. If I run 10 experiments, and each succeeds with chance 1/5, then I expect to see 2 successes.

This answer also closely tracks what we learned about the mode of the binomial. The most likely outcome for a binomial random variable is near npn p.

Deriving the expectation directly is exhausting. If you haven’t, open the dropdown above to see how much work it took to get here. Whenever we reach an answer that is suspiciously simple, through a process that is evidently opaque, we should ask, “was there a better way to find this answer?” Often, if your answer is intuitive, but your work is ornate, there is a simpler method. The rest of this chapter will develop a series of rules that make this sort of calculation a breeze.

Rules of Expectation:

Here’s our basic strategy:

  1. When given a random variable, always try to write out the expectation directly as a sum or an integral. If you can close it, go ahead. There’s no need to try anything sharper.

  2. If the sum/integral is tricky, try to rewrite the random variable as a combination of simpler random variables.

  3. Then, apply rules of expectation to break up the original expectation into a combination of the expectations of its parts. If each part is simple enough, then we can use the expectations of the parts to put together the expectation of the whole.

Expectations of Key Distributions

To use this strategy, we will need to know the expectations of some key reference distributions. Here are three you should always be ready to use:

  1. Constants: If X=cX = c for some constant cc, then E[X]=c\mathbb{E}[X] = c.

    This result follows immediately from either interpretation of the expectation. If all of the mass of the distribution is at one value, then that value must be the center of mass. Alternately, if XX is always cc, then any sample average of a string of samples, will be a sample average of a string of cc’s, so must equal cc.

  2. Bernoulli (binary) Random Variables: Suppose that XX is an indicator random variable for some event EE. Then XBern(p)X \sim \text{Bern}(p) where p=Pr(E)p = \text{Pr}(E). What is E[X]\mathbb{E}[X]?

    E[X]=x=01x×PMF(x)=0×(1p)+1×p=p.\mathbb{E}[X] = \sum_{x = 0}^1 x \times \text{PMF}(x) = 0 \times (1 - p) + 1 \times p = p.
  3. Symmetric Distributions: Suppose that XX is drawn from a distribution that is symmetric about some value xx_*. Then, to balance the distribution, the only possible midpoint is xx_*, so the center of mass is xx_*. It follows that:

Linearity of Expectation

Next, we will need rules that help us compute expectations of transformations of random variables. These are just standard algebra rules for averages.

The simplest transformation is a linear function. We’ll break this rule into three parts. The first two are each special cases of the third.

  1. Translations: If Y=X+bY = X + b for some bb, then E[Y]=E[X+b]=E[X]+b\mathbb{E}[Y] = \mathbb{E}[X + b] = \mathbb{E}[X] + b.

    As usual, we can either prove this rule using the weighted average formula for expectations, or argue it using the interpretations of expectation. Let’s work by interpretation.

    Adding a constant to XX just shifts its distribution rightward by the constant since PMF(y)=Pr(Y=y)=Pr(X+b=y)=Pr(X=yb)=PMF(xb)\text{PMF}(y) = \text{Pr}(Y = y) = \text{Pr}(X + b = y) = \text{Pr}(X = y - b) = \text{PMF}(x - b). If I translate a distribution horizontally by a distance bb, then I must also translate its center of mass horizontally by a distance of bb. So, the new center of mass is the old center of mass, plus bb.

  2. Scaling: If Y=aXY = a X for some aa, then E[Y]=E[aX]=aE[X]\mathbb{E}[Y] = \mathbb{E}[aX] = a \mathbb{E}[X].

    Let’s prove this one using the weighted average formula. We’ll do the discrete case. The continuous case works for the same reason.

    E[Y]=E[aX]=all x(ax)PMF(x)=aall xxPMF(x)=aE[X].\mathbb{E}[Y] = \mathbb{E}[a X] = \sum_{\text{all } x} (a x) \text{PMF}(x) = a \sum_{\text{all } x} x \text{PMF}(x) = a \mathbb{E}[X].

Combining these two rules produces the following:

Additivity of Expectation

The final, and most useful property of expectation is another statement about sums. This time, it regards the expectations of sums of random variables.

To prove this result, we’ll use some of the ideas from Sections 1.3 and 1.5. Let S=X+YS = X + Y. Then, by the weighted average formula for expectations

E[S]=all ss×Pr(S=s)=all ss×Pr(X+Y=s).\mathbb{E}[S] = \sum_{\text{all } s} s \times \text{Pr}(S = s) = \sum_{\text{all } s} s \times \text{Pr}(X + Y = s).

We can expand the chance that X+Y=sX + Y = s by summing over all pairs, x,yx,y that add to ss. Each distinct pair of x,yx,y that add to ss are disjoint, so, by the addition rule (see Section 1.3):

Pr(X+Y=s)=all x,y s.t. x+y=sPr(X=x,Y=y).\text{Pr}(X + Y = s) = \sum_{\text{all } x,y \text{ s.t. } x + y = s} \text{Pr}(X = x, Y = y).

Then, moving all terms inside the sum:

E[S]=all ss×Pr(S=s)=all sall x,y s.t. x+y=s(x+y)×Pr(X=x,Y=y).\mathbb{E}[S] = \sum_{\text{all } s} s \times \text{Pr}(S = s) = \sum_{\text{all } s} \sum_{\text{all } x,y \text{ s.t. } x + y = s} (x + y) \times \text{Pr}(X = x, Y = y).

The sum over all possible ss, of each pair x,yx,y that could add to ss, is just the sum over all pairs xx and yy. So, we can write our sum more simply:

E[S]=all x,y(x+y)×Pr(X=x,Y=y)\mathbb{E}[S] = \sum_{\text{all } x, y} (x + y) \times \text{Pr}(X = x, Y = y)

Now, simplifying:

E[S]=all x,yx×Pr(X=x,Y=y)+all x,yy×Pr(X=x,Y=y).\mathbb{E}[S] = \sum_{\text{all } x,y } x \times \text{Pr}(X = x, Y = y) + \sum_{\text{all } x,y } y \times \text{Pr}(X = x, Y = y).

The probabilities in each sum are joint probabilities. We can expand each using the multiplication rule from Section 1.5. For example:

all x,yx×Pr(X=x,Y=y)=all x,yx×Pr(X=x)×Pr(Y=yX=x).\sum_{\text{all } x,y } x \times \text{Pr}(X = x, Y = y) = \sum_{\text{all } x,y } x \times \text{Pr}(X = x) \times \text{Pr}(Y = y|X = x).

Now, let’s split up the sum. Sum over all xx first, then, sum over all yy:

all xall yxPr(X=x)Pr(Y=yX=x)=all xxPr(X=x)(all yPr(Y=yX=x)).\sum_{\text{all } x } \sum_{\text{all } y} x \text{Pr}(X = x) \text{Pr}(Y = y|X = x) = \sum_{\text{all } x } x \text{Pr}(X = x) \left( \sum_{\text{all } y} \text{Pr}(Y = y|X = x) \right).

Here’s the kicker. The sum inside the parentheses on the right is the sum of a distribution, over all possible values of the associated random variable. Anytime we sum the PMF of a random variable (here, the PMF of YY given X=xX = x), over all possible values, we must get back 1. All PMF’s are normalized.

So:

all xall yxPr(X=x)Pr(Y=yX=x)=all xxPr(X=x)=E[X].\sum_{\text{all } x } \sum_{\text{all } y} x \text{Pr}(X = x) \text{Pr}(Y = y|X = x) = \sum_{\text{all } x } x \text{Pr}(X = x) = \mathbb{E}[X].

The same argument applies for the second term in our original sum, so:

E[S]=E[X]+E[Y].\mathbb{E}[S] = \mathbb{E}[X] + \mathbb{E}[Y].

Essentially the same arguments apply in the continuous case.

Expectations of Count Variables via Additivity

A count variable is an integer-valued random variable that represents some sort of count. For instance, binomial random variables count successes. Geometric random variables count trials until a success. The rules established above make it easy to find the expectations of count variables, since most count variables can be expanded as a sum. After all, most counting processes occur as sequences where, each time an instance occurs, we add 1 to our running count.

Binomial Random Variables

Let’s try to find the expectation of a binomial again. This time, we’ll use rules instead of brute force algebra.

First, suppose that XBinom(n,p)X \sim \text{Binom}(n,p). Then XX is the number of successes in a string of nn independent, identical, binary trials. So, if we let IjI_j be an indicator for the event that the jthj^{th} trial succeeds, then:

X=I1+I2+...In=j=1nIj.X = I_1 + I_2 + ... I_n = \sum_{j=1}^n I_j.

Then, using the additivity property:

E[X]=E[j=1nIj]=j=1nE[Ij].\mathbb{E}[X] = \mathbb{E}\left[\sum_{j=1}^n I_j \right] = \sum_{j=1}^n \mathbb{E}[I_j].

Each IjI_j is an indicator, so is a Bernoulli random variable with success probability pp. Since the expectation of any Bernoulli random variable is its success probability:

E[X]=j=1np=n×p.\mathbb{E}[X] = \sum_{j=1}^n p = n \times p.

Done! Compare this proof to the dropdown argument provided at the start of the chapter. This one is much better.

It is better in two ways:

  1. It is simpler. It involves fewer steps and is easier to follow/remember.

  2. Each of its steps are meaningful and rely on clearly motivated logical arguments that walk directly towards the desired result. Unlike the algebraic proof, which required a large number of little steps, none of which except the last carried much intrinsic meaning, each step in this proof uses a powerful idea: count variables are sums of indicators, expectations of sums are sums of expectations, and, the expectation of an indicator is its chance of success.

This is why rules are so helpful. They will allow us to find expectations in situations where direct application of the weighted average formula is ungainly.

Hypergeometric Random Variables

Suppose you sampled nn individuals from a pool of total size mm. You sample uniformly, but sample without replacement. You make sure that your sample of nn individuals never includes the same individual twice. Of the mm individuals s=pms = p m possess a characteristic of interest. For example, perhaps you wanted to know what fraction of Berkeley data science majors are double majors. Then mm would be about 2,000, pp would be 44%, and ss would be the number of double majors, which is about 880 students. The nn individuals could be a sample of 100 data science majors selected from Data 89.

Let XX denote the number of individuals in your sample of nn who possess the characteristic of interest. In our example, XX could be the number of students in our sample who are double majors. Abstractly, XX is the number of successful draws, in a sequence of nn uniform draws, made without replacement, from a fixed pool. Random variables of this kind are called hypergeometric random variables.

What is E[X]\mathbb{E}[X]?

First, try to write the expectation as a weighted average:

E[X]=all xx×PMF(x)\mathbb{E}[X] = \sum_{\text{all } x} x \times \text{PMF}(x)

To fill in the sum, first we need to work out the support of xx. The minimum and maximum values of XX depend on nn, mm, and ss. If n<sn < s, then it is possible that every student we sample is a double major, so XX could be as large as nn when n<sn < s. If n>sn > s, then, at most, we sample every double major in data science, and X=sX = s. So, Xmin(n,s)X \leq \text{min}(n,s). Similar logic applies to nXn - X, the number of single majors in our sample. The number of single majors must be less than msm - s, and less than nn, so nXmin(ms,n)n - X \leq \text{min}(m - s,n) which implies Xmax(0,n(ms))X \geq \text{max}(0,n - (m - s)). So:

E[X]=x=max(0,n+sm)min(n,s)x×PMF(x)\mathbb{E}[X] = \sum_{x = \text{max}(0,n + s - m)}^{\text{min}(n,s)} x \times \text{PMF}(x)

Already this looks tough.

The PMF is even worse. To find the chance that X=xX = x, use probability by proportion. There are mm choose nn ways to select nn individuals without replacement from a pool of mm. There are ss choose xx ways to select xx individuals with the characteristic of interest from the ss in the pool. There are msm - s choose nsn - s wasy to select the remaining nsn - s from the msm - s individuals in the pool who don’t have the characteristic of interest.

Therefore:

PMF(x)=Pr(X=x)=(sx)(msnx)(mn)=(pmx)((1p)m)nx)(mn)\text{PMF}(x) = \text{Pr}(X = x) = \frac{\left(\begin{array}{c} s \\ x \end{array} \right) \left(\begin{array}{c} m - s \\ n - x \end{array} \right)}{\left(\begin{array}{c} m \\ n \end{array} \right)} = \frac{\left(\begin{array}{c} p m \\ x \end{array} \right) \left(\begin{array}{c} (1 - p) m) \\ n - x \end{array} \right)}{\left(\begin{array}{c} m \\ n \end{array} \right)}

So, the expectation is:

E[X]=x=max(0,n(1p)m)min(n,pm)x×(pmx)((1p)m)nx)(mn)\mathbb{E}[X] = \sum_{x = \text{max}(0,n - (1 - p) m)}^{\text{min}(n,p m)} x \times \frac{\left(\begin{array}{c} p m \\ x \end{array} \right) \left(\begin{array}{c} (1 - p) m) \\ n - x \end{array} \right)}{\left(\begin{array}{c} m \\ n \end{array} \right)}

That is a properly difficult sum!

To solve it, we’ll adopt the same approach we used for the expectation of the binomial.

First, notice that XX is a count variable. So, let’s expand it as a sum of indicators. Imagine drawing the individuals in sequence, and checking, one at a time, whether they possess the characteristic of interest. Let IjI_j be an indicator for the event that the jthj^{th} individual has the characteristic of interest (e.g. is a double major). Then, just like we saw for the binomial:

X=j=1nIj.X = \sum_{j=1}^n I_j.

So, by additivity:

E[X]=j=1nE[Ij].\mathbb{E}[X] = \sum_{j=1}^n \mathbb{E}[I_j].

Notice that, unlike in the binomial case, the indicators in this example are dependent. They are dependent since, each time we sample an individual, we remove them from the pool.

However, their dependence doesn’t matter since the additivity rule applies to any pair of random variables!

Now we’re almost done. As before, the expectation of an indicator is the chance that the corresponding event occurs. So, E[Ij]\mathbb{E}[I_j] is the probability that the jthj^{th} individual we pick has the characteristic of interest. This is a marginal probability. It does not depend on the other individuals sampled. On any particular draw, ignoring the other draws, the chance that the selected individual has the characteristic of interest is pp, since pp% of all individuals in the pool have the desired characteristic. In other words, the chance the 10th student selected is a double major is 44%, as is the chance that the 40th student selected is a double major.

So:

E[X]=j=1nE[Ij]=j=1np=np.\mathbb{E}[X] = \sum_{j=1}^n \mathbb{E}[I_j] = \sum_{j=1}^n p = np.

So, just like binomial radnom variables, the expectation of a hypergeometric random variable is the number of draws, nn, times the marginal chance each draw succeeds, pp.

Notice the power of working by properties. Even though the hypergeometric PMF is much harder to work with, its expectation is just as easy as the binomials. Both can be broken into a sum of simple expectations using additivity, even though, when sampling without replacement, all draws are all dependent on each other!