Conditional and Iterated Expectation

In the last chapter we saw that double integrals and double sums can be expanded as iterated integrals and iterated sums. Since expectations are weighted averages, expectations involving two or more random variables may be expanded as multiple integrals or multiple sums. In each case, converting from a multiple integral to an iterated integral, will convert a single expectation into a pair of nested expectations.

This approach is very useful in applied problems. It breaks a problem involving two (or more) random variables into a sequence of problems, each involving only one variable at a time.

In order to understand the nested expectations, we will need to understand expectations over a single variable, fixing a different variable. An expectation given a fixed condition on some of the random inputs is a conditional expectation.

Conditional Expectation¶

Suppose that $X$ and $Y$ are jointly distributed random variables and $g(X,Y)$ is a scalar-valued function of $X$ and $Y$ . Then, the conditional expectation of $g(X,Y)$ is the expected value of $g(X,Y)$ given some constraint on the values of $X$ and/or $Y$ .

Conditional Expectation

If $X$ and $Y$ are jointly distributed random variables, and $g(X,Y)$ is a scalar-valued function of $X$ and $Y$ , then the conditional expectation of $g(X,Y)$ , given $X = x$ , is denoted:

\mathbb{E}_{Y|X = x}[g(x,Y)] = \mathbb{E}[g(X,Y)|X = x]

(1)

and is defined as the expected value of $g(X,Y)$ given that $X = x$ .

If $X = x$ , then $X$ is fixed, so only $Y$ is random. If $X = x$ , then $Y$ is drawn from its conditional distribution given $X = x$ . So:

\mathbb{E}_{Y|X = x}[g(x,Y)] = \begin{cases} \sum_{\text{all } y} g(x,Y) \text{Pr}(Y = y|X = x) & \text{ if discrete} \\ \\ \int_{\text{all } y} g(x,Y) f_{Y|X = x}(y) dy & \text{ if continuous} \end{cases}

(2)

Applying the division rule; conditional equals joint over marginal (see Section 1.5 and Section 8.3):

\mathbb{E}_{Y|X = x}[g(x,Y)] = \begin{cases} \sum_{\text{all } y} g(x,Y) \frac{\text{Pr}(X = x,Y = y)}{\text{Pr}(X = x)} & \text{ if discrete} \\ \\ \int_{\text{all } y} g(x,Y) \frac{f_{X,Y}(x,y)}{f_{X}(x)} dy & \text{ if continuous} \end{cases}

(3)

Recall that, all expectations are the center of mass of some distribution (see Section 4.1). So, we can visualize the conditional expectation of $Y$ as the center of mass of the conditional distribution of $Y$ given $X = x$ . Since conditional distributions are proportional to cross-sections of joint distributions (see Section 8.3), we can imagine conditional expectations as the center of mass of a cross-section of a joint.

The figure below shows an example joint density function as a heat map. The contours are level sets of the density. The solid red line shows $\bar{y}(x)$ , the conditional expectation of $Y$ given $X = x$ . The vertical dashed and dotted red lines show the range of possible $Y$ for $X = 0$ and $X = 1.5$ . The conditional distributions $f_{Y|X = 0}$ and $f_{Y|X = 1.5}$ are shown in the panel to the right. Notice that, the center of the conditionals matches the $y$ -coordinate where the solid red line intersects the vertical red lines.

Run the code cell below for an interactive example. You can vary the value of $x$ , and track how both the conditional distribution of $Y$ and its expectation vary as a function of $x$ . Notice that the conditional density of $Y|X = x$ is proportional to the $y$ -cross section of the joint density shown in the left-hand panel.

from utils_cond_exp import show_conditional_expectation

show_conditional_expectation()

In both examples the conditional expectation of $Y$ given $X = x$ depends on the choice of $x$ . This should not be surprising. The conditional distribution of $Y$ given $X = x$ is proportional to the $y$ -cross-section at $x$ . The $y$ -cross-section varies depending on the choice of $x$ , so the conditional expectation of $Y$ given $X = x$ may also vary with $x$ .

Iterated Expectation¶

Iterated Expectation

If $X$ and $Y$ are jointly distributed, and $g(X,Y)$ is a scalar-valued function of $X$ and $Y$ , then:

\mathbb{E}[g(X,Y)] = \mathbb{E}_{X}[\mathbb{E}_{Y|X}[g(X,Y)]] = \mathbb{E}_{Y}[\mathbb{E}_{X|Y}[g(X,Y)]].

(6)

If $g(X,Y) = Y$ , then:

\mathbb{E}[g(X,Y)] = \mathbb{E}_X[\mathbb{E}[Y|X]] = \mathbb{E}_{X}[\bar{y}(X)]

(7)

where $\bar{y}(x) = \mathbb{E}[Y|X = x]$ .

If $g(X,Y) = X$ , then:

\mathbb{E}[g(X,Y)] = \mathbb{E}_Y[\mathbb{E}[X|Y]] = \mathbb{E}_{Y}[\bar{x}(Y)]

(8)

where $\bar{x}(y) = \mathbb{E}[X|Y = y]$ .

Iterated expectation expresses a joint expectation over both $X$ and $Y$ as an iterated expectation, first over $Y$ given $X$ , then over $X$ . As usual, we can choose whether to work over $X$ on the outside and $Y$ on the inside, or $Y$ on the outside and $X$ on the inside.

Loosely, you can remember this law as expressing a joint average as an average of conditional averages. Or, more succinctly, as an average of averages.

Other Names

Iterated expectation is sometimes called the chain rule of expectations or the tower property of expectation. Here we are using the term iterated expectation to relate iterated expectations to the more general strategies of iterated integration and iterated summation.

Averaging Averages

You’ve probably used an iterated average before.

Suppose that a student has a 100 percent homework average, a 90 percent quiz average, and an 80 percent exam average. Suppose that their final grade is a weighted average of their component scores, with weights 20 percent homework, 30 percent quizzes, and 50 percent exams. Then their final grade is an average of averages:

0.20 \times 1.00 + 0.30 \times 0.90 + 0.50 \times 0.80.

(9)

Proof of Iterated Expectation

We’ll prove the rule in the continuous case for arbitrary $g(X,Y)$ . To recover the rule in the discrete case, follow the same steps, but use iterated sums. To recover the result for $g(X,Y) = X$ or $g(X,Y) = Y$ , just substitute $g(x,y) = x$ or $g(x,y) = y$ into the proof.

First, as in Section 8.3:

\mathbb{E}_{X,Y}[g(X,Y)] = \iint_{\text{all }[x,y]} g(x,y) f_{X,Y}(x,y) dx dy.

(10)

Next, expand the double integral as an iterated integral:

\mathbb{E}_{X,Y}[g(X,Y)] = \int_{\text{all } x} \left( \int_{\text{all } y} g(x,y) f_{X,Y}(x,y) dy \right) dx

(11)

Then, by the multiplication rule; joint equals marginal times conditional (see Section 1.5 and Section 8.3):

f_{X,Y}(x,y) = f_{X}(x) f_{Y|X=x}(y).

(12)

So:

\mathbb{E}_{X,Y}[g(X,Y)] = \int_{\text{all } x} \left( \int_{\text{all } y} g(x,y) f_{X}(x) f_{Y|X = x}(y) dy \right) dx

(13)

Now, focus on the inner integral. In the inner integral $x$ is fixed so the marginal density, $f_X(x)$ , is a constant. Therefore:

\int_{\text{all } y} g(x,y) f_{X}(x) f_{Y|X = x}(y) dy = f_X(x) \int_{\text{all } y} g(x,y) f_{Y|X = x}(y) dy.

(14)

Then, by the definition of conditional expectation:

\int_{\text{all } y} g(x,y) f_{Y|X = x}(y) dy = \mathbb{E}[g(x,Y)|X = x].

(15)

So, the joint expectation over $X$ and $Y$ is:

\mathbb{E}_{X,Y}[g(X,Y)] = \int_{\text{all } x} \mathbb{E}[g(x,Y)|X = x] f_{X}(x) dx.

(16)

The conditional expectation, $\mathbb{E}[g(x,Y)|X = x]$ , is a function of $x$ . Denote it $\bar{g}(x)$ . Then, by the definition of expectation as a weighted average (Section 4.1):

\mathbb{E}_{X,Y}[g(X,Y)] = \int_{\text{all } x} \bar{g}(x) f_{X}(x) dx = \mathbb{E}_{X}[\bar{g}(X)].

(17)

Therefore, the joint expectation is equal to an expectation over $X$ of a conditional expectation of $g$ over $Y$ given $X$ :

\mathbb{E}_{X,Y}[g(X,Y)] = \mathbb{E}_{X}\left[ \mathbb{E}_{Y|X}[g(X,Y)] \right].

(18)

To recover the equality in the other order, rewrite the double integral as an iterated integral over $y$ then $x$ .

Examples¶

Like the other algebraic properties of expectation, applying iterated expectation can make it much easier to compute some joint expectations. Here are some example cases:

Suppose that $X$ and $Y$ are jointly distributed, where $\mathbb{E}[X] = 2$ and $\bar{y}(x) = \mathbb{E}[Y|X = x] = 4 x - 3$ . What is $\mathbb{E}[Y]$ ?
Solution
Apply iterated expectation:
$\mathbb{E}[Y] = \mathbb{E}_{X}[\mathbb{E}_{Y|X}[Y]]] = \mathbb{E}_{X}[\bar{y}(X)] = \mathbb{E}_{X}[4 X - 3].$
(19)
Then, by the linearity of expectation (see Section 4.2):
$\mathbb{E}[Y] = 4 \mathbb{E}[X] - 3 = 4 \times 2 - 3 = 5.$
(20)
Notice that, in this example, we were able to compute the expectations without evaluating a single sum or integral. We didn’t even need to know the joint distribution!
Suppose that $I$ and $S$ are drawn sequentially. Suppose that $I \sim \text{Bernoulli}(1/3)$ is an indicator variable. If $I = 0$ , draw $S$ from a binomial on 100 trials with success probability $1/5$ . If $I = 1$ , draw $S$ from a binomial on 100 trials with success probability $3/5$ . Then:
$S|I = i \sim \begin{cases} \text{Binomial}(100,1/5) & \text{ if } i = 0 \\ \text{Binomial}(100,3/5) & \text{ if } i = 1 \\ \end{cases}$
(21)
What is $\mathbb{E}[S]$ ?
Solution
Apply iterated expectation. In general, if two variables are drawn in sequence, apply iterated expectation in the same order that you would use to sample the variables:
$\mathbb{E}[S] = \mathbb{E}_I[\mathbb{E}_{S|I}[S]] = \sum_{i=0}^1 \bar{s}(i) \text{Pr}(I = i).$
(22)
Plugging in:
$\mathbb{E}[S] = \bar{s}(0) \times \frac{2}{3} + \bar{s}(1) \times \frac{1}{3}.$
(23)
The conditional expectations, $\bar{s}(i)$ are the expectations of binomial distributions. The expectation of a binomial equals the number of trials times the success probability. So:
$\bar{s}(0) = 100 \times \frac{1}{5} = 20, \quad \bar{s}(1) = 100 \times \frac{3}{5} = 60.$
(24)
Therefore:
$\mathbb{E}[S] = \frac{2}{3} \times 20 + \frac{1}{3} \times 60 = \frac{100}{3} \approx 33.333...$
(25)
In this case, the average of averages interpretation is quite clear. The values 20 and 60 are the conditional averages. The joint average is a weighted average of the conditional averages.
Suppose that $W \sim \text{Geometric}(p)$ . What is $\mathbb{E}[W]$ ?
Solution
This doesn’t look like an obvious case for iterated integration. We computed this expectation in Section 7.1 using the tail sum formula (summation by parts). Recall that:
$\mathbb{E}[W] = 1/p.$
(26)
This answer makes intuitive sense. A geometric random variable represents the number of trials, in a sequence of independent and identical binary trials, up to and including the first success (see Section 2.2). The parameter $p$ was the success probability. The larger $p$ , the more likely each trial is to succeed, so the fewer trials needed, on average, before a success. If $p = 1/10$ , then it will take 10 trials, on average, before the first success.
Imagine running the process that produces a geometric random variable. Before you run any trials, $\mathbb{E}[W]$ represents how many trials you should expect to run before stopping.
Now you run a trial. It either succeeds or fails. Let $I = 0$ if it fails and $I = 1$ if it succeeds. If the trial succeeds, then you stop, and $W = 1$ . If it fails, you go on to the next trial.
So, by iterated expectation:
$\mathbb{E}[W] = \mathbb{E}_I[\mathbb{E}_{W|I}[W]] = \text{Pr}(I = 0) \times \mathbb{E}[W|I = 0] + \text{Pr}(I = 1) \times \mathbb{E}[W|I = 1].$
(27)
Substituting in the chance of success:
$\mathbb{E}[W] = (1 - p) \times \mathbb{E}[W|I = 0] + p \times \mathbb{E}[W|I = 1].$
(28)
Since we stop immediately if we succeed, $\mathbb{E}[W|I = 1] = 1$ . So:
$\mathbb{E}[W] = (1 - p) \times \mathbb{E}[W|I = 0] + p \times 1.$
(29)
All that remains is the expected number of turns until our first sucess given that we failed on the first trial ( $I = 0$ ). Since the trials are independent and identical, the expected additional number of trials needed to stop after failing on the first trial equals the expected number of trials needed to stop before we ran the first trial. In essence, everytime we fail, we start the process over.
So:
$\times \mathbb{E}[W|I = 0] = 1 + \mathbb{E}[W].$
(30)
The “1” represents the first trial that failed. The remaining $\mathbb{E}[W]$ represents the expected number of future trials until a success. Now:
$\mathbb{E}[W] = (1 - p) \times (1 + \mathbb{E}[W]) + p = 1 + (1 -p) \mathbb{E}[W].$
(31)
Rearrange, and solve for $\mathbb{E}[W]$ :
$(1 - (1 - p)) \mathbb{E}[W] = p \mathbb{E}[W] = 1.$
(32)
Therefore:
$\mathbb{E}[W] = 1/p.$
(33)

Independent Products¶

We can use iterated expectation to prove one last property of expectations. In Section 4.2 we worked out rules for simplifying expectations of sums. We can now simplify expectations of some (not all) products:

Proof

You will complete this proof on your homework using iterated integration. Here’s the start.

The proof follows from the product rule for iterated integrals and sums. As usual, we can prove the result in either the continuous or discrete case since the rules for iterated integrals and sums are the same. In the continuous case:

\mathbb{E}_{X,Y}[g(X) \times h(Y)] = \iint_{\text{all } x,y} g(x) h(x) f_{X,Y}(x,y) dx dy.

(35)

Next, use the multiplication rule for independent random variables to expand the joint density (see Section 8.3):

f_{X,Y}(x,y) = f_X(x) \times f_Y(y).

(36)

Finally, apply iterated integration.

To complete the proof using iterated expectation, write:

\mathbb{E}_{X,Y}[g(X) \times h(Y)] = \mathbb{E}_{X}[\mathbb{E}_{Y|X}[g(X) \times h(Y)]] = \mathbb{E}_X[g(X) \times \mathbb{E}_{Y|X}[h(Y)]]

(37)

If $Y$ and $X$ are independent, then the conditional distribution of $Y$ given $X = x$ is the marginal distribution of $Y$ , no matter $x$ . Therefore:

\mathbb{E}_{Y|X = x}[h(Y)] = \mathbb{E}_Y[h(Y)].

(38)

Let $\bar{h} = \mathbb{E}_Y[h(Y)].$

Now:

\mathbb{E}_{X,Y}[g(X) \times h(Y)] = \mathbb{E}_X[g(X) \times \bar{h}].

(39)

The quantity, $\bar{h}$ , does not depend on $X$ , so, by the linearity of expectation:

\mathbb{E}_{X,Y}[g(X) \times h(Y)] = \mathbb{E}_X[g(X)]\times \bar{h}.

(40)

Therefore:

\mathbb{E}_{X,Y}[g(X) \times h(Y)] = \mathbb{E}_X[g(X)]\times \mathbb{E}_{Y}[h(Y)]

(41)