7.1 Integrals of Sums and Products
Sums ¶ Suppose that f ( x ) = a g ( x ) + b h ( x ) f(x) = a g(x) + b h(x) f ( x ) = a g ( x ) + bh ( x ) for some constants a a a and b b b , and some factors g ( x ) g(x) g ( x ) and h ( x ) h(x) h ( x ) . What is ∑ all x f ( x ) \sum_{\text{all } x} f(x) ∑ all x f ( x ) ? What about ∫ all x f ( x ) d x \int_{\text{all } x} f(x) dx ∫ all x f ( x ) d x ?
Integrals and sums are linear operations. That means that, like the expectation, the integral of a sum is the sum of the integrals, the integral of a linear function is the linear function of the integral, and integrals of linear combinations are linear combinations of integrals. The same rules apply for summations. In all cases, the area under the sum of two curves is the sum of the area under each curve.
If f ( x ) = a g ( x ) + b h ( x ) f(x) = a g(x) + b h(x) f ( x ) = a g ( x ) + bh ( x ) then:
∑ all x f ( x ) = a ∑ all x g ( x ) + b ∑ all x h ( x ) , ∫ all x f ( x ) d x = a ∫ all x g ( x ) d x + b ∫ all x h ( x ) d x . \begin{aligned} &\sum_{\text{all } x} f(x) = a \sum_{\text{all } x} g(x) + b \sum_{\text{all } x} h(x), \\ &\int_{\text{all } x} f(x) dx = a \int_{\text{all } x} g(x) dx + b \int_{\text{all } x} h(x) dx. \end{aligned} all x ∑ f ( x ) = a all x ∑ g ( x ) + b all x ∑ h ( x ) , ∫ all x f ( x ) d x = a ∫ all x g ( x ) d x + b ∫ all x h ( x ) d x . In other words, the sum/integral of a linear combination equals the linear combination of the sums/integrals.
Notice the similarity to linearity of expectation. Expectations are linear since they are weighted averages, and averages are sums/integrals.
Suppose that I ∼ Bernoulli ( 1 / 3 ) I \sim \text{Bernoulli}(1/3) I ∼ Bernoulli ( 1/3 ) is an indicator random variable. Suppose that, if I = 0 I = 0 I = 0 then X ∼ Binomial ( 100 , 1 / 5 ) X \sim \text{Binomial}(100,1/5) X ∼ Binomial ( 100 , 1/5 ) . Suppose that, if I = 1 I = 1 I = 1 , then X ∼ Binomial ( 50 , 4 / 5 ) X \sim \text{Binomial}(50,4/5) X ∼ Binomial ( 50 , 4/5 ) . What is E [ X ] \mathbb{E}[X] E [ X ] ?
To find the expectation, first expand the PMF by partitioning:
PMF ( x ) = Pr ( X = x ) = Pr ( ( X = x , I = 0 ) ∪ ( X = x , I = 1 ) ) = Pr ( X = x , I = 0 ) + Pr ( X = x , I = 1 ) . \begin{aligned} \text{PMF}(x) = \text{Pr}(X = x) & = \text{Pr}((X=x,I = 0) \cup (X = x, I = 1)) \\ & = \text{Pr}(X = x,I = 0) + \text{Pr}(X = x,I = 1). \end{aligned} PMF ( x ) = Pr ( X = x ) = Pr (( X = x , I = 0 ) ∪ ( X = x , I = 1 )) = Pr ( X = x , I = 0 ) + Pr ( X = x , I = 1 ) . Then, use the multiplication rule to expand each joint probability:
PMF ( x ) = Pr ( I = 0 ) Pr ( X = x ∣ I = 0 ) + Pr ( I = 1 ) Pr ( X = x ∣ I = 1 ) = 1 3 Pr ( X = x ∣ I = 0 ) + 2 3 Pr ( X = x ∣ I = 1 ) . \begin{aligned} \text{PMF}(x) & = \text{Pr}(I =0) \text{Pr}(X = x|I = 0) + \text{Pr}(I = 1) \text{Pr}(X = x|I = 1) \\
& = \frac{1}{3} \text{Pr}(X = x|I = 0) + \frac{2}{3} \text{Pr}(X = x|I = 1). \end{aligned} PMF ( x ) = Pr ( I = 0 ) Pr ( X = x ∣ I = 0 ) + Pr ( I = 1 ) Pr ( X = x ∣ I = 1 ) = 3 1 Pr ( X = x ∣ I = 0 ) + 3 2 Pr ( X = x ∣ I = 1 ) . So, the PMF is a linear combination of the conditional PMF’s given I = 0 I = 0 I = 0 and I = 1 I = 1 I = 1 . This is an example of a mixture distribution (see Section 3.2 ).
It follows that:
E [ X ] = ∑ all x x PMF ( x ) = ∑ all x 1 3 ( x Pr ( X = x ∣ I = 0 ) ) + 2 3 ( x Pr ( X = x ∣ I = 0 ) ) . \mathbb{E}[X] = \sum_{\text{all }x} x \text{PMF}(x) = \sum_{\text{all }x} \frac{1}{3} \left(x \text{Pr}(X = x|I = 0) \right) + \frac{2}{3} \left(x \text{Pr}(X = x|I = 0) \right). E [ X ] = all x ∑ x PMF ( x ) = all x ∑ 3 1 ( x Pr ( X = x ∣ I = 0 ) ) + 3 2 ( x Pr ( X = x ∣ I = 0 ) ) . Let g ( x ) = x Pr ( X = x ∣ I = 0 ) g(x) = x \text{Pr}(X =x|I = 0) g ( x ) = x Pr ( X = x ∣ I = 0 ) and h ( x ) = x Pr ( X = x ∣ I = 1 ) h(x) = x \text{Pr}(X = x|I = 1) h ( x ) = x Pr ( X = x ∣ I = 1 ) . Then, applying the linearity of sums:
E [ X ] = 1 3 ∑ all x x Pr ( X = x ∣ I = 0 ) + 2 3 ∑ all x x Pr ( X = x ∣ I = 1 ) . \mathbb{E}[X] = \frac{1}{3} \sum_{\text{all }x} x \text{Pr}(X = x|I = 0) + \frac{2}{3} \sum_{\text{all }x} x \text{Pr}(X = x|I = 1). E [ X ] = 3 1 all x ∑ x Pr ( X = x ∣ I = 0 ) + 3 2 all x ∑ x Pr ( X = x ∣ I = 1 ) . Now, each sum is a weighted average of x x x against a PMF, so is an expectation. The first is the expected value of a Binomial random variable on 100 trials with success probability 1 / 5 1/5 1/5 , so equals 100 × 1 5 = 20 100 \times \frac{1}{5} = 20 100 × 5 1 = 20 . The second is the expected value of a Binomial random variable on 50 trials with success probability 4 / 5 4/5 4/5 so equals 50 × 4 / 5 = 40 50 \times 4/5 = 40 50 × 4/5 = 40 . Therefore:
E [ X ] = 20 3 + 2 × 40 3 = 100 3 = 33.333... \mathbb{E}[X] = \frac{20}{3} + \frac{2 \times 40}{3} = \frac{100}{3} = 33.333... E [ X ] = 3 20 + 3 2 × 40 = 3 100 = 33.333... Replacing each number with the quantity it represents:
E [ X ] = Pr ( I = 0 ) E [ X ∣ I = 0 ] + Pr ( I = 1 ) E [ X ∣ I = 1 ] \mathbb{E}[X] = \text{Pr}(I = 0) \mathbb{E}[X|I = 0] + \text{Pr}(I = 1) \mathbb{E}[X|I = 1] E [ X ] = Pr ( I = 0 ) E [ X ∣ I = 0 ] + Pr ( I = 1 ) E [ X ∣ I = 1 ] where the conditional signs inside the expectations indicate an expectation against the conditional distribution specified by the conditioning statement. This is an example of a related property of expectations. We will explore this property more in the future. For now, it’s enough to observe that linearity helps break problems into simpler parts.
Products ¶ Anytime we go to evaluate an expectation we need to compute a sum, or integral, of a product of two functions (typically, the function of X X X whose expectation we are interested in, and the distribution function of X X X ).
So, suppose f ( x ) = g ( x ) × h ( x ) f(x) = g(x) \times h(x) f ( x ) = g ( x ) × h ( x ) . What is ∑ all x f ( x ) \sum_{\text{all } x} f(x) ∑ all x f ( x ) ? What about ∫ all x f ( x ) d x \int_{\text{all }x} f(x) dx ∫ all x f ( x ) d x ?
We’ll state the general rule for integrals, provide some examples, then will illustrate an analogous rule for sums.
Integration by Parts ¶ Given f ( x ) = g ( x ) h ′ ( x ) f(x) = g(x) h'(x) f ( x ) = g ( x ) h ′ ( x ) ,
∫ x = a b f ( x ) d x = g ( x ) h ( x ) ∣ x = a b − ∫ a b g ′ ( x ) h ( x ) d x \int_{x = a}^b f(x) dx = g(x) h(x) \Big|_{x = a}^b - \int_{a}^b g'(x) h(x) dx ∫ x = a b f ( x ) d x = g ( x ) h ( x ) ∣ ∣ x = a b − ∫ a b g ′ ( x ) h ( x ) d x where h ′ ( x ) = d d x h ( x ) h'(x) = \frac{d}{dx} h(x) h ′ ( x ) = d x d h ( x ) and g ′ ( x ) = d d x g ( x ) g'(x) = \frac{d}{dx} g(x) g ′ ( x ) = d x d g ( x ) .
Integration by parts follows simply from the product rule for differentiation. Recall that, if f ( x ) = g ( x ) h ( x ) f(x) = g(x) h(x) f ( x ) = g ( x ) h ( x ) then:
f ′ ( x ) = g ′ ( x ) h ( x ) + g ( x ) h ′ ( x ) . f'(x) = g'(x) h(x) + g(x) h'(x). f ′ ( x ) = g ′ ( x ) h ( x ) + g ( x ) h ′ ( x ) . So, integrating both sides:
f ( a ) − f ( b ) = ∫ x = a b f ′ ( x ) d x = ∫ x = a b ( g ′ ( x ) h ( x ) + g ( x ) h ′ ( x ) ) d x = ∫ x = a b g ′ ( x ) h ( x ) d x + ∫ x = a b g ( x ) h ′ ( x ) d x . \begin{aligned} f(a) - f(b) & = \int_{x = a}^b f'(x) dx = \int_{x = a}^b (g'(x) h(x) + g(x) h'(x)) dx \\
& = \int_{x = a}^b g'(x) h(x) dx + \int_{x = a}^b g(x) h'(x) dx. \end{aligned} f ( a ) − f ( b ) = ∫ x = a b f ′ ( x ) d x = ∫ x = a b ( g ′ ( x ) h ( x ) + g ( x ) h ′ ( x )) d x = ∫ x = a b g ′ ( x ) h ( x ) d x + ∫ x = a b g ( x ) h ′ ( x ) d x . So, rearranging:
∫ x = a b g ( x ) h ′ ( x ) d x = f ( x ) ∣ x = a b − ∫ x = a b g ′ ( x ) h ( x ) d x . \int_{x = a}^b g(x) h'(x) dx = f(x)|_{x = a}^b - \int_{x = a}^b g'(x) h(x) dx. ∫ x = a b g ( x ) h ′ ( x ) d x = f ( x ) ∣ x = a b − ∫ x = a b g ′ ( x ) h ( x ) d x . Finally, expressing f f f as a product of g g g and h h h :
∫ x = a b g ( x ) h ′ ( x ) d x = g ( x ) h ( x ) ∣ x = a b − ∫ x = a b g ′ ( x ) h ( x ) d x . \int_{x = a}^b g(x) h'(x) dx = g(x) h(x)|_{x = a}^b - \int_{x = a}^b g'(x) h(x) dx. ∫ x = a b g ( x ) h ′ ( x ) d x = g ( x ) h ( x ) ∣ x = a b − ∫ x = a b g ′ ( x ) h ( x ) d x . Integration by parts swaps which factor is differentiated. It replaces h ′ h' h ′ with h h h inside the integral, and replaces g g g with g ′ g' g ′ . In essence, we are moving the derivative from h h h to g g g . To move the derivative we will need to know h h h , which is the anti -derivative of h ′ h' h ′ . In probability h ′ h' h ′ is often a PDF, so h h h is often a CDF.
This is a long formula. A mnemonic helps.
This rule is often written:
∫ x = a b u ( x ) d v ( x ) = u ( x ) v ( x ) ∣ x = a b − ∫ a b v ( x ) d u ( x ) \int_{x = a}^b u(x) dv(x) = u(x) v(x) \Big|_{x = a}^b - \int_{a}^b v(x) du(x) ∫ x = a b u ( x ) d v ( x ) = u ( x ) v ( x ) ∣ ∣ x = a b − ∫ a b v ( x ) d u ( x ) where d v ( x ) = d d x v ( x ) d x = v ′ ( x ) d x dv(x) = \frac{d}{dx} v(x) dx = v'(x) dx d v ( x ) = d x d v ( x ) d x = v ′ ( x ) d x and d u ( x ) = d d x u ( x ) d x = u ′ ( x ) d x du(x) = \frac{d}{dx} u(x) dx = u'(x) dx d u ( x ) = d x d u ( x ) d x = u ′ ( x ) d x .
You can remember the right hand side “u ltra-v iolet v oodoo ” for “u u u v v v less v v v d u du d u .”
To apply integration by parts, always apply the same steps. These are:
Write f f f as a product of two functions: f ( x ) = g ( x ) h ′ ( x ) f(x) = g(x) h'(x) f ( x ) = g ( x ) h ′ ( x ) . Look for a pair of factors where:
g ′ ( x ) g'(x) g ′ ( x ) is a simpler function than g ( x ) g(x) g ( x )
h ′ ( x ) h'(x) h ′ ( x ) is easy to integrate to recover h ( x ) h(x) h ( x ) . If we can’t find the anti-derivative, then we won’t be able to integrate by parts.
Find g ′ ( x ) = d d x g ( x ) g'(x) = \frac{d}{dx} g(x) g ′ ( x ) = d x d g ( x ) .
Find h ( x ) = ∫ x h ′ ( s ) d s h(x) = \int^x h'(s) ds h ( x ) = ∫ x h ′ ( s ) d s .
Try to evaluate ∫ a b g ′ ( x ) h ( x ) d x \int_{a}^b g'(x) h(x) dx ∫ a b g ′ ( x ) h ( x ) d x .
Combine terms.
Suppose that X ∈ [ 0 , 1 ] X \in [0,1] X ∈ [ 0 , 1 ] and PDF ( x ) ∝ g ( x ) = x ( 1 − x ) 4 \text{PDF}(x) \propto g(x) = x (1 - x)^4 PDF ( x ) ∝ g ( x ) = x ( 1 − x ) 4 . What is the normalizing constant?
To find the normalizing constant, we should integrate g ( x ) g(x) g ( x ) over all x x x :
∫ x = 0 1 x ( 1 − x ) 4 d x = ∫ x = 0 1 u ( x ) v ′ ( x ) d x \int_{x = 0}^1 x (1 - x)^4 dx = \int_{x = 0}^1 u(x) v'(x) dx ∫ x = 0 1 x ( 1 − x ) 4 d x = ∫ x = 0 1 u ( x ) v ′ ( x ) d x where u ( x ) = x u(x) = x u ( x ) = x and v ′ ( x ) = ( 1 − x ) 4 v'(x) = (1 - x)^4 v ′ ( x ) = ( 1 − x ) 4 . Then d u ( x ) = d d x u ( x ) d x = 1 d x = d x du(x) = \frac{d}{dx} u(x) dx = 1 dx = dx d u ( x ) = d x d u ( x ) d x = 1 d x = d x and v ( x ) = ∫ x d v ( s ) = ∫ x ( 1 − x ) 4 d s = − 1 5 ( 1 − s ) 5 ∣ x = − 1 5 ( 1 − x ) 5 v(x) = \int^x dv(s) = \int^x (1 - x)^4 ds = -\frac{1}{5}(1 - s)^5|^x = -\frac{1}{5}(1 - x)^5 v ( x ) = ∫ x d v ( s ) = ∫ x ( 1 − x ) 4 d s = − 5 1 ( 1 − s ) 5 ∣ x = − 5 1 ( 1 − x ) 5 .
Next, evaluate the integral of v d u v du v d u :
∫ x = 0 1 v ( x ) d u ( x ) = − 1 5 ∫ x = 0 1 ( 1 − x ) 5 d x = 1 5 1 6 ( 1 − x ) 6 ∣ 0 1 = 1 30 ( 0 − 1 ) = − 1 30 . \begin{aligned} \int_{x = 0}^1 v(x) du(x) & = -\frac{1}{5} \int_{x = 0}^{1} (1 - x)^5 dx \\
& = \frac{1}{5} \frac{1}{6} (1 - x)^{6}|_{0}^1 \\
& = \frac{1}{30}(0 - 1) = -\frac{1}{30}. \end{aligned} ∫ x = 0 1 v ( x ) d u ( x ) = − 5 1 ∫ x = 0 1 ( 1 − x ) 5 d x = 5 1 6 1 ( 1 − x ) 6 ∣ 0 1 = 30 1 ( 0 − 1 ) = − 30 1 . The “u v u v uv ” term is:
u ( x ) v ( x ) ∣ 0 1 = − 1 5 x ( 1 − x ) 5 ∣ 0 1 = − 1 5 ( 0 − 0 ) = 0. u(x) v(x) \Big|_{0}^1 = - \frac{1}{5} x (1 - x)^5 \Big|_{0}^1 = -\frac{1}{5}(0 - 0) = 0. u ( x ) v ( x ) ∣ ∣ 0 1 = − 5 1 x ( 1 − x ) 5 ∣ ∣ 0 1 = − 5 1 ( 0 − 0 ) = 0. So, combining terms:
∫ x = 0 1 g ( x ) = 0 + 1 30 = 1 30 . \int_{x = 0}^1 g(x) = 0 + \frac{1}{30} = \frac{1}{30}. ∫ x = 0 1 g ( x ) = 0 + 30 1 = 30 1 . It follows that:
PDF ( x ) = 30 x ( 1 − x ) 4 . \text{PDF}(x) = 30 x (1 - x)^4. PDF ( x ) = 30 x ( 1 − x ) 4 . To check our work, notice that the value of the integral wouldn’t change if we reflected the function g ( x ) g(x) g ( x ) about x = 0.5 x = 0.5 x = 0.5 . This swaps x x x with 1 − x 1 - x 1 − x . Then:
∫ x = 0 1 x 4 ( 1 − x ) d x = ∫ x = 0 1 x 4 − x 5 d x = ∫ x = 0 1 x 4 d x − ∫ x = 0 1 x 5 d x = 1 5 x 5 − 1 6 x 6 ∣ 0 1 = 1 5 − 1 6 = 1 30 . \begin{aligned} \int_{x = 0}^1 x^4 (1 - x) dx & = \int_{x = 0}^1 x^4 - x^5 dx \\
& = \int_{x = 0}^1 x^4 dx - \int_{x = 0}^1 x^5 dx \\
& = \frac{1}{5} x^5 - \frac{1}{6} x^6 |_{0}^1 \\
& = \frac{1}{5} - \frac{1}{6} = \frac{1}{30}. \end{aligned} ∫ x = 0 1 x 4 ( 1 − x ) d x = ∫ x = 0 1 x 4 − x 5 d x = ∫ x = 0 1 x 4 d x − ∫ x = 0 1 x 5 d x = 5 1 x 5 − 6 1 x 6 ∣ 0 1 = 5 1 − 6 1 = 30 1 . Suppose that X ∼ Exponential ( λ ) X \sim \text{Exponential}(\lambda) X ∼ Exponential ( λ ) . What is E [ X ] \mathbb{E}[X] E [ X ] ?
Expanding the expectation as a weighted average gives:
E [ X ] = ∫ x = 0 ∞ x PDF ( x ) d x = ∫ x = 0 ∞ x λ e − λ x d x . \mathbb{E}[X] = \int_{x = 0}^{\infty} x \text{PDF}(x) dx = \int_{x = 0}^{\infty} x \lambda e^{-\lambda x} dx. E [ X ] = ∫ x = 0 ∞ x PDF ( x ) d x = ∫ x = 0 ∞ x λ e − λ x d x . Set g ( x ) = x g(x) = x g ( x ) = x and h ′ ( x ) = λ e − λ x h'(x) = \lambda e^{-\lambda x} h ′ ( x ) = λ e − λ x . Then g ′ ( x ) = 1 g'(x) = 1 g ′ ( x ) = 1 and h ( x ) = − e − λ x h(x) = - e^{-\lambda x} h ( x ) = − e − λ x .
Then:
∫ x = 0 ∞ g ′ ( x ) h ( x ) d x = − ∫ x = 0 ∞ e − λ x d x = 1 λ e − λ x ∣ 0 ∞ = 1 λ ( 0 − 1 ) = − 1 λ . \begin{aligned} \int_{x = 0}^{\infty} g'(x) h(x) dx & = - \int_{x = 0}^{\infty} e^{-\lambda x} dx
= \frac{1}{\lambda} e^{-\lambda x}|_{0}^\infty \\
& = \frac{1}{\lambda}(0 - 1) = -\frac{1}{\lambda}. \end{aligned} ∫ x = 0 ∞ g ′ ( x ) h ( x ) d x = − ∫ x = 0 ∞ e − λ x d x = λ 1 e − λ x ∣ 0 ∞ = λ 1 ( 0 − 1 ) = − λ 1 . The “g h g h g h ” term is:
− x e − λ x ∣ 0 ∞ = 0 − 0 = 0. -x e^{-\lambda x} \Big|_0^{\infty} = 0 - 0 = 0. − x e − λ x ∣ ∣ 0 ∞ = 0 − 0 = 0. Therefore:
∫ x = 0 ∞ x λ e − λ x d x = 0 − ( − 1 λ ) = 1 λ . \int_{x = 0}^{\infty} x \lambda e^{-\lambda x} dx = 0 - \left(-\frac{1}{\lambda}\right) = \frac{1}{\lambda}. ∫ x = 0 ∞ x λ e − λ x d x = 0 − ( − λ 1 ) = λ 1 . So, if X ∼ Exponential ( λ ) X \sim \text{Exponential}(\lambda) X ∼ Exponential ( λ ) then E [ X ] = λ \mathbb{E}[X] = \lambda E [ X ] = λ .
Tail Integrals ¶ Any time we ask for the expectation of a continuous, nonnegative random variable we need to evaluate the integral ∫ x = 0 b x PDF ( x ) d x \int_{x = 0}^{b} x \text{PDF}(x) dx ∫ x = 0 b x PDF ( x ) d x where b b b is the upper endpoint of the support. In this case integration by parts produces an alternate formula for the expectation using the area under the survival function.
Setting g ( x ) = x g(x) = x g ( x ) = x gives g ′ ( x ) = 1 g'(x) = 1 g ′ ( x ) = 1 . Setting h ′ ( x ) = PDF ( x ) h'(x) = \text{PDF}(x) h ′ ( x ) = PDF ( x ) sets h ( x ) = CDF ( x ) h(x) = \text{CDF}(x) h ( x ) = CDF ( x ) . Then, integrating by parts:
∫ x = 0 b x PDF ( x ) d x = x CDF ( x ) ∣ 0 b − ∫ x = 0 b CDF ( x ) d x = [ b × CDF ( b ) − 0 × CDF ( 0 ) ] − ∫ x = 0 b CDF ( x ) d x = [ b × 1 − 0 × 0 ] − ∫ x = 0 b CDF ( x ) d x = b − ∫ x = 0 b CDF ( x ) d x = ∫ x = 0 b 1 − CDF ( x ) d x = ∫ x = 0 b 1 − Pr ( X ≤ x ) d x = ∫ x = 0 b Pr ( X > x ) d x . \begin{aligned} \int_{x = 0}^b x \text{PDF}(x) dx & = x \text{CDF}(x) \Big|_{0}^b - \int_{x = 0}^b \text{CDF}(x) dx \\
& = [b \times \text{CDF}(b) - 0 \times \text{CDF}(0)] - \int_{x = 0}^b \text{CDF}(x) dx \\
& = [b \times 1 - 0 \times 0] - \int_{x = 0}^b \text{CDF}(x) dx \\
& = b - \int_{x = 0}^b \text{CDF}(x) dx \\
& = \int_{x = 0}^b 1 - \text{CDF}(x) dx \\
& = \int_{x = 0}^b 1 - \text{Pr}(X \leq x) dx \\
& = \int_{x = 0}^b \text{Pr}(X > x) dx. \end{aligned} ∫ x = 0 b x PDF ( x ) d x = x CDF ( x ) ∣ ∣ 0 b − ∫ x = 0 b CDF ( x ) d x = [ b × CDF ( b ) − 0 × CDF ( 0 )] − ∫ x = 0 b CDF ( x ) d x = [ b × 1 − 0 × 0 ] − ∫ x = 0 b CDF ( x ) d x = b − ∫ x = 0 b CDF ( x ) d x = ∫ x = 0 b 1 − CDF ( x ) d x = ∫ x = 0 b 1 − Pr ( X ≤ x ) d x = ∫ x = 0 b Pr ( X > x ) d x . This formula holds for any b b b , so holds for any nonnegative random variable. We can extend the upper bound of the integral to infinity since Pr ( X > x ) = 0 \text{Pr}(X > x) = 0 Pr ( X > x ) = 0 if x x x is above the upper bound of the support.
For example, if X ∼ Exponential ( λ ) X \sim \text{Exponential}(\lambda) X ∼ Exponential ( λ ) then Pr ( X > x ) = e − λ x \text{Pr}(X > x) = e^{-\lambda x} Pr ( X > x ) = e − λ x so E [ X ] = ∫ x = 0 ∞ e − λ x = 1 λ . \mathbb{E}[X] = \int_{x = 0}^{\infty} e^{-\lambda x} = \frac{1}{\lambda}. E [ X ] = ∫ x = 0 ∞ e − λ x = λ 1 .
Summation by Parts ¶ Most integral rules match a rule for sums. The integration by parts strategy can also be used for sums of products.
Consider a sum of the form
∑ x = 1 n g ( x ) h ( x ) \sum_{x = 1}^n g(x) h(x) x = 1 ∑ n g ( x ) h ( x ) for n n n possibly infinite.
The discrete analog to a derivative is a difference. Let:
g ′ ( x ) = g ( x ) − g ( x − 1 ) , g ( 0 ) = 0. g'(x) = g(x) - g(x - 1), \quad g(0) = 0. g ′ ( x ) = g ( x ) − g ( x − 1 ) , g ( 0 ) = 0. Then, just as derivatives are inverted by integrals, differences are inverted by running sums:
g ( x ) = ( g ( x ) − g ( x − 1 ) ) + ( g ( x − 1 ) − g ( x − 2 ) ) + . . . ( g ( 2 ) − g ( 1 ) ) + g ( 1 ) = ∑ y = 1 x g ′ ( y ) . g(x) = (g(x) - g(x - 1)) + (g(x - 1) - g(x - 2)) + ... (g(2) - g(1)) + g(1) = \sum_{y=1}^x g'(y). g ( x ) = ( g ( x ) − g ( x − 1 )) + ( g ( x − 1 ) − g ( x − 2 )) + ... ( g ( 2 ) − g ( 1 )) + g ( 1 ) = y = 1 ∑ x g ′ ( y ) . So, we can write our original sum:
∑ x = 1 n g ( x ) h ( x ) = ∑ x = 1 n [ ∑ y = 1 x g ′ ( y ) ] h ( x ) = ∑ x = 1 n ∑ y = 1 x g ′ ( y ) h ( x ) . \sum_{x = 1}^n g(x) h(x) = \sum_{x = 1}^n \left[\sum_{y = 1}^x g'(y) \right] h(x) = \sum_{x = 1}^n \sum_{y=1}^x g'(y) h(x). x = 1 ∑ n g ( x ) h ( x ) = x = 1 ∑ n [ y = 1 ∑ x g ′ ( y ) ] h ( x ) = x = 1 ∑ n y = 1 ∑ x g ′ ( y ) h ( x ) . To recover the discrete analog to integration by parts, reverse the order of the sum. First, notice that the sum runs over all pairs x , y x, y x , y such that 1 ≤ y ≤ x ≤ n 1 \leq y \leq x \leq n 1 ≤ y ≤ x ≤ n . We can write this sum:
∑ x = 1 n ∑ y = 1 x g ′ ( y ) h ( x ) = ∑ 1 ≤ y ≤ x ≤ n g ′ ( y ) h ( x ) = ∑ y = 1 n ∑ x = y n g ′ ( y ) h ( x ) . \sum_{x = 1}^n \sum_{y=1}^x g'(y) h(x) = \sum_{1 \leq y \leq x \leq n} g'(y) h(x) = \sum_{y = 1}^n \sum_{x = y}^n g'(y) h(x). x = 1 ∑ n y = 1 ∑ x g ′ ( y ) h ( x ) = 1 ≤ y ≤ x ≤ n ∑ g ′ ( y ) h ( x ) = y = 1 ∑ n x = y ∑ n g ′ ( y ) h ( x ) . Since g ′ ( y ) g'(y) g ′ ( y ) does not depend on x x x :
∑ y = 1 n ∑ x = y n g ′ ( y ) h ( x ) = ∑ y = 1 n g ′ ( y ) [ ∑ x = y n h ( x ) ] . \sum_{y = 1}^n \sum_{x = y}^n g'(y) h(x) = \sum_{y = 1}^n g'(y) \left[\sum_{x = y}^n h(x) \right]. y = 1 ∑ n x = y ∑ n g ′ ( y ) h ( x ) = y = 1 ∑ n g ′ ( y ) [ x = y ∑ n h ( x ) ] . The sum, ∑ x = y n h ( x ) \sum_{x = y}^n h(x) ∑ x = y n h ( x ) is analogous to an anti-derivative. To make the analogy explicit, let H ( y ) = ∑ x = 1 y − 1 h ( x ) H(y) = \sum_{x = 1}^{y-1} h(x) H ( y ) = ∑ x = 1 y − 1 h ( x ) . Then ∑ x = y n h ( x ) = H ( n + 1 ) − H ( y ) \sum_{x = y}^n h(x) = H(n+1) - H(y) ∑ x = y n h ( x ) = H ( n + 1 ) − H ( y ) so:
∑ y = 1 n g ′ ( y ) [ ∑ x = y n h ( x ) ] = ∑ y = 1 n g ′ ( x ) [ H ( n + 1 ) − H ( y ) ] = H ( n + 1 ) ( ∑ y = 1 n g ′ ( x ) ) − ∑ y = 1 n g ′ ( y ) H ( y ) . \begin{aligned} \sum_{y = 1}^n g'(y) \left[\sum_{x = y}^n h(x) \right] & = \sum_{y = 1}^n g'(x) [H(n+1) - H(y)] \\ & = H(n+1) \left( \sum_{y = 1}^n g'(x) \right) - \sum_{y=1}^n g'(y) H(y). \end{aligned} y = 1 ∑ n g ′ ( y ) [ x = y ∑ n h ( x ) ] = y = 1 ∑ n g ′ ( x ) [ H ( n + 1 ) − H ( y )] = H ( n + 1 ) ( y = 1 ∑ n g ′ ( x ) ) − y = 1 ∑ n g ′ ( y ) H ( y ) . Finally, since ∑ y = 1 n g ′ ( x ) = g ( n ) − g ( 0 ) \sum_{y=1}^n g'(x) = g(n) - g(0) ∑ y = 1 n g ′ ( x ) = g ( n ) − g ( 0 ) :
∑ x = 1 n g ( x ) h ( x ) = g ( n ) H ( n + 1 ) − ∑ y = 1 n g ′ ( y ) H ( y ) . \sum_{x = 1}^n g(x) h(x) = g(n) H(n+1) - \sum_{y=1}^n g'(y) H(y). x = 1 ∑ n g ( x ) h ( x ) = g ( n ) H ( n + 1 ) − y = 1 ∑ n g ′ ( y ) H ( y ) . The summation by parts formula is perfectly analogous to the integration by parts formula. You’ll prove integration by parts on your homework using the same steps, but replacing differencing with derivatives, and summation with integration.
Tail Sums ¶ Summation by parts is most often used to find the expectation of count valued random variables.
Suppose that X ∈ { 0 , 1 , 2 , 3 , . . . , n } X \in \{0,1,2,3,...,n\} X ∈ { 0 , 1 , 2 , 3 , ... , n } . Then:
E [ X ] = ∑ x = 0 n x PMF ( x ) \mathbb{E}[X] = \sum_{x = 0}^{n} x \text{PMF}(x) E [ X ] = x = 0 ∑ n x PMF ( x ) where n n n is possibly infinite.
Then, let:
g ( x ) = x , h ( x ) = PMF ( x ) . g(x) = x, \quad h(x) = \text{PMF}(x). g ( x ) = x , h ( x ) = PMF ( x ) . Then:
g ′ ( x ) = g ( x ) − g ( x − 1 ) = x − ( x − 1 ) = 1 g'(x) = g(x) - g(x - 1) = x - (x - 1) = 1 g ′ ( x ) = g ( x ) − g ( x − 1 ) = x − ( x − 1 ) = 1 and:
H ( y ) = ∑ x < y h ( x ) = ∑ x < y PMF ( x ) = CDF ( y − 1 ) . H(y) = \sum_{x < y} h(x) = \sum_{x < y} \text{PMF}(x) = \text{CDF}(y-1). H ( y ) = x < y ∑ h ( x ) = x < y ∑ PMF ( x ) = CDF ( y − 1 ) . It follows that:
E [ X ] = n × CDF ( n ) − ∑ y = 1 n 1 × CDF ( y − 1 ) = n × 1 − ∑ y = 1 n 1 × CDF ( y − 1 ) = n − ∑ y = 1 n CDF ( y − 1 ) = ∑ y = 1 n 1 − CDF ( y − 1 ) = ∑ y = 1 n ( 1 − Pr ( X < y ) ) = ∑ y = 1 n Pr ( X ≥ y ) . \begin{aligned} \mathbb{E}[X] & = n \times \text{CDF}(n)- \sum_{y=1}^n 1 \times \text{CDF}(y - 1) \\
& = n \times 1 - \sum_{y=1}^n 1 \times \text{CDF}(y-1) = n - \sum_{y=1}^n \text{CDF}(y-1) \\
& = \sum_{y = 1}^n 1 - \text{CDF}(y - 1) = \sum_{y = 1}^{n} (1 - \text{Pr}(X < y)) \\
& = \sum_{y = 1}^{n} \text{Pr}(X \geq y). \end{aligned} E [ X ] = n × CDF ( n ) − y = 1 ∑ n 1 × CDF ( y − 1 ) = n × 1 − y = 1 ∑ n 1 × CDF ( y − 1 ) = n − y = 1 ∑ n CDF ( y − 1 ) = y = 1 ∑ n 1 − CDF ( y − 1 ) = y = 1 ∑ n ( 1 − Pr ( X < y )) = y = 1 ∑ n Pr ( X ≥ y ) . If X ∈ { 0 , 1 , 2 , 3 , . . . , n } X \in \{0,1,2,3,...,n\} X ∈ { 0 , 1 , 2 , 3 , ... , n } then:
E [ X ] = ∑ y = 1 n Pr ( X ≥ y ) \mathbb{E}[X] = \sum_{y = 1}^{n} \text{Pr}(X \geq y) E [ X ] = y = 1 ∑ n Pr ( X ≥ y ) In other words, the expectation of a count valued random variable is the area under its survival function.
The tail sum formula is often proved using a different algebraic argument. Consider the sum:
E [ X ] = ∑ x = 0 n x PMF ( x ) = ∑ x = 1 n x PMF ( x ) . \mathbb{E}[X] = \sum_{x = 0}^{n} x \text{PMF}(x) = \sum_{x = 1}^{n} x \text{PMF}(x). E [ X ] = x = 0 ∑ n x PMF ( x ) = x = 1 ∑ n x PMF ( x ) . For concision, let p ( x ) = PMF ( x ) p(x) = \text{PMF}(x) p ( x ) = PMF ( x ) . Then the sum has the form:
E [ X ] = 1 × p ( 1 ) + 2 × p ( 2 ) + 3 × p ( 3 ) + . . . n × p ( n ) . \mathbb{E}[X] = 1 \times p(1) + 2 \times p(2) + 3 \times p(3) + ... n \times p(n). E [ X ] = 1 × p ( 1 ) + 2 × p ( 2 ) + 3 × p ( 3 ) + ... n × p ( n ) . We can organize this sum:
E [ X ] = p ( 1 ) + . . . p ( 2 ) + p ( 2 ) + . . . p ( 3 ) + p ( 3 ) + p ( 3 ) + . . . ⋮ p ( n ) + p ( n ) + p ( n ) + p ( n ) + . . . p ( n ) \begin{aligned} \mathbb{E}[X] = & p(1) + ... \\
& p(2) + p(2) + ... \\
& p(3) + p(3) + p(3) + ... \\
& \vdots \\
& p(n) + p(n) + p(n) + p(n) +... p(n)\end{aligned} E [ X ] = p ( 1 ) + ... p ( 2 ) + p ( 2 ) + ... p ( 3 ) + p ( 3 ) + p ( 3 ) + ... ⋮ p ( n ) + p ( n ) + p ( n ) + p ( n ) + ... p ( n ) Summing down the columns instead of across the rows returns the tail sum formula:
E [ X ] = ( ∑ x = 1 n p ( x ) ) + ( ∑ x = 2 n p ( x ) ) + ( ∑ x = 3 n p ( x ) ) + . . . + ( p ( n ) ) = Pr ( X ≥ 1 ) + Pr ( X ≥ 2 ) + . . . + Pr ( X ≥ n ) = ∑ y = 1 n Pr ( X ≥ y ) . \begin{aligned} \mathbb{E}[X] & = \left(\sum_{x = 1}^n p(x) \right) + \left(\sum_{x = 2}^n p(x) \right) + \left(\sum_{x = 3}^n p(x) \right) + ... + (p(n)) \\
& = \text{Pr}(X \geq 1) + \text{Pr}(X \geq 2) + ... + \text{Pr}(X \geq n) = \sum_{y=1}^n \text{Pr}(X \geq y). \end{aligned} E [ X ] = ( x = 1 ∑ n p ( x ) ) + ( x = 2 ∑ n p ( x ) ) + ( x = 3 ∑ n p ( x ) ) + ... + ( p ( n )) = Pr ( X ≥ 1 ) + Pr ( X ≥ 2 ) + ... + Pr ( X ≥ n ) = y = 1 ∑ n Pr ( X ≥ y ) . This proof is less general, but more transparent.
Suppose that X ∼ Geometric ( p ) X \sim \text{Geometric}(p) X ∼ Geometric ( p ) . What is E [ X ] \mathbb{E}[X] E [ X ] ?
If X X X is a geometric random variable then X X X is count valued. So, using the tail sum formula:
E [ X ] = ∑ x = 1 ∞ Pr ( X ≥ x ) . \mathbb{E}[X] = \sum_{x = 1}^{\infty} \text{Pr}(X \geq x). E [ X ] = x = 1 ∑ ∞ Pr ( X ≥ x ) . The chance that a geometric random variable is greater than or equal to a threshold x x x , is the chance that, x − 1 x - 1 x − 1 independent, identical, binary trials all fail. The trials have success probability p p p , so the chance they each fail is 1 − p 1 - p 1 − p . The chance of x − 1 x - 1 x − 1 failures is ( 1 − p ) x − 1 (1 - p)^{x - 1} ( 1 − p ) x − 1 . Therefore:
E [ X ] = ∑ x = 1 ∞ ( 1 − p ) x − 1 = ∑ n = 0 ∞ q n \mathbb{E}[X] = \sum_{x = 1}^{\infty} (1 - p)^{x - 1} = \sum_{n = 0}^{\infty} q^n E [ X ] = x = 1 ∑ ∞ ( 1 − p ) x − 1 = n = 0 ∑ ∞ q n where q = 1 − p q = 1 - p q = 1 − p .
Since the sum is a geometric series (see Section 5.1 ) it converges to:
E [ X ] = ∑ n = 0 ∞ q n = 1 1 − q = 1 p . \mathbb{E}[X] = \sum_{n = 0}^{\infty} q^n = \frac{1}{1 - q} = \frac{1}{p}. E [ X ] = n = 0 ∑ ∞ q n = 1 − q 1 = p 1 . So, the expectation of a geometric random variable is the reciprocal of its success probability.
This result makes intuitive sense. If the chance of success is 1/100, then it is sensible that the expected number of trials until a success is 100.