The Central Limit Theorem - Data 89 Course Notes

Motivation¶

We’ve shown that sample averages are guaranteed to converge to an underlying expectation provided the samples are sufficiently uncorrelated and are drawn with the same (or converging) expectations. We proved this result with an upper bound on tail probabilities for the distribution of the sample average. Unfortunately, the tail bounds used often overestimate the true tail probabilities. So, while strong enough to prove the weak law of large numbers, they are too weak to give useful finite sample guarantees when the user wants to guarantee accuracy with high probability.

Our results have been imprecise since we have avoided the exact distribution of the sum or sample average. This was useful since finding exact distributions is hard. We relied on summary quantites (e.g. expectations and variances) since these summaries were easier to compute and powerful enough to develop a general theory that applied no matter the original distribution. To improve on this theory we will need to study the actual distribution of long term sums and sample averages.

All examples and results in this section will pursue the exact distribution of a sum in a limit where the sum includes many terms.

Remarkably, the exact distribution of a sum, or sample average, converges, in the limit of infinitely many samples, to a normal distribution, regardless the original distribution used to generate the samples.

This theorem is the last major result in most introductory probability classes. It is especially useful when we want to produce confidence intervals. It suggests much tighter intervals than the tail bounds developed in Section 13.2. It explains the ubiquity of the normal distribution in applied probability and probability modeling.

Motivating Examples¶

To anticipate the result, we’ll consider two examples.

Bernoulli/Binomial Random Variables¶

Suppose that $\{X_j\}_{j=1}^n$ are drawn independently and identically from a Bernoulli distribution with success probability $p$ .

In this case we can write down the distribution of $S_n$ exactly. The sum, $S_n = \sum_{j=1}^n I_j$ , is a sum of independent, identical indicators, so is drawn from a binomial distribution with parameters $n$ and $p$ :

S_n \sim \text{Binomial}(n,p).

(2)

Since we have a closed form for the PMF of the sum, we can study the limiting distribution of the sum directly. Run the code cell below to visualize the binomial distribution. Choose some $p$ and hold it fixed. Then gradually increase $n$ .

from utils_dist import run_distribution_explorer

run_distribution_explorer("Binomial");

You should notice three main effects:

As you increase $n$ , the peak in the distribution slides to the right. We’ve known this for a while. The expected value of a binomial distribution is, by additivity of expectation, $n p$ , and its mode is close to $n p$ . So, the peak of the distribution will move rightward along the line $n p$ as a function of $n$ .
The distribution gets wider as $n$ increases. We proved this in Section 13.1. The variance in a sum of independent, identical random variables grows proportionally to the number of terms in the sum. So, the standard deviation in $S_n$ will grow proportionally to $\sqrt{n}$ . Using our rules for the variance of a sum, $\text{SD}[S_n] = \sqrt{n p (1 - p)}$ .
Notice, the standard deviation grows slower than the mean ( $\mathcal{O}(n^{1/2})$ vs $\mathcal{O}(n)$ ). So, if your axis adjusts to fit the bulk of the distribution, the distribution may appear to grow narrower as $n$ increases. Pay attention to the marks on the $x$ -axis. The distribution is getting wider as $n$ increases.
The distribution becomes more and more bell-shaped.

The last observation is the remarkable one. It is true for the sum of any Bernoulli random variable with $p \neq 0$ or 1. It would also be true had we used essentially any distribution to sample the $X$ ’s!

The function for the bell-curve you are seeing is given by the normal curve with mean $\mu = np$ and standard deviation $\sigma = \sqrt{n p (1- p)}$ :

\frac{1}{\sqrt{2 \pi \sigma}} e^{-\frac{1}{2}\frac{(s - \mu)^2}{\sigma^2}}.

(3)

A random variable with density of the form provided above is a normal random variable. You can experiment with normal random variables by running the code cell below. Note that the normal curve has the same shape as the bell shaped histogram we observed for the binomial with large $n$ .

from utils_dist import run_distribution_explorer

run_distribution_explorer("Normal");

We proved that the binomial does, in fact, converge to the normal curve at the end of Section 6.4. There, we assumed that $p = 1/2$ , and proceeded with a detailed limiting analysis based on Striling’s formula (see Section 6.3).

For now, we will accept the claim based on the observation that the plotted histogram, and plotted curve, look suspiciously similar. We will recapitulate our old proof at the end of this chapter. We save the proof since the specific limiting analysis depends on details of the binomial PMF.

Our goal now is to see that we would arrive at the same normal curve, from any initial distribution. Since we can’t test all initial distributions, we’ll try a different test case.

Uniform Random Variables¶

The simplest discrete case is $X_j \sim \text{Bernoulli}(0.5)$ . This is a uniform distribution on the set $\{0,1\}$ . Let’s try the simplest continuous analog, $X_j \sim \text{Uniform}([0,1])$ . Then $X_j \in [0,1]$ for all $j$ and:

f_{X_j}(x) = \begin{cases} 1 & \text{ if } x \in [0,1] \\ 0 & \text{ else } \end{cases}

(4)

Since we’ll need this density repeatedly, let’s give it a simpler name, $f_U(x)$ , where $U$ stands for uniform.

To find the density of $S_n$ , work recursively:

S_2 = X_1 + X_2 \quad \Rightarrow \quad S_3 = S_2 + X_3 \quad \Rightarrow \quad S_4 = S_3 + X_4 \quad \Rightarrow \quad S_{n+1} = S_n + X_{n+1}.

(5)

Then, since all of the $X$ ’s are independent, we can derive the distribution of the sum directly by convolution (see Section 10.3):

f_{S_2}(s) = [f_{X_1} * f_{X_2}](s) \quad \Rightarrow \quad f_{S_3}(s) = [f_{S_2} * f_{X_3}](s) \quad \Rightarrow \quad f_{S_4}(s) = [f_{S_3} * f_{X_4}](s) \quad \Rightarrow \quad f_{S_{n+1}}(s) = [f_{S_n} * f_{X_{n+1}}](s).

(6)

Since the variables are identical, and uniformly distributed, we are left with the recursion:

f_{S_{n+1}}(s) = [f_{S_n}(s) * f_U(s)]

(7)

with base case:

f_{S_1}(s) = f_{U}(s) = \begin{cases} 1 & \text{ if } x \in [0,1] \\ 0 & \text{ else } \end{cases}.

(8)

So, to find the density of $S_n$ we need to convolve the uniform density with itself $n$ times.

You ran the first step in discussion two weeks ago:

f_{S_2}(s) = [f_U * f_U](s) = \begin{cases} s & \text{ if } s \in [0,1] \\ 2 - s & \text{ if } s \in [1,2] \\ 0 \text{ else}. \end{cases}

(9)

This density is shaped like a tent.

So, to find $S_3$ , we need to compute $[f_{S_2} * f_U](s)$ . Since both densities are defined piecewise, the resulting density for $S_3$ will also be defined piecwise. We can work out the boundaries between pieces as follows:

$S_3 < 0$ is impossible since $S_3$ is a sum of nonnegative variables.
$S_3 \in [0,1)$ requires $S_2 < 1$ since $X_3$ is nonnegative.
$S_3 \in [1,2]$ allows any $S_2 \in [0,2]$ .
$S_3 \in (2,3]$ requires $S_2 > 1$ since $X_3$ is less than one.
$S_3 > 3$ is impossible since $S_3$ is a sum of three variables, all less than or equal to 1.

The first and fifth observations fix the support, $S_3 \in [0,3]$ . The middle three observations divide the support into three intervals, $[0,1), [1,2], (2,3]$ .

So, to find the density function, we should run convolution separately on each interval. To see the explicit convolution, open the drop down below.

Convolution

We’ll work out the convolution one interval at a time. We’ll do the outer intervals first since the distribution of a sum of uniform random variables is symmetric about its expectation (the distribution of $3 - S_3$ is the same as the distribution of $\sum_{j=1}^3 (1 - X_j)$ , which has the same distribution as $S_3 = \sum_{j=1}^3 X_j$ since $X_j$ and $1 - X_j$ are identically distributed when $X_j$ is drawn uniformly on $[0,1]$ ).

$S_3 = s_3 \in [0,1)$ . Then:
$f_{S_3}(s_3) = \int_{s_2 = 0}^1 f_{S_2}(s_2) f_{U}(s_3 - s_2) ds_2 = \int_{s_2 = 0}^{s_3} f_{S_2}(s_2) f_{U}(s_3 - s_2) ds_2.$
(10)
We’ve set the upper bound of the integral to $s_3$ since $f_U(s_3 - s_2) = 0$ if $s_3 - s_2 < 0$ . Therefore:
$f_{S_3}(s_3) = \int_{s_2 = 0}^{s_3} s_2 \times 1 ds_2 = \frac{1}{2} s_2^2 \Big|_{s_2 = 0}^{s_3} = \frac{1}{2} s_3^2.$
(11)
$S_3 = s_3 \in (2,3]$ . By symmetry:
$f_{S_3}(s_3) = \frac{1}{2} (3 - s_3)^2$
(12)
$S_3 = s_3 \in [1,2]$ . The smallest possible value of $s_2$ is $s_3 - 1$ , and the largest is $s_3$ , so:
$\begin{aligned} f_{S_3}(s_3) = \int_{s_2 = s_3 - 1}^{s_3} f_{S_2}(s_3) f_U(s_3 - s_2) d s_2 = \int_{s_2 = s_3 - 1}^{s_3} f_{S_2}(s_3) \times 1 d s_2. \end{aligned}$
(13)
Since $f_{S_2}(s_2)$ is piecewise, we should split the integral about $s_2 = 1$ :
$\begin{aligned} f_{S_3}(s_3) & = \int_{s_2 = s_3 - 1}^{1} f_{S_2}(s_2) d s_2 + \int_{s_2 = 1}^{s_3} f_{S_2}(s_2) d s_2 \\ & = \int_{s_2 = s_3 - 1}^1 s_2 d s_2 + \int_{s_2 = 1}^{s_3} (2 - s_2) d s_2 \\ & = \frac{1}{2} s_2^2 \Big|_{s_2 = s_3 - 1}^1 + (2 s_2 - \frac{1}{2} s_2^2) \Big|_{s_2 = 1}^{s_3}. \end{aligned}$
(14)
Plugging in the bounds, then simplifying the polynomial, gives:
$f_{S_3}(s_3) = \frac{1}{2} s_3^2 - \frac{3}{2} (s_3 - 1)^2.$
(15)
This is the downward facing parabola, centered at $s_3 = 1.5$ , that equals $1/2$ at $s_3 = 1$ and $1/2$ at $s_3 = 2$ .

After convolving, we find that $f_{S_3}(s_3)$ is a piecewise function that is:

zero outside $[0,3]$ ,
an upward facing parabola centered at zero connecting $(0,0)$ to $(1,1/2)$ ,
a downward facing parabola centered at 1.5 connecting $(1,1/2)$ to $(2,1/2)$ ,
then an upward facing parabola centered at 3 connecting $(2,1/2)$ to $(3,0)$ .

Reading in order, the density function is constant and zero, curves up, curves down, curves up, then is constant at zero again. The result is a symmetric bell centered at 1.5. The central location is sensible since $\mathbb{E}[S_3] = 3 \mathbb{E}[X_j] = 3 \times \frac{1}{2} = 1.5$ .

Clearly, this procedure is too involved to continue by hand. Nevertheless, the graphical trend is clear:

$S_1$ is drawn from a box shaped distribution
$S_2$ is drawn from a tent shaped distribution
$S_3$ is drawn from a bell shaped distribution (with parabolic pieces)

It should not be too surprising that, if we keep going, the resulting distribution becomes a smoother and smoother bell. Here are the first four densities:

The last density is very bell shaped! It is a symmetric, piecewise density, centered at 2, with three pieces, all cubic functions of $s_3$ .

The limiting bell is, once again, a normal curve.

The Theorem¶

First, let’s recall the definition of a normal random variable (see Sections 5.4 and Section 6.4).

Normal Random Variables

A random variable $X$ is normally distributed with parameters $\mu$ and $\sigma$ if:

X \in (-\infty,\infty), \quad \text{PDF}(x) \propto e^{-\frac{1}{2} \frac{(x - \mu)^2}{\sigma^2}}

(16)

where $\mu = \mathbb{E}[X]$ and $\sigma^2 = \text{Var}[X]$ . The normalization constant is $\frac{1}{\sqrt{2 \pi} \sigma}$ .

A random variable $X$ is drawn from a standard normal distribution if $X$ is normally distributed with $\mu = 0$ and $\text{Var}[X] = 1$ . Then:

X \in (-\infty,\infty), \quad \text{PDF}(x) \propto e^{-\frac{1}{2} x^2}

(17)

The normalization constant is $\frac{1}{\sqrt{2 \pi}}$ .

The Central Limit Theorem (Informally)

If $\{X_j\}_{j=1}^n$ are drawn independently and identically from a distribution with finite mean $\mu$ and variance $\sigma^2$ , then the sum, $S_n = \sum_{j=1}^n X_j$ , and the sample average, $\bar{X}_n = \frac{1}{n} S_n$ , converge to normal random variables as $n$ diverges:

\lim_{n \rightarrow \infty} S_n \sim \text{Normal}(\mathbb{E}[S_n], \text{Var}[S_n]), \quad \lim_{n \rightarrow \infty} \bar{X}_n \sim \text{Normal}(\mathbb{E}[\bar{X}_n], \text{Var}[\bar{X}_n])

(18)

where (see Section 13.1):

\begin{aligned} & \mathbb{E}[S_n] = n \mu, \quad & \text{Var}[S_n] = n \sigma^2 \\ & \mathbb{E}[\bar{X}_n] = \mu, \quad & \text{Var}[\bar{X}_n] = \frac{1}{n} \sigma^2. \end{aligned}

(19)

To make this statement formal, we need to adjust it in two ways. First, we need to define what we mean by convergence.

What do we mean by convergence?

We will say that a sequence of random variables $\{Y_n\}_{n = 1}^{\infty}$ converges in distribution to a limiting variable $Y$ if the cumulative distribution function of $Y_n$ converges to the cumulative distribution function of $Y$ . We use the CDF to avoid finagling over the difference between density and mass functions. Then, we can guarantee that, for any interval, $[a,b]$ :

\lim_{n \rightarrow \infty} \text{Pr}(Y_n \in [a,b]) = \text{Pr}(Y \in [a,b]).

(20)

The same conclusion extends to finite collections of disjoint intervals so ensures that any answer to a probability question we might ask about $Y_n$ can be approximated with an answer using $Y$ provided $n$ is sufficiently large.

Second, we need to change to standard variables. Neither the sum, nor the sample average admit sensible limiting distributions since the expectation of the sum diverges, while the variance of the sample average converges to zero.

Go back to the binomial example. The distribution of $S_n$ slides rightward while widening as $n$ increases. So, $S_n$ doesn’t approach a random variable drawn from any fixed limiting distribution.

The sample average, $\bar{X}_n$ , behaves more nicely. Its mean stays planted at $\mu$ for all $n$ . However, as we increase $n$ , the distribution of the sample average concentrates, so $\text{SD}[\bar{X}_n]$ converges to zero. Therefore, if $\bar{X}_n$ approaches a limiting variable, the limiting variable is deterministic, $\lim_{n \rightarrow \infty} \bar{X}_n = \mu$ . So, $\bar{X}_n$ does not have an informative limiting distribution either. It’s limiting distribution has infinite density at $\mu$ and is zero everywhere else.

To get a sensible limiting distribution we need to find a transformation of $S_n$ and $\bar{X}_n$ whose mean and standard deviation converge to sensible values as $n$ diverges.

Graphical Intuition

You can think about this graphically. The standard bell-curve we keep seeing is produced by plotting the distribution, then centering our plot to center the bell at zero, and scaling the $x$ -axis so that the axis limits fit most of the bell. The first step is a centering step. It keeps the mean fixed at zero. The second step is a scaling step. It chooses an $x$ -axis scale so that the bell stays the same width. These are the steps needed to standardize a random variable (see Sections 4.3 and 11.1).

Recall that:

By construction, $Z_n$ is a standard variable, so:

\mathbb{E}[Z_n] = 0, \quad \text{SD}[Z_n] = 1.

(22)

Assumptions Matter

The central limit theorem applies regardless the underlying distribution producing the samples. However, the specific limiting construction matters.

For example, we saw in Section 6.4 that the limit, as $n$ diverges, of a binomial distribution with parameters $n, p_n = \lambda/n$ converges to a Poisson distribution, not a normal distribution. In this case, the limit is Poisson since the success probability is vanishing as $n$ increases. The central limit theorem assumes that the distribution used to produce the samples stays fixed as we increase $n$ .

Similarly, the way in which we combine the $n$ samples matters. If we’d taken a sample maximum, instead of a sample average, then the limiting distribution would not be normal. Instead it would have been drawn from a generalized extreme value distribution.

Application: Frequencies Estimate Chances¶

Suppose that $\{X_j\}_{j=1}^n$ are independent, identically sampled Bernoulli random variables with unknown success probability $p$ . Then, the maximum likelihood estimator for $p$ is the sample average:

\hat{p}(X_1,X_2,...,X_n) = \frac{1}{n} \sum_{j=1}^n X_j

(26)

The sample average is the observed frequency of success in the $n$ trials.

If $n$ is large, then $\hat{p}$ is a sample average of a large collection of independent, identical samples. So, it is approximately normally distributed with mean $\mathbb{E}[X_j] = p$ and variance $\frac{1}{n}\text{Var}[X_j] = \frac{1}{n} p (1 - p)$ .

So, as long as $n$ is large, the probability that the observed frequency differs from the true success probability by more than $k$ standard deviations is, approximately:

\text{Pr}(|\hat{p} - p| > k \text{SD}[\hat{p}]) = \text{Pr}\left( \frac{|\hat{p} - p|}{\text{SD}[\hat{p}]} > k \right) = \text{Pr}(|Z_n| > k) \approx \text{Pr}(Z > k)

(27)

where $Z$ is a standard normal random variable.

The probabilities that a standard normal random variable is larger, in magnitude, than an integer $k$ are well-known. We’ll compare them to the upper bounds produced by Chebyshev’s inequality (see Section 13.3)

$k$	1	2	3
Chebyshev	1	$1/4$	$1/9$
CLT	0.32	0.05	$< 0.01$

Notice how much smaller the tail probabilities suggested by the CLT are than the upper bounds provided by Chebyshev. It follows that, when the assumptions of the CLT apply, the tail probabilities for a sample sum or sample average will often be much smaller than the Chebyshev bounds.

So, for sufficiently large $n$ , the chance that the observed frequency is within 2 standard deviations of the true chance converges to about 95%, and the chance it is within three standard deviations of the mean converges to about 99.7%:

\begin{aligned} & \lim_{n \rightarrow \infty}\text{Pr}(|\hat{p} - p| > p (1 - p)/\sqrt{n}) = 0.32... \\ & \lim_{n \rightarrow \infty}\text{Pr}(|\hat{p} - p| > 2 p (1 - p)/\sqrt{n}) = 0.95... \\ & \lim_{n \rightarrow \infty}\text{Pr}(|\hat{p} - p| > 2 p (1 - p)/\sqrt{n}) = 0.997... \end{aligned}

(28)

This example illustrates the power of the CLT. The CLT allows us to approximate exact tail probabilities in the limit of a large sample size, regardless the initial distribution that produces the samples! It also explains the folk wisdom, “When estimating an unknown quantity, your result is probably accurate within ± 2 standard deviations.”

Examples¶

We are not equipped to prove the CLT for arbitrary distributions. If you’d like to see the general proof, take a second course in probability (e.g. Data 140 or Stat 134).

We will satisfy ourselves with some specific cases where we can perform the limiting analysis directly. We completed the first case in Section 6.4.

Bernoulli/Binomial Random Variables¶

Let’s start with the simplest discrete example.

Suppose that $\{X_j\}_{j=1}^n \sim \text{Bernoulli}(1/2)$ . Then $S_n \sim \text{Binomial}(n,1/2)$ so:

\begin{aligned} \text{PMF}(x) & = \left( \begin{array}{c} n \\ x \end{array} \right) \left(\frac{1}{2} \right)^x \left(1 - \frac{1}{2} \right)^{n - x} \\ & = \left( \begin{array}{c} n \\ x \end{array} \right) \left(\frac{1}{2} \right)^x \left(\frac{1}{2} \right)^{n - x} \\ & = \left( \begin{array}{c} n \\ x \end{array} \right) \left(\frac{1}{2} \right)^n. \end{aligned}

(29)

The matching standardized variable is:

Z_n = \frac{X_n - n \times 0.5}{\sqrt{n \times 0.5^2}} = \frac{1}{\sqrt{n}}(2 X_n - n).

(30)

The PMF of $Z_n$ is:

\text{Pr}(Z_n = z) = \text{Pr} \left( \frac{1}{\sqrt{n}}(2 X_n - n) = z \right) = \text{Pr}\left(X_n = \frac{n}{2} \left(1 + \frac{1}{\sqrt{n}} z \right) \right).

(31)

So, using the formula for the Binomial PMF:

\text{Pr}(Z_n = z) = \left(\frac{1}{2} \right)^n \left( \begin{array}{c} n \\ \frac{n}{2} \left(1 + \frac{1}{\sqrt{n}} z \right) \end{array} \right).

(32)

To simplify, first expand the Binomial coefficient as a ratio of factorials:

\left( \begin{array}{c} n \\ \frac{n}{2} \left(1 + \frac{1}{\sqrt{n}} z \right) \end{array} \right) = \frac{n!}{\left(\frac{n}{2} \left(1 + \frac{1}{\sqrt{n}} z \right) \right)! \times \left(\frac{n}{2} \left(1 - \frac{1}{\sqrt{n}} z \right) \right)!}

(33)

Next, apply Stirling’s approximation (see Section 6.3) to approximate each term:

\begin{aligned} & n! \simeq \sqrt{2 \pi e} \left( \frac{n}{e} \right)^{n + \frac{1}{2}}, \\ & \left(\frac{n}{2} \left(1 \pm \frac{1}{\sqrt{n}} z \right) \right)! \simeq \sqrt{2 \pi e} \left( \frac{n}{2 e} \left(1 \pm \frac{1}{\sqrt{n}} z \right) \right)^{\frac{n}{2} \left(1 \pm \frac{1}{\sqrt{n}} z \right) + \frac{1}{2}} \end{aligned}

(34)

Substituting each term for its approximation, then cancelling like terms, gives:

\begin{aligned} \text{Pr}(Z_n = z) &= \left( \frac{1}{2}\right)^n \left( \begin{array}{c} n \\ \frac{n}{2} \left(1 + \frac{1}{\sqrt{n}} z \right) \end{array} \right) \\ & \simeq \frac{2}{\sqrt{2 \pi (n - z^2)}} \left(1 + \frac{1}{\sqrt{n}} z \right)^{-\frac{n}{2}\left(1 + \frac{1}{\sqrt{n}} z \right)} \left(1 - \frac{1}{\sqrt{n}} z \right)^{-\frac{n}{2}\left(1 - \frac{1}{\sqrt{n}} z \right)}. \end{aligned}

(35)

When $n$ is large, $n - z^2$ will be dominated by $n$ . Therefore, we can make the approximation:

\text{Pr}(Z_n = z) \simeq \frac{2}{\sqrt{n}} \frac{1}{\sqrt{2 \pi}} \left(1 + \frac{1}{\sqrt{n}} z \right)^{-\frac{n}{2}\left(1 + \frac{1}{\sqrt{n}} z \right)} \left(1 - \frac{1}{\sqrt{n}} z \right)^{-\frac{n}{2}\left(1 - \frac{1}{\sqrt{n}} z \right)}.

(36)

The constant $2/\sqrt{n}$ out front is the interval between successive possible values of $Z_n$ . We’ll call this gap $\Delta z_n$ . To convert to a density function, we will divide out by $\Delta z_n$ . Dividing by $\Delta z_n$ cancels the $2/\sqrt{n}$ term. For details, check the dropdown below.

Conversion to a Density

The standard normal random variable, $Z$ , is continuously distributed, so is parameterized by a density. Each $Z_n$ is a discrete random variable. To recover a density from a probability, we need to divide out by the length of an interval.

In this case we can construct a density from $Z_n$ by replacing $Z_n$ with a random variable $W_n$ , where $W_n|Z_n = z \sim \text{Uniform}(z - \Delta z_n/2, z + \Delta z_n/2)$ where $\Delta z_n$ is the gap between successive possible values of $Z_n$ . Since $Z_n = \frac{1}{\sqrt{n}}(2 X_n - n)$ , and $X_n$ are integer valued, $\Delta z_n = \frac{2}{\sqrt{n}}$ .

Then, work with the density function of $W_n$ :

\frac{1}{\Delta z_n} \text{Pr}(Z_n = z) = \frac{1}{\sqrt{2 \pi}} \left(1 + \frac{1}{\sqrt{n}} z \right)^{-\frac{n}{2}\left(1 + \frac{1}{\sqrt{n}} z \right)} \left(1 - \frac{1}{\sqrt{n}} z \right)^{-\frac{n}{2}\left(1 - \frac{1}{\sqrt{n}} z \right)}.

(37)

This procedure is the same as:

Representing the PMF of $Z_n$ with a bar plot. The widths of the bars is $\Delta z_n$ .
Scaling the height of the bars by their widths so that their area returns the PMF value. The height of the scaled bars return the density function for $W_n$ .

Integrating over the density function of $W_n$ , with bounds equal to the endpoints of the bars, will sum over the PMF of $Z_n$ . So, all probability questions we could ask about $Z_n$ could be answered by integrating over the density function of $W_n$ .

Now that we’ve handled the normalizing constants, focus on the functional form:

\begin{aligned} \left(1 + \frac{1}{\sqrt{n}} z \right)^{-\frac{n}{2}\left(1 + \frac{1}{\sqrt{n}} z \right)} \left(1 - \frac{1}{\sqrt{n}} z \right)^{-\frac{n}{2}\left(1 - \frac{1}{\sqrt{n}} z \right)} & = \left[\left(1 + \frac{1}{\sqrt{n}} z \right) \times \left(1 - \frac{1}{\sqrt{n}} z \right) \right]^{-\frac{n}{2}} \left(\frac{1 - \frac{1}{\sqrt{n}} z}{1 + \frac{1}{\sqrt{n}} z} \right)^{\frac{z}{2} \sqrt{n}} \\ & = \left(1 - \frac{1}{n} z^2 \right)^{-\frac{n}{2}} \left(\frac{1 - \frac{1}{\sqrt{n}} z}{1 + \frac{1}{\sqrt{n}} z} \right)^{\frac{z}{2} \sqrt{n}} \end{aligned}

(38)

To take the limit as $n$ goes to infinity, express each term in the form used for the limiting definition of the exponential from Section 6.2:

\begin{aligned} & \left(1 - \frac{1}{n} z^2 \right)^{-\frac{n}{2}} = \left[\left(1 - \frac{1}{n} z^2 \right)^n \right]^{-\frac{1}{2}} \simeq \left[e^{-z^2} \right]^{-\frac{1}{2}} = e^{\frac{1}{2} z^2} \\ & \left(1 - \frac{1}{\sqrt{n}} z \right)^{\frac{z}{2} \sqrt{n}} = \left[\left(1 - \frac{1}{\sqrt{n}} z \right)^{\frac{z}{2} \sqrt{n}} \right]^{\frac{z}{2}} \simeq \left[e^{-z} \right]^{\frac{z}{2}} = e^{-\frac{1}{2} z^2} \\ & \left(1 + \frac{1}{\sqrt{n}} z \right)^{-\frac{z}{2} \sqrt{n}} = \left[\left(1 + \frac{1}{\sqrt{n}} z \right)^{\frac{z}{2} \sqrt{n}} \right]^{-\frac{z}{2}} \simeq \left[e^{z} \right]^{-\frac{z}{2}} = e^{-\frac{1}{2} z^2} \\ \end{aligned}

(39)

Therefore:

\lim_{n \rightarrow \infty} \left(1 - \frac{1}{n} z^2 \right)^{-\frac{n}{2}} \left(\frac{1 - \frac{1}{\sqrt{n}} z}{1 + \frac{1}{\sqrt{n}} z} \right)^{\frac{z}{2} \sqrt{n}} = e^{\frac{1}{2} z^2} \times e^{-\frac{1}{2} z^2} \times e^{-\frac{1}{2} z^2} = e^{-\frac{1}{2} z^2}

(40)

So:

\lim_{n \rightarrow \infty} \frac{1}{\Delta z_n} \text{Pr}(Z_n = z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} z^2}.

(41)

The expression on the left is a PDF since $\Delta z_n$ converges to zero as $n$ diverges. The expression on the right is the standard normal density function. Therefore:

If you like challenging algebra exercises, try the same limiting analysis for generic $p$ . The steps are all the same, but you will need to pay careful attention when you standardize and when you apply Striling’s formula.

Exponential Random Variables¶

Let’s try a continuous example. The uniform case is hard, so we’ll pick the next simplest example, exponential random variables.

Suppose that $\{X_j\}_{j=1}^n$ are drawn independently and identically from an exponential distribution with parameter $\lambda$ .

In this case, we can work out the distribution of the sum $S_n$ exactly using convolution.

Run the code cell below to visualize the convolution of two exponential random variables. Select exponential for both distributions, then match their parameters.

%matplotlib inline
from utils_convolution import show_convolution

show_convolution()

The distribution you’ve produced has density function:

f_{S_2}(s) = \lambda^2 s e^{-\lambda s}.

(43)

We proved this result in Section 10.3.

You worked out the result for general $n$ on homework 12. For the necessary work, check the matching solutions. The sum is gamma distributed with shape parameter $n$ and rate parameter $\lambda$ :

S_n \geq 0, \quad f_{S_n}(s) = \frac{\lambda^n}{(n-1)!} s^{n-1} e^{-\lambda s}.

(44)

Experiment with gamma distributions by running the cell below. Start with shape = 1. This returns an exponential random variable. Then increase shape to 2. This produces the gamma distribution of $S_2$ . Then gradually increase the shape parameter. As you do, you will see the distribution shift to the right and become gradually more bell shaped. It remains skewed, though the skew decreases as the shape parameter increases. The mean of the distribution, which lies to the right of its mode, increases linearly in $n$ .

from utils_dist import run_distribution_explorer

run_distribution_explorer("Gamma");

Next, convert to a standard variable. To standardize, we need the expectation and variance in the sum. Both follow from our usual rules for expectations and variances of sums (see Section 13.1). The expectation and variance of an exponential random variable are $1/\lambda$ and $1/\lambda^2$ (see Section 7.1).

So:

Z_n = \frac{S_n - n/\lambda}{\sqrt{n}/\lambda} = \frac{\lambda}{\sqrt{n}} S_n - \sqrt{n}.

(45)

To find the density function of $Z_n$ , use the linear change of density formula from Section 7.2. Set $h(s) = \frac{\lambda}{\sqrt{n}} s - \sqrt{n}$ . Then $h'(s) = \frac{\lambda}{\sqrt{n}}$ and $h^{-1}(z) = \frac{\sqrt{n}}{\lambda}(z + \sqrt{n})$ . Then:

f_{Z_n}(z) = f_{S_n}(h^{-1}(z)) \frac{1}{\lambda/\sqrt{n}} = \frac{\sqrt{n}}{\lambda} \frac{\lambda^n}{(n-1)!} \left(\frac{\sqrt{n}}{\lambda} (z + \sqrt{n}) \right)^{n-1} e^{-\lambda \frac{\sqrt{n}}{\lambda} (z + \sqrt{n})}

(46)

Simplifying:

\begin{aligned} f_{Z_n}(z) & = \frac{\sqrt{n}^n e^{-n}}{(n-1)!}(\sqrt{n} + z)^{n-1} e^{-\sqrt{n} z} \\ & = \frac{\sqrt{n}^n e^{-n}}{(n-1)!}\left(\sqrt{n} \left(1 + \frac{z}{\sqrt{n}} \right) \right)^{n-1} e^{-\sqrt{n} z} \\ & = \frac{\sqrt{n}^{2n - 1} e^{-n}}{(n-1)!} \left(1 + \frac{z}{\sqrt{n}}\right)^{n-1} e^{-\sqrt{n} z} \\ & = \frac{n^{n - 1/2} e^{-n}}{(n-1)!} \left(1 + \frac{z}{\sqrt{n}}\right)^{n-1} e^{-\sqrt{n} z}. \end{aligned}

(47)

In the limit as $n$ diverges:

\lim_{n \rightarrow \infty} \frac{n^{n - 1/2} e^{-n}}{(n-1)!} = \frac{1}{\sqrt{2 \pi}}.

(48)

Thus, the normalization constant converges to the normalization constant of the standard normal, $1/\sqrt{2 \pi}$ . Open the drop-down below for the step-by-step analysis using tools from Section 6.

Limiting Analysis of the Normalization Constant

To simplify the leading term, we’ll use Stirling’s formula (see Section 6.3):

(n-1)! \approx \sqrt{2 \pi e} \left(\frac{n-1}{e} \right)^{n- 1 + 1/2} = \sqrt{2 \pi e} \left(\frac{n-1}{e} \right)^{n- 1/2}.

(49)

Therefore:

\frac{n^{n - 1/2} e^{-n}}{(n-1)!} \approx \frac{n^{n-1/2}}{e^n} \frac{1}{\sqrt{2 \pi e}} \frac{e^{n - 1/2}}{(n-1)^{n-1/2}} \approx \frac{1}{\sqrt{2 \pi} e} \left(\frac{n}{n - 1} \right)^{n - \frac{1}{2}}

(50)

Next, consider:

\left(\frac{n}{n - 1} \right)^{n - \frac{1}{2}} = \sqrt{\frac{n-1}{n}} \left(\frac{n}{n - 1} \right)^{n} = \sqrt{1 - \frac{1}{n}} \left(\frac{n}{n - 1} \right)^{n}.

(51)

In the limit as $n$ diverges, $1 - 1/n$ converges to 1. So, using the limiting expression for $e$ provided in Section 6.2:

\begin{aligned} \lim_{n \rightarrow \infty} \left(\frac{n}{n - 1} \right)^{n - \frac{1}{2}} & = \lim_{n \rightarrow \infty} \left(\frac{n}{n - 1} \right)^{n} \\ & = \lim_{n \rightarrow \infty} \left(\frac{1}{1 - 1/n} \right)^{n} \\ & = \left( \lim_{n \rightarrow \infty} \left(1 + \frac{-1}{n} \right)^{n} \right)^{-1} = (e^{-1})^{-1} = e.\end{aligned}

(52)

Therefore, in the limit as $n$ diverges:

\lim_{n \rightarrow \infty} \frac{n^{n - 1/2} e^{-n}}{(n-1)!} = \frac{1}{\sqrt{2 \pi}}.

(53)

This is the normalization constant of a normal random variable (see Section 10.3)!

Now that we’ve worked out the limit of the normalization constant, we need only work out the limit of the functional form:

\lim_{n \rightarrow \infty} f_{Z_n}(z) = \frac{1}{\sqrt{2 \pi}} \left(\lim_{n \rightarrow \infty} \left(1 + \frac{z}{\sqrt{n}}\right)^{n-1} e^{-\sqrt{n} z} \right).

(54)

In this case it will be easier to work out the limit of the logarithm of the functional form. Since the log is continuous, the limit of the logarithm is the logarithm of the limit. So, we can find the original limit by taking the limit of the logarithm, then exponentiating our answer.

\lim_{n \rightarrow \infty} \log\left(\left(1 + \frac{z}{\sqrt{n}}\right)^{n-1} e^{-\sqrt{n} z} \right) = \lim_{n \rightarrow \infty} (n-1) \log\left(1 + \frac{z}{\sqrt{n}} \right) - \sqrt{n} z.

(55)

We can simplify the first term since $\lim_{n \rightarrow 0} z/\sqrt{n} = 0$ so:

- \lim_{n \rightarrow \infty} \log\left(1 + \frac{z}{\sqrt{n}} \right) = -\log(0) = 0.

(56)

Therefore:

\lim_{n \rightarrow \infty} (n-1) \log\left(1 + \frac{z}{\sqrt{n}} \right) - \sqrt{n} z = \lim_{n \rightarrow \infty} n \log\left(1 + \frac{z}{\sqrt{n}} \right) - \sqrt{n} z.

(57)

Since $z/\sqrt{n}$ is small when $n$ is large, we can replace the logarithm with its Taylor expansion near zero (see Section 6.1). In this case we will need to go past our usual linear approximation to the log and include quadratic terms:

\log(1 + x) \simeq x - \frac{1}{2} x^2 + \mathcal{O}(x^3).

(58)

Therefore:

\log\left(1 + \frac{z}{\sqrt{n}} \right) \simeq \frac{z}{\sqrt{n}} - \frac{1}{2} \frac{z^2}{n} + \mathcal{O}(n^{-3/2}).

(59)

Multiplying by $n$ :

n \log\left(1 + \frac{z}{\sqrt{n}} \right) \simeq \sqrt{n} z - \frac{1}{2} z^2 + \mathcal{O}(n^{-1/2}).

(60)

Therefore:

n \log\left(1 + \frac{z}{\sqrt{n}}\right) - \sqrt{n} z = - \frac{1}{2} z^2 + \mathcal{O}(n^{-1/2}).

(61)

Anything proportional to $n^{-1/2}$ converges to zero as $n$ diverges so:

\lim_{n \rightarrow \infty }n \log\left(1 + \frac{z}{\sqrt{n}}\right) - \sqrt{n} z = -\frac{1}{2} z^2.

(62)

So, working back:

\lim_{n \rightarrow \infty} \left(1 + \frac{z}{\sqrt{n}}\right)^{n-1} e^{-\sqrt{n} z} = e^{-\frac{1}{2} z^2}.

(63)

This is the functional form for a standard normal density!

Putting the functional form and the normalization constant together:

\lim_{n \rightarrow \infty} f_{Z_n}(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2} z^2} = f_Z(z)

(64)

where $Z$ is a standard normal variable. Therefore $\lim_{n \rightarrow \infty} Z_n = Z$ , a standard normal random variable!

Moral

This problem demonstrated many of our tools in action.

To find the distribution of a sum we used used convolution, which is a special application of marginalization (Section 1.4), integration over manifolds (Section 10.3), and the product rule for densities of independent random variables (Sections 1.5 and 8.3).
Then, to standardize it (Section 4.3), we needed the expectation and variance of an exponential random variable, which required definitions of expectation and variance for continuous variables (Sections 4.1 and 4.3), integration by parts (Section 7.1), and formulas for the expectation and variance of a sum (Sections 4.2 and 13.1).
To find the density of the standardized sum we needed to apply the change of density formula (Section 7.2).
Then, to analyze the limiting distribution we needed Stirling’s formula (Section 6.3), the limit definition of the exponential (Section 6.2), the Taylor series expansion of the logarithm (Section 6.1), and order notation (Section 5.2).

Finally, to understand the result, we needed to know the meaning of a density function (Section 2.4) and the definition of the normal density (Sections 6.4 and 13.4). While laborious, the analysis demonstrates the remarkable power of mathematics to build on itself. By applying tools from 15 sections, we were able to show a highly nontrivial and elegant result:

The sum of many independent, identically distributed, exponential random variables is approximately normally distributed.

13.4 The Central Limit Theorem

Motivation¶

Motivating Examples¶

Bernoulli/Binomial Random Variables¶

Uniform Random Variables¶

The Theorem¶

Application: Frequencies Estimate Chances¶

Examples¶

Bernoulli/Binomial Random Variables¶

Exponential Random Variables¶