Variance of Sums and Averages - Data 89 Course Notes

Let $\{X_j\}_{j=1}^n$ denote a collection of random variables. Let:

S_n = \sum_{j=1}^n X_j

(1)

denote the sum of the variables, and:

\bar{X}_n = \frac{1}{n} S_n = \frac{1}{n} \sum_{j=1}^n X_j

(2)

denote their sample average.

How variable is $S_n$ ? What about $\bar{X}_n$ ?

Expectation¶

To address their variance, we will first need their expectation. We’ve worked out these expectations before, but, we will ened them repeatedly in this chapter. So, we’ll recall them a last time before progressing.

Deriving the Expected Value of a Sum and Sample Average

To find the expectation of a sum, use additivity (see Section 4.2):

\mathbb{E}[S_n] = \mathbb{E}\left[\sum_{j=1}^n X_j \right] = \sum_{j=1}^n \mathbb{E}[X_j].

(3)

So, the expectation of a sum is the sum of the expectations.

The expectation of the sample average follows by the linearity of expectation (see Section 4.2):

\mathbb{E}[\bar{X}_n] = \mathbb{E} \left[\frac{1}{n} S_n \right] = \frac{1}{n} \mathbb{E}[S_n] = \frac{1}{n} \sum_{j=1}^n \mathbb{E}[X_j].

(4)

So, the expectation of a sample average is the average of the expectations.

If all of the random variables have the same expectation, as when they are drawn identically, then $\mathbb{E}[X_j] = \mu$ for some $\mu$ . Then:

\mathbb{E}[S_n] = \sum_{j=1}^n \mu = n \mu, \quad \mathbb{E}[\bar{X}_n] = \mu.

(5)

We used this argument in Section 12.2 to show that sample averages are unbiased estimators to an unknown expectation.

Variance of a Sum¶

Two Variables¶

Let’s start with the simplest case, $S_2 = X_1 + X_2$ .

Since we are only working with two variables, drop the subscripts and denote the random pair $[X,Y]$ , and the sum, $S$ .

We could proceed by solving for the distribution of $S$ explicitly (see Section 10.3). If the variables are independent we could use convolution. In practice, this is difficult unless the random variables are drawn from convenient distributions. It also will not scale easily to sums of $n$ variables. If you want to learn more about the distribution of $S_n$ , wait for Section 13.4.

For now, we’ll try to work out the variance using properties of expectation:

\text{Var}[S] = \text{Var}[X + Y] = \mathbb{E}[(X + Y - \mathbb{E}[X + Y])^2].

(8)

The expectation of a sum is the sum of the expectations, so:

\text{Var}[S] = \text{Var}[X + Y] = \mathbb{E}[((X - \mathbb{E}[X]) + (Y - \mathbb{E}[Y]))^2].

(9)

Let $X_0 = X - \mathbb{E}[X]$ , and $Y_0 = Y - \mathbb{E}[Y]$ denote the centered variables. Then:

\text{Var}[S] = \text{Var}[X + Y] = \mathbb{E}[(X_0 + Y_0)^2].

(10)

Now, expand the square:

\mathbb{E}[(X_0 + Y_0)^2] = \mathbb{E}[X_0^2 + 2 X_0 Y_0 + Y_0^2].

(11)

Then, use additivity and linearity to break the expectation into a sum of pieces:

\mathbb{E}[X_0^2 + 2 X_0 Y_0 + Y_0^2] = \mathbb{E}[X_0^2] + 2 \mathbb{E}[X_0 Y_0] + \mathbb{E}[Y_0^2].

(12)

Finally, apply the definition of variance and covariance in Sections 4.3 and 11.1 to identify each term:

\begin{aligned} &\mathbb{E}[X_0^2] = \text{Var}[X] \\ &\mathbb{E}[X_0 Y_0] = \text{Cov}[X,Y] \\ & \mathbb{E}[Y_0^2] = \text{Var}[Y]. \end{aligned}

(13)

Therefore:

In the special case when $X$ and $Y$ are uncorrelated, as when they are independent, then the covariance is zero. So:

If $X$ and $Y$ are positively correlated, then $\text{Var}[X + Y] > \text{Var}[X] + \text{Var}[Y]$ . So, when $X$ and $Y$ are positively correlated, their sum is more variable than it would be if they were uncorrelated. If $X$ and $Y$ are negatively correlated, then their sum is less variable than it would be if they were uncorrelated.

Examples

This is a sensible result. If two variables are positively associated then their variations will add constructively. When one is larger than its expectation, the other is more likely to be larger than its expectation. If one is smaller, then the other is likely to be smaller. If two variables are negatively associated then their variations add destructively and tend to cancel out.

For instance, the pair $X$ and $X$ are as positively associated as is possible. Then:

\text{Var}[X + X] = \text{Var}[2X] = 4 \text{Var}[X].

(16)

Using our equation for the variance of a sum recovers the same result:

\begin{aligned} \text{Var}[X + X] & = \text{Var}[X] + \text{Var}[X] + 2 \text{Cov}[X,X] \\ & = 2 \text{Var}[X] + 2 \text{Var}[X] = 4 \text{Var}[X]. \end{aligned}

(17)

In contrast, the pair $X$ and $-X$ are as negatively associated as is possible. Then,

\text{Var}[X + (-X)] = \text{Var}[0] = 0.

(18)

Using our equation for the variance of a sum,

\begin{aligned} \text{Var}[X + (-X)] & = \text{Var}[X] + \text{Var}[-X] + 2 \text{Cov}[X,-X] \\ & = \text{Var}[X] + (-1)^2 \text{Var}[X] - 2 \text{Cov}[X,X] \\ & = 2 \text{Var}[X] - 2 \text{Var}[X] = 0. \end{aligned}

(19)

We can extend these results to a sum of $n$ variables.

$n$ Variables¶

To extend to $n$ variables, $\{X_j\}_{j=1}^n$ , we will need to track the covariance between every possible pair. We can track all of the covariances using a covariance matrix.

Covariance Matrix

The covariance matrix associated with $n$ random variables, $\{X_j\}_{j=1}^n$ , is the $n \times n$ array $C$ with entries $c_{i,j} = \text{Cov}[X_i,X_j].$

The diagonal entries of the covariance matrix, $c_{i,i}$ are variances since $c_{i,i} = \text{Cov}[X_i,X_i] = \text{Var}[X_i]$ .
The off diagonal entries are symmetric, $c_{i,j} = c_{j,i}$ since the covariance between two variables does not depend on the order in which we list them, $\text{Cov}[X_i,X_j] = \text{Cov}[X_j,X_i]$ .

Therefore, $C$ , is a symmetric $n \times n$ array, whose diagonal entries record all of the variances, and whose off-diagonal entries record all of the covariances between distinct variables:

C = \left[\begin{array}{cccc} \text{Var}[X_1] & \text{Cov}[X_1,X_2] & ... & \text{Cov}[X_1,X_n] \\ \text{Cov}[X_2, X_1] & \text{Var}[X_2] & ... & \text{Cov}[X_2,X_n] \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}[X_n X_1] & \text{Cov}[X_n,X_2] & ... & \text{Var}[X_n] \end{array} \right]

(20)

Then, we can rewrite our equation for the variance of a sum of a pair of random variables:

\begin{aligned} \text{Var}[X_1 + X_2] & = \text{Var}[X_1] + 2 \text{Cov}[X_1,X_2] + \text{Var}[X_2] \\ & = \text{Cov}[X_1,X_1] + (\text{Cov}[X_1,X_2] + \text{Cov}[X_2,X_1]) + \text{Cov}[X_2,X_2] \\ & = \sum_{i=1}^2 \sum_{j=1}^2 \text{Cov}[X_i,X_j] = \sum_{i=1}^2 \sum_{j=1}^2 c_{i,j}. \end{aligned}

(21)

In other words, the variance of a sum of two variables is the sum of every entry of the covariance matrix for the two variables. It is the sum of all possible pairwise covariances we can construct out of pairs of variables selected from the sequence.

Example

For example, if $\{X_j\}_{j=1}^2$ share the covariance matrix:

C = \left[\begin{array}{cc} 8 & 2 \\ 2 & 8 \end{array} \right]

(22)

then $\text{Var}[X_1 + X_2 + X_3] = (8 + 2 ) + (2 + 8) = 20.$

This result extends naturally to $n$ variables.

Variance of a Sum of

n

Variables

If $\{X_j\}_{j=1}^n$ are random variables, $C$ is the covariance matrix with entries $c_{i,j} = \text{Cov}[X_i,X_j]$ , and $S_n = \sum_{j=1}^n X_j$ then:

\text{Var}[S_n] = \text{Var}\left[ \sum_{j=1}^n X_j \right] = \sum_{i=1}^n \sum_{j=1}^n c_{i,j}

(23)

In other words, the variance of a sum of a sequence of random variables is the sum of all the entries of the covariance matrix of the sequence. So, to find the variance of a sum, just find all the covariances and variances, and add them together.

If we rewrite the diagonal entries of the covariance as variances, and group the matching covariances, then:

\text{Var}[S_n] = \text{Var}\left[ \sum_{j=1}^n X_j \right] = \sum_{i=1}^n \text{Var}[X_i] + 2 \sum_{i=1}^n \sum_{j = i+1}^n \text{Cov}[X_i,X_j].

(24)

Proof

\begin{aligned} \text{Var}[S_n] & = \mathbb{E}[\left(S_n - \mathbb{E}\left[S_n\right] \right)^2] \\ & = \mathbb{E}\left[\left(\sum_{j=1}^n X_j - \mu_j \right)^2 \right] \\ & = \mathbb{E}\left[\left(\sum_{i=1}^n X_i - \mu_i \right) \left( \sum_{j=1}^n X_j - \mu_j \right) \right] \\ & = \mathbb{E}\left[\sum_{i=1}^n \sum_{j=1}^n (X_i - \mu_i) (X_j - \mu_j) \right] \\ & = \sum_{i=1}^n \sum_{j=1}^n \mathbb{E}[(X_i - \mu_i)(X_j - \mu_j)] \\ & = \sum_{i=1}^n \sum_{j=1}^n \text{Cov}[X_i,X_j].\end{aligned}

(25)

Example

For example, if $\{X_j\}_{j=1}^3$ share the covariance matrix:

C = \left[\begin{array}{ccc} 9 & 2 & 1 \\ 2 & 8 & 2 \\ 1 & 2 & 9 \end{array} \right]

(26)

then $\text{Var}[X_1 + X_2 + X_3] = (9 + 2 + 1) + (2 + 8 + 2) + (1 + 2 + 9) = 36.$

If all of the variables are uncorrelated, as when they are mutually independent, then the covariances between distinct variables are all zero. Then:

In the special case when the variables are uncorrelated, and identically distributed, then $\text{Var}[X_i] = \sigma^2$ for some $\sigma > 0$ . Then:

This result is an important scaling law for variances of sums and averages. The variance of a sum of independent, identical variables grows proportionally to $n$ . We can now show that the opposite is true for averages. While the variance in a sum increases as we add more terms, the variance in a sample average will decrease, provided the variables are sufficiently uncorrelated.

Example: Variance of Binomial Random Variables¶

Suppose that $S \sim \text{Binomial}(n,p)$ . What is the variance in $X$ ?

Let’s adopt the strategy we used to find the expectation of the binomial in Section 4.2.

A binomial random variable represents the total number of successes in $n$ independent, identical, binary trials with success probability $p$ . Let $\{I_j\}_{j=1}^n$ represent the string of indicators associated with each trial. Then $I_j = 1$ if the $j^{th}$ trial succeeded, and $I_j = 0$ otherwise. Then, we can write $S$ as a sum, $S = \sum_{j=1}^n I_j$ .

It follows immediately that:

\text{Var}[S] = \text{Var}\left[\sum_{j=1}^n I_j \right] = \sum_{j=1}^n \text{Var}[I_j].

(29)

We don’t need to add any covariance terms since the trials are independent, so their indicators are uncorrelated.

The trials are also identical, so the variance in a binomial random variable is the variance in a sum of $n$ independent, identically distributed variables. So:

\text{Var}[S] = n \text{Var}[I_j]

(30)

for any $j$ we like. For simplicity, set $j = 1$ . Then $\text{Var}[S] = n \text{Var}[I_1]$ .

To finish the problem, we need the variance in an indicator. We’ll solve for this variance directly:

\text{Var}[I_1] = \mathbb{E}[I_1^2] - \mathbb{E}[I_1]^2.

(31)

The variable $I_1^2$ is the same as the variable $I_1$ since $0^2 = 0$ and $1^2 = 1$ . Therefore:

\text{Var}[I_1] = \mathbb{E}[I_1] - \mathbb{E}[I_1]^2.

(32)

Recall that the expected value of an indicator is the chance of the event it indicates. Therefore, $\mathbb{E}[I_1] = p$ , the success probability. Then:

\text{Var}[I_1] = p - p^2 = p(1 - p).

(33)

So, the variance in an indicator is a product of the success and failure probabilities. It follows that:

Variance of a Sample Average¶

We can use the results we derived for sums to work out the variance of a sample average. This time, we can just use the scaling rule for variances (see Section 4.3). Recall that, $\text{Var}[a X] = a^2 \text{Var}[X]$ for any constant $a$ and variable $X$ . Therefore:

\text{Var}[\bar{X}_n] = \text{Var}\left[\frac{1}{n} S_n \right] = \frac{1}{n^2} \text{Var}[S_n].

(35)

So, to find the variance of a sample average, just plug in the variance of a sum, and divide by $n^2$ .

We’ll study the result in the general case, the uncorrelated case, and the uncorrelated and identical case.

First, the general case.

Variance of a Sample Average

If $\{X_j\}_{j=1}^n$ is a sequence of random variables with covariance matrix $C$ , and $\bar{X}_n = \frac{1}{n} \sum_{j=1}^n X_j$ is their sample average, then:

\text{Var}[\bar{X}_n] = \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n c_{i,j} = \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \text{Cov}[X_i,X_j].

(36)

Since the covariance matrix has $n^2$ entries, the variance in a sample average is the average value of the entries of the covariance matrix. It equals the expected value of $\text{Cov}[X_I,X_J]$ if we choose $I$ and $J$ uniformly and independently from $\{1,2,3,...,n\}$ .

So, to compute the variance in a sample average, find all the covariances, fill in the covariance matrix, then average the entries.

Example

For example, if $\{X_j\}_{j=1}^3$ share the covariance matrix:

C = \left[\begin{array}{ccc} 9 & 2 & 1 \\ 2 & 8 & 2 \\ 1 & 2 & 9 \end{array} \right]

(37)

then $\text{Var}[\bar{X}_3] = \frac{1}{9}((9 + 2 + 1) + (2 + 8 + 2) + (1 + 2 + 9)) = \frac{36}{9} = 4.$

Just as the variance of a sum is the sum of all the covariances, the variance in an average is the average of all the covariances.

If the variables are uncorrelated, then all the off-diagonal entries of the covariance matrix are zero, so:

If the variables are uncorrelated, and all have the same variance, $\sigma$ , then $\bar{\sigma} = \sigma$ .

Visualization with Covariance Matrices

In the case when all the variables are independent, all of the covariances off the diagonal of the covariance matrix are zero. Here are two examples, one for 3 variables and one for 5. We’ll use $\cdot$ to mark entries equal to zero.

\left[\begin{array}{ccc} 9 & \cdot & \cdot \\ \cdot & 9 & \cdot \\ \cdot & \cdot & 9 \end{array} \right], \quad \left[\begin{array}{ccccc} 9 & \cdot & \cdot & \cdot & \cdot \\ \cdot & 9 & \cdot & \cdot & \cdot \\ \cdot & \cdot & 9 & \cdot & \cdot \\ \cdot & \cdot & \cdot & 9 & \cdot \\ \cdot & \cdot & \cdot & \cdot & 9 \end{array} \right]

(41)

There are far more zero entries in the larger matrix. The average entry in the 3 by 3 example is $3 \times 9/9 = 3$ . The average entry in the 5 by 5 examples is $5 \times 9/ 25 = 9/5 < 2$ .

For $n$ independent variables there are $n$ diagonal entries, and $n^2 - n$ zero entries off the diagonal. So, the average entry of the covariance approaches zero as $n$ diverges. The fraction of the covariance matrix that is nonzero is $n/n^2 = 1/n$ , so the variance in the sample average of independent random variables vanishes proportional to $1/n$ .

This result is an important scaling law for variances of sums and averages. The variance of an average of uncorrelated variables with matching variances decays proportional to $1/n$ . In particular, the variance in an average of independent, identical random variables is proportional to $1/n$ .

This is our first example of a concentration phenomena. The distribution of sample averages of independent, identical random variables concentrates about its expectation as $n$ increases since the variance in the sample average decays to zero as $n$ diverges. This is why sample averages of independent, identical variables form reliable estimators. The variance in the sample average decreases as the number of samples grows!

Special Case: Independent, Identical Variables¶

The case when $\{X_j\}_{j=1}^n$ are independent and identically distributed is both the simplest and most important. We’ll highlight it here.

Expectation, Variance, and Standard Deviation of Sums and Averages

Suppose that $\{X_j\}_{j=1}^n$ are independent and identically distributed random variables with mean $\mathbb{E}[X_j] = \mu$ and standard deviation $\text{SD}[X_j] = \sigma$ . Let:

S_n = \sum_{j=1}^n X_j, \quad \bar{X}_n = \frac{1}{n} S_n

(42)

denote the sum, and sample average, of the random variables. Then:

Expectations:
$\mathbb{E}[S_n] = n \mu, \quad \mathbb{E}[\bar{X}_n] = \mu.$
(43)
Variances:
$\text{Var}[S_n] = n \sigma^2, \quad \text{Var}[\bar{X}_n] = \frac{1}{n} \sigma^2$
(44)
Standard Deviations:
$\text{SD}[S_n] = \sqrt{n} \sigma, \quad \text{SD}[\bar{X}_n] = \frac{1}{\sqrt{n}} \sigma.$
(45)

The last result is the most famous scaling law in statistics. The standard deviation in a sample average of $n$ independent, identical random variables is proportional to $1/\sqrt{n}$ .

1/\sqrt{n}

is Slow

While the standard deviation in a sample average decreases as $n$ increases, it does so at rate $1/\sqrt{n} = n^{-1/2}$ . This rate is very slow. Here’s a table showing $\text{SD}[\bar{X}_n]/\text{SD}[X_1]$ :

$n$	1	2	3	4	5	6	7	8	9
$\text{SD}[\bar{X}_n]/\text{SD}[X_1]$	1	0.71	0.58	0.5	0.45	0.41	0.38	0.35	0.33

Notice that, while the standard deviation falls quickly at first, we rapidly experience diminishing returns. It only takes four samples to cut the standard deviation in half, but it takes nine to cut it by a third. It would take 16 to cut it by a half again, and 64 to cut it by one eighth.

In general, to reduce the standard deviation in the sample average by a factor of $1/k$ , we need to use $k^2$ times as many samples. To reduce the standrd deviation by a factor of 10 we need 100 samples. To reduce it by a factor of 100 we need 10,000 samples! To reduce by a factor of 1,000 we need an astonshing one billion samples. So, adding digits of precision by averaging requires exhuastive sampling. Even averages over large data sets remain somewhat variable.

Sample Averages are Consistent Estimators¶

Let’s put these ideas together to prove that sample averages of large collections of samples, are good estimators for unknown expectations.

To start, suppose that $\{X_j\}_{j=1}^n$ are drawn independently and identically with unknown $\mathbb{E}[X_j] = \mu$ and unknown, but finite, standard deviation $\text{SD}[X_j] = \sigma$ . Then, as suggested in Section 12.1, estimate the unknown mean using the sample average:

\hat{\mu}(X) = \bar{X}_n = \frac{1}{n} \sum_{j=1}^n X_j \approx \mu.

(46)

How accurate is this estimator?

We can measure accuracy with the expected squared error:

\mathbb{E}_{X}[(\hat{\mu}(X) - \mu)^2] = \text{bias}(\hat{\mu})^2 + \text{Var}_X[\hat{\mu}(X)].

(47)

Let’s compute the expected square error when we adopt the sample average as our estimator for the unknown mean.

First, the bias. We’ve shown that $\mathbb{E}[\bar{X}_n] = \mu$ , so the estimator is unbiased. Therefore, $\text{bias}(\hat{\mu})^2 = 0$ . Then, the expected square error in our estimate is exactly the variance in the sample average:

\mathbb{E}_{X}[(\hat{\mu}(X) - \mu)^2] = \text{Var}_X[\hat{\mu}(X)] = \text{Var}[\bar{X}_n].

(48)

Plugging in our formula for the variance of a sample average:

\mathbb{E}_{X}[(\hat{\mu}(X) - \mu)^2] = \frac{1}{n} \sigma^2 = \mathcal{O}(n^{-1}).

(49)

Since $\sigma$ is finite:

\lim_{n \rightarrow \infty} \mathbb{E}_{X}[(\hat{\mu}(X) - \mu)^2] = \lim_{n \rightarrow \infty} \frac{1}{n} \sigma^2 = 0.

(50)

So, the expected square error in the estimate $\bar{X}_n \approx \mu$ converges to zero in the limit as $n$ diverges to $\infty$ .

This is a form of consistency. In the limit of infinitely many samples, the expected error in the estimator vanishes.

Example: Frequencies Estimate Chances¶

All the way back at the start of the course we defined chance as long run frequency. We can now justify that definition.

Suppose that $E$ is some event. Let $p = \text{Pr}(E)$ denote the chance that $E$ occurs.

To relate this chance to a measurement we could produce from an experiment we need a procedure for estimating $p$ from data. The simplest procedure is to repeat the random process $n$ times, making sure that all repetitions are unrelated to each other (independent) and are identical. Let $I_j = 1$ if $E$ occurs on the $j^{th}$ repetition and zero otherwise. Then, the total number of times $E$ occurs in the $n$ trials is $S_n = \sum_{j=1}^n I_j$ .

So, the observed frequency with which the event $E$ occurred is:

\text{Fr}(E) = \frac{S_n}{n}.

(52)

The variable $S_n$ is binomial with unknown success probability $p$ on $n$ trials. Recall that the observed frequency, $s/n$ is the maximum likelihood estimator, $\hat{p}(s)$ , for the success probability of a binomial random variable. Therefore $\text{Fr}(E)$ is a maximum likelihood estimator for $p$ .

Is the frequency a consistent estimator?

To show consistency, we need to show that it is unbiased (or, if biased, that the bias vanishes in the limit as $n$ diverges), and that the variance in the estimator converges to zero as $n$ goes to infinity.

To show that the frequency is an unbiased estimator, write the frequency as a sample average of the indicators:

\text{Fr}(E) = \frac{1}{n} \sum_{j=1}^n I_j.

(53)

Then, $\mathbb{E}[\text{Fre}(E)] = \frac{1}{n} \sum_{j=1}^n \mathbb{E}[I_j] = \frac{1}{n} \sum_{j=1}^n p = p$ . So, the frequency is an unbiased estimator.

To show that it is a precise estimator, and becomes more precise as $n$ increases, let’s compute its variance. We already know the variance in an indicator is $p(1 - p)$ and in a binomial random variable is $n p(1 - p)$ . So, the variance in the observed frequency is:

\begin{aligned} \text{Var}[\text{Fr}(E)] = \text{Var}\left[\frac{1}{n} S_n \right] & = \frac{1}{n^2} \text{Var}[S_n] \\& = \frac{1}{n} \text{Var}[I_j] = \frac{p(1 - p)}{n}. \end{aligned}

(54)

So, the variance in the frequency decays proportional to $1/n$ . As we collect more samples, our estimate to the unknown probability becomes more stable.

At largest, $p(1 - p) = 1/2 \times 1/2 = 1/4$ , so, at largest:

\text{Var}[\text{Fr}(E)] = \frac{p(1 - p)}{n} \leq \frac{1}{4n}.

(55)

Then $\lim_{n \rightarrow \infty} \text{Var}[\text{Fr}(E)] = 0$ no matter the value of $p$ . It follows that the expected square error in the estimate, $\text{Fr}(E) = \frac{1}{n} S_n = \hat{p}(S_n) \approx p$ vanishes in the limit of infinitely many samples. So the frequency is a consistent estimator in the expected square sense. This result justifies the definition:

p = \lim_{n \rightarrow \infty} \text{Fr}(E).

(56)

We can also use this analysis to generate finite sample guarantees. For example,

\text{SD}[\text{Fr}(E)] \leq \frac{1}{2 \sqrt{n}}.

(57)

The larger $n$ , the smaller this upper bound. We’ll exploit this analysis at length in Section 13.3. For now, we’ll use it to give suggested sample sizes to ensure reliable estimates of unknown chances.

Suppose that we want to estimate $p$ with the observed frequency, and want the standard deviation in our estimate to be smaller than some error threshold $\sigma > 0$ . Then, since we don’t know $p$ , we should set the upper bound on the standard deviation less than $\epsilon$ . That way, even in the most variable case, our estimator will have standard deviation at most $\sigma$ . This requires:

\frac{1}{2 \sqrt{n}} \leq \sigma \Rightarrow n \geq \frac{1}{4} \frac{1}{\sigma^2} = (2 \sigma)^{-2}.

(58)

Since sampling is expensive, we should pick the smallest $n$ that satisfies the bound. That is $n = \lceil (2 \sigma)^{-2} \rceil$ where $\lceil x \rceil$ means “round $x$ up to the nearest integer.”

Suppose, for example, that we wanted to guarantee that the standard deviation in our estimator was less than $0.05 = 1/20$ . Then we would need, at smallest, $n = \lceil (2 \frac{1}{20})^{-2} \rceil = \lceil (\frac{1}{10})^{-2} \rceil = \lceil 100 \rceil = 100$ samples. If we had instead demanded that the standard deviation in our estimator was less than $0.005 = 1/200$ then we would have needed, at least, $n = 10,000$ samples.

Here’s a table of suggested sample sizes for different error controls:

$SD[\text{Fr}(E)] \leq$	0.2	0.1	0.05	0.01	0.005	0.001
$n \geq$	6.25	25	100	$2,500$	$10,000$	$250,000$

Notice the diminishing returns. We can always decrease the standard deviation in our estimator by collecting more samples, but, to actually make the estimator precise, we need very large sample sizes.

Example: Asymptotic Analysis of Biased Estimators¶

The consistency of sample averages can help us study the behavior of estimators that can be expressed as a function of a sample average. For example the maximum likelihood estimator to the parameter of a Geometric distribution, based on $n$ independent, identical draws, $\{W_j\}_{j=1}^n$ is (see Section 12.2):

\hat{p}(W_1,W _2,...,W_n) = \frac{1}{\bar{W}_n} \text{ where } \bar{W}_n = \frac{1}{n} \sum_{j=1}^n W_j.

(59)

In Section 12.2 we showed that this estimator is biased using Jensen’s inequality (see Section 4.1). Jensen’s inequality explains the sign of the bias (the estimator has positive bias, so systematically overestimates) but provides no insight into the size of the bias.

In this section we will study an approximate to the bias based on a Taylor series expansion of the function $1/x$ . This approximation will approximate the bias using the variance in the sample average $\frac{1}{n} \sum_{j=1}^n W_j$ . We’ll see that, since the variance in the sample average vanishes as $n$ increases, so does our approximation to the bias. To see the derivation based on a Taylor series, open the dropdown below.

Approximate MLE Via Taylor Series

Let $f(x) = x^{-1}$ so $\hat{p}(W_1,W_2,...,W_n) = f(\bar{W}_n)$ .

Next, let’s find the Taylor expansion of the function $1/x$ (see Section 6.1). To find the Taylor series, we need the sequence of derivatives:

f(x) = x^{-1} \Rightarrow f'(x) = - x^{-2} \Rightarrow f''(x) = 2 x^{-3} \Rightarrow f'''(x) = - 6 x^{-4} \Rightarrow ...

(60)

Therefore:

f^{(n)}(x_*) = (-1)^n (n!) x_*^{-(n + 1)}.

(61)

Since the approximates based on the Taylor series are most accurate near the point of expansion, we should try expanding $f(x)$ about a point that will probably be close to $\bar{X}_n$ . Since we know that $\bar{W}_n$ converges to its expectation, $\mathbb{E}[\bar{W}_n] = \mathbb{E}[W_1] = 1/p$ , let’s Taylor expand about $x_* = 1/p$ . Then, we can guarantee that, with high probability, the sample average $\bar{W}_n$ will be near the point of expansion, $1/p$ .

Therefore, we’ll adopt the approximation:

\begin{aligned} f(x) & \simeq \sum_{n = 0}^{\infty} \frac{(-1)^n n!}{n!} \left(\frac{1}{p}\right)^{-(n + 1)} \left(x - \frac{1}{p} \right)^n \\ & = \sum_{n = 0}^{\infty} (-1)^n p^{(n+1)} \left(x - \frac{1}{p} \right)^n. \end{aligned}

(62)

Then, truncate to produce a quadratic approximation:

f(x) \simeq p - p^2 \left(x - \frac{1}{p} \right) + p^3 \left(x - \frac{1}{p} \right)^2.

(63)

Plugging in produces the approximate maximum likelihood estimator:

\begin{aligned} \hat{p}(W_1,W_2,...,W_n) & \approx \tilde{p}(W_1,W_2,...,W_n) \\ & = p - p^2 \left(\bar{W}_n - \frac{1}{p} \right) + p^3 \left(\bar{W}_n - \frac{1}{p} \right)^2. \end{aligned}

(64)

Let’s find the bias in the approximation to the maximum likelihood estimator:

\begin{aligned} \hat{p}(W_1,W_2,...,W_n) & \approx \tilde{p}(W_1,W_2,...,W_n) \\ & = p - p^2 \left(\bar{W}_n - \frac{1}{p} \right) + p^3 \left(\bar{W}_n - \frac{1}{p} \right)^2. \end{aligned}

(65)

Since expectations are additive and linear (see Section 4.2):

\mathbb{E}[\tilde{p}(W_1,W_2,...,W_n)] = p - p^2 \mathbb{E}[\bar{W}_n - 1/p] + p^3 \mathbb{E}[(\bar{W}_n - 1/p)^2].

(66)

The middle term cancels since $\mathbb{E}[\bar{W}_n - 1/p] = \mathbb{E}[\bar{W}_n] - 1/p = \mathbb{E}[W_1] - 1/p = 1/p - 1/p = 0$ . The last expectation is a variance since $1/p = \mathbb{E}[\bar{W}_n]$ . Therefore:

\mathbb{E}[\tilde{p}(W_1,W_2,...,W_n)] = p + p^2 \text{Var}[\bar{W}_n].

(67)

The second term is the bias

\text{bias}(\tilde{p}) = p^2 \text{Var}[\bar{W}_n] \approx \text{bias}(\hat{p}).

(68)

We can work out a closed formula for this bias using our formula for the variance of a sample average:

\text{Var}[\bar{W}_n] = \frac{1}{n} \text{Var}[W_1].

(69)

The variance in a geometric random variable is $(1 - p)/p^2$ . Therefore:

\text{bias}(\hat{p}) \approx \text{bias}(\tilde{p}) = \frac{p(1 - p)}{n} = \mathcal{O}(n^{-1}).

(70)

So, the bias in the approximate maximum likelihood estimator vanishes as the number of samples diverges. It is asymptotically unbiased.

Although this analysis is not exact, it correctly recovers the asymptotic behavior of the bias in the maximum likelihood estimator. The maximum likelihood estimator is also asymptotically unbiased with bias that vanishes at rate $\mathcal{O}(n^{-1})$ . This behavior is typical. Many maximum likelihood estimators are asymptotically unbiased with a bias that vanishes at rate $\mathcal{O}(n^{-1})$ in the sample size.

Dependent Variables¶

What if the samples are not identical? What if they aren’t independent? Can we identify more general situations where sample averages converge?

Convergence Requirements for Sample Averages

Suppose that $\{X_j\}_{j=1}^n$ is a sequence of random variables with means $\mathbb{E}[X_j] = \mu_j$ . Then, the expected value of the sample average converges if and only if:

\lim_{n \rightarrow \infty} \mathbb{E}[\bar{X}_n] = \lim_{n \rightarrow \infty} \frac{1}{n} \sum_{j=1}^n \mu_j = \mu

(71)

for some $\mu < \infty$ .

The variance in the sample average converges to zero if and only if the average covariance between a uniformly selected pair $X_i$ , $X_j$ , converges to zero in the limit as $n$ goes to infinity:

\lim_{n \rightarrow \infty} \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \text{Cov}[X_i,X_j] = 0.

(72)

If both conditions apply, then the sample average $\bar{X}_n$ will converge to $\mu$ in the expected square sense, $\lim_{n \rightarrow \infty} \mathbb{E}[(\bar{X}_n - \mu)^2] = 0$ ,

Proof

Both results follow immediately from the expectation and variance of a sample average:

\mathbb{E}[\bar{X}_n] = \frac{1}{n} \sum_{j=1}^n \mu_j, \quad \text{Var}[\bar{X}_n] = \frac{1}{n} \sum_{i=1}^n \sum_{j = 1}^n \text{Cov}[X_i,X_j].

(73)

The expected square error statement follows by decomposing the expected square error into a contribution from the bias and the variance. If the expectation converges to $\mu$ as $n$ diverges then the bias, $\mathbb{E}[\bar{X}_n] - \mu$ converges to zero. If the variance converges to zero as well, then the expected square error converges to zero as $n$ diverges.

So, even sample averages of dependent random variables may converge provided the dependences are weak enough. If you go on to study stochastic process, or Monte Carlo methods, you’ll see this argument appear in a wide variety of useful models.

13.1 Variance of Sums and Averages

Expectation¶

Variance of a Sum¶

Two Variables¶

nnn Variables¶

Example: Variance of Binomial Random Variables¶

Variance of a Sample Average¶

Special Case: Independent, Identical Variables¶

Sample Averages are Consistent Estimators¶

Example: Frequencies Estimate Chances¶

Example: Asymptotic Analysis of Biased Estimators¶

Dependent Variables¶

$n$ Variables¶