Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

13.1 Variance of Sums and Averages

Let {Xj}j=1n\{X_j\}_{j=1}^n denote a collection of random variables. Let:

Sn=j=1nXjS_n = \sum_{j=1}^n X_j

denote the sum of the variables, and:

Xˉn=1nSn=1nj=1nXj\bar{X}_n = \frac{1}{n} S_n = \frac{1}{n} \sum_{j=1}^n X_j

denote their sample average.

How variable is SnS_n? What about Xˉn\bar{X}_n?

Expectation

To address their variance, we will first need their expectation. We’ve worked out these expectations before, but, we will ened them repeatedly in this chapter. So, we’ll recall them a last time before progressing.

Variance of a Sum

Two Variables

Let’s start with the simplest case, S2=X1+X2S_2 = X_1 + X_2.

Since we are only working with two variables, drop the subscripts and denote the random pair [X,Y][X,Y], and the sum, SS.

We could proceed by solving for the distribution of SS explicitly (see Section 10.3). If the variables are independent we could use convolution. In practice, this is difficult unless the random variables are drawn from convenient distributions. It also will not scale easily to sums of nn variables. If you want to learn more about the distribution of SnS_n, wait for Section 13.4.

For now, we’ll try to work out the variance using properties of expectation:

Var[S]=Var[X+Y]=E[(X+YE[X+Y])2].\text{Var}[S] = \text{Var}[X + Y] = \mathbb{E}[(X + Y - \mathbb{E}[X + Y])^2].

The expectation of a sum is the sum of the expectations, so:

Var[S]=Var[X+Y]=E[((XE[X])+(YE[Y]))2].\text{Var}[S] = \text{Var}[X + Y] = \mathbb{E}[((X - \mathbb{E}[X]) + (Y - \mathbb{E}[Y]))^2].

Let X0=XE[X]X_0 = X - \mathbb{E}[X], and Y0=YE[Y]Y_0 = Y - \mathbb{E}[Y] denote the centered variables. Then:

Var[S]=Var[X+Y]=E[(X0+Y0)2].\text{Var}[S] = \text{Var}[X + Y] = \mathbb{E}[(X_0 + Y_0)^2].

Now, expand the square:

E[(X0+Y0)2]=E[X02+2X0Y0+Y02].\mathbb{E}[(X_0 + Y_0)^2] = \mathbb{E}[X_0^2 + 2 X_0 Y_0 + Y_0^2].

Then, use additivity and linearity to break the expectation into a sum of pieces:

E[X02+2X0Y0+Y02]=E[X02]+2E[X0Y0]+E[Y02].\mathbb{E}[X_0^2 + 2 X_0 Y_0 + Y_0^2] = \mathbb{E}[X_0^2] + 2 \mathbb{E}[X_0 Y_0] + \mathbb{E}[Y_0^2].

Finally, apply the definition of variance and covariance in Sections 4.3 and 11.1 to identify each term:

E[X02]=Var[X]E[X0Y0]=Cov[X,Y]E[Y02]=Var[Y].\begin{aligned} &\mathbb{E}[X_0^2] = \text{Var}[X] \\ &\mathbb{E}[X_0 Y_0] = \text{Cov}[X,Y] \\ & \mathbb{E}[Y_0^2] = \text{Var}[Y]. \end{aligned}

Therefore:

In the special case when XX and YY are uncorrelated, as when they are independent, then the covariance is zero. So:

If XX and YY are positively correlated, then Var[X+Y]>Var[X]+Var[Y]\text{Var}[X + Y] > \text{Var}[X] + \text{Var}[Y]. So, when XX and YY are positively correlated, their sum is more variable than it would be if they were uncorrelated. If XX and YY are negatively correlated, then their sum is less variable than it would be if they were uncorrelated.

We can extend these results to a sum of nn variables.

nn Variables

To extend to nn variables, {Xj}j=1n\{X_j\}_{j=1}^n, we will need to track the covariance between every possible pair. We can track all of the covariances using a covariance matrix.

Then, we can rewrite our equation for the variance of a sum of a pair of random variables:

Var[X1+X2]=Var[X1]+2Cov[X1,X2]+Var[X2]=Cov[X1,X1]+(Cov[X1,X2]+Cov[X2,X1])+Cov[X2,X2]=i=12j=12Cov[Xi,Xj]=i=12j=12ci,j.\begin{aligned} \text{Var}[X_1 + X_2] & = \text{Var}[X_1] + 2 \text{Cov}[X_1,X_2] + \text{Var}[X_2] \\ & = \text{Cov}[X_1,X_1] + (\text{Cov}[X_1,X_2] + \text{Cov}[X_2,X_1]) + \text{Cov}[X_2,X_2] \\ & = \sum_{i=1}^2 \sum_{j=1}^2 \text{Cov}[X_i,X_j] = \sum_{i=1}^2 \sum_{j=1}^2 c_{i,j}. \end{aligned}

In other words, the variance of a sum of two variables is the sum of every entry of the covariance matrix for the two variables. It is the sum of all possible pairwise covariances we can construct out of pairs of variables selected from the sequence.

This result extends naturally to nn variables.

If all of the variables are uncorrelated, as when they are mutually independent, then the covariances between distinct variables are all zero. Then:

In the special case when the variables are uncorrelated, and identically distributed, then Var[Xi]=σ2\text{Var}[X_i] = \sigma^2 for some σ>0\sigma > 0. Then:

This result is an important scaling law for variances of sums and averages. The variance of a sum of independent, identical variables grows proportionally to nn. We can now show that the opposite is true for averages. While the variance in a sum increases as we add more terms, the variance in a sample average will decrease, provided the variables are sufficiently uncorrelated.

Example: Variance of Binomial Random Variables

Suppose that SBinomial(n,p)S \sim \text{Binomial}(n,p). What is the variance in XX?

Let’s adopt the strategy we used to find the expectation of the binomial in Section 4.2.

A binomial random variable represents the total number of successes in nn independent, identical, binary trials with success probability pp. Let {Ij}j=1n\{I_j\}_{j=1}^n represent the string of indicators associated with each trial. Then Ij=1I_j = 1 if the jthj^{th} trial succeeded, and Ij=0I_j = 0 otherwise. Then, we can write SS as a sum, S=j=1nIjS = \sum_{j=1}^n I_j.

It follows immediately that:

Var[S]=Var[j=1nIj]=j=1nVar[Ij].\text{Var}[S] = \text{Var}\left[\sum_{j=1}^n I_j \right] = \sum_{j=1}^n \text{Var}[I_j].

We don’t need to add any covariance terms since the trials are independent, so their indicators are uncorrelated.

The trials are also identical, so the variance in a binomial random variable is the variance in a sum of nn independent, identically distributed variables. So:

Var[S]=nVar[Ij]\text{Var}[S] = n \text{Var}[I_j]

for any jj we like. For simplicity, set j=1j = 1. Then Var[S]=nVar[I1]\text{Var}[S] = n \text{Var}[I_1].

To finish the problem, we need the variance in an indicator. We’ll solve for this variance directly:

Var[I1]=E[I12]E[I1]2.\text{Var}[I_1] = \mathbb{E}[I_1^2] - \mathbb{E}[I_1]^2.

The variable I12I_1^2 is the same as the variable I1I_1 since 02=00^2 = 0 and 12=11^2 = 1. Therefore:

Var[I1]=E[I1]E[I1]2.\text{Var}[I_1] = \mathbb{E}[I_1] - \mathbb{E}[I_1]^2.

Recall that the expected value of an indicator is the chance of the event it indicates. Therefore, E[I1]=p\mathbb{E}[I_1] = p, the success probability. Then:

Var[I1]=pp2=p(1p).\text{Var}[I_1] = p - p^2 = p(1 - p).

So, the variance in an indicator is a product of the success and failure probabilities. It follows that:

Variance of a Sample Average

We can use the results we derived for sums to work out the variance of a sample average. This time, we can just use the scaling rule for variances (see Section 4.3). Recall that, Var[aX]=a2Var[X]\text{Var}[a X] = a^2 \text{Var}[X] for any constant aa and variable XX. Therefore:

Var[Xˉn]=Var[1nSn]=1n2Var[Sn].\text{Var}[\bar{X}_n] = \text{Var}\left[\frac{1}{n} S_n \right] = \frac{1}{n^2} \text{Var}[S_n].

So, to find the variance of a sample average, just plug in the variance of a sum, and divide by n2n^2.

We’ll study the result in the general case, the uncorrelated case, and the uncorrelated and identical case.

First, the general case.

So, to compute the variance in a sample average, find all the covariances, fill in the covariance matrix, then average the entries.

Just as the variance of a sum is the sum of all the covariances, the variance in an average is the average of all the covariances.

If the variables are uncorrelated, then all the off-diagonal entries of the covariance matrix are zero, so:

If the variables are uncorrelated, and all have the same variance, σ\sigma, then σˉ=σ\bar{\sigma} = \sigma.

This result is an important scaling law for variances of sums and averages. The variance of an average of uncorrelated variables with matching variances decays proportional to 1/n1/n. In particular, the variance in an average of independent, identical random variables is proportional to 1/n1/n.

This is our first example of a concentration phenomena. The distribution of sample averages of independent, identical random variables concentrates about its expectation as nn increases since the variance in the sample average decays to zero as nn diverges. This is why sample averages of independent, identical variables form reliable estimators. The variance in the sample average decreases as the number of samples grows!

Special Case: Independent, Identical Variables

The case when {Xj}j=1n\{X_j\}_{j=1}^n are independent and identically distributed is both the simplest and most important. We’ll highlight it here.

Sample Averages are Consistent Estimators

Let’s put these ideas together to prove that sample averages of large collections of samples, are good estimators for unknown expectations.

To start, suppose that {Xj}j=1n\{X_j\}_{j=1}^n are drawn independently and identically with unknown E[Xj]=μ\mathbb{E}[X_j] = \mu and unknown, but finite, standard deviation SD[Xj]=σ\text{SD}[X_j] = \sigma. Then, as suggested in Section 12.1, estimate the unknown mean using the sample average:

μ^(X)=Xˉn=1nj=1nXjμ.\hat{\mu}(X) = \bar{X}_n = \frac{1}{n} \sum_{j=1}^n X_j \approx \mu.

How accurate is this estimator?

We can measure accuracy with the expected squared error:

EX[(μ^(X)μ)2]=bias(μ^)2+VarX[μ^(X)].\mathbb{E}_{X}[(\hat{\mu}(X) - \mu)^2] = \text{bias}(\hat{\mu})^2 + \text{Var}_X[\hat{\mu}(X)].

Let’s compute the expected square error when we adopt the sample average as our estimator for the unknown mean.

First, the bias. We’ve shown that E[Xˉn]=μ\mathbb{E}[\bar{X}_n] = \mu, so the estimator is unbiased. Therefore, bias(μ^)2=0\text{bias}(\hat{\mu})^2 = 0. Then, the expected square error in our estimate is exactly the variance in the sample average:

EX[(μ^(X)μ)2]=VarX[μ^(X)]=Var[Xˉn].\mathbb{E}_{X}[(\hat{\mu}(X) - \mu)^2] = \text{Var}_X[\hat{\mu}(X)] = \text{Var}[\bar{X}_n].

Plugging in our formula for the variance of a sample average:

EX[(μ^(X)μ)2]=1nσ2=O(n1).\mathbb{E}_{X}[(\hat{\mu}(X) - \mu)^2] = \frac{1}{n} \sigma^2 = \mathcal{O}(n^{-1}).

Since σ\sigma is finite:

limnEX[(μ^(X)μ)2]=limn1nσ2=0.\lim_{n \rightarrow \infty} \mathbb{E}_{X}[(\hat{\mu}(X) - \mu)^2] = \lim_{n \rightarrow \infty} \frac{1}{n} \sigma^2 = 0.

So, the expected square error in the estimate Xˉnμ\bar{X}_n \approx \mu converges to zero in the limit as nn diverges to \infty.

This is a form of consistency. In the limit of infinitely many samples, the expected error in the estimator vanishes.

Example: Frequencies Estimate Chances

All the way back at the start of the course we defined chance as long run frequency. We can now justify that definition.

Suppose that EE is some event. Let p=Pr(E)p = \text{Pr}(E) denote the chance that EE occurs.

To relate this chance to a measurement we could produce from an experiment we need a procedure for estimating pp from data. The simplest procedure is to repeat the random process nn times, making sure that all repetitions are unrelated to each other (independent) and are identical. Let Ij=1I_j = 1 if EE occurs on the jthj^{th} repetition and zero otherwise. Then, the total number of times EE occurs in the nn trials is Sn=j=1nIjS_n = \sum_{j=1}^n I_j.

So, the observed frequency with which the event EE occurred is:

Fr(E)=Snn.\text{Fr}(E) = \frac{S_n}{n}.

The variable SnS_n is binomial with unknown success probability pp on nn trials. Recall that the observed frequency, s/ns/n is the maximum likelihood estimator, p^(s)\hat{p}(s), for the success probability of a binomial random variable. Therefore Fr(E)\text{Fr}(E) is a maximum likelihood estimator for pp.

Is the frequency a consistent estimator?

To show consistency, we need to show that it is unbiased (or, if biased, that the bias vanishes in the limit as nn diverges), and that the variance in the estimator converges to zero as nn goes to infinity.

To show that the frequency is an unbiased estimator, write the frequency as a sample average of the indicators:

Fr(E)=1nj=1nIj.\text{Fr}(E) = \frac{1}{n} \sum_{j=1}^n I_j.

Then, E[Fre(E)]=1nj=1nE[Ij]=1nj=1np=p\mathbb{E}[\text{Fre}(E)] = \frac{1}{n} \sum_{j=1}^n \mathbb{E}[I_j] = \frac{1}{n} \sum_{j=1}^n p = p. So, the frequency is an unbiased estimator.

To show that it is a precise estimator, and becomes more precise as nn increases, let’s compute its variance. We already know the variance in an indicator is p(1p)p(1 - p) and in a binomial random variable is np(1p)n p(1 - p). So, the variance in the observed frequency is:

Var[Fr(E)]=Var[1nSn]=1n2Var[Sn]=1nVar[Ij]=p(1p)n.\begin{aligned} \text{Var}[\text{Fr}(E)] = \text{Var}\left[\frac{1}{n} S_n \right] & = \frac{1}{n^2} \text{Var}[S_n] \\& = \frac{1}{n} \text{Var}[I_j] = \frac{p(1 - p)}{n}. \end{aligned}

So, the variance in the frequency decays proportional to 1/n1/n. As we collect more samples, our estimate to the unknown probability becomes more stable.

At largest, p(1p)=1/2×1/2=1/4p(1 - p) = 1/2 \times 1/2 = 1/4, so, at largest:

Var[Fr(E)]=p(1p)n14n.\text{Var}[\text{Fr}(E)] = \frac{p(1 - p)}{n} \leq \frac{1}{4n}.

Then limnVar[Fr(E)]=0\lim_{n \rightarrow \infty} \text{Var}[\text{Fr}(E)] = 0 no matter the value of pp. It follows that the expected square error in the estimate, Fr(E)=1nSn=p^(Sn)p\text{Fr}(E) = \frac{1}{n} S_n = \hat{p}(S_n) \approx p vanishes in the limit of infinitely many samples. So the frequency is a consistent estimator in the expected square sense. This result justifies the definition:

p=limnFr(E).p = \lim_{n \rightarrow \infty} \text{Fr}(E).

We can also use this analysis to generate finite sample guarantees. For example,

SD[Fr(E)]12n.\text{SD}[\text{Fr}(E)] \leq \frac{1}{2 \sqrt{n}}.

The larger nn, the smaller this upper bound. We’ll exploit this analysis at length in Section 13.3. For now, we’ll use it to give suggested sample sizes to ensure reliable estimates of unknown chances.

Suppose that we want to estimate pp with the observed frequency, and want the standard deviation in our estimate to be smaller than some error threshold σ>0\sigma > 0. Then, since we don’t know pp, we should set the upper bound on the standard deviation less than ϵ\epsilon. That way, even in the most variable case, our estimator will have standard deviation at most σ\sigma. This requires:

12nσn141σ2=(2σ)2.\frac{1}{2 \sqrt{n}} \leq \sigma \Rightarrow n \geq \frac{1}{4} \frac{1}{\sigma^2} = (2 \sigma)^{-2}.

Since sampling is expensive, we should pick the smallest nn that satisfies the bound. That is n=(2σ)2n = \lceil (2 \sigma)^{-2} \rceil where x\lceil x \rceil means “round xx up to the nearest integer.”

Suppose, for example, that we wanted to guarantee that the standard deviation in our estimator was less than 0.05=1/200.05 = 1/20. Then we would need, at smallest, n=(2120)2=(110)2=100=100n = \lceil (2 \frac{1}{20})^{-2} \rceil = \lceil (\frac{1}{10})^{-2} \rceil = \lceil 100 \rceil = 100 samples. If we had instead demanded that the standard deviation in our estimator was less than 0.005=1/2000.005 = 1/200 then we would have needed, at least, n=10,000n = 10,000 samples.

Here’s a table of suggested sample sizes for different error controls:

SD[Fr(E)]SD[\text{Fr}(E)] \leq0.20.10.050.010.0050.001
nn \geq6.25251002,5002,50010,00010,000250,000250,000

Notice the diminishing returns. We can always decrease the standard deviation in our estimator by collecting more samples, but, to actually make the estimator precise, we need very large sample sizes.

Example: Asymptotic Analysis of Biased Estimators

The consistency of sample averages can help us study the behavior of estimators that can be expressed as a function of a sample average. For example the maximum likelihood estimator to the parameter of a Geometric distribution, based on nn independent, identical draws, {Wj}j=1n\{W_j\}_{j=1}^n is (see Section 12.2):

p^(W1,W2,...,Wn)=1Wˉn where Wˉn=1nj=1nWj.\hat{p}(W_1,W _2,...,W_n) = \frac{1}{\bar{W}_n} \text{ where } \bar{W}_n = \frac{1}{n} \sum_{j=1}^n W_j.

In Section 12.2 we showed that this estimator is biased using Jensen’s inequality (see Section 4.1). Jensen’s inequality explains the sign of the bias (the estimator has positive bias, so systematically overestimates) but provides no insight into the size of the bias.

In this section we will study an approximate to the bias based on a Taylor series expansion of the function 1/x1/x. This approximation will approximate the bias using the variance in the sample average 1nj=1nWj\frac{1}{n} \sum_{j=1}^n W_j. We’ll see that, since the variance in the sample average vanishes as nn increases, so does our approximation to the bias. To see the derivation based on a Taylor series, open the dropdown below.

Let’s find the bias in the approximation to the maximum likelihood estimator:

p^(W1,W2,...,Wn)p~(W1,W2,...,Wn)=pp2(Wˉn1p)+p3(Wˉn1p)2.\begin{aligned} \hat{p}(W_1,W_2,...,W_n) & \approx \tilde{p}(W_1,W_2,...,W_n) \\ & = p - p^2 \left(\bar{W}_n - \frac{1}{p} \right) + p^3 \left(\bar{W}_n - \frac{1}{p} \right)^2. \end{aligned}

Since expectations are additive and linear (see Section 4.2):

E[p~(W1,W2,...,Wn)]=pp2E[Wˉn1/p]+p3E[(Wˉn1/p)2].\mathbb{E}[\tilde{p}(W_1,W_2,...,W_n)] = p - p^2 \mathbb{E}[\bar{W}_n - 1/p] + p^3 \mathbb{E}[(\bar{W}_n - 1/p)^2].

The middle term cancels since E[Wˉn1/p]=E[Wˉn]1/p=E[W1]1/p=1/p1/p=0\mathbb{E}[\bar{W}_n - 1/p] = \mathbb{E}[\bar{W}_n] - 1/p = \mathbb{E}[W_1] - 1/p = 1/p - 1/p = 0. The last expectation is a variance since 1/p=E[Wˉn]1/p = \mathbb{E}[\bar{W}_n]. Therefore:

E[p~(W1,W2,...,Wn)]=p+p2Var[Wˉn].\mathbb{E}[\tilde{p}(W_1,W_2,...,W_n)] = p + p^2 \text{Var}[\bar{W}_n].

The second term is the bias

bias(p~)=p2Var[Wˉn]bias(p^).\text{bias}(\tilde{p}) = p^2 \text{Var}[\bar{W}_n] \approx \text{bias}(\hat{p}).

We can work out a closed formula for this bias using our formula for the variance of a sample average:

Var[Wˉn]=1nVar[W1].\text{Var}[\bar{W}_n] = \frac{1}{n} \text{Var}[W_1].

The variance in a geometric random variable is (1p)/p2(1 - p)/p^2. Therefore:

bias(p^)bias(p~)=p(1p)n=O(n1).\text{bias}(\hat{p}) \approx \text{bias}(\tilde{p}) = \frac{p(1 - p)}{n} = \mathcal{O}(n^{-1}).

So, the bias in the approximate maximum likelihood estimator vanishes as the number of samples diverges. It is asymptotically unbiased.

Although this analysis is not exact, it correctly recovers the asymptotic behavior of the bias in the maximum likelihood estimator. The maximum likelihood estimator is also asymptotically unbiased with bias that vanishes at rate O(n1)\mathcal{O}(n^{-1}). This behavior is typical. Many maximum likelihood estimators are asymptotically unbiased with a bias that vanishes at rate O(n1)\mathcal{O}(n^{-1}) in the sample size.

Dependent Variables

What if the samples are not identical? What if they aren’t independent? Can we identify more general situations where sample averages converge?

So, even sample averages of dependent random variables may converge provided the dependences are weak enough. If you go on to study stochastic process, or Monte Carlo methods, you’ll see this argument appear in a wide variety of useful models.