To address their variance, we will first need their expectation. We’ve worked out these expectations before, but, we will ened them repeatedly in this chapter. So, we’ll recall them a last time before progressing.
Deriving the Expected Value of a Sum and Sample Average
To find the expectation of a sum, use additivity (see Section 4.2):
Since we are only working with two variables, drop the subscripts and denote the random pair [X,Y], and the sum, S.
We could proceed by solving for the distribution of S explicitly (see Section 10.3). If the variables are independent we could use convolution. In practice, this is difficult unless the random variables are drawn from convenient distributions. It also will not scale easily to sums of n variables. If you want to learn more about the distribution of Sn, wait for Section 13.4.
For now, we’ll try to work out the variance using properties of expectation:
In the special case when X and Y are uncorrelated, as when they are independent, then the covariance is zero. So:
If X and Y are positively correlated, then Var[X+Y]>Var[X]+Var[Y]. So, when X and Y are positively correlated, their sum is more variable than it would be if they were uncorrelated. If X and Y are negatively correlated, then their sum is less variable than it would be if they were uncorrelated.
Examples
This is a sensible result. If two variables are positively associated then their variations will add constructively. When one is larger than its expectation, the other is more likely to be larger than its expectation. If one is smaller, then the other is likely to be smaller. If two variables are negatively associated then their variations add destructively and tend to cancel out.
For instance, the pair X and X are as positively associated as is possible. Then:
To extend to n variables, {Xj}j=1n, we will need to track the covariance between every possible pair. We can track all of the covariances using a covariance matrix.
Then, we can rewrite our equation for the variance of a sum of a pair of random variables:
In other words, the variance of a sum of two variables is the sum of every entry of the covariance matrix for the two variables. It is the sum of all possible pairwise covariances we can construct out of pairs of variables selected from the sequence.
Example
For example, if {Xj}j=12 share the covariance matrix:
If all of the variables are uncorrelated, as when they are mutually independent, then the covariances between distinct variables are all zero. Then:
In the special case when the variables are uncorrelated, and identically distributed, then Var[Xi]=σ2 for some σ>0. Then:
This result is an important scaling law for variances of sums and averages. The variance of a sum of independent, identical variables grows proportionally to n. We can now show that the opposite is true for averages. While the variance in a sum increases as we add more terms, the variance in a sample average will decrease, provided the variables are sufficiently uncorrelated.
Suppose that S∼Binomial(n,p). What is the variance in X?
Let’s adopt the strategy we used to find the expectation of the binomial in Section 4.2.
A binomial random variable represents the total number of successes in n independent, identical, binary trials with success probability p. Let {Ij}j=1n represent the string of indicators associated with each trial. Then Ij=1 if the jth trial succeeded, and Ij=0 otherwise. Then, we can write S as a sum, S=∑j=1nIj.
We don’t need to add any covariance terms since the trials are independent, so their indicators are uncorrelated.
The trials are also identical, so the variance in a binomial random variable is the variance in a sum of n independent, identically distributed variables. So:
We can use the results we derived for sums to work out the variance of a sample average. This time, we can just use the scaling rule for variances (see Section 4.3). Recall that, Var[aX]=a2Var[X] for any constant a and variable X. Therefore:
then Var[Xˉ3]=91((9+2+1)+(2+8+2)+(1+2+9))=936=4.
Just as the variance of a sum is the sum of all the covariances, the variance in an average is the average of all the covariances.
If the variables are uncorrelated, then all the off-diagonal entries of the covariance matrix are zero, so:
If the variables are uncorrelated, and all have the same variance, σ, then σˉ=σ.
Visualization with Covariance Matrices
In the case when all the variables are independent, all of the covariances off the diagonal of the covariance matrix are zero. Here are two examples, one for 3 variables and one for 5. We’ll use ⋅ to mark entries equal to zero.
There are far more zero entries in the larger matrix. The average entry in the 3 by 3 example is 3×9/9=3. The average entry in the 5 by 5 examples is 5×9/25=9/5<2.
For n independent variables there are n diagonal entries, and n2−n zero entries off the diagonal. So, the average entry of the covariance approaches zero as n diverges. The fraction of the covariance matrix that is nonzero is n/n2=1/n, so the variance in the sample average of independent random variables vanishes proportional to 1/n.
This result is an important scaling law for variances of sums and averages. The variance of an average of uncorrelated variables with matching variances decays proportional to 1/n. In particular, the variance in an average of independent, identical random variables is proportional to 1/n.
This is our first example of a concentration phenomena. The distribution of sample averages of independent, identical random variables concentrates about its expectation as n increases since the variance in the sample average decays to zero as n diverges. This is why sample averages of independent, identical variables form reliable estimators. The variance in the sample average decreases as the number of samples grows!
Let’s put these ideas together to prove that sample averages of large collections of samples, are good estimators for unknown expectations.
To start, suppose that {Xj}j=1n are drawn independently and identically with unknown E[Xj]=μ and unknown, but finite, standard deviation SD[Xj]=σ. Then, as suggested in Section 12.1, estimate the unknown mean using the sample average:
Let’s compute the expected square error when we adopt the sample average as our estimator for the unknown mean.
First, the bias. We’ve shown that E[Xˉn]=μ, so the estimator is unbiased. Therefore, bias(μ^)2=0. Then, the expected square error in our estimate is exactly the variance in the sample average:
All the way back at the start of the course we defined chance as long run frequency. We can now justify that definition.
Suppose that E is some event. Let p=Pr(E) denote the chance that E occurs.
To relate this chance to a measurement we could produce from an experiment we need a procedure for estimating p from data. The simplest procedure is to repeat the random process n times, making sure that all repetitions are unrelated to each other (independent) and are identical. Let Ij=1 if E occurs on the jth repetition and zero otherwise. Then, the total number of times E occurs in the n trials is Sn=∑j=1nIj.
So, the observed frequency with which the event E occurred is:
The variable Sn is binomial with unknown success probability p on n trials. Recall that the observed frequency, s/n is the maximum likelihood estimator, p^(s), for the success probability of a binomial random variable. Therefore Fr(E) is a maximum likelihood estimator for p.
Is the frequency a consistent estimator?
To show consistency, we need to show that it is unbiased (or, if biased, that the bias vanishes in the limit as n diverges), and that the variance in the estimator converges to zero as n goes to infinity.
To show that the frequency is an unbiased estimator, write the frequency as a sample average of the indicators:
Then, E[Fre(E)]=n1∑j=1nE[Ij]=n1∑j=1np=p. So, the frequency is an unbiased estimator.
To show that it is a precise estimator, and becomes more precise as n increases, let’s compute its variance. We already know the variance in an indicator is p(1−p) and in a binomial random variable is np(1−p). So, the variance in the observed frequency is:
Then limn→∞Var[Fr(E)]=0 no matter the value of p. It follows that the expected square error in the estimate, Fr(E)=n1Sn=p^(Sn)≈p vanishes in the limit of infinitely many samples. So the frequency is a consistent estimator in the expected square sense. This result justifies the definition:
The larger n, the smaller this upper bound. We’ll exploit this analysis at length in Section 13.3. For now, we’ll use it to give suggested sample sizes to ensure reliable estimates of unknown chances.
Suppose that we want to estimate p with the observed frequency, and want the standard deviation in our estimate to be smaller than some error threshold σ>0. Then, since we don’t know p, we should set the upper bound on the standard deviation less than ϵ. That way, even in the most variable case, our estimator will have standard deviation at most σ. This requires:
Since sampling is expensive, we should pick the smallest n that satisfies the bound. That is n=⌈(2σ)−2⌉ where ⌈x⌉ means “round x up to the nearest integer.”
Suppose, for example, that we wanted to guarantee that the standard deviation in our estimator was less than 0.05=1/20. Then we would need, at smallest, n=⌈(2201)−2⌉=⌈(101)−2⌉=⌈100⌉=100 samples. If we had instead demanded that the standard deviation in our estimator was less than 0.005=1/200 then we would have needed, at least, n=10,000 samples.
Here’s a table of suggested sample sizes for different error controls:
SD[Fr(E)]≤
0.2
0.1
0.05
0.01
0.005
0.001
n≥
6.25
25
100
2,500
10,000
250,000
Notice the diminishing returns. We can always decrease the standard deviation in our estimator by collecting more samples, but, to actually make the estimator precise, we need very large sample sizes.
Example: Asymptotic Analysis of Biased Estimators¶
The consistency of sample averages can help us study the behavior of estimators that can be expressed as a function of a sample average. For example the maximum likelihood estimator to the parameter of a Geometric distribution, based on n independent, identical draws, {Wj}j=1n is (see Section 12.2):
p^(W1,W2,...,Wn)=Wˉn1 where Wˉn=n1j=1∑nWj.
In Section 12.2 we showed that this estimator is biased using Jensen’s inequality (see Section 4.1). Jensen’s inequality explains the sign of the bias (the estimator has positive bias, so systematically overestimates) but provides no insight into the size of the bias.
In this section we will study an approximate to the bias based on a Taylor series expansion of the function 1/x. This approximation will approximate the bias using the variance in the sample average n1∑j=1nWj. We’ll see that, since the variance in the sample average vanishes as n increases, so does our approximation to the bias. To see the derivation based on a Taylor series, open the dropdown below.
Approximate MLE Via Taylor Series
Let f(x)=x−1 so p^(W1,W2,...,Wn)=f(Wˉn).
Next, let’s find the Taylor expansion of the function 1/x (see Section 6.1). To find the Taylor series, we need the sequence of derivatives:
Since the approximates based on the Taylor series are most accurate near the point of expansion, we should try expanding f(x) about a point that will probably be close to Xˉn. Since we know that Wˉn converges to its expectation, E[Wˉn]=E[W1]=1/p, let’s Taylor expand about x∗=1/p. Then, we can guarantee that, with high probability, the sample average Wˉn will be near the point of expansion, 1/p.
So, the bias in the approximate maximum likelihood estimator vanishes as the number of samples diverges. It is asymptotically unbiased.
Although this analysis is not exact, it correctly recovers the asymptotic behavior of the bias in the maximum likelihood estimator. The maximum likelihood estimator is also asymptotically unbiased with bias that vanishes at rate O(n−1). This behavior is typical. Many maximum likelihood estimators are asymptotically unbiased with a bias that vanishes at rate O(n−1) in the sample size.
The expected square error statement follows by decomposing the expected square error into a contribution from the bias and the variance. If the expectation converges to μ as n diverges then the bias, E[Xˉn]−μ converges to zero. If the variance converges to zero as well, then the expected square error converges to zero as n diverges.
So, even sample averages of dependent random variables may converge provided the dependences are weak enough. If you go on to study stochastic process, or Monte Carlo methods, you’ll see this argument appear in a wide variety of useful models.