Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

13.2 Tail Bounds

In the last section we showed that the variance in a sample average decreases as the number of samples increases, provided the samples are sufficiently uncorrelated. As a consequence, sample averages of sufficiently uncorrelated variables are consistent estimators of unknown expectations. We established consistency in the sense that the expected square error converges to zero in the limit as the number of samples diverges.

In this section, we will show that, when the variance in a random variable converges to zero, then the distribution of the random variable must concentrate about its expectation.

Motivation

To motivate this effort, consider the following setting:

  1. You can observe {Xj}j=1n\{X_j\}_{j=1}^n drawn jointly and E[Xj]=μ\mathbb{E}[X_j] = \mu for all jj. This occurs if the samples are drawn identically.

  2. You don’t know μ\mu, but want to estimate it. So, you estimate it with the sample average Xˉn=j=1nXj\bar{X}_n = \sum_{j=1}^n X_j. You adopt the sample average since it is an unbiased estimator.

  3. The samples are sufficiently uncorrelated so limnVar[Xˉn]=0\lim_{n \rightarrow \infty} \text{Var}[\bar{X}_n] = 0.

  4. You don’t need an arbitrarily accurate estimator, but do want your estimates to be sufficiently accurate. For example, maybe you want to guarantee that the error in your estimate will be less than some threshold, ϵ\epsilon. That is, you’d like to guarantee that Xˉnμ<ϵ|\bar{X}_n - \mu| < \epsilon for some ϵ>0\epsilon > 0. For example, you might demand that your estimates are accurate to within a tolerance of ϵ=0.1\epsilon = 0.1.

Since the estimates depend on random sample, they are random. We’ve assumed nothing about the actual distribution of the XX’s, so we can’t work out the distribution of Xˉn\bar{X}_n, let alone its support. It is too much to hope that we can provide any error guarantee with total certainty. However, we might look for a probabilistic guarantee. Something like, “if you use at least 100 samples, then the chance that your error is greater than the tolerance ϵ\epsilon, is less than 1 in 1,000.” If you could prove this guarantee, then you could argue that, there is a 99.9% chance that the error will satisfy the bound Xˉnμ<ϵ|\bar{X}_n - \mu| < \epsilon.

A bound of this kind is an example of a tail bound. Tail bounds are bounds on tail probabilities (see Section 5.1).

Good tail bounds are very useful pieces of math. They enable analysts to make sharp probability statements without demanding exactly specified models. By trading off the amount of information used to inform the bounds, we can balance between specificity and generality.

Tail bounds are useful when:

  1. We don’t have enough information to find the exact distribution of YY so can’t work out its cumulative distribution function or survival function,

  2. The distribution of YY is hard to work with, so exact calculation of tail probabilities is cumbersome, or

  3. We are looking to make a statement that holds over a range of different distributions.

In all three cases we pursue a bound, instead of an equality since an equality (exact tail probabilities) would require exact knowledge of the distribution, an exact calculation, and, would only apply to that specific distribution. In general, the more information we use when forming our bound, the stronger the bound. There is a trade-off here. The more information the bound demands, the more we need to know about the underlying distribution, the more involved the calculation, and the fewer distributions it applies to.

The first and third motivations apply in our estimation setting. Since we don’t know the distribution of the XX’s, we can’t work out the exact distribution of their sample average. Moreover, we saw in the last section that sample averages are consistent estimators for unknown expectations under extremely weak conditions on the distribution generating the samples. As a consequence, sample averages were consistent estimators for a wide variety of distributions. We ought to be able to provide a tail bound, at least in the limit of large nn, that holds for a similarly wide range of distributions.

Returning to our estimation problem, we’d like a statement of the kind:

Pr(Xˉnμ>ϵ)<δ(ϵ,n)\text{Pr}(|\bar{X}_n - \mu| > \epsilon) < \delta(\epsilon,n)

where the failure probability is some function of the precision we demand, ϵ\epsilon, and the number of sampled we collect, nn. If we demand higher precision, while holding the failure probability fixed, or a smaller failure probability for a fixed precision, then we should need larger nn.

Reasoning about the relationship between a error tolerance, failure probability, and sample size, is a practical exercise. Since sample collection is often expensive, we often want to know the smallest sufficient sample size needed to guarantee accuracy. Alternately, if the sample size is fixed, then tail bounds allow the analyst to bound the confidence they should place on their estimates, or to find the highest precision statement available with a given failure probability.

We will explore two tail bounds in this section.

Markov’s Inequality

We’ll start with the simplest tail bound. This bound is not so useful by itself, since it is very loose. In most cases it grossly overestimates the true tail probability. So if used to inform an estimation policy it would usually produce far too conservative suggestions. However, we can modify it to derive more useful inequalities.

Our first bound formalizes the idea that, if YY is nonnegative, then it should be unlikely to see YY much larger than E[Y]\mathbb{E}[Y].

Sometimes it is helpful to think about Markov’s inequality as follows. Set y=kE[Y]y_* = k \mathbb{E}[Y] for some k>1k > 1. Then, Markov’s inequality says that:

Pr(Y>kE[Y])1k.\text{Pr}(Y > k \mathbb{E}[Y]) \leq \frac{1}{k}.

In other words, the chance that a nonnegative random variable is kk times larger than its expected value is less than or equal to 1/k1/k. So, the chance that YY is more than twice as large as its expectation is less than or equal to 1/21/2. The chance it is more than five times as large as its expectation is less than or equal to 1/51/5.

Here’s a table for different kk:

kk234567...10
Pr(Y>kE[Y])\text{Pr}(Y > k \mathbb{E}[Y]) \leq1/21/21/31/31/41/41/51/51/61/61/71/7...1/101/10

The latter con is fatal. Since the bound decays at rate O(k1)\mathcal{O}(k^{-1}), it only returns a small failure probability for very large kk. In most settings, it is very unlikely to see a random sample 4 times larger than its expectation, yet Markov would only bound the probability at 0.25. To find a bound whose failure probability is 0.1\leq 0.1, we needed to increase kk all the way to 10! Only very heavy tailed, or heavily skewed, distributions produce samples 10 times larger than their expectation 10% of the time.

Chebyshev’s Inequality

Markov’s inequality is very loose since it doesn’t use much information about the distribution. We can find tighter bounds by using more information. Let’s look for a two-sided tail bound that uses the variance to inform the bound.

A two-sided bound is a statement of the kind:

Pr(YE[Y]>ϵ)<δ.\text{Pr}(|Y - \mathbb{E}[Y]| > \epsilon) < \delta.

Notice, while YY may not be nonnegative, the random variable YE[Y]|Y - \mathbb{E}[Y]| is nonnegative. So, Markov’s inequality applies:

Pr(YE[Y]>ϵ)=E[YE[Y]]ϵ.\text{Pr}(|Y - \mathbb{E}[Y]| > \epsilon) = \frac{\mathbb{E}[| Y - \mathbb{E}[Y] |]}{\epsilon}.

The expectation in the numerator is the mean absolute deviation (MAD) in the random variable YY (see Section 4.3). So:

Pr(YE[Y]>ϵ)=MAD[Y]ϵ.\text{Pr}(\mid Y - \mathbb{E}[Y] \mid > \epsilon) = \frac{\text{MAD}[Y]}{\epsilon}.

We can replace the mean absolute deviation with the variance in YY by replacing the absolute deviation, YE[Y]|Y - \mathbb{E}[Y]|, with the squared deviation, (YE[Y])2(Y - \mathbb{E}[Y])^2. We’ll replace the absolute deviation with the squared deviation since the resulting bound decays more quickly as a function of ϵ\epsilon once ϵ\epsilon is large. As a reuslt, it is less conservative when we demand a very small failure probability. It is also easier to work with since we have stronger results for variances than mean absolute deviations. The bound using the mean absolute deviation is more useful for small ϵ\epsilon.

To replace the absolute deviation with the squared deviation, notice that YE[Y]>ϵ|Y - \mathbb{E}[Y]| > \epsilon if and only if (YE[Y])2>ϵ2(Y - \mathbb{E}[Y])^2 > \epsilon^2. Therefore:

Pr(YE[Y]>ϵ)=Pr((YE[Y])2>ϵ2).\text{Pr}(|Y - \mathbb{E}[Y]| > \epsilon) = \text{Pr}((Y - \mathbb{E}[Y])^2 > \epsilon^2).

The squared deviation is a nonnegative random variable, so Markov’s inequality applies:

Pr((YE[Y])2>ϵ2)E[(YE[Y])2]ϵ2.\text{Pr}((Y - \mathbb{E}[Y])^2 > \epsilon^2) \leq \frac{\mathbb{E}[(Y - \mathbb{E}[Y])^2]}{\epsilon^2}.

The expectation in the numerator is the expected square deviation, so is the variance in YY. Thus:

Pr((YE[Y])2>ϵ2)Var[Y]ϵ2.\text{Pr}((Y - \mathbb{E}[Y])^2 > \epsilon^2) \leq \frac{\text{Var}[Y]}{\epsilon^2}.

This is Chebyshev’s inequality.

Like Markov’s inequality, Chebyshev’s inequality only gives an informative bound for ϵ>SD[Y]\epsilon > \text{SD}[Y], or, for k>1k > 1.

Chebyshev’s inequality is more useful than Markov’s since it applies to any random variable, gives a two-sided bound, and the bound decays faster as a function of kk. Markov’s inequality produced a bound that decayed at rate 1/k1/k. Chebyshev’s decays at rate 1/k21/k^2.

Here’s a table for different kk:

kk234567...10
Pr(YE[Y]>kSD[Y])\text{Pr}(\mid Y - \mathbb{E}[Y] \mid > k \text{SD}[Y] ) \leq1/41/41/91/91/161/161/251/251/361/361/491/49...1/1001/100

Notice how much faster the bound decays.

Because the right hand side of Chebyshev’s inequality is proportional to the variance, decreasing the variance lowers the value of the bound. Therefore, if the variance in the random variable converges to zero, then the chance it differs from its mean by any fixed threshold also converges to zero.

Application

Let’s apply Chebyshev’s inequality to our original problem.

Suppose that the random variables XjX_j are drawn identically, with known SD[Xj]=σ\text{SD}[X_j] = \sigma. Then:

Pr(Xˉnμ>ϵ)Var[Xˉn]ϵ2.\text{Pr}(|\bar{X}_n - \mu| > \epsilon) \leq \frac{\text{Var}[\bar{X}_n]}{\epsilon^2}.

Suppose that the random variables are uncorrelated, then (see Section 13.1):

Var[Xˉn]=1nVar[Xj]=σ2n.\text{Var}[\bar{X}_n] = \frac{1}{n} \text{Var}[X_j] = \frac{\sigma^2}{n}.

So:

Pr(Xˉnμ>ϵ)1nσ2ϵ2=δ(n,ϵ).\text{Pr}(|\bar{X}_n - \mu| > \epsilon) \leq \frac{1}{n}\frac{\sigma^2}{\epsilon^2} = \delta(n,\epsilon).

Pay attention to the three terms in this bound.

  1. Increasing nn decreases the bound for a fixed tolerance ϵ\epsilon since the sample average becomes a more accurate estimator as we use more samples.

  2. Decreasing the tolerance ϵ\epsilon (demanding higher accuracy) decreases the chance the error is smaller than the tolerance, so increases the bound on the failure probability. This is an application of the inclusion bound from Section 1. All estimates within 0.01 of the true value are also within 0.1 of the true value, so we are more likely to see an estimate within 0.1 than 0.01.

  3. Lastly, the σ2\sigma^2 term shows how the bound depends on the distribution of the samples. The larger the variance in the samples, the less accurate the estimator, so the larger the failure probability.

Let’s use this equation to find a sufficient sample size.

Suppose that the standard deviation in each sample is 5, we want our estimates to be accurate within 0.5, and we want our bound to fail with chance at most 0.05. Then σ=5\sigma = 5, ϵ=0.5\epsilon = 0.5, and δ=0.05\delta = 0.05. Then, we should find the smallest nn such that:

1n520.52=1n250.25=100n5100.\frac{1}{n} \frac{5^2}{0.5^2} = \frac{1}{n} \frac{25}{0.25} = \frac{100}{n} \leq \frac{5}{100}.

This would require:

n=100×1005=100×20=2,000.n = 100 \times \frac{100}{5} = 100 \times 20 = 2,000.

So, the minimum sample size necessary to ensure that the sample average is within 0.5 units of the true mean, with probability at least 95%, is less than or equal to 2,000.