Tail Bounds - Data 89 Course Notes

In the last section we showed that the variance in a sample average decreases as the number of samples increases, provided the samples are sufficiently uncorrelated. As a consequence, sample averages of sufficiently uncorrelated variables are consistent estimators of unknown expectations. We established consistency in the sense that the expected square error converges to zero in the limit as the number of samples diverges.

In this section, we will show that, when the variance in a random variable converges to zero, then the distribution of the random variable must concentrate about its expectation.

Motivation¶

To motivate this effort, consider the following setting:

You can observe $\{X_j\}_{j=1}^n$ drawn jointly and $\mathbb{E}[X_j] = \mu$ for all $j$ . This occurs if the samples are drawn identically.
You don’t know $\mu$ , but want to estimate it. So, you estimate it with the sample average $\bar{X}_n = \sum_{j=1}^n X_j$ . You adopt the sample average since it is an unbiased estimator.
The samples are sufficiently uncorrelated so $\lim_{n \rightarrow \infty} \text{Var}[\bar{X}_n] = 0$ .
You don’t need an arbitrarily accurate estimator, but do want your estimates to be sufficiently accurate. For example, maybe you want to guarantee that the error in your estimate will be less than some threshold, $\epsilon$ . That is, you’d like to guarantee that $|\bar{X}_n - \mu| < \epsilon$ for some $\epsilon > 0$ . For example, you might demand that your estimates are accurate to within a tolerance of $\epsilon = 0.1$ .

Since the estimates depend on random sample, they are random. We’ve assumed nothing about the actual distribution of the $X$ ’s, so we can’t work out the distribution of $\bar{X}_n$ , let alone its support. It is too much to hope that we can provide any error guarantee with total certainty. However, we might look for a probabilistic guarantee. Something like, “if you use at least 100 samples, then the chance that your error is greater than the tolerance $\epsilon$ , is less than 1 in 1,000.” If you could prove this guarantee, then you could argue that, there is a 99.9% chance that the error will satisfy the bound $|\bar{X}_n - \mu| < \epsilon$ .

A bound of this kind is an example of a tail bound. Tail bounds are bounds on tail probabilities (see Section 5.1).

Tail Bounds

A tail bound is an upper bound on the probability of some rare event. When the tail bound depends on the sample size, as is often the case for estimators that become more reliable as the number of observations used to inform the estimate increases, then we sometimes call the tail bound a concentration inequality.

A one-sided tail bound for a random variable $Y$ is a bound of the kind:

\text{Pr}(Y > y_*) < \delta

(1)

for some threshold value $y_*$ and some $\delta > 0$ .

The parameter $\delta$ is an upper bound on the chance the bound $Y \leq y_*$ fails. Accordingly, we will call it the failure probability. It is an upper bound on the survival function of $Y$ at $y_*$ . If $\delta$ is small, then we can guarantee, with high probability, that $Y \leq y_*$ .

A two-sided tail bound is a bound of the kind:

\text{Pr}(|Y - y_*| > \epsilon) < \delta.

(2)

This is an upper bound on the probability that $Y$ differs from some central value $y_*$ by more than some tolerance $\epsilon$ . In this case you should think of $\epsilon$ as a precision. It sets the width of an interval about $y_*$ where the bound $Y \in y_* \pm \epsilon$ holds. Often $y_* = \mathbb{E}[Y]$ .

Good tail bounds are very useful pieces of math. They enable analysts to make sharp probability statements without demanding exactly specified models. By trading off the amount of information used to inform the bounds, we can balance between specificity and generality.

Tail bounds are useful when:

We don’t have enough information to find the exact distribution of $Y$ so can’t work out its cumulative distribution function or survival function,
The distribution of $Y$ is hard to work with, so exact calculation of tail probabilities is cumbersome, or
We are looking to make a statement that holds over a range of different distributions.

In all three cases we pursue a bound, instead of an equality since an equality (exact tail probabilities) would require exact knowledge of the distribution, an exact calculation, and, would only apply to that specific distribution. In general, the more information we use when forming our bound, the stronger the bound. There is a trade-off here. The more information the bound demands, the more we need to know about the underlying distribution, the more involved the calculation, and the fewer distributions it applies to.

The first and third motivations apply in our estimation setting. Since we don’t know the distribution of the $X$ ’s, we can’t work out the exact distribution of their sample average. Moreover, we saw in the last section that sample averages are consistent estimators for unknown expectations under extremely weak conditions on the distribution generating the samples. As a consequence, sample averages were consistent estimators for a wide variety of distributions. We ought to be able to provide a tail bound, at least in the limit of large $n$ , that holds for a similarly wide range of distributions.

Returning to our estimation problem, we’d like a statement of the kind:

\text{Pr}(|\bar{X}_n - \mu| > \epsilon) < \delta(\epsilon,n)

(3)

where the failure probability is some function of the precision we demand, $\epsilon$ , and the number of sampled we collect, $n$ . If we demand higher precision, while holding the failure probability fixed, or a smaller failure probability for a fixed precision, then we should need larger $n$ .

Reasoning about the relationship between a error tolerance, failure probability, and sample size, is a practical exercise. Since sample collection is often expensive, we often want to know the smallest sufficient sample size needed to guarantee accuracy. Alternately, if the sample size is fixed, then tail bounds allow the analyst to bound the confidence they should place on their estimates, or to find the highest precision statement available with a given failure probability.

We will explore two tail bounds in this section.

Markov’s Inequality¶

We’ll start with the simplest tail bound. This bound is not so useful by itself, since it is very loose. In most cases it grossly overestimates the true tail probability. So if used to inform an estimation policy it would usually produce far too conservative suggestions. However, we can modify it to derive more useful inequalities.

Our first bound formalizes the idea that, if $Y$ is nonnegative, then it should be unlikely to see $Y$ much larger than $\mathbb{E}[Y]$ .

Markov’s Inequality

Suppose that $Y$ is a nonnegative random variable and $y_* > 0$ . Then:

\text{Pr}(Y > y_*) \leq \frac{\mathbb{E}[Y]}{y_*}.

(4)

Multiple Bounds

Markov’s inequality only provides a useful upper bound when $y_* > \mathbb{E}[Y]$ . If $y_* \leq \mathbb{E}[Y]$ then $\mathbb{E}[Y]/y_* \geq 1$ . All probabilities are less than or equal to one, so the bound is uninformative when $y_* \leq \mathbb{E}[Y]$ . So, we should write:

\text{Pr}(Y > y_*) \leq \min \left(\frac{\mathbb{E}[Y]}{y_*},1 \right) = \begin{cases} 1 & \text{ if } y_* \leq \mathbb{E}[Y] \\ \frac{\mathbb{E}[Y]}{y_*} & \text{ if } y_* > \mathbb{E}[Y] \end{cases}.

(5)

Anytime we can derive multiple upper bounds, we should always adopt the smallest of the upper bounds.

Proof

Consider the event $Y > y_*$ . Let $I_{y_*}(Y)$ be an indicator function for the event $Y > y_*$ . Then $I_{y_*}(Y) = 1$ if $Y > y_*$ and $I_{y_*}(Y) = 0$ if $Y \leq y_*$ . So, as a function of $y$ , $I_{y_*}(y)$ is a step function. It is zero for all $y \leq y_*$ then jumps to 1 for all $y > y_*$ .

Recall that, the expected value of any indicator variable is the chance of the event it indicates (see Section 4.2). Therefore:

\text{Pr}(Y > y_*) = \mathbb{E}_Y[I_{y_*}(Y)].

(6)

So, to bound the tail probability, we can bound the expected value of the indicator instead.

Consider the linear function that equals 0 at 0, and passes through the point $(y_*,1)$ , where the indicator function jumps from 0 to 1. This is the function:

g_{y_*}(y) = \frac{1}{y_*} y.

(7)

The figure below illustrates the indicator function (blue) and the line (red) for $y_* = 3$ . The shaded red region highlights the difference between these two functions. Notice that $g_{y_*}(y) \geq I_{Y_*}(y)$ for all $y > 0$ .

Since $g_{y_*}(Y) \geq I_{y_*}(Y)$ , its expected value must also be greater than or equal to the expected value of the indicator. Therefore:

\mathbb{E}_Y[I_{y_*}(Y)] \leq \mathbb{E}_Y[g_{y_*}(Y)].

(8)

Simplifying:

\mathbb{E}_Y[g_{y_*}(Y)] = \mathbb{E} \left[\frac{1}{y_*} Y \right] = \frac{1}{y_*} \mathbb{E}[Y].

(9)

Therefore:

\text{Pr}(Y > y_*) \leq \frac{\mathbb{E}[Y]}{y_*}.

(10)

Sometimes it is helpful to think about Markov’s inequality as follows. Set $y_* = k \mathbb{E}[Y]$ for some $k > 1$ . Then, Markov’s inequality says that:

\text{Pr}(Y > k \mathbb{E}[Y]) \leq \frac{1}{k}.

(11)

In other words, the chance that a nonnegative random variable is $k$ times larger than its expected value is less than or equal to $1/k$ . So, the chance that $Y$ is more than twice as large as its expectation is less than or equal to $1/2$ . The chance it is more than five times as large as its expectation is less than or equal to $1/5$ .

Here’s a table for different $k$ :

$k$	2	3	4	5	6	7	...	10
$\text{Pr}(Y > k \mathbb{E}[Y]) \leq$	$1/2$	$1/3$	$1/4$	$1/5$	$1/6$	$1/7$	...	$1/10$

The latter con is fatal. Since the bound decays at rate $\mathcal{O}(k^{-1})$ , it only returns a small failure probability for very large $k$ . In most settings, it is very unlikely to see a random sample 4 times larger than its expectation, yet Markov would only bound the probability at 0.25. To find a bound whose failure probability is $\leq 0.1$ , we needed to increase $k$ all the way to 10! Only very heavy tailed, or heavily skewed, distributions produce samples 10 times larger than their expectation 10% of the time.

Markov is Often Loose

Here’s an example. Suppose that $W \sim \text{Geometric}(0.5)$ . Then $W$ is nonnegative, so Markov’s inequality applies. The expected value of $W$ is $1/0.5 = 2$ .

We worked out the exact tail probabilities of a geometric random variable in Section 5.1. The exact tail probabilities are:

\text{Pr}(W > w_*) = (1 - 0.5)^{w} = 0.5^w.

(12)

Compare the exact tail probabilities to the Markov bound in the table below. Markov’s inequality provides valid bounds, but the bounds are extremely conservative:

$w_*$	1	2	3	4	5	6	7	...	10
$\text{Pr}(W > w_*) \leq$	1	1	$2/3$	$1/2$	$2/5$	$1/3$	$2/7$	...	$2/10 = 0.2$
$\text{Pr}(W > w_*) =$	$1/2$	$1/4$	$1/8$	$1/16$	$1/32$	$1/64$	$1/128$	...	$1/1024 < 0.001$

The upper bound produced by Markov’s inequality is 200 times too large for $w_* = 10$ .

Chebyshev’s Inequality¶

Markov’s inequality is very loose since it doesn’t use much information about the distribution. We can find tighter bounds by using more information. Let’s look for a two-sided tail bound that uses the variance to inform the bound.

A two-sided bound is a statement of the kind:

\text{Pr}(|Y - \mathbb{E}[Y]| > \epsilon) < \delta.

(13)

Notice, while $Y$ may not be nonnegative, the random variable $|Y - \mathbb{E}[Y]|$ is nonnegative. So, Markov’s inequality applies:

\text{Pr}(|Y - \mathbb{E}[Y]| > \epsilon) = \frac{\mathbb{E}[| Y - \mathbb{E}[Y] |]}{\epsilon}.

(14)

The expectation in the numerator is the mean absolute deviation (MAD) in the random variable $Y$ (see Section 4.3). So:

\text{Pr}(\mid Y - \mathbb{E}[Y] \mid > \epsilon) = \frac{\text{MAD}[Y]}{\epsilon}.

(15)

We can replace the mean absolute deviation with the variance in $Y$ by replacing the absolute deviation, $|Y - \mathbb{E}[Y]|$ , with the squared deviation, $(Y - \mathbb{E}[Y])^2$ . We’ll replace the absolute deviation with the squared deviation since the resulting bound decays more quickly as a function of $\epsilon$ once $\epsilon$ is large. As a reuslt, it is less conservative when we demand a very small failure probability. It is also easier to work with since we have stronger results for variances than mean absolute deviations. The bound using the mean absolute deviation is more useful for small $\epsilon$ .

To replace the absolute deviation with the squared deviation, notice that $|Y - \mathbb{E}[Y]| > \epsilon$ if and only if $(Y - \mathbb{E}[Y])^2 > \epsilon^2$ . Therefore:

\text{Pr}(|Y - \mathbb{E}[Y]| > \epsilon) = \text{Pr}((Y - \mathbb{E}[Y])^2 > \epsilon^2).

(16)

The squared deviation is a nonnegative random variable, so Markov’s inequality applies:

\text{Pr}((Y - \mathbb{E}[Y])^2 > \epsilon^2) \leq \frac{\mathbb{E}[(Y - \mathbb{E}[Y])^2]}{\epsilon^2}.

(17)

The expectation in the numerator is the expected square deviation, so is the variance in $Y$ . Thus:

\text{Pr}((Y - \mathbb{E}[Y])^2 > \epsilon^2) \leq \frac{\text{Var}[Y]}{\epsilon^2}.

(18)

This is Chebyshev’s inequality.

Like Markov’s inequality, Chebyshev’s inequality only gives an informative bound for $\epsilon > \text{SD}[Y]$ , or, for $k > 1$ .

Chebyshev’s inequality is more useful than Markov’s since it applies to any random variable, gives a two-sided bound, and the bound decays faster as a function of $k$ . Markov’s inequality produced a bound that decayed at rate $1/k$ . Chebyshev’s decays at rate $1/k^2$ .

Here’s a table for different $k$ :

$k$	2	3	4	5	6	7	...	10
$\text{Pr}(\mid Y - \mathbb{E}[Y] \mid > k \text{SD}[Y] ) \leq$	$1/4$	$1/9$	$1/16$	$1/25$	$1/36$	$1/49$	...	$1/100$

Notice how much faster the bound decays.

Because the right hand side of Chebyshev’s inequality is proportional to the variance, decreasing the variance lowers the value of the bound. Therefore, if the variance in the random variable converges to zero, then the chance it differs from its mean by any fixed threshold also converges to zero.

Other Bounds

The strategy we used to derive Chebyshev’s inequality would apply for any function that converts $Y$ into a nonnegative random variable. In particular, we could derive two-sided tail bounds using:

\text{Pr}(|Y - y_*| > \epsilon) = \text{Pr}(f(|Y - y_*|) > f(\epsilon)) \leq \frac{\mathbb{E}[f(|Y - y_*|)]}{f(\epsilon)}

(22)

for any nonnegative function $f$ that is monotonically increasing. For example:

\text{Pr}(|Y - y_*| > \epsilon) = \text{Pr}((|Y - y_*|)^m > \epsilon^m) \leq \frac{\mathbb{E}[|Y - y_*|^m]}{\epsilon^m}.

(23)

This produces a bound that decays at rate $\mathcal{O}(\epsilon^m)$ . So, by increasing $m$ we can produce a bound that decays faster. There is no free lunch, since increasing $m$ makes $|y - y_*|^m$ grow faster as a function of $|y - y_*|$ , so increases the numerator.

Here’s a different example, try $f(x) = e^{tx}$ for some $t > 0$ . Then:

\text{Pr}(|Y - y_*| > \epsilon) = \text{Pr}(e^{t |Y - y_*|} > e^{t \epsilon}) \leq \frac{\mathbb{E}\left[e^{t |Y - y_*|}\right]}{e^{t \epsilon}} = \mathbb{E}\left[e^{t |Y - y_*|}\right] e^{-t \epsilon}.

(24)

This bound decays exponentially in $\epsilon$ . We can make it decay faster by increasing $t$ . As before, there is no free lunch. Increasing $t$ makes $e^{t |y - y_*|}$ grow faster as a function of $|y - y_*|$ , so increases the constant in the numerator.

The best bound to use depends on which expectations we can compute, whether we want to optimize the rate at which the bound decays in the tolerance $\epsilon$ , or whether we want a small constant in the numerator. In practice, if we don’t need a very small failure probability, or want a bound that applies for small tolerances, then we should use a function that grows slowly in $|y - y_*|$ . If we need a very small failure probability, or want a bound that applies for large tolerances, then we should use a function that grows quickly in $|y - y_*|$ . If we can compute multiple bounds, then we should always pick the smallest and use it as our bound.

This is the idea behind Chernoff’s inequality, which says that:

\text{Pr}(Y > y_*) \leq \min_t\{\mathbb{E}[e^{t Y}] e^{-t y_*}\}.

(25)

Chernoff’s inequality is quite powerful since, there are a variety of important scenarios where we can compute the expectation $\mathbb{E}[e^{t Y}]$ as a function of $t$ , and the resulting bound can decay quite rapidly as a function of $y_*$ . For example, you could use Chernoff’s inequality to show that the survival function of a normal random variable decays faster than exponentially.

If you’d like to learn more about tail bounds, come talk to your professor, or take a second course in probability!

Application¶

Let’s apply Chebyshev’s inequality to our original problem.

Suppose that the random variables $X_j$ are drawn identically, with known $\text{SD}[X_j] = \sigma$ . Then:

\text{Pr}(|\bar{X}_n - \mu| > \epsilon) \leq \frac{\text{Var}[\bar{X}_n]}{\epsilon^2}.

(26)

Suppose that the random variables are uncorrelated, then (see Section 13.1):

\text{Var}[\bar{X}_n] = \frac{1}{n} \text{Var}[X_j] = \frac{\sigma^2}{n}.

(27)

So:

\text{Pr}(|\bar{X}_n - \mu| > \epsilon) \leq \frac{1}{n}\frac{\sigma^2}{\epsilon^2} = \delta(n,\epsilon).

(28)

Pay attention to the three terms in this bound.

Increasing $n$ decreases the bound for a fixed tolerance $\epsilon$ since the sample average becomes a more accurate estimator as we use more samples.
Decreasing the tolerance $\epsilon$ (demanding higher accuracy) decreases the chance the error is smaller than the tolerance, so increases the bound on the failure probability. This is an application of the inclusion bound from Section 1. All estimates within 0.01 of the true value are also within 0.1 of the true value, so we are more likely to see an estimate within 0.1 than 0.01.
Lastly, the $\sigma^2$ term shows how the bound depends on the distribution of the samples. The larger the variance in the samples, the less accurate the estimator, so the larger the failure probability.

Let’s use this equation to find a sufficient sample size.

Suppose that the standard deviation in each sample is 5, we want our estimates to be accurate within 0.5, and we want our bound to fail with chance at most 0.05. Then $\sigma = 5$ , $\epsilon = 0.5$ , and $\delta = 0.05$ . Then, we should find the smallest $n$ such that:

\frac{1}{n} \frac{5^2}{0.5^2} = \frac{1}{n} \frac{25}{0.25} = \frac{100}{n} \leq \frac{5}{100}.

(29)

This would require:

n = 100 \times \frac{100}{5} = 100 \times 20 = 2,000.

(30)

So, the minimum sample size necessary to ensure that the sample average is within 0.5 units of the true mean, with probability at least 95%, is less than or equal to 2,000.

Solution

If we allow more failures, then we don’t need as many samples. In particular, if we only wanted the bound to hold with chance 90%, then we’d have used $\delta = 0.1$ , so would have needed $n$ at most $1,000$ .