Tails and Rare Events - Data 89 Course Notes

The tail of a distribution refers to the distribution function at extreme, or limiting, values of the random variable.

Suppose that $X \sim \text{Binomial}(100,1/5)$ . Then $\mathbb{E}[X] = 100 \times \frac{1}{5} = 20$ (see Section 4.2) and $\text{SD}[X] = \sqrt{100 \times \frac{1}{5} \times \frac{4}{5}} = 10 \times \sqrt{\frac{4}{25}} = 10 \times \frac{2}{5} = 4$ (see HW 5). So, most of the mass of the distribution will lie between $20 - 2 \times 4 = 12$ and $20 + 2 \times 4 = 28$ . So, we could define the tails as, all $x$ more than two standard deviations from the mean. We’ve highlighted the region outside two standard deviations of the mean in orange in the figure below.

Notice that, the tails of the distribution correspond to rare events. These are unusually extreme values of $X$ that occur with a small probability. In general, a tail event occurs if $X$ is surprisingly small or suprisingly large.

There is no hard or fast rule that determines where the tail of a distribution starts. We could have just as well used the regions outside 3 standard deviations of the mean. This corresponds to $x < 20 - 3 \times 4 = 8$ and $x > 20 + 3 \times 4 = 32$ . In either case, the tails correspond to unusually large, or small, $x$ .

Tails are interesting when we want to study rare events. In any example listed below, the tails are the most important feature of study:

A seismologist is interested in the frequency and likelihood of dangerously large earthquakes.
An insurance company is interested in the frequency and likelihood of catastrophes.
An investor is interested in the frequency and likelihood of sudden changes in the value of an asset.
A gambler is interested in the chance an unlikely bet pays off.

Often, we describe the tails of a distribution using a survival function.

Often, we use survival functions, or CDF’s, to find tail probabilities. A tail probability is the chance that a random variable is greater than, or less than, some threshold. In the binomial examples above, our thresholds were $\mathbb{E}[X] - 3 \text{SD}[X] = 20 - 3 \times 4 = 8$ (lower), and $\mathbb{E}[X] + 3 \text{SD}[X] = 20 + 3 \times 4 = 32$ (upper).

If $X$ is unbounded, then tail probabilities often require a sum with infinitely many terms, or an integral with infinite bounds.

For example, if:

$X$ is discrete, integer valued, and unbounded above, then:
$\text{Pr}(X > x) = \sum_{\text{all } y > x} \text{Pr}(X = y) = \sum_{y = x+1}^\infty \text{PMF}(y).$
(2)
$X$ is continuous and unbounded above, then:
$\text{Pr}(X > x) = \int_{\text{all } y > x} \text{PDF}(y) dy = \int_{y=x}^{\infty} \text{PDF}(y) dy.$
(3)

Sums with infinitely many terms are often harder to close than integrals with an infinite bound. Accordingly, we will pay more attention to the discrete case in this chapter.

Examples:¶

Geometric Distribution¶

Suppose that $S \sim \text{Geometric}(p)$ for some success probability $p \in (0,1)$ . Here’s the geometric PMF for $p = 1/2$ from Section 2.2:

In a sense, the geometric distribution is all tail. It’s mode is at 1, and its PMF decreases monotonically for increasing $s$ .

Since the most likely outcome is always $S = 1$ , the geometric distribution only has a right tail. These are unusually large and positive values of $s$ .

So, to find a tail probability, we should compute the survival function, $\text{Pr}(S > s)$ for some unusually large $s$ (e.g. $\mathbb{E}[S] + k \text{SD}[X]$ for $k \geq 2$ ):

\text{Pr}(S > s) = \sum_{x = s+1}^{\infty} \text{PMF}(x) = \sum_{x = s+1}^{\infty} (1 - p)^{x - 1} p.

(4)

How should we solve for the value of this sum?

Before we try anything clever, let’s simplify it as much as possible. First, take the constant multiple of $p$ to the outside. Then, let $y = x - 1$ so that we don’t need to keep track of the $- 1$ in the exponent:

\text{Pr}(S > s) = \sum_{x = s+1}^{\infty} (1 - p)^{x - 1} p = p \sum_{y = s}^{\infty} (1- p)^y

(5)

Let $n = y - s$ so that $y = s + n$ . Then the sum is:

\text{Pr}(S > s) = p \sum_{y = s}^{\infty} (1- p)^y = p \sum_{n = 0}^{\infty} (1- p)^{s + n} = p (1 - p)^{s} \sum_{n = 0}^{\infty} (1 - p)^n

(6)

Notice that we did not need to update the upper bound of the sum since the sum runs to infinity, and infinity ± any finite offset is still infinity.

Finally, for concision, let $q = 1 - p$ denote the failure probability. Then, the survival function can be expressed:

\text{Pr}(S > s) = (1 - q) q^s \sum_{n = 0}^{\infty} q^n.

(7)

So, to find the survival function, it is sufficient to close the sum:

\sum_{n = 0}^{\infty} q^n.

(8)

This is an example of a geometric series:

It’s not obvious how to simplify a geometric series.

We’ll start with two probability arguments, then will check our result using the standard algebraic solution.

First, we can convert the sum over infinitely many terms into a sum with finitely many terms by thinking in terms of the complement:
$\text{Pr}(S > s) = 1 - \text{Pr}(S \leq s) = 1 - \text{CDF}(s) = 1 - \sum_{x = 1}^s \text{PMF}(x).$
(10)
Expressing the survival function as a CDF converted the infinite sum to a sum with finitely many terms since geometric random variables are bounded below.
Notice, while this answer is sufficient for small $s$ , it becomes unwieldy for large $s$ . If $s = 100$ , then we would need to add up 100 terms to use this formula. Clearly this formula does not generalize well.
Alternately, let’s think about the event $S > s$ in more detail. A geometric random variable is the number of attempts up to, and including, the first success in a string of identical, independent, Bernoulli trials. So, $S > s$ means that it took at least $s$ failures before the first success. So, $S > s$ is the same as the event, $(\text{fail on the first } s \text{ trials succesively})$ . That is:
$\begin{aligned} \text{Pr}(S > s) & = \text{Pr}(\text{fail on the first } s \text{ trials succesively}) \\ & = \text{Pr}(\text{fail on 1} \cap \text{fail on 2} \cap ... \text{fail on s} ). \end{aligned}$
(11)
Since the trials are independent, we can use the multiplication rule to expand the joint probability:
$\text{Pr}(S > s) = \prod_{j=1}^s \text{Pr}(\text{fail on trial } j) = \prod_{j=1}^s q = q^s.$
(12)
Therefore:
$\text{Pr}(S > s) = q^s.$
(13)
Then, since $(1 - q) q^s \sum_{n = 0}^{\infty} q^n = \text{Pr}(S > s) = q^s$ , we can solve for the value of the geometric series:
$\sum_{n=0}^{\infty} q^n = \frac{q^s}{(1 - q) q^s} = \frac{1}{1-q} = \frac{1}{p}.$
(14)

An Alternate Probability Argument

Consider the sum again:

\text{Pr}(S > s) = (1 - q) q^s \sum_{n = 0}^{\infty} q^n

(16)

Notice that, while the left hand side is a function of the threshold $s$ , the geometric series does not depend on $s$ . That means that the geometric series must take on the same value for all choices of $s$ . So, to solve for the value of the geometric series, just pick a convenient $s$ .

If $s = 1$ then $\text{Pr}(S > 1) = 1 - \text{Pr}(S = 1)$ . The chance a geometric random variable equals 1 is the chance of success on the first trial, $p$ , so $\text{Pr}(S > 1) = 1 - p = q$ .

Therefore:

\text{Pr}(S > 1) = 1 - p = q = (1 - q) q^1 \sum_{n = 0}^{\infty} q^n

(17)

Rearranging and cancelling like terms:

\sum_{n = 0}^{\infty} q^n = \frac{1}{1 - q} = \frac{1}{p}.

(18)

Therefore, for any $0 \leq r < 1$ :

\sum_{n = 0}^{\infty} r^n = \frac{1}{1 - r}

(19)

and, the geometric survival function is:

\text{Pr}(S > s) = (1 - q) \times q^s \times \frac{1}{1 - q} = \frac{1 - q}{1 - q} q^s = q^s.

(20)

This matches the answer we derived by the multiplication rule.

We’ve shown that the survival function of a geometric distribution is:

\text{Pr}(S > s) = q^s = (1 - p)^s.

(21)

This function decays rapidly as $s$ increases. In fact, it decays exponentially as a function of $s$ :

\text{Pr}(S > s) = q^s = (e^{\log(q)})^s = e^{\log(q) \times s} = e^{-|\log(q)| \times s}.

(22)

Accordingly, the geometric distribution has exponential tails. We will often describe a distribution’s tail behavior by comparing the rate of decay of the PMF, PDF, or survival functions, as functions of their inputs.

We can derive the CDF of a geometric distribution directly from its survival function:

\text{CDF}(s) = \text{Pr}(S \leq s) = 1 - \text{Pr}(S > s) = 1 - q^s = 1 - (1 - p)^s.

(23)

Here’s a plot of the CDF when $p = q = 1/2$ :

Notice that the geometric CDF looks like the difference between 1 and an exponential function of $s$ . That exponential function is the survival function we derived above.

This formula for the geometric CDF is interesting since we could also write the CDF explicitly as a sum of the PMF for all $x \leq s$ . Equating both sides gives:

\sum_{x \leq s} q^{x - 1} p = \text{CDF}(s) = 1 - q^s

(24)

Simplifying (let $y = x - 1$ ):

\sum_{y = 0}^{s - 1} q^y = \frac{1}{p}(1 - q^s) = \frac{1 - q^s}{1 - q}.

(25)

This is the general form for the partial geometric sum.

Algebraic Proof

We can prove the partial geometric sum formula algebraicly.

Consider the claim:

\sum_{n = 0}^{m-1} r^n = \frac{1 - r^m}{1 - r}.

(28)

This claim is true if and only if:

(1 - r) \sum_{n = 0}^{m-1} r^n = 1 - r^m.

(29)

So, let’s check that the left hand side equals the right hand side.

\begin{aligned} (1 - r) \times \sum_{n = 0}^{m-1} r^n = & (+1) \times (1 + r + r^2 + ... r^{m-2} + r^{m-1}) \\ & (-r) \times (1 + r + r^2 + ... r^{m-2} + r^{m-1}) \end{aligned}

(30)

Simplifying:

\begin{aligned} (1 - r) \times \sum_{n = 0}^{m-1} r^n = & (1 + r + r^2 + ... r^{m-2} + r^{m-1}) \\ & - (r + r^2 + r^3 + ... r^{m-1} + r^{m}) \end{aligned}

(31)

Aligning like terms:

\begin{aligned} (1 - r) \times \sum_{n = 0}^{m-1} r^n = & 1 + r + r^2 + ... r^{m-2} + r^{m-1} \\ & 0 - r - r^2 ... - r^{m-2} - r^{m-1} - r^{m} \end{aligned}

(32)

Then, cancelling like terms:

(1 - r) \times \sum_{n = 0}^{m-1} r^n = 1 - r^m.

(33)

It follows that:

\sum_{n = 0}^{m-1} r^n = \frac{1 - r^m}{1 - r}

(34)

as claimed. $\square$

Harmonic Series and Power Laws¶

The geometric distribution decays to zero fairly quickly for large inputs. In this sense it has “light” tails. As shown above, its tails are exponential. Let’s study a “heavy tailed” distribution as a contrast.

Power laws occur in a variety of applications. Here are some examples:

The frequency of the use of a word is inversely proportional to its order, when ordered by frequency. This means that the $n^{th}$ most common word occurs with chance $1/n$ relative to the most common word. This specifies a distribution that obeys a power law with power $\gamma = -1$ . This observation is sometimes called Zipf’s law.
Income and wealth distributions are often Pareto distributed, so often obey power laws.
The number of connections at a node in large graphs often obey power laws (for example, the number of friends each individual has in a social network, or the number of links pointing to a webpage).
Estimated signal to noise ratios in hypothesis testing. In this case, the distribution of possible signal to noise ratios often looks bell-shaped, but has power law tails. The power of the power law usually increases with more samples.

The larger $\gamma$ , the faster $x^{-\gamma}$ decays. So to study a case with “heavy” tails, let’s pick $\gamma$ small. Of our examples, the smallest suggestion was $\gamma = 1$ . Can we define a discrete distribution of the form:

X \in \{1,2,3,...,\infty\}, \quad \text{PMF}(x) \propto x^{-1}

(36)

In order for this model to be valid, the PMF must be normalized. Normalization requires:

\sum_{x = 1}^{\infty} c x^{-1} = 1

(37)

for some choice of normalizing constant $c \neq 0$ . The normalizing constant is:

c = \left(\sum_{x = 1}^{\infty} x^{-1} \right)^{-1} = \frac{1}{1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + ...}

(38)

The constant is nonzero if and only if the sum in the denominator converges to a finite number. The sum, $1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + ...$ is the infamous Harmonic series.

Proof

To show that the sum diverges, we will bound it from below, then show that the lower bound diverges.

Consider the plot comparing the series $\sum_{n=1}^{\infty} n^{-1}$ (shaded blue area) with the integral $\int_{x = 1}^{\infty} x^{-1} dx$ (area under the gold curve).

Since the bar chart is above the curve, $1/x$ , the area under the barchart, $\sum_{n=1}^{\infty} n^{-1}$ , is greater than the area under the curve, $\int_{x = 1}^{\infty} x^{-1} dx$ . Therefore:

\sum_{n=1}^{\infty} n^{-1} \geq \int_{x = 1}^{\infty} x^{-1} dx.

(41)

If the integral diverges, then the sum must diverge as well. This is our first example of an integral test.

The integral is:

\int_{x = 1}^{\infty} x^{-1} dx = \log(x)|_{x = 1}^{\infty} = \lim_{y \rightarrow \infty} (\log(y) - \log(1)) = \lim_{y \rightarrow \infty} \log(y) = \infty.

(42)

The integral diverges since the logarithm is unbounded above. Therefore, the sum must also diverge! $\square$

So, we cannot define a discrete random variable, that is unbounded above, with tails that decay proportional to $1/x$ !

Power laws with larger powers are possible, however, in each case, the associated random variables have “heavy tails.”

When tails decay slowly, the random variable behaves more erratically, and rare events occur more frequently. When a distribution has very heavy tails, typical samples from the distribution will be rare events since most of the mass of the distribution is in its tail!

5.1 Tails and Rare Events

Examples:¶

Geometric Distribution¶

Harmonic Series and Power Laws¶