Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

5.1 Tails and Rare Events

The tail of a distribution refers to the distribution function at extreme, or limiting, values of the random variable.

Suppose that XBinomial(100,1/5)X \sim \text{Binomial}(100,1/5). Then E[X]=100×15=20\mathbb{E}[X] = 100 \times \frac{1}{5} = 20 (see Section 4.2) and SD[X]=100×15×45=10×425=10×25=4\text{SD}[X] = \sqrt{100 \times \frac{1}{5} \times \frac{4}{5}} = 10 \times \sqrt{\frac{4}{25}} = 10 \times \frac{2}{5} = 4 (see HW 5). So, most of the mass of the distribution will lie between 202×4=1220 - 2 \times 4 = 12 and 20+2×4=2820 + 2 \times 4 = 28. So, we could define the tails as, all xx more than two standard deviations from the mean. We’ve highlighted the region outside two standard deviations of the mean in orange in the figure below.

2 SD Tail.

Notice that, the tails of the distribution correspond to rare events. These are unusually extreme values of XX that occur with a small probability. In general, a tail event occurs if XX is surprisingly small or suprisingly large.

There is no hard or fast rule that determines where the tail of a distribution starts. We could have just as well used the regions outside 3 standard deviations of the mean. This corresponds to x<203×4=8x < 20 - 3 \times 4 = 8 and x>20+3×4=32x > 20 + 3 \times 4 = 32. In either case, the tails correspond to unusually large, or small, xx.

3 SD Tail.

Tails are interesting when we want to study rare events. In any example listed below, the tails are the most important feature of study:

  1. A seismologist is interested in the frequency and likelihood of dangerously large earthquakes.

  2. An insurance company is interested in the frequency and likelihood of catastrophes.

  3. An investor is interested in the frequency and likelihood of sudden changes in the value of an asset.

  4. A gambler is interested in the chance an unlikely bet pays off.

Often, we describe the tails of a distribution using a survival function.

Often, we use survival functions, or CDF’s, to find tail probabilities. A tail probability is the chance that a random variable is greater than, or less than, some threshold. In the binomial examples above, our thresholds were E[X]3SD[X]=203×4=8\mathbb{E}[X] - 3 \text{SD}[X] = 20 - 3 \times 4 = 8 (lower), and E[X]+3SD[X]=20+3×4=32\mathbb{E}[X] + 3 \text{SD}[X] = 20 + 3 \times 4 = 32 (upper).

If XX is unbounded, then tail probabilities often require a sum with infinitely many terms, or an integral with infinite bounds.

For example, if:

  1. XX is discrete, integer valued, and unbounded above, then:

    Pr(X>x)=all y>xPr(X=y)=y=x+1PMF(y).\text{Pr}(X > x) = \sum_{\text{all } y > x} \text{Pr}(X = y) = \sum_{y = x+1}^\infty \text{PMF}(y).
  2. XX is continuous and unbounded above, then:

    Pr(X>x)=all y>xPDF(y)dy=y=xPDF(y)dy.\text{Pr}(X > x) = \int_{\text{all } y > x} \text{PDF}(y) dy = \int_{y=x}^{\infty} \text{PDF}(y) dy.

Sums with infinitely many terms are often harder to close than integrals with an infinite bound. Accordingly, we will pay more attention to the discrete case in this chapter.

Examples:

Geometric Distribution

Suppose that SGeometric(p)S \sim \text{Geometric}(p) for some success probability p(0,1)p \in (0,1). Here’s the geometric PMF for p=1/2p = 1/2 from Section 2.2:

Geometric Distribution.

In a sense, the geometric distribution is all tail. It’s mode is at 1, and its PMF decreases monotonically for increasing ss.

Since the most likely outcome is always S=1S = 1, the geometric distribution only has a right tail. These are unusually large and positive values of ss.

So, to find a tail probability, we should compute the survival function, Pr(S>s)\text{Pr}(S > s) for some unusually large ss (e.g. E[S]+kSD[X]\mathbb{E}[S] + k \text{SD}[X] for k2k \geq 2):

Pr(S>s)=x=s+1PMF(x)=x=s+1(1p)x1p.\text{Pr}(S > s) = \sum_{x = s+1}^{\infty} \text{PMF}(x) = \sum_{x = s+1}^{\infty} (1 - p)^{x - 1} p.

How should we solve for the value of this sum?

Before we try anything clever, let’s simplify it as much as possible. First, take the constant multiple of pp to the outside. Then, let y=x1y = x - 1 so that we don’t need to keep track of the 1- 1 in the exponent:

Pr(S>s)=x=s+1(1p)x1p=py=s(1p)y\text{Pr}(S > s) = \sum_{x = s+1}^{\infty} (1 - p)^{x - 1} p = p \sum_{y = s}^{\infty} (1- p)^y

Let n=ysn = y - s so that y=s+ny = s + n. Then the sum is:

Pr(S>s)=py=s(1p)y=pn=0(1p)s+n=p(1p)sn=0(1p)n\text{Pr}(S > s) = p \sum_{y = s}^{\infty} (1- p)^y = p \sum_{n = 0}^{\infty} (1- p)^{s + n} = p (1 - p)^{s} \sum_{n = 0}^{\infty} (1 - p)^n

Notice that we did not need to update the upper bound of the sum since the sum runs to infinity, and infinity ± any finite offset is still infinity.

Finally, for concision, let q=1pq = 1 - p denote the failure probability. Then, the survival function can be expressed:

Pr(S>s)=(1q)qsn=0qn.\text{Pr}(S > s) = (1 - q) q^s \sum_{n = 0}^{\infty} q^n.

So, to find the survival function, it is sufficient to close the sum:

n=0qn.\sum_{n = 0}^{\infty} q^n.

This is an example of a geometric series:

It’s not obvious how to simplify a geometric series.

We’ll start with two probability arguments, then will check our result using the standard algebraic solution.

  1. First, we can convert the sum over infinitely many terms into a sum with finitely many terms by thinking in terms of the complement:

    Pr(S>s)=1Pr(Ss)=1CDF(s)=1x=1sPMF(x).\text{Pr}(S > s) = 1 - \text{Pr}(S \leq s) = 1 - \text{CDF}(s) = 1 - \sum_{x = 1}^s \text{PMF}(x).

    Expressing the survival function as a CDF converted the infinite sum to a sum with finitely many terms since geometric random variables are bounded below.

    Notice, while this answer is sufficient for small ss, it becomes unwieldy for large ss. If s=100s = 100, then we would need to add up 100 terms to use this formula. Clearly this formula does not generalize well.

  2. Alternately, let’s think about the event S>sS > s in more detail. A geometric random variable is the number of attempts up to, and including, the first success in a string of identical, independent, Bernoulli trials. So, S>sS > s means that it took at least ss failures before the first success. So, S>sS > s is the same as the event, (fail on the first s trials succesively)(\text{fail on the first } s \text{ trials succesively}). That is:

    Pr(S>s)=Pr(fail on the first s trials succesively)=Pr(fail on 1fail on 2...fail on s).\begin{aligned} \text{Pr}(S > s) & = \text{Pr}(\text{fail on the first } s \text{ trials succesively}) \\ & = \text{Pr}(\text{fail on 1} \cap \text{fail on 2} \cap ... \text{fail on s} ). \end{aligned}

    Since the trials are independent, we can use the multiplication rule to expand the joint probability:

    Pr(S>s)=j=1sPr(fail on trial j)=j=1sq=qs.\text{Pr}(S > s) = \prod_{j=1}^s \text{Pr}(\text{fail on trial } j) = \prod_{j=1}^s q = q^s.

    Therefore:

    Pr(S>s)=qs.\text{Pr}(S > s) = q^s.

    Then, since (1q)qsn=0qn=Pr(S>s)=qs(1 - q) q^s \sum_{n = 0}^{\infty} q^n = \text{Pr}(S > s) = q^s, we can solve for the value of the geometric series:

    n=0qn=qs(1q)qs=11q=1p.\sum_{n=0}^{\infty} q^n = \frac{q^s}{(1 - q) q^s} = \frac{1}{1-q} = \frac{1}{p}.

We’ve shown that the survival function of a geometric distribution is:

Pr(S>s)=qs=(1p)s.\text{Pr}(S > s) = q^s = (1 - p)^s.

This function decays rapidly as ss increases. In fact, it decays exponentially as a function of ss:

Pr(S>s)=qs=(elog(q))s=elog(q)×s=elog(q)×s.\text{Pr}(S > s) = q^s = (e^{\log(q)})^s = e^{\log(q) \times s} = e^{-|\log(q)| \times s}.

Accordingly, the geometric distribution has exponential tails. We will often describe a distribution’s tail behavior by comparing the rate of decay of the PMF, PDF, or survival functions, as functions of their inputs.

We can derive the CDF of a geometric distribution directly from its survival function:

CDF(s)=Pr(Ss)=1Pr(S>s)=1qs=1(1p)s.\text{CDF}(s) = \text{Pr}(S \leq s) = 1 - \text{Pr}(S > s) = 1 - q^s = 1 - (1 - p)^s.

Here’s a plot of the CDF when p=q=1/2p = q = 1/2:

Geometric CDF.

Notice that the geometric CDF looks like the difference between 1 and an exponential function of ss. That exponential function is the survival function we derived above.

This formula for the geometric CDF is interesting since we could also write the CDF explicitly as a sum of the PMF for all xsx \leq s. Equating both sides gives:

xsqx1p=CDF(s)=1qs\sum_{x \leq s} q^{x - 1} p = \text{CDF}(s) = 1 - q^s

Simplifying (let y=x1y = x - 1):

y=0s1qy=1p(1qs)=1qs1q.\sum_{y = 0}^{s - 1} q^y = \frac{1}{p}(1 - q^s) = \frac{1 - q^s}{1 - q}.

This is the general form for the partial geometric sum.

Harmonic Series and Power Laws

The geometric distribution decays to zero fairly quickly for large inputs. In this sense it has “light” tails. As shown above, its tails are exponential. Let’s study a “heavy tailed” distribution as a contrast.

Power laws occur in a variety of applications. Here are some examples:

  1. The frequency of the use of a word is inversely proportional to its order, when ordered by frequency. This means that the nthn^{th} most common word occurs with chance 1/n1/n relative to the most common word. This specifies a distribution that obeys a power law with power γ=1\gamma = -1. This observation is sometimes called Zipf’s law.

  2. Income and wealth distributions are often Pareto distributed, so often obey power laws.

  3. The number of connections at a node in large graphs often obey power laws (for example, the number of friends each individual has in a social network, or the number of links pointing to a webpage).

  4. Estimated signal to noise ratios in hypothesis testing. In this case, the distribution of possible signal to noise ratios often looks bell-shaped, but has power law tails. The power of the power law usually increases with more samples.

The larger γ\gamma, the faster xγx^{-\gamma} decays. So to study a case with “heavy” tails, let’s pick γ\gamma small. Of our examples, the smallest suggestion was γ=1\gamma = 1. Can we define a discrete distribution of the form:

X{1,2,3,...,},PMF(x)x1X \in \{1,2,3,...,\infty\}, \quad \text{PMF}(x) \propto x^{-1}

In order for this model to be valid, the PMF must be normalized. Normalization requires:

x=1cx1=1\sum_{x = 1}^{\infty} c x^{-1} = 1

for some choice of normalizing constant c0c \neq 0. The normalizing constant is:

c=(x=1x1)1=11+12+13+14+...c = \left(\sum_{x = 1}^{\infty} x^{-1} \right)^{-1} = \frac{1}{1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + ...}

The constant is nonzero if and only if the sum in the denominator converges to a finite number. The sum, 1+12+13+14+...1 + \frac{1}{2} + \frac{1}{3} + \frac{1}{4} + ... is the infamous Harmonic series.

So, we cannot define a discrete random variable, that is unbounded above, with tails that decay proportional to 1/x1/x!

Power laws with larger powers are possible, however, in each case, the associated random variables have “heavy tails.”

When tails decay slowly, the random variable behaves more erratically, and rare events occur more frequently. When a distribution has very heavy tails, typical samples from the distribution will be rare events since most of the mass of the distribution is in its tail!