Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

2.2 Discrete Models

This section will explore three famous discrete models. All of the models study random counts. These are random variables that take on integer values.

We will rely on a live code demo built into the textbook. To run the code, follow the instructions below.

Bernoulli

The simplest example of a random variable is a random variable that can only take on one of two values. When those values are 0 and 1, we call the random variable a Bernoulli random variable.

Since XX can either equal 1, or equal 0, we can specify the PMF of XX using only the chance that X=1X = 1, Pr(X=1)=p\text{Pr}(X = 1) = p. The value pp is the success probability. By the complement rule, the chance X=0X = 0 is, Pr(X=0)=1p\text{Pr}(X = 0) = 1 - p. The value q=1pq = 1 - p is the failure probability.

The sentence: the random variable XX drawn from a Bernoulli distribution with success probability pp is usually expressed:

XBern(p)X \sim \text{Bern}(p)

where \sim is the symbol for “drawn from”.

Notice that:

  1. Every Bernoulli random variable is drawn from a categorical distribution on two categories, labelled “0” and “1”. Since the labels are arbitrary, Bernoulli random variables are often used to study settings where we are interested in a binary event. For example:

    • Flip a coin. Label heads 1 and tails 0.

    • Run an experiment. Label a succesful outcome 1 and a failure 0.

    • Check a statement. Label a valid check 1 and an erroneous check 0.

  2. Bernoulli random variables are often constructed from indicators. An indicator function for an event EE, is the function that returns 1 if EE happens, and 0 otherwise. We’ll see that many count variables can be expanded as combinations of indicators, where each indicator indicates whether or not some event occurs.

  3. Bernoulli random variables are specified by the choice of the parameter pp. If we plot the PMF, then pp is the height of the bar assigned to the outcome X=1X = 1.

    • The parameter pp is often called a success probability since we often use Bernoulli random variables as indicators for events we consider successes. For instance, if I am interested in whether a string of coin tosses contains at least 5 heads, I could define an event EE, which contains all strings with at least five heads and set XX to an indicator for the event. Then, when the event occurs (a success), my indicator equals 1. So, the chance of success is Pr(X=1)\text{Pr}(X = 1).

    • If a Bernoulli random variable is constructed as an indicator for the event EE, then the success probability is the chance of the associated event, p=Pr(E)p = \text{Pr}(E).

    • This language is arbitrary since the encoding to 0 and 1 is arbitrary. We could just as well call 0 success and 1 failure, but the assignment of 1 to success is so standard we will accept it as convention.

Experiment with the code cell below to visualize Bernoulli distributions:

from utils_dist import run_distribution_explorer

run_distribution_explorer("Bernoulli");

While simple, Bernoulli random variables:

  1. Illustrate the two general approaches to defining a distribution. Either:

    • Write down a process that produces the variable (e.g. I am interested in some event, and assign it an indicator XX)

    • Or, fix the support and distribution function for the random variable.

  2. Many random variables are specified by fixing a free parameter. The Bernoulli has only one free parameter, the success probability pp. Usually the parameters are either variables that can be toggled to change the shape of the distribution, so that it can be adapted to model the scenario of interest, or, are free variables involved in the definition of the process that produces the variable (e.g. the probability of the event that XX indicates).

  3. Are a useful building block for other discrete disctributions.

Let’s build some other discrete distributions.

Geometric Random Variables

Suppose that you start flipping a coin. You are waiting to see a heads. Let WW denote the number of flips up to and including the flip when you see your first head. How is WW distributed?

As always, start with the support. The random variable WW is a count, so it must be integer valued. In order to see a head, we must flip at least once, so W1W \geq 1.

Must WW be smaller than any upper bound w0w \geq 0? For instance, must W10W \leq 10?

No. While unlikely, it is possible to see ten tails in ten tosses. So, WW could be greater than 10. What about 20? The same logic applies. 50? The same logic applies.

No matter what ww you pick, it is possible, if very improbable, that W>wW > w.

This means that WW is unbounded above. In other words, the random variable WW is supported on the set of all positive integers {1,2,3,...}\{1,2,3,...\}. This is our first example of a random variable that can take on infinitely many values.

The fact that WW can take on infinitely many different values may be disturbing. So far all of our outcome spaces have contained finitely many outcomes. Do not fear. There is no issue extending to infinitely many outcomes. We just have to obey the axioms. In particular, we need to make sure our chance model, or PMF, assigns chances such that:

w=1PMF(w)=Pr(W=1)+Pr(W=2)+Pr(W=3)+Pr(W=4)+...=1.\sum_{w = 1}^{\infty} \text{PMF}(w) = \text{Pr}(W = 1) + \text{Pr}(W = 2) + \text{Pr}(W = 3) + \text{Pr}(W = 4) + ... = 1.

If the sum equals 1, then the distribution is normalized, so is valid.

Now that we’ve established the support for WW, we should find one of its distribution functions. Since it is the most natural, let’s try and solve for its PMF.

To find the PMF, ask the question, what is the chance that W=wW = w?

In order for the event W=wW = w to occur, we must fail on the first w1w - 1 tosses, and succeed on toss ww. So, we express the event W=wW = w as an intersection of ww events:

{W=w}={fail on toss 1}{fail on toss 2}...{fail on toss w1}{succeed on toss w}\{W = w\} = \{\text{fail on toss 1}\} \cap \{\text{fail on toss 2}\} \cap ... \{\text{fail on toss } w - 1\} \cap \{\text{succeed on toss } w\}

Since you are flipping a coin, the outcome of any toss is independent of the outcome of any other toss. Therefore, we can compute the chance of this joint event using the multiplication rule:

Pr(W=w)=Pr(fail on toss 1)×Pr(fail on toss 2)×...Pr(fail on toss w1)×Pr(succeed on toss w)\text{Pr}(W = w) = \text{Pr}(\text{fail on toss 1}) \times \text{Pr}(\text{fail on toss 2}) \times ... \text{Pr}(\text{fail on toss } w - 1) \times \text{Pr}(\text{succeed on toss } w)

Notice, each toss is a binary event. It has two possible outcomes: heads (success), or tails (failure). So, we can represent each toss with a Bernoulli random variable. The Bernoulli has som success probability pp and failure probbility q=1pq = 1 - p.

Since I am tossing the same coin repeatedly, the chance I fail or succeed on any particular toss is the same as any other toss. Therefore:

Pr(W=w)=q×q×...×q×p=qw1p=(1p)w1p\text{Pr}(W = w) = q \times q \times ... \times q \times p = q^{w-1} p = (1 - p)^{w-1} p

You can read the equation above as, the chance I fail w1w-1 times (qw1q^{w-1}) times the chance I succeed on the last trial (pp).

For a fair coin, q=p=1/2q = p = 1/2, so:

Pr(W=w)=(1p)w1p=(12)w\text{Pr}(W = w) = (1 - p)^{w-1} p = \left(\frac{1}{2}\right)^w

We’ve plotted the PMF below:

Geometric PMF.

Notice that, each bar is 1/21/2 the height of the preceeding bar, since, on each toss, there is a 1/21/2 chance of success.

The random variable WW is an example of a geometric random variable:

We denote the sentence, the random variable WW is drawn from a geometric distribution with parameter pp with:

WGeom(p)W \sim \text{Geom}(p)

Like Bernoulli random variables, Geometric random variables have one free parameter pp which must be between 0 and 1.

The definition provided above is explicit. It defines the family of random variables by fixing a support (range of possible values), and form for the PMF.

We can also define geoemtric random variables implicitly, by identifying the types of processes that produce geometric random variables:

In other words, geometric random variables describe the number of attempts needed until a first success in a sequence of independent, identical, binary experiments.

Experiment with the code cell below to visualize Geometric distributions. Select “Discrete” then “Geometric” from the drop down. Adjust the parameter pp until you are confident you could explain how the shape of the PMF depends on pp. Try to explain, by interpreting pp as the chance of success in a string of independent, identical trials, why the shape depends on pp in the way you observed.

from utils_dist import run_distribution_explorer

run_distribution_explorer("Geometric");

Binomial Random Variables

Let’s construct a random count in a different way.

Suppose you flip a coin 10 times. What is the chance that, in 10 tosses, you see 7 heads?

Let XX denote the number of heads. As always, you should start by asking, *what is the range of possible values of XX?

  • XX is a count so it must be integer-valued.

  • At least, all of the tosses are tails, so XX could be as small as 0.

  • At most, all of the tosses are heads, so XX could be as large as 10.

Therefore, X{0,1,2,3,...,10}X \in \{0,1,2,3,...,10\}. In other words, XX is supported on all whole numbers between 0 and 10.

Next, let’s derive the PMF. To find the PMF, first ask a question of the kind, what is Pr(X=x)\text{Pr}(X = x)? Picka value xx that is in the support. Solve for the chance, then substitute for generic xx. For example, what’s the chance that X=7X = 7?

The event X=7X = 7 can happen in a few different ways. As usual, we’ll start by partitioning into all the ways the event can occur.

Here’s one HHHHHHHTTTHHHHHHHTTT. Here’s another TTTHHHHHHHTTTHHHHHHH. And here’s a third THHHTTHHHHTHHHTTHHHH. All of these strings of tosses contain 7 heads and 3 tails.

What’s the chance we see the string HHHHHHHTTTHHHHHHHTTT?

This string is a sequence of events, so its a series of and statements. The outcome of your 5th toss has nothing to do with the outcome of your 7th toss, so the tosses are independent. Therefore, we can use the multiplication rule for independent events to split up each chance:

Pr(HHHHHHHTTT)=Pr(H and H and ...H and T and T and T)=Pr(H)×Pr(H)×...Pr(H)×Pr(T)×Pr(T)×Pr(T)\begin{aligned} \text{Pr}(HHHHHHHTTT) & = \text{Pr}(H \text{ and } H \text{ and } ... H \text{ and } T \text{ and } T \text{ and } T) \\ & = \text{Pr}(H) \times \text{Pr}(H) \times ... \text{Pr}(H) \times \text{Pr}(T) \times \text{Pr}(T) \times \text{Pr}(T) \end{aligned}

We could, at this point, plu in 1/21/2 for all the chances, but it will be more informative to keep the chance of heads and tails separate. So, let’s use pp for the chance of a heads (success) and q=1pq = 1 - p for the chance of a tails (failure). Then:

Pr(HHHHHHHTTT)=p×p×...×p×q×q×q=p7q3.\text{Pr}(HHHHHHHTTT) = p \times p \times ... \times p \times q \times q \times q = p^7 q^3.

This equation is suggestive. Notice:

  1. The answer on the right hand side does not depend on the order of the heads and tails in the string defining the event, only the number of heads and tails. This is sensible. If we run a string of independent, identical experiments, then the chance of any sequence of outcomes should only depend on the number of times we saw each outcome, not their order. So, we can leave the right hand side alone, and replace the string HHHHHHHTTTHHHHHHHTTT with any string of 10 characters, 7 of which are heads, and 3 of which are tails. We’ll need to count how many different strings satisfy this rule in a moment.

  2. Each term has a clear source in the original question. We’ve raised pp to the 7th power since we asked about the chance of 7 heads. Since we are counting the number of heads with XX, the chance X=7X = 7 is the chance XX takes on a particular selection drawn from its possible values, xx. So, anytime we ask a question, what is the chance we see X=xX = x heads, we should expect to see pp raised to the xx.

  3. We’ve raised qq to the third power since, to see 7 heads in 10 tosses, we must see 3 tails. If we’d asked, what is the chance we see X=xX = x heads, we are equivalently asking, *what’s the chance we see nxn - x tailswhere where n = 10$ was the number of tosses. So, we can rewrite the right hand side of the equation:

pxqnx=px(1p)nx.p^{x}q^{n-x} = p^x (1 - p)^{n-x}.

What we’ve just done is solved a specific case, then generalized the result to work for variants of our original question. Notice that, if we weren’t tossing a fair coin, but repeating some other binary experiment, then we could choose any success probability p[0,1]p \in [0,1].

Now, we’ve solved for the chance of seeing any specific string of xx successes and nxn - x failures in a sequence of independent, identical, Bernoulli trials, each which succeeds with chance pp. We can put these together with the addition rule (marginalize) to recover the probability of seeing xx sucesses and nxn - x falures:

Pr(x successes and nx failures)=...px(1p)nx\text{Pr}(x \text{ successes and } n - x \text{ failures}) = \sum_{...} p^x (1 - p)^{n-x}

where the ... stands for summing over all sequences of length nn, containing xx successes and nxn - x failures. We’ve left this as a ... for now since it is too long for a subscript, and, since the next step won’t depend on the details of ...

Take a look at the sum again. The term inside the sum does not depend on the sequence chosen. So, it can be taken outside. Then the sum can be written:

Pr(x successes and nx failures)=(...1)px(1p)nx\text{Pr}(x \text{ successes and } n - x \text{ failures}) = \left(\sum_{...} 1 \right) p^x (1 - p)^{n-x}

The sum of any constant mm times is m×the constantm \times \text{the constant}. The sum of the number 1 mm times is just mm. Therefore:

Pr(x successes and nx failures)=all distinct sequences of x successes and nx failures×px(1p)nx.\text{Pr}(x \text{ successes and } n - x \text{ failures}) = |\text{all distinct sequences of }x \text{ successes and } n - x \text{ failures}| \times p^x (1 - p)^{n - x}.

In other words, the chance of xx successes in nn independent, identical, binary trials, is the number of ways of ordering xx successes and nxn - x failures, times the chance of any particular sequence of xx successes and nxn - x failures.

All that is left to do is to count the number of distinct ways we can arrange xx successes and nxn - x failures.

Clearly, listing all the arrangements is too cumbersome for a long string. Even listing all the arrangements of 7 heads in 10 tosses would require listing 120 different arrangemenets.

So, we use combinatorics instead. If you haven’t solved this type of problem before, or its been a while, check Appendix A.1.

There are n!n! distinct arrangements of nn distinguishable outcomes. If the outcomes are divided into two sets of sizes xx and nxn - x, then we’ve overcounted by the x!x! ways of rearranging the elements of the first set amongst themselves, and the (nx)!(n - x)! ways of rearranging the elements of the second set. So, the correct count is:

number of arrangements=n!x!(nx)!=(nx)\text{number of arrangements} = \frac{n!}{x!(n-x)!} = \left(\begin{array}{c} n \\ x \end{array} \right)

We call the expression above a binomial or choose coefficient. We read it nn choose xx. The binomial distribution get’s its name from this coefficient.

We now have all the pieces we needed to compute the desired chance. The chance of xx successes in nn identical, independent, binary trials is:

Pr(X=x)=(nx)px(1p)nx.\text{Pr}(X = x) = \left(\begin{array}{c} n \\ x \end{array} \right) p^x (1- p)^{n-x}.

The random variable, XX, is an example of a binomial random variable. We can define it explicitly or implicitly.

We denote the statement XX is a binomial random variable on nn trials with success probability pp, with:

XBinom(n,p).X \sim \text{Binom}(n,p).

Experiment with the code cell below to visualize Binomial distributions. Adjust the parameters nn and pp until you are confident you could explain how the shape of the PMF depends both. Try to explain, by interpreting nn and pp, why the shape depends on pp in the way you observed.

from utils_dist import run_distribution_explorer

run_distribution_explorer("Binomial");

Other distributions:

There are many discrete distribution models, each associated with a different family of distribution functions, and a different collection of generating processes. We’ve built a demo that you can use to explore some of the most important discrete distributions. You won’t need to know these all in detail, but this demo might be useful for you in future classes. We will ask you to interact with it on your HW.

If you’re interested in any of the other distributions, come ask the Professor. Given time, we’ll tough briefly on the Poisson.

from utils_dist import run_distribution_explorer

run_distribution_explorer();

Some General Lessons

Here are some last thoughts:

  1. It’s worth remembering that there are two ways of defining a random variable, and associated distribution.

    • The implicit approach is the most natural from an applied perspective. We’re interested in some random process in the world, have a natural language description of the process, and want to derive, from that description, the chances of events. Most of the important discrete distributions are tied to a particular story (natural language description of process), family of related stories. This approach is less common for distributions associated with continuous variables.

    • All distributions can also be defined explicitly. Just like we can define a set in natural language (implicitly), then convert to an explicit list of elements, we can also define a distribution explicitly, by listing all the possible values of the random variable, then writing down a function that computes the chance of each event. Usually these functions have free parameters that are:

      • allowed to vary so that the modeler has the freedom to express many different distributions with a single function, and

      • allowed to vary to represent natural variants in the story associated with the distribution. For example, the parameters nn and pp of the Binomial distribution are naturally associated with a number of trials, and a chance of success per trial.

  2. So, one skill you should develop is the ability to go back and forth between a functional form, with free parameters, and the shape of a distribution. It is very important that you understand how a distribution changes when we change its parameters. Understanding the parameteric dependence of chance means that you understand how the free elements of a question (e.g. the number of trials), influence the answer to your question. We’ll practice this skill explicitly in the next chapter. The better you get at it, the faster you’ll be able to make sense of equations, see how your answers depend on the set up to a question, gut check whether your answers are sensible, and, if asked to develop a new model, to choose a family of functions.