Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

2.1 Random Variables and Distributions

For instance, the shoe size of a randomly polled student. The shoe size is a number that is determined by the outcome, in this case, the choice of student. The formal definition generalizes the idea that, in many situations, we summarize a random outcome with a measurement, or with some summary number.

It is standard practice to denote:

  • A random variable with a captial letter, e.g. XX

  • If we want to emphasize that the random variable is determined by some randomly selected outcome ω\omega then we might write it as a function X(ω)X(\omega)

  • A possible value of the random variable is xx

The support is to the outcome space as random variables are to randomly generated outcomes. It is just the set of all possible values. For instance, if I toss a coin three times, and record the number of heads, NN, then the support of NN is the set {0,1,2,3}\{0,1,2,3\}.

Generally we will define random variables by either:

  1. Describing the process that produces them. In this case the support, and distribution of the random variable, are derived as consequences of the process.

  2. By directly fixing the support, and the chances assigned to different values of the random variable.

We will practice going from story/process to explicit definition via support/list of chances. For now, it is enough to know that, as for any random object, it is always a good idea to first ask, what are the possible values the random variable could return?

Distribution Functions

We can represent the table in the example discussed above with a bar plot:

PMF for the sum of two rolls.

This is an example of a probability histogram. The horizontal axis indicates possible values, ss, of the random variable SS. The vertical axis represents probability. The height of the bar at S=sS = s denotes Pr(S=s)\text{Pr}(S = s).

Notice: this is the first time we’ve been able to actually plot all the probabilities of each possible outcome. That’s because generic outcome spaces have no natural organization or order. In order to plot something, we need to be able to order an input axis. In many of our previous examples, there was no natural way to choose which outcomes to list before which other outcomes. Random variables are just randomly chosen numbers, drawn from some set of possible numbers. Since numbers are ordered, we can actually plot a list of values that determine the chance of any statement about the random variable. In other words, we can define functions which assign chances to values, and that determine the chance of any other statement or event regarding the variable. These are distributions.

The function that returns the height of each bar is an example of a distribution function.

Distribution functions are to random variables as probability measures are to events. A distribution function is a function that accepts a possible value of a random variable, and returns the probability of a standardized question about that value.

The most natural choice is to plot the chance of each possible value:

The histogram shown above is a probability mass function, or PMF. The heights of the histogram correspond to the PMF: $Pr(S=s)\text{Pr}(S = s) as a function of ss.

Note: the use of the word “mass” in the PMF might seem odd. It’s a reference to the common analogy that probability acts like a collection of masses assigned to objects, where all the masses add to one. We’ll see the reason to adopt this odd analogy in Section 2.3 when we consider continuous random variables, who don’t have a useful PMF, but are characterized by a notion of density.

The last example listed above is an example of a cumulative probability. It is cumulative since it is a sum of chances for sequential values of the random variable. Probabilities of this kind are also associated with a standard distribution function:

The CDF is assigned a standard notation:

FX(x)=Pr(Xx).F_X(x) = \text{Pr}(X \leq x).

The subscript XX means, for the random variable XX, the argument is an upper bound, and the value returned is the chance XX is less than or equal to the upper bound. We’ll use that notation interchangeably with the more transparent notation:

CDF(x)=Pr(Xx)\text{CDF}(x) = \text{Pr}(X \leq x)

and add a subscript when it is unclear which random variable is of interest.

In other words, the CDF is the running sum of the values of the PMF.

Here’s the CDF for the sum of two rolls:

CDF for the sum of two rolls.

Differences in CDF values return the probability that a random variable lands in any interval. For instance, the chance that SS is between 6 and 11 is:

Pr(S{6,7,8,9,10,11}=CDF(11)CDF(5)\text{Pr}(S \in \{6,7,8,9,10,11\} = \text{CDF}(11) - \text{CDF}(5)

since subtracting off the CDF evaluated at 5 will remove from the sum any chances contributed by S=1S = 1, S=2S = 2, ..., S=4S = 4, and S=5S = 5.

We can also use the CDF to find the chance that a random variables is greater than a lower bound by applying the complement rule:

Pr(X>x)=1Pr(Xx)=1CDF(x)\text{Pr}(X > x) = 1 - \text{Pr}(X \leq x) = 1 - \text{CDF}(x)

Since we can use the CDF to find the probability that a random variable is contained beneath any upper bound, between any two bounds, or above any lower bound, we can use the CDF to compute the chance of any event statement regarding a random variable. So, like the PMF, if we know the CDF, then we know every detail needed to compute chances. In other words, the PMF and the CDF both fully specify a probability model.

Interactive Example

Let’s explore the relationship between the PMF and CDF with a live code demo. To run the code, follow the instructions below.

Click on the power symbol on the upper right, then click on the play arrow on the code cell below:

from utils_dist import run_pdf_cdf_explorer

run_pdf_cdf_explorer(dist="Poisson", show="PDF");

You’ll should see a nice PMF above. Play with the parameter λ\lambda using the available slider. This controls the shape of the PMF. Try λ=4\lambda = 4 or λ=5\lambda = 5.

Then, move the slider that controls the position of the upper bound, xx. The visualization will highlight the bars of the PMF for all yxy \leq x. The sum of the heights of these bars (the highlighted area) returns the corresponding CDF value since the CDF is the running sum of the PMF. To build up the CDF, gradually move the threshold, check the value for the shaded area printed above the visuals, then click “Save Value” to save the computed area at the current threshold. Repeat until you have a guess for the shape of the CDF. Then click “Reveal CDF”.

Now let’s try this the other way around. Run the code cell to create a new session:

from utils_dist import run_pdf_cdf_explorer

run_pdf_cdf_explorer(dist="Poisson", show="CDF");

Now, to recover the unknown values of the PMF, we should use the successive differences in the heights of the CDF bars. Pick a new distribution (still discrete) from the dropdown. Play with the parameters until you find a CDF you’re interested in. Then try to eyeball the PMF.

Vary the slider value for xx, record the difference in heights of the CDF bars, and click “Save Point” to add the computed value to the list of computed PMF values. Continue until you have a good sense of the PMF, then reveal it to check your guess.