Probability as Proportion - Data 89 Course Notes

Models of Chance¶

Equally Likely Outcomes¶

What’s the chance a coin lands heads?

Given no other information about the coin, or the tosser, you might answer 1/2. If asked why, you might answer that the coin has a 1/2 chance of landing heads because the coin has 2 sides.

You could then point to other examples. There is a 1/6 chance a die lands on any of its six sides. Given a 52 card deck there is a 1/52 chance of pulling any card after shuffling thoroughly. There is a 1/38 chance that a roulette ball lands in any pocket on the roulette wheel since there are 38 pockets on the wheel.

This is the oldest and most widely accepted model of chance. It describes, or approximates, many of the physical processes we use to introduce, produce, and learn about randomness. Many games of chance are built around physical processes where the chance of any outcome is simply one divided by the total number of possible outcomes, $|\Omega|$ .

While familiar and succesful, this model can’t quite be true. Take the deck of cards. If a player doesn’t shuffle enough times, then cards starting near the top can’t move to the bottom without cutting the deck. Cards starting at the bottom can’t move to the top. So, if we only shuffle once, and draw off the top, there aren’t really 52 possible options.

Moreover, you could reasonably argue that the cards starting close to the top are more likely to be drawn than those starting farther from the top. Shuffling again mixes the cards to balance the chances, but what is true for one shuffle should also be true for two shuffles, or three, or four. In fact, it strains belief that after any number of shuffles the deck is exactly mixed so that every card is exactly equally likely to appear on the top no matter where it started. Instead, we usually use 1/52 since, after many shuffles, 1/52 is a good approximation, and the approximation gets so close that, after a while, it is essentially true.

The card shuffling example raises a key assumption hidden in the claim that the probability of an outcome equals one divided by the number of possible outcomes: this model assumes all outcomes are equally likely. It turns out that the statements:

are equivalent when there are finitely many possible outcomes. The second statement is easier to reason with, so let’s think about it a bit more.

There are a couple common reasons to accept the claim that all events are equally likely:

The second argument is often used when we don’t really know how to assign chances to outcomes, so should work with the simplest plausible model until we collect enough evidence, or information, to reject it. This is often an argument used in hypothesis testing. The second argument may also be adopted as a simplifying ideal.

The last argument is often the best, and explains many of the examples that satisfy the first criteria. Cards are distinguished by the images painted on their surfaces. While these images differ, they change the physical properties of the cards in such minute ways that the cards should all behave the same when shuffled. Similarly, the slight differences in the images engraved on the sides of a coin, or cut into the sides of a die, are so small that they shouldn’t have much impact on how the coin or die rotates in the air, bounces, rolls, spins, and ultimately lands. The last argument is a symmetry argument. When outcomes are evidently asymmetric, as when we shuffle poorly and record the initial position of cards in the deck, then we shouldn’t assume equally likely outcomes.

A Thought Experiment¶

To test these ideas, consider the following thought experiment.

I offer to play you in a game involving a die, and show you a six-sided die I brought from home. I claim that the die is fair (all sides are equally likely). You are unsure, so ask to test the die first. You toss it ten times and it lands on the side labelled “4” nine out of the ten tosses. Would you still believe that the die is fair?

You would be within your rights to pause and contest my claim. A fair die landing on a specific side in nine of ten tosses seems absurd. It’s possible, but it would be very unlikely. Indeed, the probability a fair die lands on a specific side in nine of ten tosses is $10 \times (1/6)^9 \times 5/6 = 5/1,077,696 \approx 0.0000005$ . That about 5 in one million. So, you have pretty strong statistical evidence that the die is biased to the side four. Most professional statisticians would consider this event unlikely enough to reject the claim that the die is fair.

Nevertheless, I could persist. “The die is fair”, I claim. “Five in a million is small, sure, but it isn’t zero. Think about how many people have played with dice in California this year. There are about 40 million Californians. If one in every four played with dice this year, then about fifty should have seen this exact event. Moreover, its not enough just to know that a specific event is unlikely, since most events, spelled out in detail, are very unlikely. The sequence 1334621114 looks totally normal while the sequence 4444444442 looks suspicious, but, if the die is fair, then both would have chance $(1/6)^{10} = 1/10,077,696 \approx 0.0000001$ . They’re equally unlikely even though the first looks typical while the second looks atypical. In sum, coincidences happen (someone wins the lottery) and happen constantly.”

How could we resolve our dispute? Think about how you would resolve it before opening the drop down below.

Suggestions

There are two reasonable options.

We could toss the die again. I might claim that I beleive it is fair since I’ve used it many times in the past and never seen a particular bias to the side labelled 4. You might not believe me, so could ask to toss it ten more times, or, given the shocking prevalence of 4’s in the first ten tosses, one hundred more times, to see if the bias for 4’s goes away. By collecting more samples, it should become clearer whether or not the die is fair. This approach is reasonable, but may not resolve our dispute, since anything we see is possible, so our positions remain tenable, even if mine might be statistically absurd.
We could look for some physical property of the die that might bias it towards the side labelled 4. If, after measuring every aspect of the die you think could possibly influence the way it rolls, and you find that all the sides are essentially the same, then you might be convinced that it is physically implausible that the die prefers any side. Then, while unlikely, the long string of 4’s was just a fluke. If, alternately, you show that the die is not uniformly dense, and the side opposite the side labeled four is significantly denser, then I might be pressed to admit that the outcomes differ in a way that could influence the random process.

The last option illustrates the advantage of justification (3.). The argument that outcomes are equally likely because they are indistinguishable to the process allows deduction. It establishes equal likelihood as the consequence of an alternate claim. If we accept the premise that the outcomes are indistinguishable to the process, then we must accept the consequence that they are equally likely. By establishing this chain from premise to conclusion we can shift our argument from probabilities to characteristics of outcomes that could influence the behavior of the process. If we can show that all aspects of the outcomes that influence the process are identical, then we can reach consensus that the outcomes are equally likely.

Frequency Measures Chance¶

The first option, just keep rolling, illustrates the second basic model of chance. It is the model preferred by most statisticians. If the underlying outcomes are not symmetric, and we have enough evidence to question the simplest model (equal likelihood), then we can’t simply set the probability of an outcome to $1/|\Omega|$ . To work out the probabilities we could either try to derive them from some other premises that we are confident in, or, could try to measure them. The first approach is deductive, it derives chances from some alternate claim. The second is inductive. It measures chances by relating them to an experimental procedure:

The ratio of the number of times the event happened to the total number of trials ought to reflect the chance the event occurs in any individual trial. We call the ratio of the number of times an event occurs to the number of trials the frequency of the event.

For instance, if a coin is fair, then about half of all tosses should be heads and about half should be tails. Since each toss is random, the exact frequency of heads may differ from 1/2, and must differ from 1/2 when the total number of tosses is odd. For example, on a single toss, the exact frequency can only be 0 or 1. However, if we toss the coin many many times, the variability in the frequency should decrease. It’s plausible that we only see heads in 4 tosses. It’s very unlikely that we’ll only see heads in 100 tosses if the coin is fair.

This suggests an empirical definition for chance:

We’ll make this more formal later in the course. The relation between the number of trials, variability in the observed frequency, and underlying chance, is the subject of the first and most fundamental result in probability, the Law of Large Numbers. This result is fundamental since it establishes a hypothetical procedure that could measure chances objectively.

Probability as Proportion¶

We’ve now seen two models of chance:

The first model tells us directly how to compute the probability of any specific outcome. It doesn’t tell us how to compute the probability of an arbitrary event from the probability of each outcome. The second model is more helpful since it explains how probabilities of events should change when we vary the definition of the event. In short, if probabilities equal long run frequencies, then probabilities must obey the same algebra rules as long run frequencies. So, the first model asserts specific probabilities for outcomes, while the second asserts specific rules we can use to manipulate chances, no matter the chances assigned to outcomes.

Consider the second definition. Under the second definition the chance of any event must behave in the same way as the fraction of all trials in which the event occurs in a long sequence of trials. Therefore:

How should the probabilities of events combine?

Suppose that $E$ and $F$ are disjoint events. By definition, disjoint events are nonoverlapping sets of outcomes. This means that the events cannot occur simultaneously. Any trial will produce an outcome that is either in $E$ , in $F$ , or not in either. So, when the number of trials in $E$ or $F$ will equal the number of trials in $E$ plus the number of trials in $F$ . It follows that the frequency (fraction of all trials) of the event $G = E \text{ or } F$ must equal the frequency of the event $E$ plus the frequency of the event $F$ .

So, if probabilities are to behave as frequencies, then:

Rule (4.) is our first algebra rule for chances. We can use it to complete our probability model for equally likely outcomes.

Proof

Suppose that all outcomes are equally likely and there are finitely many outcomes in $\Omega$ . Then:

All outcomes are equally likely so must all share the same chance, $p$ .
Distinct outcomes are disjoint, so, if an event $E$ contains $|E|$ distinct outcomes, $\text{Pr}(E) = \sum_{j=1}^{|E|} p = |E| \times p$ .
Since $\Omega$ is a valid event, and all outcomes are in $\Omega$ , the frequency with which $\Omega$ occurs must equal 1.

Putting the second and third statements together gives us the rule we started this section with:

\text{Pr}(\Omega) = |\Omega| \times p = 1 \quad \text{ so } p = \frac{1}{|\Omega|}

(3)

If all outcomes are equally likely, then the probability of any outcome is $1/|\Omega|$ .

We can now provide a rule for the probability of any event:

\text{Pr}(E) = |E| \times p = \frac{|E|}{|\Omega|}

(4)

This is our first real model of probability. It equates probability to the fraction of all outcomes contained in the event. It also provides a formula for computing the probability of an event. If all outcomes are equally likely and there are finitely many:

Examples¶

Let’s practice this approach.

Permutations¶

Suppose that I have three cards labeled $a$ , $b$ , and $c$ . The cards are otherwise identical. We shuffle the deck, then draw the cards in order from the top, without replacing any cards. We shuffle thoroughly, so each card is equally likely to appear in any location in the deck.

The first thing to do is to write down the outcome space. We’ve seen it already. It consists of all distinct ways in which we can order the three cards. Since the cards all have a unique label, the outcome space is the set of all permuations of the labels $a$ , $b$ , and $c$ :

\Omega = \{abc,acb,bac,bca,cab,cba\}

(5)

Next, count the number of possible outcomes. This will be the denominator each time we compute a chance:

|\Omega| = 6

(6)

Finally, given an event $E$ , count the number of ways the event can happen. This will be the numerator. For example, if the event $A$ is all outcomes where $a$ occurs first. Then $|A| = 2$ . Therefore $\text{Pr}(A) = 2/6 = 1/3.$

We can repeat this process for any event. 🛠️ Complete each example below:

Event	Verbal Description	Subset	Size of Subset	Chance
$A$	$a$ appears first	$\{abc, acb\}$	2	2/6
$B$	$a$ and $b$ are next to each other	$\{abc, bac, cab, cba\}$
$C$	the letters are in reverse alphabetical order	$\{cba\}$
$D$	$a$ does not appear	$\emptyset$
$E$	$b$ is either first, second, or third	$\Omega$
$F$	the letters form a word that means ‘taxi’	$\{cab\}$

Solutions

Event	Verbal Description	Subset	Size of Subset	Chance
$A$	$a$ appears first	$\{abc, acb\}$	2	2/6
$B$	$a$ and $b$ are next to each other	$\{abc, bac, cab, cba\}$	4	4/6
$C$	the letters are in reverse alphabetical order	$\{cba\}$	1	1/6
$D$	$a$ does not appear	$\emptyset$	0	0
$E$	$b$ is either first, second, or third	$\Omega$	6	1
$F$	the letters form a word that means ‘taxi’	$\{cab\}$	1	1/6

Notice that, the probability of the emptyset is the probability that an impossible event occurs. If an event is impossible, then it never occurs, so its chance must be zero.

Notice also that, the probability of every event is at least as large as the probability of any more detailed event. For example the probability that $a$ appears first is larger than the probability that $a$ appears first and $b$ appears second. Or, the probability that $a$ and $b$ are next to each other is greater than the probability that $a$ and $b$ appear next to each other in the order $ab$ . This is an important rule to keep in mind since it gives us a strategy for bounding probabilities. Adding detail to the description of an event never makes the event more likely, so, the probability of a detailed event is, at most, the probability of a less detailed description of the event.

Let’s add these rules to our growing list of probability rules:

Poker Hands¶

Suppose, now, that we are playing a card game with a standard 52 card deck. If we draw off the top of a thoroughly shuffled deck we can model any sequence of draws by:

considering an outcome a specific order of all 52 cards in the stack,
an event a specific statement about the order of the cards in the stack, and
assuming that all sequences of 52 are equally likely.

Once again, we are working with an outcome space $\Omega$ that contains all permutations of a list of $n$ distinguishable objects. In the previous example we had 3 objects. Now we have 52. These examples differ only in that the number of possible permutations of 52 cards is enormously large. Far too large to write out $\Omega$ explicitly. So, we’ll have to get better at counting. We will need to learn to count the sizes of sets without simply listing all the members of the set.

First, how big is $\Omega$ ?

It’s hard to think about all 52 cards at once, so, imagine you are drawing cards off the top, one at a time. There are 52 options for the first card. Once we’ve drawn the first there are 51 remaining options for the second. Once we’ve drawn the second there are 50 remaining options for the third. The process continues. Each time the total number of options multiplies. There are $52 \times 51$ options for the first two cards. There are $52 \times 51 \times 50$ options for the first three cards. Repeating the pattern:

|\Omega| = 52 \times 51 \times 50 \times 49 \times ... \times 1 = \prod_{j=1}^{52} (52 - j) = 52!

(7)

Here, $\prod$ means “take the product of” all the terms appearing inside the product symbol, sweeping over all values of the index $j$ . The product sign works like the summation symbol $\sum$ .

Notice, this rule also works in our previous example. There, $|\Omega| = 3! = 3 \times 2 \times 1 = 6$ .

Now let’s find the chance of an event. What is the chance that:

$AA = \{\text{ I draw two aces in my first two draws}\}$ ?
$AS = \{\text{ I draw two ace then a spade in my first two draws}\}$ ?

To find the probabilities, we need the size of each set:

Two Aces

$|AA|$ : There are four aces in the deck, so there are four options for my first draw. After removing an ace there are three remaining aces, so there are three options for my second draw. We haven’t said anything about the order of the remaining 50 cards. So, we can order those 50 however we like. There are 50! such arrangements. Therefore:

|AA| = 4 \times 3 \times 50!

(8)

and:

\text{Pr}(AA) = \frac{4 \times 3 \times 50!}{52!} = \frac{4 \times 3}{52 \times 51} \times \frac{50 \times 49 \times ... \times 1}{50 \times 49 \times ... \times 1} = \frac{12}{52 \times 51}.

(9)

Ace Then Spade

$|AS|$ : There are four aces in the deck, so there are four options for my first draw. I could have removed an ace of any suit. If I removed the ace of spades, then there are 12 remaining spades. If I did not there are 13 remaining spades. So, let’s use our union rule to split up these possibilities. Either, we draw the ace of spades, then draw a spade, or, we don’t draw an ace of spades, then draw a spade. This approach is called partitioning. We are breaking an event into a union of disjoint smaller events that are easier to think about:

\begin{aligned}|AS| &= |A^{\text{of spades}} S \cup A^{\text{not of spades}} S| \\& = |A^{\text{of spades}} S| + |A^{\text{not of spades}} S| = 1 \times 12 \times 50! + 3 \times 13 \times 50! \\& = (12 + 3 \times 13) \times 50! \end{aligned}

(10)

Therefore:

\begin{aligned}\text{Pr}(AS) &= \frac{(12 + 3 \times 13) \times 50!}{52!} = \frac{(12 + 3 \times 13))}{52 \times 51} \times \frac{50!}{50!} \\ & = \frac{12 + 3 \times 13}{52 \times 51} = \frac{51}{52 \times 51} = \frac{1}{52}.\end{aligned}

(11)

Picking

\Omega

Strategically

Notice that, in both examples, we were forced to carry around a 50! that cancelled in the numerator and demoninator. This 50! appeared since we defined our outcomes as the full order of the entire stack of cards. It counted arrangements of the last 50 cards. Since our outcomes don’t depend on the order of those 50 cards, the 50! term that counts their possible permutations cannot matter, so must cancel. If we knew that we only wanted to compute chances for the first pair of cards, then enumerating all possible orders of the entire stack is overkill.

Situations like this occur frequently in probability. We have the freedom to define any outcome space we like by changing the way we describe outcomes. All we need is enough detail to distinguish the events of interest. If we don’t want to spend a lot of time counting distinctions that don’t matter, we should pick the simplest descriptions possible. We should not make distinctions between outcomes that don’t influence membership in the events we care about.

That said, unless we are provided chances for the events of of interest, it is often a good strategy to add detail to our definition of events until we reach a space of outcomes where we know the chances. For instance, when working with card shuffles, we started with the space of all permutations of all 52 cards because we believed that all permutations should be equally likely.

If all permutations of 52 are equally likely, then, all initial distinct pairs of cards are equally likely. So, we could have computed the same probabilities working with $\Omega = \{\text{all distinct ordered pairs of cards}\}$ . Then there are 52 options for the first card and 51 for the second so $|\Omega| = 52 \times 51$ . There are $4 \times 3 = 12$ ordered pairs of aces, and $12 + 3 \times 13$ ordered pairs that start with an ace then end with a spade. These ratios produce the same probabilities we computed before.

Pairs of Die¶

Suppose that we roll two fair die. Then the outcome space $\Omega$ is all possible pairs of numbers between 1 and 36. Since there are six options for each roll, and the rolls don’t influence each other, we have:

|\Omega| = 6 \times 6 = 36.

(12)

What is the chance that:

Even then Less than 5

There are three choices for the first roll (2, 4, or 6), and 4 for the second (1, 2, 3, or 4). So the number of outcomes consistent with the event is $3 \times 4 = 12$ . Therefore, the probability is $12/36 = 1/3$ .

There is a nice alternate way to think about this chance. Notice that:

\begin{aligned}\text{Pr}(\text{the first roll is even and the second roll is less than 5}) & = 12/36 \\ & = (3/6) \times (4/6) \\ & = \text{Pr}(\text{even}) \times \text{Pr}(\text{less than 5}) \end{aligned}

(13)

This example suggests another probability rule: $\text{Pr}(A \text{ and } B) = \text{Pr}(A \cap B) = \text{Pr}(A) \times \text{Pr}(B)$ . While intuitive, this rule is not strictly true, and will fail unless we are careful. Do not apply it blindly.

Take the two ace example above. The chance of drawing an ace in any individual card is $4/52 = 1/13$ . So, if we used the rule above we would have computed $\text{Pr}(AA) = (1/13) \times (1/13) = 1/13^2$ . Instead, we found $\text{Pr}(AA) = (4/52) \times (3/51) = (1/13) \times (3/51) < (1/13) \times (1/13)$ . So, the rule cannot always be true.

The key difference in these examples is how the events relate to one another. In the dice rolling example the outcome of the first roll has no influence on the second roll. In the card example, the outcome of the first draw influences the second draw since, if we draw an ace, there are fewer aces remaining in the deck.

Add to 2 or 7

Let’s partition.

There is only one way the rolls could add to 2, roll a 1 and a 1.
There are 6 ways the rolls could add to 7: $\{16,25,34,43,52,61\}$

So there are $1 + 6 = 7$ ways the event could occur. Therefore:

\text{Pr}(\text{add to 2 or 7}) = \frac{1 + 6}{36} = \frac{7}{36}.

(14)

Notice that, even though all pairs of rolls are equally likely, all sums of pairs are not. There are 6 ways the rolls can add to 7, but there is only 1 way they can add to 2. So, the probability that we see a sum $S$ equal to $s \in [2, 12]$ depends on $s$ .

This is an important point. Even if all outcomes are equally likely, all events are not.

Consider the biased die example introduced as a thought experiment. Even though every specific sequence of ten rolls are equally likely, there are very few sequences of 10 rolls where the total number of fours rolled is 9, while there are many sequences where the side labelled “4” appears $1/6 \times 10 \approx 2$ times.

Microscopic Models

Many probability models start by assuming that all outcomes are equally likely if the outcomes are spelled out in microscopic detail, then assume that we care about collections of outcomes and focus on assigning probabilities to events that are not described so exhaustively. For example, statistical mechanics, which explains the behavior of gases, temperature, pressure, and other related variables, begins by assuming that all arrangements of molecules with the same total energy are equally likely, then focuses on macroscopic variables, like the total number of molecules, or average energy of the molecules, in a region. These macroscopic variables obey non-uniform distributions. The number of microscopic arrangements corresponding to any macroscopic observation helps determine its chance (the number of outcomes in an event) and is intimately related to the entropy of the arrangement. If you go on to study the physical sciences you will see that entropy, in essence, multiplicity, plays a fundamental role in our understanding of natural systems.

Rules¶

Through example, we’ve seen some rules that will help us break probability calculations down into simpler pieces. Instead of always computing the probability of an event by counting the ways it can happen, and dividing by the total number of possible outcomes, we can break probabilities down using the following rules:

Rules of Chance

Complements: The probability an event does not occur is one minus the probability it does:

\text{Pr}(\text{not } A) = \text{Pr}(A^c) = 1 - \text{Pr}(A)

(15)

Disjoint Unions: The probability that either $A$ or $B$ happen, if $A$ and $B$ can’t happen simultaneously, is the sum of their probabilities:

\text{Pr}(A \text{ or } B) = \text{Pr}(A \cup B) = \text{Pr}(A) + \text{Pr}(B)

(16)

Intersections: The probability that either $A$ and $B$ both happen is ... ?
- ⚠️ We don’t have a rule for the last case yet. We know that the answer should be related to multiplication, and in some cases the probability that $A$ and $B$ both happen is the product of their individual chances, but, we don’t know when this rule does or does not apply. We’ll come back to this rule in Section 1.5 and Section 1.6.

These rules are helpful since they allow us to manipulate our questions. If you can’t answer a question as asked, try to express the event according to one of these logical operations applied to events that are easier to work with.

In the next section we will start from these rules, and show that any probability model can be defined consistently as long as it satisfies the disjoint union rule.

1.2 Probability as Proportion

Models of Chance¶

Equally Likely Outcomes¶

A Thought Experiment¶

Frequency Measures Chance¶

Probability as Proportion¶

Examples¶

Permutations¶

Poker Hands¶

Pairs of Die¶

Rules¶