Not all outcomes are equally likely, not all systems are symmetric or close to symmetric, and, while we can build outcome spaces without equally likely outcomes from outcome spaces with equally likely outcomes, it is rarely worthwhile to expand the natural description of outcomes into a detailed description in pursuit of equal likelihood. In many cases, the virtue of a probability model is its ability to match observed frequencies, and to derive consequences of those frequencies, without requiring a detailed explanation for the mechanism that produced them.
What we want is a more general theory that allows any outcome space, and any assignment of chance to events, such that the assignment of chance to events could be equated to frequencies of outcomes in a string of repeated trials, or to proportions in a more detailed symmetric model. Thankfully, many models can satisfy these requirements as long as they satisfy three simple rules. These rules are the heart of probability.
The Probability Axioms¶
A probability model is a procedure that returns an answer for any question of the kind, “What is the probability of the event ?” The procedure should provide an answer for any event in an outcome space. Since events are subsets of , the probability model must specify a rule that accepts sets as inputs and returns chances. Since chances are meant to model proportions, or frequencies, they are real valued numbers between zero and one. The function that accepts any subset of , and returns a real number representing its chance is the probability measure. A measure is any function that accepts all subsets of an outcome space and returns a real number.
We’ve already fixed notation for the probability measure, since we defined to be the probability of the event . Therefore, is the probability measure. The probability model is the combination of the outcome space , the collection of all its subsets that we might ask about as events, and the probability measure that returns the chance of any event we ask about.
A probability model is valid if the measure satisfies three rules. These are Kolmogorov’s probability axioms. The first two are not surprising, since chances are meant to model proportions or frequencies. The last is the most substantive. We observed that it was true for proportions (see Section 1.2).
Categorical Distributions¶
While the axioms provide some basic rules the measure must satisfy, they give no guidance on the actual choice of measure. The axioms alone cannot determine the chance of an event. To determine a chance we need a model.
We saw this same conceptual division in the previous chapter.
Assuming equally likely outcomes fixed a model since it assigned chances to individual outcomes. It did not give any direction on how those chances combine, or how to compute the probability of an event containing multiple outcomes.
Asserting a relationship between probability and long run frequency did not provide a rule for computing chances. It did give explicit direction on how chances combine, namely, chances for disjoint events add.
We put these rules together to build up probability as proportion. By assigning a chance to each outcome, then by adding chances together to find the probability of events, we saw that, if outcomes are equally likely, then the probability of an event is the fraction of all possible outcomes that satisfy the event definition. We can repeat the same process by applying the axioms to any rule that assigns a well-defined chance to every outcome.
A categorical distribution is a rule that assigns a chance to every distinct outcome in when is finite. Then, by the additivity axiom:
You should read the equation above as: the probability of any event () is the sum, over all the ways the event can happen (), of the probability of each way the event can happen ()." In other words, the chance of an event is the total chance of all the outcomes contained in the event.
Categorical distributions are the most directly defined probability models; simply assign a chance to every outcome. The equally likely model we studied before is a uniform categorical distribution. It is uniform since it assigns the same chance to every outcome.
Categorical distributions get their name from their application. They are often used to categorize and classify data.
Example
Suppose that you were designing a self-driving car. You might want a computer vision system that could distinguish pedestrians, scooters, cyclists, vehicles, and so on. Usually these systems rely on a machine learning model that returns a categorical distribution. It receives a string of images (a video), segments the images into objects, and then, for each object, assigns a chance to the categories pedestrian, scooter, cyclist, vehicle, etc. For a given object it might return a categorical distribution:
| Event | Ped. 👟 | Scoot. 🛵 | Cycl. 🚲 | Veh. 🚗 |
|---|---|---|---|---|
| Probability | 0.1 | 0.4 | 0.5 | 0 |
We often represent categorical distributions with bar charts whose heights equal the chance of each outcome. This is sometimes called a probability histogram.
Then:
and:
Using Axioms¶
The Complement Rule¶
An axiom is a mathematical statement that is asserted as a basic premise for a theory. Many rich mathematical areas are built by first posing a short list of self-evident (or, at least, plausible) axioms. Once accepted, the axioms are used to prove the rest of the theory. Let’s practice this way of thinking using the probability axioms.
In Section 1.2 we observed that, when we equated probability to proportion, the rule:
That is, the probability that an event does not occur equals one minus the probability that it does occur.
This rule might appear self-evident. For instance, it may feel instinctive that, if the probability that it rains tomorrow is 1/3, then the probability is doesn’t rain should be 2/3. Let’s show that we don’t need this rule as an additional fourth axiom. Instead, it is implied by the first three.
Proof:
The sets and are disjoint for any event . No outcome can simultaneously be in and to not be in .
The sets and partition since .
If you’re not convinced, draw a box representing , then a circle inside it labeled . First, shade the region inside the circle. This is all outcomes in . Then, shade the region outside the circle. This is all outcomes not in . When you’ve finished shading , you will have shaded the whole box.
So, by additivity, .
Then, by normalization, .
Rearranging:
Let’s add this rule to our list:
Bounding The Chance of a Union¶
Our union rule looks different than the complement rule since it does not hold for all pairs of events. What happens if and are not disjoint?
An example suffices. What is the chance that a single roll of a fair die is even or is less than 4. The event that the roll is even contains the outcomes . The event that it is less than 4 contains the outcomes . So, the event that the roll is even or less than 4 contains the outcomes . There are five of these outcomes so the desired probability is .
What would have happened if we tried to apply our rule?
In this case the sum of the individual probabilities is too large. It equals one, which is clearly wrong, since it is possible to roll a value that is neither even nor less than 4. The only such value is a 5, so
The sum of the individual probabilities is too large since it counts an outcome twice. The outcome where the die lands on 2 appears in both events since 2 is both even and less than 4. In fact, . The sum of the probability that the die roll is even and the probability that the roll is less than four counts the outcomes that are even and less than four twice, once for each set. Therefore, it overcounts the intersection of the events:
This is the general rule for unions:
Since all probabilities are nonnegative:
with equality if and only if the intersection has probability zero, . In other words, if and only if the events are mutually exclusive.
Let’s add this last rule to our list:
You’ll practice with this rule in discussion section and on your HW.
Summary¶
To summarize our rules, let’s append a column to the logic and sets table from Section 1.1: