Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

2.4 Probability Densities

In Section 2.3 we showed that, for any continuous random variable XX, the chance XX equals some exact value, xx, is zero, no matter which xx we choose. As a result, the PMF is zero everywhere:

PMF(x)=Pr(X=x)=0.\text{PMF}(x) = \text{Pr}(X = x) = 0.

This tells us nothing about the random variable except that it is continuous. It does not tell us how to compute chances.

In this chapter we will introduce our last type of distribution function, a probability density function. Density functions are the continuous analog to mass functions for discrete random variables. If you are asked to picture a distribution in your head, and you picture a bell-curve, or a bell-curve shaped histogram whose bars could be made very narrow, then you are picturing a density function.

This section will introduce density functions by experimentation with histograms for continuous random variables. Once we see that density is a natural idea for continuous random variables, we’ll show how to relate densities to chances, then introduce some examples.

Probability Density

By Experiment

Let’s start with an experiment. We’ve already seen that we can:

  1. Define uniform continuous variables by equating chances to proportions of length or area

  2. Relate chances of events to long run frequencies

  3. Define arbitrary continuous random variables by fixing a cumulative distribution function (CDF).

Let’s set up an experiment that puts these ideas to work. We’ll start with a uniform continuous random variable, transform it to get something non-uniform, divide up its support into small pieces, then draw many copies of the variable and see how frequently it lands in each piece. Since probabilities equal long run frequencies, the associated histogram will represent a distribution function. Formally, its a categorical distribution with categories equal to the histogram bins. The key continuous idea is, we have the freedom to change the width of the bins.

To get a detailed picture of the distribution, we’ll try to take a limit where the bins get arbitrarily small. Per Section 2.3, we’ll see that, making the bins narrower, lowers the frequency that outcomes land in each bin, so to keep the histogram bars about the same height as we narrow the bins we’ll have to scale by the width of the bins. Scaling by width produces a density. Once we’ve seen a density, we’ll show that, unlike the PMF, the density function can be used to recover a CDF. Then, since a CDF specifies a measure, the density function is enough to define a probability model.

Imagine you are throwing darts at a dartboard. We’ll imagine that you’re not very good at darts, so the position of each dart is uniform over the board. For a circular board this means we are picking a location uniformly from the interior of a circle.

This is a uniform continuous model, so we can measure chances. The chance a dart lands in some region AA on the board, is:

Pr(dart lands in A)=area of Aarea of dart board=AΩ\text{Pr}(\text{dart lands in } A) = \frac{\text{area of } A}{\text{area of dart board}} = \frac{|A|}{|\Omega|}

where Ω\Omega is the collection of all possible positions the dart could land (the circular board).

You want your darts to land near the center of the board. So, you decide to measure the distance the dart lands from the center. The dart’s position is random, so is it’s distance from the middle of the board. We’ll call that random variable RR for radius.

RR is a continuous random variable. For simplicity, let’s assume the dart board has radius 1. Then, RR is supported on the interval [0,1][0,1] since we, at best, hit the center of the board, and, at worst, hit the outer edge.

Is RR uniformly distributed? Take a moment to think about this carefully before answering.

If you’re unsure how to proceed, remember that its often easier to start with a CDF.

What is:

CDF(r)=Pr(Rr)?\text{CDF}(r) = \text{Pr}(R \leq r)?

Well, the region on the board where RrR \leq r is the interior of a circle centered at zero, with radius 0r10 \leq r \leq 1. We don’t have to worry about the distinction between r\leq r and <r< r since the position of the dart is continuous. We can now visualize the event RrR \leq r as a filled circle with radius rr inside a larger circle of radius 1. The chance we land in the inner circle is the ratio of its area to the area of the full circle.

So, using probability by proportion:

CDF(r)=Pr(Rr)=πr2π12=r2.\text{CDF}(r) = \text{Pr}(R \leq r) = \frac{\pi r^2}{\pi 1^2} = r^2.

To check whether this CDF could correspond to a uniform measure, let’s compare the chances that RR lands in two intervals of equal length. For instance, what are the odds that R>1/2R > 1/2 compared to the odds R1/2R \leq 1/2?

Pr(R>0.5)Pr(R0.5)=1Pr(R0.5)Pr(R0.5)=1CDF(0.5)CDF(0.5)=10.520.52=4×34=3\begin{aligned}\frac{\text{Pr}(R > 0.5)}{\text{Pr}(R \leq 0.5)} & = \frac{1 - \text{Pr}(R \leq 0.5)}{\text{Pr}(R \leq 0.5)} = \frac{1 - \text{CDF}(0.5)}{\text{CDF}(0.5)} \\& = \frac{1 - 0.5^2}{0.5^2} = 4 \times \frac{3}{4} = 3 \end{aligned}

So, the dart is 3 times more likely to land in the outer interval, (1/2,1](1/2,1] than the inner interval, [0,1/2][0,1/2]. The dart’s distance from the origin cannot be uniformly distributed. It is more likely to land farther from the center of the board than closer to the center.

To visualize this bias, run the following experiment:

  1. Throw a bunch of darts (sample uniformly from the circle),

  2. Compute the distance of each dart from the center (sample RR repeatedly), then

  3. Divide the interval [0,1][0,1] into many small pieces, count the frequency with which RR lands in each segment, and

  4. Plot the associated histogram.

We’ll let Δr\Delta r denote the width of the histogram bins. Remember, we get to choose these. They are an artifact of our visualization scheme. To get a precise picture, we will want to both take many samples to eliminate randomness in the bar heights and make the bins very narrow.

You can run this experiment yourself using the demo below:

from utils_dartboard import run_dartboard_explorer

run_dartboard_explorer(R=1.0)
Loading...
<utils_dartboard.DartboardVisualization at 0x20ed14ffd10>

If you draw enough samples, and keep the bins small enough for a detailed plot, but big enough that they each contain a lot of samples, then you should get a pretty clear picture. The histogram should look something like this:

Histogram of Distance to Center of Dartboard.

It’s basically a linear wedge. A linear trend is not surprising, since running sums act like integrals, the integral of a linear function is quadratic, and we saw that the CDF is a quadratic function of rr.

So, it’s natural to think that, the distribution of RR should be described by some linear function of rr. To recover that function, we need to make sure our histogram plot is not sensitive to arbitrary choices we made when we set up the experiment.

There were two free parameters in the set up:

  1. The number of samples, and

  2. The bin widths, Δr\Delta r.

To see why these are problematic, open the demo again, and set the vertical heights of the bars equal to the raw number of samples that land in each interval. This is the simplest histogram convention. The height of each bar is the number of times the corresponding event occured.

from utils_dartboard import run_dartboard_explorer

run_dartboard_explorer(R=1.0)
Loading...
<utils_dartboard.DartboardVisualization at 0x20ee89c6e90>

Now try varying the number of samples. If you increase the number of samples to make the plot less noisy, you should see that all the bars get taller. The histogram is still roughly linear in rr but its slope changed.

The distribution of RR was fixed by the sampling process, so cannot depend on the total number of samples. Accordingly, we can’t equate probability to a raw count of occurences. That’s not surprising. Probabilities should match long run frequencies.

So, to make sure that our plot is invariant to (does not depend on) the total number of samples, we should set the height of the bars to the frequency with which each event occured. This converts to plotting chances rather than counts.

Set the y-axis back to frequency and try varying the sample size. You should see that, as long as you keep the bin widths fixed, the histogram now converges to a fixed function in the limit of many samples. This function is a categorical distribution on the bins. It is analogous to a PMF when we round RR to some fixed, finite, precision.

So far so good. Albeit, our plot could still depend on the bin widths.

Keep plotting frequency, and this time try varying the bin width.

from utils_dartboard import run_dartboard_explorer

run_dartboard_explorer(R=1.0)

You should see that, if you make the bins wider, not too much changes except the bars get taller, but, as you make the bins smaller, things start to fall apart. The narrower the bins, the shorter the bars, and the noiser the pattern. The second effect, the amount of noise, can be fixed by increasing the sample size.

So, take a very large sample size, and see how small you can set the bars. Can you set the bars small enough so that their height stops changing?

from utils_dartboard import run_dartboard_explorer

run_dartboard_explorer(R=1.0)

No!

Everytime you shrink the bins, the bars get shorter. The function we’re looking for is still linear, but its slope depends on the arbitrarily chosen bin width. Worse, its slope approaches zero as we make the bins narrow.

That last observation is the main result of Section 2.3. If a random variable is continuous, then the chance it lands in a shrinking series of intervals decreases as the intervals get narrower, and approaches zero as the intervals converge to a point. The distance from the center of the board, RR, is a continuous random variable, so the chance RR lands exactly in any very narrow bin must be very small. What we’re seeing is experimental evidence that, for a continuous random variable, the chance of every exact event is zero, so its PMF is zero everywhere.

Ok, what next?

Well since the heights of the bars decrease as we make the bars narrow we could try an old trick from calculus. Let’s scale the heights of the bars by dividing each frequency by the width of the associated bin. Then we’re plotting frequency per bin width. Hopefully, dividing by the shrinking bin width should cancel out the decrease in frequency.

Try it! Set the y-axis convention to frequency per width (density).

from utils_dartboard import run_dartboard_explorer

run_dartboard_explorer(R=1.0)
Loading...
<utils_dartboard.DartboardVisualization at 0x20ee8a84e90>

Now you should see that, as long as you keep increasing the sample size, the histogram converges to the linear function: f(r)=2rf(r) = 2 r. This function stays stable no matter how you vary the bin widths, or the sample sizes, as long as we have enough samples to average away noise in the frequencies!

To check, lock the bin width, Δx\Delta x, to a function that decreases in sample size, nn, so that the bin widths approach zero as the sample size diverges, and do so slowly enough so that the number of samples in each bin increases as nn increases.

Click the “lock” checkbox, then gradually increase the sample size.

from utils_dartboard import run_dartboard_explorer

run_dartboard_explorer(R=1.0)

What we’ve just recovered is a probability density function. We call it a density by analogy to mass per volume.

In physics, the density of a region is the ratio of its mass to its volume. We can define the density of a point by using arbitrarily small regions. For instance, the density of a point xx is the total mass within Δx\Delta x of xx, divided by the volume of the region of points within Δx\Delta x of xx, in the limit as Δx\Delta x goes to zero. We call frequency per length a density since it has units of probability mass per length. This definition extends naturally to higher dimensions. We can define densities as probabilities per unit area or probabilities per unit volume.

Working with densities will help us resolve one of the paradoxes from the end of Section 2.3. There we saw that, when we sample a continuous random variable we always get a specific outcome despite the fact that every specific outcome has zero chance. Somehow, all of the possible outcomes had zero chance, but collections of outcomes had nonzero chances.

Density and mass behave the same way. The total mass in some region vanishes as we make the region arbitrarily small. So, the total mass at any point is zero. Yet, regions are composed of points, and can have nonzero mass.

This paradox is resolved by integration. The total mass of an object is not the sum of the masses of every infinitesimally small piece of the object. It is the integral of the density of every point in the object.

The same thing will be true for probability densities. Let’s check it for our example.

We’d proved that:

CDF(r)=Pr(Rr)=r2\text{CDF}(r) = \text{Pr}(R \leq r) = r^2

using probability by proportion.

Then, by experiment, we’d guessed a density function:

f(r)=2r.f(r) = 2 r.

The region RrR \leq r is the interval [0,r][0,r] since RR is a radius. Therefore, we should try integrating the density from 0 to rr:

s=0rf(s)ds=s=0r2sds=s2s=0r=r20=r2.\int_{s = 0}^r f(s) ds = \int_{s = 0}^r 2 s ds = s^2|_{s = 0}^{r} = r^2 - 0 = r^2.

Therefore, in this example probability mass (e.g. chances) are related to densities by integration:

CDF(r)=s=0rf(s)ds.\text{CDF}(r) = \int_{s = 0}^r f(s) ds.

We’ll see that this is always true for densities, and that we can use this relation to define continuous random variables starting directly from density functions. This approach is more intuitive than starting from cumulative probabilities, since density functions look like histograms. CDF’s don’t.

Definition and Relation to Chance

Let’s formalize our construction.

It is very common to denote:

  1. The PDF of a continuous random variable, XX, evaluated at xx: fX(x)f_X(x). Here we use a subscript XX to remind us which random variable, and a little ff for function.

  2. The CDF of a continuous random variable, XX, evaluated at xx: FX(x)F_X(x). Here we use a subscript XX to remind us which random variable and use a capital FF for function.

We use a lower case for the density and an upper case for the CDF since this notation matches the standard convention in calculus. We’ll adopt this notation since it is so widely used, but frequently remind you that little ff is the PDF, while FF is the CDF.

The PDF and PMF are easy to mix up since the PDF looks like a rescaled limit of a PMF, when we make our intervals very narrow. This idea is more than an annoying stumbling block. The relationship between PDF, and its approximation with a PMF, explains how we should use PDF’s to compute chances.

We can use the approximation statement above to recover a formula for chances from densities. Just like we integrate densities to recover mass, we should integrate probability densities to recover probability mass:

If there is one formula to remember from this section, it’s the one above. To compute chances from densities, integrate.

This equation justifies the famous “area under the curve” picture you may have seen for bell-curves. An integral computes an area under a curve.

If we:

  1. Plot the PDF fxf_x

  2. Then find the area of a in an interval under the curve (integrate)

we will have computed the chance the associated random variable lands in the interval.

from utils_dist import run_distribution_explorer

run_distribution_explorer("Normal");

Therefore, a proposed function fXf_X must satisfy:

Working with Densities

Now that we know what a density function is, and how to use it to compute chances, let’s see how the density function is related to the CDF. Then, we’ll make a table summarizing how to go between chances, densities, and cumulative probabilities.

We already know how to compute the chance of an interval from a density:

Pr(X[a,b])=x=abfX(x)dx.\text{Pr}(X \in [a,b]) = \int_{x = a}^b f_X(x) dx.

The CDF is defined as the chance of a one-side interval. So, we get the CDF from the PDF by integrating:

This equation is the continuous analog to the idea that a cumulative distribution is a running sum of a PMF. For continuous random variables, the CDF is the running integral of the density. Notice the two steps in this analogy. Replace a sum with an integral, then replace a mass function with a density function. In practice, this is the only real operational change you need to make to find probabilities for continuous random variables:

This equation also justifies the big FF, little ff notation for cumulative probabilities and densities. The CDF is the integral of the PDF, or, is the anti-derivative of the PDF. The notation ff and FF for function and anti-derivative is the standard notation in calculus. You may remember it from your unit on the fundamental theorem of calculus.

As always, a CDF can be used to find the chance XX lands in any interval. So, if we are given a PDF, integrate it to get a CDF, then we can take differences in CDF values to find chances:

Pr(X[a,b])=CDF(b)CDF(a)=FX(b)FX(a).\text{Pr}(X \in [a,b]) = \text{CDF}(b) - \text{CDF}(a) = F_X(b) - F_X(a).

What if we started from a CDF?

Well, the CDF is the anti-derivative of the PDF, so the PDF is the derivative of the CDF! If the CDF is the integral (area under the curve) of the PDF, then the PDF must be the slope of the CDF. This is just the good old fundamental theorem of calculus.

We can now move freely between density function, CDF, and measure. Given a rule for computing any of the three, we can solve for the other two.

Here’s a table summarizing the procedures. Memorize it. It’s the most important part of this section. See if you can fill in the entries of the blank table below before opening the expandable section beneath the blank table. Fill in each “?” with a formula that recovers the object in the row header from the object in the column header. We’ve given an example in the first row.

ObjectPDFCDFMeasure
PDF: fX(x)f_X(x).?limΔx01ΔxPr(Xx±12Δx)\lim_{\Delta x \rightarrow 0} \frac{1}{\Delta x} \text{Pr}\left(X \in x \pm \frac{1}{2} \Delta x \right)
CDF: FX(x)F_X(x)?.?
Measure: Pr(X[a,b])\text{Pr}(X \in [a,b])??.

Modeling by Shape

For the remainder of the class, we will largely pose continuous models by explicitly defining a density function.

Example Density Functions

Here are some examples:

Uniform

We denote the statement: XX is drawn uniformly on [a,b][a,b]:

XUni(a,b)X \sim \text{Uni}(a,b)

This distribution has two parameters, aa, the lower bound of the interval of possible XX, and bb, the upper bound.

Select “uniform” from the dropdown below, and experiment with the density function.

from utils_dist import run_distribution_explorer

run_distribution_explorer("Uniform");

You should see a box, whose height is inversely proportional to its width.

Notice that, the smaller you make the interval, the taller the density. This is a natural consequence of normalization. The area of a box is its height times its width. So, to keep the area equal to one, if must get taller as it gets narrower.

Try picking aa and bb so the interval is narrower than 1. What do you notice about the height of the density function?

You should see that the value of the density function is, for all xx inside the support, greater than 1. This can seem odd at first. Chances are always between 0 and 1. Remember, however, that fX(x)f_X(x) is a density not a chance. It is perfectly possible for densities to exceed one, so long as the total mass (area under the density curve) remains equal to one. If XX is a continuous random variable, and is highly likely to fall in a small interval, then the density inside the interval must be large.

Let’s try to compute the chance of an event using the uniform density.

Exponential

We denote the statement XX is drawn from an exponential distribution with parameter λ\lambda:

XExp(λ)X \sim \text{Exp}(\lambda)

The symbol \propto in the exponential definition means “proportional to”. When we say a density function is proportional to some other function, g(x)g(x), what we mean is:

fX(x)=cg(x)f_X(x) = c g(x)

for some positive constant cc. In our case,

fX(x)=c(λ)eλxf_X(x) = c(\lambda) e^{-\lambda x}

Notice that the constant does not depend on xx, but does depend on the choice of the free parameter. The constant factor c(λ)c(\lambda) is called the normalizing factor. The exponential part is the functional form.

The functional form controls the shape of the distribution as a function of the possible values of the random variable, xx. The normalizing constant does not depend on the random variable, but does depend on the choice of parameter. The normalizing constant is implied by the functional form and the parameter since it is introduced to ensure that fX(x)f_X(x) integrates to 1.

In our example, if we enforce normalization, then we can solve for c(λ)c(\lambda):

x=0fX(x)dx=0c(λ)eλxdx=c(λ)0eλxdx=c(λ)[1λeλx]0=c(λ)λ(e0e)=c(λ)λ\begin{aligned} \int_{x = 0}^{\infty} f_X(x) dx & = \int_{0}^{\infty} c(\lambda) e^{-\lambda x} dx \\ & = c(\lambda) \int_{0}^{\infty} e^{-\lambda x} dx = c(\lambda) \left[ \frac{-1}{\lambda} e^{-\lambda x} \right]_{0}^{\infty} \\ & = \frac{c(\lambda)}{\lambda} (e^{-0} - e^{-\infty}) = \frac{c(\lambda)}{\lambda} \end{aligned}

So, to enforce normalization, set the integral equal to one. This sets c(λ)=λc(\lambda) = \lambda.

Therefore, the exponential density always has the form:

fX(x)={λeλx if x00 otherwisef_X(x) = \begin{cases} \lambda e^{-\lambda x} & \text{ if } x \geq 0 \\ 0 & \text{ otherwise}\end{cases}

We could have defined the distribution this way from the start, but separating the normalizing constant from the functional form is an important skill in probability, and is helpful for directing your attention when you look at a density function. Usually, there is a normalizing constant out front, which is often a messy function of the parameters. Then, there is usually a relatively simple function of xx and the parameters that determines the shape of the distribution. It is the shape that controls the properties of the distribution, implies the normalizing constant, and directs when to use a given model. So, the shape is much more important. Always read the functional form first.

In general, if fX(x)=cg(x)f_X(x) = c g(x) for some constant cc that depends on the parameters, then we can solve for cc by enforcing normalization. This gives:

Select “exponential” from the dropdown below and experiment with the exponential density.

from utils_dist import run_distribution_explorer

run_distribution_explorer("Exponential");
Loading...

How does the density function vary as you vary λ\lambda?

You should notice that the exponential density looks a bit like the geometric PMF from Section 2.2. It is nonnegative, unbounded above, decays as the input increases, and is maximized at zero. This is actually more than a chance alignment. The geometric and the exponential are analogs. We often use a exponential distribution to model continuously distributed waiting times, and the geometric to model discretely distributed waiting times.

We’ll talk more about the properties of the exponential and geometric, and when these are or are not reasonable models. For now, it is enough to appreciate their shape. They are a good choice for a random variable that is nonnegative, unbounded, most likely to be small, and, whose histograms only show a peak at zero then a smooth decay as we move to the right. We’ll develop more precise language to describe these shapes in the coming chapters.

Pareto

Here’s another basic family of continuous densities that look, to the eye, a lot like the exponential densities, but have starkly different properties. As usual, we’ll explore those differences in the future. Here, we’ll define the densities, study their shape, how the shape varies with parameter, and compute the normalizing constant.

We denote the statement, XX is Pareto distributed with parameters xm,αx_m, \alpha:

XPareto(xm,α)X \sim \text{Pareto}(x_m,\alpha)

As for the exponential, it is easier to recognize a Pareto density by its functional form. Pareto densities are simply negative powers of xx, cut off to force XxmX \geq x_m. They are an example of heavy tailed distributions. The Pareto distribution was originally introduced to model distributions of wealth. Its discrete analogs are widely used to model word frequencies, or link counts in social networks. We have introduced them here because they are the next simplest densities by functional form, they look similar to the exponential, but, behave quite differently.

Select “Pareto” from the dropdown below and experiment with the density functions. Try changing the α\alpha parameter. How does the density respond?

from utils_dist import run_distribution_explorer

run_distribution_explorer("Pareto");
Loading...

You should see densities that look a lot like the exponential. As you vary α\alpha the shape of the distribution changes. The parameter α\alpha controls the rate at which it decays for increasing inputs, and is called the shape parameter.

You will practice working with Exponential and Pareto densities on your Homework this week.