Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

8.3 Joint Distributions

A random vector is a ordered list of finitely many random variables, X=[X1,X2,...,Xn]X = [X_1,X_2,...,X_n]. Random vectors may be discretely, or continuously, distributed. If they are discretely distributed, then the random vector is selected from a finite, or countably infinite, set of possible vectors. If continuously distributed, then the vector is selected from an uncountably infinite set of possible vectors, such that Pr(X=x)=0\text{Pr}(X = x) = 0 for any specific xx.

Like random variables, random vectors may be characterized using distribution functions. In the discrete case, we use an analog to the probability mass function. In the continuous case we use an analog to the probability density function.

The Discrete Case

In the discrete case, we can define a joint probability mass function. Given a possible vector x=[x1,x2,...xd]x = [x_1,x_2,...x_d], the joint probability mass function returns the chance:

Pr(X=x)=Pr(X1=x1,X2=x2,...,Xd=xd).\text{Pr}(X = x) = \text{Pr}(X_1 = x_1,X_2 = x_2, ..., X_d = x_d).

In the bivariate setting, we can represent the joint probability mass function with a joint distribution table (see Section 1.4). The columns correspond to possible values for the first entry, and the rows correspond to possible values for the second entry. For instance, if X1 in{1,2,3}X_1 \ in \{1,2,3\} and X2{2,4,6}X_2 \in \{2, 4, 6\}, then we could represent the joint distribution with the table:

EventX1=1X_1 = 1X1=2X_1 = 2X1=3X_1 = 3
X2=2X_2 = 22/102/103/103/101/101/10
X2=4X_2 = 41/101/10 01/101/10
X2=6X_2 = 6000 2/102/10

As usual, we can expand the table by appending the marginal probabilities. These are probabilities like, Pr(X1=1)\text{Pr}(X_1 = 1) (sum down the first column), or Pr(X2=4)\text{Pr}(X_2 = 4) (sum across the second row).

EventX1=1X_1 = 1X1=2X_1 = 2X1=3X_1 = 3Marginals
X2=2X_2 = 22/102/103/103/101/101/10Pr(X2=2)=6/10\text{Pr}(X_2 = 2) = 6/10
X2=4X_2 = 41/101/10 01/101/10Pr(X2=4)=2/10\text{Pr}(X_2 = 4) = 2/10
X2=6X_2 = 6000 2/102/10Pr(X2=6)=2/10\text{Pr}(X_2 = 6) = 2/10
MarginalsPr(X1=1)=3/10\text{Pr}(X_1 = 1) = 3/10Pr(X1=2)=3/10\text{Pr}(X_1 = 2) = 3/10Pr(X1=3)=4/10\text{Pr}(X_1 = 3) = 4/101

The bottom row provides the marginal mass function for X1X_1:

Pr(X1=x)=all x2Pr(X1=x,X2=x2).\text{Pr}(X_1 = x) = \sum_{\text{all } x_2} \text{Pr}(X_1 = x,X_2 = x_2).

The rightmost column provides the marginal mass function for X2X_2:

Pr(X2=x)=all x1Pr(X1=x1,X2=x).\text{Pr}(X_2 = x) = \sum_{\text{all } x_1} \text{Pr}(X_1 = x_1,X_2 = x).

To isolate conditional mass functions from the joint mass function, isolate either a row or column by fixing either X1X_1 or X2X_2, then normalize the isolated row/column. For example, if we fix X2=2X_2 = 2, then we should isolate the first row of the table:

Eventx=1x = 1x=2x = 2x=3x = 3
Pr(X1=xX2=2)\text{Pr}(X_1 = x \mid X_2 = 2)2/62/6 3/63/61/61/6

If we fix X2=4X_2 = 4, then we should isolate the second row:

Eventx=1x = 1x=2x = 2x=3x = 3
Pr(X1=xX2=4)\text{Pr}(X_1 = x \mid X_2 = 4)1/21/2 01/21/2

If we’d fixed X1=1X_1 = 1, then we woul isolate the first column.

Notice that, isolating a row corresponds to fixing one input variable, while letting the other vary. Similarly, isolating a column correspond to fixing one input, and letting the other variable. This is the same procedure we used to define cross-sections. Start with a scalar valued function of multiple inputs (in this case, a column and row index). Then, fix one input, and let the other vary.

Since isolating a row or column is equivalent to extracting a cross-section, each conditional mass function is proportional to a cross-section of the joint mass function.* The particular cross-section (row or column isolated) is selected by the conditioning statement. We’ll see that the same intuition extends to the continuous case.

The Continuous Case

In the continuous setting, the probability of any exact event is zero (see Section 2.3). So:

Pr(X=x)=Pr(X1=x1,X2=x2,...,Xd=xd)=0.\text{Pr}(X = x) = \text{Pr}(X_1 = x_1,X_2 = x_2,...,X_d = x_d) = 0.

So, like continuous random variables are characterized by density functions, continuous random vectors are characterized by joint dentisy functions.

This definition extends the definition in one-dimension. In one-dimension, a probability density is the chance that a random variable lands in a small interval, relative to the length of the interval. In two-dimensions, is the chance that a random variable lands in a small square, relative to the area of the square. In three-dimensions, is the chance that a random variable lands in a small cube, relative to the volume of the cube. In each case, we recover a density by computing the chance that the random vector lands in some small region, relative to the size of the region, in the limit as the region contracts to a point, xx.

It follows that, joint density functions don’t return chances, but return chances per unit volume. In two dimensions, PDF(x)\text{PDF}(x) returns chance per unit area.

In the special case when d=2d = 2, PDF(x)\text{PDF}(x) is a function of two inputs, x1x_1 and x2x_2, that returns a scalar value. So, in two-dimensions, PDF(x)\text{PDF}(x) defines a surface over two variables. More generally, the joint density function is a scalar-valued function of multiple variables. So, joint density functions are surfaces.

So, to visualize density functions, we can borrow the same techniques we developed in Section 8.2.

Examples

You can experiment with the λ=μ=1\lambda = \mu = 1 case by running the code cell below, then selecting “Independent Exp.”

from utils_lsg import show_level_sets

show_level_sets()

Notice that, all of the level sets are lines with slope negative 1. This follows since PDF(x,y)\text{PDF}(x,y) is constant if λx+μy\lambda x + \mu y is constant. When λ=μ=1\lambda = \mu = 1, we have x+y=cx + y = c so y=cxy = c - x. So, every level set is a line with slope -1.

You can experiment with the n=2n = 2 case by running the code cell below, then selecting “Independent Laplace”

from utils_lsg import show_level_sets

show_level_sets()

Try computing the level sets of e(x1+x2)e^{-(|x_1| + |x_2|)}. Convince yourself that these should form concentric diamonds centered at the origin.

You can experiment with the associated density function by running the code cell below. Select “Independent Normal.”

from utils_lsg import show_level_sets

show_level_sets()

Why are the level sets concentric circles? What would change if we’d used a density function proportional to e12(ax2+by2)e^{-\frac{1}{2}(a x^2 + b y^2)}?

The three previous examples illustrate a useful fact. If f(x,y)f(x,y) is a composition of functions, g(h(x,y))g(h(x,y)) then its level sets are all level sets of the inner function h(x,y)h(x,y). If x,yx,y and x,yx',y' both produce h(x,y)=h(x,y)=ch(x,y) = h(x',y') = c, then f(x,y)=g(c)=g(c)=f(x,y)f(x,y) = g(c) = g(c) = f(x',y').

For example:

This density also has circular level sets, since it is a composition whose innermost function is x2+y2x^2 + y^2. You can experiment with it by running the code cell below. Select “Student-t”. Like the normal distribution, it is bell-shaped, however, unlike the normal, its tails decay at power law rates.

from utils_lsg import show_level_sets

show_level_sets()

Working with Joint Density Functions

Recall that, to find the chance a random variable lies in an interval, we compute the area under the density function over the interval:

Pr(X[a,b])=x=abPDF(x)dx\text{Pr}(X \in [a,b]) = \int_{x = a}^b \text{PDF}(x) dx

The same idea extends to random vectors.

We will study integration in multiple variables in detail later in the course. For now, it is enough to recognize the analogy to the univariate case. Chances are given by integrating density functions over the region defined by an event. So, when we are in mutliple dimensions, picture the volume under a segment of a surface, rather than the area under a segment of a curve.

We’ll see that essentially every equation used to answer probability questions witha joint density are either directly analogous to the corresponding equation in a single variable, or are natural analogs to the procedures applied to joint distribution tables, if we substitute sums for integrals. For example,

In each case, integrate out the variable we are not interested in. This is exactly analogous to summing across a row, or down a column. So, marginal densities return the area under cross-sections of the joint density surface. The same definition extends easily to higher dimensions. For example, if d=3d = 3, we can find marginal densities by integrating out two of the three variables.

Conditional densities follow in the same fashion:

Notice the immediate analogy to the division rule for conditional distributions (see Section 1.5). As usual, conditional is joint divided by marginal.

Notice also that, conditional densities are, as functions of their input, proportional to cross-sections of the joint density surface. For example, fixing X2=3X_2 = 3 isolates the cross-section of PDF(x1,x2)\text{PDF}(x_1,x_2) where x1x_1 can vary, but x2x_2 is fixed at 3. So, conditional density functions are proportional to cross-sections of the joint density. The proportionality constant is the associated marginal density. It is the area under the selected cross-section.

It follows that the division and multiplication rules introduced in Section 1.5 extend naturally to joint densities. In particular:

So, as usual, joint densities equal marginal densities time conditional densities.

Applying either the joint density definition, or the multiplication rule, to independent variables establishes the usual result that, joint distributions are products of marginals if and only if the variables are independent. In the continuous setting, joint densities are products of their marginals if and only if the components of the associated random vector are independent.

Expectations are also defined in the usual fashion:

So,