Joint Distributions - Data 89 Course Notes

A random vector is a ordered list of finitely many random variables, $X = [X_1,X_2,...,X_n]$ . Random vectors may be discretely, or continuously, distributed. If they are discretely distributed, then the random vector is selected from a finite, or countably infinite, set of possible vectors. If continuously distributed, then the vector is selected from an uncountably infinite set of possible vectors, such that $\text{Pr}(X = x) = 0$ for any specific $x$ .

Like random variables, random vectors may be characterized using distribution functions. In the discrete case, we use an analog to the probability mass function. In the continuous case we use an analog to the probability density function.

The Discrete Case¶

In the discrete case, we can define a joint probability mass function. Given a possible vector $x = [x_1,x_2,...x_d]$ , the joint probability mass function returns the chance:

\text{Pr}(X = x) = \text{Pr}(X_1 = x_1,X_2 = x_2, ..., X_d = x_d).

(1)

In the bivariate setting, we can represent the joint probability mass function with a joint distribution table (see Section 1.4). The columns correspond to possible values for the first entry, and the rows correspond to possible values for the second entry. For instance, if $X_1 \ in \{1,2,3\}$ and $X_2 \in \{2, 4, 6\}$ , then we could represent the joint distribution with the table:

Event	$X_1 = 1$	$X_1 = 2$	$X_1 = 3$
$X_2 = 2$	$2/10$	$3/10$	$1/10$
$X_2 = 4$	$1/10$	0	$1/10$
$X_2 = 6$	0	$0$	$2/10$

As usual, we can expand the table by appending the marginal probabilities. These are probabilities like, $\text{Pr}(X_1 = 1)$ (sum down the first column), or $\text{Pr}(X_2 = 4)$ (sum across the second row).

Event	$X_1 = 1$	$X_1 = 2$	$X_1 = 3$	Marginals
$X_2 = 2$	$2/10$	$3/10$	$1/10$	$\text{Pr}(X_2 = 2) = 6/10$
$X_2 = 4$	$1/10$	0	$1/10$	$\text{Pr}(X_2 = 4) = 2/10$
$X_2 = 6$	0	$0$	$2/10$	$\text{Pr}(X_2 = 6) = 2/10$
Marginals	$\text{Pr}(X_1 = 1) = 3/10$	$\text{Pr}(X_1 = 2) = 3/10$	$\text{Pr}(X_1 = 3) = 4/10$	1

The bottom row provides the marginal mass function for $X_1$ :

\text{Pr}(X_1 = x) = \sum_{\text{all } x_2} \text{Pr}(X_1 = x,X_2 = x_2).

(2)

The rightmost column provides the marginal mass function for $X_2$ :

\text{Pr}(X_2 = x) = \sum_{\text{all } x_1} \text{Pr}(X_1 = x_1,X_2 = x).

(3)

To isolate conditional mass functions from the joint mass function, isolate either a row or column by fixing either $X_1$ or $X_2$ , then normalize the isolated row/column. For example, if we fix $X_2 = 2$ , then we should isolate the first row of the table:

Event	$x = 1$	$x = 2$	$x = 3$
$\text{Pr}(X_1 = x \mid X_2 = 2)$	$2/6$	$3/6$	$1/6$

If we fix $X_2 = 4$ , then we should isolate the second row:

Event	$x = 1$	$x = 2$	$x = 3$
$\text{Pr}(X_1 = x \mid X_2 = 4)$	$1/2$	0	$1/2$

If, on the other hand, we’d fixed $X_1 = 1$ , then we would have isolated the first column.

Notice that, isolating a row corresponds to fixing one input variable, while letting the other vary. Similarly, isolating a column correspond to fixing one input, and letting the other variable. This is the same procedure we used to define cross-sections. Start with a scalar valued function of multiple inputs (in this case, a column and row index). Then, fix one input, and let the other vary.

Since isolating a row or column is equivalent to extracting a cross-section, each conditional mass function is proportional to a cross-section of the joint mass function. The particular cross-section (row or column isolated) is selected by the conditioning statement. We’ll see that the same intuition extends to the continuous case.

The Continuous Case¶

In the continuous setting, the probability of any exact event is zero (see Section 2.3). So:

\text{Pr}(X = x) = \text{Pr}(X_1 = x_1,X_2 = x_2,...,X_d = x_d) = 0.

(4)

So, like continuous random variables are characterized by density functions, continuous random vectors are characterized by joint density functions.

Joint Density Functions

Given a $d$ -dimensional, continuously distributed random vector, $X$ , and a possible vector, $x = [x_1,x_2,...,x_d]$ , the joint density function at $x$ returns:

\text{PDF}(x) = \lim_{\Delta x \rightarrow 0} \frac{1}{\Delta x^d} \text{Pr}\left(X_1 \in x_1 \pm \frac{1}{2} \Delta x, X_2 \in x_2 \pm \frac{1}{2} \Delta x,... X_d \in x_d \pm \frac{1}{2} \Delta x \right)

(5)

In other words, the probability density at $x$ , is the chance that each coordinate $X_j$ is within $\pm \frac{1}{2} \Delta x$ of $x_j$ , divided by the volume of the cube centered at $x$ , with sides of length $\Delta x$ .

This definition extends the definition in one-dimension. In one-dimension, a probability density is the chance that a random variable lands in a small interval, relative to the length of the interval. In two-dimensions, is the chance that a random variable lands in a small square, relative to the area of the square. In three-dimensions, is the chance that a random variable lands in a small cube, relative to the volume of the cube. In each case, we recover a density by computing the chance that the random vector lands in some small region, relative to the size of the region, in the limit as the region contracts to a point, $x$ .

It follows that, joint density functions don’t return chances, but return chances per unit volume. In two dimensions, $\text{PDF}(x)$ returns chance per unit area.

To see why we need to work with densities, we can repeat the same exercise we worked through in Section 2.4. Run the code cell below to generate a heatmap, whose colors represent the probability of a cell in a joint probability table. Then, click “3D perspective” to see the same information displayed as a three-dimensional histogram. The height of the histogram bars mark the probability that a random vector, $[X,Y]$ lands in the square defining the base of the bar. Once in the 3D perspective, rotate to view from above (or click “Birds-eye”) to see that the heatmap and histogram represent the same information.

from utils_joint_distribution import run_joint_distribution_demo

run_joint_distribution_demo()

Try changing the number of histogram bins, $n$ . As you increase the number of bins, the base of each bin becomes smaller. As a result, the chance a random vector $[X,Y]$ lands in any particular bin diminishes as $n$ increases. So, as you increase $n$ , you will see the entire collection of histogram bars grow shorter.

Repeat your experiment, this time clicking the button that reads “Normalize by area.” This changes the convention used for the vertical axis. Instead of plotting probability, it plots probability per unit area, that is, probability density. You will see now that, as you increase $n$ , the heights of the histogram bars approach a smooth function. This function is the PDF.

from utils_joint_distribution import run_joint_distribution_demo

run_joint_distribution_demo()

In the special case when $d = 2$ , $\text{PDF}(x)$ is a function of two inputs, $x_1$ and $x_2$ , that returns a scalar value. So, in two-dimensions, $\text{PDF}(x)$ defines a surface over two variables. More generally, the joint density function is a scalar-valued function of multiple variables. So, joint density functions are surfaces.

So, to visualize density functions, we can borrow the same techniques we developed in Section 8.2.

Examples¶

You can experiment with the $\lambda = \mu = 1$ case by running the code cell below, then selecting “Independent Exp.”

from utils_lsg import show_level_sets

show_level_sets()

Notice that, all of the level sets are lines with slope negative 1. This follows since $\text{PDF}(x,y)$ is constant if $\lambda x + \mu y$ is constant. When $\lambda = \mu = 1$ , we have $x + y = c$ so $y = c - x$ . So, every level set is a line with slope -1.

You can experiment with the $n = 2$ case by running the code cell below, then selecting “Independent Laplace”

from utils_lsg import show_level_sets

show_level_sets()

Try computing the level sets of $e^{-(|x_1| + |x_2|)}$ . Convince yourself that these should form concentric diamonds centered at the origin.

You can experiment with the associated density function by running the code cell below. Select “Independent Normal.”

from utils_lsg import show_level_sets

show_level_sets()

Why are the level sets concentric circles? What would change if we’d used a density function proportional to $e^{-\frac{1}{2}(a x^2 + b y^2)}$ ?

The three previous examples illustrate a useful fact. If $f(x,y)$ is a composition of functions, $g(h(x,y))$ then its level sets are all level sets of the inner function $h(x,y)$ . If $x,y$ and $x',y'$ both produce $h(x,y) = h(x',y') = c$ , then $f(x,y) = g(c) = g(c) = f(x',y')$ .

For example:

This density also has circular level sets, since it is a composition whose innermost function is $x^2 + y^2$ . You can experiment with it by running the code cell below. Select “Student-t”. Like the normal distribution, it is bell-shaped, however, unlike the normal, its tails decay at power law rates.

from utils_lsg import show_level_sets

show_level_sets()

Working with Joint Density Functions¶

Recall that, to find the chance a random variable lies in an interval, we compute the area under the density function over the interval:

\text{Pr}(X \in [a,b]) = \int_{x = a}^b \text{PDF}(x) dx

(10)

The same idea extends to random vectors.

We will study integration in multiple variables in detail later in the course. For now, it is enough to recognize the analogy to the univariate case. Chances are given by integrating density functions over the region defined by an event. So, when we are in mutliple dimensions, picture the volume under a segment of a surface, rather than the area under a segment of a curve.

To help visualize this volume, try running the code cell below. Select “3D Perspective” and "Normalize by area. Start with the smallest possible $n$ . Then, pick a rectangular region $\mathcal{R} = \{x \in [a,b], y \in [c,d]\}$ . Click “Round rectangle boundaries to $\Delta x$ ” to align your rectangular region to the bin boundaries. You will see your rectangle highlighted with red boundaries.

from utils_joint_distribution import run_joint_distribution_demo

run_joint_distribution_demo()

Now, start increasing $n$ . As you do, track the volume under the 3D histogram, over the rectangle you selected. The volume is reported in the blue highlight box beneath the visualization. When $n$ is finite, the reported volume equals the sum of the volume of each highlighted histogram bar. These volumes equal the height of the bar times the area of its base. As you increase $n$ , the bins shrink, but grow in number, so the total volume under the surface converges to a constant value. This constant value is the volume under the PDF, returned by a double integral over the red rectangle. It equals the probability that a random vector $[X,Y]$ lands inside the red rectangle.

We’ll see that essentially every equation used to answer probability questions with a joint density are either directly analogous to the corresponding equation in a single variable, or are natural analogs to the procedures applied to joint distribution tables, if we substitute sums for integrals. For example,

In each case, integrate out the variable we are not interested in. This is exactly analogous to summing across a row, or down a column. So, marginal densities return the area under cross-sections of the joint density surface. The same definition extends easily to higher dimensions. For example, if $d = 3$ , we can find marginal densities by integrating out two of the three variables.

Conditional densities follow in the same fashion:

Conditional Density Functions

Suppose, for simplicity, that $d = 2$ , so $X = [X_1,X_2]$ . Then, the conditional density for $X_1$ given $X_2 = x_2$ is:

f_{X_1|X_2 = x_2}(x) = \frac{\text{PDF}(x,x_2)}{f_{X_2}(x_2)} = \frac{\text{PDF}(x,x_2)}{\int_{x_1} \text{PDF}(x_1,x_2) dx_1}

(15)

and the conditional density for $X_2$ given $X_1 = x_1$ is:

f_{X_2|X_1 = x_1}(x) = \frac{\text{PDF}(x_1,x)}{f_{X_1}(x_1)} = \frac{\text{PDF}(x_1,x)}{\int_{x_2} \text{PDF}(x_1,x_2) dx_2}.

(16)

So, conditional densities are joint densities divided by marginal densities.

Notice the immediate analogy to the division rule for conditional distributions (see Section 1.5). As usual, conditional is joint divided by marginal.

Notice also that, conditional densities are, as functions of their input, proportional to cross-sections of the joint density surface. For example, fixing $X_2 = 3$ isolates the cross-section of $\text{PDF}(x_1,x_2)$ where $x_1$ can vary, but $x_2$ is fixed at 3. So, conditional density functions are proportional to cross-sections of the joint density. The proportionality constant is the associated marginal density. It is the area under the selected cross-section.

It follows that the division and multiplication rules introduced in Section 1.5 extend naturally to joint densities. In particular:

So, as usual, joint densities equal marginal densities time conditional densities.

Applying either the joint density definition, or the multiplication rule, to independent variables establishes the usual result that, joint distributions are products of marginals if and only if the variables are independent. In the continuous setting, joint densities are products of their marginals if and only if the components of the associated random vector are independent.

Expectations are also defined in the usual fashion:

So,