Joint, Marginal, and Conditional Distributions

Sources:

  1. Jeseph K. Blitzstein & Jessica Hwang. (2019). Joint distributions. Introduction to Probability (2nd ed., pp. 304-323). CRC Press.

Joint, Marginal, and Conditional Distributions

Notation

Symbol Type Description
X,Y Random variable Random variables whose distributions are analyzed
FX,Y(x,y) Function Joint cumulative distribution function (CDF) for X and Y
pX,Y(x,y) Function Joint probability mass function (PMF) for discrete random variables X and Y
fX,Y(x,y) Function Joint probability density function (PDF) for continuous random variables X and Y
fX(x),fY(y) Function Marginal PDF of X and Y, respectively
fYX(yx) Function Conditional PDF of Y given X=x
AfX,Y(x,y)dxdy Operation Integral of the joint PDF fX,Y over region AR2
AR2 Set A subset of the two-dimensional real space

Abbreviations

Abbreviation Description
r.v. Random variable
CDF Cumulative distribution function
PMF Probability mass function
PDF Probability density function
LOTP Law of total probability

Discrete

The most general description of the joint distribution of two r.v.s is the joint CDF, which applies to discrete and continuous r.v.s alike.

Joint CDF

Definition: The joint CDF of r.v.s X and Y is the function FX,Y given by FX,Y(x,y)=P(Xx,Yy)

The joint CDF of n r.v.s is defined analogously.

For discrete r.v.s, the joint CDF often consists of jumps and flat regions, so we typically work with the joint PMF instead.

Joint PMF

Definition: The joint PMF of discrete r.v.s X and Y is the function pX,Y given by pX,Y(x,y)=P(X=x,Y=y).

The joint PMF of n discrete r.v.s is defined analogously.

Just as univariate PMFs must be nonnegative and sum to 1 , we require valid joint PMFs to be nonnegative and sum to 1 , where the sum is taken over all possible values of X and Y : xyP(X=x,Y=y)=1.

Marginal PMF

Definition: For discrete r.v.s X and Y, the marginal PMF of X is P(X=x)=yP(X=x,Y=y).

The operation of summing over the possible values of Y in order to convert the joint PMF into the marginal PMF of X is known as marginalizing out Y.

Conditional PMF

Definition: For discrete r.v.s X and Y, the conditional PMF of Y given X=x is P(Y=yX=x)=P(X=x,Y=y)P(X=x)

This is viewed as a function of y for fixed x.

Independence of discrete r.v.s

Definition: Random variables X and Y are independent if for all x and y, FX,Y(x,y)=FX(x)FY(y).

If X and Y are discrete, this is equivalent to the condition

P(X=x,Y=y)=P(X=x)P(Y=y)

for all x,y, and it is also equivalent to the condition

P(Y=yX=x)=P(Y=y)

for all x,y such that P(X=x)>0.

Continuous

Once we have a handle on discrete joint distributions, it isn't much harder to consider continuous joint distributions. We simply make the now-familiar substitutions of integrals for sums and PDFs for PMFs, remembering that the probability of any individual point is now 0.

Formally, in order for X and Y to have a continuous joint distribution, we require that the joint CDF

FX,Y(x,y)=P(Xx,Yy)

be differentiable with respect to x and y. The partial derivative with respect to x and y is called the joint PDF. The joint PDF determines the joint distribution, as does the joint CDF.

Joint PDF

Definition: If X and Y are continuous with joint CDF FX,Y, their joint PDF is the derivative of the joint CDF with respect to x and y : fX,Y(x,y)=2xyFX,Y(x,y).

We require valid joint PDFs to be nonnegative and integrate to 1:

fX,Y(x,y)0, and fX,Y(x,y)dxdy=1

In the univariate case, the PDF was the function we integrated to get the probability of an interval. Similarly, the joint PDF of two r.v.s is the function we integrate to get the probability of a two-dimensional region. For example,

P(X<3,1<Y<4)=143fX,Y(x,y)dxdy.

For a general region AR2,

P((X,Y)A)=AfX,Y(x,y)dxdy.

Example

Figure 7.4

Figure 7.4 shows a sketch of what a joint PDF of two r.v.s could look like. As usual with continuous r.v.s, we need to keep in mind that the height of the surface fX,Y(x,y) at a single point does not represent a probability. The probability of any specific point in the plane is 0 . Now that we've gone up a dimension, the probability of any line or curve in the plane is also 0 . The only way we can get nonzero probability is by integrating over a region of positive area in the xy-plane.

When we integrate the joint PDF over a region A, we are calculating the volume under the surface of the joint PDF and above A. Thus, probability is represented by volume under the joint PDF. The total volume under a valid joint PDF is 1.

Marginal PDF

Definition: For continuous r.v.s X and Y with joint PDF fX,Y, the marginal PDF of X is fX(x)=fX,Y(x,y)dy.

This is the PDF of X, viewing X individually rather than jointly with Y.

To simplify notation, we have mainly been looking at the joint distribution of two r.v.s rather than n r.v.s, but marginalization works analogously with any number of variables. For example, if we have the joint PDF of X,Y,Z,W but want the joint PDF of X,W, we just have to integrate over all possible values of Y and Z : fX,W(x,w)=fX,Y,Z,W(x,y,z,w)dydz.

Conceptually this is easy-just integrate over the unwanted variables to get the joint PDF of the wanted variables-but computing it may or may not be easy.

Returning to the case of the joint distribution of two r.v.s X and Y, let's consider how to update our distribution for Y after observing the value of X, using the conditional PDF.

Conditional PDF

Definition: For continuous r.v.s X and Y with joint PDF fX,Y, the conditional PDF of Y given X=x is fYX(yx)=fX,Y(x,y)fX(x),

for all x with fX(x)>0. This is considered as a function of y for fixed x. As a convention, in order to make fYX(yx) well-defined for all real x, let fYX(yx)=0 for all x with fX(x)=0.

Notation: The subscripts that we place on all the f 's are just to remind us that we have three different functions on our plate. We could just as well write g(yx)=f(x,y)/h(x), where f is the joint PDF, h is the marginal PDF of X, and g is the conditional PDF of Y given X=x, but that makes it more difficult to remember which letter stands for which function.

Note: We know that by the definition of PDF, fX(x)=0 for a continuous r.v. X. So how can we speak of conditioning on X=x as its probability is 0? Rigorously speaking, we are actually conditioning on the event that X falls within a small interval containing x, say X(xϵ,x+ϵ), and then taking a limit as ϵ approaches 0 from the right. We will not fuss over this technicality; fortunately, many important results such as Bayes' rule work in the continuous case exactly as one would hope.

Continuous form of Bayes' rule and LOTP

Theorem: For continuous r.v.s X and Y, we have the following continuous form of Bayes' rule: fYX(yx)=fXY(xy)fY(y)fX(x), for fX(x)>0

And we have the following continuous form of the law of total probability:

fX(x)=fXY(xy)fY(y)dy.

Proof. By definition of conditional PDFs, we have fYX(yx)fX(x)=fX,Y(x,y)=fXY(xy)fY(y).

The continuous version of Bayes' rule follows immediately from dividing by fX(x). The continuous version of LOTP follows immediately from integrating with respect to y : fX(x)=fX,Y(x,y)dy=fXY(xy)fY(y)dy

Out of curiosity, let's see what would have happened if we had plugged in the other expression for fX,Y(x,y) instead in the proof of LOTP:

fX(x)=fX,Y(x,y)dy=fYX(yx)fX(x)dy=fX(x)fYX(yx)dy.

This just says that, for any x with fX(x)>0,

fYX(yx)dy=1,

confirming the fact that conditional PDFs must integrate to 1.

Independence of continuous r.v.s

Definition: Random variables X and Y are independent if for all x and y, FX,Y(x,y)=FX(x)FY(y).

If X and Y are continuous with joint PDF fX,Y, this is equivalent to the condition

fX,Y(x,y)=fX(x)fY(y)

for all x,y, and it is also equivalent to the condition

fYX(yx)=fY(y)

for all x,y such that fX(x)>0.

Here is one important proposition for the independence of two r.v.s.

A proposition of independence and joint PDF factorization

Proposition: Suppose that the joint PDFfX,Y of X and Y factors as fX,Y(x,y)=g(x)h(y)

for all x and y, where g and h are nonnegative functions. Then X and Y are independent. Also, if either g or h is a valid PDF, then the other one is a valid PDF too and g and h are the marginal PDFs of X and Y, respectively. (The analogous result in the discrete case also holds.)

Proof. Let c=h(y)dy. Multiplying and dividing by c, we can write fX,Y(x,y)=cg(x)h(y)c.

(The point of this is that h(y)/c is a valid PDF.) Then the marginal PDF of X is

fX(x)=fX,Y(x,y)dy=cg(x)h(y)cdy=cg(x).

It follows that cg(x)dx=1 since a marginal PDF is a valid PDF (knowing the integral of h gave us the integral of g for free!). Then the marginal PDF of Y is

fY(y)=fX,Y(x,y)dx=h(y)ccg(x)dx=h(y)c.

Thus, X and Y are independent with PDFs cg(x) and h(y)/c, respectively. If g or h is already a valid PDF , then c=1, so the other one is also a valid PDF .

Note: In the above proposition, we need the joint PDF to factor as a function of x times a function of y for all (x,y) in the plane R2, not just for (x,y) with fX,Y(x,y)>0. The reason for this is illustrated in the next example.

A simple case of a continuous joint distribution is when the joint PDF is constant over some region in the plane. In the following example, we'll compare a joint PDF that is constant on a square to a joint PDF that is constant on a disk.

Example: Uniform on a region in the plane

Let (X,Y) be a completely random point in the square {(x,y):x,y[0,1]}, in the sense that the joint PDF of X and Y is constant over the square and 0 outside of it:

fX,Y(x,y)={1 if x,y[0,1]0 otherwise. 

The constant 1 is chosen so that the joint PDF will integrate to 1 . This distribution is called the Uniform distribution on the square.

Intuitively, it makes sense that X and Y should be Unif(0,1) marginally. We can check this by computing

fX(x)=01fX,Y(x,y)dy=011dy=1, and similarly for fY. Furthermore, X and Y are independent, since the joint PDF factors into the product of the marginal PDFs (this just reduces to 1=11, but it's important to note that the value of X does not constrain the possible values of Y). So the conditional distribution of Y given X=x is Unif(0,1), regardless of x.

Now let (X,Y) be a completely random point in the unit disk {(x,y):x2+y21}, with joint PDF

fX,Y(x,y)={1π if x2+y210 otherwise 

Again, the constant 1/π is chosen to make the joint PDF integrate to 1 ; the value follows from the fact that the integral of 1 over some region in the plane is the area of that region.

Note that X and Y are not independent, since in general, knowing the value of X constrains the possible values of Y : larger values of |X| restrict Y to be in a smaller range. It would fall into the previous proposition disastrously to conclude independence from the fact that fX,Y(x,y)=g(x)h(y) for all (x,y) in the disk, where g(x)=1/π and h(y)=1 are constant functions. To see from the definition that X and Y are not independent, note that, for example, fX,Y(0.9,0.9)=0 since (0.9,0.9) is not in the unit disk, but fX(0.9)fY(0.9)0 since 0.9 is in the supports of both X and Y.

The marginal distribution of X is now

fX(x)=1x21x21πdy=2π1x2,1x1

By symmetry, fY(y)=2π1y2. Note that the marginal distributions of X and Y are not Uniform on [1,1]; rather, X and Y are more likely to fall near 0 than near ±1.

Figure 7.6

Suppose we observe X=x. As illustrated in Figure 7.6, this constrains Y to lie in the interval [1x2,1x2]. Specifically, the conditional distribution of Y given X=x is

fYX(yx)=fX,Y(x,y)fX(x)=1π2π1x2=121x2

for 1x2y1x2, and 0 otherwise. This conditional PDF is constant as a function of y, which tells us that the conditional distribution of Y is Uniform on the interval [1x2,1x2]. The fact that the conditional PDF is not free of x confirms the fact that X and Y are not independent.