Joint, Marginal, and Conditional Distributions
Sources:
- Jeseph K. Blitzstein & Jessica Hwang. (2019). Joint distributions. Introduction to Probability (2nd ed., pp. 304-323). CRC Press.
Joint, Marginal, and Conditional Distributions
Notation
Symbol | Type | Description |
---|---|---|
\(X, Y\) | Random variable | Random variables whose distributions are analyzed |
\(F_{X,Y}(x, y)\) | Function | Joint cumulative distribution function (CDF) for \(X\) and \(Y\) |
\(p_{X,Y}(x, y)\) | Function | Joint probability mass function (PMF) for discrete random variables \(X\) and \(Y\) |
\(f_{X,Y}(x, y)\) | Function | Joint probability density function (PDF) for continuous random variables \(X\) and \(Y\) |
\(f_X(x), f_Y(y)\) | Function | Marginal PDF of \(X\) and \(Y\), respectively |
\(f_{Y \mid X}(y \mid x)\) | Function | Conditional PDF of \(Y\) given \(X = x\) |
\(\int_A f_{X,Y}(x, y) \, dx \, dy\) | Operation | Integral of the joint PDF \(f_{X,Y}\) over region \(A \subseteq \mathbb{R}^2\) |
\(A \subseteq \mathbb{R}^2\) | Set | A subset of the two-dimensional real space |
Abbreviations
Abbreviation | Description |
---|---|
r.v. | Random variable |
CDF | Cumulative distribution function |
PMF | Probability mass function |
Probability density function | |
LOTP | Law of total probability |
Discrete
The most general description of the joint distribution of two r.v.s is the joint CDF, which applies to discrete and continuous r.v.s alike.
Joint CDF
Definition: The joint CDF of r.v.s \(X\) and \(Y\) is the function \(F_{X, Y}\) given by \[ F_{X, Y}(x, y)=P(X \leq x, Y \leq y) \]
The joint CDF of \(n\) r.v.s is defined analogously.
For discrete r.v.s, the joint CDF often consists of jumps and flat regions, so we typically work with the joint PMF instead.
Joint PMF
Definition: The joint PMF of discrete r.v.s \(X\) and \(Y\) is the function \(p_{X, Y}\) given by \[ p_{X, Y}(x, y)=P(X=x, Y=y) . \]
The joint PMF of \(n\) discrete r.v.s is defined analogously.
Just as univariate PMFs must be nonnegative and sum to 1 , we require valid joint PMFs to be nonnegative and sum to 1 , where the sum is taken over all possible values of \(X\) and \(Y\) : \[ \sum_x \sum_y P(X=x, Y=y)=1 . \]
Marginal PMF
Definition: For discrete r.v.s \(X\) and \(Y\), the marginal PMF of \(X\) is \[ P(X=x)=\sum_y P(X=x, Y=y) . \]
The operation of summing over the possible values of \(Y\) in order to convert the joint PMF into the marginal PMF of \(X\) is known as marginalizing out \(Y\).
Conditional PMF
Definition: For discrete r.v.s \(X\) and \(Y\), the conditional \(P M F\) of \(Y\) given \(X=x\) is \[ P(Y=y \mid X=x)=\frac{P(X=x, Y=y)}{P(X=x)} \]
This is viewed as a function of \(y\) for fixed \(x\).
Independence of discrete r.v.s
Definition: Random variables \(X\) and \(Y\) are independent if for all \(x\) and \(y\), \[ F_{X, Y}(x, y)=F_X(x) F_Y(y) . \]
If \(X\) and \(Y\) are discrete, this is equivalent to the condition
\[ P(X=x, Y=y)=P(X=x) P(Y=y) \]
for all \(x, y\), and it is also equivalent to the condition
\[ P(Y=y \mid X=x)=P(Y=y) \]
for all \(x, y\) such that \(P(X=x)>0\).
Continuous
Once we have a handle on discrete joint distributions, it isn't much harder to consider continuous joint distributions. We simply make the now-familiar substitutions of integrals for sums and PDFs for PMFs, remembering that the probability of any individual point is now 0.
Formally, in order for \(X\) and \(Y\) to have a continuous joint distribution, we require that the joint CDF
\[ F_{X, Y}(x, y)=P(X \leq x, Y \leq y) \]
be differentiable with respect to \(x\) and \(y\). The partial derivative with respect to \(x\) and \(y\) is called the joint PDF. The joint PDF determines the joint distribution, as does the joint CDF.
Joint PDF
Definition: If \(X\) and \(Y\) are continuous with joint CDF \(F_{X, Y}\), their joint PDF is the derivative of the joint CDF with respect to \(x\) and \(y\) : \[ f_{X, Y}(x, y)=\frac{\partial^2}{\partial x \partial y} F_{X, Y}(x, y) . \]
We require valid joint PDFs to be nonnegative and integrate to 1:
\[ f_{X, Y}(x, y) \geq 0, \text { and } \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X, Y}(x, y) d x d y=1 \]
In the univariate case, the PDF was the function we integrated to get the probability of an interval. Similarly, the joint PDF of two r.v.s is the function we integrate to get the probability of a two-dimensional region. For example,
\[ P(X<3,1<Y<4)=\int_1^4 \int_{-\infty}^3 f_{X, Y}(x, y) d x d y . \]
For a general region \(A \subseteq \mathbb{R}^2\),
\[ P((X, Y) \in A)=\iint_A f_{X, Y}(x, y) d x d y . \]
Example
Figure 7.4 shows a sketch of what a joint PDF of two r.v.s could look like. As usual with continuous r.v.s, we need to keep in mind that the height of the surface \(f_{X, Y}(x, y)\) at a single point does not represent a probability. The probability of any specific point in the plane is 0 . Now that we've gone up a dimension, the probability of any line or curve in the plane is also 0 . The only way we can get nonzero probability is by integrating over a region of positive area in the \(x y\)-plane.
When we integrate the joint PDF over a region \(A\), we are calculating the volume under the surface of the joint PDF and above \(A\). Thus, probability is represented by volume under the joint PDF. The total volume under a valid joint PDF is 1.
Marginal PDF
Definition: For continuous r.v.s \(X\) and \(Y\) with joint PDF \(f_{X, Y}\), the marginal PDF of \(X\) is \[ f_X(x)=\int_{-\infty}^{\infty} f_{X, Y}(x, y) d y . \]
This is the PDF of \(X\), viewing \(X\) individually rather than jointly with \(Y\).
To simplify notation, we have mainly been looking at the joint distribution of two r.v.s rather than \(n\) r.v.s, but marginalization works analogously with any number of variables. For example, if we have the joint PDF of \(X, Y, Z, W\) but want the joint PDF of \(X, W\), we just have to integrate over all possible values of \(Y\) and \(Z\) : \[ f_{X, W}(x, w)=\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X, Y, Z, W}(x, y, z, w) d y d z . \]
Conceptually this is easy-just integrate over the unwanted variables to get the joint PDF of the wanted variables-but computing it may or may not be easy.
Returning to the case of the joint distribution of two r.v.s \(X\) and \(Y\), let's consider how to update our distribution for \(Y\) after observing the value of \(X\), using the conditional PDF.
Conditional PDF
Definition: For continuous r.v.s \(X\) and \(Y\) with joint PDF \(f_{X, Y}\), the conditional PDF of \(Y\) given \(X=x\) is \[ f_{Y \mid X}(y \mid x)=\frac{f_{X, Y}(x, y)}{f_X(x)}, \]
for all \(x\) with \(f_X(x)>0\). This is considered as a function of \(y\) for fixed \(x\). As a convention, in order to make \(f_{Y \mid X}(y \mid x)\) well-defined for all real \(x\), let \(f_{Y \mid X}(y \mid x)=0\) for all \(x\) with \(f_X(x)=0\).
Notation: The subscripts that we place on all the \(f\) 's are just to remind us that we have three different functions on our plate. We could just as well write \(g(y \mid x)=f(x, y) / h(x)\), where \(f\) is the joint PDF, \(h\) is the marginal PDF of \(X\), and \(g\) is the conditional PDF of \(Y\) given \(X=x\), but that makes it more difficult to remember which letter stands for which function.
Note: We know that by the definition of PDF, \(f_X(x)=0\) for a continuous r.v. \(X\). So how can we speak of conditioning on \(X=x\) as its probability is 0? Rigorously speaking, we are actually conditioning on the event that \(X\) falls within a small interval containing \(x\), say \(X \in(x-\epsilon, x+\epsilon)\), and then taking a limit as \(\epsilon\) approaches 0 from the right. We will not fuss over this technicality; fortunately, many important results such as Bayes' rule work in the continuous case exactly as one would hope.
Continuous form of Bayes' rule and LOTP
Theorem: For continuous r.v.s \(X\) and \(Y\), we have the following continuous form of Bayes' rule: \[ f_{Y \mid X}(y \mid x)=\frac{f_{X \mid Y}(x \mid y) f_Y(y)}{f_X(x)}, \text { for } f_X(x)>0 \]
And we have the following continuous form of the law of total probability:
\[ f_X(x)=\int_{-\infty}^{\infty} f_{X \mid Y}(x \mid y) f_Y(y) d y . \]
Proof. By definition of conditional PDFs, we have \[ f_{Y \mid X}(y \mid x) f_X(x)=f_{X, Y}(x, y)=f_{X \mid Y}(x \mid y) f_Y(y) . \]
The continuous version of Bayes' rule follows immediately from dividing by \(f_X(x)\). The continuous version of LOTP follows immediately from integrating with respect to \(y\) : \[ f_X(x)=\int_{-\infty}^{\infty} f_{X, Y}(x, y) d y=\int_{-\infty}^{\infty} f_{X \mid Y}(x \mid y) f_Y(y) d y \]
Out of curiosity, let's see what would have happened if we had plugged in the other expression for \(f_{X, Y}(x, y)\) instead in the proof of LOTP:
\[ f_X(x)=\int_{-\infty}^{\infty} f_{X, Y}(x, y) d y=\int_{-\infty}^{\infty} f_{Y \mid X}(y \mid x) f_X(x) d y=f_X(x) \int_{-\infty}^{\infty} f_{Y \mid X}(y \mid x) d y . \]
This just says that, for any \(x\) with \(f_X(x)>0\),
\[ \int_{-\infty}^{\infty} f_{Y \mid X}(y \mid x) d y=1, \]
confirming the fact that conditional PDFs must integrate to 1.
Independence of continuous r.v.s
Definition: Random variables \(X\) and \(Y\) are independent if for all \(x\) and \(y\), \[ F_{X, Y}(x, y)=F_X(x) F_Y(y) . \]
If \(X\) and \(Y\) are continuous with joint PDF \(f_{X, Y}\), this is equivalent to the condition
\[ f_{X, Y}(x, y)=f_X(x) f_Y(y) \]
for all \(x, y\), and it is also equivalent to the condition
\[ f_{Y \mid X}(y \mid x)=f_Y(y) \]
for all \(x, y\) such that \(f_X(x)>0\).
Here is one important proposition for the independence of two r.v.s.
A proposition of independence and joint PDF factorization
Proposition: Suppose that the joint \(\operatorname{PDF} f_{X, Y}\) of \(X\) and \(Y\) factors as \[ f_{X, Y}(x, y)=g(x) h(y) \]
for all \(x\) and \(y\), where \(g\) and \(h\) are nonnegative functions. Then \(X\) and \(Y\) are independent. Also, if either \(g\) or \(h\) is a valid PDF, then the other one is a valid PDF too and \(g\) and \(h\) are the marginal PDFs of \(X\) and \(Y\), respectively. (The analogous result in the discrete case also holds.)
Proof. Let \(c=\int_{-\infty}^{\infty} h(y) d y\). Multiplying and dividing by \(c\), we can write \[ f_{X, Y}(x, y)=c g(x) \cdot \frac{h(y)}{c} . \]
(The point of this is that \(h(y) / c\) is a valid PDF.) Then the marginal PDF of \(X\) is
\[ f_X(x)=\int_{-\infty}^{\infty} f_{X, Y}(x, y) d y=c g(x) \int_{-\infty}^{\infty} \frac{h(y)}{c} d y=c g(x) . \]
It follows that \(\int_{-\infty}^{\infty} c g(x) d x=1\) since a marginal PDF is a valid PDF (knowing the integral of \(h\) gave us the integral of \(g\) for free!). Then the marginal PDF of \(Y\) is
\[ f_Y(y)=\int_{-\infty}^{\infty} f_{X, Y}(x, y) d x=\frac{h(y)}{c} \int_{-\infty}^{\infty} c g(x) d x=\frac{h(y)}{c} . \]
Thus, \(X\) and \(Y\) are independent with PDFs \(c g(x)\) and \(h(y) / c\), respectively. If \(g\) or \(h\) is already a valid PDF , then \(c=1\), so the other one is also a valid PDF .
Note: In the above proposition, we need the joint PDF to factor as a function of \(x\) times a function of \(y\) for all \((x, y)\) in the plane \(\mathbb{R}^2\), not just for \((x, y)\) with \(f_{X, Y}(x, y)>0\). The reason for this is illustrated in the next example.
A simple case of a continuous joint distribution is when the joint PDF is constant over some region in the plane. In the following example, we'll compare a joint PDF that is constant on a square to a joint PDF that is constant on a disk.
Example: Uniform on a region in the plane
Let \((X, Y)\) be a completely random point in the square \(\{(x, y): x, y \in[0,1]\}\), in the sense that the joint PDF of \(X\) and \(Y\) is constant over the square and 0 outside of it:
\[ f_{X, Y}(x, y)= \begin{cases}1 & \text { if } x, y \in[0,1] \\ 0 & \text { otherwise. }\end{cases} \]
The constant 1 is chosen so that the joint PDF will integrate to 1 . This distribution is called the Uniform distribution on the square.
Intuitively, it makes sense that \(X\) and \(Y\) should be \(\operatorname{Unif}(0,1)\) marginally. We can check this by computing
\[ f_X(x)=\int_0^1 f_{X, Y}(x, y) d y=\int_0^1 1 d y=1, \] and similarly for \(f_Y\). Furthermore, \(X\) and \(Y\) are independent, since the joint PDF factors into the product of the marginal PDFs (this just reduces to \(1=1 \cdot 1\), but it's important to note that the value of \(X\) does not constrain the possible values of \(Y)\). So the conditional distribution of \(Y\) given \(X=x\) is \(\operatorname{Unif}(0,1)\), regardless of \(x\).
Now let \((X, Y)\) be a completely random point in the unit disk \(\left\{(x, y): x^2+y^2 \leq 1\right\}\), with joint PDF
\[ f_{X, Y}(x, y)= \begin{cases}\frac{1}{\pi} & \text { if } x^2+y^2 \leq 1 \\ 0 & \text { otherwise }\end{cases} \]
Again, the constant \(1 / \pi\) is chosen to make the joint PDF integrate to 1 ; the value follows from the fact that the integral of 1 over some region in the plane is the area of that region.
Note that \(X\) and \(Y\) are not independent, since in general, knowing the value of \(X\) constrains the possible values of \(Y\) : larger values of \(|X|\) restrict \(Y\) to be in a smaller range. It would fall into the previous proposition disastrously to conclude independence from the fact that \(f_{X, Y}(x, y)=g(x) h(y)\) for all \((x, y)\) in the disk, where \(g(x)=1 / \pi\) and \(h(y)=1\) are constant functions. To see from the definition that \(X\) and \(Y\) are not independent, note that, for example, \(f_{X, Y}(0.9,0.9)=0\) since \((0.9,0.9)\) is not in the unit disk, but \(f_X(0.9) f_Y(0.9) \neq 0\) since 0.9 is in the supports of both \(X\) and \(Y\).
The marginal distribution of \(X\) is now
\[ f_X(x)=\int_{-\sqrt{1-x^2}}^{\sqrt{1-x^2}} \frac{1}{\pi} d y=\frac{2}{\pi} \sqrt{1-x^2}, \quad-1 \leq x \leq 1 \]
By symmetry, \(f_Y(y)=\frac{2}{\pi} \sqrt{1-y^2}\). Note that the marginal distributions of \(X\) and \(Y\) are not Uniform on \([-1,1]\); rather, \(X\) and \(Y\) are more likely to fall near 0 than near \(\pm 1\).
Suppose we observe \(X=x\). As illustrated in Figure 7.6, this constrains \(Y\) to lie in the interval \(\left[-\sqrt{1-x^2}, \sqrt{1-x^2}\right]\). Specifically, the conditional distribution of \(Y\) given \(X=x\) is
\[ f_{Y \mid X}(y \mid x)=\frac{f_{X, Y}(x, y)}{f_X(x)}=\frac{\frac{1}{\pi}}{\frac{2}{\pi} \sqrt{1-x^2}}=\frac{1}{2 \sqrt{1-x^2}} \]
for \(-\sqrt{1-x^2} \leq y \leq \sqrt{1-x^2}\), and 0 otherwise. This conditional PDF is constant as a function of \(y\), which tells us that the conditional distribution of \(Y\) is Uniform on the interval \(\left[-\sqrt{1-x^2}, \sqrt{1-x^2}\right]\). The fact that the conditional PDF is not free of \(x\) confirms the fact that \(X\) and \(Y\) are not independent.