Mutual Information

Posted on 2024-01-18 Edited on 2025-11-16 In Mathematics Views: 88

Sources:

Elements of Information Theory
An Introduction to Single-User Information Theory

Definition

Given two random variables $X$ and $Y$ , we want to define a measure of the information that $Y$ provides about $X$ when $Y$ is observed, but $X$ is not. We call this measure mutual information, which is defined as: $I (X; Y) ≜ H (X) - H (X ∣ Y)$

It can be expressed in terms relative entropy between their joint distribution $p_{X, Y}$ and the product of their marginal distributions $p_{X} \cdot p_{Y}$ $\begin{aligned} I (X; Y) & = \sum_{x, y} p (x, y) \log \frac{p_{X, Y} (x, y)}{p_{X} (x) \cdot p_{Y} (y)} \\ = D (p_{X, Y} ‖ p_{X} \cdot p_{Y}) . \end{aligned}$

Properties of Mutual Information

Basics

$I (X; Y) = \sum_{x \in X} \sum_{y \in Y} P_{X, Y} (x, y) \log_{2} \frac{P_{X, Y} (x, y)}{P_{X} (x) P_{Y} (y)}$ .
$I (X; Y) = I (Y; X) = H (Y) - H (Y ∣ X)$ .
$I (X; Y) = H (X) + H (Y) - H (X, Y)$ .
$I (X; Y) \leq H (X)$ with equality holding iff $X$ is a function of $Y$ (i.e., $X = f (Y)$ for some function $f (\cdot))$ .
$I (X; Y) \geq 0$ with equality holding iff $X$ and $Y$ are independent.
$I (X; Y) \leq min {\log_{2} | X |, \log_{2} | Y |}$ .

Proof:

Properties 1, 2, 3, and 4 follow immediately from the definition.

Property 5 is a direct consequence of $D (p ‖ q) \geq 0$ , since $I (X : Y) = D (p_{X, Y} ‖ p_{X} \cdot p_{Y})$ .

Property 6 holds iff $I (X; Y) \leq \log_{2} | X |$ and $I (X; Y) \leq \log_{2} | Y |$ .

To show the first inequality, we write $I (X; Y) = H (X) -$ $H (X ∣ Y)$ , use the fact that $H (X ∣ Y)$ is nonnegative and apply Theorem: $H (X) \leq | X |$ . A similar proof can be used to show that $I (X; Y) \leq \log_{2} | Y |$ .

The relationships between $H (X), H (Y), H (X, Y), H (X ∣ Y), H (Y ∣ X)$ , and $I (X; Y)$ can be illustrated by the Venn diagram in the above figure.

$I (X : Y) = I (Y : X)$

Expanding $H (X) - H (X ∣ Y)$ , we have: $\begin{aligned} H (X) - H (X ∣ Y) & = E [\log \frac{1}{p (X)}] - E [\log \frac{1}{p (X ∣ Y)}] \\ = E [\log \frac{p (X ∣ Y)}{p (X)}] \\ = E [\log \frac{p (X ∣ Y) p (Y)}{p (X) p (Y)}] \\ = E [\log \frac{p (X, Y)}{p (X) p (Y)}] \\ = H (Y) - H (Y ∣ X) \end{aligned}$

Then $I (X; Y) ≜ H (X) - H (X ∣ Y) = H (Y) - H (Y ∣ X)$

So mutual information is symmetric.

$I (X; X) = H (X)$

Now let’s ask an interesting question: How much does X tell us about itself? In other words, what is I (X; X)? Using our first definition, we have: $I (X; X) = H (X) - H (X | X)$

We note that $H (X ∣ X) = 0$ , because in the expectation, $X$ can only take on one fixed, given value with probability 1 . Therefore, $H (X ∣ X) = \log 1 = 0$ . Thus: $I (X; X) = H (X)$ Meaning that $X$ tells us everything about itself!

$I (X; Y) \geq 0$

For any two random variables, $X, Y$ , $I (X; Y) \geq 0$ with equality if and only if $X$ and $Y$ are independent.

Proof:

We know that $I (X; Y) = D (p (x, y) ‖ p (x) p (y))$ .
See property 3 of Relative Entropy, $D (p ‖ q) \geq 0$ for all distributions $p, q$ with equality holding iff $p = q$ .
So $I (X; Y) = D (p (x, y) ‖ p (x) p (y)) \geq 0$ , with equality if and only if $p (x, y) = p (x) p (y)$ (i.e., $X$ and $Y$ are independent).

Corollary: $I (X; Y ∣ Z) \geq 0,$ with equality if and only if $X$ and $Y$ are conditionally independent given $Z$ .

Conditional Mutual Information

Venn diagram of information theoretic measures for three variables $x, y$ , and $z$ , represented by the lower left, lower right, and upper circles, respectively.

The conditional mutual informations $I (x; z ∣ y), I (y; z ∣ x)$ and $I (x; y ∣ z)$ are represented by the yellow, cyan, and magenta regions, respectively.

The conditional mutual information, denoted by $I (X; Y ∣ Z)$ , is defined as the common uncertainty between $X$ and $Y$ under the knowledge of $Z$ : $I (X; Y ∣ Z) := H (X ∣ Z) - H (X ∣ Y, Z)$

可以这么想像: $I (X; Y)$ 就是 $H (X)$ 和 $H (Y)$ 的交集, 再挖掉其中 $H (Z)$ 的部分就是 $I (X; Y | Z)$ . 对应于图中粉色部分.

Joint Mutual Information

Lemma: Defining the joint mutual information between $X$ and the pair $(Y, Z)$ by $I (X; Y, Z) := H (X) - H (X ∣ Y, Z),$ we have $I (X; Y, Z) = I (X; Y) + I (X; Z ∣ Y) = I (X; Z) + I (X; Y ∣ Z) .$ $X$ 和 $(Y, Z)$ 的互信息 = $X$ 和 $Y$ 的互信息 + 在 $Y$ 已知的情况下 $X$ 和 $Z$ 的互信息.

可以这么想像: 把 $H (Y), H (Z)$ 连成一块得到 $H (Y, Z)$ , $I (X; Y, Z)$ 就是 $H (X)$ 和 $H (Y, Z)$ 的交集. 对应于图中黄, 灰, 粉三块区域的并集.

Proof: Without loss of generality, we only prove the first equality: $\begin{aligned} I (X; Y, Z) & = H (X) - H (X ∣ Y, Z) \\ = H (X) - H (X ∣ Y) + H (X ∣ Y) - H (X ∣ Y, Z) \\ = I (X; Y) + I (X; Z ∣ Y) . \end{aligned}$

The above lemma can be read as follows: the information that $(Y, Z)$ has about $X$ is equal to the information that $Y$ has about $X$ plus the information that $Z$ has about $X$ when $Y$ is already known.

Properties of Entropy and Mutual Information for Multiple Random Variables

Chain rule for joint entropy

Theorem: Let $X_{1}, X_{2}, \dots, X_{n}$ be drawn according to $P_{X^{n}} (x^{n}) := P_{X_{1}, \dots, X_{n}} (x_{1}, \dots, x_{n})$ , where we use the common superscript notation to denote an $n$ -tuple: $X^{n} := (X_{1}, \dots, X_{n})$ and $x^{n} := (x_{1}, \dots, x_{n})$ .

Then $H (X_{1}, X_{2}, \dots, X_{n}) = \sum_{i = 1}^{n} H (X_{i} ∣ X_{i - 1}, \dots, X_{1}),$ where $H (X_{i} ∣ X_{i - 1}, \dots, X_{1}) := H (X_{1})$ for $i = 1$ . (The above chain rule can also be written as: $H (X^{n}) = \sum_{i = 1}^{n} H (X_{i} ∣ X^{i - 1})$ where $X^{i} := (X_{1}, \dots, X_{i})$ .)

For example, for three random variables $X$ , $Y$ , and $Z$ , $\begin{aligned} H (X, Y, Z) & = H (X) + H (Y, Z ∣ X) \\ = H (X) + H (Y ∣ X) + H (Z ∣ X, Y) \end{aligned}$

Proof:

From chain rule for 2 r.v. , $H (X_{1}, X_{2}, \dots, X_{n}) = H (X_{1}, X_{2}, \dots, X_{n - 1}) + H (X_{n} ∣ X_{n - 1}, \dots, X_{1})$

Once again, applying chain rule for 2 r.v. to the first term of the right-hand side of this equation, we have $H (X_{1}, X_{2}, \dots, X_{n - 1}) = H (X_{1}, X_{2}, \dots, X_{n - 2}) + H (X_{n - 1} ∣ X_{n - 2}, \dots, X_{1})$

The desired result can then be obtained by repeatedly applying chain rule for 2 r.v. .

Chain rule for conditional entropy

Theorem: $H (X_{1}, X_{2}, \dots, X_{n} ∣ Y) = \sum_{i = 1}^{n} H (X_{i} ∣ X_{i - 1}, \dots, X_{1}, Y) .$

Proof:

The theorem can be proved similarly to Chain Rule for Entropy(2 Variables). If $X^{n} = (X_{1}, \dots, X_{n})$ and $Y^{m} = (Y_{1}, \dots, Y_{m})$ are jointly distributed random vectors (of not necessarily equal lengths), then their joint mutual information is given by $I (X_{1}, \dots, X_{n}; Y_{1}, \dots, Y_{m}) := H (X_{1}, \dots, X_{n}) - H (X_{1}, \dots, X_{n} ∣ Y_{1}, \dots, Y_{m}) .$

Chain rule for mutual information

Theorem: $I (X_{1}, X_{2}, \dots, X_{n}; Y) = \sum_{i = 1}^{n} I (X_{i}; Y ∣ X_{i - 1}, \dots, X_{1}),$ where $I (X_{i}; Y ∣ X_{i - 1}, \dots, X_{1}) := I (X_{1}; Y)$ for $i = 1$ .

Proof:

This can be proved by first expressing mutual information in terms of entropy and conditional entropy, and then applying the chain rules for entropy and conditional entropy.

Independence bound on entropy

Theorem: $H (X_{1}, X_{2}, \dots, X_{n}) \leq \sum_{i = 1}^{n} H (X_{i}) .$

Equality holds iff all the $X_{i}$ 's are independent of each other.[^8]

Proof:

By applying the chain rule for entropy, $\begin{aligned} H (X_{1}, X_{2}, \dots, X_{n}) & = \sum_{i = 1}^{n} H (X_{i} ∣ X_{i - 1}, \dots, X_{1}) \\ \leq \sum_{i = 1}^{n} H (X_{i}) . \end{aligned}$

Equality holds iff each conditional entropy is equal to its associated entropy, that iff $X_{i}$ is independent of $(X_{i - 1}, \dots, X_{1})$ for all $i$ .

Bound on mutual information

Theorem: If ${(X_{i}, Y_{i})}_{i = 1}^{n}$ is a process satisfying the conditional independence assumption $P_{Y^{n} ∣ X^{n}} = \prod_{i = 1}^{n} P_{Y_{i} ∣ X_{i}}$ , then $I (X_{1}, \dots, X_{n}; Y_{1}, \dots, Y_{n}) \leq \sum_{i = 1}^{n} I (X_{i}; Y_{i}),$ with equality holding iff ${X_{i}}_{i = 1}^{n}$ are independent.

Proof:

From the independence bound on entropy, we have $H (Y_{1}, \dots, Y_{n}) \leq \sum_{i = 1}^{n} H (Y_{i}) .$

By the conditional independence assumption, we have $\begin{aligned} H (Y_{1}, \dots, Y_{n} ∣ X_{1}, \dots, X_{n}) & = E [- \log_{2} P_{Y^{n} ∣ X^{n}} (Y^{n} ∣ X^{n})] \\ = E [- \sum_{i = 1}^{n} \log_{2} P_{Y_{i} ∣ X_{i}} (Y_{i} ∣ X_{i})] \\ = \sum_{i = 1}^{n} H (Y_{i} ∣ X_{i}) . \end{aligned}$

Hence, $\begin{aligned} I (X^{n}; Y^{n}) & = H (Y^{n}) - H (Y^{n} ∣ X^{n}) \\ \leq \sum_{i = 1}^{n} H (Y_{i}) - \sum_{i = 1}^{n} H (Y_{i} ∣ X_{i}) \\ = \sum_{i = 1}^{n} I (X_{i}; Y_{i}), \end{aligned}$ with equality holding iff ${Y_{i}}_{i = 1}^{n}$ are independent, which holds iff ${X_{i}}_{i = 1}^{n}$ are independent.

Definition

Properties of Mutual Information

Basics

I(X:Y)=I(Y:X)

I(X;X)=H(X)

I(X;Y)≥0

Conditional Mutual Information

Joint Mutual Information

Properties of Entropy and Mutual Information for Multiple Random Variables

Chain rule for joint entropy

Chain rule for conditional entropy

Chain rule for mutual information

Independence bound on entropy

Bound on mutual information

$I (X : Y) = I (Y : X)$

$I (X; X) = H (X)$

$I (X; Y) \geq 0$