Chain Rules for Entropy, Relative Entropy and Mutual Information

Posted on 2023-10-15 Edited on 2025-05-08 In Mathematics Views: 32

We now show that the entropy of a collection of random variables is the sum of the conditional entropies.

Note: 本文的证明还比较粗糙, 后面会改进.

Ref: Elements of Information Theory

Chain rule for entropy

Let $X_{1}, X_{2}, \dots, X_{n}$ be drawn according to $p (x_{1}, x_{2}, \dots, x_{n})$ . Then $H (X_{1}, X_{2}, \dots, X_{n}) = \sum_{i = 1}^{n} H (X_{i} ∣ X_{i - 1}, \dots, X_{1}) .$

Proof: Short

By repeated application of the two-variable expansion rule for entropies, we have $\begin{aligned} H (X_{1}, X_{2}) & = H (X_{1}) + H (X_{2} ∣ X_{1}), \\ H (X_{1}, X_{2}, X_{3}) & = H (X_{1}) + H (X_{2}, X_{3} ∣ X_{1}) \\ = H (X_{1}) + H (X_{2} ∣ X_{1}) + H (X_{3} ∣ X_{2}, X_{1}), \\ ⋮ \\ H (X_{1}, X_{2}, \dots, X_{n}) & = H (X_{1}) + H (X_{2} ∣ X_{1}) + \dots + H (X_{n} ∣ X_{n - 1}, \dots, X_{1}) \\ = \sum_{i = 1}^{n} H (X_{i} ∣ X_{i - 1}, \dots, X_{1}) . \end{aligned}$

Proof: Long

We write $p (x_{1}, \dots, x_{n}) = \prod_{i = 1}^{n} p (x_{i} ∣ x_{i - 1}, \dots, x_{1})$ and evaluate $\begin{aligned} H & (X_{1}, X_{2}, \dots, X_{n}) \\ = - \sum_{x_{1}, x_{2}, \dots, x_{n}} p (x_{1}, x_{2}, \dots, x_{n}) \log p (x_{1}, x_{2}, \dots, x_{n}) \\ = - \sum_{x_{1}, x_{2}, \dots, x_{n}} p (x_{1}, x_{2}, \dots, x_{n}) \log \prod_{i = 1}^{n} p (x_{i} ∣ x_{i - 1}, \dots, x_{1}) \\ = - \sum_{x_{1}, x_{2}, \dots, x_{n}} \sum_{i = 1}^{n} p (x_{1}, x_{2}, \dots, x_{n}) \log p (x_{i} ∣ x_{i - 1}, \dots, x_{1}) \\ = - \sum_{i = 1}^{n} \sum_{x_{1}, x_{2}, \dots, x_{n}} p (x_{1}, x_{2}, \dots, x_{n}) \log p (x_{i} ∣ x_{i - 1}, \dots, x_{1}) \\ = - \sum_{i = 1}^{n} \sum_{x_{1}, x_{2}, \dots, x_{i}} p (x_{1}, x_{2}, \dots, x_{i}) \log p (x_{i} ∣ x_{i - 1}, \dots, x_{1}) \\ = \sum_{i = 1}^{n} H (X_{i} ∣ X_{i - 1}, \dots, X_{1}) . \end{aligned}$ We now define the conditional mutual information as the reduction in the uncertainty of $X$ due to knowledge of $Y$ when $Z$ is given.

Chain rule for Mutual Information

Mutual information also satisfies a chain rule. $I (X_{1}, X_{2}, \dots, X_{n}; Y) = \sum_{i = 1}^{n} I (X_{i}; Y ∣ X_{i - 1}, X_{i - 2}, \dots, X_{1}) .$

Proof

$\begin{aligned} I & (X_{1}, X_{2}, \dots, X_{n}; Y) \\ = H (X_{1}, X_{2}, \dots, X_{n}) - H (X_{1}, X_{2}, \dots, X_{n} ∣ Y) \\ = \sum_{i = 1}^{n} H (X_{i} ∣ X_{i - 1}, \dots, X_{1}) - \sum_{i = 1}^{n} H (X_{i} ∣ X_{i - 1}, \dots, X_{1}, Y) \\ = \sum_{i = 1}^{n} I (X_{i}; Y ∣ X_{1}, X_{2}, \dots, X_{i - 1}) . \end{aligned}$

Chain Rule for Relative Entropy

The chain rule for relative entropy is used in Section 4.4 to prove a version of the second law of thermodynamics.

$D (p (x, y) ‖ q (x, y)) = D (p (x) ‖ q (x)) + D (p (y ∣ x) ‖ q (y ∣ x)) .$

Proof

$\begin{aligned} D ( & p (x, y) ‖ q (x, y)) \\ = \sum_{x} \sum_{y} p (x, y) \log \frac{p (x, y)}{q (x, y)} \\ = \sum_{x} \sum_{y} p (x, y) \log \frac{p (x) p (y ∣ x)}{q (x) q (y ∣ x)} \\ = \sum_{x} \sum_{y} p (x, y) \log \frac{p (x)}{q (x)} + \sum_{x} \sum_{y} p (x, y) \log \frac{p (y ∣ x)}{q (y ∣ x)} \\ = D (p (x) ‖ q (x)) + D (p (y ∣ x) ‖ q (y ∣ x)) . \end{aligned}$