Chain Rules for Entropy, Relative Entropy and Mutual Information
We now show that the entropy of a collection of random variables is the sum of the conditional entropies.
Note: 本文的证明还比较粗糙, 后面会改进.
Ref: Elements of Information Theory
Chain rule for entropy
Let \(X_1, X_2, \ldots, X_n\) be drawn according to \(p\left(x_1, x_2, \ldots, x_n\right)\). Then \[ H\left(X_1, X_2, \ldots, X_n\right)=\sum_{i=1}^n H\left(X_i \mid X_{i-1}, \ldots, X_1\right) . \]
Proof: Short
By repeated application of the two-variable expansion rule for entropies, we have \[ \begin{aligned} H\left(X_1, X_2\right) & =H\left(X_1\right)+H\left(X_2 \mid X_1\right), \\ H\left(X_1, X_2, X_3\right) & =H\left(X_1\right)+H\left(X_2, X_3 \mid X_1\right) \\ & =H\left(X_1\right)+H\left(X_2 \mid X_1\right)+H\left(X_3 \mid X_2, X_1\right), \\ & \vdots \\ H\left(X_1, X_2, \ldots, X_n\right) & =H\left(X_1\right)+H\left(X_2 \mid X_1\right)+\cdots+H\left(X_n \mid X_{n-1}, \ldots, X_1\right) \\ & =\sum_{i=1}^n H\left(X_i \mid X_{i-1}, \ldots, X_1\right) . \end{aligned} \]
Proof: Long
We write \(p\left(x_1, \ldots, x_n\right)=\prod_{i=1}^n p\left(x_i \mid x_{i-1}, \ldots, x_1\right)\) and evaluate \[ \begin{aligned} H & \left(X_1, X_2, \ldots, X_n\right) \\ & =-\sum_{x_1, x_2, \ldots, x_n} p\left(x_1, x_2, \ldots, x_n\right) \log p\left(x_1, x_2, \ldots, x_n\right) \\ & =-\sum_{x_1, x_2, \ldots, x_n} p\left(x_1, x_2, \ldots, x_n\right) \log \prod_{i=1}^n p\left(x_i \mid x_{i-1}, \ldots, x_1\right) \\ & =-\sum_{x_1, x_2, \ldots, x_n} \sum_{i=1}^n p\left(x_1, x_2, \ldots, x_n\right) \log p\left(x_i \mid x_{i-1}, \ldots, x_1\right) \\ & =-\sum_{i=1}^n \sum_{x_1, x_2, \ldots, x_n} p\left(x_1, x_2, \ldots, x_n\right) \log p\left(x_i \mid x_{i-1}, \ldots, x_1\right) \\ & =-\sum_{i=1}^n \sum_{x_1, x_2, \ldots, x_i} p\left(x_1, x_2, \ldots, x_i\right) \log p\left(x_i \mid x_{i-1}, \ldots, x_1\right) \\ & =\sum_{i=1}^n H\left(X_i \mid X_{i-1}, \ldots, X_1\right) . \end{aligned} \] We now define the conditional mutual information as the reduction in the uncertainty of \(X\) due to knowledge of \(Y\) when \(Z\) is given.
Chain rule for Mutual Information
Mutual information also satisfies a chain rule. \[ I\left(X_1, X_2, \ldots, X_n ; Y\right)=\sum_{i=1}^n I\left(X_i ; Y \mid X_{i-1}, X_{i-2}, \ldots, X_1\right) . \]
Proof
\[ \begin{aligned} I & \left(X_1, X_2, \ldots, X_n ; Y\right) \\ & =H\left(X_1, X_2, \ldots, X_n\right)-H\left(X_1, X_2, \ldots, X_n \mid Y\right) \\ & =\sum_{i=1}^n H\left(X_i \mid X_{i-1}, \ldots, X_1\right)-\sum_{i=1}^n H\left(X_i \mid X_{i-1}, \ldots, X_1, Y\right) \\ & =\sum_{i=1}^n I\left(X_i ; Y \mid X_1, X_2, \ldots, X_{i-1}\right) . \end{aligned} \]
Chain Rule for Relative Entropy
The chain rule for relative entropy is used in Section 4.4 to prove a version of the second law of thermodynamics.
\[ D(p(x, y) \| q(x, y))=D(p(x) \| q(x))+D(p(y \mid x) \| q(y \mid x)) . \]
Proof
\[ \begin{aligned} D( & p(x, y) \| q(x, y)) \\ & =\sum_x \sum_y p(x, y) \log \frac{p(x, y)}{q(x, y)} \\ & =\sum_x \sum_y p(x, y) \log \frac{p(x) p(y \mid x)}{q(x) q(y \mid x)} \\ & =\sum_x \sum_y p(x, y) \log \frac{p(x)}{q(x)}+\sum_x \sum_y p(x, y) \log \frac{p(y \mid x)}{q(y \mid x)} \\ & =D(p(x) \| q(x))+D(p(y \mid x) \| q(y \mid x)) . \end{aligned} \]