Commmon Problems of Entropy

Posted on 2023-12-10 Edited on 2025-06-17 In Mathematics Views: 19

Common problems of Shannon entropy in information theory.

Theorem: Deterministic Distribution Minimizes The Entropy

What is the minimum value of $H (p_{1}, \dots, p_{n}) = H (p)$ as $p$ ranges over the set of $n$ -dimensional probability vectors? Find all p's which achieve this minimum.

Solution: We wish to find all probability vectors $p = (p_{1}, \dots, p_{n})$ which minimize $H (p) = - \sum_{i} p_{i} \log p_{i} .$

Now $- p_{i} \log p_{i} \geq 0$ , with equality iff $p_{i} = 0$ or 1 . Besides we know that $\sum_{i} p_{i} = 1$ .

Hence the only possible probability vectors which minimize $H (p)$ are those with $p_{i} = 1$ for some $i$ and $p_{j} = 0, j \neq i$ . There are $n$ such vectors, i.e., $(1, 0, \dots, 0), (0, 1, 0, \dots, 0), \dots, (0, \dots, 0, 1)$ , and the minimum value of $H (p)$ is 0 .

Theorem: Uniform Distribution Maximizes The Entropy

Source:

Proof

Heuristically, the probability density function on ${x_{1}, x_{2}, \dots, x_{n}}$ with maximum entropy turns out to be the one that corresponds to the least amount of knowledge of ${x_{1}, x_{2}, \dots, x_{n}}$ , in other words the Uniform distribution.

Now, for a more formal proof consider the following: A probability density function on ${x_{1}, x_{2}, \dots, . x_{n}}$ is a set of nonnegative real numbers $p_{1}, \dots, p_{n}$ that add up to 1 . Entropy is a continuous function of the $n$ -tuples $(p_{1}, \dots, p_{n})$ , and these points lie in a compact subset of $R^{n}$ , so there is an $n$ -tuple where entropy is maximized. We want to show this occurs at $(1 / n, \dots, 1 / n)$ and nowhere else.

Suppose the $p_{j}$ are not all equal, say $p_{1} < p_{2}$ . (Clearly $n \neq 1$ .) We will find a new probability density with higher entropy. It then follows, since entropy is maximized at some $n$ -tuple, that entropy is uniquely maximized at the $n$ -tuple with $p_{i} = 1 / n$ for all $i$ .

Since $p_{1} < p_{2}$ , for small positive $ε$ we have $p_{1} + ε < p_{2} - ε$ . The entropy of ${p_{1} + ε, p_{2} - ε, p_{3}, \dots, p_{n}}$ minus the entropy of ${p_{1}, p_{2}, p_{3}, \dots, p_{n}}$ equals $- p_{1} \log (\frac{p_{1} + ε}{p_{1}}) - ε \log (p_{1} + ε) - p_{2} \log (\frac{p_{2} - ε}{p_{2}}) + ε \log (p_{2} - ε)$

To complete the proof, we want to show this is positive for small enough $ε$ . Rewrite the above equation as $\begin{aligned} - p_{1} \log (1 + \frac{ε}{p_{1}}) & - ε (\log p_{1} + \log (1 + \frac{ε}{p_{1}})) - p_{2} \log (1 - \frac{ε}{p_{2}}) \\ + ε (\log p_{2} + \log (1 - \frac{ε}{p_{2}})) \end{aligned}$

Recalling that $\log (1 + x) = x + O (x^{2})$ for small $x$ , the above equation is $- ε - ε \log p_{1} + ε + ε \log p_{2} + O (ε^{2}) = ε \log (p_{2} / p_{1}) + O (ε^{2})$ which is positive when $ε$ is small enough since $p_{1} < p_{2}$ .

Simpler Proof

Recalling that $H (X) \leq \log | X |$ with equality if and only has a uniform distribution over $X$ .

Then for any r.v. $X$ , it's maximum entropy is $\log | X |$ , and is achieved by a uniform distribution over $X$ .

Drawing with and without replacement

Drawing with and without replacement. An urn contains $r$ red, $w$ white, and $b$ black balls. Which has higher entropy, drawing $n \geq 2$ balls from the urn with replacement or without replacement?

Solution:

Let random variable $X_{i}$ denotes the outcome color of the $i -$ th ball. The alphabate of all $X_{i}$ is the same: $X = {0, 1, 2}$ , where

$X_{i} = 0$ if the ball is red,
$X_{i} = 1$ if the ball is white
$X_{i} = 2$ if the ball is black.

Let $p_{X_{i}}$ denote the PMF of $X_{i}$ .

For the case with replacement:

For all $X_{i} \in X_{1}, \dots, X_{n}$ , Since the total number of red, white, black balls are $r, w, b$ , $p (X_{i}) = {\begin{cases} \frac{r}{r + w + b} & if X = 1 \\ \frac{w}{r + w + b} & if X = 2 \\ \frac{b}{r + w + b} & if X = 3 . \end{cases}$ $p (X_{i})$ doesn't change during the drawing process, so all $p (X_{i})$ are independant(because of the replacement).

Due to Theorem: Entropy is additive for independent r.v., $H (X_{1}, X_{2}, \dots, X_{k}) = H (X_{1}) + H (X_{2}) + \dots + H (X_{k}) .$ For the case without replacement:

the picks are not independent anymore, due to the chain rule for joint entropy, we have $\begin{aligned} H (X_{1}, X_{2}, \dots, X_{k}) & = H (X_{1}) + H (X_{2} ∣ X_{1}) + \dots + H (X_{k} ∣ X_{1}, \dots, X_{k} - 1) . \end{aligned}$

Meanwhile, due to Theorem: Theorem: Conditioning reduces entropy, we have: $H (X_{1}) + H (X_{2} ∣ X_{1}) + \dots + H (X_{k} ∣ X_{1}, \dots, X_{k} - 1) \leq H (X_{1}) + H (X_{2}) + \dots + H (X_{k})$ Hence, the sample with replacement has higher entropy.