Commmon Problems of Entropy

Common problems of Shannon entropy in information theory.

Theorem: Deterministic Distribution Minimizes The Entropy

What is the minimum value of H(p1,,pn)=H(p) as p ranges over the set of n-dimensional probability vectors? Find all p's which achieve this minimum.

Solution: We wish to find all probability vectors p=(p1,,pn) which minimize H(p)=ipilogpi.

Now pilogpi0, with equality iff pi=0 or 1 . Besides we know that ipi=1.

Hence the only possible probability vectors which minimize H(p) are those with pi=1 for some i and pj=0,ji. There are n such vectors, i.e., (1,0,,0),(0,1,0,,0),,(0,,0,1), and the minimum value of H(p) is 0 .

Theorem: Uniform Distribution Maximizes The Entropy

Source:

  1. Why is Entropy maximised when the probability distribution is uniform?
  2. Why does log(1+x)=x+O(x2) when 𝑥0?

Proof

Heuristically, the probability density function on {x1,x2,,xn} with maximum entropy turns out to be the one that corresponds to the least amount of knowledge of {x1,x2,,xn}, in other words the Uniform distribution.

Now, for a more formal proof consider the following: A probability density function on {x1,x2,,.xn} is a set of nonnegative real numbers p1,,pn that add up to 1 . Entropy is a continuous function of the n-tuples (p1,,pn), and these points lie in a compact subset of Rn, so there is an n-tuple where entropy is maximized. We want to show this occurs at (1/n,,1/n) and nowhere else.

Suppose the pj are not all equal, say p1<p2. (Clearly n1.) We will find a new probability density with higher entropy. It then follows, since entropy is maximized at some n-tuple, that entropy is uniquely maximized at the n-tuple with pi=1/n for all i.

Since p1<p2, for small positive ε we have p1+ε<p2ε. The entropy of {p1+ε,p2ε,p3,,pn} minus the entropy of {p1,p2,p3,,pn} equals p1log(p1+εp1)εlog(p1+ε)p2log(p2εp2)+εlog(p2ε)

To complete the proof, we want to show this is positive for small enough ε. Rewrite the above equation as p1log(1+εp1)ε(logp1+log(1+εp1))p2log(1εp2)+ε(logp2+log(1εp2))

Recalling that log(1+x)=x+O(x2) for small x, the above equation is εεlogp1+ε+εlogp2+O(ε2)=εlog(p2/p1)+O(ε2) which is positive when ε is small enough since p1<p2.

Simpler Proof

Recalling that H(X)log|X| with equality if and only has a uniform distribution over X.

Then for any r.v. X, it's maximum entropy is log|X|, and is achieved by a uniform distribution over X.

Drawing with and without replacement

Drawing with and without replacement. An urn contains r red, w white, and b black balls. Which has higher entropy, drawing n2 balls from the urn with replacement or without replacement?

Solution:

Let random variable Xi denotes the outcome color of the ith ball. The alphabate of all Xi is the same: X={0,1,2}, where

  1. Xi=0 if the ball is red,
  2. Xi=1 if the ball is white
  3. Xi=2 if the ball is black.

Let pXi denote the PMF of Xi.

For the case with replacement:

For all XiX1,,Xn, Since the total number of red, white, black balls are r,w,b, p(Xi)={rr+w+bif X=1wr+w+bif X=2br+w+bif X=3. p(Xi) doesn't change during the drawing process, so all p(Xi) are independant(because of the replacement).

Due to Theorem: Entropy is additive for independent r.v., H(X1,X2,,Xk)=H(X1)+H(X2)++H(Xk). For the case without replacement:

the picks are not independent anymore, due to the chain rule for joint entropy, we have H(X1,X2,,Xk)=H(X1)+H(X2X1)++H(XkX1,,Xk1).

Meanwhile, due to Theorem: Theorem: Conditioning reduces entropy, we have: H(X1)+H(X2X1)++H(XkX1,,Xk1)H(X1)+H(X2)++H(Xk) Hence, the sample with replacement has higher entropy.