Jensen’s Inequality
Sources:
- Elements of Information Theory
- EE 376A: Information Theory - Stanford University
Convexity
Convexity underlies many of the basic properties of information-theoretic quantities such as entropy and mutual information.
Definition: A function
Note: A function
A function is convex if it always lies below any chord. A function is concave if it always lies above any chord.
Examples

- Examples of convex functions:
(for 0 ), and so on. - Examples of concave functions:
and for . - Note: linear functions
are both convex and concave.
Theorem 1
If the function f has a second derivative that is non- negative (positive) over an interval, the function is convex (strictly convex) over that interval.
Proof: We use the Taylor series expansion(泰勒展开) of the function around
By hypothesis,
We let
Similarly, taking
Theorem 1 allows us immediately to verify the strict convexity of
Let
Jensen’s Inequality
Definition: If
Proof
We prove this for discrete distributions by induction on the number of mass points. The proof of conditions for equality when
The proof can be extended to continuous distributions by continuity arguments.
We now use these results to prove some of the properties of entropy and relative entropy. The following theorem is of fundamental importance.
Example
Consider a uniform random variable
By Jensen’s inequality,

Theorem: Information Inequality
Let
Also, since
We have equality in
Note: 证明中必须要用到support set
Theorem:
We now show that the uniform distribution over the range
is the maximum entropy distribution over this range. It follows that any random variable with this range has an entropy no greater than .
Theorem:
Proof:
Theorem: Conditioning reduces entropy
Conditioning reduces entropy(Information can't hurt):
Proof:
- We know that
2. - Since
. - We have
Intuitively, the theorem says that knowing another random variable
Note that this is true only on the average. Specifically,
For multiple condions
This theorem can be extended to multiple condions. Given random variables
- We know that
. - Note that the conditional entropy of
given observation is . so . - We have
.
Example
Let
Y\X | x=1 | x=2 |
---|---|---|
Y=1 | 0 | 3/4 |
Y=2 | 1/8 | 1/8 |
Then we have
bit, bits, bit.
We calculate
Thus, the uncertainty in
- increased if
is observed - decreased if
is observed
But uncertainty decreases on the average.
Theorem: Independence bound on entropy
Let
Proof:
By the chain rule for entropies,
Then we have
According to theorem of Conditioning reduces entropy, we have equality if and only if is independent of for all (i.e., if and only if the 's are independent).Q.E.D.