Asymptotic Equipartition Property

Sources:

  1. EE 376A: Information Theory. Winter 2018. Lecture 4. - Stanford University
  2. EE 376A: Information Theory. Winter 2017. Lecture 4. - Stanford University
  3. Elements of Information Theory

Intro

In information theory, the analog of the law of large numbers(LLN) is the asymptotic equipartition property (AEP).

It is a direct consequence of the weak law of large numbers.

The law of large numbers states that for iid random variables, 1ni=1nXi is close to its expected value E(X) for large values of n.

The AEP states that 1nlog1p(X1,X2,,Xn) is close to the entropy H, where X1,X2,,Xn are i.i.d. random variables and p(X1,X2,,Xn) is the probability of observing the sequence X1,X2,,Xn. Thus, the probability p(X1,X2,,Xn) assigned to an observed sequence will be close to 2nH 1nlog1p(X1,X2,,Xn)Hp(X1,X2,,Xn)2nH

This enables us to divide the set of all sequences into two sets, the typical set, where the sample entropy is close to the true entropy, and the nontypical set, which contains the other sequences.

Most of our attention will be on the typical sequences. Any property that is proved for the typical sequences will then be true with high probability and will determine the average behavior of a large sample.

Notation

  1. Memoryless source: X1,X2, iid X, where X denotes one PMF of arbitory Xi. Note that "memoryless" is used here because samples are drawn iid and have no dependence on past realizations.

    • Note: All i.i.d. r.v. have the same probability distribution and the same entropy.
  2. Alphabet: X={1,2,,r} specifies the possible values that each symbol Xi can take on. The size of X is denoted |X|.

  3. Source sequence: Xn=(X1,,Xn) denotes the n-tuple that specifies a sequence of n source symbols. In other words, Xn is a stochastic process. Further note that Xn indicates the set of all possible source sequences of length n.

    • Since the source symbols are memoryless here. Xn is an iid stochastic process. It's also denoted as {Xi} in some literatures.
  4. Probability of Xn where Xi are i.i.d. : The probability assigned to a source sequence Xn is given by P(Xn)=i=1nPX(Xi). Since we implicitly evaluate the probabilities over the alphabet X, we may also omit X amd write (1)P(Xn)=i=1nP(Xi).

The ϵ-typical Set

Definition: For some ϵ>0, the source sequence Xn is ϵ-typical if, (2)|1nlogP(Xn)H(X)|ϵ.

Let Aϵ(n) denote the " ϵ-typical set", that is the set of all source sequences Xn that are ϵ-typical.

即: XnAϵ(n) <==> Xn is an ϵ-typical set <==> |1nlogP(Xn)H(X)|ϵ.

Furthermore, note the following equivalent ways of defining ϵ-typicality:

  1. (2)

  1. H(X)ϵ(3)1nlogP(Xn)H(X)+ϵ

  1. n(H(X)+ϵ(4)log(P(Xn))n(H(X)ϵ)

  1. 2n(H(X)+ϵ)(5)P(Xn)2n(H(X)ϵ)

We have (2) <==> (3) <==> (4) <==> (5).

From (5) we can say that every ϵ-typical set has probability roughly 2nH(X).

Theorem: P(XnAϵ(n))1

Theorem: ϵ>0,P(XnAϵ(n))n1.

Proof

  1. Note that XnAϵ(n) <==> Xn is an ϵ-typical set <==> |1nlogP(Xn)H(X)|ϵ. So we have P(XnAϵ(n))=P(|1nlogP(Xn)H(X)|ϵ)

  2. Also, we know that P(Xn)=i=1nP(Xi). Then the right term can be reformed to P(|1nlog[i=1nP(Xi)]H(X)|ϵ). By (1), it then can be reformed to P(|1ni=1nlog1P(Xi)H(X)|ϵ) As a result, we have P(XnAϵ(n))=P(|1ni=1nlog1P(Xi)H(X)|ϵ).

  3. Note that since Xi are iid, log1P(Xi) are iid. Then by the weak law of large numbers (LLN), P(|1ni=1nlog1P(Xi)E(log1P(X))|ϵ)n1. Here we just treat log1P(Xi) as iid r.v, then apply the LLN.

    Recalling that by the definition of entrypy, H(X)E(log1P(X)), so P(|1ni=1nlog1P(Xi)H(X)|ϵ)n1. To conclude,

P(XnAϵ(n))=P(|1ni=1nlog1P(Xi)H(X)|ϵ)n1.

Note: Since P(XnAϵ(n))1 and Aϵ(n) is comprised of sequences each with probability roughly 2nH(X) of being observed(See (5)), then |Aϵ(n)|2nH(X). We'll provide more rigorous bounds on |Aϵ(n)| in the following theorem.

AEP

From Theorem: P(XnAϵ(n))1, we know that for any ϵ>0, the probabilistic mass in Xn is almost entirely concentrated in Aϵ(n).

From (2), we know that if XnAϵ(n), then for a fixed(given in Aϵ(n)) ϵ>0, we have |1nlogP(Xn)H(X)|ϵ.

As a result,

for X1,X2, iid X, for any ϵ>0, we have |1nlogP(Xn)H(X)|ϵ. In otherwords, 1nlogp(Xn)nH(X). This is called asymptotic equipartition property (AEP).

Thus, AEP enables us to partition the sequence Xn into

  • Typical set: Containing sequences with probability close to 2nH(X).
  • Atypical set: Containing the other sequences.
AEP

Note: the notation p(X1,X2,,Xn) in the above figure is denoted as p(Xn) is this post.

Theorem: Bound of |Aϵ(n)|

Theorem: ϵ>0 and n sufficiently large, (1ϵ)2n(H(X)ϵ)(6)|Aϵ(n)|2n(H(u)+ϵ).

Proof

Upper bound:

See following equation: (7)1P(XnAϵ(n))(8)XnAϵ(n)2n(H(X)+ϵ)=2n(H(X)+ϵ)|Aϵ(n)|.

(7) is due to the non-negativity of any probability.

(8) is due to (5).

According to (???), we have |Aϵ(n)|2n(H(X)+ϵ).


Lower bound:

See following equation: (9)1ϵP(XnAϵ(n))unAϵ(n)2n(H(X)ϵ)(10)=2n(H(X)ϵ)|Aϵ(n)|.

The starting equation (9) in the lower bound proof is a consequence of Theorem: P(XnAϵ(n))1.

Since P(XnAϵ(n))n1, we can choose a sufficiently large n such that P(XnAϵ(n))1ϵ,ϵ>0.

According to (10), we have |Aϵ(n)|(1ϵ)2n(H(X)ϵ).

Theorem: Aϵ(n) is an exponentially smaller than Xn

The space of all possible source sequences Xn has exponential size |Xn|=rn, and the ϵ-typical set Aϵ(n) comprises a tiny fraction of Xn with size |Aϵ(n)|2nH(X).

In fact, Aϵ(n) is an exponentially smaller than Xn, as indicated by the ratio of their sizes, except when X is uniformly distributed. |Aϵ(n)||Xn|2nH(X)rn=2nH(X)2nlogr=2n(logrH(X)).

Despite the small size of Aϵ(n), the probabilistic mass in Xn is almost entirely concentrated in Aϵ(n). The forthcoming theorem illustrates the point that any subset of Xn that's smaller than Aϵ(n) fails to capture almost all of its probabilistic mass. Thus, Aϵ(n) is minimal.

Theorem: Aϵ(n) is minimal

Theorem: Fix δ>0 and Bδ(n)Xn such that |Bδ(n)|2n(H(X)δ). Then (11)limnP(XnBδ(n))=0

Proof

$$ P(XnBδ(n))(12)=P(XnBδ(n)Aϵ(n))+P(XnBδ(n)Aϵ(n))(13)P(XnBδ(n)Aϵ(n))+P(XnAϵ(n))(14)=XnBδ(n)Aϵ(n)P(Xn)+P(XnAϵ(n))(15)unBδ(n)Aϵ(n)2n(H(X)ϵ)+P(XnAϵ(n))(16)=|Bδ(n)Aϵ(n)|2n(H(X)ϵ)+P(XnAϵ(n))(17)|Bδ(n)|2n(H(X)ϵ)+P(XnAϵ(n))(18)2n(H(X)δ)2n(H(X)ϵ)+P(XnAϵ(n))(19)=2n(δϵ)0 as n+P(XnAϵ(n))0 as n $$

Explaination:

  1. (12) is derived from a basic theorem of sets. That is, given arbitary sets B,A, (20)B=(BA)(BA) The proof is in appendix.

    Then, we have Bδ(n)=(Bδ(n)Aϵ(n))(Bδ(n)(Aϵ(n))

  2. (13) is because given arbitary sets A,B, ABA Then we have Bδ(n)Aϵ(n)Aϵ(n) And we also know XnAϵ(n)=XnAϵ(n).

    Then P(XnBδ(n)Aϵ(n))P(XnAϵ(n))

  3. (14) is because all Xn are independent? //TODO

  4. (15) is due to (5).

  5. (16) is a common computation.

  6. (17) is because given arbitary sets A,B, ABA Then we have (Bδ(n)Aϵ(n))Bδ(n)

  7. (18) is due to the prerequisite |Bδ(n)|2n(H(X)δ).

  8. The limit in (18) is easy to understand. Because ϵ>0,P(XnAϵ(n))n1 Then we have ϵ>0,P(XnAϵ(n))n0

From the theorems proven above, we understand Aϵ(n) as a subset of Xn that most efficiently contains virtually all of the source sequences that can be drawn from Xn. The following section confirms the intuition that, when developing a scheme which encodes sequences from Xn, we should focus our efforts towards the sequences that lie in Aϵ(n).

Appendix

  • Proof of (20) "Given arbitary sets A,B, B=(BA)(BA)":

    • Inclusion ():

      Let x be an arbitrary element in B. We want to show that x is also in the right-hand side of the equation.

      • If xBA: x is in the intersection of B and A.
      • If xBA: x is in the intersection of B and the complement of A.

      Therefore, B(BA)(BA).

    • Inclusion ():

      Let y be an arbitrary element in (BA)(BA). We want to show that y is also in B.

      • If yBA: y is in the intersection of B and A, so yy is in B.
      • If yBA: y is in the intersection of B and the complement of A, so y is in B.

      Since y could be in either BA or BA, we can conclude that y is in B.

      Therefore, (BA)(BA)B.

      Combining both inclusions, we have B=(BA)(BA).