Relative Entropy (or KL divergence)

Sources:

  1. Elements of Information Theory
  2. An Introduction to Single-User Information Theory

Definition

The relative entropy (or Kullback-Leibler divergence) D(p||q) measures the difference between the true distribution p and the model distribution q (typically a neural network). It's defined as:

The relative entropy (or Kullback-Leibler divergence, KL divergence) KL(p||q) is a 'measure' of the distance between the true distribution p and the model distribution, e.g., a neural network, q. It's defined as: KL(pq)=xXp(x)logp(x)q(x)=Exp[logp(x)q(x)] where X is the sample space of r.v. X.

In the above definition, we use the convention that:

  1. 0log00=0 and the convention (based on continuity arguments) that
  2. 0log0q=0 and plogp0=. Thus, if there is any symbol xX such that p(x)>0 and q(x)=0, then D(pq)=.

You may see some people use these symbols to represent relative entropy. They are interchangeable. D(p|q),KL(p|q),DKL(p|q),

Properties of Relative Entropy

  1. In general, relative entropy is asymmetric (D(pq)D(qp)), and does not satisfy the triangle inequality. Therefore, it is not a metric.
  2. D(pp)=0.
  3. D(pq)0 for all distributions p,q with equality holding iff p=q.
  4. D(p(yx)q(yx))0 with equality if and only if p(yx)=q(yx) for all y and x such that p(x)>0.

Property (3) is proved using Jensen’s inequality.

Property (4) is proved using property (3).

Relative Entropy is Not Symmetric

In the following problem and solution, we give an counterexample of relative entropy's symmericity.

Relative entropy is not symmetric. Let the random variable X have three possible outcomes {a,b,c}. Consider two distributions p(x) and q(x) on this random variable

Symbol p(x) q(x)
a 1/2 1/3
b 1/4 1/3
c 1/4 1/3

Calculate H(p),H(q),D(pq) and D(qp). Verify that in this case D(pq)D(qp). Solution: H(p)=12log2+14log4+14log4=1.5bitsH(q)=3×13log3=1.58496bitsD(pq)=12log32+14log34+14log34=log31.5=0.08496bitsD(qp)=13log23+13log43+13log43=log3+53=0.0817bits It is clear that D(pq)D(qp).

Conditional Eelative Entropy

We define a conditional version of the relative entropy.

Definition: For joint probability mass functions p(x,y) and q(x,y), the conditional relative entropy D(p(yx)q(yx)) is the average of the relative entropies between the conditional probability mass functions p(yx) and q(yx) averaged over the probability mass function p(x). More precisely, D(p(yx)q(yx))=xp(x)yp(yx)logp(yx)q(yx)=Ep(x,y)logp(YX)q(YX).

The notation for conditional relative entropy is not explicit since it omits mention of the distribution p(x) of the conditioning random variable. However, it is normally understood from the context.

The relative entropy between two joint distributions on a pair of random variables can be expanded as the sum of a relative entropy and a conditional relative entropy.

For miltivariance Gaussian distributions

Suppose p=N(μ1,Σ1) and q=N(μ2,Σ2). Both are n=dimensoin multivariate Gaussian distributions.

Their relative entropy (or KL divergence) is: KL(pq)=12[log|Σ2||Σ1|n+tr{Σ21Σ1}+(μ2μ1)TΣ21(μ2μ1)]