Properties of Differential Entropy

//NOTE: This article is not finished yet and contains many errors. I am always ready to edit it.

Sources:

  1. Thomas M. Cover & Joy A. Thomas. (2006). Chapter 8. Differential Entropy. Elements of Information Theory (2nd ed., pp. 243-255). Wiley-Interscience.
  2. Fady Alajaji & Po-Ning Chen. (2018). Chapter 5. Differential Entropy and Gaussian Channels. An Introduction to Single-User Information Theory (1st ed., pp. 165-218). Springer.

Notations

  • Note that the joint pdf fX,Y is also commonly written as fXY.

Joint differential entropy

Defnition: If Xn=(X1,X2,,Xn) is a continuous random vector of size n (i.e., a vector of n continuous random variables) with joint pdf fXn and support SXnRn, then its joint differential entropy is defined as h(Xn):=SXnfXn(x1,x2,,xn)log2fXn(x1,x2,,xn)dx1dx2dxn=E[log2fXn(Xn)] when the n-dimensional integral exists.

By contrast, entropy and differential entropy are sometimes called discrete entropy and continuous entropy, respectively.

Conditional differential entropy

Definition:Let X and Y be two jointly distributed continuous random variables with joint pdf4fX,Y and support SX,YR2 such that the conditional pdf of Y given X, given by fYX(yx)=fX,Y(x,y)fX(x) is well defined for all (x,y)SX,Y, where fX is the marginal pdf of X. Then, the conditional differential entropy of Y given X is defined as h(YX):=SX,YfX,Y(x,y)log2fYX(yx)dxdy=E[log2fYX(YX)], when the integral exists. Note that as in the case of (discrete) entropy, the chain rule holds for differential entropy: h(X,Y)=h(X)+h(YX)=h(Y)+h(XY).

Relative entropy

Definition: Let X and Y be two continuous random variables with marginal pdfs fX and fY, respectively, such that their supports satisfy SXSYR. Then, the KL-divergence (or relative entropy) between X and Y is written as D(XY) or D(fXfY) and defined by D(XY):=SXfX(x)log2fX(x)fY(x)dx=E[fX(X)fY(X)] when the integral exists. The definition carries over similarly in the multivariate case: for Xn=(X1,X2,,Xn) and Yn=(Y1,Y2,,Yn) two random vectors with joint pdfs fXn and fYn, respectively, and supports satisfying SXnSYnRn, the divergence between Xn and Yn is defined as D(XnYn):=SXnfXn(x1,x2,,xn)log2fXn(x1,x2,,xn)fYn(x1,x2,,xn)dx1dx2dxn when the integral exists.

Mutual information

Definition: Let X and Y be two jointly distributed continuous random variables with joint pdf fX,Y and support SXYR2. Then, the mutual information between X and Y is defined by I(X;Y):=D(fX,YfXfY)=SX,YfX,Y(x,y)log2fX,Y(x,y)fX(x)fY(y)dxdy, assuming the integral exists, where fX and fY are the marginal pdfs of X and Y, respectively.

Observation 5.13 For two jointly distributed continuous random variables X and Y with joint pdf fX,Y, support SXYR2 and joint differential entropy h(X,Y)=SXYfX,Y(x,y)log2fX,Y(x,y)dxdy, then as in Lemma 5.2 and the ensuing discussion, one can write H(qn(X),qm(Y))h(X,Y)+n+m for n and m sufficiently large, where qk(Z) denotes the (uniformly) quantized version of random variable Z with a k-bit accuracy. On the other hand, for the above continuous X and Y, I(qn(X);qm(Y))=H(qn(X))+H(qm(Y))H(qn(X),qm(Y))[h(X)+n]+[h(Y)+m][h(X,Y)+n+m]=h(X)+h(Y)h(X,Y)=SX,YfX,Y(x,y)log2fX,Y(x,y)fX(x)fY(y)dxdy for n and m sufficiently large; in other words, limn,mI(qn(X);qm(Y))=h(X)+h(Y)h(X,Y).

Furthermore, it can be shown that limnD(qn(X)qn(Y))=SXfX(x)log2fX(x)fY(x)dx

Thus, mutual information and divergence can be considered as the true tools of information theory, as they retain the same operational characteristics and properties for both discrete and continuous probability spaces (as well as general spaces where they can be defined in terms of Radon-Nikodym derivatives1.

Properties

The following properties hold for the information measures of continuous systems.

//TODO: Proves

Nonnegativity of divergence

Nonnegativity of divergence: Let X and Y be two continuous random variables with marginal pdfs fX and fY, respectively, such that their supports satisfy SXSYR. Then D(fXfY)0 with equality iff fX(x)=fY(x) for all xSX except in a set of fX-measure zero (i.e., X=Y almost surely). ## Nonnegativity of mutual information

Nonnegativity of mutual information: For any two continuous jointly distributed random variables X and Y, I(X;Y)0, with equality iff X and Y are independent. ## Conditioning never increases differential entropy

For any two continuous random variables X and Y with joint pdf fX,Y and well-defined conditional pdf fXY h(XY)h(X), with equality iff X and Y are independent. ## Chain rule for differential entropy

For a continuous random vector Xn= (X1,X2,,Xn), h(X1,X2,,Xn)=i=1nh(XiX1,X2,,Xi1), where h(XiX1,X2,,Xi1):=h(X1) for i=1.

Chain rule for mutual information

For continuous random vector Xn= (X1,X2,,Xn) and random variable Y with joint pdffXn,Y and well-defined conditional pdfs fXi,YXi1,fXiXi1 and fYXi1 for i=1,,n, we have that I(X1,X2,,Xn;Y)=i=1nI(Xi;YXi1,,X1), where I(Xi;YXi1,,X1):=I(X1;Y) for i=1. ## Data processing inequality

For continuous random variables X,Y, and Z such that XYZ, i.e., X and Z are conditional independent given Y, I(X;Y)I(X;Z). ## Independence bound for differential entropy

For a continuous random vector Xn=(X1,X2,,Xn), h(Xn)i=1nh(Xi) with equality iff all the Xi 's are independent from each other. ## Invariance of differential entropy under translation

For continuous random variables X and Y with joint pdf fX,Y and well-defined conditional pdf fXY, h(X+c)=h(X) for any constant cR, and h(X+YY)=h(XY).

The results also generalize in the multivariate case: for two continuous random vectors Xn=(X1,X2,,Xn) and Yn=(Y1,Y2,,Yn) with joint pdf fXn,Yn and well-defined conditional pdf fXnYn, h(Xn+cn)=h(Xn) for any constant n-tuple cn=(c1,c2,,cn)Rn, and h(Xn+YnYn)=h(XnYn), where the addition of two n-tuples is performed component-wise. ## Differential entropy under scaling

For any continuous random variable X and any nonzero real constant a, h(aX)=h(X)+log2|a|.

Joint differential entropy under linear mapping

Consider the random (column) vector X=(X1,X2,,Xn)T with joint pdffXn, where T denotes transposition, and let Y=(Y1,Y2,,Yn)T be a random (column) vector obtained from the linear transformation Y=AX, where A is an invertible (non-singular) n×n real-valued matrix. Then h(Y)=h(Y1,Y2,,Yn)=h(X1,X2,,Xn)+log2|det(A)|, where det(A) is the determinant of the square matrix A. ## Joint differential entropy under nonlinear mapping

Consider the random (column) vector X=(X1,X2,,Xn)T with joint pdf fXn, and let Y=(Y1,Y2,,Yn)T be a random (column) vector obtained from the nonlinear transformation Y=g(X):=(g1(X1),g2(X2),,gn(Xn))T, where each gi:RR is a differentiable function, i=1,2,,n. Then h(Y)=h(Y1,Y2,,Yn)=h(X1,,Xn)+RnfXn(x1,,xn)log2|det(J)|dx1dxn, where J is the n×n Jacobian matrix given by J:=[g1x1g1x2g1xng2x1g2x2g2xngnx1gnx2gnxn].

Observation: Property Differential entropy under scaling of the above Lemma indicates that for a continuous random variable X,h(X)h(aX) (except for the trivial case of a=1 ) and hence differential entropy is not in general invariant under invertible maps. This is in contrast to entropy, which is always invariant under invertible maps: given a discrete random variable X with alphabet X, H(f(X))=H(X) for all invertible maps f:XY, where Y is a discrete set; in particular H(aX)= H(X) for all nonzero reals a.

On the other hand, for both discrete and continuous systems, mutual information and divergence are invariant under invertible maps: I(X;Y)=I(g(X);Y)=I(g(X);h(Y)) and D(XY)=D(g(X)g(Y)) for all invertible maps g and h properly defined on the alphabet/support of the concerned random variables. This reinforces the notion that mutual information and divergence constitute the true tools of information theory.

Joint differential entropy of the multivariate Gaussian

If X Nn(μ,KX) is a Gaussian random vector with mean vector μ and (positive-definite) covariance matrix KX, then its joint differential entropy is given by h(X)=h(X1,X2,,Xn)=12log2[(2πe)ndet(KX)].

In particular, in the univariate case of n=1,(5.2.1) reduces to (5.1.1). Proof Without loss of generality, we assume that X has a zero-mean vector since its differential entropy is invariant under translation by Property 8 of Lemma 5.14: h(X)=h(Xμ) so we assume that μ=0.

Since the covariance matrix KX is a real-valued symmetric matrix, then it is orthogonally diagonizable; i.e., there exists a square (n×n) orthogonal matrix A (i.e., satisfying AT=A1 ) such that AKXAT is a diagonal matrix whose entries are given by the eigenvalues of KX (A is constructed using the eigenvectors of KX; e.g., see [128]). As a result, the linear transformation Y=AXNn(0,AKXAT) is a Gaussian vector with the diagonal covariance matrix KY=AKXAT and has therefore independent components (as noted in Observation 5.17). Thus h(Y)=h(Y1,Y2,,Yn)=h(Y1)+h(Y2)++h(Yn)=i=1n12log2[2πeVar(Yi)]=n2log2(2πe)+12log2[i=1nVar(Yi)]=n2log2(2πe)+12log2[det(KY)]=12log2(2πe)n+12log2[det(KX)]=12log2[(2πe)ndet(KX)], where (5.2.2) follows by the independence of the random variables Y1,,Yn (e.g., see Property 7 of Lemma 5.14), (5.2.3) follows from (5.1.1), (5.2.4) holds since the matrix KY is diagonal and hence its determinant is given by the product of its diagonal entries, and (5.2.5) holds since det(KY)=det(AKXAT)=det(A)det(KX)det(AT)=det(A)2det(KX)=det(KX), where the last equality holds since (det(A))2=1, as the matrix A is orthogonal (AT=A1det(A)=det(AT)=1/[det(A)]; thus, det(A)2=1). Now invoking Property 10 of Lemma 5.14 and noting that |det(A)|=1 yield that h(Y1,Y2,,Yn)=h(X1,X2,,Xn)+log2|det(A)|=0=h(X1,X2,,Xn).

We therefore obtain using (5.2.6) that h(X1,X2,,Xn)=12log2[(2πe)ndet(KX)], hence completing the proof. An alternate (but rather mechanical) proof to the one presented above consists of directly evaluating the joint differential entropy of X by integrating fXn(xn) log2fXn(xn) over Rn; it is left as an exercise.

Hadamard's inequality

For any real-valued n×n positive-definite matrix K=[Ki,j]i,j=1,,n, det(K)i=1nKi,i, with equality iff K is a diagonal matrix, where Ki,i are the diagonal entries of K. Proof Since every positive-definite matrix is a covariance matrix (e.g., see [162]), let X=(X1,X2,,Xn)TNn(0,K) be a jointly Gaussian random vector with zero-mean vector and covariance matrix K. Then 12log2[(2πe)ndet(K)]=h(X1,X2,,Xn)i=1nh(Xi)=i=1n12log2[2πeVar(Xi)]=12log2[(2πe)ni=1nKi,i], where (5.2.7) follows from Theorem 5.18, (5.2.8) follows from Property 7 of Lemma 5.14 and (5.2.9)-(5.2.10) hold using (5.1.1) along with the fact that each random variable XiN(0,Ki,i) is Gaussian with zero mean and variance Var(Xi)=Ki,i for i=1,2,,n (as the marginals of a multivariate Gaussian are also Gaussian e.g., cf. [162]). Finally, from (5.2.10), we directly obtain that det(K)i=1nKi,i, with equality iff the jointly Gaussian random variables X1,X2,,Xn are independent from each other, or equivalently iff the covariance matrix K is diagonal.

The next theorem states that among all real-valued size- n random vectors (of support Rn ) with identical mean vector and covariance matrix, the Gaussian random vector has the largest differential entropy.

Maximal differential entropy for real-valued random vectors

Let X=(X1,X2,,Xn)T be a real-valued random vector with a joint pdf of support SXn=Rn, mean vector μ, covariance matrix KX and finite joint differential entropy h(X1,X2,,Xn). Then h(X1,X2,,Xn)12log2[(2πe)ndet(Kx)] with equality iff X is Gaussian; i.e., XNn(μ,KX).

Proof We will present the proof in two parts: the scalar or univariate case, and the multivariate case. (i) Scalar case (n=1) : For a real-valued random variable with support SX=R, mean μ and variance σ2, let us show that h(X)12log2(2πeσ2), with equality iff XN(μ,σ2). For a Gaussian random variable YN(μ,σ2), using the nonnegativity of divergence, we can write 0D(XY)=RfX(x)log2fX(x)12πσ2e(xμ)22σ2dx=h(X)+RfX(x)[log2(2πσ2)+(xμ)22σ2log2e]dx=h(X)+12log2(2πσ2)+log2e2σ2R(xμ)2fX(x)dx=σ2=h(X)+12log2[2πeσ2].

Thus h(X)12log2[2πeσ2], with equality iff X=Y (almost surely); i.e., XN(μ,σ2). (ii). Multivariate case (n>1) : As in the proof of Theorem 5.18, we can use an orthogonal square matrix A (i.e., satisfying AT=A1 and hence |det(A)|=1 ) such that AKXAT is diagonal. Therefore, the random vector generated by the linear map Z=AX will have a covariance matrix given by KZ=AKXAT and hence have uncorrelated (but not necessarily independent) components. Thus h(X)=h(Z)log2|det(A)|=0=h(Z1,Z2,,Zn)i=1nh(Zi)i=1n12log2[2πeVar(Zi)]=n2log2(2πe)+12log2[i=1nVar(Zi)]=12log2(2πe)n+12log2[det(KZ)]=12log2(2πe)n+12log2[det(KX)]=12log2[(2πe)ndet(KX)] where (5.2.13) holds by Property 10 of Lemma 5.14 and since |det(A)|=1,(5.2.14) follows from Property 7 of Lemma 5.14, (5.2.15) follows from (5.2.12) (the scalar case above), (5.2.16) holds since KZ is diagonal, and (5.2.17) follows from the fact that det(KZ)=det(KX) (as A is orthogonal). Finally, equality is achieved in both (5.2.14) and (5.2.15) iff the random variables Z1,Z2,,Zn are Gaussian and independent from each other, or equivalently iff XNn(μ,KX). Observation 5.21 (Examples of maximal differential entropy under various constraints) The following three results can also be shown (the proof is left as an exercise): 1. Among all continuous random variables admitting a pdf with support the interval (a,b), where b>a are real numbers, the uniformly distributed random variable maximizes differential entropy. 2. Among all continuous random variables admitting a pdf with support the interval [0,), finite mean μ, and finite differential entropy, the exponentially distributed random variable with parameter (or rate parameter) λ=1/μ maximizes differential entropy. 3. Among all continuous random variables admitting a pdf with support R, finite mean μ, and finite differential entropy and satisfying E[|Xμ|]=λ, where λ>0 is a fixed finite parameter, the Laplacian random variable with mean μ, variance 2λ2 and pdf fX(x)=12λe|xμ|λ for xR maximizes differential entropy.

A systematic approach to finding distributions that maximize differential entropy subject to various support and moments constraints can be found in [83, 415].

Information rates for stationary Gaussian sources

We close this section by noting that for stationary zero-mean Gaussian processes {Xi} and {X^i}, the differential entropy rate, limn1nh(Xn), the divergence rate, limn1nD(XnX^n), as well as their Rényi counterparts all exist and admit analytical expressions in terms of the source power spectral densities [154, 196, 223, 393], [144, Table4]. In particular, the differential entropy rate of {Xi} and the divergence rate between {Xi} and {X^i} are given (in nats) by limn1nh(Xn)=12ln(2πe)+14πππlnϕX(λ)dλ, and limn1nD(XnX^n)=14πππ(ϕX(λ)ϕX^(λ)1lnϕX(λ)ϕX^(λ))dλ, respectively. Here, ϕX() and ϕX^() denote the power spectral densities of the zero-mean stationary Gaussian processes {Xi} and {X^i}, respectively. Recall that for a stationary zero-mean process {Zi}, its power spectral density ϕZ() is the (discrete-time) Fourier transform of its covariance function KZ(τ):=E[Zn+τZn] E[Zn+τ]E[Zn]=E[Zn+τZn],n,τ=1,2,; more precisely, ϕZ(λ)=τ=KZ(τ)ejτλ,πλπ, where j=1 is the imaginary unit number. Note that (5.2.18) and (5.2.19) hold under mild integrability and boundedness conditions; see [196, Sect. 2.4] for the details.


  1. This justifies using identical notations for both I(;) and D() as opposed to the discerning notations of H() for entropy and h() for differential entropy.↩︎