//NOTE: This article is not finished yet and contains many errors. I am always ready to edit it.
Sources:
- Thomas M. Cover & Joy A. Thomas. (2006). Chapter 8. Differential Entropy. Elements of Information Theory (2nd ed., pp. 243-255). Wiley-Interscience.
- Fady Alajaji & Po-Ning Chen. (2018). Chapter 5. Differential Entropy and Gaussian Channels. An Introduction to Single-User Information Theory (1st ed., pp. 165-218). Springer.
Notations
- Note that the joint pdf is also commonly written as .
Joint differential entropy
Defnition: If is a continuous random vector of size (i.e., a vector of continuous random variables) with joint pdf and support , then its joint differential entropy is defined as when the -dimensional integral exists.
By contrast, entropy and differential entropy are sometimes called discrete entropy and continuous entropy, respectively.
Conditional differential entropy
Definition:Let and be two jointly distributed continuous random variables with joint and support such that the conditional pdf of given , given by is well defined for all , where is the marginal pdf of . Then, the conditional differential entropy of given is defined as when the integral exists. Note that as in the case of (discrete) entropy, the chain rule holds for differential entropy:
Relative entropy
Definition: Let and be two continuous random variables with marginal pdfs and , respectively, such that their supports satisfy . Then, the KL-divergence (or relative entropy) between and is written as or and defined by when the integral exists. The definition carries over similarly in the multivariate case: for and two random vectors with joint pdfs and , respectively, and supports satisfying , the divergence between and is defined as when the integral exists.
Definition: Let and be two jointly distributed continuous random variables with joint pdf and support . Then, the mutual information between and is defined by assuming the integral exists, where and are the marginal pdfs of and , respectively.
Observation 5.13 For two jointly distributed continuous random variables and with joint pdf , support and joint differential entropy then as in Lemma 5.2 and the ensuing discussion, one can write for and sufficiently large, where denotes the (uniformly) quantized version of random variable with a -bit accuracy. On the other hand, for the above continuous and , for and sufficiently large; in other words,
Furthermore, it can be shown that
Thus, mutual information and divergence can be considered as the true tools of information theory, as they retain the same operational characteristics and properties for both discrete and continuous probability spaces (as well as general spaces where they can be defined in terms of Radon-Nikodym derivatives.
Properties
The following properties hold for the information measures of continuous systems.
//TODO: Proves
Nonnegativity of divergence
Nonnegativity of divergence: Let and be two continuous random variables with marginal pdfs and , respectively, such that their supports satisfy . Then with equality iff for all except in a set of -measure zero (i.e., almost surely). ## Nonnegativity of mutual information
Nonnegativity of mutual information: For any two continuous jointly distributed random variables and , with equality iff and are independent. ## Conditioning never increases differential entropy
For any two continuous random variables and with joint pdf and well-defined conditional pdf with equality iff and are independent. ## Chain rule for differential entropy
For a continuous random vector , where for .
For continuous random vector and random variable with joint and well-defined conditional pdfs and for , we have that where for . ## Data processing inequality
For continuous random variables , and such that , i.e., and are conditional independent given , ## Independence bound for differential entropy
For a continuous random vector , with equality iff all the 's are independent from each other. ## Invariance of differential entropy under translation
For continuous random variables and with joint pdf and well-defined conditional pdf , and
The results also generalize in the multivariate case: for two continuous random vectors and with joint pdf and well-defined conditional pdf , for any constant -tuple , and where the addition of two n-tuples is performed component-wise. ## Differential entropy under scaling
For any continuous random variable and any nonzero real constant ,
Joint differential entropy under linear mapping
Consider the random (column) vector with joint , where denotes transposition, and let be a random (column) vector obtained from the linear transformation , where is an invertible (non-singular) real-valued matrix. Then where is the determinant of the square matrix . ## Joint differential entropy under nonlinear mapping
Consider the random (column) vector with joint pdf , and let be a random (column) vector obtained from the nonlinear transformation where each is a differentiable function, . Then where is the Jacobian matrix given by
Observation: Property Differential entropy under scaling of the above Lemma indicates that for a continuous random variable (except for the trivial case of ) and hence differential entropy is not in general invariant under invertible maps. This is in contrast to entropy, which is always invariant under invertible maps: given a discrete random variable with alphabet , for all invertible maps , where is a discrete set; in particular for all nonzero reals .
On the other hand, for both discrete and continuous systems, mutual information and divergence are invariant under invertible maps: and for all invertible maps and properly defined on the alphabet/support of the concerned random variables. This reinforces the notion that mutual information and divergence constitute the true tools of information theory.
Joint differential entropy of the multivariate Gaussian
If is a Gaussian random vector with mean vector and (positive-definite) covariance matrix , then its joint differential entropy is given by
In particular, in the univariate case of reduces to (5.1.1). Proof Without loss of generality, we assume that has a zero-mean vector since its differential entropy is invariant under translation by Property 8 of Lemma 5.14: so we assume that .
Since the covariance matrix is a real-valued symmetric matrix, then it is orthogonally diagonizable; i.e., there exists a square orthogonal matrix (i.e., satisfying ) such that is a diagonal matrix whose entries are given by the eigenvalues of (A is constructed using the eigenvectors of ; e.g., see [128]). As a result, the linear transformation is a Gaussian vector with the diagonal covariance matrix and has therefore independent components (as noted in Observation 5.17). Thus where (5.2.2) follows by the independence of the random variables (e.g., see Property 7 of Lemma 5.14), (5.2.3) follows from (5.1.1), (5.2.4) holds since the matrix is diagonal and hence its determinant is given by the product of its diagonal entries, and (5.2.5) holds since where the last equality holds since , as the matrix is orthogonal ; thus, . Now invoking Property 10 of Lemma 5.14 and noting that yield that
We therefore obtain using (5.2.6) that hence completing the proof. An alternate (but rather mechanical) proof to the one presented above consists of directly evaluating the joint differential entropy of by integrating over it is left as an exercise.
Hadamard's inequality
For any real-valued positive-definite matrix , with equality iff is a diagonal matrix, where are the diagonal entries of . Proof Since every positive-definite matrix is a covariance matrix (e.g., see [162]), let be a jointly Gaussian random vector with zero-mean vector and covariance matrix . Then where (5.2.7) follows from Theorem 5.18, (5.2.8) follows from Property 7 of Lemma 5.14 and (5.2.9)-(5.2.10) hold using (5.1.1) along with the fact that each random variable is Gaussian with zero mean and variance for (as the marginals of a multivariate Gaussian are also Gaussian e.g., cf. [162]). Finally, from (5.2.10), we directly obtain that with equality iff the jointly Gaussian random variables are independent from each other, or equivalently iff the covariance matrix is diagonal.
The next theorem states that among all real-valued size- random vectors (of support ) with identical mean vector and covariance matrix, the Gaussian random vector has the largest differential entropy.
Maximal differential entropy for real-valued random vectors
Let be a real-valued random vector with a joint pdf of support , mean vector , covariance matrix and finite joint differential entropy . Then with equality iff is Gaussian; i.e., .
Proof We will present the proof in two parts: the scalar or univariate case, and the multivariate case. (i) Scalar case : For a real-valued random variable with support , mean and variance , let us show that with equality iff . For a Gaussian random variable , using the nonnegativity of divergence, we can write
Thus with equality iff (almost surely); i.e., . (ii). Multivariate case : As in the proof of Theorem 5.18, we can use an orthogonal square matrix (i.e., satisfying and hence ) such that is diagonal. Therefore, the random vector generated by the linear map will have a covariance matrix given by and hence have uncorrelated (but not necessarily independent) components. Thus where (5.2.13) holds by Property 10 of Lemma 5.14 and since follows from Property 7 of Lemma 5.14, (5.2.15) follows from (5.2.12) (the scalar case above), (5.2.16) holds since is diagonal, and (5.2.17) follows from the fact that (as is orthogonal). Finally, equality is achieved in both (5.2.14) and (5.2.15) iff the random variables are Gaussian and independent from each other, or equivalently iff . Observation 5.21 (Examples of maximal differential entropy under various constraints) The following three results can also be shown (the proof is left as an exercise): 1. Among all continuous random variables admitting a pdf with support the interval , where are real numbers, the uniformly distributed random variable maximizes differential entropy. 2. Among all continuous random variables admitting a pdf with support the interval , finite mean , and finite differential entropy, the exponentially distributed random variable with parameter (or rate parameter) maximizes differential entropy. 3. Among all continuous random variables admitting a pdf with support , finite mean , and finite differential entropy and satisfying , where is a fixed finite parameter, the Laplacian random variable with mean , variance and pdf maximizes differential entropy.
A systematic approach to finding distributions that maximize differential entropy subject to various support and moments constraints can be found in [83, 415].
We close this section by noting that for stationary zero-mean Gaussian processes and , the differential entropy rate, , the divergence rate, , as well as their Rényi counterparts all exist and admit analytical expressions in terms of the source power spectral densities [154, 196, 223, 393], [144, Table4]. In particular, the differential entropy rate of and the divergence rate between and are given (in nats) by and respectively. Here, and denote the power spectral densities of the zero-mean stationary Gaussian processes and , respectively. Recall that for a stationary zero-mean process , its power spectral density is the (discrete-time) Fourier transform of its covariance function more precisely, where is the imaginary unit number. Note that (5.2.18) and (5.2.19) hold under mild integrability and boundedness conditions; see [196, Sect. 2.4] for the details.