DifferentialEntropy Examples

Chapter 9
Differential Entropy
The channels considered in the previous chapters have been discrete valued and mem-
oryless, DMC. These channel models are very wide spread and can represent the real
world signalling in many cases. On that abstraction level they typically represent cor-
rect transmission, errors and erasures. However, in many cases the modelling is done
on another abstraction level, closer to to the actual signalling on the transmission. In
those models the signals can be viewed as continuous valued random variables, and the
noise on the channel is often considered to be additive white noise. In this chapter ran-
dom variables that are continuous will be considered and the corresponding information
measures adopted for this case. In the next chapter this will be used to nd the capacity
for time discrete sequences of continuous variables, and then a fundamental limit wil be
derived for signaling in the presence of additive white Gaussian noise (AWGN).
9.1 Information measures for continuous random variables
Starting with the entropy, which in the discrete variable case is dened as the expected
value of the self information, H(X) = E
_
log p(x)
. Using the same denition for con-

tinuous variables will result in the following. To keep the denitions apart it is common
to name this value the differential entropy
1
.
DEFINITION 9.1 Let X be a real continuous random variable with probability density
function f(x). The differential entropy is
H(X) = E
_
log f(x)
=
_
R
f(x) log f(x)dx
where we use the convention that 0 log 0 = 0.
1
For simplicity, it is assumed that the variables are dened over the complete range of real numbers, R,
hence the notation

R
dx =
dx. For variables dened on other ranges, let say D, the integration should
be

D
dx. Then the denition would also include multi-dimensional variables as dened in (9.1) or complex
variables.
139
140 CHAPTER 9. DIFFERENTIAL ENTROPY
In the literature there are other notations as well, among others H(f) is used to denote
the differential entropy of the density function f(x). This is quite handy and will be used
in some occasions in this text, similar to the use of H(p
1
, . . . , p
N
) for the discrete case.
EXAMPLE 9.1 For a uniform distribution, U(a), the density function is given by
f(x) =
_
1
a
, 0 x a
0, otherwise
The differential entropy becomes
H(X) =
_
a
0
1
a
log
1
a
dx = log a
_
a
0
1
a
dx = log a
As can be seen in Figure 9.1 the log function is negative when the argument is less than
1. Hence if a < 1 in the previous example the entropy is negative. If a = 1 the entropy is
zero, and if a > 1 the entropy is positive. The fact that the entropy can now be negative
means that the interpretation of uncertainty can no longer be used.
a
log a
1
Figure 9.1: The log function.
In Appendix A the most common distributions are listed together with their mean, vari-
ance and entropy. In this listing the entropy is derived over the natural base, since this
makes the integration slightly easier. Then the unit is nats instead of bits. By simple
logarithm laws the translation between base 2 and base e can be found as
H(X) = E
_
log f(x)
=
E
_
ln f(x)
ln 2
=
H
e
(X)
ln 2
In some cases it can be useful to notice that translation with a constant c of a random
variable does not affect the entropy,
H(X c) =
_
R
f(x c) log f(x c)dx =
_
R
f(z) log f(z)dz = H(X)
where the variable change z = x c was used. Furthermore, is a random variable X
with density function f(x) is scaled by a constant the corresponding density functions
X+c
9.1. INFORMATION MEASURES FOR CONTINUOUS RANDOM VARIABLES 141
is f
X
(x)) =
1
f
_
x
_
. Then the entropy becomes
H(X) =
_
R
1
f
_
x
_
log
1
f
_
x
_
dx
= log
_
R
1
f
_
x
_
dx
_
R
f
_
x
_
log f
_
x
_
dx
= log
_
R
f(z) log f(z)dz
= H(X) + log
EXAMPLE 9.2 Let X be a Gaussian (Normal) distributed random variable, X N(, ).
To make derivations easier consider rst a normalised variable Y =
X
, where Y
N(0, 1), with density function
f(y) =
1
2
e
y
2
/2
The entropy of Y can be derived as
H(Y ) =
_
R
f(y) log f(y)dy =
_
R
f(y) log
1
2
e
y
2
/2
dy
=
_
R
f(y) log
1
2
dy
_
R
f(y) log e
y
2
/2
dy
=
1
2
log(2)
_
R
f(y)dy +
1
2
log(e)
_
R
y
2
f(y)dy
=
1
2
log(2e)
The normalisation of the mean E[X] = does not affect the entropy, and the entropy for
X can be derived as
H(X) = H(Y ) = H(Y ) + log() =
1
2
log(2e) +
1
2
log(
2
) =
1
2
log(2e
2
)
which is the listed function (with the natural base in the logarithm) for the entropy of a
Gaussian variable in Appendix A. It should be noted that the entropy for a GAussian
variable can be negative. It will be zero for log(2e
2
) = 0, or equivalently,
2
=
1
2e
.
That is,
2
>
1
2e
H(X) > 0
2
=
1
2e
H(X) = 0
2
<
1
2e
H(X) < 0
A more general form of the differential entropy function is given when mult-dimensional
random variables are allowed, (X
1
, X
2
, . . . , X
N
). The entropy can stil be dened as the
expectation of the logarithmic density function,
H(X
1
, . . . , X
N
) = E
_
log f(x
1
, . . . , x
N
)
(9.1)
Especially, for the two-dimensional case the joint differential entropy is
DEFINITION 9.2 The joint differential entropy for a two dimensional random variable
(X, Y ) with density function f(x, y) is
H(X, Y ) = E
_
log f(x, y)
=
_
R
2
f(x, y) log f(x, y)dxdy
EXAMPLE 9.3 The two dimensional continuous random variable (X, Y ) has the density
function
f(x, y) =
_
A, x > 0, y > 0, ax +by < ab
0, otherwise
That is, it is uniformly distributed over the area shown in Figure 9.2. First the amplitude
x
y
a
b
ax +by = ab
Figure 9.2: The area where f(x, y) = A.
of the density function, A, must be determine. Using that
1 =
_
Adxdy = A
_
1dxdy = A
ab
2
gives A =
2
ab
. Then the joint entropy becomes
H(X, Y ) =
_
Alog Adxdy = log A
_
Adxdy = log
1
A
= log
ab
2
= log ab 1
As for the discrete case random vectors can be viewed as multi dimensional random
variables.
Next is the mutual information for continuous variables. It is a straight forward general-
isation of the discrete case as in the next denition.
DEFINITION 9.3 The mutual information for a pair of continuous randomvariables (X, Y )
with joint probability density function f(x, y) is
I(X; Y ) = E
_
log
f(x, y)
f(x)f(y)
_
=
_
R
2
f(x, y) log
f(x, y)
f(x)f(y)
dxdy

Fromthe denition it can directly be concluded that the mutual information is symmetric,
i.e. I(X; Y ) = I(Y ; X). By breaking up the logarithm in a sum the function can be
rewritten using the entropy functions,
I(X; Y ) = E
_
log
f(x, y)
f(x)f(y)
_
= E
_
log f(x, y) log f(x) log f(y)
= E
_
log f(x)
+E
_
log f(y)
E
_
log f(x, y)
= H(X) +H(Y ) H(X, Y )

As for discrete variables, the conditional entropy is dened as follows.
DEFINITION 9.4 The conditional differential entropy is the differential entropy for the
random variable X conditioned on the random variable Y , and is written as
H(X|Y ) = E
_
log f(x|y)
=
_
R
2
f(x, y) log f(x|y)dxdy
where f(x, y) is the joint density function and f(x|y) is the conditional density function.
Since the joint density function can be written as f(x, y) = f(x|y)f(y) the joint entropy
becomes
H(X, Y ) = E
_
log f(x, y)
= E
_
log f(x|y)
+E
_
log f(y)
= H(X|Y ) +H(Y )
Similarly
H(X, Y ) = H(Y |X) +H(X)
Combining the above gives the following derivations for the mutual information,
I(X; Y ) = H(X) H(X|Y )
= H(Y ) H(Y |X)
= H(X) +H(Y ) H(X, Y )
EXAMPLE 9.4 To derive the mutual information between the two variables in Exam-
ple 9.3 the entropy for the individual variables X and Y are needed. Starting with X
the density function is
f(x) =
_

a
b
x+a
y=0
2
ab
dy =
2
ab
_
a
b
x +a
_
=
2
b
2
x +
2
b
This is a triangular distribution starting at f(0) =
2
b
and decreasing linearly to f(b) = 0.
Then to derive the entropy use the variable change z =
2
b
2
x+
2
b
and integration by parts
to get
H(X) =
_
b
0
_
2
b
2
x +
2
b
_
log
_
2
b
2
x +
2
b
_
=
b
2
2 ln 2
_
2/b
0
z ln zdz
=
b
2
2 ln 2
__
z
2
2
ln z
z
2
4
_
2/b
0
_
=
b
2
2 ln 2
_
2
b
2
ln
2
b

1
b
2
_
= log
b
2
+
1
2 ln 2
= log b
e 1
Similarly, the entropy of Y is H(Y ) = log a
e1. Then, the mutual information between

X and Y becomes
I(X; Y ) = H(X) +H(Y ) H(X, Y )
= log b
e 1 + log a
e 1 log ab + 1
= log e 1 = log
e
2
There are two things to notice by the previous example. Firstly, the mutual information,
in this case, is not dependent of the constants a and b. Secondly, which is more important,
is that the mutual information is a positive number. To see that it is not only in this
example where the mutual information is non-negative dene the relative entropy for
the continuous case.
DEFINITION 9.5 The relative entropy for a pair of continuous randomvariables with prob-
ability density functions f(x) and g(y) is
D(f||g) = E
f
_
log
f(x)
g(x)
_
=
_
R
f(x) log
f(x)
g(x)
dx
EXAMPLE 9.5 Consider a random variable that is exponentially distributed, Exp(

0
).
The density function is
f(x =)
0
e
0
x
and the mean is E[X] =
1
0
. Then the relative entropy between this distribution and
another exponential distribution Exp() is
D
_
_
= E
0
_
log

0
e
0
x
e
x
_
= E
0
_
log

0
(
0
)xlog e
_
= log

0
(
0
)E
0
[X] log e
=
_

0
1
_
log e log

0
To see that the relative entropy is still a non-negative function for the continuous case
we rst denote the support for f(x) and g(x) as S
f
and S
g
, respectively. The support is
the area where the function is strictly positive. When both functions are zero, we use the
convention that 0 log
0
0
= 0 and when f(x) = 0 but g(x) = 0 we get 0 log 0 = 0. The
problem comes when f(x) = 0 and g(x) = 0, as the function tends to innity. However,
right now we only want to show that the function is non-negative, which is of course
true for innity. Therefore, we will consider, without loss of generality, the case when
S
f
S
g
. Then, by using the IT-inequality,
D(f||g) =
_
S
f
f(x) log
f(x)
g(x)
dx
=
_
S
f
f(x) log
g(x)
f(x)
dx

_
S
f
f(x)
_
g(x)
f(x)
1
_
log edx
=
_
_
S
f
g(x)dx +
_
S
f
f(x)dx
_
log e
(1 + 1) log e = 0
where we have equality if and only if f(x) = g(x) for all x. We state the result as a
theorem.
THEOREM 9.1 The relative entropy is non-negative,
D
_
f
g
_
0
with equality if and only if f(x) = g(x), for all x.
Clearly, the mutual information can be expressed as the relative entropy as
I(X; Y ) = D
_
f(x, y)
f(x)f(y)
_
which is a non-negative function.
COROLLARY 9.2 The mutual information is non-negative, i.e.
I(X; Y ) = H(X) H(X|Y ) 0
with equality if and only if X and Y are independent.
Then, from I(X; Y ) = H(X) H(X|Y ) we get
COROLLARY 9.3 The entropy will not increase by considering side information, i.e.
H(X|Y ) H(X)
with equality if and only if X and Y are independent.
The latter corollary can be generalised by the chain rule for probabilities,
f(x
1
, . . . , x
n
) =
n
i=1
f(x
i
|x
1
, . . . , x
i1
)
Hence, the n-dimensional entropy can be written
H(X
1
, . . . , X
n
) = E
_
log f(x
1
, . . . , x
n
)
= E
_
log
n
i=1
f(x
i
|x
1
, . . . , x
i1
)
=
n
i=1
E
_
log f(x
i
|x
1
, . . . , x
i1
)
=
n
i=1
H(X
i
|X
1
, . . . , X
i1
)
Using that H(X
i
|X
1
, . . . , X
i1
) H(X
i
) this also means that entropy for the vector is not
more that the sum of the marginal entropies,
H(X
1
, . . . , X
n
)
n
i=1
H(X
i
)
9.1.1 Relation between information measures for discrete and continuous vari-
ables
In the previous it was seen that the denition for the differential entropy is not consistent
with the interpretation of the entropy as uncertainty of the random variable. On way to
seen the reason fr this is to discretise a continuous density function to obtain a discrete
variable. Given the continuous random variable X with density function f(x), dene a
discrete random variable X
where the probability for the outcome x
k
is
p(x
k
) =
_
(k+1)
k
f(x)dx = f(x
k
)
where the existence of such x
k
in the interval k x
k
(k + 1) is guaranteed by the
mean value theorem in integral calculus, see Figure 9.3. The entropy, or uncertainty, of
this discrete variable is
H(X
) =
k
p(x
k
) log p(x
k
)
=
k
f(x
k
) log f(x
k
)
=
k
f(x
k
) log f(x
k
)
_
k
f(x
k
)
_
log (9.2)
As the interval for x
k
narrows, 0, the probability approaches zero, p(x
k
) 0. The
rst term in (9.2) is a Riemann sum and the corresponding limit value is the differential
entropy,
k
f(x
k
) log f(x
k
)
_
R
f(x) log f(x)dx = H(X)
Similarly, the rst part of the second term is
k
f(x
k
)
_
R
f(x)dx = 1. However,
that means the second term becomes log and as long as the differential entropy
is nite the uncertainty H(X
) does not exist in the limit as 0. Actually, this is

reasonable, since for most distributions the number of outcomes for X
grows to innity
as 0, and then the uncertainty of the outcome also goes to innity. This divergence
is the reason that the interpretation of uncertainty in the discrete case cannot be used for
continuous variables. There are simply too many values for it to be reasonable to talk
about the uncertainty of specic outcomes.
x
f(x)
k (k + 1)
f(x
k
)
x
k
p(x
k
) = f(x
k
)
Figure 9.3: Creation of a discrete random variable from a continuous.
For the mutual information. however, it can be seen from the same type of derivation
that the interpretation that it is a measure of the information obtained about one variable
by observing another, still holds. Consider two continuous random variables, X and Y ,
with joint density function f(x, y) and marginals f(x) and f(y), respectively. Dene the
two discrete random variables X
and Y
with joint probability p(x
k
, y
) = f(x
k
, y
)
and marginals p(x
k
) = f(x
k
) and p(y
) = f(y
), respectively. The corresponding

mutual information between X
and Y
is
I(X
, Y
) =
k,
p(x
k
, y
) log
p(x
k
, y
)
p(x
k
)p(y
)
=
k,
f(x
k
, y
) log
f(x
k
, y
)
f(x
k
)f(y
)
=
k,
f(x
k
, y
) log
f(x
k
, y
)
f(x
k
)f(y
_
R
2
f(x, y) log
f(x, y)
f(x)f(y)
= I(X; Y ), , 0
where the limit value as and approaches zero, individually, follows from Riemann
integration formula. The conclusion from the above derivation is that the properties for
the mutual information in the discrete case is inherited to the continuous case. Especially,
this means that the interpretation of mutual information is still valid for continuous vari-
ables. It also means there is no problem to consider the information exchange between
one discrete and one continuous variable.
By similar derivations it can be shown that properties for the relative entropy, D(p||q), for
discrete distributions are still valid for continuous distributions.
9.2 Gaussian distribution
In many applications the Gaussian distribution plays an important role, and information
theory is not an exception. In this section we will have a closer look on it, and see that it
maximizes the entropy for a given mean and variance. In the next chapter this will give
the means to calculate the capacity for a case when the noise is Gaussian distributed.
In Example 9.2 the differential entropy for the gaussian distribution, N(, ), was derived
as
H(X) = E
_
log
1
2
2
e
(x)
2
/2
2
_
=
1
2
log(2
2
)E[1] +
1
2
log(e)
E
_
(x )
2
2
=
1
2
log(2e
2
)
To derive the mutual information between two Gaussian variables, the density function
for the two dimensional case is needed. In the next section the n-dimensional case for
the Gaussian distribution will be treated a bit more thoroughly, but at this point the two
dimensional density function for a pair of Gaussian variables (X, Y ) N(0, ) with zero
mean and covariance matrix is dened as
f(x, y) =
1
2
_
||
e
1
2
(
x y
)
1
(
x
y
)
The joint entropy can be derived as
H(X, Y ) = E
_
log
1
2
_
||
e
1
2
(
x y
)
1
(
x
y
)
_
= E
_
1
2
log(2)
2
|| +
1
2
_
x y
_
1
_
x
y
_
log e
_
=
1
2
log(2)
2
|| +
1
2
log e
2
_
=
1
2
log(2e)
2
||
where it is used that E
_
(
x y
)
1
(
x
y
)
= 2. This will be shown in the next section when

we consider multidimensional Gaussian distributions.
The covariance matrix for the two-dimensional case can be written as
=
_
E[X
2
] E[XY ]
E[XY ] E[Y
2
]
_
=
_

2
x

x
y

2
y
_
where || 1. Hence, the joint entropy can be rewritten as
H(X, Y ) =
1
2
log
_
(2e)
2
2
x
2
y
(1
2
)
_
The mutual information then becomes
I(X; Y ) = H(X) +H(Y ) H(X, Y )
=
1
2
log(2e
2
x
) +
1
2
log(2e
2
y
)
1
2
log
_
(2e)
2
2
x
2
y
(1
2
)
_
=
1
2
log(1
2
)
9.2. GAUSSIAN DISTRIBUTION 149
To derive the relative entropy we consider two Gaussian distributions with equal mean
and different variance, N(,
2
0
) and N(,
2
). Then
D
_
_
= E
f
0
_
log
1
2
2
0
e
(x)
2
/2
2
0
1
2
2
e
(x)
2
/2
2
_
= E
f
0
_
log

0
+
1
2
_
(x )
2
2
0
(x )
2
2
_
log e
_
= log

0
+
1
2
_
E
f
0
_
(x )
2
2
0
E
f
0
_
(x )
2
2
_
log e
= log

0
+
_
1

2
0
2
_
log
e
We will now continue to show that the Gaussian distribution maximizes the entropy for
a given mean and variance. This will be an important result in the next chapter when
we are considering the capacity for a cannel with Gaussian noise. To start with we rst
need a lemma that states that the averaging distribution in the entropy formula is not of
importance, as long as the mean and variance is not not changed.
LEMMA 9.4 Let h(x) is a Gaussian distribution, N(, ), with mean and variance
2
. If
f(x) is an arbitrary distribution with the same mean and variance, then
_
R
f(x) log h(x)dx =
_
R
h(x) log h(x)dx
The lemma can be shown by the following derivation

E
f
_
log h(x)
= E
f
_
log
1
2
2
e
(x)
2
/2
2
_
=
1
2
log(2
2
) +E
f
_
(x )
2
2
2
log e
_
=
1
2
log(2
2
) +
E
f
_
(x )
2
2
2
log e
=
1
2
log(2
2
) +
1
2
log e =
1
2
log(2e
2
) = E
h
_
log h(x)
which completes the proof of the lemma.

We can then compare the entropy for the Gaussian distribution, N(, ), with an arbitrary
distribution with the same mean and variance. Then, by using the previous lemma and
the non-negativity of the relative entropy we obtain
H
h
(X) H
f
(X) =
_
R
h(x) log h(x)dx +
_
R
f(x) log f(x)dx
=
_
R
f(x) log h(x)dx +
_
R
f(x) log f(x)dx
=
_
R
f(x) log
f(x)
h(x)
dx = D
_
f
g
_
0
with equality if and only if f(x) = h(x) for all x. Stated differently we have seen that
if h(x) is the density function for a Gaussian distribution, N(, ), and f(x) the density
function for any other distribution with the same mean and variance,
H
h
(X) H
f
(X)
That is, we have the following theorem.
THEOREM 9.5 The Gaussian distribution maximises the differential entropy over all dis-
tributions with mean and variance
2
.
9.2.1 Multidimensional Gaussian distribution
In the previous section we have treated the Gaussian distribution with extra care. The re-
sult that this is the distribution that maximizes the entropy is of special interest in the next
chapter. In this section we will expand the results to a multidimensional case. We will
start be dening an n-dimensional Gaussian distribution, and then show the necessary
derivations to see how the results looks in this case.
Assume we have an n-dimensional vector X = (X
1
, . . . , X
n
) where each entrence is
Gaussian with mean
i
and variance
2
i
, i.e. X
i
N(
i
,
i
), and all linear combinations
of them, Xa
T
, a R
n
, are Gaussian. The mean of the vector X is
= E
_
X
= (
1
, . . . ,
n
)
and the covariance matrix
= E
_
(X )
T
(X )
=
_
E
_
(X
i
i
)(X
j
j
)
_
i,j=1,...,n
where
T
denotes the transpose. Clearly the diagonal elements of contains the variances
of X. The vector X has then an n-dimensional Gaussian distribution, N(, ).
2
To nd the density function of the distribution we rst consider a general scaling and
translation of a random variable X. Let X be an n-dimensional random variable dis-
tributed according to an n-dimensional distribution with mean and covariance . Then,
if A is a square, non-singular (full rank), matrix and and a an n-dimensional row vector,
we form the new random vector Y = XA+a. The mean and covariance of Y is
E
_
Y
= E
_
XA+a
= E
_
X
A+a = A+a
Cov[Y
= E
_
(Y E[Y ])
T
(Y E[Y ])
= E
_
XA+a Aa)
T
(XA+a Aa)
= E
__
(X )A
_
T
_
(X )A
_
= E
_
A
T
_
X
_
T
_
X
_
A
= A
T
E
__
X
_
T
_
X
_
A = A
T
A
The idea now is to transform the Gaussian vector X into a normalised Gaussian vector
instead. In the case when X is a one dimensional random variable, this is done with
2
In this text we will only consider cases where has full rank. In the case it does not have full rank, i.e.
|| = 0, the dimensionality of the vector can be decreased.
Y =
X
. To see how the corresponding equation looks for the n dimensional case, we
need some denitions and results from matrix theory. For a more thorough treatment of
this topic refere to e.g. [11]. Most of the results here will be given without any proofs.
Firstly, we will characterise the covariance matrix to see how we can get the square root
of its inverse. We start by dening a symmetric (or Hermitian) matrix.
DEFINITION 9.6 A matrix A is symmetric
3
if it is symmetric along the diagonal, A
T
= A.
If the matrix A is symmetric and has an inverse, we can write the unity matrix as I =
AA
1
= A
T
A
1
= (A
T
A)
T
= A
T
A, where
T
denotes the transpose of the inverse.
Then, we get A
1
= IA
1
= A
T
AA
1
= A
T
. Hence, the inverse of a symmetric matrix
is again symmetric.
From the denition of the covariance matrix, we conclude that since E[(X
i

i
)(X
j

j
)] = E[(X
j
j
)(X
i
i
)], the covariance matrix is symmetric.
Next we need that the covariance is positive. In matrix theory this corresponds to positive
deniteness.
DEFINITION 9.7 A matrix A is positive denite if aAa
T
> 0, for all vectors a = 0.
DEFINITION 9.8 A matrix A is positive semi-denite if aAa
T
0, for all vectors a = 0.
For the covariance matrix we can get, for any real vector a = 0,
aa
T
= aE
_
(X )
T
(X )
a
T
= E
__
(X )a
T
_
T
_
(X )a
T
_
= V
_
Xa
T
0
since Xa
T
is a random variable and its variance non-negative. Hence, we can conclude
the following theorem.
THEOREM 9.6 Given an n-dimensional random vector X = (X
1
, . . . , X
n
) with mean
X = (
1
, . . . ,
n
), the covariance matrix = E
_
(X )
T
(X )
is symmetric and
positive semi-denite.
In e.g. [11] we can nd that for every symmetric positive semi-denite matrix A, there
exists a unique symmetric positive semi-denite matrix A
1/2
such that
_
A
1/2
_
2
= A
This matrix A
1/2
is the matrix equivalent of the scalar square root. Furthermore, it can be
shown that the inverse of the square root is equivalent to the square root of the inverse,
_
A
1/2
_
1
=
_
A
1
_
1/2
3
For complex matrix A we say that it is Hermitian if A
= A, where,

denote complex conjugate and
transpose. For real matrices the two denitions are equivalent but for complex matrices it is often more
convenient with the second denition. In MATLAB the notation A
means Hermitian transpose, A

and we denote this matrix A
1/2
. It can then be found that
A
1/2
= |A|
1/2
=
1
_
|A|
With this we can go back to our n-dimensional Gaussian distributed vector, X N(, ),
and let us consider the vector
Y = (X )
1/2
The mean and covariance of Y can be derived as
E
_
Y
= E
_
X
1/2
1/2
= E
_
X
1/2
1/2
= 0
and
Cov(Y ) =
1/2
1/2
=
1/2
1/2
1/2
1/2
= I
respectively. Hence, Y N(0, I) is normalised Gaussian distributed with zero mean and
covariance I. Since we have assumed that || > 0 there exists a density function that is
uniquely determined by the mean and covariance. To nd this we let the entries of Y be
independent and write the density function as
f
Y
(y) =
n
i=1
1
2
e
1
2
y
2
i
=
1
(2)
n/2
e
1
2
i
y
2
i
=
1
(2)
n/2
e
1
2
yy
T
The entropy for this vector follows from the independency
H(Y ) =
n
i=1
H(Y
i
) =
1
2
log(2e)
n
To calculate the entropy of the general Gaussian vector X N(, ), we rst consider
the general case. Assume that Z is an n-dimensional random vector with density func-
tion f
Z
(z), and let Abe an nn non-singular matrix and a an n dimensional static vector.
Then, form X = ZA + a, which leads to that Z = (X a)A
1
and dx = |A|dz, where
|A| is the Jacobian for the variable change. The density function of X can then be written
as
f
X
(x) =
1
|A|
f
Z
_
(x a)A
1
_
and the entropy
H(X) =
_
R
n
f
X
(x) log f
X
(x)dx
=
_
R
n
1
|A|
f
Z
_
(x a)A
1
_
log
1
|A|
f
Z
_
(x a)A
1
_
dx
=
_
R
n
f
Z
(z) log
1
|A|
f
Z
(z)dz
=
_
R
n
f
Z
(z) log f
Z
(z)dz + log |A|
_
R
n
f
Z
(z)dz = H(Z) + log |A|
Hence, we have the following result.
THEOREM 9.7 Let Z is an n-dimensional random vector with entropy H(Z). If A is an
n n non-singular matrix and a an n-dimensional static vector, then, X = ZA + a has
the entropy
H(X) = H(Z) + log |A|
To get back from the normalised Gaussian vector Y to X N(, ), we use the function
X = Y
1/2
+
The above then states that the entropy for the vector X is
H(X) =
1
2
log(2e)
n
+ log |
1/2
| =
1
2
log(2e)
n
||
THEOREM 9.8 Let X = (X
1
, . . . , X
n
) be an n-dimensional Gaussian vector with mean
= (
1
, . . . ,
n
) and covariance matrix = E
_
(X )
T
(X )
, i.e. X N(, ).
Then the entropy of the vector is
H(X) =
1
2
log
_
(2e)
n
||
_
An alternative way to show the above theorem is to rst derive the density function for
X and then use this to derive the entropy. Since we will reuse this derivation later, we
will also go through this. So, again we use the variable change Y = (X)
1/2
, where
the Jacobian is |
1/2
| =
1
|A|
. Then
f
X
(x) =
1
_
||
1
(2)
n/2
e
1
2
(x)
1/2
((x)
1/2
)
T
=
1
(2)
n/2
_
||
e
1
2
(x)
1
(x)
T
which is the density function normally used for an n-dimensional Gaussian distribu-
tion. In the above derivation we used that the square root of a symmetric and positive
semi-denite matrix is again symmetric and positive semi-denite together with that the
inverse of a symmetric matrix is also symmetric. That is, (
1/2
)
T
=
1/2
.
Before we continue to derive the entropy, we will study the argument in the exponent a
bit more in detail. Assume that we have a random variable X (not necessarily Gaussian)
with E[X] = and Cov[X] = , and that we formY = (X)
1/2
to get a normalised
version, E[Y ] = 0 and Cov[Y ] = I. Then
E
_
(x )
1
(x )
T
= E
_
(x )
1/2
((x )
1/2
)
T
= E
_
yy
T
= E
_
n
i=1
y
2
i
_
=
n
i=1
E[y
2
i
] =
n
i=1
1 = n
If X is Gaussian, X N(, ), then Y is normalised Gaussian, Y N(0, I), and so is
each of the marginals, Y
i
N(0, 1). Then
Z = (x )
1
(x )
T
=
n
i=1
Y
2
i

2
(n)
This also gives the the expectation of a Chi-square distributed randomvariable, E[Z] = n.
We are now ready to derive the entropy for the Gaussian distribution using the density
function above.
H(X) = E
f
_
log
1
(2)
n/2
_
||
e
1
2
(x)
1
(x)
T
_
= E
f
_
1
2
log
_
(2)
n
||
_
+
1
2
(x )
1
(x )
T
log e
_
=
1
2
log
_
(2)
n
||
_
+
1
2
E
f
_
(x )
1
(x )
T
log e
=
1
2
log
_
(2)
n
||
_
+
1
2
log e
n
=
1
2
log
_
(2e)
n
||
_
Looking back at Lemma 9.4 and Theorem 9.5, we can get the corresponding result for the
n-dimensional case. Starting with the lemma, we assume that h(x) is a density function
for a normal distribution, N(, ), and that f(x) is a density function for an arbitrary
density function with the same mean E
f
[X] = and covariance matrix Cov[X] = .
Then, the expectation of log h(x) with respect to h(x) or f(x) are equal. This can be
seen from the exact same derivation as above when f(x) is Gaussian, since the Gaus-
sian properties of the distribution for the expectation is not used. Hence, we have the
following lemma, corresponding to Lemma 9.4.
LEMMA 9.9 Let h(x) be an n-dimensional Gaussian distribution, N(, ), with mean
and covariance matrix . If f(x) is an arbitrary distribution with the same mean and
covariance, then
E
f
_
log h(x)
= E
h
_
log h(x)
To see that the Gaussian distribution maximizes the entropy consider

H
h
(x) H
f
(x) = E
h
_
log h(x)
E
f
_
log f(x)
= E
f
_
log h(x)
E
f
_
log f(x)
= E
f
_
log
f(x)
h(x)
_
= D
_
f
h
_
0
THEOREM 9.10 The n-dimensional Gaussian distribution maximises the differential en-
tropy over all n-dimensional distributions with mean and covariance matrix .
9.2.2 Complex Gaussian distribution
Pseudo covariance
Circular Gaussian
Complex Gaussian Vectors
In many engineering applications it is convenient to use a complex notation. One of
the most obvious cases is modulation of binary data into signal, where we often are use
two orthogonal basis functions like cos t and sin t. A signal is then represented by a
point in this two dimensional space and can be written as a complex number. During
the transmission the signal is corrupted by additive noise in both dimensions, which can
then be modelled as a complex Gaussian noise.
In general, a Complex Gaussian is written Z = X + jY where X and Y are Gaussian
random variables.
To be done

DifferentialEntropy Examples

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DifferentialEntropy Examples

Uploaded by

Copyright:

Available Formats

Chapter 9

. Using the same denition for con-

9.1. INFORMATION MEASURES FOR CONTINUOUS RANDOM VARIABLES 143

= H(X) +H(Y ) H(X, Y )

e1. Then, the mutual information between

EXAMPLE 9.5 Consider a random variable that is exponentially distributed, Exp(

where the probability for the outcome x

) does not exist in the limit as 0. Actually, this is

with joint probability p(x

), respectively. The corresponding

= 2. This will be shown in the next section when

The lemma can be shown by the following derivation

which completes the proof of the lemma.

means Hermitian transpose, A

152 CHAPTER 9. DIFFERENTIAL ENTROPY

To see that the Gaussian distribution maximizes the entropy consider

You might also like