Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

INFORMATION THEORY & CODING

Week 10 : Differential Entropy

Dr. Rui Wang

Department of Electrical and Electronic Engineering


Southern University of Science and Technology (SUSTech)

Email: wang.r@sustech.edu.cn

November 16, 2020

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 1 / 16
Differential Entropy - 1

Definitions

AEP for Continuous Random Variables

Relation of differential entropy to discrete entropy

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 2 / 16
From discrete to continuous variables

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 3 / 16
Differential Entropy

Definition
Let X be a random variable with cumulative distribution function (CDF)
F (x) = Pr(X ≤ x). If F (x) is continuous, the random variable is
Rcontinuous. Let f (x) = F 0 (X) when the derivative is defined. If
+∞
−∞ f (x) = 1, f (x) is called the probability density function (pdf) for X.
The set of x where f (x) > 0 is called the support set of the X.

Definition
The differential entropy h(X) of a continuous random variable X with
density f (x) is defined as
Z
h(X) = − f (x) log f (x)dx = h(f ),
S

where S is the support set of the random variable.

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 4 / 16
Example: Uniform distribution

f (x) = a1 , x ∈ [0, a]

The differential entropy is:


Z a
1 1
h(X) = − log dx = log a bits
0 a a

for a < 1, h(X) = log a< 0, differential entropy can be negative!


(unlike discrete entropy)

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 5 / 16
Example: Uniform distribution

f (x) = a1 , x ∈ [0, a]

The differential entropy is:


Z a
1 1
h(X) = − log dx = log a bits
0 a a

for a < 1, h(X) = log a< 0, differential entropy can be negative!


(unlike discrete entropy)

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 5 / 16
Example: Normal distribution

2
X ∼ φ(x) = √ 1 exp( −x
2σ 2
), x ∈ R
2πσ 2
Differential entropy:
1
h(φ) = log 2πeσ 2 bits
2
Calculation:
Z
x2
Z  √ 
h(φ) = − φ log φdx = − φ(x) − 2 log e − log 2πσ dx 2

E(X ) 2 1 1 1
= 2
log e + log 2πσ 2 = log e + log 2πσ 2
2σ 2 2 2
1 2
= log 2πeσ
2

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 6 / 16
Example: Normal distribution

2
X ∼ φ(x) = √ 1 exp( −x
2σ 2
), x ∈ R
2πσ 2
Differential entropy:
1
h(φ) = log 2πeσ 2 bits
2
Calculation:
Z
x2
Z  √ 
h(φ) = − φ log φdx = − φ(x) − 2 log e − log 2πσ dx 2

E(X ) 2 1 1 1
= 2
log e + log 2πσ 2 = log e + log 2πσ 2
2σ 2 2 2
1 2
= log 2πeσ
2

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 6 / 16
AEP for continuous random variables

Discrete world: for a sequence of i.i.d. random variables


1
log p(X1 , X2 , . . . , Xn ) → H(X).
n
Continuous world: for a sequence of i.i.d. random variables
1
− log f (X1 , X2 , . . . , Xn ) → E[− log f (X)] = h(X) in probability
n
Proof follows from the weak law of large numbers.

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 7 / 16
AEP for continuous random variables

Discrete world: for a sequence of i.i.d. random variables


1
log p(X1 , X2 , . . . , Xn ) → H(X).
n
Continuous world: for a sequence of i.i.d. random variables
1
− log f (X1 , X2 , . . . , Xn ) → E[− log f (X)] = h(X) in probability
n
Proof follows from the weak law of large numbers.

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 7 / 16
Typical set
Discrete case: number of typical sequences

(n)
A ≈ 2nH(X)

Continuous case: The volume of the typical set


Z
Vol(A) = dx1 dx2 . . . dxn , A ⊂ Rn .
A

Definition
(n)
For  > 0 and any n, we define the typical set A with respect to f (x) as
follows:
 
1
A = (x1 , x2 , . . . , xn ) ∈ S n
(n)

: − log f (x1 , x2 , . . . , xn ) − h(X) ≤  ,

n
Qn
where f (x1 , x2 , . . . , xn ) = i=1 f (xi ).

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 8 / 16
Typical set

Theorem
(n)
The typical set A has the following properties:
(n)
1. Pr(A ) > 1 −  for n sufficiently large.
(n)
2. Vol(A ) ≤ 2n(h(X)+) for all n.
(n)
3. Vol(A ) ≥ (1 − )2n(h(X)−) for n sufficiently large.

Proof. 1.
Similar to the discrete case.
By definition, − n1 log f (X n ) = − n1
P
log f (Xi ) → h(X) in
probability.

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 9 / 16
Typical set

Theorem
(n)
The typical set A has the following properties:
(n)
1. Pr(A ) > 1 −  for n sufficiently large.
(n)
2. Vol(A ) ≤ 2n(h(X)+) for all n.
(n)
3. Vol(A ) ≥ (1 − )2n(h(X)−) for n sufficiently large.

Proof. 1.
Similar to the discrete case.
By definition, − n1 log f (X n ) = − n1
P
log f (Xi ) → h(X) in
probability.

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 9 / 16
Typical set

Theorem
(n)
The typical set A has the following properties:
(n)
2. Vol(A ) ≤ 2n(h(X)+) for all n.

Poof. 2.
Z
1= f (x1 , x2 , . . . , xn )dx1 dx2 . . . dxn
n
ZS
≥ f (x1 , x2 , . . . , xn )dx1 dx2 . . . dxn
(n)
ZA Z
−n(h(X)+) −n(h(X)+)
≥ 2 dx1 dx2 . . . dxn = 2 dx1 dx2 . . . dxn
(n) (n)
A A
−n(h(X)+)
=2 Vol(A(n)
 ).

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 10 / 16
Typical set

Theorem
(n)
The typical set A has the following properties:
(n)
3. Vol(A ) ≥ (1 − )2n(h(X)−) for n sufficiently large.

Proof. 3.
Z
1−≤ f (x1 , x2 , . . . , xn )dx1 dx2 . . . dxn
(n)
A
Z
≤ 2−n(h(X)−) dx1 dx2 . . . dxn
(n)
A
Z
= 2−n(h(X)−) dx1 dx2 . . . dxn
(n)
A
= 2−n(h(X)−) Vol(A(n)
 ).

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 11 / 16
An interpretation

The volume of the smallest set that contains most of the probability
is approximately 2nh(X) .

For an n-dim volume, this means that each dim has length
1
2nh(X) n = 2h(X) .

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 12 / 16
Mean value theorem (MVT)
If a function f is continuous on the closed interval [a, b], and differentiable
on (a, b), then there exists a point c ∈ (a, b) such that

f (b) − f (a)
f 0 (c) =
b−a

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 13 / 16
Relation of differential entropy to discrete entropy

Consider a random variable X with pdf f (x). We divide the range of


X into bins of length ∆.
MVT: there exists a value xi ∈ (i∆, (i + 1)∆) within each bin such
that
Z (i+1)∆
f (xi )∆ = f (x)dx.
i∆

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 14 / 16
Relation of differential entropy to discrete entropy

Define the quantized random variable as X ∆ = xi if


i∆ ≤ X ≤ (i + 1)∆ with pmf
Z (i+1)∆
pi = Pr[X ∆ = xi ] = f (x)dx = f (xi )∆.
i∆

The entropy of X ∆ is
+∞
X X
H(X ∆ ) = − pi log pi = − ∆f (xi ) log f (xi ) − log ∆.
−∞

If f (x) is is Riemann integrable, as ∆ → 0,

H(X ∆ ) + log ∆ → h(f ) = h(X)

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 15 / 16
Relation of differential entropy to discrete entropy

Define the quantized random variable as X ∆ = xi if


i∆ ≤ X ≤ (i + 1)∆ with pmf
Z (i+1)∆
pi = Pr[X ∆ = xi ] = f (x)dx = f (xi )∆.
i∆

The entropy of X ∆ is
+∞
X X
H(X ∆ ) = − pi log pi = − ∆f (xi ) log f (xi ) − log ∆.
−∞

If f (x) is is Riemann integrable, as ∆ → 0,

H(X ∆ ) + log ∆ → h(f ) = h(X)

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 15 / 16
Relation of differential entropy to discrete entropy

Define the quantized random variable as X ∆ = xi if


i∆ ≤ X ≤ (i + 1)∆ with pmf
Z (i+1)∆
pi = Pr[X ∆ = xi ] = f (x)dx = f (xi )∆.
i∆

The entropy of X ∆ is
+∞
X X
H(X ∆ ) = − pi log pi = − ∆f (xi ) log f (xi ) − log ∆.
−∞

If f (x) is is Riemann integrable, as ∆ → 0,

H(X ∆ ) + log ∆ → h(f ) = h(X)

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 15 / 16
Reading & Homework

Reading: Chapter 8: 8.1 - 8.3


Homework: Problems 8.1, 8.5, 8.7

Dr. Rui Wang (EEE) INFORMATION THEORY & CODING November 16, 2020 16 / 16

You might also like