Entropy and Mutual Information

Fundamentals of
Information Theory
Lecture 2. Entropy and Mutual
Information
Prof. CHEN Jie
Lab. 201, School of EIE
Beihang University
1
Contents
1 Self-information
2 Jensen’s inequality
3 Entropy
4 Joint entropy and conditional entropy
5 Relative entropy and mutual information
6 Relationship between entropy and MI
7 Chain rules for entropy , RE and MI
8 Data processing inequality 2

Copyright©BUAA201
Copyright©BUAA201
Definitions
An ensemble ‘X’ is a random variable with a set of possible
outcomes, AX  a1 , a2 ,..., a i ,..., al having probabilities  p1 , p2 ,..., pl 
with p  x  ai   pi , pi  0 and  xA p  x   1 .
x
A joint ensemble ‘XY’ is an ensemble which each outcome is an

ordered pair x,y with x  AX  a1 ,..., al and y  b1 ,..., b j . From the
joint probability P(x,y) we can obtain the following:
P( x  ai , y  b j )
Conditional P( x  ai | y  b j ) 
P( y  b j )
Probability

if P y  b j  0
3
Copyright©BUAA201
Copyright©BUAA201
Some Definitions
Rather than writing down the joint probability directly, we will
often define an ensemble in terms of a collection of conditional
probabilities. The following rules of probability theory will be
useful. (H denotes background assumptions).
Product rule P  x, y H   P  x y , H  P  y H 
Sum rule P  x H    P  x, y H    P  x y , H  P  y H 
y y
P  xPy x, Hy, H
P yPH
 y H 
Bayes’ P  y x , H  
theorem  y' P
P x
x H
y '
,H P 
 H
y '
4
Copyright©BUAA201
Copyright©BUAA201
Self-information &
Conditional Self-information
5
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information
1. Simple
events
Self-information: suppose that p(xi) was the probability of

particular event xi ,
We define self-information as :
I  xi    log p  x i 
6
Copyright©BUAA201
Copyright©BUAA201
p  xi   0 , means the probability of event xi

Annotation occurred . The aim for giving the “-” (minus) is to
assure I  xi   0 .
Before xi occurred , it stands for the uncertainty of

Meaning xi. After xi occurred , it stands for information that xi
provided.
7
Copyright©BUAA201
Copyright©BUAA201
Depending on the base of log

bit (the base 2) I(xi)=-log2p(xi)
Units nat (the base e) I(xi)=-logep(xi)
haitely (the base 10) I(xi)=-log10p(xi)
p(xi) uncertainty self-information

Rules
p(xi) uncertainty self-information
8
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information (…)
Examples
1. Martian are occupying the earth.
self-information
2. On every Friday, we will discuss information theory
self-information
9
Copyright©BUAA201
Copyright©BUAA201
We define Joint self-information as:

Joint events
Note that
10
Copyright©BUAA201
Copyright©BUAA201
Example
The occurrence probability of “e” is 0.105, the occurrence

probability of “c” is 0.023, the occurrence probability of “o” is
0.001. Please calculate their self-information, respectively.
11
Copyright©BUAA201
Copyright©BUAA201
Conditional
self-information
12
Copyright©BUAA201
Copyright©BUAA201
2.1.2 Conditional self-information
Conditional self-information
Note that
13
Copyright©BUAA201
Copyright©BUAA201
Review of the
probability theory
14
Copyright©BUAA201
Copyright©BUAA201
Some P  AB   P  B  P  A B 
useful P  AB   P  A P  B A
n
P  A   P  Bi P  A Bi 
formulas
i 1
P  AB 
P  A B 
Conditional
probability
P  B
P  Bi  P  A Bi 
Bayes’ P  Bi A  
 PB  P A B 
n
Formula
j j
j 1
15
Copyright©BUAA201
Copyright©BUAA201
Example
A chessboard is divided into 64 small panes. A chessman

is put into one pane, and let you guess the position
of that chessman.
1) Number these panes one by one. 1, 2, ……, 64. let
you guess the sequence number of pane in which
the chessman is.
2) Number these panes according to the
column and row. You have known
which column the chessman is in, and
you are asked to guess the row
number.
Calculate the information you got.
16
Copyright©BUAA201
Copyright©BUAA201
Answer:
p  xi y j   1 64
1) I  xi y j    log p  xi y j   6bit
p  xi y j 
   
2) I xi y j   log p xi y j   log
p yj 
 3bit
17
Copyright©BUAA201
Copyright©BUAA201
2.2 Jensen’s inequality
Definition
If the function f has a second derivative which is non-negative

(positive) everywhere, then the function is convex (strictly convex).
f  X 1  1    X 2    f  X 1   1    f  X 2 
18
Copyright©BUAA201
Copyright©BUAA201
Y
f [αx1+ (1-α)x2] f(x)
αf(x1)+ (1-α)f(x2)
x
0 x1 αx1+(1- α)x2 x2
19
Copyright©BUAA201
Copyright©BUAA201
Theorem 2.2.1 (Jensen’s inequality)
If f is a convex function and X is a random variable , then :
Ef  X   f  EX 
pj
j 1
 
j p j f  i   f j p j  j 
 
20
Copyright©BUAA201
Copyright©BUAA201
Entropy
21
Copyright©BUAA201
Copyright©BUAA201
2.3 Entropy
Definition
The entropy H(X) of a discrete random
variable X is defined by :
H  X     p  x  log p  x 
xX
Remark
The entropy of X can also be interpreted as the
expected value of log1/p(x), where X is drawn according to
probability mass function p(x).
1
Thus: H  X   E p log
p X 
Annotation
Ep means the expectation of p(x).
22
Copyright©BUAA201
Copyright©BUAA201
2.3 Entropy
Definition 2
We can use the concept of self-information to define the

entropy as:
The expectation of random variable I  xi  , namely average
self-information.
(define X as a discrete set )
The formula to express this definition is:
def q
H  X   E  I  xi   E   log p  xi    p  xi  log p  xi 
i 1
23
Copyright©BUAA201
Copyright©BUAA201
2.3 Entropy
Annotation
H(x) can be interpreted as a probability weighted

average, namely statistical average.
 p  x I  x
The expression: H X   x
 p  x
x
24
Copyright©BUAA201
Copyright©BUAA201
2.3 Entropy (…)
Lemma 2.3.1: H X  0
 1 
Proof: 0  p  x  1 implies log  0
 p  x 
Lemma 2.3.2: H b  X    logb a  H a  X 
Proof: logb p  logb a log a p
Annotation
The second property of entropy enable us to change the base
of the logarithm in the definition.
Entropy can be changed from one base to another by
multiplying by the appropriate factor .
25
Copyright©BUAA201
Copyright©BUAA201
Entropy(…)
Example 2.3.1
H(X)=1 bit
when p=1/2
let
1 with probability p
X 
0 with probability 1  p
then
H  X    p log p  1  p  log 1  p 
 H  p
26
Copyright©BUAA201
Copyright©BUAA201
Entropy(…)
Example 2.3.1
H(X)=1 bit
when p=1/2
The graph shown in Figure 2.1 illustrates

Some the basic properties of entropy.
- It is a convex function of the distribution.
- It is equal 0 when p=0 or 1.
27
Copyright©BUAA201
Copyright©BUAA201
Entropy(…)
Example 2.3.2
let
 a with probability 1 2
 b with
 probability 1 4
X 
 c with probability 1 8
d with probability 1 8
The entropy of X is
1 1 1 1 1 1 1 1
H  X    log  log  log  log
2 2 4 4 8 8 8 8
7
 bit
4
28
Copyright©BUAA201
Copyright©BUAA201
Entropy(…)
Example 2.3.3
Let X represents the outcome of a single roll of a fair die. Then

X={1,2,3,4,5,6} and pi=1/6 for each i.
The entropy of X is :
H  X   log 6  2.58bit  1.79nat
29
Copyright©BUAA201
Copyright©BUAA201
Joint Entropy and
Conditional Entropy
30
Copyright©BUAA201
Copyright©BUAA201
2.4.1 Joint entropy & conditional entropy
2.4.1 Joint entropy
Definition:
The joint entropy H(X,Y) of a pair of discrete random variables (X,Y)
with a joint distribution p(x,y) is defined as:
H  X , Y    p  x, y  log p  x, y 
`
x y
Which can also be expressed as :
H  X , Y    E log p  X , Y 
31
Copyright©BUAA201
Copyright©BUAA201
2.4.1 Joint entropy & conditional entropy
2.4.2 Conditional entropy
Definition:
If  X , Y   p  x, y  , then the conditional entropy H Y X 
is defined as ：
H Y X    p  x H Y X  x 
`
x
  p  x   p  y x  log p  y x 
x y
  p  x, y  log p  y x 
x y
32
Copyright©BUAA201
Copyright©BUAA201
2.4 Joint entropy & Conditional entropy(…)
Theorem 2.4.1: (Chain rule)

H  X , Y   H  X   H Y X 
Proof:
H  X , Y    p  xi y j  log p  xi y j 
n m
i 1 j 1
  p  xi y j  log  p  xi  p  y j xi  
n m
i 1 j 1
m 
   p  y j xi   p  xi  log p  xi    p  xi y j  log p  y j xi 
n n m
i 1  j 1  i 1 j 1
n
  p  xi  log p  xi   H Y X 
i 1
 H  X   H Y X 
33
Copyright©BUAA201
Copyright©BUAA201
Corollary
H  X , Y Z   H  X Z   H Y X , Z 
Additional
H  X , Y   H  X   H Y 
Corollaries
H Y X   H Y 
H X Y H X 
Think about how to prove

the corollaries above
34
Copyright©BUAA201
Copyright©BUAA201
Lemma: If x>0, then
lnx≤x-1
Proof: y y x1
y  ln x
x 1
35
Copyright©BUAA201
Copyright©BUAA201
Example 2.4.1
Let (X,Y) has the following distribution:
Answer:
4 X
H ( X | Y )   p(Y  i) H ( X | Y  i) Y 1 2 3 4
i 1
1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 8 16 32 32 4
 H  , , ,  H  , , , 
4  2 4 8 8 4  2 4 8 8
1 1 1 1 1
1 1 1 1 1 1 2
 H  , , ,   H 1, 0, 0, 0  16 8 32 32 4
4 4 4 4 4 4
1 1 1 1 1
3 4
1 7 1 7 1 1 16 16 16 16
     2 0 1
4 4 4 4 4 4 1 0 0 0
4 4
4
11 H(Y)
 bits H(X)=7/4
4 1 1 1 1 =2 bits
bits
2 4 8 8
Copyright©BUAA201
Copyright©BUAA201
Let (X,Y) have the following X 1 2

Example Y
2.2.1 distribution: 1 0
3
4
Then: H  X   H  ,   0.544bits
1 7 2
1 1
8
8 8 8
1 1
H  X Y  1  H  0,1  0bits H  X Y  2   H  ,   1bits
2 2
H  X Y   H  X Y  1  H  X Y  2   0.25bits
3 1
4 4
Remarks
The uncertainty in X is increased if Y=2 is observed and
decreased if Y=1 is observed, but uncertainty decreased on
the average.
37
Copyright©BUAA201
Copyright©BUAA201
Relative Entropy and
Mutual Information
38
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information
2.5.1 Relative entropy
Definition:
The relative entropy or Kullback Leibler distance between two
probability mass functions p(x) and q(x) is defined as:
`
p  x p  x
D  p q    p  x  log  Ep
xX q  x q  x
39
Copyright©BUAA201
Copyright©BUAA201
Annotation
The relative entropy is always non-negative and is zero if
and only if p=q.
However, it is not a true distance between distributions
since it is not symmetric and does not satisfy the triangle
inequality.
Nonetheless, it is often useful to think of relative entropy
as a “distance” between distributions.
40
Copyright©BUAA201
Copyright©BUAA201
Theorem (Information inequality): D  p q   0

with equality if and only if p  x   q  x  for all x
Proof:
p  x q  x
 D  p q     p  x  log   p  x  log
xX q  x xX p  x
q  x
 log  p  x   log  q  x 
xX p  x xX
 log1
 0.
41
Copyright©BUAA201
Copyright©BUAA201
Example 2.5.1
Let  ={0,1} and consider two distribution p and q on .Let p(0)=1-r ,

p(1)=r , and let q(0)=1-s ,q(1)=s.
Then
1 r
D  p q   1  r  log
r
 r log
1 s s
and
1 s
D  q p   1  s  log
s
 s log
1 r r
42
Copyright©BUAA201
Copyright©BUAA201
Annotation
►If r=s, then D(p||q)=D(q||p)=0.
►If r=1/2, s=1/4, then

1 1
D  p q   log 2  log 2  1  log 3  0.2075bits
1 1 1
2 3 2 1 2
whereas 4 4
3 1
D  q p   log 4  log 4  log 3  1  0.1887bits
3 1 3
4 1 4 1 4
2 2
►Note that D(p||q) ≠D(q||p) in general
43
Copyright©BUAA201
Copyright©BUAA201
2.5.2 Mutual Information
Definition:
The mutual information I(X;Y) is the relative entropy between the

joint distribution and the product distribution p(x)p(y).
p  x, y 
`
I  X ;Y    p  x, y  log
x y p  x p  y

 D p  x, y  p  x  p  y  
p  x, y 
 E p x , y  log
p  x p  y
44
Copyright©BUAA201
Copyright©BUAA201
Lemma (additional): I  X ;Y   0
p  xi 
Proof:  I  X ;Y    p  xi , y j  log
X Y 
p xi y j 
 px  
  log e  . p  xi , y j   i
 1
X Y   p xi y j



  log e  .  p  x  p  y   p  x y  
i j i j
X Y
 
  log e  .  p  xi  p  y j    p  xi y j  
X Y X Y 
  log e  .1  1
0
45
Copyright©BUAA201
Copyright©BUAA201
Relationship between
entropy
& Mutual information
46
Copyright©BUAA201
Copyright©BUAA201
2.6 Relationship between entropy & MI
Theorem 2.6.1
1 . The relationship among mutual information , entropy and

conditional entropy：
HI ( XX;;YY)  H  X   H  X Y   H Y   H Y X 
`
I  X ;Y   H  X   H Y   H  X,Y 
47
Copyright©BUAA201
Copyright©BUAA201
Proof
I  X ;Y    p  x y 
 
p xi y j
px 
i j
X Y i
I  X ;Y    p  xi y j  log p  xi 
X Y
 

   p  xi y j  log p xi y j 
 X Y 

 H X H X Y
By symmetry, we can also prove :
H  X ; Y   H  Y   H Y X 
48
Copyright©BUAA201
Copyright©BUAA201
2 . The symmetrical characteristic of mutual information：

I  X ; Y   I Y ; X 
Annotation
According to the symmetrical characteristic of mutual information，

We can get : I  X ; X   H  X   H  X X   H  X 
Thus the mutual information of a random variable with itself is the

entropy of the random variable.
This is the reason that entropy is sometimes referred to as self-
information.
49
Copyright©BUAA201
Copyright©BUAA201
Proof
I  X ; Y    p  xi y j  log

p xi y j 
X Y p  xi 
  p  xi y j  log
 
p xi y j p  y j 
X Y p  xi  p  y j 
p  xi y j 
  p  xi y j  log
X Y p  xi  p  y j 
p  y j xi  p  xi 
  p  xi y j  log
X Y p  xi  p  y j 
p  y j xi 
  p  xi y j  log
X Y p yj 
 I Y ; X 
50
Copyright©BUAA201
Copyright©BUAA201
H(X)
H(X|Y) I(X;Y) H(Y|X)
H(Y)
H(X,Y)
Figure2.2 relationship between Entropy and Mutual Information

51
Copyright©BUAA201
Copyright©BUAA201
Chain rules for entropy ,
relative entropy &mutual
information
52
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for entropy , RE & MI
We now show that the entropy of a collection of random

variables is the sum of the conditional entropies.
Theorem 2.7.1 (chain rule for entropy)
Let X1,X2,…XN be drawn according to p(x1,x2,…,xN).

Then:
H  X1 , X 2 ,..., X N   H  X 1   H  X 2 X 1   ...  H  X N X 1 X 2 ... X N 1 

N
  H  X i X 1 X 2 ... X i1 
i 1
53
Copyright©BUAA201
Copyright©BUAA201
Proof
H  X1 , X 2   H  X1   H  X 2 X1 
H  X1 , X 2 , X 3   H  X1   H  X 2 , X 3 X1 
 H  X1   H  X 2 X1   H  X 3 X 2 , X1 
…...
H  X1 , X 2 ,..., X N   H  X 1   H  X 2 X 1   ...  H  X N X N 1 ,..., X 1 

N
  H  X i X 1 X 2 ... X i1 
i 1
See Alternative proof in Page 21

54
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for Entropy , RE & MI
Definition
The conditional mutual information of random variables X and

Y given Z is defined by
I  X ;Y Z   H  X Z   H  X Y , Z 
p  X ,Y Z 
 E p x , y , z  log
p  X Z  p Y Z 
55
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for Entropy , RE & MI
Theorem 2.7.2 (chain rule for information)

n
I  X 1 , X 2 ,..., X n ; Y    I  X i ;Y X i 1 , X i 2 ,..., X 1 
i 1
Proof:
I  X 1 , X 2 ,..., X n ;Y   H  X 1 , X 2 ,..., X n   H  X 1 , X 2 ,..., X n Y 
n n
  H  X i X i1 ,..., X 1    H  X i X i 1 ,..., X 1 , Y 
i 1 i 1
n
  I  X i ;Y X 1 , X 2 ,..., X i 1 
i 1
56
Copyright©BUAA201
Copyright©BUAA201
Definition
The conditional relative entropy D(p(y|x)||q(y|x)) is the average

of the relative entropies between the conditional probability mass
functions p(y|x) and q(y|x) averaged over the probability mass
function p(x). More precisely,
p  y x
 
D p  y x  q  y x    p  x   p  y x  log
x y q  y x
p Y X 
 E p x , y  log
q Y X 
57
Copyright©BUAA201
Copyright©BUAA201
Theorem 2.7.3 (chain rule for relative entropy)

    
D p  x, y  q  x , y   D p  x  q  x   D p  y x  q  y x  
Proof:
p  x, y 
D  p  x, y  ||q  x, y     p  x, y  log
x y q  x, y 
p ( x) p ( y | x)
  p( x, y ) log
x y q ( x)q ( y | x)
p  x p  y x
  p  x, y  log  p  x, y  log
x y q  x x y q  y x
  
 D p  x q  x  D p  y x q  y x 
58
Copyright©BUAA201
Copyright©BUAA201
Data processing
inequality
59
Copyright©BUAA201
Copyright©BUAA201
2.8 Data processing inequality
Definition
Random variables X,Y,Z are said to form a Markov chain in that order
(denoted by XYZ) if the conditional distribution of Z depends
only on Y and is conditionally independent of X.
Specially, X,Y and Z form a Markov chain XYZ if the joint
probability mass function can be written as
p  x, y , z   p  x  p  y x  p  z y 
See some simple consequences in page32.
60
Copyright©BUAA201
Copyright©BUAA201
Theorem 2.8.1 (data processing inequality)

if XYZ, then I  X ;Y   I  X ; Z 
Proof:
I  X ;Y ; Z   I  X ; Z   I  X ;Y Z 
 I  X ;Y   I  X ; Z Y 
Since X and Z are conditionally independent given Y, we hav

e I  X ; Z Y   0. Since I  X ;Y Z   0, we have:
I (X;Y) I(X;Z)
61
Copyright©BUAA201
Copyright©BUAA201
Corollary:

In particular, if Z  g Y ,we have I  X ; Y   I X ; g Y  
X  Y  g Y  forms a Markov chain.
Corollary:
if , then I  X ;Y Z   I  X ;Y 
Proof: From the theorem 2.8.1, and using the fact
that I  X ; Z \Y   0 by Markovity and I  X ; Z   0, we
have :
I  X ;Y Z   I  X ;Y 
62
Copyright©BUAA201
Copyright©BUAA201
Thanks
63
Copyright©BUAA201
Copyright©BUAA201

Entropy and Mutual Information

Uploaded by

Copyright:

Available Formats

You might also like

Entropy and Mutual Information

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Entropy and Mutual Information

Uploaded by

Copyright:

Available Formats

Fundamentals of

4 Joint entropy and conditional entropy

5 Relative entropy and mutual information

6 Relationship between entropy and MI

7 Chain rules for entropy , RE and MI

8 Data processing inequality 2

A joint ensemble ‘XY’ is an ensemble which each outcome is an

Self-information: suppose that p(xi) was the probability of

p  xi   0 , means the probability of event xi

Before xi occurred , it stands for the uncertainty of

Depending on the base of log

p(xi) uncertainty self-information

1. Martian are occupying the earth.

2. On every Friday, we will discuss information theory

We define Joint self-information as:

The occurrence probability of “e” is 0.105, the occurrence

A chessboard is divided into 64 small panes. A chessman

If the function f has a second derivative which is non-negative

f [αx1+ (1-α)x2] f(x)

Theorem 2.2.1 (Jensen’s inequality)

If f is a convex function and X is a random variable , then :

We can use the concept of self-information to define the

H(x) can be interpreted as a probability weighted

Lemma 2.3.2: H b  X    logb a  H a  X 

Proof: logb p  logb a log a p

The graph shown in Figure 2.1 illustrates

Let X represents the outcome of a single roll of a fair die. Then

H  X   log 6  2.58bit  1.79nat

2.4.1 Joint entropy

Which can also be expressed as :

2.4.2 Conditional entropy

Theorem 2.4.1: (Chain rule)

Think about how to prove

Lemma: If x>0, then

Let (X,Y) have the following X 1 2

2.5.1 Relative entropy

Theorem (Information inequality): D  p q   0

Let  ={0,1} and consider two distribution p and q on .Let p(0)=1-r ,

►If r=1/2, s=1/4, then

2.5.2 Mutual Information

The mutual information I(X;Y) is the relative entropy between the

1 . The relationship among mutual information , entropy and

2 . The symmetrical characteristic of mutual information：

According to the symmetrical characteristic of mutual information，

Thus the mutual information of a random variable with itself is the

H(X|Y) I(X;Y) H(Y|X)

Figure2.2 relationship between Entropy and Mutual Information

We now show that the entropy of a collection of random

Theorem 2.7.1 (chain rule for entropy)

Let X1,X2,…XN be drawn according to p(x1,x2,…,xN).

H  X1 , X 2 ,..., X N   H  X 1   H  X 2 X 1   ...  H  X N X 1 X 2 ... X N 1 

H  X1 , X 2 ,..., X N   H  X 1   H  X 2 X 1   ...  H  X N X N 1 ,..., X 1 

See Alternative proof in Page 21

The conditional mutual information of random variables X and

Theorem 2.7.2 (chain rule for information)

The conditional relative entropy D(p(y|x)||q(y|x)) is the average

Theorem 2.7.3 (chain rule for relative entropy)

See some simple consequences in page32.

Theorem 2.8.1 (data processing inequality)