Professional Documents
Culture Documents
Entropy and Mutual Information
Entropy and Mutual Information
Entropy and Mutual Information
Information Theory
Lecture 2. Entropy and Mutual
Information
Prof. CHEN Jie
Lab. 201, School of EIE
Beihang University
1
Contents
1 Self-information
2 Jensen’s inequality
3 Entropy
P( x ai , y b j )
Conditional P( x ai | y b j )
P( y b j )
Probability
if P y b j 0
3
Copyright©BUAA201
Copyright©BUAA201
Some Definitions
Rather than writing down the joint probability directly, we will
often define an ensemble in terms of a collection of conditional
probabilities. The following rules of probability theory will be
useful. (H denotes background assumptions).
Product rule P x, y H P x y , H P y H
Sum rule P x H P x, y H P x y , H P y H
y y
P xPy x, Hy, H
P yPH
y H
Bayes’ P y x , H
theorem y' P
P x
x H
y '
,H P
H
y '
4
Copyright©BUAA201
Copyright©BUAA201
Self-information &
Conditional Self-information
5
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information
1. Simple
events
I xi log p x i
6
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information
7
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information
8
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information (…)
Examples
self-information
self-information
9
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information (…)
Note that
10
Copyright©BUAA201
Copyright©BUAA201
2.1.1 Self-information (…)
Example
11
Copyright©BUAA201
Copyright©BUAA201
Conditional
self-information
12
Copyright©BUAA201
Copyright©BUAA201
2.1.2 Conditional self-information
Conditional self-information
Note that
13
Copyright©BUAA201
Copyright©BUAA201
Review of the
probability theory
14
Copyright©BUAA201
Copyright©BUAA201
2.1.2 Conditional self-information
Some P AB P B P A B
useful P AB P A P B A
n
P A P Bi P A Bi
formulas
i 1
P AB
P A B
Conditional
probability
P B
P Bi P A Bi
Bayes’ P Bi A
PB P A B
n
Formula
j j
j 1
15
Copyright©BUAA201
Copyright©BUAA201
2.1.2 Conditional self-information
Example
p xi y j 1 64
1) I xi y j log p xi y j 6bit
p xi y j
2) I xi y j log p xi y j log
p yj
3bit
17
Copyright©BUAA201
Copyright©BUAA201
2.2 Jensen’s inequality
Definition
f X 1 1 X 2 f X 1 1 f X 2
18
Copyright©BUAA201
Copyright©BUAA201
2.2 Jensen’s inequality
Y
αf(x1)+ (1-α)f(x2)
x
0 x1 αx1+(1- α)x2 x2
19
Copyright©BUAA201
Copyright©BUAA201
2.2 Jensen’s inequality
Ef X f EX
pj
j 1
j p j f i f j p j j
20
Copyright©BUAA201
Copyright©BUAA201
Entropy
21
Copyright©BUAA201
Copyright©BUAA201
2.3 Entropy
Definition
The entropy H(X) of a discrete random
variable X is defined by :
H X p x log p x
xX
Remark
The entropy of X can also be interpreted as the
expected value of log1/p(x), where X is drawn according to
probability mass function p(x).
1
Thus: H X E p log
p X
Annotation
Ep means the expectation of p(x).
22
Copyright©BUAA201
Copyright©BUAA201
2.3 Entropy
Definition 2
def q
H X E I xi E log p xi p xi log p xi
i 1
23
Copyright©BUAA201
Copyright©BUAA201
2.3 Entropy
Annotation
p x I x
The expression: H X x
p x
x
24
Copyright©BUAA201
Copyright©BUAA201
2.3 Entropy (…)
Lemma 2.3.1: H X 0
1
Proof: 0 p x 1 implies log 0
p x
Annotation
The second property of entropy enable us to change the base
of the logarithm in the definition.
Entropy can be changed from one base to another by
multiplying by the appropriate factor .
25
Copyright©BUAA201
Copyright©BUAA201
Entropy(…)
Example 2.3.1
H(X)=1 bit
when p=1/2
let
1 with probability p
X
0 with probability 1 p
then
H X p log p 1 p log 1 p
H p
26
Copyright©BUAA201
Copyright©BUAA201
Entropy(…)
Example 2.3.1
H(X)=1 bit
when p=1/2
27
Copyright©BUAA201
Copyright©BUAA201
Entropy(…)
Example 2.3.2
let
a with probability 1 2
b with
probability 1 4
X
c with probability 1 8
d with probability 1 8
The entropy of X is
1 1 1 1 1 1 1 1
H X log log log log
2 2 4 4 8 8 8 8
7
bit
4
28
Copyright©BUAA201
Copyright©BUAA201
Entropy(…)
Example 2.3.3
The entropy of X is :
29
Copyright©BUAA201
Copyright©BUAA201
Joint Entropy and
Conditional Entropy
30
Copyright©BUAA201
Copyright©BUAA201
2.4.1 Joint entropy & conditional entropy
Definition:
The joint entropy H(X,Y) of a pair of discrete random variables (X,Y)
with a joint distribution p(x,y) is defined as:
H X , Y p x, y log p x, y
`
x y
H X , Y E log p X , Y
31
Copyright©BUAA201
Copyright©BUAA201
2.4.1 Joint entropy & conditional entropy
Definition:
If X , Y p x, y , then the conditional entropy H Y X
is defined as :
H Y X p x H Y X x
`
x
p x p y x log p y x
x y
p x, y log p y x
x y
32
Copyright©BUAA201
Copyright©BUAA201
2.4 Joint entropy & Conditional entropy(…)
Proof:
H X , Y p xi y j log p xi y j
n m
i 1 j 1
p xi y j log p xi p y j xi
n m
i 1 j 1
m
p y j xi p xi log p xi p xi y j log p y j xi
n n m
i 1 j 1 i 1 j 1
n
p xi log p xi H Y X
i 1
H X H Y X
33
Copyright©BUAA201
Copyright©BUAA201
2.4 Joint entropy & Conditional entropy(…)
Corollary
H X , Y Z H X Z H Y X , Z
Additional
H X , Y H X H Y
Corollaries
H Y X H Y
H X Y H X
lnx≤x-1
Proof: y y x1
y ln x
x 1
35
Copyright©BUAA201
Copyright©BUAA201
2.4 Joint entropy & Conditional entropy(…)
Example 2.4.1
Let (X,Y) has the following distribution:
Answer:
4 X
H ( X | Y ) p(Y i) H ( X | Y i) Y 1 2 3 4
i 1
1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 8 16 32 32 4
H , , , H , , ,
4 2 4 8 8 4 2 4 8 8
1 1 1 1 1
1 1 1 1 1 1 2
H , , , H 1, 0, 0, 0 16 8 32 32 4
4 4 4 4 4 4
1 1 1 1 1
3 4
1 7 1 7 1 1 16 16 16 16
2 0 1
4 4 4 4 4 4 1 0 0 0
4 4
4
11 H(Y)
bits H(X)=7/4
4 1 1 1 1 =2 bits
bits
2 4 8 8
Copyright©BUAA201
Copyright©BUAA201
2.4 Joint entropy & Conditional entropy(…)
Then: H X H , 0.544bits
1 7 2
1 1
8
8 8 8
1 1
H X Y 1 H 0,1 0bits H X Y 2 H , 1bits
2 2
H X Y H X Y 1 H X Y 2 0.25bits
3 1
4 4
Remarks
The uncertainty in X is increased if Y=2 is observed and
decreased if Y=1 is observed, but uncertainty decreased on
the average.
37
Copyright©BUAA201
Copyright©BUAA201
Relative Entropy and
Mutual Information
38
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information
Definition:
The relative entropy or Kullback Leibler distance between two
probability mass functions p(x) and q(x) is defined as:
`
p x p x
D p q p x log Ep
xX q x q x
39
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information
Annotation
The relative entropy is always non-negative and is zero if
and only if p=q.
However, it is not a true distance between distributions
since it is not symmetric and does not satisfy the triangle
inequality.
Nonetheless, it is often useful to think of relative entropy
as a “distance” between distributions.
40
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information
Proof:
p x q x
D p q p x log p x log
xX q x xX p x
q x
log p x log q x
xX p x xX
log1
0.
41
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information
Example 2.5.1
1 r
D p q 1 r log
r
r log
1 s s
and
1 s
D q p 1 s log
s
s log
1 r r
42
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information
Annotation
►If r=s, then D(p||q)=D(q||p)=0.
43
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information
Definition:
p x, y
`
I X ;Y p x, y log
x y p x p y
D p x, y p x p y
p x, y
E p x , y log
p x p y
44
Copyright©BUAA201
Copyright©BUAA201
2.5 Relative Entropy & Mutual Information
Lemma (additional): I X ;Y 0
p xi
Proof: I X ;Y p xi , y j log
X Y
p xi y j
px
log e . p xi , y j i
1
X Y p xi y j
log e . p x p y p x y
i j i j
X Y
log e . p xi p y j p xi y j
X Y X Y
log e .1 1
0
45
Copyright©BUAA201
Copyright©BUAA201
Relationship between
entropy
& Mutual information
46
Copyright©BUAA201
Copyright©BUAA201
2.6 Relationship between entropy & MI
Theorem 2.6.1
HI ( XX;;YY) H X H X Y H Y H Y X
`
I X ;Y H X H Y H X,Y
47
Copyright©BUAA201
Copyright©BUAA201
2.6 Relationship between entropy & MI
Proof
I X ;Y p x y
p xi y j
px
i j
X Y i
I X ;Y p xi y j log p xi
X Y
p xi y j log p xi y j
X Y
H X H X Y
By symmetry, we can also prove :
H X ; Y H Y H Y X
48
Copyright©BUAA201
Copyright©BUAA201
2.6 Relationship between entropy & MI
49
Copyright©BUAA201
Copyright©BUAA201
2.6 Relationship between entropy & MI
Proof
I X ; Y p xi y j log
p xi y j
X Y p xi
p xi y j log
p xi y j p y j
X Y p xi p y j
p xi y j
p xi y j log
X Y p xi p y j
p y j xi p xi
p xi y j log
X Y p xi p y j
p y j xi
p xi y j log
X Y p yj
I Y ; X
50
Copyright©BUAA201
Copyright©BUAA201
2.6 Relationship between entropy & MI
H(X)
H(Y)
H(X,Y)
52
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for entropy , RE & MI
53
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for entropy , RE & MI
Proof
H X1 , X 2 H X1 H X 2 X1
H X1 , X 2 , X 3 H X1 H X 2 , X 3 X1
H X1 H X 2 X1 H X 3 X 2 , X1
…...
Definition
I X ;Y Z H X Z H X Y , Z
p X ,Y Z
E p x , y , z log
p X Z p Y Z
55
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for Entropy , RE & MI
Proof:
I X 1 , X 2 ,..., X n ;Y H X 1 , X 2 ,..., X n H X 1 , X 2 ,..., X n Y
n n
H X i X i1 ,..., X 1 H X i X i 1 ,..., X 1 , Y
i 1 i 1
n
I X i ;Y X 1 , X 2 ,..., X i 1
i 1
56
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for entropy , RE & MI
Definition
p y x
D p y x q y x p x p y x log
x y q y x
p Y X
E p x , y log
q Y X
57
Copyright©BUAA201
Copyright©BUAA201
2.7 Chain rules for entropy , RE & MI
D p x q x D p y x q y x
58
Copyright©BUAA201
Copyright©BUAA201
Data processing
inequality
59
Copyright©BUAA201
Copyright©BUAA201
2.8 Data processing inequality
Definition
Random variables X,Y,Z are said to form a Markov chain in that order
(denoted by XYZ) if the conditional distribution of Z depends
only on Y and is conditionally independent of X.
Specially, X,Y and Z form a Markov chain XYZ if the joint
probability mass function can be written as
p x, y , z p x p y x p z y
60
Copyright©BUAA201
Copyright©BUAA201
2.8 Data processing inequality
Proof:
I X ;Y ; Z I X ; Z I X ;Y Z
I X ;Y I X ; Z Y
Corollary:
In particular, if Z g Y ,we have I X ; Y I X ; g Y
Corollary:
if , then I X ;Y Z I X ;Y
Proof: From the theorem 2.8.1, and using the fact
that I X ; Z \Y 0 by Markovity and I X ; Z 0, we
have :
I X ;Y Z I X ;Y
62
Copyright©BUAA201
Copyright©BUAA201
Thanks
63
Copyright©BUAA201
Copyright©BUAA201