Information Theory Lecture Notes

Information Theory Lecture Notes
Richard Combes 1 , 2
Version 1.0
1 Université
Paris-Saclay, CNRS, CentraleSupélec, Laboratoire des signaux et systèmes,
France
2 Department Signal, Information and Communication
2
Contents
1 Information Measures 11
1.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.2 Entropy and Physics . . . . . . . . . . . . . . . . . . . . 11
1.1.3 Positivity of Entropy and Maximal Entropy . . . . . . . . 12
1.2 Joint and Conditional Entropy . . . . . . . . . . . . . . . . . . . 13
1.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Relative Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Positivity of Relative Entropy . . . . . . . . . . . . . . . 14
1.3.3 Relative Entropy is Not a Distance . . . . . . . . . . . . . 15
1.4 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2 Positivity of Mutual Information . . . . . . . . . . . . . . 16
1.4.3 Conditionning Reduces Entropy . . . . . . . . . . . . . . 16
2 Properties of Information Measures 17

2.1 Chain Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Chain Rule for Entropy . . . . . . . . . . . . . . . . . . . 17
2.1.2 Chain Rule for Mutual Information . . . . . . . . . . . . 18
2.1.3 Chain Rule for Relative Entropy . . . . . . . . . . . . . . 18
2.2 Log Sum Inequality . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Data Processing and Markov Chains . . . . . . . . . . . . . . . . 19
2.3.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Data Processing Inequality . . . . . . . . . . . . . . . . . 20
2.4 Fano Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Estimation Problems . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Asymptotic Equipartition and Typicality . . . . . . . . . . . . . . 22
3
4 CONTENTS
2.5.1 AEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Typicality . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.3 Joint Typicality . . . . . . . . . . . . . . . . . . . . . . . 23
3 Data Representation: Fundamental Limits 25

3.1 Source Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Expected Length . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 Non-Singular Codes . . . . . . . . . . . . . . . . . . . . 26
3.1.4 Uniquely Decodable Codes . . . . . . . . . . . . . . . . . 26
3.2 Prefix Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Prefix Codes as Trees . . . . . . . . . . . . . . . . . . . . 27
3.2.3 Kraft Inequality . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Optimal Codes and Entropy . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Lower Bound on the Expected Code Length . . . . . . . . 28
3.3.2 Existance of Nearly Optimal Codes . . . . . . . . . . . . 30
3.3.3 Asymptotically Optimal Codes . . . . . . . . . . . . . . . 30
4 Data Representation: Algorithms 33

4.1 The Huffman Algorithm . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.5 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.6 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Markov Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Markov Sources . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 The Entropy of English . . . . . . . . . . . . . . . . . . . 37
4.2.3 Efficient Codes for Markov Sources . . . . . . . . . . . . 37
4.3 Universal Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Universality . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.2 A Simple Universal Code for Binary Sequences . . . . . . 39
4.3.3 Lempel-Ziv Coding . . . . . . . . . . . . . . . . . . . . . 40
5 Data Representation: Rate-Distorsion Theory 43

5.1 Lossy Compression, Quantization and Distorsion . . . . . . . . . 43
5.1.1 Lossless vs Lossy Compression . . . . . . . . . . . . . . 44
5.1.2 The Quantization Problem . . . . . . . . . . . . . . . . . 44
5.2 Scalar Quantization . . . . . . . . . . . . . . . . . . . . . . . . . 45
CONTENTS 5
5.2.1 Lloyd-Max Conditions . . . . . . . . . . . . . . . . . . . 45

5.2.2 Uniform Distribution . . . . . . . . . . . . . . . . . . . . 46
5.2.3 Gaussian Distribution with one bit . . . . . . . . . . . . . 47
5.2.4 General Distributions . . . . . . . . . . . . . . . . . . . . 48
5.3 Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 Vector Quantization is Better than Scalar Quantization . . 48
5.3.2 Paradoxes of High Dimensions . . . . . . . . . . . . . . . 49
5.3.3 Rate Distorsion Function . . . . . . . . . . . . . . . . . . 50
5.4 Rate Distorsion Theorem . . . . . . . . . . . . . . . . . . . . . . 50
5.4.1 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4.2 Efficient Coding Scheme: Random Coding . . . . . . . . 52
5.5 Rate Distorsion for Gaussian Distributions . . . . . . . . . . . . . 53
5.5.1 Gaussian Random Variables . . . . . . . . . . . . . . . . 53
5.5.2 Gaussian Vectors . . . . . . . . . . . . . . . . . . . . . . 55
6 Mutual Information and Communication: discrete channels 57

6.1 Memoryless Channels . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1.2 Information Capacity of a Channel . . . . . . . . . . . . . 58
6.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.1.4 Non-Overlapping Outputs Channels . . . . . . . . . . . . 59
6.1.5 Binary Symmetric Channel . . . . . . . . . . . . . . . . . 59
6.1.6 Typewriter Channel . . . . . . . . . . . . . . . . . . . . . 60
6.1.7 Binary Erasure Channel . . . . . . . . . . . . . . . . . . 60
6.2 Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2.1 Coding Schemes . . . . . . . . . . . . . . . . . . . . . . 61
6.2.2 Example of a Code for the BSC . . . . . . . . . . . . . . 61
6.2.3 Achievable Rates . . . . . . . . . . . . . . . . . . . . . . 63
6.3 Noisy Channel Coding Theorem . . . . . . . . . . . . . . . . . . 63
6.3.1 Capacity Upper Bound . . . . . . . . . . . . . . . . . . . 63
6.3.2 Efficient Coding Scheme: Random Coding . . . . . . . . 64
6.4 Computing the Channel Capacity . . . . . . . . . . . . . . . . . . 66
6.4.1 Capacity of Weakly Symmetric Channels . . . . . . . . . 67
6.4.2 Concavity of Mutual Information . . . . . . . . . . . . . 67
6.4.3 Algorithms for Mutual Information Maximization . . . . . 68
7 Mutual Information and Communication: continuous channels 69

7.1 Information Mesures for Continous Variables . . . . . . . . . . . 69
7.1.1 Differential Entropy . . . . . . . . . . . . . . . . . . . . 69
7.1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.1.3 Joint and Conditional Entropy Mutual Information . . . . 71
6 CONTENTS
7.1.4 Unified Definitions for Information Measures . . . . . . . 71

7.2 Properties of Information Measures for Continous Variables . . . 72
7.2.1 Chain Rule for Differential Entropy . . . . . . . . . . . . 72
7.2.2 Differential Entropy of Affine Transformation . . . . . . . 72
7.3 Differential Entropy of Multivariate Gaussians . . . . . . . . . . . 73
7.3.1 Computing the Differential Entropy . . . . . . . . . . . . 73
7.3.2 The Gaussian Distribution Maximizes Entropy . . . . . . 74
7.4 Capacity of Continuous Channels . . . . . . . . . . . . . . . . . 75
7.5 Gaussian Channels . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.5.1 Gaussian Channel . . . . . . . . . . . . . . . . . . . . . . 75
7.5.2 The AWGN Channel . . . . . . . . . . . . . . . . . . . . 76
7.5.3 Parallel Gaussian Channels . . . . . . . . . . . . . . . . . 77
7.5.4 Vector Gaussian Channels . . . . . . . . . . . . . . . . . 78
8 Portfolio Theory 81
8.1 A Model for Investment . . . . . . . . . . . . . . . . . . . . . . . 81
8.1.1 Asset Prices and Portfolios . . . . . . . . . . . . . . . . . 81
8.1.2 Relative Returns . . . . . . . . . . . . . . . . . . . . . . 82
8.2 Log Optimal Portfolios . . . . . . . . . . . . . . . . . . . . . . . 82
8.2.1 Asymptotic Wealth Distribution . . . . . . . . . . . . . . 82
8.2.2 Growth Rate Maximization . . . . . . . . . . . . . . . . . 83
8.3 Properties of Log Optimal Portfolios . . . . . . . . . . . . . . . . 83
8.3.1 Kuhn Tucker Conditions . . . . . . . . . . . . . . . . . . 84
8.3.2 Asymptotic Optimality . . . . . . . . . . . . . . . . . . . 84
8.4 Investment with Side Information . . . . . . . . . . . . . . . . . 85
8.4.1 Mismatched Portfolios . . . . . . . . . . . . . . . . . . . 86
8.4.2 Exploiting Side Information . . . . . . . . . . . . . . . . 86
9 Information Theory for Machine learning and Statistics 89

9.1 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.1.1 Statistical Inference . . . . . . . . . . . . . . . . . . . . . 89
9.1.2 Examples of Inference Problems . . . . . . . . . . . . . . 89
9.1.3 Empirical Distributions . . . . . . . . . . . . . . . . . . . 90
9.2 The Method Of Types . . . . . . . . . . . . . . . . . . . . . . . . 90
9.2.1 Probability Distribution of a Sample . . . . . . . . . . . . 91
9.2.2 Number of Types . . . . . . . . . . . . . . . . . . . . . . 92
9.2.3 Size of Type Class . . . . . . . . . . . . . . . . . . . . . 92
9.3 Large Deviations and Sanov’s Theorem . . . . . . . . . . . . . . 93
9.3.1 Sanov’s Theorem . . . . . . . . . . . . . . . . . . . . . . 93
9.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 95
CONTENTS 7
10 Mathematical Tools 97
10.1 Jensen Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.2 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 97
8 CONTENTS
Foreword
Those lectures notes pertain to the Information Theory Course given in Centrale-
Supelec. They are based on the book "Cover and Thomas, Elements of Information
Theory", which we highly reccomend to interested students in order to go further
in the study of this topic. Each chapter corresponds to a lecture, apart from the last
chapter which contains mathematical tools used in the proofs.
9
10 CONTENTS
Chapter 1
Information Measures
In this chapter we introduce information measures for discrete random variables,

which form the basis of all information theory: entropy, relative entropy and mutual
information, and prove a few elementary properties.
1.1 Entropy
1.1.1 Definition
Definition 1.1.1. The entropy of X ∈ X a discrete random variable with distribu-
tion pX is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) x∈X
pX (x)
where pX (x) = P(X = x).
Entropy is arguably the most fundamental information measure. The entropy

H(X) is a real number, that only depends on the distribution of X, and is expressed
in bits. If the base 2 logarithm log2 is replaced by the natural logarithm log, then
entropy is expressed in nats, and the two are equivalent up to a multiplicative factor,
1
in the sense that 1 nat is equal to log(2) ≈ 1.44 bits. We shall later see that H(X)
both measures the randomness of X, as well as how much information is contained
in X.
1.1.2 Entropy and Physics

The notion of entropy originates from statistical physics. If random variable X is
the state of a physical system with distribution pX , then H(X) is called the Gibbs
11
12 CHAPTER 1. INFORMATION MEASURES
entropy. One of the fundamental ideas is that the Gibbs entropy of an isolated
physical system is a non-deacreasing function of time, is maximized at equilibrium.
Therefore, the randomness in a isolated system always increases and is maximized
at equilibrium. In fact, one can prove that the Boltzman distribution:
exp(− E(x)
kB T
)
pX (x) = P ′
x′ ∈X exp(− E(x
kB T
)
)
where T is the temperature, E(x) is the energy of state x and kB is the Bolt-
man
P constant, maximizes the Gibbs entropy under an average energy constraint
x∈X pX (x)E(x) = Ē. Hence all systems in equilibrium follow this distribution.
1.1.3 Positivity of Entropy and Maximal Entropy

Property 1. The entropy of X ∈ X a discrete random variable with distribution
X ∼ pX verifies 0 ≤ H(X) ≤ log2 |X | with equality if and only if X is uniform.
Proof: Since 0 ≤ pX (x) ≤ 1:
X 1 X
H(X) = pX (x) log2 ≥ pX (x) log2 1 = 0.
x∈X
pX (x) x∈X
Logarithm is strictly concave, using Jensen’s inequality
X 1 X 1
H(X) = pX (x) log2 < log2 pX (x) = log2 |X |.
x∈X
pX (x) x∈X
p X (x)
with equality if and only if X is uniform.

Entropy is positive and is upper bounded by the logarithm of the size of the
support |X |, with equality if and only if X is uniform. The fact that entropy is
positive makes sense since entropy must measure an amount of information which
must be positive. Furthermore, it makes sense to view entropy as a measure of
randomness, since it is minimized (H(X) = 0) if X is deterministic, and it is
maximized (H(X) = log2 |X |) for the uniform distribution which are respectively
the least and the most random distributions over X .
1.2. JOINT AND CONDITIONAL ENTROPY 13
1.2 Joint and Conditional Entropy

1.2.1 Definition
Definition 1.2.1. The joint entropy of X ∈ X and Y ∈ Y two discrete random
variables with joint distribution (X, Y ) ∼ pX,Y is:

1 X 1
H(X, Y ) = E log2 = pX,Y (x, y) log2
pX,Y (X, Y ) pX,Y (x, y)
(x,y)∈X ×Y
The joint entropy H(X, Y ) is simply the entropy of (X, Y ) seen as a single
random variable. It is important to notice that the joint entropy depends on the
joint distribution, not only on the marginal distributions.
Definition 1.2.2. The conditional entropy of X ∈ X knowing Y ∈ Y two discrete

random variables with joint distribution pX,Y and conditional distribution pX|Y is
1
H(X|Y ) = E log2
pX|Y (X|Y )
1 1
= E log2 − E log2
pX,Y (X, Y ) pY (Y )
= H(X, Y ) − H(Y ).
The conditional entropy H(X|Y ) measures the entropy of X once the value of
Y has been revealed, and has several definitions, which are all equivalent, from the
Bayes rule stating that
pX,Y (x, y) = pX|Y (x|y)pY (y)
In particular, the last relationship
H(X, Y ) = H(X|Y ) + H(Y )
is called a chain rule, and can be interpreted as the fact that the amount of ran-
domness in (X, Y ) equals the amount of randomness in Y plus the amount of
randomness left in X once Y has been revealed.
1.2.2 Properties
Property 2. If X and Y are independent then H(X|Y ) = H(X) and
H(X, Y ) = H(X) + H(Y )

Proof: If X and Y are independent then pX,Y (x, y) = pX (x)pY (y) and replac-
ing in the definition gives the result immediately.
Entropy is additive for independent random variables, which once again is
coherent with its interpretation as a measure of randomness. Indeed, if there
is no relationship between X and Y , the randomness of (X, Y ) is simply the
sum of the randomness in X and Y taken separately. It is also noticed that
entropy is not additive if X and Y are correlated, for instance if X = Y then
H(X, Y ) = H(X) ̸= H(X) + H(Y ), unless both X and Y are deterministic.
Property 3. Conditional entropy is not symmetrical unless H(X) = H(Y ):
H(Y |X) − H(X|Y ) = H(Y ) − H(X)
Conditional entropy is not symmetrical, one notable exception being if X and
Y have the same distribution.
1.3 Relative Entropy

1.3.1 Definition
Definition 1.3.1. Consider p, q two distributions over X discrete and X with
distribution pX = p. The relative entropy between p and q is:
p(X) X p(x)
D(p||q) = E log2 = p(x) log2 .
q(X) x∈X q(x)
if q is absolutely continuous with respect to p and D(p||q) = +∞ otherwise.
Relative entropy is another fundamental information measure, and a notable
difference is that, while entropy measures the randomness of a single distribution,
relative entropy measures the dissimilarity between two distributions p and q. It is
also noted that if q is not absolutely continuous with respect to p, then p(x)
q(x)
= +∞
for some x ∈ X , so that indeed D(p||q) = +∞.
1.3.2 Positivity of Relative Entropy

Property 4. Consider p, q two distributions. Then D(p||q) ≥ 0 with equality if
and only if p = q.
Proof: Since z 7→ − log2 z is strictly convex, from Jensen’s inequality:
q(X) q(X)
D(p||q) = −E log2 ≥ − log2 E
p(X) p(X)
X q(x)
= − log2 p(x) = − log2 1 = 0.
x∈X
p(x)
1.4. MUTUAL INFORMATION 15
Relative entropy (sometimes called Kullback-Leibler divergence) is positive,

which makes sense as it measures dissimilarity. We always have D(p||q) ≥ 0, and
D(p||q) = 0 if p = q and the larger the value of D(p||q), the more dissimilar p is
to q.
1.3.3 Relative Entropy is Not a Distance

Example 1. Consider |X | = 2 and p = ( 12 , 21 ) and q = (a, 1 − a). Then D(p||q) ̸=
D(q||p) if a ̸= 12 . This can be checked by inspection.
It should be noted that relative entropy is not a distance: it is not symmetrical
by the example above, nor does it satisfy the triangle inequality.
1.4 Mutual Information

1.4.1 Definition
Definition 1.4.1. Let (X, Y ) discrete random variables with joint distribution
pX,Y and marginal distributions pX and pY respectively. The mutual information
between X and Y is:
XX pX,Y (x, y)
I(X; Y ) = pX,Y (x, y) log2
x∈X y∈Y
pX (x)pY (y)
XX 1 1
= pX,Y (x, y) log2 − log2
x∈X y∈Y
pX (x) pX|Y (x|y)
= H(X) − H(X|Y )
= H(Y ) − H(Y |X)
= H(X) + H(Y ) − H(X, Y )
= D(pX,Y ||pX pY )
The last measure of information we consider is called the mutual information.

We provide several definitions, which are all equivalent to each other, and this can
be checked by inspection. Mutual information is symmetric by definition. We have
that I(X; Y ) = H(X) − H(X|Y ) therefore, if I(X; Y ) is large, one must have
that H(X) is large, so that the randomness of X is large and that H(X|Y ) is small,
so that the randomness of X knowing Y is small i.e. it is easy to guess X from Y .
We also have that I(X; Y ) measures the dissimilarity between the joint distribution
of (X, Y ), which is pX,Y , and the distribution of (X, Y ) if X and Y were chosen
independendently with the same marginals pX , pY . So mutual information can also
be seen as a measure of dependency between X and Y .
We shall see later that mutual information also quantifies the amount of infor-
mation that can be exchanged between a sender whom selects X and a receiver
whom observes Y .
1.4.2 Positivity of Mutual Information

Property 5. Let X, Y discrete random variables then I(X; Y ) ≥ 0 with equality
if and only if X and Y are independent.
Proof: By definition I(X; Y ) = D(pX,Y ||pX pY ) ≥ 0 since relative entropy is

positive, with equality if and only if pX,Y = pX pY so that X,Y are independent.
Mutual information is positive, since it can be written as a relative entropy.

This has important consequences as we shall see.
1.4.3 Conditionning Reduces Entropy

Property 6. Let X, Y discrete random variables then H(X|Y ) ≤ H(X) with
equality if and only if X,Y are independent and H(X, Y ) ≤ H(X) + H(Y ) with
equality if and only if X,Y are independent.
Proof: We have 0 ≤ I(X; Y ) = H(X) − H(X|Y ) with equality if and only

if X,Y are independent. From the chain rule H(X, Y ) = H(Y |X) + H(X) ≤
H(X) + H(Y ) using the previous result.
From the positivity of mutual information, we deduce two important properties.
The first is that conditioning always reduces entropy, which is intuitive since
revealing the value of Y reduces the randomness in X. Furthermore, we have
already seen that entropy is additive for independent random variables, we now see
that it is in fact sub-additive, so that joint entropy is always smaller than the sum of
entropies.
Chapter 2
Properties of Information Measures
In this chapter we introduce important properties of information measures which

enable to manipulate them efficiently. We also introduce fundamental inequalities
involving information measures such as: the data processing inequality, the log-sum
inequality and Fano’s inequality.
2.1 Chain Rules

In general, a chain rule is simply a formula that allows to compute information
measures by recursion.
2.1.1 Chain Rule for Entropy

Property 7. For any X1 , . . . , Xn discrete we have:
n
X
H(X1 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1
Proof: By definition of conditional entropy:
H(X1 , ..., Xn ) = H(Xn |Xn−1 , ..., X1 ) + H(Xn−1 , ..., X1 )
The result follows by induction over n.

The chain rule for entropy allows to compute the entropy of X1 , . . . , Xn by
successive conditionning, and has the following interpretation: imagine that the
values of X1 , . . . , Xn are presented to us as a time series, one value after the
other, then H(Xi |Xi−1 , . . . , X1 ) is simply the randomness of the current value Xi
knowing the history of the process up to time i − 1 which is Xi−1 , . . . , X1
17
18 CHAPTER 2. PROPERTIES OF INFORMATION MEASURES
2.1.2 Chain Rule for Mutual Information

Property 8. For any X1 , . . . , Xn we have:
n
X
I(X1 , . . . , Xn ; Y ) = I(Xi ; Y |Xi−1 , . . . , X1 )
i=1
Proof Using both the chain rule and the definition of mutual information:
I(X1 , . . . , Xn ; Y ) = H(X1 , . . . , Xn ) − H(X1 , . . . , Xn |Y )

Xn X n
= H(Xi |Xi−1 , . . . , X1 ) − H(Xi |Xi−1 , . . . , X1 , Y )
i=1 i=1
Xn
= I(Xi ; Y |Xi−1 , . . . , X1 ).
i=1
The chain rule for mutual information also has a natural interpretation. Imagine
that a sender selects X1 , . . . , Xn and attempts to communicate with a receiver
whom observes Y . Then the information that can be exchanged I(X1 , . . . , Xn ; Y )
is the sum of I(Xi ; Y |Xi−1 , . . . , X1 ) which can be interpreted as the sender sending
X1 , the receiver retrieving X1 from Y , then sender sending X2 and the receiver
retrieving X2 from both Y and X1 etc. This idea of retrieving X1 , ..., Xn iteratively
is used in many communication systems.
2.1.3 Chain Rule for Relative Entropy

Property 9. Consider X, Y discrete random variables with joint distribution pX,Y
and marginal distributions pX , pY respectively. We have:
D(pX,Y ||qX,Y ) = D(pX ||qX ) + D(pY |X ||qY |X )
Proof Using the Bayes rule:

pX,Y (X, Y )
D(pX,Y ||qX,Y ) = E log2
qX,Y (X, Y )

pY |X (X, Y ) pX (X, Y )
= E log2 + E log2
qY |X (X, Y ) qX (X, Y )
= D(pY |X ||qY |X ) + D(pX ||qX )
proving the result.

The interpretation of this chain rule is similar to that for the entropy.
2.2. LOG SUM INEQUALITY 19
2.2 Log Sum Inequality

In information theory, weighted sums of logarithms are ubiquitous, and the so-
called log-sum inequality is a useful tool in many situations.
2.2.1 Statement
Proposition 2.2.1. For any (ai )i , (bi )i positive
n n Pn
X ai X ai
ai log2 ≥ ( ai ) log2 Pi=1
n
i=1
bi i=1 i=1 bi
ai
with equality iff bi
= c for all i.
Proof Function f (x) = x log2 x is strictly convex as f ′′ (x) = 1

x
> 0.
Using Jensen’s inequality with αi = Pnbi bj :
j=1
n n n n n
X ai X X a
i
X X ai
ai log2 = ( bj ) αi f ≥( bj )f αi
i=1
bi j=1 i=1
bi j=1 i=1
bi
n P n
X ai
=( ai ) log2 Pi=1n .
i=1 i=1 bi
Interestingly, the log-sum inequality implies a variety of other results as we

shall see later.
2.3 Data Processing and Markov Chains

A fundamental idea in information theory, which to a degree justifies the definition
of mutual information in itself, is that data processing, even with unlimited com-
puting power, cannot create information. This is formalized by the data processing
inequality for Markov chains.
2.3.1 Markov Chains

Definition 2.3.1. X → Y → Z is a Markov chain iff X and Z are independent
given Y . Equivalently we have (X, Y, Z) ∼ pX,Y,Z (x, y, z) with
pX,Y,Z (x, y, z) = pX (x)pY |X (y|x)pZ|Y,X (z|y, x) = pX (x)pY |X (y|x)pZ|Y (z|y).

Simply said, a Markov chain X → Y → Z is such that one first draws the
value of X, then once the value of X is known we draw Y accoding to some
distribution that depends solely on X, and finally one draws Z accoding to some
distribution that depends solely on Y . The key idea is that, in order to generate Z,
one can only look at the previously generated value Y , i.e. we generate the process
with a memory of order 1. The simplest, and most often encountered example of
a Markov chain X → Y → Z is any X, Y, Z such that Z = g(Y ) where g is a
known, deterministic function.
2.3.2 Data Processing Inequality

Proposition 2.3.2. If X → Y → Z then I(X; Y ) ≥ I(X; Z).
Proof We have:
I(X; Y, Z) = I(X; Z) + I(X; Y |Z) = I(X; Y ) + I(X; Z|Y )
since I(X; Y |Z) ≥ 0 and I(X; Z|Y ) = 0 we have I(X; Y ) ≥ I(X; Z).
The data processing inequality simply states that mutual information cannot
increase along a Markov chain, i.e. data processing cannot create information
out of nowhere. An interpretation in the context of communication is that if a
sender selects X and a receiver observes Y , and a helper offers to help the receiver
by computing the value of g(Y ), then X 7→ Y 7→ g(Y ) and so I(X; g(Y )) ≤
I(X; Y ). I.e. the helper is in fact never helpful.
2.4 Fano Inequality

We now derive Fano’s inequality, which establishes a fundamental link between
entropy and the probability of error in estimation problems, and is essential in both
statistics and communication.
2.4.1 Estimation Problems

We call estimation problem a situation in which an agent observes a random
variable Y , and attempts to guess another hidden random variable X. The agent
is allowed to construct any estimator X̂, without any limitation on his computing
power. The goal is to minimize the estimation error P(X ̸= X̂).
2.4. FANO INEQUALITY 21
2.4.2 Statement
Proposition 2.4.1. If X → Y → X̂ then:
h2 (P(X ̸= X̂)) + P(X ̸= X̂) log2 |X | ≥ H(X|Y )
1
with h2 (a) = a log a1 + (1 − a) log 1−a the binary entropy.
Proof: Since X → Y → X̂ is a Markov chain
H(X) − H(X|X̂) = I(X; X̂) ≤ I(X; Y ) = H(X) − H(X|Y )
so that
H(X|Y ) ≤ H(X|X̂)
Define E = 1{X̂ ̸= X}, using the chain rule in both directions:
H(E|X̂) + H(X|E, X̂) = H(X, E|X̂) = H(X|X̂) + H(E|X, X̂)
Now H(E|X, X̂) = 0 because E is a deterministic function of X, X̂ which proves:
H(X|X̂) = H(E|X̂) + H(X|E, X̂)
We have
H(X|E, X̂) ≤ P(E = 1) log2 (|X | − 1) + P(E = 0) log2 (1)
because if E = 0 then X = X̂ has 1 possible values and if E = 1 X ̸= X̂ has

|X | − 1 possible values. Finally, since conditioning reduces entropy:
H(E|X̂) ≤ H(E) = h2 (P(E = 1))
which concludes the proof.

Fano’s inequality states that the estimation error P(X ̸= X̂) cannot be arbitrarly
small, unless the conditional entropy of the hidden variable knowing the observation
H(X|Y ) is small too. This a fundamendal limit that is true irrespective of how
much computational power is available to perform the estimation. This intuitive
since H(X|Y ) is the randomness left in X once Y has been seen by the agent.
Fano’s inequality therefore shows that conditional entropy can be used as a measure
of how difficult an estimation problem might be.
2.5 Asymptotic Equipartition and Typicality

2.5.1 AEP
Proposition 2.5.1. Consider X1 , . . . , Xn i.i.d. with common distribution pX . Then
n
1X 1
log2 → H(X) in probability.
n i=1 pX (Xi ) n→∞
Consider (X1 , Y1 ), . . . , (Xn , Yn ) i.i.d. with common joint distribution pX,Y . Then
n
1X 1
log2 → H(X, Y ) in probability.
n i=1 pX,Y (Xi , Yi ) n→∞
and
n
1X 1
log2 → H(X|Y ) in probability.
n i=1 pX|Y (Xi |Yi ) n→∞
and
n
1X pX (Xi )pY (Yi )
log2 → I(X; Y ) in probability.
n i=1 pX,Y (Xi , Yi ) n→∞
Proof: All statements hold true from the weak law of large numbers.
The Asymptotic Equipartition Property (AEP), which in itself is a straightfor-
ward consequence of the law of large numbers, roughly states that for large i.i.d.
samples, the "empirical information measures" behave like the actual information
measures. While this is not very useful in itself, a consequence is that i.i.d. samples
concentrate on what is called "typical sets".
2.5.2 Typicality
Proposition 2.5.2. Consider X1 , . . . , Xn i.i.d. with common distribution pX .
Given ϵ > 0 define the typical set:
n
n 1X 1 o
Anϵ = x ∈X : n n
log2 − H(X) ≤ ϵ .
n i=1 pX (xi )
Then:
(i) |Anϵ | ≤ 2n(H(X)+ϵ) for all n
(ii) |Anϵ | ≥ (1 − ϵ)2n(H(X)−ϵ) for n large enough
(iii) P((X1 , . . . , Xn ) ∈ Anϵ ) ≥ 1 − ϵ for n large enough
2.5. ASYMPTOTIC EQUIPARTITION AND TYPICALITY 23
Proof: By definition, xn = (x1 , ..., xn ) ∈ Anϵ if and only if
2−n(H(X)+ϵ) ≤ pX (x1 )...pX (xn ) ≤ 2−n(H(X)−ϵ)
Computing the probability of the typical set:

X
P((X1 , . . . , Xn ) ∈ Anϵ ) = pX (x1 )...pX (xn )
(x1 ,...,xn )∈An
ϵ
Which we bound as
|Anϵ |2−n(H(X)+ϵ) ≤ P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ |Anϵ |2−n(H(X)−ϵ)
From asymptotic equipartition the typical set is a high probability set, and for n
large enough
1 − ϵ ≤ P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ 1.
The size of the typical set is bounded as
|Anϵ | ≤ 2n(H(X)−ϵ) P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ 2n(H(X)−ϵ)

|Anϵ | ≥ P((X1 , . . . , Xn ) ∈ Anϵ )2n(H(X)+ϵ) ≥ (1 − ϵ)2n(H(X)+ϵ)
This concludes the proof.

In essence, if one draws an i.i.d. sample X1 , ..., Xn , with high probability, it
will fall in the so called "typical set" and this typical set has a size roughly equal
to 2nH(X) . This is also fundamental for data compression: imagine that we would
like to represent X1 , ..., Xn as a sequence of m binary symbols. If we have a small
tolerence for error, then if X1 , ..., Xn is typical we could represent it by its index in
the typical set using m ≈ nH(X) binary symbols, and if X1 , ..., Xn is non-typical
simply ignore it. This gives a new interpretation of entropy as the number of binary
symbols necessary to represent data. We will expand on this in the latter chapters.
2.5.3 Joint Typicality

Proposition 2.5.3. Consider (X n , Y n ) = (Xi , Yi )i=1,...,n i.i.d. with distribution
pX,Y and (X̃ n , Ỹ n ) = (X̃i , Ỹi )i=1,...,n i.i.d. with distribution pX pY .
Given ϵ > 0 define the jointly typical set:
n
n 1X 1
Anϵ n n n n
= (x , y ) ∈ X × Y : log2 − H(X)
n i=1 pX (xi )
n n
1X 1 1X 1 o
+ log2 − H(Y ) + log2 − H(X, Y ) ≤ ϵ .
n i=1 pY (yi ) n i=1 pX,Y (xi , yi )
Then:
(i) |Anϵ | ≤ 2n(H(X,Y )+ϵ) for all n ; (ii) P((X n , Y n ) ∈ Anϵ ) → 1.
n→∞
(iii) (1 − ϵ)2−n(I(X;Y )+ϵ) ≤ P((X̃ n , Ỹ n ) ∈ Anϵ ) ≤ 2−n(I(X;Y )+ϵ) for n large
enough
Proof We have:
n
n
n
n n n n 1X 1 o
Aϵ ⊂ (x , y ) ∈ X × Y : log2 − H(X, Y ) ≤ ϵ .
n i=1 pX,Y (xi , yi )
and we know that this set has size at most 2n(H(X,Y )+ϵ) .
From the law of large numbers:
1X n
1 ϵ
P log2 − H(X) ≥ → 0
n i=1 pX (Xi ) 3 n→∞
1X n
1 ϵ
P log2 − H(Y ) ≥ → 0
n i=1 pY (Yi ) 3 n→∞
1X n
1 ϵ
P log2 − H(X, Y ) ≥ → 0
n i=1 pX,Y (Xi , Yi ) 3 n→∞
Therefore: P((X n , Y n ) ∈ Anϵ ) → 1

n→∞
Since (X̃ n , Ỹ n ) is i.i.d. with distribution pX pY :

n n n
X Y X Y pX (xi )pY (yi ) Y
P((X̃ n , Ỹ n ) ∈ Anϵ ) = pX (xi )pY (yi ) = pX,Y (xi , yi )
pX,Y (xi , yi )
(xn ,y n )∈An
ϵ i=1 (xn ,y n )∈An
ϵ i=1 i=1
If (xn , y n ) ∈ Anϵ :
n
−n(I(X;Y )+ϵ)
Y pX (xi )pY (yi )
2 ≤ ≤ 2−n(I(X;Y )−ϵ)
i=1
pX,Y (xi , yi )
Therefore:
P((X̃ n , Ỹ n ) ∈ Anϵ )
2−n(I(X;Y )+ϵ) ≤ ≤ 2−n(I(X;Y )−ϵ)
P((X n , Y n ) ∈ Anϵ )
and the result is proven as P((X n , Y n ) ∈ Anϵ ) → 1.
n→∞
Joint typicality is similar to typicality, and we will expand on its implications
when considering communication over noisy channels.
Chapter 3
Data Representation: Fundamental

Limits
In this chapter we start our exposition of how to represent data efficiently using
information theoretic tools. We introduce prefix codes and show that the entropy
of the source quantifies the length of the best prefix codes, and how such codes can
be constructed.
3.1 Source Coding

We consider the problem of source coding, in which we would like to represent a
sequence of symbols X1 , .., Xn from some finite set X as a sequence of bits, with
the goal of doing so as efficiently as possible.
3.1.1 Definition
Definition 3.1.1. Consider X ∈ X and D the set of finite strings on {0, 1}. A
source code is a mapping C : X → D.
A source code takes as input a symbol X and maps it into a finite sequence of
bits.
3.1.2 Expected Length

Definition 3.1.2. Let X ∈ X discrete with distribution pX . The expected length of
code C is: X
L(C) = Eℓ(X) = pX (x)ℓ(x).
x∈X
with ℓ(x) the length of codeword C(x).
25
26 CHAPTER 3. DATA REPRESENTATION: FUNDAMENTAL LIMITS
One of the main measures of efficiency of a source code is its expected length,
which is the expected number of bits required to represent a symbol, if this symbol
were drawn according to the source distribution.
3.1.3 Non-Singular Codes

Definition 3.1.3. A code C is non singular if and only if C(x) = C(x′ ) implies
x = x′ for all x, x′ ∈ X . Namely if X can be perfectly retrieved from C(X).
A code is non-singular if the original symbol be retrieved from its associated

codeword, using some sort of decoding procedure which is possible if any only if
there exists no pair of symbols that get assigned the same codeword. Therefore,
non-singular codes perfom lossless compression, which is the focus of this chapter.
There also exist lossy compression techniques, considered in future chapters,
where the amount of information lost (also called "distorsion") is controlled in
some fashion.
3.1.4 Uniquely Decodable Codes

Definition 3.1.4. The extension of a code C is the mapping from finite strings of X
to finite strings of D defined as:
C(x1 . . . xn ) = C(x1 ) . . . C(xn )
The extension of a code is what we obtain when encoding the sequence of

symbols X1 , ..., Xn as the concatenation of the codewords associated to each
symbol C(X1 ), ..., C(Xn ).
Definition 3.1.5. A code C is uniquely decodable if its extension is non-singular.
A critical point is that extension can create ambiguity , even if the code is non-
singular. Indeed, if one only observes the concatenated codewords C(X1 ), ..., C(Xn ),
it might be difficult to know where one codeword ends and where the next one
begins. A simple example would be X = {a, b, c} and a code C(a) = 0, C(b) = 1
and C(c) = 01. We have C(a)C(b) = C(c) so it is impossible to differentiate
between ab and c. This code is non-singular, but not uniquely decodable.
A uniquely decodable code is such that extension does not create ambiguity,
and enables to encode streams of symbols by encoding each symbol separately,
without losing any information.
3.2. PREFIX CODES 27
3.2 Prefix Codes

3.2.1 Definition
Definition 3.2.1. A code C is a prefix code if C(x) is not a prefix of C(x′ ) unless
x = x′ for all (x, x′ ) ∈ X 2 .
An important class of uniquely decodable codes are prefix codes, where no

codeword can be the prefix of another codeword. Those codes are also called self-
puncturing, or instantaneous, because the decoding can be done without looking
ahead in the stream of coded bits.
Definition 3.2.2. Prefix codes are uniquely decodable.
Proof: Consider the following decoding algorithm: let C(X1 ), ..., C(Xn ) be a
sequence of bits u1 ...um and let ℓ the smallest integer such that u1 ...uℓ = C(x) for
some x. Then we must have x = X1 , otherwise C(x) would be the prefix of some
other codeword. This yields X1 and repeat the procedure to obtain X1 , ..., Xn .
It is understood that prefix codes are uniquely decodable, and uniquely decod-
able codes are non-singular, but there exists uniquely decodable codes that are not
prefix codes, and there exists non-singular codes that are not uniquely decodable.
3.2.2 Prefix Codes as Trees

We first introduce a few notions related to binary trees, which are important in
order to understand properties of prefix codes.
Definition 3.2.3. Given a binary tree G = (V, E), we call the "label" of leaf v the
binary sequence encoding the unique path from the root to v, where 0 stands for
"down and left" and 1 for "down and right").
Property 10. Consider a binary tree, then the labels of its leaves form a prefix
code. Conversely, for any prefix code, there exists a binary tree whose leaves label
are the codewords of that code.
Proof: Consider v and v ′ two leaves of G such that the label of v is a prefix of
the label of v ′ , then this means that v ′ is a descendent of v which is not a leaf, a
contradicton. So the leaves labels form a prefix code.
Conversely, consider a prefix code, and the following procedure to build the
associated binary tree. Start with G a complete binary tree. If the code is not empty
then select one of its codewords C(x), find v the node whose label is C(x) and
remove all of the descendents of v from G and remove C(x) from the code. Repeat
the procedure until the code is empty.
Therefore, there is an identity between binary trees and prefix codes: for every
prefix code we can construct a binary tree representation of this code, and every
binary tree represents a prefix code. This is fundamental in order to derive lower
bounds on the code length and design codes which attain these bounds bound.
3.2.3 Kraft Inequality

Proposition 3.2.4. For any prefix code we have:
X
2−ℓ(x) ≤ 1.
x∈X
Also, given any (ℓ(x))x∈X satisfying this inequality one can construct a prefix code
with codeword lengths (ℓ(x))x∈X .
Proof: Let lm = maxx∈X ℓ(x) the largest codeword length. Let Z(x) ⊂
{0, 1}lm set of words that have C(x) as a prefix. Then |Z(x)| = 2lm −ℓ(x) . Further-
more Z(x) ∩ Z(x′ ) = ∅ as C is a prefix code. Summing over x proves the first
result:
X X
2lm = |{0, 1}lm | ≥ | ∪x∈X Z(x)| = |Z(x)| = 2lm −ℓ(x) .
x∈X x∈X
Conversely, assume that if ℓ(x) are sorted in increasing order, and assume that we
are given codeword lengths (ℓ(x))x∈X satisfying the Kraft inequality. Consider
the
P prefix code where C(x) is the ℓ(x) first digits of the binary representation of
−ℓ(i)
i<x 2 . This proves the second result.
Kraft’s inequality is a fundamental limit and states that there is a constraint on
the expected length that must be satisfied by any prefix code.
3.3 Optimal Codes and Entropy

3.3.1 Lower Bound on the Expected Code Length
Proposition 3.3.1. For any prefix code we have:
L(C) ≥ H(X).
with equality if and only if 2−ℓ(x) = pX (x) for all x ∈ X .

Proof: Consider the optimization problem (P1 )
X X
Minimize pX (x)ℓ(x) s.t. 2−ℓ(x) ≤ 1 , ℓ(x) ∈ N , x ∈ X
x∈X x∈X
3.3. OPTIMAL CODES AND ENTROPY 29
Now consider its convex relaxation (P2 ):

X X
Minimize pX (x)ℓ(x) s.t. 2−ℓ(x) ≤ 1 , ℓ(x) ∈ R , x ∈ X
x∈X x∈X
From the Karush-Kuhn-Tucker conditions (recalled in appendix), the solution of

(P2 ) must verify:
pX (x) − λ(log 2)2−ℓ(x) = 0 , x ∈ X
The optimal solution is of the form:
pX (x)
2−ℓ(x) = , x∈X
λ(log 2)
We find the value of λ by saturating the constraint:

X X pX (x) 1
1= 2−ℓ(x) = = .
x∈X x∈X
λ(log 2) λ(log 2)
The optimal solution of (P2 ) is
2−ℓ(x) = pX (x) , x ∈ X
Its value lower bounds that of (P1 ) which concludes the proof:
X X 1
pX (x)ℓ(x) = pX (x) log2 = H(X),
x∈X x∈X
pX (x)
A direct consequence of the Kraft inequality is that the source entropy is a

lower bound on the expected length of any prefix code. Furthermore, in order to
get close to the lower bound, one must make sure 2−ℓ(x) ≈ pX (x). This shows that
efficient codes assign short/long code words to frequent /infrequent symbols, in
order to minimize the expected length.
Now, the lower bound is not always attainable: to attain the bound we require
that for all x ∈ X : ℓ(x) = log2 ( pX1(x) ), where ℓ(x) is an integer. For instance,
if pX = (1/2, 1/4, 1/8, 1/8), then we can select ℓ = (1, 2, 3, 3), but if pX =
(2/3, 1/6, 1/6) this is impossible, as log2 ( pX1(x) ) is not an integer.
Two natural questions arise: how close to the entropy can the best prefix code
perform, and how to derive the best prefix code in a computationally efficient
manner ?
3.3.2 Existance of Nearly Optimal Codes

Proposition 3.3.2. There exists a prefix code with codeword lengths ℓ(x) =
⌈log2 pX1(x) ⌉, such that:
H(X) ≤ L(C) ≤ H(X) + 1

1
Proof: Let ℓ(x) = ⌈log2 pX (x)
⌉ which satisfies the Kraft Inequality:
1 1
−⌈log2 ⌉ − log2
X X X X
2−ℓ(x) = 2 pX (x)
≤ 2 pX (x)
= pX (x) = 1.
x∈X x∈X x∈X x∈X
Recall that whenever ℓ(x), x ∈ X satisfy the Kraft inequality, then there exists a
corresponding prefix code with lenghts ℓ(x), x ∈ X .
The length of this code is:
X X l 1 m
L(C) = pX (x)ℓ(x) = pX (x) log2
x∈X x∈X
pX (x)
X 1
≤ pX (x) log2 +1
x∈X
pX (x)
= H(X) + 1.
which concludes the proof

Therefore, it is always possible to construct a prefix code whose length is within
1 bit of the entropic lower bound. Now, this result is only useful if H(X) is much
greater than 1. The key idea is then to use this scheme to encode not one individual
symbol (with entropy H(X)), but rather blocks of n independent symbols (with
entropy nH(X)) for large n.
3.3.3 Asymptotically Optimal Codes

Proposition 3.3.3. Let (X1 , ..., Xn ) i.i.d. copies of X. For any prefix code C for
(X1 , ..., Xn ):
L(C)
H(X) ≤
n
and there is a prefix code C for (X1 , ..., Xn ) such that:
L(C) 1
≤ H(X) + .
n n
Proof: From independence H(X1 , ..., Xn ) = nH(X), and selecting C as the
optimal prefix code for (X1 , ..., Xn ) gives the result.
3.3. OPTIMAL CODES AND ENTROPY 31
If one encodes blocks of independent symbols with length n we are interested

in the rate L(C)
n
which is the average number of bits per source symbol required to
represent the data. Then the rate of any prefix code must be greater than the entropy
H(X), and for large n there exists a prefix code whose rate is approximately equal
to the entropy (within a factor of 1/n). Therefore this code is asymptotically
optimal, and cannot be improved upon (in terms of rate).
This result also justifies entropy not only as a measure of randomness but also
as a measure of the average description length of a source symbol.
Chapter 4
Data Representation: Algorithms
In this chapter we introduce algorithms in order to perfom lossless compression

under various assumptions and demonstrate their optimality by comparing their
performance to the entropic bound derived in last chapter.
4.1 The Huffman Algorithm

4.1.1 Algorithm
Algorithm 4.1.1 (Huffman Algorithm). Consider a known distribution p(x), x ∈
X . Start with G = (|X |, E, w) a weighted digraph with |X | nodes, no edges
E = ∅, and weights w(x) = p(x). Repeat the following procedure until G is a
tree: find i and j the two nodes with no father and minimal weight, add a new node
k to G with weight w(k) and add edges (k, j) and (k, i) to E.
The Huffman algorithm is a greedy algorithm which takes as an input the
probability of each symbol p(x), x ∈ X , and iteratively constructs a prefix code
with the goal of minimizing the expected code length.
4.1.2 Rationale
The Huffman algorithm is based on the idea that a good prefix code should verify
three properties:
• (i) If p(x) ≥ p(y) then ℓ(y) ≥ ℓ(x)
• (ii) The two longest codewords should have the same lengths
• (iii) The two longest codewords differ by only 1 bit and correspond to the
two least likely symbols
In fact, these facts will serve to show the optimality of the Huffman algorithm.
33
34 CHAPTER 4. DATA REPRESENTATION: ALGORITHMS
4.1.3 Complexity
At each step of the algorithm, one must find the two nodes with the smallest weight.
There are |X | steps and finding the two nodes with smallest weight requires to
sort the list of nodes by weight at each step which requires O(|X | ln |X |). Hence
a naive implementation of the algorithm requires time O(|X |2 ln |X |). A smarter
implementation would be to keep the list of nodes sorted at each step so that finding
the two nodes with smallest weight can be done in time O(1) then insert the new
node in the sorted list using binary search in time O(ln |X |). Hence the Huffman
algorithm can be implemented in time O(|X | ln |X |), almost linear in the number
of symbols.
4.1.4 Limitations
While optimal, for sources with billions of symbols, the Huffman algorithm is
too complex to implement, and there exists other techniques, such as artihmetic
coding (used in JPEG). Also, the Huffman algorithm requires knowing the source
distribution p(x) for x ∈ X at the encoder which is a practical limitation, and
to solve this problem there exists universal codes, which operate without prior
knowledge on p. We will show some simple strategies to design universal codes.
4.1.5 Illustration
1
1
1 2
1 3
10
0
0
1
1
5
0
0
1
A B C D E
1 1 1 1 1
2 5 10 10 10
x A B C D E
1 1 1 1 1
p(x) 2 5 10 10 10
C(x) 0 10 110 1110 1111
ℓ(x) 1 2 3 4 4
Above is the result of the Huffman algorithm applied to a given source. One
can readily verify that the more probable the symbol, the longer the codeword, and
4.1. THE HUFFMAN ALGORITHM 35
that the two least probable symbols D and E have been assigned to the two leaves
with highest depth.
The length of the code is minimal amongst all prefix codes and equals:
1 1 1
×1+ ×2+ × (3 + 4 + 4) = 2
2 5 10
which is only slightly larger than the source entropy:
1 1 1
log2 (2) + log2 (5) + 3 × log2 (10) ≈ 1.96
2 5 10
4.1.6 Optimality
Proposition 4.1.2. The Huffman algorithm outputs a prefix code with minimal
expected length L(C) amongst all prefix codes.
Proof: Assume that the source symbols are sorted p(1) ≤ ... ≤ p(|X |).
Consider a code C with minimal length, and x, y two symbols such that x ≤ y
and ℓ(x) < ℓ(y). Then construct a new code C ′ such that C ′ (x) = C(y), C ′ (y) =
C(x) and C ′ (z) = C(z) for z ̸= x, y. Then clearly L(C ′ ) < L(C) hence C cannot
be optimal, a contradiction. This shows that for any x, y such that x ≤ y we must
have ℓ(x) ≥ ℓ(y). Futhermore, since the two least probable symbols should have
maximal depth we can always assume that they are siblings (otherwise simply
perform an exchange between 2 and the sibling of 1).
Consider C the prefix code with minimal length, and H the prefix code output
by the Huffman algorithm. Further define C ′ and H ′ the codes obtained by consid-
ering C and H and replacing nodes 1 and 2 by their father with weight p(1) + p(2).
Then we have:
L(C ′ ) = L(C) − (p(1) + p(2))
and
L(H ′ ) = L(H) − (p(1) + p(2))
We also realize that H ′ is exactly the output of the Huffman algorithm applied to a
source with |X | − 1 symbols.
We can then prove the result by recursion. Clearly for |X | = 1 symbols the
Huffman algorithm is optimal. Furthermore, if for |X | − 1 symbols the Huffman
algorithm is optimal this implies that L(C ′ ) = L(H ′ ) so that L(C) = L(H) hence
the Huffman algorithm is optimal for |X | symbols.
4.2 Markov Coding

4.2.1 Markov Sources
Definition 4.2.1. A source is a Markov source with stationary distribution π(x)
and transition matrix P (x|x′ ) if:
(i) X1 , ..., Xn all have distribution π(x)
(ii) For any i we have
P(Xi = xi |X1 = x1 , ..., Xi−1 = xi−1 ) = P (xi |xi−1 ).
So far we have mostly considered memoryless sources in which the symbols

produced by the source X1 , ..., Xn are i.i.d. random variables with some fixed
distribution. We now consider the much more general case of Markov sources
where the symbols produced by the source X1 , ..., Xn are correlated. The Marko-
vian assumption roughly means that the distribution of the current symbol Xn only
depends on the value of the previous symbol Xn−1 . One can generate the symbols
sequentially, by first drawing X0 according to π the stationary distribution, and
once Xn−1 is known, one would draw Xn with distribution P (.|Xn−1 ). The matrix
P is called transition matrix, since P (xi |xi−1 ) is the probability of going from xi−1
to xi in one time step. In a sense, a Markov process is a stochastic process with
order one memory. It is also noted that, for π to actually be a stationary distribution,
it has to verify the balance condition π = πP , since if Xn−1 has distribution π,
then Xn has distribution P π and the two must be equal.
The simplest model of a Markov source is called the Gilbert Elliot model,
which has two equiprobable states and a given probability α of going from one to
the other in one step:

1 1
π= ,
2 2
and
1−α α
P =
α 1−α
ON OFF
To generate the Gilbert Elliot model, first draw X0 ∈ {0, 1} uniformly at

random, and the for each n draw Xn = Xn−1 + Un modulo 2, where U1 , ..., Un
is Bernoulli with expectation α. In short flip the value of the process at each step
with probabiliy α.
4.2. MARKOV CODING 37
4.2.2 The Entropy of English

One of the initial motivations for studying Markov sources in the context of
information theory was to model English text. Namely, consider English text as a
sequence of letters Z1 , ..., Zn , k ≥ 0 and Xn = (Z−1 , ..., Zn−k ), the k letters that
precede the n-th letter. Then, when k is large enough, Xn can be considered a
Markov chain, meaning that the distribution of the n-th word solely depends on
the k words that precede it. The transition probabilities encode all of the structure
of the English language: grammar rules, dictionary, frequency of words and so on.
This means that, if we wanted to generate English text automatically, one could
simply gather a very large corpus of text, and estimate the transition probabilities
by figuring out, for any letter x, the probability that x can be the n-th letter of an
English sentence knowing that the k previous letters are Xn−1 , ..., Xn−k . Doing
this for a large enough k will create computer generated sentences which look very
close to English sentences produced by a human.
This also means that one could estimate the entropy of English using the
following experiment imagined by Shannon: one person thinks about some english
sentence, and another person attempts to guess the sentence letter-by-letter without
prior information by asking binary questions e.g "Is the next letter an ’a’" or "Is
the next letter a vowel". Then the ratio between the number of questions and the
number of letters in a phrase is a good estimate of the number of bits per symbol in
English text. The entropy of English estimated by this experiment is usually about
1 bit per letter, much smaller than log2 (26) bits per letter, which is the entropy of
an i.i.d. uniform sequence of letters.
4.2.3 Efficient Codes for Markov Sources

Proposition 4.2.2. Let (X1 , ..., Xn ) a Markov source and define its entropy rate
XX 1
R(π, P ) = π(x)P (y|x) log2 .
x∈X y∈X
P (y|x)
Then for any prefix code C for (X1 , ..., Xn ):
1 H(X1 ) L(C)
(1 − )R(π, P ) + ≤
n n n
and there is a prefix code C for (X1 , ..., Xn ) such that:
L(C) 1 H(X1 ) + 1
≤ (1 − )R(π, P ) + .
n n n
Proof: Using the chain rule and the Markov property:

n
X n
X
H(X1 , ..., Xn ) = H(Xi |Xi−1 , ..., X1 ) = H(Xi |Xi−1 )
i=1 i=1
Furthermore
XX 1
H(Xi |Xi−1 ) = P(Xi−1 = x, Xi = y) log2
x∈X y∈X
P(Xi = y|Xi−1 = x)
XX 1
= π(x)P (y|x) log2 = R(π, P ).
x∈X y∈X
P (y|x)
Therefore:
H(X1 , ..., Xn ) = (n − 1)R(π, P ) + H(X1 ).
The lower bound holds as before, and applying Huffman coding to (X1 , ..., Xn )
yields a code with:
(n − 1)R(π, P ) + H(X1 ) ≤ L(C) ≤ (n − 1)R(π, P ) + H(X1 ) + 1.
We have therefore established that the rate of optimal codes for Markov sources
is exactly R(π, P ) bits per symbol. Furthermore, optimal codes can be found
using the same algorithms as in the memoryless case. One would first determine
the transition probabilities for the Markov source at hand, which would then give
us the probability of any sequence (X1 , ..., Xn ) and finally we may apply the
Huffman algorithm. One can apply this (for instance) in order to encode English
text optimally, since English can be seen as a Markov source.
Now, one caveat of our approach is that we require to know the probability
distribution of any sequence that can be generated by the source. In the case of
memoryless sources this implies to know the distribution of a symbol, and in the
case of Markov sources this implies knowing both the stationary distribution and
the transition probabilities. This can often be a limitation in practice, and to solve
this problem we study the concept of universal codes.
4.3 Universal Coding

4.3.1 Universality
Definition 4.3.1. Consider X1 , ..., Xn i.i.d. copies of X ∈ X with distribution
pX , and a coding scheme C : X n → D that does not depend on pX . This coding
scheme is universal if for all pX :
1
lim Eℓ(C(X1 , ..., Xn )) → H(X)
n→∞ n n→∞
4.3. UNIVERSAL CODING 39
The idea of a universal code is that the code should have no prior knowledge of
the data distribution, and that the code should work well irrespective of the data
distribution. This is important in practical scenarios in which nothing is known
about the data distribution. In fact, when the data distribution is known, we know
that the smallest attainable rate is the entropy H(X), and if a code is universal,
then it attains this rate asymptotically for all distributions.
4.3.2 A Simple Universal Code for Binary Sequences

Algorithm P 4.3.2 (Simple Adaptive Binary Code). Consider x1 , ..., xn ∈ {0, 1}n
and let k = ni=1 xi . Output the codeword C(x1 , ..., xn ) which is the concatenation
of the binary represention of k and the binary represention of the index of x1 , ..., xn
in the set
Xn
n
Ak = {(x1 , ..., xn ) ∈ {0, 1} : xi = k}
i=1
The main idea behind this

Pncode is that the difficulty of encoding a sequence
x1 , ..., xn depends on k = i=1 xi , which is the number of 1’s in the sequence.
Indeed, the
n
number of possible values that x1 , ..., xn can have knowing k is pre-
cisely k . This means that one could first encode the value of k (which requires
at most log2 n bits) and subsequently encode the index of x1 , ..., xn amongst
the
sequences which have the same value of k (which requires at most log2 nk bits).
This coding scheme assigns short codewords to sequences with k ≈ 0 and k ≈ n,
and longer codewords to sequences with k ≈ n/2. The goal of encoding k along
with the sequence is that the decoder will get to know k as well.
Proposition 4.3.3. The simple adaptive binary code is universal.
Proof: For a given value of k, since Ak has nk elements, the length of the

corresponding codeword is

n
ℓ(C(x1 , ..., xn )) = log2 (n) + log2
k
Using Stirling’s approximation log2 n! = n log2 (n/e) + O(log2 n) we have

n
log2 = n log2 (n/e) − k log2 (k/e) − (n − k) log2 ((n − k)/e) + O(log2 n)
k
so that
1
ℓ(C(x1 , ..., xn )) = h2 (k/n) + o(1)
n
Consider X ∼Bernoulli(a):
n
k 1X
= Xi → a almost surely
n n i=1 n→∞
and since n1 ℓ(C(X1 , ..., Xn )) ≤ 1 dominated convergence yields:
1
Eℓ(C(X1 , ..., Xn )) → h2 (a) = H(X)
n n→∞
proving the result.

The fact that this simple code is universal not only shows that such codes do
exist, but also point to a more general idea for constructing such codes: one can
attempt to estimate the value of the underlying distribution, encode that estimate
P Indeed, if X1 , ..., Xn are i.i.d. Bernoulli with parameter a,
along with the message.
then k/n = (1/n) ni=1 Xi is a consistant estimator of a, and the knowledge of k
is equivalent to knowing this estimator. In a certain way, universal codes perform
both encoding and estimation at the same time (although the estimation might be
implicit).
Algorithm 4.3.4 (Simple Adaptive Code). Consider x1 , ..., xn ∈ X and let kx =

1
Pn
i=1 {x i = x}. Output the codeword C(x1 , ..., xn ) which is the concatenation
of the binary represention of kx for all x ∈ X and the binary represention of the
index of x1 , ..., xn in set
n
1{xi = x} = kx for all x ∈ X }
X
n
Ak = {(x1 , ..., xn ) ∈ X :
i=1
The simple code can be extended to non-binary sequences, by encoding the

empirical distribution of the data (also known as the type of the Pnsequence, see
further chapters). It is noted that the empirical distribution kx = i=1 1{xi = x}
can be encoded in at most |X | log2 n bits, since for each x kx ∈ {0, ..., n}.
4.3.3 Lempel-Ziv Coding

Algorithm 4.3.5 (Lempel Ziv Coding). Consider a string x1 , ..., xn ∈ X n and a
window W ≥ 1. Start at position i = 1. Then, until i ≥ n repeat the following:
First find the largest k such that (xj , ..., xj+k ) = (xi , ..., xi+k ) for some j ∈
{i − 1 − W, ..., i − 1}. Second, if such a k exists, encode xi , ..., xi+k as the binary
representation of (1, i − j, k) and skip to position i + k + 1; and if such a k does
not exist, encode xi as (0, xi ) and skip to position i + 1.
4.3. UNIVERSAL CODING 41
The most famous universal codes are called the Lempel-Ziv algorithms, and we
present here the algorithm that uses a sliding window. There exists other versions
such as the one based on trees. The algorithm encodes the sequence by first parsing
it into a set of words, and then to encode each word based on the previous words.
The central idea of this coding scheme is that if a word (x1 , ..., xk ) of size k has a
relatively high probability, then it is likely to appear in a window of size W if W is
large enough. In turn this word can be represented with 1 + log2 W + log2 k bits
instead of k bits. In short, words that are frequent tend to appear repeatedly, and
therefore can be encoded by providing a pointer to one of their past occurences,
which enables to sometimes drastically reduce the number of bits required.
Example 2. Consider the following string of 30 bits
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0
After parsing with a window size of W = 4 we get 8 phrases:

0; 0, 0; 1; 0, 0, 0; 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0; 1; 0, 0, 0; 0, 0
Those phrases will then be represented as:

(0, 0) ; (1, 1, 2); (0, 1) ; (1, 4, 3) ; (1, 1, 17); (0, 1); (1, 4, 3); (1, 1, 2)
The above example illustrates how the algorithm operates on a binary sequence.
The sliding window enables us to encode long sequences of consecutive 0, ..., 0
with relatively few bits. Indeed we manage to encode a sequence of 17 consecuting
0’s by the word (1, 1, 17) which can be represented using rougly 2 + log2 (4) +
log2 (17) ≈ 7 bits: a net gain of 17 − 7 = 10 bits.
Proposition 4.3.6. Lempel-Ziv Coding is universal.
Lempel-Ziv coding has both the advantage of being very easy to implement,
requiring no knowledge about the data distribution, and also to be universal. We
do not present the proof here, due to its complexity.
Chapter 5
Data Representation:
Rate-Distorsion Theory
In this chapter we consider the problem of lossy compression and quantization,

which is a central problem when dealing with signals from the physical world such
as sounds, images and videos and so on. We introduce the notion of distorsion
which measures how much information is lost after encoding, and design optimal
rate-distorsion codes which minimize the rate given a constraint on the distorsion.
5.1 Lossy Compression, Quantization and Distorsion

Most physical systems produce continuous-valued data such as: sound, electro-
magnetic fields, currents. On the other hand, information processing systems work
with finite-valued data. For instance, in order to store images, sounds and movies,
one must somehow represent them as sequences of bits. The transformation from
continuous to discrete data is called quantization, and is fundamental for any
information system handling data from the physical world.
1
Continuous Data
0.8 Quantized Data
0.6
0.4
0.2
Data
−0.2
−0.4
−0.6
−0.8
−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time
43
44 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY
In fact, in some cases, even if data is already discrete, one may want to represent
it using less bits, even at the expense of losing some information. For instance, we
might be interested in reducing the size (in bits) of an image or a sound file as long
as, after the compression, one can reconstruct them and the reconstructed image or
sound looks or sounds similar to a human. This means that most of the information
has been preserved. We call this process lossy compression. Since quantization
and lossy compression can be understood in the same framework, we will use both
terms interchangeably.
5.1.1 Lossless vs Lossy Compression

It is noted that lossy compression is different from lossless compression studied
in the previous chapters, in the sense that lossless compression allows exact re-
construction of the data. This means that for lossy compression we need some
criterion in order to assess how much information is lost in the process, and this
criterion is called a distorsion measure.
5.1.2 The Quantization Problem

We will study the quantization problem in the information theoretic framework,
defined as follows:
• The source generates data X n = (X1 , ..., Xn ) ∈ X n drawn i.i.d. from a

distribution pX
• The encoder encodes the data as fn (X n ) ∈ {1, ..., 2nR } with nR bits.
• The decoder decodes the data by X̂ n = gn (fn (X n ))
The mappings fn and gn define the strategy for encoding the data and decoding the
data, and given a rate R the goal is to select these mappings in order to minimize
the distorsion defined as:
n
1X
D= E(d(Xi , X̂i ))
n i=1
where d is a positive function, e.g. d(x, x′ ) = (x′ − x)2 .

A few remarks are in order. The mapping fn is indeed a quantizer as it maps
a vector of n source symbols (whose values may be continouous or discrete) to a
finite integer between 1 and 2nR , or equivalently to a string of nR bits, so that R
measures the number of bits per source symbols at the quantizer. The mapping gn
is a decoder and attempts to reconstruct the original data. The n source symbols
5.2. SCALAR QUANTIZATION 45
X n are quantized as fn (X n ), and subsequently reconstructed as X̂ n = gn (fn (X n ))

so that one would like X̂ n to be as close as possible to X n and we do so by
minimizing D, which can be seen as a measure of dissimilarity between X n and
X̂ n , or a measure of how much information was lost in the process.
Of course, the choice of d impacts the strategy we should use, and should be
chosen wisely. For instance if one is dealing with images, so that X n is an image
and Xi is its i-th pixel, then D being small should imply that X n and X̂ n look
similar to a human.
5.2 Scalar Quantization

We first study scalar quantization, where n = 1 so that we compress symbols one
at a time, with the goal of minimizing the per-symbol distorsion.
5.2.1 Lloyd-Max Conditions

There exists a general result to find optimal quantization schemes called the Lloyd-
Max conditions, which gives necessary conditions that the optimal quantizer must
verify.
Proposition 5.2.1 (Lloyd-Max). An optimal codebook must satisfy two conditions:

(i) The encoder f should verify for all x ∈ X
f (x) ∈ arg min d(g(i), x)

i∈{1,...,2R }
(ii) The decoder g should verify for all i ∈ {1, ..., 2R }
g(i) ∈ arg min

′
E(d(x′ , X)|f (X) = i)
x ∈X
Proof We have that:

D = E d(g(f (X)), X) ≥ E min d(g(i), X)
i∈{1,...,2R }
and
X2R
D = E d(g(f (X)), X) = E(d(g(i), X)|f (X) = i)P(f (X) = i)
i=1
2R
X
≥ min E(d(x′ , X)|f (X) = i)P(f (X) = i)
x′ ∈X
i=1
Therefore, if (i) or (ii) are not satisfied, we can decrease the value of the distortion
by modifying f or g.
The most important insight gained from Lloyd max are twofold. First, to
design the quantizer, a point should be mapped to the closest reconstruction point.
Second, when designing the decoder, one should select the reconstruction points
to minimize the conditional expected distorsion. In fact this shows that if the
quantizer f is known, then finding g is easy, and vice-versa, and suggests an
iterative algorithm: starting with (f, g) arbitrary and alternatively minimize over f
and g until convergence. This algorithm may not always converge to the optimal
solution and should be seen as a heuristic.
5.2.2 Uniform Distribution

Distribution
1.2 Quantization Points
0.8
p(x)
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Proposition 5.2.2. Consider n = 1, X ∼Uniform([0, 1]) and distorsion function

−2R
d(x, x′ ) = (x − x′ )2 . Then the minimal distorsion is D = 2 12 and the optimal
quantization scheme is uniform quantization:
f (X) ∈ arg min |X − g(i)|

i=1,...,2R
with g(i) = i2−R .

Proof: Let us assume without loss of generality that g(1) < ... < g(2R ). From
Lloyd-Max, the quantization scheme should be such that
f (x) ∈ arg min d(g(i), x) = arg min |g(i) − x|

i∈{1,...,2R } i∈{1,...,2R }
Furthermore, knowing that f (X) = i, X has uniform distribution over interval

g(i) + g(i − 1) g(i) + g(i + 1)
[ , ]
2 2
This implies that
g(i + 1) − g(i − 1)
g(i) ∈ arg min E(d(x′ , X)2 |f (X) = i) =
′
x ∈X 2
5.2. SCALAR QUANTIZATION 47
One can readily check by recursion that this implies g(i) = i/2R for i = 1, ..., R.
1 −2R
The distorsion is hence D = 12 2 , which which concludes the proof.
When data is uniformly distributed over an interval, then the optimal quan-
tization scheme is uniform quantization, which simply partitions the interval in
1 −2R
2R intervals of equal size, and the distorsion is 12 2 , so that when the rate is
increased by 1 bit, the distorsion is divided by 4 (or decreased by 6dB). It is also
noted that uniform quantization is equivalent to rounding the data to the nearest
integer multiple of 2−R so it is very easy to implement.
5.2.3 Gaussian Distribution with one bit

0.4
Distribution
Quantization Points
0.35
0.3
0.25
p(x)
0.2
0.15
0.1
0.05
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x
Proposition 5.2.3. Consider n = 1, R = 1, X ∼ N (0, σ 2 ) and distorsion function

d(x, x′ ) = (x − x′ )2 . Then the minimal distorsion is D = π−2
π
σ 2 and the optimal
quantization scheme is a function of the sign
(
1 if X < 0
f (X) =
2 if X ≥ 0
and the optimal reconstruction is

 q
− 2σ2 if i = 1
g(i) = q π
+ 2σ2 if i = 2
π
Proof: Let us assume without loss of generality that g(1) < g(2). From
Lloyd-Max, the quantization scheme should be such that
f (x) ∈ arg min d(g(i), x) = arg min |g(i) − x|

i∈{1,...,2R } i∈{1,2}
Since X has the same distribution as −X one must have that g(2) = −g(1) hence
f (X) = 1 if X < 0 and f (X) = 2 otherwise. Furthermore
r
2σ 2
g(2) ∈ arg min E(d(x′ , X)2 |f (X) = 2) = E(X|f (X) = 1) = E(X|X ∈ [0, +∞]) =
′
x ∈X π
hence r
2σ 2
g(2) = E(X|X ∈ [0, +∞]) =
π
One may readily check that D = π−2 π
σ 2 which concludes the proof.
If only R = 1 bit per symbol is available, the most efficient quantizer consists
in simply encoding the sign of the data, so that the information ofq the absolute
2σ 2
value is lost. It is also noted that the optimal reconstruction points ± π
are the
expected value of the absolute value of X.
5.2.4 General Distributions

It should be noted that even for Gaussian distributions and R ̸= 1, finding the
optimal quantization scheme is not straightforward. In fact, for many distributions,
the optimal quantization scheme is not known.
5.3 Vector Quantization

We now study vector quantization n > 1, where we attempt to encode several
source symbols at the same time.
5.3.1 Vector Quantization is Better than Scalar Quantization

It would be tempting to think that, if one has a good scalar quantizer, one could
simply apply it to sequences of n independent symbols, and this would result in a
low distorsion. We propose to illustrate this with an example, showing that this
intuition is not only false, but also showing that, in general, randomization can be
a very powerful tool in order to perform vector quantization.
Consider X n = (X1 , ..., Xn ) i.i.d. uniform in [0, 1], so that X n is uniformly
distributed on [0, 1]n and let us apply the optimal scalar quantizer to each of its
entries, namely:
f s (X n ) = (arg min |X1 − i2−R |, ..., arg min |Xn − i2−R |)

i=1,...,2R i=1,...,2R
and
g s (in ) = (i1 2−R , ..., in 2−R )
Then one may readily check that the reconstruction error
g s (f s (X n )) − X n
5.3. VECTOR QUANTIZATION 49
1 R
has i.i.d. uniformly distributed entries with variance 12 2 , and therefore the
1 −R
achieved distorsion is D = 12 2 .
On the other hand, consider another quantization strategy where the quantiza-
tion points g(1), ..., g(nR) are selected uniformly at random in [0, 1]n . One may
readily check that, from independence of g(1), ..., g(nR):
1 1 nR
P( min d(X, g(i)) ≥ 2R ) = P(d(X, g(1)) ≥ rn2 )2
n i=1,...,2 nR 12
1
with rn2 = 12
n2R . Furthermore
(πrn2 )n/2
P(d(X, g(1)) ≤ rn ) ≈
Γ(n/2 + 1)
since the probability of d(X, g(1)) ≤ rn2 can be approximated by the Lebesgue
measure a ball of radius rn centerered at X.
We may then use Stirling’s approximation to show that
nR
P(d(X, g(1)) ≥ rn )2 → 0
n→∞
1 R
Therefore, this quantization strategy has distorsion lower than 12
2 with high
probabilty, and is superior to scalar quantization.
5.3.2 Paradoxes of High Dimensions

This means that this vector quantization is provably better than scalar quantization,
and this also shows that, in some cases, drawing the representation points of the
quantizer accoding to some distribution, rather than according to some deterministic
rule, can perform better. This is due to the counterintuitive fact that rectangular
grids do not fill out space very well in high dimensions, while i.i.d sequences to to
fill out the space much better (this is in fact the basis of Monte-Carlo methods).
1 1
Quantization Points Quantization Points
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
x2
0.5 0.5
x
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x1 x1
Another related counterintuitive fact is that even if two random variables X, Y

have no relationship with each other, quantizing them together is always better
than quantizing them separately.
5.3.3 Rate Distorsion Function

Definition 5.3.1. A rate distortion pair (R, D) is achievable if and only if there
exists a sequence of (2nR , n) distortion codes (fn , gn ) with
n
1X
lim sup E d(X̂in , Xi ) ≤ D.
n→∞ n
i=1
where X̂ n = gn (fn (X n )).
Definition 5.3.2. The rate distortion function R(D) for a given D is the infimum
over R such that (R, D) is achievable.
Given a rate R and a distorsion D, we say that (R, D) is achievable if, asymp-
totically when n is large, there exists a sequence of quantizers whose distorsion is
at most D. We insist on the fact that for each value of n, an appropriate quantizer
must be found and what matters is the limit behaviour of this sequence. This means
that the notion of achievability is asymptotic, and there may not exist quantizers
with rate R and distorsion D for small values of n. In a sense achievability quanti-
fies the smallest distorsion for n = +∞. Clearly the larger the allowed distorsion,
the smaller the rate can be with an efficient quantizer, and a natural question is:
what is the optimal trade-off between distorsion and rate ? The answer to this
question is called the rate distorsion function. Now computing this function may
be difficuly in general, and we will show how this may be done by maximizing the
mutual information.
5.4 Rate Distorsion Theorem

Definition 5.4.1. Define the information rate function
RI (D) = min I(X; X̂)

E(d(X,X̂))≤D
minimizing over all possible conditional distributions pX̂|X that satisfy the con-
straint E(d(X, X̂)) ≤ D.
Theorem 5.4.2. The information rate function equals the rate distorsion function.
We will prove the rate-distorsion theorem, by showing that the information

rate function is an information theoretic limit of the problem, and then construct
efficient rate-distorsion codes which reach this limit.
5.4. RATE DISTORSION THEOREM 51
5.4.1 Lower Bound

Proposition 5.4.3. Consider a memoryless source. Then any rate R < RI (D) is
not achievable at distorsion D.
Proof: Let us consider a (2nR , n) distortion code (fn , gn ). Since fn ∈ {1, ..., 2nR }:
H(fn (X n )) ≤ nR
Using the fact that conditional entropy is positive
I(X n ; fn (X n )) ≤ H(fn (X n ))
Furthermore, from the data processing inequality
I(X n ; X̂ n ) = I(X n ; gn (fn (X n ))) ≤ I(X n ; fn (X n ))
Since the source is memoryless
I(X n ; X̂ n ) = H(X n ) − H(X n |X̂ n )
with n
X
H(X n ) = H(Xi )
i=1
and, using the chain rule and the fact that conditoning reduces entropy
n
X n
X
H(Xn |X̂ n ) = H(Xi |Xi−1 , ..., X1 , X̂ n ) ≤ H(Xi |X̂i )
i=1 i=1
Putting things together

n
X
I(Xi ; X̂i ) ≤ I(X n ; X̂ n )
i=1
By definition of the information rate function

n
X n
X
R(Di ) ≤ I(Xi ; X̂i )
i=1 i=1
PnDi = E(d(Xi , X̂i )) the distorsion for the i-th symbol. We have D =
with
1
n i=1 Di and since the mutual information is convex, so is the rate distorsion
function, which in turn implies:
n
X
nR(D) ≤ R(Di )
i=1
We have proven that R(D) ≤ R, so that R(D) is indeed a lower bound on the rate
that can be achieved at distorsion level D.
5.4.2 Efficient Coding Scheme: Random Coding

We now propose a scheme known as random coding, so that any rate distorsion pair
(R, D) which verifies the previous lower bound can be achieved with this scheme.
In this sense, random coding is optimal. We do not provide a complete proof for
the optimality of random coding in this context. We will go into further details in
the next chapter on channel coding.
Algorithm 5.4.4 (Random Coding for Rate Distorsion). Consider the following
randomized scheme to construct a rate-distorsion codebook:
• (Codebook generation) Let pX̂|X a distribution such that R(C) = I(X; X̂)
and E(d(X, X̂)) ≤ D. Draw C = {X̂ n (i), i = {1, ..., 2nR }} where X̂ n (i) is
an i.i.d. sample of size n from pX̂
• (Encoding) Encode X n by W ∈ {1, ..., 2nR } with W the smallest W such

that (X n , X̂ n (W )) is distortion typical. If such a W does not exist let
W = 1.
• (Decoding) Output the representation point X̂ n (W )
It is noted that this is a randomized strategy, so that both the encoder fn and
the decoder gn are in fact random. While it may seem counter-intuitive to select a
random codebook, this in fact eases the analysis very much, because it allows us to
average over the codebook itself. Furthermore, when performing this averaging, as
long as we are able to prove that the codebook has good performance in expectation,
it automatically implies that there exists a codebook with good performance. This
strategy is common in information theory as well as other fields (for instance
random graphs), and is known as the "probabilistic method". The disadvantage of
random coding with respect to, for instance, Huffman coding, is that it is much
more complex to implement.
Proposition 5.4.5. There exists a sequence of codebooks achieving any rate distor-
sion pair (R, D) with R(D) > D.
The main idea centers around the idea of typicality, in that case rate-distorsion
typicality.
Proposition 5.4.6. Consider (X n , X̂ n ) = (Xi , X̂i )i=1,...,n i.i.d. with distribution

5.5. RATE DISTORSION FOR GAUSSIAN DISTRIBUTIONS 53
pX,X̂ . Given ϵ > 0 define the distortion typical set:

n
n 1X 1
Anϵ n n n n
= (x , x̂ ) ∈ X × X : log2 − H(X)
n i=1 pX (xi )
n n
1X 1 1X 1
+ log2 − H(X̂) + log2 − H(X, X̂)
n i=1 pX̂ (x̂i ) n i=1 pX,X̂ (xi , x̂i )
n
1X o
+ d(xi , x̂i ) − E(d(X, X̂)) ≤ ϵ .
n i=1
Then: P((X n , X̂ n ) ∈ Anϵ ) → 1.

n→∞
The point of random coding is that the codewords are drawn in an i.i.d. then
the pairs (X n , X̂ n ) will be distorsion typical so that d(X n , X̂ n ) will be arbitrairly
close to D with high probability.
5.5 Rate Distorsion for Gaussian Distributions

Computing the rate-distorsion function is usually difficult, as it is the solution to
a maximization problem, and does not admit a closed-form expression for many
distributions. For Gaussian variables and vectors however, the solution can be
computed in closed form, and gives several interesting insights into quantization
in general. It is noted that, in order to deal with continuous variables, we need to
introduce the notion of differential entropy (see next chapter) and this section can
be skipped in a first reading.
5.5.1 Gaussian Random Variables

3.5
2.5
Rate Distorsion Function
1.5
0.5
0
0 0.5 1 1.5
Distorsion
′ ′ 2
Proposition 5.5.1. Consider X ∼ N (0, σ 2 ) with
d(x,2x )= (x − x ) . The rate
distortion function is given by: R(D) = max 12 log2 σD , 0
Proof: We must minimize I(X; X̂) where X ∼ N (0, σ 2 ) and (X, X̂) verifies
E((X − X̂)2 ) ≤ D. By definition of the mutual information:
I(X; X̂) = h(X) − h(X|X̂).
Since X ∼ N (0, σ 2 ) we have
1
log2 2πeσ 2

h(X) =
2
Furthermore, since conditioning reduces entropy:
h(X|X̂) = h(X − X̂|X̂) ≤ h(X − X̂)
Now, since the Gaussian distribution maximizes entropy knowing the variance:
1
h(X − X̂) ≤ log2 2πevar(X − X̂)
2
Since var(X − X̂) ≤ D, replacing we have proven that
1 σ2
I(X; X̂) ≥ log2
2 D
Now consider the following joint distribution X = X̂ + Z where X̂ and Z are

independent and gaussian with respective variances σ 2 − D and D. Then one may
readily check that E((X − X̂)2 ) ≤ D and that
1 σ2
I(X; X̂) = log2
2 D
which proves the result.
The rate-distorsion function for gaussian variables is indeed convex and de-
creasing, and in particular this function is 0 for any D > σ 2 , due to the fact that,
even with no information, one can achieve a distorsion of σ 2 , by representing X
by a fixed value equal to E(X). Furthermore, for D < σ 2 , when R is increased by
1, D is divided by 4 so each added bit of quantization decreases the quantization
error by 6dB. Finally, as predicted previously, vector quantization is better than
scalar quantization. For instance, consider R = 1, using vector quantization on
2
(X1 , ..., Xn ) with a rate of R = 1 yields a distorsion of D = σ4 , while using scalar
quantization on each entry of (X1 , ..., Xn ) with a rate of R = 1 yields a distrorsion
of D = π−2 π
σ 2 . Hence in that example vector quantization is 45% more efficient
than scalar quantization.
5.5. RATE DISTORSION FOR GAUSSIAN DISTRIBUTIONS 55
5.5.2 Gaussian Vectors

Proposition 5.5.2. Consider X1 , ..., Xk independent with Xj ∼ N (0, σj2 ) and
distortion function d(x, x′ ) = kj=1 (xj − x′j )2 . The rate distortion function is
P
given by:
k
X 1 σj2
R(D) = log2
j=1
2 min(λ⋆ , σj2 )
where λ⋆ is chosen such that kj=1 min(λ⋆ , σj2 ) = D.

P
Proof: We must minimize I(X k ; X̂ k ) where X1 , ..., Xk are independent with

Xj ∼ N (0, σj2 ) and (X k , X̂ k ) verifies E((X k − X̂ k )2 ) ≤ D.
By definition of the mutual information:
I(X k ; X̂ k ) = h(X k ) − h(X k |X̂ k ).
Since X1 , ..., Xk are independent:
k
X
h(X) = h(Xi )
i=1
and since conditioning reduces entropy:

k
X k
X
h(X k |X̂ k ) = h(Xi |Xi−1 , ..., X1 , X̂ k ) ≤ h(Xi |X̂i )
i=1 i=1
Therefore
k
X
k k
I(X ; X̂ ) ≥ I(Xi ; X̂i )
i=1
Define Di = E((X̂i − Xi )2 ) the distorsion attributed to component i. From the

scalar case studied in the previous case:
1 σ 2 +
I(Xi ; X̂i ) ≥ log2 i
2 Di
Hence
k
X 1 σ 2 +
I(X k ; X̂ k ) ≥ log2 i
i=1
2 Di
Furthermore, one can achieve equality by choosing (Xi , X̂i ) with the same distri-
bution as in the scalar case. Hence the rate distorsion function is the solution to the
optimization problem
k k
X 1 σ 2 + X
Minimize log2 s.t. Di = D.
i=1
2 D i i=1
From Lagrangian relaxation the solution of this optimiation problem must be such
⋆
that there exists λP > 0 such that either Di = σi2 or otherwise Di = λ⋆ . Selecting
λ to ensure that ki=1 Di = D yields the result.
⋆
For Gaussian vectors with independent entries, the rate distorsion function can
be computed as well, and the solution is given by an allocation called "reverse
water filling" which attempts to equalize the distortion for each component. Bits
are allocated mostly to components with high variance, and components with low
variance are simply ignored. This makes sense since, for an equal amount of bits,
the larger the variance, the larger the distorsion. This can be generalized to gaussian
vectors with non-diagonal covariance matrices by performing reverse waterfilling
on the eigenvectors/eigenvalues of the covariance matrix.
Chapter 6
Mutual Information and

Communication: discrete channels
We now move away from data representation, and focus on communication over
noisy channels. For this problem, we are concerned with the maximal rate at which
information can be reliably sent over the channel, in the sense that the receiver
should be able to retrieve the sent information with high probability. As we shall
see, information theoretic tools provide a complete characterization of the problem
in terms of achievable rates as well as coding strategies.
6.1 Memoryless Channels

6.1.1 Definition
We will consider the case where a sender selects n inputs X n = (X1 , ..., Xn ) from a
finite alphabet X , and a receiver observes corresponding outputs Y n = (Y1 , ..., Yn )
from another finite alphabet Y. The relationship between X n and Y n is called
a channel. The main problem we aim to solve is how much information can be
reliably exchanged between the sender and the receiver as a function of n. The
ratio between the amount of information exchanged and n is called the rate (in bits
per channel use).
Definition 6.1.1. A channel with input X n = (X1 , ..., Xn ) and output Y n =
(Y1 , ..., Yn ) is memoryless with transition matrix pY |X (y|x) = P(Y = y|X = x) if
n
Y
n n
pY n |X n (y |x ) = pY |X (yi |xi )
i=1
Of course, a channel can model almost any point-to-point communication sce-

nario regardless of the medium: wireless communication, optical communication,
57
58 CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS
and so on. We will focus mostly on memoryless channels, which already constitude
a rather rich model. Of course, there exists more general models, such as Marko-
vian channels and the most general model of ergodic channels. It is noted that, if a
channel is memoryless, and X n = (X1 , ..., Xn ) is i.i.d., then Y n = (Y1 , ..., Yn ) is
also i.i.d.
6.1.2 Information Capacity of a Channel

Definition 6.1.2. The information channel capacity of a memoryless channel is
defined as:
C = max I(X; Y )
pX
where the maximum is taken over all possible input distributions.
The information channel capacity is simply the largest amount of mutual

information that can be achieved by selecting the input distribution appropriately.
It turns out that this number also represents the amount of bits per channel use that
can be reliably exchanged between the sender and the receiver, as we shall later
see.
6.1.3 Examples
We now propose to compute the information channel capacity for a few simple
channel models.
Noiseless Binary Channel

1 b 0
0 b
1 b b 1
1
Since X can be retrieved perfectly from X we have H(X|Y ) = 0 and the

mutual information is
I(X; Y ) = H(X) − H(X|Y ) = H(X)

6.1. MEMORYLESS CHANNELS 59
To maximize I(X; Y ) one must maximize H(X), so the input distribution maxi-
mizing X should be unifom in {0, 1} and the capacity is
C = log2 2 = 1
6.1.4 Non-Overlapping Outputs Channels

p0
0 b b 0
1 − p0
b 1
b 2
p1
1 b b 3
1 − p1
Once again X can be retrieved perfectly from X we have H(X|Y ) = 0 and

the mutual information is
I(X; Y ) = H(X) − H(X|Y ) = H(X)
mizing X should be unifom in X and the capacity is
C = log2 |X |
which generalizes the previous case.
6.1.5 Binary Symmetric Channel

1−p
0 b b 0
1 b b 1
1−p
Knowing Y , input X has a Bernoulli(a) distribution, so the mutual information

is
I(X; Y ) = H(X) − H(X|Y ) = H(X) − h2 (a)
mizing X should be unifom in X and the capacity is
C = log2 2 − h2 (a) = 1 − h2 (a)
6.1.6 Typewriter Channel

0 b b 0
1 b b 1
2 b b 2
3 b b 3
Knowing X, output Y has two equiprobable values, therefore

I(X; Y ) = H(Y ) − H(Y |X) = H(Y ) − 1
One would like to maximize H(Y ) by selecting the distribution of X appropriately.
If we select X uniformly distributed, then Y is also uniformly distributed so the
input distribution maximizing X should be unifom in X and and the capacity is
C = log2 X − 1
6.1.7 Binary Erasure Channel

1−α b 0
0 b
b ×
1 b b 1
1−α
Knowing X, Y has values X or × with probabilities α and 1 − α so the

conditional entropy is
H(Y |X) = h2 (α)
On the other hand, Y has 3 possible values (0, ×, 1) with probabilities (1 − α(1 −
p), α, p(1 − α)) where p = P(X = 0) so the entropy is:
1 1 1
H(Y ) = (1 − α)(1 − p) log2 + α log2 + (1 − α)p log2
(1 − α)(1 − p) α (1 − α)p
= h2 (α) + (1 − α)H(X)
6.2. CHANNEL CODING 61
Therefore the mutual information is
I(X; Y ) = (1 − α)H(X)
mizing X should be unifom in {0, 1} and the capacity is
C = (1 − α) log2 2 = (1 − α)
6.2 Channel Coding

6.2.1 Coding Schemes
We consider coding over blocks of n channel uses.
Definition 6.2.1. Consider the following procedure:
• The transmitter chooses a message W ∈ {1, ..., M }
• She transmits a codeword X n (W ) = (X1 (W ), ..., Xn (W )) ∈ X n
• The receiver sees Y n distributed as pY n |X n
• She decodes the message using some decoding rule Ŵ = g(Y n )
Any such procedure is called an (M, n) channel code with rate
1
R= log2 M
n
and error probability:
Pen = P(Ŵ ̸= W ).
6.2.2 Example of a Code for the BSC

1−p
0 b b 0
1 b b 1
1−p
For the binary symmetric channel, a code is given by a subset C of {0, 1}n of
size 2nR , along with a decoding rule. The distribution of the channel output y n
conditional to transmitting some some codeword xn is given by
n a d(xn ,yn )
1 {xi =yi } 1{xi ̸=yi }
Y
n n n
pY n |X n (y |x ) = (1 − a) a = (1 − a) .
i=1
1−a
where d(xn , y n ) is the so-called Hamming distance between xn and y n , i.e. it is

simply the number of entries of xn that are different from that of y n .
One can prove that the optimal decoding rule is maximum likelihood decoding
(in the sense that this rule minimizes the error probability) which consists in
selecting the codeword xn ∈ C that is the most likely to have been transmitted:
x̂n = arg max

n
pX n |Y n (xn |y n ) = arg min
n
d(xn , y n )
x ∈C x ∈C
We notice that this is equivalent to minimizing the Hamming distance between the
output and the codeword d(xn , y n ). Also note that, if C is very large, this might be
very hard to do computationally.
x3
x4 = x1 ⊕ x3 x5 = x2 ⊕ x3
x1 x6 x2
= x1 ⊕ x2
A well known code for the BSC is the so called Hamming code.
C = {xn ∈ {0, 1}6 : (x4 , x5 , x6 ) = (x1 ⊕ x3 , x2 ⊕ x3 , x1 ⊕ x2 )}
It is a code with M = 23 codewords, block size n = 6 so its rate is R = 12 . This

code illustrates the idea of appending parity check bits (x4 , x5 , x6 ) at the end of the
message (x1 , x2 , x3 ), which adds redundancy in order to allow for error correction.
In fact, one can prove that, this code can correct exactly one error and its error
probability is given by:
n
1{Xi ̸= Yi } ≥ 2) = 1 − np(1 − p)n−1 − (1 − p)n .
X
Pen = P(
i=1
6.3. NOISY CHANNEL CODING THEOREM 63
6.2.3 Achievable Rates
Given any code (M, n) we define the condtionnal error probability:
λni = P(g(Y n ) ̸= i|X n = X n (i)),
and the maximal error probability:
λn = max λi .
i=1,...,M
Definition 6.2.2. A rate R is achievable if there exists a sequence of (2nR , n) codes

with vanishing maximal error probability λn → 0.
n→∞
Definition 6.2.3. The capacity of a channel is the supremum of all achievable

rates.
6.3 Noisy Channel Coding Theorem
6.3.1 Capacity Upper Bound
We now show that any rate above the information capacity is not achievable. The
main idea is to apply Fano’s inequality to show that if there are too many codewords,
then the transmitted codeword cannot be estimated with arbitrary high accuracy.
Proposition 6.3.1. Consider a memoryless channel. Then any rate R > C is not
achievable.
Proof: We recall that X, Y we have that H(X|Y ) ≤ H(X) so that using the
chain rule, for any X1 , ..., Xn :
n
X n
X
H(X1 , ..., Xn ) = H(Xi |Xi−1 , ..., X1 ) ≤ H(Xi ).
i=1 i=1
We now upper bound the maximal mutual information with n channel uses. By
definition of the capacity:
I(X n ; Y n ) = H(Y n ) − H(Y n |X n )

Xn
n
= H(Y ) − H(Yi |Yi−1 , ..., X n )
i=1
Xn
= H(Y n ) − H(Yi |Xi )
i=1
n
X
≤ H(Yi ) − H(Yi |Xi )
i=1
n
X
= I(Xi ; Yi ).
i=1
Therefore I(X n ; Y n ) ≤ nC.

For any channel code
W → X n (W ) → Y n → Ŵ
forms a Markov chain and the data processing inequality yields:
I(W ; Ŵ ) ≤ I(X n ; Y n ) ≤ nC.
Since message W ∈ {1, ..., 2nR } is chosen uniformly at random we have H(W ) =
nR and:
H(W |Ŵ ) = H(W ) − I(W ; Ŵ ) ≥ n(R − C).
We may now apply Fano’s inequality:
h2 (P(W ̸= Ŵ )) + P(W ̸= Ŵ ) log2 2nR ≥ H(W |Ŵ ) ≥ n(R − C)
Since h2 ≤ 1 we have proven that:

n(R − C) − 1 C
P(W ̸= Ŵ ) ≥ → 1− .
nR n→∞ R
Therefore the probability of error of any family of (n, 2nR ) channel codes does not
vanish when R > C.
6.3.2 Efficient Coding Scheme: Random Coding

We now show that any rate below the information capacity is achievable, so that,
in essence information capacity qunatifies how much information can be reliably
6.3. NOISY CHANNEL CODING THEOREM 65
transmitted over a channel. This is a very strong result because it applies to

any communication system. Capacity is the fundamental limit that no scheme can
overcome, and reaching this limit can be hard in practice: no good (low complexity)
codes were known for 30 years for the BSC and well known examples of such
good codes which reach this limit are Turbo Codes and LDPC codes.
Proposition 6.3.2. Consider a discrete memoryless channel. Then any rate R < C
is achievable.
Proof In order to prove the result, we construct a coding scheme called random
coding.
Random Coding
Algorithm 6.3.3 (Random Channel Coding). Consider the following randomized
algorithm in order to generate a codebook and transmit data.
• (Codebook generation) Let pX a distribution such that C = I(X; Y ). Draw

C = {X n (i), i = 1, ..., 2nR } where X n (i) is an i.i.d. sample of size n from
pX . Reveal C to both the transmitter and receiver.
• (Data Transmission) To transmit data, choose W ∈ {1, ..., 2nR } uniformly
distributed, and transmit X n (W )
• (Decoding) Observe Y n . If there exists a unique Ŵ such that (X n (Ŵ ), Y n )
are jointly typical, then output Ŵ . Otherwise output an error.
Intuition Behind Random Coding

Interestingly, the fact that we use a random code ensemble, perhaps counter-
intuitively, eases analysis. While this analysis is not trivial, the main intuitive
idea behind random channel coding is that, if X n (W ) is transmitted and Y n is
received, then (X n (W ), Y n ) will be jointly typical with high probability, and
for any W ′ ̸= W (X n (W ′ ), Y n ) will not be jointly typical, because X n (W ′ ) is
independent from X n (W ) from the random code construction.
Error Probability
We compute the error probability averaged over C. Define E the event that decoding
fails and average over C:
2 nR
X 1 XX
P(E) = P(C = c)Pen (c) = P(C = c)λi (c).
c
2nR i=1 c
P
By symmetry c P(C = c)λi (c) does not depend on i, so:
2 nR
1 XX
P(E) = P(C = c)λ1 (c) = P(E|W = 1).
2nR c i=1
Define the event that a particular couple is typical:
Ei = {(X n (i), Y n ) ∈ Anϵ }.
If W = 1, decoding fails if either (X n (1), Y n ) is not typical, or there exists i ̸= 1

such that (X n (i), Y n ) is typical, hence:
nR
2
X
P(E|W = 1) ≤ P(E1c |W = 1) + P(Ei |W = 1)
i=2
From joint typicality, for n large
P(E1c |W = 1) ≤ ϵ and P(Ei |W = 1) ≤ 2−n(C−ϵ) , i ≥ 2
We conclude that, for n large and R < C − ϵ:
P(E|W = 1) ≤ ϵ + 2−n(C−R−ϵ) ≤ 2ϵ.
As P(E) ≤ 2ϵ, there exists c⋆ with P(E|C = c⋆ ) ≤ 2ϵ.

Since
2nR
⋆ 1 X
P(E|C = c ) = nR λi (c⋆ ) ≤ 2ϵ
2 i=1
there are 2nR − 1 indices i such that λi (c⋆ ) ≤ 4ϵ by considering the best half. So
we have proven that there exists a sequence of (n, 2nR ) codes with vanishing error
probability which concludes the proof.
6.4 Computing the Channel Capacity

In general, how does one compute channel capacity ? The problem is usually
difficult and for many channels, the computation of their capacity is an open
problem. We highlight a two simple strategies here.
6.4. COMPUTING THE CHANNEL CAPACITY 67
6.4.1 Capacity of Weakly Symmetric Channels

If the channel has some symmetry features, one can use this to compute the
capacity.
Definition 6.4.1. A channel is weakly symmetric if (i) for any x,x′ , vectors
′
pP are equal up to a permutation and (ii) for any y,y ′ we have
Y |X (.|x), pY |X (.|x )P
′
x∈X pY |X (y|x) = x∈X pY |X (y |x).
If the channel is weakly symmetric, the optimal input distribution is uniform,

and the capacity is simply the logarithm of the number of outputs, minus the
entropy of a column of the transition matrix. Interestingly, this result generalizes
our previous computations.
Proposition 6.4.2. Assume that (i) for any x,x′ , vectors pYP ′
|X (.|x), pY |X (.|x )
′
are
P equal up to a permutation and (ii) for any y,y we have x∈X pY |X (y|x) =
′
x∈X pY |X (y |x). Then
X 1
C = log |Y| − pY |X (y|x) log2 ,
y∈Y
pY |X (y|x)
for any x and the optimal input is uniform.
Proof The distribution of Y knowing X = x does not depend on x (up to a

permutation), so:
X 1
I(X; Y ) = H(Y ) − pY |X (y|x) log2
y∈Y
pY |X (y|x)
Once again, by symmetry, as X uniform implies Y uniform, this maximizes H(Y )

and I(X; Y ).
6.4.2 Concavity of Mutual Information

Proposition 6.4.3. For any channel we have:
(i) 0 ≤ C ≤ log2 (min(|X |, |Y|)
(ii) pX 7→ I(X; Y ) is a concave function.
Proof The distribution of Y is:

X X
pY (y) = pX,Y (x, y) = pX (x)pY |X (y|x).
x∈X x∈X
1
Define f (x) = x log2 x
and
X 1 X X
H(Y ) = pY (y) log2 = f pX (x)pY |X (y|x)
y∈Y
pY (y) y∈Y x∈X
X 1 X X 1
H(Y |X) = pX,Y (x, y) log2 = pX (x) pY |X (y|x) log2 .
pY |X (y|x) x∈X y∈Y
pY |X (y|x)
(x,y)∈X ×Y
We have that f is concave so pX 7→ H(Y ) is concave as well. Furthermore,

pX 7→ H(Y |X) is linear. Therefore
pX 7→ I(X; Y ) = H(Y ) − H(Y |X)
is concave.
6.4.3 Algorithms for Mutual Information Maximization

In general capacity is not known in closed form can its computation can be a
hard problem. One can maximize I(X; Y ) numerically using convex optimization
techniques such as gradient ascent, which are valid for maximizing any concave
function. Specific algorithms taking advantage of particular properties of mutual
information also exist, such as the algorithm of Arimoto and Blahut.
Chapter 7
Mutual Information and

Communication: continuous
channels
In this chapter, we turn our attention to continuous channels, where both the input
and the output are real valued. Such channels are ubiquitous in the physical world,
due to its continuous nature. To solve this problem we need to generalize the
notions of entropy, relative entropy and mutual information to continuous random
variables. We compute the capacity and the optimal input distribution gaussian
channels, which are found in many applications such as wireless communication.
7.1 Information Mesures for Continous Variables

7.1.1 Differential Entropy
Definition 7.1.1. Consider X a continuous random variable with p.d.f. pX .
Its differential entropy is given by
Z
1 1
h(X) = E log2 = pX (x) log2 dx
pX (X) X pX (x)
if the integral exists.
Just like entropy, differential entropy is expressed in bits, and is a natural

natural extension of the discrete case. It is noted that the integral might not exist.
One of the most notable differences with entropy is that differential entropy can be
negative.
69
70 CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS
7.1.2 Examples
Uniform Distribution
If X ∼ Uniform(X ):
h(X) = E(log2 |X |) = log2 |X |
It is noted that, if |X | ≤ 1, then h(X) ≤ 0 so that differential entropy can be

negative. Also, if X is deterministic, X is a point and h(X) = −∞, which differs
from the discrete case where deterministic variables have an entropy of 0.
Exponential Distribution
If X ∼ Exponential(λ):
eXλ λE(X) 1 e
h(X) = E log2 = + log2 = log2
λ log 2 λ λ
The fact that differential entropy decreases with λ is intuitive, since the smaller λ,
the less X is concentrated around 0.
Gaussian Distribution
If X ∼ N (µ, σ 2 ):
√ (X−µ)2

2
h(X) = E log2 ( 2πσ e 2σ 2
)
1 E(X − µ)2 1
= log2 (2πσ 2 ) + 2
= log2 (2πeσ 2 )
2 2 log(2)σ 2
This expression will occcur in various places, in particular when computing the
capacity of Gaussian channels. Two remarks can be made: first the differential
entropy does not depend on µ which illustrates the fact that differential entropy is
invariant by translation, second it is increasing in σ 2 , which is intuitive since the
larger σ 2 , the less X will be concentrated around its mean µ.
7.1. INFORMATION MESURES FOR CONTINOUS VARIABLES 71
7.1.3 Joint and Conditional Entropy Mutual Information

Joint and Conditional Differential Entropy
Definition 7.1.2. Let X, Y with joint p.d.f. pX,Y . The joint differential entropy,
conditional differential entropy and mutual information are:
Z
1
h(X, Y ) = pX,Y (x, y) log2 dxdy
X ×Y pX,Y (x, y)
Z
pY (y)
h(X|Y ) = pX,Y (x, y) log2 dxdy
X ×Y pX,Y (x, y)
Z
pY (y)pX (x)
I(X; Y ) = pX,Y (x, y) log2 dxdy.
X ×Y pX,Y (x, y)
As in the discrete case , one can readily check that
h(X|Y ) = h(X, Y ) − h(Y )
I(X; Y ) = h(Y ) − h(Y |X) = h(X) − h(X|Y ) = h(X) + h(Y ) − h(X, Y ).
Relative Entropy
Definition 7.1.3. Consider two p.d.f.s p(x) and q(x). The relative entropy is
Z
p(x)
D(p||q) = p(x) log2 dx
X q(x)
Proposition 7.1.4. We have D(p||q) ≥ 0 for any p, q.
Proof: Jensen’s inequality.
7.1.4 Unified Definitions for Information Measures

In the above presentation, we have presented two distinct set of definitions for
information measures for continuous and discrete variables. A natural question
is whether or not one can define information measures in such a way that the
same definition is applicable to both discrete and continuous variables. The key
is to, perhaps counterintuitively, start by defining the relative entropy using the
Radon-Nikodym derivative.
Definition 7.1.5. Consider P ,Q two distributions over a measurable space X , and
assume that P is absolutely continous with respect to Q, then the relative entropy
can be defined as:
Z
P (dx)
D(P ||Q) = log2 P (dx)
X Q(dx)
P (dx)
where Q(dx)
is the Radon-Nikodym derivative of P with respect to Q.
The Radon-Nikodym derivative is well-defined from absolute continuity, and

in turn this allows to define mutual information in terms of relative entropies.
Definition 7.1.6. Consider (X, Y ) random variables with joint distribution P(X,Y ) ,
then the mutual information between X and Y is
I(X, Y ) = D(P(X,Y ) ||PX PY )
As a byproduct, we obtain a very instructive interpretation of mutual informa-
tion I(X; Y ) as the dissimilarity between the joint distribution of the vector (X, Y )
and another vector with independent entries and the same marginals. Also, one can
readily check that the above definitions generalize both the discrete and continuous
case.
7.2 Properties of Information Measures for Conti-

nous Variables
7.2.1 Chain Rule for Differential Entropy
Differential entropies obey a chain rule just like entropies, and the proof follows
from the same arguments.
Proposition 7.2.1. For any X1 , ..., Xn we have:
n
X
h(X1 , ..., Xn ) = h(Xi |Xi−1 , ..., X1 )
i=1
Proof: By definition of conditional entropy:

h(X1 , ..., Xn ) = h(Xn |X1 , ..., Xn−1 ) + h(X1 , ..., Xn−1 )
The result follows by induction over n.
7.2.2 Differential Entropy of Affine Transformation

For continous variables over Rn , we will often be interested in how their differen-
tial entropy is affected by simple transformations such as translations and linear
transformations.
Proposition 7.2.2. For any random variable X ∈ R, fixed vector a ∈ R and
invertible matrix A ∈ Rd×d we have
h(a + AX) = h(X) + log2 det A.
If A is not invertible we have h(a + AX) = −∞.
7.3. DIFFERENTIAL ENTROPY OF MULTIVARIATE GAUSSIANS 73
−1
Proof: If A is invertible and X ∼ p(x)dx then a + AX ∼ p(A det (x−a))dx
A
, so:
Z
det A dx
h(a + AX) = p(A−1 (x − a) log2 −1
d p(A (x − a)) det A
ZR
det A dx
= p(A−1 (x − a) log2
d p(A−1 (x − a)) det A
ZR
det A
= p(y) log2 dy
Rd p(y)
= h(X) + log2 det A.
which proves the first result. If A is not invertible then the support of the distribution
of a + AX has Lebesgue measure 0 so that h(a + AX) = −∞.
Therefore, an affine transformation incurs an additive change to the entropy, and
this change is the logarithm of the determinant of A. If A = I or more generally if
A is a rotation then log2 det A = 1 so that differential entropy is invariant by both
translation and rotation.
7.3 Differential Entropy of Multivariate Gaussians

7.3.1 Computing the Differential Entropy
Our previous result allows to derive the differential entropy of multivariate gaus-
sians vectors without any computation, indeed, any gaussian vector can be ex-
pressed as an affine transformation of an i.i.d. vector with centered standard
Gaussians.
Proposition 7.3.1. If X ∼ N (µ, Σ) then

1
h(X) = log2 ((2πe)n det(Σ))
2
Proof: If X ∼ N (0, I) then X has i.i.d. N (0, 1) entries so that
n
X n
h(X) = h(Xi ) = log2 (2πe).
i=1
2
1
Now consider Y = µ + Σ 2 X, then Y ∼ N (µ, Σ) and
1 1
h(Y ) = h(X) + log2 det Σ 2 = log2 ((2πe)n det(Σ))
2
proving the result.
7.3.2 The Gaussian Distribution Maximizes Entropy

One of the reasons why the multivariate Gaussian distribution is ubiquitous in
information theory is the fact that, it is an entropy maximizer. Namely, if one
knows the mean and covarance of X, then its differential entropy is always upper
bounded the differential entropy of a Gaussian vector with the same mean and
covariance matrix. This result has interesting applications in statistical modelling:
if one must model some incertain parameter by a distribution, and the only infor-
mation available are its first and second moments, then considering the Gaussian
distribution is natural, as it follows the so-called maximum entropy principle for
modelling. Another important application of this result is the computation of the
capacity of Gaussian channels, and more generally deriving capacity bounds for
various types of channels.
Proposition 7.3.2. Consider X ∈ Rn with covariance matrix Σ then
1
h(X) ≤ log2 ((2πe)n det(Σ))
2
with equality if and only if X has a Gaussian distribution.
Proof Denote by p(x) the density of X and µ its mean.
Define Y ∼ N (µ, Σ) with density q(x), so that
p(X)
0 ≤ D(p||q) = E log2
q(X)
and 1 1
h(X) = E log2 ≤ E log2
p(X) q(X)
Since
1 1 ⊤ −1
q(x) = p e− 2 (x−µ) Σ (x−µ)
(2π)n det(Σ)
:
1 1 1
E log2 = log2 ((2π)n det(Σ)) + E((X − µ)⊤ Σ−1 (X − µ))
q(X) 2 2(log 2)
1 1
= log2 ((2π)n det(Σ)) + E((Y − µ)⊤ Σ−1 (Y − µ))
2 2(log 2)
1
= E log2
q(Y )
= h(Y ).
since the r.h.s only depends on the covariance matrix of X which proves the
result.
7.4. CAPACITY OF CONTINUOUS CHANNELS 75
7.4 Capacity of Continuous Channels

Consider a continuous, memoryless channel. As in the discrete case, communi-
cating over a continuous channel follows the same paradigm. One may define
codebooks, error probabilities and achievable rates. Furthermore, the noisy channel
coding theorem still applies: the information capacity of the channel is also the
supremum of all achievable rates, and any achievable rate can be attained using the
random coding strategy, coupled with typicality decoding.
7.5 Gaussian Channels

7.5.1 Gaussian Channel
Definition 7.5.1. The Gaussian channel with power P is given by:
Y =X +Z
where Z ∼ N (0, N ), and the input must satisfy E(X 2 ) ≤ P .

The Gaussian channel is the simplest model for communication between a
transmitter and a receiver when the only perturbation is additive noise. Gaussian
noise is often a good model whenever the perturbation is the result of many small,
independent sources of perturbation, from the central limit theorem. We compute
the capacity of this channel by maximizing mutual information, and the power
constraint E(X 2 ) ≤ P is necessary, otherwise, the capacity of the channel is simply
infinite.
Proposition 7.5.2. The information capacity of the Gaussian channel with power
P is:
1 P
C = max I(X; Y ) = log 2 1 +
X:E(X 2 )≤P 2 N
and the optimal input is X ∼ N (0, P ).
Proof: When X is fixed, Y ∼ N (X, N ) so that
1
h(Y |X) = log2 (2πeN ).
2
On the other hand from independence:
E(Y 2 ) = E((X + Z)2 ) = E(X 2 ) + 2E(XZ) + E(Z 2 ) = P + N.
Therefore
1
h(Y ) ≤ log2 (2πe(N + P )),
2
with equality iff Y is Gaussian. Finally

1 1
I(X; Y ) = h(Y ) − h(Y |X) ≤ log2 (2πe(N + P )) − log2 (2πeN )
2 2
1 P
= log2 1 + .
2 N
with equality if and only if Y is Gaussian which concludes the proof.
The capacity of this channel is an increasing function of the signal-to-noise
P
ratio (SNR) N . When the SNR is small the capacity is roughly linear in the SNR
but when the SNR is large, the capacity is logarithmic. This shows that, when
communicating over a Gaussian channel, increasing the power leads to better
performance, but one quiclky runs into diminishing returns.
7.5.2 The AWGN Channel

A variant of the Gaussian channel is the Additive White Gaussian (AWGN) Noise
channel, where the input is a continuous time signal, and this input is perturbed by
a continous time process called white noise. For instance, in almost all wireless
communication systems, communication is impaired by Johnson-Nyquist noise,
which is unwanted noise generated by the thermal agitation of electrons, and
Johnson-Nyquist noise usually can be modelled by white noise.
Definition 7.5.3. The AWGN (Additive White Gaussian Noise) channel is given by:
Y (t) = X(t) + Z(t)
where x(t) is bandlimited in [−W, W ] with total power P and Z(t) is white
Gaussian noise with spectral power density N0 .
Proposition 7.5.4. The capacity of the AWGN channel is given by:

P
C = W log2 1 +
W N0
Proof From the Nyquist sampling theorem, the AWGN channel is equivalent
to 2W parallel, identical Gaussian channels, hence the result.
We notice that for infinite bandwidth W → ∞ (low SNR):
P
C= log2 e
N0
and that, just like the previous case, there exists a power-bandwidth tradeoff:
C is linear in W but logarithmic in WPN0 . This explains why, in most wireless
7.5. GAUSSIAN CHANNELS 77
communication systems, increasing the bandwidth yields much more gains that
increasing the power, especially if the SNR of the typical user is already high.
Also, the formula of the capacity of the AWGN channel allows to predict the
performance of many practical communication systems past and present, and while
the capacity is an upper bound of the best performance that can be achieved in ideal
conditions (infinite processing power for coding and decoding for instance) the
formula allows to roughly predict the typical performance, providing one knows
the typical SNR as well as the bandwidth. Here are three illustrative examples: for
telephone lines: W = 3.3 kHz, WPN0 = 33 dB, C = 36 Kbits/s. Wifi: W = 40
MHz, WPN0 = 30 dB, C = 400 Mbits/s and for 4G Networks W = 20 MHz,
P
W N0
= 20 dB, C = 133 Mbits/s.
7.5.3 Parallel Gaussian Channels

In many communication systems, one can in fact use multiple channels all at once
in order to communicate. If those channels are Gaussian, and they are independent
from each other, the model is called parallel Gaussian channels.
Definition 7.5.5. A set of parallel Gaussian channels with total power P is:
Yj = Xj + Zj , j = 1, . . . , k
where Zj ∼ N (0, Nj ), j = 1, . . . , k are independent, and the input must satisfy

Pk 2
j=1 E(Xj ) ≤ P .
In the context of communication, in particular wireless communication, this

model covers communication over parallel links, over distinct frequency bands
and distinct antennas, all of which are important components of modern wireless
systems. The main question is how one should allocate the available power to the
various channels. Certainly, if the noise variance is the same across all channels,
the problem is trivial and one can simply allocate power uniformly, but in general
if some channels are much better than other the problem is non-trivial.
We now compute the capacity of parallel Gaussian channels by computing the
optimal power allocation across channels.
Proposition 7.5.6. The capacity of parallel Gaussian channels is given by:

k
1X (λ⋆ − Nj )+
C= log2 1 +
2 j=1 Nj
Pk
with λ⋆ unique solution to j=1 (λ − Nj )+ = P
Proof: We need to solve the optimization problem:

k k
X Pj X
maximizeP1 ,...,Pk ≥0 log2 1+ subject to Pj ≤ P.
j=1
Nj j=1
From Lagrangian duality, this can be done by solving:

k k
X Pj X
maximizeP1 ,...,Pk ≥0 log2 1+ +µ Pj
j=1
Nj j=1
Setting the gradient to 0 above yields:

1
Nj
Pj
+µ=0
1+ Nj
Therefore either Pj = 0 or Pj + Nj = − µ1 ≡ λ and summing over j to get

Pk
j=1 Pj = P yields the correct value of λ. This concludes the proof.
The optimal power allocation is called the ”water filling” solution, due to the
fact that the power allocated to channnel i is either 0, or it should be equal to
Ni + λ⋆ , where λ⋆ is selected to make sure that the total power allocated equals
P . This implies that very noisy channels are ignored. Having parallel channels
enable a multiplexing gain: indeed capacity is linear in the number of channels.
Finally, one may show that the result also applies to to time varying channels, and
bandlimited channels.
7.5.4 Vector Gaussian Channels

A generalization of parallel Gaussian channels is called vector Gaussian channels,
where the correlation matrix of the noise vector can be arbitrary.
Definition 7.5.7. A vector Gaussian channel with total power P is:
Y k = Xk + Zk
where Z ∼ N (0, ΣZ ) and the input satisfies E((X k )⊤ X k ) ≤ P .
This model for instance allows to describe wireless communication systems
with multiple-input multiple-output (MIMO), where both the receiver and the
transmitter can use several antennas to communicate. This model can also be used
to descrive non-memoryless Gaussian channels, where the entries of X k and Y k
would describe the successive values of the input and output across time.
As in the case of parallel Gaussian channels, we now derive the optimal input
and transmission strategy.
7.5. GAUSSIAN CHANNELS 79
Proposition 7.5.8. The capacity of Vector Gaussian Channels is given by:

k
1X (ν − λj )+
C= log2 1 +
2 j=1 λj
Pk
with (λ1 , ..., λk ) = eig(KZ ), ν unique solution to j=1 (ν − λj )+ = P
Proof: Since ΣZ is real and symmetric, there exists U : U ⊤ U = I and
ΣZ = U ⊤ diag(λ1 , ..., λk )U
Multplying by U :
U Y k = U X k + U Zk,
This defines a new channel:
Ȳ k = X̄ k + Z̄ k ,
We have
(X̄ k )⊤ X̄ = (X k )⊤ U ⊤ U X k = (X k )⊤ X k
ΣZ̄ = U ⊤ ΣZ U = diag(λ1 , ..., λk ),

This is the same as k parallel Gaussian channels with noises λ1 , ..., λk variances,
which concludes the proof.
It turns out that the optimal power allocation is to perform waterfilling on the
eigenvectors of the noise correlation matrix. The main idea behind this is that one
can always reduce a vector Gaussian channel to k parallel channels by rotation, so
that after the rotation, the noise correlation matrix becomes diagonal.
Chapter 8
Portfolio Theory
In this chapter, we illustrate how, perhaps surprisingly, information theoretic

techniques can be used in order to design investment strategies in financial markets,
in the context of the so-called portfolio theory.
8.1 A Model for Investment
8.1.1 Asset Prices and Portfolios

We consider the following model for investment in a financial market. At the start
of the process, the investor has a starting wealth S0 . The process is sequential, and
at the start of day n ∈ N the investor has wealth Sn , observes the stock prices at
Pmopening denoted by (Pn,1 , ..., Pn,m ), chooses a portfolio (bn,1 , ...,bi bSn,m
the ) with
i=1 bn,i = 1. He invests bi Sn amount of wealth in asset i by buying Pn,i units of
n
asset i.
′ ′
At the end of day n, he observes the closing prices (Pn,1 , ..., Pn,m ) realizes his
profits and losses so that the amount of wealth available at the start of day n + 1
equals:
m ′
Sn+1 X Pn,j
= bn,j
Sn j=1
Pn,j
By recursion, the wealth at any given time can be written as
n−1 m
!
′
Sn Y X Pi,j
= bn,j
S0 i=1 j=1
Pi,j
81
82 CHAPTER 8. PORTFOLIO THEORY
8.1.2 Relative Returns

The model can be written in a simpler form by defining (Xn,1 , ..., Xn,m ) with
′
Pn,j
Xn,i =
Pn,j
the relative return of asset j at time n so that the wealth evolution is
n−1 m
Sn X X
log2 = log2 ( bi,j Xi,j )
S0 i=1 j=1
Indeed, the relative returns of each asset are sufficient in order to predict the
evolution of the wealth. Throughout the chapter we will assume that the vectors of
relative returns Xn = (Xn,1 , ..., Xn,m ) are i.i.d. with some fixed distribution F .
8.2 Log Optimal Portfolios

8.2.1 Asymptotic Wealth Distribution
The investor wishes to design optimal portfolio strategies that maximizes the
distribution of its wealth log Sn in some sense. Since the investor monitors the
market on a daily basis, he may choose an investment strategy that depends on the
previous returns Xn−1 , ..., X1 as well as his previous decisions. Of course, since
the wealth is a random variable, there are several acceptable criteria to maximize,
and we will propose one such criterion.
Proposition 8.2.1. Consider a constant investment strategy where bn = (bn,1 , ..., bn,m )
does not depend on n. Then
1 Sn
log2 → W (b, F ) almost surely
n S0 n→∞
where W (b, F ) is called the growth rate of portfolio b:
m
!!
X
W (b, F ) = EX∼F log2 bi X i
i=1
1
Proof If the investment strategy is constant, then n
log SSn0 is an empirical
average of n i.i.d. random variables:
n−1 m
!
Sn 1X X
log2 = log2 bj Xi,j
S0 n i=1 j=1
8.3. PROPERTIES OF LOG OPTIMAL PORTFOLIOS 83
with expectation W (b, F ) so the strong law of large numbers yields the result.
The above proposition shows that, if the investor chooses a fixed investment
strategy across time, then with high probability, wealth will grow exponentially as
a function of time:
Sn ≈ S0 2nW (b,F )
and the exponent equals the growth rate of the portfolio W (b, F ) ≥ 0. Perhaps
surprisingly, if the growth rate is strictly positive, then with high probability, the
wealth asymptotically grows to infinity.
8.2.2 Growth Rate Maximization

Definition 8.2.2. The optimal growth rate W ⋆ (b, F ) is the value of
m
X
maximize W (b, F ) subject to bi ≤ 1 and b ≥ 0
i=1
and an optimal portfolio b⋆ is an optimal solution to this problem.
The previous results suggests that, if the investor knows the distribution of the
returns F , than he should select the portfolio maximizing the growth rate, to ensure
that its wealth grows as rapidly as possible. While this is not the only possible
objective function in porfolio theory, it comes with strong guarantees providing that
returns are indeed i.i.d. Other possible objective functions in portfolio theory are
for instance linear combinations of the mean and variance of the returns, as there
exists a trade-off between high-risk/high-return and low-risk/low-return portfolios.
Another interesting observation is that maximizing Pthe growth rate is usu-
ally different from maximizing the expected returns E( m i=1 bi Xi ), which can be
achieved by selecting bi⋆ = 1{i = i } where i = arg maxi E(Xi ), i.e. the investor
⋆ ⋆
places all of his wealth on the stock with highest average return, a risky strategy
indeed. Usually, maximizing the growth rate is a much moreP conservative, due to
the logarithm which places a heavy penalty on the wealth m i=1 bi Xi becoming
very close to 0 . In other words, maximizing the growth rate discourages porfolios
that can bankrupt the investor in a day.
8.3 Properties of Log Optimal Portfolios

We now show how to compute the optimal portfolio maiximizing the growth rate.
8.3.1 Kuhn Tucker Conditions

Proposition 8.3.1. The optimal portfolio b⋆ is the only porfolio such that for all j:
(
Xj = 1 if bj > 0
E Pm
i=1 bi Xi < 1 if bj = 0
Proof: It is noted that b 7→ W (b, F ) is a concave function, by concavity of the

logarithm. From the KKT conditions, there exists λ > 0 and µ ≥ 0 such that:
∇W (b⋆ , F ) + λ1 + µ = 0
Since µ ≥ 0 we have for all i:

d
W (b⋆ , F ) + λ ≤ 0
dbi
and furthermore, if b⋆i ̸= 0
d
W (b⋆ , F ) + λ = 0
dbi
By definition of W :

d 1 X
W (b⋆ , F ) = E Pm j ⋆
dbi log(2) i=1 bi Xi
Multiplying the above by b⋆i , replacing and summing shows that:

Pm ⋆ m
1 i=1 bi Xi
X
E P m ⋆
+λ b⋆i = 0
log(2) i=1 bi Xi i=1
1
Therefore, λ = log(2) , and replacing yields the result.
The KKT conditions are necessary and sufficent conditions for the optimality
of the portfolio, and if F is known, one can search for the optimal using an iterative
scheme such as gradient descent.
8.3.2 Asymptotic Optimality

So far, we have only considered constant stratgies where the investor uses the same
porfolio at all times, and for such strategies the best achievable wealth is given
by the growth rate. One can then whether if it is possible to do better by using
history dependent strategies where the investors decision at time n depends on the
observed returns up to time n − 1.
8.4. INVESTMENT WITH SIDE INFORMATION 85
Definition 8.3.2. A portfolio strategy is said to be causal if for all n, bn,1 , ..., bn,m
is solely a function of (Xn′ ,1 , ..., Xn′ ,m ) for n′ < n.
Proposition 8.3.3. For any causal portfolio strategy

1 Sn
E log2 ≤ W (b⋆ , F )
n S0
With equality if one selects b⋆ the maximizer of W (b, F ) at all times i.e. constant
strategies are optimal.
Proof: The expected log wealth is given by

n−1 m
!!
1 Sn 1X X
E log2 = E log2 bi,j Xi,j
n S0 n i=1 j=1
For any i, when (bi,1 , ..., bi,m ) is an arbitrary function of (Xi′ ,1 , ..., Xi′ ,m ) for i′ < i,
the optimal choice is to select the maximizer of:
m
! ! m
!!
X X
E log2 bi,j Xn,j |(Xi′ ,1 , ..., Xi′ ,m ), i′ < i = E log2 bi,j Xi,j
j=1 j=1
since (Xi,1 , ..., Xi,m ) is independent of (Xi′ ,1 , ..., Xi′ ,m ) , i′ < i. Therefore, for
each i, (bi,1 , ..., bi,m ) can be chosen as the maximizer of W (b, F ), and constant
strategies are optimal.
Interestingly, in our setting causal strategies yield no gains with respect to con-
stant strategies. Therefore, the best achievable performance with causal strategies is
still given by the growth rate. Of course, this only true if F is known to the investor,
and the returns are i.i.d. If F were unknown, then the investor should change his
decisions as more and more returns are observed. Similarly if the returns have a
significant correlation in time, then the investment strategy should be time varying,
as the returns observed up to time n − 1 can be used to predict the returns at time
n and choose a portfolio intelligently.
8.4 Investment with Side Information

Finally, we investigate how much side information available to the investor may
increase his performance, and how much having imperfect knowledge about the
market can decrease his performance.
8.4.1 Mismatched Portfolios

So far, we have assumed that the investor knows the distribution of the relative
returns F , and in that case the optimal choice is to select a portfolio maximizing the
growth rate W (b, F ). However, in practice a full knowledge of F is not available,
and F must be somehow estimated, for instance using historical data. Consider
the case where the investor knows G, an estimate of F , and selects the optimal
porfolio if G were equal to the unknown F . A natural question is how to assess
how much wealth is loss due to the imperfect knowlege of F .
Proposition 8.4.1. Consider two distributions F and G, and the corresonding log
optimal portfolios b⋆F and b⋆G , which maximize W (b, F ) and W (b, G) respecively.
Then we have that
W (b⋆F , F ) − W (b⋆G , F ) ≤ D(F ||G)
In other words, the amount of growth rate lost by the invstor due to his imperfect
knowlege is upper bounded by the relative entropy between the true distribution
F and his estimate G. So the wealth of an investor with perfect knowledge will
⋆
be approximatly 2nW (bF ,F ) , the wealth of an investor with imperfect knowledge
⋆
will be approximatly (at least) 2n[W (bF ,F )−D(F ||G)] . It should also be noted that this
bound is tight for some distributions of X. This is indeed a surprising link between
portfolio theory and information theory.
8.4.2 Exploiting Side Information

Now consider the scenario where the investor may use side information before
selecting a porfolio. The goal is still to maximize the growth rate, which depends
on the distibution of the returns X, however the investor has access to another
random variable Y , which is hopefully useful in order to predict X. Two example
of scenarios include: financial advice where Y is the prediction of some expert
(or experts) that the investor may choose to consult before making a decidion, or
correlated returns in which the returns are not i.i.d. anymore so that Xn can be
predicted as a function of Y = (Xn′ )n′ <n .
Definition 8.4.2. The growth rate of portfolio b with side information Y is:
m
! !
X
W (b, F |Y ) = E log bj Xj |Y
j=1
If the investor has access to side information Y , then he should select the port-
folio maximizing W (b, F |Y ), and while this certainly yields a better performance
8.4. INVESTMENT WITH SIDE INFORMATION 87
compared to the case with no side information, one can wonder how much growth
rate is gained with side information (for instance in the case where the investor
must pay some premium in order to access the side information). Intuitively, this
should depend on how much X and Y are correlated.
Proposition 8.4.3. Consider b⋆ the log optimal portfolio maximizing W (b, F ) and
b⋆|Y the log optimal portfolio with side information maximizing W (b, F |Y ). Then
we have
0 ≤ W (b⋆|Y , F |Y ) − W (b⋆ , F ) ≤ I(X; Y )
Proof: If Y = y, from our previous result, the loss of growth rate between an
investor whom assumes that X has distribution G = pX and an investor whom
knows the actual distribution F = pX|Y =y is at most

X pX|Y (x, y)
D(pX|Y =y |pX ) = pX|Y (x, y) log2
x∈X
pX (x)
Averaging this loss over Y equals:

X pX|Y (x, y)
pY (y)pX|Y (x, y) log2 = I(X; Y )
x∈X
pX (x)
which is the announced result.

Here we discover another surprising connection between portfolio theory and
information theory: the amount of growth rate that can be gained by side informa-
tion is at most the mutual information between the returns and the side information
I(X; Y ). This makes sense since I(X; Y ) does measure the correlation between
X and Y . To illustrate, consider two extremes: Y is independent from X, then
I(X; Y ) = 0, and the side information yields no benefit, and if Y = X then
I(X; Y ) = H(X), and the gain is at most the entropy of X.
Chapter 9
Information Theory for Machine

learning and Statistics
In this chapter, we illustrate how information theoretic techniques can be used to

solve problems in statistics and machine learning.
9.1 Statistics
9.1.1 Statistical Inference
Assume that we are given n data points X1 , ..., Xn in a finite set X drawn i.i.d.
from some unknown distribution Q. We would like to perform statistical inference,
meaning that we would like to learn information about the unknown distribution Q,
solely by observing the data points X1 , ..., Xn . Of course, depending on what kind
of information we wish to obtain, the resulting problems can be vastly different.
We give a few examples.
9.1.2 Examples of Inference Problems

Density Estimation We would like to construct Q̂, an estimator of Q with the
goal of minimizing E(ℓ(Q, Q̂)) where ℓ is some loss function. The loss function
quantifies how close the true Q is to its estimate Q̂.
Parameter Estimation We assume that Q is parameterized by some parameter
θ (write it Qθ ). We would like to construct θ̂, an estimator of θ with the goal of
minimizing E(ℓ(θ, θ̂)) where ℓ is some loss function.
Binary Hypothesis Testing We partition the set of distributions as H0 ∪ H1 and we
would like to know if Q lies in H0 or H1 . We would like to compute a well chosen
89
90 CHAPTER 9. MACHINE LEARNING AND STATISTICS
function of the data T such that both P(T = 0|Q ∈ H0 ) and P(T = 1|Q ∈ H1 )
are close to 1.
9.1.3 Empirical Distributions

To obtain information about Q, the most natural strategy is to compute the empirical
distribution of the data, i.e. the frequency at which each possible symbol a ∈ X
appears in the data.
Definition 9.1.1. Consider a sequence x1 , ..., xn in X , its empirical probability

distribution Pxn is given by
n
1X
Pxn (a) = 1{xi = a}
n i=1
for a ∈ X . Alternatively, we call Pxn the "type" of sequence xn = (x1 , ..., xn ).
It is noted that the type Pxn is indeed a distribution over X since it has positive
entries and sums to 1, and that it is an element of the set of probability distributions
over X ( )
X
P = P ∈ (R+ )X : P (a) = 1
a∈X
This set is often called the probability simplex, and has dimension |X | − 1.
The reason why the most natural strategy is to compute the empirical distribu-
tion of the data is because it converges to the true distribution when the number of
data points grows large, as a consequence of the law of large numbers.
Proposition 9.1.2. If X n = (X1 , ..., Xn ) are drawn i.i.d. from distribution Q, then
the type of X n converges to Q almost surely.
Proof: From the law of large numbers, for any fixed a ∈ X :

n
1X
PX n (a) = 1{Xi = a} → P(Xi = a) = Q(a) almost surely
n i=1 n→∞
This holds for any a which proves the result.
9.2 The Method Of Types

The method of types is a very powerful information theoretic strategy in order to
control the behaviour of the empirical distribution and works as follows.
9.2. THE METHOD OF TYPES 91
9.2.1 Probability Distribution of a Sample

The first step is to show that the distribution of an i.i.d sample only depends on its
type, and that this distribution can be expressed in terms of entropy and relative
entropy.
Proposition 9.2.1. Consider X n = (X1 , ..., Xn ) i.i.d. from distribution Q, then

the probability distribution of X n only depends on its type, and
P(X n = xn ) = 2−n[H(Pxn )+D(Pxn ||Q)]
Proof Consider X n = (X1 , ..., Xn ) i.i.d. from distribution Q, then the proba-
bility distribution of X n only depends on its type, in the sense that
n
1{xi =a} =
Y Y Pn Y
n n
P(X = x ) = Q(xi ) = Q(a) i=1 Q(a)nPxn (a)
i=1 a∈X a∈X
Indeed, the expression above only depends on the type Pxn , so that all sequences
that have the same type are equally likely to occur.
Furthermore, taking logarithms and dividing by n:

1 n n
X 1
− log2 P(X = x ) = Pxn (a) log2
n a∈X
Q(a)

X Pxn (a) 1
= Pxn (a) log2 + Pxn (a) log2
a∈X
Q(a) Pxn (a)
= H(Pxn ) + D(Pxn ||Q)
Hence we have proven that:
P(X n = xn ) = 2−n[H(Pxn )+D(Pxn ||Q)]
So not only does the probability of a sequence only depend on its type, but the
exponent is equal to the sum of the entropy of the type, and the relative entropy
between the type and the true distribution. This implies that the most likely type is
the true distribution, and also that, when n is large, types that are far away from
the true distribution are very unlikely to occur.
9.2.2 Number of Types

The second step is to show that the number of possible types is not very large, in
the sense that there are at most polynomially many types in n grows and X is fixed.
Proposition 9.2.2. The type Pxn lies in
Pn = P ∩ {P : nP (a) ∈ {0, ..., n}, a ∈ X }
and the number of types is at most:
|Pn | ≤ (n + 1)|X |
Proof One can readily check that the entries of Pxn are integer multiples of
1/n by definition. Furthermore,
|Pn | ≤ (n + 1)|X |
since |Pn | is the number of vectors whose components are positive integer multiples
of 1/n and sum to 1, and (n + 1)|X | is the number of vectors whose components
are positive integer multiples of 1/n and where all components are less than 1.
9.2.3 Size of Type Class

The third step is to estimate the number of sequences which have a given type. For
a given type P ∈ Pn , denote by
T (P ) = {xn ∈ X n : Pxn = P }
the sequences of type P , called the type class of P .

Proposition 9.2.3. For any type we have:
(n + 1)−|X | 2nH(P ) ≤ |T (P )| ≤ 2nH(P )
Proof: Since the probability of a sequence only depends on its type:

X X
1= P(X n = xn ) = |T (P )|2−n[H(P )+D(P ||Q)]
xn ∈X n P ∈Pn
An upper bound for the size of a type class is:
1 ≥ |T (P )|2−n[H(P )+D(P ||Q)]
This holds for any Q, so that for Q = P :
|T (Q)| ≤ 2nH(Q)
9.3. LARGE DEVIATIONS AND SANOV’S THEOREM 93
A lower bound can be derived by observing
1 ≤ |Pn | max |T (P )|2−n[H(P )+D(P ||Q)]

P ∈Pn
One may check that the maximum in the above occurs for P = Q which gives
|T (Q)| ≥ (n + 1)−|X | 2nH(Q)
where we used the fact that |Pn | ≤ (n + 1)|X | .

A few important observations. The entropy H(P ) provides an estimate of the
number of sequences with type P , and this estimate is accurate in the exponent, in
the sense that when n → ∞:
1
log2 (|T (P )|) = H(P ) + o(1)
n
Second, both the type class T (P ) and the typical set of an i.i.d. sample with
distribution P have approximately the same size. Third, the size of type classes
grows exponentially with n, but the number of type classes grows polynomially
in n. Finally, consider two types P and P ′ with H(P ) < H(P ′ ), then when n is
large, T (P ′ ) will be overwhelmingly larger than T (P ). We will leverage those
observations to derive powerful results.
9.3 Large Deviations and Sanov’s Theorem

Using the method of types we can now derive Sanov’s theorem, which enables to
control the fluctuations of the empirical distribution PX n around the true distribu-
tion Q when X n is drawn i.i.d. from Q.
9.3.1 Sanov’s Theorem

Proposition 9.3.1. Consider X n = (X1 , ..., Xn ) drawn i.i.d. from distribution Q,
and consider E ⊂ P a set of distributions.
⋆ ||Q)
P(PX n ∈ E) ≤ (n + 1)|X | 2−nD(P
where
P ⋆ = arg min D(P ||Q)
P ∈E
Furthermore, if E is the closure of its interior then when n → ∞:

1
log2 P(PX n ∈ E) → −D(P ⋆ ||Q)
n
Proof Summing over the possible types

X
P(PX n ∈ E) = P(PX n = P )
P ∈Pn ∩E
The probability of type P occuring is:

X
P(PX n = P ) = |T (P )|2−n[H(P )+D(P ||Q)]
P ∈Pn ∩E
Using the fact that |T (P )| ≤ 2nH(P ) :

⋆ ||Q)
P(PX n = P ) ≤ 2−nD(P ||Q) ≤ 2−nD(P
using the fact that P ⋆ minimizes D(P ||Q) over E. Summing the above over P and
using the fact that Pn ∩ E ≤ (n + 1)|X | we get the first result
⋆ ||Q)
P(PX n ∈ E) ≤ (n + 1)|X | 2−nD(P
If E is the closure of its interior we can find a sequence of Pn such that when
n → ∞,
D(Pn ||Q) → D(P ⋆ ||Q)
and in turn
P(PX n ∈ E) ≥ P(PX n = Pn )
with
P(PX n = Pn ) = T (Pn )2−n[H(Pn )+D(Pn ||Q)] ≥ (n + 1)−|X | 2−nD(Pn ||Q)
Using the fact that

T (Pn ) ≥ (n + 1)−|X | 2nH(Pn ) .
Taking logarithms, when n → ∞
1
log2 P(PX n ∈ E) → −D(P ⋆ ||Q)
n
which is the second result
Sanov’s theorem enables us to predict the behavior of the empirical distribution
of an i.i.d. sample, and is a "large deviation" result, in the sense that it predicts
events with exponentially small probability. The empirical distribution typically
lies close to the true distribution Q, and when Q ̸∈ E, this means that PX n ∈ E is
unlikely. The theorem predicts that the probability of this event only depends on
P ⋆ , which can be interpreted as the "closest" distribution to Q, where "distance"
is measured by relative entropy. We will give several examples that illustrate the
power of this result.
9.3. LARGE DEVIATIONS AND SANOV’S THEOREM 95
9.3.2 Examples
We now highlight a few examples on how Sanov’s theorem may be applied to
various statistical problems.
Majority Vote Consider an election with two candidates, where Q(1), Q(2)
are the proportion of people whom prefer candidates 1, and 2 respectively. We
gather the votes X1 , ..., Xn of n voters, which we’ll assume to be i.i.d. distributed
from Q. The candidate whom wins is the one who gathers the most votes. Assume
that Q(1) > 1/2 so that 1 is the favorite candidate. What is the probability that 2
gets elected in place of 1 ?
The votes X n = (X1 , ..., Xn ) are an i.i.d. sample from Q, and 2 gets elected if
and only if PX n (2) > 1/2, so that he gets at least n/2 votes. So 2 gets elected if
and only if PX n ∈ E where
E = {P ∈ P : P (2) ≥ 1/2}
We can then apply Sanov’s theorem to conclude that 2 gets elected in place of 1
with probability
⋆
P(PX n ∈ E) ≈ 2−nD(P ||Q)
with P ⋆ = (1/2, 1/2) so that
1 (1/2) 1 (1/2)
D(P ⋆ ||Q) = log2 + log2
2 Q(2) 2 1 − Q(2)
Indeed:
P (2) 1 − P (2)
D(P ||Q) = P (2) log2 + (1 − P (2)) log2
Q(2) 1 − Q(2)
and minimizing this quantity over P (2) under the constraint P (2) ≥ 1/2 gives
P (2) = 1/2, since Q(2) ≤ 1/2.
Testing Fairness Assume that one is given a dice with k faces, and we want to
test whether or not the dice is fair, in the sense that it is equally likely to fall on
each of its faces. Consider X n = (X1 , ..., Xn ) the outcomes of casting the dice n
times where Xn ∈ X is the index of the face on which the dice has fallen. To test
fairness of the dice we compute the empirical distribution PX n and we compare
it to Q, the uniform distribution over Q. Namely if D(PX n ||Q) ≤ ϵ we deem the
dice to be fair, and unfair otherwise.
What is the probability that we mistake a fair dice for an unfair dice ?
If the dice is fair X n = (X1 , ..., Xn ) is an i.i.d. sample from Q, and we mistake
it for an unfair dice if and only and only if PX n ∈ E where
E = {P ∈ P : D(P ||Q) ≥ ϵ}
Hence from Sanov’s theorem, the probability of a mistake is

⋆ ||Q)
P(PX n ∈ E) ≈ 2−nD(P ≈ 2−nϵ
with D(P ⋆ ||Q) = minP ∈E D(P ||Q) = ϵ. It is remarkable that Sanov’s theorem
allows for an easy, explicit computation.
Testing General Distributions It is also noted that the above works in the more
general case Q is not the uniform distribution, but simply some target distribution,
namely reject the test that P = Q if D(PX n ||Q) ≥ ϵ and accept otherwise.
Chapter 10
Mathematical Tools
In this chapter we provide a few results that are instrumental for some proofs.
Results are stated without proofs.
10.1 Jensen Inequality

Definition 10.1.1. A function f : Rd → R is said to be convex if for all x, y ∈ Rd
and all λ ∈ [0, 1]:
f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)
Property 11. A twice differentiable function f : Rd → R is convex if and only if
its Heissian (∇2 f )(x) is positive semi-definite for all x ∈ Rd .
Property 12 (Jensen’s Inequality). Consider f : Rd → R a convex function and
X a random vector in Rd . Then f (E(X)) ≤ E(f (X)) with equality if and only if
f is linear over the support of the distribution of X.
10.2 Constrained Optimization

Property 13 (Karush-Kuhn-Tucker Conditions). Consider f and g1 , ..., gn convex,
differentiable functions Rd → R and x⋆ the optimal solution to:
Minimize f (x) s.t. gi (x) ≤ 0 for all i = 1, ..., n
Then there exists λ in Rn with
n
X
⋆
∇f (x ) + λi ∇gi (x⋆ ) = 0
i=1
⋆
with λi ≥ 0 and λi gi (x ) = 0 for all i = 1, ..., n.
97

Information Theory Lecture Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Theory Lecture Notes

Uploaded by

Copyright:

Available Formats

Information Theory Lecture Notes

2 Properties of Information Measures 17

3 Data Representation: Fundamental Limits 25

4 Data Representation: Algorithms 33

5 Data Representation: Rate-Distorsion Theory 43

5.2.1 Lloyd-Max Conditions . . . . . . . . . . . . . . . . . . . 45

6 Mutual Information and Communication: discrete channels 57

7 Mutual Information and Communication: continuous channels 69

7.1.4 Unified Definitions for Information Measures . . . . . . . 71

9 Information Theory for Machine learning and Statistics 89

In this chapter we introduce information measures for discrete random variables,

where pX (x) = P(X = x).

Entropy is arguably the most fundamental information measure. The entropy

1.1.2 Entropy and Physics

1.1.3 Positivity of Entropy and Maximal Entropy

Proof: Since 0 ≤ pX (x) ≤ 1:

Logarithm is strictly concave, using Jensen’s inequality

with equality if and only if X is uniform.

1.2 Joint and Conditional Entropy

Definition 1.2.2. The conditional entropy of X ∈ X knowing Y ∈ Y two discrete

pX,Y (x, y) = pX|Y (x|y)pY (y)

In particular, the last relationship

H(X, Y ) = H(X|Y ) + H(Y )

H(X, Y ) = H(X) + H(Y )

1.3 Relative Entropy

1.3.2 Positivity of Relative Entropy

Relative entropy (sometimes called Kullback-Leibler divergence) is positive,

1.3.3 Relative Entropy is Not a Distance

1.4 Mutual Information

The last measure of information we consider is called the mutual information.

1.4.2 Positivity of Mutual Information

Proof: By definition I(X; Y ) = D(pX,Y ||pX pY ) ≥ 0 since relative entropy is

Mutual information is positive, since it can be written as a relative entropy.

1.4.3 Conditionning Reduces Entropy

Proof: We have 0 ≤ I(X; Y ) = H(X) − H(X|Y ) with equality if and only

Properties of Information Measures

In this chapter we introduce important properties of information measures which

2.1 Chain Rules

2.1.1 Chain Rule for Entropy

Proof: By definition of conditional entropy:

H(X1 , ..., Xn ) = H(Xn |Xn−1 , ..., X1 ) + H(Xn−1 , ..., X1 )

The result follows by induction over n.

2.1.2 Chain Rule for Mutual Information

I(X1 , . . . , Xn ; Y ) = H(X1 , . . . , Xn ) − H(X1 , . . . , Xn |Y )

2.1.3 Chain Rule for Relative Entropy

D(pX,Y ||qX,Y ) = D(pX ||qX ) + D(pY |X ||qY |X )

Proof Using the Bayes rule:

proving the result.

2.2 Log Sum Inequality

Proof Function f (x) = x log2 x is strictly convex as f ′′ (x) = 1

Interestingly, the log-sum inequality implies a variety of other results as we

2.3 Data Processing and Markov Chains

2.3.1 Markov Chains

pX,Y,Z (x, y, z) = pX (x)pY |X (y|x)pZ|Y,X (z|y, x) = pX (x)pY |X (y|x)pZ|Y (z|y).

2.3.2 Data Processing Inequality

I(X; Y, Z) = I(X; Z) + I(X; Y |Z) = I(X; Y ) + I(X; Z|Y )

2.4 Fano Inequality

2.4.1 Estimation Problems

h2 (P(X ̸= X̂)) + P(X ̸= X̂) log2 |X | ≥ H(X|Y )

Proof: Since X → Y → X̂ is a Markov chain

H(X) − H(X|X̂) = I(X; X̂) ≤ I(X; Y ) = H(X) − H(X|Y )

Define E = 1{X̂ ̸= X}, using the chain rule in both directions:

H(E|X̂) + H(X|E, X̂) = H(X, E|X̂) = H(X|X̂) + H(E|X, X̂)