Professional Documents
Culture Documents
Information Theory Lecture Notes
Information Theory Lecture Notes
Richard Combes 1 , 2
Version 1.0
1 Université
Paris-Saclay, CNRS, CentraleSupélec, Laboratoire des signaux et systèmes,
France
2 Department Signal, Information and Communication
2
Contents
1 Information Measures 11
1.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.1.2 Entropy and Physics . . . . . . . . . . . . . . . . . . . . 11
1.1.3 Positivity of Entropy and Maximal Entropy . . . . . . . . 12
1.2 Joint and Conditional Entropy . . . . . . . . . . . . . . . . . . . 13
1.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Relative Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Positivity of Relative Entropy . . . . . . . . . . . . . . . 14
1.3.3 Relative Entropy is Not a Distance . . . . . . . . . . . . . 15
1.4 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2 Positivity of Mutual Information . . . . . . . . . . . . . . 16
1.4.3 Conditionning Reduces Entropy . . . . . . . . . . . . . . 16
3
4 CONTENTS
2.5.1 AEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Typicality . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.3 Joint Typicality . . . . . . . . . . . . . . . . . . . . . . . 23
8 Portfolio Theory 81
8.1 A Model for Investment . . . . . . . . . . . . . . . . . . . . . . . 81
8.1.1 Asset Prices and Portfolios . . . . . . . . . . . . . . . . . 81
8.1.2 Relative Returns . . . . . . . . . . . . . . . . . . . . . . 82
8.2 Log Optimal Portfolios . . . . . . . . . . . . . . . . . . . . . . . 82
8.2.1 Asymptotic Wealth Distribution . . . . . . . . . . . . . . 82
8.2.2 Growth Rate Maximization . . . . . . . . . . . . . . . . . 83
8.3 Properties of Log Optimal Portfolios . . . . . . . . . . . . . . . . 83
8.3.1 Kuhn Tucker Conditions . . . . . . . . . . . . . . . . . . 84
8.3.2 Asymptotic Optimality . . . . . . . . . . . . . . . . . . . 84
8.4 Investment with Side Information . . . . . . . . . . . . . . . . . 85
8.4.1 Mismatched Portfolios . . . . . . . . . . . . . . . . . . . 86
8.4.2 Exploiting Side Information . . . . . . . . . . . . . . . . 86
10 Mathematical Tools 97
10.1 Jensen Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.2 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 97
8 CONTENTS
Foreword
Those lectures notes pertain to the Information Theory Course given in Centrale-
Supelec. They are based on the book "Cover and Thomas, Elements of Information
Theory", which we highly reccomend to interested students in order to go further
in the study of this topic. Each chapter corresponds to a lecture, apart from the last
chapter which contains mathematical tools used in the proofs.
9
10 CONTENTS
Chapter 1
Information Measures
1.1 Entropy
1.1.1 Definition
Definition 1.1.1. The entropy of X ∈ X a discrete random variable with distribu-
tion pX is:
X
1 1
H(X) = E log2 = pX (x) log2
pX (X) x∈X
pX (x)
11
12 CHAPTER 1. INFORMATION MEASURES
entropy. One of the fundamental ideas is that the Gibbs entropy of an isolated
physical system is a non-deacreasing function of time, is maximized at equilibrium.
Therefore, the randomness in a isolated system always increases and is maximized
at equilibrium. In fact, one can prove that the Boltzman distribution:
exp(− E(x)
kB T
)
pX (x) = P ′
x′ ∈X exp(− E(x
kB T
)
)
where T is the temperature, E(x) is the energy of state x and kB is the Bolt-
man
P constant, maximizes the Gibbs entropy under an average energy constraint
x∈X pX (x)E(x) = Ē. Hence all systems in equilibrium follow this distribution.
X 1 X
H(X) = pX (x) log2 ≥ pX (x) log2 1 = 0.
x∈X
pX (x) x∈X
X 1 X 1
H(X) = pX (x) log2 < log2 pX (x) = log2 |X |.
x∈X
pX (x) x∈X
p X (x)
The joint entropy H(X, Y ) is simply the entropy of (X, Y ) seen as a single
random variable. It is important to notice that the joint entropy depends on the
joint distribution, not only on the marginal distributions.
1
H(X|Y ) = E log2
pX|Y (X|Y )
1 1
= E log2 − E log2
pX,Y (X, Y ) pY (Y )
= H(X, Y ) − H(Y ).
The conditional entropy H(X|Y ) measures the entropy of X once the value of
Y has been revealed, and has several definitions, which are all equivalent, from the
Bayes rule stating that
is called a chain rule, and can be interpreted as the fact that the amount of ran-
domness in (X, Y ) equals the amount of randomness in Y plus the amount of
randomness left in X once Y has been revealed.
1.2.2 Properties
Property 2. If X and Y are independent then H(X|Y ) = H(X) and
Proof: If X and Y are independent then pX,Y (x, y) = pX (x)pY (y) and replac-
ing in the definition gives the result immediately.
Entropy is additive for independent random variables, which once again is
coherent with its interpretation as a measure of randomness. Indeed, if there
is no relationship between X and Y , the randomness of (X, Y ) is simply the
sum of the randomness in X and Y taken separately. It is also noticed that
entropy is not additive if X and Y are correlated, for instance if X = Y then
H(X, Y ) = H(X) ̸= H(X) + H(Y ), unless both X and Y are deterministic.
Property 3. Conditional entropy is not symmetrical unless H(X) = H(Y ):
H(Y |X) − H(X|Y ) = H(Y ) − H(X)
Conditional entropy is not symmetrical, one notable exception being if X and
Y have the same distribution.
We shall see later that mutual information also quantifies the amount of infor-
mation that can be exchanged between a sender whom selects X and a receiver
whom observes Y .
17
18 CHAPTER 2. PROPERTIES OF INFORMATION MEASURES
Proof Using both the chain rule and the definition of mutual information:
The chain rule for mutual information also has a natural interpretation. Imagine
that a sender selects X1 , . . . , Xn and attempts to communicate with a receiver
whom observes Y . Then the information that can be exchanged I(X1 , . . . , Xn ; Y )
is the sum of I(Xi ; Y |Xi−1 , . . . , X1 ) which can be interpreted as the sender sending
X1 , the receiver retrieving X1 from Y , then sender sending X2 and the receiver
retrieving X2 from both Y and X1 etc. This idea of retrieving X1 , ..., Xn iteratively
is used in many communication systems.
2.2.1 Statement
Proposition 2.2.1. For any (ai )i , (bi )i positive
n n Pn
X ai X ai
ai log2 ≥ ( ai ) log2 Pi=1
n
i=1
bi i=1 i=1 bi
ai
with equality iff bi
= c for all i.
n n n n n
X ai X X a
i
X X ai
ai log2 = ( bj ) αi f ≥( bj )f αi
i=1
bi j=1 i=1
bi j=1 i=1
bi
n P n
X ai
=( ai ) log2 Pi=1n .
i=1 i=1 bi
Simply said, a Markov chain X → Y → Z is such that one first draws the
value of X, then once the value of X is known we draw Y accoding to some
distribution that depends solely on X, and finally one draws Z accoding to some
distribution that depends solely on Y . The key idea is that, in order to generate Z,
one can only look at the previously generated value Y , i.e. we generate the process
with a memory of order 1. The simplest, and most often encountered example of
a Markov chain X → Y → Z is any X, Y, Z such that Z = g(Y ) where g is a
known, deterministic function.
Proof We have:
since I(X; Y |Z) ≥ 0 and I(X; Z|Y ) = 0 we have I(X; Y ) ≥ I(X; Z).
The data processing inequality simply states that mutual information cannot
increase along a Markov chain, i.e. data processing cannot create information
out of nowhere. An interpretation in the context of communication is that if a
sender selects X and a receiver observes Y , and a helper offers to help the receiver
by computing the value of g(Y ), then X 7→ Y 7→ g(Y ) and so I(X; g(Y )) ≤
I(X; Y ). I.e. the helper is in fact never helpful.
2.4.2 Statement
Proposition 2.4.1. If X → Y → X̂ then:
1
with h2 (a) = a log a1 + (1 − a) log 1−a the binary entropy.
so that
H(X|Y ) ≤ H(X|X̂)
We have
Consider (X1 , Y1 ), . . . , (Xn , Yn ) i.i.d. with common joint distribution pX,Y . Then
n
1X 1
log2 → H(X, Y ) in probability.
n i=1 pX,Y (Xi , Yi ) n→∞
and
n
1X 1
log2 → H(X|Y ) in probability.
n i=1 pX|Y (Xi |Yi ) n→∞
and
n
1X pX (Xi )pY (Yi )
log2 → I(X; Y ) in probability.
n i=1 pX,Y (Xi , Yi ) n→∞
Proof: All statements hold true from the weak law of large numbers.
The Asymptotic Equipartition Property (AEP), which in itself is a straightfor-
ward consequence of the law of large numbers, roughly states that for large i.i.d.
samples, the "empirical information measures" behave like the actual information
measures. While this is not very useful in itself, a consequence is that i.i.d. samples
concentrate on what is called "typical sets".
2.5.2 Typicality
Proposition 2.5.2. Consider X1 , . . . , Xn i.i.d. with common distribution pX .
Given ϵ > 0 define the typical set:
n
n 1X 1 o
Anϵ = x ∈X : n n
log2 − H(X) ≤ ϵ .
n i=1 pX (xi )
Then:
(i) |Anϵ | ≤ 2n(H(X)+ϵ) for all n
(ii) |Anϵ | ≥ (1 − ϵ)2n(H(X)−ϵ) for n large enough
(iii) P((X1 , . . . , Xn ) ∈ Anϵ ) ≥ 1 − ϵ for n large enough
2.5. ASYMPTOTIC EQUIPARTITION AND TYPICALITY 23
Which we bound as
From asymptotic equipartition the typical set is a high probability set, and for n
large enough
1 − ϵ ≤ P((X1 , . . . , Xn ) ∈ Anϵ ) ≤ 1.
The size of the typical set is bounded as
Then:
(i) |Anϵ | ≤ 2n(H(X,Y )+ϵ) for all n ; (ii) P((X n , Y n ) ∈ Anϵ ) → 1.
n→∞
(iii) (1 − ϵ)2−n(I(X;Y )+ϵ) ≤ P((X̃ n , Ỹ n ) ∈ Anϵ ) ≤ 2−n(I(X;Y )+ϵ) for n large
enough
Proof We have:
n
n
n
n n n n 1X 1 o
Aϵ ⊂ (x , y ) ∈ X × Y : log2 − H(X, Y ) ≤ ϵ .
n i=1 pX,Y (xi , yi )
and we know that this set has size at most 2n(H(X,Y )+ϵ) .
From the law of large numbers:
1X n
1 ϵ
P log2 − H(X) ≥ → 0
n i=1 pX (Xi ) 3 n→∞
1X n
1 ϵ
P log2 − H(Y ) ≥ → 0
n i=1 pY (Yi ) 3 n→∞
1X n
1 ϵ
P log2 − H(X, Y ) ≥ → 0
n i=1 pX,Y (Xi , Yi ) 3 n→∞
If (xn , y n ) ∈ Anϵ :
n
−n(I(X;Y )+ϵ)
Y pX (xi )pY (yi )
2 ≤ ≤ 2−n(I(X;Y )−ϵ)
i=1
pX,Y (xi , yi )
Therefore:
P((X̃ n , Ỹ n ) ∈ Anϵ )
2−n(I(X;Y )+ϵ) ≤ ≤ 2−n(I(X;Y )−ϵ)
P((X n , Y n ) ∈ Anϵ )
and the result is proven as P((X n , Y n ) ∈ Anϵ ) → 1.
n→∞
Joint typicality is similar to typicality, and we will expand on its implications
when considering communication over noisy channels.
Chapter 3
In this chapter we start our exposition of how to represent data efficiently using
information theoretic tools. We introduce prefix codes and show that the entropy
of the source quantifies the length of the best prefix codes, and how such codes can
be constructed.
3.1.1 Definition
Definition 3.1.1. Consider X ∈ X and D the set of finite strings on {0, 1}. A
source code is a mapping C : X → D.
A source code takes as input a symbol X and maps it into a finite sequence of
bits.
25
26 CHAPTER 3. DATA REPRESENTATION: FUNDAMENTAL LIMITS
One of the main measures of efficiency of a source code is its expected length,
which is the expected number of bits required to represent a symbol, if this symbol
were drawn according to the source distribution.
A critical point is that extension can create ambiguity , even if the code is non-
singular. Indeed, if one only observes the concatenated codewords C(X1 ), ..., C(Xn ),
it might be difficult to know where one codeword ends and where the next one
begins. A simple example would be X = {a, b, c} and a code C(a) = 0, C(b) = 1
and C(c) = 01. We have C(a)C(b) = C(c) so it is impossible to differentiate
between ab and c. This code is non-singular, but not uniquely decodable.
A uniquely decodable code is such that extension does not create ambiguity,
and enables to encode streams of symbols by encoding each symbol separately,
without losing any information.
3.2. PREFIX CODES 27
Proof: Consider the following decoding algorithm: let C(X1 ), ..., C(Xn ) be a
sequence of bits u1 ...um and let ℓ the smallest integer such that u1 ...uℓ = C(x) for
some x. Then we must have x = X1 , otherwise C(x) would be the prefix of some
other codeword. This yields X1 and repeat the procedure to obtain X1 , ..., Xn .
It is understood that prefix codes are uniquely decodable, and uniquely decod-
able codes are non-singular, but there exists uniquely decodable codes that are not
prefix codes, and there exists non-singular codes that are not uniquely decodable.
Definition 3.2.3. Given a binary tree G = (V, E), we call the "label" of leaf v the
binary sequence encoding the unique path from the root to v, where 0 stands for
"down and left" and 1 for "down and right").
Property 10. Consider a binary tree, then the labels of its leaves form a prefix
code. Conversely, for any prefix code, there exists a binary tree whose leaves label
are the codewords of that code.
Proof: Consider v and v ′ two leaves of G such that the label of v is a prefix of
the label of v ′ , then this means that v ′ is a descendent of v which is not a leaf, a
contradicton. So the leaves labels form a prefix code.
Conversely, consider a prefix code, and the following procedure to build the
associated binary tree. Start with G a complete binary tree. If the code is not empty
then select one of its codewords C(x), find v the node whose label is C(x) and
remove all of the descendents of v from G and remove C(x) from the code. Repeat
the procedure until the code is empty.
28 CHAPTER 3. DATA REPRESENTATION: FUNDAMENTAL LIMITS
Therefore, there is an identity between binary trees and prefix codes: for every
prefix code we can construct a binary tree representation of this code, and every
binary tree represents a prefix code. This is fundamental in order to derive lower
bounds on the code length and design codes which attain these bounds bound.
Also, given any (ℓ(x))x∈X satisfying this inequality one can construct a prefix code
with codeword lengths (ℓ(x))x∈X .
Proof: Let lm = maxx∈X ℓ(x) the largest codeword length. Let Z(x) ⊂
{0, 1}lm set of words that have C(x) as a prefix. Then |Z(x)| = 2lm −ℓ(x) . Further-
more Z(x) ∩ Z(x′ ) = ∅ as C is a prefix code. Summing over x proves the first
result:
X X
2lm = |{0, 1}lm | ≥ | ∪x∈X Z(x)| = |Z(x)| = 2lm −ℓ(x) .
x∈X x∈X
Conversely, assume that if ℓ(x) are sorted in increasing order, and assume that we
are given codeword lengths (ℓ(x))x∈X satisfying the Kraft inequality. Consider
the
P prefix code where C(x) is the ℓ(x) first digits of the binary representation of
−ℓ(i)
i<x 2 . This proves the second result.
Kraft’s inequality is a fundamental limit and states that there is a constraint on
the expected length that must be satisfied by any prefix code.
L(C) ≥ H(X).
pX (x)
2−ℓ(x) = , x∈X
λ(log 2)
2−ℓ(x) = pX (x) , x ∈ X
Its value lower bounds that of (P1 ) which concludes the proof:
X X 1
pX (x)ℓ(x) = pX (x) log2 = H(X),
x∈X x∈X
pX (x)
Recall that whenever ℓ(x), x ∈ X satisfy the Kraft inequality, then there exists a
corresponding prefix code with lenghts ℓ(x), x ∈ X .
The length of this code is:
X X l 1 m
L(C) = pX (x)ℓ(x) = pX (x) log2
x∈X x∈X
pX (x)
X 1
≤ pX (x) log2 +1
x∈X
pX (x)
= H(X) + 1.
L(C) 1
≤ H(X) + .
n n
Proof: From independence H(X1 , ..., Xn ) = nH(X), and selecting C as the
optimal prefix code for (X1 , ..., Xn ) gives the result.
3.3. OPTIMAL CODES AND ENTROPY 31
4.1.2 Rationale
The Huffman algorithm is based on the idea that a good prefix code should verify
three properties:
• (i) If p(x) ≥ p(y) then ℓ(y) ≥ ℓ(x)
• (ii) The two longest codewords should have the same lengths
• (iii) The two longest codewords differ by only 1 bit and correspond to the
two least likely symbols
In fact, these facts will serve to show the optimality of the Huffman algorithm.
33
34 CHAPTER 4. DATA REPRESENTATION: ALGORITHMS
4.1.3 Complexity
At each step of the algorithm, one must find the two nodes with the smallest weight.
There are |X | steps and finding the two nodes with smallest weight requires to
sort the list of nodes by weight at each step which requires O(|X | ln |X |). Hence
a naive implementation of the algorithm requires time O(|X |2 ln |X |). A smarter
implementation would be to keep the list of nodes sorted at each step so that finding
the two nodes with smallest weight can be done in time O(1) then insert the new
node in the sorted list using binary search in time O(ln |X |). Hence the Huffman
algorithm can be implemented in time O(|X | ln |X |), almost linear in the number
of symbols.
4.1.4 Limitations
While optimal, for sources with billions of symbols, the Huffman algorithm is
too complex to implement, and there exists other techniques, such as artihmetic
coding (used in JPEG). Also, the Huffman algorithm requires knowing the source
distribution p(x) for x ∈ X at the encoder which is a practical limitation, and
to solve this problem there exists universal codes, which operate without prior
knowledge on p. We will show some simple strategies to design universal codes.
4.1.5 Illustration
1
1
1 2
1 3
10
0
0
1
1
5
0
0
1
A B C D E
1 1 1 1 1
2 5 10 10 10
x A B C D E
1 1 1 1 1
p(x) 2 5 10 10 10
C(x) 0 10 110 1110 1111
ℓ(x) 1 2 3 4 4
Above is the result of the Huffman algorithm applied to a given source. One
can readily verify that the more probable the symbol, the longer the codeword, and
4.1. THE HUFFMAN ALGORITHM 35
that the two least probable symbols D and E have been assigned to the two leaves
with highest depth.
The length of the code is minimal amongst all prefix codes and equals:
1 1 1
×1+ ×2+ × (3 + 4 + 4) = 2
2 5 10
1 1 1
log2 (2) + log2 (5) + 3 × log2 (10) ≈ 1.96
2 5 10
4.1.6 Optimality
Proposition 4.1.2. The Huffman algorithm outputs a prefix code with minimal
expected length L(C) amongst all prefix codes.
Proof: Assume that the source symbols are sorted p(1) ≤ ... ≤ p(|X |).
Consider a code C with minimal length, and x, y two symbols such that x ≤ y
and ℓ(x) < ℓ(y). Then construct a new code C ′ such that C ′ (x) = C(y), C ′ (y) =
C(x) and C ′ (z) = C(z) for z ̸= x, y. Then clearly L(C ′ ) < L(C) hence C cannot
be optimal, a contradiction. This shows that for any x, y such that x ≤ y we must
have ℓ(x) ≥ ℓ(y). Futhermore, since the two least probable symbols should have
maximal depth we can always assume that they are siblings (otherwise simply
perform an exchange between 2 and the sibling of 1).
Consider C the prefix code with minimal length, and H the prefix code output
by the Huffman algorithm. Further define C ′ and H ′ the codes obtained by consid-
ering C and H and replacing nodes 1 and 2 by their father with weight p(1) + p(2).
Then we have:
L(C ′ ) = L(C) − (p(1) + p(2))
and
L(H ′ ) = L(H) − (p(1) + p(2))
We also realize that H ′ is exactly the output of the Huffman algorithm applied to a
source with |X | − 1 symbols.
We can then prove the result by recursion. Clearly for |X | = 1 symbols the
Huffman algorithm is optimal. Furthermore, if for |X | − 1 symbols the Huffman
algorithm is optimal this implies that L(C ′ ) = L(H ′ ) so that L(C) = L(H) hence
the Huffman algorithm is optimal for |X | symbols.
36 CHAPTER 4. DATA REPRESENTATION: ALGORITHMS
ON OFF
1 H(X1 ) L(C)
(1 − )R(π, P ) + ≤
n n n
L(C) 1 H(X1 ) + 1
≤ (1 − )R(π, P ) + .
n n n
38 CHAPTER 4. DATA REPRESENTATION: ALGORITHMS
Furthermore
XX 1
H(Xi |Xi−1 ) = P(Xi−1 = x, Xi = y) log2
x∈X y∈X
P(Xi = y|Xi−1 = x)
XX 1
= π(x)P (y|x) log2 = R(π, P ).
x∈X y∈X
P (y|x)
Therefore:
H(X1 , ..., Xn ) = (n − 1)R(π, P ) + H(X1 ).
The lower bound holds as before, and applying Huffman coding to (X1 , ..., Xn )
yields a code with:
(n − 1)R(π, P ) + H(X1 ) ≤ L(C) ≤ (n − 1)R(π, P ) + H(X1 ) + 1.
We have therefore established that the rate of optimal codes for Markov sources
is exactly R(π, P ) bits per symbol. Furthermore, optimal codes can be found
using the same algorithms as in the memoryless case. One would first determine
the transition probabilities for the Markov source at hand, which would then give
us the probability of any sequence (X1 , ..., Xn ) and finally we may apply the
Huffman algorithm. One can apply this (for instance) in order to encode English
text optimally, since English can be seen as a Markov source.
Now, one caveat of our approach is that we require to know the probability
distribution of any sequence that can be generated by the source. In the case of
memoryless sources this implies to know the distribution of a symbol, and in the
case of Markov sources this implies knowing both the stationary distribution and
the transition probabilities. This can often be a limitation in practice, and to solve
this problem we study the concept of universal codes.
The idea of a universal code is that the code should have no prior knowledge of
the data distribution, and that the code should work well irrespective of the data
distribution. This is important in practical scenarios in which nothing is known
about the data distribution. In fact, when the data distribution is known, we know
that the smallest attainable rate is the entropy H(X), and if a code is universal,
then it attains this rate asymptotically for all distributions.
Proof: For a given value of k, since Ak has nk elements, the length of the
corresponding codeword is
n
ℓ(C(x1 , ..., xn )) = log2 (n) + log2
k
so that
1
ℓ(C(x1 , ..., xn )) = h2 (k/n) + o(1)
n
40 CHAPTER 4. DATA REPRESENTATION: ALGORITHMS
Consider X ∼Bernoulli(a):
n
k 1X
= Xi → a almost surely
n n i=1 n→∞
1
Eℓ(C(X1 , ..., Xn )) → h2 (a) = H(X)
n n→∞
The most famous universal codes are called the Lempel-Ziv algorithms, and we
present here the algorithm that uses a sliding window. There exists other versions
such as the one based on trees. The algorithm encodes the sequence by first parsing
it into a set of words, and then to encode each word based on the previous words.
The central idea of this coding scheme is that if a word (x1 , ..., xk ) of size k has a
relatively high probability, then it is likely to appear in a window of size W if W is
large enough. In turn this word can be represented with 1 + log2 W + log2 k bits
instead of k bits. In short, words that are frequent tend to appear repeatedly, and
therefore can be encoded by providing a pointer to one of their past occurences,
which enables to sometimes drastically reduce the number of bits required.
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0
The above example illustrates how the algorithm operates on a binary sequence.
The sliding window enables us to encode long sequences of consecutive 0, ..., 0
with relatively few bits. Indeed we manage to encode a sequence of 17 consecuting
0’s by the word (1, 1, 17) which can be represented using rougly 2 + log2 (4) +
log2 (17) ≈ 7 bits: a net gain of 17 − 7 = 10 bits.
Lempel-Ziv coding has both the advantage of being very easy to implement,
requiring no knowledge about the data distribution, and also to be universal. We
do not present the proof here, due to its complexity.
42 CHAPTER 4. DATA REPRESENTATION: ALGORITHMS
Chapter 5
Data Representation:
Rate-Distorsion Theory
1
Continuous Data
0.8 Quantized Data
0.6
0.4
0.2
Data
−0.2
−0.4
−0.6
−0.8
−1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time
43
44 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY
In fact, in some cases, even if data is already discrete, one may want to represent
it using less bits, even at the expense of losing some information. For instance, we
might be interested in reducing the size (in bits) of an image or a sound file as long
as, after the compression, one can reconstruct them and the reconstructed image or
sound looks or sounds similar to a human. This means that most of the information
has been preserved. We call this process lossy compression. Since quantization
and lossy compression can be understood in the same framework, we will use both
terms interchangeably.
• The encoder encodes the data as fn (X n ) ∈ {1, ..., 2nR } with nR bits.
The mappings fn and gn define the strategy for encoding the data and decoding the
data, and given a rate R the goal is to select these mappings in order to minimize
the distorsion defined as:
n
1X
D= E(d(Xi , X̂i ))
n i=1
and
X2R
D = E d(g(f (X)), X) = E(d(g(i), X)|f (X) = i)P(f (X) = i)
i=1
2R
X
≥ min E(d(x′ , X)|f (X) = i)P(f (X) = i)
x′ ∈X
i=1
46 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY
Therefore, if (i) or (ii) are not satisfied, we can decrease the value of the distortion
by modifying f or g.
The most important insight gained from Lloyd max are twofold. First, to
design the quantizer, a point should be mapped to the closest reconstruction point.
Second, when designing the decoder, one should select the reconstruction points
to minimize the conditional expected distorsion. In fact this shows that if the
quantizer f is known, then finding g is easy, and vice-versa, and suggests an
iterative algorithm: starting with (f, g) arbitrary and alternatively minimize over f
and g until convergence. This algorithm may not always converge to the optimal
solution and should be seen as a heuristic.
0.8
p(x)
0.6
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
One can readily check by recursion that this implies g(i) = i/2R for i = 1, ..., R.
1 −2R
The distorsion is hence D = 12 2 , which which concludes the proof.
When data is uniformly distributed over an interval, then the optimal quan-
tization scheme is uniform quantization, which simply partitions the interval in
1 −2R
2R intervals of equal size, and the distorsion is 12 2 , so that when the rate is
increased by 1 bit, the distorsion is divided by 4 (or decreased by 6dB). It is also
noted that uniform quantization is equivalent to rounding the data to the nearest
integer multiple of 2−R so it is very easy to implement.
0.3
0.25
p(x)
0.2
0.15
0.1
0.05
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x
Proof: Let us assume without loss of generality that g(1) < g(2). From
Lloyd-Max, the quantization scheme should be such that
Since X has the same distribution as −X one must have that g(2) = −g(1) hence
f (X) = 1 if X < 0 and f (X) = 2 otherwise. Furthermore
r
2σ 2
g(2) ∈ arg min E(d(x′ , X)2 |f (X) = 2) = E(X|f (X) = 1) = E(X|X ∈ [0, +∞]) =
′
x ∈X π
48 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY
hence r
2σ 2
g(2) = E(X|X ∈ [0, +∞]) =
π
One may readily check that D = π−2 π
σ 2 which concludes the proof.
If only R = 1 bit per symbol is available, the most efficient quantizer consists
in simply encoding the sign of the data, so that the information ofq the absolute
2σ 2
value is lost. It is also noted that the optimal reconstruction points ± π
are the
expected value of the absolute value of X.
and
g s (in ) = (i1 2−R , ..., in 2−R )
Then one may readily check that the reconstruction error
g s (f s (X n )) − X n
5.3. VECTOR QUANTIZATION 49
1 R
has i.i.d. uniformly distributed entries with variance 12 2 , and therefore the
1 −R
achieved distorsion is D = 12 2 .
On the other hand, consider another quantization strategy where the quantiza-
tion points g(1), ..., g(nR) are selected uniformly at random in [0, 1]n . One may
readily check that, from independence of g(1), ..., g(nR):
1 1 nR
P( min d(X, g(i)) ≥ 2R ) = P(d(X, g(1)) ≥ rn2 )2
n i=1,...,2 nR 12
1
with rn2 = 12
n2R . Furthermore
(πrn2 )n/2
P(d(X, g(1)) ≤ rn ) ≈
Γ(n/2 + 1)
since the probability of d(X, g(1)) ≤ rn2 can be approximated by the Lebesgue
measure a ball of radius rn centerered at X.
We may then use Stirling’s approximation to show that
nR
P(d(X, g(1)) ≥ rn )2 → 0
n→∞
1 R
Therefore, this quantization strategy has distorsion lower than 12
2 with high
probabilty, and is superior to scalar quantization.
0.8 0.8
0.7 0.7
0.6 0.6
x2
0.5 0.5
x
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x1 x1
Definition 5.3.2. The rate distortion function R(D) for a given D is the infimum
over R such that (R, D) is achievable.
Given a rate R and a distorsion D, we say that (R, D) is achievable if, asymp-
totically when n is large, there exists a sequence of quantizers whose distorsion is
at most D. We insist on the fact that for each value of n, an appropriate quantizer
must be found and what matters is the limit behaviour of this sequence. This means
that the notion of achievability is asymptotic, and there may not exist quantizers
with rate R and distorsion D for small values of n. In a sense achievability quanti-
fies the smallest distorsion for n = +∞. Clearly the larger the allowed distorsion,
the smaller the rate can be with an efficient quantizer, and a natural question is:
what is the optimal trade-off between distorsion and rate ? The answer to this
question is called the rate distorsion function. Now computing this function may
be difficuly in general, and we will show how this may be done by maximizing the
mutual information.
minimizing over all possible conditional distributions pX̂|X that satisfy the con-
straint E(d(X, X̂)) ≤ D.
Theorem 5.4.2. The information rate function equals the rate distorsion function.
PnDi = E(d(Xi , X̂i )) the distorsion for the i-th symbol. We have D =
with
1
n i=1 Di and since the mutual information is convex, so is the rate distorsion
function, which in turn implies:
n
X
nR(D) ≤ R(Di )
i=1
We have proven that R(D) ≤ R, so that R(D) is indeed a lower bound on the rate
that can be achieved at distorsion level D.
52 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY
Algorithm 5.4.4 (Random Coding for Rate Distorsion). Consider the following
randomized scheme to construct a rate-distorsion codebook:
• (Codebook generation) Let pX̂|X a distribution such that R(C) = I(X; X̂)
and E(d(X, X̂)) ≤ D. Draw C = {X̂ n (i), i = {1, ..., 2nR }} where X̂ n (i) is
an i.i.d. sample of size n from pX̂
It is noted that this is a randomized strategy, so that both the encoder fn and
the decoder gn are in fact random. While it may seem counter-intuitive to select a
random codebook, this in fact eases the analysis very much, because it allows us to
average over the codebook itself. Furthermore, when performing this averaging, as
long as we are able to prove that the codebook has good performance in expectation,
it automatically implies that there exists a codebook with good performance. This
strategy is common in information theory as well as other fields (for instance
random graphs), and is known as the "probabilistic method". The disadvantage of
random coding with respect to, for instance, Huffman coding, is that it is much
more complex to implement.
Proposition 5.4.5. There exists a sequence of codebooks achieving any rate distor-
sion pair (R, D) with R(D) > D.
The main idea centers around the idea of typicality, in that case rate-distorsion
typicality.
The point of random coding is that the codewords are drawn in an i.i.d. then
the pairs (X n , X̂ n ) will be distorsion typical so that d(X n , X̂ n ) will be arbitrairly
close to D with high probability.
2.5
Rate Distorsion Function
1.5
0.5
0
0 0.5 1 1.5
Distorsion
′ ′ 2
Proposition 5.5.1. Consider X ∼ N (0, σ 2 ) with
d(x,2x )= (x − x ) . The rate
distortion function is given by: R(D) = max 12 log2 σD , 0
54 CHAPTER 5. DATA REPRESENTATION: RATE-DISTORSION THEORY
Proof: We must minimize I(X; X̂) where X ∼ N (0, σ 2 ) and (X, X̂) verifies
E((X − X̂)2 ) ≤ D. By definition of the mutual information:
1
log2 2πeσ 2
h(X) =
2
Furthermore, since conditioning reduces entropy:
Now, since the Gaussian distribution maximizes entropy knowing the variance:
1
h(X − X̂) ≤ log2 2πevar(X − X̂)
2
1 σ2
I(X; X̂) ≥ log2
2 D
1 σ2
I(X; X̂) = log2
2 D
which proves the result.
The rate-distorsion function for gaussian variables is indeed convex and de-
creasing, and in particular this function is 0 for any D > σ 2 , due to the fact that,
even with no information, one can achieve a distorsion of σ 2 , by representing X
by a fixed value equal to E(X). Furthermore, for D < σ 2 , when R is increased by
1, D is divided by 4 so each added bit of quantization decreases the quantization
error by 6dB. Finally, as predicted previously, vector quantization is better than
scalar quantization. For instance, consider R = 1, using vector quantization on
2
(X1 , ..., Xn ) with a rate of R = 1 yields a distorsion of D = σ4 , while using scalar
quantization on each entry of (X1 , ..., Xn ) with a rate of R = 1 yields a distrorsion
of D = π−2 π
σ 2 . Hence in that example vector quantization is 45% more efficient
than scalar quantization.
5.5. RATE DISTORSION FOR GAUSSIAN DISTRIBUTIONS 55
Therefore
k
X
k k
I(X ; X̂ ) ≥ I(Xi ; X̂i )
i=1
From Lagrangian relaxation the solution of this optimiation problem must be such
⋆
that there exists λP > 0 such that either Di = σi2 or otherwise Di = λ⋆ . Selecting
λ to ensure that ki=1 Di = D yields the result.
⋆
For Gaussian vectors with independent entries, the rate distorsion function can
be computed as well, and the solution is given by an allocation called "reverse
water filling" which attempts to equalize the distortion for each component. Bits
are allocated mostly to components with high variance, and components with low
variance are simply ignored. This makes sense since, for an equal amount of bits,
the larger the variance, the larger the distorsion. This can be generalized to gaussian
vectors with non-diagonal covariance matrices by performing reverse waterfilling
on the eigenvectors/eigenvalues of the covariance matrix.
Chapter 6
We now move away from data representation, and focus on communication over
noisy channels. For this problem, we are concerned with the maximal rate at which
information can be reliably sent over the channel, in the sense that the receiver
should be able to retrieve the sent information with high probability. As we shall
see, information theoretic tools provide a complete characterization of the problem
in terms of achievable rates as well as coding strategies.
57
58 CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS
and so on. We will focus mostly on memoryless channels, which already constitude
a rather rich model. Of course, there exists more general models, such as Marko-
vian channels and the most general model of ergodic channels. It is noted that, if a
channel is memoryless, and X n = (X1 , ..., Xn ) is i.i.d., then Y n = (Y1 , ..., Yn ) is
also i.i.d.
6.1.3 Examples
We now propose to compute the information channel capacity for a few simple
channel models.
1 b b 1
1
To maximize I(X; Y ) one must maximize H(X), so the input distribution maxi-
mizing X should be unifom in {0, 1} and the capacity is
C = log2 2 = 1
1 − p0
b 1
b 2
p1
1 b b 3
1 − p1
To maximize I(X; Y ) one must maximize H(X), so the input distribution maxi-
mizing X should be unifom in X and the capacity is
C = log2 |X |
1 b b 1
1−p
To maximize I(X; Y ) one must maximize H(X), so the input distribution maxi-
mizing X should be unifom in X and the capacity is
C = log2 2 − h2 (a) = 1 − h2 (a)
1 b b 1
2 b b 2
3 b b 3
b ×
1 b b 1
1−α
I(X; Y ) = (1 − α)H(X)
To maximize I(X; Y ) one must maximize H(X), so the input distribution maxi-
mizing X should be unifom in {0, 1} and the capacity is
C = (1 − α) log2 2 = (1 − α)
1
R= log2 M
n
and error probability:
Pen = P(Ŵ ̸= W ).
1 b b 1
1−p
62 CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS
For the binary symmetric channel, a code is given by a subset C of {0, 1}n of
size 2nR , along with a decoding rule. The distribution of the channel output y n
conditional to transmitting some some codeword xn is given by
n a d(xn ,yn )
1 {xi =yi } 1{xi ̸=yi }
Y
n n n
pY n |X n (y |x ) = (1 − a) a = (1 − a) .
i=1
1−a
We notice that this is equivalent to minimizing the Hamming distance between the
output and the codeword d(xn , y n ). Also note that, if C is very large, this might be
very hard to do computationally.
x3
x4 = x1 ⊕ x3 x5 = x2 ⊕ x3
x1 x6 x2
= x1 ⊕ x2
A well known code for the BSC is the so called Hamming code.
λn = max λi .
i=1,...,M
We now show that any rate above the information capacity is not achievable. The
main idea is to apply Fano’s inequality to show that if there are too many codewords,
then the transmitted codeword cannot be estimated with arbitrary high accuracy.
Proposition 6.3.1. Consider a memoryless channel. Then any rate R > C is not
achievable.
Proof: We recall that X, Y we have that H(X|Y ) ≤ H(X) so that using the
chain rule, for any X1 , ..., Xn :
n
X n
X
H(X1 , ..., Xn ) = H(Xi |Xi−1 , ..., X1 ) ≤ H(Xi ).
i=1 i=1
64 CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS
We now upper bound the maximal mutual information with n channel uses. By
definition of the capacity:
W → X n (W ) → Y n → Ŵ
Since message W ∈ {1, ..., 2nR } is chosen uniformly at random we have H(W ) =
nR and:
H(W |Ŵ ) = H(W ) − I(W ; Ŵ ) ≥ n(R − C).
We may now apply Fano’s inequality:
Random Coding
Algorithm 6.3.3 (Random Channel Coding). Consider the following randomized
algorithm in order to generate a codebook and transmit data.
Error Probability
We compute the error probability averaged over C. Define E the event that decoding
fails and average over C:
2 nR
X 1 XX
P(E) = P(C = c)Pen (c) = P(C = c)λi (c).
c
2nR i=1 c
66 CHAPTER 6. COMMUNICATION: DISCRETE CHANNELS
P
By symmetry c P(C = c)λi (c) does not depend on i, so:
2 nR
1 XX
P(E) = P(C = c)λ1 (c) = P(E|W = 1).
2nR c i=1
nR
2
X
P(E|W = 1) ≤ P(E1c |W = 1) + P(Ei |W = 1)
i=2
there are 2nR − 1 indices i such that λi (c⋆ ) ≤ 4ϵ by considering the best half. So
we have proven that there exists a sequence of (n, 2nR ) codes with vanishing error
probability which concludes the proof.
Definition 6.4.1. A channel is weakly symmetric if (i) for any x,x′ , vectors
′
pP are equal up to a permutation and (ii) for any y,y ′ we have
Y |X (.|x), pY |X (.|x )P
′
x∈X pY |X (y|x) = x∈X pY |X (y |x).
Proposition 6.4.2. Assume that (i) for any x,x′ , vectors pYP ′
|X (.|x), pY |X (.|x )
′
are
P equal up to a permutation and (ii) for any y,y we have x∈X pY |X (y|x) =
′
x∈X pY |X (y |x). Then
X 1
C = log |Y| − pY |X (y|x) log2 ,
y∈Y
pY |X (y|x)
1
Define f (x) = x log2 x
and
X 1 X X
H(Y ) = pY (y) log2 = f pX (x)pY |X (y|x)
y∈Y
pY (y) y∈Y x∈X
X 1 X X 1
H(Y |X) = pX,Y (x, y) log2 = pX (x) pY |X (y|x) log2 .
pY |X (y|x) x∈X y∈Y
pY |X (y|x)
(x,y)∈X ×Y
is concave.
In this chapter, we turn our attention to continuous channels, where both the input
and the output are real valued. Such channels are ubiquitous in the physical world,
due to its continuous nature. To solve this problem we need to generalize the
notions of entropy, relative entropy and mutual information to continuous random
variables. We compute the capacity and the optimal input distribution gaussian
channels, which are found in many applications such as wireless communication.
69
70 CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS
7.1.2 Examples
Uniform Distribution
If X ∼ Uniform(X ):
Exponential Distribution
If X ∼ Exponential(λ):
eXλ λE(X) 1 e
h(X) = E log2 = + log2 = log2
λ log 2 λ λ
The fact that differential entropy decreases with λ is intuitive, since the smaller λ,
the less X is concentrated around 0.
Gaussian Distribution
If X ∼ N (µ, σ 2 ):
√ (X−µ)2
2
h(X) = E log2 ( 2πσ e 2σ 2
)
1 E(X − µ)2 1
= log2 (2πσ 2 ) + 2
= log2 (2πeσ 2 )
2 2 log(2)σ 2
This expression will occcur in various places, in particular when computing the
capacity of Gaussian channels. Two remarks can be made: first the differential
entropy does not depend on µ which illustrates the fact that differential entropy is
invariant by translation, second it is increasing in σ 2 , which is intuitive since the
larger σ 2 , the less X will be concentrated around its mean µ.
7.1. INFORMATION MESURES FOR CONTINOUS VARIABLES 71
Relative Entropy
Definition 7.1.3. Consider two p.d.f.s p(x) and q(x). The relative entropy is
Z
p(x)
D(p||q) = p(x) log2 dx
X q(x)
Proposition 7.1.4. We have D(p||q) ≥ 0 for any p, q.
Proof: Jensen’s inequality.
which proves the first result. If A is not invertible then the support of the distribution
of a + AX has Lebesgue measure 0 so that h(a + AX) = −∞.
Therefore, an affine transformation incurs an additive change to the entropy, and
this change is the logarithm of the determinant of A. If A = I or more generally if
A is a rotation then log2 det A = 1 so that differential entropy is invariant by both
translation and rotation.
Y =X +Z
Therefore
1
h(Y ) ≤ log2 (2πe(N + P )),
2
76 CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS
Definition 7.5.3. The AWGN (Additive White Gaussian Noise) channel is given by:
where x(t) is bandlimited in [−W, W ] with total power P and Z(t) is white
Gaussian noise with spectral power density N0 .
communication systems, increasing the bandwidth yields much more gains that
increasing the power, especially if the SNR of the typical user is already high.
Also, the formula of the capacity of the AWGN channel allows to predict the
performance of many practical communication systems past and present, and while
the capacity is an upper bound of the best performance that can be achieved in ideal
conditions (infinite processing power for coding and decoding for instance) the
formula allows to roughly predict the typical performance, providing one knows
the typical SNR as well as the bandwidth. Here are three illustrative examples: for
telephone lines: W = 3.3 kHz, WPN0 = 33 dB, C = 36 Kbits/s. Wifi: W = 40
MHz, WPN0 = 30 dB, C = 400 Mbits/s and for 4G Networks W = 20 MHz,
P
W N0
= 20 dB, C = 133 Mbits/s.
Definition 7.5.5. A set of parallel Gaussian channels with total power P is:
Yj = Xj + Zj , j = 1, . . . , k
Pk
with λ⋆ unique solution to j=1 (λ − Nj )+ = P
78 CHAPTER 7. COMMUNICATION: CONTINUOUS CHANNELS
ΣZ = U ⊤ diag(λ1 , ..., λk )U
Multplying by U :
U Y k = U X k + U Zk,
This defines a new channel:
Ȳ k = X̄ k + Z̄ k ,
We have
(X̄ k )⊤ X̄ = (X k )⊤ U ⊤ U X k = (X k )⊤ X k
Portfolio Theory
asset i.
′ ′
At the end of day n, he observes the closing prices (Pn,1 , ..., Pn,m ) realizes his
profits and losses so that the amount of wealth available at the start of day n + 1
equals:
m ′
Sn+1 X Pn,j
= bn,j
Sn j=1
Pn,j
n−1 m
!
′
Sn Y X Pi,j
= bn,j
S0 i=1 j=1
Pi,j
81
82 CHAPTER 8. PORTFOLIO THEORY
Indeed, the relative returns of each asset are sufficient in order to predict the
evolution of the wealth. Throughout the chapter we will assume that the vectors of
relative returns Xn = (Xn,1 , ..., Xn,m ) are i.i.d. with some fixed distribution F .
1
Proof If the investment strategy is constant, then n
log SSn0 is an empirical
average of n i.i.d. random variables:
n−1 m
!
Sn 1X X
log2 = log2 bj Xi,j
S0 n i=1 j=1
8.3. PROPERTIES OF LOG OPTIMAL PORTFOLIOS 83
with expectation W (b, F ) so the strong law of large numbers yields the result.
The above proposition shows that, if the investor chooses a fixed investment
strategy across time, then with high probability, wealth will grow exponentially as
a function of time:
Sn ≈ S0 2nW (b,F )
and the exponent equals the growth rate of the portfolio W (b, F ) ≥ 0. Perhaps
surprisingly, if the growth rate is strictly positive, then with high probability, the
wealth asymptotically grows to infinity.
m
X
maximize W (b, F ) subject to bi ≤ 1 and b ≥ 0
i=1
The previous results suggests that, if the investor knows the distribution of the
returns F , than he should select the portfolio maximizing the growth rate, to ensure
that its wealth grows as rapidly as possible. While this is not the only possible
objective function in porfolio theory, it comes with strong guarantees providing that
returns are indeed i.i.d. Other possible objective functions in portfolio theory are
for instance linear combinations of the mean and variance of the returns, as there
exists a trade-off between high-risk/high-return and low-risk/low-return portfolios.
Another interesting observation is that maximizing Pthe growth rate is usu-
ally different from maximizing the expected returns E( m i=1 bi Xi ), which can be
achieved by selecting bi⋆ = 1{i = i } where i = arg maxi E(Xi ), i.e. the investor
⋆ ⋆
places all of his wealth on the stock with highest average return, a risky strategy
indeed. Usually, maximizing the growth rate is a much moreP conservative, due to
the logarithm which places a heavy penalty on the wealth m i=1 bi Xi becoming
very close to 0 . In other words, maximizing the growth rate discourages porfolios
that can bankrupt the investor in a day.
∇W (b⋆ , F ) + λ1 + µ = 0
1
Therefore, λ = log(2) , and replacing yields the result.
The KKT conditions are necessary and sufficent conditions for the optimality
of the portfolio, and if F is known, one can search for the optimal using an iterative
scheme such as gradient descent.
Definition 8.3.2. A portfolio strategy is said to be causal if for all n, bn,1 , ..., bn,m
is solely a function of (Xn′ ,1 , ..., Xn′ ,m ) for n′ < n.
With equality if one selects b⋆ the maximizer of W (b, F ) at all times i.e. constant
strategies are optimal.
For any i, when (bi,1 , ..., bi,m ) is an arbitrary function of (Xi′ ,1 , ..., Xi′ ,m ) for i′ < i,
the optimal choice is to select the maximizer of:
m
! ! m
!!
X X
E log2 bi,j Xn,j |(Xi′ ,1 , ..., Xi′ ,m ), i′ < i = E log2 bi,j Xi,j
j=1 j=1
since (Xi,1 , ..., Xi,m ) is independent of (Xi′ ,1 , ..., Xi′ ,m ) , i′ < i. Therefore, for
each i, (bi,1 , ..., bi,m ) can be chosen as the maximizer of W (b, F ), and constant
strategies are optimal.
Interestingly, in our setting causal strategies yield no gains with respect to con-
stant strategies. Therefore, the best achievable performance with causal strategies is
still given by the growth rate. Of course, this only true if F is known to the investor,
and the returns are i.i.d. If F were unknown, then the investor should change his
decisions as more and more returns are observed. Similarly if the returns have a
significant correlation in time, then the investment strategy should be time varying,
as the returns observed up to time n − 1 can be used to predict the returns at time
n and choose a portfolio intelligently.
Proposition 8.4.1. Consider two distributions F and G, and the corresonding log
optimal portfolios b⋆F and b⋆G , which maximize W (b, F ) and W (b, G) respecively.
Then we have that
In other words, the amount of growth rate lost by the invstor due to his imperfect
knowlege is upper bounded by the relative entropy between the true distribution
F and his estimate G. So the wealth of an investor with perfect knowledge will
⋆
be approximatly 2nW (bF ,F ) , the wealth of an investor with imperfect knowledge
⋆
will be approximatly (at least) 2n[W (bF ,F )−D(F ||G)] . It should also be noted that this
bound is tight for some distributions of X. This is indeed a surprising link between
portfolio theory and information theory.
Definition 8.4.2. The growth rate of portfolio b with side information Y is:
m
! !
X
W (b, F |Y ) = E log bj Xj |Y
j=1
If the investor has access to side information Y , then he should select the port-
folio maximizing W (b, F |Y ), and while this certainly yields a better performance
8.4. INVESTMENT WITH SIDE INFORMATION 87
compared to the case with no side information, one can wonder how much growth
rate is gained with side information (for instance in the case where the investor
must pay some premium in order to access the side information). Intuitively, this
should depend on how much X and Y are correlated.
Proposition 8.4.3. Consider b⋆ the log optimal portfolio maximizing W (b, F ) and
b⋆|Y the log optimal portfolio with side information maximizing W (b, F |Y ). Then
we have
0 ≤ W (b⋆|Y , F |Y ) − W (b⋆ , F ) ≤ I(X; Y )
Proof: If Y = y, from our previous result, the loss of growth rate between an
investor whom assumes that X has distribution G = pX and an investor whom
knows the actual distribution F = pX|Y =y is at most
X pX|Y (x, y)
D(pX|Y =y |pX ) = pX|Y (x, y) log2
x∈X
pX (x)
9.1 Statistics
9.1.1 Statistical Inference
Assume that we are given n data points X1 , ..., Xn in a finite set X drawn i.i.d.
from some unknown distribution Q. We would like to perform statistical inference,
meaning that we would like to learn information about the unknown distribution Q,
solely by observing the data points X1 , ..., Xn . Of course, depending on what kind
of information we wish to obtain, the resulting problems can be vastly different.
We give a few examples.
89
90 CHAPTER 9. MACHINE LEARNING AND STATISTICS
function of the data T such that both P(T = 0|Q ∈ H0 ) and P(T = 1|Q ∈ H1 )
are close to 1.
It is noted that the type Pxn is indeed a distribution over X since it has positive
entries and sums to 1, and that it is an element of the set of probability distributions
over X ( )
X
P = P ∈ (R+ )X : P (a) = 1
a∈X
This set is often called the probability simplex, and has dimension |X | − 1.
The reason why the most natural strategy is to compute the empirical distribu-
tion of the data is because it converges to the true distribution when the number of
data points grows large, as a consequence of the law of large numbers.
Proposition 9.1.2. If X n = (X1 , ..., Xn ) are drawn i.i.d. from distribution Q, then
the type of X n converges to Q almost surely.
Proof Consider X n = (X1 , ..., Xn ) i.i.d. from distribution Q, then the proba-
bility distribution of X n only depends on its type, in the sense that
n
1{xi =a} =
Y Y Pn Y
n n
P(X = x ) = Q(xi ) = Q(a) i=1 Q(a)nPxn (a)
i=1 a∈X a∈X
Indeed, the expression above only depends on the type Pxn , so that all sequences
that have the same type are equally likely to occur.
Furthermore, taking logarithms and dividing by n:
1 n n
X 1
− log2 P(X = x ) = Pxn (a) log2
n a∈X
Q(a)
X Pxn (a) 1
= Pxn (a) log2 + Pxn (a) log2
a∈X
Q(a) Pxn (a)
So not only does the probability of a sequence only depend on its type, but the
exponent is equal to the sum of the entropy of the type, and the relative entropy
between the type and the true distribution. This implies that the most likely type is
the true distribution, and also that, when n is large, types that are far away from
the true distribution are very unlikely to occur.
92 CHAPTER 9. MACHINE LEARNING AND STATISTICS
|Pn | ≤ (n + 1)|X |
Proof One can readily check that the entries of Pxn are integer multiples of
1/n by definition. Furthermore,
|Pn | ≤ (n + 1)|X |
since |Pn | is the number of vectors whose components are positive integer multiples
of 1/n and sum to 1, and (n + 1)|X | is the number of vectors whose components
are positive integer multiples of 1/n and where all components are less than 1.
T (P ) = {xn ∈ X n : Pxn = P }
|T (Q)| ≤ 2nH(Q)
9.3. LARGE DEVIATIONS AND SANOV’S THEOREM 93
One may check that the maximum in the above occurs for P = Q which gives
where
P ⋆ = arg min D(P ||Q)
P ∈E
using the fact that P ⋆ minimizes D(P ||Q) over E. Summing the above over P and
using the fact that Pn ∩ E ≤ (n + 1)|X | we get the first result
⋆ ||Q)
P(PX n ∈ E) ≤ (n + 1)|X | 2−nD(P
If E is the closure of its interior we can find a sequence of Pn such that when
n → ∞,
D(Pn ||Q) → D(P ⋆ ||Q)
and in turn
P(PX n ∈ E) ≥ P(PX n = Pn )
with
9.3.2 Examples
We now highlight a few examples on how Sanov’s theorem may be applied to
various statistical problems.
Majority Vote Consider an election with two candidates, where Q(1), Q(2)
are the proportion of people whom prefer candidates 1, and 2 respectively. We
gather the votes X1 , ..., Xn of n voters, which we’ll assume to be i.i.d. distributed
from Q. The candidate whom wins is the one who gathers the most votes. Assume
that Q(1) > 1/2 so that 1 is the favorite candidate. What is the probability that 2
gets elected in place of 1 ?
The votes X n = (X1 , ..., Xn ) are an i.i.d. sample from Q, and 2 gets elected if
and only if PX n (2) > 1/2, so that he gets at least n/2 votes. So 2 gets elected if
and only if PX n ∈ E where
E = {P ∈ P : P (2) ≥ 1/2}
We can then apply Sanov’s theorem to conclude that 2 gets elected in place of 1
with probability
⋆
P(PX n ∈ E) ≈ 2−nD(P ||Q)
with P ⋆ = (1/2, 1/2) so that
1 (1/2) 1 (1/2)
D(P ⋆ ||Q) = log2 + log2
2 Q(2) 2 1 − Q(2)
Indeed:
P (2) 1 − P (2)
D(P ||Q) = P (2) log2 + (1 − P (2)) log2
Q(2) 1 − Q(2)
and minimizing this quantity over P (2) under the constraint P (2) ≥ 1/2 gives
P (2) = 1/2, since Q(2) ≤ 1/2.
Testing Fairness Assume that one is given a dice with k faces, and we want to
test whether or not the dice is fair, in the sense that it is equally likely to fall on
each of its faces. Consider X n = (X1 , ..., Xn ) the outcomes of casting the dice n
times where Xn ∈ X is the index of the face on which the dice has fallen. To test
fairness of the dice we compute the empirical distribution PX n and we compare
it to Q, the uniform distribution over Q. Namely if D(PX n ||Q) ≤ ϵ we deem the
dice to be fair, and unfair otherwise.
What is the probability that we mistake a fair dice for an unfair dice ?
If the dice is fair X n = (X1 , ..., Xn ) is an i.i.d. sample from Q, and we mistake
it for an unfair dice if and only and only if PX n ∈ E where
E = {P ∈ P : D(P ||Q) ≥ ϵ}
96 CHAPTER 9. MACHINE LEARNING AND STATISTICS
with D(P ⋆ ||Q) = minP ∈E D(P ||Q) = ϵ. It is remarkable that Sanov’s theorem
allows for an easy, explicit computation.
Testing General Distributions It is also noted that the above works in the more
general case Q is not the uniform distribution, but simply some target distribution,
namely reject the test that P = Q if D(PX n ||Q) ≥ ϵ and accept otherwise.
Chapter 10
Mathematical Tools
In this chapter we provide a few results that are instrumental for some proofs.
Results are stated without proofs.
97