Information Theory: Lecture Notes For

Technische Universität München
Lehrstuhl für Nachrichtentechnik

Prof. Dr. sc. techn. Gerhard Kramer
Lecture Notes for
Information Theory
Prof. Dr. sc. techn. Gerhard Kramer

Dr. Christian Deppe
Emna Ben Yacoub M.Sc.
Institute for Communications Engineering

Technische Universität München
Building: N4, 2nd – 4th Floor
Edition: WS 2021/22
Authors: G. Kramer
(all rights reserved)
Information Theory,
Winter Semester 2021/2022
Lecturers:
• Prof. Dr. sc. techn. Gerhard Kramer,
Phone: (089) 289-23491
Email: gerhard.kramer@tum.de
Address: Institute for Communications Engineering, room N2403
Office hours: on request
• Dr. Christian Deppe,
Phone: (089) 289-23458
Email: christian.deppe@tum.de
Adress: Institute for Communications Engineering, Room N3404
Teaching Assistant:
• Emna Ben Yacoub M.Sc.,
Phone: (089) 289-23436
Email: emna.ben-yacoub@tum.de
Adress: Institute for Communications Engineering, Room N3406
Schedule:
• Lecture/Tutorial:
– Monday: 13:15 - 14:45
– Tuesday: 15:00 - 17:30
General information about the lecture and the tutorial:
• The material related to the lecture and tutorial will be made avail-
able in Moodle (http://www.moodle.tum.de). Each week after the
lecture, a problem set addressing the material discussed in the lecture
will be uploaded and discussed in the following tutorial session(s).
You are strongly advised to attempt to solve these problems before
the corresponding tutorial. In addition, detailed solutions of the prob-
lem sets will be uploaded after each tutorial session.
• Announcements will be made in Moodle.
• All material discussed in the lectures or tutorials is relevant for the
exam unless mentioned otherwise.
• In addition to the final exam, a midterm exam will take place in which
a bonus worth 10 % of the final exam can be achieved. Participating
in the midterm exam is optional.
• For both the midterm and final exam, no notes and no calculators are
allowed.
Contents
1. Information Theory for Discrete Variables 1

1.1. Message Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Measuring Choice . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3. Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4. Example Distributions . . . . . . . . . . . . . . . . . . . . . 7
1.4.1. Binomial Distribution . . . . . . . . . . . . . . . . . 7
1.4.2. Poisson Distribution . . . . . . . . . . . . . . . . . . 7
1.4.3. Geometric Distribution . . . . . . . . . . . . . . . . . 8
1.5. Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . 10
1.6. Joint Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7. Informational Divergence . . . . . . . . . . . . . . . . . . . . 14
1.8. Mutual Information . . . . . . . . . . . . . . . . . . . . . . . 17
1.9. Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.10. Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.10.1. Log-Sum Identity and Inequality . . . . . . . . . . . 21
1.10.2. Data Processing Inequalities . . . . . . . . . . . . . . 21
1.10.3. Fano’s Inequality . . . . . . . . . . . . . . . . . . . . 23
1.10.4. Pinsker’s Inequality . . . . . . . . . . . . . . . . . . . 25
1.11. Convexity Properties . . . . . . . . . . . . . . . . . . . . . . 26
2. Information Theory for Continuous Variables 31

2.1. Differential Entropy . . . . . . . . . . . . . . . . . . . . . . . 31
2.2. Mixed Distributions . . . . . . . . . . . . . . . . . . . . . . . 33
2.3. Example Distributions . . . . . . . . . . . . . . . . . . . . . 35
2.3.1. Discrete Random Variables . . . . . . . . . . . . . . . 35
2.3.2. Exponential Distribution . . . . . . . . . . . . . . . . 35
2.3.3. Gaussian Distribution . . . . . . . . . . . . . . . . . 35
2.4. Informational Divergence . . . . . . . . . . . . . . . . . . . . 38
2.5. Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6. Maximum Entropy . . . . . . . . . . . . . . . . . . . . . . . 40
2.6.1. Alphabet or Volume Constraint . . . . . . . . . . . . 40
2.6.2. First Moment Constraint . . . . . . . . . . . . . . . . 40
2.6.3. Second Moment Constraint . . . . . . . . . . . . . . 41
2.7. Appendix: Table of Differential Entropies . . . . . . . . . . . 43
3. Coding of Memoryless Sources 45

3.1. Discrete Memoryless Source . . . . . . . . . . . . . . . . . . 45
3.2. Coding a Single Random Variable . . . . . . . . . . . . . . . 45
3.3. Rooted Trees and the Kraft Inequality . . . . . . . . . . . . 49
3.4. Rooted Trees with Probabilities . . . . . . . . . . . . . . . . 52
3.5. Entropy Bounds on Source Coding . . . . . . . . . . . . . . 55
3.6. Coding Theorem for Discrete Memoryless Sources . . . . . . 57
3.7. Huffman Codes . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.8. Dictionary Parsers . . . . . . . . . . . . . . . . . . . . . . . 63
3.9. Tunstall Codes . . . . . . . . . . . . . . . . . . . . . . . . . 65
4. Coding of Stationary Sources 69

4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2. Discrete Stationary Sources . . . . . . . . . . . . . . . . . . 69
4.3. Entropy Bounds on Source Coding a DSS . . . . . . . . . . . 73
4.4. Elias Code for the Positive Integers . . . . . . . . . . . . . . 74
4.5. Elias-Willems Universal Source Coding . . . . . . . . . . . . 77
5. Channel Coding 81
5.1. Rate, Reliability, and Cost . . . . . . . . . . . . . . . . . . . 81
5.2. Memoryless Channels . . . . . . . . . . . . . . . . . . . . . . 82
5.3. Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4. Block and Bit Error Probability . . . . . . . . . . . . . . . . 88
5.5. Random Coding . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5.1. Block Error Probability . . . . . . . . . . . . . . . . 91
5.5.2. Capacity-Cost Function . . . . . . . . . . . . . . . . 93
5.5.3. Maximum Error Probability . . . . . . . . . . . . . . 94
5.6. Concavity and Converse . . . . . . . . . . . . . . . . . . . . 95
5.7. Discrete Alphabet Examples . . . . . . . . . . . . . . . . . . 97
5.7.1. Binary Symmetric Channel . . . . . . . . . . . . . . 97
5.7.2. Binary Erasure Channel . . . . . . . . . . . . . . . . 98
5.7.3. Strongly Symmetric Channels . . . . . . . . . . . . . 98
5.7.4. Symmetric Channels . . . . . . . . . . . . . . . . . . 99
5.8. Continuous Alphabet Examples . . . . . . . . . . . . . . . . 102
5.8.1. AWGN Channel . . . . . . . . . . . . . . . . . . . . . 102
5.8.2. AWGN Channel with BPSK . . . . . . . . . . . . . . 102
5.8.3. Complex AWGN Channel . . . . . . . . . . . . . . . 104
5.8.4. Parallel AWGN Channels . . . . . . . . . . . . . . . 107
5.8.5. Vector AWGN Channels . . . . . . . . . . . . . . . . 110
5.8.6. AWGN Channels with Receiver Channel Information 111
5.9. Source and Channel Coding . . . . . . . . . . . . . . . . . . 112
5.9.1. Separate Coding . . . . . . . . . . . . . . . . . . . . 112
5.9.2. Rates Beyond Capacity . . . . . . . . . . . . . . . . . 112
5.10. Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A. Discrete Probability 119

A.1. Events, Sample Space, and Probability Measure . . . . . . . 119
A.2. Discrete Random Variables . . . . . . . . . . . . . . . . . . . 121
A.3. Independent Random Variables . . . . . . . . . . . . . . . . 123
A.4. Probabilistic Dependence via Functional Dependence . . . . 124
A.5. Establishing Conditional Statistical Independence . . . . . . 126
A.6. Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.7. Second-Order Statistics for Scalars . . . . . . . . . . . . . . 130
A.8. Second-Order Statistics for Vectors . . . . . . . . . . . . . . 131
A.9. Conditional Expectation Random Variables . . . . . . . . . 132
A.10.Linear Estimation . . . . . . . . . . . . . . . . . . . . . . . . 133
A.11.Markov Inequalities . . . . . . . . . . . . . . . . . . . . . . . 134
A.12.Jensen Inequalities . . . . . . . . . . . . . . . . . . . . . . . 135
A.13.Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . 137
A.14.Strong Law of Large Numbers . . . . . . . . . . . . . . . . . 138
A.15.Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
B. Problems 141
B.1. Information Theory for Discrete Variables . . . . . . . . . . 141
B.2. Information Theory for Continuous Variables . . . . . . . . . 151
B.3. Coding of Memoryless Sources . . . . . . . . . . . . . . . . . 155
B.4. Coding of Stationary Sources . . . . . . . . . . . . . . . . . 167
B.5. Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . 173
Notation
We use standard notation for probabilities, random variables, entropy, mu-
tual information, and so forth. Table 0.1 lists notation, some of it developed
in the appendices, and we use this without further explanation in the main
body of the text. We introduce the remaining notation as we go along.
The reader is referred to the appendices for a review of relevant probability
theory concepts.
Table 0.1.: Probability and Information Theory Notation.

Sets
Ω sample space
Ac complement of the set A in Ω
N0 and N1 natural numbers {0, 1, 2, . . .} and {1, 2, . . .}
Z, R, C integers, real numbers, complex numbers
Strings, Vectors, Matrices
xn the string x1 x2 . . . xn or x1 , x2 , . . . , xn
xn y m string concatenation: x1 x2 . . . xn y1 y2 . . . ym
x the vector [x1 , x2 , . . . , xn ]
xT , x† transpose of x, complex-conjugate transpose of x
H, HT , H† a matrix, its transpose, and its complex-conjugate transpose
|Q| determinant of the matrix Q
Probability
Pr [A] probability of the event A
Pr [A|B] probability of event A conditioned on event B
PX (·) probability distribution of the random variable X
PX|Y (·) probability distribution of X conditioned on Y
supp(PX ) support of PX , i.e., the set of a such that PX (a) > 0
pX (·) probability density of the random variable X
pX|Y (·) probability density of X conditioned on Y
Expectation and Variance
E[X] expectation of the real-valued X
E[X|A] expectation of X conditioned on event A
E[X|Y ] random variable that takes on the value E[X|Y = y] if Y = y
Var[X] variance of the real-valued X
Var[X|A] variance of X conditioned on event A
Var[X|Y ] random variable that takes on the value Var[X|Y = y] if Y = y
Cov[X, Y ] covariance of X and Y
QX covariance matrix of X
Cov[X, Y ] covariance matrix of X and Y
Information Theory
H(X) entropy of the discrete random variable X
H(X|Y ) entropy of X conditioned on Y
I(X; Y ) mutual information between X and Y
I(X; Y |Z) mutual information between X and Y conditioned on Z
D(PX kPY ) informational divergence between PX and PY
D(PX kPY |PZ ) informational divergence between PX and PY conditioned on Z
h(X) differential entropy of X
h(X|Y ) differential entropy of X conditioned on Y
H2 (·) binary entropy function
Chapter 1.
Information Theory for

Discrete Variables
1.1. Message Sets

Information theory was born as A Mathematical Theory of Communication
as developed by Shannon in [1]. Shannon was particularly interested in
messages and he wrote that:
The fundamental problem of communication is that of repro-
ducing at one point either exactly or approximately a message
selected at another point.
But what is a message? Shannon suggested that a message has to do with
choice and sets. In his words:
The significant aspect is that the actual message is one selected
from a set of possible messages. The system must be designed
to operate for each possible selection, not just the one which will
actually be chosen since this is unknown at the time of design.
For example, suppose we wish to communicate the local weather to a friend.
Suppose that our message set is
{sunny, cloudy, rain, thunderstorm}.
We could, e.g., communicate the weather “sunny” by transmitting this word.

However, from an engineering perspective this approach is inefficient. We
could instead represent the four elements in the message set by using only
binary digits or bits:
{00, 01, 10, 11}
where 00 represents “sunny”, 01 represents “cloudy”, and so forth. The
main point is, again in Shannon’s words, that although
the messages have meaning ... these semantic aspects of com-
munication are irrelevant to the engineering problem.
1
2 Chapter 1. Information Theory for Discrete Variables
1.2. Measuring Choice

Shannon was interested in defining
a quantity which will measure, in some sense, how much infor-
mation is “produced” by
choosing messages [1, Sec. 6]. He suggested that the logarithm of the number
of elements in the message set
can be regarded as a measure of the information produced when
one message is chosen from the set, all choices being equally
likely.
This logarithmic measure of information was also suggested by Hartley [2].
For example, consider a fair coin that takes on values in {Heads, Tails} and
suppose we flip the coin n times. The string of coin flips takes on one of 2n
values, all equally likely, and the information produced is
log2 2n = n bits.
Note that the base of the logarithm simply defines the units of measurement.
For non-equally likely choices Shannon developed a measure that has the
same form as the entropy in statistical mechanics. For example, consider a
biased coin that takes on the value Heads with probability p and Tails with
probability 1 − p. Shannon proposed that the information produced when
flipping the coin n times is
!
1 1
n p log2 + (1 − p) log2 bits.
p 1−p
One motivation for this choice is as follows: approximately np of the n coin

flips should take on the value Heads. Furthermore, each string of coin flips
with approximately np Heads will be equally likely. If we take the logarithm
of the number of such strings we obtain
! !
n 1 1
log2 ≈ n p log2 + (1 − p) log2 bits. (1.1)
np p 1−p
where we have used Stirling’s approximation1 to write

!
n n! 1 1
= ≈q · pn .
np (np)!(n(1 − p))! 2πp(1 − p)n p (1 − p)
(1−p)n
For instance, if p = 1/2 then the right-hand side of (1.1) gives n bits, as
expected. But if p = 0.11 then we compute n/2 bits.
More generally, consider an experiment that has m possible outcomes with
probabilities pi , i = 1, 2, . . . , m. Suppose we repeat the experiment n times.
√
1
Stirling’s approximation is n! ≈ 2πn(n/e)n .
Information Theory LNT, TUM

1.3. Entropy 3
We expect outcome i to occur approximately npi times for all i, and the
logarithm of the number of expected strings is
m
!
n X
log2 ≈n −pi log2 pi . (1.2)
np1 , np2 , . . . , npm i=1
The amount of information obtained by observing the outcomes of the ex-

periments thus seems to grow at the rate m
i=1 −pi log2 pi .
P
1.3. Entropy
The word “entropy” was invented by Clausius [3, p. 390] from the Greek
trop for “turn” or “change”. Clausius was attempting to find a Greek
word
• that was related to the German “Verwandlungsinhalt” which trans-
lates literally to “transformation content”;
• for which the German version, in this case “Entropie”, sounds similar
to the German word for energy, in this case “Energie”.
Calusius had a physical notion of “transformation content.” Shannon instead
related entropy to messages and sets. In other words, Shannon entropy is
not necessarily physical: it has to do with choice and uncertainty in a broader
sense than envisioned by Clausius.
We now become more formal in our treatment. Let supp(f ) be the support
set of the function f , i.e., the set of a such that f (a) > 0. We define the
entropy or uncertainty of the discrete random variable X as
X
H(X) = −PX (a) log2 PX (a). (1.3)
a∈supp(PX )
We remark that by taking the sum over the support of PX (·) we avoid
dealing with the expression 0 · log2 0. Many authors simply define 0 · log2 0
to be zero because limx→0+ x log2 x = 0, where the notation ”0+ ” means that
limit is taken from above. However, it is often instructive to be concerned
about events with probability zero, see Sec. 1.7 below.
Example 1.1. The entropy H(X) depends only on the probability distri-
bution PX (·), and not on what we call the letters a in the alphabet X of X.
This idea is consistent with our discussion in Sec. 1.1. We thus have
H(X) = H(g(X)) (1.4)
for any invertible function g(·), since g(·) simply relabels the letters in X .
We can express (1.3) in an insightful way by considering Y = − log2 PX (X)

to be a random variable that is a function of X. The entropy H(X) is then
LNT, TUM Information Theory

1.1
0.9
0.8
0.7
0.6
H2(p)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
Figure 1.1.: The binary entropy function H2 (·).
the expectation of Y :
H(X) = E[− log2 PX (X)] . (1.5)
In cases where the distribution PX is a variable, we may use the notation

H(PX ) rather than H(X) (see Sec. 1.11). We have chosen to evaluate the
logarithm using the base 2, and we continue to follow this convention for
discrete random variables. Our entropy units are, therefore, bits.
Example 1.2. Consider the Bernoulli distribution that has X = {0, 1}

and PX (1) = p. The entropy of X is
H(X) = H2 (p) = −p log2 p − (1 − p) log2 (1 − p) (1.6)
and H2 (·) is called the binary entropy function.
The binary entropy function is plotted in Fig. 1.1. Observe that H2 (0) =
H2 (1) = 0, H2 (0.11) = H2 (0.89) ≈ 1/2, H2 (1/2) = 1, and H2 (p) is maxi-
mized by p = 1/2. More generally, we have the following important result
where we recall that |X | is the number of values in X .

1.3. Entropy 5
1
x 1
x 1, ln(x)
ln(x)
0
0 1 2 3
x
Figure 1.2.: Illustration that ln(x) ≤ x − 1 for x ≥ 0.
Theorem 1.1.
0 ≤ H(X) ≤ log2 |X | (1.7)
with equality on the left if and only if there is one letter a in X with
PX (a) = 1, and with equality on the right if and only if PX (a) = 1/|X | for
all a ∈ X , i.e., X is uniform over X .
Proof. Consider (1.3) and note that for 0 < p ≤ 1 we have −p log2 p ≥ 0
with equality if and only if p = 1. Thus, we have H(X) ≥ 0 with equality if
and only if there is one letter a in X with PX (a) = 1. Consider next (1.5)
and observe that
" #
1
H(X) = E log2 + log2 |X |. (1.8)
|X |PX (X)
But we have the inequality (see Fig. 1.2)
log2 (x) = ln(x) log2 (e) ≤ (x − 1) log2 (e) (1.9)
where ln(x) is the natural logarithm of x, and where equality holds if and

only if x = 1. Applying (1.9) to (1.8) we have

" #
(a) 1
H(X) ≤ E − 1 log2 (e) + log2 |X |
|X |PX (X)
!
X 1
= PX (a) − 1 log2 (e) + log2 |X |
a∈supp(PX )
|X |PX (a)
!
|supp(PX )|
= − 1 log2 (e) + log2 |X |
|X |
(b)
≤ log2 |X | (1.10)
with equality in (a) if and only if X is uniform over X , and with equality
in (b) if and only if supp(PX ) = X .
The two bounds in (1.7) are intuitively pleasing: uncertainty should not
be negative (what would negative uncertainty mean?) and the maximum
uncertainty is when all possibilities are equally likely.
Example 1.3. Consider a fair die with X = {1, 2, 3, 4, 5, 6} and PX (a) =

1/6 for all a ∈ X . Theorem 1.1 tells us that
H(X) = log2 6 ≈ 2.585 bits. (1.11)
Furthermore, an unfair six-sided die has smaller entropy or uncertainty.
Example 1.4. Consider X = {0, 1, 2} and PX (0) = PX (1) = p/2 and

PX (2) = 1 − p. We have
p p p p
H(X) = − log2 − log2 − (1 − p) log2 (1 − p)
2 2 2 2
= p + H2 (p) (1.12)
and entropy is maximized at H(X) = log2 (3) if p = 2/3.

1.4. Example Distributions 7
1.4. Example Distributions
1.4.1. Binomial Distribution
Consider the random variable

n
X
X= Xi (1.13)
i=1
where the Xi , i = 1, 2, . . . , n, are independent binary random variables with

the distribution of Example 1.2. The alphabet of X is X = {0, 1, 2, . . . , n}
and PX (·) is called the binomial distribution with
!
n k
PX (k) = p (1 − p)n−k , k ∈ X. (1.14)
k
A plot of the distribution with p = 0.5 and n = 20 is shown in Fig. 1.3. We

compute E[X] = n p, Var[X] = n p (1 − p) and
n
! ! !
n k n k
p (1 − p)n−k log2 p (1 − p)n−k .
X
H(X) = − (1.15)
k=0 k k
The expression (1.15) seems not to have a simple closed form for general n.
We instead use a Gaussian approximation for large n:
1 (k−E[X])2
PX (k) ≈ q e− 2Var[X] (1.16)
2πVar[X]
which gives
  
1 −(X−E[X])2
− 2Var[X]
H(X) ≈ E − log2  q e 
2πVar[X]
1
= log2 (2πe · np(1 − p)) . (1.17)
2
1.4.2. Poisson Distribution
The Poisson distribution is2

λk −λ
PX (k) = e , k ∈ N0 = {0, 1, 2, . . .}. (1.18)
k!
Note that the range of X is discrete but infinite. A plot of the distribution
with λ = 5 is shown in Fig. 1.4. We compute E[X] = λ and Var[X] = λ. We
define H(X) as in (1.3), and H(X) again seems not to have a simple closed
2
The notation N refers to the so-called natural numbers {1, 2, 3, . . .} or {0, 1, 2, 3, . . .}.
To avoid ambiguity, we refer to the former set as N1 and to the latter as N0 .

0.18
0.16
0.14
0.12
0.1
PX(k)
0.08
0.06
0.04
0.02
0
0 2 4 6 8 10 12 14 16 18 20
k
Figure 1.3.: The binomial distribution for p = 0.5 and n = 20. The Gaus-
sian approximation with E[X] = 10 and Var[X] = 5 is the solid
curve.
form for general n. However, if λ is large then PX (·) is approximately the

binomial distribution with variance λ and mean shifted to λ so we have
1
H(X) ≈ log2 (2πeλ) . (1.19)
2
1.4.3. Geometric Distribution
The geometric distribution arises when counting the number of Bernoulli

trials needed for observing a 1. The distribution is
PX (k) = p (1 − p)k−1 , k ∈ N1 = {1, 2, 3, . . .}. (1.20)
A plot of the distribution with p = 0.25 is shown in Fig. 1.5. We compute

(see Problem B.1.8) E[X] = 1/p, Var[X] = (1 − p)/p2 , and
H(X) = H2 (p) E[X] . (1.21)

0.18
0.16
0.14
0.12
0.1
P (k)
X
0.08
0.06
0.04
0.02
0
0 2 4 6 8 10 12 14 16 18 20
k
Figure 1.4.: The Poisson distribution for λ = 5. The Gaussian approxima-

tion with E[X] = 5 and Var[X] = 5 is the solid curve.
0.25
0.2
0.15
P (k)
X
0.1
0.05
0
0 2 4 6 8 10 12 14 16 18 20
k
Figure 1.5.: The geometric distribution for p = 0.25.

1.5. Conditional Entropy
Consider a joint distribution PXY (·) where the random variables X and
Y take on values in the discrete and finite alphabets X and Y, respec-
tively. The conditional entropy of X given the event Y = b with probability
Pr [Y = b] > 0 is
X
H(X|Y = b) = −PX|Y (a|b) log2 PX|Y (a|b)
a∈supp(PX|Y (·|b))
h i
= E − log2 PX|Y (X|Y ) Y = b . (1.22)
Using the same steps as in the proof of Theorem 1.1, one can show that
0 ≤ H(X|Y = b) ≤ log2 |X | (1.23)
with equality on the left if and only if PX|Y (a|b) = 1 for some a, and with
equality on the right if and only if PX|Y (a|b) = 1/|X | for all a.
The conditional entropy of X given Y is the average of the values (1.22),

i.e., we define
X
H(X|Y ) = PY (b)H(X|Y = b). (1.24)
b∈supp(PY )
Alternatively, we have
X
H(X|Y ) = −PXY (a, b) log2 PX|Y (a|b)
(a,b)∈supp(PXY )
h i
= E − log2 PX|Y (X|Y ) . (1.25)
Example 1.5. Consider the joint distribution
PXY (a, b) a = 0 a = 1
b=0 1/3 1/3 (1.26)
b=1 1/3 0
We compute H(X|Y = 0) = 1, H(X|Y = 1) = 0, and H(X|Y ) = 2/3.
One can show that (try Exercise B.1.10)
0 ≤ H(X|Y ) ≤ log2 |X | (1.27)
with equality on the left if and only if for every b in supp(PY ) there is an
a such that PX|Y (a|b) = 1, and with equality on the right if and only if for
every b in supp(PY ) we have PX|Y (a|b) = 1/|X | for all a. We say that Y
essentially determines X if H(X|Y ) = 0.

1.5. Conditional Entropy 11
The above definitions and bounds extend naturally to more than two ran-
dom variables. For example, consider the distribution PXY Z (·). We de-
fine the conditional entropy of X given Y and the event Z = c with
Pr [Z = c] > 0 as
X
H(X|Y, Z = c) = −PXY |Z (a, b|c) log2 PX|Y Z (a|b, c)
(a,b)∈supp(PXY |Z (·|c))
h i
= E − log2 PX|Y Z (X|Y, Z) Z = c . (1.28)

1.6. Joint Entropy
The joint entropy of X and Y is defined by considering the concatenation

XY of X and Y as a new discrete random variable, i.e., we have
X
H(XY ) = −PXY (a, b) log2 PXY (a, b)
(a,b)∈supp(PXY )
= E[− log2 PXY (X, Y )] . (1.29)
Note that we have written H(XY ) rather than H(X, Y ) and the reader
should not confuse XY with “X multiplied by Y ”. Some authors prefer
H(X, Y ) and this is a matter of taste. For example, if one considers XY as
a vector [X, Y ] then the notation H(X, Y ) makes sense. We will follow the
convention of not using punctuation if no confusion arises.3
Using Bayes’ rule in (1.29) we have
h i
H(XY ) = E − log2 PX (X)PY |X (Y |X)
h i
= E[− log2 PX (X)] + E − log2 PY |X (Y |X)
= H(X) + H(Y |X). (1.30)
We similarly have
H(XY ) = H(Y ) + H(X|Y ). (1.31)
More generally, we have the chain rule for entropy
H(X1 X2 . . . Xn ) = H(X1 ) + H(X2 |X1 ) + . . . + H(Xn |X1 X2 · · · Xn−1 )

n
H(Xi |X i−1 ).
X
= (1.32)
i=1
where X j = X1 X2 . . . Xj (see Sec. A.2) and X 0 is a constant.
Example 1.6. Consider the joint distribution of Example 1.5. We compute
H(XY ) = log2 (3) ≈ 1.585 bits

H(X) = H(Y ) = H2 (1/3) ≈ 0.918 bits
H(X|Y ) = H(Y |X) = 2/3 bits
and one may check that (1.30) and (1.31) are satisfied.
Theorem 1.1 and (1.27) give
max (H(X), H(Y )) ≤ H(XY ) ≤ log2 (|X | · |Y|) (1.33)

3
We are not entirely consistent with this approach, e.g., we write PXY (a, b) without
punctuation in the subscript and with punctuation in the argument.

1.6. Joint Entropy 13
with equality on the left if and only if X essential determines Y , or Y

essentially determines X, or both, and with equality on the right if and
only if PXY (a, b) = 1/(|X | |Y|) for all (a, b).
Example 1.7. Using (1.33), (1.30), and that X essentially determines

f (X) we have
H(f (X)) ≤ H(Xf (X)) = H(X). (1.34)

1.7. Informational Divergence

Suppose we wish to measure how close two distributions are to each other.
A natural approach is to use the `1 distance
X
d(PX , PY ) = |PX (a) − PY (a)| (1.35)
a∈X
where PX (·) and PY (·) have the same domain X . This distance and the
so-called variational distance 21 d(PX , PY ) measure additive differences.
Another useful approach is the following that considers logarithmic differ-
ences. The informational divergence (or relative entropy or Kullback-Leibler
distance) between PX (·) and PY (·) is defined as
X PX (a)
D(PX kPY ) = PX (a) log2 (1.36)
a∈supp(PX )
PY (a)
which is the same as

" #
PX (X)
D(PX kPY ) = E log2 . (1.37)
PY (X)
We define D(PX kPY ) = ∞ if PY (a) = 0 for some a ∈ supp(PX ). Ob-

serve that the definition is not symmetric in PX and PY , i.e., we have
D(PX kPY ) 6= D(PY kPX ) in general.
Example 1.8. Consider X = {0, 1} and the Bernoulli distributions with

PX (1) = p, PY (1) = q. If 0 < p < 1 then we have
p 1−p
D(PX kPY ) = p log + (1 − p) log . (1.38)
q 1−q
If q = 1/2 then we compute D(PX kPY ) = 1 − H2 (p). If q = 0 or q = 1 then

we have D(PX kPY ) = ∞.
To avoid the case D(PX kPY ) = ∞ we introduce the notation PX PY (or

PY PX ) to mean that PY (a) = 0 ⇒ PX (a) = 0 for all a ∈ X . In other
words, PX PY is the same as saying that supp(PX ) ⊆ supp(PY ), or that
D(PX kPY ) < ∞ for finite sets. The measure-theoretic terminology is that
PX is absolutely continuous with respect to PY .
Example 1.9. Consider a joint distribution PXY and its marginals PX and
PY . We use the notation PX PY (·) to refer to the distribution for which
PX PY (a, b) = PX (a)PY (b) for all a and b. (1.39)
Observe that PX (a) = 0 implies PXY (a, b) = 0. However, we may have

PXY (a, b) = 0 even though PX (a) > 0 and PY (b) > 0 (see Example 1.5). In

1.7. Informational Divergence 15
other words, we always have
PXY PX PY (1.40)
but the statement PXY PX PY is not necessarily true.
The following result is fundamentally important.
Theorem 1.2.
D(PX kPY ) ≥ 0 (1.41)
with equality if and only if PX (a) = PY (a) for all a ∈ supp(PX ).4
Proof. Apply the inequality (1.9) to the negative of (1.37):

" #
PY (X)
−D(PX kPY ) = E log2
PX (X)
" #
(a) PY (X)
≤E − 1 log2 (e)
PX (X)
" #
X PY (a)
= PX (a) − 1 log2 (e)
a∈supp(PX )
PX (a)
  
X
=  PY (a) − 1 log2 (e)
a∈supp(PX )
(b)
≤0 (1.42)
with equality in (a) if and only if PY (a) = PX (a) for all a ∈ supp(PX ), and
with equality in (b) if and only if PY PX .
Example 1.10. Suppose PY (·) is uniform on X . We compute
D(PX kPY ) = log2 |X | − H(X) (1.43)
and so Theorem 1.2 gives H(X) ≤ log2 |X | with equality if and only if PX
is uniform. This reproves the interesting part of Theorem 1.1.
Example 1.11. If PX (·) is uniform on X then we have

1 X
D(PX kPY ) = − log2 |X | − log2 PY (a). (1.44)
|X | a∈X
4
This is the same as requiring PX (a) = PY (a) for all a ∈ X .

Example 1.12. Suppose PX (a) = 1 for some a ∈ X . We then have
D(PX kPY ) = − log2 PY (a). (1.45)
As in (1.37), given a third discrete random variable Z, we define the condi-

tional informational divergence between PX|Z (·) and PY |Z (·) as
X
D(PX|Z kPY |Z |PZ ) = PZ (b)D PX|Z (·|b)kPY |Z (·|b)
b∈supp(PZ )
X PX|Z (a|b)
= PXZ (a, b) log2
(a,b)∈supp(PXZ )
PY |Z (a|b)
" #
PX|Z (X|Z)
= E log2 . (1.46)
PY |Z (X|Z)
Similar to (1.41), we have
D(PX|Z kPY |Z |PZ ) ≥ 0 (1.47)
with equality if and only if PX|Z (a|b) = PY |Z (a|b) for all (a, b) ∈ supp(PXZ ).
Problem B.1.23 develops a chain rule for informational divergence similar
to the chain rule (1.32) for entropy, namely
n
X
D(PX n kPY n ) = D(PXi |X i−1 kPYi |Y i−1 |PX i−1 ). (1.48)
i=1

1.8. Mutual Information 17
1.8. Mutual Information

The mutual information I(X; Y ) between two random variables X and Y
with respective discrete and finite alphabets X and Y is defined as
I(X; Y ) = D(PXY kPX PY ). (1.49)
More explicitly, we have

X PXY (a, b)
I(X; Y ) = PXY (a, b) log2 . (1.50)
(a,b)∈supp(PXY )
PX (a)PY (b)
Note that PXY = PX PY means that X and Y are statistically indepen-

dent. Thus, I(X; Y ) measures the dependence of X and Y . Recall from
Example 1.9 that PXY PX PY so that I(X; Y ) is finite for finite-alphabet
random variables.
The term “mutual” describes the symmetry in the arguments of I(X; Y ).
By using the chain rule (1.48), Bayes’ rule, and expanding the logarithm in
(1.50) in various ways we may write
I(X; Y ) = D(PY |X kPY |PX )

= D(PX|Y kPX |PY )
= H(X) − H(X|Y )
= H(Y ) − H(Y |X)
= H(X) + H(Y ) − H(XY )
= H(XY ) − H(X|Y ) − H(Y |X). (1.51)
Every one of these expressions is useful for certain problems.

The definition (1.49) and Theorem 1.2 imply the following inequalities.
Theorem 1.3. We have
I(X; Y ) ≥ 0 (1.52)
H(X|Y ) ≤ H(X) (1.53)
H(XY ) ≤ H(X) + H(Y ) (1.54)
with equality if and only if X and Y are statistically independent.
The inequality (1.53) means that conditioning cannot increase entropy, or

colloquially that conditioning reduces entropy. Note, however, that
H(X|Y = b) can be larger than H(X).

Example 1.13. Consider Example 1.5 for which we compute
H(X) = H(Y ) = H2 (1/3) ≈ 0.918.
We thus have H(X|Y ) ≤ H(X) in accordance with (1.53), but observe that
H(X|Y = 0) > H(X).
The conditional mutual information between X and Y given a random vari-

able Z is defined as
X
I(X; Y |Z) = PZ (c) I(X; Y |Z = c) (1.55)
c∈supp(PZ )
where we define

I(X; Y |Z = c) = D PXY |Z (·|c) PX|Z (·|c)PY |Z (·|c) . (1.56)
From (1.46) we may write
I(X; Y |Z) = D(PXY |Z kPX|Z PY |Z |PZ ). (1.57)
Using (1.56) and Theorem 1.2, we have
I(X; Y |Z = c) ≥ 0 (1.58)
I(X; Y |Z) ≥ 0 (1.59)
with equality in (1.58) if and only if X and Y are statistically independent

conditioned on the event Z = c, and with equality in (1.59) if and only if X
and Y are statistically independent when conditioned on any event Z = c
with positive probability. Equality in (1.59) is thus the same as saying that
X − Z − Y forms a Markov chain (see Sec. A.3).
We compute
I(X; Y |Z) = H(X|Z) − H(X|Y Z)

= H(Y |Z) − H(Y |XZ). (1.60)
We thus have
0 ≤ I(X; Y |Z) ≤ min (H(X|Z), H(Y |Z)) (1.61)
with equality on the left if and only if X − Z − Y forms a Markov chain, and
with equality on the right if and only if Y Z essentially determines X, or
XZ essentially determines Y , or both. The left-hand side of (1.61) implies
H(X|Y Z) ≤ H(X|Z) (1.62)
with equality if and only if X − Z − Y forms a Markov chain.

1.8. Mutual Information 19
We can expand mutual information as follows:
I(X1 X2 . . . Xn ; Y ) = I(X1 ; Y ) + I(X2 ; Y |X1 ) + . . .

+ I(Xn ; Y |X1 X2 · · · Xn−1 )
n
I(Xi ; Y |X i−1 ).
X
= (1.63)
i=1
The expansion (1.63) is called the chain rule for mutual information.
Example 1.14. Using the chain rule for mutual information and the left-
hand side of (1.61), we have
I(X; Y ) ≤ I(X; Y Z)
(1.64)
I(X; Y ) ≤ I(XZ; Y ).

1.9. Cross Entropy

The cross entropy of two distributions PX (·) and PY (·) with the same do-
main is defined as
X
X(PX kPY ) = −PX (a) log2 PY (a). (1.65)
a∈supp(PX )
which is the same as
X(PX kPY ) = E[− log2 PY (X)] . (1.66)
We thus have
X(PX kPY ) = H(X) + D(PX kPY ) (1.67)
and Theorem 1.2 gives the following result.
Theorem 1.4.
X(PX kPY ) ≥ H(X) (1.68)
with equality if and only if PX (a) = PY (a) for all a ∈ X .
Theorem 1.4 gives a useful tool to upper bound entropy: every choice of
PY (·) gives an upper bound on H(X)! To simplify the computation, one
usually chooses PY (·) to have an exponential form, such as a Gaussian
distribution for continuous random variables as in Section 2.6.3.
Example 1.15. Consider Example 1.10 where PY (·) is uniform on X . We

compute
X(PX kPY ) = log2 |X | ≥ H(X) (1.69)
which matches Theorem 1.4.

1.10. Inequalities 21
1.10. Inequalities
1.10.1. Log-Sum Identity and Inequality
Theorem 1.5. (Log-sum Identity and Inequality) Consider positive

ai and non-negative bi , i = 1, 2, . . . , n, and suppose that at least one of the
bi is positive. Let Sa = ni=1 ai , Sb = ni=1 bi , and define PX (i) = ai /Sa and
P P
PY (i) = bi /Sb for i = 1, 2, . . . , n. We have

n
X ai Sa
ai log = Sa D(PX kPY ) + Sa log (1.70)
i=1 bi Sb
Sa
≥ Sa log (1.71)
Sb
with equality if and only if ai /bi = Sa /Sb for all i.
Proof. If bi = 0 for some i then the left-hand side of (1.70) is infinity and
the identity and the inequality are valid. Now suppose that bi > 0 for all i.
The identity (1.70) follows by substitution. The inequality (1.71) and the
condition for equality follow by Theorem 1.2.
Example 1.16. Consider n = 2 for which the log-sum inequality is
a1 a2 a1 + a2
a1 log + a2 log ≥ (a1 + a2 ) log (1.72)
b1 b2 b1 + b2
with equality if and only if a1 /b1 = a2 /b2 = (a1 + a2 )/(b1 + b2 ).
1.10.2. Data Processing Inequalities
Theorem 1.6. (Data Processing Inequalities) If X − Y − Z forms a

Markov chain, then we have
I(X; Z) ≤ I(X; Y ) and I(X; Z) ≤ I(Y ; Z). (1.73)
If Y1 and Y2 are the outputs of a channel PY |X (·) with inputs X1 and X2 ,

respectively, then we have
D(PY1 kPY2 ) ≤ D(PX1 kPX2 ). (1.74)
Proof. We have
(a)
I(X; Z) ≤ I(X; Y Z)
(b)
= I(X; Y ) + I(X; Z|Y )
(c)
= I(X; Y ). (1.75)

where (a) follows by (1.64), (b) follows by the chain rule (1.63), and (c)
follows because X − Y − Z forms a Markov chain. One can prove I(X; Z) ≤
I(Y ; Z) in the same way. Next, by the chain rule (1.48) we have the two
expansions
D(PX1 Y1 kPX2 Y2 ) = D(PX1 kPX2 ) + D(PY1 |X1 kPY2 |X2 |PX1 ) (1.76)
| {z }
=0
D(PX1 Y1 kPX2 Y2 ) = D(PY1 kPY2 ) + D(PX1 |Y1 kPX2 |Y2 |PY1 ) (1.77)
which gives
D(PY1 kPY2 ) = D(PX1 kPX2 ) − D(PX1 |Y1 kPX2 |Y2 |PY1 ). (1.78)
and which implies (1.74).
Example 1.17. Suppose PY |X has the special property that a PX uniform

on X produces a PY uniform on Y. So choosing PX2 uniform in (1.74) makes
PY2 uniform. Using (1.43) we find that
H(X1 ) ≤ H(Y1 ). (1.79)
In other words, such a channel PY |X does not decrease entropy.

1.10.3. Fano’s Inequality
A useful lower bound on error probability is the following.
Theorem 1.7. (Fano’s Inequality) Suppose h both

i X and X̂ take on
values in the alphabet X , and let Pe = Pr X̂ 6= X . We have
H(X|X̂) ≤ H2 (Pe ) + Pe log2 (|X | − 1) (1.80)
with equality if and only if, for all a and b in X , we have

(
1 − Pe if b = a
PX|X̂ (a|b) = Pe . (1.81)
|X |−1
6 a
if b =
Proof. Let E = 1(X̂ 6= X), where 1(·) is the indicator function. We use the
chain rule to expand H(EX|X̂) in two ways as
H(EX|X̂) = H(X|X̂) + H(E|X̂X)

(a)
= H(X|X̂)
H(EX|X̂) = H(E|X̂) + H(X|X̂E)
= H(E|X̂) + Pr [E = 0] H(X|X̂, E = 0)
+ Pr [E = 1] H(X|X̂, E = 1)
= H(E|X̂) + Pe H(X|X̂, E = 1)
(b)
≤ H(E|X̂) + Pe log2 (|X | − 1)
(c)
≤ H(E) + Pe log2 (|X | − 1)
= H2 (Pe ) + Pe log2 (|X | − 1)
where (a) follows because X̂ and X essentially determine E. Step (b) follows
because, given X̂ and E = 1, X takes on at most |X | − 1 values. Further-
more, by Theorem 1.1 equality holds in (b) if and only if, conditioned on X̂,
X is uniform over the set of X not including X̂, i.e., if and only if (1.81) is
satisfied for all a and b in X . Inequality (c) holds with equality if (1.81) is
satisfied for all a and b in X .
Example 1.18. Consider X = {0, 1} for which Fano’s inequality is
H2 (Pe ) ≥ H(X|X̂). (1.82)
Equality holds if and only if X = X̂ ⊕ Z where Z is independent of X̂,

PZ (0) = 1 − Pe , PZ (1) = Pe , and “⊕” denotes addition modulo-2 (or XOR).

1.6
1.4
1.2
Fano Bound
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Pe
Figure 1.6.: Fano upper bound on H(X|X̂) for |X | = 3.
Example 1.19. Consider X = {0, 1, 2} and X = X̂ + Z where Z is

independent of X̂, “+” denotes addition modulo-3, and PZ (i) = pi , i =
0, 1, 2. Fano’s inequality is
H2 (1 − p0 ) + (1 − p0 ) ≥ H(X|X̂) (1.83)
and one can check that equality holds if and only if p1 = p2 (see (1.12)).
A plot of the left-hand side of (1.80) as a function of Pe is shown in Fig. 1.6

for |X | = 3. We can interpret (1.80) as follows: Pe cannot be driven to zero
if H(X|X̂) is bounded from below by some positive number.

1.10.4. Pinsker’s Inequality

Suppose P (·) and Q(·) are probability distributions with discrete and finite
domain X . Recall that the `1 distance between P (·) and Q(·) is
X
d(P, Q) = |P (a) − Q(a)| . (1.84)
a∈X
Oberve that
0 ≤ d(P, Q) ≤ 2 (1.85)
where the right inequality follows by |P (a) − Q(a)| ≤ P (a) + Q(a). We

have equality on the left if and only if P (a) = Q(a) for all a, and we have
equality on the right if and only if supp(P ) ∩ supp(Q) = ∅.
Theorem 1.8. (Pinsker’s Inequality)

1
D(P kQ) ≥ d2 (P, Q) (1.86)
2 ln(2)
Proof. Partition X into two sets:
X1 = {a : P (a) ≥ Q(a)} (1.87)

X2 = {a : P (a) < Q(a)}. (1.88)
Define
X X
p= P (a) and q= Q(a) (1.89)
a∈X1 a∈X1
so that p ≥ q. We have d(P, Q) = 2(p − q) and the log-sum inequality gives

p 1−p
D(P kQ) ≥ p log2 + (1 − p) log2 . (1.90)
q 1−q
We thus have
" #
1 p 1−p
D(P kQ) − d(P, Q)2 ≥ p log2 + (1 − p) log2
2 ln(2) q 1−q
2
− (p − q)2 . (1.91)
ln(2)
If q = p then the right-hand side of (1.91) is zero, so suppose q < p. The

derivative of (1.91) with respect to q is
!
p−q −1
+ 4 ≤ 0. (1.92)
ln(2) q(1 − q)
Thus, as q decreases the right-hand side of (1.91) increases from its (mini-
mum) value of 0.

1.11. Convexity Properties

Entropy, informational divergence, and mutual information have convexity
properties that are useful for proving capacity theorems. We list and prove
some of these below.
Recall (see Appendix A.12) that a real-valued function f (·) whose domain
is a non-empty convex set S in Rn is convex on S if for every point every
x1 and x2 in S we have
λf (x1 ) + (1 − λ)f (x2 ) ≥ f (λx1 + (1 − λ)x2 ) for 0 < λ < 1. (1.93)
We say that f (·) is concave (or convex-∩) on S if −f (·) is convex on S.

A useful tool for convex and concave functions is Jensen’s inequality (see
Theorem A.3).
For the following, it is useful to think of PX as being a real-valued n-
dimensional vector that is in the convex set of vectors with non-negative
P
entries and for which a PX (a) = 1. Similarly, we view PY |X as being a
n × m real matrix, or rather a long vector of length n · m, that is in the
convex set of matrices, or long vectors, with non-negative entries and for
P
which b PY |X (b|a) = 1 for all a.
Theorem 1.9. (Convexity of Informational Divergence) D(PX kPY )

is convex (or convex-∪) in the pair (PX , PY ). That is, for distributions PX ,
PY , QX , QY with the same domain X we have
λD(PX kPY ) + (1 − λ)D(QX kQY )

≥ D (λPX + (1 − λ)QX kλPY + (1 − λ)QY ) (1.94)
for any λ satisfying 0 < λ < 1.
Proof. Consider PX (a) > 0 and QX (a) > 0. The log-sum inequality gives
λPX (a) (1 − λ)QX (a)

λPX (a) log2 + (1 − λ)QX (a) log2
λPY (a) (1 − λ)QY (a)
λPX (a) + (1 − λ)QX (a)
≥ [λPX (a) + (1 − λ)QX (a)] log2
λPY (a) + (1 − λ)QY (a)
where 0 < λ < 1. If PX (a) > 0 and QX (a) = 0 then we have
λPX (a) λPX (a)

λPX (a) log2 ≥ λPX (a) log2 .
λPY (a) λPY (a) + (1 − λ)QY (a)
A similar bound results when PX (a) = 0 and QX (a) > 0. Now sum the
appropriate bounds over supp(PX ) and supp(QX ) to obtain (1.94).

1.11. Convexity Properties 27
Theorem 1.10. (Concavity of Entropy) We write H(X) as H(PX ). The

entropy H(PX ) is concave (or convex-∩) in PX . That is, for two distributions
PX and QX with the same domain X we have
λH(PX ) + (1 − λ)H(QX ) ≤ H (λPX + (1 − λ)QX ) (1.95)
Proof. From (1.43) we can write H(X) = log2 |X | − D(PX kPY ) where PY
is uniform. Theorem 1.9 thus gives the desired result by fixing PY as the
uniform distribution, i.e., use (1.94) with QY = PY .
Example 1.20. For H2 (p) we compute
d 1−p
H2 (p) = log2 (1.96)
dp p
2
d −1
H2 (p) = < 0. (1.97)
dp 2 ln(2) p(1 − p)
H2 (p) is therefore concave in p.
Theorem 1.11. (Convexity of Mutual Information) We write I(X; Y )

as I(PX , PY |X ). The function I(PX , PY |X ) is concave in PX if PY |X is fixed,
and I(PX , PY |X ) is convex in PY |X if PX is fixed. That is, we have

λI(PX , PY |X ) + (1 − λ)I(QX , PY |X ) ≤ I λPX + (1 − λ)QX , PY |X (1.98)

λI(PX , PY |X ) + (1 − λ)I(PX , QY |X ) ≥ I PX , λPY |X + (1 − λ)QY |X
(1.99)
Proof. Suppose PY |X is fixed and consider I(X; Y ) = H(Y ) − H(Y |X).

Note that H(Y ) is concave in PY . But PY and H(Y |X) are linear in PX .
Thus, I(X; Y ) is concave in PX .
Suppose next that PX is fixed and consider I(X; Y ) = D(PX PY |X kPX PY ).
PY is linear in PY |X so that D(PX PY |X kPX PY ) is convex in PY |X .
Theorem 1.12. (Linearity and Convexity of Cross Entropy) X(PX kPY )

is linear in PX and convex in PY . That is, we have
λX(PX kPY ) + (1 − λ)X(QX kPY ) = X (λPX + (1 − λ)QX kPY ) (1.100)

λX(PX kPY ) + (1 − λ)X(PX kQY ) ≥ X (PX kλPY + (1 − λ)QY ) (1.101)

for any λ satisfying 0 < λ < 1. However, X(PX kPY ) is neither convex nor
concave in the pair (PX , PY ) in general.
Proof. The linearity (1.100) follows by substitution and the convexity

(1.101) follows by the concavity of the logarithm, or by the convexity of in-
formational divergence, see Problem B.1.24. The non-convexity/concavity
of X(PX kPY ) is developed in Problem B.1.24.

1.11. Convexity Properties 29
References
[1] C. E. Shannon. A mathematical theory of communication. Bell System
Technical Journal, 27:379–423 and 623–656, July and October 1948.
Reprinted in Claude Elwood Shannon: Collected Papers, pp. 5–83, (N.
J. A. Sloane and A. D. Wyner, eds.) Piscataway: IEEE Press, 1993.
[2] R. V. L. Hartley. Transmission of information. Bell Syst. Tech. J., page
535, July 1928.
[3] R. Clausius. Ueber verschiedene für die Anwendung bequeme Formen
der Hauptgleichungen der mechanischen Wärmetheorie. Annalen der
Physik und Chemie, CXXV(7):353–400, 1865.
[4] A. D. Wyner and J. Ziv. A theorem on the entropy of certain binary
sequences and applications: Part i. IEEE Trans. Inf. Theory, 19(6):769–
772, November 1973.

Chapter 2.
Information Theory for

Continuous Variables
2.1. Differential Entropy

Consider a real-valued random variable X with density pX (·). The differen-
tial entropy of X is defined in a similar manner as the entropy of a discrete
random variable:
Z
h(X) = −pX (a) log pX (a) da (2.1)
supp(pX )
assuming this integral exists. We can alternatively write
h(X) = E[− log pX (X)] . (2.2)
Example 2.1. Consider the uniform distribution with pX (a) = 1/A for
a ∈ [0, A) where [0, A) = {x : 0 ≤ x < A} (see Fig. 2.1). We compute
h(X) = log(A). (2.3)
We find that h(X) is negative for A < 1. A natural next question is how we
should interpret h(X): what does “negative uncertainty” mean? We shall
give one interpretation in Sec 2.6.1.
pX (a)
2 A = 1/2, h(X) = −1 bit
1.5
1 A = 1, h(X) = 0
0.5
0 0.5 1 a
Figure 2.1.: Densities of uniform distributions.
31
32 Chapter 2. Information Theory for Continuous Variables
The joint differential entropy of real-valued and continuous random vari-

ables X1 , X2 , . . . , Xn with joint density pX n (·) is defined as
Z
h(X n ) = −pX n (a) log pX n (a) da. (2.4)
supp(pX n )
We can alternatively write (2.4) as h(X) where X = [X1 , X2 , . . . , Xn ].

Simple exercises show that for a non-zero real number c we have
Translation rule: h(X + c) = h(X)

(2.5)
Scaling rule: h(cX) = h(X) + log |c|.
Similarly, for a real-valued column vector c of dimension n and an invertible

n × n matrix C we have
Translation rule: h(X + c) = h(X)

(2.6)
Scaling rule: h(CX) = h(X) + log | det C|
where det C is the determinant of C. We will, however, use the notation

|C| for the determinant of C below.
Consider a joint density pXY (·) and its conditional density pY |X (·) = pXY (·)/pX (·).
We define
Z
h(Y |X = a) = −pY |X (b|a) log pY |X (b|a) db (2.7)
supp(pY |X (·|a))
and if X has a density then

Z
h(Y |X) = pX (a) h(Y |X = a) da. (2.8)
supp(pX )
We thus have h(Y |X) = h(XY ) − h(X). If X is discrete, then we define

X
h(Y |X) = PX (a) h(Y |X = a). (2.9)
a∈supp(PX )
Note that by conditioning on X = a and using the translation rule in (2.5),

for any real constant c we obtain
h(Y + cX|X) = h(Y |X). (2.10)

2.2. Mixed Distributions 33
FX (a) pX (a)
1 1
0.75 0.75
0.5 0.5
0.25 0.25
0 0.5 1 a 0 0.5 1 a
Figure 2.2.: A cumulative distribution and its density.
2.2. Mixed Distributions
One often encounters problems with random variables having both discrete
and continuous components. For example, consider a real-valued random
variable X with cumulative distribution (see Fig. 2.2)

0, 
 a<0
3a/4, 0 ≤ a < 1/2


FX (a) =  (2.11)

 (3a + 1)/4, 1/2 ≤ a < 1
1, a ≥ 1.

Observe that FX (·) has a discontinuity that represents a probability mass.

A common way of writing the “density” for such a mixed distribution is as
a sum of a discrete part and a continuous part, in this case (see Fig. 2.2)
3 1
pX (a) = pU (a) + δ(a − 1/2) (2.12)
4 4
where pU (·) is the uniform density in the interval [0, 1) and where δ(·) is the
“Dirac-δ” (generalized) function defined indirectly by taking integrals over
proper1 intervals I:
(
f (0), 0 ∈ I
Z
f (x)δ(x)dx = (2.13)
I 0, otherwise.
More generally, if there are probability mass points at ai , i = 1, 2, . . . , J,

with probabilities PX (ai ), then we may write the “density” as
J
X
pX (x) = pX̃ (x) + PX (ai )δ(x − ai ) (2.14)
i=1
where pX̃ (·) is a normalized density of a continuous random variable X̃.

We will see below that h(X) = −∞ for a mixed random variable with a
non-zero probability mass. Hence one should exercise caution when dealing
with such random variables and using differential entropy.
1
An interval is proper if it is non-empty and does not have just one point.

pX (a)
x1
(−1, 1/2) (1, 1/2)

x2
Figure 2.3.: A probability density function for a two-dimensional random

vector.
One can extend the above ideas to random vectors. For instance, consider
X = [X1 , X2 ] that takes on values in R2 , and suppose X has a
• 0-dimensional part: probability mass 1/4 at the point (1, 1/2);
• 1-dimensional part: probability mass 1/2 on the interval defined by
x1 = −1 and x2 ∈ [0, 1/2);
• 2-dimensional part: uniform probability density on the rectangle (x1 , x2 ) ∈
[−1, 1) × [−1/2, 1/2) (so the 2-dimensional probability mass is 1/4).
A plot of the “density” is shown in Fig. 2.3. We have h(X) = −∞ due
to the 0-dimensional and 1-dimensional parts. However, if we condition on
the event E = {X1 = −1, X2 ∈ [0, 1/2)} then we compute h(X|E) = −1 bit
because X is uniform given E.

2.3. Example Distributions
2.3.1. Discrete Random Variables
Consider again the uniform density with pX (a) = 1/A for a ∈ [0, A) and
let A → 0. We thus have h(X) → −∞. We can interpret such limiting
densities as Dirac-δ functions representing discrete random variables. For
instance, suppose that pX (a) = pi /A for some integers i, a ∈ [i, i + A), and
0 ≤ A ≤ 1. As A → 0, this density represents a discrete random variable
X̃ with PX̃ (i) = pi . We compute
X
h(X) = −pi log (pi /A) = log(A) + H X̃ (2.15)
i
so h(X) has increased as compared to (2.3). However, h(X) still approaches

−∞ for small A.
In general, one must exercise caution when dealing with h(X) where X has
discrete components. For example, we have h(Xf (X)) = h(X)+h(f (X)|X)
and h(f (X)|X) = −∞.
2.3.2. Exponential Distribution
The exponential density is

(
1 −a/m
m
e a≥0
pX (a) = (2.16)
0 a<0
where m = E[X]. We compute Var[X] = m2 and
h(X) = log(e m). (2.17)
We find that h(X) < 0 if m < 1/e and h(X) → −∞ as m → 0.
2.3.3. Gaussian Distribution
The scalar Gaussian density is

1 1 2
pX (a) = √ e− 2σ2 (a−m) (2.18)
2πσ 2
where m = E[X] and σ 2 = Var[X] is the variance of X. Inserting (2.18)
into (2.1), we compute
1
h(X) = log 2πeσ 2 . (2.19)
2

We find that h(X) < 0 if σ 2 < 1/(2πe). We further have h(X) → −∞ as

σ 2 → 0.
More generally, consider a random column vector X of dimension n, mean

E[X] = m and covariance matrix
h i
QX = E (X − m)(X − m)T (2.20)
where the superscript “T” denotes transposition. Suppose X is Gaussian

distributed, i.e., the density of X is
1 1

pX (a) = 1/2
exp − (a − m)T Q−1
X (a − m) (2.21)
(2π)n/2 QX 2
where we recall that QX is the determinant of QX . Inserting (2.21) into

(2.1), we compute (see Problem B.2.2)
1
h(X) = log (2πe)n QX . (2.22)
2
Note that h(X) is negative for small QX .
Example 2.2. Consider (2.21) with X = [X Y ]T and

" #
2
σX ρ σX σY
QXY = (2.23)
ρ σX σY σY2
so that ρ = E[XY ] /(σX σY ) is the correlation coefficient of X and Y . The

differential entropy (2.22) is
1
h(XY ) = log (2πe)2 σX
2 2
σY (1 − ρ2 ) . (2.24)
2
X and Y are uncorrelated if ρ = 0, in which case X and Y are independent
and h(XY ) = h(X) + h(Y ). On the other hand, if X = a · Y + b for
some constants a and b then we have ρ = 1 and h(XY ) = −∞. We must
therefore be cautious with Gaussian vectors: if one entry is a function of
the other entries then we have |QX | = 0 and h(X) = −∞.
Finally, suppose pX Y (·) is Gaussian, where X has dimension n and Y has

dimension m. Let QX Y be the covariance matrix of the stacked vector
[X T Y T ]T . We compute
1 .
h(Y |X) = h(X Y ) − h(X) = log (2πe)m QX Y QX . (2.25)
2

Example 2.3. Consider (2.23) for which we compute

1
h(Y |X) = log 2πe σY2 (1 − ρ2 ) . (2.26)
2

2.4. Informational Divergence

The informational divergence between the densities pX and pY is
Z
pX (a)
D(pX kpY ) = pX (a) log da (2.27)
supp(pX ) pY (a)
assuming this integral exists. If pX and pY are mixed distributions of the

form (2.14), then we partition the events that can occur into events with
probability mass on points (0-dimensional parts), curves (1-dimensional
parts), and so forth. The informational divergence is defined by summing
informational divergences of the parts. For example, if pX has 0-dimensional
and 1-dimensional parts then we write
 
X PX (a) 
D(pX kpY ) =  PX (a) log
a∈supp(PX )
PY (a)
Z
pX̃ (a)
+ pX̃ (a) log da. (2.28)
supp(pX̃ ) pỸ (a)
A similar generalization is possible for vectors X and Y .

The mutual information between continuous random variables X and Y is
I(X; Y ) = D(pXY kpX pY ). (2.29)
We can derive natural bounds on informational divergence for continuous

random variables. For instance, the bound ln(x) ≤ x − 1 implies
D(pX kpY ) ≥ 0 (2.30)
with equality if and only if pX (a) = pY (a) for (almost) all a ∈ supp(pX ).
This means that
I(X; Y ) ≥ 0 (2.31)
h(X|Y ) ≤ h(X) (2.32)
h(XY ) ≤ h(X) + h(Y ) (2.33)
with equality if and only if X and Y are independent.

The conditional informational divergence between pX|Z (·) and pY |Z (·) is
defined as (see (1.46))
Z
D(pX|Z kpY |Z |pZ ) = pZ (b)D(pX|Z (·|b)kpY |Z (·|b)) db
supp(pZ )
Z
pX|Z (a|b)
= pXZ (a, b) log da db
supp(pXZ ) pY |Z (a|b)
" #
pX|Z (X|Z)
= E log (2.34)
pY |Z (X|Z)
assuming the integrals exist. We have D(pX|Z kpY |Z |pZ ) ≥ 0 with equality

2.5. Cross Entropy 39
if and only if pX|Z (a|b) = pY |Z (a|b) for (almost) all (a, b) ∈ supp(pXZ ).
2.5. Cross Entropy

The cross entropy of two densities pX (·) and pY (·) is defined as
Z
X(pX kpY ) = −pX (a) log pY (a)da. (2.35)
supp(pX )
We thus have
X(pX kpY ) = h(X) + D(pX kpY ) (2.36)
and (2.30) gives the following result.
X(pX kpY ) ≥ h(X) (2.37)
with equality if and only if pX (a) = pY (a) for (almost) all a ∈ supppX .
Theorem 1.4 gives a useful tool to upper bound entropy that we use in the
next section..
Example 2.4. Consider Example 2.1 where pX (a) = 1/A for a ∈ [0, A).
Suppose pY (b) = 1/B for a ∈ [0, B) for which we compute
(
∞, B<A
X(pX kpY ) = (2.38)
log(B), B ≥ A.
We thus have X(pX kpY ) ≥ h(X) = log(A) with equality if and only if
A = B. This matches (2.37). Observe that any choice of pY (·) where
supp(pY ) is a proper subset of supp(pX ) gives a useless bound.

2.6. Maximum Entropy
2.6.1. Alphabet or Volume Constraint
Recall that the uniform distribution maximizes the entropy of discrete ran-
dom variables with alphabet X . Similarly, the uniform density maximizes
the differential entropy of continuous random variables with a support of
finite volume. To prove this, suppose that X is confined to a set S in Rn .
Let |S| be the volume of S, i.e., |S| = supp(pX ) 1 dx, and let U be uniform
R
over S. We use (2.37) and compute
h(X) ≤ X(pX kpU )

Z
= pX (a) log (|S|) da = log |S|. (2.39)
supp(pX )
We thus find that if X is limited to S then
h(X) ≤ log |S| (2.40)
with equality if and only if pX (a) = 1/|S| for (almost) all a ∈ S.
Alternatively, we have 2h(X) ≤ |S|. This bound justifies having negative

differential entropy, namely that 2h(X) for a uniform X measures the volume
of the support set S.
2.6.2. First Moment Constraint
For continuous random variables, one is often interested in moment con-

straints rather than alphabet constraints. For example, suppose that the
alphabet of X is all of Rn and we wish to maximize h(X) under the first-
moment constraint (2.41)
E[X] ≤ m (2.41)
where the inequality a ≤ b means that ai ≤ bi for all entries ai and bi of the
respective a and b.
Without further constraints we can choose X to be uniform over the interval

[−A, 0) for large positive A and make h(X) arbitrarily large. We hence
further restrict attention to non-negative X, i.e., every entry Xi of X must
be non-negative.
Let E have independent entries Ei that are exponentially distributed with

mean mi , i.e., we choose
1 −a/mi
(
mi
e a≥0
pEi (a) = (2.42)
0 a < 0.

2.6. Maximum Entropy 41
We use the same approach as in (2.39) to compute

Z
h(X) ≤ X(pX kpE ) = − pX (a) log pE (a) da
supp(pX )
X ai
Z
=− pX (a) − log e − log mi da
supp(pX ) i mi
X
= log(e mi ) (2.43)
i
We thus have
X
h(X) ≤ log(e mi ) (2.44)
i
with equality if and only if pX (x) = pE (x) for almost all x. Independent ex-
ponential random variables therefore maximize (differential) entropy under
first moment and non-negativity constraints.
2.6.3. Second Moment Constraint
Suppose we wish to maximize h(X) under the second-moment constraint
|QX | ≤ D (2.45)
where D is some constant. For example, the constraint (2.45) occurs if we

are restricting attention to X that satisfy
QX Q (2.46)
for some positive semidefinite Q with |Q| = D, where A B means that

B − A is positive semi-definite (and hence |A| ≤ |B|; see [1, p. 471]).
Let G be Gaussian with the same mean m and covariance matrix QX as X.
We repeat the approach of (2.39) and (2.43) and compute
Z
h(X) ≤ X(pX kpG ) = − pX (a) log pG (a) da
supp(pX )
1
Z
=− pX (a) − log (2π)n |QX |
supp(pX ) 2
1

T −1
− (a − m) QX (a − m) log e da
2
1
= log (2πe)n |QX | . (2.47)
2
We thus have
1
h(X) ≤ log (2πe)n |QX | (2.48)
2
with equality if and only if pX (x) = pG (x) for almost all x. Gaussian random

variables thus maximize (differential) entropy under the second moment

constraint (2.45).
Example 2.5. The constraint (2.45) for scalars (n = 1) is Var[X] ≤ D and

the bound (2.48) implies
1
h(X) ≤ log (2πe D) . (2.49)
2
A Gaussian random variable X with variance D (and any mean) thus max-
imizes differential entropy under the variance constraint.
Finally, we prove a conditional version of the maximum entropy theorem.

Consider a density pX Y (·) that has the conditional density pY |X (·) and co-
variance matrix QX Y . Suppose that (G, H) is Gaussian with the same
covariance matrix. We compute

h(Y |X) ≤ X pY |X kpH|G kpX
Z
:= pX (a)X pY |X (·|a)kpH|G (·|a) da
supp(pX )
Z
=− pX Y (a, b) log pH|G (b|a) da db.
supp(pX Y )
1 .
= log (2πe)m QX Y QX . (2.50)
2
We thus have
 
1 QX Y
h(Y |X) ≤ log (2πe)m  (2.51)
2 |QX |
with equality if and only if pY |X (x) = pH|G (x) for almost all x.

2.7. Appendix: Table of Differential Entropies 43
2.7. Appendix: Table of Differential

Entropies
The constant γ ≈ 0.57721566 is Euler’s constant.
Distribution Density p(x) Support Entropy (in nats)

1
Uniform b−a
a ≤ x < b ln(b − a)
Exponential λe−λx , λ > 0 x≥0 1 − ln λ
(x−m)2
Gaussian √ 1 e− 2σ 2 x∈R 1
ln(2πeσ 2 )
2πσ 2 2
x2
x − 2σ2 γ
Rayleigh σ 2 e x≥0 1+ 2
+ ln √σ2
References
[1] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University
Press, Cambridge, UK, 1985.

Chapter 3.
Coding of Memoryless Sources
3.1. Discrete Memoryless Source
Our first application of Shannon entropy will be data compression or source

coding. Consider a discrete memoryless source (DMS), which is a device that
emits symbols from a discrete and finite alphabet X in an independent and
identically distributed (i.i.d.) manner (see Fig. 3.1). For instance, suppose
the source probability distribution is PX (·) where
PX (0) = 1 − p and PX (1) = p. (3.1)
If p = 1/2 then this DMS is called a binary symmetric source or BSS. As

discussed in Sec. 1.2, we expect that a block X B = X1 , X2 , . . . , XB of B
output symbols of the DMS has approximately B(1 − p) zeros and Bp ones.
Furthermore, such strings have the same probability and their number is
approximately 2BH(X) where H(X) = H2 (p). Can we use this insight to
compress the string?
3.2. Coding a Single Random Variable
We begin with B = 1. A binary source is not very interesting for this case
so we choose X to be the K-ary alphabet {a1 , a2 , . . . , aK } where K > 2.
We shall return to binary sources later when we consider B > 1.
Our problem is depicted in Fig. 3.2 where we are interested in encoding
the random variable X into a number of digits U L = U1 U2 . . . UL . We are
mostly interested in encoding into bits, but we will take the alphabet of the
Ui , i = 1, 2, . . . , L, to be the D-ary set U = {0, 1, . . . , D − 1}.
PX (·)
. . . , X3, X2, X1
DMS
Figure 3.1.: A DMS with distribution PX (·).
45
46 Chapter 3. Coding of Memoryless Sources
PX (·)
X U L = U1U2 . . . UL
Source Encoder
Figure 3.2.: Source coding scheme where L is a function of X.
In the following we distinguish between a code and an encoder. The source

code is a list1 of D-ary strings (or sequences)
C = {b1 , b2 , . . . , bK } (3.2)
and the bk are called code words. Code word bk has length `k , k = 1, 2, . . . , K.
Since the `k can take on different values for different k, we are considering
variable-length codes.
Example 3.1. The list C = {0, 10, 11} is a binary (D = 2) code for a
ternary (K = 3) source. The code words are b1 = 0, b2 = 10, b3 = 11 with
lengths `1 = 1, `2 = 2, `3 = 2, respectively.
The list C = {0, 1, 20, 21} is a ternary (D = 3) code for a quaternary
(K = 4) source. The code words are b1 = 0, b2 = 1, b3 = 20, b4 = 21 with
lengths `1 = 1, `2 = 1, `3 = 2, `4 = 2, respectively.
The source encoder is the mapping U L (·) from X to C, where bk = U L (ak )

for k = 1, 2, . . . , K. The length `k is therefore a function of ak and write
`k = L(ak ). The code word length is thus a random variable L(X) that is a
function of X.
Example 3.2. Consider the code C = {0, 10, 11} from Example 3.1 and
suppose that X = {x, y, z}. There are 33 = 27 mappings from X to C so
there are 27 possible encoders we could choose. If we choose U L (x) = 0,
U L (y) = 10, U L (z) = 11 then we have L(x) = 1, L(y) = 2, L(z) = 2.
Our ultimate goal is to encode a sequence X n of source variables emitted by

a DMS. If we apply the source encoder U L (·) to the symbols Xi sequentially,
then the encoder output sequence is, with some abuse of notation,
U L (X n ) = U L (X1 ) U L (X2 ) . . . U L (Xn ) (3.3)
which has length L(X n ) = ni=1 L(Xi ). The weak law of large numbers
P
(see Appendix A.13) states that L(X n ) ≈ nE[L(X)], so we wish to choose

a source code and encoder that minimize the average code word length
K
X
E[L(X)] = PX (ak ) `k . (3.4)
k=1
1
A set does not admit multiple occurrences of the same element while a list does. For
instance, {0, 1, 2} is a set and a list, while {0, 1, 1, 2, 2} is a list but not a set.

3.2. Coding a Single Random Variable 47
Encoding a sequence X n places restrictions on the choice of code. For

example, in this chapter we wish to recover X n perfectly from U L (X n ), i.e.,
we require that the code and encoder make the sequence mapping from xn
to U L (xn ) invertible for any xn . Such a code and encoder are called uniquely
decodable.
Another restriction we place on the code (and encoder) will be that we can
decode quickly. More precisely, we wish to recover Xi immediately after the
symbols U L (Xi ) have been received. We accomplish this by requiring that
no code word bj is the prefix of another code word bk for j 6= k. Such a
code is called prefix-free. A prefix-free code has no two code words being
the same: the mapping from X to U L (X) is invertible and H(X) = H(U L ).
Example 3.3. Suppose K = 3 and D = 2. The code C = {0, 10, 11} is

prefix free since 0 is not a prefix of 10 or 11, and similarly for 10 and 11.
The encoder is specified by the mapping U L (ak ) = bk so that L(ak ) = `k ,
as shown in the table below.
x U L (x) L(x)
a1 0 1
a2 10 2
a3 11 2
Example 3.4. The following table specifies an encoder for a code that is
not prefix-free.
x U L (x) L(x)
a1 0 1
a2 00 2
a3 11 2
In fact, the code is not uniquely decodable. For example, both of the source
strings a1 a2 and a2 a1 are mapped to the encoded string 000.
Example 3.5. The following table specifies an encoder for a code that is
not prefix-free but that is uniquely decodable.
x U L (x) L(x)
a1 0 1
a2 01 2
a3 11 2
In fact, the code is the reflected code of Example 3.3 and it is suffix-free: no
code word bj is the suffix of another code word bk for j 6= k. Codes can be
both prefix-free and suffix-free, e.g., the code with all DN strings of a given
length N .

We can decode a string of suffix-free code words in reverse order. For

instance, the string 01111111 decodes into a2 a3 a3 a3 . Note, however, that
we must wait until the last binary symbol to distinguish this sequence from
a1 a3 a3 a3 .

3.3. Rooted Trees and the Kraft Inequality 49
0 branch
a1
root node
leaf
0
a2
1
root
1
a3
branching node
Figure 3.3.: A rooted tree for the prefix-free code and encoder of Exam-
ple 3.3.
code word leaf
0
a2
0
a1 unused leaf
labels specify
0 encoder
1
1
a3
Figure 3.4.: A rooted tree for the code and encoder of Example 3.4.
3.3. Rooted Trees and the Kraft Inequality

Prefix-free codes can be conveniently represented by rooted trees. For in-
stance, consider the binary prefix-free code of Example 3.3. We label the
branches of a binary tree with the code words as shown in Fig. 3.3. Encod-
ing proceeds by transmitting bits encountered when moving from the root
of the tree towards the leaves drawn with large dark circles. The branch-
ing nodes are drawn with small dark circles. The leaves represent the code
words. However, we often label the leaves with the source symbols, in which
case we are specifying both the code and the encoder.
The code and encoder of Example 3.4 is shown in Fig. 3.4. Note that one
of the code words corresponds to a branching node rather than a leaf, and
that some leaves are unused, i.e., they do not correspond to a code word.
We draw unused leaves with hollow circles.
In general, for a D-ary code we draw a D-ary tree and label the branches
coming out of a branching node with the letters 0, 1, . . . , D−1. A prefix-free
code has code words at leaves only, although it can have unused leaves. The
D-ary tree for a prefix-free code is unique if we prune the tree at branching
nodes from which no code words stem.
Example 3.6. The unique rooted tree for the ternary prefix-tree code
C = {0, 1, 20, 21} is drawn in Fig. 3.5.
The depth of a branching node or leaf is the number of branches one must
traverse from the root to this node or leaf. The depth of a code is its
maximal code word length N = maxk `k . The full D-ary tree of depth N is
the D-ary tree with DN leaves at depth N .

1
0
2 1
2
Figure 3.5.: A rooted tree for a prefix-free ternary code.
Consider, then, any prefix-free code of depth N . A branching node at depth

` of the full tree of length N has DN −` leaves at depth N stemming from
it. Since all code words correspond to leaves, a simple counting argument
gives
K
DN −`k ≤ DN
X
(3.5)
k=1
This bound is known as the Kraft Inequality. The inequality turns out to
give both a necessary and sufficient condition for the existence of a prefix-
free code. We state this formally as a theorem.
Theorem 3.1. (Kraft Inequality) A D-ary prefix-free code with code

word lengths `k , k = 1, 2, . . . , K, exists if and only if
K
D−`k ≤ 1.
X
(3.6)
k=1
Equality holds if and only if the code has no unused leaves in its D-ary tree.
Proof. We have already shown that (3.6) is a necessary condition, so it

remains to prove sufficiency. We present an algorithm for constructing a
prefix-free code for the `k satisfying (3.6). Suppose the lengths are ordered
so that 1 ≤ `1 ≤ `2 ≤ . . . ≤ `K = N and identify ak with the length `k .
Consider the full tree of length N . We assign a1 to any branching node or
leaf at depth `1 and, if necessary, we prune the tree at this node so that
a1 labels a leaf in the new tree. As a result, there are DN (1 − D−`1 ) > 0
depth-N leaves remaining from the full tree, and therefore there must be
at least one branching node or leaf remaining at depth `2 . Furthermore,
since `1 ≤ `2 the code word U L (a1 ) cannot stem from such a node or leaf,
and we can choose this node or leaf as U L (a2 ). Now repeat this argument
iteratively in the remaining tree for `k , k = 2, 3, . . . , K. After step k there
are
k k
!
N N −ì N −ì
X X
D − D =D 1− D (3.7)
i=1 i=1
depth-N unused leaves remaining. And since ki=1 D−ì < 1 by assumption,
P
if k < K there is one or more depth-N unused leaf remaining. But this

3.3. Rooted Trees and the Kraft Inequality 51
1 0
0
1
1
0
1
1
Figure 3.6.: A rooted tree for a binary prefix-free code with specified code
word lengths.
means that there is one or more depth-`k+1 unused leaf or branching node
remaining and since the lengths were ordered no already chosen code words
can stem from such a node.
Example 3.7. Consider D = 2, K = 5, and (`1 , `2 , `3 , `4 , `5 ) = (1, 2, 4, 4, 4).

We compute 5k=1 2−`k = 15/16 < 1. There is thus a prefix-free code with
P
these code word lengths. We construct the code in Fig. 3.6 by “growing” it
from the root as specified by the above algorithm. The resulting code is
k bk `k
1 0 1
2 10 2
3 1100 4
4 1101 4
5 1110 4
Note that we can shorten the code by replacing 1110 with 111. We then
have no unused leaves and the Kraft inequality is tight.

3.4. Rooted Trees with Probabilities
Our discussion of trees has so far not included the probabilities of the code
words. From (3.4) we should clearly associate short code word lengths with
large probabilities, but how should we do this precisely? To understand this
better, we develop properties of rooted trees with probabilities (see also [1]).
Suppose we assign the leaves a probability distribution so that leaf k (or code
word bk ) has probability pk , k = 1, 2, . . . , K. The leaf entropy is therefore
X
Hleaf = −pk log2 pk . (3.8)
k: pk >0
Suppose there are J branching nodes labeled j = 1, 2, . . . , J. We assign

each node j the sum Pj of the probabilities of the leaves stemming from
node j. This means that Pj is the sum of the probabilities of the branching
nodes and leaves at depth 1 in the subtree stemming from node j. We say
that such nodes and leaves stem directly from node j.
The branching distribution Qj (·) of node j is the distribution of the nodes
and leaves stemming directly from node j, conditioned on the event that
node j has occurred. The branching entropy of a branching node is the
entropy H(Qj ) of its branching distribution. It is perhaps best to illustrate
these concepts with an example.
Example 3.8. Consider the rooted tree of Fig. 3.5 and suppose the source
has the probabilities as shown in the table below.
k bk pk
1 0 0.4
2 1 0.3
3 20 0.2
4 21 0.1
We label the leaves and branching nodes with their probabilities as shown
in Fig. 3.7. The leaf entropy is
4
X
Hleaf = −pk log2 pk ≈ 1.846 bits. (3.9)
k=1
We label the (branching) root node with j = 1 and branching node 2 with
j = 2. We thus have P1 = 1.0 which is the sum of the probabilities of the
leaves stemming from the root node, and we have P2 = 0.3. The branch-
ing distributions are Q1 (·) = [0.4, 0.3, 0.3] and Q2 (·) = [0.2, 0.1, 0]/0.3 =
[2/3, 1/3, 0]. The branching entropies are therefore
H(Q1 ) = −0.4 log2 (0.4) − 2 [−0.3 log2 (0.3)] ≈ 1.571 bits (3.10)
2 2 1 1
H(Q2 ) = − log2 − log2 ≈ 0.9183 bits. (3.11)
3 3 3 3

3.4. Rooted Trees with Probabilities 53
0
0.4
1.0 1
0.3
0
0.2
2 0.3 1
0.1
2
Figure 3.7.: A rooted tree with probabilities for a prefix-free ternary code.
The following lemmas give useful properties of rooted trees with probabili-
ties (see [1]).
Lemma 3.1. (Path Length Lemma) The average depth of the leaves
in a rooted tree with probabilities is the sum of the probabilities of the
branching nodes including the root, i.e., we have
K
X J
X
pk `k = Pj . (3.12)
k=1 j=1
Lemma 3.2. (Leaf-Entropy Lemma) The leaf entropy of a rooted tree

with probabilities is the average branching entropy, i.e., we have
J
X
Hleaf = Pj H(Qj ). (3.13)
j=1
Proof. For the Path Length Lemma, the depth `k of each leaf is the number
of branching nodes encountered when moving from the root to this leaf. We
can thus reorder the first sum in (3.12) by first summing the probabilities
of the leaves stemming from each branching node, and then sum over all
branching nodes. The result is the second sum in (3.12).
For the Leaf-Entropy Lemma, the branching nodes and leaves stemming
directly from node j have the probabilities Pm = Pj · Qj (m). Furthermore,
the average branching entropy on the right-hand side of (3.13) has the terms
X
Pj H(Qj ) = Pj −Qj (m) log2 Qj (m)
m:Pj Qj (m)>0
X
= Pj log2 Pj + −Pm log2 Pm (3.14)
m:Pm >0
Node j therefore contributes a term Pj log2 Pj to the sum in (3.13), but it

also subtracts the terms Pm log2 Pm for each of the nodes stemming directly
from it. The sum in (3.13) is thus a telescoping sum and the surviving terms
are the −Pk log2 Pk where the k represent leaves.

Example 3.9. Consider Example 3.8. The average depth of the leaves can
be computed using either of the following two approaches:
4
X
pk `k = (0.4 + 0.3) + (0.2 + 0.1)2+ = 1.3 (3.15)
k=1
P1 + P2 = 1.3. (3.16)
Similarly, the average branching entropy is
P1 H(Q1 ) + P2 H(Q2 ) ≈ 1.0(1.571) + 0.3(0.9183) = 1.846 bits. (3.17)

3.5. Entropy Bounds on Source Coding 55
3.5. Entropy Bounds on Source Coding
Lemmas 3.1 and 3.2 give a useful lower-bound on E[L]. Consider (3.13) for
which we have
H(X) = Hleaf
(a) J
X
≤ Pj log2 D
j=1
(b)
= E[L] log2 D bits (3.18)
where (a) follows because H(Qj ) ≤ log2 D with equality if and only if Qj is
the uniform distribution over {0, 1, . . . , D − 1} for all j (see Thm. 1.1), and
(b) follows by Lemma 3.1. Equality holds if and only if pk = D−`k for all k
because the branching probabilities must all be 1/D.
For an upper bound on E[L], the conditions for equality in (3.18) suggest
that we should choose pk ≈ D−`k . Consider the lengths
`k = d− logD pk e (3.19)
for k = 1, 2, . . . , K. (We assume from here on that all pk are positive.)

Inserting (3.19) into the left-hand side of the Kraft inequality (3.6) and
using the left-hand side of
x ≤ dxe < x + 1 (3.20)
we have
K K
−d− logD pk e
DlogD pk = 1.
X X
D ≤ (3.21)
k=1 k=1
By Theorem 3.1, there is a D-ary prefix-free code with code word lengths
given by (3.19). A code with these lengths is called a Shannon-Fano code.
We next bound E[L] by using the right-hand side of (3.20):
K
X H(X)
E[L] = pk d− logD pk e < +1 (3.22)
k=1 log2 D
A Shannon-Fano code is thus within 1 digit of the lower bound (3.18). We

summarize our findings in the following theorem.
Theorem 3.2. (Coding Theorem for a Single Random Variable)

The average code word length E[L] of an optimum D-ary prefix-free code
for a K-ary random variable X satisfies
H(X) H(X)
≤ E[L] < +1 (3.23)
log2 D log2 D

0
0.4
0
0.7
1
0.3
1.0 0
1
0.3 0
0.2
1
0.3 0
1 0.1
0.1 1
Figure 3.8.: A rooted tree with probabilities for a Shannon-Fano code.
with equality on the left if and only if PX (a) = D−L(a) for all a and some
positive integer L(a), and the length of the corresponding code word is L(a).
Moreover, a Shannon-Fano code achieves E[L] satisfying (3.23).
Example 3.10. Consider the source of Example 3.8 and choose D = 2

(not D = 3) and the Shannon-Fano code word lengths given in the table
below.
k pk `k = d− log2 pk e bk
1 0.4 2 00
2 0.3 2 01
3 0.2 3 110
4 0.1 4 1110
We use the algorithm in the proof of Thm. 3.1 to grow a tree with the
specified lengths. The resulting tree is shown in Fig. 3.8 and it is listed in
the above table.
We use the Path Length Lemma to compute
E[L] = 1.0 + 0.7 + 0.3 + 0.3 + 0.1 = 2.4 bits. (3.24)
The source entropy is H(X) ≈ 1.846 bits (see (3.9)) so that (3.23) is sat-
isfied. The code is not optimal, however, and we can reduce E[W ] to 2 by
simply using the full binary tree of length 2. However, if H(X) is large then
a Shannon-Fano code is nearly optimal. We use this insight in the next
section.

3.6. Coding Theorem for Discrete Memoryless Sources 57
Source Encoder
abracada Block (ab)(ra)(ca)(da) Prefix−Free D-ary code

Source Parser Mapper
Source Encoder
abracada Dictionary (ab)(rac)(ad)(a Prefix−Free D-ary code

Source Parser Mapper
Figure 3.9.: Source encoder with a block parser (top) and a variable-length
dictionary parser (bottom).
3.6. Coding Theorem for Discrete

Memoryless Sources
Consider encoding a sequence X n coming out of a DMS. The main new idea
is that we parse X n into blocks of length B and consider each block as the
output of a new DMS with distribution PX B (·). The process is shown at
the top of Fig. 3.9 where we have split the source encoder into two parts: a
parser and a mapper. We treat the structure on the bottom of Fig. 3.9 in
Sec. 3.8 and Sec. 3.9 below.
Applying Theorem 3.2 we find that the average code word length E[L] of
an optimum D-ary code for the K B -ary random variable X B satisifies
H(X B ) H(X B )
≤ E[L] < + 1. (3.25)
log2 D log2 D
However, we have H(X B ) = BH(X), and E[L] in (3.25) measures the

expected code word length per B symbols from the original source. We
thus have the following coding theorem for the DMS.
Theorem 3.3. (Coding Theorem for a DMS) The average code word
length E[L] of an optimum D-ary prefix-free code for a K-ary DMS PX (·)
that is parsed into blocks of length B satisfies
H(X) E[L] H(X) 1

≤ < + . (3.26)
log2 D B log2 D B
Moreover, a Shannon-Fano code achieves an E[L] satisfying (3.26).
Theorem 3.3 tells us that we can approach the lower bound H(X)/ log2 D
code symbols per source symbol as closely as we like by using large block
lengths B. Of course, the encoding and decoding delays increase linearly
with B and the complexity of the encoder and decoder seems to increase
exponentially with B.

As a final remark, we may apply the structures in Fig. 3.9 to blocks of

a source with memory. For instance, a binary (D = 2) Shannon-Fano
code compresses the block X B to H(X B )/B bits per source symbol. The
entropy rate limB→∞ H(X B )/B will interest us when we consider sources
with memory.

3.7. Huffman Codes 59
3.7. Huffman Codes

We have until now developed only bounds on the average code word lengths
by using the Kraft inequality and Shannon-Fano codes. In this section, we
derive the optimal prefix-free codes for a single random variable X.
The D-ary prefix-free codes that minimize E[L] for a K-ary random variable
are called Huffman codes in honor of their inventor. Before we describe
Huffman codes, we prove a few basic lemmas. We call two leaves siblings
if they stem directly from the same branching node. Similarly, we call two
code words siblings if their leaves are siblings.
Lemma 3.3. Optimal D-ary prefix-free codes assign shorter code words to
more probable letters, i.e., if pk < pm then `k ≥ `m .
Lemma 3.4. All (used) leaves in the tree of an optimal D-ary prefix-free
code have a sibling.
Lemma 3.5. There is an optimal D-ary prefix-free code whose two least-
likely code words differ only in their last digit.
Proof. Lemma 3.3: Suppose we have pk < pm but `k < `m . Then we can
swap bk and bm and E[L] is reduced by the positive number
pk (`k − `m ) + pm (`m − `k ) = (pm − pk )(`m − `k ). (3.27)
Lemma 3.4: If a leaf has no siblings then we can delete the last digit of the
code word and we still have a prefix-free code. We thus reduce E[L] and the
original code could not have been optimal.
Lemma 3.5: By Lemma 3.3, the two least-likely code words must have the
two longest lengths. By Lemma 3.4, both of these code words must have
siblings. Thus, the two least-likely code words must have the same length.
Further, if they are not siblings then we can swap one of the code words for
the other code word’s sibling to make them siblings without changing E[L].
The result is a prefix free code with the desired property.
Consider the simplest case D = 2. By the Path Length Lemma, to minimize
E[L] we should minimize the sum of the probabilities of the branching nodes.
By Lemma 3.5 we can choose the two least likely codewords as siblings, say
with probabilities pK−1 and pK . The branching node from which they stem
can now be considered as a leaf in a new tree with K − 1 leaves and with
expected length
E[L0 ] = E[L] − (pK−1 + pK ) (3.28)

0
0.4 0.4 0.4
0 1.0 0
0.3 0.3 0.3
1
0 0.6 0 0.6 0
0.2 0.2 0.2
1 1
0.3 0.3 0.3
1 1 1
0.1 0.1 0.1
Figure 3.10.: Construction of a binary (D = 2) Huffman code for K = 4.
where we have used the Path Length Lemma. The problem of minimizing
E[L] is thus equivalent to minimizing E[L0 ], i.e., it is equivalent to con-
structing a binary tree with these K − 1 leaves such that the sum of the
probabilities of its branching nodes is minimum. But Lemma 3.5 states that
for this tree we can make two least likely code words siblings. Continuing
until the root, we have the following algorithm for constructing an optimal
binary prefix-free code for a K-ary random variable.
a) Draw K leaves and label them with the K probabilities p1 , p2 , . . . , pK .
Set k = K.
b) Create a branching node from which the two least likely leaves stem.
Label this node with the sum of the probabilities of these two leaves.
Label one of the new branches with 0 and the other with 1. The k − 2
remaining leaves and the new branching node are considered to be the
k − 1 leaves of a new tree.
c) Reduce k by 1. If k = 1 then make the remaining leaf the root node
and stop. Else go to step b).
Example 3.11. Consider the source of Examples 3.8 and 3.10 and choose
D = 2. We use the algorithm described above to construct a Huffman code
as depicted in Fig. 3.10. The resulting table of code words and lengths is
as follows
k pk `k bk
1 0.4 1 0
2 0.3 2 10
3 0.2 3 110
4 0.1 3 111
The Path Length Lemma gives
E[L] = 1.0 + 0.6 + 0.3 = 1.9 bits (3.29)
which is quite close to the source entropy is H(X) ≈ 1.846 bits (see (3.9))
and much better than the Shannon-Fano code (see (3.24)).
Consider now the case D > 3 where the only new complication is that the

3.7. Huffman Codes 61
optimal code might have unused leaves. We determine their number by

using a few simple observations.
First, an optimal prefix-free code has all unused leaves at maximal depth,
since otherwise we could transfer one of the code words at maximal depth
to this leaf and lower E[L]. Next, at the maximal depth we can swap leaves,
used or unused, without changing E[L]. But this means that if there are
more than D − 2 unused leaves we can swap D − 1 of them to stem directly
from a common branching node which contradicts Lemma 3.4. An optimal
prefix-free code thus has at most D − 2 unused leaves, say their number is r.
We can move the r unused leaves and any of the other D −r maximal-length
leaves to stem directly from a common branching node. We may as well
choose these D − r maximal-length leaves to be the D − r least likely leaves.
We prove the following lemma.
Lemma 3.6. The number of unused leaves in the tree of an optimal D-ary
prefix-free code for a random variable that takes on K values with positive
probability, K ≥ D, is the remainder when (K − D)(D − 2) is divided by
D − 1.
Proof. A D-ary tree must have D + (J − 1)(D − 1) leaves (including used

and unused leaves) where J is the number of branching nodes including the
root (see Sec. 3.4). The reason is that every time we add D new leaves to
the tree we remove one leaf, except for the first time where we add D leaves
to the root. Next, the number of unused leaves is the number of leaves
minus the number of code words, i.e., we have r = [D + (J − 1)(D − 1)] − K
or
D − K = −(J − 1)(D − 1) + r. (3.30)
By the above discussion, we have r < D − 1 so that r is the remainder when

D − K is divided by D − 1. We add (K − D)(D − 1) to both sides of (3.30)
and obtain
(K − D)(D − 2) = (K − D − J + 1)(D − 1) + r. (3.31)
This proves the lemma.

Combining the above insights, we can construct an optimal D-ary prefix
free code for K ≥ D by using a similar algorithm as as for the binary case.
The main difference is that we must create the first branching node with the
D − r least likely code leaves. All remaining branching nodes are created
by collecting the D least likely leaves in appropriate trees.
Example 3.12. Consider the source with K = 4 of Example 3.8 and choose
D = 3. Using Lemma 3.6 we compute the remainder when K − D = 1 is
divided by D −1 = 2 to be r = 1. The algorithm described above constructs
the ternary Huffman code depicted in Fig. 3.11. The resulting table of code

0
0.4 0.4
1.0 1
0.3 0.3
0 0
0.2 0.2
0.3 1 2 0.3 1
0.1 0.1
2 2
Figure 3.11.: Construction of a ternary Huffman code.
words is the same as in Example 3.8 so that E[L] = 1.3 code symbols per
source symbol. This number is reasonably close to H(X)/ log2 (3) ≈ 1.165.

3.8. Dictionary Parsers 63
PX B (·)
XB UL
Source Encoder
Figure 3.12.: Source encoding scheme with K-ary input blocks X B of length
B and D-ary output blocks U L of length L.
3.8. Dictionary Parsers

Suppose we use a variable-length dictionary parser as shown on the bottom
of Fig. 3.9. We can interpret the parsed source sequence as X B where B
is a random variable. The output U L takes on DL values where L may
have a variable length in general. However, in this section we are interested
in the special case where L is a fixed block length (see Fig. 3.12). We are
thus interested in variable-to-fixed length source coding. The average length
of X B is E[B] so the compression rate is L/E[B] code symbols per source
symbol. The goal is therefore to maximize E[B] for a given length L. (We
shall later consider the case where both B and L are random variables.) We
remark that the parsed sequence of X B itself forms a string independent
and identically distributed symbols because the source is memoryless.
We require that the code is uniquely decodable which means that every
sufficiently long source sequence X B has a word in the dictionary as a
prefix. If we further desire the parser to recognize the word immediately
after it appears, then we require that no message is the prefix of a longer
message. As a result, we require that:
• The dictionary (or message set) satisfies a prefix-free condition, i.e.,
the dictionary words are leaves of a rooted tree.
• No leaves are unused.
A dictionary that satisfies these two properties is called proper. For example,
a proper dictionary for a source with alphabet {0, 1, 2} is {0, 1, 20, 21, 22}
and is shown on the left in Fig. 3.13. The dictionary {0, 1, 2, 22} on the
right in Fig. 3.13 is improper but we can make the code uniquely decodable
by using the parsing rule to always choose the longest message if more than
one parsing is possible. We thus have a parser with memory in the sense
that the parser does not necessarily put out U L immediately after having
received the final symbol in X B that is being represented. We will focus
on proper dictionaries for simplicity. However, improper dictionaries can
outperform proper dictionaries, and practical adaptive parsers often use
improper dictionaries.
The leaf entropy of a proper dictionary is simply H(U L ). Furthermore,
the branching entropy (see Sec. 3.4) is H(X) for all branching nodes. Ap-
plying the Leaf-Entropy Lemma 3.2 and the Path Length Lemma 3.1 we
immediately have
H(U L ) = H(X)E[B] . (3.32)

0 0
1 1
0 0
2 1 2 1
2 2
Figure 3.13.: Proper (left) and improper (right) dictionaries.
We further have H(U L ) ≤ log2 DL since U L takes on DL values. Combining

these results, we have
L H(X)
≥ (3.33)
E[B] log2 D
with equality if and only if all U L have the same probability D−L . The
bound (3.33) is the variable-length-to-block counterpart of the block-to-
variable-length lower bound in (3.26).
In fact, we immediately have a more general lower bound that applies to first
using a variable-length-to-block encoder with a proper dictionary followed
by a prefix-free block-to-variable-length encoder. The idea is that the U L
in (3.32) becomes a letter “X” that is input to the block-to-variable-length
encoder to which we apply (3.26) to get E[L] ≥ H(U L )/ log2 D. As a result,
we have a variable-length B and a variable-length L.
Theorem 3.4. (Converse to the Coding Theorem for a DMS) A

D-ary prefix-free encoding of a proper dictionary for a DMS satisfies
E[L] H(X)
≥ . (3.34)
E[B] log2 D

3.9. Tunstall Codes 65
3.9. Tunstall Codes

Consider Fig. 3.12 with a K-ary source and where L is fixed. The proper
dictionary codes that maximize E[B] for a K-ary random variable are called
Tunstall codes in honor of their inventor.
A proper dictionary has M = K + (J − 1)(K − 1) words if its tree is
extended J times starting from the root. The maximum depth of the tree
is J so consider the full D-ary tree of length J. We order the 1 + K +
K 2 + . . . , K J−1 branching nodes of the full tree so that their probabilities
P1 , P2 , . . . , P(K J −1)/(K−1) are non-increasing, i.e., we have P1 = 1 and Pj ≥
Pj+1 for j = 1, 2, . . . , J − 1. We claim that the leaves of (one of) the tree(s)
with the J most likely branching nodes including each of their K leaves form
a proper dictionary. To see this, observe that there can be no unused leaves
because we have included all K leaves of each branching node. Furthermore,
this dictionary clearly maximizes E[B] which by the path-length lemma is
the sum of the probabilities of the branching nodes.
The above description specifies one method for constructing an optimal
proper dictionary by starting with a full tree. A simpler construction is
as follows. Observe that we may prune the least-likely branching node of
an optimal proper dictionary with M nodes and obtain an optimal proper
dictionary with M − (K − 1) nodes (assuming that J > 1). We may repeat
this pruning until we arrive at the root node. Reversing the process, we
have the following algorithm for constructing an optimal proper dictionary
with M words for a K-ary random variable.
• Grow the tree by applying the following rule J−1 times after extending
the root: extend the most likely leaf.
Tunstall’s algorithm for constructing an optimum proper dictionary uses the
above rule with the addition that M (or J) is chosen as large as possible
while satisfying M ≤ DL .
Example 3.13. Consider the source of Example 3.11 with K = 4 and

suppose we wish to encode into blocks of 4 bits (D = 2, L = 4). We choose
J = 5 so that M = 16 ≤ DL . Tunstall’s algorithm constructs the dictionary
depicted in Fig. 3.14. The Path Length Lemma gives E[B] = 2.06 so that
L/E[B] = 1.94 bits. The Huffman code of Example 3.11 has a similar
efficiency of E[L] = 1.9 bits. However, the Tunstall code has many more
code words than the Huffman code. Recall that H(X) ≈ 1.846 bits.
Finally, we show that Tunstall codes approach entropy for large L. Let
pmin be the smallest (positive) probability letter from the source X. The
least likely word from the dictionary has probability Pj ∗ pmin where j ∗ is the
branching node from which this word stems. We further have
1
Pj ∗ pmin ≤ (3.35)
M

a
0.064
b
0.048
c
a 0.16 0.032
d
b 0.016
a 0.4 0.12
c
0.08
d
0.04
a
0.12
b
b 0.3 0.09
c
0.06
d
0.03
1.0 a
0.08
b
c 0.2 0.06
c
0.04
d
0.02
d
0.1
Figure 3.14.: A Tunstall code for K = 4 and M = 16 by using J = 5

extensions including the root node.
because there are M words. By construction, Tunstall’s algorithm results

in a tree where every branching node probability Pj is at least as large as
every leaf probability. We thus have
1
PU L (uL ) ≤ Pj ∗ ≤ (3.36)
M pmin
for all uL and therefore

!
L 1
− log2 PU L (u ) ≥ log2 M − log2 . (3.37)
pmin
To bound M , observe that Tunstall’s algorithm chooses M > DL − (K − 1).

Since K ≤ M we this have M > DL /2. Inserting this bound into (3.37)
and averaging, we obtain
!
L 2
H(U ) > L log2 D − log2 . (3.38)
pmin
Applying (3.32) and using Theorem 3.4 we have the following analog of
Theorem 3.2 for variable-to-fixed length parsers.
Theorem 3.5. (Coding Theorem for Dictionary Parsing) The aver-

age word length E[B] of an optimum proper dictionary parser with DL code
words for a K-ary random variable X satisfies
H(X) L H(X)
≤ < (3.39)
log2 D E[B] log2 D + log2 (pmin /2) /L
as long as log2 D + log2 (pmin /2) /L > 0. Moreover, a Tunstall code achieves
E[B] satisfying the inequalities in (3.39).
We remark that variable-to-fixed length codes apply naturally to sources

3.9. Tunstall Codes 67
with memory (see the end of Sec. 3.6). For example, if a small number
of long strings of symbols are very probable then one may include these
strings in a (relatively small) dictionary rather than considering all strings
of a large length whose number grows exponentially with the string length.

References
[1] J. L. Massey. Applied Digital Information Theory. ETH Zurich, Zurich,
Switzerland, 1980-1998.

Chapter 4.
Coding of Stationary Sources
4.1. Introduction
Most sources have memory but it is not obvious how to model the memory.
For example, consider the text you are reading. The memory in this source
depends on the language but also on the subject matter and the writer.
Should one model the source probabilistically? Should one use a model
with block memory or sliding-window memory? Such questions can be
answered in different ways, and different approaches lead to different data
compression algorithms. Ultimately, the choice of algorithm should be based
on the empirical performance for data sets of interest.
We take a probabilistic approach. For example, we might model the source
letters as being output by a finite state machine whose transitions from
one state to the next are governed by a conditional distribution PS 0 |S (·|s)
where S is the current state and S 0 is the next state. The letters output
by the source might be the state labels or, more generally, they might be
chosen from a conditional distribution PX|S (·|s). However, we may not know
PS 0 |S (·) or PX|S (·). We must therefore design an encoding algorithm that
learns these distributions while compressing the data. Furthermore, the
decoder must be able to keep track of the encoder’s model.
Codes that compress data for a class of sources are called universal, where
the word “universal” loosely refers to the class only. The reader should
always check to which class of sources the label “universal” applies. In fact,
small changes to a source can make “universal” codes perform poorly even
if these codes compress long strings to entropy for the original source.
4.2. Discrete Stationary Sources
The first class of sources that we consider are discrete stationary sources.
A K-ary discrete stationary source (DSS) is a device that emits symbols
X1 , X2 , X3 , . . . , from a K-ary alphabet X and where
Xn+1 , Xn+2 , . . . Xn+L
69
70 Chapter 4. Coding of Stationary Sources
. . . , Xn+2, Xn+1, Xn, . . . , X3, X2, X1

DSS
Figure 4.1.: A DSS. For example, for the window length L = 3 the strings
X1 X2 X3 and Xn Xn+1 Xn+2 have the same statistics for any n.
The statistics are time-invariant for any choice of L.
has the same probability distribution as
X1 , X2 , . . . , XL
for all n ≥ 1 and all L ≥ 1 (see Fig. 4.1). In other words, if we place a
window of any length L around consecutive symbols, then the statistical
behavior is the same irrespective of where we place the window. The reader
is warned that this definition is restrictive. For example, if the underlying
source model is a finite state machine then the source is a DSS only if the
machine has reached a steady state. If the source is in a transient state, e.g.,
upon start-up, then the source is not generally time-invariant as is required
of a DSS.
The DSS has finite memory µ if µ is the smallest non-negative integer such
that
PXn |X n−1 (xn |xn−1 ) = PXn |Xn−µ ...Xn−1 (xn |xn−µ , . . . , xn−1 )
= PXµ+1 |X µ (xn |xn−µ , . . . , xn−1 ) (4.1)
for all n > µ and for all xn (if µ = 0 then we write PXn |X n−1 (·) = PXn (·)).
For example, a DMS is a DSS with µ = 0. Of course, this means that
H(Xn |X n−1 ) = H(Xn |Xn−µ . . . Xn−1 ) = H(Xµ+1 |X µ ). (4.2)
Example 4.1. Consider the binary DSS with PX1 (0) = PX1 (1) = 1/2 and
Xn = Xn−1 ⊕ An (4.3)
for n ≥ 2 where A1 , A2 , A3 , . . . is a sequence of independent and identically

distributed (i.i.d.) bits with distribution PA (0) = 1 − p and PA (1) = p. The
DSS has memory µ = 1 if p 6= 1/2 and we compute
H(X1 ) = 1
H(X2 |X1 ) = H2 (p)
and H(Xn |X n−1 ) = H2 (p) for all n ≥ 2.
Example 4.2. We consider three special cases of Example 4.1.

• If p = 1/2 then µ = 0 and the DSS is a DMS.

4.2. Discrete Stationary Sources 71
• If p = 0 then we have the peculiar situation Xn = X1 for all n ≥ 1,

i.e., the DSS puts out either 0, 0, 0, . . . or 1, 1, 1, . . ., and each sequence
occurs with probability 1/2. The memory is µ = 1 and H(X2 |X1 ) = 0.
• If p = 1 then the source puts out either 0, 1, 0, 1, . . . or 1, 0, 1, 0, . . .
with probability 1/2. We again have µ = 1 and H(X2 |X1 ) = 0.
The latter two sources are called non-ergodic because there are empirical
statistics that do not match the statistics computed via the probability
distributions. For example, if p = 0 then the number of zeros in a string of
length n will be either 0 or n, while the expected number of zeros based on
the probability distributions is n/2.
For source encoding, we are interested in the entropy per source letter (see
the end of Sec. 3.6). Consider a block X B for which we use the notation
1
HB (X) = H(X B ). (4.4)
B
The entropy rate is defined as
1
H∞ (X) = lim H(X B ) (4.5)
B→∞ B
if this limit exists. We proceed to show that the limit does exist for a DSS
and that the sequence H1 (X), H2 (X), H3 (X), .. . , has special properties re-
lated to the conditional entropy H XB X B−1 .
Theorem 4.1. (Entropy Rate for a DSS) A DSS output X1 , X2 , X3 , . . .

has the following properties
H(XB |X B−1 ) ≤ HB (X) for all B ≥ 1 (4.6)

B−1
H(XB |X ) is non-increasing as B increases (4.7)
HB (X) is non-increasing as B increases (4.8)
lim H(XB |X B−1 ) = H∞ (X) (4.9)
B→∞
where property (4.9) means that both limits exist and are the same.
Proof. We begin with property (4.7) for which we have
H(XB |X B−1 ) ≤ H(XB |X2 X3 . . . XB−1 ) = H(XB−1 |X B−2 ) (4.10)
where the equality follows by stationarity (time-invariance). Next, we prove

property (4.6) with
B
B−1
H(Xi |X i−1 ) = H(X B )
X
BH(XB |X )≤ (4.11)
i=1

where the inequality follows by (4.7). Property (4.8) follows by
H(X B+1 ) = H(X B ) + H(XB+1 |X B )

≤ H(X B ) + H(XB |X B−1 )
≤ H(X B ) + HB (X)
= (B + 1)HB (X). (4.12)
where the second inequality is due to (4.6). Finally, for property (4.9) we
note that both H(XB |X B−1 ) and HB (X) are lower-bounded by zero so that
limB→∞ H(XB |X B−1 ) and H∞ (X) must exist (the limit of a non-increasing
sequence that is lower bounded exists). Property (4.6) implies
lim H(XB |X B−1 ) ≤ H∞ (X) (4.13)

B→∞
and it remains to show that equality holds in (4.13). Consider a long string
X nB where n is a large integer. We have
nB
1 1 X
H(X nB ) = H(Xi |X i−1 )
nB nB i=1
1 B−1 nB
" #
i−1
H(Xi |X i−1 )
X X
= H(Xi |X ) +
nB i=1 i=B
1 B−1
" #
H(Xi |X i−1 ) + (nB − B + 1)H(XB |X B−1 )
X
≤ (4.14)
nB i=1
where the inequality follows by property (4.7). Taking n → ∞ we find that
H∞ (X) ≤ H(XB |X B−1 ) (4.15)
and we therefore have
H∞ (X) ≤ lim H(XB |X B−1 ). (4.16)

B→∞

Property (4.9) means that we may use either HB (X) or H(XB |X B−1 ) to
estimate the entropy rate. In fact, property (4.6) suggests that using a
conditional encoding algorithm will perform better than a block-encoding
algorithm because the entropy is smaller in the former case.
Example 4.3. A DSS with finite memory µ has
H∞ (X) = lim H(XB |X B−1 ) = H(Xµ+1 |X µ ). (4.17)

B→∞
For instance, Example 4.1 has H∞ (X) = H(X2 |X1 ) = H2 (p).

4.3. Entropy Bounds on Source Coding a DSS 73
4.3. Entropy Bounds on Source Coding a

DSS
Consider a block parser with blocks of length B (see Fig. 3.9). We write
the ith block of symbols as
Vi = X(i−1)B+1 , X(i−1)B+2 , . . . , XiB (4.18)
for i ≥ 1. To derive a lower bound on the compression rate, we must let

the source encoder use all past output blocks to encode. Theorem 3.3 gives
the conditional lower bound
h i H(Vi |V i−1 = v i−1 )
E Li V i−1 = v i−1 ≥ (4.19)
log2 D
where Li is the code word length when encoding block Vi . The Theorem on
Total Expectation thus gives
H(Vi |V i−1 )
E[Li ] ≥ (4.20)
log2 D
and this bound is valid for all i and all B. We prove the following result.
Theorem 4.2. (Coding Theorem for a DSS) The average code word
length E[L] of an optimum D-ary prefix-free code for a K-ary DSS that is
parsed into blocks of length B satisfies
H∞ (X) E[L] HB (X) 1

≤ < + . (4.21)
log2 D B log2 D B
Shannon-Fano codes achieve an E[L] satisfying (4.21) and E[L] /B approaches

H∞ (X)/ log2 D as closely as desired by choosing B sufficiently large.
Proof. Consider the right-hand side of (4.20) for which we have
H(Vi |V i−1 ) = H(X(i−1)B+1 X(i−1)B+2 . . . XiB |X (i−1)B )

B
H(X(i−1)B+j |X (i−1)B+j−1 )
X
=
j=1
≥ BH(XiB |X (i−1)B−1 )
≥ BH∞ (X) (4.22)
where the inequalities follow by Theorem 4.1. Substituting into (4.20) we

have the inequality on the left in (4.21). Note that we may replace E[Li ]
with E[L] because (4.20) is valid for all i. For the inequality on the right in
(4.21), we use a Shannon-Fano code for each block and ignore the memory
across blocks (see Sec. 3.6). Finally, we can make HB (X) as close to H∞ (X)
as desired by choosing large B.

4.4. Elias Code for the Positive Integers
This section develops a prefix-free code when the source has the discrete
but infinite alphabet X = {1, 2, 3, . . .} (see also [1]). We shall use this code
in the next section. Consider the usual binary representation of a positive
integer x:
x U L (x) L(x)
1 1 1
2 10 2
3 11 2
4 100 3
5 101 3
6 110 3
7 111 3
8 1000 4
9 1001 4
This base-two (or radix-two) code is not prefix-free which is why we usually
add a space or a comma between code words to separate them. Similarly,
the usual ternary (or 4-ary or 10-ary, etc.) representation of the integers is
not prefix-free:
x U L (x) L(x)
1 1 1
2 2 1
3 10 2
4 11 2
5 12 2
6 20 2
7 21 2
8 22 2
9 100 3
Observe that we have L(x) = blogD xc + 1 for such D-ary codes.
We would like to make these codes prefix free. Observe that every code word
begins with a non-zero number. We can thus make the code prefix-free by
pre-padding each string with L(x) − 1 zeros to get a code word of length
L1 (x) = 2blogD xc + 1.
The decoder simply counts the number of zeros to determine how many
subsequent D-ary symbols it should consider after the non-zero first digit.

4.4. Elias Code for the Positive Integers 75
The prefix-free binary code is
x U L (x) L1 (x)
1 1 1
2 0.10 3
3 0.11 3
4 00.100 5
5 00.101 5
6 00.110 5
7 00.111 5
8 000.1000 7
9 000.1001 7
where we have added a “.” to distinguish the pre-padding. However, now

the code words are almost twice as long as before.
A simple trick Elias used is to shorten this code for large integers is to pre-
pad differently. Instead of pre-padding with L(x) − 1 zeros, he suggested
to pre-pad with the above prefix-free code with L1 (L(x)) digits that count
the number of digits after the non-zero first digit. For D > 2 this gives the
code word length
L2 (x) = L1 (L(x)) + L(x)

= 2blogD (blogD xc + 1)c + 1 + (blogD xc + 1)
= blogD xc + 2blogD (blogD xc + 1)c + 2.
For large x we thus recover the size blogD xc that we had for the usual D-ary
representation of x. In fact, for D = 2 we can do slightly better because
the only non-zero digit is a “1”. We can therefore remove this bit to achieve
the code-word length
L2 (x) = blog2 xc + 2blog2 (blog2 xc + 1)c + 1. (4.23)
The binary code up to x = 9 is as follows:
x U L (x) L2 (x)
1 1 1
2 010.0 4
3 010.1 4
4 011.00 5
5 011.01 5
6 011.10 5
7 011.11 5
8 00100.000 8
9 00100.001 8
where we have added a “.” to distinguish the pre-padding.
The reader will observe that L2 (x) is larger than L1 (x) in the D = 2 tables

above. In fact, for D = 2 we find that
L1 (x) < L2 (x) for 2 ≤ x ≤ 3 and 8 ≤ x ≤ 15

L1 (x) = L2 (x) for x = 1 and 4 ≤ x ≤ 7 and 16 ≤ x ≤ 31
L1 (x) > L2 (x) for x ≥ 32.
For large x the Elias code performs noticably better than the zero-padding
code. For example, for 16-bit Unicode characters and the “intermediate”
number x = 215 we have L1 (x) = 31 bits and L2 (x) = 24 bits. We will
consider Elias’ code again when we develop a “universal” source coding
algorithm.

4.5. Elias-Willems Universal Source Coding 77
4.5. Elias-Willems Universal Source Coding
We now describe an algorithm for source coding when we do not know the
source statistics. We consider sources that are stationary and ergodic, where
ergodic means that time averages correspond to statistical averages.
Consider the DSS sequence X1 , X2 , X3 , . . .. Let ∆i be the time to the most
recent occurrence of the letter Xi . For example, consider the 4-ary source
with alphabet {a, b, c, d} and
X 8 = abacadab.
We initialize by prepending X 8 with the string dcba and compute
∆8 = 1, 3, 2, 6, 2, 9, 2, 6.
Note that the first a has the prepended a occurring 4 time units in the past,
the first b has the prepended b occurring 4 time units in the past, the second
a has the first a occurring 2 time units in the past, and so forth. Observe
that we can recover X n from ∆n and the initialization string. Furthermore,
we can encode the sequence of integers by using the Elias code for the
positive integers given in Sec. 4.4.
An important result that we state without proof is that the expected recur-
rence time of a stationary and ergodic source satisfies
1
E[∆i |Xi = a] = (4.24)
PX1 (a)
for all times i and all letters a.
Example 4.4. Consider a DMS PX for which the probability of waiting for
j symbols before seeing the letter a is (1 − PX (a))j−1 PX (a). The expected
waiting time is therefore
∞
(1 − PX (a))j−1 PX (a) · j
X
E[∆i |Xi = a] = (4.25)
j=1
which evaluates to 1/PX (a).
We apply the binary Elias code to ∆i and compute
E[Li |Xi = a] = E[L2 (∆i )|Xi = a]

≤ E[log2 ∆i + 2 log2 (log2 ∆i + 1) + 1|Xi = a]
(a)
≤ log2 E[∆i |Xi = a] + 2 log2 (log2 E[∆i |Xi = a] + 1) + 1
= − log2 PX1 (a) + 2 log2 (− log2 PX1 (a) + 1) + 1 (4.26)
where (a) follows by the concavity of the logarithm and Jensen’s inequality.

Taking expectations we have
E[Li ] ≤ H(X1 ) + 2 log2 (H(X1 ) + 1) + 1 (4.27)
where we have again used Jensen’s inequality. Finally, rather than encoding
the Xi we use a block parser and encode blocks of symbols Vi as in (4.18).
We may simply replace X1 with V1 of length B in (4.27) and we compute
H(V1 ) = H(X B ) = BHB (X). (4.28)
Combining the above results, we have a source coding scheme that is uni-
versal for the class of stationary and ergodic sources and that compresses
to the DSS entropy rate.
Theorem 4.3. (Universal Source Coding Theorem for a DSS) The

average code word length E[L] of a binary code for a K-ary DSS that is
parsed into blocks of length B and for which the Elias code is used to map
the time to the most recent occurrence of letters satisfies
E[L] 2 1
< HB (X) + log2 (BHB (X) + 1) + . (4.29)
B B B
Moreover, E[L] /B can be made as close to H∞ (X) as desired by choosing
B sufficiently large.
Finally, we remark that coding the waiting times ∆i is inefficient. For

example, if some letters have very low probability then their waiting times
tend to be very large. A better approach is to use a recency rank calculator
that puts out numbers Ri that are at most the size of the alphabet. The
idea is that we assign each letter in the K B -ary alphabet of Vi a number
called its recency rank. If the letter v with recency rank r occurs at time
i then the recency rank calculator puts out the number Ri = r. Next, the
recency ranks are adjusted: the recency ranks of letters whose recency rank
was below r are incremented by one and v is assigned the recency rank 1.
For example, consider B = 1 and our string X 8 = abacadab. We ini-
tialize the recency ranks as 1, 2, 3, 4 for the letters a, b, c, d, respectively
(this corresponds to the same initialization as in our example above). The
recency-ranks are
R8 = 1, 2, 2, 3, 2, 4, 2, 4
and the important things to note are that (i) we can recover X n from Rn
and the initialization rank, and (ii) Ri ≤ ∆i . Encoding the recency ranks
Ri is thus more efficient than encoding the waiting times ∆i .
The compression scheme consisting of the recent rank calculator and the
Elias code for the positive integers is known as the Elias-Willems source
coding scheme [1, Ch. 4].

4.5. Elias-Willems Universal Source Coding 79
References
[1] J. L. Massey. Applied Digital Information Theory. ETH Zurich, Zurich,
Switzerland, 1980-1998.

Chapter 5.
Channel Coding
5.1. Rate, Reliability, and Cost

Channel coding is concerned with transmitting data reliably from a source to
a sink. One can anticipate that by reducing rate, one can increase reliability.
For example, if your conversation partner does not understand what you say,
simply repeat the message until you receive an acknowledgment. We thus
expect a rate-reliability tradeoff and would like to determine the frontier of
this tradeoff.
As a great surprise, Shannon showed that the tradeoff exhibits a sharp be-
havior: below a certain rate he called capacity one can achieve as reliable
communication as desired! And above capacity one has a positive lower
bound on reliability that increases very rapidly with increasing rate. More-
over, reliability is in general possible only by coding over long sequences of
symbols.
But Shannon discovered more. First, one can control reliability by applying
a lossy compression code followed by a channel code. Second, this separation
of compression and reliability coding incurs no loss in rate. Third, if one
introduces a cost S associated with transmission then there is a capacity-
cost tradeoff. For example, a symbol with large energy generally costs
more to transmit than one with small energy. Capacity naturally increases
with increasing cost, and this suggests capacity-cost shapes such as those
depicted in Fig. 5.1. We shall find that the uppermost shape is the correct
one for the cost functions that we are interested in.
R
Rmax
S
0
Figure 5.1.: Three possible shapes for increasing (or more precisely non-
decreasing) curves.
81
82 Chapter 5. Channel Coding
Noise
Source
Zn
W Xn Yn Ŵ
Message
Encoder Yi = f (Xi, Zi) Decoder Sink
Source
Channel
Figure 5.2.: The channel coding problem: the channel is defined by a func-
tion f (·) and a noise source PZ (·).
X1 X2 X3
Z1 Z2 Y2 Z3 Y3
Y1
Ŵ
Figure 5.3.: FDG for a memoryless channel with n = 3. The hollow vertices
represent mutually statistically independent random variables.
5.2. Memoryless Channels

A memoryless channel is the basic model for channel coding, and it is
depicted in Fig. 5.2. The functional dependence graph (FDG) shown in
Fig. 5.3 specifies the relationships between the random variables.1 There
are five types of variables: a source message W , channel inputs Xi , channel
outputs Yi , noise Zi , i = 1, 2, . . . , n, and a message estimate Ŵ .
A source puts out the message w, w ∈ {1, 2, . . . , M }. An encoder maps w
to a string xn in X n . We assume that H(W ) = nR and nR is an integer for
simplicity. We may thus view W as being a string V nR of independent bits
Vi , i = 1, 2, . . . , nR, where PVi (0) = PVi (1) = 1/2. The rate R measures
how many information bits are sent per channel input symbol.
The channel puts out
yi = f (xi , zi ), i = 1, 2, . . . , n (5.1)
for some function f (·) with range Y, and where each zi is a different re-
alization of a noise random variable Z with alphabet Z. In other words,
1
A FDG is a graph where the vertices represent random variables and the edges repre-
sent functional dependencies, see Sec. A.5.

5.2. Memoryless Channels 83
W Xn Yn Ŵ
Message
Encoder PY |X (·) Decoder Sink
Source
Channel
Figure 5.4.: The channel coding problem: the channel is a conditional prob-
ability distribution PY |X (·).
the Z1 , Z2 , . . . , Zn all have the same distribution PZ and are statistically

independent. Furthermore, the noise string Z n is statistically independent
of W and X n . The channel is memoryless because yi is a function of xi
and zi only. The channel is time invariant because f (·) and PZ (·) do not
depend on time. If the alphabets X and Y are discrete and finite, then the
channel is called a discrete memoryless channel (DMC).
The decoder maps y n to an estimate ŵ of w, where Ŵ = W. The goal is
to find the maximumh rate R ifor which, by choosing n sufficiently large, one
can make Pe = Pr Ŵ 6= W arbitrarily close to zero (but not necessarily
exactly zero). This maximum rate is called the capacity C.
Observe that capacity does not account for the delay due to encoding and
decoding, or for the complexity of encoding and decoding. Delay and com-
plexity are, of course, of great engineering relevance. We are therefore trying
to find only the limits that we should try to approach if we we are willing to
delay information transfer, and if we are willing to build complex devices.
The above functional definitions can alternatively be written by using prob-
ability distributions. In particular, for a DMC we may write the channel
(5.1) as a conditional probability distribution PY |X (·) as in Fig. 5.4. The
joint distribution of the random variables (other than the noise) satisfies
" n #
n n n
PY |X (yi |xi ) P (ŵ|y n )
Y
P (w, x , y , ŵ) = P (w)P (x |w) (5.2)
i=1
where both P (xn |w) and P (ŵ|y n ) are either 0 or 1.

As done here, we will often remove the subscripts on the probability dis-
tributions if the arguments are lower-case versions of the random variables.
For instance, we write P (w) for PW (w). Similarly, we could write P (yi |xi )
for PYi |Xi (yi |xi ) but we prefer to write this as PY |X (yi |xi ) to emphasize that
the channel PY |X is time-invariant.
Example 5.1. A binary symmetric channel (BSC) has X = Y = Z =

{0, 1} and
f (x, z) = x + z (5.3)
where addition is modulo-2. A diagram representing the BSC is shown in

Fig. 5.5 where PZ (1) = p. The parameter p is called the channel crossover
probability and we usually have p ≤ 1/2. The channel conditional probabil-

X Y
1−p
0 0
p
p
1 1
1−p
Figure 5.5.: BSC with crossover probability p.
X Y
1 − p0
0 0
p0
p1
1 1
1 − p1
Figure 5.6.: Binary channel with asymmetric crossover probabilities.
ity distribution is
( )
1 − p if y = x
P (y|x) = . (5.4)
p if y 6= x
Example 5.2. An asymmetric version of the BSC has Z = {00, 01, 10, 11}
and
( )
z0 if x = 0
f (x, z0 z1 ) = = x + zx (5.5)
1 + z1 if x = 1
where the noise is now a pair Z = Z0 Z1 of binary random variables. The

variables X and Z are independent, but X and ZX are dependent in general.
A diagram representing the BSC is shown in Fig. 5.6 where px = PZx (1).
The channel conditional probability distribution is
( )
1 − px if y = x
P (y|x) = . (5.6)
px if y 6= x
Example 5.3. A binary erasure channel (BEC) has X = {0, 1}, Y =

{0, 1, ∆}, Z = {0, ∆}, and
( )
x if z = 0
f (x, z) = . (5.7)
∆ if z = ∆
A diagram representing the BEC is shown in Fig. 5.7 where PZ (∆) = p.

Observe that receiving y = 0 or y = 1 gives full knowledge of x. The channel

5.2. Memoryless Channels 85
Y
X 1−p
0
0
p
∆
p
1
1
1−p
Figure 5.7.: BEC with erasure probability p.
X Y
1−p
0 0
1 1
2 2
3 3
∆
p
Figure 5.8.: PEC with b = 2 and erasure probability p.
conditional probability distribution is

 

1 − p if y = x 

P (y|x) =  p if y = ∆ . (5.8)


0 if x = 0 and y = 1 or if x = 1 and y = 0 
Example 5.4. A packet erasure channel (PEC) with packets of length b

bits has X = {0, 1, . . . , 2b − 1}, Y = {0, 1, . . . , 2b − 1, ∆}, Z = {0, ∆}, and
f (x, z) having the same form as (5.7). The PEC with b = 1 is thus a BEC.
A diagram representing the PEC with b = 2 is shown in Fig. 5.8 where
PZ (∆) = p. Receiving y, y 6= ∆, again gives full knowledge of x. The
channel conditional probability distribution is
 

1 − p if y = x 

P (y|x) = p if y = ∆ . (5.9)
6 ∆ and y 6= x
 

0 if y = 
Example 5.5. An additive channel has
f (x, z) = x + z (5.10)
where the “+” denotes addition in a field with alphabets X = Y = Z. For

example, a BSC is additive over the Galois field GF(2). If the field is the set

of real numbers and Z is a Gaussian random variable with zero mean then
we have an additive white Gaussian noise (AWGN) channel. Of course,
an AWGN channel is not a DMC. We must therefore replace PY |X (y|x)
in (5.2) with a channel conditional probability density function p(y|x). If Z
has variance N then we have
1 −(y−x)2
p(y|x) = pZ (y − x) = √ e 2N . (5.11)
2πN

5.3. Cost Functions 87
5.3. Cost Functions

The transmission and reception of symbols often incurs costs, e.g., power
or energy costs. We therefore refine the capacity problem by adding a cost
constraint. Suppose that transmitting xn and receiving y n incurs a cost of
sn (xn , y n ) units. We require that the average cost satisfy
E[sn (X n , Y n )] ≤ S. (5.12)
We consider sn (·) that are averages of a per-letter cost function s(·):

n
n n n 1X
s (x , y ) = s(xi , yi ). (5.13)
n i=1
The largest rate C as a function of the cost S is called the capacity cost
function, and is denoted C(S).
Example 5.6. Suppose X = {0, 1} and

(
0 if x = 0
s(x, y) = (5.14)
E if x = 1
so that sending x = 1 costs E units of energy. This situation might occur for
an optical channel where transmitting a 1 represents light while transmitting
a 0 represents dark. A cost constraint with 0 ≤ S < E/2 will bias the best
transmission scheme towards sending the symbol 1 less often.
Example 5.7. Suppose X = R and s(x, y) = x2 so that

n
1X h i
E[sn (X n , Y n )] = E Xi2 ≤ P (5.15)
n i=1
where we have chosen S = P to emphasize that we interpret P as the

average transmit power. The energy is thus at most nP . The constraint
(5.15) is called an average block power constraint. Another interesting choice
is s(x, y) = y 2 which limits the average receive power.
Example 5.8. A more stringent constraint than (5.12) is that the cost be
bounded with probability 1:
Pr [sn (X n , Y n ) ≤ S] = 1. (5.16)
For example, with s(x, y) = x2 and S = P we obtain the power constraints

n
1X
xi (w)2 ≤ P, for all w = 1, 2, . . . , M . (5.17)
n i=1

5.4. Block and Bit Error Probability
Consider W = V b where the Vi are independent bits with PVi (0) = PVi (1) =
1/2, i = 1, 2, . . . , b. We thus have M = 2b and R = b/n bits per channel
symbol. We have defined the channel coding by requiring that the block
error probability
h i
Pe = Pr Ŵ 6= W (5.18)
be small. However, one is sometimes interested in minimizing the average

bit error probability
b
1X h i
Pb = Pr V̂i 6= Vi . (5.19)
b i=1
We have the following relations between Pb and Pe :

(a) (b)
Pb ≤ Pe ≤ b Pb . (5.20)
The bound (a) is because a bit error implies a block error, and the bound
(b) is because a block error implies at least 1 bit error for the b bits. One
has equality on the left if all bits in an erroneous block are incorrect, and
one has equality on the right if exactly one bit is incorrect in each erroneous
block. The bounds (5.20) imply that if Pb is positive so is Pe . Similarly, if
Pe is small so is Pb . This is why coding theorems should upper bound Pe
and converse theorems should lower bound Pb . For example, a code with
large Pe can have small Pb .
We next develop lower bounds on Pe and Pb . Consider first Pe . Using Fano’s

inequality and |W| = M = 2nR we have
H(W |Ŵ ) ≤ H2 (Pe ) + Pe log2 (|W| − 1)

< H2 (Pe ) + Pe nR (5.21)
We also have H(W |Ŵ ) = H(W ) − I(W ; Ŵ ) and H(W ) = nR so that2
I(W ; Ŵ ) + H2 (Pe )
nR < . (5.22)
1 − Pe
The bound (5.22) gives a rate-reliability tradeoff, i.e., nR is at most I(W ; Ŵ )

if reliability is good (Pe is small) and nR can become larger if reliability is
poor (Pe is large). The tradeoff is parameterized by I(W ; Ŵ ). We empha-
size that (5.22) is valid for any choice of Pe .
2
The bounds (5.22) and (5.25) are valid for any channel: discrete or continuous, mem-
oryless or with memory. An improved lower bound on the block error probability Pe
for memoryless channels is called a strong converse and is developed in Problem B.5.6.

5.4. Block and Bit Error Probability 89
Consider next Pb for which we bound

b
!
1X h i
H2 (Pb ) = H2 Pr V̂i 6= Vi
b i=1
b
(a) 1X h i
≥ H2 Pr V̂i 6= Vi
b i=1
b
(b) 1X
≥ H(Vi |V̂i ) (5.23)
b i=1
where (a) follows by the concavity of H2 (·), and (b) by Fano’s inequality.
We continue the chain of inequalities by using H(Vi |V̂i ) ≥ H(Vi |V i−1 V̂ b ) so
that
b
1X
H2 (Pb ) ≥ H(Vi |V i−1 V̂ b )
b i=1
1
= H(V b |V̂ b )
b
1
= H(V b ) − I(V b ; V̂ b )
b
I(W ; Ŵ )
=1− . (5.24)
nR
We thus have the following counterpart to (5.22):
I(W ; Ŵ )
nR ≤ . (5.25)
1 − H2 (Pb )
We again have a rate-reliability tradeoff with 0 ≤ Pb ≤ 1/2, and we essen-

tially require nR ≤ I(W ; Ŵ ) if Pb is small.
Observe that we can write
M
X h i
Pe = P (w) Pr Ŵ 6= w W = w (5.26)
w=1
h i
and the probabilities Pr Ŵ 6= w W = w can be different. An interesting
parameter is therefore the maximum block error probability
h i
Pm = max Pr Ŵ 6= w W = w . (5.27)
1≤w≤M
We clearly have Pb ≤ Pe ≤ Pm so that a lower bound on Pb is a lower bound

on Pm and an upper bound on Pm is also an upper bound on Pb and Pe .

Message W
Source
X n(1)
PX
X n(W ) Yn Ŵ
X n(2) Encoder PY |X Decoder Sink
PX
... ... Channel ...
X (1) ... X n(2nR )
n
X n(2nR )
PX
Figure 5.9.: Random coding experiment for channel coding.
5.5. Random Coding

We construct a random code book for the DMC with a cost constraint. We
begin by choosing a distribution PX .
Code Construction: Generate M = 2nR codewords xn (w), w = 1, 2, . . . , 2nR ,
by choosing the n · 2nR symbols xi (w) in the code book
xn = [xn (1), xn (2), . . . , xn (M )] (5.28)
independently using PX .
Encoder: Given w, transmit xn (w).
Decoder: Given y n , choose ŵ as (one of) the message(s) w that maximizes
P (w)P (y n |w) P (w)P (y n |xn (w))

P (w|y n ) = = . (5.29)
P (y n ) P (y n )
This decoder is called a maximimum a-posteriori probability (MAP) decoder.

Since all messages are equally likely, the MAP decoder is the same as the
maximum likelihood (ML) decoder that chooses ŵ as (one of) the message(s)
w that maximizes the P (y n |xn (w)).
Analysis: The random coding experiment is shown in Fig. 5.9 where the
random variables W , X n (1), X n (2), . . . , X n (2nR ) are mutually statistically
independent. The joint distribution of the random variables is
P (w, xn (1), . . . , xn (2nR ), y n , ŵ)

 nR

2Y
= P (w)  PXn (xn (i)) PYn|X (y n |xn (w)) 1(ŵ = f (y n )) (5.30)
i=1
where 1(·) is the indicator function that takes on the value 1 if its argu-
ment is true and is 0 otherwise. We compute the error probability for this
experiment. We have two error events
E = {Ŵ 6= W } (5.31)
F = {E[sn (X n , Y n )] > S} (5.32)

5.5. Random Coding 91
and the error probability can be written as the code book average
P (xn ) Pr [E ∪ F | X n = xn ] .
X
Pr [E ∪ F] = (5.33)
xn
We wish to find conditions so that for any positive δ there is a sufficiently

large n such that Pr [E ∪ F] ≤ δ. If we are successful, then (5.33) guarantees
there must exist a code book xn for which Pr [E ∪ F | X n = xn ] ≤ δ.
5.5.1. Block Error Probability
The union bound gives Pr [E ∪ F] ≤ Pr [E] + Pr [F] so we may upper bound

the probability of each error event separately. We begin with
X h i
Pr [E] = P (w) Pr Ŵ 6= w | W = w . (5.34)
w
h i h i
By symmetry, we have Pr Ŵ 6= w | W = w = Pr Ŵ 6= 1 | W = 1 so we
h i
proceed to bound Pr Ŵ 6= 1 | W = 1 .
Consider the events
E(w̃) = {PY n |X n (Y n |X n (1)) ≤ PY n |X n (Y n |X n (w̃))} (5.35)
for w̃ 6= 1. The event {Ŵ 6= 1} occurs only if E(w̃) occurs for at least one
w̃ with w̃ 6= 1. Furthermore, for such a w̃ the likelihood ratio
PY n |X n (Y n |X n (w̃))
L(w̃) = (5.36)
PY n |X n (Y n |X n (1))
must be at least 1. We thus have L(w̃)s ≥ 1 for any s ≥ 0 and also

{ w̃6=1 L(w̃)s }ρ ≥ 1 for any s ≥ 0 and any ρ ≥ 0. These results imply
P
 ρ 
h i (a) X 
Pr Ŵ 6= 1 | W = 1 ≤ Pr  L(w̃)s ≥ 1 W = 1
 
w̃6=1
 ρ 
(b) X 
≤ E  L(w̃)s  W = 1 (5.37)
w̃6=1
where (a) is because A ⇒ B implies Pr [A] ≤ Pr [B], and (b) follows by the
Markov inequality (A.65).
The expectation in (5.37) is with respect to the X n (w), w = 1, 2, . . . , M ,

and Y n . Since we have W = 1, the pair (X n (1), Y n ) is jointly distributed
according to the channel distribution PY |X . The X n (w̃), w̃ 6= 1, are statis-
tically independent of (X n (1), Y n ). If we take the expectation over the pair

(X n (1), Y n ), then the right-hand side of (5.37) is


n n
!s ρ 
X PY n |X n (y |X (w̃)) 
P (xn , y n ) · E   X n (1) = xn , Y n = y n 
X
xn ,y n w̃6=1 P (y n |xn ) 
 ρ 
(a) X
X 
= P (xn )P (y n |xn )1−sρ · E  PY n |X n (y n |X n (w̃))s  (5.38)
xn ,y n
 
w̃6=1
where we removed the conditioning on W = 1 because (X n (1), Y n ) and the

X n (w̃), w̃ 6= 1, are independent of {W = 1}. Step (a) follows because the
X n (w̃), w̃ 6= 1, are independent of (X n (1), Y n ). For 0 ≤ ρ ≤ 1, Jensen’s in-
equality gives E[X ρ ] ≤ E[X]ρ so the expectation in (5.38) is upper bounded
by
 ρ ( )ρ
X h i
n n s n n n s
X
E PY n |X n (y |X (w̃)) = (M − 1) P (x )P (y |x ) . (5.39)
xn
 
w̃6=1
Now substitute s = 1/(1 + ρ) and insert (5.39) into (5.38) to obtain

( )1+ρ
h i 1
ρ n n n
X X
Pr Ŵ 6= 1 | W = 1 ≤ (M − 1) P (x )P (y |x ) 1+ρ . (5.40)
yn xn
We remark that (5.40) is valid for channels with memory and any P (xn ). It is
also valid for continuous channels by replacing P (y n |xn ) with the conditional
density p(y n |xn ).
Using (M − 1)ρ ≤ M ρ = 2nRρ , random coding with P (xn ) = ni=1 PX (xi ),
Q
and the memoryless property P (y n |xn ) = ni=1 PY |X (yi |xi ), the bound (5.40)
Q
becomes
h i
Pr Ŵ 6= 1 | W = 1 ≤ 2−n[E0 (ρ,PX )−ρR] (5.41)
for all 0 ≤ ρ ≤ 1 where

( )1+ρ
X X 1
E0 (ρ, PX ) = − log2 P (x)P (y|x) 1+ρ . (5.42)
y x
Optimizing over ρ, we have

h i
Pr Ŵ 6= 1 | W = 1 ≤ 2−nEG (R,PX ) (5.43)
where EG (R, PX ) is the Gallager exponent3
EG (R, PX ) = max [E0 (ρ, PX ) − ρR] . (5.44)

0≤ρ≤1
One may prove the following properties of EG (R, PX ) (see Problem B.5.5):
a) EG (0, PX ) = E0 (1, PX ).
3
The Gallager exponent optimized over PX is written as EG (R) = maxPX EG (R, PX ).

5.5. Random Coding 93
EG(R, PX )
E0(1, PX )
R
0
Rc(PX ) E0(1, PX ) I(X; Y )
Figure 5.10.: Gallager exponent EG (R, PX ) for a fixed PX .
b) EG (R, PX ) is linear with slope -1 for 0 ≤ R ≤ Rc (PX ) where
∂E0 (ρ, PX )
Rc (PX ) = . (5.45)
∂ρ ρ=1
c) EG (R, PX ) is convex and positive in the interval 0 ≤ R < I(X; Y ) if

I(X; Y ) is positive.
The shape of EG (R, PX ) is shown in Fig. 5.10. We see that as long as we

choose R and PX so that
R < I(X; Y ) (5.46)

h i
then we can make Pr Ŵ 6= W ≤ δ for any δ > 0 by choosing large n.
5.5.2. Capacity-Cost Function
For the cost constraint we have
Pr [F] = 1 (E[sn (X n (W ), Y n )] > S) (5.47)
where 1(.) is the indicator function that takes on the value 1 if its argu-
ment is true and is zero otherwise. We must therefore choose PX so that
E[sn (X n , Y n )] ≤ S over the ensemble of code books. We easily compute
n
1X
E[sn (X n (W ), Y n )] = E[s(Xi (W ), Yi )] = E[s(X, Y )] . (5.48)
n i=1
Thus, we must choose PX so that
E[s(X, Y )] ≤ S. (5.49)
Combining the above results, for large n there is a code in the random
ensemble of codes that has small Pe , expected cost at most S, and with a

rate that approaches the capacity-cost function
C(S) = max I(X; Y ). (5.50)

PX : E[s(X,Y )]≤S
If there is no cost constraint, then we can approach the rate
C = max I(X; Y ). (5.51)

PX
5.5.3. Maximum Error Probability
Consider the maximum error probability (5.27). Suppose we have a fixed

code for which
Pr [E ∪ F] ≤ δ. (5.52)
For example, we might have found this code by using the random coding
method described above. We may write
M
X
Pr [E ∪ F] = P (w)Pr [E ∪ F | W = w] . (5.53)
w=1
Now re-index the codewords by ordering them so that the error probability
increases as w increases, i.e., re-index so that
Pr [E ∪ F | W = w1 ] ≤ Pr [E ∪ F | W = w2 ] (5.54)
if w1 ≤ w2 . Now consider the code book having only the first M/2 messages
(suppose M is even) of the re-indexed code book. The rate is reduced by
1/n bits as compared to the original code book, but the maximum error
probability satisfies
Pm = max Pr [ E| W = w]
1≤w≤M/2
≤ max Pr [ E ∪ F| W = w]
1≤w≤M/2
= Pr [ E ∪ F| W = M/2]
(a)
≤ 2δ (5.55)
where (a) follows by (5.53) since no more than half of the terms in the sum
can be larger than 2δ. In other words, by expurgating our codes we obtain
codes with a maximum error probability as close to zero as desired and rates
as close to C(S) as desired.

5.6. Concavity and Converse 95
C(S)
C(S2)
C(S3)
C(S1)
S
0 S1 S3 S2
Figure 5.11.: Concavity of the capacity-cost function.
5.6. Concavity and Converse
The function C(S) in (5.50) is non-decreasing in S because increasing S

permits using a larger class of PX . We show that C(S) is concave in S.
Consider two distinct points (S1 , C(S1 )) and (S2 , C(S2 )) and suppose the
distributions PX1 and PX2 achieve these respective points (see Fig. 5.11).
That is, we have
S1 = E[s(X1 , Y1 )] , C(S1 ) = I(X1 ; Y1 )

(5.56)
S2 = E[s(X2 , Y2 )] , C(S2 ) = I(X2 ; Y2 )
where Y1 and Y2 are the respective outputs of the channel PY |X when the
input is X1 and X2 . Consider the mixture distribution
PX3 (x) = λ PX1 (x) + (1 − λ) PX2 (x) (5.57)
for all x, where 0 ≤ λ ≤ 1. We have

X
S3 = PX3 (x) P (y|x) s(x, y)
(x,y)∈suppPX3 Y
X
= (λPX1 (x) + (1 − λ)PX2 (x)) P (y|x) s(x, y)
(x,y)∈suppPX3 Y
= λS1 + (1 − λ)S2 . (5.58)
We thus have
C(λ S1 + (1 − λ) S2 ) = C(S3 )
(a)
≥ I(X3 ; Y3 )
(b)
≥ λ I(X1 ; Y1 ) + (1 − λ) I(X2 ; Y2 )
= λ C(S1 ) + (1 − λ) C(S2 ). (5.59)
where (a) follows because PX3 might not give the maximum mutual infor-
mation for the cost S3 , and (b) follows by the concavity of I(X; Y ) in PX
when PY |X is held fixed (see Thm. 1.11). The bound (5.59) means that
C(S) is concave in S.

We now show that C(S) in (5.50) is the capacity cost function. We have
(a)
I(W ; Ŵ ) ≤ I(X n ; Y n )
n
H(Yi |Y i−1 ) − H(Yi |Xi )
X
=
i=1
n
X
≤ H(Yi ) − H(Yi |Xi )
i=1
Xn
= I(Xi ; Yi ). (5.60)
i=1
where (a) follows by the data processing inequality. If there is no cost

constraint then we can use the simple bound
I(Xi ; Yi ) ≤ max I(X; Y ) = C. (5.61)

PX
Inserting (5.61) into (5.60) and then into (5.22) and (5.25) we obtain R ≤ C
for small Pe and Pb .
More generally, with a cost constraint we must use the concavity of C(S) in
S. We have I(Xi ; Yi ) ≤ C(E[s(Xi , Yi )]) because C(E[s(Xi , Yi )]) maximizes
mutual information for the cost E[s(Xi , Yi )]. We thus have
n n
X X 1
I(Xi ; Yi ) ≤ n C (E[s(Xi , Yi )])
i=1 i=1 n
n
!
(a) 1X
≤ nC E[s(Xi , Yi )]
n i=1
= n C(E[sn (X n , Y n )])
(b)
≤ n C(S) (5.62)
where (a) follows by the concavity of C(S) and (b) follows because we require
E[sn (X n , Y n )] ≤ S and because C(S) is non-decreasing in S. Inserting
(5.62) into (5.60) and then into (5.22) and (5.25) we have
C(S) + H2 (Pe )/n

R< . (5.63)
1 − Pe
and
C(S)
R≤ . (5.64)
1 − H2 (Pb )
Thus, we find that R can be at most C(S) for reliable communication and
E[sn (X n , Y n )] ≤ S.

5.7. Discrete Alphabet Examples 97
0.5
0.4
0.3
C(S)
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5
S
Figure 5.12.: Capacity-cost function for a BSC with p = 0.11 and E = 1.
5.7. Discrete Alphabet Examples
5.7.1. Binary Symmetric Channel
The binary symmetric channel (BSC) has X = Y = {0, 1} and Pr [Y 6= X] =

p. In the absence of a cost constraint, we have
I(X; Y ) = H(Y ) − H(Y |X)

= H2 (PX (1) ∗ p) − H2 (p) (5.65)
where q ∗p = q(1−p)+(1−q)p. The best choice for PX is PX (0) = PX (1) =

1/2 so that
C = 1 − H2 (p). (5.66)
Suppose next that we have the cost function (5.14). We compute
E[s(X)] = PX (1) · E. (5.67)
The capacity cost function is thus
C(S) = H2 (min(S/E, 1/2) ∗ p) − H2 (p) (5.68)
and for S ≥ E/2 we have C = 1 − H2 (p). A plot of C(S) is shown in

Fig. 5.12 for the case p = 0.11 and E = 1.

5.7.2. Binary Erasure Channel
The binary erasure channel (BEC) has X = {0, 1}, Y = {0, 1, ∆}, and
Pr [Y = X] = 1−p and Pr [Y = ∆] = p. For no cost constraint, we compute
C = max H(X) − H(X|Y )

PX
= max H(X) (1 − p) (5.69)

PX
and choosing PX to be coin-flipping we have
C = 1 − p. (5.70)
5.7.3. Strongly Symmetric Channels
Many practical channels exhibit symmetries. Two symmetries that we con-

sider in detail are:
• Uniformly dispersive: for every input letter x the list of probabilities
{P (y|x) : y ∈ Y}, is the same.
• Uniformly focusing: for every output letter y the list of probabilities
{P (y|x) : x ∈ X }, is the same.
For example, a BSC is both uniformly dispersive and uniformly focusing.
A BEC is uniformly dispersive but not uniformly focusing.
Uniformly dispersive channels have the special property that H(Y |X) does
not depend on PX . To see this, observe that
X
H(Y |X = x) = −P (y|x) log2 P (y|x) (5.71)
y∈Y
is the same for all x. Hence H(Y |X) is given by (5.71) and determining the
capacity-cost function reduces to a constrained maximum-entropy problem:
" #
C(S) = max H(Y ) − H(Y |X). (5.72)
PX : E[s(X,Y )]≤S
On the other hand, uniformly focusing channels have the special property
that a uniform PX results in a uniform PY . To see this, observe that if
P (x) = 1/|X | for all x then we have
X 1
P (y) = P (y|x) (5.73)
x∈X |X |
which is the same for all b. Thus, PY is uniform which maximizes H(Y ).
A channel is said to be strongly symmetric if it is both uniformly dispersive
and uniformly focusing. In the absence of a cost constraint, this means that

Y
Y
0
X 1−p X 1−p
0
Y1 = {0, 1}
0 0
p
p 1
∆ ≡ 1−p
p
1 1 p
∆ Y2 = {∆}
1
1−p
Figure 5.13.: A BEC decomposed into two strongly symmetric channels.
a strongly symmetric channel has capacity

X
C = log2 |Y| − −P (y|x) log2 P (y|x) (5.74)
y∈Y
for any choice of x, and capacity is achieved by a uniform PX . For example,

a BSC is strongly symmetric and has capacity C = 1−H2 (p) which matches
(5.74).
5.7.4. Symmetric Channels
Strongly symmetric channels are somewhat restrictive, e.g., they do not

include the BEC. We now consider the more interesting class of symmetric
channels that have the special property that they can be decomposed into
strongly symmetric channels. By this we mean that one can partition the
output symbols in Y into L sets Y1 , Y2 , . . . , YL such that the channel from
X to Yi is strongly symmetric (uniformly dispersive and uniformly focusing)
for all i = 1, 2, . . . , L.
Example 5.9. The BEC can be decomposed into L = 2 strongly symmetric

channels with Y1 = {0, 1} and Y2 = {∆}, i.e., we have
(
{P (y|x) : y ∈ Y1 } = {0, 1 − p} is the same for all x ∈ X
Channel 1 :
{P (y|x) : x ∈ X } = {0, 1 − p} is the same for all y ∈ Y1
(
{P (y|x) : y ∈ Y2 } = {p} is the same for all x ∈ X
Channel 2 :
{P (y|x) : x ∈ X } = {p, p} is the same for all y ∈ Y2 .
The decomposition is shown in Fig. 5.13. Observe that each input symbol
X goes through either a BSC with crossover probability 0 (Channel 1) or
through a channel that maps both inputs to ∆ with probability 1 (Channel
2). The former channel is chosen with probability 1 − p and the latter with
probability p. Observe that Y specifies which sub-channel was chosen.
Many channels in practice are symmetric. Furthermore, it turns out that

(in the absence of a cost constraint) the capacity of a symmetric DMC is

easily computed from the capacities of the sub-channels:
L
X
C= qi C i (5.75)
i=1
where qi = y∈Yi P (y|x) for any x ∈ X is the probability that sub-channel

P
i is chosen, and Ci is the capacity of this sub-channel. Moreover, a uniform

PX achieves capacity. For example, the BEC has q1 = 1 − p, q2 = p, C1 = 1,
and C2 = 0.
To prove (5.75), let A be a random variable that represents the sub-channel,

i.e., we have PA (i) = qi for i = 1, 2, . . . , L. We have
(a)
I(X; Y ) = I(X; Y A)
(b)
= I(X; A) + I(X; Y |A)
(c)
= I(X; Y |A) (5.76)
where (a) is because A is a function of Y (see (B.23)), (b) follows by the

chain rule for mutual information (see (1.63)), and (c) is because X and A
are independent. We further have
L
X
I(X; Y |A) = qi I(Xi ; Yi |A = i)
i=1
(a) XL
≤ qi C i (5.77)
i=1
where (a) follows by the definition of Ci . Finally, a uniform PX simultane-

ously maximizes I(Xi ; Yi |Ai ) for all i = 1, 2, . . . , L so that we can achieve
equality in (a).
Example 5.10. Consider an AWGN channel where Z is Gaussian with zero

mean and variance
√ √N . Suppose we use binary phase shift keying (BPSK)
with X = {− P , P } and a uniform quantizer that maps Y to the nearest
value in the set Ỹ = {−L + 1, −L + 3, . . . , L − 3, L − 1}. Call the resulting
discrete output Ỹ . The X-to-Ỹ channel is a symmetric DMC that can be
decomposed into dL/2e strongly symmetric channels. If L is even then all
sub-channels are BSCs. If L is odd then all sub-channels are BSCs except
one sub-channel that maps both inputs to Ỹ = 0. Either way, the capacity
is given by (5.75).
√
For instance, suppose we have P ≤ 2. We choose L = 4 so that Ỹ takes
on values in the set Ỹ = {−3, −1, 1, 3} (see Fig. 5.14 where the quantizer
boundaries are shown with dashed lines). There are two sub-channels whose

0.4
0.35 p (y | ⌧1.5) p Y | X (y | 1.5)
Y|X
0.3
0.25
0.2
0.15
0.1
0.05
0
5 4 3 2 1 0 1 2 3 4 5
y
Figure 5.14.: Conditional probability densities

√ for an AWGN channel with
N = 1 and BPSK with P = 1.5. The channel output is
quantized to values in the set {−3, −1, 1, 3}.
probabilities and capacities are:

√ √ √ √
q1 = Q (2 − P )/ N + Q (2 + P )/ N
q2 = 1 − q1 √ √
Q((2+ P )/ N )
C1 = 1 − H2 q (5.78)
 √ 1 √ √ 
Q P/N −Q((2+ P )/ N )
C2 = 1 − H2  q2

where
Z ∞
1 2
Q(x) = √ e−y /2 dy. (5.79)
x 2π

5.8. Continuous Alphabet Examples
5.8.1. AWGN Channel
Consider the additive white Gaussian noise (AWGN) channel with
Y =X +Z (5.80)
where Z is a zero-mean, variance N , Gaussian random variable that is

independent of X. We consider the cost function s(x, y) = x2 and S = P .
At this point, we have not shown that the capacity-cost function (5.50) gives
a rate that we can approach for continuous-alphabet channels. However,
one can check that the arguments in Sec. 5.5 extend to AWGN and other
continuous-alphabet channels. We compute
C(P ) = max [h(Y ) − h(Y |X)]

PX : E[X 2 ]≤P
" #
1
= max2 h(Y ) −
log(2πeN )
PX : E[X ]≤P 2
(a) 1 1
= log(2πe(P + N )) − log(2πeN ) (5.81)
2 2
where (a) follows by
h i
Var[Y ] = Var[X] + N ≤ E X 2 + N ≤ P + N (5.82)
and the maximum entropy result (2.48). We thus have
1 P

C(P ) = log 1 + . (5.83)
2 N
Furthermore, we achieve C(P ) by choosing X to be Gaussian with zero-

mean and variance P . Observe that C(P ) depends only on the signal-to-
noise ratio (SNR) P/N . The function (5.83) is plotted in Fig. 5.15 for N = 1
(and therefore P = P/N ) as the curve labeled “Gauss”. We have taken the
logarithm to the base 2 so that the rate units are bits per channel symbol.
5.8.2. AWGN Channel with BPSK
Consider the additive white Gaussian√noise

√ (AWGN) channel (5.80) but
suppose we use BPSK with X = {− P , P }. From Example 5.10 it is
clear that we should choose PX to be uniform even if Y is not quantized.
The capacity is therefore
1
C(P ) = I(X; Y ) = h(Y ) − log(2πeN ) (5.84)
2

5.8. Continuous Alphabet Examples 103
2.5
Gauss
2
1.5
C(P)
BPSK
1
0.5
0
0 5 10 15 20 25 30
P
Figure 5.15.: Capacity-cost functions for an AWGN channel with N = 1.

The rate units are bits per channel symbol.
where the density of y is

1 √ 1 √
p(y) = pY |X y − P + pY |X y P
2 2
1 1 √ 1 1 √ 2
−(y+ P )2 /(2N )
= √ e + √ e−(y− P ) /(2N ) .
2 2πN 2 2πN
The differential entropy h(Y ) can be computed numerically from (5.85).

The function (5.84) is plotted in Fig. 5.15 as the curve labeled “BPSK”. As
should be expected, the capacity saturates at 1 bit per symbol for large P .
A more interesting result is that BPSK almost achieves capacity for small
P (say P ≤ 2).
We may alternatively compute capacity by expanding I(X; Y ) differently:
C(P ) = H(X) − H(X|Y )

Z ∞ √
=1− p(y)H2 PX|Y − P y dy (5.85)
−∞
where
√ √
√ PX − P pY |X y − P
PX|Y − P y =
p(y)
1 1 √ 2
= √ e−(y+ P ) /(2N ) . (5.86)
2p(y) 2πN
The integral (5.85) must again be computed numerically.

5.8.3. Complex AWGN Channel
The complex AWGN channel has
Y =X +Z (5.87)
where X = XR +jX√I , Y = YR +jYI , and Z = ZR +jZI are complex random

variables with j = −1. The noise variables ZR and ZI are Gaussian with
zero mean. One usually chooses ZR and ZI to be independent with the same
variance in which case the noise Z is said to be proper complex or circularly
symmetric. We choose E[ZR2 ] = E[ZI2 ] = N/2 so that E[|Z|2 ] = N .
In communications engineering, complex AWGN channels usually model
bandpass channels. The symbol XR represents the sign and √ amplitude of
the transmit component that is in-phase with a carrier 2 cos(2πfc t) at
frequency fc . The symbol XI represents the sign and amplitude of the
transmit component √ in quadrature with this carrier, i.e., the component
that is in-phase with 2 sin(2πfc t).
The power constraint is usually taken to be E[|X|2 ] = E[XR2 ] + E[XI2 ] ≤ P .
We may view this channel as a parallel Gaussian channel with two sub-
channels and a sum power constraint. We define PR = E[XR2 ] and PI =
E[XI2 ] and compute
C(P ) = max [h(YR YI ) − h(YR YI |XR XI )]

PXR XI : PR +PI ≤P
" #
= max h(YR YI ) − log(πeN )
PXR XI : PR +PI ≤P
" #
(a)
≤ max h(YR ) + h(YI ) − log(πeN )
PXR PXI : PR +PI ≤P
(b) 1 1

≤ max log(1 + 2PR /N ) + log(1 + 2PI /N ) (5.88)
PR +PI ≤P 2 2
with equality in (a) if XR and XI are independent, and with equality in
(b) if XR and XI are Gaussian with zero mean. It is clearly best to choose
PR + PI = P and direct differentiation shows that the best choice of power
allocation is PR = PI = P/2. We thus have
P

C(P ) = log 1 + . (5.89)
N
C(P ) again depends only on the SNR and the capacity is twice that of the
real case (5.83) where E[X 2 ] ≤ P and E[Z 2 ] = N . We achieve C(P ) by
choosing XR and XI to be independent Gaussian random variables with
zero-mean and variance P/2.
The capacity-cost function C(P ) is clearly concave in P . However, there
are two different and more common ways of plotting C(P ). First, one
considers a bandlimited channel with bandwidth W Hz and transmits 2W
real symbols per second that are represented as W complex symbols per

6
Gauss
C(E /N )
0
5
s
3
QPSK
2
BPSK
1
0
5 0 5 10 15 20 25
(Es/N0)dB
Figure 5.16.: C(Es /N0 ) for an AWGN channel in terms of (Es /N0 )dB . The
rate units are bits per channel symbol.
second. The received signal is sampled at the rate W complex symbols per
second. If the transmit signal has a power constraint P and the noise is
modeled as AWGN with power N = N0 W , where N0 is a constant with
units of Watts/Hz, then the capacity is
P

C̃(P ) = W log2 1+ bits/second. (5.90)
N0 W
The tilde on C̃(P ) emphasizes that the units are bits per second. The
transmit energy with transmit intervals of length 1/W seconds is Es = P/W
Joules. We can also express the capacity in units of bits per second per Hertz
(which are units of spectral efficiency) or in units of bits per symbol:
Es

C(Es /N0 ) = log2 1+ bits/symbol. (5.91)
N0
The expression (5.91) is basically the same as (5.89) but now the SNR is
measured in terms of an energy ratio rather than a power ratio. One usually
plots C(Es /N0 ) by measuring the SNR in decibels:
(Es /N0 )dB = 10 log10 Es /N0 . (5.92)
The resulting curve is labeled “Gauss” in Fig. 5.16 The two other curves
show the capacities of BPSK and quaternary phase shift keying (QPSK),
respectively, where QPSK has the modulation alphabet
√ √ √ √
X = { P , P j, − P , − P j}. (5.93)

6
Gauss
5
C
4
3
QPSK
2
BPSK
1
0
2 0 2 4 6 8 10 12 14 16
(Eb/N0)dB
Figure 5.17.: Capacity for an AWGN channel in terms of (Eb /N0 )dB . The
rate units are bits per channel symbol.
The QPSK capacity is computed by doubling the BPSK capacity and then
shifting the curve to the right by 3 dB (since the energy per dimension is
Es /2). Both modulations almost achieve capacity at low SNR.
Yet another way of plotting the capacity is to measure the SNR in terms of
the energy per information bit: Eb = Es /R where R is the rate in bits per
symbol. We require
Es Eb

R ≤ log2 1+ = log2 1+ ·R (5.94)
N0 N0
so we say that the capacity is the largest R that satisfies (5.94).4 One
usually plots this capacity against
(Eb /N0 )dB = 10 log10 Eb /N0 . (5.95)
The resulting curve is shown in Fig. 5.17. Observe that the smallest Eb /N0
as R → 0 is Eb /N0 = ln(2) which is
(Eb /N0 )dB = 10 log10 (ln 2) ≈ −1.6 dB. (5.96)
Thus, there is an ultimate minimum energy (per information bit) required

to transmit reliably over an AWGN channel. Moreover, we now find that the
QPSK capacity is exactly double the BPSK capacity. BPSK is poor at low
SNR because it uses only 1 of the 2 complex dimensions available for each
symbol. One trick to improve BPSK is to use single sideband modulation
(SSB) to reduce the number of dimensions per symbol back to 1. Thus,
BPSK with SSB achieves the same rates as QPSK.
4
Alternatively, the smallest SNR that satisfies (5.94) is Eb /N0 = (2R − 1)/R.

5.8.4. Parallel AWGN Channels
Suppose we have L AWGN sub-channels:
Yi = hi Xi + Zi , i = 1, 2, . . . , L, (5.97)
where the hi are constants and the Zi are Gaussian random variables with
zero-mean and variance Ni . The Zi , i = 1, 2, . . . , L, are mutually statisti-
cally independent of each other and of the input variables Xi , i = 1, 2, . . . , L.
For example, the complex Gaussian channel treated in Sec. 5.8.3 effectively
has L = 2, h1 = h2 = 1, and N1 = N2 = N/2. We may write the overall
channel in vector form as
Y = HX + Z (5.98)
where H is a L × L diagonal matrix, and where the X, Y , and Z have as

their ith entries the random variables Xi , Yi , and Zi , respectively.
If each sub-channel had its own power constraint si (xi , yi ) = x2i then the
capacity is simply the sum of the capacities of the L sub-channels. The
situation is more interesting if we have a sum power constraint
L
s(x, y) = kxk2 = x2i .
X
(5.99)
i=1
We define Pi = E[Xi2 ] and compute
C(P ) = max I(X; H X + Z)

PX : E[kXk2 ]≤P
L
!
(a) 1 h2 Pi
log 1 + i
X
= PLmax (5.100)
P ≤P
i=1 i i=1 2 Ni
where (a) follows by the maximum entropy result (2.48) (see Problem B.5.4).
We have thus arrived at a power allocation problem.
We may solve the problem by using standard optimization methods. The
problem is concave in P = [P1 , P2 , . . . , PL ], and the Lagrangian is
" L !# " L # " L
#
1 h2 Pi
log 1 + i
X X X
Λ= + γi Pi + λ P − Pi (5.101)
i=1 2 Ni i=1 i=1
where the γi , i = 1, 2, . . . , L, and λ are Lagrange multipliers. The Karush-

Kuhn-Tucker (KKT) conditions for the optimal solution are
γi ≥ 0, Pi ≥ 0, γi · Pi = 0, i = 1, 2, . . . , L
L L
" #
X X
λ ≥ 0, Pi ≤ P, λ· P − Pi = 0
i=1 i=1
∂Λ 1 1
= + γi − λ = 0, i = 1, 2, . . . , L (5.102)
∂Pi 2 Ni /h2i + Pi

Table 5.1.: Sub-channel parameters

i 1 2 3 4
√ √
hi 2 2 1 2
Ni 1 1 2 2
Ni /h2i 1/4 1/2 2 1
1/(2λ)
2
1/(2λ) = 1.25
P =2 1
1 2 3 4 i
Figure 5.18.: Waterfilling for L = 4 sub-channels.
where we have used the natural logarithm for the derivative. The γi are slack
variables that we may choose to achieve equality in the last L conditions
above. Furthermore, we cannot choose λ = 0 so we must have Li=1 Pi = P .
P
The KKT conditions thus reduce to

" #
1 1
Pi ≥ 0, λ− · Pi = 0, i = 1, 2, . . . , L (5.103)
2 Ni /h2i + Pi
L
X
λ > 0, Pi = P. (5.104)
i=1
Now if λ > h2i /(2Ni ) or 1/(2λ) < Ni /h2i then (5.103) requires Pi = 0. We
must thus choose
!
1 Ni
Pi = max − 2,0 , i = 1, 2, . . . , L. (5.105)
2λ hi
This solution is often called waterfilling because one visualizes pouring water
up to a level 1/(2λ) when the ground level is at Ni /h2i . One stops pouring
once the water “volume” is P . The capacity is given by (5.100), namely
L
!
1 h2
log 1 + i · Pi .
X
C(P ) = (5.106)
i=1 2 Ni
Example 5.11. Consider L = 4 sub-channels with the parameters given

in Table 5.1. Suppose P = 2 in which case we need to set 1/(2λ) = 5/4 so
that P1 = 1, P2 = 3/4, P3 = 0, P4 = 1/4 (see Fig. 5.18). The capacity is

1 1 1
C(2) = log (1 + 4P1 ) + log (1 + 2P2 ) + log (1 + P4 )
2 2 2
1 1 1
= log(5) + log(2.5) + log(1.25)
2 2 2
≈ 1.98 bits. (5.107)
Example 5.12. Suppose the L sub-channels are proper complex AWGN

channels, i.e., Zi is circularly symmetric for all i. Repeating the above
analysis requires only slight modifications to (5.105) and (5.106): choose λ
so that Li=1 Pi = P where
P
!
1 Ni
Pi = max − ,0 , i = 1, 2, . . . , L. (5.108)
λ |hi |2
The resulting capacity is

L
!
X |hi |2
C(P ) = log 1 + · Pi . (5.109)
i=1 Ni

5.8.5. Vector AWGN Channels
Consider the complex vector AWGN channel with nt × 1 input X, nr × nt

matrix H, and nr × 1 output
Y = HX + Z (5.110)
where Z is a nr × 1 Gaussian vector with independent and identically dis-

tributed (i.i.d.) proper complex entries of unit variance. This problem
is also known as a multi-antenna, or multi-input, multi-output (MIMO)
AWGN channel. We choose the cost function s(x, y) = kxk2 and compute
C(P ) = max I(X; H X + Z)

PX : E[kXk2 ]≤P
" #
= max h(H X + Z) − nr log(πe)
PX : E[kXk2 ]≤P
(a)
= max log I + HQX H† (5.111)
tr(QX )≤P
where (a) follows by the maximum entropy result (2.48). Suppose H has
the singular-value decomposition H = UDV† where U and V are unitary
matrices (with UU† = I and VV† = I) and where D is a real, diagonal
nr × nt matrix with the singular values of H on the diagonal. We write
(5.111) as
C(P ) = max log I + DQX DT

tr(QX )≤P
min(nt ,nr )
(a)
log 1 + d2i Pi
X
= P max (5.112)
min(nt ,nr )
i=1
Pi ≤P i=1
where the di , i = 1, 2, . . . , min(nt , nr ), are the singular values of H, and

where we have used Hadamard’s inequality for matrices for (a). The re-
maining optimization problem is the same as for parallel Gaussian channels
with Ni = 1 for all i. The waterfilling solution is to choose the water level
Pmin(n ,n )
λ̃ so that i=1 t r Pi = P where
!
1
Pi = max λ̃ − 2 , 0 . (5.113)
di
The capacity is
min(nt ,nr )
log 1 + d2i · Pi .
X
C(P ) = (5.114)
i=1

5.8.6. AWGN Channels with Receiver Channel

Information
Consider the following complex channel with a vector output:
Y = [H X + Z, H] (5.115)
where Z is proper complex Gaussian, and H is a random variable with

density pH that is independent of X and Z. This problem models a fad-
ing channel where the receiver, but not the transmitter, knows the fading
coefficient H. We choose the cost function s(x, y) = |x|2 and compute
C(P ) = max I(X; [H X + Z, H])

PX : E[|X|2 ]≤P
= max I(X; H) + I(X; H X + Z|H)

PX : E[|X|2 ]≤P
= max I(X; H X + Z|H)

PX : E[|X 2 ]≤P
Z
= max pH (a) h(a X + Z) da − log(πeN )
PX : E[|X|2 ]≤P a
Z
= pH (a) · log(1 + |a|2 P/N ) da (5.116)
a
where the last step follows by the maximum entropy result (2.48), i.e.,
a Gaussian X with zero mean and variance P simultaneously maximizes
h(aX + Z) for all values of a.
If the fading coefficient H does not have a density then the analysis hardly
changes. For example, suppose that H is a discrete random variable with:
PH (1/2) = 1/4, PH (1) = 1/2, and PH (2) = 1/4. The capacity is
1 P 1 P 1 4P

C(P ) = log 1 + + log 1 + + log 1 + . (5.117)
4 4N 2 N 4 N
We remark that QPSK may perform poorly for fading channels because the
information rate I(X; aX + Z) saturates at 2 bits per symbol even if the
fading coefficient H = a has a large amplitude. Thus, for fading channels it
can be important to use a large modulation set X to approach the capacity
(5.116) that is achieved with a Gaussian distribution.

5.9. Source and Channel Coding
5.9.1. Separate Coding
Suppose we wish to communicate the output of a DMS PU with entropy

H(U ) across a DMC with capacity-cost function C(S). A natural approach
is to separate source and channel coding: compress the DMS output to a
rate close to H(U ) bits per source symbol and then communicate these bits
across the DMC reliably at a rate close to C(S) bits per channel symbol.
The overall rate is C(S)/H(U ) source symbols per channel symbol.
We wish to prove that other techniques cannot do better. Suppose the
source puts out m symbols U m and we communicate over n channel uses.
A simple modification of Fano’s inequality (5.21) gives
H(U m |Û m ) ≤ H2 (Pe ) + Pe log2 (|U|m − 1)

< H2 (Pe ) + Pe m log2 |U|. (5.118)
We also have H(U m |Û m ) = H(U m ) − I(U m ; Û m ) and H(U m ) = mH(U ) so

that
I(U m ; Û m ) + H2 (Pe )
m<
H(U ) − Pe log2 |U|
(a) I(X n ; Y n ) + H (P )
2 e
≤
(b) nC(S) + H (P )
2 e
≤ (5.119)
where (a) follows by the data processing inequality and (b) by (5.60)-(5.62).
Now if Pe → 0 then we have the desired bound
m C(S)
≤ (5.120)
n H(U )
source symbols per channel symbol. Thus, separate source and channel
coding achieves the best possible performance (if we can permit n → ∞).
5.9.2. Rates Beyond Capacity
Suppose we permit an average bit error probability

m
1 X h i
Pb = Pr Ûi 6= Ui . (5.121)
m i=1
It turns out that there are source codes that can compress coin-tossing bits
U m to m(1 − H2 (Pb )) bits from which we can recover a string Û m that has
an average bit error probability Pb . But we can send the m(1 − H2 (Pb ))
bits reliably (with average bit error probability near zero) over the channel

5.9. Source and Channel Coding 113
0
10
1 Uncoded BPSK and QPSK

10
2 Gauss Lower Bounds

10
Pb
3 R=1 R=2
10
4
10
8 6 4 2 0 2 4 6 8 10
(E /N )
b 0 dB
Figure 5.19.: Average bit error probability Pb as a function of (Eb /N0 )dB for
a complex AWGN channel. The blue curves are lower bounds
on the best possible Pb given R.
in approximately n = m(1 − H2 (Pb ))/C(S) channel uses. The overall rate

R = m/n is
C(S)
R= (5.122)
1 − H2 (Pb )
so that we can approach equality in (5.64). For instance, if we permit

Pb = 0.11 then we can approach R = 2C(S) as closely as desired. Moreover,
we can accomplish this task by separate source coding and channel coding.
We further explore the impact of this result. Consider a complex AWGN

channel and suppose we wish to communicate R bits per symbol. We use
(5.64), (5.91), and Es = Eb R to compute
Eb 1 R(1−H2 (Pb ))
≥ 2 −1 . (5.123)
N0 R
As argued above, we can approach equality in (5.123) by using separate

source and channel coding. For example, for Pb = 0 and R = 1 we compute
Eb /N0 = 1 or (Eb /N0 )dB = 0 dB. The minimum (Eb /N0 )dB for positive
Pb and R = 1 and R = 2 are plotted in Fig. 5.19 as the curves labeled
“Gauss Lower Bounds”. For instance, if we permit Pb ≈ 0.1 then for R = 1
we can save over 3 dB, or over half, the energy as compared to Pb → 0.
Alternatively, for a fixed Eb /N0 , if we increase Pb we may increase R.

Now suppose we transmit using BPSK without coding. We have R = 1 and

s ! s !
P Eb
Pb = Q =Q 2 (5.124)
N/2 N0
since Eb = Es = P/W and N = N0 W . Similarly, QPSK has R = 2 and

v  s !
u P/2 Eb
u
Pb = Q t  =Q 2 (5.125)
N/2 N0
since Eb = Es /2 = P/(2W ). Thus, QPSK achieves the same Pb as BPSK

but at double the rate. The resulting Pb are shown in Fig. 5.19 as the
curve labeled “Uncoded BPSK and QPSK”. For instance, uncoded BPSK
and QPSK require (Eb /N0 )dB ≈ 9.6 dB for Pb = 10−5 , while coding and
Gaussian modulation require only (Eb /N0 )dB = 0 dB and (Eb /N0 )dB ≈ 1.8
dB for R = 1 and R = 2, respectively. The potential energy savings of
coding and modulation are therefore up to 9.6 dB, or over a factor of 9, as
compared to uncoded BPSK, and up to 7.8 dB, or over a factor of 6, as
compared to uncoded QPSK.

5.10. Feedback 115
Delay
W Xn Yn Ŵ
Message
Encoder PY |X (·) Decoder Sink
Source
Channel
Figure 5.20.: The capacity-cost problem with feedback.
X1 X2 X3
Z1 Z2 Y2 Z3 Y3
Y1
Figure 5.21.: FDG for a memoryless channel with feedback. The message
estimate Ŵ is not shown.
5.10. Feedback
Feedback is usually used in communications to improve performance. One

might expect that feedback can increase capacity. To check this, consider a
memoryless channel with feedback in the sense that Xi can be a function of
the message W and a function of the past channel outputs Y i−1 . The most
informative feedback would thus be that the transmitter is aware of Y i−1 ,
as shown in Fig. 5.20 and Fig. 5.21 (the latter figure is also in Sec. A.5).
We slightly modify (5.60) and bound
I(W ; Ŵ ) ≤ I(W ; Y n )
n
H(Yi |Y i−1 ) − H(Yi |W Y i−1 )
X
=
i=1
n
(a)
H(Yi |Y i−1 ) − H(Yi |W Y i−1 Xi )
X
=
i=1
n
(b) X
= H(Yi |Y i−1 ) − H(Yi |Xi )
i=1
n
X
≤ I(Xi ; Yi ) (5.126)
i=1
where (a) follows because Xi is a function of W and Y i−1 , and (b) is because
the channel is memoryless. We have thus arrived at (5.60) and find the
surprising result that feedback does not improve the capacity-cost function
of a discrete memoryless channel [1, 2]. However, we emphasize that feed-

back can help to reduce complexity and improve reliability, as the following
example shows. Feedback can also increase the capacity of channels with
memory and the capacity of multi-user channels.
Example 5.13. Consider the BEC with feedback of the channel outputs.
The encoder can use a variable-length encoding scheme where each message
bit is repeated until the decoder receives this bit without erasure. The
number N of times that each bit must be transmitted has the geometric
distribution
PN (k) = (1 − p)pk−1 , k = 1, 2, 3, . . . (5.127)
whose mean is E[N ] = 1/(1 − p). The average rate R is the number of in-
formation bits transmitted divided by the average number of trials, giving
R = 1 − p = C. The benefit of feedback is therefore not in increasing ca-
pacity but in giving a simple (variable-length) coding method that achieves
capacity. In fact, the error probability is zero if one permits infinite delay.

5.10. Feedback 117
References
[1] C. E. Shannon. The zero error capacity of a noisy channel. IRE
Trans. Inf. Theory, 2:221–238, September 1956. Reprinted in Claude
Elwood Shannon: Collected Papers, pp. 221-238, (N.J.A. Sloane and
A.D. Wyner, eds.) Piscataway: IEEE Press, 1993.
[2] R. L. Dobrushin. Information transmission in a channel with feedback.
Theory Prob. Appl., 34:367–383, December 1958.

Appendix A.
Discrete Probability
A.1. Events, Sample Space, and Probability

Measure
We begin with basic definitions. A discrete sample space Ω = {ω1 , . . . , ωN }
is the set of possible outcomes of a random experiment. An event is a subset
of Ω including the empty set ∅ and the certain event Ω. The probability
measure Pr [·] assigns each event a number in the interval [0, 1] = {x : 0 ≤
x ≤ 1} such that
Pr [Ω] = 1 (A.1)
Pr [A ∪ B] = Pr [A] + Pr [B] if A ∩ B = ∅. (A.2)
The atomic events are the events {ωi }, i = 1, 2, . . . , N , so we have

X
Pr [A] = Pr [ωi ] (A.3)
ωi ∈A
where we have written Pr [ωi ] as a shorthand for Pr [{ωi }]. The complement
Ac (or Ā) of event A is the set of all ωi that are not in A.
Example A.1. Consider a six-sided die and define Ω = {1, 2, 3, 4, 5, 6} (see

Fig. A.1). A fair die has Pr [ωi ] = 1/6 for all i. The probability of the event
A is therefore |A|/|Ω|, where |A| is the number of elements in A.
We say that “event A implies event B”, or A ⇒ B, if and only if A ⊆ B.

By using (A.3), we thus find that A ⇒ B gives Pr [A] ≤ Pr [B]. Equation
(A.3) also implies that
Pr [A ∪ B] = Pr [A] + Pr [B] − Pr [A ∩ B] . (A.4)
We thus have
Pr [A ∪ B] ≤ Pr [A] + Pr [B] . (A.5)
which is known as the union bound. Equality holds in (A.5) if and only if
Pr [A ∩ B] = 0 (this does not necessarily mean that A ∩ B = ∅).
119
120 Appendix A. Discrete Probability
B
Sample Space Ω
A
Figure A.1.: A sample space with six atomic events.
The events A and B are said to be independent if
Pr [A ∩ B] = Pr [A] · Pr [B] . (A.6)
More generally, the sets Ai , i = 1, 2, . . . , n, are independent if

" n # n
\ Y
Pr Ai = Pr [Ai ] . (A.7)
i=1 i=1
The conditional probability of the event B given the occurrence of the event
A with Pr [A] > 0 is
Pr [A ∩ B]
Pr [B|A] = . (A.8)
Pr [A]
Thus, if Pr [A] > 0 then using (A.8) the events A and B are independent if
Pr [B|A] = Pr [B].1
Example A.2. Consider our fair die and the events A = {1, 3, 5} and
B = {1, 2} in Fig. A.1. We find that (A.6) is satisfied so A and B are
independent. We further have Pr [A] > 0 and compute
Pr [{1}] 1/6 1
Pr [B|A] = = = . (A.9)
Pr [A] 1/2 3
1
The reader may now wonder what happens if Pr [A] = 0. We then have Pr [A ∩ B] = 0
so that (A.6) is satisfied. Thus, if Pr [A] = 0 then A and B are always independent.

A.2. Discrete Random Variables 121
A.2. Discrete Random Variables

A discrete random variable X is a mapping from Ω into a discrete and
finite set X and its range is denoted by X(Ω) (we usually consider random
variables with X(Ω) = X ). The preimage (or inverse image) of a point a,
a ∈ X , is written as
X −1 (a) = {ω : X(ω) = a}. (A.10)
More generally, for a subset A of X we write X −1 (A) = {ω : X(ω) ∈ A}.

The probability distribution PX (·) is a mapping from X(Ω) into the interval
[0, 1] such that
h i
PX (a) = Pr X −1 (a) (A.11)
or simply PX (a) = Pr [X = a]. We thus have
PX (a) ≥ 0 for all a ∈ X(Ω) (A.12)

X
PX (a) = 1. (A.13)
a∈X(Ω)
Example A.3. Consider the sample space of Example A.1 and choose
X = {odd, even}. We define the mapping X(·) as follows:
X(1) = X(3) = X(5) = odd

X(2) = X(4) = X(6) = even.
We compute PX (odd) = PX (even) = 1/2.
Consider next n random variables X n = X1 , X2 , . . . , Xn with domain Ω

and range X n (Ω) = X1 (Ω) × X2 (Ω) × · · · × Xn (Ω). The joint probability
distribution PX n (·) of these random variables is the mapping from X n (Ω)
into the interval [0, 1] such that
" n #
n
\
PX n (a ) = Pr {Xi = ai } . (A.14)
i=1
We thus have
PX n (an ) ≥ 0 for all an ∈ X n (Ω) (A.15)

n
X
PX n (a ) = 1. (A.16)
an ∈X n (Ω)
We further have
PX n−1 (an−1 ) = PX1 X2 ...Xn−1 (a1 , a2 , . . . , an−1 )

X
= PX1 X2 ...Xn−1 Xn (a1 , a2 , . . . , an−1 , an ). (A.17)
an ∈Xn (Ω)

The marginal distributions of PX n are the distributions PX1 , PX2 , . . . , PXn .

The process of “removing” a random variable as in (A.17) by summing a
joint distribution over the range of the random variable is called marginal-
ization.
The support of a random variable X is the set
supp(PX ) = {a : a ∈ X , PX (a) > 0}. (A.18)
The conditional probability distribution PY |X (·) is a mapping from X(Ω) ×

Y (Ω) into the interval [0, 1] such that
( P (a,b)
XY
PX (a)
, PX (a) > 0
PY |X (b|a) = (A.19)
undefined, else.
The value “undefined” is sometimes chosen as a number in the interval [0, 1].
An alternative (but indirect) way of defining PY |X (·) is as any function from
X(Ω) × Y (Ω) into the interval [0, 1] such that
PY |X (b|a)PX (a) = PXY (a, b) (A.20)
for all (a, b) ∈ X(Ω) × Y (Ω). This alternative approach recovers the usual
definition if PX (a) > 0, and it motivates choosing PY |X (b|a) to be “unde-
fined” otherwise.

A.3. Independent Random Variables 123
A.3. Independent Random Variables

The random variables X1 , X2 , . . . , Xn are statistically independent if
n
PX n (an ) = PXi (ai ) for all an ∈ X n (Ω).
Y
(A.21)
i=1
Similarly, X1 , X2 , . . . , Xn are statistically independent conditioned on the

event A with Pr [A] > 0 if, for all an ∈ X n (Ω), we have
" n # n
\ Y
Pr {Xi = ai } A = Pr [Xi = ai |A] . (A.22)
i=1 i=1
Thus, using (A.21) we find that X and Y are statistically independent if

and only if
PY |X (b|a) = PY (b) for all (a, b) ∈ supp(PX ) × Y (Ω). (A.23)
Similarly, we say that X and Y are statistically independent conditioned

on Z if
PXY |Z (a, b|c) = PX|Z (a|c)PY |Z (b|c) (A.24)
for all (a, b, c) ∈ X(Ω) × Y (Ω) × supp(PZ ). Thus, we find that X and Y are
statistically independent conditioned on Z if and only if
PY |XZ (b|a, c) = PY |Z (b|c) (A.25)
for all (a, b, c) ∈ supp(PX ) × Y (Ω) × supp(PZ ). Alternatively, X and Y are

statistically independent conditioned on Z if and only if
PX|Y Z (a|b, c) = PX|Z (a|c) (A.26)
for all (a, b, c) ∈ X(Ω) × supp(PY Z ).

A common way of expressing that X and Y are statistically independent
given Z is to say that
X −Z −Y (A.27)
forms a Markov chain.

Y
X
PZ1 b1
a1
b2
b3
a2
b4
PZ2
Figure A.2.: Graph representation of PXY (·). The statistics of Zi depend

on i but Z = Z1 Z2 is statistically independent of X.
A.4. Probabilistic Dependence via

Functional Dependence
Consider a joint distribution PXY (·). We claim that there is a random

variable Ỹ having PX Ỹ (·) = PXY (·) and where
Ỹ = f (X, Z) (A.28)
for some function f (·) and some random variable Z that is statistically
independent of X and that has alphabet Z of size at most |Z| = |X | · |Y|.
Suppose X = {a1 , a2 , . . . , a|X | } and Y = {b1 , b2 , . . . , b|Y| }. Consider the

graph representation of PXY (·), as depicted in Fig. A.2 for |X | = 2 and |Y| =
4. Let Z be a word with |X | letters whose ith letter Zi , i = 1, 2, . . . , |X |,
takes on the value bj with probability PZi (bj ) = PY |X (bj |ai ) as long as
PX (ai ) > 0. If PX (ai ) = 0 then we leave PZi (·) unspecified. We define the
function index(ai ) = i, i = 1, 2, . . . , |X |, and choose Z independent of X
(the Zi , i = 1, 2, . . . , |X |, could be dependent). We claim that the function
Ỹ = Zindex(X) (A.29)
makes X Ỹ have the joint distribution PXY (·). Indeed, by construction have
PỸ |X (bj |ai ) = PZi (bj ) = PY |X (bj |ai ). (A.30)
The purpose of the above exercise is to show that we may as well represent
any channel PY |X (·) by a functional relation Y = f (X, Z) where the “noise”
Z is independent of the channel input X. This result forms the basis of the
ideas in the next section.
Example A.4. Consider binary X and Y for which
PXY (0, 0) = p00 , PXY (0, 1) = p01 ,

(A.31)
PXY (1, 0) = p10 , PXY (1, 1) = p11 .

A.4. Probabilistic Dependence via Functional Dependence 125
We define a1 = 0, a2 = 1, b1 = 0, b2 = 1, and Z = Z1 Z2 where

p00
PZ1 (0) = PY |X (0|0) = p00 +p01
p10 (A.32)
PZ2 (0) = PY |X (0|1) = p10 +p11
.
We choose Ỹ = Z1 if X = 0 and Ỹ = Z2 if X = 1.
Example A.5. Consider Example A.4 with p00 = p11 and p01 = p10 . The
symmetry lets us simplify f (·) and Z. We may define Ỹ = X ⊕ Z where
PX (0) = PX (1) = 1/2, Z is independent of X, and PZ (0) = p00 + p11 ,
PZ (1) = 1 − PZ (0). We see that |Z| = 2 suffices in this case.

X1 X2 X3
Z1 Z2 Y2 Z3 Y3
Y1
Figure A.3.: FDG for a memoryless channel with feedback.
A.5. Establishing Conditional Statistical

Independence
Graphs can help us to understand relationships between random variables

and to prove conditional statistical independence results. We will use func-
tional dependence graphs or FDGs.
An FDG has vertices that represent random variables and edges that rep-
resent the functional dependencies between the random variables. For in-
stance, suppose we have NRV random variables that are defined by SRV
independent (source) random variables by NRV functions. The correspond-
ing FDG G is a directed graph having NRV + SRV vertices and where edges
are drawn from one vertex to another if the random variable of the former
vertex is an argument of the function defining the random variable of the
latter vertex.
Example A.6. Fig. A.3 depicts the FDG for the first three uses of a
channel with feedback. In this graph the channel input symbol Xi , i =
1, 2, 3, is a function of the message W and the past channel outputs Y i−1 .
We have drawn the feedback links using dashed lines to emphasize the
role that feedback plays. The output Yi is a function of Xi and a noise
random variable Zi . The graph has NRV = 6 random variables defined by
SRV = 4 independent random variables. The SRV vertices representing the
independent W, Z1 , Z2 and Z3 are distinguished by drawing them with a
hollow circle.
The precise structure of FDGs lets one establish the conditional statisti-
cal independence of sets of random variables by using graphical procedures
called d-separation and f d-separation (“d” for dependence and “fd” for func-
tional dependence). By d-separation we mean the following.
Definition A.1 Let X , Y and Z be disjoint subsets of the vertices of an

FDG G. Z is said to d-separate X from Y if there is no path between a
vertex in X and a vertex in Y after the following manipulations of the graph
have been performed.

A.5. Establishing Conditional Statistical Independence 127
a) Consider the subgraph GX YZ of G consisting of the vertices in X , Y

and Z, as well as the edges and vertices encountered when moving
backward one or more edges starting from any of the vertices in X or
Y or Z.
b) In GX YZ delete all edges coming out of the vertices in Z. Call the
resulting graph GX Y|Z .
c) Remove the arrows on the remaining edges of GX Y|Z to obtain an undi-
rected graph.
A fundamental result of [1, Sec. 3.3] is that d-separation establishes con-

ditional independence in FDGs having no directed cycles. That is, if G is
acyclic, Z d-separates X from Y in G, and we collect the random variables
of the vertices in X , Y and Z in the respective vectors X, Y and Z, then
PX Y |Z = PX|Z PY |Z . This is the same as saying that X − Z − Y forms a
Markov chain.
Example A.7. Consider Fig. A.3 and choose X = {W }, Y = {Y2 }, and

Z = {X1 , X2 }. We find that Z d-separates X from Y so that W −X1 X2 −Y2
forms a Markov chain.
A simple extension of d-separation is known as f d-separation which uses

the fact that the FDG represents functional relations, and not only Markov
relations as in Bayesian networks [2, Ch. 2],[3]. For f d-separation, after the
second step above one removes all edges coming out of vertices to which
there is no path (in a directed sense) from the SRV source vertices. We
remark that f d-separation applies to an FDG G with cycles, as long as all
subgraphs of G are also FDGs (see [2, Ch. 2]) and if one considers discrete
random variables.

A.6. Expectation
Expectation is an integral and usually involves continuous real-valued ran-

dom variables. However, (real-valued) discrete random variables have all
the important properties we shall need for the continuous cases.
Consider a real-valued function f (·) with domain X(Ω). The expectation of

the random variable Y = f (X) is
X
E[Y ] = E[f (X)] = PX (a) f (a). (A.33)
a∈supp(PX )
One sometimes encounters the notation EX [·] if it is unclear which of the

letters in the argument of E[·] are random variables.
Example A.8. The random variable Y = f (X) = 1(X = a) has
E[Y ] = PX (a). (A.34)
Similarly, Y = f (X) = 1(X ∈ A) for A ⊆ X(Ω) has

X
E[Y ] = PX (a). (A.35)
a∈A
The conditional expectation of f (X) given that the event A with Pr [A] > 0
occurred is
X
E[f (X)|A] = Pr [X = a|A] f (a) (A.36)
a:Pr[{X=a}∩A]>0
where the conditional probability Pr [X = a|A] is defined as in (A.8). In

particular, if A = {Z = c} and PZ (c) > 0 we have
X
E[f (X)|Z = c] = PX|Z (a|c) f (a). (A.37)
a∈supp(PX|Z (·|c))
We can re-write the above definitions slightly differently. Let {B1 , B2 , . . . , BM }

be a collection of events that partition the sample space, i.e., we have
M
[
Bm = Ω and Bi ∩ Bj = ∅, i 6= j. (A.38)
m=1

A.6. Expectation 129
We can then write (A.33) as

X
E[f (X)] = Pr [Bi ∩ {X = a}] f (a)
i,a:Pr[Bi ∩{X=a}]>0
X X Pr [Bi ∩ {X = a}]
= Pr [Bi ] f (a)
i:Pr[Bi ]>0 a:Pr[Bi ∩{X=a}]>0
Pr [Bi ]
X X
= Pr [Bi ] Pr [X = a|Bi ] f (a)
i:Pr[Bi ]>0 a:Pr[Bi ∩{X=a}]>0
X
= Pr [Bi ] E[f (X)|Bi ] (A.39)
i:Pr[Bi ]>0
and (A.36) as
X
E[f (X)|A] = Pr [Bi |A] E[f (X)|Bi ∩ A] . (A.40)
i:Pr[Bi ∩A]>0
Example A.9. For a discrete random variable Y we can choose Bb = {Y =

b} and write
X
E[f (X)] = PY (b) E[f (X)|Y = b] (A.41)
b∈supp(PY )
X
E[f (X)|A] = Pr [Y = b|A] E[f (X)|{Y = b} ∩ A] . (A.42)
b:Pr[{Y =b}∩A]>0
The identities (A.39)-(A.42) are called the Theorem on Total Expectation.

A.7. Second-Order Statistics for Scalars

The m’th moment, m = 1, 2, 3, . . ., of a real-valued random variable Y is
E[Y m ]. The variance of Y is
h i h i
Var[Y ] = E (Y − E[Y ])2 = E Y 2 − E[Y ]2 . (A.43)
The covariance of real-valued X and Y is
Cov[X, Y ] = E[(X − E[X])(Y − E[Y ])]

= E[XY ] − E[X] E[Y ] . (A.44)
We thus have Var[X] = Cov[X, X]. We say that X and Y are uncorrelated
if Cov[X, Y ] = 0 or, alternatively, if E[XY ] = E[X] E[Y ].
Example A.10. Statistically independent X and Y are uncorrelated.
Example A.11. Consider random variables X1 , X2 , . . . , Xn . We compute

" n # n X
n
X X
Var Xi = Cov[Xi , Xj ] . (A.45)
i=1 i=1 j=1
Pn
If the {Xi }ni=1 are pairwise uncorrelated then (A.45) is i=1 Var[Xi ].
The correlation coefficient of X and Y is the normalized covariance
E[(X − E[X])(Y − E[Y ])]

ρ= q (A.46)
Var[X] Var[Y ]
as long as the variances are positive; otherwise we say ρ = 0. One can show
that −1 ≤ ρ ≤ 1.

A.8. Second-Order Statistics for Vectors 131
A.8. Second-Order Statistics for Vectors

The definitions of the previous section extend naturally to random vectors.
The covariance matrix of the real-valued column vectors X = [X1 . . . X` ]T
and Y = [Y1 . . . Ym ]T is
h i
Cov[X, Y ] = E (X − E[X])(Y − E[Y ])T
h i
= E X Y T − E[X] E[Y ]T . (A.47)
The covariance matrix of X is QX = Cov[X, X]. We say that X and Y are

uncorrelated if Cov[X, Y ] = 0.
Example A.12. For the random vectors X 1 , X 2 , . . . , X n we compute

n X
X n h i
QPn Xi = Cov X i , X j . (A.48)
i=1
i=1 j=1
Pn
If the {X i }ni=1 are pairwise uncorrelated then (A.48) is i=1 QX i .
The correlation matrix of X and Y is the normalized covariance matrix

T
h i
R = E X̃ Ỹ (A.49)
where the ith entry of X̃ is
Xi − E[Xi ]
X̃i = q (A.50)
Var[Xi ]
as long as the variance is positive; otherwise we set X̃i = 0. The entries of

Ỹ are similarly defined. In other words, the entry of row i and column j of
R is simply the correlation coefficient of Xi and Yj .

A.9. Conditional Expectation Random

Variables
The conditional expectation (A.37) motivates defining the random variable
E[Y |X] that takes on the value E[Y |X = a] when X = a. E[Y |X] is there-
fore a deterministic function of X and we have
X
PE[Y |X] (b) = PX (a). (A.51)
a: E[Y |X=a]=b
We similarly define Var[Y |X] as the random variable that takes on the value
Var[Y |X = a] when X = a.
We next list several simple properties of conditional expectation.
• E[f (X)|X] = f (X)
• E[E[Y |X]] = E[Y ] and more generally E[E[Y |XZ] |Z] = E[Y |Z]
• E[Y Z] = E[Y E[Z|Y ]] and more generally E[Y Z|X] = E[Y E[Z|XY ] |X]
• If Y − X − Z forms a Markov chain then E[Z|XY ] = E[Z|X]
• If Y − X − Z forms a Markov chain then E[Y Z|X] = E[Y |X] E[Z|X]
h i
• Var[Y |X] = E[Y 2 |X]−E[Y |X]2 and E[Var[Y |X]] = E[Y 2 ]−E E[Y |X]2
An important property of conditional expectation concerns minimum-mean
square-error (MMSE) estimation. Consider the error S = X − X̂ and
suppose that, given Y = y, we wish to find the estimate X̂(y) that minimizes
2
Var[S|Y = y] = E X − X̂(y) Y =y . (A.52)
A simple optimization gives X̂(y) = E[X|Y = y] and the MMSE is

h i
Var[S|Y = y] = E X 2 |Y = y − E[X|Y = y]2 . (A.53)
Hence, the MMSE estimate is the random variable
X̂(Y ) = E[X|Y ] (A.54)
and the MMSE is the random variable Var[S|Y ] = E[X 2 |Y ] − E[X|Y ]2 .

We caution that E[Var[S|Y ]] 6= Var[S] in general. In fact, using Jensen’s
inequality and the strict convexity of the function f (x) = x2 , we have
h i h i
E[Var[S|Y ]] = E X 2 − E E[X|Y ]2
h i
≤ E X 2 − E[X]2
= Var[S] (A.55)
with equality if and only if E[X|Y ] is the constant E[X].

A.10. Linear Estimation 133
A.10. Linear Estimation

This section is concerned with linear minimum-mean square-error (LMMSE)
estimators. Consider the zero-mean random variables X and Y . (For non-
zero mean X and Y , simply subtract the means to give zero-mean random
variables. One may add E[X] to the final estimate without affecting the
estimator performance.) Given Y , we estimate X̂ = c Y where c is chosen
to minimize
2
E X − X̂ . (A.56)
Simple calculations give

h i
c = E[XY ] /E Y 2 (A.57)
2 h i h i
E X − X̂ = E X 2 − E X̂ 2 (A.58)
h i (a)
E X − X̂ · Y = 0 (A.59)
and (a) is called the orthogonality principle.

More generally, for (zero-mean column) vectors the LMMSE estimator is
X̂ = C Y where C minimizes

2
E X − X̂ . (A.60)
h i
Suppose that X and Y have zero mean and QY = E Y Y T is invertible.
We compute
h i
C = E X Y T Q−1
Y (A.61)

2 h i 2
E X − X̂ = E kXk2 − E X̂ (A.62)
h i (a)
E X − X̂ · Y T = 0 (A.63)
and (a) is again called the orthogonality principle.

Summarizing, the vector LMMSE estimator can be written as
h i
X̂(Y ) = E X Y T Q−1
Y Y . (A.64)

A.11. Markov Inequalities

We state and prove several useful inequalities.
Theorem A.2. (Markov Inequality) Let X be a non-negative real-

valued random variable with mean E[X]. For a > 0, we have
E[X]
Pr [X ≥ a] ≤ . (A.65)
a
Proof. We have Pr [X ≥ a] = E[1(X ≥ a)], where 1(·) is the indicator func-

tion that takes on the value 1 if its argument is true and is 0 otherwise.
We further note that a 1(X ≥ a) ≤ X. We thus have a Pr [X ≥ a] =
E[a 1(X ≥ a)] ≤ E[X].
Example A.13. Suppose we set X = |Y − E[Y ] |. Markov’s inequality

then gives Tchebycheff’s inequality
h i
Pr [|Y − E[Y ] | ≥ a] = Pr |Y − E[Y ] |2 ≥ a2
Var[Y ]
≤ (A.66)
a2
where Var[Y ] is the variance of Y and a > 0.
Example A.14. Suppose we set X = eνY and a = eνb . Markov’s inequality

then gives the Chernoff bounds
h i
Pr [Y ≥ b] ≤ E eνY e−νb for ν ≥ 0
h i
Pr [Y ≤ b] ≤ E eνY e−νb for ν ≤ 0. (A.67)

A.12. Jensen Inequalities 135
A.12. Jensen Inequalities

A real-valued function f (·) with domain an interval I of non-zero length on
the real line is convex (or convex-∪) on I if, for every interior point x0 of
I, there exists a real number m (that may depend on x0 ) such that
f (x) ≥ f (x0 ) + m(x − x0 ) for all x ∈ I. (A.68)
The convexity is strict if the inequality (A.68) is strict whenever x 6= x0 .

An alternative and equivalent definition is that f (·) is convex on I if for
every x1 and x2 in I we have
f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ) for 0 < λ < 1. (A.69)
We say that f (·) is concave (or convex-∩) on I if −f (·) is convex on I.
Theorem A.3. (Jensen’s Inequality) Let X be a real-valued random

variable taking values in I and suppose f (·) is convex on I. We have
f (E[X]) ≤ E[f (X)] . (A.70)
Similarly, if f (·) is concave on I then we have
f (E[X]) ≥ E[f (X)] . (A.71)
If f (·) is strictly convex (or concave) then equality holds in (A.70) (or
(A.71)) if and only if X is a constant.
Instead of proving Theorem A.3, we prove a more general result for vectors.
Consider a real-valued function f (·) whose domain is a non-empty convex
set S in the n-dimensional vector space Rn . We say that f (·) is convex (or
convex-∪) on S if, for every interior point x0 of S, there exists a real vector
m (that may depend on x0 ) such that
f (x) ≥ f (x0 ) + mT (x − x0 ) for all x ∈ S. (A.72)
The convexity is strict if the inequality (A.72) is strict whenever x 6= x0 .

An alternative and equivalent definition is that f (·) is convex on S if for
every x1 and x2 in S we have
f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ) for 0 < λ < 1. (A.73)
We say that f (·) is concave (or convex-∩) on S if −f (·) is convex on S.
Theorem A.4. (Jensen’s Inequality for Vectors) Let X be a vector-

valued random variable taking values in S and suppose f (·) is convex on S.
We have
f (E[X]) ≤ E[f (X)] . (A.74)

Similarly, if f (·) is concave on S then we have
f (E[X]) ≥ E[f (X)] . (A.75)
If f (·) is strictly convex (or concave) then equality holds in (A.74) (or
(A.75)) if and only if X is a constant.
Proof. Choose x0 = E[X] in (A.72), choose an m that satisfies (A.72) for

this x0 , replace x with the random variable X, and take expectations of
both sides of (A.72). The result is (A.74). If f (·) is concave on S, then we
similarly obtain (A.75). Furthermore, if f (·) is strictly convex (or concave),
equality holds in (A.74) (or (A.75)) if and only if X is a constant.

A.13. Weak Law of Large Numbers 137
A.13. Weak Law of Large Numbers

Laws of large numbers are concerned with n repeated trials of the same
random experiment. Let X1 , X2 , . . . , Xn be independent and identically
distributed (i.i.d.) real-valued random variables with PXi (·) = PX (·) for all
i. The sample mean Sn is
n
1X
Sn = Xi . (A.76)
n i=1
We clearly have E[Sn ] = E[X] and Var[Sn ] = Var[X] /n. Applying Tcheby-
cheff’s inequality (A.66), we have
Var[X]
Pr [|Sn − E[X] | ≥ ] ≤ . (A.77)
n2
Alternatively, we have
Var[X]
Pr [|Sn − E[X] | < ] ≥ 1 − (A.78)
n2
which is a quantitative version of the Weak Law of Large Numbers. The

Weak Law thus states that the sample mean of n independent samples of
X is almost certain to be near E[X] when n is large. Qualitatively, we can
write (A.78) as a “limit in probability”:
plim Sn = E[X] . (A.79)

n→∞
Example A.15. Consider a (perhaps non-real) random variable X with

alphabet X = {a1 , a2 , . . . , a|X | }. Let 1(·) be the indicator function that is
1 if its argument is true and is 0 otherwise. Let Y = 1(Xi = aj ) for some
1 ≤ j ≤ |X | for which we compute
E[Y ] = PX (aj ) (A.80)

Var[Y ] = PX (aj ) (1 − PX (aj )) . (A.81)
Now consider n independent trials X1 , X2 , . . . , Xn of X which generate n

independent trials of real-valued Y1 , Y2 , . . . , Yn of Y . We form the sum
(A.76) with Yi replacing Xi . Note that this sum is 1/n multiplying the
number N (aj |X n ) of times that the letter aj occurs in X n . Using the weak
law (A.78) we find that
N (aj |X n )
" #
PX (aj ) (1 − PX (aj ))
Pr − PX (aj ) < ≥ 1 − . (A.82)
n n2
In other words, for large n the letter aj occurs about nPX (aj ) times in X n .

A.14. Strong Law of Large Numbers

The Weak Law of Large Numbers required only the Tchebycheff bound that
is based on Var[X]. We next wish to derive a stronger result that concerns
the semi-infinite sequence of sums S1 , S2 , S3 , . . .. The Chernoff bound (A.67)
with Y = Sn − E[X] gives
h i
Pr [(Sn − E[X]) ≥ ] ≤ E eν(Sn −E[X]) e−ν
(a) h in
= E e(ν/n)(X−E[X]) e−ν (A.83)
where ν ≥ 0 and (a) follows by the independence of the trials. We should

now optimize over ν but instead choose ν = n δ for some small constant δ.
Suppose the magnitudes of all moments of X are bounded from above as
follows:
h i
E (X − E[X])i ≤ mi (A.84)
for i = 2, 3, 4, . . . and for some positive m. The right-hand side of (A.83) is

" ∞ i
#n
h
(ν/n)(X−E[X]) n
i
−ν δ i
e−nδ
X
E e e =E 1+ (X − E[X])
i=2 i!
∞
!n
δ i X i
≤ 1+ m e−nδ
i=2 i!
n
= emδ − mδ e−nδ
= en[log(e ].
mδ −mδ)−δ
(A.85)
The term in square brackets in (A.85) evaluates to zero at δ = 0 and its

derivative with respect to δ is − at δ = 0. Hence we can find a small δ for
which the term in square brackets is negative. We thus find that
Pr [(Sn − E[X]) ≥ ] ≤ β n (A.86)
where β depends on and m and satisfies 0 < β < 1. Combining (A.86)

with a similar bound on Pr [Sn − E[X] ≤ −] we have
Pr [|Sn − E[X] | ≥ ] ≤ 2β n . (A.87)
Now consider the following result on the behavior of {Si }∞

i=1 for i ≥ n:
" #  
[
Pr sup |Si − E[X] | ≥ = Pr  {|Si − E[X] | ≥ }
i≥n i≥n
∞
(a) X
≤ Pr [|Sn − E[X] | ≥ ]
i=n
(b) 2β n
≤ (A.88)
1−β

A.15. Problems 139
where (a) follows by the union bound, and (b) follows by the Chernoff bound
(A.87). Note that step (b) does not work with the Tchebycheff bound (A.77)
because the probabilities decrease only as 1/n rather than exponentially
with n. The bound (A.88) implies the Strong Law of Large Numbers, namely
" #
lim Pr sup |Si − E[X] | < = 1 .
n→∞
(A.89)
i≥n
A.15. Problems
A.15.1. Borel-Kolmogorov “Paradox”
Consider a 3-dimensional sphere centered at the origin and label the points
on its surface by using the spherical coordinates longitude φ, −π ≤ φ < π,
and latitude θ, −π/2 ≤ θ ≤ π/2. Suppose X is a point that is uniformly
distributed on the sphere, i.e., we write X(φ, θ) and consider the joint den-
sity
1
pΦΘ (φ, θ) = cos θ. (A.90)
4π
a) Determine the marginal densities pΦ (·) and pΘ (·).
b) Determine the density of a point on the great circle defined by θ = 0,
i.e., determine pΦ|Θ (·|0).
c) Determine the density of a point on the great circle defined by φ = 0,
i.e., determine pΘ|Φ (·|0). Compare your answer to the previous result
and interpret.
A.15.2. Strong Law with Fourth Central Moment Constraint
This exercise proves the strong law of large numbers but replaces (A.84)
with the weaker constraint that X has a finite fourth central moment
h i
E (X − E[X])4 ≤ m (A.91)
for some non-negative m.
a) Use Jensen’s inequality to show that

h i
Var[X]2 ≤ E (X − E[X])4 (A.92)
√
so that (A.91) implies Var[X] ≤ m.
b) Use the Markov inequality to show that
E[(Sn − E[X])4 ]
Pr [(Sn − E[X]) ≥ ] ≤ . (A.93)
4

c) Show by direct computation or induction that for zero-mean and i.i.d.

Y1 , Y2 , Y3 , . . . with common distribution PY , we have
 !4 
n h i h i2
= nE Y 4 + 3n(n − 1)E Y 2 .
X
E Yi  (A.94)
i=1
d) Now use Y = X − E[X] and show that

h
4
i (3n2 − 2n)m 3m
E (Sn − E[X]) ≤ 4
≤ 2. (A.95)
n n
e) Insert (A.95) into (A.93) and modify the steps (A.86)-(A.88) to show
that
∞
" #
X 3m 3m Z ∞ 1 3m
Pr sup |Si − E[X] | ≥ ≤ 2 4
≤ dx = (A.96)
i≥n i=n i 4 n x2 n4
Explain why the result (A.96) proves the strong law of large numbers.
References
[1] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. Morgan Kaufmann, San Mateo, CA, 1988.
[2] G. Kramer. Directed Information for Channels with Feedback. Hartung-
Gorre Verlag, Konstanz, Germany, 1998. ETH Series in Information
Processing, Vol. 11.
[3] G. Kramer and S. A. Savari. Edge-cut bounds on network coding rates.
Journal of Network and Systems Management, 14(1):49–67, March 2006.

Appendix B.
Problems
B.1. Information Theory for Discrete

Variables
B.1.1. Expectation
Consider random variables X and Y with
1
PX (0) = PX (1) = , PX (2) = 0
2
1
PY (0) = PY (1) = PY (2) = .
3
h i h i h i h i
1 1 1 1
a) Determine E PX (X)
,E PX (Y )
,E PY (X)
, and E PY (Y )
.
b) Compute E[− log2 PX (X)] and E[− log2 PY (X)].
h i h i
PX (X) PY (Y )
c) Compute E log2 PY (X)
and E log2 PX (Y )
.
B.1.2. Expectation of a Function of a Discrete Random Variable

In Information Theory (IT), one is frequently required to compute the ex-
pectation of F (X) where X is a discrete random variable and F is a function
whose domain includes X(Ω) and which takes values on the extended real
line R ∪ {−∞, ∞}.
The defining relation for expectation reads as
X
E[F (X)] = PX (x)F (x) (B.1)
x∈supp(PX )
where supp(PX ) denotes the support of PX defined as
supp(PX ) = {x : x ∈ X(Ω) and PX (x) 6= 0}.
a) Suppose that #(X(Ω)) = L and that #(supp(PX )) = L0 where L0 ≤

L.
Find E[1/PX (X)].
Hints: What is F (x)? What is ni=1 1?
P
141
142 Appendix B. Problems
b) Suppose that X and Y are random variables such that X(Ω) = Y (Ω).
Write out the general expression for E[PX (X)] and E[PY (X)].
c) Write out the general expression for E[− log PX (X)] and E[− log PY (X)].
B.1.3. Probability Theory

The contestants of a television game show can win a prize which is hidden
behind one of three doors. As soon as one of the contestants chooses one of
the doors, the omniscient game show host opens one of the other two doors
and exclaims, “The prize is not behind this door. Would you like to change
your decision and choose the other door?”
a) What decision would you spontaneously make?
b) Can you give a good reason for your choice?
Hint: Each contestant can choose from one of two sensible strategies:
i) Always stick with the original decision.
ii) Always change one’s mind and choose the other remaining door.
B.1.4. Jensen’s Inequality

a) For each of the following functions and the indicated interval, de-
termine whether the function is convex-∩ or convex-∪ (or neither or
both) on this interval, and whether the convexity is strict or not.
Sketch each function over the indicated interval.
i) f (x) = x2 , −∞ < x < ∞,
ii) f (x) = e−ax , a > 0, −∞ < x < ∞,
iii) f (x) = log(x), 0 ≤ x ≤ 1,
iv) f (x) = h(x), 0 ≤ x ≤ 1.
b) Use Jensen’s inequality to find the appropriate inequalities between
i) E[X 2 ] and (E[X])2 ,
ii) E[e−ax ] and e−aE[X] , a > 0,
iii) E[log(X)] and log(E[X]),
PK P
1 K 1
iv) i=1 K h(Pei ) and h i=1 K Pei .
B.1.5. Maximum Entropy

Prove the inequality on the right-hand side of (1.7) by using Jensen’s in-
equality (A.71) and the concavity of log2 (x) in x for x > 0. What are the
conditions for equality?
B.1.6. Binomial Distribution 1

Show that
n
!
n
X n i n−i
(a + b) = a b (B.2)
i=0 i

B.1. Information Theory for Discrete Variables 143
and use this identity to show that the PX (k), k = 0, 1, 2, . . . , n specified in

(1.14) give a probability distribution.
B.1.7. Binomial Distribution 2

Consider a binomial probability distribution PX (k) for n = 5 and p = 1/3.
a) Plot PX .
b) Compute E[X] and Var[X] and overlay the Gaussian approximation
(1.16) on the plot in part (a).
c) Compute H(X) and compare with the approximation (1.17).
B.1.8. Geometric Distribution

Consider a geometric distribution PX and verify the equations for E[X],
Var[X], and H(X) given in Sec. 1.4.3.
B.1.9. Conditional Entropy

Consider a jar in which there is a fair coin c1 and an unfair coin c2 . The fair
coin takes on the values H and T with probability 1/2 each and the unfair
coin always takes on the value H. Suppose we reach into the jar, choose
one of the coins, and flip it. Let X be the random variable that represents
the outcome H or T .
a) Let C be a random variable that represents the choice of coin, i.e., C
takes on the values c1 and c2 each with probability 1/2. Determine
H(X|C = c1 ) and H(X|C = c2 ).
b) Compute H(X|C).
c) Determine PX (·) and compute H(X).
d) Compare the four entropies you computed by ordering them.
Can conditioning increase entropy?
B.1.10. Bounds on Conditional Entropy

Verify (1.27) including the conditions for equality.
B.1.11. Functions of Variables

Let f (·) and g(·) be functions whose domains are the ranges of [X, Y ] and
Y , respectively. Show that
H(Xf (X, Y )|Y g(Y )) = H(X|Y ). (B.3)
Interpret the result.
B.1.12. Uncertainty 1
Suppose that the random vector [X1 X2 X3 ] takes on the values [0 0 0],
[0 1 1], [1 0 1] and [1 1 0], each with probability 1/4 (i.e., PX1 X2 X3 (0, 0, 0) =
1/4, PX1 X2 X3 (0, 0, 1) = 0, etc.).

a) Find H(X1 X2 X3 ).
b) Calculate the following uncertainties:
i) H(X1 ), H(X2 ) and H(X3 ).
ii) H(X2 |X1 = 0), H(X2 |X1 = 1) and H(X2 |X1 ).
Hint: PX2 |X1 (0|0) =?
iii) H(X3 |X1 X2 = 00), H(X3 |X1 X2 = 01), H(X3 |X1 X2 = 10),
H(X3 |X1 X2 = 11) and H(X3 |X1 X2 ).
Hint: PX3 |X1 X2 (0|00) =?
c) Verify from a) and b) that the chain rule for uncertainties is verified.
B.1.13. Uncertainty 2
Repeat Problem B.1.12 when [X1 X2 X3 ] takes on the values [0 0 0], [0 1 0],
[1 0 0] and [0 0 1], each with probability 1/4.
Hint: If an event has zero probability, then the conditional uncertainty
given that event cannot generally be calculated, but it is also of no physical
significance.
B.1.14. Uncertainty of a Geometrically Distributed Random Vari-

able
In the analysis of data communications networks, it is commonly assumed
that the lengths of messages are geometrically (or “exponentially”) dis-
tributed. That is, if X denotes the number of binary digits in a message,
then
PX (x) = p(1 − p)x−1 for x = 1, 2, 3, . . .
where 0 < p < 1. It is easy to show that E[X] = 1/p and Var[[] X] =
(1 − p)/p2 . Find the uncertainty H(X) about the message length in terms
of p, then express H(X) as simply as possible in terms of the binary entropy
function h(p) and E[X].
B.1.15. Essential Determination

To say that the discrete random variable Y essentially determines the dis-
crete random variable X means that for every y such that PY (y) 6= 0, there
exists an x such that PX|Y (x|y) = 1. Recall that H(X|Y ) = 0 if and only if
Y essentially determines X.
Show that if Y essentially determines X, then
H(X) ≤ H(Y )
with equality if and only if X also essentially determines Y .

Hint: Consider H(XY ). ]
B.1.16. Calculation of Joint Entropy

Suppose X is binary with alphabet X = {0, 1}. Consider the Binary Sym-
metric Channel or BSC for which Y = {0, 1} and
(
1 − p, if b = a
PY |X (b|a) = (B.4)
p, 6 a.
if b =
a) Show that
H(X) = H2 (PX (0))

H(Y |X) = H2 (p).
b) Defining q ∗ p = q(1 − p) + (1 − q)p, verify that
H(Y ) = H2 (PX (0) ∗ p)

H(X|Y ) = H2 (PX (0)) + H2 (p) − H2 (PX (0) ∗ p) .
c) Show that H(Y ) is maximized by PX (0) = 1/2.
B.1.17. Binary Channels
Verify the following claims where the inverse function H2−1 (·) is defined with
domain the interval [0, 1] and range the interval [0, 1/2].

a) Show that H2 H2−1 (x) = x for 0 ≤ x ≤ 1 but H2−1 (H2 (x)) 6= x can
happen (for which x?).
b) Show that [λa + (1 − λ)b] ∗ p = λ(a ∗ p) + (1 − λ)(b ∗ p), where
q ∗ p = q(1 − p) + (1 − q)p.
c) Show that H2 (a) ∗ p ≤ H2 (a ∗ p) for 0.11 ≤ a ≤ 0.5 and 0 ≤ p ≤ 1.

d) Using the above two results and H2 H2−1 (a) = a, show that

[λa + (1 − λ)b] ∗ p ≤ H2 λH2−1 (a) ∗ p + (1 − λ)H2−1 (b) ∗ p
for the range of a and p of problem (c).
B.1.18. Mrs. Gerber’s Lemma 1
definitionq ∗p = q(1−p)+(1−q)p. The following problem shows

Recall the
that H2 p ∗ H2−1 (z) is convex in z, 0 ≤ z ≤ 1, for p satisfying 0 ≤ p ≤ 1/2.
a) Consider the function composition g (f −1 (·)) where f (·) is invertible.

Show that
!0
∂ 2 g (f −1 (z)) 1 g 0 (x)
= (B.5)
∂z 2 f 0 (x) f 0 (x)
where f 0 (x) = df (x)/dx and z = f (x).

Hint: Use the identities:

0 1
f −1 (x) =
f 0 (f −1 (x))
!0
g(x) f (x)g 0 (x) − f 0 (x)g(x)
=
f (x) f 2 (x)
b) Consider f (x) = H2 (x) and show that f 0 (x) > 0 for 0 < x < 1/2.
c) Consider g(x) = H2 (p ∗ x) and show that (g 0 /f 0 )0 ≥ 0 for 0 < x < 1/2.

d) Finally, prove the claimed convexity of H2 p ∗ H2−1 (z) .
B.1.19. Mrs. Gerber’s Lemma 2

The following problem makes use of the fact that H2 p ∗ H2−1 (z) is convex
in z, 0 ≤ z ≤ 1, for p satisfying 0 ≤ p ≤ 1.
a) Consider the BSC of Problem B.1.16. Show that for any discrete
random variable U for which U − X − Y forms a Markov chain we
have
H(Y |U ) ≥ H2 (p ∗ H2−1 (H(X|U )). (B.6)
b) Suppose the BSC is used n times with input Xi and output Yi for
i = 1, 2, . . . , n. The {Xi }ni=1 could be dependent. Show that
n
H(Y n ) ≥ H(Yi |X i−1 )
X
(B.7)
i=1
with equality if and only if the {Xi }ni=1 are independent or p = 1/2.
c) Use (B.6) and (B.7) to show that

H(Y n )/n ≥ H2 p ∗ H2−1 (H(X n )/n) (B.8)
with equality if and only if the {Xi }ni=1 are independent or p = 1/2.
This result is known as Mrs. Gerber’s Lemma.
d) Now show that for any discrete random variable V we have

H(Y n |V )/n ≥ H2 p ∗ H2−1 (H(X n |V )/n) . (B.9)
B.1.20. Informational Divergence 1

Prove the inequality (1.41) by using Jensen’s inequality (A.71) and the
concavity of log2 (x) in x for x > 0. What are the conditions for equality?
a) Verify that D(P1 P2 kQ1 Q2 ) = D(P1 kQ1 ) + D(P2 kQ2 ).

b) Verify the “parallelogram identity” for probability distributions P, Q, R:
P +Q P +Q

D(P kR) + D(QkR) = D P k + D Qk
2 2
P +Q

+ 2D kR . (B.10)
2
Consider X = {0, 1} and PX (0) = PY (0)(1 + ) where 0 ≤ ≤ 1/PY (0) − 1.
a) Verify that
D(PX kPY ) = PY (0)(1 + ) log2 (1 + )

!
1 − PY (0)(1 + )
+ [1 − PY (0)(1 + )] log2 (B.11)
1 − PY (0)
and we have D(PX kPY ) ≥ 0 with equality if and only if = 0.

b) Show that D(PX kPY ) in (B.11) is convex in .
B.1.23. Chain Rule for Informational Divergence
Verify the chain rule (1.48) for informational divergence.
B.1.24. Cross Entropy Convexity
Show that X(PX kPY ) is linear in PX and convex in PY . For convexity,

use either the concavity of the logarithm or the convexity of informational
divergence, Next, give a counterexample to show that X(PX kPY ) is neither
convex nor concave in the pair (PX , PY ).
Hint: One direction of inequality can be established by choosing the distri-
butions
[PX (0), PX (1)] = [1, 0], [PY (0), PY (1)] = [0, 1]

[QX (0), QX (1)] = [0, 1], [QY (0), QY (1)] = [1, 0]
and for λ = 1/2 computing

1 1
X(PX kPY ) + X(QX kQY ) = ∞ (B.12)
2 2
1 1 1 1

X PX + QX k PY + QY = 1. (B.13)
2 2 2 2
Next, swap PY and QY and choose
[PX (0), PX (1)] = [1, 0], [PY (0), PY (1)] = [1, 0]

[QX (0), QX (1)] = [0, 1], [QY (0), QY (1)] = [0, 1].

For λ = 1/2 we now compute

1 1
X(PX kPY ) + X(QX kQY ) = 0 (B.14)
2 2
1 1 1 1

X PX + QX k PY + QY = 1. (B.15)
2 2 2 2
Thus, the cross entropy is neither convex nor concave in the pair (PX , PY )
in general.
B.1.25. Mutual Information
a) Verify the following identities:
I(X; Y ) = D(PX|Y kPX |PY ) (B.16)

= D(PY |X kPY |PX ) (B.17)
= D(PY |X kQY |PX ) − D(PY kQY ) (B.18)
" #
QY |X (Y |X)
= E log + D(PXY kPY RX|Y ) (B.19)
QY (Y )
where QY is any distribution on Y, QY |X is an auxiliary channel with

QY |X = QY /PX , RX|Y = PX QY |X /QY is a reverse channel, and the
expectation in (B.19) is over PXY .
b) What happens if QY |X = PY |X ? What happens if QY |X 6= PY |X ?
c) Argue that if PY |X can be simulated, but not computed, then one can
compute a lower bound on I(X; Y ) as follows. Generate a long string
of independent and identically distributed (i.i.d.) inputs xn by using
PX , simulate the outputs y n by passing xn through the channel PY |X ,
choose a QY |X to compute QY |X (yi |xi ) for i = 1, 2, . . . , n, and then
estimate the expectation in (B.19) by averaging.
d) Similarly, argue that one can compute an upper bound on I(X; Y ) by
using (B.18).
B.1.26. Uncertainty and Information in Cryptography

Let X, Y and Z be binary random variables. Consider the simple crypto-
graphic system in which X is the plaintext, Z is the secret key, and Y is
the cryptogram formed as
Y =X ⊕Z
where ⊕ denotes the “exclusive or” operation (1 ⊕ 1 = 0).
Suppose that PZ (1) = 1/2 and that PX (1) = p. Suppose further that X
and Z are statistically independent.
Find each of the following quantities:
a) H(X) and H(X|Z).

b) H(X|Y ).

c) I(X; Y ) and I(X; Z).

d) H(X|Y Z).
e) I(X; Y |Z) and I(X; Y Z).
Remark: This problem shows that I(X; Y |Z) > I(X; Y ) is possible.
f) Just to convince yourself that the inequality can also go in the other
direction in general, define U = X and V = X and compute both
I(X; U ) and I(X; U |V ).
B.1.27. Uncertainty and Information in the Rolling of Dice

Consider the random experiment of rolling two dice, where one die is larger
than the other. Let X be the number of dots that turn up on the larger
die, and let Y be the number on the smaller die. Note that X and Y are
independent random variables with the same probability distribution. Let
S = X + Y and Z = X · Y be the sum and product, respectively, of X and
Y.
a) Find H(X), H(Y ), H(XY ), H(X|Y ) and I(X; Y ).
b) Find H(S), H(S|X) and I(X; S).
c) Find H(Z), H(Z|X) and I(X; Z). Which gives more information
about a single die, the sum S or the product Z?
B.1.28. Uncertainty and Information in Data Transmission

Consider the following data transmission, which represents an information
sequence [X1 , X2 ] that is encoded with a “parity-check code” into the code-
word [X1 , X2 , X1 ⊕ X2 ], which is then transmitted over a channel with “bit
error probability p” where 0 < p < 1/2 to yield the received word [Y1 , Y2 , Y3 ]:
Z1

?
X1 c - - Y1

Z2

?
X2 c - - Y2

Z3

?
?
- - - Y3
X3
The symbol ⊕ denotes the exclusive-or operation (or addition mod 2). As-
sume that X1 , X2 , Z1 , Z2 , and Z3 are independent binary random variables
such that PX1 (1) = PX2 (1) = 1/2 and PZ1 (1) = PZ2 (1) = PZ3 (1) = p.

a) Find the following quantities: H(Y1 ), H(Y2 ), H(Y3 ), H(X1 X2 ) and

H(X1 X2 X3 ).
b) Find H(Y1 Y2 ) and I(X1 X2 ; Y1 Y2 ).
c) Verify the values of PY3 |Y1 Y2 (y3 |y1 y2 ) as given in the following table
where q = (1 − p)(1 − 2p + 4p2 ).
y1 y2 y3 PY3 |Y1 Y2 (y3 |y1 y2 )

0 0 0 q
0 0 1 1−q
0 1 0 1−q
0 1 1 q
1 0 0 1−q
1 0 1 q
1 1 0 q
1 1 1 1−q
d) Now find H(Y1 Y2 Y3 ) and I(X1 X2 ; Y1 Y2 Y3 ).

Hint: I(X1 X2 ; Y1 Y2 Y3 ) = I(X1 X2 X3 ; Y1 Y2 Y3 ).
e) Compare the mutual information found in b) with the one in d) for
p = 0.01 and p = 0.1.
B.1.29. Bounds on Conditional Mutual Information
Verify (1.61) including the conditions for equality.
B.1.30. Chain Rule for Mutual Information 1
Show that
n
n
I(Xi ; Y |ZX i−1 ).
X
I(X ; Y |Z) = (B.20)
i=1
B.1.31. Chain Rule for Mutual Information 2

n n
Define Xi+1 = Xi+1 Xi+2 . . . Xn and consider X 0 and Xn+1 to be constants.
a) Establish the telescoping identity

n n
i n
I(X i−1 ; Yin ).
X X
I(X ; Yi+1 ) = (B.21)
i=1 i=1
b) Establish the Csiszár sum identity

n n
n
|X i−1 ) I(X i−1 ; Yi |Yi+1
n
X X
I(Xi ; Yi+1 = ). (B.22)
i=1 i=1

B.2. Information Theory for Continuous Variables 151
Hint: Use (B.21) and use the chain rule (1.63) to show that
I(X i ; Yi+1
n
) = I(X i−1 ; Yi+1
n n
) + I(Xi ; Yi+1 |X i−1 )
I(X i−1 ; Yin ) = I(X i−1 ; Yi+1
n
) + I(X i−1 ; Yi |Yi+1
n
).
B.1.32. Functional Dependence and Mutual Information

Let f (·), g(·), and h(·) be functions whose domains are the ranges of [X, Z],
[Y, Z], and Z, respectively. Show that
I(X; Y |Z) = I(Xf (X, Z); Y g(Y, Z)|Zh(Z)). (B.23)
B.1.33. Conditional Convexity

This exercise develops conditional versions of Theorems 1.9-1.11.
a) Verify that the conditional informational divergence D(PX|Z kPY |Z |PZ )
is convex in the pair (PX|Z , PY |Z ).
Is D(PX|Z kPY |Z |PZ ) convex in (PXZ , PY Z )?
Is D(PX|Z kPY |Z |PZ ) convex in PXY Z ?
b) Verify that H(X|Z) is concave in PX|Z . Is H(X|Z) concave in PXZ ?
c) Verify that I(X; Y |Z) is concave in PX|Z for fixed PY |XZ , and is convex
in PY |XZ for fixed PX|Z .
B.1.34. Data Processing Inequality

Prove (1.74) by using the log-sum inequality (see Theorem 1.5).
B.2. Information Theory for Continuous

Variables
B.2.1. Translation and Scaling

Verify equations (2.5) and (2.6).
B.2.2. Entropy for Gaussian Random Vectors

Verify equation (2.22).
Hint: use tr (AB) = tr (BA) where tr (A) is the trace of matrix A.
B.2.3. Entropy for Complex Gaussian Random Variables

A complex Gaussian
√ random variable X = XR + jXI has Gaussian XR and
XI , where j = −1. Suppose that [XR XI ] has the covariance matrix
" #
σR2 ρ σR σI
QXR XI = (B.24)
ρ σR σI σI2
The variance and pseudo-variance of X are defined as the respective

h i
Var[X] = E |X − E[X]|2 (B.25)
h i
pVar[X] = E (X − E[X])2 . (B.26)
The complex random variable X is said to be proper if pVar[X] = 0.

a) Show that X is proper if and only if σR2 = σI2 and ρ = 0. Such X are
also called circularly symmetric.
b) Show that a proper complex X has h(X) = log (πe Var[X]).
c) Show that X with non-singular QXR XI has h(X) ≤ log (πe Var[X])
with equality if and only if X is proper.
d) What happens if QXR XI is singular?
B.2.4. Entropy for Complex Gaussian Random Vectors

The covariance and pseudo-covariance matrices of a complex random vector
X are defined as the respective
h i
QX = E (X − E[X])(X − E[X])† (B.27)
h i
f = E (X − E[X])(X − E[X])T .
Q (B.28)
X
where X T and X † are the respective transpose and complex-conjugate trans-

pose of X. The complex vector X is said to be proper if Qf = 0.
X
a) Show that a linear transformation of a proper complex X is proper.

h i
b) Show that X with non-singular QX has h(X) ≤ log (πe)L QX with
equality if and only if X is proper.
B.2.5. Informational Divergence
a) Verify (2.30) and show that this bound holds with equality if and only
if pX (a) = pY (a) for (almost) all a ∈ supp(pX ).
b) Verify that D(pX|Z kpY |Z |pZ ) ≥ 0 with equality if and only if pX|Z (a|b) =
pY |Z (a|b) for (almost) all (a, b) ∈ supp(pXZ ).
B.2.6. Conditional Entropy-Power Inequality

The entropy power of a real random variable X with differential entropy
h(X) is defined as e2h(X) . The entropy power inequality states that the sum

B.2. Information Theory for Continuous Variables 153
X + Y of two independent random variables X and Y having differential

entropies satisfies
22h(X+Y ) ≥ 22h(X) + 22h(Y ) (B.29)
Prove that f (x) = log(ex + c), where c is a non-negative constant, is convex-

∪ in x. Now use the entropy power inequality to prove that
22h(X+Y |U ) ≥ 22h(X|U ) + 22h(Y |U ) (B.30)
if U and Y are independent (assume that U has a density).
B.2.7. Polar Coordinates

√
Consider a complex random variable X = XR + jXI where j = −1 and
XR and XI are real random variables. We can view X as a vector [XR , XI ],
or as a string XR XI , and we define
h(X) = h(XR XI ). (B.31)
Suppose that we wish to represent X in polar coordinates via the vector

[AX , ΦX ] where AX = |X| and ΦX is the phase of X. Show that
h(X) = h(AX ΦX ) + E[log AX ] (B.32)

= h(A2X ) + h(ΦX |AX ) − log 2. (B.33)
B.2.8. Computing Achievable Rates of Memoryless Channels

Consider a memoryless channel with conditional probability pY |X and dis-
crete input alphabet X . The channel output Y is a continuous random
variable. Consider an auxiliary channel qY |X . We assume that the auxiliary
channel is connected to the same source. Its output distribution is then
X
qY (y) = PX (x)qY |X (y|x). (B.34)
x∈X
In the following we will assume that qY |X is chosen such that

Z
qY |X (y|x)dy = 1. (B.35)
y
We define
!
Z
pY |X (y|x)
I¯q (X; Y ) =
X
PX (x) pY |X (y|x) log2 dy (B.36)
x∈supp(PX )
y qY (y)
and
!
X Z
qY |X (y|x)
I q (X; Y ) = PX (x) pY |X (y|x) log2 dy (B.37)
x∈supp(PX )
y qY (y)

a) Show that I(X; Y ) ≤ I¯q (X; Y ).

Hint: Prove that I¯q (X; Y ) − I(X; Y ) = D(pY ||qY ).
b) Show that I(X; Y ) ≥ I q (X; Y ).
Hint: Prove that I(X; Y ) − I q (X; Y ) = D(PX pY |X ||pY RX|Y ) where
P (x)q |X (y|x)
RX|Y (x|y) = X qYY(y) .
c) Argue that if pY |X can be simulated, but not computed, then one
can compute a lower bound on I(X; Y ) as follows. Generate a long
string of independent and identically distributed (i.i.d.) inputs x =
(x1 , . . . , xn ) by using PX , simulate the outputs y = (y1 , . . . , yn ) by
passing x through the channel pY |X and computing
 
n
1X  qY |X (yi |xi ) 
I q (X; Y ) ≈ log2  P (B.38)
 
n 
n i=1 
PX (xj )qY |X (yi |xj )

j=1
d) Similarly, argue that one can compute an upper bound on I(X; Y ) by

using I¯q (X; Y ).
e) Let C be defined as
C = max I(X; Y ). (B.39)

PX
Prove that
C ≤ max I¯q (X; Y )

PX
= max D(pY |X (·|x) kqY ) . (B.40)

x∈X
Consider an AWGN channel Y = X + Z with input alphabet X =

{a1 , a2 , . . . , aK } where ai < aj if i < j, and where Z is a Gaussian
random variable with zero mean and variance N . (Note: there is no
further constraint on X, e.g., there is no constraint on the average
“power” E[X 2 ] as in Chapter 5.)
Show that for large N the optimal input distribution PX in (B.39)
(achieving the upper bound in (B.40) uses PX (a1 ) = PX (aK ) = 21 and
where we choose qy = pY .
B.2.9. Scaling for Divergence and Cross Entropy

Let X, and Y be two real-valued random variables with densities pX and
pY and c ∈ R be a non-zero constant.
a) Verify the scaling rule D(pcX ||pcY ) = D(pX ||pY ).

b) Verify the scaling rule for the cross entropy X(pcX ||pcY ) = X(pX ||pY )+
log |c|.
Now, let X, Y be two real-valued random column vectors of dimension n.

Let C be an invertible n × n matrix.

B.3. Coding of Memoryless Sources 155
a) Verify the scaling rule D(pCX ||pCY ) = D(pX ||pY ).

b) Verify the scaling rule for the cross entropy X(pCX ||pCY ) = X(pX ||pY )+
log | det C|, where det C is the determinant of C.
B.3. Coding of Memoryless Sources
B.3.1. Bounds on Conditional Entropy

Design a Huffman code for the source PU (0) = 1/2, PU (1) = 1/4, PU (2) =
1/4.
B.3.2. Kraft Inequality 1

One can formulate the problem of finding the best prefix-free code as an
optimization problem:
K
X
min pk `k (B.41)
`1 ,`2 ,...,`K
k=1
K
−`k
X
subject to D ≤ 1. (B.42)
k=1
The optimization is over positive integers `k , k = 1, 2, . . . , K. However,

suppose we consider the relaxed problem where we permit the `k to be real
numbers. Show that the optimal code word lengths for this problem are
`k = − logD pk . (B.43)
B.3.3. Kraft Inequality 2

Let S = K −`k
and qk = D−`k /S for k = 1 = 2, . . . , K. Define PY (k) =
P
k=1 D
qk for k = 1, 2, . . . , K. Show that
H(X)
E[L] − = D(PX kPY ) logD (2) − logD S ≥ 0 (B.44)
log2 D
What are the conditions for equality?
B.3.4. Uniquely Decodable Codes

This exercise shows that uniquely-decodable codes satisfy the Kraft inequal-
ity. We thus may as well use prefix-free codes for source coding. Consider
a uniquely decodable code C and suppose we use it n times. The list C n of
concatenations of n code words from C has size K n and must itself form a
uniquely decodable code.

a) Show that the code word lengths of C n range from n to n · maxk `k .

b) Show that
#n n·max
Xk `k
"K
−`k
N` D−`
X
D = (B.45)
k=1 `=n
where
!
X n
N` = (B.46)
n1 `1 +...+nK `K =` n1 , n2 , . . . , nK
is the number of concatenations having length `, and where nk is the

number of times the letter bk is chosen in the n concatenations.
c) Argue that the sum on the left-hand side of (B.45) corresponds to the
Kraft inequality for C, while the sum on the right-hand corresponds
to the Kraft inequality for C n .
d) Argue that N` ≤ D` and show that this implies
"K #n
−`k
X
D ≤ n max `k − 1 + 1. (B.47)
k
k=1
e) Now take the nth root of both sides of (B.47) and let n → ∞. What
is the result?
B.3.5. Uniquely Decodable and Prefix Free Codes

A D-ary code is said to be uniquely decodable if every finite D-ary sequence
has at most one interpretation as a sequence of its codewords (but the D-
ary sequence may have no such interpretation). It is difficult to conceive
of a situation where one would want to use a code that was not at least
uniquely decodable.
a) Show that a prefix-free code is automatically also a uniquely decodable
code.
Hint: Consider, for a given finite D-ary sequence, what the first code-
word must be if this sequence can be written as a sequence of code-
words from the prefix-free code.
b) For each of the following binary codes, determine whether it is uniquely
decodable and whether it is prefix-free. If the code is prefix-free, draw
its code tree. If the code is not uniquely decodable, give a binary
sequence that has two different interpretations as a sequence of code-
words from that code.
i) {00, 10, 01, 11},
ii) {0, 10, 11},
iii) {0, 01, 11},
iv) {1, 101},
v) {0, 1, 01}.

Remark: It is not hard to show that there exists a D-ary uniquely decodable
code with a given list of codeword lengths if and only if these lengths satisfy
Kraft’s inequality. Thus, whenever a D-ary uniquely decodable code exists,
there also exists a D-ary prefix-free code with precisely the same codeword
lengths. For this reason, there is never an advantage in using a D-ary
uniquely decodable code that is not prefix-free, even if one did not actually
need the prefix-free property in his or her code.
B.3.6. Existence of a Binary Prefix-Free Code

For each of the following lists of codeword lengths, determine whether a
binary prefix-free code with these codeword lengths exists and, if it exists,
construct such a code.
a) (1, 2, 3, 4, 5, 5),
b) (1, 2, 3, 4, 4, 5),
c) (2, 2, 2, 3, 4).
B.3.7. Existence of a Ternary Prefix-Free Code

For each of the following lists of codeword lengths, determine whether a
ternary prefix-free code with these codeword lengths exists and, if it exists,
construct such a code.
a) (1, 2, 3, 1, 2, 3, 2),
b) (1, 2, 3, 4, 1, 2, 3, 4),
c) (3, 3, 3, 3, 3, 3, 2, 1, 1).
B.3.8. A Rooted Tree with Probabilities

Consider the following rooted tree, where the leaves are indexed by Arabic
numbers and the nodes are indexed by Roman numerals.

1

II 2

IV

I
3
4

III 5 6

V 7

8


Suppose that the following probabilities are assigned to the leaves: p1 = 1/4,
p2 = p3 = 1/8, p4 = p5 = 1/6, p6 = p7 = p8 = 1/18.
a) Directly compute the average path length E[W ] from the root to the
leaves, and directly compute the leaf entropy Hleaf as defined by Equa-
tion (3.8) in the lectotesure notes.
b) Find the Probabilities PI , PII , PIII , PIV and PV of the nodes in the
rooted tree with probabilities.
c) Compute E[W ] by the Path Length Lemma.
d) Find Hi , the branching entropy, for i = I, II, III, IV and V. Compute
Hleaf as suggested by Lemma 3.2 in the lecture notes.
B.3.9. Prefix-Free Code for a Quantized Speech Source

When a certain four-level quantizer is used as part of a linear predictive
coding system for speech, each of the two outer quantization levels has
probability 0.1 and each of the two inner quantization levels has probability
0.4. Let the random variable U denote the quantizer output.
a) Find the uncertainty, H(U ), of the quantizer output.

b) Find (by “trial-and-error”) an optimum binary prefix-free code for the
quantizer output.
Remark: You should find that this simple code would result in a saving
of 10% in the number of transmitted digits compared to “uncoded”
transmission of the quantizer output as a pair of binary digits. (This
simple coding has been used in practice to improve some speech en-
coding systems.)
c) Why has no one ever used the Shannon-Fano code for this random
variable U ?
B.3.10. A Shannon-Fano Prefix-Free Code

Consider the random variable U whose probability distribution is:
u u1 u2 u3 u4 u5 u6 u7
PU (u) 0.3 0.2 0.1 0.1 0.1 0.1 0.1
a) Construct a binary Shannon-Fano prefix-free code for U .

b) Calculate E[W ] for the code found in a) and verify that it satisfies
H(U ) ≤ E[W ] < H(U ) + 1
where H(U ) is the entropy in bits.
The following theorem was alluded to in the “remark” at the end of Problem
B.3.5.

McMillan’s Theorem: There exists a uniquely decodable D-ary code with

codeword lengths w1 , w2 , . . . wn (all positive integers of course) if and only
if these lengths satisfy Kraft’s inequality, i.e., if and only if
n
D−wi ≤ 1.
X
i=1
This theorem was given by McMillan in 1956 [B. McMillan, “Two In-
equalities Implied by Unique Decipherability”, IRE Trans. Inform. Theory,
vol. IT-2, pp. 115-116, 1956]. An elegant proof of this theorem was provided
by Karush five years later [J. Karush, “A Simple Proof of an Inequality of
McMillan,” IRE Trans. Inform. Theory, vol. IT-7, pp. 118, 1961]. In this
problem, you will develop Karush’s proof for yourself.
Consider any D-ary code with codeword lengths w1 , w2 , . . . wn (all positive
integers of course) and let wmax be the largest codeword length. For an
arbitrary positive integer L, consider the following quantity:
" n #L n X
n n L·wmax
−wi
D−(wi1 +wi2 +···+wiL ) = Aj D−j .
X X X X
D = ···
i=1 i1 =1 i2 =1 iL =1 j=1
Note that Aj is the number of choices of i1 , i2 , . . . iL such that wi1 +wi2 +· · ·+

wiL = j, i.e., the number of different ways to choose a sequence of exactly
L codewords so that this sequence of codewords forms a D-ary string of
length exactly j.
a) What inequality must Aj satisfy if the code is uniquely decodable?
(Remember that a code is uniquely decodable if and only if every
finite D-ary sequence can be written in at most one way as a sequence
of codewords.)
b) Substitute your upper bound on Aj on the right in the above equation
to get an upper bound on the left side, then let L go to ∞ to complete
your proof of McMillan’s theorem.
B.3.11. The Upper Bound of the Coding Theorem

Show that the upper bound E[W ] stated in the Coding Theorem in the
lecture notes is the best general bound by demonstrating that for any ε > 0
there exists a random variable U such that E[W ] for the optimal code is less
than the upper bound by at most ε, i.e.,
" # " #
H(U ) H(U )
+ 1 − ε ≤ E[W ] < +1 .
log D log D
Hint: It suffices to consider a U with only two possible values.
B.3.12. Comma Codes

A comma code is a code (with no duplicate codewords, of course), such that
every codeword ends with the same digit and this digit appears nowhere

else in the codeword. This special digit is called the comma. For instance,
the codewords [1], [0, 1] and [0, 0, 1] form a binary comma code and the
digit 1 is the comma. Every comma code is also a prefix-free code. It is
very easy to find an optimum D-ary comma code for any random variable
U . One simply assigns the length 1 codeword (the comma alone) to the
most likely value of U , length 2 codewords to the D − 1 next most likely
values of U , length 3 codewords to the (D − 1)2 next most likely values of
U , etc. This follows from the fact that there are exactly (D − 1)i−1 possible
codewords of length i for a given comma, regardless of which codewords of
shorter length are in the comma code. Sometimes a D-ary comma code is
called a “(D − 1)-ary code” (for instance, a “unary code” when D = 2), but
this is a misleading terminology as the comma itself is active in supplying
information.
a) Find optimum binary (D = 2) comma codes for the block length L

message sets in Problem B.3.17 with L = 1, 2 and 3. Compare the
efficiency LH(U )/(E[W ] log D) of the optimum comma code to that
of the Huffman code used in Problem B.3.17. (Comma codes often
perform better than one might have expected).
b) Engineers like the ease of synchronization provided by comma codes.
What synchronization advantages do comma codes offer compared to
general prefix-free codes?
B.3.13. The Efficiency of Huffman and Shannon-Fano Code
a) Construct a binary Huffman code for the random variable U of Prob-

lem B.3.10, and compare its E[W ] to that of the Shannon-Fano code.
b) Construct a ternary Huffman code for this same U and verify that its
average codeword length E[W ] satisfies
H(U )/ log 3 ≤ E[W ] < H(U )/ log 3 + 1.
c) The efficiency, η, of the D-ary prefix-free code for a random variable

U is defined as the ratio of H(U )/ log D to E[W ], i.e.,
H(U )
η= .
E[W ] log D
Compare the efficiencies of the binary Shannon-Fano and Huffman

codes for U .
d) Compare the efficiencies of the binary and ternary Huffman codes for
U.
B.3.14. The Optimal Prefix-Free Code is not Necessarily Unique

Show that there can exist several fundamentally different optimal prefix-
free codes for the same random variable, where fundamentally different
means that their ordered lists of codeword lengths are different. How many

fundamentally different optimal binary prefix-free codes exist for a random

variable with symbol probabilities 0.3, 0.2, 0.2, 0.1, 0.1, 0.05 and 0.05?
B.3.15. Huffman Codes for German and English Text

The attached table (kindly provided by Dr. Wah of the IKT) gives the
probability distribution for a single letter U chosen from written text as well
as the uncertainty H(U ) for both the English and the German language.
a) Find a binary Huffman code for a single German letter U , and compute
the efficiency of the code.
b) If you enjoyed doing a), repeat a) for a single English letter U .
B.3.16. The Most Probable Message of a Huffman Code

Prove that for any binary Huffman code, if the most probable message
symbol has probability p1 > 2/5 and it is the only message symbol with
this probability, then that symbol must be assigned a codeword of length 1.
Hint: Sometimes the simplest things are not so easy to prove.
B.3.17. The Efficiency of Block-to-Variable Length Coding

Consider the binary memoryless source (BMS) with PU (0) = 0.9 and PU (1) =
0.1.
a) Find an optimum binary prefix-free code for a single source letter, and
compute its efficiency
H(U )
η= .
E[W ] log D
b) Find an optimum binary prefix-free code for L = 2 source letters, and
compute its efficiency
LH(U )
η= .
E[W ] log D
c) Repeat b) for L = 3.
B.3.18. The Efficiency of Variable-Length-to-Block Coding

Consider the BMS with PU (0) = 0.9 and PU (1) = 0.1.
a) Find an optimum binary block encoding scheme with blocklength N =

1 for a proper message set for this BMS, and compute its efficiency
H(U )E[Y ]
η= ,
N log D
where E[Y ] is the average message length.

b) Repeat a) for N = 2.
c) Repeat a) for N = 3.

German English
i Letter pi Letter pi
1 _ 0.15149 _ 0.1859
2 E 0.14700 E 0.1031
3 N 0.08835 T 0.0796
4 R 0.06858 A 0.0642
5 I 0.06378 O 0.0632
6 S 0.05388 I 0.0575
7 T 0.04731 N 0.0574
8 D 0.04385 S 0.0514
9 H 0.04355 R 0.0484
10 A 0.04331 H 0.0467
11 U 0.03188 L 0.0321
12 L 0.02931 D 0.0317
13 C 0.02673 U 0.0228
14 G 0.02667 C 0.0218
15 M 0.02134 F 0.0208
16 O 0.01772 M 0.0198
17 B 0.01597 W 0.0175
18 Z 0.01423 Y 0.0164
19 W 0.01420 G 0.0152
20 F 0.01360 P 0.0152
21 K 0.00956 B 0.0127
22 V 0.00735 V 0.0083
23 Ü 0.00580 K 0.0049
24 P 0.00499 X 0.0013
25 Ä 0.00491 J 0.0008
26 Ö 0.00255 Q 0.0008
27 J 0.00165 Z 0.0005
28 Y 0.00017
29 Q 0.00014
30 X 0.00013
Entropy 4.113 [bits/letter] 4.08 [bits/letter]
Table B.1.: the Probability Distribution for a Single Letter of German and
English Text.

Note: Since the number of codewords in parts a), b) and c) of this problem
equals that in parts a), b) and c), respectively, of Problem B.3.17, it is fair
to compare the efficiencies of these block-to-variable-length and variable-
length-to-block coding schemes. It is generally true that the former will be
somewhat more efficient for the same number of codewords.
B.3.19. The Efficiency of Variable-Length-to-Variable-Length Cod-

ing
a) Find an optimum binary prefix-free code for the Tunstall message

set with M = 2 messages for this BMS, and compute its efficiency
η = H(U )E[Y ]/(E[W ] log D) where E[Y ] is the average message length
and E[W ] is the average codeword length.
b) Repeat a) for M = 4 messages.
c) Repeat a) for M = 8 messages.
Hint: Use the results of Problem B.3.18.

Note: Again, since the number of codewords is the same, it is fair to compare
the efficiencies in each part of this problem to those in the corresponding
parts of Problems B.3.17 and B.3.18.
B.3.20. Run-Length Coding

“Run-length coding” is a popularly used variable-length-to-block coding
scheme for binary information sources. Let 0n 1 denote a sequence of n zeros
followed by a one, i.e., a “run of zeros” of length n. For the run-length coding
of blocklength N , the message v = 0n 1 would be encoded into [b1 b2 . . . bN ],
the binary representation of the integer n for 0 ≤ n < 2N − 1, while the
N
message 02 −1 would be encoded as [11 . . . 1]. Notice that the message set
is proper.
a) Find the smallest N such that the run-length message set is not a
Tunstall message set.
b) As a function of N , find N/E[Y ] for the run-length coding scheme,
where E[Y ] is the average message length.
For optimum block encoding with blocklength N of a proper message set

for a DMS, the ratio N/E[Y ] generally decreases as N increases and always
approaches H(U )/ log D so that the efficiency η = H(U )E[Y ]/(N log D)
always approaches unity. This is not the case for run-length coding where
there will be some particular N that maximizes the efficiency, but this
maximum efficiency will generally be less than 1.
c) Find the N which maximizes the efficiency of run-length coding for

this BMS and compute this maximum efficiency.

d) Run-length coding generally has a maximum efficiency strictly less

than unity. Why do you think it is so popular in practice?
B.3.21. Quasi-Proper Message Sets
Consider the following binary message set:
0
V3
0
0 1
V4
0 1
V1
1
1
V2
1
Although this message set is “improper”, it could in fact be used in practice

if we adopted the convention that, when more than one choice is possible, we
always choose the longest message. Thus, if the source output sequence was
0001001 . . ., the resulting sequence of messages would be v4 , v1 , v1 , v2 , . . ..
Notice that we must sometimes examine digits beyond the end of a message
before we choose that message. Such an improper message set could be
called quasi-proper. (A quasi-proper messages set is a message set that is
not prefix-free but has the property that every path from the root to a leaf in
its K-ary tree contains at least one message node.) Rather surprisingly (as
this problem will illustrate), D-ary blocklength N coding of a quasi-proper
message set for a DMS is sometimes better than the optimum such coding
of a proper message set.
Consider using the above quasi-proper message set for blocklength N = 2

binary coding of the BMS with PU (0) = 0.9 and PU (1) = 0.1. (Recall that
in Problem B.3.18b) we found that the optimum proper message set in this
case gave an efficiency of η = 0.635.) Notice that the source sequence 01 will
always be identified as the message sequence v1 , v2 , that the source sequence
001 will always be identified as the message sequence v1 , v1 , v2 , etc.
a) Show that the above quasi-proper message set is equivalent to a proper

message set with M = 5 messages in which each message in the latter
message set is a sequence of one or more messages from the former
message set. Draw the binary tree for this equivalent proper message
set and find its average message length E[Y ].
Notice that, because we are using blocklength N = 2 coding of the
quasi-proper message set, the message “v1 , v2 ” in the equivalent proper
message set will have a codeword length 2N = 4.
b) Find the average codeword length E[W ] for the five codewords that
are assigned to the messages in the equivalent proper message set.

c) Finally, compute the efficiency
H(U )E[Y ]
η=
E[W ] log D
of the blocklength N = 2 binary coding of the given quasi-proper

message set and verify that it exceeds that for the optimum proper
message set.
Note: Very little is known about how to construct optimum quasi-proper

message sets for blocklength N coding of a DMS. Moreover, one can con-
struct quasi-proper message sets such that there is no “equivalent” proper
message set of the kind found in a) above ; in this case, one needs to use
Markov process techniques to evaluate the system efficiency.
B.3.22. Estimating the Entropy of Written German
The letters in written text certainly cannot be treated as the output of

a memoryless source because the probability distributions for successive
letters are greatly affected by the values of the previous letters. For instance,
in English the probability is virtually one that the next letter will be a “u”
given that the previous letter is a “q”. The true information rate then is
not H(U ), the uncertainty of a single letter, but H∞ (U ) which is defined as
lim H(Un |U1 U2 . . . Un−1 ).

H∞ (U ) = n→∞
Shannon, in his paper “Prediction and entropy of printed English” (Bell

System Techn. Journ., Vol. 30, pp. 50-64, January 1951), suggested the
following intriguing way to estimate H∞ (U ). Suppose that Peter and Paul
are identical twins (whose ways of thinking coincide). Consider transforming
the sequence U1 , U2 , U3 , . . . into the sequence Ũ1 , Ũ2 , Ũ3 , . . . in the following
way. Suppose Peter knows U1 , U2 , . . . , Un−1 and tries to guess Un . Peter’s
friend, James, tells him if the guess is correct and otherwise tells him to
guess again. If Peter’s fifth guess is correct, James sets Ũn = 5 and goes
on to the next letter, etc. The original text can be recovered from the
transformed sequence as indicated in the following diagram:

Here, Paul already knows U1 , U2 , . . . , Un−1 . Paul’s friend, John, knows Ũn
and asks Paul to guess Un . If this guess is correct (which John would know
because then Ũn = 1) he tells Paul that it is correct and otherwise tells him
to guess again, etc.
For large n, Ũn will be virtually statistically independent of the previous
letters so that H(Ũn ) is a good estimate of H∞ (U ), the true information
rate of the text.
a) Use Shannon’s method to estimate H∞ (U ) for German text as follows.
In order to generate the sequence Ũ1 , Ũ2 , Ũ3 , . . . in the above diagram,
only the services of James and Peter are required. Thus, you should
work in pairs on this problem, letting one member play the role of
James and the other play the role of Peter. James should first act as
the information source by secretly selecting a typical German sentence
of about 40 letters (including the letter * to denote a space but ex-
cluding punctuation symbols) to serve as the sequence U1 , U2 , U3 , . . ..
James should then tell Peter the first 10 letters of the sequence. Peter
should then guess U11 until he is successful, which determines Ũ11 .
Continue in this way to get Ũ12 , Ũ13 , . . . , Ũ40 .
Now use your 30 samples of Ũ to estimate the probabilities of the four
most likely values of Ũ . Then estimate H(Ũ ) by assuming that the
other 26 values of Ũ are equally likely (which will make your estimate
slightly too large).
b) Find a binary Huffman code for Ũ (assuming that the probabili-
ties used in estimating H(Ũ ) are correct) and compute its efficiency.
Also, compare E[W ] for this code with E[W ] for the code in Prob-
lem B.3.15(a).
Note: To make your results more accurate, you can pool your statistics in
part a) with those from several other teams before estimating the probability
distribution for Ũ .
B.3.23. Elias’s B1 (.) and B2 (.) codes

B.4. Coding of Stationary Sources 167
a) Make a table of j, B1 (j) and L1 (j) for j = 1, 2, . . . , 10 where B1 is

Elias’ first prefix-free coding scheme for the positive integers and L1 (j)
is the length of B1 (j).
b) Make a table of j, B2 (j) and L2 (j) for j = 1, 2, . . . , 10 where B2 is
Elias’ second prefix-free coding scheme for the positive integers and
L2 (j) is the length of B2 (j).
c) Find B1−1 (0001101) and B1−1 (000011101).
d) Find B2−1 (00111000001) and B2−1 (000100100001001).
e) Elias’ idea for constructing prefix-free codes for the positive integers
could be iterated one more time, resulting in a code B3 (.). Which
is the smallest n such that B3 (.) is better than B2 (.), i.e., such that
L3 (n) < L2 (n).
B.3.24. The Elias-Willems Coding Scheme

Because we are forced to use small message sets in hand-solved problems, we
cannot expect to see clearly in such problems the effectiveness of the Elias-
Willems universal coding scheme. But we can at least get a good idea of how
the scheme works. Do this by encoding all or part of this problem statement
with a slightly modified Elias-Willems scheme. Take as your message set
the twenty-seven letter alphabet consisiting of the usual twenty-six letter
English alphabet together with the space symbol. Ignore punctuation signs.
First map this problem statement into the positive integer sequence of times
to last sighting of the symbol. Then compute the total length of the binary
encoded sequence. Use the code B2 (.) to code the integers from one to seven
but code any integer eight or greater by seven digits, two zeros followed by
the five bit binary code for that integer. (Is this coding prefix-free?) Thus,
you need to find the actual integer only when it is seven or less. This
will greatly speed-up your hand encoding. Finally, calculate the number
of binary digits per character that your coding has used. Would you have
been better off just to code directly all symbols into their five bit codes?
It would not have been better to code the time-to-last-sighting using Elias’
first integer coding scheme B1 (.) rather than his second B2 (.), even though
the length of the codeword in the former code is never greater than for that
in the second when the integer coded is less than thirty-one. Why wouldn’t
it be better to have used B1 (.) rather than B2 (.) to code the integers from
one to seven?
B.4. Coding of Stationary Sources
B.4.1. Two-State Markov Chain

Consider a source which acts as a finite state machine. The source produces
a sequence of zeros and ones as shown in the figure below. Suppose that

probability 1/3
output 1
probability 2/3 probability 3/4

0 1
output 0 output 1
probability 1/4
output 0
we are in steady state, where the effect of the initial state is absent and the
source is stationary.
(a) What is the stationary distribution of the source?
(b) What is the memory of the source?
(c) Let the source output be denoted by X1 , X2 , X3 , . . . (assume the source
is in steady state). Calculate H(X1 ), H(X1 X2 ), and H(X1 X2 X3 ).
B.4.2. Constrained Sequences

In (d, k)-constrained sequences, the 1s are separated by at least d and at
most k 0s. The following source generates (0, 1)-constrained sequences.
probability p
output 0
probability 1-p
1 0
output 1
probability 1
output 1
(a) What is the stationary distribution of the source?

(b) Calculate the entropy rate of the source and determine the maximum
entropy rate as a function of p.
(c) Let N (B) be the number of sequences of length B produced by the
source. Find N (B) and calculate
1
C = lim log N (B). (B.48)
B→∞ B
Hint: Find a linear recurrence that expresses N (B) in terms of N (B −
1) and N (B − 2). Why is C an upper bound on the entropy rate of
the Markov chain? Compare C with the maximum entropy found in
part (b).
Remark: C is called the capacity of the (0, 1)-constraint. In magnetic record-
ing, data is stored in the form of (d, k)-constrained sequences.
B.4.3. Elias-Willems Universal Source Coding 1

Consider the 3-ary source with alphabet {a, b, c} and
X 32 = a a b c c a c b a a a c a b b c a a b c c b c b a b a a c b a b. (B.49)
(a) Encode the given sequence for B = 1 and B = 2 by using the Elias-
Willems universal source encoder without a recency rank calculator.

(b) Encode the given sequence for B = 1 and B = 2 by using the Elias-
Willems universal source encoder with a recency rank calculator.
(c) Find a numerical upper bound on E[L]/B as B gets large. Compare
with the compression rate achieved in (a) and (b).
B.4.4. Elias-Willems Universal Source Coding 2
Suppose a source puts out the sequence 01010101 . . .. What does the Elias-
Willems universal source encoder (with a recency rank calculator) put out
if B = 1, 2, 3, 4, ...? What happens when B is odd? What happens when B
is even? What happens as B gets large?
B.4.5. Markov Sources
A discrete stationary source (DSS) with finite memory µ, i.e., such that
P (un |u1 , u2 , . . . , un−1 ) = P (un |un−µ , . . . , un−1 ) for all n > µ,
is often called a Markovian DSS or simply a Markov source. The state of

the source at time n is the µ-tuple
Sn = [un−µ , . . . , un−1 ].
In particular, Sµ+1 = [u1 , . . . , uµ ]. The probability of the state transiting

from state Sn = [a1 , a2 , . . . , aµ ] at any time n to state Sn+1 = [a2 , a3 , . . . , aµ+1 ]
at time n + 1 is just
PUµ+1 |U1 ,...,Uµ (aµ+1 |a1 , . . . , aµ ).
Note that the stationary probability distribution for the states is just the
probability distribution P (u1 , . . . , uµ ). From
P (u1 , . . . , uµ ),
it is easy to find P (u1 , . . . , uL ) for any L ≤ µ just by “summing out” the

unwanted random variables. Using P (u1 , . . . , uµ ) and the transition proba-
bilities, it is then easy to find P (u1 , . . . , uL ) for any L > µ as follows:
P (u1 , . . . , uL )
= P (u1 , . . . , uµ )P (uµ+1 |u1 , . . . , uµ ) . . . P (uL |uL−µ , . . . , uL−1 ).
Note that P (u1 , . . . , uL ) is the probability distribution for the messages

formed by an L-block parser for the Markov source.
a) Consider the binary Markov source with memory µ = 1 and output

alphabet {a, b} having the following state-transition diagram:

On each branch, the transition probability is indicated together with

(in parentheses) the output symbol emitted by the source when the
transition is made. Find the stationary probability distribution of the
states of this Markov source. Find also the probability distribution of
a block message set of length L for L = 1, 2 and 3.
b) Repeat a) for the binary Markov source with memory µ = 2 and
output alphabet {a, b} having the following state-transition diagram:
B.4.6. Entropy-Rate of a Markov Source

The entropy-rate of a DSS is the quantity H∞ (U ), which determines the
smallest number of code digits per source letter with which the source can
be encoded.
a) Find the entropy-rates of the Markov sources in Problems B.4.5a) and

B.4.5b).
b) The Markov Sources in Problems B.4.5a) and B.4.5b) are both ergodic
DSSs. Give and intuitive explanation of why this is true.
c) Find the entropy H(V ) = H(U1 U2 ) of the block message set of length
L = 2 for the Markov sources in Problems B.4.5a) and B.4.5b). Con-
struct the binary Huffman code for each of these message sets and
compare the resulting E[W ]/L to H(V ).
d) Consider adaptive Huffman coding of the block message set of length
L = 2 in the manner that one can use a different code for Vi depending
on the value of Vi−1 . Thus, H(V2 |V1 ) becomes the entropy of interest.
Compute this conditional entropy as well as the average of E[W ]/L
for the four Huffman codes when the source is the Markov source in
Problem B.4.5a).

e) Try to devise a good variable-length parser for the Markov source in

Problem B.4.5a) for use with binary block coding of length N = 2.
The goal is to maximize the average message length, but the messages
are not i.i.d. so it will take some ingenuity to calculate this average
message length.
f) For the Markov source in Problem B.4.5a) and for each message in
the block message set of length L = 2, find the average number of
messages from one occurence of this message until the next occurence
of this message. (For instance, if Vn = [a, a] and the next subsequent
message equal to [a, a] is Vn+4 , then there are 4 messages from this
former occurence to this next occurence of [a, a].)
B.4.7. Block-to-Block (Lossy) Source Coding

The Block-to-Block (Lossy) Source Coding Theorem for a DMS states that,
given any positive numbers ε1 and ε2 , there exists a source coding scheme
as shown in Figure B.1 for which
N H(U )
< + ε1
L log(D)
and
P (F ) < ε2 ,
where P (F ) is the probability that [Û1 , . . . , ÛL ] 6= [U1 , . . . , UL ]. Defining
Û1 , . . . , ÛL Source X1 , . . . , XL U , . . . , UN

Recon- Source 1 DMS
(K-ary) structor (D-ary) Encoder (K-ary)
Figure B.1.: A Source Coding Scheme
the symbol error probability PS by

L
1X
PS = Pei ,
L i=1
and noting that Pei = P (Ûi 6= Ui ) ≤ P (F ), we see that
PS ≤ P (F ).
Hence, we see that the Lossy Source Coding Theorem implies
PS < ε2
so that PS can be made arbitrarily small.
In this problem, you will prove the converse to the Lossy Source Coding
Theorem by showing that PS cannot be made arbitrarily small when N/L

is smaller than H(U )/ log D. More precisely, you will show that if
N H(U )
≤ ,
L log(D)
then
" #
H(U ) N
h(PS ) + PS · log(K − 1) ≥ − · log(D). (B.50)
log(D) L
a) Make use of the Data Processing Lemma and the fact that
I(U1 . . . UL ; X1 . . . XN ) ≤ H(X1 . . . XN )
to prove that, for the source coding scheme of the figure,
I(U1 . . . UL ; Û1 . . . ÛL ) ≤ N log(D).
b) Now show that
H(U1 . . . UL |Û1 . . . ÛL ) ≥ LH(U ) − N log(D).
c) Make use of the inequality (4.32), Fano’s Lemma, and the convexity-∩
of h(p) to show that
1
H(U1 . . . UL |Û1 . . . ÛL ) ≤ h(PS ) + PS log(K − 1).
L
d) Now combine the results of b) and c) to prove Equation (B.50).
e) Find an explicit lower bound on PS when the DMS is a ternary source
(i.e., K = 3), the coding alphabet is binary (i.e., D = 2), and N/L ≤
H(U ) − 0.336.
B.4.8. The Rate Distortion Function

Consider again the source coding scheme of Figure 1 of Problem B.4.7.
One can consider the symbol error probability PS , as described in Equa-
tion (B.50) of Problem B.4.7, to be the “distortion” in the source recon-
struction. Suppose the coding alphabet is binary, i.e., D = 2. One can
define the rate R of this source coding scheme as
R = N/L binary digits per source letter.
The units of R are often, but somewhat imprecisely, called “bits per letter”.
a) Specializing Equation (B.50) of Problem B.4.7 to the case of a BSS so
that PS = Pb , show that
R ≥ 1 − h(Pb ) bits/letter.
The previous inequality states that, to achieve a distortion Pb , the rate R

of the coding scheme must be at least 1 − h(Pb ).

B.5. Channel Coding 173
b) Sketch this lower bound on R as a function of the allowed distortion

Pb for 0 ≤ Pb ≤ 1/2. It can be shown that, by choosing L and N
sufficiently large, this bound can be approached as closely as desired.
The function R = 1 − h(Pb ) for 0 ≤ Pb ≤ 1/2 is called the rate-
distortion-function of the BSS when the bit error probability is the
distortion measure.
Suppose now that the source in Figure 1 of Problem B.4.7 is changed to the
general binary memoryless source (BMS) with p = PU (1). Without loss of
essential generality, assume that p ≤ 1/2.
c) Repeat a) and b) (with the appropriate changes) to show that
R ≥ h(p) − h(Pb ).
Sketch this bound for 0 ≤ Pb ≤ p for the BMS with p = 0.110.

Again, it can be shown that, by choosing L and N sufficiently large,
this bound can be approached as closely as desired. The function
R = h(p) − h(Pb ) is thus called the rate-distortion function of the
BMS when the bit error probability is the distortion measure.
B.5. Channel Coding
B.5.1. Block and Bit Error Probability

Prove the relations (5.20).
Hint: Observe that {Ŵ 6= W } = ∪bi=1 {V̂i 6= Vi }. Now use the union bound
to upper bound Pe . Finally, try to upper bound 1 − Pe .
B.5.2. BSC with Output Cost
a) Compute the capacity-cost function C(S) for the binary symmetric

channel (BSC) with alphabets X = Y = {0, 1}, crossover probability
Pr [Y 6= X] = p, and the cost function s(x, y) = 2y.
b) Plot C(S) as a function of S for p = 1/3. Is C(S) concave? Explain
what happens when S ∈ [0, 2/3].
B.5.3. Z-Channel with Sum Cost
a) Compute the capacity-cost function C(S) for the binary “Z-channel”

with
PY |X (0|0) = 1, PY |X (1|0) = 0
(B.51)
PY |X (0|1) = 1/2, PY |X (1|1) = 1/2
and the cost constraint E[X + Y ] ≤ S.

b) Plot C(S) as a function of S.
B.5.4. Parallel Channel Power Allocation

Prove step (a) in (5.100).
B.5.5. Gallager Exponent

Prove the three properties of EG (R, PX ) given after (5.44).
Hint: Use the result:
" #
d d g(x) ln f (x) g(x) df (x) dg(x)
f (x)g(x) = e = + ln f (x) f (x)g(x) .
dx dx f (x) dx dx
B.5.6. Strong Converse

Consider a code book xn as in (5.28). The ML decoder chooses as its
estimate ŵ = i one of the messages that maximizes PY n |X n (y n |xn (i)). The
ML decoder thus partitions the set of channel output sequences y n into M
disjoint decoding regions.
a) Show that the error probability of an ML decoder for any code with
M equally-likely codewords is given by
1
max PY n |X n (y n |xn (i)).
X
Pr [E] = 1 − (B.52)
yn M i
b) Now use a = {as }1/s to show that for any s > 0 we have
(M )1/s
X 1 X
n n s
Pr [E] ≥ 1 − PY n |X n (y |x (i)) . (B.53)
yn M i=1
c) Argue that Pr [E] does not change by permuting the indexes i =

1, 2, . . . , M . Take the expectation of Pr [E] over all permutations, each
with probability 1/M !, to show that for s ≥ 1 we have
i 1/s
(M )
X 1 X h
n n s
Pr [E] ≥ 1 − E PY n |X n (Y |X (i))
yn M i=1
( )1/s
(1−s)/s n n n s
X X
≥1−M max PX n (x )PY n |X n (y |x ) .
PX n
yn xn
(B.54)
d) Since the channel is memoryless, show that (B.54) reduces to

 ( )1/s n
Pr [E] ≥ 1 − M (1−s)/s max PX (x)P (y|x)s
X X
 
PX y x
(a)
= 1 − 2−n[−ρR+minPX E0 (ρ,PX )] (B.55)

where (a) follows by choosing s = 1/(1 + ρ) for −1 ≤ ρ < 0 and

defining minPX E0 (−1, PX ) = limρ→−1 minPX E0 (ρ, PX ).
e) Define

E(R) = max −ρR + min E0 (ρ, PX ) (B.56)
−1≤ρ<0 PX
Show that if R > C then we have E(R) > 0. In other words, for large
n the block error probability approaches 1. This result is known as a
strong converse because it shows that the block error probability must
approach 1 as n grows.
B.5.7. Frame Error Rate
A frame is a block V = [V1 , V2 , . . . , Vb ] of b bits. We map each frame

into a block X n of n channel symbols. Suppose we send L blocks V i ,
i = 1, 2, . . . , L, so the total number of bits and channel symbols is B = L b
and N = L n, respectively. The rate is R = B/N = b/n and the frame
error rate (FER) is
L
1X h i
PF = Pr V̂ i 6= V i . (B.57)
L i=1
a) Suppose we permit coding across frames, i.e., X N is a function of

W = V L . Use Fano’s inequality as in Sec. 5.4 to show that
I(W ; Ŵ )
NR ≤ . (B.58)
1 − H2 (PF )/b − PF log2 (2b − 1)/b
We recover (5.25) for b = 1, as should be expected. Show that for

large b we essentially have
I(W ; Ŵ )/N C(S)

R. ≤ . (B.59)
1 − PF 1 − PF
b) Show that on can approach equality in (B.59) by discarding a fraction
PF of the frames and transmitting the remaining fraction 1 − PF of
the frames reliably. Is coding across frames required?
c) The above shows that one may transmit at rates above capacity with-
out having PF approach one. How does this result relate to the strong
converse of Problem B.5.6? Which result is relevant in practice?
B.5.8. Computing Entropy and Mutual Information
Recall that the BPSK capacity is given by (5.84). Suppose we wish to

numerically compute
h(Y ) = E[− log pY (Y )] (B.60)

where pY (·) given in (5.85). By the Weak Law of Large Numbers (see
Appendix A.13), we can compute (B.60) by Monte Carlo integration by
simulating n independent variables Yi = √Xi + Zi , i = √1, 2, . . . , n. Here
each Xi has probability distribution PX ( P ) = PX (− P ) = 1/2, and
Zi is a Gaussian random variable with zero mean and variance N that is
independent of Xi . Consider the sum
n
1X
Sn = − log pY (Yi ). (B.61)
n i=1
a) Show that plim Sn = h(X) if E[log p(Y )2 ] is bounded.

n→∞
b) Simulate Sn for n = 102 , 103 , 104 , and for N = 1 and P in the range of
Fig. 5.15. Compare the corresponding computed mutual information
expressions with the BPSK curve in Fig. 5.15.
B.5.9. Information Rates for a Rayleigh Fast Fading Channel
Consider the Rayleigh fast fading channel
Y = HX + Z (B.62)
where H and Z are independent Gaussian random variables with zero mean
and unit variance.
√ Suppose that X has the on-off keying (OOK) distribution
PX (0) = PX ( 2P ) = 1/2. We wish to compute the mutual information
" #
pY |X (Y |X)
I(X; Y ) = E log . (B.63)
pY (Y )
By the Weak Law of Large Numbers (see Exercise B.5.8) we can compute
I(X; Y ) by simulating independent random variables Yi = Hi Xi + Zi , i =
1, 2, . . . , n and computing
n
1X pY |X (Yi |Xi )
Sn = log . (B.64)
n i=1 pY (Yi )
a) Give a closed-form expression for pY |X (y|x) and pY (y).

b) Simulate Sn for n = 104 for P satisfying 0 ≤ P ≤ 30 and compare the
curves with the BPSK curves in Fig. 5.15.
√
c) Repeat the experiment but with PX (0) = 0.9 and PX ( 10P ) = 0.1.
What happens for small P ?
Remark: sending a large amplitude signal with small probability is
called bursty or flash signaling. This can be useful for fading channels.
B.5.10. Information Rate Lower Bounds

Consider the vector channel

 
" # " # " # " #
Y1 1 1/2  X1 + Z1 

= (B.65)
Y2 0 1  X2
 Z2  
| {z } | {z } | {z } | {z }
Y H X Z
where Z1 and Z2 are independent Gaussian random variables with zero mean
and unit variance. Suppose the power constraint is E[X12 + X22 ] ≤ 2P .
a) Show that the capacity of the channel is C = log(1 + P ).

Hint: observe that the matrix H is invertible.
b) Let X be a Gaussian vector with independent entries with mean zero
and variance P . Simplify the lower bound of Exercise B.1.25, namely
equation (B.19):
" #
qY |X (Y |X)
I(X; Y ) ≥ E log (B.66)
qY (Y )
if we choose the auxiliary channel to be the Gaussian channel

1 1
T
qY |X y|x = exp − y − Hq x Q−1
q y − H q x
2π (det Qq )1/2 2
(B.67)
where Qq is a positive semi-definite matrix and Hq is a 2 × 2 matrix.

Hint: show that
1 1

qY y = exp − y T K−1
q y (B.68)
2π (det Kq )1/2 2
where Kq = P Hq HTq + Qq .
c) Show that the lower bound (B.66) is tight if Hq = H and Qq = H HT .
d) Suppose the receiver believes that Hq and Qq are the 2 × 2 identity
matrices. Show that the lower bound of (B.66) is
I(X; Y ) ≥ log(1 + P ) − P log e (B.69)
which is not very useful.

e) Suppose the receiver measures the interference plus noise powers as
h i h i
E (Y1 − X1 )2 = P/4 + 5/4, E (Y2 − X2 )2 = 1.
To improve performance, the receiver thus chooses Hq as the 2 × 2

identity matrix but
" #
(P + 5)/4 0
Qq = .
0 1

Show that the lower bound of (B.66) is

1
I(X; Y ) ≥ log(1 + P ) − log(1 + P/5). (B.70)
2
f) Suppose the receiver performs more measurements and realizes that
Hq = H. However, the receiver continues to believe that the noise is
uncorrelated. In other words, the receiver chooses
" #
5/4 0
Qq = .
0 1
Show that the lower bound of (B.66) is
1 P2
I(X; Y ) ≥ log (1 + P )2 − P/5 − . (B.71)
2 5(1 + P )2 − P
g) Plot the bounds of (B.69)-(B.71) and compare them with each other
and the capacity. What happens at small and large P ?
B.5.11. Information Rates for Channels with Memory

Consider a channel P (y n |xn ) with memory. Suppose we choose a memory-
less input P (xn ) = P n (xn ) and a memoryless auxiliary channel
n
qY n |X n (y n |xn ) =
Y
qY |X (yi |xi ). (B.72)
i=1
a) Simplify the lower bound of Exercise B.1.25, namely equation (B.19):
qY n |X n (Y n |X n )
" #
n n
I(X ; Y ) ≥ E log (B.73)
qY n (Y n )
and show that the information rate is bounded as

n
" #
1 1X qY |X (Yi |Xi )
I(X n ; Y n ) ≥ E log (B.74)
n n i=1 qY (Yi )
P
where qY (b) = a∈X PX (a)qY |X (b|a).
b) Suppose the channel is stationary, by which we (loosely) mean that a
stationary input makes all statistics converge to a stationary distribu-
tion. Argue (loosely) that, by the Weak Law of Large Numbers (see
Exercise B.5.8), we can compute the lower bound (B.74) by simulat-
ing the channel output Y n with i.i.d. inputs X1 , X2 , . . ., Xn , and then
compute the average
n
1X qY |X (Yi |Xi )
Sn = log . (B.75)
n i=1 qY (Yi )
Remark: stationarity is useful because it lets us perform one simula-

tion with large n rather than many simulations with large n.

B.5.12. The Capacity of the J-ary Symmetric Channel

The J-ary symmetric channel is the DMC with A = B = {0, 1, 2, . . . , J −1},
such that 
1 − ε, y = x
P (y|x) =
 ε , y 6= x
J−1
where 0 ≤ ε ≤ 1.
a) Find the capacity C of this DMC and verify that it reduces to that of
the BSC when J = 2.
J−1
b) Find the explicit values of C when ε = 0, ε = J
and ε = 1, and give
an intuitive explanation of these values.
B.5.13. The Capacity of the Binary Symmetric Erasure Channel

The following DMC with 0 ≤ ε ≤ 1 and 0 ≤ δ ≤ 1 − ε is called the binary
1 −-ε − δ
0H
XXX

0
HHXXX
HH X XXz
Xδ
H ε XXX
X
HjH
H XXXX

H
HH
X
∆
*
ε H
H

: HH

δ HH
1
- H 1
1−ε−δ
symmetric erasure channel (BSEC). Notice that it reduces to the BSC when
δ = 0 and to the BEC when ε = 0.
a) Show that the BSEC is indeed symmetric, and find its capacity C.
b) Verify that C reduces to the appropriate capacities when δ = 0 and
when ε = 0.
B.5.14. The Capacity of a Deep Space Channel

The following two-input, eight-output DMC arises when the additive white
Gaussian noise (AWGN) channel (the “deep-space channel”) is used with
binary antipodal signals and 8-level “soft-decision” demodulation at an
energy-to-noise ratio E/N0 of −3 dB. (I. M. Jacobs, “Sequential Decoding
for Efficient Communication from Deep Space”, IEEE Trans. Comm. Tech.,
Vol. COM-15, pp. 492-501, August 1967.) Here E is the energy of the re-
ceived signal in the absence of noise, and N0 is the noise power in a band-
width of 1 Hz.
a) Find the capacity C8 of this DMC.
b) If “hard-decision” demodulation had been used, then 0, 00 , 000 and
0000 would all be converted to 0 while 1, 10 , 100 and 1000 would all be
converted to 1. Find the capacity C2 of the resulting BSC and compute
the decibel loss in capacity (i.e., compute 10 log10 (C8 /C2 )).

c) There are three different sensible ways that the above channel could
be converted to the BSEC (one way is to convert 0000 and 1000 to ∆).
Find the capacity C3 for the way that gives the greatest capacity. Find
also the decibel loss compared to eight-level demodulation, and the
decibel gain over hard-decision demodulation.
Note: The kinds of dB gains and losses computed here are typical of the
AWGN channel when used at an efficient E/N0 value. Additionally, there
is about 0.2 dB loss by using eight-level, rather than infinitely-many-level,
demodulation.
B.5.15. Directed Information and Feedback Capacity

In the paper (J.L. Massey, “Causality, Feedback and Directed Information”,
pp. 303-305 in Proc. 1990 Int’l Symp. on Information Theory and Its Ap-
plications (ISITA ’90), Hawaii, Nov. 27-30, 1990), an attempt was made to
modify Shannon’s measure of information so as to quantify the amount of
information flowing from one sequence to another, i.e., to give directivity to
information. Whether the definition of directed information given in that
paper has general validity remains to be seen. In this problem, however,
you will show that it is at least useful in proving that the “capacity” of a
DMC is not increased by feedback.
The directed information I(X1 . . . XN → Y1 . . . YN ) from a sequence X1 . . . XN
to a sequence Y1 . . . YN is defined by
N
X
I(X1 . . . XN → Y1 . . . YN ) = I(X1 . . . Xn ; Yn |Y1 . . . Yn−1 ).
n=1
a) Show that if X1 . . . XN and Y1 . . . YN are input and output sequences

of a DMC (regardless of whether the DMC is used with feedback or
without feedback), then
N
X
I(X1 . . . XN → Y1 . . . YN ) ≤ I(Xn ; Yn )
n=1
with equality if and only if Y1 . . . YN are independent.

Hint: Note that the definition of a DMC implies
H(Yn |X1 . . . Xn Y1 . . . Yn−1 ) = H(Yn |Xn ).
We will say that the DMC (or any other channel) is embedded in a causal
system if, for the output U1 . . . UK of any source in the system, for all n ≥ 1,
P (yn |x1 , . . . , xn , y1 , . . . , yn−1 , u1 , . . . , uK ) = P (yn |x1 , . . . , xn , y1 , . . . , yn−1 )
i.e., the channel output at any time can depend on the source sequence only
through the effect of that source sequence on past and present inputs and
past outputs of the channel.
b) Show that if X1 . . . XN and Y1 . . . YN are input and output sequences

of a channel embedded in a causal system and if U1 . . . UK is a source
output sequence, then
I(U1 . . . UK ; Y1 . . . YN ) ≤ I(X1 . . . XN → Y1 . . . YN ).
PN
Hint: Consider H(Y1 . . . YN |U1 . . . UK ) = n=1 H(Yn |Y1 . . . Yn−1 U1 . . . UK ).
c) Now show that if K digits of a BSS are sent to a destination via N
uses of a DMC (possibly with feedback), then
I(U1 . . . UK ; Y1 . . . YN ) ≤ N C.
B.5.16. Cauchy’s Inequality

Cauchy’s Inequality (which is the discrete counterpart of Schwarz’s Inequal-
ity) states that, for any two sequences of real numbers (a1 , a2 , . . . , aN ) and
(b1 , b2 , . . . , bN ),
N
!2 N
! N
!
a2i b2i
X X X
ai b i ≤
i=1 i=1 i=1
where, in the nontrivial case when neither sequence is all-zero, equality holds
if and only if there is a constant such than ai = c · bi , i = 1, 2, . . . , N .
a) Prove the validity of Cauchy’s Inequality.

Hint: Begin with the fact that, for all c,
N
(ai − c · bi )2 ≥ 0.
X
f (c) =
i=1
Next, find the value of c, say c0 , that minimizes f (c), and then use
the fact that f (c0 ) ≥ 0.
b) Show that
Xq sX sX
PY |X (y|0)PY |X (y|1) ≤ PY |X (y|0) · PY |X (y|1) = 1
y y y
with equality if and only if PY |X (y|0) = PY |X (y|1), ∀y.

B.5.17. The Z-Channel

Consider the Z-Channel with one-directional crossover probability 1/2, which
is the DMC:
1/2
-
HH
H
HH
H 1/2
X H
jH
H
H
Y
HH
H
HH
- H
1
a) For the M = 2 codewords x1 = [0 . . . 01 . . . 1] and x2 = [1 . . . 10 . . . 0] of

length N = 2L, where the runs of 1’s and 0’s have length L, calculate
PB|0 , PB|1 and (PB )WC .
b) Find the Bhattacharyya upper bound on (PB )WC for the code in a).
c) Repeat a) for the repeat code of length N , i.e., x1 = [0 . . . 0] and
x2 = [1 . . . 1].
d) Find the Bhattacharyya upper bound on (PB )WC for the code in c).
e) Explain why the Bhattacharyya bound is tight in b) but not in d).
B.5.18. Binary-Input Continuous-Output Channels

The theory of DMC’s is easily extended to memoryless channels whose out-
put is a continuous random variable Y but whose input is a discrete random
variable. For instance, for a binary input continuous-output memoryless
channel, the expression for the cut-off rate is
Z +∞ q
R0 = 1 − log2 1 + pY |X (y|0) · pY |X (y|1) dy bits/use
−∞
where pY |X (y|x) is the probability density function for Y given that X = x.

When binary antipodal signalling is used in the presence of additive white
Gaussian noise (AWGN), the unquantized correlator output Y is described
by √
pY |X (y|0) = pn (y − E)
and √
pY |X (y|1) = pn (y + E)
where E is the energy of the signal, where pn is the noise probability density
function
1 2
pn (α) = √ e−α /N0 ,
πN0
and where N0 is the one-sided noise power spectral density (power per
Hertz).

a) Find the cut-off rate R0 of the above binary-input continuous-output

memoryless channel.
Hint: Don’t forget that every probability density function integrates
to 1 over the whole real line.
b) Sketch the cut-off rate per unit energy, R0 /E, versus the energy-to-
noise ratio E/N0 and determine in what range of E/N0 the resulting
channel makes the most efficient use of energy. Express R0 /E in terms
of E/N0 in this most efficient region. Find also the value of E/N0 for
which R0 /E in 90% of its maximum value.
Note: The results of this problem contain an important lesson for commu-
nications engineers. The “natural impulse” of a communications engineer
is to make E/N0 very large so that reliable “hard-decisions” about X can
be made from the correlator output.

Information Theory: Lecture Notes For

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Theory: Lecture Notes For

Uploaded by

Copyright:

Available Formats

Technische Universität München

Lehrstuhl für Nachrichtentechnik

Lecture Notes for

Prof. Dr. sc. techn. Gerhard Kramer

Institute for Communications Engineering

1. Information Theory for Discrete Variables 1

2. Information Theory for Continuous Variables 31

3. Coding of Memoryless Sources 45

4. Coding of Stationary Sources 69

A. Discrete Probability 119

Table 0.1.: Probability and Information Theory Notation.

Information Theory for

1.1. Message Sets

{sunny, cloudy, rain, thunderstorm}.

We could, e.g., communicate the weather “sunny” by transmitting this word.

1.2. Measuring Choice

One motivation for this choice is as follows: approximately np of the n coin

where we have used Stirling’s approximation1 to write

Information Theory LNT, TUM

The amount of information obtained by observing the outcomes of the ex-

H(X) = H(g(X)) (1.4)

We can express (1.3) in an insightful way by considering Y = − log2 PX (X)

LNT, TUM Information Theory

Figure 1.1.: The binary entropy function H2 (·).

H(X) = E[− log2 PX (X)] . (1.5)

In cases where the distribution PX is a variable, we may use the notation

Example 1.2. Consider the Bernoulli distribution that has X = {0, 1}

H(X) = H2 (p) = −p log2 p − (1 − p) log2 (1 − p) (1.6)

and H2 (·) is called the binary entropy function.

Information Theory LNT, TUM

Figure 1.2.: Illustration that ln(x) ≤ x − 1 for x ≥ 0.

0 ≤ H(X) ≤ log2 |X | (1.7)

But we have the inequality (see Fig. 1.2)

log2 (x) = ln(x) log2 (e) ≤ (x − 1) log2 (e) (1.9)

LNT, TUM Information Theory

only if x = 1. Applying (1.9) to (1.8) we have

Example 1.3. Consider a fair die with X = {1, 2, 3, 4, 5, 6} and PX (a) =

H(X) = log2 6 ≈ 2.585 bits. (1.11)

Furthermore, an unfair six-sided die has smaller entropy or uncertainty.

Example 1.4. Consider X = {0, 1, 2} and PX (0) = PX (1) = p/2 and

and entropy is maximized at H(X) = log2 (3) if p = 2/3.

Information Theory LNT, TUM

1.4. Example Distributions

1.4.1. Binomial Distribution

Consider the random variable

where the Xi , i = 1, 2, . . . , n, are independent binary random variables with

A plot of the distribution with p = 0.5 and n = 20 is shown in Fig. 1.3. We

1.4.2. Poisson Distribution

The Poisson distribution is2

LNT, TUM Information Theory

form for general n. However, if λ is large then PX (·) is approximately the

1.4.3. Geometric Distribution

The geometric distribution arises when counting the number of Bernoulli

PX (k) = p (1 − p)k−1 , k ∈ N1 = {1, 2, 3, . . .}. (1.20)

A plot of the distribution with p = 0.25 is shown in Fig. 1.5. We compute

H(X) = H2 (p) E[X] . (1.21)

Information Theory LNT, TUM

Figure 1.4.: The Poisson distribution for λ = 5. The Gaussian approxima-

Figure 1.5.: The geometric distribution for p = 0.25.

LNT, TUM Information Theory

1.5. Conditional Entropy

0 ≤ H(X|Y = b) ≤ log2 |X | (1.23)

To avoid the case D(PX kPY ) = ∞ we introduce the notation PX PY (or

but the statement PXY PX PY is not necessarily true.