Professional Documents
Culture Documents
Information Theory: Lecture Notes For
Information Theory: Lecture Notes For
Information Theory
Edition: WS 2021/22
Authors: G. Kramer
(all rights reserved)
Information Theory,
Winter Semester 2021/2022
Lecturers:
• Prof. Dr. sc. techn. Gerhard Kramer,
Phone: (089) 289-23491
Email: gerhard.kramer@tum.de
Address: Institute for Communications Engineering, room N2403
Office hours: on request
• Dr. Christian Deppe,
Phone: (089) 289-23458
Email: christian.deppe@tum.de
Adress: Institute for Communications Engineering, Room N3404
Office hours: on request
Teaching Assistant:
• Emna Ben Yacoub M.Sc.,
Phone: (089) 289-23436
Email: emna.ben-yacoub@tum.de
Adress: Institute for Communications Engineering, Room N3406
Office hours: on request
Schedule:
• Lecture/Tutorial:
– Monday: 13:15 - 14:45
– Tuesday: 15:00 - 17:30
General information about the lecture and the tutorial:
• The material related to the lecture and tutorial will be made avail-
able in Moodle (http://www.moodle.tum.de). Each week after the
lecture, a problem set addressing the material discussed in the lecture
will be uploaded and discussed in the following tutorial session(s).
You are strongly advised to attempt to solve these problems before
the corresponding tutorial. In addition, detailed solutions of the prob-
lem sets will be uploaded after each tutorial session.
• Announcements will be made in Moodle.
• All material discussed in the lectures or tutorials is relevant for the
exam unless mentioned otherwise.
• In addition to the final exam, a midterm exam will take place in which
a bonus worth 10 % of the final exam can be achieved. Participating
in the midterm exam is optional.
• For both the midterm and final exam, no notes and no calculators are
allowed.
Contents
5. Channel Coding 81
5.1. Rate, Reliability, and Cost . . . . . . . . . . . . . . . . . . . 81
5.2. Memoryless Channels . . . . . . . . . . . . . . . . . . . . . . 82
5.3. Cost Functions . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4. Block and Bit Error Probability . . . . . . . . . . . . . . . . 88
5.5. Random Coding . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5.1. Block Error Probability . . . . . . . . . . . . . . . . 91
5.5.2. Capacity-Cost Function . . . . . . . . . . . . . . . . 93
5.5.3. Maximum Error Probability . . . . . . . . . . . . . . 94
5.6. Concavity and Converse . . . . . . . . . . . . . . . . . . . . 95
5.7. Discrete Alphabet Examples . . . . . . . . . . . . . . . . . . 97
5.7.1. Binary Symmetric Channel . . . . . . . . . . . . . . 97
5.7.2. Binary Erasure Channel . . . . . . . . . . . . . . . . 98
5.7.3. Strongly Symmetric Channels . . . . . . . . . . . . . 98
5.7.4. Symmetric Channels . . . . . . . . . . . . . . . . . . 99
5.8. Continuous Alphabet Examples . . . . . . . . . . . . . . . . 102
5.8.1. AWGN Channel . . . . . . . . . . . . . . . . . . . . . 102
5.8.2. AWGN Channel with BPSK . . . . . . . . . . . . . . 102
5.8.3. Complex AWGN Channel . . . . . . . . . . . . . . . 104
5.8.4. Parallel AWGN Channels . . . . . . . . . . . . . . . 107
5.8.5. Vector AWGN Channels . . . . . . . . . . . . . . . . 110
5.8.6. AWGN Channels with Receiver Channel Information 111
5.9. Source and Channel Coding . . . . . . . . . . . . . . . . . . 112
5.9.1. Separate Coding . . . . . . . . . . . . . . . . . . . . 112
5.9.2. Rates Beyond Capacity . . . . . . . . . . . . . . . . . 112
5.10. Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
B. Problems 141
B.1. Information Theory for Discrete Variables . . . . . . . . . . 141
B.2. Information Theory for Continuous Variables . . . . . . . . . 151
B.3. Coding of Memoryless Sources . . . . . . . . . . . . . . . . . 155
B.4. Coding of Stationary Sources . . . . . . . . . . . . . . . . . 167
B.5. Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . 173
Notation
We use standard notation for probabilities, random variables, entropy, mu-
tual information, and so forth. Table 0.1 lists notation, some of it developed
in the appendices, and we use this without further explanation in the main
body of the text. We introduce the remaining notation as we go along.
The reader is referred to the appendices for a review of relevant probability
theory concepts.
1
2 Chapter 1. Information Theory for Discrete Variables
log2 2n = n bits.
Note that the base of the logarithm simply defines the units of measurement.
For non-equally likely choices Shannon developed a measure that has the
same form as the entropy in statistical mechanics. For example, consider a
biased coin that takes on the value Heads with probability p and Tails with
probability 1 − p. Shannon proposed that the information produced when
flipping the coin n times is
!
1 1
n p log2 + (1 − p) log2 bits.
p 1−p
For instance, if p = 1/2 then the right-hand side of (1.1) gives n bits, as
expected. But if p = 0.11 then we compute n/2 bits.
More generally, consider an experiment that has m possible outcomes with
probabilities pi , i = 1, 2, . . . , m. Suppose we repeat the experiment n times.
√
1
Stirling’s approximation is n! ≈ 2πn(n/e)n .
We expect outcome i to occur approximately npi times for all i, and the
logarithm of the number of expected strings is
m
!
n X
log2 ≈n −pi log2 pi . (1.2)
np1 , np2 , . . . , npm i=1
1.3. Entropy
The word “entropy” was invented by Clausius [3, p. 390] from the Greek
trop for “turn” or “change”. Clausius was attempting to find a Greek
word
• that was related to the German “Verwandlungsinhalt” which trans-
lates literally to “transformation content”;
• for which the German version, in this case “Entropie”, sounds similar
to the German word for energy, in this case “Energie”.
Calusius had a physical notion of “transformation content.” Shannon instead
related entropy to messages and sets. In other words, Shannon entropy is
not necessarily physical: it has to do with choice and uncertainty in a broader
sense than envisioned by Clausius.
We now become more formal in our treatment. Let supp(f ) be the support
set of the function f , i.e., the set of a such that f (a) > 0. We define the
entropy or uncertainty of the discrete random variable X as
X
H(X) = −PX (a) log2 PX (a). (1.3)
a∈supp(PX )
We remark that by taking the sum over the support of PX (·) we avoid
dealing with the expression 0 · log2 0. Many authors simply define 0 · log2 0
to be zero because limx→0+ x log2 x = 0, where the notation ”0+ ” means that
limit is taken from above. However, it is often instructive to be concerned
about events with probability zero, see Sec. 1.7 below.
Example 1.1. The entropy H(X) depends only on the probability distri-
bution PX (·), and not on what we call the letters a in the alphabet X of X.
This idea is consistent with our discussion in Sec. 1.1. We thus have
for any invertible function g(·), since g(·) simply relabels the letters in X .
1.1
0.9
0.8
0.7
0.6
H2(p)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p
the expectation of Y :
The binary entropy function is plotted in Fig. 1.1. Observe that H2 (0) =
H2 (1) = 0, H2 (0.11) = H2 (0.89) ≈ 1/2, H2 (1/2) = 1, and H2 (p) is maxi-
mized by p = 1/2. More generally, we have the following important result
where we recall that |X | is the number of values in X .
1
x 1
x 1, ln(x)
ln(x)
0
0 1 2 3
x
Theorem 1.1.
with equality on the left if and only if there is one letter a in X with
PX (a) = 1, and with equality on the right if and only if PX (a) = 1/|X | for
all a ∈ X , i.e., X is uniform over X .
Proof. Consider (1.3) and note that for 0 < p ≤ 1 we have −p log2 p ≥ 0
with equality if and only if p = 1. Thus, we have H(X) ≥ 0 with equality if
and only if there is one letter a in X with PX (a) = 1. Consider next (1.5)
and observe that
" #
1
H(X) = E log2 + log2 |X |. (1.8)
|X |PX (X)
where ln(x) is the natural logarithm of x, and where equality holds if and
with equality in (a) if and only if X is uniform over X , and with equality
in (b) if and only if supp(PX ) = X .
The two bounds in (1.7) are intuitively pleasing: uncertainty should not
be negative (what would negative uncertainty mean?) and the maximum
uncertainty is when all possibilities are equally likely.
The expression (1.15) seems not to have a simple closed form for general n.
We instead use a Gaussian approximation for large n:
1 (k−E[X])2
PX (k) ≈ q e− 2Var[X] (1.16)
2πVar[X]
which gives
1 −(X−E[X])2
− 2Var[X]
H(X) ≈ E − log2 q e
2πVar[X]
1
= log2 (2πe · np(1 − p)) . (1.17)
2
0.18
0.16
0.14
0.12
0.1
PX(k)
0.08
0.06
0.04
0.02
0
0 2 4 6 8 10 12 14 16 18 20
k
Figure 1.3.: The binomial distribution for p = 0.5 and n = 20. The Gaus-
sian approximation with E[X] = 10 and Var[X] = 5 is the solid
curve.
0.18
0.16
0.14
0.12
0.1
P (k)
X
0.08
0.06
0.04
0.02
0
0 2 4 6 8 10 12 14 16 18 20
k
0.25
0.2
0.15
P (k)
X
0.1
0.05
0
0 2 4 6 8 10 12 14 16 18 20
k
Consider a joint distribution PXY (·) where the random variables X and
Y take on values in the discrete and finite alphabets X and Y, respec-
tively. The conditional entropy of X given the event Y = b with probability
Pr [Y = b] > 0 is
X
H(X|Y = b) = −PX|Y (a|b) log2 PX|Y (a|b)
a∈supp(PX|Y (·|b))
h i
= E − log2 PX|Y (X|Y ) Y = b . (1.22)
Using the same steps as in the proof of Theorem 1.1, one can show that
with equality on the left if and only if PX|Y (a|b) = 1 for some a, and with
equality on the right if and only if PX|Y (a|b) = 1/|X | for all a.
Alternatively, we have
X
H(X|Y ) = −PXY (a, b) log2 PX|Y (a|b)
(a,b)∈supp(PXY )
h i
= E − log2 PX|Y (X|Y ) . (1.25)
PXY (a, b) a = 0 a = 1
b=0 1/3 1/3 (1.26)
b=1 1/3 0
with equality on the left if and only if for every b in supp(PY ) there is an
a such that PX|Y (a|b) = 1, and with equality on the right if and only if for
every b in supp(PY ) we have PX|Y (a|b) = 1/|X | for all a. We say that Y
essentially determines X if H(X|Y ) = 0.
The above definitions and bounds extend naturally to more than two ran-
dom variables. For example, consider the distribution PXY Z (·). We de-
fine the conditional entropy of X given Y and the event Z = c with
Pr [Z = c] > 0 as
X
H(X|Y, Z = c) = −PXY |Z (a, b|c) log2 PX|Y Z (a|b, c)
(a,b)∈supp(PXY |Z (·|c))
h i
= E − log2 PX|Y Z (X|Y, Z) Z = c . (1.28)
Note that we have written H(XY ) rather than H(X, Y ) and the reader
should not confuse XY with “X multiplied by Y ”. Some authors prefer
H(X, Y ) and this is a matter of taste. For example, if one considers XY as
a vector [X, Y ] then the notation H(X, Y ) makes sense. We will follow the
convention of not using punctuation if no confusion arises.3
Using Bayes’ rule in (1.29) we have
h i
H(XY ) = E − log2 PX (X)PY |X (Y |X)
h i
= E[− log2 PX (X)] + E − log2 PY |X (Y |X)
= H(X) + H(Y |X). (1.30)
We similarly have
and one may check that (1.30) and (1.31) are satisfied.
where PX (·) and PY (·) have the same domain X . This distance and the
so-called variational distance 21 d(PX , PY ) measure additive differences.
Another useful approach is the following that considers logarithmic differ-
ences. The informational divergence (or relative entropy or Kullback-Leibler
distance) between PX (·) and PY (·) is defined as
X PX (a)
D(PX kPY ) = PX (a) log2 (1.36)
a∈supp(PX )
PY (a)
Example 1.9. Consider a joint distribution PXY and its marginals PX and
PY . We use the notation PX PY (·) to refer to the distribution for which
PXY PX PY (1.40)
Theorem 1.2.
with equality if and only if PX (a) = PY (a) for all a ∈ supp(PX ).4
with equality in (a) if and only if PY (a) = PX (a) for all a ∈ supp(PX ), and
with equality in (b) if and only if PY PX .
and so Theorem 1.2 gives H(X) ≤ log2 |X | with equality if and only if PX
is uniform. This reproves the interesting part of Theorem 1.1.
4
This is the same as requiring PX (a) = PY (a) for all a ∈ X .
with equality if and only if PX|Z (a|b) = PY |Z (a|b) for all (a, b) ∈ supp(PXZ ).
Problem B.1.23 develops a chain rule for informational divergence similar
to the chain rule (1.32) for entropy, namely
n
X
D(PX n kPY n ) = D(PXi |X i−1 kPYi |Y i−1 |PX i−1 ). (1.48)
i=1
I(X; Y ) ≥ 0 (1.52)
H(X|Y ) ≤ H(X) (1.53)
H(XY ) ≤ H(X) + H(Y ) (1.54)
We thus have H(X|Y ) ≤ H(X) in accordance with (1.53), but observe that
H(X|Y = 0) > H(X).
where we define
I(X; Y |Z = c) = D PXY |Z (·|c) PX|Z (·|c)PY |Z (·|c) . (1.56)
I(X; Y |Z = c) ≥ 0 (1.58)
I(X; Y |Z) ≥ 0 (1.59)
We thus have
with equality on the left if and only if X − Z − Y forms a Markov chain, and
with equality on the right if and only if Y Z essentially determines X, or
XZ essentially determines Y , or both. The left-hand side of (1.61) implies
The expansion (1.63) is called the chain rule for mutual information.
Example 1.14. Using the chain rule for mutual information and the left-
hand side of (1.61), we have
I(X; Y ) ≤ I(X; Y Z)
(1.64)
I(X; Y ) ≤ I(XZ; Y ).
We thus have
Theorem 1.4.
Theorem 1.4 gives a useful tool to upper bound entropy: every choice of
PY (·) gives an upper bound on H(X)! To simplify the computation, one
usually chooses PY (·) to have an exponential form, such as a Gaussian
distribution for continuous random variables as in Section 2.6.3.
1.10. Inequalities
Proof. If bi = 0 for some i then the left-hand side of (1.70) is infinity and
the identity and the inequality are valid. Now suppose that bi > 0 for all i.
The identity (1.70) follows by substitution. The inequality (1.71) and the
condition for equality follow by Theorem 1.2.
Example 1.16. Consider n = 2 for which the log-sum inequality is
a1 a2 a1 + a2
a1 log + a2 log ≥ (a1 + a2 ) log (1.72)
b1 b2 b1 + b2
with equality if and only if a1 /b1 = a2 /b2 = (a1 + a2 )/(b1 + b2 ).
Proof. We have
(a)
I(X; Z) ≤ I(X; Y Z)
(b)
= I(X; Y ) + I(X; Z|Y )
(c)
= I(X; Y ). (1.75)
where (a) follows by (1.64), (b) follows by the chain rule (1.63), and (c)
follows because X − Y − Z forms a Markov chain. One can prove I(X; Z) ≤
I(Y ; Z) in the same way. Next, by the chain rule (1.48) we have the two
expansions
D(PX1 Y1 kPX2 Y2 ) = D(PX1 kPX2 ) + D(PY1 |X1 kPY2 |X2 |PX1 ) (1.76)
| {z }
=0
D(PX1 Y1 kPX2 Y2 ) = D(PY1 kPY2 ) + D(PX1 |Y1 kPX2 |Y2 |PY1 ) (1.77)
which gives
D(PY1 kPY2 ) = D(PX1 kPX2 ) − D(PX1 |Y1 kPX2 |Y2 |PY1 ). (1.78)
Proof. Let E = 1(X̂ 6= X), where 1(·) is the indicator function. We use the
chain rule to expand H(EX|X̂) in two ways as
where (a) follows because X̂ and X essentially determine E. Step (b) follows
because, given X̂ and E = 1, X takes on at most |X | − 1 values. Further-
more, by Theorem 1.1 equality holds in (b) if and only if, conditioned on X̂,
X is uniform over the set of X not including X̂, i.e., if and only if (1.81) is
satisfied for all a and b in X . Inequality (c) holds with equality if (1.81) is
satisfied for all a and b in X .
1.6
1.4
1.2
Fano Bound
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Pe
H2 (1 − p0 ) + (1 − p0 ) ≥ H(X|X̂) (1.83)
and one can check that equality holds if and only if p1 = p2 (see (1.12)).
Oberve that
0 ≤ d(P, Q) ≤ 2 (1.85)
Define
X X
p= P (a) and q= Q(a) (1.89)
a∈X1 a∈X1
Thus, as q decreases the right-hand side of (1.91) increases from its (mini-
mum) value of 0.
Proof. Consider PX (a) > 0 and QX (a) > 0. The log-sum inequality gives
A similar bound results when PX (a) = 0 and QX (a) > 0. Now sum the
appropriate bounds over supp(PX ) and supp(QX ) to obtain (1.94).
Proof. From (1.43) we can write H(X) = log2 |X | − D(PX kPY ) where PY
is uniform. Theorem 1.9 thus gives the desired result by fixing PY as the
uniform distribution, i.e., use (1.94) with QY = PY .
d 1−p
H2 (p) = log2 (1.96)
dp p
2
d −1
H2 (p) = < 0. (1.97)
dp 2 ln(2) p(1 − p)
for any λ satisfying 0 < λ < 1. However, X(PX kPY ) is neither convex nor
concave in the pair (PX , PY ) in general.
References
[1] C. E. Shannon. A mathematical theory of communication. Bell System
Technical Journal, 27:379–423 and 623–656, July and October 1948.
Reprinted in Claude Elwood Shannon: Collected Papers, pp. 5–83, (N.
J. A. Sloane and A. D. Wyner, eds.) Piscataway: IEEE Press, 1993.
[2] R. V. L. Hartley. Transmission of information. Bell Syst. Tech. J., page
535, July 1928.
[3] R. Clausius. Ueber verschiedene für die Anwendung bequeme Formen
der Hauptgleichungen der mechanischen Wärmetheorie. Annalen der
Physik und Chemie, CXXV(7):353–400, 1865.
[4] A. D. Wyner and J. Ziv. A theorem on the entropy of certain binary
sequences and applications: Part i. IEEE Trans. Inf. Theory, 19(6):769–
772, November 1973.
Example 2.1. Consider the uniform distribution with pX (a) = 1/A for
a ∈ [0, A) where [0, A) = {x : 0 ≤ x < A} (see Fig. 2.1). We compute
We find that h(X) is negative for A < 1. A natural next question is how we
should interpret h(X): what does “negative uncertainty” mean? We shall
give one interpretation in Sec 2.6.1.
pX (a)
2 A = 1/2, h(X) = −1 bit
1.5
1 A = 1, h(X) = 0
0.5
0 0.5 1 a
31
32 Chapter 2. Information Theory for Continuous Variables
FX (a) pX (a)
1 1
0.75 0.75
0.5 0.5
0.25 0.25
0 0.5 1 a 0 0.5 1 a
One often encounters problems with random variables having both discrete
and continuous components. For example, consider a real-valued random
variable X with cumulative distribution (see Fig. 2.2)
0,
a<0
3a/4, 0 ≤ a < 1/2
FX (a) = (2.11)
(3a + 1)/4, 1/2 ≤ a < 1
1, a ≥ 1.
pX (a)
x1
One can extend the above ideas to random vectors. For instance, consider
X = [X1 , X2 ] that takes on values in R2 , and suppose X has a
• 0-dimensional part: probability mass 1/4 at the point (1, 1/2);
• 1-dimensional part: probability mass 1/2 on the interval defined by
x1 = −1 and x2 ∈ [0, 1/2);
• 2-dimensional part: uniform probability density on the rectangle (x1 , x2 ) ∈
[−1, 1) × [−1/2, 1/2) (so the 2-dimensional probability mass is 1/4).
A plot of the “density” is shown in Fig. 2.3. We have h(X) = −∞ due
to the 0-dimensional and 1-dimensional parts. However, if we condition on
the event E = {X1 = −1, X2 ∈ [0, 1/2)} then we compute h(X|E) = −1 bit
because X is uniform given E.
Consider again the uniform density with pX (a) = 1/A for a ∈ [0, A) and
let A → 0. We thus have h(X) → −∞. We can interpret such limiting
densities as Dirac-δ functions representing discrete random variables. For
instance, suppose that pX (a) = pi /A for some integers i, a ∈ [i, i + A), and
0 ≤ A ≤ 1. As A → 0, this density represents a discrete random variable
X̃ with PX̃ (i) = pi . We compute
X
h(X) = −pi log (pi /A) = log(A) + H X̃ (2.15)
i
1
h(X) = log 2πeσ 2 . (2.19)
2
1
h(X) = log (2πe)n QX . (2.22)
2
with equality if and only if pX (a) = pY (a) for (almost) all a ∈ supp(pX ).
This means that
I(X; Y ) ≥ 0 (2.31)
h(X|Y ) ≤ h(X) (2.32)
h(XY ) ≤ h(X) + h(Y ) (2.33)
assuming the integrals exist. We have D(pX|Z kpY |Z |pZ ) ≥ 0 with equality
if and only if pX|Z (a|b) = pY |Z (a|b) for (almost) all (a, b) ∈ supp(pXZ ).
We thus have
with equality if and only if pX (a) = pY (a) for (almost) all a ∈ supppX .
Theorem 1.4 gives a useful tool to upper bound entropy that we use in the
next section..
Example 2.4. Consider Example 2.1 where pX (a) = 1/A for a ∈ [0, A).
Suppose pY (b) = 1/B for a ∈ [0, B) for which we compute
(
∞, B<A
X(pX kpY ) = (2.38)
log(B), B ≥ A.
We thus have X(pX kpY ) ≥ h(X) = log(A) with equality if and only if
A = B. This matches (2.37). Observe that any choice of pY (·) where
supp(pY ) is a proper subset of supp(pX ) gives a useless bound.
Recall that the uniform distribution maximizes the entropy of discrete ran-
dom variables with alphabet X . Similarly, the uniform density maximizes
the differential entropy of continuous random variables with a support of
finite volume. To prove this, suppose that X is confined to a set S in Rn .
Let |S| be the volume of S, i.e., |S| = supp(pX ) 1 dx, and let U be uniform
R
E[X] ≤ m (2.41)
where the inequality a ≤ b means that ai ≤ bi for all entries ai and bi of the
respective a and b.
We thus have
X
h(X) ≤ log(e mi ) (2.44)
i
with equality if and only if pX (x) = pE (x) for almost all x. Independent ex-
ponential random variables therefore maximize (differential) entropy under
first moment and non-negativity constraints.
|QX | ≤ D (2.45)
QX Q (2.46)
with equality if and only if pX (x) = pG (x) for almost all x. Gaussian random
with equality if and only if pY |X (x) = pH|G (x) for almost all x.
References
[1] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University
Press, Cambridge, UK, 1985.
We begin with B = 1. A binary source is not very interesting for this case
so we choose X to be the K-ary alphabet {a1 , a2 , . . . , aK } where K > 2.
We shall return to binary sources later when we consider B > 1.
Our problem is depicted in Fig. 3.2 where we are interested in encoding
the random variable X into a number of digits U L = U1 U2 . . . UL . We are
mostly interested in encoding into bits, but we will take the alphabet of the
Ui , i = 1, 2, . . . , L, to be the D-ary set U = {0, 1, . . . , D − 1}.
PX (·)
. . . , X3, X2, X1
DMS
45
46 Chapter 3. Coding of Memoryless Sources
PX (·)
X U L = U1U2 . . . UL
Source Encoder
C = {b1 , b2 , . . . , bK } (3.2)
and the bk are called code words. Code word bk has length `k , k = 1, 2, . . . , K.
Since the `k can take on different values for different k, we are considering
variable-length codes.
Example 3.1. The list C = {0, 10, 11} is a binary (D = 2) code for a
ternary (K = 3) source. The code words are b1 = 0, b2 = 10, b3 = 11 with
lengths `1 = 1, `2 = 2, `3 = 2, respectively.
The list C = {0, 1, 20, 21} is a ternary (D = 3) code for a quaternary
(K = 4) source. The code words are b1 = 0, b2 = 1, b3 = 20, b4 = 21 with
lengths `1 = 1, `2 = 1, `3 = 2, `4 = 2, respectively.
Example 3.2. Consider the code C = {0, 10, 11} from Example 3.1 and
suppose that X = {x, y, z}. There are 33 = 27 mappings from X to C so
there are 27 possible encoders we could choose. If we choose U L (x) = 0,
U L (y) = 10, U L (z) = 11 then we have L(x) = 1, L(y) = 2, L(z) = 2.
which has length L(X n ) = ni=1 L(Xi ). The weak law of large numbers
P
1
A set does not admit multiple occurrences of the same element while a list does. For
instance, {0, 1, 2} is a set and a list, while {0, 1, 1, 2, 2} is a list but not a set.
Example 3.4. The following table specifies an encoder for a code that is
not prefix-free.
x U L (x) L(x)
a1 0 1
a2 00 2
a3 11 2
In fact, the code is not uniquely decodable. For example, both of the source
strings a1 a2 and a2 a1 are mapped to the encoded string 000.
Example 3.5. The following table specifies an encoder for a code that is
not prefix-free but that is uniquely decodable.
x U L (x) L(x)
a1 0 1
a2 01 2
a3 11 2
In fact, the code is the reflected code of Example 3.3 and it is suffix-free: no
code word bj is the suffix of another code word bk for j 6= k. Codes can be
both prefix-free and suffix-free, e.g., the code with all DN strings of a given
length N .
0 branch
a1
root node
leaf
0
a2
1
root
1
a3
branching node
Figure 3.3.: A rooted tree for the prefix-free code and encoder of Exam-
ple 3.3.
code word leaf
0
a2
0
a1 unused leaf
labels specify
0 encoder
1
1
a3
Figure 3.4.: A rooted tree for the code and encoder of Example 3.4.
Example 3.6. The unique rooted tree for the ternary prefix-tree code
C = {0, 1, 20, 21} is drawn in Fig. 3.5.
The depth of a branching node or leaf is the number of branches one must
traverse from the root to this node or leaf. The depth of a code is its
maximal code word length N = maxk `k . The full D-ary tree of depth N is
the D-ary tree with DN leaves at depth N .
1
0
2 1
2
This bound is known as the Kraft Inequality. The inequality turns out to
give both a necessary and sufficient condition for the existence of a prefix-
free code. We state this formally as a theorem.
Equality holds if and only if the code has no unused leaves in its D-ary tree.
depth-N unused leaves remaining. And since ki=1 D−`i < 1 by assumption,
P
if k < K there is one or more depth-N unused leaf remaining. But this
1 0
0
1
1
0
1
1
Figure 3.6.: A rooted tree for a binary prefix-free code with specified code
word lengths.
means that there is one or more depth-`k+1 unused leaf or branching node
remaining and since the lengths were ordered no already chosen code words
can stem from such a node.
these code word lengths. We construct the code in Fig. 3.6 by “growing” it
from the root as specified by the above algorithm. The resulting code is
k bk `k
1 0 1
2 10 2
3 1100 4
4 1101 4
5 1110 4
Note that we can shorten the code by replacing 1110 with 111. We then
have no unused leaves and the Kraft inequality is tight.
Our discussion of trees has so far not included the probabilities of the code
words. From (3.4) we should clearly associate short code word lengths with
large probabilities, but how should we do this precisely? To understand this
better, we develop properties of rooted trees with probabilities (see also [1]).
Suppose we assign the leaves a probability distribution so that leaf k (or code
word bk ) has probability pk , k = 1, 2, . . . , K. The leaf entropy is therefore
X
Hleaf = −pk log2 pk . (3.8)
k: pk >0
Example 3.8. Consider the rooted tree of Fig. 3.5 and suppose the source
has the probabilities as shown in the table below.
k bk pk
1 0 0.4
2 1 0.3
3 20 0.2
4 21 0.1
We label the leaves and branching nodes with their probabilities as shown
in Fig. 3.7. The leaf entropy is
4
X
Hleaf = −pk log2 pk ≈ 1.846 bits. (3.9)
k=1
We label the (branching) root node with j = 1 and branching node 2 with
j = 2. We thus have P1 = 1.0 which is the sum of the probabilities of the
leaves stemming from the root node, and we have P2 = 0.3. The branch-
ing distributions are Q1 (·) = [0.4, 0.3, 0.3] and Q2 (·) = [0.2, 0.1, 0]/0.3 =
[2/3, 1/3, 0]. The branching entropies are therefore
H(Q1 ) = −0.4 log2 (0.4) − 2 [−0.3 log2 (0.3)] ≈ 1.571 bits (3.10)
2 2 1 1
H(Q2 ) = − log2 − log2 ≈ 0.9183 bits. (3.11)
3 3 3 3
0
0.4
1.0 1
0.3
0
0.2
2 0.3 1
0.1
2
Figure 3.7.: A rooted tree with probabilities for a prefix-free ternary code.
The following lemmas give useful properties of rooted trees with probabili-
ties (see [1]).
Lemma 3.1. (Path Length Lemma) The average depth of the leaves
in a rooted tree with probabilities is the sum of the probabilities of the
branching nodes including the root, i.e., we have
K
X J
X
pk `k = Pj . (3.12)
k=1 j=1
Proof. For the Path Length Lemma, the depth `k of each leaf is the number
of branching nodes encountered when moving from the root to this leaf. We
can thus reorder the first sum in (3.12) by first summing the probabilities
of the leaves stemming from each branching node, and then sum over all
branching nodes. The result is the second sum in (3.12).
For the Leaf-Entropy Lemma, the branching nodes and leaves stemming
directly from node j have the probabilities Pm = Pj · Qj (m). Furthermore,
the average branching entropy on the right-hand side of (3.13) has the terms
X
Pj H(Qj ) = Pj −Qj (m) log2 Qj (m)
m:Pj Qj (m)>0
X
= Pj log2 Pj + −Pm log2 Pm (3.14)
m:Pm >0
Example 3.9. Consider Example 3.8. The average depth of the leaves can
be computed using either of the following two approaches:
4
X
pk `k = (0.4 + 0.3) + (0.2 + 0.1)2+ = 1.3 (3.15)
k=1
P1 + P2 = 1.3. (3.16)
Lemmas 3.1 and 3.2 give a useful lower-bound on E[L]. Consider (3.13) for
which we have
H(X) = Hleaf
(a) J
X
≤ Pj log2 D
j=1
(b)
= E[L] log2 D bits (3.18)
where (a) follows because H(Qj ) ≤ log2 D with equality if and only if Qj is
the uniform distribution over {0, 1, . . . , D − 1} for all j (see Thm. 1.1), and
(b) follows by Lemma 3.1. Equality holds if and only if pk = D−`k for all k
because the branching probabilities must all be 1/D.
For an upper bound on E[L], the conditions for equality in (3.18) suggest
that we should choose pk ≈ D−`k . Consider the lengths
`k = d− logD pk e (3.19)
we have
K K
−d− logD pk e
DlogD pk = 1.
X X
D ≤ (3.21)
k=1 k=1
By Theorem 3.1, there is a D-ary prefix-free code with code word lengths
given by (3.19). A code with these lengths is called a Shannon-Fano code.
We next bound E[L] by using the right-hand side of (3.20):
K
X H(X)
E[L] = pk d− logD pk e < +1 (3.22)
k=1 log2 D
H(X) H(X)
≤ E[L] < +1 (3.23)
log2 D log2 D
0
0.4
0
0.7
1
0.3
1.0 0
1
0.3 0
0.2
1
0.3 0
1 0.1
0.1 1
with equality on the left if and only if PX (a) = D−L(a) for all a and some
positive integer L(a), and the length of the corresponding code word is L(a).
Moreover, a Shannon-Fano code achieves E[L] satisfying (3.23).
We use the algorithm in the proof of Thm. 3.1 to grow a tree with the
specified lengths. The resulting tree is shown in Fig. 3.8 and it is listed in
the above table.
We use the Path Length Lemma to compute
The source entropy is H(X) ≈ 1.846 bits (see (3.9)) so that (3.23) is sat-
isfied. The code is not optimal, however, and we can reduce E[W ] to 2 by
simply using the full binary tree of length 2. However, if H(X) is large then
a Shannon-Fano code is nearly optimal. We use this insight in the next
section.
Source Encoder
Source Encoder
Figure 3.9.: Source encoder with a block parser (top) and a variable-length
dictionary parser (bottom).
H(X B ) H(X B )
≤ E[L] < + 1. (3.25)
log2 D log2 D
Theorem 3.3. (Coding Theorem for a DMS) The average code word
length E[L] of an optimum D-ary prefix-free code for a K-ary DMS PX (·)
that is parsed into blocks of length B satisfies
Theorem 3.3 tells us that we can approach the lower bound H(X)/ log2 D
code symbols per source symbol as closely as we like by using large block
lengths B. Of course, the encoding and decoding delays increase linearly
with B and the complexity of the encoder and decoder seems to increase
exponentially with B.
Lemma 3.3. Optimal D-ary prefix-free codes assign shorter code words to
more probable letters, i.e., if pk < pm then `k ≥ `m .
Lemma 3.4. All (used) leaves in the tree of an optimal D-ary prefix-free
code have a sibling.
Lemma 3.5. There is an optimal D-ary prefix-free code whose two least-
likely code words differ only in their last digit.
Proof. Lemma 3.3: Suppose we have pk < pm but `k < `m . Then we can
swap bk and bm and E[L] is reduced by the positive number
Lemma 3.4: If a leaf has no siblings then we can delete the last digit of the
code word and we still have a prefix-free code. We thus reduce E[L] and the
original code could not have been optimal.
Lemma 3.5: By Lemma 3.3, the two least-likely code words must have the
two longest lengths. By Lemma 3.4, both of these code words must have
siblings. Thus, the two least-likely code words must have the same length.
Further, if they are not siblings then we can swap one of the code words for
the other code word’s sibling to make them siblings without changing E[L].
The result is a prefix free code with the desired property.
Consider the simplest case D = 2. By the Path Length Lemma, to minimize
E[L] we should minimize the sum of the probabilities of the branching nodes.
By Lemma 3.5 we can choose the two least likely codewords as siblings, say
with probabilities pK−1 and pK . The branching node from which they stem
can now be considered as a leaf in a new tree with K − 1 leaves and with
expected length
0
0.4 0.4 0.4
0 1.0 0
0.3 0.3 0.3
1
0 0.6 0 0.6 0
0.2 0.2 0.2
1 1
0.3 0.3 0.3
1 1 1
0.1 0.1 0.1
where we have used the Path Length Lemma. The problem of minimizing
E[L] is thus equivalent to minimizing E[L0 ], i.e., it is equivalent to con-
structing a binary tree with these K − 1 leaves such that the sum of the
probabilities of its branching nodes is minimum. But Lemma 3.5 states that
for this tree we can make two least likely code words siblings. Continuing
until the root, we have the following algorithm for constructing an optimal
binary prefix-free code for a K-ary random variable.
a) Draw K leaves and label them with the K probabilities p1 , p2 , . . . , pK .
Set k = K.
b) Create a branching node from which the two least likely leaves stem.
Label this node with the sum of the probabilities of these two leaves.
Label one of the new branches with 0 and the other with 1. The k − 2
remaining leaves and the new branching node are considered to be the
k − 1 leaves of a new tree.
c) Reduce k by 1. If k = 1 then make the remaining leaf the root node
and stop. Else go to step b).
Example 3.11. Consider the source of Examples 3.8 and 3.10 and choose
D = 2. We use the algorithm described above to construct a Huffman code
as depicted in Fig. 3.10. The resulting table of code words and lengths is
as follows
k pk `k bk
1 0.4 1 0
2 0.3 2 10
3 0.2 3 110
4 0.1 3 111
which is quite close to the source entropy is H(X) ≈ 1.846 bits (see (3.9))
and much better than the Shannon-Fano code (see (3.24)).
Consider now the case D > 3 where the only new complication is that the
Lemma 3.6. The number of unused leaves in the tree of an optimal D-ary
prefix-free code for a random variable that takes on K values with positive
probability, K ≥ D, is the remainder when (K − D)(D − 2) is divided by
D − 1.
Example 3.12. Consider the source with K = 4 of Example 3.8 and choose
D = 3. Using Lemma 3.6 we compute the remainder when K − D = 1 is
divided by D −1 = 2 to be r = 1. The algorithm described above constructs
the ternary Huffman code depicted in Fig. 3.11. The resulting table of code
0
0.4 0.4
1.0 1
0.3 0.3
0 0
0.2 0.2
0.3 1 2 0.3 1
0.1 0.1
2 2
words is the same as in Example 3.8 so that E[L] = 1.3 code symbols per
source symbol. This number is reasonably close to H(X)/ log2 (3) ≈ 1.165.
PX B (·)
XB UL
Source Encoder
Figure 3.12.: Source encoding scheme with K-ary input blocks X B of length
B and D-ary output blocks U L of length L.
0 0
1 1
0 0
2 1 2 1
2 2
L H(X)
≥ (3.33)
E[B] log2 D
with equality if and only if all U L have the same probability D−L . The
bound (3.33) is the variable-length-to-block counterpart of the block-to-
variable-length lower bound in (3.26).
In fact, we immediately have a more general lower bound that applies to first
using a variable-length-to-block encoder with a proper dictionary followed
by a prefix-free block-to-variable-length encoder. The idea is that the U L
in (3.32) becomes a letter “X” that is input to the block-to-variable-length
encoder to which we apply (3.26) to get E[L] ≥ H(U L )/ log2 D. As a result,
we have a variable-length B and a variable-length L.
E[L] H(X)
≥ . (3.34)
E[B] log2 D
Finally, we show that Tunstall codes approach entropy for large L. Let
pmin be the smallest (positive) probability letter from the source X. The
least likely word from the dictionary has probability Pj ∗ pmin where j ∗ is the
branching node from which this word stems. We further have
1
Pj ∗ pmin ≤ (3.35)
M
a
0.064
b
0.048
c
a 0.16 0.032
d
b 0.016
a 0.4 0.12
c
0.08
d
0.04
a
0.12
b
b 0.3 0.09
c
0.06
d
0.03
1.0 a
0.08
b
c 0.2 0.06
c
0.04
d
0.02
d
0.1
Applying (3.32) and using Theorem 3.4 we have the following analog of
Theorem 3.2 for variable-to-fixed length parsers.
H(X) L H(X)
≤ < (3.39)
log2 D E[B] log2 D + log2 (pmin /2) /L
as long as log2 D + log2 (pmin /2) /L > 0. Moreover, a Tunstall code achieves
E[B] satisfying the inequalities in (3.39).
with memory (see the end of Sec. 3.6). For example, if a small number
of long strings of symbols are very probable then one may include these
strings in a (relatively small) dictionary rather than considering all strings
of a large length whose number grows exponentially with the string length.
References
[1] J. L. Massey. Applied Digital Information Theory. ETH Zurich, Zurich,
Switzerland, 1980-1998.
4.1. Introduction
Most sources have memory but it is not obvious how to model the memory.
For example, consider the text you are reading. The memory in this source
depends on the language but also on the subject matter and the writer.
Should one model the source probabilistically? Should one use a model
with block memory or sliding-window memory? Such questions can be
answered in different ways, and different approaches lead to different data
compression algorithms. Ultimately, the choice of algorithm should be based
on the empirical performance for data sets of interest.
We take a probabilistic approach. For example, we might model the source
letters as being output by a finite state machine whose transitions from
one state to the next are governed by a conditional distribution PS 0 |S (·|s)
where S is the current state and S 0 is the next state. The letters output
by the source might be the state labels or, more generally, they might be
chosen from a conditional distribution PX|S (·|s). However, we may not know
PS 0 |S (·) or PX|S (·). We must therefore design an encoding algorithm that
learns these distributions while compressing the data. Furthermore, the
decoder must be able to keep track of the encoder’s model.
Codes that compress data for a class of sources are called universal, where
the word “universal” loosely refers to the class only. The reader should
always check to which class of sources the label “universal” applies. In fact,
small changes to a source can make “universal” codes perform poorly even
if these codes compress long strings to entropy for the original source.
The first class of sources that we consider are discrete stationary sources.
A K-ary discrete stationary source (DSS) is a device that emits symbols
X1 , X2 , X3 , . . . , from a K-ary alphabet X and where
69
70 Chapter 4. Coding of Stationary Sources
Figure 4.1.: A DSS. For example, for the window length L = 3 the strings
X1 X2 X3 and Xn Xn+1 Xn+2 have the same statistics for any n.
The statistics are time-invariant for any choice of L.
X1 , X2 , . . . , XL
for all n ≥ 1 and all L ≥ 1 (see Fig. 4.1). In other words, if we place a
window of any length L around consecutive symbols, then the statistical
behavior is the same irrespective of where we place the window. The reader
is warned that this definition is restrictive. For example, if the underlying
source model is a finite state machine then the source is a DSS only if the
machine has reached a steady state. If the source is in a transient state, e.g.,
upon start-up, then the source is not generally time-invariant as is required
of a DSS.
The DSS has finite memory µ if µ is the smallest non-negative integer such
that
PXn |X n−1 (xn |xn−1 ) = PXn |Xn−µ ...Xn−1 (xn |xn−µ , . . . , xn−1 )
= PXµ+1 |X µ (xn |xn−µ , . . . , xn−1 ) (4.1)
for all n > µ and for all xn (if µ = 0 then we write PXn |X n−1 (·) = PXn (·)).
For example, a DMS is a DSS with µ = 0. Of course, this means that
Example 4.1. Consider the binary DSS with PX1 (0) = PX1 (1) = 1/2 and
Xn = Xn−1 ⊕ An (4.3)
H(X1 ) = 1
H(X2 |X1 ) = H2 (p)
The latter two sources are called non-ergodic because there are empirical
statistics that do not match the statistics computed via the probability
distributions. For example, if p = 0 then the number of zeros in a string of
length n will be either 0 or n, while the expected number of zeros based on
the probability distributions is n/2.
For source encoding, we are interested in the entropy per source letter (see
the end of Sec. 3.6). Consider a block X B for which we use the notation
1
HB (X) = H(X B ). (4.4)
B
The entropy rate is defined as
1
H∞ (X) = lim H(X B ) (4.5)
B→∞ B
if this limit exists. We proceed to show that the limit does exist for a DSS
and that the sequence H1 (X), H2 (X), H3 (X), .. . , has special properties re-
lated to the conditional entropy H XB X B−1 .
where property (4.9) means that both limits exist and are the same.
where the second inequality is due to (4.6). Finally, for property (4.9) we
note that both H(XB |X B−1 ) and HB (X) are lower-bounded by zero so that
limB→∞ H(XB |X B−1 ) and H∞ (X) must exist (the limit of a non-increasing
sequence that is lower bounded exists). Property (4.6) implies
and it remains to show that equality holds in (4.13). Consider a long string
X nB where n is a large integer. We have
nB
1 1 X
H(X nB ) = H(Xi |X i−1 )
nB nB i=1
1 B−1 nB
" #
i−1
H(Xi |X i−1 )
X X
= H(Xi |X ) +
nB i=1 i=B
1 B−1
" #
H(Xi |X i−1 ) + (nB − B + 1)H(XB |X B−1 )
X
≤ (4.14)
nB i=1
Property (4.9) means that we may use either HB (X) or H(XB |X B−1 ) to
estimate the entropy rate. In fact, property (4.6) suggests that using a
conditional encoding algorithm will perform better than a block-encoding
algorithm because the entropy is smaller in the former case.
where Li is the code word length when encoding block Vi . The Theorem on
Total Expectation thus gives
H(Vi |V i−1 )
E[Li ] ≥ (4.20)
log2 D
and this bound is valid for all i and all B. We prove the following result.
Theorem 4.2. (Coding Theorem for a DSS) The average code word
length E[L] of an optimum D-ary prefix-free code for a K-ary DSS that is
parsed into blocks of length B satisfies
≥ BH(XiB |X (i−1)B−1 )
≥ BH∞ (X) (4.22)
This section develops a prefix-free code when the source has the discrete
but infinite alphabet X = {1, 2, 3, . . .} (see also [1]). We shall use this code
in the next section. Consider the usual binary representation of a positive
integer x:
x U L (x) L(x)
1 1 1
2 10 2
3 11 2
4 100 3
5 101 3
6 110 3
7 111 3
8 1000 4
9 1001 4
This base-two (or radix-two) code is not prefix-free which is why we usually
add a space or a comma between code words to separate them. Similarly,
the usual ternary (or 4-ary or 10-ary, etc.) representation of the integers is
not prefix-free:
x U L (x) L(x)
1 1 1
2 2 1
3 10 2
4 11 2
5 12 2
6 20 2
7 21 2
8 22 2
9 100 3
We would like to make these codes prefix free. Observe that every code word
begins with a non-zero number. We can thus make the code prefix-free by
pre-padding each string with L(x) − 1 zeros to get a code word of length
L1 (x) = 2blogD xc + 1.
The decoder simply counts the number of zeros to determine how many
subsequent D-ary symbols it should consider after the non-zero first digit.
x U L (x) L1 (x)
1 1 1
2 0.10 3
3 0.11 3
4 00.100 5
5 00.101 5
6 00.110 5
7 00.111 5
8 000.1000 7
9 000.1001 7
A simple trick Elias used is to shorten this code for large integers is to pre-
pad differently. Instead of pre-padding with L(x) − 1 zeros, he suggested
to pre-pad with the above prefix-free code with L1 (L(x)) digits that count
the number of digits after the non-zero first digit. For D > 2 this gives the
code word length
For large x we thus recover the size blogD xc that we had for the usual D-ary
representation of x. In fact, for D = 2 we can do slightly better because
the only non-zero digit is a “1”. We can therefore remove this bit to achieve
the code-word length
x U L (x) L2 (x)
1 1 1
2 010.0 4
3 010.1 4
4 011.00 5
5 011.01 5
6 011.10 5
7 011.11 5
8 00100.000 8
9 00100.001 8
The reader will observe that L2 (x) is larger than L1 (x) in the D = 2 tables
For large x the Elias code performs noticably better than the zero-padding
code. For example, for 16-bit Unicode characters and the “intermediate”
number x = 215 we have L1 (x) = 31 bits and L2 (x) = 24 bits. We will
consider Elias’ code again when we develop a “universal” source coding
algorithm.
We now describe an algorithm for source coding when we do not know the
source statistics. We consider sources that are stationary and ergodic, where
ergodic means that time averages correspond to statistical averages.
Consider the DSS sequence X1 , X2 , X3 , . . .. Let ∆i be the time to the most
recent occurrence of the letter Xi . For example, consider the 4-ary source
with alphabet {a, b, c, d} and
X 8 = abacadab.
∆8 = 1, 3, 2, 6, 2, 9, 2, 6.
Note that the first a has the prepended a occurring 4 time units in the past,
the first b has the prepended b occurring 4 time units in the past, the second
a has the first a occurring 2 time units in the past, and so forth. Observe
that we can recover X n from ∆n and the initialization string. Furthermore,
we can encode the sequence of integers by using the Elias code for the
positive integers given in Sec. 4.4.
An important result that we state without proof is that the expected recur-
rence time of a stationary and ergodic source satisfies
1
E[∆i |Xi = a] = (4.24)
PX1 (a)
Example 4.4. Consider a DMS PX for which the probability of waiting for
j symbols before seeing the letter a is (1 − PX (a))j−1 PX (a). The expected
waiting time is therefore
∞
(1 − PX (a))j−1 PX (a) · j
X
E[∆i |Xi = a] = (4.25)
j=1
where (a) follows by the concavity of the logarithm and Jensen’s inequality.
where we have again used Jensen’s inequality. Finally, rather than encoding
the Xi we use a block parser and encode blocks of symbols Vi as in (4.18).
We may simply replace X1 with V1 of length B in (4.27) and we compute
Combining the above results, we have a source coding scheme that is uni-
versal for the class of stationary and ergodic sources and that compresses
to the DSS entropy rate.
References
[1] J. L. Massey. Applied Digital Information Theory. ETH Zurich, Zurich,
Switzerland, 1980-1998.
Channel Coding
R
Rmax
S
0
Figure 5.1.: Three possible shapes for increasing (or more precisely non-
decreasing) curves.
81
82 Chapter 5. Channel Coding
Noise
Source
Zn
W Xn Yn Ŵ
Message
Encoder Yi = f (Xi, Zi) Decoder Sink
Source
Channel
Figure 5.2.: The channel coding problem: the channel is defined by a func-
tion f (·) and a noise source PZ (·).
X1 X2 X3
Z1 Z2 Y2 Z3 Y3
Y1
Ŵ
Figure 5.3.: FDG for a memoryless channel with n = 3. The hollow vertices
represent mutually statistically independent random variables.
yi = f (xi , zi ), i = 1, 2, . . . , n (5.1)
for some function f (·) with range Y, and where each zi is a different re-
alization of a noise random variable Z with alphabet Z. In other words,
1
A FDG is a graph where the vertices represent random variables and the edges repre-
sent functional dependencies, see Sec. A.5.
W Xn Yn Ŵ
Message
Encoder PY |X (·) Decoder Sink
Source
Channel
Figure 5.4.: The channel coding problem: the channel is a conditional prob-
ability distribution PY |X (·).
f (x, z) = x + z (5.3)
X Y
1−p
0 0
p
p
1 1
1−p
X Y
1 − p0
0 0
p0
p1
1 1
1 − p1
ity distribution is
( )
1 − p if y = x
P (y|x) = . (5.4)
p if y 6= x
Example 5.2. An asymmetric version of the BSC has Z = {00, 01, 10, 11}
and
( )
z0 if x = 0
f (x, z0 z1 ) = = x + zx (5.5)
1 + z1 if x = 1
Y
X 1−p
0
0
p
∆
p
1
1
1−p
X Y
1−p
0 0
1 1
2 2
3 3
∆
p
f (x, z) = x + z (5.10)
of real numbers and Z is a Gaussian random variable with zero mean then
we have an additive white Gaussian noise (AWGN) channel. Of course,
an AWGN channel is not a DMC. We must therefore replace PY |X (y|x)
in (5.2) with a channel conditional probability density function p(y|x). If Z
has variance N then we have
1 −(y−x)2
p(y|x) = pZ (y − x) = √ e 2N . (5.11)
2πN
E[sn (X n , Y n )] ≤ S. (5.12)
The largest rate C as a function of the cost S is called the capacity cost
function, and is denoted C(S).
so that sending x = 1 costs E units of energy. This situation might occur for
an optical channel where transmitting a 1 represents light while transmitting
a 0 represents dark. A cost constraint with 0 ≤ S < E/2 will bias the best
transmission scheme towards sending the symbol 1 less often.
Example 5.8. A more stringent constraint than (5.12) is that the cost be
bounded with probability 1:
Pr [sn (X n , Y n ) ≤ S] = 1. (5.16)
Consider W = V b where the Vi are independent bits with PVi (0) = PVi (1) =
1/2, i = 1, 2, . . . , b. We thus have M = 2b and R = b/n bits per channel
symbol. We have defined the channel coding by requiring that the block
error probability
h i
Pe = Pr Ŵ 6= W (5.18)
The bound (a) is because a bit error implies a block error, and the bound
(b) is because a block error implies at least 1 bit error for the b bits. One
has equality on the left if all bits in an erroneous block are incorrect, and
one has equality on the right if exactly one bit is incorrect in each erroneous
block. The bounds (5.20) imply that if Pb is positive so is Pe . Similarly, if
Pe is small so is Pb . This is why coding theorems should upper bound Pe
and converse theorems should lower bound Pb . For example, a code with
large Pe can have small Pb .
I(W ; Ŵ ) + H2 (Pe )
nR < . (5.22)
1 − Pe
2
The bounds (5.22) and (5.25) are valid for any channel: discrete or continuous, mem-
oryless or with memory. An improved lower bound on the block error probability Pe
for memoryless channels is called a strong converse and is developed in Problem B.5.6.
where (a) follows by the concavity of H2 (·), and (b) by Fano’s inequality.
We continue the chain of inequalities by using H(Vi |V̂i ) ≥ H(Vi |V i−1 V̂ b ) so
that
b
1X
H2 (Pb ) ≥ H(Vi |V i−1 V̂ b )
b i=1
1
= H(V b |V̂ b )
b
1
= H(V b ) − I(V b ; V̂ b )
b
I(W ; Ŵ )
=1− . (5.24)
nR
We thus have the following counterpart to (5.22):
I(W ; Ŵ )
nR ≤ . (5.25)
1 − H2 (Pb )
Message W
Source
X n(1)
PX
X n(W ) Yn Ŵ
X n(2) Encoder PY |X Decoder Sink
PX
... ... Channel ...
X (1) ... X n(2nR )
n
X n(2nR )
PX
independently using PX .
Encoder: Given w, transmit xn (w).
Decoder: Given y n , choose ŵ as (one of) the message(s) w that maximizes
where 1(·) is the indicator function that takes on the value 1 if its argu-
ment is true and is 0 otherwise. We compute the error probability for this
experiment. We have two error events
E = {Ŵ 6= W } (5.31)
F = {E[sn (X n , Y n )] > S} (5.32)
and the error probability can be written as the code book average
P (xn ) Pr [E ∪ F | X n = xn ] .
X
Pr [E ∪ F] = (5.33)
xn
for w̃ 6= 1. The event {Ŵ 6= 1} occurs only if E(w̃) occurs for at least one
w̃ with w̃ 6= 1. Furthermore, for such a w̃ the likelihood ratio
PY n |X n (Y n |X n (w̃))
L(w̃) = (5.36)
PY n |X n (Y n |X n (1))
ρ
h i (a) X
Pr Ŵ 6= 1 | W = 1 ≤ Pr L(w̃)s ≥ 1 W = 1
w̃6=1
ρ
(b) X
≤ E L(w̃)s W = 1 (5.37)
w̃6=1
where (a) is because A ⇒ B implies Pr [A] ≤ Pr [B], and (b) follows by the
Markov inequality (A.65).
xn ,y n w̃6=1 P (y n |xn )
ρ
(a) X
X
= P (xn )P (y n |xn )1−sρ · E PY n |X n (y n |X n (w̃))s (5.38)
xn ,y n
w̃6=1
We remark that (5.40) is valid for channels with memory and any P (xn ). It is
also valid for continuous channels by replacing P (y n |xn ) with the conditional
density p(y n |xn ).
Using (M − 1)ρ ≤ M ρ = 2nRρ , random coding with P (xn ) = ni=1 PX (xi ),
Q
and the memoryless property P (y n |xn ) = ni=1 PY |X (yi |xi ), the bound (5.40)
Q
becomes
h i
Pr Ŵ 6= 1 | W = 1 ≤ 2−n[E0 (ρ,PX )−ρR] (5.41)
One may prove the following properties of EG (R, PX ) (see Problem B.5.5):
a) EG (0, PX ) = E0 (1, PX ).
3
The Gallager exponent optimized over PX is written as EG (R) = maxPX EG (R, PX ).
EG(R, PX )
E0(1, PX )
R
0
Rc(PX ) E0(1, PX ) I(X; Y )
∂E0 (ρ, PX )
Rc (PX ) = . (5.45)
∂ρ ρ=1
where 1(.) is the indicator function that takes on the value 1 if its argu-
ment is true and is zero otherwise. We must therefore choose PX so that
E[sn (X n , Y n )] ≤ S over the ensemble of code books. We easily compute
n
1X
E[sn (X n (W ), Y n )] = E[s(Xi (W ), Yi )] = E[s(X, Y )] . (5.48)
n i=1
E[s(X, Y )] ≤ S. (5.49)
Combining the above results, for large n there is a code in the random
ensemble of codes that has small Pe , expected cost at most S, and with a
Pr [E ∪ F] ≤ δ. (5.52)
For example, we might have found this code by using the random coding
method described above. We may write
M
X
Pr [E ∪ F] = P (w)Pr [E ∪ F | W = w] . (5.53)
w=1
Now re-index the codewords by ordering them so that the error probability
increases as w increases, i.e., re-index so that
Pr [E ∪ F | W = w1 ] ≤ Pr [E ∪ F | W = w2 ] (5.54)
if w1 ≤ w2 . Now consider the code book having only the first M/2 messages
(suppose M is even) of the re-indexed code book. The rate is reduced by
1/n bits as compared to the original code book, but the maximum error
probability satisfies
Pm = max Pr [ E| W = w]
1≤w≤M/2
≤ max Pr [ E ∪ F| W = w]
1≤w≤M/2
= Pr [ E ∪ F| W = M/2]
(a)
≤ 2δ (5.55)
where (a) follows by (5.53) since no more than half of the terms in the sum
can be larger than 2δ. In other words, by expurgating our codes we obtain
codes with a maximum error probability as close to zero as desired and rates
as close to C(S) as desired.
C(S)
C(S2)
C(S3)
C(S1)
S
0 S1 S3 S2
Consider two distinct points (S1 , C(S1 )) and (S2 , C(S2 )) and suppose the
distributions PX1 and PX2 achieve these respective points (see Fig. 5.11).
That is, we have
where Y1 and Y2 are the respective outputs of the channel PY |X when the
input is X1 and X2 . Consider the mixture distribution
We thus have
C(λ S1 + (1 − λ) S2 ) = C(S3 )
(a)
≥ I(X3 ; Y3 )
(b)
≥ λ I(X1 ; Y1 ) + (1 − λ) I(X2 ; Y2 )
= λ C(S1 ) + (1 − λ) C(S2 ). (5.59)
where (a) follows because PX3 might not give the maximum mutual infor-
mation for the cost S3 , and (b) follows by the concavity of I(X; Y ) in PX
when PY |X is held fixed (see Thm. 1.11). The bound (5.59) means that
C(S) is concave in S.
We now show that C(S) in (5.50) is the capacity cost function. We have
(a)
I(W ; Ŵ ) ≤ I(X n ; Y n )
n
H(Yi |Y i−1 ) − H(Yi |Xi )
X
=
i=1
n
X
≤ H(Yi ) − H(Yi |Xi )
i=1
Xn
= I(Xi ; Yi ). (5.60)
i=1
Inserting (5.61) into (5.60) and then into (5.22) and (5.25) we obtain R ≤ C
for small Pe and Pb .
More generally, with a cost constraint we must use the concavity of C(S) in
S. We have I(Xi ; Yi ) ≤ C(E[s(Xi , Yi )]) because C(E[s(Xi , Yi )]) maximizes
mutual information for the cost E[s(Xi , Yi )]. We thus have
n n
X X 1
I(Xi ; Yi ) ≤ n C (E[s(Xi , Yi )])
i=1 i=1 n
n
!
(a) 1X
≤ nC E[s(Xi , Yi )]
n i=1
= n C(E[sn (X n , Y n )])
(b)
≤ n C(S) (5.62)
where (a) follows by the concavity of C(S) and (b) follows because we require
E[sn (X n , Y n )] ≤ S and because C(S) is non-decreasing in S. Inserting
(5.62) into (5.60) and then into (5.22) and (5.25) we have
and
C(S)
R≤ . (5.64)
1 − H2 (Pb )
Thus, we find that R can be at most C(S) for reliable communication and
E[sn (X n , Y n )] ≤ S.
0.5
0.4
0.3
C(S)
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5
S
C = 1 − H2 (p). (5.66)
The binary erasure channel (BEC) has X = {0, 1}, Y = {0, 1, ∆}, and
Pr [Y = X] = 1−p and Pr [Y = ∆] = p. For no cost constraint, we compute
C = 1 − p. (5.70)
is the same for all x. Hence H(Y |X) is given by (5.71) and determining the
capacity-cost function reduces to a constrained maximum-entropy problem:
" #
C(S) = max H(Y ) − H(Y |X). (5.72)
PX : E[s(X,Y )]≤S
On the other hand, uniformly focusing channels have the special property
that a uniform PX results in a uniform PY . To see this, observe that if
P (x) = 1/|X | for all x then we have
X 1
P (y) = P (y|x) (5.73)
x∈X |X |
which is the same for all b. Thus, PY is uniform which maximizes H(Y ).
A channel is said to be strongly symmetric if it is both uniformly dispersive
and uniformly focusing. In the absence of a cost constraint, this means that
Y
Y
0
X 1−p X 1−p
0
Y1 = {0, 1}
0 0
p
p 1
∆ ≡ 1−p
p
1 1 p
∆ Y2 = {∆}
1
1−p
The decomposition is shown in Fig. 5.13. Observe that each input symbol
X goes through either a BSC with crossover probability 0 (Channel 1) or
through a channel that maps both inputs to ∆ with probability 1 (Channel
2). The former channel is chosen with probability 1 − p and the latter with
probability p. Observe that Y specifies which sub-channel was chosen.
L
X
C= qi C i (5.75)
i=1
√
For instance, suppose we have P ≤ 2. We choose L = 4 so that Ỹ takes
on values in the set Ỹ = {−3, −1, 1, 3} (see Fig. 5.14 where the quantizer
boundaries are shown with dashed lines). There are two sub-channels whose
0.4
0.35 p (y | ⌧1.5) p Y | X (y | 1.5)
Y|X
0.3
0.25
0.2
0.15
0.1
0.05
0
5 4 3 2 1 0 1 2 3 4 5
y
where
Z ∞
1 2
Q(x) = √ e−y /2 dy. (5.79)
x 2π
Y =X +Z (5.80)
At this point, we have not shown that the capacity-cost function (5.50) gives
a rate that we can approach for continuous-alphabet channels. However,
one can check that the arguments in Sec. 5.5 extend to AWGN and other
continuous-alphabet channels. We compute
1 P
C(P ) = log 1 + . (5.83)
2 N
2.5
Gauss
2
1.5
C(P)
BPSK
1
0.5
0
0 5 10 15 20 25 30
P
where
√ √
√ PX − P pY |X y − P
PX|Y − P y =
p(y)
1 1 √ 2
= √ e−(y+ P ) /(2N ) . (5.86)
2p(y) 2πN
Y =X +Z (5.87)
P
C(P ) = log 1 + . (5.89)
N
C(P ) again depends only on the SNR and the capacity is twice that of the
real case (5.83) where E[X 2 ] ≤ P and E[Z 2 ] = N . We achieve C(P ) by
choosing XR and XI to be independent Gaussian random variables with
zero-mean and variance P/2.
The capacity-cost function C(P ) is clearly concave in P . However, there
are two different and more common ways of plotting C(P ). First, one
considers a bandlimited channel with bandwidth W Hz and transmits 2W
real symbols per second that are represented as W complex symbols per
6
Gauss
C(E /N )
0
5
s
3
QPSK
2
BPSK
1
0
5 0 5 10 15 20 25
(Es/N0)dB
Figure 5.16.: C(Es /N0 ) for an AWGN channel in terms of (Es /N0 )dB . The
rate units are bits per channel symbol.
second. The received signal is sampled at the rate W complex symbols per
second. If the transmit signal has a power constraint P and the noise is
modeled as AWGN with power N = N0 W , where N0 is a constant with
units of Watts/Hz, then the capacity is
P
C̃(P ) = W log2 1+ bits/second. (5.90)
N0 W
The tilde on C̃(P ) emphasizes that the units are bits per second. The
transmit energy with transmit intervals of length 1/W seconds is Es = P/W
Joules. We can also express the capacity in units of bits per second per Hertz
(which are units of spectral efficiency) or in units of bits per symbol:
Es
C(Es /N0 ) = log2 1+ bits/symbol. (5.91)
N0
The expression (5.91) is basically the same as (5.89) but now the SNR is
measured in terms of an energy ratio rather than a power ratio. One usually
plots C(Es /N0 ) by measuring the SNR in decibels:
The resulting curve is labeled “Gauss” in Fig. 5.16 The two other curves
show the capacities of BPSK and quaternary phase shift keying (QPSK),
respectively, where QPSK has the modulation alphabet
√ √ √ √
X = { P , P j, − P , − P j}. (5.93)
6
Gauss
5
C
4
3
QPSK
2
BPSK
1
0
2 0 2 4 6 8 10 12 14 16
(Eb/N0)dB
Figure 5.17.: Capacity for an AWGN channel in terms of (Eb /N0 )dB . The
rate units are bits per channel symbol.
The QPSK capacity is computed by doubling the BPSK capacity and then
shifting the curve to the right by 3 dB (since the energy per dimension is
Es /2). Both modulations almost achieve capacity at low SNR.
Yet another way of plotting the capacity is to measure the SNR in terms of
the energy per information bit: Eb = Es /R where R is the rate in bits per
symbol. We require
Es Eb
R ≤ log2 1+ = log2 1+ ·R (5.94)
N0 N0
so we say that the capacity is the largest R that satisfies (5.94).4 One
usually plots this capacity against
The resulting curve is shown in Fig. 5.17. Observe that the smallest Eb /N0
as R → 0 is Eb /N0 = ln(2) which is
Yi = hi Xi + Zi , i = 1, 2, . . . , L, (5.97)
where the hi are constants and the Zi are Gaussian random variables with
zero-mean and variance Ni . The Zi , i = 1, 2, . . . , L, are mutually statisti-
cally independent of each other and of the input variables Xi , i = 1, 2, . . . , L.
For example, the complex Gaussian channel treated in Sec. 5.8.3 effectively
has L = 2, h1 = h2 = 1, and N1 = N2 = N/2. We may write the overall
channel in vector form as
Y = HX + Z (5.98)
where (a) follows by the maximum entropy result (2.48) (see Problem B.5.4).
We have thus arrived at a power allocation problem.
We may solve the problem by using standard optimization methods. The
problem is concave in P = [P1 , P2 , . . . , PL ], and the Lagrangian is
" L !# " L # " L
#
1 h2 Pi
log 1 + i
X X X
Λ= + γi Pi + λ P − Pi (5.101)
i=1 2 Ni i=1 i=1
γi ≥ 0, Pi ≥ 0, γi · Pi = 0, i = 1, 2, . . . , L
L L
" #
X X
λ ≥ 0, Pi ≤ P, λ· P − Pi = 0
i=1 i=1
∂Λ 1 1
= + γi − λ = 0, i = 1, 2, . . . , L (5.102)
∂Pi 2 Ni /h2i + Pi
1/(2λ)
2
1/(2λ) = 1.25
P =2 1
1 2 3 4 i
where we have used the natural logarithm for the derivative. The γi are slack
variables that we may choose to achieve equality in the last L conditions
above. Furthermore, we cannot choose λ = 0 so we must have Li=1 Pi = P .
P
Now if λ > h2i /(2Ni ) or 1/(2λ) < Ni /h2i then (5.103) requires Pi = 0. We
must thus choose
!
1 Ni
Pi = max − 2,0 , i = 1, 2, . . . , L. (5.105)
2λ hi
This solution is often called waterfilling because one visualizes pouring water
up to a level 1/(2λ) when the ground level is at Ni /h2i . One stops pouring
once the water “volume” is P . The capacity is given by (5.100), namely
L
!
1 h2
log 1 + i · Pi .
X
C(P ) = (5.106)
i=1 2 Ni
1 1 1
C(2) = log (1 + 4P1 ) + log (1 + 2P2 ) + log (1 + P4 )
2 2 2
1 1 1
= log(5) + log(2.5) + log(1.25)
2 2 2
≈ 1.98 bits. (5.107)
!
1 Ni
Pi = max − ,0 , i = 1, 2, . . . , L. (5.108)
λ |hi |2
Y = HX + Z (5.110)
where (a) follows by the maximum entropy result (2.48). Suppose H has
the singular-value decomposition H = UDV† where U and V are unitary
matrices (with UU† = I and VV† = I) and where D is a real, diagonal
nr × nt matrix with the singular values of H on the diagonal. We write
(5.111) as
The capacity is
min(nt ,nr )
log 1 + d2i · Pi .
X
C(P ) = (5.114)
i=1
Y = [H X + Z, H] (5.115)
where the last step follows by the maximum entropy result (2.48), i.e.,
a Gaussian X with zero mean and variance P simultaneously maximizes
h(aX + Z) for all values of a.
If the fading coefficient H does not have a density then the analysis hardly
changes. For example, suppose that H is a discrete random variable with:
PH (1/2) = 1/4, PH (1) = 1/2, and PH (2) = 1/4. The capacity is
1 P 1 P 1 4P
C(P ) = log 1 + + log 1 + + log 1 + . (5.117)
4 4N 2 N 4 N
We remark that QPSK may perform poorly for fading channels because the
information rate I(X; aX + Z) saturates at 2 bits per symbol even if the
fading coefficient H = a has a large amplitude. Thus, for fading channels it
can be important to use a large modulation set X to approach the capacity
(5.116) that is achieved with a Gaussian distribution.
I(U m ; Û m ) + H2 (Pe )
m<
H(U ) − Pe log2 |U|
(a) I(X n ; Y n ) + H (P )
2 e
≤
H(U ) − Pe log2 |U|
(b) nC(S) + H (P )
2 e
≤ (5.119)
H(U ) − Pe log2 |U|
where (a) follows by the data processing inequality and (b) by (5.60)-(5.62).
Now if Pe → 0 then we have the desired bound
m C(S)
≤ (5.120)
n H(U )
source symbols per channel symbol. Thus, separate source and channel
coding achieves the best possible performance (if we can permit n → ∞).
It turns out that there are source codes that can compress coin-tossing bits
U m to m(1 − H2 (Pb )) bits from which we can recover a string Û m that has
an average bit error probability Pb . But we can send the m(1 − H2 (Pb ))
bits reliably (with average bit error probability near zero) over the channel
0
10
3 R=1 R=2
10
4
10
8 6 4 2 0 2 4 6 8 10
(E /N )
b 0 dB
Figure 5.19.: Average bit error probability Pb as a function of (Eb /N0 )dB for
a complex AWGN channel. The blue curves are lower bounds
on the best possible Pb given R.
C(S)
R= (5.122)
1 − H2 (Pb )
Eb 1 R(1−H2 (Pb ))
≥ 2 −1 . (5.123)
N0 R
Delay
W Xn Yn Ŵ
Message
Encoder PY |X (·) Decoder Sink
Source
Channel
X1 X2 X3
Z1 Z2 Y2 Z3 Y3
Y1
Figure 5.21.: FDG for a memoryless channel with feedback. The message
estimate Ŵ is not shown.
5.10. Feedback
I(W ; Ŵ ) ≤ I(W ; Y n )
n
H(Yi |Y i−1 ) − H(Yi |W Y i−1 )
X
=
i=1
n
(a)
H(Yi |Y i−1 ) − H(Yi |W Y i−1 Xi )
X
=
i=1
n
(b) X
= H(Yi |Y i−1 ) − H(Yi |Xi )
i=1
n
X
≤ I(Xi ; Yi ) (5.126)
i=1
where (a) follows because Xi is a function of W and Y i−1 , and (b) is because
the channel is memoryless. We have thus arrived at (5.60) and find the
surprising result that feedback does not improve the capacity-cost function
of a discrete memoryless channel [1, 2]. However, we emphasize that feed-
back can help to reduce complexity and improve reliability, as the following
example shows. Feedback can also increase the capacity of channels with
memory and the capacity of multi-user channels.
Example 5.13. Consider the BEC with feedback of the channel outputs.
The encoder can use a variable-length encoding scheme where each message
bit is repeated until the decoder receives this bit without erasure. The
number N of times that each bit must be transmitted has the geometric
distribution
whose mean is E[N ] = 1/(1 − p). The average rate R is the number of in-
formation bits transmitted divided by the average number of trials, giving
R = 1 − p = C. The benefit of feedback is therefore not in increasing ca-
pacity but in giving a simple (variable-length) coding method that achieves
capacity. In fact, the error probability is zero if one permits infinite delay.
References
[1] C. E. Shannon. The zero error capacity of a noisy channel. IRE
Trans. Inf. Theory, 2:221–238, September 1956. Reprinted in Claude
Elwood Shannon: Collected Papers, pp. 221-238, (N.J.A. Sloane and
A.D. Wyner, eds.) Piscataway: IEEE Press, 1993.
[2] R. L. Dobrushin. Information transmission in a channel with feedback.
Theory Prob. Appl., 34:367–383, December 1958.
Discrete Probability
Pr [Ω] = 1 (A.1)
Pr [A ∪ B] = Pr [A] + Pr [B] if A ∩ B = ∅. (A.2)
where we have written Pr [ωi ] as a shorthand for Pr [{ωi }]. The complement
Ac (or Ā) of event A is the set of all ωi that are not in A.
We thus have
which is known as the union bound. Equality holds in (A.5) if and only if
Pr [A ∩ B] = 0 (this does not necessarily mean that A ∩ B = ∅).
119
120 Appendix A. Discrete Probability
B
Sample Space Ω
A
The conditional probability of the event B given the occurrence of the event
A with Pr [A] > 0 is
Pr [A ∩ B]
Pr [B|A] = . (A.8)
Pr [A]
Thus, if Pr [A] > 0 then using (A.8) the events A and B are independent if
Pr [B|A] = Pr [B].1
Example A.2. Consider our fair die and the events A = {1, 3, 5} and
B = {1, 2} in Fig. A.1. We find that (A.6) is satisfied so A and B are
independent. We further have Pr [A] > 0 and compute
Pr [{1}] 1/6 1
Pr [B|A] = = = . (A.9)
Pr [A] 1/2 3
1
The reader may now wonder what happens if Pr [A] = 0. We then have Pr [A ∩ B] = 0
so that (A.6) is satisfied. Thus, if Pr [A] = 0 then A and B are always independent.
Example A.3. Consider the sample space of Example A.1 and choose
X = {odd, even}. We define the mapping X(·) as follows:
We thus have
We further have
The value “undefined” is sometimes chosen as a number in the interval [0, 1].
An alternative (but indirect) way of defining PY |X (·) is as any function from
X(Ω) × Y (Ω) into the interval [0, 1] such that
for all (a, b) ∈ X(Ω) × Y (Ω). This alternative approach recovers the usual
definition if PX (a) > 0, and it motivates choosing PY |X (b|a) to be “unde-
fined” otherwise.
for all (a, b, c) ∈ X(Ω) × Y (Ω) × supp(PZ ). Thus, we find that X and Y are
statistically independent conditioned on Z if and only if
X −Z −Y (A.27)
Y
X
PZ1 b1
a1
b2
b3
a2
b4
PZ2
Ỹ = f (X, Z) (A.28)
for some function f (·) and some random variable Z that is statistically
independent of X and that has alphabet Z of size at most |Z| = |X | · |Y|.
Ỹ = Zindex(X) (A.29)
makes X Ỹ have the joint distribution PXY (·). Indeed, by construction have
The purpose of the above exercise is to show that we may as well represent
any channel PY |X (·) by a functional relation Y = f (X, Z) where the “noise”
Z is independent of the channel input X. This result forms the basis of the
ideas in the next section.
We choose Ỹ = Z1 if X = 0 and Ỹ = Z2 if X = 1.
Example A.5. Consider Example A.4 with p00 = p11 and p01 = p10 . The
symmetry lets us simplify f (·) and Z. We may define Ỹ = X ⊕ Z where
PX (0) = PX (1) = 1/2, Z is independent of X, and PZ (0) = p00 + p11 ,
PZ (1) = 1 − PZ (0). We see that |Z| = 2 suffices in this case.
X1 X2 X3
Z1 Z2 Y2 Z3 Y3
Y1
Example A.6. Fig. A.3 depicts the FDG for the first three uses of a
channel with feedback. In this graph the channel input symbol Xi , i =
1, 2, 3, is a function of the message W and the past channel outputs Y i−1 .
We have drawn the feedback links using dashed lines to emphasize the
role that feedback plays. The output Yi is a function of Xi and a noise
random variable Zi . The graph has NRV = 6 random variables defined by
SRV = 4 independent random variables. The SRV vertices representing the
independent W, Z1 , Z2 and Z3 are distinguished by drawing them with a
hollow circle.
The precise structure of FDGs lets one establish the conditional statisti-
cal independence of sets of random variables by using graphical procedures
called d-separation and f d-separation (“d” for dependence and “fd” for func-
tional dependence). By d-separation we mean the following.
A.6. Expectation
The conditional expectation of f (X) given that the event A with Pr [A] > 0
occurred is
X
E[f (X)|A] = Pr [X = a|A] f (a) (A.36)
a:Pr[{X=a}∩A]>0
and (A.36) as
X
E[f (X)|A] = Pr [Bi |A] E[f (X)|Bi ∩ A] . (A.40)
i:Pr[Bi ∩A]>0
We thus have Var[X] = Cov[X, X]. We say that X and Y are uncorrelated
if Cov[X, Y ] = 0 or, alternatively, if E[XY ] = E[X] E[Y ].
Pn
If the {Xi }ni=1 are pairwise uncorrelated then (A.45) is i=1 Var[Xi ].
as long as the variances are positive; otherwise we say ρ = 0. One can show
that −1 ≤ ρ ≤ 1.
Pn
If the {X i }ni=1 are pairwise uncorrelated then (A.48) is i=1 QX i .
Xi − E[Xi ]
X̃i = q (A.50)
Var[Xi ]
We similarly define Var[Y |X] as the random variable that takes on the value
Var[Y |X = a] when X = a.
We next list several simple properties of conditional expectation.
• E[f (X)|X] = f (X)
• E[E[Y |X]] = E[Y ] and more generally E[E[Y |XZ] |Z] = E[Y |Z]
• E[Y Z] = E[Y E[Z|Y ]] and more generally E[Y Z|X] = E[Y E[Z|XY ] |X]
• If Y − X − Z forms a Markov chain then E[Z|XY ] = E[Z|X]
• If Y − X − Z forms a Markov chain then E[Y Z|X] = E[Y |X] E[Z|X]
h i
• Var[Y |X] = E[Y 2 |X]−E[Y |X]2 and E[Var[Y |X]] = E[Y 2 ]−E E[Y |X]2
An important property of conditional expectation concerns minimum-mean
square-error (MMSE) estimation. Consider the error S = X − X̂ and
suppose that, given Y = y, we wish to find the estimate X̂(y) that minimizes
2
Var[S|Y = y] = E X − X̂(y) Y =y . (A.52)
E[X]
Pr [X ≥ a] ≤ . (A.65)
a
If f (·) is strictly convex (or concave) then equality holds in (A.70) (or
(A.71)) if and only if X is a constant.
Instead of proving Theorem A.3, we prove a more general result for vectors.
Consider a real-valued function f (·) whose domain is a non-empty convex
set S in the n-dimensional vector space Rn . We say that f (·) is convex (or
convex-∪) on S if, for every interior point x0 of S, there exists a real vector
m (that may depend on x0 ) such that
If f (·) is strictly convex (or concave) then equality holds in (A.74) (or
(A.75)) if and only if X is a constant.
We clearly have E[Sn ] = E[X] and Var[Sn ] = Var[X] /n. Applying Tcheby-
cheff’s inequality (A.66), we have
Var[X]
Pr [|Sn − E[X] | ≥ ] ≤ . (A.77)
n2
Alternatively, we have
Var[X]
Pr [|Sn − E[X] | < ] ≥ 1 − (A.78)
n2
N (aj |X n )
" #
PX (aj ) (1 − PX (aj ))
Pr − PX (aj ) < ≥ 1 − . (A.82)
n n2
In other words, for large n the letter aj occurs about nPX (aj ) times in X n .
" #
[
Pr sup |Si − E[X] | ≥ = Pr {|Si − E[X] | ≥ }
i≥n i≥n
∞
(a) X
≤ Pr [|Sn − E[X] | ≥ ]
i=n
(b) 2β n
≤ (A.88)
1−β
where (a) follows by the union bound, and (b) follows by the Chernoff bound
(A.87). Note that step (b) does not work with the Tchebycheff bound (A.77)
because the probabilities decrease only as 1/n rather than exponentially
with n. The bound (A.88) implies the Strong Law of Large Numbers, namely
" #
lim Pr sup |Si − E[X] | < = 1 .
n→∞
(A.89)
i≥n
A.15. Problems
Consider a 3-dimensional sphere centered at the origin and label the points
on its surface by using the spherical coordinates longitude φ, −π ≤ φ < π,
and latitude θ, −π/2 ≤ θ ≤ π/2. Suppose X is a point that is uniformly
distributed on the sphere, i.e., we write X(φ, θ) and consider the joint den-
sity
1
pΦΘ (φ, θ) = cos θ. (A.90)
4π
a) Determine the marginal densities pΦ (·) and pΘ (·).
b) Determine the density of a point on the great circle defined by θ = 0,
i.e., determine pΦ|Θ (·|0).
c) Determine the density of a point on the great circle defined by φ = 0,
i.e., determine pΘ|Φ (·|0). Compare your answer to the previous result
and interpret.
This exercise proves the strong law of large numbers but replaces (A.84)
with the weaker constraint that X has a finite fourth central moment
h i
E (X − E[X])4 ≤ m (A.91)
E[(Sn − E[X])4 ]
Pr [(Sn − E[X]) ≥ ] ≤ . (A.93)
4
Explain why the result (A.96) proves the strong law of large numbers.
References
[1] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. Morgan Kaufmann, San Mateo, CA, 1988.
[2] G. Kramer. Directed Information for Channels with Feedback. Hartung-
Gorre Verlag, Konstanz, Germany, 1998. ETH Series in Information
Processing, Vol. 11.
[3] G. Kramer and S. A. Savari. Edge-cut bounds on network coding rates.
Journal of Network and Systems Management, 14(1):49–67, March 2006.
Problems
B.1.1. Expectation
Consider random variables X and Y with
1
PX (0) = PX (1) = , PX (2) = 0
2
1
PY (0) = PY (1) = PY (2) = .
3
h i h i h i h i
1 1 1 1
a) Determine E PX (X)
,E PX (Y )
,E PY (X)
, and E PY (Y )
.
b) Compute E[− log2 PX (X)] and E[− log2 PY (X)].
h i h i
PX (X) PY (Y )
c) Compute E log2 PY (X)
and E log2 PX (Y )
.
141
142 Appendix B. Problems
b) Suppose that X and Y are random variables such that X(Ω) = Y (Ω).
Write out the general expression for E[PX (X)] and E[PY (X)].
c) Write out the general expression for E[− log PX (X)] and E[− log PY (X)].
B.1.12. Uncertainty 1
Suppose that the random vector [X1 X2 X3 ] takes on the values [0 0 0],
[0 1 1], [1 0 1] and [1 1 0], each with probability 1/4 (i.e., PX1 X2 X3 (0, 0, 0) =
1/4, PX1 X2 X3 (0, 0, 1) = 0, etc.).
a) Find H(X1 X2 X3 ).
b) Calculate the following uncertainties:
i) H(X1 ), H(X2 ) and H(X3 ).
ii) H(X2 |X1 = 0), H(X2 |X1 = 1) and H(X2 |X1 ).
Hint: PX2 |X1 (0|0) =?
iii) H(X3 |X1 X2 = 00), H(X3 |X1 X2 = 01), H(X3 |X1 X2 = 10),
H(X3 |X1 X2 = 11) and H(X3 |X1 X2 ).
Hint: PX3 |X1 X2 (0|00) =?
c) Verify from a) and b) that the chain rule for uncertainties is verified.
B.1.13. Uncertainty 2
Repeat Problem B.1.12 when [X1 X2 X3 ] takes on the values [0 0 0], [0 1 0],
[1 0 0] and [0 0 1], each with probability 1/4.
Hint: If an event has zero probability, then the conditional uncertainty
given that event cannot generally be calculated, but it is also of no physical
significance.
H(X) ≤ H(Y )
Suppose X is binary with alphabet X = {0, 1}. Consider the Binary Sym-
metric Channel or BSC for which Y = {0, 1} and
(
1 − p, if b = a
PY |X (b|a) = (B.4)
p, 6 a.
if b =
a) Show that
Verify the following claims where the inverse function H2−1 (·) is defined with
domain the interval [0, 1] and range the interval [0, 1/2].
a) Show that H2 H2−1 (x) = x for 0 ≤ x ≤ 1 but H2−1 (H2 (x)) 6= x can
happen (for which x?).
b) Show that [λa + (1 − λ)b] ∗ p = λ(a ∗ p) + (1 − λ)(b ∗ p), where
q ∗ p = q(1 − p) + (1 − q)p.
c) Show that H2 (a) ∗ p ≤ H2 (a ∗ p) for 0.11 ≤ a ≤ 0.5 and 0 ≤ p ≤ 1.
d) Using the above two results and H2 H2−1 (a) = a, show that
[λa + (1 − λ)b] ∗ p ≤ H2 λH2−1 (a) ∗ p + (1 − λ)H2−1 (b) ∗ p
b) Consider f (x) = H2 (x) and show that f 0 (x) > 0 for 0 < x < 1/2.
c) Consider g(x) = H2 (p ∗ x) and show that (g 0 /f 0 )0 ≥ 0 for 0 < x < 1/2.
d) Finally, prove the claimed convexity of H2 p ∗ H2−1 (z) .
b) Suppose the BSC is used n times with input Xi and output Yi for
i = 1, 2, . . . , n. The {Xi }ni=1 could be dependent. Show that
n
H(Y n ) ≥ H(Yi |X i−1 )
X
(B.7)
i=1
with equality if and only if the {Xi }ni=1 are independent or p = 1/2.
c) Use (B.6) and (B.7) to show that
H(Y n )/n ≥ H2 p ∗ H2−1 (H(X n )/n) (B.8)
with equality if and only if the {Xi }ni=1 are independent or p = 1/2.
This result is known as Mrs. Gerber’s Lemma.
d) Now show that for any discrete random variable V we have
H(Y n |V )/n ≥ H2 p ∗ H2−1 (H(X n |V )/n) . (B.9)
P +Q P +Q
D(P kR) + D(QkR) = D P k + D Qk
2 2
P +Q
+ 2D kR . (B.10)
2
a) Verify that
Z1
?
X1 c - - Y1
Z2
?
X2 c - - Y2
Z3
?
?
- - - Y3
X3
The symbol ⊕ denotes the exclusive-or operation (or addition mod 2). As-
sume that X1 , X2 , Z1 , Z2 , and Z3 are independent binary random variables
such that PX1 (1) = PX2 (1) = 1/2 and PZ1 (1) = PZ2 (1) = PZ3 (1) = p.
Show that
n
n
I(Xi ; Y |ZX i−1 ).
X
I(X ; Y |Z) = (B.20)
i=1
Hint: Use (B.21) and use the chain rule (1.63) to show that
I(X i ; Yi+1
n
) = I(X i−1 ; Yi+1
n n
) + I(Xi ; Yi+1 |X i−1 )
I(X i−1 ; Yin ) = I(X i−1 ; Yi+1
n
) + I(X i−1 ; Yi |Yi+1
n
).
A complex Gaussian
√ random variable X = XR + jXI has Gaussian XR and
XI , where j = −1. Suppose that [XR XI ] has the covariance matrix
" #
σR2 ρ σR σI
QXR XI = (B.24)
ρ σR σI σI2
a) Verify (2.30) and show that this bound holds with equality if and only
if pX (a) = pY (a) for (almost) all a ∈ supp(pX ).
b) Verify that D(pX|Z kpY |Z |pZ ) ≥ 0 with equality if and only if pX|Z (a|b) =
pY |Z (a|b) for (almost) all (a, b) ∈ supp(pXZ ).
We define
!
Z
pY |X (y|x)
I¯q (X; Y ) =
X
PX (x) pY |X (y|x) log2 dy (B.36)
x∈supp(PX )
y qY (y)
and
!
X Z
qY |X (y|x)
I q (X; Y ) = PX (x) pY |X (y|x) log2 dy (B.37)
x∈supp(PX )
y qY (y)
Prove that
`k = − logD pk . (B.43)
H(X)
E[L] − = D(PX kPY ) logD (2) − logD S ≥ 0 (B.44)
log2 D
where
!
X n
N` = (B.46)
n1 `1 +...+nK `K =` n1 , n2 , . . . , nK
e) Now take the nth root of both sides of (B.47) and let n → ∞. What
is the result?
Remark: It is not hard to show that there exists a D-ary uniquely decodable
code with a given list of codeword lengths if and only if these lengths satisfy
Kraft’s inequality. Thus, whenever a D-ary uniquely decodable code exists,
there also exists a D-ary prefix-free code with precisely the same codeword
lengths. For this reason, there is never an advantage in using a D-ary
uniquely decodable code that is not prefix-free, even if one did not actually
need the prefix-free property in his or her code.
III 5 6
V 7
8
Suppose that the following probabilities are assigned to the leaves: p1 = 1/4,
p2 = p3 = 1/8, p4 = p5 = 1/6, p6 = p7 = p8 = 1/18.
a) Directly compute the average path length E[W ] from the root to the
leaves, and directly compute the leaf entropy Hleaf as defined by Equa-
tion (3.8) in the lectotesure notes.
b) Find the Probabilities PI , PII , PIII , PIV and PV of the nodes in the
rooted tree with probabilities.
c) Compute E[W ] by the Path Length Lemma.
d) Find Hi , the branching entropy, for i = I, II, III, IV and V. Compute
Hleaf as suggested by Lemma 3.2 in the lecture notes.
u u1 u2 u3 u4 u5 u6 u7
PU (u) 0.3 0.2 0.1 0.1 0.1 0.1 0.1
The following theorem was alluded to in the “remark” at the end of Problem
B.3.5.
i=1
This theorem was given by McMillan in 1956 [B. McMillan, “Two In-
equalities Implied by Unique Decipherability”, IRE Trans. Inform. Theory,
vol. IT-2, pp. 115-116, 1956]. An elegant proof of this theorem was provided
by Karush five years later [J. Karush, “A Simple Proof of an Inequality of
McMillan,” IRE Trans. Inform. Theory, vol. IT-7, pp. 118, 1961]. In this
problem, you will develop Karush’s proof for yourself.
Consider any D-ary code with codeword lengths w1 , w2 , . . . wn (all positive
integers of course) and let wmax be the largest codeword length. For an
arbitrary positive integer L, consider the following quantity:
" n #L n X
n n L·wmax
−wi
D−(wi1 +wi2 +···+wiL ) = Aj D−j .
X X X X
D = ···
i=1 i1 =1 i2 =1 iL =1 j=1
else in the codeword. This special digit is called the comma. For instance,
the codewords [1], [0, 1] and [0, 0, 1] form a binary comma code and the
digit 1 is the comma. Every comma code is also a prefix-free code. It is
very easy to find an optimum D-ary comma code for any random variable
U . One simply assigns the length 1 codeword (the comma alone) to the
most likely value of U , length 2 codewords to the D − 1 next most likely
values of U , length 3 codewords to the (D − 1)2 next most likely values of
U , etc. This follows from the fact that there are exactly (D − 1)i−1 possible
codewords of length i for a given comma, regardless of which codewords of
shorter length are in the comma code. Sometimes a D-ary comma code is
called a “(D − 1)-ary code” (for instance, a “unary code” when D = 2), but
this is a misleading terminology as the comma itself is active in supplying
information.
H(U )
η= .
E[W ] log D
a) Find a binary Huffman code for a single German letter U , and compute
the efficiency of the code.
b) If you enjoyed doing a), repeat a) for a single English letter U .
a) Find an optimum binary prefix-free code for a single source letter, and
compute its efficiency
H(U )
η= .
E[W ] log D
b) Find an optimum binary prefix-free code for L = 2 source letters, and
compute its efficiency
LH(U )
η= .
E[W ] log D
c) Repeat b) for L = 3.
H(U )E[Y ]
η= ,
N log D
German English
i Letter pi Letter pi
1 _ 0.15149 _ 0.1859
2 E 0.14700 E 0.1031
3 N 0.08835 T 0.0796
4 R 0.06858 A 0.0642
5 I 0.06378 O 0.0632
6 S 0.05388 I 0.0575
7 T 0.04731 N 0.0574
8 D 0.04385 S 0.0514
9 H 0.04355 R 0.0484
10 A 0.04331 H 0.0467
11 U 0.03188 L 0.0321
12 L 0.02931 D 0.0317
13 C 0.02673 U 0.0228
14 G 0.02667 C 0.0218
15 M 0.02134 F 0.0208
16 O 0.01772 M 0.0198
17 B 0.01597 W 0.0175
18 Z 0.01423 Y 0.0164
19 W 0.01420 G 0.0152
20 F 0.01360 P 0.0152
21 K 0.00956 B 0.0127
22 V 0.00735 V 0.0083
23 Ü 0.00580 K 0.0049
24 P 0.00499 X 0.0013
25 Ä 0.00491 J 0.0008
26 Ö 0.00255 Q 0.0008
27 J 0.00165 Z 0.0005
28 Y 0.00017
29 Q 0.00014
30 X 0.00013
Entropy 4.113 [bits/letter] 4.08 [bits/letter]
Table B.1.: the Probability Distribution for a Single Letter of German and
English Text.
Note: Since the number of codewords in parts a), b) and c) of this problem
equals that in parts a), b) and c), respectively, of Problem B.3.17, it is fair
to compare the efficiencies of these block-to-variable-length and variable-
length-to-block coding schemes. It is generally true that the former will be
somewhat more efficient for the same number of codewords.
a) Find the smallest N such that the run-length message set is not a
Tunstall message set.
b) As a function of N , find N/E[Y ] for the run-length coding scheme,
where E[Y ] is the average message length.
0
V3
0
0 1
V4
0 1
V1
1
1
V2
1
H(U )E[Y ]
η=
E[W ] log D
Here, Paul already knows U1 , U2 , . . . , Un−1 . Paul’s friend, John, knows Ũn
and asks Paul to guess Un . If this guess is correct (which John would know
because then Ũn = 1) he tells Paul that it is correct and otherwise tells him
to guess again, etc.
For large n, Ũn will be virtually statistically independent of the previous
letters so that H(Ũn ) is a good estimate of H∞ (U ), the true information
rate of the text.
a) Use Shannon’s method to estimate H∞ (U ) for German text as follows.
In order to generate the sequence Ũ1 , Ũ2 , Ũ3 , . . . in the above diagram,
only the services of James and Peter are required. Thus, you should
work in pairs on this problem, letting one member play the role of
James and the other play the role of Peter. James should first act as
the information source by secretly selecting a typical German sentence
of about 40 letters (including the letter * to denote a space but ex-
cluding punctuation symbols) to serve as the sequence U1 , U2 , U3 , . . ..
James should then tell Peter the first 10 letters of the sequence. Peter
should then guess U11 until he is successful, which determines Ũ11 .
Continue in this way to get Ũ12 , Ũ13 , . . . , Ũ40 .
Now use your 30 samples of Ũ to estimate the probabilities of the four
most likely values of Ũ . Then estimate H(Ũ ) by assuming that the
other 26 values of Ũ are equally likely (which will make your estimate
slightly too large).
b) Find a binary Huffman code for Ũ (assuming that the probabili-
ties used in estimating H(Ũ ) are correct) and compute its efficiency.
Also, compare E[W ] for this code with E[W ] for the code in Prob-
lem B.3.15(a).
Note: To make your results more accurate, you can pool your statistics in
part a) with those from several other teams before estimating the probability
distribution for Ũ .
probability 1/3
output 1
we are in steady state, where the effect of the initial state is absent and the
source is stationary.
(a) What is the stationary distribution of the source?
(b) What is the memory of the source?
(c) Let the source output be denoted by X1 , X2 , X3 , . . . (assume the source
is in steady state). Calculate H(X1 ), H(X1 X2 ), and H(X1 X2 X3 ).
probability 1-p
1 0
output 1
probability 1
output 1
X 32 = a a b c c a c b a a a c a b b c a a b c c b c b a b a a c b a b. (B.49)
(a) Encode the given sequence for B = 1 and B = 2 by using the Elias-
Willems universal source encoder without a recency rank calculator.
(b) Encode the given sequence for B = 1 and B = 2 by using the Elias-
Willems universal source encoder with a recency rank calculator.
(c) Find a numerical upper bound on E[L]/B as B gets large. Compare
with the compression rate achieved in (a) and (b).
Suppose a source puts out the sequence 01010101 . . .. What does the Elias-
Willems universal source encoder (with a recency rank calculator) put out
if B = 1, 2, 3, 4, ...? What happens when B is odd? What happens when B
is even? What happens as B gets large?
A discrete stationary source (DSS) with finite memory µ, i.e., such that
Sn = [un−µ , . . . , un−1 ].
Note that the stationary probability distribution for the states is just the
probability distribution P (u1 , . . . , uµ ). From
P (u1 , . . . , uµ ),
P (u1 , . . . , uL )
= P (u1 , . . . , uµ )P (uµ+1 |u1 , . . . , uµ ) . . . P (uL |uL−µ , . . . , uL−1 ).
N H(U )
< + ε1
L log(D)
and
P (F ) < ε2 ,
where P (F ) is the probability that [Û1 , . . . , ÛL ] 6= [U1 , . . . , UL ]. Defining
PS ≤ P (F ).
Hence, we see that the Lossy Source Coding Theorem implies
PS < ε2
so that PS can be made arbitrarily small.
In this problem, you will prove the converse to the Lossy Source Coding
Theorem by showing that PS cannot be made arbitrarily small when N/L
is smaller than H(U )/ log D. More precisely, you will show that if
N H(U )
≤ ,
L log(D)
then
" #
H(U ) N
h(PS ) + PS · log(K − 1) ≥ − · log(D). (B.50)
log(D) L
a) Make use of the Data Processing Lemma and the fact that
I(U1 . . . UL ; X1 . . . XN ) ≤ H(X1 . . . XN )
c) Make use of the inequality (4.32), Fano’s Lemma, and the convexity-∩
of h(p) to show that
1
H(U1 . . . UL |Û1 . . . ÛL ) ≤ h(PS ) + PS log(K − 1).
L
d) Now combine the results of b) and c) to prove Equation (B.50).
e) Find an explicit lower bound on PS when the DMS is a ternary source
(i.e., K = 3), the coding alphabet is binary (i.e., D = 2), and N/L ≤
H(U ) − 0.336.
The units of R are often, but somewhat imprecisely, called “bits per letter”.
a) Specializing Equation (B.50) of Problem B.4.7 to the case of a BSS so
that PS = Pb , show that
R ≥ 1 − h(Pb ) bits/letter.
Suppose now that the source in Figure 1 of Problem B.4.7 is changed to the
general binary memoryless source (BMS) with p = PU (1). Without loss of
essential generality, assume that p ≤ 1/2.
R ≥ h(p) − h(Pb ).
b) Now use a = {as }1/s to show that for any s > 0 we have
(M )1/s
X 1 X
n n s
Pr [E] ≥ 1 − PY n |X n (y |x (i)) . (B.53)
yn M i=1
i 1/s
(M )
X 1 X h
n n s
Pr [E] ≥ 1 − E PY n |X n (Y |X (i))
yn M i=1
( )1/s
(1−s)/s n n n s
X X
≥1−M max PX n (x )PY n |X n (y |x ) .
PX n
yn xn
(B.54)
Show that if R > C then we have E(R) > 0. In other words, for large
n the block error probability approaches 1. This result is known as a
strong converse because it shows that the block error probability must
approach 1 as n grows.
I(W ; Ŵ )
NR ≤ . (B.58)
1 − H2 (PF )/b − PF log2 (2b − 1)/b
where pY (·) given in (5.85). By the Weak Law of Large Numbers (see
Appendix A.13), we can compute (B.60) by Monte Carlo integration by
simulating n independent variables Yi = √Xi + Zi , i = √1, 2, . . . , n. Here
each Xi has probability distribution PX ( P ) = PX (− P ) = 1/2, and
Zi is a Gaussian random variable with zero mean and variance N that is
independent of Xi . Consider the sum
n
1X
Sn = − log pY (Yi ). (B.61)
n i=1
Y = HX + Z (B.62)
where H and Z are independent Gaussian random variables with zero mean
and unit variance.
√ Suppose that X has the on-off keying (OOK) distribution
PX (0) = PX ( 2P ) = 1/2. We wish to compute the mutual information
" #
pY |X (Y |X)
I(X; Y ) = E log . (B.63)
pY (Y )
By the Weak Law of Large Numbers (see Exercise B.5.8) we can compute
I(X; Y ) by simulating independent random variables Yi = Hi Xi + Zi , i =
1, 2, . . . , n and computing
n
1X pY |X (Yi |Xi )
Sn = log . (B.64)
n i=1 pY (Yi )
where Z1 and Z2 are independent Gaussian random variables with zero mean
and unit variance. Suppose the power constraint is E[X12 + X22 ] ≤ 2P .
where Kq = P Hq HTq + Qq .
c) Show that the lower bound (B.66) is tight if Hq = H and Qq = H HT .
d) Suppose the receiver believes that Hq and Qq are the 2 × 2 identity
matrices. Show that the lower bound of (B.66) is
1 P2
I(X; Y ) ≥ log (1 + P )2 − P/5 − . (B.71)
2 5(1 + P )2 − P
g) Plot the bounds of (B.69)-(B.71) and compare them with each other
and the capacity. What happens at small and large P ?
qY n |X n (Y n |X n )
" #
n n
I(X ; Y ) ≥ E log (B.73)
qY n (Y n )
where 0 ≤ ε ≤ 1.
a) Find the capacity C of this DMC and verify that it reduces to that of
the BSC when J = 2.
J−1
b) Find the explicit values of C when ε = 0, ε = J
and ε = 1, and give
an intuitive explanation of these values.
1 −-ε − δ
0H
XXX
0
HHXXX
HH X XXz
Xδ
H ε XXX
X
HjH
H XXXX
H
HH
X
∆
*
ε H
H
: HH
δ HH
1
- H 1
1−ε−δ
symmetric erasure channel (BSEC). Notice that it reduces to the BSC when
δ = 0 and to the BEC when ε = 0.
a) Show that the BSEC is indeed symmetric, and find its capacity C.
b) Verify that C reduces to the appropriate capacities when δ = 0 and
when ε = 0.
c) There are three different sensible ways that the above channel could
be converted to the BSEC (one way is to convert 0000 and 1000 to ∆).
Find the capacity C3 for the way that gives the greatest capacity. Find
also the decibel loss compared to eight-level demodulation, and the
decibel gain over hard-decision demodulation.
Note: The kinds of dB gains and losses computed here are typical of the
AWGN channel when used at an efficient E/N0 value. Additionally, there
is about 0.2 dB loss by using eight-level, rather than infinitely-many-level,
demodulation.
We will say that the DMC (or any other channel) is embedded in a causal
system if, for the output U1 . . . UK of any source in the system, for all n ≥ 1,
i.e., the channel output at any time can depend on the source sequence only
through the effect of that source sequence on past and present inputs and
past outputs of the channel.
I(U1 . . . UK ; Y1 . . . YN ) ≤ I(X1 . . . XN → Y1 . . . YN ).
PN
Hint: Consider H(Y1 . . . YN |U1 . . . UK ) = n=1 H(Yn |Y1 . . . Yn−1 U1 . . . UK ).
c) Now show that if K digits of a BSS are sent to a destination via N
uses of a DMC (possibly with feedback), then
I(U1 . . . UK ; Y1 . . . YN ) ≤ N C.
where, in the nontrivial case when neither sequence is all-zero, equality holds
if and only if there is a constant such than ai = c · bi , i = 1, 2, . . . , N .
Next, find the value of c, say c0 , that minimizes f (c), and then use
the fact that f (c0 ) ≥ 0.
b) Show that
Xq sX sX
PY |X (y|0)PY |X (y|1) ≤ PY |X (y|0) · PY |X (y|1) = 1
y y y
1/2
-
HH
H
HH
H 1/2
X H
jH
H
H
Y
HH
H
HH
- H
1
Note: The results of this problem contain an important lesson for commu-
nications engineers. The “natural impulse” of a communications engineer
is to make E/N0 very large so that reliable “hard-decisions” about X can
be made from the correlator output.