Chapter 02 Information Theory

Chapter 02
Information theory and Coding

Chapter objectives
 Introduction to the field of information theory
 Definition of information source
 Source coding theorem and algorithms
 Channel coding theorem and algorithms
Chapter contents
1. Introduction
2. Modeling of Information Sources
3. Measure of information
4. Source coding theorem
5. Source coding algorithms
1. Introduction
In the design and analysis of communication system discussed on chapter one there exists an
information source that produces information and a channel that propagates it to the receiver.
The purpose of a communication system is to transmit this output of the source (information) to
the destination via communication channel.
The source encoder takes the source output and represent with a few number of bits and the
channel is expected to transport these bits to the receiver with negligible error. So least amount of
information should be lost in both cases. Therefore, it is necessary to have a precise notion of
information. Hence performance analysis of communication system can hardly be conceived
without a quantitative measure of information and mathematical modeling of information
sources. This study is information theory.
“In the context of electronic communication, information theory deals with mathematical
modeling and analysis of a communication system rather than with physical sources and
physical channels”
In particular, it provides answers to the two fundamental questions;
1. Given an information source, how do we evaluate the ‘rate’ at which the source is emitting
information? Output of an information source can be made more compact. How much should
the source output be compressed?
2. Given a noisy communication channel, how do we evaluate the ‘maximum rate’ at which
reliable information transmission can take place over the channel
The answer to these questions lies in the entropy of the source and capacity of the channel
respectively. Entropy is defined in terms of the probabilistic behavior of a source of information
Issayas T 1
and capacity is defined as the intrinsic ability of a channel to convey information; it is naturally
related to the noise characteristics of the channel. A remarkable result that emerges from
information theory is that if the entropy of the source is less than the capacity of the channel,
then error-free communication over the channel can be achieved. Therefore, these concepts will
be proved in this chapter.
2. Modeling of Information Sources

The intuitive and common notion of information refers to any new knowledge about something.
The information source, therefore, produces outputs which are of interest to the receiver of
information, who does not know these outputs in advance. Since the output of the information
source is a time-varying unpredictable function (if predictable, there is no need to transmit it), it
can be modeled as a random process. And in communication channels, the existence of noise
causes stochastic dependence between the input and output of the channel. Therefore, the
communication-system designer designs a system that transmits the output of a random process
(information source) to a destination via a random medium (channel) and ensures low distortion.
Information sources can be modeled by random processes, and the properties of the random
process depend on the nature of the information source.
 For example, when modeling speech signals, the resulting random process has all its
power in a frequency band of approximately 300–4000 Hz. Therefore, the power-
spectral density of the speech signal also occupies this band of frequencies.
 Video signals are restored from a still or moving image and, therefore, the bandwidth
depends on the required resolution. For TV transmission, depending on the system
employed (NTSC, PAL or SECAM), this band is typically between 0–4.5 MHz and 0–6.5
MHz.
What is common in all these processes is that they are bandlimited processes and, therefore, can
be sampled at the Nyquist rate or larger and reconstructed from the sampled values. Therefore, it
makes sense to confine ourselves to discrete-time random processes in this chapter because all
information sources of interest can be modeled by such a process.
The mathematical model for an information source is shown below. Here the source is modeled by
∞
a discrete-time random process { X i }i=−∞ . The alphabet over which the random variables X i are
defined can be either discrete (in transmission of binary data, for instance) or continuous (e.g.,
sampled speech). The statistical properties of the discrete-time random process depend on the
nature of the information source.
Issayas T 2
In this chapter, we will only study rather simple models for information sources. The simplest
model for the information source that we study is the discrete memoryless source (DMS).
A DMS is a discrete-time, discrete-amplitude random process in which all Xi’s are generated
independently and with the same distribution. Therefore, a DMS generates a sequence of
i.i.d. random variables taking values in a discrete set.
Let set s={a1 , a2 , … a N } denote the set in which the random variable X takes its values, and let
the probability mass function for the discrete random variable X be denoted by pi= p( X=ai) for
all i = 1, 2, . . ., N.
A full description of the DMS is given by the set 𝓈, called the alphabet, and the
N
probabilities { pi }i=1.
Example An information source is described by the alphabet s = {0, 1} and p(Xi = 1) = 1 −p(Xi = 0) =
p. This is an example of a discrete memoryless source. In the special case where p = 0.5 the source
is called a binary symmetric source, or BSS for short.
3. Measure of information
In order to give a quantitative measure of information, we will start with the basic model of an
information source, i.e., a discrete source and try to define the information content of the source
in such a way that certain intuitive properties are satisfied. Let the outputs of this source be
revealed to an interested party.
Let a1 be the most-likely and a N be the least-likely output. For example, one could imagine
the source to represent both the weather condition and air pollution of semera city during
May. In this case, s represents various combinations of various weather conditions and
pollution such as;
 a 1= hot and polluted

 a 2= hot and lightly polluted
 a 3= cold and highly polluted
 a 4= cold and mildly polluted,
 ….
 a N = very cold and lightly polluted
The question is: which output conveys more information, a1 or a N (the most probable or
the least probable one)? Answer: Intuitively, revealing a N (or, equivalently, very cold and
lightly polluted in the previous example) reveals the most information. From this it
follows that;
Issayas T 3
First intuitive property of measure of information
a rational measure of information for an output of an information source should be a

decreasing function of the probability of that output.
A second intuitive property of a measure of information is that a small change in the

probability of a certain output should not change the information delivered by that output
by a large amount. In other words;
Second intuitive property
The information measure should be a decreasing and continuous function of the

probability of the source.
From the above discussion, we can conclude that the amount of information revealed about an
output a j with probability p j must satisfy the following three conditions:
1. The information content of output a j depends only on the probability of a j and not on the
value of a j . We denote this function by I ( p j ) and call it self-information.
2. Self-information is a continuous function of p j; i.e., I (·) is a continuous function.
3. Self-information is a decreasing function of its argument i.e., least probable outcomes
convey most information.
4. If p j = p j + p j , then I ( p j )=I ( p ) + I ( p ). This happens if each source output can be
1 2 1 j 2 j
broken into two independent components, example temperature and pollution. Since the
components are independent, revealing the information about one component
(temperature) does not provide any information about the other component (pollution)
and, therefore, intuitively the amount of information provided by revealing a j is the sum
of the information obtained by revealing a j and a j .
1 2
It can be proved that the only function that satisfies all the above properties is the logarithmic
1
function; i.e., I ( x )=log =−log ⁡(x).
x
The base of logarithm is not important and defines the unit by which the information is
measured. If the base is 2, the information is expressed in bits, and if the natural logarithm is
employed, the unit is nats.
Now that the information revealed about each source output a j is defined as the self-information
of that output, given by −log(pj), we can define the information content of the source as the
weighted average of the self-information of all source outputs. This is justified by the fact that
various source outputs appear with their corresponding probabilities. Therefore, the information
revealed by an unidentified source output is the weighted average of the self-information of the
various source outputs. The information content of the information source is known as the
entropy of the source and is denoted by H(X).
Issayas T 4
Definition: The entropy of a discrete random variable X is a function of its PMF and is
defined by
where 0 log 0 = 0. Note that there exists a slight abuse of notation here. One would expect H(X)
to denote a function of the random variable X and, hence, be a random variable itself. However,
H(X) is a function of the PMF of the random variable X and is, therefore, a number.
Example 1
In the binary memoryless source with probabilities p and 1 − p,

respectively, we have
H(X) = −p log p − (1 − p) log (1 − p)
This function, denoted by H ( p), is known as the binary

b
entropy function and a plot of it is given below;
Example 2
A source with bandwidth 4000Hz is sampled at the Nyquist rate. Assuming that the resulting
sequence can be approximately modeled by a DMS with alphabet s = {−2, −1, 0,1, 2} and with
1 1 1 1 1
corresponding probabilities { , , , , }, determine the rate of the source in bits/sec.
2 4 8 16 16
Solution: We have
1 1 1 1 15
H(X) = log 2 + log 4 + log 8 + 2 × log 16 = bits/sample
2 4 8 16 8
and since we have 8000 samples/sec the source produces information at a rate of 15,000 bits/sec.
Joint and Conditional Entropy
When dealing with two or more random variables, exactly in the same way that joint and
conditional probabilities are introduced, one can introduce joint and conditional entropies. These
concepts are especially important when dealing with sources with memories.
Definition: The joint entropy of two discrete random variables (X, Y) is defined by
For the case of n random variables X = (X1, X2..., Xn), we have
Issayas T 5
As seen, the joint entropy is simply the entropy of a vector-valued random variable.
The conditional entropy of the random variable X, given the random variable Y, can be defined by
noting that if Y = y, then the PMF of the random variable X will be p(x | y), and the corresponding
entropy is H(X | Y = y) = −∑ p (x∨ y )log p( x∨ y), which is intuitively the amount of uncertainty
x
in X when one knows Y = y. The weighted average of the above quantities over all y is the
uncertainty in X when Y is known. This quantity is known as the conditional entropy and defined
as follows:
Definition The conditional entropy of the random variable X given the random variable Y is
defined by
In general, we have
Example:
Using chain rule for PMFs, p(x, y) = p(y)p(x | y), show that H(X, Y ) = H(Y ) + H(X | Y ).
Generalize this result to the case of n random variables to show the following chain rule
for entropies
Solution: From the definition of the joint entropy of two random variables, we have
where in the last step we have used
Issayas T 6
This relation says that the information content of the pair (X, Y) is equal to the
information content of Y plus the information content of X after Y is known. Equivalently,
it says that the same information is transferred by either revealing the pair (X, Y), or by
first revealing Y and then revealing the remaining information in X. The proof for general
n is similar and is left as an exercise. In the case where the random variables (X1, X2, ...,
Xn) are independent, the equation reduces to
If the random variable Xn denotes the output of a discrete (not necessarily memoryless) source at
time n, then H (X2 | X1) denotes the fresh information provided by source output X2 to someone
who already knows the source output X1. In the same way, H (Xn | X1, X2, ..., Xn−1) denotes the
fresh information in Xn for an observer who has observed the sequence (X1, X2, ..., Xn−1). The
limit of the above conditional entropy as n tends to infinity is known as the entropy rate of the
random process.
Definition: The entropy rate of a stationary discrete-time random process is defined by
Stationarity ensures the existence of the limit, and it can be proved that an alternative
definition of the entropy rate for sources with memory is given by
Entropy rate plays the role of entropy for sources with memory. It is basically a measure of the
uncertainty per output symbol of the source.
SOURCE-CODING THEOREM
Source encoding is the process of efficiently representing the data generated by a discrete source
with a sequence of binary digits called codewords. The device that performs the representation is
called a source encoder. For the source encoder to be efficient, we require knowledge of the
statistics of the Source. An efficient source encoder satisfies two functional requirements:
1. The code words produced by the encoder are in binary form.

2. The source code is uniquely decodable, so that the original source sequence can be
reconstructed perfectly from the encoded binary sequence.
Consider then the scheme shown below, which depicts a discrete memoryless source whose
output sk is converted by the source encoder into a block of 0s and 1s, denoted by b k.
Issayas T 7
We assume that the source has an alphabet with 𝐾 different symbols, and that the 𝑘𝑡ℎ symbol sk
occurs with probability pk , 𝑘 = 𝑂, 1, …, 𝐾 − 1. Let the binary code word assigned to symbol sk by
the encoder have length l k , measured in bits. We define the average code-word length, L, of the
source encoder as,
In physical terms, the parameter L represents the average number of bits per source symbol used
in the source encoding process. Let 𝐿𝑚𝑖𝑛 denote the minimum possible value of L. We then
define the coding efficiency of the source encoder as,
With L ≥ 𝐿𝑚𝑖𝑛 we clearly have 𝜂 ≤ 1. Then the source encoder is said to be efficient when 𝜂
approaches unity.
But how is the minimum value Lmin is determined?
The minimum value 𝐿𝑚𝑖𝑛 is determined from Shannon's first theorem which may be stated as
follows:
Given a discrete memoryless source of entropy 𝐻(𝓈) the average code-word length L for
any distortion less source encoding scheme is bounded as
According to the source-coding theorem, the entropy sk represents a fundamental limit on the
average number of bits per source symbol necessary to represent a discrete memoryless source in
that it can be made as small as, but no smaller than, the entropy H(s). Thus, with Lmin = H (s) we
may rewrite the efficiency of a source encoder in terms of the entropy as
whereas before we have η < 1.
Lossless Data Compression Algorithms

A common characteristic of signals generated by physical sources is that, in their natural form,
they contain a significant amount of redundant information, the transmission of which is
therefore wasteful of primary communication resources. For example, the output of a computer
used for business transactions constitutes a redundant sequence in the sense that any two
adjacent symbols are typically correlated with each other.
Issayas T 8
For efficient signal transmission, the redundant information should, therefore, be removed from
the signal prior to transmission. This operation, with no loss of information, is ordinarily
performed on a signal in digital form, in which case we refer to the operation as lossless data
compression. The code resulting from such an operation provides a representation of the source
output that is not only efficient in terms of the average number of bits per symbol, but also exact
in the sense that the original data can be reconstructed with no loss of information.
The entropy of the source establishes the fundamental limit on the removal of redundancy from
the data. Basically, lossless data compression is achieved by assigning short descriptions to the
most frequent outcomes of the source output and longer descriptions to the less frequent ones.
Prefix Coding
Consider a discrete memoryless source of alphabet { s0 , s1,, …, s K−1} and respective probabilities {
p0, p1, …, p K−1}. For a source code representing the output of this source to be of practical use,
the code has to be uniquely decodable. This restriction ensures that, for each finite sequence of
symbols emitted by the source, the corresponding sequence of codewords is different from the
sequence of codewords corresponding to any other source sequence. A special class of codes
satisfying a restriction known as the prefix condition. Any sequence made up of the initial part of
the codeword is called a prefix of the codeword. We thus say:
A prefix code is defined as a code in which no codeword is the prefix of any other
codeword
Prefix codes are distinguished from other uniquely decodable codes by the fact that the end of a
codeword is always recognizable. Hence, the decoding of a prefix can be accomplished as soon as
the binary sequence representing a source symbol is fully received. For this reason, prefix codes
are also referred to as instantaneous codes.
Illustrative Example of Prefix Coding
Code I is not a prefix code because the bit 0, the codeword for s0, is a prefix of 00, the codeword
for s2. Likewise, the bit 1, the codeword for s1, is a prefix of 11, the codeword for s3. Similarly, we
may show that code III is not a prefix code but code II is.
Decoding of Prefix Code
To decode a sequence of codewords generated from a prefix source code, the source decoder
simply starts at the beginning of the sequence and decodes one codeword at a time. Specifically, it
Issayas T 9
sets up what is equivalent to a decision tree, which is a graphical portrayal of the codewords in
the particular source code.
Figure above depicts the decision tree corresponding to code II in Table. The tree has an initial
state and four terminal states corresponding to source symbols s0, s1, s2, and s3. The decoder
always starts at the initial state. The first received bit moves the decoder to the terminal state s0 if
it is 0 or else to a second decision point if it is 1. Note also that each bit in the received encoded
sequence is examined only once. Consider, for example, the following encoded sequence:
1011111000 … This sequence is readily decoded as the source sequence s1 , s3 , s2 , s 0 , s 0 , …
As mentioned previously, a prefix code has the important property that it is instantaneously
decodable. But the converse is not necessarily true. For example, code III in Table does not satisfy
the prefix condition, yet it is uniquely decodable because the bit 0 indicates the beginning of each
codeword in the code.
Kraft Inequality
Consider a discrete memoryless source with source alphabet { s0 , s1,, …, s K−1} and source
probabilities { p0, p1, …, p K−1}, with the codeword of symbol sk having length l k , k = 0, 1, …, K – 1.
Then, according to the Kraft inequality, the codeword lengths always satisfy the following
inequality:
where the factor 2 refers to the number of symbols in the binary alphabet. The Kraft inequality is
a necessary but not sufficient condition for a source code to be a prefix code. In other words, the
inequality merely a condition on the codeword lengths of a prefix code and not on the codewords
themselves. For example, referring to the three codes listed in Table, we see:
 Code I violate the Kraft inequality; it cannot, therefore, be a prefix code.

 The Kraft inequality is satisfied by both codes II and III, but only code II is a prefix
code.
Given a discrete memoryless source of entropy H(S), a prefix code can be constructed with an
average codeword length, which is bounded as follows:
Issayas T 10
Huffman Coding
We next describe an important class of prefix codes known as Huffman codes. The basic idea
behind Huffman coding6 is the construction of a simple algorithm that computes an optimal
prefix code for a given distribution, optimal in the sense that the code has the shortest expected
length.
The end result is a source code whose average codeword length approaches the
fundamental limit set by the entropy of a discrete memoryless source, namely H(S).
Huffman coding is a coding algorism based on the statistics of source outputs which proceeds as
follows.
1. The source symbols are listed in order of decreasing probability. The two source symbols
of lowest probability are assigned 0 and 1. This part of the step is referred to as the
splitting stage.
2. These two source symbols are then combined into a new source symbol with probability
equal to the sum of the two original probabilities. (The list of source symbols, and,
therefore, source statistics, is thereby reduced in size by one.) The probability of the new
symbol is placed in the list in accordance with its value.
3. The procedure is repeated until we are left with a final list of source statistics (symbols) of
only two for which the symbols 0 and 1 are assigned.
4. The code for each (original) source is found by working backward and tracing the
sequence of 0s and 1s assigned to that symbol as well as its successors.
Example 1: Given a DMS with seven source letters s0 , s1, s2, s3, s4 with probabilities 0.35, 0.30,
0.4,0.2,0.2,0.1, 0.1, respectively
Solution: Order the symbols in decreasing order of probabilities. the Huffman tree and source
code are shown below.
The average codeword length is, therefore,
The entropy of the specified discrete memoryless source is calculated as follows,
Issayas T 11
H ( S) 2.121
The efficiency becomes, η= = ∗100=96.4 %
L 2.2
Example 2: Given a DMS with seven source letters 𝑥1, 𝑥2, … , 𝑥7 with probabilities 0.35, 0.30,
0.20,0.10,0.04,0.005, 0.005, respectively.
Example 3: Let the output of a DMS consist of 𝑥1, 𝑥2 and 𝑥3 with probabilities 0.45, 0.35, 0.2,
respectively. Entropy of the source is
The codewords are: x1--0, x2--10, and x3--11 with an average codeword length of 1.55 and an
efficiency of 97.7%.
If pairs of symbols are encoded using the Huffman algorithm, one possible variable length code
can be as given next
𝐻(𝑥) = 3.036; 𝐿ത = 3.0675 and
the efficiency is 𝜂 = (3.036/3.0675) ∗ 100% = 99%
It is noteworthy that the Huffman encoding process (i.e., the Huffman tree) is not unique. In
particular, we may cite two variations in the process that are responsible for the no uniqueness of
the Huffman code. First, at each splitting stage in the construction of a Huffman code, there is
arbitrariness in the way the symbols 0 and 1 are assigned to the last two source symbols. Second,
ambiguity arises when the probability of a combined symbol (obtained by adding the last two
probabilities pertinent to a particular step) is found to equal another probability in the list.
Should the probability of the new symbol be placed as high or low as possible? The answer is as
high as possible. Why?
Lempel–Ziv Coding
A drawback of the Huffman code is that it requires knowledge of a probabilistic model of the
source; unfortunately, in practice, source statistics are not always known a priori. Moreover, in
the modeling of text we find that storage requirements prevent the Huffman code from capturing
the higher-order relationships between words and phrases because the codebook grows
exponentially fast in the size of each super-symbol of letters (i.e., grouping of letters); the
efficiency of the code is therefore compromised. To overcome these practical limitations of
Issayas T 12
Huffman codes, we may use the Lempel–Ziv algorithm, which is intrinsically adaptive and simpler
to implement than Huffman coding. Basically, the idea behind encoding in the Lempel–Ziv
algorithm is described as follows:
The source data stream is parsed into segments that are the shortest subsequences not
encountered previously.
The following procedures can be followed while generating codes by the use of the Lempel-Ziv
algorithm,
1. The sequence at the output of the discrete source is parsed into variable-length blocks,
called phrases
2. A new phrase is introduced every time a block of letters from the source differs from some
previous phrase in the last letter. I.e., new phrase will be one of the minimum lengths that
has not appeared before
3. The phrases are listed in a dictionary, which stores the location of the existing phrases
4. In encoding a new phrase
 Specify the location of the existing phrase in the dictionary &
 Append the new letter
5. Codewords are determined by listing the dictionary location (in binary form) of the
previous phrase that matches the new phrase in all but the last location
6. The new output letter is appended to the dictionary location of the previous phrase
7. The location 0000 is used to encode a phrase that has not appeared previously
8. The source decoder constructs an identical copy of the dictionary and decodes the
received sequence in step with the transmitted data sequence
Example 1: Consider the binary sequence
10101101001001110101000011001110101100011011
Parsing the sequence results in the following phrases

1,0,10,11,01,00,100,111,010,1000,011,001,110,101, 10001,1011
Dictionary locations are numbered consecutively
Beginning with 1 and counting up, in this case to 16, which is the number of phrases in the parsed
sequence.
Then the following table is constructed.
Issayas T 13
Lempel-Ziv Algorithm does not work well for short string. In the example, 44 source bits are
encoded into 16 code words of 5 bits each, resulting in 80 coded bits. Hence, the algorithm
provided no data compression at all. However, the inefficiency is due to the fact that the sequence
we have considered is very short. If longer sequences are used a better result may be obtained. See
the next example.
Example 2: Consider the binary sequence:
001101100011010101001001001101000001010010110010110
Parsing the sequence as the following phrases:
0,01,1,011,00,0110,10,101,001,0010,01101,000,00101,001011,0010110
Since we have 16 strings, we will need 4 bits for the dictionary location.
The table is constructed as follows.
Issayas T 14
Issayas T 15

Chapter 02 Information Theory

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 02 Information Theory

Uploaded by

Copyright:

Available Formats

Chapter 02

Information theory and Coding

In particular, it provides answers to the two fundamental questions;

2. Modeling of Information Sources

 a 1= hot and polluted

a rational measure of information for an output of an information source should be a

A second intuitive property of a measure of information is that a small change in the

Second intuitive property

The information measure should be a decreasing and continuous function of the

In the binary memoryless source with probabilities p and 1 − p,

H(X) = −p log p − (1 − p) log (1 − p)

This function, denoted by H ( p), is known as the binary

entropy function and a plot of it is given below;

Joint and Conditional Entropy

For the case of n random variables X = (X1, X2..., Xn), we have

where in the last step we have used

Definition: The entropy rate of a stationary discrete-time random process is defined by

1. The code words produced by the encoder are in binary form.

But how is the minimum value Lmin is determined?

whereas before we have η < 1.

Lossless Data Compression Algorithms

Illustrative Example of Prefix Coding

Decoding of Prefix Code

 Code I violate the Kraft inequality; it cannot, therefore, be a prefix code.

The average codeword length is, therefore,

The entropy of the specified discrete memoryless source is calculated as follows,

𝐻(𝑥) = 3.036; 𝐿ത = 3.0675 and

the efficiency is 𝜂 = (3.036/3.0675) ∗ 100% = 99%

Example 1: Consider the binary sequence

Parsing the sequence results in the following phrases

Dictionary locations are numbered consecutively

Then the following table is constructed.

Example 2: Consider the binary sequence:

Parsing the sequence as the following phrases:

The table is constructed as follows.

You might also like