DC CH02

Outline
2.1 Overview
2.2 A Brief introduction to information Theory
Chapter 2 2.2.1 Derivation of Average Information (*)
2.3 Models
MATHEMATICAL PRELIMINARIES 2.3.1 Physical Models
2.3.2 Probability Models
FOR LOSSLESS COMPRESSION 2.3.3 Markov Models
2.3.4 Composite Source Model
2.4 Coding
2.4.1 Uniquely Decodable Codes
Yeuan-Kuen Lee 2.4.2 Prefix Codes
[ MCU, CSIE ] 2.4.3 The Kraft-McMillan Inequality (*)
2.5 Summary
Ch 2 Mathematical Preliminaries for Lossless Compression 2
2.1 Overview 2.2 A Brief Introduction to Information Theory
In this chapter:
Information theory
9 Some of ideas in information theory that provide the framework 9 Idea of a quantitative measure of information
for the development of lossless data compression schemes 9 Claude Elwood Shannon – (poll everything together), Bell Labs.
are briefly reviewed. 9 self-information.
9 We will also look at some ways to model the data that lead to
self-information
self-information
efficient coding scheme.
9 We have assumed some knowledge of probability concepts. 9 Suppose we have an event A,
which is a set of outcomes of some random experiment.
(See Appendix A for a brief review of probability and random process.)
9 If P(A) is the probability that the event A will occur,
the the self-information associated with A is given by
1
i ( A) = log b = − log P( A) ( 2.1 )
P( A)
Ch 2 Mathematical Preliminaries for Lossless Compression 3 Ch 2 Mathematical Preliminaries for Lossless Compression 4
2.2 A Brief Introduction to Information Theory 2.2 A Brief Introduction to Information Theory
1 Another property of this mathematical definition of information
i ( A) = log b = − log b P( A)
P( A) that makes intuitive sense is that:
the information obtained from the occurrence of the individual events
is the sum of the information obtained from the occurrence of the individual events.
Log(1) = 0
Suppose A and B are two independent events.
The self-information associated with the occurrence of both event A and event B is
0 ≤ P(A) ≤ 1
-log(x) increase as x decreases from 1 to 0 1 1

i ( AB) = log b = log b
P ( AB) P ( A) P ( B)
Therefore, if the probability of an event is low,
the amount of self-information associated 1 1
= log b + log b
with it is high; P ( A) P ( A)
if the probability of an event is high,
= i ( A) + i ( B)
the amount of self-information associated
with it is low.
The unit of information depends on the base of the log. Example
Example 2.2.1
2.2.1
1 Let H and T be the outcomes of flipping a coin.
i ( A) = log b = − log b P( A) If we use log base 2, the unit is bits;
P( A) if we use log base e, the unit is nats; If the coin is fair, then
if we use log base 10, the unit is hartleys.
P(H) = P(T) = ½ and i(H) = i(T) = 1 bit.
Note that to calculate the information in bits, If the coin is not fair, then we would expect the information associated
we need to take logarithm base 2 of the probability.
with each event to be different. Suppose
log2 x = a means 2a = x P(H) = 1/8 , P(T) = 7/8
ln( 2a ) = ln (x) then,
a ln (2) = ln (x) i(H) = 3 bits , i(T) = 0.193 bits.
⇒ a = ln(x) / ln(2)
At least mathematically, the occurrence of a head conveys much
more information than the occurrence of a tail.
R. V. L. Hartley: who first proposed the use of the logarithmic measure of information
One of the many contribution of Shannon was that he showed that
self-information
self-information
if the experiment is a source that puts out symbols Ai from a set of A,
If we have a set of independent event Ai , then the entropy is a measure of the average number of binary symbols
which are set of outcomes of some experiment S, needed to code the output of the source.
such that
U Ai = S
2
1 6 7
5
3 7
5
3
5 3
5 7 2 7 2
2
7 =
where S is the sample space, them the average self-information 4 2 7
8 3 4 3 8 3 5
associated with the random experiments is given by
i(A2)= i(A3)= i(A5)= i(A7)= 2

i(A1)= i(A2)…= i(A8)=3
H = ∑ P ( A i ) i ( A i ) = − ∑ P ( A i ) log b P ( Ai )
entropy = 3 entropy = 2
i(A2)= i(A4)= i(A5)= i(A8)= 3
i(A3)= i(A7)= 2
The quantity is called the entropy associated with the experiment. entropy = (3*4 + 2*2 )/6
= 16/6 = 2.67
Shannon showed that the best that a lossless compression scheme can do
is to encode the output of a source with an average number of bits
equal to the entropy of the source.
The set of symbols A is often called the alphabet for the source,
and the symbols are referred as letters.
For a general source S with alphabet A = { 1, 2, … , m }
that generates a sequence {x1, x2,…},
The entropy is given by
1
H ( S ) = lim G n ( 2.2 )
n→ ∞ n
Where
i1 = m i 2 = m in = m
Gn = − ∑ ∑ L ∑ P( X 1 = i1 , X 2 = i 2 , K , X n = i n ) log P ( X 1 = i1 , X 2 = i 2 , K , X n = i n )
i1 =1 i 2 =1 i n =1
Relative
Relative frequency
frequency of
of letters
letters in
in English
English text
text And {X1, X2, … , Xn} is a sequence of length n from the source.
i1 = m i 2 = m in = m
In general, it is not possible to know the entropy for a physical source,
Gn = − ∑ ∑ L ∑ P( X 1 = i1 , X 2 = i 2 , K , X n = i n ) log P ( X 1 = i1 , X 2 = i 2 , K , X n = i n ) so we have to estimate the entropy.
i1 =1 i 2 =1 i n =1
The estimate of the entropy depends on our assumptions about
If each element in the sequence is independent and identical distributed (iid), the structure of the source sequence.
then we can show that
Consider the following sequence: 1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10
i1 = m
Assuming the frequency of occurrence of each number is reflected accurately
G n = − n ∑ P ( X 1 = i1 ) log P ( X 1 = i1 ) ( 2.3 )
i1 =1
in the number of times it appears in the sequence, we can estimate the probability
of occurrence of each symbol as follows:
and the equation of the entropy becomes P(1)= P(6)= P(7)= P(10) = 1/16 P(2)= P(3)= P(4)= P(5)= P(8)= p(9) = 2/16 = 1/8
Assuming the sequence is iid, the entropy for this sequence is the same as
H ( S ) = − ∑ P ( X 1 ) log P ( X 1 ) ( 2.4 ) the first-order entropy as defined in (2.4). The entropy can be calculated as
10
1 1 1 1 18
For most source Equations (2.2) and (2.4) are not identical. H = − ∑ P ( i ) log 2 P (i ) = − 4 ( ) log 2 ( ) − 6 ( ) log 2 ( ) = 1 + = 3 . 25
i =1 16 16 8 8 8
If we need to distinguish between the two,
we will call the quantity computed in (2.4) the first-order entropy of the source, This means that the best scheme we could find coding this sequence
while the quantity in (2.2) will be referred to as entropy of the source. could only code it at 3.25 bits/sample.
However, if we assume that there was sample-to-sample correlation
between the samples, and we remove the correlation by taking differences 2
of neighboring sample values, we arrive at the residual sequence H = − ∑ P ( i ) log 2 P (i )

i =1
1 1 1 -1 1 1 1 -1 1 1 1 1 1 -1 1 1 13 13 3 3
= −( ) log 2 ( ) − ( ) log 2 ( )
16 16 16 16
This sequence is constructed using only two values
with probability P(1) = 13/16 and P(-1) = 3/16. = − ( 0 . 8125 ) log 2 ( 0 . 8125 ) − ( 0 . 1875 ) log 2 ( 0 . 1875 )
The entropy in this case is ln( 0 . 8125 ) ln( 0 . 1875 )
= − ( 0 . 8125 ) − ( 0 . 1875 )
2 ln( 2 ) ln( 2 )
H = − ∑ P ( i ) log P (i ) = 0 .7
2
( − 0 . 2076 ) ( − 1 . 6740 )
i =1
= − ( 0 . 8125 ) − ( 0 . 1875 )
0 . 6931 0 . 6931
Of course, knowing only this sequence would not be enough for the receiver = − ( 0 . 8125 )( − 0 . 2996 ) − ( 0 . 1875 )( − 2 . 4150 )
to reconstruct the original sequence. The receiver must also know the process
= 0 . 2434 + 0 . 4528
by which the sequence was generated from the original sequence.
The process depends on our assumptions about the structure of the sequence. = 0 . 6962
These assumptions are called the model for the sequence.
In this case, the model for the sequence is Consider the following contrived sequence:
12123333123333123312
Obviously, there is some structure to the data.
x n = x n −1 + rn
However, if we look at it one symbol at a time:
P(1) = P(2) = ¼ P(3) = ½
Where xn is the nth element of the original sequence and
rn is the nth element of the residual sequence. The entropy = - 2 * ¼ log2( ¼ ) - ½ log2( ½ ) = 1 + 0.5 = 1.5
This sequence consists of 20 symbols; therefore,
This model is called a static model because its parameters do not change with n. the total number of bits required to represent this sequence is 30 bits.
A model whose parameters change or adapt with n to the Now, let’s take the same sequence and look at it in blocks of two.
Obviously, there are only two symbols, 12, and 33.
changing characteristics of the data is called an adaptive model.
P(12) = ½ P(33) = ½
The entropy is 1 bit/symbol.
As there are 10 symbols in the sequence, we need a total of 10 bits
to represent the entire sequence – a reduction of a factor of three.
2.3 Models 2.3.1 Physical Models

9 Having a good model for the data can be useful 9 If we know something about the physics of the data generation process,
in estimating the entropy of the source. we can use that information to construct a model.
9 Good models for sources lead to more efficient compression algorithm. Ex: Speech-related applications, (Chapter 7)
9 Obviously, the better the model, the more likely it is that Telemetry data
we will come up with a satisfactory technique. 9 In general, however, the physics of data generation is simply
too complicated to understand, let alone use to develop a model.
Where the physics of the problem is too complicated,
we can obtain a model based on empirical observation of statistics of the data.
2.3.2 Probability Models 2.3.3 Markov Models
9 Ignorance model - 9 One of the most popular ways of representing dependence in the data
The simplest statistical model for the source is to assume that is through the use of Markov models.
each letter that is generated by the source in independent of every other letter, [ Russian Mathematician, Andrei Andrevich Markov, 1856-1922 ]
and each occurs with the same probability.
9 Probability model – 9 For models used in lossless compression, we use a specific type of Markov process
For a source that generates letters from an alphabet A = { a1, a2,…, aM }, called a discrete time Markov chain.
we can have a probability model P = { P(a1), P(a2), … , P(aM) } . Let { xn } be a sequence of observations.
[ keep the independence assumption to each letter in the alphabet ] This sequence is said to follow a kth-order Markov model if
9 Given a probability model (and the independence assumption), P ( xn | xn-1, xn-2, …,xn-k ) = P ( xn | xn-1, … , xn-k, … ). (2.13)
we can compute the entropy of the source using Equation (2.4). In other words, knowledge of the past k symbols is equivalent to
9 If the assumption of independence does not fit with our observation of the data, the knowledge of the entire past history of the process.
we can generally find better compression scheme if we discard this assumption. The values taken on by the set { xn-1, xn-2, …,xn-k }
When we discard the independence assumption, we have to come up with a way are called the state of the process.
to describe the dependence of elements of the data sequence on each other. If the size of the source alphabet is l, then the number of state is l k .
2.3.3 Markov Models 2.3.3 Markov Models

9 The most commonly used Markov model is the first-order Markov model, (The use of Markov model does not require the assumption of linearity.)
for which Ex: binary image.
P ( xn | xn-1 ) = P ( xn | xn-1, xn-2, xn-3, … ). (2.14) Define two states Sw : current pixel is a white pixel,
9 Equation 2.13, 2.14 indicate the existence of dependence between samples. Sb : current pixel is a black pixel.
However, they do not describe the form of the sequence. We define the transition probabilities P(w|b) and P(b|w) ,
and the probability of being in each state P(Sw) and P(Sb).
[ develop different first-order Markov models ]
9 If we assume that the dependence was introduced in a linear manner,
P(b|w)
we could view the data sequence as the output of a linear filter
driven by white noise.
The output of such a filter can be given by the difference equation P(w|w) Sw Sb P(b|b)
xn= ρx n-1 + ∈n (2.15)

9 where ∈n is a white noise process. This model is often used when developing P(w|b)
coding algorithms for speech and images.
Figure 2.2 A two-state Markov model for binary images.
Example
Example 2.3.1
2.3.1 To see the effect of modeling on the estimate of entropy.
9 The entropy of a finite state process with state Si
is the average value of the entropy at each state: let us calculate the entropy for a binary image :
1. Using probability model (iid)
M 2. Using Markov model
H = ∑
i =1
P (Si )H (Si)
P(b|w) = 0.01
For our example of a binary image
H(Sw) = - P(b|w) log P(b|w) - P(w|w) log P(w|w)

P(w|w) =0.99 Sw Sb P(b|b) = 0.7
H(Sb) = - P(b|b) log P(b|b) - P(w|b) log P(w|b)
where P(w|w) = 1 – P(b|w) , and P(w|b) = 0.3

P(b|b) = 1 – P(w|b) . P(Sw) = 30/31 P(S b) = 1/31

Example Markov
Markov Model
Model in
in the
the Text
Text Compression
Compression
Example 2.3.1
2.3.1
Using probability model (iid assumption) 9 The use of Markov models for written English appears
in the origin work of Shannon.
H = - (30/31) log2 (30/31) - (1/31) log2 (1/31) = 0.20559 9 In current text compression literature, the kth-order Markov models
Using Markov model are more widely known as finite context models.
( first-order Markov model : single-letter context )
H(Sb) = - P(b|b) log P(b|b) - P(w|b) log P(w|b)
9 Context – state.
= - (0.7) log (0.7) – (0.3) log (0.3)
= 0.88129
9 Example : “preceding”
H(Sw) = - P(b|w) log P(b|w) - P(w|w) log P(w|w) suppose we have already processed “precedin” and are going
= - (0.01) log (0.01) – (0.99) log (0.99) (skewed) to encode the next letter.
= 0.08079 9 No account of context: p( g ) = 2%
H = (1/31) (0.88129) + (30/31) (0.08079) 9 Single-letter context: p( g | n ) increase substantially
= 0.10661 9 Two-letter context: p( g | in )
9 Three-letter context: p( g | din )
About a half of the entropy obtained using the iid assumption.
The probability becomes more and more skewed Æ lower entropy
Markov
Markov Model
Model in
in the
the Text
Text Compression
Compression Markov
Markov Model
Model in
in the
the Text
Text Compression
Compression
9 Shannon used a second-order model for English text consisting of 9 Adaptive strategy –
26 letters and 1 space to obtain an entropy of 3.1 bits/letter. the probability for different symbols in the different contexts are updated
9 Using a model where the output symbols were words rather than letters as they are encountered.
brought down the entropy to 2.4 bits/letter.
9 Zero frequency problem -
9 Shannon :
9 predictions generated by people (Subjects knew the 100 previous letters) encounter symbols that have not been encountered before
9 to estimate the upper bound and lower bound on the entropy 9 Send a code to indicate that the following symbol
of the 27-letter English language. was being encountered for the first time,
9 Upper bound: 1.3 bits/letter, Lower Bound: 0.6 bits/letter. 9 Followed by a prearranged code for that symbol.
9 The longer the context, the better its predictive value. 9 Overhead vs. Frequency
Problem: the number of contexts(states) would grow exponentially
9 Solutions: ppm algorithm (prediction with partial match) – (chapter 6)
with the length of context.
9 Fourth-order: if the alphabet size of 95, 9 The use of Markov models in text compression
the possible number of context is 954 – more than 81 million! is a rich and active area of research. (chapter 6)
2.3.4 Composite Source Model 2.4 Coding

Coding : the assignment of binary sequences to elements of an alphabet.
Only one source can be active
P1
Source
Source 11 at any given time.
code : the set of binary sequences
codewords : individual members of the set
P2 Switch
Source
Source 22 An alphabet is a collection of symbols called letters.
Example: the alphabet used in writing the most books consists of

the 26 lowercase letters, 26 uppercase letters,
and a variety of punctuation marks.
Pn
Source
Source nn the ASCII code for the letter a is 1000011,
the letter A is coded as 1000001, and
the letter “,” is coded 0011010.
Note that: the ASCII code use the same number of bits to represent each symbol.
Figure 2.3 A composite source.
Such a code is called a fixed-length code.
2.4 Coding 2.4.1 Uniquely Decodable Codes
9 Variable-length code - The average length of the code is not the only important point
in designing a “good” code.
If we want to reduce the number of bits required to represent different
messages, we need to use a different number of bits to represent
alphabet
different symbols. If we use fewer bits to represent symbols that P(a1) = ½ , P(a2) = ¼ ,
a1
occur more often, on the average would use fewer bits per symbols.
a2 P(a3) = P(a4) = 1/8
a4
a3 The entropy is 1.75 bits/symbol.
9 The average number of bits per symbol is often called the rate of the code.
9 Morse code: Table 2.1 Four different codes for a four-letter alphabet.
9 The codeword for letters that occur more frequently are shorter Letter Probability Code 1 Code 2 Code 3 Code 4
than for letters that occur less frequently. a1 0.5 0 0 0 0
9 The codeword for E is “ . “ a2 0.25 0 1 10 01
9 The codeword for Z is “ －－ . . ” a3 0.125 1 00 110 011
a4 0.125 10 11 111 0111
Average length 1.125 1.25 1.75 1.875
2.4.1 Uniquely Decodable Codes 2.4.1 Uniquely Decodable Codes

The average length l for each code is given by: 9 Code 3 :
aa11 00
9 Note that the first three codewords all end in a 0.
4
l = ∑ P (ai )n (ai ) A 0 always denotes the termination of a codeword. aa22 10
10
i =1 9 The final codeword contains no 0s and is 3 bits long. aa33 110
110
9 Decoding rule: accumulate bits until you get a 0
aa44 111
111
9 Code 1 : Based on the average length, Code 1 appears to be the best code. or you have three 1s.
aa11 00 Problem: ambiguous !! No ambiguous !! Unique decodable!! AL:1.75
aa22 00 Both a1 and a2 have been assigned the codeword “0”
When a 0 is received, there is no way to know whether 9 Code 4 :
aa33 11 AL:1.875
an a1 was transmitted or an a2. 9 Each codeword starts with a 0,
aa44 10
10 AL:1.25 aa11 00
We would like each symbol to be assigned a unique codeword. The only time we see a 0 is in the beginning of a codeword.
aa11 00 aa22 01
AL:1.125 9 Code 2 : each symbol is assigned a distinct codeword. 9 Decoding rule: accumulate bits until you see a 0 01
Problem: encode : the sequence a2 a1 a1 Æ 1 0 0 aa22 11 (the bit before the 0 is the last bit of the previous codeword) aa33 011
011
decode : 1 0 0 Æ a2 a1 a1 , or a2 a3 aa33 00
00 No ambiguous!! Unique decodable!! aa44 0111
0111
The original sequence can not be recovered with certainty. aa44 11
11
2.4.1 Uniquely Decodable Codes 2.4.1 Uniquely Decodable Codes
Code 5
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
9 Instantaneous code vs, near-instantaneous
aa11 00 a1, a3 , a3 , a3 , a3 , a3 , a3 , a3 , a3
9 Decoder of Code 3 : knows the moment a code is complete.
aa22 01
01
9 Decoder of Code 4 : has to wait till the beginning of the next codeword. 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
aa33 11
11 a2, a3 , a3 , a3 , a3 , a3 , a3 , a3 , a3
9 Instantaneous decoding is a nice property,
but it is not a requirement for unique decodability. This string can be uniquely decoded,
In fact, Code 5, while it is certainly not instantaneous, is unique decodable.
Table 2.2 & 2.3 Code 5 & Code 6. Let’s decode the string
01 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Unique
Unique decodable
decodable or
or not
not ??
?? Small code Æ easy, large codes Æ not easy!
Letter Code 5 Code 6
In this string, the first codeword is
a1 0 0
either 0 corresponding to a1 A systematic procedure is useful !!
a2 01 01
a3 11 10 or 01 corresponding to a2.
A class of variable-length codes that always uniquely decodable,
so a test for unique decodability may not be necessary.
2.4.1 Uniquely Decodable Codes 2.4.2 Prefix Codes

Encode: prefix
prefix code
code No codeword is a prefix to another codeword !!
Code 6
a1, a3 , a3 , a3 , a3 , a3 , a3 , a3 , a3 A simple way to check if a code is a prefix code is
aa11 00 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 Code 2 to draw the rooted binary tree corresponding to the code.
aa22 01
01
Decode: aa11 00
aa33 10
10 aa22 11 Note that the tree has two kinds of nodes –
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
aa33 00
00 9 Internal nodes
a2, a3 , a3 , a3 , a3 , a3 , a3 , a3 , a1
aa44 11
11 9 External nodes (leaves)
This incorrect decoding is also a valid decoding,
and Code 6 is not unique decodable. 0 1
In a prefix code, the codewords are
a1 a2 only associated with the external nodes.
0 1
a3 a4
2.4.2 Prefix Codes 2.4.2 Prefix Codes
Code 3 Code 4 9 It is nice to have a class of codes,

whose members are so clearly unique decodable.
aa11 00 aa11 00
aa22 10
10 aa22 01
01
aa33 110
110 aa33 011
011
0 9 Problem ??
aa44 111
111 aa44 0111
0111 a1 9 Are we losing something if we restrict ourself to prefix code?
0 1 1 (If we do not restrict ourself to prefix code, we can find shorter code?)
a1 a2 9 Answer : No !!
0 1 1
a2 a3 9 For any nonprefix uniquely decodable code,
0 1 we can always find a prefix code with the same codeword lengths.
1
a3 a4 Not a prefix code !! (Proof is in section 2.4.3)
a4

DC CH02

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DC CH02

Uploaded by

Copyright:

Available Formats

Outline

Ch 2 Mathematical Preliminaries for Lossless Compression 2

2.1 Overview 2.2 A Brief Introduction to Information Theory

-log(x) increase as x decreases from 1 to 0 1 1

i(A2)= i(A3)= i(A5)= i(A7)= 2

of neighboring sample values, we arrive at the residual sequence H = − ∑ P ( i ) log 2 P (i )

2.3 Models 2.3.1 Physical Models

2.3.3 Markov Models 2.3.3 Markov Models

xn= ρx n-1 + ∈n (2.15)

H(Sw) = - P(b|w) log P(b|w) - P(w|w) log P(w|w)

where P(w|w) = 1 – P(b|w) , and P(w|b) = 0.3

2.3.3 Markov Models 2.3.3 Markov Models

2.3.4 Composite Source Model 2.4 Coding

Example: the alphabet used in writing the most books consists of

Average length 1.125 1.25 1.75 1.875

2.4.1 Uniquely Decodable Codes 2.4.1 Uniquely Decodable Codes

2.4.1 Uniquely Decodable Codes 2.4.2 Prefix Codes

Code 3 Code 4 9 It is nice to have a class of codes,

You might also like