Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

9/30/2019

Introduction to Information Theory

Part 4-A

Assignment#2 Results
Adventures of Tom Sawyer (Mark Twain)
0.25
(374,590 characters)

0.2

0.15 H(X) = 4.077714

0.1

0.05

0
SPC e t a o n h i s r d l u w m y g c f b p k v j x q z

9/30/2019 2

1
9/30/2019

Quick Code Review


• Using dictionaries in Python

• Using the Counter class

9/30/2019 3

Assignment#2: Huffman Encoding

Symbol Probability Code


A 0.75 0
B 0.25 1

H=0.8118

L=1

9/30/2019 4

2
9/30/2019

Assignment#2: Huffman Encoding

Symbol Probability Code Symbol Probability Code


A 0.75 0 AA 0.5625 0
B 0.25 1 AB 0.1875 10
BA 0.1875 110
BB 0.0625 111

H=0.8118

L=1 L/2=0.84375

9/30/2019 5

Assignment#2: Huffman Encoding

Symbol Probability Code Symbol Probability Code Symbol Probability Code


A 0.75 0 AA 0.5625 0 AAA 0.421875 1
B 0.25 1 AB 0.1875 10 AAB 0.140625 001
BA 0.1875 110 ABA 0.140625 010
BB 0.0625 111 BAA 0.140625 100
ABB 0.046875 00000
BAB 0.046875 00001
H=0.8118 BBA 0.046875 00010
BBB 0.015625 00011

L=1 L/2=0.84375 L/3 = 0.8229167

9/30/2019 6

3
9/30/2019

Assignment#2: Huffman Encoding

Symbol Probability Code Symbol Probability Code Symbol Probability Code


A 0.75 0 AA 0.5625 0 AAA 0.421875 1
B 0.25 1 AB 0.1875 10 AAB 0.140625 001
BA 0.1875 110 ABA 0.140625 010
BB 0.0625 111 BAA 0.140625 100
ABB 0.046875 00000
BAB 0.046875 00001
H=0.8118 BBA 0.046875 00010
BBB 0.015625 00011

L=1 L/2=0.84375 L/3 = 0.8229167

Notice that as we increase the length of symbols, entropy/letter approaches 0.8118.

9/30/2019 7

Lempel-Ziv Coding
• Sequences of text repeat patterns (words, phrases, etc)

• Construct a dictionary of common patterns

• Send references to patterns as triples (𝑥, 𝑦, 𝑧)

9/30/2019 8

4
9/30/2019

Lempel-Ziv Coding (LZ77)


Message Search Buffer Look-Ahead Buffer
THIS-THESIS-IS-THE-THESIS.

9/30/2019 9

Lempel-Ziv Coding (LZ77)


Message Search Buffer Look-Ahead Buffer
THIS-THESIS-IS-THE-THESIS.

0 0 T T THIS-THESIS-IS-THE-THESIS.

9/30/2019 10

5
9/30/2019

Lempel-Ziv Coding (LZ77)


Message Search Buffer Look-Ahead Buffer
THIS-THESIS-IS-THE-THESIS.

0 0 T T THIS-THESIS-IS-THE-THESIS.

0 0 H H TH I S - T H E S I S - I S - T H E - T H E S I S .

9/30/2019 11

Lempel-Ziv Coding (LZ77)


Message Search Buffer Look-Ahead Buffer
THIS-THESIS-IS-THE-THESIS.

0 0 T T THIS-THESIS-IS-THE-THESIS.

0 0 H H TH I S - T H E S I S - I S - T H E - T H E S I S .

0 0 I I THIS-THESIS-IS-THE-THESIS.

9/30/2019 12

6
9/30/2019

Lempel-Ziv Coding (LZ77)


Message Search Buffer Look-Ahead Buffer
THIS-THESIS-IS-THE-THESIS.

0 0 T T THIS-THESIS-IS-THE-THESIS.

0 0 H H TH I S - T H E S I S - I S - T H E - T H E S I S .

0 0 I I THIS-THESIS-IS-THE-THESIS.

0 0 S S THIS-THESIS-IS-THE-THESIS.

9/30/2019 13

Lempel-Ziv Coding (LZ77)


Message Search Buffer Look-Ahead Buffer
THIS-THESIS-IS-THE-THESIS.

0 0 T T THIS-THESIS-IS-THE-THESIS.

0 0 H H TH I S - T H E S I S - I S - T H E - T H E S I S .

0 0 I I THIS-THESIS-IS-THE-THESIS.

0 0 S S THIS-THESIS-IS-THE-THESIS.

0 0 - - THIS-THESIS-IS-THE-THESIS.

9/30/2019 14

7
9/30/2019

Lempel-Ziv Coding (LZ77)


Message Search Buffer Look-Ahead Buffer
THIS-THESIS-IS-THE-THESIS.

0 0 T T THIS-THESIS-IS-THE-THESIS.

0 0 H H TH I S - T H E S I S - I S - T H E - T H E S I S .

0 0 I I THIS-THESIS-IS-THE-THESIS.

0 0 S S THIS-THESIS-IS-THE-THESIS.

0 0 - - THIS-THESIS-IS-THE-THESIS.

5 2 E E THIS-THE SIS-IS-THE-THESIS.

9/30/2019 15

Lempel-Ziv Coding (LZ77)


Message Search Buffer Look-Ahead Buffer
THIS-THESIS-IS-THE-THESIS.

0 0 T T THIS-THESIS-IS-THE-THESIS.

0 0 H H TH I S - T H E S I S - I S - T H E - T H E S I S .

0 0 I I THIS-THESIS-IS-THE-THESIS.

0 0 S S THIS-THESIS-IS-THE-THESIS.

0 0 - - THIS-THESIS-IS-THE-THESIS.

5 2 E THE THIS-THE SIS-IS-THE-THESIS.

5 1 I SI THIS-THESI S-IS-THE-THESIS.

9/30/2019 16

8
9/30/2019

Lempel-Ziv Coding (LZ77)


Message Search Buffer Look-Ahead Buffer
THIS-THESIS-IS-THE-THESIS.

0 0 T T THIS-THESIS-IS-THE-THESIS.

0 0 H H TH I S - T H E S I S - I S - T H E - T H E S I S .

0 0 I I THIS-THESIS-IS-THE-THESIS.

0 0 S S THIS-THESIS-IS-THE-THESIS.

0 0 - - THIS-THESIS-IS-THE-THESIS.

5 2 E THE THIS-THE SIS-IS-THE-THESIS.

5 1 I SI THIS-THESI S-IS-THE-THESIS.

7 2 I S-I THIS-THESIS-I S-THE-THESIS.

9/30/2019 17

Lempel-Ziv Coding (LZ77)


Message Search Buffer Look-Ahead Buffer
THIS-THESIS-IS-THE-THESIS.

0 0 T T THIS-THESIS-IS-THE-THESIS.

0 0 H H TH I S - T H E S I S - I S - T H E - T H E S I S .

0 0 I I THIS-THESIS-IS-THE-THESIS.

0 0 S S THIS-THESIS-IS-THE-THESIS.

0 0 - - THIS-THESIS-IS-THE-THESIS.

5 2 E THE THIS-THE SIS-IS-THE-THESIS.

5 1 I SI THIS-THESI S-IS-THE-THESIS.

7 2 I S-I THIS-THESIS-I S-THE-THESIS.

10 5 - S-THE- THIS-THESIS-IS-THE- THESIS.

9/30/2019 18

9
9/30/2019

Lempel-Ziv Coding (LZ77)


Message Search Buffer Look-Ahead Buffer
THIS-THESIS-IS-THE-THESIS.

0 0 T T THIS-THESIS-IS-THE-THESIS.

0 0 H H TH I S - T H E S I S - I S - T H E - T H E S I S .

0 0 I I THIS-THESIS-IS-THE-THESIS.

0 0 S S THIS-THESIS-IS-THE-THESIS.

0 0 - - THIS-THESIS-IS-THE-THESIS.

5 2 E THE THIS-THE SIS-IS-THE-THESIS.

5 1 I SI THIS-THESI S-IS-THE-THESIS.

7 2 I S-I THIS-THESIS-I S-THE-THESIS.

10 5 - S-THE- THIS-THESIS-IS-THE- THESIS.

14 6 . THESIS. THIS-THESIS-IS-THE–THESIS.

9/30/2019 19

Lempel-Ziv Coding
• Sequences of text repeat patterns (words, phrases, etc)

• Construct a dictionary of common patterns

• Send references to patterns as triples 𝑥, 𝑦, 𝑧


e.g. (5, 3, 𝐹)
go back 5 received chars
take the next 3 from there
add 𝐹 to the end

• Size of Search Buffer and Look-Ahead Buffer is finite.

• Used by ZIP, PKSip, Lharc, PNG, gzip, ARJ

• Extended to LZ78 (uses dictionary), LZW (+Terry Welch)

• Achieves optimal rate of transmission in the long run w/o using probability dist.

9/30/2019 20

10
9/30/2019

Decode
Message
0 0 I

0 0 -

0 0 M

3 1 S

1 1 -

5 5 L

5 3 Y

9/30/2019 21

Decode
Message
0 0 I
I
0 0 -

0 0 M

3 1 S

1 1 -

5 5 L

5 3 Y

9/30/2019 22

11
9/30/2019

Decode
Message
0 0 I
I
0 0 -
I-
0 0 M

3 1 S

1 1 -

5 5 L

5 3 Y

9/30/2019 23

Decode
Message
0 0 I
I
0 0 -
I-
0 0 M
I-M
3 1 S

1 1 -

5 5 L

5 3 Y

9/30/2019 24

12
9/30/2019

Decode
Message
0 0 I
I
0 0 -
I-
0 0 M
I-M
3 1 S
I-MIS
1 1 -

5 5 L

5 3 Y

9/30/2019 25

Decode
Message
0 0 I
I
0 0 -
I-
0 0 M
I-M
3 1 S
I-MIS
1 1 -
I-MISS-
5 5 L

5 3 Y

9/30/2019 26

13
9/30/2019

Decode
Message
0 0 I
I
0 0 -
I-
0 0 M
I-M
3 1 S
I-MIS
1 1 -
I-MISS-
5 5 L
I-MISS-MISS-L
5 3 Y

9/30/2019 27

Decode
Message
0 0 I
I
0 0 -
I-
0 0 M
I-M
3 1 S
I-MIS
1 1 -
I-MISS-
5 5 L
I-MISS-MISS-L
5 3 Y
I-MISS-MISS-LISSY

9/30/2019 28

14
9/30/2019

Lempel-Ziv Compression: Class Exercise

9/30/2019 29

A General Communication System

CHANNEL

• Information Source
• Transmitter
• Channel
• Receiver
• Destination
9/30/2019 30

15
9/30/2019

Information Channel

Input Output
Channel
X Y

9/30/2019 31

Information Channel

Input Output
Channel
X Y
Cholesterol Levels Condition
of Arteries

9/30/2019 32

16
9/30/2019

Information Channel

Input Output
Channel
X Y
Symptoms Diagnosis
or Test results

9/30/2019 33

Information Channel

Input Output
Channel
X Y
Geological Structure Presence
of oil deposits

9/30/2019 34

17
9/30/2019

Information Channel

Input Output
Channel
X Y
Opinion Poll Next President

9/30/2019 35

Perfect Communication
(Discrete Noiseless Channel)

0 0
X Y
Transmitted Received
Symbol Symbol
1 1

9/30/2019 36

18
9/30/2019

NOISE

9/30/2019 37

Motivating Noise…

0 0
X Y
Transmitted Received
Symbol Symbol
1 1

9/30/2019 38

19
9/30/2019

Motivating Noise…
f = 0.1, n = ~10,000

1-f
0 0
f

f
1 1
1-f

9/30/2019 39

Motivating Noise…
Message: $5213.75
Received: $5293.75

1. Detect that an error has occurred.

2. Correct the error.

3. Watch out for the overhead.


9/30/2019 40

20
9/30/2019

Error Detection by Repetition


In the presence of 20% noise…

Message :$5213.75
Transmission 1: $ 5 2 9 3 . 7 5
Transmission 2: $ 5 2 1 3 . 7 5
Transmission 3: $ 5 2 1 3 . 1 1
Transmission 4: $ 5 4 4 3 . 7 5
Transmission 5: $ 7 2 1 8 . 7 5

There is no way of knowing where the errors are.

9/30/2019 41

Error Detection by Repetition


In the presence of 20% noise…
Message :$5213.75
Transmission 1: $ 5 2 9 3 . 7 5
Transmission 2: $ 5 2 1 3 . 7 5
Transmission 3: $ 5 2 1 3 . 1 1
Transmission 4: $ 5 4 4 3 . 7 5
Transmission 5: $ 7 2 1 8 . 7 5
Most common: $ 5 2 1 3 . 7 5
1. Guesswork is involved.
2. There is overhead.

9/30/2019 42

21
9/30/2019

Error Detection by Repetition


In the presence of 50% noise…
Message :$5213.75

Repeat 1000 times!

1. Guesswork is involved.
But it will almost never be wrong!
2. There is overhead.
A LOT of it!

9/30/2019 43

Binary Symmetric Channel (BSC)


(Discrete Memoryless Channel)

1 − 𝑝
0 0
𝑝 𝑌
𝑋
Transmitted Received
Symbol 𝑝 Symbol
1 1
1 − 𝑝

9/30/2019 44

22
9/30/2019

Binary Symmetric Channel (BSC)


(Discrete Memoryless Channel)

1 − 𝑝
0 0
𝑝 𝑌
𝑋
Transmitted Received
Symbol 𝑝 Symbol
1 1
1 − 𝑝

Defined by a set of conditional probabilities (aka transitional probabilities)

𝑝 𝑦 𝑥 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑥 ∈ 𝑋 𝑎𝑛𝑑 𝑦 ∈ 𝑌


The probability of 𝑦 occurring at the output when 𝑥 is the input to the channel.

9/30/2019 45

A General Discrete Channel

𝑥1 p(𝑦1 |𝑥1 )

p(𝑦𝑟 |𝑥1 ) p(𝑦3 |𝑥1 ) p(𝑦2 |𝑥1 ) 𝑦1


𝑦2
𝑥2
𝑦3

𝑥𝑠
𝑦𝑟
𝑠 × 𝑟 transition probabilities
𝑠 input symbols 𝑟 output symbols

9/30/2019 46

23
9/30/2019

Channel With Internal Structure

1 − 𝑝 1 −𝑞 1 − 𝑝 1 − 𝑞 + 𝑝𝑞 𝑝(0|0)
0 0 0 0 0
𝑝 𝑞
1 − 𝑝 𝑞 + (1 − 𝑞)𝑝
𝑝 𝑞
1 1 1 1 1 𝑝(1|1)
1 − 𝑝 1 − 𝑝 1 − 𝑞 + 𝑝𝑞
1 −𝑞

9/30/2019 47

References
• Eugene Chiu, Jocelyn Lin, Brok Mcferron, Noshirwan Petigara, Satwiksai Seshasai: Mathematical Theory
of Claude Shannon: A study of the style and context of his work up to the genesis of information theory.
MIT 6.933J / STS.420J The Structure of Engineering Revolutions
• Luciano Floridi, 2010: Information: A Very Short Introduction, Oxford University Press, 2011.
• Luciano Floridi, 2011: The Philosophy of Information, Oxford University Press, 2011.
• James Gleick, 2011: The Information: A History, A Theory, A Flood, Pantheon Books, 2011.
• Zhandong Liu , Santosh S Venkatesh and Carlo C Maley, 2008: Sequence space coverage, entropy of
genomes and the potential to detect non-human DNA in human samples, BMC Genomics 2008, 9:509
• David Luenberger, 2006: Information Science, Princeton University Press, 2006.
• David J.C. MacKay, 2003: Information Theory, Inference, and Learning Algorithms, Cambridge University
Press, 2003.
• Claude Shannon & Warren Weaver, 1949: The Mathematical Theory of Communication, University of
Illinois Press, 1949.
• W. N. Francis and H. Kucera: Brown University Standard Corpus of Present-Day American English, Brown
University, 1967.
• Edward L. Glaeser: A Tale of Many Cities, New York Times, April 10, 2010. Available at:
http://economix.blogs.nytimes.com/2010/04/20/a-tale-of-many-cities/
• Alan Rimm-Kaufman, The Long Tail of Search. Search Engine Land Website, September 18, 2007.
Available at: http://searchengineland.com/the-long-tail-of-search-12198

48

24

You might also like