Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Entropy and Huffman Codes

Entropy as a Measure of Information Content

Entropy of a random variable. Let X be a random variable P that takes on values from the set
{x1 , x2 , . . . , xn } with respective probabilities p1 , p2 , . . . , pn , where i pi = 1. Then the entropy of X,
H(X), represents the average amount of information contained in X and is defined by
n
X
H(X) = − pi log2 pi .
i=1

Note that entropy is measured in bits. Also, Hn (p1 , . . . , pn ) denotes another way of writing the entropy
function.

Arguments in favor of the above definition of information:

• the definition is consistent with the following extreme cases:

1. If n = 2 and p1 = p2 = 21 , then H(X) = 1 bit; i.e. when an event (e.g. X = x1 ) has an equally
likely chance of occurring or not occurring, then its outcome possesses one bit of information.
This is the maximum amount of information a binary outcome may possess.
2. In the case when pi = 1 for some 1 ≤ i ≤ n, then H(X) = 0; i.e. any random variable whose
outcome is certain possesses no information.

• moreover, the above definition is the only definition which satisfies the following three properties of
information which seem reasonable under any definition:

– Normalization: H2 ( 21 , 12 ) = 1
– Continuity: H2 (p, 1 − p) is a continuous function of p on the interval (0, 1)
– Grouping:
p1 p2
Hm (p1 , . . . , pm ) = Hm−1 (p1 + p2 , p3 , . . . , pm ) + (p1 + p2 )H2 ( , )
p1 + p2 p1 + p2

1
Claude Shannon (1912-2001). Pioneer in

• applying Boolean logic to electronic circuit design

• studying the complexity of Boolean circuits

• signal processing: determined lower bounds on the amount of samples needed to achieve a desired
estimation accuracy

• game theory: inventor of minimax algorithm

• information theory: first to give a precise definition for the concept of information

• coding theory: Channel Coding Theorem, Optimal Coding Theorem

2
Example 1. calculate the amount of information contained in a weather forecast if the possibilities include
{normal, rain, f og, hot, windy} if there respective probabilities are .8, .10, .04, .03, and .03.

Example 2. Calculate the entropy of the probability distribution (1/2, 1/4, 1/8, 1/16, 1/16).

3
Example 3. Verify that independently tossing a fair coin n times imparts n bits of information.

4
Introduction to Codes

A code is a set of words C over some alphabet Ψ. The elements of C are called codewords. Let X be a
finite set of objects. Then an encoding of X is a map φ : X → Ψ. Thus, the image of the map is a set
of codewords. Moreover, when we speak of a code C, we generally are referring to some encoding φ whose
image is C. More defintions:

• encoding φ is one-to-one or non-singular if φ is a one-to-one map

• an extension of an encoding φ is a map φ∗ which maps finite-length strings from X to finite-length


strings over Ψ, and is defined in terms of φ in the following manner:

φ∗ (x1 · · · xn ) = φ(x1 )φ(x2 ) · · · φ(xn ).

• a code (encoding) is called uniquely decodable iff its extension is one-to-one

• a code is called a prefix code if no codeword is a proper prefix of another codeword.

Theorem 1. Exercise: every prefix-code is uniquely decodable.

5
Example 4. Consider the following famous binary code (if we let a “dot” denote 0, and a “dash” denote
1) known as the Morse code, which encodes the alphabet {A − Z} (see below). Is this code injective? a
prefix-code? uniquely decodable? What if we increase the alphabet to {0, 1, #}, where e.g. φ(‘A0 ) = 01#,
where # represents a pause between letters?

CHARACTER INTERNATIONAL
MORSE CODE
A 01
B 1000
C 1010
D 100
E 0
F 0010
G 110
H 0000
I 00
J 0111
K 101
L 0100
M 11
N 10
O 111
P 0110
Q 0010
R 010
S 000
T 1
U 001
V 0001
W 011
X 1001
Y 1011
Z 1100

6
Example 5. Same questions as in Example 4, but now assume the code is the ASCII code (American
Standard Code for Information Interchange). Hint: each letter is encoded into a seven-bit word.

0 1 2 3 4 5 6 7
0 NUL SOH STX ETX EOT ENQ ACK BEL
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB
2 SP ! ” # $ % & ’
3 0 1 2 3 4 5 6 7
4 A B C D E F G
5 P Q R S T U V W
6 ‘ a b c d e f g
7 p q r s t u v w

8 9 A B C D E F
0 BS HT LF VT FF CR SO SI
1 CAN EM SUB ESC FS GS RS US
2 ( ) * + , - . /
3 8 9 : ; ¡ = ¿ ?
4 H I J K L M N O
5 X Y Z [ backslash ] caret
6 h i j k l m n o
7 x y z { | } tilde DEL

7
Theorem 2.

1. Kraft Inquality: For any binary prefix-code with codeword lengths l1 , . . . , ln ,


n
X
2−li ≤ 1.
i=1

Conversely, given a set of codeword lengths that satisfy this inequality, there exists a binary prefix
code with these word lengths.
2. McMillan’s Result: Kraft’s inequality holds for all uniquely-decodable codes.

Proof of Theorem 2, Part 1. Let C denote a finite binary prefix code and lmax denote the maximum
length of a codeword in C. Now consider a perfect binary tree T with height lmax . Then the following facts
should seem self-evident upon sufficient consideration.

1. T has 2lmax leaves


2. there is one-to-one correspondence between binary words of length not exceeding lmax and nodes of T
3. there is one-to-one mapping from C into the set of nodes of T
4. every leaf of T has at most one ancestor in C
5. every codeword wi ∈ C is the ancestor of exactly 2lmax −li leaves of T

From the above facts we see that n


X
2lmax −li ≤ 2lmax ,
i=1

and hence (dividing both sides of the above equation by 2lmax )


n
X
2−li ≤ 1.
i=1

Conversely, suppose that lengths l1 , . . . , ln satisfy


n
X
2−li ≤ 1.
i=1

Let T and lmax be as above. Define binary code C in the following manner. Basis step: let w1 = 0l1 .
Inductive step: assume that there exists 1 ≤ k ≤ n − 1 such that codewords w 1 , . . . , wk have been defined
in such a way that |wi | = li , for all 1 ≤ i ≤ k. Then by the Kraft Inequality, ki=1 2−li < 1, which implies
P
Pk lmax −li
i=1 2 < 2lmax . Then there exists a leaf L of T for which no member of w1 , . . . , wk is an ancestor of
L. Choose the first such leaf L and set wk+1 to the ancestor of L having length lk+1 . Continuing in this
manner a prefix-code C with the desired word lengths will be attained. QED

8
Efficient Codes

We now show the beautiful connection between entropy and coding theory. We may think of a computer
file (or any other entity that possesses information content) as a finite string from some alphabet X =
{x1 , . . . , xn }. Moreover, we know that letters of the alphabet occur with different frequencies. For example,
if X is the set of ASCII characters, then the letter “e” occurs with much greater frequency than “EOF”, the
symbol that denotes the end of a file. Furthermore, through empirical studies, we can obtain good estimates
for the values pi , 1 ≤ i ≤ n, where pi denotes the proportion of a file that is comprised of the i th symbol
xi .

Now suppose we want to represent file F as a binary string, so as to store it and transmit it on a computer
network. Let φ be an encoding of X into binary strings (words). We can define the average codelength
with respect to φ as X
Lφ = pi |φ(xi )|,
i

where pi represents the probability of xi appearing in the file. Thus, given a file F that consists of m letters
from X , and given some encoding φ, the size of File F with respect to encoding φ is denoted by |F |φ and
equals m · Lφ . And so to minimize the size of the file it follows that we must find an encoding φ for which
Lφ is minimized. We call such encodings length-optimal, or, in the case of files, size-optimal.

Theorem 3 (Claude Shannon). Let X = {x1 , . . . , xn }, and suppose that pi is the probability (or weighted
frequency) of xi . Let φ : X → C be uniquely-decodable. Then

1. Lφ ≥ H(X ).

2. There exists an encoding φ̂ for which

H(X ) ≤ Lφ̂ < H(X ) + 1.

To prove Theorem 3, we introduce an asymmetric distance measure between two finite probability distribu-
tions.

Kullback-Leibler Distance. Let p = {p1 , . . . , pn } and r = {r1 , . . . , rn } be two probability distributions.


Then the Kullback-Leibler distance between p and r, denoted D(p||r) is given by
n
X pi
D(p||r) = pi log( ).
i=1
ri

9
The following is stated without proof.

Lemma 1. D(p||r) ≥ 0, for all distributions p and r.

Proof of Theorem 3.

Part 1. Let li denote the length of φ(xi ) for all 1 ≤ i ≤ n.


n n
X X 1
Lφ − H(X ) = pi li − pi log =
i=1 i=1
pi

n
X n
X
−li
− pi log 2 + pi log pi .
i=1 i=1
2−li
Pn
Now let c = i=1 2−li and ri = c
. Then
n n
X (c2−li ) X
Lφ − H(X ) = − pi log + pi log pi =
i=1
c i=1

n n n
X 1 X X
pi log − pi log ri + pi log pi =
i=1
c i=1 i=1
n
1 X pi 1
log + pi log = log + D(p||r) ≥ 0.
c i=1 ri c
QED

Part 2. Let li = dlog p1i e. These lengths satisfy the Kraft inequality since
n n n
1 1
−dlog e − log
X X X
2 pi
≤ 2 pi
= pi = 1.
i=1 i=1 i=1

Thus, by Theorem 2, we know that a prefix code φ̂ : X → C exists with these lengths. Moreover, for all
1 ≤ i ≤ n,
1 1
log ≤ li < log + 1.
pi pi
And from this we conclude (by summing over i and multiplying by pi )

H(X ) ≤ Lφ̂ < H(X ) + 1.

QED

10
Huffman Coding

We now provide a greedy algorithm due to D.A. Huffman (1952) which will always find an encoding having
minimum average length.

Huffman’s Algorithm:

• Name of Algorithm: Huffman’s Algorithm

• Input:

– symbol set X = {x1 , . . . , xn } listed in increasing order of probability


– probabilities {p1 , . . . , pn }

• Output: a length-optimal encoding φ

• Begin Algorithm

• base case 1. if X = {x} then return encoding φ, where φ(x) =  denotes the empty string

• base case 2. if X = {x1 , x2 } then return encoding φ, where φ(x1 ) = 0 and φ(x2 ) = 1

• recursive case. combine x1 and x2 into a new symbol y, having probability p1 + p2 . Let X̂ =
{y, x3 , . . . , xn } and let φ̂ be the encoding obtained upon applying Huffman’s Algorithm to X̂ and
probabilities {p1 + p2 , . . . , pn }

• define φ as

– φ(xi ) = φ̂(xi ) for i ≥ 3


– φ(x1 ) = φ̂(x1 ) · 0
– φ(x2 ) = φ̂(x2 ) · 1

• return φ

• End Algorithm

11
Example 6. Let X = {1, 2, 3, 4, 5} and with respective probabilities {.15, .15, .2, .25, .25}. Use Huffman’s
Algorithm to find a length-optimal encoding for X .

Example 7. Let X = {1, 2, 3, 4} and with respective weights {4, 4, 5, 5}. Use Huffman’s Algorithm to find
a length-optimal encoding for X .

12
Theorem 4. Huffman’s Algorithm is correct! In other words, if φ2 is any other encoding, then Lφ ≤ Lφ2 .

The proof of the following lemma is left as an exercise.

Lemma 2. For a distribution p = {p1 , . . . , pn }, there exists a length-optimal prefix code C such that

1. if pi > pk , then li ≤ lj
2. the two longest codewords have the same length
3. the two longest codewords differ only in the last bit. These two words are the encoding of the two
least likely symbols in X

Proof of Theorem 4. The proof uses induction on n = |X |.

Basis Step. n = 1. In this case X = {x} and φ(x) = , the empty string. Note that in this case
Lφ = H(X ) = 0 and φ is clearly length-optimal.

Now suppose n = 2. Then X = {x1 , x2 } and Huffman’s Algorithm yields φ(x1 ) = 0 and φ(x1 ) = 1.
Obviously φ is a length-optimal prefix code.

Induction Step. Suppose that Huffman’s Algorithm yields length-optimal codes for all codes of size n
or less. Assume that x1 and x2 have the least probabilities p1 and p2 . Let X̂ denote the symbol set
{y, x3 , . . . , xn+1 } with corresponding probabilities p = {p1 + p2 , p3 . . . , pn+1 }. By the induction assumption,
Huffman’s Algorithm produces a length-optimal encoding φ̂. Now according to the algorithm, φ is defined
as

• φ(xi ) = φ̂(xi ) for i ≥ 3


• φ(x1 ) = φ̂(x1 ) · 0
• φ(x2 ) = φ̂(x2 ) · 1

Since φ̂ is a prefix code, it is clear that φ is also a prefix code. We now show that φ is length-optimal. To
see this we note that
Xn n
X
Lφ = pi li = p1 + p2 + p1 (l1 − 1) + p2 (l2 − 1) + pi li =
i=1 i=3

p1 + p2 + Lφ̂ .
Thus, the minimizing of Lφ̂ is equivalent to minimizing Lφ since p1 + p2 is a constant. Verifying this is left
as an excercise and requires Lemma 2. QED

13
Exercises.

1. Prove that the entropy function Hm (p1 , . . . , pm ) satisfies the grouping property.

2. Suppose that multiple choice exam has fifty questions with four responses each; where the correct
response is randomly assigned a letter a-d. If a student who knows nothing about the exam subject
takes the exam, how much information is in the scantron that records her responses? How much
information is in the scantron of a student who has a complete mastery of the subject? Assume all
questions are answered independently. Hint: think in terms of what the professor is expecting to see
when grading these exams. Suppose now the instructor notices that, for all questions, the correct
answer was marked 67% of the time, while the second-best response was 20%, third best 10%, and
worst 3%. On average, how much information can be found in the average exam?

3. Prove that every prefix code is uniquely decodable. Hint: use mathematical induction on the length
of the string y, to be decoded; i.e. φ∗ (x1 , x2 , . . . , xn ) = y, and you must show that the sequence of
objects x1 , x2 , . . . , xn is unique. You may assume that φ is one-to-one.

4. Is it possible to define a uniquely decodable encoding of 5 objects if the respective codeword lengths
are to be 1,2,3,3,3? Explain.

5. Give an example of a code that is not a prefix code, but is still uniquely decodable.

6. A fair coin is flipped until the first head occurs. Let random variable X denote the number of flips
required. Find the entropy H(X) in bits. The following expressions may be useful:

X r
rn = ,
n=1
1−r


X r
nrn = .
n=1
(1 − r)2

7. Let φ : X → {0, 1}∗ be a prefix code for finite set X . Given a binary string y of length n describe
an efficient procedure for decoding y into a unique sequence x1 , . . . , xm of objects in X . State the
asymptotic running time (in terms of n) of your procedure.

8. The inventor of Morse code, Samuel Morse (1791-1872), needed to know the frequency of letters in
English text that he could give the simplest codewords to the most frequently used letters. He did it
simply by counting the number of letters in sets of printers’ type. The figures he came up with were:

12,000 E 2,500 F
9,000 T 2,000 W, Y
8,000 A, I, N, O, S 1,700 G, P
6,400 H 1,600 B
6,200 R 1,200 V
4,400 D 800 K
4,000 L 500 Q
3,400 U 400 J, X
3,000 C, M 200 Z

14
Use this data to compute the entropy of a random variable that outputs one of the English letters.
Assuming that a 1Mb text file is Huffman encoded according to the above frequencies, what will be
the size of the encoded binary file?

9. Compute the Kullback-Leibler distance between the distributions p = (1/3, 1/3, 1/3) and q = (7/8, 1/16, 1/16).

10. Given object set X = {x1 , x2 , x3 , x4 , x5 } with respective probabilities 0.1, 0.35, 0.05, 0.2, 0.3, find a
Huffman code for X .

15

You might also like