Professional Documents
Culture Documents
Entropy and Huffman Codes: 1 2 N 1 2 N I I
Entropy and Huffman Codes: 1 2 N 1 2 N I I
Entropy of a random variable. Let X be a random variable P that takes on values from the set
{x1 , x2 , . . . , xn } with respective probabilities p1 , p2 , . . . , pn , where i pi = 1. Then the entropy of X,
H(X), represents the average amount of information contained in X and is defined by
n
X
H(X) = − pi log2 pi .
i=1
Note that entropy is measured in bits. Also, Hn (p1 , . . . , pn ) denotes another way of writing the entropy
function.
1. If n = 2 and p1 = p2 = 21 , then H(X) = 1 bit; i.e. when an event (e.g. X = x1 ) has an equally
likely chance of occurring or not occurring, then its outcome possesses one bit of information.
This is the maximum amount of information a binary outcome may possess.
2. In the case when pi = 1 for some 1 ≤ i ≤ n, then H(X) = 0; i.e. any random variable whose
outcome is certain possesses no information.
• moreover, the above definition is the only definition which satisfies the following three properties of
information which seem reasonable under any definition:
– Normalization: H2 ( 21 , 12 ) = 1
– Continuity: H2 (p, 1 − p) is a continuous function of p on the interval (0, 1)
– Grouping:
p1 p2
Hm (p1 , . . . , pm ) = Hm−1 (p1 + p2 , p3 , . . . , pm ) + (p1 + p2 )H2 ( , )
p1 + p2 p1 + p2
1
Claude Shannon (1912-2001). Pioneer in
• signal processing: determined lower bounds on the amount of samples needed to achieve a desired
estimation accuracy
• information theory: first to give a precise definition for the concept of information
2
Example 1. calculate the amount of information contained in a weather forecast if the possibilities include
{normal, rain, f og, hot, windy} if there respective probabilities are .8, .10, .04, .03, and .03.
Example 2. Calculate the entropy of the probability distribution (1/2, 1/4, 1/8, 1/16, 1/16).
3
Example 3. Verify that independently tossing a fair coin n times imparts n bits of information.
4
Introduction to Codes
A code is a set of words C over some alphabet Ψ. The elements of C are called codewords. Let X be a
finite set of objects. Then an encoding of X is a map φ : X → Ψ. Thus, the image of the map is a set
of codewords. Moreover, when we speak of a code C, we generally are referring to some encoding φ whose
image is C. More defintions:
5
Example 4. Consider the following famous binary code (if we let a “dot” denote 0, and a “dash” denote
1) known as the Morse code, which encodes the alphabet {A − Z} (see below). Is this code injective? a
prefix-code? uniquely decodable? What if we increase the alphabet to {0, 1, #}, where e.g. φ(‘A0 ) = 01#,
where # represents a pause between letters?
CHARACTER INTERNATIONAL
MORSE CODE
A 01
B 1000
C 1010
D 100
E 0
F 0010
G 110
H 0000
I 00
J 0111
K 101
L 0100
M 11
N 10
O 111
P 0110
Q 0010
R 010
S 000
T 1
U 001
V 0001
W 011
X 1001
Y 1011
Z 1100
6
Example 5. Same questions as in Example 4, but now assume the code is the ASCII code (American
Standard Code for Information Interchange). Hint: each letter is encoded into a seven-bit word.
0 1 2 3 4 5 6 7
0 NUL SOH STX ETX EOT ENQ ACK BEL
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB
2 SP ! ” # $ % & ’
3 0 1 2 3 4 5 6 7
4 A B C D E F G
5 P Q R S T U V W
6 ‘ a b c d e f g
7 p q r s t u v w
8 9 A B C D E F
0 BS HT LF VT FF CR SO SI
1 CAN EM SUB ESC FS GS RS US
2 ( ) * + , - . /
3 8 9 : ; ¡ = ¿ ?
4 H I J K L M N O
5 X Y Z [ backslash ] caret
6 h i j k l m n o
7 x y z { | } tilde DEL
7
Theorem 2.
Conversely, given a set of codeword lengths that satisfy this inequality, there exists a binary prefix
code with these word lengths.
2. McMillan’s Result: Kraft’s inequality holds for all uniquely-decodable codes.
Proof of Theorem 2, Part 1. Let C denote a finite binary prefix code and lmax denote the maximum
length of a codeword in C. Now consider a perfect binary tree T with height lmax . Then the following facts
should seem self-evident upon sufficient consideration.
Let T and lmax be as above. Define binary code C in the following manner. Basis step: let w1 = 0l1 .
Inductive step: assume that there exists 1 ≤ k ≤ n − 1 such that codewords w 1 , . . . , wk have been defined
in such a way that |wi | = li , for all 1 ≤ i ≤ k. Then by the Kraft Inequality, ki=1 2−li < 1, which implies
P
Pk lmax −li
i=1 2 < 2lmax . Then there exists a leaf L of T for which no member of w1 , . . . , wk is an ancestor of
L. Choose the first such leaf L and set wk+1 to the ancestor of L having length lk+1 . Continuing in this
manner a prefix-code C with the desired word lengths will be attained. QED
8
Efficient Codes
We now show the beautiful connection between entropy and coding theory. We may think of a computer
file (or any other entity that possesses information content) as a finite string from some alphabet X =
{x1 , . . . , xn }. Moreover, we know that letters of the alphabet occur with different frequencies. For example,
if X is the set of ASCII characters, then the letter “e” occurs with much greater frequency than “EOF”, the
symbol that denotes the end of a file. Furthermore, through empirical studies, we can obtain good estimates
for the values pi , 1 ≤ i ≤ n, where pi denotes the proportion of a file that is comprised of the i th symbol
xi .
Now suppose we want to represent file F as a binary string, so as to store it and transmit it on a computer
network. Let φ be an encoding of X into binary strings (words). We can define the average codelength
with respect to φ as X
Lφ = pi |φ(xi )|,
i
where pi represents the probability of xi appearing in the file. Thus, given a file F that consists of m letters
from X , and given some encoding φ, the size of File F with respect to encoding φ is denoted by |F |φ and
equals m · Lφ . And so to minimize the size of the file it follows that we must find an encoding φ for which
Lφ is minimized. We call such encodings length-optimal, or, in the case of files, size-optimal.
Theorem 3 (Claude Shannon). Let X = {x1 , . . . , xn }, and suppose that pi is the probability (or weighted
frequency) of xi . Let φ : X → C be uniquely-decodable. Then
1. Lφ ≥ H(X ).
To prove Theorem 3, we introduce an asymmetric distance measure between two finite probability distribu-
tions.
9
The following is stated without proof.
Proof of Theorem 3.
n
X n
X
−li
− pi log 2 + pi log pi .
i=1 i=1
2−li
Pn
Now let c = i=1 2−li and ri = c
. Then
n n
X (c2−li ) X
Lφ − H(X ) = − pi log + pi log pi =
i=1
c i=1
n n n
X 1 X X
pi log − pi log ri + pi log pi =
i=1
c i=1 i=1
n
1 X pi 1
log + pi log = log + D(p||r) ≥ 0.
c i=1 ri c
QED
Part 2. Let li = dlog p1i e. These lengths satisfy the Kraft inequality since
n n n
1 1
−dlog e − log
X X X
2 pi
≤ 2 pi
= pi = 1.
i=1 i=1 i=1
Thus, by Theorem 2, we know that a prefix code φ̂ : X → C exists with these lengths. Moreover, for all
1 ≤ i ≤ n,
1 1
log ≤ li < log + 1.
pi pi
And from this we conclude (by summing over i and multiplying by pi )
QED
10
Huffman Coding
We now provide a greedy algorithm due to D.A. Huffman (1952) which will always find an encoding having
minimum average length.
Huffman’s Algorithm:
• Input:
• Begin Algorithm
• base case 1. if X = {x} then return encoding φ, where φ(x) = denotes the empty string
• base case 2. if X = {x1 , x2 } then return encoding φ, where φ(x1 ) = 0 and φ(x2 ) = 1
• recursive case. combine x1 and x2 into a new symbol y, having probability p1 + p2 . Let X̂ =
{y, x3 , . . . , xn } and let φ̂ be the encoding obtained upon applying Huffman’s Algorithm to X̂ and
probabilities {p1 + p2 , . . . , pn }
• define φ as
• return φ
• End Algorithm
11
Example 6. Let X = {1, 2, 3, 4, 5} and with respective probabilities {.15, .15, .2, .25, .25}. Use Huffman’s
Algorithm to find a length-optimal encoding for X .
Example 7. Let X = {1, 2, 3, 4} and with respective weights {4, 4, 5, 5}. Use Huffman’s Algorithm to find
a length-optimal encoding for X .
12
Theorem 4. Huffman’s Algorithm is correct! In other words, if φ2 is any other encoding, then Lφ ≤ Lφ2 .
Lemma 2. For a distribution p = {p1 , . . . , pn }, there exists a length-optimal prefix code C such that
1. if pi > pk , then li ≤ lj
2. the two longest codewords have the same length
3. the two longest codewords differ only in the last bit. These two words are the encoding of the two
least likely symbols in X
Basis Step. n = 1. In this case X = {x} and φ(x) = , the empty string. Note that in this case
Lφ = H(X ) = 0 and φ is clearly length-optimal.
Now suppose n = 2. Then X = {x1 , x2 } and Huffman’s Algorithm yields φ(x1 ) = 0 and φ(x1 ) = 1.
Obviously φ is a length-optimal prefix code.
Induction Step. Suppose that Huffman’s Algorithm yields length-optimal codes for all codes of size n
or less. Assume that x1 and x2 have the least probabilities p1 and p2 . Let X̂ denote the symbol set
{y, x3 , . . . , xn+1 } with corresponding probabilities p = {p1 + p2 , p3 . . . , pn+1 }. By the induction assumption,
Huffman’s Algorithm produces a length-optimal encoding φ̂. Now according to the algorithm, φ is defined
as
Since φ̂ is a prefix code, it is clear that φ is also a prefix code. We now show that φ is length-optimal. To
see this we note that
Xn n
X
Lφ = pi li = p1 + p2 + p1 (l1 − 1) + p2 (l2 − 1) + pi li =
i=1 i=3
p1 + p2 + Lφ̂ .
Thus, the minimizing of Lφ̂ is equivalent to minimizing Lφ since p1 + p2 is a constant. Verifying this is left
as an excercise and requires Lemma 2. QED
13
Exercises.
1. Prove that the entropy function Hm (p1 , . . . , pm ) satisfies the grouping property.
2. Suppose that multiple choice exam has fifty questions with four responses each; where the correct
response is randomly assigned a letter a-d. If a student who knows nothing about the exam subject
takes the exam, how much information is in the scantron that records her responses? How much
information is in the scantron of a student who has a complete mastery of the subject? Assume all
questions are answered independently. Hint: think in terms of what the professor is expecting to see
when grading these exams. Suppose now the instructor notices that, for all questions, the correct
answer was marked 67% of the time, while the second-best response was 20%, third best 10%, and
worst 3%. On average, how much information can be found in the average exam?
3. Prove that every prefix code is uniquely decodable. Hint: use mathematical induction on the length
of the string y, to be decoded; i.e. φ∗ (x1 , x2 , . . . , xn ) = y, and you must show that the sequence of
objects x1 , x2 , . . . , xn is unique. You may assume that φ is one-to-one.
4. Is it possible to define a uniquely decodable encoding of 5 objects if the respective codeword lengths
are to be 1,2,3,3,3? Explain.
5. Give an example of a code that is not a prefix code, but is still uniquely decodable.
6. A fair coin is flipped until the first head occurs. Let random variable X denote the number of flips
required. Find the entropy H(X) in bits. The following expressions may be useful:
∞
X r
rn = ,
n=1
1−r
∞
X r
nrn = .
n=1
(1 − r)2
7. Let φ : X → {0, 1}∗ be a prefix code for finite set X . Given a binary string y of length n describe
an efficient procedure for decoding y into a unique sequence x1 , . . . , xm of objects in X . State the
asymptotic running time (in terms of n) of your procedure.
8. The inventor of Morse code, Samuel Morse (1791-1872), needed to know the frequency of letters in
English text that he could give the simplest codewords to the most frequently used letters. He did it
simply by counting the number of letters in sets of printers’ type. The figures he came up with were:
12,000 E 2,500 F
9,000 T 2,000 W, Y
8,000 A, I, N, O, S 1,700 G, P
6,400 H 1,600 B
6,200 R 1,200 V
4,400 D 800 K
4,000 L 500 Q
3,400 U 400 J, X
3,000 C, M 200 Z
14
Use this data to compute the entropy of a random variable that outputs one of the English letters.
Assuming that a 1Mb text file is Huffman encoded according to the above frequencies, what will be
the size of the encoded binary file?
9. Compute the Kullback-Leibler distance between the distributions p = (1/3, 1/3, 1/3) and q = (7/8, 1/16, 1/16).
10. Given object set X = {x1 , x2 , x3 , x4 , x5 } with respective probabilities 0.1, 0.35, 0.05, 0.2, 0.3, find a
Huffman code for X .
15