Download as pdf or txt
Download as pdf or txt
You are on page 1of 220


Quantization and Data

ECE 302 Spring 2012
Purdue University, School of ECE
Prof. Ilya Pollak
What is data compression?
•  Reducing the file size without compromising the
quality of the data stored in the file too much
(lossy compression) or at all (lossless
•  With compression, you can fit higher-quality data
(e.g., higher-resolution pictures or video) into a
file of the same size as required for lower-quality
uncompressed data.

Ilya Pollak
Why data compression?

•  Our appetite for data (high-resolution pictures,

HD video, audio, documents, etc) seems to
always significantly outpace hardware
capabilities for storage and transmission.

Ilya Pollak
Data compression: Step 0
•  If the data is continuous-time (e.g., audio) or
continuous-space (e.g., picture), it first needs to be

Ilya Pollak
Data compression: Step 0
•  If the data is continuous-time (e.g., audio) or
continuous-space (e.g., picture), it first needs to be
•  Sampling is typically done nowadays during signal
acquisition (e.g., digital camera for pictures or audio
recording equipment for music and speech).

Ilya Pollak
Data compression: Step 0
•  If the data is continuous-time (e.g., audio) or
continuous-space (e.g., picture), it first needs to be
•  Sampling is typically done nowadays during signal
acquisition (e.g., digital camera for pictures or audio
recording equipment for music and speech).
•  We will not study sampling. It is studied in ECE 301,
ECE 438, and ECE 440.
•  We will consider compressing discrete-time or
discrete-space data.

Ilya Pollak
Example: compression of
grayscale images
•  An eight-bit grayscale image is a rectangular array
of integers between 0 (black) and 255 (white).
•  Each site in the array is called a pixel.

Ilya Pollak
Example: compression of
grayscale images
•  An eight-bit grayscale image is a rectangular array
of integers between 0 (black) and 255 (white).
•  Each site in the array is called a pixel.
•  It takes one byte (eight bits) to store one pixel value,
since it can be any number between 0 and 255.

Ilya Pollak
Example: compression of
grayscale images
•  An eight-bit grayscale image is a rectangular array
of integers between 0 (black) and 255 (white).
•  Each site in the array is called a pixel.
•  It takes one byte (eight bits) to store one pixel value,
since it can be any number between 0 and 255.
•  It would take 25 bytes to store a 5x5 image.

Ilya Pollak
Example: compression of
grayscale images
•  An eight-bit grayscale image is a rectangular array
of integers between 0 (black) and 255 (white).
•  Each site in the array is called a pixel.
•  It takes one byte (eight bits) to store one pixel value,
since it can be any number between 0 and 255.
•  It would take 25 bytes to store a 5x5 image.
•  Can we do better?

Ilya Pollak
Example: compression of
grayscale images
255 255 255 255 255

255 255 255 255 255

200 200 200 200 200

200 200 200 200 200

200 200 200 200 100

Can we do better than 25 bytes?

Ilya Pollak
Two key ideas
•  Idea #1:
–  Transform the data to create lots of zeros.

Ilya Pollak
Two key ideas
•  Idea #1:
–  Transform the data to create lots of zeros. For example,
we could rasterize the image, compute the differences, and
store the top left value along with the 24 differences [in
reality, other transforms are used, but they work in a similar

Ilya Pollak
Two key ideas
•  Idea #1:
–  Transform the data to create lots of zeros. For example,
we could rasterize the image, compute the differences, and
store the top left value along with the 24 differences [in
reality, other transforms are used, but they work in a similar
–  255,0,0,0,0,0,0,0,0,0,−55,0,0,0,0,0,0,0,0,0,0,0,0,0,−100

Ilya Pollak
Two key ideas
•  Idea #1:
–  Transform the data to create lots of zeros. For example,
we could rasterize the image, compute the differences, and
store the top left value along with the 24 differences [in
reality, other transforms are used, but they work in a similar
–  255,0,0,0,0,0,0,0,0,0,−55,0,0,0,0,0,0,0,0,0,0,0,0,0,−100
–  This seems to make things worse: now the numbers can
range from −255 to 255, and therefore we need two bytes
per pixel!

Ilya Pollak
Two key ideas
•  Idea #1:
–  Transform the data to create lots of zeros. For example,
we could rasterize the image, compute the differences, and
store the top left value along with the 24 differences [in
reality, other transforms are used, but they work in a similar
–  255,0,0,0,0,0,0,0,0,0,−55,0,0,0,0,0,0,0,0,0,0,0,0,0,−100
–  This seems to make things worse: now the numbers can
range from −255 to 255, and therefore we need two bytes
per pixel!
•  Idea #2:
–  when encoding the data, spend fewer bits on frequently
occurring numbers and more bits on rare numbers.

Ilya Pollak
Entropy coding
Suppose we are encoding realizations of a discrete random variable X such that

value of X 0 255 −55 −100

probability 22/25 1/25 1/25 1/25

Ilya Pollak
Entropy coding
Suppose we are encoding realizations of a discrete random variable X such that

value of X 0 255 −55 −100

probability 22/25 1/25 1/25 1/25

Consider the following fixed-length encoder:

value of X 0 255 −55 −100

codeword 00 01 10 11

Ilya Pollak
Entropy coding
Suppose we are encoding realizations of a discrete random variable X such that

value of X 0 255 −55 −100

probability 22/25 1/25 1/25 1/25

Consider the following fixed-length encoder:

value of X 0 255 −55 −100

codeword 00 01 10 11

For a file with 25 numbers, E[file size] = 25*2*(22/25+1/25+1/25+1/25) = 50 bits

Ilya Pollak
Entropy coding
Suppose we are encoding realizations of a discrete random variable X such that

value of X 0 255 −55 −100

probability 22/25 1/25 1/25 1/25

Consider the following fixed-length encoder:

value of X 0 255 −55 −100

codeword 00 01 10 11

For a file with 25 numbers, E[file size] = 25*2*(22/25+1/25+1/25+1/25) = 50 bits

Now consider the following encoder:

value of X 0 255 −55 −100

codeword 1 01 000 001

Ilya Pollak
Entropy coding
Suppose we are encoding realizations of a discrete random variable X such that

value of X 0 255 −55 −100

probability 22/25 1/25 1/25 1/25

Consider the following fixed-length encoder:

value of X 0 255 −55 −100

codeword 00 01 10 11

For a file with 25 numbers, E[file size] = 25*2*(22/25+1/25+1/25+1/25) = 50 bits

Now consider the following encoder:

value of X 0 255 −55 −100

codeword 1 01 000 001

For a file with 25 numbers, E[file size] = 25(22/25 + 2/25 + 3/25 + 3/25) = 30 bits!

Ilya Pollak
Entropy coding
•  A similar encoding scheme can be devised for a
random variable of pixel differences which takes
values between −255 and 255, to result in a smaller
average file size than two bytes per pixel.

Ilya Pollak
Entropy coding
•  A similar encoding scheme can be devised for a
random variable of pixel differences which takes
values between −255 and 255, to result in a smaller
average file size than two bytes per pixel.
•  Another commonly used idea: run-length coding. I.e.,
instead of encoding each 0 individually, encode the
length of each string of zeros.

Ilya Pollak
Back to the four-symbol example
value of X 0 255 −55 −100
probability 22/25 1/25 1/25 1/25
codeword 1 01 000 001

Can we do even better than 30 bits?

Ilya Pollak
Back to the four-symbol example
value of X 0 255 −55 −100
probability 22/25 1/25 1/25 1/25
codeword 1 01 000 001

Can we do even better than 30 bits?

What about this alternative encoder?
value of X 0 255 −55 −100
probability 22/25 1/25 1/25 1/25
codeword 0 01 1 10

Ilya Pollak
Back to the four-symbol example
value of X 0 255 −55 −100
probability 22/25 1/25 1/25 1/25
codeword 1 01 000 001

Can we do even better than 30 bits?

What about this alternative encoder?
value of X 0 255 −55 −100
probability 22/25 1/25 1/25 1/25
codeword 0 01 1 10

E[file size] = 25(22/25 + 2/25 + 1/25+2/25) = 27 bits

Ilya Pollak
Back to the four-symbol example
value of X 0 255 −55 −100
probability 22/25 1/25 1/25 1/25
codeword 1 01 000 001

Can we do even better than 30 bits?

What about this alternative encoder?
value of X 0 255 −55 −100
probability 22/25 1/25 1/25 1/25
codeword 0 01 1 10

E[file size] = 25(22/25 + 2/25 + 1/25+2/25) = 27 bits

Is there anything wrong with this encoder?
Ilya Pollak
The second encoding is not
uniquely decodable!
value of X 0 255 −55 −100
probability 22/25 1/25 1/25 1/25
codeword 0 01 1 10

Encoded string ‘01’ could either be 255 or 0 followed

by −55

Ilya Pollak
The second encoding is not
uniquely decodable!
value of X 0 255 −55 −100
probability 22/25 1/25 1/25 1/25
codeword 0 01 1 10

Encoded string ‘01’ could either be 255 or 0 followed

by −55
Therefore, this code is unusable!
It turns out that the first code is uniquely decodable.

Ilya Pollak
What kinds of distributions are
amenable to entropy coding?




0.3 0.3

0.2 0.2

0.1 0.1

0 0
a b c d a b c d

Can do a lot better than Cannot do better than

two bits per symbol two bits per symbol

Ilya Pollak
What kinds of distributions are
amenable to entropy coding?




0.3 0.3

0.2 0.2

0.1 0.1

0 0
a b c d a b c d

Can do a lot better than Cannot do better than

two bits per symbol two bits per symbol
Conclusion: the transform procedure should be such that the numbers fed
into the entropy coder have a highly concentrated histogram (a few very
likely values, most values unlikely).

Ilya Pollak
What kinds of distributions are
amenable to entropy coding?




0.3 0.3

0.2 0.2

0.1 0.1

0 0
a b c d a b c d

Can do a lot better than Cannot do better than

two bits per symbol two bits per symbol
Conclusion: the transform procedure should be such that the numbers fed
into the entropy coder have a highly concentrated histogram (a few very
likely values, most values unlikely). Also, if we are encoding each number
individually, they should be independent or approximately independent.

Ilya Pollak
What if we are willing to lose
some information?
253 253 255 254 255

254 254 254 255 254

252 255 255 254 252

253 253 254 254 254

252 255 253 252 253

Ilya Pollak
What if we are willing to lose
some information?
253 253 255 254 255 253.5 253.5 253.5 253.5 253.5

254 254 254 255 254 253.5 253.5 253.5 253.5 253.5

252 255 255 254 252 253.5 253.5 253.5 253.5 253.5

253 253 254 254 254 253.5 253.5 253.5 253.5 253.5

252 255 253 252 253 253.5 253.5 253.5 253.5 253.5


Ilya Pollak
Some eight-bit images

The five stripes contain random values The five stripes contain random integers
from (left to right): {252,253,254,255}, from (left to right): {240,…,255},
{188,189,190,191}, {125,126,127,128}, {176,…,191}, {113,…,128}, {49,…,64 },
{61,62,63,64}, {0,1,2,3}. {0,…,15}.

Ilya Pollak
Converting continuous-valued to
discrete-valued signals
•  Many real-world signals are continuous-valued.
–  audio signal a(t): both the time argument t and the intensity value
a(t) are continuous;
–  image u(x,y): both the spatial location (x,y) and the image
intensity value u(x,y) are continuous;
–  video v(x,y,t): x,y,t, and v(x,y,t) are all continuous.

Ilya Pollak
Converting continuous-valued to
discrete-valued signals
•  Many real-world signals are continuous-valued.
–  audio signal a(t): both the time argument t and the intensity value
a(t) are continuous;
–  image u(x,y): both the spatial location (x,y) and the image
intensity value u(x,y) are continuous;
–  video v(x,y,t): x,y,t, and v(x,y,t) are all continuous.
•  Discretizing the argument values t, x, and y (or
sampling), is studied in ECE 301, 438, and 440.

Ilya Pollak
Converting continuous-valued to
discrete-valued signals
•  Many real-world signals are continuous-valued.
–  audio signal a(t): both the time argument t and the intensity value
a(t) are continuous;
–  image u(x,y): both the spatial location (x,y) and the image
intensity value u(x,y) are continuous;
–  video v(x,y,t): x,y,t, and v(x,y,t) are all continuous.
•  Discretizing the argument values t, x, and y (or
sampling), is studied in ECE 301, 438, and 440.
•  However, in addition to descretizing the argument
values, the signal values must be discretized as well in
order to be digitally stored.

Ilya Pollak
•  Digitizing a continuous-valued signal into a discrete and
finite set of values.
•  Converting a discrete-valued signal into another discrete
-valued signal, with fewer possible discrete values.

Ilya Pollak
How to compare two quantizers?

•  Suppose data X(1),…,X(N) is quantized using two quantizers, to result in

Y1(1),…,Y1(N) and Y2(1),…,Y2(N).
•  Suppose both Y1(1),…,Y1(N) and Y2(1),…,Y2(N) can be encoded with the
same number of bits.
•  Which quantization is better?
•  The one that results in less distortion. But how to measure distortion?
–  In general, measuring and modeling perceptual image similarity and similarity of
audio are open research problems.
–  Some useful things are known about human audio and visual systems that
inform the design of quantizers.

Ilya Pollak
Sensitivity of the Human Visual
System to Contrast Changes, as a
Function of Frequency

Ilya Pollak
Sensitivity of the Human Visual
System to Contrast Changes, as a
Function of Frequency

[From Mannos-Sakrison IEEE-IT 1974]

Ilya Pollak
Sensitivity of the Human Visual
System to Contrast Changes, as a
Function of Frequency

[From Mannos-Sakrison IEEE-IT 1974]

High and low frequencies may be quantized more coarsely

Ilya Pollak
But there are many other
intricacies in the way human
visual system computes

Ilya Pollak
Are these two images similar?

Ilya Pollak
What about these two?

Ilya Pollak
What about these two?

•  Performance assessment of compression algorithms and quantizers is

complicated, because measuring image fidelity is complicated.
•  Often, very simple distortion measures are used such as mean-square error.

Ilya Pollak
Scalar vs Vector Quantization
s s
255 255


0 127 255 r 0 95 255 r

•  quantize each value separately •  quantize several values jointly
•  simple thresholding •  more complex

Ilya Pollak
What kinds of joint distributions are
amenable to scalar quantization?


0 127 255 r
If (r,s) are jointly uniform over green square
(or, more generally, independent), knowing
r does not tell us anything about s.
Best thing to do: make quantization
decisions independently.

Ilya Pollak
What kinds of joint distributions are
amenable to scalar quantization?
s s
255 255


0 127 255 r 0 95 255 r

If (r,s) are jointly uniform over green square If (r,s) are jointly uniform over yellow
(or, more generally, independent), knowing region, knowing r tells us a lot about s.
r does not tell us anything about s.
Best thing to do: make quantization Best thing to do: make quantization
decisions independently. decisions jointly.

Ilya Pollak
What kinds of joint distributions are
amenable to scalar quantization?
s s
255 255


0 127 255 r 0 95 255 r

If (r,s) are jointly uniform over green square If (r,s) are jointly uniform over yellow
(or, more generally, independent), knowing region, knowing r tells us a lot about s.
r does not tell us anything about s.
Best thing to do: make quantization Best thing to do: make quantization
decisions independently. decisions jointly.
Conclusion: if the data is transformed before quantization, the transform
procedure should be such that the coefficients fed into the quantizer are
independent (or at least uncorrelated, or almost uncorrelated), in order to
enable the simpler scalar quantization.
Ilya Pollak
More on Scalar Quantization
•  Does it make sense to do scalar s
quantization with different 255
quantization bins for different

0 127 255 r

Ilya Pollak
More on Scalar Quantization
•  Does it make sense to do scalar s
quantization with different 255
quantization bins for different
–  No reason to do this if we are
quantizing grayscale pixel values.

0 127 255 r

Ilya Pollak
More on Scalar Quantization
•  Does it make sense to do scalar s
quantization with different 255
quantization bins for different
–  No reason to do this if we are
quantizing grayscale pixel values.
–  However, if we can decompose the
image into components that are less 0 127 255 r
perceptually important and more
perceptually important, we should use
larger quantization bins for the less
important components.

Ilya Pollak
Structure of a Typical Lossy
Compression Algorithm for Audio,
Images, or Video

entropy compressed
data transform quantization
coding bitstream

Ilya Pollak
Structure of a Typical Lossy
Compression Algorithm for Audio,
Images, or Video

entropy compressed
data transform quantization
coding bitstream

Let’s more closely consider quantization and entropy coding.

(Various transforms are considered in ECE 301 and ECE 438.)

Ilya Pollak
Quantization: problem statement
Sequence of discrete or continuous
random variables X(1),…,X(N)
(e.g., transformed image pixel
Source (e.g., image,
video, speech signal)

Ilya Pollak
Quantization: problem statement
Sequence of discrete or continuous Sequence of discrete random
random variables X(1),…,X(N) variables Y(1),…,Y(N), each
(e.g., transformed image pixel distributed over a finite set of
values). values (quantization levels)
Source (e.g., image,
video, speech signal)

Ilya Pollak
Quantization: problem statement
Sequence of discrete or continuous Sequence of discrete random
random variables X(1),…,X(N) variables Y(1),…,Y(N), each
(e.g., transformed image pixel distributed over a finite set of
values). values (quantization levels)
Source (e.g., image,
video, speech signal)

Errors: D(1),…,D(N) where D(n) = X(n) − Y(n)

Ilya Pollak
MSE is a widely used measure of
distortion of quantizers

•  Suppose data X(1),…,X(N) are quantized, to result in Y(1),…,Y(N).

⎡N 2⎤ ⎡N 2⎤
E ⎢ ∑ ( X(n) − Y (n)) ⎥ = E ⎢ ∑ ( D(n)) ⎥
⎣ n =1 ⎦ ⎣ n =1 ⎦
If D(1),..., D(N ) are identically distributed, this is the same as NE ⎡⎣( D(n)) ⎤⎦ , for any n.

Ilya Pollak
Scalar uniform quantization
•  Use quantization intervals (bins) of equal
size [x1,x2), [x2,x3),…[xL,xL+1].
•  Quantization levels q1, q2,…, qL.
•  Each quantization level is in the middle of
the corresponding quantization bin:

Ilya Pollak
Scalar uniform quantization
•  Use quantization intervals (bins) of equal
size [x1,x2), [x2,x3),…[xL,xL+1].
•  Quantization levels q1, q2,…, qL.
•  Each quantization level is in the middle of
the corresponding quantization bin:
•  If quantizer input X is in [xk,xk+1), the
corresponding quantized value is Y = qk.

Ilya Pollak
Uniform vs non-uniform
•  Uniform quantization is not a good
strategy for distributions which
significantly differ from uniform.

Ilya Pollak
Uniform vs non-uniform
•  Uniform quantization is not a good
strategy for distributions which
significantly differ from uniform.
•  If the distribution is non-uniform, it is better
to spend more quantization levels on
more probable parts of the distribution
and fewer quantization levels on less
probable parts.

Ilya Pollak
Scalar Lloyd-Max quantizer
•  X = source random variable with a known distribution. We assume it to be a
continuous r.v. with PDF fX(x)>0.

Ilya Pollak
Scalar Lloyd-Max quantizer
•  X = source random variable with a known distribution. We assume it to be a
continuous r.v. with PDF fX(x)>0.
–  The results can be extended to discrete or mixed random variables, and to
continuous random variables whose density can be zero for some x.

Ilya Pollak
Scalar Lloyd-Max quantizer
•  X = source random variable with a known distribution. We assume it to be a
continuous r.v. with PDF fX(x)>0.
–  The results can be extended to discrete or mixed random variables, and to
continuous random variables whose density can be zero for some x.
•  Quantization intervals (x1,x2), [x2,x3),…[xL,xL+1) and levels q1, …, qL such that
–  x1 = −∞
–  xL+1 = ∞
–  −∞ < q1 < x2 ≤ q2 < x3 ≤ q3 < … ≤ qL −1 < x L ≤ qL < +∞
I.e., qk ∈k-th quantization interval

Ilya Pollak
Scalar Lloyd-Max quantizer
•  X = source random variable with a known distribution. We assume it to be a
continuous r.v. with PDF fX(x)>0.
–  The results can be extended to discrete or mixed random variables, and to
continuous random variables whose density can be zero for some x.
•  Quantization intervals (x1,x2), [x2,x3),…[xL,xL+1) and levels q1, …, qL such that
–  x1 = −∞
–  xL+1 = ∞
–  −∞ < q1 < x2 ≤ q2 < x3 ≤ q3 < … ≤ qL −1 < x L ≤ qL < +∞
I.e., qk ∈k-th quantization interval
•  Y = the result of quantizing X, a discrete random variable with L possible
outcomes, q1, q2,…, qL, defined by
⎧ q1 if X < x2

⎪ q2 if x 2 ≤ X < x3

Y = Y (X) = ⎨  
⎪ qL −1 if x L −1 ≤ X < x L

⎪⎩ qL X ≥ xL
Ilya Pollak
Scalar Lloyd-Max quantizer: goal

•  Given the pdf fX(x) of the source r.v. X and the desired number L of
quantization levels, find the quantization interval endpoints x2,…,xL and
quantization levels q1,…, qL to minimize the mean-square error, E[(Y−X)2].

Ilya Pollak
Scalar Lloyd-Max quantizer: goal

•  Given the pdf fX(x) of the source r.v. X and the desired number L of
quantization levels, find the quantization interval endpoints x2,…,xL and
quantization levels q1,…, qL to minimize the mean-square error, E[(Y−X)2].
•  To do this, express the mean-square error in terms of the quantization
interval endpoints and quantization levels, and find the minimum (or
minima) through differentiation.

Ilya Pollak
Scalar Lloyd-Max quantizer: derivation

E ⎡⎣(Y − X ) ⎤⎦ = ∫ ( − )
2 2
y(x) x f X (x)dx

Ilya Pollak
Scalar Lloyd-Max quantizer: derivation
∞ L xk+1
E ⎡⎣(Y − X ) ⎤⎦ = ∫ ( y(x) − x ) f X (x)dx = ∑ ∫ ( − )
2 2 2
y(x) x f X (x)dx
−∞ k =1 xk

Ilya Pollak
Scalar Lloyd-Max quantizer: derivation
∞ L xk+1 L xk+1
E ⎡⎣(Y − X ) ⎤⎦ =
∫ ( y(x) − x )
f X (x)dx = ∑ ∫ ( y(x) − x )
f X (x)dx = ∑ (
∫ kq − x ) 2
f X (x)dx
−∞ k =1 xk k =1 xk

Ilya Pollak
Scalar Lloyd-Max quantizer: derivation
∞ L xk+1 L xk+1
E ⎡⎣(Y − X ) ⎤⎦ =
∫ ( y(x) − x )
f X (x)dx = ∑ ∫ ( y(x) − x )
f X (x)dx = ∑ (
∫ kq − x ) 2
f X (x)dx
−∞ k =1 xk k =1 xk


E ⎡⎣(Y − X ) ⎤⎦ = ∫ 2 (q − x ) f X (x)dx = 0
Minimize w.r.t. qk :

Ilya Pollak
Scalar Lloyd-Max quantizer: derivation
∞ L xk+1 L xk+1
E ⎡⎣(Y − X ) ⎤⎦ =
∫ ( y(x) − x )
f X (x)dx = ∑ ∫ ( y(x) − x )
f X (x)dx = ∑ (
∫ kq − x ) 2
f X (x)dx
−∞ k =1 xk k =1 xk


E ⎡⎣(Y − X ) ⎤⎦ = ∫ 2 (q − x ) f X (x)dx = 0
Minimize w.r.t. qk :
xk+1 xk+1

f (x)dx =
k X ∫
xf X (x)dx

Ilya Pollak
Scalar Lloyd-Max quantizer: derivation
∞ L xk+1 L xk+1
E ⎡⎣(Y − X ) ⎤⎦ =
∫ ( y(x) − x )
f X (x)dx = ∑ ∫ ( y(x) − x )
f X (x)dx = ∑ (
∫ kq − x ) 2
f X (x)dx
−∞ k =1 xk k =1 xk


E ⎡⎣(Y − X ) ⎤⎦ = ∫ 2 (q − x ) f X (x)dx = 0
Minimize w.r.t. qk :

xk+1 xk+1 ∫ xf X (x)dx

∫ qk f X (x)dx = ∫ xf X (x)dx, therefore qk =

xk xk
f X (x)dx

Ilya Pollak
Scalar Lloyd-Max quantizer: derivation
∞ L xk+1 L xk+1
E ⎡⎣(Y − X ) ⎤⎦ =
∫ ( y(x) − x )
f X (x)dx = ∑ ∫ ( y(x) − x )
f X (x)dx = ∑ (
∫ kq − x ) 2
f X (x)dx
−∞ k =1 xk k =1 xk


E ⎡⎣(Y − X ) ⎤⎦ = ∫ 2 (q − x ) f X (x)dx = 0
Minimize w.r.t. qk :

xk+1 xk+1 ∫ xf X (x)dx

∫ qk f X (x)dx = ∫ xf X (x)dx, therefore qk = = E [ X | X ∈k-th quantization interval]

xk xk
f X (x)dx

Ilya Pollak
Scalar Lloyd-Max quantizer: derivation
∞ L xk+1 L xk+1
E ⎡⎣(Y − X ) ⎤⎦ =
∫ ( y(x) − x )
f X (x)dx = ∑ ∫ ( y(x) − x )
f X (x)dx = ∑ (
∫ kq − x ) 2
f X (x)dx
−∞ k =1 xk k =1 xk


E ⎡⎣(Y − X ) ⎤⎦ = ∫ 2 (q − x ) f X (x)dx = 0
Minimize w.r.t. qk :

xk+1 xk+1 ∫ xf X (x)dx

∫ qk f X (x)dx = ∫ xf X (x)dx, therefore qk = = E [ X | X ∈k-th quantization interval]

xk xk
f X (x)dx
This is a minimum, since 2 E ⎡⎣(Y − X ) ⎤⎦ = ∫ 2f (x)dx > 0.


Ilya Pollak
Scalar Lloyd-Max quantizer: derivation
∞ L xk+1 L xk+1
E ⎡⎣(Y − X ) ⎤⎦ =
∫ ( y(x) − x )
f X (x)dx = ∑ ∫ ( y(x) − x )
f X (x)dx = ∑ (
∫ kq − x ) 2
f X (x)dx
−∞ k =1 xk k =1 xk

Minimize w.r.t. xk , for k = 2,…, L

Ilya Pollak
Scalar Lloyd-Max quantizer: derivation
∞ L xk+1 L xk+1
E ⎡⎣(Y − X ) ⎤⎦ =
∫ ( y(x) − x )
f X (x)dx = ∑ ∫ ( y(x) − x )
f X (x)dx = ∑ (
∫ kq − x ) 2
f X (x)dx
−∞ k =1 xk k =1 xk

Minimize w.r.t. xk , for k = 2,…, L:

∂ ⎧⎪ k ⎫⎪
x xk+1

E (Y − X ) ⎦ =
⎡ ⎤ ⎨ ∫ ( qk −1 − x ) f X (x)dx + ∫ ( qk − x ) f X (x)dx ⎬
2 2 2

∂xk ⎣ ∂xk ⎪⎩ xk−1 xk ⎪⎭

Ilya Pollak
Scalar Lloyd-Max quantizer: derivation
∞ L xk+1 L xk+1
E ⎡⎣(Y − X ) ⎤⎦ =
∫ ( y(x) − x )
f X (x)dx = ∑ ∫ ( y(x) − x )
f X (x)dx = ∑ (
∫ kq − x ) 2
f X (x)dx
−∞ k =1 xk k =1 xk

Minimize w.r.t. xk , for k = 2,…, L:

∂ ⎧⎪ k ⎫⎪
x xk+1

E (Y − X ) ⎦ =
⎡ ⎤ ⎨ ∫ ( qk −1 − x ) f X (x)dx + ∫ ( qk − x ) f X (x)dx ⎬
2 2 2

∂xk ⎣ ∂xk ⎪⎩ xk−1 xk ⎪⎭

= ( qk −1 − xk ) f X (xk ) − ( qk − xk ) f X (xk )
2 2

Ilya Pollak
Scalar Lloyd-Max quantizer: derivation
∞ L xk+1 L xk+1
E ⎡⎣(Y − X ) ⎤⎦ =
∫ ( y(x) − x )
f X (x)dx = ∑ ∫ ( y(x) − x )
f X (x)dx = ∑ (
∫ kq − x ) 2
f X (x)dx
−∞ k =1 xk k =1 xk

Minimize w.r.t. xk , for k = 2,…, L:

∂ ⎧⎪ k ⎫⎪
x xk+1

E (Y − X ) ⎦ =
⎡ ⎤ ⎨ ∫ ( qk −1 − x ) f X (x)dx + ∫ ( qk − x ) f X (x)dx ⎬
2 2 2

∂xk ⎣ ∂xk ⎪⎩ xk−1 xk ⎪⎭

= ( qk −1 − xk ) f X (xk ) − ( qk − xk ) f X (xk ) = ( qk −1 − qk ) ( qk −1 + qk − 2xk ) f X (xk ) = 0.
2 2

By assumption, f X (x) ≠ 0 and qk −1 ≠ qk .

Ilya Pollak
Scalar Lloyd-Max quantizer: derivation
∞ L xk+1 L xk+1
E ⎡⎣(Y − X ) ⎤⎦ =
∫ ( y(x) − x )
f X (x)dx = ∑ ∫ ( y(x) − x )
f X (x)dx = ∑ (
∫ kq − x ) 2
f X (x)dx
−∞ k =1 xk k =1 xk

Minimize w.r.t. xk , for k = 2,…, L:

∂ ⎧⎪ k ⎫⎪
x xk+1

E (Y − X ) ⎦ =
⎡ ⎤ ⎨ ∫ ( qk −1 − x ) f X (x)dx + ∫ ( qk − x ) f X (x)dx ⎬
2 2 2

∂xk ⎣ ∂xk ⎪⎩ xk−1 xk ⎪⎭

= ( qk −1 − xk ) f X (xk ) − ( qk − xk ) f X (xk ) = ( qk −1 − qk ) ( qk −1 + qk − 2xk ) f X (xk ) = 0.
2 2

By assumption, f X (x) ≠ 0 and qk −1 ≠ qk . Therefore,

q + qk
xk = k −1 , for k = 2,…, L.

Ilya Pollak
Scalar Lloyd-Max quantizer: derivation
∞ L xk+1 L xk+1
E ⎡⎣(Y − X ) ⎤⎦ =
∫ ( y(x) − x )
f X (x)dx = ∑ ∫ ( y(x) − x )
f X (x)dx = ∑ (
∫ kq − x ) 2
f X (x)dx
−∞ k =1 xk k =1 xk

Minimize w.r.t. xk , for k = 2,…, L:

∂ ⎧⎪ k ⎫⎪
x xk+1

E (Y − X ) ⎦ =
⎡ ⎤ ⎨ ∫ ( qk −1 − x ) f X (x)dx + ∫ ( qk − x ) f X (x)dx ⎬
2 2 2

∂xk ⎣ ∂xk ⎪⎩ xk−1 xk ⎪⎭

= ( qk −1 − xk ) f X (xk ) − ( qk − xk ) f X (xk ) = ( qk −1 − qk ) ( qk −1 + qk − 2xk ) f X (xk ) = 0.
2 2

By assumption, f X (x) ≠ 0 and qk −1 ≠ qk . Therefore,

q + qk
xk = k −1 , for k = 2,…, L.
This is a minimum, since 2 E ⎡⎣(Y − X ) ⎤⎦ = 2 ( qk − qk −1 ) f X (xk ) > 0.


Ilya Pollak
Nonlinear system to be solved
⎧ xk+1

⎪ ∫x xfX (x)dx

⎪⎪ q k = x
= E [ X | X ∈k-th quantization interval], for k = 1,…, L

f X (x)dx

⎪ xk = qk −1 + qk , for k = 2,…, L
⎪⎩ 2

Ilya Pollak
Nonlinear system to be solved
⎧ xk+1

⎪ ∫x xfX (x)dx

⎪⎪ q k = x
= E [ X | X ∈k-th quantization interval], for k = 1,…, L

f X (x)dx

⎪ xk = qk −1 + qk , for k = 2,…, L
⎪⎩ 2

•  Closed-form solution can be found only for very simple PDFs.

–  E.g., if X is uniform, then Lloyd-Max quantizer = uniform quantizer.

Ilya Pollak
Nonlinear system to be solved
⎧ xk+1

⎪ ∫x xfX (x)dx

⎪⎪ q k = x
= E [ X | X ∈k-th quantization interval], for k = 1,…, L

f X (x)dx

⎪ xk = qk −1 + qk , for k = 2,…, L
⎪⎩ 2

•  Closed-form solution can be found only for very simple PDFs.

–  E.g., if X is uniform, then Lloyd-Max quantizer = uniform quantizer.
•  In general, an approximate solution can be found numerically, via an
iterative algorithm (e.g., lloyds command in Matlab).

Ilya Pollak
Nonlinear system to be solved
⎧ xk+1

⎪ ∫x xfX (x)dx

⎪⎪ q k = x
= E [ X | X ∈k-th quantization interval], for k = 1,…, L

f X (x)dx

⎪ xk = qk −1 + qk , for k = 2,…, L
⎪⎩ 2

•  Closed-form solution can be found only for very simple PDFs.

–  E.g., if X is uniform, then Lloyd-Max quantizer = uniform quantizer.
•  In general, an approximate solution can be found numerically, via an
iterative algorithm (e.g., lloyds command in Matlab).
•  For real data, typically the PDF is not given and therefore needs to be
estimated using, for example, histograms constructed from the observed

Ilya Pollak
Vector Lloyd-Max quantizer?
X = ( X(1),…, X(N )) = source random vector with a given joint distribution.
L = a desired number of quantization points.

Ilya Pollak
Vector Lloyd-Max quantizer?
X = ( X(1),…, X(N )) = source random vector with a given joint distribution.
L = a desired number of quantization points.
We would like to find:
(1) L events A1 ,…, AL that partition the joint sample space of X(1),…, X(N ), and
(2) L quantization points q1 ∈A1 ,…, q L ∈AL

Ilya Pollak
Vector Lloyd-Max quantizer?
X = ( X(1),…, X(N )) = source random vector with a given joint distribution.
L = a desired number of quantization points.
We would like to find:
(1) L events A1 ,…, AL that partition the joint sample space of X(1),…, X(N ), and
(2) L quantization points q1 ∈A1 ,…, q L ∈AL ,
such that the quantized random vector, defined by
Y = q k if X ∈Ak , for k = 1,…, L,
minimizes the mean-square error,
⎡ N
E ⎡⎣ Y − X ⎤⎦ = E ⎢ ∑ (Y (n) − X(n)) ⎥

⎣ n =1 ⎦

Ilya Pollak
Vector Lloyd-Max quantizer?
X = ( X(1),…, X(N )) = source random vector with a given joint distribution.
L = a desired number of quantization points.
We would like to find:
(1) L events A1 ,…, AL that partition the joint sample space of X(1),…, X(N ), and
(2) L quantization points q1 ∈A1 ,…, q L ∈AL ,
such that the quantized random vector, defined by
Y = q k if X ∈Ak , for k = 1,…, L,
minimizes the mean-square error,
⎡ N
E ⎡⎣ Y − X ⎤⎦ = E ⎢ ∑ (Y (n) − X(n)) ⎥

⎣ n =1 ⎦
Difficulty: cannot differentiate with respect to a set Ak , and so unless the set of all allowed
partitions is somehow restricted, this cannot be solved.

Ilya Pollak
Hopefully, prior discussion gives
you some idea about various
issues involved in quantization.
And now, on to entropy coding…
entropy compressed
data transform quantization
coding bitstream

Ilya Pollak
Problem statement
Sequence of discrete
random variables X(1),…,X(N)
(e.g., transformed image pixel values),
assumed to be independent and
identically distributed over a finite
Source (e.g., image, alphabet {a1,…,aM}.
video, speech signal,
or quantizer output)

Ilya Pollak
Problem statement
Sequence of discrete
random variables X(1),…,X(N)
(e.g., transformed image pixel values),
assumed to be independent and
identically distributed over a finite
Encoder: mapping
Source (e.g., image, alphabet {a1,…,aM}. Binary string
between source
video, speech signal,
symbols and binary
or quantizer output)
strings (codewords)

•  minimize the expected length of the binary string;
•  the binary string needs to be uniquely decodable, i.e., we need to be able
to infer X(1),…,X(N) from it!

Ilya Pollak
Problem statement
Sequence of discrete
random variables X(1),…,X(N)
(e.g., transformed image pixel values),
assumed to be independent and
identically distributed over a finite
Encoder: mapping
Source (e.g., image, alphabet {a1,…,aM}. Binary string
between source
video, speech signal,
symbols and binary
or quantizer output)
strings (codewords)

•  Since X(1),…,X(N) are assumed independent in this model, we will

encode each of them separately.
•  Each can assume any value among {a1,…,aM}.
•  Therefore, our code will consist of M codewords, one for each symbol
symbol codeword

a1 w1
… …

aM wM

Ilya Pollak
Unique Decodability
symbol codeword

a 0
b 1

c 00

d 01

•  How to decode the following string: 0001?

•  It could be aaab or aad or acb or cab or cd.
•  Not uniquely decodable!

Ilya Pollak
A condition that ensures unique
•  Prefix condition: no codeword in the code is a prefix for
any other codeword.

Ilya Pollak
A condition that ensures unique
•  Prefix condition: no codeword in the code is a prefix for
any other codeword.
•  If the prefix condition is satisfied, then the code is
uniquely decodable.
–  Proof. Take a bit string W that corresponds to two different
strings of symbols, A and B. If the first symbols in A and B are
the same, discard them and the corresponding portion of W.
Repeat until either there are no bits left in W (in this case A=B)
or the first symbols in A and B are different. Then one of the
codewords corresponding to these two symbols is a prefix for
the other.

Ilya Pollak
A condition that ensures unique
•  Prefix condition: no codeword in the code is a prefix for
any other codeword.
•  Visualizing binary strings. Form a binary tree where
each branch is labeled 0 or 1. Each codeword w can be
associated with the unique node of the tree such that
string of 0’s and 1’s on the path from the root to the
node forms w.

Ilya Pollak
A condition that ensures unique
•  Prefix condition: no codeword in the code is a prefix for
any other codeword.
•  Visualizing binary strings. Form a binary tree where
each branch is labeled 0 or 1. Each codeword w can be
associated with the unique node of the tree such that
string of 0’s and 1’s on the path from the root to the
node forms w.
•  Prefix condition holds if an only if all the codewords are
leaves of the binary tree.

Ilya Pollak
A condition that ensures unique
•  Prefix condition: no codeword in the code is a prefix for
any other codeword.
•  Visualizing binary strings. Form a binary tree where
each branch is labeled 0 or 1. Each codeword w can be
associated with the unique node of the tree such that
string of 0’s and 1’s on the path from the root to the
node forms w.
•  Prefix condition holds if an only if all the codewords are
leaves of the binary tree---i.e., if no codeword is a
descendant of another codeword.

Ilya Pollak
Example: no prefix condition, no unique
decodability, one word is not a leaf
symbol codeword

a 0
b 1

c 00

d 01

•  Codeword 0 is a prefix for both codeword 00 and codeword 01

Ilya Pollak
Example: no prefix condition, no unique
decodability, one word is not a leaf
symbol codeword

a 0
b 1

•  Codeword 0 is a prefix for both codeword 00 and codeword 01


Ilya Pollak
Example: no prefix condition, no unique
decodability, one word is not a leaf
symbol codeword

a 0
b 1

c 00

•  Codeword 0 is a prefix for both codeword 00 and codeword 01


Ilya Pollak
Example: no prefix condition, no unique
decodability, one word is not a leaf
symbol codeword

a 0
b 1

c 00

d 01

•  Codeword 0 is a prefix for both codeword 00 and codeword 01


wd=01 1
Ilya Pollak
Example: prefix condition, all words are
symbol codeword

a 1

wa=1 Ilya Pollak

Example: prefix condition, all words are
symbol codeword

a 1
b 01


wb=01 1

wa=1 Ilya Pollak

Example: prefix condition, all words are
symbol codeword

a 1
b 01

c 000

d 001

wc=000 0

wd=001 1

wb=01 1

wa=1 Ilya Pollak

Example: prefix condition, all words are
symbol codeword

a 1
b 01

c 000

d 001

•  No path from the root to a

wc=000 0 codeword contains another
codeword. This is equivalent
to saying that the prefix
1 condition holds.
wd=001 1

wb=01 1

wa=1 Ilya Pollak

Example: prefix condition, all words are
leaves => unique decodability
symbol codeword

a 1
b 01

c 000

d 001

Decoding: traverse the string left to right, tracing the

wc=000 0 corresponding path from the root of the binary tree.
Each time a leaf is reached, output the codeword and
go back to the root.
wd=001 1

wb=01 1

wa=1 Ilya Pollak

Example: prefix condition, all words are
leaves => unique decodability

How to decode the following string?

wc=000 0
wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

output: c
wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

wc=000 0

output: c
wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

wc=000 0
output: c
wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

wc=000 0

output: c
wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

wc=000 0

output: cd
wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

wc=000 0

output: cd
wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

wc=000 0

output: cda
wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

wc=000 0

output: cda
wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

wc=000 0

output: cda
wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

wc=000 0

output: cdab
wd=001 1

wb=01 1


Ilya Pollak
Example: prefix condition, all words are
leaves => unique decodability

wc=000 0

1 final output:
0 cdab
wd=001 1

wb=01 1


Ilya Pollak
Prefix condition and unique
•  There are uniquely decodable codes
which do not satisfy the prefix condition
(e.g., {0, 01}).

Ilya Pollak
Prefix condition and unique
•  There are uniquely decodable codes
which do not satisfy the prefix condition
(e.g., {0, 01}). For any such code, a prefix
condition code can be constructed with an
identical set of codeword lengths. (E.g.,
{0, 10} for {0, 01}.)

Ilya Pollak
Prefix condition and unique
•  There are uniquely decodable codes
which do not satisfy the prefix condition
(e.g., {0, 01}). For any such code, a prefix
condition code can be constructed with an
identical set of codeword lengths. (E.g.,
{0, 10} for {0, 01}.)
•  For this reason, we can consider just
prefix condition codes.

Ilya Pollak
Entropy coding
•  Given a discrete random variable X with M possible outcomes
(“symbols” or “letters”) a1,…,aM and with PMF pX, what is the
lowest achievable expected codeword length among all the
uniquely decodable codes?
–  Answer depends on pX; Shannon’s source coding theorem provides
•  How to construct a prefix condition code which achieves this
expected codeword length?
–  Answer: Huffman code.

Ilya Pollak
Huffman code
•  Consider a discrete r.v. X with M possible outcomes a1,…,aM and with PMF
pX. Assume that pX(a1) ≤ … ≤ pX(aM). (If this condition is not satisfied,
reorder the outcomes so that it is satisfied.)

Ilya Pollak
Huffman code
•  Consider a discrete r.v. X with M possible outcomes a1,…,aM and with PMF
pX. Assume that pX(a1) ≤ … ≤ pX(aM). (If this condition is not satisfied,
reorder the outcomes so that it is satisfied.)
•  Consider “aggregate outcome” a12 = {a1,a2} and a discrete r.v. X’ such that

⎧⎪ a12 if X = a1 or X = a2
X' = ⎨
⎪⎩ X otherwise

Ilya Pollak
Huffman code
•  Consider a discrete r.v. X with M possible outcomes a1,…,aM and with PMF
pX. Assume that pX(a1) ≤ … ≤ pX(aM). (If this condition is not satisfied,
reorder the outcomes so that it is satisfied.)
•  Consider “aggregate outcome” a12 = {a1,a2} and a discrete r.v. X’ such that

⎧⎪ a12 if X = a1 or X = a2
X' = ⎨
⎪⎩ X otherwise

⎧ p ( a ) + p ( a ) if a = a
⎪ X 1
pX ' ( a ) = ⎨
X 2 12

⎪⎩ p X ( a ) if a = a3 ,…, aM

Ilya Pollak
Huffman code
•  Consider a discrete r.v. X with M possible outcomes a1,…,aM and with PMF
pX. Assume that pX(a1) ≤ … ≤ pX(aM). (If this condition is not satisfied,
reorder the outcomes so that it is satisfied.)
•  Consider “aggregate outcome” a12 = {a1,a2} and a discrete r.v. X’ such that

⎧⎪ a12 if X = a1 or X = a2
X' = ⎨
⎪⎩ X otherwise

⎧ p ( a ) + p ( a ) if a = a
⎪ X 1
pX ' ( a ) = ⎨
X 2 12

⎪⎩ p X ( a ) if a = a3 ,…, aM

•  Suppose we have a tree, T’, for an optimal prefix condition code for X’. A tree
T for an optimal prefix condition code for X can be obtained from T’ by
splitting the leaf a12 into two leaves corresponding to a1 and a2.

Ilya Pollak
Huffman code
•  Consider a discrete r.v. X with M possible outcomes a1,…,aM and with PMF
pX. Assume that pX(a1) ≤ … ≤ pX(aM). (If this condition is not satisfied,
reorder the outcomes so that it is satisfied.)
•  Consider “aggregate outcome” a12 = {a1,a2} and a discrete r.v. X’ such that

⎧⎪ a12 if X = a1 or X = a2
X' = ⎨
⎪⎩ X otherwise

⎧ p ( a ) + p ( a ) if a = a
⎪ X 1
pX ' ( a ) = ⎨
X 2 12

⎪⎩ p X ( a ) if a = a3 ,…, aM

•  Suppose we have a tree, T’, for an optimal prefix condition code for X’. A tree
T for an optimal prefix condition code for X can be obtained from T’ by
splitting the leaf a12 into two leaves corresponding to a1 and a2.
•  We won’t prove this.

Ilya Pollak
letter pX(letter)
a1 0.10
a2 0.10
a3 0.25
a4 0.25
a5 0.30

Ilya Pollak
letter pX(letter)
letter pX’(letter)
Step 1: combine
a1 0.10 a12 0.20
the two least likely
a2 0.10
letters. a3 0.25
a3 0.25 a4 0.25
a4 0.25 a5 0.30
a5 0.30

Ilya Pollak
letter pX(letter)
letter pX’(letter)
Step 1: combine
a1 0.10 a12 0.20
the two least likely
a2 0.10
letters. a3 0.25
a3 0.25 a4 0.25
a4 0.25 a5 0.30
a5 0.30
a1 1

a2 0

Ilya Pollak
letter pX(letter)
letter pX’(letter)
Step 1: combine
a1 0.10 a12 0.20
the two least likely
a2 0.10
letters. a3 0.25
a3 0.25 a4 0.25
a4 0.25 a5 0.30
a5 0.30
a1 1 Tree for X’
a12 (still to be
Tree for X: constructed)
a2 0

Ilya Pollak
Step 2: combine
letter pX’(letter) letter pX’’(letter)
the two least likely
a12 0.20 a123 0.45
letters from the new
a3 0.25 alphabet. a4 0.25
a4 0.25 a5 0.30
a5 0.30

Ilya Pollak
Step 2: combine
letter pX’(letter) letter pX’’(letter)
the two least likely
a12 0.20 a123 0.45
letters from the new
a3 0.25 alphabet. a4 0.25
a4 0.25 a5 0.30
a5 0.30
a1 1
a2 0

Ilya Pollak
Step 2: combine
letter pX’(letter) letter pX’’(letter)
the two least likely
a12 0.20 a123 0.45
letters from the new
a3 0.25 alphabet. a4 0.25
a4 0.25 a5 0.30
a5 0.30
a1 1
Tree for X:
a2 0 Tree for
a3 X’’

Ilya Pollak
Step 2: combine
letter pX’(letter) letter pX’’(letter)
the two least likely
a12 0.20 a123 0.45
letters from the new
a3 0.25 alphabet. a4 0.25
a4 0.25 a5 0.30
a5 0.30
a1 1 Tree for X’
Tree for X:
a2 0 Tree for
a3 X’’

Ilya Pollak
letter pX’’(letter) letter pX’’’(letter)
Step 3: again combine
a123 0.45 the two least likely a123 0.45
a4 0.25 letters a45 0.55
a5 0.30

a1 1
a2 0

a4 1


Ilya Pollak
letter pX’’(letter) letter pX’’’(letter)
Step 3: again combine
a123 0.45 the two least likely a123 0.45
a4 0.25 letters a45 0.55
a5 0.30

a1 1
Tree for X:
1 a123
a2 0
a3 Tree for X’’’

a4 1


Ilya Pollak
letter pX’’(letter) letter pX’’’(letter)
Step 3: again combine
a123 0.45 the two least likely a123 0.45
a4 0.25 letters a45 0.55
a5 0.30

a1 1
Tree for X: Tree for X’’
1 a123
a2 0
a3 Tree for X’’’

a4 1


Ilya Pollak
letter pX’’(letter) letter pX’’’(letter)
Step 3: again combine
a123 0.45 the two least likely a123 0.45
a4 0.25 letters a45 0.55
a5 0.30

a1 1 Tree for X’
Tree for X: Tree for X’’
1 a123
a2 0
a3 Tree for X’’’

a4 1


Ilya Pollak
Step 4: combine the last
letter pX’’’(letter)
two remaining letters
a123 0.45
a45 0.55 Done!

a1 1
Tree for X:
a2 0
a3 1

a4 1a
45 0


Ilya Pollak
Step 4: combine the last Done! The codeword
letter pX’’’(letter)
two remaining letters for each leaf is the sequence
a123 0.45
of 0’1 and 1’s along the path
a45 0.55
from the root to that leaf.
a1 1
Tree for X:
a2 0
a3 1

a4 1


Ilya Pollak

a1 1
Tree for X:
1 letter pX(letter) codeword
a2 0 a1 0.10 111
a3 1
0 a2 0.10

1 a3 0.25
a4 0
a4 0.25
a5 a5 0.30

Ilya Pollak

a1 1
Tree for X:
1 letter pX(letter) codeword
a2 0 a1 0.10 111
a3 1
0 a2 0.10 110

1 a3 0.25
a4 0
a4 0.25
a5 a5 0.30

Ilya Pollak

a1 1
Tree for X:
1 letter pX(letter) codeword
a2 0 a1 0.10 111
a3 1
0 a2 0.10 110

1 a3 0.25 10
a4 0
a4 0.25
a5 a5 0.30

Ilya Pollak

a1 1
Tree for X:
1 letter pX(letter) codeword
a2 0 a1 0.10 111
a3 1
0 a2 0.10 110

1 a3 0.25 10
a4 0
a4 0.25 01
a5 a5 0.30

Ilya Pollak

a1 1
Tree for X:
1 letter pX(letter) codeword
a2 0 a1 0.10 111
a3 1
0 a2 0.10 110

1 a3 0.25 10
a4 0
a4 0.25 01
a5 a5 0.30 00

Ilya Pollak
Expected codeword length: 3(0.1) + 3(0.1) + 2(0.25) + 2(0.25) + 2(0.3) = 2.2 bits

a1 1
Tree for X:
1 letter pX(letter) codeword
a2 0 a1 0.10 111
a3 1
0 a2 0.10 110

1 a3 0.25 10
a4 0
a4 0.25 01
a5 a5 0.30 00

Ilya Pollak
•  Consider again a discrete random variable X with M possible
outcomes a1,…,aM and with PMF pX.

Ilya Pollak
•  Consider again a discrete random variable X with M possible
outcomes a1,…,aM and with PMF pX.
•  Self-information of outcome am is I(am) = −log2 pX(am) bits.

Ilya Pollak
•  Consider again a discrete random variable X with M possible
outcomes a1,…,aM and with PMF pX.
•  Self-information of outcome am is I(am) = −log2 pX(am) bits.
•  E.g., pX(am) = 1 then I(am) = 0. The occurrence of am is not at
all informative, since it had to occur. The smaller the
probability of an outcome, the larger its self-information.

Ilya Pollak
•  Consider again a discrete random variable X with M possible
outcomes a1,…,aM and with PMF pX.
•  Self-information of outcome am is I(am) = −log2 pX(am) bits.
•  E.g., pX(am) = 1 then I(am) = 0. The occurrence of am is not at
all informative, since it had to occur. The smaller the
probability of an outcome, the larger its self-information.
•  Self-information of X is I(X) = −log2 pX(X) and is a random

Ilya Pollak
•  Consider again a discrete random variable X with M possible
outcomes a1,…,aM and with PMF pX.
•  Self-information of outcome am is I(am) = −log2 pX(am) bits.
•  E.g., pX(am) = 1 then I(am) = 0. The occurrence of am is not at
all informative, since it had to occur. The smaller the
probability of an outcome, the larger its self-information.
•  Self-information of X is I(X) = −log2 pX(X) and is a random
•  Entropy of X is the expected value of its self-information:
H (X) = E [ I(X)] = − ∑ p X (am )log 2 p X (am )
m =1

Ilya Pollak
Source coding theorem (Shannon)

For any uniquely decodable code, the expected codeword length is ≥ H (X).
Moreover, there exists a prefix condition code for which the expected codeword
length is < H (X) + 1.

Ilya Pollak
•  Suppose that X has M=2K possible outcomes a1,…,aM.

Ilya Pollak
•  Suppose that X has M=2K possible outcomes a1,…,aM.
•  Suppose that X is uniform, i.e., pX (a1) = … = pX (aM) = 2−K.

Ilya Pollak
•  Suppose that X has M=2K possible outcomes a1,…,aM.
•  Suppose that X is uniform, i.e., pX (a1) = … = pX (aM) = 2−K. Then

( ) ( )
H (X) = E [ I(X)] = − ∑ 2 − K log 2 2 − K = 2 K −2 − K ( −K ) = K
k =1

Ilya Pollak
•  Suppose that X has M=2K possible outcomes a1,…,aM.
•  Suppose that X is uniform, i.e., pX (a1) = … = pX (aM) = 2−K. Then

( ) ( )
H (X) = E [ I(X)] = − ∑ 2 − K log 2 2 − K = 2 K −2 − K ( −K ) = K
k =1

•  On the other hand, observe that there exist 2K different K-bit

sequences. Thus, a fixed-length code for X that uses all these
2K K-bit sequences as codewords for all the 2K outcomes of X,
will have expected codeword length of K.

Ilya Pollak
•  Suppose that X has M=2K possible outcomes a1,…,aM.
•  Suppose that X is uniform, i.e., pX (a1) = … = pX (aM) = 2−K. Then

( ) ( )
H (X) = E [ I(X)] = − ∑ 2 − K log 2 2 − K = 2 K −2 − K ( −K ) = K
k =1

•  On the other hand, observe that there exist 2K different K-bit

sequences. Thus, a fixed-length code for X that uses all these
2K K-bit sequences as codewords for all the 2K outcomes of X,
will have expected codeword length of K.
•  I.e., for this particular random variable, this fixed-length code
achieves the entropy of X, which is the lower bound given by
the source coding theorem.

Ilya Pollak
•  Suppose that X has M=2K possible outcomes a1,…,aM.
•  Suppose that X is uniform, i.e., pX (a1) = … = pX (aM) = 2−K. Then

( ) ( )
H (X) = E [ I(X)] = − ∑ 2 − K log 2 2 − K = 2 K −2 − K ( −K ) = K
k =1

•  On the other hand, observe that there exist 2K different K-bit

sequences. Thus, a fixed-length code for X that uses all these
2K K-bit sequences as codewords for all the 2K outcomes of X,
will have expected codeword length of K.
•  I.e., for this particular random variable, this fixed-length code
achieves the entropy of X, which is the lower bound given by
the source coding theorem.
•  Therefore, the K-bit fixed-length code is optimal for this X.

Ilya Pollak
Lemma 1: An auxiliary result helpful for
proving the source coding theorem

•  log2α ≤ (α−1) log2e for log2 α > 0.

•  Proof: differentiate g(α) = (α−1) log2e − log2α and show that
g(1) = 0 is its minimum.

Ilya Pollak
Another auxiliary result: Kraft inequality

If integers d1 ,…, d M satisfy the inequality


∑2 − dm
≤ 1, (1)
m =1

then there exists a prefix condition code whose codeword lengths are these integers.
Conversely, the codeword lengths of any prefix condition code satisfy this inequality.

Ilya Pollak
Some useful facts about full binary trees
A full binary tree of depth D has
2D leaves.

Ilya Pollak
Some useful facts about full binary trees
Tree depth D = 4
A full binary tree of depth D has
2D leaves. (Here, depth is D=4 and
the number of leaves is 24=16.)

Ilya Pollak
Some useful facts about full binary trees
Tree depth D = 4
A full binary tree of depth D has
2D leaves. (Here, depth is D=4 and
the number of leaves is 24=16.)

In a full binary tree of depth D, each

node at depth d has 2D−d leaf
Depth of red descendants. (Here, D=4, the red
node = 2 node is at depth d=2, and so it has
24−2 = 4 leaf descendants.)

Ilya Pollak
Kraft inequality: proof of ⇒

Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its
nodes at depth d1 . Assign one of these nodes to symbol a1 .

Ilya Pollak
Kraft inequality: proof of ⇒

Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its
nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which
are not a1 and not descendants of a1 . Assign one of them to symbol a2 .

Ilya Pollak
Kraft inequality: proof of ⇒

Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its
nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which
are not a1 and not descendants of a1 . Assign one of them to symbol a2 . Iterate like this M times.

Ilya Pollak
Kraft inequality: proof of ⇒

Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its
nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which
are not a1 and not descendants of a1 . Assign one of them to symbol a2 . Iterate like this M times.
If we have run out of tree nodes to assign after r < M iterations, it means that every leaf in the full
binary tree of depth d M is a descendant of one of the first m symbols, a1 ,…, ar .

Ilya Pollak
Kraft inequality: proof of ⇒

Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its
nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which
are not a1 and not descendants of a1 . Assign one of them to symbol a2 . Iterate like this M times.
If we have run out of tree nodes to assign after r < M iterations, it means that every leaf in the full
binary tree of depth d M is a descendant of one of the first m symbols, a1 ,…, ar . But note that every
node at depth dm has 2 dM − dm descendants. Note also that the full tree has 2 dM leaves. Therefore, if
every leaf in the tree is a descendant of a1 ,…, ar , then

∑2 d M − dm
= 2 dM
m =1

Ilya Pollak
Kraft inequality: proof of ⇒

Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its
nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which
are not a1 and not descendants of a1 . Assign one of them to symbol a2 . Iterate like this M times.
If we have run out of tree nodes to assign after r < M iterations, it means that every leaf in the full
binary tree of depth d M is a descendant of one of the first m symbols, a1 ,…, ar . But note that every
node at depth dm has 2 dM − dm descendants. Note also that the full tree has 2 dM leaves. Therefore, if
every leaf in the tree is a descendant of a1 ,…, ar , then
r r

∑2 d M − dm
=2 dM
⇔ ∑2 − dm
m =1 m =1

Ilya Pollak
Kraft inequality: proof of ⇒

Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its
nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which
are not a1 and not descendants of a1 . Assign one of them to symbol a2 . Iterate like this M times.
If we have run out of tree nodes to assign after r < M iterations, it means that every leaf in the full
binary tree of depth d M is a descendant of one of the first m symbols, a1 ,…, ar . But note that every
node at depth dm has 2 dM − dm descendants. Note also that the full tree has 2 dM leaves. Therefore, if
every leaf in the tree is a descendant of a1 ,…, ar , then
r r

∑2 d M − dm
=2 dM
⇔ ∑2 − dm
m =1 m =1
M r M
Therefore, ∑2 − dm
= ∑2 − dm
+ ∑ 2 − dm > 1. This violates (1).
m =1 m =1 m = r +1

Ilya Pollak
Kraft inequality: proof of ⇒

Suppose d1 ≤ … ≤ d M satisfy (1). Consider the full binary tree of depth d M , and consider all its
nodes at depth d1 . Assign one of these nodes to symbol a1 . Consider all the nodes at depth d2 which
are not a1 and not descendants of a1 . Assign one of them to symbol a2 . Iterate like this M times.
If we have run out of tree nodes to assign after r < M iterations, it means that every leaf in the full
binary tree of depth d M is a descendant of one of the first m symbols, a1 ,…, ar . But note that every
node at depth dm has 2 dM − dm descendants. Note also that the full tree has 2 dM leaves. Therefore, if
every leaf in the tree is a descendant of a1 ,…, ar , then
r r

∑2 d M − dm
=2 dM
⇔ ∑2 − dm
m =1 m =1
M r M
Therefore, ∑2 − dm
= ∑2 − dm
+ ∑ 2 − dm > 1. This violates (1).
m =1 m =1 m = r +1

Thus, our procedure can in fact go on for M iterations. After the M -th iteration, we will have
constructed a prefix condition code with codeword lengths d1 ,…, d M .

Ilya Pollak
Kraft inequality: proof of ⇐

Suppose d1 ≤ … ≤ d M , and suppose we have a prefix condition code with there codeword lengths.
Consider the binary tree corresponding to this code.

Ilya Pollak
Kraft inequality: proof of ⇐

Suppose d1 ≤ … ≤ d M , and suppose we have a prefix condition code with there codeword lengths.
Consider the binary tree corresponding to this code. Complete this tree to obtain a full tree of
depth d M .

Ilya Pollak
Kraft inequality: proof of ⇐

Suppose d1 ≤ … ≤ d M , and suppose we have a prefix condition code with there codeword lengths.
Consider the binary tree corresponding to this code. Complete this tree to obtain a full tree of
depth d M . Again use the following facts:
the full tree has 2 dM leaves;
the number of leaf descendants of the codeword of length dm is 2 dM − dm .

Ilya Pollak
Kraft inequality: proof of ⇐

Suppose d1 ≤ … ≤ d M , and suppose we have a prefix condition code with there codeword lengths.
Consider the binary tree corresponding to this code. Complete this tree to obtain a full tree of
depth d M . Again use the following facts:
the full tree has 2 dM leaves;
the number of leaf descendants of the codeword of length dm is 2 dM − dm .
The combined number of all leaf descendants of all codewords must be less than or equal to
the total number of leaves in the full tree:

∑2 d M − dm
≤ 2 dM
m =1

Ilya Pollak
Kraft inequality: proof of ⇐

Suppose d1 ≤ … ≤ d M , and suppose we have a prefix condition code with there codeword lengths.
Consider the binary tree corresponding to this code. Complete this tree to obtain a full tree of
depth d M . Again use the following facts:
the full tree has 2 dM leaves;
the number of leaf descendants of the codeword of length dm is 2 dM − dm .
The combined number of all leaf descendants of all codewords must be less than or equal to
the total number of leaves in the full tree:

∑2 d M − dm
≤2 dM
⇔ ∑2 − dm
≤ 1.
m =1 m =1

Ilya Pollak
Source coding theorem: proof of
Let dm be the codeword length for am , and let random variable C be the codeword length for X.

Ilya Pollak
Source coding theorem: proof of
Let dm be the codeword length for am , and let random variable C be the codeword length for X.
H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm
m =1 m =1

Ilya Pollak
Source coding theorem: proof of
Let dm be the codeword length for am , and let random variable C be the codeword length for X.
⎡ ⎛ 1 ⎞ ⎤
H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm = ∑ p X (am ) ⎢ log 2 ⎜ ⎟⎠ − log 2 2 ⎥

m =1 m =1 m =1 ⎣ ⎝ p X (a m ) ⎦

Ilya Pollak
Source coding theorem: proof of
Let dm be the codeword length for am , and let random variable C be the codeword length for X.
M M M ⎡ ⎛ 1 ⎞ ⎤
H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm = ∑ p X (am ) ⎢ log 2 ⎜ ⎟⎠ − log 2 2 ⎥

m =1 m =1 m =1 ⎣ ⎝ p X (a m ) ⎦
M ⎡ ⎛ 1 ⎞⎤
= ∑ p X (am ) ⎢ log 2 ⎜ dm ⎟ ⎥
m =1 ⎣ ⎝ p X (a m )2 ⎠⎦

Ilya Pollak
Source coding theorem: proof of
Let dm be the codeword length for am , and let random variable C be the codeword length for X.
M M M ⎡ ⎛ 1 ⎞ ⎤
H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm = ∑ p X (am ) ⎢ log 2 ⎜ ⎟⎠ − log 2 2 ⎥

m =1 m =1 m =1 ⎣ ⎝ p X (a m ) ⎦
M ⎡ ⎛ 1 ⎞⎤
= ∑ p X (am ) ⎢ log 2 ⎜ dm ⎟ ⎥
m =1 ⎣ ⎝ p X (a m )2 ⎠⎦
⎛ 1 ⎞
≤ ∑ p X (am ) ⎜ − 1⎟ log 2 e (by Lemma 1)
m =1 ⎝ p X (am )2 dm

Ilya Pollak
Source coding theorem: proof of
Let dm be the codeword length for am , and let random variable C be the codeword length for X.
M M M ⎡ ⎛ 1 ⎞ ⎤
H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm = ∑ p X (am ) ⎢ log 2 ⎜ ⎟⎠ − log 2 2 ⎥

m =1 m =1 m =1 ⎣ ⎝ p X (a m ) ⎦
⎡ ⎛ 1 ⎞⎤
= ∑ p X (am ) ⎢ log 2 ⎜ dm ⎟ ⎥
m =1 ⎣ ⎝ p X (a m )2 ⎠⎦
⎛ 1 ⎞
≤ ∑ p X (am ) ⎜ − 1⎟ log 2 e (by Lemma 1)
m =1 ⎝ p X (am )2 dm

⎛ M 1 M

= ⎜ ∑ dm − ∑ p X (am )⎟ log 2 e
⎝ m =1 2 m =1 ⎠

Ilya Pollak
Source coding theorem: proof of
Let dm be the codeword length for am , and let random variable C be the codeword length for X.
⎡ ⎛ 1 ⎞ ⎤
H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm = ∑ p X (am ) ⎢ log 2 ⎜ ⎟⎠ − log 2 2 ⎥

m =1 m =1 m =1 ⎣ ⎝ p X (a m ) ⎦
M ⎡ ⎛ 1 ⎞⎤
= ∑ p X (am ) ⎢ log 2 ⎜ dm ⎟ ⎥
m =1 ⎣ ⎝ p X (a m )2 ⎠⎦
⎛ 1 ⎞
≤ ∑ p X (am ) ⎜ − 1⎟ log 2 e (by Lemma 1)
m =1 ⎝ p X (am )2 dm

⎛ M 1 M

= ⎜ ∑ dm − ∑ p X (am )⎟ log 2 e
⎝ m =1 2 m =1 ⎠
⎛ M − dm ⎞
= ⎜ ∑ 2 − 1⎟ log 2 e ≤ 0
⎝ m =1 ⎠

Ilya Pollak
Source coding theorem: proof of
Let dm be the codeword length for am , and let random variable C be the codeword length for X.
⎡ ⎛ 1 ⎞ ⎤
H (X) − E[C] = − ∑ p X (am )log 2 p X (am ) − ∑ p X (am )dm = ∑ p X (am ) ⎢ log 2 ⎜ ⎟⎠ − log 2 2 ⎥

m =1 m =1 m =1 ⎣ ⎝ p X (a m ) ⎦
M ⎡ ⎛ 1 ⎞⎤
= ∑ p X (am ) ⎢ log 2 ⎜ dm ⎟ ⎥
m =1 ⎣ ⎝ p X (a m )2 ⎠⎦
⎛ 1 ⎞
≤ ∑ p X (am ) ⎜ − 1⎟ log 2 e (by Lemma 1)
m =1 ⎝ p X (am )2 dm

⎛ M 1 M

= ⎜ ∑ dm − ∑ p X (am )⎟ log 2 e
⎝ m =1 2 m =1 ⎠
⎛ M − dm ⎞
= ⎜ ∑ 2 − 1⎟ log 2 e ≤ 0
⎝ m =1 ⎠
By Kraft inequality, this holds for any prefix condition code. But it is also true for any uniquely
decodable code.

Ilya Pollak
Source coding theorem: how to satisfy
E[C] < H(X)+1?
Choose dm = − ⎡⎢ log 2 p X (am ) ⎤⎥ (where ⎢⎡ x ⎤⎥ stands for the smallest integer which is ≥ x). Then
dm ≥ − log 2 p X (am )

Ilya Pollak
Source coding theorem: how to satisfy
E[C] < H(X)+1?
Choose dm = − ⎡⎢ log 2 p X (am ) ⎤⎥ (where ⎡⎢ x ⎤⎥ stands for the smallest integer which is ≥ x). Then
dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm ≤ p X (am )

Ilya Pollak
Source coding theorem: how to satisfy
E[C] < H(X)+1?
Choose dm = − ⎡⎢ log 2 p X (am ) ⎤⎥ (where ⎡⎢ x ⎤⎥ stands for the smallest integer which is ≥ x). Then
dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm
≤ p X (am ) ⇒ ∑2 − dm
≤ ∑ p X (am ) = 1.
m =1 m =1

Ilya Pollak
Source coding theorem: how to satisfy
E[C] < H(X)+1?
Choose dm = − ⎡⎢ log 2 p X (am ) ⎤⎥ (where ⎡⎢ x ⎤⎥ stands for the smallest integer which is ≥ x). Then
dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm
≤ p X (am ) ⇒ ∑2 − dm
≤ ∑ p X (am ) = 1.
m =1 m =1

Therefore, Kraft inequality is satisfied, and we can construct a prefix condition code with codeword
lengths d1 ,…, d M .

Ilya Pollak
Source coding theorem: how to satisfy
E[C] < H(X)+1?
Choose dm = − ⎡⎢ log 2 p X (am ) ⎤⎥ (where ⎡⎢ x ⎤⎥ stands for the smallest integer which is ≥ x). Then
dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm
≤ p X (am ) ⇒ ∑2 − dm
≤ ∑ p X (am ) = 1.
m =1 m =1

Therefore, Kraft inequality is satisfied, and we can construct a prefix condition code with codeword
lengths d1 ,…, d M . Also, by construction,
dm − 1 < − log 2 p X (am ) ⇒ dm < − log 2 p X (am ) + 1

Ilya Pollak
Source coding theorem: how to satisfy
E[C] < H(X)+1?
Choose dm = − ⎡⎢ log 2 p X (am ) ⎤⎥ (where ⎡⎢ x ⎤⎥ stands for the smallest integer which is ≥ x). Then
dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm
≤ p X (am ) ⇒ ∑2 − dm
≤ ∑ p X (am ) = 1.
m =1 m =1

Therefore, Kraft inequality is satisfied, and we can construct a prefix condition code with codeword
lengths d1 ,…, d M . Also, by construction,
dm − 1 < − log 2 p X (am ) ⇒ dm < − log 2 p X (am ) + 1
⇒ p X (am )dm < − p X (am )log 2 p X (am ) + p X (am )

Ilya Pollak
Source coding theorem: how to satisfy
E[C] < H(X)+1?
Choose dm = − ⎡⎢ log 2 p X (am ) ⎤⎥ (where ⎢⎡ x ⎤⎥ stands for the smallest integer which is ≥ x). Then
dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm
≤ p X (am ) ⇒ ∑2 − dm
≤ ∑ p X (am ) = 1.
m =1 m =1

Therefore, Kraft inequality is satisfied, and we can construct a prefix condition code with codeword
lengths d1 ,…, d M . Also, by construction,
dm − 1 < − log 2 p X (am ) ⇒ dm < − log 2 p X (am ) + 1
⇒ p X (am )dm < − p X (am )log 2 p X (am ) + p X (am )
⇒ ∑p X (am )dm < ∑ ( − p X (am )log 2 p X (am ) + p X (am ))
m =1 m =1

Ilya Pollak
Source coding theorem: how to satisfy
E[C] < H(X)+1?
Choose dm = − ⎡⎢ log 2 p X (am ) ⎤⎥ (where ⎡⎢ x ⎤⎥ stands for the smallest integer which is ≥ x). Then
dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm
≤ p X (am ) ⇒ ∑2 − dm
≤ ∑ p X (am ) = 1.
m =1 m =1

Therefore, Kraft inequality is satisfied, and we can construct a prefix condition code with codeword
lengths d1 ,…, d M . Also, by construction,
dm − 1 < − log 2 p X (am ) ⇒ dm < − log 2 p X (am ) + 1
⇒ p X (am )dm < − p X (am )log 2 p X (am ) + p X (am )
⇒ ∑p X (am )dm < ∑ ( − p X (am )log 2 p X (am ) + p X (am ))
m =1 m =1
⇒ E[C] < ∑ ( − p X (am )log 2 p X (am )) + ∑ p X (am )
m =1 m =1

Ilya Pollak
Source coding theorem: how to satisfy
E[C] < H(X)+1?
Choose dm = − ⎡⎢ log 2 p X (am ) ⎤⎥ (where ⎡⎢ x ⎤⎥ stands for the smallest integer which is ≥ x). Then
dm ≥ − log 2 p X (am ) ⇒ − dm ≤ log 2 p X (am ) ⇒ 2 − dm
≤ p X (am ) ⇒ ∑2 − dm
≤ ∑ p X (am ) = 1.
m =1 m =1

Therefore, Kraft inequality is satisfied, and we can construct a prefix condition code with codeword
lengths d1 ,…, d M . Also, by construction,
dm − 1 < − log 2 p X (am ) ⇒ dm < − log 2 p X (am ) + 1
⇒ p X (am )dm < − p X (am )log 2 p X (am ) + p X (am )
⇒ ∑p X (am )dm < ∑ ( − p X (am )log 2 p X (am ) + p X (am ))
m =1 m =1
⇒ E[C] < ∑ ( − p X (am )log 2 p X (am )) + ∑ p X (am ) = H (X) + 1
m =1 m =1

Ilya Pollak
Note: Huffman code may often be very
far from the entropy
•  Let X have two outcomes, a1 and a2, with probabilities 1−2−d
and 2−d, respectively.

Ilya Pollak
Note: Huffman code may often be very
far from the entropy
•  Let X have two outcomes, a1 and a2, with probabilities 1−2−d
and 2−d, respectively.
•  Huffman code: 0 for a1; 1 for a2.
•  Expected codeword length: 1.

Ilya Pollak
Note: Huffman code may often be very
far from the entropy
•  Let X have two outcomes, a1 and a2, with probabilities 1−2−d
and 2−d, respectively.
•  Huffman code: 0 for a1; 1 for a2.
•  Expected codeword length: 1.
•  Entropy: −(1−2−d) log2(1−2−d) + d2−d ≈ 0 for large d. For
example, if d=20, this is 0.0000204493.

Ilya Pollak
Note: Huffman code may often be very
far from the entropy
•  Let X have two outcomes, a1 and a2, with probabilities 1−2−d
and 2−d, respectively.
•  Huffman code: 0 for a1; 1 for a2.
•  Expected codeword length: 1.
•  Entropy: −(1−2−d) log2(1−2−d) + d2−d ≈ 0 for large d. For
example, if d=20, this is 0.0000204493.
•  Problem: no codeword can have fractional numbers of bits!

Ilya Pollak
Note: Huffman code may often be very
far from the entropy
•  Let X have two outcomes, a1 and a2, with probabilities 1−2−d
and 2−d, respectively.
•  Huffman code: 0 for a1; 1 for a2.
•  Expected codeword length: 1.
•  Entropy: −(1−2−d) log2(1−2−d) + d2−d ≈ 0 for large d. For
example, if d=20, this is 0.0000204493.
•  Problem: no codeword can have fractional numbers of bits!
•  If we have a source which produces independent random
variables X1, X2 , …, all identically distributed to X, a single
Huffman code can be constructed for several of them,
effectively resulting in fractional numbers of bits per random
Ilya Pollak
•  (X1,X2) will have four outcomes, (a1,a1), (a1,a2), (a2,a1), (a2,a2),
with probabilities 1−2−d+1+2−2d, 2−d−2−2d, 2−d−2−2d, and 2−2d,

Ilya Pollak
•  (X1,X2) will have four outcomes, (a1,a1), (a1,a2), (a2,a1), (a2,a2),
with probabilities 1−2−d+1+2−2d, 2−d−2−2d, 2−d−2−2d, and 2−2d,
•  Huffman code: 0 for (a1,a1); 10 for (a1,a2); 110 for (a2,a1); 111
for (a2,a2).

Ilya Pollak
•  (X1,X2) will have four outcomes, (a1,a1), (a1,a2), (a2,a1), (a2,a2),
with probabilities 1−2−d+1+2−2d, 2−d−2−2d, 2−d−2−2d, and 2−2d,
•  Huffman code: 0 for (a1,a1); 10 for (a1,a2); 110 for (a2,a1); 111
for (a2,a2).
•  Expected codeword length per random variable:
–  [1−2−d+1+2−2d + 2(2−d−2−2d) + 3(2−d−2−2d)+ 3(2−2d)]/2

Ilya Pollak
•  (X1,X2) will have four outcomes, (a1,a1), (a1,a2), (a2,a1), (a2,a2),
with probabilities 1−2−d+1+2−2d, 2−d−2−2d, 2−d−2−2d, and 2−2d,
•  Huffman code: 0 for (a1,a1); 10 for (a1,a2); 110 for (a2,a1); 111
for (a2,a2).
•  Expected codeword length per random variable:
–  [1−2−d+1+2−2d + 2(2−d−2−2d) + 3(2−d−2−2d)+ 3(2−2d)]/2
–  This is 0.500001 for d=20

Ilya Pollak
•  (X1,X2) will have four outcomes, (a1,a1), (a1,a2), (a2,a1), (a2,a2),
with probabilities 1−2−d+1+2−2d, 2−d−2−2d, 2−d−2−2d, and 2−2d,
•  Huffman code: 0 for (a1,a1); 10 for (a1,a2); 110 for (a2,a1); 111
for (a2,a2).
•  Expected codeword length per random variable:
–  [1−2−d+1+2−2d + 2(2−d−2−2d) + 3(2−d−2−2d)+ 3(2−2d)]/2
–  This is 0.500001 for d=20
•  Can get arbitrarily close to entropy by encoding longer
sequences of Xk’s.

Ilya Pollak
Source coding theorem for sequences
of independent, identically distributed
random variables
Suppose we are jointly encoding independent, identically distributed discrete
random variables X1 ,…, X N , each taking values in {a1 ,…, aN }.
For any uniquely decodable code, the expected codeword length is ≥ H (Xn ).
Moreover, there exists a prefix condition code for which the expected codeword
length is < H (Xn ) + .

Ilya Pollak
Proof of the source coding theorem
for iid sequences
Consider random vector X = ( X1 ,…, X N ) . The self-information of its outcome x = ( x1 ,…, x N ) is
I(x) = − log 2 p X1 ,…, XN ( x1 ,…, x N )

Ilya Pollak
Proof of the source coding theorem
for iid sequences
Consider random vector X = ( X1 ,…, X N ) . The self-information of its outcome x = ( x1 ,…, x N ) is
I(x) = − log 2 p X1 ,…, XN ( x1 ,…, x N ) = − ∑ log 2 p Xn ( xn ) = ∑ I ( xn ).
n =1 n =1

Ilya Pollak
Proof of the source coding theorem
for iid sequences
Consider random vector X = ( X1 ,…, X N ) . The self-information of its outcome x = ( x1 ,…, x N ) is
I(x) = − log 2 p X1 ,…, XN ( x1 ,…, x N ) = − ∑ log 2 p Xn ( xn ) = ∑ I ( xn ).
n =1 n =1

Therefore, the entropy of X is

⎡N ⎤ N
H ( X ) = E ⎡⎣ I ( X ) ⎤⎦ = E ⎢ ∑ I ( Xn ) ⎥ = ∑ H ( Xn ) = NH ( Xn ) .
⎣ n =1 ⎦ n =1

Ilya Pollak
Proof of the source coding theorem
for iid sequences
Consider random vector X = ( X1 ,…, X N ) . The self-information of its outcome x = ( x1 ,…, x N ) is
I(x) = − log 2 p X1 ,…, XN ( x1 ,…, x N ) = − ∑ log 2 p Xn ( xn ) = ∑ I ( xn ).
n =1 n =1

Therefore, the entropy of X is

⎡N ⎤ N
H ( X ) = E ⎡⎣ I ( X ) ⎤⎦ = E ⎢ ∑ I ( Xn ) ⎥ = ∑ H ( Xn ) = NH ( Xn ) .
⎣ n =1 ⎦ n =1
Therefore, applying the single-symbol source coding theorem to X, we have:
H ( X ) ≤ E [ C N ] < H ( X ) + 1,

where E [ C N ] is the expected codeword length for the optimal uniquely decodable code for X

Ilya Pollak
Proof of the source coding theorem
for iid sequences
Consider random vector X = ( X1 ,…, X N ) . The self-information of its outcome x = ( x1 ,…, x N ) is
I(x) = − log 2 p X1 ,…, XN ( x1 ,…, x N ) = − ∑ log 2 p Xn ( xn ) = ∑ I ( xn ).
n =1 n =1

Therefore, the entropy of X is

⎡N ⎤ N
H ( X ) = E ⎡⎣ I ( X ) ⎤⎦ = E ⎢ ∑ I ( Xn ) ⎥ = ∑ H ( Xn ) = NH ( Xn ) .
⎣ n =1 ⎦ n =1
Therefore, applying the single-symbol source coding theorem to X, we have:
H ( X ) ≤ E [ C N ] < H ( X ) + 1,
NH ( Xn ) ≤ E [ C N ] < NH ( Xn ) + 1,

where E [ C N ] is the expected codeword length for the optimal uniquely decodable code for X

Ilya Pollak
Proof of the source coding theorem
for iid sequences
Consider random vector X = ( X1 ,…, X N ) . The self-information of its outcome x = ( x1 ,…, x N ) is
I(x) = − log 2 p X1 ,…, XN ( x1 ,…, x N ) = − ∑ log 2 p Xn ( xn ) = ∑ I ( xn ).
n =1 n =1

Therefore, the entropy of X is

⎡N ⎤ N
H ( X ) = E ⎡⎣ I ( X ) ⎤⎦ = E ⎢ ∑ I ( Xn ) ⎥ = ∑ H ( Xn ) = NH ( Xn ) .
⎣ n =1 ⎦ n =1
Therefore, applying the single-symbol source coding theorem to X, we have:
H ( X ) ≤ E [ C N ] < H ( X ) + 1,
NH ( Xn ) ≤ E [ C N ] < NH ( Xn ) + 1,
H ( Xn ) ≤ E [C ] < H ( Xn ) + ,
where E [ C N ] is the expected codeword length for the optimal uniquely decodable code for X,
E [CN ]
and E [ C ] = is the corresponding expected codeword length per symbol.
Ilya Pollak
Arithmetic coding
•  Another form of entropy coding.
•  More amenable to coding long sequences of symbols than
Huffman coding.
•  Can be used in conjunction with on-line learning of conditional
probabilities to encode dependent sequences of symbols:
–  Q-coder in JPEG (JPEG also has a Huffman coding option)
–  QM-coder in JBIG
–  MQ-coder in JPEG-2000
–  CABAC coder in H.264/MPEG-4 AVC

Ilya Pollak

You might also like