1-Data Compression-2022

Data compression
Sid Lamrous
MITGL01 / Course 1 By Sid Lamrous/ Data compression lundi 20 juin 2022

Data compression
Considerable importance in a variety of areas:

television, music, remote sensing, medical imaging,...
The graphic designer has long (sometimes unknowingly)

used various compression methods implemented in
commercial software.
It is proposed here, limited to the case of texts or images,

to make an inventory of the available methods, and to
understand the principle.
MITGL01 / Course 1 By Sid Lamrous/ Data compression

Compression methods
■Reversible methods (without loss).

»Statistical coding (Huffman and Shanon-Fano
algorithms)
»Arithmetic methods
»Dictionary methods (LZW algorithm)
■Irreversible methods (with loss).
»Compression JPEG
»Compression JPEG 2000

Definition
We will say that we have compressed a file if we manage

to reduce the number of binary digits needed to save it.
The effectiveness of compression is measured by the

compression ratio:
 = Number of binary digits used for compressed document

Number of binary digits used by the original document

Lossless compression methods
A simple example: Suppose we had to try to compress the following

series of bytes:
0001 0100 0011 1111 0101 0101 0101 0101 0101 0101
0101 0101 1110 1111 0000 1111 1111 1111 1111 1111
1111 1111 1111 1111 1111 1111 0110 0111 0111 0000
0111 0000 1010 1111 0000 0000 0001 1111 0001 1111
0001 1111
This series has 21 bytes. (You can interpret it as a sequel

characters or as the grey levels of a series of pixels, it does not matter what
we are concerned about here).

Lossless compression methods
Let's start by writing a signaling byte, to locate it, we will write it in bold.
0000 0010
The first bit indicates if the following data byte is repeating, if so the bit will
be set to 1, if not it will be set to 0.
If first bit = 0
The following 7 bits will indicate the number of bytes without
repetition
If first bit = 1
The following 7 bits will indicate the number of repetitions

Data compression
Result:
0000 0010 0001 0100 0011 1111 1000 0100 0101 0101 0000 0010
1110 1111 0000 1111 1000 0101 1111 1111 0000 0001 0110 0111
1000 0010 0111 0000 0000 0010 1010 1111 0000 0000 1000 0011
0001 1111
This chain is 19 bytes, 2 less than the original chain,
we have a compression rate of 21/19 = 1.1
Decoding is not a problem if the decoder is informed of the

method used, in this case it will interpret the
first byte as a signaling byte and everything will be fine ...

Statistical method
Encode all the values we have to store with the same number of bits, is
this the best solution?
Theory tells us not!
Idea: Use shorter codes for frequent values and reserve longer codes for less
frequent values. Example of the histogram for the image
=> We will therefore focus on VLC (Variable Length Code), we also say
entropy coding.

Example
Suppose we have to process a message which contains only 4 characters which we

will name A, B, C and D and that the frequencies of these characters in our message
are as follows:
A : 60 % One can imagine the following code:
B : 30 % A:0
C:5% B : 11
D:5% C : 01
D : 10
It will certainly allow us to save space, but if we receive the following message:
000110, how to interpret it?
3A, B, A? 2A, C, D? It does not work ! We have just discovered the problem of
synchronization.
How to solve it?
Use separator characters? it is contradictory with the objective
compression.

The solution is: Prefixed VLC
Definition: A code is prefixed if it is not the beginning of

any other.
Let's go back to our example and try to correct it.

A : 0 B : 10 C : 110 D : 111 It's working!
Is this the optimal solution?
As we got it by trial and error, we don't
know nothing but rest assured the theory asserts that it does.
We will now take a slightly more complex example

and present a method for determining an efficient VLC
(although it is not always optimal).

Shanon-Fano algorithm
Suppose we want to send the following message:
LE PRESIDENT EST ENTRE DANS LA SALLE

It has 36 characters and occupies 36 bytes in software like WordPerfect. We will
count the occurrences of the different characters of the alphabet used and classify
them by decreasing frequencies, we obtain the following list:
E:7
Espace : 6
L:4
S:4
N : 3 We will now group the characters into two groups whose
T : 3 frequencies of occurrence are as close as possible, then divide
A : 3 each of these groups in the same way until we reach each of the
R : 2 starting frequencies.
D:2
P:1
I:1

Sample processing

Data compression
• We can verify that we get a prefixed VLC.

• Let's measure its effectiveness.
We started with a message of 36 characters or 36 x 8 = 288 bits

If we had coded each character in 4 bits (that was enough), we used
144 bits.
Using the coding we have just done we will use:
(2 x 7) + (3 x 6) + (3 x 4)+ (3 x 4) + (4 x 3) + (4 x 3) +(4 x 3) +
(4 x 2) + (4 x 2) + (5 x 1) + (5 x 1) = 14 + 18 +12 + 12 + 12 +
12 + 12 + 8 + 8 + 5 + 5 = 118 bits
The compression ratio relative to the 4-bit representation is
 =144/118 = 1,22
Huffman Algorithm
Principle:
An element i is represented by a sequence of bits of length inversely
proportional to the probability of appearance p(i) integers.
The problem is to define a variable length code of which no element
is the beginning of another (prefixed code).
The method is based on the construction of a tree based on the
probabilities of appearance of the elements.
Example :
Let the following elements be coded, with the following probabilities:

Principle of tree construction
1. Leaves are free nodes

2. Choose the two free nodes with the lowest
probabilities
3. Create with a father node of weight sum of the
weights of the child nodes.
4.Associer
Associate 0 to the branch with the smallest
0 à la branche de plus petite probabilité et 1 pour
probability
l’autre and 1 to the other
5. The father is a free node, and the childs are no
longer
6-As long as there is more than one free node, repeat
2-5
Sample processing
1 0
1110 11110 11111 110

Arithmetic coding
■Unlike statistical methods (for a high frequency message

its representation was converted to a shorter notation, and vice
versa for rare information), here we represent a flow of
information using a numerical interval.
■A message is represented in an interval of real
numbers between [o, 1 [
■The more important the message becomes, the
shorter the interval necessary for its representation

Processing an example
■ Message: BRACADABRA
■ Each character will have a representation
interval. The allocation of the interval has no
influence on the compression / decompression
of the message
» But Attention! for decompression use the same table
that was used for compression
Character probability interval

A 4/10 [ 0.0- 0.4[
B 2/10 [ 0.4-0.6 [
C 1/10 [ 0.6-0.7 [
D 1/10 [ 0.7- 0.8[
R 2/10 [ 0.8-1.0 [

Arithmetic coding
Valeur = Old_High_Limit - Old_Low_Limit
New_High_Limit =
Old_Low_Limit +Valeur* High_value(c)
New_Low_Limit =
Old_Low_Limit +Valeur* Low_value(c)

How to assign the code?
New character Low limit High limit

character probability interval
A 4/10 [ 0.0- 0.4[ 0.0 1.0
B 2/10 [ 0.4-0.6 [ B 0.4 0.6
C 1/10 [ 0.6-0.7 [ R 0.56 0.60
D 1/10 [ 0.7- 0.8[ A 0.560 0.576
R 2/10 [ 0.8-1.0 [ C 0.5696 0.5712
A 0.56960 0.57024
Valeur = Old_High_Limit - Old_Low_Limit D
A
B
New_High_Limit =
R
Old_Low_Limit +Valeur* High_value(c)
A 0.5700623360
New_Low_Limit = The last value of the lower gives a unique code

Old_Low_Limit +Valeur* Low_value(c) of the message « BRACADABRA »

How to decode?
value interval character

character probability interval 0.5700623360 [ 0.4-0.6 [ B
A 4/10 [ 0.0- 0.4[ 0.85031168 [ 0.8-1.0 [ R
B 2/10 [ 0.4-0.6 [ 0.2515584 [ 0.0- 0.4[ A
C 1/10 [ 0.6-0.7 [ 0.628896 [ 0.6-0.7 [ C
D 1/10 [ 0.7- 0.8[ 0.28896 [ 0.0- 0.4[ A
R 2/10 [ 0.8-1.0 [ 0.7224 [ 0.7- 0.8[ D
0.224 [ 0.0- 0.4[ A
0.56 [ 0.4-0.6 [ B
Value =
0.8 [ 0.8-1.0 [ R
Valeur – Value_low(c)
0.0 [ 0.0- 0.4[ A
Value =
Value / (Value_high(c)-Value_low (c))

Dictionary method
LZW
« Lempel-Ziv-Welch »

LZW Method, dictionary method
Code read add to dictionary latent code Transmitted code
Message compression :
« DU CODAGE AU DECODAGE »

Thank you

1-Data Compression-2022

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1-Data Compression-2022

Uploaded by

Copyright:

Available Formats

Data compression

MITGL01 / Course 1 By Sid Lamrous/ Data compression lundi 20 juin 2022

Considerable importance in a variety of areas:

The graphic designer has long (sometimes unknowingly)

It is proposed here, limited to the case of texts or images,

MITGL01 / Course 1 By Sid Lamrous/ Data compression

■Reversible methods (without loss).

MITGL01 / Course 1 By Sid Lamrous/ Data compression

We will say that we have compressed a file if we manage

The effectiveness of compression is measured by the

 = Number of binary digits used for compressed document

MITGL01 / Course 1 By Sid Lamrous/ Data compression

A simple example: Suppose we had to try to compress the following

This series has 21 bytes. (You can interpret it as a sequel

MITGL01 / Course 1 By Sid Lamrous/ Data compression

MITGL01 / Course 1 By Sid Lamrous/ Data compression

Decoding is not a problem if the decoder is informed of the

MITGL01 / Course 1 By Sid Lamrous/ Data compression

MITGL01 / Course 1 By Sid Lamrous/ Data compression

Suppose we have to process a message which contains only 4 characters which we

MITGL01 / Course 1 By Sid Lamrous/ Data compression

Definition: A code is prefixed if it is not the beginning of

Let's go back to our example and try to correct it.

We will now take a slightly more complex example

MITGL01 / Course 1 By Sid Lamrous/ Data compression

Suppose we want to send the following message:

LE PRESIDENT EST ENTRE DANS LA SALLE

MITGL01 / Course 1 By Sid Lamrous/ Data compression

MITGL01 / Course 1 By Sid Lamrous/ Data compression

• We can verify that we get a prefixed VLC.

We started with a message of 36 characters or 36 x 8 = 288 bits

MITGL01 / Course 1 By Sid Lamrous/ Data compression

1. Leaves are free nodes

1110 11110 11111 110

MITGL01 / Course 1 By Sid Lamrous/ Data compression

■Unlike statistical methods (for a high frequency message

MITGL01 / Course 1 By Sid Lamrous/ Data compression

Character probability interval

MITGL01 / Course 1 By Sid Lamrous/ Data compression

Valeur = Old_High_Limit - Old_Low_Limit

MITGL01 / Course 1 By Sid Lamrous/ Data compression

New character Low limit High limit

New_Low_Limit = The last value of the lower gives a unique code

MITGL01 / Course 1 By Sid Lamrous/ Data compression

value interval character

MITGL01 / Course 1 By Sid Lamrous/ Data compression

MITGL01 / Course 1 By Sid Lamrous/ Data compression

Code read add to dictionary latent code Transmitted code

MITGL01 / Course 1 By Sid Lamrous/ Data compression

You might also like