Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Data compression

Sid Lamrous

MITGL01 / Course 1 By Sid Lamrous/ Data compression lundi 20 juin 2022


Data compression

Considerable importance in a variety of areas:


television, music, remote sensing, medical imaging,...

The graphic designer has long (sometimes unknowingly)


used various compression methods implemented in
commercial software.

It is proposed here, limited to the case of texts or images,


to make an inventory of the available methods, and to
understand the principle.

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Compression methods

■Reversible methods (without loss).


»Statistical coding (Huffman and Shanon-Fano
algorithms)
»Arithmetic methods
»Dictionary methods (LZW algorithm)
■Irreversible methods (with loss).
»Compression JPEG
»Compression JPEG 2000

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Definition

We will say that we have compressed a file if we manage


to reduce the number of binary digits needed to save it.

The effectiveness of compression is measured by the


compression ratio:

 = Number of binary digits used for compressed document


Number of binary digits used by the original document

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Lossless compression methods

A simple example: Suppose we had to try to compress the following


series of bytes:

0001 0100 0011 1111 0101 0101 0101 0101 0101 0101
0101 0101 1110 1111 0000 1111 1111 1111 1111 1111
1111 1111 1111 1111 1111 1111 0110 0111 0111 0000
0111 0000 1010 1111 0000 0000 0001 1111 0001 1111
0001 1111

This series has 21 bytes. (You can interpret it as a sequel


characters or as the grey levels of a series of pixels, it does not matter what
we are concerned about here).

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Lossless compression methods

Let's start by writing a signaling byte, to locate it, we will write it in bold.
0000 0010
The first bit indicates if the following data byte is repeating, if so the bit will
be set to 1, if not it will be set to 0.

If first bit = 0
The following 7 bits will indicate the number of bytes without
repetition

If first bit = 1
The following 7 bits will indicate the number of repetitions

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Data compression

Result:
0000 0010 0001 0100 0011 1111 1000 0100 0101 0101 0000 0010
1110 1111 0000 1111 1000 0101 1111 1111 0000 0001 0110 0111
1000 0010 0111 0000 0000 0010 1010 1111 0000 0000 1000 0011
0001 1111
This chain is 19 bytes, 2 less than the original chain,
we have a compression rate of 21/19 = 1.1

Decoding is not a problem if the decoder is informed of the


method used, in this case it will interpret the
first byte as a signaling byte and everything will be fine ...

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Statistical method

Encode all the values we have to store with the same number of bits, is
this the best solution?
Theory tells us not!

Idea: Use shorter codes for frequent values and reserve longer codes for less
frequent values. Example of the histogram for the image

=> We will therefore focus on VLC (Variable Length Code), we also say
entropy coding.

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Example

Suppose we have to process a message which contains only 4 characters which we


will name A, B, C and D and that the frequencies of these characters in our message
are as follows:
A : 60 % One can imagine the following code:
B : 30 % A:0
C:5% B : 11
D:5% C : 01
D : 10
It will certainly allow us to save space, but if we receive the following message:
000110, how to interpret it?
3A, B, A? 2A, C, D? It does not work ! We have just discovered the problem of
synchronization.
How to solve it?
Use separator characters? it is contradictory with the objective
compression.

MITGL01 / Course 1 By Sid Lamrous/ Data compression


The solution is: Prefixed VLC

Definition: A code is prefixed if it is not the beginning of


any other.

Let's go back to our example and try to correct it.


A : 0 B : 10 C : 110 D : 111 It's working!
Is this the optimal solution?
As we got it by trial and error, we don't
know nothing but rest assured the theory asserts that it does.

We will now take a slightly more complex example


and present a method for determining an efficient VLC
(although it is not always optimal).

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Shanon-Fano algorithm

Suppose we want to send the following message:

LE PRESIDENT EST ENTRE DANS LA SALLE


It has 36 characters and occupies 36 bytes in software like WordPerfect. We will
count the occurrences of the different characters of the alphabet used and classify
them by decreasing frequencies, we obtain the following list:
E:7
Espace : 6
L:4
S:4
N : 3 We will now group the characters into two groups whose
T : 3 frequencies of occurrence are as close as possible, then divide
A : 3 each of these groups in the same way until we reach each of the
R : 2 starting frequencies.
D:2
P:1
I:1

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Sample processing

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Data compression

• We can verify that we get a prefixed VLC.


• Let's measure its effectiveness.

We started with a message of 36 characters or 36 x 8 = 288 bits


If we had coded each character in 4 bits (that was enough), we used
144 bits.
Using the coding we have just done we will use:
(2 x 7) + (3 x 6) + (3 x 4)+ (3 x 4) + (4 x 3) + (4 x 3) +(4 x 3) +
(4 x 2) + (4 x 2) + (5 x 1) + (5 x 1) = 14 + 18 +12 + 12 + 12 +
12 + 12 + 8 + 8 + 5 + 5 = 118 bits
The compression ratio relative to the 4-bit representation is

 =144/118 = 1,22
MITGL01 / Course 1 By Sid Lamrous/ Data compression
Huffman Algorithm

Principle:
An element i is represented by a sequence of bits of length inversely
proportional to the probability of appearance p(i) integers.
The problem is to define a variable length code of which no element
is the beginning of another (prefixed code).
The method is based on the construction of a tree based on the
probabilities of appearance of the elements.

Example :
Let the following elements be coded, with the following probabilities:

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Principle of tree construction

1. Leaves are free nodes


2. Choose the two free nodes with the lowest
probabilities
3. Create with a father node of weight sum of the
weights of the child nodes.
4.Associer
Associate 0 to the branch with the smallest
0 à la branche de plus petite probabilité et 1 pour
probability
l’autre and 1 to the other
5. The father is a free node, and the childs are no
longer
6-As long as there is more than one free node, repeat
2-5
MITGL01 / Course 1 By Sid Lamrous/ Data compression
Sample processing

1 0

1110 11110 11111 110

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Arithmetic coding

■Unlike statistical methods (for a high frequency message


its representation was converted to a shorter notation, and vice
versa for rare information), here we represent a flow of
information using a numerical interval.
■A message is represented in an interval of real
numbers between [o, 1 [
■The more important the message becomes, the
shorter the interval necessary for its representation

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Processing an example

■ Message: BRACADABRA
■ Each character will have a representation
interval. The allocation of the interval has no
influence on the compression / decompression
of the message
» But Attention! for decompression use the same table
that was used for compression

Character probability interval


A 4/10 [ 0.0- 0.4[
B 2/10 [ 0.4-0.6 [
C 1/10 [ 0.6-0.7 [
D 1/10 [ 0.7- 0.8[
R 2/10 [ 0.8-1.0 [

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Arithmetic coding

Valeur = Old_High_Limit - Old_Low_Limit

New_High_Limit =
Old_Low_Limit +Valeur* High_value(c)

New_Low_Limit =
Old_Low_Limit +Valeur* Low_value(c)

MITGL01 / Course 1 By Sid Lamrous/ Data compression


How to assign the code?

New character Low limit High limit


character probability interval
A 4/10 [ 0.0- 0.4[ 0.0 1.0
B 2/10 [ 0.4-0.6 [ B 0.4 0.6
C 1/10 [ 0.6-0.7 [ R 0.56 0.60
D 1/10 [ 0.7- 0.8[ A 0.560 0.576
R 2/10 [ 0.8-1.0 [ C 0.5696 0.5712
A 0.56960 0.57024
Valeur = Old_High_Limit - Old_Low_Limit D
A
B
New_High_Limit =
R
Old_Low_Limit +Valeur* High_value(c)
A 0.5700623360

New_Low_Limit = The last value of the lower gives a unique code


Old_Low_Limit +Valeur* Low_value(c) of the message « BRACADABRA »

MITGL01 / Course 1 By Sid Lamrous/ Data compression


How to decode?

value interval character


character probability interval 0.5700623360 [ 0.4-0.6 [ B
A 4/10 [ 0.0- 0.4[ 0.85031168 [ 0.8-1.0 [ R
B 2/10 [ 0.4-0.6 [ 0.2515584 [ 0.0- 0.4[ A
C 1/10 [ 0.6-0.7 [ 0.628896 [ 0.6-0.7 [ C
D 1/10 [ 0.7- 0.8[ 0.28896 [ 0.0- 0.4[ A
R 2/10 [ 0.8-1.0 [ 0.7224 [ 0.7- 0.8[ D
0.224 [ 0.0- 0.4[ A
0.56 [ 0.4-0.6 [ B
Value =
0.8 [ 0.8-1.0 [ R
Valeur – Value_low(c)
0.0 [ 0.0- 0.4[ A

Value =
Value / (Value_high(c)-Value_low (c))

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Dictionary method

LZW
« Lempel-Ziv-Welch »

MITGL01 / Course 1 By Sid Lamrous/ Data compression


LZW Method, dictionary method

Code read add to dictionary latent code Transmitted code

Message compression :
« DU CODAGE AU DECODAGE »

MITGL01 / Course 1 By Sid Lamrous/ Data compression


Thank you

You might also like