Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Common Compression Methods

• Statistical methods:
• It requires prior information about the occurrence of symbols.
E.g. Huffman coding and Entropy coding
• Estimate probabilities of symbols, code one symbol at a time, shorter
codes for symbols with high probabilities.

• Dictionary-based coding
• The previous algorithms (both entropy and Huffman) require the
statistical knowledge which is often not available (e.g., live audio,
video).
• Dictionary based coding, such as Lempel-Ziv (LZ) compression
techniques do not require prior information to compress strings.
•Rather, replace symbols with a pointer to dictionary entries.
Common Compression Techniques
• Compression techniques are classified into static, adaptive
(or dynamic), and hybrid.
• Static coding requires two passes: one pass to compute
probabilities (or frequencies) and determine the mapping,
& a second pass to encode.
•Examples: Huffman Coding, entropy encoding

• Adaptive coding:
• It adapts to localized changes in the characteristics of the data,
and don't require a first pass over the data to calculate a
probability model. All of the adaptive methods are one-pass
methods; only one scan of the message is required.
• The cost paid for these advantages is that the encoder & decoder
must be complex to keep their states synchronized, & more
computational power is needed to keep adapting the
encoder/decoder state.
• Examples: Lempel-Ziv and Adaptive Huffman Coding
Data Compression = Modeling + Coding
• Data compression consists of taking a stream of symbols
and transforming them into codes.
• The model is a collection of data and rules used to process input
symbols and determine their probabilities.
• A coder uses a model (probabilities) to assign codes for the given
input symbols.

Input Symbols Output


Stream Model Probabilities Encoder Codes
Stream
• We will take Huffman coding to demonstrate the
distinction:
Huffman Coding
• Developed in 1950s by David Huffman, 0 1
widely used for text compression,
multimedia codec and message 0 1 D4
transmission. 1 D3
0
• The problem: Given a set of n symbols D1 D2
and their weights (or frequencies),
construct a tree structure (a binary tree for Code of:
binary code) with the objective of
D1 = 000
reducing memory space and decoding
D2 = 001
time per symbol. D3 = 01
• For instance, Huffman coding is D4 = 1
constructed based on frequency of
occurrence of letters in text documents.
Huffman Coding
• The Model could determine raw probabilities of each
symbol occurring anywhere in the input stream.
pi = # of occurrences of Si
Total # of Symbols

• The output of the Huffman encoder is determined by the


Model (probabilities).
• The higher the probability of occurrence of the symbol, the
shorter the code assigned to that symbol and vice versa.
• This will enable to easily control the most frequently occurring
symbols in a data and also reduce the time taken during decoding
each symbols.
How to construct Huffman coding
Step-1:
• Create a leaf node for each character of the text.
• Leaf node of a character contains the occurring frequency of that character.
Step-02:
• Arrange all the nodes in increasing order of their frequency value.
Step-03:
• Considering the first two nodes having minimum frequency,
• Create a new internal node.
• The frequency of this new node is the sum of frequency of those two nodes.
• Make the first node as a left child and the other node as a right child of the
newly created node.
Step-04:
• Keep repeating Step-2 and Step-3 until all the nodes form a single tree.
• The tree finally obtained is the desired Huffman Tree.
Example
A file contains the following characters with the frequencies as
shown. If Huffman Coding is used for data compression,
determine-

1. Huffman Code for each


character
2. Average code length
3. Length of Huffman
encoded message (in bits)
Solution
Huffman Tree is constructed in the following steps-
Step 1:

Step 2:

Step 3:
Solution (Cont.)
Step 5:
Step 4:
Solution (Cont.)
Step 6:
Solution (Cont.)
Step 7:
Huffman code tree

• We assign weight to all the edges of the constructed Huffman Tree.


• Let us assign weight ‘0’ to the left edges and weight ‘1’ to the right
edges
Solution
1. Huffman Code For Characters-
a = 111
• To write Huffman Code for any character, traverse the e = 10
Huffman Tree from root node to the leaf node of that i = 00
character. o = 11001
• Following this rule, the Huffman Code for each character u = 1101
is- s = 01
t = 11000

From here, we can observe-


• Characters occurring less frequently in the text are assigned the larger
code.
• Characters occurring more frequently in the text are assigned the smaller
code.
2. Average Code Length

3. Length of Huffman Encoded Message-

Q1: What is the Huffman binary representation for ‘café’?


Q2: Given text: “for each rose, a rose is a rose” Construct the Huffman coding
Class Exercise
Find
1. Huffman Code for each character
2. Average code length
3. Length of Huffman encoded
message (in bits)
Lempel-Ziv Encoding
• Data compression up until the late 1970's mainly directed towards
creating better methodologies for Huffman coding.
• An innovative, radically different method was introduced in 1977 by Abraham
Lempel and Jacob Ziv.

• This technique (called Lempel-Ziv) actually consists of two


considerably different algorithms, LZ77 and LZ78.
• Due to patents, LZ77 and LZ78 led to many variants:

LZ77 Variants LZR LZSS LZB LZH


LZ78 Variants LZW LZC LZT LZMW LZJ LZFG

• The zip and unzip use the LZH technique while UNIX's compress
methods belong to the LZW and LZC classes.
Lempel-Ziv compression
•The problem with Huffman coding is that it requires
knowledge about the data before encoding takes place.
• Huffman coding requires frequencies of symbol occurrence
before codeword is assigned to symbols

•Lempel-Ziv compression:
• Not rely on previous knowledge about the data
• Rather builds this knowledge in the course of data
transmission/data storage
• Lempel-Ziv algorithm (called LZ) uses a table of code-words
created during data transmission;
•each time it replaces strings of characters with a reference to a previous
occurrence of the string.
Lempel-Ziv Compression Algorithm
• The multi-symbol patterns are of the form: C0C1 . . . Cn-
1Cn. The prefix of a pattern consists of all the pattern
symbols except the last: C0C1 . . . Cn-1

Lempel-Ziv Output: there are three options in assigning a code


to each symbol in the list
• If one-symbol pattern is not in dictionary, assign (0, symbol)
• If multi-symbol pattern is not in dictionary, assign
(dictionaryPrefixIndex, lastPatternSymbol)
• If the last input symbol or the last pattern is in the dictionary,
asign (dictionaryPrefixIndex, )
Example: LZ compression
• Example: Given a word, aaababbbaaabaaaaaaabaabb
containing only two letters, a and b, compress it using LZ
technique.
Steps in Compression
• First, split the given word into pieces of symbols
• In the example, the first piece of our sample text is a. The second
piece must then be aa. If we go on like this, we obtain the
breakdown of data as illustrated below:
• Note that, the shortest piece of data is the string of characters
that we have not seen so far.

seen unseen
LZ Compression
•Second, index the pieces of text obtained in the breaking-
down process from 1 to n.
• The empty string (start of text) has index 0, a has index 1, ...

•Third, number the pieces of data using the above indices.


•Thus a, with the initial string, is numbered Oa. String 2, aa, is
numbered 1a, because it contains a, whose index is 1, and the new
character a. Proceed numbering all the pieces in terms of those
preceding them.

•Is replacing characters by integers compress the given text


?
LZ Compression
•Now, compute how many bits needed to represent this coded
information.
• each piece of text is composed of an integer and an alphabet.

• One of the advantages of Lempel-Ziv compression is that in a long


string of text, the number of bits needed to transmit the coded
information is peanuts compared to the actual length of the text.
• E.g. To transmit the actual text aab, 24 bits (8 + 8 + 8) needed, where as for the
code 2b, 12 bits needed.

• How do we calculate the number of bits required for coding using LZ


compression?
Example 2: LZ Compression
Encode (i.e., compress) the string ABBCBCABABCAABCAAB using the
LZ algorithm.

The compressed message is: (0,A)(0,B)(2,C)(3,A)(2,A)(4,A)(6,B)


Note: The above is just a representation, the commas and parentheses are
not transmitted;
Example 2: Compute Number of bits transmitted
• Consider the string ABBCBCABABCAABCAAB given in example 2
(previous slide) and compute the number of bits transmitted:
Number of bits = Total No. of characters * 8 = 18 * 8 = 144 bits
• The compressed string consists of codewords and the corresponding
codeword index as shown below:
Codeword: (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B)
Codeword index: 1 2 3 4 5 6 7
• Each code word consists of a character and an integer:
• The character is represented by 8 bits
• The number of bits n required to represent the integer part of the codeword
with index i is given by:

Codeword (0, A) (0, B) (2, C) (3, A) (2, A) (4, A) (6, B)


Index 1 2 3 4 5 6 7
Bits: 1 + 8) + (1 + 8) + (2 + 8) + (2 + 8) + (3 + 8) + (3 + 8) + (3 + 8) = 71 bits

• The actual compressed message is: 0A0B10C11A010A100A110B


• where each character is replaced by its binary 8-bit ASCII code.
Example 3: Decompression
Decode (i.e., decompress) the sequence (0, A) (0, B) (2, C) (3, A)
(2, A) (4, A) (6, B)

The decompressed message is:


ABBCBCABABCAABCAAB
Exercise
Encode (i.e., compress) the following strings using the
Lempel-Ziv algorithm.

1. ABBCBCABABCAABCAAB
2. SATATASACITASA.
Group Assignment
Compare the Huffman Coding and Lempel-Ziv
algorithm.
• The algorithm
• Draw a flowchart
• Time and space complexity
• Provide example to show the algorithms work in
image compression

Due August 23, 2021

You might also like