Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 69

DATA STRUCTURE & ALGORITHM

TOPIC: HUFFMAN’S CODE

By: Hira Farman


Huffman Coding:
An Application of Binary Trees and
Priority Queues

By: Hira Farman


PURPOSE OF HUFFMAN CODING:
• Proposed by Dr. David A. Huffman in 1952
• “A Method for the Construction of Minimum Redundancy
Codes”.

• Applicable to many forms of data transmission


• Our example: text files

By: Hira Farman


Huffman Coding:
• Huffman codes can be used to compress information
• Like WinZip – although WinZip doesn’t use the Huffman algorithm
• JPEGs do use Huffman as part of their compression process

• The basic idea is that instead of storing each character in a file as an 8-


bit ASCII value, we will instead store the more frequently occurring
characters using fewer bits and less frequently occurring characters
using more bits
• On average this should decrease the filesize (usually ½)
The Basic Algorithm
• Huffman coding is a form of statistical coding
• Not all characters occur with the same frequency!
• Yet all characters are allocated the same amount of space
• 1 char = 1 byte, be it e or x

CS 102
The (Real) Basic Algorithm
1. Scan text to be compressed and tally
occurrence of all characters.
2. Sort or prioritize characters based on
number of occurrences in text.
3. Build Huffman code tree based on
prioritized list.
4. Perform a traversal of tree to determine
all code words.
5. Scan text again and create new file
using the Huffman codes.
CS 102
INTRODUCTION TO HUFFMAN CODE:
• Huffman codes are an effective technique of ‘lossless data
compression’ which means no information is lost.
• The algorithm builds a table of the frequencies of each character in a
file .
• The table is then used to determine an optimal way of representing each
character as a binary string.
HUFFMAN CODE:
• The algorithm as described by David Huffman assigns every symbol to
a leaf node of a binary code tree. These nodes are weighted by the
number of occurrences of the corresponding symbol called frequency
or cost.
• The tree structure results from combining the nodes step-by-step
until all of them are embedded in a root tree.
• The algorithm always combines the two nodes providing the lowest
frequency in a bottom up procedure. The new interior nodes gets the
sum of frequencies of both child nodes.
ALGORITHM
HUFFMAN CODING ALGORITHM:
• 1. Initialization: Put all symbols on a list sorted according to their frequency
counts.
• 2. Repeat until the list has only one symbol left:
1. From the list pick two symbols with the lowest frequency counts
2. Form a Huffman sub-tree that has these two symbols as child
nodes and create a parent node.
3. Assign the sum of the children’s frequency counts to the
parent and insert it into the list such that the order is maintained.
4. Delete the children from the list.
3. Assign a codeword for each leaf based on the path from the root.
PSEUDOCODE
HUFFMAN(C)
C is set of characters
1: n=|c|
2: Q=C
3: For i=1 to n-1
4: allocate a new node z.
5: z. left=x= Extract-min(Q)
6: z. right= y=Extract-min(Q).
7: z. freq= x. freq + y. freq
8: Insert(Q,z)
9: return
CODE:
• struct tree_t
• {
• tree_t *left;
• tree_t *right;
• char character;
• };
• to store the tree elements (note that if left and right are NULL, then we would know that
the node is a leaf and that character stores a valid char of interest) and
• struct list_t
• {
• list_t *next;
• int total_frequency;
• tree_t *tree;
• }
• to store the list of trees that still need to be merged together as a linked list.
EXAMPLE 1:
HUFFMAN TREE:
100

a/45 55

25 30

c/12 b/13 14 d/16

e/9 f/5
CONSIDER THE TABLE :
(DESIGN HUFFMAN TREE)
characters Frequency Fixed length code Variable length
a 45 000 ?

b 13 001 ?

c 12 010 ?

d 16 011 ?

e 9 100 ?

f 5 101 ?
Encoding the File:Traverse Tree for Codes

• Perform a traversal of the tree to obtain new code words


• Going left is a 0 going right is a 1
• code word is only completed when a leaf node is reached
HUFFMAN TREE:
100
0 1

a/45 55
0 1

25 30

0 1 0 1

c/12 b/13 14 d/16

0 1

f/5 e/9
CONSIDER THE TABLE :
(DESIGN HUFFMAN TREE)
characters Frequency Fixed length code Variable length
a 45 000 0

b 13 001 101

c 12 010 100

d 16 011 111

e 9 100 1101

f 5 101 1100
EXAMPLE 2:
Code Tree According to Huffman:
• The branches of the tree represent the binary values 0 and
1 according to the rules for common prefix-free code trees.

• The path from the root tree to the corresponding leaf


node defines the particular code word.
Huffman Code: Example

The following example bases on a data source using a set of five different symbols. The symbol's
frequencies are:

Symbol Frequency
A 24
B 12
C 10
D 8
E 8

The two rarest symbols 'E' and 'D' are connected first, followed by 'C' and 'D'. The new parent nodes
have the frequency 16 and 22 respectively and are brought together in
the next step. The resulting node and the remaining symbol 'A' are subordinated to the root node
that is created in a final step.
Code Tree according to Huffman:

SYMBOL FREQUENCY VAR CODE TOTAL


CODE LENGTH LENGTH
A 24 0 1 24
B 12 100 3 36
C 10 101 3 30
D 8 110 3 24
E 8 111 3 24

http://www.binaryessence.com/dct/en000080.htm
EXAMPLE 3:
Eerie eyes seen near lake.  26 characters
Building a Tree
Scan the original text
Eerie eyes seen near lake.
• What characters are present?

E e r i space
y s n a r l k .

CS 102
Building a Tree
Scan the original text
Eerie eyes seen near lake.
• What is the frequency of each character in the text?

Char Freq. Char Freq. Char Freq.


E 1 y 1 k 1
e 8 s 2 . 1
r 2 n 2
i 1 a 2
space 4 l 1

CS 102
Building a Tree
Prioritize characters
• Create binary tree nodes with character and
frequency of each character.
• Place nodes in a priority queue
• The lower the occurrence, the higher the
priority in the queue

CS 102
Building a Huffman Tree:
• The queue after inserting all nodes

E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8

• Null Pointers are not shown

CS 102
Building a Tree

E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8

CS 102
Building a Huffman Tree

y l k . r s n a sp e
1 1 1 1 2 2 2 2 4 8

E i
1 1
Building a Huffman Tree

y l k . r s n a sp e
2
1 1 1 1 2 2 2 2 4 8
E i
1 1
Building a Huffman Tree

k . r s n a sp e
2
1 1 2 2 2 2 4 8
E i
1 1

y l
1 1
Building a Huffman Tree:

2
k . r s n a 2 sp e
1 1 2 2 2 2 4 8
y l
E i 1 1
1 1
Building a Huffman Tree

r s n a 2 2 sp e
2 2 2 2 4 8
y l
E i 1 1
1 1

k .
1 1
Building a Tree

r s n a 2 2 sp e
2
2 2 2 2 4 8
E i y l k .
1 1 1 1 1 1
Building a Huffman Tree

n a 2 sp e
2 2
2 2 4 8
E i y l k .
1 1 1 1 1 1

r s
2 2
Building a Huffman Tree

n a 2 sp e
2 2 4
2 2 4 8

E i y l k . r s
1 1 1 1 1 1 2 2
Building a Huffman Tree

2 4 e
2 2 sp
8
4
y l k . r s
E i 1 1 1 1 2 2
1 1

n a
2 2
Building a Huffman Tree

2 4 4 e
2 2 sp
8
4
y l k . r s n a
E i 1 1 1 1 2 2 2 2
1 1

CS 102
Building a Huffman Tree

4 4 e
2 sp
8
4
k . r s n a
1 1 2 2 2 2

2 2

E i y l
1 1 1 1
Building a Huffman Tree

4 4 4
2 sp e
4 2 2 8
k . r s n a
1 1 2 2 2 2
E i y l
1 1 1 1
Building a Huffman Tree

4 4 4
e
2 2 8
r s n a
2 2 2 2
E i y l
1 1 1 1

2 sp
4
k .
1 1
Building a Huffman Tree

4 4 4 6 e
2 sp 8
r s n a 2 2
4
2 2 2 2 k .
E i y l 1 1
1 1 1 1

What is happening to the characters


with a low number of occurrences?
Building a Huffman Tree

4 6 e
2 2 2 8
sp
4
E i y l k .
1 1 1 1 1 1
8

4 4

r s n a
2 2 2 2
Building a Huffman Tree

4 6 e 8
2 2 2 8
sp
4 4 4
E i y l k .
1 1 1 1 1 1
r s n a
2 2 2 2
Building a Huffman Tree

8
e
8
4 4
10
r s n a
2 2 2 2 4
6
2 2 2 sp
4
E i y l k .
1 1 1 1 1 1
Building a Huffman Tree

8 10
e
8 4
4 4
6
2 2
r s n a 2 sp
2 2 2 2 4
E i y l k .
1 1 1 1 1 1
Building a Tree

10
16
4
6
2 2 e 8
2 sp 8
4
E i y l k . 4 4
1 1 1 1 1 1

r s n a
2 2 2 2
Building a Huffman Tree

10 16

4
6
e 8
2 2 8
2 sp
4 4 4
E i y l k .
1 1 1 1 1 1
r s n a
2 2 2 2
Building a Huffman Tree
26

16
10

4 e 8
6 8
2 2 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2
Why is Huffman Coding Greedy?
Building a Huffman Tree
•After
enqueueing
26 this node
there is only
16
10 one node left
4 e 8
in priority
6 8 queue.
2 2 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2

CS 102
Building a Huffman Tree
Dequeue the single node
left in the queue.
26

16
This tree contains the 10
new code words for each
4 e 8
character. 6 8
2 2 2 sp 4 4
4
Frequency of root node E i y l k .
1 1 1 1 1 1 r s n a
should equal number of 2 2 2 2
characters in text.
Eerie eyes seen near lake.  26 characters
Encoding the File
Traverse Tree for Codes

• Perform a traversal of the


tree to obtain new code
words 26
• Going left is a 0 going right is 16
a1 10

• code word is only completed 4 e 8


6 8
when a leaf node is reached
2 2 2 sp 4 4
4
E i y l k .
1 1 1 1 1 1 r s n a
2 2 2 2
Encoding the File
Traverse Tree for Codes
26
0
Char Code
E 0000 16
i 0001 10
y 0010 0 1
l 0011
4
k 0100 6
e
8
8
. 0101 0 1
0 1
space 011 2 2
e 10 2 4 4
sp
r 1100 0 1 0 1 4
s 1101 E i y l k . 0
n 1110 1 1 1 1 1 1 r s n a
2 2 2 2
a 1111
CS 102
Encoding the File
• Rescan text and encode file using new
code words Char Code
E 0000
Eerie eyes seen near lake. i 0001
y 0010
000010110000011001110001 l 0011
k 0100
010110110100111110101111 . 0101
110001100111111010010010 space 011
1 e 10
r 1100
s 1101
 Why is there no need n 1110
for a separator a 1111
character?
.
Encoding the File:
Results:
• Have we made things any
better? 0000101100000110011
• 73 bits to encode the text 1000101011011010011
1110101111110001100
• ASCII would take 8 * 26 =
208 bits 1111110100100101

If modified code used 4 bits per


character are needed. Total bits
4 * 26 = 104. Savings not as great.
CS 102
Decoding the File
• How does receiver know what the codes are?
• Tree constructed for each text file.
• Considers frequency for each file
• Big hit on compression, especially for smaller files
• Tree predetermined
• based on statistical analysis of text files or file types
• Data transmission is bit based versus byte based

CS 102
Decoding the File
• Once receiver has tree it
scans incoming bit stream 26

• 0  go left 10
16

• 1  go right 4 e 8
6 8
2 2 2 sp 4 4
4
101000110111101111 E i y l k .
1 1 1 1 1 1 r s n a
01111110000110101 2 2 2 2
NEW CASE
(WORD BAGGAGE)
HUFFMAN CODING TYPES:
• The construction of a code tree for the Huffman coding is based on a
certain probability distribution.

• Varies In Three Types:


– static probability distribution
– dynamic probability distribution
– adaptive probability distribution
SUMMARY:
• Huffman coding is a technique used to compress files for transmission
• Uses statistical coding
• more frequently used symbols have shorter code words
• Works well for text and fax transmissions.
• An application that uses several data structures

CS 102
REFERRENCE:
• https://www.slideshare.net/alitech123/huffman-coding-
10173665?next_slideshow=1
• http://www2.hawaii.edu/~janst/211/lecture/Huffman%20Trees.htm
• https://www.cs.duke.edu/csed/poop/huff/info/
• http://homes.soic.indiana.edu/yye/lab/teaching/spring2014-
C343/huffman.php
• https://www.youtube.com/watch?v=ZZX_KtWWPi8

• http://www.binaryessence.com/dct/en000093.htm

You might also like