Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

HUFFMAN CODING

Project Report
Submitted By
URVASHI GUPTA, 04220902717
In partial fulfillment of the requirements for
the award of the degree
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING


GOVIND BALLABH PANT GOVT. ENGINEERING COLLEGE,
NEW DELHI-110020
TABLE OF CONTENT

Certificate
Acknowledgement
About the Organization
Abstract
1. Introduction
2. Compression
2.1 The Basic Idea
2.2 Building the Huffman Tree
2.3 An Example
3. Decompression
3.1 Storing the Huffman Tree
3.2 Creating the Huffman Table
3.3 Storing Sizes
4. Program Code and Output
5. Conclusion and Future Works
ACKNOWLEDGEMENT
ABOUT THE ORGANIZATION
ABSTRACT

Data reduction is one of the data pre-processing techniques which


can be applied to obtain a reduced representation of the data set that
is much smaller in volume, yet closely maintains the integrity of the
original data. That is, mining on the reduced data set should be more
efficient yet produce the same analytical results. Data compression is
useful, where encoding mechanisms are used to reduce the data set
size. In data compression, data encoding or transformations are
applied so as to obtain a reduced or compressed representation of
the original data. Huffman coding is a successful compression method
used originally for text compression. Huffman's idea is, instead of
using a fixed-length code such as 8-bit extended ASCII or DBCDIC for
each symbol, to represent a frequently occurring character in a source
with a shorter codeword and to represent a less frequently occurring
one with a longer codeword.
Chapter 1
INTRODUCTION

The project aims at developing programs that transform a string of


characters in some representation into a new string (of bits) which
contains the same information.
Compression refers to reducing the quantity of data used to
represent a file, image or video content without excessively reducing
the quality of the original data. It also reduces the number of bits
required to store and/or transmit digital media. To compress
something means that you have a piece of data and you decrease its
size. There are different techniques who to do that and they all have
their own advantages and disadvantages. One trick is to reduce
redundant information, meaning saving sometimes once instead of 6
times. Another one is to find out which parts of the data are not
really important and just leave those away.
Huffman Coding also called as Huffman Encoding is a famous greedy
algorithm that is used for the lossless compression of data. It uses
variable length encoding where variable length codes are assigned to
all the characters depending on how frequently they occur in the
given text. The character which occurs most frequently gets the
smallest code and the character which occurs least frequently gets
the largest code. To prevent ambiguities while decoding, Huffman
coding implements a rule known as a prefix rule which ensures that
the code assigned to any character is not a prefix of the code
assigned to any other character.
Chapter 3
COMPRESSION

2.1 The Basic Idea


The basic idea behind Huffman Coding is to use fewer bits for more
frequently occurring characters. We will see how this is done using a
tree that stores characters at the leaves, and whose root-to-leaf
paths provide the bit sequence used to encode the characters.
We'll assume that each character has an associated weight equal to
the number of times the character occurs in a file, for example. In
the "go go gophers" example, the characters 'g' and 'o' have weight
3, the space has weight 2, and the other characters have weight 1.
When compressing a file we'll need to calculate these weights, we'll
ignore this step for now and assume that all character weights have
been calculated.

2.2 Building the Huffman Tree


Huffman's algorithm assumes that we're building a single tree from a
group (or forest) of trees. Initially, all the trees have a single node
with a character and the character's weight. Trees are combined by
picking two trees, and making a new tree from the two trees. This
decreases the number of trees by one at each step since two trees
are combined into one tree. The algorithm is as follows:

1. Begin with a forest of trees. All trees are one node, with the
weight of the tree equal to the weight of the character in the
node. Characters that occur most frequently have the highest
weights. Characters that occur least frequently have the
smallest weights.
2. Repeat this step until there is only one tree:

Choose two trees with the smallest weights, call these trees
T1 and T2. Create a new tree whose root has a weight equal to
the sum of the weights T1 + T2 and whose left subtree is T1 and
whose right subtree is T2.

3. The single tree left after the previous step is an optimal


encoding tree.

2.2 An Example
We'll use the string "go go gophers" as an example. Initially we have
the forest shown below. The nodes are shown with a weight/count
that represents the number of times the node's character occurs.

We pick two minimal nodes. There are five nodes with the minimal
weight of one, it doesn't matter which two we pick. In a program, the
deterministic aspects of the program will dictate which two are
chosen, e.g., the first two in an array, or the elements returned by a
priority queue implementation. We create a new tree whose root is
weighted by the sum of the weights chosen. We now have a forest of
seven trees as shown here:

Choosing two minimal trees yields another tree with weight two as
shown below. There are now six trees in the forest of trees that will
eventually build an encoding tree.
Again we must choose the two trees of minimal weight. The lowest
weight is the 'e'-node/tree with weight equal to one. There are three
trees with weight two, we can choose any of these to create a new
tree whose weight will be three.

Now there are two trees with weight equal to two. These are joined
into a new tree whose weight is four. There are four trees left, one
whose weight is four and three with a weight of three.

Two minimal (three weight) trees are joined into a tree whose
weight is six. In the diagram below we choose the 'g' and 'o' trees
(we could have chosen the 'g' tree and the space-'e' tree or the 'o'
tree and the space-'e' tree.) There are three trees left.
The minimal trees have weights of three and four, these are joined
into a tree whose weight is seven leaving two trees.

Finally, the last two trees are joined into a final tree whose weight is
thirteen, the sum of the two weights six and seven. Note that this
tree is different from the tree we used to illustrate the Huffman
coding above, and the bit patterns for each character are different,
but the total number of bits used to encode "go go gophers" is the
same.
The character encoding induced by the last tree is shown below
where again, 0 is used for left edges and 1 for right edges.

char binary

'g' 00
'o' 01
'p' 1110
'h' 1101
'e' 101
'r' 1111
's' 1100
' ' 100

The string "go go gophers" would be encoded as shown (with


spaces used for easier reading, the spaces wouldn't appear in
the real encoding).
00 01 100 00 01 100 00 01 1110 1101 101 1111 1100
Once again, 37 bits are used to encode "go go gophers". There
are several trees that yield an optimal 37-bit encoding of "go
go gophers". The tree that actually results from a programmed
implementation of Huffman's algorithm will be the same each
time the program is run for the same weights (assuming no
randomness is used in creating the tree).
CHAPTER 3
DECOMPRESSION

The process of decompression is simply a matter of translating the


stream of prefix codes to individual byte values, usually by traversing
the Huffman tree node by node as each bit is read from the input
stream reaching a leaf node necessarily terminates the search for
that particular byte value). Before this can take place, however,
the Huffman tree must be somehow reconstructed.

3.1 Storing the Huffman Tree


 In the simplest case, where character frequencies are fairly
predictable, the tree can be preconstructed (and even
statistically adjusted on each compression cycle) and thus
reused every time, at the expense of at least some measure of
compression efficiency
 Otherwise, the information to reconstruct the tree must be
sent.
 A naive approach might be to prepend the frequency count of
each character to the compression stream. Unfortunately, the
overhead in such a case could amount to several kilobytes, so
this method has little practical use.
 Another method is to simply prepend the Huffman tree, bit by
bit, to the output stream. For example, assuming that the value
of 0 represents a parent node and 1 a leaf node, whenever the
latter is encountered the tree building routine simply reads the
next 8 bits to determine the character value of that particular
leaf. The process continues recursively until the last leaf node is
reached at that point, the Huffman tree will thus be faithfully
reconstructed. The overhead using such a method ranges
from roughly 2 to 320 bytes (summing an 8-bit alphabet).
Many other techniques are possible is well. In any case, since the
compressed data can include its "trailing bits the decompressor must
be able to determine when to stop producing output. This can be
accomplished by either transmitting the length of
the decompressed data along with the compression model or by
defining a special code symbol to signify the end of input (the latter
method can adversely affect code length optimality, however).

3.2 Creating the Huffman Table


To create a table or map of coded bit values for each character you'll
need to traverse the Huffman tree (e.g., inorder, preorder, etc.)
making an entry in the table each time you reach a leaf. For example,
if you reach a deal that stores the character ‘C’, following a path left-
left-right-right-left, then an entry in the ‘C’-th location of the map
should be set to 00110. You'll need to make a decision about how to
store the bit patterns in the map. At least two methods are possible
for implementing what could be a class/struct Bit Pattern:

 Use a string. This makes it easy to add a character (using +) to a


string during tree traversal and makes it possible to us string as
Bit Pattern. Your program may be slow because appending
characters to a string (in creating the bit pattern) and accessing
characters in a string (in writing 0's or 1's when compressing) is
slower than the next approach.
 Alternatively, you can store an integer for the bitwise coding of
a character. You need to store the length of the code too to
differentiate between 01001 and 00101. However, using an int
restricts root-to-leaf paths to be at most 32 edges long since an
int holds 32 bits. In a pathological file, a Huffman tree could
have a root-to-leaf path of over 100. Because of this problem,
you should use strings to store paths rather than ints. A slow
correct program is better than a fast-incorrect program.
3.2 Storing Sizes
The operating system will buffer output, i.e., output to disk actually
occurs when some internal buffer is full. In particular, it is not
possible to write just one single bit to a file, all output is actually
done in "chunks", e.g., it might be done in eight-bit chunks. In any
case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are
eventually written, but you cannot be sure precisely when they're
written during the execution of your program. Also, because of
buffering, if all output is done in eight-bit chunks and your
program writes exactly 61 bits explicitly, then 3 extra bits will be
written so that the number of bits written is a multiple of eight.
Because of the potential for the existence of these "extra bits when
reading one bit at a time, you cannot simply read bits until
there are no more left since your program might then read the extra
bits written due to buffering.
CHAPTER 4
PROGRAM CODE AND OUTPUT

You might also like