Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

SC205 DISCRETE MATHEMATICS

PROJECT

May 2023

Professors:
Prof. Manish Gupta
Prof. Prosenjit Kundu
Prof. Manoj Raut
Team members:
Harsh Popatiya :- 202201463
Dhrudeep Sharma :- 202201150
Jaimin Prajapati :- 202201228
Parshwa Modi :- 202201165
Anuj Valambhiya :- 202201481
Sahil Pandavadara :- 202201151

Page 1 of 31
Contents

1 INTRODUCTION 3

2 PROBLEM STATEMENT 3

3 SOLUTION 4
3.1 BURROWS WHEELER TRANSFORM
(BWT) . . . . . . . . . . . . . . . . . . . 6
3.2 RUN LENGTH ENCODING (RLE) . . 15
3.3 MOVE TO FRONT (MTF) . . . . . . . 20
3.4 HUFFMAN CODING . . . . . . . . . . . 22

4 PROJECT EXECUTION: Achievements and


Implementation 25
4.1 COMMERCIALIZATION ASPECT: . . . 29
4.2 CONTRIBUTION: . . . . . . . . . . . . . 30

5 REFERENCES 31

Page 2 of 31
1 INTRODUCTION

• In this rapidly evolving world of today, we desire every-


thing to be as efficient as possible. In this technological
world, we innovate more and more to comfort ourselves
with efficient solutions to everything around us. As the
generation and usage of data is increasing exponentially
nowadays, we do need efficient techniques to store and
transfer data.
• Data compression reduces the space required to trans-
mit or store data. It efficiently reduces the space while
preserving its essential information. The objective of
this project is to explore various data compression al-
gorithms, methodologies, and technologies. We aim
to analyze their effectiveness, advantages, and limita-
tions in different contexts. By removing repetitive pat-
terns, encoding information more efficiently, or employ-
ing mathematical algorithms, compression techniques
can significantly reduce the storage space needed for
data.

2 PROBLEM STATEMENT

• If you wish to share a very large text file over the inter-
net, how can you efficiently transfer the file with your
friend ?

Page 3 of 31
3 SOLUTION

• Data compression is a necessary tool to reduce the num-


ber of bits needed to represent data. Compressing data
can save storage capacity, speed up file transfer and
decrease costs for storage hardware and network band-
width.
• As large text files can be problematic to transfer due
to size restrictions imposed by various platforms be-
cause of which it becomes time consuming to upload
and download the data.
Basically there are two common types of compression techniques:-

∗ Lossy compression
∗ Lossless compression

1. Lossy compression
• As the name suggests, after compression we can’t
retrieve the information which is lost in compression.
• It is used to compress when file compression loss is
acceptable.
• It is used to compress images, audio and videos. In a
video, it removes the 3r d frame so it makes the file
1/3rd of the original file.

Page 4 of 31
2. Lossless compression
• It restores and rebuilds the compressed file in original
form.
• It is used when data loss is unacceptable such as in
some high quality images, audio or in text files.
So from the two techniques, it is evident that for text files
lossless compression is a better tool. So let us have a look
at the data compression algorithms in lossless compression.
And the algorithms are:-
1. BURROWS WHEELER TRANSFORM (BWT)
2. RUN LENGTH ENCODING (RLE)
3. MOVE TO FRONT (MTF)
4. HUFFMAN CODING

Page 5 of 31
3.1 BURROWS WHEELER TRANSFORM (BWT)

• The Burrows-Wheeler Transform (BWT), also known


as block sorting compression, is a data transformation
technique used in data compression algorithms. It is a
pre-processing step in lossless compressions. It
restructures the data in such a way that it improves
the combressibility. It is thus a useful way to increase
the efficiency of text compression algorithms, costing
only some extra computation.
• There are basically 3 steps followed in BWT:-

1. Rotating the initial string to get all the


possible outcomes
2. Arranging the strings lexicographically
3. The last character of the lexicographical
strings is our output

• Let us understand in detail all the 3 steps:


1. Rotating the initial string to get all the possible
outcomes

∗ Let us consider our input string to be

"#banana#"

Now rotate it ’n’ number of times, where n is


the number of characters in the string

Page 6 of 31
#banana#
b a banana##
# n anana##b
cyclic rotations ------> nana##ba
# a ana##ban
a n na##bana
a##banan
##banana
2. Arranging the strings lexicographically
∗ Sorting lexicographically means we sort the
strings according to ASCII values from the first
column till the last column as:-
#banana# ##banana
banana## #banana#
anana##b a##banan
nana##ba -----------> ana##ban
ana##ban -----------> anana##b
na##bana banana##
a##banan na##bana
##banana nana##ba
3. The last characters of the lexicographical strings
is our output
∗ For our output string, we take the last
character of every permutation and then put it
in one single string to get our Burrows Wheeler
Transform which looks like:-

Page 7 of 31
#banana# ##banan|a|
banana## #banana|#|
anana##b a##bana|n|
nana##ba -----------> ana##ba|n|
ana##ban -----------> anana##|b|
na##bana banana#|#|
a##banan na##ban|a|
##banana nana##b|a|

Output string :- a#nnb#aa


Index at which orginal string is present :- 5
• As we have understood BWT for the string
#banana#
So now let us consider a string and apply BWT on it:
"Susie works in a shoeshine shop. Where she shines
she sits, and where she sits she shines"

Page 8 of 31
Burrows Wheeler Transform algorithm code :-

class BWT
{
// private members
private:
struct rotation
{
// it is a structure that stores
the index and the suffic string
ll index;
string suffix;
};
void swap(rotation *x, rotation *y)
{
rotation temp = *x;
*x = *y;
*y = temp;
}
void merge(rotation arr[], ll left, ll
mid, ll right)
{
ll n1 = mid - left + 1;
ll n2 = right - mid;
rotation a1[n1], a2[n2];
// to copy data to a1 and a2
for (ll i = 0; i < n1; i++)
{
a1[i] = arr[left + i];

Page 9 of 31
}
for (ll i = 0; i < n2; i++)
{
a2[i] = arr[mid + 1 + i];
}
ll indexof1 = 0, indexof2 = 0,
indexofmerged = left;
while (indexof1 < n1 && indexof2 <
n2)
{
if (a1[indexof1].suffix.compare
(a2[indexof2].suffix) <= 0)
{
arr[indexofmerged] =
a1[indexof1];
indexof1++;
}
else
{
arr[indexofmerged] =
a2[indexof2];
indexof2++;
}
indexofmerged++;
}
// now compying the remaining
elements
while (indexof1 < n1)
{
arr[indexofmerged] =

Page 10 of 31
a1[indexof1];
indexof1++;
indexofmerged++;
}

while (indexof2 < n2)


{
arr[indexofmerged] =
a2[indexof2];
indexof2++;
indexofmerged++;
}
}
void mergesort(rotation arr[], ll
begin, ll end)
{
if (begin >= end)
{
return;
}
ll mid = begin + (end - begin) / 2;
mergesort(arr, begin, mid);
mergesort(arr, mid + 1, end);
merge(arr, begin, mid, end);
}

public:
pair<ll, string> encode(string
&input_str, ll n)
{

Page 11 of 31
// making an array of structures
to store all the permutations
rotation arr[n];
string s = input_str;
// string s1 = "";
for (ll i = 0; i < n; i++)
{
arr[i].index = i;
arr[i].suffix = s.substr(i, n -
i) + s.substr(0, i);
}
mergesort(arr, 0, n - 1);

string last_column = "";


for (ll i = 0; i < n; i++)
{

last_column += arr[i].suffix[n
- 1];
}

ll index = -1;
for (ll i = 0; i < n; i++)
{
if (arr[i].index == 0)
{
index = i;
break;
}
}

Page 12 of 31
return {index, last_column};
}
string decode(ll index, string
&lastcolumn, ll n)
{
string firstcolumn = lastcolumn;
sort(firstcolumn.begin(),
firstcolumn.end());
// cout<<firstcolumn<<endl;
ll count = 0;
char c = firstcolumn[index];
string ans = "";
for (ll i = 0; i < n; i++)
{
count = 0;
c = firstcolumn[index];
for (ll j = 0; j < n; j++)
{
if (firstcolumn[j] == c &&
j != index)
{
count++;
}
else if (firstcolumn[j] ==
c && j == index)
{
break;
}
}

Page 13 of 31
for (ll k = 0; k < n; k++)
{
if (lastcolumn[k] == c &&
count != 0)
{
count--;
}
else if (lastcolumn[k] == c
&& count == 0)
{
index = k;
ans += c;
break;
}
}
}
// cout<<"the decoded string is
"<<ans<<endl;
return ans;
}
};

Page 14 of 31
3.2 RUN LENGTH ENCODING (RLE)

• Run-Length Encoding (RLE) is a simple data


compression technique that reduces the size of data
by giving the occurrence to the characters which
comes in sequence.
• In RLE, instead of explicitly repeating the same value
multiple times, we count the number of consecutive
occurrences of that character and represent it as a
pair of the character and the count. For example, if
we have a sequence of ”AAABBBBCCCCC” in
RLE, we can represent it as ”3A4B5C”.
• As per our above example, we have a sequence of :-
”.n,seseseeaeeedesps nrrhhnhhihhnnossssWwssssss
hhhssriaiiihhwoeeoktete e u iiS”,

• In RLE, we can represent it as :-


”.n,seses2ea3edesps3 n2r2hn2hi2h2no4sWw6s
3h2sria3i2hwo2eoktete4 e4 u2 2iS”.

Figure 1: RLE conversion

Page 15 of 31
• By grouping repeated characters together and storing
the count, RLE can effectively reduce the size of the
data. It is commonly used in various applications,
such as image and video compression, where there are
often long sequences of repeated values or patterns.

RLE algorithm code :-

class RLE
{
public:
string encode(string& s)
{
ll n = s.length();
string ans="";
for(ll i=0;i<n;i++)
{
ll count = 1;
while(i<n-1 &&
s[i]==s[i+1])
{
count++;
i++;
}
if(count!=1)

Page 16 of 31
{
string temp="";
ll co = count;
while(co>0)
{
int t = co%10;
temp+=char(t+’0’);
co = co/10;
}
ll start = 0,end =
temp.length()-1;
while(start<end)
{
swap(temp[start++]
,temp[end--]);
}
ans+=temp;
// cout<<"temp is
"<<temp<<endl;
}
ans+=s[i];

}
return ans;

Page 17 of 31
}
string decode(string& s)
{
string ans="";
ll n = s.length();
for(ll i=0;i<n;i++)
{
if(int(s[i])<48 ||
int(s[i])>57)
{
ans+=s[i];
}
else
{
ll count = 0;
while(i<n)
{
if((int(s[i])>=48
&&
int(s[i])<=57))
{

count = count*10

Page 18 of 31
+
int(s[i]-’0’);
i++;
}
else
break;
}

for(ll
j=0;j<count;j++)
{
ans+=s[i];
}

}
return ans;
}

};

Page 19 of 31
3.3 MOVE TO FRONT (MTF)

• The MTF algorithm is a data encoding technique that


rearranges the order of elements in a list based on
their access frequency. It is commonly used in data
compression and data transmission applications.
• The main motive behind MTF algorithm is that
elements that are accessed more frequently are more
likely to be accessed again in the near future. By
moving these elements to the front, we can optimizes
access time and improves efficiency .
• Let’s understand it with use of simple example
”banana”.

Page 20 of 31
Page 21 of 31
3.4 HUFFMAN CODING

• Huffman coding is a lossless data compression


technique. The idea is to assign codes to input
characters.The lengths of the assigned codes are
based on the frequencies of corresponding characters.

• The codes which are assigned to input characters are


Prefix Codes, means the codes are assigned in such a
way that the code assigned to one character is not the
prefix or not any initial subpart of any other character.
This is how Huffman Coding makes sure that there is
no ambiguity occurs when we will try to decode data.
• Let’s understand the huffman coding technique by one
simple example :-
”aaaaabbbbbbbbbccccccccccccdddddddddddddee
eeeeeeeeeeeeeefffffffffffffffffffffffffffffffffffffffffffff”
• First we need to count the frequencies of all
characters of the input :-

Page 22 of 31
• Then we will make huffman tree. For each character it
creates a leaf node and then we select the characters
with the two least frequencies and add them up which
makes a new leaf node in place of the two. Then we
repeat the process for the rest of the nodes till just
one node is left. It makes the following tree:-

∗ Huffman Tree:-

• Now to assign the code we will Traverse the tree


formed starting from the root. Maintain an auxiliary
array. While moving to the left child, write 0 to the
array. While moving to the right child, write 1 to the
array. when we will reach leaf node of perticular
character then the code which made by Traversing in
array , assign it to that character where we stop while
Traversing and we get like :-

Page 23 of 31
∗ Applications of Huffman Coding :-

1. They are used for transmitting fax and text.


2. They are used by conventional compression formats
like PKZIP, GZIP, etc.
3. Multimedia codecs like JPEG, PNG, and MP3 use
Huffman encoding(to be more precise the prefix
codes).

Page 24 of 31
4 PROJECT EXECUTION: Achievements and
Implementation

• Till now, we have just been discussing about


algorithms and how we can compress data efficiently
but now let us have a look at how the project was
implemented.
• We were firstly exploring different different ideas on
how to even make a good project as well as to learn
something new. We were really excited about the
idea. We did the obvious thing which was to ask
”Google Baba” and then we started collecting
different topics and documents regarding discrete
maths. We did a hell lot of research and everyone was
just excited about one idea which was

Data Compression

• So we started researching on different algorithms to


compress data and we found that Run Length
Encoding, Burrows Wheeler Transform, Move To
Front and Huffman Coding were the best techniques
to compress data efficiently. As we know there are
plenty of algorithms to compress data but these were
the best methods as they compressed data more than
60% of the original data.
• But there are even drawbacks to these techniques
because it takes a lot of time to compress but the
most important factor is it always compresses more
than 60%.

Page 25 of 31
• Then all of us started the research and started
discussing different ideas for the project. And after a
lot of discussion and arguments, we decided to first
implement Run Length Encoding, then Burrows
Wheeler Transform, then Move to Front and at last
Huffman coding.
• We all went back to research as to find if there were
any small things we were missing and we found a very
big drawback which were ”NUMBERS” because in
Run Length Encoding(RLE), the output is the count
of number of a particular character such as

33A42B5C7D

but what if in our data, there are 20 2’s so how can


we distinguish the number of occurence with that
number.
• We then came up with an idea on how to overcome
this drawback which was to write a digit 1 as one, 2
as two and so on. And then we can continue with our
compression as we did earlier.
• Everyone was doing their tasks enthusiastically and we
were on the right path completing RLE and BWT
algorithms but then came the father of all problems.
• In huffman tree, we get the output in binary and we
did a lot of research on how to embed the output in a
binary file. We asked many of our seniors but no
progress. We really invested 4 days for just one
problem. We were really exhausted working for it. We
even asked professor Naresh Jotwani, Professor

Page 26 of 31
Manish Gupta and Professor Prosenjit Kundu and we
were able to get an idea that we would pack those bits
in bundles of eight and then store that character in a
file. But it became a lossy compression method as we
were not able to apply it and make it a fully lossless
compression.
• It was so funny that we people, who don’t pay much
attention to KB’s today, were really worried about bits
for this 4-5 days. It was very clear that the file could
be compressed more than 60% every-time. So, we did
decided to move on from huffman coding and focus
on RLE and BWT for the project.
• The next drawback faced by us was the time taken for
compression was very much. And after some study, we
knew that the lexicographical sorting took a hell lot of
time in BWT. Firstly, we were using bubble sort but
afterwards we used merge sort for sorting it efficiently.
And guess what, it worked.
• A file of 25 KB would take approximately 10 minutes
to compress in the bubble sort but now it just took 10
seconds in merge sort. This was a huge victory for us.
• As you can see in the figure 2 we uploaded a file of 50
KB and we get the compressed file as 8.8 KB which
gives us a monstrous 82.37% as compression ratio.
• Interestingly, we named our compressed file as .mkg in
honour of our professor Manish K. Gupta.
• After which we decided to work on working software
model and we made it on python. The app was made

Page 27 of 31
Figure 2: Compression Ratio

using tkinter module of python. The app has 2


buttons, the first one is for us to select the file from
our PC which is to be compressed and then the
second button is for us to extract the file. In this way,
we were successful in creating a working software
model of our idea.
• For simplicity purposes, we have even made a
readme.md where we have given a guide on how to
use our app so you can use our app without issues.
• Let us look at how we wish to commercialize it.

Page 28 of 31
4.1 COMMERCIALIZATION ASPECT:
• The commercialization aspect is how we would like to
launch our product, here it is an app, in the market.
There are many data compression applications
available on the internet but the main thing about our
app is that it compresses to about 60% of the original
file.
• As the working model of our software is ready so we
will now wish to minimise the compression time and
increase the compression ratio. For reducing the time,
our main idea is to divide the whole file provided to us
in small-small parts and then to evaluate each part
simultaneously to increase the productivity and the
efficiency. Through this method, the compression
would be made even faster than usual.
• We would also like to make separate applications for
the Windows, Linux and Mac users which will help us
gain more reach for our product.
• Who knows even Google may get impressed by our
product and hire us for internship so we guess there
are many many possibilities and benefits after we
commercialize our product.
• Let us have a quick peek at contribution from all the
group members:

Page 29 of 31
4.2 CONTRIBUTION:
Harsh Popatiya (202201463):-
Major work in code implementation, Research on
various topics for the project.

Dhrudeep Sharma (202201150):-


Major work in Latex document preparation,Youtube
video recording and Research for different algorithms.

Jaimin Prajapati (202201228):-


Major work in Latex document preparation,
Explanatary photos and Research for different
algorithms selected.

Parshwa Modi (202201165):-


Major work in PPT preparation and Research for
different algorithms selected.

Anuj Valambhiya (202201481):-


Major work in Code implementation, App
development and Research for different topics.

Sahil Pandavadara (202201151):-


Major work in video editing, PPT preparation, website
development, explanatary photo making and research
of different topics.

Page 30 of 31
5 REFERENCES

[1] GeeksforGeeks. Burrows-wheeler data transform


algorithm. https://www.geeksforgeeks.org/
burrows-wheeler-data-transform-algorithm/.
[2] GeeksforGeeks. Huffman coding (greedy algorithm).
https://www.geeksforgeeks.org/
huffman-coding-greedy-algo-3/.
[3] GeeksforGeeks. Move to front (mtf) data transform
algorithm. https://www.geeksforgeeks.org/
move-front-data-transform-algorithm/.
[4] GeeksforGeeks. Run-length encoding. https://www.
geeksforgeeks.org/run-length-encoding/.
[5] Wikipedia. Data compression. https:
//en.wikipedia.org/wiki/Datacompression.
[6] Wikipedia. Huffman coding. https:
//en.wikipedia.org/wiki/Huffman_coding.
[7] Wikipedia. Move-to-front transform. https://en.
wikipedia.org/wiki/Move-to-front_transform.
[8] Wikipedia. Run-length encoding. https:
//en.wikipedia.org/wiki/Run-length_encoding.

Page 31 of 31

You might also like