Professional Documents
Culture Documents
DM 1
DM 1
PROJECT
May 2023
Professors:
Prof. Manish Gupta
Prof. Prosenjit Kundu
Prof. Manoj Raut
Team members:
Harsh Popatiya :- 202201463
Dhrudeep Sharma :- 202201150
Jaimin Prajapati :- 202201228
Parshwa Modi :- 202201165
Anuj Valambhiya :- 202201481
Sahil Pandavadara :- 202201151
Page 1 of 31
Contents
1 INTRODUCTION 3
2 PROBLEM STATEMENT 3
3 SOLUTION 4
3.1 BURROWS WHEELER TRANSFORM
(BWT) . . . . . . . . . . . . . . . . . . . 6
3.2 RUN LENGTH ENCODING (RLE) . . 15
3.3 MOVE TO FRONT (MTF) . . . . . . . 20
3.4 HUFFMAN CODING . . . . . . . . . . . 22
5 REFERENCES 31
Page 2 of 31
1 INTRODUCTION
2 PROBLEM STATEMENT
• If you wish to share a very large text file over the inter-
net, how can you efficiently transfer the file with your
friend ?
Page 3 of 31
3 SOLUTION
∗ Lossy compression
∗ Lossless compression
1. Lossy compression
• As the name suggests, after compression we can’t
retrieve the information which is lost in compression.
• It is used to compress when file compression loss is
acceptable.
• It is used to compress images, audio and videos. In a
video, it removes the 3r d frame so it makes the file
1/3rd of the original file.
Page 4 of 31
2. Lossless compression
• It restores and rebuilds the compressed file in original
form.
• It is used when data loss is unacceptable such as in
some high quality images, audio or in text files.
So from the two techniques, it is evident that for text files
lossless compression is a better tool. So let us have a look
at the data compression algorithms in lossless compression.
And the algorithms are:-
1. BURROWS WHEELER TRANSFORM (BWT)
2. RUN LENGTH ENCODING (RLE)
3. MOVE TO FRONT (MTF)
4. HUFFMAN CODING
Page 5 of 31
3.1 BURROWS WHEELER TRANSFORM (BWT)
"#banana#"
Page 6 of 31
#banana#
b a banana##
# n anana##b
cyclic rotations ------> nana##ba
# a ana##ban
a n na##bana
a##banan
##banana
2. Arranging the strings lexicographically
∗ Sorting lexicographically means we sort the
strings according to ASCII values from the first
column till the last column as:-
#banana# ##banana
banana## #banana#
anana##b a##banan
nana##ba -----------> ana##ban
ana##ban -----------> anana##b
na##bana banana##
a##banan na##bana
##banana nana##ba
3. The last characters of the lexicographical strings
is our output
∗ For our output string, we take the last
character of every permutation and then put it
in one single string to get our Burrows Wheeler
Transform which looks like:-
Page 7 of 31
#banana# ##banan|a|
banana## #banana|#|
anana##b a##bana|n|
nana##ba -----------> ana##ba|n|
ana##ban -----------> anana##|b|
na##bana banana#|#|
a##banan na##ban|a|
##banana nana##b|a|
Page 8 of 31
Burrows Wheeler Transform algorithm code :-
class BWT
{
// private members
private:
struct rotation
{
// it is a structure that stores
the index and the suffic string
ll index;
string suffix;
};
void swap(rotation *x, rotation *y)
{
rotation temp = *x;
*x = *y;
*y = temp;
}
void merge(rotation arr[], ll left, ll
mid, ll right)
{
ll n1 = mid - left + 1;
ll n2 = right - mid;
rotation a1[n1], a2[n2];
// to copy data to a1 and a2
for (ll i = 0; i < n1; i++)
{
a1[i] = arr[left + i];
Page 9 of 31
}
for (ll i = 0; i < n2; i++)
{
a2[i] = arr[mid + 1 + i];
}
ll indexof1 = 0, indexof2 = 0,
indexofmerged = left;
while (indexof1 < n1 && indexof2 <
n2)
{
if (a1[indexof1].suffix.compare
(a2[indexof2].suffix) <= 0)
{
arr[indexofmerged] =
a1[indexof1];
indexof1++;
}
else
{
arr[indexofmerged] =
a2[indexof2];
indexof2++;
}
indexofmerged++;
}
// now compying the remaining
elements
while (indexof1 < n1)
{
arr[indexofmerged] =
Page 10 of 31
a1[indexof1];
indexof1++;
indexofmerged++;
}
public:
pair<ll, string> encode(string
&input_str, ll n)
{
Page 11 of 31
// making an array of structures
to store all the permutations
rotation arr[n];
string s = input_str;
// string s1 = "";
for (ll i = 0; i < n; i++)
{
arr[i].index = i;
arr[i].suffix = s.substr(i, n -
i) + s.substr(0, i);
}
mergesort(arr, 0, n - 1);
last_column += arr[i].suffix[n
- 1];
}
ll index = -1;
for (ll i = 0; i < n; i++)
{
if (arr[i].index == 0)
{
index = i;
break;
}
}
Page 12 of 31
return {index, last_column};
}
string decode(ll index, string
&lastcolumn, ll n)
{
string firstcolumn = lastcolumn;
sort(firstcolumn.begin(),
firstcolumn.end());
// cout<<firstcolumn<<endl;
ll count = 0;
char c = firstcolumn[index];
string ans = "";
for (ll i = 0; i < n; i++)
{
count = 0;
c = firstcolumn[index];
for (ll j = 0; j < n; j++)
{
if (firstcolumn[j] == c &&
j != index)
{
count++;
}
else if (firstcolumn[j] ==
c && j == index)
{
break;
}
}
Page 13 of 31
for (ll k = 0; k < n; k++)
{
if (lastcolumn[k] == c &&
count != 0)
{
count--;
}
else if (lastcolumn[k] == c
&& count == 0)
{
index = k;
ans += c;
break;
}
}
}
// cout<<"the decoded string is
"<<ans<<endl;
return ans;
}
};
Page 14 of 31
3.2 RUN LENGTH ENCODING (RLE)
Page 15 of 31
• By grouping repeated characters together and storing
the count, RLE can effectively reduce the size of the
data. It is commonly used in various applications,
such as image and video compression, where there are
often long sequences of repeated values or patterns.
class RLE
{
public:
string encode(string& s)
{
ll n = s.length();
string ans="";
for(ll i=0;i<n;i++)
{
ll count = 1;
while(i<n-1 &&
s[i]==s[i+1])
{
count++;
i++;
}
if(count!=1)
Page 16 of 31
{
string temp="";
ll co = count;
while(co>0)
{
int t = co%10;
temp+=char(t+’0’);
co = co/10;
}
ll start = 0,end =
temp.length()-1;
while(start<end)
{
swap(temp[start++]
,temp[end--]);
}
ans+=temp;
// cout<<"temp is
"<<temp<<endl;
}
ans+=s[i];
}
return ans;
Page 17 of 31
}
string decode(string& s)
{
string ans="";
ll n = s.length();
for(ll i=0;i<n;i++)
{
if(int(s[i])<48 ||
int(s[i])>57)
{
ans+=s[i];
}
else
{
ll count = 0;
while(i<n)
{
if((int(s[i])>=48
&&
int(s[i])<=57))
{
count = count*10
Page 18 of 31
+
int(s[i]-’0’);
i++;
}
else
break;
}
for(ll
j=0;j<count;j++)
{
ans+=s[i];
}
}
return ans;
}
};
Page 19 of 31
3.3 MOVE TO FRONT (MTF)
Page 20 of 31
Page 21 of 31
3.4 HUFFMAN CODING
Page 22 of 31
• Then we will make huffman tree. For each character it
creates a leaf node and then we select the characters
with the two least frequencies and add them up which
makes a new leaf node in place of the two. Then we
repeat the process for the rest of the nodes till just
one node is left. It makes the following tree:-
∗ Huffman Tree:-
Page 23 of 31
∗ Applications of Huffman Coding :-
Page 24 of 31
4 PROJECT EXECUTION: Achievements and
Implementation
Data Compression
Page 25 of 31
• Then all of us started the research and started
discussing different ideas for the project. And after a
lot of discussion and arguments, we decided to first
implement Run Length Encoding, then Burrows
Wheeler Transform, then Move to Front and at last
Huffman coding.
• We all went back to research as to find if there were
any small things we were missing and we found a very
big drawback which were ”NUMBERS” because in
Run Length Encoding(RLE), the output is the count
of number of a particular character such as
33A42B5C7D
Page 26 of 31
Manish Gupta and Professor Prosenjit Kundu and we
were able to get an idea that we would pack those bits
in bundles of eight and then store that character in a
file. But it became a lossy compression method as we
were not able to apply it and make it a fully lossless
compression.
• It was so funny that we people, who don’t pay much
attention to KB’s today, were really worried about bits
for this 4-5 days. It was very clear that the file could
be compressed more than 60% every-time. So, we did
decided to move on from huffman coding and focus
on RLE and BWT for the project.
• The next drawback faced by us was the time taken for
compression was very much. And after some study, we
knew that the lexicographical sorting took a hell lot of
time in BWT. Firstly, we were using bubble sort but
afterwards we used merge sort for sorting it efficiently.
And guess what, it worked.
• A file of 25 KB would take approximately 10 minutes
to compress in the bubble sort but now it just took 10
seconds in merge sort. This was a huge victory for us.
• As you can see in the figure 2 we uploaded a file of 50
KB and we get the compressed file as 8.8 KB which
gives us a monstrous 82.37% as compression ratio.
• Interestingly, we named our compressed file as .mkg in
honour of our professor Manish K. Gupta.
• After which we decided to work on working software
model and we made it on python. The app was made
Page 27 of 31
Figure 2: Compression Ratio
Page 28 of 31
4.1 COMMERCIALIZATION ASPECT:
• The commercialization aspect is how we would like to
launch our product, here it is an app, in the market.
There are many data compression applications
available on the internet but the main thing about our
app is that it compresses to about 60% of the original
file.
• As the working model of our software is ready so we
will now wish to minimise the compression time and
increase the compression ratio. For reducing the time,
our main idea is to divide the whole file provided to us
in small-small parts and then to evaluate each part
simultaneously to increase the productivity and the
efficiency. Through this method, the compression
would be made even faster than usual.
• We would also like to make separate applications for
the Windows, Linux and Mac users which will help us
gain more reach for our product.
• Who knows even Google may get impressed by our
product and hire us for internship so we guess there
are many many possibilities and benefits after we
commercialize our product.
• Let us have a quick peek at contribution from all the
group members:
Page 29 of 31
4.2 CONTRIBUTION:
Harsh Popatiya (202201463):-
Major work in code implementation, Research on
various topics for the project.
Page 30 of 31
5 REFERENCES
Page 31 of 31