SegmentationBased DNA Sequence Compression

Alochana Chakra Journal ISSN NO:2231-3990
SegmentationBased DNA Sequence Compression

K. Punitha1,Dr. A. Murugan2
1
Assistant Professor
Department of Computer Science, AgurchandManmull Jain College(shift II),Chennai, India
2
Associate Professor & Head, PG & Research
Department of Computer Science, Dr.Ambedkar Government Arts College (Autonomous)
Affiliated to University of Madras, Chennai, India
1
punithsathish@gmail.com
2
amurugan1972@gmail.com
Abstract- On the basis of the lossless compression algorithm the DNA sequence has been compressed through a
substitution method which is equivalent to the scheme of offline dictionary Lempel-Ziv compression. The
repetition structures are employed by the recommended method that being intrinsic within DNA
sequences.Offline dictionary have the repeated patterns from the DNA input sequence and information about the
mismatches.The method makes sure that purely assuring mismatches are permitted thereby helping in gaining a
compression ratio thus surpassing the prevailing lossless DNA sequence compression algorithms.The present
research proposes dictionary based methods for compressing DNA sequences by making use of varied repetitive
structures that are in-built in these sequences. In this paper, the R_pattern algorithm proposes the dictionary
based compression of DNA sequence and attain a better compression ratio and tested with the benchmarking
datasets from NCBI(National Center for Biotechnology Information).
Keywords— DNA sequence compression, BWT, Lempel-Ziv compression, biological sequences.
1. INTRODUCTION sequence thereby encoding them for acquiring an

improvised compression ratio. The existing work
With the unceasing rise in generation of DNA sequences, recommends algorithm that incorporates substitution
there has been increasing issues related to, comprehension, method in coordination with the approach of Lempel-Ziv
storage as well as transmission. Though lately there has [5, 6] compression for DNA sequence compression.
been a drastic drop in the storage cost, but because there is a Different kinds of repeats (such as complementary, reverse
high surge in the DNA that is being sequenced, which has complementary, exact and reverse) are being acquired using
resulted in voluminous data that must be stored online and the above algorithm and stored in an offline dictionary. Off-
this makes storage a costly affair. In addition, there is also line dictionary oriented strategies are being examined by the
the concern of drawing out sense from this voluminous set research for carrying out the DNA sequence compression,
of data. Along with these genomes, the research must depending upon the BWT (Burrows-Wheeler Transform).
handle millions or even billions of base pairs. And with the The BWT transforms the character string into runs of
presence of the genomes database, the issue aggregates similar characters. The prevalence or dominance of patterns
more. Resultant, there is a requirement of highly efficient that are short and repeating holds of utmost significance in
techniques pertaining to compression of DNA sequences or biological sequences. The present research proposes
any biological sequence data. dictionary based methods for compressing DNA sequences
by making use of varied repetitive structures that are in-
Because DNA sequences are not random in nature, built in these sequences. For generating the final parsed
redundancy can be easily eliminated making the sequence, repeats are eliminated from the original sequence.
compression possible. The probability indicates that 50% The compressed sequence is formed with the combination
plus human genome depicts repeat DNA [1]. Concerns of dictionary and the final parsed sequence. Mismatches
related to storage can be achieved by compression. Chen et yielding in better compression gain are considered and
al. [2] illustrates that compressibility can be imbibed in recorded with the repetitive substrings in the dictionary.
evolutionary tree construction and sequence alignment and
acts as a good measurement of relatedness among the
sequences. Allison et al. [3] indicates that DNA sequence 2. RELATED WORK
can lead to wise assessment of such sequences. Also,
Grumbach and Tahi [7, 8] proposed, biological sequence
compression aids in effective sequence classification [4].
compression can be of 2 modes, either in horizontal or
There are four nucleotide bases in a DNA sequences
vertical. In the horizontal mode, biological sequence is
namely, A - Adenine, C- Cytosine, G - Guanine, and T –
compressed using the information it holds such as
Thymine. Each nucleotide base can be depicted with just
references to the substrings. On the other hand the vertical
two bits. Usually there can be different types of repeats
mode involves various biological sequences which are
appearing in a DNA sequence such as complementary,
being compressed using the information retrieved from such
reverse complementary, approximate, and reverse which
set. For reducing the transmission and storage costs,
are long and not that frequent. The text compression
horizontal mode is a preferred choice [9] that incorporates
algorithms that were traditional could capture only short
either statistical, substitution or a combination of such
and frequent repeats as a result DNA sequence compression
compression techniques. The technique of statistical
using such algorithms extended the same. Hence the
compression makes use of statistical model of the data that
challenge lies in identifying varied set of repeats in a DNA
Volume IX, Issue V, May/2020 Page No:6630

involves varying code size and based on the data model, splitting of DNA sequence is done in fixed size blocks. By
compression quality can be achieved [10]. Substitution or the means of probability distribution, the bit mask is
dictionary based compression makes selection of various encoded which being assessed using the normalized
strings of symbols that are quiet frequent, thereafter maximum likelihood of similarity. Matsumoto et al. [18]
encoding every string to a token that acts as a pointer to the makes use of LZ [5, 6] algorithm. Initially, the approximate
string present in the dictionary. The dictionary can be either repeat regions are being recognized via hash and dynamic
static or dynamic. Online dictionary is utilized by the programming and thereafter such repeat regions are
compression algorithm that relies upon LZ method. replaced using an offset and length. By the means of
Alternatively, algorithms incorporating offline dictionary arithmetic coding, edit operations are encoded and order 32
undergoes compression in two passes: firstly all the repeats context tree weighting is utilized in case of non-repeat
are recognized and stored in the dictionary whereas the regions. Though arithmetic coding and CTW have better
second pass involves encoding of all such repeats in terms compression ratio, but have low decompression speed as a
of pointers towards the dictionary [5, 6]. The hybrid sort of demerit [23]. Bio Compress 2 is [8] an improvised version
compression integrates both the substitution and statistical that being somewhat equivalent to Bio Compress but
techniques for the purpose of compression. Apparently, the incorporates order-2 arithmetic coding for encoding non-
substitution technique is incorporated by numerous repeat areas. The output projects that both the algorithms
compression methods for biological sequence compression compressed the standard benchmark data wherein the Bio-
[2, 7, 8, 29] .Grumbach and Tahi [7, 8] formerly developed compress utilized an average compression ratio of 1.850
a special purpose DNA compression Algorithm named Bio- bpb and Bio-compress2 utilized 1.783 bpb in contrast to the
compress that exists in the literature. Bio-compress as well general-purpose algorithms compact and compress, that
as Bio-compress 2 was recommended that aids in utilized higher than 2 bpb. Cfact [24] being equivalent to
identification of repeats of substrings that took place Bio compress makes use of a two-pass algorithm for
previously in the sequence and thereby encoding them detecting the longest exact and reverse complement repeats.
based on the position of earlier occurrence and length of In the first pass a suffix tree of the sequence is generated
repeat. In addition, order 2 arithmetic coding is imbibed for and the second pass involves actual encoding via LZ. Non-
encoding regions that do not repeat. With the help of repeat areas are encoded using 2 bpb. Chen et al makes use
arithmetic coding, encoding of non-repeat regions is of an equivalent approach in Gen Compress [25], though
performed. Apostolic and Lonardi [11] proposes an offline with the involvement of approximate repetitions.
approach that repetitively chooses repeated substrings for
enabling the encoding to achieve highest compression.
3. PROPOSED WORK
Adjeroh et al. [12, 13] generates offline dictionary
comprising of short repeats and subsequently code entire 3.1 Compression Algorithm for DNA Sequences
occurrences of the available repeat in context to the position
of that repeat within the dictionary. Rivals et al. [14] The method recommended in the present work comprises of
proposes C fact that builds a suffix tree in the first pass and algorithm which helps in determining possibly good
in the second pass utilizing this structure to identify the matches. Further the identified matching substrings are
longest exact matching repeat. Statistical techniques are stored in the dictionary for further reference level by level.
employed by some of the methods such as ARM, XM and All mismatches are moved to the next level for finding the
CDNA. Cao et al. [15] proposes Expert model –XM that matches and mismatches. In the first level, take the four
utilizes an order 2 Markov experts and a copy expert for DNA sequence from the starting position consider as i and
anticipating the chances of a symbol occurrence and find the matching in the remaining sequence. The matching
incorporates adaptive coding for handling correct or should be equal, reverse, complement and reverse
incorrect predictions. Loewenstern and Yianilos [16] complement. If any one of the four type of matching occurs
proposed an exclusive statistical CDNA algorithm using then remove the matching pattern from the input sequence
which every symbol‟s probability distribution is achieved and store the matching pattern in the dictionary along with
by approximate partial matches from history. Each its position, type of matching whether equal(E), reverse(R),
approximate match is carried in reference to an earlier complement(C), reverse complement(RC) and its level
subsequence with a small Hamming distance prior to the number. If the mismatch occurs, then leave the sequence as
symbol which needs to be encoded. Allison et al also it is and move to the next level. In the second level, take the
introduces an exclusive statistical ARM algorithm [3] that next pattern with 3 sequence from the position i+1 in that 4
determines the sequence‟s probability by aggregating the input sequence and do the same process of matching and if
probabilities of entire explanations pertaining to how to mismatching move to the next level. In the third level, take
produce the subsequence. Korodi and Tabus [17] the pattern with 2 sequences from the position i+2 do the
recommends a hybrid technique based method that same process of matching and if mismatching move to the
performs encoding by the means of a basic normalized next level. In the 4th level, all the mismatch patterns are
maximum likelihood model for discrete regression in encode with 2 bits such A=00, C=01, G=10 and T=11.
context to the prior approximate matching blocks and
thereafter encoding them using a first-order context coding.
Korodi and Tabus [17] proposed GeNML wherein the

Compression
DNA Sequence
DNA Sequence Segmentation

Take the next 4sequence
DNA Sequence Data Compression
Check whether any one of the following matching occurs
Exact Matching /Reverse Matching /Complement Matching /
Reverse Complement Matching
Matching Mismatching
Level 4 Level 1, 2, 3
Level 1,
Level 3
Level 2 Encoding the seq. Repeat the
process
Dictionary 1 Output the compressed
sequence
Fig 1 Overall Architecture
TABLE 1 mismatches and whether allowing such mismatch would

STRUCTURE OF THE OFFLINE DICTIONARY yield in a compression gain.
Matching Type of Position Level No. 3.2 General Data Compressions

Pattern repeat of repeat
Information in the DNA sequences can be represented by
AATA Equal 5 utilizing four alphabets represented as A, C, G and T.
1 Considering the sequences to be absolutely unpredictable or
20 random, then only two bits are required for coding every
Reverse
single nucleotide base pair. According to the compression
and sequence concept, repetitions present within the
ACT Reverse 36 2 biological sequences can lead to redundancies that can be a
way for a major compaction. Recognizing these
dependencies becomes the base for biological sequence
Complement 73
compression. Lossless compression schemes falls under
three general categories: symbol-wise substitution,
dictionary based and context-based methods [26]. The
In case there is a mismatch during the extension of the symbol-wise substitution replaces every symbol with a new
repeat substring and the decision regarding allowing such code-word in a manner that frequent occurring symbols are
mismatch is taken relying upon the total number of being replaced with shorter code words thus gaining

anoverall compression. The dictionary-based methods build schemes, desirable compression was not produced. There
a vocabulary of symbols or group of symbols from the input exists much regularity in biological sequences that is utilized
sequence that occur frequently. By replacing the symbols for compression.
positions with a pointer to their positions in the dictionary,
compression is obtained. The dictionary can be either offline Effective approaches for biological sequence compression
or online. In online dictionary, the text only acts as the takes into account multiple regularities or repetition
dictionary and symbols observed prior in the sequence are structures present in such sequences. To name a few: tandem
replaced with pointers to their previous occurring positions. repeats, simple (interspersed) repeats (SINEs and LINEs),
palindromes, complemented palindromes and complemented
Commonly known LZ-family of algorithms relies upon repeats. Exact as well as in-exact (approximate) repetitions
online dictionaries. In the Offline dictionary methods the must be used in any case. The challenging part being,
input sequence is compressed in two passes, the first pass prompt identification of such repetition structures and
determines the repeating sequences and constructs the thereby utilizing this acquired knowledge for sequence
dictionary and in the second pass the repeats are encoded compression. Various specialized algorithms have been
with pointers in the dictionary. Significantly, compression suggested for successfully compressing bio sequences which
that is dictionary based tends to be as a substitution that vary in the repetitions type(s) they use and the way it is
involves substitution in terms of earlier existing symbols or done. Usually such methods are based online dictionary,
sequence of symbols rather than code word substitution. wherein various repetition structures are considered for
Storer[27] proposed a standard approach for illustrating constructing the dictionary/substitution mechanism).
various substitution-based schemes. Grumbach and Tahi [8] pioneered in recommending the first
special purpose compression algorithm concerning the DNA
3.3 Biological sequence compression algorithms sequences. BIOCOMPRESS1 and BIOCOMPRESS2 [8]
were the algorithms which factored the input sequence into
Entire information regarding DNA sequences gets saved in repetitive structures. A pair of integer numbers denotes
molecular biology databases. In the near future both the size every single factor/repeat that denotes the repeat size and the
as well as importance of such databases will enlarge to a position of the factor occurrence in part of the perceived
greater extent. Hence it‟s a pre-requisite that the information input. Thereafter there is coding of the pair of integers by
pertaining to DNA sequences is effectively stored. In the means of Fibonacci code which represents a universal
addition, by the means of sequence compression similarities code for integer coding. Prior to replacing a factor, there is a
amidst the biological sequences can be defined. It‟s not checking to verify whether there will be any compression
possible to compress the DNA sequences using the standard with such a replacement. That is, whether without any
compression algorithms like gzip or compress cannot coding the size of the coded representation of the pair of
compress DNA sequences as they can only expand them numbers will be less compared to the original Nucleotide
size wise. But the algorithm like Context Tree Weighting bases. Thereafter, rest of the sequence parts that remain non-
(CTW) is able to compress DNA sequences with less than factored are coded by the means of arithmetic codes.
2bps (two bits per symbol) without making use of any BIOCOMPRESS2 is somewhat equivalent to
special structures of biological sequences. DNA sequences BIOCOMPRESS1, with the exception that it provides
are of mainly 2 characteristic structures, first are the palindrome handling.
palindromes or reverse complements and second is
approximate repeats. Such structures are utilized by various 3.4 The Burrows-Wheeler Transform
algorithms for DNA sequence compression using less than
2bps (two bits per symbol). CTW is improvised in the The BWT carries out the characters permutation in such a
existing work to make the characteristic structures of DNA sequence that characters that are lexically similar in contexts
sequences available. In the dynamic and hash programming, will reside close to one another. The forward and inverse
the algorithm searches an approximate repeat prior to BWT along with the subsequent encoding of the permuted
encode the next symbol, then the algorithm represents a sequence are the main significant procedures in BWT-based
palindrome or an approximate repeat with good length (if compression decompression.
any) using the length and distance. Such pre-processing
helps the new program obtain a bit greater compression ratio 3.4.1The forward transform
in contrast to the prevailing DNA-oriented compression
algorithms. There is also description of new compression Consider the input sequence T, there are three steps in the
algorithm pertaining to protein sequences. forward BWT) Form u permutations of T by cyclic rotations
of the characters in T. These permutations results in a u × u
Though the above discussed conventional compression matrix M’, wherein each row in M’ depicts one permutation
methods prove to be effective on text, compressing of T; ii) Rows in the matrix M’ are sorted lexicographically
biological sequences like the DNA or protein sequences to build another matrix M that comprises of T as one of its
tends to be tedious for them. Directly employing classical rows; iii) Record L, that denotes last column of the sorted
compression methods like Huffman, LZ and arithmetic permutation matrix M, and id, depicts the row number in M
coding on these sequences in fact leads to data expansion resembling to the original sequence T. The (L, id) pair forms
and not compression [24, 25]. This might be because such the output of the BWT. The impact in general is that the
conventional models emphasized or were developed purely similar contexts in T are brought nearer in L. such a
for text thus they couldn‟t work in the same way for similarity inclose by contexts helps in acquiring can be
biological sequences which had special characteristics. compression. For instance, consider T=ACTAGA. Let
Andthough they worked somewhat for dictionary base Fdenote array of first characters and L denote the last

characters. Then, F=AAACGT and L=GATAAC and the

transformation result is denoted by the pair: (L, id) =
(GATAAC, 2) - indices are from 1 to u. The rotation 1110000011101011110100
matrices for the sequence T=ACTAGA is mentioned as
following: Encoded Data length=22 bytes
TABLE 2 TABLE 3
BWT FORWARD TRANSFORMATION OFFLINE DICTIONARY
S. M‟ M Patterns Type of repeat Position Level no.
NO.
AGAC Equal 1,20 1

1 ACTAGA$ $ACTAGA 52
Complement
2 CTAGA$A A$ACTAG
TTAA Equal 4,48 1
3 TAGA$AC ACTAGA$
4 AGA$ACT SORT  AGA$ACT TAA Equal 0,13,21 2
5 GA$ACTA CTAGA$A
6 A$ACTAG GA$ACRA
CT Equal 1,6,15 3
7 $ACTAGA TAGA$AC
M‟ is the matrix of cyclic rotations before sorting. M is the Algorithm 1 : R_Pattern DNA Sequence Compression
matrix after sorting. Then they have included the extra Algorithm
symbol $ for consistency in the suffixes
Algorithm Rpattern(input_sequence)
3.5 Calculation of Compression Ratio
Begin
In order to encode each symbol of a DNA sequence, just 2
bits are required. DNA compression emphasizes to make use
{
of less than 2 bits to denote each base. The output in the
suggested method includes the offline dictionary as well as
Read the DNA sequence from input_sequence
the final parsed sequence and the compression ratio denotes
bps i.e. bits per symbol which is computed using Seti = 0 //assign the starting position
Do
Compression ratio = 1- {
(Length of dictionary + Length of compressed sequence ) Read 4 sequence as R_pattern from
* 100
(Length of input sequence ) input_sequence
Set level to 1
Assign Dictionary to name of the
Consider the following example dictionary
Osequence= call
AGACTTAATAACTTGAACTTAGACTAAGGTTCATA Rpattern_match(R_pattern,Dictionary,level)
ACTAGTATGAGTAAATTCTCA (Data length 448
} Until i>=input_sequence.length
bytes)
Do
Level 1 {
Read 3 sequence as R_pattern from
TAACTTGAACTTTAAGGTTCATAACTAGTATGAGTA input_sequence
Set level to 2
Level 2
Osequence= call
CTTGAACTTGGTTCACTAGTATGAGTA Rpattern_match(R_pattern,Dictionary,level)
} Until i<=Osequence.length
Level 3 Do
{
TGAATGGTTCA
Read 2 sequence as R_pattern from
Level 4 input_sequence

Set level to 3 Algorithm 3: Text _Match(file1,file2)

Osequence= call
Rpattern_match(R_pattern,Dictionary,level) Algorithm Text_Match(file1,file2)
} Until i<=Osequence.length
Begin
Set j=0 // starting position
Do Initialise i=0;
{ While (End of the files)
ReadOsequence from jth position {
IfOsequence[j]=‟A‟ then Read the line1 [i] from the file1

Esequence[j]=00 Read the line2 [i] from the file2
Else if Osequence[j]=‟C‟ then a. If line1[i]==line2[i] then
Esequence[j]=01 Flag: = true
Else ifOsequence[j]=‟G‟ then
Esequence[j]=10 i++;
Else
Esequence=11 }
End if
} Until j<=Osequence.length ifflag==true then
} Print “Both files are matching”
Else
End Print “Not Matching”
End if
Algorithm 1 defines the process of the segmentation and End
calls the pattern matching Algorithm 2 for finding the type
Algorithm 3 defines the process of find whether the original
of matching and encodes the output sequence input sequence and decompressed output sequence are
matching or not
Algorithm 2: Function Rpattern _Match(Rpattern,
Dictionary_ Name, level) 4. RESULT AND DISCUSSIONS
FunctionRpattern_match(R_pattern, Dictionary, level) In the existing research, the dictionary based compression
algorithm was experimented and tested on a set of DNA
Begin sequences with input in FASTA format. Testing of the
{ method was performed using the similar standard
Assign the starting position to k benchmark data utilized in [2, 7, 19, and 28]. These standard
While (k<=Osequence.length) sequences comprised of HUMGHCSA (human growth
{ hormone), HUMHDABCD (human DNA sequence),
IfR_pattern is match with remaining sequence in VACCG (vaccinia virus Copenhagen complete genome),
any of the four types Equal(E), HUMHBB (human beta globin region on chromosome 11),
Reverse(R),Complement(C),Reverse complement HUMDYSTROP (Homo sapiens dystrophin gene), and
(RC)) then HUMHPRTB (human hypoxanthine phosphor
{ ribosyltransferase gene).
SetDictionary.matchpattern = R_pattern
SetDictionary.position=k Dictionary Compression – The Dictionary methods is used
SetDictionary.type_of_match=‟E‟ / „R‟ / „C‟ in this proposed system to store the matching patterns with
/ „RC‟ positions, type, levels and it will be used for decompression.
SetDictionary.level_no=level
Remove R_pattern from Osequence A. Performance in terms of data compression:
}
Increment k value by level value The tabulated result of comparison of compression ratios in
} percentage of the proposed R_Pattern algorithm with
ReturnOsequence existing is shown in Table 4.
}
End
Algorithm 2 defines the process of finding the type of
matching whether it is equal or reverses or complement or
reverse complement and return the result to the Algorithm 1
Decompression: The reverse process of the above

Compression Algorithm. And the decompressed sequence
can be checked with original DNA sequence by using the
following Text_Match(file1,file2) algorithm.

TABLE 4 accessed to collect the datasets and the Java environment is

COMPARISON OF COMPRESSION RATIOS OF THE PROPOSED ALGORITHM used for the implementation.
AGAINST EXISTING ALGORITHM IN PERCENTAGE
REFERENCES
Sequence Length Optimal Proposed [1] E. S. Lander, L. M. Linton, B. Birren et al., “Initial sequencing
Seed R_Pattern and analysis of the human genome,” Nature, vol. 409, no. 6822,
based algorithm pp. 860–921, 2001.
[2] X. Chen, S. Kwong, and M. Li, “Compression algorithm for DNA
method sequences and its applications in genome comparison,”
in Proceedings of the 4th Annual International Conference on
HUMDYSTROP 38,770 82% 87% Computational Molecular Biology (RECOMB '00), p. 107, ACM,
Tokyo, Japan, April 2000.
HUMGHCSA 66,496 90% 90% [3] L. Allison, L. Stern, T. Edgoose, and T. I. Dix, “Sequence
complexity for biological sequence analysis,” Computers and
Chemistry, vol. 24, no. 1, pp. 43–55, 2000.
HUMHBB 73,308 82% 88%
[4] E. Keogh, S. Lonardi, and C. A. Ratanamahatana, “Towards
parameter-free data mining,” in Proceedings of the 10th ACM
HUMHDABCD 58,863 84% 90% SIGKDD International Conference on Knowledge Discovery and
Data Mining, pp. 206–215, August 2004.
HUMHPRTB 56,832 84% 85% [5] J. Ziv and A. Lempel, “A universal algorithm for sequential data
compression,” IEEE Transactions on Information Theory, vol. 23,
no. 3, pp. 337–343, 1977.
VACCG 1,91,735 84% 90% [6] J. Ziv and A. Lempel, “Compression of individual sequences via
variable-rate coding,” IEEE Transactions on Information Theory,
vol. 24, no. 5, pp. 530–536, 1978.
Decompression - By implementing R_Pattern Algorithm, [7] S. Grumbach and F. Tahi, “Compression of DNA sequences,”
the decompression output was checked by using the text in Proceedings of the IEEE Symposium on Data Compression, pp.
matching algorithm Text_Match(file1,file2) between files to 340–350, Snowbird, Utah, USA, 1993.
ascertain that the proposed compression method is definitely [8] S. Grumbach and F. Tahi, “A new challenge for compression
lossless. algorithms: genetic sequences,” Information Processing and
Management, vol. 30, no. 6, pp. 875–886, 1994
[9] R. Giancarlo, D. Scaturro, and F. Utro, “Textual data compression
B. Experiments in Time Execution: in computational biology: a synopsis,” Bioinformatics, vol. 25, no.
13, pp. 1575–1586, 2009.
Table 5 compares the time taken for execution of proposed [10] D. Salomon, Data Compression: The Complete Reference,
algorithm against existing algorithm in seconds. Springer Science and Business Media, 2004.
[11] A. Apostolico and S. Lonardi, “Compression of biological
TABLE 5 sequences by greedy off-line textual substitution,” in Proceedings
of the Data Compression Conference (DDC '00), pp. 143–152,
TIME TAKEN FOR EXECUTION (IN SECONDS) [28] March 2000.
Sequence Length DNA Gen_ Optimal Proposed [12] D. Adjeroh and F. Nan, “On compressibility of protein
_ Compre seed based Algorithm sequences,” in Proceedings of the Data Compression Conference
Com ss (sec) method R_Pattern (DCC '06), p. 10, Snowbird, Utah, USA, March 2006.
press (sec) (sec) [13] D. Adjeroh, Y. Zhang, A. Mukherjee, M. Powell, and T. Bell,
(sec) “DNA sequence compression using the Burrows-Wheeler
Transform,” in Proceedings of the IEEE Computer Society
Bioinformatics Conference, Computer Society, vol. 1, pp. 303–
HUMDYSTROP 38,770 0.125 0:00:45 1.5 31 313, 2002.
[14] É. Rivals, M. Dauchet, J. P. Delahaye, and O. Delgrange,
HUMGHCSA 66,496 0.094 874 2.5 65 “Compression and genetic sequence analysis,” Biochimie, vol. 78,
no. 5, pp. 315–322, 1996.
HUMHBB 73,308 0.125 NA 2.8 87 [15] M. D. Cao, T. I. Dix, L. Allison, and C. Mears, “A simple
statistical algorithm for biological sequence compression,”
HUMHDABCD 58,863 0.125 104 2.2 56 in Proceedings of the Data Compression Conference (DCC '07),
pp. 43–52, IEEE, Snowbird, Utah, USA, March 2007.
HUMHPRTB 56,832 0.124 90 2 52
[16] D. Loewenstern and P. N. Yianilos, “Significantly lower entropy
estimates for natural DNA sequences,” Journal of Computational
Biology, vol. 6, no. 1, pp. 125–142, 1999.
VACCG 1,91,735 0.219 1239 4 150
[17] G. Korodi and I. Tabus, “An efficient normalized maximum
likelihood algorithm for DMA sequence compression,” ACM
Transactions on Information Systems, vol. 23, no. 1, pp. 3–34,
5. CONCLUSION 2005.
[18] T. Matsumoto, K. Sadakane, and H. Imai, “Biological sequence
compression algorithms,” Genome Informatics, vol. 11, pp. 43–
There is a recommendation of a dictionary based and
52, 2000.
substitutional compression algorithm for DNA sequence. [19] B. Behzadi and F. Le Fessant, “DNA compression challenge
Based upon intensive testing, the optimum length is revisited: a dynamic programming approach,” in Proceedings of
concluded to be 4. It‟s elucidated from the output that the the Annual Symposium on Combinatorial Pattern Matching, pp.
190–200, Springer, Berlin, Germany, 2005.
recommended method yields in better performance with [20] D. Adjeroh and J. Feng, “The SCP and compressed domain
compression ratio in contrast to the prevailing algorithm as analysis of biological sequences,” in Proceedings of the IEEE
well as few standard sequences too. The NCBIGenbankwas Bioinformatics Conference (CSB '03), pp. 587–592, Stanford,
Calif, USA, August 2003.

[21] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J.

Lipman, “Basic local alignment search tool,” Journal of Molecular
Biology, vol. 215, no. 3, pp. 403–410, 1990.
[22] P. Agarwal, “Compact encoding strategies for DNA sequence
similarity search,” in Proceedings of the International Conference
on Intelligent Systems for Molecular Biology (ISMB '95), vol. 4,
pp. 211–217, 1995.
[23] H. Sato, T. Yoshioka, A.Konagaya, and T. Toyoda, “DNA data
compression in the post genome era,” Genome Informatics, vol.
12, pp.512-514,2001.
[24] E. Rivals, J.P. Delahaye, M. Dauchet, and O. Delgrange, “Fast
discerning repeats in DNA sequences with a compression
algorithm,” Genome Informatics, 8, pp.215-226, 1997.
[25] X. Chen, S. Kwong, and M. Li, “A compression algorithm for
DNA sequences and its applications in genome
comparison,” Genome informatics, 10, pp.51-61, 1999.
[26] J.A. Storer, “Textual substitution techniques for data
compression,” In Combinatorial algorithms on words (pp. 111-
129). Springer, Berlin, Heidelberg, 1985.
[27] T. Bell, I.H. Witten, andJ.G. Cleary, “Modeling for text
compression,” ACM Computing Surveys (CSUR), 21(4), pp.557-
591, 1989.
[28] P.V. Eric, G. Gopalakrishnan, and M. Karunakaran, “An optimal
seed based compression algorithm for DNA sequences,”
Advances in bioinformatics,vol.2016, Article ID 3528406, 2016.

SegmentationBased DNA Sequence Compression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SegmentationBased DNA Sequence Compression

Uploaded by

Copyright:

Available Formats

Alochana Chakra Journal ISSN NO:2231-3990

SegmentationBased DNA Sequence Compression

Keywords— DNA sequence compression, BWT, Lempel-Ziv compression, biological sequences.

1. INTRODUCTION sequence thereby encoding them for acquiring an

Volume IX, Issue V, May/2020 Page No:6630

Volume IX, Issue V, May/2020 Page No:6631

DNA Sequence Segmentation

DNA Sequence Data Compression

Check whether any one of the following matching occurs

Exact Matching /Reverse Matching /Complement Matching /

Reverse Complement Matching

Fig 1 Overall Architecture

TABLE 1 mismatches and whether allowing such mismatch would

Matching Type of Position Level No. 3.2 General Data Compressions

Volume IX, Issue V, May/2020 Page No:6632

Volume IX, Issue V, May/2020 Page No:6633

characters. Then, F=AAACGT and L=GATAAC and the

AGAC Equal 1,20 1

4 AGA$ACT SORT  AGA$ACT TAA Equal 0,13,21 2

Volume IX, Issue V, May/2020 Page No:6634

Set level to 3 Algorithm 3: Text _Match(file1,file2)

IfOsequence[j]=‟A‟ then Read the line1 [i] from the file1

Decompression: The reverse process of the above

Volume IX, Issue V, May/2020 Page No:6635

TABLE 4 accessed to collect the datasets and the Java environment is

Volume IX, Issue V, May/2020 Page No:6636

[21] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J.

Volume IX, Issue V, May/2020 Page No:6637

You might also like