Professional Documents
Culture Documents
SegmentationBased DNA Sequence Compression
SegmentationBased DNA Sequence Compression
Abstract- On the basis of the lossless compression algorithm the DNA sequence has been compressed through a
substitution method which is equivalent to the scheme of offline dictionary Lempel-Ziv compression. The
repetition structures are employed by the recommended method that being intrinsic within DNA
sequences.Offline dictionary have the repeated patterns from the DNA input sequence and information about the
mismatches.The method makes sure that purely assuring mismatches are permitted thereby helping in gaining a
compression ratio thus surpassing the prevailing lossless DNA sequence compression algorithms.The present
research proposes dictionary based methods for compressing DNA sequences by making use of varied repetitive
structures that are in-built in these sequences. In this paper, the R_pattern algorithm proposes the dictionary
based compression of DNA sequence and attain a better compression ratio and tested with the benchmarking
datasets from NCBI(National Center for Biotechnology Information).
involves varying code size and based on the data model, splitting of DNA sequence is done in fixed size blocks. By
compression quality can be achieved [10]. Substitution or the means of probability distribution, the bit mask is
dictionary based compression makes selection of various encoded which being assessed using the normalized
strings of symbols that are quiet frequent, thereafter maximum likelihood of similarity. Matsumoto et al. [18]
encoding every string to a token that acts as a pointer to the makes use of LZ [5, 6] algorithm. Initially, the approximate
string present in the dictionary. The dictionary can be either repeat regions are being recognized via hash and dynamic
static or dynamic. Online dictionary is utilized by the programming and thereafter such repeat regions are
compression algorithm that relies upon LZ method. replaced using an offset and length. By the means of
Alternatively, algorithms incorporating offline dictionary arithmetic coding, edit operations are encoded and order 32
undergoes compression in two passes: firstly all the repeats context tree weighting is utilized in case of non-repeat
are recognized and stored in the dictionary whereas the regions. Though arithmetic coding and CTW have better
second pass involves encoding of all such repeats in terms compression ratio, but have low decompression speed as a
of pointers towards the dictionary [5, 6]. The hybrid sort of demerit [23]. Bio Compress 2 is [8] an improvised version
compression integrates both the substitution and statistical that being somewhat equivalent to Bio Compress but
techniques for the purpose of compression. Apparently, the incorporates order-2 arithmetic coding for encoding non-
substitution technique is incorporated by numerous repeat areas. The output projects that both the algorithms
compression methods for biological sequence compression compressed the standard benchmark data wherein the Bio-
[2, 7, 8, 29] .Grumbach and Tahi [7, 8] formerly developed compress utilized an average compression ratio of 1.850
a special purpose DNA compression Algorithm named Bio- bpb and Bio-compress2 utilized 1.783 bpb in contrast to the
compress that exists in the literature. Bio-compress as well general-purpose algorithms compact and compress, that
as Bio-compress 2 was recommended that aids in utilized higher than 2 bpb. Cfact [24] being equivalent to
identification of repeats of substrings that took place Bio compress makes use of a two-pass algorithm for
previously in the sequence and thereby encoding them detecting the longest exact and reverse complement repeats.
based on the position of earlier occurrence and length of In the first pass a suffix tree of the sequence is generated
repeat. In addition, order 2 arithmetic coding is imbibed for and the second pass involves actual encoding via LZ. Non-
encoding regions that do not repeat. With the help of repeat areas are encoded using 2 bpb. Chen et al makes use
arithmetic coding, encoding of non-repeat regions is of an equivalent approach in Gen Compress [25], though
performed. Apostolic and Lonardi [11] proposes an offline with the involvement of approximate repetitions.
approach that repetitively chooses repeated substrings for
enabling the encoding to achieve highest compression.
3. PROPOSED WORK
Adjeroh et al. [12, 13] generates offline dictionary
comprising of short repeats and subsequently code entire 3.1 Compression Algorithm for DNA Sequences
occurrences of the available repeat in context to the position
of that repeat within the dictionary. Rivals et al. [14] The method recommended in the present work comprises of
proposes C fact that builds a suffix tree in the first pass and algorithm which helps in determining possibly good
in the second pass utilizing this structure to identify the matches. Further the identified matching substrings are
longest exact matching repeat. Statistical techniques are stored in the dictionary for further reference level by level.
employed by some of the methods such as ARM, XM and All mismatches are moved to the next level for finding the
CDNA. Cao et al. [15] proposes Expert model –XM that matches and mismatches. In the first level, take the four
utilizes an order 2 Markov experts and a copy expert for DNA sequence from the starting position consider as i and
anticipating the chances of a symbol occurrence and find the matching in the remaining sequence. The matching
incorporates adaptive coding for handling correct or should be equal, reverse, complement and reverse
incorrect predictions. Loewenstern and Yianilos [16] complement. If any one of the four type of matching occurs
proposed an exclusive statistical CDNA algorithm using then remove the matching pattern from the input sequence
which every symbol‟s probability distribution is achieved and store the matching pattern in the dictionary along with
by approximate partial matches from history. Each its position, type of matching whether equal(E), reverse(R),
approximate match is carried in reference to an earlier complement(C), reverse complement(RC) and its level
subsequence with a small Hamming distance prior to the number. If the mismatch occurs, then leave the sequence as
symbol which needs to be encoded. Allison et al also it is and move to the next level. In the second level, take the
introduces an exclusive statistical ARM algorithm [3] that next pattern with 3 sequence from the position i+1 in that 4
determines the sequence‟s probability by aggregating the input sequence and do the same process of matching and if
probabilities of entire explanations pertaining to how to mismatching move to the next level. In the third level, take
produce the subsequence. Korodi and Tabus [17] the pattern with 2 sequences from the position i+2 do the
recommends a hybrid technique based method that same process of matching and if mismatching move to the
performs encoding by the means of a basic normalized next level. In the 4th level, all the mismatch patterns are
maximum likelihood model for discrete regression in encode with 2 bits such A=00, C=01, G=10 and T=11.
context to the prior approximate matching blocks and
thereafter encoding them using a first-order context coding.
Korodi and Tabus [17] proposed GeNML wherein the
Compression
DNA Sequence
Matching Mismatching
Level 4 Level 1, 2, 3
Level 1,
Level 3
Level 2 Encoding the seq. Repeat the
process
Dictionary 1 Output the compressed
sequence
anoverall compression. The dictionary-based methods build schemes, desirable compression was not produced. There
a vocabulary of symbols or group of symbols from the input exists much regularity in biological sequences that is utilized
sequence that occur frequently. By replacing the symbols for compression.
positions with a pointer to their positions in the dictionary,
compression is obtained. The dictionary can be either offline Effective approaches for biological sequence compression
or online. In online dictionary, the text only acts as the takes into account multiple regularities or repetition
dictionary and symbols observed prior in the sequence are structures present in such sequences. To name a few: tandem
replaced with pointers to their previous occurring positions. repeats, simple (interspersed) repeats (SINEs and LINEs),
palindromes, complemented palindromes and complemented
Commonly known LZ-family of algorithms relies upon repeats. Exact as well as in-exact (approximate) repetitions
online dictionaries. In the Offline dictionary methods the must be used in any case. The challenging part being,
input sequence is compressed in two passes, the first pass prompt identification of such repetition structures and
determines the repeating sequences and constructs the thereby utilizing this acquired knowledge for sequence
dictionary and in the second pass the repeats are encoded compression. Various specialized algorithms have been
with pointers in the dictionary. Significantly, compression suggested for successfully compressing bio sequences which
that is dictionary based tends to be as a substitution that vary in the repetitions type(s) they use and the way it is
involves substitution in terms of earlier existing symbols or done. Usually such methods are based online dictionary,
sequence of symbols rather than code word substitution. wherein various repetition structures are considered for
Storer[27] proposed a standard approach for illustrating constructing the dictionary/substitution mechanism).
various substitution-based schemes. Grumbach and Tahi [8] pioneered in recommending the first
special purpose compression algorithm concerning the DNA
3.3 Biological sequence compression algorithms sequences. BIOCOMPRESS1 and BIOCOMPRESS2 [8]
were the algorithms which factored the input sequence into
Entire information regarding DNA sequences gets saved in repetitive structures. A pair of integer numbers denotes
molecular biology databases. In the near future both the size every single factor/repeat that denotes the repeat size and the
as well as importance of such databases will enlarge to a position of the factor occurrence in part of the perceived
greater extent. Hence it‟s a pre-requisite that the information input. Thereafter there is coding of the pair of integers by
pertaining to DNA sequences is effectively stored. In the means of Fibonacci code which represents a universal
addition, by the means of sequence compression similarities code for integer coding. Prior to replacing a factor, there is a
amidst the biological sequences can be defined. It‟s not checking to verify whether there will be any compression
possible to compress the DNA sequences using the standard with such a replacement. That is, whether without any
compression algorithms like gzip or compress cannot coding the size of the coded representation of the pair of
compress DNA sequences as they can only expand them numbers will be less compared to the original Nucleotide
size wise. But the algorithm like Context Tree Weighting bases. Thereafter, rest of the sequence parts that remain non-
(CTW) is able to compress DNA sequences with less than factored are coded by the means of arithmetic codes.
2bps (two bits per symbol) without making use of any BIOCOMPRESS2 is somewhat equivalent to
special structures of biological sequences. DNA sequences BIOCOMPRESS1, with the exception that it provides
are of mainly 2 characteristic structures, first are the palindrome handling.
palindromes or reverse complements and second is
approximate repeats. Such structures are utilized by various 3.4 The Burrows-Wheeler Transform
algorithms for DNA sequence compression using less than
2bps (two bits per symbol). CTW is improvised in the The BWT carries out the characters permutation in such a
existing work to make the characteristic structures of DNA sequence that characters that are lexically similar in contexts
sequences available. In the dynamic and hash programming, will reside close to one another. The forward and inverse
the algorithm searches an approximate repeat prior to BWT along with the subsequent encoding of the permuted
encode the next symbol, then the algorithm represents a sequence are the main significant procedures in BWT-based
palindrome or an approximate repeat with good length (if compression decompression.
any) using the length and distance. Such pre-processing
helps the new program obtain a bit greater compression ratio 3.4.1The forward transform
in contrast to the prevailing DNA-oriented compression
algorithms. There is also description of new compression Consider the input sequence T, there are three steps in the
algorithm pertaining to protein sequences. forward BWT) Form u permutations of T by cyclic rotations
of the characters in T. These permutations results in a u × u
Though the above discussed conventional compression matrix M’, wherein each row in M’ depicts one permutation
methods prove to be effective on text, compressing of T; ii) Rows in the matrix M’ are sorted lexicographically
biological sequences like the DNA or protein sequences to build another matrix M that comprises of T as one of its
tends to be tedious for them. Directly employing classical rows; iii) Record L, that denotes last column of the sorted
compression methods like Huffman, LZ and arithmetic permutation matrix M, and id, depicts the row number in M
coding on these sequences in fact leads to data expansion resembling to the original sequence T. The (L, id) pair forms
and not compression [24, 25]. This might be because such the output of the BWT. The impact in general is that the
conventional models emphasized or were developed purely similar contexts in T are brought nearer in L. such a
for text thus they couldn‟t work in the same way for similarity inclose by contexts helps in acquiring can be
biological sequences which had special characteristics. compression. For instance, consider T=ACTAGA. Let
Andthough they worked somewhat for dictionary base Fdenote array of first characters and L denote the last
TABLE 2 TABLE 3
BWT FORWARD TRANSFORMATION OFFLINE DICTIONARY
S. M‟ M Patterns Type of repeat Position Level no.
NO.
5 GA$ACTA CTAGA$A
6 A$ACTAG GA$ACRA
CT Equal 1,6,15 3
7 $ACTAGA TAGA$AC
M‟ is the matrix of cyclic rotations before sorting. M is the Algorithm 1 : R_Pattern DNA Sequence Compression
matrix after sorting. Then they have included the extra Algorithm
symbol $ for consistency in the suffixes
Algorithm Rpattern(input_sequence)
3.5 Calculation of Compression Ratio
Begin
In order to encode each symbol of a DNA sequence, just 2
bits are required. DNA compression emphasizes to make use
{
of less than 2 bits to denote each base. The output in the
suggested method includes the offline dictionary as well as
Read the DNA sequence from input_sequence
the final parsed sequence and the compression ratio denotes
bps i.e. bits per symbol which is computed using Seti = 0 //assign the starting position
Do
Compression ratio = 1- {
(Length of dictionary + Length of compressed sequence ) Read 4 sequence as R_pattern from
* 100
(Length of input sequence ) input_sequence
Set level to 1
Assign Dictionary to name of the
Consider the following example dictionary
Osequence= call
AGACTTAATAACTTGAACTTAGACTAAGGTTCATA Rpattern_match(R_pattern,Dictionary,level)
ACTAGTATGAGTAAATTCTCA (Data length 448
} Until i>=input_sequence.length
bytes)
Do
Level 1 {
Read 3 sequence as R_pattern from
TAACTTGAACTTTAAGGTTCATAACTAGTATGAGTA input_sequence
Set level to 2
Level 2
Osequence= call
CTTGAACTTGGTTCACTAGTATGAGTA Rpattern_match(R_pattern,Dictionary,level)
} Until i<=Osequence.length
Level 3 Do
{
TGAATGGTTCA
Read 2 sequence as R_pattern from
Level 4 input_sequence
FunctionRpattern_match(R_pattern, Dictionary, level) In the existing research, the dictionary based compression
algorithm was experimented and tested on a set of DNA
Begin sequences with input in FASTA format. Testing of the
{ method was performed using the similar standard
Assign the starting position to k benchmark data utilized in [2, 7, 19, and 28]. These standard
While (k<=Osequence.length) sequences comprised of HUMGHCSA (human growth
{ hormone), HUMHDABCD (human DNA sequence),
IfR_pattern is match with remaining sequence in VACCG (vaccinia virus Copenhagen complete genome),
any of the four types Equal(E), HUMHBB (human beta globin region on chromosome 11),
Reverse(R),Complement(C),Reverse complement HUMDYSTROP (Homo sapiens dystrophin gene), and
(RC)) then HUMHPRTB (human hypoxanthine phosphor
{ ribosyltransferase gene).
SetDictionary.matchpattern = R_pattern
SetDictionary.position=k Dictionary Compression – The Dictionary methods is used
SetDictionary.type_of_match=‟E‟ / „R‟ / „C‟ in this proposed system to store the matching patterns with
/ „RC‟ positions, type, levels and it will be used for decompression.
SetDictionary.level_no=level
Remove R_pattern from Osequence A. Performance in terms of data compression:
}
Increment k value by level value The tabulated result of comparison of compression ratios in
} percentage of the proposed R_Pattern algorithm with
ReturnOsequence existing is shown in Table 4.
}
End
Algorithm 2 defines the process of finding the type of
matching whether it is equal or reverses or complement or
reverse complement and return the result to the Algorithm 1