An Efficient Machine Learning Approach To Low-Complexity Filtering in Biological Sequences

An Efficient Machine Learning Approach to
Low-Complexity Filtering in Biological Sequences

Christopher A. Barber Christopher S. Oehmen
Pacific Northwest National Laboratory Pacific Northwest National Laboratory
Richland, Washington 99352 Richland, Washington 99352
Email: christopher.barber@pnnl.gov Email: christopher.oehmen@pnnl.gov
Abstract—Biological sequences contain low-complexity re- complexity. Furthermore, filtering algorithms take a variety
gions (LCRs) which produce superfluous matches in homology of approaches in determining the boundaries of LCRs based
searches, and lead to slow execution of database search algo- on the scores obtained. These differing characteristics cause
rithms such as BLAST. These regions are efficiently identified
by low-complexity filtering algorithms such as SDUST and SEG, filtering algorithms such as SDUST and SEG to exhibit com-
which are included in the BLAST tool-suite. These algorithms plementary strengths and weaknesses, motivating a filtering
target differing notions of complexity, so an algorithm which approach which combines multiple complexity measures and
combines their sensitivities is pursued. A variety of features are masking procedures.
derived from these algorithms, as well as a new filtering algorithm
based on Lempel-Ziv complexity. Artificial sequences with known A filtering method which combines various filtering al-
LCRs are used to train and evaluate an SVM classifier, which gorithms using machine learning is proposed. The low-
significantly outperforms the standalone filtering algorithms. complexity filtering task is formulated as a classification
I. I NTRODUCTION problem on a per-character basis, where class 0 denotes
an unmasked or high-complexity region (HCR) and class 1
Nucleotide and amino acid sequences contain regions with denotes an LCR. Features used in classification are indicators
low information content due to biased composition (e.g. CpG for the filtering decisions of the various constituent algorithms
islands in DNA, or hydrophobic-biased transmembrane regions (e.g. 1 if SDUST masked the corresponding character, 0
in proteins) or short-period repeats (also known as microsatel- otherwise). Additional features include statistics on complexity
lites in DNA). The efficiency of BLAST (Altschul et al., scores relevant to each position. This classifier-based filtering
1997) depends on the absence of low complexity regions algorithm is evaluated using features from SDUST and SEG
in sequences being processed. Furthermore, the presence of alone, as well as the additional features from a new filtering
LCRs may lead to superfluous (chance) alignments appearing method based on Lempel-Ziv complexity which is presented
in the output of BLAST searches. here.
Because low-complexity filtering or masking is performed
as a pre-processing step on every query sequence in a BLAST In order to train and evaluate the classifier-based filter-
search, it is desirable to have accurate and efficient (linear- ing algorithm, artificial sequences containing high- and low-
time) algorithms for this purpose. The long-standing SDUST complexity regions are generated from probabilistic models
(Morgulis et al., 2006) and SEG (Wootton and Federhen, with known complexity characteristics. Specifically, HCRs are
1993) algorithms have been packaged with BLAST for fil- modelled as i.i.d. (independent and identically distributed)
tering DNA and protein sequences, respectively. SDUST and random sources with entropy at or near the maximum for the
SEG achieve running-time linear in sequence length N by given number of symbols. LCRs are modelled both as i.i.d.
examining subsequences or windows of fixed length n. SDUST sources with biased composition, or as Markov sources with
characterizes the complexity of strings by counting repeated 3- relatively low entropy rate. For HCRs and both types of LCRs,
mers, whereas SEG measures complexity based on the relative random assignment produces a diverse set of parameterizations
frequencies of individual characters. with entropy rates within the desired ranges.
In BLAST, filtering of LCRs affects the results of homology This paper is organized as follows: The methods section
searches by preventing local alignments from beginning within (section II) covers details of generating artificial sequences
these regions. Consequently, false-positive filtering choices for classifier training and evaluation (section II-A); the various
may cause statistically significant alignments to be missed. complexity scoring functions (section II-B) and masking rules
False-negatives (unmasked LCRs) may cause the aformen- (section II-C) employed by SDUST, SEG, and the proposed
tioned problems of slow search execution or the appearance Lempel-Ziv based filter; and the various features which are
of superfluous alignments in search results. For these reasons, supplied to the classifier (section II-D). Section III describes
an improvement in filtering accuracy is desired. experimental setup and results from an evaluation of the
Depending upon the complexity scoring function used, classifier-based filtering algorithm. Section IV discusses the
filtering algorithms are sensitive to different types of low results and offers concluding remarks.
II. M ETHODS second from a (first-order) Markov model. The actual entropy
rates were 2.1005 and 2.0883 bits, respectively.
A. Generating Sequences
YYYTAYSSYATWTSTTTTATTYATYTTTYYTWTTTYTYSASTYTSWTTTK
To train and evaluate the SVM based filtering algorithm,
artificial sequences with a variety of properties were generated GFGFGVNAEFGFEIMFGTMFMFGLGLQMKHACHYKHYKHHQIARFIANIV
using probabilistic models. This section describes the design of Likewise, the following are examples of DNA LCRs from
these models and the procedures for choosing their parameters models with entropy rates of approximately .5log2 (4) = 2
and measuring their complexity. bits. Here, the Markov LCR model was of order 2. The actual
One way to model both high and low complexity regions entropy rates were 0.9638 and 0.9416, respectively.
is as i.i.d. sources. That is, each symbol Xi is drawn in-
CCGGGGCGGCCCGCGCCCGGCCCCCCCGCCCCCGGCGCCGGGGGGCCGCG
dependently from a source distribution P (Xi ) over alphabet
{s1 , ..., sb } of b symbols. For modelling HCRs, the source CGTAGGCGTCCACTTGACTTAGACTTAGACACTTGCGTAGGCGTAGGCGT
entropy H(Xi ) should be at or near log2 b bits. For LCRs, the
Note that the Markov LCRs have relatively unbiased character
entropy should be substantially lower.
composition as desired, and that their low-complexity can
An i.i.d. source models compositionally biased LCRs, but it
mostly be attributed to short repeats.
would be useful to model LCRs with some type of repetitive
structure. For alphabet sizes 4 and 20, this was accomplished B. Complexity Scoring Functions
by modelling LCRs as second- and first- order Markov models, Low-complexity filtering algorithms such as SDUST and
respectively. SEG are composed of two main elements: first, a scoring
A suitable complexity measure for these probabilistic se- function is used to measure the complexity of subsequences
quence models is entropy rate, which, intuitively, is a measure of an input sequence; second, a set of rules are defined which
of the average information content per character in a sequence determine the regions to be masked (identified as LCRs)
generated by the model. The entropy rate of a stochastic based on the collection of subsequence scores. This section
process X is defined as covers the complexity scoring functions used by SDUST, SEG,
1 and the new Lempel-Ziv based filter. Section II-C covers the
H(X) = lim H(X1 , ..., Xn ) masking rules applied by these algorithms.
n→∞ n
SDUST (Morgulis et al., 2006) measures complexity based
For an i.i.d. model, the entropy rate is simply the en- on counts of repeated m-mers. In the original formulation
tropy H(Xi ) of the source distribution. For a Markov chain which was designed for DNA sequences, m = 3. In order
with transition matrix Pi,j , with stationary distribution µ = to apply SDUST to proteins, m is set to 2, since repeated 3-
hµ1 , ..., µb i, the entropy rate is given by mers are very improbable even in low-complexity sequences
−
X
µi Pi,j log2 Pi,j of length n = 64 or less with an alphabet of 20 characters.
i,j
Specifically, SDUST scores a subsequence X = X1 , ..., Xn
by counting every repeat of length m within X, and scales
In order to differentiate this type of LCR from the com- this count by 10(n − m)−1 . Letting Mi,j denote the “match”
positionally biased LCRs, parameters are chosen such that event Xi , ..., Xi+m−1 = Xj , ..., Xj+m−1 , the SDUST score
H(µ) is at least 90% of log2 b, or relatively unbiased. In can be written as
these experiments, the stationary distribution µ was estimated X
by running the Markov chain from a uniform initial state sdust_score(X) = 10(n − m)−1 I(Mi,j )
distribution until no entry in µ changed by more than one i6=j
percent of its previous value. for i, j ∈ [1, n − m + 1]. Note that higher SDUST scores
To obtain a diverse set of parameterizations to a model at a correspond to a greater number of repeats and hence lower
specified entropy, parameters are randomized until the entropy complexity.
rate falls within 90 and 100% of the desired value. First, each SEG (Wootton and Federhen, 1993) measures sequence
parameter is set uniformly at random to a value between 0 complexity using a technique inspired by the statistical
and 1. Then all parameters are raised to a power d drawn “method of types.” Informally, the type of a sequence is the
uniformly at random from the integers [1, 100]. sorted (descending) tuple of symbol frequencies. For example,
Generating sequences using either of the models is straight- the type of the binary sequences 01101 or 10001 is (3,2).
forward. The intial states of the Markov chains are sampled Note that amongst random binary sequences of length 5, the
from the estimated stationary distribution. To make sequences type (3,2) is much more probable than (5,0). There are only
consisting of multiple HCRs and LCRs, the regions are sam- two sequences (00000 and 11111) with type (5,0), giving a
pled individually from randomly parameterized models with probability of 2(1/25 ). SEG measures the complexity of a
the desired entropy rates, and then concatenated. sequence by calculating this probability for its type.
The following are examples of protein LCRs of length 50, Calculating the probability of an observed sequence’s type
sampled from models with entropy rates of about .5log2 20 ≈ is straightforward. Consider a sequence X = X1 , ..., Xn over
2.161 bits. The first was sampled from an i.i.d. model, and the alphabet {s1 , ..., sb }. The observered character counts ci =
Pn
j=1 I(Xj = si ) are used to calculate the type (d1 , ..., dn ) superstring scores relevant to a single character at position
Pb +
according to dk = i is |Si,1 | = n(n + 1)/2.
i=1 I(ci = k). The probability of the
type is then given by The filtering algorithms covered here take advantage of the
! ! following procedure, which aggregates substring or superstring
−n n! b! scores in O(n) time, despite the fact that these sets contain
P (d1 , ..., dn ) = b Q Q
1≤i≤b ci ! 0≤k≤n dk O(n2 ) scores. In this description, the max function is used,
but any aggregate function can be substituted in its place (this
Note that lower SEG scores correspond to improbable types
will be used in section II-D in order to efficiently calculate
and hence lower complexity. In the subsequent sections we
statistical features based on superstring scores). At each po-
assume that the negative SEG score is used, in order to be
sition i = 1, ..., N in an input sequence, store the maximum
consistent with other scoring functions which assign higher −
score for all prefixes length l or less, Mi,l = max1≤l0 ≤l (si,l0 )
scores to lower complexity strings. +
and likewise for suffixes, Mi,l = maxl≤l0 ≤n (si,l0 ). The
Both the SDUST and SEG scores can be computed for all − +
quantities max(Si,l ) and max(Si,l ) can then be com-
prefixes of a length n string in O(n) time. This allows all −
substrings of length n or less in an input sequence of length puted in O(n) time from max1≤k≤l (Mi+k−1,l−k+1 ) and
+
N to be computed in O(nN ) time. max1≤k≤n−l+1 (Mi−k+1,l+k−1 ).
The new filtering algorithm presented in this paper uses SDUST masks a character at position i if any score si0 ,l0 ∈
a scoring function based on the Lempel-Ziv (LZ) complex-
+
Si,1 is equal to max(Si−0 ,l0 ) and greater than a threshold T .
ity of strings. The LZ complexity counts the number of That is, a character is part of an LCR if it belongs to a string
distinct patterns in a sequence when scanned from left to whose score is greater than T and which contains no higher-
right. For instance, the LZ complexity of the binary sequence scoring substring. This method is referred to as the perfect
011010010000111 is 7 because its distinct patterns appearing interval method.
from left to right are 0|1|10|100|1000|01|11. To obtain a SEG, on the other hand, employs a two-stage procedure
further refined complexity estimate, the scoring function uses where a window at position i and of length n will only
the negative log-probability of the observed LZ complexity for be inspected for LCRs if its Shannon entropy falls below
a random string of the corresponding length and alphabet size. a threshold. If it does, the maximum scoring subsequence
The LZ complexity LZ(X) of all prefixes of a string X = is masked; i.e. the subsequence at i0 of length l0 where
−
X1 , ..., Xn can be calculated in O(n) time using an algorithm si0 ,l0 = max(Si,n ). This method is referred to as the maximum
based on suffix trees (Ukkonen, 1995). Thus, computing the method.
LZ complexities for all substrings up to the window length Two additional heuristic filtering methods are pro-
n in a sequence of length N can be performed in O(nN ) posed, which operate on the score sequence S =
time, just as with the SDUST and SEG scores. The final + + +
max(S1,1 ), max(S2,1 ), ..., max(SN,1 ), or superstring scores
complexity score of X is obtained by performing a constant- for each position of the input sequence.
time lookup of the probability of observing LZ complexity The first such method simply masks any portion of the
LZ(X) in random strings of the same length as X. For alphabet sequence for which S is a local maximum, which attains a
size b, this lookup table can be computed for all lengths up value greater than a threshold T . This method is referred to
to n in O(n2b+1 ) time using the recurrence from Doğanaksoy as the local maximum method.
and Göloğlu (2006) with a straightforward generalization to
The second such method masks portions of the sequence
alphabet sizes b > 2. Using our implementation, it was
where S is flat for more than 10 consecutive positions, and
possible to compute the lookup tables for b = 4 and 20 out
greater than a threshold T . These intervals are referred to as
to n = 100 and 48, respectively, within an hour of wall-clock
plateaus and hence this method is referred to as the plateau
time each.
method. Plateaus which are separated by monotonically in-
C. Masking Algorithms creasing or decreasing scores are joined, i.e. the intervening
region is masked along with the plateaus being joined.
Each filtering algorithm applies its scoring function to every
subsequence of an input sequence up to a window length n. Figure 1 shows a plot of the superstring scores for the fol-
For each character in a sequence, the decision whether to mark lowing example sequence, using the SDUST scoring function,
it as part of an LCR must be made based on a collection of and shows the regions which would be masked by the local
possibly contradictory complexity scores. Specifically, there maximum and plateau methods. The sequence was generated
are n(n + 1)/2 substrings of length n or less which contain a from a model with a Markov LCR of length 100 bordered by
particular character. two HCRs of length 50. The LCR model had entropy rate of
approximately 2 bits.
To formalize, denote by si,l the score of the subsequence
Xi,l = Xi , ..., Xi+l−1 of length l beginning at position i. Let
+
Si,l denote the set of scores for all superstrings (inclusive) of MPCTFGPLIIVDAERNSEVLDHLWLLCLWCTMPFSEVIGSENIPTVMYFP
− SLMKGGMKQSLMKQSLTMKGEACACQSLMTWMMTWMTWMMKQSQYGDHAC
Xi,l , and let Si,l denote the set of scores for all substrings IEACFFVQSQPSLMMKQSLMTEACFVEAIQSIEADPSQSIQSQSLMQSIE
(inclusive) of Xi,l . As mentioned before, the number of SQEMTDQVQIAVQQGLDLCSLQGWIHSSWPKEEPIFKHRADHTMRMHFRP
12
12
10
26 24
8 10
6
4
2
0
0 50 100 150 200
Fig. 1: Superstring scores along an artificial protein sequence with a Markov LCR from position 51 to 150. The horizontal line
shows the score threshold T . Lengths of key plateaus are indicated in the diagram. Regions masked by the local maximum
method are shown in solid gray, while regions masked by the plateau method are shown with diagonal stripes. Note that the
plateau method joined the two plateaus on the left since they are connected by a strictly increasing interval. The plateau of
length 10 is below the minimum plateau length.
D. Features III. R ESULTS

The SVM filtering method was evaluated in two 10-fold
The low-complexity filtering task is formulated as a classi- cross-validation experiments on datasets of 1000 artificially
fication problem where each character is assigned to class 1 generated DNA sequences, and 1000 protein sequences. Each
if it is part of an LCR (masked) and class 0 otherwise. This sequence was generated according to a randomly chosen
section describes the set of features for each character using format: either an LCR bordered by two HCR’s (HCR-LCR-
the terminology and methods developed in sections II-B and HCR) or an HCR bordered by two LCRs (LCR-HCR-LCR).
II-C. The first and last regions always have length 100, while the
The following are used as features for a character at second region was assigned a length between 10 and 100
position i in the input sequence. Features 1–9 are 0/1 variables uniformly at random. LCR entropy rates were drawn uniformly
indicating whether position i was masked by the corresponding at random between 0 and 50% of the maximum possible
scoring function and masking method. entropy of log2 b bits, where b is alphabet size. HCR entropy
rates were chosen uniformly at random between 80 and 100%
1) SDUST scoring + perfect interval (default) of log2 b. LCR type (biased versus Markov source) was chosen
2) SEG scoring + two-stage maximum method (default) uniformly at random for each region.
3) LZ scoring + perfect interval For features generated by running SDUST or SEG the
4) SDUST scoring + local maximum threshold T was set to correspond to .65log2 b when applicable
5) SEG scoring + local maximum (a threshold is only used in the two-stage version of SEG).
6) LZ scoring + local maximum For the Lempel-Ziv based filter, a threshold of 16 was chosen
7) SDUST scoring + plateau experimentally (this corresponds to a probability of 2−16 ).
8) SEG scoring + plateau In all experiments, the SVM training was done in MATLAB
9) LZ scoring + plateau using the least-squares solver, with a quadratic kernel function.
+
10) Maximum SDUST superstring score, max(Si,1 ) Features were auto-scaled to have unit standard deviation.
+
11) Mean SDUST superstring score, mean(Si,1 ) In order to acheive reasonable run-time whilst maintaining a
+
12) SDUST superstring score variance, Var(Si,1 ) diversity of examples, 10,000 training points were chosen at
+
13) Maximum SEG superstring score, max(Si,1 ) random from the 900 training sequences in each fold. This
+
14) Mean SEG superstring score, mean(Si,1 ) acheived reasonable coverage (about 20%) of the total length
+ of the training sequences.
15) SEG superstring score variance, Var(Si,1 )
+ The SVM filtering performance was evaluated with various
16) Maximum LZ superstring score, max(Si,1 )
17) Mean LZ superstring score, mean(Si,1+
) sets of features in order to assess their value. First, features
18) +
LZ superstring score variance, Var(Si,1 ) pertaining to the LZ scoring were omitted, leaving features
1,2,4,5,7,8, and 10–15. Next, all features 1–18 were included,
The statistical features 10–18 can be calculated in O(n) time in order to assess whether the addition of LZ features is valu-
where n is the window length according to the procedure given able. Additionally, an experiment including only features 1–9
in section II-C. To compute the means, scores in Si,1 are tested whether the superstring score features were valuable.
summed and then divided by n(n + 1)/2. To compute the These three experiments are referred to as SVM-no-LZ, SVM-
variances, scores in Si,1 are squared and averaged, and the LZ, and SVM-no-score, respectively.
variance is obtained using the identity Var(X) = E(X 2 ) − Figure 2 shows mean test-set accuracy of the stand-alone
E(X)2 . LZ, SEG, and SDUST filtering algorithms, alongside the three
DNA Filtering Accuracy Protein Filtering Accuracy
92.5% 93.7% 94.9% 95.3%

100 100 90.5%
87.9% 88.2% 89.1%
79.9%
75.6% 77.7%
80 69.5% 80
Percent Accuracy
Percent Accuracy
60 60
40 40
20 20
0 0
LZ
ST
re
LZ
ST
re
Z
−L
−L
−L
−L
SE
co
SE
co
U
U
s
M
SD
SD
o−
o−
−n
−n
SV
SV
−n
−n
M
M
SV
M
SV
M
SV
SV
Fig. 2: Mean accuracy of the various filtering algorithms on the DNA and protein test-sets. Error bars extend one standard
deviation in both directions, where the standard deviation is measured over the 10 cross-validation folds.
SVM variants, for DNA and protein. 1

First, the value of including LZ based features and score 0.9
based features was assessed. For DNA, SVM-LZ performs
0.8
slightly better than SVM-no-LZ, and the improvement was
significant (p<1e-3) using a paired t-test over the 10 folds. Fur- 0.7
thermore, SVM-LZ performed substantially better than SVM- 0.6
no-score (p<1e-5). Likewise for protein, SVM-LZ performed
0.5
better than SVM-no-LZ (p<1e-5), and SVM-LZ greatly out-
performed SVM-no-scores (p<1e-9). 0.4
Second, the value of including alternate heuristic masking 0.3
methods was assessed. For DNA, inclusion of the local maxi- 0.2
mum and plateau based features increased mean classification 0.1
accuracy by only 0.4%, which was statistically insignificant
0 −6
(p=0.021). For protein, however, a 0.6% increase in accuracy 10 10
−5
10
−4
10
−3
10
−2
10
−1
10
0
was found to be statistically significant (p<1e-04).

Finally, the increase in performance of SVM-LZ versus the Fig. 4: ROC curves for SVM-LZ on the DNA (solid) and
best performing stand-alone filtering method was quantified. protein (dashed) test-sets. Note that false positives (x-axis)
For both DNA and protein, SDUST was the best performing correspond to HCR regions which were incorrectly masked,
stand-alone filter. For DNA, SVM-LZ improved mean accu- meaning that local alignments would be prevented from begin-
racy versus SDUST by 5.7% (p<1e-5) from 87.9% to 93.7%, ning in these (potentially interesting) portions of the sequence.
which is a 47% reduction in misclassifications. For protein, True positives (y-axis) correspond to correctly masking LCRs,
SVM-LZ improved mean accuracy versus SDUST by 6.2% and thus a potential for improving the run-time of a BLAST
(p<1e-9) from 89.1% to 95.4%, which is a 57% reduction in search.
misclassifications.
Figure 3 shows a visualization of SVM-LZ filtering versus
SEG and SDUST on test sequences not in the original test-set. IV. C ONCLUSION
The sequences were generated by the same procedure except On a set of artificially generated DNA and protein se-
that the middle region is always of length 50 for illustrative quences, the SVM classifier-based filtering algorithm was
purposes. shown to significantly increase accuracy versus standalone
Figure 4 shows ROC curves for SVM-LZ on the DNA and filtering algorithms SDUST and SEG by combining a novel
protein test-sets. set of features derived from the SDUST and SEG designs, as
25 25
50 50
75 75
100 100
101 150 101 150
25 25
50 50
75 75
100 100
101 150 101 150
25 25
50 50
75 75
100 100
101 150 101 150
25 25
50 50
75 75
100 100
101 150 101 150
Fig. 3: Masking of artifical DNA (left) and protein (right) by SEG (second row), SDUST (third row), and SVM-LZ (last row).
Diagrams on the first row show the correct LCRs according to the generating models. In each diagram there are 100 sequences
of length 250 arranged top to bottom, and black indicates masked regions. Sequences 1–25 and 51–75 were generated using
Markov LCRs, which explains why SEG has the most difficulty detecting these regions. Note that SDUST is the default
algorithm in BLAST for masking DNA sequences, and SEG is the default algorithm for protein. It is apparent that SVM-LZ
reduces false positives in both DNA and protein, while retaining most of the true positives detected by either SEG or SDUST.
well as a new filtering method based on Lempel-Ziv sequence and the LZ filter. The constant of proportionality in an efficient
complexity. Classification errors were reduced by 47% and implementation would be small, since SVM classification with
57% on DNA and protein sequences, respectively, versus the a quadratic kernel is relatively simple.
best performing standalone algorithm (SDUST). Reducing complexity classification errors by nearly half
The increased accuracy of the SVM-based classifier comes may have significant impacts on BLAST search runtime.
at a negligible performance cost, since SVM classification can The extremely low complexity regions which cause the most
be done in time linear in the input sequence length N . Since severe slowdowns in BLAST are unlikely to be missed by
all scoring functions permit the scoring of every substring up the standalone algorithms, however, and these experiments
to the window length n in an input sequence of length N did not attempt to characterize what kind of regions are
in O(nN ) time, and features can be computed in O(n) time misclassified by SDUST or SEG. Perhaps more importantly
for each of the N positions, the overall run-time is O(nN ). than BLAST runtime is the potential for misclassifications
This asymptotic run-time is the same as for SDUST, SEG, to lead to significant local alignments being missed. For
negligible added run-time, the SVM based filtering algorithm
significantly reduces false-positive masking errors, and thus
reduces the chance that an important alignment will not be
found.
Certain types of sequences did not appear in these experi-
ments, but the methodology is easily extended to any labelled
dataset. For example, one might wish to incude real biological
sequences that have been annotated by experts. Furthermore, it
might be valuable to construct artificial sequences using more
sophisticated or diverse models; for example, sequences other
than the simple HCR-LCR-HCR and LCR-HCR-LCR formats,
or more sophisticated LCR or HCR models.
Finally, these experiments show that classifying local re-
gions of a sequence based on their complexity is a subtle and
difficult problem, and that an algorithm constructed around
a single scoring function will have inherent weaknesses. An
effective low-complexity filter should combine several notions
of complexity estimation, as well as several techniques for
refining the boundaries of low-complexity regions. The clas-
sification based filtering algorithm presented in this paper is
shown to successfully leverage the complementary strengths
of several low-complexity filtering algorithms, through signif-
icant increases in accuracy on a diverse artificial dataset.
ACKNOWLEDGMENT
The authors would like to thank Jeremy Teuton for his
comments on this manuscript.
R EFERENCES
Altschul, S. F., Madden, T. L., Schäer, A. A., Zhang, J., Zhang,
Z., Miller, W., and Lipman, D. J. (1997). Gapped blast
and psi-blast: a new generation of protein database search
programs. Nucleic Acids Research, 25(17), 3389–3402.
Doğanaksoy, A. and Göloğlu, F. (2006). On lempel-ziv com-
plexity of sequences. Sequences and Their Applications–
SETA 2006, pages 180–189.
Morgulis, A., Gertz, E., Schäffer, A., and Agarwala, R. (2006).
A fast and symmetric dust implementation to mask low-
complexity dna sequences. Journal of Computational Biol-
ogy, 13(5), 1028–1040.
Ukkonen, E. (1995). On-line construction of suffix trees.
Algorithmica, 14(3), 249–260.
Wootton, J. C. and Federhen, S. (1993). Statistics of local com-
plexity in amino acid sequences and sequence databases.
Computers and Chemistry, 17(2), 149–163.

An Efficient Machine Learning Approach To Low-Complexity Filtering in Biological Sequences

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Efficient Machine Learning Approach To Low-Complexity Filtering in Biological Sequences

Uploaded by

Copyright:

Available Formats

An Efficient Machine Learning Approach to

Low-Complexity Filtering in Biological Sequences

D. Features III. R ESULTS

92.5% 93.7% 94.9% 95.3%

SVM variants, for DNA and protein. 1

Second, the value of including alternate heuristic masking 0.3

was found to be statistically significant (p<1e-04).

You might also like