Professional Documents
Culture Documents
Putational - Biology.and - Bioinformatics.vol.2.Issue.1.Jan.2005.eBook Li
Putational - Biology.and - Bioinformatics.vol.2.Issue.1.Jan.2005.eBook Li
1, JANUARY-MARCH 2005 1
Junhyong Kim is the Edmund J. and Louise Inge Jonassen is a professor of computer
Kahn Term Endowed Professor in the Depart- science in the Department of Informatics at the
ment of Biology at the University of Pennsylvania. University of Bergen in Norway, where he is
He holds joint appointments in the Department of member of the bioinformatics group. He is also
Computer and Information Science, Penn Center affiliated with the Bergen Center for Computa-
for Bioinformatics, and the Penn Genomics tional Science at the same university where he
Institute. He serves on the editorial board of heads the Computational Biology Unit. He is also
Molecular Development and Evolution and the vice president of the Society for Bioinformatics in
IEEE/ACM Transactions on Computational Biol- the Nordic Countries (SocBiN) and a member of
ogy and Bioinformatics, the council of the Society the board of the Nordic Bioinformatics Network.
for Systematic Biology, and the executive committee of the Cyber He coordinates the technology platform for bioinformatics funded by the
Infrastructure for Phylogenetics Research. His research focuses on Norwegian Research Council functional genomics programme FUGE.
computational and experimental approaches to comparative develop- He has worked in the field of bioinformatics since the early 1990s, where
ment. The current focus of his lab is in three areas: computational he has primarily focused on methods for discovery of patterns with
phylogenetics, in silico gene discovery, and comparative development applications to biological sequences and structures and on methods for
using genome-wide gene expression data. the analysis of microarray gene expression data.
Abstract—We describe an algorithm for comparing two RNA secondary structures coded in the form of trees that introduces two new
operations, called node fusion and edge fusion, besides the tree edit operations of deletion, insertion, and relabeling classically used in
the literature. This allows us to address some serious limitations of the more traditional tree edit operations when the trees represent
RNAs and what is searched for is a common structural core of two RNAs. Although the algorithm complexity has an exponential term, this
term depends only on the number of successive fusions that may be applied to a same node, not on the total number of fusions. The
algorithm remains therefore efficient in practice and is used for illustrative purposes on ribosomal as well as on other types of RNAs.
1 INTRODUCTION
Fig. 3. Edit operations: (a) the original tree T , (b) deletion of the node
labelled D, (c) insertion of the node labeled I, and (d) relabeling of a
node in T (the label A of the root is changed into K).
Fig. 4. Illustration of the scattering effect problem. Result of the matching of two RNAsePs, of Streptococcus gorgonii and of Thermotoga maritima,
using the model given in Fig. 2b.
location of an unpaired base is not taken into account. It is followed by an internal loop and another helix of size 5. By
therefore possible to match, for instance, an unpaired base definition (see Section 2), the algorithm can only associate
from a hairpin loop with an unpaired base from a multiloop. one element in the first tree to one element in the second
Using another type of representation, as we shall do, would, tree. In this case, we would like to associate the helix of the
however, not be enough to solve all problems as we see next. left tree to the two helices of the second tree since it seems
Indeed, to compare the same two RNAs, we can also use a clear that the internal loop represents either an inserted
more abstract tree representation such as the one given in element in the second RNA, or the unbonding of one base
Fig. 2d. In this case, the internal nodes represent a multiloop, pair. This, however, is not possible with classical edit
internal-loop, or bulge, the leaves code for hairpin loops and operations.
edges for helices. The result of the edition of T into T 0 for some A third type of problem one can meet when using only
cost function is presented in Fig. 5 (we shall come back later to the three classical edit operations to compare trees standing
the cost functions used in the case of such more abstract RNA for RNAs is similar to the previous one, but concerns this
representations; for the sake of this example, we may assume time a node instead of edges in the same tree representa-
an arbitrary one is used). tion. Often, an RNA may present a very small helix between
The problem we wish to illustrate in this case is shown two elements (multiloop, internal-loop, bulge, or hairpin-
by the boxes in the figure. Consider the boxes at the bottom. loop) while such helix is absent in the other RNA. In this
In the left RNA, we have a helix made up of 13 base pairs. In case, we would therefore have liked to be able to associate
the right RNA, the helix is formed by seven base pairs one node in a tree representing an RNA with two or more
Fig. 5. Illustration of the one-to-one association problem with edges. Result of the matching of the two RNAsePs, of Saccharomyces uvarum and of
Saccharomyces kluveri, using the model given in Fig. 2d.
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 7
Fig. 6. Illustration of the one-to-one association problem with nodes. The two RNAs used here are RNAsePs from Pyrococcus furiosus and
Metallosphaera sedula. Triangles stand for bulges, diamond stand for internal loops, and squares for hairpin loops.
nodes in the tree for the other RNA. Once again, this is not replacing eci and eu with a new single edge e. The edge e links
possible with any of the classical tree edit operations. An the father of u to ci . Its label then becomes a function of the
illustration of this problem is shown in Fig. 6. (numerical) labels of eu , u and eci . For instance, if such labels
We shall use RNA representations that take the elements indicated the size of each element (e.g., for a helix, the number
of the structure of an RNA into account to avoid some of the of its stacked pairs, and for a loop, the min , max or the average
scattering effect. Furthermore, in addition to considering of its unpaired bases on each side of the loop), the label of e
information of a structural nature, labels are attached, in could be the sum of the sizes of eu , u and eci . Observe that
general, to both nodes and edges of the tree representing an merging two edges implies deleting all subtrees rooted at the
RNA. Such labels are numerical values (integers or reals). children cj of u for j different from i. The cost of such deletions
They represent in most cases the size of the corresponding is added to the cost of the edge fusion.
element, but may also further indicate its composition, etc. An example of node fusion is given in Fig. 7b. Let u be a
Such additional information is then incorporated into the node and ci one of its children. Performing a node fusion of
cost functions for all three edit operations. It is important to u and ci consists in making u the father of all children of ci
observe that when dealing with trees labeled at both the and in relabeling u with a value that is a function of the
nodes and edges, any node and the edge that leads to it (or, values of the labels of u, ci and of the edge between them.
in an alternative perspective, departs from it) represent a Observe that a node fusion may be simulated using the
single object from the point of view of computing an edit classical edit operations by a deletion followed by a
distance between the trees. relabeling. However, the difference between a node fusion
It remains now to deal with the last two problems that and a deletion/relabeling is in the cost associated with both
are a consequence of the one-to-one associations between operations. We shall come back to this point later.
nodes and edges enforced by the classical tree edit Obviously, like insertions or deletions, edge fusions and
operations. To that purpose, we introduce two novel tree node fusions have of course symmetric counterparts, which
edit operations, called the edge fusion and the node fusion. are the edge split and the node split.
Given two rooted, ordered, and labeled trees T and T 0 ,
we define the “edit distance with fusion” between T and T 0
4 INTRODUCING NOVEL TREE EDIT OPERATIONS
4.1 Edge Fusion and Node Fusion
In order to address some of the limitations of the classical tree
edit operations that were illustrated in the previous section,
we need to introduce two novel operations. These are the edge
fusion and the node fusion. They may be applied to any of the
tree representations given in Figs. 2c, 2d, and 2e.
An example of edge fusion is shown in Fig. 7a. Let eu be an
edge leading to a node u, ci a child of u and eci the edge
between u and ci . The edge fusion of eu and eci consists in Fig. 7. (a) An example of edge fusion. (b) An example of node fusion.
8 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 8. Zhang and Sasha’s dynamic programming algorithm: the tree distance part. The right box corresponds to the additional operations added to
take fusion into account.
S
as distancefusion ðT ; T 0 Þ ¼ fminðcostS ÞjT ! T 0 g with costs the index of the leftmost child of the subtree rooted at ti . Let
cost associated to each of the seven edit operations now T ði . . . jÞ denote the forest composed by the nodes ti . . . tj
considered (relabeling, insertion, deletion, node fusion and (T T ð0 . . . jT jÞÞ. To simplify notation, from now on, when
split, edge fusion and split).
there is no ambiguity, i will refer to the node ti . In this case,
Proposition 1. If the following is verified: distanceði1 . . . i2 ; j1 . . . j2 Þ will be equivalent to distanceðT ði1
. costmatch ða; bÞ is a distance, . . . i2 Þ; T 0 ðj1 . . . j2 ÞÞ.
. costins ðaÞ ¼ costdel ðaÞ 0, The algorithm of Zhang and Sasha is fully described by
. costnodefusion ða; b; cÞ ¼ costnodesplit ða; b; cÞ 0, and the following recurrence formula:
. costedgefusion ða; b; cÞ ¼ costedgesplit ða; b; cÞ 0,
then distancefusion is indeed a distance.
Proof. The positiveness of distancefusion is given by the fact
if ðði1 ¼¼ lði2 ÞÞ and ðj1 ¼¼ lðj2 ÞÞÞ
that all elementary cost functions are positive. Its
symmetry is guaranteed by the symmetry in the costs MIN
of the insertion/deletion and (node/edge) fusion/split 8
operations. Finally, it is straighforward to see that < distanceð i1 . . . i2 1 ; j1 . . . j2
> Þ þ costdel ði2 Þ
distancefusion satisfies triangular inequality. u
t distanceð i1 . . . i2 ; j1 . . . j2 1 Þ þ costins ðj2 Þ
>
:
Besides the above properties that must be satisfied by the distanceð i1 . . . i2 1 ; j1 . . . j2 1 Þ þ costmatch ði2 ; j2 Þ
cost functions in order to obtain a distance, others may be ð1Þ
introduced for specific purposes. Some will be discussed in
Section 5. else
We now present an algorithm to compute the tree edit
MIN
distance between two trees using the classical tree edit 8
operations plus the two operations just introduced. > distanceð i1 . . . i2 1 ; j1 . . . j2 Þ Þ
>
>
>
> þ costdel ði2 Þ
>
>
4.2 Algorithm >
< distanceð i1 . . . i2 Þ ; j1 . . . j2 1 Þ ð2Þ
The method we introduce is a dynamic programming
>
> þ costins ðj2 Þ
algorithm based on the one proposed by Zhang and Shasha. >
>
>
> distanceð i1 . . . lði2 Þ 1 ; j1 . . . lðj2 Þ 1 Þ
Their algorithm is divided in two parts: They first compute >
>
:
the edit distance between two trees (this part is denoted by þdistanceð lði2 Þ . . . i2 ; lðj2 Þ . . . j2 Þ
T Dist) and then the distance between two forests (this part
Part (1) of the formula corresponds to Fig. 8, while part (2)
is denoted by F Dist). Fig. 8 illustrates in pictorial form the
part T Dist and Fig. 9 the F Dist part of the computation. corresponds to Fig. 9. In practice, the algorithm stores in a
In order to take our two new operations into account, we matrix the score between each subtree of T and T 0 . The space
need to compute a few more things in the T Dist part. complexity is therefore OðjT j jT 0 jÞ. To reach this complexity,
Indeed, we must add the possibility for each tree to have a the computation must be done in a certain order (see
node fusion (inversely, node split) between the root and one
Section 4.3). The time complexity of the algorithm is
of its children, or to have an edge fusion (inversely edge
split) between the root and one of its children. These OðjT j minðleafðT Þ; heightðT ÞÞ
additional operations are indicated in the right box of Fig. 8. jT 0 j minðleafðT 0 Þ; heightðT 0 ÞÞÞ;
We present now a formal description of the algorithm. Let
T be an ordered rooted tree with jT j nodes. We denote by ti where leafðT Þ and heightðT Þ represent, respectively, the
the ith node in a postfix order. For each node ti , lðiÞ is the number of leaves and the height of a tree T .
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 9
Fig. 9. Zhang and Sasha’s dynamic programming algorithm: the forest distance part.
The formula to compute the edit score allowing for both leading to, respectively, nodes u and v. The symmetric
node and edge fusions follows. operations are denoted by, respectively, node splitðu; vÞ and
edge splitðu; vÞ.
The distance computation takes two new parameters
if ðði1 lðik ÞÞ and ðj1 lðjk0 ÞÞÞ path and path0 . These are sets of pairs ðe or u; vÞ which
indicate, for node ik (respectively, jk ), the series of fusions
MIN that were done. Thus, a pair ðe; vÞ indicates that an edge
8
>
> distanceðfi1 . . . ik1 g; ;; fj1 . . . jk0 g; path0 Þ þ costdel ðik Þ fusion has been perfomed between ik and v, while for ðu; vÞ
>
>
>
> distanceðfi1 . . . ik g; path; fj1 . . . jk0 1 g; ;Þ þ costins ðjk0 Þ a node v has been merged with node ik .
>
>
>
> The notation path:ðe; vÞ indicates that the operation ðe; vÞ
>
> distanceðfi1 . . . ik1 g; ;; fj1 . . . jk0 1 g; ;Þ þ costmatch ðik ; jk0 Þ
>
> has been performed in relation to node ik and the
>
> for each child ic of ik in fi1 ; . . . ; ik g; set il ¼ lðic Þ
>
> information is thus concatenated to the set path of pairs
>
>
>
> distanceðfi1 . . . ic1 ; icþ1 . . . ik g; path:ðu; ic Þ; fj1 . . . jk0 g;
>
> currently linked with ik .
>
> path0 Þ
>
>
>
> 4.3 Implementation and Complexity
>
> þcostnode fusion ðic ; ik Þðobs: :ik data are changedÞ
>
>
>
> The previous section gave the recurrence formulæ for
>
> distanceðfil . . . ic1 ; ik g; path:ðe; ic Þ; fj1 . . . jk0 g; path0 Þ
>
> calculating the edit distance between two trees allowing for
>
> þcostedge fusion ðic ; ik Þ þ distanceðfi1 . . . il1 g;
>
> node and edge fusion and split. We now discuss the
>
>
>
> ;; ;; ;Þ
>
> complexity of the algorithm. This requires paying attention
< þdistanceðfi . . . i 1; ;; ;; ;Þ
cþ1 k to some high-level implementation details that, in the case
>
> ðobs: : ik data are changedÞ of the tree edit distance problem, may have an important
>
>
>
> for each child jc0 of jk0 in fj1 ; . . . ; jk0 g; set jl0 ¼ lðjc0 Þ influence on the theoretical complexity of the algorithm.
>
>
>
>
>
> distanceðfi1 . . . ik g; path; fj1 . . . jc0 1 ; jc0 þ1 . . . jk0 ; Such details were first observed by Zhang and Shasha. They
>
>
>
> concern the order in which to perform the operations
>
> path0 :ðu; jc0 ÞÞ
>
> indicated in (2) and (1) to obtain an algorithm that is time
>
> þcostnode split ðjc0 ; jk0 Þ
>
> and space efficient.
>
>
>
> ðobs: : jk0 data are changedÞ Let us consider the last line of (2). We may observe that
>
>
>
> 0
>
> distanceðfi 1 . . . ik g; path; fjl0 . . . jc0 ; jk0 ; path :ðe; jc0 ÞÞ the computation of the distance between two forests refers
>
>
>
> þcostedge split ðjc0 ; jk0 Þ to the computation of the distance between two trees
>
>
>
> þdistanceð;; ;; fj1 . . . jl0 1 g; ;Þ T ðlði2 Þ . . . i2 Þ and T 0 ðlðj2 Þ . . . j2 Þ. We must therefore memor-
>
>
>
> ise the distance between any two subtrees of T and T 0 .
>
> þdistanceð;; ;; jc0 þ1 . . . jk0 1 ; ;Þ
>
> Furthermore, we have to carry out the computation from
:
ðobs: : jk0 data are changedÞ the leaves to the root because when we compute the
ð3Þ distance between two subtrees U and U 0 , the distance
between any subtrees of U and U 0 must already have been
else set il ¼ lðik Þ and jl0 ¼ lðjk0 Þ measured. This explains the space complexity which is in
OðjT j jT 0 jÞ and corresponds to the size of the table used for
MIN
8 storing such distances in memory.
> distanceðfi1 . . . ik1 g; ;; fj1 . . . jk0 g; path0 Þ þ delðik Þ If we look at (1) now, we see that it is not necessary to
>
>
< distanceðfi . . . i g; path; fj . . . j 0 g; ;Þ þ insðj 0 Þ ð4Þ calculate separately the distance between the subtrees
1 k 1 k 1 k
>
> distanceðfi1 . . . il1 g; ;; fj1 . . . jl 1 g; ;Þ
0 rooted at i0 and j0 if i0 is on the path from lðiÞ to i and j0
>
: is on the path from lðjÞ to j, for i and j nodes of,
þ distanceðfil . . . ik g; path; fjl0 . . . jk0 g; path0 Þ
respectively, T and T 0 .
Given two nodes u and v such that v is a child of u, We define a set LRðT Þ of the left roots of T as follows:
node fusionðu; vÞ is the fusion of node v with u, and
LRðT Þ ¼ fkj1 k jT j and 6 9k0 > k such that lðk0 Þ ¼ lðkÞg
edge fusionðu; vÞ is the edge fusion between the edges
10 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
The algorithm for computing the edit distance between t our algorithm is thus in Oðð2dÞ‘ ð2d0 Þ‘ jT j jT 0 jÞ, with d
and T 0 consists then in computing the distance between and d0 the maximum degrees of, respectively, T and T 0 .
each subtree rooted at a node in LRðT Þ and each subtree The computation of the time complexity of our algorithm
rooted at a node in LRðT 0 Þ. Such subtrees are considered is done in a similar way as for the algorithm of Zhang and
from the leaves to the root of T and T 0 , that is, in the order Shasha. For each node of T and T 0 , one must compute the
of their indexes. number of subtree distance computations the node will be
Zhang and Shasha proved that this algorithm has a involved in by considering all subtrees rooted in, respec-
time complexity in OðjT j minðleafðT Þ; heightðT ÞÞ jT 0 j tively, a node of LRðT Þ and a node of LRðT 0 Þ. In our case,
minðleafðT 0 Þ; heightðT 0 ÞÞÞ, leafðT Þ designating the num- one must also take into account for each node the possibility
ber of leaves of T and heightðT Þ its height. In the worst of applying a fusion. This leads to a time complexity in
case (fan tree), the complexity is in OðjT j2 jT 0 j2 Þ.
Taking fusion and split operations into account does Oðð2dÞ‘ jT j minðleafðT Þ; heightðT ÞÞ ð2d0 Þ‘ jT 0 j
not change the above reasoning. However, we must now minðleafðT 0 Þ; heightðT 0 ÞÞÞ:
store in memory the distance between all subtrees
T ðlði2 Þ . . . i2 Þ and T 0 ðlðj2 Þ . . . j2 Þ, and all the possible values This complexity suggests that the fusion operations may
of path and path0 . be used only for reasonable trees (typically, less than
We must therefore determine the number of values that 100 nodes) and small values of l (typically, less than 4). It is
path can take. This amounts to determine the total number however important to observe that the overall number of
of successive fusions that could be applied to a given node. fusions one may perform can be much greater than l
We recall that path is a list of pairs ðe or u; vÞ. Let path ¼ without affecting the worst-case complexity of the algo-
fðe or u; v1 Þ; ðe or u; v2 Þ; . . . ; ðe or u; v‘ Þg be the list for node i rithm. Indeed, any number of fusions can be made while
of T . The first fusion can be performed only with a child v1 still retaining the bound of
of i. If d is the maximum degree of T , there are d possible
choices for v1 . The second fusion can be done with one of Oðð2dÞl jT j minðleafðT Þ; heightðT ÞÞ jT 0 j minðleafðT 0 Þ;
the children of i or with one of its grandchildren. Let v2 be heightðT 0 ÞÞÞ
the node chosen. There are d + d2 possible choices for v2 . so long as one does not realize more than l consecutive
P
Following the same reasoning, there are k¼‘ k
k¼1 d possible fusions for each node.
choices for the ‘th node v‘ to be fusioned with i. In general, also, most interesting tree representations of
an RNA are of small enough size as will be shown next,
together with some initial results obtained in practice.
Fig. 11. Result of the editing between the two RNAs shown in Fig. 4 allowing for node and edge fusions.
6 MULTILEVEL RNA STRUCTURE COMPARISON: that are located at the same positions relatively to the global
SKETCH OF THE MAIN IDEA common structure. This is a normal, expected behavior in
the context of an edition. However, it seems clear also when
We briefly discuss now an approach which addresses in
we look at Fig. 4 that the bases of a terminal loop should not
part the “scattering effect” problem (see Section 2). This
approach is being currently validated and will be more fully be mapped to those of a multiple loop.
To reduce this problem, one possible solution consists of
described in another paper. We therefore present here the
main idea only. adding to the nodes corresponding to a base an information
To start with, it is important to understand the nature of concerning the element of secondary structure to which the
this “scattering effect.” Let us consider first a trivial case: the base belongs. The cost functions are then adapted to take
cost functions are unitary (insertion, deletion, and relabeling this type of information into account. This solution,
each cost 1) and we compute the edit distance between two although producing interesting results, is not entirely
trees composed of a single node each. The obtained mapping satisfying. Indeed, the algorithm will tend to systematically
will associate the single node in the first tree with the single put into correspondence nodes (and, thus, bases) belonging
one in the second tree, independently from the labels of the to structural elements of the same type, which is also not
nodes. This example can be extended to the comparison of necessarily a good choice as these elements may not be
two trees whose node labels are all different. In this case, the related in the overall structure. It seems therefore preferable
obtained mapping corresponds to the maximum home- to have a structural approach first, mapping initially the
omorphic subtree common to both trees. elements of secondary structure to each other and taking
If the two RNA secondary structures compared using a care of the nucleotides in a second step only.
tree representation which models both the base pairs and The approach we have elaborated may be briefly
the nonpaired bases are globally similar but present some described as follows: Given two RNA secondary structures,
local dissimilarity, then an edit operation will almost the first step consists in coding the RNAs by trees of type ðcÞ
always associate the nodes of the locally divergent regions in Fig. 2 (nodes represent bulges or multiple, internal or
Fig. 12. Part of a mapping between two rRNA small subunits. The node fusion is circled.
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 13
Fig. 13. Result of the comparison of the two RNAs of Fig. 4 using trees in Fig. 2c. The thick dash lines indicate some of the associations resulting
from the computation of the edit distance between these two trees. Triangular nodes stand for bulges, diamonds for internal loops, squares for
hairpin loops, and circles for multiloops. Noncolored fine dashed nodes and lines correspond, respectively, to deleted nodes/edges.
terminal loops while edges code for helices). We then 7 FURTHER WORK AND CONCLUSION
compute the edit distance between these two trees using the
We have proposed an algorithm that addresses two main
two novel fusion operations described in this paper. This
limitations of the classical tree edit operations for compar-
also produces a mapping between the two trees. Each node ing RNA secondary structures. Its complexity is high in
and edge of the trees, that is, each element of secondary theory if many fusions are applied in succession to any
structure, is then colored according to this mapping. Two given (the same) node, but the total number of fusions that
elements are thus of a same color if they have been mapped may be performed is not limited. In practice, the algorithm
in the first step. We now have at our disposal an is fast enough for most situations one can meet in practice.
information concerning the structural similarity of the two To provide a more complete solution to the problem of
RNAs. We can then code the RNAs using a tree of type ðbÞ. the scattering effect, we also proposed a new multilevel
To these trees, we add to each node the colour of the approach for comparing two RNA secondary structures
structural element to which it belongs. We need now only to whose main idea was sketched in this paper. Further details
restrict the match operation to nodes of the same color. Two and evaluation of such novel comparison scheme will be the
nodes can therefore match only if they belong to secondary subject of another paper.
elements that have been identified in the first step as being
similar.
To illustrate the use of this algorithm, we have applied it REFERENCES
to the two RNAs of Fig. 4. Fig. 13 presents the trees of type [1] D. Bouthinon and H. Soldano, “A New Method to Predict the
Consensus Secondary Structure of a Set of Unaligned RNA
(Fig. 2c) coding for these structures, and the mapping Sequences,” Bioinformatics, vol. 15, no. 10, pp. 785-798, 1999.
produced by the computation of the edit distance with [2] J.W. Brown, “The Ribonuclease P Database,” Nucleic Acids
fusion. In particular, the noncolored fine dashed nodes and Research, vol. 24, no. 1, p. 314, 1999.
edges correspond, respectively, to deleted nodes/edges. [3] N. el Mabrouk and F. Lisacek, “and Very Fast Identification of
RNA Motifs in Genomic DNA. Application to tRNA Search in the
One can see that in the left RNA, the two hairpin loops Yeast Genome,” J. Molecular Biology, vol. 264, no. 1, pp. 46-55, 1996.
involved in the scattering effect problem in Fig. 4 (indicated [4] I. Hofacker, “The Vienna RNA Secondary Structure Server,” 2003.
by the arrows) have been destroyed and will not be mapped [5] I. Hofacker, W. Fontana, P.F. Stadler, L. Sebastian Bonhoeffer, M.
to one another anymore when the edit operations are Tacker, and P. Schuster, “Fast Folding and Comparison of RNA
Secondary Structures,” Monatshefte für Chemie, vol. 125, pp. 167-
applied to the trees of the type in Fig. 2b. 188, 1994.
This approach allows to obtain interesting results. [6] M. Höchsmann, T. Töller, R. Giegerich, and S. Kurtz, “Local
Furthermore, it considerably reduces the complexity of Similarity in RNA Secondary Structures,” Proc. IEEE Computer Soc.
Conf. Bioinformatics, p. 159, 2003.
the algorithm for comparing two RNA structures coded [7] M. Höchsmann, B. Voss, and R. Giegerich, “Pure Multiple RNA
with trees of the type in Fig. 2b. However, it is important to Secondary Structure Alignments: A Progressive Profile Ap-
observe that the scattering effect problem is not specific of proach,” IEEE/ACM Trans. Computational Biology and Bioinfor-
matics, vol. 1, no. 1, pp. 53-62, 2004.
the tree representations of the type in Fig. 2b. Indeed, the [8] T. Winkelmans, J. Wuyts, Y. Van de Peer, and R. De Wachter, “The
same problem may be observed, to a lesser degree, with European Database on Small Subunit Ribosomal RNA,” Nucleic
Acids Research, vol. 30, no. 1, pp. 183-185, 2002.
trees of the type in Fig. 2c. This is the reason why we [9] T. Jiang, L. Wang, and K. Zhang, “Alignment of Trees—An
generalize the process by adopting a modelling of RNA Alternative to Tree Edit,” Proc. Fifth Ann. Symp. Combinatorial
secondary structures at different levels of abstraction. This Pattern Matching, pp. 75-86, 1994.
[10] F. Lisacek, Y. Diaz, and F. Michel, “Automatic Identification of
model, and the accompanying algorithm for comparing Group I Intron Cores in Genomic DNA Sequences,” J. Molecular
RNA structures, is in progress. Biology, vol. 235, no. 4, pp. 1206-1217, 1994.
14 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
[11] B. Shapiro, “An Algorithm for Multiple RNA Secondary Struc- Julien Allali studied at the University of Marne
tures,” Computer Applications in the Biosciences, vol. 4, no. 3, pp. 387- la Vallée (France), where he received the MSc
393, 1988. degree in computer science and computational
[12] B.A. Shapiro and K. Zhang, “Comparing Multiple RNA Secondary genomics. In 2001, he began his PhD in
Structures Using Tree Comparisons,” Computer Applications in the computational genomics at the Gaspard Monge
Biosciences, vol. 6, no. 4, pp. 309-318, 1990. Institute of the University of Marne la Vallée. His
[13] K.-C. Tai, “The Tree-to-Tree Correction Problem,” J. ACM, vol. 26, thesis focused on the study of RNA secondary
no. 3, pp. 422-433, 1979. structures and, in particular, their comparison
[14] K. Zhang and D. Shasha, “Simple Fast Algorithms for the Editing using a tree distance. In 2004, he received the
Distance between Trees and Related Problems,” SIAM J. Comput- PhD degree.
ing, vol. 18, no. 6, pp. 1245-1262, 1989.
[15] M. Zuker, “Mfold Web Server for Nucleic Acid Folding and Marie-France Sagot received the BSc degree in computer science from
Hybridization Prediction,” Nucleic Acids Research, vol. 31, no. 13, the University of São Paulo, Brazil, in 1991, the PhD degree in
pp. 3406-3415, 2003. theoretical computer science and applications from the University of
Marne-la-Vallée, France, in 1996, and the Habilitation from the same
university in 2000. From 1997 to 2001, she worked as a research
associate at the Pasteur Institute in Paris, France. In 2001, she moved
to Lyon, France, as a research associate at the INRIA, the French
National Institute for Research in Computer Science and Control. Since
2003, she has been the Director of Research at the INRIA. Her research
interests are in computational biology, algorithmics, and combinatorics.
Abstract—The problem of reconstructing the duplication history of a set of tandemly repeated sequences was first introduced by Fitch
[4]. Many recent studies deal with this problem, showing the validity of the unequal recombination model proposed by Fitch, describing
numerous inference algorithms, and exploring the combinatorial properties of these new mathematical objects, which are duplication
trees. In this paper, we deal with the topological rearrangement of these trees. Classical rearrangements used in phylogeny (NNI, SPR,
TBR, ...) cannot be applied directly on duplication trees. We show that restricting the neighborhood defined by the SPR (Subtree
Pruning and Regrafting) rearrangement to valid duplication trees, allows exploring the whole duplication tree space. We use these
restricted rearrangements in a local search method which improves an initial tree via successive rearrangements. This method is
applied to the optimization of parsimony and minimum evolution criteria. We show through simulations that this method improves all
existing programs for both reconstructing the topology of the true tree and recovering its duplication events. We apply this approach to
tandemly repeated human Zinc finger genes and observe that a much better duplication tree is obtained by our method than using any
other program.
Index Terms—Tandem duplication trees, phylogeny, topological rearrangements, local search, parsimony, minimum evolution, Zinc
finger genes.
1 INTRODUCTION
Fig. 1. (a) Rooted duplication tree describing the evolutionary history of the 13 Antennapedia-class homeobox genes from the cognate group [6].
(b) Rooted duplication tree describing the evolutionary history of the nine variable genes of the human T cell receptor Gamma (TRGV) locus [8]. In
both examples, the contemporary genes are adjacent and linearly ordered along the extant locus.
duplication history of tandemly repeated sequences. duplication trees, especially from minisatellites. Elemento
Indeed, accurate reconstruction of duplication histories et al. [8] present an enumerative algorithm that computes the
will be useful to elucidate various aspects of genome most parsimonious duplication tree; this algorithm (by its
evolution. They will provide new insights into the exhaustive approach) is limited to datasets of less than 15
mechanisms and determinants of gene and protein domain repeats. Several distance-based methods have also been
duplication, often recognized as major generators of described. The WINDOW method [10] uses an agglomeration
novelty [13]. Several important gene families, such as scheme similar to UPGMA [16] and NJ [17], but the cost
immunity-related genes, are arranged in tandem; better function used to judge potential duplication is based on the
understanding their evolution should provide new insights assumption that the sequences follow a molecular clock mode
into their duplication dynamics and clues about their of evolution. The DTSCORE method [18] uses the same
functional specialization. Studying the evolution of micro scheme but corrects this limitation using a score criterion [19],
and minisatellites could resolve unanswered biological like ADDTREE [20]. DTSCORE can be used with sequences
questions regarding human migrations or the evolution of that do not follow the molecular clock, which is, for example,
bacterial diseases [14]. essential when dealing with gene families containing
Given a set of aligned and ordered sequences (DNA or pseudogenes that evolve much faster than functional genes.
proteins), the aim is to find the duplication tree that best Finally, GREEDY SEARCH [21] corresponds to a different
explains these sequences, according to usual criteria in approach divided into two steps: First, a phylogeny is
phylogenetics, e.g., parsimony or minimum evolution. Few computed with a classical reconstruction method (NJ), then,
studies have focused on the computational hardness of this with nearest neighbor interchange (NNI) rearrangements, a
problem, and all of these studies only deal with the
duplication tree close to this phylogeny is computed. This
restricted version where simultaneous duplication of multi-
approach is noteworthy since it implements topological
ple adjacent segments is not allowed. In this context, Jaitly
rearrangements which are highly useful in phylogenetics
et al. [15] shows that finding the optimal single copy
[22], but it works blindly and does not ensure that good
duplication tree with parsimony is NP-Hard and that this
duplication trees will be found (cf. Section 5.2).
problem has a PTAS (Polynomial Time Approximation
Topological rearrangements have an essential function in
Scheme). Another closely related PTAS is given by Tang
phylogenetic inference, where they are used to improve an
et al. [10] for the same problem. On the other hand,
initial phylogeny by subtree movement or exchange.
Elemento et al. [7] describes a polynomial distance-based
algorithm that reconstructs optimal single copy tandem Rearrangements are very useful for all common criteria
duplication trees with minimum evolution. (parsimony, distance, maximum likelihood) and are inte-
However, it is commonly believed, as in phylogeny, that grated into all classical programs like PAUP* [23] or
most (especially multiple) duplication tree inference pro- PHYLIP [24]. Furthermore, they are used to define various
blems are NP-Hard. This explains the development of distances between phylogenies and are the foundation of
heuristic approaches. Benson and Dong [9] provides various much mathematical work [25]. Unfortunately, they cannot
parsimony-based heuristic reconstruction algorithms to infer be directly used here, as shown by a simple example given
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 17
Fig. 2. (a) Duplication history; each segment represents a copy; extant segments are numbered. (b) Duplication tree (DT); the black points show the
possible root locations. (c) Rooted duplication tree (RDT) corresponding to history (a) and root position 1 on (b).
later. Indeed, when applied to a duplication tree, they do Let O ¼ ð1; 2; . . . ; nÞ be the ordered set of sequences
not guarantee that another valid duplication tree will be representing the extant locus. Initially containing a single
produced. copy, the locus grew through a series of consecutive
In this paper, we describe a set of topological rearrange- duplications. As shown in Fig. 2a, a duplication history
ments to stay inside the duplication tree space and explore may contain simple duplication events. When the dupli-
the whole space from any of its elements. We then show the cated fragment contains two, three, or k repeats, we say that
advantages of this approach for duplication tree inference it involves a multiple duplication event. Under this
from sequences. In Section 2, we describe the duplication duplication model, a duplication history is a rooted tree
model introduced by [4], [8], [10], as well as an algorithm to with n labeled and ordered leaves, in which internal nodes
recognize duplication trees in linear time. Thanks to this of degree 3 correspond to duplication events. In a real
algorithm, we restrict the neighborhoods defined by duplication history (Fig. 2a), the time intervals between
consecutive duplications are completely known, and the
classical phylogeny rearrangements, namely, nearest neigh-
internal nodes are ordered from top to bottom according to
bor interchange (NNI) and subtree pruning and regrafting
the moment they occurred in the course of evolution. Any
(SPR), to valid duplication trees. We demonstrate (Section 3)
ordered segment set of the same height then represents an
that for NNI moves this restricted neighborhood does not
ancestral state of the locus. We call such a set a floor, and
allow the exploration of the whole duplication tree space.
we say that two nodes i; j are adjacent (i j) if there is a
On the other hand, we demonstrate that the restricted
floor where i and j are consecutive and i is on the left of j.
neighborhood of SPR rearrangement allows the whole
However, in the absence of a molecular clock mode of
space to be explored. In this way, we define a local search evolution (a typical problem), it is impossible to recover the
method, applied here to parsimony and minimum evolu- order between the duplication events of two different
tion (Section 4). We compare this method to other existing lineages from the sequences. In this case, we are only able to
approaches using simulated and real data sets (Section 5). infer a duplication tree (DT) (Fig. 2b) or a rooted
We conclude by discussing the positive results obtained by duplication tree (RDT) (Fig. 2c).
our method, and indicate directions for further research A duplication tree is an unrooted phylogeny with
(Section 6). ordered leaves, whose topology is compatible with at least
one duplication history. Also, internal nodes of duplication
trees are partitioned into events (or “blocks” following
2 MODEL
[10]), each containing one or more (ordered) nodes. We
2.1 Duplication History and Duplication Tree distinguish “simple” duplication events that contain a
The tandem duplication model used in this article was first unique internal node (e.g., b and f in Fig. 2c) and “multiple”
introduced by Fitch [4] then studied independently by [8], duplication events which group a series of adjacent and
[10]. It is based on unequal recombination which is assumed simultaneous duplications (e.g., c, d, and e in Fig. 2c). Let
to be the sole evolution mechanism (except point mutations) E ¼ ðsi ; siþ1 ; . . . ; sk Þ denote an event containing internal
acting on sequences. Although it is a completely different nodes si ; siþ1 ; . . . ; sk in left to right order. We say that two
biological mechanism, slipped-strand mispairing leads to consecutive nodes of the same event are adjacent (sj sjþ1 )
the same duplication model [5], [9]. just like in histories, as any event belongs to a floor in all of
18 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 3. The tree obtained by applying an NNI move to a DT is not always a valid DT: T whose RT is a rooted version; T 0 is obtained by
applying NNI(5,4) around the bold edge; none of the possible root positions of T 0 (a, b, c, and d) leads to a valid RDT, cf. tree (b) which
corresponds to root b in T 0 .
the histories that are compatible with the DT being The definition for unrooted trees is quite similar:
considered. The same notation will also be used for leaves ðT ; OÞ defines an unrooted duplication tree if and only if:
to express the segment order in the extant locus. When the 1. ðT ; OÞ contains 1 segment, or
tree is rooted, every internal node sj is unambiguously 2. same as for rooted trees with ðT 0 ; O0 Þ now defining an
associated to one parent and two child nodes; moreover, unrooted duplication tree.
one child of sj is “left” and the other one is “right,” which is Those definitions provide a recursive algorithm, RADT
denoted as lj and rj , respectively. In this case, for any
(Recognition Algorithm for Duplication Trees), to check
duplication history that is compatible with this tree, child
whether any given phylogeny with ordered leaves is a
nodes of an event, si ; siþ1 ; . . . ; sk are organized as follows:
duplication tree. In case of success, this algorithm can also
li liþ1 . . . lk ri riþ1 . . . rk : be used to reconstruct duplication events: At each step, the
In [8], [26], [27], it was shown that rooting a series of internal nodes above denoted as ðsi ; siþ1 ; . . . ; sk Þ is
duplication tree is different than rooting a phylogeny: a duplication event. When the tree is rooted, lj is the left
the root of a duplication tree necessarily lies on the tree child of sj and rj its right child, for every j; i j k. This
path between the most distant repeats on the locus, i.e., 1 algorithm can be implemented in OðnÞ [26] where n is the
and n; moreover, the root is always located ”above” all number of leaves. Another linear algorithm is proposed by
multiple duplications, e.g., Fig. 1b shows that there are
Zhang et al. [21] using a top down approach instead of a
only three valid root positions, the root cannot be a direct
bottom-up one, but applies only to rooted duplication trees.
ancestor of 12.
Fig. 4. The NNI neighborhood of a duplication tree does not always contain duplication trees: T whose RT is a rooted version; T 0 is obtained by
exchanging subtrees 1 and (2 5); none of the possible root positions of T 0 (a, b, and c) leads to a valid duplication tree, cf. tree (b) which corresponds
to root b in T 0 ; and the same holds for every neighbor of T being obtained by NNI.
3.1 Topological Rearrangements for Phylogeny and there is no succession of restricted NNIs allowing T to
There are many ways of carrying out topological rearrange- be transformed into any other DT.
ments on phylogeny [22]. We only describe NNI (Nearest
3.4 Restricted SPR Allows the Whole DT Space to
Neighbor Interchange), SPR (Subtree Pruning Regrafting),
Be Explored
and TBR (Tree Bisection and Reconnection) rearrangements.
As before, we restrict (using RADT) the neighborhood
The NNI move is a simple rearrangement which
defined by SPR rearrangements to duplication trees. We
exchanges two subtrees adjacent to the same internal edge
(Figs. 3 and 4). There are two possible NNIs for each name restricted SPR, SPR moves that, starting from a
internal edge, so 2ðn 3Þ neighboring trees for one tree duplication tree, lead to another duplication tree.
with n leaves. This rearrangement allows the whole space of Main Theorem. Let T1 and T2 be any given duplication trees; T1
phylogeny to be explored; i.e., there is a succession of NNI can be transformed into T2 via a succession of restricted SPRs.
moves making it possible to transform any phylogeny P1 Proof. To demonstrate the Main Theorem, we define two
into any phylogeny P2 [28]. types of special SPR that ensure staying within the space
The SPR move consists of pruning a subtree and of rooted duplication trees (RDT). Given these two types
regrafting it, by its root, to an edge of the resulting tree of SPRs, we demonstrate that it is possible to transform
(Figs. 6 and 7). We note that the neighborhood of a tree any rooted duplication tree into a caterpillar, i.e., a
defined by the NNI rearrangements is included in the
rooted tree in which all internal nodes belong to the tree
neighborhood defined by SPRs. The latter rearrangement
path between the leaf 1 and the tree root (cf. Fig. 5).
defines a neighborhood of size 2ðn 3Þð2n 7Þ [25].
This result demonstrates the theorem. Indeed, let T1
Finally, TBR generalizes SPR by allowing the pruned
and T2 be two RDTs. We can transform T1 and T2 into a
subtree to be reconnected by any of its edges to the resulting
caterpillar by a succession of restricted SPRs. So, it is
tree. These three rearrangements (NNI, SPR, and TBR) are
possible to transform T1 into T2 by a succession of
reversible, that is, if T 0 is obtained from T by a particular
restricted SPRs, with (possibly) a caterpillar as inter-
rearrangement, then T can be obtained from T 0 using the
mediate tree. This property holds since the reciprocal
same type of rearrangement.
movement of an SPR is an SPR. As the two SPR types
3.2 NNI Rearrangements Do Not Stay in DT Space proposed ensure that we stay within the RDTs space, we
The classical phylogenetic rearrangements (NNI, SPR, have the desired result for rooted duplication trees. And,
TBR,...) do not always stay in DT space. So, if we apply this result extends to unrooted duplications trees since
an NNI to a DT (e.g., Fig. 3), the resulting tree is not always two DTs can be arbitrarily rooted, transformed from one
a valid DT. This property is also true for SPR and TBR to the other using restricted SPRs, then unrooted. u
t
rearrangements since NNI rearrangements are included in The first special SPR allows multiple duplication
these two rearrangement classes. events to be destroyed. Let E ¼ ðsi ; siþ1 ; . . . ; sk Þ be a
duplication event, ri and lk respectively right child of si
3.3 Restricted NNI Does Not Allow the Whole DT
Space to Be Explored
To restrict the neighborhood defined by NNI rearrange-
ments to duplication trees, each element of the neighbor-
hood is filtered thanks to the recognition algorithm (RADT).
But, this restricted neighborhood does not allow the whole
DT space to be explored. Fig. 4 gives an example of a
duplication tree, T , the neighborhood of which does not
contain any DT. So, its restricted neighborhood is empty, Fig. 5. A six-leaf caterpillar.
20 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
and left child of sk , and let pi be the father of si . The obtained by applying such a rearrangement to a simple
DELETE rearrangement consists of pruning the subtree of RDT, is a simple RDT. We now establish the following
root ri and grafting this subtree on the edge ðsk ; lk Þ, while lemma which shows that any simple tree can be trans-
li is renamed si and the edge ðli ; si Þ is deleted. Fig. 6 formed into a caterpillar.
demonstrates this rearrangement. Lemma 2. Let T be a simple RDT; T can be transformed into a
Lemma 1. DELETE preserves the RDT property. caterpillar by a succession of LEFT rearrangements.
Proof. Let T be the initial tree (Fig. 6a), E ¼ ðsi ; siþ1 ; . . . ; sk Þ Proof. In a caterpillar all internal nodes are ancestors of 1. If
be an event of T , and T 0 be the tree obtained from T by T is not a caterpillar, there is an internal node r that is not
applying DELETE to E (Fig. 6b). Children of any node sj an ancestor of 1. If r is the right child of its father, we can
(i j k) are denoted lj and rj . apply LEFT to the left child of r (Fig. 7). If r is the left
By definition, for any duplication history compatible child of its father, we consider its father: It cannot be an
with T we have ancestor of 1 since its children are r and a node on the
li liþ1 . . . lk ri riþ1 . . . rk : right of r. So, we can apply the same argument: Either
the father of r is adequate for performing LEFT, or we
Thus, there is a way to partially agglomerate T (using an
consider its father again. In this way, we necessarily
RADT-like procedure) such that these nodes becomes
obtain a node for which the rearrangement is possible. T
leaves. The same agglomeration can be applied to T 0 as
is then transformed into a caterpillar by successively
only ancestors of the lj s and rj s are affected by DELETE.
applying the LEFT rearrangement to nodes which are not
Now, 1) agglomerate the event E of T , and 2) reduce T 0
on the path between 1 and . After a finite number of
by agglomerating the cherry ðlk ; ri Þ and then agglomer-
steps, all internal nodes are ancestors of 1 and T has been
ating the event ðsiþ1 ; . . . ; sk Þ. Two identical trees follow,
transformed into a caterpillar. This concludes the proof
which concludes the proof. u
t
of Lemma 2 and, therefore, of our Main Theorem. u
t
By successively applying DELETE to any duplication
tree, we remove all multiple duplication events. The 4 LOCAL SEARCH METHOD
following SPR rearrangement allows duplications to be
We consider data consisting of an alignment of n segments
moved within simple RDT, i.e., any RDT containing only
with length k, and of the ordering O of the segments along
simple duplications. Let p be a node of a simple RDT T , l its
the locus. This alignment has been created before tree
left child, r its right child, and x the left child of r. This
construction and the problem is not to build simultaneously
rearrangement consists of pruning the subtree of root x and
the alignment and the tree, a much more complicated task
regrafting it to the edge ðl; pÞ (Fig. 7). This rearrangement is
[29]. The aim is to find a (nearly) optimal duplication tree,
an SPR (in fact an NNI); we name it LEFT as it moves the
where “optimal” is defined by some usual phylogenetic
subtree root towards the left. It is obvious that the tree
criterion and the ordered and aligned segments at hand.
Topological rearrangements described in the previous
section naturally lead to a local search method for this
purpose. We discuss its use to optimize the usual Wagner
parsimony [22] and the distance-based balanced minimum
evolution criterion (BME) [30], [31]. First, we describe our
local search method, then we define briefly these two
criteria and explain how to compute them during local
Fig. 7. LEFT rearrangement. search.
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 21
Fig. 9. (a) Every edge defines one down-subtree and one up-subtree; e.g., A represents the down-subtree (2 3) defined by the edge e while D
corresponds to the up-subtree (1 (4 5)). Moreover, only the parsimony vector of the five leaves is known before the preprocessing stage. The
postorder search computes the parsimony vector of down-subtrees: A is computed from 2 and 3, B from 4 and 5, C from A and B. The preorder
search computes the parsimony vector of up-subtrees: D is obtained from 1 and B, E is obtained from D and 3, etc. (b) When the parsimony vector
of every subtree in Tr is known, regrafting Tp on any given edge and computing the parsimony score of the resulting tree only requires analyzing the
parsimony vector of three subtrees and is done in OðkÞ time.
preprocessing stages; the third term ðn2 kÞ is the time to Gascuel demonstrated that selecting the shortest tree (as
test the n subtrees and the n possible insertion edges. computed from above formula) is statistically consistent and
well suited for phylogenetic inference. They called this new
4.3 The Distance-Based Balanced Minimum
Evolution Principle version of ME “balanced minimum evolution” (BME) [31].
Using the above formula, the length of any given tree is
As in any distance-based approach, we first estimate the
computed in Oðn2 Þ, so computing one LSDT local search
matrix of pairwise evolutionary distances between the
step can be achieved in Oðn4 Þ. However, a faster imple-
segments, using some standard distance estimator [22],
mentation is possible using a straightforward modification
e.g., the Kimura two-parameter estimator [37] in case of
of our BME addition algorithm [43]. This involves:
DNA or the JTT method with proteins [38]. Let be this
matrix and ij be the distance between segments i and j. 1. pruning a rooted subtree Tp from tree T ,
The matrix plus the segment order is the input of the 2. computing the average distance between all non-
reconstruction method. intersecting subtree pairs in the remaining tree Tr ,
The minimum evolution principle (ME) [39], [40] 3. computing the average distance between Tp and any
subtree of Tr in T , and
involves selecting the shortest tree to be the tree which
4. using formula (10) from [43] and RADT to find the
best explains the observed sequences. The tree length is best allowed edge to regraft Tp .
equal to the sum of all the edge lengths, and the edge
Steps 2 and 3 are based on algorithms described in [43],
lengths are estimated by minimizing a least squares fit
which follow the same approach as the double depth-first
criterion. The problem of inferring optimal phylogenies
search described in the previous section. These two steps
within ME is commonly assumed to be NP-hard, as are
require Oðn2 Þ, just as Step 4. As there are OðnÞ subtrees to
many other distance-based phylogeny inference problems
prune and regraft, this implementation requires Oðn3 Þ to
[41]. Nonetheless, ME forms the basis of several phyloge-
perform one search step.
netic reconstruction methods, generally based on greedy
heuristics. Among them is the popular Neighbor-Joining
(NJ) algorithm [17]. Starting from a star tree, NJ iteratively 5 RESULTS
agglomerates external pairs of taxa so as to minimize the 5.1 Simulation Protocol
tree length at each step. We applied our method and other existing methods to
Recently, Pauplin [30] proposed a new simple formula to simulated datasets obtained using the procedure described
estimate the tree length LðT Þ of tree T : in [18]. We uniformly randomly generated rooted tandem
X duplication trees (see [26]) with 12, 24, and 48 leaves and
LðT Þ ¼ 21T ij ij ;
i<j
assigned lengths to the edges of these trees using the
coalescent model [44]. We then obtained molecular clock
where T ij is the topological distance (number of edges) in T trees (MC), which might be unrealistic in numerous cases,
between segments i and j. The correctness of this formula e.g., when the sequences being studied contain pseudo-
was shown by Semple and Steel [42], while Desper and genes which evolve much faster than functional genes.
Gascuel [31] showed that this formula is a special case of Then, we generated nonmolecular clock trees (NO-MC)
weighted-least squares tree fitting. Moreover, Desper and from the previous trees by independently multiplying
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 23
every edge length by 1 þ 0:8X, where X was drawn from methods. TNT is acknowledged as one of the very best
an exponential distribution with parameter 1. MC trees parsimony packages; it was run with 10 replicates and TBR
were rescaled by multiplying every edge length by 1.8. rearrangements. TNT often returns a set of equally
The trees thus obtained (MC and NO-MC) have a parsimonious trees. When this set contained duplication
maximum leaf-to-leaf divergence in the range ½0:1; 0:7, trees, we randomly selected one of them; when no
and in NO-MC trees the ratio between the longest and duplication tree was inferred by TNT, we randomly
shortest root-to-leaf lineages is about 3.0 on average. Both selected one of the output trees.
values are in accordance with real data, e.g., gene families Results are given in Tables 1 and 2. First, we observe that
[8] or repeated protein domains [10]. with n ¼ 48 the true tree is almost never entirely found, for
SEQGEN [45] was used to produce a 1,000 bp-long the reasons explained earlier. On the other hand, the best
nucleotide multiple alignment from each of the generated methods recover 80 to 95 percent of the duplication events,
trees using the Kimura two-parameter model of substitution indicating that the tested datasets are relatively easy. NJ
[46], and a distance matrix was computed by DNADIST [24] and TNT perform relatively well, but they often output
from this alignment using the same substitution model. For trees that are not duplication trees, which is unsatisfactory
MC and NO-MC cases, 1,000 trees (and, then, 1,000 sequence (e.g., with 48 leaves and NO-MC, NJ and TNT only infer
sets and 1,000 distance matrices) were generated per tree 1 percent and 5 percent of duplication trees, respectively).
size. These data sets were used to compare the ability of the The GS approach is noteworthy since it modifies the trees
various methods to recover the original trees from the inferred by NJ to transform them into duplication trees.
sequences or from the distance matrices, depending on the However, GS is only slightly better than NJ regarding the
method being tested. We measured the percentage of trees proportion of correctly reconstructed trees, but consider-
(out of 1,000) being correctly reconstructed (%tr). For the ably degrades the number of recovered duplication events,
phylogeny reconstruction methods, we also kept the which could be explained by the blind search it performs
percentage of duplication trees among the set of inferred
to transform NJ trees into duplication trees. GTR also
trees. Due to the random process used for generating these
obtains relatively poor results. As expected from its
trees and datasets, some short branches might not have
assumptions, WINDOW performs better in the MC case
undergone any substitution (as during Evolution) and, thus,
than in the NO-MC one. Finally, DTSCORE obtains the best
are unobtainable, except by chance. When n and, thus, the
performance among the four existing methods, whatever
branch number is high, it becomes hard or impossible to
the topological criterion considered.
find the entire tree. So, we also measured the percentage of
Applying our method to starting trees produced by GS,
duplication events in the true tree recovered by the inferred
GTR, WINDOW, and DTSCORE reveals the advantages of
tree (%ev). A duplication event involves one or more
the local search approach. Optimizing parsimony or BME
internal nodes and is the lowest common ancestor of a set
gives similar results, with a slight advantage for parsimony
of leaves; we say it “covers” its descendent leaves. However,
as expected from the relatively low divergence rates in our
the leaves covered by a simple duplication event can change
data sets. The trees produced by GS, GTR, and WINDOW
when the root position changes. As regards the true tree, the
are clearly improved and, for most, are better than those
root is known and each event is defined by the set of leaves
obtained by DTSCORE. DTSCORE trees are also improved,
which it covers. But, the inferred tree is unrooted. To avoid
even though this improvement is not very high from a
ambiguity, we then tested all possible root positions and
topological point of view. This could be explained by the
chose the one which gave the highest proximity in number
fact that DTSCORE is already an accurate method with
of events detected between the true tree and the inferred
respect to the datasets used.
tree, where two events are identical if they cover the same
When we consider the parsimony criterion, the gain
leaves. Finally, we kept the average parsimony value of each
achieved by LSDT is appreciable for each start method. This
method (pars).
could be expected for GS, WINDOW and DTSCORE which
5.2 Performance and Comparison do not optimize this criterion; with n ¼ 48 in NO-MC case,
Using this protocol, we compared NJ [17], TNT [47], and the gain for GS is about 329, thus confirming that this
GREEDY-SEARCH (GS) [21] which starts from the NJ tree, a method is clearly suboptimal; the gains for WINDOW and
modified version of GREEDY TRHIST RESTRICTED (GTR) DTSCORE are about 42 and 15, which are lower but still
[9] to infer multiple duplication trees, WINDOWS [10], significant. The GTR results, which optimizes parsimony,
DTSCORE [18], and eight versions of our local search are more surprising since the gain (always with n ¼ 48 in
method LSDT corresponding to different starting duplica- NO-MC case) is about 77 on average, which is very high.
tion trees (GS, GTR, WINDOW, and DTSCORE) and Moreover, the parsimony value obtained by LSDT is very
different criteria (parsimony and BME). TNT and GS use close to that of TNT, in spite of a much more restricted
the parsimony criterion, but the other are distance-based search space. This confirms the good performance of our
24 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 1
Performance Comparison Using Simulations (Molecular Clock Mode of Evolution)
X+LSDT_Y: X is the method used to obtain the starting tree and Y the criterion being optimized by LSDT; %tr: the percentage of trees being correctly
reconstructed; the percentage of duplication trees obtained by phylogeny reconstruction methods is given between parentheses; %ev: the
percentage of duplication events in the true tree being recovered by the inferred tree; pars: the average parsimony value.
local search method. It should be stressed that these gains interaction. Experimental studies on functions of ZNF genes
are obtained at low computational cost as dealing with any suggest that many of them code for transcription factors,
of the 48-taxon datasets only requires about 10 seconds and some of them are known to take part in cellular growth
for parsimony and five seconds for BME on a standard and development [48]. However, the biological functions of
PC-Pentium 4. most ZNF genes are currently unknown. The 16 members of
5.3 Analysis of the ZNF45 Family ZNF45 gene family are found in the q13.2 gene cluster on
Zinc finger (ZNF) genes code for proteins that contain one human chromosome 19 [49]. The organization and features
or more zinc finger motifs. The zinc finger motif is one of of the members of the ZNF45 family suggest that the genes
the most common motifs involved in nucleic acid-protein in the family may have been produced by a series of in situ
TABLE 2
Performance Comparison Using Simulations (No Molecular Clock of Evolution)
Fig. 10. (a) Duplication tree for the 16 genes of human ZNF 45 family inferred by DTSCORE plus LSDT with parsimony; black dots represent the only
allowed root positions, according to the tandem duplication model; the (arbitrarily) selected root position is circled. (b) Rooted duplication tree
corresponding to tree (a). (c) Phylogeny inferred by TNT. Tree (a) can be obtained from tree (c) by moving ZNF45 and ZNF228 to edge 1, and
ZNF233 to edge 2. Edge lengths in tree (a) and tree (c) were estimated by maximum likelihood [52]. Lengths in tree (b) are meaningless and were
adjusted to obtain a readable drawing.
gene duplication events [49]. The ZNF45 gene family has We used this distance matrix and DTSCORE to build a
been previously studied by Tang et al. [10] and Zhang et al. starting tree, which was then refined by LSDT using
[21], who proposed different tandem duplication trees to parsimony. We selected this criterion because of its good
explain its evolutionary history. performance with simulated data (Tables 1 and 2). The
We downloaded the DNA sequences of the 16 members
resulting tree (Figs. 10a and 10b) is a simple DT requiring
of ZNF45 from NCBI. Multiple alignment was achieved
897 steps to explain the extant sequences. We tried to
using TCOFFEE,1 using default settings. We removed gaps
improve this score using a computationally intensive
as usual in phylogenetics [22] and third codon positions
which look saturated (734 parsimony steps are required to ratchet approach [51], but were unable to obtain any other
explain the evolution of the 237 sites). We thus obtained a DT with better (or even identical) parsimony. We also ran
final alignment2 containing 474 homologous sites, with a TNT with ratchet, 1,000 random taxon addition replicates
maximum pairwise divergence of 0:45. and TBR branch swapping (i.e., all TNT options to intensify
PAUP* [23] was used to estimate the matrix of pairwise the search) and found one maximum-parsimony phylogeny
distances, assuming the GTR substitution model [50] and a requiring 896 steps. This phylogeny (Fig. 10c) contains an
gamma distribution of rates with parameter 1. unresolved node with degree 4 and is not a duplication tree.
TNT phylogeny is close to LSDT duplication tree. To
1. http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi.
2. Available on request. transform from one to the other only three taxa have to be
26 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 3
Analysis of the ZNF45 Data Set
moved (Fig. 10), and both trees differ by only 1 parsimony from 10 to 50 parsimony steps. In all cases but GTR,
step. A similar difference was commonly observed in LSDT recovers the most parsimonious DT of Fig. 10.
simulation where TNT found (non-DT) phylogenies requir-
ing one parsimony step less (on average) than the DTs 6 CONCLUSION AND PROSPECTS
found by LSDT (Tables 1 and 2), though the true tree used
We have demonstrated that restricting the neighborhood
to generate the sequences was a DT. Thus, having (only)
defined by the SPR rearrangement to valid duplication trees
one parsimony step of difference between the best DT and
allows the whole DT space to be explored. Thanks to these
the best phylogeny is not significant and can be seen as
supporting the duplication model. Moreover, the discre- rearrangements, we have defined a general local search
pancy between the two trees can be explained by long method which we used to optimize the parsimony and
branch attraction, a phenomenon that frequently affects balanced minimum evolution criteria. We have thus
parsimony-based reconstructions [53]. Indeed, ZNF180 and improved the topological accuracy of all the tested
ZNF229 genes are distant from the other genes (Figs. 10a methods.
and 10c) and might perturb the whole tree. When removing Several research directions are possible. Finding the set
those two genes from the data set, both LSDT and TNT of combinatorial configurations for the SPR rearrangement
found the same tree, which is identical to the LSDT tree of which necessarily produce a duplication tree, could allow
Fig. 10a without the two genes. With 14 segments, the the neighborhood computation to be accelerated (e.g., for
probability of randomly picking up a duplication tree n ¼ 48 only 5 percent of the SPR neighborhood correspond
among all distinct phylogenies is less than 104 [26]. This to duplication trees) and, furthermore, gain more insight
extremely small probability indicates that the identity into the nature of duplication trees, which are just starting
between LSDT and TNT trees is very unlikely to be due to be investigated mathematically [12], [26], [27]. Our local
to chance. This provides a strong support for the tandem search method could be improved using restricted TBR
duplication model and indicates that our LSDT tree likely rearrangements or with the help of different stochastic
represents most—if not all—of the history of ZNF45 family. approaches (taboo, noising, ...) in order to avoid local
We compared trees obtained by Tang et al. [10], Zhang minima. Moreover, it would be relevant to test this local
et al. [21], and those of the other programs to the LSDT tree search method with other criteria like maximum likelihood.
of Fig. 10. We computed the parsimony score of each tree Finally, combining the tandem duplication events with
and the percentage of events shared by each tree with the speciation events, as described in [54] and [55] for
LSDT tree. Just as in the simulation study, we tested GS
nontandem duplications, would be relevant for real
[21], GTR [9], WINDOW [10], DTSCORE [8], and LSDT
applications where we have homologous tandem repeats
using different starting points but optimizing parsimony in
from several genomes.
all cases.
Results are displayed in Table 3 and confirm those
obtained with simulated data sets.Results of trees from ACKNOWLEDGMENTS
[10] and [21] are poor, which was expected as these The authors would like to thank Wafae El Alaoui for her help
methods (WINDOWS and GS, respectively) do not with ZNF45 family genes, and Richard Desper, Wim Hordijk
optimize the parsimony criterion and as we did not use and the referees of the Workshop on Algorithms in
the same alignment. GS is relatively poor, while Bioinformatics (WABI ’04) for reading preliminary versions
DTSCORE, WINDOWS, and GTR perform better. LSDT of this paper. This work was supported by ACI-IMPBIO
clearly improves these four methods, with gains ranging (Ministère de la Recherche, France).
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 27
REFERENCES [27] J. Yang and L. Zhang, “On Counting Tandem Duplication Trees,”
Molecular Biology and Evolution, vol. 21, pp. 1160-1163, 2004.
[1] F. Blattner, G. Plunkett, C. Bloch, N. Perna, V. Burland, M. Riley, J. [28] D. Robinson, “Comparison of Labeled Trees with Valency Trees,”
Collado-Vides, J. Glasner, C. Rode, G. Mayhew, J. Gregor, N. J. Combinatorial Theory, vol. 11, pp. 105-119, 1971.
Davis, H. Kirkpatrick, M. Goeden, D. Rose, B. Mau, and Y. Shao, [29] L. Wang and D. Gusfield, “Improved Approximation Algorithms
“The Complete Genome Sequence Of Escherichia Coli k-12,” for Tree Alignment,” J. Algorithms, vol. 25, pp. 255-273, 1997.
Science, vol. 277, no. 5331, pp. 1453-1474, 1997. [30] Y. Pauplin, “Direct Calculation of a Tree Length Using a Distance
[2] E. Lander et al., “Initial Sequencing and Analysis of the Human Matrix,” J. Molecular Evolution, vol. 51, pp. 41-47, 2000.
Genome,” Nature, vol. 409, pp. 860-921, 2001. [31] R. Desper and O. Gascuel, “Theoretical Foundation of the
[3] A. Smit, “Interspersed Repeats and Other Mementos of Transpo- Balanced Minimum Evolution Method of Phylogenetic Inference
sable Elements in Mammalian Genomes,” Current Opinion in and Its Relationship to Weighted Least-Squares Tree Fitting,”
Genetics & Development, vol. 9, pp. 657-663, 1999. Molecular Biology and Evolution, vol. 21, no. 3, pp. 587-598, 2004.
[4] W. Fitch, “Phylogenies Constrained by Cross-Over Process as [32] W. Fitch, “Toward Defining the Course of Evolution: Minimum
Illustrated by Human Hemoglobins in a Thirteen-Cycle, Eleven Change for a Specified Tree Topology,” Systematic Zoology, vol. 20,
Amino-Acid Repeat in Human Apolipoprotein A-I,” Genetics, pp. 406-416, 1971.
vol. 86, pp. 623-644, 1977. [33] J. Hartigan, “Minimum Mutation Fits to a Given Tree,” Biometrics,
[5] G. Levinson and G. Gutman, “Slipped-Strand Mispairing: A Major vol. 29, pp. 53-65, 1973.
Mechanism for DNA Sequence Evolution,” Molecular Biology and [34] G. Ganapathy, V. Ramachandran, and T. Warnow, “Better Hill-
Evolution, vol. 4, pp. 203-221, 1987. Climbing Searches for Parsimony,” Proc. Third Int’l Workshop
[6] J. Zhang and M. Nei, “Evolution of Antennapedia-Class Homeo- Algorithms in Bioinformatics, 2003.
box Genes,” Genetics, vol. 142, no. 1, pp. 295-303, 1996.
[35] P.A. Goloboff, “Methods for Faster Parsimony Analysis,” Cladis-
[7] O. Elemento and O. Gascuel, “An Exact and Polynomial Distance- tics, vol. 12, pp. 199-220, 1996.
Based Algorithm to Reconstruct Single Copy Tandem Duplication
[36] V. Berry and O. Gascuel, “Inferring Evolutionary Trees with
Trees,” Proc. 14th Ann. Symp. Combinatorial Pattern Matching
Strong Combinatorial Evidence,” Theoretical Computer Science,
(CPM2003), 2003.
vol. 240, pp. 271-298, 2000.
[8] O. Elemento, O. Gascuel, and M.-P. Lefranc, “Reconstructing the
[37] M. Kimura, “A Simple Model for Estimating Evolutionary Rates of
Duplication History of Tandemly Repeated Genes,” Molecular
Base Substitutions through Comparative Studies of Nucleotide
Biology and Evolution, vol. 19, pp. 278-288, 2002.
Sequences,” J. Molecular Evolution, vol. 16, pp. 111-120, 1980.
[9] G. Benson and L. Dong, “Reconstructing the Duplication History
of a Tandem Repeat,” Proc. Intelligent Systems in Molecular Biology [38] D. Jones, W. Taylor, and J. Thornton, “The Rapid Generation of
(ISMB1999), T. Lengauer, ed., pp. 44-53, 1999. Mutation Data Matrices from Protein Sequences,” Computer
Applications in Biosciences, vol. 8, pp. 275-282, 1992.
[10] M. Tang, M. Waterman, and S. Yooseph, “Zinc Finger Gene
Clusters and Tandem Gene Duplication,” J. Computational Biology, [39] K. Kidd and L. Sgaramella-Zonta, “Phylogenetic Analysis:
vol. 9, pp. 429-446, 2002. Concepts and Methods,” Am. J. Human Genetics, vol. 23, pp. 235-
[11] E. Rivals, “A Survey on Algorithmic Aspects of Tandem Repeats 252, 1971.
Evolution,” Int’l J. Foundations of Computer Science, vol. 15, no. 2, [40] A. Rzhetsky and M. Nei, “Theoretical Foundation of the
pp. 225-257, 2004. Minimum-Evolution Method of Phylogenetic Inference,” Molecu-
[12] O. Gascuel, D. Bertrand, and O. Elemento, “Reconstructing the lar Biology and Evolution, vol. 10, pp. 173-1095, 1993.
Duplication History of Tandemly Repeated Sequences,” Math. of [41] W. Day, “Computational Complexity of Inferring Phylogenies
Evolution and Phylogeny, O. Gascuel, ed., 2004. from Dissimilarity Matrices,” Bull. Math. Biology, vol. 49, pp. 461-
[13] S. Ohno, Evolution by Gene Duplication. Springer Verlag, 1970. 467, 1987.
[14] P.L. Fleche, Y. Hauck, L. Onteniente, A. Prieur, F. Denoeud, V. [42] C. Semple and M. Steel, “Cyclic Permutations and Evolutionary
Ramisse, P. Sylvestre, G. Benson, F. Ramisse, and G. Vergnaud, “A Trees,” Advances in Applied Math., vol. 32, no. 4, pp. 669-680, 2004.
Tandem Repeats Database for Bacterial Genomes: Application to [43] R. Desper and O. Gascuel, “Fast and Accurate Phylogeny
the Genotyping of Yersinia Pestis and Bacillus Anthracis,” BioMed Reconstruction Algorithms Based on the Minimum-Evolution
Central Microbiology, vol. 1, pp. 2-15, 2001. Principle,” J. Computational Biology, vol. 9, pp. 687-706, 2002.
[15] D. Jaitly, P. Kearney, G. Lin, and B. Ma, “Methods for [44] M. Kuhner and J. Felsenstein, “A Simulation Comparison of
Reconstructing the History of Tandem Repeats and Their Phylogeny Algorithms under Equal and Unequal Evolutionary
Application to the Human Genome,” J. Computer and System Rates,” Molecular Biology and Evolution, vol. 11, pp. 459-468, 1994.
Sciences, vol. 65, pp. 494-507, 2002. [45] A. Rambault and N. Grassly, “Seq-Gen: An Application for the
[16] P. Sneath and R. Sokal, Numerical Taxonomy. pp. 230-234, San Monte Carlo Simulation of DNA Sequence Evolution Along
Francisco: W.H. Freeman and Company, 1973. Phylogenetic Trees,” Computer Applied Biosciences, vol. 13, pp. 235-
[17] N. Saitou and M. Nei, “The Neighbor-Joining Method: A New 238, 1997.
Method for Reconstructing Phylogenetic Trees,” Molecular Biology [46] J. Felsenstein and G. Churchill, “A Hidden Markov Model
and Evolution, vol. 4, pp. 406-425, 1987. Approach to Variation Among Sites in Rate of Evolution,”
[18] O. Elemento and O. Gascuel, “A Fast and Accurate Distance- Molecular Biology and Evolution, vol. 13, pp. 93-104, 1996.
Based Algorithm to Reconstruct Tandem Duplication Trees,” [47] P.A. Goloboff, J.S. Farris, and K. Nixon, “TNT: Tree Analysis
Bioinformatics, vol. 18, pp. 92-99, 2002. Using New Technology,” 2000, www.cladistics.com.
[19] J. Barthélemy and A. Guénoche, Trees and Proximity Representa- [48] T. El-Barabi and T. Pieler, “Zinc Finger Proteins: What We Know
tions. Wiley and Sons, 1991. and What We Would Like to Know,” Mechanisms of Development,
[20] S. Sattath and A. Tversky, “Additive Similarity Trees,” Psychome- vol. 33, pp. 155-169, 1991.
trika, vol. 42, pp. 319-345, 1977. [49] M. Shannon, J. Kim, L. Ashworth, E. Branscomb, and L. Stubbs,
[21] L. Zhang, B. Ma, L. Wang, and Y. Xu, “Greedy Method for “Tandem Zinc-Finger Gene Families in Mammals: Insights and
Inferring Tandem Duplication History,” Bioinformatics, vol. 19, Unanswered Questions,” DNA Sequence—The J. Sequencing and
pp. 1497-1504, 2003. Mapping, vol. 8, no. 5, pp. 303-315, 1998.
[22] D. Swofford, P. Olsen, P. Waddell, and D. Hillis, Molecular [50] P. Waddel and M. Steel, “General Time Reversible Distances with
Systematics. pp. 407-514, Sunderland, Mass.: Sinauer Associates, Unequal Rates Across Sites: Mixing T and Inverse Gaussian
1996. Distributions with Invariant Sites,” Molecular Phylogeny and
[23] D. Swofford, PAUP*. Phylogenetic Analysis Using Parsimony (*and Evolution, vol. 8, pp. 398-414, 1997.
Other Methods), version 4. Sunderland, Mass.: Sinauer Associates, [51] K.C. Nixon, “The Parsimony Ratchet, a New Method for Rapid
1999. Parsimony Analysis,” Cladistics, vol. 15, pp. 407-414, 1999.
[24] J. Felsenstein, “PHYLIP—PHYLogeny Inference Package,” Cladis- [52] S. Guindon and O. Gascuel, “A Simple, Fast and Accurate Method
tics, vol. 5, pp. 164-166, 1989. to Estimate Large Phylogenies by Maximum-Likelihood,” Sys-
[25] C. Semple and M. Steel, Phylogenetics. Oxford Univ. Press, 2003. tematic Biology, vol. 52, no. 5, pp. 696-704, 2003.
[26] O. Gascuel, M. Hendy, A. Jean-Marie, and S. McLachlan, “The [53] J. Felsenstein, “Cases in Which Parsimony or Compatibility
Combinatorics of Tandem Duplication Trees,” Systematic Biology, Methods Will Be Positively Misleading,” Systematic Zoology,
vol. 52, pp. 110-118, 2003. vol. 27, pp. 401-410, 1978.
28 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
[54] D. Page and M. Charleston, “From Gene to Organismal Phylogeny: Olivier Gascuel is Directeur de Recherche at
Reconciled Trees and the Gene Tree/Species Tree Problem,” the Centre National de la Recherche Scientifi-
Molecular Phylogenetics and Evolution, vol. 7, pp. 231-240, 1997. que (France). He is the head of the bioinfor-
[55] M. Hallett, J. Lagergren, and A. Tofigh, “Simultaneous Identifica- matics group from the LIRMM laboratory,
tion of Duplications and Lateral Transfers,” Proc. Conf. Research belongs to the editorial board of Systematic
and Computational Molecular Biology (RECOMB2004), pp. 347-356, Biology and of BMC Evolutionary Biology, and
2004. has served in a number of program committees
of bioinformatics conferences (ISMB, WABI). He
Denis Bertrand is a PhD student under the started in this field in the mid 1980s, with works
supervision of Olivier Gascuel. His research on sequence analysis and protein structure
subject is the study of tandemly repeated prediction. Since the beginning of the 1990s, he turned his efforts to
sequences. His main areas of interest are phylogenetics, focusing on the mathematical and computational tools
phylogenetics, combinatorics, and algorithms. and concepts. He (co)authored several well-known phylogeny inference
programs (BioNJ, PHYML, FastME).
Abstract—We present a framework for improving local protein alignment algorithms. Specifically, we discuss how to extend local
protein aligners to use a collection of vector seeds or ungapped alignment seeds to reduce noise hits. We model picking a set of seed
models as an integer programming problem and give algorithms to choose such a set of seeds. While the problem is NP-hard, and
Quasi-NP-hard to approximate to within a logarithmic factor, it can be solved easily in practice. A good set of seeds we have chosen
allows four to five times fewer false positive hits, while preserving essentially identical sensitivity as BLASTP.
1 INTRODUCTION
2 BACKGROUND: HEURISTIC ALIGNMENT AND positions of v to have values beside 0 and 1 was not
SPACED SEEDS extremely useful, so the vector seeds we discuss here all
have binary vectors v.
Since the development of heuristic sequence aligners [1], the
Spaced seeds have the same expected number of junk
same approach has been commonly used: identify short,
hits as unspaced seeds. For unrelated noise DNA se-
highly conserved regions and build local alignments
quences, this is nm4w , where w is the number of ones in
around these “hits.” This avoids the use of the Smith-
Waterman algorithm [8] for pairwise local alignment, which the seed (its support). Their advantage comes because more
has ðnmÞ runtimes on input sequences A and B of length n distinct internal subregions of a given alignment will match
and m, respectively. (We will use the notation A½i to a spaced seed than the unspaced seed; this happens because
represent the ith character of sequence A.) the hits are more independent of each other. The probability
Instead, assuming random sequences, the expected that an alignment of length 64 with 70 percent conservation
runtime of this heuristic search method is hðn; mÞ þ aðn; mÞ, matches a good spaced seed of support 11 can be greater
where hðn; mÞ is the amount of time needed to find hits in the than 45 percent because there are likely to be more
two sequences and aðn; mÞ is the expected time needed to subregions that match the spaced seed than the unspaced
compute the alignments from the hits. Most heuristic aligners seed; by contrast, the default BLASTN seed, which is
have hðn; mÞ ¼ ðn þ m þ nm=kÞ, while aðn; mÞ ¼ ðnm=kÞ 11 consecutive required matches, hits only 30 percent of
for some large constant k. There are many assumptions in alignments.
these formulas. First, even when we align sequences with true Spaced seeds have three advantages over unspaced
homologies, most hits are between unrelated positions, so the seeds. First, their hits are more independent, which means
estimation of the runtime need not consider whether the that it is more likely that a given alignment has at least one
sequences are related. Further, this simplification assumes hit to a seed; fewer alignments have many. Second, the seed
that each hit found in the first phase results in a constant model can be tailored to a particular application: If there is
amount of work being done in the second phase to identify structure or periodicity to alignments, this can be reflected
that it is false (or that true hits are rare). It is the speedup factor in the design of the seeds chosen. For example, in searching
of k that is important here; assuming m and n are large, the for homologous codons, they can be tailored to the three-
overall runtime is much faster. periodic structure of such alignments [10], [11]. Finally, the
Most heuristic aligners look at the scores of matching use of multiple seeds allows us to boost sensitivity well
characters in short regions and use high-scoring short above what is achievable with a single seed, which, for
regions as hits. For example, BLASTP [1] hits are three nucleotide alignment, can give near 100 percent sensitivity
consecutive positions in the two sequences where the total in reasonable runtime [4].
score, according to a BLOSUM or PAM scoring matrix, of Keich et al. [12] have given an algorithm for a simple
aligning the three letters in one sequence to the three letters model of alignments to compute the probability that an
of the other sequence is at least +13. Finding such hits can alignment hits a seed; this has been extended by both
be done easily, for example, by making a hash table of one Buhler et al. [10] and Brejova et al. [11] to more complex
sequence and searching positions of the hash table for the sequence models. Choi et al. [13] have also shown
other sequence, in time proportional to the length of the experimental results for spaced seeds with high sensitivity
sequences and the number of hits found. BLASTP uses across a wide range of homologies. Kucherov et al. [14]
more complicated data structures for this process, but the show how to adapt spaced seeds to the interesting case of
principle is similar. alignments where no subregion of the alignment has a
higher score than the entire alignment.
2.1 Seeding Models
To generalize BLASTP’s hits, we defined vector seeds [3], [9]. 2.2 Some Newer Seeding Models
A vector seed is a pair ðv; T Þ. Vector v ¼ ðv1 ; . . . ; vk Þ is a Another seeding model, which has recently arisen [7], [15]
vector of position multipliers and T is a threshold. Given is of ungapped alignment seeds. These were developed by
two sequences A and B, let si;j be the score in our scoring Brown and Hudek [15] to anchor global alignments of
matrix of aligning the A½i to B½j. If we consider position i ambiguous DNA sequences and, independently, by Kisman
in A and j in B, we then get an hit to the vector seed at those et al. [7] in their heuristic protein aligner, tPatternHunter.
positions when v ðsi;j ; siþ1;jþ1 ; . . . ; siþk1;jþk1 Þ T . In this An ungapped alignment seed is a vector v, a global
framework, BLASTP’s seed is ((1, 1, 1), 13). threshold T , and a vector of positional minimum scores b.
Vector seeds generalize the earlier idea of spaced seeds There is a match between positions in the two sequences
[2] for nucleotide alignments, where both scores and the when the vector of pairwise match scores is at least as large,
vector are 0/1 vectors and where T , the threshold, equals position-by-position, as the minimum scores vector b and
the number of 1s in v. A spaced seed requires an exact where the dot product of the position-by-position scores and
match in the positions where the vector is 1 and the places the multiplier vector v is at least T . These seeds are a
where the vector is 0 are “don’t care” positions. In our compromise between spaced seeds and consecutive seeds:
original work with vector seeds [3], the freedom to allow They require spaced positions to have good scores (those
BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 31
where the lower bound vector b has high values), while also alignments. Unfortunately, this also gives rise to problems,
focusing on the quality of the local alignment at the seed by as the thresholds may be set high due to overtraining for a
possibly examining all of the positions of the seed. It is not given set of alignments.
possible to cast an ungapped alignment seed in the language Most of our experiments concern themselves with vector
of vector seeds because of the requirement that each seeds, but the framework can be expanded straightforwardly
individual position’s score is greater than its bound. It is to ungapped alignment seeds as well. This is because we do
possible to cast a vector seed as an ungapped alignment seed, not compute theoretical sensitivity of the seeds, but, instead,
by setting the b vector to 1 in all positions, thus removing only identify hits in existing real alignments. Indeed, our
the position-by-position lower bound requirement. framework is quite broad and extends to many different
Csürös [16] has also extended this framework of seeding to models for seeding as long as the assumption that false
look at variable-length seeds, where the length of the regions positives are additive is reasonably accurate and that one can
that must match depends on their positional scores. While compute that false positive rate for the seed models. Where
this approach can also be brought into the framework of the the ungapped alignment seeds require some thought, we
present work, we have not done so in our experiments. present the addition needed for them.
We show this by giving an approximation-preserving problem will have the thresholds as on the seeds as high as
reduction of the Set-Cover problem to this problem. Since possible while still hitting each alignment. This allows
Set-Cover is Quasi-NP-hard to approximate to within a overtraining: Since even a tiny increase in the thresholds
logarithmic factor [19], so is our problem. would have caused a missed alignment, we may easily
An instance of Set-Cover is a ground set S and a expect that, in another set of alignments, there may be
collection T ¼ fT1 ; . . . ; Tm g of subsets of S; the goal is the alignments just barely missed by the chosen thresholds.
smallest cardinality subset of T whose union is S. The This is particularly possible if thresholds are allowed to get
connection to our problem is clear: We will produce one extremely high and only useful for a single alignment. This
alignment per ground set member and, for each of the overtraining happened in some of our experiments, so we
elements of T , we will have one seed. For simplicity, we will lowered the maximum so that they were either found in a
assume that S ¼ f1; . . . ; ng. To fill the construction out, we fairly narrow range (+13 to +25) or set to 1 when a seed
will assign the vector seed was not used. As one way of also addressing overtraining,
i we considered lowering the thresholds obtained from the IP
zfflfflfflffl}|fflfflfflffl{
vi ¼ ðð1; 0; . . . ; 0 ; 1Þ; 1Þ uniformly or just lowering the thresholds that have been set
to high values.
to every ground set element si . In a model of sequence
And, finally, the framework can be extended to allow a
where all positions are independent of all other, each of
specific number of alignments to be missed. For each
these seeds has the same false positive rate, so the false
alignment, rather than requiring that
positive rate will be proportional to the number of ground
X
set members chosen. xi;Ti;j 1;
Then, for each set Tj 2 T , we create an alignment Aj of i
length 2n2 þ 4n by pasting together in n blocks of length which requires that some threshold be chosen so that the
2n þ 4. If i is in Tj , then we make the ith block of the alignment is hit, we can add a 0/1 slack variable to count
alignment have the first and i þ 2nd position be of score 1,
how many are missed, changing the constraint to
while all other positions in the block have score zero, while X
if i 62 Tj , then the ith block is all score zero. Then, it is clear xi;Ti;j þ sj 1:
that if we choose the seed vi , we will hit all alignments Aj , i
We finally note that a simple greedy heuristic works well happen to a random sequence by chance only one time in
for the problem, as well: Start with low thresholds for all 10,000, according to BLASTP’s statistics.
seed patterns and repeatedly increase the threshold whose We begin by identifying a set of BLASTP alignments in
increase most reduces the false positive rate until no such this score range. To avoid overrepresenting certain families
increase can be made without missing an alignment. This of alignments in our test set, we did an all-versus-all
simple heuristic performed essentially comparably to the comparison of 8,654 human proteins from the SWISS-PROT
integer program in our experiments, but, since the IP solved database [20]. (We note that this is the same set of proteins
quickly, we present its results. and alignments we used in our previous vector seed work
One other advantage to the IP formulation is that the [3]. We have used this test set in part to confirm our belief
false-positive rate from the LP relaxation is a lower bound that, while a single seed may not help much, in comparison
on what can possibly be achieved; the simple greedy to BLASTP, many seeds will be of assistance.) We then
heuristic offers no such lower bound. divided the proteins into families so that all alignments
with BLASTP score greater than 100 are between two
sequences in the same family and there are as many families
4 EXPERIMENTAL RESULTS as possible. We then chose 10 sets of alignments in our
Here, we present the results of experiments with our target score range such that, in each set of alignments, a
multiple seed selection framework in the context of protein particular family will only contribute at most eight
alignments. Our goal is to identify collections of seed alignments to that set. Note that, since our threshold for
models which together have extremely high sensitivity to sharing family membership is a BLASTP score greater than
even moderately strong alignments, while admitting a very 100 and the alignments we are seeking score between +40
low false positive rate. and +60, many chosen alignments will be between members
Since we pick seeds with a relatively small number of of different families. We divided the sets of alignments into
alignments, we run the serious risk of overtraining. In five training sets and five testing sets. It is possible that the
particular, the requirement that our set of seeds has same alignments will occur in a training and testing set as
100 percent sensitivity on the training data need not require we did not take any efforts to avoid this, though the set of
that it also have comparable sensitivity overall. In one possible alignments is large enough to make this a rare
example, the particular choice of training examples was occurrence.
We note that we are using this somewhat complicated
apparently quite unrepresentative since a 100 percent
system specifically because we want to avoid imposing a
sensitivity to this set of alignments still gave only 96 percent
preexisting bias on the set of alignments: Many true yet
sensitivity on a testing set. (Or, presumably, the testing set
moderate-scoring alignments will be between proteins with
may be unrepresentative.) As a simple way of exploring this,
different function or from different biological families. For the
we examined what happened when we lowered the thresh-
same reason, we have used alignments from dynamic
old on some seeds that were chosen by the integer program
programming as our standard, rather than structural align-
to modestly increase their false positive rates and sensitivity
ments of known proteins or curated alignments because our
in the hope of still keeping very high sensitivity. goal is to improve the quality of heuristic alignments.
We first present simple experiments with vector seeds Certainly, many of the alignments we consider will not be
and with ungapped alignment seeds on a small sample of precise; still, a heuristic dynamic programming-based align-
alignments discovered with BLASTP; in this section, we ment that finds a hit between two proteins and then uses the
also allow for seed sets that miss a small number of the same scoring matrix as BLASTP will find the exact same,
training alignments. potentially inaccurate, alignment as did BLASTP.
Then, we explore how well these seed sets do in hitting
alignments that we did not use BLASTP to identify. Here, 4.1.1 Multiple Vector Seeds
we note that our vector seed sets do not appear to do as well We then considered the set of all 35 vector patterns of length
as BLASTP for sensitivity to alignments in general, but they at most 7 that include three or four 1s (the support of the
do hit more alignments with high-scoring short regions; seed). We used this collection of vector patterns as we have
presumably, these alignments are more likely true. seen no evidence that nonbinary seed vectors are preferable
to binary ones for proteins and because it is more difficult to
4.1 Preliminary Experiments find hits to seeds with higher support than four due to the
We begin by exploring several sets of alignments generated high number of needed hash table keys.
using BLASTP. Our target score range for our alignments is We computed the optimal set of thresholds for these
BLASTP score between +40 and +60 (BLOSUM score +112 vector seeds such that every alignment in a training set has
to +168). These moderate-scoring alignments can happen by a hit to at least one of the seeds, while minimizing the
chance, but also are often true. Alignments below this background rate of hits to the seeds and only using at most
threshold are much more likely to be errors, while, in a 10 vector patterns. Then, we examined the sensitivity of the
database of proteins we used, such alignments are likely to chosen seeds for a training set to its corresponding test set.
BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 35
TABLE 1 TABLE 3
Hit Rates for Optimal Seed Sets for Various Sets of Training Weakening Sensitivity to Testing Alignment
Alignments when Applied to an Unrelated Test Set Reduces Sensitivity on Training Alignments
The results are found in Table 1. Some seed sets chosen set. We show results in Table 3, using again a randomly
showed signs of overtraining, but others were quite chosen testing set for each training set. The training data
successful, where the chosen seeds work well for their sets varied in size from 304 to 415, while the testing sets
training set as well and have low false positive rate. ranged from 392 to 407 in size.
We took the best seed set with near 100 percent Unsurprisingly, if we did not hit all alignments in the
sensitivity for both its training and testing data, which training set, we often miss alignments in the testing set as
was the third of our experimental sets and used it in further well. However, the ranges of the sensitivities we saw in
experiments. This seed set is shown in Table 2. We note that testing data for the seed sets picked allowing some misses
this seed set has five times lower false positive rate in the training data were much less wide, suggesting that
(1=8; 000) than does BLASTP, while still hitting all of its there may be fewer seed thresholds lowered merely to
testing alignments but four (which is not statistically accommodate a single outlier in the training data. As such,
significant from zero). We also considered a set of thresh- if slightly lower sensitivity is acceptable, this approach may
olds where we lowered the higher thresholds slightly to give much more predictable results than training to require
allow more hits and possibly avoid overtraining on the all alignments to be hit.
initial set of alignment. These altered thresholds are shown 4.1.3 Multiple Ungapped Alignment Seeds
as well in Table 2 and give a total false positive rate of Ungapped alignment seeds can be seen as breaking the
1=6; 900. (This set of thresholds also hits all 402 test model we have for alignment speed. The most straightfor-
alignments for that instance.) ward implementation of ungapped alignment seeds would
4.1.2 A Weaker Requirement on the Sensitivity involve a hash table keyed on the letters corresponding to
the positions in the bounds vector b, where there is a
As noted previously, we can alter our integer program so
nontrivial lower bound on the score of a position. Still, even
that it does not require 100 percent sensitivity on the
after the first step, where we identified pairs of positions
training data set. We performed experiments on this
satisfying the minimum bounds scores, we still need
formulation, using five subsets of the training alignments another test to verify that a pair of positions satisfies the
chosen as before, where we allowed between zero and five requirement of the dot product of the local alignment score
alignments from the training set to be missed by the seed with the vector v of positional multipliers being higher than
the threshold. Similar limitations affect any such two-phase
TABLE 2 seed, such as requiring that two hypothetically aligned
Seeds and Thresholds Chosen by positions satisfy two vector seeds at once.
Integer Programming for 409 Test Alignments If we assume, however, that testing a hit to the simple
hash-table to verify if the dot product of the local alignment
score with the vector of multipliers v has score greater than
the threshold T so rapidly that we can throw out misses
without having to count them, then we return to the case
from before, where we need count only the fraction of
positions expected to pass both levels of filtration. This
assumption may be appropriate, assuming that the small
amount of time taken to throw out a hash-table hit that does
not satisfy the dot product threshold is much, much smaller
than the amount of time needed to throw out a hit to the
whole ungapped alignment seed that still does not make a
good local alignment.
36 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 5
Hits in Locally Good Regions of Alignments
identifying the alignments that actually have a core con- discarded before the next step, should it count toward the
served region. estimated runtime? Using our framework, we identified a
Our experiments show that multiple seed models can have set of seeds for moderate-scoring protein alignments whose
an impact on local alignment of protein sequences. Using total false positive rate in random sequence is four-to-five
many spaced seeds, which we picked by optimizing an times lower than the default BLASTP seed. This set of seeds
integer program, we find seed models with a comparable had hits to slightly fewer alignments in a test set of
chance of finding a good hit in a moderate-scoring alignment moderate-scoring alignments found by the Smith-Water-
than does the BLASTP seed, with four to five times fewer man algorithm than found by BLASTP; however, the
noise hits. The difficulty with the BLASTP seed is that it not BLASTP seeds hit subregions of these alignments that were
only has more junk hits and more hits in overlapping places, it actually slightly worse than hit by the spaced seeds. Hence,
also has more hits in short regions of true alignments, which given the filtering used by BLASTP, we expect that the two
are likely to be filtered and thrown out. alignment strategies would give comparable sensitivity,
while the spaced seeds give four times fewer false hits.
5 CONCLUSIONS
We have given a theoretical framework to the problem of ACKNOWLEDGMENTS
using spaced seeds for protein homology search detection. The author would like to thank Ming Li for introducing him
Our result shows that using multiple vector or ungapped to the idea of spaced seeds. This work is supported by the
alignment seeds can give sensitivity to good parts of local Natural Science and Engineering Research Council of
protein alignments essentially comparable to BLASTP, Canada and by the Human Frontier Science Program. A
while reducing the false positive rate of the search preliminary version of this paper [21] appeared at the
algorithm by a factor of four to five. Workshop on Algorithms in Bioinformatics, held in Bergen,
Our set of vector seeds is chosen by optimizing an Norway, in September, 2004.
integer programming framework for choosing multiple
seeds when we want 100 percent sensitivity to a collection REFERENCES
of training alignments. The framework is general enough to [1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman,
accommodate many extensions, such as requiring a fixed “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215,
no. 3, pp. 403-410, 1990.
amount of sensitivity on the training (not only 100 percent),
[2] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and More
allowing only a small number of seeds to be chosen or Sensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-
allowing for many different sorts of seeding strategies. We 445, Mar. 2002.
[3] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension to
have mostly used it to optimize sets of vector seeds because Spaced Seeds Allows Substantial Improvements in Sensitivity and
they encapsulate an approach to homology search for Specificity,” Proc. Third Ann. Workshop Algorithms in Bioinformatics,
pp. 39-54, 2003.
nucleotides that has been very successful. [4] M. Li, B. Ma, D. Kisman, and J. Tromp, “Patternhunter II: Highly
One difficulty with our approach is that it relies on a Sensitive and Fast Homology Search,” J. Bioinformatics and
theoretical estimate of the runtime of a homology search Computational Biology, vol. 2, no. 3, pp. 419-439, 2004.
[5] J. Xu, D. Brown, M. Li, and B. Ma, “Optimizing Multiple Spaced
program: namely, that the program will take time propor- Seeds for Homology Search,” Proc. 15th Ann. Symp. Combinatorial
tional to the number of false positives found by the seeding Pattern Matching, pp. 47-58, 2004.
[6] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds for
method. As seeding methods become more complex, such DNA Similarity Search,” Proc. Eighth Ann. Int’l Conf. Computational
as the two-step ungapped alignment seeds, it may become Biology, pp. 76-84, 2004.
[7] D. Kisman, M. Li, B. Ma, and L. Wang, “TPatternHunter: Gapped,
harder to identify what a “false positive” is, in particular, if Fast and Sensitive Translated Homology Search,” Bioinformatics,
a false positive fits through one step of a filter, but is quickly 2004.
38 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
[8] T. Smith and M. Waterman, “Identification of Common Molecular Daniel G. Brown received the undergraduate
Subsequences,” J. Molecular Biology, vol. 147, pp. 195-197, 1981. degree in mathematics with computer science
[9] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension to from the Massachusetts Institute of Technology
Spaced Seeds,” J. Computer and System Sciences, 2005, pending in 1995 and the PhD degree in computer science
publication. from Cornell University in 2000. He then spent a
[10] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for Similarity year as a research scientist at the Whitehead
Search in Genomic DNA,” Proc. Seventh Ann. Int’l Conf. Computa- Institute/MIT Center for Genome Research in
tional Biology, pp. 67-75, 2003. Cambridge, Massachusetts, working on the Hu-
[11] B. Brejova, D. Brown, and T. Vinar, “Optimal Spaced Seeds for man and Mouse Genome Projects. Since 2001,
Homologous Coding Regions,” J. Bioinformatics and Computational he has been an assistant professor in the School of Computer Science
Biology, vol. 1, pp. 595-610, Jan. 2004. at the University of Waterloo.
[12] U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds for
Similarity Search,” Discrete Applied Math., vol. 138, pp. 253-263,
2004.
. For more information on this or any other computing topic,
[13] K.P. Choi, F. Zeng, and L. Zhang, “Good Spaced Seeds for
please visit our Digital Library at www.computer.org/publications/dlib.
Homology Search,” Bioinformatics, vol. 20, no. 7, pp. 1053-1059,
2004.
[14] G. Kucherov, L. Noé, and Y. Ponty, “Estimating Seed Sensitivity
on Homogeneous Alignments,” Proc. Fourth IEEE Int’l Symp.
BioInformatics and BioEng., pp. 387-394, 2004.
[15] D. Brown and A. Hudek, “New Algorithms for Multiple DNA
Sequence Alignment,” Proc. Fourth Ann. Workshop Algorithms in
Bioinformatics, pp. 314-326, 2004.
[16] M. Csürös, “Performing Local Similarity Searches with Variable
Length Seeds,” Proc. 15th Ann. Symp. Combinatorial Pattern
Matching, pp. 373-387, 2004.
[17] K. Choi and L. Zhang, “Sensitive Analysis and Efficient Method
for Identifying Optimal Spaced Seeds,” J. Computer and System
Sciences, vol. 68, pp. 22-40, 2004.
[18] G. Kucherov, L. Noé, and Y. Ponty, “Multiseed Lossless
Filtration,” Proc. 15th Ann. Symp. Combinatorial Pattern Matching,
pp. 297-310, 2004.
[19] U. Feige, “A Threshold of ln n for Approximating Set Cover,”
J. ACM, vol. 45, pp. 634-652, 1998.
[20] A. Bairoch and R. Apweiler, “The SWISS-PROT Protein Sequence
Database and Its Supplement TrEMBL in 2000,” Nucleic Acids
Research, vol. 28, no. 1, pp. 45-48, 2000.
[21] D. Brown, “Multiple Vector Seeds for Protein Alignment,” Proc.
Fourth Ann. Workshop Algorithms in Bioinformatics, pp. 170-181,
2004.
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 39
I T is a pleasure to write this editorial at the beginning of the second year of the publication of the IEEE/ACM Transactions
on Computational Biology and Bioinformatics (TCBB). The last year saw the publication of four issues of TCBB, the first of
which was mailed out roughly nine months after our initial call for submissions. That accomplishment was the result of
tremendous cooperation and hard work on the part of authors, reviewers, associate editors, and staff. I would like to thank
everyone for making that possible.
During the past year, we recieved roughly 205 submissions and, presently, we have about 50 of those under review. In
our first year, we published 16 papers, including Part I of a special section on The Best Papers from WABI (Workshop on
Algorithms in Bioinformatics). Part II will appear this year, along with a special issue on Machine Learning in
Computational Biology and Bioinformatics. Other special issues are also in the planning stages. The papers that we have
published are establishing TCBB as a venue for the highest quality research in a broad range of topics in computational
biology and bioinformatics. I know that some of the papers we have already published will be cited as the foundational or
the definitive papers in several subareas of the field.
A goal for the future is to attract more submissions from the biology community and this will be facilitated when TCBB
is indexed in MEDLINE, which requires two years of publication before it will consider indexing a journal. So, this second
year of publication will hopefully lead to the inclusion of TCBB in MEDLINE.
Finally, I would like to share some wonderful news we recieved in February. The Association of American Publishers,
Professional and Scholarly Publishing Division awarded TCBB their “Honorable Mention” award for The Best New Journal
in any category for the year 2004. Only one Honorable Mention is awarded. Again, the credit for this accomplishment goes
to all the authors, reviewers, associate editors, and staff who have worked so hard to establish TCBB in this last year. I look
forward to continued growth and success of TCBB in our second year of publication.
Dan Gusfield
Editor-in-Chief
For information on obtaining reprints of this article, please send e-mail to:
tcbb@computer.org.
1545-5963/05/$20.00 © 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
40 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Abstract—Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones.
Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in
relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature:
matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favor of either, and recent work
has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns.
This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently
proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs.
Instead of exhibiting a complete list of motifs satisfying some input constraints, what is produced is a basis of such motifs from which all
the other ones can be generated. We study the computational cost of determining such a basis of repeated motifs with wild cards in a
sequence. We give new upper and lower bounds on such a cost, introducing a notion of basis that is provably contained in (and, thus,
smaller) than previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of
motifs. We also prove that the number of motifs in all bases defined so far grows exponentially with the quorum, that is, with the
minimal number of times a motif must appear in a sequence, something unnoticed in previous work. We show that there is no hope to
efficiently compute such bases unless the quorum is fixed.
1 INTRODUCTION
structuring such input. Yet, it appeared clear also to any of capturing one aspect of biological features that current
computational biologist working with motifs as patterns that PSSMs in general ignore, or address only in an indirect way.
there was further structure to be extracted from the set of This aspect often concerns isolated positions inside a motif
motifs found, even when such a set is huge. Furthermore, that are not part of the biological feature being captured.
such a structure could reflect some additional biological This is the case, for instance, with some binding sites,
information, thus providing additional motivation for infer- particularly at the protein level. Studying patterns with
ring it. Doing this is generally addressed by means of
wild cards has a further very important motivation in
clustering, or even by attempting to bring together the two
biology, even when no differences (such as substitutions)
types of motif models (PSSMs and patterns). Indeed, recently
researchers have been using pattern detection as a first filter- are allowed. Indeed, motifs such as these or closely related
flavored step toward inferring PSSMs from biological ones can be used as seeds for finding long repeats and for
sequences [6]. This seems very promising although much aligning, pairwise or multiple-wise, a set of sequences or
work remains to be done to precisely determine the relation even whole genomes [15], [23].
between the two types of models, and to fully explore the The basis introduced by Parida et al. had interesting
biological implications this may have. features, but presented some unsatisfying properties. In
Again, each of the two above approaches is valid, but the particular, as we show in this paper, there is an infinite
question remained open whether or not the inner structure family of strings for which the authors’ basis contains ðn2 Þ
of a set of motifs could be expressed in a manner that would motifs for q ¼ 2. This contradicts the upper bound of 3n for
be more satisfying from both the mathematical and the any q 2 given in [17]. As a result, the algorithm taking
biological points of view. Then, in 2000, a paper by Parida et Oðn3 log nÞ time, mentioned in [17], for finding the basis of
al. [17] seemed to present a way of extracting such an inner motifs does not hold since it relies on the upper bound of
structure in a very elegant and powerful way for a 3n, thus leaving open the problem of efficiently discovering
particular type of motif. The power of their proposal a basis. A refinement of the definition of basis and an
resided in the fact that the above mentioned structure incremental construction in Oðn3 Þ time has recently been
corresponded to a well-known and precisely defined described by Apostolico and Parida [2]. A comparative
mathematical object and, moreover, guaranteed that no survey of several notions of bases can be found in [22].
solution would be lost. Exhaustiveness in relation to the Closely following previous work, here we introduce a
chosen type of motif is also preserved, thus enabling a new definition of basis. The condition for the new basis is
biologist to draw some conclusions even in the face of stronger than that of [17] and, hence, our basis is included
negative answers (i.e., when no motifs, or no a priori in that of [17] (and is thus smaller) while both are able to
“expected” motifs are found in a given input), something generate the same set of motifs with mechanical rules. Our
which PSSM-detecting methods do not allow. The structure basis is moreover symmetric: Given a string s, the motifs in
is that of a basis of motifs. Informally speaking, it is a subset the basis for its reverse se are the reversals of the motifs in
of all the motifs satisfying some input parameters (related, the basis for s. Moreover, the number of motifs in our basis
for instance, to which differences between a pattern and its can provably be upper bounded in the worst case by n 1
occurrences are allowed) from which it is possible to for q ¼ 2 and occur in s a total of 2n times at most. However,
recover all the other motifs, in the sense that all motifs not we reveal an exponential dependency on q for the number of
in the basis are a combination of some (in general, a few
motifs in all bases defined so far (i.e., including our basis,
only) motifs in the basis. Such a combination is modeled by
Parida’s and Pelfrene et al.’s [19]), something unnoticed in
simple rules to systematically generate the other motifs with
previous work. Consequently, no polynomial-time algo-
an output sensitive cost [18]. A basis would therefore also
rithm can exist for finding one of these bases with arbitrary
provide a way of characterizing the input, which then might
values of q 2.
be used to compare different inputs without resorting to the
traditional alignment methods with all the pitfalls they
present. The idea of a basis would fulfill such expectations 2 NOTATION AND TERMINOLOGY
if its size could be proven to be small enough. The argument We consider strings that are finite sequences of letters
[17] seemed to be that, for the type of motifs considered, a drawn from an alphabet , whose elements are also called
compact enough basis could always be found. solid characters. We introduce an additional symbol (de-
The motifs considered in [17] were patterns with wild card
noted by and called wild card) that does not belong to
symbols occurring in a given sequence s of n symbols
and matches any letter; a wild card clearly matches itself.
drawn over an alphabet . A wild card symbol is a special
The length of a string t, denoted by jtj, is the number of
symbol “” matching any other element1 For example, the
letters and wild cards in t, and t½i indicates the letter or
pattern T G matches both TTG and TGG inside s ¼ TTGG.
wild card at position i in t for 0 i jtj 1 (hence, t ¼
Parida et al. focused on patterns which appear at least q
t½0t½1 t½jtj 1 also noted t½0::jtj 1).
times in s for an input parameter q 2, called the quorum.
This may, at first sight, seem an even more restrictive type Definition 1 (pattern). Given the alphabet , a pattern is a
of motif than patterns in general. It, however, has the merit string in [ ð [ fgÞ (that is, it starts and ends with a
solid character).
1. In the literature on sequence analysis and pattern matching, the wild
card is often referred to as do not care (as it is in the literature on bases of
motifs). Therefore, we will use this latter term when referring to the The patterns are related by the following specificity
sequence analysis and string matching literature. relation .
42 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Definition 2 ( ). For individual characters 1 ; 2 2 [ fg, employing the example string s ¼ FABCXFADCYZEADCEADC.
we have 1 2 if 1 ¼ or 1 ¼ 2 . Relation extends to For this string and q ¼ 2 the location list of motif x1 ¼ A C
strings in ð [ fgÞ under the convention that each string t is Lx1 ¼ f1; 6; 12; 16g, and that of motif x2 ¼ FA C is
is implicitly surrounded by wild cards, namely, letter t½j is Lx2 ¼ f0; 5g. They are both maximal because they lose at
when j jtj. Hence, v is more specific than u (written least one of their occurrences when extended with solid
u v) if u½j v½j for any integer j. characters at one side (possibly with wild cards in between),
or when their wild cards are replaced by solid characters.
We can now formally define the occurrences of patterns However, motif x3 ¼ DC having list Lx3 ¼ f7; 13; 17g is not
x in s and their lists. maximal. It occurs in x4 ¼ ADC, where Lx4 ¼ f6; 12; 16g, and
Definition 3 (occurrence, L). We say that u occurs at its occurrences can be obtained from those of x4 by a
position ‘ in v if u½j v½j þ ‘, for 0 j juj 1 displacement of d ¼ 1 positions. The basis of the irredun-
(equivalently, we say that u matches v½‘::‘ þ juj 1). For dant motifs for s is made up of x1 ¼ A C, x2 ¼ FA C,
the input string s 2 with n ¼ jsj, we consider the location x4 ¼ ADC, and x5 ¼ EADC. The location list of each of them
list Lx f0::n 1g as the set of all the positions on s at cannot be obtained from the union of any of the other
which x occurs. location lists.
4 TILING MOTIFS: THE BASIS AND ITS PROPERTIES 4.2 A Linear Upper Bound for the Tiling Motifs with
Quorum q ¼ 2
4.1 Terminology and Properties
Given a string s of length n, let B denote its basis of tiling
In this section, we introduce a natural notion of a basis for
motifs for quorum q ¼ 2. Although the number of maximal
generating all maximal motifs occurring in a string s of
motifs may be exponential and the basis of irredundant
length n. motifs may be at least quadratic (see Section 3), we show
Definition 7 (tiling motif). A maximal motif x is tiling if, for that the size of B is always less than n. For this, we
any maximal motifs y1 , y2 ; . . . ; yk and for any integers d1 , introduce an operator between the symbols of to define
d2 ; . . . ; dk such that Lx ¼ [ki¼1 ðLyi þ di Þ, motif x must be one the merges, which are at the heart of the properties of B.
of the yi s. Conversely, if all the yi s are different from x, pattern Given two letters 1 ; 2 2 with 1 6¼ 2 , the operator
x is said to be tiled by motifs y1 , y2 ; . . . ; yk . satisfies 1 2 ¼ and 1 1 ¼ 1 . The operator applies
to any pair of strings x; y 2 , so that u ¼ x y satisfies
The notion of tiling is in general more selective than that u½j ¼ x½j y½j for all integers j.
of irredundancy. Continuing our example string Definition 8 (Merge). For 1 k n 1, let sk be the (infinite)
s ¼ FABCXFADCYZEADCEADC, we have seen in Section 2 that string whose character at position i is sk ½i ¼ s½i s½i þ k. If
motif x1 ¼ A C is irredundant for s. Now, x1 is tiled by sk contains at least one solid character, Mergek denotes the
x2 ¼ FA C and x4 ¼ ADC according to Definition 7 since its motif obtained by removing all the leading and trailing s in sk
location list, Lx1 ¼ f1; 6; 12; 16g, can be obtained from the (that is, those appearing before the leftmost solid character and
union of Lx2 ¼ f0; 5g and Lx4 ¼ f6; 12; 16g with respective after the rightmost solid character).
displacements d2 ¼ 1 and d4 ¼ 0.
Remark 1. A fairly direct consequence of Definition 7 is that For example, FABCXFADCYZEADCEADC has Merge4 ¼ EADC,
Merge5 ¼ FA C, Merge6 ¼ Merge10 ¼ ADC, and Merge11 ¼
if x is tiled by y1 , y2 , . . . , yk with associated displacements
Merge15 ¼ A C. The latter is the only merge that is not a tiling
d1 , d2 , . . . , dk , then x occurs at position di in yi for
motif.
1 i k. As a consequence, we have that di 0 in
Definition 7. Note also that the yi s in Definition 7 are not Lemma 1. If Mergek exists, it must be a maximal motif.
necessarily distinct and that k > 1 for tiled motifs. (It Proof. Motif x ¼ Mergek occurs at positions, say, i and i þ k in
follows from the fact that Lx ¼ Ly1 þ d1 with x 6¼ y1 s. Character sk ½i is solid by Definitions 4 and 8. We use the
would contradict the maximality of both x and y1 .) As a fact that x at occurs at least twice in s for showing that it is
result, a maximal motif x occurring exactly q times in s is maximal. Suppose it is not maximal. By Definition 5, there
tiling as it cannot be tiled by any other motifs because exists y 6¼ x such that x occurs in y and Ly ¼ Lx þ d for
such motifs would occur less than q times. some integer d (in this case d 0). Since y is more specific
than x displaced by d, there must exist at least one position
The basis of tiling motifs is the complete set of all tiling j with 0 j < jyj such that x½j þ d ¼ and y½j ¼ 2 .
motifs for s, and the size of the basis is the number of these Hence, x½j þ d ¼s i þ ðj þ dÞ s iþ k þ ðj þ dÞ ¼ ,
motifs. For example, the basis, let us denote it by B, for and so s ði þ dÞ þ j 6¼ s ði þ k þ dÞ þ j . Since y½j cannot
FABCXFADCYZEADCEADC contains FA C, EADC, and ADC as match both of the latter symbols in s, at least one of i þ d or
tiling motifs. Although Definition 7 is derived from that of i þ k þ d is not a position of y in s. This contradicts the
irredundant motifs given in Definition 6, the difference is hypothesis that Ly ¼ Lx þ d, whereas both i; i þ k 2 Lx . t u
much more substantial than it may appear. The basis of Lemma 2. For each tiling motif x in the basis B, there is at least
tiling motifs relies on the fact that tiling motifs are one k for which Mergek ¼ x.
considered as invariant by displacement as for maximality.
Consequently, our definition of basis is symmetric, that is, Proof. As mentioned in Remark 1, a maximal motif
each tiling motif in the basis for the reverse string se is the occurring exactly twice in s is tiling. Hence, if jLx j ¼ 2,
reverse of a tiling motif in the basis of s. This follows from say Lx ¼ fi; jg with j > i, then x ¼ Mergek with k ¼ j i
the symmetry in Definition 7 and from the fact that by the maximality of x and that of the merges by
maximality is also symmetric in Definition 5. It is a sine Lemma 1. Let us now consider the case where jLx j > 2.
qua non condition for having a notion of basis invariant by For any pair i; j 2 Lx , we denote by uij the string s½i::i þ
the left-to-right or right-to-left order of the symbols in s (like jxj 1 s½j::j þ jxj 1 obtained by applying the op-
the entropy of s), while this property does not hold for the erator to the two substrings of s matching x at
irredundant motifs. positions i and j, respectively. We have S x uij since x
The basis of tiling motifs has further interesting proper- occurs at positions i and j, and Lx ¼ i;j2Lx Luij since we
ties for quorum q ¼ 2, illustrated in Sections 4.2, 4.3, and 4.4. are taking all pairs of occurrences of x. Letting k ¼ jj ij
In Section 4.2, we show that our basis is linear (that is, its for i; j 2 Lx , we observe that uij is a substring of Mergek
size is at most n 1). In Section 4.3, we show that the total occurring at position, say, k in it. Thus,
size of the location lists for the tiling motifs is less than 2n, [ [
Luij ¼ LMergek þ k ¼ Lx :
describing how to find them in Oðn2 log n log jjÞ time. In i;j2Lx k¼jjij : i;j2Lx
Section 4.4, we discuss some applications such as generat-
ing all maximal motifs with the basis and finding motifs By Definition 7, the fact that x is tiling implies that x
with a constraint on the number of undefined symbols. must be one Mergek , proving the lemma. t
u
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 45
We now state the main property of tiling bases that j 2 Tx , let mij ¼ Mergejjij , which is maximal by
follows directly from Lemma 2. Lemma 1. Note that each mij 6¼ x by our assumption as
Theorem 3 (linearity of the basis). Given a string s of length n otherwise i would belong to Tx ; however, x must occur
and the quorum q ¼ 2, let M be the set of Mergek , for 1 k S mij , say, at position
in ij in mij . Consequently,
n 1 such that Mergek exists. The basis B of tiling motifs for s i2Lx Tx ;j2Tx L mij
þ ij ¼ Lx since any occurrence of x
satisfies B M and, therefore, the size of B is at most n 1. is either i 2 Lx Tx or j 2 Tx . At this point, we apply
Definition 7 to the tiling motif x, obtaining the contra-
A simple consequence of Theorem 3 implies a tight diction that x must be equal to one mij . u
t
bound on the number of tiling motifs for periodic strings. If
s ¼ we for a string w repeated e > 1 times, then s has at most Notice that the conclusion of Lemma 3 does not
jwj tiling motifs. necessarily hold for the motifs in M B. For the previous
example string FADABCXFADCYZEADCEADCFADC, one such
Corollary 1. The number of tiling motifs for s is at most p, the motif is x ¼ ADC with Lx ¼ f8; 14; 18; 22g while Tx ¼ f8; 18g.
smallest period of s. Step 3. Select M M, where M ¼ fx 2 M : Tx ¼ Lx g.
In order to build M , we employ the Fischer-Paterson
The bound in Corollary 1 is not valid for irredundant algorithm based on convolution [8] for string matching with
motifs. String s ¼ ATATATATA has period p ¼ 2 and only one don’t cares to compute the whole list of occurrences Lx for
tiling motif ATATATA, while its irredundant motifs are A, ATA, each merge x 2 M. Its cost is Oððjxj þ nÞ log n log jjÞ time for
ATATA, and ATATATA. each merge x. Since jxj < n and there are at most n 1 motifs
4.3 A Simple Algorithm for Computing Tiling Motifs x 2 M, we obtain Oðn2 log n log jjÞ time to construct all lists
with Quorum q ¼ 2 Lx . We can compute M by discarding the merges x 2 M
such that Tx 6¼ Lx in additional Oðn2 Þ time.
We describe how to compute the basis B for string s when
q ¼ 2. A brute-force algorithm generating first all maximal Lemma
P 4. The set M satisfies the conditions B M and
motifs of s takes exponential time in the worst case. x2M jLx j < 2n.
Theorem 3 plays a crucial role in that we first compute Proof. The first condition follows from the fact that the
the motifs in M and then discard those being tiled. Since motifs in M M are surely tiled by Lemma 3. The
B M, what remains is exactly B. To appreciate this second condition follows from the definition of M and
approach, it is worth noting that we are left with the from the observation that
problem of selecting B from n 1 maximal motifs in M at X X X
most, rather than selecting B among all the maximal motifs jLx j ¼ jTx j joccx j < 2n;
in s, which may be exponential in number. Our simple x2M x2M x2M
algorithm takes Oðn2 log n log jjÞ time and is faster than since joccx j ¼ 2 (see Step 1) and there are less than n of
previous (and more complicated) methods discussed in them. t
u
Section 1.
Step 1. Compute the multiset M0 of merges. Letting P The property 2of M in Lemma 4 is crucial in that
sk ½i be the leftmost solid character of string sk in x2M jLx j ¼ ðn Þ when many lists contain ðnÞ entries.
Definition 8, we define occx ¼ fi; i þ kg to be the positions For example, s ¼ An has n 1 distinct merges, each of the
of the two occurrences of x whose superposition generates form x ¼ Ai for 1 i n 1, and so jLx j ¼ n i þ 1. This
x ¼ Mergek . For k ¼ 1; 2; . . . ; n 1, we compute string sk would be a sharp drawback in Step 4 when removing tiled
3
in Oðn kÞ time. If sk contains some solid characters, we motifs as it may turn into a ðn P Þ algorithm. Using M
compute x ¼ Mergek and occx in the same time complex- instead, we are guaranteed that x2M jLx j ¼ OðnÞ; hence,
ity. As a result, we compute the multiset M0 of merges in we may still have some tiled motifs in M , but their total
Oðn2 Þ time. Each merge x in M0 is identified by a triplet number of occurrences is OðnÞ.
hi; i þ k; jxji, from which we can recover the jth symbol of Step 4. Discard the tiled motifs in M . We can now
x in constant time by simple arithmetic operations and check for tiling motifs in Oðn2 Þ time. Given two distinct
comparisons. motifs x; y 2 M , we want to test whether Lx þ d Ly for
Step 2. Transform the multiset M0 into the set M of some integer d and, in that case, we want to mark the entries
merges. Since there can be two or more merges in M0 that in Ly that are also in Lx þ d. At the end of this task, the lists
are identical and correspond to the same merge in M, we having all entries marked are tiled (see Definition 7). By
put together all identical merges in M0 by radix sorting removing their corresponding motifs from M , we even-
them. The total cost of this step is dominated by radix tually obtain the basis B by Lemma 4. Since the meaningful
sorting, giving Oðn2 Þ time. AsSa byproduct, we produce the values of d are as many as the entries of Ly , we have only
temporary location list Tx ¼ x0 ¼x : x0 2M0 occx0 for each dis- jLy j possible values to check. For a given value of d, we
tinct x 2 M thus obtained. avoid to merge Lx and Ly in OðjLx j þ jLy jÞ time to perform
the test, as it would contribute to a total of ðn3 Þ time.
Lemma 3. Each motif x 2 B satisfies Tx ¼ Lx . Instead, we exploit the fact that each list has values ranging
Proof. For a fixed x 2 B, the fact that x is equal to at least from 1 to n, and use two bit-vectors of size n to perform the
one merge by Lemma 2 implies that Tx is well defined, above check
P P in OðjLx j jLy jÞ time
P for all P
values of d. This
2
with jTx j 2. Since Tx Lx , let us assume by contra- gives Oð y x jLx j jLy jÞ ¼ Oð y jLy j x jLx jÞ ¼ Oðn Þ
diction that Lx Tx 6¼ ;. For each pair i 2 Lx Tx and by Lemma 4.
46 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
We therefore detail how to perform the above check with characters by wild cards. However, since the number of
Lx and Ly in OðjLx j jLy jÞ time. We use two bit-vectors V1 motifs, and even maximal motifs, can be exponential, this is
and V2 of length n initially set to all zeros. Given y 2 M , we not really meaningful unless this number is small and the
set V1 ½i ¼ 1 if i 2 Ly . For each x 2 M fyg and for each time complexity of the algorithm is proportional to the total
d 2 ðLy mÞ (where m is the smallest entry of Lx ), we then size of the output. An attempt in this direction is done in
perform the following test. If all j 2 Lx þ d satisfy V1 ½j ¼ 1, [18]. The dual problem concerns testing only one pattern.
we set V2 ½j ¼ 1 for all such j. Otherwise, we take the next We show how, given a pattern x, it can be tested whether x
value of d, or the next motif if there are no more values of d, is a motif for string s, that is, if pattern x occurs at least q
and we repeat the test. After examining all x 2 M fyg, times in s. There are two possible ways of performing such
we check whether V1 ½i ¼ V2 ½i for all i 2 Ly . If so, y is tiled a test, depending on whether we test directly on the string
as its list is covered by possibly shifted location lists of other or on the basis. The answer relies on iterative applications
motifs. We then reset the ones in both vectors in OðjLy jÞ of the observation made in Remark 1, according to which
time. any tiled motif must occur in at least one tiling motif. The
Summing up Steps 1-4, we have that the dominant cost is next two statements deal with the alternative. In both cases,
that of Step 3 and that we have proved the following result. we assume that integer k comes from the decomposition of
Theorem 4. Given an input string s of length n over the alphabet pattern x in the form u0 ‘0 u1 ‘1 uk1 ‘k1 uk , where the
, the basis of tiling motifs with quorum q ¼ 2 can be subwords ui contain no wild cards (ui 2 , 0 i k) and
computed in Oðn2 log n log jjÞ time. The total number of ‘j are positive integers, 0 j k 1. The next proposition
motifs in the basis is less than n, and the total number of their states a well-known fact on matching such a pattern in a
occurrences in s is less than 2n. text without any wild card that we report here because it is
used in the sequel.
We have implemented the algorithm underlying Theo- Proposition 3. The positions of the occurrences of a pattern x in
rem 4, and we report here the lessons learned from our a string of length n can be computed in time OðknÞ.
experiments. Step 1 requires, in practice, less than the
Proof. This is a mere application of matching a pattern with
predicted Oðn2 Þ running time. If p ¼ 1=jj denotes the
do not cares inside a text without do not cares. Using, for
probability that two randomly chosen symbols of match
instance, the Fischer and Paterson’s algorithm [8] is not
in the uniform distribution, the probability of finding the
necessary. Instead, the positions of the subwords ui are
first solid character in a merge follows the binomial
computed by a multiple string-matching algorithm, such
distribution, and so the expected number of examined
as the Aho-Corasick algorithm [1]. For each position p, a
characters in s is Oð1=pÞ ¼ OðjjÞ, yielding OðnjjÞ time on
counter associated with position p ‘ on s is incremented,
the average to locate the first (scanning s from the
where ‘ is the position of ui in x (‘ is the offset of ui in x).
beginning) and the last (scanning s from the end backward)
Counters whose value is k þ 1 correspond then to
solid character in each merge. A similar approach can be
occurrences of x in s. It remains to check if x occurs at
followed in Step 2 for finding the distinct merges. In this
least q times in s. The running time is governed by the
case, the merges are first partially sorted using hashing and
string-matching algorithm, which is OðknÞ (equivalent to
exploiting the fact that the input is almost sorted. Insertion
running k times a linear-time string matching algorithm).t u
sort is then the best choice and works very efficiently in our
experiments (at least 50 percent faster than Quicksort). We Proposition 4. Given the basis B of string s, testing if pattern x
do not compute yet the full merges at this stage, but we is a motif
P or a maximal motif can be done in OðkbÞ time, where
delay this expensive part to a later stage on a small set of b ¼ y2B jyj.
buckets that require explicit representation of the merges. Proof. From Remark 1, testing if x is a maximal motif
As a result, the average case is almost linear. For example, requires only finding if x occurs in an element y of the
executing Steps 1 and 2 on chromosome V of C.elegans basis. To do this, we can apply the procedure of the
containing more than 21 million bases took around previous proof because wild cards in y should be viewed
15 minutes on a machine with 512Mb of RAM running as extra characters that do not match any letter of . The
Linux on a 1Ghz AMD Athlon processor. Step 3 is time complexity of the procedure is thus OðkbÞ. Since a
expensive also in practice and the worst case predicted by nonmaximal motif occurs in a maximal motif, the same
theory shows up in the experiments. Running this step on procedure applies to test if x is a general motif. u
t
sequences much shorter than chromosome V of C.elegans
took many hours. Step 4 is not much of a problem. As a As a consequence of Propositions 3 and 4, we get an
result, an alternative way of selecting M from M in Step 3 upper bound on the time complexity for testing motifs.
working fast in practice, would improve considerably the
Corollary 2. Testing whether or not pattern u0 ‘0 u1 ‘1
overall performance.
uk1 ‘k1 uk is a motif in a string of length n having a
4.4 Some Applications basis of total size b can be done in time Oðk minfb; ngÞ.
Checking whether a pattern is a motif. The main property Remark 2. Inside the procedure described in the proofs of
underlying the notion of basis is that it is a generator of all Propositions 3 and 4, it is also possible to use bit-vector
motifs. The generation can be done as follows: First select pattern matching methods [3], [16], [25] to compute the
segments of motifs in the basis that start and end with solid occurrences of x. This leads to practically efficient
characters, then replace any number of internal solid solutions running in time proportional to the length of
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 47
the string n or the total size of the basis b, in the bit-vector always beats the Oðg n2 Þ cost of using the suffix tree. In
model of machine. This is certainly a method of choice particular, it is interesting to notice that the running time of
for short patterns. the algorithm using the basis is independent of the
parameter g.
Finding the longest motif with bounded number of
wild cards. We address an interesting question concerning
the computation of a longest motif occurring repeated in a 5 PSEUDOPOLYNOMIAL BASES FOR HIGHER
string. Given an integer g 0, let LMg ðsÞ be the maximal QUORUM
length of motifs occurring in a string s of length n with We now discuss the general case of quorum q 2 for
quorum q ¼ 2, and containing no more than g wild cards. If
finding the basis of a string of length n. Differently from
g ¼ 0, the value can be computed in Oðn log jjÞ time with
the help of the suffix tree of s (see [5] or [10]). For g > 0, we previous work, we show in Section 5.1 that no polynomial-
can show that LMg ðsÞ can be computed in Oðgn2 Þ time time algorithm can exist for any arbitrary value of q in the
using the suffix tree augmented (in linear time) to accept worst case, both for the basis of irredundant motifs and for
longest common ancestor (LCA) queries as follows: For the basis of tiling motifs. The size of these bases provably
each possible pair ði; jÞ of positions on s for which s½i ¼ s½j,
depends exponentially on
n1suitable
values of q 2, that is, we
we compute the longest common prefix of s½i::n 1 and
1
1 n1
s½j::n 1 in constant time through an LCA query on the give a lower bound of q1 ¼ 2q q1 . In practice, this
2
suffix tree. If ‘ is the length of the prefix, we get the first part size has an exponential growth for increasing values of q up
s½i::i þ ‘ 1 of a possible longest motif. The second part to Oðlog nÞ, but larger values of q are theoretically possible
is found similarly by considering the pair of positions
in the worst case. Fixing q ¼ ðn 1Þ=4 þ 1 in our lower
ði þ ‘ þ 1; j þ ‘ þ 1Þ. The process is iterated g times (or less)
and provides a longest motif containing at most g wild bound, we get a size of ð2ðn1Þ=4 Þ motifs in the bases. On
cards and occurring at positions i and j. Length LMg ðsÞ is the average, q ¼ Oðlogjj nÞ by extending the argument after
obtained by taking the maximum length of motifs for all Theorem 4, namely, using the fact that on the average the
pairs of positions ði; jÞ. This yields the next result. number of simultaneous comparisons to find the first solid
Proposition 5. Using the suffix tree, LMg ðsÞ can be computed in character of a merge is Oðjjq1 Þ, which must be less than n.
Oðgn2 Þ time.
We show a further property for the basis of tiling motifs
n1
What makes the use of the basis of tiling motifs interesting in Section 5.2, giving an upper bound of q1 on its size
is that computing LMg ðsÞ becomes a mere pattern matching with a simple proof. Since we can find an algorithm taking
exercise because of the strong properties of the basis. This time proportional to the square of that size, we can
contrasts with the previous result grounded on the deep
algorithmic technique for LCA queries. conclude that a worst-case polynomial-time algorithm for
finding the basis of tiling motifs exists if and only if the
P motifs, LMg ðsÞ can be
Proposition 6. Using the basis B of tiling
computed in time OðbÞ, where b ¼ y2B jyj. quorum q satisfies either q ¼ Oð1Þ or q ¼ n Oð1Þ (the latter
Proof. Let x be a motif yielding LMg ðsÞ (i.e., x is of length condition is hardly meaningful in practice).
n1
LMg ðsÞ); hence, x occurs at least twice in s. Let y be a 2 1
5.1 A Lower Bound of on the Bases
maximal motif in which x occurs (we have y ¼ x if x is q1
itself maximal). Let z be a tiling motif in which y occurs We show the
existence
of a family of strings for which there
(again we may have z ¼ y if y is a tiling motif). The word are at least
n1
2 1 tiling motifs for a quorum q. Since a tiling
q1
x then occurs in z that belongs to the basis. Let us say that
it matches z½i::j. Assume that x is not a tiling motif, that motif is also irredundant, this gives a lower bound for the
is x 6¼ z. Certainly, i ¼ 0 or z½i 1 ¼ , otherwise, x irredundant motifs to be combined with that in Section 3
would not be the longest with its property. For the same 2
reason, j ¼ jzj 1 or z½j þ 1 ¼ . But, indeed, x occurs n1 Þfor
(note that the lower bound in Section 3 still gives ðn
q 2). For q > 2, this gives a lower bound of q1 2 1 ¼
exactly in z, which means that the wild card symbols do 1 n1
not match any solid symbol. Because, otherwise, z½i::j 2q q1 for the number of both tiling and irredundant
would contain less than g do not cares and could be
motifs.
extended by at least one symbol to the left or to the right
because x 6¼ z, yielding a contradiction with the defini- The strings are this time of the form tk ¼ Ak TAk (k 5),
tion of x. Therefore, either x is a tiling motif or it matches without the left extension used in the
bound
of Section 3.
exactly a segment of one of the tiling motifs. Searching k1
The proof proceeds by exhibiting q1 motifs that are
for x thus reduces to finding a longest segment of a tiling
motif in B that contains no more than g wild cards. The maximal and have each exactly q occurrences, from when it
computation can be done in linear time with only two follows immediately that they are tiling. Indeed, Remark 1
pointers on s, which proves the result. u
t for tiling motifs holds for any q 2. Namely, all maximal
By Proposition 6, it is clear that a small basis B leads to motifs that occur exactly q times in a string are tiling.
an efficient computation once B is given. If we have to build
B from scratch, we can observe that no (maximal) motif can Proposition 7. For 2 q k and 1 p k q þ 1, any motif
give a larger value of LMg ðsÞ if it does not belong to B. With Ap fA; gkp1 Ap with exactly q wild cards is tiling (and
this observation, we have Oðn2 Þ running time, which so irredundant) in tk .
48 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Proof. Let x be an arbitrary motif Ap fA; gkp1 Ap with implies there exists at least one position j with 0
1 p k q þ 1 and q wild cards; namely, x ¼ Ap1 j < jyj such that y½j ¼ 2 and x½j þ d ¼ . Since
Ap2 p1 1 Apq1 pq2 1 Akpq1 1 Ap1 for 1 p1 < p2 <
< pq1 k 1 and p ¼ p1 . We first have to prove that x x½j þ d ¼ s½i þ j þ d s½i þ j þ k1 þ d
is a maximal motif according to Definition 5. Its length is s½i þ j þ kq1 þ d;
k þ 1 þ p1 and its location list is Lx ¼ f0; k pq1 ; . . . ;
then at least one among i þ d; i þ k1 þ d; . . . ; i þ kq1 þ d
k p2 ; k p1 g. Observe that the number of its occurrences
is not an occurrence of y, contradicting the hypothesis
is exactly the number of times the wild card appears in x,
that Ly ¼ Lx þ d (since i; i þ k1 ; . . . ; i þ kq1 2 Lx ). u
t
which is equal to q. A motif y different from x such that x
occurs in y can be obtained by replacing the wild card at
position pi with a solid symbol, for 1 i q 1, but this Lemma 6. For each tiling motif x in the basis B with quorum q,
eliminates k pi from the location list of y. Also, y can be there is at least one k for which Mergek ¼ x.
obtained by extending x to the right by a solid symbol (at Proof. If jLx j ¼ q and Lx ¼ fi1 ; . . . ; iq g with i1 < < iq , then
any position jxj), but then position k p1 is not in Ly x ¼ Mergek where k is the array of values i2 i1 ; i3 i1 ;
because the last symbol in that occurrence of y occupies . . . ; iq i1 . Let us now consider the case where jLx j > q.
position ðk p1 Þþjyj1 ðk p1 Þ þ jxj ¼ ðk p1 Þ þ ðk þ Given any q-tuple i1 ; . . . ; iq 2 Lx , let uk denote s½i1 ::i1 þ
1 þp1 Þ > jtk j 1 in tk , which is impossible. Analogously, y jxj 1 s½iq ::iq þ jxj 1, which is a substring of
can be obtained by extending x to the left by a solid symbol Mergek introduced S in Definition 9. We have that x uk
(at any position d < 0), but position 0 is no longer in Ly . and Lx ¼ i1 ;i2 ;...;iq 2Lx Luk . Since each uk for i1 ; i2 ; . . . ; iq 2
Consequently, for any motif y more specific than x, we Lx is a substring
S of Mergek , we infer that Lx ¼
have Ly 6¼ Lx þ d, implying that x is maximal. As i1 ;i2 ;...;iq 2Lx L Mergek þ k where the k s are non-negative
previously mentioned, x is tiling because it has exactly q integers. By Definition 7, if Mergek were different from x,
occurrences. u
t then x would not be tiling, which is a contradiction.
n1
2 1
Therefore, at least one Mergek is x. u
t
Theorem 5. String tk has q1 ¼ 21q n1q1 tiling (and
irredundant) motifs, where n ¼ jtk j and k 2. The following property of tiling bases follows from
Proof. By Proposition
7, the tiling or irredundant motifs in tk Lemma 5 and 6.
are at least k1
q1 , the number of choices of q 1 positions Theorem 6. Given a string s of length n and q 2, let
k1
a quorum
on A . Since n ¼ 2k þ 1, we obtain the statement. u
t M be the set of Mergek , for any of the n1
possible choices
q1
n1 of k for which Mergek exists. The basis B of tiling motifs
5.2 An Upper Bound of Tiling Motifs
q1 fors
n1
n1
We now prove that q1 is an upper bound for the size of a satisfies B M and, therefore, the size of B is at most q1 .
basis of tiling motifs for a string s and quorum q 2. Let us
The tiling motifs in our basis appear in s for a total of
denote as before such a basis by B. To prove the upper n1
q q1 times at most. A variation of the algorithm given in
bound, we use again the notion of a merge, except that it Section 4.3 gives a pseudopolynomial-time complexity of
now involves q strings. The operator between the 2 !
elements of extends to more than two arguments, so that 2 n1
O q :
the result is a if at least two arguments differ. Let k denote q1
now an array of q 1 positive values k1 ; . . . ; kq1 with 1
When this upper bound is combined with the lower bound
ki < kj n 1 for all 1 i < j q 1. of Section 5.1, we obtain that there exists a polynomial-time
Definition 9. Let sk denote the string such that its jth character algorithm for finding the basis if and only if either q ¼ Oð1Þ
is sk ½j ¼ s½j s½j þ k1 s½j þ kq1 for all integers j. or q ¼ n Oð1Þ.
Mergek is the pattern obtained by removing all the leading
and trailing s in sk (that is, appearing before the leftmost solid 6 CONCLUSIONS
character and after the rightmost solid character).
The work presented in this paper is theoretical in nature, but it
Lemmas 5 and 6 reported below extend Lemmas 1 and 2 should be clear by now that its practical consequences,
for q > 2. particularly—but not exclusively—for computational biol-
ogy, are relevant. Whether motifs as patterns are used for
Lemma 5. If Mergek exists for quorum q, then it must be a
inferring binding sites or repeats of any length, for character-
maximal motif. izing sequences or as a filtering step in a whole genome
Proof. Let x ¼ Mergek denote the (nonempty) pattern, and comparison algorithm or before inferring PSSMs: We show
let sk ½i be its first character, which is solid by that wild cards alone are not enough for a biologically
Definition 9. Since x occurs at least q times in s, at satisfying definition of the patterns of interest. Simply
positions i; i þ k1 ; . . . ; i þ kq1 , then x is a motif for throwing away the pattern-type of motif detection is not a
quorum q. We show that x is maximal. Suppose it is good way to address the problem. This is confirmed by
not maximal. By Definition 5, there exists y 6¼ x s.t. x various biological publications [24], [7] as well as by the not yet
occurs in y and Ly ¼ Lx þ d for some integer d. This published—but already publicly available—results of a first
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 49
motif detection competition http://bio.cs.washington.edu/ [15] W. Miller, “Comparison of Genomic DNA Sequences: Solved and
Unsolved Problems,” Bioinformatics, vol. 17, pp. 391-397, 2001.
assessment/. Even if patterns are not the best way of modeling [16] G. Myers, “A Fast Bit-Vector Algorithm for Approximate String
biological features, they deserve an important function in any Matching Based on Dynamic Programming,” J. ACM, vol. 46, no. 3,
future improved algorithm for inferring motifs ab initio from pp. 395-415, 1999.
biological sequences. As such, the purpose of this paper is to [17] L. Parida, I. Rigoutsos, A. Floratos, D. Platt, and Y. Gao, “Pattern
Discovery on Character Sets and Real-Valued Data: Linear Bound
shed some further light on the inner structure of one on Irredundant Motifs and Efficient Polynomial Time Algorithm,”
important type of motif. Proc. SIAM Symp. Discrete Algorithms (SODA), 2000.
[18] L. Parida, I. Rigoutsos, and D. Platt, “An Output-Sensitive Flexible
Pattern Discovery Algorithm,” Combinatorial Pattern Matching,
ACKNOWLEDGMENTS A. Amir and G. Landau, eds., pp. 131-142, Springer-Verlag, 2001.
[19] J. Pelfrne, S. Abdeddaı̈m, and J. Alexandre, “Extracting Approx-
Many suggestions from the anonymous referees greatly imate Patterns,” Combinatorial Pattern Matching, pp. 328-347,
improved the original form of this paper. The authors are Springer-Verlag, 2003.
[20] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, “A Basis
thankful to them for this and to M.H.ter Beek for improving for Repeated Motifs in Pattern Discovery and Text Mining,”
the English. A preliminary version of the results in this Technical Report IGM 2002-10, Institut Gaspard-Monge, Univ. of
paper has been described in the technical report IGM-2002- Marne-la-Vallée, July 2002.
10, July 2002 [20], and in [21]. Work was partially supported [21] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, “A Basis of
Tiling Motifs for Generating Repeated Patterns and Its Complex-
by the French program bioinformatique EPST 2002 “Algo- ity for Higher Quorum,” Math. Foundations of Computer Science
rithms for Modelling and Inference Problems in Molecular (MFCS), B. Rovan and P. Vojtás, eds., pp. 622-631, Springer-
Biology.” N. Pisanti and R. Grossi were partially supported Verlag, 2003.
[22] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, String
by the Italian PRIN project “ALINWEB: Algorithmics for Algorithmics, chapter: A Comparative Study of Bases for Motif
Internet and the Web.” M.-F. Sagot was partially supported Inference, pp. 195-225, KCL Press, 2004.
by CNRS-INRIA-INRA-INSERM action BioInformatique [23] D. Pollard, C. Bergman, J. Stoye, S. Celniker, and M. Eisen,
and the Wellcome Trust Foundation. M. Crochemore was “Benchmarking Tools for the Alignment of Functional Noncoding
DNA,” BMC Bioinformatics, vol. 5, pp. 6-23, 2004.
partially supported by CNRS action AlBio, NATO Science [24] A. Vanet, L. Marsan, and M.-F. Sagot, “Promoter Sequences and
Programme grant PST.CLG.977017, and the Wellcome Trust Algorithmical Methods for Identifying Them,” Research in Micro-
Foundation. biology, vol. 150, pp. 779-799, 1999.
[25] S. Wu and U. Manber, “Path-Matching Problems,” Algorithmica,
vol. 8, no. 2, pp. 89-101, 1992.
REFERENCES Nadia Pisanti received the laurea degree in
[1] A. Aho and M. Corasick, “Efficient String Matching: An Aid to computer science in 1996 from the University of
Bibliographic Search,” Comm. ACM, vol. 18, no. 6, pp. 333-340, 1975. Pisa (Italy), the French DEA in fundamental
[2] A. Apostolico and L. Parida, “Incremental Paradigms of Motif informatics with applications to genome treat-
Discovery,” J. Computational Biology, vol. 11, no. 1, pp. 15-25, 2004. ment in 1998 from the University of Marne-la-
[3] R. Baeza-Yates and G. Gonnet, “A New Approach to Text Vallee (France), and the PhD degree in computer
Searching,” Comm. ACM, vol. 35, pp. 74-82, 1992. science in 2002 from the University of Pisa. She
[4] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, “Ap- has been postdoctorate at INRIA and at the
proaches to the Automatic Discovery of Patterns in Biose- University of Paris 13 and she is currently a
quences,” J. Computational Biology, vol. 5, pp. 279-305, 1998. research fellow in the Department of Computer
[5] M. Crochemore and W. Rytter, Jewels of Stringology. World Science of the University of Pisa. Her interests are in computational
Scientific Publishing, 2002. biology and, in particular, in motifs extraction and genome rearrangement.
[6] E. Eskin, “From Profiles to Patterns and Back Again: A Branch and
Bound Algorithm for Finding Near Optimal Motif Profiles,” Maxime Crochemore received the PhD degree
RECOMB’04: Proc. Eighth Ann. Int’l Conf. Computational Molecular in 1978 and the Doctorat d’etat in 1983 from the
Biology, pp. 115-124, 2004. University of Rouen. He received his first
[7] E. Eskin, U. Keich, M. Gelfand, and P. Pevzner, “Genome-Wide professorship position at the University of
Analysis of Bacterial Promoter Regions,” Proc. Pacific Symp. Paris-Nord in 1975 where he acted as President
Biocomputing, pp. 29-40, 2003. of the Department of Mathematics and Compu-
[8] M. Fischer and M. Paterson, “String Matching and Other ter Science for two years. He became a
Products,” SIAM AMS Complexity of Computation, R. Karp, ed., professor at the University Paris 7 in 1989 and
pp. 113-125, 1974. was involved in the creation of the University of
[9] M. Gribskov, A. McLachlan, and D. Eisenberg, “Profile Analysis: Marne-la-Vallee where he is presently a profes-
Detection of Distantly Related Proteins,” Proc. Nat’l Academy of sor. He also created the Computer Science Research Laboratory of this
Sciences, vol. 84, no. 13, pp. 4355-4358, 1987. university in 1991. Since then, he has been the director of the laboratory,
[10] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer which now has around 45 permanent researchers. Professor Crochem-
Science and Computational Biology. Cambridge Univ. Press, 1997. ore has been a senior research fellow at King’s College London since
[11] G.Z. Hertz and G.D. Stormo, “Escherichia Coli Promoter Sequences: 2002. He has been the recipient of several French grants on string
Analysis and Prediction,” Methods in Enzymology, vol. 273, pp. 30- algorithmics and bioinformatics. He participated in a good number of
42, 1996. international projects in algorithmics and supervised 20 PhD students.
[12] C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald,
and J.C. Wooton, “Detecting Subtle Sequence Signals: A Gibbs
Sampling Strategy for Multiple Alignment,” Science, vol. 262,
pp. 208-214, 1993.
[13] C.E. Lawrence and A.A. Reilly, “An Expectation Maximization
(EM) Algorithm for the Identification and Characterization of
Common Sites in Unaligned Biopolymer Sequences,” Proteins:
Structure, Function, and Genetics, vol. 7, pp. 41-51, 1990.
[14] L. Marsan and M.-F. Sagot, “Algorithms for Extracting Structured
Motifs Using a Suffix Tree with an Application to Promoter and
Regulatory Site Consensus Identification,” J. Computational Biol-
ogy, vol. 7, pp. 345-362, 2000.
50 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Roberto Grossi received the laurea degree in Marie-France Sagot received the BSc degree in computer science from
computer science in 1988, and the PhD degree the University of Sao Paulo, Brazil, in 1991, the PhD degree in
in computer science in 1993, at the University of theoretical computer science and applications from the University of
Pisa. He joined the University of Florence in Marne-la-Vallee, France, in 1996, and the Habilitation from the same
1993 as an associate researcher. Since 1998, university in 2000. From 1997 to 2001, she worked as a research
he has been an associate professor of computer associate at the Pasteur Institute in Paris, France. In 2001, she moved
science in the Dipartimento di Informatica, to Lyon, France, as a research associate at the INRIA, the French
University of Pisa. He has been visiting several National Institute for Research in Computer Science and Control. Since
international research institutions. His interests 2003, she has been director of research at the INRIA. Her research
are in the design and analysis of algorithms and interests are in computational biology, algorithmics, and combinatorics.
data structures, namely, dynamic and external memory algorithms,
graph algorithms, experimental and algorithm engineering, fast lookup
tables and dictionaries, pattern matching algorithms, text indexing, and . For more information on this or any other computing topic,
compressed data structures. please visit our Digital Library at www.computer.org/publications/dlib.
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 51
Abstract—We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics
applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt
and Kärkkäinen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial
properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed
technique to the problem of oligonucleotide selection for an EST sequence database.
Index Terms—Filtration, string matching, gapped seed, gapped q-gram, local alignment, sequence similarity, seed family, multiple
spaced seeds, dynamic programming, EST, oligonucleotide selection.
1 INTRODUCTION
several seed families we computed, and we report a large- the ðm; kÞ-problem. Formally, a finite family of seeds F ¼<
scale experimental application of the method to a practical Ql >Ll¼1 solves an ðm; kÞ-problem iff for any ðm; kÞ-similarity w,
problem of oligonucleotide selection. there exists a seed Ql 2 F that detects w.
Note that the seeds of the family are used in the
complementary (or disjunctive) fashion, i.e., a similarity is
2 MULTIPLE SEED FILTERING
detected if it is detected by one of the seeds. This differs from
A seed Q (called also spaced seed or gapped q-gram) is a list the conjunctive approach of [7] where a similarity should be
fp1 ; p2 ; . . . ; pd g of positive integers, called matching positions, detected by two seeds simultaneously.
such that p1 < p2 < . . . < pd . By convention, we always The following example motivates the use of multiple
assume p1 ¼ 0. The span of a seed Q, denoted sðQÞ, is the seeds. In [1], it has been shown that a seed solving the
quantity pd þ 1. The number d of matching positions is called ð25; 2Þ-problem has the maximal weight 12. The only such
the weight of the seed and denoted wðQÞ. Often, we will use a seed (up to reversal) is
more visual representation of seeds, adopted in [1], as words
### # ### # ### #:
of length sðQÞ over the two-letter alphabet f#; g, where #
occurs at all matching positions and—at all positions in However, the problem can be solved by the family
between. For example, seed f0; 1; 2; 4; 6; 9; 10; 11g of weight 8 composed of the following two seeds of weight 14:
and span 12 is represented by word ### # # ###. ##### ## ##### ##
The character is called a joker. Note that, unless otherwise
stated, the seed has the character # at its first and last and
positions. # ## ##### ## ####:
Intuitively, a seed specifies the set of patterns that, if
shared by two sequences, indicate a possible similarity Clearly, using these two seeds increases the selectivity of
between them. Two sequences are similar if the Hamming the search, as only similarities having 14 or more matching
characters pass the filter versus 12 matching characters in
distance between them is smaller than a certain threshold.
the case of single seed. On uniform Bernoulli sequences,
For example, sequences CACTCGT and CACACTT are similar
this results in the decrease of the number of candidate
within Hamming distance 2 and this similarity is detected
similarities by the factor of jAj2 =2, where A is the input
by the seed ## # at position 2. We are interested in seeds
alphabet. This illustrates the advantage of the multiple seed
that detect all similarities of a given length with a given
approach: it allows to increase the selectivity while
Hamming distance. preserving a lossless search. The price to pay for this gain
Formally, a gapless similarity (hereafter simply similarity) in selectivity is multiplying the work on identifying the
of two sequences of length m is a binary word w 2 f0; 1gm seed occurrences. In the case of large sequences, however,
interpreted as a sequence of matches (1s) and mismatches this is largely compensated by the decrease in the number
(0s) of individual characters from the alphabet of input of false positives caused by the increase of the seed weight.
sequences. A seed Q ¼ fp1 ; p2 ; . . . ; pd g matches a similarity w
at position i, 1 i m pd þ 1, iff for every j 2 ½1::d, we
3 COMPUTING PROPERTIES OF SEED FAMILIES
have w½i þ pj ¼ 1. In this case, we also say that seed Q has
an occurrence in similarity w at position i. A seed Q is said to Burkhardt and Kärkkäinen [1] proposed a dynamic pro-
detect a similarity w if Q has at least one occurrence in w. gramming algorithm to compute the optimal threshold of a
Given a similarity length m and a number of given seed—the minimal number of its occurrences over all
mismatches k, consider all similarities of length m possible ðm; kÞ-similarities. In this section, we describe an
extension of this algorithm for seed families and, on the
containing k 0s and ðm kÞ 1s. These similarities are
other hand, describe dynamic programming algorithms for
called ðm; kÞ-similarities. A seed Q solves the detection
computing two other important parameters of seed families
problem ðm; kÞ (for short, the ðm; kÞ-problem) iff all of mk
that we will use in a later section.
ðm; kÞ-similarities w are detected by Q. For example, one
Consider an ðm; kÞ-problem and a family of seeds
can check that seed # ## # ## solves the F ¼< Ql >Ll¼1 . We need the following notations:
ð15; 2Þ-problem.
Note that the weight of the seed is directly related to the . smax ¼ maxfsðQl ÞgLl¼1 , smin ¼ minfsðQl ÞgLl¼1 ,
selectivity of the corresponding filtration procedure. A larger . for a binary word w and a seed Ql , suffðQl ; wÞ ¼ 1 if
weight improves the selectivity, as less similarities will pass Ql matches w at position ðjwjsðQl Þþ1Þ (i.e.,
matches a suffix of w), otherwise suffðQl ; wÞ ¼ 0,
through the filter. On the other hand, a smaller weight
. lastðwÞ ¼ 1 if the last character of w is 1, otherwise
reduces the filtration efficiency. Therefore, the goal is to
lastðwÞ ¼ 0, and
solve an ðm; kÞ-problem by a seed with the largest possible . zerosðwÞ is the number of 0s in w.
weight.
Solving ðm; kÞ-problems by a single seed has been studied 3.1 Optimal Threshold
by Burkhardt and Kärkkäinen [1]. An extension we propose Given an ðm; kÞ-problem, a family of seeds F ¼< Ql >Ll¼1
here is to use a family of seeds, instead of a single seed, to solve has the optimal threshold TF ðm; kÞ if every ðm; kÞ-similarity
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 53
has at least TF ðm; kÞ occurrences of seeds of F and this is the Oðgðk; smax ÞÞ, under the assumption that checking an
maximal number with this property. Note that overlapping individual match is done in constant time. This leads to
occurrences of a seed as well as occurrences of different the overall time complexity Oðm fðk; smax Þ þ L gðk; smax ÞÞ
seeds at the same position are counted separately. For
with the leading term m fðk; smax Þ (as L is usually small
example, the singleton family f### ##g has threshold 2
compared to m and gðk; smax Þ is smaller than fðk; smax Þ).
for the ð15; 2Þ-problem.
Clearly, F solves an ðm; kÞ-problem if and only if 3.2 Number of Undetected Similarities
TF ðm; kÞ > 0. If TF ðm; kÞ > 1, then one can strengthen the
We now describe a dynamic programming algorithm that
detection criterion by requiring several seed occurrences for
computes another characteristic of a seed family, that will
a similarity to be detected. This shows the importance of the
be used later in Section 4.4. Consider an ðm; kÞ-problem.
optimal threshold parameter.
Given a seed family F ¼< Ql >Ll¼1 , we are interested in
We now describe a dynamic programming algorithm
for computing the optimal threshold TF ðm; kÞ. For a the number UF ðm; kÞ of ðm; kÞ-similarities that are not
binary word w, consider the quantity TF ðm; k; wÞ defined detected by F . For a binary word w, define UF ðm; k; wÞ to
as the minimal number of occurrences of seeds of F in all be the number of undetected ðm; kÞ-similarities that have
ðm; kÞ-similarities which have the suffix w. By definition, the suffix w.
TF ðm; kÞ ¼ TF ðm; k; "Þ. Assume that we precomputed Similar to [10], let XðF Þ be the set of binary words w such
values T F ðj; wÞ ¼ TF ðsmax ; j; wÞ, for all j maxfk; smax g, that 1) jwj smax , 2) for any Ql 2 F , suffðQl ; 1smax jwj wÞ ¼ 0,
jwj ¼ smax . The algorithm is based on the following and 3) no proper suffix of w satisfies 2). Note that word 0
recurrence relations on TF ði; j; wÞ, for i smax . belongs to XðF Þ, as the last position of every seed is a
matching position.
TF ði; j; w½1::nÞ ¼ The following recurrence relations allow to compute
8
>
>T F ðj; wÞ; if i ¼ smax ; UF ði; j; wÞ for i m, j k, and jwj smax :
>
>
>
< F ði1; j1; w½1::n1Þ;P
>T if w½n ¼ 0;
UF ði; j; w½1::nÞ ¼
TF ði1; j; w½1::n1Þ þ ½ Ll¼1 suffðQl ; wÞ; if n ¼ smax ; 8
>
> ijwj
>
>minfTF ði; j; 1:wÞ; TF ði; j; 0:wÞg; if zerosðwÞ < j; >
> ; if i < smin ;
>
> >
> jzerosðwÞ
: >
>
TF ði; j; 1:wÞ; if zerosðwÞ ¼ j: >
> 0; if 9l 2 ½1::L;
>
<
suffðQl ; wÞ ¼ 1;
The first relation is an initial condition of the recurrence.
>
> UF ði 1; j lastðwÞ; w½1::n 1Þ; if w 2 XðF Þ;
The second one is based on the fact that if the last symbol of >
>
>
> U ði; j; 1:wÞ þ U ði; j; 0:wÞ;
w is 0, then no seed can match a suffix of w (as the last >
> if zerosðwÞ < j;
>
:
F F
SF ði; j; l; w½1::nÞ ¼ define the seed design problem is to fix a similarity length
8
> 0 if i < smin or 9l0 6¼ l m and find a seed that solves the ðm; kÞ-problem with the
>
>
>
> suffðQl0 ; wÞ ¼ 1 largest possible value of k. A complementary definition is to
>
>
>
>
>
> SF ði 1; j 1; l; w½1::n 1Þ if w½n ¼ 0 fix k and minimize m provided that the ðm; kÞ-problem is
>
>
>
> SF ði 1; j; l; w½1::n 1Þ if n ¼ jQl j and still solved. In this section, we adopt the second definition
>
>
>
> and present an optimal solution for one particular case.
>
> suffðQl ; wÞ ¼ 0
>
> For a seed Q and a number of mismatches k, define the
>
>
<SF ði 1; j; l; w½1::n 1Þ
>
k-critical length for Q as the minimal value m such that Q
þUF ði 1; j; w½1::n 1Þ if n ¼ smax and solves the ðm; kÞ-problem. For a class of seeds C and a value
>
>
>
> suffðQl ; wÞ ¼ 1
>
> k, a seed is k-optimal in C if Q has the minimal k-critical
>
>
>
> and 8l0 6¼ l; length among all seeds of C.
>
>
>
> suffðQl0 ; wÞ ¼ 0; One interesting class of seeds C is obtained by putting an
>
>
>
>
>
> SF ði; j; l; 1:w½1::nÞ upper bound on the possible number of jokers in the seed,
>
>
>
> þSF ði; j; l; 0:w½1::nÞ if zerosðwÞ < j i.e. on the number ðsðQÞ wðQÞÞ. We have found a general
>
>
: solution of the seed design problem for the class C1 ðnÞ
SF ði; j; l; 1:w½1::nÞ if zerosðwÞ ¼ j:
consisting of seeds of weight d with only one joker, i.e. seeds
The third and fourth relations play the principal role: #dr #r .
if Ql does not match a suffix of w½1::n, then we simply Consider first the case of one mismatch, i.e., k ¼ 1. A
drop out the last letter. If Ql matches a suffix of w½1::n, 1-optimal seed from C1 ðdÞ is #dr #r with r ¼ bd=2c. To
but no other seed does, then we count prefixes matched
see this, consider an arbitrary seed Q ¼ #p #q , p þ q ¼ d,
by Ql exclusively (term SF ði 1; j; l; w½1::n 1Þ) together
and assume by symmetry that p q. Observe that the
with prefixes matched by no seed at all (term
longest ðm; 1Þ-similarity that is not detected by Q is
UF ði 1; j; w½1::n 1Þ). The latter is computed by the
1p1 01pþq of length ð2p þ qÞ. Therefore, we have to minimize
algorithm of the previous section.
The complexity of computing SF ðm; k; lÞ for a given l is 2p þ q ¼ d þ p, and since p dd=2e, the minimum is reached
the same as the complexity of dynamic programming for p ¼ dd=2e, q ¼ bd=2c.
algorithms from the previous sections. However, for k 2, an optimal seed has an asymmetric
structure described by the following theorem.
Theorem 1. Let n be an integer and r ¼ ½d=3 (½x is the closest
4 SEED DESIGN
integer to x). For every k 2, seed QðdÞ ¼ #dr #r is
In the previous section we showed how to compute various k-optimal among the seeds of C1 ðdÞ.
useful characteristics of a given family of seeds. A much
Proof. Again, consider a seed Q ¼ #p #q , p þ q ¼ d, and
more difficult task is to find an efficient seed family that
assume that p q. Consider the longest word SðkÞ from
solves a given ðm; kÞ-problem. Note that there exists
a trivial ð1 0Þk 1 , k 1, which is not detected by Q and let LðkÞ is
solution where the family consists of all mk position
the length of SðkÞ. By the above remark, Sð1Þ ¼ 1p1 01pþq
combinations, but this is in general unacceptable in practice
and Lð1Þ ¼ 2p þ q.
because of a huge number of seeds. Our goal is to find
It is easily seen that for every k, SðkÞ starts either with
families of reasonable size (typically, with the number of
1p1 0, or with 1pþq 01q1 0. Define L0 ðkÞ to be the maximal
seeds smaller than 10), with a good filtration efficiency.
length of a word from ð1 0Þk 1 that is not detected by Q
In this section, we present several results that contribute
and starts with 1q1 0. Since prefix 1q1 0 implies no
to this goal. In Section 4.1, we start with the case of single
additional constraint on the rest of the word, we have
seed with a fixed number of jokers and show, in particular,
L0 ðkÞ ¼ q þ Lðk 1Þ. Observe that L0 ð1Þ ¼ p þ 2q (word
that for one joker, there exists one best seed in a sense that
1q1 01pþq ). To summarize, we have the following
will be defined. We then show in Section 4.2 that a solution
recurrences for k 2:
for a larger problem can be obtained from a smaller one by a
regular expansion operation. In Section 4.3, we focus on L0 ðkÞ ¼ q þ Lðk 1Þ; ð1Þ
seeds that have a periodic structure and show how those LðkÞ ¼ maxfp þ Lðk 1Þ; p þ q þ 1 þ L0 ðk 1Þg; ð2Þ
seeds can be constructed by iterating some smaller seeds.
We then show a way to build efficient families of periodic with initial conditions L0 ð1Þ ¼ p þ 2q, Lð1Þ ¼ 2p þ q.
Two cases should be distinguished. If p 2q þ 1, then
seeds. Finally, in Section 4.4, we briefly describe a heuristic
the straightforward induction shows that the first term in
approach to constructing efficient seed families that we
(2) is always greater, and we have
used in the experimental part of this work presented in
Section 5. LðkÞ ¼ ðk þ 1Þp þ q; ð3Þ
4.1 Single Seeds with a Fixed Number of Jokers and the corresponding longest word is
Assume that we fixed a class of seeds under interest (e.g.,
SðkÞ ¼ ð1p1 0Þk 1pþq : ð4Þ
seeds of a given minimal weight). One possible way to
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 55
If q p 2q þ 1, then by induction, we obtain obtained by the regular contraction operation, inverse to the
regular expansion.
ð‘ þ 1Þp þ ðk þ 1Þq þ ‘ if k ¼ 2‘;
LðkÞ ¼ ð5Þ Lemma 2. If a family Fi ¼ i F solves an ðim; kÞ-problem, then
ð‘ þ 2Þp þ kq þ ‘ if k ¼ 2‘ þ 1;
F solves both the ðim; kÞ-problem and the ðm; bk=icÞ-problem.
and
Proof. One can even show that F solves the ðim; kÞ-problem
pþq q1 ‘ pþq with the additional restriction for F to match inside one of
ð1 01 0Þ 1 if k ¼ 2‘;
SðkÞ ¼ ð6Þ the position intervals ½1::m; ½m þ 1::2m; . . . ; ½ði 1Þm þ
1p1 0ð1pþq 01q1 0Þ‘ 1pþq if k ¼ 2‘ þ 1:
1::im. This is done by using the bijective mapping from
By definition of LðkÞ, seed #p #q detects any word Lemma 1: Given an ðim; kÞ-similarity w, consider i disjoint
from ð1 0Þk 1 of length ðLðkÞ þ 1Þ or more, and this is the subsequences wj (0 j i 1) of w obtained by picking
tight bound. Therefore, we have to find p; q which m positions equal to j modulo i, and then consider the
minimize LðkÞ. Recall that p þ q ¼ d, and observe that for concatenation w0 ¼ w1 w2 . . . wi1 w0 .
p 2q þ 1, LðkÞ (defined by (3)) is increasing on p, while For every ðim; kÞ-similarity w0 , its inverse image w is
for p 2q þ 1, LðkÞ (defined by (5)) is decreasing on p. detected by Fi , and therefore F detects w0 at one of the
Therefore, both functions reach its minimum when intervals
p ¼ 2q þ 1. Therefore, if d 1 ðmod 3Þ, we obtain q ¼
bd=3c and p ¼ d q. If d 0 ðmod 3Þ, a routine computa- ½1::m; ½m þ 1::2m; . . . ; ½ði 1Þm þ 1::im:
tion shows that the minimum is reached at q ¼ d=3, Futhermore, for any ðm; bk=icÞ-similarity v, consider w0 ¼
p ¼ 2d=3, and if d 2 ðmod 3Þ, the minimum is reached vi and its inverse image w. As w0 is detected by Fi , v is
at q ¼ dd=3e, p ¼ d q. Putting the three cases together detected by F . u
t
results in q ¼ ½d=3, p ¼ d q. u
t
Example 1. To illustrate the two lemmas above, we give the
To illustrate Theorem 1, seed #### ## is optimal
following example pointed out in [1]. The following two
among all seeds of weight 6 with one joker. This means that seeds are the only seeds of weight 12 that solve the
this seed solves the ðm; 2Þ-problem for all m 16 and this is ð50; 5Þ-problem:
the smallest possible bound over all seeds of this class.
Similarly, this seed solves the ðm; 3Þ-problem for all m 20, # # # # # # #
which is the best possible bound, etc. #####
4.2 Regular Expansion and Contraction of Seeds and
We now show that seeds solving larger problems can be
### # ### # ### #:
obtained from seeds solving smaller problems, and vice
versa, using regular expansion and regular contraction The first one is the 2-regular expansion of the second. The
operations. second one is the only seed of weight 12 that solves the
Given a seed Q , its i-regular expansion i Q is ð25; 2Þ-problem.
obtained by multiplying each matching position by i. This
The regular expansion allows, in some cases, to obtain an
is equivalent to inserting i 1 jokers between every two
efficient solution for a larger problem by reducing it to a
successive positions along the seed. For example, if Q ¼
smaller problem for which an optimal or a near-optimal
f0; 2; 3; 5g (or # ## #), then the 2-regular expansion
solution is known.
of Q is 2 Q ¼ f0; 4; 6; 10g (or # # # #).
Given a family F , its i-regular expansion i F is the 4.3 Periodic Seeds
family obtained by applying the i-regular expansion on In this section, we study seeds with a periodic structure that
each seed of F . can be obtained by iterating a smaller seed. Such seeds often
Lemma 1. If a family F solves an ðm; kÞ-problem, then the turn out to be among maximally weighted seeds solving a
ðim; ði þ 1Þk 1Þ-problem is solved both by family F and by given ðm; kÞ-problem. Interestingly, this contrasts with the
its i-regular expansion Fi ¼ i F . lossy framework where optimal seeds usually have a
Proof. Consider an ðim; ði þ 1Þk 1Þ-similarity w. By the “random” irregular structure.
pigeon hole principle, it contains at least one substring of Consider two seeds Q1 ;Q2 represented as words over
length m with k mismatches or less and, therefore, F f#;g. In this section, we lift the assumption that a seed
solves the ðim; ði þ 1Þk 1Þ-problem. On the other hand, must start and end with a matching position. We denote
consider i disjoint subsequences of w each one consisting ½Q1 ;Q2 i the seed defined as ðQ1 Q2 Þi Q1 . For example,
of m positions equal modulo i. Again, by the pigeon hole ½### #; 2 ¼ ### # ### # ### #.
principle, at least one of them contains k mismatches or We also need a modification of the ðm; kÞ-problem, where
less and, therefore, the ðim; ði þ 1Þk 1Þ-problem is ðm; kÞ-similarities are considered modulo a cyclic permuta-
solved by i F . u
t tion. We say that a seed family F solves a cyclic
The following lemma is the inverse of Lemma 1. It states ðm; kÞ-problem, if for every ðm; kÞ-similarity w, F detects
that if seeds solving a bigger problem have a regular one of cyclic permutations of w. Trivially, if F solves an
structure, then a solution for a smaller problem can be ðm; kÞ-problem, it also solves the cyclic ðm; kÞ-problem. To
56 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
distinguish from a cyclic problem, we call sometimes an shows that the maximal weight grows faster than any linear
ðm; kÞ-problem a linear problem. fraction of the similarity size.
We first restrict ourselves to the single-seed case. The Theorem 2. Consider a constant k. Let wðmÞ be the maximal
following lemma demonstrates that iterating smaller seeds weight of a seed solving the cyclic ðm; kÞ-problem. Then,
solving a cyclic problem allows to obtain a solution for k1
ðm wðmÞÞ ¼ ðm k Þ.
bigger problems, for the same number of mismatches.
Proof. Note first that all seeds solving a cyclic ðm; kÞ-problem
Lemma 3. If a seed Q solves a cyclic ðm; kÞ-problem, then for can be considered as seeds of span m. The number of jokers
every i 0, the seed Qi ¼ ½Q; ðmsðQÞÞ i solves the linear in any seed Q is then n ¼ m wðQÞ. The theorem states
ðm ði þ 1Þ þ sðQÞ 1; kÞ-problem. If i 6¼ 0, the inverse that the minimal number of jokers of a seed solving the
holds too. k1
ðm; kÞ-problem is ðm k Þ for every fixed k.
Proof. ) Consider an ðm ði þ 1Þ þ sðQÞ 1; kÞ-similarity Lower bound Consider a cyclic ðm; kÞ-problem. The
u. Transform u into a similarity u0 for the cyclic number Dðm; kÞ of distinct cyclic ðm; kÞ-similarities
ðm; kÞ-problem as follows: For each mismatch position ‘ satisfies
of u, set 0 at position ð‘ mod mÞ in u0 . The other positions m
k
of u0 are set to 1. Clearly, there are at most k 0s in u. As Q Dðm; kÞ; ð7Þ
m
solves the ðm; kÞ-cyclic problem, we can find at least one
position j, 1 j m, such that Q detects u0 cyclicly. as every linear ðm; kÞ-similarity has at most m cyclicly
We show now that Qi matches at position j of u (which equivalent ones. Consider a seed Q. Let n be the number
is a valid position as 1 j m and sðQi Þ ¼ im þ sðQÞ). As of jokers in Q and JQ ðm; kÞ the number of distinct cyclic
the positions of 1 in u are projected modulo m to matching n kÞ-similarities detected by Q. Observe that JQ ðm; kÞ
ðm;
positions of Q, then there is no 0 under any matching k and if Q solves the cyclic ðm; kÞ-problem, then
element of Qi and, thus, Qi detects u. n
( Consider a seed Qi ¼ ½Q; ðmsðQÞÞ i solving the Dðm; kÞ ¼ JQ ðm; kÞ : ð8Þ
k
ðm ði þ 1Þ þ sðQÞ 1; kÞ-problem. As i > 0, consider ðm
From (7) and (8), we have
ði þ 1Þ þ sðQÞ 1; kÞ-similarities having all their mis-
m
matches located inside the interval ½m; 2m 1. For each n
k
such similarity, there exists a position j, 1 j m, such : ð9Þ
m k
that Qi detects it. Note that the span of Qi is at least k1
m þ sðQÞ, which implies that there is either an entire Using the Stirling formula, this gives nðkÞ ¼ ðm k Þ.
Upper bound. To prove the upper bound, we construct
occurrence of Q inside the window ½m; 2m 1, or a k1
a seed Q that has no more then k m k joker positions
prefix of Q matching a suffix of the window and the
and solves the cyclic ðm; kÞ-problem.
complementary suffix of Q matching a prefix of the
We start with the seed Q0 of span m with all matching
window. This implies that Q solves the cyclic
positions, and introduce jokers into it in k steps. After
ðm; kÞ-problem. u
t step i, the obtained seed is denoted Qi , and Q ¼ Qk .
1
Example 2. Observe that the seed ### # solves the Let B ¼ dmk e. Q1 is obtained by introducing into Q0
cyclic ð7; 2Þ-problem. From Lemma 3, this implies that for individual jokers with periodicity B by placing jokers at
every i 0, the ð11 þ 7i; 2Þ-problem is solved by the seed positions 1; B þ 1; 2B þ 1; . . . . At step 2, we introduce
½### #; i of span 5 þ 7i. Moreover, for i ¼ 1; 2; 3, into Q1 contiguous intervals of jokers of length B with
this seed is optimal (maximally weighted) over all seeds periodicity B2 , such that jokers are placed at positions
solving the problem. ½1 . . . B; ½B2 þ 1 . . . B2 þ B; ½2B2 þ 1 . . . 2B2 þ B; . . . .
In general, at step i (i k), we introduce into Qi
By a similar argument based on Lemma 3, the
intervals of Bi1 jokers with periodicity Bi at positions
periodic seed ½##### ##; i solves the
½1 . . . Bi1 ; ½Bi þ 1 . . . Bi þ Bi1 ; . . . (see Fig. 1).
ð18 þ 11i; 2Þ-problem. Note that its weight grows as
7 4 Note that Qi is periodic with periodicity Bi . Note
11 m compared to 7 m for the seed from the previous also that at each step i, we introduce at most bm1k c
i
ð18 þ 11i; 3Þ-problem. By induction on i, we prove that for any ðm; iÞ-similarity
One question raised by these examples is whether u (i k), Qi detects u cyclicly, that is there is a cyclic shift of
iterating some seed could provide an asymptotically Qi such that all i mismatches of u are covered with jokers
optimal solution, i.e., a seed of maximal asymptotic weight. introduced at steps 1; . . . ; i.
The following theorem establishes a tight asymptotic bound For i ¼ 1, the statement is obvious, as we can
on the weight of an optimal seed, for a fixed number of always cover the single mismatch by shifting Q1 by at
mismatches. It gives a negative answer to this question, as it most ðB 1Þ positions. Assuming that the statement
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 57
and as k is constant,
k
m wðQÞ ¼ Oðmkþ1 Þ: ð13Þ
The lower bound is obtained similarly to Theorem 2.
Let Q be a seed solving a linear ðm; kÞ-problem, and let
n ¼ m wðQÞ. From simple combinatorial considera-
tions, we have
m n n
ðm sðQÞÞ n; ð14Þ
k k k
k
which implies n ¼ ðmkþ1 Þ for constant k. u
t
The following simple lemma is also useful for construct-
ing efficient seeds.
Lemma 5. Assume that a family F solves an ðm; kÞ-problem. Let
F 0 be the family obtained from F by cutting out l characters
from the left and r characters from the right of each seed of F .
Fig. 1. Construction of seeds Qi from the proof of Theorem 2. Jokers are Then F 0 solves the ðm r l; kÞ-problem.
represented in white and matching positions in black. Example 3. The ð9 þ 7i; 2Þ-problem is solved by the seed
½###; # i which is optimal for i ¼ 1; 2; 3. Using
holds for ði 1Þ, we show now that it holds for i too. Lemma 5, this seed can be immediately obtained from
Consider an ðm; iÞ-similarity u. Select one mismatch of the seed ½### #; i from Example 2, solving the
u. By induction hypothesis, the other ði 1Þ mis- ð11 þ 7i; 2Þ-problem.
matches can be covered by Qi1 . Since Qi1 has period
Bi1 and Qi differs from Qi1 by having at least one
We now apply the above results for the single seed case
contiguous interval of Bi1 jokers, we can always shift
to the case of multiple seeds.
Qi by j Bi1 positions such that the selected mismatch
falls into this interval. This shows that Qi detects u. For a seed Q considered as a word over f#; g, we
We conclude that Q solves the cyclic ðm; iÞ-problem. t u denote by Q½i its cyclic shift to the left by i characters.
F o r e x a m p l e , i f Q ¼ #### # ## , t h e n
Using Theorem 2, we obtain the following bound on the
Q½5 ¼ # ## #### . The following lemma gives
number of jokers for the linear ðm; kÞ-problem.
a way to construct seed families solving bigger
Lemma 4. Consider a constant k. Let wðmÞ be the maximal problems from an individual seed solving a smaller
weight of a seed solving the linear ðm; kÞ-problem. Then, cyclic problem.
k
ðm wðmÞÞ ¼ ðmkþ1 Þ.
Lemma 6. Assume that a seed Q solves a cyclic ðm; kÞ-problem
Proof. To prove the upper bound, we construct a seed Q and assume that sðQÞ ¼ m (otherwise, we pad Q on the right
that solves the linear ðm; kÞ-problem and satisfies the with ðm sðQÞÞ jokers). Fix some i > 1. For some L > 0,
asymptotic bound. Consider some l < m that will be consider a list of L integers 0 j1 < < jL < m, and define a
defined later, and let P be a seed that solves the cyclic family of seeds F ¼< kðQ½jl Þi k >Ll¼1 , where kðQ½jl Þi k stands
ðl; kÞ-problem. Without loss of generality, we assume for the seed obtained from ðQ½jl Þi by deleting the joker characters
sðP Þ ¼ l.
at the left and right edges. Define ðlÞ ¼ ððjl1 jl Þ mod mÞ
For a real number e 1, define P e to be the maximally
(or, alternatively, ðlÞ ¼ ððjl jl1 Þ mod mÞ) for all l,
weighted seed of span at most le of the form
1 l L. Let m0 ¼ maxfsðkðQ½jl Þi kÞ þ ðlÞgLl¼1 1. Then,
P 0 P P P 00 , where P 0 and P 00 are, respectively, a
suffix and a prefix of P . Due to the condition of maximal F solves the ðm0 ; kÞ-problem.
weight, wðP e Þ e wðP Þ. Proof. The proof is an extension of the proof of Lemma 3.
We now set Q ¼ P e for some real e to be defined. Here, the seeds of the family are constructed in such a
Observe that if e l m l, then Q solves the linear way that for any instance of the linear ðm0 ; kÞ-problem,
ðm; kÞ-problem. Therefore, we set e ¼ ml l .
there exists at least one seed that satisfies the property
k1
From the proof of Theorem 2, we have l wðP Þ k l k . required in the proof of Lemma 3 and, therefore, matches
We then have this instance. u
t
In applying Lemma 6, integers jl are chosen from the
ml
interval ½0; m in such a way that values sðjjðQ½jl Þi jjÞ þ ðlÞ
k1
wðQÞ ¼ e wðP Þ ðl k l k Þ: ð10Þ
l are closed to each other. We illustrate Lemma 6 with two
If we set examples that follow.
k Example 4. Let m ¼ 11, k ¼ 2. Consider the seed Q ¼
l ¼ mkþ1 ; ð11Þ
#### # ## solving the cyclic ð11; 2Þ-problem.
we obtain Choose i ¼ 2, L ¼ 2, j1 ¼ 0, j2 ¼ 5. This gives two seeds:
TABLE 1
Seed Families for (25,2)-Problem
TABLE 2
Seed Families for (25,3)-Problem
to match those of other genes. As the first approximation, (or a sequence database) all substrings of length m that have
the problem of oligo selection can then be formulated as the no occurrences elsewhere in the sequence within the
search for strings of a fixed length that occur in a given Hamming distance k. The parameters m and k were set to
sequence but do not occur, within a specified distance, in 32 and 5, respectively. For the ð32; 5Þ-problem, different seed
other sequences of a given (possibly very large) sample. families were designed and their selectivity was estimated.
Different approaches to this problem apply different Those are summarized in the table in Fig. 2, using the same
distance measures and different algorithmic techniques conventions as in Tables 1 and 2 above. The family
[21], [22], [23], [24]. The experiments we briefly present here composed of six seeds of weight 11 was selected for the
demonstrate that the multiseed filtering provides an filtration experiment (shown in Fig. 2).
efficient computation of candidate oligonucleotides. These The filtering has been applied to a database of rice EST
should then be further processed by complementary sequences composed of 100,015 sequences for a total length
methods in order to take into account other physico- of 42,845,242 bp.1 Substrings matching other substrings
chemical factors occurring in hybridisation, such as the with five substitution errors or less were computed. The
melting temperature or the possible hairpin structure of computation took slightly more than one hour on a
palindromic oligos.
Here, we adopt the formalization of the oligo selection 1. Source: http://bioserver.myongji.ac.kr/ricemac.html, The Korea Rice
problem as the problem of identifying in a given sequence Genome Database.
60 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Fig. 2. Computed seed families for the ð32; 5Þ-problem and the chosen family (six seeds of weight 11).
Pentium2 4 3GHz computer. Before applying the filtering of this work has been done during a stay of M. Roytberg at
using the family for the ð32; 5Þ-problem, we made a rough LORIA, Nancy, supported by INRIA. M. Roytberg has been
prefiltering using one spaced seed of weight 16 to detect,
supported by the Russian Foundation for Basic Research
with a high selectivity, almost identical regions. Sixty-five
percent of the database has been discarded by this (project nos. 03-04-49469, 02-07-90412) and by grants from
prefiltering. Another 22 percent of the database has been the RF Ministry for Industry, Science, and Technology (20/
filtered out using the chosen seed family, leaving the 2002, 5/2003) and NWO. An extended abstract of this work
remaining 13 percent as oligo candidates.
has been presented to the Combinatorial Pattern Matching
Conference (Istanbul, July 2004).
6 CONCLUSION
In this paper, we studied a lossless filtration method based
REFERENCES
on multiseed families and demonstrated that it represents
[1] S. Burkhardt and J. Kärkkäinen, “Better Filtering with Gapped
an improvement compared to the single-seed approach q-Grams,” Fundamenta Informaticae, vol. 56, nos. 1-2, pp. 51-70,
considered in [1]. We showed how some important 2003, preliminary version in Combinatorial Pattern Matching
characteristics of seed families can be computed using the 2001.
[2] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings
dynamic programming. We presented several combinator- —Practical On-Line Search Algorithms for Texts and Biological
ial results that allow one to construct efficient families Sequences. Cambridge Univ. Press, 2002.
composed of seeds with a periodic structure. Finally, we [3] S. Altschul, T. Madden, A. Schäffer, J. Zhang, Z. Zhang, W. Miller,
and D. Lipman, “Gapped BLAST and PSI-BLAST: A New
described a large-scale computational experiment of de- Generation of Protein Database Search Programs,” Nucleic Acids
signing reliable oligonucleotides for DNA microarrays. The Research, vol. 25, no. 17, pp. 3389-3402, 1997.
obtained experimental results provided evidence of the [4] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and More
Sensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-
applicability and efficiency of the whole method. 445, 2002.
The results of Sections 4.1, 4,2, and 4.3 establish several [5] S. Schwartz, J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. Hardison,
D. Haussler, and W. Miller, “Human—Mouse Alignments with
combinatorial properties of seed families, but many more of BLASTZ,” Genome Research, vol. 13, pp. 103-107, 2003.
them remain to be elucidated. The structure of optimal or [6] L. Noé and G. Kucherov, “Improved Hit Criteria for DNA Local
near-optimal seed families can be reduced to number- Alignment,” BMC Bioinformatics, vol. 5, no. 149, Oct. 2004.
[7] P. Pevzner and M. Waterman, “Multiple Filtration and Approx-
theoretic questions, but this relation remains to be clearly imate Pattern Matching,” Algorithmica, vol. 13, pp. 135-154, 1995.
established. In general, constructing an algorithm to [8] A. Califano and I. Rigoutsos, “Flash: A Fast Look-Up Algorithm
systematically design seed families with quality guarantee for String Homology,” Proc. First Int’l Conf. Intelligent Systems for
Molecular Biology, pp. 56-64, July 1993.
remains an open problem. Some complexity issues remain [9] J. Buhler, “Provably Sensitive Indexing Strategies for Biosequence
open too: For example, what is the complexity of testing if a Similarity Search,” Proc. Sixth Ann. Int’l Conf. Computational
single seed is lossless for given m; k? Section 3 implies a Molecular Biology (RECOMB ’02), pp. 90-99, Apr. 2002.
[10] U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds for
time bound exponential on the number of jokers. Note that Similarity Search,” Discrete Applied Math., vol. 138, no. 3, pp. 253-
for multiple seeds, computing the number of detected 263, 2004.
similarities is NP-complete [16, Section 3.1]. [11] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for Similarity
Search in Genomic DNA,” Proc. Seventh Ann. Int’l Conf. Computa-
Another direction is to consider different distance tional Molecular Biology (RECOMB ’03), pp. 67-75, Apr. 2003.
measures, especially the Levenstein distance, or at least to [12] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension to
Spaced Seeds Allows Substantial Improvements in Sensitivity and
allow some restricted insertion/deletion errors. The method Specificity,” Proc. Third Int’l Workshop Algorithms in Bioinformatics
proposed in [25] does not seem to be easily generalized to (WABI), pp. 39-54, Sept. 2003.
multiseed families, and a further work is required to [13] G. Kucherov, L. Noé, and Y. Ponty, “Estimating Seed Sensitivity
on Homogeneous Alignments,” Proc. IEEE Fourth Symp. Bioinfor-
improve lossless filtering in this case. matics and Bioeng. (BIBE 2004), May 2004.
[14] K. Choi and L. Zhang, “Sensitivity Analysis and Efficient Method
for Identifying Optimal Spaced Seeds,” J. Computer and System
ACKNOWLEDGMENTS Sciences, vol. 68, pp. 22-40, 2004.
[15] M. Csürös, “Performing Local Similarity Searches with Variable
G. Kucherov and L. Noé have been supported by the French Length Seeds,” Proc. 15th Ann. Combinatorial Pattern Matching
Action Spécifique “Algorithmes et Séquences” of CNRS. A part Symp. (CPM), pp. 373-387, 2004.
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 61
[16] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: Highly Gregory Kucherov received the PhD degree in
Sensitive and Fast Homology Search,” J. Bioinformatics and computer science in 1988 from the USSR
Computational Biology, vol. 2, no. 3, pp. 417-440, Sept. 2004. Academy of Sciences, and a Habilitation degree
[17] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds for in 2000 from the Henri Poincaré University in
DNA Similarity Search,” Proc. Eighth Ann. Int’l Conf. Research in Nancy. He is a senior INRIA researcher with the
Computational Molecular Biology (RECOMB 2004), pp. 76-84, Mar. LORIA research unit in Nancy, France. For the
2004. last 10 years, he has been doing research on
[18] D.G. Brown, “Multiple Vector Seeds for Protein Alignment,” Proc. word combinatorics, text algorithms and combi-
Fourth Int’l Workshop Algorithms in Bioinformatics (WABI), pp. 170- natorial algorithms for bioinformatics, and com-
181, Sept. 2004. putational biology.
[19] J. Xu, D. Brown, M. Li, and B. Ma, “Optimizing Multiple Spaced
Seeds for Homology Search,” Proc. 15th Symp. Combinatorial
Pattern Matching, pp. 47-58, 2004. Laurent Noé studied computer science at the
[20] J. Oommen and J. Dong, “Generalized Swap-with-Parent Schemes ESIAL engineering school in Nancy, France. He
for Self-Organizing Sequential Linear Lists,” Proc. 1997 Int’l Symp. received the MS degree in 2002 and is currently
Algorithms and Computation (ISAAC ’97), pp. 414-423, Dec. 1997. a PhD student in computational biology at
[21] F. Li and G. Stormo, “Selection of Optimal DNA Oligos for Gene LORIA.
Expression Arrays,” Bioinformatics, vol. 17, pp. 1067-1076, 2001.
[22] L. Kaderali and A. Schliep, “Selecting Signature Oligonucleotides
to Identify Organisms Using DNA Arrays,” Bioinformatics, vol. 18,
no. 10, pp. 1340-1349, 2002.
[23] S. Rahmann, “Fast Large Scale Oligonucleotide Selection Using
the Longest Common Factor Approach,” J. Bioinformatics and
Computational Biology, vol. 1, no. 2, pp. 343-361, 2003.
[24] J. Zheng, T. Close, T. Jiang, and S. Lonardi, “Efficient Selection of Mikhail Roytberg received the PhD degree in
Unique and Popular Oligos for Large EST Databases,” Proc. 14th computer science in 1983 from Moscow State
Ann. Combinatorial Pattern Matching Symp. (CPM), pp. 273-283, University. He is a leader of the Computational
2003. Molecular Biology Group in the Institute of
[25] S. Burkhardt and J. Karkkainen, “One-Gapped q-Gram Filters for Mathematical Problems in Biology of the Rus-
Levenshtein Distance,” Proc. 13th Symp. Combinatorial Pattern sian Academy of Sciences at Pushchino, Rus-
Matching (CPM ’02), vol. 2373, pp. 225-234, 2002. sia. During the last years, his main research field
has been the development of algorithms for
comparative analysis of biological sequences.
Abstract—Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of
microarray data. A number of computer algorithms have been developed for this task. Although these algorithms have demonstrated their
usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from
MEDLINE for a set of genes that are isolated for further study from microarray experiments based on their differential expression patterns.
The sharing of functional keywords among genes is used as a basis for clustering in a new approach called BEA-PARTITION in this paper.
Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA),
which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional
keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering
and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of
BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell
cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the
results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means
and self-organizing map. Whereas BEA-PARTITION and the hierarchical clustering produced similar quality of clusters, BEA-PARTITION
provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a
powerful approach to clustering genes or to any clustering problem where starting matrices are available from experimental observations.
Index Terms—Bond energy algorithm, microarray, MEDLINE, text analysis, cluster analysis, gene function.
1 INTRODUCTION
information appears in the literature is a major challenge that In order to explore whether this algorithm could be
is rarely met adequately. useful for clustering genes derived from microarray
If, instead of organizing by expression pattern similarity, experiments, we compared the performance of BEA-
genes were grouped according to shared function, investi- PARTITION, hierarchical clustering algorithm, self-organiz-
gators might more quickly discover patterns or themes of ing map, and the k-means algorithm for clustering func-
biological processes that were revealed by their microarray tionally-related genes based on shared keywords, using
experiments and focus on a select group of functionally purity, entropy, and mutual information as metrics for
related genes. A number of clustering strategies based on evaluating cluster quality.
shared functions rather than similar expression patterns
have been devised. Chaussabel and Sher [3] analyzed
literature profiles generated by extracting the frequencies of 2 METHODS
certain terms from the abstracts in MEDLINE and then 2.1 Keyword Extraction from Biomedical Literature
clustered the genes based on these terms, essentially We used statistical methods to extract keywords from
applying the same algorithm used for expression pattern MEDLINE citations, based on the work of [15]. This method
clustering. Jenssen et al. [12] used co-occurrence of gene estimates the significance of words by comparing the
names in abstracts to create networks of related genes frequency of words in a given gene-related set (Test Set)
automatically. Text analysis of biomedical literature has
of abstracts with their frequency in a background set of
also been applied successfully to incorporate functional
abstracts. We modified the original method by using a
information about the genes in the analysis of gene
1) different background set, 2) a different stemming
expression data [1], [10], [13], [14] without generating
algorithm (Porter’s stemmer), and 3) a customized stop list.
clusters de novo. For example, Blaschke et al. [1] extracted
information about the common biological characteristics of The details were reported by Liu et al. [20], [21].
For each gene analyzed, word frequencies were calcu-
gene clusters from MEDLINE using Andrade and Valen-
lated from a group of abstracts retrieved by an SQL
cia’s statistical text mining approach, which accepts user-
(structured query language) search of MEDLINE for the
supplied abstracts related to a protein of interest and
specific gene name, gene symbol, or any known aliases (see
returns an ordered set of keywords that occur in those
LocusLink, ftp://ftp.ncbi.nih.gov/refseq/LocusLink/
abstracts more often than would be expected by chance [15].
LL_tmpl.gz for gene aliases) in the TITLE field. The resulting
We expanded and extended Andrade and Valencia’s
set of abstracts (the Test Set) was processed to generate a
approach [15] to functional gene clustering by using an
specific keyword list.
approach that applies an algorithm called the Bond Energy
Test Sets of Genes. We compared BEA-PARTITION and
Algorithm (BEA) [16], [17], which, to our knowledge, has
other clustering algorithms (k-means, hierarchical, and
not been used in bioinformatics. We modified it so that the
SOM) on two test sets.
“affinity” among attributes (in our case, genes) is defined
based on the sharing of keywords between them and we 1. Twenty-six genes in four well-defined functional
came up with a scheme for partitioning the clustered groups consisting of 10 glutamate receptor subunits,
affinity matrix to produce clusters of genes. We call the seven enzymes in catecholamine metabolism, five
resulting algorithm BEA-PARTITION. BEA was originally cytoskeletal proteins, and four enzymes in tyrosine
conceived as a technique to cluster questions in psycholo- and phenylalanine synthesis. The gene names and
gical instruments [16], has been used in operations research, aliases are listed in Table 1. This experiment was
production engineering, marketing, and various other fields performed to determine whether keyword associa-
[18], and is a popular clustering algorithm in distributed tions can be used to group genes appropriately and
database system (DDBS) design. The fundamental task of
whether the four gene families or clusters that were
BEA in DDBS design is to group attributes based on their
known a priori would also be predicted by a
affinity, which indicates how closely related the attributes
clustering algorithm simply using the affinity metric
are, as determined by the inclusion of these attributes by the
based on keywords.
same database transactions. In our case, each gene was
2. Forty-four yeast genes involved in the cell cycle of
considered as an attribute. Hence, the basic premise is that
budding yeast (Saccharomyces cerevisiae) that had
two genes would have higher affinity, thus higher bond
altered expression patterns on spotted DNA
energy, if abstracts mentioning these genes shared many
microarrays [6]. These genes were analyzed by
informative keywords. BEA has several useful properties
Cherepinsky et al. [4] to demonstrate their Shrink-
[16], [19]. First, it groups attributes with larger affinity
values together, and the ones with smaller values together age algorithm for gene clustering. A master list of
(i.e., during the permutation of columns and rows, it member genes for each cluster was assembled
shuffles the attributes towards those with which they have according to a combination of 1) common cell-cycle
higher affinity and away from those with which they have functions and regulatory systems and 2) the
lower affinity). Second, the composition and order of the corresponding transcriptional activators for each
final groups are insensitive to the order in which items are gene [4] (Table 2).
presented to the algorithm. Finally, it seeks to uncover and Keyword Assessment. Statistical formulae from [15] for
display the association and interrelationships of the clus- word frequencies were used without modification. These
tered groups with one another. calculations were repeated for all gene names in the test
64 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 1
Twenty-Six Genes Manually Clustered Based on Functional Similarity
TABLE 2
Forty-Four Yeast Genes Grouped by Transcriptional Activators and Cell Cycle Functions [4]
set, a process that generated a database of keywords 2.2 BEA-PARTITION: Detailed Working of the
associated with specific genes, the strength of the associa- Algorithm
tion being reflected by a z-score. The z-score of word a for The BEA-PARTITION takes a symmetric matrix as input,
gene g is defined as: permutes its rows and columns, and generates a sorted
a matrix, which is then partitioned to form a clustered matrix.
Fga F Constructing the Symmetric Gene Gene Matrix. The
Zga ¼ ; ð1Þ
a sparse word gene matrix, with the cells containing the
where Fga equals the frequency of word a in Test Set g (i.e., z-scores of each word-gene pair, was converted to a gene
in the Test set g, the number of abstracts where the word a gene matrix with the cells containing the sum of products of
occurs divided by the total number of abstracts) and, Fa and z-scores for shared keywords. The z-score value was set to
a are the average frequency and standard deviation, zero if the value was less than the threshold. Larger values
respectively, of word a in the background set. Intuitively,
reflect stronger and more extensive keyword associations
the score Z compares the “importance” or “discriminatory
relevance” of a keyword in the test set of abstract with the between gene-gene pairs. For each gene pair ðGi; GjÞ and
background set that represents the expected occurrence of every word a they share in the sparse word gene matrix, the
that word in the literature at large. Gi Gj cell value ðaffðGi; GjÞÞ in the gene gene matrix
Keyword Selection for Gene Clustering. We used z-score represents the affinity of the two genes for each other and is
thresholds to select the keywords used for gene clustering. calculated as:
Those keywords with z-scores less than the threshold were
discarded. The z-score thresholds we tested were 0, 5, 8, 10, PN a a
a¼1 ðZGi ZGj Þ
15, 20, 30, 50, and 100. The database generated by this affðGi ; Gj Þ ¼ : ð2Þ
algorithm is represented as a sparse word (rows) gene 1; 000
(columns) matrix with cells containing z-scores. The matrix is Dividing the sum of the z-score product by 1,000 was
characterized as “sparse” because each gene only has a done to reduce the typically large numbers to a more
fraction of all words associated with it. The output of the
readable format in the output matrix.
keyword selection for all genes in each Test Set is represented
Sorting the Matrix [19]. The sorted matrix is generated
as a sparse keyword (rows) gene (columns) matrix with
cells containing z-scores. as follows:
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 65
1. Initialization. Place and fix one of the columns of for further iterations of the splitting algorithm. The number
symmetric matrix arbitrarily into the clustered of clusters into which the gene affinity matrix was
matrix. partitioned was determined by AUTOCLASS (described
2. Iteration. Pick one of the remaining n-i columns
below), however, other heuristics might be useful for this
(where i is the number of columns already in the
sorted matrix). Choose the placement in the sorted determination. The boundary metric ðBÞ for columns Gi
matrix that maximizes the change in bond energy as and Gj used for placement of new column k between
described below (3). Repeat this step until no more existing columns i and j was defined as:
columns remain.
Xp
maxðaffðk; qÞ; affðk; q þ 1ÞÞ
3. Row ordering. Once the column ordering is deter- BðGi ; Gj Þ ¼ max ; ð6Þ
mined, the placement of the rows should also be p1qp
k¼p1
minðaffðk; qÞ; affðk; q þ 1ÞÞ
changed correspondingly so that their relative
positions match the relative position of the columns. where q is the new splitting point (for simplicity, we use the
This restores the symmetry to the sorted matrix. number of the leftmost column in the new submatrix that is
To calculate the change in bond energy for each possible to the right of the splitting point), which will split the
submatrix defined between two previous splitting points, p
placement of the next ði þ 1Þ column, the bonds between
and p 1 (which do not necessarily represent contiguous
that column ðkÞ and each of two newly adjacent columns
columns). To partition the entire sorted matrix, the
ði; jÞ are added and the bond that would be broken between
following initial conditions are set, p ¼ N; p 1 ¼ 0.
the latter two columns is subtracted. Thus, the “bond
energy” between these three columns i, j, and k (represent- 2.3 K -Means Algorithm and Hierarchical Clustering
ing gene i ðGiÞ; gene j ðGjÞ; gene k ðGkÞ)) is calculated by Algorithm
the following interaction contribution measure: K-means and hierarchical clustering analysis were performed
using Cluster/Treeview programs available online (http://
energyðGi; Gj; GkÞ ¼ bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/
ð3Þ
2 ½bondðGi; GkÞ þ bondðGk; GjÞ bondðGi; GjÞ; software.htm).
where bond ðGi; GjÞ is the bond energy between gene Gi 2.4 Self-Organizing Map
and gene Gj and Self-organizing map was performed using GeneClus-
ter 2.0 (http://www.broad.mit.edu/cancer/software/
X
N
bondðGi; GjÞ ¼ affðGr; GiÞ affðGr; GjÞ ð4Þ software.html).
r¼l Euclidean distance measure was used when gene
keyword matrix as input. When gene gene matrix was
affðG0; GiÞ ¼ affðGi; G0Þ used as input, the gene similarity was calculated by (2).
¼ affðGðn þ 1Þ; GiÞ ¼ affðGi; Gðn þ 1ÞÞ ¼ 0: 2.5 Number of Clusters
ð5Þ In order to apply BEA-PARTITION and k-means cluster-
The last set of conditions (5) takes care of cases where a ing algorithms, the investigator needs to have a priori
gene is being placed in the sorted matrix to the left of the knowledge about the number of clusters in the test set.
We determined the number of clusters by applying
leftmost gene or to the right of the rightmost gene during
AUTOCLASS, an unsupervised Bayesian classification
column permutations, and prior to the topmost row and
system developed by [22]. AUTOCLASS, which seeks a
following the last row during row permutations.
maximum posterior probability classification, determines
Partitioning the Sorted Matrix. The original BEA
the optimal number of classes in large data sets. Among
algorithm [16] did not propose how to partition the sorted
a variety of applications, AUTOCLASS has been used
matrix. The partitioning heuristic was added by Navathe
for the discovery of new classes of infra-red stars in the
et al. [17] for the problems in the distributed database
IRAS Low Resolution Spectral catalogue, new classes of
design. These heuristics were constructed using the goals of airports in a database of all US airports, and discovery
design: to minimize access time and storage costs. We do of classes of proteins, introns and other patterns in
not have the luxury of such a clear cut objective function in DNA/protein sequence data [22]. We applied an open
our case. Hence, to partition the sorted matrix into source implementation of AUTOCLASS (http://
submatrices, each representing a gene cluster, we experi- ic.arc.nasa.gov/ic/projects/bayes-group/autoclass/
mented with different heuristics and, finally, derived a autoclass-c-program.html). The resulting number of
heuristic that identifies the boundaries between clusters by clusters was then used as the endpoint for the
sequentially finding the maximum sum of the quotients for partitioning step of the BEA-PARTITION algorithm. To
corresponding cells in adjacent columns across the matrix. determine whether AUTOCLASS could discover the
With each successive split, only those rows corresponding number of clusters in the test sets correctly, we also
to the remaining columns were processed, i.e., only the tested different number of clusters other than the ones
remaining symmetrical portion of the submatrix was used AUTOCLASS predicted.
66 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
nj N
2.6 Evaluating the Clustering Results log PK i t PC t
2X K X C
j n
t¼1 i
n
t¼1 i
To evaluate the quality of our resultant clusters, we used MðÞ ¼ n ; ð9Þ
N i¼1 j¼1 i logðK CÞ
the established metrics of Purity, Entropy, and Mutual
Information, which are briefly described below [23]. Let us where N is the total number of genes being clustered and K
assume that we have C classes (i.e., C expert clusters, as is the number of clusters the algorithm produced, and C is
shown in Tables 1 and 2), while our clustering algorithms the number of expert classes.
produce K clusters, ; 2 ; . . . ; k .
Purity. Purity can be interpreted as classification 2.7 Top-Scoring Keywords Shared among Members
of a Gene Cluster
accuracy under the assumption that all objects of a cluster
are classified to be members of the dominant class for that Keywords were ranked according to their highest shared z-
scores in each cluster. The keyword sharing strength metric
cluster. If the majority of genes in cluster A are in class X,
(K a ) is defined as the sum of z-scores for a shared keyword
then class X is the dominant class. Purity is defined as the
a within the cluster, multiplied by the number of genes ðMÞ
ratio between the number of items in cluster i from
within the cluster with which the word is associated; in this
dominant class j and the size of cluster i , that is: calculation z-scores less than a user-selected threshold are
1 set to zero and are not counted.
P ði Þ ¼ maxðnji Þ; i ¼ 1; 2 . . . ; k; ð7Þ
ni j X
M X
M
Fig. 1. Procedure for clustering genes by the strength of their associated keywords.
mutual information when the z-score threshold was 8, much higher values than those outside. Hierarchical cluster-
while, for the 26-gene set, mutual information was highest ing algorithm, with the gene keyword matrix as the input,
when z-score threshold was 15. For the remaining studies, generated similar result as BEA-PARTITION (five clusters
we chose to use a z-score threshold of 10 to keep as many and TT was the outlier) (Fig. 4a). The results, with gene gene
functional keywords as possible. matrix as the input, were shown in tables in the supplemen-
tary materials which can be found at www.computer.org/
3.3 Number of Clusters publications/dlib.
We then used AUTOCLASS to decide the number of While BEA-PARTITION and hierarchical clustering
clusters in the test sets. AUTOCLASS took the keyword algorithm produced clusters very similar to the original
gene matrix as input and predicted that there were five functional classes, those produced by k-means (Table 4),
clusters in the set of 26 genes and nine clusters in the set of self-organizing map (Table 5), and AUTOCLASS (Table 6),
44 yeast genes. The effect of the numbers of clusters on the with gene keyword matrix as input, were heterogeneous
algorithm performance was shown in Figs. 2A2 and 2B2. and, thus, more difficult to explain. The average purity,
BEA-PARTITION again produced a better result regardless
of the number of clusters used. BEA-PARTITION had the
highest mutual information when the numbers of clusters
were five (26-gene set) and nine (44-gene set), whereas
k-means worked marginally better when the numbers of
clusters were 8 (26-gene set) and 10 (44-gene set). Based on
these results we chose to use five and nine clusters,
respectively, because the probabilities were higher than
the other choices.
Fig. 3. Gene clusters by keyword associations using BEA-PARTITION. Keywords with z-scores >¼ 10 were extracted from MEDLINE abstracts for
26 genes in four functional classes. The resulting word gene sparse matrix was converted to a gene gene matrix. The cell values are the sum of
z-score products for all keywords shared by the gene pair. This value is divided by 1,000 for purpose of display. A modified bond energy algorithm
[16], [17] was used to group genes into five clusters based on the strength of keyword associations, and the resulting gene clusters are boxed.
average entropy, and mutual information of the BEA- 3.6 Keywords Indicative of Major Shared Functions
PARTITION and hierarchical algorithm result were 1, 0, with a Gene Cluster
and 0.88, while those of k-means result were 0.53, 0.65, and Keywords shared among genes (26-gene set) within each
0.28, respectively, those of SOM result were 0.76, 0.35, and cluster were ranked according to a metric based on both the
0.18, respectively, and those of AUTOCLASS result were degree of significance (the sum of z-scores for each keyword)
0.82, 0.28, and 0.56 (Table 3) (gene keyword matrix as and the breadth of distribution (the sum of the number of
input). When gene gene matrix was used as input to genes within the cluster for which the keyword has a z-score
hierarchical algorithm, k-means, and SOM, the results were greater than a selected threshold). This double-pronged
even worse as measured by purity, entropy, and mutual metric obviated the difficulty encountered with keywords
information (Table 3). that had extremely high z-scores for single genes within the
cluster but modest z-scores for the remainder. The 30 highest
3.5 Yeast Microarray Gene Clustering by Keyword scoring keywords for each of the four clusters were tabulated
Association (Table 11). The respective keyword lists appeared to be highly
To determine whether our test mining/gene clustering informative about the general function of the original,
approach could be used to group genes identified in preselected clusters when shown to medical students,
microarray experiments, we clustered 44 yeast genes taken faculties, and postdoctoral fellows.
from Eisen et al. [6] via Cherepinsky et al. [4], again using
BEA-PARTITION, hierarchical algorithm, SOM, AUTO-
4 DISCUSSION
CLASS, and k-means. Keyword lists were generated for each
of the 44 yeast genes (Table 2) and a 3,882 (words appearing in In this paper, we clustered the genes by shared functional
the query sets with z-score greater or equal 10) 44 (genes) keywords. Our gene clustering strategy is similar to the
matrix was created. The clusters produced by the BEA- document clustering in information retrieval. Document
PARTITION, k-means, SOM, and AUTOCLASS are shown in clustering, defined as grouping documents into clusters
Tables 7, 8, 9, and 10, respectively, whereas those produced by according to their topics or main contents in an unsuper-
hierarchical algorithm are shown in Fig. 4b. The average vised manner, organizes large amounts of information into
purity, average entropy, and mutual information of the BEA- a small number of meaningful clusters and improves the
PARTITION result were 0.74, 0.24, and 0.60, whereas those of information retrieval performance either via cluster-driven
hierarchical algorithm, SOM, k-means, and AUTOCLASS dimensionality reduction, term-weighting, or query expan-
results (gene keyword matrix as input) were 0.86, 0.12, and sion [9], [24], [25], [26], [27].
0.58; 0.60, 0.37, and 0.46; 0.61, 0.33, and 0.39; 0.57, 0.39, and Term vector-based document clustering has been widely
0.49, respectively (Table 3). studied in information retrieval [9], [24], [25], [26], [27]. A
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 69
Fig. 4. Gene clusters by keyword associations using hierarchical clustering algorithm. Keywords with z-scores >¼ 10 were extracted from MEDLINE
abstracts for (a) 26 genes in four functional classes and (b) 44 gene in nine classes. The resulting word gene sparse matrix was used as input to
the hierarchical algorithm.
number of clustering algorithms have been proposed and noise (noninformative words and misspelled words), were
many of them have been applied to bioinformatics research. used to cluster genes. Under the tested conditions, clusters
In this report, we introduced a new algorithm for clustering produced by BEA-PARTITION had higher quality than
genes, BEA-PARTITION. Our results showed that BEA- those produced by k-means. BEA-PARTITION clusters
PARTITION, in conjunction with the heuristic developed genes based on their shared keywords. It is unlikely that
for partitioning the sorted matrix, outperforms the k-means genes within the same cluster shared the same noisy words
algorithm and SOM in two test sets. In the first set of genes with high z-scores, indicating that BEA-PARTITION is less
(26-gene set), BEA-PARTITION, as well as hierarchical sensitive to noise than k-means. In fact, BEA-PARTITION
algorithm, correctly assigned 25 of 26 genes in a test set of performed better than k-means in the two test gene sets
four known gene groups with one outlier, whereas k-means under almost all test conditions (Fig. 2). BEA-PARTITION
and SOM mixed the genes into five more evenly sized but
performed best when z-score thresholds were 10, 15, and 20,
less well functionally defined groups. In the 44-gene set, the
which indicated 1) that the words with z-score less than 10
result generated by BEA-PARTITION had the highest
were less informative and 2) few words with z-scores
mutual information, indicating that BEA-PARTITION out-
between 10 and 20 were shared by at least two genes and
performed all the other four clustering algorithms.
did not improve the cluster quality. When z-score thresh-
4.1 BEA-PARTITION versus k -Means olds were high (> 30 in the 26-gene set and > 20 in the
In this study, the z-score thresholds were used for keyword 44-gene set), more informative words were discarded, and
selection. When the threshold was 0, all words, including as a result, the cluster quality was degraded.
70 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 3
The Quality of the Gene Clusters Derived by Different Clustering Algorithms, Measured by Purity, Entropy, and Mutual Information
BEA-PARTITION is designed to group cells with larger association and interrelationships of the clustered groups
values together, and the ones with smaller values together. with one another can be seen in the final clustering matrix.
The final order of the genes within the cluster reflected For example, TT was an outlier in Fig. 3, however, it still
deeper interrelationships. Among the 10 glutamate receptor had higher affinity to PD1 (affinity = 202) and PD2 (affinity
genes examined, GluR1, GluR2, and GluR4 are AMPA = 139) than to any other genes. Thus, TT appears to be
receptors, while GluR6, KA1, and KA2 are kainate receptors. strongly related to genes in the tyrosine and phenylalanine
The observation that BEA-PARTITION placed gene GluR6 synthesis cluster, from which it originated.
and gene KA2 next to each other, confirms that the literature BEA-PARTITION has several advantages over the
associations between GluR6 and KA2 are higher than those k-means algorithm: 1) while k-means generally produces a
between GluR6 and AMPA receptors. Furthermore, the locally optimal clustering [2], BEA-PARTITION produces
TABLE 4
Twenty-Six Gene Set k-Means Result (Gene Keyword Matrix as Input)
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 71
TABLE 5
Twenty-Six Gene SOM Result (Gene Keyword Matrix as Input)
the globally optimal clustering by permuting the columns result differently. Some have proposed automatically
and rows of the symmetric matrix; 2) the k-means algorithm defining boundaries based on statistical properties of the
is sensitive to initial seed selection and noise [9]. gene expression profiles; however, the same statistical
criteria may not be generally applicable to identify all
4.2 BEA-PARTITION versus Hierarchical Algorithm
relevant biological functions [10]. We believe that an
Hierarchical clustering algorithm, as well as k-means, and algorithm that produces clusters with clear boundaries
Self-Organizing Maps, have been widely used in microarray
can provide more objective results and possibly new
expression profile analysis. Hierarchical clustering orga-
discoveries, which are beyond the experts’ knowledge. In
nizes expression data into a binary tree without providing
this report, our results showed that BEA-PARTITION can
clear indication of how the hierarchy should be clustered. In
have similar performance as a hierarchical algorithm, and
practice, investigators define clusters by a manual scan of
provide distinct cluster boundaries.
the genes in each node and rely on their biological expertise
to notice shared functional properties of genes. Therefore, 4.3 K -Means versus SOM
the definition of the clusters is subjective, and as a result, The k-means algorithm and SOM can group objects into
different investigators may interpret the same clustering different clusters and provide clear boundaries. Despite its
TABLE 6
Twenty-Six Gene AUTOCLASS Result (Gene Keyword Matrix as Input)
72 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 7
Forty-Four Yeast Genes BEA-PARTITION Result (Gene Keyword Matrix as Input)
simplicity and efficiency, the SOM algorithm has several 4.4 Computing Time
weaknesses that make its theoretical analysis difficult and The computing time of BEA-PARTITION, same as that of
limit its practical usefulness. Various studies have sug- hierarchical algorithm and SOM, is in the order of N2 , which
gested that it is hard to find any criteria under which the means that it grows proportionally to the square of the
SOM algorithm performs better than the traditional number of genes and commonly denoted as OðN2 Þ, and that of
techniques, such as k-means [11]. Balakrishnan et al. [28] k-means is in the order of N*K*T (O(NKT)), where N is the
number of genes tested, K is the number of clusters, and T is
compared the SOM algorithm with k-means clustering on
the number of improvement steps (iterations) performed by
108 multivariate normal clustering problems. The results
k-means. In our study, the number of improvement steps was
showed that the SOM algorithm performed significantly 1,000. Therefore, when the number of genes tested is about
worse than the k-means clustering algorithm. Our results 1,000, BEA-PARTITION runs (a K þ b) times faster than
also showed that k-means performed better than SOM by k-means, where a, and b are constants. As long as the number
generating clusters with higher mutual information. of genes to be clustered is less than the product of the number
TABLE 8
Forty-Four Yeast Gene SOM Result (Gene Keyword as Input)
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 73
TABLE 9
Forty-Four Yeast Gene k-Means Result (Gene Keyword Matrix as Input)
of clusters and the number of iterations, BEA-PARTITION test set, which may not be known. We approached this
will run faster than k-means. problem by using AUTOCLASS to predict the number of
4.5 Number of Clusters clusters in the test sets. BEA-PARTITION performed best
One disadvantage of BEA-PARTITION and k-means com- when it grouped the genes into five clusters (26-gene set) and
pared to hierarchical clustering is that the investigator needs nine clusters (44-gene set), which were predicted by AUTO-
to have a priori knowledge about the number of clusters in the CLASS with higher probabilities. Therefore, AUTOCLASS
TABLE 10
Forty-Four Yeast Gene AUTOCLASS Result (Gene Keyword Matrix as Input)
74 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
TABLE 11
Top Ranking Keywords Associated with Each Gene Cluster
appears to be an effective tool to assist the BEA-PARTITION algorithm represents our extension to the BEA approach
in gene clustering. specifically for dealing with the problem of discovering
functional similarity among genes based on functional
5 CONCLUSIONS AND FUTURE WORK keywords extracted from literature. We believe that this
important clustering technique, which was originally
There are several aspects of the BEA approach that we are proposed by [16] to cluster questions on psychological
currently exploring with more detailed studies. For example, instruments and later introduced by [17] for clustering of
although the BEA-PARTITION described here performs data items in database design, has promise for application
relatively well on small sets of genes, the larger gene lists to other bioinformatics problems where starting matrices
expected from microarray experiments need to be tested. are available from experimental observations.
Furthermore, we derived a heuristic to partition the clustered
affinity matrix into clusters. We anticipate that this heuristic,
which is simply based on the sum of ratios of corresponding ACKNOWLEDGMENTS
values from adjacent columns, will generally work regardless This work was supported by NINDS (RD) and the Emory-
of the type of items being clustered. Generally, optimizing the Georgia Tech Research Consortium. The authors would
heuristic to partition a sorted matrix after BEA-based like to thank Brian Revennaugh and Alex Pivoshenk for
clustering will be valuable. Finally, we are developing a research support.
Web-based tool that will include a text mining phase to
identify functional keywords, and a gene clustering phase to REFERENCES
cluster the genes based on the shared functional keywords. [1] C. Blaschke, J.C. Oliveros, and A. Valencia, “Mining Functional
We believe that this tool should be useful for discovering Information Associated with Expression Arrays,” Functional &
Integrative Genomics, vol. 1, pp. 256-268, 2001.
novel relationships among sets of genes because it links genes [2] Y. Xu, V. Olman, and D. Xu, “EXCAVATOR: A Computer
by shared functional keywords rather than just reporting Program for Efficiently Mining Gene Expression Data,” Nucleic
known interactions based on published reports. Thus, genes Acids Research, vol. 31, pp. 5582-5589, 2003.
[3] D. Chaussabel and A. Sher, “Mining Microarray Expression Data
that never co-occur in the same publication could still be by Literature Profiling,” Genome Biology, vol. 3, pp. 1-16, 2002.
linked by their shared keywords. [4] V. Cherepinsky, J. Feng, M. Rejali, and B. Mishra, “Shrinkage-
The BEA approach has been applied successfully to other Based Similarity Metric for Cluster Analysis of Microarray Data,”
Proc. Nat’l Academy of Sciences USA, vol. 100, pp. 9668-9673, 2003.
disciplines, such as operations research, production en- [5] J. Quackenbush, “Computational Analysis of Microarray Data,”
gineering, and marketing [18]. The BEA-PARTITION Nature Rev. Genetics, vol. 2, pp. 418-427, 2001.
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 75
[6] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster Ying Liu received the BS degree in environ-
Analysis and Display of Genome-Wide Expression Patterns,” Proc. mental biology from Nanjing University, China.
Nat’l Academy of Sciences USA, vol. 95, pp. 14863-14868, 1998. He received Master’s degrees in bioinformatics
[7] R. Herwig, A.J. Poustka, C. Mller, C. Bull, H. Lehrach, and J. and computer science from Georgia Institute of
O’Brien, “Large-Scale Clustering of cDNA-Fingerprinting Data,” Technology in 2002. He is a PhD candidate in
Genome Research, vol. 9, pp. 1093-1105, 1999. College of Computing, Georgia Institute of
[8] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Technology, where he works on text mining
Dmitrovsky, E.S. Lander, and T.R. Golub, “Interpreting Patterns of biomedical literature to discover gene-to-gene
Gene Expression with Self-Organizing Maps: Methods and relationships. His research interests include
Application to Hematopoietic Differentiation,” Proc. Nat’l Academy bioinformatics, computational biology, data
of Sciences USA, vol. 96, pp. 2907-2912, 1999. mining, text mining, and database system. He is a student member of
[9] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: A IEEE Computer Society.
Review,” ACM Computing Surveys, vol. 31, pp. 264-323, 1999.
[10] S. Raychaudhuri, J.T. Chang, F. Imam, and R.B. Altman, “The Shamkant B. Navathe received the PhD degree
Computational Analysis of Scientific Literature to Define and from the University of Michigan in 1976. He is a
Recognize Gene Expression Clusters,” Nucleic Acids Research, professor in the College of Computing, Georgia
vol. 15, pp. 4553-4560, 2003. Institute of Technology. He has published more
[11] B. Kegl, “Principle Curves: Learning, Design, and Applications,” than 130 refereed papers in database research;
PhD dissertation, Dept. of Computer Science, Concordia Univ., his important contributions are in database
Montreal, Quebec, 2002. modeling, database conversion, database de-
[12] T.K. Jenssen, A. Laegreid, J. Komorowski, and E. Hovig, “A sign, conceptual clustering, distributed database
Literature Network of Human Genes for High-Throughtput allocation, data mining, and database integra-
Analysis of Gene Expression,” Nat’l Genetics, vol. 178, pp. 139- tion. Current projects include text mining of
143, 2001. medical literature databases, creation of databases for biological
[13] D.R. Masys, J.B. Welsh, J.L. Fink, M. Gribskov, I. Klacansky, and J. applications, transaction models in P2P and Web applications, and
Corbeil, “Use of Keyword Hierarchies to Interprate Gene data mining for better understanding of genomic/proteomic and medical
Expression Patterns,” Bioinformatics, vol. 17, pp. 319-326, 2001. data. His recent work has been focusing on issues of mobility,
[14] S. Raychaudhuri, H. Schutze, and R.B. Altman, “Using Text scalability, interoperability, and personalization of databases in scien-
Analysis to Identify Functionally Coherent Gene Groups,” Genome tific, engineering, and e-commerce applications. He is an author of the
Research, vol. 12, pp. 1582-1590, 2002. book, Fundamentals of Database Systems, with R. Elmasri (Addison
[15] M. Andrade and A. Valencia, “Automatic Extraction of Keywords Wesley, fourth edition, 2004) which is currently the leading database
from Scientific Text: Application to the Knowledge Domain of text-book worldwide. He also coauthored the book Conceptual Design:
Protein Families,” Bioinformatics, vol. 14, pp. 600-607, 1998. An Entity Relationship Approach (Addison Wesley, 1992) with Carlo
[16] W.T. McCormick, P.J. Schweitzer, and T.W. White, “Problem Batini and Stefano Ceri. He was the general cochairman of the 1996
Decomposition and Data Reorganization by a Clustering Techni- International VLDB (Very Large Data Base) Conference in Bombay,
que,” Operations Research, vol. 20, pp. 993-1009, 1972. India. He was also program cochair of ACM SIGMOD 1985 at Austin,
[17] S. Navathe, S. Ceri, G. Wiederhold, and J. Dou, “Vertical Texas. He is also on the editorial boards of Data and Knowledge
Partitioning Algorithms for Database Design,” ACM Trans. Engineering (North Holland), Information Systems (Pergamon Press),
Database Systems, vol. 9, pp. 680-710, 1984. Distributed and Parallel Databases (Kluwer Academic Publishers), and
[18] P. Arabie and L.J. Hubert, “The Bond Energy Algorithm World Wide Web Journal (Kluwer). He has been an associate editor of
Revisited,” IEEE Trans. Systems, Man, and Cybernetics, vol. 20, IEEE Transactions on Knowledge and Data Engineering. He is a
pp. 268-274, 1990. member of the IEEE.
[19] A.T. Ozsu and P. Valduriez, Principles of Distributed Database
Systems, second ed. Prentice Hall Inc., 1999. Jorge Civera received the BSc degree in
[20] Y. Liu, M. Brandon, S. Navathe, R. Dingledine, and B.J. Ciliax, computer science from the Universidad Politéc-
“Text Mining Functional Keywords Associated with Genes,” Proc. nica de Valencia in 2002, and the Msc degree in
Medinfo 2004, pp. 292-296, Sept. 2004. computer science from Georgia Institute of
[21] Y. Liu, B.J. Ciliax, K. Borges, V. Dasigi, A. Ram, S. Navathe, and R. Technology in 2003. He is currently a PhD
Dingledine, “Comparison of Two Schemes for Automatic Key- student at Departamento de Sistemas Informá-
word Extraction from MEDLINE for Functional Gene Clustering,” ticos y Computación and a research assistant in
Proc. IEEE Computational Systems Bioinformatics Conf. (CSB 2004), the Instituto Tecnológico de Informática. He is
pp. 394-404, Aug. 2004. also with a fellowship from the Spanish Ministry
[22] P. Cheeseman and J. Stutz, “Bayesian Classification (Autoclass): of Education and Culture. His research interests
Theory and Results,” Advances in Knowledge Discovery and Data include bioinformatics, machine translation, and text mining.
Mining, pp. 153-180, AAAI/MIT Press, 1996.
[23] A. Strehl, “Relationship-Based Clustering and Cluster Ensembles Venu Dasigi received the BE degree in electro-
for High-Dimensional Data Mining,” PhD dissertation, Dept. of nics and communication engineering from An-
Electric and Computer Eng., The University of Texas at Austin, dhra University in 1979, the MEE degree in
2002. electronic engineering from the Netherlands
[24] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Universities Foundation for International Coop-
New York: Addison Wesley Longman, 1999. eration in 1981, and the MS and PhD degrees in
[25] F. Sebastiani, “Machine Learning in Automated Text Categoriza- computer science from the University of Mary-
tion,” ACM Computing Surveys, vol. 34, pp. 1-47, 1999. land, College Park in 1985 and 1988, respec-
[26] P. Willett, “Recent Trends in Hierarchic Document Clustering: A tively. He is currently professor and chair of
Critical Review,” Information Processing and Management, vol. 24, computer science at Southern Polytechnic State
pp. 577-597, 1988. University in Marietta, Georgia. He is also an honorary professor at
[27] J. Aslam, A. Leblanc, and C. Stein, “Clustering Data without Prior Gandhi Institute of Technology and Management in India. He held
Knowledge,” Proc. Algorithm Eng.: Fourth Int’l Workshop, 1982. research fellowships at the Oak Ridge National Laboratory and the Air
[28] P.V. Balakrishnan, M.C. Cooper, V.S. Jacob, and P.A. Lewis, “A Force Research Laboratory. His research interests include text mining,
Study of the Classification Capabilities of Neural Networks Using information retrieval, natural language processing, artificial intelligence,
Unsupervised Learning: A Comparison with K-Means Cluster- bioinformatics, and computer science education. He is a member of
ing,” Psychometrika, vol. 59, pp. 509-525, 1994. ACM and the IEEE Computer Society.
76 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005
Ashwin Ram received the PhD degree from Brian J. Ciliax received the BS degree in
Yale University in 1989, the MS degree from the biochemistry from Michigan State University in
University of Illinois in 1984, and the BTech 1981, and the PhD degree in pharmacology from
degree from IIT Delhi in 1982. He is an associate the University of Michigan in 1987. He is
professor in the College of Computing at the currently an assistant professor in the Depart-
Georgia Institute of Technology, an associate ment of Neurology at Emory University School of
professor of Cognitive Science, and an adjunct Medicine. His research interests include the
professor in the School of Psychology. He has functional neuroanatomy of the basal ganglia,
published two books and more than 80 scientific particularly as it relates to hyperkinetic move-
papers in international forums. His research ment disorders such as Tourette’s Syndrome.
interests lie in artificial intelligence and cognitive science, and include Since 2000, he has collaborated with the coauthors on the development
machine learning, natural language processing, case-based reasoning, of a system to functionally cluster genes (identified by high-throughput
educational technology, and artificial intelligence applications. genomic and proteomic assays) according to keywords mined from
relevant MEDLINE abstracts.
Jem Rowland W
Larry Ruzzo
Leszek Rychlewski Baoying Wang
Chang Wang
S Lisan Wang
Tandy Warnow
Gerhard Sagerer Michael K. Weir
Steven Salzberg Jason Weston
Herbert Sauro Ydo Wexler
Alejandro Schaffer Nalin Wickramarachchi
Alexander Schliep Chris Wiggins
Scott Schmidler David Wild
Jeanette Schmidt Tiffani Williams
Alexander Schönhuth Thomas Wu
Charles Semple
Soheil Shams X
Roded Sharan
Chad Shaw Dong Xu
Dinggang Shen Jinbo Xu
Dou Shen
Lisan Shen Y
Stanislav Shvartsman
Amandeep Sidhu Qiang Yang
Richard Simon Yee Hwa Yang
Sameer Singh Zizhen Yao
Janne Sinkkonen Daniel Yekutieli
Steven S. Skiena Jeffrey Yu
Quinn Snell
Carol Soderlund Z
Rainer Spang
Peter Stadler Mohammed J. Zaki
Mike Steel An-Ping Zeng
Gerhard Steger Chengxiang Zhai
Jens Stoye Jingfen Zhang
Jack Sullivan Kaizhong Zhang
Krister Swenson Xuegong Zhang
Yang Zhang
T Zhi-Hua Zhou
Zonglin Zhou
Pablo Tamayo Ji Zhu
Amos Tanay
Chun Tang
Jijun Tang
Thomas Tang
Glenn Tesler
Robert Tibshirani
Martin Tompa
Anna Tramontano
James Troendle
Jerry Tsai
Koji Tsuda
John Tyson