Download as pdf or txt
Download as pdf or txt
You are on page 1of 78

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO.

1, JANUARY-MARCH 2005 1

Guest Editorial: WABI Special Section Part ll


Junhyong Kim and Inge Jonassen

T HE Fourth International Workshop on Algorithms in


BIoinformatics (WABI) 2004 was held in Bergen, Nor-
way, September 2004. The program committee consisted of
The second paper is “Topological Rearrangements and
Local Search Method for Tandem Duplication Trees” and is
authored by Denis Bertrand and Olivier Gascuel. The paper
33 members and selected, among 117 submissions, 39 to be approaches the problem of estimating the evolutionary
presented at the workshop and included in the proceedings history of tandem repeats. A tandem repeat is a stretch of
from the workshop (volume 3240 of Lecture Notes in DNA sequence that contains an element that is repeated
Bioinformatics, series edited by Sorin Istrail, Pavel Pevzner, multiple times and where the repeat occurrences are next to
and Michael Waterman). each other in the sequence. Since the repeats are subject to
The WABI 2004 program committee selected a small mutations, they are not identical. Therefore, tandem repeats
number of papers among the 39 to be invited to submit occur through evolution by “copying” (duplication) of
extended versions of their papers to a special section of the repeat elements in blocks of varying size. Bertrand and
IEEE/ACM Transactions on Computational Biology and Bioin- Gascuel address the problem of finding the most likely
formatics. Four papers were published in the October- sequence of events giving rise to the observed set of repeats.
December 2004 issue of the journal and this issue contains Each sequence of events can be described by a duplication
an additional three papers. We would like to thank both the tree and one searches for the tree that is the most
entire program committee for WABI and the reviewers of parsimonious, i.e., one that explains how the sequence has
evolved from an ancestral single copy with a minimum
the papers in this issue for their valuable contributions.
number of mutations along the branches of the tree. The
The first of the papers is “A New Distance for High Level
main difference with the standard phylogeny problem is
RNA Secondary Structure Comparison” authored by Julien
that linear ordering of the tandem duplications impose
Allali and Marie-France Sagot. This paper describes algo-
constraints the possible binary tree form. This paper
rithms for comparing secondary structures of RNA molecules describes a local search method that allows exploration of
where the structures are represented by trees. The problem of the complete space of possible duplication trees and shows
classifying RNA secondary structure is becoming critical as that the method is superior to other existing methods for
biologists are discovering more and more noncoding func- reconstructing the tree and recovering its duplication
tional elements in the genome (e.g., miRNA). Most likely, the events.
major functional determinants of the elements are their The third paper is “Optimizing Multiple Seeds for
secondary structure and, therefore, a metric between such Homology Search” authored by Daniel G. Brown. The
secondary structures will also help delineate clusters of paper presents an approach to selecting starting points for
functional groups. In Allali and Sagot’s paper, two tree pairwise local alignments of protein sequences. The
representations of secondary structure are compared by problem of pairwise local alignment is to find a segment
analysing how one tree can be transformed into the other from each so that the two local segments can be aligned to
using an allowed set of operations. Each operation can be obtain a high score. For commonly used scoring schemes,
associated with a cost and the distance between two trees can this can be solved exactly using dynamic programming.
then be defined as the minimum cost associated with a However, pairwise alignment is frequently applied to large
data sets and heuristic methods for restricting alignments to
transform of one tree to the other. Allali and Sagot introduce
be considered are frequently used, for instance, in the
two new operations that they name edge fusion and node
BLAST programs. The key is to restrict the number of
fusion and show that these alleviate limitations associated
alignments as much as possible, by choosing a few good
with the classical tree edit operations used for RNA seeds, without missing high scoring alignments. The paper
comparison. Importantly, they also present algorithms for shows that this can be formulated as an integer program-
calculating the distance between trees allowing the new ming problem and presents algorithm for choosing optimal
operations in addition to the classical ones, and analyze the seeds. Analysis is presented showing that the approach
performance of the algorithms. gives four times fewer false positives (unnecessary seeds) in
comparison with BLASTP without losing more good hits.
. J. Kim is with the Department of Biology, University of Pennsylvania,
3451 Walnut Street, Philadelphia, PA 19104. Junhyong Kim
E-mail: junhyong@sas.upenn.edu. Inge Jonassen
. I. Jonassen is with the Department of Informatics and Computational
Guest Editors
Biology Unit, University of Bergen, HIB N5020 Bergen, Norway.
E-mail: inge@ii.uib.no.
For information on obtaining reprints of this article, please send e-mail to:
tcbb@computer.org.
1545-5963/05/$20.00 ß 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Junhyong Kim is the Edmund J. and Louise Inge Jonassen is a professor of computer
Kahn Term Endowed Professor in the Depart- science in the Department of Informatics at the
ment of Biology at the University of Pennsylvania. University of Bergen in Norway, where he is
He holds joint appointments in the Department of member of the bioinformatics group. He is also
Computer and Information Science, Penn Center affiliated with the Bergen Center for Computa-
for Bioinformatics, and the Penn Genomics tional Science at the same university where he
Institute. He serves on the editorial board of heads the Computational Biology Unit. He is also
Molecular Development and Evolution and the vice president of the Society for Bioinformatics in
IEEE/ACM Transactions on Computational Biol- the Nordic Countries (SocBiN) and a member of
ogy and Bioinformatics, the council of the Society the board of the Nordic Bioinformatics Network.
for Systematic Biology, and the executive committee of the Cyber He coordinates the technology platform for bioinformatics funded by the
Infrastructure for Phylogenetics Research. His research focuses on Norwegian Research Council functional genomics programme FUGE.
computational and experimental approaches to comparative develop- He has worked in the field of bioinformatics since the early 1990s, where
ment. The current focus of his lab is in three areas: computational he has primarily focused on methods for discovery of patterns with
phylogenetics, in silico gene discovery, and comparative development applications to biological sequences and structures and on methods for
using genome-wide gene expression data. the analysis of microarray gene expression data.

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 3

A New Distance for High Level RNA


Secondary Structure Comparison
Julien Allali and Marie-France Sagot

Abstract—We describe an algorithm for comparing two RNA secondary structures coded in the form of trees that introduces two new
operations, called node fusion and edge fusion, besides the tree edit operations of deletion, insertion, and relabeling classically used in
the literature. This allows us to address some serious limitations of the more traditional tree edit operations when the trees represent
RNAs and what is searched for is a common structural core of two RNAs. Although the algorithm complexity has an exponential term, this
term depends only on the number of successive fusions that may be applied to a same node, not on the total number of fusions. The
algorithm remains therefore efficient in practice and is used for illustrative purposes on ribosomal as well as on other types of RNAs.

Index Terms—Tree comparison, edit operation, distance, RNA, secondary structure.

1 INTRODUCTION

R NAS are one of the fundamental elements of a cell. Their


role in regulation has been recently shown to be far
more prominent than initially believed (20 December 2002
1. hairpin loops which are sequences of unpaired bases
closing a helix;
2. internal loops which are sequences of unpaired
issue of Science, which designated small RNAs with bases linking two different helices;
regulatory function as the scientific breakthrough of the 3. bulges which are internal loops with unpaired bases
year). It is now known, for instance, that there is massive on one side only of a helix;
transcription of noncoding RNAs. Yet current mathematical 4. multiloops which are unpaired bases linking at least
and computer tools remain mostly inadequate to identify, three helices.
analyze, and compare RNAs. Stems are successions of one or more among helices,
An RNA may be seen as a string over the alphabet of internal loops, and/or bulges.
nucleotides (also called bases), {A, C, G, T}. Inside a cell, The comparison of RNA secondary structures is one of
RNAs do not retain a linear form, but instead fold in space. the main basic computational problems raised by the study
The fold is given by the set of nucleotide bases that pair. The of RNAs. It is the problem we address in this paper. The
main type of pairing, called canonical, corresponds to bonds motivations are many. RNA structure comparison has been
of the type A  U and G  C. Other rarer types of bonds used in at least one approach to RNA structure prediction
may be observed, the most frequent among them is G  U, that takes as initial data a set of unaligned sequences
also called the wobble pair. Fig. 1 shows the sequence of a
supposed to have a common structural core [1]. For each
folded RNA. Each box represents a consecutive sequence of
sequence, a set of structural predictions are made (for
bonded pairs, corresponding to a helix in 3D space. The
instance, all suboptimal structures predicted by an algo-
secondary structure of an RNA is the set of helices (or the
list of paired bases) making up the RNA. Pseudoknots, rithm like Zucker’s MFOLD [15], or all suboptimal sets of
which may be described as a pair of interleaved helices, are compatible helices or stems). The common structure is then
in general excluded from the secondary structure of an found by comparing all the structures obtained from the
RNA. RNA secondary structures can thus be represented as initial set of sequences, and identifying a substructure
planar graphs. An RNA primary structure is its sequence of common to all, or to some of the sequences. RNA structure
nucleotides while its tertiary structure corresponds to the comparison is also an essential element in the discovery of
geometric form the RNA adopts in space. RNA structural motifs, or profiles, or of more general
Apart from helices, the other main structural elements in models that may then be used to search for other RNAs of
an RNA are: the same type in newly sequenced genomes. For instance,
general models for tRNAs and introns of group I have been
. J. Allali is with the Institut Gaspard-Monge, Université de Marne-la-
derived by hand [3], [10]. It is an open question whether
Vallée, Cité Descartes, Champs-sur-Marne, 77454, Marne-la-Vallée Cedex models at least as accurate as these, or perhaps even more
2, France. E-mail: allali@univ-mlv.fr. accurate, could have been derived in an automatic way. The
. M.-F. Sagot is with Inria Rhône-Alpes, Université Claude Bernard, Lyon I,
43 Bd du Novembre 1918, 69622 Villeurbanne cedex, France. identification of smaller structural motifs is an equally
E-mail: Marie-France.Sagot@inria.fr. important topic that requires comparing structures.
Manuscript received 11 Oct. 2004; accepted 20 Dec. 2004; published online As we saw, the comparison of RNA structures may
30 Mar. 2005.
For information on obtaining reprints of this article, please send e-mail to: concern known RNA structures (that is, structures that were
tcbb@computer.org, and reference IEEECS Log Number TCBB-0164-1004. experimentally determined) or predicted structures. The
1545-5963/05/$20.00 ß 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 1. Primary and secondary structures of a transfer RNA.

objective in both cases is the same: to find the common parts


of such structures.
In [11], Shapiro suggested to mathematically model RNA Fig. 2. Example of different tree representations ((b), (c), (d), and (e)) of
secondary structures without pseudoknots by means of the same RNA (a).
trees. The trees are rooted and ordered, which means that
the order among the children of a node matters. This order node is labeled with, respectively, a base or a pair of bases. A
corresponds to the 5’-3’ orientation of an RNA sequence. node of the tree in Fig. 2c represents a set of successive
Given two trees representing each an RNA, there are two unpaired bases or of stacked paired ones. The label of a node
main ways for comparing them. One is based on the is an integer indicating, respectively, the number of unpaired
computation of the edit distance between the two trees bases or the height of the stack of paired ones. The nodes of the
while the other consists in aligning the trees and using the tree in Fig. 2d represent elements of secondary structure:
score of the alignment as a measure of the distance between hairpin loop (H), bulge (B), internal loop (I), or multiloop (M).
the trees. Contrary to what happens with sequences, the The edges correspond to helices. Finally, the tree in Fig. 2e
two, alignment and edit distance, are not equivalent. The contains only the information concerning the skeleton of
alignment distance is a restrained form of the edit distance multiloops of an RNA. The last representation, though giving
between two trees, where all insertions must be performed a highly simplified view of an RNA, is important nevertheless
before any deletions. The alignment distance for general as it is generally accepted that it is this skeleton which is
trees was defined in 1994 by Jiang et al. in [9] and extended usually the most constrained part of an RNA. The last two
to an alignment distance between forests in [6]. More models may be enriched with information concerning, for
recently, Höchsmann et al. [7] applied the tree alignment instance, the number of (unpaired) bases in a loop (hairpin,
distance to the comparison of two RNA secondary internal, multi) or bulge, and the number of paired bases in a
structures. Because of the restriction on the way edit helix. The first label the nodes of the tree, the second its edges.
operations can be applied in an alignment, we are not Other types of information may be added (such as overall
concerned in this paper with tree alignment distance and composition of the elements of secondary structure). In fact,
we therefore address exclusively from now on the problem one could consider working with various representations
of tree edit distance.
simultaneously or in an interlocked, multilevel fashion. This
Our way for comparing two RNA secondary structures is
goes beyond the scope of this paper which is concerned with
then to apply a number of tree edit operations in one or both of
comparing RNA secondary structures using any one among
the trees representing the RNAs until isomorphic trees are
the many tree representations possible. We shall, however,
obtained. The currently most popular program using this
comment further on this multilevel approach later on.
approach is probably the Vienna package [5], [4]. The tree edit
Concerning the objectives of this paper, they are twofold.
operations considered are derived from the operations
The first is to give some indications on why the classical edit
classically applied to sequences [13]: substitution, deletion,
operations that have been considered so far in the literature
and insertion. In 1989, Zhang and Shasha [14] gave a dynamic
programming algorithm for comparing two trees. Shapiro for comparing trees present some limitations when the trees
and Zhang then showed [12] how to use tree editing to stand for RNA structures. Three cases of such limitations will
compare RNAs. The latter also proposed various tree models be illustrated through examples in Section 3. In Section 4, we
that could be used for representing RNA secondary struc- then introduce two novel operations, so-called node-fusion
tures. Each suggested tree offers a more or less detailed view and edge-fusion, that enable us to address some of these
of an RNA structure. Figs. 2b, 2c, 2d, and 2e present a few limitations and then give a dynamic programming algorithm
examples of such possible views for the RNA given in Fig. 2a. for comparing two RNA structures with these two additional
In Fig. 2, the nodes of the tree in Fig. 2b represent either operations. Implementation issues and initial results are
unpaired bases (leaves) or paired bases (internal nodes). Each presented in Section 4. In Section 5, we give a first application
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 5

Let an insertion or a deletion cost one and the relabeling of


a node cost zero if the label is the same and one otherwise. For
the two trees of the figure on the left, the series relabelðA !
F Þ:deleteðBÞ:insertðGÞ realizes the editing of the left tree into
the right one and costs 3. Another possibility is the series
deleteðBÞ:relabelðA ! GÞ:insertðF Þ which also costs 3. The
distance between these two trees is 3.

Fig. 3. Edit operations: (a) the original tree T , (b) deletion of the node
labelled D, (c) insertion of the node labeled I, and (d) relabeling of a
node in T (the label A of the root is changed into K).

of our algorithm to the comparison of two RNA secondary


structures. Finally, in Section 6, we sketch the main ideas
behind the multilevel RNA comparison approach mentioned Given a series of operations S, let us consider the nodes
above. Before that, we start by introducing some notation and of T that are not deleted (in the initial tree or after some
by recalling in the next section the basics about classical tree relabeling). Such nodes are associated with nodes of T 0 . The
edit operations and tree mapping. mapping MS relative to S is the set of couples ðu; u0 Þ with
This paper is an extended version of a paper presented at u 2 T and u0 2 T 0 such that u is associated with u0 by S.
the Workshop on Algorithms in BioInformatics (WABI) in The operations described above are the “classical tree edit
2004, in Bergen, Norway. A few more examples are given to operations” that have been commonly used in the literature
illustrate some of the points made in the WABI paper, for RNA secondary structure comparison. We now present a
complexity and implementation issues are discussed in few results obtained using such classical operations that will
more depth as are the cost functions and a multilevel allow us to illustrate a few limitations they may present when
approach to comparing RNAs. used for comparing RNA structures.

2 TREE EDITING AND MAPPING 3 LIMITATIONS OF CLASSICAL TREE EDIT


OPERATIONS FOR RNA COMPARISON
Let T be an ordered rooted tree, that is, a tree where the
order among the children of a node matters. We define As suggested in [12], the tree edit operations recalled in the
three kinds of operations on T : deletion, insertion, and previous section can be used on any type of tree coding of
relabeling (corresponding to a substitution in sequence an RNA secondary structure.
comparison). The operations are shown in Fig. 3. The Fig. 4 shows two RNAsePs extracted from the database [2]
deletion (Fig. 3b) of a node u removes u from the tree. The (they are found, respectively, in Streptococcus gordonii and
children of u become the children of u’s father. An insertion Thermotoga maritima). For the example we discuss now, we
code the RNAs using the tree representation indicated in
(Fig. 3c) is the symmetric of a deletion. Given a node u, we
Fig. 2b where a node represents a base pair and a leaf an
remove a consecutive (in relation to the order among the
unpaired base. After applying a few edit operations to the
children) set u1 ; . . . ; up of its children, create a new node v,
trees, we obtain the result indicated in Fig. 4, with deleted/
make v a child of u by attaching it at the place where the set
inserted bases in gray. We have surrounded a few regions that
was, and, finally, make the set u1 ; . . . ; up (in the same order)
match in the two trees. Bases in the rectangular box at the
the children of v. The relabeling of a node (Fig. 3d) consists
bottom of the RNA on the left are thus associated with bases in
simply in changing its label.
the bottom rightmost rectangular box of the RNA on the right.
Given two trees T and T 0 , we define S ¼ fs1 . . . se g to be
The same is observed for the bases in the oval boxes for both
a series of edit operations such that, if we apply succes- RNAs. Such matches illustrate one of the main problems with
sively the operations in S to the tree T , we obtain T 0 (i.e., T the classical tree edit operations: Bases in one RNA may be
and T 0 become isomorphic). A series of operations like S mapped to identically labeled bases in the other RNA to
S
realizes the editing of T into T 0 and is denoted by T ! T 0 . minimise the total cost, while such bases should not be
We define a function cost from the set of possible edit associated in terms of the elements of secondary structure to
operations (deletion, insertion, relabeling) to the integers (or which they belong. In fact, such elements are often distant
the reals) such that costs is the score of the edit operation s. from one another along the common RNA structure. We call
If S is a series of edit operations, we define by extension that this problem the “scattering effect.” It is related to the
P
costS is s2S costs . We can define the edit distance between definition of tree edit operations. In the case of this example
two trees as the series of operations that performs the and of the representation adopted, the problem might have
editing of T into T 0 and such that its cost is minimal: been avoided if structural information had been used.
S
distanceðT ; T 0 Þ ¼ fminðcostS ÞjT ! T 0 g. Indeed, the problem appears also because the structural
6 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 4. Illustration of the scattering effect problem. Result of the matching of two RNAsePs, of Streptococcus gorgonii and of Thermotoga maritima,
using the model given in Fig. 2b.

location of an unpaired base is not taken into account. It is followed by an internal loop and another helix of size 5. By
therefore possible to match, for instance, an unpaired base definition (see Section 2), the algorithm can only associate
from a hairpin loop with an unpaired base from a multiloop. one element in the first tree to one element in the second
Using another type of representation, as we shall do, would, tree. In this case, we would like to associate the helix of the
however, not be enough to solve all problems as we see next. left tree to the two helices of the second tree since it seems
Indeed, to compare the same two RNAs, we can also use a clear that the internal loop represents either an inserted
more abstract tree representation such as the one given in element in the second RNA, or the unbonding of one base
Fig. 2d. In this case, the internal nodes represent a multiloop, pair. This, however, is not possible with classical edit
internal-loop, or bulge, the leaves code for hairpin loops and operations.
edges for helices. The result of the edition of T into T 0 for some A third type of problem one can meet when using only
cost function is presented in Fig. 5 (we shall come back later to the three classical edit operations to compare trees standing
the cost functions used in the case of such more abstract RNA for RNAs is similar to the previous one, but concerns this
representations; for the sake of this example, we may assume time a node instead of edges in the same tree representa-
an arbitrary one is used). tion. Often, an RNA may present a very small helix between
The problem we wish to illustrate in this case is shown two elements (multiloop, internal-loop, bulge, or hairpin-
by the boxes in the figure. Consider the boxes at the bottom. loop) while such helix is absent in the other RNA. In this
In the left RNA, we have a helix made up of 13 base pairs. In case, we would therefore have liked to be able to associate
the right RNA, the helix is formed by seven base pairs one node in a tree representing an RNA with two or more

Fig. 5. Illustration of the one-to-one association problem with edges. Result of the matching of the two RNAsePs, of Saccharomyces uvarum and of
Saccharomyces kluveri, using the model given in Fig. 2d.
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 7

Fig. 6. Illustration of the one-to-one association problem with nodes. The two RNAs used here are RNAsePs from Pyrococcus furiosus and
Metallosphaera sedula. Triangles stand for bulges, diamond stand for internal loops, and squares for hairpin loops.

nodes in the tree for the other RNA. Once again, this is not replacing eci and eu with a new single edge e. The edge e links
possible with any of the classical tree edit operations. An the father of u to ci . Its label then becomes a function of the
illustration of this problem is shown in Fig. 6. (numerical) labels of eu , u and eci . For instance, if such labels
We shall use RNA representations that take the elements indicated the size of each element (e.g., for a helix, the number
of the structure of an RNA into account to avoid some of the of its stacked pairs, and for a loop, the min , max or the average
scattering effect. Furthermore, in addition to considering of its unpaired bases on each side of the loop), the label of e
information of a structural nature, labels are attached, in could be the sum of the sizes of eu , u and eci . Observe that
general, to both nodes and edges of the tree representing an merging two edges implies deleting all subtrees rooted at the
RNA. Such labels are numerical values (integers or reals). children cj of u for j different from i. The cost of such deletions
They represent in most cases the size of the corresponding is added to the cost of the edge fusion.
element, but may also further indicate its composition, etc. An example of node fusion is given in Fig. 7b. Let u be a
Such additional information is then incorporated into the node and ci one of its children. Performing a node fusion of
cost functions for all three edit operations. It is important to u and ci consists in making u the father of all children of ci
observe that when dealing with trees labeled at both the and in relabeling u with a value that is a function of the
nodes and edges, any node and the edge that leads to it (or, values of the labels of u, ci and of the edge between them.
in an alternative perspective, departs from it) represent a Observe that a node fusion may be simulated using the
single object from the point of view of computing an edit classical edit operations by a deletion followed by a
distance between the trees. relabeling. However, the difference between a node fusion
It remains now to deal with the last two problems that and a deletion/relabeling is in the cost associated with both
are a consequence of the one-to-one associations between operations. We shall come back to this point later.
nodes and edges enforced by the classical tree edit Obviously, like insertions or deletions, edge fusions and
operations. To that purpose, we introduce two novel tree node fusions have of course symmetric counterparts, which
edit operations, called the edge fusion and the node fusion. are the edge split and the node split.
Given two rooted, ordered, and labeled trees T and T 0 ,
we define the “edit distance with fusion” between T and T 0
4 INTRODUCING NOVEL TREE EDIT OPERATIONS
4.1 Edge Fusion and Node Fusion
In order to address some of the limitations of the classical tree
edit operations that were illustrated in the previous section,
we need to introduce two novel operations. These are the edge
fusion and the node fusion. They may be applied to any of the
tree representations given in Figs. 2c, 2d, and 2e.
An example of edge fusion is shown in Fig. 7a. Let eu be an
edge leading to a node u, ci a child of u and eci the edge
between u and ci . The edge fusion of eu and eci consists in Fig. 7. (a) An example of edge fusion. (b) An example of node fusion.
8 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 8. Zhang and Sasha’s dynamic programming algorithm: the tree distance part. The right box corresponds to the additional operations added to
take fusion into account.

S
as distancefusion ðT ; T 0 Þ ¼ fminðcostS ÞjT ! T 0 g with costs the index of the leftmost child of the subtree rooted at ti . Let
cost associated to each of the seven edit operations now T ði . . . jÞ denote the forest composed by the nodes ti . . . tj
considered (relabeling, insertion, deletion, node fusion and (T  T ð0 . . . jT jÞÞ. To simplify notation, from now on, when
split, edge fusion and split).
there is no ambiguity, i will refer to the node ti . In this case,
Proposition 1. If the following is verified: distanceði1 . . . i2 ; j1 . . . j2 Þ will be equivalent to distanceðT ði1
. costmatch ða; bÞ is a distance, . . . i2 Þ; T 0 ðj1 . . . j2 ÞÞ.
. costins ðaÞ ¼ costdel ðaÞ  0, The algorithm of Zhang and Sasha is fully described by
. costnodefusion ða; b; cÞ ¼ costnodesplit ða; b; cÞ  0, and the following recurrence formula:
. costedgefusion ða; b; cÞ ¼ costedgesplit ða; b; cÞ  0,
then distancefusion is indeed a distance.
Proof. The positiveness of distancefusion is given by the fact
if ðði1 ¼¼ lði2 ÞÞ and ðj1 ¼¼ lðj2 ÞÞÞ
that all elementary cost functions are positive. Its
symmetry is guaranteed by the symmetry in the costs MIN
of the insertion/deletion and (node/edge) fusion/split 8
operations. Finally, it is straighforward to see that < distanceð i1 . . . i2  1 ; j1 . . . j2
> Þ þ costdel ði2 Þ
distancefusion satisfies triangular inequality. u
t distanceð i1 . . . i2 ; j1 . . . j2  1 Þ þ costins ðj2 Þ
>
:
Besides the above properties that must be satisfied by the distanceð i1 . . . i2  1 ; j1 . . . j2  1 Þ þ costmatch ði2 ; j2 Þ
cost functions in order to obtain a distance, others may be ð1Þ
introduced for specific purposes. Some will be discussed in
Section 5. else
We now present an algorithm to compute the tree edit
MIN
distance between two trees using the classical tree edit 8
operations plus the two operations just introduced. > distanceð i1 . . . i2  1 ; j1 . . . j2 Þ Þ
>
>
>
> þ costdel ði2 Þ
>
>
4.2 Algorithm >
< distanceð i1 . . . i2 Þ ; j1 . . . j2  1 Þ ð2Þ
The method we introduce is a dynamic programming
>
> þ costins ðj2 Þ
algorithm based on the one proposed by Zhang and Shasha. >
>
>
> distanceð i1 . . . lði2 Þ  1 ; j1 . . . lðj2 Þ  1 Þ
Their algorithm is divided in two parts: They first compute >
>
:
the edit distance between two trees (this part is denoted by þdistanceð lði2 Þ . . . i2 ; lðj2 Þ . . . j2 Þ
T Dist) and then the distance between two forests (this part
Part (1) of the formula corresponds to Fig. 8, while part (2)
is denoted by F Dist). Fig. 8 illustrates in pictorial form the
part T Dist and Fig. 9 the F Dist part of the computation. corresponds to Fig. 9. In practice, the algorithm stores in a
In order to take our two new operations into account, we matrix the score between each subtree of T and T 0 . The space
need to compute a few more things in the T Dist part. complexity is therefore OðjT j  jT 0 jÞ. To reach this complexity,
Indeed, we must add the possibility for each tree to have a the computation must be done in a certain order (see
node fusion (inversely, node split) between the root and one
Section 4.3). The time complexity of the algorithm is
of its children, or to have an edge fusion (inversely edge
split) between the root and one of its children. These OðjT j  minðleafðT Þ; heightðT ÞÞ
additional operations are indicated in the right box of Fig. 8.  jT 0 j  minðleafðT 0 Þ; heightðT 0 ÞÞÞ;
We present now a formal description of the algorithm. Let
T be an ordered rooted tree with jT j nodes. We denote by ti where leafðT Þ and heightðT Þ represent, respectively, the
the ith node in a postfix order. For each node ti , lðiÞ is the number of leaves and the height of a tree T .
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 9

Fig. 9. Zhang and Sasha’s dynamic programming algorithm: the forest distance part.

The formula to compute the edit score allowing for both leading to, respectively, nodes u and v. The symmetric
node and edge fusions follows. operations are denoted by, respectively, node splitðu; vÞ and
edge splitðu; vÞ.
The distance computation takes two new parameters
if ðði1  lðik ÞÞ and ðj1  lðjk0 ÞÞÞ path and path0 . These are sets of pairs ðe or u; vÞ which
indicate, for node ik (respectively, jk ), the series of fusions
MIN that were done. Thus, a pair ðe; vÞ indicates that an edge
8
>
> distanceðfi1 . . . ik1 g; ;; fj1 . . . jk0 g; path0 Þ þ costdel ðik Þ fusion has been perfomed between ik and v, while for ðu; vÞ
>
>
>
> distanceðfi1 . . . ik g; path; fj1 . . . jk0 1 g; ;Þ þ costins ðjk0 Þ a node v has been merged with node ik .
>
>
>
> The notation path:ðe; vÞ indicates that the operation ðe; vÞ
>
> distanceðfi1 . . . ik1 g; ;; fj1 . . . jk0 1 g; ;Þ þ costmatch ðik ; jk0 Þ
>
> has been performed in relation to node ik and the
>
> for each child ic of ik in fi1 ; . . . ; ik g; set il ¼ lðic Þ
>
> information is thus concatenated to the set path of pairs
>
>
>
> distanceðfi1 . . . ic1 ; icþ1 . . . ik g; path:ðu; ic Þ; fj1 . . . jk0 g;
>
> currently linked with ik .
>
> path0 Þ
>
>
>
> 4.3 Implementation and Complexity
>
> þcostnode fusion ðic ; ik Þðobs: :ik data are changedÞ
>
>
>
> The previous section gave the recurrence formulæ for
>
> distanceðfil . . . ic1 ; ik g; path:ðe; ic Þ; fj1 . . . jk0 g; path0 Þ
>
> calculating the edit distance between two trees allowing for
>
> þcostedge fusion ðic ; ik Þ þ distanceðfi1 . . . il1 g;
>
> node and edge fusion and split. We now discuss the
>
>
>
> ;; ;; ;Þ
>
> complexity of the algorithm. This requires paying attention
< þdistanceðfi . . . i  1; ;; ;; ;Þ
cþ1 k to some high-level implementation details that, in the case
>
> ðobs: : ik data are changedÞ of the tree edit distance problem, may have an important
>
>
>
> for each child jc0 of jk0 in fj1 ; . . . ; jk0 g; set jl0 ¼ lðjc0 Þ influence on the theoretical complexity of the algorithm.
>
>
>
>
>
> distanceðfi1 . . . ik g; path; fj1 . . . jc0 1 ; jc0 þ1 . . . jk0 ; Such details were first observed by Zhang and Shasha. They
>
>
>
> concern the order in which to perform the operations
>
> path0 :ðu; jc0 ÞÞ
>
> indicated in (2) and (1) to obtain an algorithm that is time
>
> þcostnode split ðjc0 ; jk0 Þ
>
> and space efficient.
>
>
>
> ðobs: : jk0 data are changedÞ Let us consider the last line of (2). We may observe that
>
>
>
> 0
>
> distanceðfi 1 . . . ik g; path; fjl0 . . . jc0 ; jk0 ; path :ðe; jc0 ÞÞ the computation of the distance between two forests refers
>
>
>
> þcostedge split ðjc0 ; jk0 Þ to the computation of the distance between two trees
>
>
>
> þdistanceð;; ;; fj1 . . . jl0 1 g; ;Þ T ðlði2 Þ . . . i2 Þ and T 0 ðlðj2 Þ . . . j2 Þ. We must therefore memor-
>
>
>
> ise the distance between any two subtrees of T and T 0 .
>
> þdistanceð;; ;; jc0 þ1 . . . jk0 1 ; ;Þ
>
> Furthermore, we have to carry out the computation from
:
ðobs: : jk0 data are changedÞ the leaves to the root because when we compute the
ð3Þ distance between two subtrees U and U 0 , the distance
between any subtrees of U and U 0 must already have been
else set il ¼ lðik Þ and jl0 ¼ lðjk0 Þ measured. This explains the space complexity which is in
OðjT j  jT 0 jÞ and corresponds to the size of the table used for
MIN
8 storing such distances in memory.
> distanceðfi1 . . . ik1 g; ;; fj1 . . . jk0 g; path0 Þ þ delðik Þ If we look at (1) now, we see that it is not necessary to
>
>
< distanceðfi . . . i g; path; fj . . . j 0 g; ;Þ þ insðj 0 Þ ð4Þ calculate separately the distance between the subtrees
1 k 1 k 1 k
>
> distanceðfi1 . . . il1 g; ;; fj1 . . . jl 1 g; ;Þ
0 rooted at i0 and j0 if i0 is on the path from lðiÞ to i and j0
>
: is on the path from lðjÞ to j, for i and j nodes of,
þ distanceðfil . . . ik g; path; fjl0 . . . jk0 g; path0 Þ
respectively, T and T 0 .
Given two nodes u and v such that v is a child of u, We define a set LRðT Þ of the left roots of T as follows:
node fusionðu; vÞ is the fusion of node v with u, and
LRðT Þ ¼ fkj1  k  jT j and 6 9k0 > k such that lðk0 Þ ¼ lðkÞg
edge fusionðu; vÞ is the edge fusion between the edges
10 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

The algorithm for computing the edit distance between t our algorithm is thus in Oðð2dÞ‘  ð2d0 Þ‘  jT j  jT 0 jÞ, with d
and T 0 consists then in computing the distance between and d0 the maximum degrees of, respectively, T and T 0 .
each subtree rooted at a node in LRðT Þ and each subtree The computation of the time complexity of our algorithm
rooted at a node in LRðT 0 Þ. Such subtrees are considered is done in a similar way as for the algorithm of Zhang and
from the leaves to the root of T and T 0 , that is, in the order Shasha. For each node of T and T 0 , one must compute the
of their indexes. number of subtree distance computations the node will be
Zhang and Shasha proved that this algorithm has a involved in by considering all subtrees rooted in, respec-
time complexity in OðjT j  minðleafðT Þ; heightðT ÞÞ  jT 0 j  tively, a node of LRðT Þ and a node of LRðT 0 Þ. In our case,
minðleafðT 0 Þ; heightðT 0 ÞÞÞ, leafðT Þ designating the num- one must also take into account for each node the possibility
ber of leaves of T and heightðT Þ its height. In the worst of applying a fusion. This leads to a time complexity in
case (fan tree), the complexity is in OðjT j2  jT 0 j2 Þ.
Taking fusion and split operations into account does Oðð2dÞ‘  jT j  minðleafðT Þ; heightðT ÞÞ  ð2d0 Þ‘  jT 0 j
not change the above reasoning. However, we must now minðleafðT 0 Þ; heightðT 0 ÞÞÞ:
store in memory the distance between all subtrees
T ðlði2 Þ . . . i2 Þ and T 0 ðlðj2 Þ . . . j2 Þ, and all the possible values This complexity suggests that the fusion operations may
of path and path0 . be used only for reasonable trees (typically, less than
We must therefore determine the number of values that 100 nodes) and small values of l (typically, less than 4). It is
path can take. This amounts to determine the total number however important to observe that the overall number of
of successive fusions that could be applied to a given node. fusions one may perform can be much greater than l
We recall that path is a list of pairs ðe or u; vÞ. Let path ¼ without affecting the worst-case complexity of the algo-
fðe or u; v1 Þ; ðe or u; v2 Þ; . . . ; ðe or u; v‘ Þg be the list for node i rithm. Indeed, any number of fusions can be made while
of T . The first fusion can be performed only with a child v1 still retaining the bound of
of i. If d is the maximum degree of T , there are d possible
choices for v1 . The second fusion can be done with one of Oðð2dÞl  jT j  minðleafðT Þ; heightðT ÞÞ  jT 0 j  minðleafðT 0 Þ;
the children of i or with one of its grandchildren. Let v2 be heightðT 0 ÞÞÞ
the node chosen. There are d + d2 possible choices for v2 . so long as one does not realize more than l consecutive
P
Following the same reasoning, there are k¼‘ k
k¼1 d possible fusions for each node.
choices for the ‘th node v‘ to be fusioned with i. In general, also, most interesting tree representations of
an RNA are of small enough size as will be shown next,
together with some initial results obtained in practice.

5 APPLICATION TO RNA SECONDARY STRUCTURES


COMPARISON
The algorithm presented in the previous section has been
coded using C++. An online version is available at http://
www-igm.univ-mlv.fr/~allali/migal/.
We recall that RNAs are relatively small molecules with
sizes limited to a few kilobases. For instance, the small
ribosomal subunit of Sulfolobus acidocaldarius (D14876) is
Furthermore, we must take into account the fact that a
made up of 1,147 bases. Using the representation shown in
fusion can concern a node or an edge. The total number of
Fig. 2b, the tree obtained contains 440 internal nodes and
values possible for the variable path is therefore:
567 leaves, that is 1,007 nodes overall. Using the representa-
k¼‘ X
Y j¼k Y
k¼‘ kþ1
d 1 tion in Fig. 2d, the tree is composed of 78 nodes. Finally, the
2‘  dj ¼ 2l ; tree obtained using the representation given in Fig. 2e
k¼1 j¼1 k¼1
d1
contains only 48 nodes. We therefore see that even for large
that is: RNAs, any of the known abstract tree-representations (that
is, representations which take the elements of the secondary
 ‘ Y
k¼‘  l
‘ 1 kþ1 l 1 ð‘þ1Þð‘þ2Þ structure of an RNA into account) that we can use leads to a
2  ðd  1Þ < 2  d 2 :
d1 k¼1
d  1 tree of manageable size for our algorithm. In fact, for small
values of l (2 or 3), the tree comparison takes reasonable
A node i may then be involved in Oðð2dÞl Þ possible time (a few minutes) and memory (less than 1Gb).
successive (node/edge) fusions. As we already mentioned, a fusion (respctively, split) can
As indicated, we must store in memory the distance be viewed as an alternative to a deletion (respectively,
between each subtree T ðlði2 Þ . . . i2 Þ and T 0 ðlðj2 Þ . . . j2 Þ for all insertion) followed by a relabeling. Therefore, the cost
possible values of path and path0 . The space complexity of function for a fusion must be chosen carefully.
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 11

function that produces real values between 0 and 1, if t is


equal to 0:1, a fusion will be performed only if it improves
the score by 0:1. In practice, we use values of t between 0
and 0:2.
For practical considerations, we also set a further
condition on the cost and relabeling functions related to a
node or edge resulting from a fusion which is as follows:

costdel ðaÞ þ costdel ðbÞ  costdel ðcÞ


with c the label of the node/edge resulting from the fusion
of the nodes/edges labeled a and b. Indeed, if this condition
is not fulfilled, the algorithm may systematically fusion the
Fig. 10. Illustration of the gain that must be obtained using a fusion
instead of a deletion/relabeling.
nodes or edges to reduce the overall cost.
An important consequence of the conditions seen above
To simplify, we reason on the cost of a node fusion is that a node fusion cannot be followed by an edge fusion.
without considering the label of the edges leading to the Below, the node fusion followed by an edge fusion costs:
nodes that are fusioned with a father. The formal definition ðcostdel ðbÞ þ costdel ðBÞ þ tÞ þ ðcostdel ðABÞ þ costdel ðaÞ þ tÞ:
of the cost functions takes the edges also into account.
Let us assume that the cost function returns a real The alternative is to destroy node B (together with edge b) and
value between zero and one. If we want to compute the then to operate an edge fusion, the whole costing: ðcostdel ðbÞ
cost of a fusion between two nodes u and v, the aim is to þcostdel ðBÞÞ þ ðcostdel ðAÞ þ costdel ðaÞ þ tÞ. The difference be-
give to such fusion a cost slightly greater than the cost of tween these two costs is t þ costdel ðABÞ  costdel ðAÞ, which is
deleting v and relabeling u; that is, we wish to have always positive.
costnode fusion ðu; vÞ ¼ minðcostdel ðvÞ þ t; 1Þ. The parameter t
is a tuning parameter for the fusion.
Suppose that the new node w resulting from the fusion of
u and v matches with another node z. The cost of this match
is costmatch ðw; zÞ. If we do not allow for node fusions, the
algorithm will first match u with z, then will delete v. If we
compare the two possibilities, on one hand we have a total
cost of costnode fusion ðu; vÞ þ costmatch ðw; zÞ for the fusion,
that is, costdel ðvÞ þ t þ costmatch ðw; zÞ, on the other hand, a
cost of costdel ðvÞ þ costmatch ðu; zÞ. Thus, t represents the gain
that must be obtained by costmatch ðw; zÞ with regard to
costmatch ðu; zÞ, that is, by a match without fusion. This is This observation allows to significantly improve the
illustrated in Fig. 10. performance in practice of the algorithm.
In this example, the cost associated with the path on the top We have applied the new algorithm on the two RNAs
is costmatch ð5; 9Þ þ costdel ð3Þ. The path at the bottom has a cost shown in Fig. 5 (these are eukaryotic nuclear P RNAs from
of costnode fusion ð5; 3Þ ¼ costdel ð3Þ þ t for the node fusion to Saccharomyces uvarum and Saccharomyces kluveri) and coded
which is added a relabeling cost of costmatch ð8; 9Þ, leading to a using the same type of representation as in Fig. 2d. We have
total of costmatch ð8; 9Þ þ costdel ð3Þ þ t. A node fusion will limited the number of consecutive fusions to one (l ¼ 1).
therefore be chosen if costmatch ð8; 9Þ þ t > costmatch ð5; 9Þ, The computation of the edit distance between the two trees
therefore if the score of a match with fusion is better by at taking node and edge fusions into account besides dele-
least t than a match without fusion. tions, insertions, and relabeling has required less than a
We apply the same reasoning to the cost of an edge fusion. second. The total cost allowing for fusions is 6:18 with t ¼
The cost function for a node and an edge fusion between a 0:05 against 7:42 without fusions. As indicated in Fig. 11, the
node u and a node v, with eu denoting the edge leading to u last two problems discussed in Section 3 disappear thanks
and ev the edge leading to v is defined as follows: to some edge fusions (represented by the boxes).
An example of node fusions required when comparing
costnode fusion ðu; vÞ ¼ costdel ðvÞ þ costdel ðev Þ þ t
two “real” RNAs is given in Fig. 12. The RNAs are coded
costedge fusion ðu; vÞ ¼ costdel ðuÞ þ costdel ðeu Þ þ t
X using the same type of representation as in Fig. 2d. The
þ cost deleting subtree rooted at c: figure shows part of the mapping obtained between the
csibling ofv
small subunits of two ribosomal RNAs retrieved from [8]
The tuning parameter t is thus an important parameter (from Bacillaria paxillifer and Calicophoron calicophorum). The
that allows us to control fusions. Always considering a cost node fusion has been circled.
12 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 11. Result of the editing between the two RNAs shown in Fig. 4 allowing for node and edge fusions.

6 MULTILEVEL RNA STRUCTURE COMPARISON: that are located at the same positions relatively to the global
SKETCH OF THE MAIN IDEA common structure. This is a normal, expected behavior in
the context of an edition. However, it seems clear also when
We briefly discuss now an approach which addresses in
we look at Fig. 4 that the bases of a terminal loop should not
part the “scattering effect” problem (see Section 2). This
approach is being currently validated and will be more fully be mapped to those of a multiple loop.
To reduce this problem, one possible solution consists of
described in another paper. We therefore present here the
main idea only. adding to the nodes corresponding to a base an information
To start with, it is important to understand the nature of concerning the element of secondary structure to which the
this “scattering effect.” Let us consider first a trivial case: the base belongs. The cost functions are then adapted to take
cost functions are unitary (insertion, deletion, and relabeling this type of information into account. This solution,
each cost 1) and we compute the edit distance between two although producing interesting results, is not entirely
trees composed of a single node each. The obtained mapping satisfying. Indeed, the algorithm will tend to systematically
will associate the single node in the first tree with the single put into correspondence nodes (and, thus, bases) belonging
one in the second tree, independently from the labels of the to structural elements of the same type, which is also not
nodes. This example can be extended to the comparison of necessarily a good choice as these elements may not be
two trees whose node labels are all different. In this case, the related in the overall structure. It seems therefore preferable
obtained mapping corresponds to the maximum home- to have a structural approach first, mapping initially the
omorphic subtree common to both trees. elements of secondary structure to each other and taking
If the two RNA secondary structures compared using a care of the nucleotides in a second step only.
tree representation which models both the base pairs and The approach we have elaborated may be briefly
the nonpaired bases are globally similar but present some described as follows: Given two RNA secondary structures,
local dissimilarity, then an edit operation will almost the first step consists in coding the RNAs by trees of type ðcÞ
always associate the nodes of the locally divergent regions in Fig. 2 (nodes represent bulges or multiple, internal or

Fig. 12. Part of a mapping between two rRNA small subunits. The node fusion is circled.
ALLALI AND SAGOT: A NEW DISTANCE FOR HIGH LEVEL RNA SECONDARY STRUCTURE COMPARISON 13

Fig. 13. Result of the comparison of the two RNAs of Fig. 4 using trees in Fig. 2c. The thick dash lines indicate some of the associations resulting
from the computation of the edit distance between these two trees. Triangular nodes stand for bulges, diamonds for internal loops, squares for
hairpin loops, and circles for multiloops. Noncolored fine dashed nodes and lines correspond, respectively, to deleted nodes/edges.

terminal loops while edges code for helices). We then 7 FURTHER WORK AND CONCLUSION
compute the edit distance between these two trees using the
We have proposed an algorithm that addresses two main
two novel fusion operations described in this paper. This
limitations of the classical tree edit operations for compar-
also produces a mapping between the two trees. Each node ing RNA secondary structures. Its complexity is high in
and edge of the trees, that is, each element of secondary theory if many fusions are applied in succession to any
structure, is then colored according to this mapping. Two given (the same) node, but the total number of fusions that
elements are thus of a same color if they have been mapped may be performed is not limited. In practice, the algorithm
in the first step. We now have at our disposal an is fast enough for most situations one can meet in practice.
information concerning the structural similarity of the two To provide a more complete solution to the problem of
RNAs. We can then code the RNAs using a tree of type ðbÞ. the scattering effect, we also proposed a new multilevel
To these trees, we add to each node the colour of the approach for comparing two RNA secondary structures
structural element to which it belongs. We need now only to whose main idea was sketched in this paper. Further details
restrict the match operation to nodes of the same color. Two and evaluation of such novel comparison scheme will be the
nodes can therefore match only if they belong to secondary subject of another paper.
elements that have been identified in the first step as being
similar.
To illustrate the use of this algorithm, we have applied it REFERENCES
to the two RNAs of Fig. 4. Fig. 13 presents the trees of type [1] D. Bouthinon and H. Soldano, “A New Method to Predict the
Consensus Secondary Structure of a Set of Unaligned RNA
(Fig. 2c) coding for these structures, and the mapping Sequences,” Bioinformatics, vol. 15, no. 10, pp. 785-798, 1999.
produced by the computation of the edit distance with [2] J.W. Brown, “The Ribonuclease P Database,” Nucleic Acids
fusion. In particular, the noncolored fine dashed nodes and Research, vol. 24, no. 1, p. 314, 1999.
edges correspond, respectively, to deleted nodes/edges. [3] N. el Mabrouk and F. Lisacek, “and Very Fast Identification of
RNA Motifs in Genomic DNA. Application to tRNA Search in the
One can see that in the left RNA, the two hairpin loops Yeast Genome,” J. Molecular Biology, vol. 264, no. 1, pp. 46-55, 1996.
involved in the scattering effect problem in Fig. 4 (indicated [4] I. Hofacker, “The Vienna RNA Secondary Structure Server,” 2003.
by the arrows) have been destroyed and will not be mapped [5] I. Hofacker, W. Fontana, P.F. Stadler, L. Sebastian Bonhoeffer, M.
to one another anymore when the edit operations are Tacker, and P. Schuster, “Fast Folding and Comparison of RNA
Secondary Structures,” Monatshefte für Chemie, vol. 125, pp. 167-
applied to the trees of the type in Fig. 2b. 188, 1994.
This approach allows to obtain interesting results. [6] M. Höchsmann, T. Töller, R. Giegerich, and S. Kurtz, “Local
Furthermore, it considerably reduces the complexity of Similarity in RNA Secondary Structures,” Proc. IEEE Computer Soc.
Conf. Bioinformatics, p. 159, 2003.
the algorithm for comparing two RNA structures coded [7] M. Höchsmann, B. Voss, and R. Giegerich, “Pure Multiple RNA
with trees of the type in Fig. 2b. However, it is important to Secondary Structure Alignments: A Progressive Profile Ap-
observe that the scattering effect problem is not specific of proach,” IEEE/ACM Trans. Computational Biology and Bioinfor-
matics, vol. 1, no. 1, pp. 53-62, 2004.
the tree representations of the type in Fig. 2b. Indeed, the [8] T. Winkelmans, J. Wuyts, Y. Van de Peer, and R. De Wachter, “The
same problem may be observed, to a lesser degree, with European Database on Small Subunit Ribosomal RNA,” Nucleic
Acids Research, vol. 30, no. 1, pp. 183-185, 2002.
trees of the type in Fig. 2c. This is the reason why we [9] T. Jiang, L. Wang, and K. Zhang, “Alignment of Trees—An
generalize the process by adopting a modelling of RNA Alternative to Tree Edit,” Proc. Fifth Ann. Symp. Combinatorial
secondary structures at different levels of abstraction. This Pattern Matching, pp. 75-86, 1994.
[10] F. Lisacek, Y. Diaz, and F. Michel, “Automatic Identification of
model, and the accompanying algorithm for comparing Group I Intron Cores in Genomic DNA Sequences,” J. Molecular
RNA structures, is in progress. Biology, vol. 235, no. 4, pp. 1206-1217, 1994.
14 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

[11] B. Shapiro, “An Algorithm for Multiple RNA Secondary Struc- Julien Allali studied at the University of Marne
tures,” Computer Applications in the Biosciences, vol. 4, no. 3, pp. 387- la Vallée (France), where he received the MSc
393, 1988. degree in computer science and computational
[12] B.A. Shapiro and K. Zhang, “Comparing Multiple RNA Secondary genomics. In 2001, he began his PhD in
Structures Using Tree Comparisons,” Computer Applications in the computational genomics at the Gaspard Monge
Biosciences, vol. 6, no. 4, pp. 309-318, 1990. Institute of the University of Marne la Vallée. His
[13] K.-C. Tai, “The Tree-to-Tree Correction Problem,” J. ACM, vol. 26, thesis focused on the study of RNA secondary
no. 3, pp. 422-433, 1979. structures and, in particular, their comparison
[14] K. Zhang and D. Shasha, “Simple Fast Algorithms for the Editing using a tree distance. In 2004, he received the
Distance between Trees and Related Problems,” SIAM J. Comput- PhD degree.
ing, vol. 18, no. 6, pp. 1245-1262, 1989.
[15] M. Zuker, “Mfold Web Server for Nucleic Acid Folding and Marie-France Sagot received the BSc degree in computer science from
Hybridization Prediction,” Nucleic Acids Research, vol. 31, no. 13, the University of São Paulo, Brazil, in 1991, the PhD degree in
pp. 3406-3415, 2003. theoretical computer science and applications from the University of
Marne-la-Vallée, France, in 1996, and the Habilitation from the same
university in 2000. From 1997 to 2001, she worked as a research
associate at the Pasteur Institute in Paris, France. In 2001, she moved
to Lyon, France, as a research associate at the INRIA, the French
National Institute for Research in Computer Science and Control. Since
2003, she has been the Director of Research at the INRIA. Her research
interests are in computational biology, algorithmics, and combinatorics.

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 15

Topological Rearrangements and Local Search


Method for Tandem Duplication Trees
Denis Bertrand and Olivier Gascuel

Abstract—The problem of reconstructing the duplication history of a set of tandemly repeated sequences was first introduced by Fitch
[4]. Many recent studies deal with this problem, showing the validity of the unequal recombination model proposed by Fitch, describing
numerous inference algorithms, and exploring the combinatorial properties of these new mathematical objects, which are duplication
trees. In this paper, we deal with the topological rearrangement of these trees. Classical rearrangements used in phylogeny (NNI, SPR,
TBR, ...) cannot be applied directly on duplication trees. We show that restricting the neighborhood defined by the SPR (Subtree
Pruning and Regrafting) rearrangement to valid duplication trees, allows exploring the whole duplication tree space. We use these
restricted rearrangements in a local search method which improves an initial tree via successive rearrangements. This method is
applied to the optimization of parsimony and minimum evolution criteria. We show through simulations that this method improves all
existing programs for both reconstructing the topology of the true tree and recovering its duplication events. We apply this approach to
tandemly repeated human Zinc finger genes and observe that a much better duplication tree is obtained by our method than using any
other program.

Index Terms—Tandem duplication trees, phylogeny, topological rearrangements, local search, parsimony, minimum evolution, Zinc
finger genes.

1 INTRODUCTION

R EPEATED sequences constitute an important fraction of


most genomes, from the well-studied Escherichia coli
model is provided in Section 2, but its main features can be
grasped from the examples of Fig. 1. Fig. 1a shows the
bacterial genome [1] to the Human genome [2]. For duplication history of the 13 Antennapedia-class homeobox
example, it is estimated that more than 50 percent of the genes from the cognate group [6]. In this history, the
Human genome consists of repeated sequences [2], [3]. ancestral locus has undergone a series of simple duplica-
There exist three major types of repeated sequences: tion events where one of the genes has been duplicated into
transposon-derived repeats, micro or minisatellites, and two adjacent copies. Starting from the unique ancestral
large duplicated sequences, the last often containing one or gene, this series of events has produced the extant locus
several RNA or protein-coding genes. Micro or minisatel- containing the 13 linearly ordered contemporary genes. It is
lites arise through a mechanism called slipped-strand easily seen [7] that trees only containing simple duplication
mispairing, and are always arranged in tandem: copies of events are equivalent to binary search trees with labeled
a same basic unit are linearly ordered on the chromosome. leaves. They differ from standard phylogenies in that node
Large duplicated sequences are also often found in tandem children have left/right orientation. Fig. 1b shows another
and, when this is the case, unequal recombination is widely example corresponding to the nine variable genes of the
assumed to be responsible for their formation. human T cell receptor Gamma (TRGV) locus [8]. In this
Both the linear order among tandemly repeated se- history, the most recent event involves a double duplica-
quences, and the knowledge of the biological mechanisms tion where two adjacent genes have been simultaneously
responsible for their generation, suggest a simple model of duplicated to produce four adjacent copies. Duplication
evolution by duplication. This model, first described by trees containing multiple duplication events differ from
Fitch in 1977 [4], introduces tandem duplication trees as binary search trees, but are less general than phylogenies.
phylogenies constrained by the unequal recombination The model proposed by Fitch [4] covers both simple and
mechanism. Although being a completely different biologi- multiple duplication trees.
cal mechanism, slipped-strand mispairing leads to the same Fitch’s paper [4] received relatively little attention at the
duplication model [5]. A formal recursive definition of this time of its publication probably due to the lack of available
sequence data. Rediscovered by Benson and Dong [9],
Tang et al. [10], and Elemento et al. [8], tandemly repeated
. The authors are with Projet Méthodes et Algorithmes pour la Bioinforma-
tique, LIRMM (UMR 5506, CNRS—Univ. Montpellier 2), 161 rue Ada, sequences and their suggested duplication model have
34392 Montpellier Cedex 5—France. E-mail: gascuel@lirmm.fr. recently received much interest, providing several new
Manuscript received 11 Oct. 2004; revised 17 Dec. 2004; accepted 20 Dec. computational biology problems and challenges [11], [12].
2004; published online 30 Mar. 2005.
For information on obtaining reprints of this article, please send e-mail to:
The main challenge consists of creating algorithms
tcbb@computer.org, and reference IEEECS Log Number TCBBSI-0170-1004. incorporating the model constraints to reconstruct the
1545-5963/05/$20.00 ß 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
16 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 1. (a) Rooted duplication tree describing the evolutionary history of the 13 Antennapedia-class homeobox genes from the cognate group [6].
(b) Rooted duplication tree describing the evolutionary history of the nine variable genes of the human T cell receptor Gamma (TRGV) locus [8]. In
both examples, the contemporary genes are adjacent and linearly ordered along the extant locus.

duplication history of tandemly repeated sequences. duplication trees, especially from minisatellites. Elemento
Indeed, accurate reconstruction of duplication histories et al. [8] present an enumerative algorithm that computes the
will be useful to elucidate various aspects of genome most parsimonious duplication tree; this algorithm (by its
evolution. They will provide new insights into the exhaustive approach) is limited to datasets of less than 15
mechanisms and determinants of gene and protein domain repeats. Several distance-based methods have also been
duplication, often recognized as major generators of described. The WINDOW method [10] uses an agglomeration
novelty [13]. Several important gene families, such as scheme similar to UPGMA [16] and NJ [17], but the cost
immunity-related genes, are arranged in tandem; better function used to judge potential duplication is based on the
understanding their evolution should provide new insights assumption that the sequences follow a molecular clock mode
into their duplication dynamics and clues about their of evolution. The DTSCORE method [18] uses the same
functional specialization. Studying the evolution of micro scheme but corrects this limitation using a score criterion [19],
and minisatellites could resolve unanswered biological like ADDTREE [20]. DTSCORE can be used with sequences
questions regarding human migrations or the evolution of that do not follow the molecular clock, which is, for example,
bacterial diseases [14]. essential when dealing with gene families containing
Given a set of aligned and ordered sequences (DNA or pseudogenes that evolve much faster than functional genes.
proteins), the aim is to find the duplication tree that best Finally, GREEDY SEARCH [21] corresponds to a different
explains these sequences, according to usual criteria in approach divided into two steps: First, a phylogeny is
phylogenetics, e.g., parsimony or minimum evolution. Few computed with a classical reconstruction method (NJ), then,
studies have focused on the computational hardness of this with nearest neighbor interchange (NNI) rearrangements, a
problem, and all of these studies only deal with the
duplication tree close to this phylogeny is computed. This
restricted version where simultaneous duplication of multi-
approach is noteworthy since it implements topological
ple adjacent segments is not allowed. In this context, Jaitly
rearrangements which are highly useful in phylogenetics
et al. [15] shows that finding the optimal single copy
[22], but it works blindly and does not ensure that good
duplication tree with parsimony is NP-Hard and that this
duplication trees will be found (cf. Section 5.2).
problem has a PTAS (Polynomial Time Approximation
Topological rearrangements have an essential function in
Scheme). Another closely related PTAS is given by Tang
phylogenetic inference, where they are used to improve an
et al. [10] for the same problem. On the other hand,
initial phylogeny by subtree movement or exchange.
Elemento et al. [7] describes a polynomial distance-based
algorithm that reconstructs optimal single copy tandem Rearrangements are very useful for all common criteria
duplication trees with minimum evolution. (parsimony, distance, maximum likelihood) and are inte-
However, it is commonly believed, as in phylogeny, that grated into all classical programs like PAUP* [23] or
most (especially multiple) duplication tree inference pro- PHYLIP [24]. Furthermore, they are used to define various
blems are NP-Hard. This explains the development of distances between phylogenies and are the foundation of
heuristic approaches. Benson and Dong [9] provides various much mathematical work [25]. Unfortunately, they cannot
parsimony-based heuristic reconstruction algorithms to infer be directly used here, as shown by a simple example given
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 17

Fig. 2. (a) Duplication history; each segment represents a copy; extant segments are numbered. (b) Duplication tree (DT); the black points show the
possible root locations. (c) Rooted duplication tree (RDT) corresponding to history (a) and root position 1 on (b).

later. Indeed, when applied to a duplication tree, they do Let O ¼ ð1; 2; . . . ; nÞ be the ordered set of sequences
not guarantee that another valid duplication tree will be representing the extant locus. Initially containing a single
produced. copy, the locus grew through a series of consecutive
In this paper, we describe a set of topological rearrange- duplications. As shown in Fig. 2a, a duplication history
ments to stay inside the duplication tree space and explore may contain simple duplication events. When the dupli-
the whole space from any of its elements. We then show the cated fragment contains two, three, or k repeats, we say that
advantages of this approach for duplication tree inference it involves a multiple duplication event. Under this
from sequences. In Section 2, we describe the duplication duplication model, a duplication history is a rooted tree
model introduced by [4], [8], [10], as well as an algorithm to with n labeled and ordered leaves, in which internal nodes
recognize duplication trees in linear time. Thanks to this of degree 3 correspond to duplication events. In a real
algorithm, we restrict the neighborhoods defined by duplication history (Fig. 2a), the time intervals between
consecutive duplications are completely known, and the
classical phylogeny rearrangements, namely, nearest neigh-
internal nodes are ordered from top to bottom according to
bor interchange (NNI) and subtree pruning and regrafting
the moment they occurred in the course of evolution. Any
(SPR), to valid duplication trees. We demonstrate (Section 3)
ordered segment set of the same height then represents an
that for NNI moves this restricted neighborhood does not
ancestral state of the locus. We call such a set a floor, and
allow the exploration of the whole duplication tree space.
we say that two nodes i; j are adjacent (i  j) if there is a
On the other hand, we demonstrate that the restricted
floor where i and j are consecutive and i is on the left of j.
neighborhood of SPR rearrangement allows the whole
However, in the absence of a molecular clock mode of
space to be explored. In this way, we define a local search evolution (a typical problem), it is impossible to recover the
method, applied here to parsimony and minimum evolu- order between the duplication events of two different
tion (Section 4). We compare this method to other existing lineages from the sequences. In this case, we are only able to
approaches using simulated and real data sets (Section 5). infer a duplication tree (DT) (Fig. 2b) or a rooted
We conclude by discussing the positive results obtained by duplication tree (RDT) (Fig. 2c).
our method, and indicate directions for further research A duplication tree is an unrooted phylogeny with
(Section 6). ordered leaves, whose topology is compatible with at least
one duplication history. Also, internal nodes of duplication
trees are partitioned into events (or “blocks” following
2 MODEL
[10]), each containing one or more (ordered) nodes. We
2.1 Duplication History and Duplication Tree distinguish “simple” duplication events that contain a
The tandem duplication model used in this article was first unique internal node (e.g., b and f in Fig. 2c) and “multiple”
introduced by Fitch [4] then studied independently by [8], duplication events which group a series of adjacent and
[10]. It is based on unequal recombination which is assumed simultaneous duplications (e.g., c, d, and e in Fig. 2c). Let
to be the sole evolution mechanism (except point mutations) E ¼ ðsi ; siþ1 ; . . . ; sk Þ denote an event containing internal
acting on sequences. Although it is a completely different nodes si ; siþ1 ; . . . ; sk in left to right order. We say that two
biological mechanism, slipped-strand mispairing leads to consecutive nodes of the same event are adjacent (sj  sjþ1 )
the same duplication model [5], [9]. just like in histories, as any event belongs to a floor in all of
18 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 3. The tree obtained by applying an NNI move to a DT is not always a valid DT: T whose RT is a rooted version; T 0 is obtained by
applying NNI(5,4) around the bold edge; none of the possible root positions of T 0 (a, b, c, and d) leads to a valid RDT, cf. tree (b) which
corresponds to root b in T 0 .

the histories that are compatible with the DT being The definition for unrooted trees is quite similar:
considered. The same notation will also be used for leaves ðT ; OÞ defines an unrooted duplication tree if and only if:
to express the segment order in the extant locus. When the 1. ðT ; OÞ contains 1 segment, or
tree is rooted, every internal node sj is unambiguously 2. same as for rooted trees with ðT 0 ; O0 Þ now defining an
associated to one parent and two child nodes; moreover, unrooted duplication tree.
one child of sj is “left” and the other one is “right,” which is Those definitions provide a recursive algorithm, RADT
denoted as lj and rj , respectively. In this case, for any
(Recognition Algorithm for Duplication Trees), to check
duplication history that is compatible with this tree, child
whether any given phylogeny with ordered leaves is a
nodes of an event, si ; siþ1 ; . . . ; sk are organized as follows:
duplication tree. In case of success, this algorithm can also
li  liþ1  . . .  lk  ri  riþ1  . . .  rk : be used to reconstruct duplication events: At each step, the
In [8], [26], [27], it was shown that rooting a series of internal nodes above denoted as ðsi ; siþ1 ; . . . ; sk Þ is
duplication tree is different than rooting a phylogeny: a duplication event. When the tree is rooted, lj is the left
the root of a duplication tree necessarily lies on the tree child of sj and rj its right child, for every j; i  j  k. This
path between the most distant repeats on the locus, i.e., 1 algorithm can be implemented in OðnÞ [26] where n is the
and n; moreover, the root is always located ”above” all number of leaves. Another linear algorithm is proposed by
multiple duplications, e.g., Fig. 1b shows that there are
Zhang et al. [21] using a top down approach instead of a
only three valid root positions, the root cannot be a direct
bottom-up one, but applies only to rooted duplication trees.
ancestor of 12.

2.2 Recursive Definition of Rooted and Unrooted 3 TOPOLOGICAL REARRANGEMENTS FOR


Duplication Trees
DUPLICATION TREES
A duplication tree is compatible with at least one duplica-
tion history. This suggests a recursive definition, which This section shows how to explore the DT space using SPR
progressively reconstructs a possible history, given a rearrangements. First, we describe some NNI, SPR, and
phylogeny T and a leaf ordering O. We define a cherry TBR rearrangement properties with standard phylogenies.
ðl; s; rÞ as a pair of leaves (l and r) separated by a single But, these rearrangements cannot be directly used to
node s in T , and we call CðT Þ the set of cherries of T . This explore the DT space. Indeed, when applied to a duplica-
recursive definition reverses evolution: It searches for a tion tree, they do not guarantee that another valid
“visible duplication event,” “agglomerates” this event, and duplication tree will be produced. So, we have decided to
checks whether the “reduced” tree is a duplication tree. In
restrict the neighborhood defined by those rearrangements
case of rooted trees, we have:
to duplication trees. If we only used NNI rearrangements,
ðT ; OÞ defines a duplication tree with root  if and only if: the neighborhood would be too restricted (as shown by a
1. ðT ; OÞ only contains , or
simple example) and would not allow the whole DT space
2. there is in CðT Þ a series of cherries
ðli ; si ; ri Þ; ðliþ1 ; siþ1 ; riþ1 Þ; . . . ; ðlk ; sk ; rk Þ to be explored. On the other hand, we can distinguish two
with k  i and types of SPR rearrangements which, when applied to a
li  liþ1  . . .  lk  ri  riþ1  . . .  rk in O, such rooted duplication tree guarantee that another valid
that ðT 0 ; O0 Þ defines a duplication tree with root , duplication tree will be produced. Thanks to these specific
where T 0 is obtained from T by removing
rearrangements, we demonstrate that restricting the neigh-
li ; liþ1 ; . . . ; lk ; ri ; riþ1 ; . . . ; rk , and O0 is obtained by
replacing ðli ; liþ1 ; . . . ; lk ; ri ; riþ1 ; . . . ; rk Þ by borhood of SPR rearrangements allows the whole space of
ðsi ; siþ1 ; . . . ; sk Þ in O. duplication trees to be explored.
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 19

Fig. 4. The NNI neighborhood of a duplication tree does not always contain duplication trees: T whose RT is a rooted version; T 0 is obtained by
exchanging subtrees 1 and (2 5); none of the possible root positions of T 0 (a, b, and c) leads to a valid duplication tree, cf. tree (b) which corresponds
to root b in T 0 ; and the same holds for every neighbor of T being obtained by NNI.

3.1 Topological Rearrangements for Phylogeny and there is no succession of restricted NNIs allowing T to
There are many ways of carrying out topological rearrange- be transformed into any other DT.
ments on phylogeny [22]. We only describe NNI (Nearest
3.4 Restricted SPR Allows the Whole DT Space to
Neighbor Interchange), SPR (Subtree Pruning Regrafting),
Be Explored
and TBR (Tree Bisection and Reconnection) rearrangements.
As before, we restrict (using RADT) the neighborhood
The NNI move is a simple rearrangement which
defined by SPR rearrangements to duplication trees. We
exchanges two subtrees adjacent to the same internal edge
(Figs. 3 and 4). There are two possible NNIs for each name restricted SPR, SPR moves that, starting from a
internal edge, so 2ðn  3Þ neighboring trees for one tree duplication tree, lead to another duplication tree.
with n leaves. This rearrangement allows the whole space of Main Theorem. Let T1 and T2 be any given duplication trees; T1
phylogeny to be explored; i.e., there is a succession of NNI can be transformed into T2 via a succession of restricted SPRs.
moves making it possible to transform any phylogeny P1 Proof. To demonstrate the Main Theorem, we define two
into any phylogeny P2 [28]. types of special SPR that ensure staying within the space
The SPR move consists of pruning a subtree and of rooted duplication trees (RDT). Given these two types
regrafting it, by its root, to an edge of the resulting tree of SPRs, we demonstrate that it is possible to transform
(Figs. 6 and 7). We note that the neighborhood of a tree any rooted duplication tree into a caterpillar, i.e., a
defined by the NNI rearrangements is included in the
rooted tree in which all internal nodes belong to the tree
neighborhood defined by SPRs. The latter rearrangement
path between the leaf 1 and the tree root  (cf. Fig. 5).
defines a neighborhood of size 2ðn  3Þð2n  7Þ [25].
This result demonstrates the theorem. Indeed, let T1
Finally, TBR generalizes SPR by allowing the pruned
and T2 be two RDTs. We can transform T1 and T2 into a
subtree to be reconnected by any of its edges to the resulting
caterpillar by a succession of restricted SPRs. So, it is
tree. These three rearrangements (NNI, SPR, and TBR) are
possible to transform T1 into T2 by a succession of
reversible, that is, if T 0 is obtained from T by a particular
restricted SPRs, with (possibly) a caterpillar as inter-
rearrangement, then T can be obtained from T 0 using the
mediate tree. This property holds since the reciprocal
same type of rearrangement.
movement of an SPR is an SPR. As the two SPR types
3.2 NNI Rearrangements Do Not Stay in DT Space proposed ensure that we stay within the RDTs space, we
The classical phylogenetic rearrangements (NNI, SPR, have the desired result for rooted duplication trees. And,
TBR,...) do not always stay in DT space. So, if we apply this result extends to unrooted duplications trees since
an NNI to a DT (e.g., Fig. 3), the resulting tree is not always two DTs can be arbitrarily rooted, transformed from one
a valid DT. This property is also true for SPR and TBR to the other using restricted SPRs, then unrooted. u
t
rearrangements since NNI rearrangements are included in The first special SPR allows multiple duplication
these two rearrangement classes. events to be destroyed. Let E ¼ ðsi ; siþ1 ; . . . ; sk Þ be a
duplication event, ri and lk respectively right child of si
3.3 Restricted NNI Does Not Allow the Whole DT
Space to Be Explored
To restrict the neighborhood defined by NNI rearrange-
ments to duplication trees, each element of the neighbor-
hood is filtered thanks to the recognition algorithm (RADT).
But, this restricted neighborhood does not allow the whole
DT space to be explored. Fig. 4 gives an example of a
duplication tree, T , the neighborhood of which does not
contain any DT. So, its restricted neighborhood is empty, Fig. 5. A six-leaf caterpillar.
20 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 6. DELETE rearrangement.

and left child of sk , and let pi be the father of si . The obtained by applying such a rearrangement to a simple
DELETE rearrangement consists of pruning the subtree of RDT, is a simple RDT. We now establish the following
root ri and grafting this subtree on the edge ðsk ; lk Þ, while lemma which shows that any simple tree can be trans-
li is renamed si and the edge ðli ; si Þ is deleted. Fig. 6 formed into a caterpillar.
demonstrates this rearrangement. Lemma 2. Let T be a simple RDT; T can be transformed into a
Lemma 1. DELETE preserves the RDT property. caterpillar by a succession of LEFT rearrangements.
Proof. Let T be the initial tree (Fig. 6a), E ¼ ðsi ; siþ1 ; . . . ; sk Þ Proof. In a caterpillar all internal nodes are ancestors of 1. If
be an event of T , and T 0 be the tree obtained from T by T is not a caterpillar, there is an internal node r that is not
applying DELETE to E (Fig. 6b). Children of any node sj an ancestor of 1. If r is the right child of its father, we can
(i  j  k) are denoted lj and rj . apply LEFT to the left child of r (Fig. 7). If r is the left
By definition, for any duplication history compatible child of its father, we consider its father: It cannot be an
with T we have ancestor of 1 since its children are r and a node on the
li  liþ1  . . .  lk  ri  riþ1  . . .  rk : right of r. So, we can apply the same argument: Either
the father of r is adequate for performing LEFT, or we
Thus, there is a way to partially agglomerate T (using an
consider its father again. In this way, we necessarily
RADT-like procedure) such that these nodes becomes
obtain a node for which the rearrangement is possible. T
leaves. The same agglomeration can be applied to T 0 as
is then transformed into a caterpillar by successively
only ancestors of the lj s and rj s are affected by DELETE.
applying the LEFT rearrangement to nodes which are not
Now, 1) agglomerate the event E of T , and 2) reduce T 0
on the path between 1 and . After a finite number of
by agglomerating the cherry ðlk ; ri Þ and then agglomer-
steps, all internal nodes are ancestors of 1 and T has been
ating the event ðsiþ1 ; . . . ; sk Þ. Two identical trees follow,
transformed into a caterpillar. This concludes the proof
which concludes the proof. u
t
of Lemma 2 and, therefore, of our Main Theorem. u
t
By successively applying DELETE to any duplication
tree, we remove all multiple duplication events. The 4 LOCAL SEARCH METHOD
following SPR rearrangement allows duplications to be
We consider data consisting of an alignment of n segments
moved within simple RDT, i.e., any RDT containing only
with length k, and of the ordering O of the segments along
simple duplications. Let p be a node of a simple RDT T , l its
the locus. This alignment has been created before tree
left child, r its right child, and x the left child of r. This
construction and the problem is not to build simultaneously
rearrangement consists of pruning the subtree of root x and
the alignment and the tree, a much more complicated task
regrafting it to the edge ðl; pÞ (Fig. 7). This rearrangement is
[29]. The aim is to find a (nearly) optimal duplication tree,
an SPR (in fact an NNI); we name it LEFT as it moves the
where “optimal” is defined by some usual phylogenetic
subtree root towards the left. It is obvious that the tree
criterion and the ordered and aligned segments at hand.
Topological rearrangements described in the previous
section naturally lead to a local search method for this
purpose. We discuss its use to optimize the usual Wagner
parsimony [22] and the distance-based balanced minimum
evolution criterion (BME) [30], [31]. First, we describe our
local search method, then we define briefly these two
criteria and explain how to compute them during local
Fig. 7. LEFT rearrangement. search.
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 21

trees does not significantly decrease the neighborhood size.


However, on average the diminution is quite significant;
e.g., with n ¼ 48, only 5 percent of the neighborhood
corresponds to a valid DTs, assuming DTs are uniformly
distributed [26].
Since the time complexity of the recognition algorithm
(RADT) is OðnÞ, computing the neighborhood defined by
restricted SPR requires Oðn3 Þ. The calculation of the
criterion value is done for each tree of the restricted
neighborhood. Thus one local search step basically requires
Oðn3 þ n2 gÞ, where g represents the time complexity of
Fig. 8. A simple rooted duplication tree with a double caterpillar
structure. computing the criterion value. However, preprocessing
allows this time complexity to be lowered, both for
4.1 The LSDT Method parsimony and minimum evolution, as we shall explain in
Our method, LSDT (Local Search for Duplication Trees), the following sections.
follows a classical local search procedure in which, at each 4.2 The Maximum Parsimony Criterion
step, we try to strictly improve the current tree. This Parsimony is commonly acknowledged [22] to be a good
approach can be used to optimize various criteria. In this criterion when dealing with slightly divergent sequences,
study, we restrict ourselves to parsimony and balanced which is usually the case with tandemly duplicated genes
minimum evolution; fðT Þ represents the value (to be [8]. The parsimony criterion involves selecting the tree
minimized) of one of these criteria for the duplication tree which minimizes the number of substitutions needed to
T and the sequence set. explain the evolution of the given sequences. Finding the
most parsimonious tree [22] or duplication tree [15] is
NP-hard, but we can find the optimal labeling of the
internal nodes and the parsimony score of a given tree T in
polynomial time using the Fitch-Hartigan algorithm [32],
[33]. The parsimony score and optimal labeling of internal
nodes is independently computed for each position within
sequences, using a postorder depth-first search algorithm
that requires OðnÞ time [32], [33]. Thus, computing the
parsimony score of n sequences of length k requires OðknÞ
time. Hence, if we use this algorithm during our local
search method, one local search step is computed in Oðkn3 Þ,
which is relatively high.
Algorithm 1 summarizes LSDT. The neighborhood of the To speed up this process, we adapted techniques
current DT, Tcurrent , is computed using SPR. As we commonly used in phylogeny for fast calculation of
explained earlier, we use the RADT procedure to restrict parsimony. Our implementation uses a data structure
this neighborhood to valid DTs. When a tree is a valid DT, implemented (among others) in DNAPARS [24] and
its f criterion value is computed. That way, we select the described in [34], [35]. Let Tp be the pruned subtree and
best neighbor of Tcurrent . If this DT improves the value Tr be the resulting tree. A preprocessing stage computes
obtained so far (i.e., fðTbest Þ), the local search restarts with the parsimony vector (i.e., the optimal score and optimal
labeling of all sequence positions) of every rooted subtree
this new topology. If no neighbor of Tcurrent improves Tbest ,
of Tr using a double depth-first search [36] (Fig. 9a); the
the local search is stopped and returns Tbest .
first search is postordered and computes the parsimony
To analyze the time complexity of one LSDT step, we
vector of down-subtrees; the second search is preordered
have to consider the size of the neighborhood defined by
and computes the parsimony vector of up-subtrees. Each
the restricted SPR. In the worst case, this size is of the same
search requires OðnkÞ time. Thanks to this data structure,
order as the size of an unrestricted SPR neighborhood, i.e., the parsimony score of the tree obtained by regrafting Tp
Oðn2 Þ. Indeed for the “double caterpillar” (Fig. 8), it is on any given edge of Tr is computed in OðkÞ (Fig. 9b).
possible to move any subtree being rooted on the path Hence, computing the SPR neighbor with minimum
between n=2 and  towards any edge of the path between parsimony of any given duplication tree is achieved in
ðn þ 1Þ=2 and ; and inversely. Thus, for this tree, Oðn2 Þ Oðn3 þ n  nk þ n2 kÞ ¼ Oðn3 þ n2 kÞ; the first term ðn3 Þ
restricted SPRs can be performed. In the worst case, represents the neighborhood computation; the second
restricting the neighborhood defined by SPR to duplication term ðn  nkÞ corresponds to the time required by the n
22 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 9. (a) Every edge defines one down-subtree and one up-subtree; e.g., A represents the down-subtree (2 3) defined by the edge e while D
corresponds to the up-subtree (1 (4 5)). Moreover, only the parsimony vector of the five leaves is known before the preprocessing stage. The
postorder search computes the parsimony vector of down-subtrees: A is computed from 2 and 3, B from 4 and 5, C from A and B. The preorder
search computes the parsimony vector of up-subtrees: D is obtained from 1 and B, E is obtained from D and 3, etc. (b) When the parsimony vector
of every subtree in Tr is known, regrafting Tp on any given edge and computing the parsimony score of the resulting tree only requires analyzing the
parsimony vector of three subtrees and is done in OðkÞ time.

preprocessing stages; the third term ðn2 kÞ is the time to Gascuel demonstrated that selecting the shortest tree (as
test the n subtrees and the n possible insertion edges. computed from above formula) is statistically consistent and
well suited for phylogenetic inference. They called this new
4.3 The Distance-Based Balanced Minimum
Evolution Principle version of ME “balanced minimum evolution” (BME) [31].
Using the above formula, the length of any given tree is
As in any distance-based approach, we first estimate the
computed in Oðn2 Þ, so computing one LSDT local search
matrix of pairwise evolutionary distances between the
step can be achieved in Oðn4 Þ. However, a faster imple-
segments, using some standard distance estimator [22],
mentation is possible using a straightforward modification
e.g., the Kimura two-parameter estimator [37] in case of
of our BME addition algorithm [43]. This involves:
DNA or the JTT method with proteins [38]. Let  be this
matrix and ij be the distance between segments i and j. 1. pruning a rooted subtree Tp from tree T ,
The  matrix plus the segment order is the input of the 2. computing the average distance between all non-
reconstruction method. intersecting subtree pairs in the remaining tree Tr ,
The minimum evolution principle (ME) [39], [40] 3. computing the average distance between Tp and any
subtree of Tr in T , and
involves selecting the shortest tree to be the tree which
4. using formula (10) from [43] and RADT to find the
best explains the observed sequences. The tree length is best allowed edge to regraft Tp .
equal to the sum of all the edge lengths, and the edge
Steps 2 and 3 are based on algorithms described in [43],
lengths are estimated by minimizing a least squares fit
which follow the same approach as the double depth-first
criterion. The problem of inferring optimal phylogenies
search described in the previous section. These two steps
within ME is commonly assumed to be NP-hard, as are
require Oðn2 Þ, just as Step 4. As there are OðnÞ subtrees to
many other distance-based phylogeny inference problems
prune and regraft, this implementation requires Oðn3 Þ to
[41]. Nonetheless, ME forms the basis of several phyloge-
perform one search step.
netic reconstruction methods, generally based on greedy
heuristics. Among them is the popular Neighbor-Joining
(NJ) algorithm [17]. Starting from a star tree, NJ iteratively 5 RESULTS
agglomerates external pairs of taxa so as to minimize the 5.1 Simulation Protocol
tree length at each step. We applied our method and other existing methods to
Recently, Pauplin [30] proposed a new simple formula to simulated datasets obtained using the procedure described
estimate the tree length LðT Þ of tree T : in [18]. We uniformly randomly generated rooted tandem
X duplication trees (see [26]) with 12, 24, and 48 leaves and
LðT Þ ¼ 21T ij ij ;
i<j
assigned lengths to the edges of these trees using the
coalescent model [44]. We then obtained molecular clock
where T ij is the topological distance (number of edges) in T trees (MC), which might be unrealistic in numerous cases,
between segments i and j. The correctness of this formula e.g., when the sequences being studied contain pseudo-
was shown by Semple and Steel [42], while Desper and genes which evolve much faster than functional genes.
Gascuel [31] showed that this formula is a special case of Then, we generated nonmolecular clock trees (NO-MC)
weighted-least squares tree fitting. Moreover, Desper and from the previous trees by independently multiplying
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 23

every edge length by 1 þ 0:8X, where X was drawn from methods. TNT is acknowledged as one of the very best
an exponential distribution with parameter 1. MC trees parsimony packages; it was run with 10 replicates and TBR
were rescaled by multiplying every edge length by 1.8. rearrangements. TNT often returns a set of equally
The trees thus obtained (MC and NO-MC) have a parsimonious trees. When this set contained duplication
maximum leaf-to-leaf divergence in the range ½0:1; 0:7, trees, we randomly selected one of them; when no
and in NO-MC trees the ratio between the longest and duplication tree was inferred by TNT, we randomly
shortest root-to-leaf lineages is about 3.0 on average. Both selected one of the output trees.
values are in accordance with real data, e.g., gene families Results are given in Tables 1 and 2. First, we observe that
[8] or repeated protein domains [10]. with n ¼ 48 the true tree is almost never entirely found, for
SEQGEN [45] was used to produce a 1,000 bp-long the reasons explained earlier. On the other hand, the best
nucleotide multiple alignment from each of the generated methods recover 80 to 95 percent of the duplication events,
trees using the Kimura two-parameter model of substitution indicating that the tested datasets are relatively easy. NJ
[46], and a distance matrix was computed by DNADIST [24] and TNT perform relatively well, but they often output
from this alignment using the same substitution model. For trees that are not duplication trees, which is unsatisfactory
MC and NO-MC cases, 1,000 trees (and, then, 1,000 sequence (e.g., with 48 leaves and NO-MC, NJ and TNT only infer
sets and 1,000 distance matrices) were generated per tree 1 percent and 5 percent of duplication trees, respectively).
size. These data sets were used to compare the ability of the The GS approach is noteworthy since it modifies the trees
various methods to recover the original trees from the inferred by NJ to transform them into duplication trees.
sequences or from the distance matrices, depending on the However, GS is only slightly better than NJ regarding the
method being tested. We measured the percentage of trees proportion of correctly reconstructed trees, but consider-
(out of 1,000) being correctly reconstructed (%tr). For the ably degrades the number of recovered duplication events,
phylogeny reconstruction methods, we also kept the which could be explained by the blind search it performs
percentage of duplication trees among the set of inferred
to transform NJ trees into duplication trees. GTR also
trees. Due to the random process used for generating these
obtains relatively poor results. As expected from its
trees and datasets, some short branches might not have
assumptions, WINDOW performs better in the MC case
undergone any substitution (as during Evolution) and, thus,
than in the NO-MC one. Finally, DTSCORE obtains the best
are unobtainable, except by chance. When n and, thus, the
performance among the four existing methods, whatever
branch number is high, it becomes hard or impossible to
the topological criterion considered.
find the entire tree. So, we also measured the percentage of
Applying our method to starting trees produced by GS,
duplication events in the true tree recovered by the inferred
GTR, WINDOW, and DTSCORE reveals the advantages of
tree (%ev). A duplication event involves one or more
the local search approach. Optimizing parsimony or BME
internal nodes and is the lowest common ancestor of a set
gives similar results, with a slight advantage for parsimony
of leaves; we say it “covers” its descendent leaves. However,
as expected from the relatively low divergence rates in our
the leaves covered by a simple duplication event can change
data sets. The trees produced by GS, GTR, and WINDOW
when the root position changes. As regards the true tree, the
are clearly improved and, for most, are better than those
root is known and each event is defined by the set of leaves
obtained by DTSCORE. DTSCORE trees are also improved,
which it covers. But, the inferred tree is unrooted. To avoid
even though this improvement is not very high from a
ambiguity, we then tested all possible root positions and
topological point of view. This could be explained by the
chose the one which gave the highest proximity in number
fact that DTSCORE is already an accurate method with
of events detected between the true tree and the inferred
respect to the datasets used.
tree, where two events are identical if they cover the same
When we consider the parsimony criterion, the gain
leaves. Finally, we kept the average parsimony value of each
achieved by LSDT is appreciable for each start method. This
method (pars).
could be expected for GS, WINDOW and DTSCORE which
5.2 Performance and Comparison do not optimize this criterion; with n ¼ 48 in NO-MC case,
Using this protocol, we compared NJ [17], TNT [47], and the gain for GS is about 329, thus confirming that this
GREEDY-SEARCH (GS) [21] which starts from the NJ tree, a method is clearly suboptimal; the gains for WINDOW and
modified version of GREEDY TRHIST RESTRICTED (GTR) DTSCORE are about 42 and 15, which are lower but still
[9] to infer multiple duplication trees, WINDOWS [10], significant. The GTR results, which optimizes parsimony,
DTSCORE [18], and eight versions of our local search are more surprising since the gain (always with n ¼ 48 in
method LSDT corresponding to different starting duplica- NO-MC case) is about 77 on average, which is very high.
tion trees (GS, GTR, WINDOW, and DTSCORE) and Moreover, the parsimony value obtained by LSDT is very
different criteria (parsimony and BME). TNT and GS use close to that of TNT, in spite of a much more restricted
the parsimony criterion, but the other are distance-based search space. This confirms the good performance of our
24 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 1
Performance Comparison Using Simulations (Molecular Clock Mode of Evolution)

X+LSDT_Y: X is the method used to obtain the starting tree and Y the criterion being optimized by LSDT; %tr: the percentage of trees being correctly
reconstructed; the percentage of duplication trees obtained by phylogeny reconstruction methods is given between parentheses; %ev: the
percentage of duplication events in the true tree being recovered by the inferred tree; pars: the average parsimony value.

local search method. It should be stressed that these gains interaction. Experimental studies on functions of ZNF genes
are obtained at low computational cost as dealing with any suggest that many of them code for transcription factors,
of the 48-taxon datasets only requires about 10 seconds and some of them are known to take part in cellular growth
for parsimony and five seconds for BME on a standard and development [48]. However, the biological functions of
PC-Pentium 4. most ZNF genes are currently unknown. The 16 members of
5.3 Analysis of the ZNF45 Family ZNF45 gene family are found in the q13.2 gene cluster on
Zinc finger (ZNF) genes code for proteins that contain one human chromosome 19 [49]. The organization and features
or more zinc finger motifs. The zinc finger motif is one of of the members of the ZNF45 family suggest that the genes
the most common motifs involved in nucleic acid-protein in the family may have been produced by a series of in situ

TABLE 2
Performance Comparison Using Simulations (No Molecular Clock of Evolution)

Note: see Table 1.


BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 25

Fig. 10. (a) Duplication tree for the 16 genes of human ZNF 45 family inferred by DTSCORE plus LSDT with parsimony; black dots represent the only
allowed root positions, according to the tandem duplication model; the (arbitrarily) selected root position is circled. (b) Rooted duplication tree
corresponding to tree (a). (c) Phylogeny inferred by TNT. Tree (a) can be obtained from tree (c) by moving ZNF45 and ZNF228 to edge 1, and
ZNF233 to edge 2. Edge lengths in tree (a) and tree (c) were estimated by maximum likelihood [52]. Lengths in tree (b) are meaningless and were
adjusted to obtain a readable drawing.

gene duplication events [49]. The ZNF45 gene family has We used this distance matrix and DTSCORE to build a
been previously studied by Tang et al. [10] and Zhang et al. starting tree, which was then refined by LSDT using
[21], who proposed different tandem duplication trees to parsimony. We selected this criterion because of its good
explain its evolutionary history. performance with simulated data (Tables 1 and 2). The
We downloaded the DNA sequences of the 16 members
resulting tree (Figs. 10a and 10b) is a simple DT requiring
of ZNF45 from NCBI. Multiple alignment was achieved
897 steps to explain the extant sequences. We tried to
using TCOFFEE,1 using default settings. We removed gaps
improve this score using a computationally intensive
as usual in phylogenetics [22] and third codon positions
which look saturated (734 parsimony steps are required to ratchet approach [51], but were unable to obtain any other
explain the evolution of the 237 sites). We thus obtained a DT with better (or even identical) parsimony. We also ran
final alignment2 containing 474 homologous sites, with a TNT with ratchet, 1,000 random taxon addition replicates
maximum pairwise divergence of 0:45. and TBR branch swapping (i.e., all TNT options to intensify
PAUP* [23] was used to estimate the matrix of pairwise the search) and found one maximum-parsimony phylogeny
distances, assuming the GTR substitution model [50] and a requiring 896 steps. This phylogeny (Fig. 10c) contains an
gamma distribution of rates with parameter 1. unresolved node with degree 4 and is not a duplication tree.
TNT phylogeny is close to LSDT duplication tree. To
1. http://igs-server.cnrs-mrs.fr/Tcoffee/tcoffee_cgi/index.cgi.
2. Available on request. transform from one to the other only three taxa have to be
26 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 3
Analysis of the ZNF45 Data Set

moved (Fig. 10), and both trees differ by only 1 parsimony from 10 to 50 parsimony steps. In all cases but GTR,
step. A similar difference was commonly observed in LSDT recovers the most parsimonious DT of Fig. 10.
simulation where TNT found (non-DT) phylogenies requir-
ing one parsimony step less (on average) than the DTs 6 CONCLUSION AND PROSPECTS
found by LSDT (Tables 1 and 2), though the true tree used
We have demonstrated that restricting the neighborhood
to generate the sequences was a DT. Thus, having (only)
defined by the SPR rearrangement to valid duplication trees
one parsimony step of difference between the best DT and
allows the whole DT space to be explored. Thanks to these
the best phylogeny is not significant and can be seen as
supporting the duplication model. Moreover, the discre- rearrangements, we have defined a general local search
pancy between the two trees can be explained by long method which we used to optimize the parsimony and
branch attraction, a phenomenon that frequently affects balanced minimum evolution criteria. We have thus
parsimony-based reconstructions [53]. Indeed, ZNF180 and improved the topological accuracy of all the tested
ZNF229 genes are distant from the other genes (Figs. 10a methods.
and 10c) and might perturb the whole tree. When removing Several research directions are possible. Finding the set
those two genes from the data set, both LSDT and TNT of combinatorial configurations for the SPR rearrangement
found the same tree, which is identical to the LSDT tree of which necessarily produce a duplication tree, could allow
Fig. 10a without the two genes. With 14 segments, the the neighborhood computation to be accelerated (e.g., for
probability of randomly picking up a duplication tree n ¼ 48 only 5 percent of the SPR neighborhood correspond
among all distinct phylogenies is less than 104 [26]. This to duplication trees) and, furthermore, gain more insight
extremely small probability indicates that the identity into the nature of duplication trees, which are just starting
between LSDT and TNT trees is very unlikely to be due to be investigated mathematically [12], [26], [27]. Our local
to chance. This provides a strong support for the tandem search method could be improved using restricted TBR
duplication model and indicates that our LSDT tree likely rearrangements or with the help of different stochastic
represents most—if not all—of the history of ZNF45 family. approaches (taboo, noising, ...) in order to avoid local
We compared trees obtained by Tang et al. [10], Zhang minima. Moreover, it would be relevant to test this local
et al. [21], and those of the other programs to the LSDT tree search method with other criteria like maximum likelihood.
of Fig. 10. We computed the parsimony score of each tree Finally, combining the tandem duplication events with
and the percentage of events shared by each tree with the speciation events, as described in [54] and [55] for
LSDT tree. Just as in the simulation study, we tested GS
nontandem duplications, would be relevant for real
[21], GTR [9], WINDOW [10], DTSCORE [8], and LSDT
applications where we have homologous tandem repeats
using different starting points but optimizing parsimony in
from several genomes.
all cases.
Results are displayed in Table 3 and confirm those
obtained with simulated data sets.Results of trees from ACKNOWLEDGMENTS
[10] and [21] are poor, which was expected as these The authors would like to thank Wafae El Alaoui for her help
methods (WINDOWS and GS, respectively) do not with ZNF45 family genes, and Richard Desper, Wim Hordijk
optimize the parsimony criterion and as we did not use and the referees of the Workshop on Algorithms in
the same alignment. GS is relatively poor, while Bioinformatics (WABI ’04) for reading preliminary versions
DTSCORE, WINDOWS, and GTR perform better. LSDT of this paper. This work was supported by ACI-IMPBIO
clearly improves these four methods, with gains ranging (Ministère de la Recherche, France).
BERTRAND AND GASCUEL: TOPOLOGICAL REARRANGEMENTS AND LOCAL SEARCH METHOD FOR TANDEM DUPLICATION TREES 27

REFERENCES [27] J. Yang and L. Zhang, “On Counting Tandem Duplication Trees,”
Molecular Biology and Evolution, vol. 21, pp. 1160-1163, 2004.
[1] F. Blattner, G. Plunkett, C. Bloch, N. Perna, V. Burland, M. Riley, J. [28] D. Robinson, “Comparison of Labeled Trees with Valency Trees,”
Collado-Vides, J. Glasner, C. Rode, G. Mayhew, J. Gregor, N. J. Combinatorial Theory, vol. 11, pp. 105-119, 1971.
Davis, H. Kirkpatrick, M. Goeden, D. Rose, B. Mau, and Y. Shao, [29] L. Wang and D. Gusfield, “Improved Approximation Algorithms
“The Complete Genome Sequence Of Escherichia Coli k-12,” for Tree Alignment,” J. Algorithms, vol. 25, pp. 255-273, 1997.
Science, vol. 277, no. 5331, pp. 1453-1474, 1997. [30] Y. Pauplin, “Direct Calculation of a Tree Length Using a Distance
[2] E. Lander et al., “Initial Sequencing and Analysis of the Human Matrix,” J. Molecular Evolution, vol. 51, pp. 41-47, 2000.
Genome,” Nature, vol. 409, pp. 860-921, 2001. [31] R. Desper and O. Gascuel, “Theoretical Foundation of the
[3] A. Smit, “Interspersed Repeats and Other Mementos of Transpo- Balanced Minimum Evolution Method of Phylogenetic Inference
sable Elements in Mammalian Genomes,” Current Opinion in and Its Relationship to Weighted Least-Squares Tree Fitting,”
Genetics & Development, vol. 9, pp. 657-663, 1999. Molecular Biology and Evolution, vol. 21, no. 3, pp. 587-598, 2004.
[4] W. Fitch, “Phylogenies Constrained by Cross-Over Process as [32] W. Fitch, “Toward Defining the Course of Evolution: Minimum
Illustrated by Human Hemoglobins in a Thirteen-Cycle, Eleven Change for a Specified Tree Topology,” Systematic Zoology, vol. 20,
Amino-Acid Repeat in Human Apolipoprotein A-I,” Genetics, pp. 406-416, 1971.
vol. 86, pp. 623-644, 1977. [33] J. Hartigan, “Minimum Mutation Fits to a Given Tree,” Biometrics,
[5] G. Levinson and G. Gutman, “Slipped-Strand Mispairing: A Major vol. 29, pp. 53-65, 1973.
Mechanism for DNA Sequence Evolution,” Molecular Biology and [34] G. Ganapathy, V. Ramachandran, and T. Warnow, “Better Hill-
Evolution, vol. 4, pp. 203-221, 1987. Climbing Searches for Parsimony,” Proc. Third Int’l Workshop
[6] J. Zhang and M. Nei, “Evolution of Antennapedia-Class Homeo- Algorithms in Bioinformatics, 2003.
box Genes,” Genetics, vol. 142, no. 1, pp. 295-303, 1996.
[35] P.A. Goloboff, “Methods for Faster Parsimony Analysis,” Cladis-
[7] O. Elemento and O. Gascuel, “An Exact and Polynomial Distance- tics, vol. 12, pp. 199-220, 1996.
Based Algorithm to Reconstruct Single Copy Tandem Duplication
[36] V. Berry and O. Gascuel, “Inferring Evolutionary Trees with
Trees,” Proc. 14th Ann. Symp. Combinatorial Pattern Matching
Strong Combinatorial Evidence,” Theoretical Computer Science,
(CPM2003), 2003.
vol. 240, pp. 271-298, 2000.
[8] O. Elemento, O. Gascuel, and M.-P. Lefranc, “Reconstructing the
[37] M. Kimura, “A Simple Model for Estimating Evolutionary Rates of
Duplication History of Tandemly Repeated Genes,” Molecular
Base Substitutions through Comparative Studies of Nucleotide
Biology and Evolution, vol. 19, pp. 278-288, 2002.
Sequences,” J. Molecular Evolution, vol. 16, pp. 111-120, 1980.
[9] G. Benson and L. Dong, “Reconstructing the Duplication History
of a Tandem Repeat,” Proc. Intelligent Systems in Molecular Biology [38] D. Jones, W. Taylor, and J. Thornton, “The Rapid Generation of
(ISMB1999), T. Lengauer, ed., pp. 44-53, 1999. Mutation Data Matrices from Protein Sequences,” Computer
Applications in Biosciences, vol. 8, pp. 275-282, 1992.
[10] M. Tang, M. Waterman, and S. Yooseph, “Zinc Finger Gene
Clusters and Tandem Gene Duplication,” J. Computational Biology, [39] K. Kidd and L. Sgaramella-Zonta, “Phylogenetic Analysis:
vol. 9, pp. 429-446, 2002. Concepts and Methods,” Am. J. Human Genetics, vol. 23, pp. 235-
[11] E. Rivals, “A Survey on Algorithmic Aspects of Tandem Repeats 252, 1971.
Evolution,” Int’l J. Foundations of Computer Science, vol. 15, no. 2, [40] A. Rzhetsky and M. Nei, “Theoretical Foundation of the
pp. 225-257, 2004. Minimum-Evolution Method of Phylogenetic Inference,” Molecu-
[12] O. Gascuel, D. Bertrand, and O. Elemento, “Reconstructing the lar Biology and Evolution, vol. 10, pp. 173-1095, 1993.
Duplication History of Tandemly Repeated Sequences,” Math. of [41] W. Day, “Computational Complexity of Inferring Phylogenies
Evolution and Phylogeny, O. Gascuel, ed., 2004. from Dissimilarity Matrices,” Bull. Math. Biology, vol. 49, pp. 461-
[13] S. Ohno, Evolution by Gene Duplication. Springer Verlag, 1970. 467, 1987.
[14] P.L. Fleche, Y. Hauck, L. Onteniente, A. Prieur, F. Denoeud, V. [42] C. Semple and M. Steel, “Cyclic Permutations and Evolutionary
Ramisse, P. Sylvestre, G. Benson, F. Ramisse, and G. Vergnaud, “A Trees,” Advances in Applied Math., vol. 32, no. 4, pp. 669-680, 2004.
Tandem Repeats Database for Bacterial Genomes: Application to [43] R. Desper and O. Gascuel, “Fast and Accurate Phylogeny
the Genotyping of Yersinia Pestis and Bacillus Anthracis,” BioMed Reconstruction Algorithms Based on the Minimum-Evolution
Central Microbiology, vol. 1, pp. 2-15, 2001. Principle,” J. Computational Biology, vol. 9, pp. 687-706, 2002.
[15] D. Jaitly, P. Kearney, G. Lin, and B. Ma, “Methods for [44] M. Kuhner and J. Felsenstein, “A Simulation Comparison of
Reconstructing the History of Tandem Repeats and Their Phylogeny Algorithms under Equal and Unequal Evolutionary
Application to the Human Genome,” J. Computer and System Rates,” Molecular Biology and Evolution, vol. 11, pp. 459-468, 1994.
Sciences, vol. 65, pp. 494-507, 2002. [45] A. Rambault and N. Grassly, “Seq-Gen: An Application for the
[16] P. Sneath and R. Sokal, Numerical Taxonomy. pp. 230-234, San Monte Carlo Simulation of DNA Sequence Evolution Along
Francisco: W.H. Freeman and Company, 1973. Phylogenetic Trees,” Computer Applied Biosciences, vol. 13, pp. 235-
[17] N. Saitou and M. Nei, “The Neighbor-Joining Method: A New 238, 1997.
Method for Reconstructing Phylogenetic Trees,” Molecular Biology [46] J. Felsenstein and G. Churchill, “A Hidden Markov Model
and Evolution, vol. 4, pp. 406-425, 1987. Approach to Variation Among Sites in Rate of Evolution,”
[18] O. Elemento and O. Gascuel, “A Fast and Accurate Distance- Molecular Biology and Evolution, vol. 13, pp. 93-104, 1996.
Based Algorithm to Reconstruct Tandem Duplication Trees,” [47] P.A. Goloboff, J.S. Farris, and K. Nixon, “TNT: Tree Analysis
Bioinformatics, vol. 18, pp. 92-99, 2002. Using New Technology,” 2000, www.cladistics.com.
[19] J. Barthélemy and A. Guénoche, Trees and Proximity Representa- [48] T. El-Barabi and T. Pieler, “Zinc Finger Proteins: What We Know
tions. Wiley and Sons, 1991. and What We Would Like to Know,” Mechanisms of Development,
[20] S. Sattath and A. Tversky, “Additive Similarity Trees,” Psychome- vol. 33, pp. 155-169, 1991.
trika, vol. 42, pp. 319-345, 1977. [49] M. Shannon, J. Kim, L. Ashworth, E. Branscomb, and L. Stubbs,
[21] L. Zhang, B. Ma, L. Wang, and Y. Xu, “Greedy Method for “Tandem Zinc-Finger Gene Families in Mammals: Insights and
Inferring Tandem Duplication History,” Bioinformatics, vol. 19, Unanswered Questions,” DNA Sequence—The J. Sequencing and
pp. 1497-1504, 2003. Mapping, vol. 8, no. 5, pp. 303-315, 1998.
[22] D. Swofford, P. Olsen, P. Waddell, and D. Hillis, Molecular [50] P. Waddel and M. Steel, “General Time Reversible Distances with
Systematics. pp. 407-514, Sunderland, Mass.: Sinauer Associates, Unequal Rates Across Sites: Mixing T and Inverse Gaussian
1996. Distributions with Invariant Sites,” Molecular Phylogeny and
[23] D. Swofford, PAUP*. Phylogenetic Analysis Using Parsimony (*and Evolution, vol. 8, pp. 398-414, 1997.
Other Methods), version 4. Sunderland, Mass.: Sinauer Associates, [51] K.C. Nixon, “The Parsimony Ratchet, a New Method for Rapid
1999. Parsimony Analysis,” Cladistics, vol. 15, pp. 407-414, 1999.
[24] J. Felsenstein, “PHYLIP—PHYLogeny Inference Package,” Cladis- [52] S. Guindon and O. Gascuel, “A Simple, Fast and Accurate Method
tics, vol. 5, pp. 164-166, 1989. to Estimate Large Phylogenies by Maximum-Likelihood,” Sys-
[25] C. Semple and M. Steel, Phylogenetics. Oxford Univ. Press, 2003. tematic Biology, vol. 52, no. 5, pp. 696-704, 2003.
[26] O. Gascuel, M. Hendy, A. Jean-Marie, and S. McLachlan, “The [53] J. Felsenstein, “Cases in Which Parsimony or Compatibility
Combinatorics of Tandem Duplication Trees,” Systematic Biology, Methods Will Be Positively Misleading,” Systematic Zoology,
vol. 52, pp. 110-118, 2003. vol. 27, pp. 401-410, 1978.
28 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

[54] D. Page and M. Charleston, “From Gene to Organismal Phylogeny: Olivier Gascuel is Directeur de Recherche at
Reconciled Trees and the Gene Tree/Species Tree Problem,” the Centre National de la Recherche Scientifi-
Molecular Phylogenetics and Evolution, vol. 7, pp. 231-240, 1997. que (France). He is the head of the bioinfor-
[55] M. Hallett, J. Lagergren, and A. Tofigh, “Simultaneous Identifica- matics group from the LIRMM laboratory,
tion of Duplications and Lateral Transfers,” Proc. Conf. Research belongs to the editorial board of Systematic
and Computational Molecular Biology (RECOMB2004), pp. 347-356, Biology and of BMC Evolutionary Biology, and
2004. has served in a number of program committees
of bioinformatics conferences (ISMB, WABI). He
Denis Bertrand is a PhD student under the started in this field in the mid 1980s, with works
supervision of Olivier Gascuel. His research on sequence analysis and protein structure
subject is the study of tandemly repeated prediction. Since the beginning of the 1990s, he turned his efforts to
sequences. His main areas of interest are phylogenetics, focusing on the mathematical and computational tools
phylogenetics, combinatorics, and algorithms. and concepts. He (co)authored several well-known phylogeny inference
programs (BioNJ, PHYML, FastME).

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 29

Optimizing Multiple Seeds


for Protein Homology Search
Daniel G. Brown

Abstract—We present a framework for improving local protein alignment algorithms. Specifically, we discuss how to extend local
protein aligners to use a collection of vector seeds or ungapped alignment seeds to reduce noise hits. We model picking a set of seed
models as an integer programming problem and give algorithms to choose such a set of seeds. While the problem is NP-hard, and
Quasi-NP-hard to approximate to within a logarithmic factor, it can be solved easily in practice. A good set of seeds we have chosen
allows four to five times fewer false positive hits, while preserving essentially identical sensitivity as BLASTP.

Index Terms—Bioinformatics database applications, similarity measures, biology and genetics.

1 INTRODUCTION

P AIRWISE alignment is one of the most important problems


in bioinformatics. Here, we continue an exploration into
Our successful result here contrasts with our previous
work [3] in which we introduced vector seeds. There, we
the seeding and structure of local pairwise alignments and found that using only one vector seed would not substan-
show that a recent strategy for seeding nucleotide align- tially improve BLASTP’s sensitivity or selectivity. The use
ments can be expanded to protein alignment. Heuristic of multiple seeds is the important change in the present
protein sequence aligners, exemplified by BLASTP [1], find work. This successful use of multiple seeds is similar to
almost all high-scoring alignments. However, the sensitivity what has been reported recently for pairwise nucleotide
of heuristic aligners to moderate-scoring alignments can alignment [4], [5], [6], but the approach we use is different
still be poor. In particular, alignments with BLASTP score since protein aligners require extremely high sensitivity. We
between 40 and 60 are commonly missed by BLASTP, even note that, independently of our work, the authors of
though many are of truly homologous sequences. We focus PatternHunter, the first program to use optimized spaced
on these alignments and show that a change to the seeding seeds, have developed a protein aligner based on seeding
strategy gives success rates comparable to BLASTP with far approaches similar to those we discuss here [7]; however,
fewer false positive hits. they have not offered theoretical justification for their
Specifically, multiple spaced seeds [2] and their relatives, approach, which, in some sense, we provide here.
vector seeds [3], can be used in local protein alignment to Our results confirm the themes developed by us and
reduce the false positive rate in the seeding step of alignment others since the initial development of spaced seeds. The
by a factor of four. We present a protocol for choosing first theme is that spaced seeds help in heuristic alignment
because the very surprisingly conserved regions that one
multiple vector seeds that allows us to find good seeds that
uses as a basis for building an alignment happen more
work well together. Our approach is based on solving a set-
independently in true alignments than for unspaced seeds.
cover integer program whose solution gives optimal thresh-
In protein alignments, there are often many small regions of
olds for a collection of seeds. Our IP is prone to overtraining,
high conservation, each of which has a chance to have a hit
so we discuss how to reduce the dependency of the solution
to a seed in it. With unspaced seeds, the probability that any
on the set of training alignments, both by increasing the false one of these regions is hit is low, but, when a region is hit,
positive rate of the seeds found slightly and by making the there may be several more hits, which is unhelpful. By
program less sensitive to outliers. The problem we are trying contrast, a spaced seed is likely to hit a given region fewer
to solve is NP-hard and Quasi-NP-hard to approximate to a times, wasting less runtime, and will also hit at least one
sublogarithmic factor, so we present heuristics for it, though region in more alignments, increasing sensitivity.
most instances are of moderate enough size to use integer The second theme is that the more one understands how
programming solvers. local and global alignments look, the more possible it is to
tailor alignment seeding strategies to a particular applica-
. The author is with the School of Computer Science, University of Waterloo,
tion, reducing false positives and improving true positives.
200 University Ave., West, Waterloo, ON N2L 3G1, Canada. Here, by basing our set of seeds on sensitivity to true
E-mail: browndg@uwaterloo.ca. alignments, we choose a set of seed models that hit diverse
Manuscript received 1 Nov. 2004; revised 2 Jan. 2005; accepted 11 Jan. 2005; types of short conserved alignment subregions. Conse-
published online 30 Mar. 2005.
For information on obtaining reprints of this article, please send e-mail to: quently, the probability that one of them hits a given
tcbb@computer.org, and reference IEEECS Log Number TCBBSI-0183-1104. alignment is high since they complement each other well.
1545-5963/05/$20.00 ß 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
30 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

2 BACKGROUND: HEURISTIC ALIGNMENT AND positions of v to have values beside 0 and 1 was not
SPACED SEEDS extremely useful, so the vector seeds we discuss here all
have binary vectors v.
Since the development of heuristic sequence aligners [1], the
Spaced seeds have the same expected number of junk
same approach has been commonly used: identify short,
hits as unspaced seeds. For unrelated noise DNA se-
highly conserved regions and build local alignments
quences, this is nm4w , where w is the number of ones in
around these “hits.” This avoids the use of the Smith-
Waterman algorithm [8] for pairwise local alignment, which the seed (its support). Their advantage comes because more
has ðnmÞ runtimes on input sequences A and B of length n distinct internal subregions of a given alignment will match
and m, respectively. (We will use the notation A½i to a spaced seed than the unspaced seed; this happens because
represent the ith character of sequence A.) the hits are more independent of each other. The probability
Instead, assuming random sequences, the expected that an alignment of length 64 with 70 percent conservation
runtime of this heuristic search method is hðn; mÞ þ aðn; mÞ, matches a good spaced seed of support 11 can be greater
where hðn; mÞ is the amount of time needed to find hits in the than 45 percent because there are likely to be more
two sequences and aðn; mÞ is the expected time needed to subregions that match the spaced seed than the unspaced
compute the alignments from the hits. Most heuristic aligners seed; by contrast, the default BLASTN seed, which is
have hðn; mÞ ¼ ðn þ m þ nm=kÞ, while aðn; mÞ ¼ ðnm=kÞ 11 consecutive required matches, hits only 30 percent of
for some large constant k. There are many assumptions in alignments.
these formulas. First, even when we align sequences with true Spaced seeds have three advantages over unspaced
homologies, most hits are between unrelated positions, so the seeds. First, their hits are more independent, which means
estimation of the runtime need not consider whether the that it is more likely that a given alignment has at least one
sequences are related. Further, this simplification assumes hit to a seed; fewer alignments have many. Second, the seed
that each hit found in the first phase results in a constant model can be tailored to a particular application: If there is
amount of work being done in the second phase to identify structure or periodicity to alignments, this can be reflected
that it is false (or that true hits are rare). It is the speedup factor in the design of the seeds chosen. For example, in searching
of k that is important here; assuming m and n are large, the for homologous codons, they can be tailored to the three-
overall runtime is much faster. periodic structure of such alignments [10], [11]. Finally, the
Most heuristic aligners look at the scores of matching use of multiple seeds allows us to boost sensitivity well
characters in short regions and use high-scoring short above what is achievable with a single seed, which, for
regions as hits. For example, BLASTP [1] hits are three nucleotide alignment, can give near 100 percent sensitivity
consecutive positions in the two sequences where the total in reasonable runtime [4].
score, according to a BLOSUM or PAM scoring matrix, of Keich et al. [12] have given an algorithm for a simple
aligning the three letters in one sequence to the three letters model of alignments to compute the probability that an
of the other sequence is at least +13. Finding such hits can alignment hits a seed; this has been extended by both
be done easily, for example, by making a hash table of one Buhler et al. [10] and Brejova et al. [11] to more complex
sequence and searching positions of the hash table for the sequence models. Choi et al. [13] have also shown
other sequence, in time proportional to the length of the experimental results for spaced seeds with high sensitivity
sequences and the number of hits found. BLASTP uses across a wide range of homologies. Kucherov et al. [14]
more complicated data structures for this process, but the show how to adapt spaced seeds to the interesting case of
principle is similar. alignments where no subregion of the alignment has a
higher score than the entire alignment.
2.1 Seeding Models
To generalize BLASTP’s hits, we defined vector seeds [3], [9]. 2.2 Some Newer Seeding Models
A vector seed is a pair ðv; T Þ. Vector v ¼ ðv1 ; . . . ; vk Þ is a Another seeding model, which has recently arisen [7], [15]
vector of position multipliers and T is a threshold. Given is of ungapped alignment seeds. These were developed by
two sequences A and B, let si;j be the score in our scoring Brown and Hudek [15] to anchor global alignments of
matrix of aligning the A½i to B½j. If we consider position i ambiguous DNA sequences and, independently, by Kisman
in A and j in B, we then get an hit to the vector seed at those et al. [7] in their heuristic protein aligner, tPatternHunter.
positions when v  ðsi;j ; siþ1;jþ1 ; . . . ; siþk1;jþk1 Þ  T . In this An ungapped alignment seed is a vector v, a global
framework, BLASTP’s seed is ((1, 1, 1), 13). threshold T , and a vector of positional minimum scores b.
Vector seeds generalize the earlier idea of spaced seeds There is a match between positions in the two sequences
[2] for nucleotide alignments, where both scores and the when the vector of pairwise match scores is at least as large,
vector are 0/1 vectors and where T , the threshold, equals position-by-position, as the minimum scores vector b and
the number of 1s in v. A spaced seed requires an exact where the dot product of the position-by-position scores and
match in the positions where the vector is 1 and the places the multiplier vector v is at least T . These seeds are a
where the vector is 0 are “don’t care” positions. In our compromise between spaced seeds and consecutive seeds:
original work with vector seeds [3], the freedom to allow They require spaced positions to have good scores (those
BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 31

where the lower bound vector b has high values), while also alignments. Unfortunately, this also gives rise to problems,
focusing on the quality of the local alignment at the seed by as the thresholds may be set high due to overtraining for a
possibly examining all of the positions of the seed. It is not given set of alignments.
possible to cast an ungapped alignment seed in the language Most of our experiments concern themselves with vector
of vector seeds because of the requirement that each seeds, but the framework can be expanded straightforwardly
individual position’s score is greater than its bound. It is to ungapped alignment seeds as well. This is because we do
possible to cast a vector seed as an ungapped alignment seed, not compute theoretical sensitivity of the seeds, but, instead,
by setting the b vector to 1 in all positions, thus removing only identify hits in existing real alignments. Indeed, our
the position-by-position lower bound requirement. framework is quite broad and extends to many different
Csürös [16] has also extended this framework of seeding to models for seeding as long as the assumption that false
look at variable-length seeds, where the length of the regions positives are additive is reasonably accurate and that one can
that must match depends on their positional scores. While compute that false positive rate for the seed models. Where
this approach can also be brought into the framework of the the ungapped alignment seeds require some thought, we
present work, we have not done so in our experiments. present the addition needed for them.

2.3 Multiple Seeds 3.1 Background Rates


Another important extension to these ideas of seeding has One important detail that we need before we begin is to the
been the use of multiple seeds of different sorts in basing background hit rate for a given vector or ungapped
alignments. In this approach, an attempt is made to perform alignment seed. We noted previously [3] that this can be
extension when any of a collection of seed models has a hit. computed for vector seeds, given a scoring matrix; it is also
This will work well if each chosen seed has a very low false straightforward to compute for ungapped alignment seeds
positive rate so that their total false positive rate is still as well. Namely, from the scoring matrix, we can compute
below that of one seed of comparable sensitivity. the distribution of letters in random sequences implied by
Several authors [2], [3], [4], [6], [10], [17] have proposed the matrix; this can then be used to compute the distribu-
using multiple seeds and given heuristics to choose them. tion of scores found in unrelated sequences. Using this, we
This problem was recently given a theoretical framework by can compute the probability that unrelated sequences give a
Xu et al. [5] and, independently, Kucherov et al. [18] studied hit to a given seed at a random position, which we call the
heuristic algorithms for identifying sets of good seeds. In false positive rate for that seed. In fact, we can easily
work unrelated to the present work, Kisman et al. [7] have compute the entire probability distribution on the score for
heuristically used multiple ungapped alignment seeds a given seed vector at a random position. Similarly, we can
(though not called by that term) for protein alignment. To compute this probability under the constraint that posi-
the best of our knowledge, the present work is the first work tional scores have minimum value, thus expanding to
to choose multiple seeds for protein alignment with a ungapped alignment seeds.
theoretical basis. For the default BLASTP seed, the probability that two
random unrelated positions have a hit is quite high, 1/
1,600. Because of this high level of false positives, BLASTP
3 CHOOSING A GOOD SET OF SEEDS
must filter hits further in hopes of throwing out hits in
Spaced seeds have made a substantial impact in nucleotide unrelated sequences. Specifically, BLASTP rapidly exam-
alignments, but less in protein alignment. Here, we show ines the local area around a hit and, if this region is not also
that they have use in this domain as well. Specifically, well-conserved, the hit is thrown out. Sometimes, this
multiple vector seeds or multiple ungapped alignment filtering throws out all of the hits found in some true
seeds, with high thresholds, give essentially the sensitivity alignments and, thus, BLASTP misses them, even though
of BLASTP with four times fewer noise hits. Slightly fewer they hit the seed. One way of modeling this filtering is to
alignments are hit, but the regions of alignment hit by the view BLASTP as testing two seeds simultaneously: The
vector seeds are all of the same good ones as hit by the vector seed ((1, 1, 1), 13) and an ungapped alignment seed
BLASTP seed and a few more. In other words, BLASTP hits that looks at the region surrounding the seed hit.
more alignments, but the hits found by BLASTP and not the Our goal in using other seed models here is to reduce the
vector seeds are mostly in areas unlikely to be expanded to false positive rate, while still hitting the overwhelming
full alignments. majority of alignments and hitting them in places that are
We adapt a framework for identifying sets of seeds highly enough conserved as to make a full alignment likely.
introduced by Xu et al. [5]. We model multiple seed A flowchart of our proposal, and the approach of BLASTP,
selection as a set cover problem and give heuristics for the is in Fig. 1.
problem. For our purposes, one advantage of the formula- For a set Q of alignment seeds, we say that its false positive
tion is that it works with explicit alignments: Since real rate is the probability that any seed in Q has a hit to two
alignments may not look like a probabilistic model, we can random positions in unrelated sequences. This is not equal to
pick a set of seeds for sensitivity to a collection of true the sum of the false positive rates for all seeds in Q since hits to
32 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

More formally, suppose we are given a collection of


alignments A ¼ fa1 ; . . . ; am g and a set of seed patterns
P ¼ fp1 ; . . . ; pn g. We will choose thresholds ðT1 ; . . . ; Tn Þ for
the patterns of P such that the seed model set Q ¼
fðp1 ; T1 Þ; . . . ; ðpn ; Tn Þg hits all alignments in A and the false
positive rate of Q is as low as possible. The Ti may be 1,
which corresponds to not choosing the pattern pi at all.
We require that each alignment a must be hit, so one of
the thresholds must be low enough to hit a. To verify this,
Fig. 1. Flowchart contrasting BLASTP’s approach to heuristic sequence we compute the best-scoring hit for each seed pattern pi in
alignment to the one proposed here. The only difference is in the initial each alignment aj ; let the score of this hit be Ti;j . If we
collection of hits. The smaller collection of hits found with the variations
on seeds gives as many hits to true alignments that survive to the third
choose Ti so that it is at most Ti;j , then the seed ðpi ; Ti Þ will
stage as does BLASTP, yet far fewer noise hits must be filtered out. hit alignment a.
To model this as an integer program, we have a collection
one seed may overlap hits to another. However, we will use of integer variables xi;T for each possible threshold value for
this approximation in our optimization. As we extend to a seed pattern pi . We note that we are requiring that this
very large collection of seeds in Q, this can become worrisome number is a small number or can be granularized reasonably
as the same false positive may be counted many times. since each possible threshold will get its own constraint. For
However, this may be appropriate, in fact, depending on how simple seeds from a BLOSUM matrix, the scores at a position
the search is done to find the false hits. come in a small range of integers, so the possible reasonable
3.2 An Integer Program to Choose Many Seeds thresholds form a small range; let Tm be the smallest such
Here, we give an integer program to find the set of seeds threshold. We will set variable xi;T to 1 when the threshold or
that hits all alignments in a given training set with overall seed vector xi is at most T ; for each pattern pi , its threshold
lowest possible false positive rate. We will show that our IP chosen is the smallest T , where xi;T ¼ 1.
encodes the Set-Cover problem and that it is NP-hard to To compute the false positive rate, we let ri;T be the
solve and Quasi-NP-hard even to approximate to a probability that a random place in the background model
sublogarithmic factor. However, for moderate-sized train- has score exactly T according to the seed model ðpi ; T Þ. We
ing sets, we can solve it, in practice, or use simple heuristics add these up for all of the false hits with score equal to or
to get good solutions. greater than the chosen thresholds. Our integer program is
Given a set of alignment seeds Q, we say that they hit a as follows:
given alignment a if any member of Q has a hit to the X
min xi;T ri;T ; such that
alignment. Our goal in picking such a set will be to i;T
minimize the false positive rate of the set Q, with the X
xi;Ti;j  1 for all alignments aj
requirement that we hit all alignments in a training i
collection, A. xi;T  xi;T 1 for all thresholds T > Tm
This optimization goal is the alternative to the goal of Xu xi;T 2 f0; 1g for all i and T :
et al. [5]. In that work, we maximized seed sensitivity when
a maximum number of spaced seeds is allowed; given that Our framework is quite general: Given any collection of
all possible seeds had the same false positive rate, this was alignments and the sensitivity of a collection of seeds to the
equivalent to maximizing sensitivity for a given false alignments, one can use this IP formulation to choose
positive rate. This alternative goal of minimizing false thresholds to hit all alignments while minimizing false
positives when we want 100 percent sensitivity on the positives. In particular, one could require that a hit satisfy
training set is appropriate for protein alignment; however, multiple seeds simultaneously or use more complicated hit
we want to achieve extremely high sensitivity, as close to formulations. Of course, for these harder models, one might
100 percent as possible. have a more difficult time optimizing the integer program.

3.2.1 The Integer Program 3.2.2 NP-Hardness


Here, we show how to cast this seed selection problem as an We now show that the problem of optimizing the seed set to
integer program. Recall that a seed model is the vector v of minimize the false positive rate while hitting all alignments
multipliers or for an ungapped alignment seed, the vector v is NP-hard and that it is Quasi-NP-hard to approximate to
of multipliers, and the vector b of positional lower bounds. within a logarithmic factor [19]. (That is, assuming NP does
We will call this vector or vectors the “pattern” of a seed. not have polynomial-time deterministic algorithms running
We can then view choosing a set of vector or ungapped in OðnOðlog log nÞ Þ time, no polynomial-time algorithm exists
alignment seeds as choosing thresholds for each pattern. with approximation ratio oðlog nÞ.)
BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 33

We show this by giving an approximation-preserving problem will have the thresholds as on the seeds as high as
reduction of the Set-Cover problem to this problem. Since possible while still hitting each alignment. This allows
Set-Cover is Quasi-NP-hard to approximate to within a overtraining: Since even a tiny increase in the thresholds
logarithmic factor [19], so is our problem. would have caused a missed alignment, we may easily
An instance of Set-Cover is a ground set S and a expect that, in another set of alignments, there may be
collection T ¼ fT1 ; . . . ; Tm g of subsets of S; the goal is the alignments just barely missed by the chosen thresholds.
smallest cardinality subset of T whose union is S. The This is particularly possible if thresholds are allowed to get
connection to our problem is clear: We will produce one extremely high and only useful for a single alignment. This
alignment per ground set member and, for each of the overtraining happened in some of our experiments, so we
elements of T , we will have one seed. For simplicity, we will lowered the maximum so that they were either found in a
assume that S ¼ f1; . . . ; ng. To fill the construction out, we fairly narrow range (+13 to +25) or set to 1 when a seed
will assign the vector seed was not used. As one way of also addressing overtraining,
i we considered lowering the thresholds obtained from the IP
zfflfflfflffl}|fflfflfflffl{
vi ¼ ðð1; 0; . . . ; 0 ; 1Þ; 1Þ uniformly or just lowering the thresholds that have been set
to high values.
to every ground set element si . In a model of sequence
And, finally, the framework can be extended to allow a
where all positions are independent of all other, each of
specific number of alignments to be missed. For each
these seeds has the same false positive rate, so the false
alignment, rather than requiring that
positive rate will be proportional to the number of ground
X
set members chosen. xi;Ti;j  1;
Then, for each set Tj 2 T , we create an alignment Aj of i

length 2n2 þ 4n by pasting together in n blocks of length which requires that some threshold be chosen so that the
2n þ 4. If i is in Tj , then we make the ith block of the alignment is hit, we can add a 0/1 slack variable to count
alignment have the first and i þ 2nd position be of score 1,
how many are missed, changing the constraint to
while all other positions in the block have score zero, while X
if i 62 Tj , then the ith block is all score zero. Then, it is clear xi;Ti;j þ sj  1:
that if we choose the seed vi , we will hit all alignments Aj , i

where i 2 Tj . If we desire the minimum false positive rate to


Then, if we require that
hit all alignments, this is exactly equivalent to choosing the
X
minimum cardinality set to cover all of the Tj . sj  M;
Thus, we have presented an approximation-preserving j

transformation from Set-Cover to our problem and it is both


this allows at most M alignments to be so missed. This may
NP-hard and Quasi-NP-hard to approximate to within a
be appropriate to allow the optimization framework to be
logarithmic factor.
less sensitive to a small number of outliers. We show
3.2.3 Expansions of the Framework experiments with this slightly expanded framework in the
In our experiments, we use the vector seed requirement as a next section.
threshold; one could use a more complicated threshold We note one simplification of our formulation: False hit
scheme to focus on hits that would be expanded to full rates are not additive. Given two spaced seeds, a hit to one
alignments. That is, our minimum threshold for Ti;j could may coincide with a hit to the other, so the background rate
be the highest-scoring hit that is expanded to a full alignment of false positives is lower than estimated by the program.
of seed vector vj in alignment ai . We could also have a more When we give such background rates later, we will
complicated way of seeding alignments and, still, as long as distinguish those found by the IP from the true values.
we could compute false positive rates, we could require that
3.2.4 Solving the IP and Heuristics
all alignments are hit and minimize false positive rates.
Also, we can limit the total number of vector seeds used To solve this integer program or its variations is not
in the true solution (in other words, limit the number of necessarily straightforward since the problem is NP-hard.
vectors with finite threshold). We do this by putting an In our experiments, we used sets of approximately 400 align-
P
upper bound on i xi;T for the maximum threshold T . In ments and the IP has been able to solve directly quickly, using
practice, one might want an upper bound of four or eight the CPLEX 9.0 integer programming solver.
seeds, as each chosen seed requires a method to identify hits Straightforward heuristics also work well for the
and one might not want to have to use too many such problem, such as solving the LP relaxation and rounding
methods in the goal of keeping fewer indexes of a protein to 1 all variables with values close to 1, until all alignments
sequence database, for example. are hit, or setting all variables with fractional LP solutions to
Further, we might want to not allow seeds to be chosen 1 and then raising thresholds on seeds until we start to miss
with very high threshold. The optimal solution to the alignments.
34 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

We finally note that a simple greedy heuristic works well happen to a random sequence by chance only one time in
for the problem, as well: Start with low thresholds for all 10,000, according to BLASTP’s statistics.
seed patterns and repeatedly increase the threshold whose We begin by identifying a set of BLASTP alignments in
increase most reduces the false positive rate until no such this score range. To avoid overrepresenting certain families
increase can be made without missing an alignment. This of alignments in our test set, we did an all-versus-all
simple heuristic performed essentially comparably to the comparison of 8,654 human proteins from the SWISS-PROT
integer program in our experiments, but, since the IP solved database [20]. (We note that this is the same set of proteins
quickly, we present its results. and alignments we used in our previous vector seed work
One other advantage to the IP formulation is that the [3]. We have used this test set in part to confirm our belief
false-positive rate from the LP relaxation is a lower bound that, while a single seed may not help much, in comparison
on what can possibly be achieved; the simple greedy to BLASTP, many seeds will be of assistance.) We then
heuristic offers no such lower bound. divided the proteins into families so that all alignments
with BLASTP score greater than 100 are between two
sequences in the same family and there are as many families
4 EXPERIMENTAL RESULTS as possible. We then chose 10 sets of alignments in our
Here, we present the results of experiments with our target score range such that, in each set of alignments, a
multiple seed selection framework in the context of protein particular family will only contribute at most eight
alignments. Our goal is to identify collections of seed alignments to that set. Note that, since our threshold for
models which together have extremely high sensitivity to sharing family membership is a BLASTP score greater than
even moderately strong alignments, while admitting a very 100 and the alignments we are seeking score between +40
low false positive rate. and +60, many chosen alignments will be between members
Since we pick seeds with a relatively small number of of different families. We divided the sets of alignments into
alignments, we run the serious risk of overtraining. In five training sets and five testing sets. It is possible that the
particular, the requirement that our set of seeds has same alignments will occur in a training and testing set as
100 percent sensitivity on the training data need not require we did not take any efforts to avoid this, though the set of
that it also have comparable sensitivity overall. In one possible alignments is large enough to make this a rare
example, the particular choice of training examples was occurrence.
We note that we are using this somewhat complicated
apparently quite unrepresentative since a 100 percent
system specifically because we want to avoid imposing a
sensitivity to this set of alignments still gave only 96 percent
preexisting bias on the set of alignments: Many true yet
sensitivity on a testing set. (Or, presumably, the testing set
moderate-scoring alignments will be between proteins with
may be unrepresentative.) As a simple way of exploring this,
different function or from different biological families. For the
we examined what happened when we lowered the thresh-
same reason, we have used alignments from dynamic
old on some seeds that were chosen by the integer program
programming as our standard, rather than structural align-
to modestly increase their false positive rates and sensitivity
ments of known proteins or curated alignments because our
in the hope of still keeping very high sensitivity. goal is to improve the quality of heuristic alignments.
We first present simple experiments with vector seeds Certainly, many of the alignments we consider will not be
and with ungapped alignment seeds on a small sample of precise; still, a heuristic dynamic programming-based align-
alignments discovered with BLASTP; in this section, we ment that finds a hit between two proteins and then uses the
also allow for seed sets that miss a small number of the same scoring matrix as BLASTP will find the exact same,
training alignments. potentially inaccurate, alignment as did BLASTP.
Then, we explore how well these seed sets do in hitting
alignments that we did not use BLASTP to identify. Here, 4.1.1 Multiple Vector Seeds
we note that our vector seed sets do not appear to do as well We then considered the set of all 35 vector patterns of length
as BLASTP for sensitivity to alignments in general, but they at most 7 that include three or four 1s (the support of the
do hit more alignments with high-scoring short regions; seed). We used this collection of vector patterns as we have
presumably, these alignments are more likely true. seen no evidence that nonbinary seed vectors are preferable
to binary ones for proteins and because it is more difficult to
4.1 Preliminary Experiments find hits to seeds with higher support than four due to the
We begin by exploring several sets of alignments generated high number of needed hash table keys.
using BLASTP. Our target score range for our alignments is We computed the optimal set of thresholds for these
BLASTP score between +40 and +60 (BLOSUM score +112 vector seeds such that every alignment in a training set has
to +168). These moderate-scoring alignments can happen by a hit to at least one of the seeds, while minimizing the
chance, but also are often true. Alignments below this background rate of hits to the seeds and only using at most
threshold are much more likely to be errors, while, in a 10 vector patterns. Then, we examined the sensitivity of the
database of proteins we used, such alignments are likely to chosen seeds for a training set to its corresponding test set.
BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 35

TABLE 1 TABLE 3
Hit Rates for Optimal Seed Sets for Various Sets of Training Weakening Sensitivity to Testing Alignment
Alignments when Applied to an Unrelated Test Set Reduces Sensitivity on Training Alignments

The results are found in Table 1. Some seed sets chosen set. We show results in Table 3, using again a randomly
showed signs of overtraining, but others were quite chosen testing set for each training set. The training data
successful, where the chosen seeds work well for their sets varied in size from 304 to 415, while the testing sets
training set as well and have low false positive rate. ranged from 392 to 407 in size.
We took the best seed set with near 100 percent Unsurprisingly, if we did not hit all alignments in the
sensitivity for both its training and testing data, which training set, we often miss alignments in the testing set as
was the third of our experimental sets and used it in further well. However, the ranges of the sensitivities we saw in
experiments. This seed set is shown in Table 2. We note that testing data for the seed sets picked allowing some misses
this seed set has five times lower false positive rate in the training data were much less wide, suggesting that
(1=8; 000) than does BLASTP, while still hitting all of its there may be fewer seed thresholds lowered merely to
testing alignments but four (which is not statistically accommodate a single outlier in the training data. As such,
significant from zero). We also considered a set of thresh- if slightly lower sensitivity is acceptable, this approach may
olds where we lowered the higher thresholds slightly to give much more predictable results than training to require
allow more hits and possibly avoid overtraining on the all alignments to be hit.
initial set of alignment. These altered thresholds are shown 4.1.3 Multiple Ungapped Alignment Seeds
as well in Table 2 and give a total false positive rate of Ungapped alignment seeds can be seen as breaking the
1=6; 900. (This set of thresholds also hits all 402 test model we have for alignment speed. The most straightfor-
alignments for that instance.) ward implementation of ungapped alignment seeds would
4.1.2 A Weaker Requirement on the Sensitivity involve a hash table keyed on the letters corresponding to
the positions in the bounds vector b, where there is a
As noted previously, we can alter our integer program so
nontrivial lower bound on the score of a position. Still, even
that it does not require 100 percent sensitivity on the
after the first step, where we identified pairs of positions
training data set. We performed experiments on this
satisfying the minimum bounds scores, we still need
formulation, using five subsets of the training alignments another test to verify that a pair of positions satisfies the
chosen as before, where we allowed between zero and five requirement of the dot product of the local alignment score
alignments from the training set to be missed by the seed with the vector v of positional multipliers being higher than
the threshold. Similar limitations affect any such two-phase
TABLE 2 seed, such as requiring that two hypothetically aligned
Seeds and Thresholds Chosen by positions satisfy two vector seeds at once.
Integer Programming for 409 Test Alignments If we assume, however, that testing a hit to the simple
hash-table to verify if the dot product of the local alignment
score with the vector of multipliers v has score greater than
the threshold T so rapidly that we can throw out misses
without having to count them, then we return to the case
from before, where we need count only the fraction of
positions expected to pass both levels of filtration. This
assumption may be appropriate, assuming that the small
amount of time taken to throw out a hash-table hit that does
not satisfy the dot product threshold is much, much smaller
than the amount of time needed to throw out a hit to the
whole ungapped alignment seed that still does not make a
good local alignment.
36 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 4 one should instead focus on longer windows around a hit


Ungapped Alignment Seeds Offer before discarding it with a filter.
Similar Performance to Vector Seeds
4.2 A Broader Set of Alignments
Returning to our set of vector seeds from Table 2, we then
considered a larger set of alignments in our target range of
good, but not great scores to verify if the advantage of
multiple seeds still holds. We used the Smith-Waterman
algorithm to compute all alignments between pairs of a
1,000-sequence subset of our protein data set and computed
how many of them were not found by BLASTP. Only 970
out of 2,950 Smith-Waterman alignments with BLOSUM62
score between +112 and +168 had been identified by
With this in mind, we tested our set of moderate BLASTP, even though alignments in this score range would
alignments on a simple collection of ungapped alignment have happened by chance only one time in 10,000 according
seed patterns to identify whether ungapped alignment seeds to BLASTP’s statistics.
form a potentially superior seed filtering approach to vector Almost all of these 2,950 alignments, 2,942, had a hit to
seeds. Of course, since they include vector seeds as a special the BLASTP default seed. Despite this, however, only 970
case, this is trivial, but our interest is primarily whether the actually built a successful BLASTP alignment. Our set of
advantage of ungapped alignments is large enough to merit eight seeds had hits to 1,939 of the 1,980 that did not build a
their consideration over that of vector seeds. BLASTP alignment and to 955 of the 970 that did build a
In our experiments, we used ungapped alignment seeds BLASTP alignment, so, at first glance, the situation does not
where the vector of score lower bounds consisted of only look good. However, the difference between having a hit
the values 0 and 1 (which results in no score restriction); and having a hit in a good region of the alignment is where
we also allowed the vector of pairwise multipliers to only we are able to show substantial improvement.
be the all-ones vector. This simple approach, which was The discrepancy between hits and alignments comes
used independently in the multiple aligner of Brown and because the BLASTP seed can have a hit in a bad part of the
Hudek [15] and in the tPatternHunter protein aligner [7], alignment, which is filtered out. Typically, such hits occur
simply requires a good local region, with certain specified in a region where the source of positive score is quite short,
positions having positive score. We required that the which is much more likely with an unspaced seed than with
bounds vector have at most four active positions and a spaced seed. We looked at all of the regions of length
considered seed lengths between three and six. Note that, in 10 amino acids of alignments that included a hit to a seed
this model, the bounds vector ð0; 0; 0; 1Þ behaves quite (either the BLASTP seed or one of the multiple seeds), and
differently than the bounds vector ð0; 0; 0Þ because we will assigned the best score of such a region to that alignment; if
be adding pairwise scores of four positions in the former no ungapped region of length 10 surrounded a hit, we
case and three in the latter. assumed it would certainly be filtered out. The data are
The results of our experiment are shown in Table 4. We shown in Table 5 and show that of the alignments hit by the
used the same testing and training data sets as for Table 3. spaced seeds, they are hit in regions that are essentially
In general, these results are slightly worse than the results identical in conservation to where the BLASTP seed hits
of our original experiments with vector seeds when we them. For example, 47.7 percent of the alignments contain a
require 100 percent sensitivity to testing data, but improve 10-amino acid region around a hit to the ((1, 1, 1), 13) seed
when we allow some misses in the training data. Typical with BLOSUM score at least +30, while 46.7 percent contain
false positive rates on the order of 1=10; 000 are common such a region surrounding a hit to one of the multiple seeds
with testing sensitivity of approximately 99 percent, as with higher threshold. If we use the lower thresholds that
before; again, the corresponding false positive rate for allow slightly more false positives, their performance is
BLASTP’s seed is approximately 1=1; 600. actually slightly better than BLASTP’s.
A positive note to the ungapped alignment seeds is that Table 5 also shows that the higher-threshold seed ((1, 1, 1),
there seems to be less overtraining: As the training 15), which has a worse false positive rate (1/5,700) than our
sensitivity is allowed to go down slightly, the testing ensembles of seeds, performs substantially worse: Namely,
sensitivity does not plummet as quickly as for vector seeds. only 64 percent of the alignments have a hit to the single seed
One reason for this is that an ungapped alignment seed, found in a region with local score above +25, while 73 percent
both times they have been implemented [7], [15], still of the alignments have a hit to one of the multiple seeds with
requires high-scoring short local alignment around the this property. This single seed strategy is clearly worse than
seed. As we show in the next section, focusing on very the multiple seed strategy of comparable false positive rate
narrow alignments in seeding may be inappropriate and and the optimized seeds perform comparably to BLASTP in
BROWN: OPTIMIZING MULTIPLE SEEDS FOR PROTEIN HOMOLOGY SEARCH 37

TABLE 5
Hits in Locally Good Regions of Alignments

identifying the alignments that actually have a core con- discarded before the next step, should it count toward the
served region. estimated runtime? Using our framework, we identified a
Our experiments show that multiple seed models can have set of seeds for moderate-scoring protein alignments whose
an impact on local alignment of protein sequences. Using total false positive rate in random sequence is four-to-five
many spaced seeds, which we picked by optimizing an times lower than the default BLASTP seed. This set of seeds
integer program, we find seed models with a comparable had hits to slightly fewer alignments in a test set of
chance of finding a good hit in a moderate-scoring alignment moderate-scoring alignments found by the Smith-Water-
than does the BLASTP seed, with four to five times fewer man algorithm than found by BLASTP; however, the
noise hits. The difficulty with the BLASTP seed is that it not BLASTP seeds hit subregions of these alignments that were
only has more junk hits and more hits in overlapping places, it actually slightly worse than hit by the spaced seeds. Hence,
also has more hits in short regions of true alignments, which given the filtering used by BLASTP, we expect that the two
are likely to be filtered and thrown out. alignment strategies would give comparable sensitivity,
while the spaced seeds give four times fewer false hits.
5 CONCLUSIONS
We have given a theoretical framework to the problem of ACKNOWLEDGMENTS
using spaced seeds for protein homology search detection. The author would like to thank Ming Li for introducing him
Our result shows that using multiple vector or ungapped to the idea of spaced seeds. This work is supported by the
alignment seeds can give sensitivity to good parts of local Natural Science and Engineering Research Council of
protein alignments essentially comparable to BLASTP, Canada and by the Human Frontier Science Program. A
while reducing the false positive rate of the search preliminary version of this paper [21] appeared at the
algorithm by a factor of four to five. Workshop on Algorithms in Bioinformatics, held in Bergen,
Our set of vector seeds is chosen by optimizing an Norway, in September, 2004.
integer programming framework for choosing multiple
seeds when we want 100 percent sensitivity to a collection REFERENCES
of training alignments. The framework is general enough to [1] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman,
accommodate many extensions, such as requiring a fixed “Basic Local Alignment Search Tool,” J. Molecular Biology, vol. 215,
no. 3, pp. 403-410, 1990.
amount of sensitivity on the training (not only 100 percent),
[2] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and More
allowing only a small number of seeds to be chosen or Sensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-
allowing for many different sorts of seeding strategies. We 445, Mar. 2002.
[3] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension to
have mostly used it to optimize sets of vector seeds because Spaced Seeds Allows Substantial Improvements in Sensitivity and
they encapsulate an approach to homology search for Specificity,” Proc. Third Ann. Workshop Algorithms in Bioinformatics,
pp. 39-54, 2003.
nucleotides that has been very successful. [4] M. Li, B. Ma, D. Kisman, and J. Tromp, “Patternhunter II: Highly
One difficulty with our approach is that it relies on a Sensitive and Fast Homology Search,” J. Bioinformatics and
theoretical estimate of the runtime of a homology search Computational Biology, vol. 2, no. 3, pp. 419-439, 2004.
[5] J. Xu, D. Brown, M. Li, and B. Ma, “Optimizing Multiple Spaced
program: namely, that the program will take time propor- Seeds for Homology Search,” Proc. 15th Ann. Symp. Combinatorial
tional to the number of false positives found by the seeding Pattern Matching, pp. 47-58, 2004.
[6] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds for
method. As seeding methods become more complex, such DNA Similarity Search,” Proc. Eighth Ann. Int’l Conf. Computational
as the two-step ungapped alignment seeds, it may become Biology, pp. 76-84, 2004.
[7] D. Kisman, M. Li, B. Ma, and L. Wang, “TPatternHunter: Gapped,
harder to identify what a “false positive” is, in particular, if Fast and Sensitive Translated Homology Search,” Bioinformatics,
a false positive fits through one step of a filter, but is quickly 2004.
38 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

[8] T. Smith and M. Waterman, “Identification of Common Molecular Daniel G. Brown received the undergraduate
Subsequences,” J. Molecular Biology, vol. 147, pp. 195-197, 1981. degree in mathematics with computer science
[9] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension to from the Massachusetts Institute of Technology
Spaced Seeds,” J. Computer and System Sciences, 2005, pending in 1995 and the PhD degree in computer science
publication. from Cornell University in 2000. He then spent a
[10] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for Similarity year as a research scientist at the Whitehead
Search in Genomic DNA,” Proc. Seventh Ann. Int’l Conf. Computa- Institute/MIT Center for Genome Research in
tional Biology, pp. 67-75, 2003. Cambridge, Massachusetts, working on the Hu-
[11] B. Brejova, D. Brown, and T. Vinar, “Optimal Spaced Seeds for man and Mouse Genome Projects. Since 2001,
Homologous Coding Regions,” J. Bioinformatics and Computational he has been an assistant professor in the School of Computer Science
Biology, vol. 1, pp. 595-610, Jan. 2004. at the University of Waterloo.
[12] U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds for
Similarity Search,” Discrete Applied Math., vol. 138, pp. 253-263,
2004.
. For more information on this or any other computing topic,
[13] K.P. Choi, F. Zeng, and L. Zhang, “Good Spaced Seeds for
please visit our Digital Library at www.computer.org/publications/dlib.
Homology Search,” Bioinformatics, vol. 20, no. 7, pp. 1053-1059,
2004.
[14] G. Kucherov, L. Noé, and Y. Ponty, “Estimating Seed Sensitivity
on Homogeneous Alignments,” Proc. Fourth IEEE Int’l Symp.
BioInformatics and BioEng., pp. 387-394, 2004.
[15] D. Brown and A. Hudek, “New Algorithms for Multiple DNA
Sequence Alignment,” Proc. Fourth Ann. Workshop Algorithms in
Bioinformatics, pp. 314-326, 2004.
[16] M. Csürös, “Performing Local Similarity Searches with Variable
Length Seeds,” Proc. 15th Ann. Symp. Combinatorial Pattern
Matching, pp. 373-387, 2004.
[17] K. Choi and L. Zhang, “Sensitive Analysis and Efficient Method
for Identifying Optimal Spaced Seeds,” J. Computer and System
Sciences, vol. 68, pp. 22-40, 2004.
[18] G. Kucherov, L. Noé, and Y. Ponty, “Multiseed Lossless
Filtration,” Proc. 15th Ann. Symp. Combinatorial Pattern Matching,
pp. 297-310, 2004.
[19] U. Feige, “A Threshold of ln n for Approximating Set Cover,”
J. ACM, vol. 45, pp. 634-652, 1998.
[20] A. Bairoch and R. Apweiler, “The SWISS-PROT Protein Sequence
Database and Its Supplement TrEMBL in 2000,” Nucleic Acids
Research, vol. 28, no. 1, pp. 45-48, 2000.
[21] D. Brown, “Multiple Vector Seeds for Protein Alignment,” Proc.
Fourth Ann. Workshop Algorithms in Bioinformatics, pp. 170-181,
2004.
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 39

Editorial—State of the Transaction


Dan Gusfield

I T is a pleasure to write this editorial at the beginning of the second year of the publication of the IEEE/ACM Transactions
on Computational Biology and Bioinformatics (TCBB). The last year saw the publication of four issues of TCBB, the first of
which was mailed out roughly nine months after our initial call for submissions. That accomplishment was the result of
tremendous cooperation and hard work on the part of authors, reviewers, associate editors, and staff. I would like to thank
everyone for making that possible.
During the past year, we recieved roughly 205 submissions and, presently, we have about 50 of those under review. In
our first year, we published 16 papers, including Part I of a special section on The Best Papers from WABI (Workshop on
Algorithms in Bioinformatics). Part II will appear this year, along with a special issue on Machine Learning in
Computational Biology and Bioinformatics. Other special issues are also in the planning stages. The papers that we have
published are establishing TCBB as a venue for the highest quality research in a broad range of topics in computational
biology and bioinformatics. I know that some of the papers we have already published will be cited as the foundational or
the definitive papers in several subareas of the field.
A goal for the future is to attract more submissions from the biology community and this will be facilitated when TCBB
is indexed in MEDLINE, which requires two years of publication before it will consider indexing a journal. So, this second
year of publication will hopefully lead to the inclusion of TCBB in MEDLINE.
Finally, I would like to share some wonderful news we recieved in February. The Association of American Publishers,
Professional and Scholarly Publishing Division awarded TCBB their “Honorable Mention” award for The Best New Journal
in any category for the year 2004. Only one Honorable Mention is awarded. Again, the credit for this accomplishment goes
to all the authors, reviewers, associate editors, and staff who have worked so hard to establish TCBB in this last year. I look
forward to continued growth and success of TCBB in our second year of publication.

Dan Gusfield
Editor-in-Chief

For information on obtaining reprints of this article, please send e-mail to:
tcbb@computer.org.
1545-5963/05/$20.00 © 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
40 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Bases of Motifs for Generating


Repeated Patterns with Wild Cards
Nadia Pisanti, Maxime Crochemore, Roberto Grossi, and Marie-France Sagot

Abstract—Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones.
Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in
relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature:
matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favor of either, and recent work
has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns.
This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently
proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs.
Instead of exhibiting a complete list of motifs satisfying some input constraints, what is produced is a basis of such motifs from which all
the other ones can be generated. We study the computational cost of determining such a basis of repeated motifs with wild cards in a
sequence. We give new upper and lower bounds on such a cost, introducing a notion of basis that is provably contained in (and, thus,
smaller) than previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of
motifs. We also prove that the number of motifs in all bases defined so far grows exponentially with the quorum, that is, with the
minimal number of times a motif must appear in a sequence, something unnoticed in previous work. We show that there is no hope to
efficiently compute such bases unless the quorum is fixed.

Index Terms—Motifs basis, repeated motifs.

1 INTRODUCTION

I DENTIFYINGmotifs in biological sequences is one of the


oldest fields in computational biology. Yet, it remains also
very much an open problem in the sense that no currently
repeatedly, in general, approximately, that is, up to a
certain number of differences (most often substitutions
only) in a sequence or set of sequences of interest.
existing definition of a “motif” is fully satisfying for the It is generally accepted that PSSMs are more appropriate
purposes of accurately and sensitively identifying the for modeling an already known (in the sense of well-
biological features that such motifs are supposed to characterized) biological feature for the purpose of then
represent. Among the most difficult to model are binding identifying other occurrences of the feature, even though
sites, as they are often quite degenerate. Indeed, variability the false positive rate of this further identification remains
may be considered part of their function. Such variability very high. Identifying the PSSM itself ab initio is still,
translates itself into changes in the motif, mostly substitu- however, a difficult problem, particularly for large data sets
tions, that do not affect the biological function. Two main or when the amount of noise may be high. The methods
schools of thought on how to define motifs in biology have used are also no guarantee heuristics, leaving an uncer-
coexisted for years, each valid in its own way. The first tainty as to whether motifs that are statistically as mean-
works with a statistical representation of motifs, usually ingful as those reported have not been missed.
given in the form of what is called in the literature a PSSM On the other hand, formulating the problem of identifying
(“Position Specific Scoring Matrix” [9], [11], [13], [12] or a approximate motifs as patterns enables one to address the
profile which is one type of PSSM). Interesting PSSMs are motif identification problem in an exhaustive fashion, even
those that have a high information value (measured, for though the algorithmic complexity of the problem remains
instance, by the relative entropy of the corresponding relatively high, and the model may appear more limited than
matrix). The second school defines a motif as a consensus PSSMs. Because of the lower algorithmic complexity of
[4], [24]. A motif is therefore a pattern that appears identifying repeated patterns, the model may, however, be
made more complex and biologically pertinent in other ways.
One could think of introducing motifs composed of various
. N. Pisanti and R. Grossi are with the Dipartimento di Informatica, different submotifs separated by variable-length distances
Università di Pisa, Italy. E-mail: {pisanti, grossi}@di.unipi.it. that may then also be found in a relatively efficient way [14].
. M. Crochemore is with the Institut Gaspard-Monge, University of Marne-
la-Vallée, France and King’s College London. Motifs presenting such a high level of combinatorial complex-
E-mail: maxime.crochemore@univ-mlv.fr. ity are indeed frequent, particularly in eukaryotes. Exhaus-
. M.-F. Sagot is with INRIA Rhône-Alpes, Laboratoire de Biométrie et tively seeking for approximately repeated patterns may
Biologie Evolutive, Université Claude Bernard Lyon 1, France and however have the drawback of producing many “solutions,”
King’s College London. E-mail: marie-france.sagot@inria.fr.
that is, many motifs. In fact, the number of motifs identified
Manuscript received 14 Mar. 2004; revised 2 Dec. 2004; accepted 16 Feb.
with this model may be so high (e.g., exponential in the size of
2005; published online 30 Mar. 2005.
For information on obtaining reprints of this article, please send e-mail to: the input) that it is as impossible to manage as the initial input
tcbb@computer.org, and reference IEEECS Log Number TCBB-0036-0304. sequence(s), even though they provide a first way of
1545-5963/05/$20.00 ß 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 41

structuring such input. Yet, it appeared clear also to any of capturing one aspect of biological features that current
computational biologist working with motifs as patterns that PSSMs in general ignore, or address only in an indirect way.
there was further structure to be extracted from the set of This aspect often concerns isolated positions inside a motif
motifs found, even when such a set is huge. Furthermore, that are not part of the biological feature being captured.
such a structure could reflect some additional biological This is the case, for instance, with some binding sites,
information, thus providing additional motivation for infer- particularly at the protein level. Studying patterns with
ring it. Doing this is generally addressed by means of
wild cards has a further very important motivation in
clustering, or even by attempting to bring together the two
biology, even when no differences (such as substitutions)
types of motif models (PSSMs and patterns). Indeed, recently
researchers have been using pattern detection as a first filter- are allowed. Indeed, motifs such as these or closely related
flavored step toward inferring PSSMs from biological ones can be used as seeds for finding long repeats and for
sequences [6]. This seems very promising although much aligning, pairwise or multiple-wise, a set of sequences or
work remains to be done to precisely determine the relation even whole genomes [15], [23].
between the two types of models, and to fully explore the The basis introduced by Parida et al. had interesting
biological implications this may have. features, but presented some unsatisfying properties. In
Again, each of the two above approaches is valid, but the particular, as we show in this paper, there is an infinite
question remained open whether or not the inner structure family of strings for which the authors’ basis contains ðn2 Þ
of a set of motifs could be expressed in a manner that would motifs for q ¼ 2. This contradicts the upper bound of 3n for
be more satisfying from both the mathematical and the any q  2 given in [17]. As a result, the algorithm taking
biological points of view. Then, in 2000, a paper by Parida et Oðn3 log nÞ time, mentioned in [17], for finding the basis of
al. [17] seemed to present a way of extracting such an inner motifs does not hold since it relies on the upper bound of
structure in a very elegant and powerful way for a 3n, thus leaving open the problem of efficiently discovering
particular type of motif. The power of their proposal a basis. A refinement of the definition of basis and an
resided in the fact that the above mentioned structure incremental construction in Oðn3 Þ time has recently been
corresponded to a well-known and precisely defined described by Apostolico and Parida [2]. A comparative
mathematical object and, moreover, guaranteed that no survey of several notions of bases can be found in [22].
solution would be lost. Exhaustiveness in relation to the Closely following previous work, here we introduce a
chosen type of motif is also preserved, thus enabling a new definition of basis. The condition for the new basis is
biologist to draw some conclusions even in the face of stronger than that of [17] and, hence, our basis is included
negative answers (i.e., when no motifs, or no a priori in that of [17] (and is thus smaller) while both are able to
“expected” motifs are found in a given input), something generate the same set of motifs with mechanical rules. Our
which PSSM-detecting methods do not allow. The structure basis is moreover symmetric: Given a string s, the motifs in
is that of a basis of motifs. Informally speaking, it is a subset the basis for its reverse se are the reversals of the motifs in
of all the motifs satisfying some input parameters (related, the basis for s. Moreover, the number of motifs in our basis
for instance, to which differences between a pattern and its can provably be upper bounded in the worst case by n  1
occurrences are allowed) from which it is possible to for q ¼ 2 and occur in s a total of 2n times at most. However,
recover all the other motifs, in the sense that all motifs not we reveal an exponential dependency on q for the number of
in the basis are a combination of some (in general, a few
motifs in all bases defined so far (i.e., including our basis,
only) motifs in the basis. Such a combination is modeled by
Parida’s and Pelfrene et al.’s [19]), something unnoticed in
simple rules to systematically generate the other motifs with
previous work. Consequently, no polynomial-time algo-
an output sensitive cost [18]. A basis would therefore also
rithm can exist for finding one of these bases with arbitrary
provide a way of characterizing the input, which then might
values of q  2.
be used to compare different inputs without resorting to the
traditional alignment methods with all the pitfalls they
present. The idea of a basis would fulfill such expectations 2 NOTATION AND TERMINOLOGY
if its size could be proven to be small enough. The argument We consider strings that are finite sequences of letters
[17] seemed to be that, for the type of motifs considered, a drawn from an alphabet , whose elements are also called
compact enough basis could always be found. solid characters. We introduce an additional symbol (de-
The motifs considered in [17] were patterns with wild card
noted by  and called wild card) that does not belong to 
symbols occurring in a given sequence s of n symbols
and matches any letter; a wild card clearly matches itself.
drawn over an alphabet . A wild card symbol is a special
The length of a string t, denoted by jtj, is the number of
symbol “” matching any other element1 For example, the
letters and wild cards in t, and t½i indicates the letter or
pattern T  G matches both TTG and TGG inside s ¼ TTGG.
wild card at position i in t for 0  i  jtj  1 (hence, t ¼
Parida et al. focused on patterns which appear at least q
t½0t½1    t½jtj  1 also noted t½0::jtj  1).
times in s for an input parameter q  2, called the quorum.
This may, at first sight, seem an even more restrictive type Definition 1 (pattern). Given the alphabet , a pattern is a
of motif than patterns in general. It, however, has the merit string in  [ ð [ fgÞ  (that is, it starts and ends with a
solid character).
1. In the literature on sequence analysis and pattern matching, the wild
card is often referred to as do not care (as it is in the literature on bases of
motifs). Therefore, we will use this latter term when referring to the The patterns are related by the following specificity
sequence analysis and string matching literature. relation  .
42 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Definition 2 ( ). For individual characters 1 ; 2 2  [ fg, employing the example string s ¼ FABCXFADCYZEADCEADC.
we have 1  2 if 1 ¼  or 1 ¼ 2 . Relation  extends to For this string and q ¼ 2 the location list of motif x1 ¼ A  C
strings in ð [ fgÞ under the convention that each string t is Lx1 ¼ f1; 6; 12; 16g, and that of motif x2 ¼ FA  C is
is implicitly surrounded by wild cards, namely, letter t½j is  Lx2 ¼ f0; 5g. They are both maximal because they lose at
when j  jtj. Hence, v is more specific than u (written least one of their occurrences when extended with solid
u  v) if u½j  v½j for any integer j. characters at one side (possibly with wild cards in between),
or when their wild cards are replaced by solid characters.
We can now formally define the occurrences of patterns However, motif x3 ¼ DC having list Lx3 ¼ f7; 13; 17g is not
x in s and their lists. maximal. It occurs in x4 ¼ ADC, where Lx4 ¼ f6; 12; 16g, and
Definition 3 (occurrence, L). We say that u occurs at its occurrences can be obtained from those of x4 by a
position ‘ in v if u½j  v½j þ ‘, for 0  j  juj  1 displacement of d ¼ 1 positions. The basis of the irredun-
(equivalently, we say that u matches v½‘::‘ þ juj  1). For dant motifs for s is made up of x1 ¼ A  C, x2 ¼ FA  C,
the input string s 2  with n ¼ jsj, we consider the location x4 ¼ ADC, and x5 ¼ EADC. The location list of each of them
list Lx f0::n  1g as the set of all the positions on s at cannot be obtained from the union of any of the other
which x occurs. location lists.

When a pattern u occurs in another pattern (or into a


string) v, we also say that v contains u. For example, the 3 IRREDUNDANT MOTIFS: THE BASIS AND ITS SIZE
location list of x ¼ T  G in s ¼ TTGG is Lx ¼ f0; 1g, hence s FOR QUORUM q¼2
contains x. In this section, we show the existence of an infinite family of
Definition 4 (motif). Given a parameter q  2, called quorum, strings sk (k  5) for which there are ðn2 Þ irredundant motifs
we say that pattern x is a motif in s when jLx j  q. in the basis for quorum q ¼ 2, where n ¼ jsk j. In this way, we
disprove the claimed upper bound of 3n [17] mentioned in
Given any location list Lx and any integer d, we adopt Section 1. Each string sk will be constructed from a shorter
the notation Lx þ d ¼ f‘ þ d j ‘ 2 Lx g for indicating the string tk , which we now define. For each k, tk ¼ Ak TAk , where
occurrences in Lx “displaced” by the offset d. Ak denotes the letter A repeated k times (our argument works,
in general, for zk wzk , where z and w are strings of equal length
Definition 5 (maximality). A motif x is maximal if for any
not sharing any common character). String tk contains an
other motif y that contains x, we have no integer d such that
exponential number of maximal motifs, including those
Ly ¼ Lx þ d.
having the form AfA; gk2 A with exactly two wild cards. To
see why, each such motif x occurs four times in tk : Specifically,
In other words, making a maximal motif x more specific
two occurrences of x match the first and the last k letters in tk
(thus obtaining y) reduces the number of its occurrences in
while each distinct wild card in x matching the letter T in tk
s. Definition 5 is equivalent to that meant in [17] stating that
contributes to one of the two remaining occurrences.
x is maximal if there exist no other motif y and no integer
Extending x or replacing a wild card with a solid character
d  0 verifying Lx ¼ Ly þ d, such that x½j  y½j þ d for 0 
reduces the number of these occurrences, so x is maximal. The
j  jxj  1 (that is, x occurs in y at position d in our
idea of our proof is to obtain strings sk by prefixing tk with
terminology).2
Oðjtk jÞ symbols so that these motifs x become irredundant in
Definition 6 (irredundant motif). A maximal motif x is sk . Since there are ðk2 Þ of them, and n ¼ jsk j ¼ ðjtk jÞ ¼
irredundant if, for any maximal motifs y1 , y2 ; . . . ; yk such ðkÞ, this leads to the claimed result.
that Lx ¼ [ki¼1 Lyi , motif x must be one of the yi s. Conversely, In order to define the strings sk on the alphabet
if all the yi s are different from x, pattern x is said to be  ¼ fA; T; u; v; w; x; y; z; a1 ; a2 ; . . . ; ak2 g, we introduce some
covered by motifs y1 , y2 ; . . . ; yk . notation. Let u e denote the reversal of u, and let
evk ; odk ; uk ; vk be the strings thus defined
The basis of irredundant motifs for string s is the set of all
irredundant motifs in s. The definition is given with respect if k is even : evk ¼ a2 a4    ak2 ;
to the set of maximal motifs of the input string which is odk ¼ a1 a3    ak3 ;
unique; indeed, such basis is unique and it can be used as a fk vw evk ;
uk ¼ evk u ev
generator for all maximal motifs in s as proved in [17]. The
size of the basis is the number of irredundant motifs fk z odk ;
vk ¼ odk xy od
contained in it. We illustrate the notions given so far by
if k is odd : evk ¼ a2 a4    ak3 ;
2. Actually, the definition literally reported in [17] is “Definition 4
(Maximal Motif). Let p1 ; p2 ; . . . ; pk be the motifs in a sequence s. Let pi ½j be odk ¼ a1 a3    ak2 ;
“.” if j > jpi j. A motif pi is maximal if and only if there exists no pl , l 6¼ i and
no integer 0   such that Lpi þ  ¼ Lpl and pl ½ þ j  pi ½j hold for uk ¼ evk uv evfk wx evk ;
1  j  jpi j.” (The symbols in pi and pl are indexed starting from 1 fk z odk :
onward.) The corresponding example in the paper illustrates the definition
vk ¼ odk y od
for s ¼ ABCDABCD, stating that pi ¼ ABCD is maximal while pl ¼ ABC is not.
However, pi does not match the definition because of the existence of its The strings sk are then defined by sk ¼ uk vk tk for k  5.
prefix pl (setting  ¼ 0); hence, we suspect a minor typo in the definition, for Fig. 1 shows them for k ¼ 7.
which the definition should read as “... such that Lpi ¼ Lpl þ  and
pi ½j  pl ½ þ j.” Fact 1. The length of uk vk is 3k, and that of sk is n ¼ 5k þ 1.
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 43

two leftmost letters ap is the length of ap apþ2    ak3 xyak3


   apþ2 , that is, 2japþ2    ak3 j þ 3 ¼ 2ðk  3  pÞ=2 þ 3
¼ k  p. The distance between the leftmost and the
rightmost ap is the length of the string ap apþ2    ak3
xyod fk za1 a3    ap2 , which equals k þ 1, the length of
xyod fk zodk . The analogous verification of the other two
cases yields the fact that w cannot be maximal.
The second part of the lemma for motif Ak proceeds
along the same lines, except that we choose y ¼
ap 3ki1 Ak with i as before (note that y is not required
to be maximal and that the motifs in the statement are
Fig. 1. Example string s7 , (ai of the definition is simply denoted by i). maximal in tk ). u
t
Above it, there are the occurrences of w of the Proof of Proposition 1, Proposition 2. Each motif of the form AfA; gk2 A with exactly
while the three lines below show the occurrences of motif x ¼
4 19 AAAA  AA in s7 . The letter 4 corresponds to position 4 of the wild two s is irredundant in sk .
card in AAAA  AA. Proof. Let x be an arbitrary motif of the form AfA; gk2 A with
two s, namely, x ¼ Ap1  Ap2 p1 1  Akp2 1 for 1  p1 <
Proof. Whatever the parity of k, the string uk vk contains the six
p2  k  2. To prove that x is an irredundant motif, we first
letters u, v, w, x, y, z, two occurrences each of evk and odk ,
fk . Since odk and evk show that x is maximal. Its location list is Lx ¼ f0; k  p2 ;
and one occurrence each of ev fk and od
together contain one occurrence of each letter a1 , k  p1 ; k þ 1g þ 3k since juk vk j ¼ 3k by Fact 1 and x
a2 ; . . . ; ak2 , we have jodk j þ jevk j ¼ k  2. Moreover, matches the two substrings Ak of sk as well as Ap1 TAkp1 1
fk j ¼ jevk j and jod
jev fk j ¼ jodk j, so that juk vk j ¼ 6 þ 3ðk  2Þ and Ap2 TAkp2 1 . Any other motif y such that x occurs in y
¼ 3k. This proves the first statement. For the second can be obtained by replacing at least one wild card (at
statement, the total length of sk follows by observing that position p1 or p2 ) in x with a solid character, but this would
jtk j ¼ 2k þ 1, and so n ¼ jsk j ¼ 3k þ 2k þ 1 ¼ 5k þ 1. u
t cause the removal of position 4k  p1 or 4k  p2 from Lx .
Proposition 1. For 1  p  k  2, no motif of the form Ap  Analogously, extending x to the right by putting a solid
Akp1 can be maximal in sk . Also, motif Ak cannot be maximal character at position jxj or larger would eliminate position
in sk . 4k þ 1 from Lx . Finally, extending x to the left by a solid
Proof. Let w be an arbitrary motif of the form Ap  Akp1 , with character would eliminate at least one position from Lx
1  p  k  2. Its location list is Lw ¼ f0; k  p; k þ 1g þ because no symbol occurs four times in uk vk . In conclusion,
juk vk j ¼ f3k; 4k  p; 4k þ 1g since juk vk j ¼ 3k by Fact 1 and for any motif y such that x occurs in y, we have Ly 6¼ Lx þ d
w matches the two substrings Ak of sk as well as Ap TAkp1 . for any integer d and, thus, x is a maximal motif by
The occurrences are shown in Fig. 1 for k ¼ 7 and p ¼ 2. No Definition 5. We now prove that x is irredundant
other occurrences are possible. Let us consider the according to Definition 6. Let us consider an arbitrary set
position, say i, of the leftmost appearance of letter ap in
of maximal motifs y1 , y2 ; . . . ; yh such that Lx ¼ [hi¼1 Lyi . We
sk (recall that there are three positions on sk at which letter
ap occurs; we have i ¼ 0 in our example of Fig. 1 with claim that at least one yi is of the form AfA; gk2 A. Indeed,
p ¼ 2). We claim that motif y ¼ ap 3ki1 w satisfies there must exist a location list Lyi containing position 4k þ
Ly ¼ Lw  ð3k  iÞ. Since w appears in y, it follows that w 1 since that position belongs to Lx . This implies that yi
cannot be maximal in sk by Definition 5 (setting occurs in the suffix Ak of sk . It cannot be that jyi j < k since yi
d ¼ 3k þ i). To see why Lw ¼ Ly þ ð3k  iÞ, it suffices to would occur also in some position j > 4k þ 1 whereas
prove that the distance in sk between the positions of the j 62 Lx , so it is impossible. Consequently, yi is of length k
two leftmost letters ap is k  p while that of the leftmost and and matches Ak , thus being of the form AfA; gk2 A. We
the rightmost ap is k þ 1. The verification is a bit tedious observe that yi cannot contain zero or one s, as it would
because four cases arise according to the fact that each of k
not be maximal by Proposition 1. Also, yi cannot contain
and p can be even or odd. Since the cases are analogous, we
three or more s, as each distinct  symbol would match the
detail only two of them, namely, when both k and p are
even, and when k is even and p is odd. In the first case, the letter T in sk giving jLyi j > jLx j, which is impossible. The
three occurrences of ap are all in uk . Moreover, the distance only possibility is that yi contains exactly two s as x does
between the two leftmost letters ap is the length of the at the same positions because Ly Lx and they are
substring ap apþ2    ak2 uak2 ak4    apþ2 , that is, 2japþ2    maximal. It follows that yi ¼ x proving the proposition. t u
ak2 j þ 2 ¼ 2ðk  2  pÞ=2 þ 2 ¼ k  p. The distance be- Theorem 2. The basis for string sk contains ðn2 Þ irredundant
tween the leftmost and rightmost ap is the length of motifs, where n ¼ jsk j and k  5.
fk vwa2 a4    ap2 . This is also the length of
ap apþ2    ak2 u ev
fk vwa2 a4    ap2 ap apþ2    ak2 ¼ u ev
u ev fk vwevk , that is, Proof. By Proposition 2, the number of irredundant motifs
 
2ðk  2Þ=2 þ 3 ¼ k þ 1 as expected. In the second case in sk is at least k2
2 ¼ ðk2 Þ, the number of choices of
where k is even and p is odd, the occurrences of ap are all in two positions in fA; gk2 . Since jsk j ¼ 5k þ 1 by Fact 1,
vk . Analogously to the first case, the distance between the we get the conclusion. u
t
44 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

4 TILING MOTIFS: THE BASIS AND ITS PROPERTIES 4.2 A Linear Upper Bound for the Tiling Motifs with
Quorum q ¼ 2
4.1 Terminology and Properties
Given a string s of length n, let B denote its basis of tiling
In this section, we introduce a natural notion of a basis for
motifs for quorum q ¼ 2. Although the number of maximal
generating all maximal motifs occurring in a string s of
motifs may be exponential and the basis of irredundant
length n. motifs may be at least quadratic (see Section 3), we show
Definition 7 (tiling motif). A maximal motif x is tiling if, for that the size of B is always less than n. For this, we
any maximal motifs y1 , y2 ; . . . ; yk and for any integers d1 , introduce an operator between the symbols of  to define
d2 ; . . . ; dk such that Lx ¼ [ki¼1 ðLyi þ di Þ, motif x must be one the merges, which are at the heart of the properties of B.
of the yi s. Conversely, if all the yi s are different from x, pattern Given two letters 1 ; 2 2  with 1 6¼ 2 , the operator
x is said to be tiled by motifs y1 , y2 ; . . . ; yk . satisfies 1 2 ¼  and 1 1 ¼ 1 . The operator applies
to any pair of strings x; y 2  , so that u ¼ x y satisfies
The notion of tiling is in general more selective than that u½j ¼ x½j y½j for all integers j.
of irredundancy. Continuing our example string Definition 8 (Merge). For 1  k  n  1, let sk be the (infinite)
s ¼ FABCXFADCYZEADCEADC, we have seen in Section 2 that string whose character at position i is sk ½i ¼ s½i s½i þ k. If
motif x1 ¼ A  C is irredundant for s. Now, x1 is tiled by sk contains at least one solid character, Mergek denotes the
x2 ¼ FA  C and x4 ¼ ADC according to Definition 7 since its motif obtained by removing all the leading and trailing s in sk
location list, Lx1 ¼ f1; 6; 12; 16g, can be obtained from the (that is, those appearing before the leftmost solid character and
union of Lx2 ¼ f0; 5g and Lx4 ¼ f6; 12; 16g with respective after the rightmost solid character).
displacements d2 ¼ 1 and d4 ¼ 0.
Remark 1. A fairly direct consequence of Definition 7 is that For example, FABCXFADCYZEADCEADC has Merge4 ¼ EADC,
Merge5 ¼ FA  C, Merge6 ¼ Merge10 ¼ ADC, and Merge11 ¼
if x is tiled by y1 , y2 , . . . , yk with associated displacements
Merge15 ¼ A  C. The latter is the only merge that is not a tiling
d1 , d2 , . . . , dk , then x occurs at position di in yi for
motif.
1  i  k. As a consequence, we have that di  0 in
Definition 7. Note also that the yi s in Definition 7 are not Lemma 1. If Mergek exists, it must be a maximal motif.
necessarily distinct and that k > 1 for tiled motifs. (It Proof. Motif x ¼ Mergek occurs at positions, say, i and i þ k in
follows from the fact that Lx ¼ Ly1 þ d1 with x 6¼ y1 s. Character sk ½i is solid by Definitions 4 and 8. We use the
would contradict the maximality of both x and y1 .) As a fact that x at occurs at least twice in s for showing that it is
result, a maximal motif x occurring exactly q times in s is maximal. Suppose it is not maximal. By Definition 5, there
tiling as it cannot be tiled by any other motifs because exists y 6¼ x such that x occurs in y and Ly ¼ Lx þ d for
such motifs would occur less than q times. some integer d (in this case d  0). Since y is more specific
than x displaced by d, there must exist at least one position
The basis of tiling motifs is the complete set of all tiling j with 0  j < jyj such  that x½j þ d ¼  and y½j ¼  2 .
motifs for s, and the size of the basis is the number of these Hence,  x½j þ d ¼s i þ ðj þ dÞ s iþ  k þ ðj þ dÞ ¼ ,
motifs. For example, the basis, let us denote it by B, for and so s ði þ dÞ þ j 6¼ s ði þ k þ dÞ þ j . Since y½j cannot
FABCXFADCYZEADCEADC contains FA  C, EADC, and ADC as match both of the latter symbols in s, at least one of i þ d or
tiling motifs. Although Definition 7 is derived from that of i þ k þ d is not a position of y in s. This contradicts the
irredundant motifs given in Definition 6, the difference is hypothesis that Ly ¼ Lx þ d, whereas both i; i þ k 2 Lx . t u
much more substantial than it may appear. The basis of Lemma 2. For each tiling motif x in the basis B, there is at least
tiling motifs relies on the fact that tiling motifs are one k for which Mergek ¼ x.
considered as invariant by displacement as for maximality.
Consequently, our definition of basis is symmetric, that is, Proof. As mentioned in Remark 1, a maximal motif
each tiling motif in the basis for the reverse string se is the occurring exactly twice in s is tiling. Hence, if jLx j ¼ 2,
reverse of a tiling motif in the basis of s. This follows from say Lx ¼ fi; jg with j > i, then x ¼ Mergek with k ¼ j  i
the symmetry in Definition 7 and from the fact that by the maximality of x and that of the merges by
maximality is also symmetric in Definition 5. It is a sine Lemma 1. Let us now consider the case where jLx j > 2.
qua non condition for having a notion of basis invariant by For any pair i; j 2 Lx , we denote by uij the string s½i::i þ
the left-to-right or right-to-left order of the symbols in s (like jxj  1 s½j::j þ jxj  1 obtained by applying the op-
the entropy of s), while this property does not hold for the erator to the two substrings of s matching x at
irredundant motifs. positions i and j, respectively. We have S x  uij since x
The basis of tiling motifs has further interesting proper- occurs at positions i and j, and Lx ¼ i;j2Lx Luij since we
ties for quorum q ¼ 2, illustrated in Sections 4.2, 4.3, and 4.4. are taking all pairs of occurrences of x. Letting k ¼ jj  ij
In Section 4.2, we show that our basis is linear (that is, its for i; j 2 Lx , we observe that uij is a substring of Mergek
size is at most n  1). In Section 4.3, we show that the total occurring at position, say, k in it. Thus,
size of the location lists for the tiling motifs is less than 2n, [ [  
Luij ¼ LMergek þ k ¼ Lx :
describing how to find them in Oðn2 log n log jjÞ time. In i;j2Lx k¼jjij : i;j2Lx
Section 4.4, we discuss some applications such as generat-
ing all maximal motifs with the basis and finding motifs By Definition 7, the fact that x is tiling implies that x
with a constraint on the number of undefined symbols. must be one Mergek , proving the lemma. t
u
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 45

We now state the main property of tiling bases that j 2 Tx , let mij ¼ Mergejjij , which is maximal by
follows directly from Lemma 2. Lemma 1. Note that each mij 6¼ x by our assumption as
Theorem 3 (linearity of the basis). Given a string s of length n otherwise i would belong to Tx ; however, x must occur
and the quorum q ¼ 2, let M be the set of Mergek , for 1  k  S mij , say, at position
in  ij in mij . Consequently,
n  1 such that Mergek exists. The basis B of tiling motifs for s i2Lx Tx ;j2Tx L mij
þ  ij ¼ Lx since any occurrence of x
satisfies B M and, therefore, the size of B is at most n  1. is either i 2 Lx  Tx or j 2 Tx . At this point, we apply
Definition 7 to the tiling motif x, obtaining the contra-
A simple consequence of Theorem 3 implies a tight diction that x must be equal to one mij . u
t
bound on the number of tiling motifs for periodic strings. If
s ¼ we for a string w repeated e > 1 times, then s has at most Notice that the conclusion of Lemma 3 does not
jwj tiling motifs. necessarily hold for the motifs in M  B. For the previous
example string FADABCXFADCYZEADCEADCFADC, one such
Corollary 1. The number of tiling motifs for s is at most p, the motif is x ¼ ADC with Lx ¼ f8; 14; 18; 22g while Tx ¼ f8; 18g.
smallest period of s. Step 3. Select M M, where M ¼ fx 2 M : Tx ¼ Lx g.
In order to build M , we employ the Fischer-Paterson
The bound in Corollary 1 is not valid for irredundant algorithm based on convolution [8] for string matching with
motifs. String s ¼ ATATATATA has period p ¼ 2 and only one don’t cares to compute the whole list of occurrences Lx for
tiling motif ATATATA, while its irredundant motifs are A, ATA, each merge x 2 M. Its cost is Oððjxj þ nÞ log n log jjÞ time for
ATATA, and ATATATA. each merge x. Since jxj < n and there are at most n  1 motifs
4.3 A Simple Algorithm for Computing Tiling Motifs x 2 M, we obtain Oðn2 log n log jjÞ time to construct all lists
with Quorum q ¼ 2 Lx . We can compute M by discarding the merges x 2 M
such that Tx 6¼ Lx in additional Oðn2 Þ time.
We describe how to compute the basis B for string s when
q ¼ 2. A brute-force algorithm generating first all maximal Lemma
P 4. The set M satisfies the conditions B M and
motifs of s takes exponential time in the worst case. x2M jLx j < 2n.
Theorem 3 plays a crucial role in that we first compute Proof. The first condition follows from the fact that the
the motifs in M and then discard those being tiled. Since motifs in M  M are surely tiled by Lemma 3. The
B M, what remains is exactly B. To appreciate this second condition follows from the definition of M and
approach, it is worth noting that we are left with the from the observation that
problem of selecting B from n  1 maximal motifs in M at X X X
most, rather than selecting B among all the maximal motifs jLx j ¼ jTx j  joccx j < 2n;
in s, which may be exponential in number. Our simple x2M x2M x2M
algorithm takes Oðn2 log n log jjÞ time and is faster than since joccx j ¼ 2 (see Step 1) and there are less than n of
previous (and more complicated) methods discussed in them. t
u
Section 1. 
Step 1. Compute the multiset M0 of merges. Letting P The property 2of M in Lemma 4 is crucial in that
sk ½i be the leftmost solid character of string sk in x2M jLx j ¼ ðn Þ when many lists contain ðnÞ entries.
Definition 8, we define occx ¼ fi; i þ kg to be the positions For example, s ¼ An has n  1 distinct merges, each of the
of the two occurrences of x whose superposition generates form x ¼ Ai for 1  i  n  1, and so jLx j ¼ n  i þ 1. This
x ¼ Mergek . For k ¼ 1; 2; . . . ; n  1, we compute string sk would be a sharp drawback in Step 4 when removing tiled
3 
in Oðn  kÞ time. If sk contains some solid characters, we motifs as it may turn into a ðn P Þ algorithm. Using M
compute x ¼ Mergek and occx in the same time complex- instead, we are guaranteed that x2M jLx j ¼ OðnÞ; hence,
ity. As a result, we compute the multiset M0 of merges in we may still have some tiled motifs in M , but their total
Oðn2 Þ time. Each merge x in M0 is identified by a triplet number of occurrences is OðnÞ.
hi; i þ k; jxji, from which we can recover the jth symbol of Step 4. Discard the tiled motifs in M . We can now
x in constant time by simple arithmetic operations and check for tiling motifs in Oðn2 Þ time. Given two distinct
comparisons. motifs x; y 2 M , we want to test whether Lx þ d Ly for
Step 2. Transform the multiset M0 into the set M of some integer d and, in that case, we want to mark the entries
merges. Since there can be two or more merges in M0 that in Ly that are also in Lx þ d. At the end of this task, the lists
are identical and correspond to the same merge in M, we having all entries marked are tiled (see Definition 7). By
put together all identical merges in M0 by radix sorting removing their corresponding motifs from M , we even-
them. The total cost of this step is dominated by radix tually obtain the basis B by Lemma 4. Since the meaningful
sorting, giving Oðn2 Þ time. AsSa byproduct, we produce the values of d are as many as the entries of Ly , we have only
temporary location list Tx ¼ x0 ¼x : x0 2M0 occx0 for each dis- jLy j possible values to check. For a given value of d, we
tinct x 2 M thus obtained. avoid to merge Lx and Ly in OðjLx j þ jLy jÞ time to perform
the test, as it would contribute to a total of ðn3 Þ time.
Lemma 3. Each motif x 2 B satisfies Tx ¼ Lx . Instead, we exploit the fact that each list has values ranging
Proof. For a fixed x 2 B, the fact that x is equal to at least from 1 to n, and use two bit-vectors of size n to perform the
one merge by Lemma 2 implies that Tx is well defined, above check
P P in OðjLx j jLy jÞ time
P for all P
values of d. This
2
with jTx j  2. Since Tx Lx , let us assume by contra- gives Oð y x jLx j jLy jÞ ¼ Oð y jLy j x jLx jÞ ¼ Oðn Þ
diction that Lx  Tx 6¼ ;. For each pair i 2 Lx  Tx and by Lemma 4.
46 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

We therefore detail how to perform the above check with characters by wild cards. However, since the number of
Lx and Ly in OðjLx j jLy jÞ time. We use two bit-vectors V1 motifs, and even maximal motifs, can be exponential, this is
and V2 of length n initially set to all zeros. Given y 2 M , we not really meaningful unless this number is small and the
set V1 ½i ¼ 1 if i 2 Ly . For each x 2 M  fyg and for each time complexity of the algorithm is proportional to the total
d 2 ðLy  mÞ (where m is the smallest entry of Lx ), we then size of the output. An attempt in this direction is done in
perform the following test. If all j 2 Lx þ d satisfy V1 ½j ¼ 1, [18]. The dual problem concerns testing only one pattern.
we set V2 ½j ¼ 1 for all such j. Otherwise, we take the next We show how, given a pattern x, it can be tested whether x
value of d, or the next motif if there are no more values of d, is a motif for string s, that is, if pattern x occurs at least q
and we repeat the test. After examining all x 2 M  fyg, times in s. There are two possible ways of performing such
we check whether V1 ½i ¼ V2 ½i for all i 2 Ly . If so, y is tiled a test, depending on whether we test directly on the string
as its list is covered by possibly shifted location lists of other or on the basis. The answer relies on iterative applications
motifs. We then reset the ones in both vectors in OðjLy jÞ of the observation made in Remark 1, according to which
time. any tiled motif must occur in at least one tiling motif. The
Summing up Steps 1-4, we have that the dominant cost is next two statements deal with the alternative. In both cases,
that of Step 3 and that we have proved the following result. we assume that integer k comes from the decomposition of
Theorem 4. Given an input string s of length n over the alphabet pattern x in the form u0 ‘0 u1 ‘1    uk1 ‘k1 uk , where the
, the basis of tiling motifs with quorum q ¼ 2 can be subwords ui contain no wild cards (ui 2  , 0  i  k) and
computed in Oðn2 log n log jjÞ time. The total number of ‘j are positive integers, 0  j  k  1. The next proposition
motifs in the basis is less than n, and the total number of their states a well-known fact on matching such a pattern in a
occurrences in s is less than 2n. text without any wild card that we report here because it is
used in the sequel.
We have implemented the algorithm underlying Theo- Proposition 3. The positions of the occurrences of a pattern x in
rem 4, and we report here the lessons learned from our a string of length n can be computed in time OðknÞ.
experiments. Step 1 requires, in practice, less than the
Proof. This is a mere application of matching a pattern with
predicted Oðn2 Þ running time. If p ¼ 1=jj denotes the
do not cares inside a text without do not cares. Using, for
probability that two randomly chosen symbols of  match
instance, the Fischer and Paterson’s algorithm [8] is not
in the uniform distribution, the probability of finding the
necessary. Instead, the positions of the subwords ui are
first solid character in a merge follows the binomial
computed by a multiple string-matching algorithm, such
distribution, and so the expected number of examined
as the Aho-Corasick algorithm [1]. For each position p, a
characters in s is Oð1=pÞ ¼ OðjjÞ, yielding OðnjjÞ time on
counter associated with position p  ‘ on s is incremented,
the average to locate the first (scanning s from the
where ‘ is the position of ui in x (‘ is the offset of ui in x).
beginning) and the last (scanning s from the end backward)
Counters whose value is k þ 1 correspond then to
solid character in each merge. A similar approach can be
occurrences of x in s. It remains to check if x occurs at
followed in Step 2 for finding the distinct merges. In this
least q times in s. The running time is governed by the
case, the merges are first partially sorted using hashing and
string-matching algorithm, which is OðknÞ (equivalent to
exploiting the fact that the input is almost sorted. Insertion
running k times a linear-time string matching algorithm).t u
sort is then the best choice and works very efficiently in our
experiments (at least 50 percent faster than Quicksort). We Proposition 4. Given the basis B of string s, testing if pattern x
do not compute yet the full merges at this stage, but we is a motif
P or a maximal motif can be done in OðkbÞ time, where
delay this expensive part to a later stage on a small set of b ¼ y2B jyj.
buckets that require explicit representation of the merges. Proof. From Remark 1, testing if x is a maximal motif
As a result, the average case is almost linear. For example, requires only finding if x occurs in an element y of the
executing Steps 1 and 2 on chromosome V of C.elegans basis. To do this, we can apply the procedure of the
containing more than 21 million bases took around previous proof because wild cards in y should be viewed
15 minutes on a machine with 512Mb of RAM running as extra characters that do not match any letter of . The
Linux on a 1Ghz AMD Athlon processor. Step 3 is time complexity of the procedure is thus OðkbÞ. Since a
expensive also in practice and the worst case predicted by nonmaximal motif occurs in a maximal motif, the same
theory shows up in the experiments. Running this step on procedure applies to test if x is a general motif. u
t
sequences much shorter than chromosome V of C.elegans
took many hours. Step 4 is not much of a problem. As a As a consequence of Propositions 3 and 4, we get an
result, an alternative way of selecting M from M in Step 3 upper bound on the time complexity for testing motifs.
working fast in practice, would improve considerably the
Corollary 2. Testing whether or not pattern u0 ‘0 u1 ‘1
overall performance.
   uk1 ‘k1 uk is a motif in a string of length n having a
4.4 Some Applications basis of total size b can be done in time Oðk  minfb; ngÞ.
Checking whether a pattern is a motif. The main property Remark 2. Inside the procedure described in the proofs of
underlying the notion of basis is that it is a generator of all Propositions 3 and 4, it is also possible to use bit-vector
motifs. The generation can be done as follows: First select pattern matching methods [3], [16], [25] to compute the
segments of motifs in the basis that start and end with solid occurrences of x. This leads to practically efficient
characters, then replace any number of internal solid solutions running in time proportional to the length of
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 47

the string n or the total size of the basis b, in the bit-vector always beats the Oðg n2 Þ cost of using the suffix tree. In
model of machine. This is certainly a method of choice particular, it is interesting to notice that the running time of
for short patterns. the algorithm using the basis is independent of the
parameter g.
Finding the longest motif with bounded number of
wild cards. We address an interesting question concerning
the computation of a longest motif occurring repeated in a 5 PSEUDOPOLYNOMIAL BASES FOR HIGHER
string. Given an integer g  0, let LMg ðsÞ be the maximal QUORUM
length of motifs occurring in a string s of length n with We now discuss the general case of quorum q  2 for
quorum q ¼ 2, and containing no more than g wild cards. If
finding the basis of a string of length n. Differently from
g ¼ 0, the value can be computed in Oðn log jjÞ time with
the help of the suffix tree of s (see [5] or [10]). For g > 0, we previous work, we show in Section 5.1 that no polynomial-
can show that LMg ðsÞ can be computed in Oðgn2 Þ time time algorithm can exist for any arbitrary value of q in the
using the suffix tree augmented (in linear time) to accept worst case, both for the basis of irredundant motifs and for
longest common ancestor (LCA) queries as follows: For the basis of tiling motifs. The size of these bases provably
each possible pair ði; jÞ of positions on s for which s½i ¼ s½j,
depends exponentially on
n1suitable
 values of q  2, that is, we
we compute the longest common prefix of s½i::n  1 and
1
 1 n1
s½j::n  1 in constant time through an LCA query on the give a lower bound of q1 ¼  2q q1 . In practice, this
2

suffix tree. If ‘ is the length of the prefix, we get the first part size has an exponential growth for increasing values of q up
s½i::i þ ‘  1  of a possible longest motif. The second part to Oðlog nÞ, but larger values of q are theoretically possible
is found similarly by considering the pair of positions
in the worst case. Fixing q ¼ ðn  1Þ=4 þ 1 in our lower
ði þ ‘ þ 1; j þ ‘ þ 1Þ. The process is iterated g times (or less)
and provides a longest motif containing at most g wild bound, we get a size of ð2ðn1Þ=4 Þ motifs in the bases. On
cards and occurring at positions i and j. Length LMg ðsÞ is the average, q ¼ Oðlogjj nÞ by extending the argument after
obtained by taking the maximum length of motifs for all Theorem 4, namely, using the fact that on the average the
pairs of positions ði; jÞ. This yields the next result. number of simultaneous comparisons to find the first solid
Proposition 5. Using the suffix tree, LMg ðsÞ can be computed in character of a merge is Oðjjq1 Þ, which must be less than n.
Oðgn2 Þ time.
We show a further property for the basis  of tiling motifs
n1
What makes the use of the basis of tiling motifs interesting in Section 5.2, giving an upper bound of q1 on its size
is that computing LMg ðsÞ becomes a mere pattern matching with a simple proof. Since we can find an algorithm taking
exercise because of the strong properties of the basis. This time proportional to the square of that size, we can
contrasts with the previous result grounded on the deep
algorithmic technique for LCA queries. conclude that a worst-case polynomial-time algorithm for
finding the basis of tiling motifs exists if and only if the
P motifs, LMg ðsÞ can be
Proposition 6. Using the basis B of tiling
computed in time OðbÞ, where b ¼ y2B jyj. quorum q satisfies either q ¼ Oð1Þ or q ¼ n  Oð1Þ (the latter
Proof. Let x be a motif yielding LMg ðsÞ (i.e., x is of length condition is hardly meaningful in practice).
n1 
LMg ðsÞ); hence, x occurs at least twice in s. Let y be a 2 1
5.1 A Lower Bound of on the Bases
maximal motif in which x occurs (we have y ¼ x if x is q1
itself maximal). Let z be a tiling motif in which y occurs We show the
 existence
 of a family of strings for which there
(again we may have z ¼ y if y is a tiling motif). The word are at least
n1
2 1 tiling motifs for a quorum q. Since a tiling
q1
x then occurs in z that belongs to the basis. Let us say that
it matches z½i::j. Assume that x is not a tiling motif, that motif is also irredundant, this gives a lower bound for the
is x 6¼ z. Certainly, i ¼ 0 or z½i  1 ¼ , otherwise, x irredundant motifs to be combined with that in Section 3
would not be the longest with its property. For the same 2
reason, j ¼ jzj  1 or z½j þ 1 ¼ . But, indeed, x occurs n1 Þfor
(note that the lower bound in Section 3 still gives ðn
q  2). For q > 2, this gives a lower bound of  q1 2 1 ¼
exactly in z, which means that the wild card symbols do  1 n1
not match any solid symbol. Because, otherwise, z½i::j  2q q1 for the number of both tiling and irredundant
would contain less than g do not cares and could be
motifs.
extended by at least one symbol to the left or to the right
because x 6¼ z, yielding a contradiction with the defini- The strings are this time of the form tk ¼ Ak TAk (k  5),
tion of x. Therefore, either x is a tiling motif or it matches without the left extension used in the
 bound
 of Section 3.
exactly a segment of one of the tiling motifs. Searching k1
The proof proceeds by exhibiting q1 motifs that are
for x thus reduces to finding a longest segment of a tiling
motif in B that contains no more than g wild cards. The maximal and have each exactly q occurrences, from when it
computation can be done in linear time with only two follows immediately that they are tiling. Indeed, Remark 1
pointers on s, which proves the result. u
t for tiling motifs holds for any q  2. Namely, all maximal
By Proposition 6, it is clear that a small basis B leads to motifs that occur exactly q times in a string are tiling.
an efficient computation once B is given. If we have to build
B from scratch, we can observe that no (maximal) motif can Proposition 7. For 2  q  k and 1  p  k  q þ 1, any motif
give a larger value of LMg ðsÞ if it does not belong to B. With Ap  fA; gkp1  Ap with exactly q wild cards is tiling (and
this observation, we have Oðn2 Þ running time, which so irredundant) in tk .
48 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Proof. Let x be an arbitrary motif Ap  fA; gkp1  Ap with implies there exists at least one position j with 0 
1  p  k  q þ 1 and q wild cards; namely, x ¼ Ap1  j < jyj such that y½j ¼  2  and x½j þ d ¼ . Since
Ap2 p1 1      Apq1 pq2 1  Akpq1 1  Ap1 for 1  p1 < p2 <
   < pq1  k  1 and p ¼ p1 . We first have to prove that x x½j þ d ¼ s½i þ j þ d s½i þ j þ k1 þ d 
is a maximal motif according to Definition 5. Its length is s½i þ j þ kq1 þ d;
k þ 1 þ p1 and its location list is Lx ¼ f0; k  pq1 ; . . . ;
then at least one among i þ d; i þ k1 þ d; . . . ; i þ kq1 þ d
k  p2 ; k  p1 g. Observe that the number of its occurrences
is not an occurrence of y, contradicting the hypothesis
is exactly the number of times the wild card appears in x,
that Ly ¼ Lx þ d (since i; i þ k1 ; . . . ; i þ kq1 2 Lx ). u
t
which is equal to q. A motif y different from x such that x
occurs in y can be obtained by replacing the wild card at
position pi with a solid symbol, for 1  i  q  1, but this Lemma 6. For each tiling motif x in the basis B with quorum q,
eliminates k  pi from the location list of y. Also, y can be there is at least one k for which Mergek ¼ x.
obtained by extending x to the right by a solid symbol (at Proof. If jLx j ¼ q and Lx ¼ fi1 ; . . . ; iq g with i1 <    < iq , then
any position  jxj), but then position k  p1 is not in Ly x ¼ Mergek where k is the array of values i2  i1 ; i3  i1 ;
because the last symbol in that occurrence of y occupies . . . ; iq  i1 . Let us now consider the case where jLx j > q.
position ðk  p1 Þþjyj1  ðk  p1 Þ þ jxj ¼ ðk  p1 Þ þ ðk þ Given any q-tuple i1 ; . . . ; iq 2 Lx , let uk denote s½i1 ::i1 þ
1 þp1 Þ > jtk j  1 in tk , which is impossible. Analogously, y jxj  1    s½iq ::iq þ jxj  1, which is a substring of
can be obtained by extending x to the left by a solid symbol Mergek introduced S in Definition 9. We have that x  uk
(at any position d < 0), but position 0 is no longer in Ly . and Lx ¼ i1 ;i2 ;...;iq 2Lx Luk . Since each uk for i1 ; i2 ; . . . ; iq 2
Consequently, for any motif y more specific than x, we Lx is a substring
S  of  Mergek , we infer that Lx ¼
have Ly 6¼ Lx þ d, implying that x is maximal. As i1 ;i2 ;...;iq 2Lx L Mergek þ k where the k s are non-negative
previously mentioned, x is tiling because it has exactly q integers. By Definition 7, if Mergek were different from x,
occurrences. u
t then x would not be tiling, which is a contradiction.
n1    
2 1
Therefore, at least one Mergek is x. u
t
Theorem 5. String tk has q1 ¼  21q n1q1 tiling (and
irredundant) motifs, where n ¼ jtk j and k  2. The following property of tiling bases follows from
Proof. By Proposition
  7, the tiling or irredundant motifs in tk Lemma 5 and 6.
are at least k1
q1 , the number of choices of q  1 positions Theorem 6. Given a string s of length n and q  2, let
k1
 a quorum

on A . Since n ¼ 2k þ 1, we obtain the statement. u
t M be the set of Mergek , for any of the n1
possible choices
  q1
n1 of k for which Mergek exists. The basis B of tiling motifs
5.2 An Upper Bound   of Tiling Motifs
q1  fors
n1
n1
We now prove that q1 is an upper bound for the size of a satisfies B M and, therefore, the size of B is at most q1 .
basis of tiling motifs for a string s and quorum q  2. Let us
 The tiling motifs in our basis appear in s for a total of
denote as before such a basis by B. To prove the upper n1
q q1 times at most. A variation of the algorithm given in
bound, we use again the notion of a merge, except that it Section 4.3 gives a pseudopolynomial-time complexity of
now involves q strings. The operator between the  2 !
elements of  extends to more than two arguments, so that 2 n1
O q :
the result is a  if at least two arguments differ. Let k denote q1
now an array of q  1 positive values k1 ; . . . ; kq1 with 1 
When this upper bound is combined with the lower bound
ki < kj  n  1 for all 1  i < j  q  1. of Section 5.1, we obtain that there exists a polynomial-time
Definition 9. Let sk denote the string such that its jth character algorithm for finding the basis if and only if either q ¼ Oð1Þ
is sk ½j ¼ s½j s½j þ k1     s½j þ kq1  for all integers j. or q ¼ n  Oð1Þ.
Mergek is the pattern obtained by removing all the leading
and trailing s in sk (that is, appearing before the leftmost solid 6 CONCLUSIONS
character and after the rightmost solid character).
The work presented in this paper is theoretical in nature, but it
Lemmas 5 and 6 reported below extend Lemmas 1 and 2 should be clear by now that its practical consequences,
for q > 2. particularly—but not exclusively—for computational biol-
ogy, are relevant. Whether motifs as patterns are used for
Lemma 5. If Mergek exists for quorum q, then it must be a
inferring binding sites or repeats of any length, for character-
maximal motif. izing sequences or as a filtering step in a whole genome
Proof. Let x ¼ Mergek denote the (nonempty) pattern, and comparison algorithm or before inferring PSSMs: We show
let sk ½i be its first character, which is solid by that wild cards alone are not enough for a biologically
Definition 9. Since x occurs at least q times in s, at satisfying definition of the patterns of interest. Simply
positions i; i þ k1 ; . . . ; i þ kq1 , then x is a motif for throwing away the pattern-type of motif detection is not a
quorum q. We show that x is maximal. Suppose it is good way to address the problem. This is confirmed by
not maximal. By Definition 5, there exists y 6¼ x s.t. x various biological publications [24], [7] as well as by the not yet
occurs in y and Ly ¼ Lx þ d for some integer d. This published—but already publicly available—results of a first
PISANTI ET AL.: BASES OF MOTIFS FOR GENERATING REPEATED PATTERNS WITH WILD CARDS 49

motif detection competition http://bio.cs.washington.edu/ [15] W. Miller, “Comparison of Genomic DNA Sequences: Solved and
Unsolved Problems,” Bioinformatics, vol. 17, pp. 391-397, 2001.
assessment/. Even if patterns are not the best way of modeling [16] G. Myers, “A Fast Bit-Vector Algorithm for Approximate String
biological features, they deserve an important function in any Matching Based on Dynamic Programming,” J. ACM, vol. 46, no. 3,
future improved algorithm for inferring motifs ab initio from pp. 395-415, 1999.
biological sequences. As such, the purpose of this paper is to [17] L. Parida, I. Rigoutsos, A. Floratos, D. Platt, and Y. Gao, “Pattern
Discovery on Character Sets and Real-Valued Data: Linear Bound
shed some further light on the inner structure of one on Irredundant Motifs and Efficient Polynomial Time Algorithm,”
important type of motif. Proc. SIAM Symp. Discrete Algorithms (SODA), 2000.
[18] L. Parida, I. Rigoutsos, and D. Platt, “An Output-Sensitive Flexible
Pattern Discovery Algorithm,” Combinatorial Pattern Matching,
ACKNOWLEDGMENTS A. Amir and G. Landau, eds., pp. 131-142, Springer-Verlag, 2001.
[19] J. Pelfrne, S. Abdeddaı̈m, and J. Alexandre, “Extracting Approx-
Many suggestions from the anonymous referees greatly imate Patterns,” Combinatorial Pattern Matching, pp. 328-347,
improved the original form of this paper. The authors are Springer-Verlag, 2003.
[20] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, “A Basis
thankful to them for this and to M.H.ter Beek for improving for Repeated Motifs in Pattern Discovery and Text Mining,”
the English. A preliminary version of the results in this Technical Report IGM 2002-10, Institut Gaspard-Monge, Univ. of
paper has been described in the technical report IGM-2002- Marne-la-Vallée, July 2002.
10, July 2002 [20], and in [21]. Work was partially supported [21] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, “A Basis of
Tiling Motifs for Generating Repeated Patterns and Its Complex-
by the French program bioinformatique EPST 2002 “Algo- ity for Higher Quorum,” Math. Foundations of Computer Science
rithms for Modelling and Inference Problems in Molecular (MFCS), B. Rovan and P. Vojtás, eds., pp. 622-631, Springer-
Biology.” N. Pisanti and R. Grossi were partially supported Verlag, 2003.
[22] N. Pisanti, M. Crochemore, R. Grossi, and M.-F. Sagot, String
by the Italian PRIN project “ALINWEB: Algorithmics for Algorithmics, chapter: A Comparative Study of Bases for Motif
Internet and the Web.” M.-F. Sagot was partially supported Inference, pp. 195-225, KCL Press, 2004.
by CNRS-INRIA-INRA-INSERM action BioInformatique [23] D. Pollard, C. Bergman, J. Stoye, S. Celniker, and M. Eisen,
and the Wellcome Trust Foundation. M. Crochemore was “Benchmarking Tools for the Alignment of Functional Noncoding
DNA,” BMC Bioinformatics, vol. 5, pp. 6-23, 2004.
partially supported by CNRS action AlBio, NATO Science [24] A. Vanet, L. Marsan, and M.-F. Sagot, “Promoter Sequences and
Programme grant PST.CLG.977017, and the Wellcome Trust Algorithmical Methods for Identifying Them,” Research in Micro-
Foundation. biology, vol. 150, pp. 779-799, 1999.
[25] S. Wu and U. Manber, “Path-Matching Problems,” Algorithmica,
vol. 8, no. 2, pp. 89-101, 1992.
REFERENCES Nadia Pisanti received the laurea degree in
[1] A. Aho and M. Corasick, “Efficient String Matching: An Aid to computer science in 1996 from the University of
Bibliographic Search,” Comm. ACM, vol. 18, no. 6, pp. 333-340, 1975. Pisa (Italy), the French DEA in fundamental
[2] A. Apostolico and L. Parida, “Incremental Paradigms of Motif informatics with applications to genome treat-
Discovery,” J. Computational Biology, vol. 11, no. 1, pp. 15-25, 2004. ment in 1998 from the University of Marne-la-
[3] R. Baeza-Yates and G. Gonnet, “A New Approach to Text Vallee (France), and the PhD degree in computer
Searching,” Comm. ACM, vol. 35, pp. 74-82, 1992. science in 2002 from the University of Pisa. She
[4] A. Brazma, I. Jonassen, I. Eidhammer, and D. Gilbert, “Ap- has been postdoctorate at INRIA and at the
proaches to the Automatic Discovery of Patterns in Biose- University of Paris 13 and she is currently a
quences,” J. Computational Biology, vol. 5, pp. 279-305, 1998. research fellow in the Department of Computer
[5] M. Crochemore and W. Rytter, Jewels of Stringology. World Science of the University of Pisa. Her interests are in computational
Scientific Publishing, 2002. biology and, in particular, in motifs extraction and genome rearrangement.
[6] E. Eskin, “From Profiles to Patterns and Back Again: A Branch and
Bound Algorithm for Finding Near Optimal Motif Profiles,” Maxime Crochemore received the PhD degree
RECOMB’04: Proc. Eighth Ann. Int’l Conf. Computational Molecular in 1978 and the Doctorat d’etat in 1983 from the
Biology, pp. 115-124, 2004. University of Rouen. He received his first
[7] E. Eskin, U. Keich, M. Gelfand, and P. Pevzner, “Genome-Wide professorship position at the University of
Analysis of Bacterial Promoter Regions,” Proc. Pacific Symp. Paris-Nord in 1975 where he acted as President
Biocomputing, pp. 29-40, 2003. of the Department of Mathematics and Compu-
[8] M. Fischer and M. Paterson, “String Matching and Other ter Science for two years. He became a
Products,” SIAM AMS Complexity of Computation, R. Karp, ed., professor at the University Paris 7 in 1989 and
pp. 113-125, 1974. was involved in the creation of the University of
[9] M. Gribskov, A. McLachlan, and D. Eisenberg, “Profile Analysis: Marne-la-Vallee where he is presently a profes-
Detection of Distantly Related Proteins,” Proc. Nat’l Academy of sor. He also created the Computer Science Research Laboratory of this
Sciences, vol. 84, no. 13, pp. 4355-4358, 1987. university in 1991. Since then, he has been the director of the laboratory,
[10] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer which now has around 45 permanent researchers. Professor Crochem-
Science and Computational Biology. Cambridge Univ. Press, 1997. ore has been a senior research fellow at King’s College London since
[11] G.Z. Hertz and G.D. Stormo, “Escherichia Coli Promoter Sequences: 2002. He has been the recipient of several French grants on string
Analysis and Prediction,” Methods in Enzymology, vol. 273, pp. 30- algorithmics and bioinformatics. He participated in a good number of
42, 1996. international projects in algorithmics and supervised 20 PhD students.
[12] C.E. Lawrence, S.F. Altschul, M.S. Boguski, J.S. Liu, A.F. Neuwald,
and J.C. Wooton, “Detecting Subtle Sequence Signals: A Gibbs
Sampling Strategy for Multiple Alignment,” Science, vol. 262,
pp. 208-214, 1993.
[13] C.E. Lawrence and A.A. Reilly, “An Expectation Maximization
(EM) Algorithm for the Identification and Characterization of
Common Sites in Unaligned Biopolymer Sequences,” Proteins:
Structure, Function, and Genetics, vol. 7, pp. 41-51, 1990.
[14] L. Marsan and M.-F. Sagot, “Algorithms for Extracting Structured
Motifs Using a Suffix Tree with an Application to Promoter and
Regulatory Site Consensus Identification,” J. Computational Biol-
ogy, vol. 7, pp. 345-362, 2000.
50 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Roberto Grossi received the laurea degree in Marie-France Sagot received the BSc degree in computer science from
computer science in 1988, and the PhD degree the University of Sao Paulo, Brazil, in 1991, the PhD degree in
in computer science in 1993, at the University of theoretical computer science and applications from the University of
Pisa. He joined the University of Florence in Marne-la-Vallee, France, in 1996, and the Habilitation from the same
1993 as an associate researcher. Since 1998, university in 2000. From 1997 to 2001, she worked as a research
he has been an associate professor of computer associate at the Pasteur Institute in Paris, France. In 2001, she moved
science in the Dipartimento di Informatica, to Lyon, France, as a research associate at the INRIA, the French
University of Pisa. He has been visiting several National Institute for Research in Computer Science and Control. Since
international research institutions. His interests 2003, she has been director of research at the INRIA. Her research
are in the design and analysis of algorithms and interests are in computational biology, algorithmics, and combinatorics.
data structures, namely, dynamic and external memory algorithms,
graph algorithms, experimental and algorithm engineering, fast lookup
tables and dictionaries, pattern matching algorithms, text indexing, and . For more information on this or any other computing topic,
compressed data structures. please visit our Digital Library at www.computer.org/publications/dlib.
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 51

Multiseed Lossless Filtration


Gregory Kucherov, Laurent Noé, and Mikhail Roytberg

Abstract—We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics
applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt
and Kärkkäinen [1]. We present algorithms to compute several important parameters of seed families, study their combinatorial
properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed
technique to the problem of oligonucleotide selection for an EST sequence database.

Index Terms—Filtration, string matching, gapped seed, gapped q-gram, local alignment, sequence similarity, seed family, multiple
spaced seeds, dynamic programming, EST, oligonucleotide selection.

1 INTRODUCTION

F ILTERING is a widely-used technique in biosequence


analysis. Applied to the approximate string matching
problem [2], it can be summarized by the following two-
selectivity parameter makes sense and is therefore the main
characteristic of the filtration efficiency.
The choice of patterns that must be contained in the
stage scheme: To find approximate occurrences (matches) of searched sequence fragments is a key ingredient of the
a given string in a sequence (text), one first quickly discards filtration algorithm. Gapped seeds (spaced seeds, gapped q-
(filters out) those sequence regions where matches cannot grams) have been recently shown to significantly improve
occur, and then checks out the remaining parts of the the filtration efficiency over the “traditional” technique of
sequence for actual matches. The filtering is done according contiguous seeds. In the framework of lossy filtration for
to small patterns of a specified form that the searched string sequence alignment, the use of designed gapped seeds has
is assumed to share, in the exact way, with its approximate been introduced by the PATTERNHUNTER method [4] and
occurrences. A similar filtration scheme is used by heuristic then used by some other algorithms (e.g., [5], [6]). In [8], [9],
local alignment algorithms ([3], [4], [5], [6], to mention a spaced seeds have been shown to improve indexing
few): They first identify potential similarity regions that schemes for similarity search in sequence databases. The
share some patterns and then actually check whether those estimation of the sensitivity of spaced seeds (as well as of
regions represent a significant similarity by computing a some extended seed models) has been the subject of several
corresponding alignment. recent studies [10], [11], [12], [13], [14], [15]. In the
Two types of filtering should be distinguished—lossless framework of lossless filtration for approximate pattern
and lossy. A lossless filtration guarantees to detect all matching, gapped seeds were studied in [1] (see also [7])
sequence fragments under interest, while a lossy filtration and have also been shown to increase the filtration
may miss some of them, but still tries to detect a majority of efficiency considerably.
them. Local alignment algorithms usually use a lossy In this paper, we study an extension of the lossless
filtration. On the other hand, the lossless filtration has been single-seed filtration technique [1]. The extension is based
studied in the context of approximate string matching on using seed families rather than individual seeds. The idea
problem [7], [1]. In this paper, we focus on the lossless of simultaneous use of multiple seeds for DNA local
filtration. alignment was already envisaged in [4] and applied in
In the case of lossy filtration, its efficiency is measured by PATTERNHUNTER II software [16]. The problem of design-
two parameters, usually called selectivity and sensitivity. The ing efficient seed families has also been studied in [17]. In
sensitivity measures the part of sequence fragments of [18], multiple seeds have been applied to the protein search.
interest that are missed by the filter (false negatives), and However, the issues analyzed in the present paper are quite
the selectivity indicates what part of detected candidate different, due to the proposed requirement for the search to
fragments do not actually represent a solution (false be lossless.
positives). In the case of lossless filtration, only the The rest of the paper is organized as follows: After
formally introducing the concept of multiple seed filtering
in Section 2, Section 3 is devoted to dynamic programming
. G. Kucherov and L. Noé are with the INRIA/LORIA, 615, rue du Jardin algorithms to compute several important parameters of
Botanique, B.P. 101, 54602 Villers-lès-Nancy, France.
E-mail: {Gregory.Kucherov, Laurent.Noe}@loria.fr. seed families. In Section 4, we first study several combina-
. M. Roytberg is with the Institute of Mathematical Problems in Biology, torial properties of families of seeds and, in particular, seeds
Pushchino, Moscow Region, Russia. E-mail: roytberg@impb.psn.ru. having a periodic structure. These results are used to obtain
Manuscript received 24 Sept. 2004; revised 13 Dec. 2004; accepted 10 Jan. a method for constructing efficient seed families. We also
2005; published online 30 Mar. 2005.
For information on obtaining reprints of this article, please send e-mail to: outline a heuristic genetic programming algorithm for
tcbb@computer.org, and reference IEEECS Log Number TCBB-0154-0904. constructing seed families. Finally, in Section 5, we present
1545-5963/05/$20.00 ß 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
52 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

several seed families we computed, and we report a large- the ðm; kÞ-problem. Formally, a finite family of seeds F ¼<
scale experimental application of the method to a practical Ql >Ll¼1 solves an ðm; kÞ-problem iff for any ðm; kÞ-similarity w,
problem of oligonucleotide selection. there exists a seed Ql 2 F that detects w.
Note that the seeds of the family are used in the
complementary (or disjunctive) fashion, i.e., a similarity is
2 MULTIPLE SEED FILTERING
detected if it is detected by one of the seeds. This differs from
A seed Q (called also spaced seed or gapped q-gram) is a list the conjunctive approach of [7] where a similarity should be
fp1 ; p2 ; . . . ; pd g of positive integers, called matching positions, detected by two seeds simultaneously.
such that p1 < p2 < . . . < pd . By convention, we always The following example motivates the use of multiple
assume p1 ¼ 0. The span of a seed Q, denoted sðQÞ, is the seeds. In [1], it has been shown that a seed solving the
quantity pd þ 1. The number d of matching positions is called ð25; 2Þ-problem has the maximal weight 12. The only such
the weight of the seed and denoted wðQÞ. Often, we will use a seed (up to reversal) is
more visual representation of seeds, adopted in [1], as words
###  #  ###  #  ###  #:
of length sðQÞ over the two-letter alphabet f#; g, where #
occurs at all matching positions and—at all positions in However, the problem can be solved by the family
between. For example, seed f0; 1; 2; 4; 6; 9; 10; 11g of weight 8 composed of the following two seeds of weight 14:
and span 12 is represented by word ###  #  #  ###. #####  ##    #####  ##
The character  is called a joker. Note that, unless otherwise
stated, the seed has the character # at its first and last and
positions. #  ##    #####  ##    ####:
Intuitively, a seed specifies the set of patterns that, if
shared by two sequences, indicate a possible similarity Clearly, using these two seeds increases the selectivity of
between them. Two sequences are similar if the Hamming the search, as only similarities having 14 or more matching
characters pass the filter versus 12 matching characters in
distance between them is smaller than a certain threshold.
the case of single seed. On uniform Bernoulli sequences,
For example, sequences CACTCGT and CACACTT are similar
this results in the decrease of the number of candidate
within Hamming distance 2 and this similarity is detected
similarities by the factor of jAj2 =2, where A is the input
by the seed ##  # at position 2. We are interested in seeds
alphabet. This illustrates the advantage of the multiple seed
that detect all similarities of a given length with a given
approach: it allows to increase the selectivity while
Hamming distance. preserving a lossless search. The price to pay for this gain
Formally, a gapless similarity (hereafter simply similarity) in selectivity is multiplying the work on identifying the
of two sequences of length m is a binary word w 2 f0; 1gm seed occurrences. In the case of large sequences, however,
interpreted as a sequence of matches (1s) and mismatches this is largely compensated by the decrease in the number
(0s) of individual characters from the alphabet of input of false positives caused by the increase of the seed weight.
sequences. A seed Q ¼ fp1 ; p2 ; . . . ; pd g matches a similarity w
at position i, 1  i  m  pd þ 1, iff for every j 2 ½1::d, we
3 COMPUTING PROPERTIES OF SEED FAMILIES
have w½i þ pj  ¼ 1. In this case, we also say that seed Q has
an occurrence in similarity w at position i. A seed Q is said to Burkhardt and Kärkkäinen [1] proposed a dynamic pro-
detect a similarity w if Q has at least one occurrence in w. gramming algorithm to compute the optimal threshold of a
Given a similarity length m and a number of given seed—the minimal number of its occurrences over all
mismatches k, consider all similarities of length m possible ðm; kÞ-similarities. In this section, we describe an
extension of this algorithm for seed families and, on the
containing k 0s and ðm  kÞ 1s. These similarities are
other hand, describe dynamic programming algorithms for
called ðm; kÞ-similarities. A seed Q solves the detection
  computing two other important parameters of seed families
problem ðm; kÞ (for short, the ðm; kÞ-problem) iff all of mk
that we will use in a later section.
ðm; kÞ-similarities w are detected by Q. For example, one
Consider an ðm; kÞ-problem and a family of seeds
can check that seed #  ##  #  ## solves the F ¼< Ql >Ll¼1 . We need the following notations:
ð15; 2Þ-problem.
Note that the weight of the seed is directly related to the . smax ¼ maxfsðQl ÞgLl¼1 , smin ¼ minfsðQl ÞgLl¼1 ,
selectivity of the corresponding filtration procedure. A larger . for a binary word w and a seed Ql , suffðQl ; wÞ ¼ 1 if
weight improves the selectivity, as less similarities will pass Ql matches w at position ðjwjsðQl Þþ1Þ (i.e.,
matches a suffix of w), otherwise suffðQl ; wÞ ¼ 0,
through the filter. On the other hand, a smaller weight
. lastðwÞ ¼ 1 if the last character of w is 1, otherwise
reduces the filtration efficiency. Therefore, the goal is to
lastðwÞ ¼ 0, and
solve an ðm; kÞ-problem by a seed with the largest possible . zerosðwÞ is the number of 0s in w.
weight.
Solving ðm; kÞ-problems by a single seed has been studied 3.1 Optimal Threshold
by Burkhardt and Kärkkäinen [1]. An extension we propose Given an ðm; kÞ-problem, a family of seeds F ¼< Ql >Ll¼1
here is to use a family of seeds, instead of a single seed, to solve has the optimal threshold TF ðm; kÞ if every ðm; kÞ-similarity
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 53

has at least TF ðm; kÞ occurrences of seeds of F and this is the Oðgðk; smax ÞÞ, under the assumption that checking an
maximal number with this property. Note that overlapping individual match is done in constant time. This leads to
occurrences of a seed as well as occurrences of different the overall time complexity Oðm  fðk; smax Þ þ L  gðk; smax ÞÞ
seeds at the same position are counted separately. For
with the leading term m  fðk; smax Þ (as L is usually small
example, the singleton family f###  ##g has threshold 2
compared to m and gðk; smax Þ is smaller than fðk; smax Þ).
for the ð15; 2Þ-problem.
Clearly, F solves an ðm; kÞ-problem if and only if 3.2 Number of Undetected Similarities
TF ðm; kÞ > 0. If TF ðm; kÞ > 1, then one can strengthen the
We now describe a dynamic programming algorithm that
detection criterion by requiring several seed occurrences for
computes another characteristic of a seed family, that will
a similarity to be detected. This shows the importance of the
be used later in Section 4.4. Consider an ðm; kÞ-problem.
optimal threshold parameter.
Given a seed family F ¼< Ql >Ll¼1 , we are interested in
We now describe a dynamic programming algorithm
for computing the optimal threshold TF ðm; kÞ. For a the number UF ðm; kÞ of ðm; kÞ-similarities that are not
binary word w, consider the quantity TF ðm; k; wÞ defined detected by F . For a binary word w, define UF ðm; k; wÞ to
as the minimal number of occurrences of seeds of F in all be the number of undetected ðm; kÞ-similarities that have
ðm; kÞ-similarities which have the suffix w. By definition, the suffix w.
TF ðm; kÞ ¼ TF ðm; k; "Þ. Assume that we precomputed Similar to [10], let XðF Þ be the set of binary words w such
values T F ðj; wÞ ¼ TF ðsmax ; j; wÞ, for all j  maxfk; smax g, that 1) jwj  smax , 2) for any Ql 2 F , suffðQl ; 1smax jwj wÞ ¼ 0,
jwj ¼ smax . The algorithm is based on the following and 3) no proper suffix of w satisfies 2). Note that word 0
recurrence relations on TF ði; j; wÞ, for i  smax . belongs to XðF Þ, as the last position of every seed is a
matching position.
TF ði; j; w½1::nÞ ¼ The following recurrence relations allow to compute
8
>
>T F ðj; wÞ; if i ¼ smax ; UF ði; j; wÞ for i  m, j  k, and jwj  smax :
>
>
>
< F ði1; j1; w½1::n1Þ;P
>T if w½n ¼ 0;
UF ði; j; w½1::nÞ ¼
TF ði1; j; w½1::n1Þ þ ½ Ll¼1 suffðQl ; wÞ; if n ¼ smax ; 8 
>
> ijwj
>
>minfTF ði; j; 1:wÞ; TF ði; j; 0:wÞg; if zerosðwÞ < j; >
> ; if i < smin ;
>
> >
> jzerosðwÞ
: >
>
TF ði; j; 1:wÞ; if zerosðwÞ ¼ j: >
> 0; if 9l 2 ½1::L;
>
<
suffðQl ; wÞ ¼ 1;
The first relation is an initial condition of the recurrence.
>
> UF ði  1; j  lastðwÞ; w½1::n  1Þ; if w 2 XðF Þ;
The second one is based on the fact that if the last symbol of >
>
>
> U ði; j; 1:wÞ þ U ði; j; 0:wÞ;
w is 0, then no seed can match a suffix of w (as the last >
> if zerosðwÞ < j;
>
:
F F

position of a seed is always assumed to be a matching UF ði; j; 1:wÞ; if zerosðwÞ ¼ j:


position). The third relation reduces the size of the problem The first condition says that if i < smin , then no word of
by counting the number of suffix seed occurrences. The length i will be detected, hence the binomial coefficient. The
fourth one splits the counting into two cases, by considering
second condition is straightforward. The third relation
two possible characters occurring on the left of w. If w
follows from the definition of XðF Þ and allows us to reduce
already contains j 0s, then only 1 can occur on the left of w,
the size of the problem. The last two conditions are similar
as stated by the last relation.
to those from the previous section.
A dynamic programming implementation of the above The set XðF Þ can be precomputed in time OðL 
recurrence allows to compute TF ðm; k; "Þ in a bottom-up gðk; smax ÞÞ and the worst-case time complexity of the whole
fashion, starting from initial values T F ðj; wÞ and applying the algorithm remains Oðm  fðk; smax Þ þ L  gðk; smax ÞÞ.
above relations in the order in which they are given. A
straightforward dynamic programming implementation re-
3.3 Contribution of a Seed
Using a similar dynamic programming technique, one can
quires Oðm  k  2ðsmax þ1Þ Þ time and space. However, the space
compute, for a given seed of the family, the number of
complexity can be immediately improved: If values of i are
ðm; kÞ-similarities that are detected only by this seed and not
processed successively, then only Oðk  2ðsmax þ1Þ Þ space is
by the others. Together with the number of undetected
needed. Furthermore, for each i and j, it is not necessary to
similarities, this parameter will be used later in Section 4.4.
consider all 2ðsmax þ1Þ different strings w, but only those which Given an ðm; kÞ-problem and a family F ¼< Ql >Ll¼1 , we
contain up to j 0s. The number of those w is gðj; smax Þ ¼ define SF ðm; k; lÞ to be the number of ðm; kÞ-similarities
Pj smax 
e . For each i, j ranges from 0 to k. Therefore, for each detected by the seed Ql exclusively (through one or several
e¼0
Pk Pk smax 
i, we need to store fðk; smax Þ ¼ j¼0 gðj; smax Þ ¼ j¼0 j  occurrences), and SF ðm; k; l; wÞ to be the number of those
ðk  j þ 1Þ values. This yields the same space complexity as similarities ending with the suffix w. A dynamic program-
for computing the optimal threshold for one seed [1]. ming algorithm similar to the one described in the previous
P
The quantity Ll¼1 suffðQl ; wÞ can be precomputed for all sections can be applied to compute SF ðm; k; lÞ. The
considered words w in time OðL  gðk; smax ÞÞ and space recurrence is given below.
54 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

SF ði; j; l; w½1::nÞ ¼ define the seed design problem is to fix a similarity length
8
> 0 if i < smin or 9l0 6¼ l m and find a seed that solves the ðm; kÞ-problem with the
>
>
>
> suffðQl0 ; wÞ ¼ 1 largest possible value of k. A complementary definition is to
>
>
>
>
>
> SF ði  1; j  1; l; w½1::n  1Þ if w½n ¼ 0 fix k and minimize m provided that the ðm; kÞ-problem is
>
>
>
> SF ði  1; j; l; w½1::n  1Þ if n ¼ jQl j and still solved. In this section, we adopt the second definition
>
>
>
> and present an optimal solution for one particular case.
>
> suffðQl ; wÞ ¼ 0
>
> For a seed Q and a number of mismatches k, define the
>
>
<SF ði  1; j; l; w½1::n  1Þ
>
k-critical length for Q as the minimal value m such that Q
þUF ði  1; j; w½1::n  1Þ if n ¼ smax and solves the ðm; kÞ-problem. For a class of seeds C and a value
>
>
>
> suffðQl ; wÞ ¼ 1
>
> k, a seed is k-optimal in C if Q has the minimal k-critical
>
>
>
> and 8l0 6¼ l; length among all seeds of C.
>
>
>
> suffðQl0 ; wÞ ¼ 0; One interesting class of seeds C is obtained by putting an
>
>
>
>
>
> SF ði; j; l; 1:w½1::nÞ upper bound on the possible number of jokers in the seed,
>
>
>
> þSF ði; j; l; 0:w½1::nÞ if zerosðwÞ < j i.e. on the number ðsðQÞ  wðQÞÞ. We have found a general
>
>
: solution of the seed design problem for the class C1 ðnÞ
SF ði; j; l; 1:w½1::nÞ if zerosðwÞ ¼ j:
consisting of seeds of weight d with only one joker, i.e. seeds
The third and fourth relations play the principal role: #dr  #r .
if Ql does not match a suffix of w½1::n, then we simply Consider first the case of one mismatch, i.e., k ¼ 1. A
drop out the last letter. If Ql matches a suffix of w½1::n, 1-optimal seed from C1 ðdÞ is #dr  #r with r ¼ bd=2c. To
but no other seed does, then we count prefixes matched
see this, consider an arbitrary seed Q ¼ #p  #q , p þ q ¼ d,
by Ql exclusively (term SF ði  1; j; l; w½1::n  1Þ) together
and assume by symmetry that p  q. Observe that the
with prefixes matched by no seed at all (term
longest ðm; 1Þ-similarity that is not detected by Q is
UF ði  1; j; w½1::n  1Þ). The latter is computed by the
1p1 01pþq of length ð2p þ qÞ. Therefore, we have to minimize
algorithm of the previous section.
The complexity of computing SF ðm; k; lÞ for a given l is 2p þ q ¼ d þ p, and since p  dd=2e, the minimum is reached
the same as the complexity of dynamic programming for p ¼ dd=2e, q ¼ bd=2c.
algorithms from the previous sections. However, for k  2, an optimal seed has an asymmetric
structure described by the following theorem.
Theorem 1. Let n be an integer and r ¼ ½d=3 (½x is the closest
4 SEED DESIGN
integer to x). For every k  2, seed QðdÞ ¼ #dr  #r is
In the previous section we showed how to compute various k-optimal among the seeds of C1 ðdÞ.
useful characteristics of a given family of seeds. A much
Proof. Again, consider a seed Q ¼ #p  #q , p þ q ¼ d, and
more difficult task is to find an efficient seed family that
assume that p  q. Consider the longest word SðkÞ from
solves a given ðm; kÞ-problem. Note that there exists
  a trivial ð1 0Þk 1 , k  1, which is not detected by Q and let LðkÞ is
solution where the family consists of all mk position
the length of SðkÞ. By the above remark, Sð1Þ ¼ 1p1 01pþq
combinations, but this is in general unacceptable in practice
and Lð1Þ ¼ 2p þ q.
because of a huge number of seeds. Our goal is to find
It is easily seen that for every k, SðkÞ starts either with
families of reasonable size (typically, with the number of
1p1 0, or with 1pþq 01q1 0. Define L0 ðkÞ to be the maximal
seeds smaller than 10), with a good filtration efficiency.
length of a word from ð1 0Þk 1 that is not detected by Q
In this section, we present several results that contribute
and starts with 1q1 0. Since prefix 1q1 0 implies no
to this goal. In Section 4.1, we start with the case of single
additional constraint on the rest of the word, we have
seed with a fixed number of jokers and show, in particular,
L0 ðkÞ ¼ q þ Lðk  1Þ. Observe that L0 ð1Þ ¼ p þ 2q (word
that for one joker, there exists one best seed in a sense that
1q1 01pþq ). To summarize, we have the following
will be defined. We then show in Section 4.2 that a solution
recurrences for k  2:
for a larger problem can be obtained from a smaller one by a
regular expansion operation. In Section 4.3, we focus on L0 ðkÞ ¼ q þ Lðk  1Þ; ð1Þ
seeds that have a periodic structure and show how those LðkÞ ¼ maxfp þ Lðk  1Þ; p þ q þ 1 þ L0 ðk  1Þg; ð2Þ
seeds can be constructed by iterating some smaller seeds.
We then show a way to build efficient families of periodic with initial conditions L0 ð1Þ ¼ p þ 2q, Lð1Þ ¼ 2p þ q.
Two cases should be distinguished. If p  2q þ 1, then
seeds. Finally, in Section 4.4, we briefly describe a heuristic
the straightforward induction shows that the first term in
approach to constructing efficient seed families that we
(2) is always greater, and we have
used in the experimental part of this work presented in
Section 5. LðkÞ ¼ ðk þ 1Þp þ q; ð3Þ

4.1 Single Seeds with a Fixed Number of Jokers and the corresponding longest word is
Assume that we fixed a class of seeds under interest (e.g.,
SðkÞ ¼ ð1p1 0Þk 1pþq : ð4Þ
seeds of a given minimal weight). One possible way to
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 55

If q  p  2q þ 1, then by induction, we obtain obtained by the regular contraction operation, inverse to the
 regular expansion.
ð‘ þ 1Þp þ ðk þ 1Þq þ ‘ if k ¼ 2‘;
LðkÞ ¼ ð5Þ Lemma 2. If a family Fi ¼ i  F solves an ðim; kÞ-problem, then
ð‘ þ 2Þp þ kq þ ‘ if k ¼ 2‘ þ 1;
F solves both the ðim; kÞ-problem and the ðm; bk=icÞ-problem.
and
Proof. One can even show that F solves the ðim; kÞ-problem
 pþq q1 ‘ pþq with the additional restriction for F to match inside one of
ð1 01 0Þ 1 if k ¼ 2‘;
SðkÞ ¼ ð6Þ the position intervals ½1::m; ½m þ 1::2m; . . . ; ½ði  1Þm þ
1p1 0ð1pþq 01q1 0Þ‘ 1pþq if k ¼ 2‘ þ 1:
1::im. This is done by using the bijective mapping from
By definition of LðkÞ, seed #p  #q detects any word Lemma 1: Given an ðim; kÞ-similarity w, consider i disjoint
from ð1 0Þk 1 of length ðLðkÞ þ 1Þ or more, and this is the subsequences wj (0  j  i  1) of w obtained by picking
tight bound. Therefore, we have to find p; q which m positions equal to j modulo i, and then consider the
minimize LðkÞ. Recall that p þ q ¼ d, and observe that for concatenation w0 ¼ w1 w2 . . . wi1 w0 .
p  2q þ 1, LðkÞ (defined by (3)) is increasing on p, while For every ðim; kÞ-similarity w0 , its inverse image w is
for p  2q þ 1, LðkÞ (defined by (5)) is decreasing on p. detected by Fi , and therefore F detects w0 at one of the
Therefore, both functions reach its minimum when intervals
p ¼ 2q þ 1. Therefore, if d  1 ðmod 3Þ, we obtain q ¼
bd=3c and p ¼ d  q. If d  0 ðmod 3Þ, a routine computa- ½1::m; ½m þ 1::2m; . . . ; ½ði  1Þm þ 1::im:
tion shows that the minimum is reached at q ¼ d=3, Futhermore, for any ðm; bk=icÞ-similarity v, consider w0 ¼
p ¼ 2d=3, and if d  2 ðmod 3Þ, the minimum is reached vi and its inverse image w. As w0 is detected by Fi , v is
at q ¼ dd=3e, p ¼ d  q. Putting the three cases together detected by F . u
t
results in q ¼ ½d=3, p ¼ d  q. u
t
Example 1. To illustrate the two lemmas above, we give the
To illustrate Theorem 1, seed ####  ## is optimal
following example pointed out in [1]. The following two
among all seeds of weight 6 with one joker. This means that seeds are the only seeds of weight 12 that solve the
this seed solves the ðm; 2Þ-problem for all m  16 and this is ð50; 5Þ-problem:
the smallest possible bound over all seeds of this class.
Similarly, this seed solves the ðm; 3Þ-problem for all m  20, #  #  #    #      #  #  #  
which is the best possible bound, etc. #####
4.2 Regular Expansion and Contraction of Seeds and
We now show that seeds solving larger problems can be
###  #  ###  #  ###  #:
obtained from seeds solving smaller problems, and vice
versa, using regular expansion and regular contraction The first one is the 2-regular expansion of the second. The
operations. second one is the only seed of weight 12 that solves the
Given a seed Q , its i-regular expansion i  Q is ð25; 2Þ-problem.
obtained by multiplying each matching position by i. This
The regular expansion allows, in some cases, to obtain an
is equivalent to inserting i  1 jokers between every two
efficient solution for a larger problem by reducing it to a
successive positions along the seed. For example, if Q ¼
smaller problem for which an optimal or a near-optimal
f0; 2; 3; 5g (or #  ##  #), then the 2-regular expansion
solution is known.
of Q is 2  Q ¼ f0; 4; 6; 10g (or #    #  #    #).
Given a family F , its i-regular expansion i  F is the 4.3 Periodic Seeds
family obtained by applying the i-regular expansion on In this section, we study seeds with a periodic structure that
each seed of F . can be obtained by iterating a smaller seed. Such seeds often
Lemma 1. If a family F solves an ðm; kÞ-problem, then the turn out to be among maximally weighted seeds solving a
ðim; ði þ 1Þk  1Þ-problem is solved both by family F and by given ðm; kÞ-problem. Interestingly, this contrasts with the
its i-regular expansion Fi ¼ i  F . lossy framework where optimal seeds usually have a
Proof. Consider an ðim; ði þ 1Þk  1Þ-similarity w. By the “random” irregular structure.
pigeon hole principle, it contains at least one substring of Consider two seeds Q1 ;Q2 represented as words over
length m with k mismatches or less and, therefore, F f#;g. In this section, we lift the assumption that a seed
solves the ðim; ði þ 1Þk  1Þ-problem. On the other hand, must start and end with a matching position. We denote
consider i disjoint subsequences of w each one consisting ½Q1 ;Q2 i the seed defined as ðQ1 Q2 Þi Q1 . For example,
of m positions equal modulo i. Again, by the pigeon hole ½###  #; 2 ¼ ###  #  ###  #  ###  #.
principle, at least one of them contains k mismatches or We also need a modification of the ðm; kÞ-problem, where
less and, therefore, the ðim; ði þ 1Þk  1Þ-problem is ðm; kÞ-similarities are considered modulo a cyclic permuta-
solved by i  F . u
t tion. We say that a seed family F solves a cyclic
The following lemma is the inverse of Lemma 1. It states ðm; kÞ-problem, if for every ðm; kÞ-similarity w, F detects
that if seeds solving a bigger problem have a regular one of cyclic permutations of w. Trivially, if F solves an
structure, then a solution for a smaller problem can be ðm; kÞ-problem, it also solves the cyclic ðm; kÞ-problem. To
56 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

distinguish from a cyclic problem, we call sometimes an shows that the maximal weight grows faster than any linear
ðm; kÞ-problem a linear problem. fraction of the similarity size.
We first restrict ourselves to the single-seed case. The Theorem 2. Consider a constant k. Let wðmÞ be the maximal
following lemma demonstrates that iterating smaller seeds weight of a seed solving the cyclic ðm; kÞ-problem. Then,
solving a cyclic problem allows to obtain a solution for k1
ðm  wðmÞÞ ¼ ðm k Þ.
bigger problems, for the same number of mismatches.
Proof. Note first that all seeds solving a cyclic ðm; kÞ-problem
Lemma 3. If a seed Q solves a cyclic ðm; kÞ-problem, then for can be considered as seeds of span m. The number of jokers
every i  0, the seed Qi ¼ ½Q; ðmsðQÞÞ i solves the linear in any seed Q is then n ¼ m  wðQÞ. The theorem states
ðm  ði þ 1Þ þ sðQÞ  1; kÞ-problem. If i 6¼ 0, the inverse that the minimal number of jokers of a seed solving the
holds too. k1
ðm; kÞ-problem is ðm k Þ for every fixed k.
Proof. ) Consider an ðm  ði þ 1Þ þ sðQÞ  1; kÞ-similarity Lower bound Consider a cyclic ðm; kÞ-problem. The
u. Transform u into a similarity u0 for the cyclic number Dðm; kÞ of distinct cyclic ðm; kÞ-similarities
ðm; kÞ-problem as follows: For each mismatch position ‘ satisfies
of u, set 0 at position ð‘ mod mÞ in u0 . The other positions m
k
of u0 are set to 1. Clearly, there are at most k 0s in u. As Q  Dðm; kÞ; ð7Þ
m
solves the ðm; kÞ-cyclic problem, we can find at least one
position j, 1  j  m, such that Q detects u0 cyclicly. as every linear ðm; kÞ-similarity has at most m cyclicly
We show now that Qi matches at position j of u (which equivalent ones. Consider a seed Q. Let n be the number
is a valid position as 1  j  m and sðQi Þ ¼ im þ sðQÞ). As of jokers in Q and JQ ðm; kÞ the number of distinct cyclic
the positions of 1 in u are projected modulo m to matching n kÞ-similarities detected by Q. Observe that JQ ðm; kÞ 
ðm;
positions of Q, then there is no 0 under any matching k and if Q solves the cyclic ðm; kÞ-problem, then
element of Qi and, thus, Qi detects u. n
( Consider a seed Qi ¼ ½Q; ðmsðQÞÞ i solving the Dðm; kÞ ¼ JQ ðm; kÞ  : ð8Þ
k
ðm  ði þ 1Þ þ sðQÞ  1; kÞ-problem. As i > 0, consider ðm 
From (7) and (8), we have
ði þ 1Þ þ sðQÞ  1; kÞ-similarities having all their mis-
m  
matches located inside the interval ½m; 2m  1. For each n
k
such similarity, there exists a position j, 1  j  m, such  : ð9Þ
m k
that Qi detects it. Note that the span of Qi is at least k1

m þ sðQÞ, which implies that there is either an entire Using the Stirling formula, this gives nðkÞ ¼ ðm k Þ.
Upper bound. To prove the upper bound, we construct
occurrence of Q inside the window ½m; 2m  1, or a k1
a seed Q that has no more then k  m k joker positions
prefix of Q matching a suffix of the window and the
and solves the cyclic ðm; kÞ-problem.
complementary suffix of Q matching a prefix of the
We start with the seed Q0 of span m with all matching
window. This implies that Q solves the cyclic
positions, and introduce jokers into it in k steps. After
ðm; kÞ-problem. u
t step i, the obtained seed is denoted Qi , and Q ¼ Qk .
1
Example 2. Observe that the seed ###  # solves the Let B ¼ dmk e. Q1 is obtained by introducing into Q0
cyclic ð7; 2Þ-problem. From Lemma 3, this implies that for individual jokers with periodicity B by placing jokers at
every i  0, the ð11 þ 7i; 2Þ-problem is solved by the seed positions 1; B þ 1; 2B þ 1; . . . . At step 2, we introduce
½###  #; i of span 5 þ 7i. Moreover, for i ¼ 1; 2; 3, into Q1 contiguous intervals of jokers of length B with
this seed is optimal (maximally weighted) over all seeds periodicity B2 , such that jokers are placed at positions
solving the problem. ½1 . . . B; ½B2 þ 1 . . . B2 þ B; ½2B2 þ 1 . . . 2B2 þ B; . . . .
In general, at step i (i  k), we introduce into Qi
By a similar argument based on Lemma 3, the
intervals of Bi1 jokers with periodicity Bi at positions
periodic seed ½#####  ##;   i solves the
½1 . . . Bi1 ; ½Bi þ 1 . . . Bi þ Bi1 ; . . . (see Fig. 1).
ð18 þ 11i; 2Þ-problem. Note that its weight grows as
7 4 Note that Qi is periodic with periodicity Bi . Note
11 m compared to 7 m for the seed from the previous also that at each step i, we introduce at most bm1k c
i

paragraph. However, when m ! 1, this is not an i1


intervals of B jokers. Moreover, due to overlaps
asymptotically optimal bound, as we will see later.
with already added jokers, each interval adds ðB 
The ð18 þ 11i; 3Þ-problem is solved by the seed
1Þi1 new jokers.
ð###  #  #;   Þi , a s s e e d ###  #  #
This implies that the total number of jokers added at
solves the cyclic ð11; 3Þ-problem. For i ¼ 1; 2, the former i i 1 k1
step i is at most m1k  ðB  1Þi1  m1k  mkði1Þ ¼ m k .
is a maximally weighted seed among all solving the Thus, the total number of jokers in Q is less than k  m k .
k1

ð18 þ 11i; 3Þ-problem. By induction on i, we prove that for any ðm; iÞ-similarity
One question raised by these examples is whether u (i  k), Qi detects u cyclicly, that is there is a cyclic shift of
iterating some seed could provide an asymptotically Qi such that all i mismatches of u are covered with jokers
optimal solution, i.e., a seed of maximal asymptotic weight. introduced at steps 1; . . . ; i.
The following theorem establishes a tight asymptotic bound For i ¼ 1, the statement is obvious, as we can
on the weight of an optimal seed, for a fixed number of always cover the single mismatch by shifting Q1 by at
mismatches. It gives a negative answer to this question, as it most ðB  1Þ positions. Assuming that the statement
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 57

and as k is constant,
k
m  wðQÞ ¼ Oðmkþ1 Þ: ð13Þ
The lower bound is obtained similarly to Theorem 2.
Let Q be a seed solving a linear ðm; kÞ-problem, and let
n ¼ m  wðQÞ. From simple combinatorial considera-
tions, we have
m n n
  ðm  sðQÞÞ   n; ð14Þ
k k k
k
which implies n ¼ ðmkþ1 Þ for constant k. u
t
The following simple lemma is also useful for construct-
ing efficient seeds.
Lemma 5. Assume that a family F solves an ðm; kÞ-problem. Let
F 0 be the family obtained from F by cutting out l characters
from the left and r characters from the right of each seed of F .
Fig. 1. Construction of seeds Qi from the proof of Theorem 2. Jokers are Then F 0 solves the ðm  r  l; kÞ-problem.
represented in white and matching positions in black. Example 3. The ð9 þ 7i; 2Þ-problem is solved by the seed
½###; #  i which is optimal for i ¼ 1; 2; 3. Using
holds for ði  1Þ, we show now that it holds for i too. Lemma 5, this seed can be immediately obtained from
Consider an ðm; iÞ-similarity u. Select one mismatch of the seed ½###  #; i from Example 2, solving the
u. By induction hypothesis, the other ði  1Þ mis- ð11 þ 7i; 2Þ-problem.
matches can be covered by Qi1 . Since Qi1 has period
Bi1 and Qi differs from Qi1 by having at least one
We now apply the above results for the single seed case
contiguous interval of Bi1 jokers, we can always shift
to the case of multiple seeds.
Qi by j  Bi1 positions such that the selected mismatch
falls into this interval. This shows that Qi detects u. For a seed Q considered as a word over f#; g, we
We conclude that Q solves the cyclic ðm; iÞ-problem. t u denote by Q½i its cyclic shift to the left by i characters.
F o r e x a m p l e , i f Q ¼ ####  #  ##  , t h e n
Using Theorem 2, we obtain the following bound on the
Q½5 ¼ #  ##  ####  . The following lemma gives
number of jokers for the linear ðm; kÞ-problem.
a way to construct seed families solving bigger
Lemma 4. Consider a constant k. Let wðmÞ be the maximal problems from an individual seed solving a smaller
weight of a seed solving the linear ðm; kÞ-problem. Then, cyclic problem.
k
ðm  wðmÞÞ ¼ ðmkþ1 Þ.
Lemma 6. Assume that a seed Q solves a cyclic ðm; kÞ-problem
Proof. To prove the upper bound, we construct a seed Q and assume that sðQÞ ¼ m (otherwise, we pad Q on the right
that solves the linear ðm; kÞ-problem and satisfies the with ðm  sðQÞÞ jokers). Fix some i > 1. For some L > 0,
asymptotic bound. Consider some l < m that will be consider a list of L integers 0  j1 <    < jL < m, and define a
defined later, and let P be a seed that solves the cyclic family of seeds F ¼< kðQ½jl  Þi k >Ll¼1 , where kðQ½jl  Þi k stands
ðl; kÞ-problem. Without loss of generality, we assume for the seed obtained from ðQ½jl  Þi by deleting the joker characters
sðP Þ ¼ l.
at the left and right edges. Define ðlÞ ¼ ððjl1  jl Þ mod mÞ
For a real number e  1, define P e to be the maximally
(or, alternatively, ðlÞ ¼ ððjl  jl1 Þ mod mÞ) for all l,
weighted seed of span at most le of the form
1  l  L. Let m0 ¼ maxfsðkðQ½jl  Þi kÞ þ ðlÞgLl¼1  1. Then,
P 0  P    P  P 00 , where P 0 and P 00 are, respectively, a
suffix and a prefix of P . Due to the condition of maximal F solves the ðm0 ; kÞ-problem.
weight, wðP e Þ  e  wðP Þ. Proof. The proof is an extension of the proof of Lemma 3.
We now set Q ¼ P e for some real e to be defined. Here, the seeds of the family are constructed in such a
Observe that if e  l  m  l, then Q solves the linear way that for any instance of the linear ðm0 ; kÞ-problem,
ðm; kÞ-problem. Therefore, we set e ¼ ml l .
there exists at least one seed that satisfies the property
k1
From the proof of Theorem 2, we have l  wðP Þ  k  l k . required in the proof of Lemma 3 and, therefore, matches
We then have this instance. u
t
In applying Lemma 6, integers jl are chosen from the
ml
interval ½0; m in such a way that values sðjjðQ½jl Þi jjÞ þ ðlÞ
k1
wðQÞ ¼ e  wðP Þ   ðl  k  l k Þ: ð10Þ
l are closed to each other. We illustrate Lemma 6 with two
If we set examples that follow.
k Example 4. Let m ¼ 11, k ¼ 2. Consider the seed Q ¼
l ¼ mkþ1 ; ð11Þ
####  #  ##   solving the cyclic ð11; 2Þ-problem.
we obtain Choose i ¼ 2, L ¼ 2, j1 ¼ 0, j2 ¼ 5. This gives two seeds:

Q1 ¼ kðQ½0 Þ2 k ¼ ####  #  ##  ####  #  ##


k k1
m  wðQÞ  ðk þ 1Þmkþ1  kmkþ1 ; ð12Þ
58 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

and evolving by mutating and crossing over according to the set


of similarities they do not detect. Moreover, random seed
Q2 ¼ kðQ½5 Þ2 k ¼ #  ##  ####  #  ##  #### families are regularly injected into the population in order
to avoid local optima.
of span 20 and 21, respectively, ð1Þ ¼ 6 and ð2Þ ¼ 5.
The described heuristic procedure often allows efficient
maxf20 þ 6; 21 þ 5g  1 ¼ 25. Therefore, family F ¼
or even optimal solutions to be computed in a reasonable
fQ1 ; Q2 g solves the ð25; 2Þ-problem.
time. For example, in 10 runs of the algorithm, we found
Example 5. Let m ¼ 11, k ¼ 3. The seed Q ¼ ###  #  three of the six existing families of two seeds of weight 14
#    solving the cyclic ð11; 3Þ-problem. Choose solving the ð25; 2Þ-problem. The whole computation took
i ¼ 2, L ¼ 2, j1 ¼ 0, j2 ¼ 4. The two seeds are less than 1 hour, compared to a week of computation
needed to exhaustively test all seed pairs. Note that the
Q1 ¼ kðQ½0 Þ2 k ¼ ###  #  #    ###  #  # randomized-greedy approach (incremental completion of
the seed set by adding the best random seed) applied a
(span 19) and
dozen of times to the same problem yielded only sets of
three and sometimes four, but never two seeds, taking
Q2 ¼ kðQ½4 Þ2 k
about 1 hour at each run.
¼ #  #    ###  #  #    ###
(span 21), with ð1Þ ¼ 7 and ð2Þ ¼ 4. maxf19 þ 7; 5 EXPERIMENTS
21 þ 4g  1 ¼ 25. Therefore, family F ¼ fQ1 ; Q2 g solves
the ð25; 3Þ-problem. We describe two groups of experiments that we made. The
first one concerns the design of efficient seed families, and
4.4 Heuristic Seed Design the second one applies a multiseed lossless filtration to the
Results of Sections 4.1, 4.2, and 4.3 allow one to construct identification of unique oligos in a large set of EST
efficient seed families in certain cases, but still do not allow sequences.
a systematic seed design. Recently, linear programming
5.1 Seed Design Experiments
approaches to designing efficient seed families were
proposed in [19] and in [18], respectively, for DNA and We considered several ðm; kÞ-problems. For each problem,
protein similarity search. However, neither of these and for a fixed number of seeds in the family, we computed
families solving the problem and realizing the largest
methods aims at constructing lossless families.
possible seed weight (under a natural assumption that all
In this section, we outline a heuristic genetic program-
seeds in a family have the same weight). We also kept track
ming algorithm for designing lossless seed families. The
of the ways (periodic seeds, genetic programming heur-
algorithm will be used in the experimental part of this
istics, exhaustive search) in which those families can be
work, that we present in the next section. Note that this computed.
algorithm uses the dynamic programming algorithms Tables 1 and 2 summarize some results obtained for the
discussed in Section 3. Since the algorithm uses standard ð25; 2Þ-problem and the ð25; 3Þ-problem, respectively. Fa-
genetic programming techniques, we give only a high-level milies of periodic seeds (that can be found using Lemma 6)
description here without going into all details. are marked with p , those that are found using a genetic
The algorithm tries to iteratively improve characteristics algorithm are marked with g , and those which are obtained
of a population of seed families until it finds a small family by an exhaustive search are marked with e . Only in this
that detects all ðm; kÞ-similarities (i.e., is lossless). The first latter case, the families are guaranteed to be optimal.
step of each iteration is based on screening current families
Families of periodic seeds are shifted according to their
against a set of difficult similarities that are similarities that
construction (see Lemma 6).
have been detected by fewer families. This set is continually
Moreover, to compare the selectivity of different families
reordered and updated according to the number of families
solving a given ðm; kÞ-problem, we estimated the probability
that do not detect those similarities. For this, each set is
 for at least one of the seeds of the family to match at a
stored in a tree and the reordering is done using the list-as-
given position of a uniform Bernoulli four-letter sequence.
a-tree principle [20]: Each time a similarity is not detected by
This has been done using the inclusion-exclusion formula.
a family, it is moved towards the root of the tree such that
Note that the simple fact of passing from a single seed to
its height is divided by two.
a two-seed family results in a considerable gain in
For those families that pass through the screening, the
efficiency: In both examples shown in the tables there a
number of undetected similarities is computed by the
change of about one order magnitude in the selectivity
dynamic programming algorithm of Section 3.2. The family
estimator .
is kept if it produces a smaller number than the families
currently known. An undetected similarity obtained during 5.2 Oligo Selection Using Multiseed Filtering
this computation is added as a leaf to the tree of difficult An important practical application of lossless filtration is
similarities. the selection of reliable oligonucleotides for DNA micro-
To detect seeds to be improved inside a family, we array experiments. Oligonucleotides (oligos) are small DNA
compute the contribution of each seed by the dynamic sequences of fixed size (usually ranging from 10 to 50)
programming algorithm of Section 3.3. The seeds with the designed to hybridize only with a specific region of the
least contribution are then modified with a higher prob- genome sequence. In microarray experiments, oligos are
ability. In general, the population of seed families is expected to match ESTs that stem from a given gene and not
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 59

TABLE 1
Seed Families for (25,2)-Problem

TABLE 2
Seed Families for (25,3)-Problem

to match those of other genes. As the first approximation, (or a sequence database) all substrings of length m that have
the problem of oligo selection can then be formulated as the no occurrences elsewhere in the sequence within the
search for strings of a fixed length that occur in a given Hamming distance k. The parameters m and k were set to
sequence but do not occur, within a specified distance, in 32 and 5, respectively. For the ð32; 5Þ-problem, different seed
other sequences of a given (possibly very large) sample. families were designed and their selectivity was estimated.
Different approaches to this problem apply different Those are summarized in the table in Fig. 2, using the same
distance measures and different algorithmic techniques conventions as in Tables 1 and 2 above. The family
[21], [22], [23], [24]. The experiments we briefly present here composed of six seeds of weight 11 was selected for the
demonstrate that the multiseed filtering provides an filtration experiment (shown in Fig. 2).
efficient computation of candidate oligonucleotides. These The filtering has been applied to a database of rice EST
should then be further processed by complementary sequences composed of 100,015 sequences for a total length
methods in order to take into account other physico- of 42,845,242 bp.1 Substrings matching other substrings
chemical factors occurring in hybridisation, such as the with five substitution errors or less were computed. The
melting temperature or the possible hairpin structure of computation took slightly more than one hour on a
palindromic oligos.
Here, we adopt the formalization of the oligo selection 1. Source: http://bioserver.myongji.ac.kr/ricemac.html, The Korea Rice
problem as the problem of identifying in a given sequence Genome Database.
60 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 2. Computed seed families for the ð32; 5Þ-problem and the chosen family (six seeds of weight 11).

Pentium2 4 3GHz computer. Before applying the filtering of this work has been done during a stay of M. Roytberg at
using the family for the ð32; 5Þ-problem, we made a rough LORIA, Nancy, supported by INRIA. M. Roytberg has been
prefiltering using one spaced seed of weight 16 to detect,
supported by the Russian Foundation for Basic Research
with a high selectivity, almost identical regions. Sixty-five
percent of the database has been discarded by this (project nos. 03-04-49469, 02-07-90412) and by grants from
prefiltering. Another 22 percent of the database has been the RF Ministry for Industry, Science, and Technology (20/
filtered out using the chosen seed family, leaving the 2002, 5/2003) and NWO. An extended abstract of this work
remaining 13 percent as oligo candidates.
has been presented to the Combinatorial Pattern Matching
Conference (Istanbul, July 2004).
6 CONCLUSION
In this paper, we studied a lossless filtration method based
REFERENCES
on multiseed families and demonstrated that it represents
[1] S. Burkhardt and J. Kärkkäinen, “Better Filtering with Gapped
an improvement compared to the single-seed approach q-Grams,” Fundamenta Informaticae, vol. 56, nos. 1-2, pp. 51-70,
considered in [1]. We showed how some important 2003, preliminary version in Combinatorial Pattern Matching
characteristics of seed families can be computed using the 2001.
[2] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings
dynamic programming. We presented several combinator- —Practical On-Line Search Algorithms for Texts and Biological
ial results that allow one to construct efficient families Sequences. Cambridge Univ. Press, 2002.
composed of seeds with a periodic structure. Finally, we [3] S. Altschul, T. Madden, A. Schäffer, J. Zhang, Z. Zhang, W. Miller,
and D. Lipman, “Gapped BLAST and PSI-BLAST: A New
described a large-scale computational experiment of de- Generation of Protein Database Search Programs,” Nucleic Acids
signing reliable oligonucleotides for DNA microarrays. The Research, vol. 25, no. 17, pp. 3389-3402, 1997.
obtained experimental results provided evidence of the [4] B. Ma, J. Tromp, and M. Li, “PatternHunter: Faster and More
Sensitive Homology Search,” Bioinformatics, vol. 18, no. 3, pp. 440-
applicability and efficiency of the whole method. 445, 2002.
The results of Sections 4.1, 4,2, and 4.3 establish several [5] S. Schwartz, J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. Hardison,
D. Haussler, and W. Miller, “Human—Mouse Alignments with
combinatorial properties of seed families, but many more of BLASTZ,” Genome Research, vol. 13, pp. 103-107, 2003.
them remain to be elucidated. The structure of optimal or [6] L. Noé and G. Kucherov, “Improved Hit Criteria for DNA Local
near-optimal seed families can be reduced to number- Alignment,” BMC Bioinformatics, vol. 5, no. 149, Oct. 2004.
[7] P. Pevzner and M. Waterman, “Multiple Filtration and Approx-
theoretic questions, but this relation remains to be clearly imate Pattern Matching,” Algorithmica, vol. 13, pp. 135-154, 1995.
established. In general, constructing an algorithm to [8] A. Califano and I. Rigoutsos, “Flash: A Fast Look-Up Algorithm
systematically design seed families with quality guarantee for String Homology,” Proc. First Int’l Conf. Intelligent Systems for
Molecular Biology, pp. 56-64, July 1993.
remains an open problem. Some complexity issues remain [9] J. Buhler, “Provably Sensitive Indexing Strategies for Biosequence
open too: For example, what is the complexity of testing if a Similarity Search,” Proc. Sixth Ann. Int’l Conf. Computational
single seed is lossless for given m; k? Section 3 implies a Molecular Biology (RECOMB ’02), pp. 90-99, Apr. 2002.
[10] U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds for
time bound exponential on the number of jokers. Note that Similarity Search,” Discrete Applied Math., vol. 138, no. 3, pp. 253-
for multiple seeds, computing the number of detected 263, 2004.
similarities is NP-complete [16, Section 3.1]. [11] J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for Similarity
Search in Genomic DNA,” Proc. Seventh Ann. Int’l Conf. Computa-
Another direction is to consider different distance tional Molecular Biology (RECOMB ’03), pp. 67-75, Apr. 2003.
measures, especially the Levenstein distance, or at least to [12] B. Brejova, D. Brown, and T. Vinar, “Vector Seeds: An Extension to
Spaced Seeds Allows Substantial Improvements in Sensitivity and
allow some restricted insertion/deletion errors. The method Specificity,” Proc. Third Int’l Workshop Algorithms in Bioinformatics
proposed in [25] does not seem to be easily generalized to (WABI), pp. 39-54, Sept. 2003.
multiseed families, and a further work is required to [13] G. Kucherov, L. Noé, and Y. Ponty, “Estimating Seed Sensitivity
on Homogeneous Alignments,” Proc. IEEE Fourth Symp. Bioinfor-
improve lossless filtering in this case. matics and Bioeng. (BIBE 2004), May 2004.
[14] K. Choi and L. Zhang, “Sensitivity Analysis and Efficient Method
for Identifying Optimal Spaced Seeds,” J. Computer and System
ACKNOWLEDGMENTS Sciences, vol. 68, pp. 22-40, 2004.
[15] M. Csürös, “Performing Local Similarity Searches with Variable
G. Kucherov and L. Noé have been supported by the French Length Seeds,” Proc. 15th Ann. Combinatorial Pattern Matching
Action Spécifique “Algorithmes et Séquences” of CNRS. A part Symp. (CPM), pp. 373-387, 2004.
KUCHEROV ET AL.: MULTISEED LOSSLESS FILTRATION 61

[16] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: Highly Gregory Kucherov received the PhD degree in
Sensitive and Fast Homology Search,” J. Bioinformatics and computer science in 1988 from the USSR
Computational Biology, vol. 2, no. 3, pp. 417-440, Sept. 2004. Academy of Sciences, and a Habilitation degree
[17] Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds for in 2000 from the Henri Poincaré University in
DNA Similarity Search,” Proc. Eighth Ann. Int’l Conf. Research in Nancy. He is a senior INRIA researcher with the
Computational Molecular Biology (RECOMB 2004), pp. 76-84, Mar. LORIA research unit in Nancy, France. For the
2004. last 10 years, he has been doing research on
[18] D.G. Brown, “Multiple Vector Seeds for Protein Alignment,” Proc. word combinatorics, text algorithms and combi-
Fourth Int’l Workshop Algorithms in Bioinformatics (WABI), pp. 170- natorial algorithms for bioinformatics, and com-
181, Sept. 2004. putational biology.
[19] J. Xu, D. Brown, M. Li, and B. Ma, “Optimizing Multiple Spaced
Seeds for Homology Search,” Proc. 15th Symp. Combinatorial
Pattern Matching, pp. 47-58, 2004. Laurent Noé studied computer science at the
[20] J. Oommen and J. Dong, “Generalized Swap-with-Parent Schemes ESIAL engineering school in Nancy, France. He
for Self-Organizing Sequential Linear Lists,” Proc. 1997 Int’l Symp. received the MS degree in 2002 and is currently
Algorithms and Computation (ISAAC ’97), pp. 414-423, Dec. 1997. a PhD student in computational biology at
[21] F. Li and G. Stormo, “Selection of Optimal DNA Oligos for Gene LORIA.
Expression Arrays,” Bioinformatics, vol. 17, pp. 1067-1076, 2001.
[22] L. Kaderali and A. Schliep, “Selecting Signature Oligonucleotides
to Identify Organisms Using DNA Arrays,” Bioinformatics, vol. 18,
no. 10, pp. 1340-1349, 2002.
[23] S. Rahmann, “Fast Large Scale Oligonucleotide Selection Using
the Longest Common Factor Approach,” J. Bioinformatics and
Computational Biology, vol. 1, no. 2, pp. 343-361, 2003.
[24] J. Zheng, T. Close, T. Jiang, and S. Lonardi, “Efficient Selection of Mikhail Roytberg received the PhD degree in
Unique and Popular Oligos for Large EST Databases,” Proc. 14th computer science in 1983 from Moscow State
Ann. Combinatorial Pattern Matching Symp. (CPM), pp. 273-283, University. He is a leader of the Computational
2003. Molecular Biology Group in the Institute of
[25] S. Burkhardt and J. Karkkainen, “One-Gapped q-Gram Filters for Mathematical Problems in Biology of the Rus-
Levenshtein Distance,” Proc. 13th Symp. Combinatorial Pattern sian Academy of Sciences at Pushchino, Rus-
Matching (CPM ’02), vol. 2373, pp. 225-234, 2002. sia. During the last years, his main research field
has been the development of algorithms for
comparative analysis of biological sequences.

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.
62 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Text Mining Biomedical Literature


for Discovering Gene-to-Gene Relationships:
A Comparative Study of Algorithms
Ying Liu, Shamkant B. Navathe, Jorge Civera, Venu Dasigi,
Ashwin Ram, Brian J. Ciliax, and Ray Dingledine

Abstract—Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of
microarray data. A number of computer algorithms have been developed for this task. Although these algorithms have demonstrated their
usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from
MEDLINE for a set of genes that are isolated for further study from microarray experiments based on their differential expression patterns.
The sharing of functional keywords among genes is used as a basis for clustering in a new approach called BEA-PARTITION in this paper.
Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA),
which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional
keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering
and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of
BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell
cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the
results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means
and self-organizing map. Whereas BEA-PARTITION and the hierarchical clustering produced similar quality of clusters, BEA-PARTITION
provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a
powerful approach to clustering genes or to any clustering problem where starting matrices are available from experimental observations.

Index Terms—Bond energy algorithm, microarray, MEDLINE, text analysis, cluster analysis, gene function.

1 INTRODUCTION

D NA microarrays, among the most rapidly growing tools


for genome analysis, are introducing a paradigmatic
change in biology by shifting experimental approaches from
algorithms, such as k-means [7] and Self-Organizing Maps
(SOM) [8] have also been widely used. These algorithms have
demonstrated their usefulness in gene clustering, but some
single gene studies to genome-level analyses [1], [2].
basic problems remain [2], [9]. Hierarchical clustering
Increasingly accessible microarray platforms allow the
rapid generation of large expression data sets [3]. One of organizes expression data into a binary tree, in which the
the key challenges of microarray studies is to derive leaves are genes and the interior nodes (or branch points) are
biological insights from the unprecedented quantities of candidate clusters. True clusters with discrete boundaries are
data on gene-expression patterns [5]. Partitioning genes into not produced [10]. Although SOM is efficient and simple to
closely related groups has become an element of practically implement, studies suggest that it typically performs worse
all analyses of microarray data [4]. than the traditional techniques, such as k-means [11].
A number of computer algorithms have been applied to Based on the assumption that genes with the same function
gene clustering. One of the earliest was a hierarchical or in the same biological pathway usually show similar
algorithm developed by Eisen et al. [6]. Other popular expression patterns, the functions of unknown genes can be
inferred from those of the known genes with similar
expression profile patterns. Therefore, expression profile
. Y. Liu, S.B. Navathe, J. Civera, and A. Ram are with the College of
Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, gene clustering by all the algorithms mentioned above has
GA 30322. received much attention; however, the task of finding
E-mail: {yingliu, sham, ashwin}@cc.gatech.edu, jorcisai@iti.upv.es. functional relationships between specific genes is left to the
. V. Dasigi is with the Department of Computer Science, School of
Computing and Software Engineering, Southern Polytechnic State investigator. Manual scanning of the biological literature (for
University, Marietta, GA 30060. E-mail: vdasigi@spsu.edu. example, via MEDLINE) for clues regarding potential
. B.J. Ciliax is with the Department of Neurology, Emory University School functional relationships among a set of genes is not feasible
of Medicine, Atlanta, GA 30322. E-mail: bciliax@emory.edu.
. R. Dingledine is with the Department of Pharmacology, Emory University when the number of genes to be explored rises above
School of Medicine, Atlanta, GA 30322. approximately 10. Restricting the scan (manual or automatic)
E-mail: rdingledine@pharm.emory.edu. to annotation fields of GenBank, SwissProt, or LocusLink is
Manuscript received 4 Apr. 2004; revised 1 Oct. 2004; accepted 10 Feb. 2005; quicker but can suffer from the ad hoc relationship of
published online 30 Mar. 2005.
For information on obtaining reprints of this article, please send e-mail to: keywords to the research interests of whoever submitted
tcbb@computer.org, and reference IEEECS Log Number TCBB-0043-0404. the entry. Moreover, keeping annotation fields current as new
1545-5963/05/$20.00 ß 2005 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 63

information appears in the literature is a major challenge that In order to explore whether this algorithm could be
is rarely met adequately. useful for clustering genes derived from microarray
If, instead of organizing by expression pattern similarity, experiments, we compared the performance of BEA-
genes were grouped according to shared function, investi- PARTITION, hierarchical clustering algorithm, self-organiz-
gators might more quickly discover patterns or themes of ing map, and the k-means algorithm for clustering func-
biological processes that were revealed by their microarray tionally-related genes based on shared keywords, using
experiments and focus on a select group of functionally purity, entropy, and mutual information as metrics for
related genes. A number of clustering strategies based on evaluating cluster quality.
shared functions rather than similar expression patterns
have been devised. Chaussabel and Sher [3] analyzed
literature profiles generated by extracting the frequencies of 2 METHODS
certain terms from the abstracts in MEDLINE and then 2.1 Keyword Extraction from Biomedical Literature
clustered the genes based on these terms, essentially We used statistical methods to extract keywords from
applying the same algorithm used for expression pattern MEDLINE citations, based on the work of [15]. This method
clustering. Jenssen et al. [12] used co-occurrence of gene estimates the significance of words by comparing the
names in abstracts to create networks of related genes frequency of words in a given gene-related set (Test Set)
automatically. Text analysis of biomedical literature has
of abstracts with their frequency in a background set of
also been applied successfully to incorporate functional
abstracts. We modified the original method by using a
information about the genes in the analysis of gene
1) different background set, 2) a different stemming
expression data [1], [10], [13], [14] without generating
algorithm (Porter’s stemmer), and 3) a customized stop list.
clusters de novo. For example, Blaschke et al. [1] extracted
information about the common biological characteristics of The details were reported by Liu et al. [20], [21].
For each gene analyzed, word frequencies were calcu-
gene clusters from MEDLINE using Andrade and Valen-
lated from a group of abstracts retrieved by an SQL
cia’s statistical text mining approach, which accepts user-
(structured query language) search of MEDLINE for the
supplied abstracts related to a protein of interest and
specific gene name, gene symbol, or any known aliases (see
returns an ordered set of keywords that occur in those
LocusLink, ftp://ftp.ncbi.nih.gov/refseq/LocusLink/
abstracts more often than would be expected by chance [15].
LL_tmpl.gz for gene aliases) in the TITLE field. The resulting
We expanded and extended Andrade and Valencia’s
set of abstracts (the Test Set) was processed to generate a
approach [15] to functional gene clustering by using an
specific keyword list.
approach that applies an algorithm called the Bond Energy
Test Sets of Genes. We compared BEA-PARTITION and
Algorithm (BEA) [16], [17], which, to our knowledge, has
other clustering algorithms (k-means, hierarchical, and
not been used in bioinformatics. We modified it so that the
SOM) on two test sets.
“affinity” among attributes (in our case, genes) is defined
based on the sharing of keywords between them and we 1. Twenty-six genes in four well-defined functional
came up with a scheme for partitioning the clustered groups consisting of 10 glutamate receptor subunits,
affinity matrix to produce clusters of genes. We call the seven enzymes in catecholamine metabolism, five
resulting algorithm BEA-PARTITION. BEA was originally cytoskeletal proteins, and four enzymes in tyrosine
conceived as a technique to cluster questions in psycholo- and phenylalanine synthesis. The gene names and
gical instruments [16], has been used in operations research, aliases are listed in Table 1. This experiment was
production engineering, marketing, and various other fields performed to determine whether keyword associa-
[18], and is a popular clustering algorithm in distributed tions can be used to group genes appropriately and
database system (DDBS) design. The fundamental task of
whether the four gene families or clusters that were
BEA in DDBS design is to group attributes based on their
known a priori would also be predicted by a
affinity, which indicates how closely related the attributes
clustering algorithm simply using the affinity metric
are, as determined by the inclusion of these attributes by the
based on keywords.
same database transactions. In our case, each gene was
2. Forty-four yeast genes involved in the cell cycle of
considered as an attribute. Hence, the basic premise is that
budding yeast (Saccharomyces cerevisiae) that had
two genes would have higher affinity, thus higher bond
altered expression patterns on spotted DNA
energy, if abstracts mentioning these genes shared many
microarrays [6]. These genes were analyzed by
informative keywords. BEA has several useful properties
Cherepinsky et al. [4] to demonstrate their Shrink-
[16], [19]. First, it groups attributes with larger affinity
values together, and the ones with smaller values together age algorithm for gene clustering. A master list of
(i.e., during the permutation of columns and rows, it member genes for each cluster was assembled
shuffles the attributes towards those with which they have according to a combination of 1) common cell-cycle
higher affinity and away from those with which they have functions and regulatory systems and 2) the
lower affinity). Second, the composition and order of the corresponding transcriptional activators for each
final groups are insensitive to the order in which items are gene [4] (Table 2).
presented to the algorithm. Finally, it seeks to uncover and Keyword Assessment. Statistical formulae from [15] for
display the association and interrelationships of the clus- word frequencies were used without modification. These
tered groups with one another. calculations were repeated for all gene names in the test
64 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 1
Twenty-Six Genes Manually Clustered Based on Functional Similarity

TABLE 2
Forty-Four Yeast Genes Grouped by Transcriptional Activators and Cell Cycle Functions [4]

set, a process that generated a database of keywords 2.2 BEA-PARTITION: Detailed Working of the
associated with specific genes, the strength of the associa- Algorithm
tion being reflected by a z-score. The z-score of word a for The BEA-PARTITION takes a symmetric matrix as input,
gene g is defined as: permutes its rows and columns, and generates a sorted
a matrix, which is then partitioned to form a clustered matrix.
Fga  F Constructing the Symmetric Gene  Gene Matrix. The
Zga ¼ ; ð1Þ
a sparse word  gene matrix, with the cells containing the
where Fga equals the frequency of word a in Test Set g (i.e., z-scores of each word-gene pair, was converted to a gene
in the Test set g, the number of abstracts where the word a  gene matrix with the cells containing the sum of products of
occurs divided by the total number of abstracts) and, Fa and z-scores for shared keywords. The z-score value was set to
a are the average frequency and standard deviation, zero if the value was less than the threshold. Larger values
respectively, of word a in the background set. Intuitively,
reflect stronger and more extensive keyword associations
the score Z compares the “importance” or “discriminatory
relevance” of a keyword in the test set of abstract with the between gene-gene pairs. For each gene pair ðGi; GjÞ and
background set that represents the expected occurrence of every word a they share in the sparse word  gene matrix, the
that word in the literature at large. Gi  Gj cell value ðaffðGi; GjÞÞ in the gene  gene matrix
Keyword Selection for Gene Clustering. We used z-score represents the affinity of the two genes for each other and is
thresholds to select the keywords used for gene clustering. calculated as:
Those keywords with z-scores less than the threshold were
discarded. The z-score thresholds we tested were 0, 5, 8, 10, PN a a
a¼1 ðZGi  ZGj Þ
15, 20, 30, 50, and 100. The database generated by this affðGi ; Gj Þ ¼ : ð2Þ
algorithm is represented as a sparse word (rows)  gene 1; 000
(columns) matrix with cells containing z-scores. The matrix is Dividing the sum of the z-score product by 1,000 was
characterized as “sparse” because each gene only has a done to reduce the typically large numbers to a more
fraction of all words associated with it. The output of the
readable format in the output matrix.
keyword selection for all genes in each Test Set is represented
Sorting the Matrix [19]. The sorted matrix is generated
as a sparse keyword (rows)  gene (columns) matrix with
cells containing z-scores. as follows:
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 65

1. Initialization. Place and fix one of the columns of for further iterations of the splitting algorithm. The number
symmetric matrix arbitrarily into the clustered of clusters into which the gene affinity matrix was
matrix. partitioned was determined by AUTOCLASS (described
2. Iteration. Pick one of the remaining n-i columns
below), however, other heuristics might be useful for this
(where i is the number of columns already in the
sorted matrix). Choose the placement in the sorted determination. The boundary metric ðBÞ for columns Gi
matrix that maximizes the change in bond energy as and Gj used for placement of new column k between
described below (3). Repeat this step until no more existing columns i and j was defined as:
columns remain.
Xp
maxðaffðk; qÞ; affðk; q þ 1ÞÞ
3. Row ordering. Once the column ordering is deter- BðGi ; Gj Þ ¼ max ; ð6Þ
mined, the placement of the rows should also be p1qp
k¼p1
minðaffðk; qÞ; affðk; q þ 1ÞÞ
changed correspondingly so that their relative
positions match the relative position of the columns. where q is the new splitting point (for simplicity, we use the
This restores the symmetry to the sorted matrix. number of the leftmost column in the new submatrix that is
To calculate the change in bond energy for each possible to the right of the splitting point), which will split the
submatrix defined between two previous splitting points, p
placement of the next ði þ 1Þ column, the bonds between
and p  1 (which do not necessarily represent contiguous
that column ðkÞ and each of two newly adjacent columns
columns). To partition the entire sorted matrix, the
ði; jÞ are added and the bond that would be broken between
following initial conditions are set, p ¼ N; p  1 ¼ 0.
the latter two columns is subtracted. Thus, the “bond
energy” between these three columns i, j, and k (represent- 2.3 K -Means Algorithm and Hierarchical Clustering
ing gene i ðGiÞ; gene j ðGjÞ; gene k ðGkÞ)) is calculated by Algorithm
the following interaction contribution measure: K-means and hierarchical clustering analysis were performed
using Cluster/Treeview programs available online (http://
energyðGi; Gj; GkÞ ¼ bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/
ð3Þ
2  ½bondðGi; GkÞ þ bondðGk; GjÞ  bondðGi; GjÞ; software.htm).
where bond ðGi; GjÞ is the bond energy between gene Gi 2.4 Self-Organizing Map
and gene Gj and Self-organizing map was performed using GeneClus-
ter 2.0 (http://www.broad.mit.edu/cancer/software/
X
N
bondðGi; GjÞ ¼ affðGr; GiÞ  affðGr; GjÞ ð4Þ software.html).
r¼l Euclidean distance measure was used when gene 
keyword matrix as input. When gene  gene matrix was
affðG0; GiÞ ¼ affðGi; G0Þ used as input, the gene similarity was calculated by (2).
¼ affðGðn þ 1Þ; GiÞ ¼ affðGi; Gðn þ 1ÞÞ ¼ 0: 2.5 Number of Clusters
ð5Þ In order to apply BEA-PARTITION and k-means cluster-
The last set of conditions (5) takes care of cases where a ing algorithms, the investigator needs to have a priori
gene is being placed in the sorted matrix to the left of the knowledge about the number of clusters in the test set.
We determined the number of clusters by applying
leftmost gene or to the right of the rightmost gene during
AUTOCLASS, an unsupervised Bayesian classification
column permutations, and prior to the topmost row and
system developed by [22]. AUTOCLASS, which seeks a
following the last row during row permutations.
maximum posterior probability classification, determines
Partitioning the Sorted Matrix. The original BEA
the optimal number of classes in large data sets. Among
algorithm [16] did not propose how to partition the sorted
a variety of applications, AUTOCLASS has been used
matrix. The partitioning heuristic was added by Navathe
for the discovery of new classes of infra-red stars in the
et al. [17] for the problems in the distributed database
IRAS Low Resolution Spectral catalogue, new classes of
design. These heuristics were constructed using the goals of airports in a database of all US airports, and discovery
design: to minimize access time and storage costs. We do of classes of proteins, introns and other patterns in
not have the luxury of such a clear cut objective function in DNA/protein sequence data [22]. We applied an open
our case. Hence, to partition the sorted matrix into source implementation of AUTOCLASS (http://
submatrices, each representing a gene cluster, we experi- ic.arc.nasa.gov/ic/projects/bayes-group/autoclass/
mented with different heuristics and, finally, derived a autoclass-c-program.html). The resulting number of
heuristic that identifies the boundaries between clusters by clusters was then used as the endpoint for the
sequentially finding the maximum sum of the quotients for partitioning step of the BEA-PARTITION algorithm. To
corresponding cells in adjacent columns across the matrix. determine whether AUTOCLASS could discover the
With each successive split, only those rows corresponding number of clusters in the test sets correctly, we also
to the remaining columns were processed, i.e., only the tested different number of clusters other than the ones
remaining symmetrical portion of the submatrix was used AUTOCLASS predicted.
66 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

nj N
2.6 Evaluating the Clustering Results log PK i t PC t
2X K X C
j n
t¼1 i
n
t¼1 i
To evaluate the quality of our resultant clusters, we used MðÞ ¼ n ; ð9Þ
N i¼1 j¼1 i logðK  CÞ
the established metrics of Purity, Entropy, and Mutual
Information, which are briefly described below [23]. Let us where N is the total number of genes being clustered and K
assume that we have C classes (i.e., C expert clusters, as is the number of clusters the algorithm produced, and C is
shown in Tables 1 and 2), while our clustering algorithms the number of expert classes.
produce K clusters, ; 2 ; . . . ; k .
Purity. Purity can be interpreted as classification 2.7 Top-Scoring Keywords Shared among Members
of a Gene Cluster
accuracy under the assumption that all objects of a cluster
are classified to be members of the dominant class for that Keywords were ranked according to their highest shared z-
scores in each cluster. The keyword sharing strength metric
cluster. If the majority of genes in cluster A are in class X,
(K a ) is defined as the sum of z-scores for a shared keyword
then class X is the dominant class. Purity is defined as the
a within the cluster, multiplied by the number of genes ðMÞ
ratio between the number of items in cluster i from
within the cluster with which the word is associated; in this
dominant class j and the size of cluster i , that is: calculation z-scores less than a user-selected threshold are
1 set to zero and are not counted.
P ði Þ ¼ maxðnji Þ; i ¼ 1; 2 . . . ; k; ð7Þ
ni j X
M X
M

where ni ¼ ji j, that is, the size of cluster i and nji


is the Ka ¼ ðzag Þ  Countðzag Þ: ð10Þ
g¼1 g¼1
number of genes in i that belong to class j; j ¼ 1; 2; . . . ; C.
The closer to 1 the purity value is, the more similar this Thus, larger values reflect stronger and more extensive
cluster is to its dominant class. Purity is measured for each keyword associations within a cluster. We identified the
cluster and the average purity of each test gene set cluster 30 highest scoring keywords for each of the four clusters and
result was calculated. provided these four lists to approximately 20 students,
Entropy. Entropy denotes how uniform the cluster is. If a postdoctoral fellows, and faculty, asking them to guess a
cluster is composed of genes coming from different classes, major function of the underlying genes that gave rise to the
four keyword lists.
then the value of entropy will be close to 1. If a cluster only
contains one class, the value of entropy will be close to 0.
The ideal value for entropy would be zero. Lower values of 3 RESULTS
entropy would indicate better clustering. Entropy is also 3.1 Keywords and Keyword  Gene Matrix
measured for each cluster and is defined as: Generation
! A list of keywords was generated for each gene to build the
1 X C
nji nji keyword  gene matrix. Keywords were sorted according
Eði Þ ¼  log : ð8Þ
log C j¼1 ni ni to their z-scores. The keyword selection experiment (see
below) showed that a z-score threshold of 10 generally
The average entropy of each test gene set cluster result was produced better results, which suggests that keywords with
also calculated. z-scores lower than 10 have less information content, e.g.,
Mutual Information. One problem with purity and “cell,” “express.” The relative values of z-scores depended
entropy is that they are inherently biased to favor small on the size of the background set (data not shown). Since we
clusters. For example, if we had one object for each cluster, used 5.6 million abstracts as the background set, the
then the value of purity would be 1 and entropy would be z-scores of most of the informative keywords were well
zero, no matter what the distribution of objects in the expert above 10 (based on smaller values of standard deviation in
classes is. the definition of z-score). The keyword  gene matrices
Mutual information is a symmetric measure for the were used as inputs to k-means, hierarchical clustering
degree of dependency between clusters and classes. Unlike algorithm, self-organizing map, while as required by the
correlation, mutual information also takes higher order BEA approach, they were first converted to a gene  gene
dependencies into account. We use mutual information matrix based on common shared keywords and these gene
because it captures how related clusters are to classes  gene matrices were used as inputs to BEA-PARTITION.
without bias towards small clusters. Mutual information is An overview of the gene clustering by shared keyword
a measure of the discordance between the algorithm- process is provided in Fig. 1.
derived clusters and the actual clusters. It is the measure 3.2 Effect of Keyword Selection on Gene Clustering
of how much information the algorithm-derived clusters The effect of using different z-score thresholds for keyword
can tell us to infer the actual clusters. Random clustering selection on the quality of resulting clusters is shown in
has mutual information of 0 in the limit. Higher mutual Figs. 2A1 and 2B1. For both test sets, BEA-PARTITION
information indicates higher similarity between the algo- produced clusters with higher mutual information when z-
rithm-derived clusters and the actual clusters. Mutual score thresholds were within a range of 10 to 20. For the 44-
information is defined as: gene set, K-means produced clusters with the highest
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 67

Fig. 1. Procedure for clustering genes by the strength of their associated keywords.

mutual information when the z-score threshold was 8, much higher values than those outside. Hierarchical cluster-
while, for the 26-gene set, mutual information was highest ing algorithm, with the gene  keyword matrix as the input,
when z-score threshold was 15. For the remaining studies, generated similar result as BEA-PARTITION (five clusters
we chose to use a z-score threshold of 10 to keep as many and TT was the outlier) (Fig. 4a). The results, with gene  gene
functional keywords as possible. matrix as the input, were shown in tables in the supplemen-
tary materials which can be found at www.computer.org/
3.3 Number of Clusters publications/dlib.
We then used AUTOCLASS to decide the number of While BEA-PARTITION and hierarchical clustering
clusters in the test sets. AUTOCLASS took the keyword  algorithm produced clusters very similar to the original
gene matrix as input and predicted that there were five functional classes, those produced by k-means (Table 4),
clusters in the set of 26 genes and nine clusters in the set of self-organizing map (Table 5), and AUTOCLASS (Table 6),
44 yeast genes. The effect of the numbers of clusters on the with gene  keyword matrix as input, were heterogeneous
algorithm performance was shown in Figs. 2A2 and 2B2. and, thus, more difficult to explain. The average purity,
BEA-PARTITION again produced a better result regardless
of the number of clusters used. BEA-PARTITION had the
highest mutual information when the numbers of clusters
were five (26-gene set) and nine (44-gene set), whereas
k-means worked marginally better when the numbers of
clusters were 8 (26-gene set) and 10 (44-gene set). Based on
these results we chose to use five and nine clusters,
respectively, because the probabilities were higher than
the other choices.

3.4 Clustering of the 26-Gene Set by Keyword


Associations
To determine whether keyword associations could be used to
group genes appropriately, we clustered the 26-gene set with
either BEA-PARTITION, k-means, hierachical algorithm,
SOM, and AUTOCLASS. Keyword lists were generated for
each of these 26 genes, which belonged to one of four well-
defined functional groups (Table 1). The resulting word  Fig. 2. Effect of keyword selection by z-score thresholds (A1 and B1)
gene matrix had 26 columns (genes) and approximately and different number of clusters (A2 and B2) on the cluster quality. Z-
8,540 rows (words with z-scores >¼ 10 appearing in any of score thresholds were used to select the keywords for gene clustering.
Those keywords with z-scores less than the threshold were discarded.
the query sets). The BEA-PARTITION, with z-score threshold To determine the effect of keyword selection by z-score thresholds on
= 10, correctly assigned 25 of 26 genes to the appropriate cluster quality, we tested z-score thresholds 0, 5, 8, 10, 15, 20, 30, 50,
cluster based on the strength of keyword associations (Fig. 3). and 100. To determine whether AUTOCLASS could be used to discover
the number of clusters in the test sets correctly, we tested a different
Tyrosine transaminase was the only outlier. As expected from number of clusters other than the ones AUTOCLASS predicted (four for
the BEA-PARTITION, cells inside clusters tended to have the 26-gene set and nine for the 44-gene set).
68 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Fig. 3. Gene clusters by keyword associations using BEA-PARTITION. Keywords with z-scores >¼ 10 were extracted from MEDLINE abstracts for
26 genes in four functional classes. The resulting word  gene sparse matrix was converted to a gene  gene matrix. The cell values are the sum of
z-score products for all keywords shared by the gene pair. This value is divided by 1,000 for purpose of display. A modified bond energy algorithm
[16], [17] was used to group genes into five clusters based on the strength of keyword associations, and the resulting gene clusters are boxed.

average entropy, and mutual information of the BEA- 3.6 Keywords Indicative of Major Shared Functions
PARTITION and hierarchical algorithm result were 1, 0, with a Gene Cluster
and 0.88, while those of k-means result were 0.53, 0.65, and Keywords shared among genes (26-gene set) within each
0.28, respectively, those of SOM result were 0.76, 0.35, and cluster were ranked according to a metric based on both the
0.18, respectively, and those of AUTOCLASS result were degree of significance (the sum of z-scores for each keyword)
0.82, 0.28, and 0.56 (Table 3) (gene  keyword matrix as and the breadth of distribution (the sum of the number of
input). When gene  gene matrix was used as input to genes within the cluster for which the keyword has a z-score
hierarchical algorithm, k-means, and SOM, the results were greater than a selected threshold). This double-pronged
even worse as measured by purity, entropy, and mutual metric obviated the difficulty encountered with keywords
information (Table 3). that had extremely high z-scores for single genes within the
cluster but modest z-scores for the remainder. The 30 highest
3.5 Yeast Microarray Gene Clustering by Keyword scoring keywords for each of the four clusters were tabulated
Association (Table 11). The respective keyword lists appeared to be highly
To determine whether our test mining/gene clustering informative about the general function of the original,
approach could be used to group genes identified in preselected clusters when shown to medical students,
microarray experiments, we clustered 44 yeast genes taken faculties, and postdoctoral fellows.
from Eisen et al. [6] via Cherepinsky et al. [4], again using
BEA-PARTITION, hierarchical algorithm, SOM, AUTO-
4 DISCUSSION
CLASS, and k-means. Keyword lists were generated for each
of the 44 yeast genes (Table 2) and a 3,882 (words appearing in In this paper, we clustered the genes by shared functional
the query sets with z-score greater or equal 10)  44 (genes) keywords. Our gene clustering strategy is similar to the
matrix was created. The clusters produced by the BEA- document clustering in information retrieval. Document
PARTITION, k-means, SOM, and AUTOCLASS are shown in clustering, defined as grouping documents into clusters
Tables 7, 8, 9, and 10, respectively, whereas those produced by according to their topics or main contents in an unsuper-
hierarchical algorithm are shown in Fig. 4b. The average vised manner, organizes large amounts of information into
purity, average entropy, and mutual information of the BEA- a small number of meaningful clusters and improves the
PARTITION result were 0.74, 0.24, and 0.60, whereas those of information retrieval performance either via cluster-driven
hierarchical algorithm, SOM, k-means, and AUTOCLASS dimensionality reduction, term-weighting, or query expan-
results (gene  keyword matrix as input) were 0.86, 0.12, and sion [9], [24], [25], [26], [27].
0.58; 0.60, 0.37, and 0.46; 0.61, 0.33, and 0.39; 0.57, 0.39, and Term vector-based document clustering has been widely
0.49, respectively (Table 3). studied in information retrieval [9], [24], [25], [26], [27]. A
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 69

Fig. 4. Gene clusters by keyword associations using hierarchical clustering algorithm. Keywords with z-scores >¼ 10 were extracted from MEDLINE
abstracts for (a) 26 genes in four functional classes and (b) 44 gene in nine classes. The resulting word  gene sparse matrix was used as input to
the hierarchical algorithm.

number of clustering algorithms have been proposed and noise (noninformative words and misspelled words), were
many of them have been applied to bioinformatics research. used to cluster genes. Under the tested conditions, clusters
In this report, we introduced a new algorithm for clustering produced by BEA-PARTITION had higher quality than
genes, BEA-PARTITION. Our results showed that BEA- those produced by k-means. BEA-PARTITION clusters
PARTITION, in conjunction with the heuristic developed genes based on their shared keywords. It is unlikely that
for partitioning the sorted matrix, outperforms the k-means genes within the same cluster shared the same noisy words
algorithm and SOM in two test sets. In the first set of genes with high z-scores, indicating that BEA-PARTITION is less
(26-gene set), BEA-PARTITION, as well as hierarchical sensitive to noise than k-means. In fact, BEA-PARTITION
algorithm, correctly assigned 25 of 26 genes in a test set of performed better than k-means in the two test gene sets
four known gene groups with one outlier, whereas k-means under almost all test conditions (Fig. 2). BEA-PARTITION
and SOM mixed the genes into five more evenly sized but
performed best when z-score thresholds were 10, 15, and 20,
less well functionally defined groups. In the 44-gene set, the
which indicated 1) that the words with z-score less than 10
result generated by BEA-PARTITION had the highest
were less informative and 2) few words with z-scores
mutual information, indicating that BEA-PARTITION out-
between 10 and 20 were shared by at least two genes and
performed all the other four clustering algorithms.
did not improve the cluster quality. When z-score thresh-
4.1 BEA-PARTITION versus k -Means olds were high (> 30 in the 26-gene set and > 20 in the
In this study, the z-score thresholds were used for keyword 44-gene set), more informative words were discarded, and
selection. When the threshold was 0, all words, including as a result, the cluster quality was degraded.
70 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 3
The Quality of the Gene Clusters Derived by Different Clustering Algorithms, Measured by Purity, Entropy, and Mutual Information

BEA-PARTITION is designed to group cells with larger association and interrelationships of the clustered groups
values together, and the ones with smaller values together. with one another can be seen in the final clustering matrix.
The final order of the genes within the cluster reflected For example, TT was an outlier in Fig. 3, however, it still
deeper interrelationships. Among the 10 glutamate receptor had higher affinity to PD1 (affinity = 202) and PD2 (affinity
genes examined, GluR1, GluR2, and GluR4 are AMPA = 139) than to any other genes. Thus, TT appears to be
receptors, while GluR6, KA1, and KA2 are kainate receptors. strongly related to genes in the tyrosine and phenylalanine
The observation that BEA-PARTITION placed gene GluR6 synthesis cluster, from which it originated.
and gene KA2 next to each other, confirms that the literature BEA-PARTITION has several advantages over the
associations between GluR6 and KA2 are higher than those k-means algorithm: 1) while k-means generally produces a
between GluR6 and AMPA receptors. Furthermore, the locally optimal clustering [2], BEA-PARTITION produces

TABLE 4
Twenty-Six Gene Set k-Means Result (Gene  Keyword Matrix as Input)
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 71

TABLE 5
Twenty-Six Gene SOM Result (Gene  Keyword Matrix as Input)

the globally optimal clustering by permuting the columns result differently. Some have proposed automatically
and rows of the symmetric matrix; 2) the k-means algorithm defining boundaries based on statistical properties of the
is sensitive to initial seed selection and noise [9]. gene expression profiles; however, the same statistical
criteria may not be generally applicable to identify all
4.2 BEA-PARTITION versus Hierarchical Algorithm
relevant biological functions [10]. We believe that an
Hierarchical clustering algorithm, as well as k-means, and algorithm that produces clusters with clear boundaries
Self-Organizing Maps, have been widely used in microarray
can provide more objective results and possibly new
expression profile analysis. Hierarchical clustering orga-
discoveries, which are beyond the experts’ knowledge. In
nizes expression data into a binary tree without providing
this report, our results showed that BEA-PARTITION can
clear indication of how the hierarchy should be clustered. In
have similar performance as a hierarchical algorithm, and
practice, investigators define clusters by a manual scan of
provide distinct cluster boundaries.
the genes in each node and rely on their biological expertise
to notice shared functional properties of genes. Therefore, 4.3 K -Means versus SOM
the definition of the clusters is subjective, and as a result, The k-means algorithm and SOM can group objects into
different investigators may interpret the same clustering different clusters and provide clear boundaries. Despite its

TABLE 6
Twenty-Six Gene AUTOCLASS Result (Gene  Keyword Matrix as Input)
72 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 7
Forty-Four Yeast Genes BEA-PARTITION Result (Gene  Keyword Matrix as Input)

simplicity and efficiency, the SOM algorithm has several 4.4 Computing Time
weaknesses that make its theoretical analysis difficult and The computing time of BEA-PARTITION, same as that of
limit its practical usefulness. Various studies have sug- hierarchical algorithm and SOM, is in the order of N2 , which
gested that it is hard to find any criteria under which the means that it grows proportionally to the square of the
SOM algorithm performs better than the traditional number of genes and commonly denoted as OðN2 Þ, and that of
techniques, such as k-means [11]. Balakrishnan et al. [28] k-means is in the order of N*K*T (O(NKT)), where N is the
number of genes tested, K is the number of clusters, and T is
compared the SOM algorithm with k-means clustering on
the number of improvement steps (iterations) performed by
108 multivariate normal clustering problems. The results
k-means. In our study, the number of improvement steps was
showed that the SOM algorithm performed significantly 1,000. Therefore, when the number of genes tested is about
worse than the k-means clustering algorithm. Our results 1,000, BEA-PARTITION runs (a K þ b) times faster than
also showed that k-means performed better than SOM by k-means, where a, and b are constants. As long as the number
generating clusters with higher mutual information. of genes to be clustered is less than the product of the number

TABLE 8
Forty-Four Yeast Gene SOM Result (Gene  Keyword as Input)
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 73

TABLE 9
Forty-Four Yeast Gene k-Means Result (Gene  Keyword Matrix as Input)

of clusters and the number of iterations, BEA-PARTITION test set, which may not be known. We approached this
will run faster than k-means. problem by using AUTOCLASS to predict the number of

4.5 Number of Clusters clusters in the test sets. BEA-PARTITION performed best
One disadvantage of BEA-PARTITION and k-means com- when it grouped the genes into five clusters (26-gene set) and
pared to hierarchical clustering is that the investigator needs nine clusters (44-gene set), which were predicted by AUTO-
to have a priori knowledge about the number of clusters in the CLASS with higher probabilities. Therefore, AUTOCLASS

TABLE 10
Forty-Four Yeast Gene AUTOCLASS Result (Gene  Keyword Matrix as Input)
74 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

TABLE 11
Top Ranking Keywords Associated with Each Gene Cluster

appears to be an effective tool to assist the BEA-PARTITION algorithm represents our extension to the BEA approach
in gene clustering. specifically for dealing with the problem of discovering
functional similarity among genes based on functional
5 CONCLUSIONS AND FUTURE WORK keywords extracted from literature. We believe that this
important clustering technique, which was originally
There are several aspects of the BEA approach that we are proposed by [16] to cluster questions on psychological
currently exploring with more detailed studies. For example, instruments and later introduced by [17] for clustering of
although the BEA-PARTITION described here performs data items in database design, has promise for application
relatively well on small sets of genes, the larger gene lists to other bioinformatics problems where starting matrices
expected from microarray experiments need to be tested. are available from experimental observations.
Furthermore, we derived a heuristic to partition the clustered
affinity matrix into clusters. We anticipate that this heuristic,
which is simply based on the sum of ratios of corresponding ACKNOWLEDGMENTS
values from adjacent columns, will generally work regardless This work was supported by NINDS (RD) and the Emory-
of the type of items being clustered. Generally, optimizing the Georgia Tech Research Consortium. The authors would
heuristic to partition a sorted matrix after BEA-based like to thank Brian Revennaugh and Alex Pivoshenk for
clustering will be valuable. Finally, we are developing a research support.
Web-based tool that will include a text mining phase to
identify functional keywords, and a gene clustering phase to REFERENCES
cluster the genes based on the shared functional keywords. [1] C. Blaschke, J.C. Oliveros, and A. Valencia, “Mining Functional
We believe that this tool should be useful for discovering Information Associated with Expression Arrays,” Functional &
Integrative Genomics, vol. 1, pp. 256-268, 2001.
novel relationships among sets of genes because it links genes [2] Y. Xu, V. Olman, and D. Xu, “EXCAVATOR: A Computer
by shared functional keywords rather than just reporting Program for Efficiently Mining Gene Expression Data,” Nucleic
known interactions based on published reports. Thus, genes Acids Research, vol. 31, pp. 5582-5589, 2003.
[3] D. Chaussabel and A. Sher, “Mining Microarray Expression Data
that never co-occur in the same publication could still be by Literature Profiling,” Genome Biology, vol. 3, pp. 1-16, 2002.
linked by their shared keywords. [4] V. Cherepinsky, J. Feng, M. Rejali, and B. Mishra, “Shrinkage-
The BEA approach has been applied successfully to other Based Similarity Metric for Cluster Analysis of Microarray Data,”
Proc. Nat’l Academy of Sciences USA, vol. 100, pp. 9668-9673, 2003.
disciplines, such as operations research, production en- [5] J. Quackenbush, “Computational Analysis of Microarray Data,”
gineering, and marketing [18]. The BEA-PARTITION Nature Rev. Genetics, vol. 2, pp. 418-427, 2001.
LIU ET AL.: TEXT MINING BIOMEDICAL LITERATURE FOR DISCOVERING GENE-TO-GENE RELATIONSHIPS: A COMPARATIVE STUDY OF... 75

[6] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster Ying Liu received the BS degree in environ-
Analysis and Display of Genome-Wide Expression Patterns,” Proc. mental biology from Nanjing University, China.
Nat’l Academy of Sciences USA, vol. 95, pp. 14863-14868, 1998. He received Master’s degrees in bioinformatics
[7] R. Herwig, A.J. Poustka, C. Mller, C. Bull, H. Lehrach, and J. and computer science from Georgia Institute of
O’Brien, “Large-Scale Clustering of cDNA-Fingerprinting Data,” Technology in 2002. He is a PhD candidate in
Genome Research, vol. 9, pp. 1093-1105, 1999. College of Computing, Georgia Institute of
[8] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Technology, where he works on text mining
Dmitrovsky, E.S. Lander, and T.R. Golub, “Interpreting Patterns of biomedical literature to discover gene-to-gene
Gene Expression with Self-Organizing Maps: Methods and relationships. His research interests include
Application to Hematopoietic Differentiation,” Proc. Nat’l Academy bioinformatics, computational biology, data
of Sciences USA, vol. 96, pp. 2907-2912, 1999. mining, text mining, and database system. He is a student member of
[9] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: A IEEE Computer Society.
Review,” ACM Computing Surveys, vol. 31, pp. 264-323, 1999.
[10] S. Raychaudhuri, J.T. Chang, F. Imam, and R.B. Altman, “The Shamkant B. Navathe received the PhD degree
Computational Analysis of Scientific Literature to Define and from the University of Michigan in 1976. He is a
Recognize Gene Expression Clusters,” Nucleic Acids Research, professor in the College of Computing, Georgia
vol. 15, pp. 4553-4560, 2003. Institute of Technology. He has published more
[11] B. Kegl, “Principle Curves: Learning, Design, and Applications,” than 130 refereed papers in database research;
PhD dissertation, Dept. of Computer Science, Concordia Univ., his important contributions are in database
Montreal, Quebec, 2002. modeling, database conversion, database de-
[12] T.K. Jenssen, A. Laegreid, J. Komorowski, and E. Hovig, “A sign, conceptual clustering, distributed database
Literature Network of Human Genes for High-Throughtput allocation, data mining, and database integra-
Analysis of Gene Expression,” Nat’l Genetics, vol. 178, pp. 139- tion. Current projects include text mining of
143, 2001. medical literature databases, creation of databases for biological
[13] D.R. Masys, J.B. Welsh, J.L. Fink, M. Gribskov, I. Klacansky, and J. applications, transaction models in P2P and Web applications, and
Corbeil, “Use of Keyword Hierarchies to Interprate Gene data mining for better understanding of genomic/proteomic and medical
Expression Patterns,” Bioinformatics, vol. 17, pp. 319-326, 2001. data. His recent work has been focusing on issues of mobility,
[14] S. Raychaudhuri, H. Schutze, and R.B. Altman, “Using Text scalability, interoperability, and personalization of databases in scien-
Analysis to Identify Functionally Coherent Gene Groups,” Genome tific, engineering, and e-commerce applications. He is an author of the
Research, vol. 12, pp. 1582-1590, 2002. book, Fundamentals of Database Systems, with R. Elmasri (Addison
[15] M. Andrade and A. Valencia, “Automatic Extraction of Keywords Wesley, fourth edition, 2004) which is currently the leading database
from Scientific Text: Application to the Knowledge Domain of text-book worldwide. He also coauthored the book Conceptual Design:
Protein Families,” Bioinformatics, vol. 14, pp. 600-607, 1998. An Entity Relationship Approach (Addison Wesley, 1992) with Carlo
[16] W.T. McCormick, P.J. Schweitzer, and T.W. White, “Problem Batini and Stefano Ceri. He was the general cochairman of the 1996
Decomposition and Data Reorganization by a Clustering Techni- International VLDB (Very Large Data Base) Conference in Bombay,
que,” Operations Research, vol. 20, pp. 993-1009, 1972. India. He was also program cochair of ACM SIGMOD 1985 at Austin,
[17] S. Navathe, S. Ceri, G. Wiederhold, and J. Dou, “Vertical Texas. He is also on the editorial boards of Data and Knowledge
Partitioning Algorithms for Database Design,” ACM Trans. Engineering (North Holland), Information Systems (Pergamon Press),
Database Systems, vol. 9, pp. 680-710, 1984. Distributed and Parallel Databases (Kluwer Academic Publishers), and
[18] P. Arabie and L.J. Hubert, “The Bond Energy Algorithm World Wide Web Journal (Kluwer). He has been an associate editor of
Revisited,” IEEE Trans. Systems, Man, and Cybernetics, vol. 20, IEEE Transactions on Knowledge and Data Engineering. He is a
pp. 268-274, 1990. member of the IEEE.
[19] A.T. Ozsu and P. Valduriez, Principles of Distributed Database
Systems, second ed. Prentice Hall Inc., 1999. Jorge Civera received the BSc degree in
[20] Y. Liu, M. Brandon, S. Navathe, R. Dingledine, and B.J. Ciliax, computer science from the Universidad Politéc-
“Text Mining Functional Keywords Associated with Genes,” Proc. nica de Valencia in 2002, and the Msc degree in
Medinfo 2004, pp. 292-296, Sept. 2004. computer science from Georgia Institute of
[21] Y. Liu, B.J. Ciliax, K. Borges, V. Dasigi, A. Ram, S. Navathe, and R. Technology in 2003. He is currently a PhD
Dingledine, “Comparison of Two Schemes for Automatic Key- student at Departamento de Sistemas Informá-
word Extraction from MEDLINE for Functional Gene Clustering,” ticos y Computación and a research assistant in
Proc. IEEE Computational Systems Bioinformatics Conf. (CSB 2004), the Instituto Tecnológico de Informática. He is
pp. 394-404, Aug. 2004. also with a fellowship from the Spanish Ministry
[22] P. Cheeseman and J. Stutz, “Bayesian Classification (Autoclass): of Education and Culture. His research interests
Theory and Results,” Advances in Knowledge Discovery and Data include bioinformatics, machine translation, and text mining.
Mining, pp. 153-180, AAAI/MIT Press, 1996.
[23] A. Strehl, “Relationship-Based Clustering and Cluster Ensembles Venu Dasigi received the BE degree in electro-
for High-Dimensional Data Mining,” PhD dissertation, Dept. of nics and communication engineering from An-
Electric and Computer Eng., The University of Texas at Austin, dhra University in 1979, the MEE degree in
2002. electronic engineering from the Netherlands
[24] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Universities Foundation for International Coop-
New York: Addison Wesley Longman, 1999. eration in 1981, and the MS and PhD degrees in
[25] F. Sebastiani, “Machine Learning in Automated Text Categoriza- computer science from the University of Mary-
tion,” ACM Computing Surveys, vol. 34, pp. 1-47, 1999. land, College Park in 1985 and 1988, respec-
[26] P. Willett, “Recent Trends in Hierarchic Document Clustering: A tively. He is currently professor and chair of
Critical Review,” Information Processing and Management, vol. 24, computer science at Southern Polytechnic State
pp. 577-597, 1988. University in Marietta, Georgia. He is also an honorary professor at
[27] J. Aslam, A. Leblanc, and C. Stein, “Clustering Data without Prior Gandhi Institute of Technology and Management in India. He held
Knowledge,” Proc. Algorithm Eng.: Fourth Int’l Workshop, 1982. research fellowships at the Oak Ridge National Laboratory and the Air
[28] P.V. Balakrishnan, M.C. Cooper, V.S. Jacob, and P.A. Lewis, “A Force Research Laboratory. His research interests include text mining,
Study of the Classification Capabilities of Neural Networks Using information retrieval, natural language processing, artificial intelligence,
Unsupervised Learning: A Comparison with K-Means Cluster- bioinformatics, and computer science education. He is a member of
ing,” Psychometrika, vol. 59, pp. 509-525, 1994. ACM and the IEEE Computer Society.
76 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Ashwin Ram received the PhD degree from Brian J. Ciliax received the BS degree in
Yale University in 1989, the MS degree from the biochemistry from Michigan State University in
University of Illinois in 1984, and the BTech 1981, and the PhD degree in pharmacology from
degree from IIT Delhi in 1982. He is an associate the University of Michigan in 1987. He is
professor in the College of Computing at the currently an assistant professor in the Depart-
Georgia Institute of Technology, an associate ment of Neurology at Emory University School of
professor of Cognitive Science, and an adjunct Medicine. His research interests include the
professor in the School of Psychology. He has functional neuroanatomy of the basal ganglia,
published two books and more than 80 scientific particularly as it relates to hyperkinetic move-
papers in international forums. His research ment disorders such as Tourette’s Syndrome.
interests lie in artificial intelligence and cognitive science, and include Since 2000, he has collaborated with the coauthors on the development
machine learning, natural language processing, case-based reasoning, of a system to functionally cluster genes (identified by high-throughput
educational technology, and artificial intelligence applications. genomic and proteomic assays) according to keywords mined from
relevant MEDLINE abstracts.

Ray Dingledine received the PhD degree in


pharmacology from Stanford. He is currently
professor and chair of pharmacology at Emory
University and serves on the Scientific Council of
NINDS at NIH. His research interests include the
application of microarray and associated tech-
nologies to identify novel molecular targets for
neurologic disease, the normal functions and
pathobiology of glutamate receptors, and the
role of COX2 signaling in neurologic disease.

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005 77

2004 Reviewers List



We thank the following reviewers for the time and energy they have given to TCBB:

A Diego di Bernardo J François Major


Adrian Dobra Elisabetta Manduchi
John Aach Bruce R. Donald Inge Jonassen Mark Marron
Tatsuya Akutsu Sebastián Dormido-Canto Rebecka Jornsten Jens Meiler
David Aldous Zhihua Du Stefano Merler
Aijun An Blythe Durbin K Webb Miller
Iannis Apostolakis Marta Milo
Lars Arvestad E Jaap Kaandorp Satoru Miyano
Daniel Ashlock Markus Kalisch Annette Molinaro
Kevin Atteson Nadia El-Mabrouk Rachel Karchin Shinichi Morishita
Wai-Ho Au Charles Elkan Juha Karkkainen Vincent Moulton
Eleazar Eskin Kevin Karplus Marcus Mueller
B Simon Kasif Sayan Mukherjee
F Samuel Kaski Rory Mulvaney
Rolf Backofen Ed Keedwell T.M. Murali
David Bader Giancarlo Ferrari-Trecate Purvesh Khatri Simon Myers
Tim Bailey Liliana Florea Hyunsoo Kim
Tomas Balla Gary Fogel Junhyong Kim N
Serafim Batzoglou Yoav Freund Ross D. King
Gil Bejerano Jane Fridlyand Andrzej Konopka Iftach Nachman
Amir Ben-Dor Yan Fu Hamid Krim Luay Nakhleh
Asa Ben-Hur Terrence Furey Nandini Krishnamurthy Anand Narasimhamurthy
Anne Bergeron Cesare Furlanello Gregory Kucherov Gonzalo Navarro
Olaf Bininda-Emonds David Kulp William Noble
Riccardo Boscolo G
Guillaume Bourque L O
Alvis Brazma Olivier Gascuel
Daniel Brown Dan Geiger Michelle Lacey Enno Ohlebusch
Duncan Brown Zoubin Ghahramani Wai Lam Arlindo Oliveira
Barb Bryant Debashis Ghosh Giuseppe Lancia Jose Oliver
David Bryant Pulak Ghosh Michael Lappe Christos Ouzounis
Jeremy Buhler Raffaele Giancarlo Richard Lathrop
Joachim Buhmann Robert Giegerich Nicolas Le Novere P
David Gilbert Thierry LeCroq
C Jan Gorodkin Hansheng Lei Junfeng Pan
John Goutsias Boaz Lerner Rong Pan
Andrea Califano Daniel Gusfield Christina Leslie Wei Pan
Colin Campbell Ilya Levner Paul Pavlidis
Isabelle M. Guyon
Alberto Caprara Dequan Li Itsik Pe’er
Adolfo Guzman-Arenas
Keith Chan Fan Li Christian Pedersen
Claudine Chaouiya Jinyan Li Anton Petrov
H
Ferdinando Cicalese Wentian Li Tuan Pham
Melissa Cline Jie Liang Katherine Pollard
Sridhar Hannenhalli
David Corne Olivier Lichtarge Gianluca Pollastri
Alexander Hartemink
Nello Cristianini Charles Ling Calton Pu
Tzvika Hartman
Miklos Csuros Michal Linial
Lisa Holm
Adele Cutler Huan Liu R
Paul Horton Zhenqiu Liu
D Steve Horvath Stanley Loh John Rachlin
Xiao Hu Heitor Lopes Mark Ragan
Patrik D’haeseleer Haiyan Huang Rune Lyngsoe Jagath Rajapakse
Michiel de Hoon Alan Hubbard R.S. Ramakrishna
Arthur Delcher Katharina Huber M Isidore Rigoutsos
Alain Denise Dirk Husmeier Dave Ritchie
Marcel Dettling Daniel Huson Bin Ma Fredrik Ronquist
Inderjit S. Dhillon Patrick Ma Juho Rousu
78 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 2, NO. 1, JANUARY-MARCH 2005

Jem Rowland W
Larry Ruzzo
Leszek Rychlewski Baoying Wang
Chang Wang
S Lisan Wang
Tandy Warnow
Gerhard Sagerer Michael K. Weir
Steven Salzberg Jason Weston
Herbert Sauro Ydo Wexler
Alejandro Schaffer Nalin Wickramarachchi
Alexander Schliep Chris Wiggins
Scott Schmidler David Wild
Jeanette Schmidt Tiffani Williams
Alexander Schönhuth Thomas Wu
Charles Semple
Soheil Shams X
Roded Sharan
Chad Shaw Dong Xu
Dinggang Shen Jinbo Xu
Dou Shen
Lisan Shen Y
Stanislav Shvartsman
Amandeep Sidhu Qiang Yang
Richard Simon Yee Hwa Yang
Sameer Singh Zizhen Yao
Janne Sinkkonen Daniel Yekutieli
Steven S. Skiena Jeffrey Yu
Quinn Snell
Carol Soderlund Z
Rainer Spang
Peter Stadler Mohammed J. Zaki
Mike Steel An-Ping Zeng
Gerhard Steger Chengxiang Zhai
Jens Stoye Jingfen Zhang
Jack Sullivan Kaizhong Zhang
Krister Swenson Xuegong Zhang
Yang Zhang
T Zhi-Hua Zhou
Zonglin Zhou
Pablo Tamayo Ji Zhu
Amos Tanay
Chun Tang
Jijun Tang
Thomas Tang
Glenn Tesler
Robert Tibshirani
Martin Tompa
Anna Tramontano
James Troendle
Jerry Tsai
Koji Tsuda
John Tyson

Eugene van Someren


Stella Veretnik
David Vogel
Gwenn Volkert

You might also like