Professional Documents
Culture Documents
04-Alinemiento Múltiple de Secuencias
04-Alinemiento Múltiple de Secuencias
structure prediction aim at predicting the role a aware that given inappropriate sequences, most
residue plays in a protein structure (buried or multiple alignment routines will nonetheless
exposed, helix or strand etc.). Secondary struc- produce an alignment. It will be the responsibil-
ture predictions based on a single sequence yield ity of the biologist to realize that this alignment
a low accuracy (in the order of 60%) [13], while is meaningless. This is not an easy task, and a few
predictions based on a MSA go much higher (in years ago Henikoff reviewed a series of problems
the order of 75%) [2,14,15]. The rationale behind that can occur when one forces multiple align-
such improvements is that the pattern of substi- ments with unrelated sequences [22]. In order to
tutions observed in a column directly reflects the recruit a set of homologous sequences, it is com-
type of constraints imposed on that position in mon practice to use one of the BLAST programs
the course of evolution. In the context of tertiary (WU-BLAST, PSI-BLAST, GAPPED BLAST
structure determination or when predicting non- etc.) [12], for searching within a database all the
local contacts, multiple alignments can also help sequences similar to some query sequence.
to identify correlated mutations. This approach When doing so, an observed similarity is consid-
has only given limited results when applied to ered good when it is unlikely to arise by chance
proteins [16], it has been much more successful in (given the database and the amino-acid frequen-
RNA analysis where it allows highly accurate cies). To make this estimation, BLAST uses pow-
predictions [17] well confirmed by structural erful statistical models developed by Altschul
analysis. and Karlin [23]. Of course, these statistical mod-
Altogether, these very important applications els merely approximate the biological reality, and
explain the amount of attention dedicated to the homology may be misrepresented by similarity,
MSA problem and any biologist should be aware leading to the incorporation of improper
that very few bioinformatics protocols bypass the sequences within a multiple alignment.
multiple alignment stage. Unfortunately, availa-
ble tools are only heuristics providing an approx- The choice of an objective function
imate solution to a problem that remains largely This is purely a biological problem that lies in
open. These many heuristics are based on differ- the definition of correctness. What should a bio-
ent paradigms, each well suited to a limited logically correct alignment look like? Can we
range of situations. define its expected properties and will we recog-
nize it when we see it? These intricate questions
A complicated problem. can only be answered by means of a mathemati-
MSA is a complicated problem. It stands at the- cal function able to measure an alignment bio-
cross road of three distinct technical difficulties: logical quality. We name this function an
Objective Function (OF) because it defines the
• the choice of the sequences
mathematical objective of the search. Given a
• the choice of an objective function (i.e., a
perfect function, the mathematically optimal
comparison model)
alignment will also be biologically optimal. Yet
• the optimization of that function
this is rarely the case, and while the function
Altogether, properly solving these three problems defines a mathematical optimum, we rarely have
would require an understanding of statistics, an argument that this optimum will also be bio-
biology and computer science that lies far logically optimal.
beyond our grasp. Defining a proper objective function is a
highly non-trivial task and an active research
The choice of the sequences field of its own right. In theory, an OF should
The methods reviewed here (i.e., global MSA incorporate everything that is known about the
methods) only make sense if they are assumed to sequences, including their structure, function
be dealing with a set of homologous sequences and evolutionary history. This information is
i.e., sequences sharing a common ancestor. Fur- rarely at hand and is hard to use, so it is usually
thermore, with the exception of DiAlign [18], replaced with sequence similarity. Thus, a very
global methods require the sequences to be simple general function is often used: the
related over their whole length (or at least most weighted sums-of-pairs with affine gap penalties
of it). When that condition is not met, one [24]. Under this model, each sequence receives a
should consider the use of local MSA methods weight proportional to the amount of independ-
such as the Gibbs sampler [19], Match-Box [20] or ent information it contains [25] and the cost of
MACAW [21]. In any case, one should always be the multiple alignment is equal to the sum of the
cost of all the weighted pair-wise substitutions. drawbacks. In an ideal world, a perfect OF
The substitution costs are evaluated using a pre- would be available for every situation. In prac-
defined evolutionary model known as a substitu- tice, this is not the case and the user is always left
tion matrix [26], in which a score is assigned to to make a decision when choosing the method
every possible substitution or conservation that is most suitable to the problem.
according to its biological likeliness (i.e., rarely
observed mutations receive a negative score Computational
while mutations observed more often would be The third problem associated with MSAs is com-
expected by chance receive to a positive score). putational. Assuming we have at our disposal an
Insertions or deletions are scored using affine gap adequate set of sequences and a biologically per-
penalties that penalize a gap once for opening fect objective function, the computation of a
and then proportionally to its length. This pen- mathematically optimal alignment is too com-
alty scheme is a major source of concern because plex a task for an exact method to be used [38].
it requires two parameters: Even if the function we are interested in was as
simple as a maximization of the number of per-
• The gap opening
fect identities within each column, the problem
• The gap extension penalty
would already be out of reach for more than
whose adequate values can only be set empiri- three sequences. This is why all the current
cally and may vary from one set of sequences to implementations of multiple alignment algo-
the next [27]. Although this function is clearly rithms are heuristics and that none of them guar-
wrong from an evolutionary point of view [24], antee a full optimization. Considering their most
because it assumes every sequence within the set obvious properties, it is convenient to classify
to be an ancestor of every other sequence, the existing algorithms in three main categories:
ease of its implementation has made it popular exact, progressive and iterative. Exact algorithms
with the most widely used MSA packages [28-30]. are high quality heuristics that deliver an align-
This validation was recently confirmed by a ment usually very close to optimality [28,39],
more thorough benchmarking [31] indicating sometimes but not always within well-defined
that packages that rely on the sums-of-pairs are boundaries. They can only handle a small
reasonable performers as judged by the biological number of sequences (< 20) and are limited to
quality of the alignments they produce. Very the sums-of-pairs objective function. Progressive
recently, a new variant of the sum-of-pairs func- alignments are by far the most widely used
tion has been introduced that seems less likely to [34,40,41]. They depend on a progressive assembly
over-estimate evolutionary events [32]. of the multiple alignment [42-44] where sequences
Over the last years, new OFs were described or alignments are added one by one so that never
that seem to be less sensitive to gap penalty esti- more than two sequences (or multiple align-
mation thanks to the incorporation of local ments) are simultaneously aligned using
information. These include the segment-based dynamic programming [45]. This approach has
evaluation of DiAlign [33] and the consistency the great advantage of speed and simplicity com-
objective function of T-Coffee [34]. HMMs [9,35] bined with reasonable sensitivity, even if it is by
constitute another line of thought recently nature a heuristic that does not guarantee any
explored. HMMs describe the multiple align- level of optimization. Other progressive align-
ment in a statistical context, using a Bayesian ment methods exist such as DiAlign [18] or
approach. Although from a formal point of view Match-Box [20], which assemble the alignment in
they provide us with the most attractive solution, a sequence-independent manner by combining
their performances for ab initio alignments have segment pairs in an order dictated by their score,
so far been disappointing and recent work shows until every residue of every sequence has been
that carefully tuned HMM packages barely out- incorporated in the multiple alignment. Iterative
perform ClustalW [36]. Other statistically-based alignment methods depend on algorithms able to
methods that attempt to associate a P-value to produce an alignment and to refine it through a
the multiple alignment have been described series of cycles (iterations) until no more
[19,37]. Unfortunately, these measures are improvements can be made. Iterative methods
restricted to ungapped MSAs. can be deterministic or stochastic, depending on
All things considered, one should be well the strategy used to improve the alignment. The
aware that there is no such thing as the ideal OF simplest iterative strategies are deterministic.
and every available scheme suffers from major They involve extracting sequences one by one
http://www.ashley.com 3
REVIEW
from a multiple alignment and realigning them ing a set of sequences in little time and with little
to the remaining sequences [46,47], some of these memory. This algorithm was initially described
methods can even be a mixture of progressive by Hogeweg [42] and later re-invented by Feng
and iterative strategies [48]. The procedure is ter- [43] and Taylor [44]. The most widely used MSA
minated when no more improvement can be packages are based on an implementation of this
made (convergence). Stochastic iterative meth- algorithm, which include: Pileup, a part of the
ods include HMM training [49] and simulated GCG package [59], MultAlign [41] and ClustalW
annealing or genetic algorithms [50-56]. The main [29] that has become the standard method for
advantage is to allow for a good conceptual sepa- multiple alignments.
ration between optimization processes and OF. ClustalW is a non-iterative, deterministic
Recent examples of algorithms belonging to algorithm that attempts to optimize the
these three categories are reviewed in the next weighted sums-of-pairs with affine gap penalties.
section. It is a straightforward progressive alignment
strategy where sequences are added one by one to
Review the multiple alignment according to the order
The number of available MSA methods has indicated by a pre-computed dendrogram.
steadily increased over the last 20 years. Being Sequence addition is made using a pair-wise
exhaustive on these will not be possible within sequence alignment algorithm [45]. The main
the scope of this work, and this review should be shortcoming of this strategy is that once a
seen as complementary with another recent sequence has been aligned, that alignment will
review [57]. Furthermore, it should be pointed never be modified even if it conflicts with
out that only a minority of the methods sequences added later in the process as shown in
described in the literature have found their way Figure 1. ClustalW also includes many highly spe-
towards regular usage. There are many reasons cialized heuristics meant to maximally exploit
for failure, but the main one stems from a simple sequence information:
fact: there is no satisfactory theoretical frame-
• local gap penalties
work in sequence analysis, in this context an
• automatic substitution matrix choice
algorithm is only as good as it is useful. Improve-
• automatic gap penalty adjustment
ments are driven by results and not theory, so
• the delaying of the alignment of distantly
that programs with badly designed interfaces or
related sequences
poor portability have been disgarded by natural
selection, leaving their algorithms to be re- Benchmarking tests, carried out on BAliBASE
invented by later generations. [31],a database of reference multiple sequence
Over the last few years, the field of MSA has alignments. In general, ClustalW performs bet-
undergone drastic evolutionary changes with the ter when the Phylogenetic tree is relatively dense
introduction of several new algorithms and new without any obvious outlier. It does not matter
evaluation methods. Some of the methods used how widely the sequences are spread just as long
for mutiple sequence alignments are listed in as every sequence remains close enough (a bit
Table 1. Among all this, two new trends have like crossing a river stepping from stone to
emerged: stone). Long insertions or deletions also cause
trouble, due to the intrinsic limitation of the aff-
• the increasing use of iterative optimisation
ine penalty scheme used by ClustalW.
strategies (stochastic or non-stochastic)
The latest improvement to the progressive
• the use of consistency-based scoring schemes
alignment algorithm is T-Coffee, a novel strategy
In this section, we review some of these new where sequences are aligned in a progressive
algorithms, their main characteristics and poten- manner but using a consistency-based objective
tial shortcomings. Another major trend, that will function that makes it possible to minimize
not be extensively covered here, has been the potential errors, especially in the early stages of
introduction of HMMs methods [9,35]. A very the alignment assembly. T-Coffee is reviewed in
detailed account on HMM-based methods for more detail in the consistency-based algorithm
MSAs may be found in [58]. section.
Table 1. Some recent and less recent available methods for MSAs.
Name Algorithm URL Ref.
MSA Exact http://www.ibc.wustl.edu/ibc/msa.html [28]
signal contained in the sequences properly, one and extends its capabilities. The DCA algorithm
would like to simultaneously align them, rather cuts the sequences in subsets of segments that are
than adding them one by one to a multiple small enough to be fed to MSA. The sub-align-
alignment. This would be especially useful when ments are later reassembled by DCA. The trick is
dealing with sets of extremly divergent sequences to cut the sequences at the right points so that
whose pair-wise alignments are all likely to be the produced alignment remains as close as pos-
incorrect. Unfortunately, to align several sible to optimality. The way it is done in DCA is
sequences, one would need to generalize the slightly heuristic albeit fairly accurate. Bench-
Needlman and Wunsch algorithm [45] to a multi- marking on BAliBASE indicated that the DCA
dimensional space and for practical reasons (time strategy does slightly better that ClustalW, even
and memory) this is only possible for a maxi- if the four largest BAliBASE test sets could not
mum of three sequences. That limit can be be computed with DCA (Notredame, unpub-
pushed a bit further if one finds a way to identify lished results). Even when MSA is coupled to
in advance the portion of the hyperspace that DCA, strong limitations remain on the number
does not contribute to the solution and exclude of sequences that can be handled (20–30) and on
it from computation. This is achieved in the their phylogenetic spread. Recently, an iterative
MSA program, an implementation of the Car- implementation of DCA [61], optimal multiple
rillo and Lipman algorithm [60] that makes it alignment (OMA) was described that is meant to
possible to align up to ten closely related speed up the DCA strategy and decreases its
sequences [28]. It should be stressed here that, memory requirements.
contrary to a widespread belief, the MSA pro-
gram is only a heuristic implementation of the Iterative algorithms
Carillo and Lipman algorithm, that is not guar- Iterative algorithms are based on the idea that
anteed to reach the mathematical optimum. the solution to a given problem can be computed
MSA uses lower and upper bounds tighter than by modifying an already existing sub-optimal
the guaranteed ones (Altschul, personal commu- solution. Each ‘modification’ step is an iteration.
nication). Even so, the high memory require- In the examples considered here, modifications
ment, the lengthy computational time and the can be made using dynamic programming or
limitation on the number of sequences explain various random protocols. While the dynamic
why the MSA program quickly gave way to programming-based protocols can also include
ClustalW. Yet, MSA met again with popularity elements of randomization, we distinguish them
when Stoye described a new divide and conquer from more traditional stochastic iterative meth-
algorithmDCA [39] that sits on the top of MSA
http://www.ashley.com 5
REVIEW
THE FAT CAT GARFIELD THE VERY FAST CAT GARFIELD THE FAST CAT GARFIELD THE LAST FAT CAT
This example shows how a progressive alignment strategy can be misled. In the initial alignment of sequences 1 and 2, ClustalW has a
choice between aligning CAT with CAT and making an internal gap or making a mismatch between C and F and having a terminal gap.
Since terminal gaps are much cheaper than internals, the ClustalW scoring schemes prefers the former. In the next stage, when the
extra sequence is added, it turns out that properly aligning the two CATs in the previous stage would have led to a better scori ng sums-
of-pairs multiple alignment.
ods such as simulated annealing (SA) [62] or generated multiple alignments of a given set of
genetic algorithms (GA) [63]. sequences evolve under some selection pressure.
These alignments are in competition with
Stochastic iterative algorithms each other for survival (survival of the fittest)
SA was the first stochastic iterative method and reproduction. Within SAGA, fitness
described for simultaneously aligning a set of depends on the score measured by the objective
sequences. Various schemes have been published function (the better the score, the fitter the
[50,64], which all involve the same chain of proc- multiple alignment). Over a series of cycles
esses: an alignment is randomly modified, its known as generations, alignments will die or
score assessed, it is kept or discarded according to survive, depending on their fitness. They can
an acceptance function that gets more stringent also improve and reproduce through some sto-
while the iteration number increases (by analogy chastic modifications known as mutations and
with a decreasing temperature during crystalliza- crossovers. Mutations randomly insert or shift
tion), the process goes on until a finishing crite- gaps while crossovers combine the content of
ria such as convergence is met. In practice, two alignments ( Figure 2). Overall, 20 operators
despite being intellectually very attractive, SA is co-exist in SAGA and compete for usage. The
too slow for making ab initio alignment and it program does not guarantee optimality but has
can only be used as an alignment improver. GAs been shown to equal or outperform MSA from
constitute an interesting alternative to SA as a mathematical point of view on 13 test sets
shown in SAGA [51], a GA dedicated to MSA. (using exactly the same OF in both programs).
Like SA, SAGA is an optimization black box in The complete disconnection between the oper-
which any OF invented can be tested. The prin- ators and the original OF made it possible to
ciple of SAGA is very straightforward and fol- seamlessly modify the original OF in order to
lows closely the ‘simple GA’ [65]: randomly test SAGA with a new OF named COFFEE
(Consistency Objective Function For align-
http://www.ashley.com 7
REVIEW
randomization makes the algorithm more robust may be some correlation between this measure
and improves its accuracy. Few of these early iter- and the true accuracy of the alignment.
ative methods have been properly bench- Regardless of the potential performances of
marked, making it hard to estimate their true these two methods (neither have been properly
biological significance. bench marked), some emphasis should be given
The most sophisticated DP-based iterative to the novel concepts they incorporate:
algorithm available was recently described by
• the first one is the use of local information in
Gotoh [47]. It is a double nested iterative strategy
IterAlign, in order to decrease sensitivity to
with randomization that optimizes the weighted
the gap penalty parameterization
sums-of-pairs with affine gap penalties (Figure 3).
• the second key concept is consistency
The originality of this algorithm is that the
weights and the alignment are simultaneously Sequences are preprocessed so that the regions
optimised. The inner iteration optimizes the consistently conserved across the family see their
weighted sums of pairs while the outer iteration signal enhanced and become more likely to drive
optimizes the weights that are calculated on a the alignment. This search for consistency has
phylogenetic tree estimated from the current been one of the strongest trend in recent devel-
alignment [25]. The algorithm terminates when opments of MSA. It is also central to the non-
the weights have converged. Prrp was the first iterative methods.
multiple alignment program to be extensively
benchmarked, using JOY, a database of struc- Consistency-based algorithm
tural alignments. The results were confirmed on The first consistency-based MSA method was
BAliBASE [31,34]. Prrp significantly out-performs described by Kececioglu in the 80s [71]. Given a
most of the traditional progressive methods as set of sequences, the optimal MSA is defined as
well as some of the most recent iterative strate- the one that agrees the most with all the possible
gies (Table 2). optimal pair-wise alignments. Computing that
Two other iterative alignment methods were alignment is an NP complete problem that can
recently described: Praline [48] and IterAlign [70]. only be solved for a small number of related
These two methods share very similar protocols. sequences, using an MSA-like algorithm. None-
They both start with a preprocessing of the theless, there are at least three good reasons that
sequences to align. In IterAlign, sequences are make consistency-based OFs very attractive:
‘ameliorated’(sic), this means that each sequence
• firstly, they do not depend on a specific substi-
is locally compared to others and that every seg-
tution matrix but rather on any method or
ment that shows high similarity with other pro-
collection of methods able to align two
teins is replaced by a consensus. One round of
sequences at a time
‘amelioration’ constitutes one iteration. Other
• secondly, the consistency-based scheme is
iterations are run on the new set of ‘ameliorated’
position dependant, given the collection of
sequences, until the collection of consensus con-
pair-wise alignments. This means that the
verges. Consistent blocks are then extracted from
score associated with the alignment of two res-
the consensus collection and these blocks are
idues depends on their indexes (position
chained in order to produce the final alignment.
within the protein sequence) rather than their
Praline uses a very similar protocol: sequences
individual nature
are replaced with a complete profile made from a
• the third reason is more general and has to do
multiple alignment that only includes their clos-
with consistency. Experience shows that given
est relatives. That profile step is iterated until the
a set of independent observations, the most
collection of profiles converges. This collection
consistent are often closer to the truth
of profiles is conceptually similar to the ‘amelio-
rated’ set of sequences used by IterAlign. The This principle generally holds well in biology
multiple alignment is then assembled by using a and can be loosely connected to the observation
straightforward progressive algorithm where that, given a series of measurements, noise
sequences are replaced with profiles. One of the spreads while signal accumulates.
most interesting consequences of the protocol Although the first consistency-based OF was
used in Praline is the possibility of measuring the described in 1983, it took several more years to
consistency between the final alignment and the develop heuristic algorithms able to deal with
collection of profiles used for its assembly. There optimization and it is only recently that a GA,
(SAGA [51]) was used to show the biological
Initial alignment
No
Outer iteration
Realign two sub-groups
Inner iteration
This figure shows the layout of Prrp, a double-nested strategy for optimizing multiple alignments. When the
inner iteration has converged, new sequence weights are estimated. The convergence of these weights is the
criteria for the outer iteration to stop.
Table 2. Some elements of validation on BAliBASE. sequence comparison, database search, experi-
mental knowledge etc.). Although SAGA-COF-
Method Ref1 Ref2 Ref3 Ref4 Ref5 Total FEE yielded interesting results, the GA was too
DiAlign 71.0 25.2 35.1 74.7 80.4 57.3 slow for everyday use. This prompted the devel-
ClustalW 78.5 32.2 42.5 65.7 74.3 58.7 opment of a new heuristic algorithm to optimize
the COFFEE function in a time efficient man-
Prrp 78.6 32.5 50.2 51.1 82.7 59.0
ner: T-Coffee (Figure 4). In T-Coffee, the COF-
T-Coffee 80.7 37.3 52.9 83.2 88.7 68.7 FEE library is turned into a so-called ‘extended
Each method in the Method column was used to align the 141 test-sets contained in library’, a position-specific substitution matrix
BAliBASE. The alignments were then compared with the reference BAliBASE where the score associated with each pair of resi-
alignment using aln_compare [34]. Ref1–5 indicates the five BAliBASE categories.
dues depends on the compatibility of that pair
Results obtained in each category were averaged. All the observed differences are
statistically significant, as assessed by the Wilcoxon rank-based test [34,47]. Ref1
with the rest of the library. T-Coffee uses a pro-
contains a homogenous set of sequences, ref2 contains a homogenous group of cedure reminiscent of Vingron’s Dot matrix mul-
sequences and an outlayer, ref3 contains two distantly related groups of sequences. tiplication [72] and Morgenstern overlapping
Ref4 contains sequences that require long internal gaps to be properly aligned and weights [73]. The multiple alignment is assem-
ref5 contains sequences that require long-terminal gaps to be properly aligned. Total
bled using a progressive alignment algorithm
is the average of ref1–5.
similar to the one used in ClustalW:
advantages of such a function, COFFEE [66], • pair-wise distances are computed
which emulates the maximum weight trace • a neighbour joining tree is estimated [3]
problem. In SAGA-COFFEE, the collection of • the sequences are aligned one by one follow-
weighted pair-wise alignments is named a library ing the topology of the tree
and SAGA is used to compute the alignment
that has the highest level of consistency with the The main difference between T-Coffee and
library. In practice, the library may contain more ClustalW is that in T-Coffee, the extended
than one alignment for each pair of sequences, library replaces a substitution matrix. Another
the information it contains may be redundant, important characteristic of T-Coffee is that its
conflicting and may originate from sources as primary library is made of a mixture of global
various as one wishes (structure analysis, alignments (produced with ClustalW) and local
http://www.ashley.com 9
REVIEW
alignments (produced with Lalign [74]). The ‘relevant’ information and in fact there is so
bench-marking carried out on BAliBASE shows much of it that by choosing the data, one can
that this combination of local and global infor- suit the needs of almost any method (progres-
mation makes the T-Coffee implementation able sive, iterative etc.). Ironically, one could be
to outperform Prrp, ClustalW and DiAlign on tempted to say that data has improved faster
the five categories of test-sets contained in this than mutiple aligment methods.
reference database [34]. These results were As a consequence, the real challenge is not so
obtained without tuning, since T-Coffee does much the multiple alignment itself but rather the
not have any parameters of its own. Due to the choice of a subset of sequences that will yield the
library extension, T-Coffee does more than sim- most biologically correct and informative align-
ply compute a consensus alignment. Nonethe- ment, given one method or another. There are
less, given a collection of multiple alignments, it two good reasons for not using all the available
can be interesting to combine them into a single sequences:
consensus multiple alignment. This is what the
• Alignments with a large number of sequences
ComAlign program does [75] by combining sev-
are slow to compute and hard to analyze.
eral multiple alignments into a single, often
Whenever possible, an alignment should fit
improved, multiple alignment.
on a single sheet of A4 paper.
Although the details differ, T-Coffee bears
• Limitations of existing programs. Although
some similarity to DiAlign [73], another consist-
they all use weighting schemes meant to mini-
ency-based algorithm that attempts to use local
mize the effect of similar or highly correlated
information in order to guide a global multiple
sequences, none of these schemes are entirely
alignment. DiAlign starts with an identification
satisfactory, and over-represented sub-groups
of highly homologous segment-pairs. The
always end up dominating the alignment or
weight of each of these pairs is defined by a P-
the profile.
value comparable to the P-values used in BLAST.
Each of these segment-pairs receives another This can prevent the proper alignment of less
score proportional to its compatibility with the well represented sequence sub-groups that may
complete set of segment-pairs. This score is be just as important. Careful user’s trimming is
named an overlapping weight and segment-pairs still the best available way around that effect.
weighted this way are very reminiscent of the Unfortunately, the increased sensitivity of data-
extended library. The multiple alignment is then base search tools coupled with the increase in
progressively assembled by adding the pairs of database size has rendered this process very tedi-
segments according to their weight. Assembly is ous.
made in a sequence independent order, as The second major change that has occurred
opposed to the ClustalW-style progressive align- over the last years is the increasing number of
ment strategy. Non-compatible segment-pairs available 3D structures. Although the proportion
are discarded, hence the importance of the order of protein sequences with a known 3D structure
induced by the weights. According to the is getting smaller and smaller, the situation is
authors, DiAlign is especially good at properly very different from a protein family perspective
aligning sequences where local homology is the and the proportion of protein families where at
driving signal. This has been confirmed by BAli- least one member has a known 3D structure
BASE benchmarking [31,34]. Overall, DiAlign is increases regularly. This means that in most
not as accurate as ClustalW or Prrp but it does cases, multiple alignment modelling could bene-
very well in categories 4 and 5 of BAliBASE, fit from the incorporation of 3D structural infor-
which require very long insertions to be properly mation, in order to enhance very remote
aligned. Over the past few years, the DiAlign homologies, or to guide the choice of local pen-
algorithm has been modified on numerous occa- alties [77]. Very few of the packages avaliable are
sions for improved efficiency [76]. able to mix structure and sequences within a
multiple alignment. While ClustalW is able to
Conclusion and expert opinion use SwissProt secondary structure information
Ten years ago, when schemes such as MSA were for gap penalty estimation, a proper tool is still
developed, there was very little data available and lacking for the simultaneous alignment of
the main problem was to use every bit of availa- sequences and structures. Two of the methods
ble information properly. Today the situation has introduced here are good candidates for such a
dramatically changed. We are overwhelmed by combination. The consistency-based algorithms
Users library
A A A
B B B
A A A
C C Primary library C
B B
C C
Extended library
Progressive alignment
This figure indicates the layout of T-Coffee. Local and global pairwise alignments are first computed and then combined into a primary
library that is extended in order to be used for computing the multiple sequence alignment in a progressive manner.
have the advantage of having few requirements sampler [19], Mocca [82] or Repro [83]. Unfortu-
on the origin of their libraries. For instance, nately, none of these tools are well integrated
DALI, the database of structural multiple align- within a global multiple alignment procedure.
ments [78] relies on T-Coffee to assemble the col- The Gibbs sampler and Mocca have the advan-
lection of pair-wise alignments produced by the tage of providing the user with some estimation
DALI algorithm into a multiple alignment. The of the biological relevance of their output.
double dynamic programming algorithm intro- The fourth point that needs to be raised here
duced by Taylor [79] is also a good candidate for is computation. While elegant solutions have
that purpose. While it has been shown that this been found to parallelize database searches, the
algorithm is suitable for structure-to-structure parallelization of a MSA algorithm remains a dif-
alignments [80], recent results indicate that it ficult task. The operations involved in the imple-
could also be used in the context of MSA and mentation of these algorithms require complex
possibly as a means to mix sequences and struc- schemes of memory sharing that are not suitable
tures [81]. to Linux-farms and other clusters. When dealing
The third major obstacle on the road that with large sets of data of long sequences, super-
leads toward an informative multiple alignment computers are still required for multiple align-
is the processing of repeats. Repeated sequences ment programs.
(in tandem or not) are renowned for confusing The last important point is the estimation of
all the existing MSA methods. When dealing local accuracy. The common property of all the
with sequences that contain such repeats, the methods introduced here is that no one in par-
only solution is to pre-process the sequences, ticular is the best. They may all be out-per-
extract the repeats and only align homologous formed by the others on one protein family or
regions. This extraction can be made using any another. For that reason, we feel that it is more
local multiple alignment tool such as the Gibbs important to be able to assess the exact level of
http://www.ashley.com 11
REVIEW
9. Haussler D, Krogh A, Mian IS, Sjölander K: pitfalls of forcing multiple alignments. The 35. Baldi P, Chauvin Y, Hunkapiller T, McClure
Protein modeling using hidden markov New Biologist 3(12), 1148-1154 (1991). MA: Hidden Markov models of biological
models: analysis of globins. In: Proceedings 23. Karlin S, Bucher P, Brendel V, Altschul SF: primary sequence information. Proc. Nat.
for the 26th Hawaii International Conference Statistical methods and insight for protein Acad. Sci. 91, 1059-1063 (1994).
on Systems Sciences. Wailea HI U.S.A.: Los and DNA sequences. Annu. Rev.Biophys. 36. Karplus K, Hu B: Evaluation of protein
Alamitos CA: IEEE Computer Society Press Biophys. Chem. 20, 175-203 (1991). multiple alignments by SAM-T99 using the
(1993). 24. Altschul SF, Lipman DJ:Trees stars and BAliBASE multiple alignment test set.
10. Luthy R, Xenarios I, Bucher P: Improving multiple biological sequence alignment. Bioinformatics 17(8), 713-20 (2001).
the sensitivity of the sequence profile SIAM J. Appl. Math. 49, 197-209 (1989). 37. Hertz GZ, Stormo GD: Identifying DNA
method. Protein Sci. 3(1), 139-146 (1994). 25. Altschul SF, Carroll RJ, Lipman DJ: and protein patterns with statistically
11. Bateman A, Birney E, Durbin R, Eddy SR, Weights for data related by a tree. J. Mol. significant alignments of multiple
Howe KL, Sonnhammer E: The Pfam Biol. 207, 647-653 (1989). sequences. Bioinformatics 15(7-8), 563-77
protein families database. Nucleic Acids Res. 26. Dayhoff MO, Schwarz RM, Orcutt BC: A (1999).
28(1), 263-266 (2000). model of evolutionary change in proteins. 38. Wang L, Jiang T: On the complexity of
12. Altschul SF, Madden TL, Schaffer AA et al.: Detecting distant relationships: computer multiple sequence alignment. J. Comput.
Gapped BLAST and PSI-BLAST: a new methods and results. In: Atlas of Protein Biol. 1(4), 337-348 (1994).
generation of protein database search Sequence and Structure Dayhoff MO (Eds.), 39. Stoye J, Moulton V, Dress AW: DCA: an
programs. Nucleic Acids Res. 25(17), 2289- xNational Biomedical Research Foundation: efficient implementation of the divide-and-
3402 (1997). Washington D.C. USA 353-358 (1979). conquer approach to simultaneous multiple
13. Garnier J, Gibrat J-F, Robson B: GOR 27. Vingron M, Waterman MS: Sequence sequence alignment. Comput. Appl. Biosci.
method for predicting protein secondary alignment and penalty choice. J. Mol. Biol. 13(6), 625-6 (1997).
structure from amino acid sequence. Meth. 235, 1-12 (1994). 40. Higgins DG, Sharp PM: Clustal: a package
Enzymol. 266, 540-553 (1996). 28. Lipman DJ, Altschul SF, Kececioglu JD: A for performing multiple sequence alignment
14. Jones DT: Protein secondary structure tool for multiple sequence alignment. Proc. on a microcomputer. Gene 73, 237-244
prediction based on position-specific scoring Natl. Acad. Sci. USA 86, 4412-4415 (1989). (1988).
matrices. J. Mol. Biol. 292(2), 195-202 29. Thompson J, Higgins D, Gibson T: 41. Corpet F: Multiple sequence alignment with
(1999). CLUSTAL W: Improving the sensitivity of hierarchical clustering. Nucleic Acids Res. 16,
15. Rost B: Review: protein secondary structure progressive multiple sequence alignment 10881-10890 (1988).
prediction continues to rise. J. Struct. Biol. through sequence weighting position- 42. Hogeweg P, Hesper B: The alignment of sets
134(2-3), 204-18 (2001). specific gap penalties and weight matrix of sequences and the construction of
16. Goebel U, Sander C, Schneider R, Valencia choice. Nucleic Acids Res. 22, 4673-4690 phylogenetic trees. An integrated method. J.
A: Correlated mutations and residue (1994). Mol. Evol. 20, 175-186 (1984).
contacts in proteins. Proteins: Structure •• The most widely used method for making • The first description of the progressive
Function and Genetics 18(4), 309-317 multiple sequence alignments. algorithm.
(1994). 30. Gotoh O: Further improvement in methods 43. Feng D-F, Doolittle RF: Progressive
17. Gutell RR, Weiser B, Woese CR, Noller HF: of group-to-group sequence alignment with sequence alignment as a prerequisite to
Comparative anatomy of 16S-like ribosomal generalized profile operations. Comput. correct phylogenetic trees. J. Mol. Evol. 25,
RNA. Prog. Nucleic Acid Res. Mol. Biol. 32, Appl. Biosci. 10(4), 379-87 (1994). 351-360 (1987).
155-216 (1985). •• The first attempt to systematically assess 44. Taylor WR: A flexible method to align large
18. Morgenstern B, Dress A, Wener T: Multiple the accuracy of a MSA method by numbers of biological sequences. J. Mol.
DNA and protein sequence based on comparison wit reference structural Evol. 28, 161-169 (1988).
segment-to-segment comparison. Proc. Natl. alignment. Also the most complex dynamic 45. Needleman SB, Wunsch CD: A general
Acad. Sci. USA 93, 12098-12103 (1996). programming based iterative method. method applicable to the search for
•• The first method described that does not 31. Thompson JD, Plewniak F, Poch O: A similarities in the amino acid sequence of
require arbitrary gap penalties. comprehensive comparison of multiple two proteins. J. Mol. Biol. 48, 443-453
19. Lawrence CE, Altschul SF, Boguski MS, Liu sequence alignment programs. Nucleic Acids (1970).
JS, Neuwald AF, Wootton JC: Detecting Res. 27(13), 2682-2690 (1999). 46. Barton GJ, Sternberg MJE: A strategy for
subtle sequence signals: a Gibbs sampling 32. Gonnet GH, Korostensky C, Benner S: the rapid multiple alignment of protein
strategy for multiple alignment. Science 262, Evaluation measures of multiple sequence sequences: confidence levels from tertiary
208-214 (1993). alignments. J. Comput. Biol. 7(1-2), 261-76 structure comparisons. J. Mol. Biol. 198,
20. Depiereux E, Baudoux G, Briffeuil P et al.: (2000). 327-337 (1987).
Match-Box_server: a multiple sequence 33. Morgenstern B: DIALIGN 2: improvement 47. Gotoh O: Significant Improvement in
alignment tool placing emphasis on of the segment-to-segment approach to Accuracy of Multiple Protein Sequence
reliability. Comput. Appl. Biosci. 13(3), 249- multiple sequence alignment [In Process Alignments by Iterative Refinements as
56 (1997). Citation]. Bioinformatics 15(3), 211-8 Assessed by Reference to Structural
21. Schuler GD, Altschul SF, Lipman DJ: A (1999). Alignments. J. Mol. Biol. 264(4), 823-838
workbench for multiple alignment 34. Notredame C, Higgins DG, Heringa J: T- (1996).
construction and analysis. Proteins 9(3), Coffee: A novel algorithm for multiple 48. Heringa J: Two strategies for sequence
180-90 (1991). sequence alignment. J. Mol. Biol. 302, 205- comparison: profile-preprocessed and
22. Henikoff S: Playing with blocks: some 217 (2000). secondary structure-induced multiple
http://www.ashley.com 13
REVIEW