Professional Documents
Culture Documents
Praline 3
Praline 3
Praline 3
Abstract
Profile ALIgNmEnt (PRALINE) is a versatile multiple sequence alignment toolkit. In its main alignment
protocol, PRALINE follows the global progressive alignment algorithm. It provides various alignment
optimization strategies to address the different situations that call for protein multiple sequence alignment:
global profile preprocessing, homology-extended alignment, secondary structure-guided alignment, and
transmembrane aware alignment. A number of combinations of these strategies are enabled as well.
PRALINE is accessible via the online server http://www.ibi.vu.nl/programs/PRALINEwww/. The
server facilitates extensive visualization possibilities aiding the interpretation of alignments generated,
which can be written out in pdf format for publication purposes. PRALINE also allows the sequences in
the alignment to be represented in a dendrogram to show their mutual relationships according to the
alignment. The chapter ends with a discussion of various issues occurring in multiple sequence alignment.
Key words Multiple sequence alignment, Progressive alignment, Sequence preprocessing, Homology-
extended MSA, Secondary structure-guided MSA, Transmembrane-aware protein alignment
1 Introduction
David J. Russell (ed.), Multiple Sequence Alignment Methods, Methods in Molecular Biology, vol. 1079,
DOI 10.1007/978-1-62703-646-7_16, © Springer Science+Business Media, LLC 2014
245
246 Punto Bawono and Jaap Heringa
the alignment and further optimize its quality. Here the rationale is
to integrate predicted structural information into the alignment,
following the principle that protein structural aspects tend to be
more conserved than the associated sequences during evolution.
PRALINE incorporates secondary structure and/or TM informa-
tion by using specific residue exchange matrices during alignment.
PRALINE is available as an online server (URL: http://www.
ibi.vu.nl/programs/PRALINEwww/), which is also equipped
with a SOAP service, allowing the users easy access to the Web
service from within their own programs or scripts.
2 Method
2.1 The “Core” MSA PRALINE employs a profile-based progressive alignment strategy.
Protocol in PRALINE As stated above, after initial all-against-all pairwise alignment, the
highest scoring sequence pair is joined into the first sequence block.
Then, this sequence block is aligned with all the remaining single
sequences, after which the highest scoring pair is selected. Note
that at this stage, the highest scoring alignment can be between the
sequence block and a single sequence, while at a later stage also
alignment of sequence blocks may occur. Alignment proceeds until
all sequences have been aligned in a single MSA. By following this
protocol, PRALINE does not utilize a precomputed guide tree in
its alignment protocol, but calculates the guide tree on the fly by
utilizing the information afforded by pre-aligned blocks at each
stage, such that the tree reflecting the progressive alignment steps
becomes available at the end. Since successive profile scores during
the PRALINE progressive protocol descend uniformly, they can
be used to construct a dendrogram reflecting the alignment order.
Alignment in PRALINE is carried out using the dynamic pro-
gramming technique [7]. The following simple profile-scoring
scheme is used to score a pair of profile positions (columns) x and y:
X20 X
20
Pij
Score ðx; yÞ ¼ αi βj log ; (1)
i j
Pi Pj
where αi and βj are the frequencies with which amino acids i and j
appear in columns x and y, respectively, and M (i, j) is the exchange
value for amino acids i and j according to substitution matrix M
(e.g., BLOSUM62 [12] or PAM250 [13]).
PRALINE adopts a semi-global alignment strategy, which
means that it aligns sequences over their whole length, but without
penalizing the so-called end gaps, i.e., gaps occurring N- or
C-terminally to any of the sequences. Global alignment strategy is
known to be optimal for sequences of high-to-medium sequence
similarity. Since interesting biological alignments can have
sequences that diverged considerably beyond the level that can
248 Punto Bawono and Jaap Heringa
Fig. 1 Schematic overview of the profile preprocessing (a) and the pre-profile alignment (b) routines.
For details, see text. Adapted from ref. 8
Fig. 3 Schematic overview of the secondary structure-guided alignment strategy in PRALINE (Pirovano,
Simossis, and Heringa, unpublished). For details, see text
2.4 Secondary It is well known that the secondary structure elements of proteins
Structure-Guided are much more conserved than their amino acid sequence during
Alignment evolution [16, 19–21]. Therefore, secondary structure information
can be used to guide the alignment process, particularly in the case
of distantly related proteins [7, 8, 18, 22–27].
The secondary structure-guided alignment strategy in PRA-
LINE works by combining secondary structure prediction with a
secondary structure-based scoring scheme (Fig. 3). When using
predicted secondary structure, however, the gain in information
might be overshadowed by prediction error. Fortunately, during
earlier tests with the PRALINE secondary structure-guided strat-
egy, it turned out that the inclusion of secondary structure infor-
mation improves alignment whenever a prediction accuracy of 65 %
or more is achieved (Simossis and Heringa, unpublished), and this
is easily attained by modern prediction methods.
PRALINE starts the strategy by predicting the secondary struc-
ture elements of each sequence using a secondary structure predic-
tion tool. PRALINE provides the user with the choice of four
different secondary structure predictors: PSIPRED [28], SSPRO
4.0 [29], PORTER [30], and YASPIN [31]. Each of these predic-
tors has its own strengths and weaknesses, the choice of which is
therefore left to the user’s discretion. The secondary structure
prediction methods perform a PSI-BLAST search for each input
sequence and then perform the secondary structure prediction
using the position-dependent scoring matrix (PSSM) produced by
PSI-BLAST, thereby making use of the amino acid conservation as
observed in the putative homologous sequences. If an input
252 Punto Bawono and Jaap Heringa
1
PRALINE finds the PDB identifier of a protein by extracting it from the fasta definition line of that protein. For
example, these description lines are fine: “>102L_A,” “>102L|A,” and “>102LA”. For any other description
line, PDB identifier is not extracted. No description may follow the sequence identifier. Thus “>pdb|102L|A”,
“>gi|157829524|pdb|102L|A”, and also “>102L_A ” (note the trailing space) are skipped.
PRALINE: A Versatile Multiple Sequence Alignment Toolkit 253
Fig. 4 Schematic overview of the TM-aware strategy in PRALINE. For details, see text. Adapted from ref. 37
2.6 The PRALINE The PRALINE server is accessible via the Web site of the IBIVU
Online Server center at VU University Amsterdam (URL: http://www.ibi.vu.nl/
programs/PRALINEwww/). The server is aimed to assist both
specialist and nonspecialist users. It provides the user with extensive
online documentation for each of the different parameters PRA-
LINE may be run with, and also provides a “sample output” page
which contains examples of the possible outputs of the PRALINE
server using the various alignment strategies described above. PRA-
LINE accepts sequences in FASTA [45] format as input. For each
alignment job, the maximum number of sequences that can be
254 Punto Bawono and Jaap Heringa
color scheme table (Fig. 7). Finally, PRALINE includes the option
to generate a tree based upon the MSA. However, the user should
note that trees generated by PRALINE are not phylogenetic trees,
but simply show the relationships between the sequences as deter-
mined by the alignment scores (Fig. 8).
The following output (Figs. 6, 8, and 9) is taken from an
alignment of 14 proteins belonging to the MscL family of large-
conductance mechanosensitive channels compiled together in the
BaliBASE 3.0 benchmarking database [47]. The alignment was
performed using the homology-extended strategy with both
integrated transmembrane and secondary structure information
from the predictions of PHOBIUS and PSIPRED, respectively.
The alignment shown in Fig. 9 is colored using the “Residue
Type” coloring scheme. The alignment shows conserved elements
as well as regions with extensive gaps. The associated tree (Fig. 8)
clearly shows that the 1msla sequence (bottom sequence in the
alignment) is an outlier, missing elements at both the N- and C-
termini.
PRALINE: A Versatile Multiple Sequence Alignment Toolkit 257
2.7 Practical Issues 1. Aligning distantly related protein sequences. Although state-of-
the-art alignment methods are able to make very accurate
MSAs, inaccurate MSA can arise due to divergent evolution.
It has been shown that the accuracy of alignment methods
decreases dramatically when the sequence identity between
the aligned sequences is lower than 30 % [16]. Given this
limitation, it is advisable to compile a number of MSAs using
different amino acid substitution matrices (e.g., PAM and
BLOSUM matrices). It is helpful to know that higher PAM
numbers and low BLOSUM numbers (e.g., PAM250 or BLO-
SUM45) correspond to exchange matrices that are suited for
the alignment of more divergent sequences, respectively,
whereas matrices with lower PAM and higher BLOSUM num-
bers are more suitable for more closely related protein
sequences. It is also important to try different gap penalties
when aligning distant protein sequences. Gap penalties play an
important role in the dynamic programming algorithm; there-
fore they can have considerable influence on the alignment
quality. The higher the gap penalties, the stricter the insertion
of gaps into the alignment and consequently the fewer gaps
inserted. Gap regions in an MSA often correspond to loop
regions in the associated tertiary structure, which are more
likely to be altered by divergent evolution. Therefore, it can
be useful to lower the gap penalty values when aligning diver-
gent proteins, although care should be taken not to deviate too
much from the recommended settings. Excessive gap penalty
values will enforce a gap-less alignment, whereas low gap penal-
ties will lead to alignments with very many gaps, allowing
(near) identical amino acids to be matched. In both cases
the resulting alignment will be biologically inaccurate.
258 Punto Bawono and Jaap Heringa
Fig. 9 MSA of 14 proteins belonging to the MscL family of large-conductance mechanosensitive channels
References
34. L€uthy R, McLachlan AD, Eisenberg D (1991) 41. Ng PC, Henikoff JG, Henikoff S (2000)
Secondary structure-based profiles: use of PHAT: a transmembrane-specific substitution
structure-conserving scoring tables in search- matrix. Bioinformatics 16:760–766
ing protein sequence databases for structural 42. Hirosawa M, Totoki Y, Hoshida M et al (1995)
similarities. Proteins 10:229–239 Comprehensive study on iterative algorithms
35. Jones DT, Taylor WR, Thornton JM (1994) of multiple sequence alignment. Comput
A mutation data matrix for transmembrane Appl Biosci 11:13–18
proteins. FEBS Lett 339:269–275 43. Edgar RC (2004) MUSCLE: a multiple sequence
36. Shafrir Y, Guy HR (2004) STAM: simple trans- alignment method with reduced time and space
membrane alignment method. Bioinformatics complexity. BMC Bioinformatics 5:113
20:758–769 44. Edgar RC (2004) MUSCLE: multiple sequence
37. Pirovano W, Feenstra KA, Heringa J (2008) alignment with high accuracy and high
PRALINETM: a strategy for improved multi- throughput. Nucleic Acids Res 32:1792–1797
ple alignment of transmembrane proteins. Bio- 45. Pearson WR (2000) Flexible sequence similar-
informatics 24:492–497 ity searching with the FASTA3 program pack-
38. K€all L, Krogh A, Sonnhammer ELL (2004) A age. Methods Mol Biol 132:185–219
combined transmembrane topology and signal 46. Gonnet GH, Cohen MA, Benner SA (1992)
peptide prediction method. J Mol Biol 338: Exhaustive matching of the entire protein
1027–1036 sequence database. Science 256:1443–1445
39. Krogh A, Larsson B, von Heijne G et al (2001) 47. Thompson JD, Koehl P, Ripp R et al (2005)
Predicting transmembrane protein topology BAliBASE 3.0: latest developments of the mul-
with a hidden Markov model: application to tiple sequence alignment benchmark. Proteins
complete genomes. J Mol Biol 305:567–580 61:127–136
40. Tusnády GE, Simon I (2001) The HMMTOP 48. Sammeth M, Heringa J (2006) Global
transmembrane topology prediction server. multiple-sequence alignment with repeats.
Bioinformatics 17:849–850 Proteins 64:263–274