Professional Documents
Culture Documents
Multiple Sequence Alignment (MSA)
Multiple Sequence Alignment (MSA)
Multiple Sequence Alignment (MSA)
Alignment (MSA)
Plan
Introduction to sequence alignments
Multiple alignment construction
Traditional approaches
Alignment parameters
Alternative approaches
Sequence A
Sequence B
Global alignment
Sequence alignment on
their whole length
GGCTGACCACCTT
|||||||
GATCACTTCCATG
Local alignment
Conservation profile
Secondary structure
MACS
Schematic overview of complete alignment
e.g. domain organisation (Interpro)
Key:
SH3
PI-PLC-X
CH
SH2
PI-PLC-Y
rhoGEF
PH
C2
DAG_PE-bind
MSA Construction
Alignment parameters
Residue similarity matrices
Gap penalties
Alternative approaches
Iterative alignment methods
Combinatorial algorithms
PipeAlign : a protein family analysis tool
Traditional
Approaches
Problem
The optimised mathematical alignment is not necessarily the biologically optimal alignment
CPU time and memory required are prohibitive for practical purposes (the required time is
proportional to Nk for k sequences with length N) : limited to <10 sequences
possible
Principle :
Progressively align the sequences (or sequence groups) by pair
Problem :
Which sequences begin with ? In which order ?
first align closest sequences
Ex : pairwise alignment
of 2 globin sequences
Hbb_horse
Hbb_human
Hba_human
Hba_human
Hbb_horse
1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST ...
| |. |||.|| ||| ||| :|||||||||||||||||||||:||||||
2 VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ...
1 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLST ...
|.| :|. | | |||| . | | ||| |: . :| |. :| | |||
3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS. ...
3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLSH ...
|| :| | | | ||
| | ||| |: . :| |. :| | |||.
2 LSGEEKAAVLALWDKVNEE..EVGGEALGRLLVVYPWTQRFFDSFGDLSN ...
=> faster
=> better
In Clustalx :
distance between 2 sequences = 1- nb of identical residues
nb of compared residues
Ex : Hbb_human vs Hbb_horse = 83% identity = 17% distance
Hbb_human
Hbb_horse
Hba_human
Hba_horse
Myg_phyca
Glb5_petma
Lgb2_lupla
1
2
3
4
5
6
7
.17
.59
.59
.77
.81
.87
1
.60
.59
.77
.82
.86
2
.13
.75
.73
.86
3
.75
.74
.88
4
.80
.93
5
.90
6
Guide tree
Hbb_human
Hba_human
Hbb_horse
Hba_horse
Hbb_horse
Hba_human
Hbb_human
Hba_horse
Myg_phyca
Glb5_petma
Glb5_petma
Myg_phyca
Lgb2_lupla
Lgb2_lupla
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
Hbb_human
Hbb_horse
Hba_human
Hba_horse
Myg_phyca
Glb5_petma
Lgb2_lupla
H1
H5
H2
H6
H3
H4
H7
Local
SBpima
SB
multal
NJ
UPGMA
ML
clustalx
multalign
pileup
MLpima
SB
UPGMA
ML
NJ
- Sequential Branching
- Unweighted Pair Grouping Method
- Maximum Likelihood
- Neighbor-Joining
Alignment
Parameters
PAM 250
Gap penalties
A gap penalty is a cost for introducing gaps into the alignment,
corresponding to insertions or deletions in the sequences
SFGDLSNPGAVMG
HF-DLS-----HG
Fixed penalty :
P=x+yL
Alternative
Approaches
Local
SBpima
SB
multal
NJ
ML
UPGMA
MLpima
multalign
pileup
clustalx
prrp
dialign
Genetic Algo.
HMM
saga
hmmt
Iterative
reference set
description
several sub-families
long insertions
repeats
transmembrane regions
circular permutations
Ecole Phylognomique,8 Carry le Rouet 2006
ClustalW
PRRP
Dialign
2 mins 41 secs
3 hours 40 mins
3 hours 48 mins
Ballast Anchors
Query Sequence
Query Sequence
Anchors
Database Hits
Domain A
Domain B
Domain C
DbClustal Alignment
DbClustal
Combinatorial algorithms
http://bips.u-strasbg.fr/PipeAlign/jump_to.cgi?DbClustal+noid
designed to align the sequences detected by a database search. Locally conserved motifs are detected
using the Ballast program (Plewniak et al. 1999) and are used in the global multiple alignment as
anchor points.
http://timpani.genome.ad.jp/%7Emafft/server
detects locally conserved segments using a Fast Fourier Transform, then uses a restricted global DP
and a progressive algorithm
MultipleAlignmentQuality
Truncated Alignments
Ref1
V1 (<20%)
V2
40%)
(20-
Ref2
Ref3
Ref4
Ref5
Time
orphans
subgroups
extensions
insertion
s
(sec)
ClustalW1.83
0.42
0.78
0.42
0.52
0.41
0.38
902
Dialign2.2.1
0.31
0.71
0.37
0.39
0.45
0.43
5993
Mafft5.32
0.44
0.78
0.49
0.53
0.47
0.48
96
Maffti5.32
0.54
0.83
0.56
0.60
0.49
0.57
327
Muscle3.51
0.52
0.82
0.50
0.58
0.46
0.54
523
Muscle_fast
0.40
0.77
0.43
0.44
0.35
0.49
34
Muscle_med
0.45
0.80
0.50
0.59
0.44
0.51
219
Tcoffee2.66
0.47
0.84
0.50
0.64
0.54
0.58
216133
Probcons1.1
0.63
0.87
0.60
0.65
0.54
0.63
19035
MultipleAlignmentQuality
Comparison:truncatedversusfulllengthsequences
Ref1
V1 (<20%)
1.
2.
3.
Ref2: orphans
Ref3: subgroups
V2 (20-40%)
Time (sec)
for all refs
FL
FL
FL
FL
FL
ClustalW1.8
3
0.42
0.24
0.78
0.72
0.42
0.20
0.52
0.27
902
2227
Dialign2.2.1
0.31
0.26
0.71
0.70
0.37
0.29
0.39
0.31
5993
12595
Mafft5.32
0.44
0.25
0.78
0.75
0.49
0.35
0.53
0.38
96
312
Maffti5.32
0.54
0.35
0.83
0.80
0.56
0.40
0.60
0.50
327
1409
Muscle3.51
0.52
0.34
0.82
0.79
0.50
0.36
0.58
0.39
523
3608
Muscle_fast
0.40
0.28
0.77
0.72
0.43
0.29
0.44
0.33
34
132
Muscle_med
0.45
0.29
0.80
0.74
0.50
0.34
0.59
0.38
219
1601
Tcoffee2.66
0.47
0.35
0.84
0.82
0.50
0.40
0.64
0.49
216133
341578
Probcons1.1
0.63
0.43
0.87
0.86
0.60
0.41
0.65
0.54
19035
58488
Lossofaccuracyismoreimportantintwilightzone(Ref1V1,orphans,andsubgroups)
Probconsstillscoresbestinalltests
MAFFTstillscoresbetterthanMUSCLEinalltests
Relative Entropy:
uses a normalized log-likelihood ratio to measure the degree of conservation for each column
(identical residues only).
MD
(column scores used in ClustalX) uses a comparison matrix (Gonnet) to take into account similar
residues
FASTA format
>O88763 Phosphatidylinositol 3-kinase.
------MGEAEKFHYIYSCDLDINVQLKIGSLEGKREQKSYKAVLEDPMLKFSGLYQETC
SDLYVTCQVFAEGKPLALPVRTSYKPFSTRWN-WNEWLKLPVKYPDLPRNAQVALTIWD-----VYGPG-RAVPVGGTTVSLFGKYGMFRQGMHDLKVWPNVEADGSEPTRTPGRTSST
LSEDQMSRLAKLTKAHRQGHMVKVLDRLTFREIEMINESEKRSS--NFMYLMVEFRCVKC
DDKE-YGIVYYE--->Q9W1M7 CG5373-PA (GH13170p).
-----MDQPDDHFRYIHSSSLHERVQIKVGTLEGKKRQPDYEKLLEDPILRFSGLYSEEH
PSFQVRLQVFNQGRPYCLPVTSSYKAFGKRWS-WNEWVTLPLQFSDLPRSAMLVLTILD-----CSGAG-QTTVIGGTSISMFGKDGMFRQGMYDLRVWLGVEGDGNFPSRTPGK-GKE
SSKSQMQRLGKLAKKHRNGQVQKVLDRLTFREIEVINEREKRMS--DYMFLMIEFPAIVV
DDMYNYAVVYFE--->Q7PMF0 ENSANGP00000002906 (Fragment).
------------LRYIGSSSLLQKISIKIGTLEGENVGYSYEKLIEQPLLKFSGMYTEKT
PPLKVKLQIFDNGEPVGLPVCTSHKHFTTRWS-WNEWVTLPLRFTDISRTAVLGLTIYD-----CAGGREQLTVVGGTSISFFSTNGLFRQGLYDLKVWPQMEPDGACNSITPGK-AIT
TGVHQMQRLSKLAKKHRNGQMEKILDRLTFRELEVINEMEKRNS--QFLYLMVEFPQVYI
HEKL-YSVIHLE--->Q9TXI7 Related to yeast vacuolar protein sorting factor protein 34
MIPGMRATPTESFSFVYSCDLQTNVQVKVAEFEG-----IFRDVLN-PVRRLNQLFAEIT
VYCNNQQIGYPVCTSFHTPPDSSQLARQKLIQKWNEWLTLPIRYSDLSRDAFLHITIWEH
EDDEIVNNSTFSRRLVAQSKLSMFSKRGILKSGVIDVQMNVSTTPDPFVKQPETWKYSDA
WG-DEIDLLFKQVTRQSRGLVEDVLDPFASRRIEMIRAKYKYSSPDRHVFLVLEMAAIRL
GPTF-YKVVYYEDETK
toto.msf
Name:
Name:
Name:
Name:
MSF: 256
O88763
Q9W1M7
Q7PMF0
Q9TXI7
Type: P
Len:
Len:
Len:
Len:
Check:
Check:
Check:
Check:
9443
1161
8095
4716
Check: 3415 ..
Weight:
Weight:
Weight:
Weight:
1.00
1.00
1.00
1.00
MSF format
//
O88763
Q9W1M7
Q7PMF0
Q9TXI7
1
......MGEA
.....MDQPD
..........
MIPGMRATPT
EKFHYIYSCD
DHFRYIHSSS
..LRYIGSSS
ESFSFVYSCD
LDINVQLKIG
LHERVQIKVG
LLQKISIKIG
LQTNVQVKVA
SLEGKREQKS
TLEGKKRQPD
TLEGENVGYS
EFEG.....I
50
YKAVLEDPML
YEKLLEDPIL
YEKLIEQPLL
FRDVLN.PVR
O88763
Q9W1M7
Q7PMF0
Q9TXI7
51
KFSGLYQETC
RFSGLYSEEH
KFSGMYTEKT
RLNQLFAEIT
SDLYVTCQVF
PSFQVRLQVF
PPLKVKLQIF
VYCNNQQIGY
AEGKPLALPV
NQGRPYCLPV
DNGEPVGLPV
PVCTSFHTPP
RTSYKPFSTR
TSSYKAFGKR
CTSHKHFTTR
DSSQLARQKL
100
WN.WNEWLKL
WS.WNEWVTL
WS.WNEWVTL
IQKWNEWLTL
O88763
Q9W1M7
Q7PMF0
Q9TXI7
101
PVKYPDLPRN
PLQFSDLPRS
PLRFTDISRT
PIRYSDLSRD
AQVALTIWD.
AMLVLTILD.
AVLGLTIYD.
AFLHITIWEH
.....VYGPG
.....CSGAG
.....CAGGR
EDDEIVNNST
.RAVPVGGTT
.QTTVIGGTS
EQLTVVGGTS
FSRRLVAQSK
150
VSLFGKYGMF
ISMFGKDGMF
ISFFSTNGLF
LSMFSKRGIL
O88763
Q9W1M7
Q7PMF0
Q9TXI7
151
RQGMHDLKVW
RQGMYDLRVW
RQGLYDLKVW
KSGVIDVQMN
PNVEADGSEP
LGVEGDGNFP
PQMEPDGACN
VSTTPDPFVK
TRTPGRTSST
SRTPGK.GKE
SITPGK.AIT
QPETWKYSDA
LSEDQMSRLA
SSKSQMQRLG
TGVHQMQRLS
WG.DEIDLLF
200
KLTKAHRQGH
KLAKKHRNGQ
KLAKKHRNGQ
KQVTRQSRGL
O88763
Q9W1M7
Q7PMF0
Q9TXI7
201
MVKVLDRLTF
VQKVLDRLTF
MEKILDRLTF
VEDVLDPFAS
REIEMINESE
REIEVINERE
RELEVINEME
RRIEMIRAKY
KRSS..NFMY
KRMS..DYMF
KRNS..QFLY
KYSSPDRHVF
LMVEFRCVKC
LMIEFPAIVV
LMVEFPQVYI
LVLEMAAIRL
250
DDKE.YGIVY
DDMYNYAVVY
HEKL.YSVIH
GPTF.YKVVY
O88763
Q9W1M7
Q7PMF0
Q9TXI7
251
YE....
FE....
LE....
YEDETK
With an editor
PipeAlign
INPUT: single sequence OR set of unaligned sequences
single
sequence
BlastP search
Identify motifs
conservation profile
list of homologs
single
sequence
MACS of user-specified
homologs
multiple
alignment
Refine alignment
Correct alignment errors
refined MACS
multiple
alignment
multiple
alignment
Validate alignment
validated MACS
multiple
alignment
Cluster sequences
MSA Main
Applications
Phylogenetic studies
MACS
Interaction networks
DBD
Euc
Arc
Euc
Bac
Motif II
Euc
10 aa
Bac
N-terminal extension
EMAP domain
S4 domain
C-terminal extension
Bac
Motif I
Euc
Arc
Euc
Bac
Motif II
Euc
10 aa
Bac
N-terminal extension
EMAP domain
S4 domain
C-terminal extension
Phylogenetic studies
Multiple alignments = basis for calculation of the levels of similarity between sequences
Phylogenetic studies
PLASM FALC
Wholealignment
ar
Euc
ARABI THAL
SCHI PO MT
DROSO MEGA
SACC CE MT
MYCOP GENI
DROS ME MT
CAEN EL MT
Bacteria +
Mitochondrie
HOMO SAPIE
RATTU NORV
MYCOP PNEU
SCHIZ POMB
SACCH CERE
CANDI ALBI
BORRE BURG
TREPO PALI
MYCOP CAPR
BUCHN
AFID
RICKE PROW
RHODO CAPS
HALOB SALI
CHLOR TEPI
ARCHE FULG
MYCOB TUBE
AQUIF AEOL
THERM MARI
HELIC PYLO
METBA THER
METHA JANN
PYROC KODA
CHLAM TRAC
SYNECHO
SP
AR THA
CHL
BORDE PERT
NEISSMENI
GONO
NEISS
THERM
DEINO
RADI THER
BACIL SUBT
PSEUD AERU
ENTER FAEC
STREP PYOG
YERSISHEWA
PEST PUTR
ESCHE
COLI
SALMO TYPH
CHOL
HAEMO INFL ACTIN VIBRI
ACTI
PYROC HORI
ha
e
CLOST ACET
PORPH GING
CAMPY JEJU
Ar
c
MYCOB LEPR
ya
CAENO ELEG
Phylogenetic studies
PLASM FALC
SACCH CERE
ARABI THAL
ya
Bacteria
Archaea
Mito.
ar
Euk
Nterminus
globalgapremoval
SCHIZ POMB
CANDI ALBI
CAENO ELEG
DROSO MEGA
HALOB SALI
HOMO SAPIE
RATTU NORV
PYROC HORI
METBA THER
PYROC KODA
DROS ME MT
METHA JANN
SCHI PO MT
CAEN EL MT
ARCHE FULG
BORRE BURG
SACC CE MT
MYCOP CAPR
BUCHN AFID
PORPH GING
CLOST ACET
DEINO RADI
RICKE PROW
BACIL SUBT
RHODO CAPS
CHLOR TEPI
SYNECHO SP
CHLAM TRAC
BORDE PERT
NEISS MENI NEISS GONO
HELIC PYLO
CAMPY JEJU
MYCOB TUBE
MYCOB LEPR
TREPO PALI
0.1
PSEUD AERU
SHEWA PUTR
ESCHE
ENTER FAEC SALMO TYPH
YERSICOLI
PEST
VIBRI CHOL
HAEMO INFL
AQUIF
AEOL PYOG
ACTIN ACTI
STREP
THERM THER
THERM MARI
AR THA CHL
MYCOP GENI
MYCOP PNEU
SchematicalignmentofAspartyltRNAsynthetases
180
200
220
240
260
280
300
320
Euc
Arc
Eub
340
360
380
400
420
Euc
440
L Q PQ
460
KQ
480
500
520
540
560
Arc
Eub
Motif I
690
710
730
Flipping
loop
750
Motif II
Catalytic core I
770
790
810
Insertion domain
830
850
870
890
HG
Euc
Arc
Eub
Motif III
Catalytic core II
930
Definition of blocs
N-terminal region analysis :
Reference position
Proposed N-terminus : potential start codon
closest to the reference position
MXXXXXXXXXXXXXXX
MXXXXXXXXXXXXXXXXX
MXXXXXXMXXXMXXXXXXXXXXXXXXXXXX
MXXXXXXXXXXXXXXXXXXXXXX
MXXXXXXXXXXXXXXXXXXXXX
extension
Reference position
3000 proteins from B. subtilis with wrong randomly generated N-ter. : 82% predicted
For the 3828 proteins from the Vibrio cholera proteome :
817 specific / 1722 valid start codons / 236 wrong (from 1 up to 56 aas)
Daily Blastp
Characterization
of the
specificity of
the homologous
sequences
Clustering
-> Filter
Databases : DBWatcher [Plewniak, IGBMC]
- Proteins
- Structures
Automatic Daily Update
Filter
Integration of the
sub-family members
cell communication
cellular process
Level 6
16
16 + 3
Level 5
p
12
2 + 16
2 + 19
Level 4
0 + 12
0+2
0 + 18
0 + 21
Level 3
0 + 12
0+2
0 + 18
0 + 21
Level 2
0 + 12
0+ 2
0 + 18
0 + 21
Level 1
18
21
Level 0
metabolism
physiological processes
biological_process
12
Gene_Ontology
minV(Horiz) = 21 * F
795 proteins with a GO terms (increase of 47 %)
3085 GO terms (increase of 92 %)
Transmembrane
region
Additional
domain
Phosphorylation
site
1st
FAMILY
Bacteria
Bacteria
2
FAMILY
nd
Archaea
Eucarya
Differential
conservation between
the two families
NLS
Universal
conservation
Intra-group
conservation
Functional
genomics
Evolutionary
studies
Structure
modeling
Mutagenesis
experiments
Drug design
MACSIMS
MAO:MultipleAlignmentOntology
http://www-igbmc.u-strasbg.fr/BioInfo/MAO/mao.html
MAO consortium:
- RNA analysis
(Steve HOLBROOK, Berkeley)
- MACS algorithm
(Kazutake KATOH, Kyoto)
- Protein 3D analysis
(Patrice KOEHL, Davis)
- Protein 3D structure
(Dino MORAS, Strasbourg)
- 3D RNA structure
(Eric WESTHOF, Strasbourg)
Also available
from OBO
web site: http://obo.sourceforge.net
Ecole Phylognomique, Carry
le Rouet
2006
MACSIMS
Multiple Alignment of Complete Sequences Information Management System
Thompson et al BMC Bioinformatics 2006
Structural and functional
information is mined
automatically from the
public databases
Homologous
regions
are identified in the
MACS
MACSIMS
http://bips.u-strasbg.fr/MACSIMS/
MACSIMS
Schematic overview of complete alignment
e.g. domain organisation (Interpro)
Key:
SH3
PI-PLC-X
CH
SH2
PI-PLC-Y
rhoGEF
PH
C2
DAG_PE-bind
MACSIMS visualisation
MACSIMS
BAliBASE reference 3: aldehyde dehydrogenase-like
*
*GSVPTG
* **
E
* * *
C
GSTKVG
GETRTG
GSTEVG
GSVSAG
GSRDVG
GSRDVG
GSRDVG
GSTNVF
GSTNVF
GSTAVF
Uniprot annotation
NAD binding
Active site
Active site
Summary
Choice of multiple alignment method
traditional progressive method (e.g. clustalw / clustalx)
combined local and global method (e.g. mafft, muscle, dbclustal)
knowledge-based method (e.g. PipeAlign)
Web Server versus Local Installation ?
WARNING: Automatic alignment methods can make mistakes.
Verify alignment quality by automatic methods (e.g. norMD) and visual inspection !
alternative algorithms
IterativeRefinement
PRRP(Gotoh,1993)refinesaninitialprogressivemultiplealignment
byiterativelydividingthealignmentinto2profilesandrealigning
them.
dividesequences
into2groups
initial
alignment
profile1
pairwise
profile
alignment
refined
alignment
converged?
profile2
no
alternative algorithms
GeneticAlgorithms
SAGA(Notredame,Higgins,1996)evolvesapopulationofalignmentsinaquasievolutionary
manner,iterativelyimprovingthefitnessofthepopulation
populationn
selectanumberofindividualstobeparents
modifytheparentsbyshufflinggaps,merging2alignmentsetc.
populationn+1
evaluationofthefitnessusingOF
(sumofpairsorCOFFEE)
END
alternative algorithms
HMM
Probabilisticmodelforsequenceprofiles,visualizedasafinitestate
machine
Foreachcolumnofthealignmentamatchstatemodelsthedistribution
ofresiduesallowed
Insertanddeletestatesateachcolumnallowforinsertionordeletionof
oneormoreresidues
OriginalprofileHMM(Kroghetal,1994)
matchstate
AK
Y
W
L
L
AKY-L-D
--WVLED
insertstate
D
D
delete,
begin,
endstate
MultipleAlignmentusingHMM
generateinitialalignment
(BaumWelchexpectationmaximization)
HMMER(Eddy,unpublished)
produceamodel
SAMT98(Hughey,1996)
generatenewalignment
(Viterbialgorithmor
posteriordecoding)
evaluatealignment
(expectationmaximization)
END
alternative algorithms
SegmenttosegmentAlignment
Dialign(Morgensternetal.1996)comparessegmentsofsequences
insteadofsingleresidues
1.constructdotplotsofallpossiblepairsofsequences
Sequencei
Sequencej
2.findamaximalsetofconsistentdiagonalsinallthesequences
.......aeyVRALFDFngndeedlpfkKGDILRIrdkpeeq...............WWNAedsegkr.GMIPVPYVek..........
........nlFVALYDFvasgdntlsitKGEKLRVlgynhnge..............WCEAqtkngq..GWVPSNYItpvns.......
ieqvpqqptyVQALFDFdpqedgelgfrRGDFIHVmdnsdpn...............WWKGachgqt..GMFPRNYVtpvnrnv.....
gsmstselkkVVALYDYmpmnandlqlrKGDEYFIleesnlp...............WWRArdkngqe.GYIPSNYVteaeds......
.....tagkiFRAMYDYmaadadevsfkDGDAIINvqaideg...............WMYGtvqrtgrtGMLPANYVeai.........
..gsptfkcaVKALFDYkaqredeltfiKSAIIQNvekqegg...............WWRGdyggkkq.LWFPSNYVeemvnpegihrd
.......gyqYRALYDYkkereedidlhLGDILTVnkgslvalgfsdgqearpeeigWLNGynettgerGDFPGTYVeyigrkkisp..
Localalignmentresiduesbetweenthediagonalsarenotaligned