Professional Documents
Culture Documents
Lec 4 Seq and Seq Align New (相容模式)
Lec 4 Seq and Seq Align New (相容模式)
Lec 4 Seq and Seq Align New (相容模式)
Sequencing & Sequence Alignment • Understand how DNA sequence data is collected
and prepared
G
G
60
E
40
N
30
E
20
T
20
I
0
C
10
S
0
• Be aware of the importance of sequence searching
E 40 50 30 30 20 0 10 0 and sequence alignment in biology and medicine
N 30 30 40 20 20 0 10 0
E 20 20 20 30 20 10 10 0 • Be familiar with the different algorithms and
S 20 20 20 20 20 0 10 10
I 10 10 10 10 10 20 10 0
scoring schemes used in sequence searching and
S 0 0 0 0 0 0 0 10 sequence alignment
30,000
1
Similar genes arise by gene duplication What is Sequence Alignment?
• Copy of a gene inserted next to the original • Given two sequences, how to measure their
• Two copies mutate independently similarity?
i il it ?
• Each can take on separate functions • ATAACTTTAATTAA
• All or part can be transferred from one part • ATCC‐TTTACTAA‐
of genome to another
• ATAACTTTAATTAA
• ATCC‐TTTAC‐TAA
2
Tasks of Sequence Alignment Pairwise Sequence Alignment
• Pairwise alignment • Pairwise sequence alignment primary tool in
sequence analysis,
analysis used for database search and as
• Multiple sequence alignment
a component of other algorithms.
• Global alignment • What? To detect similarity
• Local assignment • Why?
• Approximate alignment algorithms versus – Infer phylogeny (similarity ~ distance)
optimal/exact
l/ alignment
l algorithms
l h – P di t function
Predict f ti
– Predict structure
– Predict “signals” (binding sites, splicing signals etc.
10
3
Example Alignment Shotgun Sequencing
13
Primer
DNA fragment
Amp
PBR322
Tet
4
Shotgun Sequencing The Finished Product
5
Genomes to Date
• 8 vertebrates (human, mouse, rat, fugu,
zebrafish))
• 3 plants (arabadopsis, rice, poplar) So what do we do with all this
• 2 insects (fruit fly, mosquito) sequence data?
• 2 nematodes (C. elegans, C. briggsae)
• 1 sea squirt
• 4 parasites (plasmodium,
( l d guillardia)
ll d )
• 4 fungi (S. cerevisae, S. pombe)
• 200+ bacteria and archebacteria
• 2000+ viruses
Types of Alignments
Sequence Alignment
• Global—sequences aligned from end‐to‐end.
G E N E T I C S • Local—alignments may start in the middle of
G 60 40 30 20 20 0 10 0
E 40 50 30 30 20 0 10 0
either sequence
N 30 30 40 20 20 0 10 0 • Ungapped—no insertions or deletions are
E 20 20 20 30 20 10 10 0 allowed
S 20 20 20 20 20 0 10 10
I 10 10 10 10 10 20 10 0 • Other types: overlap alignments,
alignments repeated
S 0 0 0 0 0 0 0 10 match alignments
24
6
Alignments tell us about... Factoid:
7
Similarity versus Homology
• Homology
gy cannot be quantified
q
• Similarity
Similarit can be quantified
q antified
• If two sequences have a high % identity it is OK to
• It is correct to say that two sequences are X%
say they are homologous
identical
• It is incorrect to say two sequences have a
• It is correct to say that two sequences have a
homology score of Z
similarityy score of Z
It is incorrect to say two sequences are X%
• It is generally incorrect to say that two
homologous
sequences are X% similar
8
Homologues & All That Sequence Complexity
Rbn PNACYKTTQANKHIIVACEGNPYVPHFDASV
THE STORY OF GENESIS Lsz NRCKGTDVQA WIRGCRL
Context
THIS BOOK ON GENETICS Comparison
is this alignment significant?
9
Some Simple Rules
Is This Alignment Significant?
• If two sequence are > 100 residues and > 25%
identical they are likely related
identical,
Gelsolin 89 L G N E L S Q D E S G A A A I F T V Q L 108 • If two sequences are 15‐25% identical they may be
82 L P S A L K S A L S G H L E T V I L G L 101
Annexin
related, but more tests are needed
154 L E K D I I S D T S G D F R K L M V A L 173
240 L E – S I K K E V K G D L E N A F L N L 258 • If two sequences are < 15% identical they are
314 L Y Y Y I Q Q D T K G D Y Q K A L L Y L 333 probably not related
Consensus L x P x x x P D x S G x h x x h x V L L
• If you needd more than
h 1 gap for f every 20 residues
id
the alignment is suspicious
120
• Dot Plots
S e q u e n c e Id e n t ity (% )
100
• Dynamic Programming
80
60
Twilight Zone
• Heuristic (Fast) Local Alignment
40
20
• Multiple Sequence Alignment
0
0 40 80 120 160 200 240 280 320 360 400
• Contig Assembly
Number of Residues
10
Dot Plots Dot Plots
• “Invented”
Invented in 1970 by Gibbs & McIntyre
• Good for quick graphical overview
• Simplest method for sequence comparison
• Inter‐sequence comparison
• Intra‐sequence comparison
•Identifies internal repeats
•Identifies domains or “modules”
a c g c g
a
c
a
c
g
43
11
Dot Plot Algorithm Dot Plot Algorithm
A C D E F G H G
• Take two sequences (A & B), B) write sequence A
A
out as a row (length=m) and sequence B as a
C
column (length =n)
D
• Create a table or “matrix” of “m” columns and
E
“n” rows
F
• Compare each letter of sequence A with every G
letter in sequence B. If there’s a match mark it H
with a dot, if not, leave blank G
Dot Plots
12
Dot Plots Dynamic Programming
• Most commercial programs offer pretty good
dot plot programs including:
G E N E T I C S G E N E T I C S
•GCG/Omiga (Pharmacopeia) G 10 0 0 0 0 0 0 0 G 60 40 30 20 20 0 10 0
E 0 10 0 10 0 0 0 0 E 40 50 30 30 20 0 10 0
•PepTool (BioTools Inc.) N 0 0 10 0 0 0 0 0 N 30 30 40 20 20 0 10 0
•LaserGene (DNAStar) E 0 0 0 10 0 0 0 0 E 20 20 20 30 20 10 10 0
S 0 0 0 0 0 0 0 10 S 20 20 20 20 20 0 10 10
• Popular freeware package is Dotter I 0 0 0 0 0 10 0 0 I 10 10 10 10 10 20 10 0
S 0 0 0 0 0 0 0 10 S 0 0 0 0 0 0 0 10
www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html
G E N E T I C S
• Dotlet http://www.isrec.isb‐sib.ch/java/dotlet/Dotlet.html | | | | * | |
G E N E S I S
• JDotter http://athena.bioc.uvic.ca/sars/jdotter/main.php
S = CTGTCGCTGCACG
Idea: Display one sequence above T = TGCCGTG
another with spaces inserted in both
to reveal similarity Global alignment Local alignment
A: C A T - T C A - C CTGTCG-CTGCACG CTGTCGCTGCACG--
| | | | | -------TGC-CGTG
-TGC-CG-TG----
B: C - T C G C A G C
51 52
13
Dynamic Programming Identity Scoring Matrix (Sij)
A R N D C Q E G H I L K M F P S T W Y V
A 1
R 0 1
• Developed by Needleman & Wunsch (1970) N 0 0 1
D 0 0 0 1
• Refined by Smith & Waterman (1981) C 0 0 0 0 1
Q 0 0 0 0 0 1
• Performed in 2 stages F
P
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0 1
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
•Prepare a scoring matrix using recursive function T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
•Scan matrix diagonally using traceback protocol Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14
The Recursive Function
Si-1,j-1 or
W = gap penalty
S = alignment score
15
16
Initialization Step
An Example...
• To create a matrix with M + 1 columns and N + 1 rows where
M and N correspond to the size of the sequences to be
G A A T T C A G T T A ((sequence
q #1)) aligned
G G A T C G A (sequence #2) • Since this example assumes there is no gap opening or gap
extension penalty, the first row and first column of the
matrix can be initially filled with 0.
Three steps in dynamic programming
Initialization
Matrix fill (scoring)
Traceback (alignment)
17
• Now let's look at column 2.
• Since the gap penalty (w) is 0, the rest of row 1 and column 1 – The location at row 2 will be assigned the value of the maximum of 1
can be filled in with the value 1. (mismatch), 1 (horizontal gap) or 1 (vertical gap). So its value is 1.
• Take the example of row 1 – At the position column 2 row 3, there is an A in both sequences. Thus, its
– At column 2, the value is the max of 0 (for a mismatch), 0 (for a value will be the maximum of 2(match), 1 (horizontal gap), 1 (vertical gap) so
vertical gap) or 1 (horizontal gap). The rest of row 1 can be its value is 2.
filled out similarly until we get to column 8. • Moving g alongg to position
p colum 2 row 4, its value will be the maximum of
– At this point, there is a G in both sequences (light blue). Thus, 1 (mismatch), 1 (horizontal gap), 2 (vertical gap) so its value is 2. Note
the value for the cell at row 1 column 8 is the maximum of 1 that for all of the remaining positions except the last one in column 2,
(for a match), 0 (for a vertical gap) or 1 (horizontal gap). The the choices for the value will be the exact same as in row 4 since there
value will again be 1. are no matches.
– The rest of row 1 and column 1 can be filled with 1 using the • The final row will contain the value 2 since it is the maximum of 2 (match),
above reasoning. 1 (horizontal gap) and 2(vertical gap).
• Using the same techniques as described for column 2, we • After filling in all of the values the score matrix is as follows
can fill in column 3.
18
Traceback Step
• After the matrix fill step, the maximum alignment • The traceback step begins in the M,J position in the matrix,
score for the two test sequences is 6.
6 i.e. the position that leads to the maximal score. In this case,
there is a 6 in that location.
• The traceback step determines the actual
alignment(s) that result in the maximum score
19
• The alignment as described in the above step adds a gap to • Continuing on with the traceback step, we eventually get
sequence #2, so the current alignment is to a position in column 0 row 0 which tells us that
(Seq #1) TA traceback is completed.
|
• One possible maximum alignment is :
(Seq #2) _A
Once again,
g , the direct predecessor
p p
produces a gap
g p in sequence
q #2.
Giving an alignment of :
After this step, the current alignment is GAATTCAGTTA
| | | | | |
(Seq #1) T T A GGA_TC_G__A
|
__A
• Note:
– There are more alternative solutions each
resulting in a maximal global alignment score of
6.
– Since this is an exponential problem, most
Giving an alignment of : dynamic programming algorithms will only print
outt a single
i l solution.
l ti
G_AATTCAGTTA
| | | | | |
GG_A_TC_G__A
20
λ C T C G C A G C λ C T C G C A G C
λ 0 -5 -10 -15 -20 -25 -30 -35 -40 λ 0 -5 -10 -15 -20 -25 -30 -35 -40
C -5 10 5 C -5
5 10 5 0 -5
5 -10
10 -15
15 -20
20 -25
25
A -10 A -10 5 8 3 -2 -7 0 -5 -10
T -15 T -15 0 15 10 5 0 -5 -2 -7
*
T -20 T -20 -5 10 * 13 8 3 -2 -7 -4
C -25 C -25 -10 5 20 15 18 13 8 3
A -30 A -30 -15 0 15 18 13 28 23 18
C -35 C -35 -20 -5 10 13 28 23 26 33
+10 for match, ‐2 for mismatch, ‐5 for space Traceback can yield both optimum alignments
81 82
84
21
Scoring systems Match Scores
• Match scores: s(a,b) assigns a score to each • Some amino acids are more “substitutable”
combination of aligned letters.
letters Examples: for each other than others.
others Serine and
transition/transversion, PAM, Blosum. threonine are more alike than tryptophan
• Gap score: f(g) assigns a score to a gap of and alanine.
length g. Examples: linear, affine. • We can introduce "mismatch costs" for
• Scoring systems are usually additive: the handling different substitutions.
t t l score is
total i the
th sum off the
th substitution
b tit ti • We
W don't
d 't usually
ll use mismatch
i t h costs
t in
i
scores and all the gap scores. aligning nucleotide sequences, since often
no substitution is per se better than any
other.
85 86
22
A T V D
A 2 Gap
A Better Matrix ‐ PAM250 T 1 3 Using PAM250... Penalty = -1
V 0 0 4
D 0 0-2 4
A R N D C Q E G H I L K M F P S T W Y V
A 2 AAT V D AAT V D A AT V D
R
N
-2
0
6
0 2
A 2 A 2 1 A 2 1 0 -1 -1
D 0 -1 2 4 V V V
C -2 -4 -4 -5 4
Q 0 1 1 2 -5 4 V V V
E
G
0 -1
1 -3
1
0
3 -5
1 -3 -1
2 4
0 5
D D D
H -1 2 2 1 -3 3 1 -2 6
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5
L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 A ATV D A ATV D A ATV D
K
M
-1
-1
3 1 0 -5 1 0 -2
0 -2 -3 -5 -1 -2 -3 -2
0 -2 -3
2 4
5
0 6
A 2 1 0 -1 -1 A 2 1 0 -1 -1 A 2 1 0 -1 -1
F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 V -1 2 V -1 2 1 V -1 2 1 5
P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3 V V V
T
W
1 -1
-6
0 0 -2 -1 0 0 -1 0 -2
2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4
0 -1 -2 0 1
0 -6 -2 -5 17
3
D D D
Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
A T V D
A 2 Gap
T 1 3 Using PAM250... Penalty = -1 PAM Matrices
V 0 0 4
D 0 0-2 4
23
Dynamic Programming Fast Local Alignment Methods
24
Fast Alignment Algorithm Fast Alignment Algorithm
Query:
Q y ACDEFGDEF….. A C D E F G D E F...
L
ACD CDE DEF EFG FGD GDE M
R
ACE CDD NEF … … …
G
GCE CEE DEY … … … C
GCD DDY … … … D
D
Database: LMRGCDDYGDEY… Y
G
FASTA
Fast Alignment Algorithm
25
FASTA
Multiple Sequence Alignment
gi|135775|sp|P08628|THIO_RABIT THIOREDOXIN (104 aa)
initn: 641 init1: 641 opt: 642 Z-score: 806.4 expect() 3.2e-38
Smith-Waterman score: 642; 86.538% identity in 104 aa overlap (2-105:1-104)
10 20 30 40 50 60 70 80
thiore MVKQIESKTAFQEALDAAGDKLVVVDFSATWCGPCKMINPFFHSLSEKYSNVIFLEVDVDDCQDVASECEVKCTPTFQFF
:::::::.::::.::.:::::::::::::::::::::.::::.::::..::.:.:::::::.:.:.:::::: ::::::
gi|135 VKQIESKSAFQEVLDSAGDKLVVVDFSATWCGPCKMIKPFFHALSEKFNNVVFIEVDVDDCKDIAAECEVKCMPTFQFF
10 20 30 40 50 60 70
90 100
thiore KKGQKVGEFSGANKEKLEATINELV Multiple alignment of Calcitonins
::::::::::::::::::::::::.
gi|135 KKGQKVGEFSGANKEKLEATINELL
80 90 100
26
Multiple Alignment Mutli‐Align Websites
• Match‐Box
• Most commercial vendors offer good multiple http://www fundp ac be/sciences/biologie/bms/matchbox
http://www.fundp.ac.be/sciences/biologie/bms/matchbox_
alignment programs including: submit.shtml
•GCG (Accelerys)
• MUSCA http://cbcsrv.watson.ibm.com/Tmsa.html
•PepTool/GeneTool (BioTools Inc.)
•LaserGene (DNAStar) • T‐Coffee http://www.ebi.ac.uk/Tools/msa/tcoffee/
• Popular web servers include T‐COFFEE,
T COFFEE
• MULTALIN http://www.toulouse.inra.fr/multalin.html
MULTALIN and CLUSTALW
• Popular freeware includes PHYLIP & PAUP • CLUSTALW http://www.ebi.ac.uk/clustalw/
T‐Coffee
http://www.ebi.ac.uk/Tools/msa/tcoffee/
27
Contig Assembly
Multi‐alignment & Contig Assembly
1. Onlyy accept
p a veryy high
g sequence
q identityy
• User‐selected
U l t d parameters
t
2. Accept unlimited number of “end” gaps
1. minimum length of overlap
3. Very high cost for opening “internal” gaps
2. percent identity within overlap
4. A short match with high score/residue is
preferred over a long match with low • Non‐adjustable parameters
score/residue 1. sequence “quality”
quality factors
28
Chromatogram Editing Sequence Loading
ATCGATGCGTAGCAGACTACCGTTACGATGCCTT…
29
Sequence Assembly Programs
Problems for Assembly
http://bio.ifom‐firc.it/ASSEMBLY/assemble.html
Phrap
• Phrap is a program for assembling shotgun DNA
sequence data
• Uses a combination of user‐supplied and
internally computed data quality information to
improve assembly accuracy in the presence of
repeats
• Constructs the contig sequence as a mosaic of
the highest quality read segments rather than a
consensus
• Handles large datasets
30
Conclusions
• Sequence alignments
g and database searching
g
are key to all of bioinformatics
• There are four different methods for doing
sequence comparisons 1) Dot Plots; 2) Dynamic
Programming; 3) Fast Alignment; and 4)
Multiple Alignment
• Understanding the significance of alignments
requires an understanding of statistics and
distributions
31