Lec 4 Seq and Seq Align New (相容模式)

Objectives
Sequencing & Sequence Alignment • Understand how DNA sequence data is collected
and prepared
G
G
60
E
40
N
30
E
20
T
20
I
0
C
10
S
0
• Be aware of the importance of sequence searching
E 40 50 30 30 20 0 10 0 and sequence alignment in biology and medicine
N 30 30 40 20 20 0 10 0
E 20 20 20 30 20 10 10 0 • Be familiar with the different algorithms and
S 20 20 20 20 20 0 10 10
I 10 10 10 10 10 20 10 0
scoring schemes used in sequence searching and
S 0 0 0 0 0 0 0 10 sequence alignment
Why compare sequences?

• To find whether two (or more) genes or
proteins
t i are evolutionarily
l ti il related
l t d to
t each h
other
• To find structurally or functionally similar
regions within proteins
30,000
1
Similar genes arise by gene duplication What is Sequence Alignment?
• Copy of a gene inserted next to the original • Given two sequences, how to measure their
• Two copies mutate independently similarity?
i il it ?
• Each can take on separate functions • ATAACTTTAATTAA
• All or part can be transferred from one part • ATCC‐TTTACTAA‐
of genome to another
• ATAACTTTAATTAA
• ATCC‐TTTAC‐TAA
What is Sequence Alignment?

What is Sequence Alignment?
sequence alignment of instances of the acidic ribosomal protein P0
(L10E) from several organisms
• Arranging the primary sequences of DNA,
RNA or protein
RNA, t i tto identify
id tif regions
i off
similarity that may be a consequence of
functional, structural, or evolutionary
relationships between the sequences
2
Tasks of Sequence Alignment Pairwise Sequence Alignment
• Pairwise alignment • Pairwise sequence alignment primary tool in
sequence analysis,
analysis used for database search and as
• Multiple sequence alignment
a component of other algorithms.
• Global alignment • What? To detect similarity
• Local assignment • Why?
• Approximate alignment algorithms versus – Infer phylogeny (similarity ~ distance)
optimal/exact
l/ alignment
l algorithms
l h – P di t function
Predict f ti
– Predict structure
– Predict “signals” (binding sites, splicing signals etc.
10
Problem Definition: Key Issues

(Optimal) pairwise alignment consists of The key issues are:
considering
id i allll possible
ibl alignments
li t off ttwo • Types of alignments (local vs. global)
sequences and choosing the optimal one. • The scoring system
• Sub‐optimal (heuristic) alignment algorithms • The alignment algorithm
are also very important: eg BLAST
• Measuring alignment significance
• We will focus on optimal alignment methods
in this class.
3
Example Alignment Shotgun Sequencing
True homology: alpha globin and beta globin

HBA_HUMAN
HBA HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL
G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL
HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
Spurious “homology”: alpha globin protein with

different strucure and function
HBA_HUMAN
HBA HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD‐‐‐‐LHAHKL
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD LHAHKL
GS+ + G + +D L ++ H+ D+ A +AL D ++AH+
F11G11.2 GSGYLVGDSLTFVDLL‐‐VAQHTADLLAANAALLDEFPQFKAHQE
Isolate ShearDNA Clone into
Chromosome into Fragments Seq. Vectors Sequence
13
Principles of DNA Sequencing Shotgun Sequencing
Primer
DNA fragment
Amp
PBR322
Tet
Ori Denature with Klenow + ddNTP

heat to produce + dNTP + primers Sequence Send to Computer Assembled
ssDNA Chromatogram Sequence
single strand DNA
4
Shotgun Sequencing The Finished Product
• Veryy efficient process

p for small‐scale ((~10 kb)) GATTACAGATTACAGATTACAGATTACAGATTACAG
sequencing (preferred method) ATTACAGATTACAGATTACAGATTACAGATTACAGA
TTACAGATTACAGATTACAGATTACAGATTACAGAT
• First applied to whole genome sequencing in TACAGATTAGAGATTACAGATTACAGATTACAGATT
1995 (H. influenzae) ACAGATTACAGATTACAGATTACAGATTACAGATTA
CAGATTACAGATTACAGATTACAGATTACAGATTAC
• Now standard for all prokaryotic genome AGATTACAGATTACAGATTACAGATTACAGATTACA
sequencing projects GATTACAGATTACAGATTACAGATTACAGATTACAG
ATTACAGATTACAGATTACAGATTACAGATTACAGA
• Successfully applied to D. melanogaster TTACAGATTACAGATTACAGATTACAGATTACAGAT
• Moderately successful for H. sapiens
Sequencing Successes Sequencing Successes
T7 bacteriophage Caenorhabditis elegans

completed in 1983 p
completed in 1998
99
39,937 bp, 59 coded proteins 95,078,296 bp, 19,099 genes
Escherichia coli Drosophila melanogaster

completed in 1998 completed in 2000
4,639,221 bp, 4293 ORFs 116,117,226 bp, 13,601 genes
Sacchoromyces cerevisae Homo sapiens

completed in 1996 completed in 2003
12,069,252 bp, 5800 genes 3,201,762,515 bp, 31,780 genes
5
Genomes to Date
• 8 vertebrates (human, mouse, rat, fugu,
zebrafish))
• 3 plants (arabadopsis, rice, poplar) So what do we do with all this
• 2 insects (fruit fly, mosquito) sequence data?
• 2 nematodes (C. elegans, C. briggsae)
• 1 sea squirt
• 4 parasites (plasmodium,
( l d guillardia)
ll d )
• 4 fungi (S. cerevisae, S. pombe)
• 200+ bacteria and archebacteria
• 2000+ viruses
Types of Alignments
Sequence Alignment
• Global—sequences aligned from end‐to‐end.
G E N E T I C S • Local—alignments may start in the middle of
G 60 40 30 20 20 0 10 0
E 40 50 30 30 20 0 10 0
either sequence
N 30 30 40 20 20 0 10 0 • Ungapped—no insertions or deletions are
E 20 20 20 30 20 10 10 0 allowed
S 20 20 20 20 20 0 10 10
I 10 10 10 10 10 20 10 0 • Other types: overlap alignments,
alignments repeated
S 0 0 0 0 0 0 0 10 match alignments
24
6
Alignments tell us about... Factoid:
• Function or activity of a new gene/protein

• Structure or shape of a new protein
• Location or preferred location of a protein Sequence comparisons
• Stability of a gene or protein lie at the heart of all
• Origin of a gene or protein
bioinformatics
• Origin or phylogeny of an organelle
• Origin or phylogeny of an organism
Biological Definitions for Related Sequences
• Homologs are similar sequences in two different

SEQUENCE SIMILARITY ≠ HOMOLOGY g
organisms that have been derived from a common
ancestor sequence. Homologs can be described as
either orthologous or paralogous.
– Orthologs are similar sequences in two different
organisms that have arisen due to a speciation event.
Orthologs typically retain their functionality throughout
evolution.
– Paralogs are similar sequences within a single organism
that have arisen due to a gene duplication event.
• Xenologs are similar sequences that do not share
the same evolutionary origin, but rather have
arisen out of horizontal transfer events through
symbiosis, viruses, etc.
7
Similarity versus Homology
• Similarity refers to the • Homology refers to

likeness or % identity shared ancestry
between 2 sequences • Two sequences are
• Similarity means homologous is they are
sharing a statistically derived from a
significant number of common ancestral
bases or amino acids sequence
• Similarity does not • Homology usually
imply homology implies similarity
Similarity versus Homology Similarity versus Homology
• Homology
gy cannot be quantified
q
• Similarity
Similarit can be quantified
q antified
• If two sequences have a high % identity it is OK to
• It is correct to say that two sequences are X%
say they are homologous
identical
• It is incorrect to say two sequences have a
• It is correct to say that two sequences have a
homology score of Z
similarityy score of Z
 It is incorrect to say two sequences are X%
• It is generally incorrect to say that two
homologous
sequences are X% similar
8
Homologues & All That Sequence Complexity
• Homologue (or Homolog)

– Protein/gene that shares a common ancestor and MCDEFGHIKLAN…. High Complexity
which has good sequence and/or structure similarity to
another (general term)
• Paralogue (or Paralog) ACTGTCACTGAT…. Mid Complexity
– A homologue which arose through gene duplication in
the same species/chromosome
• Orthologue (or Ortholog) NNNNTTTTTNNN…. Low Complexity
– A homologue which arose through speciation (found
in different species) Translate those DNA sequences!!!
Assessing Sequence Similarity

Assessing Sequence Similarity
THESTORYOFGENESIS Two Character Rbn KETAAAKFERQHMD

THISBOOKONGENETICS Strings Lsz KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNT
Rbn SST SAASSSNYCNQMMKSRNLTKDRCKPMNTFVHESLA

Lsz QATNRNTDGSTDYGILQINSRWWCNDGRTP GSRN
THESTORYOFGENESI-S Character
* * * * * * * * * * * Rbn DVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKY
THISBOOKONGENETICS Comparison Lsz LCNIPCSALLSSDITASVNC AKKIVSDGDGMNAWVAWR
Rbn PNACYKTTQANKHIIVACEGNPYVPHFDASV
THE STORY OF GENESIS Lsz NRCKGTDVQA WIRGCRL
Context
THIS BOOK ON GENETICS Comparison
is this alignment significant?
9
Some Simple Rules
Is This Alignment Significant?
• If two sequence are > 100 residues and > 25%
identical they are likely related
identical,
Gelsolin 89 L G N E L S Q D E S G A A A I F T V Q L 108 • If two sequences are 15‐25% identical they may be
82 L P S A L K S A L S G H L E T V I L G L 101
Annexin
related, but more tests are needed
154 L E K D I I S D T S G D F R K L M V A L 173
240 L E – S I K K E V K G D L E N A F L N L 258 • If two sequences are < 15% identical they are
314 L Y Y Y I Q Q D T K G D Y Q K A L L Y L 333 probably not related
Consensus L x P x x x P D x S G x h x x h x V L L
• If you needd more than
h 1 gap for f every 20 residues
id
the alignment is suspicious
Doolittle’s Rules of Thumb Sequence Alignment ‐ Methods

Evolutionary Distance VS Percent Sequence Identity
120
• Dot Plots
S e q u e n c e Id e n t ity (% )
100
• Dynamic Programming
80
60
Twilight Zone
• Heuristic (Fast) Local Alignment
40
20
• Multiple Sequence Alignment
0
0 40 80 120 160 200 240 280 320 360 400
• Contig Assembly
Number of Residues
10
Dot Plots Dot Plots
• “Invented”
Invented in 1970 by Gibbs & McIntyre
• Good for quick graphical overview
• Simplest method for sequence comparison
• Inter‐sequence comparison
• Intra‐sequence comparison
•Identifies internal repeats
•Identifies domains or “modules”
Dot matrices Dot Plots & Internal Repeats
a c g c g
a
c
a
c
g
43
11
Dot Plot Algorithm Dot Plot Algorithm
A C D E F G H G
• Take two sequences (A & B), B) write sequence A
A
out as a row (length=m) and sequence B as a
C
column (length =n)
D
• Create a table or “matrix” of “m” columns and
E
“n” rows
F
• Compare each letter of sequence A with every G
letter in sequence B. If there’s a match mark it H
with a dot, if not, leave blank G
Dot Plots
• Dot plots are useful as a first‐level filter for

determining an alignment between two sequences.
sequences
• Regions of similarity will show up as diagonals
within the dot plot matrix.
12
Dot Plots Dynamic Programming
• Most commercial programs offer pretty good
dot plot programs including:
G E N E T I C S G E N E T I C S
•GCG/Omiga (Pharmacopeia) G 10 0 0 0 0 0 0 0 G 60 40 30 20 20 0 10 0
E 0 10 0 10 0 0 0 0 E 40 50 30 30 20 0 10 0
•PepTool (BioTools Inc.) N 0 0 10 0 0 0 0 0 N 30 30 40 20 20 0 10 0
•LaserGene (DNAStar) E 0 0 0 10 0 0 0 0 E 20 20 20 30 20 10 10 0
S 0 0 0 0 0 0 0 10 S 20 20 20 20 20 0 10 10
• Popular freeware package is Dotter I 0 0 0 0 0 10 0 0 I 10 10 10 10 10 20 10 0
S 0 0 0 0 0 0 0 10 S 0 0 0 0 0 0 0 10
www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html
G E N E T I C S
• Dotlet http://www.isrec.isb‐sib.ch/java/dotlet/Dotlet.html | | | | * | |
G E N E S I S
• JDotter http://athena.bioc.uvic.ca/sars/jdotter/main.php
Pair‐wise sequence alignments Two types of alignment
S = CTGTCGCTGCACG
Idea: Display one sequence above T = TGCCGTG
another with spaces inserted in both
to reveal similarity Global alignment Local alignment
A: C A T - T C A - C CTGTCG-CTGCACG CTGTCGCTGCACG--
| | | | | -------TGC-CGTG
-TGC-CG-TG----
B: C - T C G C A G C
51 52
13
Dynamic Programming Identity Scoring Matrix (Sij)
A R N D C Q E G H I L K M F P S T W Y V
A 1
R 0 1
• Developed by Needleman & Wunsch (1970) N 0 0 1
D 0 0 0 1
• Refined by Smith & Waterman (1981) C 0 0 0 0 1
Q 0 0 0 0 0 1
• Ideal for quantitative assessment E

G
0
0
0
0
0
0
0
0
0
0
0
0
1
0 1
H 0 0 0 0 0 0 0 0 1
• Guaranteed to be mathematically optimal I 0 0 0 0 0 0 0 0 0 1
L 0 0 0 0 0 0 0 0 0 0 1
• Slow N2 algorithm K 0 0 0 0 0 0 0 0 0 0 0 1
M 0 0 0 0 0 0 0 0 0 0 0 0 1
• Performed in 2 stages F
P
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0 1
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
•Prepare a scoring matrix using recursive function T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
•Scan matrix diagonally using traceback protocol Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14
The Recursive Function
Si-1,j-1 or
Sij = sij + max max Si-x,j-1 + wx-1

2<x<i
or
max Si-1,j-y + wy-1

2< <j
2<y<j
W = gap penalty
S = alignment score
15
16
Initialization Step
An Example...
• To create a matrix with M + 1 columns and N + 1 rows where
M and N correspond to the size of the sequences to be
G A A T T C A G T T A ((sequence
q #1)) aligned
G G A T C G A (sequence #2) • Since this example assumes there is no gap opening or gap
extension penalty, the first row and first column of the
matrix can be initially filled with 0.
Three steps in dynamic programming
Initialization
Matrix fill (scoring)
Traceback (alignment)
• Using this information, the score at position 1,1 in the matrix

Matrix Fill Step can be calculated. Since the first residue in both sequences
is a G, S1,1 = 1, and by the assumptions stated at the
beginning, w = 0. Thus,
• For each position, Mi,j is defined to be the maximum score
M1,1 = MAX[M0,0 + 1, M1, 0 + 0, M0,1 + 0] = MAX [1, 0, 0] = 1
at position i, j;
• Mi,j = MAX [
• A value of 1 is then placed in position 1,1 of the scoring
Mi‐1, j‐1 + Si,j (match/mismatch in the diagonal),
matrix.
Mi,j‐1 + w (gap in sequence #1),
Mi‐1,j + w (gap in sequence #2)]
Note that in the example, Mi‐1,j‐1 will be red, Mi,j‐1 will be

green and Mi‐1,j will be blue.
17
• Now let's look at column 2.
• Since the gap penalty (w) is 0, the rest of row 1 and column 1 – The location at row 2 will be assigned the value of the maximum of 1
can be filled in with the value 1. (mismatch), 1 (horizontal gap) or 1 (vertical gap). So its value is 1.
• Take the example of row 1 – At the position column 2 row 3, there is an A in both sequences. Thus, its
– At column 2, the value is the max of 0 (for a mismatch), 0 (for a value will be the maximum of 2(match), 1 (horizontal gap), 1 (vertical gap) so
vertical gap) or 1 (horizontal gap). The rest of row 1 can be its value is 2.
filled out similarly until we get to column 8. • Moving g alongg to position
p colum 2 row 4, its value will be the maximum of
– At this point, there is a G in both sequences (light blue). Thus, 1 (mismatch), 1 (horizontal gap), 2 (vertical gap) so its value is 2. Note
the value for the cell at row 1 column 8 is the maximum of 1 that for all of the remaining positions except the last one in column 2,
(for a match), 0 (for a vertical gap) or 1 (horizontal gap). The the choices for the value will be the exact same as in row 4 since there
value will again be 1. are no matches.
– The rest of row 1 and column 1 can be filled with 1 using the • The final row will contain the value 2 since it is the maximum of 2 (match),
above reasoning. 1 (horizontal gap) and 2(vertical gap).
• Using the same techniques as described for column 2, we • After filling in all of the values the score matrix is as follows
can fill in column 3.
18
Traceback Step
• After the matrix fill step, the maximum alignment • The traceback step begins in the M,J position in the matrix,
score for the two test sequences is 6.
6 i.e. the position that leads to the maximal score. In this case,
there is a 6 in that location.
• The traceback step determines the actual
alignment(s) that result in the maximum score
• Traceback takes the current cell and looks to the neighbor

cells that could be direct predecessors. • Since the current cell has a value of 6 and the scores are 1
for a match and 0 for anything else, the only possible
• This means it looks to the neighbor to the left (gap in
sequence #2), the diagonal neighbor (match/mismatch), and predecessor is the diagonal match/mismatch neighbor.
the neighbor above it (gap in sequence #1). • If more than one possible predecessor exists, any can be
• The algorithm for traceback chooses as the next cell in the chosen.
sequence one of the possible predecessors.
predecessors – This
hi gives
i us a current alignment
li off
• In this case, the neighbors are marked in red. They are all
also equal to 5. (Seq #1) A
| In this case, it is the cell with the red 5.
(Seq #2) A
19
• The alignment as described in the above step adds a gap to • Continuing on with the traceback step, we eventually get
sequence #2, so the current alignment is to a position in column 0 row 0 which tells us that
(Seq #1) TA traceback is completed.
|
• One possible maximum alignment is :
(Seq #2) _A
Once again,
g , the direct predecessor
p p
produces a gap
g p in sequence
q #2.
Giving an alignment of :
After this step, the current alignment is GAATTCAGTTA
| | | | | |
(Seq #1) T T A GGA_TC_G__A
|
__A
• An alternate solution is:
• Note:
– There are more alternative solutions each
resulting in a maximal global alignment score of
6.
– Since this is an exponential problem, most
Giving an alignment of : dynamic programming algorithms will only print
outt a single
i l solution.
l ti
G_AATTCAGTTA
| | | | | |
GG_A_TC_G__A
20
λ C T C G C A G C λ C T C G C A G C
λ 0 -5 -10 -15 -20 -25 -30 -35 -40 λ 0 -5 -10 -15 -20 -25 -30 -35 -40
C -5 10 5 C -5
5 10 5 0 -5
5 -10
10 -15
15 -20
20 -25
25
A -10 A -10 5 8 3 -2 -7 0 -5 -10
T -15 T -15 0 15 10 5 0 -5 -2 -7
*
T -20 T -20 -5 10 * 13 8 3 -2 -7 -4
C -25 C -25 -10 5 20 15 18 13 8 3
A -30 A -30 -15 0 15 18 13 28 23 18
C -35 C -35 -20 -5 10 13 28 23 26 33
+10 for match, ‐2 for mismatch, ‐5 for space Traceback can yield both optimum alignments
81 82
Local vs. Global

Could We Do Better?
Pairwise Alignments
• A global alignment includes all elements of the
• Keyy to the p
performance of Dynamic
y sequences and includes gaps.
Programming is the scoring function – A global alignment may or may not include "end gap"
penalties.
• Dynamic Programming always gives the – Global alignments are better indicators of homology and
mathematically correct answer take longer to compute.
• Dynamic Programming does not always give • A local alignment includes only subsequences, and
sometimes is computed without gaps.
gaps
the biologically correct answer – Local alignments can find shared domains in divergent
• The weakest link ‐‐ The Scoring Matrix proteins and are fast to compute
84
21
Scoring systems Match Scores
• Match scores: s(a,b) assigns a score to each • Some amino acids are more “substitutable”
combination of aligned letters.
letters Examples: for each other than others.
others Serine and
transition/transversion, PAM, Blosum. threonine are more alike than tryptophan
• Gap score: f(g) assigns a score to a gap of and alanine.
length g. Examples: linear, affine. • We can introduce "mismatch costs" for
• Scoring systems are usually additive: the handling different substitutions.
t t l score is
total i the
th sum off the
th substitution
b tit ti • We
W don't
d 't usually
ll use mismatch
i t h costs
t in
i
scores and all the gap scores. aligning nucleotide sequences, since often
no substitution is per se better than any
other.
85 86
Match Score Tables Scoring Matrices
• Match scores are

computed using a • An empirical model of evolution, biology and
lookup table with an chemistry all wrapped up in a 20 X 20 table of
entry for each possible integers
pair of letters. • Structurally or chemically similar residues
• Match scores are often should ideally have high diagonal or off‐
calculated on the basis
of the frequency of
diagonal
g numbers
particular mutations in • Structurally or chemically dissimilar residues
very similar sequences. should ideally have low diagonal or off‐diagonal
numbers
87
22
A T V D
A 2 Gap
A Better Matrix ‐ PAM250 T 1 3 Using PAM250... Penalty = -1
V 0 0 4
D 0 0-2 4
A R N D C Q E G H I L K M F P S T W Y V
A 2 AAT V D AAT V D A AT V D
R
N
-2
0
6
0 2
A 2 A 2 1 A 2 1 0 -1 -1
D 0 -1 2 4 V V V
C -2 -4 -4 -5 4
Q 0 1 1 2 -5 4 V V V
E
G
0 -1
1 -3
1
0
3 -5
1 -3 -1
2 4
0 5
D D D
H -1 2 2 1 -3 3 1 -2 6
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5
L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 A ATV D A ATV D A ATV D
K
M
-1
-1
3 1 0 -5 1 0 -2
0 -2 -3 -5 -1 -2 -3 -2
0 -2 -3
2 4
5
0 6
A 2 1 0 -1 -1 A 2 1 0 -1 -1 A 2 1 0 -1 -1
F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 V -1 2 V -1 2 1 V -1 2 1 5
P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3 V V V
T
W
1 -1
-6
0 0 -2 -1 0 0 -1 0 -2
2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4
0 -1 -2 0 1
0 -6 -2 -5 17
3
D D D
Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
A T V D
A 2 Gap
T 1 3 Using PAM250... Penalty = -1 PAM Matrices
V 0 0 4
D 0 0-2 4
• Developed by M.O. Dayhoff (1978)

A ATV D A ATV D A ATV D
A 2 1 0 -1 -1 A 2 1 0 -1 -1 A 2 1 0 -1 -1 • PAM = Point Accepted Mutation
V -1 2 1 5 -1 V -1 2 1 5 -1 V -1 2 1 5 -1 • Matrix assembled by looking at patterns of
V V -1 12 5 3 V -1 12 5 3
substitutions in closely related proteins
D D -1 11 0 9 D -1 11 0 9
• 1 PAM corresponds to 1 amino acid change per
AAT V D 100 residues
| | | • 1 PAM = 1% divergence or 1 million years in
AV -VD
evolutionary history
23
Dynamic Programming Fast Local Alignment Methods
• Great for doingg pairwise

p global
g alignments
g
ACDEAGHNKLM
ACDEAGHNKLM...
• Produces a quantitative alignment “score” KKDEFGHPKLM...
• Problems if one tries to do alignments with SCDEFCHLKLM...
very large sequences (memory requirement MCDEFGHNKLV...
grows as N2 or as N x M) ACDEFGHIKLM... QCDEFGHAKLM...
• Serious
S i problems
bl if one tries
t i tot align
li one AQQQFGHIKLPI...
AQQQFGHIKLPI
WCDEFGHLKLM...
sequence against a database (10’s of hours)
SMDEFAHVKLM...
• Need an alternative….. ACDEFGFKKLM...
Fast Local Alignment Methods Fast Alignment Algorithm
 Developed by Lipman & Pearson (1985/88) Query:

Q y ACDEFGDEF…..
 Refined by Altschul et al. (1990/97) ACD CDE DEF EFG FGD GDE …
 Ideal for large database comparisons 1 2 3,7 4 5 6
 Uses heuristics & statistical simplification
 Fast N‐type
yp algorithm
g ((similar to Dot Plot)) ACD CDE DEF EFG FGD GDE
 Cuts sequences into short words (k‐tuples) ACE CDD NEF … … …
 Uses “Hash Tables” to speed comparison GCE CEE DEY … … …
GCD DDY … … …
24
Fast Alignment Algorithm Fast Alignment Algorithm
Query:
Q y ACDEFGDEF….. A C D E F G D E F...
L
ACD CDE DEF EFG FGD GDE M
R
ACE CDD NEF … … …
G
GCE CEE DEY … … … C
GCD DDY … … … D
D
Database: LMRGCDDYGDEY… Y
G
FASTA
Fast Alignment Algorithm
• Developed in 1985 and 1988 (W. Pearson)

• Looks for clusters of nearby or locally dense
“identical” k‐tuples
• init1 score = score for first set of k‐tuples
• initn score = score for gapped k‐tuples
• opt score = optimized alignment
l score
• Z‐score = number of S.D. above random
• expect = expected # of random matches
25
FASTA
Multiple Sequence Alignment
gi|135775|sp|P08628|THIO_RABIT THIOREDOXIN (104 aa)
initn: 641 init1: 641 opt: 642 Z-score: 806.4 expect() 3.2e-38
Smith-Waterman score: 642; 86.538% identity in 104 aa overlap (2-105:1-104)
gi|135 2- 105: --------------------------------------------------------------------:
10 20 30 40 50 60 70 80
thiore MVKQIESKTAFQEALDAAGDKLVVVDFSATWCGPCKMINPFFHSLSEKYSNVIFLEVDVDDCQDVASECEVKCTPTFQFF
:::::::.::::.::.:::::::::::::::::::::.::::.::::..::.:.:::::::.:.:.:::::: ::::::
gi|135 VKQIESKSAFQEVLDSAGDKLVVVDFSATWCGPCKMIKPFFHALSEKFNNVVFIEVDVDDCKDIAAECEVKCMPTFQFF
10 20 30 40 50 60 70
90 100
thiore KKGQKVGEFSGANKEKLEATINELV Multiple alignment of Calcitonins
::::::::::::::::::::::::.
gi|135 KKGQKVGEFSGANKEKLEATINELL
80 90 100
Multiple Alignment Algorithm Multiple Sequence Alignment
• Take all “n” sequences

q and perform
p f all possible
p
pairwise (n/2(n‐1)) alignments • Developed and refined by many (Doolittle,
(
Barton, Corpet) through the 1980’s
• Identify highest scoring pair, perform an
alignment & create a consensus sequence • Used extensively for extracting hidden
• Select next most similar sequence and align it to phylogenetic relationships and identifying
the initial consensus,
consensus regenerate a second sequence families
consensus • Powerful tool for extracting new sequence
• Repeat step 3 until finished motifs and signature sequences
26
Multiple Alignment Mutli‐Align Websites
• Match‐Box
• Most commercial vendors offer good multiple http://www fundp ac be/sciences/biologie/bms/matchbox
http://www.fundp.ac.be/sciences/biologie/bms/matchbox_
alignment programs including: submit.shtml
•GCG (Accelerys)
• MUSCA http://cbcsrv.watson.ibm.com/Tmsa.html
•PepTool/GeneTool (BioTools Inc.)
•LaserGene (DNAStar) • T‐Coffee http://www.ebi.ac.uk/Tools/msa/tcoffee/
• Popular web servers include T‐COFFEE,
T COFFEE
• MULTALIN http://www.toulouse.inra.fr/multalin.html
MULTALIN and CLUSTALW
• Popular freeware includes PHYLIP & PAUP • CLUSTALW http://www.ebi.ac.uk/clustalw/
T‐Coffee
• Uses standard progressive alignment but with a

“ i ” to avoid
“twist” id llocall minima
i i
• Allows the combination of a collection of
multiple/pairwise, global or local alignments into
a single model
• It also allows to estimate the level of
consistency of each position within the new
alignment with the rest of the alignments
http://www.ebi.ac.uk/Tools/msa/tcoffee/
27
Contig Assembly
Multi‐alignment & Contig Assembly
• Read,, edit & trim DNA chromatograms

g
ATCGATGCGTAGCAGACTACCGTTACGATGCCTT… • Remove overlaps & ambiguous calls
TAGCTACGCATCGTCTGATGGCAATGCTACGGAA.. • Read in all sequence files (10‐10,000)
• Reverse complement all sequences (doubles #
of sequences to align)
• Remove vector sequences (vector trim)
• Remove regions of low complexity
• Perform multiple sequence alignment
Contig Assembly = Multiple Alignment Assembly Parameters
1. Onlyy accept
p a veryy high
g sequence
q identityy
• User‐selected
U l t d parameters
t
2. Accept unlimited number of “end” gaps
1. minimum length of overlap
3. Very high cost for opening “internal” gaps
2. percent identity within overlap
4. A short match with high score/residue is
preferred over a long match with low • Non‐adjustable parameters
score/residue 1. sequence “quality”
quality factors
28
Chromatogram Editing Sequence Loading
Sequence Alignment Contig Alignment ‐ Process
ATCGATGCGTAGCAGACTACCGTTACGATGCCTT…
29
Sequence Assembly Programs
Problems for Assembly
• Repeat regions • Phred ‐ base calling program that does detailed

statistical analysis (UNIX)
– Capture sequences from non‐contiguous regions http://www.phrap.org/
• Polymorphisms • Phrap ‐ sequence assembly program (UNIX)
– Cause failure to join correct regions http://www.phrap.org/
• Large data volume • TIGR Assembler ‐ microbial genomes (UNIX)
– Re
Requires
i e large
l ge numbers
be of pair‐wise
i i e http://www tigr org/softlab/assembler/
http://www.tigr.org/softlab/assembler/
comparisons • The Staden Package (UNIX)
http://www.mrc‐lmb.cam.ac.uk/pubseq/
• GeneTool/ChromaTool/Sequencher (PC/Mac)
http://bio.ifom‐firc.it/ASSEMBLY/assemble.html
Phrap
• Phrap is a program for assembling shotgun DNA
sequence data
• Uses a combination of user‐supplied and
internally computed data quality information to
improve assembly accuracy in the presence of
repeats
• Constructs the contig sequence as a mosaic of
the highest quality read segments rather than a
consensus
• Handles large datasets
30
Conclusions
• Sequence alignments
g and database searching
g
are key to all of bioinformatics
• There are four different methods for doing
sequence comparisons 1) Dot Plots; 2) Dynamic
Programming; 3) Fast Alignment; and 4)
Multiple Alignment
• Understanding the significance of alignments
requires an understanding of statistics and
distributions
31

Lec 4 Seq and Seq Align New (相容模式)

Uploaded by

Copyright:

Available Formats

You might also like

Lec 4 Seq and Seq Align New (相容模式)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 4 Seq and Seq Align New (相容模式)

Uploaded by

Copyright:

Available Formats

Objectives

Why compare sequences?

What is Sequence Alignment?

Problem Definition: Key Issues

True homology: alpha globin and beta globin

Spurious “homology”: alpha globin protein with

Principles of DNA Sequencing Shotgun Sequencing

Ori Denature with Klenow + ddNTP

single strand DNA

• Veryy efficient process

Sequencing Successes Sequencing Successes

T7 bacteriophage Caenorhabditis elegans

Escherichia coli Drosophila melanogaster

Sacchoromyces cerevisae Homo sapiens

• Function or activity of a new gene/protein

Biological Definitions for Related Sequences

• Homologs are similar sequences in two different

• Similarity refers to the • Homology refers to

Similarity versus Homology Similarity versus Homology

• Homologue (or Homolog)

Assessing Sequence Similarity

THESTORYOFGENESIS Two Character Rbn KETAAAKFERQHMD

Rbn SST SAASSSNYCNQMMKSRNLTKDRCKPMNTFVHESLA

Doolittle’s Rules of Thumb Sequence Alignment ‐ Methods

Dot matrices Dot Plots & Internal Repeats

• Dot plots are useful as a first‐level filter for

Pair‐wise sequence alignments Two types of alignment

• Ideal for quantitative assessment E

Sij = sij + max max Si-x,j-1 + wx-1

max Si-1,j-y + wy-1

• Using this information, the score at position 1,1 in the matrix

Note that in the example, Mi‐1,j‐1 will be red, Mi,j‐1 will be

• Traceback takes the current cell and looks to the neighbor

• An alternate solution is:

Local vs. Global

Match Score Tables Scoring Matrices

• Match scores are

• Developed by M.O. Dayhoff (1978)

• Great for doingg pairwise

Fast Local Alignment Methods Fast Alignment Algorithm

 Developed by Lipman & Pearson (1985/88) Query:

• Developed in 1985 and 1988 (W. Pearson)

gi|135 2- 105: --------------------------------------------------------------------:

Multiple Alignment Algorithm Multiple Sequence Alignment

• Take all “n” sequences

• Uses standard progressive alignment but with a

• Read,, edit & trim DNA chromatograms

Contig Assembly = Multiple Alignment Assembly Parameters

Sequence Alignment Contig Alignment ‐ Process

• Repeat regions • Phred ‐ base calling program that does detailed

You might also like