Lec 4 Seq and Seq Align New (相容模式)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Objectives

Sequencing & Sequence Alignment • Understand how DNA sequence data is collected
and prepared
G
G
60
E
40
N
30
E
20
T
20
I
0
C
10
S
0
• Be aware of the importance of sequence searching
E 40 50 30 30 20 0 10 0 and sequence alignment in biology and medicine
N 30 30 40 20 20 0 10 0
E 20 20 20 30 20 10 10 0 • Be familiar with the different algorithms and
S 20 20 20 20 20 0 10 10
I 10 10 10 10 10 20 10 0
scoring schemes used in sequence searching and
S 0 0 0 0 0 0 0 10 sequence alignment

Why compare sequences?


• To find whether two (or more) genes or
proteins
t i are evolutionarily
l ti il related
l t d to
t each h
other
• To find structurally or functionally similar
regions within proteins

30,000

1
Similar genes arise by gene duplication What is Sequence Alignment?
• Copy of a gene inserted next to the original • Given two sequences, how to measure their
• Two copies mutate independently similarity?
i il it ?
• Each can take on separate functions • ATAACTTTAATTAA
• All or part can be transferred from one part • ATCC‐TTTACTAA‐
of genome to another
• ATAACTTTAATTAA
• ATCC‐TTTAC‐TAA

What is Sequence Alignment?


What is Sequence Alignment?
sequence alignment of instances of the acidic ribosomal protein P0
(L10E) from several organisms
• Arranging the primary sequences of DNA,
RNA or protein
RNA, t i tto identify
id tif regions
i off
similarity that may be a consequence of
functional, structural, or evolutionary
relationships between the sequences

2
Tasks of Sequence Alignment Pairwise Sequence Alignment
• Pairwise alignment • Pairwise sequence alignment primary tool in
sequence analysis,
analysis used for database search and as
• Multiple sequence alignment
a component of other algorithms.
• Global alignment • What? To detect similarity
• Local assignment • Why?
• Approximate alignment algorithms versus – Infer phylogeny (similarity ~ distance)
optimal/exact
l/ alignment
l algorithms
l h – P di t function
Predict f ti
– Predict structure
– Predict “signals” (binding sites, splicing signals etc.

10

Problem Definition: Key Issues


(Optimal) pairwise alignment consists of The key issues are:
considering
id i allll possible
ibl alignments
li t off ttwo • Types of alignments (local vs. global)
sequences and choosing the optimal one. • The scoring system
• Sub‐optimal (heuristic) alignment algorithms • The alignment algorithm
are also very important: eg BLAST
• Measuring alignment significance
• We will focus on optimal alignment methods
in this class.

3
Example Alignment Shotgun Sequencing

True homology: alpha globin and beta globin


HBA_HUMAN
HBA HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL
G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL
HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

Spurious “homology”: alpha globin protein with


different strucure and function
HBA_HUMAN
HBA HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD‐‐‐‐LHAHKL
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD LHAHKL
GS+ + G + +D L ++ H+ D+ A +AL D ++AH+
F11G11.2 GSGYLVGDSLTFVDLL‐‐VAQHTADLLAANAALLDEFPQFKAHQE
Isolate ShearDNA Clone into
Chromosome into Fragments Seq. Vectors Sequence

13

Principles of DNA Sequencing Shotgun Sequencing

Primer
DNA fragment

Amp

PBR322

Tet

Ori Denature with Klenow + ddNTP


heat to produce + dNTP + primers Sequence Send to Computer Assembled
ssDNA Chromatogram Sequence

single strand DNA

4
Shotgun Sequencing The Finished Product

• Veryy efficient process


p for small‐scale ((~10 kb)) GATTACAGATTACAGATTACAGATTACAGATTACAG
sequencing (preferred method) ATTACAGATTACAGATTACAGATTACAGATTACAGA
TTACAGATTACAGATTACAGATTACAGATTACAGAT
• First applied to whole genome sequencing in TACAGATTAGAGATTACAGATTACAGATTACAGATT
1995 (H. influenzae) ACAGATTACAGATTACAGATTACAGATTACAGATTA
CAGATTACAGATTACAGATTACAGATTACAGATTAC
• Now standard for all prokaryotic genome AGATTACAGATTACAGATTACAGATTACAGATTACA
sequencing projects GATTACAGATTACAGATTACAGATTACAGATTACAG
ATTACAGATTACAGATTACAGATTACAGATTACAGA
• Successfully applied to D. melanogaster TTACAGATTACAGATTACAGATTACAGATTACAGAT
• Moderately successful for H. sapiens

Sequencing Successes Sequencing Successes

T7 bacteriophage Caenorhabditis elegans


completed in 1983 p
completed in 1998
99
39,937 bp, 59 coded proteins 95,078,296 bp, 19,099 genes

Escherichia coli Drosophila melanogaster


completed in 1998 completed in 2000
4,639,221 bp, 4293 ORFs 116,117,226 bp, 13,601 genes

Sacchoromyces cerevisae Homo sapiens


completed in 1996 completed in 2003
12,069,252 bp, 5800 genes 3,201,762,515 bp, 31,780 genes

5
Genomes to Date
• 8 vertebrates (human, mouse, rat, fugu,
zebrafish))
• 3 plants (arabadopsis, rice, poplar) So what do we do with all this
• 2 insects (fruit fly, mosquito) sequence data?
• 2 nematodes (C. elegans, C. briggsae)
• 1 sea squirt
• 4 parasites (plasmodium,
( l d guillardia)
ll d )
• 4 fungi (S. cerevisae, S. pombe)
• 200+ bacteria and archebacteria
• 2000+ viruses

Types of Alignments
Sequence Alignment
• Global—sequences aligned from end‐to‐end.
G E N E T I C S • Local—alignments may start in the middle of
G 60 40 30 20 20 0 10 0
E 40 50 30 30 20 0 10 0
either sequence
N 30 30 40 20 20 0 10 0 • Ungapped—no insertions or deletions are
E 20 20 20 30 20 10 10 0 allowed
S 20 20 20 20 20 0 10 10
I 10 10 10 10 10 20 10 0 • Other types: overlap alignments,
alignments repeated
S 0 0 0 0 0 0 0 10 match alignments

24

6
Alignments tell us about... Factoid:

• Function or activity of a new gene/protein


• Structure or shape of a new protein
• Location or preferred location of a protein Sequence comparisons
• Stability of a gene or protein lie at the heart of all
• Origin of a gene or protein
bioinformatics
• Origin or phylogeny of an organelle
• Origin or phylogeny of an organism

Biological Definitions for Related Sequences

• Homologs are similar sequences in two different


SEQUENCE SIMILARITY ≠ HOMOLOGY g
organisms that have been derived from a common
ancestor sequence. Homologs can be described as
either orthologous or paralogous.
– Orthologs are similar sequences in two different
organisms that have arisen due to a speciation event.
Orthologs typically retain their functionality throughout
evolution.
– Paralogs are similar sequences within a single organism
that have arisen due to a gene duplication event.
• Xenologs are similar sequences that do not share
the same evolutionary origin, but rather have
arisen out of horizontal transfer events through
symbiosis, viruses, etc.

7
Similarity versus Homology

• Similarity refers to the • Homology refers to


likeness or % identity shared ancestry
between 2 sequences • Two sequences are
• Similarity means homologous is they are
sharing a statistically derived from a
significant number of common ancestral
bases or amino acids sequence
• Similarity does not • Homology usually
imply homology implies similarity

Similarity versus Homology Similarity versus Homology

• Homology
gy cannot be quantified
q
• Similarity
Similarit can be quantified
q antified
• If two sequences have a high % identity it is OK to
• It is correct to say that two sequences are X%
say they are homologous
identical
• It is incorrect to say two sequences have a
• It is correct to say that two sequences have a
homology score of Z
similarityy score of Z
 It is incorrect to say two sequences are X%
• It is generally incorrect to say that two
homologous
sequences are X% similar

8
Homologues & All That Sequence Complexity

• Homologue (or Homolog)


– Protein/gene that shares a common ancestor and MCDEFGHIKLAN…. High Complexity
which has good sequence and/or structure similarity to
another (general term)
• Paralogue (or Paralog) ACTGTCACTGAT…. Mid Complexity
– A homologue which arose through gene duplication in
the same species/chromosome
• Orthologue (or Ortholog) NNNNTTTTTNNN…. Low Complexity
– A homologue which arose through speciation (found
in different species) Translate those DNA sequences!!!

Assessing Sequence Similarity


Assessing Sequence Similarity

THESTORYOFGENESIS Two Character Rbn KETAAAKFERQHMD


THISBOOKONGENETICS Strings Lsz KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNT

Rbn SST SAASSSNYCNQMMKSRNLTKDRCKPMNTFVHESLA


Lsz QATNRNTDGSTDYGILQINSRWWCNDGRTP GSRN
THESTORYOFGENESI-S Character
* * * * * * * * * * * Rbn DVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKY
THISBOOKONGENETICS Comparison Lsz LCNIPCSALLSSDITASVNC AKKIVSDGDGMNAWVAWR

Rbn PNACYKTTQANKHIIVACEGNPYVPHFDASV
THE STORY OF GENESIS Lsz NRCKGTDVQA WIRGCRL
Context
THIS BOOK ON GENETICS Comparison
is this alignment significant?

9
Some Simple Rules
Is This Alignment Significant?
• If two sequence are > 100 residues and > 25%
identical they are likely related
identical,
Gelsolin 89 L G N E L S Q D E S G A A A I F T V Q L 108 • If two sequences are 15‐25% identical they may be
82 L P S A L K S A L S G H L E T V I L G L 101
Annexin
related, but more tests are needed
154 L E K D I I S D T S G D F R K L M V A L 173
240 L E – S I K K E V K G D L E N A F L N L 258 • If two sequences are < 15% identical they are
314 L Y Y Y I Q Q D T K G D Y Q K A L L Y L 333 probably not related
Consensus L x P x x x P D x S G x h x x h x V L L
• If you needd more than
h 1 gap for f every 20 residues
id
the alignment is suspicious

Doolittle’s Rules of Thumb Sequence Alignment ‐ Methods


Evolutionary Distance VS Percent Sequence Identity

120
• Dot Plots
S e q u e n c e Id e n t ity (% )

100
• Dynamic Programming
80

60
Twilight Zone
• Heuristic (Fast) Local Alignment
40
20
• Multiple Sequence Alignment
0
0 40 80 120 160 200 240 280 320 360 400
• Contig Assembly
Number of Residues

10
Dot Plots Dot Plots

• “Invented”
Invented in 1970 by Gibbs & McIntyre
• Good for quick graphical overview
• Simplest method for sequence comparison
• Inter‐sequence comparison
• Intra‐sequence comparison
•Identifies internal repeats
•Identifies domains or “modules”

Dot matrices Dot Plots & Internal Repeats

a c g c g

a
c
a
c
g

43

11
Dot Plot Algorithm Dot Plot Algorithm

A C D E F G H G
• Take two sequences (A & B), B) write sequence A
A
out as a row (length=m) and sequence B as a
C
column (length =n)
D
• Create a table or “matrix” of “m” columns and
E
“n” rows
F
• Compare each letter of sequence A with every G
letter in sequence B. If there’s a match mark it H
with a dot, if not, leave blank G

Dot Plots

• Dot plots are useful as a first‐level filter for


determining an alignment between two sequences.
sequences
• Regions of similarity will show up as diagonals
within the dot plot matrix.

12
Dot Plots Dynamic Programming
• Most commercial programs offer pretty good
dot plot programs including:
G E N E T I C S G E N E T I C S
•GCG/Omiga (Pharmacopeia) G 10 0 0 0 0 0 0 0 G 60 40 30 20 20 0 10 0
E 0 10 0 10 0 0 0 0 E 40 50 30 30 20 0 10 0
•PepTool (BioTools Inc.) N 0 0 10 0 0 0 0 0 N 30 30 40 20 20 0 10 0
•LaserGene (DNAStar) E 0 0 0 10 0 0 0 0 E 20 20 20 30 20 10 10 0
S 0 0 0 0 0 0 0 10 S 20 20 20 20 20 0 10 10
• Popular freeware package is Dotter I 0 0 0 0 0 10 0 0 I 10 10 10 10 10 20 10 0
S 0 0 0 0 0 0 0 10 S 0 0 0 0 0 0 0 10
www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html
G E N E T I C S
• Dotlet http://www.isrec.isb‐sib.ch/java/dotlet/Dotlet.html | | | | * | |
G E N E S I S
• JDotter http://athena.bioc.uvic.ca/sars/jdotter/main.php

Pair‐wise sequence alignments Two types of alignment

S = CTGTCGCTGCACG
Idea: Display one sequence above T = TGCCGTG
another with spaces inserted in both
to reveal similarity Global alignment Local alignment
A: C A T - T C A - C CTGTCG-CTGCACG CTGTCGCTGCACG--
| | | | | -------TGC-CGTG
-TGC-CG-TG----
B: C - T C G C A G C

51 52

13
Dynamic Programming Identity Scoring Matrix (Sij)
A R N D C Q E G H I L K M F P S T W Y V
A 1
R 0 1
• Developed by Needleman & Wunsch (1970) N 0 0 1
D 0 0 0 1
• Refined by Smith & Waterman (1981) C 0 0 0 0 1
Q 0 0 0 0 0 1

• Ideal for quantitative assessment E


G
0
0
0
0
0
0
0
0
0
0
0
0
1
0 1
H 0 0 0 0 0 0 0 0 1
• Guaranteed to be mathematically optimal I 0 0 0 0 0 0 0 0 0 1
L 0 0 0 0 0 0 0 0 0 0 1
• Slow N2 algorithm K 0 0 0 0 0 0 0 0 0 0 0 1
M 0 0 0 0 0 0 0 0 0 0 0 0 1

• Performed in 2 stages F
P
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0 1
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
•Prepare a scoring matrix using recursive function T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
•Scan matrix diagonally using traceback protocol Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

14
The Recursive Function

Si-1,j-1 or

Sij = sij + max max Si-x,j-1 + wx-1


2<x<i
or

max Si-1,j-y + wy-1


2< <j
2<y<j

W = gap penalty
S = alignment score

15
16
Initialization Step
An Example...
• To create a matrix with M + 1 columns and N + 1 rows where
M and N correspond to the size of the sequences to be
G A A T T C A G T T A ((sequence
q #1)) aligned
G G A T C G A (sequence #2) • Since this example assumes there is no gap opening or gap
extension penalty, the first row and first column of the
matrix can be initially filled with 0.
Three steps in dynamic programming
Initialization
Matrix fill (scoring)
Traceback (alignment)

• Using this information, the score at position 1,1 in the matrix


Matrix Fill Step can be calculated. Since the first residue in both sequences
is a G, S1,1 = 1, and by the assumptions stated at the
beginning, w = 0. Thus,
• For each position, Mi,j is defined to be the maximum score
M1,1 = MAX[M0,0 + 1, M1, 0 + 0, M0,1 + 0] = MAX [1, 0, 0] = 1
at position i, j;
• Mi,j = MAX [
• A value of 1 is then placed in position 1,1 of the scoring
Mi‐1, j‐1 + Si,j (match/mismatch in the diagonal),
matrix.
Mi,j‐1 + w (gap in sequence #1),
Mi‐1,j + w (gap in sequence #2)]

Note that in the example, Mi‐1,j‐1 will be red, Mi,j‐1 will be


green and Mi‐1,j will be blue.

17
• Now let's look at column 2.
• Since the gap penalty (w) is 0, the rest of row 1 and column 1 – The location at row 2 will be assigned the value of the maximum of 1
can be filled in with the value 1. (mismatch), 1 (horizontal gap) or 1 (vertical gap). So its value is 1.
• Take the example of row 1 – At the position column 2 row 3, there is an A in both sequences. Thus, its
– At column 2, the value is the max of 0 (for a mismatch), 0 (for a value will be the maximum of 2(match), 1 (horizontal gap), 1 (vertical gap) so
vertical gap) or 1 (horizontal gap). The rest of row 1 can be its value is 2.
filled out similarly until we get to column 8. • Moving g alongg to position
p colum 2 row 4, its value will be the maximum of
– At this point, there is a G in both sequences (light blue). Thus, 1 (mismatch), 1 (horizontal gap), 2 (vertical gap) so its value is 2. Note
the value for the cell at row 1 column 8 is the maximum of 1 that for all of the remaining positions except the last one in column 2,
(for a match), 0 (for a vertical gap) or 1 (horizontal gap). The the choices for the value will be the exact same as in row 4 since there
value will again be 1. are no matches.
– The rest of row 1 and column 1 can be filled with 1 using the • The final row will contain the value 2 since it is the maximum of 2 (match),
above reasoning. 1 (horizontal gap) and 2(vertical gap).

• Using the same techniques as described for column 2, we • After filling in all of the values the score matrix is as follows
can fill in column 3.

18
Traceback Step

• After the matrix fill step, the maximum alignment • The traceback step begins in the M,J position in the matrix,
score for the two test sequences is 6.
6 i.e. the position that leads to the maximal score. In this case,
there is a 6 in that location.
• The traceback step determines the actual
alignment(s) that result in the maximum score

• Traceback takes the current cell and looks to the neighbor


cells that could be direct predecessors. • Since the current cell has a value of 6 and the scores are 1
for a match and 0 for anything else, the only possible
• This means it looks to the neighbor to the left (gap in
sequence #2), the diagonal neighbor (match/mismatch), and predecessor is the diagonal match/mismatch neighbor.
the neighbor above it (gap in sequence #1). • If more than one possible predecessor exists, any can be
• The algorithm for traceback chooses as the next cell in the chosen.
sequence one of the possible predecessors.
predecessors – This
hi gives
i us a current alignment
li off
• In this case, the neighbors are marked in red. They are all
also equal to 5. (Seq #1) A
| In this case, it is the cell with the red 5.
(Seq #2) A

19
• The alignment as described in the above step adds a gap to • Continuing on with the traceback step, we eventually get
sequence #2, so the current alignment is to a position in column 0 row 0 which tells us that
(Seq #1) TA traceback is completed.
|
• One possible maximum alignment is :
(Seq #2) _A
Once again,
g , the direct predecessor
p p
produces a gap
g p in sequence
q #2.

Giving an alignment of :
After this step, the current alignment is GAATTCAGTTA
| | | | | |
(Seq #1) T T A GGA_TC_G__A
|
__A

• An alternate solution is:

• Note:
– There are more alternative solutions each
resulting in a maximal global alignment score of
6.
– Since this is an exponential problem, most
Giving an alignment of : dynamic programming algorithms will only print
outt a single
i l solution.
l ti
G_AATTCAGTTA
| | | | | |
GG_A_TC_G__A

20
λ C T C G C A G C λ C T C G C A G C

λ 0 -5 -10 -15 -20 -25 -30 -35 -40 λ 0 -5 -10 -15 -20 -25 -30 -35 -40
C -5 10 5 C -5
5 10 5 0 -5
5 -10
10 -15
15 -20
20 -25
25
A -10 A -10 5 8 3 -2 -7 0 -5 -10
T -15 T -15 0 15 10 5 0 -5 -2 -7
*
T -20 T -20 -5 10 * 13 8 3 -2 -7 -4
C -25 C -25 -10 5 20 15 18 13 8 3
A -30 A -30 -15 0 15 18 13 28 23 18
C -35 C -35 -20 -5 10 13 28 23 26 33
+10 for match, ‐2 for mismatch, ‐5 for space Traceback can yield both optimum alignments
81 82

Local vs. Global


Could We Do Better?
Pairwise Alignments
• A global alignment includes all elements of the
• Keyy to the p
performance of Dynamic
y sequences and includes gaps.
Programming is the scoring function – A global alignment may or may not include "end gap"
penalties.
• Dynamic Programming always gives the – Global alignments are better indicators of homology and
mathematically correct answer take longer to compute.
• Dynamic Programming does not always give • A local alignment includes only subsequences, and
sometimes is computed without gaps.
gaps
the biologically correct answer – Local alignments can find shared domains in divergent
• The weakest link ‐‐ The Scoring Matrix proteins and are fast to compute

84

21
Scoring systems Match Scores
• Match scores: s(a,b) assigns a score to each • Some amino acids are more “substitutable”
combination of aligned letters.
letters Examples: for each other than others.
others Serine and
transition/transversion, PAM, Blosum. threonine are more alike than tryptophan
• Gap score: f(g) assigns a score to a gap of and alanine.
length g. Examples: linear, affine. • We can introduce "mismatch costs" for
• Scoring systems are usually additive: the handling different substitutions.
t t l score is
total i the
th sum off the
th substitution
b tit ti • We
W don't
d 't usually
ll use mismatch
i t h costs
t in
i
scores and all the gap scores. aligning nucleotide sequences, since often
no substitution is per se better than any
other.
85 86

Match Score Tables Scoring Matrices

• Match scores are


computed using a • An empirical model of evolution, biology and
lookup table with an chemistry all wrapped up in a 20 X 20 table of
entry for each possible integers
pair of letters. • Structurally or chemically similar residues
• Match scores are often should ideally have high diagonal or off‐
calculated on the basis
of the frequency of
diagonal
g numbers
particular mutations in • Structurally or chemically dissimilar residues
very similar sequences. should ideally have low diagonal or off‐diagonal
numbers
87

22
A T V D
A 2 Gap
A Better Matrix ‐ PAM250 T 1 3 Using PAM250... Penalty = -1
V 0 0 4
D 0 0-2 4

A R N D C Q E G H I L K M F P S T W Y V
A 2 AAT V D AAT V D A AT V D
R
N
-2
0
6
0 2
A 2 A 2 1 A 2 1 0 -1 -1
D 0 -1 2 4 V V V
C -2 -4 -4 -5 4
Q 0 1 1 2 -5 4 V V V
E
G
0 -1
1 -3
1
0
3 -5
1 -3 -1
2 4
0 5
D D D
H -1 2 2 1 -3 3 1 -2 6
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5
L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 A ATV D A ATV D A ATV D
K
M
-1
-1
3 1 0 -5 1 0 -2
0 -2 -3 -5 -1 -2 -3 -2
0 -2 -3
2 4
5
0 6
A 2 1 0 -1 -1 A 2 1 0 -1 -1 A 2 1 0 -1 -1
F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 V -1 2 V -1 2 1 V -1 2 1 5
P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3 V V V
T
W
1 -1
-6
0 0 -2 -1 0 0 -1 0 -2
2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4
0 -1 -2 0 1
0 -6 -2 -5 17
3
D D D
Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4

A T V D
A 2 Gap
T 1 3 Using PAM250... Penalty = -1 PAM Matrices
V 0 0 4
D 0 0-2 4

• Developed by M.O. Dayhoff (1978)


A ATV D A ATV D A ATV D
A 2 1 0 -1 -1 A 2 1 0 -1 -1 A 2 1 0 -1 -1 • PAM = Point Accepted Mutation
V -1 2 1 5 -1 V -1 2 1 5 -1 V -1 2 1 5 -1 • Matrix assembled by looking at patterns of
V V -1 12 5 3 V -1 12 5 3
substitutions in closely related proteins
D D -1 11 0 9 D -1 11 0 9
• 1 PAM corresponds to 1 amino acid change per
AAT V D 100 residues
| | | • 1 PAM = 1% divergence or 1 million years in
AV -VD
evolutionary history

23
Dynamic Programming Fast Local Alignment Methods

• Great for doingg pairwise


p global
g alignments
g
ACDEAGHNKLM
ACDEAGHNKLM...
• Produces a quantitative alignment “score” KKDEFGHPKLM...
• Problems if one tries to do alignments with SCDEFCHLKLM...
very large sequences (memory requirement MCDEFGHNKLV...
grows as N2 or as N x M) ACDEFGHIKLM... QCDEFGHAKLM...
• Serious
S i problems
bl if one tries
t i tot align
li one AQQQFGHIKLPI...
AQQQFGHIKLPI
WCDEFGHLKLM...
sequence against a database (10’s of hours)
SMDEFAHVKLM...
• Need an alternative….. ACDEFGFKKLM...

Fast Local Alignment Methods Fast Alignment Algorithm

 Developed by Lipman & Pearson (1985/88) Query:


Q y ACDEFGDEF…..
 Refined by Altschul et al. (1990/97) ACD CDE DEF EFG FGD GDE …
 Ideal for large database comparisons 1 2 3,7 4 5 6
 Uses heuristics & statistical simplification
 Fast N‐type
yp algorithm
g ((similar to Dot Plot)) ACD CDE DEF EFG FGD GDE
 Cuts sequences into short words (k‐tuples) ACE CDD NEF … … …
 Uses “Hash Tables” to speed comparison GCE CEE DEY … … …
GCD DDY … … …

24
Fast Alignment Algorithm Fast Alignment Algorithm

Query:
Q y ACDEFGDEF….. A C D E F G D E F...
L
ACD CDE DEF EFG FGD GDE M
R
ACE CDD NEF … … …
G
GCE CEE DEY … … … C
GCD DDY … … … D
D
Database: LMRGCDDYGDEY… Y
G

FASTA
Fast Alignment Algorithm

• Developed in 1985 and 1988 (W. Pearson)


• Looks for clusters of nearby or locally dense
“identical” k‐tuples
• init1 score = score for first set of k‐tuples
• initn score = score for gapped k‐tuples
• opt score = optimized alignment
l score
• Z‐score = number of S.D. above random
• expect = expected # of random matches

25
FASTA
Multiple Sequence Alignment
gi|135775|sp|P08628|THIO_RABIT THIOREDOXIN (104 aa)
initn: 641 init1: 641 opt: 642 Z-score: 806.4 expect() 3.2e-38
Smith-Waterman score: 642; 86.538% identity in 104 aa overlap (2-105:1-104)

gi|135 2- 105: --------------------------------------------------------------------:

10 20 30 40 50 60 70 80
thiore MVKQIESKTAFQEALDAAGDKLVVVDFSATWCGPCKMINPFFHSLSEKYSNVIFLEVDVDDCQDVASECEVKCTPTFQFF
:::::::.::::.::.:::::::::::::::::::::.::::.::::..::.:.:::::::.:.:.:::::: ::::::
gi|135 VKQIESKSAFQEVLDSAGDKLVVVDFSATWCGPCKMIKPFFHALSEKFNNVVFIEVDVDDCKDIAAECEVKCMPTFQFF
10 20 30 40 50 60 70

90 100
thiore KKGQKVGEFSGANKEKLEATINELV Multiple alignment of Calcitonins
::::::::::::::::::::::::.
gi|135 KKGQKVGEFSGANKEKLEATINELL
80 90 100

Multiple Alignment Algorithm Multiple Sequence Alignment

• Take all “n” sequences


q and perform
p f all possible
p
pairwise (n/2(n‐1)) alignments • Developed and refined by many (Doolittle,
(
Barton, Corpet) through the 1980’s
• Identify highest scoring pair, perform an
alignment & create a consensus sequence • Used extensively for extracting hidden
• Select next most similar sequence and align it to phylogenetic relationships and identifying
the initial consensus,
consensus regenerate a second sequence families
consensus • Powerful tool for extracting new sequence
• Repeat step 3 until finished motifs and signature sequences

26
Multiple Alignment Mutli‐Align Websites
• Match‐Box
• Most commercial vendors offer good multiple http://www fundp ac be/sciences/biologie/bms/matchbox
http://www.fundp.ac.be/sciences/biologie/bms/matchbox_
alignment programs including: submit.shtml
•GCG (Accelerys)
• MUSCA http://cbcsrv.watson.ibm.com/Tmsa.html
•PepTool/GeneTool (BioTools Inc.)
•LaserGene (DNAStar) • T‐Coffee http://www.ebi.ac.uk/Tools/msa/tcoffee/
• Popular web servers include T‐COFFEE,
T COFFEE
• MULTALIN http://www.toulouse.inra.fr/multalin.html
MULTALIN and CLUSTALW
• Popular freeware includes PHYLIP & PAUP • CLUSTALW http://www.ebi.ac.uk/clustalw/

T‐Coffee

• Uses standard progressive alignment but with a


“ i ” to avoid
“twist” id llocall minima
i i
• Allows the combination of a collection of
multiple/pairwise, global or local alignments into
a single model
• It also allows to estimate the level of
consistency of each position within the new
alignment with the rest of the alignments

http://www.ebi.ac.uk/Tools/msa/tcoffee/

27
Contig Assembly
Multi‐alignment & Contig Assembly

• Read,, edit & trim DNA chromatograms


g
ATCGATGCGTAGCAGACTACCGTTACGATGCCTT… • Remove overlaps & ambiguous calls
TAGCTACGCATCGTCTGATGGCAATGCTACGGAA.. • Read in all sequence files (10‐10,000)
• Reverse complement all sequences (doubles #
of sequences to align)
• Remove vector sequences (vector trim)
• Remove regions of low complexity
• Perform multiple sequence alignment

Contig Assembly = Multiple Alignment Assembly Parameters

1. Onlyy accept
p a veryy high
g sequence
q identityy
• User‐selected
U l t d parameters
t
2. Accept unlimited number of “end” gaps
1. minimum length of overlap
3. Very high cost for opening “internal” gaps
2. percent identity within overlap
4. A short match with high score/residue is
preferred over a long match with low • Non‐adjustable parameters
score/residue 1. sequence “quality”
quality factors

28
Chromatogram Editing Sequence Loading

Sequence Alignment Contig Alignment ‐ Process

ATCGATGCGTAGCAGACTACCGTTACGATGCCTT…

29
Sequence Assembly Programs
Problems for Assembly

• Repeat regions • Phred ‐ base calling program that does detailed


statistical analysis (UNIX)
– Capture sequences from non‐contiguous regions http://www.phrap.org/
• Polymorphisms • Phrap ‐ sequence assembly program (UNIX)
– Cause failure to join correct regions http://www.phrap.org/
• Large data volume • TIGR Assembler ‐ microbial genomes (UNIX)
– Re
Requires
i e large
l ge numbers
be of pair‐wise
i i e http://www tigr org/softlab/assembler/
http://www.tigr.org/softlab/assembler/
comparisons • The Staden Package (UNIX)
http://www.mrc‐lmb.cam.ac.uk/pubseq/
• GeneTool/ChromaTool/Sequencher (PC/Mac)

http://bio.ifom‐firc.it/ASSEMBLY/assemble.html
Phrap
• Phrap is a program for assembling shotgun DNA
sequence data
• Uses a combination of user‐supplied and
internally computed data quality information to
improve assembly accuracy in the presence of
repeats
• Constructs the contig sequence as a mosaic of
the highest quality read segments rather than a
consensus
• Handles large datasets

30
Conclusions

• Sequence alignments
g and database searching
g
are key to all of bioinformatics
• There are four different methods for doing
sequence comparisons 1) Dot Plots; 2) Dynamic
Programming; 3) Fast Alignment; and 4)
Multiple Alignment
• Understanding the significance of alignments
requires an understanding of statistics and
distributions

31

You might also like