Pairwise Alignment

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Sndor Pongor

Pairwise sequence alignment


With slides adapted from Jack Leunissen and Christoph Sensen
Algorithms
Representations (unstructured, structured, mixed).
Comparison: 1) Proimity measures (similarities,
distances) 2) Motifs (from pairwise and multiple
alignment of sequences)
Main distance and similarity measures
Granularity problems
Dot plot
Last lecture (on algorithms)
Edit distance (refresh)
Substitution matrices (PAM, BLOSUM, how to build
your own
Algorithm types: exhaustive and heuristic, global
and local
Exhaustive global alignment: Needleman/Wusch
Exhaustive local alignment: Smith-Waterman
This lecture
The score S is a sum of costs assigned to identities
and mismatches, minus a penalty for gaps. Costs
are stored in the substitution matrix. Gap usually a
sum of gap opening and gap-extension costs.
HSP, high scoring
segment pair
Sequence similarity score
2011.10.17..
TMOP 4.1.2-08/2/A/KMR-2009-0006
4
2011.10.17..
Score =



Penalty =

Alignment score


end
start
end
start
Penalties _weights Similarity
gap length init
Gap Gap +
Linear
Penalty rises monotonous with length of gap
Affine
Penalty has a gap-opening and a separate length
component
Probabilistic
Penalties may depend upon the character of the
residues involved
Other functions
Gap penalty functions
A simple example (without gaps):
For a match/mismatch we look up the value in the
substitution matrix. The matrix is a lookup table
Introduction to bioinformatics: Similarity Searching and BLAST
2011.10.17..
TMOP 4.1.2-08/2/A/KMR-2009-0006
7
2011.10.17..
The susbstitution matrix (also called scoring matrix)
contains costs for amino acid identities and
substitutions in an alignment.
It is a 20x20 symmetrical matrix that can be
constructed from pairwise alignments of related
sequences
Related means either
a) evolutionary relatedness described by
an approved evolutionary tree (Dayhoffs PAM
Substitution matrices in details

2011.10.17..
TMOP 4.1.2-08/2/A/KMR-2009-0006
8
2011.10.17..
Calculation of scoring matrices from multiple
alignments.
|
|
.
|

\
|

=
) ( ) (
) / (
log ) / (
T f S f
T S f
T S M
Matrix elements are calculated from the
observed and expected frequencies (using a log
odds principle). E.g. for S/T (indicated by red):
The values are calculated from many (not just
one) mjultiple alignments.The log odds values in
the matrix are then normalized to a range (e.g. -5
to +15) depending on the application
Substitution matrices
9
ASDESKLVV
|
ATDDATLSI
| |
ASDSERITV
f(S/T)=3
f(S)=4, f(T)=2
PAM matrices
Percent Accepted Mutation: Unit of evolutionary
change for protein sequences [Dayhoff78].
Calculated from related sequences organized into
accepted evolutinary trees (71 trees, 1572
exchange [only])
20x20 matrix, columns add up to the no of cases
observed.
All entries 10
4
Converted into
scoring matrix by
log-odds and
scaling
Introduction to bioinformatics: Similarity Searching and BLAST
2011.10.17..
TMOP 4.1.2-08/2/A/KMR-2009-0006
10
2011.10.17..
Pam_1 = 1% of amino acids mutate
Pam_30 = (Pam_1)30 (matrix multiplication)

PAM 250
(the higher the numbers the
higher the divergence)
Note: chemically similar amino
acids are near each other
small
polar
basic
large
aromatic
Introduction to bioinformatics: Similarity Searching and BLAST
2011.10.17..
TMOP 4.1.2-08/2/A/KMR-2009-0006
11
2011.10.17..
BLOSUM Matrices (most often used)
Developed by Henikoff & Henikoff (1992)
BLOcks SUbstitution Matrix
Derived from the BLOCKS database
PAM Matrices
Developed by Schwarz and Dayhoff (1978)
Point Accepted Mutation
Derived from manual alignments of closely related
proteins
Scoring Matrices used today
First useful scoring matrix for protein
Assumed a Markov Model of evolution (I.e. all sites
equally mutable and independent)
Derived from small, closely related proteins with
~15% divergence

PAM versus BLOSUM
Much later entry to matrix
sweepstakes
No evolutionary model is
assumed
Built from PROSITE
derived sequence blocks
Uses much larger, more
diverse set of protein
sequences (30% - 90% ID)
Higher PAM numbers to detect more remote
sequence similarities
Lower PAM numbers to detect high similarities
1 PAM ~ 1 million years of divergence
Errors in PAM 1 are scaled 250X in PAM 250

PAM versus BLOSUM
Lower BLOSUM numbers
to detect more remote
sequence similarities
Higher BLOSUM numbers
to detect high similarities
Sensitive to structural and
functional substitution
Errors in BLOSUM arise
from errors in alignment
PAM 40 - prepared by multiplying PAM 1 by itself a
total of 40 times
best for short alignments with high similarity
PAM 120 - prepared by multiplying PAM 1 by itself
a total of 120 times
best for general alignment
PAM 250 - prepared by multiplying PAM 1 by itself
a total of 250 times
PAM Matricies
BLOSUM 90 - prepared from BLOCKS sequences
with >90% sequence ID
best for short alignments with high similarity
BLOSUM 62 - prepared from BLOCKS sequences
with >62% sequence ID
best for general alignment (default)
BLOSUM 30 - prepared from BLOCKS sequences
with >30% sequence ID
BLOSUM Matrices
Scores
V D S C Y
V E T L C F
BLOSUM62 +4 +2 +1 -12 +9 +3 7
PAM30 +7 +2 0 -10 +10 +2 11
Slide by David Landsman, NCBI
Nucleic acid matrices
A 10 0 0 0
C 0 10 0 0
G 0 0 10 0
T 0 0 0 10
A C G T

Needleman-Wunsch
A 10 -9 -9 -9
C -9 10 -9 -9
G -9 -9 10 -9
T -9 -9 -9 10
A C G T

Smith-Waterman
1) The magnitude of the elements are relative, can be scaled. Note e.g.
that the SW matrix puts high penalty (large negative value) on
mismatches.
2) Other heuristic matrices can be easily constructed. Identity matrix:
diagonal =1, rest=0. Ore, one can penalize certain associations
assigning a large negative value to them. etc.
Pairwise alignment
We have two (protein or DNA) sequences
originating from a common ancestor
The purpose of an alignment is to line up all
positions that originate from the same position in
the ancestral sequence
Pairwise alignment
The purpose of an alignment is to line up all
residues that were derived from the same residue
position in the ancestral gene or protein in two
sequences
Pairwise alignment
The purpose of an alignment is to line up all
residues that were derived from the same residue
position in the ancestral gene or protein in two
sequences
gap = insertion or deletion
Two kinds of alignments: Global and local
Global
Local
Global alignment
Align two sequences from head to toe, i.e.
from 5 ends to 3 ends
from N-termini to C-termini
Algorithm published by:
Needleman, S.B. and Wunsch, C.D. (1970)
A general method applicable to the search for
similarities in the amino acid sequence of two proteins
J. Mol. Biol. 48:443-453.
Global alignment
Simple rules:
Match (i,j) =
Now we use 1, if residue (i) = residue (j); else 0
(but we can also take it from a matrix!!!!)
Gap = 1
Score (i,j) = Maximum of
Score (i+1,j+1) + Match (i,j)
Score (i+1,j) + Match (i,j) - Gap
Score (i,j+1) + Match (i,j) - Gap
follow diagonal, align next residue
insert gap in vertical sequence
insert gap in horizontal sequence
Global alignment
a a c t t g a g c -
c -6
t -5
g -4
a -3
g -2
t -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
Global alignment
a a c t t g a g c -
c -6
t -5
g -4
a -3
g -2
t 0 -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
0 + 0 = 0
-1 + 0 1 = -2
-1 + 0 1 = -2

Global alignment
a a c t t g a g c -
c -1 -2 -4 -6
t 0 -1 -4 -5
g 1 0 -3 -4
a 2 0 -2 -3
g 0 0 1 -1 -2
t -3 -2 -1 0 -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
Global alignment
a a c t t g a g c -
c -1 -2 -4 -6
t 0 -1 -4 -5
g 1 0 -3 -4
a 2 0 -2 -3
g 0 0 1 -1 -2
t -3 -2 -1 0 -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
0 + 0 = 0
2 + 0 1 = 1
0 + 0 1 = -1

Global alignment
a a c t t g a g c -
c -1 -2 -4 -6
t 0 -1 -4 -5
g 1 0 -3 -4
a 1 2 0 -2 -3
g 0 0 1 -1 -2
t -3 -2 -1 0 -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
2 + 1 = 3
1 + 1 1 = 1
1 + 1 1 = 1

Global alignment
a a c t t g a g c -
c -1 -2 -4 -6
t 0 -1 -4 -5
g 1 0 -3 -4
a 1 2 0 -2 -3
g 0 0 1 -1 -2
t -3 -2 -1 0 -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
Global alignment
a a c t t g a g c -
c -1 -2 -4 -6
t 0 -1 -4 -5
g 3 1 0 -3 -4
a 1 2 0 -2 -3
g 0 0 1 -1 -2
t -3 -2 -1 0 -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
Global alignment
a a c t t g a g c -
c 3 4 5 4 3 1 -1 -2 -4 -6
t 1 2 3 4 4 2 0 -1 -4 -5
g -2 -1 0 1 2 3 1 0 -3 -4
a -2 -2 -2 -1 0 1 2 0 -2 -3
g -5 -4 -3 -2 -1 0 0 1 -1 -2
t -6 -5 -4 -3 -3 -3 -2 -1 0 -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
Global alignment
a a c t t g a g c -
c 3 4 5 4 3 1 -1 -2 -4 -6
t 1 2 3 4 4 2 0 -1 -4 -5
g -2 -1 0 1 2 3 1 0 -3 -4
a -2 -2 -2 -1 0 1 2 0 -2 -3
g -5 -4 -3 -2 -1 0 0 1 -1 -2
t -6 -5 -4 -3 -3 -3 -2 -1 0 -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
Global alignment
a a c t t g a g c -
c 3 4 5 4 3 1 -1 -2 -4 -6
t 1 2 3 4 4 2 0 -1 -4 -5
g -2 -1 0 1 2 3 1 0 -3 -4
a -2 -2 -2 -1 0 1 2 0 -2 -3
g -5 -4 -3 -2 -1 0 0 1 -1 -2
t -6 -5 -4 -3 -3 -3 -2 -1 0 -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
a a c t t g a g c
- - c t - g a g t
Global alignment
a a c t t g a g c -
c 3 4 5 4 3 1 -1 -2 -4 -6
t 1 2 3 4 4 2 0 -1 -4 -5
g -2 -1 0 1 2 3 1 0 -3 -4
a -2 -2 -2 -1 0 1 2 0 -2 -3
g -5 -4 -3 -2 -1 0 0 1 -1 -2
t -6 -5 -4 -3 -3 -3 -2 -1 0 -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
a a c t t g a g c
- - c t - g a g t
Global alignment
a a c t t g a g c -
c 3 4 5 4 3 1 -1 -2 -4 -6
t 1 2 3 4 4 2 0 -1 -4 -5
g -2 -1 0 1 2 3 1 0 -3 -4
a -2 -2 -2 -1 0 1 2 0 -2 -3
g -5 -4 -3 -2 -1 0 0 1 -1 -2
t -6 -5 -4 -3 -3 -3 -2 -1 0 -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
a a c t t g a g c
- - c - t g a g t
Global alignment
a a c t t g a g c -
c 3 4 5 4 3 1 -1 -2 -4 -6
t 1 2 3 4 4 2 0 -1 -4 -5
g -2 -1 0 1 2 3 1 0 -3 -4
a -2 -2 -2 -1 0 1 2 0 -2 -3
g -5 -4 -3 -2 -1 0 0 1 -1 -2
t -6 -5 -4 -3 -3 -3 -2 -1 0 -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
a a c t t g a g c
- - c - t g a g t
Global alignment
a a c t t g a g c -
c 3 4 5 4 3 1 -1 -2 -4 -6
t 1 2 3 4 4 2 0 -1 -4 -5
g -2 -1 0 1 2 3 1 0 -3 -4
a -2 -2 -2 -1 0 1 2 0 -2 -3
g -5 -4 -3 -2 -1 0 0 1 -1 -2
t -6 -5 -4 -3 -3 -3 -2 -1 0 -1
- -9 -8 -7 -6 -5 -4 -3 -2 -1 0
a a c t t g a g c
- - c - t g a g t
Percent Similarity: 70.833 Percent Identity: 58.929

cyboa x cyboab

. . . . .
1 MDIAIQHPWFKRTLGPFY.PSRLFDQFFGEGLFEYDLLPFLSSTISPYYR 49
||||| ||| :| ||: ||||||||||| | | || | |..:||:|
1 MDIAIHHPWIRRPFFPFHSPSRLFDQFFGEHLLESDLFP.ASTSLSPFYL 49
. . . . .
50 Q..SLFR..TVLDSGISEVRSDRDKFVIFLDVKHFSPEDLTVKVQEDFVE 95
. | | . :|.|:||.| ::|:| : |||||||||:| ||| | :|
50 RPPSFLRAPSWIDTGLSEMRLEKDRFSVNLDVKHFSPEELKVKVLGDVIE 99
. . . . .
96 IHGKHNERQDDHGYISREFHRRYRLPSNVDQSALSCSLSADGMLTFSGPK 145
:|||| ||||:||:|||||||:||:|..|| |:. |||.||.|| .||:
100 VHGKHEERQDEHGFISREFHRKYRIPADVDPLAITSSLSSDGVLTVNGPR 149
. . .
146 IPSGVDAGHSERAIPVSREEKP..SSAPSS 173
| || ||:.||||| ..||
150 ....KQASGPERTIPITREEKPAVTAAPKK 175
Percent Similarity: 46.857 Percent Identity: 37.143

cybog x cybob

. . . . .
1 ...............MGKITFYEDRGFQGHCYECSSDCPNL.QPYFSRCN 34
|| :| |||| :| . |||| : :
1 MASDHQTQAGKPQPLNPKIIIFEQENFQGHSHELNGPCPNLKETGVEKAG 50
. . . . .
35 SIRVDSGCWMLYERPNYQGHQYFLRRGDYPDYQQWMGF..NDSIRSCRLI 82
|: | .| |. ||. | .| |: :|:|| : | ||: | | |
51 SVLVQAGPWVGYEQANCKGEQFVFEKGEYPRWDSWTSSRRTDSLSSLRPI 100
. . . . .
83 PQHTGTFRMRIYERDDFRG.QMSEITDDCPSLQDRFHLTEVHSLNVLEGS 131
. :. :|| .| | .| | || || : .| |. | |.
101 KVDSQEHKITLYENPNFTGKKMEVIDDDVPSFHAHGYQEKVSSVRVQSGT 150
. . . . .
132 WVLYEMPSYRGRQYLLRPGEYRRYLDWGAMNAKVGSLRRVMDFY...... 175
|| |: | ||| |||| |:|: |.|| .| |.||: |
151 WVGYQYPGYRGLQYLLEKGDYKDSGDFGAPQPQVQSVRRIRDMQWHQRGA 200
Local alignment
Locate region(s) with high degree of similarity in
two sequences

Algorithm published by:
Smith, T.F. and Waterman, M.S. (1981)
Identification of common molecular subsequences
J. Mol. Biol. 147:195-197.
Local alignment
Simple rules:
Match (i,j) =
1, if residue (i) = residue (j); else 0
(but we can also take it from a matrix!!!!)
Gap = 1
Score (i,j) = Maximum of
Score (i+1,j+1) + Match (i,j)
Score (i+1,j) + Match (i,j) - Gap
Score (i,j+1) + Match (i,j) - Gap
0
follow diagonal, align next residue
insert gap in vertical sequence
insert gap in horizontal sequence
no negative values allowed

Local alignment
a a c t t g a g c -
c 3 4 5 4 3 1 0 0 1 0
t 1 2 3 4 4 2 1 0 0 0
g 2 1 0 1 2 3 1 1 0 0
a 2 2 1 0 1 1 2 0 0 0
g 0 0 1 1 0 1 0 1 0 0
t 0 0 0 1 1 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0
c t t g a g
c t - g a g
Local alignment
a a c t t g a g c -
c 3 4 5 4 3 1 0 0 1 0
t 1 2 3 4 4 2 1 0 0 0
g 2 1 0 1 2 3 1 1 0 0
a 2 2 1 0 1 1 2 0 0 0
g 0 0 1 1 0 1 0 1 0 0
t 0 0 0 1 1 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0
c t t g a g
c t - g a g
Local alignment
a a c t t g a g c -
c 3 4 5 4 3 1 0 0 1 0
t 1 2 3 4 4 2 1 0 0 0
g 2 1 0 1 2 3 1 1 0 0
a 2 2 1 0 1 1 2 0 0 0
g 0 0 1 1 0 1 0 1 0 0
t 0 0 0 1 1 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0
c t t g a g
c t - g a g
Local alignment
a a c t t g a g c -
c 3 4 5 4 3 1 0 0 1 0
t 1 2 3 4 4 2 1 0 0 0
g 2 1 0 1 2 3 1 1 0 0
a 2 2 1 0 1 1 2 0 0 0
g 0 0 1 1 0 1 0 1 0 0
t 0 0 0 1 1 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0
c t t g a g
c - t g a g
Local alignment
a a c t t g a g c -
c 3 4 5 4 3 1 0 0 1 0
t 1 2 3 4 4 2 1 0 0 0
g 2 1 0 1 2 3 1 1 0 0
a 2 2 1 0 1 1 2 0 0 0
g 0 0 1 1 0 1 0 1 0 0
t 0 0 0 1 1 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0
c t t g a g
c - t g a g
Local alignment
a a c t t g a g c -
c 3 4 5 4 3 1 0 0 1 0
t 1 2 3 4 4 2 1 0 0 0
g 2 1 0 1 2 3 1 1 0 0
a 2 2 1 0 1 1 2 0 0 0
g 0 0 1 1 0 1 0 1 0 0
t 0 0 0 1 1 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0
c t t g a g
c - t g a g
Percent Similarity: 45.679 Percent Identity: 37.037

Average quality based on 100 randomizations: 32.3 +/- 4.6

cyboa x p40_schma November 24, 2001 14:33 ..

. . . . .
63 EVRSD.RDKFVIFLDVKHFSPEDLTVKVQEDFVEIHGKHNERQDDHGYIS 111
|| | : | : | . |.|:|: | |. | :| | | | |
132 EVGEDGKVHFKVRFDAQGFAPQDINVTSSENRVTVHAK.KETTTDGRKCS 180
. . .
112 REFHRRYRLPSNVDQSALSCSLSADGMLTFSGP 144
||| | .|| .:| | | | :. ||.| |
181 REFCRMVQLPKSIDDSQLKCRMTDDGVLMLEAP 213
Significance of alignment
How significant is the alignment that we have
found?
Or put differently: how much different is the
alignment score that we found from scores
obtained by aligning random sequences to our
sequence?
Significance of alignment
Random
matches
Genuine
match
Alignment score
Significance of alignment
Random
matches
Random
match
Alignment score
Significance of alignment
Random
matches
Genuine
match?
Alignment score
Statistical significance
The p-value is the probability of
observing data at least as extreme
as that observed.
This area is the p-value
Density function
(integral=1.0)
An engineers guide to significance
Significance: 1) The probability of finding a score
by chance (p-value) ; 2) The number of times you
expect to find a score >= a certain value by
chance (E-value). (the smaller, the better)
You can estimate p by making a histogram of
chance (random) scores, linearizing it and
reading p from the linear curve.
An engineers guide to significance
Methods to estimate significance differ in how they model
the distribution of random scores

a) Calculation of score with random shuffled sequences (Z
score)

b) Comparison with a histogram of unrelated similarity
scores (using either a random db or the low (=random)
scores of the real data.

b1) Without knowing the distribution (linearization and
reading from the curve)

b2) Determining the distribution (like the extreme value
distribution of the Karlin-Altschul statistics)
Calculating significance
Repeat N times (N > 100):
Randomise sequence A by shuffling their residues in a
random fashion
Align sequence A to sequence B, and calculate
alignment score Sgenuine
Calculate mean random and standard deviation

Calculate Z-score:
Z = (Sgenuine random) /
Quality: 129 Length: 83
Ratio: 1.573 Gaps: 2
Percent Similarity: 45.679 Percent Identity: 37.037

Average quality based on 100 randomizations: 32.3 +/- 4.6

cyboa x p40_schma November 24, 2001 14:33 ..

. . . . .
63 EVRSD.RDKFVIFLDVKHFSPEDLTVKVQEDFVEIHGKHNERQDDHGYIS 111
|| | : | : | . |.|:|: | |. | :| | | | |
132 EVGEDGKVHFKVRFDAQGFAPQDINVTSSENRVTVHAK.KETTTDGRKCS 180
. . .
112 REFHRRYRLPSNVDQSALSCSLSADGMLTFSGP 144
||| | .|| .:| | | | :. ||.| |
181 REFCRMVQLPKSIDDSQLKCRMTDDGVLMLEAP 213
( 129 - 32.3 ) / 4.6 = 21.0
Z-score statistics why not
Z-score implies a normal distribution of scores
this is NOT true!
Calculating Z-scores for all alignment pairs during
database searches is not practical
Z score does not consider the other sequences in
the database, it models sequences with the same
composition as sequence A.
An engineers guide to significance
A typical distribution of scores S
Not normally
distributed!
An engineers guide to significance
b1) Estimating significance from an unknown
distribution
1) Draw % histogram of
chance similarities
An engineers guide to significance
Estimating significance from an unknown
distribution (FAQ)
Where to get distribution data: 1) comparison with real sequences,
omitting largest scores. 2) Using simulated, random-shuffled sequences.
Neither is correct but both work quite well
Usually one has to extrapolate quite far since large S values are rare (red
line) True, but there is no other way.
?
The p-value is the probability of
observing data at least as extreme
as that observed.
This area is
the p-value
Just as the sum of many independent identically distributed (i.i.d) random variables
tends to a normal distribution, the maximum of a large number of i.i.d. random
variables tends to an extreme value distribution. And we ALWAYS use maximal
alignments
Why is the difference
important?
Important formulas
The Karlin-Altschul statistics is based on Extreme
Value distribution, the expected no of local
alignments of high scoring pairs (HSPs), with score
at least S is


m is the query length, n is the database length, K
and are constants. mn is called the search space.
S
Kmne E

=
-
E-value (again)
Expected number of HSPs with score at least S is
given by




We call this the E-value for the score S
S
Kmne = E

Further important formulas
The raw score S depends on the scoring system
(matrix), K and . The normalized bit-score S is more
portable


The probability P of finding at least one HSP with
score >=S is


where E (right) is calculated by the Karlin-Altschul
formula


S
Kmne E
e e P


= = 1 1
2 ln
ln
'
K S
S

=

Estimating and K
Generate 10,000 pairs of length-1000 random
protein sequences, and use the Smith-Waterman
algorithm to calculate 10,000 optimal local
alignment scores
For example, was estimated at 0.252 and K at 0.035
by the maximum-likelihood method for the BLOSUM-62
amino acid substitution scores, and affine gap costs in
which a gap of length k is assigned a score of - (10 + k)

Estimating and K
In general, given M samples from an extreme
value distribution, the ratio of the maximum-
likelihood estimate of to its actual value is
approximately normally distributed, with mean 1.0
and standard deviation 0.78/sqrt(M)
Thus the standard error for the estimate of is
about 0.002, or less than 1%

Practical considerations
A score and significance in itself
is not fully informative.
Important is the length coverage,
LC (tells you if the similarity
covers a high percentage of the
protein/gene



The score coverage can be used
to scale the score to the self
score, or to the score of a
protein group.
2 1
l
l
or
l
l
LC
align align
=
> <
=
a group
ab
aa
ab
S
S
or
S
S
SC
_
l
align
l
1
l
2
What have we learnt?
Sequence scoring matrices (PAM, BLOSUM,
unitary, and how to make ones own)
Alignment methods: exhaustive, heuristic, local,
global
Global alignment, Needleman-Wunsch
Local alignment, Smith-Waterman
Significance: Z-score (not too good), reading from
linearized histogram of random scores, fitting an
extreme value distribution.

You might also like