Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

(Scoring matrices)

PAM = Percent (or Point) Accepted Mutation


The PAM matrices are series of scoring matrices, each reflecting a
certain level of divergence:

PAM = unit of evolution (1 PAM = 1 mutation/100 amino acid)

•PAM1 proteins with an evolutionary distance of 1% mutation


•PAM50 50% mutations
•PAM250 250% mutations

The substitution score is expected to depend


on the rate of divergence between sequences.

accepted mutations
To illustrate how the PAM substitution matrices have been derived,
we will consider the following artificial ungapped aligned sequences:

Example taken from Borodovsky & Ekisheva (2007) Problems and Solutions in Biological sequence analysis.
Cambridge Univ Press.
Phylogenetic trees to show how mutations have occurred
Matrix of accepted point mutation counts (A)

A B C D G H I J
A
B
C
D
Try to calculate yourself…
G
H
I
J
Each edge of a given tree is associated with the ungapped alignment of the two
sequences connected by this edge. Thus, any tree shown above generates 6
alignments.

For example the first phylogenetic tree generates the following alignments:

Those alignments can be used to assess the "relative mutability" of each amino acid
ABIJ ABIJ ABGH ABGH ABIJ ABIJ
ABGH ABIJ ACGH DBGH ADIJ CBIJ

ABIH ABIH ABGH ABGH ABIJ ABIJ


ABGH ABIJ ACGH DBGH ADIJ CBIJ

ABGJ ABGJ ABGH ABGH ABIJ ABIJ


ABGH ABIJ ACGH DBGH ADIJ CBIJ

Number of occurrence of A: 40
B:
C:
Relative mutability (mi)
The relative mutability is defined by the ratio of the total number of times
that amino acid j has changed in all the pair-wise alignments (in our case
6x4=24 alignments) to the number of times that j has occurred in these
alignments, i.e.

Amino acids A B I H G J C D
Changes (substitutions)
Number of occurances
Relative mutability (mj)

Te relative mutability accounts for the fact that the different amino acids have
different mutation rates. This is thus the probability to mutate.
Relative mutability of the 20 amino acids

• Trp and Cys are less mutable

• Cys is known to have several unique,


indispensable function . Big groups like Trp or Phe
are less mutable due to their particular chemistry.

• Asn, Ser, Asp and Glu are most mutable


Effective frequency (fi)
The notion of effective frequency fi takes into account the difference in
variability of the primary structure conservation in proteins with
different functional roles.
Effective frequency (fi)
The notion of effective frequency fi takes into account the difference in
variability of the primary structure conservation in proteins with
different functional roles.
Effective frequency of the 20 amino acids determined for the original alignment data (Dayhoff et
al., 1978)

These values sum to 1. If the 20 amino acids were equally represented in proteins,
these values would all be 0.05; instead, amino acids vary in their frequency of
occurrence.
Mutational probability matrix (M)
Let's define Mij the probability of the amino acid in column j having
been substituted by an amino acid in row i over a given evolutionary
time unit.

Non-diagonal elements of M Diagonal elements of M

A B C D G H I J
A
Mij: Probability of amino acid j changing
into i in the evolutionary period B
C
Aij: Number of times j is chainging to i D
l: is the proportionality constant used to G
connect matrix M with an evolutionary time H
scale I
J
mj: is relative mutability of the amino acid
Mij: Probability of amino acid j (column)changing into i
(row) in the evolutionary period
Aij: Number of times j is changing to i

l: is the proportionality constant used to connect


matrix M with an evolutionary time scale
mj: is relative mutability of the amino acid

If A is mutated, the probability that


it is mutated into D is: ADA/(ABA+ACA+ADA) = 4/8

The probability that A is


mutated into D is: MDA = [8/40]x[4/8] = 4/40

The probability that A is not


mutated is: MAA = 1 - 8/40 = 32/40
PAM0 Matrix
PAM0 A (Ala) R(Arg) N(Asn) D(Asp) C(Cys) Q(Gln) E(Glu) G(Gly)
A 100 0 0 0 0 0 0 0
R 0 100 0 0 0 0 0 0
N 0 0 100 0 0 0 0 0
D 0 0 0 100 0 0 0 0
C 0 0 0 0 100 0 0 0
Q 0 0 0 0 00 100 0 0
E 0 0 0 0 0 0 100 0
G 0 0 0 0 0 0 0 100
In our example, λ = 0.0261 and the mutation probability matrix (PAM1) is:
0.99 = 0.125x8 – l[ 0.125 (0.2 + 0.2 + 0.167 + 0.167 +0.167 +0.167 + 1 + 1)]
0.99 = 1 – l[ 0.125 (0.2 + 0.2 + 0.167 + 0.167 +0.167 +0.167 + 1 + 1)]
0.99 = 1 – l [0.125 x 3.068]
0.99 = 1 – l 0.385
l0.385 = 0.01
l = 0.026075……

A B C D G H I J
A
B
C
D
G
H
I
J
PAM1 Matrix
This defines an evolutionary period: the period during which
the 1% of all sequences are mutated (accepted of course)

For clarity, the values have been multiplied by 10000

The diagonal represents the probability to still observe the same residue after 1 PAM.
Therefore the diagonal represents the 99% of the case of non-mutation.
From PAM1 to PAM2, PAM100, PAM250, etc...

( 2) (1) (1)
M M M
(K ) (1) K
M  {M }
PAM1
PAM2 = PAM12
PAM100 = PAM1 100

PAM250 = PAM1 250

PAMn = PAM1n
PAM250 derived by Dayhoff for the 20 amino acids

For clarity, the values have been multiplied by 100

This matrix corresponds to an evolution time period giving 250 mutation/100


amino acids (i.e. an evolutionary distance of 250 PAM), and is referred to as the
PAM250 matrix.
From probabilities to scores
• So far, we have obtained a probability matrix, but we would like a scoring
matrix.

Why do we go from a mutation probability matrix to a log odds matrix?

• We want a scoring matrix so that when we do a pairwise alignment (or a


BLAST search) we know what score to assign to two aligned amino acid
residues.

• Logarithms are easier to use for a scoring system. They allow us to sum
the scores of aligned residues (rather than having to multiply them).

TAHGK
YSDGD
Salignment = s(T,Y) + s(A,S) + s(H,D) + s(G,G) + s(K,D)
• A score should reflect the significance of an alignment occurring as a
result of an evolutionary process with respect to what we could expect
by chance.

• A score should involve the ratio between the probability derived from
nonrandom (evolutionary) to random models
• The cells in a log odds matrix consist of an “odds ratio”:
the probability that an alignment is authentic
the probability that the alignment was random

• The score Sij for an alignment of residues i,j is given by:

(MK)ij: the probability that the aligned pair of amino acid residues (i, j) represents
an authentic alignment (mutation accepted by evolution)
Pi: independent probabilities of each amino acid i occurring in this position
PAM250 matrix
Normalized Frequencies of Amino acids
Asn 0.040 His 0.034
Ser 0.070 Arg 0.041
Asp 0.047 Lys 0.081
Glu 0.050 Pro 0.051
Ala 0.087 Gly 0.089
Thr 0.058 Tyr 0.030
Ile 0.037 Phe 0.040
Met 0.015 Leu 0.085
Gln 0.038 Cys 0.033
Val 0.065 Trp 0.010
Normalized Frequencies of Amino acids
Asn 0.040 His 0.034
Ser 0.070 Arg 0.041
Asp 0.047 Lys 0.081
Glu 0.050 Pro 0.051
Ala 0.087 Gly 0.089
Thr 0.058 Tyr 0.030
Ile 0.037 Phe 0.040
S (A → A) = 10 log10 (0.13/0.087) = 1.744 (~2.0) Met 0.015 Leu 0.085
Gln 0.038 Cys 0.033
S (R → R) = 10 log10 (0.17/0.041) = 6.17 (~6.0) Val 0.065 Trp 0.010
S (N→ N) = 10 log10 (0.06/0.040) = 1.76 (~2.0)

S (N → A) = 10 log10 (0.09/0.087) = 0
S (A → N) = 10 log10 (0.04/0.04) = 0
2 -2 0
S (R → N) = 10 log10 (0.04/0.04) = 0 -2 6 0
S (N → R) = 10 log10 (0.04/0.041) = 0 0 0 2
PAM250 log odds scoring matrix
What do the numbers mean in a log odds matrix?
S( W, W ) = 10 log10 ( 0.55 / 0.010 ) = 17.4

S(W, W) = 17
Probability of replacement ( Mij / pi ) = x
Then
17 = 10 log10 x
1.7 = log10 x
101.7 = x = 50

• A score of +17 for tryptophan means that this alignment is 50 times more
likely than a chance alignment of two tryptophan residues.

• A score of +2 indicates that the amino acid replacement occurs 1.6 times as
frequently as expected by chance.

• A score of 0 is neutral.

• A score of –10 indicates that the replacement of two amino is one tenth as
frequent as the chance alignment of these amino acids.
Scoring an alignment
C 12
S 0 2 STPP
CTCA
T -2 1 3
P -3 1 0 6 0 + 3 + (-3) + 1
A -2 1 1 1 2
C S T P A =1
Comparing two proteins with a PAM1 matrix gives
completely different results than PAM250!

Consider two distantly related proteins. A PAM40 matrix is not forgiving of


mismatches, and penalizes them severely. Using this matrix you can find
almost no match.

hsrbp, 136 CRLLNLDGTC


btlact, 3 CLLLALALTC
* ** * **

A PAM250 matrix is very tolerant of mismatches.


24.7% identity in 81 residues overlap; Score: 77.0; Gap frequency: 3.7%
hsrbp, 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV
btlact, 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN
* **** * * * * ** *

hsrbp, 86 --CADMVGTFTDTEDPAKFKM
btlact, 80 GECAQKKIIAEKTKIPAVFKI
** * ** **
How to choose the appropriate PAM matrix?
Altschul SF(1991) Amino acid substitution matrices from an information theoretic
perspective. J Mol Biol. 219:555-65.

• PAM120 matrix is the most appropriate for database searches

• PAM200 matrix is the most appropriate for comparing two specific


proteins with suspected homology

Remark:
In the PAM matrices, the index indicates the percentage of substitution
per position.

Higher indexes are more appropriate for more distant proteins


(PAM250 better than PAM100 for distant proteins).
Limitations in PAM matrix

• Based on only one original dataset

• Examines proteins with few differences (85%


identity)

• Based mainly on small globular proteins so the


matrix is biased

• Errors in PAM-1 are magnified in extrapolation


to PAM-250

You might also like