Professional Documents
Culture Documents
PB Bioinfo L4 2023
PB Bioinfo L4 2023
accepted mutations
To illustrate how the PAM substitution matrices have been derived,
we will consider the following artificial ungapped aligned sequences:
Example taken from Borodovsky & Ekisheva (2007) Problems and Solutions in Biological sequence analysis.
Cambridge Univ Press.
Phylogenetic trees to show how mutations have occurred
Matrix of accepted point mutation counts (A)
A B C D G H I J
A
B
C
D
Try to calculate yourself…
G
H
I
J
Each edge of a given tree is associated with the ungapped alignment of the two
sequences connected by this edge. Thus, any tree shown above generates 6
alignments.
For example the first phylogenetic tree generates the following alignments:
Those alignments can be used to assess the "relative mutability" of each amino acid
ABIJ ABIJ ABGH ABGH ABIJ ABIJ
ABGH ABIJ ACGH DBGH ADIJ CBIJ
Number of occurrence of A: 40
B:
C:
Relative mutability (mi)
The relative mutability is defined by the ratio of the total number of times
that amino acid j has changed in all the pair-wise alignments (in our case
6x4=24 alignments) to the number of times that j has occurred in these
alignments, i.e.
Amino acids A B I H G J C D
Changes (substitutions)
Number of occurances
Relative mutability (mj)
Te relative mutability accounts for the fact that the different amino acids have
different mutation rates. This is thus the probability to mutate.
Relative mutability of the 20 amino acids
These values sum to 1. If the 20 amino acids were equally represented in proteins,
these values would all be 0.05; instead, amino acids vary in their frequency of
occurrence.
Mutational probability matrix (M)
Let's define Mij the probability of the amino acid in column j having
been substituted by an amino acid in row i over a given evolutionary
time unit.
A B C D G H I J
A
Mij: Probability of amino acid j changing
into i in the evolutionary period B
C
Aij: Number of times j is chainging to i D
l: is the proportionality constant used to G
connect matrix M with an evolutionary time H
scale I
J
mj: is relative mutability of the amino acid
Mij: Probability of amino acid j (column)changing into i
(row) in the evolutionary period
Aij: Number of times j is changing to i
A B C D G H I J
A
B
C
D
G
H
I
J
PAM1 Matrix
This defines an evolutionary period: the period during which
the 1% of all sequences are mutated (accepted of course)
The diagonal represents the probability to still observe the same residue after 1 PAM.
Therefore the diagonal represents the 99% of the case of non-mutation.
From PAM1 to PAM2, PAM100, PAM250, etc...
( 2) (1) (1)
M M M
(K ) (1) K
M {M }
PAM1
PAM2 = PAM12
PAM100 = PAM1 100
PAMn = PAM1n
PAM250 derived by Dayhoff for the 20 amino acids
• Logarithms are easier to use for a scoring system. They allow us to sum
the scores of aligned residues (rather than having to multiply them).
TAHGK
YSDGD
Salignment = s(T,Y) + s(A,S) + s(H,D) + s(G,G) + s(K,D)
• A score should reflect the significance of an alignment occurring as a
result of an evolutionary process with respect to what we could expect
by chance.
• A score should involve the ratio between the probability derived from
nonrandom (evolutionary) to random models
• The cells in a log odds matrix consist of an “odds ratio”:
the probability that an alignment is authentic
the probability that the alignment was random
(MK)ij: the probability that the aligned pair of amino acid residues (i, j) represents
an authentic alignment (mutation accepted by evolution)
Pi: independent probabilities of each amino acid i occurring in this position
PAM250 matrix
Normalized Frequencies of Amino acids
Asn 0.040 His 0.034
Ser 0.070 Arg 0.041
Asp 0.047 Lys 0.081
Glu 0.050 Pro 0.051
Ala 0.087 Gly 0.089
Thr 0.058 Tyr 0.030
Ile 0.037 Phe 0.040
Met 0.015 Leu 0.085
Gln 0.038 Cys 0.033
Val 0.065 Trp 0.010
Normalized Frequencies of Amino acids
Asn 0.040 His 0.034
Ser 0.070 Arg 0.041
Asp 0.047 Lys 0.081
Glu 0.050 Pro 0.051
Ala 0.087 Gly 0.089
Thr 0.058 Tyr 0.030
Ile 0.037 Phe 0.040
S (A → A) = 10 log10 (0.13/0.087) = 1.744 (~2.0) Met 0.015 Leu 0.085
Gln 0.038 Cys 0.033
S (R → R) = 10 log10 (0.17/0.041) = 6.17 (~6.0) Val 0.065 Trp 0.010
S (N→ N) = 10 log10 (0.06/0.040) = 1.76 (~2.0)
S (N → A) = 10 log10 (0.09/0.087) = 0
S (A → N) = 10 log10 (0.04/0.04) = 0
2 -2 0
S (R → N) = 10 log10 (0.04/0.04) = 0 -2 6 0
S (N → R) = 10 log10 (0.04/0.041) = 0 0 0 2
PAM250 log odds scoring matrix
What do the numbers mean in a log odds matrix?
S( W, W ) = 10 log10 ( 0.55 / 0.010 ) = 17.4
S(W, W) = 17
Probability of replacement ( Mij / pi ) = x
Then
17 = 10 log10 x
1.7 = log10 x
101.7 = x = 50
• A score of +17 for tryptophan means that this alignment is 50 times more
likely than a chance alignment of two tryptophan residues.
• A score of +2 indicates that the amino acid replacement occurs 1.6 times as
frequently as expected by chance.
• A score of 0 is neutral.
• A score of –10 indicates that the replacement of two amino is one tenth as
frequent as the chance alignment of these amino acids.
Scoring an alignment
C 12
S 0 2 STPP
CTCA
T -2 1 3
P -3 1 0 6 0 + 3 + (-3) + 1
A -2 1 1 1 2
C S T P A =1
Comparing two proteins with a PAM1 matrix gives
completely different results than PAM250!
hsrbp, 86 --CADMVGTFTDTEDPAKFKM
btlact, 80 GECAQKKIIAEKTKIPAVFKI
** * ** **
How to choose the appropriate PAM matrix?
Altschul SF(1991) Amino acid substitution matrices from an information theoretic
perspective. J Mol Biol. 219:555-65.
Remark:
In the PAM matrices, the index indicates the percentage of substitution
per position.