Bioinformatics Session5

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 32

Bioinformatics (BIO213)

Session 5

Slide content: Various textbooks, Internet sources


Multiplication rule
• Let A and B be 2 events.
• When 2 events are independent, the probability of both A and B occurring:
• P(A and B) = P(A ⋂ B) = P(A).P(B)

Example 1:
Rolling a dice twice: Probability that the first roll is any odd number (A) and the
second roll is number 4 (B)
P(A ⋂ B) = (3/6).(1/6) = 3/36

Example 2:
Roll a six-sided die and then flip a coin. 
These two events are independent.
The probability of rolling a 1 is 1/6. The probability of a head is 1/2.
The probability of rolling a 1 and getting a head is 1/6 x 1/2 = 1/12.
Independent event:  event that has no effect on the probability of another event occurring
The likelihood ratio
• The score should reflect the odds that the sequences (aligned) s and s
′ are evolutionary related.
• High score if the odds that the 2 sequences (aligned) are related.
Likelihood score to log-likelihood score

Multiplication by 10 to deal with decimals


WHY Log?
Log makes your life more convenient
Log makes your life more convenient

i.e, Sa,b < 0, when a and b are more likely to be aligned randomly compared to related
sequences @ 1 PAM distance
PAMn substitutions
• For sequences having an evolutionary distances of n PAM units
• nPAM units does not essentially mean n% divergence, because
substitutions can occur at the same site multiple times.

Let P be the 1PAM transition matrix.


For n-step Markov chains transition probabilities P 
What do the scores in the PAM250 matrix signify?
A score of +17 for tryptophan matching tryptophan indicates that this
correspondence is 50 times more frequent than the chance alignment.
let si,j = +17 and
Let probability of replacement qij/pi = x.
Then +17 = 10 log10 x;
A score of −6 indicates that the two
amino acids can align accurately log10 x = 1.7
representing homology at 1/6th the x = 50.
frequency of chance alignment of
these amino acids.
BLOSUM matrices
Important Alternative to PAM: BLOSUM Scoring Matrices
• BLOcks SUbstitution Matrices
• Log-likelihood matrix (Henikoff & Henikoff, 1992, 96)
• BLOSUM was developed on the BLOCKS database, consisting of >500
groups of local multiple alignments.
• Blocks: conserved regions of distantly related proteins.
• Focus on conserved regions (blocks) of proteins that are distantly related to
each other (less similar than Dayhoff’s set of sequences).
• Distantly related sequences  we can observe multiple substitutions at the
same position.
• Distantly related  large variation, prevents tree construction
BLOSUM scoring matrices
• Blocks: conserved regions of distantly related proteins.
• Focus on conserved regions (blocks) of proteins that are distantly related to
each other.
• Distantly related sequences  we can observe multiple substitutions at the
same position.
• Distantly related  large variation, prevents tree construction
• If the alignment is correct, aligned residues will be related by their evolutionary
history and the alignment is expected to contain useful info on substitution
preferences.
The BLOSUM scoring scheme:
Dayhoff’s

Sij refers to the score of amino acid i aligning with j and qij are the
positive target frequencies
PAM vs BLOSUM

High BLOSUM Low BLOSUM


Low PAM High PAM
Assignment: Why is BLOSUM 62 the all-weather
matrix?
• A report explaining why BLOSUM 62 not BLOSUM 60 or
BLOSUM 70?
• Hint: How did they figure out BLOSUM 62 works better than 60
or 70?
Pairwise Alignment & Limits of Detection: The “Twilight Zone”
When we compare two protein sequences, how many mutations can occur between
them before their differences make them unrecognizable?

Multiple sequence alignment of a portion of


the glyceraldehyde 3‐phosphate
dehydrogenase (GAPDH) protein from 13
organisms

Relationship between human beta globin


and myoglobin, the relationship was much
less obvious
Two randomly diverging protein sequences change in a
negatively exponential fashion

Observed number of amino acid identities per


100 residues of two sequences
VS
No. of changes that must have occurred
(PAM250 units).

The twilight zone (Doolittle, 1987) refers to the evolutionary distance corresponding to about 20%
identity between two proteins.
Proteins with this degree of amino acid sequence identity may be homologous, but such homology is
difficult to detect.
Global sequence alignment
Global Sequence Alignment: Needleman and
Wunsch Algorithm
• One of the first and most important algorithms for aligning two
protein sequences
• Important because it produces an optimal alignment of protein
or DNA sequences, even allowing the introduction of gaps.
• The result is optimal, and not all possible alignments need to
be evaluated.
• An exhaustive pairwise comparison would be too
computationally expensive to perform.
How many alignments needs to be compared
with a simple search?
• For 2 sequences of length n, the number of possible global
alignments are

• If both the sequences are 100 AA’s in length ≈ 4 x1057 possible


alignments.
• If both the sequences are 110 AA’s in length ≈ 4x10 63 possible
alignments.
Pairwise Alignment via Dynamic Programming
• Dynamic programming: solve an instance of a problem by
taking advantage of solutions for subparts of the problem.
– reduce problem of best alignment of two sequences to
best alignment of all prefixes of the sequences.

– avoid recalculating the scores already considered.


Example: Fibonacci sequence 1, 1, 2, 3, 5, 8, 13, 21, 34, ...
How does dynamic programming work?
• A divide-and-conquer strategy:
• Break the problem into smaller subproblems.
• Solve the smaller problems optimally.
• Use the sub-problem solutions to construct an optimal solution for the
original problem.
• Dynamic programming can be applied only to problems
exhibiting the properties of overlapping subproblems. Examples
include
• Travelling salesman problem
• Finding the best chess move
Dynamic programming idea
• If we are asked to calculate the nth Fibonacci number, we can
do that with the following equation
• Fibonacci sequence: is the sum of the two preceding numbers.
Fib(n) = Fib(n-1) + Fib(n-2), for n > 1
• To solve the overall problem (i.e., Fib(n)), we broke it down into
two smaller subproblems (which are Fib(n-1) and Fib(n-2))
Characteristics of Dynamic Programming 
1. Overlapping Subproblems
Subproblems are smaller versions of the original problem. Any problem has overlapping
sub-problems if finding its solution involves solving the same subproblem multiple times.

Eg. of the Fibonacci numbers: to find the fib(4), we


need to break it down into the following sub-problems:

The overlapping subproblem pattern here,


as fib(2) has been evaluated twice and fib(1) has been
evaluated 3 times.

2. Optimal Substructure Property (Recursion)


Fib(n) = Fib(n-1) + Fib(n-2)
Dynamic Programming Methods
1. Top-down with Memoization
Here, we try to solve the bigger problem by recursively finding the solution to
smaller sub-problems.
When we solve a sub-problem, we cache its result so that we don’t have to
solve it multiple times and retrieve the the saved result.
The technique of storing the results of solved sub-problem is Memoization.
2. Bottom-up with Tabulation
We solve the problem “bottom-up” (i.e., by solving all the related sub-
problems first).
This is done by filling up an n-dimensional table (matrix). Based on the
results in the table, the solution to the top/original problem is then computed.
Next class:
• Dynamic programming idea
• Needleman Wunsch (Global sequence alignment)
• Smith waterman (Local sequence alignment)
• Salient differences between global and local alignment
• BLAST
• Statistics

You might also like