Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 69

Sequence alignment methods

Chakresh

Slides are for academic purpose only


What is the purpose of sequence
alignment?
• Identification of homology and homologous sites in
related sequences
• Inference of evolutionary history that lead to the
differences in observed sequences
• Sequence alignment is useful to study prediction of
function/ database searching/gene finding/sequence
divergence/sequence assembly
• SIMILARITY makes no statement about descent from a
common ancestor. (Convergent versus Divergent
evolution.)
• HOMOLOGY – is a Sequence similarity that can be
attributed to descent from a common ancestor
Continue..
• Ortholoous - Homologous sequences in different species.
These sequences usually retain the same function in the
two species.
• Homologous sequences are said to be orthologous when
they are direct descendants of a sequence in the common
ancestor (i.e., without having undergone a gene
duplication event)
• Paralogous - Homologous sequences in the same
species that arose by means of gene duplication.
Homologous sequences in two organisms A and B that are
descendants of two different copies of a sequence that
was created by a duplication event in the genome of the
common ancestor
• Gene duplication: one gene is duplicated in multiple copies that
therefore free to evolve and assume new functions
Similarity is NOT equal to Homology
Measures of sequence similarity
• Hamming Distance: defined
between two strings of equal
length, is the number of positions
with mismatching characters.
• Levenshtein Distance: or edit
distance, between two strings of
not necessarily equal length, is
the minimal number of 'edit
operations' required to change
one string into the other, where
an edit operation is a deletion,
insertion or alteration of a single
character in either sequence
Types of comparisons and
alignment methods
According to
sequence
Coverage:
According to
LOCAL GLOBAL
number od
sequences:

TWO Database seqrch Comparoson of two


SEQUENCES against query sequences;
(Pairwise sequences First step in multiple
• BLAST algorithm alignment
alignment)
Determination of
THREE OR Defining consensus conserved residues
MORE sequences, protein and domains;
structural motifs and
SEQUENCES Introductory step in
domains, regulatory
(Multiple elements in DNA mplecular
phylogenetic
alignment) etc.
analysis
Introduction to sequence alignment
Given two text strings:
First string = a b c d e
Second string = a c d e f
a reasonable alignment would be
a b c d e -
a - c d e f

We must choose criteria so that algorithm can choose the best


alignment.

Source: Lesk, Introduction to Bioinformatics


Causes for sequence (dis)similarity

mutation: a nucleotide at a certain location is replaced by


another nucleotide (e.g.: ATA → AGA)

insertion: at a certain location one new nucleotide is


inserted in between two existing nucleotides
(e.g.: AA → AGA)

deletion: at a certain location one existing nucleotide


is deleted (e.g.: ACTG → AC-G)

indel: an insertion or a deletion


Alignment Methods
1. Dot plot methods/Dot matrix
2. Dynamic programming/ Rigorous
algorithms
– Needleman-Wunsch (global)
– Smith-Waterman (local)

3. K-Tupple methods [Heuristic algorithms]


(faster but approximate)
• BLAST
• FASTA
Dot plot method
• Dot plots are most likely the oldest visual
representation used to compare two sequences
( Maizel and Lenk 1981) In its simplest form, a
dot is produced at position (i,j) iff character
number i in the first sequence is the same as
character number j in the second sequence.
• More elaborated forms use sliding windows and
a threshold value for two windows to be
considered as matched
• Useful to identify repeats, insertion and deletion
Dotplot:

A dotplot gives an overview of all possible alignments

A    
T    
T    
C   
Sequence 2 A    
C   
A    
T    
A    
T A C A T T A C G T A C

Sequence 1
Insertions / Deletions in a Dotplot

Sequence 2 T
A
C
T
G
T
C
A
T
T A C T G T T C A T
Sequence 1

T A C T G - T C A T
| | | | | | | | |
T A C T G T T C A T
The dotplot (1)
• A simple picture that gives an overview of the
similarities between two sequences
• Dotplot showing identities between short name (DOROTHYHODGKIN)
and full name (DOROTHYCROWFOOTHODGKIN)of a famous protein
crystallographer:

Letters corresponding to isolated


matches are shown in non-bold
type. The longest matching
regions, shown in boldface,
are the first and last names
DOROTHY and HODGKIN. Shorter
matching regions, such as the OTH
of dorOTHy and
crowfoOTHodgkin, or the RO of
doROthy and cROwfoot, are
noise.

Source: Lesk, Introduction to Bioinformatics


Dotplot Method/Dot matrix analysis
Dotplot showing
identities between a
repetitive sequence
(ABRACADABRACAD
ABRA) and itself.
The repeats appear
on several
subsidiary diagonals
parallel to the main
diagonal.

Source: Lesk, Introduction to Bioinformatics


The dotplot (3)
Dotplot showing
identities between
the palindromic
sequence MAX I
STAY AWAY AT
SIX AM and itself.
The palindrome
reveals itself as a
stretch of matches
perpendicular to the
main diagonal.

Source: Lesk, Introduction to Bioinformatics


Dotplot/dot matrix parameters
• Evaluating similarity between 2 sequences
• Window size – number of nucleotides compare
each time (usually odd number)
• Stringency – the minimum number of
nucleotides in the window must be “match”, so
that a dot can be placed
• Mismatch Limit – the maximum number of
nucleotides in the window can be “not match”, so
that a dot can still be placed
• Mismatch Limit = Window size - Stringency
•Word method: Certain word size is
•fixed when found matching a dot is
Word Size Algorithm •Placed between two sequences

T A C G G T A T G Word Size = 3
A C A G T A T C
C
T A C G G T A T G T
A C A G T A T C A
T 
G
T A C G G T A T G
A
A C A G T A T C C
A
T A C G G T A T G T A C G G T A T G

A C A G T A T C
Advancement of Dotplot (Word
Method)
• Sliding window techniques:
– Here window size/stringency AND Mismatch cutoff is
predefined then start from one sequence and match
according to window size with the permissible limit( i.
e. mismatch cutoff ) again leave one then start and so
on, if found place dot. e.g if window size is 8 and
mismatch cutoff is 3 it means the program will place
dot if there is 3 mismacth out of 8 base pair.
– same procedure have to be followed for other
sequence.
– Improves visibility and reduces noise
Dotplot
(Window = 130 / Stringency = 9)

Hemoglobin
-chain

Hemoglobin -chain
Dotplot
(Window = 18 / Stringency = 10)

Hemoglobin
-chain

Hemoglobin -chain
Considerations

• The window/stringency method is more sensitive than the


wordsize
method (ambiguities are permitted).

• The smaller the window, the larger the weight of statistical


(unspecific) matches.

• With large windows the sensitivity for short sequences is reduced.

• Insertions/deletions are not treated explicitly.


Terms to be used
• Alignment
• Global (Needleman/Wunsch techniques)
• Local
• Pairwise alignment
• Multiple alignment
Alignment methods
• Rigorous algorithms = Dynamic
Programming
– Needleman-Wunsch (global)
– Smith-Waterman (local)
• Heuristic algorithms
(faster but approximate)
• BLAST
• FASTA
Global vs. Local
Alignments
• Global alignment algorithms start at the
beginning of two sequences and add gaps
to each until the end of one is reached.
• needle (Needleman & Wunsch) creates
an end to-end alignment.
• Local alignment algorithms finds the
region (or regions) of highest similarity
between two sequences and build the
alignment outward from there.
Longest common subsequence
problem
• Substrings are consecutive parts of a
string, while subsequences need not be.
This means that a substring of a string is
always a subsequence of the string, but a
subsequence of a string is not always a
substring of the string.
• Solution ;;;; DP method
• It (DP method) is applicable to problems exhibiting the
properties of overlapping subproblems
• dynamic programming was originally used in the 1940s
by Richard Bellman used for optimization
Global Alignment

Two sequences sharing several regions of local similarity:

1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.... 67
|||||||||||||| | | | |||| || | | | ||
1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG 70
Global Alignment
(Needleman -Wunsch)
• The Needleman-Wunsch algorithm creates a
global alignment over the length of both
sequences (needle)
• Global algorithms are often not effective for highly
diverged sequences - do not reflect the biological
reality that two sequences may only share limited
regions of conserved sequence.
– Sometimes two sequences may be derived from
ancient recombination events where only a single
functional domain is shared.
• Global methods are useful when you want to force
two sequences to align over their entire length
Needlemen Wunsch
algorithm(Global Alignment)
steps in dynamic programming
• Assumption
• Initialization
• Matrix fill (scoring)
• Traceback (alignment)
Assumption

• Si,j = 1 if the residue at position i of


sequence #1 is the same as the residue at
position j of sequence #2 (match score);
otherwise
• Si,j = 0 (mismatch score)
• w = 0 (gap penalty)
Initialization

• create a matrix with M + 1 columns and N


+ 1 rows where M and N correspond to the
size of the sequences to be aligned.
Matrix Fill Step
For each position, Mi,j is defined to be the
maximum score at position i,j; i.e.
Mi,j = MAX [ M i-1, j-1 + Si,j (match/mismatch in the
diagonal),

Mi,j-1 + w (gap in sequence


#1),
Mi-1,j + w (gap in sequence
#2)
]
Matrix Fill Step

• Note that in the example, Mi-1,j-1 will be red, Mi,j-1


will be green and Mi-1,j will be blue.
• Using this information, the score at position 1,1
in the matrix can be calculated. Since the first
residue in both sequences is a G, S1,1 = 1, and
by the assumptions stated at the beginning, w =
0.
• Thus, M1,1 = MAX [M0,0 + 1, M1, 0 + 0, M0,1 +
0] = MAX [1, 0, 0] = 1.
• A value of 1 is then placed in position 1,1 of the
scoring matrix.
Matrix Fill Step
Conti..
Cont…
• Using the same techniques as described
for column 2, we can fill in column 3.
Cont..
• After filling in all of the
values the score
matrix is as follows
G A A T T C A G T T A

0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2 2 2
A 0 1 2 2 2 2 2 2 2 2 2 3
T 0 1 2 2 3 3 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4 4 4
G 0 1 2 2 3 3 3 4 5 5 5 5
A 0 1 2 3 3 3 3 4 5 5 5 6
Trace back step
SEQ 1
SEQ 2
Trace back step
• After the matrix fill step, the maximum
alignment score for the two test
sequences is 6. The trace back step
determines the actual alignment(s) that
result in the maximum score.
• The trace back step begins in the M I,J
position in the matrix, i.e. the position that
leads to the maximal score. In this case,
there is a 6 in that location.
Trace back step
• Trace back takes the current cell and looks to
the neighbor cells that could be direct
predecessors.
• This means it looks to the neighbor to the left;
horizontal (gap in sequence #2), the diagonal
neighbor (match/mismatch), and the neighbor
above it; vertical (gap in sequence #1).
• In this case, the neighbors are marked in red.
They are all also equal to 5.
Trace back step

Neighbor
Trace back step
current alignment
Seq #1 A
|
Seq #2 A
So now we look at the current cell and
determine which cell is its direct
predecessor. In this case, it is the cell with
the red 5.
Trace back step
The current alignment is

Seq #1 T A
|
Seq #2 _ A
Trace back step
• Once again, the direct predecessor
produces a gap in sequence #2.

Current alignment is

Seq #1 T T A
|
Seq 2 _ _ A
Trace back step
• Continuing on with the trace back step, we
eventually get to a position in column 0
row 0 which tells us that trace back is
completed. One possible maximum
alignment is :
Out PUT

GAATTCAGTTA

GGA_TC_G__A
Question
• Seq 1 AAGTGTGGTCCG
• Seq2 AAATTGTGTGTCC

• Take these two sequence and find out the


optimal score value by using dynamic
programming method globally
• Write down the computer program for the
same
Alternative solution

Giving an alignment of :
G_AATTCAGTTA

GG_A_TC_G__A
Sequence alignment with gap :
Initialization
A C T C G
0 1 2 3 4 5
gap penalty = 1 A 1
match score = 1
mismatch score = 0
C 2
A 3
G 4
T 5
A 6
G 7
Matrix fill (scoring)
A C T C G
0 1 2 3 4 5
A 1 1 0 1 2 3
C 2 0 2 1 0 1
A 3 1 1 2 1 0
G 4 2 0 1 2 2
T 5 3 1 1 1 2
A 6 4 2 0 1 1
G 7 5 3 1 0 2
The Recurrence Relations
 F (i  1, j  1)  s ( xi , y j )

F (i, j )  max  F (i  1, j )  w
 F (i, j  1)  w

Tools for Global and Local alignment
Traceback (alignment)
A C T C G
0 1 2 3 4 5
gap penalty = 1 A 1 1 0 1 2 3
match score = 1
mismatch score = 0
C 2 0 2 1 0 1
A 3 1 1 2 1 0
Output: G 4 2 0 1 2 2
AC--TCG T 5 3 1 1 1 2
ACAGTAG A 6 4 2 0 1 1
G 7 5 3 1 0 2
Local Alignment
(Smith-Waterman)
• Local alignment
– Identify the most similar sub-region shared
between two sequences
– Smith-Waterman

– EMBOSS: water

Slides are for academic purpose only

You might also like