Professional Documents
Culture Documents
Sequence Alignment Methods Final
Sequence Alignment Methods Final
Chakresh
A
T
T
C
Sequence 2 A
C
A
T
A
T A C A T T A C G T A C
Sequence 1
Insertions / Deletions in a Dotplot
Sequence 2 T
A
C
T
G
T
C
A
T
T A C T G T T C A T
Sequence 1
T A C T G - T C A T
| | | | | | | | |
T A C T G T T C A T
The dotplot (1)
• A simple picture that gives an overview of the
similarities between two sequences
• Dotplot showing identities between short name (DOROTHYHODGKIN)
and full name (DOROTHYCROWFOOTHODGKIN)of a famous protein
crystallographer:
T A C G G T A T G Word Size = 3
A C A G T A T C
C
T A C G G T A T G T
A C A G T A T C A
T
G
T A C G G T A T G
A
A C A G T A T C C
A
T A C G G T A T G T A C G G T A T G
A C A G T A T C
Advancement of Dotplot (Word
Method)
• Sliding window techniques:
– Here window size/stringency AND Mismatch cutoff is
predefined then start from one sequence and match
according to window size with the permissible limit( i.
e. mismatch cutoff ) again leave one then start and so
on, if found place dot. e.g if window size is 8 and
mismatch cutoff is 3 it means the program will place
dot if there is 3 mismacth out of 8 base pair.
– same procedure have to be followed for other
sequence.
– Improves visibility and reduces noise
Dotplot
(Window = 130 / Stringency = 9)
Hemoglobin
-chain
Hemoglobin -chain
Dotplot
(Window = 18 / Stringency = 10)
Hemoglobin
-chain
Hemoglobin -chain
Considerations
1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.... 67
|||||||||||||| | | | |||| || | | | ||
1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG 70
Global Alignment
(Needleman -Wunsch)
• The Needleman-Wunsch algorithm creates a
global alignment over the length of both
sequences (needle)
• Global algorithms are often not effective for highly
diverged sequences - do not reflect the biological
reality that two sequences may only share limited
regions of conserved sequence.
– Sometimes two sequences may be derived from
ancient recombination events where only a single
functional domain is shared.
• Global methods are useful when you want to force
two sequences to align over their entire length
Needlemen Wunsch
algorithm(Global Alignment)
steps in dynamic programming
• Assumption
• Initialization
• Matrix fill (scoring)
• Traceback (alignment)
Assumption
0 0 0 0 0 0 0 0 0 0 0 0
G 0 1 1 1 1 1 1 1 1 1 1 1
G 0 1 1 1 1 1 1 1 2 2 2 2
A 0 1 2 2 2 2 2 2 2 2 2 3
T 0 1 2 2 3 3 3 3 3 3 3 3
C 0 1 2 2 3 3 4 4 4 4 4 4
G 0 1 2 2 3 3 3 4 5 5 5 5
A 0 1 2 3 3 3 3 4 5 5 5 6
Trace back step
SEQ 1
SEQ 2
Trace back step
• After the matrix fill step, the maximum
alignment score for the two test
sequences is 6. The trace back step
determines the actual alignment(s) that
result in the maximum score.
• The trace back step begins in the M I,J
position in the matrix, i.e. the position that
leads to the maximal score. In this case,
there is a 6 in that location.
Trace back step
• Trace back takes the current cell and looks to
the neighbor cells that could be direct
predecessors.
• This means it looks to the neighbor to the left;
horizontal (gap in sequence #2), the diagonal
neighbor (match/mismatch), and the neighbor
above it; vertical (gap in sequence #1).
• In this case, the neighbors are marked in red.
They are all also equal to 5.
Trace back step
Neighbor
Trace back step
current alignment
Seq #1 A
|
Seq #2 A
So now we look at the current cell and
determine which cell is its direct
predecessor. In this case, it is the cell with
the red 5.
Trace back step
The current alignment is
Seq #1 T A
|
Seq #2 _ A
Trace back step
• Once again, the direct predecessor
produces a gap in sequence #2.
Current alignment is
Seq #1 T T A
|
Seq 2 _ _ A
Trace back step
• Continuing on with the trace back step, we
eventually get to a position in column 0
row 0 which tells us that trace back is
completed. One possible maximum
alignment is :
Out PUT
GAATTCAGTTA
GGA_TC_G__A
Question
• Seq 1 AAGTGTGGTCCG
• Seq2 AAATTGTGTGTCC
Giving an alignment of :
G_AATTCAGTTA
GG_A_TC_G__A
Sequence alignment with gap :
Initialization
A C T C G
0 1 2 3 4 5
gap penalty = 1 A 1
match score = 1
mismatch score = 0
C 2
A 3
G 4
T 5
A 6
G 7
Matrix fill (scoring)
A C T C G
0 1 2 3 4 5
A 1 1 0 1 2 3
C 2 0 2 1 0 1
A 3 1 1 2 1 0
G 4 2 0 1 2 2
T 5 3 1 1 1 2
A 6 4 2 0 1 1
G 7 5 3 1 0 2
The Recurrence Relations
F (i 1, j 1) s ( xi , y j )
F (i, j ) max F (i 1, j ) w
F (i, j 1) w
Tools for Global and Local alignment
Traceback (alignment)
A C T C G
0 1 2 3 4 5
gap penalty = 1 A 1 1 0 1 2 3
match score = 1
mismatch score = 0
C 2 0 2 1 0 1
A 3 1 1 2 1 0
Output: G 4 2 0 1 2 2
AC--TCG T 5 3 1 1 1 2
ACAGTAG A 6 4 2 0 1 1
G 7 5 3 1 0 2
Local Alignment
(Smith-Waterman)
• Local alignment
– Identify the most similar sub-region shared
between two sequences
– Smith-Waterman
– EMBOSS: water