Professional Documents
Culture Documents
DP and Edit Dist
DP and Edit Dist
DP and Edit Dist
You are free to use these slides. If you do, please sign the
guestbook (www.langmead-lab.org/teaching-materials), or email
me (ben.langmead@gmail.com) and tell me briefly how you’re
using them. For original Keynote files, email me.
Beyond approximate matching: sequence similarity
In many settings, Hamming and edit distance are too simple. Biologically-relevant
distances require algorithms. We will expand our tool set accordingly.
X: G T A G C G G C G
|| |||||| AKA insertion in X or deletion in Y
Y: G T - G C G G C G
Gap in Y
Alignment
X: G C G T A T G A G G C T A - A C G C
|| |||| ||||| ||||
Y: G C - T A T G C G G C T A T A C G C
editDistance(x, y) ≤ hammingDistance(x, y)
editDistance(x, y) ≥ | | x | - | y | |
x: GCGTATGCGGCTAACGC Operations:
M = match, R = replace,
y: GCTATGCGGCTATACGC I = insert into x, D = delete from x
x: GCGTATGCGGCTAACGC
|| MMD
y: G C - T A T G C G G C T A T A C G C
x: GCGTATGCGGCTA-ACGC
|| |||||||||| MMDMMMMMMMMMMI
y: G C - T A T G C G G C T A T A C G C
x: GCGTATGCGGCTA-ACGC
|| |||||||||| |||| MMDMMMMMMMMMMIMMMM
y: G C - T A T G C G G C T A T A C G C
Edit distance
x: GCGTATGCGGCTA-ACGC
|| |||||||||| |||| MMDMMMMMMMMMMIMMMM Distance = 2
y: G C - T A T G C G G C T A T A C G C
x: GCGTATGAGGCTA-ACGC
|| |||| ||||| |||| MMDMMMMRMMMMMIMMMM Distance = 3
y: G C - T A T G C G G C T A T A C G C
x: the longest----
||||||| DDDDMMMMMMMIIII Distance = 8
y: - - - - l o n g e s t d a y
Edit distance
D[i, j]: edit distance between length-i prefix of x and length-j prefix of y
i
x:
y:
j
Think in terms of edit transcript. Optimal transcript for D[i, j] can be built
by extending a shorter one by 1 operation. Only 3 options:
D[i, j] is minimum of the three, and D[|x|, |y|] is the overall edit distance
Edit distance
Much better
Edit distance: dynamic programming
ϵ is empty y
string
ϵ G C T A T G C C A C G C Let n = | x |, m = | y |
ϵ
G D: (n+1) x (m+1) matrix
C D[i, j] = edit distance b/t
G length-i prefix of x and
T length-j prefix of y
A
D: x T
G
C
A
C
G
C
Edit distance: dynamic programming
y
ϵ G C T A T G C C A C G C
ϵ
G D[6, 5]
C D[5, 5]
G
D[5, 6]
T
A
D: x T D[6, 6]
G Value in a cell depends upon its upper,
C left, and upper-left neighbors
A 8
C < D[i 1, j] + 1 upper
left upper-left
G D[i, j] = min
:
D[i, j 1] + 1
D[i 1, j 1] + (x[i 1], y[j 1])
C
Edit distance: dynamic programming
D
=
numpy.zeros((len(x)+1,
len(y)+1),
dtype=int)
First few lines
D[0,
1:]
=
range(1,
len(y)+1)
of edDistDp:
D[1:,
0]
=
range(1,
len(x)+1)
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12 Initialize D[0, j] to j,
G 1 D[i, 0] to i
C 2
G 3
T 4
A 5
T 6
G 7
C 8
A 9
C 10
G 11
C 12
Edit distance: dynamic programming
for
i
in
xrange(1,
len(x)+1):
Loop from
for
j
in
xrange(1,
len(y)+1):
edDistDp:
delt
=
1
if
x[i-‐1]
!=
y[j-‐1]
else
0
D[i,
j]
=
min(D[i-‐1,
j-‐1]+delt,
D[i-‐1,
j]+1,
D[i,
j-‐1]+1)
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12 Fill remaining cells from
G 1 top row to bottom and
C 2 from left to right
G 3
T 4
A 5 etc
T 6
G 7
C 8
A 9
C 10
G 11
C 12
Edit distance: dynamic programming
for
i
in
xrange(1,
len(x)+1):
Loop from
for
j
in
xrange(1,
len(y)+1):
edDistDp:
delt
=
1
if
x[i-‐1]
!=
y[j-‐1]
else
0
D[i,
j]
=
min(D[i-‐1,
j-‐1]+delt,
D[i-‐1,
j]+1,
D[i,
j-‐1]+1)
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12 Fill remaining cells from
G 1 ? top row to bottom and
C 2 from left to right
G 3
T 4 What goes here in i=1,j=1?
A 5 x[i-‐1] = y[j-‐1] = ‘G ‘,
T 6 so delt =
0
G 7
C 8 D[i,
j]
=
min(D[i-‐1,
j-‐1]+delt,
D[i-‐1,
j]+1,
A 9
D[i,
j-‐1]+1)
=
min(0
+
0,
1
+
1,
1
+
1)
C 10
=
0
G 11
C 12
Edit distance: dynamic programming
for
i
in
xrange(1,
len(x)+1):
Loop from
for
j
in
xrange(1,
len(y)+1):
edDistDp:
delt
=
1
if
x[i-‐1]
!=
y[j-‐1]
else
0
D[i,
j]
=
min(D[i-‐1,
j-‐1]+delt,
D[i-‐1,
j]+1,
D[i,
j-‐1]+1)
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12 Fill remaining cells from
G 1 0 1 2 3 4 5 6 7 8 9 10 11 top row to bottom and
C 2 1 0 1 2 3 4 5 6 7 8 9 10 from left to right
G 3 2 1 1 2 3 3 4 5 6 7 8 9
T 4 3 2 1 2 2 3 4 5 6 7 8 9
A 5 4 3 2 1 2 3 4 5 5 6 7 8
T 6 5 4 3 2 1 2 3 4 5 6 7 8
G 7 6 5 4 3 2 1 2 3 4 5 6 7
C 8 7 6 5 4 3 2 1 2 3 4 5 6
A 9 8 7 6 5 4 3 2 2 2 3 4 5
C 10 9 8 7 6 5 4 3 2 3 2 3 4 Edit distance for x, y
G 11 10 9 8 7 6 5 4 3 3 3 2 3
C 12 11 10 9 8 7 6 5 4 4 3 3 2
Edit distance: dynamic programming
for
i
in
xrange(1,
len(x)+1):
Loop from
for
j
in
xrange(1,
len(y)+1):
edDistDp:
delt
=
1
if
x[i-‐1]
!=
y[j-‐1]
else
0
D[i,
j]
=
min(D[i-‐1,
j-‐1]+delt,
D[i-‐1,
j]+1,
D[i,
j-‐1]+1)
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12 Could we have filled the
G 1 cells in a different order?
C 2
G 3
T 4
A 5 etc
T 6
G 7
C 8
A 9
C 10
G 11
C 12
Edit distance: dynamic programming
for
j
in
xrange(1,
len(y)+1):
Switched
for
i
in
xrange(1,
len(x)+1):
delt
=
1
if
x[i-‐1]
!=
y[j-‐1]
else
0
D[i,
j]
=
min(D[i-‐1,
j-‐1]+delt,
D[i-‐1,
j]+1,
D[i,
j-‐1]+1)
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 Yes: e.g. invert the loops
C 2
G 3
T 4
A 5
T 6 etc
G 7
C 8
A 9
C 10
G 11
C 12
Edit distance: dynamic programming
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 Or by anti-diagonal
C 2
G 3
T 4 etc
A 5
T 6
G 7
C 8
A 9
C 10
G 11
C 12
Edit distance: dynamic programming
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 Or blocked
C 2
G 3
T 4
A 5
T 6
G 7
C 8
A 9
C 10
G 11 etc
C 12
Edit distance: getting the alignment
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 0 1 2 3 4 5 6 7 8 9 10 11
C 2 1 0 1 2 3 4 5 6 7 8 9 10
G 3 2 1 1 2 3 3 4 5 6 7 8 9
T 4 3 2 1 2 2 3 4 5 6 7 8 9
A 5 4 3 2 1 2 3 4 5 5 6 7 8
T 6 5 4 3 2 1 2 3 4 5 6 7 8
G 7 6 5 4 3 2 1 2 3 4 5 6 7
C 8 7 6 5 4 3 2 1 2 3 4 5 6
A 9 8 7 6 5 4 3 2 2 2 3 4 5
C 10 9 8 7 6 5 4 3 2 3 2 3 4 A: From here
G 11 10 9 8 7 6 5 4 3 3 3 2 3 Q: How did I get here?
C 12 11 10 9 8 7 6 5 4 4 3 3 2
Edit distance: getting the alignment
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 0 1 2 3 4 5 6 7 8 9 10 11
C 2 1 0 1 2 3 4 5 6 7 8 9 10
G 3 2 1 1 2 3 3 4 5 6 7 8 9
T 4 3 2 1 2 2 3 4 5 6 7 8 9
A 5 4 3 2 1 2 3 4 5 5 6 7 8
T 6 5 4 3 2 1 2 3 4 5 6 7 8
G 7 6 5 4 3 2 1 2 3 4 5 6 7
C 8 7 6 5 4 3 2 1 2 3 4 5 6
A 9 8 7 6 5 4 3 2 2 2 3 4 5 A: From here
C 10 9 8 7 6 5 4 3 2 3 2 3 4 Q: How did I get here?
G 11 10 9 8 7 6 5 4 3 3 3 2 3
C 12 11 10 9 8 7 6 5 4 4 3 3 2
Edit distance: getting the alignment
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 0 1 2 3 4 5 6 7 8 9 10 11
C 2 1 0 1 2 3 4 5 6 7 8 9 10
G 3 2 1 1 2 3 3 4 5 6 7 8 9
T 4 3 2 1 2 2 3 4 5 6 7 8 9
A 5 4 3 2 1 2 3 4 5 5 6 7 8
T 6 5 4 3 2 1 2 3 4 5 6 7 8 A: From here
G 7 6 5 4 3 2 1 2 3 4 5 6 7 Q: How did I get here?
C 8 7 6 5 4 3 2 1 2 3 4 5 6
A 9 8 7 6 5 4 3 2 2 2 3 4 5
C 10 9 8 7 6 5 4 3 2 3 2 3 4
G 11 10 9 8 7 6 5 4 3 3 3 2 3
C 12 11 10 9 8 7 6 5 4 4 3 3 2
Edit distance: getting the alignment
ϵ G C T A T G C C A C G C
ϵ 0 1 2 3 4 5 6 7 8 9 10 11 12
G 1 0 1 2 3 4 5 6 7 8 9 10 11
Alignment:
C 2 1 0 1 2 3 4 5 6 7 8 9 10
G 3 2 1 1 2 3 3 4 5 6 7 8 9 GCGTATG-CACGC
T 4 3 2 1 2 2 3 4 5 6 7 8 9 || |||| |||||
GC-TATGCCACGC
A 5 4 3 2 1 2 3 4 5 5 6 7 8
T 6 5 4 3 2 1 2 3 4 5 6 7 8
G 7 6 5 4 3 2 1 2 3 4 5 6 7
C 8 7 6 5 4 3 2 1 2 3 4 5 6 Edit transcript:
A 9 8 7 6 5 4 3 2 2 2 3 4 5 MMDMMMMIMMMMM
C 10 9 8 7 6 5 4 3 2 3 2 3 4
G 11 10 9 8 7 6 5 4 3 3 3 2 3
C 12 11 10 9 8 7 6 5 4 4 3 3 2
Edit distance: summary
FillIng matrix is O(mn) space and time, and yields edit distance