Professional Documents
Culture Documents
Bio 4
Bio 4
Bio 4
(4)
BOYER-MOORE
DR. IBRAHIM ZAGHLOUL
String Definitions
2
String Definitions
3
String Definitions
4
Exact matching
5
Exact Matching
6
Exact matching: naïve algorithm
7
Exact matching: naïve algorithm
8
Exact matching: naïve algorithm
9
Can we improve on the naïve algorithm?
P: word
T: There would have been a time for such a word
word
P: word
T: There would have been a time for such a word
word
word skip!
word skip!
word
10
Boyer-Moore
Learn from character comparisons to skip pointless alignments
P: word
T: There would have been a time for such a word
word
Boyer, RS and Moore, JS. "A fast string searching algorithm." Communications of theACM 20.10 (1977): 762-772.
11
Boyer-Moore: Bad character rule
Upon mismatch, skip alignments until:
(a) mismatch becomes a match, or
(b) P moves past mismatched character.
(c) If there was no mismatch, don't skip
Step 1:
T: G C T T C T G C T A C C T T T T G C G C G C G C G C G G A
A CCTTTTGC Case (a)
P:
Step 2:
T: G C T T C T G C T A C C T T T T G C G C G C G C G C G G A
A CCTTTTGC Case (b)
P:
T: G C T T C T G C T A C C T T T T G C G C G C G C G C G G A
Step 3: A CCTTTTGC Case (c)
P:
Step 4: T: G C T T C T G C T A C C T T T T G C G C G C G C G C G G A A
P: CCTTTTGC
(etc)
Boyer-Moore: Bad character rule
Step 1:
T: G C T T C T G C T A C C T T T T G C G C G C G C G C G G A A
P: C C T T T T G C
T: G C T T C T G C T A C C T T T T G C G C G C G C G C G G A A
Step 2:
P: CCTTTTGC
Step 3:
T: G C T T C T G C T A C C T T T T G C G C G C G C G C G G A
A CCTTTTGC
P:
Up to step 3, we skipped 8 alignments
Step 2: T: C G T G C C T A C T T A C T T A C T T A C T T A C G C G A A
P: CTTACTTAC
Step 3: T: C G T G C C T A C T T A C T T A C T T A C T T A C G C G A A
P: CTTACTTAC
Boyer-Moore: Good suffix rule
Let t = substring matched by inner loop; skip until (a) there
are no mismatches between P and tor (b) P moves past t
t
Step 1:
T: C G T G C C T A C T T A C T T A C T T A C T T A C G C G A A
P: C T T A C T T A C
t occurs in its entirety to the left within P
t
Step 2: T: C G T G C C T A C T T A C T T A C T T A C T T A C G C G A A
P: CTTACTTAC
prefix of P matches a suffix of t
Step 3: T: C G T G C C T A C T T A C T T A C T T A C T T A C G C G A A
P: CTTACTTAC
Case (a) has two subcases according to whether t occurs in its entirety to the
left within P (as in step 1), or a prefix of P matches a suffix of t (as in step 2)
Boyer-Moore: Putting it together
How to combine bad character and good suffix rules?
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: GTAGCGGCG
Step 1:
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: G T A G C G G C G bc:6,gs:0 bad character
Step 2:
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: GTAGCGGCG bc:0, gs:2 good suffix
Step 3: T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: GTAGCGGCG bc:2, gs:7 good suffix
T: GTTATAGCTGATCGCGGCGTAGCGGCGAA
Step 4:
P: GTAGCGGCG
11 characters of T we ignored
Step 1:
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: G T A G C G G C G
Step 2:
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: GTAGCGGCG
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
Step 3:
P: GTAGCGGCG
Step 4:
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: GTAGCGGCG
Skipped 15 alignments
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = TCGC:
P
T C G C
A
Σ C - -
G -
T -
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = TCGC:
P
T C G C
A 0 1 2 3
T: A A T C A A T A G C Skip: 1 alignments
(2 shifts)
Σ C 0 - 0 - P: T C G C
G 0 1 - 0
T - 0 1 2
This can be constructed efficiently.
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = TCGC:
P
T C G C T: A A T C A A T C G C
P: T C G C
A 0 1 2 3
Σ C 0 - 0 -
T: A A T C A A T C G C Skip: 3 alignments
(4 shifts)
G 0 1 - 0 P: TCGC
T - 0 1 2
This can be constructed efficiently.
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = TCGC:
P
T: A A T C A A T A G C
T C G C P: T C G C
A 0 1 2 3 T: A A T C A A T A G C
Σ C 0 - 0 - P: TCGC
G 0 1 - 0 T: A A T C A A T C G C
T - 0 1 2 P: TCGC
This can be constructed efficiently.
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = G TAG C G G C G
P
G T A G C G G C G P: G T A G C G G C G
A
Σ C
G
T
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: G T A G C G G C G
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = G TAG C G G C G
P
G T A G C G G C G P: G T A G C G G C G
A -
Σ C - -
G - - - - -
T -
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: G T A G C G G C G
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = G TAG C G G C G
P
G T A G C G G C G P: G T A G C G G C G
A 0 1 - 0 1 2 3 4 5
Σ C 0 1 2 3 - 0 1 - 0
G - 0 1 - 0 - - 0 -
T 0 - 0 1 2 3 4 5 6
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: G T A G C G G C G
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = G TAG C G G C G
P
G T A G C G G C G P: G T A G C G G C G
A 0 1 - 0 1 2 3 4 5
Σ C 0 1 2 3 - 0 1 - 0
G - 0 1 - 0 - - 0 -
T 0 - 0 1 2 3 4 5 6
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: GTAGCGGCG
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = G TAG C G G C G
P
G T A G C G G C G P: G T A G C G G C G
A 0 1 - 0 1 2 3 4 5
Σ C 0 1 2 3 - 0 1 - 0
G - 0 1 - 0 - - 0 -
T 0 - 0 1 2 3 4 5 6
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: GTAGCGGCG
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = G TAG C G G C G
P
G T A G C G G C G P: G T A G C G G C G
A 0 1 - 0 1 2 3 4 5
Σ C 0 1 2 3 - 0 1 - 0
G - 0 1 - 0 - - 0 -
T 0 - 0 1 2 3 4 5 6
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: GTAGCGGCG
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = G TAG C G G C G
P
G T A G C G G C G P: G T A G C G G C G
A 0 1 - 0 1 2 3 4 5
Σ C 0 1 2 3 - 0 1 - 0
G - 0 1 - 0 - - 0 -
T 0 - 0 1 2 3 4 5 6
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: GTAGCGGCG
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = G TAG C G G C G
P
G T A G C G G C G P: G T A G C G G C G
A 0 1 - 0 1 2 3 4 5
Σ C 0 1 2 3 - 0 1 - 0
G - 0 1 - 0 - - 0 -
T 0 - 0 1 2 3 4 5 6
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: GTAGCGGCG
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = G TAG C G G C G
P
G T A G C G G C G P: G T A G C G G C G
A 0 1 - 0 1 2 3 4 5
Σ C 0 1 2 3 - 0 1 - 0
G - 0 1 - 0 - - 0 -
T 0 - 0 1 2 3 4 5 6
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: GTAGCGGCG
Boyer-Moore: Preprocessing
Pre-calculate skips for all possible mismatch scenarios!
For bad character rule and P = G TAG C G G C G
P
G T A G C G G C G P: G T A G C G G C G
A 0 1 - 0 1 2 3 4 5
Σ C 0 1 2 3 - 0 1 - 0
G - 0 1 - 0 - - 0 -
T 0 - 0 1 2 3 4 5 6
T: G T T A T A G C T G A T C G C G G C G T A G C G G C G A A
P: GTAGCGGCG
Boyer-Moore: Preprocessing
P T
Results
Preprocessing: Boyer-Moore
P
P reprocess P: M a k e l o o k u p
tables for bad character &
good suffix rules T
Boyer-Moore
Results
Preprocessing
P r e p r o c e s s T ( o ff l i n e ) :
Genome or sequence that does not have frequent changes.
P
Preprocessing
Algorithm that preprocesses T is
offline. Otherwise,algorithm is online.
T Online or offline?
• Naïve algorithm
• Boyer-Moore
P
• Web search engine
• Read alignment
You have a pattern P and you expect to receive several texts T1,
T2, T3, ..., and you would like to match P against each text as it
arrives. It is better to use an algorithm that: