Professional Documents
Culture Documents
Lecture-11 String Matching
Lecture-11 String Matching
CT065-3.5-3
String Matching
Level 3 – Computing (Software Engineering)
Topic & Structure of Lesson
Module Code and Module Title Title of Slides Slide 2 (of 37)
Learning Outcomes
Module Code and Module Title Title of Slides Slide 3 (of 37)
Keywords
• String/pattern matching
• Brute Force/Naïve String Search
• KMP algorithm
• Rabin-Karp algorithm
Module Code and Module Title Title of Slides Slide 4 (of 37)
The Pattern Matching Problem
That is,
P = T[i..i+m-1].
Module Code and Module Title Title of Slides Slide 5 (of 37)
The Pattern Matching Problem
Example:
Module Code and Module Title Title of Slides Slide 6 (of 37)
Brute Force Pattern Matching
Module Code and Module Title Title of Slides Slide 7 (of 37)
Brute Force Pattern Matching
Module Code and Module Title Title of Slides Slide 8 (of 37)
Brute Force Pattern Matching
27 character comparisons
etc. etc.
Module Code and Module Title Title of Slides Slide 9 (of 37)
Brute Force Efficiency
quadratic
Module Code and Module Title Title of Slides Slide 10 (of 37)
Disadvantages
Module Code and Module Title Title of Slides Slide 11 (of 37)
KMP Algorithm
• Knuth-Morris-Pratt Algorithm
• Avoids wastage of information gathered
• Uses a failure function (“partial match”
table)
Module Code and Module Title Title of Slides Slide 12 (of 37)
KMP Algorithm
Module Code and Module Title Title of Slides Slide 13 (of 37)
KMP Algorithm
Module Code and Module Title Title of Slides Slide 14 (of 37)
Failure Function
Module Code and Module Title Title of Slides Slide 15 (of 37)
Failure Function
Module Code and Module Title Title of Slides Slide 16 (of 37)
Failure Function
Module Code and Module Title Title of Slides Slide 17 (of 37)
Failure Function
Module Code and Module Title Title of Slides Slide 18 (of 37)
Failure Function
j 0 1 2 3 4 5
P[j] a b a c a b
f(j) -1 0 0 1 0 1
• f(0) = -1
• f(1) → no proper suffix for “a”. f(1) = 0.
• f(2) → suffix for “ab” = “b”. Can’t find prefix of P that is
“b”. f(2) = 0.
• f(3) → suffix for “aba” = “ba” or “a”. Prefix “a” of length 1
exists. f(3) = 1.
• f(4) → suffix for “abac” = “bac” / “ac” / “c”. Does not exist.
f(4) = 0
• f(5) → suffix for “abaca” found (“a”). So f(5) = 1.
Module Code and Module Title Title of Slides Slide 19 (of 37)
Failure Function
Algorithm KMPFailureFunction(P):
Input: String P (pattern) of length m
Output: The failure function f for P, which maps j
to the length of the longest prefix of P that is a
suffix of P [0…j – 1]
i←2 {current index position of P}
j←0 {beginning of current match in P}
f(0) ← -1 {first two values are fixed}
f(1) ← 0
Module Code and Module Title Title of Slides Slide 20 (of 37)
Failure Function
while i < m do
if P[i – 1] = P[j]
{prefix of length j + 1 matched}
f(i) ← j + 1
i←i+1
j←j+1
else if j > 0
{mismatch, but can fall back}
j ← f(j)
else {match not found}
f(i) ← 0
i←i+1
Module Code and Module Title Title of Slides Slide 21 (of 37)
KMP Algorithm
no comparison
needed for “a”
Module Code and Module Title Title of Slides Slide 22 (of 37)
KMP Efficiency
Module Code and Module Title Title of Slides Slide 23 (of 37)
Rabin-Karp Algorithm
Module Code and Module Title Title of Slides Slide 24 (of 37)
Problems
Module Code and Module Title Title of Slides Slide 25 (of 37)
Rabin-Karp Algorithm
for i from 1 to n - m
if hs = hsub
if T[i..i+m-1] = sub
return i
hs ← hash(T[i+1..i+m])
return not found
Module Code and Module Title Title of Slides Slide 27 (of 37)
Hash Functions
• Continuously recomputing
hash(T[i+1..i+m]) naively can be time-
consuming.
• Rather, do T[i+1..i+m] = T[i..i+m-1] - T[i] +
T[i+m]
• Called a rolling hash function
Module Code and Module Title Title of Slides Slide 28 (of 37)
Rolling Hash
Module Code and Module Title Title of Slides Slide 30 (of 37)
Rabin-Karp Algorithm
11 hash comparisons
etc. etc.
Module Code and Module Title Title of Slides Slide 31 (of 37)
Rabin-Karp Efficiency
Module Code and Module Title Title of Slides Slide 32 (of 37)
Multiple Searches
Module Code and Module Title Title of Slides Slide 33 (of 37)
Multiple Searches
Module Code and Module Title Title of Slides Slide 34 (of 37)
Multiple Searches
Module Code and Module Title Title of Slides Slide 35 (of 37)
Summary
Module Code and Module Title Title of Slides Slide 36 (of 37)
Next Lesson
• Hashing
– Approach
– Collision detection
– Recovery
Module Code and Module Title Title of Slides Slide 37 (of 37)