Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 37

Algorithmics

CT065-3.5-3

String Matching
Level 3 – Computing (Software Engineering)
Topic & Structure of Lesson

• The String/Pattern Matching Problem


– Introduction
• String Search Algorithms
– Brute Force Pattern Matching Algorithm
– KMP Algorithm
– Rabin-Karp Algorithm

Module Code and Module Title Title of Slides Slide 2 (of 37)
Learning Outcomes

By the end of this lesson you should be


able to:
• Explain the classic problem of
string/pattern matching
• Comprehend the concepts behind various
string matching algorithms

Module Code and Module Title Title of Slides Slide 3 (of 37)
Keywords

• String/pattern matching
• Brute Force/Naïve String Search
• KMP algorithm
• Rabin-Karp algorithm

Module Code and Module Title Title of Slides Slide 4 (of 37)
The Pattern Matching Problem

Given a text string T of length n and a pattern


string P of length m, find an instance where P is a
substring of T such that

T[i] = P[0], T[i+1] = P[1], …, T[i+m-1] = P[m – 1].

That is,

P = T[i..i+m-1].
Module Code and Module Title Title of Slides Slide 5 (of 37)
The Pattern Matching Problem

Example:

Assume T = “abacaabaccabacabaabb”, and


P = “abacab”

Then P is a substring of T, where P = T[10..15].

Module Code and Module Title Title of Slides Slide 6 (of 37)
Brute Force Pattern Matching

•Exact string matching


•Naïve string search

Test all possible placements of P relative to


T.

Module Code and Module Title Title of Slides Slide 7 (of 37)
Brute Force Pattern Matching

Algorithm BruteForceMatch(T, P):


Input: String T of length n, and P of length m
Output: Starting index of first substring of T
matching P, or an indication otherwise
for i ← 0 to n – m do
j←0
while (j < m and T[i+j]= P[j]) do
j←j+1
if j = m then
return i
Return “No substring of T matching P.”

Module Code and Module Title Title of Slides Slide 8 (of 37)
Brute Force Pattern Matching

27 character comparisons
etc. etc.

Module Code and Module Title Title of Slides Slide 9 (of 37)
Brute Force Efficiency

Running time for this algorithm:

O((n – m + 1)m) = O(nm)

Worst-case running time for this algorithm


(when n and m are almost equal):

quadratic

Module Code and Module Title Title of Slides Slide 10 (of 37)
Disadvantages

• Brute Force does not use information


gained from unmatched pattern
characters.
• Knowledge is thrown away and
comparison is made again with the next
incremental placement of the pattern.

Module Code and Module Title Title of Slides Slide 11 (of 37)
KMP Algorithm

• Knuth-Morris-Pratt Algorithm
• Avoids wastage of information gathered
• Uses a failure function (“partial match”
table)

Module Code and Module Title Title of Slides Slide 12 (of 37)
KMP Algorithm

Algorithm KMPMatch(T, P):


Input: String T of length n, and P of length m
Output: Starting index of first substring of T
matching P, or an indication otherwise
f ← KMPFailureFunction(P) {construct the failure
function f for P}
i←0 {beginning of current match in T}
j←0 {beginning of current match in P}
while i + j < n do

Module Code and Module Title Title of Slides Slide 13 (of 37)
KMP Algorithm

if P[j] = T[i + j] then


j←j+1
if j = m then
return i {a match!}
else
i ← i + j - f(j)
if j > 0 then
{no match, but we
have advanced in P}
j ← f(j) {j indexes just after
prefix of P that must match}
Return “No substring of T matching P.”

Module Code and Module Title Title of Slides Slide 14 (of 37)
Failure Function

• Main idea of KMP is to preprocess the


pattern string P so as to compute a failure
function f that indicates the proper shift of
P so that, to the largest extent possible,
we can reuse previously performed
comparisons.
• Allows the algorithm not to match any
string character more than once.

Module Code and Module Title Title of Slides Slide 15 (of 37)
Failure Function

• For each position in pattern P, find the


length of the longest possible prefix of P
leading up to, but not including, that
position; this is how far we have to
backtrack in finding the next search.

Module Code and Module Title Title of Slides Slide 16 (of 37)
Failure Function

• Consider the pattern string P = “abacab”,


the KMP failure function for the string P is
shown below:
j 0 1 2 3 4 5
P[j] a b a c a b
f(j) -1 0 0 1 0 1

Module Code and Module Title Title of Slides Slide 17 (of 37)
Failure Function

• For each f(j), f(j) = longest length of


pattern P’s prefix that is also equal to
suffix of P[0…j - 1].
• We also use the convention that failure
function, f(0) = -1. (no possibility of
backtracking)

Module Code and Module Title Title of Slides Slide 18 (of 37)
Failure Function
j 0 1 2 3 4 5
P[j] a b a c a b
f(j) -1 0 0 1 0 1
• f(0) = -1
• f(1) → no proper suffix for “a”. f(1) = 0.
• f(2) → suffix for “ab” = “b”. Can’t find prefix of P that is
“b”. f(2) = 0.
• f(3) → suffix for “aba” = “ba” or “a”. Prefix “a” of length 1
exists. f(3) = 1.
• f(4) → suffix for “abac” = “bac” / “ac” / “c”. Does not exist.
f(4) = 0
• f(5) → suffix for “abaca” found (“a”). So f(5) = 1.
Module Code and Module Title Title of Slides Slide 19 (of 37)
Failure Function

Algorithm KMPFailureFunction(P):
Input: String P (pattern) of length m
Output: The failure function f for P, which maps j
to the length of the longest prefix of P that is a
suffix of P [0…j – 1]
i←2 {current index position of P}
j←0 {beginning of current match in P}
f(0) ← -1 {first two values are fixed}
f(1) ← 0
Module Code and Module Title Title of Slides Slide 20 (of 37)
Failure Function

while i < m do
if P[i – 1] = P[j]
{prefix of length j + 1 matched}
f(i) ← j + 1
i←i+1
j←j+1
else if j > 0
{mismatch, but can fall back}
j ← f(j)
else {match not found}
f(i) ← 0
i←i+1

Module Code and Module Title Title of Slides Slide 21 (of 37)
KMP Algorithm

no comparison
needed for “a”

Module Code and Module Title Title of Slides Slide 22 (of 37)
KMP Efficiency

• Efficiency of building failure function


– O(m), where m is length of pattern
• Efficiency of searching
– At most O(2n) = O(n), where n is length of
string
• Therefore, efficiency of KMP
– O(n) + O(m) = O(n + m)

Module Code and Module Title Title of Slides Slide 23 (of 37)
Rabin-Karp Algorithm

• Uses hash functions, converts (sub)strings


into numeric(hash) values
– i.e. hash(“hello”) = 5.
• If 2 strings are equal, then their hash
values are also equal
– Look for substring in string T that has hash
value equal to hash value of pattern P

Module Code and Module Title Title of Slides Slide 24 (of 37)
Problems

• Many different strings → hash some


strings to the same value to keep hash
values small
• Strings might not match even if their
values do. Checking for string equality can
take a long time for long strings.
• A good hash function is needed to prevent
this from happening too often and produce
a good average search time.

Module Code and Module Title Title of Slides Slide 25 (of 37)
Rabin-Karp Algorithm

Algorithm RabinKarp(T, P):


Input: String T of length n, and P of length
m
Output: Starting index of first substring of
T matching P, or an indication
otherwise
hsub ← hash (P[0…m-1])
hs ← hash (T[0…m-1])
Module Code and Module Title Title of Slides Slide 26 (of 37)
Rabin-Karp Algorithm

for i from 1 to n - m
if hs = hsub
if T[i..i+m-1] = sub
return i
hs ← hash(T[i+1..i+m])
return not found

Module Code and Module Title Title of Slides Slide 27 (of 37)
Hash Functions

• Continuously recomputing
hash(T[i+1..i+m]) naively can be time-
consuming.
• Rather, do T[i+1..i+m] = T[i..i+m-1] - T[i] +
T[i+m]
• Called a rolling hash function

Module Code and Module Title Title of Slides Slide 28 (of 37)
Rolling Hash

Rolling hash example:

Treat each character as a number in some


base prime number.

So, for the string “ab” and a base number of


101, it’s hash value would be 97 * 1011 + 98
* 1010 = 9895. (ASCII values of a and b are
97 and 98, respectively)
Module Code and Module Title Title of Slides Slide 29 (of 37)
Rolling Hash

Assuming we have a string “abacab” and we are


searching for a pattern of size 3. Once the hash
value of “aba” is calculated, the hash value of
“bac” can also be easily calculated with:

Hash value of “aba” = 97 * 1012 + 98 * 1011 + 97 *


1010 = 989497 + 9898 + 97 = 999492
Hash value of “bac” = (999492 – 989497) * 101 +
99 * 1010 = 1009594

Module Code and Module Title Title of Slides Slide 30 (of 37)
Rabin-Karp Algorithm

11 hash comparisons

etc. etc.

Module Code and Module Title Title of Slides Slide 31 (of 37)
Rabin-Karp Efficiency

• Efficiency of calculating hash value


– O(m), where m is length of pattern
• Efficiency of searching
– Average = O(n + m)
– Worst case = O(nm)
• Therefore, efficiency of Rabin-Karp
– Average = O(n + m)
– Worst case = O(nm)

Module Code and Module Title Title of Slides Slide 32 (of 37)
Multiple Searches

• Due to bad worst-case search time, Rabin-Karp


is inferior to KMP for single pattern searching
• Rabin-Karp preferable when performing multiple
pattern searching
• Other algorithms search for k patterns at O(n)
time each. Total = O(n k)
• Rabin-Karp → checking for existing pattern hash
in hash table costs O(1) time. So, O(n) + O(k) =
O(n + k)

Module Code and Module Title Title of Slides Slide 33 (of 37)
Multiple Searches

Algorithm RabinKarpSet(T, P):


Input: String T of length n, and set of
strings P, each of length m
Output: Starting index of first substring of
T matching each pattern in P, or an
indication otherwise
hsubs ← empty set

Module Code and Module Title Title of Slides Slide 34 (of 37)
Multiple Searches

for each pattern in P


insert hash(pattern[1…m]) into hsubs
hs ← hash(T[1…m])
for i from 1 to n - m
if hs Є hsubs
if T[i…i+m-1] = a substring with
hash hs
return i
hs ← hash(T[i+1…i+m])
return not found

Module Code and Module Title Title of Slides Slide 35 (of 37)
Summary

• String/Pattern Matching Algorithms


– Brute Force/ Naïve Search
– Knuth-Morris-Pratt
– Rabin-Karp

Module Code and Module Title Title of Slides Slide 36 (of 37)
Next Lesson

• Hashing
– Approach
– Collision detection
– Recovery

Module Code and Module Title Title of Slides Slide 37 (of 37)

You might also like