Lecture-11 String Matching

Algorithmics
CT065-3.5-3
String Matching
Level 3 – Computing (Software Engineering)
Topic & Structure of Lesson
• The String/Pattern Matching Problem

– Introduction
• String Search Algorithms
– Brute Force Pattern Matching Algorithm
– KMP Algorithm
– Rabin-Karp Algorithm
Module Code and Module Title Title of Slides Slide 2 (of 37)
Learning Outcomes
By the end of this lesson you should be

able to:
• Explain the classic problem of
string/pattern matching
• Comprehend the concepts behind various
string matching algorithms
Keywords
• String/pattern matching
• Brute Force/Naïve String Search
• KMP algorithm
• Rabin-Karp algorithm
The Pattern Matching Problem
Given a text string T of length n and a pattern

string P of length m, find an instance where P is a
substring of T such that
T[i] = P[0], T[i+1] = P[1], …, T[i+m-1] = P[m – 1].
That is,
P = T[i..i+m-1].
The Pattern Matching Problem
Example:
Assume T = “abacaabaccabacabaabb”, and

P = “abacab”
Then P is a substring of T, where P = T[10..15].
Brute Force Pattern Matching
•Exact string matching

•Naïve string search
Test all possible placements of P relative to

T.
Algorithm BruteForceMatch(T, P):

Input: String T of length n, and P of length m
Output: Starting index of first substring of T
matching P, or an indication otherwise
for i ← 0 to n – m do
j←0
while (j < m and T[i+j]= P[j]) do
j←j+1
if j = m then
return i
Return “No substring of T matching P.”
27 character comparisons
etc. etc.
Brute Force Efficiency
Running time for this algorithm:
O((n – m + 1)m) = O(nm)
Worst-case running time for this algorithm

(when n and m are almost equal):
quadratic
Disadvantages
• Brute Force does not use information

gained from unmatched pattern
characters.
• Knowledge is thrown away and
comparison is made again with the next
incremental placement of the pattern.
KMP Algorithm
• Knuth-Morris-Pratt Algorithm
• Avoids wastage of information gathered
• Uses a failure function (“partial match”
table)
KMP Algorithm
Algorithm KMPMatch(T, P):

Input: String T of length n, and P of length m
Output: Starting index of first substring of T
matching P, or an indication otherwise
f ← KMPFailureFunction(P) {construct the failure
function f for P}
i←0 {beginning of current match in T}
j←0 {beginning of current match in P}
while i + j < n do
KMP Algorithm
if P[j] = T[i + j] then

j←j+1
if j = m then
return i {a match!}
else
i ← i + j - f(j)
if j > 0 then
{no match, but we
have advanced in P}
j ← f(j) {j indexes just after
prefix of P that must match}
Return “No substring of T matching P.”
Failure Function
• Main idea of KMP is to preprocess the

pattern string P so as to compute a failure
function f that indicates the proper shift of
P so that, to the largest extent possible,
we can reuse previously performed
comparisons.
• Allows the algorithm not to match any
string character more than once.
Failure Function
• For each position in pattern P, find the

length of the longest possible prefix of P
leading up to, but not including, that
position; this is how far we have to
backtrack in finding the next search.
Failure Function
• Consider the pattern string P = “abacab”,

the KMP failure function for the string P is
shown below:
j 0 1 2 3 4 5
P[j] a b a c a b
f(j) -1 0 0 1 0 1
Failure Function
• For each f(j), f(j) = longest length of

pattern P’s prefix that is also equal to
suffix of P[0…j - 1].
• We also use the convention that failure
function, f(0) = -1. (no possibility of
backtracking)
Failure Function
j 0 1 2 3 4 5
P[j] a b a c a b
f(j) -1 0 0 1 0 1
• f(0) = -1
• f(1) → no proper suffix for “a”. f(1) = 0.
• f(2) → suffix for “ab” = “b”. Can’t find prefix of P that is
“b”. f(2) = 0.
• f(3) → suffix for “aba” = “ba” or “a”. Prefix “a” of length 1
exists. f(3) = 1.
• f(4) → suffix for “abac” = “bac” / “ac” / “c”. Does not exist.
f(4) = 0
• f(5) → suffix for “abaca” found (“a”). So f(5) = 1.
Failure Function
Algorithm KMPFailureFunction(P):
Input: String P (pattern) of length m
Output: The failure function f for P, which maps j
to the length of the longest prefix of P that is a
suffix of P [0…j – 1]
i←2 {current index position of P}
j←0 {beginning of current match in P}
f(0) ← -1 {first two values are fixed}
f(1) ← 0
Failure Function
while i < m do
if P[i – 1] = P[j]
{prefix of length j + 1 matched}
f(i) ← j + 1
i←i+1
j←j+1
else if j > 0
{mismatch, but can fall back}
j ← f(j)
else {match not found}
f(i) ← 0
i←i+1
KMP Algorithm
no comparison
needed for “a”
KMP Efficiency
• Efficiency of building failure function

– O(m), where m is length of pattern
• Efficiency of searching
– At most O(2n) = O(n), where n is length of
string
• Therefore, efficiency of KMP
– O(n) + O(m) = O(n + m)
Rabin-Karp Algorithm
• Uses hash functions, converts (sub)strings

into numeric(hash) values
– i.e. hash(“hello”) = 5.
• If 2 strings are equal, then their hash
values are also equal
– Look for substring in string T that has hash
value equal to hash value of pattern P
Problems
• Many different strings → hash some

strings to the same value to keep hash
values small
• Strings might not match even if their
values do. Checking for string equality can
take a long time for long strings.
• A good hash function is needed to prevent
this from happening too often and produce
a good average search time.
Algorithm RabinKarp(T, P):

Input: String T of length n, and P of length
m
Output: Starting index of first substring of
T matching P, or an indication
otherwise
hsub ← hash (P[0…m-1])
hs ← hash (T[0…m-1])
for i from 1 to n - m
if hs = hsub
if T[i..i+m-1] = sub
return i
hs ← hash(T[i+1..i+m])
return not found
Hash Functions
• Continuously recomputing
hash(T[i+1..i+m]) naively can be time-
consuming.
• Rather, do T[i+1..i+m] = T[i..i+m-1] - T[i] +
T[i+m]
• Called a rolling hash function
Rolling Hash
Rolling hash example:
Treat each character as a number in some

base prime number.
So, for the string “ab” and a base number of

101, it’s hash value would be 97 * 1011 + 98
* 1010 = 9895. (ASCII values of a and b are
97 and 98, respectively)
Rolling Hash
Assuming we have a string “abacab” and we are

searching for a pattern of size 3. Once the hash
value of “aba” is calculated, the hash value of
“bac” can also be easily calculated with:
Hash value of “aba” = 97 * 1012 + 98 * 1011 + 97 *

1010 = 989497 + 9898 + 97 = 999492
Hash value of “bac” = (999492 – 989497) * 101 +
99 * 1010 = 1009594
11 hash comparisons
etc. etc.
Rabin-Karp Efficiency
• Efficiency of calculating hash value

– O(m), where m is length of pattern
• Efficiency of searching
– Average = O(n + m)
– Worst case = O(nm)
• Therefore, efficiency of Rabin-Karp
– Average = O(n + m)
– Worst case = O(nm)
Multiple Searches
• Due to bad worst-case search time, Rabin-Karp

is inferior to KMP for single pattern searching
• Rabin-Karp preferable when performing multiple
pattern searching
• Other algorithms search for k patterns at O(n)
time each. Total = O(n k)
• Rabin-Karp → checking for existing pattern hash
in hash table costs O(1) time. So, O(n) + O(k) =
O(n + k)
Multiple Searches
Algorithm RabinKarpSet(T, P):

Input: String T of length n, and set of
strings P, each of length m
Output: Starting index of first substring of
T matching each pattern in P, or an
indication otherwise
hsubs ← empty set
Multiple Searches
for each pattern in P

insert hash(pattern[1…m]) into hsubs
hs ← hash(T[1…m])
for i from 1 to n - m
if hs Є hsubs
if T[i…i+m-1] = a substring with
hash hs
return i
hs ← hash(T[i+1…i+m])
return not found
Summary
• String/Pattern Matching Algorithms

– Brute Force/ Naïve Search
– Knuth-Morris-Pratt
– Rabin-Karp
Next Lesson
• Hashing
– Approach
– Collision detection
– Recovery

Lecture-11 String Matching

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture-11 String Matching

Uploaded by

Copyright:

Available Formats

Algorithmics

• The String/Pattern Matching Problem

By the end of this lesson you should be

Given a text string T of length n and a pattern

T[i] = P[0], T[i+1] = P[1], …, T[i+m-1] = P[m – 1].

Assume T = “abacaabaccabacabaabb”, and

Then P is a substring of T, where P = T[10..15].

•Exact string matching

Test all possible placements of P relative to

Algorithm BruteForceMatch(T, P):

Running time for this algorithm:

O((n – m + 1)m) = O(nm)

Worst-case running time for this algorithm

• Brute Force does not use information

Algorithm KMPMatch(T, P):

if P[j] = T[i + j] then

• Main idea of KMP is to preprocess the

• For each position in pattern P, find the

• Consider the pattern string P = “abacab”,

• For each f(j), f(j) = longest length of

• Efficiency of building failure function

• Uses hash functions, converts (sub)strings

• Many different strings → hash some

Algorithm RabinKarp(T, P):

Rolling hash example:

Treat each character as a number in some

So, for the string “ab” and a base number of

Assuming we have a string “abacab” and we are

Hash value of “aba” = 97 * 1012 + 98 * 1011 + 97 *

• Efficiency of calculating hash value

• Due to bad worst-case search time, Rabin-Karp

Algorithm RabinKarpSet(T, P):

for each pattern in P

• String/Pattern Matching Algorithms

You might also like