Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 18

Rabin-Karp String Matching Algorithm

CSE-8101

Presented By: Marish Kr. Gupta


National Institute of Technical Teachers Training and Research Chandigarh

Advanced Algorithms(CSE-8101)

Contents:
String Matching Problem

Application of String Matching


Introduction to Rabin Karp(RK) Algorithm Terminology used in RK Algorithm Example RK Algorithm Complexity Weakness of RK algorithm

Advanced Algorithms(CSE-8101)

String Matching
The object of string matching is to find the location

of a specific text pattern within a larger body of text (e.g., a sentence, a paragraph, a book, etc.). As with most algorithms, the main considerations for string matching are speed and efficiency. There are a number of string searching algorithms in existence today, e.g. Brute Force, Rabin-Karp, and Knuth-Morris-Pratt.

Advanced Algorithms(CSE-8101)

String Matching Problem


Finding all occurrences of a pattern in a text

text-editing programs
assume: text array T [1..n] of length n

pattern array P[1..m] of length mn elements of P and T characters from a finite alphabet ex. ={0,1}, ={a,..,z} P and T are called strings of characters

Advanced Algorithms(CSE-8101)

String Matching Problem(Cont.)


P occurs with shift s in T (or P occurs beginning at

position s+1 in T) if 0s n-m and T[s+1..s+m] = P[1..m] that is, if T[s+j] = P[j], for 1 j m if P occurs with shift s in T, then we call s a valid shift; otherwise, we call s an invalid shift string-matching problem: finding all valid shifts with which a given pattern P occurs in a given text T.

Advanced Algorithms(CSE-8101)

String Matching Problem(Cont.)


total running time = sum of preprocessing and matching

times

Advanced Algorithms(CSE-8101)

Application of String Matching


Text Editor

Encryption
Search Engine Database

Advanced Algorithms(CSE-8101)

Rabin Karp
Rabin and Karp have proposed(1980) a string-

matching algorithm that seeks a pattern i.e. a substring, within a text by using hashing. This algorithm makes use of elementary numbertheoretic notions such as the equivalence of two numbers modulo a third number. The Rabin-Karp string searching algorithm calculates a hash value for the pattern, and for each M-character subsequence of text to be compared. If the hash values are unequal, the algorithm will calculate the hash value for next M-character sequence. Advanced Algorithms(CSE-8101)

Rabin Karp(Cont.)
If the hash values are equal, the algorithm will do a

Brute Force comparison between the pattern and the M-character sequence. In this way, there is only one comparison per text subsequence, and Brute Force is only needed when hash values match.

Advanced Algorithms(CSE-8101)

Notations Used in RK Algorithm


Let = {0,1,2, . . .,9}. We can view a string of k consecutive characters as representing a length-k decimal number. Let p denote the decimal number for P[1..m] Let ts denote the decimal value of the length-m substring

T[s+1..s+m] of T[1..n] for s = 0, 1, . . ., n-m.

ts = p if and only if T[s+1..s+m] = P[1..m], and s is a valid shift.


p = P[m] + 10(P[m-1] +10(P[m-2]+ . . . +10(P[2]+10(P[1]))

We can compute p in O(m) time.


Similarly we can compute t0 from T[1..m] in O(m) time.
Advanced Algorithms(CSE-8101)

Notations Used in RK Algorithm(Cont.)


ts+1 can be computed from ts in constant time.

ts+1 = 10(ts 10m-1 T[s+1])+ T[s+m+1]


However, p and ts may be too large to work with conveniently.

Do we have a simple solution!!


Computation of p and t0 and the recurrence is done using

modulus q. In general, with a d-ary alphabet {0,1,,d-1}, q is chosen such that dq fits within a computer word.
Advanced Algorithms(CSE-8101)

Notations Used in RK Algorithm(Cont.)


The recurrence equation can be rewritten as

ts+1 = (d(ts T[s+1]h)+ T[s+m+1]) mod q, where h = dm-1(mod q) is the value of the digit 1 in the high order position of an m-digit text window.
Note that ts p mod q does not imply that ts = p. However, if ts is not equivalent to p mod q , then ts p, and the shift s is

invalid.
We use ts p mod q as a fast heuristic test to rule out the invalid shifts. Further testing is done to eliminate spurious hits. - an explicit test to

check whether P[1..m] = T[s+1..s+m]


Advanced Algorithms(CSE-8101)

RK Algorithms Example

Advanced Algorithms(CSE-8101)

References Cormen

Rabin Karp Algorithm


RABIN-KARP-MATCHER(T, P, d, q) 1 n length[T] 2 m length[P] 3 h dm-1 mod q 4 p 0 5 t0 0 6 for i 1 to m Preprocessing. compute p as the value of P[1..m] mod q and t0 as the
value of T [1..m] mod q

7 do p (dp + P[i]) mod q 8 t0 (dt0 + T[i]) mod q 9 for s 0 to n - m Matching. iterates through all possible shifts s 10 do if p = ts hit

ts = T[s + 1 s + m] mod q: must check if valid or spurious

11 then if P[1... m] = T [s + 1... s + m] true means valid shift 12 then print "Pattern occurs with shift" s 13 if s < n - m 14 then ts+1 (d(ts - T[s + 1]h) + T[s + m + 1]) mod q gets ts+2 for next iteration

Advanced Algorithms(CSE-8101)

Complexity of RK Algorithm
All characters are interpreted as radix-d digits

h is initiated to the value of high order digit position of

an m-digit window p and t0 are computed in O(m+m) time


The loop 6-8 takes O(m) time
The loop of line 9 takes ((n-m+1)m) time
The overall running time is O((n-m)m

Advanced Algorithms(CSE-8101)

Weakness of RK Algorithm
Spurious Hit

When search for a single character. It work well in for

large pattern.

Advanced Algorithms(CSE-8101)

References:
http://www.wordiq.com/definition/Rabin-Karp_string_search_algorithm http://www.eecs.harvard.edu/~ellard/Q-97/HTML/root/node43.html

http://harvestsoft.net/rabinkarp.htm
Thomas H. Cormen, Introduction to Algorithm, Second Edition, Page:794-798.

Advanced Algorithms(CSE-8101)

Thank You

Advanced Algorithms(CSE-8101)

You might also like