Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Algoritmen & Datastructuren

2012 2013
Substring Search
(Slides by Sedgewick)
Philip Dutr & Benedict Brown
Dept. of Computer Science, K.U.Leuven

Goal: Find pattern of length M in a text of length N

Usually M << N

Copyright Ph.Dutr, Spring 2013

Applications
Parsers
Spam filters
Digital libraries
Word processors
Web search engines
Natural language processing
Computational molecular biology
Feature detection in digitized images

Copyright Ph.Dutr, Spring 2013

Application: spam filtering


Identify patterns indicative
of spam.

PROFITS
L0SE WE1GHT
herbal V1AGRA
There is no catch
L0W M0RTGAGE RATES
This is a one-time mailing.
This message is sent in
compliance with spam
regulations

Copyright Ph.Dutr, Spring 2013

Application: electronic surveillance


Look for certain phrases in email traffic

Detonate Bomb
Attack Belgians
Ik zal een taart in het gezicht van Leonard werpen

Huge amount of data, incoming streams, no backup

Copyright Ph.Dutr, Spring 2013

Application: page scraping


Goal. Extract relevant data from web page.
E.g. Find string delimited by <b> and </b> after first
occurrence of pattern Last Trade

Copyright Ph.Dutr, Spring 2013

Brute force
Check all possible starting positions,
check entire length of pattern

Copyright Ph.Dutr, Spring 2013

Brute force

Copyright Ph.Dutr, Spring 2013

Brute force: worst case

Brute-force algorithm can be slow if text and pattern


are repetitive

Worst case: ~ M.N character comparisons

M.(N-M+1) = M.N M.M + M ~ M.N if M << N


Copyright Ph.Dutr, Spring 2013

Challenges

In many applications, backup in data to be avoided

Solution 1: maintain buffer (memory!)


Solution 2: avoid backup

Challenge: we want linear time complexity!


10

Copyright Ph.Dutr, Spring 2013

Brute force: alternate implementation

Same sequence of character compares

11

i points to end of sequence of already-matched chars in text


j stores number of already-matched chars
(end of sequence in pattern)

Copyright Ph.Dutr, Spring 2013

Knuth-Morris-Pratt (KMP)

Intuition: Suppose we are searching in text for pattern


BAAAAAAAAA (alphabet = {A,B})

Suppose we match 5 chars, with mismatch on 6th char


We know previous 6 characters in text are BAAAAB
Don't need to back up text pointer!

12

Copyright Ph.Dutr, Spring 2013

Deterministic finite state automaton (DFA)

DFA is abstract string-searching machine.

13

Finite number of states (including start and halt)


Exactly one transition for each char in alphabet
Accept if sequence of transitions leads to halt state

Copyright Ph.Dutr, Spring 2013

Interpretation of Knuth-Morris-Pratt DFA

Interpretation of DFA state after reading in txt[i]?

State-number = number of characters in pattern that have


been matched.
length of longest prefix of pat[] that is at end of txt[0..i]

E.g. DFA is in state 3 after reading in character txt[6]

14

Copyright Ph.Dutr, Spring 2013

KMP trace

15

Copyright Ph.Dutr, Spring 2013

KMP Implementation

Differences from brute-force implementation:

Text pointer i never decrements (no backup needed)


Need to precompute dfa[][] from pattern (some work)

Running time

Simulate DFA on text: at most N character accesses

Build DFA: how to do efficiently?

16

Copyright Ph.Dutr, Spring 2013

How to build DFA from pattern?


Match transition

first j characters of pattern


have already been matched

next char matches

If in state j and next char c == pat.charAt(j),


then go to state j+1.

now first j+1 characters of


pattern have been matched

17

Copyright Ph.Dutr, Spring 2013

How to build DFA from pattern?

Mismatch transition

To compute dfa[c][j]:

If in state j and next char c != pat.charAt(j), then the last j characters


of input are pat[1..j-1], followed by c
Simulate pat[1..j-1] on DFA (so far ) and take transition c

Running time: Seems to require j steps


Simulation does not start from scratch: remember state X
18

Copyright Ph.Dutr, Spring 2013

19

Copyright Ph.Dutr, Spring 2013

20

Copyright Ph.Dutr, Spring 2013

KMP Analysis

KMP accesses no more than M + N chars

M for constructing DFA


N for actual search

KMP constructs dfa[][] in time and space


proportional to R.M

21

R = size of alphabet

Copyright Ph.Dutr, Spring 2013

The Art of Computer Programming

Donald
Knuth

"If you think you're a really good programmer . . . read


(Knuth's) Art of Computer Programming . . . You should
definitely send me a rsum if you can read the whole
thing. (Bill Gates)
22

Copyright Ph.Dutr, Spring 2013

Boyer-Moore

Intuition:

23

Scan characters in pattern from right to left


Can skip M text chars when finding one not in the pattern

Copyright Ph.Dutr, Spring 2013

Boyer-Moore

How much to skip?

right[c] = rightmost occurrence of character c in pattern

-1 for characters not in pattern

24

Copyright Ph.Dutr, Spring 2013

Boyer-Moore

25

Copyright Ph.Dutr, Spring 2013

Boyer-Moore

26

Copyright Ph.Dutr, Spring 2013

Boyer-Moore Analysis

~ N/M character compares

(characters in pattern < characters in alphabet)

Worst-case: ~ M.N

27

KMP-like rules to avoid repetition linear ~ N

Copyright Ph.Dutr, Spring 2013

Rabin-Karp

Basic idea = modular hashing

28

Compute a hash of pattern characters 0 to M-1


For each i, compute a hash of text characters i to M+i-1
If pattern hash = text substring hash, check for a match

Copyright Ph.Dutr, Spring 2013

Rabin-Karp

Intuition: M-digit, base-R integer, modulo Q

ti = txt.charAt(i)

Horner's method

29

Linear-time method to evaluate degree-M polynomial

Copyright Ph.Dutr, Spring 2013

Rabin-Karp

Can update hash function in constant time!

30

Copyright Ph.Dutr, Spring 2013

Rabin-Karp

31

Copyright Ph.Dutr, Spring 2013

Rabin-Karp

Las Vegas variant

If hash is equal, then check pattern


100% correct

Extremely likely to run in linear time (but worst case


M.N)

Monte Carlo variant

If hash is equal, we assume pattern is equal as well


Always runs in linear time
Extremely likely to return correct answer

32

Probability 1/Q choose large Q


Copyright Ph.Dutr, Spring 2013

You might also like