String Finding3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

lnformrtion Processing & Managemenr Vol. 29. No. 6, pp. 775-791, 1993 030&4573/93 56.

00 + 30
Printed in Great Britain. Copyright 0 1993 Pergamon Press Ltd.

A FAST STRING-SEARCHING ALGORITHM


FOR MULTIPLE PATTERNS

NORIYOSHI URATANI*
Science and Technical Research Laboratories,
NHK (Japan Broadcasting Corporation),
Tokyo 157, Japan

and
MASAYUKI TAKEDA~
Interdisciplinary Graduate School of Engineering Science,
Kyushu University 39, Kasuga 816, Japan

(Receiwd 25 July 1991; accepted in final form 14 September 1992)

Abstract-The string-searching problem is to find all occurrences of pattern(s) in a text


string. The Aho-Corasick string searching algorithm simultaneously finds all occurrences
of multiple patterns in one pass through the text. On the other hand, the Boyer-Moore
algorithm is understood to be the fastest algorithm for a single pattern. By combining
the ideas of these two algorithms, we present an efficient string searching algorithm
for multiple patterns. The algorithm runs in sublinear time, on the average, as the BM
algorithm achieves, and its preprocessing time is linear proportional to the sum of the
lengths of the patterns like the AC algorithm.

1. INTRODUCTION

The string searching problem is defined as follows:

Given two strings p1 ,pz . . .pPm,called a pattern, and al, a2 . . . a,,, called a text,
find all indices j with ajaj+l . . . aj+m_l = pIp2. . .p,,, .

Knuth, Morris, and Pratt presented an algorithm that solves this problem, which runs
in time O(n) (Knuth et al., 1974). Aho and Corasick have shown how it can be extended
to the multiple pattern problem (Aho & Corasick, 1975). In their method, a finite-state
machine is built which simultaneously recognizes all occurrences of the patterns in a sin-
gle pass through the text. Such a finite-state machine is called a pattern-matching machine,
PMM for short. The running time is also O(n).
Both of the KMP and AC algorithms scan the patterns from left to right. Boyer and
Moore noticed that more information about the text can be gained if scanning is done
from right to left (Boyer & Moore, 1977). It is understood that the BM algorithm, based
on this idea, is the fastest algorithm for a single pattern. The running time on the average
is sublinear.
To characterize the efficiency of a string-searching algorithm, we use the probe rate,
defined as the ratio between the total number of references to the text and the length of
the text. By sublinear, we mean that the probe rate is less than one. For the AC algorithm
the probe rate is exactly one, both in the average and the worst cases.
We extend the BM algorithm to multiple patterns using the idea of PMM. Our algo-
rithm runs in sublinear time on the average as the BM algorithm achieves. For a single
pattern, it runs faster than the BM algorithm. Moreover, the construction of PMM requires
only linear time proportional to the sum of the lengths of the patterns like the AC algo-
rithm. Our algorithm is most suitable when all patterns are sufficiently long and the alpha-
bet size is large, but the number of patterns is not very large.

*Presently at ATR Interpreting Telephony Research Laboratories, Soraku-gun, Kyoto 619-02, Japan.
tPresently at the Department of Electrical Engineering, Kyushu University 36, Fukuoka 812, Japan.

775
776 N. URATANI and M. TAKEDA

It should be mentioned that Commentz-Walter reported an extension of the BM algo-


rithm to multiple patterns (Commentz-Walter, 1979). But it has received little attention,
probably because her method is a straightforward modification of BM and the construc-
tion of the algorithm is incomplete. Kowalski and Meltzer sketched another extension of
the BM algorithm to multiple patterns (Kowalski & Meltzer, 1984). But it is for patterns
with a uniform length. Our algorithm is simpler and runs faster than these two algorithms.
The organization of this paper is as follows: In the next two sections, we describe our
string-searching algorithm and show how to construct a PMM from a collection of pat-
terns. We then prove the correctness of the algorithm. After this discussion, we give a the-
oretical analysis of the efficiency of the algorithm, and present the empirical evidence that
supports the sublinearity of the algorithm on the average. Furthermore, we compare our
method with the BM method for a single pattern.

2. OUR ALGORITHM

The basic idea of the BM algorithm is to match the pattern from right to left. We
apply this idea to the multiple pattern problem as in the following example.

EXAMPLE1. Suppose that the collection of patterns is (trace, artist, smart ) and the text
is “the-greatest-artist...“. First, we align the right ends of these patterns. Since the
length of the shortest pattern is five, we start matching at the fifth character of the text.
In the figures below, * denotes a success in the matching and x denotes a mismatch.
trace
artist
smart
the-greatest-artist...
X

Since g is known not to occur in any pattern, we can shift the patterns to the right by five,
the length of the shortest pattern.
trace
artist
smart
the-greatest-artist...
x*
Now, the current character e in the text matches the e in the pattern trace, so we move
the pointer on the text to the left by one. Then we have a mismatch. Since the checked sub-
string te of the text does not occur in any pattern, we can again shift the patterns to the
right by five (move the pointer to the right by six) and we have:
trace
artist
smart
the-greatest-artist...
X

The current character r makes a mismatch. So as to align this r with the r in smart, we
shift the patterns to the right by one.

trace
artist
smart
the-greatest-artist...
x***.

From right to left, we succeed in successfully matching for t, r, and a, and fail at the next
character - . Consider occurrences in the patterns of the checked string -art. So as to align
its suffix art with the art in artist, we shift the patterns to the right by three positions
(and move the pointer to the right by six).
Multiple string searching

trace
artist
smart
the-greatest-artist...
***it*

This time we discover that each character of the pattern artist matches the correspond-
ing character in the text. To the text of 19 characters, we made only 14 references, and six
of them were required to confirm the final match.
As above, the idea of matching from the right is also effective for multiple patterns.
To realize this technique, we use a PMM like the AC algorithm. Our PMM consists of three
functions: goto, output, and failure. Figure 1 shows the PMM for the patterns in Exam-
ple 1. The goto and output functions are nearly the same as those of the AC algorithm.
They form a trie which represents the set of reuersedpatterns. The failure function is used
to shift the patterns to the right when a mismatch occurs.
By using such a PMM, we perform matching as follows: Start at state 0, and scanning
the text backward from the minlenth character (minlen is the minimum length of the pat-
terns), we make state transitions by the goto function as many as possible, until we encoun-
ter a state where the next state on the current character is not defined (goto transitions).
If we reach the state corresponding to the left end of a pattern, then this means that the
pattern occurs at the position. If we reach a state from which a goto transition is impos-
sible, we move the pointer on the text forward by the amount indicated by the failure func-
tion, and start at state 0 again. In the case of Example 1, the PMM moves as in Fig. 2. The
text-searching algorithm is summarized in Algorithm 1.

The solid arrows represent the goto function. The underlined strings
adjacent to the states mean outputs from them.

(a) The goto and output functions.


character\state 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1
a 126*89105678 * 10 * 6 7 8
C 1 * 7 8 9 10 5 6 7 8 9 10 6 6 7 8
e * 6 7 8 9 10 5 6 7 8 9 10 6 6 7 8
i 2 6 7 8 9 10 5 * 7 8 9 10 6 6 7 8
m 3 6 7 8 9 10 5 6 7 8 9 10 6 * 7 8
r 1 6 7 * 9 10 * 6 7 * 9 10 6 6 7 8
8 1 6 7 8 9 10 * 6 7 8 9 10 6 6 * 8
t * 6 7 8 * 10 5 6 * 8 9 10 6 6 7 8
else 1567891056789 106 6 7 8

The numerical values are the amounts b which the text pointer is
moved to the right in case of a mismat CL The asterisks mean
undefined.
(b) The failure function
Fig. 1. The PMM for (trace, artist, smart 1.
778 N. URATANI and M. TAKEDA

the-greatest-artist
Stage 1 0
. Stage 2 I+0
Stage 3 0
Stage 4 13~12~ 6 + 0
Stage 5 11+10+9+8+7+-6cO

The numbers below the text are states of PMM. A sequence of


numbers begining with 0 represents a series of state transitions
made in one stage. For example, the sequence in Stage 2 means
the transition from state 0 to 1 on the symbol 8. Since Q(l,t)=fail,
no more transitions are possible; therefore the text pointer is moved
to the right by [l&4 and then the next stage is started at state 0.

Fig. 2. The moves of PMM.

3. CONSTRUCTION OF PMM

In this section we present the algorithms for constructing a PMM from a collection of
patterns. We begin by defining some basic notions and notations.
Let C be a finite set of characters, called the alphabet. A string is a finite sequence of
symbols juxtaposed. The length of a string w, denoted by 1w I, is the number of symbols
composing the string. The empty string, denoted by E, is the string consisting of zero sym-
bols. Thus 1EJ = 0. The set of all strings over the alphabet C is denoted by C’. Let C+ =
C* - (E) . When string w can be written as uxv, string x is called a substring of w. Further-
more, if u (v) is empty, then x is called a prefix (suffix) of w.
We shall describe how to construct a PMM from a collection of patterns. The process
has two parts. In the first part, we compute the goto function g and the output function
out. In the second part, we then compute the failure function f from the goto function.
To compute the goto function, we construct a goto graph. This is the same as that of
the AC algorithm, except that it is constructed for the reversed patterns. The solid arrows
and the vertices in Fig. 3 show the goto graph for the patterns (trace, artist, smart ) .

Algorithm 1. Text-Searching Algorithm.


input: A text string T = a, a2 . . . a, and a pattern matching machine with functions
g, f, and out.
output: Locations at which patterns occur in T.
method:
/* $ is the symbol which is known not to occur in any pattern. */
/* minfen is the length of the shortest pattern. */
begin
a0 := $;
q := minlen;
while q I n do
state := 0;
while g (state, u4) > 0 do
state := g(state,a,);
if out(staie) # empty then print q, out (stare) endif;
4 := q - 1
endwhile;
q := q +f(sfate,a,);
endwhile
end.
Multiple string searching 779

The broken arrows represent the function 0, where broken arrows to


the state 0 from all but the states 0, 5, 9, 10, and 11 are omitted.

Fig. 3. The goto graph and the function 6.

Notice that there is a loop from state 0 to 0 in the goto graph. It is added for the sake of
convenience to compute the failure function. After the completion of PMM, we ignore it.
The algorithm for computing the goto and output functions is summarized in Algorithm
2. In Algorithm 2, we also compute the functions depth and left-end, which are required

Algorithm 2. Construction of the goto and output functions.


input: A collection of patterns K = ( aI, a2, . . . , CQ ) .
output: Functions g, out, depth, and @-end.
method:
/* large is a positive integer which is much larger than lengths of patterns. */
/* We initialize g to fail (<O) and out to empty. */
begin
minlen := large
newstate := 1; depth (0) := 0;
for i := 1 to k do enter(cxi,i) eadfor;
for each c such that g(O,c) = fail do g(O,c) := 0 endfor
end.
procedure enter( a1a2 . . . a,,,,i) :
begin
state := 0; j:=m;
while tg (state, ai) 2: fail) and (_j> 0) do
state := g(sfate,ai);
J‘:=j- 1
endwhile;
while j > 0 do
g (state, aj) newstate;
:=

depth (newstate) := m - j + 1;
state := newstate;
newstate := newstate + 1;
.._.
J .- J-1
endwhile;
left_end( i) := state;
out(state) := (a1a2.. .a,);
minlen := min( minlen, m 1
end.
780 N. URATANI and M. TAKEDA

to compute the failure function. Note that depth(s) denotes the length of the shortest path
from state 0 to state s and left-end(j) denotes the state corresponding to the left end of
the jth pattern.
Now, we consider the construction of the failure functionf. Let a,, u2. . . a,, be the text
string. Suppose that, starting at state 0, we have just scanned the characters ai+r , . . . , Uj
backward to reach state S, and then a mismatch occurs at ai. To compute the amount by
which we can move the pointer on the text, we shall consider the right-most occurrence of
the scanned string uiai+l . . . Uj in the patterns. Thus, the following two cases should be dis-
tinguished (see Fig. 4).
1. The scanned string aiai+i . . . aj is identical to a substring of some pattern.
2. A suffix of the string ai+i . . . Uj is identical to a prefix of some pattern.
To be precise, let u be the matched string ai+, . . . Uj, and u the mismatching charac-
ter ai. Let s be the current state of PMM, and let

Vl (s, a) = { 0 E C+ ( mtl is a suffix of a pattern],

V2(s) = ( u E C+ 1 for some suffix y of u, yu is a pattern).

Note that V2(s) is not empty, since any pattern is a member of V2(s). Let u be the short-
est string in l’l (s, a) U V2(s). The failure function f is then defined by

&,a) = 1u1 + IV/.

Note that f is defined when g(s, a) = fail or 0.


Let

min{)uul 1 uE Vl(s,u)) if Vl(s,u) # 0


shift 1(s, a) =
large otherwise,

and

shift2(s) = mini luul 1 u E V2(s)).

Here large denotes a sufficiently large, positive integer. If we have computed the functions
shift 1 and shift2, the failure function f can then be obtained by

f(s,u) = min(shiftl(s,u),shift2(s)j.

To compute the functions shift 1 and shifta, we need an auxiliary function 9. The func-
tion 4 is similar to the failure function of the AC algorithm. The broken arrows in Fig. 3

mismatching character

I
text

I matched portion

(1) II patterns

Fig. 4. Occurrences of the scanned string in the patterns.


Multiple string searching 781
show an example of the function 4, where broken arrows to state 0 from all but states 0,
5, 9, 10, and 11 are omitted.
The algorithm for computing the function 4 is summarized in Algorithm 3.1.
By using the function 4, we can compute the functions shifll and shift2 in linear time
proportional to the total length of the patterns. The algorithms are summarized in Algo-
rithms 3.2 and 3.3, respectively. In the next section, we prove the correctness of these
algorithms.

Algorithm 3.1 Construction of the function 9.


input: Goto function g.
output: Function 4.
method:
begin
queue := empty;
for each c such that g(0, c) = s > 0 do
queue := queueas;
4(s) := 0
endfor;
while queue # empty do
let queue = r - tail
queue := tail;
for each c such that g(r,c) = s # fail do
queue := queue-s;
state := 4(r);
while g(state, c) = fail do state := +(state) endwhile;
d(s) := g(state,c)
endfor
endwhile
end.

Algorithm 3.2 Construction of the function shift 1.


input: Functions g and 6.
output: Function sh$t 1.
method:
/* We initialize shift1 to large. */
begin
queue := empty;
for each c such that g(O,c) = s > 0 do queuees endfor;
while queue # empty do
let queue = r - tail
queue := tail;
for each c such that g( r, c) = s # fail do
queue := queueas;
state := f$(r);
while g(state, c) = fail do
shift 1(state, c) := min[shiftl(state,c),depth(r)];
state := $(state)
endwhile;
if g (state, c) = 0 then
shift 1(state, c) := min (shift l(stute,c),depth(r)J
endif
endfor
endwhile
end.
782 N. UMTANI and M. TAKEDA

Algorithm 3.3 Construction of the function shift2.


input: Functions g, depth, and left-end.
output: Function shift2.
method:
/* We initialize shift2 to large. */

besln
for j := 1 to k do
state := 44l&end(j));
while state > 0 do
shift2 (state) := min(shift2(state),depth(left_end( j)));
state := f$(state)
endwbile
endfor;
shift2(0) := minlen;
queue := (0);
while queue # empty do
let queue = r. tail
queue := tail;
for each c such that g(r,c) = s > 0 do
queue := queue-s;
shift2(s) := min(shift2(s),shift2(r) + lj
eadfor
endwhile
end.

4. THE CORRECTNESS

In this section we prove the correctness of the text-searching algorithm. We denote by


PRE( w) and by SUF( w) the sets of prefixes and suffixes of a string w, respectively. We
extend these to sets of strings by

PRE(X) = U PRE(w), SUF(X) = wIxSUF(w),


WEX

where X is a set of strings.


Suppose that K = ((IL~,QL*, . . . ,ak) is the set of patterns to be searched for. Let Q
be the set of states appearing in the goto graph constructed from K. Define a mapping
stats : SiJF(K) --, Q by

*@tow) = g(statdw),a) (W,UW E SUF(K),o E C).

In our example, me(trace) = 5, statc(tlet) = 9, and so on. Note that the mapping sttie
is a bijection. Moreover, we define a relation I-&on Q by

s t-6 t @ 9(s) = t,

where s, t E Q.
The following lemma, which is essentially the same as Aho and Corasick’s (Aho &
Corasick, 1975, Lemma 1 on page 337), characterizes the function 4 computed by Algo-
rithm 3.1.
*..
:..
:
0 i..
:
L
‘.
‘....,
(I - U‘““I = j) [!8j = (D‘!l)S
‘(I - 11“ * * ‘0 = I) (!d)r$ = ‘+!J
‘s = UJ ‘1 = 0, i
laql q2ns (0 < U) a 3 “1‘ * * * ‘0~ sa8als
JO acwanbas t! s$s!xa arayl pua ‘I!BJ # (D ‘J)% ‘0 # J
puv s = (n)atvts a.iayfi
uayt ‘3 + n pun Cfhms 3 n ‘n Jr
AXV-I?O?.lO~
*rC~t?l~o~o~
SU!MO~IOJ
aql IaS rlIaw!patuuq aM
-n Jo q$+uaI aql uo uogDnpuy lc8 */00.4d
184 N. URATANI and M. TAKEDA

Hence it is sufficient to show the following:

1. Vl(s,a) = 0 w A(s,a) = 0.
2. When Vl(s,a) # 0,

min((uvl (vE Vl(s,u)) =min(de@r(r)Ir~A(s,a)).

SinceA(s,u) C (state(uv)) UE Vl(s,u)), Vl(s,u) = 0 impliesA(s,u) = 0. Supposethat


V 1(s, a) # 0. Let u be the shortest string in V 1(s, a), and let r = stute( uu) . Then, there
exists a sequence of states ro, . . . , r, E Q (n > 0) such that

r. = r,

r, = s,

ri+l = Q(fi) (i=O,...,n- 1).

We wish to show r E A (s, a). To do this, we must show that g( ri, a) = fail, for all i with
1 5 i < n. Suppose for a contradiction that there exists j, 1 5 j < n, such that g(rj,u) #
fail. Then, there exists u’ E SUF(K) - (a) such that stute(uu’) = rj, and thus we obtain

U’E Vl(s,u), IU’I < Iu),

which is a contradiction. So we have

min( Iuu( 1uE Vl(s,u)) rmin(depth(r)IrEA(s,a)).

The converse inequality is clear. 0


LEMMA 111
Algorithm 3.3 correctly computes the function shift2. That is, for any state-character
pair (~,a) E Q x C such that g(s,u) = fail (or 0),

shifr2(s) = min( )uuI ) u E V2(s)j,

where state(u) = s and

V2(s) = (u E C+ I there exists y E SUF( u) such that yu E K).

(See Fig. 6.)

:...-.
i.*-.
‘Y, 0
:..-.

‘L.,

U-W

Fig. 6. Construction of the function shift2.


Multiplestring searching 785

Proof. Let

B(s) = (rE state(K)) r I-++s).

It is easy to see that the function shift2 produced by Algorithm 3.3 satisfies

minlen ifs=0
shift2(s) =
min((depth(r) 1rub) U (shift2(fath(s)) + 11) ifs> 0,

where f ath (s) denotes the father of s. The lemma is clear for s = 0. When s > 0, let s’ be
the father of s. Then

V2(s) = (u E c+ 1uu E K] u V2(s’).

So it is clear. 0

The next theorem follows from Lemmas 2 and 3.


THEOREM I
Suppose that u E SUF( K), u E C, and uu 4 SUF(K). Let v be Ihe shortest string in
C+ which satisfies at least one of the following conditions:

1. uuv E SUF(K).
2. There exists y E SUF( u) such that yv E K.

Then

f(stute(u),a) = IuI + [vi.

Now, let us prove the correctness of Algorithm 1, the text-searching algorithm.


LEMMA Iv
Letalaz.. . a, be the text string, and set a0 := $ (the symbol which is known not to
occur in any pattern ) . Let

u=ap+l... a, E SUF(K) (0 I p I q I n),

s = stute( u)

(if p = q then u = E). When g(s, ap) = fail or 0, if ai. . . aj E K and q < j then p +
f(s,a,) 5j.
Proof. By Theorem 1. 0
THEOREM II
Algorithm 1 finds all occurrences of the patterns in the text.

5. THEORETICAL ANALYSIS

The goto function g can be stored in a two-dimensional array of size ( Cl x I, where


I is the number of states of PMM. This enables us to determine the value of g(s, a) in con-
stant time for each state s and symbol (T.The failure function f can be stored together with
g in the same array, since the value off (s, a) is defined only when the value of g (s, a) is
fail. The output function should be stored in a one-dimensional array of size I. The other
functions do not require more than 0( ( Cl -I) space; thus the total space requirement is
0( 1Cl -I). The construction of PMM takes only linear time proportional to the total length
of the patterns.
IPc(29:6-H
786 N. URATANI and M. TAKEDA

We shall then discuss the efficiency of the text-searching algorithm. Similar to the BM
algorithm, the worst case running time is not linear in the text length. The probe rate in
the worst case is the maximum pattern length plus one.
Now we analyze the average case behavior of the algorithm. The average case analy-
sis of the BM type algorithms is known to be difficult. For simplicity, we assume that: (a)
texts and patterns are random strings on an alphabet of size q in which each character has
an equally likely chance of occurring; (b) all the patterns have a same length, say tn. The
second item may seem not to be acceptable. However, when the patterns have various
lengths, let m be the minimum length. The performance is then nearly the same as in the
case where all the patterns are of length m.
An execution of the algorithm consists of a series of stages, in which zero or more goto
transitions are made before finding a mismatch, and then the patterns are shifted to the
right. The probe rate is the expected value of the ratio between the number of references
to the text and the distance we get to shift the patterns to the right in one stage.
The probability that goto transitions are made onlyj times in one stage is denoted by
Pr( T = j) . The probability that goto transitions are made only j times until a mismatch,
and then the patterns are shifted to the right by d, is denoted by Pr(T = j A S = d). The
probe rate P can then be

c (j+ 1).Pr(T=j) Cd-Pr(T=jAS=d)


j j,d

If the random variables T and S were independent, we would have

pr(T=jAS=d)=Pr(T=j)-Pr(S=d[T=j). (2)

Let tj = 1 - (1 - ( l/q)j)k, where k is the number of patterns. It is easy to see that

tj - tj+l (Osjcm)
Pr(T= j) = (3)
I tj (j = m),

and

m-1

tmin(j+l,m-d)’ n (1 - fminlj+l,;i) (1 < dl m) (4)


Pr(S=dlT=j) = i=m-d+ I
t tmin[j+l,m-d) (d = 1).

Since eqn 2 can be proved to be a good approximation when q z= 1, the probe rate P is
given by

2 (j+ 1).Pr(T=j) ,gpr(T=j).(d$,d.Pr(S=dlT=i)


c5)
j=O

Figure 7 shows the relation between k and m with P = 1, for q = 2, 16, and 94, based
on eqn 5. The algorithm should be used only for q, k, and m with P < 1. In other cases the
AC algorithm should be used because the probe rate P is always 1.

6. EMPIRICAL EVIDENCE

We implemented the algorithm in the C language on a Sun-3 workstation, and then


tested the efficiency on random data and on real data.
First we describe the experimental results on random data. Texts and patterns were
randomly generated strings on an alphabet of size q = 2, 16, and 94, in which each char-
acter in the alphabet has an equally likely chance of occurring. The number of patterns k
Multiple string searching 787

lo4

6 100

E 9=2

1 2 3 4 5 6 7 8 9 10

pattern length m

Fig. 7. The relation among alphabet size q, pattern length m, and number of patterns k when probe
rate P = 1, assuming all patterns are of length m.

varied from 1 to 100. The patterns were all of length m. The results are shown in Fig. 8.
The solid lines show the empirical values of the probe rate. The broken lines show the
theoretical values based on eqn 5. They agree well with the empirical values except for
the case of q = 2. (Recall that eqn 5 is an approximation for q za 1.)

1.5

I
3
E

3
e
p 0.5

0
0 20 40 60 80 100

number of patterns
Fig. 8. The probe rate on random data.
788 N. URATANIand M. TAKEDA
The probe rate of multiple use of the BM algorithm for q = 16 and m = 4 is also
shown, which is linearly proportional to the number of patterns.
The probe rate of the AC method is always 1. The efficiency of our method is proved
to be higher than the AC method when the alphabet size is large and the number of patterns
is not very large. (This is also proved by actual time comparison in our implementation.)
Now we present the experimental results on real data. Texts and patterns were abstracts
and key-phrases from INSPEC tape, respectively. Experiments were performed as follows:
For each k, we chose at random k patterns from 20,000 key-phrases, and then searched for
them within about 1.2 MB text by the two methods-our algorithm and multiple use of
BM. The process was repeated 100 times. Although the average length of the 20,000 key-
phrases was very long, 17.5, there were short terms in the set of patterns. The results
for 10 and 20 are plotted in Fig. 9. The minimum, maximum, and average values of the
probe rate P of the two methods are shown in Table 1. It is observed from Fig. 9 that
the probe rate of our method depends on the minimum length of patterns. The efficiency
of our method is higher than the multiple use of BM. This is true even in an extreme case
with k = 2 where one pattern is short (length 2) and the other is long (length 19). It can be
seen from Table lb that, when k I 100 the probe rate of our method is less than 1 unless
the minimum pattern length is 1 or 2.
To conclude, for INSPEC data, our algorithm should be used when the number of
patterns is not very large (5 100) unless the minimum length of the patterns is 1 or 2. In
other cases the AC algorithm should be used.

7. COMPARISON WITH THE BM METHOD


We shall compare our method with the BM method for a single pattern.

EXAMPLE 2. Suppose that for the pattern ABBCABCA


we have:
pattern : ABBCABCA
text: . . ..CBCA.......
results : x***

By the BM method, using delta2, we try to find the right-most plausible reoccurrence of
the matched string BCA. Since we can shift the pattern by 3, we get

Table 1. Comparison of the probe rate P of the two methods

P of our method P of multiple use of BM

No. patterns Av. minlen Min. Max. Av. Min. Max. AV.

(a) when including the cases of minlen = 1,2


2 13.40 0.097 0.586 0.178 0.132 0.642 0.236
5 8.04 0.162 1.071 0.306 0.393 1.655 0.591
10 5.91 0.215 1.376 0.478 0.851 2.839 1.239
20 4.26 0.337 1.510 0.611 1.852 3.858 2.363
30 3.31 0.423 1.602 0.786 2.964 5.326 3.644
40 3.34 0.440 1.730 0.848 3.875 6.8% 4.726
50 2.95 0.506 1.760 0.975 5.014 7.762 5.970
100 2.25 0.678 1.960 1.214 10.555 15.778 11.824
(b) when excluding the cases of minlen = 1,2
2 13.75 0.097 0.395 0.166 0.132 0.571 0.225
5 8.37 0.162 0.551 0.284 0.393 0.954 0.561
10 6.63 0.215 0.614 0.392 0.851 1.662 1.123
20 4.59 0.337 0.758 0.561 1.852 2.902 2.286
30 4.01 0.423 0.826 0.655 2.964 4.025 3.445
40 4.18 0.440 0.851 0.675 3.875 5.014 4.433
50 3.85 0.506 0.886 0.727 5.014 6.431 5.666
100 3.33 0.678 0.979 0.891 10.555 12.500 11.424
Multiple string searching 789

0 5 10 15
minimum length of patterns
(a) when the number of patterns is 10

2.5

0
0 5 10
minimum length of patterns

(b) when the number of patterns


Fig. 9. Results on INSPEC data.

pattern : ABBCABCA
text: . . ..CBCA.......
and we go on matching. However, it is obvious that a mismatch will occur.
Against this, by our method, we try to find the right-most occurrence of the scanned
string “CBCA”. So we can shift the pattern by 7 and get
pattern : ABBCABCA
text: . . ..CBCA.......
As above, our method can often shift the pattern farther than the BM method. A stric-
ter argument follows: Let pat be a pattern and patlen be its length. Put

delta2’(s) = delta2(patlen - s).


790 N. URATANIand M. TAKEDA

We would replace f( state, a,) of Algorithm 1 by

max(deltal(a,),delra2’(state)),

we could obtain the BM algorithm. The functions delta 1, delta%’ are characterized in the
following lemma.
LEMMA V
Suppose that bu E SUF( pat), b E C. Let v be the shortest string in C+ which satis-
fies either I or 2:

1. There exists a symbol c E C such that c # b and cuv E SUF(pat).


2. There exists y E SUF(u) such that yv = pat.

Lets = state(u). Then

delta2’(s) = 1u 1 + 1v I.

On the other hand, let (TE C and let w be the shortest string in C* which satisfies either 3
or 4:

3. (TWE SUF(pat).
4. w = pat.

Then

deltal(a) = 1WI.

Commentz-Walter presented an extension of the BM algorithm to multiple patterns


(Commentz-Walter, 1979). She also used a PMM. Her method, essentially, differs from
ours only in the amount by which the patterns are shifted if a mismatch occurs. Suppose
that we have just matched the text substring u;+~ . . . aj backward and we have a mismatch
at a;. To compute the amount, she considered the following three cases:

1. The matched string q+, . . . a, is identical to a substring of some pattern.


2. A suffix of the matched string ai+, . . . Uj is identical to a prefix of some pattern.
3. The mismatching character ai occurs in some patterns.

Roughly speaking, 1 and 2 correspond to delta2 of the BM, and 3 to delta 1. That is, if a
mismatched occurs, the C-W (and the BM) tries to find occurrences of the matched string
and of the mismatching character, separately. Thus, our method often shifts the patterns
farther than the C-W method does.
It should be pointed out that the construction of the C-W algorithm is incomplete. The
amount concerning the case 2 is denoted by shift2. As described in the proof of Lemma 3,
shift2 can be computed by

minlen ifs=0
shift2(s) =
min((depth(r) 1r E B(s)] U {shift2(fath(s)) + 1)) if s > 0,

where

B(s) = (r E state(K) 1r I-++ s).

It is stated in the paper (Commentz-Walter, 1979) that B(s) can be replaced by

B’(s) = (r E state(K) 1d(r) = s).

However, it is not correct. (Consider the value of shift2( 12) in Fig. 10.)
Multiple string searching 791

A 6 C D -

*. ‘. *.
‘\ *,‘\
.. ‘.
E
9 0
EFA% o ‘. ‘.
*\‘. ‘.
‘.
G 6

GHAq -

Fig. 10. A counterexample for C-W.

8. CONCLUSION

We have presented a fast string-searching algorithm for multiple patterns, and given
a proof of the correctness of the algorithm. Moreover, we have seen that our algorithm is
not only an extension of the BM algorithm, but also better than it for a single pattern.
The minimum pattern length, the alphabet size, and the number of patterns play an
important part when considering which algorithm to use for finding multiple patterns.
From the empirical evidence on INSPEC data it can be concluded that the AC algorithm
should be used when the minimum pattern length is 1 or 2, or the number of patterns is
very large (> 100). Our algorithm should be used in all other cases.
It has been seen that the efficiency of our algorithm is higher than the multiple use of
the BM algorithm. The disadvantage to use our algorithm instead of the BM algorithm is
an 0( 1Cl ./) space requirement for PMM, where I is the total length of patterns. The BM
algorithm requires only an 0 1C 1+ m) space. Another disadvantage might be the prepro-
cessing time for constructing PMM, but it is negligible compared with the text searching
time when the text length is large.
The worst case running time of our algorithm is not linear in the text length, like the
BM algorithm. This is because these algorithms “forget” any information about the scanned
part of the text when they slide the patterns after a mismatch. The BM algorithm has been
improved to run in linear time even in the worst case (Knuth et al., 1977; Galil, 1979;
Apostolico & Giancarlo, 1986). Our algorithm also can be improved to run in linear time,
but this is only of theoretical interest because the constant factor is very large.

REFERENCES

Aho, A.V., & Corasick, M.J. (1975). Efficient string matching: An aid to bibliographic search. Communications
of the ACM, I8(6), 333-340.
Apostolico, A., & Giancarlo, R. (1986). The Boyer-Moore-Galil string searching strategies revisited. SIAM Jour-
nal on Computing, IS(l), 98-105.
Boyer, R.S., & Moore, J.S. (1977). A fast string searching algorithm. Communications fo the ACM, 20(10),
762-772.
Commentz-Watler, B. (1979). A string matching algorithm fast on the average. In H.A. Maurer (Ed.), Lecture
nofes in computer science 71 (pp. 118-132). Springer-Verlag.
Galil, Z. (1979). On improving the worst case running time of the Boyer-Moore string matching algorithm. Com-
munications of the ACM, 22(9), 505-508.
Knuth, D.E., Morris, J.H., &Pratt, V.R. (1974). Fast pattern matching in strings. Technical Report CS-74440,
Stanford University, Stanford, CA.
Knuth, D.E., Morris, J.H., & Pratt, V.R. (1977). Fast pattern matching in strings. SIAM Journal on Comput-
ing, 6(2), 323-350.
Kowalski, G., & Meltzer, A. (1984, June 20-22). New multi-term high speed text search algorithms. Paper pre-
sented at the First lnternational Conference on Computers and Applications, Beijing, China.
Takeda, M. (1988, October). A proof of the correctness of Uratani’s string searching algorithm. Technical Re-
port RlFlS-TR-CS-7, Research Institute of Fundamental lnformation Science, Kyushu University.
Uratani, N. (1988). A fast algorithm for matching patterns. 1ECEJ Technical Report, SS87-27 (In Japanese).

You might also like