Boyer 77

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

1.

Introduction
Pr ogr ammi ng
Techni ques
G. Manacher, S.L. Gr aham
Edi t ors
A Fast String
Searching Algorithm
Robert S. Boyer
Stanford Research Institute
J Strother Moore
Xerox Palo Al t o Research Cent er
An algorithm is presented that searches for the
location, " i," of the first occurrence of a character
string, "'pat,'" in another string, "string." During the
search operation, the characters of pat are matched
starting with the last character of pat. The information
gained by starting the match at the end of the pattern
often allows the algorithm to proceed in large jumps
through the text being searched. Thus the algorithm
has the unusual property that, in most cases, not all of
the first i characters of string are inspected. The
number of characters actually inspected (on the aver-
age) decreases as a function of the length of pat. For a
random English pattern of length 5, the algorithm will
typically inspect i/4 characters of string before finding
a match at i. Furthermore, the algorithm has been
implemented so that (on the average) fewer than i +
patlen machine instructions are executed. These con-
clusions are supported with empirical evidence and a
theoretical analysis of the average behavior of the
algorithm. The worst case behavior of the algorithm is
linear in i +patlen, assuming the availability of array
space for tables linear in patlen plus the size of the
alphabet.
Key Words and Phrases: bibliographic search, com-
putational complexity, information retrieval, linear
time bound, pattern matching, text editing
CR Categories: 3. 74, 4. 40, 5. 25
Copyri ght 1977, Associ at i on for Comput i ng Machi nery, Inc.
General permi ssi on to republish, but not for profit, all or part of
this mat eri al is grant ed provi ded that ACM' s copyright notice is
given and that reference is made to t he publ i cat i on, to its dat e of
issue, and to the fact that reprinting privileges were grant ed by per-
mission of the Associ at i on for Comput i ng Machi nery.
Aut hor s' present addresses: R. S. Boyer, Comput er Science
Laborat ory, St anford Resear ch Inst i t ut e, Menl o Park, CA 94025.
Thi s work was partially support ed by ONR Cont ract N00014- 75- C-
0816; JS. Moor e was in t he Comput er Science Laborat ory, Xerox
Palo Al t o Resear ch Cent er , Palo Al t o, CA 94304, when this work
was done. His current address is Comput er Science Laborat ory, SRI
Int ernat i onal , Menl o Park, CA 94025.
762
Suppose that pat is a string of length patlen and we
wish to find the position i of the l eft most charact er in
the first occurrence of pat in some string string:
pat : AT-THAT
s t r i ng: . . . WHICH-FINALLY-HALTS.--AT-THAT-POINT . . .
The obvi ous search al gori t hm considers each charact er
position of string and det ermi nes whet her the succes-
sive patlen charact ers of string starting at t hat position
mat ch the successive patlen charact ers of pat. Knut h,
Morri s, and Prat t [4] have observed that this al gori t hm
is quadrat i c. That is, in the worst case, the number of
compari sons is on the order of i * patlen.l
Knut h, Morris, and Prat t have descri bed a linear
search al gori t hm which preprocesses pat in t i me linear
in patlen and then searches string in t i me linear in i +
patlen. In particular, their al gori t hm inspects each of
the first i + patlen - 1 charact ers of string precisely
once.
We now present a search al gori t hm which is usually
"subl i near": It may not inspect each of the first i +
patlen - 1 charact ers of string. By "usually subl i near"
we mean that the expect ed value of the number of
i nspect ed charact ers in string is c * (i + patlen), where
c < 1 and gets smal l er as patlen increases. Ther e are
pat t erns and strings for which worse behavi or is ex-
hibited. However , Knut h, in [5], has shown that the
al gori t hm is linear even in the worst case.
The actual number of charact ers inspected depends
on statistical propert i es of the charact ers in pat and
string. However , since the number of charact ers in-
spect ed on the average decreases as patlen i ncreases,
our al gori t hm actually speeds up on l onger pat t erns.
Fur t her mor e, the al gori t hm is sublinear in anot her
sense: It has been i mpl ement ed so that on the average
it requi res the execut i on of fewer t han i + patlen
machi ne instructions per search.
The organi zat i on of this paper is as follows: In the
next t wo sections we give an informal descri pt i on of
the al gori t hm and show an exampl e of how it works.
We t hen define the al gori t hm precisely and discuss its
efficient i mpl ement at i on. Af t er this discussion we pres-
ent the results of a t horough test of a part i cul ar
machi ne code i mpl ement at i on of our al gori t hm. We
compar e these results to similar results for the Knut h,
Morri s, and Pratt al gori t hm and the simple search
al gori t hm. Following this empirical evi dence is a t heo-
retical analysis which accurat el y predicts the per f or m-
ance measur ed. Next we describe some situations in
which it may not be advant ageous to use our al gori t hm.
We conclude with a discussion of the history of our
al gori t hm.
1 The quadrat i c nat ure of this al gori t hm appears when initial
subst ri ngs of pat occur oft en in string. Because this is a relatively
rare phenomenon in string searches over Engl i sh text, this si mpl e
al gori t hm is practically linear in i + patlen and t herefore accept abl e
for most applications.
Communi cat i ons Oct ober 1977
of Vol ume 20
t he ACM Number 10
2. Informal Description
The basic i dea behi nd t he al gori t hm is t hat mor e
i nf or mat i on is gai ned by mat chi ng t he pat t er n f r om
t he ri ght t han f r om t he left. I magi ne t hat pat is pl aced
on t op of t he l eft -hand end of string so t hat t he first
charact ers of t he t wo strings are al i gned. Consi der
what we l earn if we fet ch t he patlenth char act er , char,
of string. This is t he char act er whi ch is al i gned with
t he last char act er of pat.
Observation 1. If char is known not t o occur in pat,
t hen we know we need not consi der t he possibility of
an occur r ence of pat st art i ng at string posi t i ons 1, 2,
. . or patlen: Such an occur r ence woul d requi re t hat
char be a charact er of pat.
Observation 2. Mor e general l y, if t he last (right-
most ) occur r ence of char in pat is deltal charact ers
f r om t he right end of pat, t hen we know we can slide
pat down delta~ posi t i ons wi t hout checki ng for mat ches.
The reason is t hat if we were t o move pat by less t han
deltas, t he occur r ence of char in string woul d be al i gned
with some charact er it coul d not possibly mat ch: Such
a mat ch woul d requi re an occur r ence of char in pat t o
t he right of t he ri ght most .
Ther ef or e unless char mat ches t he last char act er of
pat we can move past delta1 charact ers of string with-
out l ooki ng at t he charact ers ski pped; delta~ is a
funct i on of t he char act er char obt ai ned f r om string. If
char does not occur in pat, delta~ is patlen. If char does
occur in pat, delta~ is t he di fference bet ween patlen
and t he posi t i on of t he ri ght most occur r ence of char in
pat.
Now suppose t hat char mat ches t he last char act er
of pat. Then we must det er mi ne whet her t he previ ous
char act er in string mat ches t he second f r om t he last
char act er in pat. I f so, we cont i nue backi ng up until
we have mat ched all of pat (and t hus have succeeded
in fi ndi ng a mat ch) , or else we come to a mi smat ch at
some new char aft er mat chi ng t he last m charact ers of
pat.
In this latter case, we wish t o shift pat down to
consi der t he next plausible j uxt aposi t i on. Of course,
we woul d like t o shift it as far down as possi bl e.
Observation 3(a). We can use t he same r easoni ng
descri bed a bove - ba s e d on t he mi smat ched char act er
char and del t al - t o slide pat down k so as t o align t he
t wo known occur r ences of char. Then we will want t o
i nspect t he char act er of string al i gned with t he last
char act er of pat. Thus we will act ual l y shift our at t en-
tion down string by k + m. The di st ance k we shoul d
slide pat depends on wher e char occurs in pat. If t he
ri ght most occur r ence of char in pat is to t he ri ght of
t he mi smat ched char act er (i.e. within t hat part of pat
we have al ready passed) we woul d have t o move pat
backwar ds t o align t he t wo known occur r ences of char.
We woul d not want t o do this. In this case we say t hat
delta~ is "wor t hl ess" and slide pat f or war d by k = 1
(which is al ways sound). This shifts our at t ent i on down
string by 1 + m. If t he r i ght most occur r ence of char in
763
pat is t o t he left of t he mi smat ch, we can slide f or war d
by k = deltal(char) - rn t o align t he t wo occur r ences
of char. This shifts our at t ent i on down string by
deltal(char) - m + m = deltas(char).
However , it is possible t hat we can do bet t er t han
this.
Observation 3( b) . We know t hat t he next m char-
act ers of string mat ch t he final m charact ers of pat. Let
this subst ri ng of pat be subpat. We also know t hat this
occur r ence ofsubpat instring is pr eceded by a char act er
(char) whi ch is di fferent f r om t he char act er pr ecedi ng
t he t ermi nal occur r ence of subpat in pat. Roughl y
speaki ng, we can general i ze t he ki nd of r easoni ng used
above and slide pat down by some amount so t hat t he
di scover ed occur r ence of subpat in string is al i gned
with t he ri ght most occur r ence of subpat in pat whi ch is
not pr eceded by t he char act er pr ecedi ng its t ermi nal
occur r ence in pat. We call such a r eoccur r ence of
subpat in pat a "pl ausi bl e r eoccur r ence. " The reason
we said "r oughl y speaki ng" above is t hat we must
allow for t he ri ght most pl ausi bl e r eoccur r ence ofsubpat
to "fall off" t he left end of pat. This is made precise
later.
Ther ef or e, accordi ng t o Obser vat i on 3(b), if we
have mat ched t he last m charact ers of pat bef or e
fi ndi ng a mi smat ch, we can move pat down by k
charact ers, wher e k is based on t he posi t i on in pat of
t he r i ght most plausible r eoccur r ence of t he t ermi nal
subst ri ng of pat havi ng m charact ers. Af t er sliding
down by k, we want t o i nspect t he char act er of string
al i gned with t he last char act er of pat. Thus we act ual l y
shift our at t ent i on down string by k + rn charact ers.
We call this di st ance deltaz, and we defi ne deltaz as a
funct i on of t he posi t i on ] in pat at whi ch t he mi smat ch
occur r ed, k is j ust t he di st ance bet ween t he t ermi nal
occur r ence of subpat and its ri ght most pl ausi bl e reoc-
cur r ence and is al ways gr eat er t han or equal t o 1. m is
just patlen - ].
In t he case wher e we have mat ched t he final m
charact ers of pat bef or e failing, we clearly wish t o shift
our at t ent i on down string by 1 + m or deltal(char) or
deltaz(]), accordi ng t o whi chever allows t he l argest
shift. Fr om t he defi ni t i on of deltae as k + m wher e k is
al ways gr eat er t han or equal t o 1, it is cl ear t hat delta2
is at least as large as 1 + m. Ther ef or e we can shift
our at t ent i on down string by t he maxi mum of j ust t he
t wo deltas. This rul e also applies when m --- 0 (i.e.
when we have not yet mat ched any charact ers of pat),
because in t hat case ] = patlen and delta2(]) >- 1.
3. Example
In t he fol l owi ng exampl e we use an "1' " under
string t o i ndi cat e t he cur r ent char. When this "poi nt er "
is pushed t o t he right, i magi ne t hat it drags t he right
end of pat with it (i.e. i magi ne pat has a hook on its
ri ght end). When t he poi nt er is moved t o t he left,
keep pat fixed with respect t o string.
Communications October 1977
of Volume 20
the ACM Number 10
pat: AT-THAT
s t r i ng: ... WHICH-FINALLY-HALTS.--AT-THAT-POINT . . .

Since "F" is known not to occur in pat, we can appeal
to Observation 1 and move the pointer (and thus pat)
down by 7:
pat : AT-THAT
string: ... WHICH-FINALLY-HALTS.--AT-THAT-POINT ...

Appealing to Observation 2, we can move the pointer
down 4 to align the two hyphens:
pat: AT-THAT
string: ... WHICH-FINALLY-HALTS.--AT-THAT-POINT ...

Now char matches its opposite in pat. Therefore we
step left by one:
pat : AT-THAT
s t r i ng: . . . WHICH-FINALLY-HALTS.--AT T~IAT POINT . . .

Appeal i ng to Observation 3(a), we can move the
pointer to the right by 7 positions because " L" does
not occur in pat. 2 Note that this only moves pat to the
right by 6.
pat : AT-THAT
s t r i ng: . . . WHICH-FINALLY-HALTS.--AT-THAT-POINT . . .

Again char matches the last character of pat. Stepping
to the left we see that the previous character in string
also matches its opposite in pat. Stepping to the left a
second time produces:
pat : AT-THAT
string: ... WHICH-FINALLY-HALTS.--AT-THAT-POINT ...

Noting that we have a mismatch, we appeal to Obser-
vation 3(b). The delta2 move is best since it allows us
to push the pointer to the right by 7 so as to align the
discovered substring " AT" with the beginning of pat. ~
pat: AT-THAT
s t r i ng: . . . WHICH-FINALLY-HALTS.--AT-THAT-POINT . . .

This time we discover that each character of pat
matches the corresponding character in string so we
have found the pattern. Not e that we made only 14
references to string. Seven of these were required to
confirm the final match. The ot her seven allowed us to
move past the first 22 characters of string.
2 Not e that deltaz would allow us to move t he poi nt er to t he
right only 4 posi t i ons in order to align t he di scovered subst ri ng "T"
in string with its second from last occurrence at t he begi nni ng of t he
word " THAT" in pat.
3 The delta~ move only allows the poi nt er to be pushed to t he
right by 4 to align t he hyphens.
764
4. The Algorithm
We now specify the algorithm. The notation pat(j)
refers to the j t h character in pat (counting from 1 on
the left).
We assume the existence of two tables, delta1 and
deltas. The first has as many entries as there are
characters in the alphabet. The entry for some charac-
ter char will be denot ed by deltas(char). The second
table has as many entries as there are character posi-
tions in the pattern. The j t h entry will be denot ed by
delta2(j). Bot h tables contain non-negative integers.
The tables are initialized by preprocessing pat, and
their entries correspond to the values deltaa and delta2
referred to earlier. We will specify their precise con-
tents after it is clear how they are to be used.
Our search algorithm may be specified as follows:
stringlen ,,-- l engt h of string.
i ~ patlen.
top: if i > stringlen then ret urn false.
j ,,-- patlen.
loop: i f j = 0 t hen r et ur nJ + 1.
if string(i) = pat(j)
then
j ~" - j - 1.
i , ~- - i - 1.
got o loop.
close;
i ~-- i + max(delta1 (string(i)), delta2 (j)).
got o top.
If the above algorithm returns false, then pat does not
occur in string. If the algorithm returns a number,
then it is the position of the left end of the first
occurrence of pat in string.
The deltal table has an entry for each character
char in the alphabet. The definition of delta~ is:
deltas(char) = If char does not occur in pat, then pat-
len; else patlen - j, where j is the
maximum integer such that pat(j) =
char.
The deltaz table has one entry for each of the integers
from 1 to patlen. Roughl y speaking, delta2(j) is (a) the
distance we can slide pat down so as to align the
discovered occurrence (in string) of the last pat l en-j
characters of pat with its rightmost plausible reoccurr-
ence, plus (b) the additional distance we must slide the
"poi nt er" down so as to restart the process at the right
end of pat. To define delta2 precisely we must define
the rightmost plausible reoccurrence of a terminal
substring of pat. To this end let us make the following
conventions: Let $ be a character that does not occur
in pat and let us say that if i is less than 1 then pat(i) is
$. Let us also say that two sequences of characters [c~
. . c, ] and [d~ . . . d, ] "uni fy" if for all i from 1 to n
either c~ =di or c~ = $ or d~ = $.
Finally, we define the position of the rightmost
plausible reoccurrence of the terminal substring which
starts at posi t i onj + 1, rpr(j), f or j from 1 topatlen, to
be the greatest k less than or equal to patlen such that
Communi cat i ons Oct ober 1977
of Vol ume 20
t he ACM Number 10
[pat ( j + 1) . . . pat ( pat l en) ] and [pat (k) . . . pat ( k +
pat l en - j - 1)] uni fy and ei t her k -< 1 or pat ( k - 1) :~
pat(]). 4 (That is, t he posi t i on of t he ri ght most plausible
r eoccur r ence of t he subst ri ng s ubpat , whi ch starts at j
+ 1, is t he ri ght most pl ace wher e s ubpat occurs in pat
and is not pr eceded by t he char act er pat ( j ) whi ch
pr ecedes its t ermi nal oc c ur r e nc e - wi t h suitable allow-
ances for ei t her t he r eoccur r ence or t he pr ecedi ng
char act er to fall beyond t he left end of pat . Not e t hat
rpr( j ) may be negat i ve because of t hese al l owances. )
Thus t he di st ance we must slide pat t o align t he
di scovered subst ri ng whi ch starts at j + 1 with its
ri ght most plausible r eoccur r ence is j + 1 - r pr ( j ) . The
di st ance we must move to get back to t he end of pat is
just pat l en - j . del t a2(j ) is just t he sum of t hese t wo.
Thus we defi ne delta2 as follows:
delta2(j) = pat l en + 1 - rpr( j ) .
To make this defi ni t i on clear, consi der t he fol l owi ng
t wo exampl es:
j : 1 2 3 4 5 6 7 8 9
pat : A B C X X X A B C
de l t a2( J ) : _ 14 13 12 11 10 9 11 10 1
j: 1 2 3 4 5 6 7 8 9
pat: A B Y X C D E Y X
delta2(J):_ 17 16 15 14 13 12 7 10 1
5. Implementation Considerations
The most frequent l y execut ed part of t he al gori t hm
is t he code t hat embodi es Obser vat i ons 1 and 2. The
fol l owi ng versi on of our al gori t hm is equi val ent t o t he
original versi on pr ovi ded t hat deltao is a t abl e cont ai n-
ing t he same ent ri es as delta1 except t hat
del t ao(pat (pat l en)) is set t o an i nt eger l arge whi ch is
great er t han st ri ngl en + pat l en (while del t al ( pat ( pat l en) )
is al ways 0).
stringlen ,:-- length of string.
i ~-- patlen.
if i > stringlen then return false.
fast: i ,--- i + delta0(string(i)).
if i <- stringlen then goto fast.
undo: if i <- large then return false.
i ~-- (i - large) - 1.
j ~-- patlen - 1.
slow: ifj = 0then returni + 1.
if string(i) = pat(j)
then
j *. - j - 1.
i *- . - i - 1.
goto slow.
close;
i ,~- i + max(deltal(string(i)), deltaz(j)).
goto fast.
4 Note that whenj = patlen, the two sequences [pat(patlen + 1)
. . pat(patlen)] and [pat(k). i pat(k - 1)] are empty and therefore
unify. Thus, rpr(paden) is simply the greatest k less than or equal to
patlen such that k <- 1 or pat(k - 1) ~ pat(patlen).
765
Of course we do not act ual l y have t wo versi ons of
deltax. I nst ead we use onl y deltao, and in pl ace of deltax
in t he max expressi on we mer el y use t he deltao ent r y
unless it is l arge (in whi ch case we use 0).
Not e t hat t he f as t l oop j ust scans down st ri ng,
effect i vel y l ooki ng for t he last char act er pat ( pat l en) in
pat , ski ppi ng accordi ng t o del t ax. (delta2 can be i gnor ed
in this case since no t ermi nal subst ri ng has yet been
mat ched, i.e. del t a2(pat l en) is al ways less t han or equal
t o t he cor r espondi ng delta1. ) Cont r ol l eaves this l oop
onl y when i exceeds st ri ngl en. The test at undo deci des
whet her this si t uat i on arose because all of st ri ng has
been scanned or because pat ( pat l en) was hit (which
caused i t o be i ncr ement ed by l arge). If t he first case
obt ai ns, pat does not occur in st ri ng and t he al gori t hm
ret urns false. If t he second case obt ai ns, t hen i is
r est or ed (by subt ract i ng l arge) and we ent er t he s l ow
l oop whi ch backs up checki ng for mat ches. When a
mi smat ch is f ound we skip ahead by t he maxi mum of
t he original delta1 and delta2 and r eent er t he f as t l oop.
We est i mat e t hat 80 per cent of t he t i me spent in
searchi ng is spent in t he f as t l oop.
The f as t l oop can be coded in four machi ne instruc-
tions:
fast: char ~--string(i).
i *- i + deltao(char).
skip the next instruction if i > stringlen.
goto fast.
undo: . . .
We have i mpl ement ed this al gori t hm in PDP- 10 assem-
bly l anguage. In our i mpl ement at i on we have r educed
t he number of i nst ruct i ons in t he f as t l oop t o t hree by
t ransl at i ng i down by st ri ngl en; we can t hen test i
against 0 and condi t i onal l y j ump t o f as t in one instruc-
tion.
On a byt e addressabl e machi ne it is easy t o imple-
ment "char ~- - st ri ng( i ) " and "i ~- - i + del t ao( char) " in
one i nst ruct i on each. Since our i mpl ement at i on was in
PDP- 10 assembl y l anguage we had to empl oy byt e
poi nt ers t o access charact ers in st ri ng. The PDP- 10
i nst ruct i on set provi des an i nst ruct i on for i ncr ement i ng
a byt e poi nt er by one but not by ot her amount s. Our
code t her ef or e empl oys an ar r ay of 200 i ndexi ng byt e
poi nt ers whi ch we use t o access charact ers in st ri ng in
one i ndexed i nst ruct i on (aft er comput i ng t he i ndex) at
t he cost of a small (fi ve-i nst ruct i on) over head every
200 charact ers. It shoul d be not ed t hat this trick onl y
makes up for t he lack of di rect byt e addressi ng; one
can expect our al gori t hm t o run somewhat fast er on a
byt e- addr essabl e machi ne.
6. Empirical Evidence
We have exhaust i vel y t est ed t he above PDP- 10
i mpl ement at i on on r andom test dat a. To gat her t he
test pat t erns we wr ot e a pr ogr am whi ch r andoml y
selects a subst ri ng of a gi ven l engt h f r om a source
string. We used this pr ogr am t o select 300 pat t er ns of
Communications October 1977
of Volume 20
the ACM Number 10
length patlen, for each patlen from 1 to 14. We then Fig. 1.
used our algorithm to search for each of the test 1.0 -
pat t erns in its source string, starting each search in a
r andom position somewher e in the first half of the
0.9
source string. All of the charact ers for bot h the pat t erns
and the strings were in pri mary memor y (rat her than a
secondary storage medi um such as a disk). 0.8
We measured the cost of each search in two ways:
the number of references made to string and the total
number of machi ne instructions that actually got exe- ~ ~ 07
cuted (ignoring the preprocessi ng to set up the two
tables). ~ o.e
By dividing the number of references to string by i
the number of charact ers i - 1 passed before the ~ o.s
pat t ern was found (or string was exhaust ed), we ob- -~ i
t ai ned the number of references to string per charact er
passed. This measure is i ndependent of the part i cul ar ~ 0.,
i mpl ement at i on of the al gori t hm. By dividing the num-
ber of instructions execut ed by i - 1, we obt ai ned the
average number of instructions spent on each charact er ~ 03 -
passed. This measure depends upon the i mpl ement a-
tion, but we feel that it is meani ngful since the imple- 0.2 -
ment at i on is a st rai ght forward encoding of the algo-
ri t hm as described in the last section.
We then averaged these measures across all 300 0.1-
sampl es for each pat t ern length.
Because the per f or mance of the al gori t hm depends o _
upon the statistical propert i es of pat and string (and o
hence upon the propert i es of the source string from
which the test pat t erns were obt ai ned), we per f or med Fig. 2
this exper i ment for t hree different kinds of source 7
strings, each of length 10, 000. The first source string
consisted of a r andom sequence of O's and l ' s. The 6
second source string was a piece of English text ob-
tained from an online manual . The third source string
was a r andom sequence of charact ers from a 100- ~
charact er al phabet .
In Figure 1 the average number of references to ~ ,
string per charact er in string passed is pl ot t ed against
the pat t ern length for each of t hree source strings.
Not e that the number of references to string per o 3
charact er passed is less t han 1. For exampl e, for an ~_
English pat t ern of length 5, the al gori t hm typically -~
inspects 0. 24 charact ers for every charact er passed.
That is, for every reference to string the al gori t hm
1
passes about 4 charact ers, or, equivalently, the algo-
ri t hm inspects only about a quart er of the charact ers it
passes when searching for a pat t ern of length 5 in an o
English text string. Fur t her mor e, the number of refer-
ences per charact er drops as the pat t erns get longer.
This evi dence support s the conclusion that the algo-
ri t hm is "subl i near" in the number of references to
string.
For compari son, it should be not ed that the Knut h,
Morris, and Prat t al gori t hm references string precisely
1 t i me per charact er passed. The simple search algo-
ri t hm references string about 1.1 times per charact er
passed (det ermi ned empirically with the English sam-
ple above).
766
EMPIRICAL CO~T
2 4 6 8 10 12 14
LENGTH OF PATTERN
I I I I I I I I I I I I [
EMPIRICAL COST
3.56
I O.473
0.266
I I I I i ] ] [ I I I [ I
2 4 6 8 10 12 14
LENGTH OF PATTERN
In Figure 2 the average number of instructions
execut ed per charact er passed is pl ot t ed against the
pat t ern length. The most obvi ous feat ure to not e is
that the search speeds up as the pat t erns get longer.
That is, the total number of instructions execut ed in
order to pass over a charact er decreases as the length
of the pat t ern increases.
Figure 2 also exhibits a second interesting feat ure
of our i mpl ement at i on of t he algorithm: For sufficiently
Communi cat i ons Oct ober 1977
of Vol ume 20
t he ACM Number 10
large al phabet s and sufficiently long pat t erns the algo-
rithm executes fewer than 1 instruction per charact er
passed. For exampl e, in the English sampl e, less than
1 instruction per charact er is execut ed for pat t erns of
length 5 or more. Thus this i mpl ement at i on is "sub-
l i near" in the sense that it executes fewer than i +
patlen instructions before finding the pat t ern at i. This
means that no algorithm which references each char-
acter it passes could possibly be fast er than ours in
these cases (assuming it t akes at least one instruction
to reference each charact er).
The best alternative algorithm for finding a single
substring is that of Knut h, Morris, and Prat t . If that
algorithm is i mpl ement ed in the ext raordi nari l y effi-
cient way described in [4, pp. 11-12] and [2, I t em
179] ) then the cost of l ooki ng at a charact er can be
expect ed to be at least 3 - p instructions, where p is
the probabi l i t y that a charact er just fet ched from string
is equal to a given charact er of pat. Hence a horizontal
line at 3 - p i nst ruct i ons/ charact er represent s the best
(and, practically, the worst) the Knut h, Morri s, and
Pratt algorithm can achieve.
The simple string searching algorithm (when coded
with a 3-instruction fast l oop 6) executes about 3.3
instructions per charact er (det ermi ned empirically on
the English sampl e above).
As not ed, the preprocessi ng time for our al gori t hm
(and for Knut h, Morris, and Prat t ) has been ignored.
The cost of this preprocessi ng can be made linear in
patlen (this is discussed furt her in the next section) and
is trivial compar ed to a reasonabl y long search. We
made no at t empt to code this preprocessi ng efficiently.
However , the average cost (in our i mpl ement at i on)
ranges from 160 instructions (for strings of length 1)
to about 500 instructions (for strings of length 14). It
should be explained that our code uses a bl ock t ransfer
instruction to clear the 128-word delta~ table at the
beginning of the preprocessi ng, and we have count ed
this single instruction as t hough it were 128 instruc-
tions. This accounts for the unexpect edl y large instruc-
tion count for preprocessi ng a one-charact er pat t ern.
7. Theoretical Analysis
The preprocessing for delta~ requires an array the
size of the al phabet . Our i mpl ement at i on first initial-
izes all entries of this array to patlen and then sets up
s This implementation automatically compiles pat into a machine
code program which implicitly has the skip table built in and which
is executed to perform the search itself. In [2] they compile code
which uses the PDP-10 capability of fetching a character and
incrementing a byte address in one instruction. This compiled code
executes at least two or three instructions per character fetched
from string, depending on the outcome of a comparison of the
character to one from pat.
6 This loop avoids checking whether string is exhausted by
assuming that the first character of pat occurs at the end of string.
This can be arranged ahead of time. The loop actually uses the same
three instruction codes used by the above-referenced implementation
of the Knuth, Morris, and Pratt algorithm.
767
delta1 in a linear scan t hrough the pat t ern. Thus our
preprocessi ng for delta1 is linear in patlen plus t he size
of the al phabet .
At a slight loss of efficiency in the search speed
one could eliminate the initialization of the deltal
array by storing with each ent ry a key indicating the
number of times the al gori t hm has previously been
called. This approach still requires initializing the array
the first t i me the al gori t hm is used.
To i mpl ement our al gori t hm for ext remel y large
al phabet s, one might i mpl ement the deltal table as a
hash array. In the worst case, accessing delta~ during
the search itself could requi re order patlen instruc-
tions, significantly i mpai ri ng the speed of the algo-
ri t hm. Hence the al gori t hm as it stands al most certainly
does not run in time linear in i + patlen for infinite
al phabet s.
Knut h, in analyzing the al gori t hm, has shown that
it still runs in linear time when deltaa is omi t t ed, and
this result holds for infinite al phabet s. Doi ng this,
however, will drastically degrade the per f or mance of
the algorithm on the average. In [5] Knut h exhibits an
algorithm for setting up delta2 in time linear in patlen.
Fr om the precedi ng empirical evi dence, the reader
can conclude that the al gori t hm is quite good in the
average case. However , the question of its behavi or in
the worst case is nontrivial. Knut h has recently shed
some light on this question. In [5] he proves that the
execut i on of the algorithm (after preprocessi ng) is
linear in i + patlen, assuming the availability of array
space linear in patlen plus the size of the al phabet . In
particular, he shows that in order to discover that pat
does not occur in the first i charact ers of string, at
most 6 * i charact ers from string are mat ched with
charact ers in pat. He goes on to say that the const ant 6
is probabl y much t oo large, and invites the reader to
i mprove the t heor em. His pr oof reveal s that the linear-
ity of the algorithm is entirely due to delta2.
We now analyze the average behavi or of the algo-
rithm by present i ng a probabilistic model of its per-
formance. As will become clear, the results of this
analysis will support the empirical conclusions that the
algorithm is usually "subl i near " bot h in the number of
references to string and the number of instructions
execut ed (for our i mpl ement at i on).
The analysis bel ow is based on the following simpli-
fying assumpt i on: Each charact er of pat and string is
an i ndependent r andom vari abl e. The probabi l i t y that
a charact er from pat or string is equal to a given
charact er of the al phabet is p.
Imagi ne that we have just moved pat down string
to a new position and that this position does not yield
a mat ch. We want to know the expect ed value of the
ratio bet ween the cost of discovering the mi smat ch
and the distance we get to slide pat down upon findir/g
the mi smat ch. If we define the cost to be the total
number of references made to string before discovering
the mi smat ch, we can obt ai n the expect ed value of the
average number of references to string per charact er
Communications October 1977
of Volume 20
the ACM Number 10
passed. If we define the cost to be the total number of
machine instructions executed in discovering the mis-
match, we can obtain the expected value of the number
of instructions executed per character passed.
In the following we say "onl y the last m characters
of pat mat ch" to mean "t he last m characters of pat
match the corresponding m characters in string but the
(m + 1)-th character from the right end of pat fails tO
match the corresponding character in string."
The expected value of the ratio of cost to characters
passed is given by:
~=o cost(m) * prob(m)
)) m=0 prob(m) * ~ k~=l ski p( m, k) * k
where cost(m) is the cost associated with discovering
that only the last m characters of pat match; prob(m) is
the probability that only the last m characters of pat
match; and skip (m, k) is the probability that, supposing
only the last m characters of pat match, we will get to
slide pat down by k.
Under our assumptions, the probability that only
the last m characters of pat match is:
prob(m) = pro(1 - p)/ (1 - ppat t en).
(The denomi nat or is due to the assumption that a
mismatch exists.)
The probability that we will get to slide pat down
by k is determined by analyzing how i is incremented.
However, note that even though we increment i by the
maximum max of the two deltas, this will actually only
slide pat down by max - m, since the increment of i
also includes the m necessary to shift our attention
back to the end of pat. Thus when we analyze the
contributions of the two deltas we speak of the amount
by which they allow us to slide pat down, rather than
the amount by which we increment i. Finally, recall
that if the mismatched character char occurs in the
already matched final m characters of pat, then deltaa
is worthless and we always slide by deltas. The proba-
bility that deltal is worthless is just (1 - (1 - p)m). Let
us call this probdelta~worthless(m).
The conditions under which delta~ will naturally let
us slide forward by k can be broken down into four
cases as follows: (a) delta~ will let us slide down by 1 if
char is the (m + 2)-th character from the righthand
end of pat (or else there are no more characters in pat)
and char does not occur to the right of that position
(which has probability (1 - p) " * (if m + 1 = patlen
then 1 else p)). (b) delta1 allows us to slide down k,
where 1 < k < patlen - m, provided the rightmost
occurrence of char in pat is m + k characters from the
right end of pat (which has probability p * (1 -
p)k+m-~). (c) When patlen - m > 1, deltai allows us to
slide past patlen - m characters if char does not occur
in pat at all (which has probability (1 - p)paae,-1 given
that we know char is not the (m + 1)-th character from
768
the right end of pat). Finally, (d) delta~ never allows a
slide longer than patlen - m (since the maximum
value of deltal is patlen).
Thus we can define the probability probdelta~(m,
k) that when only the last m characters of pat match,
delta~ will allow us to move down by k as follows:
probdel t al (m, k) = i f k = 1
t hen
(1 - p) m. ( i f m + 1 = patlen t hen 1 el se p) ;
el sei f I < k < patlen - m t he np * (1 - p) k+, . - 1;
el sei f k = patlen - m t hen (1 - p) p. ae. - 1;
el se (i . e. k > patlen - m) O.
(It should be not ed that we will not put these formulas
into closed form, but will simply evaluate t hem to
verify the validity of our empirical evidence.)
We now perform a similar analysis for deltas; deltas
lets us slide down by k if (a) doing so sets up an
alignment of the discovered occurrence of the last m
characters of pat in string with a plausible reoccurrence
of those m characters elsewhere in pat, and (b) no
smaller move will set up such an alignment. The
probability probpr(m, k) that the terminal substring of
pat of length m has a plausible reoccurrence k charac-
ters to the left of its first character is:
probpr( m, k) = if m + k < patlen
t hen (1 - p) * p"
el se ptaatlen-k
Of course, k is just the distance delta2 lets us slide
provided there is no earlier reoccurrence. We can
therefore define the probability probdelta2(m, k) that,
when only the last m characters of pat match, delta2
will allow us to move down by k recursively as follows:
probdelta2(m, k)
=probpr(m, k)(1-k~=11Probdel t a2(m, n) )
We slide down by the maxi mum allowed by the
two deltas (taking adequat e account of the possibility
that delta1 is worthless). If the values of the deltas
were i ndependent , the probability that' we would ac-
tually slide down by k would just be the sum of the
products of the probabilities that one of the deltas
allows a move of k while the ot her allows a move of
less than or equal to k.
However, the two moves are not entirely indepen-
dent. In particular, consider the possibility that delta1
is worthless. Then the char just fetched occurs in the
last m characters of pat and does not match the (m +
1)-th. But if delta2 gives a slide of 1 it means that
sliding these m characters to the left by i produces a
match. This implies that all of the last m characters of
pat are equal to the character m + 1 from the right.
But this character is known not to be char. Thus char
cannot occur in the last m characters of pat, violating
the hypothesis that delta~ was worthless. Therefore if
delta~ is worthless, the probability that delta2 specifies
a skip of 1 is 0 and the probability that it specifies one
of the larger skips is correspondingly increased.
Communi c a t i ons Oc t obe r 1977
of Vol ume 20
t he ACM Numbe r 10
This interaction bet ween the two deltas is also felt
(to a lesser extent) for the next m possible delta2's, but
we ignore these (and in so doing accept that our
analysis may predict slightly worse results than might
be expect ed since we allow some short delta2 moves
when longer ones would actually occur).
The probabi l i t y that delta2 will allow us to slide
down by k when only the last m charact ers of pat
mat ch, assuming that deltai is worthless, is:
probdelta~(m, k) = i f k = 1
t hen 0
el se
probpr(m, k) 1 - probdelta'2(m, n) .
Finally, we can define skip(m, k), the probabi l i t y
that we will slide down by k if only the last m charact ers
of pat match:
ski p(m, k) = if k = 1
t hen probdel t al (m, 1) * probdelta2(m, 1)
el se probdeltalworthless(m) * probdelta~(m, k)
k - I
+ ~_. probdel t al (m, k) * probdelta2(m, n)
n=l
k- - 1
+ ~. probdel t al (m, n) * probdelta2(m, k)
n=l
+ probdel t al (m, k) * probdelta=(m, k).
Now let us consider the two al t ernat i ve cost func-
tions. In order to analyze the number of references to
string per charact er passed over, cost(m) should just be
m + 1, the number of references necessary to confi rm
that only the last m charact ers of pat mat ch.
In order to analyze the number of instructions
execut ed per charact er passed over, cost(m) should be
the total number of instructions execut ed in discovering
that only the last m charact ers of pat mat ch. By
inspection of our PDP-10 code:
cost(m) = i f m = 0 then 3 else 12 + 6 m.
We have comput ed the expect ed value of the ratio
of cost per charact er ski pped by using the above
formul as (and bot h definitions of cost). We did so for
pat t ern lengths running from 1 to 14 (as in our empiri-
cal evidence) and for the values of p appr opr i at e for
the t hree source strings used: For a r andom binary
string p is 0.5, for an arbi t rary English string it is
(approxi mat el y) 0. 09, and for a r andom string over a
100-charact er al phabet it is 0. 01. The value of p for
English was det ermi ned using a st andard frequency
count for the al phabet i c charact ers [3] and empirically
det ermi ni ng the frequency of space, carriage ret urn,
and line feed to be 0. 23, 0. 03, and 0. 03, respect i vel yF
In Figure 3 we have pl ot t ed the theoretical ratio of
references to string per charact er passed over against
769
0 2 4 10 12 14
Fig. 3.
1.0
I I I I I
I I I I I I
4 6 8 10 12 14
LENGTH OF PATTERN
Fi g. 4.
r We have de t e r mi ne d empi r i cal l y t hat t he al gor i t hm' s per f or -
mance on t r ul y r a ndom st r i ngs wher e p = 0. 09 is vi r t ual l y i dent i cal
to i t s per f or mance on Engl i sh st ri ngs. In par t i cul ar , t he r ef er ence
count and i nst r uct i on count curVes gener at ed by such r a ndom st r i ngs
ar e al mos t coi nci dent al wi t h t he Engl i sh cur ves in Fi gur es 1 and 2.
6 8
LENGTH OF PATTERN
0, 24
the pat t ern length. The most i mport ant fact to observe
in Figure 3 is that the algorithm can be expected to
make fewer than i + patlen references to string before
finding the pat t ern at location i. For exampl e, for
English text strings of length 5 or great er, the al gori t hm
may be expect ed to make less than (i + 5)/ 4 refer-
ences to string. The compar abl e figure for the Knut h,
Communi c a t i ons Oc t obe r 1977
of Vol ume 20
t he ACM Numbe r 10
Morris, and Pratt algorithm is of course precisely i.
The figure for the intuitive search algorithm is always
great er than or equal to i.
The reason the number of references per character
passed decreases more slowly as patlen increases is
that for longer patterns the probability is higher that
the charact er just fet ched occurs somewhere in the
pat t ern, and t herefore the distance the pat t ern can be
moved forward is short ened.
In Figure 4 we have pl ot t ed the theoretical ratio of
the number of instructions execut ed per charact er
passed versus the pat t ern length. Again we find that
our i mpl ement at i on of the algorithm can be expected
(for sufficiently large alphabets) to execut e fewer than
i + patlen instructions before finding the pat t ern at
location i. That is, our i mpl ement at i on is usually "sub-
l i near" even in the number of instructions execut ed.
The comparabl e figure for the Knut h, Morris, and
Pratt algorithm is at best (3 - p) * (i + patlen - 1). 8
For the simple search algorithm the expect ed value of
the number of instructions execut ed per charact er
passed is (approxi mat el y) 3. 28 (for p = 0. 09).
It is difficult to fully appreci at e the role played by
delta2. For exampl e, if the al phabet is large and pat-
terns are short, then comput i ng and trying to use delta2
probabl y does not pay off much (because the chances
are high that a given charact er in string does not occur
anywhere in pat and one will almost always stay in the
fast l oop ignoring delta2). 9 Conversel y, delta2 becomes
very i mport ant when the al phabet is small and the
patterns are long (for now execution will frequent l y
leave the f ast loop; deltal will in general be small
because many of the characters in the al phabet will
occur in pat and only the terminal substring observa-
tions could cause large shifts). Despite the fact that it
is difficult to appreciate the role of delta2, it should be
not ed that the linearity result for the worst case behav-
ior of the algorithm is due entirely to the presence of
delta2.
Compari ng the empirical evi dence (Figures 1 and
2) with the theoretical evidence (Figures 3 and 4,
respectively), we not e that the model is compl et el y
accurate for English and the 100-character alphabet.
The model predicts much bet t er behavi or than we
actually experi ence in the binary case. Our only expla-
nation is that since delta2 predomi nat es in the binary
al phabet and sets up alignments of the pat t ern and the
string, the algorithm backs up over longer terminal
substrings of the pat t ern before finding mismatches.
Our analysis ignores this phenomenon.
However, in summary, the theoretical analysis sup-
ports the conclusion that on the average the algorithm
is sublinear in the number of references to string and,
for sufficiently large alphabets and patterns, sublinear
in the number of instructions execut ed (in our imple-
ment at i on).
8. Caveat Programmer
It should be observed that the preceding analysis
has assumed that string is entirely in pri mary memor y
and that we can obtain the ith charact er in it in one
instruction after computing its byte address. However ,
if string is actually on secondary storage, t hen the
characters in it must be read in. TM This t ransfer will
entail some time delay equivalent to the execut i on of,
say, w instructions per charact er brought in, and (be-
cause of the nature of comput er I/ O) all of the first i +
patlen - 1 characters will eventually be brought in
whet her we actually reference all of them or not. (A
represent at i ve figure for w for paged transfers from a
fast disk is 5 instructions/character. ) Thus t here may
be a hidden cost of w instructions per charact er passed
over.
According to the statistics present ed above one
might expect our algorithm to be approxi mat el y t hree
times faster than the Knut h, Morris, and Pratt algo-
rithm (for, say, English strings of length 6) since that
algorithm executes about t hree instructions to our one.
However , if the CPU is idle for the w instructions
necessary to read each charact er, the actual ratios are
closer to w + 3 instructions than to w + 1 instructions.
Thus for paged disk transfers our algorithm can only
be expect ed to be roughly 4/3 faster (i.e. 5 + 3
instructions to 5 + 1 instructions) if we assume that
we are idle during I/ O. Thus for large values of w the
difference bet ween the various algorithms diminishes
if the CPU is idle during I/ O.
Of course, in general, programmers (or operating
systems) try to avoid the situation in which the CPU is
idle while awaiting an I/ O transfer by overl appi ng I/ O
with some ot her comput at i on. In this situation, the
chances are that our algorithm will be I/ O bound (we
will search a page faster than it can be brought in),
and i ndeed so will that of Knut h, Morris, and Pratt if
w > 3. Our algorithm will requi re that fewer CPU
cycles be devot ed to the search itself so that if t here
are ot her jobs to perform, t here will still be an overall
advantage in using the algorithm. .
s Although the Knuth, Morris, and Pratt algorithm will fetch
each of the first i + patlen - 1 characters of string precisely once,
sometimes a character is involved in several tests against characters
in pat. The number of such tests (each involving three instructions)
is bounded by log.(patlen), where qb is the golden ratio.
9 However, if the algorithm is implemented without deltaz,
recall that, in exiting the slow loop, one must now take the max of
delta1 and patlen - ./ + 1 to allow for the possibility that deltal is
worthless.
x0 We have implemented a version of our algorithm for searching
through disk files. It is available as the subroutine FFI LEPOS in the
latest release of INTERLISP-10. This function uses the TENEX
page mapping capability to identify one file page at a time with a
buffer area in virtual memory. In addition to being faster than
reading the page by conventional met hods, this means the operating
system' s memory management takes care of references to pages
which happen to still be in memory, etc. The algorithm is as much
as 50 times faster than the standard INTERLISP-10 FI LEPOS
function (depending on the length of the pattern).
770 Communications Oct ober 1977
of Vol ume 20
the ACM Number 10
There are several situations in which it may not be
advisable to use our al gori t hm. If the expect ed penet ra-
tion i at which the pat t ern is found is small, the
preprocessi ng time is significant and one might t here-
fore consider using the obvi ous intuitive al gori t hm.
As previously not ed, our algorithm can be most
efficiently i mpl ement ed on a byt e-addressabl e ma-
chine. On a machi ne that does not allow byt e addresses
to be i ncrement ed and decr ement ed directly, two pos-
sible sources of inefficiency must be addressed: The
algorithm typically skips t hrough string in steps larger
than 1, and the algorithm may back up t hrough string.
Unless these processes are coded efficiently, it is prob-
ably not worthwhile to use our al gori t hm.
Fur t her mor e, it should be not ed that because the
algorithm can back up t hrough string, it is possible to
cross a page boundary more than once. We have not
found this to be a serious source of inefficiency.
However , it does requi re a certain amount of code to
handle the necessary buffering (if page I / O is being
handl ed directly as in our FFI LEPOS) . One beaut y of
the Knut h, Morris, and Pratt algorithm is that it avoids
this probl em al t oget her.
A final situation in which it is unadvi sabl e to use
our algorithm is if the string matching pr obl em to be
solved is actually more compl i cat ed t han merel y finding
the first occurrence of a single substring. For exampl e,
if the probl em is to find the first of several possible
substrings or to identify a location in string defi ned by
a regul ar expressi on, it is much more advant ageous to
use an algorithm such as that of Aho and Corasick [1].
It may of course be possible to design an algorithm
that searches for multiple pat t erns or instances of
regul ar expressions by using the idea of starting the
mat ch at the right end of the pat t ern. However , we
have not designed such an algorithm.
9. Historical Remarks
Our earliest formul at i on of the algorithm involved
only delta1 and i mpl ement ed Observat i ons 1, 2, and
3(a). We were aware that we could do somet hi ng
along the lines of delta2 and Observat i on 3(b), but did
not precisely formul at e it. Inst ead, in April 1974, we
coded the delta1 version of the algorithm in Interlisp,
merel y to test its speed. We consi dered coding the
algorithm in PDP-10 assembl y language but abandoned
the idea as impractical because of the cost of incre-
ment i ng byte pointers by arbi t rary amount s.
We have since l earned that R. W. Gosper , of Stan-
ford University, si mul t aneousl y and i ndependent l y dis-
covered the deltal version of the al gori t hm (private
communi cat i on).
In April 1975, we st art ed thinking about the imple-
ment at i on again and di scovered a way to i ncrement
byte pointers by indexing t hrough a table. We then
formul at ed a version of deltas and coded the algorithm
more or less as it is present ed here. This original
definition of delta2 differed from the current one in the
following respect: If only the last m charact ers of pat
(call this substring subpat) were mat ched, deltas spec-
ified a slide to the second from the ri ght most occur-
rence ofsubpat in pat (allowing this occurrence to "fall
off" the left end of pat) but wi t hout any special
consideration of the charact er precedi ng this occur-
rence.
The average behavi or of that version of the algo-
ri t hm was virtually indistinguishable from that pre-
sented in this paper for large al phabet s, but was
somewhat worse for small al phabet s. However , its
worst case behavi or was quadrat i c (i.e. requi red on
the order of i * patlen compari sons). For exampl e,
consider searching for a pat t ern of the form CA( BA) r
in a string of the form ( ( XX) r ( AA) ( BA) r ) * (e. g. r =
2, pat = " CABABA, " and string = " XXXXAABA-
BAXXXXAABABA . . . " ) . The original definition
of deltas allowed only a slide of 2 if the last " BA" of
pat was mat ched before the next " A" failed to mat ch.
Of course in this situation this only sets up anot her
mi smat ch at the same charact er in string, but the
algorithm had to rei nspect the previously inspected
charact ers to discover it. The total number of refer-
ences to string in passing i charact ers in this situation
was (r + 1) * (r + 2) * i/(4r + 2), where r = (patlen -
2)/2. Thus the number of references was on the order
of i * patlen.
However , on the average the algorithm was blind-
ingly fast. To our surprise, it was several times faster
than the string searching al gori t hm in the Tenex TECO
text edi t or. This al gori t hm is reput ed to be quite an
efficient i mpl ement at i on of the simple search algorithm
because it searches for the first charact er of pat one
full word at a time (rat her t han one byte at a t i me).
In the summer of 1975, we wrot e a bri ef paper on
the algorithm and di st ri but ed it on request .
In December 1975, Ben Kui pers of the M. I . T.
Artificial Intelligence Labor at or y read the paper and
brought to our at t ent i on the i mpr ovement to deltas
concerning the charact er precedi ng the t ermi nal sub-
string and its reoccurrence (private communi cat i on).
Al most simultaneously, Donal d Knut h of St anford
University suggested the same i mpr ovement and ob-
served that the i mproved algorithm could certainly
make no more than order (i + patlen) * log(patlen)
references to string (private communi cat i on).
We ment i oned this i mpr ovement in the next revi-
sion of the paper and suggested an additional i mprove-
ment , namel y the r epl acement of bot h de l t a I and deltas
by a single t wo-di mensi onal table. Gi ven the mis-
mat ched char from string and the position j in pat at
which the mi smat ch occurred, this table indicated the
distance to the last occurrence (if any) of the substring
[char, pat(] + 1) . . . . . pat(patlen)] in pat. The revised
paper concluded with the question of whet her this
i mpr ovement or a similar one produced an algorithm
771 Communications Oct ober 1977
of Volume 20
the ACM Number 10
which was at worst linear and on the average "sub-
linear."
In January 1976, Knuth [5] proved that the simpler
improvement in fact produces linear behavior, even in
the worst case. We therefore revised the paper again
and gave delta2 its current definition.
In April 1976, R.W. Floyd of Stanford University
discovered a serious statistical fallacy in the first version
of our formula giving the expected value of the ratio
of cost to characters passed. He provided us (private
communication) with the current version of this for-
mula.
Thomas Standish, of the University of California at
Irvine, has suggested (private communication) that the
implementation of the algorithm can be improved by
fetching larger bytes in the fast loop (i.e. bytes contain-
ing several characters) and using a hash array to encode
the extended deltat table. Provided the difficulties at
the boundaries of the pattern are handled efficiently,
this could improve the behavior of the algorithm enor-
mously since it exponentially increases the effective
size of the alphabet and reduces the frequency of
common characters.
Acknowledgments. We would like to thank B.
Kuipers, of the M.I.T. Artificial Intelligence Labora-
tory, for his suggestion concerning delta2 and D. Knuth,
of Stanford University, for his analysis of the improved
algorithm. We are grateful to the anonymous reviewer
for Communications who suggested the inclusion of
evidence comparing our algorithm with that of Knuth,
Morris, and Pratt, and for the warnings contained in
Section 8. B. Mont-Reynaud, of the Stanford Research
Institute, and L. Guibas, of Xerox Palo Alto Research
Center, proofread drafts of this paper and suggested
several clarifications. We would also like to thank E.
Taft and E. Fiala of Xerox Palo Alto Research Center
for their advice regarding machine coding the algo-
rithm.
Re c e i ve d J une 1975; r evi s ed Apr i l 1976
Re f e r e nc e s
1. Aho, A. V. , a nd Cor a s i c k, M. J . Fas t pa t t e r n ma t c hi ng: An ai d
t o bi bl i ogr a phi c s ear ch. Comm. ACM 18, 6 ( J une , 1975) , 333- 340.
2. Be e l e r , M. , Gos pe r , R. W. , a nd Sc hr oe ppe l , R. Ha kme m.
Me mo No. 239, M. I . T. Ar t i f i ci al I nt el l i gence La b. , M. I . T. , Ca m-
br i dge , Mas s . , Feb. 29, 1972.
3. De we y, G. Rel at i v Fr equency of Engl i sh Speech Sounds . Ha r -
var d U. Pr es s , Ca mbr i dge , Ma s s . , 1923, p. 185.
4. Knut h, D. E. , Mor r i s , J . H. , a nd Pr a t t , V. R. Fas t pa t t e r n ma t c h-
i ng in st r i ngs. TR CS- 74- 440, St a nf or d U. , St a nf or d, Cal i f . , 1974.
5. Knut h, D. E. , Mor r i s , J . H. , a nd Pr at t , V. R. Fas t pa t t e r n ma t c h-
i ng in st r i ngs. ( t o a ppe a r in SI AM J. Comput . ) .
Pr of es s i onal Act i vi t i es
Ca l e nda r of Event s
ACM' s cal endar policy is to list open com-
put er science meet i ngs t hat are held on a not-for-
profit basis. Not i ncl uded in the cal endar are edu-
cat i onal semi nars institutes, and courses. Sub-
mittals should be subst ant i at ed with name of the
sponsori ng organi zat i on, fee schedule, and chai r-
man' s name and full address.
One telephone number cont act for those in-
terested in at t endi ng a meeting will be given when
a number is specified for this purpose in the news
release text or in a direct communi cat i on to this
periodical.
All requests for ACM sponsorshi p or coop-
erat i on should be addressed to Chai rman, Con-
ferences and Symposia Committee. Dr. W.S.
Dorsey, Dept. 503/504 Rockwell Int ernat i onal
Corporat i on, Anahei m, CA 92803. For European
events, a copy of the request should also be sent
to the Eur opean Regi onal Representative. Tech-
nical Meeting Request For ms for this purpose
can be obt ai ned f r om ACM Headquar t er s or
from the European Regi onal Representative. Lead
time should include 2 mont hs (3 mont hs if for
Europe) for processing of the request, plus the
necessary mont hs (mi ni mum 2) for any publicity
to appear in Communications.
Events for which ACM or a subunit of ACM
is a sponsor or col l aborat or are i ndi cat ed by .
Dates precede titles.
In this issue the calendar is given to Apri l
1978. New Listings are shown first; they will ap-
pear next month as Previous Listings.
NEW LI STI NGS
16-17 November 1977
Workshop on Future Di recti ons in Computer
Architecture, Austin, Tex. Sponsors: ACM SIG-
ARCH, IEEE-CS TCCA, and University of Texas
at Austin. Conf. chm: G. Jack Lipovski. Dept . of
EE, University of Texas, Austin, TX 78712.
5-9 December 1977
Third International Symposi um on Com-
puting Methods in Appl i ed Sciences and Engi-
neering, Versailles, France. Organi zed by IRIA.
Sponsors: AFCET, GAMNI , I FI P WG7. 2 Con-
tact: Institut de Recherche D' I nf or mat i que et
D' Aut omat i que, Domai ne de Voi ncean, Rocquen-
court, 78150 Le Chesnay, France.
13-15 Febr uar y 1978
Symposi um on Computer Net work Proto-
cols, Li6ge, Belgium. Sponsors: I FI P T.C.6 and
ACM Belgian Chapt er. Cont act : A. Dant hi ne,
Symposi um on Comput er Net wor k Protocols,
Avenue des Tilleuls, 49, B-4000, Li6ge, Belgium.
3 Mar ch 1978
Indi ana University Computer Net work Con-
ference on Instructional Computi ng Applications,
I ndi ana University-East, Ri chmond, Ind. Sponsor:
772
Indi ana University Comput i ng Net work. Chm:
Tom Osgood, IU-EAST, 2325 Chester Boulevard,
Ri chmond, IN 47374.
12-17 Mar ch 1978
Symposi um on Computer Simulation of Bulk
Matter from Mol ecul ar Perspective, Anahei m,
Calif.; par t of 1978 Annual Spring Meeting of
Ameri can Chemi cal Society. Sponsor: ACS Div.
Comput ers in Chemistry. Cont act : Pet er Lykes,
Illinois Institute of Technology, Chi cago, IL
60616; 312 567-3430.
28-30 March 1978
3rd Symposi um on Programmi ng, Paris,
France. Sponsor: Cent re Nat i onal de la Recherche
Sciantifique (CNRS) and Universit6 Pi erre et
Mari e Curie. Cont act S6cret ari at du Colloque,
Institut de Pr ogr ammat i on, 4, Place Jussieu,
75230 Pari s Cedex 05, France.
29-31 Mar ch 1978
Conference on Informati on Sciences and
Systems, Johns Hopki ns University, Baltimore,
Md. Cont act : 1978 CISS, Depar t ment of Electri-
cal Engineering, Johns Hopki ns University, Balti-
more, MD 21218.
3-7 Apri l 1978
Fifth International Symposi um on Comput-
ing in Literary and Linguistic Research, Univer-
sity of Aston, Bi rmi ngham, Engl and. Sponsor:
Associ at i on for Li t erary and Linguistic Comput -
ing. Cont act : The Secret ary (CLLR), Modern
Languages Dept., University of Aston, Birming-
ham B4 7ET, Engl and.
4-8 Apri l 1978
Second International Conference on Combi -
natoriaI Mathematics, Barbi zon-Pl aza Hotel, New
York City. Sponsor: New York Academy of Sci-
ences. Cont act : Conference Dept. , New Yor k
Academy of Sciences, 2 East 63 St., New York,
NY 10021; 212 838-0230.
15-19 May 1978
16th Annual Convention of the Association
for Educational Data Systems, At l ant a, Ga.
Sponsor: AEDS. Cont act : James E. Eisele, Office
of Comput i ng Activities, Uni versi t y of Georgi a,
Athens, GA 30602.
22-25 May 1978
Sixth International CODATA Conference,
Taormi na, Italy. Sponsor: Int ernat i onal Council
of Scientific Uni ons Comm. on Dat a for Science
and Technol ogy. Cont act : CODATA Secretariat,
51, Boul evard de Mont morency, 75016 Paris,
France.
24-26 May 1978
1978 SI AM Nati onal Meeti ng, University of
Wisconsin, Madi son, Wis. Sponsor: SIAM in co-
operat i on with ACM SIGSAM. Cont act : H. B.
Hai r, SIAM, 33 Sout h 17 St., Philadelphia, PA
19103; 215 564-2929.
26 May 1978
Computer Al gebra Symposi um, University of
Wisconsin, Madi son, Wis.; par t of the 1978 SIAM
Nat i onal Meeting. Sponsor: SIAM in cooperat i on
with ACM SIGSAM. Syrup. chin: George E.
Communi c a t i ons
of
t he ACM
Collins, Comput er Sciences Dept, , University of
Wisconsin, 1210 W. Dayt on Street, Madi son WI
53706.
12-16 June 1978
7th Triennial I FAC World Congress. Spon-
sor: IFAC. Cont act : I FAC 78 Secretariat, PUB
192, 00101 Helsinki 10, Fi nl and.
19-22 June 1978
Annual Conference of the Ameri can Soci ety
for Engi neeri ng Education (Comput ers in Edu-
cat i on Division Program), Uni versi t y of British
Col umbi a, Vancouver, B.C., Canada. Sponsor:
ASEE Comput ers in Educat i on Division. Cont act :
ASEE, Suite 400, One DuPont Circle, Washi ng-
ton, DC 20036.
22-23 June 1978
International Conference on the Perform-
ance of Computer Installations, Gar done Riviera,
Lake Gar da, Italy. Sponsor: Sperry Uni vac, Italy,
wi t h cooperat i on of ACM SI GMETRI CS,
ECOMA, AI CA, ACM It al i an Chapt er. Cont act :
Conference Secretariat, CILEA, Vi a Raffaello
Sanzio 4, 20090 Segrate, Milan, Italy.
2-4 August 1978
International Conference on Databases: Im-
proving Usability and Responsiveness, Techni on,
Hai fa, Israel. Sponsor: Techni on in cooperat i on
with ACM. Prog. chm: Ben Shnei derman, Dept .
of Informat i on Systems Management , University
of Maryl and, College Park, MD 20742.
13-18 August 1978
Symposi um on Model i ng and Simulation
Methodol ogy, Wei zmann Institute of Science, Re-
hovot, Israel. Cont act : H. J. Hi ghl and, State Uni -
versity Technical College Farmi ngdal e, N. Y. , or
B.P. Zeigler, Dept . of Appl i ed Mat hemat i cs,
Wei zmann Institute of Science, Rehovot , Israel.
30 October-1 November 1978
1978 SI AM Fall Meeti ng, Hyat t Regency
Hotel, Knoxville, Tenn. Sponsor: SIAM. Cont act :
H. B. Hai r, SIAM, 33 South 17th St., Phi l adel phi a,
PA 19103; 215 564-2929.
PREVI OUS LI STI NGS
17-19 Oct ober 1977
ACM 77 Annual Conference, Ol ympi c Ho-
tel. Seattle, Wash. Gen. chin: James S. Ketchel.
Box 16156, Seattle, WA 98116; 206 935-6776.
17-21 Oct ober 1977
Systems 77, Computer Systems and Their
Application, Munich, Federal Republ i c of Ger-
many. Cont act : Miinehener Messe- und Ausstel-
lungsgesellschaft mbh, Kongresszent rum, Kong-
ressbiiro Systems 77, Post fach 12 10 09, D-8000
Mfinchen 12, Federal Republ i c of Germany.
18-19 Oct ober 1977
MSFC/ VAH Data Management Symposi um,
Sherat on Mot or Inn, Huntsville, Al a. Sponsors:
NASA Marshal l Space Fl i ght Center, University
of Al abama in Huntsville. Cont act : General Chai r-
(Calendar continued on p. 781)
Oc t obe r 1977
Vol ume 20
Numbe r 10

You might also like