Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Intern. J. Computer Math., 2002, Vol. 79(8), pp.

867–888

ON-LINE APPROXIMATE STRING


SEARCHING ALGORITHMS:
SURVEY AND EXPERIMENTAL RESULTS
P. D. MICHAILIDIS and K. G. MARGARITIS*
Parallel and Distributed Processing Laboratory, Department of Applied Informatics,
University of Macedonia, 156 Egnatia Str., P.O. Box 1591, 54006, Thessaloniki, Greece

(Received 9 March 2000)

The problem of approximate string searching comprises two classes of problems: string searching with k mismatches
and string searching with k differences. In this paper we present a short survey and experimental results for well
known sequential approximate string searching algorithms. We consider algorithms based on different approaches
including dynamic programming, deterministic finite automata, filtering, counting and bit parallelism. We
compare these algorithms in terms of running time against pattern length and for several values of k for four
different kinds of text: binary alphabet, alphabet of size 8, English alphabet and DNA alphabet. Finally, we
compare the experimental results of the algorithms with their theoretical complexities.

Keywords: String searching; Hamming distance; Edit distance; String searching with k mismatches; String searching
with k differences

C.R. Categories: F.2.2; H.3.3; I.5.4

1. INTRODUCTION

String searching is a very important component of many problems, including text processing,
information retrieval, data base operations, library systems, compilers, command interpreters,
DNA processing, signal processing, error correction, speech and pattern recognition and sev-
eral other fields [HD80, Aoe94, Ste94, Nav98]. The basic string searching problem can be
defined as follows. Let a given alphabet (a finite sequence characters) S, a short pattern string
P ¼ P½1P½2  P½m of length m and a large text string T ¼ T ½1T ½2    T ½n of length n,
where both the pattern and the text are sequences of characters from S with m  n. The
string searching problem consists of finding one or more generally all the exact occurrences
of a pattern P in a text T . Survey and experimental results of well known algorithms for this
string searching problem can be found in [Aho90, CR94, 5te94, MM99a, Smi82, DB86,
BY89, Leq95, MF96, MM99b].

*Corresponding author. E-mail: {panosm, kmarg}@macedonia.uom.gr; http://macedonia.uom.gr/  {panosm,


kmarg}

ISSN 0020-7160 print; ISSN 1029-0265 online # 2002 Taylor & Francis Ltd
DOI: 10.1080=00207160290032981
868 P. D. MICHAILIDIS AND K. G. MARGARITIS

The approximate string searching problem is a generalization of the exact string searching
problem, which involves finding substrings of a text string close to a given pattern string.
More specifically, the approximate string searching problem can be formally stated as fol-
lows: Let a given alphabet S, a short pattern string P of length m, a large text string T of
length n with m  n, an integer k  0 and a distance function d. This problem consists
of finding all the substrings S of T such that dðP; SÞ k.
The distance dðP; SÞ between two strings P and S over an alphabet S is the cost of the
minimum cost sequence of operations that needed to transform P into S. The cost of a se-
quence of operations is the sum of the costs of the individual operations. The cost of an op-
eration is considered a positive real number.
In particular, in string searching applications the most interesting operations are: (a) chan-
ging one character to another single character (or a substitution), (b) deleting one character
from the given string (or a deletion), and (c) inserting a single character into the given string
(or an insertion).
There are several distance functions; two very well known functions are the Hamming dis-
tance and Levenshtein distance which are used in this paper. The Hamming distance between
two strings of equal length is defined as the number of positions with mismatching characters
in the two strings. In other words, it allows only substitutions, which cost l. The approximate
string searching problem with d being Hamming distance is called string searching with k
mismatches.
The Levenshtein or edit distance between two strings of not necessarily equal lengths, is
the minimum number of character insertions, deletions and substitutions, which all cost l,
required to transform the one string into the other. Algorithms for computing the edit distance
between a pair of strings are presented in [WF74, MP80, Ukk85a]. The approximate string
searching problem with d being the Levenshtein or edit distance is called string searching
with k differences (or sometimes string searching with k errors). Together the above two pro-
blems are called approximate string searching.
The solutions to two problems differ if the algorithm has to be on-line (that is, the text is
not known in advance) or off-line (the text can be preprocessed). In this paper, we focus on
on-line algorithms for the these two problems. There are numerous algorithms for approxi-
mate string searching problem, see for example the reviews of [GG88, Aho90, Ste94, BY96,
JTU96, Nav98, MM99c]. In general, an online approximate string searching algorithm con-
sists of two phases: the preprocessing phase in P and the searching phase of P in T . The
preprocessing phase involves gathering of information about the pattern which can be
used for a fast implementation of primitive operations in the searching phase or of construct-
ing a finite automaton that recognizes all strings at a distance at most k from the pattern. The
searching phase consists of scanning the text or the construction of a array in order to find all
approximate occurrences of the pattern in the text. In general, the searching phase is based
on four different approaches including dynamic programming/classical, deterministic finite
automata, filtering, counting and bit-parallelism algorithms.
More specifically, for the string searching with k mismatches problem, the algorithms can
be divided in to four categories:
 Classical algorithms: Brute-Force algorithm, Landau–Vishkin [LV86] algorithm, Galil–
Giancarlo [GG86] algorithm, and Tarhio–Ukkonen [TU93] algorithm.
 Deterministic finite automata algorithms: Partial-DFA [BYG94, Nav97b] algorithm.
 Counting algorithms: Grossi–Luccio [GL89] algorithm, BY-filter/counting [BYG94] al-
gorithms, EM [EMC96] algorithm, Pevzner–Waterman [PW95] algorithm, and Baeza–
Yates–Perleberg [BYP96] algorithm.
 Bit-parallelism algorithms: Shift-Or [BYG92] algorithm, Dermouche [Der95] algorithm,
and BNDM [NR98] algorithm.
ON-LINE STRING SEARCHING 869

Similarly, the algorithms for the string searching with k differences problem are divided in
to four categories:

 Dynamic programming algorithms: Sellers [Sel80] algorithm, CUTOFF [Ukk85b] algo-


rithm, Landau–Vishkin [LV88, LV89] algorithm, Galil–Park [GP90] algorithm, Ukkonen–
Wood [UW93] algorithm and Chang–Lampe [CL92] algorithm.
 Deterministic finite automata algorithms: Ukkonen [Ukk85b] algorithm, Wu–Manber–
Myers [WMM96] algorithm and Partial-DFA [BYG94, Kur96, Nav97b] algorithm.
 Filtering algorithms: Tarhio–Ukkonen [TU93] algorithm, COUNT [GL89, JTU96,
Nav97a] algorithm, Maximal matches [Ukk92] algorithm, Chang–Lawler [CL94] algo-
rithm, Takaoka [Tak94] algorithm, Suntinen–Tarhio [ST95] algorithm and Baeza et al.
with exact partitioning techinque [BYP96].
 Bit-parallelism algorithms: Wu–Manber [WM92] algorithm, Baeza–Yates–Navarro
[BYN96a, BYN96b, BYN99] algorithm, BNDM [NR98] algorithm, Wright [Wri94] al-
gorithm and Myers [Mye98, Mye99] algorithm.

It is clear that with such a variety of different approaches to the same problem it is difficult
to select a appropriate algorithm for each problem of approximate string searching. The the-
oretical analyses given in the literature are useful but it is important that the theory is com-
pleted with experimental comparisons extensive enough.
Several experiments on string searching with k differences problem have already been re-
ported [JTU96]. In [JTU96], Jokinen et al. compare the practical running time of seven algo-
rithms only for the string searching with k differences problem. More specifically, they
compare two algorithms (Seller [Sel80] and CUTOFF [Ukk85b]) based on dynamic pro-
gramming approach, two algorithms (Galil–Park [GP90] and Ukkonen–Wood [UW93])
based on diagonal transition approach and three algorithms (Tarhio–Ukkonen [TU93],
[JTU96] and Maximal matches [Ukk92]) based on filtering approach, except for the bit-par-
allelism algorithms and the algorithms for the k mismatches problem. In this paper we report
extensive experiments for the running times of the well known and recent algorithms for the
k mismatches and k differences problem, respectively. Finally, we examine these experiments
if confirm the theoretical analysis of the algorithms.
This paper is organized as follows: in the next section we briefly describe the algorithms
tested both for the k-mismatches and the k-differences problem. In the third section we de-
scribe the experimental methodology, including the test environment, types of test data and
ways measures for the comparison of the algorithms. In section four we present the results of
our experiments, in the form of performance tables and graphs. Finally, we present some con-
clusions and suggest some further research issues.

2. APPROXIMATE STRING SEARCHING ALGORITHMS

In this section we give the formal description of string searching with k mismatches and
string searching with k differences as well as the sequential algorithms tested. However,
for the details and the coding of the algorithms, the reader is referred to [MM99c] and the
original references. We start by reviewing the basic algorithms from each category for the
string searching with k mismatches problem. Finally, in all the algorithms below we suppose
that the pattern and the text are stored in arrays P½1::n and T ½1::m.
870 P. D. MICHAILIDIS AND K. G. MARGARITIS

2.1 String Searching with k Mismatches


2.1.1 Problem Definition
Let a given alphabet S, a short pattern string P ¼ P½1P½2    P½m of length m, a large text
string T ¼ T ½1T ½2    T ½n of length n, in an alphabet S of size jS|, where m; n > 0 and
m << n, and an integer maximum number of mismatches allowed k  0, find all the text
positions j such that in at most k positions in which P and T have different characters.
We say that there is an approximate occurrence of P at position j of T .

2.1.2 Algorithms for String Searching with k Mismatches


CLASSICAL APPROACH
The classical string searching algorithms are based on character comparisons.
The Brute-Force algorithm (in short, BF algorithm), which is the simplest, performs char-
acter comparisons between the text substring and the complete pattern from left to right and
to count the number of mismatches found. If more than k have been found, shifts exactly one
position to the right. When we reach the end of the pattern we report an approximate occur-
rence. It requires no preprocessing phase. This algorithm has OðmnÞ worst-case time com-
plexity.
Landau–Vishkin algorithm (in short, LV) [LV86] developed the first efficient algorithm for
this problem. Their approach is similar to the Knuth–Morris–Pratt algorithm [KMP77] in that
an array derived from preprocessing the pattern is employed as the text string is examined
from left to right, and known information is exploited to reduce the number of character com-
parisons required. The preprocessing phase has Oðkm log mÞ time and the searching phase
has OðknÞ time, to the overall total. The extra space is required by the algorithm is
Oðkðm þ nÞÞ. While it is fast, the space required is unacceptable for practical purposes. In
our experiments we include the improved version of LV algorithm using a window of size
Oðm2 Þ to process the text, instead of the OðmnÞ array suggested in the original paper
[LV86]. Therefore, this algorithm decreases the extra space to OðkmÞ which is acceptable
in practice.
The next algorithm is Tarhio–Ukkonen algorithm (in short, TU) [TU93] which is based on
the Boyer–Moore–Horspool (BMH) algorithm to exact string searching [Hor80]. The prepro-
cessing phase of TU algorithm has Oðm þ kjSjÞ time and OðkjSjÞ space. In searching phase
of the TU algorithm needs OðmnÞ time in the worst case. However, the expected running time
is Oðknððk=jSÞ þ l=ðm kÞÞÞ for random strings.

COUNTING APPROACH
This approach does not use character comparisons like the classical algorithms, but it uses
arithimetical operations i.e. it uses counters for every position of the text.
In this category we present only the Baeza–Yates–Perleberg algorithm (in short, BYP)
[BYP96] which is a very practical and simple solution to the string searching with k mis-
matches problem and whose performance is independent of k. This algorithm runs in OðnÞ
worst case time if all the characters in P are distinct and in Oðn þ RÞ worst case time if
there are identical characters in P, where R is the total number of ordered pairs of positions
at which P and T match. Assuming the characters to be equiprobable, the average running
time is Oðð1 þ m=jSjÞnÞ, irrespective of the number of distinct pattern characters. Finally,
the running time for preprocessing is Oð2m þ jSjÞ and the space requirement for this algo-
rithm is Oðm þ jSjÞ.
ON-LINE STRING SEARCHING 871

TABLE I Time and space complexities for string searching with k mismatches

Search time

Algorithm Worst case Average case Preprocessing Time Extra space

BF mn kn – 1
LV kn kn km log m km
TU mn knðk=jj þ 1=m kÞ m þ kjj kjj
BYP n ð1 þ m=jjÞn 2m þ jj m þ jj
SO mn log k=w mn log k=w ðjj þ mÞ log k=w jj þ m log k=w

BIT-PARALLELISM APPROACH
This is another technique of common use in string searching [BY92]. It was first proposed
in [BYG92] for exact string searching problem. This technique uses the intrinsic parallelism
of the bit manipulations inside a computer word to perform many operations parallel (whose
number of bits in the computer word we denote w). Furthermore, bit-parallelism approach
has became a general way to simulate simple nondeterministic finite automata (NFA) instead
of converting them to deterministic. Finally, this numerical approach provides many advan-
tages such as simplicity, flexibility and no buffering.
From this category we include in our experiments the numerical algorithm Shift-Or (in
short, SO) [BYG92] which is introduced by Baeza et al.. This algorithm handles mismatches
by essentially counting k of them with a log2 k size counter, but does not handle for the k
differences problem i.e. deletions and insertions. Moreover, the bigger the number of bits
needed to represent individual states, the smaller the length of patterns that are considered.
However, in our experimental study this algorithm runs only up to m ¼ 8 characters if the
word size is 32 bits and 4 bits which are needed to represent a state.
The known time and space complexities of several algorithms for solving the Hamming
distance are shown in Table I for both the worse case and average case.

2.2 String Searching with k Differences


2.2.1 Problem Definition
Let a given alphabet S, a short pattern string P ¼ P½1P½2    P½m of length m, a large text
string T ¼ T ½1T ½2    T ½n of length n, in a alphabet S of size jSj, where m; n > 0 and
m  n, and an integer maximum number of differences allowed k  0, find all the text posi-
tions j such that the edit distance (i.e. number of differences) between P and some substring
of T ending an T ½j is at most k. We say that there is an approximate occurrence of P at position
j of T .

2.2.2 Algorithms for string searching with k Differences


DYNAMIC PROGRAMMING APPROACH
The dynamic programming approach is a classical solution which have been proposed in-
dependently by many researchers and mainly by the Wagner and Fischer [WF74] for comput-
ing the edit distance between two strings, the distances between longer and longer prefixes of
the strings are successively evaluated from previous values until the final result is obtained.
Later, Sellers [Sel80] converts this classical solution into a search algorithm in order to find
all approximate occurrences of P in the T . This algorithm has running time OðmnÞ in the
worst and average case. There are many results that improve the SEL algorithm and take ad-
872 P. D. MICHAILIDIS AND K. G. MARGARITIS

vantage of the geometric properties of the dynamic programming array (i.e. values in neigh-
bor cells differ at most by one) [Ukk85a] in order to compute kn instead of mn entries. For
example, Ukkonen [Ukk85b] developed an algorithm which is called CUTOFF whose ex-
pected running time is OðnkÞ, by computing only a part of the dynamic programming array.
Subsequently new algorithms were developed that are based on diagonal transition ap-
proach. The basic idea of the diagonal transition algorithms is the fact that the diagonals
of the dynamic programming array are monotonically increasing. The algorithm is based
on computing in constant time the positions where the values along the diagonals are incre-
mented. Therefore, there are four algorithms which are based on diagonal transition ap-
proach: Brute Force, Landau–Vishkin [LV88, LV89], Galil–Park [GP90] and Ukkonen–
Wood [UW93]. In our experiments we include the Galil–Park algorithm (in short, GP).
The preprocessing phase of GP algorithm takes Oðm2 Þ space and time and the searching
phase is OðknÞ time in the worst case or the average case. It uses reference triples that repre-
sent matching substrings of the pattern and the text as the Landau–Vishkin algorithm.
Finally, the Chang–Lampe algorithm [CL92] (in short, CL) is a variation of the dynamic
programming and is also very efficient and practical algorithm. This adaptation of the simple
dynamic programming
pffiffiffiffiffiffi approach is based on a "column partition" approach, and has expected
time in Oðkn= jSjÞ. The running time for preprocessing and the space requirement for this
algorithm is OðmjSjÞ.

DETERMINISTIC FINITE AUTOMATA APPROACH


Although this approach is rather old has received little attention. It is based on reexpres-
sing the problem by mean of an automaton. The basic idea is to convert the general automa-
ton into a deterministic one and reduce the states and the memory requirements.
Ukkonen devised an algorithm who proposed the idea of such a deterministic finite auto-
maton (DFA) [Ukk85b]. However, this algorithm has the disadvantage that a large number of
automaton states may be generated. As a result, we have large time and space requirements
which may limit the applicability of this algorithm.
Later, Wu et al. looked again into this problem [WMM96]. The idea was to trade some
time for space using a Four Russians technique [ADKF75] and give an Oðkn= log nÞ expected
time algorithm which is a log factor improvement over the CUTOFF OðknÞ expected time
algorithm, and an Oðmn= log nÞ time algorithm in the worst case using Oðn þ mjSj= log nÞ
space for the universal lookup array. The running time for preprocessing is OðmjSjÞ.
Finally, [Kur96] and [Nav97b] proposed another way to reduce space requirements. It is an
adaptation of [BYG94], who first proposed it for the Hamming distance. The idea was to
build the automaton in lazy form, i.e. build only the states and transitions actually reached
in the processing of the text. This algorithm has running time Oðn þ m minðt; nÞÞ where t
is the total number of transitions in the complete automaton and it requires
OðminðjSj; nÞ minðm; jSjÞÞ space in the worst case. However, the average time complexity
for this algorithm is Oðn þ mtðl e n=t Þ where t is the total number of transitions in the com-
plete automaton. Finally, this algorithm in our experiments limited to m 10, because for
longer patterns, it requires large amounts of memory.
In our experiments we include two algorithms from this category, the Wu–Manber–Myers
algorithm (in short, WMM) and the partial DFA algorithm (in short, PDFA).

FILTERING APPROACH
This method is much newer trend and it is currently very active. It is based on finding fast
algorithms to discard large areas of the text that cannot match and apply another algorithm in
the rest, using the simple dynamic programming approach.
ON-LINE STRING SEARCHING 873

First, Tarhio–Ukkonen [TU93] have devised an approximate string searching algorithm (in
short, TUD) that tried to use Boyer–Moore–Horspool techniques [BM77, Hor80] to filter the
text. The preprocessing phase has Oððk þ jSjmÞ time and the space which is required by this
algorithm is OðjSjmÞ. The searching phase of the TUD algorithm has Oðmn=kÞ time in the
worst case and OððjSj=jSj 2kÞknðk=ðjSj þ 2k 2 Þ þ 1=mÞÞ in the average case.
Navarro [Nav97a] developed an algorithm (in short, COUNT), of [JTU96] and [GL89]
which is a filter based on counting matching positions. In other words, the key idea is to
search for substrings of the text whose distribution of characters differs from the distribution
of characters in the pattern at most as much as it is possible under k differences. The prepro-
cessing phase of the COUNT algorithm has OðjSj þ mÞ time and the searching phase has OðnÞ
time if the number of verifications is negligible. Finally, this algorithm uses OðjSjÞ space.
Wu and Manber [WM92] proposed a simple filter which is called pattern partition approach.
This approach is based on the following fact: an occurrence with at most k differences of a pat-
tern of length m implies that at least one substring of length r in the pattern matches a substring
of the text occurrence exactly, where r ¼ bm=ðk þ 1Þc. There are many ways to use this idea.
Perhaps the simplest one, used in [WM92], is to search for the first k þ l consecutive blocks of
size r of the pattern P. If any of the blocks is an exact match, we try to extend the match, check-
ing if there are at most k differences. This idea was used in conjunction with the extension of
the SO algorithm [BYG92] to string matching with differences. The combination of the pattern
partition approach with the SO algorithm we called MULTIWM. This algorithm has Oðmn=wÞ
time complexity. Further, in our experiments this algorithm is limited to m 3l.
Finally, Baeza et al. [BYP96] suggested a algorithm (in short, BYPEP) which combines
the pattern partition approach with traditional multiple string searching algorithms. The sim-
plest algorithm is to build an Aho–Corasick machine [AC75] (the extension of the KMP al-
gorithm [KMP77, MM99a] to search for multiple patterns) for the k þ 1 blocks of length r.
For every match found, we extend the match, checking if there are at most k differences, by
using the standard dynamic programming algorithm to check the edit distance between two
strings. This algorithm with the AC machine has OðnÞ expected search time for
k Oðm= log mÞ using Oðm2 Þ extra space. Moreover, the above algorithm of the searching
phase can be improved by using multiple string searching algorithm based on the Boyer–
Moore algorithm [CW79].

BIT-PARALLELISM APPROACH
We have seen this technique is applied to k-mismatches problem. Therefore, this approach
can be applied similar way to k-differences problem. There are two main alternatives: paral-
lelization of the non-deterministic finite automaton (NFA) and parallelization of the dynamic
programming array.
Wu and Manber algorithm [WM92] (in short, WM) uses this approach to simulate the
automaton by rows. This algorithm has a preprocessing phase which requires
OðmjSj þ kdm=wÞ time. Then the searching phase runs in Oðkndm=weÞ time in the worst
and average case which is OðknÞ for patterns typical in text searching (i.e. m w). Moreover,
this algorithm requires OðmjSj) space. In our experiments, this algorithms is limited to
m 31.
Baeza et al. [BYN96a, BYN96b, BYN99] proposed an another algorithm (in short, BYN)
which parallelizes the NFA by diagonals using bits of the computer word. The preprocessing
phase of the BYN algorithm takes OðjSj þ m minðm; jSjÞ time and it requires OðjSjÞ space.
The search phase needs OðnÞ time in the worst and average case. This algorithm is limited to
m 9 for w ¼ 32 bits in our experiments.
874 P. D. MICHAILIDIS AND K. G. MARGARITIS

TABLE II Time and space complexities for string searching with k differences

Search time

Algorithm Worst case Average case Preprocessing time Extra space

SEL mn mn – mn
CUTOFF mn kn – m
GP kn kn
pffiffiffiffiffiffiffi m2 m2
CL mn kn= jj mjj mjj
WMM mn= log m kn= log n mjj n þ mjj= log n
PDFA n þ n minðt; nÞ n þ mtð1 en=t Þ – minðjj; nÞ minðm; jjÞ
TUD mn=k ðjj=jj 2kÞ ðk þ jjÞm mjj
knðk=jj þ 2k2 þ l=mÞ
COUNT mn n jj þ m jj
MULTIWM mn=w mn=w jj þ m
BYPEP n; k m= log n m m2
WM kndm=we kndm=we mjj þ kdm=we mjj
BYN n n jj þ m minðm; jjÞ jj
MYE mn=w kn=w mjj jj

Finally, Myers [Mye98, Mye99] developed an algorithm (in short, MYE) which is based
on bit parallel simulation of the dynamic programming array. The parallelization has optimal
speedup, and the time complexity is Oðkn=wÞ on average and Oðmn=wÞ in the worst case. The
preprocessing phase of the MYE algorithm requires OðmjSjÞ time and OðjSjÞ space. This
algorithm in our experimental study is limited to m  3l.
The known time and space complexities of several algorithms for solving the edit distance
are shown in Table II for both the worse case and average case.
We must note that the algorithms SEL, CUTOFF, PDFA, COUNT, BYPEP and MUL-
TIWM were developed for the string searching with k differences problem. However, they
can applied for the string searching with k mismatches problem with slight modifications.
Therefore, we developed the algorithms SELM, CUTOFFM, PDFAM, COUNTM, BYPEPM
and MULTIWM for k mismatches problem and we included in our experimental study.

3. EXPERIMENTAL METHODOLOGY

In this section we present the testing methodology which used in our experiments in order to
compare the relative performance of approximate string searching algorithms. The para-
meters which is described the performance of the algorithms are:
(a) The text size,
(b) The pattern length,
(c) The number of allowed mismatches or differences, and
(d) The alphabet size.
It is known that none of the algorithms are optimal or best in all four cases. Therefore, the
main goal in our experimental study is to explore the practical performance of the algorithms
and verifying their theoretical analysis against the length of the pattern (small and long pat-
terns), and against the number of allowed mismatches or differences (small and long values
of k) under various alphabets of different sizes (or types of text) i.e. binary alphabet, alpha-
bet of size 8, English alphabet and DNA alphabet, which have different characteristics.
ON-LINE STRING SEARCHING 875

3.1 Test Environment


The experiments were run on a Sun UltraSparc-1 of 143 Mhz clock, with 64 Mb RAM which
is a 32 bit machine and a 2.1 Gb local hard disk. The operating system is Solaris 2.5. During
all experiments, this machine was not performing other heavy tasks (or processes). The data
structures used in the testing were all in physical memory during the experiments. Finally, the
algorithms presented in the section 2 have been implemented in ANSI C programming lan-
guage [KR78] in a homogeneous way so as to keep their comparison significant, using the
compiler cc.

3.2 Types of Test Data


We note that because the performance of the approximate string searching algorithms de-
pended upon statistical properties of the pattern and the text string from which the test pat-
terns were obtained, experiments were performed on four different types of texts: binary
alphabet, alphabet of size 8, English alphabet and DNA alphabet.

Binary alphabet: The alphabet is S ¼ f0; 1g. The text is consisted of 150,000 characters
and was randomly built. For patterns of lengths between 3 and 100 we search ten of them
random built.

Alphabet of size 8: The alphabet is S ¼ fa; b; c; d; e; f ; g; hg. The text is consisted of


150,000 characters and was random built. In addition, for patterns of lengths between 3 and
100 we search ten of them random built.

English alphabet: We used a document of English language from an web page. The al-
phabet is consisted of 70 different characters. The text is consisted of 148,188 characters and
we search ten patterns of each length from 3 to 100 characters were chosen at random from
words inside the text.

DNA alphabet: The DNA alphabet consists of the four nucleotides a, c, g and t (standing
for adenine, cytosine, guanine, and thymine, respectively) used to encode DNA. Therefore,
the alphabet is S ¼ fa; c; g; tg. The text is consisted of 997,642 characters and we search ten
patterns of each length from 10 to 100 characters. Finally, the text and the patterns is portion
of the GenBank DNA database, as distributed by Hume and Sunday [HS91].

3.3 Measures of Comparison


For the comparison of the approximate string searching algorithms we used the practical run-
ning time as measure. The running time is the total time of calling an algorithm to search a
pattern in the text including the preprocess time of building the auxiliary arrays. The running
time is obtained by calling the C function clock ( ) and it is measured in seconds.
Thus, we measured the running time all the algorithms in section 2 in order to examine the
effect of the pattern length and the effect of an absolute k. We performed two test series:
(a) We measured the effect of the pattern length in a test series with varying
m ¼ 3; 4; 8; 10; 20; 30; 40; 60; 80; 100 and fixed k ¼ 3. In case of the DNA alphabet we
used longer patterns because this alphabet has biological applications on long patterns.
For this reason, in this alphabet we measured the effect of the pattern length in a test
series with varying m ¼ 10; 20; 30; 40; 50; 100 and fixed k ¼ 3, and
876 P. D. MICHAILIDIS AND K. G. MARGARITIS

(b) We measured the effect of an absolute k in three test sub-series:


(b.1) with varying k ¼ 1; 2; 4; 6; 8 and fixed m ¼ 8, except for DNA alphabet.
(b.2) with varying k ¼ l; 8; 10; 15; 19 and fixed m ¼ 20, and
(b.3) with varying k ¼ 1; 6; 13; 25; 40 and fixed m ¼ 50.
Finally, to decrease random variation, the results of the algorithms are averages of ten runs
with different patterns of each length.

4. EXPERIMENTAL RESULTS

In the previous sections we have briefly presented the most well known approximate string
searching algorithms and the experimental methodology of our test. In this section, we pre-
sent the experimental results both for the string searching algorithms with k mismatches and
the string searching algorithms with k differences. The performance of the algorithms for
these two problems was measured on ten iterations over the four types of text. In particular,
the performance of each algorithm was plotted against the length of the pattern and against
the absolute k for each type of text.

4.1 Results for the String Searching Algorithms with k Mismatches


4.1.1 Performance versus the Pattern Length
Here, we report the performance and theoretical results for k ¼ 3 and variable pattern length.
Figures 1–4 show the practical running time for a binary alphabet, an alphabet of size 8, an
English alphabet and an DNA alphabet, respectively. Further, Figures 5–8 show the theore-

FIGURE 1 jSj ¼ 2 and k ¼ 3.

FIGURE 2 jSj ¼ 8 and k ¼ 3.


ON-LINE STRING SEARCHING 877

FIGURE 3 jSj ¼ 70 and k ¼ 3.

FIGURE 4 jSj ¼ 4 and k ¼ 3.

FIGURE 5 jSj ¼ 2 and k ¼ 3.

FIGURE 6 jSj ¼ 8 and k ¼ 3.


878 P. D. MICHAILIDIS AND K. G. MARGARITIS

FIGURE 7 jSj ¼ 70 and k ¼ 3.

FIGURE 8 jSj ¼ 4 and k ¼ 3.

tical time complexity for a binary alphabet, an alphabet of size 8, an English alphabet and an
DNA alphabet respectively. We observe that there is a general agreement between the experi-
mental and theoretical results of algorithms in most cases. However, the experimental results
of the filtering algorithms (such as COUNTM, BYPEPM and MULTIWMM) do not follow
the theoretical calculation of time complexity. We must notice that only for large alphabets
such as an English alphabet and only for large patterns, the experimental results of the
COUNTM algorithm confirm with the theoretical calculations.

4.1.2 Performance versus the Values of k


Figures 9–12 show the practical running time for a binary alphabet, an alphabet of size 8, an
English alphabet and an DNA alphabet respectively, with patterns of size m ¼ 8, m ¼ 20 and
m ¼ 50 and all possible values of k. Figures 13–16 show the theoretical time complexity for
a binary alphabet, an alphabet of size 8, an English alphabet and an DNA alphabet, respec-
tively. We can observe that the theoretical time complexity of the filtering algorithms group is
not confirmed in practice for all sizes of alphabet and the several values of k.

4.2 Results for the String Searching Algorithms with k Differences


4.2.1 Performance versus the Pattern Length
Figures 17–20 show the practical running time for a binary alphabet, an alphabet of size 8, all
English alphabet and an DNA alphabet respectively. Figures 21–24 show the theoretical run-
ning time for a binary alphabet, an alphabet of size 8, an English alphabet and an DNA al-
ON-LINE STRING SEARCHING 879

FIGURE 9 jSj ¼ 2 and m ¼ 8; 20; 50.

FIGURE 10 jSj ¼ 8 and m ¼ 8; 20; 50.

FIGURE 11 jSj ¼ 70 and m ¼ 8; 20; 50.

FIGURE 12 jSj ¼ 4 and m ¼ 20; 50.


880 P. D. MICHAILIDIS AND K. G. MARGARITIS

FIGURE 13 jSj ¼ 2 and m ¼ 8; 20; 50.

FIGURE 14 jSj ¼ 8 and m ¼ 8; 20; 50.

FIGURE 15 jSj ¼ 70 and m ¼ 8; 20; 50.

FIGURE 16 jSj ¼ 4 and m ¼ 20; 50.


ON-LINE STRING SEARCHING 881

FIGURE 17 jSj ¼ 2 and k ¼ 3.

FIGURE 18 jSj ¼ 8 and k ¼ 3.

FIGURE 19 jSj ¼ 70 and k ¼ 3.

FIGURE 20 jSj ¼ 4 and k ¼ 3.


882 P. D. MICHAILIDIS AND K. G. MARGARITIS

FIGURE 21 jSj ¼ 2 and k ¼ 3.

FIGURE 22 jSj ¼ 8 and k ¼ 3.

FIGURE 23 jSj ¼ 70 and k ¼ 3.

FIGURE 24 jSj ¼ 4 and k ¼ 3.


ON-LINE STRING SEARCHING 883

phabet, respectively. We observe that the group of filtering algorithms such as COUNT,
MULTIWM and BYPEP algorithms in most cases does not confirm the theoretical time com-
plexity measures. Furthermore, in our experiments the algorithm CL which is based on
dynamic programming approach, does not agree absolutely with the theoretical results for
all sizes of the alphabet. Finally, the computational behavior of the remaining algorithms
generally agrees with the theory in most cases.

4.2.2 Performance versus the Values of k


Figures 25–28 show the practical running time for a binary alphabet, an alphabet of size 8, an
English alphabet and an DNA alphabet respectively, with patterns of size m ¼ 8, m ¼ 20 and
m ¼ 50 and all possible values of k. Figures 29–32 show the theoretical time complexity for
a binary alphabet, an alphabet of size 8, an English alphabet and an DNA alphabet, respec-
tively. We can observe that the theoretical analysis of filtering algorithms such as COUNT,
MULTIWM and BYPEP algorithms is not valid in practice in most cases. In addition, the
CL and MYE algorithms do not present the expected behavior. Finally, the remaining
algorithms present practical performance according to the theoretical results.

4.2.3 General Remarks


Based on empirical results it can be concluded that in all cases the BF, SELM, and SEL algo-
rithms are linear in the running time. These algorithms produce relatively good running time
results despite their simplicity. More specifically, the BF algorithm is the best approach with

FIGURE 25 jSj ¼ 2 and m ¼ 8; 20; 50.

FIGURE 26 jSj ¼ 8 and m ¼ 8; 20; 50.


884 P. D. MICHAILIDIS AND K. G. MARGARITIS

FIGURE 27 jSj ¼ 70 and m ¼ 8; 20; 50.

FIGURE 28 jSj ¼ 4 and m ¼ 20; 50.

FIGURE 29 jSj ¼ 2 and m ¼ 8; 20; 50.

FIGURE 30 jSj ¼ 8 and m ¼ 8; 20; 50.


ON-LINE STRING SEARCHING 885

FIGURE 31 jSj ¼ 70 and m ¼ 8; 20; 50.

FIGURE 32 jSj ¼ 4 and m ¼ 20; 50.

the exception of SO and PDFAM algorithms when the pattern size is very small or k is close
to m. In addition, the SEL algorithm is also among the fastest when m is small or k is close to
m. This observation is valid for all the alphabets. It should be noted that BF, SELM and SEL
algorithms have no special memory requirements and no complex coding. However, these
algorithms performs poorly for large patterns and for large values of k.
Our experiments confirm the OðknÞ expected running time of the CUTOFFM, CUTOFF
and GP algorithms although they are quadratic in the theoretical worst case. In addition,
we note that our measurements do not confirm the statement of Chang and Lampe [CL92]
according to which the CL algorithm is always faster than the CUTOFF algorithm. Espe-
cially, for long patterns and small alphabets, the CL algorithm is much slower than the CUT-
OFF algorithm. However, our experiments showed that the CL algorithm is better than the
CUTOFF algorithm for large values of k. Finally, we must notice that the CUTOFFM and
CUTOFF algorithms are the best approaches for all alphabets when k is relatively large.
We have experimentally shown that the PDFAM and PDFA algorithms outperform all the
other algorithms for small values of k and small pattern lengths. In addition, the WMM algo-
rithm achieves subquadratic worst case running time and very good expected running time
for large values of k and patterns. This algorithm has the advantage that it can work for
patterns that contain a class of characters, a complement of a character or a class and
don’t care symbols. However, a main drawback of these methods is that they require large
amounts memory.
The theoretical analysis of filtering algorithms is not valid in practice in most cases. Algo-
rithms COUNTM, COUNT, BYPEPM, BYPEP, MULTIWMM and MULTIWM allow the
text to be scanned in linear expected time. Here, we must observe that the BYPEPM,
BYPEP, MULTIWMM and MULTIWM algorithms have slightly better performance than
886 P. D. MICHAILIDIS AND K. G. MARGARITIS

the COUNTM and COUNT algorithms only for small values of k. On the other hand, for
large values of k they have lower running time than the rest of the algorithms. The COUNTM
and COUNT algorithms are simple and fast in practice for small values of k. Further, they are
fast for long patterns and alphabets (i.e the alphabet of size 8 and the English alphabet). How-
ever, they are not faster than the best sublinear filters because they inspect all text characters.
According to our experimental study, we observe that the BYP algorithm achieves OðnÞ worst
case time independent of k and without restrictions on m. It can be easily adapted to find the
"best match" (smallest k). This is desirable in many cases where a bound on k is not known a
priori. Also, the BYP algorithm has been used for two dimensional text searching [BYR93].
Finally, the running time of TU and TUD algorithms decreases as the pattern length and the
alphabet size increases. This fact support theoretical evidence that TU and TUD algorithms
are sublinear in average running time. Therefore, those algorithms performs well on average
for small k and large patterns and alphabets.
We experimentally demonstrated that all bit-parallelism algorithms with the exception of
WM algorithm are among the fastest for typical text searching. More specifically, for
small patterns the SO, BYN and MYE algorithms scan the text in linear time, regardless
of the value of k. It can also be seen that they are fastest for small patterns and medium values
of k for all the alphabets. The WM algorithm is also linear according to our experiments but it
is less efficient scheme nowadays. Those algorithms are fairly simple to implement and are
also very flexible. Additionally, those algorithms can be applied to cases where the pattern
may contain a class of characters and don’t care symbols. We have not studied other
cases, such as very long patterns because all bit-parallelism algorithms do not perform so
well as other algorithms.

5. CONCLUSIONS

We have presented the experimental results of an extensive set experiments of most well
known approximate string searching algorithms based on dynamic programming, determinis-
tic finite automata, filtering and bit-parallelism approach. Therefore, we report the general
conclusions regarding the algorithms and their testing procedures.
As a general conclusion we can say that testing the algorithms on four different types of
text (binary alphabet, alphabet of size 8, English alphabet and DNA alphabet) indicates that
varying parameters such as the pattern length, the values of k and the alphabet size can pro-
duce slight different algorithm performances. Therefore, our experimental study proved that
none of the algorithms both for k mismatches and k differences problem is the best for all
values of the problem parameters.
Finally, we discuss a number of directions for future research that can mention along this
paper. First, we will present data parallel algorithms for the three distributed problems related
to the exact and approximate string searching using simple sequential algorithms: we search
all the occurrences of a pattern in a text, where the pattern and/or the text can be either single
strings or multiple ones. In the other words, we can easily extend the simple sequential algo-
rithms which were described in section 2 for the classical Single Pattern versus Single Text
(SPST) problem to data parallel algorithms for the three other problems: Single Pattern ver-
sus Multiple Text (SPMT) problem, Multiple Pattern versus Single Text (MPST) problem and
finally Multiple Pattern versus Multiple Text (MPMT) problem. Second, we will report a ex-
tensive experimental study of data parallel algorithms for the three distributed string search-
ing problems using Message Passing Interface (MPI).
ON-LINE STRING SEARCHING 887

References
[AC75] Aho, A. and Corasick, M. (1975) Efficient string matching: An aid to bibliographic search,
Communications of the ACM 18(6), 333–340.
[ADKF75] Arlazarov, V.L., Dinic, E.A., Kronrod, M.A. and Faradzev, I.A. (1975) On economic construction of the
transitive closure of a directed graph, Dokl. Akad. Nauk SSSR, Vol. 194, pp. 487–488, 1970 (in Russian).
English translation in Soviet Math. Dokl., Vol. 11, pp. 1209–1210.
[Aho90] Aho, A.V. Algorithms for finding patterns in strings. In: van, J. Leeuwen (ed.) Handbook of Theoretical
Computer Science (Elsevier Science Publishers, Amsterdam), Chapter 5, pp. 255–300.
[Aoe94] Aoe, J. (1994) Computer Algorithms — String Pattern Matching Strategies (IEEE Computer Society
Press, Los Alamitos, California).
[BM77] Boyer, R.S., Moore, J.S. (1977) A fast string searching algorithm, Communication of the ACM 20(10),
762–772.
[BY89] Baeza-Yates, R.A. (1989) Algorithms for string searching: A survey, ACM SIGIR Forum 23(3–4), 34–58.
[BY92] Baeza-Yates, R.A. (1992) Text retrieval: Theory and practice. In: Proc. of the 12th IFIP World Computer
Congress, North-Holland, Madrid, Spain, pp. 465–476.
[BY96] Baeza-Yates, R.A. (1996) A unfied view of string matching algorithms. In: Proc. SOFSEM 96: Theory
and Practice of Informatics, Lecture Notes in Computer Science, No. 1175 (Springer-Verlag, Berlin), pp.
1–15.
[BYG92] Baeza-Yates, RA. and Gonnet, G.H. (1992) A new approach to text searching, Communications of the
ACM 35(10), 74–82.
[BYG94] Baeza Yates, R.A. and Gonnet, G. (1994) Fast string matching with mismatches, Information and
Computation 108(2), 187–199.
[BYN96A] Baeza-Yates, R.A. and Navarro, G. (1996) A fast heuristic for approximate string matching. In: Proc. of
the 3rd South American Workshop on String Processing, (Carleton University Press), pp. 47–63.
[BYN96B] Baeza-Yates, R.A. and Navarro, G. (1996) A faster algorithm for approximate string matching. In: Proc.
of the 7th Annual Symposium on Combinatorial Pattern Matching, No. 1075 (Springer-Verlag, Berlin),
pp. 1–23.
[BYN99] Baeza-Yates, R.A. and Navarro, G. (1999) A faster algorithm for approximate string matching,
Algorithmica 23(2), 127–158.
[BYP96] Baeza-Yates, R.A. and Perleberg, C.H. (1996) Fast and practical approximate string matching,
Information Processing Letters 59(1), 21–27.
[BYR93] Baeza-Yates, R.A. and Regnier, M. (1993) Fast two dimensional pattern matching, Information
Processing Letters 45(1), 51–57.
[CL92] Chang, W.I. and Lampe, J. (1992) Theoretical and Empirical Comparisons of approximate string
matching algorithms. In: Proc. of the 3rd Annual Symposium on Combinatorial Pattern Matching, No.
664 (Springer-Verlag, Berlin), pp. 175–184.
[CL94] Chang, W. and Lawler, E. (1994) Sublinear approximate string matching and biological applications,
Algorithmica 12(4/5), 327–344
[CR94] Crochemore, M. and Rytter, W. (1994) Text algorithms, (Oxford University Press).
[CW79] Commentz-Walter, B. (1979) A string matching algorithm fast on the average. In: Proc. of the 611
International Colloquium on Automata, Languages and Programming, No. 71 (Springer-Verlag, Berlin),
pp. 118–132.
[DB86] Davies, G. and Bowsher, S. (1986) Algorithms for pattern matching, Software-Practice and Experience
16(6), 575–601.
[Der95] Dermouche, A. (1995) A fast algorithm for string matching with mismatches, Information Processing
Letters 55(2), 105–110.
[EMC96] El-Mabrouk, N. and Chrochemore, M. (1996) Boyer-Moore strategy to efficient approximate string
matching. In: Proc. of 711 Annual Symposium on Combinatorial Pattern Matching, No. 1075 (Springer-
Verlag, Berlin), pp. 24–38.
[GG86] Galil, Z. and Giancarlo, R. (1986) Improved string matching with k mismatches, Sigact News 17(4), 52–
54.
[GG 88] Galil, Z. and Giancarlo, R. (1988) Data structures and algorithms for approximate string matching,
Journal of Complexit 4(1), 33–72.
[GL89] Grossi, R. and Luccio, F. (1989) Simple and efficient string matching with k mismatches, Information
Processing Letters 33(3), 113–120.
[GP90] Galil, Z. and Park, K. (1990) An improved algorithm for approximate string matching, SIAM Journal of
Computing 19(6), 989–999.
[HD80] Hall, P. and Dowling, G. (1980) Approximate string matching, ACM Computing Surveys 12(4), 381–
402.
[Hor80] Horspool, N. (1980) Practical fast searching in strings, Software Practice and Experience 10(6), 501–
506.
[HS91] Hume, A. and Sunday, D. (1991) Fast string searching, Sofware-Practice and Experience 21(11), 1221–
1248.
[JTU96] Jokinen, P., Tarhio, J. and Ukkonen, E. (1996) A comparison of approximate string matching algorithms,
Software-Practice and Experience 26(12), 1439–1458.
888 P. D. MICHAILIDIS AND K. G. MARGARITIS

[KMP77] Knuth, D.E., Morris, J.H. and Pratt, V.R. (1977) Fast pattern matching in strings, SIAM Journal on
Computing 6(1), 322–350.
[KR78] Kernighan, B. and Ritchie, D. (1978) The C Programming Language (Prentice-Hall, Englewood Cliffs,
NJ).
[Kur96] Kurtz, S. (1996) Approximate string searching under weighted edit distance. In: Proc. of the 3rd South
American Workshop on String Processing (Carleton University Press), pp. 156–170.
[Leq95] Lecroq, T. (1995) Experimental results on string matching algorithms, Software-Practice and
Experience 25(7), 727–765.
[LV86] Landau, G.M. and Vishkin, U. (1986) Efficient string matching with k mismatches, Theoretical
Computer Science 43(2–3), 239–249.
[LV88] Landau, G.M. and Vishkin, U. (1988) Fast string matching with k diffrrences, Journal of Computer and
System Sciences 37(1), 63–78.
[LV89] Landau, G.M. and Vishkin, U. (1989) Fast parallel and serial approximate string matching, Journal of
Algorithms 10(2), 157–169.
[MF96] Manolopoulos, Y. and Faloutsos, C. (1996) Experimenting with pattern matching algorithms,
Information Sciences 90(1–4), 75–89.
[MM99a] Michailidis, P. and Margaritis, K. (1999) String Matching Algorithms, Technical Report, Dept. of
Applied Informatics, University of Macedonia.
[MM99b] Michailidis, P. and Margaritis, K. (1999) String Matching Algorithms: Survey and Experimental Results,
Technical Report, Dept. of Applied Informatics, University of Macedonia.
[MM99c] Michailidis, P. and Margaritis, K. (1999) A Survey of On-line Approximate String Matching Algorithms,
Technical Report, Dept. of Applied Informatics, University of Macedonia.
[MP80] Masek, W.J. and Paterson, M.S. (1980) A faster algorithm computing string edit distances, Journal of
Computer and System Sciences 20(1), 18–31.
[Mye98] Myers, G. (1998) A fast bit-vector algorithm for approximate pattern matching based on dynamic
programming. In: Proc. of the 9h Annual Symposium on Combinatorial Pattern Matching, No. 1448
(Springer-Verlag, Berlin), pp. 1–13.
[Mye99] Myers, G. (1999) A fast bit-vector algorithm for approximate pattern matching based on dynamic
programming, Journal of the Association for Computing Machinery 46(3), 395–415.
[Nav97A] Navarro, G. (1997) Mutiple approximate string matching by counting. In: Proc. of the 4th South
American Workshop on String Processing (Carleton University Press), pp. 125–139.
[Nav97b] Navarro, G. (1997) A partial deterministic automaton for approximate string matching. In: Proc. of the
4th South American Workshop on String Processing (Carleton University Press), pp. 112–124.
[Nav98] Navarro, G. (1998) Approximate Text Searching, Ph.D. Thesis, University of Chile, Dept. of Computer
Science.
[NR98] Navarro, G. and Raffinot, M. (1998) A bit-parallel approach to suffix automata: Fast extended string
matching. In: Proc. of the 9th Annual Symposium on Combinatorial Pattern Matching, No. 1448
(Springer-Verlag, Berlin), pp. 14–33.
[PW95] Pevzner, P. and Waterman, M. (1995) Multiple filtration and approximate pattern matching,
Algorithmica 13(1–2), 135–154.
[Se180] Sellers, P.H. (1980) The Theory and Computation of Evolutionary Distance: Pattern Recognition,
Journal of Algorithms 1(4), 359–373.
[Smi82] Smit, G. and De, V. (1982) A comparison of three string matching algorithms, Software-Practice and
Experience 12(1), 57–66.
[ST95] Sutinen, E. and Tarhio, J. (1995) On using q-gram locations in approximate string matching. In: Proc.
3rd Annual European Symposium, No. 979 (Springer-Verlag, Berlin), pp. 327–340.
[Ste94] Stephen, G.A. (1994) String Searching Algorithms (World Scientific Press).
[Tak94] Takaoka, T. (1994) Approximate pattern matching with samples. In: Proc. of the 5th International
Symposium on Alogirthm and Computation, No. 834, (Springer-Verlag, Berlin), pp. 234–242.
[TU93] Tarhio, J. and Ukkonen, E. (1993) Approximate boyer-moore string matching, SIAM Journal on
Computing 22(2), 243–260.
[Ukk8Sa] Ukkonen, E. (1985) Algorithms for approximate string matching, Information and Control 4(1–3), 100–
118.
[Ukk85b] Ukkonen, E. (1985) Finding approximate patterns in strings, Journal of Algorithms 4(1–3), 132–137.
[Ukk92] Ukkonen, E. (1992) Approximate string matching with q-grams and maximal matches, Theoretical
Computer Science 92(1), 191–211.
[UW93] Ukkonen, B. and Wood, D. (1993) Approximate string matching with suffix automata, Algorithmica
10(5), 353–364.
[WF74] Wagner, R.A. and Fischer, M.J. (1974) The string to string correction problem, Journal of the
Association for Computing Machinery 21(1), 168–173.
[WM92] Wu, S. and Manber, U. (1992) Fast text searching allowing errors, Communications of the ACM 35(10),
83–91.
[WMM96] Wo, S., Manber, U. and Myers, G. (1996) A subquadratic algorithm for approximate limited expression
matching, Algorithmica 15(1), 50–67.
[Wri94] Wright, A. (1994) Approximate string matching using within-word parallellism, Software-Practice and
Experience 24(4), 337–362.

You might also like