Professional Documents
Culture Documents
An Efficient Text Pattern Matching Algorithm For Retrieving Information From Desktop
An Efficient Text Pattern Matching Algorithm For Retrieving Information From Desktop
An Efficient Text Pattern Matching Algorithm For Retrieving Information From Desktop
Indian Journal of Science and Technology, Vol 9(43), DOI: 10.17485/ijst/2016/v9i43/95454, November 2016 ISSN (Online) : 0974-5645
Abstract
Objectives: To retrieve the information after analyzing the contents of the documents which are stored in the desktop
by applying string matching algorithms. Methods/Statistical Analysis: To analyze the content of the documents, the
various pattern matching algorithms are used to find all the occurrences of a limited set of patterns within an input text
or input document. In order to perform this task, this research work used four existing string matching algorithms; they
are Brute Force algorithm, Knuth-Morris-Pratt algorithm (KMP), Boyer Moore algorithm and Rabin Karp algorithm. This
work also proposes three new string matching algorithms. They are Enhanced Boyer Moore algorithm, Enhanced Rabin
Karp algorithm and Enhanced Knuth-Morris-Pratt algorithm. Findings: For experimentation, this work has used two
types of documents, i.e. .txt and .docx. Performance measures used are search time, number of iterations and accuracy.
From the experimental results, it is realized that the enhanced KMP algorithm gives better accuracy compared to other
string matching algorithms. Application/Improvements: Normally, these algorithms are used in the field of text mining,
document classification, content analysis and plagiarism detection. In future, these algorithms have to be enhanced to
improve their performance and the various types of documents will be used for experimentation.
Keywords: Brute Force, Boyer Moore, Information Retrieval, Knuth-Morris-Pratt, Pattern Matching, Rabin Karp
window pattern matching, prefix matching, suffix match- In10 discussed the method for evolving general comput-
ing and longest common subsequence algorithms6. ing applications that are flexible and adjustable for users.
This paper is structured as follows. Section II presents In this outlook, however, the Information Retrieval (IR) is
the methodology of this research work. Result and dis- frequently defined in terms of location and distribution of
cussion is given in Section III and section IV gives the the particular documents to a user to gratify their infor-
conclusion. mation needs. In most of the cases, the morphological
In1 presents the string matching algorithms which per- modifications of words have related to semantic inter-
forms character comparison effectively hence it is used for pretations and it can be measured as equivalent for the
DNA searching, Protein sequence searching and English purpose of IR applications. The algorithm Context-Aware
text searching. Connection is a file system searching tool, Stemming (CAS) is proposed, which is an improved ver-
which syndicates the old-style content-based search and sion of the extensively used Porter’s stemmer. Since only
context information collected from user hustle. By find- generated meaningful stemming words as the stemmer
ing the file system calls, Connection could be identify a output, the results illustrates that the proposed algorithm
sequential relationships between the files and use them considerably reduces the error rate of Porter’s algorithm
to develop and reorder customary content search results. from 76.7% to 6.7% without compromising the efficiency
This tool has enhanced both average recall and average of Porter’s algorithm.
precision over an advanced content-only search system. In11 it has observed that proper classification of text
String searching algorithms plays a major role to detect documents entails information retrieval, machine learn-
patterns in the text. ing and Natural Language Processing (NLP) techniques.
In7 they introduced a new Enhanced Checking and The aim is to focus on important approaches to automatic
Skipping Algorithm (ECSA). The new algorithm enhance text classification based on machine learning techniques
the traditional string searching algorithms by altering the viz. supervised, unsupervised and semi supervised.
character-comparison into character-access, which using They are presented a review of several text classification
the condition type character-access instead of the num- approaches under machine learning paradigm.
ber-comparison, and by initiating the comparison at the In12 the general methodology of Intrusion Detection
latest mismatch in the prior checking, which in turn rises System (IDS) is interpretation to model compatibility, it
the probability of finding the mismatch earlier if there is regulates the destruction happening on the network using
any. This shows that the performance of the enhanced particular models and orders. In order to perform this
algorithm gives better results than other existing algo- task, spontaneous manner of the network are measured
rithms. by modeling and in the next step it utilized as a draft
In8 they presented the algorithm in 1977. At that time, model for specifying unusual manner. In this study, wants
this algorithm considered as the greatest proficient string to determine and select the most efficient procedure for
searching algorithm. In reverse order only this algorithm this performance by investigation, application and also
achieves the character comparison from right to left gathering all kinds of model compatibility technique so
manner and do not require the complete pattern to be that the most proper result is achieved over compatibility
searched in case of a mismatch. It used two shifting rules known attacks with original models.
to shift the pattern right, in case of a match or mismatch
occurs. The time complexity and space complexity of pre-
processing phase is O (m+|Σ|) and the worst case running
2. Methodology
time of searching phase is O (nm + |Σ|).The best case The main goal of this research work is to retrieve the
of Boyer-Moore algorithm is O (n/m). information from desktop by analyzing the contents of
In9 defines the several string matching algorithms. the documents using string matching algorithms. In
From this research work it is observed the space and time order to perform this task, this research work uses four
complexities of those algorithms. They have assessed the existing string matching algorithms; they are Brute Force
performance of the algorithms and verified with biologi- algorithm, Knuth-Morris-Pratt algorithm (KMP), Boyer
cal sequences. The functional and structural relationship Moore algorithm and Rabin Karp algorithm. This work
of the biological sequence is calculated by relationships also proposes three new string matching algorithms. They
on that particular sequence. are Enhanced Boyer Moore algorithm, Enhanced Rabin
2 Vol 9 (43) | November 2016 | www.indjst.org Indian Journal of Science and Technology
R. Janani and S. Vijayarani
Karp algorithm and Enhanced Knuth-Morris-Pratt algo- In this algorithm there is no preprocessing stage and
rithm. The performance factors are used time taken for it needs the constant extra space. The main advantage of
searching the pattern, number of iterations required and this algorithm it is very easy to implement but it is very
its accuracy for single word search, multiple words search slow compared to other algorithms2. The time complexity
and a file search. Figure 1 represents the architecture of of this algorithm is O (mn) and the expected number of
this research work. character comparison is 2n.
Vol 9 (43) | November 2016 | www.indjst.org Indian Journal of Science and Technology 3
An Efficient Text Pattern Matching Algorithm for Retrieving Information from Desktop
it returns the value otherwise next substring value is 1. The prefix function, Π
matched to calculate the string of length m. The prefix function, Π for a pattern summarizes
the knowledge regarding however the pattern matches
Algorithm 2. Boyer Moore Horspool in contradiction of shifts of itself. This information may
be accustomed avoid unusable shifts of the pattern “p”. In
other words, this succeeds avoiding backtracking on the
string “S”.
2. The KMP Matcher With string “S”, pattern “p” and pre-
fix function “Π” as inputs, the prevalence of “p” in “S”
is found and the algorithm yields the variety of shifts
of “p” after which the existence is found.
3. Running - time analysis: The period of time for com-
puting the prefix function is Θ (m) and period of time
of matching function is Θ (n).
Algorithm 4: Knuth–Morris–Pratt
Algorithm 3: Rabin-Karp
4 Vol 9 (43) | November 2016 | www.indjst.org Indian Journal of Science and Technology
R. Janani and S. Vijayarani
string matching problems. This algorithm checks the Algorithm 7: Enhanced Rabin Karp
characters of the pattern from right to left order6. On these
terms of mismatch or a complete match of full pattern, it
uses the two functions to shift the window from left to
the right and the two functions are good suffix shift and
bad character shift. The searching phase of the algorithm
in o (nm) time complexity and the best performance is
O(n/m)18.
First calculate the state transition table S from the pat-
tern P, the pattern may be single line, multiple lines or
a file. Then set the pointer value and state values. If the
pointer value is smaller than the pattern and text value
then read the character from right to left, beginning the
rightmost one. In this case if match occurs, the return the
4.3 Enhanced Knuth-Morris-Pratt
index of the character, otherwise shift the pointer value
again the same process will be done by each level until the
Algorithm
pattern found or not found. Knuth-Morris-Pratt algorithm is one of the efficient
string matching algorithms. This algorithm examines for
Algorithm 6: Enhanced Boyer Moore Horspool existences of a pattern p within a main text t by using the
reflection that while matching, the mismatch occurs, the
word itself represents satisfactory information to regulate
where the next match can begin, thus avoiding the re-
examination of formerly matched characters.
The KMP algorithm uses a bit table to discover the
mismatch of the pattern in an input text. This algorithm
performs the comparison from left to right. It uses the bit
table for the comparison, if match it returns the index of
the text. Otherwise it checks the next bit.
Vol 9 (43) | November 2016 | www.indjst.org Indian Journal of Science and Technology 5
An Efficient Text Pattern Matching Algorithm for Retrieving Information from Desktop
6 Vol 9 (43) | November 2016 | www.indjst.org Indian Journal of Science and Technology
R. Janani and S. Vijayarani
Figure 2 gives the performance analysis of Brute Force Figure 3 gives the performance analysis of Brute Force
Algorithm for text file (a1.txt). Algorithm for docx file (a1.docx).
The Table 4 compares the performance measures like
Table 3. Performance analysis of Brute Force time, number of iterations and relevancy of Boyer Moore
Algorithm for docx files (a1.docx) algorithm and Enhanced Boyer Moore algorithm for text
Input Brute Force Algorithm file (a1.txt). From the analysis the enhanced Boyer Moore
algorithm gives better results than existing algorithm.
Time (ms) Number of Relevancy (%)
Iterations
Single 0 3 100
Word
Multiple 19 20 100
Words
File 53 51 91
Table 4. Performance analysis of Boyer Moore Algorithm and Enhanced Boyer Moore Algorithm for text files (a1.
txt)
Input Boyer Moore Algorithm Enhanced Boyer Moore Algorithm
Time(ms) Number of Iterations Relevancy (%) Time(ms) Number of Iterations Relevancy (%)
Single Word 0 3 100 0 2 100
Multiple Words 15 16 100 12 15 99
File 36 38 92 32 25 95
Table 5. Performance analysis of Boyer Moore Algorithm and Enhanced Boyer Moore Algorithm for docx files (a1.
docx)
Input Boyer Moore Algorithm Enhanced Boyer Moore Algorithm
Time (ms) Number of Iterations Relevancy (%) Time (ms) Number of Iterations Relevancy (%)
Single Word 0 4 99 0 3 100
Multiple Words 18 20 97 17 15 99
File 50 38 90 45 35 99
Vol 9 (43) | November 2016 | www.indjst.org Indian Journal of Science and Technology 7
An Efficient Text Pattern Matching Algorithm for Retrieving Information from Desktop
Table 6. Performance analysis of Rabin Karp algorithm and Enhanced Rabin Karp Algorithm for text files (a1.txt)
Input Rabin Karp Algorithm Enhanced Rabin Karp Algorithm
Time (ms) Number of Iterations Relevancy (%) Time (ms) Number of Iterations Relevancy (%)
Single Word 0 2 100 0 1 100
Multiple Words 22 16 100 20 12 100
File 45 24 95 30 23 96
Table 7. Performance analysis of Rabin Karp algorithm and Enhanced Rabin Karp Algorithm for docx files (a1.docx)
Input Rabin Karp Algorithm Enhanced Rabin Karp Algorithm
Time (ms) Number of Iterations Relevancy (%) Time (ms) Number of Iterations Relevancy (%)
Single Word 0 3 100 0 2 100
Multiple Words 31 19 100 26 17 100
File 46 20 90 31 18 97
8 Vol 9 (43) | November 2016 | www.indjst.org Indian Journal of Science and Technology
R. Janani and S. Vijayarani
Table 8. Performance analysis of Knuth-Morris-Pratt Algorithm and Enhanced Knuth-Morris-Pratt Algorithm for
text files (a1.txt)
Table 9. Performance analysis of Knuth-Morris-Pratt Algorithm and Enhanced Knuth-Morris-Pratt Algorithm for
docx files (a1.docx)
Input Knuth-Morris-Pratt Algorithm Enhanced Knuth-Morris-Pratt Algorithm
Time (ms) Number of Iterations Relevancy (%) Time (ms) Number of Iterations Relevancy (%)
Single Word 0 3 100 0 2 100
Multiple Words 18 11 100 15 9 100
File 39 18 100 30 12 100
analysis, the Enhanced Knuth-Morris-Pratt algorithm Table 9 shows that the performance measures of
performs well when compared to existing algorithm. Knuth-Morris-Pratt Algorithm and Enhanced Knuth-
Morris-Pratt Algorithm for docx files (a1.docx). From the
experimental results, the Enhanced Knuth-Morris-Pratt
algorithm performs well when compared to existing algo-
rithm.
Figure 9. Performance accuracy for Knuth-Morris-Pratt Figure 10. Sample output of Knuth-Morris-Pratt Algorithm
Algorithm. for text file (a1.txt).
Vol 9 (43) | November 2016 | www.indjst.org Indian Journal of Science and Technology 9
An Efficient Text Pattern Matching Algorithm for Retrieving Information from Desktop
Figure 11 shows the sample output of enhanced 2. Al-Mazroi A, Rashid NA. A Fast Hybrid Algorithm for
Knuth-Morris-Pratt algorithm for docx file (a1.docx). the Exact String Matching Problem. American Journal of
Engineering and Applied Sciences. 2011; 4(1):102–07.
3. Shweta C, Dharmadhikari D, Ingle M, Kulkarni P. Empirical
Studies on Machine Learning Based Text Classification
Algorithms. Advanced Computing. An International
Journal (ACIJ). 2011; 2(6):161–69.
4. Bist AS. Pattern matching algorithms for computer virus
detection. International Journal of Engineering Sciences
and Research Technology. 2013; 2(1):28–9.
Figure 11. Sample output of Knuth-Morris-Pratt Algorithm 5. Naser MAS, Rashid NA, FaizAboalmaaly M. Quick-Skip
for docx file (a1.docx). search hybrid algorithm for the exact string matching
problem. International Journal of Computer Theory and
Engineering. 2012; 4(2):1–7.
6. Conclusion 6. Jony AI. Analysis of Multiple String Pat
Information retrieval (IR) is used to identify the rel- tern Matching Algorithm. International
Journal of Advanced Computer Science and
evant documents in a large document collection which
Information Technology (IJACSIT). 2014; 3(4):344–53.
is matching a user’s query. The main goal of information
7. Moh’dMhashi M, Alwakeel M. New Enhanced Exact
retrieval System is to discover the significant information
String, Searching Algorithm. IJCSNS International Journal
that satisfies user information needs. Desktop search is of Computer Science and Network Security. 2010; 10(1):1–
where the information sources are the files stored on a 10.
personal computer, including email and web pages based 8. Boyer RS, Moore JS. A fast string searching algorithm.
on content analysis. To analyze the content the various Communication of the ACM. 1977; 20(10):762–72.
pattern matching algorithms are used and it is used to 9. Pandiselvam P, Marimuthu T, Lawrance R. A Comparative
find all the existences of a limited set of patterns inside an Study on String Matching Algorithms of Biological
input text or input document. String matching algorithm Sequences, Springer Berlin Heidelberg. 2009; 510–17.
is used to matches the pattern exactly or approximately 10. Charras C, Lecroq T, Daniel J. A Very fast string search-
within the input document. String matching algorithms ing algorithm for small alphabets and long patterns,
plays vital role in the field of information retrieval. To Combinational Pattern Matching, 9th Annual Symposium,
CPM 98 Piscataway, New Jersey, USA. 2005; 1448:54–8.
retrieve the information from desktop these algorithms
11. Robert S, Boyer B, Moore JS. A fast string Searching
are used widely. It is used to find one or all occurrences of
Algorithm. Communication of the ACM. 1997; 20(10):762–
a pattern in large document collection. 72.
This research work analyzes the performance measures 12. Hossein G, Shokoufeh S, Abozar S. A Survey of Pattern
of existing and enhanced string matching algorithms. The Matching Algorithm in Intrusion Detection System Tehran,
performance factors are time, number of iteration and its Iran. Indian Journal of Science and Technology. 2016 Jun;
accuracy for single line, multiple lines and a file. From 9(21):1–7.
the analysis, in existing the KMP algorithm gives the bet- 13. Rahul M, Diwate B, Satish J, Alaspurkar A. Study
ter accuracy for all the inputs. In enhanced algorithms, of Different Algorithms for Pattern Matching.
the enhanced KMP algorithm gives the better accuracy. International Journal of Advanced Research in
Form the existing and enhanced KMP algorithms; the Computer Science and Software Engineering. 2013;
enhanced KMP algorithm gives the better accuracy. 3(3):1–8.
14. Bhandari J, Kumar A. String Matching Rules Used By
Variants of Boyer-Moore Algorithm. Journal of Global
7. References Research in Computer Science. 2014; 5(1).
15. Shivaji SK, Prabhudeva S. Plagiarism Detection by using
1. Verma A, Kaur I, Singh I. Comparative analysis of data min- Karp-Rabin and String Matching Algorithm Together.
ing tools and techniques for information retrieval. Indian International Journal of Computer Applications. 2015;
Journal of Science and Technology. 2016 Mar; 9(11):1–16. 116(23):1–5.
10 Vol 9 (43) | November 2016 | www.indjst.org Indian Journal of Science and Technology
R. Janani and S. Vijayarani
16. Wahlstrom S. Evaluation of String Searching Algorithms, Science and Information Technology (IJACSIT). 2014;
Italy. 2004; 1–22 3(4):344–53.
17. Gope AP, Behera RN. A Novel Pattern Matching Algorithm 19. Harini R, Chandrasekar C. Efficient Sequential Pattern
in Genome Sequence Analysis. (IJCSIT) International Matching Algorithm for Classified Brain Image. Indian
Journal of Computer Science and Information Technologies. Journal of Science and Technology. 2015 Jul; 8(14):1–10.
2014; 5(4):5450–57.
18. Jony AI. Analysis of Multiple String Pattern Matching
Algorithms. International Journal of Advanced Computer
Vol 9 (43) | November 2016 | www.indjst.org Indian Journal of Science and Technology 11