Java Web Crawler

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Web-based Crawler Utilizing Multiple Searching and

String Matching Algorithms


Johnathan Wiltberger
Johns Hopkins University
Whiting School of Engineering
Engineering for Professionals
Email: jwiltbe1@johnshopkins.edu
AbstractIn the field of computer science, one will invariably
stumble upon the Internet and the vast amount of information
that is held therein. In order to better utilize information
stored within, one must be able to search for and find relevant
information within web domains to help further either their
knowledge or their objective. This document outlines a tool
for such a use; a web crawler that utilizes multiple different
searching algorithms, as well as several string matching algorithms. Included in the references for this document are multiple
journal entries and source web sites that helped to contribute to
the complimation of the crawler. There is a comparison drawn
between both Googlebots (web crawler used by Google, Inc.) and
the proposed crawler. With the choice of multiple searching and
string matching algorithms, one will have a more dynamic and
versatile way of optomizing their searches to gain more relevant
and profitable results.

I. I NTRODUCTION
The World Wide Web currently has, at least, almost 2
billion indexed web pages currently [1]. This does not count
un-indexed pages, whose number is much larger. The main
method of indexing and searching all of these sites currently
is through web crawlers. A web crawler is an application that
systematically browses the web, going from link to link, and
indexing the sites it comes upon. These sites then can be stored
in memory (for smaller, single-use web crawlers) or stored in
databases for traversal later. One well known web crawler is
GoogleBot, which is used by Google to crawl and index sites
that are used later in their search engine [2]. There are multiple
theories related to best performance crawls, however, it seems
that the situation determines what will be the most appropriate
and efficient crawling method.
With many web crawlers, the process is to visit a top-level
domain, search for links within the page, and follow those
links. Meanwhile, the crawler will cache these links within
a database of some sort and pass the control to the database
to some searching application, potentially a search engine, to
use for searching. This approach is extremely efficient if the
goal of a process is to build a database to use multiple times
over, and the storage space is available. However, this may be
too much for a simple query over a specific domain during
one-time searches
The crawler discussed in this paper systematically crawls
and matches query strings based on a users input on a caseby-case basis. The user will input there targeted domain, the
query they are looking for, and their choices for searching

algorithms and string matching algorithms. These inputs are


then used to develop the initial crawling strategy employed by
the crawler.
Using these methods, the crawler has the ability to quickly
query through a domain for a search string of interest without
the backend processing of building a database. This allows a
normal user the ability to customize how they craw a domain
without needing to obtain enough equipment for database
storage, as well as spending the preliminary time and effort
building a database of links to search. Although it may not be
as thorough, this solution is easily deployable in many small
to medium domains.
The rest of the paper is structured as follows. In Section
II, there is a review of related work on the subject. Section
III will present methodologies that were used in the crawler.
For Section IV, discussions will be focused on some of the
findings that were collected within use of the crawler, as well
as analysis of performance. Finally, future work and conclusion
will be discussed in Section V.
II. R ELATED W ORK
III. M ETHODOLOGY
IV. F INDINGS AND A NALYSIS
V. C ONCLUSION
The conclusion goes here.
ACKNOWLEDGMENT
The authors would like to thank...
R EFERENCES
[1] https://www.worldwidewebsize.com; Accessed 03/20/2014 0848
[2] http://en.wikipedia.org/wiki/Web crawler; Accessed 03/20/2014 0854

You might also like