Professional Documents
Culture Documents
A Web Crawler Detection Algorithm Based On Web Page Member List
A Web Crawler Detection Algorithm Based On Web Page Member List
A Web Crawler Detection Algorithm Based On Web Page Member List
Abstract—Following the widely use of search engines, the the SRE can be found. However, the standard is voluntary,
impact Web crawlers have on the Web sites should not be and many crawlers do not obey it.
ignored. After analyzing the navigational patterns of Web (2) Most crawlers do not assign any value to the
crawlers from Web logs, a new algorithm based on Web page “referrer” field in their HTTP re-quest messages. And, in the
member list is proposed. The algorithm constructs one Web log, the “referrer” field is empty (=“-”). So, if the user
member list for every Web page and one show table for every sessions have large number of requests with empty referrer
visitor. The experiment shows that the new algorithm can fields, the visitor is a “suspicious” crawler. But, as Web
detect the unknown crawlers and unfriendly crawlers who do browsers can sometimes generate HTTP messages with
not obey the Standard for Robot Exclusion.
empty referrer values, this method is also not reliable.
Keywords- Search engine; Web crawler detection; Web page
(3) When checking the validity of the hyperlink structure,
member list; Web log most crawler use HEAD request method to reduce the
burden on Web servers. Therefore, one can examine user
sessions with lots of HEAD requests to discover potential
I. INTRODUCTION crawlers. Similarly, as Web browsers can sometimes
A Web crawler(also called Web robot or Web spider) is a generate HEAD request type, this method is also not reliable.
program that automatically traverses the Web's hypertext To solve the problem, Pang-Ning Tan and Vipin Kumar
structure by retrieving a document, and recursively retrieving [1] adopted C4.5 decision tree algorithm to classify the
all documents that are referenced. Web crawlers are often crawlers visit and human visit based on the characteristics of
used as resource discovery and retrieval tools for Web search the crawlers’ access pattern. Their method can effectively
engines such as Google, Baidu, etc. detect the unknown crawlers. But it is a bit complicated. In
But the crawler's automatic visits toward the Web sites this paper, after analyzing the access patterns of Web
also cause many problems. First, considering the business crawlers, we propose a new simple but effective algorithm to
secret, many E-commerce Web sites do not hope the detect Web crawlers based on Web page member list.
unauthorized crawlers retrieve information from their Web
site. Second, many E-commerce Web sites need to analyze II. THE DIFFERENCES BETWEEN CRAWLERS
the visitors’ browsing behavior, but such analysis can be VISIT AND HUMAN VISIT
severely distorted by the presence of Web crawlers [1,2]. There are great differences between crawlers visit and
Third, many government Web sites also do not hope their human visit.
information collected and indexed by the crawlers for some (1) when a person inputs a URL address in the browser
reason. Fourth, poorly-designed crawlers often consume lots (eg. Microsoft Internet Explorer), the browser will send out
of network and server resources, affecting the visit of normal the HTTP request to the target server. According to the
customers. So, it is necessary for the Web site managers to HTTP protocol [5], the server will check whether it has the
detect Web crawlers from all the visitors, and take proper document specified by the URL after it has received the
measures to redirect the Web crawlers or stop responding to request. If it does have, it sends out that document.
HTTP requests coming from the unauthorized crawlers. Otherwise, it gives out an error message. And, the browser
The commonly used Web detection method is to set up a will parse the document after receiving the document sent by
database of known crawlers [3], and compare the IP address the server. If it is a single document, such as a picture etc.,
and User Agent fields of the HTTP request message against the browser shows it directly. If it is an HTML document,
the known crawlers. But this method can detect the well- then the browser will analyze the embedded and linked
known crawlers only. There are three simple techniques to objects in this document (such as image files, animation
detect the unknown crawlers from Web logs: files, scripts files, cascading style sheet files and frames,
(1) According to the SRE (Standard for Robot Exclusion) etc.), and then continuously and automatically send out the
[4], whenever a crawler visits a Web site, it should first HTTP requests to the server until all the embedded objects
request a file called robots.txt. So, by examining the user have been requested. On the server side, the server sends out
sessions generated from Web logs, new crawlers that follow all the requested documents in order after receiving the
client’s requests. When the browser has received all the
190
if the crawler wants to request the embedded objects, the TABLE III. THE RESULT CRAWLERS DETECTED FROM THE LOGS
time interval is usually large than 30 seconds.
IP address Agent Visiting robots.txt
Step 3: To judge whether one uid is a crawler, the simple or not
method is to check the ShowNumber field of its ShowTable.
If all the corresponding ShowNumber fields of its visited 64.68.82.135 Googlebot yes
URLs equal 0, we could think the uid is a crawler. 202.108.249.130 BaiduSpider yes
191
algorithm is that if the Web pages of the Web site only [2] Omer Duskin Dror G. Feitelson. “Distinguishing humans from robots
contain plain text and no images, sounds, the algorithm in Web search logs: Preliminary Results Using Query Rates and
Intervals,” Workshop on Web Search Click Data ’09 Barcelona,
maybe regards a human visitor as a crawler. And if a person Spain,2009.
uses very simple browser or set his browser not displaying [3] The Web robots database. http://www.robotstxt.org/wc/active.html
images and not playing sounds, the algorithm may mistake a [4] Robots exclusion. http://www.robotstxt.org/wc/exclusion.html
human visit for a crawler visit. For future work, we would
[5] Hypertext transfer protocol- HTTP/1.1. http://www.w3.org
like to take into account objects and hyperlinks of the Web
[6] Web characterization terminology & Definitions Sheet.
page to detect the search engine crawlers more effectively. http://www.w3.org /1999/05/WCA-terms/01/
REFERENCES
[1] Pang-Ning Tan, Vipin Kumar. “Discovery of Web robot sessions
based on their navigational patterns, ” Data Mining and Knowledge
Discovery, Vol. 6, pp.9-35, January 2002.
192