A Web Crawler Detection Algorithm Based On Web Page Member List

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

2012 4th International Conference on Intelligent Human-Machine Systems and Cybernetics

A Web Crawler Detection Algorithm Based on Web Page Member List

Weigang Guo, Yong Zhong, Jianqin Xie


School of Electronic and Information Engineering
Foshan University
Foshan, Guangdong Province, China
1049621@qq.com

Abstract—Following the widely use of search engines, the the SRE can be found. However, the standard is voluntary,
impact Web crawlers have on the Web sites should not be and many crawlers do not obey it.
ignored. After analyzing the navigational patterns of Web (2) Most crawlers do not assign any value to the
crawlers from Web logs, a new algorithm based on Web page “referrer” field in their HTTP re-quest messages. And, in the
member list is proposed. The algorithm constructs one Web log, the “referrer” field is empty (=“-”). So, if the user
member list for every Web page and one show table for every sessions have large number of requests with empty referrer
visitor. The experiment shows that the new algorithm can fields, the visitor is a “suspicious” crawler. But, as Web
detect the unknown crawlers and unfriendly crawlers who do browsers can sometimes generate HTTP messages with
not obey the Standard for Robot Exclusion.
empty referrer values, this method is also not reliable.
Keywords- Search engine; Web crawler detection; Web page
(3) When checking the validity of the hyperlink structure,
member list; Web log most crawler use HEAD request method to reduce the
burden on Web servers. Therefore, one can examine user
sessions with lots of HEAD requests to discover potential
I. INTRODUCTION crawlers. Similarly, as Web browsers can sometimes
A Web crawler(also called Web robot or Web spider) is a generate HEAD request type, this method is also not reliable.
program that automatically traverses the Web's hypertext To solve the problem, Pang-Ning Tan and Vipin Kumar
structure by retrieving a document, and recursively retrieving [1] adopted C4.5 decision tree algorithm to classify the
all documents that are referenced. Web crawlers are often crawlers visit and human visit based on the characteristics of
used as resource discovery and retrieval tools for Web search the crawlers’ access pattern. Their method can effectively
engines such as Google, Baidu, etc. detect the unknown crawlers. But it is a bit complicated. In
But the crawler's automatic visits toward the Web sites this paper, after analyzing the access patterns of Web
also cause many problems. First, considering the business crawlers, we propose a new simple but effective algorithm to
secret, many E-commerce Web sites do not hope the detect Web crawlers based on Web page member list.
unauthorized crawlers retrieve information from their Web
site. Second, many E-commerce Web sites need to analyze II. THE DIFFERENCES BETWEEN CRAWLERS
the visitors’ browsing behavior, but such analysis can be VISIT AND HUMAN VISIT
severely distorted by the presence of Web crawlers [1,2]. There are great differences between crawlers visit and
Third, many government Web sites also do not hope their human visit.
information collected and indexed by the crawlers for some (1) when a person inputs a URL address in the browser
reason. Fourth, poorly-designed crawlers often consume lots (eg. Microsoft Internet Explorer), the browser will send out
of network and server resources, affecting the visit of normal the HTTP request to the target server. According to the
customers. So, it is necessary for the Web site managers to HTTP protocol [5], the server will check whether it has the
detect Web crawlers from all the visitors, and take proper document specified by the URL after it has received the
measures to redirect the Web crawlers or stop responding to request. If it does have, it sends out that document.
HTTP requests coming from the unauthorized crawlers. Otherwise, it gives out an error message. And, the browser
The commonly used Web detection method is to set up a will parse the document after receiving the document sent by
database of known crawlers [3], and compare the IP address the server. If it is a single document, such as a picture etc.,
and User Agent fields of the HTTP request message against the browser shows it directly. If it is an HTML document,
the known crawlers. But this method can detect the well- then the browser will analyze the embedded and linked
known crawlers only. There are three simple techniques to objects in this document (such as image files, animation
detect the unknown crawlers from Web logs: files, scripts files, cascading style sheet files and frames,
(1) According to the SRE (Standard for Robot Exclusion) etc.), and then continuously and automatically send out the
[4], whenever a crawler visits a Web site, it should first HTTP requests to the server until all the embedded objects
request a file called robots.txt. So, by examining the user have been requested. On the server side, the server sends out
sessions generated from Web logs, new crawlers that follow all the requested documents in order after receiving the
client’s requests. When the browser has received all the

978-0-7695-4721-3/12 $26.00 © 2012 IEEE 189


DOI 10.1109/IHMSC.2012.54
embedded objects, it “assembles” them and then generates a We developed an HTML analyzer to analysis the HTML
complete Web page from the human point of view. So, one tags and its attributes and generate the Web page member list
request of a person may generate several records in Web for every file in the Web site. All the member lists consists
server logs, and all the embedded objects “show” in the log. the member list set of the Web site.
(2) The crawler is different. Usually, after getting a URL
(assuming it is an html document) from the URL list which IV. THE DETECTION ALGORITHM
is waiting for being visited, the crawler sends HTTP request Step1: data preprocessing. Sort the Web logs by IP field,
to the target server. The crawler also analyzes the embedded agent field and time field as the first, second and third
objects and hyperlinks within the received HTML document keywords, and then treat the records with same IP and agent
after receiving the server's response, and then adds the field as one visitor’s visiting records and assign a label uid.
embedded hyperlinks to the waiting list according to the its Each record of the log files can be processed as:
visiting rules. Here, the method of treating embedded objects R=<uid, url, time>
(such as image files, animation files, scripts files, cascading Where uid is the label of every different visitor, url is the
style sheet files and frames, etc.) may be different requested URL resources, and time is the request time. So,
respectively. Some search engine crawlers add the URLs of the visiting records set of the user is represented as:
these objects to the waiting list as well, while some give S = <uid, {(url1,time1),…,(urlk,timek)}>
them up directly or modify the links of the objects in the Where, k is the total number of visiting records.
HTML document rather than requested them straightly. But Step2: Constructing a ShowTable(see Table I) which
one thing is same that the crawlers don’t send the requests records actual attendance of the members of the visited Web
for the embedded objects to the server at once. Therefore one pages for every visitor. In the ShowTable, url is URL of the
request of crawler visit only leaves one piece of record in the Web page, NumberOfMember is the number of member of
server logs, which exactly represents the request of the the Web page and is obtained from the Web site’s Web page
crawler. member list, ShowNumber is the total number of members
who appear in his visiting records set.
III. CONSTRUCTION OF WEB PAGE MEMBER LIST The algorithm of computing ShowNumber is as follows:
Definition 1: A Web page is a collection of information, for each ręS do
consisting of one or more Web resources, intended to be if the URL type of r is multimedia files(images, sounds)
rendered simultaneously, and identified by a single URL. then ShowNumber := 0 ;
More specifically, a Web page consists of a Web resource else
with zero, one, or more embedded Web resources intended to for each member of r do
be rendered as a single unit, and referred to by the URL of { judge whether this member appear in its close
the one Web resource which is not embedded [6]. succeeding sequence;
Definition 2: The member list of a Web page is defined if it does appear
by a three-tuple: then { ShowNumber := ShowNumber + 1;
t=<webpage, memeberset, n> delete this member’s record}
where, webpage is the URL of the Web page, memberset next member;
is the aggregate of all the URLs of the embedded objects and end for
is represented as {m1,m2,Ă,mn}ˈmi˄i=1,Ă,n˅is the next r;
URL of the embedded objects, n is the number of the end for
members.
The embedded objects mi include: 1) Multimedia files TABLE I. SHOWTABLE: RECORDS ACTUAL ATTENDANCE OF THE
MEMBERS OF THE VISITED WEB PAGES FOR EVERY VISITOR.
(images, sounds, animations, etc) defined by the SRC
attribute of IMGǃBGSOUNDǃEMBEDǃOBJECT tags url NumberOfMember ShowNumber
of HTML. 2) Frames defined by the SRC attribute of
index.htm 6 0
FRAME ੠ IFRAME tags of HTML. 3) Cascading style
sheet files linked by the HREF attribute of LINK tag of introduction.htm 1 0
HTML. 4) Scripts files linked by the SRC attribute of myideas.htm 4 0
SCRIPT tag of HTML. 5) Java applet class files linked by
the CODE attribute of APPLET tag of HTML. bigdog.jpg 0 0
For example, if one Web page sample.htm is consisted of Ă … …
two frames: list.htm, welcome.htm, and every frame has one
Where, the close succeeding sequence of r means all the
picture named littledog.jpg, bigdog.jpg respectively, and
visiting records behind r in a certain time interval in the
there is a cascading style sheet file style.css in welcome.htm,
visiting records set S. The interval can be from 0 to 30
then, supposing all the files are in the root directory of the
seconds. The reasons is that, if the visitor is a person, the
Web site, the member list of sample.htm is:
browser usually requests the embedded objects in 0~5
<sample.htm, {list.htm, welcome.htm, littledog.jpg,
seconds, and the request intervals will not be over 30
bigdog.jpg, style.css}, 5>
seconds, otherwise the visitor will be impatient and give up
The number of multimedia files’ member is 0.
or exit. For crawler visits, according to its retrieving strategy,

190
if the crawler wants to request the embedded objects, the TABLE III. THE RESULT CRAWLERS DETECTED FROM THE LOGS
time interval is usually large than 30 seconds.
IP address Agent Visiting robots.txt
Step 3: To judge whether one uid is a crawler, the simple or not
method is to check the ShowNumber field of its ShowTable.
If all the corresponding ShowNumber fields of its visited 64.68.82.135 Googlebot yes
URLs equal 0, we could think the uid is a crawler. 202.108.249.130 BaiduSpider yes

V. EXPERIMENTS 192.160.51.70 LinkScan/11.0+Unix no

Our Experiments were performed on Foshan University 210.72.21.199 HTML_GET_APP yes


server logs (http://202.192.168.175) collected from July 1st 138.15.164.14 Lachesis no
to July 31st, 2009. The structure of the Web site is simple,
and Web pages of this site contain only text and small 138.15.164.37 - no
amount of images, sounds, frames, Java applets, JavaScript 138.15.164.25 - no
files, cascading style sheet files. The member lists of the
Web pages of top two layers of the Web site are shown in TABLE IV. THE SHOWTABLE OF GOOGLEBOT AND BAIDUSPIDER
Table II.
The Web logs of July 1st to July 31st, 2009 contain a URL NumberOfMember ShowNumber
total of 30328 records. Using the detection algorithm /index.htm 6 0
proposed in this paper, we got five crawlers from the Web
logs, which are shown in Table III. The ShowTables of /media/index.htm 1 0
Googlebot (IP:64.68.82.135) and BaiduSpider /search/index.htm 3 0
(IP:202.108.249.130) of their visited Web pages of top two
/myideas/index.htm 2 0
layers are shown in Table IV. Table V and Table VI show
the ShowTable of anonymous agent(-) with IP address /students/index.htm 2 0
138.15.164.37 and 138.15.164.25. We can see that the /cheng/index.htm 1 0
crawler from 138.15.164.37 only retrieves image files while
the crawler from 138.15.164.25 only retrieves MP3 music /ella/index.htm 2 0
files. And, these two IP address are maybe in a same local /flash/index.htm 3 0
area network with IP 138.15.164.14(its agent is Lachesis).
The reasonable explanation should be that IP 138.15.164.14 TABLE V. THE SHOWTABLE OF IP: 138.15.164.37
only retrieves HTML documents and analyzes image file
sources and music file sources. After having got the URLs, URL NumberOfMember ShowNumber
then IP 138.15.164.37 and 138.15.164.25 crawls image files
/media/images/dian2.jpg 0 0
and music files respectively. These three IP should be
owned by one same organization or company. /media/images/diehe3.jpg 0 0
/media/images/diehe3.jpg 0 0
TABLE II. THE MEMBER LISTS OF THE WEB PAGES OF THE TOP TWO
LAYERS OF THE WEB SITE (HTTP://202.192.168.175) /media/images/diehe3.jpg 0 0

URL Number of Type of members /cheng/041059/03.gif 0 0


members /Ella/Shopping/image001.gif 0 0
/index.htm 6 image files: 3 ……
sound files:1
CSS files: 1
JavaScript files:1 TABLE VI. THE SHOWTABLE OF IP: 138.15.164.25
/media/index.htm 1 image files: 1 URL NumberOfMember ShowNumber
/search/index.htm 3 image files: 1 /mp3/whitetree.mp3 0 0
frame files:2
/mp3/songtoremember.mp3 0 0
/myideas/index.htm 2 image files: 1
CSS files: 1 /mp3/Sunflowers.mp3 0 0
/students/index.htm 2 image files: 2 /mp3/ThoseFlowers.mp3 0 0
/cheng/index.htm 1 image files: 1 ……
/ella/index.htm 2 image files: 1
Java Applet files:1 VI. CONCLUSIONS
/mp3/index.htm 3 image files: 2 Our detection algorithm is simple, but it is effective and
MP3 files:1 has a high accurateness, and need only a few records to
detect whether the visitor is a crawler. The weakness of the

191
algorithm is that if the Web pages of the Web site only [2] Omer Duskin Dror G. Feitelson. “Distinguishing humans from robots
contain plain text and no images, sounds, the algorithm in Web search logs: Preliminary Results Using Query Rates and
Intervals,” Workshop on Web Search Click Data ’09 Barcelona,
maybe regards a human visitor as a crawler. And if a person Spain,2009.
uses very simple browser or set his browser not displaying [3] The Web robots database. http://www.robotstxt.org/wc/active.html
images and not playing sounds, the algorithm may mistake a [4] Robots exclusion. http://www.robotstxt.org/wc/exclusion.html
human visit for a crawler visit. For future work, we would
[5] Hypertext transfer protocol- HTTP/1.1. http://www.w3.org
like to take into account objects and hyperlinks of the Web
[6] Web characterization terminology & Definitions Sheet.
page to detect the search engine crawlers more effectively. http://www.w3.org /1999/05/WCA-terms/01/
REFERENCES
[1] Pang-Ning Tan, Vipin Kumar. “Discovery of Web robot sessions
based on their navigational patterns, ” Data Mining and Knowledge
Discovery, Vol. 6, pp.9-35, January 2002.

192

You might also like