WebTracker Paper - SUST Journal

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

WEBTracker: A Web Crawler for Maximizing

Bandwidth Utilization
Md. Ruhul Amin, Assistant Professor, Dept. of CSE, Shahjalal University of Science and Technology
Mohiul Alam Prince, Software Engineer, Structured Data Systems Ltd, Dhaka
Md. Akter Hussain, Assistant Professor, Dept. of CSE, Shahjalal University of Science and Technology

ABSTRACT

The most challenging part of a web crawler is to download contents at the fastest rate to utilize bandwidth and processing

of the downloaded data so that it will never starve the downloader. Our implemented scalable web crawling system,

named as WEBTracker has been designed to meet this challenge. It can be used very efficiently in the distributed

environment to maximize downloading. WEBTracker has a Central Crawler Server and it administers all the crawler

nodes. At each crawler node, Crawler Manager runs the downloader and manages the downloaded contents. Central

Crawler Server and its Crawler Managers are members of the Distributed File System which ensures synchronized

distributed operations of the system. In this paper, we have only concentrated on the architecture of a web crawling node

which is owned by the Crawler Manager. We have shown that our implemented crawler architecture makes efficient use

of allocated bandwidth, keeps the processor less busy for the processing of downloaded contents and makes efficient use of

the run time memory.

Keywords: WEBTracker, Web Crawler, Information Retrieval, World Wide Web.

1. INTRODUCTION

A web crawler collects web content from the World Wide Web automatically and stores this content into the storage [1-4].

Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will

index the downloaded pages to provide faster search services [5-8]. Crawlers can also be used for automating maintenance

tasks on a website, checking links or validating HTML code, harvesting e-mail addresses (usually for spam) and collecting

information on a particular topic. The main steps [1-4, 9-11] in web crawling are:

1. A URL Queue that initially contains some seed URLs.

2. Remove a URL from the queue and download that page.

3. Extract the links from that page and store the pages in secondary storage.

4. Store the links in the queue that have to be fetched in future.


5. Recursively perform steps 2, 3, and 4 until the queue is empty.

To develop a web crawler that downloads few pages per second for a short period of time is an easy task but building a

high-performance system that can download hundreds of millions of pages over several weeks presents a number of

challenges. It requires an extensive experimental work on the system design to ensure the I/O and network efficiency, and

manageability of the downloaded contents. Hence to develop a high performance web crawler we need to solve the

following challenges [1-4]:

1. An efficient multi threaded web content downloader.

2. Maintaining politeness according to the robots exclusion protocol.

3. Efficient URL extraction, normalization and duplicate URL detection.

4. Detecting duplicate web contents during web crawling.

5. Web content management for fastest storage and retrieval.

6. Bandwidth management and well-organized distribution Policy.

To achieve these goals we have developed a scalable distributed crawling system WEBTracker in which the Central

Crawler Server provides the domain links to the Crawler Manager running at each crawling node. The Crawler Manager

runs the downloader that is never interrupted unless its URL queue is empty. The downloaded web contents are processed

in batch mode for URL extraction, URL seen and to store the contents. All of our algorithms require less run time memory

and are optimized for file based operations. The given implementation is now under experimental observations and the

performance measures of a crawling node are shown in this paper. In a nutshell the implemented web crawler shows that

a high performance web crawler should have two properties:

1. A downloader that possibly never be paused.

2. All other work of the crawler has to be batch processed.

2. ARCHITECTURE OF A WEB CRAWLING NODE

Crawler Manager owns the web crawling node and it controls the following four major modules:

1. Downloader

2. LinkExtractorManager

3. URLSeen

4. HostHandler

The basic architecture of WEBTracker crawler node is shown in Figure 1.


Figure 1. Basic Architecture of a Web Crawling Node.

The Table 1 below provides couple of very important terms that are required to understand the way our implemented web

crawler works.

Table 1. Term that use by our Crawler.

Term Function Producer Module Consumer Elements/ Block


Module
Host Queue Maintains a queue of the Central Crawler Crawler Manager Maximum 100
downloading domains Server or User
RepositoryPath Save domain contents Crawler Manager Downloader
WebPagePathBlockRepo Save physical storage path Crawler Manager Link Extractor 100
of web page Manager
ExtractedURLBlockRepo Save extracted URLs Link Extractor URL Seen 5000
(internal) from the Manager
downloaded contents
ExternalURLBlockRepo Save extracted URLs Link Extractor Host Handler 1000
(external) from the Manager
downloaded contents
UnseenURLBlockRepo Save unseen URLs from URL Seen Crawler Manager
unchecked extracted URLs

The Producer Modules maintains their own configuration files which keeps track about how many blocks have been created

in each repository and Consumer Modules maintains their own configuration files to keep track of how many blocks are

processed from the repositories. The file structure of WEBTracker repository is shown in Figure 2.
Root

Domain Contents DB

Domain 1 Domain n Domain 1 Domain n


WebPage

WebPage

Extracted URL

Extracted URL
WebPage Path

WebPage Path

External URL
External URL
Unseen URL

Unseen URL
URL

URL

Figure 2. File Structure used by our Crawler.

2.1 Crawler Manager

This module is responsible to maintain the communication with the Central Crawler Server as well as control all the four

modules: Downloader, LinkExtractorManager, URLSeen, HostHandler. The task of Crawler Manager is initiated by the

command of the Central Crawler Server in a distributed crawling system; otherwise it can be initiated directly in single

node web crawling system. Then in the crawling node, Crawler Manager searches for the information about previous

crawling. If the information is found then first of all it adjusts itself with the previous configuration and starts the crawling

exactly from the point where it ended in the last crawl. Otherwise it only reads the list of domains that are required to be

downloaded and start crawling.

The important functions of crawler manager are given below:

2.1.1 Reading Robots.txt file

For each domain, crawler manager reads the robots.txt file from the domain URL [1 – 4, 12, 13] and create an object

regarding the configurations written in the file. For each domain the crawler manager uses individual thread. This

configuration is maintained every time for crawling a URL from the domain.

2.1.2 Managing Download

The list of domains is inserted into HostQueue for crawling. For a brand new domain, an unseen (means which are not

used) URL repository file named as UnseenUrlBlockRepo is created for the domain and the domain name is used as the

starting URL. Crawler manager pops up a domain name from the host queue and then read a URL of this domain from the
corresponding unseen URL block repository. Then the crawler manager assigns this URL to a downloader in a different

thread. If there is no available URL then crawler manager calls up the link extractor manager for providing unseen links of

that domain which is also done in a different thread. At the same time, the crawler manager continues its work in its

independent thread and is never interrupted by the other management work. The maximum number of downloading thread

is maintained according to the configuration file of the crawler manager.

2.1.3 Post Download Processing

When the downloader finishes downloading a URL, it sends an acknowledgement to the crawler manager. Then the crawler

manager adds this domain to the HostQueue again, writes the path of the downloaded page physically to the repository

named as WebPagePathBlockRepo and updates the corresponding configuration files as required.

2.2 Downloader

Downloader module is the simplest module of this project. Crawler manager calls this module with a URL and repository

name (RepositoryPath) for saving the downloaded pages. For a particular domain, only one URL or page is downloaded at a

time in an individual thread. After the completion of the download, it makes an acknowledgement to the Crawler manager

for updating the management files and starts another new download from that domain.

Table 2. Pseudo code of Downloader Module.

Downloader(url)

1. request for url

2. save page

3. Push this domain to CrawlerManager Queue.


Figure 3. Downloader Architecture

2.3 Link Extractor Manager

Link extractor manager is an essential module for this web crawler. Its main task is to extract links from the downloaded

web pages of a particular domain. It mainly creates two lists of unseen URLs. For a particular domain, one list is the

internal URLs (within the same domain) saved in a repository named as ExtractedUrlBlockRepo, and another is the list of

external URLs (to other domains) saved in a repository named as ExternalUrlBlockRepo. For the link extraction of each

domain, link extractor manager creates individual thread to serve the unseen URLs. When crawler manager needs URLs for
a particular domain, it calls the link extractor manager that at first try to serve the requests immediately by providing

unseen URLs (already extracted from the downloaded pages of the corresponding domain). Otherwise the link extractor

manager assigns a thread to extract links from the unprocessed downloaded web pages of the corresponding domain whose

path location is found in the repository named as WebPagePathBlockRepo. The extracted URLs from the web pages are

written in the repository named as ExtractedUrlBlockRepo. As soon as the link extraction for a web page is completed, the

link extractor manager calls up the URL seen module to get unseen URLs from ExtractedUrlBlockRepo and send an

acknowledgement to the crawler manager. Generally link extractor manager parses at most 5000 URLs from the

unprocessed web pages and store them as a block in ExtractedUrlBlockRepo. URL seen module then processes this block in

batch mode for getting unseen URLs.

Table 3: Pseudo code of LinkExtractorManager and Link Extract Module.

LinkExtractorManager() LinkExtract(page)

1. For Each element in Queue 1. LinkExtract of that page

2. if requestFromCrawlerManager 2. Filter Link

3. if this host has extracted link 3. Save in a block of own link.

4. send them to UrlSeen 4. Save other link in different block.

5. open new block to store. 5. if block has more than 5000 link

6. else 6. send it to UrlSeen

7. host = requestedHost 7. open new block

8. else 8. push this host to LinkExtractor Queue

9. host = pop(Queue)

10. if availPage(host)

11. LinkExtract(hostPage)

12. wait

13. if any push in Queue

14. Run from 1


S ta r t

C ra w le r M a n a g e r R eq u est fo r U R L

Y es I s n o tifie d ?

R e q u e s t H o s t L is t
No

I s th e re a n y I s th e r e a n y h o s t w h o s e
S en d S een U rl r e q u e s t fo r lin k ? No
th r e a d is n o t r u n n in g ? No W a it

Y es

I s th e r e a n y p a th o f
th a t h o s t ? No Ig n o re
S e n d th a t lin k Y e s /N o I s th e re a n y lin k fo r
Y es
b lo c k to U rlS e e n re q u e s te d h o s t ?
No

Y es
No

G e t th a t p a th a n d
s e t th is h o s t a s a
I s th e re a n y th r e a d r u n n in g .
r u n n in g f o r th is h o s t ?
N o tif y

L in k e x tr a c t
No th r e a d r u n
Y es

Is th is b lo c k h a s W r ite lin k to a
m a x im u m n u m b e r o f b lo c k a n d s e t th is
URL? h o s t a s a f in is h

Figure 4: Architectural Diagram of Link Extractor Manager

2.4 URL Seen

This module is used to identify the unseen URLs of a particular domain and filter out the URLs that has been already in the

queue or downloaded earlier. For a request from the link extractor manager, this module assigns a thread for each block of

unchecked URLs of the repository ExtractedUrlBlockRepo. This module creates a red-black tree using the unique URLs

retrieving from UnseenUrlBlockRepo which contains all the unique URLs of that domain from the starting point. For a

block of unchecked extracted URLs, this module searches each URL in the RB tree to ensure its uniqueness. If the link is

found in the tree then this link is filtered out. After the end of this process, this module saves the unseen URLs to the

repository UnseenUrlBlockRepo and updates the total number of already downloaded links and total unique links in the

configuration files for the corresponding domain and sends an acknowledgement to the link extractor manager.
Table 4: Pseudo code of URL Seen Module.

UrlSeen(Domain)

1. read block of Unique url block of Domain

2. make RB-tree with Unique url block


Storage Device

3. read block of Uncheck url

4. for each url

5. if Rb-tree.check(url) is not found


Figure 5: URL Seen Architecture
6. insert in RB-tree

7. Save in check block url.

8. Push this host to CrawlerManager Queue.

2.5 Host Handler

This module is important part of our system. For a single node web crawler, this module provides the unseen external

domain URLs to the Crawler Manager. In a distributed crawling system this module sends those URLs to the Central

Crawler Server rather than Crawler Manager in a single node system. Host handler collects the unseen host URLs from the

file named ExternalUrlBlockRepo.

Table 5: Psuedo code of HostHandler Module. Request From


Crawler Manager

HostHandler
HostHandler()

1. Read block for download


Is there any host in
Yes, then send HostList list which is not
2. if sufficient for requestedHost downloaded?

3. send them
No
4. else
Read different host list
5. Read all different host block which are generated by
CrawlerManager LinkExtractor.
6. find them which are not downloaded

7. merge all host fro download Send Host List.


Check Read file and make a
not downloading list.
8. send them
Figure 6: Host Handler Architecture
9. store rest of host for download
3. PERFORMANCE ANALYSIS

We have used a single node web crawler to measure the performance of our implemented system. After couple of test runs

and bug fixing, we have taken the data of the latest experiment. Our crawler machine has 2GB of RAM, Core2Duo 2.8 GHz

Processor and 2Mbps (15MB per minute) Bandwidth for crawling the data. The experiment has been started using 40 seed

URLs. We have recorded the measurement of bandwidth utilization for 1000 minutes. The Figure 7 to Figure 10 are the

outcomes of this experiment.

In the Figure 7, we show the number of downloaded pages in every minute. The average number of pages downloaded per

minute by our crawler is 220.294. In the Figure 8 we show the download size in KB per minute. The average download size

of the web crawler is 10.493MB per minute. Whereas the maximum bandwidth provided to our system is 15MB per minute.

Now from the Figure 7 we see that, from 551 to 951 minutes, the total number of pages downloaded per minute is

decreasing. On the other hand from Figure 8 we can see that the downloaded data size is larger for that time span. That

means the product of data size and number of page is approximately similar. Hence we can conclude that the crawler

performance is stable.

The Figure 9 shows the number of HTTP Requests sent in every minute by the implemented crawler. The average HTTP

request sent per minute is 244.638. While the Figure 10 show the number of HTTP errors of the crawler received in every

minute. Hence the average HTTP error is 58.222 per minute. The experimental results have no unusual spike which proves

that the implemented crawler has no glitch. So as we will deploy more crawlers the total amount of bandwidth consumption

will surely scale up.

600 45000
40000
500
35000

400 30000
25000
300
20000
200 15000
10000
100
5000
0 0
1
51
101
151
201
251
301
351
401
451
501
551
601
651
701
751
801
851
901
951

1
51
10 1
15 1
20 1
25 1
30 1
35 1
40 1
45 1
50 1
55 1
60 1
65 1
70 1
75 1
80 1
85 1
90 1
95 1

Figure 7: Crawler Statistics for Total Download Figure 8: Crawler Statistics for Total Download Size
Pages per Minutes (X-axis: Time in Min, Y-axis: in KB per Minutes (X-axis: Time in Min, Y-axis:
Downloaded Page Number) Downloaded Page Size)
600 2000
1800
500
1600

400
1400
1200
300 1000
800
200
600

100 400
200
0 0
1
51
1 01
1 51
2 01
2 51
3 01
3 51
4 01
4 51
5 01
5 51
6 01
6 51
7 01
7 51
8 01
8 51
9 01
9 51

1
51
101
151
201
251
301
351
401
451
501
551
601
651
701
751
801
851
901
951
Figure 9: Crawler Statistics For Total Number of Figure 10: Crawler Statistics For Total Number of
Http Request (X-axis: Time in Min, Y-axis: HTTP Error per Minute (X-axis: Time in Min, Y-axis:
Request Number) Number of Error)

4. CONCLUSION

In this paper we have discussed how bandwidth utilization can be maximized for a web crawler. The most important

technique we have used here is that we never interrupt the downloading threads. Every other post downloading

management works are also done in the independent threads. Moreover, all the tasks of link extraction, URL seen and host

handling are completed in batch processing mode. Hence most of the times the processor of our system remains less busy.

Since most of our algorithms are file based then the RAM utilization is also low. All these considerations ensures that the

downloader can use the system at full length most of its time. That is why we could achieve 10MB per minute web page

downloading over 15MB per minute bandwidth allocation. In our implementation we have maintained politeness explicitly

(only one URL of a domain can be downloaded at a time). Here we did not show any module regarding the content seen

since its scope is out of this paper. Our future work is to deploy a distributed web crawling system with 20 web crawling

nodes and at least 1Gbps bandwidth.

5. ACKNOWLEDGMENT

We are grateful to the Dept. of CSE, Shahjalal University of Science and Technology to provide us technological support for conducting

the research work in the IR Lab.

6. REFERENCES

[1] Cho, J. and Garcia-Molina, H. and Page, L. 1998. Efficient Crawling Through URL Ordering. Seventh International
World-Wide Web Conference (WWW 1998), April 14-18, 1998, Brisbane, Australia.
[2] Boldi P, Codenotti B, Santini M, Vigna S. Ubicrawler: Scalability and fault-tolerance issues. Poster Proceedings of the
11th International World Wide Web Conference, Honolulu, HI, 2002. ACM Press: New York, 2002.
[3] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. UbiCrawler: A scalable fully distributed web
crawler. Software: Practice & Experience, 34(8):711-726, 2004.
[4] Vladislav Shkapenyuk and Torsten Suel. Design and Implementation of a High-Performance Distributed Web Crawler,
Proceedings of the 18th International Conference on Data Engineering, p.357, February 26-March 01, 2002 .
[5] Sergey Brin and Lawrence Page. “The anatomy of a large-scale hypertextual web search engine”. In Proceedings of the
Seventh International World-Wide Web Conference, Bris-bane, Australia, April 1998.
[6] Google, http://www.google.com, Last Visited: 2011-11-07
[7] Paolo Boldi , Massimo Santini , Sebastiano Vigna, PageRank: Functional dependencies, ACM Transactions on

Information Systems (TOIS), v.27 n.4, p.1-23, November 2009, DOI = 10.1145/1629096.1629097.
[8] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. “Searching the web”.
ACM Transactions on Internet Technology, 1(1):2–43, 2001
[9] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Trovatore: “Towards a highly scalable
distributed web crawler”. In Poster Proc. of Tenth International World Wide Web Conference, pages 140–141, Hong
Kong, China, 2001.
[10] Budi Yuwono, Savio L. Lam, Jerry H. Ying, and Dik L. Lee. “A world wide web resource discovery system”. In
Proceedings of the Fourth International World-Wide Web Conference, Darmstadt, Germany, April 1995, DOI =
10.1.1.51.4920.
[11] Ziv Bar-Yossef , Alexander Berg , Steve Chien , Jittat Fakcharoenphol , Dror Weitz, Approximating Aggregate Queries
about Web Pages via Random Walks, Proceedings of the 26th International Conference on Very Large Data Bases,
p.535-544, September 10-14, 2000
[12] The Robots Exclusion Standard, http://www.robotstxt.org/wc/exclusion.html, Last Visited: 2011-08-23

[13] Robots exclusion protocol. http://info.webcrawler.com/mak/projects/robots/exclusion.html. , Last


Visited: 2011-08-23
[14] J. Talim , Z. Liu , Ph. Nain , E. G. Coffman, Jr., Controlling the robots of Web search engines, ACM SIGMETRICS
Performance Evaluation Review, v.29 n.1, p.236-244, June 2001 DOI=10.1145/384268.378788.
[15] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: “Compressing and Indexing Documents and Images”.
Morgan Kaufmann, second ed ition, 1999, ISBN: 1558605703.

You might also like