WebTracker Paper - SUST Journal

WEBTracker: A Web Crawler for Maximizing
Bandwidth Utilization
Md. Ruhul Amin, Assistant Professor, Dept. of CSE, Shahjalal University of Science and Technology
Mohiul Alam Prince, Software Engineer, Structured Data Systems Ltd, Dhaka
Md. Akter Hussain, Assistant Professor, Dept. of CSE, Shahjalal University of Science and Technology
ABSTRACT
The most challenging part of a web crawler is to download contents at the fastest rate to utilize bandwidth and processing
of the downloaded data so that it will never starve the downloader. Our implemented scalable web crawling system,
named as WEBTracker has been designed to meet this challenge. It can be used very efficiently in the distributed
environment to maximize downloading. WEBTracker has a Central Crawler Server and it administers all the crawler
nodes. At each crawler node, Crawler Manager runs the downloader and manages the downloaded contents. Central
Crawler Server and its Crawler Managers are members of the Distributed File System which ensures synchronized
distributed operations of the system. In this paper, we have only concentrated on the architecture of a web crawling node
which is owned by the Crawler Manager. We have shown that our implemented crawler architecture makes efficient use
of allocated bandwidth, keeps the processor less busy for the processing of downloaded contents and makes efficient use of
the run time memory.
Keywords: WEBTracker, Web Crawler, Information Retrieval, World Wide Web.
1. INTRODUCTION
A web crawler collects web content from the World Wide Web automatically and stores this content into the storage [1-4].
Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will
index the downloaded pages to provide faster search services [5-8]. Crawlers can also be used for automating maintenance
tasks on a website, checking links or validating HTML code, harvesting e-mail addresses (usually for spam) and collecting
information on a particular topic. The main steps [1-4, 9-11] in web crawling are:
1. A URL Queue that initially contains some seed URLs.
2. Remove a URL from the queue and download that page.
3. Extract the links from that page and store the pages in secondary storage.
4. Store the links in the queue that have to be fetched in future.

5. Recursively perform steps 2, 3, and 4 until the queue is empty.
To develop a web crawler that downloads few pages per second for a short period of time is an easy task but building a
high-performance system that can download hundreds of millions of pages over several weeks presents a number of
challenges. It requires an extensive experimental work on the system design to ensure the I/O and network efficiency, and
manageability of the downloaded contents. Hence to develop a high performance web crawler we need to solve the
following challenges [1-4]:
1. An efficient multi threaded web content downloader.
2. Maintaining politeness according to the robots exclusion protocol.
3. Efficient URL extraction, normalization and duplicate URL detection.
4. Detecting duplicate web contents during web crawling.
5. Web content management for fastest storage and retrieval.
6. Bandwidth management and well-organized distribution Policy.
To achieve these goals we have developed a scalable distributed crawling system WEBTracker in which the Central
Crawler Server provides the domain links to the Crawler Manager running at each crawling node. The Crawler Manager
runs the downloader that is never interrupted unless its URL queue is empty. The downloaded web contents are processed
in batch mode for URL extraction, URL seen and to store the contents. All of our algorithms require less run time memory
and are optimized for file based operations. The given implementation is now under experimental observations and the
performance measures of a crawling node are shown in this paper. In a nutshell the implemented web crawler shows that
a high performance web crawler should have two properties:
1. A downloader that possibly never be paused.
2. All other work of the crawler has to be batch processed.
2. ARCHITECTURE OF A WEB CRAWLING NODE
Crawler Manager owns the web crawling node and it controls the following four major modules:
1. Downloader
2. LinkExtractorManager
3. URLSeen
4. HostHandler
The basic architecture of WEBTracker crawler node is shown in Figure 1.

Figure 1. Basic Architecture of a Web Crawling Node.
The Table 1 below provides couple of very important terms that are required to understand the way our implemented web
crawler works.
Table 1. Term that use by our Crawler.
Term Function Producer Module Consumer Elements/ Block

Module
Host Queue Maintains a queue of the Central Crawler Crawler Manager Maximum 100
downloading domains Server or User
RepositoryPath Save domain contents Crawler Manager Downloader
WebPagePathBlockRepo Save physical storage path Crawler Manager Link Extractor 100
of web page Manager
ExtractedURLBlockRepo Save extracted URLs Link Extractor URL Seen 5000
(internal) from the Manager
downloaded contents
ExternalURLBlockRepo Save extracted URLs Link Extractor Host Handler 1000
(external) from the Manager
downloaded contents
UnseenURLBlockRepo Save unseen URLs from URL Seen Crawler Manager
unchecked extracted URLs
The Producer Modules maintains their own configuration files which keeps track about how many blocks have been created
in each repository and Consumer Modules maintains their own configuration files to keep track of how many blocks are
processed from the repositories. The file structure of WEBTracker repository is shown in Figure 2.
Root
Domain Contents DB
Domain 1 Domain n Domain 1 Domain n

WebPage
WebPage
Extracted URL
Extracted URL
WebPage Path
WebPage Path
External URL
External URL
Unseen URL
Unseen URL
URL
URL
Figure 2. File Structure used by our Crawler.
2.1 Crawler Manager
This module is responsible to maintain the communication with the Central Crawler Server as well as control all the four
modules: Downloader, LinkExtractorManager, URLSeen, HostHandler. The task of Crawler Manager is initiated by the
command of the Central Crawler Server in a distributed crawling system; otherwise it can be initiated directly in single
node web crawling system. Then in the crawling node, Crawler Manager searches for the information about previous
crawling. If the information is found then first of all it adjusts itself with the previous configuration and starts the crawling
exactly from the point where it ended in the last crawl. Otherwise it only reads the list of domains that are required to be
downloaded and start crawling.
The important functions of crawler manager are given below:
2.1.1 Reading Robots.txt file
For each domain, crawler manager reads the robots.txt file from the domain URL [1 – 4, 12, 13] and create an object
regarding the configurations written in the file. For each domain the crawler manager uses individual thread. This
configuration is maintained every time for crawling a URL from the domain.
2.1.2 Managing Download
The list of domains is inserted into HostQueue for crawling. For a brand new domain, an unseen (means which are not
used) URL repository file named as UnseenUrlBlockRepo is created for the domain and the domain name is used as the
starting URL. Crawler manager pops up a domain name from the host queue and then read a URL of this domain from the
corresponding unseen URL block repository. Then the crawler manager assigns this URL to a downloader in a different
thread. If there is no available URL then crawler manager calls up the link extractor manager for providing unseen links of
that domain which is also done in a different thread. At the same time, the crawler manager continues its work in its
independent thread and is never interrupted by the other management work. The maximum number of downloading thread
is maintained according to the configuration file of the crawler manager.
2.1.3 Post Download Processing
When the downloader finishes downloading a URL, it sends an acknowledgement to the crawler manager. Then the crawler
manager adds this domain to the HostQueue again, writes the path of the downloaded page physically to the repository
named as WebPagePathBlockRepo and updates the corresponding configuration files as required.
2.2 Downloader
Downloader module is the simplest module of this project. Crawler manager calls this module with a URL and repository
name (RepositoryPath) for saving the downloaded pages. For a particular domain, only one URL or page is downloaded at a
time in an individual thread. After the completion of the download, it makes an acknowledgement to the Crawler manager
for updating the management files and starts another new download from that domain.
Table 2. Pseudo code of Downloader Module.
Downloader(url)
1. request for url
2. save page
3. Push this domain to CrawlerManager Queue.

Figure 3. Downloader Architecture
2.3 Link Extractor Manager
Link extractor manager is an essential module for this web crawler. Its main task is to extract links from the downloaded
web pages of a particular domain. It mainly creates two lists of unseen URLs. For a particular domain, one list is the
internal URLs (within the same domain) saved in a repository named as ExtractedUrlBlockRepo, and another is the list of
external URLs (to other domains) saved in a repository named as ExternalUrlBlockRepo. For the link extraction of each
domain, link extractor manager creates individual thread to serve the unseen URLs. When crawler manager needs URLs for
a particular domain, it calls the link extractor manager that at first try to serve the requests immediately by providing
unseen URLs (already extracted from the downloaded pages of the corresponding domain). Otherwise the link extractor
manager assigns a thread to extract links from the unprocessed downloaded web pages of the corresponding domain whose
path location is found in the repository named as WebPagePathBlockRepo. The extracted URLs from the web pages are
written in the repository named as ExtractedUrlBlockRepo. As soon as the link extraction for a web page is completed, the
link extractor manager calls up the URL seen module to get unseen URLs from ExtractedUrlBlockRepo and send an
acknowledgement to the crawler manager. Generally link extractor manager parses at most 5000 URLs from the
unprocessed web pages and store them as a block in ExtractedUrlBlockRepo. URL seen module then processes this block in
batch mode for getting unseen URLs.
Table 3: Pseudo code of LinkExtractorManager and Link Extract Module.
LinkExtractorManager() LinkExtract(page)
1. For Each element in Queue 1. LinkExtract of that page
2. if requestFromCrawlerManager 2. Filter Link
3. if this host has extracted link 3. Save in a block of own link.
4. send them to UrlSeen 4. Save other link in different block.
5. open new block to store. 5. if block has more than 5000 link
6. else 6. send it to UrlSeen
7. host = requestedHost 7. open new block
8. else 8. push this host to LinkExtractor Queue
9. host = pop(Queue)
10. if availPage(host)
11. LinkExtract(hostPage)
12. wait
13. if any push in Queue
14. Run from 1

S ta r t
C ra w le r M a n a g e r R eq u est fo r U R L
Y es I s n o tifie d ?
R e q u e s t H o s t L is t
No
I s th e re a n y I s th e r e a n y h o s t w h o s e
S en d S een U rl r e q u e s t fo r lin k ? No
th r e a d is n o t r u n n in g ? No W a it
Y es
I s th e r e a n y p a th o f
th a t h o s t ? No Ig n o re
S e n d th a t lin k Y e s /N o I s th e re a n y lin k fo r
Y es
b lo c k to U rlS e e n re q u e s te d h o s t ?
No
Y es
No
G e t th a t p a th a n d
s e t th is h o s t a s a
I s th e re a n y th r e a d r u n n in g .
r u n n in g f o r th is h o s t ?
N o tif y
L in k e x tr a c t
No th r e a d r u n
Y es
Is th is b lo c k h a s W r ite lin k to a
m a x im u m n u m b e r o f b lo c k a n d s e t th is
URL? h o s t a s a f in is h
Figure 4: Architectural Diagram of Link Extractor Manager
2.4 URL Seen
This module is used to identify the unseen URLs of a particular domain and filter out the URLs that has been already in the
queue or downloaded earlier. For a request from the link extractor manager, this module assigns a thread for each block of
unchecked URLs of the repository ExtractedUrlBlockRepo. This module creates a red-black tree using the unique URLs
retrieving from UnseenUrlBlockRepo which contains all the unique URLs of that domain from the starting point. For a
block of unchecked extracted URLs, this module searches each URL in the RB tree to ensure its uniqueness. If the link is
found in the tree then this link is filtered out. After the end of this process, this module saves the unseen URLs to the
repository UnseenUrlBlockRepo and updates the total number of already downloaded links and total unique links in the
configuration files for the corresponding domain and sends an acknowledgement to the link extractor manager.
Table 4: Pseudo code of URL Seen Module.
UrlSeen(Domain)
1. read block of Unique url block of Domain
2. make RB-tree with Unique url block

Storage Device
3. read block of Uncheck url
4. for each url
5. if Rb-tree.check(url) is not found

Figure 5: URL Seen Architecture
6. insert in RB-tree
7. Save in check block url.
8. Push this host to CrawlerManager Queue.
2.5 Host Handler
This module is important part of our system. For a single node web crawler, this module provides the unseen external
domain URLs to the Crawler Manager. In a distributed crawling system this module sends those URLs to the Central
Crawler Server rather than Crawler Manager in a single node system. Host handler collects the unseen host URLs from the
file named ExternalUrlBlockRepo.
Table 5: Psuedo code of HostHandler Module. Request From

Crawler Manager
HostHandler
HostHandler()
1. Read block for download

Is there any host in
Yes, then send HostList list which is not
2. if sufficient for requestedHost downloaded?
3. send them
No
4. else
Read different host list
5. Read all different host block which are generated by
CrawlerManager LinkExtractor.
6. find them which are not downloaded
7. merge all host fro download Send Host List.

Check Read file and make a
not downloading list.
8. send them
Figure 6: Host Handler Architecture
9. store rest of host for download
3. PERFORMANCE ANALYSIS
We have used a single node web crawler to measure the performance of our implemented system. After couple of test runs
and bug fixing, we have taken the data of the latest experiment. Our crawler machine has 2GB of RAM, Core2Duo 2.8 GHz
Processor and 2Mbps (15MB per minute) Bandwidth for crawling the data. The experiment has been started using 40 seed
URLs. We have recorded the measurement of bandwidth utilization for 1000 minutes. The Figure 7 to Figure 10 are the
outcomes of this experiment.
In the Figure 7, we show the number of downloaded pages in every minute. The average number of pages downloaded per
minute by our crawler is 220.294. In the Figure 8 we show the download size in KB per minute. The average download size
of the web crawler is 10.493MB per minute. Whereas the maximum bandwidth provided to our system is 15MB per minute.
Now from the Figure 7 we see that, from 551 to 951 minutes, the total number of pages downloaded per minute is
decreasing. On the other hand from Figure 8 we can see that the downloaded data size is larger for that time span. That
means the product of data size and number of page is approximately similar. Hence we can conclude that the crawler
performance is stable.
The Figure 9 shows the number of HTTP Requests sent in every minute by the implemented crawler. The average HTTP
request sent per minute is 244.638. While the Figure 10 show the number of HTTP errors of the crawler received in every
minute. Hence the average HTTP error is 58.222 per minute. The experimental results have no unusual spike which proves
that the implemented crawler has no glitch. So as we will deploy more crawlers the total amount of bandwidth consumption
will surely scale up.
600 45000
40000
500
35000
400 30000
25000
300
20000
200 15000
10000
100
5000
0 0
1
51
101
151
201
251
301
351
401
451
501
551
601
651
701
751
801
851
901
951
1
51
10 1
15 1
20 1
25 1
30 1
35 1
40 1
45 1
50 1
55 1
60 1
65 1
70 1
75 1
80 1
85 1
90 1
95 1
Figure 7: Crawler Statistics for Total Download Figure 8: Crawler Statistics for Total Download Size
Pages per Minutes (X-axis: Time in Min, Y-axis: in KB per Minutes (X-axis: Time in Min, Y-axis:
Downloaded Page Number) Downloaded Page Size)
600 2000
1800
500
1600
400
1400
1200
300 1000
800
200
600
100 400
200
0 0
1
51
1 01
1 51
2 01
2 51
3 01
3 51
4 01
4 51
5 01
5 51
6 01
6 51
7 01
7 51
8 01
8 51
9 01
9 51
1
51
101
151
201
251
301
351
401
451
501
551
601
651
701
751
801
851
901
951
Figure 9: Crawler Statistics For Total Number of Figure 10: Crawler Statistics For Total Number of
Http Request (X-axis: Time in Min, Y-axis: HTTP Error per Minute (X-axis: Time in Min, Y-axis:
Request Number) Number of Error)
4. CONCLUSION
In this paper we have discussed how bandwidth utilization can be maximized for a web crawler. The most important
technique we have used here is that we never interrupt the downloading threads. Every other post downloading
management works are also done in the independent threads. Moreover, all the tasks of link extraction, URL seen and host
handling are completed in batch processing mode. Hence most of the times the processor of our system remains less busy.
Since most of our algorithms are file based then the RAM utilization is also low. All these considerations ensures that the
downloader can use the system at full length most of its time. That is why we could achieve 10MB per minute web page
downloading over 15MB per minute bandwidth allocation. In our implementation we have maintained politeness explicitly
(only one URL of a domain can be downloaded at a time). Here we did not show any module regarding the content seen
since its scope is out of this paper. Our future work is to deploy a distributed web crawling system with 20 web crawling
nodes and at least 1Gbps bandwidth.
5. ACKNOWLEDGMENT
We are grateful to the Dept. of CSE, Shahjalal University of Science and Technology to provide us technological support for conducting
the research work in the IR Lab.
6. REFERENCES
[1] Cho, J. and Garcia-Molina, H. and Page, L. 1998. Efficient Crawling Through URL Ordering. Seventh International
World-Wide Web Conference (WWW 1998), April 14-18, 1998, Brisbane, Australia.
[2] Boldi P, Codenotti B, Santini M, Vigna S. Ubicrawler: Scalability and fault-tolerance issues. Poster Proceedings of the
11th International World Wide Web Conference, Honolulu, HI, 2002. ACM Press: New York, 2002.
[3] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. UbiCrawler: A scalable fully distributed web
crawler. Software: Practice & Experience, 34(8):711-726, 2004.
[4] Vladislav Shkapenyuk and Torsten Suel. Design and Implementation of a High-Performance Distributed Web Crawler,
Proceedings of the 18th International Conference on Data Engineering, p.357, February 26-March 01, 2002 .
[5] Sergey Brin and Lawrence Page. “The anatomy of a large-scale hypertextual web search engine”. In Proceedings of the
Seventh International World-Wide Web Conference, Bris-bane, Australia, April 1998.
[6] Google, http://www.google.com, Last Visited: 2011-11-07
[7] Paolo Boldi , Massimo Santini , Sebastiano Vigna, PageRank: Functional dependencies, ACM Transactions on
Information Systems (TOIS), v.27 n.4, p.1-23, November 2009, DOI = 10.1145/1629096.1629097.
[8] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. “Searching the web”.
ACM Transactions on Internet Technology, 1(1):2–43, 2001
[9] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Trovatore: “Towards a highly scalable
distributed web crawler”. In Poster Proc. of Tenth International World Wide Web Conference, pages 140–141, Hong
Kong, China, 2001.
[10] Budi Yuwono, Savio L. Lam, Jerry H. Ying, and Dik L. Lee. “A world wide web resource discovery system”. In
Proceedings of the Fourth International World-Wide Web Conference, Darmstadt, Germany, April 1995, DOI =
10.1.1.51.4920.
[11] Ziv Bar-Yossef , Alexander Berg , Steve Chien , Jittat Fakcharoenphol , Dror Weitz, Approximating Aggregate Queries
about Web Pages via Random Walks, Proceedings of the 26th International Conference on Very Large Data Bases,
p.535-544, September 10-14, 2000
[12] The Robots Exclusion Standard, http://www.robotstxt.org/wc/exclusion.html, Last Visited: 2011-08-23
[13] Robots exclusion protocol. http://info.webcrawler.com/mak/projects/robots/exclusion.html. , Last

Visited: 2011-08-23
[14] J. Talim , Z. Liu , Ph. Nain , E. G. Coffman, Jr., Controlling the robots of Web search engines, ACM SIGMETRICS
Performance Evaluation Review, v.29 n.1, p.236-244, June 2001 DOI=10.1145/384268.378788.
[15] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: “Compressing and Indexing Documents and Images”.
Morgan Kaufmann, second ed ition, 1999, ISBN: 1558605703.

WebTracker Paper - SUST Journal

Uploaded by

Copyright:

Available Formats

You might also like

WebTracker Paper - SUST Journal

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

WebTracker Paper - SUST Journal

Uploaded by

Copyright:

Available Formats

WEBTracker: A Web Crawler for Maximizing

the run time memory.

Keywords: WEBTracker, Web Crawler, Information Retrieval, World Wide Web.

1. A URL Queue that initially contains some seed URLs.

2. Remove a URL from the queue and download that page.

4. Store the links in the queue that have to be fetched in future.

following challenges [1-4]:

1. An efficient multi threaded web content downloader.

2. Maintaining politeness according to the robots exclusion protocol.

3. Efficient URL extraction, normalization and duplicate URL detection.

4. Detecting duplicate web contents during web crawling.

5. Web content management for fastest storage and retrieval.

6. Bandwidth management and well-organized distribution Policy.

a high performance web crawler should have two properties:

1. A downloader that possibly never be paused.

2. All other work of the crawler has to be batch processed.

2. ARCHITECTURE OF A WEB CRAWLING NODE

The basic architecture of WEBTracker crawler node is shown in Figure 1.

Table 1. Term that use by our Crawler.

Term Function Producer Module Consumer Elements/ Block

Domain 1 Domain n Domain 1 Domain n

Figure 2. File Structure used by our Crawler.

2.1 Crawler Manager

downloaded and start crawling.

The important functions of crawler manager are given below:

2.1.1 Reading Robots.txt file

2.1.2 Managing Download

is maintained according to the configuration file of the crawler manager.

2.1.3 Post Download Processing

named as WebPagePathBlockRepo and updates the corresponding configuration files as required.

Table 2. Pseudo code of Downloader Module.

1. request for url

3. Push this domain to CrawlerManager Queue.

2.3 Link Extractor Manager

batch mode for getting unseen URLs.

Table 3: Pseudo code of LinkExtractorManager and Link Extract Module.

1. For Each element in Queue 1. LinkExtract of that page

2. if requestFromCrawlerManager 2. Filter Link

3. if this host has extracted link 3. Save in a block of own link.

4. send them to UrlSeen 4. Save other link in different block.

6. else 6. send it to UrlSeen

7. host = requestedHost 7. open new block

8. else 8. push this host to LinkExtractor Queue

13. if any push in Queue

14. Run from 1

Figure 4: Architectural Diagram of Link Extractor Manager

2.4 URL Seen

1. read block of Unique url block of Domain

2. make RB-tree with Unique url block

3. read block of Uncheck url

4. for each url

5. if Rb-tree.check(url) is not found

7. Save in check block url.

8. Push this host to CrawlerManager Queue.

2.5 Host Handler

file named ExternalUrlBlockRepo.

Table 5: Psuedo code of HostHandler Module. Request From