Professional Documents
Culture Documents
WebTracker Paper - SUST Journal
WebTracker Paper - SUST Journal
WebTracker Paper - SUST Journal
Bandwidth Utilization
Md. Ruhul Amin, Assistant Professor, Dept. of CSE, Shahjalal University of Science and Technology
Mohiul Alam Prince, Software Engineer, Structured Data Systems Ltd, Dhaka
Md. Akter Hussain, Assistant Professor, Dept. of CSE, Shahjalal University of Science and Technology
ABSTRACT
The most challenging part of a web crawler is to download contents at the fastest rate to utilize bandwidth and processing
of the downloaded data so that it will never starve the downloader. Our implemented scalable web crawling system,
named as WEBTracker has been designed to meet this challenge. It can be used very efficiently in the distributed
environment to maximize downloading. WEBTracker has a Central Crawler Server and it administers all the crawler
nodes. At each crawler node, Crawler Manager runs the downloader and manages the downloaded contents. Central
Crawler Server and its Crawler Managers are members of the Distributed File System which ensures synchronized
distributed operations of the system. In this paper, we have only concentrated on the architecture of a web crawling node
which is owned by the Crawler Manager. We have shown that our implemented crawler architecture makes efficient use
of allocated bandwidth, keeps the processor less busy for the processing of downloaded contents and makes efficient use of
1. INTRODUCTION
A web crawler collects web content from the World Wide Web automatically and stores this content into the storage [1-4].
Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will
index the downloaded pages to provide faster search services [5-8]. Crawlers can also be used for automating maintenance
tasks on a website, checking links or validating HTML code, harvesting e-mail addresses (usually for spam) and collecting
information on a particular topic. The main steps [1-4, 9-11] in web crawling are:
3. Extract the links from that page and store the pages in secondary storage.
To develop a web crawler that downloads few pages per second for a short period of time is an easy task but building a
high-performance system that can download hundreds of millions of pages over several weeks presents a number of
challenges. It requires an extensive experimental work on the system design to ensure the I/O and network efficiency, and
manageability of the downloaded contents. Hence to develop a high performance web crawler we need to solve the
To achieve these goals we have developed a scalable distributed crawling system WEBTracker in which the Central
Crawler Server provides the domain links to the Crawler Manager running at each crawling node. The Crawler Manager
runs the downloader that is never interrupted unless its URL queue is empty. The downloaded web contents are processed
in batch mode for URL extraction, URL seen and to store the contents. All of our algorithms require less run time memory
and are optimized for file based operations. The given implementation is now under experimental observations and the
performance measures of a crawling node are shown in this paper. In a nutshell the implemented web crawler shows that
Crawler Manager owns the web crawling node and it controls the following four major modules:
1. Downloader
2. LinkExtractorManager
3. URLSeen
4. HostHandler
The Table 1 below provides couple of very important terms that are required to understand the way our implemented web
crawler works.
The Producer Modules maintains their own configuration files which keeps track about how many blocks have been created
in each repository and Consumer Modules maintains their own configuration files to keep track of how many blocks are
processed from the repositories. The file structure of WEBTracker repository is shown in Figure 2.
Root
Domain Contents DB
WebPage
Extracted URL
Extracted URL
WebPage Path
WebPage Path
External URL
External URL
Unseen URL
Unseen URL
URL
URL
This module is responsible to maintain the communication with the Central Crawler Server as well as control all the four
modules: Downloader, LinkExtractorManager, URLSeen, HostHandler. The task of Crawler Manager is initiated by the
command of the Central Crawler Server in a distributed crawling system; otherwise it can be initiated directly in single
node web crawling system. Then in the crawling node, Crawler Manager searches for the information about previous
crawling. If the information is found then first of all it adjusts itself with the previous configuration and starts the crawling
exactly from the point where it ended in the last crawl. Otherwise it only reads the list of domains that are required to be
For each domain, crawler manager reads the robots.txt file from the domain URL [1 – 4, 12, 13] and create an object
regarding the configurations written in the file. For each domain the crawler manager uses individual thread. This
configuration is maintained every time for crawling a URL from the domain.
The list of domains is inserted into HostQueue for crawling. For a brand new domain, an unseen (means which are not
used) URL repository file named as UnseenUrlBlockRepo is created for the domain and the domain name is used as the
starting URL. Crawler manager pops up a domain name from the host queue and then read a URL of this domain from the
corresponding unseen URL block repository. Then the crawler manager assigns this URL to a downloader in a different
thread. If there is no available URL then crawler manager calls up the link extractor manager for providing unseen links of
that domain which is also done in a different thread. At the same time, the crawler manager continues its work in its
independent thread and is never interrupted by the other management work. The maximum number of downloading thread
When the downloader finishes downloading a URL, it sends an acknowledgement to the crawler manager. Then the crawler
manager adds this domain to the HostQueue again, writes the path of the downloaded page physically to the repository
2.2 Downloader
Downloader module is the simplest module of this project. Crawler manager calls this module with a URL and repository
name (RepositoryPath) for saving the downloaded pages. For a particular domain, only one URL or page is downloaded at a
time in an individual thread. After the completion of the download, it makes an acknowledgement to the Crawler manager
for updating the management files and starts another new download from that domain.
Downloader(url)
2. save page
Link extractor manager is an essential module for this web crawler. Its main task is to extract links from the downloaded
web pages of a particular domain. It mainly creates two lists of unseen URLs. For a particular domain, one list is the
internal URLs (within the same domain) saved in a repository named as ExtractedUrlBlockRepo, and another is the list of
external URLs (to other domains) saved in a repository named as ExternalUrlBlockRepo. For the link extraction of each
domain, link extractor manager creates individual thread to serve the unseen URLs. When crawler manager needs URLs for
a particular domain, it calls the link extractor manager that at first try to serve the requests immediately by providing
unseen URLs (already extracted from the downloaded pages of the corresponding domain). Otherwise the link extractor
manager assigns a thread to extract links from the unprocessed downloaded web pages of the corresponding domain whose
path location is found in the repository named as WebPagePathBlockRepo. The extracted URLs from the web pages are
written in the repository named as ExtractedUrlBlockRepo. As soon as the link extraction for a web page is completed, the
link extractor manager calls up the URL seen module to get unseen URLs from ExtractedUrlBlockRepo and send an
acknowledgement to the crawler manager. Generally link extractor manager parses at most 5000 URLs from the
unprocessed web pages and store them as a block in ExtractedUrlBlockRepo. URL seen module then processes this block in
LinkExtractorManager() LinkExtract(page)
5. open new block to store. 5. if block has more than 5000 link
9. host = pop(Queue)
10. if availPage(host)
11. LinkExtract(hostPage)
12. wait
C ra w le r M a n a g e r R eq u est fo r U R L
Y es I s n o tifie d ?
R e q u e s t H o s t L is t
No
I s th e re a n y I s th e r e a n y h o s t w h o s e
S en d S een U rl r e q u e s t fo r lin k ? No
th r e a d is n o t r u n n in g ? No W a it
Y es
I s th e r e a n y p a th o f
th a t h o s t ? No Ig n o re
S e n d th a t lin k Y e s /N o I s th e re a n y lin k fo r
Y es
b lo c k to U rlS e e n re q u e s te d h o s t ?
No
Y es
No
G e t th a t p a th a n d
s e t th is h o s t a s a
I s th e re a n y th r e a d r u n n in g .
r u n n in g f o r th is h o s t ?
N o tif y
L in k e x tr a c t
No th r e a d r u n
Y es
Is th is b lo c k h a s W r ite lin k to a
m a x im u m n u m b e r o f b lo c k a n d s e t th is
URL? h o s t a s a f in is h
This module is used to identify the unseen URLs of a particular domain and filter out the URLs that has been already in the
queue or downloaded earlier. For a request from the link extractor manager, this module assigns a thread for each block of
unchecked URLs of the repository ExtractedUrlBlockRepo. This module creates a red-black tree using the unique URLs
retrieving from UnseenUrlBlockRepo which contains all the unique URLs of that domain from the starting point. For a
block of unchecked extracted URLs, this module searches each URL in the RB tree to ensure its uniqueness. If the link is
found in the tree then this link is filtered out. After the end of this process, this module saves the unseen URLs to the
repository UnseenUrlBlockRepo and updates the total number of already downloaded links and total unique links in the
configuration files for the corresponding domain and sends an acknowledgement to the link extractor manager.
Table 4: Pseudo code of URL Seen Module.
UrlSeen(Domain)
This module is important part of our system. For a single node web crawler, this module provides the unseen external
domain URLs to the Crawler Manager. In a distributed crawling system this module sends those URLs to the Central
Crawler Server rather than Crawler Manager in a single node system. Host handler collects the unseen host URLs from the
HostHandler
HostHandler()
3. send them
No
4. else
Read different host list
5. Read all different host block which are generated by
CrawlerManager LinkExtractor.
6. find them which are not downloaded
We have used a single node web crawler to measure the performance of our implemented system. After couple of test runs
and bug fixing, we have taken the data of the latest experiment. Our crawler machine has 2GB of RAM, Core2Duo 2.8 GHz
Processor and 2Mbps (15MB per minute) Bandwidth for crawling the data. The experiment has been started using 40 seed
URLs. We have recorded the measurement of bandwidth utilization for 1000 minutes. The Figure 7 to Figure 10 are the
In the Figure 7, we show the number of downloaded pages in every minute. The average number of pages downloaded per
minute by our crawler is 220.294. In the Figure 8 we show the download size in KB per minute. The average download size
of the web crawler is 10.493MB per minute. Whereas the maximum bandwidth provided to our system is 15MB per minute.
Now from the Figure 7 we see that, from 551 to 951 minutes, the total number of pages downloaded per minute is
decreasing. On the other hand from Figure 8 we can see that the downloaded data size is larger for that time span. That
means the product of data size and number of page is approximately similar. Hence we can conclude that the crawler
performance is stable.
The Figure 9 shows the number of HTTP Requests sent in every minute by the implemented crawler. The average HTTP
request sent per minute is 244.638. While the Figure 10 show the number of HTTP errors of the crawler received in every
minute. Hence the average HTTP error is 58.222 per minute. The experimental results have no unusual spike which proves
that the implemented crawler has no glitch. So as we will deploy more crawlers the total amount of bandwidth consumption
600 45000
40000
500
35000
400 30000
25000
300
20000
200 15000
10000
100
5000
0 0
1
51
101
151
201
251
301
351
401
451
501
551
601
651
701
751
801
851
901
951
1
51
10 1
15 1
20 1
25 1
30 1
35 1
40 1
45 1
50 1
55 1
60 1
65 1
70 1
75 1
80 1
85 1
90 1
95 1
Figure 7: Crawler Statistics for Total Download Figure 8: Crawler Statistics for Total Download Size
Pages per Minutes (X-axis: Time in Min, Y-axis: in KB per Minutes (X-axis: Time in Min, Y-axis:
Downloaded Page Number) Downloaded Page Size)
600 2000
1800
500
1600
400
1400
1200
300 1000
800
200
600
100 400
200
0 0
1
51
1 01
1 51
2 01
2 51
3 01
3 51
4 01
4 51
5 01
5 51
6 01
6 51
7 01
7 51
8 01
8 51
9 01
9 51
1
51
101
151
201
251
301
351
401
451
501
551
601
651
701
751
801
851
901
951
Figure 9: Crawler Statistics For Total Number of Figure 10: Crawler Statistics For Total Number of
Http Request (X-axis: Time in Min, Y-axis: HTTP Error per Minute (X-axis: Time in Min, Y-axis:
Request Number) Number of Error)
4. CONCLUSION
In this paper we have discussed how bandwidth utilization can be maximized for a web crawler. The most important
technique we have used here is that we never interrupt the downloading threads. Every other post downloading
management works are also done in the independent threads. Moreover, all the tasks of link extraction, URL seen and host
handling are completed in batch processing mode. Hence most of the times the processor of our system remains less busy.
Since most of our algorithms are file based then the RAM utilization is also low. All these considerations ensures that the
downloader can use the system at full length most of its time. That is why we could achieve 10MB per minute web page
downloading over 15MB per minute bandwidth allocation. In our implementation we have maintained politeness explicitly
(only one URL of a domain can be downloaded at a time). Here we did not show any module regarding the content seen
since its scope is out of this paper. Our future work is to deploy a distributed web crawling system with 20 web crawling
5. ACKNOWLEDGMENT
We are grateful to the Dept. of CSE, Shahjalal University of Science and Technology to provide us technological support for conducting
6. REFERENCES
[1] Cho, J. and Garcia-Molina, H. and Page, L. 1998. Efficient Crawling Through URL Ordering. Seventh International
World-Wide Web Conference (WWW 1998), April 14-18, 1998, Brisbane, Australia.
[2] Boldi P, Codenotti B, Santini M, Vigna S. Ubicrawler: Scalability and fault-tolerance issues. Poster Proceedings of the
11th International World Wide Web Conference, Honolulu, HI, 2002. ACM Press: New York, 2002.
[3] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. UbiCrawler: A scalable fully distributed web
crawler. Software: Practice & Experience, 34(8):711-726, 2004.
[4] Vladislav Shkapenyuk and Torsten Suel. Design and Implementation of a High-Performance Distributed Web Crawler,
Proceedings of the 18th International Conference on Data Engineering, p.357, February 26-March 01, 2002 .
[5] Sergey Brin and Lawrence Page. “The anatomy of a large-scale hypertextual web search engine”. In Proceedings of the
Seventh International World-Wide Web Conference, Bris-bane, Australia, April 1998.
[6] Google, http://www.google.com, Last Visited: 2011-11-07
[7] Paolo Boldi , Massimo Santini , Sebastiano Vigna, PageRank: Functional dependencies, ACM Transactions on
Information Systems (TOIS), v.27 n.4, p.1-23, November 2009, DOI = 10.1145/1629096.1629097.
[8] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. “Searching the web”.
ACM Transactions on Internet Technology, 1(1):2–43, 2001
[9] Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Trovatore: “Towards a highly scalable
distributed web crawler”. In Poster Proc. of Tenth International World Wide Web Conference, pages 140–141, Hong
Kong, China, 2001.
[10] Budi Yuwono, Savio L. Lam, Jerry H. Ying, and Dik L. Lee. “A world wide web resource discovery system”. In
Proceedings of the Fourth International World-Wide Web Conference, Darmstadt, Germany, April 1995, DOI =
10.1.1.51.4920.
[11] Ziv Bar-Yossef , Alexander Berg , Steve Chien , Jittat Fakcharoenphol , Dror Weitz, Approximating Aggregate Queries
about Web Pages via Random Walks, Proceedings of the 26th International Conference on Very Large Data Bases,
p.535-544, September 10-14, 2000
[12] The Robots Exclusion Standard, http://www.robotstxt.org/wc/exclusion.html, Last Visited: 2011-08-23