Fuzzy Based Approach To URL Assignment in Dynamic Web Crawler

Fuzzy Based Approach to URL Assignment in
Dynamic Web Crawler
Raghav Sharma Rajesh Bhatia

Computer Science & Engineering Department Computer Science & Engineering Department
PEC University of Technology PEC University of Technology
Chandigarh, India Chandigarh, India
rghvsharma4@gmail.com rbhatiapatiala@gmail.com
Abstract— WWW is a huge collection of unorganized achieve a tradeoff among the objectives to build an optimized
documents. Web Crawler is the process used by the search crawler.
engines to build the database from this unorganized web. The
crawler which interacts with millions of web pages has to be Besides these challenges, the advantages of parallel crawler
made efficient in order to make a search engine powerful. This over single process crawler are [4]:
necessitates parallelization of web crawlers to enhance the
 Scalability: With millions of pages been added to the
download rate due to the fast increasing size of the web. The
paper review different parallel web crawling techniques in the
web daily, it’s almost impossible to crawl the web by a
literature. The paper proposes an approach for URL assignment single process crawler.
in dynamic parallel web crawler using fuzzy logic which  Network Load Dispersion: With parallel crawlers, we
addresses two important aspects of a crawler: first one to create can disperse the load to multiple regions rather than
crawling framework with load balancing among parallel overloading one local network.
crawlers. The second aspect is to make crawling process fast by
using parallel crawlers with efficient network access.  Network Load Reduction: Allowing parallel agents to
crawl specific local data (of the same country or region
Keywords— Static Parallel Crawler, Dynamic Parallel of that of crawler), the pages will have to go through
Crawler, Fuzzy Logic. local network, thereby reducing the network load.
Further, to reduce the overlapping of the downloaded pages
I. INTRODUCTION by parallel crawlers, the parallel agents need to coordinate. On
A crawler is a program that downloads and stores Web that basis, the parallel crawler can be implemented in three
pages, often for a Web search engine. A crawler plays a vital ways [4]:
role in data mining algorithms in many fields of research e.g.  Static Parallel Crawler: In this, the web is partitioned
mining of twitter data for opinion mining or finding the success by some logic and each crawler knows its own partition
ratio in project funding sites like Kickstart [1, 2]. Generally, a to crawl. So, there is no need of the central coordinator.
web crawler starts its work from a single seed URL keeping it
into a queue Q0, where it keeps all the URLs to be extracted.  Dynamic Parallel Crawler: In this, there is a central
From there, it extracts a URL based on some ordering and coordinator which assigns the URL to different parallel
downloads that page, extracts any URL in the downloaded agents based on some logic i.e. the web is partitioned
page and put them in the same queue. It repeats this function by the central coordinator at the run time.
until it is stopped.
 Independent Parallel Crawler: Here, there is no
The major difficulty for a single process web crawler is that coordination among the parallel agents. Each parallel
crawling the web may take months and in the mean time a agent continues crawling from its own seed URL. So,
number of web pages may have changed and thus, not useful the overlap can be significant in this case unless the
for the end users. So, to minimize the download time, search domain of crawl agents is limited and entirely different
engines execute multiple crawlers simultaneously known as for each crawl agent.
parallel web crawlers.
A. Static Parallel Crawler
An appropriate architecture of a parallel crawler demands
the overlapping of the downloaded web pages by different As discussed, there is no need of the central coordinator for
parallel agents to be negligible. Further, the coverage rate of the static parallel crawler. Instead, we need a good partitioning
the web should not be compromised within each parallel agents method to partition the web before the crawling. A number of
range. Next, the quality of web crawled should not be less than partitioning scheme have been proposed, as follow [5]:
a single centralized crawler [4]. To achieve all these,  URL Hash Based: In this scheme, the page is sent to a
communication overhead should be taken into account to parallel agent based on the hash value of the URL. So,
in between a crawl, a parallel agent may not be able to agent behaves as a separate single process crawler receiving
crawl URL of the same site due to different hash value the seed URL for its domain from the central coordinator. It
leading to interpartition link. then downloads the page from the web and extracts the URL
links from the downloaded page and sends them to the central
 Site Hash Based: In this scheme, the hash value is coordinator for assignment, in case it is outside the domain of
calculated only on the site name of the URL. So, the the crawl agent. The domain of each parallel agent is
URLs of the same site will be crawled by the same implementation specific. Further, the dynamic parallel crawler
parallel agents, resulting in less interpartition links and has a number of advantages which are explained as follow:
further, less communication bandwidth.
 Crawling Decision: Static parallel crawler suffers from
 Hierarchical: Here, partitioning is done on the basis of poor crawling decision i.e. which web page to crawl
issues like country, language or the type of URL next. This is because no crawling agent has complete
extension. view of the web crawled. But in dynamic parallel
One concern in the literature of Static Parallel Crawler is the crawler, the central coordinator has the global image of
mode of job division among the parallel agents. There are the web and the decision about the URL selection and
different modes of job division like firewall, crossover, and assignment is taken by the central coordinator, not the
exchange [5]. Under the first mode, the parallel agents crawl crawl agents [6].
pages in its partition only, neglecting the interpartition links.  Scalability : In case of dynamic parallel crawlers only
Under the second mode, the parallel agents primarily crawls N connections are needed with the central coordinator
only same partition links and if there are no more links left to for URL assignments, in case the number of crawl
crawl in the same partition, then it moves to interpartition links agents are N. If a new crawl agent is added to the
to crawl them. Under the third mode, parallel agents system, only one socket connection will be required
communicate through message exchanges whenever they between the crawl agent and the central coordinator.
encounter an interpartition link to increase the coverage and
decrease overlap.  Minimizing Web Server Load: One important aspect of
a web crawler is that it should not overload a server
The drawbacks of the static parallel crawler are as follow:
with its requests. It has been observed that a web page
 Scalability: In order to reduce the overlap and increase contains a number of links to the pages of same web
the coverage, there should be N! connections for URL server. The crawl agent sends the URL links to the
transfer to appropriate parallel agent, in case the central coordinator which sends the most important
number of parallel crawlers are N. unvisited link to the crawl agents. It is not possible that
all the pages of same server would be always
 Quality of web pages crawled: Here, each parallel important. This leads to the decreased load on a single
agent is unaware of the web crawled by other agents. web server.
So, they don’t have the global image of the web
crawled and thus, the decision of URL selection is Dynamic web crawler design poses a number of challenges
entirely based on the subset of web crawled, which is too which need to be addressed here:
nothing but the web crawled by the parallel crawl  Which distribution algorithm should be used for URL
agent. assignment?
B. Dynamic Parallel Crawler
 How to distribute jobs to different crawlers based on
As discussed, there is a central coordinator to manage the their health i.e. the crawler selection for the URL to
assignment of URLs to different crawler agents in case of optimize load balancing?
dynamic parallel crawler. The architecture of dynamic web
crawler is as follow:  How to manage the already crawled pages to avoid
replication of pages in database?
The main objective of this paper is the URL assignment
strategies which is one of the important functionalities of the
dynamic parallel crawler.
The paper is organized as follow: Related work in the
dynamic URL assignment is explained in section II. Section III
describes proposed fuzzy technique for URL assignment.
Section IV describes the fuzzy phase of the technique
including the benefits of the proposed architecture. Section V
concludes the paper.
Figure 1: Architecture of Dynamic Parallel Crawler

The dynamic parallel crawler starts its working from the
central coordinator as depicted by the figure 1. Each crawl
II. RELATED WORK Finally, the URL dispatcher sends the discovered
URLs to the central coordinator performing following steps
A. Hash Based Approach for URL Assignment [8]:
The basic approach for the hash based URL assignment is  It restores the relative address of hyperlinks to
using the key value of each URL to determine its crawl agent absolute address which is important as a number of
for parsing. [7] proposed the architecture for URL assignment documents can refer to same URL.
by transforming the URL into a set of numerical information  It tries to predict the domain of the discovered URLs
which represent coordinates of 3D space vector(x,y,z). Using by the help of tagged URLs which can serve as the
such transformation of URL, a number of values can be source of the newly discovered URLs reflecting the
generated from a single URL. [7] used URI 3986 standard hyperlinked behavior of the web i.e. web pages are
definition to split URL. most likely to link to pages of same domain.
Uniform Resource Identifier (URI) is a string of characters  It can check for the duplication of the discovered
used to locate the host name in the internet. In [7], the URL pages with the pages of the same domain partition
strings have been defined in 3 ways i.e. scheme, domain and pool which have been visited.
path, query and fragment. These elements are the coordinate
function in vector space. If URL is B. Virtualized URL Assignment
“http://www.abc.com/index.php?q=1#session”, it corresponds In this approach, virtualization concept is used for the
to coordinates as: working of parallel crawlers. Here, multiple cores of multicore
processors are treated as virtual machines which interact with
URI Parts Substring Coordinate each other through a shared region or memory using VMCI
Scheme & http://www.ab X=3487 (Virtual Machine Communication Interface). These virtual
Domain c.com machines are also treated as clusters and URLs belonging to
Path /index.php? Y=2241 same clusters are served by virtual machines of corresponding
Query & q=1#session Z=744 clusters [9]. Initially, there is a need of an injector module to
Fragment provide the seed URLs which are used by clustering module
for cluster formation. The clustering module calculates the
Table 1: Coordinates of URI parts hash value of the URLs and by using the URI; the cluster is
In this way, the URI structure is transformed into 3D identified to which the URL belongs. Then, the URL is
vector space over which a fuzzy clustering technique can be assigned to a virtual machine depending upon the threshold
applied for the assignment of particular URL to the specified value of that machine which is decided as per the availability
crawl agent. The advantage of this scheme is that it is easy to of the virtual machine.
implement but it does not reflect the locality structure of the
links. III. PROPOSED FUZZY BASED APPROACH FOR URL
ASSIGNMENT
B. Domain Specific Approach for URL Assignment After doing systematic literature survey of parallel crawlers
One of the proposed approach for the dynamic partitioning and the material associated with the issue of assignment of
of the web is based on the domain of the crawl agents which is URL problems, we can safely say that limited work has been
influenced by the fact that web pages are more likely to link to done to exploit the usage of Fuzzy Logic in parallel crawlers.
pages that are relevant to domain of the same page [8]. So, In this work, we propose a parallel architecture, where the
after retrieving a web page, the crawl agent has to perform systems can be geographically distributed. As shown in figure
analysis to predict its relevancy to one of the many domains of 2, the system has three main components: a fuzzy logic
all crawl agents. Thus, breaking down a domain into sub- controller, a URL distributor, and some parallel agents.
domains can add up new crawl agents increasing the
scalability of the system [8].
The domain oriented partitioning approach requires the
initial seed URLs of various domains to be gathered which can
be represented by some hub pages which consists of primarily
links to different pages highly relevant to various domains.
Once the web page is downloaded, it is fed to parser,
classifier and dispatcher module. The role of the parser
module is to extract various HTML components of the page to
extract the list of new unvisited URLs from the page specified
in the href attribute of the anchor tag. Later, the classifier
module analyses the domain of the web page and adds it to the
associated repository of its domain. It also tags the URL of the
downloaded page with its domain in the same database which
is used to store the unvisited URLs. Figure 2: Architecture of a Parallel Web Crawler
a) Fuzzy Logic Controller  “TdL” = {less, more} be the set of Lingustic Variable
set describing the “Td” having discrete range of values.
The main task of this component is of load optimization
among the parallel agents by monitoring the health of parallel  “TddL” = {less, more} be the set of Linguistic
agents in regular intervals. The approach will be discussed Variable set describing the “Tdd” having discrete range
deeply in section IV. of values.
b) URL Distributor  “TdnsL” = {less, more} be the set of Linguistic
The URL distributor selects a set of URLs from the Variable set describing the “Tdns” having discrete
database and by the use of fuzzy logic controller, it distributes range of values.
them to different parallel agents optimizing the load. It So, the Input-Set is {AL,TdL,TddL,TdnsL} and the
connects crawlers using HTTP connection and receives, Output-Set is {Mc1,Mc2,Mc3} where Mc1,Mc2,Mc3 are the
analyzes and store their result after aggregation into the central parallel crawling agents.
repository.
Since, fuzzy logic is a form of many-valued logic, it deals
c) Parallel Agents with the reasoning i.e. approximate rather than fixed [10] and
Each parallel agent works like a single centralized crawler exact, therefore, this process will undergo steps which include:
starting from a single URL or a set of URLs and analyze every a) Determination of input variables explained above.
page specified by each URL. The link extractor analyses the b) Process of fuzzification already explained above in
page to identify the links and make a list of them. The page
terms of linguistic variables and defining the values of
downloader takes a new URL from this list and downloads the
page from the internet if it’s not there in the local cache. It then fuzzy set with their range. This is done with help of
sends this page for analysis to link extractor again. The parallel membership functions which will have particular
agents keep track of their health based on some attributes shape based on the distribution of the input variables
depicted in figure 2 which will be discussed in the next section. which may be triangle, for instance.
Further, they send the list of URLs with the corresponding c) For each set of input values, run inference engine,
health to the central coordinator after regular time intervals
(crawl session) for the fuzzy logic controller to distribute the which means what rules can guide or set of rules that
load according to the health of the machine. activates(Antecedents) the fuzzy controller
(Modification of Consequents ) to assign the URLs to
IV. FUZZY PHASE the queue “q” (Accumulation) of a particular machine
In current context, the use of fuzzy logic can be exploited “m”, for instance. Inferences can be derived as:
for doing multiple tasks including:  “If AveragePageSize is “small” and “TdL” is not
a) Assignment of the URL to the priority queue for “more”, then Machine is Machine1”.
crawling sessions.  “If AveragePageSize is “large” and “Tdd” is
b) Assignment of the URL to the machine or core used “more”, then Machine is Machine2”.
for invoking crawling sessions in the parallel crawlers. d) Defuzzification (Converting the fuzzy outputs into
Crisp values by calculating centroid or maxima for
The fuzzy logic controller is responsible for the example) and finally the output, which in our case will
implementation of fuzzy logic for doing above tasks. The be the machine index to which url will be assigned to.
technique takes the following assumptions as shown in the The process of defuzzification is infact the process of
table in figure 2:
producing a quantifiable result in fuzzy logic, given
fuzzy sets and corresponding membership degrees.
 “A” be the Last Average Page Size in kb downloaded e) Evaluate performance of the crawling agents based on
by a machine “m”. the defuzzified output.
 “Td” be the Last Average Time in seconds taken by a
machine “m” to download “p” Pages. Merits of this approach in URL assignment problem of the
parallel crawler are:
 “Tdd” be the Last Average Time in seconds taken by  It supports the load balancing property of the parallel
a machine “m” to save the “p” pages to the disk “d”. crawler agents which is required to maintain
 “Tdns” be the Last Average Time in seconds taken by equilibrium.
a machine “m” for DNS resolution.  It takes into account the external features like the time
 “AL” = {normal, small, large} be the set of Lingustic taken for DNS resolution which includes the network
Variable set describing the “A” having discrete range congestion at a particular span of time.
of values.  The health of the crawl agents is monitored at regular
time intervals through which the system can be scaled
up all time in case there is unbalanced state due to [6] Debajyoti Mukhopadhyay, Sajal Mukherje, Soumya Ghosh, Saheli Kar,
Young-Chon Kim, “Architecture of A Scalable Dynamic Parallel
uncontrolled behavior of the web, thus increasing the WebCrawler with High Speed Downloadable Capability for a Web
robustness of the system. Search Engine.” 6th International Workshop on MSPT Proceedings,
2006.
[7] A.Guerriero, F. Ragni, C. Martines, “A dynamic URL assignment
CONCLUSION method for parallel web crawler.” Computational Intelligence for
Measurement Systems and Applications (CIMSA), IEEE International
In this paper, we have reviewed some basic concepts of Conference on IEEE, 2010.
parallel web crawler along with the implementation of parallel [8] Gupta, Sonali, Komal Bhatia, and Pikakshi Manchanda, “WebParF: A
crawler in static and dynamic way. The static parallel crawlers, Web Partitioning Framework for Parallel Crawler.” International Journal
on Computer Science and Engineering, Aug 2013.
as discussed are simple to build but have a number of
[9] Bhaginath, Wani Rohit, Sandip Shingade, and Mahesh Shirole,
drawbacks which are overcome by the dynamic behavior of “Virtualized dynamic URL assignment web crawling model.” Advances
parallel crawler but with the difficulty of implementation of its in Engineering and Technology Research (ICAETR), 2014 International
modules. Though, very less work has been focused on the Conference on. IEEE, 2014.
dynamic parallel crawler in the literature, this paper discusses [10] Rondeau, L., R. Ruelas, L. Levrat, and M. Lamotte, “A defuzzification
method respecting the fuzzification.” Fuzzy sets and systems 86, no. 3
the different architectures of the important phase of dynamic (1997): 311-320.
parallel crawler i.e. how to distribute URLs from the URL [11] Y. Wan, H. Tong, “URL Assignment Algorithm of Crawler in Distributed
frontier to the various concurrently executing crawling process System Based on Hash. “ IEEE International Conference on
threads which is an orthogonal problem(URL Assignment Networking, Sensing and Control, ICNSC 2008, Hainan, China, 6-8
Problem). Finally, a new approach for URL assignment to April 2008. pages 1632-1635, IEEE, 2008.
crawl agents based on their health monitoring is proposed [12] Debajyoti Mukhopadhyay, Sajal Mukherje, Soumya Ghosh, Saheli Kar,
Young-Chon Kim, “Architecture of A Scalable Dynamic Parallel
through fuzzy logic which provides with a number of WebCrawler with High Speed Downloadable Capability for a Web
advantages. Search Engine.” 6th International Workshop on MSPT Proceedings,
2006.
[13] Huang, Qiuyan, Qingzhong Li, and Zhongmin Yan, “A Novel URL
References Assignment Model Based on Multi-objective Decision Making
Method.” In Web Information Systems and Applications Conference
(WISA), 2012 Ninth, pp. 31-34. IEEE, 2012.
[1] Etter, Vincent, Matthias Grossglauser, and Patrick Thiran, “Launch hard [14] Marin, Mauricio, Rodrigo Paredes, and Carolina Bonacic. “High-
or go home!: predicting the success of kickstarter campaigns.” performance priority queues for parallel crawlers.” Proceedings of the
Proceedings of the first ACM conference on Online social networks. 10th ACM workshop on Web information and data management. ACM,
ACM, 2013. 2008.
[2] Pak, Alexander, and Patrick Paroubek, “Twitter as a Corpus for [15] Y. Wan, H. Tong, “URL Assignment Algorithm of Crawler in Distributed
Sentiment Analysis and Opinion Mining.” LREC, 2010. System Based on Hash.” IEEE International Conference on Networking,
[3] Divakar Yadav, AK Sharma, Sonia, Jorge Marato, “An approach to Sensing and Control, ICNSC 2008, Hainan, China, 6-8 April 2008.
design incremental parallel web crawler.” Journal of Theoretical and pages 1632-1635, IEEE, 2008.
Applied Information Technology Vol 43,2012.
[4] Garcia-Molina and Junghoo Chu, “Parallel Crawlers.” Proceedings of
the 11th international conference on World Wide Web, 2002.
[5] Fatemeh, Ali Sehmat, “ An architecture for a focused trend Parallel Web
crawler with the application of clickstream analysis.” Information
Sciences Vol 184 Elsevier,2011.

Fuzzy Based Approach To URL Assignment in Dynamic Web Crawler

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fuzzy Based Approach To URL Assignment in Dynamic Web Crawler

Uploaded by

Copyright:

Available Formats

Fuzzy Based Approach to URL Assignment in

Dynamic Web Crawler

Raghav Sharma Rajesh Bhatia

Figure 1: Architecture of Dynamic Parallel Crawler

You might also like