Adaptive Focus

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

201O 2nd International Conference on Education Technology and Computer (ICETC)

Adaptive Focused Crawling Based on Link Analysis

Debashis Hati Biswajit Sahoo Amritesh Kumar


Assistant Professor, Associate Dean, Research Associate,
School Of Computer Engineering School Of Computer Engineering School Of Computer Engineering
KIlT University KIlT University KIlT University
Bhubaneswar, India Bhubaneswar, India Bhubaneswar, India
d hati@yahoo.com
_ sh biswajit@yahoo.co.in
_ amritesh.kiit@gmail.com

Abstract-A web search engine is designed to search for the internet and retrieve web pages by hyperlinks. In the
information on the World Wide Web (WWW). Crawlers are large area of websites, traditional web crawlers cannot
software which can traverse the internet and retrieve web
function well to solve this problem. The focused crawler of
pages by hyperlinks. In the face of the large spam websites,
a special-purpose search engine aims to selectively seek out
traditional web crawlers cannot function well to solve this
pages that are relevant to a pre-defmed set of topics, rather
problem. Focused crawlers utilize semantic web technologies to
analyze the semantics of hyperlinks and web documents. The
than to exploit all regions of the Web. Focused crawlers aim
focused crawler of a special-purpose search engine aims to to search only the subset of the web related to a specific
selectively seek out pages that are relevant to a pre-defined set topic, and offer a potential solution to the problem [2, 4].
of topics, rather than to exploit all regions of the Web. A Focused crawler is developed to collect relevant web pages
focused crawler is a program used for searching information of interested topics form the Internet. The general-purpose
related to some interested topics from the Internet. The main search engines, such as Google (www.google.com). have
property of focused crawling is that the crawler does not need provided us with a lot of facilities, and become very popular.
to collect all web pages, but selects and retrieves relevant pages
However, they have some disadvantages because a general­
only. As the crawler is only a computer program, it cannot
purpose search engine aims to cover the network as enough
determine how relevant a web page is. The major problem is
how to retrieve the maximal set of relevant and quality page.
as possible. So, it usually returns many web pages users are
In our proposed approach, we calculate the unvisited URL not interested in [1].
score based on its Anchor text relevancy, its description in Vertical search engines use focused crawlers as their key
Google search engine and calculate the similarity score of component and develops some specific algorithms to select
description with topic keywords, cohesive text similarity with web pages relevant to some pre-defmed set of topics.
topic keywords and Relevancy score of its parent pages. Therefore, it is extremely important for a search engine how
Relevancy score is calculated based on vector space model.
to effectively build up a semantic pattern for specific topics.
The traditional process of focused web crawler is to harvest
Keywords-crawler; focused crawler; vector space a collection of web documents that are focused on the
model topical subspaces. Focused crawlers are the main element in
building domain specific search engines. They traverse the
I. INTRODUCTION web collecting only relevant data to a predefmed topic while
neglecting on the same time off-topic pages [3]. The crawler
The enormous growth of the World Wide Web (WWW) in is kept focused through a crawling strategy which
recent years has made it important to perform resource determines the relevancy degree of the web page to the
discovery efficiently. Consequently, several new ideas have predefined topic and depending on this degree a decision is
been proposed; among them a key technique is focused made whether to download the web page or not [4].
crawling which is able to crawl particular topical portions of In this paper, our proposed approach calculates the link
the www quickly without having to explore all web pages. score. First we calculate the unvisited URL score based on
A Web Crawler searches through all the Web Servers to its Anchor text relevancy, its description in Google search
fmd information about a particular topic. However, engine and calculate the similarity score of description with
searching all the Web Servers and the pages, are not realistic, topic keywords, cohesive text similarity with topic
given the growth of the Web and their refresh rates [5]. keywords and Relevancy score of its parent pages.
Crawling the Web quickly and entirely is an expensive, Relevancy score is calculated based on vector space model.
unrealistic goal because of the required hardware and The rest of this paper is organized as follows. We have
network resources. Focused Crawling is designed to traverse discussed the related work of focused crawling in Section II.
a subset of the Web to gather documents on a specific topic. In Section III, we introduce our proposed architecture. In
It also aims to identify the promising links that lead to target Section IV, we have presented our proposed approach. In
documents, and avoid off-topic searches. Focused Crawlers Section V, our proposed algorithm has been represented and
support decentralizing the crawling process, which is a more in Section VI, we have concluded our research paper.
scalable approach. Crawlers are software which can traverse

978-1-4244-6370-11$26.00 © 2010 IEEE V4-455


2010 2nd International Conference on Education Technology and Computer (ICETC)

II. RELATED WORK III. PROPOSED ARCIDTECTURE


Frontier contains a list of unvisited URLs maintained by
A focused crawler is a program used for searching
the crawler and is initialized with seed URLs. Web page
infonnation related to some interested topics from the
downloader fetches URLs from frontier and downloads
Internet. The main property of focused crawling is that the
corresponding pages from the internet. The parser and
crawler does not need to collect all web pages, but selects
extractor extract infonnation such as the tenns and the
and retrieves relevant pages only. Because the crawler is
hyperlink URLs from a downloaded page. Relevance
only a computer program, it cannot detennine how relevant
calculator calculates relevance of a page with respect to the
a web page is [1]. In order to fmd pages of a particular type
topic, and assigns score to URLs extracted from the page.
or on a particular topic, focused crawlers aim to identify
Topic filter analyzes whether the content of parsed pages is
links that are likely to lead to target documents, and avoid
related to topic or not. If the page is relevant, the URLs
links to off topic branches. However the concept of
extracted from it will be added to the URL queue. Otherwise,
prioritizing unvisited URLs on the crawl frontier for specific
the URLs will be ignored. The proposed architecture is
searching goals is not new, and Fish-Search and Shark­
shown below in Fig. I.
Search were some of the earliest algorithms for crawling for
pages with keywords specified in the query [2].
In Fish-Search, the system is query driven. Starting from a
set of seed pages, it considers only those pages that have
content matching a given query (expressed as a keyword
query or a regular expression) and their neighborhoods
1�le
.------1
(pages pointed to by these matched pages). Shark-Search is I yahoo

a modification of Fish-search which differs in two ways: a I msn


child inherits a discounted value of the score of its parent,
and this score is combined with a value based on the anchor
text that occurs around the link in the Web page [6].
Many researchers have written their approaches based on
link analysis. For example, Effective Focused Crawling
based on content and link structure analysis has been
proposed for link analysis based on URL score, anchor
score and relevance score and HAWK.: A Focused Crawler
with Content and Link Analysis. Some have written their
approaches based on page rank value. For example, An
Application of Improved Page Rank in Focused Crawler
based on To-page rank value and an Improvement of Page
Rank for Focused Crawler based on T page rank. Some have
written based on ontology. For example, A Survey in
Semantic Web Technologies-Inspired Focused Crawlers and
A Transport Service Ontology-based Focused Crawler
based on ontology. Some have developed based on meta
search and content block partition "A Framework of a
Hybrid Focused Web Crawler." Some have developed rule
based focused crawler. For example, Design of an Enhanced
Rule based Focused Crawler and URL rule based focused
crawler.
A working process of a focused crawler consists of two
main steps. The first step is to detennine the starting URLs
P ... ntRe"'.....,�
and specify user interest. The crawler is unable to traverse
the Internet without starting URLs. The second step is the
crawling method. In theoretical point of view, a focused
Figure 1. The Proposed Architecture.
crawler smartly selects a direction to traverse the Internet. A
clever route selection method of the crawler is to arrange
URLs so that the most relevant ones can be located in the IV. PROPOSED ApPROACH
first part of the queue. The queue will then be sorted by
relevancy in descending order [7]. The perfonnance and A. Seed URL Extraction
efficiency of a focused crawler is mainly detennined by the In our proposed approach, seed URLs are extracted by one
ordering strategy that detennines the order of page retrieval. search engine known as threesearches.com. We put a query

V4-456
2010 2nd International Conforence on Education Technology and Computer (ICETC)

in this search engine and it shows the result of three most systems may likely be less important if it occurs in many
popular search engines like Google, Yahoo, and MSN research papers in a database system conference. IDF(t) is
search. We take resulting URLs which are common in all defmed by the following formula:
the three search engines or common in at least two search
engines. URLs which are common in all three search
I +Idl
engines like Google, Yahoo, and MSN search, we assume IDF(t) = log-..:.....:.
that those common search result URLs are most relevant for Idtl
this query and thus these URLs are the seed URLs, and a
URL which is common in two search engines result by
where "d" is the document collection and "dt" is the set of
experiment belongs to relevant category seed URLs. We
documents containing term ''t.''
assume also that a resulting URL which is common in two
In a complete vector-space model, TF and IDF are
search engines result is not most relevant for topics but it is
combined together, which forms the TF-IDF measure.
relevant for topics and we are putting it also in seed URLs
category. For example, we put a query"computer books" on
a threesearches.com and common result of all three search TF-IDF(d�t) = TF(dtt) X IDF(t)
engines like Google, Yahoo and MSN search are extracted.
Here, two outputs are www.freecomputerbooks.com and
www.computer-book.us. Order the word by their weights and extract a certain
number of words with high weight as the topic keywords.
B. Topic Specific Weight Table Construction After that weights are normalized as:
Weight table defmes the crawling target. The topic name
W=WilWmax (1)
is sent as a query to the Google Web search engine and the
fIrst k results are retrieved. The retrieved pages are parsed.
where "Wi" is the weight of keyword "i" and "Wmax" is
To avoid indexing useless words, a text retrieval system
weight of keyword with highest weight.
�fte� associates a stop list with a set of documents. A stop
For example, we have taken a topic keyword "computer
lIst IS a set of words that are deemed "irrelevant." For
books." For Topic SpecifIc Weight Table Construction, we
example, a, the, of, for, with, and so on are stop words, even
put the "computer books" as a query in the Google web
though they may appear frequently. Words are stemmed
search engine and the fIrst 7 results are retrieved. After
using porter stemming algorithm. For example, the group of
removing stop words except word computer (We know that
words drug, drugged, and drugs, share a common word stem,
computer is a stop word but our query is "computer books."
drug, and can be viewed as different occurrences of the
So, we take a word computer as an important word.) and
same word.
stemming the words, for calculating the weights the term
Starting with a set of"d" documents and a set of"t" terms
; frequency (TF) and inverse document frequency (IDF) of
we can model each document as a vector "v" in the ''t'
each word is calculated. Here, we have taken top 10 most
dimensional space "Rt", which is why this method is called
occurrences words.
the vector-space model. Let the term frequency be the
The term "books" appears 331 times in combination of all
number of occurrences of term"t" in the document"d", i.e.,
7 results. So, the term frequency of book is:
freq(d, t). The (weighted) term-frequency matrix TF(d, t)
measures the association of a term "t" with respect to the
1 + 10g(1 + log331» = 1 + 10g(1 + 2.519827994) = 1 +
given document "d." It is generally defmed as 0 if the
0.546521441 =1. 546521441.
document does not contain the term, and nonzero otherwise.

TF(d,t) =
{ 0
I +log(1 +Iog(freq(d,t))}
iffreq(d,t}
otherwise.
= 0
Now the weight is normalized by the equation (1) and
Topic SpecifIc Weight Table is created (See TABLE I.)

TABLE I. TOPIC SPECIFIC WEIGHT TABLE

where freq(d, t) equals to term frequency which is the total Terms Weight
number of occurrences of term "t" in the document (in non
Web 1
zero case).
Beside the term frequency measure, there is another Post 0.985815783
important measure, called inverse document frequency (IDF)
that represents the scaling factor, or the importance of a Ebook 0.678071903

!erm ''t.'' If a term "t" occurs in many documents, its Site 0.677361939
Importance will be scaled down due to its reduced
discriminative power. For example, the term database

V4-457
2010 2nd International Conforence on Education Technology and Computer (ICETC)

Linux 0.673904001 because anchor text describes some information about URL.
It is the textual information about URL. For example, in
Computer 0.422007352
seed page there are number of URLs that exist. We have
Java 0.411178185
taken one URL
..http://www.freecomputerbooks.comldbCategory.html"
Book 0.20311801 whose anchor text in this seed page is "databases and
storage."
Free 0.202325996
There are number of related words of the anchor text
Program 0.197359782
"databases and storage." "Alexa, amazon elastic compute
cloud, amazon mechanical turk, amazon payments, amazon
simple storage service, amazon web services, apis, archives
and records management, article, articles, aws, aws user
C. Relevancy Calculation group, books, cloud computing, cultural property, data
The weight of words in page corresponding to the mining, database, database storage, databases, developer
keyword in the Topic Specific Weight Table is calculated. forum, developer tools, devpay, dynamic, ec2, file, flexible
The weight calculation of words in page uses the same payments service, freedom of act, grid computing, image,
approach which is used by Topic Specific Weight table information, information age, information literacy,
weight calculation. In our proposed approach, it uses a information technology, journalism, journals, library science,
cosine similarity measure to calculate the relevance of the management, media, modem, movie, newspapers in america,
page on a particular topic. online library, photo, research, retrieval, simple queue
service, simpledb, storage, text utility computing, web
application development, web hosting, website."
Now, we fmd out only two words of topic keywords are
Relevance (t. p) = available in related set. So, the Anchor_Relevancy_Score
of "databases and storage" anchor text is 0.2 because only
two words web and books of topic keywords are present in
the related set.
Here, "t" is the topic specific weight table, "p" is the web Relevancy_Score_URL_Description is the relevancy score
page under investigation, and "wkt" and "wkp" are the of URL description with respect to topic keywords. We put
weights of keyword "k" in the weight table and in the web the URL as a query in the Google Search Engine with the
page respectively. name"description of URL name" and fmd out top 10 results
and then fmd out top 10 weighed words after calculating TF
D. Link Score Calculation
* IDF. Our proposed approach calculates
Focused crawler fetches web pages from internet. After Relevancy_Score_URL_description because it gives
fmding out seed pages, focused crawler fetches seed pages detailed information about URL, i.e., what information it
from internet because from experiment it has been proved contains or what it describes.
that the seed pages are most relevant to topics. After For example, we put the URL as a query "description of
fetching seed pages from internet, our approach calculates http://www.freecomputerbooks.comldbCategory .html" in
Link Score. The Link Score assigns scores to unvisited the Google search engine and fmd out topic keywords
Links extracted from the downloaded page using the weight in 10 weighed words.
information of pages that have been crawled and the Now, we calculate the
metadata of hyperlinks. Relevancy_Score_URL_description by R(t, p) calculation
and the score of Relevancy_Score_URL_description R(t, p)
Link Score(i) Anchor_Relevancy_Score(i) + is 0.747750787 (See TABLE II.). In table II , it shows the
Relevancy_Score_URL_Description(i) + weight of topic keywords in Descriptin of URLs.
Cohesive_Text_Relevancy_Score_oCURL(i) +
[Relevancy(P,) + Relevancy(P2) + ...........+ Relevancy(PJ]
Terms Weight
Anchor_Relevancy_Score is the relevancy score between
topic keywords and anchor text. We fmd out the related Web 0.864186362
word of Anchor Text with the help of tool and also fmd out
how much percentage of topic keywords is present in set of Post 0.738989594
related words of topic keywords because we know that the
Ebook 0.738989594
more topic keywords are in set of related words of anchor
text than the anchor text is more relevant to topics. Our Site 0.738989594
proposed approach calculates Anchor_Relevancy_Score

V4-458
2010 2nd International Conference on Education Technology and Computer (ICETC)

Linux 0 The relevancy score R(t, p) with respect to topic keywords


of each and every 1 to 5 parent pages are:
Computer 0.945507125
1. 0.701533168
Java 0
2. 0.806545640
3. 0.812362450
Book 0.954019645 4. 0.786754980
5. 0.704356870
Free 0.968022093
V. PROPOSED ALGORITHM
Program 0.909094439
Step 1: Extract seed pages from threesearches.com.
Step 2: Extract all terms and links from seed page.
Cohesive_Text_Relevancy_Score_oCURL is the score of I*We extract all terms and link from seed page or any other
URL with respect to topics in sentence. For the extraction of page by java program. * 1
cohesive-text, one sentence or group of meaningful Step 3: for i 1 to n
=

sentences around the Anchor link has to be considered. A I*where n is the total number of links in web pages*1
sentence can be identified as it starts with a capital letter and Step 4: for j 1 to m
=

ends with a period (dot). The following algorithm describes I*where m is the total number of parent pages of each
the steps for extracting cohesive-text: link(i)*1
1. Identify the anchor link in the page. Step 5: Calculate the relevancy score of each parent page.
2. Extract a sentence in the backward direction of the I*Calculate the relevancy score R(t, p) of each parent page
anchor link if any. with respect to topic keywords. Repeat step 5 until all parent
3. If this sentence starts with the words 'It', 'This', pages relevancy score is calculated.*1
'And', then extract one more sentence in the Step 6: Calculate Relevancy_Score_URL_description.
backward direction, if any. I*It is in first loop because second loop is fmished in step
4. Repeat steps 2 & 3 until the sentence starts with a 5.*1
word excluding the words mentioned in step 3. Step 7: Calculate Anchor_Relevancy_Score.
5. Extract a sentence in the forward direction of Step 8: Calculate link_score(i).
anchor tag, if any. I*First for loop is from step 3 to step 8. It is executed until
After calculating Relevancy_Score_URL_description, our link_score(i) of all outlinks in a particular page is
proposed approach calculates calculated. * 1
Cohesive_Relevancy_Score_oCURL because it gives From link_score(i), our proposed approach identifies
information about schematic similarity of URL with respect which outlink in web page is more relevant for topics. It is
to the topic keywords. It calculates how many topic also based on some threshold value. If link_score(i) is
keywords exist in division in which particular URL belongs greater than threshold value, then this out link is more
out of total topic keywords that exist in topic specific table. relevant for topics and based on link score priority has been
Sometimes there are no words surrounding to anchor link. assigned to determine which URL will be fetched first from
That time its value will be O. Here, for link frontier by focused crawler.
http://www.freecomputerbooks.comldbCategorv.html the Link_score(http://www.freecomputerbooks.comldbCategory
Cohesive_Relevancy_Score_oCURL is 0.1. .html) = 0.2 + 0.747750787 + 0.1 + [0.701533168 +
0.806545640 + 0.812362450 + 0.786754980 +
E. Relevancy Calculation ofParent Pages 0.704356870] 4.859303875
=

For example, here we have taken one URL


VI. CONCLUSION AND FUTURE WORK
..http://www.freecomputerbooks.comldbCategory.html".
Now, we extract all inlinks of this URL. We calculate Focused crawlers are becoming a more and more
relevancy score of only 5 pages to show in our example. important topic, and focused crawling methods are
The 5 parent pages of this URL are: important members in the search engine family. One of the
1. http://freecomputerbooks.com/webAjaxBooks.html key problems of vertical search engines is to develop an
2. http://freecomputerbooks.com/Programming­ effective algorithm for the topic-specific searching and
Scala.html similarity measurement. One approach in solving this
3. http://freecomputerbooks.comlthe-microsoft-net­ problem is to analyze the URL attributes and fmd out some
developer-ebook.html score of these attributes and based on these attributes score,
4. http://freecomputerbooks.comlWeb-Style­ focused crawler identifies which URL is more relevant for
Guide.html topic keywords. A major open issue for future work is to do
5. http://freecomputerbooks.com/langBasicBooks.htm more extensive test with large volume of web pages. Future
I work also includes code optimization and URL queue
optimization because crawler efficiency not only depends on

V4-459
2010 2nd International Conforence on Education Technology and Computer (ICETC)

retrieving maximum number of relevant pages but also on [3] Y. Zhang, C. Yin and F. Yuan. "An Application of Improved
PageRank in Focused Crawler", Fourth International Conference on
fmishing the operation as soon as possible. Fuzzy Systems and Knowledge Discovery (FSKD 2007IEEE).
[4] Q. Cheng, W. Beizhan and W. Pianpian. "Efficient focused crawling
strategy using combination of link structure and content similarity",
REFERENCES IEEE 2008.
[5] X. Chain and X. Zhang. "HAWK: A Focused Crawler with Content
[1] X. Zhang, T. Zhou, Z.Yu and D.Chen, "URL Rule Based Focused andLink Analysis", IEEE International Conference on e-Business
Crawlers", IEEE International Conference on e-Business Engineering, Engineering, 2008.
2008. [6] S. Chakrabarti, M. van den Berg and B. Dom. "Focused crawling: a
[2] A. Pal, D. S. Tomar and S.C. Shrivastava. "Effective Focused new approach to topic-specific Web resource discovery", 8th
Crawling Based on Content and Link Structure Analysis", (IJCSIS) International WWW Conference, May 1999.
International Journal of Computer Science and Information Security, [7] M. Yuvarani, N. Ch. S. N. Iyengar and A. Kannan, "LSCrawler: A
Vol. 2, No. 1, June 2009. Framework for an Enhanced Focused Web Crawler based on Link
Semantics" in Proceedings of the 2006 IEEEIWIC/ACM International
Conference on Web Intelligence.

V4-460

You might also like