Professional Documents
Culture Documents
Adaptive Focus
Adaptive Focus
Adaptive Focus
Abstract-A web search engine is designed to search for the internet and retrieve web pages by hyperlinks. In the
information on the World Wide Web (WWW). Crawlers are large area of websites, traditional web crawlers cannot
software which can traverse the internet and retrieve web
function well to solve this problem. The focused crawler of
pages by hyperlinks. In the face of the large spam websites,
a special-purpose search engine aims to selectively seek out
traditional web crawlers cannot function well to solve this
pages that are relevant to a pre-defmed set of topics, rather
problem. Focused crawlers utilize semantic web technologies to
analyze the semantics of hyperlinks and web documents. The
than to exploit all regions of the Web. Focused crawlers aim
focused crawler of a special-purpose search engine aims to to search only the subset of the web related to a specific
selectively seek out pages that are relevant to a pre-defined set topic, and offer a potential solution to the problem [2, 4].
of topics, rather than to exploit all regions of the Web. A Focused crawler is developed to collect relevant web pages
focused crawler is a program used for searching information of interested topics form the Internet. The general-purpose
related to some interested topics from the Internet. The main search engines, such as Google (www.google.com). have
property of focused crawling is that the crawler does not need provided us with a lot of facilities, and become very popular.
to collect all web pages, but selects and retrieves relevant pages
However, they have some disadvantages because a general
only. As the crawler is only a computer program, it cannot
purpose search engine aims to cover the network as enough
determine how relevant a web page is. The major problem is
how to retrieve the maximal set of relevant and quality page.
as possible. So, it usually returns many web pages users are
In our proposed approach, we calculate the unvisited URL not interested in [1].
score based on its Anchor text relevancy, its description in Vertical search engines use focused crawlers as their key
Google search engine and calculate the similarity score of component and develops some specific algorithms to select
description with topic keywords, cohesive text similarity with web pages relevant to some pre-defmed set of topics.
topic keywords and Relevancy score of its parent pages. Therefore, it is extremely important for a search engine how
Relevancy score is calculated based on vector space model.
to effectively build up a semantic pattern for specific topics.
The traditional process of focused web crawler is to harvest
Keywords-crawler; focused crawler; vector space a collection of web documents that are focused on the
model topical subspaces. Focused crawlers are the main element in
building domain specific search engines. They traverse the
I. INTRODUCTION web collecting only relevant data to a predefmed topic while
neglecting on the same time off-topic pages [3]. The crawler
The enormous growth of the World Wide Web (WWW) in is kept focused through a crawling strategy which
recent years has made it important to perform resource determines the relevancy degree of the web page to the
discovery efficiently. Consequently, several new ideas have predefined topic and depending on this degree a decision is
been proposed; among them a key technique is focused made whether to download the web page or not [4].
crawling which is able to crawl particular topical portions of In this paper, our proposed approach calculates the link
the www quickly without having to explore all web pages. score. First we calculate the unvisited URL score based on
A Web Crawler searches through all the Web Servers to its Anchor text relevancy, its description in Google search
fmd information about a particular topic. However, engine and calculate the similarity score of description with
searching all the Web Servers and the pages, are not realistic, topic keywords, cohesive text similarity with topic
given the growth of the Web and their refresh rates [5]. keywords and Relevancy score of its parent pages.
Crawling the Web quickly and entirely is an expensive, Relevancy score is calculated based on vector space model.
unrealistic goal because of the required hardware and The rest of this paper is organized as follows. We have
network resources. Focused Crawling is designed to traverse discussed the related work of focused crawling in Section II.
a subset of the Web to gather documents on a specific topic. In Section III, we introduce our proposed architecture. In
It also aims to identify the promising links that lead to target Section IV, we have presented our proposed approach. In
documents, and avoid off-topic searches. Focused Crawlers Section V, our proposed algorithm has been represented and
support decentralizing the crawling process, which is a more in Section VI, we have concluded our research paper.
scalable approach. Crawlers are software which can traverse
V4-456
2010 2nd International Conforence on Education Technology and Computer (ICETC)
in this search engine and it shows the result of three most systems may likely be less important if it occurs in many
popular search engines like Google, Yahoo, and MSN research papers in a database system conference. IDF(t) is
search. We take resulting URLs which are common in all defmed by the following formula:
the three search engines or common in at least two search
engines. URLs which are common in all three search
I +Idl
engines like Google, Yahoo, and MSN search, we assume IDF(t) = log-..:.....:.
that those common search result URLs are most relevant for Idtl
this query and thus these URLs are the seed URLs, and a
URL which is common in two search engines result by
where "d" is the document collection and "dt" is the set of
experiment belongs to relevant category seed URLs. We
documents containing term ''t.''
assume also that a resulting URL which is common in two
In a complete vector-space model, TF and IDF are
search engines result is not most relevant for topics but it is
combined together, which forms the TF-IDF measure.
relevant for topics and we are putting it also in seed URLs
category. For example, we put a query"computer books" on
a threesearches.com and common result of all three search TF-IDF(d�t) = TF(dtt) X IDF(t)
engines like Google, Yahoo and MSN search are extracted.
Here, two outputs are www.freecomputerbooks.com and
www.computer-book.us. Order the word by their weights and extract a certain
number of words with high weight as the topic keywords.
B. Topic Specific Weight Table Construction After that weights are normalized as:
Weight table defmes the crawling target. The topic name
W=WilWmax (1)
is sent as a query to the Google Web search engine and the
fIrst k results are retrieved. The retrieved pages are parsed.
where "Wi" is the weight of keyword "i" and "Wmax" is
To avoid indexing useless words, a text retrieval system
weight of keyword with highest weight.
�fte� associates a stop list with a set of documents. A stop
For example, we have taken a topic keyword "computer
lIst IS a set of words that are deemed "irrelevant." For
books." For Topic SpecifIc Weight Table Construction, we
example, a, the, of, for, with, and so on are stop words, even
put the "computer books" as a query in the Google web
though they may appear frequently. Words are stemmed
search engine and the fIrst 7 results are retrieved. After
using porter stemming algorithm. For example, the group of
removing stop words except word computer (We know that
words drug, drugged, and drugs, share a common word stem,
computer is a stop word but our query is "computer books."
drug, and can be viewed as different occurrences of the
So, we take a word computer as an important word.) and
same word.
stemming the words, for calculating the weights the term
Starting with a set of"d" documents and a set of"t" terms
; frequency (TF) and inverse document frequency (IDF) of
we can model each document as a vector "v" in the ''t'
each word is calculated. Here, we have taken top 10 most
dimensional space "Rt", which is why this method is called
occurrences words.
the vector-space model. Let the term frequency be the
The term "books" appears 331 times in combination of all
number of occurrences of term"t" in the document"d", i.e.,
7 results. So, the term frequency of book is:
freq(d, t). The (weighted) term-frequency matrix TF(d, t)
measures the association of a term "t" with respect to the
1 + 10g(1 + log331» = 1 + 10g(1 + 2.519827994) = 1 +
given document "d." It is generally defmed as 0 if the
0.546521441 =1. 546521441.
document does not contain the term, and nonzero otherwise.
TF(d,t) =
{ 0
I +log(1 +Iog(freq(d,t))}
iffreq(d,t}
otherwise.
= 0
Now the weight is normalized by the equation (1) and
Topic SpecifIc Weight Table is created (See TABLE I.)
where freq(d, t) equals to term frequency which is the total Terms Weight
number of occurrences of term "t" in the document (in non
Web 1
zero case).
Beside the term frequency measure, there is another Post 0.985815783
important measure, called inverse document frequency (IDF)
that represents the scaling factor, or the importance of a Ebook 0.678071903
!erm ''t.'' If a term "t" occurs in many documents, its Site 0.677361939
Importance will be scaled down due to its reduced
discriminative power. For example, the term database
V4-457
2010 2nd International Conforence on Education Technology and Computer (ICETC)
Linux 0.673904001 because anchor text describes some information about URL.
It is the textual information about URL. For example, in
Computer 0.422007352
seed page there are number of URLs that exist. We have
Java 0.411178185
taken one URL
..http://www.freecomputerbooks.comldbCategory.html"
Book 0.20311801 whose anchor text in this seed page is "databases and
storage."
Free 0.202325996
There are number of related words of the anchor text
Program 0.197359782
"databases and storage." "Alexa, amazon elastic compute
cloud, amazon mechanical turk, amazon payments, amazon
simple storage service, amazon web services, apis, archives
and records management, article, articles, aws, aws user
C. Relevancy Calculation group, books, cloud computing, cultural property, data
The weight of words in page corresponding to the mining, database, database storage, databases, developer
keyword in the Topic Specific Weight Table is calculated. forum, developer tools, devpay, dynamic, ec2, file, flexible
The weight calculation of words in page uses the same payments service, freedom of act, grid computing, image,
approach which is used by Topic Specific Weight table information, information age, information literacy,
weight calculation. In our proposed approach, it uses a information technology, journalism, journals, library science,
cosine similarity measure to calculate the relevance of the management, media, modem, movie, newspapers in america,
page on a particular topic. online library, photo, research, retrieval, simple queue
service, simpledb, storage, text utility computing, web
application development, web hosting, website."
Now, we fmd out only two words of topic keywords are
Relevance (t. p) = available in related set. So, the Anchor_Relevancy_Score
of "databases and storage" anchor text is 0.2 because only
two words web and books of topic keywords are present in
the related set.
Here, "t" is the topic specific weight table, "p" is the web Relevancy_Score_URL_Description is the relevancy score
page under investigation, and "wkt" and "wkp" are the of URL description with respect to topic keywords. We put
weights of keyword "k" in the weight table and in the web the URL as a query in the Google Search Engine with the
page respectively. name"description of URL name" and fmd out top 10 results
and then fmd out top 10 weighed words after calculating TF
D. Link Score Calculation
* IDF. Our proposed approach calculates
Focused crawler fetches web pages from internet. After Relevancy_Score_URL_description because it gives
fmding out seed pages, focused crawler fetches seed pages detailed information about URL, i.e., what information it
from internet because from experiment it has been proved contains or what it describes.
that the seed pages are most relevant to topics. After For example, we put the URL as a query "description of
fetching seed pages from internet, our approach calculates http://www.freecomputerbooks.comldbCategory .html" in
Link Score. The Link Score assigns scores to unvisited the Google search engine and fmd out topic keywords
Links extracted from the downloaded page using the weight in 10 weighed words.
information of pages that have been crawled and the Now, we calculate the
metadata of hyperlinks. Relevancy_Score_URL_description by R(t, p) calculation
and the score of Relevancy_Score_URL_description R(t, p)
Link Score(i) Anchor_Relevancy_Score(i) + is 0.747750787 (See TABLE II.). In table II , it shows the
Relevancy_Score_URL_Description(i) + weight of topic keywords in Descriptin of URLs.
Cohesive_Text_Relevancy_Score_oCURL(i) +
[Relevancy(P,) + Relevancy(P2) + ...........+ Relevancy(PJ]
Terms Weight
Anchor_Relevancy_Score is the relevancy score between
topic keywords and anchor text. We fmd out the related Web 0.864186362
word of Anchor Text with the help of tool and also fmd out
how much percentage of topic keywords is present in set of Post 0.738989594
related words of topic keywords because we know that the
Ebook 0.738989594
more topic keywords are in set of related words of anchor
text than the anchor text is more relevant to topics. Our Site 0.738989594
proposed approach calculates Anchor_Relevancy_Score
V4-458
2010 2nd International Conference on Education Technology and Computer (ICETC)
sentences around the Anchor link has to be considered. A I*where n is the total number of links in web pages*1
sentence can be identified as it starts with a capital letter and Step 4: for j 1 to m
=
ends with a period (dot). The following algorithm describes I*where m is the total number of parent pages of each
the steps for extracting cohesive-text: link(i)*1
1. Identify the anchor link in the page. Step 5: Calculate the relevancy score of each parent page.
2. Extract a sentence in the backward direction of the I*Calculate the relevancy score R(t, p) of each parent page
anchor link if any. with respect to topic keywords. Repeat step 5 until all parent
3. If this sentence starts with the words 'It', 'This', pages relevancy score is calculated.*1
'And', then extract one more sentence in the Step 6: Calculate Relevancy_Score_URL_description.
backward direction, if any. I*It is in first loop because second loop is fmished in step
4. Repeat steps 2 & 3 until the sentence starts with a 5.*1
word excluding the words mentioned in step 3. Step 7: Calculate Anchor_Relevancy_Score.
5. Extract a sentence in the forward direction of Step 8: Calculate link_score(i).
anchor tag, if any. I*First for loop is from step 3 to step 8. It is executed until
After calculating Relevancy_Score_URL_description, our link_score(i) of all outlinks in a particular page is
proposed approach calculates calculated. * 1
Cohesive_Relevancy_Score_oCURL because it gives From link_score(i), our proposed approach identifies
information about schematic similarity of URL with respect which outlink in web page is more relevant for topics. It is
to the topic keywords. It calculates how many topic also based on some threshold value. If link_score(i) is
keywords exist in division in which particular URL belongs greater than threshold value, then this out link is more
out of total topic keywords that exist in topic specific table. relevant for topics and based on link score priority has been
Sometimes there are no words surrounding to anchor link. assigned to determine which URL will be fetched first from
That time its value will be O. Here, for link frontier by focused crawler.
http://www.freecomputerbooks.comldbCategorv.html the Link_score(http://www.freecomputerbooks.comldbCategory
Cohesive_Relevancy_Score_oCURL is 0.1. .html) = 0.2 + 0.747750787 + 0.1 + [0.701533168 +
0.806545640 + 0.812362450 + 0.786754980 +
E. Relevancy Calculation ofParent Pages 0.704356870] 4.859303875
=
V4-459
2010 2nd International Conforence on Education Technology and Computer (ICETC)
retrieving maximum number of relevant pages but also on [3] Y. Zhang, C. Yin and F. Yuan. "An Application of Improved
PageRank in Focused Crawler", Fourth International Conference on
fmishing the operation as soon as possible. Fuzzy Systems and Knowledge Discovery (FSKD 2007IEEE).
[4] Q. Cheng, W. Beizhan and W. Pianpian. "Efficient focused crawling
strategy using combination of link structure and content similarity",
REFERENCES IEEE 2008.
[5] X. Chain and X. Zhang. "HAWK: A Focused Crawler with Content
[1] X. Zhang, T. Zhou, Z.Yu and D.Chen, "URL Rule Based Focused andLink Analysis", IEEE International Conference on e-Business
Crawlers", IEEE International Conference on e-Business Engineering, Engineering, 2008.
2008. [6] S. Chakrabarti, M. van den Berg and B. Dom. "Focused crawling: a
[2] A. Pal, D. S. Tomar and S.C. Shrivastava. "Effective Focused new approach to topic-specific Web resource discovery", 8th
Crawling Based on Content and Link Structure Analysis", (IJCSIS) International WWW Conference, May 1999.
International Journal of Computer Science and Information Security, [7] M. Yuvarani, N. Ch. S. N. Iyengar and A. Kannan, "LSCrawler: A
Vol. 2, No. 1, June 2009. Framework for an Enhanced Focused Web Crawler based on Link
Semantics" in Proceedings of the 2006 IEEEIWIC/ACM International
Conference on Web Intelligence.
V4-460