Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

MIT International Journal of Computer Science & Information Technology, Vol. 2, No. 1, Jan. 2012, pp.

(12-15) 12
ISSN 2230-7621 MIT Publications

Web Link Structure Mining of the World Wide


Web using Hyperlink Induced Topic Search
Shivangi Dhall Bharat Bhushan Agarwal Swarna Chaudhary
Assistant Professor Reader Assistant Professor
M.I.T., Moradabad C.E.T., I.F.T.M. University, M.I.T., Moradabad
e-mail: shivangi_dhall077@yahoo.co.in Moradabad e-mail: swarna.biet@gmail.com
e-mail: bharat_innocent@yahoo.co.in

ABSTRACT
The World-Wide Web provides every internet citizen with access to an abundance of information, but it becomes increasingly
difficult to identify the relevant pieces of information. To be able to cope with the abundance of available information,
users of the WWW need to rely on intelligent tools that assist them in finding, sorting, and filtering the available information.
The Web is a hypertext body of approximately 300 million pages that continues to grow at roughly a million pages per day.
Today, index-based search engines for the Web have been the primary tool by which users search for information. Experienced
users can make effective use of such engines for tasks that can be solved by searching for tightly constrained keywords and
phrases. These search engines are, however, unsuited for a wide range of equally important tasks. We develop algorithms
that exploit the hyperlink structure of the WWW for information discovery and categorization, the construction of high-
quality resource lists, and the analysis of on-line hyperlinked communities. In this paper, we discuss problems with HITS
(Hyperlink- Induced Topic Search) algorithm, which capitalizes on hyperlinks to extract topic-bound communities of web
pages.
Keywords: Link Structure, HITS, Hubs, Authorities.

INTRODUCTION Experienced users can make effective use of such search


engines for tasks that can be solved by searching for tightly
The World Wide Web contains an enormous amount of constrained keywords and phrases; however, these search
information, but it can be exceedingly difficult for users to engines are not suited for a wide range of equally important
locate resources that are both high in quality and relevant to tasks. In particular, a topic of any breadth will typically contain
their information needs. There are a number of fundamental several thousand or several million relevant Web pages; at the
reasons for this. The Web is a hypertext corpus of enormous same time, a user will be willing to look at an extremely small
size | approximately three hundred million Web pages as of number of these pages. How, from this sea of pages, should a
this writing | and it continues to grow at a phenomenal rate. search engine select the correct ones?
But the variation in pages is even worse than the raw scale of Our work begins from two central observations. First, in
the data: the set of Web pages taken as a whole has almost no order to distill a large search topic on the WWW down to a
unifying structure, with variability in authoring style and size that will make sense to a human user, we need a means of
content that is far greater than in traditional collections of text identifying the most definitive or authoritative Web pages
documents. This level of complexity makes it impossible to on the topic. This notion of authority adds a crucial second
apply techniques from database management and information dimension to the notion of relevance: we wish not only to locate
retrieval in an off-the-shelf fashion. a set of relevant pages, but rather the relevant pages of the
Index-based search engines for the WWW have been one highest quality. Second, the Web consists not only of pages
of the primary tools by which users of the Web search for but of hyperlinks that connect one page to another; and this
information. The largest of such search engines exploit the hyperlink structure contains an enormous amount of latent
fact that modern storage technology makes it possible to human annotation that can be extremely valuable for
store and index a large fraction of the WWW; they can automatically inferring notions of authority. Specifically, the
therefore build giant indices that allow one to quickly retrieve creation of a hyperlink by the author of a Web page represents
the set of all Web pages containing a given word or string. an implicit type of endorsement of the page being pointed
A user typically interacts with them by entering query terms to; by mining the collective judgement contained in the set of
and receiving a list of Web pages that contain the given such endorsements, we can obtain a richer understanding of
terms. both the relevance and quality of the Webs contents.
MIT International Journal of Computer Science & Information Technology, Vol. 2, No. 1, Jan. 2012, pp. (12-15) 13
ISSN 2230-7621 MIT Publications

There are many ways that one could try using the link For a given query, HITS will find authorities and hubs. We
structure of the Web to infer notions of authority, and some of now describe the HITS algorithm [1], which computes lists of
these are much more effective than others. This is not hubs and authorities for WWW search topics.
surprising: the link structure implies an underlying social The main process of HITS algorithm is described as follows
structure in the way that pages and links are created, and it is [5, 3]. After data preparation, an iterative process of weight
an understanding of this social organization that can provide propagation will be initiated to determine the numerical
us with the most leverage. Our goal in designing algorithms estimate of hub and authority weights. First, for each page i in
for mining link information is to develop techniques that take base set, the algorithm defines a non-negative authority weight,
advantage of what we observe about the intrinsic social a(i), and a non-negative hub weight, h(i), both of which are
organization of the Web. initialized to a uniform constant. A page with high values of
a(i) and h(i) is regarded as a useful page.
HITS: COMPUTING HUBS AND Notice that before the iterative process, all the weights
AUTHORITIES should be normalized to satisfy the conditions: Sp a(i)2 = 1
2
Web page content can be categorized into two main categories, and Sp h(i) = 1. Then the authority and hub weights will be
(1) textual content and (2) structural content imposed by the updated based on the following equations:
hyperlinks. While information retrieval methods concentrate a(i) = Sji h(j) (1)
on the textual content to determine the relevance of a web
page to a text query, the set of possible matches is too large h(i) = Sij a(j) (2)
for a single user due to the size of the World Wide Web. It is As shown in equation (1), the authority weight of page i,
also possible that a page relevant to a keyword query may not a(i), equals to the sum of hub weights of all of the pages in
contain that keyword. The link structure provides additional base set linking to page i. Meanwhile, as shown in equation
information that can be used to rank the web pages relevant to (2), the hub weight of page i, h(i), equals to the sum of authority
a query. weights of all of the pages in base set that page i links to.
HITS algorithm is a very popular and effective algorithm The calculation will be repeated many times until a(i) and
to rank web pages based on the link structure among a set of h(i) converge to stable values. Suppose the base set includes
web pages. The basic idea of the HITS algorithm is to identify 1, 2,,n pages. Let the adjacency matrix B be an nn matrix,
a small sub-graph of the Web and apply link analysis on this where B(i, j) equals to 1 if page i points to page j, or 0
sub-graph to locate the authorities and hubs for the given query. otherwise. Similarly, let a = (a1, a2, , an)T be the authority
The sub-graph that is chosen depends on the user query. weight vector, and h = (h1, h2, , hn)T be the hub weight vector.
HITS is a link analysis algorithm that analyzes hyperlinks Thus, we have
to uncover two types of pages: h = B a, a = BT h,
authorities, are pages that contain useful information about where, BT is the transposition of B. Unfolding these two
the query topic , and equations once, we have
hubs, contain pointers to good information sources i.e., it
provide collections of links to authorities. h = BBT h = (BBT) h , a = BT Ba = (BT B)a.
Obviously, both types of pages are typically connected Unfolding these two equations k times, we have
(Figure 1): good hubs contain pointers to many good h = Ba = BBT h = (BBT)2 h = - - - = (BBT)k h,
authorities, and good authorities are pointed to by many good
hubs. So, there is a mutual reinforcement relationship between a = Bh = BT Ba = (BT B)2 a = - - - = (BT B)k a.
hubs and authorities. According to liner algebra, these two sequences of
iterations, when satisfying the normalization condition, will
converge to the principal eigenvectors of BB T and B TB
respectively [2]. It is also proved that the authority and hub
weights are intrinsic features of the linked pages collected and
cannot be influenced by the initial settings.
Finally, the HITS algorithm outputs a list which contains
the pages with large hub weights and the pages with large
authority weights for a given topic.
The pages authority weight is proportional to the sum of
the hub weights of pages that it links to it, Kleinberg [21].
Similarly, a pages hub weight is proportional to the sum of
the authority weights of pages that it links to. Figure 2 shows
Figure 1: Hubs & Authorities an example of the calculation of authority and hub scores.
MIT International Journal of Computer Science & Information Technology, Vol. 2, No. 1, Jan. 2012, pp. (12-15) 14
ISSN 2230-7621 MIT Publications

TABLE 1: Authorities and hubs of Artificial Intelligence


ap Authorities
.372 http://www.cs.washington.edu/research/jair/home.html
.298 http://www.aaai.org/
.294 http://www.ai.mit.edu/
.272 http://ai.iit.nrc.ca/ai point.html
.234 http://sigart.acm.org/
hp Hubs
.228 http://yonezakiwww.cs.titech.ac.jp/member/hidekazu/
Work/AI.html
Figure 2: Calculation of hubs & authorities .228 http://www.cs.berkeley.edu/russell/ai.html
.204 http://uscia1.usc.clu.edu/pantonio/cco360/AIWeb.htm
CONSTRAINTS OF HITS .181 http://www.scms.rgu.ac.uk/staff/asga/ai.html
The following are the constraints of HITS algorithm [6]: .171 http://www.ex.ac.uk/ESE/IT/ai.html
Hubs and authorities: It is not easy to distinguish between (Note: ap and hp represent authority weight and hub weight
hubs and authorities because many sites are hubs as well respectively.)
as authorities.
The top authority was the home page of JAIR (Journal of
Topic drift: Sometime HITS may not produce the most Artificial Intelligence Research), the second authority was
relevant documents to the user queries because of AAAI (American Association for Artificial Intelligence), then
equivalent weights. MIT AI laboratory followed. Namely, famous organizations
Automatically generated links: HITS gives equal related to Artificial Intelligence based in the United States were
importance for automatically generated links which may successfully extracted. This AI community was supplemented
not produce relevant topics for the user query. by hubs, which consisted of the researchers personal Web
Efficiency: HITS algorithm is not efficient in real time. pages (e.g. S. Russell at UCB).
Mutually reinforced relationships between hosts:
Topic: Harvard
Sometimes a set of documents on one host point to a single
document on a second host, or sometimes a single document In Kleinbergs experiment, authorities of Harvard were
on one host point to a set of document on a second host. related to Harvard University, e.g. the homepage of Harvard
These situations could provide wrong definitions about a University, Harvard Law School, Harvard Business School,
good hub or a good authority. and so on. However, in our experiment, the Web pages authored
by a financial consulting company were extracted (see Table 2).
These pages did not relate to query Harvard.
PROBLEMS WITH THE HITS ALGORITHM
To clarify problems with HITS algorithm, we traced TABLE 2: Authorities and hubs of Harvard
Kleinbergs experiments. We picked 9 query topics for our ap Authorities
study: abortion, Artificial Intelligence, censorship,
Harvard, jaguar, Kyoto University, Olympic, search .130 http://www.wetradefutures.com/investment.asp
engine, and Toyota. In these query topics, all but Kyoto .130 http://www.wetradefutures.com/trend.htm
University and Toyota were used in [4] and [1]. Though we .130 http://www.wetradefutures.com/market technology.htm
fixed the parameters r, d, and a text-based search engine for .130 http://www.wetradefutures.com/florida investment.htm
collecting the root set to examine Kleinbergs experiments .130 h t t p : / / w w w. w e t r a d e f u t u r e s . c o m / i n v e s t i n g
rigorously, we observed HITS algorithm performed poorly in investment.htm
several of our test cases. In this paper, we discuss focusing on
hp Hubs
topic Artificial Intelligence as a successful example, and topic
Harvard as an unsuccessful example. .247 http://www.profittaker.net/data.htm
.247 http://www.profittaker.org/new twentyseven.htm
Topic: Artificial Intelligence .247 http://profittaker.com/sunday trader more.htm
The extracted top 5 authorities and hubs of Artificial .247 http://www.profittaker.cc/system software.htm
Intelligence in our experiment are indicated in Table 1. The .247 http://www.futureforecasts.com/contact phone.htm
decimal fractions shown on the left of URLs represent authority (Note: a and h represent authority weight and hub weight
p p
weights (ap) and hub weights (hp) respectively. respectively.)
MIT International Journal of Computer Science & Information Technology, Vol. 2, No. 1, Jan. 2012, pp. (12-15) 15
ISSN 2230-7621 MIT Publications

In this case, higher ranked 56 authorities had the same REFERENCES


authority weights, and higher ranked 5 hubs had the same hub
weights. By checking the contents of these pages, we detected [1] D. Gibson, J. Kleinberg, and P. Raghavan, Inferring Web
communities from link topology. In Proc. 9th ACM Conference
that these authorities and hubs were authored by a single
on Hypertext and Hypermedia (HyperText 98), pp. 225234,
organization. Pittsburgh PA, June 1998.
[2] Golub, G., Van Loan, C.F., Matrix Computations. Johns
CONCLUSION Hopkins University Press (1989).
[3] Han, J.W., Micheline, K.: Data Mining Concepts and
The WWW has grown into a hypertext environment of
Techniques (Ed2). Academic Press. (2006)
enormous complexity; and the process underlying its growth
has been driven in a chaotic fashion by the individual actions [4] J. Kleinberg. Authoritative sources in a hyperlinked
environment, 1997, Research Report RJ 10076 (91892), IBM
of numerous participants. This paper covers the basics of the
HITS algorithm and its constraints on the web. The main [5] Kleinberg, J.M., Authoritative Sources in a Hyperlink Environ-
ment. In: Proceedings of the Ninth ACM2SIAM Symposium
purpose of this paper is to explore the HITS algorithm used
on Discrete Algorithms, New York (1998) 668- 677.
for information retrieval by observing the hyperlink structure
of the web and some problems of it. The future work will be [6] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, R. Kumar,
P. Raghavan, S. Rajagopalan, A. Tomkins, Mining the Link
apply the HITS algorithm in the Web and observe & improve
Structure of the World Wide Web, IEEE Computer Society
the constraints of the algorithm. Press, Vol. 32, Issue 8 pp. 60-67, 1999.

You might also like