Professional Documents
Culture Documents
Logo - File 3
Logo - File 3
(12-15) 12
ISSN 2230-7621 MIT Publications
ABSTRACT
The World-Wide Web provides every internet citizen with access to an abundance of information, but it becomes increasingly
difficult to identify the relevant pieces of information. To be able to cope with the abundance of available information,
users of the WWW need to rely on intelligent tools that assist them in finding, sorting, and filtering the available information.
The Web is a hypertext body of approximately 300 million pages that continues to grow at roughly a million pages per day.
Today, index-based search engines for the Web have been the primary tool by which users search for information. Experienced
users can make effective use of such engines for tasks that can be solved by searching for tightly constrained keywords and
phrases. These search engines are, however, unsuited for a wide range of equally important tasks. We develop algorithms
that exploit the hyperlink structure of the WWW for information discovery and categorization, the construction of high-
quality resource lists, and the analysis of on-line hyperlinked communities. In this paper, we discuss problems with HITS
(Hyperlink- Induced Topic Search) algorithm, which capitalizes on hyperlinks to extract topic-bound communities of web
pages.
Keywords: Link Structure, HITS, Hubs, Authorities.
There are many ways that one could try using the link For a given query, HITS will find authorities and hubs. We
structure of the Web to infer notions of authority, and some of now describe the HITS algorithm [1], which computes lists of
these are much more effective than others. This is not hubs and authorities for WWW search topics.
surprising: the link structure implies an underlying social The main process of HITS algorithm is described as follows
structure in the way that pages and links are created, and it is [5, 3]. After data preparation, an iterative process of weight
an understanding of this social organization that can provide propagation will be initiated to determine the numerical
us with the most leverage. Our goal in designing algorithms estimate of hub and authority weights. First, for each page i in
for mining link information is to develop techniques that take base set, the algorithm defines a non-negative authority weight,
advantage of what we observe about the intrinsic social a(i), and a non-negative hub weight, h(i), both of which are
organization of the Web. initialized to a uniform constant. A page with high values of
a(i) and h(i) is regarded as a useful page.
HITS: COMPUTING HUBS AND Notice that before the iterative process, all the weights
AUTHORITIES should be normalized to satisfy the conditions: Sp a(i)2 = 1
2
Web page content can be categorized into two main categories, and Sp h(i) = 1. Then the authority and hub weights will be
(1) textual content and (2) structural content imposed by the updated based on the following equations:
hyperlinks. While information retrieval methods concentrate a(i) = Sji h(j) (1)
on the textual content to determine the relevance of a web
page to a text query, the set of possible matches is too large h(i) = Sij a(j) (2)
for a single user due to the size of the World Wide Web. It is As shown in equation (1), the authority weight of page i,
also possible that a page relevant to a keyword query may not a(i), equals to the sum of hub weights of all of the pages in
contain that keyword. The link structure provides additional base set linking to page i. Meanwhile, as shown in equation
information that can be used to rank the web pages relevant to (2), the hub weight of page i, h(i), equals to the sum of authority
a query. weights of all of the pages in base set that page i links to.
HITS algorithm is a very popular and effective algorithm The calculation will be repeated many times until a(i) and
to rank web pages based on the link structure among a set of h(i) converge to stable values. Suppose the base set includes
web pages. The basic idea of the HITS algorithm is to identify 1, 2,,n pages. Let the adjacency matrix B be an nn matrix,
a small sub-graph of the Web and apply link analysis on this where B(i, j) equals to 1 if page i points to page j, or 0
sub-graph to locate the authorities and hubs for the given query. otherwise. Similarly, let a = (a1, a2, , an)T be the authority
The sub-graph that is chosen depends on the user query. weight vector, and h = (h1, h2, , hn)T be the hub weight vector.
HITS is a link analysis algorithm that analyzes hyperlinks Thus, we have
to uncover two types of pages: h = B a, a = BT h,
authorities, are pages that contain useful information about where, BT is the transposition of B. Unfolding these two
the query topic , and equations once, we have
hubs, contain pointers to good information sources i.e., it
provide collections of links to authorities. h = BBT h = (BBT) h , a = BT Ba = (BT B)a.
Obviously, both types of pages are typically connected Unfolding these two equations k times, we have
(Figure 1): good hubs contain pointers to many good h = Ba = BBT h = (BBT)2 h = - - - = (BBT)k h,
authorities, and good authorities are pointed to by many good
hubs. So, there is a mutual reinforcement relationship between a = Bh = BT Ba = (BT B)2 a = - - - = (BT B)k a.
hubs and authorities. According to liner algebra, these two sequences of
iterations, when satisfying the normalization condition, will
converge to the principal eigenvectors of BB T and B TB
respectively [2]. It is also proved that the authority and hub
weights are intrinsic features of the linked pages collected and
cannot be influenced by the initial settings.
Finally, the HITS algorithm outputs a list which contains
the pages with large hub weights and the pages with large
authority weights for a given topic.
The pages authority weight is proportional to the sum of
the hub weights of pages that it links to it, Kleinberg [21].
Similarly, a pages hub weight is proportional to the sum of
the authority weights of pages that it links to. Figure 2 shows
Figure 1: Hubs & Authorities an example of the calculation of authority and hub scores.
MIT International Journal of Computer Science & Information Technology, Vol. 2, No. 1, Jan. 2012, pp. (12-15) 14
ISSN 2230-7621 MIT Publications