Relevance Propagation Model For Large Hypertext Document Collections

RELEVANCE PROPAGATION MODEL USING
DOCUMENT AND LINK SCORES PROPAGATION
Idir Chibane
Supelec
Supelec, Plateau de Moulon, 3 rue Joliot Curie, 91 192 Gif/Yvette, France
Idir.Chibane@supelec.fr
Bich-Liên Doan
Supelec
Supelec, Plateau de Moulon, 3 rue Joliot Curie, 91 192 Gif/Yvette, France
Bich-Lien.Doan@supelec.fr
ABSTRACT
Web search engines have become indispensable in our daily life to help us find the information we need. Several search
tools, for instance Google, use links to select the matching documents against a query. In this paper, we propose a new
ranking function that combines content and link rank based on propagation of scores over links. This function propagates
scores from sources pages to destination pages in relation with query terms. We assessed our ranking function with
experiments over two test collections WT10g and GOV. We conclude that propagation link scores according to query
terms provide significant improvement for information retrieval.
KEYWORDS
Information retrieval, hypertext systems, link analysis, web, relevance propagation, test collection.
1. INTRODUCTION
A major focus of information retrieval (IR) research is on developing strategies for identifying documents
“relevant” to a given query. In traditional IR, the evidence of relevance is thought to reside in text content of
documents. Consequently, the fundamental strategy of traditional IR is to rank documents according to their
estimated degree of relevance based on such measures as term similarity or term occurrence probability. In
the Web setting, however, information can reside outside the textual content of documents. For example,
links between pages can be used to increase the term based estimation of document relevance. Furthermore
hyperlinks, being the most important source of evidence in Web documents, have been the subject of many
researches exploring retrieval strategies based on link analysis.
The explosive growth of the web has led to surge of research activity in the aria of IR on the Wold Wide
Web. Ranking has always been an important component of any information retrieval system (IRS). In the
case of web search its importance becomes critical. Due to the size of the Web (Google counts more 8.16
billion Web pages in August 20051), it is imperative to have ranking function that capture the user needs. To
this end the Web offers a rich context of information which is expressed thought links. In recent years,
several information retrieval methods using the information about the link structure have been developed.
Actually, most of systems based on link structure information combine content with popularity measure of
the page to rank query results. Google’s PageRank (Brin et al., 1998), and Keinberg’s HITS (Kleinberg,
1999) are two fundamental algorithms by employing the link structure among the Web page. A number of
extensions of these two algorithms are also proposed, such as (Lempel et al., 2000) (Haveliwala, 2002)
(Kamvar et al., 2003) (Jeh et al., 2003) (Deng et al., 2004) and (Xue-Mei et al., 2004). All these link analysis
algorithms are based on two assumptions: (1) If there is a link from page A to page B, then we may assume
1
http://www.zorgloob.com/2005/08/8-168-684-336-pages-pour-google.asp
that page A endorses and recommends the content of page B. (2) Pages that are co-cited by a certain page are
likely to share the same topic as well as to help retrieval. The power of hyperlink analysis comes from the
fact that it uses the content of other pages to rank the current page. Hopefully, these pages were created by
authors independent of the author of the original page, thus adding an unbiased factor to the ranking.
The study of the existing systems enabled us to conclude that most of ranking functions using link
structure do not depend on query terms. However, the precision of the found results decrease significantly. In
this paper we investigate, theoretically and experimentally, the application of link analysis to ranking pages
on the Web. The rest of this paper is organized as follows. In Section 2, we review the recent works on link
analysis. We first review the related literature on link analysis ranking algorithms and also present some
extension of these algorithms. Then we present our information retrieval model with the new ranking
function. In section 4, we show the experimental results on multiple queries using the proposed algorithm,
including a comparative study of different algorithms. In Section 5, we summarize our main contributions
and discuss possible new applications for our proposed method.
2. REVIOUS WORK
Different from traditional IR, the Web contains both content and link structures that have provided many new
dimensions for exploring better IR techniques. In the early days, people analyze web content and structure
independently. Typical approaches such as (Hawking, 2000) (Craswell et al., 2003) (Craswell et al., 2004)
use TF-IDF (Salton et al., 1975) of the query term in the page to compute a relevance score, and use
hyperlinks to compute a query-independent importance score (e.g. PageRank (Brin et al., 1998)). And then
these two scores are combined to rank the retrieved documents.
In recent years, some new methodologies that explore the inter-relationship between content and link
structures have been introduced. (Qin et al., 2005) divide these methods into two categories: one is to
enhance link analysis with the assistance of content information (Kleinberg, 1999) (Lempel et al., 2000)
(Haveliwala, 2002) (Amento et al., 2000) (Chakrabarti, 2001) (Chakrabarti et al., 2001) (Ingongngam et al.,
2003).; the other is relevance propagation, which propagates content information with the assistance of Web
structure (Mcbryan, 1994) (Song et al., 2004) (Shakery et al., 2003).
For the first category, HITS (Kleinberg, 1999) is the representative. The HITS algorithm first constructs a
query specific sub-graph, and then computes the authority and hub scores on this sub-graph to rank the
documents. Kleinberg distinguishes between two different notions of relevance: An authority is a page that is
relevant in itself and a hub is a page that is relevant since it contains links to many related authorities. To
identify good hubs and authorities, Kleinberg’s procedure exploits the graph structure of the web. When
introducing a query, the procedure first constructs a focused sub graph G, and then computes hubs and
authorities scores for each node of G. In order to quantify the quality of a page as a hub and an authority,
Kleinberg associated every page with a hub and an authority weight. According to the mutual reinforcing
relationship between hubs and authorities, Kleinberg defined the hub weight to be the sum of the authority
weights of the nodes that are pointed to by the hub, and the authority weight to be the sum of the hub weights
that point to this authority. Let A denote the n-dimensional vector of the authority weights, where Ai is the
authority weight of the page and let H denote the n-dimensional vector of the hub weights, where Hi is the
hub weight of the page pi. The computation of authority and hub weights is given by the following formula:
Ai = ∑H j and H i = ∑A j [1]
j∈ In ( p i ) j∈Out ( p i )
Where In(pi) represent the in-links set of the page pi and Out(pi) the out-links set of the page pi. Hi and Ai
represent the hub and authority weight of the page pi. Generally speaking, these methods conduct link
analysis on a sub-graph which is sampled from the whole Web graph by considering the content of the web
pages.
For the second category, many relevance propagation methods were proposed to refine the content of web
pages by propagating content-based attributes through web structure. For example, (Mcbryan, 1994) (Brin et
al., 1998) propagate anchor text from one page to another to expand the feature set of web pages. (Shakery et
al., 2003) propagates the relevance score of a page to another page through the hyperlink between them.
(Song et al., 2004) propagates query term frequency from child pages to parent pages in the sitemap tree.
They first construct a sitemap for each website based on URL analysis, and then propagate query term
frequency along the parent-child relationship in the sitemap tree as follow.
f t ' ( p ) = (1 + α ) f t ( p ) +
(1 − α ) [2 ]
∑ f t (q )
Fils ( p ) q∈Fils ( p )
Where f ’t(p) is the occurrence frequency of term t in page p after the i-th iteration, and ft(p) is the original
occurrence frequency of term t in page p. (Qin et al., 2005) propose a generic relevance propagation
framework that can be used to derive many existing propagation models.
f t k +1 ( p ) = α ⋅ f t 0 ( p ) + (1 − α ) ∑ f t k (q )
1
[3 ]
q ∈ Child ( p ) Child ( p )
3. RELEVENCE MODEL PROPAGATION

Intuitively, the content similarity of a page to a query, on its own, may not be sufficient for selecting a key
resource, and the link information can be useful in finding key resources (Neighborhood of the page). A good
resource is a page whose content is related to the query topic, and which has links from other resources that
are also related to the query topic. So, two factors are important in selecting good resources: the content of
the page and the relevance of the pages which have links to the page. The idea underlying our work is that the
popularity of a page depends on scores of pages that point to it according to query terms. Motivated by these
observations, we proposed a function that combines link rank and document rank related to the query terms.
Specifically, we define the “hyper-relevance” score of each page as a function of two variables, the content
similarity of the page to the query and a weighted sum of scores of the pages that point to this page.
Formally, the relevance propagation function can be written as:
Rank ( p i , q ) = ∂ ( Rank DR ( p i , q ); Rank LR ( p i , , q )) [4 ]

Where Rank(pi, q) is the hyper-relevance score of page pi, RankDR(pi, q) is the content similarity between
the page pi and the query q, and RankLR(pi, q) is the link rank based on propagation scores over links
according to query terms and given by :
wlink ( p j , p i ) ∗ Rank DR ( pi , q )
Rank LR ( p i , , q ) = ∑ [5]
pj → pi ei
Where wlink (pj, pi) is the weighting function assigned to the existing link between page pj and page pi and
ei is the number of pages that point to the page pi. In principle, the choice of the function ∂ could be
arbitrary. An interesting choice is a linear combination of the two variables shown below:
Rank( pi , q) = α ∗ RankDR ( pi , q) + (1 − α ) ∗ RankLR ( pi , q) [6]

With α is a parameter witch can be set between 0 and 1. It allows us to see the impact of our link rank
function on the ranking of query results.
3.1 Text Processing
Our system processed documents by first removing HTML tags and punctuation, and then excluding high-
frequency terms by using a stop words list. After punctuation and stop word removal, the system replaced
each word by its representative class (root) by using a porter stemming algorithm (Porter, 1980). To represent
documents and query, we have used the vector model (Salton et al., 1975). This choice is justified by its
success in the Web community and the satisfactory results that it generates.
3.2 Scoring Function Details

The primary innovation in our work arises from the use of a ranking function which depends on content and
neighborhood of the page according to query terms. This dependence allows better adequacy of found results
by traditional model of IR with user needs. Our ranking function is based on two measures: one is traditional
and widely used in current systems witch is the cosine measure. It computes the cosine of the angle between
query and document vector. This measure is defined as follows:
p i ⋅q
∑w ti , p i ⋅ w ti ,q
Rank ( pi , q ) = =
ti ∈ pi ∩ q
[7 ]
∑w .∑ w t2i , q
DR
pi ⋅ q 2
ti , p i
t i ∈ Pi ti
Where wti,p, wti,q, i=1 to t (total of terms in the entire collection), are term weights assigned to different
terms for the page pi and query q respectively. The best known term-weighting schemes use weights witch
are given by:
  
 tf (t i , p )    D 
w ti , p =  0 . 5 + 0 . 5 * * log   [8 ]
 max (tf (t j , p ))  
 
 df ( t i )  
 t j∈ p 
Where tf(ti, p) is the Term Frequency (TF), which measures the number of times a term ti appears in page
p and df(tj) is the Document Frequency (DF), that measures the number of documents in which a term ti has
appeared in the entire document collection. |D| is the number of pages of the entire collection D.
The second one is the structural measure that takes account link structure information. In order to
understand our function, we start with the following assumption “We considered that a page is well-known
for a term t of a query if it contains incoming links from pages which have the term t” (Doan et al., 2005).
This measure can be computed as follows:
Let q is a query containing nbtq terms. Let us suppose a found page pj by a traditional system of IR. Let
nbtq(pj) the number of query terms that pj contains. We denote by ej the number of pages that point to the
page pj (incoming links) and sj the number of out-coming links of the page pj. The main idea of our structural
measure is to weigh links according to the number of query terms contained in the source page for an
incoming link and the destination page for an outgoing link. In the following sections, we considered only
incoming link.
wlink ( p j , p i ) ∗ Rank DR ( pi , q )
Rank LR ( p i , , q ) = ∑ [9]
pj → pi ej
With wlink(pj, pj) is the weight of link between the pages pj and pj. More the page pj contains query terms,
more the weight of link between the two pages increase. This weight is defined as follow:
w link ( p j , p i ) =
nbt q (p ) ∗ β
j
[10 ]
nbt q
β is a parameter between 0 and 1 that verifies the following condition:
nbtq
k
nbtq
nbtq ! k
nbtq
(nbt −1)! [11]
∑Cnbtk q ∗ ∗β =1 ⇒ ∑ ∗ ∗ β =1 ⇒ ∑ (k −1)!∗(nbt − k )! ∗ β = 1
q
k =1 nbtq k =1 k!∗(nbtq − k )! n k =1 q
k
With C nbt q
represent the number of sets that contain exactly k terms of query. We have by recurrence:
nbtq
(nbt − 1)! 1
∑ (k − 1)!∗(nbt − k )! = 2
nbtq −1
q
⇒ β= nbtq −1
k =1 q 2
By replacing β with its value in the equation [10], we obtain:
nbt q ( p j )
wlink ( p j , pi ) = nbt q −1
[12 ]
nbt q * 2
After transformation of the function [4], we obtained the follow function that we use in our
experimentations.
nbt q ( p j )∗ Rank DR ( p j )
Rank LR ( p i , q ) = ∑ p j → pi nbt q −1
[13 ]
nbt q ∗ 2 * ei
4. EMPIRICAL EVALUATIONS
In this section, experiments were conducted to evaluate the performance and efficiency of our model. We
first introduce the experimental settings and some implementation issues, and then present experimental
results and discussions.
4.1 Experimental Settings

To avoid the corpus bias, two different data collections were used in our experiments. One is the WT10g
corpus, which is crawled from the web in early 2000. This corpus has been used as the data collection of Web
Track since TREC 2000. It contains 1.692.096 pages with 1.532.012 pages with incoming links and
1.295.841 pages with outgoing links. The other data set is the “.GOV” corpus, which is crawled from the
.gov domain in early 2002. This corpus has been used as the data collection of Web Track since TREC 2002.
There are totally 1,053,110 pages with 11,164,829 hyperlinks in it. According to Soboroff (Soboroff, 2002),
there isn’t great difference between the two corpuses WT10g and .GOV and the real structure of the Web.
However, the size of WT10g collection is less huge than .GOV collection (10 Go for WT10g against 18Go
for .GOV). The following table shows the characteristics of each of the two collections:
Table 1. Characteristics of the WT10g and GOV test collections
WT10g .GOV
Number of documents 1.692.096 1.247.753
Number of documents with incoming links 1.295.841(76,5%) 1.067.339(85,5%)
Number of documents with outdoing links 1 532 012(90,5%) 1.146.213(91.9%)
Average number of incoming links by page 5,26 10,4
Average number of outgoing links by page 6.22 9,69
Number of query 50 50
Average number of relevant pages by query 52,54 31,48
4.2. Experimental Setup and Results
In this section we present an experimental evaluation of the algorithm that we propose, as well as some other
existing algorithms. We study the ranking they produce. In our experiments, the precision over the 11
standard recall levels which are 0%, 10%, …, 100% is the main evaluation metric, and we also evaluate the
precision at 5 and 10 documents retrieval (P@5 & P@10). We carried out 50 queries on WT10g and others
50 queries on .GOV collections using different ranking methods. We compare various categories of
algorithms (contents-only, popularity with PageRank algorithm and our algorithm based on combining link
and document rank). The dependency between precision at 10 documents retrieval and α on both WT10g and
.GOV collection is illustrated in Figure 2, in which all the curves converge to the baseline when α=1.
Figure 1. Average precision on the 11 standard recall levels of three ranking functions (contents-only, PageRank and
combining link and document rank algorithms) carried out on WT10g and .GOV collections.
WT10g collection .GOV collection
40% 35%
Contents-Only Contents-Only
35%
Combining Link and Document Rank (α=0.15) Combining Link and Document Rank (α=0.25)
30%
PageRank PageRank
30%
25%
25%
Precision
Precision
20%
20%
15%
15%
10%
10%
5%
5%
0% 0%
R0% R10% R20% R30% R40% R50% R60% R70% R80% R90% R100% R0% R10% R20% R30% R40% R50% R60% R70% R80% R90% R100%
The 11 standard recall levels The 11 standard recall levels
As can be seen from Figure 1, the PageRank algorithm performs the worst result on both WT10g
and.GOV collections. With the PageRank algorithm, a page has the same score (or popularity value) for all
queries performed on the system because it takes account the popularity of a page independently of query
terms. For that, the results were worse. Combining link and document rank is strongly better than the
baseline, though it is the best among all the methods. The performance of our method increases significantly
when α decrease. That means, more we give importance to link rank, more the results are greater. The better
value of α to have more relevant document at the top of rank list is 0.15 for WT10g collection and 0.25 for
.GOV collection. However, if we don’t take into account the page textual contents in the computation of the
ranking function, the research performances decrease. This result shows the importance of page contents in
the computation of document relevance to a given query.
Figure 2. Average precision at 10 documents retrieval according to the parameter α for the link function on both WT10g
and .GOV collections.
WT10g collection .GOV collection
Combining Link and Document Rank Combining Link and Document Rank
17,00% α*RankDR(p,q)+(1-α)*RankLR(p,q) α*RankDR(p,q)+(1-α)*RankLR(p,q)
Contents-Only 14,0%
Contents-Only
16,50%
13,0%
16,00% 12,0%
P@10
11,0%
P@10
15,50%
10,0%
15,00%
9,0%
14,50%
8,0%
14,00% 7,0%
0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95 1 0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95 1
The parameter α The parameter α
We also compare the different algorithms with the average precision at 5, 10 and 20 documents retrieval
(P@5, P@10 and P@20). The performance of our algorithm is still better than the baseline algorithm on both
WT10g and .GOV collections. That means that there are more relevant documents in the top rank list. From
table 2, we can see that combining link rank with document rank performed best results than baseline, either
on average precision P@5 or P@10 or P@20 on the both collections. For example, the result achieved 20%
and 25% improvements over the baseline algorithm on average precision at five documents retrieval P@5 on
WT10g and .GOV collections respectively.
Table 2. P@5, P@10 and P@20 comparison of three ranking functions (contents-only, PageRank and combining link
and document rank algorithms) carried out on WT10g and .GOV collections.
Algorithms P@5 P@10 P@20
WT10g
Baseline algorithm (Contents-Only) 14,89% 15,11% 26,81%

0.15*RankDR(p,q)+0.85*RankLR(p,q) 17,87% (+20%) 15,74% (+4%) 29,15% (+9%)
PageRank 2,12% (-85%) 4,89% (-67%) 16,17% (-40%)
Baseline algorithm (Contents-Only) 11,20% 9,80% 16,40%

.Gov
0.25*RankDR(p,q)+0.75*RankLR(p,q) 14% (+25%) 11,40% (16%) 17,40% (+6%)

PageRank 2% (-82%) 1,60% (-83%) 3,80% (-77%)
5. CONCLUSION
Several algorithms based on link analysis approach were developed. But, until now many experiments
showed that there is no significant profit compared to the methods based on content-only of the page. In this
paper, we introduce an approach for combining content and link rank based on propagation scores over links
according to query terms. During the computation, the algorithm that we proposed propagates a portion of
rank scores of the source web pages to the destination web pages in accordance with query terms. We
performed experimental evaluations of our algorithm using IR test collection of TREC 9 and .GOV. We
found that this algorithm outperforms significantly a content-only retrieval. The study concluded from our
experiments shows that propagation link scores according to query terms provide significant improvement. It
is still better than the baseline method based on content-only. More study and experiments will be conduced,
e.g., we will use the weighted inter-host and intra-host link score propagation. We also plan to test this
framework at the semantic blocks level to see the structural effects of blocks on ranking query results.
Finally, new measure representing additional semantic information may be explored.
ACKNOWLEDGEMENT
We would like to thank the anonymous reviewers for their insightful comments.
REFERENCES
Amento, B. et al, 2000. Does Authority Mean Quality? Predicting Expert Quality Ratings of Web Pages. In Proc. ACM
SIGIR 2000, pp. 296--303.
Brin, S. and Page, L., 1998. The anatomy of a large-scale hyper textual Web search engine. In Proceeding of WWW7.
Craswell, N. and Hawking, D., 2003. Overview of the TREC 2003 Web Track, in the 12th TREC.
Craswell, N. and Hawking, D., 2004. Overview of the TREC 2004 Web Track, in the 13th TREC.
Chakrabarti, S., 2001. Integrating the Page Object Model with hyperlinks for enhanced topic distillation and information
extraction, In the 10th WWW.
Chakrabarti, S. et al, 2001. Enhanced Topic Distillation Using Text, Markup Tags, and Hyperlinks, In Proceedings of the
24th ACM SIGIR, pp. 208-216.
Deng C., Shipeng Y., Ji-Rong W., Wei-Ying M., 2004. Block-based Web Search, Microsoft research ASIA.
Doan B. and Chibane I., 2005. Expérimentations sur un modèle de recherche d’information utilisant les liens hypertextes
des pages Web, Revue des Nouvelles Technologies de l'Information (RNTI-E-3), numéro spécial Extraction et
Gestion des Connaissances (EGC'2005), Vol. 1:245-256, Cépaduès-Editions, pp.257-262.
Hawking D., 2000. Overview of the TREC-9 Web Track, in the 9th TREC.
Haveliwala T.H., 2002. Topic-Sensitive Pagerank: A Context-Sensitive Ranking Algorithm for Web Search. In
Proceedings of the eleventh international conference on World Wide Web, pp. 517-526, ACM Press.
Ingongngam P and Rungsawang A., 2003. Report on the TREC 2003 Experiments Using Web Topic-Centric Link
Analysis, In the 12th TREC.
Jeh G and Widom J., 2003. Scaling personalized web search. In Proceedings of the Twelfth International World Wide
Web Conference.
Kamvar S. et al., 2003. Exploiting the Block Structure of the Web for Computing PageRank, Stanford University
Technical Report.
Kleinberg J., 1999. Authoritative Sources in a Hyperlinked Environment, Journal of the ACM, Vol. 46, No. 5, p. 604-
622.
Lempel R. and Moran S., 2000. The stochastic approach for link-structure analysis (SALSA) and the TKC effect, In
Proceeding of 9th International World Wide Web Conference.
Mcbryan O., 1994. GENVL and WWW: Tools for Taming the Web, In Proceedings of the 1st WWW.
Porter M. M., 1980. An Algorithm for Suffix Stripping, Program, Vol. 14(3), p. 130-137.
Qin T. et al, 2005. A Study of Relevance Propagation for Web Search, the 28th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval.
Salton G. et al, 1975. A theory of term importance in automatic text analysis, Journal of the American Society for
Information Science and Technology.
Shakery A. and Zhai C.X., 2003. Relevance Propagation for Topic Distillation UIUC TREC 2003 Web Track
Experiments, in the 12th TREC..
Song R. et al, 2004. Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004, in the 13th TREC.
Soboroff I., 2002. Do TREC Web Collections Look Like the Web?, SIGIR Forum Fall, Volume 36, Number 2.
Xue-Mei J. et al., 2004. Exploiting PageRank at Different Block Level - International Conference on Web Information
Systems Engineering.

Relevance Propagation Model For Large Hypertext Document Collections

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Relevance Propagation Model For Large Hypertext Document Collections

Uploaded by

Copyright:

Available Formats

RELEVANCE PROPAGATION MODEL USING

DOCUMENT AND LINK SCORES PROPAGATION

3. RELEVENCE MODEL PROPAGATION

Rank ( p i , q ) = ∂ ( Rank DR ( p i , q ); Rank LR ( p i , , q )) [4 ]

Rank( pi , q) = α ∗ RankDR ( pi , q) + (1 − α ) ∗ RankLR ( pi , q) [6]

3.2 Scoring Function Details

By replacing β with its value in the equation [10], we obtain:

4.1 Experimental Settings

Baseline algorithm (Contents-Only) 14,89% 15,11% 26,81%

Baseline algorithm (Contents-Only) 11,20% 9,80% 16,40%

0.25RankDR(p,q)+0.75RankLR(p,q) 14% (+25%) 11,40% (16%) 17,40% (+6%)

You might also like