Google PDF

Web-Search Ranking Algorithms
What stands behind Googles success?

Evgeny Dantsin
Department of Computer Science
Roosevelt University
Chicago, USA
UAM, September 2007
Evgeny Dantsin Web-Search Ranking Algorithms
Query
Lets ask about Bill Gates ...
Results
... and well get about 53,800,000 results.
Ranking
Relevance and authority: Web users are interested in
webpages that are not only relevant, but also authoritative.
Web-search ranking algorithms: Algorithms that
evaluate the authority of webpages.
Question: Any ideas about measuring the authority of
webpages?
Naive approach: Number of visits. Drawbacks?
Use the link structure of the Web!
Ranking
webpages?
Ranking
webpages?
Ranking
webpages?
Ranking
webpages?
Simplest Approach Based on Links
Approach. A link from page q to page p is thought of as an
endorsement of the authority of page p. The rank of page p
is dened to be the number of other pages that point to p.
Science. Similarity to citation in scientic literature: article q
refers to article p.
Social networks. Similarity to recommendations in social
networks: q recommends p.
Drawbacks?
A few citations from experts may be more important than
many citations from non-experts.
Easy to falsify.
Drawbacks?
Easy to falsify.
Drawbacks?
Easy to falsify.
Drawbacks?
Easy to falsify.
Drawbacks?
Easy to falsify.
Drawbacks?
Easy to falsify.
Early Methods of Ranking and Link Analysis
Mid of the 1990s. Ranking in early web search engines
(Infoseek, AltaVista, HotBot, etc): number of visits, number
of incoming links, etc. Poor quality.
New methods of ranking:
Jon Kleinberg: HITS (Jan 1998)
Sergey Brin and Larry Page: PageRank (Apr 1998)
Early Methods of Ranking and Link Analysis
Mid of the 1990s. Ranking in early web search engines
(Infoseek, AltaVista, HotBot, etc): number of visits, number
of incoming links, etc. Poor quality.
New methods of ranking:
Jon Kleinberg: HITS (Jan 1998)
Sergey Brin and Larry Page: PageRank (Apr 1998)
Jon Kleinberg
Professor of Computer Science
at Cornell University
B.S. from Cornell, 1993
Ph.D. from MIT, 1996
Many awards, including
Nevanlinna Prize (2006)
Sergey Brin and Larry Page
Co-authors of PageRank
Googles co-founders
Sergey Brin: B.S. from University of Maryland, College Park, 1993
M.S. from Stanford University, 1995
Larry Page: B.S. from University of Michigan, Ann Arbor, 1995
M.S. from Stanford University, 1998
HITS and PageRank: Model of the Web
Model. Web is a directed graph:
a page is a vertex
a link from page q to page p is a directed edge from q to p
Web Size (2006): 34.7 billion of pages indexed by Google
Web Structure (2005):
CORE (strongly connected component) 65%
IN (pages that have paths to CORE) 20%
OUT (pages reachable from CORE) 11%
OTHERS (all other pages) 4%
a page is a vertex
a page is a vertex
From In-Degree to PageRank
The In-Degree approach. A link from page q to page p
is thought of as an endorsement of the authority of page p.
The rank of page p is dened to be the number of other
pages that point to p.
Drawbacks.
Easy to falsify.
Idea of PageRank.
The authority of page p is determined not only by how
many pages point to p, but also whether the authority of
these pages is high or low.
The random surfer model and Markov chains.
Drawbacks.
Easy to falsify.
Idea of PageRank.
Drawbacks.
Easy to falsify.
Idea of PageRank.
Drawbacks.
Easy to falsify.
Idea of PageRank.
Raw Idea
System of linear equations
Rank of page p:
r (p) =
qB
p
r (q)
|q|
where
B
p
= {all pages pointing to p}
|q| = number of outgoing links on page q
Problems with solving
Large size of the system
Different solutions
Raw Idea
System of linear equations
Rank of page p:
r (p) =
qB
p
r (q)
|q|
where
B
p
= {all pages pointing to p}
|q| = number of outgoing links on page q
Problems with solving
Large size of the system
Different solutions
Random Surfer Model
Raw Version
Random walk:
A random surfer starts from a page chosen at random.
At each page, the surfer clicks on a link picked at random
and moves to the corresponding page.
Question: What is the probability that the surfer arrives at
page p at step t ?
Answer: The probability can be computed recursively
using
r
t
(p) =
qB
p
r
t 1
(q)
|q|
where
r
k
(p) denotes the probability of arriving at page p at step k;
the initial probabilities r
0
(p) are dened to be 1/n.
Random Surfer Model
Raw Version
Random walk:
page p at step t ?
using
r
t
(p) =
qB
p
r
t 1
(q)
|q|
where
r
k
0
Random Surfer Model
Raw Version
Random walk:
page p at step t ?
using
r
t
(p) =
qB
p
r
t 1
(q)
|q|
where
r
k
0
Random Surfer Model
Raw Version
Random walk:
page p at step t ?
using
r
t
(p) =
qB
p
r
t 1
(q)
|q|
where
r
k
0
Random Walks, Markov Chains, and Ranking
Markov chains: The surfers random walk can be
equivalently described in terms of Markov chains.
Ranking: The ranks are stationary probabilities of the
surfers Markov chain
r (p) = lim
k
r
k
(p) = frequency of visiting page p
Problem: The proccess may not converge, stationary
probabilities may not exist!
Googles solution: Slight adjustment of the Markov chain
to guarantee the convergence.
r (p) = lim
k
r
k
r (p) = lim
k
r
k
r (p) = lim
k
r
k
Adjusting the Random Surfer Model
Idea of adjustment: Every page must be reachable from
every other page (irreducibility).
Adjusted random walk:
At each page, the surfer proceeds as follows:
1
With probability 1 , the surfer clicks on a link picked at
random and moves to the corresponding page.
2
With probability , the surfer gets bored and jumps to a page
chosen at random from all other pages.
The probability is called the jumping probability; its value
affects the rate of the convergence.
Adjusted recursive relation:
PR
t
(p) = (1 )
qB
p
PR
t 1
(q)
|q|
+

n
Adjusting the Random Surfer Model
Idea of adjustment: Every page must be reachable from
every other page (irreducibility).
Adjusted random walk:
At each page, the surfer proceeds as follows:
1
With probability 1 , the surfer clicks on a link picked at
random and moves to the corresponding page.
2
With probability , the surfer gets bored and jumps to a page
chosen at random from all other pages.
The probability is called the jumping probability; its value
affects the rate of the convergence.
Adjusted recursive relation:
PR
t
(p) = (1 )
qB
p
PR
t 1
(q)
|q|
+

n
Computation Issues
Only few are mentioned here ...
Convergence and accuracy:
Trade-off when choosing : the smaller , the better the
adequacy of ranking, but the slower the convergence.
Googles original value was = 0.15 (convergence: about
100 iterations, accuracy: about 10
8
).
PageRank updating: Special algorithms are used to avoid
total recomputation of PageRank.
PageRank at Google Toolbar: Ranks are displayed using
a logarithmic scale.
Googles distribution: In the jumping term, Google uses
its own distribution instead the uniform one. The choice of
the jumping distribution affects neither mathematical nor
computational aspects, but it allows Google to (slightly)
manipulate ranks.
Computation Issues
8
).
manipulate ranks.
Computation Issues
8
).
manipulate ranks.
Computation Issues
8
).
manipulate ranks.
Strength or Weakness?
PageRank is query-independent.
Strength
Efciency. Ranks are computed in advance, not during
query processing.
Page owners have very little chance of affecting their
PageRanks scores.
Weakness
PageRank does not distinguish between pages that are
authoritative in general and pages that are authoritative on
the query topic.
Strength
query processing.
PageRanks scores.
Weakness
the query topic.
Strength
query processing.
PageRanks scores.
Weakness
the query topic.
Idea of HITS
Observation. It is not necessary that an authoritative page
points to other authoritative pages. Instead, there are
special pages that act as hubs that contain collections of
links to authoritative pages.
Two weights. Every page has two weight:
The hub weight captures the quality of the page as a
pointer to useful resources.
The authority weight captures the quality of the page as a
resource itself.
Mutual reinforcing. A good hub points to good authorities,
while a good authority is pointed to by good hubs.
Idea of HITS
resource itself.
Idea of HITS
resource itself.
Iterative Computation
1
Initial step. All hub and authority weights are set to 1:
h
0
(p) = a
0
(p) = 1 for all pages p
2
Updating. At each iteration, the weights are updated using
a
t
(p) =
qB
p
h
t 1
(q) and h
t
(p) =
qB
p
a
t
(q)
3
Normalization. After each iteration, a normalization
operation is applied so that the vectors of weights become
unit vectors.
4
Convergence. The algorithm iterates until the vectors
converge (the convergence has been proved).
1
h
0
(p) = a
0
2
a
t
(p) =
qB
p
h
t 1
(q) and h
t
(p) =
qB
p
a
t
(q)
3
unit vectors.
4
1
h
0
(p) = a
0
2
a
t
(p) =
qB
p
h
t 1
(q) and h
t
(p) =
qB
p
a
t
(q)
3
unit vectors.
4
1
h
0
(p) = a
0
2
a
t
(p) =
qB
p
h
t 1
(q) and h
t
(p) =
qB
p
a
t
(q)
3
unit vectors.
4
More About HITS
Smaller graph. PageRank operates on the entire Web
graph. HITS operates on a smaller graph G that is built in
two steps:
1
Include all pages with the query key words into G.
2
Add all pages that either point to pages in G or are pointed
to by pages in G.
Query dependence. Unlike PageRank, HITS is
query-dependent.
HITS in search engines. HITS is used in Ask Jeeves
(www.ask.com).
More About HITS
two steps:
1
2
to by pages in G.
query-dependent.
(www.ask.com).
More About HITS
two steps:
1
2
to by pages in G.
query-dependent.
(www.ask.com).
More About HITS
two steps:
1
2
to by pages in G.
query-dependent.
(www.ask.com).
Conclusions
HITS and PageRank are web-search ranking algorithms
that exploit math techniques (linear algebra, Markov
chains).
Skeptical about the role of theory in information
technology? Think about Googles success.
James Clerk Maxwell: There is nothing more practical
than a good theory.
Conclusions
chains).
than a good theory.
Conclusions
chains).
than a good theory.

Google PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Google PDF

Uploaded by

Copyright:

Available Formats

Web-Search Ranking Algorithms

What stands behind Googles success?

You might also like