Professional Documents
Culture Documents
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
Fall 2018
http://www.cs.upc.edu/~caim
1 / 44
5. Web Search. Architecture of simple IR systems
Searching the Web, I
When documents are interconnected
3 / 44
How Google worked in 1998
Notation:
4 / 44
Some components
6 / 44
The inverter (sorter), I
7 / 44
The inverter (sorter), II
8 / 44
The inverter (sorter), III
The above can be done concurrently on different sets of
documents:
9 / 44
The inverter (sorter), IV
10 / 44
Searching the Web, I
When documents are interconnected
11 / 44
Searching the Web, II
Meaning of a hyperlink
12 / 44
Pagerank, I
The idea that made Google great
Intuition:
A page is important if it is pointed to by other important pages
13 / 44
Pagerank, II
Definitions
14 / 44
Pagerank, III
Example
p1 p2
p1 = +
3 2
p3
p2 = + p4
2
p1
p3 =
3
X pj p1 p2 p3
pi = p4 = + +
out(j) 3 2 2
j→i
1 = p1 + p2 + p3 + p4
15 / 44
Pagerank, IV
Formally
Equations
P pj
I pi = j:(j,i)∈E out(j) for each i ∈ V
Pn
i=1 pi =1
I
If |V | = n
I n + 1 equations
I n unknowns
16 / 44
Pagerank, V
Example, revisited
1 1
p1 3 2 0 0 p1
p2 0 1
= 1 0 2 1 · p2
p3 0 0 0 p3
3
1 1 1
p4 3 2 2 0 p4
namely:
P p~ = M T p~ and additionally
i pi = 1
Whose solutions is:
p~ is the eigenvector of matrix M T associated to eigenvalue 1
17 / 44
Pagerank, VI
Example, revisited
18 / 44
Pagerank, VII
Example, revisited
Adjacency matrix
1 0 1 1
1 0 0 1
A=
0
1 0 1
0 1 0 0
1/3 0 1/3 1/3 1/3 1/2 0 0
1/2 0 0 1/2 0 0 1/2 1
M =
0 1/2 0 1/2
MT =
1/3 0
0 0
0 1 0 0 1/3 1/2 1/2 0
19 / 44
Pagerank, VIII
Example, revisited
1 1 1 1 1
1 0 1 1 3 0 3 3 3 2 0 0
1 0 0 1 1 0 0 1 0 0 1
MT = 2 1
A= 2
M = 1
2
1
0 1 0 1 0 0 1 0 0 0
2 2 3
1 1 1
0 1 0 0 0 1 0 0 3 2 2 0
Question:
Why do we need to row-normalize and transpose A?
Answer:
X pj
I Row normalization: because pi =
out(j)
j:(j,i)∈E
X pj
I Transpose: because pi = , that is,
out(j)
j:(j,i)∈E
pi depends on i’s incoming edges
20 / 44
Pagerank, IX
It is just about solving a system of linear equations!
.. but
I How do we know a solution exists?
I How do we know it has a single solution?
I How can we compute it efficiently?
For example, the graph on the left has no solution.. (check it!)
but the one on the right does
21 / 44
Pagerank, X
How do we know a solution exists?
Theorem (Perron-Frobenius)
If M is stochastic, then it has at least one stationary vector, i.e.,
one non-zero vector p such that
M T p = p.
22 / 44
Pagerank, XI
Equivalently: the random surfer view
Now assume M is the transition probability matrix between
states in G
1/3 0 1/3 1/3
1/2 0 0 1/2
M =
0 1/2 0 1/2
0 1 0 0
23 / 44
Pagerank, XII
The random surfer view
p~(t) := M T p~(t − 1)
I In the limit t → ∞:
I p~(t) = p~(t + 1) = p~(t + 2) = .. = p~
I so p~(t) = M T p~(t − 1)
I p~(t) converges to a solution p s.t. p = M T p (the pagerank
solution)!
24 / 44
Pagerank, XIII
Random surfer example
1 1
3 2 0 0
0 1
0 2 1
MT =
1 0 0 0
3
1 1 1
3 2 2 0
I p~(0)T = (1, 0, 0, 0)
I p~(1)T = (1/3, 0, 1/3, 1/3)
I p~(2)T = (0.11, 0.50, 0.11, 0.28)
I ..
I p~(10)T = (0.26, 0.35, 0.09, 0.30)
I p~(11)T = (0.26, 0.35, 0.09, 0.30)
25 / 44
Pagerank, XIV
An algorithm to solve the eigenvector problem (find p s.t. p = M T p)
26 / 44
Pagerank, XV
Convergence of the Power method: aperiodicity required
1/4 1 1/2
1/4
, or 0 , or 0
1/4 0 1/2
1/4 0 0
27 / 44
Pagerank, XVI
Convergence of the Power method: strong connectedness required
28 / 44
Pagerank, XVII
A useful theorem from Markov chain theory
Theorem
If a matrix M is strongly connected and aperiodic, then:
T
P p~ = p~ has exactly one non-zero solution such that
M
I
i pi = 1
I 1 is the largest eigenvalue of M T
I the Power method converges to the p~ satisfying M T p~ = p~,
from any initial non-zero p~(0)
I Furthermore, we have exponential fast convergence
29 / 44
Pagerank, XVIII
Guaranteeing aperiodicity and strong connectedness
Observe that:
I G is stochastic
1
I .. because G is a weighted average of M and n J, which
are also stochastic
I for each integer k > 0, there is a non-zero probablity path
of length k from every state to any other state of G
I .. implying that G is strongly connected and aperiodic
I and so the Power method will converge on G, and fast!
30 / 44
Pagerank, XIX
Teleportation in the random surfer view
The meaning of λ
I With probability λ, the random surfer follows a link in
current page
I With probability 1 − λ, the random surfer jumps to a
random page in the graph (teleportation)
31 / 44
Pagerank, XX
Excercise, I
p1 0 1 1 1 1 1 1 1 p1
p2 2 1 0 0 0 1 1 1 1
1 1 p2
= 31 + ·
p3 3 0 0 0 3 4 1 1 1 1 p3
3
1
p4 3 0 0 0 1 1 1 1 p4
32 / 44
Pagerank, XXI
Exercise, II
Answer:
1−λ
n
p~ = (I − λM T )−1 ...
1−λ
n
33 / 44
Topic-sensitive Pagerank, I
34 / 44
Topic-sensitive Pagerank, II
35 / 44
HITS, I
Hypertext Induced Text Search
36 / 44
HITS, II
Definition of authority and hub value (ai and hi )
Associate to each page i an authority value ai and a hub
value hi
I vector of all authority values is ~a
I vector of all hub values is ~h
h4
38 / 44
HITS, IV
Example
a4
39 / 44
HITS, V
Update rule for ~a and ~h
40 / 44
HITS, VI
The power method for finding ~a and ~h
41 / 44
HITS, VII
HITS algorithm
42 / 44
HITS, VIII
HITS algorithm illustrated
43 / 44
HITS vs. Pagerank
44 / 44