Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

review articles

d oi:10.1145/1953122.1953146
Wholly new forms of encyclope-
The roots of Google’s PageRank can be traced dias will appear, ready made with
a mesh of associative trails run-
back to several early, and equally remarkable, ning through them, ready to be
ranking techniques. dropped into the Memex and there
amplified.
by Massimo Franceschet
Bush’s prediction came true in 1989,

PageRank:
when Tim Berners-Lee proposed the
Hypertext Markup Language (HTML) to
keep track of experimental data at the
European Organization for Nuclear

Standing on
Research (CERN). In the original far-
sighted proposal in which Berners-Lee
attempts to persuade CERN manage-
ment to adopt the new global hyper-

the Shoulders
text system we can read the following
paragraphb:

We should work toward a univer-

of Giants
sal linked information system, in
which generality and portability
are more important than fancy
graphics techniques and complex
extra facilities. The aim would be
to allow a place to be found for any
information or reference which
one felt was important, and a way
of finding it afterwards. The result
PageRank 3 is a Web page ranking technique that has should be sufficiently attractive
been a fundamental ingredient in the development to use that the information con-
tained would grow past a  critical
and success of the Google search engine. The threshold.
method is still one of the many signals Google uses to
determine which pages are most important.a The main b http://www.w3.org/History/1989/proposal.html

idea behind PageRank is to determine the importance key insights


of a Web page in terms of the importance assigned to
 Great, pioneering research paved the
the pages hyperlinking to it. In fact, this thesis is not way for PageRank, the well-known
algorithm of Brin and Page that became
new, and has been previously successfully exploited the basis for the Google search engine
in different contexts. We review the PageRank method and that is still one of the many signals
Google uses to determine which pages
and link it to some renowned previous techniques are most important.

that we have found in the fields of Web information  This article examines the mathematical
predecessors to PageRank in ostensibly
retrieval, bibliometrics, sociometry, and econometrics. disparate disciplines such as Web
information retrieval, bibliometrics,
In 1945 Vannevar Bush wrote a celebrated article sociometry, and econometrics.
Illustratio n by J oh n hersey

in The Atlantic Monthly entitled “As We May Think”  In tracing these early efforts we find the
fundamental idea behind PageRank—
describing a futuristic device he called Memex.5 its circular thesis that a Web page is
important if it is pointed to by other
Bush wrote: important pages—was not entirely
new, and had basis in work going
a http://www.google.com/corporate/tech.html back decades.

92 comm uni cations of t h e acm | j une 2 0 1 1 | vol . 5 4 | no. 6


credi t t k

j un e 2 0 1 1 | vo l . 5 4 | n o. 6 | c omm u n icat i o n s o f t he acm 93


review articles

Figure 1. A PageRank instance with solution. Each node is labeled with its PageRank score. on a page, the higher the PageRank
Scores have been normalized to sum to 100. We assumed α = 0.85. importance of the page.
A little more formally, the method
can be described as follows. Let us
I denote by qi the number of distinct out-
1.6 going (hyper)links of page i. Let H = (hi, j)
be a square matrix of size equal to the
number n of Web pages such that hi, j =
H 1/qi if there exists a link from page i to
1.6 page j and hi, j = 0 otherwise. The value
hi, j can be interpreted as the probabil-
F ity that the random surfer moves from
L E 3.9 B C page i to page j by clicking on one of the
1.6 8.1 38.4 34.3 distinct links of page i. The PageRank
πj of page j is recursively defined as

M
1.6 D A
3.9 3.3 or, in matrix notation, π = π  H. Hence,
the PageRank of page j is the sum of the
G PageRank scores of pages i linking to j,
1.6 weighted by the probability of going
from i to j. In words, the PageRank
­thesis reads as follows:

A Web page is important if it is


As we all know, the proposal was ac- notion of an importance score, which pointed to by other important
cepted and later implemented in a gauges the status of a page, indepen- pages.
mesh—this was the only name that Ber- dently from the user query, by analyz-
ners-Lee originally used to describe the ing the topology of the Web graph. There are in fact three distinct factors
Web—of interconnected documents The method was implemented in the that determine the PageRank of a page:
that rapidly grew beyond the CERN famous PageRank algorithm and both the number of links it receives;  the link
threshold, as Berners-Lee anticipated, the traditional content score and the propensity, that is, the number of out-
and became the World Wide Web. new importance score were efficiently going links, of the linking pages; and
Today, the Web is a huge, dynamic, combined in a new search engine the PageRank of the linking pages. The
self-organized, and hyperlinked data named Google. first factor is not surprising: the more
source, very different from tradi- links a page receives, the more impor-
tional document collections that are Ranking Web Pages using tant it is perceived. Reasonably, the
nonlinked, mostly static, centrally PageRank link value depreciates proportionally to
collected and organized by special- We briefly recall how the PageRank the number of links given out by a page:
ists. These features make Web infor- method works keeping the math- endorsements coming from parsimo-
mation retrieval quite different from ematical machinery to the minimum. nious pages are worthier than those
traditional information retrieval and Interested readers can more thoroughly emanated by spendthrift ones. Finally,
call for new search abilities, like auto- investigate the topic in a recent book not all pages are created equal: links
matic crawling and indexing of the by Langville and Meyer that ­elegantly from important pages are more valu-
Web. Moreover, early search engines describes the science of search engine able than those from obscure ones.
ranked responses using only a content rankings in a rigorous yet playful style.16 Unfortunately, this ideal model has
score, which measures the similar- We start by providing an intuitive two problems that prevent the solution
ity between the page and the query. interpretation of PageRank in terms of of the system. The first one is due to
One simple example is just a count random walks on graphs.22 The Web is the presence of dangling nodes, that
of the number of times the query viewed as a directed graph of pages con- are pages with no forward links.c These
words occur on the page, or perhaps nected by hyperlinks. A random surfer pages capture the random surfer
a weighted count with more weight on starts from an arbitrary page and sim- indefinitely. Notice that a dangling
title words. These traditional query-­ ply keeps clicking on successive links at node corresponds to a row in matrix
dependent techniques suffered under ­random, bouncing from page to page. H with all entries equal to 0. To tackle
the gigantic size of the Web and the The PageRank value of a page corre-
death grip of spammers. sponds to the relative frequency the ran-
c The term dangling refers to the fact that many
In 1998, Sergey Brin and Larry Page dom surfer visits that page, assuming dangling nodes are in fact pendent Web pages
revolutionized the field of Web infor- that the surfer goes on infinitely. The found by the crawling spiders but whose links
mation retrieval by introducing the more time spent by the random surfer have not been yet explored.

94 comm uni cations of t h e acm | j une 2 0 1 1 | vol . 5 4 | no. 6


review articles

the problem of dangling nodes, the important page B; its importance is ciated with the largest eigenvalue in
corresponding rows in H are replaced much higher than that of page E, which magnitude.
by the uniform probability vector receives many more links, but from The matrix S is most likely reducible,
u = 1/ne, where e is a vector of length anonymous pages. Pages G, H, I, L, and since experiments have shown that the
n with all components equal to 1. M do not receive endorsements; their Web has a bow-tie structure fragmented
Alternatively, one may use any fixed scores correspond to the minimum into four main continents that are not
probability vector in place of u. This amount of status of each page. mutually reachable, as first observed by
means the random surfer escapes Typically, the normalization condi- Broder et al.4 Thanks to the teleporta-
from the dangling page by jumping to tion is also added. In this case, tion trick, however, the graph of matrix
a randomly chosen page. We call S the Equation 1 becomes π = aπS + (1 − a)u. G is strongly connected. Hence G is irre-
resulting matrix. The latter distinguishes two factors ducible and Perron–Frobenius theorem
The second problem with the ideal contributing to the PageRank vector: applies.f Therefore, a positive PageRank
model is that the surfer can get trapped an endogenous factor equal to πS, which vector exists and is furthermore unique.
into a bucket of the Web graph, which is takes into consideration the real topol- Interestingly, we can arrive at the
a reachable strongly connected compo- ogy of the Web graph, and an exogenous same  result using Markov theory.20 The
nent without outgoing edges toward the factor equal to the uniform probability above described random walk on the Web
rest of the graph. The solution proposed vector u, which can be interpreted as a graph, modified with the teleportation
by Brin and Page is to replace matrix S minimal amount of status assigned to jumps, naturally induces a finite-state
by the Google matrix each page independently of the hyper- Markov chain, whose transition matrix
link graph. The parameter a balances is the stochastic matrix  G. Since G is
G = aS + (1 − a) E between these two factors. irreducible, the chain has a unique sta-
tionary distribution corresponding to
where E is the teleportation matrix Computing the PageRank Vector the PageRank vector.
with identical rows each equal to the Does Equation 1 have a solution? A last crucial question remains: can
uniform probability vector u, and a Is the solution unique? Can we effi- we efficiently compute the PageRank
is a free parameter of the algorithm ciently compute it? The success of the vector? The success of PageRank is
often called the damping factor. PageRank method rests on the answers largely due to the existence of a fast
Alternatively, a fixed personalization to these queries. Luckily, all these method to compute its values: the
probability vector v can be used in ­questions have nice answers. power method, a simple iteration
place on u. In particular, the person- Thanks to the dangling nodes method to find the dominant eigenpair
alization vector can be exploited to patch, matrix S is a stochastic matrix,d
bias the result of the method toward and clearly the teleportation matrix E f Since G is stochastic, its spectral radius is 1.
certain topics. The interpretation of is also stochastic. It follows that G is
the new system is that, with prob- stochastic as well, since it is defined Table 1. PageRank history
ability a the random surfer moves as a convex combination of stochas-
forward by following links, and, with tic matrices S and E. It is easy to show Year Author Contribution
the complementary probability 1 − a that, if G is stochastic, Equation 1 has
1906 Markov Markov theory20
the surfer gets bored of following always at least one solution. Hence, we
links and enters a new destination in have got at least one PageRank vector. 1907 Perron Perron theorem24
the browser’s URL line, possibly unre- Having two independent PageRank 1912 Frobenius Perron–Frobenius
lated to the current page. The surfer vectors, however, would be already theorem7
is hence ­teleported, like a Star Trek too much: which one should we use to 1929 von Mises and Power method31
character, to that page, even if there rank Web pages? Here, a fundamental Pollaczek-
Geiringer
exists no link connecting the current result of algebra comes to the rescue:
and the destination pages in the Web Perron–Frobenius theorem.7, 24 It states 1941 Leontief Econometric
universe. The inventors of PageRank that, if A is an irreduciblee nonnega- model18

propose to set the damping factor tive square matrix, then there exists a 1949 Seeley Sociometric
a = 0.85, meaning that after about five unique vector x, called the Perron vec- model29
link clicks the random surfer chooses tor, such that xA = rx, x > 0, and , 1952 Wei Sport ranking
a random page. where r is the maximum eigenvalue model32
The PageRank vector is then defined of A in absolute value, that algebraists 1953 Katz Sociometric
as the solution of equation: call the spectral radius of A. The Perron model11
vector is the left dominant eigenvector 1965 Hubbell Sociometric
π = πG (1) of A, that is, the left eigenvector asso- model10
1976 Pinski Bibliometric
An example is provided in Figure and Narin model26
1. Node A is a dangling node, while d This simply means that all rows sum up to 1.
1998 Kleinberg HITS14
e A matrix is irreducible if and only if the directed
nodes B and C form a bucket. Notice graph associated with it is strongly connected, 1998 Brin and Page PageRank3
the dynamics of the method: page C that is, for every pair i and j of graph nodes there
receives just one link but from the most are paths leading from i to j and from j to i.

j un e 2 0 1 1 | vo l . 5 4 | n o. 6 | c omm u n icat i o n s o f t he acm 95


review articles

Figure 2. A HITS instance with solution (compare with PageRank scores in Figure 1). Each and the code is freely available at the
node is labeled with its authority (top) and hub (bottom) scores. Scores have been normalized author’s Web page.
to sum to 100. The dominant eigenvalue for both authority and hub matrices is 10.7.
Hubs and Authorities on the Web
Hypertext Induced Topic Search
I (HITS) is a Web page ranking
0.0 method proposed by Kleinberg.14, 15
14.9 The connections between HITS and
PageRank are striking. Despite the
close conceptual, temporal, and
H even geographical proximity of the
0.0
14.9
two approaches, it appears that HITS
and PageRank have been developed
F independently. In fact, both papers
5.3 B C presenting PageRank3 and HITS15 are
L E 14.9 45.9 0.0
0.0 38.9 0.0 8.1
today citational blockbusters: the
6.8 9.9 PageRank article collected 6,167 cita-
tions, while the HITS paper has been
cited 4,617 times.h
M D A HITS thinks of Web pages as author-
0.0 5.3 4.7 ities and hubs. HITS circular thesis
6.8 8.9 0.0 reads as follows:

Good authorities are pages that


G
0.0 are pointed to by good hubs and
14.9 good hubs are pages that point to
good authorities.

Let L = (li, j) be the adjacency matrix


of a matrix developed by von Mises and thinkers of the past.”g The metaphor of the Web graph, i.e., li, j = 1 if page i
Pollaczek-Geiringer.31 It works as fol- was famously uttered by Isaac Newton: links to page j and li, j = 0 otherwise. We
lows on the Google matrix G. Let π (0) = “If I have seen a little further it is by denote with LT the transpose of L. HITS
u = 1/ne. Repeatedly compute π (k+1)  = standing on the shoulders of Giants.” defines a pair of recursive equations as
π(k) G until ||π (k+1) − π(k)|| < , where || · || Moreover, “Stand on the shoulders of follows, where x is the authority vector
measures the distance between the giants” is Google Scholar’s motto: “the containing the authority scores and
two successive PageRank vectors and phrase is our acknowledgement that y is the hub vector containing the hub
 is the desired precision. much of scholarly research involves scores:
The convergence rate of the power building on what others have already
method is approximately the rate at discovered.” x(k) = LTy(k−1) 
(2)
which a k approaches to 0: the closer a to There are many giants upon whose y(k) = Lx(k)
unity, the lower the convergence speed shoulders PageRank firmly stands:
of the power method. If, for instance, a = Markov,20 Perron,24 Frobenius,7 von where k ≥ 1 and y(0) = e, the vector of all
0.85, as many as 43 iterations are suf- Mises and Pollaczek-Geiringer31 pro- ones. The first equation tells us that
ficient to gain 3 digits of accuracy, and vided at the beginning of the 1900s the authoritative pages are those pointed
142 iterations are enough for 10 dig- necessary mathematical machinery to to by good hub pages, while the second
its of accuracy. Notice that the power investigate and effectively solve the equation claims that good hubs are
method applied to matrix G can be PageRank problem. Moreover, the cir- pages that point to authoritative pages.
easily expressed in terms of matrix cular PageRank thesis has been previ- Notice that Equation 2 is equivalent to
H, which, unlike G, is a very sparse ously exploited in different contexts,
matrix that can be stored using a linear including Web information retrieval, x(k) = LTLx(k−1)
(3)
amount of memory with respect to the bibliometrics, sociometry, and econo- y(k) = LLTy(k−1)
size of the Web. metrics. In the following, we review
these contributions and link them to It follows that the authority vector x is
Standing on the the PageRank method. Table 1 shows a the dominant right eigenvector of the
Shoulders of Giants brief summary of PageRank history. All authority matrix A = LTL, and the hub
Dwarfs standing on the shoulders of the ranking techniques surveyed in this vector y is the dominant right eigenvec-
giants is a Western metaphor mean- paper have been implemented in R27 tor of the hub matrix H = LLT. This is very
ing “One who develops future intel- similar to the PageRank method, except
lectual pursuits by understanding the g From the Wikipedia page for Standing on the
research and works created by notable shoulders of giants. h Source: Google Scholar on February 5, 2010.

96 co mmunicat ions of t h e acm | j une 2 0 1 1 | vo l. 5 4 | no. 6


review articles

the use of the authority and hub matri- topic from which a broad search can be
ces instead of the Google matrix. started. A disadvantage of HITS is the
To compute the dominant eigen- higher susceptibility of the method to
pair (eigenvector and eigenvalue) of the spamming: while it is difficult to add
authority matrix we can again exploit
the power method as follows: let x (0) = e. The connections incoming links to our favorite page, the
addition of outgoing links is much eas-
Repeatedly compute x(k) = Ax(k−1) and
normalize x(k) = x(k)/m(x(k)), where m(x(k))
between HITS ier. This leads to the possibility of pur-
posely inflating the hub score of a page,
is the signed component of maximal and PageRank are indirectly influencing also the authority
magnitude, until the desired precision
is achieved. It follows that x(k) converges
striking. Despite the scores of the pointed pages.
HITS is related to a matrix factor-
to the dominant eigenvector x (the close conceptual, ization technique known as singular
authority vector) and m(x(k)) converges
to the dominant eigenvalue (the spec-
temporal, and value decomposition.6 According to
this technique, the adjacency matrix
tral radius, which is not necessarily 1). even geographical L can be written as the matrix product
The hub vector y is then given by y = Lx.
While the convergence of the power proximity of USVT, where the columns of U, called
left-singular vectors, are the orthonor-
method is guaranteed, the computed the two approaches, mal eigenvectors of the hub matrix LLT,
solution is not necessarily unique, since
the authority and hub matrices are not it appears that HITS the columns of V, called right-singular
­vectors, are the orthonormal eigenvec-
necessarily irreducible. A modifica-
tion similar to the teleportation trick
and PageRank have tors of the authority matrix LTL, and S
is a diagonal matrix whose diagonal
used for the PageRank method can be been developed elements, called singular ­values, cor-
applied to HITS to recover the unique-
ness of the solution.35
independently. respond to the square roots of the
eigenvalues of the hub matrix (or,
An example of HITS is given in equivalently, of the authority matrix).
Figure 2. We stress the difference It follows that the HITS authority and
among importance, as computed by hub vectors correspond, respectively,
PageRank, and authority and hubness, to the right- and left-singular vectors
as computed by HITS. Page B is both associated with the highest singular
important and authoritative, but it is value of L.
not a good hub. Page C is important HITS also has a connection to bib-
but by no means authoritative. Pages liometrics.6 Two typical bibliometric
G, H, and I are neither important nor methods to identify similar publica-
authoritative, but they are the best tions are co-citation, in which publica-
hubs of the network, since they point to tions are related when they are cited by
good authorities only. Notice that the the same papers, and co-reference, in
hub score of B is 0 although B has one which papers are related when they cite
outgoing edge; unfortunately for B, the the same papers. The authority matrix is
only page C linked by B has no authority. a co-citation matrix and the hub matrix
Similarly, C has no authority because is a co-reference matrix. Indeed, since
it is pointed to only by B, whose hub A = LTL, the element ai, j of the authority
score is zero. This shows the difference matrix contains the number of times
between indegree and authority, as well pages i and j are both linked by a third
as between outdegree and hubness. page (ai,  j is the number of inlinks of i).
Finally, we observe that nodes with null Moreover, since H = LLT, the element hi,  j
authority scores (respectively, null hub of the hub matrix contains the number
scores) correspond to isolated nodes in of times both pages i and j link to a third
the graph whose adjacency matrix is the page (hi,i is the number of outlinks of i).
authority matrix A (respectively, the hub Hence, good authorities are pages that
matrix H). are frequently co-cited with other good
An advantage of HITS with respect to authorities, and good hubs are pages
PageRank is that it provides two scores that frequently co-reference with other
at the price of one. The user is hence good hubs.
provided with two rankings: the most A following algorithm that incorpo-
authoritative pages about the research rates ideas from both PageRank and
topic, which can be exploited to investi- HITS is SALSA17: like HITS, SALSA com-
gate in depth a research subject, and the putes both authority and hub scores,
most hubby pages, which correspond and like PageRank, these scores are
to portal pages linking to the research obtained from Markov chains.

j un e 2 0 1 1 | vo l . 5 4 | n o. 6 | c ommu n i cati o n s o f the acm 97


review articles

Bibliometrics defined, for a given journal and a fixed reputed journals are weighted as those
Bibliometrics, also known as scien- year, as the mean number of citations from obscure journals. In 1976 Gabriel
tometrics, is the quantitative study of in the year to papers published in the Pinski and Francis Narin developed an
the process of scholarly publication two previous years. It has been pro- innovative journal ranking method.26
of research achievements. The most posed in 1963 by Eugene Garfield, the The method measures the influence
mundane aspect of this branch of founder of the Institute for Scientific of a journal in terms of the influence
information and library science is the Information (ISI), working together of the citing journals. The Pinski and
design and application of bibliometric with Irv Sher.8 Journal Impact Factors Narin thesis is:
indicators to determine the influence are currently published in the popular
of bibliometric units like scholars and Journal Citation Reports by Thomson- A journal is influential if it is cited
academic journals. The Impact Factor Reuters, the new owner of the ISI. by other influential journals.
is, undoubtedly, the most popular The Impact Factor does not take
and controversial journal bibliometric into account the importance of the This is the same circular thesis of the
indicator available at the moment. It is citing journals: citations from highly PageRank method. Given a source
time  window T1 and a previous target
Figure 3. An instance with solution of the journal ranking method proposed by Pinski and time window T2, the journal citation
Narin. Nodes are labeled with influence scores and edges with the citation flow between
journals. Scores have been normalized to sum to 100.
system can be viewed as a weighted
directed graph in which nodes are jour-
nals and there is an edge from journal i
1 to journal j if there is some article pub-
D lished in i during T1 that cites an article
5 14.9 5 published in j during T2. The edge is
A 8 B 10 C weighted with the number ci,  j of such
28.0 5 44.7 8
12.4 citations from i to j. Let be
the total number of cited references of
journal i.
In the method described by Pinski
and Narin, a citation matrix H = (hi, j)
Figure 4. An example of the Katz model using two attenuation factors: a = 0.9 and a = 0.1
is constructed such that hi, j = ci, j/cj. The
(the spectral radius of the adjacency matrix L is 1). Each node is labeled with the Katz score
corresponding to a = 0.9 (top) and a = 0.1 (bottom). Scores have been normalized to sum to 100. coefficient hi,j is the amount of cita-
tions received by journal j from journal
i per reference given out by journal j.
For each journal an influence score is
I
determined which measures the rela-
0.0
tive journal performance per given
0.0 reference. The influence score πj of
journal j is defined as

H
0.0
0.0
or, in matrix notation:
F
2.9 B C
L E π = πΗ  (4)
7.9 46.4 41.9
0.0 3.2
39.6 8.8 Hence, journals j with a large total
0.0 30.1
influence πj cj are those that receive sig-
nificant endorsements from influen-
tial journals. Notice that the influence
M
D A per reference score πj of a journal j is a
0.0 size-independent measure, since the
2.9 2.7
0.0 formula normalizes by the number of
7.9 5.7
cited references cj contained in articles
of the journal, which is an estimation
G of the size of the journal. Moreover, the
0.0 normalization neutralizes the effect
0.0 of journal self-citations, which are
citations between articles in the same
journal. These citations are indeed
counted both at the numerator and at

98 comm unicat ions of t h e acm | j un e 2 0 1 1 | vol . 5 4 | no. 6


review articles

Figure 5. An instance of the Hubbell model with solution: each node is labeled with its network as a directed graph where
­prestige score and each edge is labeled with the endorsement strength between the nodes are people and person i is con-
connected members; negative strength is highlighted with dashed edges. The minimal nected by an edge to person j if i
amount of status has been fixed to 0.2 for all members.
chooses, or endorses, j. The status of
member i is defined as the number of
-0.4 0.8 weighted paths reaching j in the net-
Alice work, a generalization of the indegree
0.49 0.4
0.6
0.4
measure. Long paths are weighted less
Bob -0.4 David
0.41 -0.9
than short ones, since endorsements
0.1
Charles devalue over long chains. Notice that
0.2 -0.1 this method indirectly takes account
of who endorses as well as how many
endorse an individual: if a node i
points to a node j and i is reached by
the denominator of the influence score extensively tested: journal PageRank,2 many paths, then the paths leading to
formula. This avoids overinflating Eigenfactor,34 and SCImago.28 i arrive also at j in one additional step.
journals that engage in the practice of Katz builds an adjacency matrix
opportunistic self-citations. Sociometry L = (li, j) such that li, j = 1 if person i chooses
It can be proved that the spectral Sociometry, the quantitative study person j and li,  j = 0 otherwise. He defines
radius of matrix H is 1, hence the influ- of social relationships, contains a matrix , where a is an
ence score vector corresponds to the re­markably old PageRank predeces- attenuation constant. Notice that the (i,  j )
dominant eigenvector of H.9 In princi- sors. Sociologists were the first to use component of Lk is the number of paths
ple, the uniqueness of the solution and the network approach to investigate the of length k from i to j, and this number
the convergence of the power method properties of groups of people related in is attenuated by ak in the computation
to it are not guaranteed. Nevertheless, some way. They devised measures like of W. Hence, the (i, j ) component of the
both properties are not difficult to indegree, closeness, betweeness, as well limit matrix W is the weighted number
obtain in real cases. If the citation as eigenvector centrality, which are still of arbitrary paths from i to j. Finally, the
graph is strongly connected, then the used today in modern (not necessarily status of member , that is,
solution is unique. When journals social) network analysis.21 In particular, the number of weighted paths reach-
belong to the same research field, eigenvector centrality uses the same cen- ing j. If the attenuation factor a < 1/r(L),
this condition is typically satisfied. tral ingredient of PageRank applied to a with r(L) the spectral radius of L, then
Moreover, if there exists a self-loop in social network: the above series for W converges.
the graph, that is an article that cites Figure 4 illustrates the method with
an article in the same journal, then the A person is prestigious if he is an example. Notice the important role
power method converges. endorsed by prestigious people. of the attenuation factor: when it is
Figure 3 provides an example of the large (close to 1/r(L) ), long paths are
Pinski and Narin method. Notice that John R. Seeley in 1949 is probably the devalued smoothly, and Katz scores
the graph is strongly connected and first in this context to use the circular are strongly correlated with PageRank
has a self-loop, hence the solution is argument of PageRank.29 Seeley rea- ones. In the shown example, PageRank
unique and can be computed with the sons in terms of social relationships and Katz methods provide the same
power method. Both journals A and among children: each child chooses ranking of nodes when the attenuation
C receive the same number of cita- other children in a social group with factor is 0.9. On the other hand, if the
tions and give out the same number a non-negative strength. The author attenuation factor is small (close to 0),
of references. Nevertheless, the influ- notices that the total choice strengths then the contribution given by paths
ence of A is bigger, since it is cited by received by each children is inade- longer than 1 rapidly declines, and thus
a more influential journal (B instead quate as an index of popularity, since it Katz scores converge to indegrees, the
of D). Furthermore, A and D receive does not consider the popularity of the number of incoming links of nodes.
the same number of citations from chooser. Hence, he proposes to define In the example, when the attenua-
the same journals, but D is larger than the popularity of a child as a function tion ­factor drops to 0.1, nodes C and E
A, since it contains more references, of the popularity of those children who switch their positions in the ranking:
hence the influence of A is higher. chose the child, and the popularity of node E, which receives many short
Similar recursive methods have been the choosers as a function of the popu- paths, significantly increases its score,
independently proposed by Liebowitz larity of those who chose them and so while node C, which is the destination
and Palmer19 and Palacios-Huerta in an “indefinitely repeated reflection.” of just one short path and many (deval-
and Volij23 in the context of ranking Seeley exposes the problem in terms of ued) long ones, significantly decreases
of economics journals. Recently, vari- linear equations and uses Cramer’s its score.
ous PageRank-inspired bibliometric rule to solve the linear system. He does In 1965 Charles H. Hubbell gen-
indicators to evaluate the importance not discuss the issue of uniqueness. eralizes the proposal of Katz.10 Given
of journals using the academic cita- Another model is proposed in a set of members of a social con-
tion network have been proposed and 1953 by Leo Katz.11 Katz views a social text, Hubbell defines a matrix W = (wi, j)

j un e 2 0 1 1 | vo l . 5 4 | n o. 6 | c ommu n i cati o n s o f the acm 99


review articles

such that wi, j is the strength at which i acknowledging the negative judgment economy of a country may be divided
endorses j. Interestingly, these weights given by the other members. into any desired number of sectors,
can be arbitrary, and in particular, they Equation 5 is equivalent to π(I − W) = called industries, each consisting of
can be negative. The prestige of a mem- v, where I is the identity matrix, that is firms producing a similar product.
ber is recursively defined in terms of . The series con- Each industry requires certain inputs
the prestige of the endorsers and takes verge if and only if the spectral radius in order to produce a unit of its own
account of the endorsement strengths: of W is less than 1. It is now clear that product, and sells its products to other
the Hubbell model is a generalization industries to meet their ingredient
π = πW + v (5) of the Katz model to general matrices requirements. The aim is to find prices
that adds an initial exogenous input v. for the unit of product produced by
The term v is an exogenous vector such Indeed, the Katz equation for social sta- each industry that guarantee the repro-
that vi is a minimal amount of status tus is , where e is a vector of ducibility of the economy, which holds
assigned to i from outside the system. all ones. In an unpublished note Vigna when each sector balances the costs
The original aspects of the method traces the history of the mathematics for its inputs with the revenues of its
are the presence of an exogenous ini- of spectral ranking and shows there is outputs. In 1973, Leontief earned the
tial input and the possibility of giving a reduction from the path summation Nobel Prize in economics for his work
negative endorsements. A consequence formulation of Hubbell–Katz to the on the input–output model. An exam-
of negative endorsements is that the eigenvector formulation with teleporta- ple is provided in Table 2.
status of an actor can also be nega- tion of PageRank and vice versa.30 In the Let qi, j denote the quantity produced
tive. An actor who receives a positive mapping the attenuation constant is by the ith industry and used by the jth
(respectively, negative) judgment from the counterpart of the PageRank damp- industry, and qi be the total quantity
a member of positive status increases ing factor, and the exogenous vector produced by sector i, that is, .
(respectively, decreases) his prestige. corresponds to the PageRank person- Let A = (ai, j) be such that ai, j = qi, j/qj; each
On the other hand, and interestingly, alization vector. The interpretation of coefficient ai, j represents the amount
receiving a positive judgment from a PageRank as a sum of weighted paths is of product (produced by industry) i con-
member of negative status makes a neg- also investigated by Baeza-Yates et al.1 sumed by industry j that is necessary to
ative contribution to the prestige of the Spectral ranking methods have also produce a unit of product j. Let πj be the
endorsed member (if you are endorsed been exploited to rank sport teams in price for the unit of product produced
by some person affiliated to the Mafia competitions that involve teams play- by each industry j. The reproducibility
your reputation might drop indeed). ing in pairs.13, 32 The underlying idea is of the economy holds when each sector
Moreover, receiving a negative endorse- that a team is strong if it won against j balances the costs for its inputs with
ment from a member of negative status other strong teams. Much of the art of the revenues of its outputs, that is:
makes a positive contribution to the the sport ranking problem is how to
prestige of the endorsed person (if the define the matrix entries ai,  j expressing
same Mafioso opposes you, then your how much team i is better than team j
reputation might raise). (e.g., we could pick ai,  j to be 1 if j beats
Figure 5 shows an example for the i, 0.5 if the game ended in a tie, and 0
Hubbell model. Notice that Charles otherwise).12 By dividing each balance equation by qj
does not receive any endorsement and we have
hence has the minimal amount of sta- Econometrics
tus given by default to each member. We conclude with a succinct descrip-
David receives only negative judgments; tion of the input–output model devel-
interestingly, the fact that he has a posi- oped in 1941 by Nobel Prize winner
tive self-opinion further decreases his Wassily W. Leontief in the field of or, in matrix notation,
status. A better strategy for him, know- econometrics—the quantitative study
ing in advance of his negative status, of economic principles.18 According to π = π A (6)
would be to negatively judge himself, the Leontief input–output model, the

Table 2. An input–output table for an economy with three sectors with the balance solution.

Agriculture Industry Family Total Price Revenue


Agriculture     7.5   6   16.5   30 20 600
Industry   14   6 30   50 15 750
Family   80 180 40 300  3 900
Cost 600 750 900
Each row shows the output of a sector to other sectors of the economy. Each column shows the inputs received by a sector from other sectors. For each sector we also show
total quantity produced, equilibrium unitary price, total cost, and total revenue. Notice that each sector balances costs and revenues.

100 co mm unications of t h e acm | j une 2 0 1 1 | vol . 5 4 | no. 6


review articles

Hence, highly remunerated industries and ultimately the quality, of that 9. Geller, N.L. On the citation influence methodology of
Pinski and Narin. Inf. Process. Manage. 14, 2 (1978),
(industries j with high total revenue πjqj) information. 93–95.
are those that receive substantial inputs Consider the difference between 10. Hubbell, C.H. An input–output approach to clique
identification. Sociometry 28 (1965), 377–399.
from highly remunerated industries, expert evaluation and collective evalua- 11. Katz, L. A new status index derived from sociometric
a circularity that closely resembles the tion. The former tends to be intrinsic, analysis. Psychometrika 18 (1953), 39–43.
12. Keener, J.P. The Perron–Frobenius theorem and the
PageRank thesis.25 With the same argu- subjective, deep, slow, and expen- ranking of football teams. SIAM Rev. 35, 1 (1993),
ment used by Geller9 for the Pinski and sive. By contrast, the latter is typically 80–93.
13. Kendall, M.G. Further contributions to the theory of
Narin bibliometric model we can show extrinsic, democratic, superficial, paired comparisons. Biometrics 11, 1 (1955), 43–62.
that the spectral radius of matrix A is 1, fast, and low cost. Interestingly, the 14. Kleinberg, J.M. Authoritative sources in a hyperlinked
environment. In ACM–SIAM Symposium on Discrete
thus the equilibrium price vector π is dichotomy between these two evalu- Algorithms, 1998, 668–677.
15. Kleinberg, J.M. Authoritative sources in a hyperlinked
the dominant eigenvector of matrix A. ation methodologies is not peculiar environment. J. ACM 46, 5 (1999), 604–632.
Such a solution always exists, although to information found on the Web. In 16. Langville, A.N., Meyer, C.D. Google’s PageRank and
Beyond: The Science of Search Engine Rankings.
it might not be unique, unless A is irre- the context of assessment of academic Princeton University Press, 2006.
ducible. Notice the striking similarity research, peer review—the evalua- 17. Lempel, R., Moran, S. The stochastic approach for
link-structure analysis (SALSA) and the TKC effect.
of the Leontief closed model with that tion of scholarly publications given Comput. Netw. 33, 1–6 (2000), 387–401.
proposed by Pinski and Narin. An open by peer experts working in the same 18. Leontief, W.W. The Structure of American Economy,
1919–1929. Harvard University Press, Cambridge,
Leontief model adds an exogenous field of the publication—plays the 1941.
demand and creates a surplus of rev- role of expert evaluation. Collective 19. Liebowitz, S.J., Palmer, J.P. Assessing the relative
impacts of economics journals. J. Econ. Lit. 22 (1984),
enue (profit). It is described by the evaluation consists in gauging the 77–88.
equation π = π A + v, where v is the profit importance of a contribution though 20. Markov, A. Rasprostranenie zakona bol’shih chisel
na velichiny, zavisyaschie drug ot druga.
vector. Hubbell himself observes the the bibliometric practice of counting Izvestiya Fiziko-matematicheskogo obschestva
similarity between his model and the and analyzing citations received by the pri Kazanskom universitete, 2-ya seriya 15, 94
(1906), 135–156.
Leontief open model.10 publication from the academic com- 21. Newman, M.E.J. Network Analysis: An Introduction.
It might seem disputable to juxta- munity. Citations generally witness Oxford University Press, Oxford, U.K., 2010.
22. Page, L., Brin, S., Motwani, R., Winograd, T. The
pose PageRank and Leontief methods. the use of information and acknowl- PageRank citation ranking: Bringing order to the
Web. Technical Report 1999–66, Stanford InfoLab,
To be sure, the original motivation edge intellectual debt. Eigenfactor,34 a November 1999. Retrieved June 1, 2010, from http://
of Leontief work was to give a formal PageRank-inspired bibliometric indi- ilpubs.Stanford.edu:8090/422/
23. Palacios-Huerta, I., Volij, O. The measurement of
method to find equilibrium prices for cator, is among the most interesting intellectual influence. Econometrica 72 (2004),
the reproducibility of the economy recent proposals to  collectively evalu- 963–977.
24. Perron, O. Zur theorie der matrices. Math. Ann. 64,
and to use the method to estimate ate the status of academic journals. 2 (1907), 248–263.
the  impact on the entire economy The consequences of a shift from peer 25. Pillai, S.U., Suel, T., Cha, S. The Perron–Frobenius
theorem: Some of its applications. IEEE Signal
of the  change in demand in any sec- review to bibliometric evaluation are Process. Mag. 22, 2 (2005), 62–75.
tors of the economy. Leontief, to the currently heartily debated in the aca- 26. Pinski, G., Narin, F. Citation influence for journal
aggregates of scientific publications: Theory, with
best of our limited knowledge, was demic community.33 application to the literature of physics. Inf. Process.
not motivated by an industry ranking Manage. 12, 5 (1976), 297–312.
27. R Development Core Team. R: A Language and
problem. On the other hand, the moti- Acknowledgments Environment for Statistical Computing. R Foundation
vation underlying the other methods The author thanks Enrico Bozzo, Sebas­ for Statistical Computing, Vienna, Austria, 2007.
ISBN 3-900051-07-0. Accessed June 1, 2010,
described in this paper is the ranking of tiano Vigna, and the anonymous referees at http://www.R-project.org
a set of homogeneous entities. Despite for positive and critical comments on 28. SCImago. SJR—SCImago Journal & Country Rank.
Accessed June 1, 2010, at http://www.scimagojr.com,
the original ­motivations, however, the early drafts of this paper. Sebastiano 2007.
29. Seeley, J.R. The net of reciprocal influence: A
there are more than coincidental simi- Vigna was the first to point out the contri- problem in treating sociometric data. Can. J. Psychol.
larities between the Leontief open and bution of John R. Seeley. 3 (1949), 234–240.
30. Vigna, S. Spectral ranking, 2010. Retrieved
closed models and the other ranking June 1, 2010 from http://arxiv.org/abs/
methods described in this paper. These References
0912.0238
31. von Mises, R., Pollaczek-Geiringer, H. Praktische
connections motivated the discussion 1. Baeza-Yates, R.A., Boldi, P., Castillo, C. Generic
verfahren der gleichungsauflôsung. Z. Angew.
damping functions for propagating importance in link-
of the Leontief contribution, which is based ranking. Internet Math. 3, 4 (2007), 445–478.
Math. Mech. 9, 58–77 (1929), 152–164.
32. Wei, T.H. The algebraic foundations of ranking theory.
­probably the least known among the 2. Bollen, J., Rodriguez, M.A., de Sompel, H.V. Journal
PhD thesis, Cambridge University, 1952.
status. Scientometrics 69, 3 (2006), 669–687.
surveyed methods within the comput- 3. Brin, S., Page, L. The anatomy of a large-scale
33. Weingart, P. Impact of bibliometrics upon the science
system: Inadvertent consequences? Scientometrics
ing community. hypertextual web search engine. Comput. Netw.
62, 1 (2005), 117–131.
ISDN Syst. 30, 1–7 (1998), 107–117.
34. West, J., Althouse, B., Bergstrom, C., Rosvall, M.,
4. Broder, A.Z., Kumar, R., Maghoul, F., Raghavan, P.,
Bergstrom, T. Eigenfactor.org—Ranking and
Conclusion Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.L.
mapping scientific knowledge. Accessed June 1,
Graph structure in the Web. Comput. Netw. 33, 1–6
The classic notion of quality of infor- 2010, at http://www.eigenfactor.org, 2007.
(2000), 309–320.
35. Zheng, A.X., Ng, A.Y., Jordan, M.I. Stable
mation is related to the judgment 5. Bush, V. As we may think. Atl. Month. 176, 1 (1945),
algorithms for link analysis. In International
101–108.
given by few field experts. PageRank ACM SIGIR Conference on Research and
6. Ding, C.H.Q., Zha, H., He, X., Husbands, P., Simon, H.D.
Development in Information Retrieval, 2001,
introduced an original notion of qual- Link analysis: Hubs and authorities on the World
258–266.
Wide Web. SIAM Rev. 46, 2 (2004), 256–268.
ity of information found on the Web: 7. Frobenius, G. Über matrizen aus nicht negativen
elementen. In Sitzungsberichte der Preussischen
the collective intelligence of the Web, Akademie der Wissenschaften zu Berlin, 1912, Massimo Franceschet (massimo.franceschet@udine.it)
is a researcher in the Department of Mathematics and
formed by the opinions of the millions 456–477.
Computer Science, University of Udine, Italy.
8. Garfield, E., Sher, H. New factors in the evaluation
of people that populate this universe, is of scientific literature through citation indexing. Am.
exploited to determine the importance, Doc. 14 (1963), 195–201. © 2011 ACM 0001-0782/11/06 $10.00

j un e 2 0 1 1 | vo l . 5 4 | n o. 6 | c ommu n i cati o n s o f th e acm 101

You might also like