Professional Documents
Culture Documents
Project Part 2
Project Part 2
PAGE RANK
ALGORITHM
2
ABSTRACT
3
ABSTRACT
The way in which the displaying of the web pages is done within a search is not a
mystery. It involves applied math and good computer science knowledge for the right
implementation. This relation involves vectors, matrixes and other mathematical notations.
The matrices hold the link structure and the guidance of the web surfer. As links are added
every day, and the number of websites goes beyond billions, the modifications of the web
link’s structure in the web affects the PageRank. In order to make this work, search
algorithms need improvements. This topic pays attention to various algorithms of calculating
page rank of a webpage and display of a simple model of a page using Python.
4
CHAPTER 1
INTRODUCTION
5
Introduction:
CHAPTER 2
PAGERANK
History:
search engine. While just one of many factors that determine the ranking of Google search
results, PageRank continues to provide the basis for all of Google’s web search tools.
PageRank has been influenced by citation analysis, early developed by Eugene
Garfield in the 1950s at the University of Pennsylvania, and by Hyper Search, developed by
Massimo Marchiori at the University of Padua. In the same year PageRank was introduced
(1998), Jon Kleinberg published his important work on HITS. Google’s founders cite
Garfield, Marchiori, and Kleinberg in their original paper.
In the early 90’s, the first search engine used text based ranking systems that
decided which page is relevant based on the text. There are many drawbacks with this
approach. Say for example, if a person search with the keyword “Internet”, can be
problematic. The surfer might get a page with the keyword internet that does not have any
information regarding internet in the displayed page.
Moreover, suppose we wanted to find some information about Google. We
type in the word "Google" and expect that "www.google.in" would be the most relevant site
to our query. However, there may be millions of pages on the web using the word Google,
and “www.google.in” may not be the one that uses it most often. Suppose we decided to write
a web site that contains the word "google" a billion times and nothing else. Would it then
make sense for our web site to be the first one displayed by a search engine. The answer is
obviously no.
Since the search engine uses the count of occurrence of words in the given
query, it doesn’t make sense for the most searched page to be displayed first. There might be
millions of web pages that have the searched word and when the search engine brings
allthose pages, it sounds useless for the surfer. Also, the surfer does not have the patience to
go through all thepages that contained the searched word to arrive at the page he/she
searching for. Usually, the user expects the relevant page to be displayed in the top 20-30
pages provided by search engine. A modern search engine uses themethod of providing best
results first that are more appropriate than the older text ranking method.
One of themost influential algorithms is the Page Rank algorithm which is
used by Google search engine. The main ideabehind the page rank algorithm is that the
importance of a web page is predicted by the pages linking to it. If wecreate a web page i that
has a hyperlink connected to page j then page j is considered as important. On the otherhand,
if page j has a backlink from page k (like www.google.com) we can say k transfers its control
to j (i.e., kasserts that j is important). We can iteratively assign a rank to each page based on
the number of pages that points to it.
Quoting from the original Google paper, PageRank is defined like this:
We assume page A has pages T1,…,Tn which point to it (i.e. are citations). The parameter d is
a damping factor which can be set between 0 and 1. We usually set d to .85. There are more
details about d in the next section. Also, C(A) is defined as the number of links going out of
Page A. The Page Rank of a Page A is given as follows:
PR(A)=(1-d) + d(PR(T1)/C(T1)+...+PR(Tn)/C(Tn))
Note that the Page Rank form a probability distribution over web pages, so the sum of all web
pages’ will be one.
Page Rank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to
"random surfer" who is given a web page at random and keeps clicking on links, never hitting
"back" but eventually gets bored and starts on another random page. The probability that the
random surfer visits a page is its PageRank. And, the d damping factor is the probability at
each page the "random surfer" will get bored and request another random page. One
important variation is to only add the damping factor d to a single page, or a group of pages.
This allows for personalization and can make it nearly impossible to deliberately mislead the
system in order to get a higher ranking. Another intuitive justification is that a page can have
a high PageRank if there are many pages that point to it, or if there are some pages that point
to it and have a high PageRank. Intuitively, pages that are well cited from many places
around the web are worth looking at. Also, pages that have perhaps only one citation from
something like the Yahoo! homepage are also generally worth looking at. If a page was not
high quality, or was a broken link, it is quite likely that Yahoo's homepage would not link to
it. PageRank handles both these cases and everything in between by recursively propagating
weights through the link structure of the web.
Damping Factor:
From that, we can see that this probability of a click-through is a good way to
prevent spams and pages without outgoing links and does not allow them to get the PageRank
of others. If we use value 1, the forever clicking link process will end up in spam sites and if
9
0 than we will have random restart and a uniform distribution. In short, whatever the number
of in links, there will be a probability (1-damping factor) in order for a page to have a
minimum PageRank. So, a value in between 0.9 and 0.85 deals with not only making the
right calculations but the convergence is not so quick and avoids the growing higher of the
PageRank.
Fig 2 Graphical Demonstration of a five-page web
Usually, most important pages will have more inlinks. Inlinks from important
pages will also have a more effect on PageRank for a particular page than inlinks from
marginal pages. The calculation of PageRank is recursive. The method for evaluating
PageRank starts by demonstrating the directed Web graph as the square matrix of dimension
n × nnamed “Adjacency Matrix”, A, where n is the number of webpages.If webpage i has
1
l i ≥ 1links to other webpages and webpage i links to the webpage j of Aij = .
lj
Else, Aij =0.
0 .5 0 .5 0
[
0 0 1 0 0
A= 1 0 0 0 0
0 0 0 0 1
0 0 0 0 0
]
Links are number of non-zero elements in Adjacency matrix or Hyperlink
matrix.
Dangling Nodes:
The Web pages with no outlinks are called dangling nodes. Further, the remaining
Web pages, having at least one outlink are identified as Non-Dangling nodes. For the
execution of PageRank, we must have to resolve how to deal with dangling nodes and this
decision has influence on PageRank that we will compute. For above graph dangling node
can be handled by replacing dangling node row of Hyperlink matrix A by the probability
distribution vector,w, the stochastic n-dimensional row vector that sum equals to 1, n is
number of Web pages or nodes. The resultant matrix can be of the form,
P= A+ dw, where d is a dangling node vector, n-dimensional column vector that is:
d= 1 , if l i=0
{0 , else
(1)
The most appropriate selection for w is the uniform row vector that is represented by:
111 1
w=[ ... ]
nnn n
11111
w=[ ]
55555
This novel matrix P should not contain any zero row means that all nodes have at least one
outlink. On this ground matrix P for above 5 Web page directed graph can be written as:
0 .5 0 .5 0
[ ]
0 .5 0 .5 0 0
[ ][]
0 0 1 0 0
0 0 1 0 0 0 11111
1 0 0 0 0
P= A+ dw= 1 0 0 0 0 + 0 [ ]=
55555 0 0 0 0 1
0 0 0 0 1 0
1 1 1 1 1
0 0 0 0 0 1
5 5 5 5 5
11
12
CHAPTER 3
GENERATING PAGE RANK
13
Suppose that page P j has l j links. If one of those links is to page Pi, then P j
will pass on 1/l j of its importance to Pi . The importance ranking of Pi is then the sum of all
the contributions made by pages linking to it. That is, if we denote the set of pages linking
to Pi by Bi, then
I (P j)
I (Pi)= ∑ (3)
P ∈Bj i
lj
{
Aij = l j
, if l j ∈ Bi
0 , otherwise
(4)
Notice that A has some special properties. First, its entries are all nonnegative.
Also, the sum of the entries in a column is one unless the page corresponding to that column
has no links. Matrices in which all the entries are nonnegative and the sum of the entries in
every column is one are called stochastic. We will also form a vector I =¿)]whose
components are PageRank--that is, the importance rankings--of all the pages. The condition
above defining the PageRank may be expressed as:
I=AI (5)
In other words, the vector I is an eigenvector of the matrix A with eigenvalue
1. This can be called as the stationary vector of A.
Consider the following graph with different nodes and links connecting them:
A B
C D
Figure3
14
The matrix A for the directed graph with vertex pointing from i to j is given as below
0 1/3 1 /3 0 1/3
[
1/3 0
A= 1/2 0
0 1/ 3
0 0
0 1/3 1 /3 0
0 0 0 0
1/3
1 /2
1/3
0
]
The below table demonstrates the number of inlinks for each and every node.
Nodes A B C D E
Number of Nodes 2 2 2 2 4
To illustrate the calculation of page frank of the fig 3 we have the following steps:
Step 1: Take a 0.85 * a page’s PageRank, and divide it by the number of out links on
the page.
Step 2: Add that amount onto a new total for each page it’s being passed to.
Step 3: Add 0.15 to each of those totals.
As we start at zero, we will have 0.85*0=0. This leads that each page will get
0.15 as 0.15+0=0.15. But still we have the importance based on links.
So, now calculations become:
Page A links to pages B, C and E. Page A’s PageRank is 0.15 so it will add
0.85* 0.15 = 0.1275 to the new PageRank scores of the pages it links to. There
are three of them so they each get 0.0425.
Page B links to page A, C, E. Page B’s PageRank is 0.15 so it will add 0.85 *
0.15 =0.1275 to the new PageRank score of the pages it links to. Since it links to
page A, C, E, each gets 0.0425.
Page C links to Page A, E, each 0.06375. Page D links to Page B, C, E. each 0.
0425.Page E links to none.
As a result,
Page C: 0.15 (base) + 0.0425 (from Page A) + 0.0425 (from Page D) = 0.235
Page D: 0.15 (base) + 0.0425 (from Page B) = 0.1925 Page E: 0.15 (base)
Computing I:
There are different ways of calculating the eigenvectors. But the challengehere
is that the hyperlink matrix, H is a 45.3 billion×45.3 billion matrix! Studies show that on an
average a web page has 10 links going out, meaning almost all but 10 entries in each column
are 0.
Let us consider the power method for calculating the eigenvector. In this
method, we begin by choosing a vector I (0)(which is generally considered to be ¿)’) as a
candidate for I and then produce a sequence of vectors I (K ) such that
(K +1) (K)
I =AI
There are issues regarding the convergence of the sequence of vectors ( I (n)).
Matrix under consideration must satisfy certain conditions.
A B
Figure 4
A= [ 01 00]
The algorithm defined above applies as
In this web, the measure of importance of both pages is zero, indicating nothing about
the relative importance of these pages. Problem arises as page 2 has no links going out.
Consequently, page 2 takes some of the importance from page1 in each iterative step but does
not pass it on to any other page, draining all the importance from the web. Pages with no
links are called dangling nodes, and there are, of course, many of them in the real web. We’ll
now modify A.
16
Probabilistic interpretation of A:
Assume that we are on a particular web page, and we randomly follow one of its links
to another page ie., if we are on page P j with l j links, one of which takes us to page Pi , the
1
probability that we next end up on page Pi is then .
lj
As we surf randomly, let T j be the fraction of time that we spend on page P j. Then,
Tj
the fraction of time that we spend on page Pi coming from its link in page P j is . If we end
lj
up on page Pi, then we must have come from some page linking to it, which means
Tj
T i= ∑
P ∈B l j
j i
From the equation we defined for PageRank rankings, we see that I( Pi) = T i
which can be understood as a web page’s PageRank is the fraction of time a random surfer
spends on that page.Notice that, given this interpretation, it is natural to require that the sum
of the entries in the PageRank vector I be 1, since we are considering fraction of times spent
on each page.
There is a problem with the above description, if we surf randomly, then at
some point we might end up at a dangling node. To overcome this, we pretend that a dangling
node has a link to all the pages in the web. Now, the hyperlink matrix A is modified by
1
replacing the column of zeroes (if any) with a column in which each entry is where n is the
n
total number of web pages. Let this matrix be denoted by S.
1 1
S=
[ ] []
0
1
2
1
2
and I=
3
2
3
meaning P2has twice the measure of importance of P1, which seems reasonable now.
Note: S is also a column-stochastic matrix.
Let B be a matrix (with size same as of A) whose all entries are zero exceptfor the
1
columns corresponding to the dangling nodes, in which each entry is , then S =
n
A+B.
17
Fig 5
0 1 0 1
[ ] [] (0)
Where S= 0 0 1 and let I = 0 using power method, we see that
1 0 0 0
0 0 1 0
(1)
[] [] [] []
(2) (3) (4)
I =0 I =1 I =0 I =0 …
1 0 0 1
In this case Power method fails because,1 is not a simple eigen value of the matrix S.
Fig 6
1
0 0 0 0
[ ]
2
1
0 0 0 0 1
[]
2
0
1 1 1 (0 )
S= 0 0 and let I = 0 now, using power method
2 2 2
0
1 1
0 0 0 0
2 2
1 1 1
0 0
2 2 2
0 0.25 0
[] 0.5
I ( 1) = 0
0
0.5
(2)
[]
I = 0.5
0
.25
0
[]
.125
I (3)= .125
.25
.5
0 0 0
(13)
… I = .3325
[] [] []
.0001
.3332
.3341
(14)
0
I = .3337
.3333
.3328
(15)
0
I = .3331
.3333
.3335
0
0
[]
Hence I = .333 where page ranks assign to page1 and page 2 are 0 which is
.333
.333
unsatisfactory as page 1 and page 2 have links coming in and going out of them.
19
The problem here is that the web considered has a smaller web in it, ie., pages
3,4,5 are a web of themselves. Links come into this sub web formed by pages 3,4,5, but none
go out. Just as in the example of the dangling node, these pages form an” importance sink”
that drains the importance out of the other two pages. In mathematical terms power method
doesn’t work here as S is not irreducible.
20
CHAPTER 4
GOOGLE MATRIX
Google matrix:
We will modify S to get a matrix which is irreducible and has 1 as a simple
eigenvalue. As it stands now, our movement while surfing randomly is determined by S ie.,
either we follow one of the links on the current page or, if we are at a page with no links, we
randomly choose any other page to move to. To make our modification, we will first choose a
parameter
α ∋ 0<α <1
Now, suppose we move in a slightly different way. With probability α we are
guided by S, and with probability 1−α we choose the next page at random. Now we obtain the
Google Matrix
1
G=αS+(1−α ) J
n
where J is a matrix, all of whose entries are 1.
Parameter α:
eigenvalue with maximum magnitude and λ is the eigenvalue closest in magnitude to the
magnitude of λ 1. Hence power method converges slowly if λ 2 is close to λ 1.
21
22
CHAPTER 5
CHAPTER 5
Regardless of being a good page or a spam page, the number of web sites
keeps growing. For this reason, storing memory is a real issue in PageRank. As a matrix can
exceed the capacity of main memory, compressing data may be the solution. In such cases, a
modified version of PageRank is used. Or if not, they try to implement I/O efficient
computations. Generally, the time complexity of an algorithm is calculated by the number of
data processing steps as addition, multiply, etc. But when the data to be worked with are in
huge amount and they are bigger even than the main memory, the computational problem
becomes more complex. In this case, it will be the number of how many times the disk has
been accessed, which will hold importance rather than the time of running. We know that
cached data is much faster to be accessed than that in main memory, so the algorithms should
be cache friendly somehow. Also, there is another alternative of data compression in order to
fit the main memory, but that requires also a modification in the PageRank algorithm. As the
PageRank vector has to be consulted in order for the query to be processed, speeding the
process requires the help of cache memory. A best-known technique is that of gap method.
The idea here, it is that a page has inlinks from pages labelled next to it. For example: if a
page is labelled 50, most probably it will have inlinks from pages labelled 49,51. This is the
locality principle.
Another idea is that of comparing the similar adjacency list of two pages.
Adjacency list of x, contains a 1 in thei th position if the corresponding adjacency list entry i is
shared by x and y. The second vector in the reference encoding is a list of all entries in the
adjacency list of y that are not found in the adjacency list of its reference x. But this is not so
used as in this case we have the problem of finding what page serves as a reference page to
another page. Since the PageRank vector itself is large and completely dense, containing over
8 billion pages, and must be consulted in order to process each user query, has also suggested
a technique to compress the PageRank vector. This encoding of the PageRank vector hopes to
keep the ranking information cached in main memory, thus speeding query processing.
Advances:
Google Panda is a change to the Google’s search results ranking algorithm that
was first released in February 23, 2011. The change aimed to lower the rank of low-quality
sites or thin sites, and return higher-quality sites near the top of the search results. CNET
reported a surge in the rankings of news websites and social networking sites, and a drop-in
ranking for sites containing large amounts of advertising. This change reportedly affected the
rankings of almost 12 percent of all search results. Soon after the Panda rollout, many
websites, including Google’s webmaster forum, became filled with complaints of
scrapers/copyright infringers getting better rankings than sites with original content. At one
point, Google publicly asked for data points to help detect scrapers better. Google’s Panda
has received several updates since the original rollout in February 2011, and the effect went
global in April 2011. To help affected publishers, Google published an advisory on its blog,
thus giving some direction for self-evaluation of a website’s quality. Google has provided a
list of 23 bullet points on its blog answering the question of “What counts as a high-quality
site?” that is supposed to help webmasters step into Google’s mind-set.
24
Google Panda was built through an algorithm update that used artificial
intelligence in a more sophisticated and scalable way than previously possible. Human
quality testers rated thousands of websites based on measures of quality, including design,
trustworthiness, speed and whether or not they would return to the website. Google’s new
Panda machine-learning algorithm, made possible by and named after engineer Navneet
Panda, was then used to look for similarities between websites people found to be high
quality and low quality.
Google Penguin is a code name for a Google algorithm update that was first
announced on April 24, 2012. The update is aimed at decreasing search engine rankings of
websites that violate Googles Webmaster Guidelines by using black-hat SEO techniques,
such as keyword stuffing, cloaking, participating in link schemes, deliberate creation of
duplicate content, and others. Penguin update went live on April 24, 2012.
In January 2012, so-called page layout algorithm update was released, which
targeted websites with little content above the fold. The strategic goal that Panda, Penguin,
and page layout update share is to display higher quality websites at the top of Googles
search results. However, sites that were down ranked as the result of these updates have
different sets of characteristics.
25
CHAPTER 6
PYTHON
Python:
Python is a widely used general-purpose, high level programming language. It
was initially designed by Guido van Rossum in 1991 and developed by Python Software
Foundation. It was mainly developed for emphasis on code readability, and its syntax allows
programmers to express concepts in fewer lines of code. Python is a programming language
that lets to work quickly and integrate systems more efficiently.
26
Java Python
Compilation Java is a compiled programming Python is an interpreted
process language programming language
Code Length Longer lines of code as compared to 3-5 times shorter than equivalent
python. Java programs.
Syntax Define particular block by curly No need of semi colons and
Complexity braces, end statements by ; curly braces, uses indentation
Ease of typing Strongly typed, need to define the Dynamic, no need to define the
exact datatype of variables exact datatype of variables.
Speed of Java is much faster than python in Expected to run slower than Java
execution terms of speed. programs
Multiple Multiple inheritance is partially Provide both single and multiple
Inheritance done through interfaces inheritance
Uses of Python:
Python is easy to use, powerful, and versatile, making it a great choice for beginners
and experts alike. Python’s readability makes it a great first programming language — it
allows us to think like a programmer and not waste time with confusing syntax. For instance,
look at the following code to print “hello world” in Java and Python.
Instead of focusing on how to get our code to even run, we’ll be able to focus
on learning actual programming concepts. And once we have those tools under your belt, if
we move on to other languages, we’ll be able to easily understand a given piece of code.
But don’t think that because Python is easy to use it’s a wimpy language.
Python is incredibly powerful — there’s a reason companies like Google, Dropbox, Spotify,
and Netflix use it.
28