Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

PageRank explained in simple terms!

A LG O RI T HM BI G D AT A BUS I NE S S A NA LYT I C S G RA PHS & NE T W O RKS I NT E RM E D I AT E

In my previous article, we talked about information retrieval. We also talked about how machine can read
the context from  a free text.  Let’s talk about the biggest web information retrieval engine i.e. Google!
Imagine, you were to create Google search in a world devoid of any search engine. What could be the basic
rules you will code to build such a search engine?  If your answer is to use Term Frequency or TF-IDF kind
of framework, consider following case:

A user enters the query : “Harvard Business School“. He expects the first link to be
“http://www.harvard.edu/”.  But what would your algorithm do? It would try to find out pages which has the
word “Harvard” maximum number of times, as “Business” and “School” will come out to be common words.
Now, there is a possibility that “Harvard” keyword might not be repeated multiple times on Harvard’s own
website. However, websites like Business school consultants or articles on business school might have
this keyword multiple times. This leads these websites to achieve a rank much higher than the actual
business school website.

But, do search engines like Google face this challenge today? Obviously not! This is because they take help
of an algorithm known as PageRank. In this article, we will discuss the concept of PageRank. In the next
article, we will take this algorithm a step forward by leveraging it to find the most important packages in R.

An artificial web world

Imagine a web which has only 4 web pages, which are linked to each other. Each of the box below
represents a web page. The words written in black and italics are the links between pages.
For instance, in the web page “Tavish”, it has 3 outgoing links : to the other three web pages. Now, let’s
draw a simpler directed graph of this ecosystem.

Here is how Google ranks a page : The page with maximum number of incoming links is the most important
page.  In the current example, we see that the “Kunal Jain” page comes out as the most significant page.

 
Mathematical Formulation of Google Page Rank

First step of the formulation is to build a direction matrix. This matrix will have each cell as the proportion
of the outflow. For instance, Tavish (TS) has 3 outgoing links which makes each proportion as 1/3.

Now we imagine that if there were a bot which will follow all the outgoing links, what will be the total time
spent by this bot on each of these pages. This can be broken down mathematically into following equation
:

A * X = X

Here A is the proportions matrix mentioned above

X is the probability of the bot being on each of these pages

Clearly, we see that Kunal Jain’s page in this universe comes out to be most important which goes in the
same direction as our intuition.

Teleportation adjustments

Now, imagine a scenario where we have only 2 web pages : A and B. A has a link to B but B has no external
links. In such cases, if you try solving the matrix, you will get a zero matrix. This looks unreasonable as B
looks to be more important than A. But, our algorithm still gives  same importance for both. To solve for
this problem, a new concept of teleporatation was introduced. We include a constant probability of alpha
to each of these pages. This is to compensate for instances where a user teleports from one webpage to
other without any link. Hence, the equation is modified to the following equation :

(1-alpha) * A * X + alpha * b = X

Here, b is a constant unit column matrix. Alpha is the proportion of teleportation. The most common value
taken for alpha is 0.15 (but can depend on different cases).
 

Other uses of PageRank & End Notes

In this article we discussed the most significant use of PageRank. But, the use of PageRank is no way
restricted to Search Engines. Here are a few other uses of PageRank :

1. Finding how well connected a person is on Social Media : One of the unexplored territory in social
media analytics is the network information. Using this network information we can estimate how
influential is the user. And therefore prioritize our efforts to please the most influential customers.
Networks can be easily analyzed using Page Rank algorithm.
2. Fraud Detection in Pharmaceutical industry : Many countries including US struggle with the problem of
high percentage medical frauds. Such frauds can be spotted using Page Rank algorithm.
3. Understand the importance of packages in any programming language : Page Rank algorithm can also
be used to understand the layers of packages used in languages like R and Python. We will take up this
topic in our next article.

Thinkpot: Can you think of  more usage of Page Rank algorithm?   Share with us useful links to leverage
Page Rank algorithm in various fields.

Did you find this article useful? Do let us know your thoughts about this article in the box below.

If you like what you just read & want to continue your analytics
learning, subscribe to our emails, follow us on twitter or like
our facebook page.

Article Url - https://www.analyticsvidhya.com/blog/2015/04/pagerank-explained-simple/

Tavish Srivastava
Tavish Srivastava, co-founder and Chief Strategy Officer of Analytics Vidhya, is an IIT Madras graduate
and a passionate data-science professional with 8+ years of diverse experience in markets including
the US, India and Singapore, domains including Digital Acquisitions, Customer Servicing and Customer
Management, and industry including Retail Banking, Credit Cards and Insurance. He is fascinated by the
idea of artificial intelligence inspired by human intelligence and enjoys every discussion, theory or even
movie related to this idea.

You might also like