Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 31

BigData Analytics

Module 6: Big Data Analytics Applications

Faculty Name : Ms. Varsha Sanap


Dr. Vivek kumar Singh
Lecture42

Link Analysis
Why Page Rank?

• Within a few years of the birth of the web in 1990, there were over a
dozen search engines that users could use to search for information.
• Shortly after it was introduced in 1995, AltaVista became the most popular
among them.
• These search engines would categorize web pages according to the
topics that the pages themselves specified.
• But the problem with these early search engines was that unethical web
page writers used deceptive techniques to attract traffic to their pages.
• For example, a local rug-cleaning service might list “pizza” as a topic in
their web page header, just to attract people looking to order a pizza for
dinner.
• These and other tricks rendered early search engines nearly useless.

3 Link Analysis
Contd…

• To overcome the problem, various page ranking systems were attempted.


• The objective was to rank a page based upon its popularity among users
who really did want to view its contents.
• One way to estimate that is to count how many other pages have a link to
that page.
• For example, there might be 100,000 links to 
https://en.wikipedia.org/wiki/Renaissance,
but only 100 to https://en.wikipedia.org/wiki/Ernest_Renan,
so the former would be given a much higher rank than the latter.
• But simply counting the links to a page will not work either. For example,
the rug-cleaning service could simply create 100 bogus web pages, each
containing a link to the page they want users to view.

4 Link Analysis
PageRank

• In 1996, Larry Page and Sergey Brin, while students at Stanford


University, invented their PageRank algorithm.
• It simulates the web itself, represented by a very large directed graph, in
which each web page is represented by a node in the graph, and each
page link is represented by a directed edge in the graph.
• The WWW hyperlink structure forms a huge directed graph where
– the nodes represent web pages
– Directed edges are the hyperlinks

PAGE 1 PAGE 2

PAGE 4 PAGE 3

5 Link Analysis
PageRank Algorithm

 What is the basic concept behind Google’s page rank?


“PageRank works by counting the number and quality of links to a
page to determine a rough estimate of how important the website is.
The underlying assumption is that more important websites are
likely to receive more links from other websites.”

6 Link Analysis
PageRank Algorithm: Graph Representation of WWW

• Inbound link  these are links into the given site from outside so from
other pages

PAGE A

7 Link Analysis
PageRank Algorithm:Graph Representation of WWW

• Outbound link  these are links from the given page to pages in the
same site or other sites

PAGE A

8 Link Analysis
PageRank Algorithm:Graph Representation of WWW

• Dangling link  these are links that point to any page with no outgoing
links

9 Link Analysis
Crawling the Web with BFS

• First we have to know the topology of the WWW


• This is why crawlers came to be
• There are several methods to traverse a graphs:
– Breadth First Search : usually this is used for web crawling
– Depth First Search: has lots of applications, but in this case, it is not
that useful

10 Link Analysis
BFS

B F G

C D H

11 Link Analysis
BFS

B F G

C D H

12 Link Analysis
BFS

B F G

C D H

13 Link Analysis
BFS

B F G

C D H

14 Link Analysis
BFS

B F G

C D H

15 Link Analysis
BFS

B F G

C D H

16 Link Analysis
BFS

B F G

C D H

17 Link Analysis
BFS

B F G

C D H

18 Link Analysis
BFS

B F G

C D H

19 Link Analysis
BFS

B F G

C D H

20 Link Analysis
BFS

B F G

C D H

21 Link Analysis
Lecture43

Link Analysis: PageRank


Computation
PageRank Algorithm: Original Formula

• The original page rank formula with summation:


• PR(A)=(1-d) + d( PR(T1 )/C(T1) + PR(T2)/C(T2) +......+ PR(Tn )/C(Tn) )
• PR(A) page rank of page A ~kind of a recursive formula because it
depends on other pages’ page rank
• PR(Ti ) page rank of pages Ti which link to page A
• C(Ti) number of outbound links on a given Ti page
• d damping factor in the range 0 and 1

23 Link Analysis
PageRank Algorithm: Original Formula

• The original page rank formula with summation:


• PR(A)=(1-d) + d( PR(T1 )/C(T1) + PR(T2)/C(T2) +......+ PR(Tn )/C(Tn) )
• We have to initialize page ranks of the beginning: all pages are given
equal page rank 1/n ~ n is the number of pages
• That’s why we have to make several iterations until convergence!!!

24 Link Analysis
PageRank Algorithm

• The iterative formula:

25 Link Analysis
Example

A B Iteration Iteratio Iteratio PageRank


0 n1 n2

C A ¼ 1/12 1.5/12 1
B ¼ 2.5/12 2/12 2
C ¼ 4.5/12 4.5/12 4
D
D ¼ 4/12 4/12 3

26 Link Analysis
Matrix Representation

• We can use matrix operations instead of the iterative approach where we


updated values one by one: we can use matrix operations to do multiple
calculations at the same time

A B

27 Link Analysis
Matrix Representation

• We can use matrix operations instead of the iterative approach where we


updated values one by one: we can use matrix operations to do multiple
calculations at the same time

A B

28 Link Analysis
Example

A B Iteration Iteratio Iteratio PageRank


0 n1 n2

C A ¼ 1/12 2/12 1
B ¼ 2.5/12 15/12 4
C ¼ 6/12 4.5/12 2
D
D ¼ 4/12 13.5/12 3

29 Link Analysis
Matrix Representation

• We can come to conclusion  we have to multiply the matrix with a vector on


every iteration
• What is the initial vector? It is the initial page rank assigned to every page

• If we make several iterations, again, it tends to the equilibrium value

30 Link Analysis
Thank You

You might also like