3 Hadoop

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 111

•Fundamentals of Big Data Analytics

•Hadoop
Outline
• Background
• Components and Architecture
• Running Examples
• Word Count
• PageRank
• Hands-on in Lab

Slides Courtesy: Mining of Massive Datasets (http://www.mmds.org )


+ Others
What is a Server?
◼ Servers are computers that provide “services” to “clients”

◼ They are typically designed for reliability and to service a large number
of requests

◼ Organizations typically require many physical servers to provide various


services (Web, Email, Database, etc.)

◼ Server hardware is becoming more powerful and compact

3
Compact Servers
◼ Organizations would like to conserve the amount of floor space
dedicated to their computer infrastructure

◼ For large-scale installations, compact servers are used


◼ This helps with
▪ Floor Space
▪ Manageability
▪ Scalability
▪ Power and Cooling

4
Racks
◼ Equipments (e.g., servers) are typically placed in racks

5
What is a Data Center?
◼ A facility used to house computer systems and components,
such as networking and storage systems, cooling, UPS, air
filters

◼ A data center typically houses a large number of


heterogeneous networked computer systems

◼ A data center can occupy one room of a building, one or more


floors, or an entire building
6
Google now has more than 2.5 million servers

7
Microsoft now has more than one million servers

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8


What is Hadoop?
What is Hadoop?
◼ A framework for distributed processing of large data sets across clusters of
commodity computers using simple programming models

◼ Can scale up from single servers to thousands of machines


▪ each offering local computation and storage

◼ Designed to detect and handle failures at the application layer


▪ delivering a highly-available service on top of a cluster of computers, each of
which may be prone to failures
Who uses Hadoop?

Amazon, Facebook, Google, Twitter, New York Times, Yahoo! …. many more
Motivation: Google Example
◼ 20+ billion web pages x 20KB = 400+ TB
◼ 1 computer reads 30-35 MB/sec from disk
▪ ~4 months to read the web
◼ ~1,000 hard drives to store the web
◼ Takes even more to do something useful
with the data!
◼ Today, a standard architecture for such problems is
emerging:
▪ Cluster of commodity Linux nodes
▪ Commodity network (ethernet) to connect them
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
MapReduce
◼ Challenges:
▪ How to distribute computation?
▪ Distributed/parallel programming is hard

◼ Map-reduce addresses all of the above


▪ Google’s computational/data manipulation model
▪ Elegant way to work with big data

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13


Single Node Architecture

CPU
Machine Learning, Statistics
Memory

“Classical” Data Mining


Disk

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14


Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch

CPU CPU CPU CPU

Mem … Mem Mem … Mem

Disk Disk Disk Disk

Each rack contains 16-64 nodes

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15


Large-scale Computing
◼ Large-scale computing for data mining
problems on commodity hardware
◼ Challenges:
▪ How do you distribute computation?
▪ How can we make it easy to write distributed programs?
▪ Machines fail:
▪ One server may stay up 3 years (1,000 days)
▪ If you have 1,000 servers, expect to loose 1/day
▪ People estimated Google had ~2.5M machines in 2022
▪ 2,500 machines fail every day!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16


Idea and Solution
◼ Issue: Copying data over a network takes time
◼ Idea:
▪ Bring computation close to the data
▪ Store files multiple times for reliability
◼ Map-reduce addresses these problems
▪ Google’s computational/data manipulation model
▪ Elegant way to work with big data
▪ Storage Infrastructure – File system
▪ Google: GFS. Hadoop: HDFS
▪ Programming model
▪ Map-Reduce

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17


Hadoop Ecosystem

18
Core Components Hadoop
◼ The project includes these modules:
▪ Hadoop Common: The common utilities that support the other
Hadoop modules.
▪ Hadoop Distributed File System (HDFS™): A distributed file system
that provides high-throughput access to application data.
▪ Hadoop YARN: A framework for job scheduling and cluster resource
management.
▪ Hadoop MapReduce: A programming model for large scale data
processing.

19
Storage Infrastructure
◼ Problem:
▪ If nodes fail, how to store data persistently?
◼ Answer:
▪ Distributed File System:
▪ Provides global file namespace
▪ Google GFS; Hadoop HDFS;
◼ Typical usage pattern
▪ Huge files (100s of GB to TB)
▪ Data is rarely updated in place
▪ Reads and appends are common
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
Distributed File System
◼ Chunk servers
▪ File is split into contiguous chunks
▪ Typically each chunk is 64-128MB (# updating every few years)
▪ Each chunk replicated (usually 2x or 3x) (hdfs-site.xml)
▪ Try to keep replicas in different racks
◼ Master node
▪ a.k.a. Name Node in Hadoop’s HDFS
▪ Stores metadata about where files are stored
▪ Might be replicated
◼ Client library for file access
▪ Talks to master to find chunk servers
▪ Connects directly to chunk servers to access data
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
22
Distributed File System
◼ Reliable distributed file system
◼ Data kept in “chunks” spread across machines
◼ Each chunk replicated on different machines
▪ Seamless recovery from disk or machine failure

C0 C1 D0 C1 C2 C5 C0 C5
D0 D1 … D0

N
Chunk server
1
Chunk server

3
Chunk server
2
Chunk server
C5 C2 C5 C3 C2

Bring computation directly to the data!

Chunk servers also serve as compute servers


J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23
Programming Model: MapReduce

Warm-up task:
◼ We have a huge text document

◼ Count the number of times each


distinct word appears in the file

◼ Sample application:
▪ Analyze web server logs to find popular URLs

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24


Task: Word Count
Case 1:
▪ File too large for memory, but all <word, count> pairs fit in memory
Case 2:
◼ Count occurrences of words:
▪ words(doc.txt) | sort | uniq -c
▪ where words takes a file and outputs the words in it, one per a line
◼ Case 2 captures the essence of MapReduce
▪ Great thing is that it is naturally parallelizable

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25


MapReduce: Overview
◼ Sequentially read a lot of data
◼ Map:
▪ Extract something you care about
◼ Group by key: Sort and Shuffle
◼ Reduce:
▪ Aggregate, summarize, filter or transform
◼ Write the result
Outline stays the same, Map and Reduce
change to fit the problem

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26


MapReduce: The Map Step

Input Intermediate
key-value pairs key-value pairs

k v
map
k v
k v
map
k v
k v

… …

k v k v

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27


MapReduce: The Reduce Step

Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key

k v
… …

k v k v k v

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28


More Specifically
◼ Input: a set of key-value pairs
◼ Programmer specifies two methods:
▪ Map(k, v) → <k’, v’>*
▪ Takes a key-value pair and outputs a set of key-value pairs
▪ E.g., key is the filename, value is a single line in the file
▪ There is one Map call for every (k,v) pair
▪ Reduce(k’, <v’>*) → <k’, v’’>*
▪ All values v’ with same key k’ are reduced together
and processed in v’ order
▪ There is one Reduce function call per unique key k’

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29


MapReduce: Word Counting
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs
produces a set of belonging to the
with same key
key-value pairs key and output

data
The crew of the space
(The, 1) (crew, 1)

reads
shuttle Endeavor recently
returned to Earth as (crew, 1) (crew, 1)

read the
ambassadors, harbingers of (crew, 2)
a new era of space (of, 1) (space, 1)
(space, 1)

sequential
exploration. Scientists at
(the, 1) (the, 1)
NASA are saying that the (the, 3)
(space, 1) (the, 1)

Sequentially
recent assembly of the
Dextre bot is the first step in (shuttle, 1)
a long-term space-based (shuttle, 1) (the, 1)
man/mache partnership.
(recently, 1)
(Endeavor, 1) (shuttle, 1)
'"The work we're doing now …

Only
-- the robotics we're doing -- (recently, 1) (recently, 1)
is what we're going to need
…………………….. …. …
Big document (key, (key, (key,
value) value) value)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30
MapReduce: Word Counting - Again

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31


MapReduce: Word Counting - Again

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32


MapReduce: Word Counting - Again

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33


MapReduce: Word Counting - Again

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34


MapReduce: Word Counting - Again

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35


MapReduce: Word Counting - Again

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36


Word Count Using MapReduce - Pseducode

map(key, value):
// key: document name; value: text of the
document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37


Map-Reduce: Environment
Map-Reduce environment takes care of:
◼ Partitioning the input data
◼ Scheduling the program’s execution across a
set of machines
◼ Performing the group by key step
◼ Handling machine failures
◼ Managing required inter-machine communication

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38


Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Reduce:
Collect all values
belonging to the key
and output

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39


Map-Reduce: In Parallel

All phases are distributed with many tasks doing the work
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40
Data Flow
◼ Input and final output are stored on a distributed file system
(FS):
▪ Scheduler tries to schedule map tasks “close” to physical storage
location of input data

◼ Intermediate results are stored on local FS


of Map and Reduce workers

◼ Output is often input to another MapReduce task

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41


Coordination: Master
◼ Master node takes care of coordination:
▪ Task status: (idle, in-progress, completed)
▪ Idle tasks get scheduled as workers become available
▪ When a map task completes, it sends the master the location and
sizes of its R intermediate files, one for each reducer
▪ Master pushes this info to reducers

◼ Master pings workers periodically to detect failures

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42


How many Map and Reduce jobs?
◼ M map tasks, R reduce tasks
◼ Rule of a thumb:
▪ Make M much larger than the number of nodes in the
cluster
▪ One DFS chunk per map is common
▪ Improves dynamic load balancing and speeds up
recovery from worker failures
◼ Usually R is smaller than M
▪ Because output is spread across R files

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43


Refinement

◼ What is the problem/performance bottleneck during this model of execution ?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44


Refinement: Combiners

◼ Sits between the map and the shuffle


▪ Do some of the reducing while you’re waiting for other stuff to
happen
▪ Avoid moving all of that data over the network
◼ Only applicable when
▪ order of reduce values doesn’t matter
▪ effect is cumulative

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45


Refinement: Combiners

◼ Often a Map task will produce many pairs of the form (k,v1), (k,v2),
… for the same key k
▪ E.g., popular words in the word count example
◼ Can save network time by pre-aggregating values in
the mapper:
▪ combine(k, list(v1)) -> v2
▪ Combiner is usually same
as the reduce function
◼ Works only if reduce
function is commutative and associative

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46


Refinement: Combiners

◼ Back to our word counting example:


▪ Combiner combines the values of all keys of a single mapper (single
machine):

47
Refinement: Combiners

◼ Back to our word counting example:


▪ Combiner combines the values of all keys of a single mapper (single
machine):

48
Refinement: Combiners

◼ Back to our word counting example:


▪ Combiner combines the values of all keys of a single mapper (single
machine):

49
Refinement: Combiners

◼ Back to our word counting example:


▪ Combiner combines the values of all keys of a single mapper (single
machine):

50
Refinement: Combiners

◼ Back to our word counting example:


▪ Combiner combines the values of all keys of a single mapper (single
machine):

▪ Benefit: Much less data needs to be copied and shuffled!


51
Refinement: Combiners

52
Java MapReduce
◼ We need three things: a map function, a reduce function, and
some code to run the job.
◼ The map function is represented by the Mapper class, which
declares an abstract map() method.
◼ The reduce function is similarly defined using a Reducer

53
54
55
56
Computing Page Rank using Map Reduce

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 57


The Web as a Directed Graph

hyperlink
Page A Anchor Page B

Assumption 1: A hyperlink between pages denotes


author perceived relevance (quality signal)

Assumption 2: The text in the anchor of the hyperlink


describes the target page (textual context)
58
Ranking Webpages by Popularity
◼ The popularity of a webpage is calculated independent of the
page content
◼ Two well-known methods
▪ PageRank
▪ By Brin and Page (1998), and implemented in the Google search engine
▪ HITS (Hypertext Induced Topic Search)
▪ By Kleinberg (1998), adopted by search engine Teoma (ask.com)

59
PageRank
◼ PageRank is a numeric value that represents how important a
page is on the web.

◼ Webpage importance
▪ One page links to another page = A vote for the other page A link
from page A to page B is a vote on A to B.
▪ If page A is more important itself, then the vote of A to B should carry
more weight.
▪ More votes = More important the page must be

◼ How can we model this importance?


60
PageRank

61
PageRank
◼ Importance Computation
▪ The importance of a page is
distributed to pages that it points
to.
▪ It is the aggregation of the
importance shares of the pages that
points to it.
▪ If a page has 5 outlinks, the importance
of the page is divided into 5 and each
link receives one fifth share of the
importance.

62
PageRank
◼ Each webpage Pi has a score, denoted by r(Pi), or called the pagerank
of Pi
◼ The value of r(Pi) is the sum of the normalized pageranks of all
webpages pointing into Pi

▪ BPi is the set of webpages pointing to Pi


▪ |Pj| is the number of out-links from page Pj
▪ Normalized pagerank of Pj means r(Pj) / |Pj|.
The pagerank of Pj is shared by all webpages
Pj points to

63
Example
P
P Suppose the pagerank of
1
2 P1, P2, P3, and P4 are known

P
5 Pages P1 P2 P3 P4
Pagerank 3.5 1.2 4.2 1.0
P P
3
4

The pagerank of P5 is computed


as

r(P5) = 3.5/2 + 1.2/1 + 4.2/3 = 4.35

64
Computation of Pagerank
◼ Problem
▪ In the beginning, all pageranks are unknown. How to determine the
first pagerank value?
◼ Solution
▪ Give an initial pagerank to every webpage
▪ E.g., 1/n where n is the total number of webpages
▪ Perform the calculation of pagerank iteratively
▪ Use the pagerank formula to update the pagerank of every webpage
▪ Repeat the above step a number of times until the pagerank values are stable
(converge)

65
Iterative Procedure
◼ Let rk(Pi) be the PageRank of page Pi at
iteration k
▪ Starting with r0(Pi) = 1/n for all pages Pi
◼ At iteration k+1, the pagerank of every page Pi
is updated using the pageranks at iteration k

66
Example
◼ Consider the graph on the right
◼ Iteration 0 1 2
▪ r0(Pi) = 1/6 for i=1, 2, ..., 6
◼ Iteration 1 3

▪ r1(P1) = r0(P3) / 3 = 0.0556


▪ r1(P2) = r0(P1) / 2 + r0(P3) / 3 = 0.1389 6 5
▪ r1(P3) = r0(P1) / 2 = 0.0833
▪ r1(P4) = r0(P5) / 2 + r0(P6) = 0.25 4
▪ r1(P5) = r0(P3) / 3 + r0(P4) / 2 = 0.1389
▪ r1(P6) = r0(P4) / 2 + r0(P5) / 2 = 0.1667

67
Example (cont)
◼ After 20 iterations …
1 2

6 5
Rank Page
1 P4
2 P6
4
3 P5
4 P2
5, 6 P 1, P 3

68
Issue in PageRank Computation

◼ The standard equation of PageRank computation, computes it


for a single page

◼ How can we compute the same for multiple pages ?

69
Scalability Issue in PageRank Computation

◼ The standard equation of PageRank computation, computes it


for a single page

◼ How can we compute the same for multiple pages ?


▪ Using matrix multiplication: compute PageRank of all pages at one
time in an iteration

70
Matrix and Vector notations – A Recap

◼ Matrices are like tables of data (numbers)


◼ An n × m matrix consists of n rows and m columns, each of
the nm entries consists of a number
◼ A vector is a matrix with either a single row or a single
column
▪ A row vector is a 1 × m matrix
▪ A column vector is a n × 1 matrix
◼ Example
3 × 1 (column)
2×3 vector
1 × 4 (row)
matrix
vector

71
Matrix Multiplication – A Recap
◼ An n × m matrix A multiplied with an r × s matrix B,
denoted as A × B (or just AB)
▪ Must have m = r, i.e., the number of columns of A = the
number of rows of B
◼ Let matrix C = A×B (or just AB)
▪ C will be an n × s matrix
◼ Let Aij, Bij, and Cij denote the entry at row i column j
of A, B and C, respectively
Cij = Σk = 1 to m Aik * Bkj

72
Example
C = A × B

◼ C11= 0*0 + 1*3 + 1*4 + 2*1 = 9


◼ C21 = 3*0 + 0*3 + 0*4 + 1*1 = 1
◼ C31 = 0*0 + 1*3 + 0*4 + 0*1 = 3
◼ C12 = 0*1 + 1*0 + 1*0 + 2*1 = 2
◼ C22 = 3*1 + 0*0 + 0*0 + 1*1 = 4
◼ C32 = 0*1 + 1*0 + 0*0 + 0*1 = 0
73
Scalar Multiplication and Matrix Addition – A Recap

◼ Scalar multiplication involves a number k multiplies with a


matrix M
▪ Every entry of M is multiplied by k
▪ E.g.,

◼ Matrix addition is addition of two matrices


▪ Corresponding entries from two matrices are added
▪ E.g.,

74
PageRank Vector
◼ For n pages, all the n PageRank values can be represented in a
1 × n row vector (denoted as π)
π = ( r(P1) r(P2) ... r(Pn) )
◼ Let π(k) denote the vector at iteration k of the iterative
procedure
◼ At iteration 0
▪ uniform initialization: π(0) = ( 1/n 1/n ... 1/n )

75
Example
◼ For the given graph, the matrix H is row-normalised hyperlink matrix
▪ the sum of non-zero rows should sum to 1
1 2

6 5

▪ E.g., H56 = 1/2. Since there is a link 4


from P5 to P6 and P4 has two out-links,
the pagerank P5 contributing to P6 is r(P5)/2

What is the difference between Adjacency Matrix vs Row-Normalized Hyperlink Matrix ?


76
Update PageRank at Iteration k
◼ Performed by a single matrix multiplication
π(k+1) = π(k) H
1×n 1×n n×n
◼ To verify with previous formula

Because Hji = 1/|Pj|


if there is a link from Pj to Pi,
otherwise Hji = 0

77
Re-run the Example

1 2
π(0) = (1/6 1/6 1/6 1/6 1/6 1/6)

∙ π(1) = π(0) H
▪ π(1)11 = 1/6 * 1/3 = .0556 6 5
▪ π(1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389
▪ π(1)13 = 1/6 * 1/2 = .0833
▪ π(1)14 = 1/6 * 1/2 + 1/6 * 1 = .25 4
▪ π(1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389
▪ π(1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667
∙ π(1) = (.0556 .1389 .0833 .25 .1389 .1667)

78
Re-run the Example
◼ π(1) = π(0) H
= (.0556 .1389 .0833 .25 .1389 .1667)
◼ π(2) = π(1) H 1 2
= (.0278 .0556 .0278 .2361 .1528 .1944)
◼ … 3

6 5

79
Problem…..
◼ Rank sinks – those pages that
accumulate more and more pagerank 1 2
at each iteration
▪ For the given graph, Pages 4, 5 & 6 are the
3
rank sinks
▪ Why ?
▪ while Pages 1, 2 & 3 get zero PageRank 6 5
▪ E.g., after 20 iterations,
π(20) = (0 0 0 .2667 .1333 .2)
4
◼ It is difficult to rank Pages 1, 2 & 3 if
they all have zero PageRank

80
Final PageRank
◼ Initial allocation (1/6,1/6,1/6,1/6, 1/6,1/6)
1 2
◼ Final values (0,0,0, 0.2667, 0.1333, 0.2)
3

◼ Total 0.6, Missing 0.4


6 5

◼ 0.4 lost via vertex 2. Does not add up to 1


4
▪ One approach is to normalize the final answer (divide
by 0.6) giving
(0, 0, 0, 0.444, 0.222, 0.333)

81
PageRank for Vertex 2- Explained

◼ At each step, your PageRank is


what is passed to you from your
in-neighbours 1 2

3
◼ 0.4 was passed to vertex 2
and is lost, as it wasn’t
passed on. (from second 6 5
column)
4
0.167+0.139+0.056+0.023+0.009
+0.004+... = 0.4

82
Adjustments to the Basic Setting

◼ Dangling pages
▪ The pages having no out links 1 2

3
◼ Any example of Dangling pages on the Web ?
6 5

83
Adjustments to the Basic Setting
◼ Dangling pages Examples on Web
▪ pdf files
▪ image files
▪ data tables
▪ pages with no hyperlinks etc
◼ The random surfer can’t proceed forward from these pages

◼ Adjustment (teleporting)
▪ The random surfer, after entering a dangling node, can now hyperlink to any
page at random (I.e., with equal probability)

84
Recall The Previous Example
π(k+1) = π(k) H

1 2
(0)
◼ π = (1/6 1/6 1/6 1/6 1/6 1/6) .

∙ π(1) = π(0) H 6 5
▪ π(1)11 = 1/6 * 1/3 = .0556
▪ π(1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389
▪ π(1)13 = 1/6 * 1/2 = .0833 4
▪ π(1)14 = 1/6 * 1/2 + 1/6 * 1 = .25
▪ π(1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389
▪ π(1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667
∙ π(1) = (.0556 .1389 .0833 .25 .1389 .1667)
85
Adjustments to the Basic Setting( ½)
◼ The rows of the matrix H with all zeros are replaced by rows of
(1/n 1/n ... 1/n)

◼ For the running example graph, the 2nd row of the hyperlink matrix is changed, H’ is
the modified matrix

▪ Here H’ is a stochastic matrix as all the rows sums to 1


stochastic matrix vs row-normalized hyperlink matrix ?
86
Adjustments to the Basic Setting (2/2)
◼ Repeating our matrix multiplication process again
▪ Multiplying our Stochastic Matrix H’ with π(k)

π(k+1) = π(k) H’

◼ we get a probability vector


▪ Sum of values of the vector π(k) is 1
▪ Benefit: No PageRank is lost

▪ Hence, we find that Stochastic Matrix H’ provides us a probability vector

87
Adjustments to the Basic Setting
◼ Allow teleporting to any page at any time
▪ With probability α, the random surfer will follow one of the 1 2
hyperlinks at current page
▪ With probability 1-dfα, the random surfer will randomly
select a page (out of the n pages) to teleport to 3

◼ The modified hyperlink matrix is called the Google


matrix G, 6 5

G = αH’ + (1-df)(1/n)I
▪ I is an n×n matrix with all entries equal to 1 4
▪ Why we need I ?
▪ the surfer can jump to any page at random with probability (1-df)

88
Example
◼ G = αdf*H’ + (1-df)(1/n)I
◼ Let α=0.9 1 2

6 5

89
Google’s Adjusted PageRank Method

π(k+1) = π(k) G
◼ With the Google
matrix, the
pagerank vector
converges to a
stable value

…..
90
Meaning of α - The Damping Factor
◼ α is probability surfer doesn't get bored and continues to click
on page links
◼ (1-df) probability jump to a random web page
◼ 1/(1- dfα) average number of steps before jump to random
page
◼ α=0.85 (Google)
1/(1-0.85) =100/15=6.666

91
The Damping Factor. Effect on convergence

α Number of Iterations
0.5 34 α should be relatively
large to reduce the
0.75 81 effect of random
0.8 104 teleporting
0.85 142 α=0.85 is a choice of
0.9 219 balance between
computation
0.95 449 requirement and the
0.99 2,292 quality of the output
0.999 23,015
G = df*H’ + (1-df)(1/n)I
92
PageRank using MapReduce

93
PageRank using MapReduce
Recall The Previous Example π(k+1) = π(k) H

1 2
(0)
◼ π = (1/6 1/6 1/6 1/6 1/6 1/6) .

3
∙ π(1) = π(0) H
▪ π(1)11 = 1/6 * 1/3 = .0556
6 5
▪ π(1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389
▪ π(1)13 = 1/6 * 1/2 = .0833
▪ π(1)14 = 1/6 * 1/2 + 1/6 * 1 = .25
▪ π(1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389 4
▪ π(1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667
∙ π(1) = (.0556 .1389 .0833 .25 .1389 .1667)

94
PageRank using MapReduce – PEGASUS

◼ fg

95
PEGASUS
◼ First open source Peta-Scaled Graph Mining library
◼ Based on Hadoop

◼ Handling graphs with billions of nodes and edges

◼ Unification of seemingly different graph mining tasks

◼ Generalized Iterated Matrix-Vector multiplication(GIM-V)

96
PEGASUS
◼ Linear runtime on the numbers of edges

◼ Scales up well with the number of available machines

◼ Combination of optimizations can speed up to 5 times

◼ Analyzed Yahoo’s web graph (around 6.7 billion edges)

97
PEGASUS: Real World Applications
◼ PEGASUS can be useful for finding patterns, outliers, and

interesting observations
❑ Connected Components of Real Networks
❑ PageRanks of Real Networks

❑ Diameter of Real Networks

98
GIM-V
◼ Generalized Iterated Matrix-Vector multiplication

◼ Main idea

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 99


GIM-V Base: Naïve Multiplication
◼ How can we implement a matrix by vector multiplication in

MapReduce?
▪ Stage1: performs combine2 operation by combining columns of
matrix with rows of vector

▪ Stage2: combineAll partial results from stage1 and assigns the new
vector to the old vector

100
GIM-V Base: Naïve Multiplication

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 101


GIM-V Base: Naïve Multiplication

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 102


GIM-V Base: Naïve Multiplication

103
PageRank using MapReduce (1/7)
Recall The Previous Example π(k+1) = π(k) H

1 2
(0)
◼ π = (1/6 1/6 1/6 1/6 1/6 1/6) .

3
∙ π =π
(1) (0)
H
▪ π(1)11 = 1/6 * 1/3 = .0556
▪ π(1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389 6 5
▪ π(1)13 = 1/6 * 1/2 = .0833
▪ π(1)14 = 1/6 * 1/2 + 1/6 * 1 = .25
▪ π(1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389 4
▪ π(1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667
∙ π(1) = (.0556 .1389 .0833 .25 .1389 .1667)

104
PageRank using MapReduce – PEGASUS (2/7)

◼ π(0) = (1/6 1/6 1/6 1/6 1/6 1/6) .

H’ v
0 0 1/3 0 0 0 1/6

½ 0 1/3 0 0 0 1/6

½ 0 0 0 0 0 1/6

0 0 0 0 ½ 1 1/6

0 0 1/3 ½ 0 0 1/6

0 0 0 ½ ½ 0 1/6
105
PageRank using MapReduce – PEGASUS (3/7)
Stage 1 Map(): Generates (K,v) pairs of Matrix and Vector

Example: (r1,c1,v1) denotes 1st value of matrix --> (c1,(r1,v1)) is its corresponding output value by mapper
(c1,v1) is the first value and output by the mapper for vector file

Group by Column Id

C1,0 C3,1/3 C4,0 C5,0 C6,0


C2,0

C1,1/2 C3,1/3 C4,0 C5,0 C6,0


C2,0

C1,1/6 C2,1/6 C3,1/6 C4,1/6 C5,0 C5,1/6 C6,0 C6,1/6


C1,1/2 C2,0 C3,0 C4,0

C1,0 C3,0 C4,0 C5,1/2 C6,1


C2,0

C1,0 C3,1/3 C4,1/2 C5,0 C6,0


C2,0
C1,0 C3,0 C4,1/2 C5,1/2 C6,0
C2,0
106
PageRank using MapReduce – PEGASUS (4/7)
Stage 1 Reduce(): Performs combine2 operation i.e. applies aggregation function

Red1 Red2 Red 3 Red 4 Red 5 Red 6

C1 0 C5 0 C6 0
C2 0 C3 1/18 C4 0

C1 1/12 C5 0 C6 0
C2 0 C3 1/18 C4 0

C1 1/12 C5 0 C6 0
C2 0 C3 0 C4 0

C1 0 C5 1/12 C6 1/6
C2 0 C3 0 C4 0

C1 0 C5 0 C6 0
C2 0 C3 1/18 C4 1/12
C1 0 C5 1/12 C6 0
C2 0 C3 0 C4 1/12

6 Reducers are created for the sake of simplicity


107
PageRank using MapReduce – PEGASUS (5/7)
Stage 1 Reduce(): Performs combine2 operation i.e. applies aggregation function

Red1 Red2 Red 3 Red 4 Red 5 Red 6

R1 0 R1 0 R1 0
R1 0 R1 1/18 R1 0

R2 1/12 R2 0 R2 0
R2 0 R2 1/18 R2 0

R3 1/12 R3 0 R3 0
R3 0 R3 0 R3 0

R4 0 R4 1/12 R4 1/6
R4 0 R4 0 R4 0

R5 0 R5 0 R5 0
R5 0 R5 1/18 R5 1/12
R6 0 R6 1/12 R6 0
R6 0 R6 0 R6 1/12

Let we replace Column # with Row #


108
PageRank using MapReduce – PEGASUS (6/7)
Stage 2 Map(): simply emits the received (k,v) pairs

Group by Row Id

109
PageRank using MapReduce – PEGASUS (7/7)
Stage 2 Reduce(): performs combineAll operation to sum up the corresponding values Assign

R1: 0 + 0 + 1/18 + 0 + 0 + 0 0.0556

R2: 1/12 + 0 + 1/18 + 0 + 0 + 0 0.01389

R3: 1/12 + 0 + 0 + 0 + 0 + 0 0.0834

R4: 0 + 0 + 0 + 0 + 1/12 + 1/6 0.25

R5: 0 + 0 + 1/18 + 1/12 + 0 + 0 0.01389

R6: 0 + 0 + 0 + 1/12 + 1/12 + 0 0.1667


J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 110
Shortcomings of MapReduce
• Read and write to Disk before and after Map and Reduce
– Not efficient for iterative tasks, i.e. Machine Learning
• Only for Batch processing
– Interactivity, streaming data

Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis, Afzal Godil,
Information Access Division, ITL, NIST

You might also like