3 Hadoop

•Fundamentals of Big Data Analytics
•Hadoop
Outline
• Background
• Components and Architecture
• Running Examples
• Word Count
• PageRank
• Hands-on in Lab
Slides Courtesy: Mining of Massive Datasets (http://www.mmds.org )

+ Others
What is a Server?
◼ Servers are computers that provide “services” to “clients”
◼ They are typically designed for reliability and to service a large number
of requests
◼ Organizations typically require many physical servers to provide various

services (Web, Email, Database, etc.)
◼ Server hardware is becoming more powerful and compact
3
Compact Servers
◼ Organizations would like to conserve the amount of floor space
dedicated to their computer infrastructure
◼ For large-scale installations, compact servers are used

◼ This helps with
▪ Floor Space
▪ Manageability
▪ Scalability
▪ Power and Cooling
4
Racks
◼ Equipments (e.g., servers) are typically placed in racks
5
What is a Data Center?
◼ A facility used to house computer systems and components,
such as networking and storage systems, cooling, UPS, air
filters
◼ A data center typically houses a large number of

heterogeneous networked computer systems
◼ A data center can occupy one room of a building, one or more

floors, or an entire building
6
Google now has more than 2.5 million servers
7
Microsoft now has more than one million servers
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

What is Hadoop?
What is Hadoop?
◼ A framework for distributed processing of large data sets across clusters of
commodity computers using simple programming models
◼ Can scale up from single servers to thousands of machines

▪ each offering local computation and storage
◼ Designed to detect and handle failures at the application layer

▪ delivering a highly-available service on top of a cluster of computers, each of
which may be prone to failures
Who uses Hadoop?
Amazon, Facebook, Google, Twitter, New York Times, Yahoo! …. many more
Motivation: Google Example
◼ 20+ billion web pages x 20KB = 400+ TB
◼ 1 computer reads 30-35 MB/sec from disk
▪ ~4 months to read the web
◼ ~1,000 hard drives to store the web
◼ Takes even more to do something useful
with the data!
◼ Today, a standard architecture for such problems is
emerging:
▪ Cluster of commodity Linux nodes
▪ Commodity network (ethernet) to connect them
MapReduce
◼ Challenges:
▪ How to distribute computation?
▪ Distributed/parallel programming is hard
◼ Map-reduce addresses all of the above

▪ Google’s computational/data manipulation model
▪ Elegant way to work with big data

Single Node Architecture
CPU
Machine Learning, Statistics
Memory
“Classical” Data Mining

Disk

Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between Switch
any pair of nodes
in a rack
Switch Switch
CPU CPU CPU CPU
Mem … Mem Mem … Mem
Disk Disk Disk Disk
Each rack contains 16-64 nodes

Large-scale Computing
◼ Large-scale computing for data mining
problems on commodity hardware
◼ Challenges:
▪ How do you distribute computation?
▪ How can we make it easy to write distributed programs?
▪ Machines fail:
▪ One server may stay up 3 years (1,000 days)
▪ If you have 1,000 servers, expect to loose 1/day
▪ People estimated Google had ~2.5M machines in 2022
▪ 2,500 machines fail every day!

Idea and Solution
◼ Issue: Copying data over a network takes time
◼ Idea:
▪ Bring computation close to the data
▪ Store files multiple times for reliability
◼ Map-reduce addresses these problems
▪ Google’s computational/data manipulation model
▪ Elegant way to work with big data
▪ Storage Infrastructure – File system
▪ Google: GFS. Hadoop: HDFS
▪ Programming model
▪ Map-Reduce

Hadoop Ecosystem
18
Core Components Hadoop
◼ The project includes these modules:
▪ Hadoop Common: The common utilities that support the other
Hadoop modules.
▪ Hadoop Distributed File System (HDFS™): A distributed file system
that provides high-throughput access to application data.
▪ Hadoop YARN: A framework for job scheduling and cluster resource
management.
▪ Hadoop MapReduce: A programming model for large scale data
processing.
19
Storage Infrastructure
◼ Problem:
▪ If nodes fail, how to store data persistently?
◼ Answer:
▪ Distributed File System:
▪ Provides global file namespace
▪ Google GFS; Hadoop HDFS;
◼ Typical usage pattern
▪ Huge files (100s of GB to TB)
▪ Data is rarely updated in place
▪ Reads and appends are common
Distributed File System
◼ Chunk servers
▪ File is split into contiguous chunks
▪ Typically each chunk is 64-128MB (# updating every few years)
▪ Each chunk replicated (usually 2x or 3x) (hdfs-site.xml)
▪ Try to keep replicas in different racks
◼ Master node
▪ a.k.a. Name Node in Hadoop’s HDFS
▪ Stores metadata about where files are stored
▪ Might be replicated
◼ Client library for file access
▪ Talks to master to find chunk servers
▪ Connects directly to chunk servers to access data
22
Distributed File System
◼ Reliable distributed file system
◼ Data kept in “chunks” spread across machines
◼ Each chunk replicated on different machines
▪ Seamless recovery from disk or machine failure
C0 C1 D0 C1 C2 C5 C0 C5
D0 D1 … D0
N
Chunk server
1
Chunk server
3
Chunk server
2
Chunk server
C5 C2 C5 C3 C2
Bring computation directly to the data!
Chunk servers also serve as compute servers

Programming Model: MapReduce
Warm-up task:
◼ We have a huge text document
◼ Count the number of times each

distinct word appears in the file
◼ Sample application:
▪ Analyze web server logs to find popular URLs

Task: Word Count
Case 1:
▪ File too large for memory, but all <word, count> pairs fit in memory
Case 2:
◼ Count occurrences of words:
▪ words(doc.txt) | sort | uniq -c
▪ where words takes a file and outputs the words in it, one per a line
◼ Case 2 captures the essence of MapReduce
▪ Great thing is that it is naturally parallelizable

MapReduce: Overview
◼ Sequentially read a lot of data
◼ Map:
▪ Extract something you care about
◼ Group by key: Sort and Shuffle
◼ Reduce:
▪ Aggregate, summarize, filter or transform
◼ Write the result
Outline stays the same, Map and Reduce
change to fit the problem

MapReduce: The Map Step
Input Intermediate
key-value pairs key-value pairs
k v
map
k v
k v
map
k v
k v
… …
k v k v

MapReduce: The Reduce Step
Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key
k v
… …
…
k v k v k v

More Specifically
◼ Input: a set of key-value pairs
◼ Programmer specifies two methods:
▪ Map(k, v) → <k’, v’>*
▪ Takes a key-value pair and outputs a set of key-value pairs
▪ E.g., key is the filename, value is a single line in the file
▪ There is one Map call for every (k,v) pair
▪ Reduce(k’, <v’>*) → <k’, v’’>*
▪ All values v’ with same key k’ are reduced together
and processed in v’ order
▪ There is one Reduce function call per unique key k’

MapReduce: Word Counting
Provided by the Provided by the
programmer programmer
MAP: Group by key: Reduce:
Read input and Collect all values
Collect all pairs
produces a set of belonging to the
with same key
key-value pairs key and output
data
The crew of the space
(The, 1) (crew, 1)
reads
shuttle Endeavor recently
returned to Earth as (crew, 1) (crew, 1)
read the
ambassadors, harbingers of (crew, 2)
a new era of space (of, 1) (space, 1)
(space, 1)
sequential
exploration. Scientists at
(the, 1) (the, 1)
NASA are saying that the (the, 3)
(space, 1) (the, 1)
Sequentially
recent assembly of the
Dextre bot is the first step in (shuttle, 1)
a long-term space-based (shuttle, 1) (the, 1)
man/mache partnership.
(recently, 1)
(Endeavor, 1) (shuttle, 1)
'"The work we're doing now …
Only
-- the robotics we're doing -- (recently, 1) (recently, 1)
is what we're going to need
…………………….. …. …
Big document (key, (key, (key,
value) value) value)
MapReduce: Word Counting - Again






Word Count Using MapReduce - Pseducode
map(key, value):
// key: document name; value: text of the
document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

Map-Reduce: Environment
Map-Reduce environment takes care of:
◼ Partitioning the input data
◼ Scheduling the program’s execution across a
set of machines
◼ Performing the group by key step
◼ Handling machine failures
◼ Managing required inter-machine communication

Map-Reduce: A diagram
Big document
MAP:
Read input and
produces a set of
key-value pairs
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the key
and output

Map-Reduce: In Parallel
All phases are distributed with many tasks doing the work
Data Flow
◼ Input and final output are stored on a distributed file system
(FS):
▪ Scheduler tries to schedule map tasks “close” to physical storage
location of input data
◼ Intermediate results are stored on local FS

of Map and Reduce workers
◼ Output is often input to another MapReduce task

Coordination: Master
◼ Master node takes care of coordination:
▪ Task status: (idle, in-progress, completed)
▪ Idle tasks get scheduled as workers become available
▪ When a map task completes, it sends the master the location and
sizes of its R intermediate files, one for each reducer
▪ Master pushes this info to reducers
◼ Master pings workers periodically to detect failures

How many Map and Reduce jobs?
◼ M map tasks, R reduce tasks
◼ Rule of a thumb:
▪ Make M much larger than the number of nodes in the
cluster
▪ One DFS chunk per map is common
▪ Improves dynamic load balancing and speeds up
recovery from worker failures
◼ Usually R is smaller than M
▪ Because output is spread across R files

Refinement
◼ What is the problem/performance bottleneck during this model of execution ?

Refinement: Combiners
◼ Sits between the map and the shuffle

▪ Do some of the reducing while you’re waiting for other stuff to
happen
▪ Avoid moving all of that data over the network
◼ Only applicable when
▪ order of reduce values doesn’t matter
▪ effect is cumulative

◼ Often a Map task will produce many pairs of the form (k,v1), (k,v2),
… for the same key k
▪ E.g., popular words in the word count example
◼ Can save network time by pre-aggregating values in
the mapper:
▪ combine(k, list(v1)) -> v2
▪ Combiner is usually same
as the reduce function
◼ Works only if reduce
function is commutative and associative

◼ Back to our word counting example:

▪ Combiner combines the values of all keys of a single mapper (single
machine):
47

machine):
48

machine):
49

machine):
50

machine):
▪ Benefit: Much less data needs to be copied and shuffled!

51
52
Java MapReduce
◼ We need three things: a map function, a reduce function, and
some code to run the job.
◼ The map function is represented by the Mapper class, which
declares an abstract map() method.
◼ The reduce function is similarly defined using a Reducer
53
54
55
56
Computing Page Rank using Map Reduce

The Web as a Directed Graph
hyperlink
Page A Anchor Page B
Assumption 1: A hyperlink between pages denotes

author perceived relevance (quality signal)
Assumption 2: The text in the anchor of the hyperlink

describes the target page (textual context)
58
Ranking Webpages by Popularity
◼ The popularity of a webpage is calculated independent of the
page content
◼ Two well-known methods
▪ PageRank
▪ By Brin and Page (1998), and implemented in the Google search engine
▪ HITS (Hypertext Induced Topic Search)
▪ By Kleinberg (1998), adopted by search engine Teoma (ask.com)
59
PageRank
◼ PageRank is a numeric value that represents how important a
page is on the web.
◼ Webpage importance
▪ One page links to another page = A vote for the other page A link
from page A to page B is a vote on A to B.
▪ If page A is more important itself, then the vote of A to B should carry
more weight.
▪ More votes = More important the page must be
◼ How can we model this importance?

60
PageRank
61
PageRank
◼ Importance Computation
▪ The importance of a page is
distributed to pages that it points
to.
▪ It is the aggregation of the
importance shares of the pages that
points to it.
▪ If a page has 5 outlinks, the importance
of the page is divided into 5 and each
link receives one fifth share of the
importance.
62
PageRank
◼ Each webpage Pi has a score, denoted by r(Pi), or called the pagerank
of Pi
◼ The value of r(Pi) is the sum of the normalized pageranks of all
webpages pointing into Pi
▪ BPi is the set of webpages pointing to Pi

▪ |Pj| is the number of out-links from page Pj
▪ Normalized pagerank of Pj means r(Pj) / |Pj|.
The pagerank of Pj is shared by all webpages
Pj points to
63
Example
P
P Suppose the pagerank of
1
2 P1, P2, P3, and P4 are known
P
5 Pages P1 P2 P3 P4
Pagerank 3.5 1.2 4.2 1.0
P P
3
4
The pagerank of P5 is computed

as
r(P5) = 3.5/2 + 1.2/1 + 4.2/3 = 4.35
64
Computation of Pagerank
◼ Problem
▪ In the beginning, all pageranks are unknown. How to determine the
first pagerank value?
◼ Solution
▪ Give an initial pagerank to every webpage
▪ E.g., 1/n where n is the total number of webpages
▪ Perform the calculation of pagerank iteratively
▪ Use the pagerank formula to update the pagerank of every webpage
▪ Repeat the above step a number of times until the pagerank values are stable
(converge)
65
Iterative Procedure
◼ Let rk(Pi) be the PageRank of page Pi at
iteration k
▪ Starting with r0(Pi) = 1/n for all pages Pi
◼ At iteration k+1, the pagerank of every page Pi
is updated using the pageranks at iteration k
66
Example
◼ Consider the graph on the right
◼ Iteration 0 1 2
▪ r0(Pi) = 1/6 for i=1, 2, ..., 6
◼ Iteration 1 3
▪ r1(P1) = r0(P3) / 3 = 0.0556

▪ r1(P2) = r0(P1) / 2 + r0(P3) / 3 = 0.1389 6 5
▪ r1(P3) = r0(P1) / 2 = 0.0833
▪ r1(P4) = r0(P5) / 2 + r0(P6) = 0.25 4
▪ r1(P5) = r0(P3) / 3 + r0(P4) / 2 = 0.1389
▪ r1(P6) = r0(P4) / 2 + r0(P5) / 2 = 0.1667
67
Example (cont)
◼ After 20 iterations …
1 2
6 5
Rank Page
1 P4
2 P6
4
3 P5
4 P2
5, 6 P 1, P 3
68
Issue in PageRank Computation
◼ The standard equation of PageRank computation, computes it

for a single page
◼ How can we compute the same for multiple pages ?
69
Scalability Issue in PageRank Computation
◼ The standard equation of PageRank computation, computes it

for a single page
◼ How can we compute the same for multiple pages ?

▪ Using matrix multiplication: compute PageRank of all pages at one
time in an iteration
70
Matrix and Vector notations – A Recap
◼ Matrices are like tables of data (numbers)

◼ An n × m matrix consists of n rows and m columns, each of
the nm entries consists of a number
◼ A vector is a matrix with either a single row or a single
column
▪ A row vector is a 1 × m matrix
▪ A column vector is a n × 1 matrix
◼ Example
3 × 1 (column)
2×3 vector
1 × 4 (row)
matrix
vector
71
Matrix Multiplication – A Recap
◼ An n × m matrix A multiplied with an r × s matrix B,
denoted as A × B (or just AB)
▪ Must have m = r, i.e., the number of columns of A = the
number of rows of B
◼ Let matrix C = A×B (or just AB)
▪ C will be an n × s matrix
◼ Let Aij, Bij, and Cij denote the entry at row i column j
of A, B and C, respectively
Cij = Σk = 1 to m Aik * Bkj
72
Example
C = A × B
◼ C11= 0*0 + 1*3 + 1*4 + 2*1 = 9

◼ C21 = 3*0 + 0*3 + 0*4 + 1*1 = 1
◼ C31 = 0*0 + 1*3 + 0*4 + 0*1 = 3
◼ C12 = 0*1 + 1*0 + 1*0 + 2*1 = 2
◼ C22 = 3*1 + 0*0 + 0*0 + 1*1 = 4
◼ C32 = 0*1 + 1*0 + 0*0 + 0*1 = 0
73
Scalar Multiplication and Matrix Addition – A Recap
◼ Scalar multiplication involves a number k multiplies with a

matrix M
▪ Every entry of M is multiplied by k
▪ E.g.,
◼ Matrix addition is addition of two matrices

▪ Corresponding entries from two matrices are added
▪ E.g.,
74
PageRank Vector
◼ For n pages, all the n PageRank values can be represented in a
1 × n row vector (denoted as π)
π = ( r(P1) r(P2) ... r(Pn) )
◼ Let π(k) denote the vector at iteration k of the iterative
procedure
◼ At iteration 0
▪ uniform initialization: π(0) = ( 1/n 1/n ... 1/n )
75
Example
◼ For the given graph, the matrix H is row-normalised hyperlink matrix
▪ the sum of non-zero rows should sum to 1
1 2
6 5
▪ E.g., H56 = 1/2. Since there is a link 4

from P5 to P6 and P4 has two out-links,
the pagerank P5 contributing to P6 is r(P5)/2
What is the difference between Adjacency Matrix vs Row-Normalized Hyperlink Matrix ?

76
Update PageRank at Iteration k
◼ Performed by a single matrix multiplication
π(k+1) = π(k) H
1×n 1×n n×n
◼ To verify with previous formula
Because Hji = 1/|Pj|

if there is a link from Pj to Pi,
otherwise Hji = 0
77
Re-run the Example
1 2
π(0) = (1/6 1/6 1/6 1/6 1/6 1/6)
∙ π(1) = π(0) H
▪ π(1)11 = 1/6 * 1/3 = .0556 6 5
▪ π(1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389
▪ π(1)13 = 1/6 * 1/2 = .0833
▪ π(1)14 = 1/6 * 1/2 + 1/6 * 1 = .25 4
▪ π(1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389
▪ π(1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667
∙ π(1) = (.0556 .1389 .0833 .25 .1389 .1667)
78
Re-run the Example
◼ π(1) = π(0) H
= (.0556 .1389 .0833 .25 .1389 .1667)
◼ π(2) = π(1) H 1 2
= (.0278 .0556 .0278 .2361 .1528 .1944)
◼ … 3
6 5
79
Problem…..
◼ Rank sinks – those pages that
accumulate more and more pagerank 1 2
at each iteration
▪ For the given graph, Pages 4, 5 & 6 are the
3
rank sinks
▪ Why ?
▪ while Pages 1, 2 & 3 get zero PageRank 6 5
▪ E.g., after 20 iterations,
π(20) = (0 0 0 .2667 .1333 .2)
4
◼ It is difficult to rank Pages 1, 2 & 3 if
they all have zero PageRank
80
Final PageRank
◼ Initial allocation (1/6,1/6,1/6,1/6, 1/6,1/6)
1 2
◼ Final values (0,0,0, 0.2667, 0.1333, 0.2)
3
◼ Total 0.6, Missing 0.4

6 5
◼ 0.4 lost via vertex 2. Does not add up to 1

4
▪ One approach is to normalize the final answer (divide
by 0.6) giving
(0, 0, 0, 0.444, 0.222, 0.333)
81
PageRank for Vertex 2- Explained
◼ At each step, your PageRank is

what is passed to you from your
in-neighbours 1 2
3
◼ 0.4 was passed to vertex 2
and is lost, as it wasn’t
passed on. (from second 6 5
column)
4
0.167+0.139+0.056+0.023+0.009
+0.004+... = 0.4
82
Adjustments to the Basic Setting
◼ Dangling pages
▪ The pages having no out links 1 2
3
◼ Any example of Dangling pages on the Web ?
6 5
83
◼ Dangling pages Examples on Web
▪ pdf files
▪ image files
▪ data tables
▪ pages with no hyperlinks etc
◼ The random surfer can’t proceed forward from these pages
◼ Adjustment (teleporting)
▪ The random surfer, after entering a dangling node, can now hyperlink to any
page at random (I.e., with equal probability)
84
Recall The Previous Example
π(k+1) = π(k) H
1 2
(0)
◼ π = (1/6 1/6 1/6 1/6 1/6 1/6) .
∙ π(1) = π(0) H 6 5
▪ π(1)11 = 1/6 * 1/3 = .0556
▪ π(1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389
▪ π(1)13 = 1/6 * 1/2 = .0833 4
▪ π(1)14 = 1/6 * 1/2 + 1/6 * 1 = .25
▪ π(1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389
▪ π(1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667
∙ π(1) = (.0556 .1389 .0833 .25 .1389 .1667)
85
Adjustments to the Basic Setting( ½)
◼ The rows of the matrix H with all zeros are replaced by rows of
(1/n 1/n ... 1/n)
◼ For the running example graph, the 2nd row of the hyperlink matrix is changed, H’ is
the modified matrix
▪ Here H’ is a stochastic matrix as all the rows sums to 1

stochastic matrix vs row-normalized hyperlink matrix ?
86
Adjustments to the Basic Setting (2/2)
◼ Repeating our matrix multiplication process again
▪ Multiplying our Stochastic Matrix H’ with π(k)
π(k+1) = π(k) H’
◼ we get a probability vector

▪ Sum of values of the vector π(k) is 1
▪ Benefit: No PageRank is lost
▪ Hence, we find that Stochastic Matrix H’ provides us a probability vector
87
◼ Allow teleporting to any page at any time
▪ With probability α, the random surfer will follow one of the 1 2
hyperlinks at current page
▪ With probability 1-dfα, the random surfer will randomly
select a page (out of the n pages) to teleport to 3
◼ The modified hyperlink matrix is called the Google

matrix G, 6 5
G = αH’ + (1-df)(1/n)I
▪ I is an n×n matrix with all entries equal to 1 4
▪ Why we need I ?
▪ the surfer can jump to any page at random with probability (1-df)
88
Example
◼ G = αdf*H’ + (1-df)(1/n)I
◼ Let α=0.9 1 2
6 5
89
Google’s Adjusted PageRank Method
π(k+1) = π(k) G
◼ With the Google
matrix, the
pagerank vector
converges to a
stable value
…..
90
Meaning of α - The Damping Factor
◼ α is probability surfer doesn't get bored and continues to click
on page links
◼ (1-df) probability jump to a random web page
◼ 1/(1- dfα) average number of steps before jump to random
page
◼ α=0.85 (Google)
1/(1-0.85) =100/15=6.666
91
The Damping Factor. Effect on convergence
α Number of Iterations
0.5 34 α should be relatively
large to reduce the
0.75 81 effect of random
0.8 104 teleporting
0.85 142 α=0.85 is a choice of
0.9 219 balance between
computation
0.95 449 requirement and the
0.99 2,292 quality of the output
0.999 23,015
G = df*H’ + (1-df)(1/n)I
92
PageRank using MapReduce
93
PageRank using MapReduce
Recall The Previous Example π(k+1) = π(k) H
1 2
(0)
◼ π = (1/6 1/6 1/6 1/6 1/6 1/6) .
3
∙ π(1) = π(0) H
▪ π(1)11 = 1/6 * 1/3 = .0556
6 5
▪ π(1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389
▪ π(1)13 = 1/6 * 1/2 = .0833
▪ π(1)14 = 1/6 * 1/2 + 1/6 * 1 = .25
▪ π(1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389 4
▪ π(1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667
∙ π(1) = (.0556 .1389 .0833 .25 .1389 .1667)
94
PageRank using MapReduce – PEGASUS
◼ fg
95
PEGASUS
◼ First open source Peta-Scaled Graph Mining library
◼ Based on Hadoop
◼ Handling graphs with billions of nodes and edges
◼ Unification of seemingly different graph mining tasks
◼ Generalized Iterated Matrix-Vector multiplication(GIM-V)
96
PEGASUS
◼ Linear runtime on the numbers of edges
◼ Scales up well with the number of available machines
◼ Combination of optimizations can speed up to 5 times
◼ Analyzed Yahoo’s web graph (around 6.7 billion edges)
97
PEGASUS: Real World Applications
◼ PEGASUS can be useful for finding patterns, outliers, and
interesting observations
❑ Connected Components of Real Networks
❑ PageRanks of Real Networks
❑ Diameter of Real Networks
98
GIM-V
◼ Generalized Iterated Matrix-Vector multiplication
◼ Main idea

GIM-V Base: Naïve Multiplication
◼ How can we implement a matrix by vector multiplication in
MapReduce?
▪ Stage1: performs combine2 operation by combining columns of
matrix with rows of vector
▪ Stage2: combineAll partial results from stage1 and assigns the new
vector to the old vector
100


103
PageRank using MapReduce (1/7)
Recall The Previous Example π(k+1) = π(k) H
1 2
(0)
◼ π = (1/6 1/6 1/6 1/6 1/6 1/6) .
3
∙ π =π
(1) (0)
H
▪ π(1)11 = 1/6 * 1/3 = .0556
▪ π(1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389 6 5
▪ π(1)13 = 1/6 * 1/2 = .0833
▪ π(1)14 = 1/6 * 1/2 + 1/6 * 1 = .25
▪ π(1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389 4
▪ π(1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667
∙ π(1) = (.0556 .1389 .0833 .25 .1389 .1667)
104
PageRank using MapReduce – PEGASUS (2/7)
◼ π(0) = (1/6 1/6 1/6 1/6 1/6 1/6) .
H’ v
0 0 1/3 0 0 0 1/6
½ 0 1/3 0 0 0 1/6
½ 0 0 0 0 0 1/6
0 0 0 0 ½ 1 1/6
0 0 1/3 ½ 0 0 1/6
0 0 0 ½ ½ 0 1/6
105
Stage 1 Map(): Generates (K,v) pairs of Matrix and Vector
Example: (r1,c1,v1) denotes 1st value of matrix --> (c1,(r1,v1)) is its corresponding output value by mapper
(c1,v1) is the first value and output by the mapper for vector file
Group by Column Id
C1,0 C3,1/3 C4,0 C5,0 C6,0

C2,0
C1,1/2 C3,1/3 C4,0 C5,0 C6,0

C2,0
C1,1/6 C2,1/6 C3,1/6 C4,1/6 C5,0 C5,1/6 C6,0 C6,1/6

C1,1/2 C2,0 C3,0 C4,0
C1,0 C3,0 C4,0 C5,1/2 C6,1

C2,0
C1,0 C3,1/3 C4,1/2 C5,0 C6,0

C2,0
C1,0 C3,0 C4,1/2 C5,1/2 C6,0
C2,0
106
Stage 1 Reduce(): Performs combine2 operation i.e. applies aggregation function
Red1 Red2 Red 3 Red 4 Red 5 Red 6
C1 0 C5 0 C6 0
C2 0 C3 1/18 C4 0
C1 1/12 C5 0 C6 0
C2 0 C3 1/18 C4 0
C1 1/12 C5 0 C6 0
C2 0 C3 0 C4 0
C1 0 C5 1/12 C6 1/6
C2 0 C3 0 C4 0
C1 0 C5 0 C6 0
C2 0 C3 1/18 C4 1/12
C1 0 C5 1/12 C6 0
C2 0 C3 0 C4 1/12
6 Reducers are created for the sake of simplicity

107
Stage 1 Reduce(): Performs combine2 operation i.e. applies aggregation function
Red1 Red2 Red 3 Red 4 Red 5 Red 6
R1 0 R1 0 R1 0
R1 0 R1 1/18 R1 0
R2 1/12 R2 0 R2 0
R2 0 R2 1/18 R2 0
R3 1/12 R3 0 R3 0
R3 0 R3 0 R3 0
R4 0 R4 1/12 R4 1/6
R4 0 R4 0 R4 0
R5 0 R5 0 R5 0
R5 0 R5 1/18 R5 1/12
R6 0 R6 1/12 R6 0
R6 0 R6 0 R6 1/12
Let we replace Column # with Row #

108
Stage 2 Map(): simply emits the received (k,v) pairs
Group by Row Id
109
Stage 2 Reduce(): performs combineAll operation to sum up the corresponding values Assign
R1: 0 + 0 + 1/18 + 0 + 0 + 0 0.0556
R2: 1/12 + 0 + 1/18 + 0 + 0 + 0 0.01389
R3: 1/12 + 0 + 0 + 0 + 0 + 0 0.0834
R4: 0 + 0 + 0 + 0 + 1/12 + 1/6 0.25
R5: 0 + 0 + 1/18 + 1/12 + 0 + 0 0.01389
R6: 0 + 0 + 0 + 1/12 + 1/12 + 0 0.1667

Shortcomings of MapReduce
• Read and write to Disk before and after Map and Reduce
– Not efficient for iterative tasks, i.e. Machine Learning
• Only for Batch processing
– Interactivity, streaming data
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis, Afzal Godil,
Information Access Division, ITL, NIST

3 Hadoop

Uploaded by

Copyright:

Available Formats

You might also like

3 Hadoop

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 Hadoop

Uploaded by

Copyright:

Available Formats

•Fundamentals of Big Data Analytics

Slides Courtesy: Mining of Massive Datasets (http://www.mmds.org )

◼ Organizations typically require many physical servers to provide various

◼ Server hardware is becoming more powerful and compact

◼ For large-scale installations, compact servers are used

◼ A data center typically houses a large number of

◼ A data center can occupy one room of a building, one or more

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

◼ Can scale up from single servers to thousands of machines

◼ Designed to detect and handle failures at the application layer

◼ Map-reduce addresses all of the above

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

“Classical” Data Mining

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

CPU CPU CPU CPU

Mem … Mem Mem … Mem

Disk Disk Disk Disk

Each rack contains 16-64 nodes

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

Bring computation directly to the data!

Chunk servers also serve as compute servers

◼ Count the number of times each

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

◼ Intermediate results are stored on local FS

◼ Output is often input to another MapReduce task

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41

◼ Master pings workers periodically to detect failures

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43

◼ What is the problem/performance bottleneck during this model of execution ?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44

◼ Sits between the map and the shuffle

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

◼ Back to our word counting example:

◼ Back to our word counting example:

◼ Back to our word counting example:

◼ Back to our word counting example:

◼ Back to our word counting example:

▪ Benefit: Much less data needs to be copied and shuffled!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 57

Assumption 1: A hyperlink between pages denotes

Assumption 2: The text in the anchor of the hyperlink

◼ C11= 00 + 13 + 14 + 21 = 9