Professional Documents
Culture Documents
3 Hadoop
3 Hadoop
3 Hadoop
•Hadoop
Outline
• Background
• Components and Architecture
• Running Examples
• Word Count
• PageRank
• Hands-on in Lab
◼ They are typically designed for reliability and to service a large number
of requests
3
Compact Servers
◼ Organizations would like to conserve the amount of floor space
dedicated to their computer infrastructure
4
Racks
◼ Equipments (e.g., servers) are typically placed in racks
5
What is a Data Center?
◼ A facility used to house computer systems and components,
such as networking and storage systems, cooling, UPS, air
filters
7
Microsoft now has more than one million servers
Amazon, Facebook, Google, Twitter, New York Times, Yahoo! …. many more
Motivation: Google Example
◼ 20+ billion web pages x 20KB = 400+ TB
◼ 1 computer reads 30-35 MB/sec from disk
▪ ~4 months to read the web
◼ ~1,000 hard drives to store the web
◼ Takes even more to do something useful
with the data!
◼ Today, a standard architecture for such problems is
emerging:
▪ Cluster of commodity Linux nodes
▪ Commodity network (ethernet) to connect them
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
MapReduce
◼ Challenges:
▪ How to distribute computation?
▪ Distributed/parallel programming is hard
CPU
Machine Learning, Statistics
Memory
18
Core Components Hadoop
◼ The project includes these modules:
▪ Hadoop Common: The common utilities that support the other
Hadoop modules.
▪ Hadoop Distributed File System (HDFS™): A distributed file system
that provides high-throughput access to application data.
▪ Hadoop YARN: A framework for job scheduling and cluster resource
management.
▪ Hadoop MapReduce: A programming model for large scale data
processing.
19
Storage Infrastructure
◼ Problem:
▪ If nodes fail, how to store data persistently?
◼ Answer:
▪ Distributed File System:
▪ Provides global file namespace
▪ Google GFS; Hadoop HDFS;
◼ Typical usage pattern
▪ Huge files (100s of GB to TB)
▪ Data is rarely updated in place
▪ Reads and appends are common
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
Distributed File System
◼ Chunk servers
▪ File is split into contiguous chunks
▪ Typically each chunk is 64-128MB (# updating every few years)
▪ Each chunk replicated (usually 2x or 3x) (hdfs-site.xml)
▪ Try to keep replicas in different racks
◼ Master node
▪ a.k.a. Name Node in Hadoop’s HDFS
▪ Stores metadata about where files are stored
▪ Might be replicated
◼ Client library for file access
▪ Talks to master to find chunk servers
▪ Connects directly to chunk servers to access data
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
22
Distributed File System
◼ Reliable distributed file system
◼ Data kept in “chunks” spread across machines
◼ Each chunk replicated on different machines
▪ Seamless recovery from disk or machine failure
C0 C1 D0 C1 C2 C5 C0 C5
D0 D1 … D0
N
Chunk server
1
Chunk server
3
Chunk server
2
Chunk server
C5 C2 C5 C3 C2
Warm-up task:
◼ We have a huge text document
◼ Sample application:
▪ Analyze web server logs to find popular URLs
Input Intermediate
key-value pairs key-value pairs
k v
map
k v
k v
map
k v
k v
… …
k v k v
Output
Intermediate Key-value groups key-value pairs
key-value pairs
reduce
k v k v v v k v
reduce
Group
k v k v v k v
by key
k v
… …
…
k v k v k v
data
The crew of the space
(The, 1) (crew, 1)
reads
shuttle Endeavor recently
returned to Earth as (crew, 1) (crew, 1)
read the
ambassadors, harbingers of (crew, 2)
a new era of space (of, 1) (space, 1)
(space, 1)
sequential
exploration. Scientists at
(the, 1) (the, 1)
NASA are saying that the (the, 3)
(space, 1) (the, 1)
Sequentially
recent assembly of the
Dextre bot is the first step in (shuttle, 1)
a long-term space-based (shuttle, 1) (the, 1)
man/mache partnership.
(recently, 1)
(Endeavor, 1) (shuttle, 1)
'"The work we're doing now …
Only
-- the robotics we're doing -- (recently, 1) (recently, 1)
is what we're going to need
…………………….. …. …
Big document (key, (key, (key,
value) value) value)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30
MapReduce: Word Counting - Again
map(key, value):
// key: document name; value: text of the
document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the key
and output
All phases are distributed with many tasks doing the work
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40
Data Flow
◼ Input and final output are stored on a distributed file system
(FS):
▪ Scheduler tries to schedule map tasks “close” to physical storage
location of input data
◼ Often a Map task will produce many pairs of the form (k,v1), (k,v2),
… for the same key k
▪ E.g., popular words in the word count example
◼ Can save network time by pre-aggregating values in
the mapper:
▪ combine(k, list(v1)) -> v2
▪ Combiner is usually same
as the reduce function
◼ Works only if reduce
function is commutative and associative
47
Refinement: Combiners
48
Refinement: Combiners
49
Refinement: Combiners
50
Refinement: Combiners
52
Java MapReduce
◼ We need three things: a map function, a reduce function, and
some code to run the job.
◼ The map function is represented by the Mapper class, which
declares an abstract map() method.
◼ The reduce function is similarly defined using a Reducer
53
54
55
56
Computing Page Rank using Map Reduce
hyperlink
Page A Anchor Page B
59
PageRank
◼ PageRank is a numeric value that represents how important a
page is on the web.
◼ Webpage importance
▪ One page links to another page = A vote for the other page A link
from page A to page B is a vote on A to B.
▪ If page A is more important itself, then the vote of A to B should carry
more weight.
▪ More votes = More important the page must be
61
PageRank
◼ Importance Computation
▪ The importance of a page is
distributed to pages that it points
to.
▪ It is the aggregation of the
importance shares of the pages that
points to it.
▪ If a page has 5 outlinks, the importance
of the page is divided into 5 and each
link receives one fifth share of the
importance.
62
PageRank
◼ Each webpage Pi has a score, denoted by r(Pi), or called the pagerank
of Pi
◼ The value of r(Pi) is the sum of the normalized pageranks of all
webpages pointing into Pi
63
Example
P
P Suppose the pagerank of
1
2 P1, P2, P3, and P4 are known
P
5 Pages P1 P2 P3 P4
Pagerank 3.5 1.2 4.2 1.0
P P
3
4
64
Computation of Pagerank
◼ Problem
▪ In the beginning, all pageranks are unknown. How to determine the
first pagerank value?
◼ Solution
▪ Give an initial pagerank to every webpage
▪ E.g., 1/n where n is the total number of webpages
▪ Perform the calculation of pagerank iteratively
▪ Use the pagerank formula to update the pagerank of every webpage
▪ Repeat the above step a number of times until the pagerank values are stable
(converge)
65
Iterative Procedure
◼ Let rk(Pi) be the PageRank of page Pi at
iteration k
▪ Starting with r0(Pi) = 1/n for all pages Pi
◼ At iteration k+1, the pagerank of every page Pi
is updated using the pageranks at iteration k
66
Example
◼ Consider the graph on the right
◼ Iteration 0 1 2
▪ r0(Pi) = 1/6 for i=1, 2, ..., 6
◼ Iteration 1 3
67
Example (cont)
◼ After 20 iterations …
1 2
6 5
Rank Page
1 P4
2 P6
4
3 P5
4 P2
5, 6 P 1, P 3
68
Issue in PageRank Computation
69
Scalability Issue in PageRank Computation
70
Matrix and Vector notations – A Recap
71
Matrix Multiplication – A Recap
◼ An n × m matrix A multiplied with an r × s matrix B,
denoted as A × B (or just AB)
▪ Must have m = r, i.e., the number of columns of A = the
number of rows of B
◼ Let matrix C = A×B (or just AB)
▪ C will be an n × s matrix
◼ Let Aij, Bij, and Cij denote the entry at row i column j
of A, B and C, respectively
Cij = Σk = 1 to m Aik * Bkj
72
Example
C = A × B
74
PageRank Vector
◼ For n pages, all the n PageRank values can be represented in a
1 × n row vector (denoted as π)
π = ( r(P1) r(P2) ... r(Pn) )
◼ Let π(k) denote the vector at iteration k of the iterative
procedure
◼ At iteration 0
▪ uniform initialization: π(0) = ( 1/n 1/n ... 1/n )
75
Example
◼ For the given graph, the matrix H is row-normalised hyperlink matrix
▪ the sum of non-zero rows should sum to 1
1 2
6 5
77
Re-run the Example
1 2
π(0) = (1/6 1/6 1/6 1/6 1/6 1/6)
∙ π(1) = π(0) H
▪ π(1)11 = 1/6 * 1/3 = .0556 6 5
▪ π(1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389
▪ π(1)13 = 1/6 * 1/2 = .0833
▪ π(1)14 = 1/6 * 1/2 + 1/6 * 1 = .25 4
▪ π(1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389
▪ π(1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667
∙ π(1) = (.0556 .1389 .0833 .25 .1389 .1667)
78
Re-run the Example
◼ π(1) = π(0) H
= (.0556 .1389 .0833 .25 .1389 .1667)
◼ π(2) = π(1) H 1 2
= (.0278 .0556 .0278 .2361 .1528 .1944)
◼ … 3
6 5
79
Problem…..
◼ Rank sinks – those pages that
accumulate more and more pagerank 1 2
at each iteration
▪ For the given graph, Pages 4, 5 & 6 are the
3
rank sinks
▪ Why ?
▪ while Pages 1, 2 & 3 get zero PageRank 6 5
▪ E.g., after 20 iterations,
π(20) = (0 0 0 .2667 .1333 .2)
4
◼ It is difficult to rank Pages 1, 2 & 3 if
they all have zero PageRank
80
Final PageRank
◼ Initial allocation (1/6,1/6,1/6,1/6, 1/6,1/6)
1 2
◼ Final values (0,0,0, 0.2667, 0.1333, 0.2)
3
81
PageRank for Vertex 2- Explained
3
◼ 0.4 was passed to vertex 2
and is lost, as it wasn’t
passed on. (from second 6 5
column)
4
0.167+0.139+0.056+0.023+0.009
+0.004+... = 0.4
82
Adjustments to the Basic Setting
◼ Dangling pages
▪ The pages having no out links 1 2
3
◼ Any example of Dangling pages on the Web ?
6 5
83
Adjustments to the Basic Setting
◼ Dangling pages Examples on Web
▪ pdf files
▪ image files
▪ data tables
▪ pages with no hyperlinks etc
◼ The random surfer can’t proceed forward from these pages
◼ Adjustment (teleporting)
▪ The random surfer, after entering a dangling node, can now hyperlink to any
page at random (I.e., with equal probability)
84
Recall The Previous Example
π(k+1) = π(k) H
1 2
(0)
◼ π = (1/6 1/6 1/6 1/6 1/6 1/6) .
∙ π(1) = π(0) H 6 5
▪ π(1)11 = 1/6 * 1/3 = .0556
▪ π(1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389
▪ π(1)13 = 1/6 * 1/2 = .0833 4
▪ π(1)14 = 1/6 * 1/2 + 1/6 * 1 = .25
▪ π(1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389
▪ π(1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667
∙ π(1) = (.0556 .1389 .0833 .25 .1389 .1667)
85
Adjustments to the Basic Setting( ½)
◼ The rows of the matrix H with all zeros are replaced by rows of
(1/n 1/n ... 1/n)
◼ For the running example graph, the 2nd row of the hyperlink matrix is changed, H’ is
the modified matrix
π(k+1) = π(k) H’
87
Adjustments to the Basic Setting
◼ Allow teleporting to any page at any time
▪ With probability α, the random surfer will follow one of the 1 2
hyperlinks at current page
▪ With probability 1-dfα, the random surfer will randomly
select a page (out of the n pages) to teleport to 3
G = αH’ + (1-df)(1/n)I
▪ I is an n×n matrix with all entries equal to 1 4
▪ Why we need I ?
▪ the surfer can jump to any page at random with probability (1-df)
88
Example
◼ G = αdf*H’ + (1-df)(1/n)I
◼ Let α=0.9 1 2
6 5
89
Google’s Adjusted PageRank Method
π(k+1) = π(k) G
◼ With the Google
matrix, the
pagerank vector
converges to a
stable value
…..
90
Meaning of α - The Damping Factor
◼ α is probability surfer doesn't get bored and continues to click
on page links
◼ (1-df) probability jump to a random web page
◼ 1/(1- dfα) average number of steps before jump to random
page
◼ α=0.85 (Google)
1/(1-0.85) =100/15=6.666
91
The Damping Factor. Effect on convergence
α Number of Iterations
0.5 34 α should be relatively
large to reduce the
0.75 81 effect of random
0.8 104 teleporting
0.85 142 α=0.85 is a choice of
0.9 219 balance between
computation
0.95 449 requirement and the
0.99 2,292 quality of the output
0.999 23,015
G = df*H’ + (1-df)(1/n)I
92
PageRank using MapReduce
93
PageRank using MapReduce
Recall The Previous Example π(k+1) = π(k) H
1 2
(0)
◼ π = (1/6 1/6 1/6 1/6 1/6 1/6) .
3
∙ π(1) = π(0) H
▪ π(1)11 = 1/6 * 1/3 = .0556
6 5
▪ π(1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389
▪ π(1)13 = 1/6 * 1/2 = .0833
▪ π(1)14 = 1/6 * 1/2 + 1/6 * 1 = .25
▪ π(1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389 4
▪ π(1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667
∙ π(1) = (.0556 .1389 .0833 .25 .1389 .1667)
94
PageRank using MapReduce – PEGASUS
◼ fg
95
PEGASUS
◼ First open source Peta-Scaled Graph Mining library
◼ Based on Hadoop
96
PEGASUS
◼ Linear runtime on the numbers of edges
97
PEGASUS: Real World Applications
◼ PEGASUS can be useful for finding patterns, outliers, and
interesting observations
❑ Connected Components of Real Networks
❑ PageRanks of Real Networks
98
GIM-V
◼ Generalized Iterated Matrix-Vector multiplication
◼ Main idea
MapReduce?
▪ Stage1: performs combine2 operation by combining columns of
matrix with rows of vector
▪ Stage2: combineAll partial results from stage1 and assigns the new
vector to the old vector
100
GIM-V Base: Naïve Multiplication
103
PageRank using MapReduce (1/7)
Recall The Previous Example π(k+1) = π(k) H
1 2
(0)
◼ π = (1/6 1/6 1/6 1/6 1/6 1/6) .
3
∙ π =π
(1) (0)
H
▪ π(1)11 = 1/6 * 1/3 = .0556
▪ π(1)12 = 1/6 * 1/2 + 1/6 * 1/3 = .1389 6 5
▪ π(1)13 = 1/6 * 1/2 = .0833
▪ π(1)14 = 1/6 * 1/2 + 1/6 * 1 = .25
▪ π(1)15 = 1/6 * 1/3 + 1/6 * 1/2 = .1389 4
▪ π(1)16 = 1/6 * 1/2 + 1/6 * 1/2 = .1667
∙ π(1) = (.0556 .1389 .0833 .25 .1389 .1667)
104
PageRank using MapReduce – PEGASUS (2/7)
H’ v
0 0 1/3 0 0 0 1/6
½ 0 1/3 0 0 0 1/6
½ 0 0 0 0 0 1/6
0 0 0 0 ½ 1 1/6
0 0 1/3 ½ 0 0 1/6
0 0 0 ½ ½ 0 1/6
105
PageRank using MapReduce – PEGASUS (3/7)
Stage 1 Map(): Generates (K,v) pairs of Matrix and Vector
Example: (r1,c1,v1) denotes 1st value of matrix --> (c1,(r1,v1)) is its corresponding output value by mapper
(c1,v1) is the first value and output by the mapper for vector file
Group by Column Id
C1 0 C5 0 C6 0
C2 0 C3 1/18 C4 0
C1 1/12 C5 0 C6 0
C2 0 C3 1/18 C4 0
C1 1/12 C5 0 C6 0
C2 0 C3 0 C4 0
C1 0 C5 1/12 C6 1/6
C2 0 C3 0 C4 0
C1 0 C5 0 C6 0
C2 0 C3 1/18 C4 1/12
C1 0 C5 1/12 C6 0
C2 0 C3 0 C4 1/12
R1 0 R1 0 R1 0
R1 0 R1 1/18 R1 0
R2 1/12 R2 0 R2 0
R2 0 R2 1/18 R2 0
R3 1/12 R3 0 R3 0
R3 0 R3 0 R3 0
R4 0 R4 1/12 R4 1/6
R4 0 R4 0 R4 0
R5 0 R5 0 R5 0
R5 0 R5 1/18 R5 1/12
R6 0 R6 1/12 R6 0
R6 0 R6 0 R6 1/12
Group by Row Id
109
PageRank using MapReduce – PEGASUS (7/7)
Stage 2 Reduce(): performs combineAll operation to sum up the corresponding values Assign
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis, Afzal Godil,
Information Access Division, ITL, NIST