Professional Documents
Culture Documents
Chapter 3
Chapter 3
Similarity Search
Hadoop File System Spark Hypothesis Testing
Streaming Graph Analysis
MapReduce Recommendation Systems
Tensorflow
DeepLearning
1
Big Data Analytics, The
Class
2
Classical Data Analytics
CPU
Memory
Disk
CPU
Memory
(64 GB)
Disk
3
Classical Data Analytics
CPU
Memory
(64 GB)
Disk
CPU
Memory
(64 GB)
Disk
4
IO Bounded
Reading a word from disk versus main memory: 105 slower!
Reading many contiguously stored words
is faster per word, but fast modern disks
still only reach 150MB/s for sequential reads.
IO Bounded
Reading a word from disk versus main memory: 105 slower!
Reading many contiguously stored words
is faster per word, but fast modern disks
still only reach 150MB/s for sequential reads.
10
5
Classical Big Data
CPU
Classical focus: efficient use of disk.
e.g. Apache Lucene / Solr
Memory
Disk
Classical limitation: Still bounded when
needing to process all of a large file.
11
Howto solve?
Classical limitation: Still bounded when needing to
process all of a large file.
12
6
Distributed Architecture
Switch
~10Gbps
Rack 1
Rack 2
Switch Switch
~1Gbps ~1Gbps
...
13
Distributed Architecture
In reality, modern setups often have multiple cpus and disks
per server, but we will model as if one machine
per cpu-disk pair.
Switch
~1Gbps
14
7
Distributed Architecture (Cluster)
Switch
~10Gbps
Rack 1
Rack 2
Switch Switch
~1Gbps ~1Gbps
...
15
1. Nodes fail
1 in 1000 nodes fail a day
2. Network is a bottleneck
Typically 1-10 Gb/s throughput
16
8
Distributed Architecture (Cluster)
Challenges for IO Cluster Computing
1. Nodes fail
1 in 1000 nodes fail a day
Duplicate Data
2. Network is a bottleneck
Typically 1-10 Gb/s throughput
Bring computation to nodes, rather than
data to nodes.
3. Traditional distributed programming is
often ad-hoc and complicated
Stipulate a programming system that
can easily be distributed
17
18
9
Distributed Filesystem
19
Distributed Filesystem
Characteristics for Big Data Tasks
Large files (i.e. >100 GB to TBs)
Reads are most common
No need to update in place
(append preferred)
CPU
Memory
Disk
20
10
Distributed Filesystem
https://opensource.com/life/14/8/intro
-apache-hadoop-big-data
C
D
21
C
D
22
11
Distributed Filesystem
C0 D0
C1 D1
C2 D2
C3 D3
C4 D4
C5 D5
23
Distributed Filesystem
24
12
Distributed Filesystem
25
Distributed Filesystem
26
13
Distributed Filesystem
Chunk servers (on Data Nodes)
File is split into contiguous chunks
Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks
27
28
14
Components of a Distributed Filesystem
Chunk servers (on Data Nodes)
File is split into contiguous chunks
Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks
Name node (aka master node)
Stores metadata about where files are stored
Might be replicated or distributed across data nodes.
Client library for file access
Talks to master to find chunk servers
Connects directly to chunk servers to access data
29
30
15
What is MapReduce
“|” is the linux “pipe” symbol: passes stdout from first process to stdin of next.
31
What is MapReduce
“|” is the linux “pipe” symbol: passes stdout from first process to stdin of next.
32
16
What is MapReduce
“|” is the linux “pipe” symbol: passes stdout from first process to stdin of next.
33
What is MapReduce
34
17
What is MapReduce
extract what
you care
about.
35
What is MapReduce
sort and
shuffle
Map
36
18
What is MapReduce
sort and
shuffle
extract what
you care
about. aggregate,
summarize
Map
Reduce
37
What is MapReduce
Easy as 1, 2, 3!
Step 1: Map Step 2: Sort / Groupby Step 3: Reduce
38
19
What is MapReduce
Easy as 1, 2, 3!
Step 1: Map Step 2: Sort / Groupby Step 3: Reduce
39
40
20
(2) The Sort / Group-by Step
41
42
21
What is MapReduce
Easy as 1, 2, 3!
Step 1: Map Step 2: Sort / Groupby Step 3: Reduce
43
What is MapReduce
Group by key: (k1’, v1’), (k2’, v2’), ... -> (k1’, (v1’, v’, …),
(system handles) (k2’, (v1’, v’, …), …
44
22
Example: Word Count
tokenize(document) | sort | uniq -c
45
Map: extract
what you sort and Reduce:
care about. shuffle aggregate,
summarize
46
23
Example: Word Count
47
48
24
49
50
25
(Leskovec at al., 2014;
http://www.mmds.org/)
Chunks
51
@abstractmethod
def map(k, v):
pass
@abstractmethod
def reduce(k, vs):
pass
52
26
Example: Word Count (v1)
53
54
27
Example: Word Count (v2)
def map(k, v):
counts = dict()
for w in tokenize(v):
counts each word within the chunk
(try/except is faster than
“if w in counts”)
55
56
28
Example: Word Count (v2)
57
58
29
Distributed Architecture (Cluster)
Challenges for IO Cluster Computing
1. Nodes fail
1 in 1000 nodes fail a day
Duplicate Data (Distributed FS)
2. Network is a bottleneck
Typically 1-10 Gb/s throughput
Bring computation to nodes, rather than
data to nodes. (Sort and Shuffle)
3. Traditional distributed programming is
often ad-hoc and complicated
Stipulate a programming system that
can easily be distributed
59
60
30
Example: Relational Algebra
Select
Project
Union, Intersection, Difference
Natural Join
Grouping
61
Select
Project
Natural Join
Grouping
62
31
Example: Relational Algebra
Select
63
Select
R(A1,A2,A3,...), Relation R, Attributes A*
return only those attribute tuples where condition C is true
def map(k, v): #v is list of attribute tuples: [(...,), (...,), ...]
r = []
for t in v:
if t satisfies C:
r += [(t, t)]
return r
64
32
Example: Relational Algebra
Select
R(A1,A2,A3,...), Relation R, Attributes A*
return only those attribute tuples where condition C is true
def map(k, v): #v is list of attribute tuples: [(...,), (...,), ...]
r = []
for t in v:
if t satisfies C:
r += [(t, t)]
return r
def reduce(k, vs):
r = []
for each v in vs:
r += [(k, v)]
return r
65
Select
66
33
Example: Relational Algebra
Natural Join
Given R1 and R2 return Rjoin
-- union of all pairs of tuples that match given attributes.
def map(k, v): #k \in {R1, R2}, v is (A, B) for R1, (B, C) for
R2 #B are matched attributes
67
Natural Join
Given R1 and R2 return Rjoin
-- union of all pairs of tuples that match given attributes.
def map(k, v): #k \in {R1, R2}, v is (A, B) for R1, (B, C) for
R2 #B are matched attributes
if k==’R1’:
(a, b) = v
return (b,(‘R1’,a))
if k==’R2’:
(b,c) = v
return (b,(‘R2’,c))
68
34
Example: Relational Algebra
Natural Join
Given R1 and R2 return Rjoin
-- union of all pairs of tuples that match given attributes.
def map(k, v): #k \in {R1, R2}, v is (A, B) for R1, (B, C) for R2
#B are matched attributes
if k==’R1’:
def reduce(k, vs):
(a, b) = v
return (b,(‘R 1’,a)) r1, r2, rjn = [], [], []
if k==’R2’: for (s, x) in vs: #separate rs
(b,c) = v if s == ‘R1’: r1.append(x)
return (b,(‘R2’,c)) else: r2.append(x)
for a in r1: #join as tuple
for each c in r2:
rjn += (‘Rjoin’, (a, k, c)) #k is b
return rjn
69
Data Flow
70
35
Data Flow
hash
71
Data Flow
Programmer
hash
Programmer
72
36
Data Flow
73
Data Flow
● Partitioning
● Scheduling map / reducer execution
● Group by key
74
37
Data Flow
75
Data Flow
76
38
Data Flow
77
Data Flow
Handled with:
78
39
Data Flow
79
Data Flow
80
40
Data Flow Reduce Task
version 1: few reduce tasks
(same number of reduce tasks as nodes)
node1
node2
node3
node4
node5
time
Reduce tasks represented by
time to complete task
(some tasks take much longer)
81
node1 node1
node2 node2
node3 node3
node4 node4
node5 node5
time time
Reduce tasks represented by Reduce tasks represented by
time to complete task time to complete task
(some tasks take much longer) (some tasks take much longer)
82
41
Data Flow Reduce Task
version 1: few reduce tasks version 2: more reduce tasks
(same number of reduce tasks as nodes) (more reduce tasks than nodes)
time time
Reduce tasks represented by Reduce tasks represented by time
time to complete task time to complete task (the last task now completes
(some tasks take much longer) (some tasks take much longer) much earlier )
83
84
42
Communication Cost Model
85
86
43
Communication Cost Model
87
88
44
Communication Cost Model
89
90
45
Communication Cost: Natural Join
91
92
46
MapReduce: Final Considerations
● Performance Refinements:
○ Combiners (like word count version 2 but done via reduce)
■ Run reduce right after map from same node before passing to
reduce (MapTask can execute)
■ Reduces communication cost
93
● Performance Refinements:
○ Combiners (like word count version 2 but done via reduce)
■ Run reduce right after map from same node before passing to
reduce (MapTask can execute)
■ Reduces communication cost
94
47