Professional Documents
Culture Documents
CSE545 sp20 (5) 3-3
CSE545 sp20 (5) 3-3
CSE545 sp20 (5) 3-3
Similarity Search
Hadoop File System Spark Hypothesis Testing
Streaming Graph Analysis
MapReduce Recommendation Systems
Tensorflow
Deep Learning
Limitations
Spark Overviewof Spark
Spark is fast for being so flexible
● Fast: RDDs in memory + Lazy evaluation: optimized chain of operations.
● Flexible: Many transformations -- can contain any custom code.
Limitations
Spark Overviewof Spark
Spark is fast for being so flexible
● Fast: RDDs in memory + Lazy evaluation: optimized chain of operations.
● Flexible: Many transformations -- can contain any custom code.
However:
● Hadoop MapReduce can still be better for extreme IO, data that will not fit in
memory across cluster.
Limitations
Spark Overviewof Spark
Spark is fast for being so flexible
● Fast: RDDs in memory + Lazy evaluation: optimized chain of operations.
● Flexible: Many transformations -- can contain any custom code.
However:
● Hadoop MapReduce can still be better for extreme IO, data that will not fit in
memory across cluster.
A multi-dimensional matrix
(i.stack.imgur.com)
What is TensorFlow?
Spark Overview
(Bakshi, 2016, “What is Deep Learning? Getting Started With Deep Learning”)
What is TensorFlow?
Spark Overview
(Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Ghemawat, S. (2016). Tensorflow:
Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.)
TensorFlow
Spark Overview
(Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Ghemawat, S. (2016). Tensorflow:
Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.)
TensorFlow
Spark Overview
A simpler example:
c
=mm(A, B)
c = tensorflow.matmul(a, b)
a b
TensorFlow
Spark Overview
example:
d=b+c
e=c+2
a=d∗e
session devices
defines the environment in the specific devices (cpus or
which operations run. gpus) on which to run the
(like a Spark context) session.
Ingredients
Spark Overview of a TensorFlow * technically, operations that work with tensors.
tensors*
○ operations
tf.Variable(initial_value, name)
variables - persistent
○ an abstract computation
tf.constant(value, type, name)
mutable tensors
○ (e.g. matrix multiply,
tf.placeholder(type, add)name)
shape,
constants - constant
executed by device kernels
placeholders - from data
* technically, still operations
graph
session devices
defines the environment in the specific devices (cpus or
which operations run. gpus) on which to run the
(like a Spark context) session.
Ingredients
Spark Overview
Operations
of a TensorFlow
tensors*
operations
variables - persistent
an abstract computation
mutable tensors
(e.g. matrix multiply, add)
constants - constant
executed by device kernels
placeholders - from data
Ingredients
Spark Overview of a TensorFlow
tensors*
● Places operations on devices operations
variables - persistent
an abstract computation
mutable tensors
● Stores the values of variables (when
(e.g.not distributed)
matrix multiply, add)
constants - constant
executed by device kernels
placeholders - from data
● Carries out execution: eval() or run()
graph
session devices
defines the environment in the specific devices (cpus or
which operations run. gpus) on which to run the
(like a Spark context) session.
Ingredients
Spark Overview of a TensorFlow * technically, operations that work with tensors.
tensors*
operations
variables - persistent
an abstract computation
mutable tensors
(e.g. matrix multiply, add)
constants - constant
executed by device kernels
placeholders - from data
graph
session devices
defines the environment in the specific devices (cpus or
which operations run. gpus) on which to run the
(like a Spark context) session.
Distributed
Spark OverviewTensorFlow
Typical use-case:
Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε
X1 X2 X3 Y
Distributed
Spark OverviewTensorFlow
Typical use-case:
Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε
X1 X2 X3 Y
X1 X2 X3 X4 X5 X6
X7 X8 X9 X10 X11 X12 Y
X13 X14 X15 ... Xm
Distributed
Spark OverviewTensorFlow
Typical use-case:
Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε
X1 X2 X3 Y
X1 X2 X3 X4 X5 X6
X7 X8 X9 fXgiven
10
Xw111 , wX2....,
12 wp
Y
X13 X14 X15 ... Xm
(typically, p >= m)
Distributed
Spark OverviewTensorFlow
f(X|W) = Ŷ
Typical use-case: Y = (X|W) + ε
Determine weights, W, of a function, f , such that |ε| is minimized: f(X|W)
Y = Ŷ= +Y ε+ ε
ε=Ŷ -Y
X1 X2 X3 Y
X1 X2 X3 X4 X5 X6
X7 X8 X9 fXgiven
10
Xw111 , wX2....,
12 wp
Y
X13 X14 X15 ... Xm
(typically, p >= m)
Distributed
Spark OverviewTensorFlow
f(X|W)= =Ŷ Ŷ
f(X|W)
Typical use-case: Ŷ - Y+ ε
Yε==(X|W)
Determine weights, W, of a function, f , such that ε is minimized: f(X|W)
Y ==ŶY ++ ε
ε=Ŷ -Y
X1 X2 X3 Y
Typically
Typically,
very
very
complex!
complex!
X1 X2 X3 X4 X5 X6
X7 X8 X9 fXgiven
10
Xw111 , wX2....,
12 wp
Y
X13 X14 X15 ... Xm
(typically, p >= m)
Distributed
Spark OverviewTensorFlow
f(X|W)= =Ŷ Ŷ
f(X|W)
Typical use-case: Ŷ - Y+ ε
Yε==(X|W)
Determine weights, W, of a function, f , such that ε is minimized: f(X|W)
Y ==ŶY ++ ε
ε=Ŷ -Y
W determined through gradient descent:
X1 propagating
back
X2 X3 error Y across the network that defines f.
X1(1) X2 X3 X4 X5 X6
X7 X8 X9 fXgiven
10
Xw111 , wX2....,
12 wp
Y(1)
X13 X14 X15 ... Xm
(typically, p >= m)
Distributed
Spark OverviewTensorFlow
f(X|W)= =Ŷ Ŷ
f(X|W)
Typical use-case: Ŷ - Y+ ε
Yε==(X|W)
Determine weights, W, of a function, f , such that ε is minimized: f(X|W)
Y ==ŶY ++ ε
ε=Ŷ -Y
W determined through gradient descent:
X1 propagating
back
X2 X3 error Y across the network that defines f.
(rasbt,
Weights Derived from Gradients
Spark Overview
matrix multiply
Thus:
Weights Derived from Gradients
Spark Overview
matrix multiply
Thus:
In standard linear equation:
matrix multiply
Thus:
Weights Derived from Gradients
Spark Overview
Thus:
How to update?
Weights Derived from Gradients
Spark Overview
Thus:
How to update?
1. Matrix Solution:
Weights Derived from Gradients
Spark Overview
1. Matrix Solution:
Weights Derived from Gradients
Spark Overview
Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories
2. Distribute data
a. Each node finds parameters for subset of data
b. Needs mechanism for updating parameters
i. Centralized parameter server
ii. Distributed All-Reduce
Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories
2. Distribute data
a. Each node finds parameters for subset of data
b. Needs mechanism for updating parameters
i. Centralized parameter server
ii. Distributed All-Reduce
Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories
2. Distribute data
a. Each node finds parameters for subset of data
b. Needs mechanism for updating parameters
i. Centralized parameter server
ii. Distributed All-Reduce
Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories
2. Distribute data
a. Each node finds parameters for subset of data
b. Needs mechanism for updating parameters
i. Centralized parameter server
ii. Distributed All-Reduce
Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories
Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories
Model Parellelism
Pro: Parameters can be localized Con: High communication for transferring
Intermediar data.
Model Parallelism
Transfer Tensors
Machine A Machine B
CPU:0 CPU:1 GPU:0
Data Parallelism
N
Distributing Data
X y
0
batch_size-1
N-batch_size
N
learn parameters (i.e. weights),
Distributing Data given graph with cost function
and optimizer
X y
0
𝛳batch0
batch_size-1
𝛳batch1
𝛳batch2
𝛳...
N-batch_size
N
Distributing Data
X y
0
𝛳batch0
batch_size-1
𝛳batch1
Combine
parameters
N-batch_size
N
Distributing Data update params of each node and repeat
X y
0
𝛳batch0
batch_size-1
𝛳batch1
Combine
parameters
N-batch_size
N
Gradient Descent for Linear Regression
(Geron, 2017)
Gradient Descent for Linear Regression
Batch Gradient Descent
(Geron, 2017)
Gradient Descent for Linear Regression
Batch Gradient Descent
(Geron, 2017)
Distributed
Spark OverviewTensorFlow
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016, November). TensorFlow:
A System for Large-Scale Machine Learning. In OSDI (Vol. 16, pp. 265-283).
Distributed TensorFlow
Distributed:
● Locally: Across processors (cpus, gpus, tpus)
● Across a Cluster: Multiple machine with multiple processors
Distributed TensorFlow
Distributed:
● Locally: Across processors (cpus, gpus, tpus)
● Across a Cluster: Multiple machine with multiple processors
Parallelisms:
● Data Parallelism: All nodes doing same thing on different subsets of data
● Graph/Model Parallelism: Different portions of model on different devices
Distributed TensorFlow
Distributed:
● Locally: Across processors (cpus, gpus, tpus)
● Across a Cluster: Multiple machine with multiple processors
Parallelisms:
● Data Parallelism: All nodes doing same thing on different subsets of data
● Graph/Model Parallelism: Different portions of model on different devices
Model Updates:
● Asynchronous Parameter Server
● Synchronous AllReduce (doesn’t work with Model Parallelism)
Distributed TensorFlow
Distributed:
● Locally: Across processors (cpus, gpus, tpus)
● Across a Cluster: Multiple machine with multiple processors
discussed
Parallelisms: previously
● Data Parallelism: All nodes doing same thing on different subsets of data
● Graph/Model Parallelism: Different portions of model on different devices
Model Updates:
● Asynchronous Parameter Server
● Synchronous AllReduce (doesn’t work with Model Parallelism)
Local Distribution
Program 1 Program 2
Machine A Machine B
CPU:0 CPU:1 GPU:0
Distributed TensorFlow
Distributed:
● Locally: Across processors (cpus, gpus, tpus)
● Across a Cluster: Multiple machine with multiple processors
Parallelisms:
● Data Parallelism: All nodes doing same thing on different subsets of data
● Graph/Model Parallelism: Different portions of model on different devices
Model Updates:
● Asynchronous Parameter Server
● Synchronous AllReduce (doesn’t work with Model Parallelism)
Distributed TensorFlow
Distributed:
● Locally: Across processors (cpus, gpus, tpus)
● Across a Cluster: Multiple machine with multiple processors
Parallelisms:
● Data Parallelism: All nodes doing same thing on different subsets of data
● Graph/Model Parallelism: Different portions of model on different devices
Model Updates:
● Asynchronous Parameter Server
● Synchronous AllReduce (doesn’t work with Model Parallelism)
Parallelisms
Model Parallelism
Transfer Tensors
Machine A Machine B
CPU:0 CPU:1 GPU:0
Parallelisms Data Parallelism
Parallelisms:
● Data Parallelism: All nodes doing same thing on different subsets of data
● Graph/Model Parallelism: Different portions of model on different devices
Model Updates:
● Asynchronous Parameter Server
● Synchronous AllReduce (doesn’t work with Model Parallelism)
Distributed TensorFlow
Distributed:
● Locally: Across processors (cpus, gpus, tpus)
● Across a Cluster: Multiple machine with multiple processors
Parallelisms:
● Data Parallelism: All nodes doing same thing on different subsets of data
● Graph/Model Parallelism: Different portions of model on different devices
Model Updates:
● Asynchronous Parameter Server
● Synchronous AllReduce (doesn’t work with Model Parallelism)
Asynchronous Parameter Server
“ps” “worker”
task 0 task 0 task 1
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016, November). TensorFlow:
A System for Large-Scale Machine Learning. In OSDI (Vol. 16, pp. 265-283).
Summary
Spark Overview