CSE545 sp20 (5) 3-3

Distributed TensorFlow
CSE545 - Spring 2020

Stony Brook University
Big Data Analytics, The Class
Goal: Generalizations
A model or summarization of the data.
Data Frameworks Algorithms and Analyses
Similarity Search
Hadoop File System Spark Hypothesis Testing
Streaming Graph Analysis
MapReduce Recommendation Systems
Tensorﬂow
Deep Learning
Limitations
Spark Overviewof Spark
Spark is fast for being so flexible
● Fast: RDDs in memory + Lazy evaluation: optimized chain of operations.
● Flexible: Many transformations -- can contain any custom code.
Limitations
However:
● Hadoop MapReduce can still be better for extreme IO, data that will not fit in
memory across cluster.
Limitations
However:
IO Bound Compute Bound

MapReduce Spark
large files (TBs or PBs) (1s of TBs, 100s of GBs) many numeric computations
Limitations
However:
● Modern machine learning (esp. Deep learning), a common big data task,
requires heavy numeric computation.
MapReduce Spark
(large files: TBs or PBs) (1s of TBs, 100s of GBs) (many numeric computations)
* this is the subjective approximation of the instructor as of February 2020. A lot of factors at play.
Limitations
However:
● Modern machine learning (esp. Deep learning), a common big data task,
requires heavy numeric computation.
MapReduce Spark TensorFlow
(large files: TBs or PBs) (1s of TBs, 100s of GBs) (many numeric computations)
* this is the subjective approximation of the instructor as of February 2020. A lot of factors at play.
Learning Objectives
Spark Overview
● Understand TensorFlow as a data workflow system.

○ Know the key components of TensorFlow.
○ Understand the key concepts of distributed TensorFlow.
● Execute basic distributed tensorflow program.
● Establish a foundation to distribute deep learning models:
● Convolutional Neural Networks
● Recurrent Neural Network (or LSTM, GRU)
What is TensorFlow?
Spark Overview
A workflow system catered to numerical computation.
One view: Like Spark, but uses tensors instead of RDDs.

What is TensorFlow?
Spark Overview
A multi-dimensional matrix
(i.stack.imgur.com)
What is TensorFlow?
Spark Overview
A 2-d tensor is just a matrix.

1-d: vector
0-d: a constant / scalar
Note: Linguistic ambiguity:

Dimensions of a Tensor =/=
Dimensions of a Matrix
(i.stack.imgur.com)
What is TensorFlow?
Spark Overview
Examples > 2-d :

Image definitions in terms of RGB per pixel
Image[row][column][rgb]
Subject, Verb, Object representation of language:

Counts[verb][subject][object]
What is TensorFlow?
Spark Overview
Technically, less abstract than RDDs which could hold tensors as

well as many other data structures (dictionaries/HashMaps,
Trees, ...etc…).
Then, why TensorFlow?

What is TensorFlow?
Spark Overview
Efficient, high-level built-in linear algebra and machine

learning optimization operations (i.e. transformations).
enables complex models, like deep learning

Technically, less abstract than RDDs which could hold tensors as
well as many other data structures (dictionaries/HashMaps,
Trees, ...etc…).
Then, why TensorFlow?

What is TensorFlow?
Spark Overview

learning optimization operations.
enables complex models, like deep learning
(Bakshi, 2016, “What is Deep Learning? Getting Started With Deep Learning”)
What is TensorFlow?
Spark Overview

learning operations.
(Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Ghemawat, S. (2016). Tensorflow:
Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.)
TensorFlow
Spark Overview
Operations on tensors are often conceptualized

as graphs:
(Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Ghemawat, S. (2016). Tensorflow:
Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.)
TensorFlow
Spark Overview

as graphs:
A simpler example:
c
=mm(A, B)
c = tensorflow.matmul(a, b)
a b
TensorFlow
Spark Overview

as graphs:
example:
d=b+c
e=c+2
a=d∗e
(Adventures in Machine Learning.

Python TensorFlow Tutorial, 2017)
Ingredients
Spark Overview of a TensorFlow
tensors*
operations
variables - persistent
an abstract computation
mutable tensors
(e.g. matrix multiply, add)
constants - constant
executed by device kernels
placeholders - from data
* technically, still operations
graph
session devices
defines the environment in the specific devices (cpus or
which operations run. gpus) on which to run the
(like a Spark context) session.
Ingredients
Spark Overview of a TensorFlow * technically, operations that work with tensors.
tensors*
○ operations
tf.Variable(initial_value, name)
○ an abstract computation
tf.constant(value, type, name)
mutable tensors
○ (e.g. matrix multiply,
tf.placeholder(type, add)name)
shape,
* technically, still operations
graph
session devices
Ingredients
Spark Overview
Operations
of a TensorFlow
tensors*
operations
mutable tensors
Ingredients
Spark Overview of a TensorFlow
tensors*
● Places operations on devices operations
mutable tensors
● Stores the values of variables (when
(e.g.not distributed)
matrix multiply, add)
● Carries out execution: eval() or run()
graph
session devices
Ingredients
Spark Overview of a TensorFlow * technically, operations that work with tensors.
tensors*
operations
mutable tensors
graph
session devices
Distributed
Spark OverviewTensorFlow
Typical use-case: (Supervised Machine Learning)

Determine weights, W, of a function, f , such that ε is minimized: f(X|W) = Y + ε
Distributed
Typical use-case:
X1 X2 X3 Y
Distributed
Typical use-case:
X1 X2 X3 Y
X1 X2 X3 X4 X5 X6
X7 X8 X9 X10 X11 X12 Y
X13 X14 X15 ... Xm
Distributed
Typical use-case:
X1 X2 X3 Y
X1 X2 X3 X4 X5 X6
X7 X8 X9 fXgiven
10
Xw111 , wX2....,
12 wp
Y
X13 X14 X15 ... Xm
(typically, p >= m)
Distributed
f(X|W) = Ŷ
Typical use-case: Y = (X|W) + ε
Determine weights, W, of a function, f , such that |ε| is minimized: f(X|W)
Y = Ŷ= +Y ε+ ε
ε=Ŷ -Y
X1 X2 X3 Y
X1 X2 X3 X4 X5 X6
X7 X8 X9 fXgiven
10
Xw111 , wX2....,
12 wp
Y
X13 X14 X15 ... Xm
(typically, p >= m)
Distributed
f(X|W)= =Ŷ Ŷ
f(X|W)
Typical use-case: Ŷ - Y+ ε
Yε==(X|W)
Determine weights, W, of a function, f , such that ε is minimized: f(X|W)
Y ==ŶY ++ ε
ε=Ŷ -Y
X1 X2 X3 Y
Typically
Typically,
very
very
complex!
complex!
X1 X2 X3 X4 X5 X6
X7 X8 X9 fXgiven
10
Xw111 , wX2....,
12 wp
Y
X13 X14 X15 ... Xm
(typically, p >= m)
Distributed
f(X|W)= =Ŷ Ŷ
f(X|W)
Yε==(X|W)
Y ==ŶY ++ ε
ε=Ŷ -Y
W determined through gradient descent:
X1 propagating
back
X2 X3 error Y across the network that defines f.
X1(1) X2 X3 X4 X5 X6
X7 X8 X9 fXgiven
10
Xw111 , wX2....,
12 wp
Y(1)
X13 X14 X15 ... Xm
(typically, p >= m)
Distributed
f(X|W)= =Ŷ Ŷ
f(X|W)
Yε==(X|W)
Y ==ŶY ++ ε
ε=Ŷ -Y
W determined through gradient descent:
X1 propagating
back
X2 X3 error Y across the network that defines f.
XX1(1)(2)(3) XX2 XX3 XX4 XX5 XX6

X
1 (4) X
XX7X1 (...)
1
2 X3
XX8 X2 XX9 Xf3Xgiven
2 3
X
4
X
4
X
5 X6
XwX111 X, w5X2....,
5 X
6w YY(1)(2)(3)
7X1 (N) X 8 X2 X 9 X3 X
X1010 X4 X
4 11 X5 X
X
1212 6Xp
X
XX13 XX7 1X 8 X
X X X9 X3 X 10...X XX
11 X5 X12 X66 YY(4)
X 2X ...
4 X
X13X13 X7X714X14X14X8X815X15(typically,
X
X9X9 X10Xp10>=
15 9
... m)XX11
...
mm X
11 X 12
m X12
X
...Y(N)
7 8
X13X13 X14X14 X15X15 10 ...... XmXm 12
11
13 14 15 m
minimizes ε on N training examples
Weights Derived from Gradients
Spark Overview
TensorFlow has built-in ability to derive gradients given a cost function.
(rasbt,
Spark Overview
Linear Regression: Trying to find “betas” that minimize:

Spark Overview
matrix multiply
Thus:
Spark Overview
matrix multiply
Thus:
In standard linear equation:
(if we add a column of 1s, mx + b is just matmul(m, x))

Spark Overview
matrix multiply
Thus:
Spark Overview
Thus:
How to update?
Spark Overview
Thus:
How to update?
(for gradient descent) “learning rate”

Spark Overview
Ridge Regression (L2 Penalized linear regression, )
1. Matrix Solution:
Spark Overview
2. Gradient descent solution

(Mirrors many parameter optimization problems.)
1. Matrix Solution:
Spark Overview
Gradient descent needs to solve.


Spark Overview
Gradient descent needs to solve.


Spark Overview

Options
Options for for distribution
Distributing
Spark Overview ML
1. Distribute copies of entire dataset
a. Train over all with different hyperparameters
b. Train different folds per worker node.
Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal
Pro: Parameters can be localized Con: High communication for transferring

Intermediar data.
Options
Distributing
Spark Overview ML
a. Train over all with different parameters
Pro: Easy; Good for compute-bound; Con: Requires data fit in worker memories
2. Distribute data
a. Each node finds parameters for subset of data
b. Needs mechanism for updating parameters
i. Centralized parameter server
ii. Distributed All-Reduce

Intermediar data.
Options
Distributing
Spark Overview ML
2. Distribute data
3. Distribute model or individual operations (e.g. matrix multiply)

Intermediar data.
Options
Distributing
Spark Overview ML
2. Distribute data

Intermediar data.
Options
Distributing
Spark Overview ML Done often in practice. Not
talked about much because it’s
1. Distribute copies of entire dataset mostly as easy as it sounds.
2. Distribute data

Intermediar data.
Options
Distributing
2. Distribute data Preferred method for big data or

very complex models (i.e.
models with many internal
ii. Distributed All-Reduce parameters).

Intermediar data.
Options
Distributing
2. Distribute data Preferred method for big data or

very complex models (i.e.
Data Parellelism
models with many internal
ii. Distributed All-Reduce parameters).
Model Parellelism
Intermediar data.
Model Parallelism
Multiple devices on multiple machines
Transfer Tensors
Machine A Machine B
CPU:0 CPU:1 GPU:0
Data Parallelism
CPU:0 CPU:1 GPU:0

Data Parallelism
worker:0 worker:1 worker:2

Distributing Data
X y
0
N
Distributing Data
X y
0
batch_size-1
N-batch_size
N
learn parameters (i.e. weights),
Distributing Data given graph with cost function
and optimizer
X y
0
𝛳batch0
batch_size-1
𝛳batch1
𝛳batch2
𝛳...
N-batch_size
N
Distributing Data
X y
0
𝛳batch0
batch_size-1
𝛳batch1
Combine
parameters
N-batch_size
N
Distributing Data update params of each node and repeat
X y
0
𝛳batch0
batch_size-1
𝛳batch1
Combine
parameters
N-batch_size
N
Gradient Descent for Linear Regression
(Geron, 2017)
Batch Gradient Descent
Stochastic Gradient Descent: One example at a time
Mini-batch Gradient Descent: k examples at a time.
(Geron, 2017)
Batch Gradient Descent
Stochastic Gradient Descent: One example at a time
Mini-batch Gradient Descent: k examples at a time.
(Geron, 2017)
Distributed
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016, November). TensorFlow:
A System for Large-Scale Machine Learning. In OSDI (Vol. 16, pp. 265-283).
Distributed:
● Locally: Across processors (cpus, gpus, tpus)
● Across a Cluster: Multiple machine with multiple processors
Distributed:
Parallelisms:
● Data Parallelism: All nodes doing same thing on different subsets of data
● Graph/Model Parallelism: Different portions of model on different devices
Distributed:
Parallelisms:
Model Updates:
● Asynchronous Parameter Server
● Synchronous AllReduce (doesn’t work with Model Parallelism)
Distributed:
discussed
Parallelisms: previously
Model Updates:
Local Distribution
Multiple devices on single machine
Program 1 Program 2
CPU:0 CPU:1 GPU:0

Local Distribution
Multiple devices on single machine
CPU:0 CPU:1 GPU:0

Local Distribution
Machine A Machine B
CPU:0 CPU:1 GPU:0
Distributed:
Parallelisms:
Model Updates:
Distributed:
Parallelisms:
Model Updates:
Parallelisms
Model Parallelism
Transfer Tensors
Machine A Machine B
CPU:0 CPU:1 GPU:0
Parallelisms Data Parallelism
CPU:0 CPU:1 GPU:0

Distributed:
Parallelisms:
Model Updates:
Distributed:
Parallelisms:
Model Updates:
Asynchronous Parameter Server
“ps” “worker”
task 0 task 0 task 1
TF Server TF Server TF Server

Master Master Master
Worker Worker Worker
CPU:0 CPU:1 CPU:0 GPU:0
(Geron, 2017: HOML: p.324)

Machine A Machine B
Asynchronous Parameter Server
“ps” “worker”
task 0 task 0 task 1
Parameter Server: Job is just to maintain
TF Server TF of
values Server TF Server
variables being optimized.
Master Master Master
Workers: do all the numerical “work” and
Worker send updates
Worker to the parameter server.
Worker
(Geron, 2017: HOML: p.324)

Machine A Machine B
Synchronous All Reduce
“Worker” Worker Worker Worker
Workers do computation, send parameter

TF Server TF Server
updates to other TF Serverand store parameter
workers,
Master updates from other Master
Master workers. Requires low
latency communication.
Worker Worker Worker
(Geron, 2017: HOML: p.324)

Machine A Machine B
Distributed TF: Full Pipeline
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Kudlur, M. (2016, November). TensorFlow:
A System for Large-Scale Machine Learning. In OSDI (Vol. 16, pp. 265-283).
Summary
Spark Overview
● TF is a workflow system, where records are always tensors

○ operations applied to tensors (as either Variables,
constants, or placeholder)
● Optimized for numerical / linear algebra
○ automatically finds gradients
○ custom kernels for given devices
● “Easily” distributes
○ Within a single machine (local: many devices))
○ Across a cluster (many machines and devices)
○ Jobs broken up as parameter servers / workers makes
coordination of data efficient

CSE545 sp20 (5) 3-3

Uploaded by

Copyright:

Available Formats

You might also like

CSE545 sp20 (5) 3-3

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSE545 sp20 (5) 3-3

Uploaded by

Copyright:

Available Formats

Distributed TensorFlow

CSE545 - Spring 2020

Data Frameworks Algorithms and Analyses

IO Bound Compute Bound

● Understand TensorFlow as a data workflow system.

A workflow system catered to numerical computation.

One view: Like Spark, but uses tensors instead of RDDs.

A workflow system catered to numerical computation.

One view: Like Spark, but uses tensors instead of RDDs.

A workflow system catered to numerical computation.

One view: Like Spark, but uses tensors instead of RDDs.

A 2-d tensor is just a matrix.

Note: Linguistic ambiguity:

A workflow system catered to numerical computation.

One view: Like Spark, but uses tensors instead of RDDs.

Examples > 2-d :

Subject, Verb, Object representation of language:

A workflow system catered to numerical computation.

One view: Like Spark, but uses tensors instead of RDDs.

Technically, less abstract than RDDs which could hold tensors as

Then, why TensorFlow?

Efficient, high-level built-in linear algebra and machine

enables complex models, like deep learning

Then, why TensorFlow?

Efficient, high-level built-in linear algebra and machine

enables complex models, like deep learning

Efficient, high-level built-in linear algebra and machine

Operations on tensors are often conceptualized

Operations on tensors are often conceptualized

Operations on tensors are often conceptualized

(Adventures in Machine Learning.

Typical use-case: (Supervised Machine Learning)

XX1(1)(2)(3) XX2 XX3 XX4 XX5 XX6

TensorFlow has built-in ability to derive gradients given a cost function.

Linear Regression: Trying to find “betas” that minimize:

Linear Regression: Trying to find “betas” that minimize:

Linear Regression: Trying to find “betas” that minimize:

(if we add a column of 1s, mx + b is just matmul(m, x))

Linear Regression: Trying to find “betas” that minimize:

Linear Regression: Trying to find “betas” that minimize:

Linear Regression: Trying to find “betas” that minimize:

(for gradient descent) “learning rate”

Ridge Regression (L2 Penalized linear regression, )

Ridge Regression (L2 Penalized linear regression, )

2. Gradient descent solution

Ridge Regression (L2 Penalized linear regression, )

Gradient descent needs to solve.

TensorFlow has built-in ability to derive gradients given a cost function.

Ridge Regression (L2 Penalized linear regression, )

Gradient descent needs to solve.

TensorFlow has built-in ability to derive gradients given a cost function.

TensorFlow has built-in ability to derive gradients given a cost function.

Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal

Pro: Parameters can be localized Con: High communication for transferring

Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal

Pro: Parameters can be localized Con: High communication for transferring

Pro: Flexible to all situations; Con: Optimizing for subset is suboptimal

3. Distribute model or individual operations (e.g. matrix multiply)