Chapter 12 (Ditributing TensorFlow) Fa21-Bse-036

CHAPTER 12
Distributing TensorFlow Across Devices and

Servers
Multiple Devices on a Single Machine
 To get started, we'll look at how to set up your environment
to use multiple GPU cards on a single machine. This can
often provide a major performance boost without the added
complexity of a distributed setup. We'll cover the necessary
installation steps, including downloading and configuring
the CUDA and cuDNN libraries. Then we'll dive into how
to place operations on specific devices, either manually or
using TensorFlow's dynamic placer algorithm. We'll also
discuss strategies for managing GPU memory to avoid
running out of resources when running multiple TensorFlow
processes on the same machine.
Parallel Execution
Evaluating Source Nodes
 When TensorFlow runs a graph, it first identifies the list of nodes that need to
be evaluated and counts their dependencies. It then starts evaluating the

nodes with zero dependencies, known as source nodes. If these nodes are
placed on separate devices, they will be executed in parallel.
Leveraging Thread Pools
 TensorFlow manages thread pools on each device to parallelize operations.
Some operations also have multithreaded kernels that can utilize the intra-op
thread pools to split computations across multiple threads on the same
device.
Controlling Parallelism
 You can control the number of threads used in the inter-op and intra-op
thread pools by setting the appropriate configuration options. This allows you
to fine-tune the parallelism to match the characteristics of your hardware and
workload.
Multiple Devices Across Multiple Servers
Defining a Cluster
 To run a TensorFlow graph across multiple servers, you first need to define a cluster. A
cluster is composed of one or more TensorFlow servers, called tasks, typically spread
across several machines. Each task belongs to a job, which is a named group of tasks
with a common role, such as storing model parameters or performing computations.
Placing Operations
 You can use device blocks to pin operations on any device managed by any task in the
cluster, specifying the job name, task index, device type, and device index.
TensorFlow also provides the replica_device_setter() function to automatically
distribute variables across parameter servers in a round-robin fashion.
Sharing State
 In a distributed setup, variable state is managed by resource containers located on the
cluster, not by individual sessions. This allows multiple sessions to seamlessly share
the same variables, even if they are connected to different servers. TensorFlow also
provides queues and readers that can be shared across sessions to enable asynchronous
data loading.
Efficient Data Loading
Preloading the Data
 For datasets that can fit in memory, you can preload the training data into a variable and use
that variable in your graph. This ensures the data is only transferred once from the client to the
cluster, rather than being repeatedly loaded and fed through placeholders.
Reading Directly from the Graph

 For larger datasets, you can use TensorFlow's reader operations to read data directly from the
filesystem, without the data ever passing through the client. This allows you to build a data
loading pipeline that runs in parallel with the training computations.
Multithreaded Readers
 To further improve data loading throughput, you can use TensorFlow's Coordinator and
QueueRunner classes to manage multiple threads that simultaneously read from multiple files
and push the data into a shared queue.
Convenience Functions
 TensorFlow provides several convenience functions, such as string_input_producer() and
shuffle_batch(), to simplify the creation of these asynchronous data loading pipelines.

Parallelizing Neural Networks
One Network per Device
 The simplest way to parallelize neural networks is to place each one on a different
device, either on the same machine or across multiple machines in a cluster. This is
perfect for hyperparameter tuning or serving a high volume of queries.
In-Graph Replication
 For parallelizing the training of a large ensemble of neural networks, you can create a
single graph containing all the networks, each placed on a different device, plus the
computations needed to aggregate the individual predictions.
Between-Graph Replication
 Alternatively, you can create separate graphs for each neural network and coordinate
their execution using queues, with one client handling the input distribution and
another aggregating the outputs.
Scalable Performance
 By leveraging the distributed capabilities of TensorFlow, you can achieve near-linear
speedups when parallelizing neural network training or inference, allowing you to

tackle much larger and more complex models.
Distributing Computations Efficiently
Placement Strategies
 Carefully placing operations on the appropriate devices is crucial for
optimizing performance. TensorFlow provides several options, from manual
device blocks to dynamic placement algorithms, allowing you to control the
distribution of computations.
Controlling Parallelism
 Managing the thread pools and controlling the degree of parallelism is
important to avoid bottlenecks and ensure efficient utilization of your
hardware resources. TensorFlow gives you knobs to tune the inter-op and
intra-op parallelism.
Optimizing Data Flow
 Minimizing data transfers between devices and servers is key to achieving
high performance. Techniques like preloading data, using readers, and
managing queues can help optimize the data flow and reduce
communication overhead.
Coordinating Asynchronous
Computations
Leveraging Queues
 TensorFlow's queues provide a powerful mechanism for coordinating asynchronous computations,
such as loading data in the background while training a model. Queues allow you to decouple the data
pipeline from the training pipeline, improving overall throughput.
Controlling Dependencies
 Adding control dependencies between operations can help you postpone the execution of memory-
intensive or communication-heavy computations until they are truly needed, allowing other operations
to run in parallel and improving resource utilization.
Managing State
 The distributed nature of TensorFlow's resource containers allows you to seamlessly share variables,
queues, and other stateful objects across multiple sessions, simplifying the coordination of your
distributed computations.
Leveraging Coordinators
 TensorFlow's Coordinator and QueueRunner classes make it easier to manage the lifecycle of
asynchronous threads, ensuring they start and stop gracefully and avoiding deadlocks or other
concurrency issues.
Achieving Scalable Performance
Technique Benefits
Distributing computations across multiple Reduces training time for large neural
devices networks, allows exploring larger
hyperparameter spaces
Efficient data loading pipelines Ensures data is available when needed,
without becoming a bottleneck
Coordinating asynchronous computations Improves resource utilization, enables
overlapping of data loading and training
Leveraging TensorFlow's distributed Provides a scalable and flexible
capabilities framework for building high-performance
machine learning applications
Conclusion
Unlocking the Power of Distributed Computing:
 By mastering the techniques for distributing TensorFlow computations across
devices and servers, you can unlock the true potential of your hardware
resources and tackle much larger and more complex machine learning
problems. Whether it's speeding up the training of neural networks, exploring
a wider range of hyperparameters, or serving high volumes of queries, the
distributed capabilities of TensorFlow provide a powerful and flexible
foundation for building scalable, high-performance applications.
Embracing the Future of Distributed ML:

 As the field of machine learning continues to evolve, the ability to efficiently
leverage distributed computing resources will become increasingly important.

By understanding the concepts and techniques covered in this chapter, you'll
be well-equipped to build the next generation of powerful, scalable machine
learning systems using TensorFlow.

Chapter 12 (Ditributing TensorFlow) Fa21-Bse-036

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 12 (Ditributing TensorFlow) Fa21-Bse-036

Uploaded by

Copyright:

Available Formats

CHAPTER 12

Distributing TensorFlow Across Devices and

be evaluated and counts their dependencies. It then starts evaluating the

Reading Directly from the Graph

shuffle_batch(), to simplify the creation of these asynchronous data loading pipelines.

speedups when parallelizing neural network training or inference, allowing you to

Embracing the Future of Distributed ML:

leverage distributed computing resources will become increasingly important.

You might also like