aDSA SuperComp4Trng DNN

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

DOI:10.

1145/ 3 3 6 0 3 0 7

Google’s TPU supercomputers train


deep neural networks 50x faster than
general-purpose supercomputers running
a high-performance computing benchmark.
BY NORMAN P. JOUPPI, DOE HYUN YOON, GEORGE KURIAN,
SHENG LI, NISHANT PATIL, JAMES LAUDON, CLIFF YOUNG,
AND DAVID PATTERSON

A Domain-
Specific
Supercomputer
for Training
Deep Neural
Networks
THE RECENT SUCCESS of deep neural networks (DNNs)
has inspired a resurgence in domain specific
architectures (DSAs) to run them, partially as a result
of the deceleration of microprocessor performance
improvement due to the slowing of Moore’s Law.17
DNNs have two phases: training, which constructs
accurate models, and inference, which costs, such as fabrication or operating
serves those models. Google’s Tensor Pro- cost.16 In the case of DSAs like Google’s
cessing Unit (TPU) offered 50x improve- TPUs, many of the principles and ex-
ment in performance per watt over conven- periences from decades of building
tional architectures for inference.19,20 general-purpose CPUs change or do
We naturally asked whether a successor not apply. For example, here are fea-
could do the same for training. This ar- tures of the inference TPU (TPUv1) and
ticle explores how Google built the first the training TPU (TPUv2) share but are
production DSA for the much harder uncommon in CPUs:
training problem, first deployed in 2017. ˲ 1–2 large cores versus 32–64 small
Computer architects try to create de- cores in server CPUs.
signs that maximize performance on a ˲ The computational heavy lifting
set of benchmarks while minimizing is handled by two-dimensional (2D)

JU LY 2 0 2 0 | VO L. 6 3 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 67
contributed articles

128x128- or 256x256-element systolic and if we were building an inference ac-


arrays of multipliers per core, versus celerator, we could stop there. For train-
either a few scalar multipliers or SIMD ing, this is less than a third of the story.
(one-dimensional, 16–32-element) SGD next measures the difference or er-
multipliers per core in CPUs.
˲ Using narrower data (8–16 bits) to DNN (Deep ror between the model’s result and the
known good result from the training set
improve efficiency of computation and
memory versus 32–64 bits in CPUs.
Neural Network) using a loss function. Then back-propa-
gation runs the model in reverse, layer-
˲ Dropping general-purpose features wisdom is that by-layer, to produce a set of error/loss
irrelevant for DNNs but critical for CPUs
such as caches and branch predictors.
bigger machines values for each layer’s output. These
losses measure the deviation from the
The most effective DNN training is lead to bigger desired output. Last, weight update
supervised learning, where we start
with a huge (sometimes billion-exam-
breakthroughs. combines the input of each layer with
the loss value to calculate a set of del-
ple) training dataset of known-correct tas—changes to weights—which, when
(input, result) pairs. Pairs might added to the weights, would have result-
be an image and what it depicts or an ed in nearly zero loss. Updates can have
audio waveform and the phoneme it small magnitude. Shrinking further,
represents. We also start with a neural updates are scaled down by the learning
network model, which transforms the rate to keep SGD numerically stable.
input into the result through an inten- Moreover, a suite of algorithmic refine-
sive calculation of weights (also called ments—including momentum,30 batch
parameters); the weights are random normalization,18 and optimizers such as
initially. Models are typically defined Adaptive Gradient (AdaGrad)14—re-
as a graph of layers, where a layer con- quire their own state and alter the SGD
tains a linear algebra part (often a ma- algorithm to reduce the number of
trix multiplication or convolution us- steps to achieve desired accuracy.
ing the weights) followed by a Each SGD step makes a tiny adjust-
nonlinear activation function (often a ment to the weights that improves the
scalar function, applied elementwise; model with respect to a single (input,
we call the results activations). Train- result) pair. Each pass through the
ing “learns” weights that raise the like- entire dataset is an epoch; DNNs typi-
lihood of correctly mapping from in- cally take tens to hundreds of epochs to
put to result. train. SGD gradually transforms the
For some kinds of input data, an random initial weights into a trained
embedding at the start of the model model, sometimes capable of superhu-
transforms from sparse representa- man accuracy.
tions into a dense representation suit- Given this background, we can com-
able for linear algebra; embeddings pare inference and training. Both share
also contain weights.27,29 Embeddings some computational elements includ-
might use vectors where features can ing matrix multiplications, convolu-
be represented by notions of distance tions, and activation functions, so in-
between vectors. Embeddings involve ference and training DSAs might have
table lookups, link traversal, and vari- similar functional units. Key architec-
able length data fields, so they are ir- tural aspects where the requirements
regular and memory intensive. differ include:
How do we get from random initial ˲ Harder parallelization: Each infer-
weights to trained weights? Current ence is independent, so a simple clus-
best practices use variants of stochastic ter of servers with DSA chips can scale
gradient descent (SGD).31 SGD consists up inference. A training run iterates
of many iterations of three steps: for- over millions of examples, coordinat-
ward propagation, backpropagation, ing across parallel resources because it
and weight update. Forward propaga- must produce a single consistent set of
tion takes a randomly chosen training weights for the model. The number of
example, applies its inputs to the mod- examples processed in parallel, and
el, and runs the calculation through the the time to evaluate that multiple-ex-
layers to produce a result (which with ample minibatch—often shortened to
the random initial weights, is garbage batch—directly affect total end-to-end
the first time). Forward propagation is training time. A step is the computa-
functionally similar to DNN inference, tion to process one minibatch.

68 COMMUNICATIO NS O F TH E AC M | J U LY 2020 | VO L . 63 | NO. 7


contributed articles

˲ More computation: Back-propaga- stead of clustering CPU hosts with DSA


tion requires derivatives for every com- chips. The first reason is that training
key insights
putation in a model. It includes acti- time is huge. Table 1 shows that one ˽ With the slowing of Moore’s Law,
vation functions (some of which are TPUv2 chip would take two to 16 months ML breakthroughs require innovation
in computer architecture.
transcendental), and multiplication by to train a single Google production ap-
˽ The increasing importance and appetite
transposed weight matrices. plication, so a typical application might for ML training justifies its own custom
˲ More memory: Weight update ac- want to use hundreds of chips. Second, supercomputer.
cesses intermediate values from for- DNN wisdom is that bigger datasets ˽ The co-design of an ML-specific
ward and back propagation, vastly up- plus bigger machines lead to bigger programming system (TensorFlow),
ping storage requirements; temporary breakthroughs. Moreover, results like compiler (XLA), architecture (TPU),
floating-point arithmetic (Brain float16),
storage can be 10x weight storage. For AutoML use 50x more computation to interconnect (ICI), and chip (TPUv2/v3)
inference, a small activation working find DNN models that achieve higher let production ML applications
set can usually be kept on chip. accuracy scores than the best models of scale at 96%–99% of perfect linear
˲ More programmability: Training al-
speedup and 10x gains in performance/
human DNN experts.42 Watt over the most efficient
gorithms and models are continually Designing a DSA supercomputer in- general-purpose supercomputers.
changing, so a machine restricted to terconnect. The critical architecture fea-
current best-practice algorithms during ture of a modern supercomputer is how
design could rapidly become obsolete. its chips communicate: what is the speed ogy (see Figure 1). An on-device switch
˲ Wider data: Quantized arithme- of a link; what is the interconnect topol- provides virtual-circuit, deadlock-free
tic—8-bit integer instead of 32-bit float- ogy; does it have centralized versus dis- routing. To enable a 2D torus, the chip
ing point (FP)—can work for inference tributed switches; and so on. This choice has four custom Inter-Core Intercon-
like in TPUv1 but reduced-precision is much easier for a DSA supercomputer, nect (ICI) links, each running at
training is an active research area.21,25 as the communication patterns are lim- 496Gbits/s per direction in TPUv2. ICI
The challenge is sufficiently capturing ited and known. For training, most traf- enables direct connections between
the SGD sum of many small weight up- fic is an all-reduce over weight updates chips to form a supercomputer using
dates to preserve the accuracy of using from all nodes of the machine. only 13% of each chip (see Figure 3). Di-
32-bit FP arithmetic to train models. If we distribute switch functionality rect links simplify rack-level deploy-
After explaining the TPUv2 architec- into each chip rather than as a stand- ment, but in a multi-rack system the
ture, we describe the domain specific alone unit, the all-reduction can be racks must be adjacent.
language (TensorFlow) and compiler built in a dimension-balanced, band- One measure of an interconnect is
(XLA) for TPUv2 and compare the ar- width-optimal way for a 2D torus topol- its bisection bandwidth—the bandwidth
chitecture and technology choices for
the TPUv2 versus a GPU, the most pop- Table 1. Days to train production programs on one TPUv2 chip.
ular computer for DNN training. Later,
MLP0 MLP1 CNN0 CNN1 RNN0 RNN1
we compare performance per chip and
full supercomputers of TPUs and GPUs 475 117 63 115 77 147

using production applications and the


MLPerf benchmarks.
Figure 1. A 2D-torus topology. TPUv2 uses a 16x16 2D torus.
Designing a Domain-Specific
Supercomputer
In 2014, when the TPUv2 project be-
gan, the landscape for high-perfor- TPUv2 TPUv2 TPUv2 TPUv2

mance machine learning computa-


tion was very different from today.
Training took place on clusters of
TPUv2 TPUv2 TPUv2 TPUv2
CPUs. State-of-the-art parallel train-
ing used asynchronous SGD,12 in part
to tolerate tail latencies in shared
clusters. Parallel training also divided TPUv2 TPUv2 TPUv2 TPUv2
CPUs into a bipartite graph of workers
(running the SGD loop) and param-
eter servers (hosting weights and add-
ing updates to them).
The DNN training computation ap-
petite appeared unlimited. (Indeed, the
TPUv2 TPUv2 TPUv2 TPUv2
computation requirements for the larg-
est training runs grew 10x annually
from 2012 to 2018.2) Thus, in 2014 we
chose to build a DSA supercomputer in-

JU LY 2 0 2 0 | VO L. 6 3 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 69
contributed articles

Table 2. Batch sizes for the three regions of Shallue.32 LM1B, Fashion MNIST, and Imagenet models (as seen in Table 2):
are standard DNN datasets. 1. Perfect scaling region: Each dou-
bling of batch size halves the number
Model Perfect Diminishing Maximum of training steps.
Transformer on LM1B ≤256 256–4096 ≥4096 2. Diminishing returns region: In-
Simple CNN on Fashion MNIST ≤512 512–2048 ≥2048 creasing batch size still reduces the
ResNet-50 on Imagenet ≤8192 8192–65536 ≥65536 number of steps, but more slowly.
3. Maximum data parallelism region:
Increasing batch size provides no ben-
efits whatsoever.
Figure 2. Block diagram of a TensorCore (our internal development name for a TPU core,
and not related to the Tensor Cores of NVIDIA GPUs). Such scaling while preserving accu-
racy required tuning the learning rate,
TensorCore
batch size, and other hyperparameters.
Fortunately for TPUs, these recent
Core Sequencer
results show that batch sizes of 256–
Matrix Multiply
8,192 scale perfectly without losing ac-
(MXU) curacy, which makes large MXUs an at-
HBM Vector Interconnect tractive option for high performance.
Memory Unit Matrix Multiply Router
Host
(VPU) (MXU)
Unlike TPUv1, TPUv2 uses two cores
Queues (8/16 GiB) (ICI)
(over PCIe)
TPUv3 only per chip. Global wires on a chip don’t
Transpose scale with shrinking feature size, so
Permute Unit their relative delay increases. Given that
training can use many processors, two
smaller TensorCores per chip prevent-
ed the excessive latencies of a single
available between two halves of a net- peer- to-peer among workers, using the large full-chip core. We stopped at two
work of the worst-case split. The TPUv2 all-reduce to ensure workers begin and because it is easier to efficiently gener-
supercomputer uses a 16x16 2D torus end each parallel step with consistent ate programs for two brawny cores per
(256 chips), which is 32 links x copies of weights. chip than numerous wimpy cores.
496Gbits/s = 15.9Terabits/s of bisection Synchronous training has two phases Figure 2 shows the six major blocks
bandwidth. As a comparison, a separate in the critical path—a compute phase of a TensorCore and Figure 3 shows
Infiniband switch (used in CPU clus- and a communication phase that rec- their placement in the TPUv2 chip:
ters) that connected 64 hosts (each with, onciles the weights across learners. 1. Inter-Core Interconnect (ICI). Ex-
say, four DSA chips) has 64 ports using The slowest learners and slowest mes- plained earlier.
“only” 100Gbit/s links and a bisection sages through the network limit per- 2. High Bandwidth Memory (HBM).
bandwidth of at most 6.4Terabits/s. Our formance of such a synchronous sys- TPUv1 was memory bound for most of
TPUv2 supercomputer provides 2.5x the tem. Since the communication phase its applications.20 We solved its memo-
bisection bandwidth over conventional is in the critical path, a fast intercon- ry bottleneck by using High Bandwidth
cluster switches while skipping the cost nect that quickly reconciles weights Memory (HBM) DRAM in TPUv2. It of-
of the Infiniband network cards, Infini- across learners with well-controlled fers 20 times the bandwidth of TPUv1
band switch, and the communication tail latencies is critical for fast train- by using an interposer substrate that
delays of going through the CPU hosts ing. The ICI network is key to the excel- connects the TPUv2 chip via thirty-
of clusters. lent TPU supercomputer scaling re- two 128-bit buses to four short stacks
Fortuitously, building a fast inter- sults; later we show 96%–99% of perfect of DRAM chips. Conventional servers
connect inspired algorithmic advances. linear scaleup. support many more DRAM chips, but
With dedicated hardware, and shard- Designing a DSA supercomputer at a much lower bandwidth of at most
ing the examples of a minibatch over node. The TPUv2 node of the super- eight 64-bit busses.
nodes of the machine, there is little tail computer followed the main ideas of 3. The Core Sequencer fetches VLIW
latency, and synchronous parallel TPUv1: A large two-dimensional matrix (Very Long Instruction Word) instruc-
training becomes possible. Internal multiply unit (MXU) using a systolic ar- tions from the core’s on-chip, soft-
studies5 suggested that synchronous ray to reduce area and energy plus ware-managed Instruction Memory
training could beat asynchronous SGD large, software-controlled on-chip (Imem), executes scalar operations
with equivalent resources. Asynchro- memories instead of caches. The large using a 4K 32-bit scalar data memory
nous training introduces heterogeneity MXUs of the TPUs rely on large batch (Smem) and 32 32-bit scalar registers
plus parameter servers that eventually sizes, which amortize memory access- (Sregs), and forwards vector instruc-
limit parallelization, as the weights get es for weights—performance often in- tions to the VPU. The 322-bit VLIW
sharded and the bandwidth from pa- creases when memory traffic reduces. instruction can launch eight opera-
rameter servers to workers becomes a Shallue et al.32 examined the effect tions: two scalar, two vector ALU, vec-
bottleneck. Synchronous training elim- of increasing batch size on training tor load and store, and a pair of slots
inated the parameter servers allowing time, and found three regions for all that queue data to and from the matrix

70 COMMUNICATIO NS O F TH E ACM | J U LY 2020 | VO L . 63 | NO. 7


contributed articles

multiply and transpose units. The XLA it prematurely obsolete. We handled ing an inverse square root operation to
compiler schedules loading Imem via such a crisis in 2015 during our design the transcendental unit.
independent overlays of code, as un- in supporting batch normalization.18 5. The MXU produces 32-bit FP
like conventional CPUs, there is no in- Briefly, batch normalization subtracts products from 16-bit FP inputs that ac-
struction cache. out the mean and divides by the stan- cumulate in 32 bits. All other computa-
4. The Vector Processing Unit (VPU) dard deviation of a batch, making the tions are in 32-bit FP except for results
performs vector operations using a values look like samples from the nor- going directly to an MXU input, which
large on-chip vector memory (Vmem) mal distribution. In practice, it both are converted to 16-bit FP.
with 32K 128 x 32-bit elements (16MiB), improves prediction accuracy and re- The MXUs are large, but we reduced
and 32 2D vector registers (Vregs) that duces time-to-train up to 14x! Batch their size from 256x256 in TPUv1 to
each contain 128 x 8 32-bit elements normalization emerged early in 2015, 128x128 and have multiple MXUs per
(4 KiB). The VPU streams data to and and the results made it a must-do for chip. The bandwidth required to feed
from the MXU through decoupling FI- us. We divided it into vector additions and obtain results from an MXU is
FOs. The VPU collects and distributes and multiplications over the batch, proportional to its perimeter, while
data to Vmem via data-level parallelism plus one inverse-square-root calcula- the computation it provides is propor-
(2D matrix and vector functional units) tion. However, the vector operation tional to its area. Larger arrays provide
and instruction-level parallelism (8 op- count was high. We thus added a sec- more compute per byte of interface
erations per instruction). ond SIMD dimension to our vector unit, bandwidth, but larger arrays can be
Your beautiful DSA can fail if best- making its registers and ALUs 128x8 inefficient. Simulations show that
practice algorithms change, rendering (rather than just 1D 128-wide) and add- convolutional model utilization of

Figure 3. TPUv2 chip floor plan.

It has two TensorCores: Node fabric data and NF controller move on-chip data.

PCIe Link Miscellaneous


ICI Link Host ICI Link
Datapath Queue
NF
PCIe ctrl ctrl

LST LST

Vector Unit Matrix Multiply Unit Vector Unit


Node and (128 × 128 ×16b = 16K MAC) and Node
HBM Fabric 8MiB Vmem 8MiB Vmem Fabric HBM

TensorCore
port Data Data port

RPU
Core
Seq.
Transpose
I/Smem
Unit

ICI Switch
and Controller

Transpose
Unit Core
Seq.
RPU I/Smem

Vector Unit Vector Unit


TensorCore

HBM Node and and Node HBM


port Fabric 8MiB Vmem 8MiB Vmem Fabric port
Data Matrix Multiply Unit Data
(128 × 128 ×16b = 16K MAC)

LST LST

NF Chip
ctrl Host Miscellaneous Manager
ICI Link Queue Datapath ICI Link

JU LY 2 0 2 0 | VO L. 6 3 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 71
contributed articles

four 128x128 MXUs is 37%–48%, wires on its perimeter for the inputs, node designs. The TPUv1 article
which is 1.6x of a single 256x256 MXU outputs, and control. In our technology, evaluated hypothetical alternatives
(22%–30%) yet take about the same die for 128x128 and larger the MXU’s area is that examined the changes in perfor-
area. The reason is that some convolu- limited by the multipliers but area for mance while varying the MXU size,
tions are naturally smaller than 64x64 and smaller MXUs is limited by the clock rate, and the memory band-
256x256, so sections of the MXU would the I/O and control wires. width.20 We need not hypothesize
be idle. Sixteen 64x64 MXUs would have 6. The Transpose Reduction Permute here, as we implemented and de-
a little higher utilization (38%–52%) but Unit does 128x128 matrix transposes, ployed two versions of the training ar-
would need more area. The reason is reductions, and permutations of the chitecture: TPUv2 and TPUv3. TPUv3
the MXU area is determined either by VPU lanes. has ≈1.35x the clock rate, ICI band-
the logic for the multipliers or by the Alternative DSA supercomputer width, and memory bandwidth plus
twice the number of MXUs, so peak
Table 3. Key processor features. performance rises 2.7x. Liquid cools
the chip to allow 1.6x more power. We
We cannot reveal technology details of our chip partner. Although it is in a
larger, older technology, the TPUv2 die size is less than 3/4s of the GPU. also expanded the TPUv3 supercom-
TPUv3 is 6% larger in that same technology. TDP stands for Thermal puter to 1024 chips (see Figure 4). Ta-
Design Power. The Volta has 80 symmetric multiprocessors. ble 3 lists key features of the three
Feature TPUv1 TPUv2 TPUv3 Volta
TPU generations along with a con-
Peak TeraFLOPS/ 46 (16b) 123 (16b) 125 (16b)
temporary GPU (NVIDIA Volta) that
92 (8b int) we’ll compare to below.
Chip 3 (32b) 4 (32b) 16 (32b)

Network links x Gbits/s/Chip -- 4 x 496 4 x 656 6 x 200


The TPUv3 die size is only 6% larger
than TPUv2 in the same technology de-
Max chips/supercomputer -- 256 1024 Varies spite having twice as many MXUs per
Peak PetaFLOPS/supercomputer -- 11.8 126 Varies TensorCore simply because the engi-
neers had a better idea beforehand of
Bisection Terabits/supercomputer -- 15.9 42.0 Varies the layout challenges of the major
Clock Rate (MHz) 700 700 940 1530 blocks in TPUv2, which led to a more
efficient floor plan for TPUv3.
TDP (Watts)/Chip 75 280 450 450
Designing DSA supercomputer arith-
TDP (Kwatts)/supercomputer -- 124 594 Varies metic. Peak performance is ≥8x higher
2 when using 16-bit FP instead of 32-bit
Die Size (mm ) <331 <611 <648 815
FP for matrix multiply (see Table 3), so
Chip Technology 28nm >12nm >12nm 12nm it’s vital to use 16-bit to get highest per-
Memory size (on-/off-chip) 28MiB/8GiB 32MiB/16GiB 32MiB/32GiB 36MiB/32GiB formance. While we could have built an
MXU using standard IEEE fp16 and
Memory GB/s/Chip 34 700 900 900 fp32 floating point formats (see Figure
MXUs/Core, 1 1 2 8 5), we first checked the accuracy of 16-
MXU Size 256x256 128x128 128x128 4x4 bit operations for DNNs. We found that:
Cores/Chip 1 2 2 80 ˲ Matrix multiplication outputs and

Chips/CPU Host 4 4 8 8 or 16
internal sums must remain in fp32.
˲ The 5-bit exponent of fp16 matrix
multiplication inputs leads to failure

Figure 4. A TPUv2 supercomputer has up to 256 chips and is 18-ft. long (top).

A TPUv3 supercomputer consisting of up to 1,024 chips (below)


is about 7-ft. tall and 36-ft. long. A TPUv2 board (center) holds
four air-cooled chips and a TPUv3 board (right) also has four
chips but uses liquid cooling.

72 COMM UNICATIO NS O F THE ACM | J U LY 2020 | VO L . 63 | NO. 7


contributed articles

of computations that go outside its from domain-specific languages to inputs interact to produce a 2D out-
narrow range, which the 8-bit exponent code, XLA integrates a high-level li- put. Each operand has a memory lay-
of fp32 avoids. brary and a compiler. A TF front end out, which gets transformed into a
˲ Reducing the matrix multiplica- generates code in an intermediate layout in 2D registers, which in turn
tion input mantissa size from fp32’s 23 representation for XLA. must be fed at the exact moment to
bits to 7 bits did not hurt accuracy. It would seem it should be more dif- meet systolic array timing in the MXU.
The resulting brain floating format ficult to get great performance in a pro- (A systolic array reduces register ac-
(bf16) in Figure 5 keeps the same 8-bit gramming system based on Python cesses by choreographing data flow-
exponent as fp32. Given the same expo- like TF. However, ML frameworks offer ing from different directions to regu-
nent size, there is no danger in losing both a higher level of expressiveness larly arrive at cross points that
the small update values due to FP un- and the potential for much better opti- combine them.) Depending on layout
derflow of a smaller exponent, so all mization information than lower-level choices, the 2D registers dimensions
programs in this article used bf16 on languages like C++. TF programs are of 128 and 8 might not be filled, low-
TPUs without much difficulty. Beyond graphs of operations, where multi-di- ering ALU and memory utilization.
our experience that it works for training mensional array operations are first- Moreover, lacking caches, XLA man-
production applications, a recent Intel class citizens: ages all memory transfers, including
study corroborated its benefits.21 How- ˲ They operate on multi-dimension- code overlays and DMA pushes to re-
ever, fp16 requires adjustments to al arrays explicitly, rather than implic- mote nodes over ICI.
training software (loss scaling) to deliver itly via nested loops as in C++. XLA exploits the huge parallelism
convergence and efficiency. It preserves ˲ They use explicit, analyzable, and that an input TF dataflow graph repre-
the effect from small gradients by scal- bounded data access patterns versus sents. Beyond the parallelism of oper-
ing losses to fit the smaller exponents arbitrary access patterns like C++. ations (“ops”) in a graph, each op can
of fp16.26 ˲ They have known memory aliasing comprise millions of multiplications
As the size of an FP multiplier scales behavior, unlike C++. and additions on data tensors of mil-
with the square of the mantissa width, These three factors allow the XLA lions of elements. XLA maps this
the bf16 multiplier is half the size and en- compiler to safely and correctly trans- abundant parallelism across hun-
ergy of a fp16 multiplier: 8² / 11² ≈ 0.5 (ac- form programs in ways that traditional dreds of chips in a supercomputer, a
counting for the implicit leading man- compilers rarely attain. few cores per chip, multiple units per
tissa bit). Bf16 delivers a rare XLA does whole-program analysis core, and thousands of multipliers
combination: reducing hardware and and optimization. With 2D vector reg- and adders inside each functional
energy while simplifying software by isters and compute units in TPUv2/v3, unit. The domain-specific TF lan-
making loss scaling unnecessary. Thus, the layout of data in both compute guage and XLA representation allow
ARM and Intel have revealed future units and memory is critical to perfor- precise reasoning about memory use
chips with bf16. mance, perhaps more than for a vec- at every point in the program. There
tor or SIMD processor. Building effi- are no “aliasing” issues where the
Designing a DSA cient code for vector machines, with compiler must determine whether
Supercomputer Compiler 1D memory and compute units, is two pointers might address the same
The next step was getting software for well understood. For the MXU, two 2D memory—every piece of memory cor-
our hardware. To program CPUs and
GPUs for machine learning, a frame- Figure 5. IEEE FP and Brain float formats.
work such as TensorFlow (TF)1 speci- All formats have an implicit leading mantissa
fies the model and data operations bit in normal operation.
machine-independently. TF is a do-
sign(1) exponent (8) mantissa (23)
main-specific library built on Python.
NVIDIA GPU-dependent work is sup- IEEE fp32
ported by a combination of the CUDA
language, the CuBLAS and CuDNN sign(1) exponent (5) mantissa (10)

libraries, and the TensorRT system. IEEE fp16


TPUv2/v3s also use TF, with the new
system XLA (for accelerated linear al- sign(1) exponent (8) mantissa (7)
gebra) handling the TPU-dependent bf16
mapping. XLA also targets CPUs and
GPUs. Like many systems that map

Table 4. XLA speed up on TPUv2 with fusion versus without fusion.

MLP CNN RNN


0 1 0 1 0 1 SSD NMT Mask R-CNN Transformer Res Net-50
1.8 2.0 2.2 4.8 2.4 1.8 2.4 3.0 2.0 2.0 6.3

JU LY 2 0 2 0 | VO L. 6 3 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 73
contributed articles

responds to a known program variable sands of ops from a smaller set of TPU and GPU choices before we com-
or temporary. The XLA compiler is primitive ops. pare performance.
free to slice, tile, and lay out memory The XLA team needed only 96 ops as Multi-chip parallelization is built
and operations to best use the on-chip the compiler’s target to reduce work for into TPUs through ICI and supported
memory bandwidth and to reduce the the library/compiler by enhancing com- through all-reduce operations
memory footprint on chip or off chip. posability. For example, XLA has a single plumbed through XLA to TF. Similar-
TPUs use a VLIW architecture to op for convolution (kConvolution) let- sized multi-chip GPU systems use a
express instruction-level parallelism ting the compiler handle all the mem- tiered networking approach, with
to the many compute units of a Ten- ory layout variations. The TF interme- NVIDIA’s NVLink inside a chassis
sorCore. XLA uses standard VLIW diate form has nine; for example, and host-controlled InfiniBand net-
compilation techniques including Conv2D, Conv2dBackpropFil- works and switches to tie multiple
loop unrolling, instruction schedul- ter, DepthwiseConv2dNative, and chassis together.
ing, and software pipelining to keep DepthwiseConv2dNativeBackprop- TPUs offer bf16 FP arithmetic de-
all compute units busy and to simul- Filter. For the CNN1 program, the signed for DNNs inside 128x128 systol-
taneously move data through the XLA compiler fused 63 different opera- ic arrays that halves the die area and
memory hierarchy to feed them. tions with at least one kConvolution. energy versus IEEE fp16 FP multipliers.
Given a memory layout of data, oper- Since ML platforms and DSAs of- Volta GPUs have also embraced re-
ator fusion can reduce memory use and fered a new set of compiler challenges, duced-precision systolic arrays, with a
boost performance. Fusion is a tradi- it was unclear how fast they would im- finer granularity—4x4 or 16x16 de-
tional compiler optimization—but ap- prove. Table 5 shows the median gain pending on hardware or software de-
plied now to 2D data—that combines over only six months for MLPerf from scriptions—while using fp16 rather
ops to reduce memory traffic compared version 0.5 to 0.6 was 1.3x for GPUs and than bf16, so they may require software
to executing operators sequentially. For 2.1x for TPUs! (Perhaps the younger XLA to perform loss scaling plus extra die
example, fusing a matrix multiplication compiler has more opportunity to im- area and energy.
with a following activation function prove than the more mature CUDA TPUs are dual-core, in-order ma-
skips writing and reading the interme- stack.) One reason for the large gain is chines, where the XLA compiler overlaps
diate products from memory. Table 4 the focus on benchmarks, but produc- computation, memory, and network ac-
shows the speedup from the fusion op- tion applications also advanced. In- tivities. GPUs are latency-tolerant many-
timization on 2D data is from 1.8 to 6.3. creasing bf16 use, optimizing model ar- core machines, where each core has
The TF intermediate form for XLA chitecture, and XLA generating better many threads and thus very large (20MiB)
has thousands of ops. The number of code sped up CNN0 by 1.8x in 15 months register files. Threading hardware plus
ops increases when programmers and improving partitioning/placement CUDA coding conventions support over-
cannot combine existing ops if com- for embeddings and XLA optimizations lapped operations.
position is inefficient. Alas, expand- accelerated MLP0 by 1.65x. TPUs use software controlled 32MiB
ing the number of ops is an engineer- scratchpad memories that the compil-
ing challenge, since software libraries Contrasting GPU er schedules, while Volta hardware
need to be developed for CPUs, GPUs, and TPU Architectures manages a 6MiB cache and software
and TPUs. The hope was that the XLA As details of TPU and GPU architec- manages a 7.5MiB scratchpad memory.
compiler could synthesize these thou- tures are now public, let us compare The XLA compiler directs sequential
DRAM accesses typical of DNNs via di-
Table 5. Speedup of MLPerf 0.6 over 0.5 in six months. rect memory access (DMA) controllers
on TPUs while GPUs use multithread-
ResNet50 SSD MaskRCNN NMT Transformer Median
ing plus coalescing hardware for them.
Volta 1.3 1.2 1.8 1.0 2.0 1.3
Thottethodi and Vijaykumar35 con-
TPUv3 1.4 1.4 3.5 2.1 3.0 2.1 cluded that when compared to TPUs:
“[GPUs] incur high overhead in perfor-
mance, area, and energy due to heavy
Table 6. Adjusted comparison of GPU and TPU. multithreading which is unnecessary for
DNNs which have prefetchable, sequen-
Die sizes are adjusted by the square of the technology, as the semiconduc-
tial memory accesses. The systolic orga-
tor technology for TPUs is similar but larger and older than that of the GPU.
We picked 15nm for TPUs based on the information in Table 3. Thermal nization [of TPUs] ... capture[s] DNNs’
Design Power (TDP) is for 16-chip systems. TPUs come with a host CPU. data reuse while being simple by avoiding
This GPU price adds price of a n1-standard-16 CPU. multithreading.”
In addition to the contrasting archi-
Adjusted TD Cloud Relative to GPU
tectural choices, TPU and GPU chips
Die size die size (kw) price Die TDP Price
use different technologies, die areas,
Volta 815 815 12.0 $3.24 1.00 1.00 1.00
clock rates, and power. Table 6 gives
TPUv2 <611 <391 7.7 $1.13 <0.5 0.64 0.35
three related cost measures of these
TPUv3 <648 <415 9.3 $2.00 <0.5 0.78 0.62
systems: approximate die size adjust-
ed for technology; power for a 16-chip

74 COM MUNICATIO NS O F TH E AC M | J U LY 2020 | VO L . 63 | NO. 7


contributed articles

system; and cloud price per chip. The Figure 6. Performance per chip relative to TPUv2 for five MLPerf 0.6 benchmarks and six
GPU adjusted die size is more than production applications.
twice that of the TPUs, which suggests
the capital costs of the chips is at least TPUv3 Volta
double, since there would be at least Peak Compute
twice as many TPU dies per wafer. GPU
Clock Rate
power is 1.3x–1.6x higher, which sug-
gests higher operating expenses, as Memory BW

the total cost of ownership is correlat- Resnet50


ed with power.19 Finally, the hourly SSD
rental prices on Google Cloud Engine
MaskRCNN
are 1.6x–2.9x higher for the GPU. These
GNMT
three different measures consistently
suggest TPUv2 and TPUv3 are roughly Transformer
half to three fourths as expensive as MLPerf 0.6 GM
the Volta GPU. MLP0
RNN0
Performance Evaluation
In computer architecture, we “grade on CNN1
a curve” versus “grade on an absolute MLP1
scale,” so we need to measure perfor- RNN1
mance relative to the competition. Before
CNN0
showing performance of TPU supercom-
puters, we must establish the virtues of Production GM

a single chip, for a 1024x speedup from 0.0 1.0 2.0 3.0
1,024 wimpy chips is uninteresting.
We first compare training perfor-
mance for a standard set of ML bench- Table 7. Google’s inference (July 2016) and training (April 2019) workloads by DNN
marks and Google production applica- model type.
tions for TPUv2/v3 chip and the Volta
GPU chip; TPUv3 and Volta are about the DNN Model TPUv1 July 2016 TPUv3 April 2019
same speed. We then check if four MXUs MLP 61% 27%
per chip in TPUv3 really helped, or if oth- RNN 29% 21%
er bottlenecks in the TPUv3 chip made CNN 5% 24%
the extra MXUs superfluous; they Transformer -- 21%
helped! We conclude the chip compari-
son looking at inference for TPUv2/v3
versus TPUv1; TPUv2/v3 are much faster. We also wanted to measure perfor- model for image recognition.
Having established the merits of mance of production workloads. We ˲ In Recurrent Neural Networks
the TPU chips, we then evaluate the chose six production applications (RNN), each subsequent model layer is
TPUv2/v3 supercomputer. The first similar to what we used for TPUv1 as a collection of nonlinear functions of
step is to see how well it scales; we see representative of Google’s workload: weighted sums of outputs and the pre-
96%–99% of perfect linear speedup at ˲ In MultiLayer Perceptrons (MLP) vious state. Sequence prediction prob-
1024 chips. We then compare the each new layer of a model is a set of lems, such as language translation, use
fraction of peak performance and nonlinear functions of a weighted sum RNNs. RNN0 is RNMT+6 and RNN1 is
performance per Watt of TPU and tra- of all outputs (fully connected) from Improved LAS.8
ditional supercomputers; TPUs have a prior one. This classic DNN usually We recently compared the represen-
5x-10x better performance per Watt. has text as input. MLP0 is unpublished tative datacenter workloads by model
Chip performance: TPUv2/v3 versus but MLP1 is RankBrain,9 which ranks type for inference on TPUv120 versus
the Volta GPU. Figure 6 shows the per- search results for a Web page. TPUv2/v3 for training. Table 7 illus-
formance of TPUv3 and the Volta GPU ˲ In Convolutional Neural Networks trates the fast-changing nature of
over TPUv2 for two sets of programs. (CNN), each ensuing layer is a set of DNNs. We originally used the name
The first set is five programs that nonlinear functions of weighted sums LSTM (Long Short-Term Memory) for
Google and NVIDIA both submitted to of spatially nearby subsets of outputs TPUv1 applications, a type of RNN. Al-
MLPerf 0.6 in May 2019, and both use from the prior layer. CNNs usually though sampled three years apart—
16-bit multiplication with NVIDIA soft- have images as inputs. CNN0 is Alp- July 2016 versus April 2019—we were
ware performing loss scaling. The geo- haZero, a reinforcement learning al- still surprised that CNNs were a much
metric mean speedup of these pro- gorithm with extensive use of CNNs, larger part of datacenter training, and
grams over TPUv2 is 1.8 for TPUv3 and which mastered the games chess, Go, that a new model Transformer36—pub-
1.9 for Volta. and shogi.34 CNN1 is a Google-internal lished the year that TPUv2 was de-

JU LY 2 0 2 0 | VO L. 6 3 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 75
contributed articles

ployed—was as popular as RNNs. are continuously improved, and not Inference on a training chip: TPUv2/
(Transformer is part of MLPerf 0.5.) simple benchmarks, so it’s a lot of work v3 vs. TPUv1. What about inference
Transformer is intended for the to get them to run at all, and more to speed? Running it on a training chip—
same tasks as RNNs, such as transla- run well. As noted earlier, application which works since it is like the forward
tion, but is considerably faster since it programmers focus on TPUs, since pass—could help applications that re-
lends itself to parallelization while they are in everyday use, so there is little quire frequent training on fresh data.
RNNs have sequential dependencies. urge to include loss scaling needed for TPUv2/v3 do not support 8-bit integer
The layers of Transformer are a mix of fp16. (TF kernels for embeddings have data types, so inference uses bf16. One
MLPs and attention layers.4 Attention is not been developed for GPUs, so we ex- upside of using the same arithmetic
the key new mechanism used in Trans- clude MLPs from the GPU geometric for training and inference is that ML
former; it lets neural networks look up mean as they could not run.) experts don’t need to do extra work—
data associatively, in a memory-like Is TPUv3 memory bound or com- called quantization—to ensure the
structure whose indices themselves pute bound? While the peak compute same accuracy of the DNN model.
are learned. The components of atten- improvement of TPUv3 over TPUv2 is One danger is the larger batch sizes
tion resemble those of other layers, in- 2.7x, the improvements in memory needed to run efficiently on TPUv2/v3
cluding matrix multiplications and bandwidth, ICI bandwidth, and clock could hurt inference latency. Fortu-
dot products, which map well to TPU rate are only ≈1.35x. We wondered nately, we have DNN models that can
hardware. One difference is that atten- whether the extra MXUs in TPUv3 meet their latency targets with batch
tion matrices grow with sequence would be underutilized due to bottle- sizes of greater than 1,000. With bil-
length, adding dynamic shape and necks elsewhere. Figure 6 shows that lions of daily users, inferences per sec-
memory requirements that complicate one production application runs a bit ond across the whole data center fleet
some optimizations done by XLA. The higher than the memory improvement can be very high.
success of this recent model (see Figure at 1.4x, but the other five and all the The LSTM0 benchmark, for instance,
6) highlights TPU programmability. MLPerf 0.6 benchmarks run much ran at 48 inferences per second with a
The geometric mean speedup of the faster at 1.6x to 2.3x. The large applica- response time of 122ms on TPUv1.19
six production applications was 1.8 for tion batch sizes and sufficient on-chip TPUv2 runs it 5.6x as fast with a 2.8x low-
TPUv3 but only 0.4 for Volta, primarily storage enabled these good results. As er response time (44ms) at the same
because they use 8x slower fp32 on the MXUs are not a large part of the batch size. The lower latency in turn al-
GPUs instead of fp16 (Table 3). These chip (Figure 3), doubling the MXUs in lows for larger batches compared to
are large production applications that TPUv3 clearly proved beneficial. TPUv1 to be served in production yet still
meet latency targets. With larger batches,
Figure 7. Supercomputer scaling: TPUv3 and Volta. the throughput rose to 11x with a latency
improvement of 2x (58ms) vs TPUv1.
1,000
CNN1, RNN0, RNN1 on TPUv3 TPUv3 reduces latency 1.3x (45ms) versus
CNN0 on TPUv3 TPUv2 at the same batch size.
ResNet50 on TPUv3 DSA supercomputer scaling perfor-
mance. Alas, only ResNet-50 from MLP-
750 ResNet50 (MLPerf 0.6) on TPUv3
erf 0.6 can scale beyond 1,000 TPUs and
ResNet50 (MLPerf 0.6) onVolta
GPUs. Figure 7 shows three ResNet-50
MLP0 on TPUv3
results. Ying et al. published a
MLP1 on TPUv3 ResNet-50 results on TPUv3 that deliv-
Speedup

500 ered 77% of perfect linear scaleup at


1,024 chips,41 but the TPUv3 version for
MLPerf 0.6 only runs at 52%. The dif-
ference is in MLPerf’s ground rules.
250 MLPerf requires including evaluation
in the training time. (Evaluation runs a
holdout dataset after a model training
finishes to determine its accuracy.) Like
0 Ying et al., most researchers exclude it
0 250 500 750 1,000 when reporting performance. More un-
Chips usually, MLPerf requires running evalu-
ation at the end of every four epochs to
deter benchmark cheating. ML devel-
Table 8. Days to train MLPerf 0.5 benchmarks on one TPUv2 chip. See Table 1 for time to opers would never evaluate that fre-
train production applications.
quently. For MLPerf 0.6, NVIDIA ran
ResNet-50 on a cluster of 96 DGX-2H
ResNet50 SSD Mask R-CNN GNMT Transformer
each with 16 Voltas connected via In-
0.8 0.3 1.9 0.2 0.3
finiband switches at 41% of linear scale-
up for 1,536 chips.

76 COMM UNICATIO NS O F THE AC M | J U LY 2020 | VO L . 63 | NO. 7


contributed articles

Table 9. Traditional versus TPU supercomputer Top500 and Green500 rank (June 2019) for Linpack and AlphaZero.

Name Cores Benchmark Data Peta Flop/s % of Peak Mega-watts GFlop/Watt Top 500 Green 500
Tianhe 4865k Linpack 32/64 bit 61.4 61% 18.48 3.3 4 57
SaturnV 22k Linpack 32/64 bit 1.1 59% 0.97 5.1 469 1
ABCI 392k Linpack 32/64 bit 19.9 61% 1.65 14.4 8 3
TPUv2 0.5k AlphaZero 16/32 bit 9.9 84% 0.12 79.9 22 2
TPUv3 2k AlphaZero 16/32 bit 86.9 70% 0.59 146.3 4 1

See article for caveats about comparing Linpack on 64-bit floating point to ML training on 16-bit floating point.

MLPerf 0.6 benchmarks are much Table 10. Time to train supercomputers from NVIDIA, Fujitsu, and Google on the ResNet-50
smaller than the production applica- benchmark from MLPerf 0.6.
tions; Table 8 shows time to train them
NVIDIA cluster ABCI Supercomputer TPUv3 Supercomputer
on one TPUv2 chip is orders of magni-
MLP 1536 Voltas + 192 CPUs 2048 Voltas + 1024 CPUs 1024 TPUv3s + 128 CPUs
tude less than in Table 1. Thus, we in-
Transformer 80 seconds 70 seconds 77 seconds
clude six production applications
largely to show substantial programs
that can scale to supercomputer size. running Linpack, Fujitsu submitted a seven threads, each of which has a peak
The MLPs are limited by embeddings ResNet-50 result for MLPerf 0.6 using performance of 100GFLOPS/s or
and run only at 14% and 40% of perfect 2,048 GPUs. Table 10 shows time to 122TFLOPS/s per chip, almost identi-
linear scale up on 1,024 TPUv3 chips, train for ResNet-50 in MLPerf 0.6 and cal to the peak performance of TPUv3
but one runs at 96% and three at 99%! the number of chips for an NVIDIA and Volta. It relies on the 300MB on-
Note that CNN1 is an image recog- GPU cluster, the Fujitsu ABCI super- chip SRAM for memory, with two GC2
nition DNN much like ResNet101. It computer, and a Google TPUv3 super- chips per PCIe board. The Habana
scales much better on TPUs because computer. Fujitsu varied from the Gaudi38 has eight VLIW SIMD cores,
Google’s internal image datasets are strict benchmark MLPerf 0.6 closed four stacks of HBM2 memory, bf16
much larger than what ResNet50 guidelines of the other submissions— arithmetic, and eight 100Gbit/sec Eth-
uses (Imagenet). they changed the LARS optimizer and ernet links to connect many chips to-
Traditional vs. DSA supercomputer the momentum hyperparameter—so gether to form larger systems. Wave
performance. Traditional supercom- it’s not an apples-to-apples compari- Computing’s28 Dataflow Processing
puters measure performance using the son. These changes improve perfor- Unit chip has 16k processors, 8k arith-
high-performance computing (HPC) mance by 10%–15%, which would also metic units, 16MB of on-chip memory,
benchmark Linpack and ranking the help NVIDIA and TPUv3. and novelty relies on asynchronous
Top500 (top500.org). The related logic instead of a clock. It has external
Green500 list re-ranks the Top500 Related Work DRAM, offering both Hybrid Memory
based on performance per Watt. For A survey documents over 25 years of Cube and DDR4 ports. As of February
these large computers to get utiliza- custom neural network chips,3 but re- 2020, none of the five training startups
tion above 60%, HPC expands the size cent DNN successes led to an explosion has reported training accuracy or time-
of the matrix being solved (weak scal- in their development. Most designs fo- to-solution.
ing). (For which Linpack has long been cus on inference; far fewer, including Academic training studies include
criticized within HPC.13) The TPU scale the TPUv2/v3, target training. We are the DianNao family of architectures (one
up, however, uses production pro- not aware of any other results that show of which trains)7 and ScaleDeep;37 to our
grams on real-world datasets. state-of-the-art accuracy on a working knowledge, neither has been fabricated.
Table 9 shows where PetaFLOPs/ DSA hardware for training. Several studies explored reduced-
second and FLOPs/Watt of AlphaZero Of the five training startups, Sam- precision training with accelerator
on TPUv2/v3 would rank in the Top500 baNova has not yet published. Cere- construction in mind. Intel’s Flex-
and Green500 lists. This comparison bras uses a whole silicon wafer to build point22 is a block FP format,39 although
is imperfect: conventional supercom- their system, essentially treating 84 those developers switched to using
puters crunch 32- and 64-bit data rath- large “dies” as a single unit.24 Each bf16 for their DNN chips.40 De Sa et al.10
er than the 16- and 32-bit data of TPUs. “die” has 220MB of SRAM along with reduced precision and relaxed cache
However, TPUs are running a real ap- about 5k cores, yielding a total of 18GB coherence. HALP11 also made algorith-
plication on real data versus a weakly of on-chip memory and 400k cores that mic changes to reduce quantization
scaled benchmark on synthetic data. collectively use 15 kilowatts. Like noise and uses 8-bit integers to train
TPUv3 has 44x the FLOPS/Watt of GraphCore, there is no DRAM in the some models. None is yet available in a
Tianhe and 10x of SaturnV and ABCI. system, so they target small batch sizes commercial system.
The Fujitsu ABCI supercomputer in to reduce memory needs. The Graph- TPUv2/v3 are not the first domain-
Table 9 includes 2,176 Intel CPUs Core15 GC2 chip holds 1,216 Intelli- specific supercomputers to show large
along with 4352 Volta GPUs. Besides gence Processing Units that support efficiency, performance, and scaling

JU LY 2 0 2 0 | VO L. 6 3 | N O. 7 | C OM M U N IC AT ION S OF T HE ACM 77
contributed articles

gains. Anton systems33 showed two or- Googlers. Many thanks to the hardware format for efficient training of deep neural networks.
In Proceedings of the 31st Conf. on Neural Information
der-of-magnitude speedups over tra- and software teams and engineers for Processing Systems, (2017).
ditional supercomputers on molecular making TPU supercomputers possible, 23. Kung, H.T. and Leiserson, C.E. Algorithms for VLSI
processor arrays. Introduction to VLSI Systems, 1980.
dynamics workloads. They also result- including Paul Barham, Eli Bender- 24. Lie, S. Wafer scale deep learning. In Proceedings of
ed from hardware/software/algorithm sky, Dehao Chen, Chiachen Chou, Jeff the IEEE Hot Chips 31 Symp., (Aug 2019).
25. Mellempudi, N. et al. Mixed precision training with 8-bit
codesign, with custom chips, intercon- Dean, Peter Hawkins, Blake Hechtman, floating point. 2019; arXiv preprint arXiv:1905.12334.
nect, and arithmetic. Mark Heffernan, Robert Hundt, Michael 26. Micikevicius, P. et al. Mixed precision training. 2017;
arXiv preprint arXiv:1710.03740.
Isard, Fritz Kruger, Naveen Kumar, 27. Mikolov, T. et al. Distributed representations of words and
phrases and their compositionality. Advances in Neural
Conclusion Sameer Kumar, Chris Leary, Hyouk- Information Processing Systems (2013), 3111–3119.
Benchmarks suggests the TPUv3 chip Joong Lee, David Majnemer, Lifeng 28. Nicol, C. A dataflow processing chip for training deep
neural networks. In Proceedings of the IEEE Hot
performs similarly to the contempo- Nai, Thomas Norrie, Tayo Oguntebi, Chips 29 Symp., (Aug 2017I.
rary Volta GPU chip, but parallel scal- Andy Phelps, Bjarke Roune, Brennan 29. Olah, C. Deep learning, NLP, and representations.
Colah’s blog, 2014; http://colah.github.io/posts/2014-
ing for production applications is Saeta, Julian Schrittwieser, Andy Swing, 07-NLP-RNNs-Representations/.
stronger for the TPUv3 supercomputer: Shibo Wang, Tao Wang, Yujing Zhang, 30. Polyak, B.T. Some methods of speeding up the
convergence of iteration methods. USSR Computational
˲ Three scale to 1,024 chips at 99% and many more. Mathematics and Mathematical Physics 4, 5 (1964), 1–17.
linear speedup; 31. Robbins, H. and Monro, S. A Stochastic approximation
method. The Annals of Mathematical Statistics 22, 3
˲ One scales to 1,024 chips at 96% References (Sept. 1951), 400–407.
1. Abadi, M. et al. Tensorflow: Large-scale machine
linear speedup; and learning on heterogeneous distributed systems. 2016;
32. Shallue, C.J. et al. Measuring the effects of data
parallelism on neural network training. 2018; arXiv
˲ Two scale to 1,024 chips but are arXiv preprint arXiv:1603.04467. preprint arXiv:1811.03600.
2. Amodei, D. and Hernandez, D. AI and compute, 2018; 33. Shaw, D.E. et al. Anton, a special-purpose machine for
limited by embeddings. https://blog.openai.com/aiandcompute. molecular dynamics simulation. Commun. ACM 51, 7
Remarkably, a TPUv3 supercomputer 3. Asanović, K. Programmable neurocomputing. The (July 2008), 91–97.
Handbook of Brain Theory and Neural Networks, 2nd 34. Silver, D. et al. A general reinforcement learning
runs a production application using real- Edition. M.A. Arbib, ed. MIT Press, 2002. algorithm that Master’s chess, Shogi, and Go through
world data at 70% of peak performance, 4. Bahdanau, D., Cho, K. and Bengio, Y. Neural machine self-play. Science 362, 6419 (2018), 1140–1144.
translation by jointly learning to align and translate. 35. Thottethodi, M. and Vijaykumar, T. Why the GPGPU
higher than general-purpose supercom- 2014; arXiv preprint arXiv:1409.0473. is less efficient than the TPU for DNNs. Computer
puters run the Linpack benchmark us- 5. Chen, J. et al. Revisiting distributed synchronous SGD. Architecture Today Blog, 2019; www.sigarch.org/why-
2016; arXiv preprint arXiv:1604.00981. the-gpgpu-is-less-efficientthan-the-tpu-for-dnns/
ing weak scaling of manufactured data. 6. Chen, M.X. et al. The best of both worlds: Combining 36. Vaswani, A. et al. Attention is all you need. Advances
Moreover, TPU supercomputers with recent advances in neural machine translation. 2018; in Neural Information Processing Systems (2017),
arXiv preprint arXiv:1804.09849. 5998–6008.
256–1,024 chips running a production 7. Chen, Y. et al. Dadiannao: A machine-learning 37. Venkataramani, S. et al. Scaledeep: A scalable
supercomputer. In Proceedings of the 47th Int’l Symp.
application have 5x–10x performance/ on Microarchitecture, (2014), 609–622.
compute architecture for learning and evaluating deep
networks. In Proceedings of the 45th Int’l Symp. on
Watt of the #1 traditional supercomput- 8. Chiu, C.C. et al. State-of-the-art speech recognition Computer Architecture, (2017), 13–26.
with sequence-to-sequence models. In Proceedings
er on the Green500 list running Linpack of the IEEE Int’l Conference on Acoustics, Speech and
38. Ward-Foxton, S. Habana debuts record-breaking AI
training chip, (June 2019); https://www.eetimes.com/
and 24x–44x of the #4 supercomputer Signal Processing, (Apr. 2018), 4774–4778. document.asp?doc_id=1334816.
9. Clark, J. Google turning its lucrative Web search over
on the Top500 list. Reasons for this suc- to AI machines. Bloomberg Technology, Oct. 26, 2015.
39. Wilkinson, J.H. Rounding Errors in Algebraic Processes,
1st Edition. Prentice Hall, Englewood Cliffs, NJ, 1963.
cess include the built-in ICI network, 10. De Sa, C. et al. Understanding and optimizing 40. Yang, A. Deep learning training at scale Spring
asynchronous low-precision stochastic gradient Crest Deep Learning Accelerator (Intel® Nervana™
large systolic arrays, and bf16 arithmetic, descent. In Proceedings of the 44th Int’l Symp. on NNP-T). In Proceedings of the Hot Chips, (Aug. 2019);
which we expect will become a standard Computer Architecture, (2017), 561–574. www.hotchips.org/hc31/HC31_1.12_Intel_Intel.
11. De Sa, C. et al. High-accuracy low-precision training. AndrewYang.v0.92.pdf.
data type for DNN DSAs. 2018; arXiv preprint arXiv:1803.03383. 41. Ying, C. et al. Image classification at supercomputer
TPUv2/v3 have smaller dies in an old- 12. Dean, J. et al. Large scale distributed deep networks. scale. 2018; arXiv preprint arXiv:1811.06992.
Advances in Neural Information Processing Systems, 42. Zoph, B. and Le, Q.V. Neural architecture search
er semiconductor process and lower (2012), 1223–1231. with reinforcement learning. 2019; arXiv preprint
cloud prices despite being less mature at 13. Dongarra, J. The HPC challenge benchmark: a arXiv:1611.01578.
candidate for replacing Linpack in the Top500? In
many levels of hardware/software sys- Proceedings of the SPEC Benchmark Workshop, (Jan.
tem stack than CPUs and GPUs. These 2007); www.spec.org/workshops/2007/austin/slides/ Norman P. Jouppi is a Distinguished Hardware Engineer
Keynote_Jack_Dongarra.pdf. at Google, Mountain View, CA, USA.
good results despite technological dis- 14. Duchi, J., Hazan, E. and Singer, Y., Adaptive
subgradient methods for online learning and Doe Hyun Yoon is a staff software engineer at Google,
advantages suggests the TPU approach stochastic optimization. J. Machine Learning Research Mountain View, CA, USA.
is cost-effective and can deliver high ar- 12 (July 2011), 2121–2159.
15. Graphcore Intelligence Processing Unit. (https://www. George Kurian is a senior staff software engineer at
chitectural efficiency into the future. graphcore.ai/products/ipu Google, Mountain View, CA, USA.
Going forward, our ravenous DNN 16. Hennessy, J.L. and Patterson, D.A. Computer
Architecture: A Quantitative Approach, 6th Edition. Sheng Li is a staff software engineer and tech lead on ML
colleagues want the fastest computer Elsevier, 2019. Accelerator Optimization at Scale at Google, Mountain
that we can build.2 Despite Moore’s 17. Hennessy, J.L. and Patterson, D.A. A new golden age View, CA, USA.
for computer architecture. Commun. ACM 62, 2 (Feb.
Law ending, we expect the demand for 2019), 48–60.
Nishant Patil is a senior staff software engineer at
Google, Mountain View, CA, USA.
faster DNN-specific supercomputers to 18. Ioffe, S. and Szegedy, C. Batch normalization:
Accelerating deep network training by reducing James Laudon is an engineering director at Google,
grow even more quickly than Moore internal covariate shift. 2015; arXiv preprint Mountain View, CA, USA.
predicted. Trying to satisfy that de- arXiv:1502.03167.
19. Jouppi, N.P. et al. In-datacenter performance analysis Cliff Young is a software engineer at Google, Mountain
mand without the help of Moore’s Law of a tensor processing unit. In Proceedings of the 44th View, CA, USA.
offers exciting new challenges for com- Int’l Symp. on Computer Architecture, (June 2017),
1–12. David Patterson is a Distinguished Engineer at Google,
puter architects for at least a decade.17 20. Jouppi, N.P., Young, C., Patil, N. and Patterson, D. Mountain View, CA, USA, a professor of Graduate School
A domain- specific architecture for deep neural at the University of California, Berkeley, CA, USA, and
networks. Commun. ACM 61, 9 (Sept. 2018), 50–59. Director of the RISC-V International Open Source
Acknowledgments 21. Kalamkar, D. et al. A study of Bfloat16 for Laboratory at Berkeley, CA, and Shenzhen, China.
deep learning training. 2019; arXiv preprint
The authors analyzed TPU systems arXiv:1905.12322.
that involved contributions from many 22. Köster, U. et al. Flexpoint: An adaptive numerical Copyright held by authors/owners.

78 COMM UNICATIO NS O F THE AC M | J U LY 2020 | VO L . 63 | NO. 7

You might also like