Professional Documents
Culture Documents
Torchrec
Torchrec
Maxim Naumov, Ajit Mathews, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, Vijay Rao
Facebook, Menlo Park, USA
Abstract
10000
GPT-3
Deep learning recommendation models (DLRMs) have been AlphaGoZero
used across many business-critical services at Facebook and 1000 AlphaZero
are the single largest AI application in terms of infrastructure
demand in its data-centers. In this paper, we present Neo, a 100
Petaflop/s-days
software-hardware co-designed system for high-performance DLRM-2022
distributed training of large-scale DLRMs. Neo employs a 10 Xception BERT
DLRM-2021
novel 4D parallelism strategy that combines table-wise, row-
wise, column-wise, and data parallelism for training massive 1
1 BERT
1 Introduction VGG
0.1 AlexNet AlphaZero
Deep learning recommendation models (DLRMs) are ubiq- ResNet
uitously used by online companies, including Amazon for 0.01 GoogLeNet Xception
† dheevatsa@fb.com, † haoyc@fb.com
‡ These 0.001
authors contributed equally 2010 2012 2014 2016 2018 2020 2022
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not Figure 1. Comparing deep learning models in total amount
made or distributed for profit or commercial advantage and that copies bear of compute, in petaflop/s-days (top) [44] and model capacity
this notice and the full citation on the first page. Copyrights for components (bottom).
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request selecting items in its catalog [34, 36, 58], Netflix for show-
permissions from permissions@acm.org.
Conference’17, July 2017, Washington, DC, USA
ing movie options [12, 27], and Google for displaying per-
© 2021 Association for Computing Machinery. sonalized advertisements [6, 8, 17]. They have also been
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 adopted by standard benchmarking organizations, such as
https://doi.org/10.1145/nnnnnnn.nnnnnnn MLCommons (MLPerf) [37, 51]. At Facebook, we have been
1
using recommendation models extensively for ranking and DNN models, but they are not designed to match the perfor-
click through rate (CTR) prediction, including news feed mance characteristics of DLRMs. Specifically, hardware plat-
and search services [13, 15, 41, 46]. DLRMs are the single forms for DNN training are generally optimized for central-
largest AI application in terms of infrastructure demand in ized inter-node communications (e.g., parameter servers [3])
data centers. and/or AllReduce communications (e.g., Horovod [53] and
Unlike conventional deep neural networks (DNNs) with NCCL [1]). However, as identified in Section 3, performant
mainly compute-intensive operators (e.g., convolution and and scalable DLRM training requires efficient hardware sup-
matrix multiplication), DLRMs combine compute-intensive port for a mixture of diverse communication patterns, in-
components with up to thousands of data-intensive embed- cluding AllReduce, AlltoAll, ReduceScatter, OneToMany,
ding operators, each with a different resource requirement and ManyToMany.
and performance characteristic [42]. As a result, DLRMs gen-
erally exhibit much lower arithmetic intensity and larger 1.1 Our Approach
model sizes compared to their computer vision [7, 16, 28, We present Neo, a software-hardware co-designed system
56, 59], natural language processing [4, 9, 61], and reinforce- for fast and scalable DLRM training building on top of three
ment learning counterparts [54, 55], with models having key techniques.
trillions of parameters being deployed in practice, as shown
in Figure 1. Existing software and hardware solutions tai- 4D parallelism. To enable fast and scalable training of
lored for DNNs achieve only suboptimal performance and the massive embedding operators in DLRMs, it is crucial to
limited scalability for DLRM training due to the following effectively balance the workload distribution across GPUs
software/hardware limitations. while minimizing communication costs. We introduce a 4D
On the software side, existing deep learning frameworks parallelism strategy that combines table-wise, row-wise, column-
parallelize DNN training typically using either data, model wise, and data parallelism to jointly optimize the paralleliza-
or pipeline parallelism [3, 31, 47]. Frameworks that support tion performance of embedding operators. Additionally, Neo
combinations of these strategies are generally designed for also supports applying 4D parallelism in a recursive manner
specific DNN applications [14, 20, 40, 49]. However, exist- at different levels of hardware hierarchy to further improve
ing parallelization strategies designed and optimized for load balance and hardware efficiency.
compute-intensive DNN models achieve limited performance High-performance embedding computation. Neo em-
and scalability for DLRMs. In particular, data parallelism re- ploys two novel optimizations to minimize the computa-
quires each device to save a replica of the entire model and tional costs and memory requirements of embedding opera-
therefore does not support DLRMs with up to trillions of tors. First, we introduce a hybrid kernel fusion technique that
parameters [31]. Moreover, a DLRM cannot be directly par- fuses (1) multiple embedding operators and (2) embedding
allelized using model or pipeline parallelism due to the data- computations and their parameter updates all in a single
dependent behavior of its embedding operators. Specifically, CUDA kernel. This is realized by co-designing the optimiza-
processing different training samples may require accesses tion algorithms and software implementation of embedding
to different embedding parameters depending on the cate- operators. Second, to provide sufficient memory capacity for
gorical inputs of each sample. This data-dependent behavior DLRM training, Neo uses a software-managed caching mecha-
makes it infeasible to statically partition a DLRM’s trainable nism to leverage the memory hierarchy of modern hardware
parameters into disjoint subsets while satisfying data depen- platforms. Finally, a variety of compression techniques are
dencies for all training samples, a necessity for employing further applied to minimize memory requirements.
model and pipeline parallelism.
In addition, today’s DNN frameworks are designed and Hardware platform design. We introduce ZionEX , a new
optimized for compute-intensive DNN computations and hardware platform co-designed with Neo’s 4D parallelism to
miss critical optimizations for data-intensive embedding op- optimize inter-node communications for distributed DLRM
erators. Specifically, DLRM models contain up to thousands training. ZionEX supports a fully-connected topology across
of embedding operators. The forward processing, backward all GPUs in the cluster by using a dedicated RDMA over
propagation, and gradient synchronization for these embed- Converged Ethernet (RoCE) based scale-out network. This
ding operators require launching thousands of CUDA kernels topology design promotes high-performance data transfers
in a training iteration and consume up to terabytes of aggre- for the performance-dominating AlltoAll and ManyToMany
gated GPU device memory, introducing significant runtime communications in distributed DLRM training. Meanwhile,
overheads and memory requirements. ZionEX supports both the RDMA and GPUDirect communi-
On the hardware side, modern hardware platforms such as cation protocols and retains flexible intra-node GPU fabric.
GPU-based clusters provide significant capability boost for This enables high-performance DLRM training on ZionEX ,
while ensuring compatibility with existing data-center in-
frastructure to allow wide deployment of ZionEX .
2
Table 1. Sample DLRM time to train latency resources demand Parameter Servers
EASGD
Figure 4. Embedding table sharding schemes with different implications on the communication cost, load balancing and
memory requirement. Bottom MLP is omitted in this figure for simplicity of illustration.
sample 2 sample 1
1 row 1
Tensor
receiving and distributing batch 𝑖 + 1 using a separate stream. Output Tensor
row 2 row 2
2 sample 1
row 3 row 3
To minimize the interference, we overlap the input AlltoAll 6 row 4 row 4
1 row 5 sample 2
of batch 𝑖 + 1 with the forward propagation of top MLP of row 5
row 6 pooling
3 row 6
operator
batch 𝑖 where no communication is involved. In addition, we 6
▽row 1
overlap the pooled embedding AlltoAll with the forward ▽sample 1 ▽row 2
Gradient
▽row 3 Sparse
propagation of bottom MLP to hide latency. ▽sample 2 ▽row 4 Optimizer
Output Gradient ▽row 5
▽row 6
Embedding Gradient
5 Embedding Operators
A major difference between DLRM models and conventional (a) Embedding computation workflow
deep neural networks is leveraging categorical features such D
Table 1, Batch 1: pooling L1,1 rows
as users, posts, or pages. The DLRM models used in produc- Table 1, Batch 2: pooling L1, 2 rows
each of which corresponds to a dedicated embedding operator. Table 1, Batch B: pooling L1,B rows
T
D
Optimizing the runtime performance of DLRM’s embedding L1,1 Table 2, Batch 1: pooling L2,1 rows
L1,2
operators requires addressing two key challenges. First, the ...
Table 2 H2 B
T x B segments Backward
forward processing, backward propagation, and gradient up-
...
... D + Output Tensor
dates for the embedding operators require launching thou- ... Optimizer
LT,B (Fusion)
sands of GPU kernels in each training iteration, introducing Categorical feature inputs Table T HT
Table T, Batch B: pooling LT,B rows
significant GPU kernel launch overhead. Second, some em- T tables and B batches
minimize the CUDA kernel launch overhead and allow each Gradient Sorting Gradient Aggregation
GPU worker to only launch two kernels (i.e., one for forward ▽row 1
row 1
and one for back propagation and parameter update). Second, ▽sample 1
▽row 1
▽row 2 Sparse row 2
for parallelizing the computation of the embedding operators, ▽sample 2 ▽row 3 row 3
Optimizer
we propose column-wise parallelism and row-wise parallelism ▽row 6
Output row 6
▽row 6
in addition to data and model parallelism. The combinations Gradient Sorted Embedding Embedding
of these four parallelism dimensions enable Neo to support Gradients Parameter
embedding tables with up to trillions of parameters. Finally, (c) Fusing embedding backward and sparse optimizer
Neo exploits a series of memory saving techniques that lever-
age the memory hierarchy of the ZionEX platform to ensure Figure 5. Embedding operator optimizations
sufficient memory capacity for DLRM.
handling non-linearity in advanced optimizers such as Ada-
5.1 Kernel Fusion Grad [10], LAMB [65], and Adam [25]. For example, both
Neo uses a hybrid kernel fusion mechanism to minimize the sample 1 and 2 in Figure 5a contribute to the gradients of the
CUDA kernel launch overhead for performing embedding embedding vector 1 and 6. Directly sending these gradients
computations in a training iteration. First, instead of apply- to a non-linear sparse optimizer without aggregation would
ing a separate embedding lookup for each embedding table, result in incorrect updates to the embedding tables.
Neo fuses multiple embedding lookups on the same GPU To guarantee correctness while maximizing performance,
into a single CUDA kernel (Figure 5b), which improves the Neo applies gradient sorting by rows so that gradients to
parallelism and bandwidth utilization and reduces the over- the same embedding rows are processed by a single CUDA
head of launching multiple CUDA kernels on GPUs. thread block, as shown in Figure 5c. Gradient aggregation is
Second, Neo also fuses the backward pass with the sparse subsequently applied within each CUDA thread block using
optimizer to further reduce kernel launch overhead and avoid much faster but smaller GPU shared memory.
materializing gradients to the embedding tables. The key Neo’s hybrid fusion technique for embedding operators
challenge of such fusion is avoiding potential race-condition lead to three performance benefits. First, Neo reduces the
across gradient updates from different training samples and memory requirement for embedding operators by avoiding
6
allocating GPU device memory for embedding gradients.
8S Glue-less UPI Topology (Twisted Hyper Cube)
Second, the memory accesses to GPU device memory are
minimized by using GPU shared memory to save interme-
diate embedding gradients. Finally, kernel fusion improves NIC CPU NIC CPU NIC CPU NIC CPU NIC CPU NIC CPU NIC CPU NIC CPU
…
…
customized 32-way set-associative software cache [63] us- ZionEX
reader
ing least recently used (LRU) or least frequently used (LFU) Training cluster
cache replacement policies, where the associativity matches Data ingestion service
the warp size of GPUs. This enables fine grain control of (c) Overall training architecture.
caching and replacement, allowing it to be tuned for target
model characteristics. Note that UVM is bounded by PCIe Figure 6. The system architectures of Zion, ZionEX plat-
bandwidth, while Neo’s software cache can bridge the gap forms and the overall training system.
for the bandwidth between PCIe and HBM ( 50× difference).
The software cache improves the end-to-end performance of
DLRM workloads by approximately 15% compared to UVM.
To further reduce the memory requirement of embed-
6.1 Previous Platform: Zion
ding operators, Neo also employs a variety of compression
techniques, such as a row-wise sparse optimizer, low/mixed- Zion [39] introduced in 2019 was designed as a high-performance
precision training using a high-precision cache backed by hardware platform for training DLRMs. While Zion offers
low precision embedding tables, and advanced factorization significantly improved capabilities at single-node level, it
techniques. We omit the details of Neo’s compression tech- falls short as a distributed platform not being extensible to
niques due to space limit and will elaborate on them in the meet the rapidly growing DLRM training requirements. We
final paper. critically appraise its limitations, but other platforms based
on a similar design share the same limitations; we discuss
6 ZionEX: Hardware Platform Design those platforms in Section 9.
Figure 6a shows the architecture of a Zion node, which has
We begin by describing the limitations of our previous hard- 8 CPU sockets with 1.5 TB memory, 8 GPUs, and 8 network
ware platform for DLRM in Section 6.1. Section 6.2 introduces interface cards (NICs). It provides a powerful heterogeneous
ZionEX , a new hardware platform for DLRM. We also outline super node design for training DLRM by (1) offloading com-
the design principles used in the development of ZionEX . pute heavy layers of DLRM (e.g., MLPs) onto GPUs and (2)
leveraging CPUs for large embedding operators on the rela-
∗ PyToch FBGEMM_GPU library: https://github.com/pytorch/ tively cheaper DRAM instead of HBM for accommodating
FBGEMM/tree/master/fbgemm_gpu TB-scale DLRM models on a single node.
7
However, this heterogeneous design introduces a number GPUs on a different node through the dedicated low-latency
of challenges to software design and performance. For exam- high-bandwidth RoCE NIC, without involving host CPUs.
ple, it’s critical to balance the workload on CPUs and GPUs In addition, ZionEX also has a frontend NIC connected to
to ensure maximum overlap. This requires elaborate pipelin- each CPU. Data ingestion goes through the regular fron-
ing between CPUs and GPUs and partitioning DLRM into tend network and PCIe, without interfering with activations
fine-grained tasks using an accurate cost model. In addition, or gradients. The host CPUs are only used to setup input
heterogeneous training of DLRM also introduces non-trivial batches and marshal the training process.
runtime overheads, such as increased data transfers between
Capability. With ZionEX we ensure that the platform is
CPUs and GPUs and inter-socket communication. Finally,
compatible with existing infrastructure and can be widely
a critical missing component in Zion is that each NIC is di-
deployed within our data-centers, without causing major dis-
rectly attached to a CPU. As a result, all of the inter-node
ruptions. This is critical for being able to effectively leverage
communications (e.g., gradient synchronization and tensor
the capability of the platform and make it readily available to
transformation) necessitate CPU intervention and additional
across variety of applications and uses-cases. We achieve this
GPU-CPU transfers. Furthermore these NICs are connected
by making the ZionEX platform compliant with the standard
to the common shared data-center network infrastructure,
Open Rack specifications [2] , which covers the compati-
which introduces overheads from network congestion, inter-
bility with other infrastructure components such as power,
ference and are constrained to use more data center-friendly
cooling, mechanicals and cabling. Furthermore designing
topologies, protocols (TCP/IP) which are sub-optimal for
the platform to be modular and relying on open standards
distributed training.
based technologies, for instance - the ethernet based network
Although each Zion node is equipped with 8x 100Gbps
fabric for high-performance scale out solution.
NIC bandwidth, in reality we found it is very difficult to
Fig. 6c shows the overall training platform, along with the
scale out to multiple nodes due to networking overheads.
dis-aggregated data-ingestion service. This supports stream-
With today’s increasing demand on modeling size of DLRM
ing input data from a network store such as Tectonic [45]
models, Zion is not able to scale well and fully utilize the
and perform light-weight data pre-processing operations in
powerful hardware resources.
a distributed fashion. So that the data-ingestion is not a bot-
6.2 ZionEX tleneck for the end-to-end training and to ensure sufficient
throughput in feeding ZionEX trainers.
To address these shortcomings, we introduce ZionEX , which
we have designed to be more scalable than Zion with im-
7 Implementation
proved network capabilities, while retaining its flexibility
and core advantages, such as the OAM form factor, modular We detail the implementation of high-performance scalable
design [39, 57], and flexible intra-node accelerator fabric [68]. training for DLRMs described above. We built a high-performance
Figure 6b shows the overall system architecture. We briefly training software stack for DLRMs using PyTorch [47], with
highlight ZionEX ’s core design principles: efficient CUDA implementation for most deep learning op-
erators via the ATen library, and automatic handling of pa-
Scalability. Both Zion and ZionEX support heterogeneous rameter replication and gradient synchronization with over-
training of DLRM, but the most striking difference is that lapped back-propagation and AllReduce via the PyTorch
ZionEX is designed with sufficient scale-up and scale-out DistributedDataParallel library [31]. We have enabled
network capabilities. As shown in Figure 6b, ZionEX em- the following components for efficient DLRM training.
ploys a dedicated RDMA over Converged Ethernet (RoCE)
NIC for each of the GPUs connected via PCIe switches to 7.1 Data ingestion
allow for a dedicated inter-node connectivity (isolated from Data ingestion is a key component to ensure end-to-end
common data-center network) and importantly support more training performance especially for DLRMs, which typically
efficient RDMA/GPUDirect communication protocols [41]. process through order(s) of magnitude larger amount of
These ZionEX nodes can be connected with a dedicated back- data than other typical DNN models. We observe that data
end network to form a cluster for distributed scalable train- ingestion, if left unoptimized, can incur significant latency
ing. The extensible design of ZionEX allows for scaling the and introduce non-trivial overheads to pipelining.
backend network to interconnect many thousands of nodes, Originally designed for a distributed asynchronous CPU
forming a data-center scale AI training cluster. setup, our readers and data pre-processing module stores the
High Performance. As a scale-out solution, we offload offsets and indices† of each sparse feature in separate tensors
the entire DLRM to GPUs to fully leverage the massive paral- per embedding table. As a result, a DLRM with hundreds of
lelism and high memory bandwidth to accelerate MLPs and embedding tables can easily get a thousand input tensors per
embedding computations. To transfer tensors and synchro- † Please refer to the interface of nn.EmbeddingBag https://pytorch.org/
3X 20% AllReduce
Comm (Exposed)
1X
0%
8 16 32 64 128
serialized exposed serialized exposed
# of GPUs
8 GPUs (single node) 128 GPUs
Baseline (BS=64k)
row. Then we use FP16 precision on embedding tables. These
1.2
+ optimized sharding two optimizations collectively bring model memory foot-
1.0 print from 96TB down to 24TB, just fitting under the 4TB
+ fp16 emb
0.8 HBM + 24TB DRAM memory hierarchy. On the massive em-
+ Quantized comms bedding tables, we enable row-wise sharding to distribute the
0.6
+ Larger batch size (256K) tables to multiple nodes and adjust the training flow to use
0.4
AlltoAll with bucketization and ReduceScatter as shown
0.2
in Figure 4b. With UVM enabled and HBM used as a cache,
0.0 we are able to train model-F with throughput as high as 1.7
Model-A
MQPS, demonstrating capability of our HW/SW co-designed
solution to push beyond the current state-of-the-art.
Figure 11. Training throughput improvements enabled by
optimized sharding, reduced precision embeddings, quan-
tized communications and larger batch sizes. 9 Related work
Researchers have proposed various system level innovations
to tackle the challenges brought by extremely large mod-
does not provide direct throughput benefit, Neo can leverage
els. DeepSpeed [49] fully shards model parameters, gradi-
the head room to strike a better balance. As a consequence,
ents and optimizer states across all nodes, and reconstructs
the training throughput is increased by another 20% due to
necessary states on the fly using checkpoint partitioning
improved load balancing.
and rematerialization [19, 26] to drastically reduce mem-
Next, to address the increased AlltoAll latency, we in-
ory usage. GShard [30] trains a massive translation model
corporate quantized collective communications proposed
with mixture of experts that are sharded across accelerators
in [64], which directly reduce the communication volume.
through annotation of parallelization strategy at tensor level.
For model-A, we validate that using FP16 in forward AlltoAll
FlexFlow [20] uses automatic search to discover the best
and BF16 in backward AlltoAll provides almost 30% speedup
parallelization strategy of operators in the graph. Building
without any training quality loss.
on this direction of auto-parallelization, these recent papers
Lastly, we increase the global batch size from 64K to 256K.
[38, 60] use optimal synthesis and reinforcement learning to
This directly increases activation sizes, which helps satu-
find optimized device placement to further improve paral-
rate GPUs and communication bandwidth better, while be-
lelization without the need for manual intervention. How-
ing complimentary to all other optimizations. With appro-
ever, these general systems are not specifically designed for
priately tuned optimizer/hyper-parameters, we are able to
highly sparse recommendation models.
achieve on-par training quality, however more comprehen-
To that end, Alibaba introduced XDL [21], an industry-
sive experimentation is warranted since large batch training
scale training system designed for high-dimensional sparse
of DLRMs is not as well studied and will be part of future
data. XDL incorporates optimizations such as hierarchical
work. Collectively, these techniques unlock an 87% improve-
sample compression, workflow pipelining, zero copy and
ment on training throughput compared to TF32 training with
CPU binding to improve training efficiency of the sparse
a 64K global batch size.
part of the model. Kraken [62] targets at more efficient on-
8.6 Model Capacity Limitation Study line training with decoupled key-value fetching and em-
bedding, codesigned cache eviction policy with ML domain
We use model-F as an example to push the model capacity on knowledge for the embedding tables, memory efficient op-
the prototype system. Unlike model-A or model-I, efficiently timizers for the sparse and dense part of the model, and a
training model-F presents 2 different challenges. First, with non-colocated deployment model that allows the inference
12T parameters, model-F can easily require up to 96TB of servers and parameter servers to grow independently. [23]
memory using a naive training approach, far exceeding the optimizes CPU-based DLRM training through lock-free em-
total memory available on a 16-node cluster§ . Second, the bedding table update, tuned loop tiling for dense MLP, the
model has only a few massive embedding tables with ~10B AlltoAll communication primitive and a new split-SGD
rows and 256 columns, each requiring multi-node worth of implementation that takes advantage of the bits aliasing in
GPU and host memory to train. FP32 and BFloat16 to reduce memory footprint. Baidu’s AI-
§ Considering FP32 precision and doubled size for optimizer states Box [67] takes a different approach to horizontal scaling and
12𝑒12 × 4 × 2 = 96𝑒12. The prototype cluster has in total 4TB HBM and focuses on fitting training of large recommendation models
24TB DRAM in a single node. AIBox hides serving latency by pipelining
12
network, disk and CPU/GPU tasks, and reduces model up- Yang, Anil Agrawal, Viswesh Sankaran, Daniel Montgomery,
date overhead and improves SSD life span through a grouped James Taylor, Jeff Anderson, Amithash Prasad, Patrick Williams,
hashing scheme and a multi-level in-memory hashing sys- Harsha Bojja, Arrow Luo, Changduk Kim, James Le, Rachel
tem. W Wang, Vignesh Mathimohan, Shockely Chen, Doug Wimer,
Much attention is given to communication performance James Allen, Vidya Rajasekaran, Kelly Zuckerman, Wenyin
as it has become a major bottleneck in distributed train- Fu, Valentin Andrei, Matt Skach, Philipp Keller, Olivier Raginel,
ing at cluster and datacenter scale. BytePS and ByteSched- Danielle Costantino. We also like to thank other reviewers
uler [22, 48] harnesses idle CPU and network resources and who have gone through multiple drafts of this paper, provid-
better communication scheduling to improve parameter ex- ing helpful inputs.
change efficiency. However, in a homogeneous training clus-
ter where each job spans multiple nodes, there are reduced
opportunities for finding and exploiting spare network re-
sources, resulting in a sub-optimal use of such approach.
SwitchML and ATP [29, 52] leverages programmable net-
work switches to perform in-network aggregation for cross-
rack bandwidth reduction in datacenter environments. [5, 35]
discovers and exploits datacenter network locality and forms
optimized and dynamic aggregation routes through learning
and optimal synthesis. Alternatively, these papers [32, 33]
address the communication overheads by using various quan-
tization schemes to reduce communication volume.
10 Conclusion
DLRMs are an important class of models widely used by
many internet companies for a wide range of applications.
They can often be the single largest AI application in terms
of infrastructure demand in data-centers. These models have
atypical requirements compared to other types of deep learn-
ing models, but they still follow a similar trend of rapid rate
of growth that is common across all deep learning-based ap-
plications. This growth constantly pushes the performance
boundary required of the underlying software stack and
hardware platform.
In this paper we co-design a solution that enables us to
run models with trillions of parameters, while attaining 40×
faster total training time for production recommendation
models. With novel SW features such as 4D parallelism, high-
performance operator implementations, hybrid kernel fusion
and hierarchical memory support. On the HW-side, the ex-
tensible ZionEX platform allows for scaling up to the full
data center with thousands of nodes, thus enabling a data
center-scale AI training cluster to continue catering to the
growing demands of deep learning models.
Acknowledgements
We would like to acknowledge all of the help from mem-
bers of the hardware, datacenter and infrastructure teams,
without which we could not have achieved any of the above
reported results. This includes among others Jenny Yu, Matt
Hoover, Hao Shen, Damien Chong, Jeff Puglis, Garnnet Thomp-
son, Peter Bracewell, Anthony Chan, Wei Zhang, Michael Haken,
Tiffany Jin, Joshua Held, Cheng Chen, Yin Hang, Ben Kim,
Tyler Hart, Gada Badeer, Ahmed Qaid, Peichen Chang, Zhengyu
13
References Aditya Kalro, et al. Applied machine learning at facebook: A datacenter
[1] Nvidia collective communications library (nccl), https://developer. infrastructure perspective. In 2018 IEEE International Symposium on
nvidia.com/nccl. High Performance Computer Architecture (HPCA), pages 620–629. IEEE,
[2] Ocp open rack standard (v2), https://www.opencompute.org/wiki/ 2018.
Open_Rack/SpecsAndDesigns#RACK_Standards. [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
[3] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng residual learning for image recognition, 2015.
Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu [17] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and
Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irv- Tat-Seng Chua. Neural collaborative filtering. In Proc. 26th Int. Conf.
ing, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Man- World Wide Web, pages 173–182, 2017.
junath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry [18] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin
Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, and Joaquin
Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Quiñonero Candela. Practical lessons from predicting clicks on ads at
Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete facebook. In Proceedings of the Eighth International Workshop on Data
Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Mining for Online Advertising, ADKDD’14, 2014.
Zheng. TensorFlow: Large-scale machine learning on heterogeneous [19] Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter
systems, 2015. Software available from tensorflow.org. Abbeel, Kurt Keutzer, Ion Stoica, and Joseph E Gonzalez. Checkmate:
[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Breaking the memory wall with optimal tensor rematerialization. arXiv
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, preprint arXiv:1910.02653, 2019.
Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, [20] Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, parallelism for deep neural networks. arXiv preprint arXiv:1807.05358,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, 2018.
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, [21] Biye Jiang, Chao Deng, Huimin Yi, Zelin Hu, Guorui Zhou, Yang
Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Zheng, Sui Huang, Xinyang Guo, Dongyue Wang, Yue Song, et al. Xdl:
Sutskever, and Dario Amodei. Language models are few-shot learners, an industrial deep learning framework for high-dimensional sparse
2020. data. In Proceedings of the 1st International Workshop on Deep Learning
[5] Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Practice for High-Dimensional Sparse Data, pages 1–9, 2019.
Mytkowicz, Jacob Nelson, and Olli Saarikivi. Synthesizing optimal [22] Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanx-
collective algorithms. In Proceedings of the 26th ACM SIGPLAN Sympo- iong Guo. A unified architecture for accelerating distributed DNN
sium on Principles and Practice of Parallel Programming, pages 62–75, training in heterogeneous gpu/cpu clusters. In 14th USENIX Sympo-
2021. sium on Operating Systems Design and Implementation (OSDI 20), pages
[6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar 463–479. USENIX Association, November 2020.
Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, [23] Dhiraj Kalamkar, Evangelos Georganas, Sudarshan Srinivasan, Jian-
Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xi- ping Chen, Mikhail Shiryaev, and Alexander Heinecke. Optimizing
aobing Liu, and Hemal Shah. Wide and deep learning for recommender deep learning recommender systems training on cpu cluster archi-
systems. arXiv:1606.07792, 2016. tectures. In Proceedings of the International Conference for High Per-
[7] François Chollet. Xception: Deep learning with depthwise separable formance Computing, Networking, Storage and Analysis, SC ’20. IEEE
convolutions, 2017. Press, 2020.
[8] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks [24] Narenda Karmarker and Richard M. Karp. The differencing method of
for YouTube recommendations. In Proc. 10th ACM Conf. Recommender set partitioning. Technical report, USA, 1983.
Systems, pages 191–198, 2016. [25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. optimization. arXiv preprint arXiv:1412.6980, 2014.
Bert: Pre-training of deep bidirectional transformers for language [26] Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan,
understanding, 2019. Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. Dynamic
[10] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient tensor rematerialization. arXiv preprint arXiv:2006.09616, 2020.
methods for online learning and stochastic optimization. Journal of [27] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization
machine learning research, 12(7), 2011. techniques for recommender systems. Computer, (8):30–37, 2009.
[11] Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudi- [28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet
gere, Raghuraman Krishnamoorthi, Murali Annavaram, Krishnakumar classification with deep convolutional neural networks. In Proceedings
Nair, and Misha Smelyanskiy. Check-n-run: A checkpointing system of the 25th International Conference on Neural Information Processing
for training recommendation models, 2020. Systems - Volume 1, NIPS’12, page 1097–1105, Red Hook, NY, USA,
[12] Carlos A. Gomez-Uribe and Neil Hunt. The netflix recommender sys- 2012. Curran Associates Inc.
tem: Algorithms, business value, and innovation. ACM Trans. Manage. [29] ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu,
Inf. Syst., 6(4), December 2016. Aditya Akella, and Michael Swift. Atp: In-network aggregation for
[13] U. Gupta, C. Wu, X. Wang, M. Naumov, B. Reagen, D. Brooks, B. Cottel, multi-tenant learning.
K. Hazelwood, M. Hempstead, B. Jia, H. S. Lee, A. Malevich, D. Mudi- [30] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen,
gere, M. Smelyanskiy, L. Xiong, and X. Zhang. The architectural Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and
implications of facebook’s dnn-based personalized recommendation. Zhifeng Chen. Gshard: Scaling giant models with conditional com-
In 2020 IEEE International Symposium on High Performance Computer putation and automatic sharding. arXiv preprint arXiv:2006.16668,
Architecture (HPCA), pages 488–501, 2020. 2020.
[14] Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, [31] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis,
Nikhil Devanur, Greg Ganger, and Phil Gibbons. Pipedream: Fast and Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania,
efficient pipeline parallel dnn training, 2018. et al. Pytorch distributed: experiences on accelerating data parallel
[15] Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku training. Proceedings of the VLDB Endowment, 13(12):3005–3018, 2020.
Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia,
14
[32] Hyeontaek Lim, David G Andersen, and Michael Kaminsky. 3lc: Light- [46] Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Ar-
weight and effective traffic compression for distributed machine learn- avind Kalaiah, Daya Khudia, James Law, Parth Malani, Andrey Male-
ing. arXiv preprint arXiv:1802.07389, 2018. vich, Satish Nadathur, Juan Pino, Martin Schatz, Alexander Sidorov,
[33] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep Viswanath Sivakumar, Andrew Tulloch, Xiaodong Wang, Yiming Wu,
gradient compression: Reducing the communication bandwidth for Hector Yuen, Utku Diril, Dmytro Dzhulgakov, Kim Hazelwood, Bill
distributed training. arXiv preprint arXiv:1712.01887, 2017. Jia, Yangqing Jia, Lin Qiao, Vijay Rao, Nadav Rotem, Sungjoo Yoo,
[34] Romain Lopez, Inderjit Dhillon S., and Michael Jordan I. Learning from and Mikhail Smelyanskiy. Deep learning inference in facebook data
extreme bandit feedback. In Proc. Association for the Advancement of centers: Characterization, performance optimizations and hardware
Artificial Intelligence, 2021. implications, 2018.
[35] Liang Luo, Peter West, Arvind Krishnamurthy, Luis Ceze, and Jacob [47] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad-
Nelson. Plink: Discovering and exploiting datacenter network locality bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,
for efficient cloud-based distributed training, 2020. Luca Antiga, et al. Pytorch: An imperative style, high-performance
[36] Yifei Ma, Balakrishnan (Murali) Narayanaswamy, Haibin Lin, and Hao deep learning library. In Advances in neural information processing
Ding. Temporal-contextual recommendation in real-time. KDD ’20, systems, pages 8026–8037, 2019.
page 2291–2299, New York, NY, USA, 2020. Association for Computing [48] Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang
Machinery. Lan, Chuan Wu, and Chuanxiong Guo. A generic communication
[37] Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius scheduler for distributed dnn training acceleration. In Proceedings of
Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, the 27th ACM Symposium on Operating Systems Principles, pages 16–29,
Victor Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta, 2019.
Kim Hazelwood, Andrew Hock, Xinyuan Huang, Atsushi Ike, Bill Jia, [49] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He.
Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma, Zero: Memory optimizations toward training trillion parameter models.
Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian In SC20: International Conference for High Performance Computing,
Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, Tsuguchika Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
Tabaru, Carole-Jean Wu, Lingjie Xu, Masafumi Yamazaki, Cliff Young, [50] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hog-
and Matei Zaharia. Mlperf training benchmark, 2020. wild!: A lock-free approach to parallelizing stochastic gradient descent.
[38] Azalia Mirhoseini, Hieu Pham, Quoc Le, Mohammad Norouzi, Samy Advances in neural information processing systems, 24:693–701, 2011.
Bengio, Benoit Steiner, Yuefeng Zhou, Naveen Kumar, Rasmus Larsen, [51] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C. Wu,
and Jeff Dean. Device placement optimization with reinforcement B. Anderson, M. Breughe, M. Charlebois, W. Chou, R. Chukka, C. Cole-
learning. 2017. man, S. Davis, P. Deng, G. Diamos, J. Duke, D. Fick, J. S. Gardner,
[39] Dheevatsa Mudigere and Whitney Zhao. Hw/sw co-design for future I. Hubara, S. Idgunji, T. B. Jablin, J. Jiao, T. S. John, P. Kanwar, D. Lee,
ai platforms - large memory unified training platform (zion). In 2019 J. Liao, A. Lokhmotov, F. Massa, P. Meng, P. Micikevicius, C. Osborne,
OCP Regional Summit, Amsterdam, 2019. G. Pekhimenko, A. T. R. Rajan, D. Sequeira, A. Sirasao, F. Sun, H. Tang,
[40] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, B. Yu, G. Yuan, A. Zhong,
ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi P. Zhang, and Y. Zhou. Mlperf inference benchmark. In 2020 ACM/IEEE
Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and 47th Annual International Symposium on Computer Architecture (ISCA),
Matei Zaharia. Efficient large-scale language model training on GPU pages 446–459, 2020.
clusters. CoRR, abs/2104.04473, 2021. [52] Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis,
[41] Maxim Naumov, John Kim, Dheevatsa Mudigere, Srinivas Sridharan, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan RK
Xiaodong Wang, Whitney Zhao, Serhat Yilmaz, Changkyu Kim, Hec- Ports, and Peter Richtárik. Scaling distributed machine learning with
tor Yuen, Mustafa Ozdal, Krishnakumar Nair, Isabel Gao, Bor-Yiing in-network aggregation. arXiv preprint arXiv:1903.06701, 2019.
Su, Jiyan Yang, and Mikhail Smelyanskiy. Deep learning training in [53] Alexander Sergeev and Mike Del Balso. Horovod: fast and easy dis-
facebook data centers: Design of scale-up and scale-out systems, 2020. tributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799,
[42] Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu 2018.
Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit [54] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou,
Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan
Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krish- Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and
namoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Demis Hassabis. Mastering chess and shogi by self-play with a general
Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha reinforcement learning algorithm, 2017.
Smelyanskiy. Deep learning recommendation model for personaliza- [55] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis
tion and recommendation systems. CoRR, abs/1906.00091, 2019. Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker,
[43] NVIDIA. Unified memory in CUDA 6. Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan
https://developer.nvidia.com/blog/unified-memory-in-cuda-6/, 2021. Hui, Laurent Sifre, George Driessche, Thore Graepel, and Demis
Accessed: 2021-03-31. Hassabis. Mastering the game of go without human knowledge.
[44] OpenAI. Ai and compute. https://openai.com/blog/ai-and-compute/, Nature, 550:354–359, 10 2017.
2018. [56] Karen Simonyan and Andrew Zisserman. Very deep convolutional
[45] Satadru Pan, Theano Stavrinos, Yunqiao Zhang, Atul Sikaria, Pavel Za- networks for large-scale image recognition, 2015.
kharov, Abhinav Sharma, Shiva Shankar P, Mike Shuey, Richard Ware- [57] M. Smelyanskiy. Zion: Facebook next-generation large memory train-
ing, Monika Gangapuram, Guanglei Cao, Christian Preseau, Pratap ing platform. In 2019 IEEE Hot Chips 31 Symposium (HCS), 2019.
Singh, Kestutis Patiejunas, JR Tipton, Ethan Katz-Bassett, and Wyatt [58] Brent Smith and Greg Linden. Two decades of recommender systems
Lloyd. Facebook’s tectonic filesystem: Efficiency from exascale. In at amazon.com. IEEE Internet Computing, 21(3):12–18, May 2017.
19th USENIX Conference on File and Storage Technologies (FAST 21), [59] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
pages 217–231. USENIX Association, February 2021. Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
Rabinovich. Going deeper with convolutions, 2014.
15
[60] Jakub Tarnawski, Amar Phanishayee, Nikhil R Devanur, Divya Ma-
hajan, and Fanny Nina Paravecino. Efficient algorithms for device
placement of dnn graph operators. arXiv preprint arXiv:2006.16423,
2020.
[61] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention
is all you need, 2017.
[62] M. Xie, K. Ren, Y. Lu, G. Yang, Q. Xu, B. Wu, J. Lin, H. Ao, W. Xu, and
J. Shu. Kraken: Memory-efficient continual learning for large-scale
real-time recommendations. In 2020 SC20: International Conference for
High Performance Computing, Networking, Storage and Analysis (SC),
pages 1–17, Los Alamitos, CA, USA, nov 2020. IEEE Computer Society.
[63] Jie Amy Yang, Jianyu Huang, Jongsoo Park, Ping Tak Peter Tang, and
Andrew Tulloch. Mixed-precision embedding using a cache, 2020.
[64] Jie Amy Yang, Jongsoo Park, Srinivas Sridharan, and Ping Tak Peter
Tang. Training deep learning recommendation model with quantized
collective communications. 2020.
[65] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Sri-
nadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and
Cho-Jui Hsieh. Large batch optimization for deep learning: Training
bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
[66] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning
with elastic averaging sgd. Advances in neural information processing
systems, 28:685–693, 2015.
[67] Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, and
Ping Li. Aibox: Ctr prediction model training on a single node. In
Proceedings of the 28th ACM International Conference on Information
and Knowledge Management, pages 319–328, 2019.
[68] Whiteny Zhao, Dheevatsa Mudigere, Xiaodong Wang, Jongsoo Park,
John Kim, and Mikhail Smelyanskiy. Accelerator fabric in facebook
zion training system. In 2019 IEEE/ACM International Symposium on
Networks-on-Chip (NOCS), 2019.
[69] Qinqing Zheng, Bor-Yiing Su, Jiyan Yang, Alisson Azzolini, Qiang
Wu, Ou Jin, Shri Karandikar, Hagay Lupesko, Liang Xiong, and Eric
Zhou. Shadowsync: Performing synchronization in the background
for highly scalable distributed training. CoRR, 2003.03477, 2020.
16
Appendix-A
We collected and developed a set of operator-level benchmarks which we have also open sourced as part of PARAM bench¶ ,
to evaluate the representative problem sizes and shapes on the candidate hardware platforms and to better understand the
throughput and latency in compute, memory, and communications.
Compute Benchmark
GEMM benchmark. This benchmark calls cuBLAS GemmEx routine to compute matrix multiplications on configurable
problem sizes with multiple precision choices. On the V100 GPU, this benchmark supports FP32 GEMM on the CUDA core
and FP16 mixed-precision GEMM on Tensor Core. On the A100 GPU, it additionally supports TF32 GEMM and BF16 GEMM
on the Tensor Core.
The benchmark results are shown in Figures 12.
10.0.1 MLP benchmark. This benchmark implements the following multilayer perceptron (MLP) layers:
• Batch size = 128, 256, 512, 1024, 2048, 4096;
• 20 MLP layers, where each layer is 1K×1K , 2K×2K and 4K×4K;
• Each layer has ReLU and final layers has SoftMax;
• Both backward and forward passes, including SGD update as the optimizer after the backward pass;
• Precision support: FP16, BF16, TF32, FP32.
The batch size, layer dimension, and number of layers can be configured to the customized number. We implemented this MLP
benchmark using C++, directly implementing FC and FCGradients in the MLP layer using cuBLAS SGEMM/GemmEx function,
ReLU with cuDNN cudnnActivationForward/ cudnnActivationBackward function, SoftMax with cudnnSoftmaxForward in the
forward a customized CUDA kernel for the backward pass, and SGD optimizer with cuBLAS axpy function. This benchmark
can be used to project the performance of V100/A100 GPUs using a minimal MLP network without the framework overhead in
PyTorch. The benchmark results are shown in Figures 13 and 14.
Memory Benchmark
This benchmark evaluates the achieved memory bandwidth of the embedding kernels described in Section 5. To eliminate the
L2 cache effects, a random tensor with 40 MB data (A100 L2 cache size) is allocated to flush the cache.
• Support the evaluation forward and backward pass (the backward pass is fused with optimizer);
• Precision Support: FP32 and FP16;
• Number of rows: 1000000, Number of tables: 64, Embedding dimension: 128, Pooling size: 32, rows per thread block: 32.
¶ https://github.com/facebookresearch/param
300
BS=128 BS=256 BS=512 BS=1024 BS=2048 BS=4096
250
200
Mesaured TF/s
150
100
50
0
4096x4096xB
1024x1024xB
1024x2000xB
4096x4096xB
1024x1024xB
1024x2000xB
4096x4096xB
1024x1024xB
1024x2000xB
Bx4096x4096
Bx1024x1024
Bx4096x40928
Bx1024x2000
Bx40928x4096
Bx4096x4096
Bx1024x1024
Bx4096x40928
Bx1024x2000
Bx40928x4096
Bx4096x4096
Bx1024x1024
Bx4096x40928
Bx1024x2000
Bx40928x4096
4096x40928xB
4096x40928xB
4096x40928xB
Figure 12. GEMM performance (TF/s) for V100 FP16 vs. A100 FP16/BF16.
17
The benchmark results are shown in Figures 15 and 16.
Communication Benchmark
Low-level collective communication benchmarks, e.g. NVIDIA’s NCCL tests or OSU MPI benchmarks, have the following
limitations:
• Do not capture the behavior of actual workloads, i.e. exact message sizes, sequence of collective operations, etc. Instead
these benchmarks support power-of-two message sizes - helpful to detect network trends.
• Limited to one specific communication library. As the name suggests, NCCL tests works only with NCCL and OSU MPI
benchmarks is limited to MPI.
The PARAM comms benchmarks addresses these gaps by:
• Creating common abstractions across platforms (e.g. NVIDIA GPUs, x86 CPUs, Google TPU etc.) to help standardize the
benchmarking logic.
120
BS=128 BS=256 BS=512 BS=1024 BS=2048 BS=4096
100
80
Mesaured TF/s
60
40
20
0
Bx1Kx1K Bx2Kx2K Bx4Kx4K Bx1Kx1K Bx2Kx2K Bx4Kx4K Bx1Kx1K Bx2Kx2K Bx4Kx4K
V100, FP32 A100, FP32 A100, TF32
Figure 13. MLP performance for V100 FP32 vs. A100 FP32/TF32.
300
BS=128 BS=256 BS=512 BS=1024 BS=2048 BS=4096
250
200
Mesaured TF/s
150
100
50
0
Bx1Kx1K Bx2Kx2K Bx4Kx4K Bx1Kx1K Bx2Kx2K Bx4Kx4K Bx1Kx1K Bx2Kx2K Bx4Kx4K
V100, FP16 A100, FP16 A100, BF16
Figure 14. MLP performance for V100 FP16 vs. A100 FP16/BF16.
18
1600
1400
1200
Measured BW (GB/s)
1000
V100 (GB/s), FP32
200
0
0 10000 20000 30000 40000 50000 60000 70000
Batchsize
Figure 15. Achieved embedding lookup forward bandwidth using FP32 vs. FP16 on V100 vs. A100.
1600
1400
V100 (GB/s), SGD, FP32
1200
V100 (GB/s), SGD, FP16
Measured BW (GB/s)
0
0 10000 20000 30000 40000 50000 60000 70000
Batchsize
Figure 16. Achieved embedding lookup backward+optimizer bandwidth using FP32 vs. FP16 on V100 vs. A100.
• Using PyTorch Process Group APIs to provide a portable interface across different communication libraries (e.g. NCCL,
MPI, and UCC).
PARAM comms benchmarks supports two types of collective benchmarks:
• Bench mode: Simplest mode of operation similar to NCCL tests. Run single collective in blocking or non-blocking
manner across fixed set of message sizes (e.g. power of 2 message sizes). This is mainly used for low-level HW testing
• Replay mode: Replays a trace of collective communication calls to mimic exact workload behavior in terms of collective
sizes.
Figure 17 presents AlltoAll and AllReduce benchmark scaling for power-of-two message sizes on 128 GPUs. AlltoAll
achieves 7GB/s and is primarily limited by scale-out bandwidth (12.5 GB/s peak; 10.5 GB/s achievable on V100). AllReduce
achieves higher bandwidth since it uses NVLINK more effectively.
19
70
60
50
Measured BW (GB/s)
40
Alltoall (GB/s)
30
Allreduce (GB/s)
20
10
0
0 50000000 100000000 150000000 200000000 250000000 300000000
20