2020-Mar - Hot Chips - ARM Neoverse N1

VOLUME 40, NUMBER 2 MARCH/APRIL 2020
Hot Chips
www.computer.org/micro
!
W
O
HOST
N
ER
T
IS
EG
2020
4–7 May 2020 • San Jose, CA
R
IEEE INTERNATIONAL SYMPOSIUM

ON HARDWARE-ORIENTED
SECURITY AND TRUST
4–7 May 2020 • San Jose, CA, USA • DoubleTree by Hilton
Join dedicated professionals at the IEEE International Symposium on Hardware

Oriented Security and Trust (HOST) for an in-depth look into hardware-based security
research and development.
Key Topics:
• Semiconductor design, test • Cryptography
and failure analysis and cryptanalysis
• Computer architecture • Imaging and microscopy
• Systems security
Discover innovations from outside your sphere of influence at HOST. Learn about new
research that is critical to your future projects. Meet face-to-face with researchers and
experts for inspiration, solutions, and practical ideas you can put to use immediately.
REGISTER NOW: www.hostsymposium.org

EDITOR-IN-CHIEF IEEE MICRO STAFF
Lizy K. John, University of Texas, Austin Journals Production Manager: Joanna Gojlik,
j.gojlik@ieee.org
EDITORIAL BOARD Peer-Review Administrator:
micro-ma@computer.org
R. Iris Bahar, Brown University
Publications Portfolio Manager: Kimberly Sperka
Mauricio Breternitz, University of Lisbon
Publisher: Robin Baldwin
David Brooks, Harvard University
Senior Advertising Coordinator: Debbie Sims
Bronis de Supinski, Lawrence Livermore
IEEE Computer Society Executive Director:
Nation al Lab
Melissa Russell
Shane Greenstein, Harvard Business School
Natalie Enright Jerger, University of Toronto IEEE PUBLISHING OPERATIONS
Hyesoon Kim, Georgia Institute of Technology
John Kim, Korea Advanced Institute of Science Senior Director, Publishing Operations:
and Technology Dawn Melley
Hsien-Hsin (Sean) Lee, Taiwan Semiconductor Director, Editorial Services: Kevin Lisankie
Manufacturing Company Director, Production Services: Peter M. Tuohy
Richard Mateosian Associate Director, Editorial Services:
Tulika Mitra, National University of Singapore Jeffrey E. Cichocki
Trevor Mudge, University of Michigan, Ann Arbor Associate Director, Information Conversion
Onur Mutlu, ETH Zurich and Editorial Support: Neelam Khinvasara
Vijaykrishnan Narayanan, The Pennsylvania Senior Art Director: Janet Dudar
State University Senior Manager, Journals Production:
Per Stenstrom, Chalmers University of Patrick Kempf
Technology CS MAGAZINE OPERATIONS COMMITTEE
Richard H. Stern, George Washington
University Law School Sumi Helal (Chair), Irena Bojanova,
Sreenivas Subramoney, Intel Corporation Jim X. Chen, Shu-Ching Chen,
Carole-Jean Wu, Arizona State University Gerardo Con Diaz, David Alan Grier,
Lixin Zhang, Chinese Academy of Sciences Lizy K. John, Marc Langheinrich, Torsten Möller,
David Nicol, Ipek Ozkaya, George Pallis,
ADVISORY BOARD VS Subrahmanian
David H. Albonesi, Erik R. Altman, Pradip Bose, CS PUBLICATIONS BOARD
Kemal Ebcioglu, Lieven Eeckhout,
Fabrizio Lombardi (VP for Publications),
Michael Flynn, Ruby B. Lee, Yale Patt,
Alfredo Benso, Cristiana Bolchini,
James E. Smith, Marc Tremblay
Javier Bruguera, Carl K. Chang, Fred Douglis,
Subscription change of address: Sumi Helal, Shi-Min Hu, Sy-Yen Kuo,
address.change@ieee.org Avi Mendelson, Stefano Zanero, Daniel Zeng
Missing or damaged copies:
COMPUTER SOCIETY OFFICE
help@computer.org
IEEE MICRO
c/o IEEE Computer Society
10662 Los Vaqueros Circle
Los Alamitos, CA 90720 USA +1 (714) 821-8380
IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society. IEEE Headquarters, Three Park Ave., 17th Floor, New York,
NY 10016-5997; IEEE Computer Society Headquarters, 2001 L St., Ste. 700, Washington, DC 20036; IEEE Computer Society Publications Office,
10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720. Postmaster: Send address changes and undelivered copies to IEEE, Member-
ship Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid at New York, NY, and at additional mailing offices. Canadian
GST #125634188. Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885. Return undeliverable Canadian addresses
to 4960-2 Walker Road; Windsor, ON N9A 6J3. Printed in USA. Reuse rights and reprint permissions: Educational or personal use of this material is
permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the
copy; and 3) does not imply IEEE endorsement of any third-party products or services. Author and their companies are permitted to post the accepted
version of IEEE-copyrighted material on their own webservers without permission, provided that the IEEE copyright notice and a full citation to the
original work appear on the first screen of the posted copy. An accepted manuscript is a version which has been revised by the author to incorporate
review suggestions, but not the published version with copy-editing, proofreading, and formatting added by IEEE. For more information, please go to
ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising,
or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual
Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or pubs-permissions@ieee.org. ©2020 by IEEE. All rights reserved. Abstracting
and library use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy
fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE
Micro does not necessarily constitute an endorsement by IEEE or the Computer Society. All submissions are subject to editing for style, clarity, and
space. IEEE prohibits discrimination, harassment, and bullying. For more information, visit ieee.org/web/aboutus/whatis/policies/p9-26.html.
March/April 2020
Volume 40 Number 2
Special Issue
Guest Editors’ Introduction
45 T he AMD “Zen 2”
Processor
David Suggs, Mahesh Subramony, and
6 T he Hot Chips
Renaissance
Dan Bouvier
Christos Kozyrakis and Ian Bratt

Published by the IEEE Computer
Society Theme Articles
53 T he Arm Neoverse N1
Platform: Building Blocks
for the Next-Gen
8 M LPerf: An Industry
Standard Benchmark
Cloud-to-Edge
Infrastructure SoC
Suite for Machine Learning Andrea Pellegrini, Nigel Stephens,
Performance Magnus Bruce, Yasuo Ishii, Joseph Pusdesris,
Peter Mattson, Vijay Janapa Reddi, Abhishek Raja, Chris Abernathy,
Christine Cheng, Cody Coleman, Jinson Koppanalil, Tushar Ringe,
Greg Diamos, David Kanter, Ashok Tummala, Jamshed Jalal,
Paulius Micikevicius, David Patterson, Mark Werkheiser, and Anitha Kona
Guenther Schmuelling, Hanlin Tang,
63 T
Gu-Yeon Wei, and Carole-Jean Wu eraPHY: A Chiplet
Technology for
17 H abana Labs Purpose-Built

AI Inference and Training
Low-Power, High-Bandwidth
In-Package Optical I/O
Processor Architectures: Mark Wade, Erik Anderson,
Scaling AI Training Systems Shahab Ardalan, Pavan Bhargava,
Using Standard Ethernet With Sidney Buchbinder, Michael L. Davenport,
John Fini, Haiwei Lu, Chen Li, Roy Meade,
Gaudi Processor Chandru Ramamurthy, Michael Rust,
Eitan Medina and Eran Dagan Forrest Sedgwick, Vladimir Stojanovic,
Derek Van Orden, Chong Zhang,
25 C ompute Solution for Chen Sun, Sergey Y. Shumarayev,

Conor O’Keeffe, Tim T. Hoang,
Tesla’s Full Self-Driving
David Kehlet, Ravi V. Mahajan,
Computer Matthew T. Guzy, Allen Chan, and
Emil Talpes, Debjit Das Sarma, Tina Tran
Ganesh Venkataramanan, Peter Bannon,
Bill McGee, Benjamin Floering,
Ankit Jalote, Christopher Hsiong, Sahil Arora,
Atchyuth Gorti, and Gagandeep S. Sachdev
36 R
TX on—The NVIDIA
Turing GPU
John Burgess
Image credit: Image licensed by Ingram Publishing
COLUMNS AND DEPARTMENTS

From the Editor-in-Chief
4 Did ML Chips Heat Up the
Chip Design Arena?
Lizy Kurian John
Micro Economics
74 Expertise at Our Fingertips
Shane Greenstein
From the Editor-in-Chief
Did ML Chips Heat Up the Chip

Design Arena?
Lizy Kurian John
The University of Texas at Austin
& WELCOME TO THE March/April 2020 issue of In addition to the seven Hot Chips articles, this
IEEE Micro. This issue features selected articles issue also features a Micro Economics column by
from the 31st Hot Chips Symposium, held at Shane Greenstein, “Expertise at our Fingertips,”
Stanford University in August 2019. The Memorial discussing how crowdsourced Wikipedia has ren-
Auditorium at Stanford was crowded with record dered Encyclopaedia Britannica obsolete. Please
attendance eager to hear on the newest emerging enjoy this article that presents a comparison of
chips. Whether it is machine learning acceleration Wikipedia and Encyclopaedia Britannica on cost,
or sheer increase in compute ability, the chip reliability, coverage, contributor base, ease of
design arena has become hotter than ever. A lot of use, etc.
money is pouring into designing special purpose Let me also provide an overview of what to
and general purpose chips. IEEE Micro is pleased expect in upcoming issues. The May/June issue
to present for our readers selected articles based will be the popular “Top Picks” Special Issue
on the presentations at the Hot Chips Symposium. which presents the best of the best from papers
Christos Kozyrakis of the University of California, in computer architecture conferences in 2019.
Berkeley and Ian Bratt of ARM served as guest edi- Prof. Hyesoon Kim of Georgia Tech and a selec-
tors for this issue. They have compiled an excel- tion committee from industry and academia have
lent selection of articles on emerging chips and selected 12 papers from about 100 articles that
systems from the Symposium, including articles were submitted in response to the Top Picks call
from Tesla, Habana Labs, NVIDIA, AMD, ARM, and for papers. Readers can look forward to an amaz-
Ayar Labs. Please read the Guest Editors’ Intro- ing collection of excellent articles in May/June.
duction to get a preview of the seven papers, Many thematic special issues are planned for
which include two articles on machine learning the remainder of 2020. Themes include Agile/
acceleration, one on machine learning bench- Open Source Hardware, Biology Inspired Comput-
marks, two on high-end chips from AMD and ARM, ing, Machine Learning for Systems, and Chip
an article on the newest GPU from NVIDIA, and
Design 2020. The July/August issue will be a Spe-
one on optical die-to-die interconnects. Thanks to
cial Issue on “Agile/Open Source Hardware.” Tre-
the editors, authors, and reviewers who worked
vor Carlson from the National University of
hard to put this issue together.
Singapore and Yungang Bao from ICT (China) will
be guest editing this special issue. The “Biology-
Inspired Computing” theme will be guest edited
Digital Object Identifier 10.1109/MM.2020.2978375 by Abhishek Bhattacharjee from Yale University.
Date of current version 18 March 2020. Milad Hashemi of Google and Heiner Litz of
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
4
University of California, Santa Cruz will guest edit 3) https://www.computer.org/digital-library/
the “Machine Learning for Systems” theme. The magazines/mi/call-for-papers-special-issue-
year will conclude with themes “Chip Design on-chip-design-2020
2020” guest edited by Prof. Jaydeep Kulkarni of
the University of Texas, and “Commercial Prod- IEEE Micro is interested in submissions
ucts 2020” guest edited by David Patterson and on any aspect of chip/system design or
Sophia Shao of University of California, Berkeley. architecture.
We invite readers to submit to these Special Hope you enjoy the articles presented in this
Issues. Please find the open calls at issue. Happy reading!
1) https://www.computer.org/publications/ Lizy Kurian John is currently a Cullen Trust for

Higher Education Endowed Professor with the
author-resources/calls-for-papers
Department of Electrical and Computer Engineering,
2) https://www.computer.org/digital-library/
The University of Texas at Austin. Contact her at
magazines/mi/call-for-papers-special-issue-
ljohn@ece.utexas.edu.
on-commercial-products-2020
March/April 2020
5
Guest Editors’ Introduction
The Hot Chips Renaissance

Christos Kozyrakis Ian Bratt
Stanford University Arm
& THE 31ST ANNUAL Hot Chips symposium was and a large number of startup companies are now
held at Stanford University in August 2019. As competing to design the most effective chips for
the guest editors for this special issue of IEEE existing and emerging applications.
Micro, we are pleased to introduce a selection of For this special issue of IEEE Micro, we
articles based on the best presentations from selected seven talks that capture these trends
the conference program. and asked the authors to extend them into full
The 2019 Hot Chips program reflected the articles.
renaissance in the chips industry that John Hen-
nessy and Dave Patterson described in their 2017
MACHINE LEARNING AND
Turing award lecture and attracted a record high
INTELLIGENCE ACCELERATION
number of attendees. Specifically, we observed
In “MLPerf: An Industry Standard Benchmark
three important trends. The most important
Suite for Machine Learning Performance,” Mattson
development is the widespread focus on accelera-
et al. describe the challenges in developing effec-
tion of machine learning (ML) applications.
tive benchmarks for ML training
Roughly half of the conference talks
and inference workloads in a fast-
described chips of various sizes and The 2019 Hot Chips
moving field. The first two rounds
uses for which ML is a primary appli- program reflected the
of the MLPerf Training benchmark
cation driver. The second trend is renaissance in the
have already motivated significant
the increasing use of novel chips industry that
John Hennessy and improvements in performance and
approaches to overcome scaling
Dave Patterson scalability of popular ML software
and efficiency limitations of modern
described in their 2017 stacks.
systems. The program featured talks
Turing award lecture In “Habana Labs Purpose-Built
on in-package optical I/O, process-
and attracted a record AI Inference and Training Proces-
ing-in-memory, and wafer-scale inte- high number of sor Architectures: Scaling AI
gration. The final trend is the attendees. Training Systems Using Standard
broader set of companies that pro-
Ethernet With Gaudi Processor,”
duce cutting-edge chips. The tradi-
Medina and Dagan summarize
tional semiconductor vendors (for
the design of the Goya inference processor and
CPUs, GPUs, or FPGAs), hyperscale companies,
the Gaudi training processor. They also describe
how to build systems of various scales from
Digital Object Identifier 10.1109/MM.2020.2977409 these specialized ML chips using commodity
Date of current version 18 March 2020. networking interfaces.
6
In “Compute Solution for Tesla’s Full Self-Driv- OPTICAL DIE-TO-DIE
ing Computer,” Talpes et al. describe a custom- INTERCONNECTS
built chip for autonomous driving. The Tesla chip In “TeraPHY: A Chiplet Technology for Low-
integrates fixed function units and is programma- Power, High-Bandwidth In-Package Optical I/O,”
ble to cores to strike the right balance between Wade et al. from Ayar Labs and Intel summarize the
efficiency and flexibility. technology for in-package optical
interconnects for high bandwidth,
The presentations from energy efficient communication.
NEW GPU Hot Chips 31 and all Space limitations prevent us
ARCHITECTURE previous years are from featuring more articles in this
In “RTX On––The NVIDIA available at http://www.
issue. Nevertheless, the presenta-
Turing GPU,” Burgess presents hotchips.org. We
tions from Hot Chips 31 and all
the new streaming multiproces- encourage you to
previous years are available at
sor and ML accelerator in the explore this exciting
archive, as well as http://www.hotchips.org. We
latest NVIDIA GPU chip. The
contribute to and encourage you to explore this
Turing chip also features a ray
attend the Hot Chips 32 exciting archive, as well as contrib-
tracing accelerators that sup-
conference in August ute to and attend the Hot Chips 32
ports real-time frame rates.
of this year. conference in August of this year.
NEW PROCESSOR CORES Christos Kozyrakis is a Professor of electrical

In “The AMD Zen 2 Processor,” Suggs et al. engineering and computer science with Stanford
present the performance, energy efficiency, and University. His research interests include hardware
security features in the latest AMD processor architectures and system software for cloud comput-
ing and emerging workloads. He is a Fellow of ACM
core.
and IEEE. He is the corresponding author of this arti-
In “The Arm Neoverse N1 Platform: Building
cle. Contact him at christos@cs.stanford.edu.
Blocks for the Next-Gen Cloud-to-Edge Infrastruc-
ture SoC,” Pellegrini et al. describe the first line Ian Bratt is an Arm Fellow. He has worked on multi-
of ARM processor cores designed specifically for core CPUs, GPUs, and now NPUs. Contact him at
cloud and server workloads. ian.bratt@arm.com.
March/April 2020
7
Theme Article: Hot Chips
MLPerf: An Industry
Standard Benchmark Suite
for Machine Learning
Performance
Peter Mattson Paulius Micikevicius
Google Brain NVIDIA
Vijay Janapa Reddi David Patterson
Harvard University Google Brain and University of California,
Berkeley
Christine Cheng
Intel Guenther Schmuelling
Microsoft Azure AI Infrastructure
Cody Coleman
Stanford University Hanlin Tang
Intel
Greg Diamos
Landing AI Gu-Yeon Wei
Harvard University
David Kanter
Real World Technologies Carole-Jean Wu
Facebook and Arizona State University
Abstract—In this article, we describe the design choices behind MLPerf, a machine learning
performance benchmark that has become an industry standard. The first two rounds of the
MLPerf Training benchmark helped drive improvements to software-stack performance and
scalability, showing a 1.3 speedup in the top 16-chip results despite higher quality targets
and a 5.5 increase in system scale. The first round of MLPerf Inference received over 500
benchmark results from 14 different organizations, showing growing adoption.
& MACHINE LEARNING (ML) is transforming multi- and software development. Most modern ML sys-
ple industries, leading to a surge in hardware tems are built atop deep neural networks which
are computationally demanding to train and
Digital Object Identifier 10.1109/MM.2020.2974843
deploy, thus their increasing use in industry is
driving the rapid development of specialized hard-
Date of publication 18 February 2020; date of current version
ware architectures and software frameworks.
18 March 2020.
8
We need a performance benchmark to evalu- Designing an ML benchmark suite requires
ate these competing ML systems. By providing answering additional questions.
clear metrics, benchmarking aligns research,
Implementation equivalence: ML accelerator
engineering and marketing, and competitors
across the industry in pursuit of the same objec- architecture varies and there is no standard
tives. For general-purpose computing, a consor- ML software stack. Submitters need to reim-
tium of chip vendors built the SPEC benchmark1 plement the benchmark for their hardware.
in 1988, focusing the competition that drove How do we ensure that implementations are
computing perfor- equivalent enough for fair comparison?
Training hyperparameter equivalence: for
mance for the next
MLPerf was founded in training benchmarks, which hyperpara-
three decades. Ear- 2018 to combine the
lier ML benchmarks meters are tunable?
best of prior efforts: a
include DeepBench2 Training convergence variance: for training
broad benchmark set
which focused benchmarks, convergence times have rela-
with a time-to-
on deep learning convergence metric tively high variance. How do we make mean-
primitives, Fathom3 and the support of an ingful measurements?
Inference weight equivalence: for inference
which introduced academic/industry
a field-spanning set consortium. benchmarks, are retrained or sparsified
weights allowed?
of ML benchmarks,
and DAWNBench4 which proposed time-to-
This section describes how the MLPerf
convergence as a metric. See the work by Mattson
benchmark suites answer these questions.
et al.5 and Reddi et al.6 for more on related work.
MLPerf was founded in 2018 to combine the
best of prior efforts: a broad benchmark set with MLPerf Training
Benchmark Definition We specify an MLPerf
a time-to-convergence metric and the support of
Training benchmark5 as training a model on a
an academic/industry consortium. MLPerf con-
specific data set to reach a target quality. For
tains two suites of ML benchmarks: one for train-
example, one benchmark measures training on
ing,5 and one for inference.6 MLPerf has released
the ImageNet data set until the image classifica-
two rounds of results for the training suite and
tion top-1 accuracy reaches 75.9%. However, this
one round for the inference suite. Comparing the
basic definition does not answer one critical
two rounds of training data shows MLPerf is
question: do we specify which model to train?
encouraging improvements in performance and
Specifying the model enables apples-to-apples
scalability; comparing all three rounds shows
performance comparisons of software or hard-
growing adoption. The remainder of this article
ware alternatives because it requires all alterna-
describes the design choices faced by anyone
tives to process the same workload. However,
seeking to benchmark ML performance, and how
not specifying the model encourages model
MLPerf navigated those choices to become a
improvements and hardware–software codesign.
nascent industry standard.
We created two divisions of results: a Closed
Division that requires using a specific model for
direct comparisons, and an Open Division that
DESIGN CHOICES
allows the use of any model to support model
Designing any benchmark suite requires
innovation.
answering three big questions.
Benchmark definition: How to specify a mea- Metric Definition There are two obvious met-
surable task? rics for training performance: throughput, the
Metric definition: How to measure number of data processed per second, and time-
performance? to-train, the wall clock time it takes for the model
Benchmark selection: What set of tasks to to reach a target quality. These metrics could
measure? also be normalized by cost or power, which we
March/April 2020
9
Hot Chips
Table 1. MLPerf Training v0.5 Benchmarks. We then tried to select one or two specific
benchmarks within each area. In doing so, we
Area Problem Data set Model chose models based on four characteristics.
Image
Vision ImageNet7 ResNet7
recognition Maturity: We sought models that were near
7 7
Object detection COCO SSD state-of-the-art but also showed evidence of
Object growing adoption.
COCO7 Mask R CNN7
segmentation Variety: We chose models that included a
WMT Eng.- range of constructs such as a convolutional
Language Translation GNMT7
German7 neural network (CNN), a recurrent neural net-
WMT Eng.- work (RNN), attention, and an embedding
Translation Transformer7
German7 table.
Complexity: We chose model sizes to reflect
Commerce Recommendation Movielens-20M7 NCF7
current and anticipated market demands.
Research RL Go, 99 board MiniGo7 Practicality: We chose only benchmarks with
available data sets and models.
will address later on. Throughput has advan- Table 1 shows the benchmarks in MLPerf
tages. First, it is computationally inexpensive to Training v0.5. We chose ResNet as having rela-
measure because you do not train the model to tively high accuracy and wide adoption. We
completion. Second, it has a relatively low vari- added SSD and Mask R-CNN to cover two differ-
ance because the compute cost per datum is ent points in the important vision complexity
constant in most models. However, throughput space. We chose transformer and GNMT for
can be increased at the cost of time-to-train translation to increase variety by including
by using optimizations like lower precision attention and an RNN. We chose MiniGo for RL
numerics or larger batch sizes. because it did not require an even more compu-
We chose to use time-to-train because it accu- tationally expensive physics simulation. MLPerf
rately reflects the primary goal in choosing a Training presently omits medical imaging,
training system: to fully train models as quickly speech-to-text, text-to-speech, NLP, time series,
as possible. Unfortunately, it is computationally and GANs. Future versions of MLPerf will
expensive to fully train models. Furthermore, address these applications, starting with BERT
the number of epochs needed to train a model in MLPerf training v0.7.
varies due to random weight initialization and
stochastic floating-point ordering effects. How- Implementation Equivalence ML bench-
ever, we feel time-to-train is the least-bad alter- marks cannot function like conventional bench-
native available. marks in which fixed code is executed because it
Benchmark Selection Once we chose how to is not currently possible to write portable, scal-
specify a benchmark, we needed to select a set able, high-performance ML code. There is no sin-
of benchmarks. We first divided ML applications gle ML framework supported by all architectures.
into five broad areas. Furthermore, ML code needs to be tuned for the
architecture and system scale in order to achieve
Vision: image classification, object detection, high performance.
segmentation, medical imaging. Instead, MLPerf allows submitters to reimple-
Speech: speech to text, text to speech. ment the benchmarks. However, this flexibility
Language: translation, natural language proc- raises the question of implementation equiva-
essing (NLP). lence. MLPerf requires that all submitters to the
Commercial: recommenders, time-series. closed division use the same model to enable
Research: reinforcement learning (RL) for apples-to-apples hardware comparisons, but
games or robotics, generative adversarial what does it mean to use the same model? MLPerf
networks (GANs). provides a functional but unoptimized reference
IEEE Micro
10
implementation. MLPerf rules require performing Table 2. MLPerf Inference v0.5 Benchmarks.
the same set of mathematical operations as the
reference implementation to produce each out- Area Task Data set Model
put, using the same optimizer to update the Image ImageNet

Vision Resnet50-v1.57
classification (224224)7
weights, and using the same preprocessing and
evaluation methods. Submitters are allowed to Image ImageNet MobileNets-v1
Vision
classification (224224)7 224p7
reorder data channels and parallel operations,
Object COCO
use different numerical representations, and Vision SSD-ResNet347
detection (12001200)7
make a few other whitelisted changes.
Object COCO SSD-
Vision
Hyperparameter Tuning Different systems detection (300300)7 MobileNets-v17
need different hyperparameters for optimal per- Language
Machine WMT Eng.-
GNMT7
formance. Systems have varying levels of paral- translation German7
lelism and therefore demand different batch

sizes (numbers of inputs between weight
MLPerf balances variance and computation
updates). Similarly, different numerical repre-
cost by averaging over a number of runs but still
sentations affect training behavior and require
accepting relatively high variance. MLPerf aver-
changes to the learning rate schedule and other
ages five runs for relatively stable vision
optimizer hyperparameters.
benchmarks producing results that are roughly
Finding the optimal hyperparameters requi-
2.5% and ten runs for the other higher
res exploring a many-dimensional space where
variance benchmarks producing results that are
each point is evaluated by training an ML model
roughly 5%.
to convergence, which can take days on a single
processor. Therefore, allowing submitters to do
an unlimited hyperparameter search conveys an MLPerf Inference
advantage on those with the most computational Benchmark Definition We define an MLPerf
resources and/or best hyperparameter search inference benchmark6 as processing a series of
strategy. inputs to a trained model to produce outputs
Currently, MLPerf tries to level the hyper- that meet a quality target. We expand on this
parameter playing field while still allowing suffi- basic definition with four measurement scenar-
cient flexibility in two ways. ios shown in black in Figure 1, each of which
addresses a class of use cases.
1. Search limits: MLPerf limits which hyperpara-
meters can be tuned. 1. Single stream: a series of inputs are proc-
2. Hyperparameter borrowing: MLPerf allows essed one after the other as in, for example, a
submitters to “borrow” hyperparameters cell mobile vision application.
from other submissions and update their 2. Multiple stream: fixed-size batches of inputs
results prior to posting final results. are processed one after the other as in, for
example, automotive vision.
Variance The time to train a model to a target
3. Server: inputs arrive according to a Poisson
quality has relatively high variance. The time to
distribution as in, for example, an online
train is roughly proportional to the number of
translation service.
epochs (passes over the training data) required.
4. Offline: all inputs are available at once as in,
The number of epochs required varies because
for example, a photo labeling application.
the starting weights are initialized to random val-
ues and because of nondeterministic floating-point MLPerf inference then defines a benchmark
ordering. This variance can be reduced by averag- for each of the above scenarios. Like MLPerf
ing the results of multiple runs. However, the Training, MLPerf Inference has a Closed Division
reduction is proportional to the square root of the which mandates a specific model to enable
number of runs and each run is computationally direct comparisons and an Open Division which
costly; often multiple days on a single processor. allows any model to encourage innovation.
March/April 2020
11
Hot Chips
equivalence for the Closed Divi-

sion. MLPerf Inference handles
implementation equivalence by
introducing two fundamental
constraints.
1. All implementations must use a

standardized load generator
that implements the four sce-
narios and measures the corre-
sponding metrics.
2. All implementations must use
the same reference weights.
Figure 1. MLPerf inference scenarios and metrics. In addition to these two fun-
damental constraints, there is a
short blacklist of forbidden opti-
Metric Definition The ideal metric for measur- mizations including incorporating additional
ing the performance of an inference system weights or other information about the data set,
varies with the use case. For instance, a mobile caching results to take advantage of repeated
vision application needs low latency, while inputs, or sparsely evaluating the weights.
an offline photo application demands high
throughput. For this reason, each MLPerf infer- Quantization, Retraining, and Sparsity
ence benchmark scenario has a different metric, Inference systems can use quantized, retrained,
as shown in red in Figure 1. or sparsified weights to increase computational
1. Single stream: latency. efficiency at the cost of reduced accuracy,
2. Multiple stream: number of streams subject which may or may not match market require-
to a latency bound. ments and can offer an unfair advantage in
3. Server: Poisson-arrival queries per second benchmarking. Different applications have dif-
subject to a latency bound. ferent tolerance for accuracy loss, making it
4. Offline: throughput. challenging to set an inference benchmark qual-
ity target. Further, retraining and sparsification
Benchmark Selection The benchmark selec- techniques are a research area with significant
tion for MLPerf Inference was also driven by proprietary technology and allowing either pro-
maturity, diversity, complexity, and practicality. vides an advantage to those with the best
However, we also needed to choose models with methods.
complexities suitable to a spectrum of hardware For the initial version of MLPerf inference
ranging from mobile devices to servers and sup- we took a simple approach by not allowing
port whatever models we chose in multiple sce- retraining or sparsification. Most quality tar-
narios. Thus, the initial version of MLPerf gets are set to 99% of the quality that can be
Inference focuses only on the most common achieved with 32-bit floating point weights.
vision tasks at different complexities. It comple- These targets are somewhat lower than might
ments the vision models with one moderate-size be required for many applications, but we only
language model to increase model diversity. allow post-training quantization. For future ver-
Future versions will expand this model selection sions, we are investigating more flexible
and better align it with training. approaches to accuracy and allowing retrain-
ing and/or sparsification.
Implementation Equivalence MLPerf Infer-
ence gives submitters the ability to reimplement Presentation
the models to handle software stack and hardware Results or Single Summary Score Given a
diversity, again raising the question of model set of benchmark results, there are still choices
IEEE Micro
12
Figure 2. Speedup in the fastest 16-chip entry from MLPerf Training version v0.5 to v0.6, despite more timed
work due to increased quality targets as shown.
about how to present them in the most informa- MLPerf currently provides scale information
tive way. For instance, should results for one sys- in the form of chip counts and future versions
tem on multiple benchmarks be combined will include power measurements. MLPerf does
to produce a single summary “score”? A score not provide price because it is not a physical
provides a consistent and easy way to communi- quantity and can vary over time and market.
cate the bottom line. However, it assumes MLPerf does not normalize because the most
that systems are designed for general perfor- important scale factor is different for different
mance and that all benchmarks in the suite mat- uses.
ter equally. We provide results instead of a single
summary score because the range of ML use RESULTS
cases, from automotive vision to online recom- Benchmark suites aim to drive technological
mendations, makes these assumptions incorrect: progress on faster hardware and software; we can
most users care about a subset of the bench- measure this progress by comparing the best
marks and many hardware architectures are results on the benchmark suite over time. Compar-
specialized. ing two rounds of MLPerf Training results shows
progress in software-stack performance and scal-
Scale Information and Normalization ability. The two rounds of results were collected
Another presentation question is: should the approximately six months apart, and are driven by
results be normalized or include scale informa- the same ML accelerators. Figure 2 compares five
tion? Different systems have different scale fac- benchmarks that did not change significantly
tors such as price, power consumption, and chip between the two rounds, though the target quality
count. If system A performs slightly better than levels of three out of five benchmarks were
system B, but consumes twice as much power, increased, which increases the amount of work
which is a better design? In order to help con- being timed.8 The fastest 16-chip entries across the
sumers of benchmark results better utilize them, two rounds show an average 1.3 speedup despite
the results could be presented with one or more more work required. Figure 3 shows an average
scale factors as additional information or nor- 5.5 increase in system scale across the two
malized by a specific scale factor. rounds as submitters were able to effectively utilize
March/April 2020
13
Hot Chips
Figure 3. Increase in the number of chips used in the system that produced the fastest overall score from
MLPerf Training version v0.5 to v0.6.
more chips.8 Some of these improvements are well as growth in adoption. Figure 4 shows that
benchmark-specific and some would have occurred the systems submitted, ranging from embedded
without MLPerf, but many, based on first-hand devices to cloud scale data center solutions,
observations of the authors, are generic and moti- cover a very wide range of hardware that spans
vated by MLPerf. Over time, we expect similar four orders of magnitude in terms of perfor-
improvements in hardware. mance.9 The number of submissions and results
Since we only have a single round of MLPerf in each round of MLPerf is increasing: v0.5 had 3
inference results, we cannot yet assess if it is submitters and 40þ results, training v0.6 had 5
driving performance improvements but can submitters and 60þ results, inference v0.5 had
assess how well it handles diverse hardware as 14 submitters and 500þ results.
Figure 4. Normalized performance distribution in log scale from results in the closed division.
IEEE Micro
14
CONCLUSION involvement of engineers and researchers who
ML is a developing field and the MLPerf bench- are interested in helping us make MLPerf better
mark suites will need to evolve with the field; we by going to mlperf.org/get-involved.
have created an organization to enable that evolu-
tion. MLPerf inference and training are both
driven by active working groups (WGs): a sub- ACKNOWLEDGMENTS
mitter’s WG that maintains the rules, a special MLPerf would not be possible without: B.
topics WG that explores deep technical issues, Anderson, P. Bailis, V. Bittorf, M. Breughe, D.
and a results WG that handles submission review Brooks, M. Charlebois, D. Chen, W. Chou, R.
and results presentation. Other WGs focus on spe- Chukka, S. Davis, P. Deng, J. Duke, D. Dutta, D.
cific topics such as power measurement or new Fick, J. S. Gardner, U. Gupta, K. Hazelwood, A.
benchmarks. We are creating a legal entity to Hock, X. Huang, I. Hubara, S. Idgunji, T. B. Jablin,
provide a long term foundation for the effort. B. Jia, J. Jiao, D. Kang, P. Kanwar, N. Kumar, D.
We are developing a long-term benchmark Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, D.
roadmap. We aim to add new benchmarks to fully Narayanan, T. Oguntebi, C. Osborne, G. Pekhi-
cover the five large ML areas we initially identi- menko, L. Pentecost, A. T. R. Rajan, T. Robie, D.
fied: vision, speech, language, commerce, and Sequeira, A. Sirasao, T. St. John, F. Sun, H. Tang,
research. Over time, we will retire and replace M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, C.
benchmarks to keep Young, B. Yu, G. Yuan, M. Zaharia, P. Zhang, A.
pace with the field Zhong, Y. Zhou, and many others.
The MLPerf effort is now
and to reduce the
supported by more
temptation to tune
than 65 companies and & REFERENCES
for benchmarks
researchers from eight 1. K. M. Dixit, “The SPEC benchmarks,” Parallel Comput.,
rather than real educational institutions. vol. 17, no. 10/11, pp. 1195–1209, 1991.
applications. We are We welcome the 2. Baidu. “DeepBench: Benchmarking deep learning
recruiting a panel of involvement of
operations on different hardware,” 2017. [Online].
academic and indus- engineers and
Available: https://github.com/baidu-research/DeepBench
try advisors for researchers who are
3. R. Adolf, S. Rama, B. Reagen, G.-Y. Wei, and
each area to ensure interested in helping us
make MLPerf better by D. Brooks, “Fathom: Reference workloads for modern
that MLPerf bench-
going to mlperf.org/ deep learning methods,” in Proc. IEEE Int. Symp.
marks are neutrally
get-involved. Workload Characterization, 2016, pp. 1–10.
driven by research
4. C. Coleman et al., “DAWNBench: An end-to-end deep
and industry needs.
learning benchmark and competition,” in Proc. 31st
Other future work includes the following.
Conf. Neural Inf. Process. Syst., 2017.
Creating a mobile application that can run 5. P. Mattson et al., “MLPerf training benchmark,” 2019,
select MLPerf inference benchmarks on arXiv:1910.01500.
smartphones. 6. V. Reddi et al., “MLPerf inference benchmark,” 2019,
Improving reference implementations as arXiv:1911.02549.
starting points for development. 7. [Online]. Available: https://mlperf.org/dataset-
Producing a “hyperparameter table” that maps and-model-credits/ presents full dataset and model
system scale and precision to recommended citations.
hyperparameters for each benchmark. 8. [Online]. Available: https://mlperf.org/training-results-
Developing better large public data sets for 0-5/ and https://mlperf.org/training-results-0-6/
benchmarking and other purposes. present complete results. https://github.com/mlperf/
Developing better software best practices for training_results_v0.6 and https://github.com/mlperf/
ML benchmarking and experimentation. training_results_v0.5 contain system details.
9. [Online]. Available: https://mlperf.org/inference-results/
The MLPerf effort is now supported by more
present complete results and https://github.com/
than 65 companies and researchers from eight
mlperf/inference_results_v0.5 contains system details.
educational institutions. We welcome the
March/April 2020
15
Hot Chips
Peter Mattson is currently a General Chair with Paulius Micikevicius is currently a Distinguished
MLPerf. He leads the ML Performance Metrics team Engineer with NVIDIA. He received the Ph.D. degree
at Google Brain. He received the Ph.D. degree from from the University of Central Florida, Orlando, FL,
Stanford University, Stanford, CA, USA. Contact him USA. Contact him at pauliusm@nvidia.com.
at petermattson@google.com.
David Patterson is currently a Google Brain Dis-
Vijay Janapa Reddi is currently an Inference tinguished Engineer. He is a U.C. Berkeley Professor.
Chair with MLPerf. He is an Associate Professor with He is a Vice-Chair Board of Directors of RISC-V Foun-
Harvard University, Cambridge, MA, USA. He dation. He received the Ph.D. degree from the Uni-
received the Ph.D. degree from Harvard University. versity of California, Los Angeles, CA, USA. Contact
Contact him at vj@ece.utexas.edu. him at pattrsn@cs.berkeley.edu.
Christine Cheng is currently an Inference Chair with Guenther Schmuelling is currently an Inference
MLPerf. She is a Sr. Machine Learning Optimization Results Chair with MLPerf. He is a Principal Tech
Engineer with Intel, M.S., Stanford University, Stanford, Lead. He is with the Microsoft Azure AI Infrastructure.
CA, USA. Contact her at christine.cheng@intel.com. Contact him at guschmue@microsoft.com.
Cody Coleman is currently a Research Chair Hanlin Tang is currently a Sr. Director of AI Lab
with MLPerf. He is currently working toward the with Intel. He received the Ph.D. degree from Har-
Ph.D. degree in computer science with Stanford vard University, Cambridge, MA, USA. Contact him
University, Stanford, CA, USA. He received the at hanlin.tang@intel.com.
M. Eng. degree from Massachusetts Institute of
Technology, Cambridge, MA, USA. Contact him at Gu-Yeon Wei is currently a Robert and Suzanne
cody@cs.stanford.edu. Case Professor of electrical engineering and com-
puter science with Harvard University, Cambridge,
Greg Diamos is currently a Data sets Chair with MA, USA. He received the Ph.D. degree from Stan-
MLPerf. He leads AI Transformations team at Land- ford University, Stanford, CA, USA. Contact him at
ing AI. He received the Ph.D. degree from Georgia gywei@g.harvard.edu.
Tech, Atlanta, GA, USA. Contact him at gregory.
diamos@gmail.com. Carole-Jean Wu is currently an Inference Chair
with MLPerf. She is an Applied Research Scientist
David Kanter is currently an Inference and Power with Facebook. She is an Associate Professor
Chair with MLPerf. He is with Real World Technolo- at Arizona State University, Tempe, AZ, USA.
gies. He received the B.S. degree in mathematics She received the Ph.D. degree from Princeton
and the B.A. degree in economics from the University University, Princeton, NJ, USA. Contact her at
of Chicago, Chicago, IL, USA. Contact him at carolejeanwu@fb.com.
dkanter@gmail.com.
IEEE Micro
16
Habana Labs Purpose-

Built AI Inference and
Training Processor
Architectures: Scaling AI
Training Systems Using
Standard Ethernet With
Gaudi Processor
Eitan Medina and Eran Dagan
Habana Labs
Abstract—The growing computational requirements of AI applications are challenging

today’s general-purpose CPU and GPU architectures and driving the need for purpose-
built, programmable AI solutions. Habana Labs designed its Goya processor to meet the
high throughput/low latency demands of Inference workloads, and its Gaudi processor for
throughput combined with massive scale up and scale out capability needed to speed
training workloads efficiently. To address the need for scaling training, Habana is the first
AI chip developer to integrate standard Ethernet onto a training processor.
& WHILE DATACENTER INFERENCE workloads have clear that purpose-built architectures optimized
typically run on CPUs and training on GPUs, it is for AI workloads are needed to provide the
required performance and features, These AI
processors must be fully programmable, sup-
ported by robust software tools and enable data
centers to reduce total cost of ownership. In
18 March 2020.
March/April 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
17
Hot Chips
addition, the ability to scale-up and scale-out to processors. However, to achieve acceleration,
meet increasing computational needs, while fos- high processor utilization is required. Scaling AI
tering an ecosystem of suppliers with standard workloads has been a focus of Habana-Lab’s
interfaces and communication protocols, is training processor architecture from its initial
needed for AI purpose-built processors. stages and is considered fundamentally a net-
This article discusses the characteristics of working challenge. Habana Lab’s approach to
AI training and inference processors and the scaling AI training is the focus of this article and
inherent differences between their architecture is described in detail.
requirements, describing the Goya and Gaudi Despite the differences between inference and
high-level architecture and focusing on Gaudi’s training, both processors must be fully program-
unique approach for scaling AI training in data mable to handle popular frameworks and com-
centers. pilers. The processor should also be easily
customizable to perform well on new workloads
TRAINING AND INFERENCE and allow customers to migrate from other
REQUIREMENT DIFFERENCES processors while preserving their algorithmic
Neural networks inference and training execu- codebase.
tion share several functional building blocks; how-
ever there are key attributes that drive material
architecture differences. For training, the key met- GOYA INFERENCE PROCESSOR
ric is time to converge to an accuracy goal, which HIGH-LEVEL ARCHITECTURE
is closely correlated to the processor’s through- As training and inference architectures
put. For inference, throughput (at a given share some of the same functional building
accuracy level) is the primary per- blocks, the Goya architecture
formance metric, although some This article discusses laid the groundwork for Gaudi.
inference applications are also sen- the characteristics of The Goya Inference Processor is
sitive to latency (dictated by end- AI training and based on the scalable architec-
user applications). Accuracy is criti- inference processors ture of Habana’s Tensor-Proc-
cal for training and inference work- and the inherent essing Core (TPC) and includes
loads, but in the case of inference, it differences between a cluster of eight programmable
is acceptable for some end-applica- their architecture cores. TPC is Habana’s proprie-
tions to tradeoff accuracy for more requirements, tary core designed to support
describing the Goya
throughputs.5,6 deep learning workloads. It is a
and Gaudi high-level
Another inherent difference VLIW SIMD vector processor
architecture and
between training and inference is with Instruction-Set-Architec-
focusing on Gaudi’s
the required memory capacity. In unique approach for ture and hardware tailored to
inference, only the last layer of acti- scaling AI training in serve deep learning workloads
vations needs to be stored, as data centers. efficiently. The TPC is C/Cþþ
opposed to training which requires programmable, providing the
calculating the gradients as part of user with maximum flexibility to
the back-propagation flow. Thus, all activations innovate, coupled with many workload-ori-
for all layers typically must be stored and the ented features such as: General Matrix Multiply
required internal memory capacity is signifi- (GEMM) operation acceleration, special-func-
cantly larger; this memory capacity requirement tions dedicated hardware, tensor addressing,
grows substantially with the depth of the net- and latency hiding capabilities. The TPC
work. On top of that, wider data types which are natively supports these mixed-precision data
typically deployed in training processors add types: FP32, INT32/16/8, UINT32/16/8. To
additional memory capacity requirement. achieve maximum hardware efficiency, Habana
Very long training tasks performed on large Labs SynapseAI quantizer tool selects the
data sets are often scaled-up and scaled-out to appropriate data type by balancing throughput
clusters of hundreds and even thousands of and performance versus accuracy. For
IEEE Micro
18
Figure 2. BERT language model performance.
Goya Configuration2:
Figure 1. RESNET-50 throughput and latency. Hardware: Goya HL-100 PCIe Card; CPU-
XEON-E5
Software: Ubuntu-v-16.04; SynapseAI-v-0.1.6
predictability and low latency, Goya is based Workload implementation: Precision INT8
GPU Configuration4:
on software-managed, on-die memory along
Hardware Configuration: T4; Host
with programmable DMAs. For robustness, all Supermicro SYS-4029GP-TRT-T4
memories are ECC-protected. Software Configuration: TensorRT-5.1;
All Goya engines (TPCs, GEMM, and DMA) Synthetic data set; Container-19.03-py3
Workload implementation: Precision INT8
can operate concurrently and communicate
To quantify and benchmark the BERT work-
via shared memory. For external interface, the
load, Nvidia’s demo release was used. The demo
processor uses PCIe Gen416 enabling com-
runs a Question answering task, and given a
munication to any host of choice, FPGA or
question and context, determines the span of
peer-to-peer communication with another
the answer within the context (Tested date:
Goya. The processor includes two 64-b chan-
August 2019).
nels of DDR4 memory interface with max
capacity of 16 GB. The Goya architecture sup- Data set: SQuAD; Topology: BERT_BASE,
ports mixed precision of both integer and Layers ¼ 12 Hidden_Size ¼ 768, Heads ¼ 12
floating points, which allows it to flexibly sup- Intermediate_Size ¼ 3072 Max_Seq_Len ¼
128.
port different workloads and applications,
under quantization controls that the user can As presented in Figure 2, Goya delivers
specify. higher throughput in sentences-per-second, as
well as lower latency. In addition, while the T4
Goya Performance2 saturates for throughput, Goya can scale further
with batch of 24, delivering more than twice the
Although Goya is a cost-efficient processor
built upon TSMC’s mature 16 nm process, throughput, while running at half the latency.
Goya leads in performance benchmarks for Goya Configuration:
both throughput and latency compared with Hardware: Goya-HL-100; Xeon-Gold-6152
alternative Inference GPUs designed using @2.10GHz
Software: Ubuntu-v-16.04.4; SynapseAI-v-
more advanced process nodes. This achieve-
0.2.0-1173
ment is rooted in its purpose-built architec- GPU Configuration:
ture. Below are performance examples from Hardware: T4; CPU-Xeon-Gold-6154@3GHz/
different types of deep learning applications. 16GB/4-VMs
Software: Ubuntu-18.04.2.x86_64-gnu; CUDA-
Workload performance results for vision Ver-10.1, cudnn7.5; TensorRT-5.1.5.0;
classification (Resnet-50/ImageNet data set)
and Natural Language Processing (BERT) are
presented in Figure 1 and compare to the cor- GAUDI TRAINING PROCESSOR HIGH
responding T4-GPU performance. As presented LEVEL ARCHITECTURE
below, Goya’s Resnet-50 performance consis- Goya Inference architecture provided a good
tently delivers higher throughput, as well as foundation for the Gaudi training processor with
lower latency, across all batch sizes. the following key defining goals.
March/April 2020
19
Hot Chips
both inside the server and rack (termed as scale-

up), as well as for scaling across racks (scale-
out). These can be connected directly between
Gaudi processors, or through any number of
standard Ethernet switches.
As presented in Figure 3, Gaudi uses a cluster
of eight TPC-2.0 cores (second generation TPC
core) that natively supports these data types—
FP32, BF16, INT32/16/8, and UINT32/16/8.
The Gaudi memory architecture includes on-
Figure 3. Gaudi architecture-block diagram.
die SRAM and TPC local memories. In addition,
the chip package integrates four high bandwidth
1. Deliver performance at scale with high power memory devices, providing total capacity of 32 GB
efficiency and high throughput at low batch and 1-TB/s bandwidth. The PCIe interface pro-
sizes (crucial to enable near-linear scaling for vides a host interface and supports both Gen 3.0
data parallel models where the data set is and 4.0.
split across many processors). Compared with competing architectures, cus-
2. Enable native Ethernet Scale-out: using RDMA tomers do not need to add an array of PCIe
over Converged Ethernet (RoCE.v2) to scale switches and dedicated NICs. In addition, Ether-
AI systems to any needed size, leveraging an net switches, unlike proprietary switches, are
existing ecosystem of suppliers (e.g., stan- available from many vendors with numerous
dard Ethernet switches/cables) and reducing options in port-count (from small switches to
system complexity, power, and cost. 12.8 Tb/s with 128 ports of 100 GbE in a single
3. Promote standard form factors such as the chip), providing customers flexibility and
Open Compute Project (OCP) Accelerator improved efficiency.
Module (OAM) to enhance customer system
design flexibility.
GAUDI PERFORMANCE
4. Provide software infrastructure and tools
A single Gaudi Resnet-50 training delivers
supporting popular frameworks and ML com-
performance of more than 1650 images/s of
pilers, coupled with rich TPC kernel library
throughput while dissipating 140 W. This single-
and user-friendly development tools, to
card performance was achieved using low batch
enable optimization and customization.
size of 64, which enables scaling the perfor-
mance in close-to-linear scale, up to very large
Gaudi is the industry’s sole AI Processor that
systems (e.g., 13,200 images/s for 8 Gaudi chips
integrates on-chip RDMA over Converged Ether-
or 844,800 images/s for 640 Gaudi chips). Such
net (RoCE.v2) engines. These engines play a criti-
linear scale capability is rooted in Gaudi’s pur-
cal role in the interprocessor communication
pose-built architecture which delivers superior
needed during the training process. By integrat-
performance while running at low batch-size and
ing this functionality on-chip and supporting
latency.3
bidirectional networking throughput of up to 2
Tb/s, customers can build systems of any size.
The Gaudi processor embeds 20 pairs of 56-Gb/s GAUDI SCALE-OUT SOLUTION
Tx/Rx PAM4 serializers/deserializers (SerDes) Two main scale strategies are commonly
that can be configured as 10 ports of 100 Gb used today when training deep learning models:
Ethernet, 20 ports of 50-Gb/25-Gb Ethernet, or Data parallelism and model parallelism.3
any combination in between. These ports are
designed to scale out the inter-Gaudi communi- Data Parallelism
cation by integrating a complete communication In data parallelism, every machine (typically a
engine on-die. This native integration allows cus- single processor) has a complete copy of the
tomers to use the same scaling methodology, model. Each processor receives a portion of the
IEEE Micro
20
data, performs the training locally, then commu-
nicates its gradients to the other processors;
these updates are combined and then shared
back with all participating processors, before the
next iteration of training can begin with new data.
Model Parallelism Training

Model parallelism is typically employed when
the model is too big to fit in one machine. With
model parallelism, different machines in the dis-
tributed system are responsible for computing
different parts of a single neural network. For
example, each layer (or a portion of a layer)
Figure 4. HLS-1: system topology.
in the neural network may be assigned to a dif-
ferent machine. In another example of model
parallelism, parameters of a single layer are dis- offers 8100 GbE connectivity. The HL-205 is a
tributed across different machines. Model paral- standard OCP-OAM compliant card that can be
lelism requires low-latency and extremely high- used in optimized servers for maximum perfor-
throughput connectivity between processors. mance and offers 10100 GbE connectivity. HLS-
These can easily require several orders of magni- 1 (see Figure 4) is a system provided by Habana,
tude higher communication bandwidth betw- containing eight Gaudi-based HL-205 Mezzanine
een machines/processors than data-parallelism cards and dual PCIe switches.
alone. In HLS-1, the Gaudi processors are connected
in an all-to-all connection on the baseboard PCB,
using seven of the ten 100 GbE ports of each
BUILDING AI TRAINING SYSTEMS Gaudi. The remaining three ports from each
WITH GAUDI Gaudi are available to scale out the solution over
System Building Blocks Ethernet ports on the HLS-1. Any host can man-
Gaudi leverages advantages of standard age the Gaudi system through the PCIe ports
Ethernet networking technology for scaling. using PCIe cables. Such a system topology is
Each Gaudi chip implements 10 ports of 100 Gb optimal in both data parallelism (where HLS-1
Ethernet (or 20 ports of 50 GbE). Integrating net- serves as the first reduction hierarchy) and in
working directly into the AI processor chip ena- model-data parallelism hybrid, using all-to-all
bles a nimble system, extensible with virtually connectivity within the HLS-1 for model para-
unlimited communication capacity. By combin- llelism together with intra-HLS-1 connectivity
ing multiple Gaudi processors with Ethernet for data parallelism (intra- and intersystem
switching, limitless possibilities are available for connectivity).
distributing training across any number of Gaudi
processors: a single Gaudi, a single rack, multi- Gaudi System With Maximum Scale-Out
ple racks, and clusters of thousands Gaudis or Figure 5 shows a system containing eight
more. Gaudi chips and interfaces, 416 lanes of PCIe
As Gaudi uses off-the-shelf Ethernet, many Gen4 cables that can be connected to an
different systems and network configurations external host servers, and up to 80100 Gb
can be devised. The following examples describe Ethernet links (using 20 QSFP-DD connectors).
a few of many potential system implementations The external Ethernet links can be connected
that can be built using Gaudi’s processor-based to any switching hierarchy. Such configuration
cards and systems in combination with widely can be optimized to implement model parallel-
available enterprise-grade Ethernet switches. ism in large scale and can readily handle data
The HL-200 is a standard dual-slot PCIe card parallelism or a combination of model and data
that fits in existing servers with PCIe slots and parallelism.
March/April 2020
21
Hot Chips
in order to form a much larger training farm con-

taining hundreds or thousands of Gaudi process-
ors. In this example, each HLS-1 is connected to
the switch with four cables, carrying a total of
16100 GbE.
A customer optimizing for model parallelism
may design their own system using HL-205 cards
and optimize the connectivity to use more or
fewer ports
Topologies for Data Parallelism

Figure 7 presents a larger system built using
the Gaudi system as a basic component. It shows
three reduction (recursive calculation used to
Figure 5. Gaudi system with maximum scale-out. aggregate a set of data) levels: one within the sys-
tem, between 11 Gaudi systems and between 12
islands. Altogether, this system hosts 81112 ¼
1056 Gaudi cards. Larger systems can be built
with an additional aggregation layer or with less
bandwidth per Gaudi.
Such hierarchical fabric can also serve a com-
bination of data and model parallelism. Each
Gaudi system can be used for model parallelism,
whereas data parallelism can be used between
Gaudi systems. Normally, model parallelism
requires higher bandwidth and lower latency
than data parallelism. The ratio of Ethernet
switching to Gaudi accelerators can offer the
system designer the freedom to scale network-
ing bandwidth and compute capacity as needed.
Ethernet switch chips, available from multiple
vendors, can also be integrated inside the Gaudi
Figure 6. Example rack configuration.
box for higher communication bandwidth.
Gaudi-Based Training Rack Topologies for Model Parallelism

Figure 6 shows a configuration where six Since Gaudi uses standard Ethernet connec-
Gaudi systems (48 Gaudi devices in total) are tivity, very large-scale systems can be built with
connected to a single Ethernet switch. That all-to-all connectivity utilizing a single network-
switch can be further connected to other racks ing hop.
Figure 7. Hierarchical fabric.
IEEE Micro
22
Figure 8. Fully connected, single-hop system.
Figure 8 shows such a large scale system. A The Goya software stack allows interfacing
system with 128 Gaudi chips can be built in an with popular frameworks and compilers, sup-
analog manner, connecting a single 100 GbE port porting models trained on any platform. The
per Gaudi (thus 8100 per Gaudi System) to any flow starts by deconstructing the trained topol-
of the five 128-port 100 GbE switches. Such a ogy and some example data into an internal
large-scale, single-hop system with full connec- representation. The quantization process then
tivity is only possible when integrating Ethernet follows, allowing the user to improve throughput
directly in the deep learning accelerator. while maintaining negligible accuracy loss. The
graph compiler then maps the execution to
Goya’s building blocks, using the available
SOFTWARE DEVELOPMENT TOOLS library kernels (provided by Habana and easily
Habana SynapseAI is a comprehensive soft- extendible by customers to develop tailor-made
ware toolkit that simplifies the development and kernels). The execution recipe is generated and
deployment of deep learning models for mass- scheduled by the runtime APIs.
market use. The SynapseAI tool suite enables The Gaudi software suite, leverages many of
users to execute algorithms, efficiently using Goya aspects, with the addition of training opti-
its high-level software abstraction. However, mized TPC kernels and the following additional
advanced users can perform further optimiza- capabilities.
tions and add their own proprietary code using
Multistream execution of networking and
the provided software development tools. These
tools include an LLVM-based C compiler, simula- compute. Streams can be synchronized with
tor, debugger, and profiling capabilities. The one another at high performance and with
tools also facilitate the development of custom- low runtime overhead.
JIT compiler, capable of fusing and compil-
ized TPC kernels to augment the extensive ker-
nel library provided by Habana. ing multiple layers together, thereby
Figure 9. Gaudi/Goya SW stacks.
March/April 2020
23
Hot Chips
increasing utilization and exploiting hard- handle much bigger models. This means raising
ware resources. the bar on what AI training can do.
Habana Communication Library (HCL): tailor- Finally, and most importantly for data cen-
made to Gaudi’s high-performance RDMA ters, is the need to avoid being locked-in to pro-
communication capabilities. HCL exposes all prietary interfaces. By avoiding processors that
required primitives such as reduce, all-reduce, come with proprietary system interfaces, and
gather, broadcast, etc. Gaudi and Goya soft- instead insisting on standards-based scaling,
ware stacks are both presented in Figure 9. Habana’s customers can avoid locking them-
selves to any particular vendor. With standard
SUMMARY Ethernet scaling they can enjoy a competitive
Architectural requirements of Inference and ecosystem and replace AI processor suppliers
training share enough similarities that the Goya without reconstructing their infrastructure.
inference design provided a good foundation for With Gaudi all these benefits can be realized.
the Gaudi training
chip. However, the
designs diverge By avoiding processors
& REFERENCES
when it comes to that come with 1. E. Medina, “Habana labs approach to scaling AI
addressing the proprietary system training,” Proc. Hot Chips 31, CA, USA. 2019. [Online].
unique scaling interfaces, and Available: http://www.hotchips.org/hc31/
requirements of AI instead insisting on HC31_1.14_HabanaLabs.Eitan_Medina.v9.pdf
training. Habana standards-based
2. Goya Inference Platform Whitepaper, Habana Labs,
scaling, Habana’s
integrates RoCE 2019. [Online]. Available: https://habana.ai/wp-
customers can avoid
RDMA in the pro- content/uploads/2019/06/Goya-Whitepaper-Inference-
locking themselves to
cessor chip itself any particular vendor. Performance.pdf
and has enabled With standard Ethernet 3. Gaudi Training Platform Whitepaper, Habana Labs,
scaling AI like never scaling they can enjoy a 2019. [Online]. Available: https://habana.ai/wp-
before. competitive content/uploads/2019/06/Habana-Gaudi-Training-
The on-chip, ecosystem and replace Platform-whitepaper.pdf
RoCE integration AI processor suppliers 4. NVIDIA Tesla Deep Learning Product Performance.
provides Habana’s without reconstructing 2019. [Online]. Available: https://developer.nvidia.
customers the con- their infrastructure. com/deep-learning-performance-training-inference
trol they need over 5. ML Perf Training Benchmark, 2019. [Online].
their systems, gaining unprecedented design flexi- Available: https://arxiv.org/pdf/1910.01500.pdf
bility to build any needed system while easily scal- 6. DAWNBench, “An end-to-end deep learning
ing the capacity from a single processor to benchmark and competition,” 2018. [Online].
hundreds and even thousands of processors. With Available: https://cs.stanford.edu/deepakn/assets/
GPUs that rely on proprietary system interfaces, papers/dawnbench-sosp17.pdf
systems hit a bandwidth wall while trying to scale 7. C. J. Shallue and J. Lee, “Measuring the effects of data
beyond 16 GPUs. Using Gaudi with off-the-shelf parallelism on neural network training,” 2019. [Online].
Ethernet switches, model parallel training can be Available: https://arxiv.org/pdf/1811.03600.pdf
performed across a much larger system and
IEEE Micro
24
Compute Solution for

Tesla’s Full Self-Driving
Computer
Emil Talpes, Debjit Das Sarma,
Ganesh Venkataramanan, Peter Bannon,
Bill McGee, Benjamin Floering, Ankit Jalote,
Christopher Hsiong, Sahil Arora,
Atchyuth Gorti, and Gagandeep S. Sachdev
Autopilot Hardware, Tesla Motors Inc.
Abstract—Tesla’s full self-driving (FSD) computer is the world’s first purpose-built computer
for the highly demanding workloads of autonomous driving. It is based on a new System on a
Chip (SoC) that integrates industry standard components such as CPUs, ISP, and GPU, together
with our custom neural network accelerators. The FSD computer is capable of processing up to
2300 frames per second, a 21 improvement over Tesla’s previous hardware and at a lower
cost, and when fully utilized, enables a new level of safety and autonomy on the road.
PLATFORM AND CHIP GOALS The heart of the FSD computer is the world’s
& THE PRIMARY GOAL of Tesla’s full self-driving first purpose-built chip for autonomy. We pro-
(FSD) computer is to provide a hardware platform vide hardware accelerators with 72 TOPs for neu-
for the current and future data processing ral network inference, with utilization exceeding
demands associated with full self-driving. In addi- 80% for the inception workloads with a batch size
tion, Tesla’s FSD computer was designed to be ret- of 1. We also include a set of CPUs for control
rofitted into any Tesla vehicle made since October needs, ISP, GPU, and video encoders for various
2016. This introduced major constraints on form preprocessing and postprocessing needs. All of
factor and thermal envelope, in order to fit into these are integrated tightly to meet very aggres-
older vehicles with limited cooling capabilities. sive TDP of sub-40-W per chip.
The system includes two instances of the FSD
chip that boot independently and run indepen-
Digital Object Identifier 10.1109/MM.2020.2975764 dent operating systems. These two instances
Date of publication 24 February 2020; date of current version also allow independent power supply and sen-
18 March 2020. sors that ensure an exceptional level of safety
25
Hot Chips
the unmarked area of the

chip consists of periph-
erals, NOC fabrics, and
memory interfaces. Each
NNA has 32-MB SRAM and
96 96 MAC array. At 2
GHz, each NNA provides
36 TOPs, adding up to 72
TOPs total for the FSD
chip.
The FSD SoC, as shown
in Figure 2(b), provides
general-purpose CPU
Figure 1. FSD Computer with two Tesla FSD chips in dual configurations including sensors cores that run most of the
like Cameras. autopilot algorithms.
Every few milliseconds,
for the system. The computer as shown in new input frames are received through a dedi-
Figure 1 meets the form, fit, and interface level cated image signal processor where they get pre-
compatibility with the older hardware. processed before being stored in the DRAM.
Once new frames are available in the main mem-
ory, the CPUs instruct the NNA accelerators to
FSD CHIP start processing them. The accelerators control
The FSD chip is a 260-mm2 die that has about the data and parameters streaming into their
250 million gates or 6 billion transistors, manu- local SRAM, as well as the results streaming back
factured in 14-nm FinFet technology by Samsung. to the DRAM. Once the corresponding result
As shown in Figure 2, the chip is packaged in a frames have been sent out to the DRAM, the accel-
37.5 mm 37.5 mm Flip Chip BGA Package. The erators trigger an interrupt back to the CPU com-
chip is qualified to AEC-Q100 Grade2 reliability plex. The GPU is available for any postprocessing
standards. tasks that might require algorithms not sup-
Figure 2(a) shows the major blocks in the chip. ported by the NNA accelerators.
We designed the two instances of neural-network
accelerator (NNA) from scratch and we chose Chip Design Methodology
industry standard IPs such as A72 CPUs, G71 Our design approach was tailored to meet
GPU, and ISPs for the rest of the system. Rest of aggressive development timelines. To that end, we
Figure 2. (a) FSD chip die photo with major blocks. (b) SoC block diagram.
IEEE Micro
26
Figure 3. Inception network, convolution loop, and execution profile.
decided to build a custom accelerator since this every layer sequentially. An object is detected
provides the highest leverage to improve perfor- after the final layer.
mance and power consumption over the previous As shown in Figure 3, more than 98% of all
generation. We used hard or soft IPs available in operations belong to convolutions. The algo-
the technology node for the rest of the SoC blocks rithm for convolution consists of a seven deep
to reduce the development of schedule risk. nested loop, also shown in Figure 3. The compu-
We used a mix of industry-standard tools and tation within the innermost loop is a multiply-
open source tools such as verilator for extensive accumulate (MAC) operation. Thus, the primary
simulation of our design. Verilator simulations goal of our design is to perform a very large num-
were particularly well suited for very long tests ber of MAC operations as fast as possible, with-
(such as running entire neural networks), where out blowing up the power budget.
they yielded up to 50 speedup over commer- Speeding up convolutions by orders of mag-
cial simulators. On the other hand, design com- nitude will result in less frequent operations,
pilation under verilator is very slow, so we relied such as quantization or pooling, to be the bottle-
on commercial simulators for quick turnaround neck for the overall performance if their perfor-
and debug during the RTL development phase. mance is substantially lower. These operations
In addition to simulations, we extensively used are also optimized with dedicated hardware to
hardware emulators to ensure a high degree of improve the overall performance.
functional verification of the SoC.
For the accelerator’s timing closure, we set a Convolution Refactorization and Dataflow
very aggressive target, about 25% higher than The convolution loop, with some refactoring,
the final shipping frequency of 2 GHz. This allows is shown in Figure 4(a). A closer examination
the design to run well below Vmax, delivering reveals that this is an embarrassingly parallel
the highest performance within our power bud- problem with lots of opportunities to process the
get, as measured after silicon characterization. MAC operations in parallel. In the convolution
loop, the execution of the MAC operations within
the three innermost loops, which determine the
NEURAL NETWORK ACCELERATOR length of each dot product, is largely sequential.
Design Motivation However, the computation within the three outer
The custom NNA is used is to detect a prede- loops, namely for each image, for each output
fined set of objects, including, but not limited to channel, for all the pixels within each output
lane lines, pedestrians, different kinds of vehicles, channel, is parallelizable. But it is still a hard
at a very high frame rate and with modest power problem due to the large memory bandwidth
budget, as outlined in the platform goals. requirement and a significant increase in power
Figure 3 shows a typical inception convolu- consumption to support such a large parallel
tional neural network.1,2 The network has many computation. So, for the rest of the paper, we will
layers and the connections indicating flow of com- focus mostly on these two aspects.
pute data or activations. Each pass through this First thing to note is that working on multiple
network involves an image coming in, and various images in parallel is not feasible for us. We can-
features or activations, being constructed after not wait for all the images to arrive to start the
March/April 2020
27
Hot Chips
Figure 4. Convolution refactoring and dataflow.
compute for safety reasons since it increases the computation for the same set of pixels within all
latency of the object detection. We need to start output channels use the same input data.
processing images as soon as they arrive. Figure 4(b)–(d) also illustrates the dataflow
Instead, we will parallelize the computation with the above refactoring for a convolution
across multiple-output channels and multiple- layer. The same output pixels of the successive
output pixels within each output channel. output channels are computed by sharing the
Figure 4(a) shows the refactored convolution input activation, and successive output pixels
loop, optimizing the data reuse to reduce power within the same output channel are computed
and improve the realized computational band- by sharing the input weights. This sharing of
width. We merge the two dimensions of each data and weights for the dot product computa-
output channel and flatten them into one dimen- tion is instrumental in utilizing a large compute
sion in the row-major form, as shown in step (2) bandwidth while reducing the power by mini-
of Figure 4(a). This provides many output pixels mizing the number of loads to move data
to work on, in parallel, without losing local conti- around.
guity of the input data required.
We also swap the loop for iterating over the Compute Scheme
output channels with the loop for iterating over The algorithm described in the last section
the pixels within each output channel, as shown with the refactored convolution lends itself to a
in steps (2) and (3) of Figure 4(a). For a fixed compute scheme with the dataflow as shown in
group of output pixels, we first iterate on a subset Figure 5. A scaled-down version of the physical
of output channels before we move onto the next 96 96 MAC array is shown in the middle for the
group of output pixels for the next pass. One brevity of space, where each cell consists of a
such pass, combining a group of output pixels unit implementing a MAC operation with a single
within a subset of output channels, can be per- cycle feedback loop. The rectangular grids on
formed as a parallel computation. We continue the top and left are virtual and indicate data
this process until we exhaust all pixels within the flow. The top grid, called the data grid here,
first subset of output channels. Once all pixels shows a scaled-down version of 96 data elements
are exhausted, we move to the next subset of in each row, while the left grid, called the weight
output channels and repeat the process. This grid here, shows a scaled-down version of 96
enables us to maximize data sharing, as the weights in each column. The height and width of
IEEE Micro
28
groups of eight. This reduces the power con-
sumed to shift the accumulator data significantly.
Another important feature in the MAC engine
is the overlap of the MAC and SIMD operations.
While accumulator values are being pushed
down to the SIMD unit for postprocessing, the
next pass of the convolution gets started immedi-
ately in the MAC array. This overlapped computa-
tion increases the overall utilization of the
computational bandwidth, obviating dead cycles.
Design Principles and Instruction Set

Figure 5. Compute scheme. Architecture
The previous section describes the dataflow
of our computation. For the control flow, we
the data and weight grids equal the length of the focused on simplicity and power efficiency. An
dot-product. average application executed on modern out-of-
The computation proceeds as follows: the order CPUs and GPGPUs 5–7 burns most of the
first row of the data grid and the first column of energy outside of the computational unit to
the weight grid are broadcast across all the 96 move the instructions and data and in the expen-
rows and 96 columns of the MAC array, respec- sive structures such as caches, register files, and
tively, over a few cycles in a pipelined manner. branch predictors.8 Furthermore, such control
Each cell computes an MAC operation with the structures also introduce significant design com-
broadcast data and weight locally. In the next plexity. Our goal was to design a computer
cycle, the second row of the data grid and the where almost all the profligate control struc-
second column of the weight grid are broadcast tures are eliminated, and the execution of the
in a pipelined manner, and the MAC computation workloads spend all the energy on what matters
in each cell is executed similarly. This computa- most for performance, the MAC engine. To that
tion process continues until all the rows and col- end, we implemented a very flexible yet profi-
umns of the data and weight grids have been cient state machine where all the expensive con-
broadcast and all the MAC operations have fin- trol flows are built into the state machine, such
ished. Thus, each MAC unit computes the dot- as loop constructs and fusion.
product locally with no data movement within Another very important performance and
the MAC array, unlike the systolic array compu- power optimization feature is the elimination of
tations implemented in many other process- DRAM reads and writes during the convolution
ors.3,4 This results in lower power and less cell flow. For inference, the output data of each layer
area than systolic array implementations. is consumed by dependent layers and can be
When all the MAC operations are completed, overwritten. After loading the initial set of activa-
the accumulator values are ready to be pushed tion data, this machine operates entirely from an
down to the SIMD unit for post-processing. This SRAM embedded in the compute engine itself.
creates the first 96 96 output slice as shown in This design philosophy is outlined in the
Figure 5. The postprocessing, which most com- final section. We trade off fine-grain programma-
monly involves a quantization operation, is per- bility that requires expensive control structures
formed in the 96-wide SIMD unit. The 96-wide for a flexible state machine with coarse grain
SIMD unit is bandwidth matched with the 96 ele- programmability. The state machine driven
ment accumulator output associated with each control mechanism lends itself to a very com-
output channel. Accumulator rows in the MAC pact yet powerful and flexible ISA. There are
array are shifted down to the SIMD unit at the only seven main instructions, with a variety
rate of one row per cycle. Physically, the accumu- of additional control fields that set up the
lator rows shift only once every eight cycles, in state machine to perform different tasks: data
March/April 2020
29
Hot Chips
Figure 6. Typical network program.
movement in and out of the SRAM (DMA-read producer/consumer ordering is maintained using
and DMA-write), dot product (CONVOLUTION, explicit dependency flags.
DECONVOLUTION, INNER-PRODUCT), and pure A typical program is shown in Figure 6. The
SIMD (SCALE, ELTWISE). program starts with several DMA-read opera-
Data movement instructions are 32-byte longs tions, bringing data and weights into the acceler-
and encode the source and destination address, ator’s SRAM. The parser inserts them in a queue
length, and dependency flags. Compute instruc- and stops at the first compute instruction. Once
tions are 256-byte long and encode the input the data and weights for the pending compute
addresses for up to three tensors (input activa- instruction become available in the SRAM, their
tions and weights or two activations tensors, out- corresponding dependency flags get set and the
put results), tensor shapes and dependency flags. compute instruction can start executing in paral-
They also encode various parameters describing lel with other queued DMA operations.
the nature of computation (padding, strides, dila- Dependency flags are used to track both data
tion, data type, etc.), processing order (row-first availability and buffer use. The DMA-in operation
or column-first), optimization hints (input and at step 6 overwrites one of the buffers sourced
output tensor padding, precomputed state by the preceding convolution (step 5) as shown
machine fields), fused operations (scale, bias, in Figure 6. Thus, it must not start executing
pooling). All compute instructions can be fol- before its destination flag (F0) gets cleared at
lowed by a variable number of SIMD instructions the end of the convolution. However, using a dif-
that describe a SIMD program to be run on all ferent destination buffer and flag would allow
dot-product outputs. As a result, the dot product the DMA-in operation to execute in parallel with
layers (CONVOLUTION, DECONVOLUTION) can the preceding convolution.
be fused with simple operations (quantization, Our compiler takes high-level network rep-
scale, ReLU) or more complex math functions resentations in Caffe format and converts them
such as Sigmoid, Tanh, etc. to a sequence of instructions similar to the one
in Figure 6. It analyzes the compute graph and
orders it according to the dataflow, fusing or
NETWORK PROGRAMS partitioning layers to match the hardware
The accelerator can execute DMA and Com- capabilities. It allocates SRAM space for inter-
pute instructions concurrently. Within each kind, mediate results and weights tensors and man-
the instructions are executed in order but can be ages execution order through dependency
reordered between them for concurrency. The flags.
IEEE Micro
30
Figure 7. NNA Microarchitecture.
NNA MICROARCHITECTURE Each accumulator cell is built around two 30-

The NNA, as shown in Figure 7, is organized bit registers: an accumulator and a shift register.
around two main datapaths (dot-product engine Once a compute sequence is completed, the dot
and SIMD unit) and the state machines that inter- product result is copied into the shift register
pret the program, generate streams of memory and the accumulator is cleared. This allows the
requests, and control the data movement into results to shift out through the SIMD engine
and out of the datapaths. while the next compute phase starts in the dot
product engine.
Dot Product Engine
As described in the “Compute Scheme” sec- SIMD Unit
tion, the dot-product engine is a 96 96 array The SIMD unit is a 96-wide datapath that can
of MAC cells. Each cell takes two 8-bit integer execute a full set of arithmetic instructions. It
inputs (signed or unsigned) and multiplies reads 96 values at a time from the dot product
them together, adding the result to a 30-bit engine (one accumulator row) and executes a
wide local accumulator register. There are postprocessing operation as a sequence of
many processors that deploy floating-point instructions (SIMD program). A SIMD program
operations with single precision or half-preci- cannot access the SRAM directly and it does not
sion floating-point (FP) data and weight for support flow control instructions (branches).
inference. Our integer MAC compute has The same program is executed for every group
enough range and precision to execute all of 96 values unloaded from the MAC array.
Tesla workloads with the desired accuracy and The SIMD unit is programmable with a rich
consumes an order of magnitude lower power instruction set with various data types, 8-bit,
than the ones with FP arithmetic.9 16-bit, and 32-bit integers and single-precision
During every cycle, the array receives two floating point (FP32). The instruction set also
vectors with 96 elements each and it multiplies provides for conditional execution for control
every element of the first vector with every ele- flow. The input data is always 30-bit wide (cast
ment of the second vector. The results are accu- as int32) and the final output is always 8-bit
mulated in place until the end of the dot product wide (signed or unsigned int8), but the interme-
sequence when they get unloaded to the SIMD diate data formats can be different than the
engine for further processing. input or output.
March/April 2020
31
Hot Chips
Since most common SIMD programs can be reordered, but requests coming from different
represented by a single instruction, called Fuse- sources can be prioritized to minimize the bank
dReLu (fused quantization, scale, ReLU), the conflicts.
instruction format allows fusing any arithmetic During inference, weights tensors are always
operation with shift and output operations. The static and can be laid out in the SRAM to ensure
FusedReLu instruction is fully pipelined, allow- an efficient read pattern. For activations, this is
ing the full 96 96 dot-product engine to be not always possible, so the accelerator stores
unloaded in 96 cycles. More complex postpro- recently read data in a 1-kB cache. This helps to
cessing sequences require additional instruc- minimize SRAM bank conflicts by eliminating
tions, increasing the unloading time of the Dot back-to-back reads of the same data. To reduce
Product Engine. Some complex sequences are bank conflicts further, the accelerator can pad
built out of FP32 instructions and conditional input and/or output data using different patterns
execution. The 30-bit accumulator value is con- hinted by the network program.
verted to an FP32 operand in the beginning of
such SIMD programs, and the FP32 result is con- Control Logic
verted back to the 8-bit integer output at the end As shown in Figure 7, the control logic is split
of the SIMD program. between several distinct state machines: Com-
mand Sequencer, Network Sequencer, Address
Pooling Support and Data sequencers, and SIMD Unit.
After postprocessing in the SIMD unit, the Each NNA can queue up multiple network
output data can also be conditionally routed programs and execute them in-order. The Com-
through a pooling unit. This allows the most fre- mand Sequencer maintains a queue of such pro-
quent small-kernel pooling operations (2 2 and grams and their corresponding status registers.
3 3) to execute in the shadow of the SIMD exe- Once a network runs to completion, the acceler-
cution, in parallel with the earlier layer produc- ator triggers an interrupt in the host system.
ing the data. The pooling hardware implements Software running on one of the CPUs can exam-
aligners to align the output pixels that were rear- ine the completion status and re-enable the net-
ranged to optimize convolution, back to the orig- work to process a new input frame.
inal format. The pooling unit has three 96-byte The Network Sequencer interprets the pro-
96-byte pooling arrays with byte-level control. gram instructions. As described earlier, instruc-
The less frequent larger kernel pooling operations are long data packets which encode enough
tions execute as convolution layers in the dot- information to initialize an execution state
product engine. machine. The Network Sequencer decodes this
information and steers it to the appropriate con-
Memory Organization sumer, enforces dependencies and synchronizes
The NNA uses a 32-MB local SRAM to store the machine to avoid potential race-conditions
weights and activations. To achieve high band- between producer and consumer layers.
width and high density at the same time, the Once a compute instruction has been
SRAM is implemented using numerous relatively decoded and steered to its execution state
slow, single ported banks. Multiple such banks machine, the Address Sequencer then generates
can be accessed every cycle, but to maintain the a stream of SRAM addresses and commands for
high cell density, a bank cannot be accessed in the computation downstream. It partitions the
consecutive cycles. output space in sections of up to 96 96 elements
Every cycle the SRAM can provide up to 384 and, for each such section, it sequences through
bytes of data through two independent read all the terms of the corresponding dot-product.
ports, 256-byte and 128-byte wide. An arbiter pri- Weights packets are preordered in the SRAM
oritizes requests from multiple sources (weights, to match the execution, so the state machine sim-
activations, program instructions, DMA-out, etc.) ply streams them in groups of 96 consecutive
and orders them through the two ports. Requests bytes. Activations, however, do not always come
coming from the same source cannot be from consecutive addresses and they often must
IEEE Micro
32
Figure 8. Achieved utilization versus MAC array dimension.
be gathered from up to 96 distinct SRAM loca- primary concerns are always tied to its operat-
tions. In such cases, the Address Sequencer must ing clock frequency. A high clock frequency
generate multiple load addresses for each packet. makes it easier to achieve the target perfor-
To simplify the implementation and allow a high mance, but it typically requires some logic sim-
clock frequency, the 96-elements packet is parti- plifications which in turn hurt the utilization of
tioned into 12 slices of 8 elements each. Each slice specific algorithms.
is serviced by a single load operation, so the maxi- We decided to optimize this design for deep
mum distance between its first and last element convolutional neural networks with a large num-
must be smaller than 256 bytes. Consequently, a ber of input and output channels. The 192 bytes
packet of 96 activations can be formed by issuing of data and weights that the SRAM provides to
between 1 and 12 independent load operations. the MAC array every cycle can be fully utilized
Together with control information, load data only for layers with a stride of 1 or 2 and layers
is forwarded to the Data Sequencer. Weights are with higher strides tend to have poorer
captured in a prefetch buffer and issues to execu- utilization.
tion as needed. Activations are stored in the Data The accelerator’s utilization can vary signifi-
Cache, from where 96 elements are gathered and cantly depending on the size and shape of the
sent to the MAC array. Commands to the data- MAC array, as shown in Figure 8. Both the incep-
path are also funneled from the Data Sequencer, tion-v4 and the Tesla Vision network show signif-
controlling execution enable, accumulator shift, icant sensitivity to the height of the MAC array.
SIMD program start, store addresses, etc. While processing more output channels at the
The SIMD processor executes the same pro- same time can hurt overall utilization, adding
gram for each group of 96 accumulator results that capability is relatively cheap since they all
unloaded from the MAC array. It is synchronized share the same input data. Increasing the width
by control information generated within the of the array does not hurt utilization as much,
Address Sequencer, and it can decode, issue, but it requires significantly more hardware
and execute a stream of SIMD arithmetic instruc- resources. At our chosen design point (96 96
tions. While the SIMD unit has its own register MAC array), the average utilization for these net-
file and it controls the data movement in the works is just above 80%.
datapath, it does not control the destination Another tradeoff we had to evaluate is the
address where the result is stored. Store SRAM size. Neural networks are growing in size,
addresses and any pooling controls are gener- so adding as much SRAM as possible could be a
ated by the Address Sequencer when it selects way to future-proof the design. However, a signifi-
the 96 96 output slice to be worked on. cantly larger SRAM would grow the pipeline
depth and the overall area of the chip, increasing
both power consumption and the total cost of the
ARCHITECTURAL DECISIONS AND system. On the other hand, a convolutional layer
RESULTS too large to fit in SRAM can always be broken into
When implementing very wide machines like multiple smaller components, potentially paying
our MAC array and SIMD processor, the some penalty for spilling and filling data to the
March/April 2020
33
Hot Chips
DRAM. We chose 32 MB of SRAM per accelerator 2. W. Rawat and Z. Wang, “Deep convolutional neural
based on the needs of our current networks and networks for image classification: A comprehensive
on our medium-term scaling projections. review,” Neural Comput., vol. 29, no. 9, , pp. 2352–2449,
Sep. 2017.
3. K. Sato, C. Young, and D. Patterson, “An in-depth look
CONCLUSION at Google’s first tensor processing unit,” Google Cloud
Tesla’s FSD Computer provides an excep- Platform Blog, May 12, 2017.
tional 21 performance uplift over commer- 4. N. P. Jouppi et al., “In-datacenter of a performance
cially available solutions used in our previous analysis tensor processing unit,’’ in Proc. 44th Annu.
hardware while reducing cost, all at a modest Int. Symp. Comput. Archit., 2017, vol. 1, pp. 1–12.
25% extra power. This level of performance 5. I. Cutress, “AMD zen 2 microarchitecture analysis:
was achieved by the uncompromising adher- Ryzen 3000,’’ AnandTech, Jun. 10, 2019.
ence to the design principle we started with. 6. “NVIDIA volta AI architecture,” NVIDIA, 2018. [Online].
At every step, we maximized the utilization of Available: https://www.nvidia.com/en-us/data-center/
the available compute bandwidth with a high volta-gpu-architecture/
degree of data reuse and a minimalistic design 7. J. Choquette, “Volta: Programmability and performance,”
for the control flow. This FSD Computer will Nvidia, Hot Chips, 2017. [Online]. Available: https://www.
be the foundation for advancing the FSD fea- hotchips.org/wp-content/uploads/hc_archives/hc29/
ture set. HC29.21-Monday-Pub/HC29.21.10-GPU-Gaming-Pub/
The key learning HC29.21.132-Volta-Choquette-NVIDIA-Final3.pdf
Tesla’s FSD Computer
from this work has provides an exceptional 8. M. Horowitz, “Computing’s energy problem,” in IEEE
been the tradeoff 21x performance uplift Int. Solid-State Circuits Conf. Dig. Tech. Papers, 1999,
between efficiency over commercially pp. 10–14.
and flexibility. A available solutions used 9. M. Komorkiewicz, M. Kluczewski, and M. Gorgon,
custom solution in our previous “Floating point HOG implementation of for real-time
with fixed-function hardware while
multiple object detection,” in Proc. 22nd Int. Conf.
hardware offers the reducing cost, all at a
Field Programm. Logic Appl., 2012, pp. 711–714.
highest efficiency, modest 25% extra
while a fully pro- power. This level of
performance was
grammable solution Emil Talpes is a Principal Engineer with Tesla, Palo
achieved by the
is more flexible but Alto, CA, USA, where he is responsible for the archi-
uncompromising
significantly less tecture and micro-architecture of inference and train-
adherence to the
efficient. We finally ing hardware. Previously, he was a principal member
design principle we
settled on a solu- of the technical staff at AMD, working on the micro-
started with.
tion with a con- architecture of 86 and ARM CPUs. He received the
figurable fixed- Ph.D. degree in computer engineering from Carne-
gie Mellon University, Pittsburgh, PA, USA. Contact
function hardware that executes the most
him at etalpes@tesla.com.
common functions very efficiently but added a
programmable SIMD unit, which executes less
common functions at a lower efficiency. Our Debjit Das Sarma is a Principal Autopilot Hard-
knowledge of the Tesla workloads deployed ware Architect with Tesla, Palo Alto, CA, USA, where
for inference allowed us to make such a trade- he is responsible for the architecture and micro-
off with a high level of confidence. architecture of inference and training hardware.
Prior to Tesla, he was a Fellow and Chief Architect
of several generations of 86 and ARM processors
& REFERENCES at AMD. His research interests include computer
architecture and arithmetic with focus on deep
1. Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Object learning solutions. He received the Ph.D. degree in
recognition with gradient-based learning,” in computer science and engineering from Southern
Proceeding: Shape, Contour and Grouping in Computer Methodist University, Dallas, TX, USA. Contact him
Vision. New York, NY, USA: Springer-Verlag, 1999. at ddassarma@tesla.com.
IEEE Micro
34
Ganesh Venkataramanan is a Senior Director Ankit Jalote is a Senior Staff Autopilot Hardware
Hardware with Tesla, Palo Alto, CA, USA, and Engineer. He is interested in the field of computer
responsible for Silicon and Systems. Before architecture and the hardware/software relationship
forming the Silicon team with Tesla, he led AMD’s in machine learning applications. He received the
CPU group that was responsible for many genera- Master’s degree in electrical and computer engineer-
tions of 86 and ARM cores. His contributions ing from Purdue University, West Lafayette, IN, USA.
include industry’s first 86-64 chip, first Dual- Contact him at ajalote@tesla.com.
Core 86 and all the way to Zen core. He
received the Master’s degree from IIT Delhi, Delhi, Christopher Hsiong is a Staff Autopilot Hardware
India, in the field of integrated electronics Engineer with Tesla, Palo Alto, CA, USA. His
and Bachelor’s degree from Bombay Uni- research interests include computer architecture,
versity, Mumbai, Maharashtra. Contact him at machine learning, and deep learning architecture.
gvenkataramanan@tesla.com. He received the Graduate degree from the University
of Michigan Ann Arbor, Ann Arbor, MI, USA. Contact
Peter Bannon is a VP of hardware engineering him at chsiong@tesa.com.
with Tesla, Palo Alto, CA, USA. He leads the team
that created the Full Self Driving computer that is Sahil Arora is a member of Technical Staff with
used in all Tesla vehicles. Prior to Tesla, he was the Tesla, Palo Alto, CA, USA. His research interests
Lead Architect on the first 32b ARM CPU used in the are machine learning, microprocessor architecture,
iPhone 5 and built the team that created the 64b microarchitecture design, and FPGA design. He
ARM processor in the iPhone 5s. He has been received the Master’s degree in electrical engineer-
designing computing systems for over 30 years at ing from Cornell University, Ithaca, NY, USA, in 2008.
Apple, Intel, PA Semi, and Digital Equipment Corp. Contact him at saarora@tesla.com.
Contact him at pbannon@tesla.com.
Atchyuth Gorti is a Senior Staff Autopilot Hard-
Bill McGee is a Principal Engineer leading a ware Engineer with Tesla, Palo Alto, CA, USA. His
machine learning compiler team, mainly foc- research interests include testability, reliability, and
used on distributed model training on custom safety. He received the Master’s degree from the
hardware. He received the BSSEE degree in Indian Institute of Technology, Bombay, in reliability
microelectronic engineering from Rochester Institute engineering. Contact him at agorti@tesla.com.
of Technology, Rochester, NY, USA. Contact him at
bill@mcgeeclan.org. Gagandeep S Sachdev is a Staff Hardware Engi-
neer with Tesla, Palo Alto, CA, USA. He has worked as
Benjamin Floering is a Senior Staff Hardware a Design Engineer with AMD and ARM. His research
Engineer with Tesla, Palo Alto, CA, USA, whose interests include computer architecture, neural net-
research interests include low power design as well works, heterogeneous computing, performance anal-
as high-availability and fault tolerant computing. He ysis and optimization, and simulation methodology.
is also a member of IEEE. He received the BSEE He received the Master’s degree from University of
degree from Case Western Reserve University, Utah, Salt Lake City, UT, USA, in computer engineer-
Cleveland, OH, USA, and the MSEE degree from Uni- ing, with research topic of compiler-based cache
versity of Illinois at Urbana-Champaign, Champaign, management in many core systems. Contact him at
IL, USA. Contact him at floering@ieee.org. gsachdev@tesla.com.
March/April 2020
35
RTX on—The NVIDIA

Turing GPU
John Burgess
NVIDIA
Abstract—NVIDIA’s latest processor family, the Turing GPU, was designed to realize
a vision for next-generation graphics combining rasterization, ray tracing, and deep
learning. It includes fundamental advancements in several key areas: streaming
multiprocessor efficiency, a Tensor Core for accelerated AI inferencing, and an RTCore for
accelerated ray tracing. With these innovations, Turing unlocks both real-time ray-tracing
performance and deep-learning inference in consumer, professional, and datacenter
solutions.
& AT NVIDIA, WE imagined a future where real- Efficient New SM Core

time graphics combines rasterization, ray trac- In 2017, we introduced the Volta GPU archi-
ing, and deep learning. NVIDIA’s latest processor tecture, targeted at high-performance computing
family, the Turing GPU, was created to realize and deep-learning training in the V100 solution.2
that vision. Turing builds aggressively on the foundation of
The largest Turing GPU, Titan RTX, com- Volta, bringing those advancements into the
prises 18.6 billion transistors in a 12-nm process, consumer GPU space.
including several thousand programmable
Enhanced L1 Data Cache
processing elements, industry-first support for
Compared to the previous consumer genera-
GDDR6 memory, and high-bandwidth NVLink for
tion, Pascal, Turing SM has twice the instruction
multi-GPU connectivity.
schedulers, simplified issue logic, and leverages
While Turing is packed with features and horse-
a large, fast, low-latency L1 data cache (Figure 1,
power,1 we made fundamental advancements in
left).
several key areas—streaming multiprocessor (SM)
This new L1 cache architecture exploits a
efficiency, a Tensor Core for AI inferencing, and an
powerful topological change. In Pascal, global
RTCore for ray-tracing acceleration.
memory and texture data were accessed
through the fixed-function texture processing
Digital Object Identifier 10.1109/MM.2020.2971677 pipeline. In Turing, global and shared memory
Date of publication 4 February 2020; date of current version accesses share a path directly to the RAM, pro-
18 March 2020. viding twice the L1 bandwidth and a significant
36
Figure 1. Turing GPU SM, comprising four subcores and a memory interface (MIO). Math throughput,
memory bandwidth, L1 data cache topology, register file and cache capacity, and a new uniform datapath
were all designed or modified to increase processor efficiency over the previous generation.
reduction in hit latency. Because the L1 data thread (SIMT) with independent thread schedul-
cache and shared memory RAM are unified, ing model introduced in Volta.3
the tagged capacity can be configured based In a traditional SIMD/vector architecture
on the workloads running on the GPU. (Figure 3, top), control flow resides on a scalar
The SM accesses an L2 cache across a shared thread, and the developer and compiler promote
crossbar, and on Turing, the L2 capacity dou- certain operations to the SIMD/vector lanes to
bled to 6 MB. run in parallel. A vector execution mask specifies
which lanes ignore the vector operation, if
Concurrent Execution of Floating-Point and needed. In our solution (Figure 3, bottom), con-
Integer Math trol flow resides in each SIMT thread. On Turing,
Inside each of the four subcores of the Turing when warp-uniform data are detected, the com-
SM (Figure 1, right), we doubled register file piler with hardware assist promotes operations
capacity, redesigned the branch unit, and added to an independent uniform datapath, essentially
fast general purpose FP16 math. a “reverse vectorization.” This promotion is
Saturating the FP32 datapath requires only valid even if one or more of the threads in the
half the issue bandwidth, allowing another warp are currently diverged within its own con-
datapath to execute concurrently. Figure 2 trol flow.
shows the performance impact of concurrent This optimization is invisible to the program-
execution of floating-point and integer instruc- mer, but the following simplified machine code
tions, across several gaming workloads. On aver- snippet demonstrating bindless constant mem-
age, 36 integer instructions co-execute with ory access illustrates the mechanism:
every 100 floating-point instructions. Typical
integer operations include address computation ULDC.64 UR20, [UR6 þ 0x18], !UP7
and floating-point compares. UIADD UR6, UR3, UR10
FMUL R15, R4, cx[UR20][0x64]
Uniform Datapath and Uniform Register File ULOP.AND UR2, UR3, 0xfffff
The Turing SM exploits redundant computa-
tion and data across multiple threads in a warp, The U-prefixed, or uniform, instructions oper-
while preserving the single instruction multiple ate on U-prefixed registers, and a subsequent
March/April 2020
37
Hot Chips
TOPS 4-bit integer math, on a GeForce GTX

2080Ti GPU.
The Tensor Core is tightly integrated into the
SM subcore (Figure 1). It performs a single multi-
thread collaborative matrix math operation over
four to eight clocks, which saves operand and
memory bandwidth by transparently sharing
data across threads.
This fine-grained integration provides maxi-
mum algorithmic flexibility. For example, different
activation functions or batch normalization var-
iants can be interleaved on the general purpose
datapaths alongside the matrix operations exe-
cuted on the Tensor Core. This integration also
naturally leverages the large capacity and band-
width of the register file and shared memory.
Figure 2. Concurrent execution of floating-point and integer As a result, the Tesla T4 datacenter product
math, across representative shaders from several popular games. accelerates all AI workloads flexibly, achieving
speedups of over five times the Pascal-based
SIMT instruction uses the uniform result to index Tesla P4 solution on DeepSpeech2, and up to 36x
into a constant bank. This indirection avoids the that of CPU-based solutions for natural language
synchronization overhead of dynamically updat- processing (Figure 5). The capability to acceler-
ing a block of constant memory still in use by ate all AI workloads is critical, as multiple net-
other work elements. Simply enabling bindless works chained together provide higher level
constants using the uniform datapath yielded a solutions, e.g., speech recognition coupled with
12% speedup on Forza Motorsport 7. natural language processing to enable interac-
With these and other efficiency improvements, tive agents.
SM performance increased by more than 50% We expect deep learning to disrupt gaming and
across a wide variety of shaders (Figure 4). Some professional graphics as it has disrupted other
workloads show a benefit of over two times. application spaces. Examples of dynamic neural
graphics include deep learning supersampling
Tensor Core for Accelerated Deep (DLSS), which both anti-aliases and upscales
Learning game images for dramatically improved frame
Alongside the new SM efficiency improve- rates and image quality; style transfer and content
ments, we added a specialized AI accelerator creation with GauGAN,4 which enables an artist to
called a Tensor Core. The first Tensor Core was paint a simple segmentation map image and auto-
introduced in Volta, and again, we leveraged matically generate beautiful and plausible land-
that technology as a foundation upon which to scape paintings; and AI slow-motion video on a
innovate further. Integrating this new Tensor professional workstation, automatically generat-
Core in the SM enables real-time deep-learning ing the missing video frames from a standard
inference on a consumer GPU for the first time. video. Deep-learning-based graphics has only just
The Tensor Core efficiently accelerates the begun, and Turing makes it possible in real time.
computation of matrix multiply–accumulate
operations, the fundamental building block of
deep learning. On Pascal, matrix multiply–accu- RTcore for Accelerated Ray Tracing
mulate operations are computed serially, while We also invented a new accelerator for ray trac-
on the Turing Tensor Core this operation exe- ing, making Turing RTX the first shipping GPU with
cutes in parallel across tiles of streamed data, hardware-accelerated ray tracing. The RTCore
achieving a throughput of 114 TFLOPS of FP16 provides up to 10 GRays/s traversal and intersec-
math, or 228 TOPS 8-bit integer math, or 455 tion throughput. Overall, Turing RTX is seven
IEEE Micro
38
times the Pascal ray-tracing performance, mean-
ing that real-time ray tracing is finally available.
Figure 6 is a screenshot from the interactive
demo Attack from Outer Space, built with Epic
Games’ Unreal Engine 4 and a 2019 DXR Spotlight
award winner.5 The image demonstrates multi-
ple advanced rendering techniques including
soft shadows underneath the cars on the street,
and glossy reflections in the robot’s chest plate
and the puddles on the ground. These are calcu-
lated dynamically in real time in a destructible,
animated environment, and made possible by
accelerated ray tracing.
Beyond using rays for shadows or reflections,
one can employ even more advanced techniques
to generate ever more realistic images, e.g.,
path-traced global illumination, in which rays
represent virtual photons in a physical simula-
tion. Figure 7 is a simplified diagram represent-
ing the path-tracing process.
A ray, represented as an origin and a direc-
tion, is used to test what geometry is visible
from a point in the scene.
A camera ray, labeled 1 in the figure, tests

what geometry is behind each pixel in the Figure 3. Traditional scalar/vector architecture (top) versus
image being rendered. For performance, a Turing SIMT with uniform datapath (bottom). SIMT avoids the
hybrid renderer may use traditional raster- requirement for programmers to explicitly vectorize parallel tasks.
ization for this step. The promotion of uniform operations and data for efficiency works
A reflection ray, labeled 2A, tests what light even in the presence of diverged threads in an SIMT warp.
scattered toward the viewer from some
A shadow ray, labeled 3, tests whether an
other part of the scene. For rougher reflec-
tions, one direction is randomly chosen occluder blocks light from arriving at a surface.
from a distribution of likely scattering direc- Rather than test all light sources, typically only
tions. Refraction rays, labeled 2B, may also one is randomly chosen, but a special case for
be appropriate for transparent materials. the sun, labeled 4, may be desired.
Figure 4. Representative shader workloads extracted from various games and graphics benchmarks show
up to two times performance benefit from the Turing SM.
March/April 2020
39
Hot Chips
Figure 5. Accelerated deep-learning inference with Turing-based Tesla T4 datacenter solution, with speeds
up to 36 times faster than CPU-based servers. T4 can flexibly apply chained deep-learning tasks like speech
recognition followed by natural language processing to enable higher level solutions like intelligent agents.
These rays are generated recursively (labeled hours to generate a single frame. Recent GPUs
5, 6, and 7). reduce the cost to minutes or seconds, but real-
time performance has been unachievable.
Chains of rays representing reflections, refrac-
The fundamental building blocks of the
tions, and light source queries create paths
algorithm are as follows:
between the camera and the light sources to
determine the color of light transported to each sampling (What direction to shoot a ray?);
pixel in the image. This technique is commonly traversal and intersection (What did the ray
used to generate CGI movies; however, due in hit?);
part to the inherent random sampling, the brute material evaluation (How does the light scat-
force computation typically takes many CPU ter at that hit point?).
Figure 6. Attack from Outer Space interactive ray-tracing demo by Christian Hecht. Soft shadows, glossy
reflections, and indirect illumination combine to produce realistic lighting in the UE4 game engine.
IEEE Micro
40
Figure 7. Path-traced global illumination, simplified. Primary/camera rays, reflection/refraction rays, and
shadow rays combine recursively to simulate the color of light scattered from the scene to each pixel in the
rendered image.
Sampling and material evaluation are not yet taking thousands of instruction slots, and poten-
suitable for fixed-function acceleration. Techni- tially tens of thousands of cycles of latency per ray.
ques for these differ across renderers, requiring Finally, when the appropriate hit point is located
significant flexibility. Traversal and intersection, for the ray, the material at that point is evaluated.
however, are the most expensive components of On Turing RTX, the new RTCore replaces that
global illumination, has essentially annealed to a software emulation (Figure 8, bottom), performing
common algorithm in the industry, and is ripe the tree traversal and the ray/box and ray/triangle
for further acceleration. intersection tests. A ray query is sent from the SM
In pre-RTX GPU ray tracing (Figure 8, top), to the RTCore, the RTCore fetches and decodes
traversal and intersection were done on the SM memory representing part of the bounding vol-
core. To avoid testing every triangle in a scene ume tree, and uses dedicated evaluators to test
against each ray query, a bounding volume hier- each ray against the box or, at the leaves of the
archy or a tree of axis-aligned bounding boxes is tree, the triangles that make up the scene. It does
created over the triangles in the scene. this repeatedly, optionally keeping track of the
When a ray probe is launched, this tree is tra- closest intersection found. When the appropriate
versed, and each successive child box is tested intersection point is determined, the result is
against the ray, culling the distant geometry effi- returned to the SM for further processing.
ciently. At the leaves of the tree, the actual triangles The RTCore is faster and more efficient than
making up the surfaces in the scene are tested software emulation and frees up the SM to do
against the ray. This tree traversal is fundamentally other work including programmable sampling
pointer-chasing through memory interleaved with or material shading in parallel. We carefully
complex, precision-sensitive intersection tests, interfaced the RTCore and the SM for both
March/April 2020
41
Hot Chips
Figure 8. Ray tracing before (above) and after (below) Turing RTX. The RTCore performs ray/scene traversal
and intersection tasks, freeing the SM to do concurrent work such as material evaluation and denoising.
performance and flexibility, enabling a developer use path tracing. Each plot is a timeline of the
to optionally create custom intersection pro- frame on a different configuration. At the top, on
grams that run on the SM for non-triangle geome- Pascal, path tracing is possible, but five frames
try such as spheres that traditional rasterization per second (fps) is much too slow to be playable.
cannot easily handle. In the middle, on Turing without the RTCores
Figure 9 shows the performance of one frame enabled, the more efficient design is already
of Quake II RTX, a game recently remastered to twice as fast at 10 fps. The purple and gray on this
IEEE Micro
42
Figure 9. Path-tracing performance on one frame of Quake II RTX. Turing efficiency improvements yield a
two-times speedup over Pascal, while Turing RTCores provide a further speedup for a total seven times faster
frame rate and real-time performance.
timeline show the floating-point and integer data- indirect lighting, and more are evident, bringing
paths executing in parallel. Enabling the RTCores, a dramatically updated look to a fun game from
their work shown in green in the bottom plot, the past.
yields a further leap to real-time speeds at 34 fps Developers are adding ray-traced effects to
and an overall seven times speedup. upcoming games at a rapid pace as well. Every
In the rendered image of this frame (Figure 9, graphics API and major engine has added sup-
inset), reflections, refractions, soft shadows, port. Real-time ray tracing has finally arrived.
Figure 10. Professional rendering before (left) and after (right) AI-based denoising, using a Quadro RTX
workstation. The SM and RTCores are used to trace a few (noisy) paths per pixel, and the Tensor Core
accelerated AI denoiser estimates the completed image, together providing a speedup of several orders of
magnitude versus a CPU server.
March/April 2020
43
Hot Chips
Turing: Greater Than the Sum of CONCLUSION

Its Parts At NVIDIA, we envisioned the future of
Because these three key architectural invest- graphics, but it requires greater efficiency and
ments were carefully integrated, they make new breakthrough acceleration. With a new more
Turing far greater efficient SM that is more than 1.5 times faster than
than the sum of its At NVIDIA, we the previous generation, new Tensor Cores, which
parts. By leverag- envisioned the future unlock real-time deep-learning inference, and the
ing the SM, RTCore, of graphics, but it new RTCore, which yields more than 7 times faster
and Tensor Core requires greater ray-tracing performance, the Turing GPU is built
simultaneously, efficiency and new to realize that vision for next-generation graphics.
RTX-enabled pro- breakthrough
fessional film and acceleration. With a & REFERENCES
design renderers new more efficient SM
that is more than 1. NVIDIA, “NVIDIA Turing GPU architecture: Graphics
can now achieve
1.5 times faster than reinvented,” 2018. [Online]. Available: https://www.
interactive path
the previous nvidia.com/content/dam/en-zz/Solutions/design-
tracing with high-
generation, new Tensor visualization/technologies/turing-architecture/NVIDIA-
quality materials Cores, which unlock Turing-Architecture-Whitepaper.pdf
and AI denoising. real-time deep-learning 2. NVIDIA, “NVIDIA Tesla V100 GPU,” 2017. [Online].
Figure 10 shows inference, and the new Available: https://images.nvidia.com/content/volta-
a path-traced image RTCore, which yields architecture/pdf/volta-architecture-whitepaper.pdf
in split screen. more than 7 times
3. J. Choquette, O. Giroux and D. Foley, “Volta:
On the left, a few faster ray-tracing
Performance and programmability,” IEEE Micro,
randomized paths performance, the
vol. 38, no. 2, pp. 42–52, Mar. 2018.
per pixel have been Turing GPU is built to
4. T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu,
ray traced and realize that vision for
next-generation “Semantic image synthesis with spatially-adaptive
the materials evalu-
graphics. normalization,” in Proc. IEEE Conf. Comput. Vis.
ated. The image is
Pattern Recognit., 2019, pp. 2337–2346.
noisy due to the
5. C. Hecht, “NVIDIA DXR spotlight winners—Attack from
random sampling inherent in the path tracing
outer space,” 2019. [Online]. Available: https://
algorithm. Noise-free images often take thou-
developer.nvidia.com/dxr-spotlight-winners
sands of samples per pixel to converge, which is
6. C. R. A. Chaitanya et al., “Interactive reconstruction of
slow even with accelerated ray tracing.
Monte Carlo image sequences using a recurrent
However, an AI-based denoiser6 using the
denoising autoencoder,” ACM Trans. Graph., vol. 36,
Tensor Cores can estimate the completed image no. 4, Jul. 2017.
with as few as 1% of the typical samples,
efficiently removing the sampling noise in a post- John Burgess is a Senior Director of GPU architec-
ture with NVIDIA, Santa Clara, CA, USA. Most recently,
process pass (Figure 10, right). The SM, RTCore,
he has led the development of the SM architecture for
and Tensor Core work together efficiently to
the Volta and Turing GPU families, including dedicated
render the final image. This level of quality and hardware for accelerated deep learning and ray trac-
interactive performance in a professional rendering. He received the Ph.D. degree in physics from the
ing would not be possible with any of these three University of Texas at Austin, Austin, TX, USA. Contact
elements missing. him at jburgess@NVIDIA.com.
IEEE Micro
44
The AMD “Zen 2”

Processor
David Suggs, Mahesh Subramony, and
Dan Bouvier
Advanced Micro Devices Inc.
Abstract—The “Zen 2” processor is designed to meet the needs of diverse markets spanning
server, desktop, mobile, and workstation. The core delivers significant performance and
energy-efficiency improvements over “Zen” by microarchitectural changes including a new
TAGE branch predictor, a double-size op cache, and a double-width floating-point unit.
Building upon the core design, a modular chiplet approach provides flexibility and
scalability up to 64 cores per socket with a total of 256 MB of L3 cache.
& THE ZEN 2 processor provides a single core doubled from a combination of technology and
design that is leveraged by multiple solutions, microarchitecture improvements. Second, IPC
with focused goals for improving upon the prede- is increased by approximately 15% from the
cessor Zen processor. The primary targets for microarchitectural changes in the in-order front-
the core were advancements in instructions per end, integer execute, floating-point/vector exe-
cycle (IPC), energy efficiency, and security. Build- cute, load/store, and cache hierarchy.
ing on the new core, the solutions for server and
client aimed to promote design reuse across
markets, increase core count, and improve IO ENERGY EFFICIENCY
capability. Achieving these goals required inno- The Zen 2 core microarchitecture originally
vations in microarchitecture, process technol- targeted lower energy per clock cycle than the
ogy, chiplets, and on-package interconnect.1 prior generation, independent of process tech-
nology improvements. This by itself was an
aggressive goal, because increased IPC means
CPU CORE
more activity each cycle, resulting in added
The Zen 2 CPU core, shown in Figure 1, has
switching energy. Neutral energy per cycle would
two primary areas of improvement over its
require additional design optimization, including
predecessor, Zen. First, energy efficiency is
improvements in branch predictor accuracy,

AMD Zen 2 CPU-based system scored an estimated 15% higher than previous
Date of publication 17 February 2020; date of current version generation AMD Zen based system using estimated SPECint_base2006 results.
SPEC and SPECint are registered trademarks of the Standard Performance
18 March 2020. Evaluation Corporation. See www.spec.org. GD-141.
45
Hot Chips
data, and these checks are performed prior to

speculation with the data. This can prevent micro-
architectural state from being updated unless per-
mission checks pass, which allows the Zen family
to be affected by fewer security vulnerabilities.4
PREDICTION, FETCH, AND DECODE

The in-order front-end of the Zen 2 core
includes branch prediction, instruction fetch,
and decode. The branch predictor in Zen 2 fea-
tures a two-level conditional branch predictor.
To increase prediction accuracy, the L2 predic-
tor has been upgraded from a perceptron predic-
tor in Zen to a tagged geometric history length
(TAGE) predictor in Zen 2.5 TAGE predictors pro-
Figure 1. Core block diagram. vide high accuracy per bit of storage capacity.
However, they do multiplex read data from mul-
tiple tables, requiring a timing tradeoff versus
higher op cache hit rate, and a dedication to
perceptron predictors. For this reason, TAGE
continuous clock and data gating improvements.
was a good choice for the longer-latency L2 pre-
Zen 2 achieved significantly better results than
dictor while keeping perceptron as the L1 pre-
the original targets: an estimated 1.15 IPC
dictor for best timing at low latency.
increase while delivering a 1.17 improvement in
The branch capacity in Zen 2 is nearly double
energy per cycle. These two factors, together
that of Zen. The L0 BTB was increased from 8 to
with an estimated 14–7-nm process technology
16 entries. The L1 BTB was increased from 256
improvement in energy per cycle of 1.47, pro-
to 512 entries. The L2 BTB was increased from
duce double the number of instructions per unit
4096 to 7168 entries. The indirect target array
energy, as measured in silicon.y
was increased from 512 to 1024 entries. The com-
bination of improved conditional predictor and
SECURITY increased branch capacity allows Zen 2 to target
Security is a top design consideration in mod- a 30% lower mispredict rate than Zen.
ern processor design. Over the past few years, The instruction cache and op cache configura-
security researchers have identified an increas- tions in Zen 2 are reoptimized for better perfor-
ing number of security threats in processor mance. The op cache, containing previously
microarchitecture. Some examples include Spec- decoded instructions, was doubled in capacity
tre v2 (indirect branch target injection) and from 2048 to 4096 fused instructions. The L1
Spectre v4 (speculative store bypass).2 Once instruction cache was halved in size from 64 to 32
aware of these vulnerabilities, AMD provided kB to make room for the larger op cache. This pro-
mitigations for Zen, and AMD then extended the vides better overall performance and improved
mitigations in Zen 2 with dedicated hardware energy efficiency. The L1 instruction cache pro-
implementations that improve performance.3 vides better utilization due to increasing associa-
Beyond targeted mitigations for security tivity from four to eight ways. The op cache also
issues, the Zen family includes conscious design covers more microarchitectural cases of instruc-
choices that improve security. Specifically, the tion fusion, which increases effective throughput
hardware checks permissions prior to consuming and op utilization throughout the core.
y
Testing conducted by AMD Performance Labs as of 7/12/2019 with 2nd Gen-
eration Ryzen and 3rd Generation Ryzen engineering samples using estimated INTEGER EXECUTE
SPECint_base2006 results. PC manufacturers may vary configurations yield-
The Zen 2 core features a distributed execu-
ing different results. SPEC and SPECint are registered trademarks of the Stan-
dard Performance Evaluation Corporation. See www.spec.org tion engine, with separate schedulers, registers,
IEEE Micro
46
and execution units for integer and floating- its potential without significant performance
point/vector operations. The integer engine impact on the low-parallelism thread.
operates on general-purpose registers and gen-
erates addresses for loads and stores. The float-
ing-point/vector engine operates on vector FLOATING-POINT/VECTOR EXECUTE
registers. The Zen 2 integer engine focused on The Zen 2 floating-point/vector engine has
increasing issue width to provide more through- doubled the data path width from 128 bits (Zen)
put and growing the out-of-order window size to to 256 bits. Both cores support AVX-256 instruc-
expose more program parallelism. tions, but Zen double-pumps operations using its
The Zen core had the foundation for two 128-bit data paths whereas Zen 2 supports native
loads and one store per cycle, but with just two operation with its 256-bit data paths. The vector
address generation units (AGUs), Zen was not PRF width is also doubled to 256 bits. Registers
able to sustain this throughput in the steady can now be renamed on a 256-bit granularity
state. The Zen 2 core adds a third AGU, unlock- instead of a 128-bit granularity. The effective
ing this throughput potential and providing a capacity of the vector PRF is therefore doubled
more balanced processor. for AVX-256 code, even though the number of vec-
A major component of window size is the tor PRF entries remains the same at 160.
scheduler queue size. Like its predecessor, Zen 2 A significant consideration with physically
has four fully distributed arithmetic-logic unit doubling the data path is the potential for
(ALU) scheduler queues, one per ALU. Zen 2 switching activity spikes that could cause elec-
increases the size of each queue from 14 entries trical design current (EDC) specifications to be
to 16 entries. The AGU scheduler remains the exceeded. A simplistic approach to mitigating
same size, but it is now upgraded from two sepa- this issue would be to immediately throttle fre-
rate, distributed 14-entry queues each feeding quency and reduce voltage when AVX-256
an AGU to a single 28-entry queue feeding all instructions are detected. However, this would
three AGUs. The unified AGU queue has more unnecessarily penalize programs that make
effective capacity due to removing the potential occasional use of AVX-256 instructions. To opti-
for queue imbalance. The unified queue is also mize performance, Zen 2 builds an intelligent
better able to prioritize picking of the oldest EDC manager which monitors activity over mul-
ready ops, resulting in reduced mis-speculation tiple clock cycles and throttles execution only
from out-of-order loads. when necessary.
Other window size components are incre-
ased, including growing the physical register file
(PRF) from 168 to 180 entries and the re-order LOAD/STORE AND L1D/L2 CACHES
buffer (ROB) from 192 to 224 entries. These both The load/store unit and level 1 data (L1D)
work to allow more ops in the window, exposing cache provide more throughput and larger struc-
more program parallelism. tures. An important component of overall win-
Finally, the execution engine improves simul- dow size, the store queue size was increased
taneous multithreading fairness. It is possible from 44 to 48 entries. The L2 data translation loo-
for one thread with inherently low parallelism kaside buffer was increased from 1536 to 2048
(for example, a pointer chase through main entries, now supporting 1-GB pages installed as
memory) to consume many of the ALU or AGU splintered 2-MB pages.
scheduler resources without benefit. The second The 32-kB 8-way L1D cache maximum
thread may have high inherent parallelism but throughput was increased in Zen 2 due to two
be unable to realize its performance potential factors. First, read and write bandwidth are dou-
due to insufficient scheduler resources. New fair- bled through an increase in width from 128 to
ness hardware detects this condition and slows 256 bits, matching the vector data path width.
the rate at which the low-parallelism thread can Second, the third AGU provides 50% more sus-
allocate into the scheduler. This gives the high- tained load/store operations. Combined, these
parallelism thread an opportunity to approach net Zen 2 three times the loadþstore bandwidth.
March/April 2020
47
Hot Chips
Figure 2. Cache Hierachy and CPU Complex.
The Zen 2 L2 remains 512 kB, and it is 8-way power efficiency, mitigating the cost was another
set-associative with 12-cycle load-to-use latency. challenge. The technology shrink factor did not
Zen 2 has new prefetch throttling capability apply equally across all circuits. Specifically scal-
that can reduce the aggressiveness of data pre- ing some analog circuitry, for example those used
fetching when memory bandwidth utilization is in system IOs, did not benefit enough compared
high and prefetching is not being effective. This to the technology cost increase. This led to the
is a particularly important to performance for adoption of a chiplet strategy. This strategy
high core-count, constrained memory band- defines SoCs using a hybrid process technology,
width processors such as those used in server allowing each chiplet to be manufactured in its
or high-end desktop. optimal technology node. The SoCs built using the
hybrid process technology married one or more
CORE COMPLEX AND L3 CACHE of the second-generation area-optimized CCD
A core complex (CCX) is composed of four chiplets in the advanced node and an IO-die chip-
Zen 2 cores and a shared level-3 (L3) cache. The let in a mature node. This resulted in construction
L3 cache has four slices connected with a highly of cost-effective, high-performance SoCs as well as
tuned fabric/network. Each L3 slice consists of offering configurable solutions to broaden the
an L3 controller, which reads and writes the L3 product portfolio.
cache macro, and a cluster core interface that
communicates with a core. The four slices of L3
are accessible by any core within the CCX. The ON-PACKAGE DIE-TO-DIE
distributed L3 cache control provides the design INTERCONNECT
with improved control granularity. Each slice of An important requirement to making the chip-
L3 contains 4 MB of data for a total of 16 MB of let strategy viable was an optimized on-package
L3 per CCX. The L3 cache is 16-way set-associa- die-to-die interconnect. The interconnect was
tive and is populated from L2 cache victims. The required to support various product configura-
L3 is protected by DECTED ECC for reliability. A tions while meeting power, bandwidth and
CPU Core Die (CCD) chiplet is composed of two latency metrics. A new on-package Infinity Fabric
CCXs, for a total of eight cores, 16 threads, and (IFOP) link was designed to meet these require-
32 MB of L3 cache as shown in Figure 2. ments and allow efficient communication bet-
ween the core chiplet and the IO chiplet. The
Chiplet Strategy: Challenges and Solutions IFOP link was optimized for a short channel reach
Delivering Zen 2 to market across multiple plat- and responsible for carrying both data and con-
forms in a short period of time was a key design trol fabric communication. Each IO-die chiplet to
challenge. While the leading edge 7-nm technol- CCD chiplet connection is made using an indepen-
ogy is a key element to “Zen2” performance and dent point-to-point instance of the IFOP link.
IEEE Micro
48
Figure 3. “Matisse” SoC. Figure 4. “Rome” SoC.
“MATISSE” SOC contains up to 1000 mm2 of cumulative silicon

The “Matisse” desktop SoC shown in Figure 3 area. The full assembly results in up to 64 cores
utilizes up to two CCD chiplets with a single IO- with a total of 256 MB of L3 cache. A directory
die chiplet in an AM4 micro-PGA package. Cop- structure within the Server IO-die manages
per pillars were used to attach the die to the probe traffic efficiently.
package instead of traditional solder bumps to All memory and IOs are hosted by a single IO-
accommodate the signal connectivity require- die resulting in significant generational latency
ments between the chiplets. This enabled man- improvement. Decoupled clocking within the IO-
aging mixed processes on the single package. die enables efficient management IO-die thermal
The IO-die features two 64-bit channels of budget. This opens additional power budget for
DDR4-3200 memory for a peak bandwidth of increased CPU frequency. With support for eight
51.2 GB/s. The device includes 24 lanes of PCIe channels of DDR4-3200 memory and 64 lanes of
Gen4 lanes for a peak native IO bandwidth of PCIe Gen4, “Rome” delivers balanced memory
48 GB/s. The noncore clocking within the IO-die and IO to support the high CPU core count.
has more degrees of freedom when compared to “Rome” maintains backward socket-level
its predecessor, allowing decoupling of the Infin- compatibility with the first generation EPYC plat-
ity Fabric (FCLK) clock and DDR Memory Clock form. “Rome” adds advanced platform security
(MEMCLK). This enables more flexibility for features like SEV-IO and SEV-ES. The chiplet
power management. It also increased tuning flex- approach allows “Rome” to deliver up to twice
ibility for over-clocking the device when probing the socket-to-socket bandwidth, up to twice the
the limits of the system. IO-operations (IOPs) per socket, and up to
The “Matisse” SoC maintained backward com- twice the PCIe bandwidth. This results in near
patibility to the AM4 socket/platform infrastruc- twice the application performance of the previ-
ture. Voltage rail compatibility and IO connectivity ous generation.
compatibility was achieved by use of integrated Reuse and modularity were important for
low dropout regulators and firmware control. time-to-market delivery of “Rome” and “Matisse.”
The CCD chiplet is used in both products. The IO-
die chiplets, while unique for each SoC, use
“ROME” SOC highly leveraged IP. The same IO-die used in
The “Rome” server SoC, shown in Figure 4, “Matisse” was also packaged in a standalone BT1
contains up to eight CCD chiplets paired with a BGA package as the X570 chipset. Paired with
single IO-die chiplet. “Rome” is packaged in an “Matisse” the X570 chipset opened first to market
LGA package known as SP3. The SP3 package PCIe Gen4 connectivity in a premium desktop PC.
March/April 2020
49
Hot Chips
Figure 5. “Matisse” application and 1080p gaming performance compared to Ryzen 7 2700X.
PERFORMANCE photorealistic 3-D images via raytracing. With

The resulting generational performance uplift higher scores being better, the third-generation
of the “Matisse” SoC is very compelling, as Ryzen processor scores approximately 17% to
shown in Figure 5. Refer to Table 1 for the config- 24% higher than its predecessor (Ryzen 7 2700X)
urations and system parameters for the perfor- using the same number of CPU cores. Further,
mance measurements. “Matisse” scores approximately 67%–80% higher
Cinebench R20 is a real-world cross-platform using the same thermal budget (see the last row
test suite to evaluate performance scalability of Table 1) as the predecessor, exploiting 50%
reflecting advancements in CPU and rendering more cores.
technologies. The Persistence of Vision Raytracer Adobe Premiere Pro content creation software
(POV-Ray) is a software tool utilized for rendering for video editors uses a highly parallel CPU-
accelerated software encoder. Handbrake is an
open-source, GPL-licensed, multiplatform, multi-
Table 1. Matisse performance evaluation hardware and software
threaded video encoder. With higher scores associ-
configuration.
ated with lesser time to complete the task, the
CPUs “Matisse” processor completes the task approxi-
Ryzen 9 3900X 12 cores 24 threads (Zen2) mately 16%–18% faster when compared to its prede-
cessor at the same core count and approximately
Ryzen 7 3800X 8 cores 16 threads (Zen2)
34%–35% faster using the same thermal budget (see
Ryzen 7 2700X 8 cores 16 threads (Zen1) the last row of Table 1).
Motherboard AMD Reference Board The combination of high responsiveness,
affordable prices and unit volume growth in
Memory 216 GB Dual-Rank DDR4-3200
emerging markets has made 1080p a steadfast
Operating option for gamers around the world. Using aver-
Windows 10 v1903
system
age frames per second (FPS) as a performance
Security metric, the “Matisse” processor improves exam-
Windows 10 v1903 Default
mitigations
ple game titles by approximately 11%–34% over
GPU GeForce RTX 2080 its predecessor, shown in Figure 5. The higher
Platform AM4
clock-speeds, higher IPC, and larger L3 cache,
combined with new synergy with the Windows
Infrastructure TDC ¼ 95A, EDC ¼ 140A, TDP ¼ 105W, PPT ¼ scheduler provide the key ingredients for the
Limits 141.8W, Tjmax ¼ 95C.
performance uplift.
IEEE Micro
50
Table 2. “Rome” application performance
compared to 2P Intel Zeon Platinum 8280 power
server.
Application % Improvement
ESI VPS–NEON4M Upto 58%
Altair Radioss
Upto 72%
13.3.1
LS-DYNA R9.3.0 Upto 79%
STAR-CCMþ
Upto 95%
13.06.012
ANSYS Fluent 19.1 Upto 95%
Figure 6. PCIe Gen4 vs Gen3 performance. High performance computing applications

benefit significantly from the scalability of “Rome”
High-end post-production tools and certain especially when used in a 2P configuration.
categories of game VFX benefit highly from nonlin- “Rome” performed up to 58%–95% higher versus a
ear edition (NLE) performance, which is limited 2P Intel Xeon Platinum 8280 power server,z shown
by the IO bandwidth. The 3DMark PCIE Express in Table 2.
feature test evaluates the performance of vertex
animation across a field of wheat-like objects. CONCLUSION
CrystalDiskMark measures the sequential READ– Zen 2 powers the next generation of AMD
WRITE performance of the storage system. With the desktop and server processor products. It fea-
introduction of native PCIe Gen4, the “Matisse” tures up to 2 instructions per unit energy com-
processor scores up to 69% higher in the feature pared to its predecessor. The products were
test and approximately 35%–51% higher in the designed as chiplet-based solutions enabling effi-
sequential performance tests over the same sys- cient targeting of technology while balancing
tem running PCIe Gen3 (shown in Figure 6). power and cost. Significant leverage of the chiplet
When compared to the first generation EPYC and underlying design facilitated faster deploy-
processors and using maximum core count, ment of a diverse product stack. As a result of the
“Rome” delivers up to twice the socket-to-socket higher IPC, higher area utilization, improved
bandwidth, up to twice the IO-operations (IOPs) security features, and higher power efficiency of
per socket, up to twice the PCIe bandwidth and Zen 2, AMD delivered high performance, efficient,
nearly four times the peak theoretical FLOPs. and secure SoCs to market in the form of 3rd Gen-
ESI Virtual Performance Solution (VPS) is eration Ryzen “Matisse” desktop processor and
used for crash simulations during the design of 2nd Generation EPYC “Rome” processor.
occupant safety systems primarily for the auto-
motive industry. LS-DYNA is a general-purpose
multiphysics, finite-element analysis program
& REFERENCES
capable of simulating complex real-world prob- 1. D. Suggs, D. Bouvier, M. Subramony, and K. Lepak,
lems. It is used by the automotive, aerospace, “Zen 2,” Hot Chips, vol. 31, 2019.
construction, military, manufacturing and bioen- 2. P. Kocher et al., “Spectre attacks: Exploiting
gineering industries. Altair’s PBS Professional is speculative execution,” in Proc. Symp. Secur. Privacy,
a fast, powerful workload manager designed to 2019, pp. 1–19.
improve productivity, optimize utilization & effi-
z
ciency, and simplify administration for HPC clus- Based on AMD internal testing of ANSYS FLUENT 19.1, lm6000_16m bench-
mark; LSTC LS-DYNA R9.3.0, neon benchmark; of Altair RADIOSS 2018, T10M
ters, clouds and supercomputers. ANSYS Fluent benchmark; ESI VPS 2018.0, NEON4m benchmark; and Siemens PLM STAR-
CCMþ 14.02.009, kcs_with_physics benchmark as of July 17, 2019 of a 2P
is a computational fluid dynamics benchmark
EPYC 7742 powered reference server versus a 2P Intel Xeon Platinum 8280
widely used in almost every industry sector. powered server. Results may vary.
March/April 2020
51
Hot Chips
3. AMD, “Indirect branch control extension,” 2019. Mahesh Subramony is a principal member of
Technical Staff with Advanced Micro Devices Inc.
[Online]. Available. https://developer.amd.com/wp-
(AMD), and was the SoC Architect for the third
content/resources/Architecture_Guidelines_Update_
generation Ryzen “Matisse” desktop processors.
Indirect_Branch_Control.pdf
Since 2003, he has been with AMD in architecture
4. AMD, “Speculation behavior in AMD microarchitectures,” and design on SoCs spanning mobile, desktop,
2019. [Online]. Available. https://www.amd.com/system/ and servers. He received the M.S. degree in com-
files/documents/security-whitepaper.pdf puter engineering from the University of Minnesota,
5. A. Seznec and P. Michaud, “A case for (partially)- Twin Cities, and the B.Tech. degree in electrical
tagged geometric history length predictors,” engineering from the College of Engineering,
J. Instruction Level Parallelism, vol. 8, 2006. [Online]. Thiruvananthapuram. Contact him at mahesh.
Available. https://www.jilp.org/howtoref.html subramony@amd.com.
David Suggs is a Fellow with Advanced Micro Devi- Dan Bouvier is the Corporate VP and Client Prod-
ces Inc. (AMD), Santa Clara, CA, USA, where he was ucts Chief Architect for Advanced Micro Devices Inc.
the chief architect for the Zen 2 CPU core. Previously, (AMD), Ryzen products. He has defined the past five
he was the architect of the op cache and the instruc- generations of AMD notebook and desktop process-
tion decoder for the Zen core. He has been working in ors. During his 32 year career, he has focused on
architecture and design since 1993 on projects span- high-performance processors, SoCs, and systems.
ning CPU cores, north bridges, south bridges, DSPs, Prior to joining AMD in 2009, he was a processor CTO
voice telephony, and PC sound cards. He received for AMCC and before that the director of Advanced
the M.S.E.E. degree from the University of Texas at Processor Architecture for PowerPC processors at
Austin and the MBA degree from St. Edward’s Univer- Freescale/Motorola. He received the B.S. degree in
sity. He is the corresponding author of this article. electrical engineering from Arizona State University.
Contact him at david.suggs@amd.com. Contact him at dan.bouvier@amd.com.
IEEE Micro
52
The Arm Neoverse N1

Platform: Building Blocks
for the Next-Gen Cloud-
to-Edge Infrastructure
SoC
Andrea Pellegrini, Nigel Stephens,
Magnus Bruce, Yasuo Ishii,
Joseph Pusdesris, Abhishek Raja,
Chris Abernathy, Jinson Koppanalil,
Tushar Ringe, Ashok Tummala,
Jamshed Jalal, Mark Werkheiser, and
Anitha Kona
Arm
Abstract—Recent years have seen an explosion of demand for high-performance, high-

efficiency compute available at scale. This demand has skyrocketed with the move to the
public cloud and 5G networking, where compute nodes must operate within strict latency
constraints and power budgets. The Neoverse N1 platform is Arm’s latest high end offering
from a scalable portfolio of IP for high performance and energy efficient machines.
& NEOVERSE N1 IS the new platform of Arm On one end of the spectrum, the Neoverse N1 plat-
IPs that enables partners to develop systems with form is well suited for high-performance systems
competitive performance and world-leading with up to 128 cores organized on an 88 mesh.
power efficiency across a wide range of markets. At the same time, customers targeting deploy-
ments with strict power and area constraints can
rely on Neoverse N1 to create high-efficiency and
high-performance systems composed of a dozen
or fewer general-purpose cores.
18 March 2020.
53
Hot Chips
The Neoverse N1 core implements the v8.2-A acquire, store-release) the introduction of limited
A32, T32, and A64 Arm instruction sets and ordering regions, support for atomic instruc-
includes many infrastructure-focused improve- tions, and support for persistent memory.
ments, such as security features, virtualization We also included several features to signifi-
host extensions, large system extensions and cantly harden Neoverse N1 against known secu-
RAS extensions. rity vulnerabilities in the following.
Our projections
NEOVERSE N1 is the Privileged access-never (PAN): protects the
and silicon meas- new platform of Arm OS kernel from being “spoofed” into reading
urements show that IPs that enables
or writing user code or data on behalf of
the Neoverse N1 partners to develop
malicious programs.
core performs at systems with
Unprivileged Access Override (UAO): allows the
least 1.6 better competitive
performance and OS kernel to more efficiently manage user code
for most workloads
world-leading power sections that are marked as execute-only for
than Arm’s previous
efficiency across a protection.
design deployed in
wide range of markets. Stage 2 execute-never: allows the hypervisor to
infrastructure––the
prevent an OS kernel and/or application from
Arm Cortex-A72––
executing pages containing writable data, to pre-
with some cloud-native workloads performing up
vent some exploits.
to 2.5 faster. This important speed up was
Side-channel protection: introduces a range
achieved without compromising our best in class
of new speculation controls, speculation bar-
power-efficiency. Additionally, scalability signifi-
rier, and prediction restriction instructions
cantly improved thanks to a completely rede-
that allow software to mitigate microarchitec-
signed coherent mesh, cache hierarchy, and
tural side-channel attacks on speculative exe-
system IP backplane. As a result, Arm’s silicon
cution across different execution contexts.
partners using Neoverse N1 in their designs have
numerous opportunities to organize and opti- Virtualization is the backbone of much of the
mize these components to satisfy their general- modern IT infrastructure, and Neoverse N1
purpose compute needs and can take advantage implements enhancements to extend virtualiza-
of the many connectivity options for tightly cou- tion support and reduce its overhead:
pling accelerators through technologies such as Hardware update of access/dirty bits: auto-
AMBA, CCIX, and PCIe. matically updates status bits in page table
entries, avoiding a trap to the OS or hypervisor.
The VMID extension to 16 bits: increases the
ARCHITECTURE maximum number of simultaneously active vir-
The Neoverse N1 cores implement many of tual machines supported by the address transla-
the recent extensions to the base Armv8-A archi- tion system to 65536.
tecture that were introduced to improve the per- Virtual Host Extension (VHE): more efficient
formance, scalability, robustness and security of support for Type 2 (hosted) hypervisors, such
highly virtualized server, and network infrastruc- as KVM, building on the Type 1 (native) hypervi-
ture workloads on many-core processors.1 sor support already introduced in base Armv8-A.
These architecture extensions include sup- Finally, we extended our performance moni-
port for dedicated instructions to accelerate toring infrastructure to support enhanced PMU
inference ML workloads through support of IEEE events and statistical profiling.
half-precision floating-point (FP16) and int8 dot
product instructions. Neoverse N1 also includes NEOVERSE N1 CORE
the new CRC32 instructions to accelerate storage The Neoverse N1 core is designed to achieve
applications. Additionally, particular focus has high performance while maintaining the perfor-
been placed on finer handling of Arm’s relaxed mance power area (PPA) advantage point estab-
memory ordering through the LDAPR instruction lished with Cortex-A72. To achieve this goal,
(load with ordering semantics similar to load- the team designed the microarchitecture from
IEEE Micro
54
scratch and focused on features to enhance infra- instruction footprints. The predictor also
structure-focused many-core CPU performance. employs a 64-entry microBTB and a 16-entry
Neoverse N1 supports an aggressive out-of- nano-BTB to minimize bubbles in the front-end.
order superscalar pipeline and implements a 4- Neoverse N1 also significantly improves both
wide front-end with the capability of dispatch- latency and accuracy of the indirect branch pre-
ing/committing up to eight instructions per diction algorithm. The branch direction predic-
cycle. The core deploys three tor is also optimized to target
ALUs, a branch execution unit, behaviors observed on many server
The Neoverse N1 core
two Advanced SIMD units, and workloads: once a prediction is made,
is designed to achieve
two load/store execution high performance while the predicted address is stored into a
units. The minimum mispre- maintaining the 12-entry fetch queue which tracks
diction penalty is 11-cycle, performance power future fetch transactions.
and we introduced many opti- area (PPA) advantage Once the branch predictor creates a
mizations in order to preserve point established with next fetch address, the address is fed
a short pipeline without losing Cortex-A72. To achieve into a fully associative 48-entry instruc-
power efficiency. this goal, the team tion TLB and a 4-way set-associative 64-
The next sections describe designed the kB I-cache to read out the instruction
the detail of Neoverse-N1 core microarchitecture from opcode. The I-cache can deliver up to
scratch and focused on
microarchitecture. The first 16 B of instructions per cycle.
features to enhance
two sections describe the core Since the branch predictor has
infrastructure-focused
front-end and back-end. The fol- higher bandwidth than the I-cache,
many-core CPU
lowing sections detail the inter- performance. unless the pipeline was recently flushed
action of the core with the due to a branch misprediction, the
memory subsystem, security fetch queue typically holds a few pend-
features, and features added to target the infra- ing transactions. To mitigate branch misprediction
structure market. This section will conclude penalty, I-cache reads are overlapped with I-cache
with a few figures of merit about the core tag matching. After the fetch queue reaches a
implementation. threshold in the number of fetch transactions, the
I-cache read operation is serialized to maximize
Core Front-End efficiency. The Neoverse N1 core can support up
The Neoverse N1 core can fetch up to four to eight outstanding I-cache refill requests to the
instructions per cycle to feed its high-perfor- higher cache hierarchy.
mance back-end. One of the biggest improve- The stream of fetched instructions is then
ments from Cortex-A72 is a decoupled branch forwarded to a 4-wide decoder, where an instruc-
prediction, which realizes a branch predictor tion may be cracked into multiple simpler inter-
directed prefetch, where the branch predictor nal macrooperations. Each decode lane can
can run ahead even if the front-end pipeline is decode one Arm instruction per cycle, and the
waiting for instruction cache (I-cache) miss refill most frequently used instructions (e.g., simple
responses. Even if the I-cache pipeline is stalled ALU, branch and load/store) are decoded as a
on an instruction fetch miss, speculative fetch single macrooperation. To simplify and speed-
addresses provided by the branch predictor can up the decode process, the I-cache can store par-
continue to access the I-cache and resolve tially decoded instructions.
misses through early prefetches.
The branch predictor employs a large 6K- Core Back-End
entry main branch target buffer with 3-cycle Decoded instructions are renamed before
access latency to retrieve branches’ target being dispatched to the out-of-order engine. The
addresses without accessing the I-cache. Such a renaming unit can receive up to four macroops
sizeable BTB unit helps maintain target history per cycle. Each macroop can be cracked to up to
for a large number of branches, which benefits two microoperations during the renaming pro-
cloud and server workloads with large cess. Therefore, up to eight microoperations can
March/April 2020
55
Hot Chips
be dispatched into the out-of-order engine multiple cores can be configured in a cluster of
each cycle. Additionally, the rename unit can auto- cores containing a snoop filter and an optional L3
matically eliminate simple register-to-register data cluster cache. The cluster cache can be up to 2
movement instructions through its rename tables. MB, with a load-to-use latency ranging between 28
Once the microops are dispatched, instruc- and 33 cycles, depending on the configuration. A
tion status is tracked in the commit and the Neoverse N1 SoC can support up to 256 MB of
issue queues. The commit queue can track up to shared system-level cache. In the event none of
128 microoperations. The commit unit tracks a these caches are effective at filtering a requested
dispatched instruction until all prior instruc- memory address, the Neoverse N1 core employs a
tions are committed, and up to eight microops “cache-miss” predictor which bypasses the whole
can be committed per cycle. cache hierarchy and all snoop filters to issue a
The issue queue tracks the availability of “Prefetch Target” request to compatible memory
source operands required to execute corre- controllers, reducing the incurred miss latency.
sponding micro operations. When all its source Neoverse N1 employs a next generation data
operands are available, an instruction is picked prefetcher, which is similar to the one deployed
and issued to the correct execution pipeline. Neo- in the Cortex-A76 core, but with key improve-
verse N1 supports a distributed issue queue with ments for large scale systems. This updated pre-
more than 100 microoperations to increase the fetcher achieves high coverage and accuracy on
overall out-of-order window size. When the issue a variety of access patterns ranging from simple
queue is empty, dis- streams and strides to sophisticated spatial pat-
patched instruction terns. Such a prefetcher coordinates requests to
Neoverse N1 employs
can bypass such a multiple levels of cache and across virtual mem-
a next generation data
queue to minimize ory pages, preloading both TLBs and caches.
prefetcher, which is
latency. Finally, multiple cache replacement policies
similar to the one
Neoverse N1 deployed in the were designed and tuned to work in coordination
employs multiple Cortex-A76 core, but with these prefetchers, resulting in our first pre-
pipelines for each with key improvements fetch-aware replacement policy.
type of instruction: for large scale
four integer execu- systems. Security Features
tion pipelines, two During the development of the Neoverse
load/store pipelines, and two advanced SIMD N1 core, side channel attacks exploiting specu-
pipelines. As needed, each pipeline can forward lative execution2,3 were reported and several
its results to the others. architectural and microarchitectural mitigati-
ons were introduced to address these security
Memory Subsystem vulnerabilities.
The memory architecture for Neoverse N1 Neoverse N1 implements some of the Arm v8.5
is designed to enable larger, faster, and more scal- architecture features, such as the speculative
able caches than its predecessors. The 64-kB 4- store bypass safe (SSBS) control bit, and the SSBB
way set associative L1 data cache (D-cache) has a and PSSBB (Speculative Store Bypass Barrier)
4-cycle load to use latency and a bandwidth of 32 instructions. These newly introduced barriers
B/cycle. The core-private 8-way set associative L2 allow software to actively protect against Spectre
cache is up to 1 MB in size and has a load-to-use Variant 4 exploits by preventing load instructions
latency of 11 cycles. The Neoverse N1 core can from returning data written to a matching virtual
also be configured with smaller L2 cache sizes of or physical memory location by speculatively exe-
256 and 512 kB with a load-to-use latency of nine cuted store instructions prior to the barrier.4
cycles. The L2 cache connects to the system via Spectre Variant 2 attempts to exploit the
an AMBA 5 CHI interface with 16 B data channels. branch predictor by injecting branch targets
The Neoverse N1 core can directly interface to the that cause the victim process to speculate
mesh interconnect enabling minimum latency to through a specific code path. To address this
the system-level-cache and DRAM. Alternatively, threat, we designed a hardware mechanism to
IEEE Micro
56
prevent consumption of malicious target injec-
tions.4 Malicious software cannot inject branch
target information to control speculative behav-
ior of the victim process since the branch pre-
dictor in the Neoverse N1 core prevents a
process from using the predicted branch trained
by a different process.
Infrastructure Focused Core Features

Server systems targeting modern cloud
deployments typically require multi-socket plat-
forms with high-core counts and large memory
capacity: Neoverse N1 implements a number of
features targeted to this class of machines.
One of these features is the support for hard-
ware coherent I-caches. Like other architectures,
the Arm architecture does not require I-caches to
maintain coherent through hardware mecha-
nisms. Hence, on legacy Arm systems, software
must issue the necessary cache maintenance Figure 1. Neoverse N1 core floorplan used for our
operations whenever memory containing instruc- reference design, which includes 64-kB I-cache,
tions is modified. Typically, these invalidations 64-kB D-cache, and a 1-MB core-private L2.
are broadcast to all cores within the same coher-
ency domain to realize transparent instruction data, it generates an exception. If the poisoned
memory access. Unfortunately, such broadcasts data are never consumed by the core, the data
can limit scalability on high core count systems. will eventually be evicted from the core caches
Neoverse N1 eliminates this bottleneck by imple- but will retain the poison information in the sys-
menting a fully hardware coherent I-cache that tem-level-cache or DRAM.
requires no software maintenance and leverages Finally, Neoverse N1 increases the ASID and
the same hardware coherency mechanisms uti- VMID widths to 16 bits, allowing for more guest
lized by the data cache. To ensure software com- operating systems and applications within each
patibility, unnecessary I-cache maintenance guest. Neoverse N1 also extends the physical
operations still issued by legacy software are address width to 48 bits, allowing systems to
treated as “no operation” in the core. For more support up to 256 TB of physical memory.
recent software, Neoverse N1 includes a status
bit that allows users to discover that the I-cache Implementation
is hardware coherent, hence cache maintenance The Neoverse N1 core design was fully evalu-
instructions are not necessary. ated within our internal development environ-
Another infrastructure feature added by Neo- ment. Figure 1 shows the core floorplan of our
verse N1 is the support for the Arm v8.2 RAS reference implementation, which employs a 64-kB
architecture. The RAS architecture provides a I-cache, a 64-kB D-cache, and 1-MB L2 cache.
framework for detecting, classifying, and report- Our optimized physical implementations indi-
ing errors that is consistent across all compo- cate that the core and L2 cache can reach an
nents of the SoC. Neoverse N1 also adds the operating frequency of 3.1 GHz. When executing
ability to defer errors from one component to an intense integer workload, the estimated power
another. For example, an uncorrectable data consumption for a 7-nm implementation of a core
ECC error in DRAM can be propagated to a core is 1.0 and 1.8 W when clocked at 2.6 and 3.1 GHz,
which can cache the corrupted data but flag the respectively. The core area is estimated to be
erroneous word as “poisoned.” When an instruc- 1.15 mm2 for a 512-kB L2 configuration and 1.40
tion, such as a load consumes the poisoned mm2 for a 1-MB configuration. Our models project
March/April 2020
57
Hot Chips
that a 64-core reference system can achieve 190 Some key performance enhancement features
SPECint2017 rate (estimated). For such a system, of CMN-600 include the following.
the total SOC power is projected to be 105 W.
Support for Arm and PCIe architecture
atomic transactions at the home nodes. This
allows atomic transactions to be issued by
NEOVERSE N1 COHERENT MESH the cores to the home nodes, where they are
INTERCONNECT (CMN-600)
resolved. The capability to execute far
The CMN-600 product family is Arm’s second-
atomics operations improves performance
generation, highly configurable, mesh-based
for contended variable updates.
coherent interconnect based on CHI cache coher-
Prefetch hint can be issues from a Neoverse
ent protocol specification. CHI is a packet-based,
N1 core directly to the memory controllers in
point-to-point, topology agnostic, layered archi-
order to minimize DRAM latency on cache
tecture protocol. The coherent interconnect is a
misses. Other features to reduce data latency
vital component to enable many-core systems to
include: direct memory transfer from mem-
scale without compromising latency and available
ory controllers to requesting cores and
memory bandwidth. A mesh topology was chosen
direct cache transfer from peer cores to the
to address those challenges and is designed to
requesting core. In aggregate, we estimate
support one clock cycle delay per hop. CMN-600
these features to reduce data latency on the
can scale from a 12 mesh to a 88 mesh and is
interconnect by up to 37%.
designed to operate at up to 2-GHz clock fre-
Cache stashing, which allows an IO peripheral
quency. Customers can configure mesh size,
such as a PCIe endpoint to place incoming
topology, and bisection BW to match the architec-
data on various levels of the cache hierarchy
ture that best fits their PPA targets.
(SLC, L3, and L2) to enable quicker access to
CMN-600 includes a distributed set of fully
this data. When SLC stashing is enabled, sili-
coherent home nodes which are software-
con measurements show improvements up to
configurable hash address-interleaved. Software-
33% packet/second on a single core and up to
configurability allows customers to support
60% on multicore tests for DPDK L3 forward-
different hash interleaving granularity, with the
ing tests. Further improvements are expected
minimum interleave being 64 B. Such configurabil-
for applications that can stash IO data in the
ity enables traffic distribution and traffic isolation
cores’ private L2 or in the core cluster L3.
based on the characteristics of the targeted appli-
cations and allows affinity-based system cache CMN-600 is designed to support high
groups (SCG) allocation, which helps with traffic throughput IO traffic from various requesters
localization in bigger systems. such as DMA, PCIe, etc., and can achieve full
Each HN-F slice includes a snoop filter and a PCIe Gen4 upstream and downstream band-
system level cache (SLC) with enhanced replace- width. PCIe or DMA writes can be stashed in the
ment policies. System architects that adopt SLC or directly into the core caches. Direct
the Neoverse N1 platform can choose the num- stashing to CPU caches allows improved perfor-
ber of HN-F slices to deploy based on system mance and avoids SLC pollution.
cache capacity needed and system bandwidth CMN-600 provides at-speed self-hosted debug
requirement, and total SLC capacity can range and trace capabilities with distributed debug
from 0 to 256 MB. monitors within the interconnect. Our intercon-
The SLC is a victim cache for core clusters with nect supports programmable transaction tagging
adaptive cache allocation based on data-sharing and tracing, which can be used for statistical pro-
detection. An SLC also acts as a DRAM cache for IO filing and end to end latency breakdown analysis.
Requestors (PCIe, DMA, etc.) with a smart alloca-
tion policy. In addition, the SLC supports software Multichip Support Using CCIX
programmable source-based cache capacity con- CMN-600 supports CCIX protocol (Cache
trol that mitigates the “noisy-neighbor” shared Coherent Interconnect for Accelerators) to
system cache thrashing problem. coherently connect hardware accelerators such
IEEE Micro
58
Figure 2. CMN-600 based scale-up server node with two compute dies and acceleration.
as GPUs, smart NICs, smart storage, FPGAs, IO Memory Management and Interrupt
DSPs, etc. to CMN-600 based host node. By Handling
extending the benefits of full cache coherency to Arm’s latest System MMU, MMU-600, sup-
these hardware accelerators, Neoverse N1 ena- ports stage-1, stage-2, and nested address trans-
bles true peer processing with shared memory, lations with address space mapping and security
which also eliminates the need for software to mechanisms to prevent un-authorized accesses.
initialize transfers of data between devices. In a typical system, these two translation stages
CMN-600 also supports CCIX independent mem- are managed by the guest operating system and
ory expansion where the CCIX link is used to the hypervisor, respectively.
communicate with memory on a remote chip. MMU-600 support’s PCIe Address Translation
CMN-600 leverages the same CCIX connection Service to allow PCIe-based IO devices or accelera-
to enable symmetrical multiprocessing (SMP) tors (masters) to pre-fetch translations well in
across multiple chips to enable homogenous advance and place them in device-managed
computing. In order to enable this link for SMP address translation caches, hence avoiding the
use cases, the CCIX link can be augmented with translation overhead in the MMU. Support for the
special features to communicate Arm ISA spe- PCIe page request interface further enhances sys-
cific information that is not required for the tem performance by enabling devices to use un-
host-accelerator use case. pinned pages and virtual memory.
Figure 2 shows a scalable system where The Neoverse N1 platform supports PCIe root
multiple CMN-600’s form a host node for homoge- complexes with single root IO virtualization func-
nous computing while connecting to hardware tion, which allows virtualized PCIe functions to
accelerators for heterogenous compute use cases. be integrated into a system to provide IO virtuali-
Figure 3 graphically shows the configuration zation. In a PCIe root complex, each virtual func-
ranges available to Arm partners thanks to the tion (VF) and physical function (PF) pair mapping
CMN-600. This architecture is designed to scale is assigned a unique PCI express requester ID
from small and efficient edge systems all the way that is mapped to a unique StreamID in the sys-
to high performance cloud deployments. tem to match the Arm architecture requirements.
Figure 3. System Scalability of CMN-600.
March/April 2020
59
Hot Chips
MMU-600 maps virtual addresses to physical coherent mesh network connects all the high per-
addresses using the StreamID pairs. With support formance on-chip components.
for up to 224 StreamID’s, MMU-600 allows simulta- N1 SDP provides early N1 silicon samples and
neous mapping of millions of PCIe VFs. In a vir- is a vehicle for software development and evalu-
tualized environment, the VF is assigned to a ation environment to customers and partners.
virtual processing element (VPE) and the system
traffic flows directly between VF and VPE. As a
result, the IO overhead in the software emulation REAL-WORLD PERFORMANCE
We evaluated the performance of N1 systems
layer is diminished, significantly reducing the
extensively both pre-silicon as well as in silicon
overhead of a virtualized environment compared
implementations such as N1 SDP. Our projec-
to a nonvirtualized one.
tions and silicon measurements show that
With total bandwidth support of up to 64 GB/
N1 systems match or exceed performance on
s per IO interface, MMU-600 is architected to
currently available cloud instances in many rele-
support throughput requirements for next gen-
vant workloads. Single core performance was
eration PCIe Gen5.
improved from Cortex-A72 by 65% and 100% on
GIC-600 is a GICv3 architectural specification
average for integer and floating-point workloads,
compliant interrupt controller with enhanced
respectively. System-level performance uplifts
support for large number of cores and multiple
are much higher thanks to the multipliers
chip configurations. GIC-600 structurally con-
offered by the unprecedented scalability of our
sists of interrupt translation service (ITS) blocks,
CMN-600 mesh interconnect.
a distributor, and redistributors. The ITS block
Beyond targeting general performance
translates PCIe message signaled interrupts
improvements, we spent significant effort opti-
(MSI/MSI-X) to Arm locality-specific peripheral
mizing the system for common behaviors
interrupts (LPIs). The distributor manages inter-
observed in server and networking workloads.
rupt routing and directs interrupts to the appro-
For example, a class of workloads we focused
priate core cluster that services the interrupts.
on is high-throughput HTTP server, such as
In a virtualized environment core interrupts are
NGINX. NGINX is a highly concurrent, high-per-
virtualized, and the incoming physical interrupts
formance application that can be used as web
are mapped by a hypervisor to a VM. The inter-
server, reverse proxy, and API gateway. Neo-
socket or inter-chiplet messages are ported to
verse N1 performance uplifts for this class of
the native communication transport protocol
workloads is directly related to the following.
supported by the system.
1. Memory latency and bandwidth: up to 2
increase in memcpy bandwidth vs Cortex-A72.
N1 SOFTWARE DEVELOPMENT 2. Context switch: up to 2.5 faster than Cor-
PLATFORM tex-A72.
In order to create a proof point for our technol- 3. Core front-end: significant reduction in
ogy, we taped-out a test chip based on Neoverse branch mispredicts (7) and cache misses
N1 IPs called the N1 software development plat- (2) vs Cortex-A72.
form (N1 SDP). This system consists of four Neo-
verse N1 cores, configured as two pairs of two- These stressors are very common with
core clusters. Each core has 64-kB private L1 I/D throughput applications such as MemcacheD and
cache and 1-MB private L2cache. Each cluster con- HHVM. Overall, Neoverse N1 can reach 2.5 higher
nects its two cores through a DynamIQ Shared throughput on NGINX static web-server versus a
Unit (DSU), which is configured with a 1-MB similarly configured Cortex-A72 based system.
shared L3 cluster cache. The system is configured Another class of applications we focused our
with 2 DDR4 3200 memory controllers and 2 PCIe attention on are runtime frameworks such as
Gen4 root complexes, one of which supports CCIX Java Virtual Machines and .Net Frameworks.
for attachment of cache-coherent IO devices or to These runtime environments are the foundation
support multichip configurations. A 42 CMN-600 of much of the applications running in the cloud
IEEE Micro
60
and are a natural advanced network, storage, and security applian-
target for our We anticipate highcore ces as well as on edge compute installations
count designs based
design. At a high deployed by network operators with design
on Neoverse N1 to be
level, on Neoverse points starting at eight cores.
deployed in public
N1 we focused on a
cloud as an alternative
few relevant stres- architecture for main ACKNOWLEDGMENTS
sors for these compute nodes, We thank Mike Filippo, Ann Chin, and the
workloads. enabling lower total many Arm engineers in Austin, TX and world-
cost of ownership for wide who contributed to Neoverse N1 definition,
1. Object manage-
data center operators architecture, design, physical implementation,
ment: up to and edge installations
2.4 more and software optimizations.
of cloud compute while
memory alloca- delivering greater
tions and 1.6 design diversity. & REFERENCES
faster in copy-
ing characters 1. Arm Architecture Reference Manual Armv8, for Armv8-
on Java microbenchmarks vs Cortex-A72. A architecture profile Documentation. [Online].
2. Managing the instruction footprint: on a Java- Available: https://developer.arm.com/docs/ddi0487/
Based-Benchmark, I-cache miss rate and latest
branch mispredictions was reduced by 1.4 2. M. Lipp et al., “Meltdown: Reading kernel memory
and 2.25 versus Cortex-A72, respectively. from user space,” in Proc. 27th USENIX Conf. Secur.
3. Process synchronization for garbage collec- Symp., 2018, pp. 973–990.
tion: locking throughput and latency 3. P. Kocher et al., “Spectre attacks: Exploiting
improved by 2 thanks to the Large System speculative execution,” IEEE Symp. Secur. Privacy,
Extensions Arm atomic instructions. San Francisco, CA, USA, pp. 1–19, 2019.
4. Arm Vulnerability of Speculative Processors to Cache
We expect to see higher performance gains as Timing Side-Channel Mechanism. [Online]. Available:
Neoverse N1 systems become more broadly avail- https://developer.arm.com/support/arm-security-
able for software optimizations and application updates/speculative-processor-vulnerability
tuning.5 At the time of writing, Arm partners 5. Arm NeoverseTM N1 Software Optimization Guide
report that initial evaluations of real-world work- Documentation. [Online]. Available: https://developer.
loads on systems deploying Neoverse N1 show up arm.com/docs/swog309707/a
to 40% better performance compared to similarly
configured systems currently on the market. Andrea Pellegrini is a senior principal engineer
with Arm, Austin, TX, USA. He leads the performance
and workloads team for servers and networking. He
CONCLUSION received the Ph.D. degree from the University of
The Neoverse N1 platform provides Arm’s Michigan, Ann Arbor, and the B.E. and M.E. degrees
partners with the high-performance IPs neces- in computer engineering from the Universita di
sary to architect a general compute solution for Bologna, Italy. He is the corresponding author of this
addressing the infrastructure market. These article. Contact him at Andrea.Pellegrini@arm.com.
building blocks offer the versatility, performance,
features and power-area efficiency to succeed in Nigel Stephens joined Arm, Austin, TX, USA, in
2008 to contribute to the development of the Armv8-A
the infrastructure market. We anticipate high-
architecture, with a focus on the new AArch64 instruc-
core count designs based on Neoverse N1 to be
tion set architecture and its related software ABIs. He
deployed in public cloud as an alternative archi-
went on to become the lead architect with overall
tecture for main compute nodes, enabling lower responsibility for Arm’s A-profile instruction sets.
total cost of ownership for data center operators Recent projects have included leading the design of
and edge installations of cloud compute while the Scalable Vector Extension (SVE) for HPC, and its
delivering greater design diversity. We fully successor SVE2. He was appointed an Arm Fellow in
expect Neoverse N1 to also find a home in more 2015. Contact him at Nigel.Stephens@arm.com.
March/April 2020
61
Hot Chips
Magnus Bruce is a senior principal engineer with traffic performance passing through Interconnect IP.
Arm, Austin, TX, USA, focused on memory system He received the M. Tech. degree in microelectronics
microarchitecture and coherent interconnects. He from the Indian Institute of Technology Madras in
received the Master of Engineering in electrical engi- India. Contact him at Tushar.Ringe@arm.com.
neering from the University of Florida. Contact him at
Magnus.Bruce@arm.com. Ashok Tummala is a principal engineer with the
Systems IP group, Arm, Austin, TX, USA. He is part of
Yasuo Ishii is a principal engineer with Arm, Architecture, Microarchitecture, and Design team
Austin, TX, USA, where he is responsible for instruc- working on Arm’s Coherent Interconnects. His research
tion fetch and branch predictor microarchitecture. interests include various industry standard I/O bus pro-
He received the Ph.D. degree in computer sci- tocols including PCIe, CCIX, USB, etc., H/W and S/W
ence from the University of Tokyo. Contact him at coherency, system architectures, and heterogenous
Yasuo.Ishii@arm.com. computing using hardware accelerators. He received
the Ph.D. degree in electrical and computer engi-
Joseph Pusdesris is a staff engineer with Arm, neering from the University of Texas at San Antonio.
Austin, TX, USA, focused on bridging the gap Contact him at Ashok.Tummala@arm.com.
between performance exploration, modeling, and
RTL design. He received the MSE degree in com- Jamshed Jalal is a distinguished engineer with
puter science from the University of Michigan. Con- the Systems IP group, Arm, Austin, TX, USA. He is
tact him at Joseph.Pusdesris@arm.com. the lead architect on Arm’s various families of Coher-
ent Interconnects and also a key contributor to Arm’s
Abhishek Raja is a staff engineer with Arm, Austin,
AMBA5 CHI protocol specification. His research
TX, USA, focused on CPU memory system micro-
interests include H/W and S/W coherency, system
architecture. He received the M.S. degree in electri-
architectures ranging from client to enterprise, per-
cal engineering from the University of Washington.
formance analysis, and memory technologies. He
Contact him at Abhishek.Raja@arm.com
received the Bachelor’s degree in electrical and
Chris Abernathy is a CPU lead architect with Arm, computer engineering from Oklahoma State Univer-
Austin, TX, USA. He received the M.S. degree in elec- sity. Contact him at Jamshed.Jalal@arm.com.
trical and computer engineering from the University of
Texas. Contact him at Chris.Abernathy@arm.com. Mark Werkheiser is a distinguished engineer with
the Systems IP group, Arm, Austin, TX, USA. He is
Jinson Koppanalil is a senior principal engineer the technical lead responsible for Arm’s Coherent
with Arm. He is a technical lead responsible for the Interconnect IP. His research interests include devel-
development of Arm’s CPU products. He received oping scalable and configurable coherent intercon-
the M.S. degree in computer engineering from North nect microarchitecture. He received the Master’s
Carolina State University, Raleigh. Contact him at degree from the University of Wisconsin-Madison.
Jinson.Koppanalil@arm.com. Contact him at Mark.Werkheiser@arm.com.
Tushar Ringe is a principal engineer with the sys- Anitha Kona is a senior principal infrastructure
tems IP group, Arm, Austin, TX, USA. He is part of system architect with the Central Technology group,
Microarchitecture team responsible for delivering Arm, Austin, TX, USA. He is the lead system architect
Coherent Interconnect IP. His research interests for N1 SDP and other server and networking class
include various architectures, such as CHI/AXI based systems being developed at Arm. She received the
protocol, microarchitecture exploration, and valida- Master’s degree from Mississippi State University.
tion efficiency improvements with special focus on IO Contact her at Anitha.Kona@arm.com.
IEEE Micro
62
TeraPHY: A Chiplet
Technology for Low-
Power, High-Bandwidth
In-Package Optical I/O
Mark Wade, Erik Anderson, Shahab Ardalan, Sergey Y. Shumarayev, Conor O’Keeffe,
Pavan Bhargava, Sidney Buchbinder, Tim T. Hoang, David Kehlet,
Michael L. Davenport, John Fini, Haiwei Lu, Ravi V. Mahajan, Matthew T. Guzy,
Chen Li, Roy Meade, Chandru Ramamurthy, Allen Chan, and Tina Tran
Michael Rust, Forrest Sedgwick, Intel Corporation
Vladimir Stojanovic, Derek Van Orden,
Chong Zhang, and Chen Sun
Ayar Labs, Inc.
Abstract—In this article, we present TeraPHY, a monolithic electronic–photonic chiplet

technology for low power and low latency, multi-Tb/s chip-to-chip communications.
Integration of the TeraPHY optical technology with open source advanced interconnect
bus interface enables communication between chips at board, rack, and row level at the
energy and latency cost of in-package interconnect. This enables the design of logically
connected but physically separated large-scale and high-performance digital systems.
The copackaging integration approach is demonstrated by integrating the TeraPHY die
into the Intel Stratix10 FPGA multichip package.
& EMERGING MACHINE-LEARNING, high-performance require high-performance distributed comput-

computing and digital signal processing applica- ing with access to large pools of storage and
tions (digital beamforming for radar and 5G) memory, Figure 1. The key to scalability of
such systems is the high-bandwidth, low
latency, and energy-efficient interconnect fab-
ric that enables ubiquitous off-package chip-to-
chip communication.
18 March 2020.
63
Hot Chips
hungry SerDes electrical links are needed. Finally,

the optics pluggable solutions can span multiple
kilometers but are the most power hungry and
have the smallest bandwidth density. The perfor-
mance figure of merit (bandwidth density—
energy-efficiency product) gap between these off-
board solutions and in-package interconnects is
approximately four orders of magnitude.
TeraPHY technology aims to bridge this gap
and enable off-package and off-board chip-to-chip
Figure 1. Emerging memory-semantic fabrics connecting CPUs, communication at the energy, latency and band-
GPUs, and various ASIC and FPGA-based accelerators to pools width-density of in-package interconnects, effec-
of shared storage and memory. tively removing the I/O bottleneck and enabling
scalability and design of logically connected,
physically distributed large-scale digital systems.
Electrical I/O is fast approaching performance Integrating optics in the same package as the
limitations due to signal integrity issues as data SoC has emerged as one of the only viable paths
rates increase to 112 Gb/s and beyond per pin in a to scale beyond the chip-to-chip electrical I/O
pin-count constrained package. State-of-the-art bottleneck. However, this path represents a
systems-on-chip (SoCs) are already capable of out- major paradigm shift, and there has yet to be a
putting 10þ Tb/s of data throughput but most to a demonstration of in-package optical I/O with
co-packaged high-bandwidth memory (HBM) due high-performance digital SoCs (FPGAs, CPUs,
to the off-package electrical I/O limitations. Figure 2 GPUs, ASICs, etc.) that solves latency, bandwidth
illustrates these tradeoffs between distance and density, power, and cost constraints. While the
electrical I/O efficiency metrics (bandwidth-den- multichip packaging ecosystem has been already
sity and energy-efficiency). The in-package electri- established through the integration of HBM and
cal I/O technologies like wide-parallel interfaces electrical I/O chiplets, the challenges for ubiqui-
for HBM and logical chip-to-chip communication tous adoption of in-package optics include: a
(e.g., AIB—CHIPS LR/SR) have the highest band- suitable low-power electrical interface between
width density and energy-efficiency but only span a digital core and electro-optical chiplet, a scal-
the distances of a few mm, which is enough to able electro-optic packaging solution and eco-
communicate between two chips in the same pack- system that is compatible with attaching optical
age. The energy-efficiency and bandwidth density fiber to a package, and an optical technology
decrease rapidly with distance as more power- that provides energy efficiency and bandwidth
density that meets the performance required for
10þTb/s of chip-to-chip I/O.
In this article, we present TeraPHY, a Terabit-
speed optical PHY chiplet with an advanced
interconnect bus (AIB)1 electrical interface and
O-band dense-wavelength division multiplexing
(O-DWDM) optical interface, integrated in-pack-
age with an Intel Stratix10 (S10) FPGA using
Intel’s Embedded Multi-die Interconnect Bridge
(EMIB)2 packaging technology.
CHIPLET-BASED IN-PACKAGE
OPTICAL I/O
Chiplet-based technologies have emerged as
Figure 2. Interconnect metrics versus reach tradeoffs8 and a scalable approach to integrating heteroge-
TeraPHY technology capabilities. neous electrical functionality (high-speed
IEEE Micro
64
To achieve the density of pins required to
escape high bandwidths from the SoC, Intel’s
EMIB technology is used. EMIB allows a high
density of electrical connections in the AIB
regions of the S10 and TeraPHY electrical bump
maps while supporting relaxed pitch bumps
elsewhere, overcoming the substrate size limita-
tion and connectivity yield risks of traditional Sil-
icon interposer technology.
TERAPHY TECHNOLOGY
The TeraPHY chiplets employ dense wave-
length division multiplexing, Figure 4. In each
Figure 3. Ayar Labs’ optical I/O architecture based
TeraPHY optical macro, multiple ring-resonator
on TeraPHY technology.
based modulators, with resonant wavelengths
spaced nominally to match the multiple laser
analog, memories, custom accelerators, etc.) wavelengths, are coupled to the waveguide on
inside multichip packages (MCP). To support the transmit side. Multiple resonant photodetec-
this approach, an ecosystem has been created tors are coupled to the same waveguide on the
to define scalable electrical interfaces and pack- receive side to receive the corresponding wave-
aging technologies. Until now, the focus has length channel. Ring wavelength tuning and con-
been on integrating heterogeneous electrical trol circuits are used to lock the ring resonators
chiplets, but the same ecosystem can be lever- to the corresponding laser wavelengths and
aged to support next-generation systems based compensate for laser wavelength drift, process,
on in-package optical I/O. and environmental temperature variations.
Figure 3 illustrates the Ayar Labs’ optical I/ The TeraPHY chiplet is based on a 45 nm Sili-
O architecture based on the TeraPHY technol- con-on-Insulator CMOS process. The same sili-
ogy. The TeraPHY chiplet is co-packaged con layers used to form the transistor active
with the SoC and is powered by the off-pack-
age multiport, multiwavelength optical supply
(SuperNova). Since TeraPHY chiplets are
implemented via monolithic integration of
photonic devices with transistors, they can
host a variety of electrical interfaces
(from wide-parallel to high-speed serial) to
the SoC that adapt to the chosen packaging
technology (organic substrates or 2.5D
integration).
In this article, we illustrate the MCP integra-
tion with the S10 FPGA,3 which communicates
with the two TeraPHY chiplets through an AIB
interface. The AIB interface is a wide parallel dig-
ital interface. The TeraPHY chiplet contains an
array of SerDes which transfer data between the
AIB interface and the higher data rate optical
channels. This integration scheme avoids the
need for repeated SerDes interfaces between Figure 4. Dense wavelength division multiplexed photonic link
chips that waste significant power and area architecture of the TeraPHY chiplet photonic macros based on
resources. ring resonators.
March/April 2020
65
Hot Chips
Figure 6. TeraPHY test results. (a) 3 mm 3 mm

TeraPHY test chip die photo. (b) Zoom in on Tx macro.
(c) 1625-Gb/s WDM photonic transmitter. (d) Rx
Figure 5. 300-mm TeraPHY wafer, TeraPHY chiplet cross-
bathtub curve. (e) Zoom in on Rx macro. (f) Eye
section indicating the close proximity of the transistors and
diagrams from dynamically tunable transmitter.
photonic devices; SEMs of transistors and representative
photonic devices.
EMIB AND AIB
region and gates are used to form the photonic EMIB2 is a cost-effective approach to in-
components, Figure 5. package high density interconnect of heteroge-
This technology has previously demon- neous chiplets. This MCP approach is often
strated the first-ever microprocessor chip to referred to as 2.5D package integration. Figure 7
directly communicate using light,4 a full shows more detail on this physical technology
multiwavelength microring-based optical link approach.
transmitting real network packets,5 and a
1625-Gb/s 400G microring-based WDM trans-
mitter macro with 0.8-pJ/bit energy effi-
ciency.6 Figure 6 shows several select results
from optical TeraPHY macros, including a
dynamically configurable 20–100-Gb/s/wave-
length NRZ/PAM4 microring transmitter that
achieves 0.7 pJ/bit energy efficiency at 100-Gb/
s PAM4.
Monolithic integration of high-performance
transistors and photonic devices enables highly
efficient photonic links where the low-parasitics
connections between transistors and photonic
devices lead to record energy-efficiencies, and Figure 7. Intel’s EMIB packaging technology
dense connectivity leads to the use of unique illustrating embedded silicon bridges enabling
tuning and control schemes that enable the utili- dense connectivity between chiplets with 55-mm
zation of energy and area efficient ring-resonator bumps in the interface regions and increased yield
based photonic devices. with larger bumps in the non-AIB interface regions.
IEEE Micro
66
Figure 9. Floorplan of the TeraPHY chiplet showing AIB
interface, ten optical Tx/Rx macros, and fiber array, with
TeraPHY macro and Tx/Rx slice insets.
parallel bus using single-ended, full-swing signal-

ing. AIB uses a clock-forwarded architecture
where the clock is transmitted along with and in
the same direction as the data. This scheme is
suitable for USR (ultrashort reach) and XSR
(extrashort reach) channels. The simplicity of
AIB was chosen to maximize bandwidth density
(Gb/s/mm2 and Gb/s/mm) and minimize energy
Figure 8. Top: EMIB MCP with two TeraPHY cost (pJ/bit), the two most important metrics for
interfaces; Middle: zoom-in of the mixed-bump pitch chiplet electrical interfaces.
and EMIB AIB interface region. Bottom: TeraPHY
die with the SEM zoom-in into the mixed bump INTEGRATION WITH INTEL STRATIX10
pitch region (denser 55-mm bumps in the AIB FPGA
interface region). Figure 8 illustrates the EMIB MCP designed
to integrate two TeraPHY chiplets. The SEM of
the TeraPHY chiplet indicates mixed bump
EMIB allows for mixed bump pitches and pitch/size between the denser AIB interface
sizes between those used in conventional flip- region and the larger bumps covering the rest
chip packaging and tighter microbump pitch of the chip.
needed to support high density routing. In initial Figure 9 shows the layout and organization of
EMIB deployments, two routing layers for signals the 8.9 mm 5.5 mm TeraPHY chiplet. The AIB
at 2-mm signal width and spacing were utilized. interface is on the left, digital glue logic, and
At 2 Gb/s per signal, that gave a maximum band- TeraPHY optical macros in the middle and opti-
width density across two layers of 1 Tb/s/mm. cal connector area is to the right of the chiplet.
The remaining two layers are used for power Each TeraPHY chiplet supports a maximum of
and ground. 960-Gb/s Tx/Rx across the AIB interface, which is
While this MCP platform leverages Intel’s comprised of 24 channels running at 40 Gb/s
EMIB technology, AIB interfaces are defined to each, at 2 Gb/s per pin. There are ten optical Tx/
work equally well with alternative technologies Rx macros, each with eight wavelengths, sup-
(e.g., silicon interposers, organic substrates, porting a maximum of 256 Gb/s per macro. The
etc.). In AIB, data bits are transmitted over a “digital glue” is a reconfigurable crossbar that
March/April 2020
67
Hot Chips
inset illustrates the degree of density achieved

with monolithic integration where photonic
devices are in close proximity with the Tx/Rx
slice analog front-end circuitry.
Since the TeraPHY chiplet is a silicon die with
a standard CMOS back-end-of-line, it is compliant
with standard flip-chip assembly techniques.
Figure 10 illustrates the EMIB MCPs with Tera-
PHY and S10 chips after the flip-chip assembly
step utilizing standard flip-chip reflow equip-
ment for EMIB MCP packages. The optical signals
escape the TeraPHY chiplet through the sub-
strate side of the die, preventing complex inter-
actions between the optical signal escape and
the electrical signal escape. After the flip-chip
assembly, commercial off-the-shelf optical fiber
arrays are used to optically connect to the Tera-
PHY chiplets.
CONCLUSIONS
In-package optics represent a major mile-
stone for the next generation of high-perfor-
mance SoCs and has the potential to remove
the bandwidth-distance tradeoff created by
electrical I/O. We presented the Ayar Labs
TeraPHY chiplet integrated into an AIB-con-
nected, EMIB enabled MCP package with
an Intel Stratix10 FPGA, paving the way for
fully TeraPHY populated optical packaging
enabling <5 pJ/bit multi-Tb/s die-to-die IO
connectivity at <10 ns SoC-to-SoC latency
(excluding fiber) and with up to 2 km in reach.
This technology is revolutionary to high-per-
Figure 10. Top: Assembled EMIB MCPs with TeraPHY and
formance compute and emerging machine-
S10 chips. Bottom: Fully assembled S10 MCP with TeraPHY
learning applications.
chiplets post fiber attach.
ACKNOWLEDGMENTS
can map groups of four AIB channels to groups The authors would like to thank Dr. W. Chappell,
of four optical wavelengths within each of Dr. G. Keeler, Dr. D. Green, and Mr. A. Olofsson for
macros 1–8 and support multicast and broad- their support. This work was supported by the
cast. The digital reconfiguration enables work- DARPA MTO office7 Contract HR00111790020 and
load timescale path reconfiguration without Contract N66001-19-9-4010.
physical connection rewiring.
Each TeraPHY optical macro consists of
transmit/receive slices for each wavelength
& REFERENCES
(including ring tuning control, adaptation, 1. Intel, “AIB specification,” 2019. [Online]. Available:
clock and data recovery logic and link monitor- https://www.intel.com/content/www/us/en/
ing circuits), and a shared clocking subsystem architecture-and-technology/programmable/
and digital control logic layer. The Tx/Rx slice heterogeneous-integration/overview.html
IEEE Micro
68
2. R. Mahajan et al., “Embedded multi-die interconnect was with AMS R&D group in Gennum Corporation
from 2007 to 2010. In 2010, he joined San Jose
bridge (EMIB)—A high density, high band-width
State University as an Assistant Professor and Direc-
packaging interconnect,” in Proc. 66th Electron.
tor of center for analog and mixed signal where he
Compon. Technol. Conf., Jun. 2016,
was teaching and conducting research on topics of
pp. 557–565. analog and mixed signal integrated circuits. He
3. S. Shumarayev, “Stratix 10: Intel’s 14nm received the Ph.D. degree from the University of
heterogeneous FPGA system-in-package (SiP) Waterloo, Waterloo, ON, Canada, in 2007. He is the
platform,” in Proc. Hot Chips 29 Symp., 2017. [Online]. Senior Member of the IEEE. Contact him at
Available: https://www.hotchips.org/wp-content/ shahab@ayarlabs.com.
uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/
HC29.22.50-FPGA-Pub/HC29.22.523-Hetro-Mod- Pavan Bhargava is currently a Senior IC Design
Engineer with Ayar Labs, Emeryville, CA, USA. He is
Platform-Shumanrayev-Intel-Final.pdf
currently working toward the Ph.D. degree in electri-
4. C. Sun et al., “Single-chip microprocessor that
cal engineering with UC Berkeley, Berkeley, CA,
communicates directly using light,” Nature, vol. 7583,
USA. He received the B.S. degree in electrical engi-
no. 528, pp. 534–538, Dec. 24, 2015.
neering from the University of Maryland-College
5. M. S. Akhter et al., “WaveLight: A monolithic low Park, College Park, MD, USA, in 2014. His research
latency silicon-photonics communication platform for focuses on developing optical receivers for high
the next-generation disaggregated cloud data speed I/O, and on designing fully integrated optical
centers,” in Proc. IEEE 25th Annu. Symp. High- beam steering systems for solid-state LIDAR. Con-
Perform. Interconnects, 2017, pp. 25–28. tact him at pavan@ayarlabs.com.
6. M. Wade et al., “A bandwidth-dense, low power
Sidney Buchbinder is currently a Hardware Engi-
electronic-photonic platform and architecture for multi-
neer with Ayar Labs, Emeryville, CA, USA. His pri-
Tbps optical I/O,” in Proc. Eur. Conf. Opt. Commun.,
mary research interests are photonic design
2018, pp. 1–3.
automation and verification. He received the B.S.
7. Common Heterogeneous Integration and IP Reuse degree in electrical engineering from the California
Strategies (CHIPS). 2019. [Online]. Available: https:// Institute of Technology, Pasadena, CA, USA, in
www.darpa.mil/program/common-heterogeneous- 2015, and is currently working toward the Ph.D.
integration-and-ip-reuse-strategies degree with UC Berkeley, Berkeley, CA, USA. Con-
8. G. Keeler, DARPA ERI Summit 2019. tact him at sidney@ayarlabs.com.
Michael L. Davenport is a co-founder of Quintes-

Mark Wade is currently the CTO of Ayar Labs, sent, Inc. Previously, he worked at Ayar Labs and
Emeryville, CA, USA, and is interested in high-speed Intel on laser devices for use in data networking. His
electronic–photonic systems. He received the Ph.D. expertise is centered around the design and fabrica-
degree in electrical engineering from the University tion of integrated circuit lasers, particularly mode-
of Colorado Boulder, Boulder, CO, USA, in 2015. locked lasers, widely tunable lasers, and lasers
Contact him at mark@ayarlabs.com. integrated with silicon photonics. He received the
Ph.D. degree from the University of California,
Erik Anderson is currently a Hardware Engineer Santa Barbara, CA, USA, in 2017. Contact him at
with Ayar Labs, Emeryville, CA, USA, and is currently mike@ayarlabs.com.
working toward the Ph.D. degree with the University
of California, Berkeley, Berkeley, CA, USA. He John Fini is currently the Director of Optical
received the B.S. degree in engineering from North- Design with Ayar Labs, Emeryville, CA, USA. He is
west Nazarene University, Nampa, ID, USA, and the interested in silicon photonics design and the
M.S. degree in electrical engineering from Columbia automation of layout and design. He received the
University, New York, NY, USA. His research inter- Ph.D. degree from Massachusetts Institute of
ests include integrated photonic devices and VLSI Technology, Cambridge, MA, USA, in 2001.
design. Contact him at erik@ayarlabs.com. Contact him at john@ayarlabs.com.
Shahab Ardalan is with Ayar Labs, Emeryville, CA, Haiwei Lui is currently a Principal Engineer with
USA, since 2017, where he is working on high-speed Ayar Labs, Emeryville, CA, USA. He is working on
SerDes link for monolithic optical communication. He photonics assembly and packaging. He received the
March/April 2020
69
Hot Chips
Ph.D. degree in chemistry from the University of Associate Professor from 2005–2013. He received
California, Riverside, CA, USA. Contact him at the Ph.D. degree in electrical engineering from Stan-
haiwei@ayarlabs.com. ford University, Stanford, CA, USA, in 2005, and the
Dipl. Ing. degree from the University of Belgrade,
Chen Li is currently a Senior Mechanical Engineer Serbia, in 1998. He is a Senior Member of IEEE.
with Ayar Labs, Emeryville, CA, USA. He is interested Contact him at vladimir@ayarlabs.com.
in thermal characterization and management of elec-
tronic and silicon photonic devices, as well as novel Derek Van Orden is currently a Photonics Design
packaging of silicon photonics. He received the Engineer with Ayar Labs, Emeryville, CA, USA. His
Ph.D. degree in mechanical engineering from the primary interest is the design of high-speed electro-
University of Michigan, Ann Arbor, MI, USA, in 2019. optic modulators and photodetectors leveraging
Contact him at chen.li@ayarlabs.com. microring resonators on CMOS platforms. He
received the Ph.D. degree in physics from the Uni-
Roy Meade is currently the VP of Manufacturing versity of California San Diego, La Jolla, CA, USA, in
with Ayar Labs, Emeryville, CA, USA. He has more 2011. Contact him at derek@ayarlabs.com.
than 20 years of experience in CMOS, is an inventor
on more than 70 patents. He received the MBA Chong Zhang is currently a Principal Engineer with
degree from Duke University, Durham, NC, USA, and Ayar Labs, Emeryville, CA, USA. He is working on
the M.S. and B.S. degrees in mechanical engineer- photonics and electrical packaging. He received the
ing from Georgia Tech, Atlanta, GA, USA. He is a Ph.D. degree in mechanical engineering and the
Senior Member of the IEEE. Contact him at M.S. degree in optics from the University of Central
roy@ayarlabs.com. Florida, Orlando, FL, USA, in 2008. Contact him at
chong@ayarlabs.com.
Chandru Ramamurthy is currently a Staff Engi-
neer with Ayar Labs, Emeryville, CA, USA, and is Chen Sun is currently the Chief Scientist and VP of
interested in high-speed design, radiation hardened Silicon Engineering with Ayar Labs, Emeryville, CA,
design, and design technology co-optimization. He USA. He is interested in VLSI design and photonics
received the Ph.D. degree in electrical engineering I/O. He received the B.S. degree from the University
from Arizona State University, Tempe, AZ, USA, in of California Berkeley, Berkeley, CA, USA, in 2009,
2017. Contact him at chandru@ayarlabs.com. and the S.M. and Ph.D. degrees in electrical engi-
neering from the Massachusetts Institute of Technol-
Michael Rust is currently a Photonic Test Engineer ogy, Cambridge, MA, USA, in 2011 and 2015,
with Ayar Labs, Emeryville, CA, USA. He is interested respectively. Contact him at chen@ayarlabs.com.
in integrated photonics based I/O and topological
quantum computation. He received the B.S. degree Sergey Shumarayev is currently a Senior Princi-
in physics and the B.S.E.E. degree in computer pal Engineer with Intel Programmable Solution Group
architecture from the University of Texas at Austin, CTO office. He is responsible for Interconnect Strat-
Austin, TX, USA and the M.S. degree in computer egy and is interested in heterogeneous multichip
science from Georgia Tech, Atlanta, GA, USA. Con- package integration. He received the B.S. degree
tact him at michael@ayarlabs.com. in electrical engineering from the University of
California Berkeley, Berkeley, CA, USA, and an
Forrest Sedgwick is currently the Director of Test M.S. degree in electrical engineering from Cornell
Engineering with Ayar Labs, Emeryville, CA, USA. University, Ithaca, NY, USA. Contact him at sergey.
His research interest is in photonic systems for com- yuryevich.shumarayev@intel.com.
munications. He received the Ph.D. degree in electri-
cal engineering from the University of California at Conor O’Keeffe is currently a Research Engineer
Berkeley, Berkeley, CA, USA, in 2007. Contact him at with the CTO group of Intel’s Programmable Solu-
forrest@ayarlabs.com. tions Group. He is interested in SoC architecture,
RFIC design, and wireless infrastructure. He received
Vladimir Stojanovic is currently a Professor of the honor’s degree in electronic and communication
electrical engineering and computer sciences with engineering from the University of South Wales, U.K.
the University of California, Berkeley, Berkeley, CA, Contact him at conor.o.keeffe@intel.com.
USA, and Chief Architect with Ayar Labs, Emeryville,
CA, USA. He was also with Rambus, Inc., Los Altos, Tim Hoang is currently a Hardware Architect with
CA, USA, from 2001 through 2004 and with MIT as Intel. He is interested in programmable circuits, I/O,
IEEE Micro
70
analog and mixed-signal IP, and 2.5D die-to-die inter- assembly. He is interested in next-generation chip-
faces and protocols. He received the B.S. degree in on-wafer processes and emerging packaging tech-
electrical engineering and computer science from the nologies. He received the Ph.D. degree in chemical
University of California Berkeley, Berkeley, CA, USA. engineering from Virginia Polytechnic Institute and
Contact him at tim.tri.hoang@intel.com. State University, Blacksburg, VA, USA, in 2003. Con-
tact him at matthew.t.guzy@intel.com.
David Kehlet is currently a Researcher with Intel
working on pathfinding for FPGA technology includ- Allen Chan is currently a Principal Engineer at
ing high speed interfaces. He received the B.S. and Intel with the Programmable Solutions Group’s
M.S. degrees in electrical engineering from Stanford CTO Office. He received the B.S. degree in
University, Stanford, CA, USA. Contact him at electrical engineering from the University of
david.kehlet@intel.com. California Davis, Davis, CA, USA. Contact him at
allen.chan@intel.com.
Ravi V. Mahajan is currently an Intel Fellow and has
worked on many microelectronics packaging technol- Tina Tran is currently an SOC Hardware Engineer
ogies. He received the Ph.D. degree in mechanical with Intel and is interested in transceiver IP design
engineering from Lehigh University, Bethlehem, PA, and integrating chiplet technologies in FPGA plat-
USA. Contact him at ravi.v.mahajan@intel.com. forms. She received the B.S. degree in electrical
engineering and computer science from the Univer-
Matthew T. Guzy has held several technical posi- sity of California Berkeley, Berkeley, CA, USA.
tions at Intel in areas focusing on packaging and Contact her at tina.c.zhong@intel.com.
March/April 2020
71
Department: Micro Economics
Expertise at Our Fingertips

Shane Greenstein
Harvard Business School
& WHEN MY BROTHER-IN-LAW moved out of state, The price of Wikipedia differs substantially.
he gifted my household his beautiful leather- The ungated web site costs nothing to use,
bound set of Encyclopedia Britannica. In spite of though it is also not entirely free to users. Think
their age, they contain answers to any number of of it this way. Users pay charges for internet
questions. How many species of penguins inhabit access. A portion of that expenditure anticipates
Antarctica? Who was husband to Cleopatra, last using Wikipedia.
queen of Egypt? When was Billie Holiday born? It cannot add up to much. For most house-
Nobody in my household ever touches them. holds Wikipedia constitutes much less than 4%
Everybody uses Wikipedia. of their surfing. The average
This column contrasts the eco- broadband subscription and
This column contrasts
nomics behind yesterday’s compen- smartphone data contract in the
the economics behind
dium of expertise and today’s United States is around $40–$60 a
yesterday’s compen-
crowd-sourced wiki. Any compari- dium of expertise and month and around $60–$80 a
son, even a coarse one, will show today’s crowd-sourced month, respectively. The portion
that prices fell dramatically. No wiki. Any comparison, of expenditure attributable to
other conclusion can emerge. Did even a coarse one, will Wikipedia cannot exceed $3–$5
quantity and quality of answers show that prices fell a month. Anyway you look at it,
improve? How about their accuracy dramatically. expenditure declined more than
and reliability? That is less obvious. 90%, and became a trifling frac-
tion of median household income.
That does not account for the biggest drop,
CHEAPER PRICES which does not involve money. It concerns time.
Start with prices. Britannica reached its peak In 1990, Britannica sold more than 100,000
sales in 1990 when a set of books cost a house- copies in the United States. Over several deca-
hold around $1500, just under $3000 in contem- des several million households bought and
porary 2020 dollars. The leather-bound volumes owned a volume of books. Tens of million house-
cost 30% more. Most households purchased holds had a set from a competitor—e.g., World
these with monthly payments, say, $30–$50 a Book, Colliers, and so on. Everybody else went
month. That was 1%–2% of median U.S. family to their local library. In contrast, today, four out
income. Rich and well-to-do middle class families of five U.S. households access the internet at
bought them. home, or on their phone, which means any-
where. If a price could be put on convenience, it
Digital Object Identifier 10.1109/MM.2020.2971915 would show a massive drop because internet
Date of current version 18 March 2020. access is so widespread.
74
Because Wikipedia uses less time, it is just SUPPLY OF AUTHORITY
less hassle per transaction. That enables a The costs of production declined too.
crazy difference in the scale of use. In one Britannica’s expenses in 1990—i.e., to support
hour Wikipedia receives 4.8 million visitors worldwide distribution—reached $650 million,
for content in English. Not all of those readers around $1.3 billion in today’s dollars. Wikipedia’s
came from the United States, but so what? expenses—i.e., again, to support worldwide dis-
One hour. tribution—reached less than $100 million. Wiki-
Scale changed in other ways. Britannica pedia costs at least 90% less to produce.
contains 120 000 articles at most. To facilitate By design, and out of necessity, Britannica was
sales of books, Britannica’s managers decided selective. It showed only the final draft of an arti-
long ago to cap the total volume of space cle. Also by design and out of necessity, Wikipedia
occupied by the books on shelves. Editors is a work in progress. It shows everything. The
shortened the included articles to make room cost of an extra web page is negligible, and so is
for the new. That has not changed much over the cost of another paragraph, picture, link, and
the decades. an article’s entire editorial evolution. The only
Wikipedia contains a broader scope, and effective constraint comes from the AI bots that
more information. Due to the negligible cost of rid the website of abusive language, and from the
storage Wikipedia faces no constraint on its editors’ collective sense of what belongs.
breadth or length. At last count Wikipedia con- How does Wikipedia save so much produc-
tains just under six million articles in English, tion cost? Those editors are volunteers. The
and it continues to grow. For example, the entry total of 250 000 of them regularly edit the site
for Penguins—i.e., the animal—contains more each month in all its languages. They add con-
than eight thousand words, while tent, and incorporate sugges-
the entry for The Penguin—i.e., the tions from tens of millions who
villain from Batman—gets five thou- Britannica has great add something small. More than
sand. Cleopatra’s entry receives content, and most of it 350 employees support them.
more than 25 000 words. Billie Holi- is inconvenient to Can volunteers write as well
day gets more than 10 000, while Bil- access. Wikipedia as experts? As it turns out, some-
contains plenty of times yes, when the crowd is big
lie Eilish, recent Grammy winner at
everything, both enough.
age 18, has 5000 words on her page.
nutritious news and
Britannica does not lose on Two colleagues and I recently
saccharine sweet
every dimension. Constraints led investigated editors’ behavior in
sophistry, all of it within
Britannica to impose a singular sen- the most uncomfortable setting
reaches with little effort.
sibility on all articles. Everything at Wikipedia, the pages for U.S.
came from an established expert. politics.1 We found that many
Every article got attention from professional edi- editors start off biased. They show up and spout
tors, so the best writing is truly magnificent. their slanted opinion. Unlike the majority, a
Wikipedia goes to the other extreme. While future editor sticks around, at least for a short
broad, it applies a porous filter. It contains a while. Of those, who remains for the long haul?
massive sampling of material that Britannica We estimate that only 10%–20% stay for more
does not. It has something for everyone, and than a year. Most leave after a month or so, most
plenty that many readers do not want. After all, often after encountering others with extreme
one expert’s junk is another reader’s appropri- and opposing views. Most interesting, those who
ate topic for an online encyclopedia. stay lose their biases, and start fostering a neu-
That is a crucial difference, so let’s reiterate. tral point of view, the site’s highest aspiration
Britannica has great content, and most of it is for all its content. In short, Wikipedia does not
inconvenient to access. Wikipedia contains devolve into hopeless arguments because the
plenty of everything, both nutritious news and moderates decide to stay and edit the crowd.
saccharine sweet sophistry, all of it within In another project, we compared the political
reaches with little effort. slants of close to four thousand articles from
March/April 2020
75
Micro Economics
Britannica and Wikipedia.2 The articles covered Burgess Meredith in his campy performance as
nearly identical topics in U.S. politics, and tend the Penguin? Was Cleopatra considered charis-
to be popular. The biases of the articles in Bri- matic? Why does Billie Eilish’s music and fashion
tannica and Wikipedia became similar when the appeal to young listeners? Whereas Britannica
Wikipedia article received consid- retained its authority by asking
erable editorial attention. The readers to defer to the expert’s
most edited achieved something While Wikipedia opinion, Wikipedia invites answers
akin to a neutral point of view. exposes the artificial for such questions from many sour-
illusion of using a single
They were almost always longer ces, and tells readers to check else-
source of expertise, it
too, containing a wider sampling where too.
leaves the crowd’s
of opinion. authority open to
While Wikipedia exposes the
Articles matched by topic are second guessing. The artificial illusion of using a single
not representative of all of Wikipe- reader gains control, source of expertise, it leaves the
dia, however. Due to its broad sam- but loses assurance in crowd’s authority open to sec-
pling of topics, Wikipedia contains the exchange. ond guessing. The reader gains
many more articles, unmatched to control, but loses assurance in
anything within Britannica. But it the exchange.
comes with a drawback. Many of these lack edit-
ing, which generated the potential for uncorrected
grammatical error, factual mistakes, and narrow CONCLUSION
sampling of opinion. When questions come up in my household, I
Therein lies the subtle difference between go straight to Wikipedia. It takes more time to put
expert and crowd. The distribution of unfinished on reading glasses than it does to voice the
articles is enormous at Wikipedia because, answer from the small screen. The children listen
believe it or not, 250 000 editors is nowhere near while pretending not to, and so we move forward.
what the site requires. There are plenty of pas- What a bountiful, convenient, and dangerous
sages that need attention and do not get it. gift for the generation tied to small screens.
Each organization acts accordingly. Britann- The user is in charge, but it comes with a
ica flaunts its expertise, while Wikipedia flaunts catch. A modern reader needs to take the time
its sourcing from the crowd. Britannica claims to don their thinking cap. But who takes the time
reliability, while Wikipedia openly declares and effort? And does anyone really possess
caveat emptor, recommending that anybody dou- enough judgment to second guess it all?
ble check the answer against other sources.
Do readers check? Why should they check
objective, verifiable, and noncontroversial & REFERENCES
facts? For example, there are eight species of 1. S. Greenstein, G. Y. Gu, and F. Zhu, Forthcoming,
penguins in Antarctica, Cleopatra’s last hus- “Ideological segregation among online collaborators:
band was Marc Antony, and Billie Holiday was Evidence from Wikipedians,” Manage. Sci. [Online].
born on 1915. It took one editor little time to Available: http://dx.doi.org/10.2139/ssrn.2851934
enter that information. If the first draft erred, 2. S. Greenstein and F. Zhu, “Do experts or crowd-based
somebody fixed it long ago. At Wikipedia the models produce more bias? Evidence from
metadata for dates, numerical descriptions, encyclopedia Britannica and Wikipedia,” MIS Quart.,
historical accounts, and minutia of science vol. 42, no. 3, pp. 945–959, 2018.
shout their own accuracy.
On the other hand, neither Britannica, nor Shane Greenstein is a professor at the Harvard
Wikipedia, can escape subjective, nonverifiable, Business School. Contact him at sgreenstein@
and controversial content. How good was hbs.edu.
IEEE Micro
76
PURPOSE: The IEEE Computer Society is the world’s largest EXECUTIVE COMMITTEE
association of computing professionals and is the leading provider President: kƷǠǹƌ‫ژ‬%Ʒ‫ژ‬FǹȏȵǠƌȄǠ
of technical information in the field.
President-Elect: FȏȵȵƷȽɋ‫ژ‬°ǚɓǹǹ
MEMBERSHIP: Members receive the monthly magazine Past President: ƷƩǠǹǠƌ‫ژ‬uƷɋȵƌ
Computer, discounts, and opportunities to serve (all activities First VP: ¨ǠƩƩƌȵưȏ‫ژ‬uƌȵǠƌȄǠ; Second VP: °ɲ‫ٯ‬äƷȄ‫ژ‬hɓȏ ‫ژژژژژ‬
are led by volunteer members). Membership is open to all IEEE Secretary: %ǠȂǠɋȵǠȏȽ‫ژ‬°ƷȵȲƌȄȏȽ; Treasurer: %ƌɫǠư‫ژ‬kȏȂƷɋ
members, affiliate society members, and others interested in the VP, MemberȽǚǠȲ & Geographic Activities: Yervant Zorian
computer field. VP, Professional & Educational Activities: °ɲ‫ٮ‬äƷȄ‫ژ‬hɓȏ ‫ژژژژژژژژژژژ‬
VP, Publications: Fabrizio Lombardi
COMPUTER SOCIETY WEBSITE: www.computer.org
VP, Standards Activities: Riccardo Mariani
OMBUDSMAN: Direct unresolved complaints to VP, Technical & Conference Activities: William D. Gropp‫ژ‬
ombudsman@computer.org.
2019–2020 IEEE Division VIII Director: Elizabeth L. Burd‫ژ‬
CHAPTERS: Regular and student chapters worldwide provide the ‫ژ׏א׎אٮ׎א׎א‬U---‫ژ‬%ǠɫǠȽǠȏȄ‫ژ‬Ý‫ژ‬%ǠȵƷƩɋȏȵ‫¾ژي‬ǚȏȂƌȽ‫ژ‬uِ‫ژ‬ȏȄɋƷ‫ژژژژژژژژژژ‬
opportunity to interact with colleagues, hear technical experts, ‫ژ׎א׎א‬U---‫ژ‬%ǠɫǠȽǠȏȄ‫ژ‬ÝUUU‫ژ‬%ǠȵƷƩɋȏȵ‫ٮ‬-ǹƷƩɋ‫ژي‬ǚȵǠȽɋǠȄƌ‫ژ‬uِ‫ژ‬°ƩǚȏƨƷȵ
and serve the local professional community.
AVAILABLE INFORMATION: To check membership status, report BOARD OF GOVERNORS
an address change, or obtain more information on any of the ¾ƷȵȂ‫ ژ‬-ɱȲǠȵǠȄǒ‫ ژي׎א׎א ژ‬Ȅưɲ‫ ژِ¾ ژ‬ǚƷȄً‫ ژ‬eȏǚȄ‫ ژ‬%ِ‫ ژ‬eȏǚȄȽȏȄً‫ژ‬
following, email Customer Service at help@computer.org or call
°ɲ‫ٮ‬äƷȄ‫ ژ‬hɓȏً‫ ژ‬%ƌɫǠư‫ ژ‬kȏȂƷɋً‫ ژ‬%ǠȂǠɋȵǠȏȽ‫ ژ‬°ƷȵȲƌȄȏȽً‫ژژژژژ‬
+1 714 821 8380 (international) or our toll-free number, Oƌɲƌɋȏ‫ژ‬äƌȂƌȄƌ
+1 800 272 6657 (US): ¾ƷȵȂ‫ ژ‬-ɱȲǠȵǠȄǒ‫ ژي׏א׎א ژ‬uِ‫ ژ‬ȵǠƌȄ‫ ژ‬ǹƌǵƷً‫ ژ‬FȵƷư‫ ژ‬%ȏɓǒǹǠȽً‫ژ‬
• Membership applications ƌȵǹȏȽ‫ ژ‬-ِ‫ ژ‬eǠȂƷȄƷɼ‫ٮ‬GȏȂƷɼً‫¨ ژ‬ƌȂƌǹƌɋǚƌ‫ ژ‬uƌȵǠȂɓɋǚɓً‫ژژژژژژژژژژژ‬
• Publications catalog -ȵǠǵ‫ ژ‬eƌȄ‫ ژ‬uƌȵǠȄǠȽȽƷȄً‫ ژ‬hɓȄǠȏ‫ ژ‬ÅƩǚǠɲƌȂƌ
• Draft standards and order forms ¾ƷȵȂ‫ ژ‬-ɱȲǠȵǠȄǒ‫ ژيאא׎א ژ‬wǠǹȽ‫ ژ‬ȽƩǚƷȄƨȵɓƩǵً‫ژ‬
• Technical committee list -ȵȄƷȽɋȏ‫ ژ‬ɓƌưȵȏȽ‫ٯ‬ÝƌȵǒƌȽً‫ ژ‬%ƌɫǠư‫ ژ‬°ِ‫ ژ‬-ƨƷȵɋً‫ ژ‬ÞǠǹǹǠƌȂ‫ ژ‬GȵȏȲȲً‫ژ‬
• Technical committee application GȵƌƩƷ‫ ژ‬kƷɬǠȽً‫ ژ‬°ɋƷǑƌȄȏ‫ژ‬îƌȄƷȵȏ
• Chapter start-up procedures
• Student scholarship information
• Volunteer leaders/staff directory EXECUTIVE STAFF
• IEEE senior member grade application (requires 10 years Executive Director: Melissa ِ‫ژ‬Russell
practice and significant performance in five of those 10) Director, Governance & Associate Executive Director:
Anne Marie Kelly
PUBLICATIONS AND ACTIVITIES Director, Finance & Accounting: Sunny Hwang
Director, Information Technology & Services: Sumit Kacker
Computer: The flagship publication of the IEEE Computer Society,
Director, Marketing & Sales: Michelle Tubb
Computer publishes peer-reviewed technical content that covers
Director, Membership Development: Eric Berkowitz
all aspects of computer science, computer engineering,
technology, and applications.
COMPUTER SOCIETY OFFICES
Periodicals: The society publishes 12 magazines‫ژ‬ƌȄư‫ژז׏ژ‬ǱȏɓȵȄƌǹȽ. Washington, D.C.: 2001 L St., Ste. 700, Washington, D.C.
Refer to membership application or request information as noted 20036-4928ٕ Phone: +1 202 371 0101ٕ Fax: +1 202 728 9614ٕ‫ژ‬
above. Email: ǚƷǹȲ‫ۮ‬ƩȏȂȲɓɋƷȵِȏȵǒ
Conference Proceedings & Books: Conference Publishing Los Alamitos: 10662 Los Vaqueros Cir., Los Alamitos, CA 90720ٕ‫ژ‬
Services publishes more than 275 titles every year. Phone: +1 714 821 8380ٕ Email: help@computer.org
Standards Working Groups: More than 150 groups produce IEEE
u-u-¨°OU¥‫ژۯژ‬¥ÅkU¾Uw‫¨ژ‬%-¨°‫ژ‬
standards used throughout the world.
¥ǚȏȄƷ‫ژٕבבבגژזוהژ׎׎זژ׏ڹژي‬Fƌɱ‫ژٕ׏גהגژ׏אזژג׏וژ׏ڹژي‬
Technical Committees: TCs provide professional interaction in -ȂƌǠǹ‫ژي‬ǚƷǹȲ‫ۮ‬ƩȏȂȲɓɋƷȵِȏȵǒ
more than 30 technical areas and directly influence computer
engineering conferences and publications. IEEE BOARD OF DIRECTORS
Conferences/Education: The society holds about 200 conferences President: ¾ȏȽǚǠȏ‫ژ‬Fɓǵɓưƌ
each year and sponsors many educational activities, including President-Elect: °ɓȽƌȄ‫ژ‬hِ‫ٹژ‬hƌɋǚɲ‫ژٺ‬kƌȄư
computing science accreditation. Past President: eȏȽƸ‫ژ‬uِFِ‫ژ‬uȏɓȵƌ
Certifications: The society offers three software developer Secretary: Kathleen ِ‫ژ‬Kramer
credentials. For more information, visit Treasurer: Joseph V. Lillie
www.computer.org/certification. Director & President, IEEE-USA: eǠȂ‫ژ‬ȏȄȵƌư‫ژ‬
Director & President, Standards Association: Robert S. Fish‫ژ‬
BOARD OF GOVERNORS MEETING Director & VP, Educational Activities: °ɋƷȲǚƷȄ‫ژ‬¥ǚǠǹǹǠȲȽ‫ژ‬
Director & VP, Membership ‫ ۯ‬Geographic Activities:‫ژژژژژژژ‬
‫ژזא‬٫‫ חאژ‬uƌɲ: uƩkƷƌȄً‫ژ‬ÝǠȵǒǠȄǠƌ hɓǵǱǠȄ‫ژ‬ǚɓȄ
Director & VP, Publication Services & Products: ¾ƌȲƌȄ‫ژ‬°ƌȵǵƌȵ‫ژ‬
Director & VP, Technical Activities: hƌɼɓǚǠȵȏ‫ژ‬hȏȽɓǒƷ
revised ‫ڳ׭ת‬eƌȄɓƌȵɲ‫ש׫ש׫ڳ‬

2020-Mar - Hot Chips - ARM Neoverse N1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020-Mar - Hot Chips - ARM Neoverse N1

Uploaded by

Copyright:

Available Formats

VOLUME 40, NUMBER 2 MARCH/APRIL 2020

IEEE INTERNATIONAL SYMPOSIUM

Join dedicated professionals at the IEEE International Symposium on Hardware

REGISTER NOW: www.hostsymposium.org

Christos Kozyrakis and Ian Bratt

17 H abana Labs Purpose-Built

25 C ompute Solution for Chen Sun, Sergey Y. Shumarayev,

COLUMNS AND DEPARTMENTS

Did ML Chips Heat Up the Chip

1) https://www.computer.org/publications/ Lizy Kurian John is currently a Cullen Trust for

The Hot Chips Renaissance

NEW PROCESSOR CORES Christos Kozyrakis is a Professor of electrical

put, using the same optimizer to update the Image ImageNet

lelism and therefore demand different batch

equivalence for the Closed Divi-

1. All implementations must use a

Habana Labs Purpose-

Abstract—The growing computational requirements of AI applications are challenging

both inside the server and rack (termed as scale-

Model Parallelism Training

in order to form a much larger training farm con-

Topologies for Data Parallelism

Gaudi-Based Training Rack Topologies for Model Parallelism

Figure 7. Hierarchical fabric.

Figure 9. Gaudi/Goya SW stacks.

Compute Solution for

the unmarked area of the

Figure 4. Convolution refactoring and dataflow.

Design Principles and Instruction Set

Figure 6. Typical network program.

NNA MICROARCHITECTURE Each accumulator cell is built around two 30-

RTX on—The NVIDIA

& AT NVIDIA, WE imagined a future where real- Efficient New SM Core

TOPS 4-bit integer math, on a GeForce GTX

A camera ray, labeled 1 in the figure, tests

Turing: Greater Than the Sum of CONCLUSION

The AMD “Zen 2”

Digital Object Identifier 10.1109/MM.2020.2974217

data, and these checks are performed prior to

PREDICTION, FETCH, AND DECODE

Figure 2. Cache Hierachy and CPU Complex.

“MATISSE” SOC contains up to 1000 mm2 of cumulative silicon

PERFORMANCE photorealistic 3-D images via raytracing. With

LS-DYNA R9.3.0 Upto 79%

ANSYS Fluent 19.1 Upto 95%

Figure 6. PCIe Gen4 vs Gen3 performance. High performance computing applications

The Arm Neoverse N1

Abstract—Recent years have seen an explosion of demand for high-performance, high-

Infrastructure Focused Core Features

Figure 3. System Scalability of CMN-600.

on Java microbenchmarks vs Cortex-A72. A architecture profile Documentation. [Online].

2. Managing the instruction footprint: on a Java- Available: https://developer.arm.com/docs/ddi0487/

Based-Benchmark, I-cache miss rate and latest

3. Process synchronization for garbage collec- Symp., 2018, pp. 973–990.

Abstract—In this article, we present TeraPHY, a monolithic electronic–photonic chiplet

& EMERGING MACHINE-LEARNING, high-performance require high-performance distributed comput-

hungry SerDes electrical links are needed. Finally,

Figure 6. TeraPHY test results. (a) 3 mm 3 mm

parallel bus using single-ended, full-swing signal-

inset illustrates the degree of density achieved

Michael L. Davenport is a co-founder of Quintes-

Expertise at Our Fingertips

You might also like