Professional Documents
Culture Documents
2020-Mar - Hot Chips - ARM Neoverse N1
2020-Mar - Hot Chips - ARM Neoverse N1
Hot Chips
www.computer.org/micro
!
W
O
HOST
N
ER
T
IS
EG
2020
4–7 May 2020 • San Jose, CA
R
IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society. IEEE Headquarters, Three Park Ave., 17th Floor, New York,
NY 10016-5997; IEEE Computer Society Headquarters, 2001 L St., Ste. 700, Washington, DC 20036; IEEE Computer Society Publications Office,
10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720. Postmaster: Send address changes and undelivered copies to IEEE, Member-
ship Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid at New York, NY, and at additional mailing offices. Canadian
GST #125634188. Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885. Return undeliverable Canadian addresses
to 4960-2 Walker Road; Windsor, ON N9A 6J3. Printed in USA. Reuse rights and reprint permissions: Educational or personal use of this material is
permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the
copy; and 3) does not imply IEEE endorsement of any third-party products or services. Author and their companies are permitted to post the accepted
version of IEEE-copyrighted material on their own webservers without permission, provided that the IEEE copyright notice and a full citation to the
original work appear on the first screen of the posted copy. An accepted manuscript is a version which has been revised by the author to incorporate
review suggestions, but not the published version with copy-editing, proofreading, and formatting added by IEEE. For more information, please go to
ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising,
or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual
Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or pubs-permissions@ieee.org. ©2020 by IEEE. All rights reserved. Abstracting
and library use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy
fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE
Micro does not necessarily constitute an endorsement by IEEE or the Computer Society. All submissions are subject to editing for style, clarity, and
space. IEEE prohibits discrimination, harassment, and bullying. For more information, visit ieee.org/web/aboutus/whatis/policies/p9-26.html.
March/April 2020
Volume 40 Number 2
Special Issue
Guest Editors’ Introduction
45 T he AMD “Zen 2”
Processor
David Suggs, Mahesh Subramony, and
6 T he Hot Chips
Renaissance
Dan Bouvier
8 M LPerf: An Industry
Standard Benchmark
Cloud-to-Edge
Infrastructure SoC
Suite for Machine Learning Andrea Pellegrini, Nigel Stephens,
Performance Magnus Bruce, Yasuo Ishii, Joseph Pusdesris,
Peter Mattson, Vijay Janapa Reddi, Abhishek Raja, Chris Abernathy,
Christine Cheng, Cody Coleman, Jinson Koppanalil, Tushar Ringe,
Greg Diamos, David Kanter, Ashok Tummala, Jamshed Jalal,
Paulius Micikevicius, David Patterson, Mark Werkheiser, and Anitha Kona
Guenther Schmuelling, Hanlin Tang,
63 T
Gu-Yeon Wei, and Carole-Jean Wu eraPHY: A Chiplet
Technology for
36 R
TX on—The NVIDIA
Turing GPU
John Burgess
Image credit: Image licensed by Ingram Publishing
Micro Economics
74 Expertise at Our Fingertips
Shane Greenstein
From the Editor-in-Chief
& WELCOME TO THE March/April 2020 issue of In addition to the seven Hot Chips articles, this
IEEE Micro. This issue features selected articles issue also features a Micro Economics column by
from the 31st Hot Chips Symposium, held at Shane Greenstein, “Expertise at our Fingertips,”
Stanford University in August 2019. The Memorial discussing how crowdsourced Wikipedia has ren-
Auditorium at Stanford was crowded with record dered Encyclopaedia Britannica obsolete. Please
attendance eager to hear on the newest emerging enjoy this article that presents a comparison of
chips. Whether it is machine learning acceleration Wikipedia and Encyclopaedia Britannica on cost,
or sheer increase in compute ability, the chip reliability, coverage, contributor base, ease of
design arena has become hotter than ever. A lot of use, etc.
money is pouring into designing special purpose Let me also provide an overview of what to
and general purpose chips. IEEE Micro is pleased expect in upcoming issues. The May/June issue
to present for our readers selected articles based will be the popular “Top Picks” Special Issue
on the presentations at the Hot Chips Symposium. which presents the best of the best from papers
Christos Kozyrakis of the University of California, in computer architecture conferences in 2019.
Berkeley and Ian Bratt of ARM served as guest edi- Prof. Hyesoon Kim of Georgia Tech and a selec-
tors for this issue. They have compiled an excel- tion committee from industry and academia have
lent selection of articles on emerging chips and selected 12 papers from about 100 articles that
systems from the Symposium, including articles were submitted in response to the Top Picks call
from Tesla, Habana Labs, NVIDIA, AMD, ARM, and for papers. Readers can look forward to an amaz-
Ayar Labs. Please read the Guest Editors’ Intro- ing collection of excellent articles in May/June.
duction to get a preview of the seven papers, Many thematic special issues are planned for
which include two articles on machine learning the remainder of 2020. Themes include Agile/
acceleration, one on machine learning bench- Open Source Hardware, Biology Inspired Comput-
marks, two on high-end chips from AMD and ARM, ing, Machine Learning for Systems, and Chip
an article on the newest GPU from NVIDIA, and
Design 2020. The July/August issue will be a Spe-
one on optical die-to-die interconnects. Thanks to
cial Issue on “Agile/Open Source Hardware.” Tre-
the editors, authors, and reviewers who worked
vor Carlson from the National University of
hard to put this issue together.
Singapore and Yungang Bao from ICT (China) will
be guest editing this special issue. The “Biology-
Inspired Computing” theme will be guest edited
Digital Object Identifier 10.1109/MM.2020.2978375 by Abhishek Bhattacharjee from Yale University.
Date of current version 18 March 2020. Milad Hashemi of Google and Heiner Litz of
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
4
University of California, Santa Cruz will guest edit 3) https://www.computer.org/digital-library/
the “Machine Learning for Systems” theme. The magazines/mi/call-for-papers-special-issue-
year will conclude with themes “Chip Design on-chip-design-2020
2020” guest edited by Prof. Jaydeep Kulkarni of
the University of Texas, and “Commercial Prod- IEEE Micro is interested in submissions
ucts 2020” guest edited by David Patterson and on any aspect of chip/system design or
Sophia Shao of University of California, Berkeley. architecture.
We invite readers to submit to these Special Hope you enjoy the articles presented in this
Issues. Please find the open calls at issue. Happy reading!
March/April 2020
5
Guest Editors’ Introduction
& THE 31ST ANNUAL Hot Chips symposium was and a large number of startup companies are now
held at Stanford University in August 2019. As competing to design the most effective chips for
the guest editors for this special issue of IEEE existing and emerging applications.
Micro, we are pleased to introduce a selection of For this special issue of IEEE Micro, we
articles based on the best presentations from selected seven talks that capture these trends
the conference program. and asked the authors to extend them into full
The 2019 Hot Chips program reflected the articles.
renaissance in the chips industry that John Hen-
nessy and Dave Patterson described in their 2017
MACHINE LEARNING AND
Turing award lecture and attracted a record high
INTELLIGENCE ACCELERATION
number of attendees. Specifically, we observed
In “MLPerf: An Industry Standard Benchmark
three important trends. The most important
Suite for Machine Learning Performance,” Mattson
development is the widespread focus on accelera-
et al. describe the challenges in developing effec-
tion of machine learning (ML) applications.
tive benchmarks for ML training
Roughly half of the conference talks
and inference workloads in a fast-
described chips of various sizes and The 2019 Hot Chips
moving field. The first two rounds
uses for which ML is a primary appli- program reflected the
of the MLPerf Training benchmark
cation driver. The second trend is renaissance in the
have already motivated significant
the increasing use of novel chips industry that
John Hennessy and improvements in performance and
approaches to overcome scaling
Dave Patterson scalability of popular ML software
and efficiency limitations of modern
described in their 2017 stacks.
systems. The program featured talks
Turing award lecture In “Habana Labs Purpose-Built
on in-package optical I/O, process-
and attracted a record AI Inference and Training Proces-
ing-in-memory, and wafer-scale inte- high number of sor Architectures: Scaling AI
gration. The final trend is the attendees. Training Systems Using Standard
broader set of companies that pro-
Ethernet With Gaudi Processor,”
duce cutting-edge chips. The tradi-
Medina and Dagan summarize
tional semiconductor vendors (for
the design of the Goya inference processor and
CPUs, GPUs, or FPGAs), hyperscale companies,
the Gaudi training processor. They also describe
how to build systems of various scales from
Digital Object Identifier 10.1109/MM.2020.2977409 these specialized ML chips using commodity
Date of current version 18 March 2020. networking interfaces.
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
6
In “Compute Solution for Tesla’s Full Self-Driv- OPTICAL DIE-TO-DIE
ing Computer,” Talpes et al. describe a custom- INTERCONNECTS
built chip for autonomous driving. The Tesla chip In “TeraPHY: A Chiplet Technology for Low-
integrates fixed function units and is programma- Power, High-Bandwidth In-Package Optical I/O,”
ble to cores to strike the right balance between Wade et al. from Ayar Labs and Intel summarize the
efficiency and flexibility. technology for in-package optical
interconnects for high bandwidth,
The presentations from energy efficient communication.
NEW GPU Hot Chips 31 and all Space limitations prevent us
ARCHITECTURE previous years are from featuring more articles in this
In “RTX On––The NVIDIA available at http://www.
issue. Nevertheless, the presenta-
Turing GPU,” Burgess presents hotchips.org. We
tions from Hot Chips 31 and all
the new streaming multiproces- encourage you to
previous years are available at
sor and ML accelerator in the explore this exciting
archive, as well as http://www.hotchips.org. We
latest NVIDIA GPU chip. The
contribute to and encourage you to explore this
Turing chip also features a ray
attend the Hot Chips 32 exciting archive, as well as contrib-
tracing accelerators that sup-
conference in August ute to and attend the Hot Chips 32
ports real-time frame rates.
of this year. conference in August of this year.
March/April 2020
7
Theme Article: Hot Chips
MLPerf: An Industry
Standard Benchmark Suite
for Machine Learning
Performance
Peter Mattson Paulius Micikevicius
Google Brain NVIDIA
Vijay Janapa Reddi David Patterson
Harvard University Google Brain and University of California,
Berkeley
Christine Cheng
Intel Guenther Schmuelling
Microsoft Azure AI Infrastructure
Cody Coleman
Stanford University Hanlin Tang
Intel
Greg Diamos
Landing AI Gu-Yeon Wei
Harvard University
David Kanter
Real World Technologies Carole-Jean Wu
Facebook and Arizona State University
Abstract—In this article, we describe the design choices behind MLPerf, a machine learning
performance benchmark that has become an industry standard. The first two rounds of the
MLPerf Training benchmark helped drive improvements to software-stack performance and
scalability, showing a 1.3 speedup in the top 16-chip results despite higher quality targets
and a 5.5 increase in system scale. The first round of MLPerf Inference received over 500
benchmark results from 14 different organizations, showing growing adoption.
& MACHINE LEARNING (ML) is transforming multi- and software development. Most modern ML sys-
ple industries, leading to a surge in hardware tems are built atop deep neural networks which
are computationally demanding to train and
Digital Object Identifier 10.1109/MM.2020.2974843
deploy, thus their increasing use in industry is
driving the rapid development of specialized hard-
Date of publication 18 February 2020; date of current version
ware architectures and software frameworks.
18 March 2020.
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
8
We need a performance benchmark to evalu- Designing an ML benchmark suite requires
ate these competing ML systems. By providing answering additional questions.
clear metrics, benchmarking aligns research,
Implementation equivalence: ML accelerator
engineering and marketing, and competitors
across the industry in pursuit of the same objec- architecture varies and there is no standard
tives. For general-purpose computing, a consor- ML software stack. Submitters need to reim-
tium of chip vendors built the SPEC benchmark1 plement the benchmark for their hardware.
in 1988, focusing the competition that drove How do we ensure that implementations are
computing perfor- equivalent enough for fair comparison?
Training hyperparameter equivalence: for
mance for the next
MLPerf was founded in training benchmarks, which hyperpara-
three decades. Ear- 2018 to combine the
lier ML benchmarks meters are tunable?
best of prior efforts: a
include DeepBench2 Training convergence variance: for training
broad benchmark set
which focused benchmarks, convergence times have rela-
with a time-to-
on deep learning convergence metric tively high variance. How do we make mean-
primitives, Fathom3 and the support of an ingful measurements?
Inference weight equivalence: for inference
which introduced academic/industry
a field-spanning set consortium. benchmarks, are retrained or sparsified
weights allowed?
of ML benchmarks,
and DAWNBench4 which proposed time-to-
This section describes how the MLPerf
convergence as a metric. See the work by Mattson
benchmark suites answer these questions.
et al.5 and Reddi et al.6 for more on related work.
MLPerf was founded in 2018 to combine the
best of prior efforts: a broad benchmark set with MLPerf Training
Benchmark Definition We specify an MLPerf
a time-to-convergence metric and the support of
Training benchmark5 as training a model on a
an academic/industry consortium. MLPerf con-
specific data set to reach a target quality. For
tains two suites of ML benchmarks: one for train-
example, one benchmark measures training on
ing,5 and one for inference.6 MLPerf has released
the ImageNet data set until the image classifica-
two rounds of results for the training suite and
tion top-1 accuracy reaches 75.9%. However, this
one round for the inference suite. Comparing the
basic definition does not answer one critical
two rounds of training data shows MLPerf is
question: do we specify which model to train?
encouraging improvements in performance and
Specifying the model enables apples-to-apples
scalability; comparing all three rounds shows
performance comparisons of software or hard-
growing adoption. The remainder of this article
ware alternatives because it requires all alterna-
describes the design choices faced by anyone
tives to process the same workload. However,
seeking to benchmark ML performance, and how
not specifying the model encourages model
MLPerf navigated those choices to become a
improvements and hardware–software codesign.
nascent industry standard.
We created two divisions of results: a Closed
Division that requires using a specific model for
direct comparisons, and an Open Division that
DESIGN CHOICES
allows the use of any model to support model
Designing any benchmark suite requires
innovation.
answering three big questions.
Benchmark definition: How to specify a mea- Metric Definition There are two obvious met-
surable task? rics for training performance: throughput, the
Metric definition: How to measure number of data processed per second, and time-
performance? to-train, the wall clock time it takes for the model
Benchmark selection: What set of tasks to to reach a target quality. These metrics could
measure? also be normalized by cost or power, which we
March/April 2020
9
Hot Chips
Table 1. MLPerf Training v0.5 Benchmarks. We then tried to select one or two specific
benchmarks within each area. In doing so, we
Area Problem Data set Model chose models based on four characteristics.
Image
Vision ImageNet7 ResNet7
recognition Maturity: We sought models that were near
7 7
Object detection COCO SSD state-of-the-art but also showed evidence of
Object growing adoption.
COCO7 Mask R CNN7
segmentation Variety: We chose models that included a
WMT Eng.- range of constructs such as a convolutional
Language Translation GNMT7
German7 neural network (CNN), a recurrent neural net-
WMT Eng.- work (RNN), attention, and an embedding
Translation Transformer7
German7 table.
Complexity: We chose model sizes to reflect
Commerce Recommendation Movielens-20M7 NCF7
current and anticipated market demands.
Research RL Go, 99 board MiniGo7 Practicality: We chose only benchmarks with
available data sets and models.
will address later on. Throughput has advan- Table 1 shows the benchmarks in MLPerf
tages. First, it is computationally inexpensive to Training v0.5. We chose ResNet as having rela-
measure because you do not train the model to tively high accuracy and wide adoption. We
completion. Second, it has a relatively low vari- added SSD and Mask R-CNN to cover two differ-
ance because the compute cost per datum is ent points in the important vision complexity
constant in most models. However, throughput space. We chose transformer and GNMT for
can be increased at the cost of time-to-train translation to increase variety by including
by using optimizations like lower precision attention and an RNN. We chose MiniGo for RL
numerics or larger batch sizes. because it did not require an even more compu-
We chose to use time-to-train because it accu- tationally expensive physics simulation. MLPerf
rately reflects the primary goal in choosing a Training presently omits medical imaging,
training system: to fully train models as quickly speech-to-text, text-to-speech, NLP, time series,
as possible. Unfortunately, it is computationally and GANs. Future versions of MLPerf will
expensive to fully train models. Furthermore, address these applications, starting with BERT
the number of epochs needed to train a model in MLPerf training v0.7.
varies due to random weight initialization and
stochastic floating-point ordering effects. How- Implementation Equivalence ML bench-
ever, we feel time-to-train is the least-bad alter- marks cannot function like conventional bench-
native available. marks in which fixed code is executed because it
Benchmark Selection Once we chose how to is not currently possible to write portable, scal-
specify a benchmark, we needed to select a set able, high-performance ML code. There is no sin-
of benchmarks. We first divided ML applications gle ML framework supported by all architectures.
into five broad areas. Furthermore, ML code needs to be tuned for the
architecture and system scale in order to achieve
Vision: image classification, object detection, high performance.
segmentation, medical imaging. Instead, MLPerf allows submitters to reimple-
Speech: speech to text, text to speech. ment the benchmarks. However, this flexibility
Language: translation, natural language proc- raises the question of implementation equiva-
essing (NLP). lence. MLPerf requires that all submitters to the
Commercial: recommenders, time-series. closed division use the same model to enable
Research: reinforcement learning (RL) for apples-to-apples hardware comparisons, but
games or robotics, generative adversarial what does it mean to use the same model? MLPerf
networks (GANs). provides a functional but unoptimized reference
IEEE Micro
10
implementation. MLPerf rules require performing Table 2. MLPerf Inference v0.5 Benchmarks.
the same set of mathematical operations as the
reference implementation to produce each out- Area Task Data set Model
March/April 2020
11
Hot Chips
Figure 1. MLPerf inference scenarios and metrics. In addition to these two fun-
damental constraints, there is a
short blacklist of forbidden opti-
Metric Definition The ideal metric for measur- mizations including incorporating additional
ing the performance of an inference system weights or other information about the data set,
varies with the use case. For instance, a mobile caching results to take advantage of repeated
vision application needs low latency, while inputs, or sparsely evaluating the weights.
an offline photo application demands high
throughput. For this reason, each MLPerf infer- Quantization, Retraining, and Sparsity
ence benchmark scenario has a different metric, Inference systems can use quantized, retrained,
as shown in red in Figure 1. or sparsified weights to increase computational
1. Single stream: latency. efficiency at the cost of reduced accuracy,
2. Multiple stream: number of streams subject which may or may not match market require-
to a latency bound. ments and can offer an unfair advantage in
3. Server: Poisson-arrival queries per second benchmarking. Different applications have dif-
subject to a latency bound. ferent tolerance for accuracy loss, making it
4. Offline: throughput. challenging to set an inference benchmark qual-
ity target. Further, retraining and sparsification
Benchmark Selection The benchmark selec- techniques are a research area with significant
tion for MLPerf Inference was also driven by proprietary technology and allowing either pro-
maturity, diversity, complexity, and practicality. vides an advantage to those with the best
However, we also needed to choose models with methods.
complexities suitable to a spectrum of hardware For the initial version of MLPerf inference
ranging from mobile devices to servers and sup- we took a simple approach by not allowing
port whatever models we chose in multiple sce- retraining or sparsification. Most quality tar-
narios. Thus, the initial version of MLPerf gets are set to 99% of the quality that can be
Inference focuses only on the most common achieved with 32-bit floating point weights.
vision tasks at different complexities. It comple- These targets are somewhat lower than might
ments the vision models with one moderate-size be required for many applications, but we only
language model to increase model diversity. allow post-training quantization. For future ver-
Future versions will expand this model selection sions, we are investigating more flexible
and better align it with training. approaches to accuracy and allowing retrain-
ing and/or sparsification.
Implementation Equivalence MLPerf Infer-
ence gives submitters the ability to reimplement Presentation
the models to handle software stack and hardware Results or Single Summary Score Given a
diversity, again raising the question of model set of benchmark results, there are still choices
IEEE Micro
12
Figure 2. Speedup in the fastest 16-chip entry from MLPerf Training version v0.5 to v0.6, despite more timed
work due to increased quality targets as shown.
about how to present them in the most informa- MLPerf currently provides scale information
tive way. For instance, should results for one sys- in the form of chip counts and future versions
tem on multiple benchmarks be combined will include power measurements. MLPerf does
to produce a single summary “score”? A score not provide price because it is not a physical
provides a consistent and easy way to communi- quantity and can vary over time and market.
cate the bottom line. However, it assumes MLPerf does not normalize because the most
that systems are designed for general perfor- important scale factor is different for different
mance and that all benchmarks in the suite mat- uses.
ter equally. We provide results instead of a single
summary score because the range of ML use RESULTS
cases, from automotive vision to online recom- Benchmark suites aim to drive technological
mendations, makes these assumptions incorrect: progress on faster hardware and software; we can
most users care about a subset of the bench- measure this progress by comparing the best
marks and many hardware architectures are results on the benchmark suite over time. Compar-
specialized. ing two rounds of MLPerf Training results shows
progress in software-stack performance and scal-
Scale Information and Normalization ability. The two rounds of results were collected
Another presentation question is: should the approximately six months apart, and are driven by
results be normalized or include scale informa- the same ML accelerators. Figure 2 compares five
tion? Different systems have different scale fac- benchmarks that did not change significantly
tors such as price, power consumption, and chip between the two rounds, though the target quality
count. If system A performs slightly better than levels of three out of five benchmarks were
system B, but consumes twice as much power, increased, which increases the amount of work
which is a better design? In order to help con- being timed.8 The fastest 16-chip entries across the
sumers of benchmark results better utilize them, two rounds show an average 1.3 speedup despite
the results could be presented with one or more more work required. Figure 3 shows an average
scale factors as additional information or nor- 5.5 increase in system scale across the two
malized by a specific scale factor. rounds as submitters were able to effectively utilize
March/April 2020
13
Hot Chips
Figure 3. Increase in the number of chips used in the system that produced the fastest overall score from
MLPerf Training version v0.5 to v0.6.
more chips.8 Some of these improvements are well as growth in adoption. Figure 4 shows that
benchmark-specific and some would have occurred the systems submitted, ranging from embedded
without MLPerf, but many, based on first-hand devices to cloud scale data center solutions,
observations of the authors, are generic and moti- cover a very wide range of hardware that spans
vated by MLPerf. Over time, we expect similar four orders of magnitude in terms of perfor-
improvements in hardware. mance.9 The number of submissions and results
Since we only have a single round of MLPerf in each round of MLPerf is increasing: v0.5 had 3
inference results, we cannot yet assess if it is submitters and 40þ results, training v0.6 had 5
driving performance improvements but can submitters and 60þ results, inference v0.5 had
assess how well it handles diverse hardware as 14 submitters and 500þ results.
Figure 4. Normalized performance distribution in log scale from results in the closed division.
IEEE Micro
14
CONCLUSION involvement of engineers and researchers who
ML is a developing field and the MLPerf bench- are interested in helping us make MLPerf better
mark suites will need to evolve with the field; we by going to mlperf.org/get-involved.
have created an organization to enable that evolu-
tion. MLPerf inference and training are both
driven by active working groups (WGs): a sub- ACKNOWLEDGMENTS
mitter’s WG that maintains the rules, a special MLPerf would not be possible without: B.
topics WG that explores deep technical issues, Anderson, P. Bailis, V. Bittorf, M. Breughe, D.
and a results WG that handles submission review Brooks, M. Charlebois, D. Chen, W. Chou, R.
and results presentation. Other WGs focus on spe- Chukka, S. Davis, P. Deng, J. Duke, D. Dutta, D.
cific topics such as power measurement or new Fick, J. S. Gardner, U. Gupta, K. Hazelwood, A.
benchmarks. We are creating a legal entity to Hock, X. Huang, I. Hubara, S. Idgunji, T. B. Jablin,
provide a long term foundation for the effort. B. Jia, J. Jiao, D. Kang, P. Kanwar, N. Kumar, D.
We are developing a long-term benchmark Lee, J. Liao, A. Lokhmotov, F. Massa, P. Meng, D.
roadmap. We aim to add new benchmarks to fully Narayanan, T. Oguntebi, C. Osborne, G. Pekhi-
cover the five large ML areas we initially identi- menko, L. Pentecost, A. T. R. Rajan, T. Robie, D.
fied: vision, speech, language, commerce, and Sequeira, A. Sirasao, T. St. John, F. Sun, H. Tang,
research. Over time, we will retire and replace M. Thomson, F. Wei, E. Wu, L. Xu, K. Yamada, C.
benchmarks to keep Young, B. Yu, G. Yuan, M. Zaharia, P. Zhang, A.
pace with the field Zhong, Y. Zhou, and many others.
The MLPerf effort is now
and to reduce the
supported by more
temptation to tune
than 65 companies and & REFERENCES
for benchmarks
researchers from eight 1. K. M. Dixit, “The SPEC benchmarks,” Parallel Comput.,
rather than real educational institutions. vol. 17, no. 10/11, pp. 1195–1209, 1991.
applications. We are We welcome the 2. Baidu. “DeepBench: Benchmarking deep learning
recruiting a panel of involvement of
operations on different hardware,” 2017. [Online].
academic and indus- engineers and
Available: https://github.com/baidu-research/DeepBench
try advisors for researchers who are
3. R. Adolf, S. Rama, B. Reagen, G.-Y. Wei, and
each area to ensure interested in helping us
make MLPerf better by D. Brooks, “Fathom: Reference workloads for modern
that MLPerf bench-
going to mlperf.org/ deep learning methods,” in Proc. IEEE Int. Symp.
marks are neutrally
get-involved. Workload Characterization, 2016, pp. 1–10.
driven by research
4. C. Coleman et al., “DAWNBench: An end-to-end deep
and industry needs.
learning benchmark and competition,” in Proc. 31st
Other future work includes the following.
Conf. Neural Inf. Process. Syst., 2017.
Creating a mobile application that can run 5. P. Mattson et al., “MLPerf training benchmark,” 2019,
select MLPerf inference benchmarks on arXiv:1910.01500.
smartphones. 6. V. Reddi et al., “MLPerf inference benchmark,” 2019,
Improving reference implementations as arXiv:1911.02549.
starting points for development. 7. [Online]. Available: https://mlperf.org/dataset-
Producing a “hyperparameter table” that maps and-model-credits/ presents full dataset and model
system scale and precision to recommended citations.
hyperparameters for each benchmark. 8. [Online]. Available: https://mlperf.org/training-results-
Developing better large public data sets for 0-5/ and https://mlperf.org/training-results-0-6/
benchmarking and other purposes. present complete results. https://github.com/mlperf/
Developing better software best practices for training_results_v0.6 and https://github.com/mlperf/
ML benchmarking and experimentation. training_results_v0.5 contain system details.
9. [Online]. Available: https://mlperf.org/inference-results/
The MLPerf effort is now supported by more
present complete results and https://github.com/
than 65 companies and researchers from eight
mlperf/inference_results_v0.5 contains system details.
educational institutions. We welcome the
March/April 2020
15
Hot Chips
Peter Mattson is currently a General Chair with Paulius Micikevicius is currently a Distinguished
MLPerf. He leads the ML Performance Metrics team Engineer with NVIDIA. He received the Ph.D. degree
at Google Brain. He received the Ph.D. degree from from the University of Central Florida, Orlando, FL,
Stanford University, Stanford, CA, USA. Contact him USA. Contact him at pauliusm@nvidia.com.
at petermattson@google.com.
David Patterson is currently a Google Brain Dis-
Vijay Janapa Reddi is currently an Inference tinguished Engineer. He is a U.C. Berkeley Professor.
Chair with MLPerf. He is an Associate Professor with He is a Vice-Chair Board of Directors of RISC-V Foun-
Harvard University, Cambridge, MA, USA. He dation. He received the Ph.D. degree from the Uni-
received the Ph.D. degree from Harvard University. versity of California, Los Angeles, CA, USA. Contact
Contact him at vj@ece.utexas.edu. him at pattrsn@cs.berkeley.edu.
Christine Cheng is currently an Inference Chair with Guenther Schmuelling is currently an Inference
MLPerf. She is a Sr. Machine Learning Optimization Results Chair with MLPerf. He is a Principal Tech
Engineer with Intel, M.S., Stanford University, Stanford, Lead. He is with the Microsoft Azure AI Infrastructure.
CA, USA. Contact her at christine.cheng@intel.com. Contact him at guschmue@microsoft.com.
Cody Coleman is currently a Research Chair Hanlin Tang is currently a Sr. Director of AI Lab
with MLPerf. He is currently working toward the with Intel. He received the Ph.D. degree from Har-
Ph.D. degree in computer science with Stanford vard University, Cambridge, MA, USA. Contact him
University, Stanford, CA, USA. He received the at hanlin.tang@intel.com.
M. Eng. degree from Massachusetts Institute of
Technology, Cambridge, MA, USA. Contact him at Gu-Yeon Wei is currently a Robert and Suzanne
cody@cs.stanford.edu. Case Professor of electrical engineering and com-
puter science with Harvard University, Cambridge,
Greg Diamos is currently a Data sets Chair with MA, USA. He received the Ph.D. degree from Stan-
MLPerf. He leads AI Transformations team at Land- ford University, Stanford, CA, USA. Contact him at
ing AI. He received the Ph.D. degree from Georgia gywei@g.harvard.edu.
Tech, Atlanta, GA, USA. Contact him at gregory.
diamos@gmail.com. Carole-Jean Wu is currently an Inference Chair
with MLPerf. She is an Applied Research Scientist
David Kanter is currently an Inference and Power with Facebook. She is an Associate Professor
Chair with MLPerf. He is with Real World Technolo- at Arizona State University, Tempe, AZ, USA.
gies. He received the B.S. degree in mathematics She received the Ph.D. degree from Princeton
and the B.A. degree in economics from the University University, Princeton, NJ, USA. Contact her at
of Chicago, Chicago, IL, USA. Contact him at carolejeanwu@fb.com.
dkanter@gmail.com.
IEEE Micro
16
Theme Article: Hot Chips
& WHILE DATACENTER INFERENCE workloads have clear that purpose-built architectures optimized
typically run on CPUs and training on GPUs, it is for AI workloads are needed to provide the
required performance and features, These AI
processors must be fully programmable, sup-
Digital Object Identifier 10.1109/MM.2020.2975185
ported by robust software tools and enable data
Date of publication 28 February 2020; date of current version
centers to reduce total cost of ownership. In
18 March 2020.
March/April 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
17
Hot Chips
addition, the ability to scale-up and scale-out to processors. However, to achieve acceleration,
meet increasing computational needs, while fos- high processor utilization is required. Scaling AI
tering an ecosystem of suppliers with standard workloads has been a focus of Habana-Lab’s
interfaces and communication protocols, is training processor architecture from its initial
needed for AI purpose-built processors. stages and is considered fundamentally a net-
This article discusses the characteristics of working challenge. Habana Lab’s approach to
AI training and inference processors and the scaling AI training is the focus of this article and
inherent differences between their architecture is described in detail.
requirements, describing the Goya and Gaudi Despite the differences between inference and
high-level architecture and focusing on Gaudi’s training, both processors must be fully program-
unique approach for scaling AI training in data mable to handle popular frameworks and com-
centers. pilers. The processor should also be easily
customizable to perform well on new workloads
TRAINING AND INFERENCE and allow customers to migrate from other
REQUIREMENT DIFFERENCES processors while preserving their algorithmic
Neural networks inference and training execu- codebase.
tion share several functional building blocks; how-
ever there are key attributes that drive material
architecture differences. For training, the key met- GOYA INFERENCE PROCESSOR
ric is time to converge to an accuracy goal, which HIGH-LEVEL ARCHITECTURE
is closely correlated to the processor’s through- As training and inference architectures
put. For inference, throughput (at a given share some of the same functional building
accuracy level) is the primary per- blocks, the Goya architecture
formance metric, although some This article discusses laid the groundwork for Gaudi.
inference applications are also sen- the characteristics of The Goya Inference Processor is
sitive to latency (dictated by end- AI training and based on the scalable architec-
user applications). Accuracy is criti- inference processors ture of Habana’s Tensor-Proc-
cal for training and inference work- and the inherent essing Core (TPC) and includes
loads, but in the case of inference, it differences between a cluster of eight programmable
is acceptable for some end-applica- their architecture cores. TPC is Habana’s proprie-
tions to tradeoff accuracy for more requirements, tary core designed to support
describing the Goya
throughputs.5,6 deep learning workloads. It is a
and Gaudi high-level
Another inherent difference VLIW SIMD vector processor
architecture and
between training and inference is with Instruction-Set-Architec-
focusing on Gaudi’s
the required memory capacity. In unique approach for ture and hardware tailored to
inference, only the last layer of acti- scaling AI training in serve deep learning workloads
vations needs to be stored, as data centers. efficiently. The TPC is C/Cþþ
opposed to training which requires programmable, providing the
calculating the gradients as part of user with maximum flexibility to
the back-propagation flow. Thus, all activations innovate, coupled with many workload-ori-
for all layers typically must be stored and the ented features such as: General Matrix Multiply
required internal memory capacity is signifi- (GEMM) operation acceleration, special-func-
cantly larger; this memory capacity requirement tions dedicated hardware, tensor addressing,
grows substantially with the depth of the net- and latency hiding capabilities. The TPC
work. On top of that, wider data types which are natively supports these mixed-precision data
typically deployed in training processors add types: FP32, INT32/16/8, UINT32/16/8. To
additional memory capacity requirement. achieve maximum hardware efficiency, Habana
Very long training tasks performed on large Labs SynapseAI quantizer tool selects the
data sets are often scaled-up and scaled-out to appropriate data type by balancing throughput
clusters of hundreds and even thousands of and performance versus accuracy. For
IEEE Micro
18
Figure 2. BERT language model performance.
Goya Configuration2:
Figure 1. RESNET-50 throughput and latency. Hardware: Goya HL-100 PCIe Card; CPU-
XEON-E5
Software: Ubuntu-v-16.04; SynapseAI-v-0.1.6
predictability and low latency, Goya is based Workload implementation: Precision INT8
GPU Configuration4:
on software-managed, on-die memory along
Hardware Configuration: T4; Host
with programmable DMAs. For robustness, all Supermicro SYS-4029GP-TRT-T4
memories are ECC-protected. Software Configuration: TensorRT-5.1;
All Goya engines (TPCs, GEMM, and DMA) Synthetic data set; Container-19.03-py3
Workload implementation: Precision INT8
can operate concurrently and communicate
To quantify and benchmark the BERT work-
via shared memory. For external interface, the
load, Nvidia’s demo release was used. The demo
processor uses PCIe Gen416 enabling com-
runs a Question answering task, and given a
munication to any host of choice, FPGA or
question and context, determines the span of
peer-to-peer communication with another
the answer within the context (Tested date:
Goya. The processor includes two 64-b chan-
August 2019).
nels of DDR4 memory interface with max
capacity of 16 GB. The Goya architecture sup- Data set: SQuAD; Topology: BERT_BASE,
ports mixed precision of both integer and Layers ¼ 12 Hidden_Size ¼ 768, Heads ¼ 12
floating points, which allows it to flexibly sup- Intermediate_Size ¼ 3072 Max_Seq_Len ¼
128.
port different workloads and applications,
under quantization controls that the user can As presented in Figure 2, Goya delivers
specify. higher throughput in sentences-per-second, as
well as lower latency. In addition, while the T4
Goya Performance2 saturates for throughput, Goya can scale further
with batch of 24, delivering more than twice the
Although Goya is a cost-efficient processor
built upon TSMC’s mature 16 nm process, throughput, while running at half the latency.
Goya leads in performance benchmarks for Goya Configuration:
both throughput and latency compared with Hardware: Goya-HL-100; Xeon-Gold-6152
alternative Inference GPUs designed using @2.10GHz
Software: Ubuntu-v-16.04.4; SynapseAI-v-
more advanced process nodes. This achieve-
0.2.0-1173
ment is rooted in its purpose-built architec- GPU Configuration:
ture. Below are performance examples from Hardware: T4; CPU-Xeon-Gold-6154@3GHz/
different types of deep learning applications. 16GB/4-VMs
Software: Ubuntu-18.04.2.x86_64-gnu; CUDA-
Workload performance results for vision Ver-10.1, cudnn7.5; TensorRT-5.1.5.0;
classification (Resnet-50/ImageNet data set)
and Natural Language Processing (BERT) are
presented in Figure 1 and compare to the cor- GAUDI TRAINING PROCESSOR HIGH
responding T4-GPU performance. As presented LEVEL ARCHITECTURE
below, Goya’s Resnet-50 performance consis- Goya Inference architecture provided a good
tently delivers higher throughput, as well as foundation for the Gaudi training processor with
lower latency, across all batch sizes. the following key defining goals.
March/April 2020
19
Hot Chips
IEEE Micro
20
data, performs the training locally, then commu-
nicates its gradients to the other processors;
these updates are combined and then shared
back with all participating processors, before the
next iteration of training can begin with new data.
March/April 2020
21
Hot Chips
IEEE Micro
22
Figure 8. Fully connected, single-hop system.
Figure 8 shows such a large scale system. A The Goya software stack allows interfacing
system with 128 Gaudi chips can be built in an with popular frameworks and compilers, sup-
analog manner, connecting a single 100 GbE port porting models trained on any platform. The
per Gaudi (thus 8100 per Gaudi System) to any flow starts by deconstructing the trained topol-
of the five 128-port 100 GbE switches. Such a ogy and some example data into an internal
large-scale, single-hop system with full connec- representation. The quantization process then
tivity is only possible when integrating Ethernet follows, allowing the user to improve throughput
directly in the deep learning accelerator. while maintaining negligible accuracy loss. The
graph compiler then maps the execution to
Goya’s building blocks, using the available
SOFTWARE DEVELOPMENT TOOLS library kernels (provided by Habana and easily
Habana SynapseAI is a comprehensive soft- extendible by customers to develop tailor-made
ware toolkit that simplifies the development and kernels). The execution recipe is generated and
deployment of deep learning models for mass- scheduled by the runtime APIs.
market use. The SynapseAI tool suite enables The Gaudi software suite, leverages many of
users to execute algorithms, efficiently using Goya aspects, with the addition of training opti-
its high-level software abstraction. However, mized TPC kernels and the following additional
advanced users can perform further optimiza- capabilities.
tions and add their own proprietary code using
Multistream execution of networking and
the provided software development tools. These
tools include an LLVM-based C compiler, simula- compute. Streams can be synchronized with
tor, debugger, and profiling capabilities. The one another at high performance and with
tools also facilitate the development of custom- low runtime overhead.
JIT compiler, capable of fusing and compil-
ized TPC kernels to augment the extensive ker-
nel library provided by Habana. ing multiple layers together, thereby
March/April 2020
23
Hot Chips
increasing utilization and exploiting hard- handle much bigger models. This means raising
ware resources. the bar on what AI training can do.
Habana Communication Library (HCL): tailor- Finally, and most importantly for data cen-
made to Gaudi’s high-performance RDMA ters, is the need to avoid being locked-in to pro-
communication capabilities. HCL exposes all prietary interfaces. By avoiding processors that
required primitives such as reduce, all-reduce, come with proprietary system interfaces, and
gather, broadcast, etc. Gaudi and Goya soft- instead insisting on standards-based scaling,
ware stacks are both presented in Figure 9. Habana’s customers can avoid locking them-
selves to any particular vendor. With standard
SUMMARY Ethernet scaling they can enjoy a competitive
Architectural requirements of Inference and ecosystem and replace AI processor suppliers
training share enough similarities that the Goya without reconstructing their infrastructure.
inference design provided a good foundation for With Gaudi all these benefits can be realized.
the Gaudi training
chip. However, the
designs diverge By avoiding processors
& REFERENCES
when it comes to that come with 1. E. Medina, “Habana labs approach to scaling AI
addressing the proprietary system training,” Proc. Hot Chips 31, CA, USA. 2019. [Online].
unique scaling interfaces, and Available: http://www.hotchips.org/hc31/
requirements of AI instead insisting on HC31_1.14_HabanaLabs.Eitan_Medina.v9.pdf
training. Habana standards-based
2. Goya Inference Platform Whitepaper, Habana Labs,
scaling, Habana’s
integrates RoCE 2019. [Online]. Available: https://habana.ai/wp-
customers can avoid
RDMA in the pro- content/uploads/2019/06/Goya-Whitepaper-Inference-
locking themselves to
cessor chip itself any particular vendor. Performance.pdf
and has enabled With standard Ethernet 3. Gaudi Training Platform Whitepaper, Habana Labs,
scaling AI like never scaling they can enjoy a 2019. [Online]. Available: https://habana.ai/wp-
before. competitive content/uploads/2019/06/Habana-Gaudi-Training-
The on-chip, ecosystem and replace Platform-whitepaper.pdf
RoCE integration AI processor suppliers 4. NVIDIA Tesla Deep Learning Product Performance.
provides Habana’s without reconstructing 2019. [Online]. Available: https://developer.nvidia.
customers the con- their infrastructure. com/deep-learning-performance-training-inference
trol they need over 5. ML Perf Training Benchmark, 2019. [Online].
their systems, gaining unprecedented design flexi- Available: https://arxiv.org/pdf/1910.01500.pdf
bility to build any needed system while easily scal- 6. DAWNBench, “An end-to-end deep learning
ing the capacity from a single processor to benchmark and competition,” 2018. [Online].
hundreds and even thousands of processors. With Available: https://cs.stanford.edu/deepakn/assets/
GPUs that rely on proprietary system interfaces, papers/dawnbench-sosp17.pdf
systems hit a bandwidth wall while trying to scale 7. C. J. Shallue and J. Lee, “Measuring the effects of data
beyond 16 GPUs. Using Gaudi with off-the-shelf parallelism on neural network training,” 2019. [Online].
Ethernet switches, model parallel training can be Available: https://arxiv.org/pdf/1811.03600.pdf
performed across a much larger system and
IEEE Micro
24
Theme Article: Hot Chips
Abstract—Tesla’s full self-driving (FSD) computer is the world’s first purpose-built computer
for the highly demanding workloads of autonomous driving. It is based on a new System on a
Chip (SoC) that integrates industry standard components such as CPUs, ISP, and GPU, together
with our custom neural network accelerators. The FSD computer is capable of processing up to
2300 frames per second, a 21 improvement over Tesla’s previous hardware and at a lower
cost, and when fully utilized, enables a new level of safety and autonomy on the road.
PLATFORM AND CHIP GOALS The heart of the FSD computer is the world’s
& THE PRIMARY GOAL of Tesla’s full self-driving first purpose-built chip for autonomy. We pro-
(FSD) computer is to provide a hardware platform vide hardware accelerators with 72 TOPs for neu-
for the current and future data processing ral network inference, with utilization exceeding
demands associated with full self-driving. In addi- 80% for the inception workloads with a batch size
tion, Tesla’s FSD computer was designed to be ret- of 1. We also include a set of CPUs for control
rofitted into any Tesla vehicle made since October needs, ISP, GPU, and video encoders for various
2016. This introduced major constraints on form preprocessing and postprocessing needs. All of
factor and thermal envelope, in order to fit into these are integrated tightly to meet very aggres-
older vehicles with limited cooling capabilities. sive TDP of sub-40-W per chip.
The system includes two instances of the FSD
chip that boot independently and run indepen-
Digital Object Identifier 10.1109/MM.2020.2975764 dent operating systems. These two instances
Date of publication 24 February 2020; date of current version also allow independent power supply and sen-
18 March 2020. sors that ensure an exceptional level of safety
March/April 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
25
Hot Chips
Figure 2. (a) FSD chip die photo with major blocks. (b) SoC block diagram.
IEEE Micro
26
Figure 3. Inception network, convolution loop, and execution profile.
decided to build a custom accelerator since this every layer sequentially. An object is detected
provides the highest leverage to improve perfor- after the final layer.
mance and power consumption over the previous As shown in Figure 3, more than 98% of all
generation. We used hard or soft IPs available in operations belong to convolutions. The algo-
the technology node for the rest of the SoC blocks rithm for convolution consists of a seven deep
to reduce the development of schedule risk. nested loop, also shown in Figure 3. The compu-
We used a mix of industry-standard tools and tation within the innermost loop is a multiply-
open source tools such as verilator for extensive accumulate (MAC) operation. Thus, the primary
simulation of our design. Verilator simulations goal of our design is to perform a very large num-
were particularly well suited for very long tests ber of MAC operations as fast as possible, with-
(such as running entire neural networks), where out blowing up the power budget.
they yielded up to 50 speedup over commer- Speeding up convolutions by orders of mag-
cial simulators. On the other hand, design com- nitude will result in less frequent operations,
pilation under verilator is very slow, so we relied such as quantization or pooling, to be the bottle-
on commercial simulators for quick turnaround neck for the overall performance if their perfor-
and debug during the RTL development phase. mance is substantially lower. These operations
In addition to simulations, we extensively used are also optimized with dedicated hardware to
hardware emulators to ensure a high degree of improve the overall performance.
functional verification of the SoC.
For the accelerator’s timing closure, we set a Convolution Refactorization and Dataflow
very aggressive target, about 25% higher than The convolution loop, with some refactoring,
the final shipping frequency of 2 GHz. This allows is shown in Figure 4(a). A closer examination
the design to run well below Vmax, delivering reveals that this is an embarrassingly parallel
the highest performance within our power bud- problem with lots of opportunities to process the
get, as measured after silicon characterization. MAC operations in parallel. In the convolution
loop, the execution of the MAC operations within
the three innermost loops, which determine the
NEURAL NETWORK ACCELERATOR length of each dot product, is largely sequential.
Design Motivation However, the computation within the three outer
The custom NNA is used is to detect a prede- loops, namely for each image, for each output
fined set of objects, including, but not limited to channel, for all the pixels within each output
lane lines, pedestrians, different kinds of vehicles, channel, is parallelizable. But it is still a hard
at a very high frame rate and with modest power problem due to the large memory bandwidth
budget, as outlined in the platform goals. requirement and a significant increase in power
Figure 3 shows a typical inception convolu- consumption to support such a large parallel
tional neural network.1,2 The network has many computation. So, for the rest of the paper, we will
layers and the connections indicating flow of com- focus mostly on these two aspects.
pute data or activations. Each pass through this First thing to note is that working on multiple
network involves an image coming in, and various images in parallel is not feasible for us. We can-
features or activations, being constructed after not wait for all the images to arrive to start the
March/April 2020
27
Hot Chips
compute for safety reasons since it increases the computation for the same set of pixels within all
latency of the object detection. We need to start output channels use the same input data.
processing images as soon as they arrive. Figure 4(b)–(d) also illustrates the dataflow
Instead, we will parallelize the computation with the above refactoring for a convolution
across multiple-output channels and multiple- layer. The same output pixels of the successive
output pixels within each output channel. output channels are computed by sharing the
Figure 4(a) shows the refactored convolution input activation, and successive output pixels
loop, optimizing the data reuse to reduce power within the same output channel are computed
and improve the realized computational band- by sharing the input weights. This sharing of
width. We merge the two dimensions of each data and weights for the dot product computa-
output channel and flatten them into one dimen- tion is instrumental in utilizing a large compute
sion in the row-major form, as shown in step (2) bandwidth while reducing the power by mini-
of Figure 4(a). This provides many output pixels mizing the number of loads to move data
to work on, in parallel, without losing local conti- around.
guity of the input data required.
We also swap the loop for iterating over the Compute Scheme
output channels with the loop for iterating over The algorithm described in the last section
the pixels within each output channel, as shown with the refactored convolution lends itself to a
in steps (2) and (3) of Figure 4(a). For a fixed compute scheme with the dataflow as shown in
group of output pixels, we first iterate on a subset Figure 5. A scaled-down version of the physical
of output channels before we move onto the next 96 96 MAC array is shown in the middle for the
group of output pixels for the next pass. One brevity of space, where each cell consists of a
such pass, combining a group of output pixels unit implementing a MAC operation with a single
within a subset of output channels, can be per- cycle feedback loop. The rectangular grids on
formed as a parallel computation. We continue the top and left are virtual and indicate data
this process until we exhaust all pixels within the flow. The top grid, called the data grid here,
first subset of output channels. Once all pixels shows a scaled-down version of 96 data elements
are exhausted, we move to the next subset of in each row, while the left grid, called the weight
output channels and repeat the process. This grid here, shows a scaled-down version of 96
enables us to maximize data sharing, as the weights in each column. The height and width of
IEEE Micro
28
groups of eight. This reduces the power con-
sumed to shift the accumulator data significantly.
Another important feature in the MAC engine
is the overlap of the MAC and SIMD operations.
While accumulator values are being pushed
down to the SIMD unit for postprocessing, the
next pass of the convolution gets started immedi-
ately in the MAC array. This overlapped computa-
tion increases the overall utilization of the
computational bandwidth, obviating dead cycles.
March/April 2020
29
Hot Chips
movement in and out of the SRAM (DMA-read producer/consumer ordering is maintained using
and DMA-write), dot product (CONVOLUTION, explicit dependency flags.
DECONVOLUTION, INNER-PRODUCT), and pure A typical program is shown in Figure 6. The
SIMD (SCALE, ELTWISE). program starts with several DMA-read opera-
Data movement instructions are 32-byte longs tions, bringing data and weights into the acceler-
and encode the source and destination address, ator’s SRAM. The parser inserts them in a queue
length, and dependency flags. Compute instruc- and stops at the first compute instruction. Once
tions are 256-byte long and encode the input the data and weights for the pending compute
addresses for up to three tensors (input activa- instruction become available in the SRAM, their
tions and weights or two activations tensors, out- corresponding dependency flags get set and the
put results), tensor shapes and dependency flags. compute instruction can start executing in paral-
They also encode various parameters describing lel with other queued DMA operations.
the nature of computation (padding, strides, dila- Dependency flags are used to track both data
tion, data type, etc.), processing order (row-first availability and buffer use. The DMA-in operation
or column-first), optimization hints (input and at step 6 overwrites one of the buffers sourced
output tensor padding, precomputed state by the preceding convolution (step 5) as shown
machine fields), fused operations (scale, bias, in Figure 6. Thus, it must not start executing
pooling). All compute instructions can be fol- before its destination flag (F0) gets cleared at
lowed by a variable number of SIMD instructions the end of the convolution. However, using a dif-
that describe a SIMD program to be run on all ferent destination buffer and flag would allow
dot-product outputs. As a result, the dot product the DMA-in operation to execute in parallel with
layers (CONVOLUTION, DECONVOLUTION) can the preceding convolution.
be fused with simple operations (quantization, Our compiler takes high-level network rep-
scale, ReLU) or more complex math functions resentations in Caffe format and converts them
such as Sigmoid, Tanh, etc. to a sequence of instructions similar to the one
in Figure 6. It analyzes the compute graph and
orders it according to the dataflow, fusing or
NETWORK PROGRAMS partitioning layers to match the hardware
The accelerator can execute DMA and Com- capabilities. It allocates SRAM space for inter-
pute instructions concurrently. Within each kind, mediate results and weights tensors and man-
the instructions are executed in order but can be ages execution order through dependency
reordered between them for concurrency. The flags.
IEEE Micro
30
Figure 7. NNA Microarchitecture.
March/April 2020
31
Hot Chips
Since most common SIMD programs can be reordered, but requests coming from different
represented by a single instruction, called Fuse- sources can be prioritized to minimize the bank
dReLu (fused quantization, scale, ReLU), the conflicts.
instruction format allows fusing any arithmetic During inference, weights tensors are always
operation with shift and output operations. The static and can be laid out in the SRAM to ensure
FusedReLu instruction is fully pipelined, allow- an efficient read pattern. For activations, this is
ing the full 96 96 dot-product engine to be not always possible, so the accelerator stores
unloaded in 96 cycles. More complex postpro- recently read data in a 1-kB cache. This helps to
cessing sequences require additional instruc- minimize SRAM bank conflicts by eliminating
tions, increasing the unloading time of the Dot back-to-back reads of the same data. To reduce
Product Engine. Some complex sequences are bank conflicts further, the accelerator can pad
built out of FP32 instructions and conditional input and/or output data using different patterns
execution. The 30-bit accumulator value is con- hinted by the network program.
verted to an FP32 operand in the beginning of
such SIMD programs, and the FP32 result is con- Control Logic
verted back to the 8-bit integer output at the end As shown in Figure 7, the control logic is split
of the SIMD program. between several distinct state machines: Com-
mand Sequencer, Network Sequencer, Address
Pooling Support and Data sequencers, and SIMD Unit.
After postprocessing in the SIMD unit, the Each NNA can queue up multiple network
output data can also be conditionally routed programs and execute them in-order. The Com-
through a pooling unit. This allows the most fre- mand Sequencer maintains a queue of such pro-
quent small-kernel pooling operations (2 2 and grams and their corresponding status registers.
3 3) to execute in the shadow of the SIMD exe- Once a network runs to completion, the acceler-
cution, in parallel with the earlier layer produc- ator triggers an interrupt in the host system.
ing the data. The pooling hardware implements Software running on one of the CPUs can exam-
aligners to align the output pixels that were rear- ine the completion status and re-enable the net-
ranged to optimize convolution, back to the orig- work to process a new input frame.
inal format. The pooling unit has three 96-byte The Network Sequencer interprets the pro-
96-byte pooling arrays with byte-level control. gram instructions. As described earlier, instruc-
The less frequent larger kernel pooling opera- tions are long data packets which encode enough
tions execute as convolution layers in the dot- information to initialize an execution state
product engine. machine. The Network Sequencer decodes this
information and steers it to the appropriate con-
Memory Organization sumer, enforces dependencies and synchronizes
The NNA uses a 32-MB local SRAM to store the machine to avoid potential race-conditions
weights and activations. To achieve high band- between producer and consumer layers.
width and high density at the same time, the Once a compute instruction has been
SRAM is implemented using numerous relatively decoded and steered to its execution state
slow, single ported banks. Multiple such banks machine, the Address Sequencer then generates
can be accessed every cycle, but to maintain the a stream of SRAM addresses and commands for
high cell density, a bank cannot be accessed in the computation downstream. It partitions the
consecutive cycles. output space in sections of up to 96 96 elements
Every cycle the SRAM can provide up to 384 and, for each such section, it sequences through
bytes of data through two independent read all the terms of the corresponding dot-product.
ports, 256-byte and 128-byte wide. An arbiter pri- Weights packets are preordered in the SRAM
oritizes requests from multiple sources (weights, to match the execution, so the state machine sim-
activations, program instructions, DMA-out, etc.) ply streams them in groups of 96 consecutive
and orders them through the two ports. Requests bytes. Activations, however, do not always come
coming from the same source cannot be from consecutive addresses and they often must
IEEE Micro
32
Figure 8. Achieved utilization versus MAC array dimension.
be gathered from up to 96 distinct SRAM loca- primary concerns are always tied to its operat-
tions. In such cases, the Address Sequencer must ing clock frequency. A high clock frequency
generate multiple load addresses for each packet. makes it easier to achieve the target perfor-
To simplify the implementation and allow a high mance, but it typically requires some logic sim-
clock frequency, the 96-elements packet is parti- plifications which in turn hurt the utilization of
tioned into 12 slices of 8 elements each. Each slice specific algorithms.
is serviced by a single load operation, so the maxi- We decided to optimize this design for deep
mum distance between its first and last element convolutional neural networks with a large num-
must be smaller than 256 bytes. Consequently, a ber of input and output channels. The 192 bytes
packet of 96 activations can be formed by issuing of data and weights that the SRAM provides to
between 1 and 12 independent load operations. the MAC array every cycle can be fully utilized
Together with control information, load data only for layers with a stride of 1 or 2 and layers
is forwarded to the Data Sequencer. Weights are with higher strides tend to have poorer
captured in a prefetch buffer and issues to execu- utilization.
tion as needed. Activations are stored in the Data The accelerator’s utilization can vary signifi-
Cache, from where 96 elements are gathered and cantly depending on the size and shape of the
sent to the MAC array. Commands to the data- MAC array, as shown in Figure 8. Both the incep-
path are also funneled from the Data Sequencer, tion-v4 and the Tesla Vision network show signif-
controlling execution enable, accumulator shift, icant sensitivity to the height of the MAC array.
SIMD program start, store addresses, etc. While processing more output channels at the
The SIMD processor executes the same pro- same time can hurt overall utilization, adding
gram for each group of 96 accumulator results that capability is relatively cheap since they all
unloaded from the MAC array. It is synchronized share the same input data. Increasing the width
by control information generated within the of the array does not hurt utilization as much,
Address Sequencer, and it can decode, issue, but it requires significantly more hardware
and execute a stream of SIMD arithmetic instruc- resources. At our chosen design point (96 96
tions. While the SIMD unit has its own register MAC array), the average utilization for these net-
file and it controls the data movement in the works is just above 80%.
datapath, it does not control the destination Another tradeoff we had to evaluate is the
address where the result is stored. Store SRAM size. Neural networks are growing in size,
addresses and any pooling controls are gener- so adding as much SRAM as possible could be a
ated by the Address Sequencer when it selects way to future-proof the design. However, a signifi-
the 96 96 output slice to be worked on. cantly larger SRAM would grow the pipeline
depth and the overall area of the chip, increasing
both power consumption and the total cost of the
ARCHITECTURAL DECISIONS AND system. On the other hand, a convolutional layer
RESULTS too large to fit in SRAM can always be broken into
When implementing very wide machines like multiple smaller components, potentially paying
our MAC array and SIMD processor, the some penalty for spilling and filling data to the
March/April 2020
33
Hot Chips
DRAM. We chose 32 MB of SRAM per accelerator 2. W. Rawat and Z. Wang, “Deep convolutional neural
based on the needs of our current networks and networks for image classification: A comprehensive
on our medium-term scaling projections. review,” Neural Comput., vol. 29, no. 9, , pp. 2352–2449,
Sep. 2017.
3. K. Sato, C. Young, and D. Patterson, “An in-depth look
CONCLUSION at Google’s first tensor processing unit,” Google Cloud
Tesla’s FSD Computer provides an excep- Platform Blog, May 12, 2017.
tional 21 performance uplift over commer- 4. N. P. Jouppi et al., “In-datacenter of a performance
cially available solutions used in our previous analysis tensor processing unit,’’ in Proc. 44th Annu.
hardware while reducing cost, all at a modest Int. Symp. Comput. Archit., 2017, vol. 1, pp. 1–12.
25% extra power. This level of performance 5. I. Cutress, “AMD zen 2 microarchitecture analysis:
was achieved by the uncompromising adher- Ryzen 3000,’’ AnandTech, Jun. 10, 2019.
ence to the design principle we started with. 6. “NVIDIA volta AI architecture,” NVIDIA, 2018. [Online].
At every step, we maximized the utilization of Available: https://www.nvidia.com/en-us/data-center/
the available compute bandwidth with a high volta-gpu-architecture/
degree of data reuse and a minimalistic design 7. J. Choquette, “Volta: Programmability and performance,”
for the control flow. This FSD Computer will Nvidia, Hot Chips, 2017. [Online]. Available: https://www.
be the foundation for advancing the FSD fea- hotchips.org/wp-content/uploads/hc_archives/hc29/
ture set. HC29.21-Monday-Pub/HC29.21.10-GPU-Gaming-Pub/
The key learning HC29.21.132-Volta-Choquette-NVIDIA-Final3.pdf
Tesla’s FSD Computer
from this work has provides an exceptional 8. M. Horowitz, “Computing’s energy problem,” in IEEE
been the tradeoff 21x performance uplift Int. Solid-State Circuits Conf. Dig. Tech. Papers, 1999,
between efficiency over commercially pp. 10–14.
and flexibility. A available solutions used 9. M. Komorkiewicz, M. Kluczewski, and M. Gorgon,
custom solution in our previous “Floating point HOG implementation of for real-time
with fixed-function hardware while
multiple object detection,” in Proc. 22nd Int. Conf.
hardware offers the reducing cost, all at a
Field Programm. Logic Appl., 2012, pp. 711–714.
highest efficiency, modest 25% extra
while a fully pro- power. This level of
performance was
grammable solution Emil Talpes is a Principal Engineer with Tesla, Palo
achieved by the
is more flexible but Alto, CA, USA, where he is responsible for the archi-
uncompromising
significantly less tecture and micro-architecture of inference and train-
adherence to the
efficient. We finally ing hardware. Previously, he was a principal member
design principle we
settled on a solu- of the technical staff at AMD, working on the micro-
started with.
tion with a con- architecture of 86 and ARM CPUs. He received the
figurable fixed- Ph.D. degree in computer engineering from Carne-
gie Mellon University, Pittsburgh, PA, USA. Contact
function hardware that executes the most
him at etalpes@tesla.com.
common functions very efficiently but added a
programmable SIMD unit, which executes less
common functions at a lower efficiency. Our Debjit Das Sarma is a Principal Autopilot Hard-
knowledge of the Tesla workloads deployed ware Architect with Tesla, Palo Alto, CA, USA, where
for inference allowed us to make such a trade- he is responsible for the architecture and micro-
off with a high level of confidence. architecture of inference and training hardware.
Prior to Tesla, he was a Fellow and Chief Architect
of several generations of 86 and ARM processors
& REFERENCES at AMD. His research interests include computer
architecture and arithmetic with focus on deep
1. Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Object learning solutions. He received the Ph.D. degree in
recognition with gradient-based learning,” in computer science and engineering from Southern
Proceeding: Shape, Contour and Grouping in Computer Methodist University, Dallas, TX, USA. Contact him
Vision. New York, NY, USA: Springer-Verlag, 1999. at ddassarma@tesla.com.
IEEE Micro
34
Ganesh Venkataramanan is a Senior Director Ankit Jalote is a Senior Staff Autopilot Hardware
Hardware with Tesla, Palo Alto, CA, USA, and Engineer. He is interested in the field of computer
responsible for Silicon and Systems. Before architecture and the hardware/software relationship
forming the Silicon team with Tesla, he led AMD’s in machine learning applications. He received the
CPU group that was responsible for many genera- Master’s degree in electrical and computer engineer-
tions of 86 and ARM cores. His contributions ing from Purdue University, West Lafayette, IN, USA.
include industry’s first 86-64 chip, first Dual- Contact him at ajalote@tesla.com.
Core 86 and all the way to Zen core. He
received the Master’s degree from IIT Delhi, Delhi, Christopher Hsiong is a Staff Autopilot Hardware
India, in the field of integrated electronics Engineer with Tesla, Palo Alto, CA, USA. His
and Bachelor’s degree from Bombay Uni- research interests include computer architecture,
versity, Mumbai, Maharashtra. Contact him at machine learning, and deep learning architecture.
gvenkataramanan@tesla.com. He received the Graduate degree from the University
of Michigan Ann Arbor, Ann Arbor, MI, USA. Contact
Peter Bannon is a VP of hardware engineering him at chsiong@tesa.com.
with Tesla, Palo Alto, CA, USA. He leads the team
that created the Full Self Driving computer that is Sahil Arora is a member of Technical Staff with
used in all Tesla vehicles. Prior to Tesla, he was the Tesla, Palo Alto, CA, USA. His research interests
Lead Architect on the first 32b ARM CPU used in the are machine learning, microprocessor architecture,
iPhone 5 and built the team that created the 64b microarchitecture design, and FPGA design. He
ARM processor in the iPhone 5s. He has been received the Master’s degree in electrical engineer-
designing computing systems for over 30 years at ing from Cornell University, Ithaca, NY, USA, in 2008.
Apple, Intel, PA Semi, and Digital Equipment Corp. Contact him at saarora@tesla.com.
Contact him at pbannon@tesla.com.
Atchyuth Gorti is a Senior Staff Autopilot Hard-
Bill McGee is a Principal Engineer leading a ware Engineer with Tesla, Palo Alto, CA, USA. His
machine learning compiler team, mainly foc- research interests include testability, reliability, and
used on distributed model training on custom safety. He received the Master’s degree from the
hardware. He received the BSSEE degree in Indian Institute of Technology, Bombay, in reliability
microelectronic engineering from Rochester Institute engineering. Contact him at agorti@tesla.com.
of Technology, Rochester, NY, USA. Contact him at
bill@mcgeeclan.org. Gagandeep S Sachdev is a Staff Hardware Engi-
neer with Tesla, Palo Alto, CA, USA. He has worked as
Benjamin Floering is a Senior Staff Hardware a Design Engineer with AMD and ARM. His research
Engineer with Tesla, Palo Alto, CA, USA, whose interests include computer architecture, neural net-
research interests include low power design as well works, heterogeneous computing, performance anal-
as high-availability and fault tolerant computing. He ysis and optimization, and simulation methodology.
is also a member of IEEE. He received the BSEE He received the Master’s degree from University of
degree from Case Western Reserve University, Utah, Salt Lake City, UT, USA, in computer engineer-
Cleveland, OH, USA, and the MSEE degree from Uni- ing, with research topic of compiler-based cache
versity of Illinois at Urbana-Champaign, Champaign, management in many core systems. Contact him at
IL, USA. Contact him at floering@ieee.org. gsachdev@tesla.com.
March/April 2020
35
Theme Article: Hot Chips
Abstract—NVIDIA’s latest processor family, the Turing GPU, was designed to realize
a vision for next-generation graphics combining rasterization, ray tracing, and deep
learning. It includes fundamental advancements in several key areas: streaming
multiprocessor efficiency, a Tensor Core for accelerated AI inferencing, and an RTCore for
accelerated ray tracing. With these innovations, Turing unlocks both real-time ray-tracing
performance and deep-learning inference in consumer, professional, and datacenter
solutions.
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
36
Figure 1. Turing GPU SM, comprising four subcores and a memory interface (MIO). Math throughput,
memory bandwidth, L1 data cache topology, register file and cache capacity, and a new uniform datapath
were all designed or modified to increase processor efficiency over the previous generation.
reduction in hit latency. Because the L1 data thread (SIMT) with independent thread schedul-
cache and shared memory RAM are unified, ing model introduced in Volta.3
the tagged capacity can be configured based In a traditional SIMD/vector architecture
on the workloads running on the GPU. (Figure 3, top), control flow resides on a scalar
The SM accesses an L2 cache across a shared thread, and the developer and compiler promote
crossbar, and on Turing, the L2 capacity dou- certain operations to the SIMD/vector lanes to
bled to 6 MB. run in parallel. A vector execution mask specifies
which lanes ignore the vector operation, if
Concurrent Execution of Floating-Point and needed. In our solution (Figure 3, bottom), con-
Integer Math trol flow resides in each SIMT thread. On Turing,
Inside each of the four subcores of the Turing when warp-uniform data are detected, the com-
SM (Figure 1, right), we doubled register file piler with hardware assist promotes operations
capacity, redesigned the branch unit, and added to an independent uniform datapath, essentially
fast general purpose FP16 math. a “reverse vectorization.” This promotion is
Saturating the FP32 datapath requires only valid even if one or more of the threads in the
half the issue bandwidth, allowing another warp are currently diverged within its own con-
datapath to execute concurrently. Figure 2 trol flow.
shows the performance impact of concurrent This optimization is invisible to the program-
execution of floating-point and integer instruc- mer, but the following simplified machine code
tions, across several gaming workloads. On aver- snippet demonstrating bindless constant mem-
age, 36 integer instructions co-execute with ory access illustrates the mechanism:
every 100 floating-point instructions. Typical
integer operations include address computation ULDC.64 UR20, [UR6 þ 0x18], !UP7
and floating-point compares. UIADD UR6, UR3, UR10
FMUL R15, R4, cx[UR20][0x64]
Uniform Datapath and Uniform Register File ULOP.AND UR2, UR3, 0xfffff
The Turing SM exploits redundant computa-
tion and data across multiple threads in a warp, The U-prefixed, or uniform, instructions oper-
while preserving the single instruction multiple ate on U-prefixed registers, and a subsequent
March/April 2020
37
Hot Chips
IEEE Micro
38
times the Pascal ray-tracing performance, mean-
ing that real-time ray tracing is finally available.
Figure 6 is a screenshot from the interactive
demo Attack from Outer Space, built with Epic
Games’ Unreal Engine 4 and a 2019 DXR Spotlight
award winner.5 The image demonstrates multi-
ple advanced rendering techniques including
soft shadows underneath the cars on the street,
and glossy reflections in the robot’s chest plate
and the puddles on the ground. These are calcu-
lated dynamically in real time in a destructible,
animated environment, and made possible by
accelerated ray tracing.
Beyond using rays for shadows or reflections,
one can employ even more advanced techniques
to generate ever more realistic images, e.g.,
path-traced global illumination, in which rays
represent virtual photons in a physical simula-
tion. Figure 7 is a simplified diagram represent-
ing the path-tracing process.
A ray, represented as an origin and a direc-
tion, is used to test what geometry is visible
from a point in the scene.
Figure 4. Representative shader workloads extracted from various games and graphics benchmarks show
up to two times performance benefit from the Turing SM.
March/April 2020
39
Hot Chips
Figure 5. Accelerated deep-learning inference with Turing-based Tesla T4 datacenter solution, with speeds
up to 36 times faster than CPU-based servers. T4 can flexibly apply chained deep-learning tasks like speech
recognition followed by natural language processing to enable higher level solutions like intelligent agents.
These rays are generated recursively (labeled hours to generate a single frame. Recent GPUs
5, 6, and 7). reduce the cost to minutes or seconds, but real-
time performance has been unachievable.
Chains of rays representing reflections, refrac-
The fundamental building blocks of the
tions, and light source queries create paths
algorithm are as follows:
between the camera and the light sources to
determine the color of light transported to each sampling (What direction to shoot a ray?);
pixel in the image. This technique is commonly traversal and intersection (What did the ray
used to generate CGI movies; however, due in hit?);
part to the inherent random sampling, the brute material evaluation (How does the light scat-
force computation typically takes many CPU ter at that hit point?).
Figure 6. Attack from Outer Space interactive ray-tracing demo by Christian Hecht. Soft shadows, glossy
reflections, and indirect illumination combine to produce realistic lighting in the UE4 game engine.
IEEE Micro
40
Figure 7. Path-traced global illumination, simplified. Primary/camera rays, reflection/refraction rays, and
shadow rays combine recursively to simulate the color of light scattered from the scene to each pixel in the
rendered image.
Sampling and material evaluation are not yet taking thousands of instruction slots, and poten-
suitable for fixed-function acceleration. Techni- tially tens of thousands of cycles of latency per ray.
ques for these differ across renderers, requiring Finally, when the appropriate hit point is located
significant flexibility. Traversal and intersection, for the ray, the material at that point is evaluated.
however, are the most expensive components of On Turing RTX, the new RTCore replaces that
global illumination, has essentially annealed to a software emulation (Figure 8, bottom), performing
common algorithm in the industry, and is ripe the tree traversal and the ray/box and ray/triangle
for further acceleration. intersection tests. A ray query is sent from the SM
In pre-RTX GPU ray tracing (Figure 8, top), to the RTCore, the RTCore fetches and decodes
traversal and intersection were done on the SM memory representing part of the bounding vol-
core. To avoid testing every triangle in a scene ume tree, and uses dedicated evaluators to test
against each ray query, a bounding volume hier- each ray against the box or, at the leaves of the
archy or a tree of axis-aligned bounding boxes is tree, the triangles that make up the scene. It does
created over the triangles in the scene. this repeatedly, optionally keeping track of the
When a ray probe is launched, this tree is tra- closest intersection found. When the appropriate
versed, and each successive child box is tested intersection point is determined, the result is
against the ray, culling the distant geometry effi- returned to the SM for further processing.
ciently. At the leaves of the tree, the actual triangles The RTCore is faster and more efficient than
making up the surfaces in the scene are tested software emulation and frees up the SM to do
against the ray. This tree traversal is fundamentally other work including programmable sampling
pointer-chasing through memory interleaved with or material shading in parallel. We carefully
complex, precision-sensitive intersection tests, interfaced the RTCore and the SM for both
March/April 2020
41
Hot Chips
Figure 8. Ray tracing before (above) and after (below) Turing RTX. The RTCore performs ray/scene traversal
and intersection tasks, freeing the SM to do concurrent work such as material evaluation and denoising.
performance and flexibility, enabling a developer use path tracing. Each plot is a timeline of the
to optionally create custom intersection pro- frame on a different configuration. At the top, on
grams that run on the SM for non-triangle geome- Pascal, path tracing is possible, but five frames
try such as spheres that traditional rasterization per second (fps) is much too slow to be playable.
cannot easily handle. In the middle, on Turing without the RTCores
Figure 9 shows the performance of one frame enabled, the more efficient design is already
of Quake II RTX, a game recently remastered to twice as fast at 10 fps. The purple and gray on this
IEEE Micro
42
Figure 9. Path-tracing performance on one frame of Quake II RTX. Turing efficiency improvements yield a
two-times speedup over Pascal, while Turing RTCores provide a further speedup for a total seven times faster
frame rate and real-time performance.
timeline show the floating-point and integer data- indirect lighting, and more are evident, bringing
paths executing in parallel. Enabling the RTCores, a dramatically updated look to a fun game from
their work shown in green in the bottom plot, the past.
yields a further leap to real-time speeds at 34 fps Developers are adding ray-traced effects to
and an overall seven times speedup. upcoming games at a rapid pace as well. Every
In the rendered image of this frame (Figure 9, graphics API and major engine has added sup-
inset), reflections, refractions, soft shadows, port. Real-time ray tracing has finally arrived.
Figure 10. Professional rendering before (left) and after (right) AI-based denoising, using a Quadro RTX
workstation. The SM and RTCores are used to trace a few (noisy) paths per pixel, and the Tensor Core
accelerated AI denoiser estimates the completed image, together providing a speedup of several orders of
magnitude versus a CPU server.
March/April 2020
43
Hot Chips
IEEE Micro
44
Theme Article: Hot Chips
Abstract—The “Zen 2” processor is designed to meet the needs of diverse markets spanning
server, desktop, mobile, and workstation. The core delivers significant performance and
energy-efficiency improvements over “Zen” by microarchitectural changes including a new
TAGE branch predictor, a double-size op cache, and a double-width floating-point unit.
Building upon the core design, a modular chiplet approach provides flexibility and
scalability up to 64 cores per socket with a total of 256 MB of L3 cache.
& THE ZEN 2 processor provides a single core doubled from a combination of technology and
design that is leveraged by multiple solutions, microarchitecture improvements. Second, IPC
with focused goals for improving upon the prede- is increased by approximately 15% from the
cessor Zen processor. The primary targets for microarchitectural changes in the in-order front-
the core were advancements in instructions per end, integer execute, floating-point/vector exe-
cycle (IPC), energy efficiency, and security. Build- cute, load/store, and cache hierarchy.
ing on the new core, the solutions for server and
client aimed to promote design reuse across
markets, increase core count, and improve IO ENERGY EFFICIENCY
capability. Achieving these goals required inno- The Zen 2 core microarchitecture originally
vations in microarchitecture, process technol- targeted lower energy per clock cycle than the
ogy, chiplets, and on-package interconnect.1 prior generation, independent of process tech-
nology improvements. This by itself was an
aggressive goal, because increased IPC means
CPU CORE
more activity each cycle, resulting in added
The Zen 2 CPU core, shown in Figure 1, has
switching energy. Neutral energy per cycle would
two primary areas of improvement over its
require additional design optimization, including
predecessor, Zen. First, energy efficiency is
improvements in branch predictor accuracy,
March/April 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
45
Hot Chips
y
Testing conducted by AMD Performance Labs as of 7/12/2019 with 2nd Gen-
eration Ryzen and 3rd Generation Ryzen engineering samples using estimated INTEGER EXECUTE
SPECint_base2006 results. PC manufacturers may vary configurations yield-
The Zen 2 core features a distributed execu-
ing different results. SPEC and SPECint are registered trademarks of the Stan-
dard Performance Evaluation Corporation. See www.spec.org tion engine, with separate schedulers, registers,
IEEE Micro
46
and execution units for integer and floating- its potential without significant performance
point/vector operations. The integer engine impact on the low-parallelism thread.
operates on general-purpose registers and gen-
erates addresses for loads and stores. The float-
ing-point/vector engine operates on vector FLOATING-POINT/VECTOR EXECUTE
registers. The Zen 2 integer engine focused on The Zen 2 floating-point/vector engine has
increasing issue width to provide more through- doubled the data path width from 128 bits (Zen)
put and growing the out-of-order window size to to 256 bits. Both cores support AVX-256 instruc-
expose more program parallelism. tions, but Zen double-pumps operations using its
The Zen core had the foundation for two 128-bit data paths whereas Zen 2 supports native
loads and one store per cycle, but with just two operation with its 256-bit data paths. The vector
address generation units (AGUs), Zen was not PRF width is also doubled to 256 bits. Registers
able to sustain this throughput in the steady can now be renamed on a 256-bit granularity
state. The Zen 2 core adds a third AGU, unlock- instead of a 128-bit granularity. The effective
ing this throughput potential and providing a capacity of the vector PRF is therefore doubled
more balanced processor. for AVX-256 code, even though the number of vec-
A major component of window size is the tor PRF entries remains the same at 160.
scheduler queue size. Like its predecessor, Zen 2 A significant consideration with physically
has four fully distributed arithmetic-logic unit doubling the data path is the potential for
(ALU) scheduler queues, one per ALU. Zen 2 switching activity spikes that could cause elec-
increases the size of each queue from 14 entries trical design current (EDC) specifications to be
to 16 entries. The AGU scheduler remains the exceeded. A simplistic approach to mitigating
same size, but it is now upgraded from two sepa- this issue would be to immediately throttle fre-
rate, distributed 14-entry queues each feeding quency and reduce voltage when AVX-256
an AGU to a single 28-entry queue feeding all instructions are detected. However, this would
three AGUs. The unified AGU queue has more unnecessarily penalize programs that make
effective capacity due to removing the potential occasional use of AVX-256 instructions. To opti-
for queue imbalance. The unified queue is also mize performance, Zen 2 builds an intelligent
better able to prioritize picking of the oldest EDC manager which monitors activity over mul-
ready ops, resulting in reduced mis-speculation tiple clock cycles and throttles execution only
from out-of-order loads. when necessary.
Other window size components are incre-
ased, including growing the physical register file
(PRF) from 168 to 180 entries and the re-order LOAD/STORE AND L1D/L2 CACHES
buffer (ROB) from 192 to 224 entries. These both The load/store unit and level 1 data (L1D)
work to allow more ops in the window, exposing cache provide more throughput and larger struc-
more program parallelism. tures. An important component of overall win-
Finally, the execution engine improves simul- dow size, the store queue size was increased
taneous multithreading fairness. It is possible from 44 to 48 entries. The L2 data translation loo-
for one thread with inherently low parallelism kaside buffer was increased from 1536 to 2048
(for example, a pointer chase through main entries, now supporting 1-GB pages installed as
memory) to consume many of the ALU or AGU splintered 2-MB pages.
scheduler resources without benefit. The second The 32-kB 8-way L1D cache maximum
thread may have high inherent parallelism but throughput was increased in Zen 2 due to two
be unable to realize its performance potential factors. First, read and write bandwidth are dou-
due to insufficient scheduler resources. New fair- bled through an increase in width from 128 to
ness hardware detects this condition and slows 256 bits, matching the vector data path width.
the rate at which the low-parallelism thread can Second, the third AGU provides 50% more sus-
allocate into the scheduler. This gives the high- tained load/store operations. Combined, these
parallelism thread an opportunity to approach net Zen 2 three times the loadþstore bandwidth.
March/April 2020
47
Hot Chips
The Zen 2 L2 remains 512 kB, and it is 8-way power efficiency, mitigating the cost was another
set-associative with 12-cycle load-to-use latency. challenge. The technology shrink factor did not
Zen 2 has new prefetch throttling capability apply equally across all circuits. Specifically scal-
that can reduce the aggressiveness of data pre- ing some analog circuitry, for example those used
fetching when memory bandwidth utilization is in system IOs, did not benefit enough compared
high and prefetching is not being effective. This to the technology cost increase. This led to the
is a particularly important to performance for adoption of a chiplet strategy. This strategy
high core-count, constrained memory band- defines SoCs using a hybrid process technology,
width processors such as those used in server allowing each chiplet to be manufactured in its
or high-end desktop. optimal technology node. The SoCs built using the
hybrid process technology married one or more
CORE COMPLEX AND L3 CACHE of the second-generation area-optimized CCD
A core complex (CCX) is composed of four chiplets in the advanced node and an IO-die chip-
Zen 2 cores and a shared level-3 (L3) cache. The let in a mature node. This resulted in construction
L3 cache has four slices connected with a highly of cost-effective, high-performance SoCs as well as
tuned fabric/network. Each L3 slice consists of offering configurable solutions to broaden the
an L3 controller, which reads and writes the L3 product portfolio.
cache macro, and a cluster core interface that
communicates with a core. The four slices of L3
are accessible by any core within the CCX. The ON-PACKAGE DIE-TO-DIE
distributed L3 cache control provides the design INTERCONNECT
with improved control granularity. Each slice of An important requirement to making the chip-
L3 contains 4 MB of data for a total of 16 MB of let strategy viable was an optimized on-package
L3 per CCX. The L3 cache is 16-way set-associa- die-to-die interconnect. The interconnect was
tive and is populated from L2 cache victims. The required to support various product configura-
L3 is protected by DECTED ECC for reliability. A tions while meeting power, bandwidth and
CPU Core Die (CCD) chiplet is composed of two latency metrics. A new on-package Infinity Fabric
CCXs, for a total of eight cores, 16 threads, and (IFOP) link was designed to meet these require-
32 MB of L3 cache as shown in Figure 2. ments and allow efficient communication bet-
ween the core chiplet and the IO chiplet. The
Chiplet Strategy: Challenges and Solutions IFOP link was optimized for a short channel reach
Delivering Zen 2 to market across multiple plat- and responsible for carrying both data and con-
forms in a short period of time was a key design trol fabric communication. Each IO-die chiplet to
challenge. While the leading edge 7-nm technol- CCD chiplet connection is made using an indepen-
ogy is a key element to “Zen2” performance and dent point-to-point instance of the IFOP link.
IEEE Micro
48
Figure 3. “Matisse” SoC. Figure 4. “Rome” SoC.
March/April 2020
49
Hot Chips
Figure 5. “Matisse” application and 1080p gaming performance compared to Ryzen 7 2700X.
IEEE Micro
50
Table 2. “Rome” application performance
compared to 2P Intel Zeon Platinum 8280 power
server.
Application % Improvement
ESI VPS–NEON4M Upto 58%
Altair Radioss
Upto 72%
13.3.1
STAR-CCMþ
Upto 95%
13.06.012
March/April 2020
51
Hot Chips
3. AMD, “Indirect branch control extension,” 2019. Mahesh Subramony is a principal member of
Technical Staff with Advanced Micro Devices Inc.
[Online]. Available. https://developer.amd.com/wp-
(AMD), and was the SoC Architect for the third
content/resources/Architecture_Guidelines_Update_
generation Ryzen “Matisse” desktop processors.
Indirect_Branch_Control.pdf
Since 2003, he has been with AMD in architecture
4. AMD, “Speculation behavior in AMD microarchitectures,” and design on SoCs spanning mobile, desktop,
2019. [Online]. Available. https://www.amd.com/system/ and servers. He received the M.S. degree in com-
files/documents/security-whitepaper.pdf puter engineering from the University of Minnesota,
5. A. Seznec and P. Michaud, “A case for (partially)- Twin Cities, and the B.Tech. degree in electrical
tagged geometric history length predictors,” engineering from the College of Engineering,
J. Instruction Level Parallelism, vol. 8, 2006. [Online]. Thiruvananthapuram. Contact him at mahesh.
Available. https://www.jilp.org/howtoref.html subramony@amd.com.
David Suggs is a Fellow with Advanced Micro Devi- Dan Bouvier is the Corporate VP and Client Prod-
ces Inc. (AMD), Santa Clara, CA, USA, where he was ucts Chief Architect for Advanced Micro Devices Inc.
the chief architect for the Zen 2 CPU core. Previously, (AMD), Ryzen products. He has defined the past five
he was the architect of the op cache and the instruc- generations of AMD notebook and desktop process-
tion decoder for the Zen core. He has been working in ors. During his 32 year career, he has focused on
architecture and design since 1993 on projects span- high-performance processors, SoCs, and systems.
ning CPU cores, north bridges, south bridges, DSPs, Prior to joining AMD in 2009, he was a processor CTO
voice telephony, and PC sound cards. He received for AMCC and before that the director of Advanced
the M.S.E.E. degree from the University of Texas at Processor Architecture for PowerPC processors at
Austin and the MBA degree from St. Edward’s Univer- Freescale/Motorola. He received the B.S. degree in
sity. He is the corresponding author of this article. electrical engineering from Arizona State University.
Contact him at david.suggs@amd.com. Contact him at dan.bouvier@amd.com.
IEEE Micro
52
Theme Article: Hot Chips
& NEOVERSE N1 IS the new platform of Arm On one end of the spectrum, the Neoverse N1 plat-
IPs that enables partners to develop systems with form is well suited for high-performance systems
competitive performance and world-leading with up to 128 cores organized on an 88 mesh.
power efficiency across a wide range of markets. At the same time, customers targeting deploy-
ments with strict power and area constraints can
rely on Neoverse N1 to create high-efficiency and
Digital Object Identifier 10.1109/MM.2020.2972222
high-performance systems composed of a dozen
Date of publication 7 February 2020; date of current version
or fewer general-purpose cores.
18 March 2020.
March/April 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
53
Hot Chips
The Neoverse N1 core implements the v8.2-A acquire, store-release) the introduction of limited
A32, T32, and A64 Arm instruction sets and ordering regions, support for atomic instruc-
includes many infrastructure-focused improve- tions, and support for persistent memory.
ments, such as security features, virtualization We also included several features to signifi-
host extensions, large system extensions and cantly harden Neoverse N1 against known secu-
RAS extensions. rity vulnerabilities in the following.
Our projections
NEOVERSE N1 is the Privileged access-never (PAN): protects the
and silicon meas- new platform of Arm OS kernel from being “spoofed” into reading
urements show that IPs that enables
or writing user code or data on behalf of
the Neoverse N1 partners to develop
malicious programs.
core performs at systems with
Unprivileged Access Override (UAO): allows the
least 1.6 better competitive
performance and OS kernel to more efficiently manage user code
for most workloads
world-leading power sections that are marked as execute-only for
than Arm’s previous
efficiency across a protection.
design deployed in
wide range of markets. Stage 2 execute-never: allows the hypervisor to
infrastructure––the
prevent an OS kernel and/or application from
Arm Cortex-A72––
executing pages containing writable data, to pre-
with some cloud-native workloads performing up
vent some exploits.
to 2.5 faster. This important speed up was
Side-channel protection: introduces a range
achieved without compromising our best in class
of new speculation controls, speculation bar-
power-efficiency. Additionally, scalability signifi-
rier, and prediction restriction instructions
cantly improved thanks to a completely rede-
that allow software to mitigate microarchitec-
signed coherent mesh, cache hierarchy, and
tural side-channel attacks on speculative exe-
system IP backplane. As a result, Arm’s silicon
cution across different execution contexts.
partners using Neoverse N1 in their designs have
numerous opportunities to organize and opti- Virtualization is the backbone of much of the
mize these components to satisfy their general- modern IT infrastructure, and Neoverse N1
purpose compute needs and can take advantage implements enhancements to extend virtualiza-
of the many connectivity options for tightly cou- tion support and reduce its overhead:
pling accelerators through technologies such as Hardware update of access/dirty bits: auto-
AMBA, CCIX, and PCIe. matically updates status bits in page table
entries, avoiding a trap to the OS or hypervisor.
The VMID extension to 16 bits: increases the
ARCHITECTURE maximum number of simultaneously active vir-
The Neoverse N1 cores implement many of tual machines supported by the address transla-
the recent extensions to the base Armv8-A archi- tion system to 65536.
tecture that were introduced to improve the per- Virtual Host Extension (VHE): more efficient
formance, scalability, robustness and security of support for Type 2 (hosted) hypervisors, such
highly virtualized server, and network infrastruc- as KVM, building on the Type 1 (native) hypervi-
ture workloads on many-core processors.1 sor support already introduced in base Armv8-A.
These architecture extensions include sup- Finally, we extended our performance moni-
port for dedicated instructions to accelerate toring infrastructure to support enhanced PMU
inference ML workloads through support of IEEE events and statistical profiling.
half-precision floating-point (FP16) and int8 dot
product instructions. Neoverse N1 also includes NEOVERSE N1 CORE
the new CRC32 instructions to accelerate storage The Neoverse N1 core is designed to achieve
applications. Additionally, particular focus has high performance while maintaining the perfor-
been placed on finer handling of Arm’s relaxed mance power area (PPA) advantage point estab-
memory ordering through the LDAPR instruction lished with Cortex-A72. To achieve this goal,
(load with ordering semantics similar to load- the team designed the microarchitecture from
IEEE Micro
54
scratch and focused on features to enhance infra- instruction footprints. The predictor also
structure-focused many-core CPU performance. employs a 64-entry microBTB and a 16-entry
Neoverse N1 supports an aggressive out-of- nano-BTB to minimize bubbles in the front-end.
order superscalar pipeline and implements a 4- Neoverse N1 also significantly improves both
wide front-end with the capability of dispatch- latency and accuracy of the indirect branch pre-
ing/committing up to eight instructions per diction algorithm. The branch direction predic-
cycle. The core deploys three tor is also optimized to target
ALUs, a branch execution unit, behaviors observed on many server
The Neoverse N1 core
two Advanced SIMD units, and workloads: once a prediction is made,
is designed to achieve
two load/store execution high performance while the predicted address is stored into a
units. The minimum mispre- maintaining the 12-entry fetch queue which tracks
diction penalty is 11-cycle, performance power future fetch transactions.
and we introduced many opti- area (PPA) advantage Once the branch predictor creates a
mizations in order to preserve point established with next fetch address, the address is fed
a short pipeline without losing Cortex-A72. To achieve into a fully associative 48-entry instruc-
power efficiency. this goal, the team tion TLB and a 4-way set-associative 64-
The next sections describe designed the kB I-cache to read out the instruction
the detail of Neoverse-N1 core microarchitecture from opcode. The I-cache can deliver up to
scratch and focused on
microarchitecture. The first 16 B of instructions per cycle.
features to enhance
two sections describe the core Since the branch predictor has
infrastructure-focused
front-end and back-end. The fol- higher bandwidth than the I-cache,
many-core CPU
lowing sections detail the inter- performance. unless the pipeline was recently flushed
action of the core with the due to a branch misprediction, the
memory subsystem, security fetch queue typically holds a few pend-
features, and features added to target the infra- ing transactions. To mitigate branch misprediction
structure market. This section will conclude penalty, I-cache reads are overlapped with I-cache
with a few figures of merit about the core tag matching. After the fetch queue reaches a
implementation. threshold in the number of fetch transactions, the
I-cache read operation is serialized to maximize
Core Front-End efficiency. The Neoverse N1 core can support up
The Neoverse N1 core can fetch up to four to eight outstanding I-cache refill requests to the
instructions per cycle to feed its high-perfor- higher cache hierarchy.
mance back-end. One of the biggest improve- The stream of fetched instructions is then
ments from Cortex-A72 is a decoupled branch forwarded to a 4-wide decoder, where an instruc-
prediction, which realizes a branch predictor tion may be cracked into multiple simpler inter-
directed prefetch, where the branch predictor nal macrooperations. Each decode lane can
can run ahead even if the front-end pipeline is decode one Arm instruction per cycle, and the
waiting for instruction cache (I-cache) miss refill most frequently used instructions (e.g., simple
responses. Even if the I-cache pipeline is stalled ALU, branch and load/store) are decoded as a
on an instruction fetch miss, speculative fetch single macrooperation. To simplify and speed-
addresses provided by the branch predictor can up the decode process, the I-cache can store par-
continue to access the I-cache and resolve tially decoded instructions.
misses through early prefetches.
The branch predictor employs a large 6K- Core Back-End
entry main branch target buffer with 3-cycle Decoded instructions are renamed before
access latency to retrieve branches’ target being dispatched to the out-of-order engine. The
addresses without accessing the I-cache. Such a renaming unit can receive up to four macroops
sizeable BTB unit helps maintain target history per cycle. Each macroop can be cracked to up to
for a large number of branches, which benefits two microoperations during the renaming pro-
cloud and server workloads with large cess. Therefore, up to eight microoperations can
March/April 2020
55
Hot Chips
be dispatched into the out-of-order engine multiple cores can be configured in a cluster of
each cycle. Additionally, the rename unit can auto- cores containing a snoop filter and an optional L3
matically eliminate simple register-to-register data cluster cache. The cluster cache can be up to 2
movement instructions through its rename tables. MB, with a load-to-use latency ranging between 28
Once the microops are dispatched, instruc- and 33 cycles, depending on the configuration. A
tion status is tracked in the commit and the Neoverse N1 SoC can support up to 256 MB of
issue queues. The commit queue can track up to shared system-level cache. In the event none of
128 microoperations. The commit unit tracks a these caches are effective at filtering a requested
dispatched instruction until all prior instruc- memory address, the Neoverse N1 core employs a
tions are committed, and up to eight microops “cache-miss” predictor which bypasses the whole
can be committed per cycle. cache hierarchy and all snoop filters to issue a
The issue queue tracks the availability of “Prefetch Target” request to compatible memory
source operands required to execute corre- controllers, reducing the incurred miss latency.
sponding micro operations. When all its source Neoverse N1 employs a next generation data
operands are available, an instruction is picked prefetcher, which is similar to the one deployed
and issued to the correct execution pipeline. Neo- in the Cortex-A76 core, but with key improve-
verse N1 supports a distributed issue queue with ments for large scale systems. This updated pre-
more than 100 microoperations to increase the fetcher achieves high coverage and accuracy on
overall out-of-order window size. When the issue a variety of access patterns ranging from simple
queue is empty, dis- streams and strides to sophisticated spatial pat-
patched instruction terns. Such a prefetcher coordinates requests to
Neoverse N1 employs
can bypass such a multiple levels of cache and across virtual mem-
a next generation data
queue to minimize ory pages, preloading both TLBs and caches.
prefetcher, which is
latency. Finally, multiple cache replacement policies
similar to the one
Neoverse N1 deployed in the were designed and tuned to work in coordination
employs multiple Cortex-A76 core, but with these prefetchers, resulting in our first pre-
pipelines for each with key improvements fetch-aware replacement policy.
type of instruction: for large scale
four integer execu- systems. Security Features
tion pipelines, two During the development of the Neoverse
load/store pipelines, and two advanced SIMD N1 core, side channel attacks exploiting specu-
pipelines. As needed, each pipeline can forward lative execution2,3 were reported and several
its results to the others. architectural and microarchitectural mitigati-
ons were introduced to address these security
Memory Subsystem vulnerabilities.
The memory architecture for Neoverse N1 Neoverse N1 implements some of the Arm v8.5
is designed to enable larger, faster, and more scal- architecture features, such as the speculative
able caches than its predecessors. The 64-kB 4- store bypass safe (SSBS) control bit, and the SSBB
way set associative L1 data cache (D-cache) has a and PSSBB (Speculative Store Bypass Barrier)
4-cycle load to use latency and a bandwidth of 32 instructions. These newly introduced barriers
B/cycle. The core-private 8-way set associative L2 allow software to actively protect against Spectre
cache is up to 1 MB in size and has a load-to-use Variant 4 exploits by preventing load instructions
latency of 11 cycles. The Neoverse N1 core can from returning data written to a matching virtual
also be configured with smaller L2 cache sizes of or physical memory location by speculatively exe-
256 and 512 kB with a load-to-use latency of nine cuted store instructions prior to the barrier.4
cycles. The L2 cache connects to the system via Spectre Variant 2 attempts to exploit the
an AMBA 5 CHI interface with 16 B data channels. branch predictor by injecting branch targets
The Neoverse N1 core can directly interface to the that cause the victim process to speculate
mesh interconnect enabling minimum latency to through a specific code path. To address this
the system-level-cache and DRAM. Alternatively, threat, we designed a hardware mechanism to
IEEE Micro
56
prevent consumption of malicious target injec-
tions.4 Malicious software cannot inject branch
target information to control speculative behav-
ior of the victim process since the branch pre-
dictor in the Neoverse N1 core prevents a
process from using the predicted branch trained
by a different process.
March/April 2020
57
Hot Chips
that a 64-core reference system can achieve 190 Some key performance enhancement features
SPECint2017 rate (estimated). For such a system, of CMN-600 include the following.
the total SOC power is projected to be 105 W.
Support for Arm and PCIe architecture
atomic transactions at the home nodes. This
allows atomic transactions to be issued by
NEOVERSE N1 COHERENT MESH the cores to the home nodes, where they are
INTERCONNECT (CMN-600)
resolved. The capability to execute far
The CMN-600 product family is Arm’s second-
atomics operations improves performance
generation, highly configurable, mesh-based
for contended variable updates.
coherent interconnect based on CHI cache coher-
Prefetch hint can be issues from a Neoverse
ent protocol specification. CHI is a packet-based,
N1 core directly to the memory controllers in
point-to-point, topology agnostic, layered archi-
order to minimize DRAM latency on cache
tecture protocol. The coherent interconnect is a
misses. Other features to reduce data latency
vital component to enable many-core systems to
include: direct memory transfer from mem-
scale without compromising latency and available
ory controllers to requesting cores and
memory bandwidth. A mesh topology was chosen
direct cache transfer from peer cores to the
to address those challenges and is designed to
requesting core. In aggregate, we estimate
support one clock cycle delay per hop. CMN-600
these features to reduce data latency on the
can scale from a 12 mesh to a 88 mesh and is
interconnect by up to 37%.
designed to operate at up to 2-GHz clock fre-
Cache stashing, which allows an IO peripheral
quency. Customers can configure mesh size,
such as a PCIe endpoint to place incoming
topology, and bisection BW to match the architec-
data on various levels of the cache hierarchy
ture that best fits their PPA targets.
(SLC, L3, and L2) to enable quicker access to
CMN-600 includes a distributed set of fully
this data. When SLC stashing is enabled, sili-
coherent home nodes which are software-
con measurements show improvements up to
configurable hash address-interleaved. Software-
33% packet/second on a single core and up to
configurability allows customers to support
60% on multicore tests for DPDK L3 forward-
different hash interleaving granularity, with the
ing tests. Further improvements are expected
minimum interleave being 64 B. Such configurabil-
for applications that can stash IO data in the
ity enables traffic distribution and traffic isolation
cores’ private L2 or in the core cluster L3.
based on the characteristics of the targeted appli-
cations and allows affinity-based system cache CMN-600 is designed to support high
groups (SCG) allocation, which helps with traffic throughput IO traffic from various requesters
localization in bigger systems. such as DMA, PCIe, etc., and can achieve full
Each HN-F slice includes a snoop filter and a PCIe Gen4 upstream and downstream band-
system level cache (SLC) with enhanced replace- width. PCIe or DMA writes can be stashed in the
ment policies. System architects that adopt SLC or directly into the core caches. Direct
the Neoverse N1 platform can choose the num- stashing to CPU caches allows improved perfor-
ber of HN-F slices to deploy based on system mance and avoids SLC pollution.
cache capacity needed and system bandwidth CMN-600 provides at-speed self-hosted debug
requirement, and total SLC capacity can range and trace capabilities with distributed debug
from 0 to 256 MB. monitors within the interconnect. Our intercon-
The SLC is a victim cache for core clusters with nect supports programmable transaction tagging
adaptive cache allocation based on data-sharing and tracing, which can be used for statistical pro-
detection. An SLC also acts as a DRAM cache for IO filing and end to end latency breakdown analysis.
Requestors (PCIe, DMA, etc.) with a smart alloca-
tion policy. In addition, the SLC supports software Multichip Support Using CCIX
programmable source-based cache capacity con- CMN-600 supports CCIX protocol (Cache
trol that mitigates the “noisy-neighbor” shared Coherent Interconnect for Accelerators) to
system cache thrashing problem. coherently connect hardware accelerators such
IEEE Micro
58
Figure 2. CMN-600 based scale-up server node with two compute dies and acceleration.
as GPUs, smart NICs, smart storage, FPGAs, IO Memory Management and Interrupt
DSPs, etc. to CMN-600 based host node. By Handling
extending the benefits of full cache coherency to Arm’s latest System MMU, MMU-600, sup-
these hardware accelerators, Neoverse N1 ena- ports stage-1, stage-2, and nested address trans-
bles true peer processing with shared memory, lations with address space mapping and security
which also eliminates the need for software to mechanisms to prevent un-authorized accesses.
initialize transfers of data between devices. In a typical system, these two translation stages
CMN-600 also supports CCIX independent mem- are managed by the guest operating system and
ory expansion where the CCIX link is used to the hypervisor, respectively.
communicate with memory on a remote chip. MMU-600 support’s PCIe Address Translation
CMN-600 leverages the same CCIX connection Service to allow PCIe-based IO devices or accelera-
to enable symmetrical multiprocessing (SMP) tors (masters) to pre-fetch translations well in
across multiple chips to enable homogenous advance and place them in device-managed
computing. In order to enable this link for SMP address translation caches, hence avoiding the
use cases, the CCIX link can be augmented with translation overhead in the MMU. Support for the
special features to communicate Arm ISA spe- PCIe page request interface further enhances sys-
cific information that is not required for the tem performance by enabling devices to use un-
host-accelerator use case. pinned pages and virtual memory.
Figure 2 shows a scalable system where The Neoverse N1 platform supports PCIe root
multiple CMN-600’s form a host node for homoge- complexes with single root IO virtualization func-
nous computing while connecting to hardware tion, which allows virtualized PCIe functions to
accelerators for heterogenous compute use cases. be integrated into a system to provide IO virtuali-
Figure 3 graphically shows the configuration zation. In a PCIe root complex, each virtual func-
ranges available to Arm partners thanks to the tion (VF) and physical function (PF) pair mapping
CMN-600. This architecture is designed to scale is assigned a unique PCI express requester ID
from small and efficient edge systems all the way that is mapped to a unique StreamID in the sys-
to high performance cloud deployments. tem to match the Arm architecture requirements.
March/April 2020
59
Hot Chips
MMU-600 maps virtual addresses to physical coherent mesh network connects all the high per-
addresses using the StreamID pairs. With support formance on-chip components.
for up to 224 StreamID’s, MMU-600 allows simulta- N1 SDP provides early N1 silicon samples and
neous mapping of millions of PCIe VFs. In a vir- is a vehicle for software development and evalu-
tualized environment, the VF is assigned to a ation environment to customers and partners.
virtual processing element (VPE) and the system
traffic flows directly between VF and VPE. As a
result, the IO overhead in the software emulation REAL-WORLD PERFORMANCE
We evaluated the performance of N1 systems
layer is diminished, significantly reducing the
extensively both pre-silicon as well as in silicon
overhead of a virtualized environment compared
implementations such as N1 SDP. Our projec-
to a nonvirtualized one.
tions and silicon measurements show that
With total bandwidth support of up to 64 GB/
N1 systems match or exceed performance on
s per IO interface, MMU-600 is architected to
currently available cloud instances in many rele-
support throughput requirements for next gen-
vant workloads. Single core performance was
eration PCIe Gen5.
improved from Cortex-A72 by 65% and 100% on
GIC-600 is a GICv3 architectural specification
average for integer and floating-point workloads,
compliant interrupt controller with enhanced
respectively. System-level performance uplifts
support for large number of cores and multiple
are much higher thanks to the multipliers
chip configurations. GIC-600 structurally con-
offered by the unprecedented scalability of our
sists of interrupt translation service (ITS) blocks,
CMN-600 mesh interconnect.
a distributor, and redistributors. The ITS block
Beyond targeting general performance
translates PCIe message signaled interrupts
improvements, we spent significant effort opti-
(MSI/MSI-X) to Arm locality-specific peripheral
mizing the system for common behaviors
interrupts (LPIs). The distributor manages inter-
observed in server and networking workloads.
rupt routing and directs interrupts to the appro-
For example, a class of workloads we focused
priate core cluster that services the interrupts.
on is high-throughput HTTP server, such as
In a virtualized environment core interrupts are
NGINX. NGINX is a highly concurrent, high-per-
virtualized, and the incoming physical interrupts
formance application that can be used as web
are mapped by a hypervisor to a VM. The inter-
server, reverse proxy, and API gateway. Neo-
socket or inter-chiplet messages are ported to
verse N1 performance uplifts for this class of
the native communication transport protocol
workloads is directly related to the following.
supported by the system.
1. Memory latency and bandwidth: up to 2
increase in memcpy bandwidth vs Cortex-A72.
N1 SOFTWARE DEVELOPMENT 2. Context switch: up to 2.5 faster than Cor-
PLATFORM tex-A72.
In order to create a proof point for our technol- 3. Core front-end: significant reduction in
ogy, we taped-out a test chip based on Neoverse branch mispredicts (7) and cache misses
N1 IPs called the N1 software development plat- (2) vs Cortex-A72.
form (N1 SDP). This system consists of four Neo-
verse N1 cores, configured as two pairs of two- These stressors are very common with
core clusters. Each core has 64-kB private L1 I/D throughput applications such as MemcacheD and
cache and 1-MB private L2cache. Each cluster con- HHVM. Overall, Neoverse N1 can reach 2.5 higher
nects its two cores through a DynamIQ Shared throughput on NGINX static web-server versus a
Unit (DSU), which is configured with a 1-MB similarly configured Cortex-A72 based system.
shared L3 cluster cache. The system is configured Another class of applications we focused our
with 2 DDR4 3200 memory controllers and 2 PCIe attention on are runtime frameworks such as
Gen4 root complexes, one of which supports CCIX Java Virtual Machines and .Net Frameworks.
for attachment of cache-coherent IO devices or to These runtime environments are the foundation
support multichip configurations. A 42 CMN-600 of much of the applications running in the cloud
IEEE Micro
60
and are a natural advanced network, storage, and security applian-
target for our We anticipate highcore ces as well as on edge compute installations
count designs based
design. At a high deployed by network operators with design
on Neoverse N1 to be
level, on Neoverse points starting at eight cores.
deployed in public
N1 we focused on a
cloud as an alternative
few relevant stres- architecture for main ACKNOWLEDGMENTS
sors for these compute nodes, We thank Mike Filippo, Ann Chin, and the
workloads. enabling lower total many Arm engineers in Austin, TX and world-
cost of ownership for wide who contributed to Neoverse N1 definition,
1. Object manage-
data center operators architecture, design, physical implementation,
ment: up to and edge installations
2.4 more and software optimizations.
of cloud compute while
memory alloca- delivering greater
tions and 1.6 design diversity. & REFERENCES
faster in copy-
ing characters 1. Arm Architecture Reference Manual Armv8, for Armv8-
branch mispredictions was reduced by 1.4 2. M. Lipp et al., “Meltdown: Reading kernel memory
and 2.25 versus Cortex-A72, respectively. from user space,” in Proc. 27th USENIX Conf. Secur.
tion: locking throughput and latency 3. P. Kocher et al., “Spectre attacks: Exploiting
improved by 2 thanks to the Large System speculative execution,” IEEE Symp. Secur. Privacy,
Extensions Arm atomic instructions. San Francisco, CA, USA, pp. 1–19, 2019.
4. Arm Vulnerability of Speculative Processors to Cache
We expect to see higher performance gains as Timing Side-Channel Mechanism. [Online]. Available:
Neoverse N1 systems become more broadly avail- https://developer.arm.com/support/arm-security-
able for software optimizations and application updates/speculative-processor-vulnerability
tuning.5 At the time of writing, Arm partners 5. Arm NeoverseTM N1 Software Optimization Guide
report that initial evaluations of real-world work- Documentation. [Online]. Available: https://developer.
loads on systems deploying Neoverse N1 show up arm.com/docs/swog309707/a
to 40% better performance compared to similarly
configured systems currently on the market. Andrea Pellegrini is a senior principal engineer
with Arm, Austin, TX, USA. He leads the performance
and workloads team for servers and networking. He
CONCLUSION received the Ph.D. degree from the University of
The Neoverse N1 platform provides Arm’s Michigan, Ann Arbor, and the B.E. and M.E. degrees
partners with the high-performance IPs neces- in computer engineering from the Universita di
sary to architect a general compute solution for Bologna, Italy. He is the corresponding author of this
addressing the infrastructure market. These article. Contact him at Andrea.Pellegrini@arm.com.
building blocks offer the versatility, performance,
features and power-area efficiency to succeed in Nigel Stephens joined Arm, Austin, TX, USA, in
2008 to contribute to the development of the Armv8-A
the infrastructure market. We anticipate high-
architecture, with a focus on the new AArch64 instruc-
core count designs based on Neoverse N1 to be
tion set architecture and its related software ABIs. He
deployed in public cloud as an alternative archi-
went on to become the lead architect with overall
tecture for main compute nodes, enabling lower responsibility for Arm’s A-profile instruction sets.
total cost of ownership for data center operators Recent projects have included leading the design of
and edge installations of cloud compute while the Scalable Vector Extension (SVE) for HPC, and its
delivering greater design diversity. We fully successor SVE2. He was appointed an Arm Fellow in
expect Neoverse N1 to also find a home in more 2015. Contact him at Nigel.Stephens@arm.com.
March/April 2020
61
Hot Chips
Magnus Bruce is a senior principal engineer with traffic performance passing through Interconnect IP.
Arm, Austin, TX, USA, focused on memory system He received the M. Tech. degree in microelectronics
microarchitecture and coherent interconnects. He from the Indian Institute of Technology Madras in
received the Master of Engineering in electrical engi- India. Contact him at Tushar.Ringe@arm.com.
neering from the University of Florida. Contact him at
Magnus.Bruce@arm.com. Ashok Tummala is a principal engineer with the
Systems IP group, Arm, Austin, TX, USA. He is part of
Yasuo Ishii is a principal engineer with Arm, Architecture, Microarchitecture, and Design team
Austin, TX, USA, where he is responsible for instruc- working on Arm’s Coherent Interconnects. His research
tion fetch and branch predictor microarchitecture. interests include various industry standard I/O bus pro-
He received the Ph.D. degree in computer sci- tocols including PCIe, CCIX, USB, etc., H/W and S/W
ence from the University of Tokyo. Contact him at coherency, system architectures, and heterogenous
Yasuo.Ishii@arm.com. computing using hardware accelerators. He received
the Ph.D. degree in electrical and computer engi-
Joseph Pusdesris is a staff engineer with Arm, neering from the University of Texas at San Antonio.
Austin, TX, USA, focused on bridging the gap Contact him at Ashok.Tummala@arm.com.
between performance exploration, modeling, and
RTL design. He received the MSE degree in com- Jamshed Jalal is a distinguished engineer with
puter science from the University of Michigan. Con- the Systems IP group, Arm, Austin, TX, USA. He is
tact him at Joseph.Pusdesris@arm.com. the lead architect on Arm’s various families of Coher-
ent Interconnects and also a key contributor to Arm’s
Abhishek Raja is a staff engineer with Arm, Austin,
AMBA5 CHI protocol specification. His research
TX, USA, focused on CPU memory system micro-
interests include H/W and S/W coherency, system
architecture. He received the M.S. degree in electri-
architectures ranging from client to enterprise, per-
cal engineering from the University of Washington.
formance analysis, and memory technologies. He
Contact him at Abhishek.Raja@arm.com
received the Bachelor’s degree in electrical and
Chris Abernathy is a CPU lead architect with Arm, computer engineering from Oklahoma State Univer-
Austin, TX, USA. He received the M.S. degree in elec- sity. Contact him at Jamshed.Jalal@arm.com.
trical and computer engineering from the University of
Texas. Contact him at Chris.Abernathy@arm.com. Mark Werkheiser is a distinguished engineer with
the Systems IP group, Arm, Austin, TX, USA. He is
Jinson Koppanalil is a senior principal engineer the technical lead responsible for Arm’s Coherent
with Arm. He is a technical lead responsible for the Interconnect IP. His research interests include devel-
development of Arm’s CPU products. He received oping scalable and configurable coherent intercon-
the M.S. degree in computer engineering from North nect microarchitecture. He received the Master’s
Carolina State University, Raleigh. Contact him at degree from the University of Wisconsin-Madison.
Jinson.Koppanalil@arm.com. Contact him at Mark.Werkheiser@arm.com.
Tushar Ringe is a principal engineer with the sys- Anitha Kona is a senior principal infrastructure
tems IP group, Arm, Austin, TX, USA. He is part of system architect with the Central Technology group,
Microarchitecture team responsible for delivering Arm, Austin, TX, USA. He is the lead system architect
Coherent Interconnect IP. His research interests for N1 SDP and other server and networking class
include various architectures, such as CHI/AXI based systems being developed at Arm. She received the
protocol, microarchitecture exploration, and valida- Master’s degree from Mississippi State University.
tion efficiency improvements with special focus on IO Contact her at Anitha.Kona@arm.com.
IEEE Micro
62
Theme Article: Hot Chips
TeraPHY: A Chiplet
Technology for Low-
Power, High-Bandwidth
In-Package Optical I/O
Mark Wade, Erik Anderson, Shahab Ardalan, Sergey Y. Shumarayev, Conor O’Keeffe,
Pavan Bhargava, Sidney Buchbinder, Tim T. Hoang, David Kehlet,
Michael L. Davenport, John Fini, Haiwei Lu, Ravi V. Mahajan, Matthew T. Guzy,
Chen Li, Roy Meade, Chandru Ramamurthy, Allen Chan, and Tina Tran
Michael Rust, Forrest Sedgwick, Intel Corporation
Vladimir Stojanovic, Derek Van Orden,
Chong Zhang, and Chen Sun
Ayar Labs, Inc.
March/April 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
63
Hot Chips
CHIPLET-BASED IN-PACKAGE
OPTICAL I/O
Chiplet-based technologies have emerged as
Figure 2. Interconnect metrics versus reach tradeoffs8 and a scalable approach to integrating heteroge-
TeraPHY technology capabilities. neous electrical functionality (high-speed
IEEE Micro
64
To achieve the density of pins required to
escape high bandwidths from the SoC, Intel’s
EMIB technology is used. EMIB allows a high
density of electrical connections in the AIB
regions of the S10 and TeraPHY electrical bump
maps while supporting relaxed pitch bumps
elsewhere, overcoming the substrate size limita-
tion and connectivity yield risks of traditional Sil-
icon interposer technology.
TERAPHY TECHNOLOGY
The TeraPHY chiplets employ dense wave-
length division multiplexing, Figure 4. In each
Figure 3. Ayar Labs’ optical I/O architecture based
TeraPHY optical macro, multiple ring-resonator
on TeraPHY technology.
based modulators, with resonant wavelengths
spaced nominally to match the multiple laser
analog, memories, custom accelerators, etc.) wavelengths, are coupled to the waveguide on
inside multichip packages (MCP). To support the transmit side. Multiple resonant photodetec-
this approach, an ecosystem has been created tors are coupled to the same waveguide on the
to define scalable electrical interfaces and pack- receive side to receive the corresponding wave-
aging technologies. Until now, the focus has length channel. Ring wavelength tuning and con-
been on integrating heterogeneous electrical trol circuits are used to lock the ring resonators
chiplets, but the same ecosystem can be lever- to the corresponding laser wavelengths and
aged to support next-generation systems based compensate for laser wavelength drift, process,
on in-package optical I/O. and environmental temperature variations.
Figure 3 illustrates the Ayar Labs’ optical I/ The TeraPHY chiplet is based on a 45 nm Sili-
O architecture based on the TeraPHY technol- con-on-Insulator CMOS process. The same sili-
ogy. The TeraPHY chiplet is co-packaged con layers used to form the transistor active
with the SoC and is powered by the off-pack-
age multiport, multiwavelength optical supply
(SuperNova). Since TeraPHY chiplets are
implemented via monolithic integration of
photonic devices with transistors, they can
host a variety of electrical interfaces
(from wide-parallel to high-speed serial) to
the SoC that adapt to the chosen packaging
technology (organic substrates or 2.5D
integration).
In this article, we illustrate the MCP integra-
tion with the S10 FPGA,3 which communicates
with the two TeraPHY chiplets through an AIB
interface. The AIB interface is a wide parallel dig-
ital interface. The TeraPHY chiplet contains an
array of SerDes which transfer data between the
AIB interface and the higher data rate optical
channels. This integration scheme avoids the
need for repeated SerDes interfaces between Figure 4. Dense wavelength division multiplexed photonic link
chips that waste significant power and area architecture of the TeraPHY chiplet photonic macros based on
resources. ring resonators.
March/April 2020
65
Hot Chips
IEEE Micro
66
Figure 9. Floorplan of the TeraPHY chiplet showing AIB
interface, ten optical Tx/Rx macros, and fiber array, with
TeraPHY macro and Tx/Rx slice insets.
March/April 2020
67
Hot Chips
CONCLUSIONS
In-package optics represent a major mile-
stone for the next generation of high-perfor-
mance SoCs and has the potential to remove
the bandwidth-distance tradeoff created by
electrical I/O. We presented the Ayar Labs
TeraPHY chiplet integrated into an AIB-con-
nected, EMIB enabled MCP package with
an Intel Stratix10 FPGA, paving the way for
fully TeraPHY populated optical packaging
enabling <5 pJ/bit multi-Tb/s die-to-die IO
connectivity at <10 ns SoC-to-SoC latency
(excluding fiber) and with up to 2 km in reach.
This technology is revolutionary to high-per-
Figure 10. Top: Assembled EMIB MCPs with TeraPHY and
formance compute and emerging machine-
S10 chips. Bottom: Fully assembled S10 MCP with TeraPHY
learning applications.
chiplets post fiber attach.
ACKNOWLEDGMENTS
can map groups of four AIB channels to groups The authors would like to thank Dr. W. Chappell,
of four optical wavelengths within each of Dr. G. Keeler, Dr. D. Green, and Mr. A. Olofsson for
macros 1–8 and support multicast and broad- their support. This work was supported by the
cast. The digital reconfiguration enables work- DARPA MTO office7 Contract HR00111790020 and
load timescale path reconfiguration without Contract N66001-19-9-4010.
physical connection rewiring.
Each TeraPHY optical macro consists of
transmit/receive slices for each wavelength
& REFERENCES
(including ring tuning control, adaptation, 1. Intel, “AIB specification,” 2019. [Online]. Available:
clock and data recovery logic and link monitor- https://www.intel.com/content/www/us/en/
ing circuits), and a shared clocking subsystem architecture-and-technology/programmable/
and digital control logic layer. The Tx/Rx slice heterogeneous-integration/overview.html
IEEE Micro
68
2. R. Mahajan et al., “Embedded multi-die interconnect was with AMS R&D group in Gennum Corporation
from 2007 to 2010. In 2010, he joined San Jose
bridge (EMIB)—A high density, high band-width
State University as an Assistant Professor and Direc-
packaging interconnect,” in Proc. 66th Electron.
tor of center for analog and mixed signal where he
Compon. Technol. Conf., Jun. 2016,
was teaching and conducting research on topics of
pp. 557–565. analog and mixed signal integrated circuits. He
3. S. Shumarayev, “Stratix 10: Intel’s 14nm received the Ph.D. degree from the University of
heterogeneous FPGA system-in-package (SiP) Waterloo, Waterloo, ON, Canada, in 2007. He is the
platform,” in Proc. Hot Chips 29 Symp., 2017. [Online]. Senior Member of the IEEE. Contact him at
Available: https://www.hotchips.org/wp-content/ shahab@ayarlabs.com.
uploads/hc_archives/hc29/HC29.22-Tuesday-Pub/
HC29.22.50-FPGA-Pub/HC29.22.523-Hetro-Mod- Pavan Bhargava is currently a Senior IC Design
Engineer with Ayar Labs, Emeryville, CA, USA. He is
Platform-Shumanrayev-Intel-Final.pdf
currently working toward the Ph.D. degree in electri-
4. C. Sun et al., “Single-chip microprocessor that
cal engineering with UC Berkeley, Berkeley, CA,
communicates directly using light,” Nature, vol. 7583,
USA. He received the B.S. degree in electrical engi-
no. 528, pp. 534–538, Dec. 24, 2015.
neering from the University of Maryland-College
5. M. S. Akhter et al., “WaveLight: A monolithic low Park, College Park, MD, USA, in 2014. His research
latency silicon-photonics communication platform for focuses on developing optical receivers for high
the next-generation disaggregated cloud data speed I/O, and on designing fully integrated optical
centers,” in Proc. IEEE 25th Annu. Symp. High- beam steering systems for solid-state LIDAR. Con-
Perform. Interconnects, 2017, pp. 25–28. tact him at pavan@ayarlabs.com.
6. M. Wade et al., “A bandwidth-dense, low power
Sidney Buchbinder is currently a Hardware Engi-
electronic-photonic platform and architecture for multi-
neer with Ayar Labs, Emeryville, CA, USA. His pri-
Tbps optical I/O,” in Proc. Eur. Conf. Opt. Commun.,
mary research interests are photonic design
2018, pp. 1–3.
automation and verification. He received the B.S.
7. Common Heterogeneous Integration and IP Reuse degree in electrical engineering from the California
Strategies (CHIPS). 2019. [Online]. Available: https:// Institute of Technology, Pasadena, CA, USA, in
www.darpa.mil/program/common-heterogeneous- 2015, and is currently working toward the Ph.D.
integration-and-ip-reuse-strategies degree with UC Berkeley, Berkeley, CA, USA. Con-
8. G. Keeler, DARPA ERI Summit 2019. tact him at sidney@ayarlabs.com.
Shahab Ardalan is with Ayar Labs, Emeryville, CA, Haiwei Lui is currently a Principal Engineer with
USA, since 2017, where he is working on high-speed Ayar Labs, Emeryville, CA, USA. He is working on
SerDes link for monolithic optical communication. He photonics assembly and packaging. He received the
March/April 2020
69
Hot Chips
Ph.D. degree in chemistry from the University of Associate Professor from 2005–2013. He received
California, Riverside, CA, USA. Contact him at the Ph.D. degree in electrical engineering from Stan-
haiwei@ayarlabs.com. ford University, Stanford, CA, USA, in 2005, and the
Dipl. Ing. degree from the University of Belgrade,
Chen Li is currently a Senior Mechanical Engineer Serbia, in 1998. He is a Senior Member of IEEE.
with Ayar Labs, Emeryville, CA, USA. He is interested Contact him at vladimir@ayarlabs.com.
in thermal characterization and management of elec-
tronic and silicon photonic devices, as well as novel Derek Van Orden is currently a Photonics Design
packaging of silicon photonics. He received the Engineer with Ayar Labs, Emeryville, CA, USA. His
Ph.D. degree in mechanical engineering from the primary interest is the design of high-speed electro-
University of Michigan, Ann Arbor, MI, USA, in 2019. optic modulators and photodetectors leveraging
Contact him at chen.li@ayarlabs.com. microring resonators on CMOS platforms. He
received the Ph.D. degree in physics from the Uni-
Roy Meade is currently the VP of Manufacturing versity of California San Diego, La Jolla, CA, USA, in
with Ayar Labs, Emeryville, CA, USA. He has more 2011. Contact him at derek@ayarlabs.com.
than 20 years of experience in CMOS, is an inventor
on more than 70 patents. He received the MBA Chong Zhang is currently a Principal Engineer with
degree from Duke University, Durham, NC, USA, and Ayar Labs, Emeryville, CA, USA. He is working on
the M.S. and B.S. degrees in mechanical engineer- photonics and electrical packaging. He received the
ing from Georgia Tech, Atlanta, GA, USA. He is a Ph.D. degree in mechanical engineering and the
Senior Member of the IEEE. Contact him at M.S. degree in optics from the University of Central
roy@ayarlabs.com. Florida, Orlando, FL, USA, in 2008. Contact him at
chong@ayarlabs.com.
Chandru Ramamurthy is currently a Staff Engi-
neer with Ayar Labs, Emeryville, CA, USA, and is Chen Sun is currently the Chief Scientist and VP of
interested in high-speed design, radiation hardened Silicon Engineering with Ayar Labs, Emeryville, CA,
design, and design technology co-optimization. He USA. He is interested in VLSI design and photonics
received the Ph.D. degree in electrical engineering I/O. He received the B.S. degree from the University
from Arizona State University, Tempe, AZ, USA, in of California Berkeley, Berkeley, CA, USA, in 2009,
2017. Contact him at chandru@ayarlabs.com. and the S.M. and Ph.D. degrees in electrical engi-
neering from the Massachusetts Institute of Technol-
Michael Rust is currently a Photonic Test Engineer ogy, Cambridge, MA, USA, in 2011 and 2015,
with Ayar Labs, Emeryville, CA, USA. He is interested respectively. Contact him at chen@ayarlabs.com.
in integrated photonics based I/O and topological
quantum computation. He received the B.S. degree Sergey Shumarayev is currently a Senior Princi-
in physics and the B.S.E.E. degree in computer pal Engineer with Intel Programmable Solution Group
architecture from the University of Texas at Austin, CTO office. He is responsible for Interconnect Strat-
Austin, TX, USA and the M.S. degree in computer egy and is interested in heterogeneous multichip
science from Georgia Tech, Atlanta, GA, USA. Con- package integration. He received the B.S. degree
tact him at michael@ayarlabs.com. in electrical engineering from the University of
California Berkeley, Berkeley, CA, USA, and an
Forrest Sedgwick is currently the Director of Test M.S. degree in electrical engineering from Cornell
Engineering with Ayar Labs, Emeryville, CA, USA. University, Ithaca, NY, USA. Contact him at sergey.
His research interest is in photonic systems for com- yuryevich.shumarayev@intel.com.
munications. He received the Ph.D. degree in electri-
cal engineering from the University of California at Conor O’Keeffe is currently a Research Engineer
Berkeley, Berkeley, CA, USA, in 2007. Contact him at with the CTO group of Intel’s Programmable Solu-
forrest@ayarlabs.com. tions Group. He is interested in SoC architecture,
RFIC design, and wireless infrastructure. He received
Vladimir Stojanovic is currently a Professor of the honor’s degree in electronic and communication
electrical engineering and computer sciences with engineering from the University of South Wales, U.K.
the University of California, Berkeley, Berkeley, CA, Contact him at conor.o.keeffe@intel.com.
USA, and Chief Architect with Ayar Labs, Emeryville,
CA, USA. He was also with Rambus, Inc., Los Altos, Tim Hoang is currently a Hardware Architect with
CA, USA, from 2001 through 2004 and with MIT as Intel. He is interested in programmable circuits, I/O,
IEEE Micro
70
analog and mixed-signal IP, and 2.5D die-to-die inter- assembly. He is interested in next-generation chip-
faces and protocols. He received the B.S. degree in on-wafer processes and emerging packaging tech-
electrical engineering and computer science from the nologies. He received the Ph.D. degree in chemical
University of California Berkeley, Berkeley, CA, USA. engineering from Virginia Polytechnic Institute and
Contact him at tim.tri.hoang@intel.com. State University, Blacksburg, VA, USA, in 2003. Con-
tact him at matthew.t.guzy@intel.com.
David Kehlet is currently a Researcher with Intel
working on pathfinding for FPGA technology includ- Allen Chan is currently a Principal Engineer at
ing high speed interfaces. He received the B.S. and Intel with the Programmable Solutions Group’s
M.S. degrees in electrical engineering from Stanford CTO Office. He received the B.S. degree in
University, Stanford, CA, USA. Contact him at electrical engineering from the University of
david.kehlet@intel.com. California Davis, Davis, CA, USA. Contact him at
allen.chan@intel.com.
Ravi V. Mahajan is currently an Intel Fellow and has
worked on many microelectronics packaging technol- Tina Tran is currently an SOC Hardware Engineer
ogies. He received the Ph.D. degree in mechanical with Intel and is interested in transceiver IP design
engineering from Lehigh University, Bethlehem, PA, and integrating chiplet technologies in FPGA plat-
USA. Contact him at ravi.v.mahajan@intel.com. forms. She received the B.S. degree in electrical
engineering and computer science from the Univer-
Matthew T. Guzy has held several technical posi- sity of California Berkeley, Berkeley, CA, USA.
tions at Intel in areas focusing on packaging and Contact her at tina.c.zhong@intel.com.
March/April 2020
71
Department: Micro Economics
& WHEN MY BROTHER-IN-LAW moved out of state, The price of Wikipedia differs substantially.
he gifted my household his beautiful leather- The ungated web site costs nothing to use,
bound set of Encyclopedia Britannica. In spite of though it is also not entirely free to users. Think
their age, they contain answers to any number of of it this way. Users pay charges for internet
questions. How many species of penguins inhabit access. A portion of that expenditure anticipates
Antarctica? Who was husband to Cleopatra, last using Wikipedia.
queen of Egypt? When was Billie Holiday born? It cannot add up to much. For most house-
Nobody in my household ever touches them. holds Wikipedia constitutes much less than 4%
Everybody uses Wikipedia. of their surfing. The average
This column contrasts the eco- broadband subscription and
This column contrasts
nomics behind yesterday’s compen- smartphone data contract in the
the economics behind
dium of expertise and today’s United States is around $40–$60 a
yesterday’s compen-
crowd-sourced wiki. Any compari- dium of expertise and month and around $60–$80 a
son, even a coarse one, will show today’s crowd-sourced month, respectively. The portion
that prices fell dramatically. No wiki. Any comparison, of expenditure attributable to
other conclusion can emerge. Did even a coarse one, will Wikipedia cannot exceed $3–$5
quantity and quality of answers show that prices fell a month. Anyway you look at it,
improve? How about their accuracy dramatically. expenditure declined more than
and reliability? That is less obvious. 90%, and became a trifling frac-
tion of median household income.
That does not account for the biggest drop,
CHEAPER PRICES which does not involve money. It concerns time.
Start with prices. Britannica reached its peak In 1990, Britannica sold more than 100,000
sales in 1990 when a set of books cost a house- copies in the United States. Over several deca-
hold around $1500, just under $3000 in contem- des several million households bought and
porary 2020 dollars. The leather-bound volumes owned a volume of books. Tens of million house-
cost 30% more. Most households purchased holds had a set from a competitor—e.g., World
these with monthly payments, say, $30–$50 a Book, Colliers, and so on. Everybody else went
month. That was 1%–2% of median U.S. family to their local library. In contrast, today, four out
income. Rich and well-to-do middle class families of five U.S. households access the internet at
bought them. home, or on their phone, which means any-
where. If a price could be put on convenience, it
Digital Object Identifier 10.1109/MM.2020.2971915 would show a massive drop because internet
Date of current version 18 March 2020. access is so widespread.
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
74
Because Wikipedia uses less time, it is just SUPPLY OF AUTHORITY
less hassle per transaction. That enables a The costs of production declined too.
crazy difference in the scale of use. In one Britannica’s expenses in 1990—i.e., to support
hour Wikipedia receives 4.8 million visitors worldwide distribution—reached $650 million,
for content in English. Not all of those readers around $1.3 billion in today’s dollars. Wikipedia’s
came from the United States, but so what? expenses—i.e., again, to support worldwide dis-
One hour. tribution—reached less than $100 million. Wiki-
Scale changed in other ways. Britannica pedia costs at least 90% less to produce.
contains 120 000 articles at most. To facilitate By design, and out of necessity, Britannica was
sales of books, Britannica’s managers decided selective. It showed only the final draft of an arti-
long ago to cap the total volume of space cle. Also by design and out of necessity, Wikipedia
occupied by the books on shelves. Editors is a work in progress. It shows everything. The
shortened the included articles to make room cost of an extra web page is negligible, and so is
for the new. That has not changed much over the cost of another paragraph, picture, link, and
the decades. an article’s entire editorial evolution. The only
Wikipedia contains a broader scope, and effective constraint comes from the AI bots that
more information. Due to the negligible cost of rid the website of abusive language, and from the
storage Wikipedia faces no constraint on its editors’ collective sense of what belongs.
breadth or length. At last count Wikipedia con- How does Wikipedia save so much produc-
tains just under six million articles in English, tion cost? Those editors are volunteers. The
and it continues to grow. For example, the entry total of 250 000 of them regularly edit the site
for Penguins—i.e., the animal—contains more each month in all its languages. They add con-
than eight thousand words, while tent, and incorporate sugges-
the entry for The Penguin—i.e., the tions from tens of millions who
villain from Batman—gets five thou- Britannica has great add something small. More than
sand. Cleopatra’s entry receives content, and most of it 350 employees support them.
more than 25 000 words. Billie Holi- is inconvenient to Can volunteers write as well
day gets more than 10 000, while Bil- access. Wikipedia as experts? As it turns out, some-
contains plenty of times yes, when the crowd is big
lie Eilish, recent Grammy winner at
everything, both enough.
age 18, has 5000 words on her page.
nutritious news and
Britannica does not lose on Two colleagues and I recently
saccharine sweet
every dimension. Constraints led investigated editors’ behavior in
sophistry, all of it within
Britannica to impose a singular sen- the most uncomfortable setting
reaches with little effort.
sibility on all articles. Everything at Wikipedia, the pages for U.S.
came from an established expert. politics.1 We found that many
Every article got attention from professional edi- editors start off biased. They show up and spout
tors, so the best writing is truly magnificent. their slanted opinion. Unlike the majority, a
Wikipedia goes to the other extreme. While future editor sticks around, at least for a short
broad, it applies a porous filter. It contains a while. Of those, who remains for the long haul?
massive sampling of material that Britannica We estimate that only 10%–20% stay for more
does not. It has something for everyone, and than a year. Most leave after a month or so, most
plenty that many readers do not want. After all, often after encountering others with extreme
one expert’s junk is another reader’s appropri- and opposing views. Most interesting, those who
ate topic for an online encyclopedia. stay lose their biases, and start fostering a neu-
That is a crucial difference, so let’s reiterate. tral point of view, the site’s highest aspiration
Britannica has great content, and most of it is for all its content. In short, Wikipedia does not
inconvenient to access. Wikipedia contains devolve into hopeless arguments because the
plenty of everything, both nutritious news and moderates decide to stay and edit the crowd.
saccharine sweet sophistry, all of it within In another project, we compared the political
reaches with little effort. slants of close to four thousand articles from
March/April 2020
75
Micro Economics
Britannica and Wikipedia.2 The articles covered Burgess Meredith in his campy performance as
nearly identical topics in U.S. politics, and tend the Penguin? Was Cleopatra considered charis-
to be popular. The biases of the articles in Bri- matic? Why does Billie Eilish’s music and fashion
tannica and Wikipedia became similar when the appeal to young listeners? Whereas Britannica
Wikipedia article received consid- retained its authority by asking
erable editorial attention. The readers to defer to the expert’s
most edited achieved something While Wikipedia opinion, Wikipedia invites answers
akin to a neutral point of view. exposes the artificial for such questions from many sour-
illusion of using a single
They were almost always longer ces, and tells readers to check else-
source of expertise, it
too, containing a wider sampling where too.
leaves the crowd’s
of opinion. authority open to
While Wikipedia exposes the
Articles matched by topic are second guessing. The artificial illusion of using a single
not representative of all of Wikipe- reader gains control, source of expertise, it leaves the
dia, however. Due to its broad sam- but loses assurance in crowd’s authority open to sec-
pling of topics, Wikipedia contains the exchange. ond guessing. The reader gains
many more articles, unmatched to control, but loses assurance in
anything within Britannica. But it the exchange.
comes with a drawback. Many of these lack edit-
ing, which generated the potential for uncorrected
grammatical error, factual mistakes, and narrow CONCLUSION
sampling of opinion. When questions come up in my household, I
Therein lies the subtle difference between go straight to Wikipedia. It takes more time to put
expert and crowd. The distribution of unfinished on reading glasses than it does to voice the
articles is enormous at Wikipedia because, answer from the small screen. The children listen
believe it or not, 250 000 editors is nowhere near while pretending not to, and so we move forward.
what the site requires. There are plenty of pas- What a bountiful, convenient, and dangerous
sages that need attention and do not get it. gift for the generation tied to small screens.
Each organization acts accordingly. Britann- The user is in charge, but it comes with a
ica flaunts its expertise, while Wikipedia flaunts catch. A modern reader needs to take the time
its sourcing from the crowd. Britannica claims to don their thinking cap. But who takes the time
reliability, while Wikipedia openly declares and effort? And does anyone really possess
caveat emptor, recommending that anybody dou- enough judgment to second guess it all?
ble check the answer against other sources.
Do readers check? Why should they check
objective, verifiable, and noncontroversial & REFERENCES
facts? For example, there are eight species of 1. S. Greenstein, G. Y. Gu, and F. Zhu, Forthcoming,
penguins in Antarctica, Cleopatra’s last hus- “Ideological segregation among online collaborators:
band was Marc Antony, and Billie Holiday was Evidence from Wikipedians,” Manage. Sci. [Online].
born on 1915. It took one editor little time to Available: http://dx.doi.org/10.2139/ssrn.2851934
enter that information. If the first draft erred, 2. S. Greenstein and F. Zhu, “Do experts or crowd-based
somebody fixed it long ago. At Wikipedia the models produce more bias? Evidence from
metadata for dates, numerical descriptions, encyclopedia Britannica and Wikipedia,” MIS Quart.,
historical accounts, and minutia of science vol. 42, no. 3, pp. 945–959, 2018.
shout their own accuracy.
On the other hand, neither Britannica, nor Shane Greenstein is a professor at the Harvard
Wikipedia, can escape subjective, nonverifiable, Business School. Contact him at sgreenstein@
and controversial content. How good was hbs.edu.
IEEE Micro
76
PURPOSE: The IEEE Computer Society is the world’s largest EXECUTIVE COMMITTEE
association of computing professionals and is the leading provider President: kƷǠǹƌژ%ƷژFǹȏȵǠƌȄǠ
of technical information in the field.
President-Elect: FȏȵȵƷȽɋژ°ǚɓǹǹ
MEMBERSHIP: Members receive the monthly magazine Past President: ƷƩǠǹǠƌژuƷɋȵƌ
Computer, discounts, and opportunities to serve (all activities First VP: ¨ǠƩƩƌȵưȏژuƌȵǠƌȄǠ; Second VP: °ɲٯäƷȄژhɓȏ ژژژژژ
are led by volunteer members). Membership is open to all IEEE Secretary: %ǠȂǠɋȵǠȏȽژ°ƷȵȲƌȄȏȽ; Treasurer: %ƌɫǠưژkȏȂƷɋ
members, affiliate society members, and others interested in the VP, MemberȽǚǠȲ & Geographic Activities: Yervant Zorian
computer field. VP, Professional & Educational Activities: °ɲٮäƷȄژhɓȏ ژژژژژژژژژژژ
VP, Publications: Fabrizio Lombardi
COMPUTER SOCIETY WEBSITE: www.computer.org
VP, Standards Activities: Riccardo Mariani
OMBUDSMAN: Direct unresolved complaints to VP, Technical & Conference Activities: William D. Groppژ
ombudsman@computer.org.
2019–2020 IEEE Division VIII Director: Elizabeth L. Burdژ
CHAPTERS: Regular and student chapters worldwide provide the ژאאٮאאU---ژ%ǠɫǠȽǠȏȄژÝژ%ǠȵƷƩɋȏȵ¾ژيǚȏȂƌȽژuِژȏȄɋƷژژژژژژژژژژ
opportunity to interact with colleagues, hear technical experts, ژאאU---ژ%ǠɫǠȽǠȏȄژÝUUUژ%ǠȵƷƩɋȏȵٮ-ǹƷƩɋژيǚȵǠȽɋǠȄƌژuِژ°ƩǚȏƨƷȵ
and serve the local professional community.
AVAILABLE INFORMATION: To check membership status, report BOARD OF GOVERNORS
an address change, or obtain more information on any of the ¾ƷȵȂ ژ-ɱȲǠȵǠȄǒ ژيאא ژȄưɲ ژِ¾ ژǚƷȄً ژeȏǚȄ ژ%ِ ژeȏǚȄȽȏȄًژ
following, email Customer Service at help@computer.org or call
°ɲٮäƷȄ ژhɓȏً ژ%ƌɫǠư ژkȏȂƷɋً ژ%ǠȂǠɋȵǠȏȽ ژ°ƷȵȲƌȄȏȽًژژژژژ
+1 714 821 8380 (international) or our toll-free number, OƌɲƌɋȏژäƌȂƌȄƌ
+1 800 272 6657 (US): ¾ƷȵȂ ژ-ɱȲǠȵǠȄǒ ژيאא ژuِ ژȵǠƌȄ ژǹƌǵƷً ژFȵƷư ژ%ȏɓǒǹǠȽًژ
• Membership applications ƌȵǹȏȽ ژ-ِ ژeǠȂƷȄƷɼٮGȏȂƷɼً¨ ژƌȂƌǹƌɋǚƌ ژuƌȵǠȂɓɋǚɓًژژژژژژژژژژژ
• Publications catalog -ȵǠǵ ژeƌȄ ژuƌȵǠȄǠȽȽƷȄً ژhɓȄǠȏ ژÅƩǚǠɲƌȂƌ
• Draft standards and order forms ¾ƷȵȂ ژ-ɱȲǠȵǠȄǒ ژيאאא ژwǠǹȽ ژȽƩǚƷȄƨȵɓƩǵًژ
• Technical committee list -ȵȄƷȽɋȏ ژɓƌưȵȏȽٯÝƌȵǒƌȽً ژ%ƌɫǠư ژ°ِ ژ-ƨƷȵɋً ژÞǠǹǹǠƌȂ ژGȵȏȲȲًژ
• Technical committee application GȵƌƩƷ ژkƷɬǠȽً ژ°ɋƷǑƌȄȏژîƌȄƷȵȏ
• Chapter start-up procedures
• Student scholarship information
• Volunteer leaders/staff directory EXECUTIVE STAFF
• IEEE senior member grade application (requires 10 years Executive Director: Melissa ِژRussell
practice and significant performance in five of those 10) Director, Governance & Associate Executive Director:
Anne Marie Kelly
PUBLICATIONS AND ACTIVITIES Director, Finance & Accounting: Sunny Hwang
Director, Information Technology & Services: Sumit Kacker
Computer: The flagship publication of the IEEE Computer Society,
Director, Marketing & Sales: Michelle Tubb
Computer publishes peer-reviewed technical content that covers
Director, Membership Development: Eric Berkowitz
all aspects of computer science, computer engineering,
technology, and applications.
COMPUTER SOCIETY OFFICES
Periodicals: The society publishes 12 magazinesژƌȄưژזژDZȏɓȵȄƌǹȽ. Washington, D.C.: 2001 L St., Ste. 700, Washington, D.C.
Refer to membership application or request information as noted 20036-4928ٕ Phone: +1 202 371 0101ٕ Fax: +1 202 728 9614ٕژ
above. Email: ǚƷǹȲۮƩȏȂȲɓɋƷȵِȏȵǒ
Conference Proceedings & Books: Conference Publishing Los Alamitos: 10662 Los Vaqueros Cir., Los Alamitos, CA 90720ٕژ
Services publishes more than 275 titles every year. Phone: +1 714 821 8380ٕ Email: help@computer.org
Standards Working Groups: More than 150 groups produce IEEE
u-u-¨°OU¥ژۯژ¥ÅkU¾Uw¨ژ%-¨°ژ
standards used throughout the world.
¥ǚȏȄƷژٕבבבגژזוהژזژڹژيFƌɱژٕגהגژאזژגוژڹژي
Technical Committees: TCs provide professional interaction in -ȂƌǠǹژيǚƷǹȲۮƩȏȂȲɓɋƷȵِȏȵǒ
more than 30 technical areas and directly influence computer
engineering conferences and publications. IEEE BOARD OF DIRECTORS
Conferences/Education: The society holds about 200 conferences President: ¾ȏȽǚǠȏژFɓǵɓưƌ
each year and sponsors many educational activities, including President-Elect: °ɓȽƌȄژhِٹژhƌɋǚɲژٺkƌȄư
computing science accreditation. Past President: eȏȽƸژuِFِژuȏɓȵƌ
Certifications: The society offers three software developer Secretary: Kathleen ِژKramer
credentials. For more information, visit Treasurer: Joseph V. Lillie
www.computer.org/certification. Director & President, IEEE-USA: eǠȂژȏȄȵƌưژ
Director & President, Standards Association: Robert S. Fishژ
BOARD OF GOVERNORS MEETING Director & VP, Educational Activities: °ɋƷȲǚƷȄژ¥ǚǠǹǹǠȲȽژ
Director & VP, Membership ۯGeographic Activities:ژژژژژژژ
ژזא٫ חאژuƌɲ: uƩkƷƌȄًژÝǠȵǒǠȄǠƌ hɓǵDZǠȄژǚɓȄ
Director & VP, Publication Services & Products: ¾ƌȲƌȄژ°ƌȵǵƌȵژ
Director & VP, Technical Activities: hƌɼɓǚǠȵȏژhȏȽɓǒƷ
revised ڳתeƌȄɓƌȵɲששڳ