2020-Sep-Oct - Machine Learning For Systems

VOLUME 40, NUMBER 5 SEPTEMBER/OCTOBER 2020
Machine Learning for Systems

Mini-Theme: Biology and Systems Interactions
www.computer.org/micro
IEEE
Computer
Society Has
You Covered!
WORLD-CLASS CONFERENCES — Stay
ahead of the curve by attending one of our
200+ globally recognized conferences.
DIGITAL LIBRARY — Easily access over 780k

articles covering world-class peer-reviewed
content in the IEEE Computer Society
Digital Library.
CALLS FOR PAPERS — Discover

opportunities to write and present your
ground-breaking accomplishments.
EDUCATION — Strengthen your resume

with the IEEE Computer Society Course
Catalog and its range of offerings.
ADVANCE YOUR CAREER — Search the

new positions posted in the IEEE Computer
Society Jobs Board.
NETWORK — Make connections that count

by participating in local Region, Section,
and Chapter activities.
Explore all of the member benefits

at www.computer.org today!
EDITOR-IN-CHIEF IEEE MICRO STAFF
Lizy K. John, University of Texas, Austin Journals Production Manager: Joanna Gojlik,
j.gojlik@ieee.org
EDITORIAL BOARD Peer-Review Administrator:
micro-ma@computer.org
R. Iris Bahar, Brown University
Publications Portfolio Manager: Kimberly Sperka
Mauricio Breternitz, University of Lisbon
Publisher: Robin Baldwin
Christopher Batten, Cornell University
Senior Advertising Coordinator: Debbie Sims
Bronis de Supinski, Lawrence Livermore
IEEE Computer Society Executive Director:
National Lab
Melissa Russell
Yasuko Eckert, AMD Research
Natalie Enright Jerger, University of Toronto IEEE PUBLISHING OPERATIONS
Maya Gokhale, Lawrence Livermore National
Laboratory Senior Director, Publishing Operations:
Shane Greenstein, Harvard Business School Dawn Melley
Hyesoon Kim, Georgia Institute of Technology Director, Editorial Services: Kevin Lisankie
John Kim, Korea Advanced Institute of Science Director, Production Services: Peter M. Tuohy
and Technology Associate Director, Editorial Services:
Hsien-Hsin (Sean) Lee, Taiwan Semiconductor Jeffrey E. Cichocki
Manufacturing Company Associate Director, Information Conversion
Richard Mateosian and Editorial Support: Neelam Khinvasara
Tulika Mitra, National University of Singapore Senior Art Director: Janet Dudar
Trevor Mudge, University of Michigan, Ann Arbor Senior Manager, Journals Production:
Vijaykrishnan Narayanan, The Pennsylvania Patrick Kempf
State University CS MAGAZINE OPERATIONS COMMITTEE
Richard H. Stern, George Washington
University Law School Sumi Helal (Chair), Irena Bojanova,
Sreenivas Subramoney, Intel Corporation Jim X. Chen, Shu-Ching Chen,
Carole-Jean Wu, Arizona State University Gerardo Con Diaz, David Alan Grier,
Lixin Zhang, Chinese Academy of Sciences Lizy K. John, Marc Langheinrich, Torsten Möller,
David Nicol, Ipek Ozkaya, George Pallis,
ADVISORY BOARD VS Subrahmanian
David H. Albonesi, Erik R. Altman, Pradip Bose, CS PUBLICATIONS BOARD
Kemal Ebcioglu, Lieven Eeckhout,
Fabrizio Lombardi (VP for Publications),
Michael Flynn, Ruby B. Lee, Yale Patt,
Alfredo Benso, Cristiana Bolchini,
James E. Smith, Marc Tremblay
Javier Bruguera, Carl K. Chang, Fred Douglis,
Subscription change of address: Sumi Helal, Shi-Min Hu, Sy-Yen Kuo,
address.change@ieee.org Avi Mendelson, Stefano Zanero, Daniel Zeng
Missing or damaged copies:
COMPUTER SOCIETY OFFICE
help@computer.org
IEEE MICRO
c/o IEEE Computer Society
10662 Los Vaqueros Circle
Los Alamitos, CA 90720 USA +1 (714) 821-8380
IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society. IEEE Headquarters, Three Park Ave., 17th Floor, New York,
NY 10016-5997; IEEE Computer Society Headquarters, 2001 L St., Ste. 700, Washington, DC 20036; IEEE Computer Society Publications Office,
10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720. Postmaster: Send address changes and undelivered copies to IEEE, Member-
ship Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid at New York, NY, and at additional mailing offices. Canadian
GST #125634188. Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885. Return undeliverable Canadian addresses
to 4960-2 Walker Road; Windsor, ON N9A 6J3. Printed in USA. Reuse rights and reprint permissions: Educational or personal use of this material is
permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the
copy; and 3) does not imply IEEE endorsement of any third-party products or services. Author and their companies are permitted to post the accepted
version of IEEE-copyrighted material on their own webservers without permission, provided that the IEEE copyright notice and a full citation to the
original work appear on the first screen of the posted copy. An accepted manuscript is a version which has been revised by the author to incorporate
review suggestions, but not the published version with copy-editing, proofreading, and formatting added by IEEE. For more information, please go to
ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising,
or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual
Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or pubs-permissions@ieee.org. ©2020 by IEEE. All rights reserved. Abstracting
and library use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy
fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE
Micro does not necessarily constitute an endorsement by IEEE or the Computer Society. All submissions are subject to editing for style, clarity, and
space. IEEE prohibits discrimination, harassment, and bullying. For more information, visit ieee.org/web/aboutus/whatis/policies/p9-26.html.
September/October 2020
Volume 40 Number 5
6 M Guest Editors’ Introduction

achine Learning for
Systems
46 E nhancing Model
Parallelism in Neural
Architecture Search for
Heiner Litz and Milad Hashemi Multidevice System
Cheng Fu, Huili Chen, Zhenheng
Theme Articles Yang, Farinaz Koushanfar,
Yuandong Tian, and Jishen Zhao
8 A
Published by the IEEE Computer
Society Taxonomy of ML for
56 T
Systems Problems SA-NoC: Learning-
Martin Maas
Based Threat
Detection and
17 A Programmable
Approach to Neural
Mitigation for Secure
Network-on-Chip
Network Compression Architecture
Vinu Joseph, Ganesh L. Ke Wang, Hao Zheng, and
Gopalakrishnan, Saurav Ahmed Louri
Muralidharan, Michael Garland,
and Animesh Garg
64 B Guest Editor’s Introduction
26 A
iology and Systems
Single-Shot
Interactions
Generalized Device
Abhishek Bhattacharjee
Placement for Large
Dataflow Graphs
Yanqi Zhou, Sudip Roy, Amirali Theme Articles
65 A
Abdolrashidi, Daniel Lin-Kit Wong, ccelerating Genome
Peter Ma, Qiumin Xu, Azalia Analysis: A Primer
Mirhoseini, and James Laudon
on an Ongoing Journey
37 R
Mohammed Alser, Z€ ulal Bing€ol,
ELEQ:
A Damla Senol Cali, Jeremie Kim,
Reinforcement Saugata Ghose, Can Alkan, and
Learning Approach Onur Mutlu
for Automatic Deep
Quantization of Neural
Networks 76 P urpleDrop: A Digital
Microfluidics-Based
Ahmed T. Elthakeb, Prannoy Platform for Hybrid
Pilligundla, Fatemehsadat Molecular-Electronics
Mireshghallah, Amir Yazdanbakhsh, Applications
and Hadi Esmaeilzadeh
Ashley Stephenson, Max Willsey, Jeff
McBride, Sharon Newman, Bichlien
Nguyen, Christopher Takahashi,
Karin Strauss, and Luis Ceze
Image credit: shutterstock.com/whiteMocca.
COLUMNS AND DEPARTMENTS

From the Editor-in-Chief Micro Economics
4 Machine Learning 88 Triggers, Transmissions,
for Systems, Biological and Adjustments
Computing, and More Shane Greenstein
Lizy Kurian John
From the Editor-in-Chief
Machine Learning for Systems,

Biological Computing, and More
Lizy Kurian John
The University of Texas at Austin
& ARTIFICIAL INTELLIGENCE (AI) and machine for automatic neural network compression, parti-
learning (ML) are influencing everything on the tioning of large DNNs using ML, automatic deep
planet. With the capability of modern com- quantization of DNNs using ML, automatic optimi-
puters, it has become feasible to employ ML for zation of deep neural networks, and learning-based
many everyday applications. Deep neural net- threat detection. Enjoy these six papers and hope-
works are demonstrated to surpass human capa- fully they help you to incorporate such techniques
bility for image detection and recognition into your research and/or products.
applications. Can ML augment human ability to Litz and Hashemi present a guest editorial
design efficient computer systems? Can ML be giving a more detailed introduction to the six
used to design the very same systems that are articles on this topic. Use it to get an overview to
running the ML applications? guide your reading.
IEEE Micro presents six articles on how the In addition to ML for systems theme, this issue
computer architecture community has explored also presents a minitheme: biological computing.
the use of ML models to improve and optimize the The efficiency in many processes in nature has
computing systems that we build. been admirable. It is not uncom-
ML for system design is a recent Can ML augment mon to try to achieve the same in
research direction. Prof. Heiner Litz human ability to design man-made systems. Computing
of the University of California Santa efficient computer occurring in the human brain has
Cruz and Dr. Milad Hashemi of Goo- systems? Can ML be admirably higher energy effi-
gle guest edited the ML for systems used to design the very ciency. Two interesting questions
theme. Prof. Litz and Dr. Hashemi same systems that are are: Can computing help biology?
spent a signficant amount of time running the ML Can biology help computing? Dr.
aquiring submissions and reviewing applications?
Abhishek Bhattacharjee of Yale
them. At the end, they chose six University presents a minitheme
papers that cover topics such as a on biological computing. Articles were solicited on
taxonomy for ML for system problems, use of ML topics that lie at the intersection of biology and
computing. After a peer-review process, two
articles were selected for this minitheme. The first
Digital Object Identifier 10.1109/MM.2020.3017880 article is on the possibilities offered by accelera-
Date of current version 1 September 2020. tors to advance life sciences. It presents genome
4
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
analysis algorithms and opportunities that modern their employees until the end of the year. Some
hardware offers to enhance them. The second arti- of the regulatory hurdles toward telemedicine
cle in this theme is on DNA storage by using a disappeared in the light of the pandemic. But as
microfluidic platform. DNA storage is certainly the need for both these suddenly arose with
unconventional computing. Prof. Bhattacharjee COVID, it was interesting to note that hardware
has written an introduction to the minitheme. and software advances to enable these were in
Read his introduction and the two articles in this good shape. He likes to give credit to all the
theme. investment that had occurred in the hardware
I take this opportunity to thank all the three and software advances that made remote work
guest editors for their efforts in getting many and telemedicine possible.
submissions and in getting them reviewed. It is Hope AI for systems and biological comput-
the effort of such dedicated guest editors and ing enables you to design more awesome sys-
authors that result in the excellent articles that tems that help humanity to get through the next
we present to you. global challenge. I hope that this special issue
The aforementioned eight articles describe takes our readers to unconventional and new
some interesting developments that are happen- areas. I hope you architect new and unconven-
ing in computer system design. There are many tional chips, using established and unconven-
more interesting developments happening tional methodologies. Perhaps AI may assist you
around us. A field known as neurosymbolic AI in ways you were not thinking of earlier. Let us
is emerging. The “neuro” part of the term embark on the opportunities presented to us by
“neurosymbolic” refers to deep learning neural the new and emerging technologies, whether
networks that have experienced unprecedented they are biological or artificial.
growth in the last ten years and the “symbolic” Best wishes dealing with your personal and
part of the term refers to the early mainstream professional challenges as the new school year
approach to create AI using logical relations. starts in the new online hybrid or in-person
Essentially, the development is to combine the environment.
two branches of AI to help each other. Symbolic Until next time.
AI can inject common sense reasoning and
domain knowledge into deep learning, helping
large neural networks to train faster with less
& REFERENCES
data. Neural networks can make symbolic AI sys- 1. J. Mao, C. Gan, P. Kohli, and J. B. Tenenbaum, and
tems smarter by breaking the world into symbols. J. Wu, “The neuro-symbolic concept learner: Interpreting
There is ongoing work at MIT, Google, etc., scenes, words, and sentences from natural supervision,”
around this topic, which is also referred to by in Proc. Int. Conf. Learn. Representations, 2019.
terms such as neural logic machines and logical [Online]. Available at: https://arxiv.org/abs/1904.12584
neural networks.1,2,3 2. H. Dong, J. Mao, T. Lin, C. Wang, L. Li, and D. Zhou,
In addition to the top pick articles, this issue “Neural logic machines,” in Proc. Int. Conf. Learn.
also features a Micro Economics column by Representations, 2019. [Online]. Available at: https://
Shane Greenstein of Harvard Business School, arxiv.org/abs/1904.11694
titled “Triggers, Transmissions, and Adjust- 3. R. Riegel et al., “Logical neural networks – IBM,”
ments,” describing the impact of COVID on the Jun. 2020. [Online]. Available: https://arxiv.org/pdf/
economy and comparing it to the two previous 2006.13155.pdf
economic busts, the dot com bust of 2000, and
the financial meltdown of 2008. He observes that
every crisis leads to many adjustments and Lizy Kurian John is a Cullen Trust for Higher Educa-
notes the current redirection toward remote tion Endowed Professor with the Electrical and Com-
work and telemedicine that is happening now. puter Engineering Department, The University of
Many companies are extending remote work to Texas at Austin. Contact her at ljohn@ece.utexas.edu.
5
Guest Editors’ Introduction

Heiner Litz Milad Hashemi
University of California Santa Cruz Google
& SPECIALIZED COMPUTER SYSTEMS have driven that hit many different applications of what is
the performance and capability of deep learning possible in this area:
over the past decade.1 However, as machine The first article sets the stage on machine learn-
learning models and systems improve, there is a ing in systems by introducing “A Taxonomy of ML
growing opportunity to also use these models to for Systems Problems.” It provides a classification
improve how we design, architect, optimize, and of problems in the system’s domain and analyzes
automate computer systems and software. This their applicability to machine learning. The article
is a challenging area, both from correctly observes that machine learn-
a learning and a systems per- ing is not a one-size-fits-all technique
As machine learning
spective. Systems often impose and that it needs to be carefully
models and systems
tight size, latency, or reliability designed for a particular problem. An
improve, there is a
constraints on learning mecha- growing opportunity to important contribution of this article
nisms that do not arise in other also use these models is that it provides advice to the read-
applications of machine learn- to improve how we ers about what problem types are
ing, such as computer vision or design, architect, amenable to machine learning and
natural language processing. optimize, and automate which ones are rather to be addressed
From a learning perspective, computer systems and with classic approaches.
systems is a challenging appli- software. This is a The second article, “A Programma-
cation, where input features are challenging area, both
ble Approach to Neural Network
often large and sparse, action from a learning and a
Compression,” explores compression
systems perspective.
spaces are gigantic, and gener- and pruning techniques to reduce the
alization is a key attribute. size and computational complexity of
Yet, despite these chal- deep neural networks. While in many cases com-
lenges, combining learned systems with compression can reduce the size of DNNs significantly
puters has the potential to revolutionize the field. with almost no loss in accuracy, determining the
This is a young area, and interdisciplinary optimal technique and intensity of pruning and
research provided the historical constraints of compression is challenging. This article introduces
both systems and ML is challenging. Therefore, a programmable system to easily explore and auto-
we are excited to highlight six excellent papers matically optimize the compression strategy
enabling significant performance gains and storage
savings.
Digital Object Identifier 10.1109/MM.2020.3016551 In “A Single-Shot Generalized Device
Date of current version 1 September 2020. Placement for Large Dataflow Graphs,” a new
6
approach is proposed to automatically optimize depending on the target hardware. The appro-
the training phase of large deep neural networks. ach trades accuracy against inference latency to
Large DNNs are generally trained on networks of achieve high performance on a range of hard-
GPUs or other training accelerators and hence ware including mobile devices and GPUs.
the optimal partitioning of the DNN among these Our issue concludes with an article proposing
devices while considering the communication machine learning techniques to improve the secu-
patterns is challenging. This article proposes rity of hardware systems, particularly networks-
machine learning techniques to address this on-chip. In “TSA-NoC: Learning-Based Threat
challenging systems problem, significantly out- Detection and Mitigation for Secure Network-on-
performing classical graph partitioning and Chip Architecture,” the authors describe how
placement techniques. machine learning techniques can be used to
In “RELEQ: A Reinforcement Learning Appro- detect abnormal network behavior induced by
ach for Automatic Deep Quantization of Neural hardware trojans and how to automatically
Networks,” the authors describe a reinforcement reconfigure the network to address security
learning mechanism to optimize deep neural net- vulnerabilities.
work architectures. The technique automatically
explores the best bit precisions for storing
& REFERENCE
weights in neural networks, achieving compres-
sion with minimal loss of accuracy. The tech- 1. D. Hernandez and T. B. Brown, “Measuring the
nique leverages proximal policy optimization to algorithmic efficiency of neural networks,” 2020,
skip the large design space of possible encodings. arXiv:2005.04305.
“Enhancing Model Parallelism in Neural
Architecture Search for Multidevice Systems” is
Heiner Litz is with the University of California Santa
another article that analyzes automatic optimi-
Cruz. Contact him at: hlitz@ucsc.edu.
zation techniques for improving the perfor-
mance of deep neural networks. In this case, the Milad Hashemi is with Google. Contact him at:
article optimizes neural network architecture miladhashemi@utexas.edu.
7
Theme Article: Machine Learning for Systems
A Taxonomy of ML for
Systems Problems
Martin Maas
Google Research, Brain Team
Abstract—Machine learning has the potential to significantly improve systems, but only
under certain conditions. We describe a taxonomy to help identify whether or not machine
learning should be applied to particular systems problems, and which approaches are
most promising. We believe that this taxonomy can help practitioners and researchers
decide how to most effectively use machine learning in their systems, and provide the
community with a framework and vocabulary to discuss different approaches for applying
machine learning in systems.
& MACHINE LEARNING (ML) has transformed hide the fact that ML does not always lead to the
many research areas, from image recognition to immediate wins that its popularity promises.
natural language processing. ML has also had a Applying ML to systems does not always outper-
significant impact on computer systems and form highly tuned non-ML solutions, and even if
inspired the development of new systems for ML improves a particular metric, its resource
designing and training ML models (e.g., Tensor- cost does not always justify the improvement.
Flow), as well as new hardware (e.g., TPUs). This article makes the case that while ML
In contrast to such Systems for ML research, has the potential to improve systems, it does so
ML for Systems is only now seeing more attention. only in certain cases. Furthermore, different ML
While ML has long been used in areas such as techniques are suitable for different problems.
branch prediction, recent work has shown prom- We therefore categorize systems problems and
ising results in caching, compilers, and cluster develop a taxonomy for identifying whether ML
scheduling. These advances indicate that ML can be applied, and what strategies might be
could hold the key to improving many areas in suitable. We also provide a bibliography1 that
computer systems. However, these successes matches existing work to this taxonomy. We
believe that our approach can help practi-
tioners and researchers decide how to most
Digital Object Identifier 10.1109/MM.2020.3012883 effectively use ML in their systems and provide
Date of publication 30 July 2020; date of current version the research community with a framework to
1 September 2020. discuss ML for Systems strategies.
8 This work is licensed under a Creative Commons

Attribution 4.0 License. For more information, see
https://creativecommons.org/licenses/by/4.0/
Published by the IEEE Computer Society IEEE Micro
BACKGROUND policies collect databases of examples and learn
“Systems” is a broad term. To ground discus- from them, either by exploring a search space
sion in a common terminology, we therefore focus (e.g., autotuners) or by building a lookup table
on “system policies.” Given a software or hard- and using it in future executions (e.g., profile-
ware component that makes decisions related to guided optimization). However, we typically start
the execution of computer programs, a system explicitly calling this approach ML only when we
policy describes how these decisions are made. use tools from the ML literature, such as support
Compiler passes, branch predictors and memory vector machines, decision trees, or neural
allocators are all examples of system policies. networks.
When we talk about ML for Systems, we therefore What sets ML approaches apart from lookup
mean “using ML in the implementation of a sys- tables is that they
tem policy.” Specifically, we focus on supervised can potentially gener-
This article makes the
and reinforcement learning (RL); other techni- alize to unseen cases.
case that while ML has
ques such as learning-to-learn, transfer learning, An early example is
the potential to improve
or representation learning have applications in systems, it does so neural branch predic-
systems as well,2 but are out of scope for this only in certain cases. tors. Recently, there
taxonomy. Furthermore, different has been an explo-
System policies typically fall into four cate- ML techniques are sion of such techni-
gories that oftentimes correspond to the suitable for different ques, ranging from
degree to which a system has been optimized. problems. We therefore learning compiler
categorize systems optimizations3 to
An ad hoc policy based on assumptions at problems and develop cluster scheduling.4
the time of development. Consider an inlin- a taxonomy for
Note that a compli-
ing policy in a compiler: An ad-hoc heuris- identifying whether ML
cated ML technique
can be applied, and
tic could inline all functions with less than is not a necessary
what strategies might
10 instructions. requirement to gener-
be suitable.
An empirically tuned policy that has been alize. For example,
optimized for a set of benchmarks. This is work on index struc-
the type of policy often published in research tures has shown that while it is possible to learn
papers. For example, such a policy could a key distribution using neural networks, a simi-
consider the call graph and apply carefully lar goal can be achieved by fitting splines to its
crafted inlining rules, chosen to optimize per- cumulative distribution function.5
formance across a set of benchmarks (e.g., Note that highly tuned and complex heuris-
SPEC). tics are similar to data-driven policies. For exam-
A data-driven policy that optimizes towards a ple, compilers have cost models to accurately
specific target. In contrast to an empirically- predict performance and generalize to unseen
tuned policy that uses benchmarks as a proxy programs, but only recently has this problem
for the real target, this policy is tuned to the been revisited using modern ML techniques
target itself (e.g., feedback-directed inlining). such as neural networks.6 A corollary is that if a
An adaptive data-driven policy that does not system has been well-tuned, it can be difficult to
make the same decision for the same target improve the baseline with a learned policy. One
every time, but adapts online in response to interpretation is that engineers have used a
its own decisions. An example is trace-based "real-life version of gradient descent" to move
JIT compilers that re-evaluate and revise the system to a local optimum, not unlike what
inlining decisions over time. ML would do. This shows that applying ML to
systems does not represent a fundamental
ML is often defined as the ability of a program departure from systems research, but only pro-
to learn from experience. By this definition, data- vides a new set of tools.
driven policies are a form of ML, although poten- This aspect is often obscured in the discus-
tially a rudimentary one. Most data-driven sion of ML for Systems, in part due to the
9
popularity of end-to-end learning. Modern ML metrics that are difficult to reason about,
techniques can learn very complex behavior, while metrics that can be optimized analyti-
and it is therefore possible to train models that cally might be addressed without ML.
learn complex policies end-to-end. We have seen Data-driven baselines: If the main benefit of
this approach in areas ranging from RL for ML is to make a heuristic data-driven, simple
caches, to end-to-end cluster scheduling, to RL data-driven methods should be tried first. In
for compilers. While these approaches often some cases, this is ML (e.g., because the
work, they are not always data efficient, con- input features are too complex), but in
sume large amounts of resources, and some- others, a lookup table might yield most of
times do not conclusively outperform strong the benefits. For example, table-based
baselines. Many problems have a known struc- branch predictors are competitive with ML
ture that can be captured in a handwritten heu- approaches.
ristic. However, end-to-end learning has to learn High-dimensional input space: If the number
this structure from scratch and may re-learn of possible inputs is small, a lookup table can
known facts, at the cost of maximizing perfor- be used to memorize all predictions instead
mance on the otherwise intractable part. of using ML for generalization.
We therefore argue that effectively applying
ML to systems requires identifying which part of a Deployment Constraints
systems policy requires ML, and developing spe-
cific ML techniques for this part. This is supported Latency: ML for Systems differs from areas
by the fact that many recent successes of ML for such as NLP in its ultralow latency require-
Systems have focused on specific subproblems ments. OSs and runtimes often need to make
rather than end-to-end learning (e.g., learning- decisions in micro/nanoseconds, branch pre-
based index structures, learned cost models6). dictors in cycles. In contrast, even small neu-
Note that these learning techniques were used in ral networks often take hundreds of
areas that were already data-driven. As such, microseconds and GBRTs take tens of micro-
there was already an interface for the learning seconds. More complex models can take tens
technique to fit into the conventional portion of of milliseconds. Before applying ML, it is
the system, as well as strong baselines. Mean- therefore necessary to identify the latency
while, when learning replaces an end-to-end heu- requirements—offline policies (e.g., in com-
ristic, it can be hard to attribute which gains are pilers) are often more latency-tolerant than
due to shifting to a data-driven approach versus online policies (e.g., in schedulers).
ML. The resulting tradeoff space can be difficult to Space/time overheads: Even if prediction
reason about, in part because the vocabulary to latency is low, models that run often can con-
discuss the way ML is used is often missing. sume significant resources (CPU/DRAM). In
our case study, we use a model for memory
allocation, with 250 000 allocations/second.
CONSTRAINTS AND TRADEOFFS The model takes >100 ms; running it at every
When considering whether to apply ML, sev- allocation is thus infeasible. To use ML in
eral tradeoffs need to be considered: The prob- such a setting, formulating the problem so
lem needs to be well-suited to ML, the that prediction results can be cached is
deployment constraints need to allow for ML, crucial.
and suitable data needs to be available. We now Custom hardware: Scenarios where latencies
discuss these considerations in turn. are large (e.g., compilers) can use GPUs/
TPUs, and hardware policies (e.g., branch
Problem Suitability predictors, prefetchers) can have custom
implementations. Other models typically run
Target: System policies have different optimi- on the CPU, which can be inefficient for neu-
zation metrics (e.g., resource utilization or ral networks. Compiling and inlining the
the worst case latency). ML can help for model directly into the code can help.
10 IEEE Micro
Risk/robustness/interpretability: Models failures, security incidents, interference, per-
sometimes mispredict and systems need to formance regressions7).
adapt. The specific use case dictates the Forecasting: Predict future behavior of a sys-
robustness requirements and risk, and many tem. This includes program speculation (e.g.,
non-ML systems policies are not 100% robust prefetching and branch prediction), object
themselves. However, many ML models are lifetime prediction,8 cardinality estimation in
opaque, which makes problems more difficult databases, and system or network resource
to track down when they occur. In high-risk demand forecasting.
scenarios, it can therefore be required to rely Extrapolation: Given a policy for a known sub-
on interpretable approaches, even if they set of inputs, extend it to unseen inputs. This
yield lower accuracy (e.g., decision trees). can include classification tasks in schedulers
(e.g., whether a program is scale-up or scale-
out4), cost models in compilers, or selecting
Data Availability configuration parameters.
Discovery: Generate a new/better policy.
Privacy/Security: It needs to be ensured that
This includes policies that could not be
the data used for learning does not expose
handwritten, either because the rules are
sensitive data (e.g., encryption keys).

too counterintuitive for a human to come
Offline Versus Online Learning: Some policies
up with (e.g., caching policies identified
learn online (e.g., branch predictors) while
through learning) or because they are based
others train offline. For the latter, ML needs
on a large amount of data. Examples include
to be integrated into the development and
custom inlining heuristics based on perfor-
QA cycle (e.g., is the model deployed with a
mance profiles or data-specific index struc-
new binary or updated dynamically?). Mean-
tures for databases.
while, online training can be challenging for
Optimization: Exploring a potentially large
expensive models that require accelerators
space to find a good or optimum solution
to train.

(e.g., ML for hardware design, autotuners).
Distribution Shifts: Models take time to train
and require quality control. ML therefore
The full bibliography is available online.1
introduces challenges in cases where the
Note that all of these use cases can and are
output distribution of the policy shifts
being solved without ML. We argue that this
quickly. Such cases may require an online
list represents a hierarchy of how hard it is to
approach.
achieve improvements with ML, from easiest
Applying ML requires trading off these priori- to hardest.
ties. For example, it is sometimes more impor-
tant to be robust than achieve perfect accuracy.
Even if ML achieves better accuracy on a given Anomaly Detection
task, it may therefore not be suitable. Detecting faults in mechanical and control
systems is one of the traditional uses of AI.
Anomaly detection is an attractive target for ML
TAXONOMY OF ML FOR SYSTEMS because it is often data-driven already: Many
To give researchers and practitioners a anomaly detectors start from a set of examples
framework to reason about ML for Systems, we and cluster them to detect outliers. Modern ML
divide ML for Systems problems into five catego- adds new tools, such as autoencoders, where
ries that we believe capture most problems the reconstruction error measures anomaly.7
(they can overlap). We start by classifying exist- Baselines: Simple data-driven baselines
ing work. should be tried first (e.g., clustering). If a base-
line is used that is not data-driven, the main lift
Anomaly detection: Detect when a system from ML might be the result of switching from a
does not behave as expected (e.g., system heuristic to a data-driven approach.
11
Strategies: The first step is to select input fea- Strategies: Extrapolation strategies are often
tures and model their statistical properties. problem-specific. Collaborative filtering4 has
Before using complex techniques, it is worth try- been used (e.g., for workload classification), but
ing clustering approaches that show a difference supervised learning also works for many areas,
between normal and anomalous examples. Com- such as predictions based on stack traces8 or
plex techniques can be appropriate when the learning memory access patterns.9
features cannot be readily encoded for cluster-
ing (e.g., if they represent graphs), or if it is Discovery
unknown which of a large set of features to use. Discovery is about designing previously
In such cases, complex neural networks such as unknown policies, such as new caching strate-
graph neural networks or recurrent neural net- gies.10 There are two variants: Discovering a
works can be appropriate, either used as an new general policy that is intended to be uni-
embedding, or in an autoencoder setting. versally used, and learning a specialized pol-
icy for a particular set of workloads (i.e., data-
Forecasting driven).
Most computer systems use some form of Baselines: Since the goal of discovery is to
forecasting, either implicitly or explicitly. Any find a new algorithm, it should be evaluated in
form of hardware speculation relies on forecast- the same way as existing approaches of the
ing (e.g., prefetching), and schedulers adjust same (data or nondata-driven) type.
resources based on predicting future resource Strategies: A common strategy is RL, where a
usage from current/past usage. policy is learned by exploring different deci-
Baselines: Because most systems use fore- sions in a simulator or real environment and
casting, it is important to determine whether updating the policy based on a reward. While
these baselines are data-driven or heuristic- most problems can be framed as RL, it is in
based. If they are not data-driven, prior to creat- practice harder to train a model using it. Sim-
ing a complex ML-based approach, a simpler pler approaches are available: One approach is
data-driven baseline should be tried. In many to use an expensive method to solve instances
cases, the simplest baseline is a lookup table of the problem offline (e.g., SAT/ILP solvers)
that maps input features to predictions (e.g., and use them as training inputs for imitation
branch predictors or inline caches). learning.3 Another alternative is to design sev-
Strategies: ML models can provide benefits eral parameterized subpolicies and learn a pol-
over table-based forecasting when generaliza- icy that picks the best one (i.e., a bandit-based
tion is required. For example, recent work2 that approach).
learned from application-level features in stor-
age shows that a lookup table approach
degrades over time as previously unseen fea- Optimization
tures appear. In such cases, different models can In some cases, ML can be used to solve a
be applied, from decision trees to neural static optimization problem (e.g., a complex
networks. scheduling problem). In contrast to discovery,
this approach is not learning a general policy
Extrapolation to solve new problem instances, but is using
Most extrapolation in systems is heuristic- ML to explore the search space of a specific
driven and often performs classification. For instance.
example, a scheduler might compare counters Baselines: There are many well-known optimi-
against hard-coded thresholds. zation techniques, including genetic algorithms
Baselines: Extrapolation should start from a and simulated annealing, some of which have
data-driven baseline. However, as previously been shown to work in the same areas as ML
observed, highly tuned heuristics are similar to (e.g., playing video games11). While ML techni-
data-driven policies and may therefore be appro- ques can be used for optimization (e.g., gradient
priate baselines as well. descent is a form of optimization), this alone
12 IEEE Micro
Figure 1. How to decide which ML approach to use.
does not constitute learning and is not necessar- improvements from ML have been shown in all
ily better than alternatives. five areas, the further we go from “anomaly
When applying ML to an optimization prob- detection” to “optimization,” the more a model
lem, it is important to identify whether the goal has to learn about how the system works, and
is to learn a policy that transfers to new problem the more data/examples are required.
instances (i.e., discovery) or whether it is to The first step is to check whether the input
solve a standalone instance. While discovery features are predictive of the output. For low-
problems need to be evaluated based on their dimensional prediction problems, this can be
zero-shot performance for a previously unseen done with a lookup table baseline that stores a
example, optimization problems need to be eval- prediction for every possible input. For higher
uated against other optimization techniques, dimensional problems, replaying a run with an
such as genetic algorithms. Both baseline and oracle (e.g., in a simulator) is an alternative. This
ML approach need to be evaluated with the indicates the headroom.
same amount of resources. The next decision is the scope of the learned
Strategies: RL has been successfully used in system policy. Almost every system could be
optimization problems, by learning a policy that framed as an end-to-end RL problem. However,
selects the next design decision, combined with such a model needs to not only learn the statisti-
a value function that estimates the quality of a cal properties of the data but also everything
particular choice. The policy function and value about its environment. It can therefore be advan-
function can potentially be reused when solving tageous to separate the prediction problem from
a new optimization problem, leading to transfer- the rest of the system, to limit the complexity of
ability. It is, however, possible that these func- the function that needs to be learned. As shown in
tions overfit to a particular optimization the next section, it is sometimes possible to
example, so transferability cannot be taken for decompose an end-to-end problem into a super-
granted. vised learning portion and a (traditional) algorith-
For design spaces that are low-dimensional, mic portion that consumes these predictions.
Bayesian optimization frameworks can work Alternatively, the latter (reduced) problem may
well. It is also important to not dismiss alterna- be solved with ML itself (e.g., RL).
tive optimization strategies in favor of ML, par- Once the learning problem has been defined,
ticularly if transferability is not required. an ML technique needs to be chosen. We frame
our recommendations as a decision diagram
(see Figure 1). For each presented ML type, dif-
CHOOSING AN ML STRATEGY ferent learning strategies are available. For
We now discuss how to determine whether a example, supervised learning could use neural
problem is suitable for ML. While major networks or decision trees. One important factor
13
Figure 2. ML for memory allocation.8 (a) Visualizing memory fragmentation. (b) Reinforcement learning
approach. (c) Imitation learning approach. (d) Decomposed approach using supervised learning and a new
type of memory allocator. (e) Cþþ server workload memory reduction (running against synthetic requests).8
(f) Synthetic memory trace. (g) Final steady-state fragmentation. (h) Allocation/inference latency. (i) ILP solver
scalability.
is how to deploy a model within a system, based fragmentation in Cþþ workloads with huge
on resource constraints. Deployment ranges (2 MiB) pages and varying memory footprints.
from compiling a model directly into code to Since Cþþ cannot move objects, long-lived
running the model offline (e.g., at compile time). objects can prevent entire pages from being
The uniqueness of systems problems and their released to the OS [see Figure 2(a)], causing
constraints may necessitate new ML techniques. fragmentation. To solve this problem, a mem-
ory allocator needs to reason about object
lifetimes and group objects with similar life-
CASE STUDY times together.
We demonstrate how the previous insights This requires the memory allocator to pre-
apply to recent work on ML for memory allo- dict future behavior of the application, which
cation.8 The goal of this work was to reduce suggests that this could be a target for ML. The
14 IEEE Micro
allocator represents the following system policy: Figure 2(g)], but the ILP approach does not
Given a sequence of allocation requests (each scale [see Figure 2(i)], as applications have
with a size and an unknown lifetime), the alloca- >10M allocations/minute (the solver did not
tor needs to place objects in virtual address even solve the full synthetic trace). Further-
space such that the number of live 2 MiB pages more, just like with RL, we would need a DQN
(i.e., pages containing at least one object) is min- that learns from these solutions, and running
imized. At the time of allocation, we know the such a model every allocation is impractical.
current stack trace and the size of the allocation.
As such, we need to solve a forecasting problem. Decomposed Supervised Learning
Our baseline is TCMalloc, which organizes We tackle these challenges by breaking up
objects by size but ignores lifetime. Figure 2(e) the prediction problem. Instead of learning
shows used memory and actual memory foot- object placement end-to-end, we decomposed it
print for running a server workload against syn- into a supervised learning portion that predicts
thetic inputs. The baseline incurred over 2 (potentially incorrectly) the lifetime of an object
fragmentation (footprint/used) on average, over based on its stack trace. We built a new mem-
4 at low memory usage. Our goal was to reduce ory allocation algorithm (LLAMA) that relies on
this fragmentation. Since the trace is large (110M these predictions and can detect when past pre-
allocations), we also generated a synthetic dictions were incorrect. Using this approach,
driver and baseline that replicates similar behav- we can avoid running the model at every alloca-
ior with 5000 allocations [see Figure 2(f)]. tion, because results can be cached (we use a
hash table and identify stack traces based on a
Reinforcement Learning cheap fingerprinting mechanism).
This looks like a perfect setup for RL, with a Following the rules laid out in this article, we
sequence of decisions (where to place each validated our approach by using it with a lifetime
object) and a reward function (memory fragmen- oracle in the simulator. We then replaced the
tation). We therefore started collecting traces oracle with a lookup table, but found that this
and built a simulator to replay these traces, table did not generalize across different versions
which would allow an RL policy to learn a good of an application. We then decided to use an
allocation strategy. However, several constraints LSTM neural network that we compiled directly
make RL challenging for this scenario. First, the into the CPU code to maximize performance (a
state space is large and complex, as there are 264 similar approach could be used for a DQN).
addresses. Furthermore, the number of alloca- This model is easier to train than RL, because
tions is large (millions/min), and rewards are training is supervised. This also means that train-
sparse, creating credit assignment challenges. ing does not require full traces from entire runs
We implemented a naive DQN model that (like RL or imitation learning). We can instead
observes the state of all hugepages and picks an sample individual allocations. Even so, running
allocation target for our synthetic trace. This the LSTM every allocation takes 144 ms and is
simple model outperformed the baseline, but therefore impractical [see Figure 2(h)]. However,
needs to run every allocation and takes 2 ms since predictions now depend only on the current
(TCMalloc’s fast path takes 8.3 ns). Even if opti- stack trace and no state, we can cache predic-
mizations improved this latency by 1000, this tions, bringing the predictions down to 20 ns.
approach would thus be impractical. LLAMA reduces steady-state fragmentation by 43%
in Figure 2(e) (up to 78% for other workloads8).
Imitation Learning
Given full allocation traces of a program, the Lessons Learned
problem can be solved offline. It reduces to a This approach shows that while it is often pos-
two-dimensional packing problem, which can sible to apply end-to-end learning to an ML for Sys-
be solved retroactively using an ILP solver. tems problem, it is not always the best approach.
Given such a solution, we could train a policy The successful solution required us to apply the
against it. This yields low fragmentation [see rules laid out in this article: We moved from an
15
empirically-tuned policy (i.e., past memory alloca- 3. C. Mendis, C. Yang, Y. Pu, S. Amarasinghe, and
tors) to a data-driven policy. Since a lookup table M. Carbi, “Compiler auto-vectorization with imitation
alone was not sufficient, we applied ML to learn an learning,” in Proc. Adv. Neural Inf. Process. Syst.,
embedding of stack traces, using supervised learn- 2019, pp. 14625–14635.
ing. The other parts of the problem (selecting how 4. C. Delimitrou and C. Kozyrakis, “Quasar: Resource-
to allocate memory based on the prediction) were efficient and QoS-aware cluster management,” in
solved with conventional heuristics that tolerate Proc. 19th Int. Conf. Archit. Support Program. Lang.
mispredictions. Oper. Syst., 2014, pp. 127–144.
5. T. Neumann, “The case for B-tree index structures,”
2017. [Online]. Available: http://databasearchitects.
CONCLUSION blogspot.com/2017/12/the-case-for-b-tree-index-
ML for Systems is an emerging area. To
structures.html
maximize its potential, ML needs to be used
6. C. Mendis, A. Renda, S. Amarasinghe, and
in the most effective way and evaluated against
M. Carbin. “Ithemal: Accurate, portable and fast basic
suitable baselines. Meanwhile, systems-specific
block throughput estimation using deep neural
requirements such as low latency and large
networks,” in Proc. Int. Conf. Mach. Learn., 2019,
input/output spaces necessitate systems-
pp. 4505–4515.
specific innovation
7. M. Alam, J. Gottschlich, N. Tatbul, J. S. Turek,
on the ML side. Our ML for Systems is an T. Mattson, and A. Muzahid, “A zero-positive learning
goal is to establish emerging area. To approach for diagnosing software performance
a framework and maximize its potential,
regressions,” in Proc. Adv. Neural Inf. Process. Syst.,
vocabulary to dis- ML needs to be used in
2019, pp. 11627–11639.
cuss these alterna- the most effective way
8. M. Maas, D. G. Andersen, M. Isard, M. M. Javanmard,
tives and tradeoffs. and evaluated against
suitable baselines. K. S. McKinley, and C. Raffel, “Learning-based
We see this article
memory allocation for Cþþ server workloads,” in Proc.
and the accompa-
25th Int. Conf. Archit. Support Program. Lang. Oper.
nying bibliography1 as a contribution toward
Syst., 2020, pp. 541–556.
this discussion and believe that these points are
9. M. Hashemi et al., “Learning memory access
going to be increasingly important as we are see-
patterns,” in Proc. Int. Conf. Mach. Learn., 2018,
ing more ML for Systems work.
pp. 1919–1928.
10. D. S. Berger, “Towards lightweight and robust
ACKNOWLEDGMENTS machine learning for CDN caching,” in Proc.
The author would like to thank A. Klimovic, A. 17th ACM Workshop Hot Topics Netw., 2018,
Goldie, A. Mirhoseini, C. Raffel, H. Lim, J. Laudon, pp. 134–140.
M. Phothilimthana, M. Abadi, R. Singh, R. Frostig, 11. F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O.
and S. Roy for their feedback. Stanley, and J. Clune, “Deep neuroevolution: Genetic
algorithms are a competitive alternative for training
deep neural networks for reinforcement learning,”
& REFERENCES 2017, arXiv:1712.06567.
1. Supplementary material/bibliography, 2020. [Online].
Available: https://github.com/google-research/ml-for-
systems-taxonomy Martin Maas is a Research Scientist at Google. His
2. G. Zhou and M. Maas, “Multi-task learning for storage research interests span language runtimes, computer
systems,” in Proc. ML Syst. Workshop, 2019. [Online]. architecture, systems, and machine learning. Maas
Available: http://mlforsystems.org/assets/papers/ received the Ph.D. degree in computer science from
neurips2019/multi_task_zhou_2019.pdf UC Berkeley. Contact him at mmaas@google.com.
16 IEEE Micro
A Programmable
Approach to Neural
Network Compression
Vinu Joseph and Ganesh L. Gopalakrishnan Animesh Garg
University of Utah University of Toronto and Vector Institute
Saurav Muralidharan and Michael Garland
NVIDIA
Abstract—Deep neural networks (DNNs) frequently contain far more weights, represented
at a higher precision, than are required for the specific task, which they are trained to
perform. Consequently, they can often be compressed using techniques such as weight
pruning and quantization that reduce both the model size and inference time without
appreciable loss in accuracy. However, finding the best compression strategy and
corresponding target sparsity for a given DNN, hardware platform, and optimization
objective currently requires expensive, frequently manual, trial-and-error experimentation.
In this article, we introduce a programmable system for model compression called
CONDENSA. Users programmatically compose simple operators, in Python, to build more
complex and practically interesting compression strategies. Given a strategy and user-
provided objective (such as minimization of running time), CONDENSA uses a novel Bayesian
optimization-based algorithm to automatically infer desirable sparsities. Our experiments
on four real-world DNNs demonstrate memory footprint and hardware runtime throughput
improvements of 188 and 2.59, respectively, using at most ten samples per search.
& MODERN networks (DNNs) are

DEEP NEURAL spanning dozens or even hundreds of layers.1; 2
complex and often contain millions of parameters This complexity translates into substantial mem-
ory and runtime costs on hardware platforms at
all scales. Recent work has demonstrated that
Digital Object Identifier 10.1109/MM.2020.3012391 DNNs are often overprovisioned and can be com-
Date of publication 28 July 2020; date of current version pressed without appreciable loss of accuracy.
1 September 2020. Model compression can be used to reduce both
September/October 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
17
model memory footprint and inference latency automated compression hyperparameter infer-
using techniques such as weight pruning, quanti- ence, and sample efficiency, we make the follow-
zation, and low-rank factorization.3 5 Unfortu- ing contributions.
nately, the requirements of different compression
contexts—DNN structure, target hardware plat- 1) We present CONDENSA, a new framework for
form, and the user’s optimization objective—are programmable neural network compression.
often in conflict. The recommended compression CONDENSA supports the expression of the
strategy for reducing inference latency may be overall compression strategy in Python using
different from that required to reduce total operators provided by its compression library.
memory footprint. For example, in a convolu- Since each strategy is a Python function, users
tional neural network (CNN), reducing inference are able to programmatically compose ele-
latency may require pruning filters to realize mentary schemes to build much more com-
speedups on real hardware,4 while reducing plex and practically interesting schemes.
memory footprint may be accomplished by zero- 2) We present a novel sample-efficient algorithm
ing out individual weights. Similarly, even for the based on Bayesian optimization (B.O.) in
same optimization objective, say reducing infer- CONDENSA for automatically inferring optimal
ence latency, one may employ filter pruning for a sparsities based on a user-provided objective
CNN, while pruning 2-D blocks of nonzero weights function. Given CONDENSA’s ability to support
for a language translation network such as Trans- the expression of meaningful high-level objec-
former, since the latter has no convolutional tive functions—for example, the throughput
layers. Thus, it is crucial to enable convenient (images per second) of a CNN—users are freed
expression of alternative compression schemes; from the burden of having to specify compres-
however, none of today’s model compression sion hyperparameters manually.
approaches help the designer tailor compression 3) We demonstrate the effectiveness of
schemes to their needs. CONDENSA on three image classification and
Current approaches to model compression language modeling tasks, resulting in mem-
also require manual specification of compres- ory footprint reductions of up to 188 and
sion hyperparameters, such as target spar- runtime throughput improvements of up to
sity—the proportion of zero-valued parameters 2:59 using at most ten samples per search.
in the compressed model versus the original.
However, with current approaches, finding the
best sparsity often becomes a trial-and-error BACKGROUND
search, with each such trial having a huge cost For a given task such as image classification,
(often multiple days for large models such as assume we have trained a large reference model
BERT), as each such trial involves training the w ¼ argminw LðwÞ, where LðÞ denotes a loss func-
compressed model to convergence, only to find tion (e.g., cross-entropy on a given training set),
(in most cases) that the compression objec- and w 2 RP . Model compression refers to finding
tives are not met. The main difficulty faced by a smaller model Q that can be applied to the
such unguided approaches is that sparsities same task and ideally achieves the same accu-
vary unpredictably with changes in the com- racy as w. Model compression can be performed
pression context, making it very difficult to pro- in various ways, and CONDENSA currently sup-
vide users with any guidelines, whatsoever. ports two commonly used techniques: pruning
Therefore, automatic and sample-efficient appr- and quantization. In pruning, nonzero values
oaches that minimize the number of trials are from w are eliminated or “pruned” to obtain Q.
crucial to support the needs of designers who Pruning is usually performed using some kind of
must adapt a variety of neural networks to a thresholding (for e.g., magnitude-based) and can
broad spectrum of platforms targeting a wide be unstructured (prune any nonzero value) or
range of tasks. structured (prune only blocks of nonzeros).4; 3
To address the above-mentioned problems On the other hand, quantization retains the num-
of flexible expression of compression strategies, ber of parameters in Q but assigns parameters in
18 IEEE Micro
w one of K codebook values, where the code- Listing 1. Example usage of the CONDENSA
book may be fixed or adaptive. CONDENSA sup- library.
ports low-precision approximation, which refers
to assigning each parameter in w a correspond-
ing lower-precision representation (for example,
converting from 32-bit to 16-bit floating-point).
Automating model compression involves
finding both an optimal compression strategy
for a given w, along with its corresponding com-
pression hyperparameters such as target spar-
sity with minimal manual intervention. Current
state-of-the-art frameworks in this domain
include AMC3 and AutoCompress,5 which use
reinforcement learning and simulated annealing,
respectively, to automatically find desirable tar-
get sparsities for a fixed pruning strategy. CON-
DENSA, in contrast, supports the programmable
expression of a wide variety of compression
strategies (not just pruning). Also, in the context
of automated model compression, each sample
corresponds to training the compressed model
to convergence, and can be extremely expensive
to compute; unfortunately, techniques such as
reinforcement learning, which is used in AMC,3 CONDENSA System Design
can be highly sample-inefficient.6 To minimize Figure 1 provides a high-level overview of the
the number of samples drawn, CONDENSA uses CONDENSA framework. As shown on the left-hand
a novel and sample-efficient B.O.-based algo- side of the figure, a user compresses a pretrained
rithm for automatically arriving at desirable tar- model w by specifying a compression scheme and
get sparsities. an objective function f. Both the compression
scheme and objective are specified in Python
using operators from the CONDENSA library; alter-
CONDENSA FRAMEWORK natively, users may choose from a selection of
In CONDENSA, users specify compression commonly used built-in schemes and objectives.
schemes to systematically describe how a given Apart from the operator library, the core frame-
reference model w is transformed into a com- work, shown in the middle of the figure, consists
pressed version. Schemes are expressed as primarily of two components: 1) the constrained
Python functions and can utilize a rich set of Bayesian optimizer for inferring optimal target
compression operators provided by the CON- sparsities; and 2) the “learning-compression” (L-
DENSA library. Integrating with the Python eco- C) optimizer7 for accuracy recovery. The remain-
system makes the expression of common der of this section describes both the Bayesian
compression patterns more natural. For exam- and L-C optimizers in more detail.
ple, operators can be combined with conditional
statements to selectively compress layers based Sample-Efficient Bayesian Optimization
on properties of the input DNN and/or target It is intuitive to split the problem of finding
hardware platform. The CONDENSA Library pro- optimal target sparsities into two stages: 1) find
vides operators and prebuilt schemes for the highest target sparsity that loses at most
unstructured and structured (filter and block) accuracy w.r.t the original uncompressed model
pruning, and quantization. Listing 1 provides a w; and 2) in a constrained sparsity regime
concrete example of invoking CONDENSA to com- obtained from stage 1), optimize a user-provided
press a model. objective function f (e.g., throughput, or
19
Figure 1. CONDENSA framework overview. The user provides the pretrained model (w), a compression scheme, and an
objective function f. CONDENSA uses the Bayesian and L-C optimizers to infer an optimal target sparsity s and
corresponding compressed model Q.
memory footprint) and return the solution as the represents AccðwÞ and aim to find the point on
final sparsity. For both stages, CONDENSA uti- the accuracy curve of the compressed model that
lizes B.O., as shown in Figure 1. intersects with L; the sparsity corresponding to
B.O. is an optimization framework based on this solution will be sacc . We propose a novel acqui-
continually updating a probabilistic model with sition function to find sacc named domain-restricted
measurements of a function to be optimized.8 upper confidence bound (DR-UCB).
Given a set of parameters to be optimized, B.O. DR-UCB builds upon an existing level-set black-
makes black-box calls to the objective, updates the box optimization technique named ILS-UCB,10
probabilistic model with the new information, and which is characterized by two properties. 1) It pri-
selects the next point to evaluate using an acquisi- oritizes searching in the region where the level set
tion function that combines information about intersects the accuracy curve. 2) It does not seek
the expectation and uncertainty of a function to precisely learn the shape of the entire accuracy
value under the probabilistic model. CONDENSA curve. However, in CONDENSA, since accuracy
employs a Gaussian Process (G.P.) model for B.O. values can be safely assumed to decrease mono-
due to its favorable statistical and computational tonically with increasing sparsity, we notice that it
characteristics.9 It is worth highlighting that B.O. is also possible to progressively restrict the
leverages principled Bayesian inference to tradeoff search domain of sparsities based on whether the
exploration and exploitation and is sample-effi- currently sampled point meets the level-set con-
cient for nonconvex black-box functions such as straints. In DR-UCB, we exploit this property to
the ones optimized by CONDENSA.8 greatly improve sample efficiency over ILS-UCB.
In CONDENSA’s two-stage optimization pipe- Mathematically, we define st , the sparsity value
line, we first find a sparsity ratio sacc that constrains sampled at iteration t using DR-UCB, as follows:
the model accuracy function A to the provided . st ¼ argmax ð1 gÞsðsÞ gjmðsÞ Lj
We then constrain the sparsity search space to s (1)
ð0; sacc Þ while optimizing the user-provided objec- s.t. st > si 8i 2 ½0; t 1; Bf ðst Þ L:
tive function f. Note that we assume that A
decreases monotonically w.r.t. sparsity in the Here, Bf represents the L-C accuracy function, and
region ð0; sacc Þ. For each stage, CONDENSA uses a st is 1) greater than all the previous sparsities si ;
distinct acquisition function to guide the next best and 2) satisfies the level set constraint Bf ðst Þ L.
point for function evaluation. We achieve this by minimizing the difference
Stage 1: Solving Accuracy Constraints: Recall that between the GP’s mean curve mðsÞ and the level
in the first stage of the sparsity inference process, set using the term jmðsÞ Lj in (1); the parameter
we aim to find the highest sparsity sacc that loses at g controls the tradeoff between exploitation and
most accuracy w.r.t. the original reference model exploration. Algorithm 1 illustrates how DR-UCB is
w. To this end, we first define a Level-Set L that employed to efficiently find sacc .
20 IEEE Micro
Algorithm 1. Bayesian Sparsity Inference with the VGG-16 neural network2; and 3) language
Domain Restriction modeling on WikiText-213 using a two-layer LSTM
1: procedure BODRUCB ðBf , L, T Þ network described by Yu et al.14 We optimize the
" B f : Function to optimize
networks in each task for two distinct objectives
" L: Level set
" T : # Iterations
described below.
2: GP GP-Regressor.initialize() Objective 1: Minimize Memory Footprint: The
3: s0 0; D ð0; 1Þ; X ; memory footprint of a model is defined as the
4: for t 1; 2; . . . . T 1 do number of bytes consumed by the model’s non-
5: st argmaxD DR-UCBðsjX0:t1 Þ zero parameters. Reducing the footprint below a
6: yt Bf ðst Þ
threshold value is desirable, especially for mem-
7: if st > st1 and yt L then
8: D ðst ; 1Þ ory-constrained devices such as mobile phones,
9: end if and can be accomplished through either pruning
10: X0:t fX0:t1 ; ðst ; yt Þg or quantization, or both. For reducing footprint,
11: GP.Update(X0:t ) we define a compression scheme that performs
12: end for
unstructured pruning of each learnable layer
13: return sT 1
14: end procedure (except batch normalization layers), and then
quantizes it to half-precision floating-point, yield-
Stage 2: Optimizing the User-Defined Objective: ing an additional 2 reduction. We denote this
Once we find a sparsity sacc that satisfies the user- scheme by P+Q and implement it using the
provided accuracy constraints in stage 1, our next CONDENSA library as follows:
objective is to find the final sparsity s that opti-
from schemes import Compose; Prune; Quantize
mizes the user-defined objective function f in the
constrained sparsity domain ð0; sacc Þ. For this, we scheme ¼ Composeð½PruneðÞ; Quantize ðfloat16ÞÞ
employ the upper and lower confidence bound
(UCB/LCB) acquisition functions for function max- Objective 2: Maximize Throughput: Inference
imization and minimization, respectively.9 throughput is defined as the number of input
samples processed by a model per second and
is commonly used for measuring real-world per-
Accuracy Recovery Using L-C
formance. For CIFAR-10 and ImageNet, we mea-
As described earlier in this section, given a ref-
sure hardware inference throughput of the
erence model w, compression scheme, and target
compressed model in the objective function. We
sparsity (obtained automatically by the Bayesian
use an NVIDIA Titan V GPU with the TensorRT 5
optimizer), CONDENSA tries to recover any accu-
framework to obtain throughput data. For Wiki-
racy lost due to compression. In this article, we
Text-2, due to the lack of optimized block-sparse
use the recently proposed L-C algorithm7 for accu-
kernels for PyTorch, we measure the floating-
racy recovery, which formulates model compres-
point operations (FLOPs) of the compressed
sion as a constrained optimization problem. L-C
model instead as a proxy for inference perfor-
naturally supports all the compression operators
mance. To improve throughput, we focus on
supported by CONDENSA while providing opti-
removing entire blocks of nonzeros, such as con-
mality guarantees whenever possible. Due to
volutional filters, since they have been shown to
space restrictions, we refer the reader to the arti-
improve performance on real-world hardware.4
cle by Carreira-Perpina n and Idelbayev7 for a
For CIFAR-10 and ImageNet, we use filter prun-
more detailed description of the L-C algorithm.
ing, since all the networks we consider are
CNNs. In WikiText-2, we employ block pruning
EVALUATION with a block size of 5.
We conduct extensive experiments and fully Bayesian Optimizer Settings: We use a Gaussian
analyze CONDENSA on three tasks: 1) image classi- Processes prior with the Matern kernel (n ¼ 2:5),
fication on the CIFAR-10 data set11 using the VGG- length scale of 1.0, and a value of 0.1 with normali-
192 and ResNet561 neural networks; 2) image clas- zation of the predictions. For the DR-UCB acquisi-
sification on the ILSVRC (ImageNet) task12 using tion function, we use a g value of 0.95 for all our
21
Table 1. CONDENSA performance results on CIFAR-10, ImageNet, and WikiText-2.
Method Dataset Network s Accuracy rc Throughput

Baseline CIFAR-10 VGG19-BN 92:98% 1 1
CONDENSA P+Q CIFAR-10 VGG19-BN 0.99 93:26% 188:23 N/A
CONDENSA Filter CIFAR-10 VGG19-BN 0.79 93:34% 1:35 2:59
Baseline CIFAR-10 ResNet56 92:75% 1 1
AMC 3
CIFAR-10 ResNet56 N/A 90:1% N/A sF ¼ 2
CONDENSA P+Q CIFAR-10 ResNet56 0.95 91:42% 31:14 N/A
CONDENSA Filter CIFAR-10 ResNet56 0.63 93:18% 1:14 1:17
Baseline ImageNet VGG16-BN 91:50% 1 1
Filter Pruning4 ImageNet VGG16-BN 89:80% 4 N/A
AutoCompress 5
ImageNet VGG16-BN N/A 90:90% 6:4 N/A
AMC3 ImageNet VGG16-BN N/A 90:1% N/A sF ¼ 1:25
CONDENSA P+Q ImageNet VGG16-BN 0.93 89:89% 29.29 N/A
CONDENSA Filter ImageNet VGG16-BN 0.12 90:25% 1 1:16
Baseline WikiText-2 LSTM Log-Perplexity: 4.70 1 1
Lottery Ticket14 WikiText-2 LSTM N/A Log-Perplexity: 4.70 10 N/A
CONDENSA P+Q WikiText-2 LSTM 0.92 Log-Perplexity: 4.75 4:2 N/A
CONDENSA Block WikiText-2 LSTM 0.60 Log-Perplexity: 4.62 1:1 sF ¼ 2:14

Here, s represents the target sparsity obtained by CONDENSA, rc is the memory footprint reduction, and sF the FLOP
reduction. The level-set, represented by , is set to 2% below baseline in all experiments.
experiments with a bias toward sampling more in We also compare our approach with recent work
the area of level set, with the intention that the on automated model compression. For CIFAR-10
Bayesian optimizer results in a favorable sparsity and ImageNet, we compare our results with AMC3
level in as few samples as possible. We imple- and AutoCompress,5 and for WikiText-2, we com-
mented DR-UCB using the fmfn/BO package. pare with Yu et al.14 Since AMC3 and Yu et al.14 do
not report actual runtime numbers on hardware,
Results we report the corresponding FLOP improve-
We present the memory footprint reductions ments instead (values marked sF ). We also use
and inference throughput improvements obtai- FLOP reduction as a metric for LSTM block prun-
ned by CONDENSA for each of the three tasks we ing, as described above.
evaluate in Table 1. For each task, we list the tar- Using the P+Q scheme designed to minimize
get sparsity obtained by the CONDENSA Bayesian memory footprint, CONDENSA is able to obtain
optimizer (s in the table), its corresponding compression ratios up to 188 , which surpasses
accuracy/perplexity (top-1 accuracy, top-5 accu- those of frameworks such as AutoCompress. While
racy, and log perplexity for CIFAR-10, ImageNet, AMC and AutoCompress only report theoretical
and WikiText-2, respectively), memory footprint FLOP improvements on CIFAR-10 and ImageNet,
reductions using pruning and quantization the filter pruning strategy implemented using
(column labeled rc ), and inference throughput/ CONDENSA yields real-world runtime improve-
FLOP improvements using filter/block pruning. ments of up to 2:59 on an NVIDIA Titan V GPU.
Since AMC and AutoCompress do not report the

https://github.com/fmfn/BayesianOptimization number of samples evaluated to arrive at their
22 IEEE Micro
Figure 2. CONDENSA sparsity profiles for VGG19-BN and ResNet56 for CIFAR-10. Column 1 shows the
problem of the form “minimize memory footprint with a lower bound on accuracy,” while Column 2 illustrates
“maximize throughput with a lower bound on accuracy.” The dc line (gray) shows accuracy values if no
accuracy recovery with L-C is performed. Note that the x-axis ranges are different: the plots on the left have
sparsities ranging from 0.9 to 1.0 while those on the right have values ranging from 0 to 1.
solutions, we are unable to directly compare sam- unknown in our problem formulation, but we com-
ple efficiencies with these frameworks; however, pute them explicitly here to better understand the
we notice that CONDENSA obtains desirable model quality of solutions produced by CONDENSA. For
sparsities using a fixed ten iterations per search each figure, compression accuracies (shown in
in all experiments. Finally, while we set the level green) are obtained by running the L-C algorithm
set to be 2% below the accuracy of the reference to convergence for 100 sparsities ranging from 0.9
model in all our experiments, we notice that to 1.0 (for pruning + quantization), and from 0 to 1
CONDENSA-compressed models often exceed base- for the filter and block pruning schemes; collect-
line accuracy. ing each such point requires between 30 min and
8 h of time on a single NVIDIA Tesla V100 GPU.
Sparsity Profile Analysis Inference throughput, FLOPs, and memory foot-
Figure 2 illustrates how a compressed model’s print data are collected for each compressed
accuracy, inference performance, and memory model and depicted by red lines in the figures
footprint varies w.r.t. sparsity for the CIFAR-10 (right-hand-side y-axis). We also show direct com-
task. All three of these functions are assumed to be pression (D-C) accuracies in gray for comparison;
23
right y-axis) for filter pruning. We only show data

for convolutional layers as they dominate compu-
tation time for this network. We make two key
observations: 1) runtime speedups on real hard-
ware are largely correlated with compression
ratios, but may be affected by hardware and
implementation details (e.g., compare conv13
with conv14 in the Figure); and 2) higher com-
Figure 3. TensorRT runtimes and compression pression ratios and corresponding speedups for
ratios of convolutional layers in VGG19-BN (filter the later layers of the network, which indicates
pruning). that distributing a given global sparsity evenly
across network layers may not always be optimal,
and algorithms such as L-C are essential to auto-
direct compression refers to applying a compres- matically finding desirable distributions of spar-
sion scheme to a model without any subsequent sity across layers.
attempt at accuracy recovery. In each figure, the
sparsity found by CONDENSA is shown as a black
CONCLUSIONS
vertical dashed line.
This article has presented CONDENSA, which
We notice three important trends in Figure 2.
is a flexible programming system for DNN compres-
sion and corresponding hyperparameter opti-
1) CONDENSA consistently finds solutions near
mization. We have demonstrated CONDENSA’s
the “knee” of the L-C accuracy curves, signify-
effectiveness and ease-of-use on a range of
ing the effectiveness of the DR-UCB acquisi-
state-of-the-art DNNs for image classification
tion function.
and language modeling, and achieved memory
2) Local minima/maxima is avoided while opti-
footprint reductions of up to 188 and runtime
mizing the objective function, demonstrating
throughput improvements of up to 2:59 using
that the UCB acquisition func-
at most ten samples per search.
tion for objective function opti-
mization is working as expected. This article has
presented
3) The knees of the D-C accuracy ACKNOWLEDGMENTS
curves occur at significantly CONDENSA, which is a
This work was supported in
flexible programming
lower sparsities; the L-C opti- part by DARPA under Contract
system for DNN
mizer, on the other hand is able HR0011-18-3-0007; in part by
compression and
to recover accuracies up to National Science Foundation
corresponding
much higher target sparsities. hyperparameter (NSF) under Award CCF-1704715;
optimization. and in part by a CIFAR AI Chair
Layerwise Compression Analysis award. Any opinions, findings,
In this section, we analyze how and conclusions or recommenda-
improving throughput using compression trans- tions expressed in this material are those of the
lates to execution time improvements for each author(s) and do not necessarily reflect the
layer on actual hardware. For this experiment, we views of the U.S. Government. Distribution State-
focus on VGG-19 on CIFAR-10, since it has a rela- ment “A” (Approved for Public Release, Distribu-
tively simple structure and is easy to analyze on a tion Unlimited).
layer-by-layer basis. We use filter pruning with a
target sparsity of 0.79 (found by the Bayesian opti-
mizer, as decribed in Table 1) for this experiment.
& REFERENCES
Figure 3 shows layer-by-layer mean runtimes col- 1. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
lected over 100 runs using TensorRT (blue bars, learning for image recognition,” in Proc. IEEE Conf.
left y-axis), and compression ratios (green line, Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
24 IEEE Micro
2. K. Simonyan and A. Zisserman, “Very deep convolutional 12. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
networks for large-scale image recognition,” in Proc. Int. L. Fei-Fei, “ImageNet: A large-scale hierarchical image
Conf. Learn. Representations, 2015. database,” in Proc. IEEE Conf. Comput. Vis. Pattern
3. Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, Recognit., 2009, pp. 248–255.
“AMC: AutoML for model compression and 13. S. Merity, C. Xiong, J. Bradbury, and R. Socher,
acceleration on mobile devices,” in Proc. 10th Eur. “Pointer sentinel mixture models,” 2016,
Conf. Comput. Vis., 2018, pp. 784–800. arXiv:1609.07843.
4. Y. He, X. Zhang, and J. Sun, “Channel pruning for 14. H. Yu, S. Edunov, Y. Tian, and A. S. Morcos, “Playing
accelerating very deep neural networks,” in Proc. IEEE the lottery with rewards and multiple languages:
Int. Conf. Comput. Vis., 2017, vol. 2, pp. 1398–1406. Lottery tickets in RL and NLP,” 2019,
5. N. Liu, X. Ma, Z. Xu, Y. Wang, J. Tang, and J. Ye, arXiv:1906.02768.
“AutoCompress: An automatic DNN structured
pruning framework for ultra-high compression rates,”
Vinu Joseph is currently working toward the Ph.D.
in Proc. AAAI Conf. Artif. Intell., 2020, pp. 4876–4883.
degree in computer science at the University of
6. V. Mnih et al., “Playing Atari with deep reinforcement
Utah. Contact him at vinu@cs.utah.edu.
learning,” 2013, arXiv:1312.5602.
n and Y. Idelbayev, ““Learning-
7. M. A. Carreira-Perpina
compression” algorithms for neural net pruning,” in Ganesh L. Gopalakrishnan is a Professor of
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, computer science with the School of Computing, Uni-
pp. 8532–8541. versity of Utah. Contact him at ganesh@cs.utah.edu.
8. D. R. Jones, M. Schonlau, and W. J. Welch, “Efficient
global optimization of expensive black-box functions,”
Saurav Muralidharan is a Senior Research Scientist
J. Global Optim., vol. 13, no. 4, pp. 455–492, 1998.
in the Programming Systems & Applications research
9. N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger,
group with NVIDIA. Contact him at sauravm@nvidia.com.
“Gaussian process optimization in the bandit setting:
No regret and experimental design,” in Proc. 27th Int.
Conf. Mach. Learn., 2010. Michael Garland is the Senior Director for Pro-
10. A. Garg et al., “Tumor localization using automated gramming Systems & Applications research with
palpation with Gaussian process adaptive sampling,”
NVIDIA. Contact him at mgarland@nvidia.com.
in Proc. IEEE Int. Conf. Autom. Sci. Eng., 2016,
pp. 194–200. Animesh Garg is an Assistant Professor of computer
11. A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 science with the University of Toronto, and a Faculty
dataset,” [Online]. Avaliable: http://www. cs. toronto. Member with the Vector Institute, Canada. Contact him
edu/kriz/cifar. html, vol. 55, 2014. at garg@cs.toronto.edu.
25
A Single-Shot
Generalized Device
Placement for Large
Dataflow Graphs
Yanqi Zhou, Sudip Roy,
Amirali Abdolrashidi, Daniel Lin-Kit Wong,
Peter Ma, Qiumin Xu, Azalia Mirhoseini, and
James Laudon
Google
Abstract—With increasingly complex neural network architectures and heterogeneous

device characteristics, finding a reasonable graph partitioning and device placement
strategy is challenging. There have been prior attempts at learned approaches for solving
device placement, these approaches are computationally expensive, unable to handle large
graphs consisting over 50 000 nodes, and do not generalize well to unseen graphs. To
address all these limitations, we propose an efficient single-shot, generalized deep RL
method (SGDP) based on a scalable sequential attention mechanism over a graph neural
network that is transferable to new graphs. On a diverse set of representative deep learning
models, our method on average achieves 20% improvement over human placement and 18%
improvement over the prior art with 15 faster convergence. We are the first to demonstrate
super human performance on 8-layer recurrent neural network language model and 8-layer
GNMT consisting of over 50 000 nodes, on 8-GPUs. We provide rationales and sensitivity
study on model architecture selections.
& NEURAL demonstrated rema-

NETWORKS HAVE on a larger dataset.6,7 Training such large models
rkable scalability–improved performance can efficiently while meeting device constraints, like
usually be achieved by training a larger model memory limitations, necessitate partitioning of
the underlying dataflow graphs for the models
across multiple devices. However, devising a
Digital Object Identifier 10.1109/MM.2020.3015188 good partitioning and placement of the dataflow
Date of publication 7 August 2020; date of current version graphs requires deep understanding of the
1 September 2020. model architecture, optimizations performed by
26
domain-specific compilers, as well as the device Instead of generating placement decisions one
characteristics, and is therefore extremely hard node a time,1,3,16 the policy network generates
even for experts. decisions for the entire graph in a single shot
Graph partitioning and device placement can fashion. In order to handle large graphs consist-
be specified through a programming interface ing of over 50 000 nodes, we use a Transformer-
or through compiler optimizations. ML practi- XL based on segmented recurrent attention10,18
tioners often rely on their domain knowledge to that partitions the input sequences and gener-
determine a reasonable partitioning and device ates placement decisions one sequence each of
mapping for computational graphs. For example, the time while using caching to track interse-
programmers can manually assign devices to quence dependencies. The segmented Trans-
operations through a programming interface former-XL removes any hard constraints such
such as Tensorflow and Mesh-Tensorflow. How- as hierarchical grouping of operations3 or colo-
ever, relying solely on the model architecture cation heuristics to reduce the placement com-
while ignoring the effect of the partitioning on plexity.1 For generalization, we apply a graph
subsequent compiler optimizations like opera- neural network (GNN) to encode operation fea-
tion scheduling can lead to suboptimal place- tures and dependencies into a trainable graph
ments and consequently under-utilization of representation, and learn the graph representa-
available devices. Alternatively, a compiler can tion end-to-end with the placement policy
apply heuristics to annotate graphs and assign decisions.
devices to Tensors or operations. The heuristics Both our graph-embedding network and place-
not only lead to suboptimal configurations but ment network can be jointly trained in an end-to-
also need to be constantly modified to accom- end fashion using a
modate new cases arising from previously supervised reward,
We empirically show
unseen model architectures. without the need to
that the network learns
The goal of an automatic device placement is manipulate the loss
flexible placement
to find the optimal assignment of operations to policies at a per-node
functions at multiple
devices such that the end-to-end execution time granularity and can levels. We empiri-
for a single step is minimized and all device con- scale to problems over cally show that the
straints like memory limitations are satisfied. 50 000 nodes. By network learns flexi-
Since this objective function is nondifferentiable, transferring the learned ble placement poli-
prior approaches1,3,11 have explored solutions graph embeddings cies at a per-node
based on reinforcement learning (RL). However, and placement granularity and can
these RL policies are impractical to be used in a policies, we are able to scale to problems
real production compiler for several reasons. achieve faster over 50 000 nodes.
First, they are designed for small to medium sized convergence and thus By transferring the
use less resources to
computation graphs and do not demonstrate learned graph emb-
obtain high-quality
strong performance on large graphs where device eddings and place-
placements.
placement is truly needed. Second, these RL poli- ment policies, we are
cies are usually not transferable and require able to achieve faster
training a new policy from scratch for each indi- convergence and thus use less resources to obtain
vidual graph. This makes such approaches high-quality placements.
impractical due to the significant amount of com- Our contributions can be summarized as
pute required for the policy search itself, at times follows.
offsetting gains made by the reduced step time.
In this article, we propose an end-to-end, 1) An end-to-end deep RL framework to auto-
single-shot deep RL method for device place- matically learn graph partitioning and device
ment (SGDP) where the learned policy is gener- placement in a single-shot fashion. Our
alizable to new graphs. For the speed of method is demonstrated 15 faster than the
training, we propose a single-shot placement prior SoTA based on a hierarchical LSTM
using a re-engineered Transformer-XL network. model.1,3
27
2) A scalable placement network with an effi- examples into smaller micro-batches. By pipe-
cient recurrent attention mechanism, which lining the execution across micro-batches, accel-
eliminates the need for an explicit grouping erators can operate in parallel. PipeDream19
stage before placement. Our method handles introduces pipeline parallelism with more flexi-
large graphs consisting over 50 000 nodes bility, allowing gradient updates of multiple mini-
and is the first to demonstrate superhuman batches to happen in parallel. However, this
placement performance on large problems introduces the staleness and consistency issue
such as 8-layer GMNT and 8-layer recurrent for weigh updates. In addition, both GPipe and
neural network language model (RNNLM). PipeDream partition a model at the granularity of
3) An end-to-end device placement network layers instead of operations. Instead of proposing
that can generalize to arbitrary and held-out different parallelism strategies or programming
graphs. This is enabled by jointly learning a primitives, our work focuses on a general deep
transferable GNN along with the placement RL algorithmic solution for automating device
network. placement at operation granularity.
4) Superior empirical performance over a wide
set of important workloads in computer vision, Automatic Device Placement
speech, and NLP (InceptionV3, AmoebaNet, RL has been used for device placement of a
RNNs, GNMT, Transformer-XL,10 WaveNet). given dataflow graph1 and demonstrated runtime
5) Detailed rationales and sensitivity studies on reduction over human crafted placement and
model architecture selections for the policy conventional heuristics. For improved scalability,
network. Compared against LSTMs, MLPs, a hierarchical device placement (HDP) strategy3
and graph attention networks (GANs).20 has been proposed that clusters operations into
groups before placing the operation groups onto
devices. Spotlight11 applies proximal policy opti-
RELATED WORK mization (PPO) and cross-entropy minimization
Model-Level Parallelism to lower training overhead. Both HDP and Spot-
Model-level parallelism partitions a neural net- light rely on LSTM controllers that are difficult to
work model among multiple devices and each train and struggle to capture very long-term
device is responsible for the weights updates of dependencies over large graphs. In addition, both
the assigned operations or layers. Model-level methods are restricted to process only a single
parallelism enables training large models exceed- graph at a time, and cannot generalize to arbitrary
ing the size constraint of the device memory. and held-out graphs. Placeto16 represents the first
There are different forms of model-level paral- attempt to generalize device placement using a
lelism and many of them are supported at the pro- graph embedding network. But like HDP, Placeto
gramming language level. Mesh-TensorFlow13 is a also relies on hierarchical grouping and only gen-
language that built on top of Tensorflow that pro- erates placement for one node at each time step.
vides a general class of distributed tensor compu- Our approach leverages a recurrent attention
tations. While data-parallelism can be viewed as mechanism and generates the whole graph place-
splitting tensors and operations along the “batch” ment at once. This significantly reduces the train-
dimension, in Mesh-TensorFlow the user can ing time for the controller. We also demonstrate
specify any tensor-dimensions to be split across the generalization ability of SGDP over a wider set
any dimensions of a multidimensional mesh of of important workloads.
processors. FlexFlow2 introduces SOAP, a more
comprehensive search space of parallelization Compiler Optimization
strategies for DNNs which allows parallelization REGAL8,9 uses deep RL to optimize the execu-
of a DNN in the Sample, Operator, Attribute, and tion cost of computation graphs in a static com-
Parameter dimensions. It uses guided random- piler. The method leverages the policy’s ability
ized search to find a parallelization strategy. to transfer to new graphs to improve the quality
GPipe12 proposed pipeline parallelism, by of the genetic algorithm for the same objective
automatically splitting a minibatch of training budget. However, REGAL does not show strong
28 IEEE Micro
Figure 1. Overview of SGDP: An end-to-end placement network that combines graph embedding and
sequential attention. N: Number of nodes. h: Hidden size. d: Number of devices.
empirical results on large graphs, especially Figure 1 shows an overview of the proposed
graphs consisting of over 50 000 nodes such as end-to-end device placement network. Our pro-
8-layer GNMT and 8-layer RNNLM. In addition, posed policy network pu consists a graph
REGAL relies on a performance model to app- embedding network that learns the graphical
roximate the rewards, while we demonstrate representation of any dataflow graph, and a
superior performance using real machine meas- placement network that learns a placement
urements in an online learning fashion. strategy over the given graph embeddings. The
two components are jointly trained in an end-to-
END-TO-END PLACEMENT POLICY end fashion. The policy pðajGÞ is applied to make
Given a dataflow graph GðV; EÞ where V rep- a set of decisions at each node. These decisions,
resents atomic computational operations (ops) denoted as av for each v 2 V across all nodes,
and E represents the data dependence, our goal form one action a ¼ fav2V g. One decision corre-
is to learn a policy p : G 7! D that assigns a place- sponds to playing one arm of a multibandit prob-
ment D 2 D for all the ops in the given graph G 2 lem, and specifying the entire a corresponds to
G, to maximize the reward rG;D defined based on playing several arms together in a single shot.
the runtime. D is the allocated devices that can Note the architecture is designed to be invariant
be a mixture of CPUs and GPUs. In this article, over the underlying graph topology, enabling us
we represent policy pu as a neural network to apply the same learned policy to a wide set of
parameterized by u. input graphs with different structures.
Unlike prior works that focus on a single
graph only, the RL objective in GDP is defined to Graph Embedding Network
simultaneously reduce the expected runtime of We leverage GNNs4,5 to capture the topologi-
the placements over a set of N dataflow graphs cal information encoded in the dataflow graph.
Most graph embedding frameworks are inher-
JðuÞ ¼ EGG;Dpu ðGÞ ½rG;D ently transductive and can only generate embed-
1X dings for a given fixed graph. These transductive
EDpu ðGÞ ½rG;D : (1) methods do not efficiently extrapolate to handle
N G
unseen nodes (e.g., in evolving graphs), and can-
In the following, we refer to the case when not learn to generalize to unseen graphs.
N ¼ 1 as individual training and the case when GraphSAGE4 is an inductive framework that lev-
N > 1 as batch training. We optimize the objec- erages node attribute information to efficiently
tive above using PPO15 for improved sample generate representations on previously unseen
efficiency. data. While our proposed framework is generic,
29
through a fully connected layer f ðlþ1Þ
ðlÞ
hðlþ1Þ
v ¼ f ðlþ1Þ ðconcatðhðlÞ
v ; hN ðvÞ ÞÞ: (3)
Different from GraphSAGE, parameters in our

graph embedding network are trained jointly
with a placement network via stochastic gradi-
ent descent with PPO, in a supervised fashion, as
described in the “End-to-End Placement Policy”
Figure 2. Normalized runtime to SGDP-one (the section. That is, we replace the unsupervised
lower the better). Comparison of different policy loss with our task-specific objective.
network architectures: MLPs consists of graph
embedding network with MLPs. GATs incorporates Placement Network
an attention of neighbor nodes into a GNN. SGDP- The GNN works as a feature aggregation net-
one is our method. Results for GATs on all 8-layer work that learns a trainable feature representa-
input graphs, GNMT, and AmoebaNet are missing tion for the computational graph, we still need a
as GATs fails to generate valid placements. policy network that produces actions on a per
node basis. Given hv ’s, the policy network produ-
we adopt the feature aggregation scheme pro- ces av ’s through conditionally independent pre-
posed in GraphSAGE to model the dependencies dictions, where the prediction for one node v
between the operations and build a general, end- does not depend on the prediction of other nodes
to-end device placement method for a wide set Y Y
pðajGÞ ¼ pðav jGÞ ¼ pðav jfðhv ÞÞ: (4)
of dataflow graphs. v v
In SGDP, nodes and edges in the dataflow
Next, we discuss the selection of a proper neu-
graph are represented as the concatenation of
ral network model to create a policy network.
their meta features (e.g., operation type, output
shape, adjacent node ids) and are further Multilayer Perceptron (MLPs) While f can
encoded by the graph embedding network into a be represented using MLPs, where the MLPs is
trainable representation. The graph embedding shared across all nodes for prediction the place-
process consists of multiple iterations, and the ment output distributions. However, MPLs lack a
computation procedure for the lth iteration can dependence tracking mechanism across nodes.
be outlined as follows. In practise, the placement of one node can be
First, each node v 2 V aggregates the feature determined by the placement of another node,
representations of its neighbors, fhðlÞ u ; 8u 2 where the placed node may consume a large size
ðlÞ
N ðvÞg, into a single vector hN ðvÞ . This aggrega- of data produced by the other node.
tion outcome is a function of all previously gen-
erated representations, including the initial LSTM The conventional LSTM models proposed
representations defined based on the input node for language tasks usually target a shorter
features. In this article, we use the following sequence length. For example, in a language
aggregation function with max pooling: task, a typical sequence length is between a few
hundred to a thousand. However, in the device
ðlÞ placement problem, models truly needs device
hN ðvÞ ¼ maxðsðW ðlÞ hðlÞ ðlÞ
u þ b Þ; 8u 2 N ðvÞÞ
placement can consists of over 50 000 of nodes.
(2) HDP3 has been proposed to address this issue,
ðlÞ ðlÞ
where ðW , b Þ define an affine transform and s however, the proposed grouper network comes
stands for the sigmoid activation function. We with limited flexibility and generality. For exam-
then concatenate the node’s current representa- ple, the grouper network leverages an aggre-
tion, hðlÞ
v , with the aggregated neighborhood vec- gated feature representation by averaging
ðlÞ
tor, hN ðvÞ , and feed this concatenated vector feature vectors for nodes within the same group.
30 IEEE Micro
The nondifferentiable grouping procedure pre- evaluation, we compare both the performance
vents training the graph-embedding and place- and speedup of our placement network with that
ment networks end-to-end. of the LSTM-based HDP.
To improve policy optimization for the big
Graph Attention An attention network can search space [OðdN Þ], where d is the number of
learn internode dependence and the relative devices and N is the number of nodes, we apply
importance of dependencies across an entire an additional mask attention to the last layer of
graph. An intuitive way is to incorporate an atten- feature map generated by the recurrent atten-
tion mechanism into the GNN to enable nodes to tion policy network. The generated actions mask
attend over their neighborhood’s features, follow- is position-wise multiplied with the actions to
ing a self-attention strategy. GATs20 has several selectively choose nodes to place. Intuitively,
strengths: first, it is efficient and is parallelizable this enables selecting important ops in the graph
across node-neighbor pairs; second, it can specify to change placements while minimizing the cuts
arbitrary weights to the neighbors; third, the for the entire graph.
model can generalize to completely unseen
graphs. However, GATs only supports local atten-
EXPERIMENT
tion, as contrary to the long-term global attention
using a conventional Transformer network. More- Experimental Setup
over, it has limited scalability that cannot handle Workloads: We evaluate SGDP using the
large graphs consisting of over 10 k’s of nodes. computational graphs of six diverse architectures
from different domains. To create a larger set of
Our Method As we increase the input graph workloads, we vary architectural parameters like
size, the complexities of a GNN and an attention the number of layers for each of these workloads.
network can scale up. However, due to the reduc- All our workloads are implemented in TensorFlow.
tion used in the aggregation layers in GraphSAGE, Further details about the graphs is in Appendix A.
the complexity usually scales linearly with the Runtime Measurement: For placement task,
number of nodes. But for an attention network, where TensorFlow provides an API for device
the complexity is OðN 2 Þ. Therefore, designing a assignment, our experiments are evaluated on
more scalable attention network is critical to actual hardware with configuration of one Intel
large input graphs. Broadwell CPU and up to eight Nvidia P100 GPUs.
We propose to use a transformer-based atten- Baselines: We choose three different base-
tive network to generate operation placements lines against which we compare the performance
in an end-to-end fashion. As the graph embed- of SGDP along various metrics in the “Perfor-
ding already contains spatial (topological) infor- mance on Individual Graphs” section. They are
mation for each node, we remove the positional the default heuristics-based optimizations used
embedding in the original transformer to pre- in TensorFlow (METIS), a human-expert solution,
vent the model from overfitting node identifica- and finally solutions found using a learning
tions. To capture long-term dependencies based strategy like HDP.3 For sensitivity study
efficiently among a large set of nodes, we adopt in Section refsubsec:sensitivity, we compare
segment-level recurrence introduced in Trans- against MPLs and GATs.20
former-XL,10,17 where hidden states computed
for the previous set of nodes are cached (with Performance on Individual Graphs
gradient flows disabled) and reused as an We evaluate SGDP by training the model sepa-
extended context during the training of the next rately on six important deep learning computa-
segment. This reduces the complexity to tion graphs, including RNN Language Modeling,
OðN 2 =kÞ, where k is the number of segments. GNMT, Transformer-XL, Inception, AmoebaNet,
Besides achieving extra long context than a and WaveNet. We name this approach SGDP-one.
GAN, we empirically find the segment-level recur- Since TensorFlow provides an API for assigning
rent attention much faster than a conventional operation placement, all reported measurements
LSTM-based GNMT model. In our experimental for placement task are on real hardware. As
31
Table 1. Runtime comparison between SGDP-one, human expert, TensorFlow METIS, and HDP on six graphs (RNNLM, GNMT,
Transformer-XL, Inception, AmoebaNet, and WaveNet). Search speedup is the policy network training time speedup compared to
HDP (reported values are averages of six runs).
SGDP-one HP METIS HDP Runtime speedup over HP / Search speedup over

Model (#devices)
(s) (s) (s) (s) HDP HDP
2-layer RNNLM (2) 0.173 0.192 0.355 0.191 9.9% / 9.4% 2.95x
4-layer RNNLM (4) 0.210 0.239 0.503 0.251 13.8% / 16.3% 1.76x
8-layer RNNLM (8) 0.320 0.332 OOM 0.764 3.8% / 58.1% 27.8x
2-layer GNMT (2) 0.301 0.384 0.344 0.327 27.6% / 14.3% 30x
4-layer GNMT (4) 0.350 0.469 0.466 0.432 34% / 23.4% 58.8x
8-layer GNMT (8) 0.440 0.562 OOM 0.693 21.7% / 36.5% 7.35x
2-layer Transformer-XL
0.223 0.268 0.37 0.262 20.1% / 17.4% 40x
(2)
0.230 0.27 OOM 0.259 17.4% / 12.6% 26.7x
(4)
0.350 0.46 OOM 0.425 23.9% / 16.7% 16.7x
(8)
Inception (2) b32 0.229 0.312 OOM 0.301 26.6% / 23.9% 13.5x
Inception (2) b64 0.423 0.731 OOM 0.498 42.1% / 29.3% 21.0x
AmoebaNet (4) 0.394 0.44 0.426 0.418 26.1% / 6.1% 58.8x
2-stack 18-layer WaveNet

0.317 0.376 OOM 0.354 18.6% / 11.7% 6.67x
(2)
4-stack 36-layer WaveNet

0.659 0.988 OOM 0.721 50% / 9.4% 20x
(4)
GEOMEAN - - - - 20.5% / 18.2% 15x
shown in Table 1, SGDP-one consistently outper- HDP3 reports inferior performance on 8-layer
forms human expert placement (HP), TensorFlow RNNLM and 8-layer GNMT than human placement.
METIS placement, and HDP. Overall, SGDP-one
achieves on average 20.5% and 18.2% run time Sensitivity Study on Model Architectures
reduction across the evaluated 14 graphs, com- We compare our method with two alternative
pared to HP and HDP, respectively. Importantly, architectures (MLPs and GATs), as explained in
with the efficient end-to-end single-shot place- “Placement Network” section. An LSTM-based
ment, SGDP-one has a 15 speedup in conver- HDP has been compared in Table 1 and we will
gence time of the placement. leave it out in this section. SGDP-one consis-
Scalability Analysis: SGDP is designed in a way tently outperforms both MLPs and GATs by an
to scale up to extremely large graphs, consisting of average of 10% and 7%. (We only include valid
over 80 000 nodes (8-layer GNMT). Therefore, placements for GATs). GATs fails to generate
unlike any of the prior works including HDP,3 valid placements for large graphs consisting of
REGAL,9 and Placeto,16 we can demonstrate super over 10 000 nodes, such as 8-layer RNNLM, 2-
human performance on large graphs such as 8- layer GNMT, 8-layer Transformer-XL, and Amoe-
layer GNMT (21.7%/36.5% better than HP/HDP) baNet. The results imply that SGDP yields better
and 8-layer RNNLM (3.8%/58.1% better than HP/ performance with an attention network, as com-
HDP). For all of the related SoTA work, Placeto16 pared to MLPs, and provides better scalability
and REGAL9 do not provide any results on 8-layer with a decoupled segmented attention network,
RNNLM or 8-layer GNMT (more than 50 000 nodes). as compared to GATs.
32 IEEE Micro
Generalization
Table 2. Runtime comparison on SGDP-batch versus SGDP-one.
SGDP enables the training of multiple heteroge-
neous graphs in a single batch (GDP-batch). We Model Speedup Model Speedup
empirically show that GDP-batch generates better 2-layer RNNLM 0 Inception 0
placements for many workloads such as Trans-
4-layer RNNLM 5% AmoebaNet 5%
former-XL (7.6%), WaveNet (15%), and 8-layer
4-stack 36-layer
GNMT (8%). Table 2 compares the run time of 11 2-layer GNMT 0 3.3 %
WaveNet
tasks using SGDP-batch. SGDP-batch yields
2-stack 18-layer
slightly better runtime compared to SGDP-one in 4-layer GNMT 0 15%
WaveNet
majority of the tasks, while being only slightly
2-layer 8-layer Transformer-
worse on AmoebaNet. Compared to training 7.6% 1.5%
Transformer-XL XL
graphs separately, SGDP-batch reduces network
parameters and enables transfer learning among 4-layer
3%
Transformer-XL
different graphs.
We test on unseen graphs from different work-
loads by pretraining the SGDP on different subsets super human performance on large graphs con-
of five workloads, excluding the entire workload of sisting of over 50 000 nodes.
the unseen graph. For example, for an RNNLM
input graph, all RNNLM models are excluded from APPENDIX A
the pretraining dataset. During pretraining, we per-
INPUT GRAPHS
turb the number of layers, hidden size, and batch
size of the input graphs to augment the data and We used a variety of widely used workloads from
computer vision, speech, and NLP. In this sec-
add more randomness to the input data. SGDP-
tion, we give a detailed explanation on the
zeroshot directly runs inference on the pre-trained
selected models and hyperparameters.
SGDP model to generate placements for the target
graph. SGDP-finetune further trains the pre-trained
SGDP model for an additional 50 training steps and Inception-V3
generates placements using the fine-tuned SGDP Inception-V3 is a multibranch convolutional net-
model. We find that SGDP-finetune almost matches work used for a variety of computer vision tasks,
the performance of SGDP-one, degrading the place- including classification, recognition, or generation.
ment runtime on average by only 1.2% compared The network consists of blocks made of multiple
to SGDP-one and outperforms both human place- branches of convolutional and pooling operations.
Within a block, the branches of ops can be executed
ment and HDP significantly. SGDP-zeroshot
in parallel. However, the model is mostly sequential
completely eliminates the training for the target
as the outputs of each block are concatenated
unseen graphs, while being only 3.7% worse on together to form the input to the next block. We use
average than SGDP-one and being over 10% better a batch size of 64. The TensorFlow graph of this
than human placement. This indicates that both model contains 24 713 operations.
graph embedding and the learned policies transfer
and generalize to the unseen data.
AmoebaNet
AmoebaNet is an automatically designed neural
CONCLUSION network that yields SOTA performance on Image-
We propose an efficient single-shot, general- Net. Similar to Inception-V3, it contains Inception-
ized deep RL method (SGDP) and demonstrate like blocks called cells, which receives a direct
superior performance on a wide set of represen- input from the previous cell and a skip input from
tative deep learning models, including Inception- the cell before it. We use a batch size of 64. The
v3, AmoebaNet, RNNLM, GNMT, Transformer-XL, TensorFlow graphs contains 9430 operations.
and WaveNet. Our method on average achieves
20% improvement over human experts and 18% Recurrent Neural Network Language Model
improvement over the prior art with 15 faster RNNLM is made of many LSTM cells organized in
convergence, being the first to demonstrate a grid structure. The processing of each LSTM
33
Table 3. Hyperparameters for policy network. gs layers: longer than vanilla Transformers. We use a
GraphSAGE layers, gs knn: GraphSAGE maximum Transformer-XL with batch size of 64, sequence
neighbors, trf d model: Dimension of the Transformer- length of 256, segment length of 64, model hid-
XL model, trf n head: Number of attention heads, den dimension of 500 and feed forward hidden
trf layers: Number of Transformer-XL layers, dimension of 1000, 10 heads, and head dimen-
trf d heads: Dimension of each attention head,
trf d inner: Dimension of inner hidden size in
sion of 50. The 2-layer Transformer-XL contains
positionwise feed-forward. 2618 operations. The number of ops grow
roughly proportional with the number of layers.
Parameters Value Parameters Value
gs layers 4 gs dim 128 WaveNet
gs knn 5 trf layers 2 WaveNet is a generative model for speech synthe-
sis. The model is fully probabilistic and autore-
trf d model 128 trf n head 5
gressive, with the predictive distribution for each
trf d head 25 trf d inner 256 audio sample conditioned on all previous ones.
We use a WaveNet model with batch size 64 and a
receptive field size of 2048 (9-layers per stack). An
Table 4. Hyperparameters for PPO. 5-stack WaveNet contains 4374 operations and a
10-stack WaveNet contains 8516 operations.
Parameters Value Parameters Value
learing rate 0.5 num of rollouts 800 APPENDIX B
minibatches 40 epochs 20
HYPERPARAMETERS
We list out all the selected hyperparameters in
epsilon 0.2 entropy 0.5
our experiments for reproducibility in Tables 3
optimizer Adam and 4.
cell only depends on the results of 2 other cells ACKNOWLEDGMENTS

(from the previous layer, and from the previous This work was done during internship at
time step), which make the concurrent execu-
Google.
tion of many LSTM cells possible given enough
hardware resources. We use batch size 64 and a
hidden size of 2048. The corresponding Tensor- & REFERENCES
Flow graph contains 9021 operations for a 2-layer
model. The number of ops grow roughly propor- 1. A. Mirhoseini et al., “Device placement optimization
tional with the number of layers. with reinforcement learning,” in Proc. Int. Conf. Mach.
Learn., 2017.
2. Z. Jia et al., “Beyond data and model parallelism for
GNMT
deep neural networks,” in Proc. 35th Int. Conf. Mach.
Neural machine translation with attention mech-
Learn., 2018.
anism has an architecture similar to that of
RNNLM, but its many hidden states make it far 3. A. Mirhoseini et al., “A hierarchical model for device
more computationally expensive than RNNLM. placement,” in Proc. Int. Conf. Learn. Representations,
We use batch size 64. The original 2-layer 2018.
encoder-decoder consisting of 28 044 opera- 4. W. L. Hamilton et al., “Inductive representation
tions. An extended 4-layer version consisting of learning on large graphs,” in Proc. Conf. Neural Inf.
46 600 operations, An even larger 8-layer version Process. Syst., 2017.
consisting of 83 712 operations. 5. K. Xu et al., “How powerful are graph neural
networks?,” in Proc. Int. Conf. Learn. Representations,
Transformer-XL 2019.
6. H. Joel et al., “Deep learning scaling is predictable,
Transformer-XL10 is an modified version of
Transformer18 that supports segment-level empirically,” 2017, arXiv:1712.00409.
recurrence and a novel positional encoding 7. S. Noam et al., “Outrageously large neural networks:
scheme. This innovation enables learning depen- The sparsely-gated mixture-of-experts layer,” in Proc.
dence that is 80% longer than RNNs, and 450% Int. Conf. Learn. Representations, 2017.
34 IEEE Micro
8. A. Paliwal et al., “REGAL: Transfer learning for fast Sudip Roy is currently a Senior Research Scien-
tist with Google AI. He is interested in design and
optimization of computation graphs,” Knowl. Discovery
development of systems for machine learning and
Database, 2019.
also in applying machine learning to solve optimi-
9. A. Paliwal et al., “Reinforced genetic algorithm learning
zation problems that arise in systems. He has
for optimizing computation graphs,” in Proc. Int. Conf. worked on a variety of problems in this space
Learn. Representations, 2020. including infrastructure for large-scale distributed
10. Z. Dai et al., “Transformer-XL: Attentive language machine learning, data management solutions for
models beyond a fixed-length context,” in Proc. 57th managing data lakes, using reinforcement learning
Annu. Meeting Assoc. Comput. Linguistics, 2019, to solve optimization problems in machine learning
pp. 2978–2988. compilers, and geo-replicated transaction proc-
11. I. Sutskever et al., “Spotlight: Optimizing device essing systems. Roy received a Ph.D. degree in
placement for training deep neural networks,” in Proc. computer science from Cornell University. Contact
Conf. Neural Inf. Process. Syst., 2018. him at sudipr@google.com.
12. Y. Huang et al., “GPipe: Efficient training of giant
Amirali Abdolrashidi is currently working toward
neural networks using pipeline parallelism,” in Proc.
the Ph.D. degree in computer science and engineer-
NeurIPS, 2019. ing from the University of California, Riverside, where
13. N. Shazeer et al., “Mesh-tensorflow: Deep learning for he works on speeding up data-dependent workloads
supercomputers,” in Proc. NeurIPS, 2018. on GPU architectures. Abdolrashidi received the
14. B. Cheung et al., “Superposition of many models into M.S. degree in electrical engineering from New York
one,” in Proc. NeurIPS, 2019. University, in 2014, and during his Software Engi-
15. J. Schulman et al., “Proximal policy optimization neering Internship with Google, he worked to
algorithms,” 2017, arXiv:1707.06347. improve the performance of deep learning applica-
16. R. Addanki et al., “Placeto: Learning generalizable tions through prioritized fusion of operations. Contact
him at abdolrashidi@gmail.com.
device placement algorithms for distributed machine
learning,” in Proc. NuerIPS, 2019. Daniel Lin-Kit Wong is currently working toward
17. Z. Dai, “Improving deep generative modeling with the Ph.D. degree with the Computer Science
applications,” 2019. Department, Carnegie Mellon University. He is a
18. V. Ashish et al., “Attention is all you need,” in Proc. systems builder and hacker with a focus on systems
Conf. Neural Inf. Process. Syst., 2017. design and distributed systems. His research focus
19. D. Narayanan et al., “PipeDream: Generalized pipeline has been machine learning for caching. Contact
parallelism for DNN training,” in Proc. Symp. Oper. him at wonglkd@gmail.com.
Syst. Principles, 2019.
Peter Ma is currently a Software Engineer with Goo-
kovic
20. P. Velic et al., “Graph attention networks,” in
gle, and he works on machine learning accelerator
Proc. Int. Conf. Learn. Representations, 2018.
architecture and machine learning platforms perfor-
mance. Ma received the Ph.D. degree in mechanical
engineering with a Ph.D. Minor in computational and
Yanqi Zhou is currently a Research Scientist with
mathematical engineering from Stanford University.
Google Brain. She works on generalizing machine
Contact him at pcma@google.com.
learning to optimize systems problems, including
compiler graph optimizations and hardware acceler-
Qiumin Xu is currently a Senior Software Engineer
ator design. In addition, she builds large-scale deep
with Google Brain, working on the performance of
learning models for speech and language tasks.
machine learning accelerators. Xu received the
Zhou received her Ph.D. degree from Princeton Uni-
Ph.D. degree in electrical engineering from the
versity, working on configurable computer architec-
University of Southern California. Contact her at
tures and resource provisioning for clouds. Contact
qiuminxu@google.com.
her at yanqiz@google.com.
35
Azalia Mirhoseini is currently a Senior Research James Laudon is currently an Engineering Director
Scientist with Google Brain, where she works on with Google Brain. His research interests focus on hard-
deep reinforcement learning based approaches to ware and software co-design for high-performance sys-
solve problems in computer systems. Mirhoseini tems and he is currently working on domain-specific
received the Ph.D. degree in electrical and computer computer architectures for machine learning. Before
engineering from Rice University. She was the recipi- joining Google Brain, he was Founder and Site Director
ent of a number of awards, including the MIT Tech- for the Google Madison office. He has contributed to the
nology Review 35 under 35 award, the Best Ph.D. architecture and implementation of multiple computer
Thesis Award at Rice, and a Gold Medal in the systems including the Stanford DASH, SGI Origin 2000,
National Math Olympiad in Iran. Her work has been and Sun UltraSPARC T1. Laudon received the B.S.
covered in various media outlets including MIT Tech- degree in electrical engineering from the University of
nology Review and IEEE Spectrum. Contact her at Wisconsin—Madison and the M.S. and Ph.D. degrees
azalia@google.com. in electrical engineering from Stanford University.
Contact him at jlaudon@google.com.
36 IEEE Micro
RELEQ : A Reinforcement
Learning Approach for
Automatic Deep
Quantization of Neural
Networks
Ahmed T. Elthakeb, Prannoy Pilligundla, Hadi Esmaeilzadeh
and Fatemehsadat Mireshghallah University of California San Diego
University of California San Diego
Amir Yazdanbakhsh
Google Brain
Abstract—Deep Quantization (below eight bits) can significantly reduce the DNN
computation and storage by decreasing the bitwidth of network encodings. However,
without arduous manual effort, this deep quantization can lead to significant accuracy
loss, leaving it in a position of questionable utility. We propose a systematic approach to
tackle this problem, by automating the process of discovering the bitwidths through an
end-to-end deep reinforcement learning framework (RELEQ). This framework utilizes the
sample efficiency of proximal policy optimization to explore the exponentially large space
of possible assignment of the bitwidths to the layers. We show how RELEQ can balance
speed and quality, and provide a heterogeneous bitwidth assignment for quantization of a
large variety of deep networks with minimal accuracy loss ( 0.3% loss) while minimizing
the computation and storage costs. With these DNNs, RELEQ enables conventional
hardware and custom DNN accelerator to achieve 2:2 speedup over 8-bit execution.
Digital Object Identifier 10.1109/MM.2020.3009475

Date of publication 15 July 2020; date of current version
1 September 2020.
37
& DEEP NEURAL NETWORKS (DNNs) have made

quantization below eight bits is considered. For
waves across a variety of domains,1 however their example, ResNet-20 exposes a hyper-parameter
compute efficiency has become a major constraint space of size 8l ¼ 820 > 1018 , where l ¼ 20 is the
in unlocking further applications and capabilities. number of layers and 8 is the possible bitwidths.
To this end, quantization of neural networks pro- This exponentially large hyperparameter space
vides a path forward as it reduces the bitwidth of grows with the number of layers making it imprac-
operations and memory footprint. For example, in tical to exhaustively assess and determine the bit-
many scenarios, the bottleneck of running DNNs width for each layer.
is in transferring the weights and data between We develop an end-to-end framework, dubbed
main memory and compute cores. Using 8-bit inte- RELEQ, which exploits the sample efficiency of
ger rather than 32-bit, we instantly speed up the the proximal policy optimization4 to explore
memory transfer by 4. the quantization hyper-parameter space. The RL
Albeit alluring, quantization can lead to signifi- agent starts from a full-precision previously
cant accuracy loss if not employed trained model and learns the sen-
with diligence. To that end, two fun- sitivity of final classification accu-
We develop an end-to-
damental problems need to be racy with respect to the bitwidth
end framework, dubbed
addressed: 1) developing learning
RELEQ, which exploits of each layer, determining its bit-
techniques that can perform quan- the sample efficiency of width while keeping classification
tized training of DNNs; and 2) the proximal policy accuracy almost intact. Observ-
designing algorithms that identify optimization to explore ing that the quantization bitwidth
appropriate bitwidth per-layer while the quantization hyper- for a given layer affects the accu-
preserving accuracy. This article parameter space. racy of subsequent layers, our
takes on the second challenge as framework implements a long
there are inspiring efforts that have short-term memory (LSTM)-based RL framework,
developed techniques for quantized training.2,3 which enables selecting bitwidths with the con-
However, this possibility (discovering bit- text of previous layers’ bitwidths. Rigorous evalu-
widths) is manually laborious as to preserve accu- ations with a variety of networks (AlexNet, CIFAR,
racy, the bitwidth varies across individual layers LeNet, SVHN, VGG-11, ResNet-20, and MobileNet)
and different DNNs.2,3 Each layer has a different show that RELEQ can effectively find heteroge-
role and unique properties in terms of weight nous bitwidths with minimal accuracy loss
distribution; hence, displays different sensitivity (0.3% loss) while minimizing the computation
toward quantization. Nonetheless, considering and storage cost. The results (see Table 1) show
layerwise quantization opens a rather exponen- that there is a high variance in bitwidths across
tially large hyperparameter space, specially when the layers of these networks. With the seven
Table 1. Benchmark DNNs and their deep quantization with RELEQ.
Average Accuracy Loss

Network Dataset Quantization Bitwidths
Bitwidth (%)
AlexNet ImageNet {8, 4, 4, 4, 4, 4, 4, 8} 5 0.08
SimpleNet CIFAR10 {5, 5, 5, 5, 5} 5 0.30
LeNet MNIST {2, 2, 3, 2} 2.25 0.00
{8, 5, 6, 6, 4, 4, 7, 8, 4, 6, 8, 5, 5, 8, 6, 7, 7, 7, 6, 8, 6, 8, 8, 6, 7, 5, 5,
MobileNet ImageNet 6.43 0.26
7, 8, 8}
ResNet-20 CIFAR10 {8, 2, 2, 3, 2, 2, 2, 3, 2, 3, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 8} 2.81 0.12
10-Layers SVHN {8, 4, 4, 4, 4, 4, 4, 4, 4, 8} 4.80 0.00
VGG-11 CIFAR10 {8, 5, 8, 5, 6, 6, 6, 6, 8} 6.44 0.17
VGG-16 CIFAR10 {8, 8, 8, 6, 8, 6, 8, 6, 8, 6, 8, 6, 8, 6, 8, 8} 7.25 0.10
38 IEEE Micro
benchmark DNNs, RELEQ enables conventional while it explores the search space such as
hardware5 as well as a custom DNN accelerator6 state of quantization and relative accuracy.
to achieve 2.2 2:7 speedup over 8-bit execu-
tion. These results suggest that RELEQ takes an State of quantization and relative accuracy:
effective first step toward automating the deep The “NetworkSpecific” parameters reflect some
quantization of neural networks. indication of the state of quantization and rela-
tive accuracy. StateofQuantization is a metric to
evaluate the benefit of quantization for the net-
RL FOR DEEP QUANTIZATION OF work and it is calculated using the compute and
DNNs memory costs of each layer. For a neural net-
work with L layers, we define compute cost of
Method Overview layer l as the number of Multiply-Accumulate
RELEQ trains an RL agent that determines the (MAcc) operations (nMAcc ), where (l ¼ 0; . . .; L).
l
bitwidth for each layer of the network. RELEQ Additionally, since RELEQ only quantizes
explores the search space of the bitwidths, layer weights, we define memory cost of layer l as the
by layer. The underlying optimization problem is number of weights (nwl ) scaled by the ratio of
multiobjective (higher accuracy, lower compute, memory access energy (EMemoryAccess ) to MAcc
and reduced memory); however, preserving the computation energy (EMAcc ), which is estimated
accuracy is the primary objective. With this for- to be around 120 .
mulation of the RL problem, RELEQ employs It is intuitive to consider that the sum of mem-
the state-of-the-art proximal policy optimization ory and compute costs linearly scale with the
(PPO)4 to train its policy and value networks. This number of bits for each layer (nbits bits
l ). The nmax term
section details the components and the research is the maximum bitwidth among the predefined
path we have examined to design them. set of bitwidths that are available for the RL agent
to pick from. Finally, the StateofQuantization
(StateQuantization ) is the normalized sum over all
State Space Embedding
layers (L) that accounts for the total memory and
Interplay between layers: We design the
compute costs of the entire network
state space to consider sensitivities and inter-
play between layers by including the knowledge
about the bitwidth of previous layers, the index StateQuantization

of the layer-under-quantization, layer size, and PL w EMemoryAccess MAcc bits
l¼0 nl EMAcc þ nl nl
weights statistics (e.g., standard deviation).
¼ :
However, this information is incomplete with- PL w EMemoryAccess MAcc bits
l¼0 n l E þ nl n max
out knowing the accuracy of the network given a MAcc
set of bitwidths and state of quantization for (1)

the entire network. As such, the parameters used
to embed the state space of RELEQ agent are Besides the potential benefits, captured by
categorized across two different axes. StateQuantization , RELEQ considers the
StateofRelativeAccuracy to gauge the effects of
1) “Layerspecific” parameters, which are unique
quantization on the classification performance.
to the layer (layer index, layer dimensions,
To that end, the StateofRelativeAccuracy
weight statistics) versus “NetworkSpecific”
(StateAccuracy ) is defined as the ratio of the cur-
parameters that characterize the entire net-
rent accuracy (AccCurr ) with the current bit-
work as the agent steps forward during training
widths for all layers during the RL training, to
process (state of quantization and relative
accuracy of the network when it runs with full
accuracy).
precision (AccFullP )
2) “Static” parameters that do not change dur-
ing the training process versus “Dynamic”
parameters that change during training AccCurr
StateAccuracy ¼ : (2)
depending on the actions taken by the agent AccFullP
39
Figure 1. (a) Action spaces: (i) flexible action space (used in RELEQ); (ii) alternative action space with
restricted movement. (b) Reward shaping with three different formulations as functions of the optimization
objectives: state of relative accuracy and state of quantization: (i) proposed formulation; (ii) direct division; and
(iii) direct subtraction. The color palette shows the intensity of the reward. (c) Overview of RELEQ, which starts
from a pretrained network and delivers its corresponding deeply quantized network.
Given these embeddings of the observations convergence is much longer than the aforemen-
from the environment, the RELEQ agent can take tioned flexible action space, which is used, as it
actions, described next. encourages more exploration.
Flexible Actions Space Asymmetric Reward Formulation for Accuracy

The RELEQ agent steps through each layer While the state space embedding focused on
sequentially and chooses the bitwidth of a layer interplay between the layers and the action space
from a discrete set of bitwidths which are pro- provided flexibility, reward formulation for RELEQ
vided as possible choices. aims to preserve accuracy and minimize bitwidth
Figure 1(a)-(i) shows the representation of of the layers simultaneously. This requirement cre-
action space in which the set of bitwidths is ates an asymmetry between the accuracy and bit-
f1; 2; 3; 4; 5; 6; 7; 8g. As depicted, the agent can flex- width reduction, which is a core objective of
ibly choose to change the bitwidth of a given layer RELEQ. The following RewardShaping formulation
from any bitwidth to any other bitwidth. An alter- provides the asymmetry and puts more emphasis
native [Figure 1(a)-(ii)] that we experimented on maintaining the accuracy as illustrated with dif-
with was to only allow RELEQ agent to increment/ ferent color intensities in Figure 1(b)-(i). This
decrement/keep the current bitwidth of the reward uses the same terms of StateQuantization and
layer (BðtÞ ). The experimentation showed that the StateAcc from the “State Space Embedding” section.
40 IEEE Micro
Reward Shaping:
reward ¼ 1:0 ðStateQuantization Þa
if ðStateAcc < thÞ then
reward ¼ 1:0
else
Accdiscount ¼ StateAcc ðb=StateAcc Þ
reward ¼ reward Accdiscount
end if
This used formulation 1) produces a smooth

reward gradient as the agent approaches the
optimum quantization combination; and 2) the
varying 2-D gradient speeds up the agent’s conver-
gence time. In the reward formulation, th is thresh-
old for relative accuracy below, which the
accuracy loss may not be recoverable and those
bitwidths are completely unacceptable. After
Figure 2. Action (bitwidths selection) probability
some trials, we observe that a ¼ 0:2; b ¼ 0:4; th ¼
evolution over training episodes for LeNet.
0:4 provide reasonable convergence times and
accuracy-quantization tradeoff; thus, we fixed
them throughout the experiments. While Figure 1 enable the RELEQ agent to maneuver the search
(b)-(i) shows the aforementioned formulation, space with an objective of quantizing the neural
Figures 1 (b)-(ii) and (iii) depict two other network with minimal loss in accuracy. We use
alternatives. Figure 1(b)-(ii) is based on linear quantization as proposed by Mishra et al.7
StateAcc =StateQuantization while Fig. 1 (b)-(iii) is Figure 1(c) depicts the entire workflow for RELEQ
based on StateAcc StateQuantization . In summary, and this section gives an overview of how every-
based on our experiments, the formulation for thing fits together in practice.
Figure 1(b)-(i) offers faster convergence. Learning the policy. Policy in terms of neural
network quantization is to learn to choose the
Network Architecture of Policy and Value optimal bitwidth for each layer in the network.
Networks Figure 2 shows the evolution of RELEQ agent’s
Both Policy and value are functions of state, so bitwidth selection probabilities for all layers of
the state space, defined in the “State Space LeNet over training episodes, which reveals how
Embedding” section, is encoded as a vector and the agent’s policy changes with respect to select-
fed as input to LSTM layer, which acts as the first ing a bitwidth per-layer. As indicated on the
hidden layer for both Policy and Value networks. graph, the end results suggest the following
Apart from the LSTM, policy network has two fully quantization patterns, 2; 2; 2; 2 or 2; 2; 3; 2 bits.
connected hidden layers of 128 neurons each and For the first two convolution layers, the agent
the number of neurons in the final output layer is ends up assigning the highest probability for
equal to the number of available bitwidths the two bits. For the third layer (FC1), the probabili-
agent can choose from, whereas the value network ties of two bits and three bits are very close.
has two fully connected hidden layers of 128 and Finally, for the fourth layer (FC2), the agent again
64 neurons each. Based on our evaluations, LSTM tends to select two bits, however, with relatively
enables the RELEQ agent to converge almost 1:33 smaller confidence compared to layers one and
faster in comparison to a network with only fully two. With these observations, we can infer
connected layers. that bitwidth probability profiles are not uni-
form across all layers. As such, the agent dis-
PUTTING IT ALL TOGETHER: RELEQ tinguishes between the layers, understands
IN ACTION the sensitivity of the objective function to dif-
As discussed in the “RL for Deep Quantiza- ferent layers, and accordingly chooses the
tion Of DNNs” section, state, action, and reward bitwidths.
41
Figure 3. (a) Performance: quantization space and its Pareto frontier for (i) CIFAR-10; (ii) LeNet; (iii) SVHN;
and (iv) VGG-11. (b) Convergence: the evolution of reward and its basic elements: State of Relative Accuracy
for (i) CIFAR-10; (ii) SVHN. State of Quantization for (iii) CIFAR-10; (iv) SVHN, as the agent learns through the
episodes. The last plot (v) shows an alternative view by depicting the evolution of reward for MobileNet. The
trends are similar for the other networks.
EXPERIMENTAL RESULTS VGG11). Each point on these charts is a unique

Quantization Levels With RELEQ combination of bitwidths that are assigned to
Table 1 provides a summary of results with the layers of the network. The boundary of the
respect to layerwise quantization bitwidths achi- solutions denotes the Pareto frontier and is
eved by RELEQ. At the onset of the agent’s explora- highlighted by a dashed line. The solution
tion, all layers are initialized to 8 bits. As the agent found by RELEQ is marked out using an arrow
learns the optimal policy, each layer converges and lays on the desired section of the Pareto
with a high probability to a particular bitwidth. As frontier, where the accuracy loss can be recov-
shown in the “QuantizationBitwidths” column of ered through fine-tuning, which demonstrates
Table 1, RELEQ quantization policies show a spec- the quality of the obtained solutions. It is
trum of varying bitwidth assignments to the worth noting that as a result of the moderate
layers. The bitwidth for MobileNet varies with an size of these four networks, it was possible to
irregular pattern, which averages to 6.43. ResNet- enumerate the design space, obtain Pareto
20 achieves mostly 2 and 3 bits, again with an irreg- frontier, and assess ReLeQ quantization policy
ular interleaving that averages to 2.81. In many for each network. However, such enumeration
cases, there is significant heterogeneity in the bit- is infeasible for state-of-the-art deep networks
widths and a uniform assignment of the bits is not (e.g., MobileNet, AlexNet), which further
always the desired choice to preserve accuracy. stresses the importance of automation and effi-
These results demonstrate that RELEQ automati- cacy of RELEQ.
cally distinguishes different layers and their vary-
Learning and Convergence Analysis
ing importance with respect to accuracy while
An appropriate evidence for the correctness of
choosing their respective bitwidths. As shown in
a formulated RL problem is the ability of the agent
the “AccuracyLoss” column of Table 1, the deeply
to consistently yield improved solutions. Figure 3
quantized networks with RELEQ have less than
(b) shows [through different quantities (i)–(v)]
0.30% loss in accuracy. To assess the quality of
that RELEQ consistently yields solutions that
these bitwidths assignments, we conduct a Pareto
increasingly preserve the accuracy (maximize
analysis on the DNNs for which we could populate
rewards), while seeking to minimize the number
the search space.
of bits assigned to each layer (minimizing the state
Validation: Pareto Analysis of quantization) and eventually converges to a
Figure 3(a) depicts the solutions space for rather stable solution. The trends are similar for
four benchmarks (CIFAR10, LeNet, SVHN, and other networks.
42 IEEE Micro
Table 2. Speedup and energy reduction with RELEQ over ADMM.8
RELEQspeedup RELEQspeedup on Energy Improvement of

Network Dataset Technique Bitwidth
on TVM Stripes RELEQon Stripes
RELEQ {8,4,4,4,4,4,4,8}
AlexNet ImageNet 1.20X 1.22X 1.25X
ADMM {8,5,5,5,5,3,3,8}
RELEQ {2,2,3,2}
LeNet MNIST 1.42X 1.86X 1.87X
ADMM {5,3,2,3}
Execution Time and Energy Benefits With RELEQ average, 2:0 speedup and an additional 2:7
Deep quantization with conventional hard- energy reduction. MobileNet achieves 1:2
ware. RELEQ’s solution can be deployed on con- speedup, which is coupled with a similar degree
ventional hardware, such as general purpose CPUs of energy reduction. On the other end of the
to provide improvements. To manifest this, we spectrum, ResNet-20 and LeNet achieve 3:0 and
evaluated RELEQ using TVM5 on an Intel Core i7- 4:0 benefits, respectively.
4790 CPU. Figure 4(a) shows the speedup for each
benchmark using TVM compiler. The baseline is
Speedup and Energy Reduction Over ADMM
the 8-bit runtime for inference. RELEQ’s solution
We compare RELEQ’s solution in terms of
offers, on average, 2:2 speedup over the baseline
speedup and energy reduction against ADMM,8
as the result of merely quantizing the weights that
another procedure for finding quantization bit-
reduces the amount of computation and data trans-
widths. As shown in Table 2, RELEQ’s solution
fer during inference.
provides 1:25 energy reduction and 1:22 aver-
Deep quantization with custom hardware
age speedup over ADMM with Stripes for Alex-
accelerators: To further demonstrate the energy
Net and the benefits are higher for LeNet.
and performance benefits of the solution found
by RELEQ, we evaluate it on Stripes,6 a custom
accelerator designed for DNNs, which exploits RELATED WORK
bit-serial computation to support flexible bit- RELEQ is the initial step in utilizing reinforce-
widths for DNN operations. Figure 4(b) shows ment learning to automatically find the bitwidth
the speedup and energy reduction benefits of for the layers of DNNs such that their accuracy is
RELEQ’s solution on Stripes. Baseline is the 8-bit preserved.
inference execution. RELEQ’s solutions yield, on Reinforcement learning for automatic tuning:
RL-based methods have attracted much attention
within neural architecture search (NAS) after
obtaining the competitive performance on the
CIFAR-10 dataset employing RL as the search strat-
egy.9 Different RL approaches differ in how they
represent the agent’s policy. Zoph and Le9 used
an RNN trained by policy gradient to sequentially
sample a string that in turn encodes a neural
architecture.
Aside from NAS, He et al.10 employed RL to
prune existing architectures, where a policy gradi-
ent method is used to automatically find the com-
Figure 4. (a) Speedup with RELEQ for conventional pression ratio for different layers of a network.
hardware using TVM over the baseline run using 8 Techniques for selecting bitwidths: Recent
bits. (b) Energy reduction and speedup with RELEQ work ADMM8 runs a binary search to minimize the
for Stripes over the baseline execution when the total square quantization error in order to decide
accelerator is running 8-bit DNNs. the bitwidths for the layers. Then, they use an
43
iterative optimization technique for fine-tuning. CN#1703812, ECCS#1609823, CCF#1553192; in

Other work11 focused on binarized neural net- part by the Semiconductor Research Corporation
works. There is a concurrent work HAQ,12 which (SRC) under Contract #2019-SD-2884; in part by
also uses RL in the context of quantization. The fol- the Air Force Office of Scientific Research
lowing highlights some of the differences. RELEQ (AFOSR) Young Investigator Program (YIP)
uses a unique reward formulation and shaping under Award #FA9550-17-1-0274; in part by the
that enables simultaneously optimizing for two National Institute of Health (NIH) under Award
objectives (accuracy and reduced computation #R01EB028350; in part by the Air Force Research
with lower bitwidth) within a unified RL process. Laboratory (AFRL) and the Defense Advanced
In contrast, HAQ utilizes accuracy in the reward Research Project Agency (DARPA) under Agree-
formulation and then adjusts the RL solution ment #FA8650-20-2-7009 and #HR0011-18-C-0020;
through an approach that sequentially decreases and in part by gifts from Microsoft, Google, Qual-
the layer bitwidths to stay within a predefined comm, Xilinx. The U.S. Government is authorized
resource budget. This approach also makes HAQ to reproduce and distribute reprints for Govern-
focused more toward a specific hardware platform mental purposes notwithstanding any copyright
whereas we are after notation thereon. The views and conclusions
a strategy that can contained herein are those of the authors and
generalize. Addition- This article offered the should not be interpreted as necessarily repre-
RL framework that was
ally, we also provide senting the official policies or endorsements,
able to effectively
a systemic study of either expressed or implied, of Microsoft, Google,
navigate the huge
different design deci- Qualcomm, Xilinx, SRC, NSF, AFSOR, NIH, AFRL,
search space of
sions, and have sig- quantization and DARPA or the U.S. Government.
nificant performance automatically quantize
gain across diverse a variety of networks
benchmarks. The ini- leading to significant
& REFERENCES
tial version of our performance and 1. Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,”
work,13 predates energy benefits. Nature, vol. 521, no. 7553, pp. 436–444, 2015.
HAQ, and it is the 2. S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou,
first to use RL for quantization. Later HAQ was pub- “DoReFa-Net: Training low bitwidth convolutional
lished in CVPR, and we published initial version of neural networks with low bitwidth gradients,” CoRR,
RELEQ in NeurIPS ML for Systems Workshop. 2016.
3. C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained
ternary quantization,” in Proc. ICLR, 2017.
CONCLUSION 4. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and
This article set out to define the automated dis- O. Klimov, “Proximal policy optimization algorithms,”
covery of bitwidths for the layers while complying 2017, arXiv:1707.06347.
to the constraint of maintaining the accuracy. As 5. T. Chen et al., “TVM: End-to-end optimization stack for
such, this article offered the RL framework that deep learning,” 2017, arXiv:1802.04799.
was able to effectively navigate the huge search 6. P. Judd, J. Albericio, T. H. Hetherington, T. M. Aamodt,
space of quantization and automatically quantize and A. Moshovos, “Stripes: Bit-serial deep neural
a variety of networks leading to significant perfor- network computing,” in Proc. 49th Annu. IEEE/ACM
mance and energy benefits. The results suggest Int. Symp. Microarchit., 2016, pp. 1–12.
that a diligent design of our RL framework, which 7. A. K. Mishra, E. Nurvitadhi, J. J. Cook, and D. Marr,
considers multiple concurrent objectives can “WRPN: Wide reduced-precision networks,” in Proc.
automatically yield high accuracy, yet deeply ICLR, 2018.
quantized, networks. 8. S. Ye et al., “A unified framework of DNN weight
pruning and weight clustering/quantization using
ACKNOWLEDGMENTS ADMM,” 2018, arXiv:1811.01907.
This work was supported in part by the 9. B. Zoph and Q. V. Le, “Neural architecture search with
National Science Foundation (NSF) under Awards reinforcement learning,” in Proc. ICLR, 2017.
44 IEEE Micro
10. Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, Prannoy Pilligundla is currently working toward the
Master’s degree in computer science with the Univer-
“AMC: AutoML for model compression and
sity of California San Diego. His research focus is on
acceleration on mobile devices,” in Proc. 13th Eur.
designing frameworks and cross platform solutions for
Conf. Comput. Vision, 2018.
accelerating machine learning applications.
11. I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and
Y. Bengio, “Quantized neural networks: training neural Fatemehsadat Mireshghallah is currently work-
networks with low precision weights and activations,” ing toward the Ph.D. degree in computer science with
J. Mach. Learn. Res., vol. 18, 2017. the University of California San Diego. Her research
12. K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “HAQ: focuses on deep learning and privacy.
Hardware-aware automated quantization,” Nov. 21,
Amir Yazdanbakhsh joined Google Brain as a
2018, arXiv:1811.08886.
Research Scientist in 2019 following a one year AI resi-
13. A. T. Elthakeb, P. Pilligundla, F. Mireshghallah,
dency. Yazdanbakhsh received the Ph.D. degree in
A. Yazdanbakhsh, and H. Esmaeilzadeh, “Releq: A
computer science from Georgia Institute of Technol-
reinforcement learning approach for deep quantization ogy. His research interests include machine learning,
of neural networks,”, Nov. 5, 2018, arXiv:1811.01704. computer architecture, and programming language
[Online]. Available: http://arxiv.org/abs/1811.01704 for hardware design.
Hadi Esmaeilzadeh is currently the inaugural holder

Ahmed T. Elthakeb is currently working toward the of Halicioglu Chair in Computer Architecture with the
Ph.D. degree in the Alternative Computing Technolo- rank of Associate Professor in computer science and
gies (ACT) Lab, University of California San Diego. His engineering with the University of California San Diego,
current research interests include developing cross- where he was awarded early tenure. His research inter-
stack solutions to improve the performance and ests include the intersection of architecture, machine
energy efficiency of machine learning algorithms. intelligence, system design, and software engineering.
45
Enhancing Model
Parallelism in Neural
Architecture Search for
Multidevice System
Cheng Fu and Huili Chen Yuandong Tian
University of California, San Diego Facebook AI Research
Zhenheng Yang Jishen Zhao
Facebook AI Research University of California, San Diego
Farinaz Koushanfar
University of California, San Diego
Abstract—Neural architecture search (NAS) finds favorable network topologies for better
task performance. Existing hardware-aware NAS techniques only target to reduce inference
latency on single CPU/GPU systems and the searched model can hardly be parallelized. To
address this issue, we propose ColocNAS, the first synchronization-aware, end-to-end NAS
framework that automates the design of parallelizable neural networks for multidevice
systems while maintaining a high task accuracy. ColocNAS defines a new search space with
elaborated connectivity to reduce device communication and synchronization. ColocNAS
consists of three phases: 1) offline latency profiling that constructs a lookup table of
inference latency of various networks for online runtime approximation; 2) differentiable
latency-aware NAS that simultaneously minimizes inference latency and task error; and 3)
reinforcement-learning-based device placement fine-tuning to further reduce the latency of
the deployed model. Extensive evaluation corroborates ColocNAS’s effectiveness to reduce
inference latency while preserving task accuracy.

Date of publication 26 June 2020; date of current version
1 September 2020.
46
& DEEP NEURAL NETWORKS (DNNs) are increas-
block shown in Figure 2(a). This topology is hard
ingly adopted in various fields due to their to be paralleled across multiple computation
unprecedented performance. Enormous archi- components due to data dependence. 2) Cell-
tecture-level advancements have been proposed structure search. Another line of methods2,7 pro-
to improve the performance and efficiency of pose to search a building block, called cell, on
DNN execution.1 In the meantime, neural archi- small tasks (e.g., CIFAR-10) and transfer the
tecture search (NAS)2,3 enables a new line of searched cell for training on a large task (e.g.,
research that aims to design efficient DNN topol- ImageNet). The model in this space have certain
ogy for target hardware platforms. Existing hard- levels of parallelism due to the concurrent exe-
ware friendly NAS techniques4–6 target to cution of different blocks inside each cell. How-
identify efficient models on single CPU/GPU sys- ever, the concatenation (concat) operations
tems by introducing the notion of FLOPs, model between cells prevents further parallelism since
size, or predicted latency into the searching pro- they synchronize the execution and increase
cess. However, in modern real-time and cloud communication overhead. To overcome these
computing scenarios, multiple computation constraints, ColocNAS is motivated to automate
components co-exist and are shared by different the design of suitable NN architectures for multi-
tasks or users. device platforms to achieve low inference
There are two types of parallelism when exe- latency and high task accuracy.
cuting an neural network (NN) on a multiple- Designing an efficient model to achieve model
device platform: 1) model parallelism partitions parallelism across devices is a nontrivial task. It
operations into different parts and assign them is challenging because of the following. C1) The
onto different devices; and 2) data parallelism searched architecture needs to keep a high accu-
distributes the inference data onto multiple devi- racy, while reducing its runtime overhead. C2)
ces and duplicate the entire model on each While NNs with more complicated wiring pat-
device. However, data parallelism has several terns have higher chance for model parallelism,
limitations. 1) Data parallelism can increase the predicting their execution time is challenging for
throughput of the application but not the end-to- latency-aware NAS. C3) The actual latency of a
end inference latency, thus it is not suitable for specific network architecture in a multidevice
real-time applications. 2) Data parallelism may
incur severe communication overhead during
the setup stage since the inputs need to be dis-
tributed to each available device. 3) Data paral-
lelism is infeasible when the model size is too
large to fit on a single device with constrained
memory. With the increasing model size of SOTA
NNs, the fact in 3) is becoming more stringent,
especially for edge devices with very limited
memory. In contrast, emerging multidevice com-
puting systems and device-to-device communi-
cation techniques (e.g., 5G communication,
NVLINKs, and PCIe Gen 4) can benefit model par-
allelism. Because our goal is to reduce applica-
tion latency, model-level parallelism is a
promising candidate.
The designed search spaces of conventional
NAS can be classified into two main categories Figure 1. (a) ColocNAS design overview and (b) roadmap
(see Figure 2). 1) Layer-wised search. Existing between accuracy and latency of NAS methods. DP+NN refers to
latency-oriented model searching methods4,5 device placement (DP) policy and neural network (NN)
explore model architectures that are analogous architecture for deploying. ColocNAS significantly outperforms
to chain-like wired mobilenetV2 with the building other NAS methods, when running on multiple devices.
47
Figure 2. Example of basic building blocks in three types of search space. (a) Layer-wise searched model.
(b) Cell-based searched model. (c) New search space proposed in ColocNAS. The yellow circle refers to the
concatenation node. bin denotes the nth block in ith cell (i.e., layer).
setting depends on the device placement policy, ColocNAS DESIGN

which is unknown ahead of time. ColocNAS uses differentiable neural architec-
To tackle these challenges, we propose Coloc- ture search by combining two gradient-based
NAS, a differentiable NAS framework that effec- methods4,7 to solve the problem of topology
tively searches network architectures for the design. We formulate the neural architecture
given hardware environment. ColocNAS resolves search problem as a nonconvex optimization
all the aforementioned challenges (C1–C3) using problem as shown in
the following solutions [as shown in Figure 2 (a)]. min min Lða; wa Þ: (1)
a2A wa
S1) ColocNAS designs a new search space with
less synchronization and communication over- A promising architecture yields small latency
head by exploring elaborate connectivity patterns and high task accuracy. Here, A is a new search
in the NN. S2) Based on the observation that simi- space proposed in ColocNAS, a 2 A is a set of
lar computation graphs yield closer latency, continuous variables that specify a possible
ColocNAS leverages offline latency profiling and k- architecture, wa is the weight parameter of the
nearest neighbors (kNN) with graph similarity to network. L is the loss function that penalizes
predict latency for searched model in the online both accuracy degradation as well as the
phase. S3) ColocNAS employs uniform workload increase of inference latency. The design flow of
distribution as the expert placement policy to ColocNAS is detailed in Algorithm 1.
facilitate offline latency profiling and to guide
online latency-aware NAS. This policy leverages Algorithm 1. ColocNAS Design Flow.
the intrinsic structure features of the DNN in INPUT: Prototyping Hardware Devices ðtÞ; Device
the new search space. After online NAS, the Number ðN t Þ; ArchTable Size (Narch ); Cell
placement of the searched model is further fine- Number Nc ; Possible Operations (Ops).
tuned using a reinforcement learning (RL)-based OUTPUT: Fine-tuned Device Placement Policy
PRL ; Model Mout .
algorithm to reduce its runtime on the target
1: Offline Profile:
platform. Mi Random GenerateðN t ; Ops; Nt ; Nc Þ
Our framework offers a holistic solution to Pi Expert Placement PolicyðMi Þ
designing efficient and accurate NNs on mul- Arch Table Hardware ProfileðMi ; Pi ; tÞ
tidevice systems. Extensive experiments show for i ¼ 1; :::; Narch
that ColocNAS reduces inference latency by a 2: Online Searching and Training:
March SearchingðArch Table; Ops; Nt ; Nc Þ
large margin, while achieving a competitive accu- Mout Training ModelðMarch Þ
racy and latency as shown in Figure 1(b). Our 3: RL Fine-Tuning:
work sheds light on a new dimension (device-level PRL RL FineTuningðMout ; tÞ
parallelism) of hardware-aware NAS algorithms.
48 IEEE Micro
Search Space define block connectivity parameters bi1 as the
n
Previous works search models with layer- probability over each possible connection
wise or cell-based structures as shown in Figure 2 between the current block to any concatenation
(a) and (b). The layer-wise structure is suitable node in ði 1Þth layer. It will be used to compute
for a single CPU device as the identified model the weighted sum input xi1 from previous layer
n
topology is very regularized and its execution is i 1 as
fast due to data locality. However, the resulting
i1 i1
chain-like model is not suitable when multiple xi1
n ¼ softmaxðbn Þ x
devices are available for parallelism, as each X
C expðbi1
n;c Þ (3)
layer can only start execution when the outputs ¼ PC i1
xi1
c :
c¼1 c0 ¼1 expðb n;c0 Þ
of all its previous layers are computed. For the
cell structure search space, its evaluation can be Here, bi1
n is a vector of size 1-by-C. xi2
n is com-
paralleled. Yet, the concatenation at the end of i2
puted from bn using the same method as (3).
each cell node works as a synchronization unit Note that ColocNAS explores optimal net-
that collects all the results inside the cell blocks work architectures by solving the bilevel optimi-
before starting the computation in the next cell. zation problem defined in (1). The optimization
This synchronization incurs severe communica- variable a ¼ fb; opsg where b is the block con-
tion overhead (the red arrow in Figure 2) and nectivity parameter and the softmax of ops is the
hinders computation parallelism. As such, we probability vectors over the predefined set of DL
proposed a new search space that reduces the operations for every possible connection (the
participants of the synchronization unit in order candidate operations are the same as given by
to minimize the delay of computation as shown Cai et al.5). At the end of the search, we choose
in Figure 2(c). ColocNAS uses the same definition the connectivity of the nth block in ith cell from
of search blocks and cells while introducing a the previous two cells as xi1 c = arg maxc bi1
n and
i2
new block connectivity pattern by “duplicating” xi2
c = arg max b
c n .
the concatenation nodes. Through comparison experiments, we find
The traditional cell consists of an ordered that by using Gumbel softmax4 over bic in (3) can
sequence of N blocks where each block has two improve both the accuracy and latency of the
input edges ði; jÞ as shown in Figure 2(b). An searched model compared to the softmax
edge can be selected from two sources. i) previ- function.
ous blocks in the current cell. ii) The output of Given xi1 i2 as the computed input from
n ; xn
the previous two cells. Each edge is associated the previous two cells for the ith cell, the output
with an operation that is determined after the of the each blocks bni would be the weighted sum
search [oði; jÞ] is performed. Assuming the over ops7 during architecture search.
inputs of the block are x1 and x2 , the output of
the block is computed as follows: Offline Hardware Bounded Latency Profiling
ColocNAS integrates a latency-aware, gradi-
X
2
ent-based NAS for the target devices into the
bj ¼ oði;jÞ ðxi Þ: (2)
i¼1 searching process to make the low latency model
preferable. Recall that the optimization variable
The intermediate results from the N blocks are a in our search space is a set of probabilistic vari-
concatenated together to form a single cell output ables, thus the inference latency given a specific
xi . Since we identify that this concatenation node choice of a is also a random variable lat. We use
is the bottleneck for achieving device-level paral- the expectation value of the latency variable to
lelism, the new search space is defined as follows. assess the quality of the architecture choice a.
Instead of reducing towards a single concate- However, computing the latency expectation
nation node in each cell, the blocks of the ith over a architecture probability Elat ðpa Þ on the
cell will be evenly connected to C different con- given devices is challenging because of the fol-
catenation nodes, resulting in outputs xi ¼ ½xi1 ; lowing. 1) Measuring the real latency value of
xi2 ; :::; xic T . For the nth block in the ith cell, we each architecture during the searching process
49
is infeasible due to the prohibitive latency cost. where the kNNðcotÞ function returns the average
2) The latency of a complex graph is hard to pre- latency of the top-k closest architecture laten-
dict while considering the delay incurred by data cies for input archn .
movement. 3) The latency value of an architec-
ture choice is highly dependent on the placement Online Latency-Aware Architecture Searching
policy, which makes it hard to compare the opti- ColocNAS aims to find a network architecture
mal latency values of different network topolo- with low latency and high task accuracy. As
gies. To resolve the above problems, the first such, we define the loss function of ColocNAS’s
stage of ColocNAS is an offline latency profiling online searching phase as follows:
step that measures the inference time of diverse
Lða; wa Þ ¼ CEða; wa Þ þ 1 Llat (7)
network architectures on the target devices.
Besides, to disentangle the complexity of
model placement on the given hardware, we use Llat ¼ expðjjpa pba jj2 Þ Elat ðpa Þ: (8)
a predefined expert placement policy for all
searched architectures. In particular, our expert Here, CEða; wa Þ is the cross-entropy loss given
placement policy uniformly distributes each the architecture parameter a and the weights
concatenation node and the associated opera- wa . 1 is the scaling factor within the range [0,1]
tions onto all existing devices. By doing so, we that controls the tradeoff between accuracy and
reduce the overhead of synchronization and the latency. Elat ðpa Þ is the expected latency of the
cross-device communication compared to the specific architecture parameter obtained from
traditional cell-based search space as shown in (6). For the latency term, we add a regularization
P
Figure 2(c). term expðjjpa pba jj2 Þ where pba ¼ N1 Nn¼1 archn
It is also intractable to profile the latency val- to make the latency term differentiable to the
ues of all the architectures in the entire search architecture parameters a. If the sample mean
space ( 1020 models). Leveraging the observa- value of the architecture probability pba is far
tion that similar NNs yield closer runtime from the true one pa , the corresponding latency
latency, we propose a method to approximate term will be weighted less. Note that after the
the latency of a complicate wired NN using kNN. model is searched, we train the model from
We randomly generate architectures and profile scratch to obtain the final accuracy.
their latency values on the target hardware using
expert placement policy to obtain a lookup table RL-Based Device Placement Fine Tuning
ArchTable. The distance between two NNs can Once the model is searched and trained, the
be measured by using graph edit distance8. To model can be directly deployed on to the target
approximate the latency expectation over an hardware systems using the expert placement
architecture parameter, we first convert a into policy that uniformly distributes the workload
the corresponding probability variable pa which to each available device. However, this heuristic-
includes the following two parts: based device placement policy may not be opti-
mal considering the heterogeneous property of
pops ¼ SoftmaxðopsÞ (4)
the underlying computing platforms. To further
optimize the latency of the searched architec-
pb ¼ GumbelSoftmaxðbÞ: (5) ture G, the last step of the ColocNAS leverages
policy gradient for device placement on the
Given a specific choice of the probability given hardware. We apply the RL-based method
parameter pa ¼ fpb ; pops g, we sample N neural proposed by Mirhoseini et al.9 where the policy
network architectures arch1 ; arch2 ; :::; archN . The network is an attentional autoencoder that
expectation value of the latency variable is then takes the NN graph as input. Each input in the
approximated as encoding sequence is the concatenated embed-
ding of operation type, connectivity and output
1X N
shape of each block in the graph. The decoder
Elat ðpa Þ kNNðarchn ; ArchTableÞ (6)
N n¼1 sequentially generates the placement policy P .
50 IEEE Micro
The goal is to learn the policy pðP jG; uÞ to mini- variables (a and wa ) to solve the bilevel optimiza-
mize the objective JðuÞ ¼ EP pðP jG;uÞ ½RðP ÞjG. tion problem in (1). In particular, we use second-
Here, RðP Þ is the expectation of the reward order approximations to update the architecture
(execution time). parameter a. All possible operations in the
The policy gradients are computed via the search space and relevant searching details are
REINFORCE equation10 as follows: the same as DARTS.7
Hardware platforms and corresponding search
ru JðuÞ ¼ EP pðP jG;uÞ ½RðP Þ ru logp ðP jG; uÞ: (9) space. To prove the generality of ColocNAS on
different platforms, we test our method on two
By sampling K placements following the types GPUs and six different device configura-
placement policy probability P pðjG; uÞ, the tions. 1) 2/3/4 Tesla K80 GPUs with Gen3 PCIe
expectation of reward RðP Þ can be estimated device connection. 2) 2/3/4 Tesla V100 GPUs
and the policy gradient is computed as with NVlink connections. The device placement
algorithm and latency profiling for cell structure
1X K
are implemented in Tensorflow v1.14. For each
ru JðuÞ ðRðPi Þ BÞ ru logp ðP jG; uÞ: (10)
K i¼1 device setting, we set the number of concatena-
tion nodes to be the same as the number of devi-
Here, B is the mean value of the rewards com- ces to facilitate the expert placement policy.
puted from the K sampled placements. Coloc- Since the device configuration determines the
NAS leverages the intrinsic structure of the search space as shown in Figure 2, we name the
architecture in the new search space and per- search space with 2/3/4 GPUs as ColocNAS-SP-2/
forms device placement only for blocks and con- ColocNAS-SP-3/ColocNAS-SP-4. We denote the
catenation nodes [see Figure 2 (c)]. As a result, searched architectures as ColocNAS-b4c2/Coloc-
the training of the policy network pðP jG; uÞ con- NAS-b6c3/ColocNAS-b4c4 which have 4/6/4 num-
verges much faster compared to the original one ber of blocks and 2/3/4 concatenation nodes in
by Mirhoseini et al.9 each cell for hardware settings with 2/3/4 GPUs,
respectively.
EXPERIMENTS
In this section, we demonstrate ColocNAS’s Latency Prediction Results
effectiveness of model-level parallelism and Recall that ColocNAS applies the kNN
accurate classification performance on CIFAR-10 method to estimate end-to-end latency. We sam-
and ImageNet tasks compared to the state-of- ple 10 000 architectures that are mapped using
the-art NAS methods. the expert placement policy to available GPUs
on both CIFAR-10 (32323) and ImageNet
Experimental Setup (2242243) datasets. We set the total number
We run ColocNAS on different device config- of cells to be 14 and 20 for ImageNet and CIFAR-
urations and evaluate the searched architec- 10, respectively. For CIFAR-10 and ImageNet for
tures in terms of its test accuracy and inference evaluation, we added one/two initial stem cells
latency. to downscale the raw images, respectively.
Searching phase setup. ColocNAS searches two Figure 3 shows the effectiveness of our kNN-
types of cells, namely, a normal cell and a reduction based latency predictor. The latency is profiled in
cell. We put the reduction cell at 1=3 and 2=3 loca- search space ColocNAS-SP-4 with ImageNet input
tion of the neural architectures. After each reduc- (batch size is 32) on 4 Tesla K80 GPUs. We use 5%
tion cell, we double the number of output of the profiled data for testing. The profiling takes
channels in the network. The operations searched 2.5 days on 8 GPUs. The RMSE error is 0.22 ms on
in the reduction cell has stride¼ 2. the test data, which is adequate for the latency
The model that uses on the searching phase prediction of complicated graphs. Note that this
has eight cells. Larger networks can be built by offline profiling is a one-time profiling process,
stacking multiple normal cells in between the that can be used to search different NN architec-
reduction cells. We alternatively update the two tures when tuning the 1 parameters in (7).
51
Results of Latency-Aware NAS

Cifar-10 Benchmark After the architecture is
searched, we train the weight parameters of the
resulting network for 1500 epochs using a batch
size of 96 and Adam optimizer. As shown in
Table 1, ColocNAS maintains a competitive test
error rate (2.36% on average) in different hard-
ware settings. In the meantime, ColocNAS
Figure 3. Effectiveness of kNN-based end-to-end
reduces the inference latency by a large margin
latency predictor on the ImageNet data with 4 Tesla
(1.24/1.34/1.50 faster on 2/3/4 Tesla K80
K80 GPUs. We determine K ¼ 5 using cross-
GPUs compared to DARTS 2-GPU) on multiple-
validation, which yields RMSE 0.22 ms on the test set.
devices thanks to the parallelizable search space
and differentiable latency guidance. ColocNAS
uses a search space with elaborated connectiv- searching step, we stack 14 cells to build an NN
ity and an online kNN-based latency lookup, thus for ImageNet evaluation. The network is trained
the search cost is higher than DARTS (see for 360 epochs with power cosine learning rate
Table 1). Empirical results show that ColocNAS and SGD optimizer with nesterov-momentum.
is still among the fastest NAS techniques due to The comparison results are shown in Table 2.
our gradient-based method and CIFAR-10 proxy. Compared to the state-of-the-art NAS methods,
our model shows a competitive error rate on the
ImageNet Results Using the normal and test dataset (25.47% on average). Furthermore,
reduction cell found by the latency-aware our model outperforms all other state-of-the-art
Table 1. Performance comparison between ColocNAS and the state-of-the-art NAS methods classification. The test error and
inference latency on CIFAR-10 benchmarks are shown here. For cell-based method, the latency is tested by reconstructing the
model using open-source code from DARTS. 7 “-” indicates the model for CIFAR-10 is not publicly available. We use a batch size of
32 for all latency evaluation. The latency of ColocNAS is tested on 2/3/4 Tesla K80 GPUs after RL fine-tuning and “DARTS 2-GPU”
are tested on 2 GPUs using RL placement. Other baselines are tested on a single Tesla K80 GPU.
Search Search Search Cost (GPU # Test Error Latency (per Normalized
Model
Space Method hours) Params (%) image) Reduction (%)
DenseNet-
- manual - 25.6M 3.46 14.8 ms 0.0
BC
NAONet cell gradient 4.8K 10.6M 3.18 8.4 ms -43.2
PNAS cell SMBO 6K 3.2M 3.41 3.14 ms -78.8
Amoeba-
cell evolution 76K 3.2M 3.12 3.10 ms -79.1
A 11
Amoeba-
cell evolution 76K 2.8M 2.55 3.04 ms -79.4
B 11
DARTS
cell gradient 24 3.2M 2.76 3.12 ms -78.9
(2nd) 7
NASNet-A 2 cell RL 48K 3.3M 2.65 3.24 ms -78.1
DARTS 2-
cell gradient 24 3.2M 2.76 3.01 ms -79.7
gpu 7
ColocNAS- new
gradient 41 3.4M 2.31 2.43 ms -83.6
b4c2 space
ColocNAS- new
gradient 49 3.5M 2.35 2.24 ms -84.9
b6c3 space
ColocNAS- new
gradient 51 3.4M 2.42 2.01 ms -86.4
b4c4 space
52 IEEE Micro
Table 2. Transferability classification error on ImageNet benchmarks. For NAS in layer-wise search space, the latency is tested by
using open-source evaluation code from the papers. “-” indicates the number is not shown in the original paper or the model is not
publicly available. The latency results in the table are tested on Tesla K80 GPUs. The RL setting for the baseline is evaluated on 2-
GPU. With more GPUs, the baseline would yield same latency as the models can hardly be parallelized across devices.
Search Search Search cost # # Test accuracy Latency Normalized

Model
space method (GPU hours) Params FLOPs (Top-1 %) (-RL/+RL) reduction (%)
3.51/3.51
MobileNetV2 - - - 3.4M 300M 72.0 0.0
ms
ShuffleNetV2 3.32/3.32
- - - 3.5M 299M 72.6 -5.4
(1.5) ms
3.13/ 3.02
Amoeba-A 11 Cell Evolution 756K 5.1M 555M 74.5 -14.0
ms
3.22/ 3.10
NASNet-A 2 Cell RL 48K 5.3M 564M 74.0 -11.7
ms
3.09/2.98
DARTS (2nd) 7 Cell Gradient 24 4.9M 595M 73.1 -15.1
ms
Layer
MnasNet-65 6 RL 91K 3.6M 270M 73.0 - -
wise
Layer
MnasNet 6 RL 91K 4.2M 317M 74.0 - -
wise
Layer 2.93/2.93
FBNet-A 4 Gradient 216 4.3M 249M 73.0 -16.5
wise ms
Layer 3.41/3.41
FBNet-B 4 Gradient 216 4.5M 295M 74.1 - 2.8
wise ms
ProxylessNAS- Layer 3.95/3.95

Gradient 200 6.9M - 74.6 +12.5
mobile 5 wise ms
ProxylessNAS- Layer 2.80/2.80

Gradient 200 7.1M - 75.1 -20.2
GPU 5 wise ms
ColocNAS- New 2.67/2.51

Gradient 41 4.7M 546M 74.9 -28.5
b4c2 space ms

Gradient 49 4.8M 572M 74.6 -36.5
b6c3 space ms

Gradient 51 4.8M 566M 74.1 -41.0
b4c4 space ms
NAS methods in terms of the inference latency ProxylessNAS-mobile (0.85 ms). This corrobo-
when deploying in a multiple-device hardware rates that ColocNAS preserves a high task accu-
environment. With 2/3/4 GPU settings, Coloc- racy while reducing the latency overhead. We
NAS achieves 1.57/1.77/1.91 execution also notice that if without the guidance of the
speedup compared to the ProxylessNAS-mobile hardware-bounded latency estimation [1 ¼ 0 in
on Tesla K80. For Tesla V100 GPUs, ColocNAS (7)], the latency incurred by ColocNAS will
yields a similar speedup compared to DARTS increase by +0.23 ms (ColocNAS-b4c4) on Tesla
and ProxylessNAS. Applying the RL placement, K80 GPUs. Based on our observation, the
we directly tested models searched for Tesla latency regularization term encourages compu-
K80 onto the new hardware systems and the tation-efficient operations on the timing critical
latency per image for ColocNAS-b4c2/b6c3/b4c4 paths in the model. In addition, it explicitly
are 0.66/0.61/0.54 ms on 2/3/4 Tesla V100 GPUs, helps to reduce the number of connectivity
which are 1.12/1.21/1.35 faster than DARTS across blocks compared to model searching
(0.74 ms) and 1.28/1.39/1.57 faster than without latency guidance.
53
Performance Discussion One can observe & REFERENCES

the trade-off between accuracy and latency 1. V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer,
from Table 1 and 2. ColocNAS aims to achieve “Efficient processing of deep neural networks: A
hardware-aware NAS for fast model inference by tutorial and survey,” Proc. IEEE, vol. 105, no. 12,
reducing the communication overhead between pp. 2295–2329, Dec. 2017.
multiple computing platforms. However, the 2. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le,
reduction in communication would incur “Learning transferable architectures for scalable
accuracy degrada- image recognition,” in Proc. IEEE Conf. Comput. Vision
tion since the inf- Pattern Recognit., 2018, pp. 8697–8710.
ormation exchange In this article, we
3. B. Zoph and Q. V. Le, “Neural architecture search with
between sub-graphs propose ColocNAS,
reinforcement learning,” in Proc. 5th Int. Conf. Learn.
for each device is an end-to-end,
Representations, 2017. [Online]. Available: https://
differentiable neural
reduced. openreview.net/forum?id=r1Ue8Hcxg
architecture searching
Tables 1 and 2 4. B. Wu et al., “Fbnet: Hardware-aware efficient convnet
framework that is aware
show that the design via differentiable neural architecture search,” in
of the hardware-
latency improve- bounded latency. Proc. IEEE Conf. Comput. Vision Pattern Recognit.,
ment of ColocNAS 2019, pp. 10 734–10 742.
is sublinear with 5. H. Cai, L. Zhu, and S. Han, “Proxylessnas: Direct neural
the number of computing devices. This is caused architecture search on target task and hardware,” in
by the nonnegligible communication overhead Proc. Int. Conf. Learn. Representations, 2019. [Online].
between devices and system control overhead. Available: https://arxiv.org/pdf/1812.00332.pdf
6. M. Tan et al., “Mnasnet: Platform-aware neural
architecture search for mobile,” in Proc. IEEE Conf.
CONCLUSION
Comput. Vision Pattern Recognit., 2019, pp. 2820–2828.
In this article, we propose ColocNAS, an end-
7. H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable
to-end, differentiable neural architecture search-
architecture search,” in Proc. 7th Int. Conf. Learn.
ing framework that is aware of the hardware-
Representations, 2019. [Online]. Available: https://
bounded latency. ColocNAS introduces an inno-
openreview.net/forum?id=S1eYHoC5FX
vative parallelizable search space that redu-
8. X. Gao, B. Xiao, D. Tao, and X. Li, “A survey of graph
ces synchronization/device communication for
edit distance,” Pattern Anal. Appl., vol. 13, no. 1,
model parallelism. For the first time, ColocNAS is
pp. 113–129, 2010.
able to maximize device utilization of multiple
9. A. Mirhoseini et al., “Device placement optimization
computing platforms by decoupling the task of
with reinforcement learning,” in Proc. 34th Int. Conf.
hardware-aware NAS into three steps: offline
Mach. Learn. 2017, vol. 70, pp. 2430–2439, [Online].
hardware latency profiling, online latency-aware
Available: JMLR.org.
searching, and RL-based device placement fine-
10. R. J. Williams, “Simple statistical gradient-following
tuning. As a result, ColocNAS automates the
algorithms for connectionist reinforcement learning,”
design of neural architectures that achieve high
Mach. Learn., vol. 8, no. 3/4, pp. 229–256, 1992.
task accuracy and low runtime latency for a given
11. E. Real, A. Aggarwal, Y. Huang, and Q. V. Le,
hardware setting. We perform extensive experi-
“Regularized evolution for image classifier architecture
ments and corroborate that ColocNAS reduces
search,” in Proc. AAAI Conf. Artif. Intell., 2019, vol. 33,
the inference latency by a large margin in a
pp. 4780–4789.
multidevice environment while maintaining a
competitive task accuracy compared to the
Cheng Fu is currently working toward the Ph.D.
state-of-the-art hardware-aware NAS methods.
degree with the University of California, San Diego. His
research interests include efficient ML algorithms for
ACKNOWLEDGMENTS hardware and machine learning for binary analysis. His
This work was supported by NSF 1829525 and by advisor is Professor Jishen Zhao. He interned at Face-
the SRC/DARPA Center for Research on Intelligent book AI research and worked on automating machine
Storage and Processing-in-Memory. learning project. Contact him at cfu@eng.ucsd.edu.
54 IEEE Micro
Huili Chen is currently working toward the Ph.D. Computing and Security. Koushanfar received
degree with the University of California, San Diego. the Ph.D. degree from the University of California,
Her research interests span diverse aspects of Berkeley. She is a Fellow of the Kavli Foundation
machine learning (ML) systems, including intellectual Frontiers of the National Academy of Engineering.
property (IP) protection, adversarial attack detection, Contact her at farinaz@ucsd.edu.
and application of advanced ML algorithms to solve
long-standing problems in other domains. Contact Yuandong Tian is currently a Research Scientist
her at huc044@ucsd.edu. and Manager with Facebook AI Research. His
research interests include theory and practice of deep
Zhenheng Yang is currently a Research Scientist
learning, sequential decision making, and computer
with Facebook AI. His research interests include
vision. Tian received the Ph.D. degree from the
computer vision and machine learning, specifically
Robotics Institute, Carnegie Mellon University. Contact
weakly supervised learning, 3-D perception, and
him at yuandong@fb.com.
activity reasoning. Yang received the Ph.D. degree
from the University of Southern California. Contact
him at zhenhengy@gmail.com. Jishen Zhao is currently an Assistant Professor with
the Computer Science and Engineering Department,
Farinaz Koushanfar is currently a Professor and University of California, San Diego. Her research
Henry Booker Faculty Scholar with the Department of spans and stretches the boundary between computer
Electrical and Computer Engineering, University of architecture and system software, machine learning,
California, San Diego, where she directs the Adap- and system codesign. Zhao received the Ph.D.
tive Computing and Embedded Systems Lab. She is degree in computer science and engineering from
the Co-Founder and Co-Director of the University of the Pennsylvania State University. Contact her at
California, San Diego, Center for Machine-Integrated jzhao@eng.ucsd.edu.
55
TSA-NoC: Learning-Based
Threat Detection and
Mitigation for Secure
Network-on-Chip
Architecture
Ke Wang, Hao Zheng, and Ahmed Louri
George Washington University
Abstract—Networks-on-chip (NoCs) are playing a critical role in modern multicore

architecture, and NoC security has become a major concern. Maliciously implanted
hardware Trojans (HTs) inject faults into on-chip communications that saturate the
network, resulting in the leakage of sensitive data via side channels and significant
performance degradation. While existing techniques protect NoCs by detecting and
isolating HT-infected components, they inevitably incur occasional inaccurate detection
with considerable network latency and power overheads. We propose TSA-NoC, a learning-
based design framework for secure and efficient on-chip communication. The proposed
TSA-NoC uses an artificial neural network for runtime HT-detection with higher accuracy.
Furthermore, we propose a deep-reinforcement-learning-based adaptive routing design for
HT mitigation with the aim of minimizing network latency and maximizing energy efficiency.
Simulation results show that TSA-NoC achieves up to 97% HT-detection accuracy, 70%
improved energy efficiency, and 29% reduced network latency as compared to state-of-the-
art HT-mitigation techniques.
& AS TECHNOLOGY SCALES, modern multiproces- computation-centric to communication-centric

sors have pushed for a paradigm shift from systems. Networks-on-Chip (NoCs) are becoming
increasingly critical yet vulnerable to various
security threats,1–4 especially to hardware trojans
Digital Object Identifier 10.1109/MM.2020.3003576 (HTs).4 Maliciously implanted HTs inject transient
Date of publication 19 June 2020; date of current version faults in transmitted flits/packets, causing mis-
1 September 2020. routing and unnecessary packet re-transmissions.
56
These retransmissions consume massive NoC behaviors (e.g., unexpected high error rate)
resources (e.g., link and buffer) and saturate the and improve the accuracy of HT detection.
transmission channels, resulting in data leakage, Improved HT-isolation using SmartRoute: After
significant performance degradation, and even a HT detection, the routers are dynamically cat-
denial-of-service.2,3 egorized into HT-free and HT-infected routers.
Significant research has been devoted to A low-cost bypass channel using simple
securing NoCs against HT-based attacks.5,6 A switch logic is proposed to bypass the HT-
majority of these countermeasures mitigate HTs infected routers while maintaining network
by detecting and isolating the HT-infected NoC connectivity. To balance traffic loads among
components. In the HT-detection aspect, the low-throughput bypass channels and high-
existing works use fault history logging (FHL),7 throughput routers and improve the overall
runtime threshold monitoring (RTM) on link- network performance, SmartRoute controller
error/packet-injection rates, or built-in self-test- uses DRL to handle diverse traffic patterns by
ing hardware.6 These techniques monitor fault- dynamically applying the most suitable rout-
related NoC attributes (e.g., temperature and ing algorithms thus minimizing network
buffer/link utilization) at runtime and label the latency and power consumption.
corresponding component as HT-infected if any
attribute value exceeds its corresponding manu- We evaluate the performance of the proposed
ally set threshold. However, since massive NoC TSA-NoC architecture with PARSEC benchmark.
attributes are correlated with transient faults Simulation results show that TSA-NoC provides
and interact with each other,8 designing the enhanced HT-detection accuracy and improved
thresholds for HT detection is complicated. HT mitigation with reduced power consumption
These thresholds, if selected carelessly, can and network latency as compared to conven-
cause false/misdetection, additional power con- tional HT-mitigation techniques.
sumption, and increased network latency.
In the HT-isolation aspect, conventional solu-
tions use regional routing algorithms2,3 to isolate BACKGROUND AND MOTIVATION
the HT-infected components. Unfortunately, Transient Faults in NoC
these techniques limit the network throughput, Transient faults may manifest during any
forbid communication via certain channels, and stage of transmission8 in NoCs. To mitigate these
detour the packets to avoid infected regions. faults in typical NoC architecture, each flit is
This inevitably increases network latency. encoded with error correction codes (ECCs)
Therefore, the challenge is to design a secure before being propagated to the downstream
architecture that promptly and accurately router. At each hop of transmission, the ECC
detects and isolates HTs with minimal perfor- decoder examines the correctness of each flit
mance loss. and initiates a negative-acknowledgment signal
In this article, we propose TSA-NoC, a learn- for data retransmission if an error occurs.
ing-based, high-performance, and energy-effi-
cient NoC design for secured on-chip
communication. In TSA-NoC, we enhance the HT Attacks on NoCs
router architecture with a learning-based per- HTs have been shown to infect NoC links9 and
router HT-detection (DetectANN) module, router microarchitectures.5 Typically, HTs are
bypass channel, and a SmartRoute controller for inserted in the layout during the fabrication phase
HT-isolation. The major contributions of this of the IC design cycle. After being implanted, they
article include the following. usually remain dormant and slip past hardware
diagnosis until they are activated by attackers.
Improved HT detection using DetectANN: Once activated, they inject transient faults by forc-
DetectANN uses an artificial neural network ing bit-flips in links or disrupting ECC in routers.
(ANN) to automatically identify HT-injected These faults can lead to massive retransmission
faults by recognizing abnormal network traffic, back pressures, and network saturation.
57
Figure 1. Proposed TSA-NoC router architecture.

The router comprises a DetectANN design for
HT detection, a bypass channel for HT isolation, and a
SmartRoute controller. Solid and dashed arrows
represent data paths and control signals, respectively.
Conventional HT-Mitigation Techniques

A majority of the existing HT-mitigation techni- Figure 2. Overview of the proposed TSA-NoC
ques use a detect–isolate approach.2,3,6,7 For HT design.
detection, built-in diagnosis hardware6 periodically
checks the correctness of the circuitry’s logic NoC architecture. In TSA-NoC, we propose an
operations, which inevitably stalls the application enhanced router architecture, as shown in
execution to retain HT-detection accuracy. To miti- Figure 1, consisting of an improved HT-detection
gate this limitation, runtime software-based mech- design using an ANN (DetectANN), a bypass chan-
anisms such as FHL7 and RTM are proposed to nel, and an enhanced HT-mitigation mechanism
detect suspicious HT-infected components by (SmartRoute) to improve network performance
dynamically capturing abnormalities in NoC with DRL.
behavior. FHL records the pattern of transient The overview of the proposed TSA-NoC
faults in all the transmitted packets to identify HT- framework is shown in Figure 2. DetectANN mon-
infected channels. Similarly, RTM monitors the itors network attributes, learns from runtime
retransmission rate and marks out suspicious HT- network activities, and automatically identifies
infected components that exceed a certain retrans- HT-infected components by recognizing abnor-
mission rate. However, both techniques rely on mal network behaviors (e.g., high error rate and
manually-set thresholds to identify suspicious NoC retransmission). Based on the HT-detection
behavior, thus, often leading to false HT detection. results from DetectANN, SmartRoute controller
For HT isolation, SurfNoC3 eliminates the categorizes packets into ones whose source and
transmissions between high-security and low- destination nodes are HT-free (high-security
security domains by alternatively reserving and packets) and ones whose source and/or destina-
scheduling transmission channels in each tion nodes are HT-infected (low-security pack-
dimension exclusively for a specific domain. ets). For high-security packets, all the
NIBR2 partitions the virtual channels of each components in the transmission path should be
router to transmit data flows exclusively with HT-free for security, while the transmission
high-security demands. However, the static net- paths of the latter packets are allowed to contain
work partition results in poor network utiliza- HT-infected nodes. To isolate the HT-infected
tion, thus, affecting network performance. routers to protect the high-security packets, a
bypass channel is proposed to bypass the HT-
PROPOSED TSA-NoC DESIGN infected routers while maintaining network con-
We propose TSA-NoC, a learning-based HT- nectivity. Since the simple switch logic of the
detection and mitigation framework for secure bypass channel inevitably degrades the network
58 IEEE Micro
throughput, a routing mechanism that avoids activation function. For each neuron j in the mid-
transmitting intense traffic through bypass dle layer, the output yj ¼ sigmoidðS12
i¼0 xi wi Þ. The
channels should be applied. Moreover, to better output layer indicates the binary classification
utilize network resources, especially the isolated result: HT-free or HT-infected.
routers, a different routing algorithm may be Training the Proposed DetectANN. The pro-
needed for the low-security packets. To this end, posed DetectANN is trained offline. To build the
we propose SmartRoute, which proactively training set, applications are first executed in an
selects the most suitable routing algorithm to bal- HT-free system while runtime attributes are
ance traffic-loads in the low-throughput bypass recorded. These attributes are used for the input
channel and high-throughput routers, respec- layer of DetectANN, and the desired output is
tively, and improve overall network performance. HT-free. Then, the same applications are exe-
cuted multiple times with HT-infected NoC com-
Learning-Based Runtime HT-Detection Using ponents. For a better training result, HTs are
DetectANN randomly implanted each time. DetectANN moni-
We implement a per-router DetectANN to per- tors the same attributes, and routers with
form runtime HT-detection with improved accu- implanted HTs are labeled as HT-infected.
racy with minimized timing and power overheads. Avoidance of False-Positives/Negatives. As dis-
Unlike the static thresholds used by FHL and cussed, inaccurate HT-detection can lead to per-
RTM, DetectANN eschews the human engineering, formance degradation. False-positives can be a
monitors NoC attributes, and automatically problem when an HT-free router is always labeled
detects HTs by learning how to accurately recog- as HT-infected. In TSA-NoC, even if the HT-free
nize the abnormal behavior(s) of the local router router is mistakenly labeled as HT-infected, the
through complex and interrelated NoC attributes. detection result will be updated at the next epoch.
Since HTs are hard to detect when dormant, to As the trained DetectANN has a high HT-detection
identify activated HTs in a timely manner while accuracy, the wrongly labeled router has a high
reducing the computational overhead of Detec- chance to be labeled correctly at the next epoch.
tANN, HT-detection is performed iteratively. By doing so, the penalty of isolating that HT-free
Construction of DetectANN. DetectANN is a router will be limited to one epoch. Therefore, the
fully connected ANN with an input layer, a mid- false-positive problem can be mitigated. False-neg-
dle layer, and an output layer. Previous works atives are common in conventional designs, in
have shown that some system attributes are which an HT-infected router is labeled as HT-free
highly correlated with transient faults in NoCs.8 when the HT is not activated. The proposed Detec-
In TSA-NoC, we explore 12 fault-related NoC tANN resolves this problem by monitoring the
attributes as inputs, including buffer utilization runtime NoC behaviors consecutively and provid-
(number of occupied virtual channels) for each ing HT-detection results every 2000 cycles. As the
input port (+x, x, +y, y, and local core), link DetectANN utilizes the average attribute values
utilization (value of input-flits per cycle) for each within the epoch, it is able to sensitively capture
input port (+x, x, +y, y, and local core), local the anomaly behavior of HT-infected routers, even
operation temperature, and the previous tran- if the HTs are triggered on for a short period of
sient error rate in the last epoch. The middle layer time. Therefore, the false-negative problem is
utilizes all the attribute values and maps them to resolved.
the classification of whether the router is HT-
infected. As HT-detection is a binary classification
problem, a single-hidden-layer construction can Learning-Based Dynamic HT-Mitigation Using
mitigate overfitting and is sufficient to deliver SmartRoute
desired accuracy. For optimized detection accu- We propose a learning-based HT-mitigation
racy and computational/storage overhead, we mechanism for efficient HT-isolation. We imple-
implement 30 neurons in the single hidden-layer ment a bypass channel and a per-router Smart-
(detailed discussion is in the “Evaluation and Ana- Route controller to dynamically route high-
lysis” section). The middle layer uses Sigmoid security packets without traversing HT-infected
59
components and utilize the bypassed routers to generates a direct map between the optimal action
propagate low-security packets without degrad- and a given state. The problem formulation is as
ing network performance. There is no need to follows.
restrict the transmission paths of the low-secu- State and Action Space. We select a set of net-
rity packets, since they are already HT-infected. work-related attributes to represent the state
When isolating HT-infected routers with vector s for SmartRoute, which include the 12
bypass channels, the simple switch logic of the attributes used in DetectANN, local router label
bypass channel could limit the throughput of (HT-free or HT-infected), and packet injection rates
given path directions. TSA-NoC addresses this of different network dimensions. The action space
problem by intelligently balancing traffic-loads {a1 ; a2 ; a3 } comprises three routing algorithms:
with various routing algorithms (O1TURN, West- O1TURN, West-First, and Negative-First.
First, and Negative-First) using a SmartRoute con- Reward Function. The goal of the DRL agent is
troller. The rationale behind this design is two- to select actions that can maximize the long-term
fold: 1) avoid injecting into bypass channels and return R for any given state. In SmartRoute, we
2) optimize the worst case throughput of different use Q-learning11 to estimate the expected long-
NoC traffic patterns.10 The O1TURN routing term return for each state-action pair, recorded as
dynamically applies XY or YX routing for each R ¼ Qðs; aÞ. At each epoch, the agent selects the
packet to better utilize the network spatially under action with the highest Qðs; aÞ. Next, it observes
normal traffic loads. West-First and Negative-First the immediate reward r and the new state s0 . The
restrict different types of turns and achieve lower Qðs; aÞ value is updated using the following rule:
latency and less dynamic power consumption
than O1TURN under intense traffic loads.10 Note Qðs; aÞ ¼ ð1 aÞQðs; aÞ þ a½r þ g max Qðs;0 a0 Þ:
a0
that the TSA-NoC router has multiple virtual chan- (1)
nels to avoid protocol and routing deadlocks.
The variables a and g are DRL parameters
Since the HT-detection results from Detec-
called learning rate and discount rate, respec-
tANN vary periodically during runtime, selecting
tively. The immediate reward r implies minimizing
the most suitable routing algorithm that can han-
the latency and power consumption. Therefore,
dle the dynamic interactions between diverse
we define the immediate reward rt in (1) at epoch
traffic patterns and limited NoC resources is
t as follows:
complex. Therefore, we propose the use of DRL
to automatically balance the tradeoffs among rt ¼ ðPowert Latencyt Þ1 (2)
the different routing algorithms to achieve bet-
ter system-level performance for high-security The Latencyt and Powert values are obtained
and low-security packets. by average end-to-end latency and power con-
DRL-Based Control Policy. The adaptive routing sumption (static and dynamic), respectively.
algorithm is applied iteratively to avoid the timing The DRL agents select actions according to
overhead incurred by NoC reconfiguration and the Q-table. To eliminate storing overheads, the
packet draining. The length of each iteration Q-table in conventional RL is replaced with a
(epoch) is identical to that of DetectANN. At each neural network.
epoch, the DRL-based SmartRoute controller moni- The working of the DRL of SmartRoute con-
tors NoC attributes and suggests an action (apply- troller is shown in Figure 3. At each epoch, the
ing one of the routing algorithms) with the highest router first uses the feature values in the state
expected long-term return11 in terms of network vector s as inputs of the expanded ANN. The
performance and energy efficiency. The network ANN then calculates the Q-values of all the possi-
attributes will change with the action selection, ble state-action pairs in the state entry. The
resulting in a new state at the next epoch. The router suggests an action a, which has the maxi-
changes in performance and energy metrics are mum Qðs; aÞ-value for the next epoch. All routers
evaluated to update the reward accordingly. The vote with their selected actions for packets that
DRL-based control policy continues to evolve require HT-free transmission paths, and the rout-
based on the NoC historical activities and ing algorithm with the highest score is selected.
60 IEEE Micro
Figure 3. Working of DRL of SmartRoute.
Upon taking the action a, the NoC system transi-

tions to a new state s’. The NoC system then pro-
vides an immediate reward r, which is used to
update Qðs; aÞ. We implement a five-cycle win-
dow between two consecutive epochs to inform Figure 4. Simulation results of the proposed
routers of the upcoming actions and store on- TSA-NoC: (a) average end-to-end latency,
the-fly flits in router buffers. (b) average power consumption, and (c) average
energy efficiency. Results are normalized to the
FHL+SurfNoC baseline.
EVALUATION AND ANALYSIS
Simulation Setup For DetectANN and SmartRoute, we set the
We implement the proposed TSA-NoC architec- epoch size to 2000 cycles. The Q-values are ini-
ture in the Gem5 full-system simulator. We imple- tialized to 0. The learning rate a and the discount
ment 64 out-of-order CPUs with 2-level cache in an rate g are set to 0.1 and 0.9, respectively. Addi-
8 8 2D-mesh network. Additionally, we imple- tionally, the DRL agents have a small probability
ment a runtime error injection module consisting of = 0.05 to select a random action for exploring
of NoC fault and thermal models (DSENT, HotSpot, unvisited state–action pairs.
and VARIUS) to realistically simulate transient
errors. We compare the performance of TSA-NoC Performance Analysis
(DetectANN+SmartRoute) with three techniques, Average Network Latency: Figure 4(a) shows
namely FHL+SurfNoC, DetectANN+SurfNoC, and the normalized average end-to-end packet latency
DetectANN+NIBR, with the PARSEC benchmark. of all the transmitted packets. TSA-NoC achieves
We train the DetectANN and SmartRoute with a an average of 29% end-to-end latency reduction
semi-real dataset generated with synthetic traffic over the baseline. Note that the proposed TSA-
and part of PARSEC benchmark applications NoC using SmartRoute can improve upon Detec-
(dedup, facesim, freqmine, and swaption). The tANN+SurfNoC by an additional 13% over baseline.
rest of real applications in PARSEC benchmark are This illustrates that DRL-based dynamic routing
used in the testing phase. can further improve overall network latency.
Before executing each benchmark application, Power Consumption: We evaluate static
we randomly select 10% of the total transmission and dynamic power consumption for all the
links in the NoC and implant HTs in them for run- techniques used. For TSA-NoC, the power con-
time fault injection. The testing phase of each sumption of the learning-based TSA-router (with
application lasts for the entire application execu- DetectANN and SmartRoute), intermediate links,
tion time. The area overhead of TSA-NoC is evalu- and bypass channels are included. We first model
ated using the Synopsys design compiler with the the static power of all components with Synopsys
32-nm library. Design Compiler. Afterward, the captured power
61
Impact of Epoch Size of DRL: In this test, we

vary the length of DRL epoch from 1000 to
10 000 clock cycles. As shown in Figure 5(b),
increasing the epoch size negatively impacts
network latency and energy-delay product
(EDP) due to coarse-grain control (lower net-
work latency and EDP indicate better perfor-
mance). Alternatively, aggressively reducing
the length of epochs also leads to performance
degradation, as the timing overhead of DRL
Figure 5. Sensitivity study: (a) HT-detection would be notable.
accuracy, (b) epoch size, (c) discount rate g, and Impact of Discount Rate g of DRL: Figure 5(c)
(d) -greedy factor. depicts the impact of the discount rate g on net-
work performance. As shown in Figure 5(c), both
network latency and EDP are initially improved
parameters are fed to the full-system simulator for with larger g. However, aggressively increasing g
accurate dynamic power simulation. Figure 4(b) can also result in slow DRL convergence, which
shows that TSA-NoC reduces overall power con- leads to performance degradation. The best per-
sumption by an average of 18% over the baseline. formance is achieved when g equals 0.9.
The majority of power saving is from dynamic Impact of Exploration Factor of DRL: The
power reduction. impact of on network performance is shown in
Energy Efficiency: Energy efficiency is defined Figure 5(d). As increases, the agent explores
as: packetsenergy1 , where energy equals overall unvisited state-action pairs more frequently, which
power consumption (of all NoC components, the is beneficial for training DRL. However, when
proposed DetectANN, and SmartRoute) multiplies equals 1, the router will take actions completely at
benchmark execution time. Figure 4(c) shows that random. As shown in Figure 5(d), the best perfor-
the proposed framework improves energy effi- mance is achieved when equals 0.05.
ciency by an average of 70% compared to baseline.
HT-Detection Accuracy: The HT-detection accu- Overhead Analysis
racy is calculated with the ratio of the number of Timing Overhead: The timing overhead is
identified HTs to the total number of implanted induced by calculating and updating the weights
HTs within a full execution of each benchmark. In for DetectANN and SmartRoute. In the worst case,
this simulation, we vary the middle layer size of for each epoch, the computation overheads are 90
DetectANN. The DetectANN is trained with random and 150 cycles for DetectANN and SmartRoute,
distributed HT-generated bit-flips, while in the test- respectively. We use two sets of different epochs
ing phase, the HT-generated bit-flips follow three for the monitoring and controlling to minimize the
different distributions: normal, uniform, andPois- negative effect of this latency. The two sets of
son distribution. Figure 5(a) shows that, for all epochs are offset by the ANN computation time,
distributions, the proposed DetectANN improves which can pipeline the overhead effectively. By
HT-detection accuracy by 39% on average over the doing so, the calculation of ANNs does not block
FHL baseline, with 30 neurons in the middle layer. monitoring or controlling. Therefore, the use of
ANN does not negatively impact the overall
Sensitivity Analysis performance.
Impact of Middle Layer Size of DetectANN: We Area and Power Overhead: The proposed
vary the size of the middle layer to study its DetectANN and Smart-Route require additional
impact on HT-detection accuracy in Figure 5(a). ALUs (adders, multipliers, and Sigmoid function)
The HT-detection accuracy improves as the size and SRAM storage in each router. The proposed
of the middle layer increases. For the best accu- DetectANN consumes additional 425.2-mm2 area
racy, area consumption, and timing overhead, for ALUs and 718.7-mm2 area for SRAM, incurring
we use 30 neurons in the middle layer. 0.9% area overhead over a conventional router
62 IEEE Micro
in total. The DRL logics consume 956.7-mm2 ALU 6. K. Xiao and M. Tehranipoor, “BISA: Built-in self-
area and 1617.2-mm2 SRAM area, which implies authentication for preventing hardware Trojan
2.1% area overhead. Furthermore, the power insertion,” in Proc. IEEE Int. Symp. Hardware-Oriented
overheads of DetectANN and DRL are 0.086 and Secur. Trust, 2013, pp. 45–50.
0.195 MW, respectively. 7. H. Salmani, “COTD: Reference-free hardware trojan
detection and recovery based on controllability and
observability in gate-level netlist,” IEEE Trans. Inf.
CONCLUSIONS
In this article, we proposed TSA-NoC, a learning- Forensics Secur., vol. 12, no. 2, pp. 338–350, Feb.
enabled, high-performance, and energy-efficient 2017.
design framework for secure on-chip communica- 8. K. Wang et al., “IntelliNoC: A holistic design
framework for energy-efficient and reliable on-chip
tion. The TSA-NoC consists of an ANN-based HT-
detection design (DetectANN) and a DRL-based communication for manycores,” in Proc. 46th Annu.
Int. Symp. Comput. Archit., 2019, pp. 1–12.
adaptive routing mechanism (SmartRoute). The
proposed DetectANN detects HT-infected NoC 9. Q. Yu and J. Frey, “Exploiting error control approaches
for hardware Trojans on network-on-chip links,” in
components promptly and accurately at runtime.
SmartRoute isolates the HT-infected components Proc. IEEE Int. Symp. Defect Fault Tolerance VLSI
and deploys dynamic routing algorithms to opti- Nanotechnol. Syst., 2013, pp. 266–271.
mize system-level performance. Full-system evalua- 10. A. Singh. “Load-balanced routing in interconnection
tions show that TSA-NoC improves HT-detection networks,” Ph.D. thesis, Stanford Univ., Stanford, CA,
accuracy, network latency, and energy efficiency USA, 2005.
over existing techniques. 11. R. Sutton et al., Reinforcement Learning: An

Introduction. Cambridge, MA, USA: MIT Press, 2018.
ACKNOWLEDGMENTS
Ke Wang is currently working toward the Ph.D. degree
This work was supported by NSF under
in computer engineering with George Washington Uni-
Grants CCF-1547035 and CCF-1702980.
versity. His research work focuses on high-perfor-
mance, energy-efficient, and reliable interconnect
& REFERENCES designs using machine learning-based optimization.
Contact him at cory@gwu.edu.
1. H. Wassel et al., “Networks on chip with provable
security properties,” IEEE Micro, vol. 34, no. 3, Hao Zheng is currently working toward the Ph.D.
pp. 57–68, May/Jun. 2014. degree in computer engineering with George Washing-
2. T. Boraten and A. K. Kodi, “Securing NoCs against ton University. His research interests are in the areas of
timing attacks with non-interference based adaptive computer architecture and parallel computing, with
routing,” in Proc. 12th IEEE/ACM Int. Symp. emphasis on on-chip interconnects. Contact him at
Network-on-Chip, 2018, pp. 1–8. haozheng@gwu.edu.
3. H. Wassel et al., “SurfNoC: A low latency and provably
Ahmed Louri is currently the David and Marilyn
non-interfering approach to secure networks-on-chip,”
Karlgaard Endowed Chair Professor of Electrical and
in Proc. 40th Annu. Int. Symp. Comput. Archit., 2013,
Computer Engineering with George Washington Uni-
pp. 583–594.
versity. Louri received the Ph.D. degree in computer
4. M. Tehranipoor and F. Koushanfar, “A survey of
engineering from the University of Southern California
hardware trojan taxonomy and detection,” IEEE Des. in 1988. He conducts research in the broad area of
Test Comput., vol. 27, no. 1, pp. 10–25, Jan./Feb. 2010. computer architecture and parallel computing. He is
5. D. M. Ancajas, K. Chakraborty, and S. Roy, “Fort-NoCs: a Fellow of IEEE, and currently serves as the Editor-
Mitigating the threat of a compromised NoC,” in Proc. in-Chief of the IEEE TRANSACTIONS ON COMPUTERS. Con-
51st ACM/EDAC/IEEE Des. Autom. Conf., 2014, pp. 1–6. tact him at louri@gwu.edu.
63
Guest Editor’s Introduction
Biology and Systems Interactions

Abhishek Bhattacharjee
Yale University
& BIOLOGICAL SYSTEMS CARRY out computation read mapping stage of genome analysis pipe-
and store information with efficiency levels and lines, because of its computational require-
complexity that far exceeds those of synthetic com- ments, and offer researchers’ perspective on key
puter systems. Consider, for example, that biologi- problems that may be ripe for future studies.
cal neural networks are estimated to offer at least While the first article tackles the possibilities
four orders of magnitude with better ops/Joule offered by emerging hardware to advance the life
than synthetic neural networks, or the storage sciences, the second article studies the design
advantages offered by molecular and DNA sub- of tools and infrastructure to promote next-gen-
strates, or the fact that sophisticated molecular eration hybrid molecular–electronic systems.
pathways and mechanisms permit biological sys- This article, authored by Stephenson, Willsey,
tems to interface with the most efficient sensing sys- McBride, Newman, Nguyen, Takahashi, Strauss,
tems that we know of. At the same time, advancing and Ceze, focuses on a full-stack digital microflui-
our understanding of complex biological systems dic platform for hybrid molecular–electronic sys-
rests on continued scaling of the computational tem studies, culminating in a study of DNA data
capabilities of high-performance and efficient com- storage. The authors offer a vertical approach by
puter systems, hardware, and software. describing the design principles of the system,
This special issue contains two articles on its closed-loop operation with vision and capaci-
this intersection of systems and biology. In the tive sensing, on-board magnetic bead extraction,
first article, Alser, Bingol, Kim, Cali, Ghose, and much more. Of particular interest to com-
Alkan, and Mutlu present an overview of the field puter, architects may be the principles by which
of genome analysis and the role of algorithmic the authors express computation and protocols
methods, hardware accelerators, and their inter- and use it to bridge the molecular and computa-
play in their advancement. Genome analysis— tional components of their microfluidic platform.
with its emphasis on measurement and compari- These two articles address but a small sample
son of features such as DNA sequence struc- of the myriad research questions at the intersec-
tures, variations, expressions, and regulation— tion of the life sciences and computer systems. We
is an incredibly important and exciting field criti- hope, however, that they spur research discus-
cal to the existential questions facing humankind sions and studies that help propel a virtuous
(not least because of its implications on the research cycle between biology and computer sys-
COVID-19 pandemic). The authors focus on the tems design, where both fields advance from inno-
vations and ties with the other.
Abhishek Bhattacharjee is an Associate Profes-

Digital Object Identifier 10.1109/MM.2020.3016852 sor of Computer Science at Yale University. Contact him
Date of current version 1 September 2020. at yxie4@miners.utep.edu.
64
Theme Article: Biology and Systems Interactions
Accelerating Genome
Analysis: A Primer on
an Ongoing Journey
Mohammed Alser Saugata Ghose
€rich
ETH Zu University of Illinois at Urbana–Champaign and
Carnegie Mellon University
€ lal Bingo
Zu €l
Bilkent University Can Alkan
Bilkent University
Damla Senol Cali
Carnegie Mellon University Onur Mutlu
ETH Zurich, Carnegie Mellon University, and
Jeremie Kim
Bilkent University
ETH Zurich and Carnegie Mellon University
Abstract—Genome analysis fundamentally starts with a process known as read mapping,

where sequenced fragments of an organism’s genome are compared against a reference
genome. Read mapping is currently a major bottleneck in the entire genome analysis pipeline,
because state-of-the-art genome sequencing technologies are able to sequence a genome
much faster than the computational techniques employed to analyze the genome. We
describe the ongoing journey in significantly improving the performance of read mapping. We
explain state-of-the-art algorithmic methods and hardware-based acceleration approaches.
Algorithmic approaches exploit the structure of the genome as well as the structure of the
underlying hardware. Hardware-based acceleration approaches exploit specialized
microarchitectures or various execution paradigms (e.g., processing inside or near memory).
We conclude with the challenges of adopting these hardware-accelerated read mappers.
& GENOME ANALYSIS is the foundation of many analysis is currently limited by the inability of
scientific and medical discoveries, and serves as modern genome sequencing technologies to
a key enabler of personalized medicine. This read an organism’s complete genome. Instead,
sequencing machines extract smaller random
fragments of an organism’s DNA sequence,
Digital Object Identifier 10.1109/MM.2020.3013728 known as reads. While the human genome con-
Date of publication 3 August 2020; date of current version tains over three billion bases (i.e., A, C, G, T in
1 September 2020. DNA), the length of a read is orders of magnitude
65
smaller, ranging from a few hundred bases (for 1) Due to the large datasets that a read mapper
short reads) to a few million bases (for long operates on, it generates a large amount of
reads). Computers are used to perform genome data movement between the CPU and main
assembly, which reassembles read fragments memory. The CPU accesses off-chip main
back into an entire genome sequence. Genome memory through a pin-limited bus known as
assembly is currently the bottleneck to quickly the memory channel, and a high amount of
and accurately determining an individual’s entire data movement across the memory channel
genome, due to the complex algorithms and large is extremely costly in terms of both execu-
datasets used for assembly. tion time and energy.4,5
A widely used approach for genome assembly 2) Modern sequencing machines generate read
is to perform sequence alignment, which com- fragments at an exponentially higher rate
pares read fragments against a known reference than prior sequencing technologies, with
genome (i.e., a complete representative DNA their growth far outpacing the growth in
sequence for a particular species). A process computational power in recent years.6 For
known as read mapping matches each read gener- example, the Illumina NovaSeq 6000 system
ated from sequencing to one or more possible can sequence about 48 human whole
locations within the reference genome, based on genomes at 30x genome coverage (the aver-
the similarity between the read and the reference age number of times a genomic base is
sequence segment at that location. Unfortu- sequenced) in about two days. However,
nately, the bases in a read may not be identical analyzing (performing mapping and variant
to the bases in the reference genome at the loca- calling) the sequencing data of a single
tion that the read actually comes from. These dif- human genome requires over 32 CPU hours
ferences may be due to 1) sequencing errors (up on a 48-core Intel Xeon processor, 23 of
to 0.1% in short reads and up to 20% in long which are spent on read mapping.7
reads) during extraction; and 2) genetic muta- 3) The first two challenges worsen when a
tions that are specific to the individual organ- metagenomic sample is profiled, where the
ism’s DNA and may not exist in the reference sample donor is unknown. This requires
genome. Due to these potential differences, the matching the extracted reads to thousands
similarity between a read and a reference of reference genomes. Increasing the num-
sequence segment must be identified using an ber of CPUs used for genome analysis
approximate string matching (ASM) algorithm. decreases the overall analysis time, but sig-
The possible genetic differences between the ref- nificantly increases energy consumption
erence genome and the sequenced genome are and hardware costs. Cloud computing plat-
then identified using genomic variant calling forms are a potential alternative to distrib-
algorithms. ute the workload at a reasonable cost, but
The ASM performed during read mapping are disallowed due to data protection
typically uses a computationally expensive guidelines in many countries.26
dynamic programming (DP) algorithm. This
time-consuming algorithm has long been a major As a result, there is a dire need for new compu-
bottleneck in the entire genome analysis pipe- tational techniques that can quickly process and
line, accounting for over 70% of the execution analyze a tremendous number of extracted reads
time of read mapping.1 The vast majority of read in order to drive cutting-edge advances in the
mappers, such as the widely used minimap2,2 genetic applications space.8 Many works boost
are implemented as software running on CPUs. the performance of existing and new read map-
We refer readers to a comprehensive survey3 for pers using new algorithms, hardware/software
a discussion of state-of-the-art CPU-based read co-design, and hardware accelerators. Our goal
mappers. Accelerating ASM can help bridge the in this work is to survey a prominent set of these
wide performance gap between sequencing three types of acceleration efforts for guiding the
machines and CPU-based read mapping algo- design of new highly efficient read mappers. To
rithms, but faces three key challenges. this end, we 1) discuss various state-of-the-art
66 IEEE Micro
Figure 1. (a) Three steps of read mapping in genome analysis: 1) indexing, 2) pre-alignment filtering, and
3) sequence alignment. (b) Overview of the existing approaches to accelerating each step of read mapping.
mechanisms and techniques that improve the reference genome by using substrings (called
execution time of read mapping using different seeds) from each read to quickly identify all
modern high-performance computing architec- potential mapping locations of each read in the
tures; and 2) highlight the challenges, in the last reference genome. Second, the mapper uses fil-
section, that system architects and programmers tering heuristics to examine the similarity for
must address to enable the widespread adoption every sequence pair (a read sequence and one
of hardware-accelerated read mappers. potential matching segment in the reference
genome identified during indexing). These filter-
ing heuristics aim to eliminate most of the dis-
READ MAPPING similar sequence pairs. Third, the mapper
The main goal of read mapping is to locate performs sequence alignment (using ASM) to
possible subsequences of the reference genome check whether or not the remaining sequence
sequence that are similar to the read sequence pairs that are identified by filtering to be similar
while allowing at most E edits, where E is the are actually similar. The alignment step exam-
edit distance threshold. Commonly allowed edits ines all possible prefixes of two sequences and
include deletion, insertion, and substitution of tracks the prefixes that provide the highest pos-
characters in one or both sequences. Mapping sible alignment score (known as optimal align-
billions of reads to the reference genome is com- ment). The alignment score is a quantitative
putationally expensive.1,8,9 Therefore, most read representation of the quality of an alignment for
mapping algorithms apply two key heuristic a given user-defined scoring function (computed
steps, indexing and filtering, to reduce the num- based on the number of edits and/or matches).
ber of reference genome segments that need to Alignment algorithms typically use DP-based
be compared with each read. approaches to avoid re-examining the same pre-
The three steps of read mapping are shown in fixes many times. These DP-based algorithms
Figure 1(a). First, a read mapper indexes the provide the most accurate alignment results
67
compared to other non-DP algorithms, but they calculated by imposing an ordering (e.g., lexico-
have quadratic time and space complexity [i.e., graphically or by hash value) on a group of adja-
Oðm2 Þ for a sequence length of m]. Sequence cent seeds and storing only the seed with the
alignment calculates information about the align- smallest order. Read mappers also apply heuris-
ment such as the alignment score, edit distance, tics to avoid examining the mapping locations of
and the type of each edit. Edit distance is defined a seed that occur more times than a user-defined
as the minimum number of changes needed to threshold value.2 Various data structures have
convert a sequence into the other sequence. been proposed and implemented to both reduce
Such information is typically output by read the storage cost of the indexing data structure
mapping into a sequence alignment/map (SAM) and improve the algorithmic runtime of identify-
file. Given the time spent on read mapping, all ing the mapping locations within the indexing
three steps have been targeted for acceleration. data structure. One example of such data struc-
Figure 1(b) summarizes the different accelera- tures is FM-index (implemented by Langarita
tion approaches, and we discuss a set of such et al.10), which provides a compressed representa-
works in the following sections. tion of the full-text index, while allowing for query-
ing the index without the need for decompression.
This approach has two main advantages.
ACCELERATING INDEXING
The indexing operation generates a table that 1) We can query seeds of arbitrary lengths, which
is indexed by the contents of a seed, and identi- helps to reduce the number of queried seeds.
fies all locations where the seed exists in the ref- 2) It typically has less (by 1:5 2) memory
erence genome. Indexing needs to be done only footprint compared to that of the indexing
once for a reference genome, and eliminates the step of minimap2.2
need to perform ASM across the entire genome. However, one major bottleneck of FM-indexes is
During read mapping, a seed from a read is that locating the exact matches by querying the
looked up in the table, and only the correspond- FM-index is significantly slower than that of clas-
ing locations are used for ASM (as only they can sical indexes.10,11 BWA-MEM211 proposes an
match the entire read). The major challenge uncompressed version of the FM-index that is at
with indexing is choosing the appropriate length least 10 larger than the compressed FM-index
and number of to-be-indexed seeds, as they can to speed up the querying step by 2 .
significantly impact the memory footprint and
overall performance of read mapping.2 Querying Reducing Data Movement During Indexing
short seeds potentially leads to a large number RADAR12 observes that the indexing step is
of mapping locations that need to be checked memory intensive, because the large number of
for a string match. The use of long reads requires random memory accesses dominates computa-
extracting from each read a large number of tion. The authors propose a processing-in-mem-
seeds, as the sequencing error rate is much ory (PIM) architecture that stores the entire
higher in long reads. This affects 1) the number index inside the memory and enables querying
of times we query the index structure; and 2) the the same index concurrently using a large num-
number of retrieved mapping locations. Thus, ber of ASIC compute units. The amount of data
there are two key approaches used for accelerat- movement is reduced from tens of gigabytes to a
ing the indexing step [see Figure 1(b)]. few bytes for a single query task, allowing
RADAR to balance memory accesses with com-
Reducing the Number of Seeds putation, and thus provide speedups and energy
Read mapping algorithms (e.g., minimap22) savings.
typically reduce the number of seeds that are
stored in the index structure by finding the mini- ACCELERATING PRE-ALIGNMENT
mum representative set of seeds (called minimiz- FILTERING
ers) from a group of adjacent seeds within a After finding one or more potential mapping
genomic region. The representative set can be locations of the read in the reference genome,
68 IEEE Micro
Sidebar: Related Works on Pre-alignment Filtering Using
the Pigeonhole Principle
igeonhole-filtering-based pre-alignment filtering can pre-alignment filters together, each of which operates only for
P accelerate read mappers even without specialized
hardware. For example, the adjacency filter [1] accelerates
a given edit distance threshold (e.g., using SHD only when E is
between 1 and 5), provides a 2.5 speedup over GenCache
sequence alignment by up to 19. The accuracy and with a single pre-alignment filter.
speed of pre-alignment filtering with the pigeonhole princi-
ple have been rapidly improved over the last seven years.
& REFERENCES
Shifted Hamming Distance (SHD) [2] uses SIMD-capable
CPUs to provide high filtering speed, but supports a 1. Hongyi Xin et al. Accelerating read mapping with
sequence length up to only 128 base pairs due to the SIMD FastHASH. BMC Genomics, 2013.
register widths. GateKeeper [3] utilizes the large amounts of 2. Hongyi Xin et al. Shifted Hamming Distance: A fast and
parallelism offered by FPGA architectures to accelerate SHD accurate SIMD friendly filter to accelerate alignment
and overcome such sequence length limitations. MAG- verification in read mapping. Bioinformatics, 2015.
NET [4] provides a comprehensive analysis of all sources of
3. Mohammed Alser et al. GateKeeper: A new hardware
filtering inaccuracy of GateKeeper and SHD. Shouji [5] lever-
architecture for accelerating pre-alignment in dna short
ages this analysis to improve the accuracy of pre-alignment
read mapping. Bioinformatics, 2017.
filtering by up to two orders of magnitude compared to
4. Mohammed Alser et al. MAGNET: Understanding and
both GateKeeper and SHD, using a new algorithm and a
new FPGA architecture. SneakySnake [6] achieves up to improving the accuracy of genome pre-alignment filtering.
four orders of magnitude higher filtering accuracy Transactions on Internet Research, 2017.
compared to GateKeeper and SHD by mapping the 5. Mohammed Alser et al. Shouji: A fast and efficent
pre-alignment filtering problem to the single net routing pre-alignment filter for sequence alignment. Bioinformatics,
(SNR) problem in VLSI chip layout. SNR finds the shortest rout- 2019.
ing path that interconnects two terminals on the boundaries of 6. Mohammed Alser et al. SneakySnake: A fast and accurate
a VLSI chip layout in the presence of obstacles. SneakySnake is universal genome pre-alignment filter for CPUs, GPUs, and
the only pre-alignment filter that works on CPUs, GPUs, and FPGAs. arXiv:1910.09020 [q-bio.GN], 2019.
FPGAs. GenCache [7] proposes to perform highly parallel pre-
7. Anirban Nag et al. GenCache: Leveraging in-cache operators
alignment filtering inside the CPU cache to reduce data move-
for efficient sequence alignment. in Proc. 52nd Int. Symp.
ment and improve energy efficiency, with about 20% cache
Microarchitecture, 2019.
area overhead. GenCache shows that using different existing
the read mapper checks the similarity between then the two sequences are identified as dissimi-
each read and each segment extracted at these lar sequences and hence DP calculation is not
mapping locations in the reference genome. needed. In practice, only genomic sequence pairs
These segments can be similar or dissimilar to the with an edit distance less than or equal to a user-
read, though they share common seeds. To avoid defined threshold (i.e., E) provide useful data for
examining dissimilar sequences using computa- most genomic studies.1,3,13 Pre-alignment filters
tionally expensive sequence alignment algo- use one of four major approaches to quickly filter
rithms, read mappers typically use filtering out the dissimilar sequence pairs: 1) the pigeon-
heuristics that are called pre-alignment filters. The hole principle; 2) base counting; 3) q-gram filter-
key idea of pre-alignment filtering is to quickly ing; or 4) sparse DP. Long read mappers typically
estimate the number of edits between two given use q-gram filtering or sparse DP, as their perfor-
sequences and use this estimation to decide mance scales linearly with read length and inde-
whether or not the computationally expensive pendently of the edit distance.
DP-based alignment calculation is needed—if not,
a significant amount of time is saved by avoiding Pigeonhole Principle
DP-based alignment. If two genomic sequences The pigeonhole principle states that if E
differ by more than the edit distance threshold, items are put into E+1 boxes, then one or more
69
boxes would be empty. This principle can be one of the sequences can affect at most q over-
applied to detect dissimilar sequences and dis- lapping q-grams. Thus, E differences can affect
card them from the candidate sequence pairs no more than q E q-grams, where E is the edit
used for ASM. If two sequences differ by E edits, distance threshold. The minimum number of
then they should share at least a single subse- shared q-grams between two similar sequences
quence (free of edits) among E+1 nonoverlap- is therefore ðm q þ 1Þ ðq EÞ. This filtering
ping subsequences,1 where E is the edit approach requires very simple operations (e.g.,
distance threshold. For a read of length m, if sums and comparisons), which makes it attrac-
there are no more than E edits between the read tive for hardware acceleration, such as in GRIM-
and the reference segment, then the read and Filter.13 GRIM-Filter exploits the high memory
reference segment are considered similar if they bandwidth and computation capability in the
share at most E+1 nonoverlapping subsequen- logic layer of 3-D-stacked memory to accelerate
ces, with a total length of at least m E. The q-gram filtering in the DRAM chip itself, using a
problem of identifying these E+1 nonoverlap- new representation of reference genome that is
ping subsequences is highly parallelizable, as friendly to in-memory processing. q-gram filter-
these subsequences are independent of each ing is generally robust in handling only a small
other. Shouji1 exploits the pigeonhole principle number of edits, as the presence of edits in any
to reduce the search space and provide a scal- q-gram is significantly underestimated (e.g.,
able architecture that can be implemented for counted as a single edit).
any values of m and E, by examining common
subsequences independently and rapidly with
Sparse Dynamic Programming
high parallelism. Shouji accelerates sequence
Sparse DP algorithms exploit the exact
alignment by 4.2-18.8 without affecting the
matches (seeds) shared between a read and a
alignment accuracy. We refer the reader to the
reference segment to reduce execution time.
sidebar for a brief discussion of several other
These algorithms exclude the corresponding
related works.
locations of these seeds from estimating the
number of edits between the two sequences, as
Base Counting
they were already detected as exact matches
The base counting filter compares the num-
during indexing. Sparse DP filtering techniques
bers of bases (A, C, G, T) in the read with the cor-
apply DP-based alignment algorithms only
responding base counts in the reference
between every two nonoverlapping seeds to
segment. If one sequence has, for example, three
quickly estimate the total number of edits. This
more Ts than another sequence, then their align-
approach is also known as chaining, and is used
ment has at least three edits. If the difference in
in minimap2.2
count is greater than E, then the two sequences
are dissimilar and the reference segment is dis-
carded. Such a simple filtering approach rejects ACCELERATING SEQUENCE
a significant fraction of dissimilar sequences ALIGNMENT
(e.g., 49.8%–80.4% of sequences, as shown in After filtering out most of the mapping loca-
GASSST14) and thus avoids a large fraction of tions that lead to dissimilar sequence pairs, read
expensive verification computations required by mapping calculates the sequence alignment infor-
sequence alignment algorithms. mation for every read and reference segment
extracted at each mapping location. Sequence
Q-Gram Filtering Approach alignment calculation is typically accelerated
The q-gram filtering approach considers all of using one of two approaches: 1) accelerating the
the sequence’s possible overlapping substrings DP-based algorithms using hardware accelerators
of length q (known as q-grams). Given a sequence without altering algorithmic behavior; and
of length m, there are m q þ 1 overlapping 2) developing heuristics that sacrifice the opti-
q-grams that are obtained by sliding a window of mality of the alignment score solution in order to
length q over the sequence. A single difference in reduce alignment time. With the first approach, it
70 IEEE Micro
is challenging to rapidly calculate sequence align- hides the data transfer time by overlapping GPU
ment of long reads with high parallelism. As long and CPU execution. GASAL2 is up to 20 faster
reads have high sequencing error rates (up to than Parasail (when executed with 56 CPU
20% of the read length), the edit distance thresh- threads). BWA-MEM211 accelerates the banded
old for long reads is typically higher than that for sequence alignment of its predecessor (BWA-
short reads, which results in calculating more MEM) by up to 11:6 , by leveraging multicore
entries in the DP matrix compared to that of short and SIMD parallelism. However, to achieve such
reads. The use of heuristics (i.e., the second levels of acceleration, BWA-MEM2 builds an index
approach) helps to reduce the number of calcu- structure that is 6 larger than that of minimap2.
lated entries in the DP matrix and hence allows Other designs, such as FPGASW,17 exploit the
both the execution time and memory footprint to very large number of hardware execution units
grow only linearly with read length (as opposed in FPGAs to form a linear systolic array. Each
to quadratically with classical DP). Next, we execution unit in the systolic array is responsi-
describe the two approaches in detail. ble for computing the value of a single entry of
the DP matrix. The systolic array computes a sin-
Accurate Alignment Accelerators gle vector of the matrix at a time. The data
From a hardware perspective, sequence align- dependency between the entries restricts the
ment acceleration has five directions: 1) using systolic array to computing the vectors sequen-
SIMD-capable CPUs; 2) using multicore CPUs and tially (e.g., top-to-bottom, left-to-right, or in an
GPUs; 3) using FPGAs; 4) using ASICs; and 5) using antidiagonal manner). FPGASW has a similar exe-
PIM architectures. Traditional DP-based algo- cution time as its GPU implementation, but is 4
rithms are typically accelerated by computing more power efficient.
only the necessary regions (i.e., diagonal vectors) Specialized hardware accelerators (i.e., ASIC
of the DP matrix rather than the entire matrix, as designs) provide application-specific, power- and
proposed in Ukkonen’s banded algorithm.27 This area-efficient solutions to accelerate sequence
reduces the search space of the DP-based algo- alignment. For example, GenAx18 is composed of
rithm and reduces computation time. The num- SillaX, a sequence alignment accelerator, and a
ber of diagonal bands required for computing the second accelerator for finding seeds. SillaX sup-
DP matrix is 2E+1, where E is the edit distance ports both a configurable scoring function and
threshold. For example, the number of entries in traceback operations. SillaX is more efficient for
the banded DP matrix for a 2 Mb long read can be short reads than for long reads, as it consists of
1.2 trillion. Parasail15 and KSW2 (used in mini- an automata processor whose performance
map22) exploit both Ukkonen’s banded algorithm scales quadratically with the edit distance.
and SIMD-capable CPUs to compute banded align- GenAx is 31.7 faster than the predecessor of
ment for a sequence pair with a configurable scor- BWA-MEM2 (i.e., BWA-MEM) for short reads.
ing function. SIMD instructions offer significant Recent PIM architectures such as RAPID19
parallelism to the matrix computation by execut- exploit the ability to perform computation inside
ing the same vector operation on multiple oper- or near the memory chip to enable efficient
ands at once. KSW2 is nearly as fast as Parasail sequence alignment. RAPID modifies the DP-
when KSW2 does not use heuristics (explained in based alignment algorithm to make it friendly to
the next section). in-memory parallel computation by calculating
The multicore architecture of CPUs and GPUs two DP matrices: one for calculating substitu-
provides the ability to compute alignments of tions and exact matches and another for calcu-
many independent sequence pairs concurrently. lating insertions and deletions. RAPID claims
GASAL216 exploits the multicore architecture of that this approach efficiently enables higher lev-
both CPUs and GPUs for highly parallel computa- els of parallelism compared to traditional DP
tion of sequence alignment with a user-defined algorithms. The main two benefits of RAPID and
scoring function. Unlike other GPU-accelerated such PIM-based architectures are higher perfor-
tools, GASAL2 transfers the bases to the GPU, mance and higher energy efficiency,4,5 as they
without encoding them into binary format, and alleviate the need to transfer data between the
71
main memory and the CPU cores through slow into overlapping submatrices and processing
and energy hungry buses, while providing high each submatrix independently using systolic
degree of parallelism with the help of PIM. RAPID arrays. Darwin provides three orders of magni-
is on average 11:8 faster and 212:7 more tude speedup compared to Edlib.23 Dividing the
power efficient than 384-GPU cluster of GPU DP matrix (known as the Four-Russians Method)
implementation of sequence alignment, known enables significant parallelism during DP matrix
as CUDAlign.20 computation, but it leads to suboptimal align-
ment calculation.14 Darwin claims that choosing
Heuristic-Based Alignment Accelerators a large submatrix size ( 320 320) and ensur-
The second direction is to limit the functional- ing sufficient overlap (128 entries) between
ity of the alignment algorithm or sacrifice the opti- adjacent submatrices may provide optimal align-
mality of the alignment solution in order to reduce ment calculation for some datasets.
execution time. The There are other proposals that limit the num-
use of restrictive ber of calculated entries of the DP matrix based
functionality and heu- Despite more than two on one of two approaches: 1) using sparse DP; or
ristics limits the pos- decades of attempts, 2) using a greedy approach to maintain a high
sible applications of bridging the alignment score. Both approaches suffer from
the algorithms that performance gap providing suboptimal alignment calculation.24,25
utilize this direction. between sequencing The first approach uses the same sparse DP algo-
Examples of limiting machines and read rithm used for pre-alignment filtering but as an
mapping is still
functionality include alignment step, as done in the exonerate tool.24
challenging.
limiting the scoring The second approach is employed in X-drop,25
function, or only tak- which 1) avoids calculating entries (and their
ing into account accelerating the computation of neighbors) whose alignment scores are more
the DP matrix without performing the backtracking than X below the highest score seen so far
step.21 There are several existing algorithms and (where X is a user-specified parameter); and
corresponding hardware accelerators that limit 2) stops early when a high alignment score is not
scoring function flexibility. Levenshtein distance possible. The X-drop algorithm is guaranteed to
and Myers’s bit-vector algorithm are examples of find the optimal alignment between relatively
algorithms whose scoring functions are fixed, such similar sequences for only some scoring func-
that they penalize all types of edits equally when tions.25 A similar algorithm (known as Z-drop)
calculating the total alignment score. Restrictive makes KSW2 at least 2:6 faster than Parasail.
scoring functions reduce the total execution time
of the alignment algorithm and reduce the bit-
width requirement of the register that accommo- DISCUSSION AND FUTURE
dates the value of each entry in the DP matrix. OPPORTUNITIES
ASAP22 accelerates Levenshtein distance calcula- Despite more than two decades of attempts,
tion by up to 63:3 using FPGAs compared to its bridging the performance gap between sequenc-
CPU implementation. The use of a fixed scoring ing machines and read mapping is still challeng-
function as in Edlib,23 which is the state-of-the-art ing. We summarize four main challenges below.
implementation of Myers’s bit-vector algorithm, First, we need to accelerate the entire read
helps to outperform Parasail (which uses a flexible mapping process rather than its individual steps.
scoring function) by 12–1000. One downside of Accelerating only a single step of read mapping
fixed function scoring is that it may lead to the limits the overall achieved speedup according to
selection of a suboptimal sequence alignment. Amdahl’s Law. Illumina and NVIDIA have recently
There are other algorithms and hardware started following a more holistic approach, and
architectures that provide low alignment time they claim to accelerate genome analysis by more
by trading off accuracy. Darwin8 builds a cus- than 48 , mainly by using specialization and hard-
tomized hardware architecture to speed up the ware/software codesign. Illumina has built an
alignment process, by dividing the DP matrix FPGA-based platform, called DRAGEN (https://
72 IEEE Micro
www.illumina.com/products/by-type/informatics- within a portable sequencing device can poten-
products/dragen-bio-it-platform.html), that accel- tially enable new applications of genome sequenc-
erates all steps of genome analysis, including read ing (e.g., rapid surveillance of new diseases such
mapping and variant calling. DRAGEN reduces the as COVID-19, near-patient testing, bringing preci-
overall analysis time from 32 CPU hours to only sion medicine to remote locations). Unfortunately,
37 min.7 NVIDIA has built Parabricks, a software efforts in this direction remain very limited.
suite accelerated using the company’s latest Third, we need to develop flexible hardware
GPUs. Parabricks (https://developer.nvidia.com/ architectures that do not conservatively limit
clara-parabricks) can analyze whole human the range of supported parameter values at
genomes at 30 coverage in about 45 min. design time. Commonly used read mappers (e.g.,
Second, we need to reduce the high amount of minimap2) have different input parameters, each
data movement that takes place during genome of which has a wide range of input values. For
analysis. Moving data 1) between example, the edit distance thresh-
compute units and main memory; old is typically user defined and can
The acceleration efforts
2) between multiple hardware acc- be very high (15%–20% of the read
we highlight in this
elerators; and 3) between the length) for recent long reads. A con-
article represent state-
sequencing machine and the com- figurable scoring function is
of-the-art efforts to
puter performing the analysis reduce current another example, as it determines
incurs high costs in terms of execu- bottlenecks in the the number of bits needed to store
tion time and energy. These costs genome analysis each entry of the DP matrix (e.g.,
are a significant barrier to enabling pipeline. We hope that DRAGEN imposes a restriction on
efficient analysis that can keep up these efforts and the the maximum frequency of seed
with sequencing technologies, and challenges we discuss occurrence). Due to rapid changes
some recent works try to tackle provide a foundation in sequencing technologies (e.g.,
this problem.4,5,13 GenASM9 is a for future work in high sequencing error rate and lon-
framework that uses bitvector- accelerating read ger read lengths),28 these design
mappers and
based ASM to accelerate multiple restrictions can quickly make spe-
developing other
steps of the genome analysis pipe- cialized hardware obsolete. Thus,
genome sequence
line, and is designed to be imple- read mappers need to adapt their
analysis tools.
mented inside 3-D-stacked algorithms and their hardware
memory. Through a combination architectures to be modular and
of hardware–software co-design to unlock parallel- scalable so that they can be implemented for any
ism, and PIM to reduce data movement, GenASM sequence length and edit distance threshold
can perform 1) pre-alignment filtering for short based on the sequencing technology.
reads; 2) sequence alignment for both short and Fourth, we need to adapt existing genomic
long reads; and 3) whole genome alignment, data formats for hardware accelerators or
among other use cases. For short/long read align- develop more efficient file formats. Most sequenc-
ment, GenASM achieves 111/116 speedup over ing data is stored in the FASTQ/FASTA format,
state-of-the-art software read mappers while where each base takes a single byte (8 bits) of
reducing power consumption by 33/37. DRA- memory. This encoding is inefficient, as only
GEN reduces data movement between the 2 bits (3 bits when the ambiguous base, N, is
sequencing machine and the computer perform- included) are needed to encode each DNA base.
ing analysis by adding specialized hardware sup- The sequencing machine converts sequenced
port inside the sequencing machine for data bases into FASTQ/FASTA format, and hardware
compression. However, this still requires move- accelerators convert the file contents into unique
ment of compressed data. Performing read map- (for each accelerator) compact binary representa-
ping inside the sequencing machine itself can tions for efficient processing. This process that
significantly improve efficiency by eliminating requires multiple format conversions wastes
sequencer-to-computer movement, and embed- time. For example, only 43% of the sequence align-
ding a single specialized chip for read mapping ment time in BWA-MEM211 is spent on calculating
73
the DP matrix, while 33% of the sequence align- 7. A. Goyal et al., “Ultra-fast next generation human
ment time is spent on preprocessing the input genome sequencing data processing using
sequences for loading into SIMD registers, as pro- DRAGENÒ Bio-IT processor for precision medicine,”
vided by Ahmed et al.11 To address this ineffi- Open J. Genetics, vol. 7, pp. 9–19, 2017.
ciency, we need to widely adopt efficient 8. Y. Turakhia, G. Bejerano, and W. J. Dally, “Darwin: A
hardware friendly formats, such as UCSC’s 2bit genomics co-processor provides up to 15,000x
format (https://genome.ucsc.edu/goldenPath/ acceleration on long read assembly,” in Proc.
help/twoBit), to maximize the benefits of hard- Archit. Support Program. Lang. Oper. Syst., 2018,
ware accelerators and reduce resource utilization. pp. 199–213.
We are not aware of any recent read mapper that 9. D. Senol Cali et al., “GenASM: A low-power, memory-
uses such formats. efficient approximate string matching acceleration
The acceleration efforts we highlight in this framework for genome sequence analysis,” in Proc.
article represent state-of-the-art efforts to 53rd Int. Symp. Microarchitecture, 2020.
reduce current bottlenecks in the genome analy- 10. R. Langarita et al., “Compressed sparse FM-index:
sis pipeline. We hope that these efforts and the Fast sequence alignment using large k-steps,” IEEE/
challenges we discuss provide a foundation for ACM Trans. Comput. Biol. Bioinformatics, to be
future work in accelerating read mappers and published, doi: 10.1109/TCBB.2020.3000253.
developing other genome sequence analysis 11. M. Vasimuddin, S. Misra, H. Li, and S. Aluru, “Efficient
tools. architecture-aware acceleration of BWA-MEM for
multicore systems,” in Proc. IEEE Int. Parallel Distrib.
Process. Symp., 2019, pp. 314–324.
ACKNOWLEDGMENTS
12. W. Huangfu, S. Li, X. Hu, and Y. Xie, “RADAR: A
The work of Onur Mutlu’s SAFARI Research
Group was supported by funding from Intel, the 3D-ReRAM based DNA alignment accelerator
architecture,” in Proc. Des. Autom. Conf., 2018, pp. 1–6.
Semiconductor Research Corporation, VMware,
and the National Institutes of Health (NIH). 13. J. S. Kim et al., “GRIM-filter: Fast seed location filtering in
DNA read mapping using processing-in-memory
technologies,” BMC Genomics, vol. 19, 2018, Art. no. 89.
14. G. Rizk and D. Lavenier, “GASSST: Global alignment
& REFERENCES short sequence search tool,” Bioinformatics, vol. 26,
1. M. Alser, H. Hassan, A. Kumar, O. Mutlu, and C. Alkan, pp. 2534–2540, 2010.
“Shouji: A fast and efficient pre-alignment filter for 15. J. Daily, “Parasail: SIMD C library for global, semi-
sequence alignment,” Bioinformatics, vol. 35, global, and local pairwise sequence alignments,” BMC
pp. 4255–4263, 2019. Bioinformatics, vol. 17, 2016, Art. no. 81.
2. H. Li, “Minimap2: Pairwise alignment for nucleotide vy, S. Ren, H. Mushtaq, K. Bertels, and
16. N. Ahmed, J. Le
sequences,” Bioinformatics, vol. 34, pp. 3094–3100, Z. Al-Ars, “GASAL2: A GPU accelerated sequence
2018. alignment library for high-throughput NGS data,” BMC
3. M. Alser et al., “Technology dictates algorithms: Bioinformatics, vol. 20, 2019, Art. no. 520.
Recent developments in read alignment”, 2020, 17. X. Fei, Z. Dan, L. Lina, M. Xin, and Z. Chunlei,
arXiv:2003.00110. “FPGASW: Accelerating large-scale Smith-Waterman
mez-Luna, and
4. O. Mutlu, S. Ghose, J. Go sequence alignment application with backtracking on
R. Ausavarungnirun, “Processing data where it makes FPGA linear systolic array,” Interdisciplinary Sci.:
sense: Enabling in-memory computation,” Comput. Life Sci., vol. 10, pp. 176–188, 2018.
Microprocessors Microsyst., vol. 67, pp. 28–41, 2019. 18. D. Fujiki et al., “GenAx: A genome sequencing
mez-Luna, and
5. S. Ghose, A. Boroumand, J. S. Kim, J. Go accelerator,” in Proc. 45th Annu. Int. Symp. Comput.
O. Mutlu, “Processing-in-memory: A workload-driven Archit., 2018, pp. 69–82.
perspective,” IBM J. Res. Develop., vol. 63, no. 6, 19. S. Gupta, M. Imani, B. Khaleghi, V. Kumar, and
pp. 3–1, 2019. T. Rosing, “RAPID: A ReRAM processing in-memory
6. Z. D. Stephens et al., “Big data: Astronomical or architecture for DNA sequence alignment,” in Proc.
genomical?,” PLoS Biol., vol. 13, 2015, Art. no. IEEE/ACM Int. Symp. Low Power Electron. Des., 2019,
e1002195. pp. 1–6.
74 IEEE Micro
20. E. F. de Oliveira Sandes, G. Miranda, X. Martorell, 28. D. Senol Cali, J. S. Kim, S. Ghose, C. Alkan, and
E. Ayguade, G. Teodoro, and A. C. Magalhaes Melo, O. Mutlu, “Nanopore sequencing technology and tools
“CUDAlign 4.0: Incremental speculative traceback for for genome assembly: Computational analysis of the
exact chromosome-wide alignment in GPU clusters,” current state, bottlenecks and future directions,”
IEEE Trans. Parallel Distrib. Syst., vol. 27, no. 10, Briefings Bioinf., vol. 20, no. 4, pp. 1542–1559,
pp. 2838–2850, Oct. 2016. 2019.
21. P. Chen, C. Wang, X. Li, and X. Zhou, “Accelerating
the next generation long read mapping with the €rich.
Mohammed Alser is currently with ETH Zu
FPGA-based system,” IEEE/ACM Trans. Comput. Biol. Contact him at alserm@inf.ethz.ch.
Bioinformatics, vol. 11, no. 5, pp. 840–852,
Sep.–Oct. 2014. Zu€ lal Bingo € l is currently with Bilkent University. Con-
22. S. S. Banerjee et al., “ASAP: Accelerated short-read tact her at zulal.bingol@bilkent.edu.tr.
alignment on programmable hardware,” IEEE Trans.
Damla Senol Cali is currently with Carnegie Mellon
Comput., vol. 68, no. 3, pp. 331–346, Mar. 2019.
University. Contact her at dsenol@andrew.cmu.edu.

23. M. Sosic c
and M. Siki , “Edlib: A C/C++ library for fast,
exact sequence alignment using edit distance,” Jeremie Kim €rich
is currently with ETH Zu
Bioinformatics, vol. 33, pp. 1394–1395, 2017. and Carnegie Mellon University. Contact him
24. G. S. C. Slater and E. Birney, “Automated generation of at jeremie.kim@inf.ethz.ch.
heuristics for biological sequence comparison,” BMC
Bioinformatics, vol. 6, 2005, Art. no. 31. Saugata Ghose is currently with the University of
25. Z. Zhang, S. Schwartz, L. Wagner, and W. Miller, Illinois at Urbana–Champaign and Carnegie Mellon
University. Contact him at ghose@illinois.edu.
“A greedy algorithm for aligning DNA sequences,”
J. Comput. Biol., vol. 7, pp. 203–214, 2000.
Can Alkan is currently with Bilkent University. Con-
26. L. Ben and N. Abhinav, “Cloud computing for genomic
tact him at calkan@cs.bilkent.edu.tr.
data analysis and collaboration,” Nature Reviews
Genetics, vol. 19, no. 4, p. 208, 2018. Onur Mutlu is currently with ETH Zu €rich, Carnegie
27. U. Esko, “Algorithms for approximate string matching,” Mellon University and Bilkent University. Contact him
Inform. control, vol. 64, no. 1–3, pp. 100–118, 1985. at omutlu@gmail.com.
75
Theme Article: Biology and Systems Interactions
PurpleDrop: A Digital
Microfluidics-Based
Platform for Hybrid
Molecular-Electronics
Applications
Ashley Stephenson, Max Willsey, Karin Strauss
Jeff McBride, and Sharon Newman University of Washington and Microsoft
University of Washington
Luis Ceze
Bichlien Nguyen University of Washington
University of Washington and Microsoft
Christopher Takahashi
University of Washington
Abstract—Molecular manipulation and analysis are the cornerstone of life sciences.

With the recent advances in molecular data storage and computing, it has become an
increasingly exciting and viable alternative for the post-CMOS scaling era. Widespread
use of molecular manipulation/analysis and data storage/computing requires a scalable
and low-cost platform for hybrid molecular-electronics systems. This enables us to build
on the best of what molecular and electronics systems can offer. In this article, we
present PurpleDrop, a full-stack digital microfluidic platform for hybrid molecular-
electronic systems in multiple domains, and focus on DNA data storage as a use case.
We describe its design principles and relevant aspects such as closed-loop operation

Date of publication 30 June 2020; date of current version
1 September 2020.
76
with computer vision and capacitive sensing, on-board magnetic bead extraction, and
polymerase chain reaction, among other primitives. Importantly, we emphasize the ability
to express and execute protocols and computation that include molecular and
computational components.
& THE SEMICONDUCTOR INDUSTRY has seen a previous work demonstrating DNA data storage
profound trend in miniaturization driven by sci- with microfluidic retrieval via PurpleDrop,2 we
entific and technological advancements that now turn our attention to automating subse-
have enabled us to generate, manipulate, and quent steps in the DNA data storage pipeline.
store massive amounts of data. However, as we Finally, we provide a detailed description of
approach the physical limits of conventional the PurpleDrop platform and explain the func-
storage and computing hardware, alternative tional considerations that informed its design.
approaches must be investigated.
DNA data storage systems are a prime exam-
ple of molecule-based systems capable of achiev- HYBRID MOLECULAR-ELECTRONIC
ing far greater information density (1018 bytes/ SYSTEMS
mm3 ) and stability (thousands of years) than con- Electronic systems are well understood, and
ventional magnetic, optical, and solid state stor- can perform a wide variety of tasks in extremely
age technologies. DNA is also likely to have specific and controllable ways. Molecular sys-
eternal format relevance, owing to interest in tems, however, offer remarkable advantages in
DNA for health and biotechnology applications. performance and energy efficiency.
Molecular systems uniquely require macro- In the molecular domain, computation occurs
level physical storage and manipulation, e.g., in the same solution in which data-containing
mixing, splitting, heating, or diluting various molecules reside, making it possible to perform
molecules in solution. At present, these opera- many molecular
tions are most commonly manually performed computational pro-
In this article, we cesses in parallel.
by humans in laboratories. Automation is key if present the PurpleDrop
we are to develop sophisticated and robust Performing an opera-
platform and
hybrid molecular-electronic systems1 that tion on a few mole-
demonstrate its
cules takes the same
combine the best features of electronic com- application in
puting with the benefits of molecular density molecular computing. amount of time as for
and stability. trillions of molecules,
We have developed a digital microfluidic with the caveat that
(DMF) automation platform, PurpleDrop, in some operations could take hours to complete.
response to this need and demonstrated its Given these qualities, molecular systems are par-
potential in hybrid molecular-electronic sys- ticularly well suited for applications with large
tems. PurpleDrop aims to 1) minimize cost and data sets that can handle the high latency that
hardware complexity, 2) maximize scalability, comes along with the high throughput.
and 3) maximize modularity. Furthermore, Pur- In the case of DNA data storage, combining
pleDrop is designed with a larger system in the extreme density and parallelism of the
mind: we have designed a complementary soft- molecular domain with the precision of elec-
ware stack to make the most of DMF’s flexibility tronic systems could yield a more efficient
without sacrificing reliability. and robust system than either domain in
In this article, we present the PurpleDrop isolation.
platform and demonstrate its application in
molecular computing. Using data storage as a Archival DNA Data Storage
case study, we first provide background on mod- Figure 1 shows a DNA data storage system.
ern DNA data storage and present the need for Such systems encode binary data into DNA
future automation and scale. Building on sequences of nucleotides (A, C, G, and T) that are
77
Figure 1. DNA data storage workflow. Digital data is encoded into a nucleotide sequence used to synthesize
DNA “files.” DNA files can be stored in pools containing many files if they are tagged with unique identifiers
(corresponding to PCR primers). To recover the original digital data, random access selectively retrieves and
amplifies only the file(s) of interest from the pool. The selected files are then sequenced and decoded. A
notable gap between synthesis and sequencing exists, and automation of this workflow requires an efficient
solution for the physical storage and retrieval of DNA data.
translated into manufactured molecules (or oligo- for sequencing via PCR and other heat depen-
nucleotides) by a chemical process and stored in dent operations.
molecular pools. To recover the stored informa- Random access allows specific data to be
tion, the DNA molecules must be sequenced (read) retrieved without wastefully sequencing an
and decoded back into digital data. Previous work entire pool of data. DNA files are tagged during
in DNA data storage has covered synthesis, encod- synthesis with unique addresses (short strands
ing and decoding, DNA preservation, random of DNA corresponding to specific files) that
access protocols, and sequencing technologies. enable retrieval of only those strands. Larger
However, achieving high density data storage addresses use more nucleotides in the strand,
in practice will require an automated mechanism but allow more data to be stored in a pool.
for physical storage, organization, and retrieval Storing DNA in physically isolated pools ena-
of files in molecular form, a need illustrated by bles information density gains because physi-
the gap in the modern DNA data storage pipeline cally isolated pools can share the same
depicted in Figure 1. molecular address space. This leads to a hierar-
chical addressing scheme where the address of
Physical Storage and Random Access a file encodes both the physical location of the
The manner in which DNA data is physically pool and molecular address within that pool.2
stored in a system is an essential consider-
ation in how it is retrieved. DNA data must be Molecular Random Access Methods
in solution for sample preparation and Random access can be implemented in vari-
sequencing steps, but storing DNA in liquid ous ways in molecular systems.
form requires separate vessels that reduce PCR is widely used in molecular biology to
information density. DNA can be easily isolated rapidly make many copies of specific DNA sam-
and more densely stored when dehydrated, ples in solution. It involves mixing a number of
and multiple file pools can be stored in close reagents, including enzymes, small sequences of
proximity if dehydrated onto a substrate such single-stranded DNA called primers, individual
as glass.2 nucleotides, and the DNA to be copied, and con-
To “write” DNA files to a storage substrate, trolling their temperature, typically done by an
samples must be deposited in a spatially isolated instrument called a thermocycler. When the tem-
arrangement, treated with preservatives, and perature is raised, DNA “melts,” and the two
dehydrated onto the surface for indefinite stor- sides of its double helix come apart. As the solu-
age. The “read” process involves rehydrating a tion cools, primers attach to the single-stranded
sample with water, performing random access sides of the DNA, creating double-stranded sec-
via polymerase chain reaction (PCR) (or other tions of DNA. This enables enzymes to also
molecular methods), and preparing the sample attach at the double-stranded to single-stranded
78 IEEE Micro
Figure 2. (a) PurpleDrop’s modular features enable fluidic operations essential to hybrid molecular-
electronic systems such as automated DNA data storage. (PurpleDrop in a hybrid molecular-electronic
system.) (b) PurpleDrop is comprised of two main PCBs: a parent board that houses electronic components
(A), and a removable child board that contains the electrode surface (B). Peripheral hardware systems include
three on-board heaters (F), a computer vision system consisting of a camera on 3-D printed mount (C), and a
fluid input-output system powered by miniature peristaltic pumps (D) that pump fluids onto and off the board
via capillary tube ports (E). A motorized magnet (not pictured) is mounted on the underside of the electrode
board. (The PurpleDrop DMF device.)
transitions and perform a copy of the DNA by PURPLEDROP: A SIMPLE, FLEXIBLE

adding complementary individual nucleotides DMF DEVICE
as it traverses the molecule. In DNA data sto- Microfluidic technology has potential appli-
rage systems, molecules representing a file are cations in a broad range of areas such as syn-
tagged with a unique primer target site (address). thetic biology, medical diagnostics, automated
PCR-based random access selects specific DNA experimentation, and molecular computing. Pur-
files from a pool (molecular random access) pleDrop was inspired by its DNA data storage
using the unique primers associated to those use case, but designed with a greater vision in
files, after the pool is physically retrieved from a mind: a full-stack microfluidic platform that is
library containing multiple DNA pools (physical cost effective, reliable, and capable enough to be
coordinate random access). the foundation of scalable computer systems
Separation-based purification methods such as with molecular components.
magnetic bead wash can also be used to accom- The PurpleDrop hardware, shown in Figure 2
plish random access. In this modality, magnetic (b), and complementary software stack, Puddle,3
beads coated with DNA strands that selectively manage design tradeoffs by keeping hardware
bind to a specific file in the pool are used to costs low and compensating with sophisticated
extract files via separation and washing. Purified software. PurpleDrop is comprised of consumer
samples are amplified via PCR and sequenced. materials, and can be assembled outside of a
At present, retrieval, purification, and amplifi- clean room, with an assembled form factor of
cation steps are typically performed by humans approximately 110 140 100 mm.
and expensive, bulky thermocyclers. Stream- PurpleDrop’s modular features, depicted in
lined, scalable DNA data storage will require Figure 2(a), can be combined in numerous ways
fluid handling solutions for integrated molecular to enable operations such as those required for
processing steps such as file retrieval and PCR. DNA data storage. Together, PurpleDrop and
79
Puddle provide a platform that could allow users DMF devices can be highly programmable,
to easily combine computation and fluidic opera- offering flexibility at small size and low cost.
tions to implement hybrid molecular-electronic These features are highly desirable features in
systems. the molecular computing space, where automa-
tion and parallelization are essential.
Microfluidics Background Key PurpleDrop Features and Operations

Microfluidic devices can perform chemical PurpleDrop is designed to support several
and biological protocols at smaller scales, lower operations required by automated DNA data
cost, and higher precision than humans.4,5 These storage while navigating tradeoffs among cost,
devices all center around the manipulation of reliability, and scalability.
small fluid volumes, but different approaches
offer different tradeoffs among size, cost, flexibil- Moving Droplets Electrode substrates are
ity, and reliability. commonly made from silicon or glass making
Channel-based devices are manufactured them smooth and level, but requiring costly
ahead-of-time as a set of channels designed to clean room processes. Printed circuit boards
meet the needs of a particular protocol. These (PCBs) can also be used as cheaper, more dura-
devices can offer some flexibility by including ble substrates, sacrificing some smoothness.8
configurable valves,6,7 but their design is largely PCBs can also accommodate multiple electrical
fixed-function. On the other hand, liquid han- wiring layers, increasing addressablility, while
dling robots feature mechanical arms or gantries glass and silicon substrates (outside of expen-
that hold pipettes. The robots can be program- sive integrated circuit-style fabrication) are typi-
matically controlled, offering a great deal of flexi- cally limited to one layer with conduction lines
bility, but they are large and expensive, costing patterned onto the surface.
from many thousands to hundreds of thousands PurpleDrop follows OpenDrop’s9 design with
of dollars. a removable “child” electrode board that con-
A more recent technology, DMF moves drop- sists of a two layer PCB with an array of 127 elec-
lets around a grid of electrodes using a phenom- trodes. The parent PCB contains the electrical
enon called electrowetting on dielectric components. Child electrode boards can be eas-
(EWOD).4,5 In EWOD DMF, droplets sit on a sub- ily recoated or replaced upon wear or contami-
strate material that is patterned with electrodes, nation with minimal cost and effort.
insulated by a dielectric layer, and coated with a The dielectric and hydrophobic layers have
hydrophobic layer. The dielectric layer prevents similarly significant influence over performance.
electrolysis of droplets and the hydrophobic PurpleDrop’s electrode PCB is coated with a
layer reduces the surface energy in contact with 25.4-mm-thick polyimide tape dielectric layer
the droplets, making them easier to move. Some (Caplinq, PIT0.5S-UT/100), with a top coat of
devices have a second plate with a hydrophobic Teflon AF 1600 (spin-coated at 98 xg for 60 s,
coating added on top, effectively sandwiching DuPont/Chemours) for hydrophobicity. We
the droplets between the two plates. The top chose polyimide tape over more traditional DMF
plate typically consists of a continuous ground dielectric materials such as Parylene due to its
electrode made from conductive, transparent advantages in cost, dielectric strength, durabil-
indium tin oxide glass. ity, and fabrication complexity. Polyimide tape
Applying voltages to electrodes changes the is inexpensive, costing about $0.07 per electrode
wettability of droplets on the surface. To move a board before scale. In contrast, Parylene C costs
droplet, an electrode adjacent to it is activated approximately $100 per board.
while the electrode directly beneath it is deacti- Furthermore, in previous design iterations
vated, generating an electrowetting force that using Parylene C (5 mm applied via vapor deposi-
pulls the droplet onto the active electrode. This tion), we faced significant challenges with the
process is repeated to move droplets along any gaps between electrodes interfering with droplet
path of adjacent electrodes. movement. The copper electrodes are patterned
80 IEEE Micro
Figure 3. (a) Block diagram of capacitance sensing circuitry: An analog integrator circuit and microcontroller
are used to measure the total charge transferred while selected electrodes are charged and (b) the measured
capacitance plotted for varying droplet sizes.
with 1oz of copper, which corresponds to a presence of a droplet and measure its volume.
height of approximately 1.4 mil above the FR4. The PurpleDrop electrodes and top-plate are
Droplets frequently snagged on electrode edges switched between the high voltage supply (VHV )
when transitioning between electrodes as a and 0 V at 500 Hz to generate an ac actuation volt-
result, resulting in droplet fragmentation/immo- age. During each cycle, the total charge transfer
bilization. In contrast to conformal coatings like is measured while the active electrodes are
Parylene, the polyimide tape forms a relatively driven from 0 V to VHV , by integrating the voltage
level surface that covers the gaps in between across a sense resistor connected between the
electrodes and allows droplets to smoothly tran- top-plate and its driver. This provides capaci-
sition between electrodes. tance measurement at a rate of 500 Hz, which is
significantly faster than a typical camera and
Fluid Management We have implemented image processing. Figure 3(a) shows a diagram of
computer vision and capacitance sensing sub- the capacitance measurement system.
systems that provide dynamic droplet tracking, In order to measure the relationship of capac-
volume detection, and error correction to itance to drop volume, a drop of water was
enhance reliability. placed on PurpleDrop and slowly enlarged by
Volume sensing and droplet tracking provide adding approximately 0.5 uL at a time. At each
critical information about the number and size of step, a camera was used to measure the area
droplets on the board. This allows a higher level covered by the drop, and capacitance was
system like Puddle to perform fluid resource man- recorded. Volume was calculated using the
agement by automatically placing and routing known height of the gap between the top-plate
droplets. Furthermore, these features enable and the electrode board. Results for this test are
dynamic error correction where the system vali- shown in Figure 3(b).
dates that the operations occurring on the board Capacitance sensing has several advantages
match those executed by the software. over the computer vision system: lower latency,
The computer vision system3 consists of a lower computational load, and robustness to
Raspberry Pi camera on a 3-D-printed mount. lighting changes. Using capacitance sensing in
Droplets are detected based on color. conjunction with computer vision could lead to
The presence of a conductive fluid between an more robust and precise sensing than either sys-
electrode and the top-plate increases the capaci- tem alone.
tance between the two conductors, with the
amount of the increase depending on the area Polymerase Chain Reaction Filler fluids
covered by the fluid. The capacitance sensing such as silicone oil are commonly used in two-
system measures this capacitance to detect the plate DMF devices to reduce droplet resistance,
81
Figure 4. DNA data storage pipeline with microfluidic storage, retrieval, and sequencing preparation. Digital
data is encoded into nucleotide sequences and synthesized into DNA. DNA files are stored in spatially
isolated pools of dehydrated DNA, organized into a “library” of pools. Each library is contained on a glass
cartridge with a unique ID, and these cartridges are stored in deck. An address system mapping files to
specific libraries, pools, and PCR primers is used to locate files. Cartridges are loaded onto PurpleDrop for
microfluidic file retrieval with magnetic-bead-based random access. Retrieved files are amplified via PCR and
output to a sequencer for data recovery.
lower the driving voltage requirements, and slow Bead Extraction The magnetic-bead-based
evaporation;4 however, fluid media complicate random access implementation discussed ear-
certain operations. Inputting or removing drop- lier is well matched with PurpleDrop’s DNA data
lets from the device is much simpler in an air storage architecture and hardware. PurpleDrop
medium, where introducing unwanted air bub- is equipped with a motorized magnet that facili-
bles is not a concern. We found that using an air tates magnetic bead wash. A small servo motor
medium was also advantageous for thermocy- mounted near the underside of the electrode
cling protocols like PCR. When using a silicone board moves a magnet underneath droplets
oil medium (Clearco PSF–2cSt), we observed containing magnetic beads. Once in place, the
bubbles beginning to form around droplets at magnet causes the beads to aggregate at the bot-
the heating site around 80 C, with significant tom of the droplet. With the magnet held in
bubbling and fragmentation of the droplet above place, the supernatant droplet (i.e., the fluid
90C. This made it impossible to move droplets without the beads being held by the magnet)
away from the heaters for the remaining steps in is moved away via electrowetting until comp-
the protocol. We decided instead to operate in lete separation from the pellet [see Figure 6(a)].
air and utilize the computer vision and fluid After separation, washing and resuspension steps
input/output systems to provide automated complete the purification process.
droplet replenishment.3
PurpleDrop’s fluid input/output system is Implementing Hybrid Molecular-Electronic
driven by small peristaltic pumps (Takasago Flu- Primitives With PurpleDrop
idic Systems), each fitted with a capillary tube To demonstrate how PurpleDrop can automate
that serves as the interface between the pump DNA data storage and random access, we imple-
and the electrode surface. The other port of the mented several key operations: file storage and
pump is fitted with tubing that can be coupled to retrieval in a DNA library, magnetic-bead-based
arbitrary reservoirs or devices. random access for file selection from a pool, and
PurpleDrop has three heaters used for ther- file amplification for sequencing preparation.
mocycling, and compensates for the increased
droplet evaporation in air by using the compu- DNA Data Storage and Retrieval We previ-
ter vision system (capacitance sensing could ously evaluated PurpleDrop as a file retrieval mech-
also be used) to monitor droplet volumes and anism in a high density DNA data storage library.2
provide real-time feedback to Puddle. When Libraries consist of dehydrated pools of DNA files
droplets fall below a threshold volume, a replen- organized in arrays on the droplet-facing sides of
ishment water droplet is pumped onto the board PurpleDrop’s glass top plates, referred to as car-
and carried to the reaction site. This process tridges when containing DNA data (see Figure 4).
could extend the number of cycles of PCR possi- Water droplets are used to retrieve and prepare
ble indefinitely. files for sequencing, as depicted in Figure 5.
82 IEEE Micro
Figure 5. Puddle code snippet (a) and associated diagram (b) depicting DNA file location and retrieval via
PurpleDrop. File locations are given in cardinal coordinates with C referring to the central file in the
configuration shown. A key maps each file to a physical location in library and a PCR primer. Retrieved files
can be output directly to a sequencer. (c) Sequencing results from the work by Amin et al.2 show successful
recovery of the retrieved central file. (a) Puddle retrieval code. (b) DNA file retrieval. (c) Sequencing coverage
of each discovered file in cardinal coordinates (log scale).
Using this architecture, we demonstrated that physical isolation, enabling pools to share the
it is possible to store dehydrated files in DNA on same addressing space and increasing the stor-
PurpleDrop (with an estimated storage density of age capacity of the system. Microfluidic retrieval
50 TB per 50- 50-mm cartridge), and success- provides the needed actuation for accessing and
fully retrieve and recover the files without con- transporting data, with the added benefit of intro-
tamination. This storage scheme allows for ducing near-data processing opportunities.
Figure 6. (a) Motorized magnet is moved underneath the electrode board and positioned under the droplet
containing DNA and functionalized magnetic beads. The magnetic beads are pulled out of solution with
specific DNA attached and the supernatant droplet is moved away. (b) Retrieval with file A-specific beads
results in qPCR amplification of File A that takes place in fewer cycles than for file B and the control groups.
We attribute the latter to primer–dimer formation during qPCR. PCR amplification on PurpleDrop could be
calibrated to amplify only file A by limiting the number of cycles to 30, i.e., after amplification of file A and
before amplification of file B. (a) Magnetic bead wash on PurpleDrop. (b) Quantitative PCR (qPCR)
amplification for magnetic-bead-based random access.
83
For DNA data storage to become a reality out- shorter than file strands, and can be detected
side of the research laboratory, scalable physical by measuring the strand lengths of the qPCR
storage with automated file retrieval, as well as reaction products. The control bead experiments
automated file preparation for sequencing are showed comparable late-cycle amplification of
imperative. We envision PurpleDrop filling the both files, which confirms that no file selection
gap in the data pipeline between DNA synthesis took place. These results suggest that Purple-
and sequencing, as illustrated in Figure 1. In addi- Drop could be used to locate and selectively
tion to file retrieval, random access, PCR, and retrieve files stored in high density libraries.
other sequencing preparation steps are promis-
ing candidates for DMF automation.
PCR PurpleDrop is well positioned to automate
Magnetic-Bead-Based Random Access protocols such as PCR at miniaturized scales
To demonstrate magnetic-bead- that reduce reagent consumption, equipment
based random access on Pur- cost, and human involvement. Using
We have presented the automated replenishment
pleDrop, we synthesized a pool
PurpleDrop, a DMF approach described in “Key Purple-
containing two unique DNA
platform designed with Drop Features and Operations” sec-
files, files A and B, in equal con-
hybrid-molecular
centrations. Magnetic beads tion, we demonstrated the first fully
systems in mind.
(Invitrogen, 65601) functional- automated execution of PCR with
PurpleDrop provides
ized with (i.e., attached to) pri- replenishment in air.3 We have also
necessary features for
mers for file A were synthesized implementing these demonstrated basic sequencing
to selectively extract file A from systems, i.e., storing, preparation steps on PurpleDrop,
the pool containing both files retrieving, and with automated output to a
via bead wash. Control beads manipulating DNA, and sequencer through the fluidic input/
functionalized with a dummy integrates with output system.3 PurpleDrop’s versa-
primer (i.e., coated with a computing systems tile features enable a diverse set of
that can provide higher fluidic operations that can be com-
primer not found in the pool to
level control. bined to implement automated DNA
reduce nonspecific binding that
can occur with uncoated beads) storage and retrieval in a low-cost,
were used to confirm selectivity. scalable, and efficient way.
All reagents were loaded onto the electrode
board, and the magnetic beads were mixed and
incubated with the file pool. The magnet was DISCUSSION
then used to extract the beads from the droplet We have presented PurpleDrop, a DMF plat-
as described in the “Key PurpleDrop Features form designed with hybrid-molecular systems
and Operations” section. Following separation, in mind. PurpleDrop provides necessary fea-
multiple washing steps with 80% ethanol were tures for implementing these systems, i.e.,
performed to remove any unbound molecules storing, retrieving, and manipulating DNA, and
from the pelleted beads. Finally, resuspension integrates with computing systems that can
buffer was used to elute and output the purified provide higher level control. We expect this
sample. Quantitative PCR (qPCR) was used to combination of programmability and biological
assess random access performance [see Figure 6 primitives to be critical for future hybrid-
(b)]. qPCR of the PurpleDrop-purified sample molecular systems.
showed amplification of file A, with some amplifi- While this work focused on data storage, we
cation of file B appearing to occur multiple cycles have designed PurpleDrop to capture the flexi-
later. We attribute the latter to primer–dimer for- bility and accessibility of DMF technology in a
mation during qPCR. Primer–dimers are common general way. Combined with future work, Purple-
PCR byproducts that form when primers attach Drop and other DMF devices could impact a
to each other and undergo enzymatic strand diverse range of domains from diagnostics to
extension. These strands are generally much high throughput experimentation.
84 IEEE Micro
ACKNOWLEDGMENTS Ashley Stephenson is currently working toward
the Ph.D. degree in computer science and engineer-
This work was supported in part by a Grant
ing with the University of Washington, Seattle. Her
from DARPA under the Molecular Informatics
research is focused on applying microfluidic technol-
program, in part by the National Science Founda-
ogy and machine learning for intelligent automation
tion EAGER Grant 1841188, and in part by a spon- in synthetic and molecular biology. Stephenson
sored research agreement and gifts from received the B.S. degree in biomedical engineering
Microsoft. and electrical engineering from the University of
Michigan, Ann Arbor. Contact her at ashsteph@cs.
washington.edu.
& REFERENCES
1. D. Carmean, L. Ceze, G. Seelig, K. Stewart, K. Strauss, Max Willsey is currently working toward the Ph.D.
and M. Willsey, “Dna data storage and hybrid degree in computer science and engineering with the
molecular-electronic computing,” Proc. IEEE, vol. 107, University of Washington, Seattle. His research inter-
no. 1, pp. 63–72, Jan. 2019. ests center around programming languages, with a
2. S. Newman et al., “High density DNA data storage current focus on developing programming models for
microfluidic chips. Willsey received the B.S. degree in
library via dehydration with digital microfluidic
computer science from Carnegie Mellon University.
retrieval,” Nature Commun., vol. 10, no. 1, 2019,
Contact him at mwillsey@cs.washington.edu.
Art. no. 1706.
3. M. Willsey et al., “Puddle: A dynamic, error-correcting,
full-stack microfluidics platform,” in Proc. 24th Int. Conf. Jeff McBride is currently a Research Engineer with
Architectural Support Program. Lang. Oper. Syst., the Molecular Information Systems Lab, University of
2019, pp. 183–197. Washington, Seattle. His research is focused on
4. K. Choi, A. H. Ng, R. Fobel, and A. R. Wheeler, “Digital developing robust and affordable microfluidic tech-
microfluidics,” Annu. Rev. Analytical Chem., vol. 5, nologies for wet lab automation. McBride received
no. 1, pp. 413–440, 2012. the M.S. degree in computer engineering from Vir-
ginia Commonwealth University. Contact him at
5. R. B. Fair, “Digital microfluidics: Is a true lab-on-a-chip
mcbridejc@gmail.com.
possible?” Microfluidics Nanofluidics, vol. 3,
pp. 245–281, 2007.
6. A. M. Amin, M. Thottethodi, T. Vijaykumar, S. Wereley, Sharon Newman is currently working toward a
and S. C. Jacobson, “Aquacore: A programmable Ph.D. degree in bioengineering with Stanford Univer-
architecture for microfluidics,” ACM SIGARCH sity. Her research interests are focused on increas-
Comput. Architecture News, vol. 35, no. 2, ing access to medical technologies, with a focus on
pp. 254–265, 2007. combining bioinformatics, image processing, and
7. J. Paul Urbanski, W. Thies, C. Rhodes, S. Amarasinghe, biological assay development for early disease
and T. Thorsen, “Digital microfluidics using soft detection. Newman received the B.S. degree in bio-
engineering from the University of Washington, Seat-
lithography,” Lab. Chip, vol. 6, no. 1, 2006,
tle. Contact her at newmans@stanford.edu.
Art. no. 96104.
8. J. Gong and C. Kim, “Direct-referencing two-
dimensional-array digital microfluidics using multilayer Bichlien Nguyen is currently a Senior Researcher
printed circuit board,” J. Microelectromech. Syst., vol. with Microsoft Research. Her research interests
17, no. 2, pp. 257–264, 2008. include sustainable computing and DNA data storage
9. M. Alistar and U. Gaudenz, “Opendrop: An integrated systems. Nguyen received the Ph.D. degree in chem-
do-it-yourself platform for personal use of biochips,” istry from Washington University in St. Luis. Contact
Bioengineering, vol. 4, no. 2, 2017, Art. no. 45. her at bnguy@microsoft.com.
85
Chris Takahashi is currently a Senior Research biology. Strauss received the Ph.D. degree in com-
Scientist with the Molecular Information Systems Lab, puter science from the University of Illinois at
University of Washington, Seattle. His research inter- Urbana-Champaign. Contact her at kstrauss@micro-
ests include DNA data storage systems and wet lab soft.com.
automation. Takahashi received the Ph.D. degree in
synthetic biology from the University of Washington. Luis Ceze is currently a Professor in computer sci-
Contact him at cnt@cs.washington.edu. ence and engineering with the University of Washing-
ton, Seattle. His research focuses on the intersection
Karin Strauss is currently a Principal Research of computer architecture, programming languages,
Manager with Microsoft, and an Affiliate Professor in machine learning, and molecular biology. Ceze
computer science and engineering with the Univer- received the Ph.D. degree in computer science from
sity of Washington, Seattle. Her research lies at the the University of Illinois at Urbana-Champaign, Cham-
intersection of computer architecture, systems, and paign. Contact him at luisceze@cs.washington.edu.
86 IEEE Micro
Evolving Career
Opportunities
Need Your Skills
Explore new options—upload your resume today
Changes in the marketplace shift demands for vital skills and talent. The
IEEE Computer Society Jobs Board is a valuable resource tool to keep job
seekers up to date on the dynamic career opportunities offered by employers.
Take advantage of these special resources for job seekers:
JOB ALERTS TEMPLATES WEBINARS
CAREER RESUMES VIEWED

ADVICE BY TOP EMPLOYERS
No matter what your career level, the IEEE Computer Society Jobs Board
keeps you connected to workplace trends and exciting career prospects.
www.computer.org/jobs
Department: Micro Economics
Triggers, Transmissions,
and Adjustments
Shane Greenstein
Harvard Business School
& TAKE A STEP back from the daily details of runs counter to a prevailing view. A prevailing
events. Compare the recession unfolding in the view is a general set of opinions and theories held
United States at this moment with the two previ- by most analysts. Until the trigger arrives, the pre-
ous downturns. Today’s economic events share vailing view remains largely consistent with
a surprising set of common features with the observable facts and trends.
dot-com boom and bust of 1997 to 2001, and Triggers are never completely surprises,
fewer similarities with the financial meltdown of however. Usually somebody can see the disaster
2008–2009. For reasons I will explain later, we coming and, unwittingly, plays the role of a mod-
ought to be thankful for the latter. ern Cassandra, with their warnings not heeded.
To be sure, these are not obvious similarities. For example, some analysts warned about unjus-
How will be arrive at such an observation? These tified valuations of startups during the dot-com
similarities arise from the economic processes boom, and some professionals foresaw the
behind triggers, transmissions, and adjustments. issues inherent from nontransparent and mis-
While that might sound like the vocabulary of a priced securities for subprime mortgages. After
foreign language, the explanation should be intui- the fact, dissertations have been written about
tive to many at technology firms, who tend to be why their warnings caught the attention of so
first responders to the triggers, transmissions, few listeners.
and adjustments of economic recessions. Transmission mechanisms turn triggers into
recessions by either lowering demand in many
parts of the economy or interrupting supply, or
SKELETON OF RECESSIONS both, and at the same time. As an example, the
A trigger is the event that interrupts an econ-
dot-com bust eliminated wealth of many invest-
omy, usually moving it away from a path of eco-
ors, and—oversimplifying for the sake of brev-
nomic growth. Money can be saved or made from
ity—reduced consumption and shuttered value
foreseeing a trigger sooner than later, so analysts
chains affiliated with many online businesses.
often jump the gun on their arrival. That said, trig-
That led to a redirection of aggregate invest-
gers do not arise out of the ether. Usually a trigger ment. As another example, the collapse of Leh-
man Brothers in 2008 generated a liquidity crisis
at virtually every U.S. bank, and—oversimplify-
Digital Object Identifier 10.1109/MM.2020.3014536 ing again—that hindered lending activity every-
Date of current version 1 September 2020. where, because no bank was sure how to price
88
risky assets. Shuttered value chains and less Despite that complexity, one thing always hap-
lending led to unemployment. pens during these moments: news outlets and ana-
There are professional economists who spend lysts find ample audiences for publically accessible
their entire career studying every angle of every information and analysis, because readers and
trigger and transmission in these types of events viewers urgently need to adjust their behavior to
in every country that has ever experienced them. match new trends. Analysts forecast, learn to reas-
So it is necessarily an oversimplification to charac- sess as events unfold, update, and keep forecast-
terize what can happen. That said, based on ing. The hunger for updated information reaches a
watching many other countries, analysts had spec- rare intensity.
ulated that a major pandemic could act as a trigger Nobody can predict when the present hunger
for an economic recession in the United States. for information will subside, but, as in every
As this pandemic played out, the Corona other prior situation, it will do so eventually. I
Virus generated a rapid decline in consumption in should stress “eventually.” In complicated eco-
the spring at restaurants, bars, stadiums, stages, nomic times, such as in the United States right
parks, schools, and other places where people now, slow adjustment toward a prevailing view
gather. It also interru- is more likely than fast adjustment.
pted supply of plenty Nobody can predict As a historical illustration, consider the
of entertainment, am- when the present emergence of a new prevailing view around the
ong other immediate hunger for information dot-com boom and bust. It is hard to remember,
effects, as well as im- will subside, but, as in but the new prevailing view took a long time to
posed costs on medi- every other prior emerge. The first downturn in spring of 2000
cal services in the situation, it will do so met with considerable resistance among opti-
areas where infection eventually. I should mistic investors, who stuck to the old prevailing
spiked. The reactions stress “eventually.” In view that things would bounce back. It took
complicated economic
to this trigger have more than half and a year to convince most
times, such as in the
concentrated in some investors that revenue would not cover the
United States right
regions and not costs in most of these businesses. Then invest-
now, slow adjustment
others, and at differ- toward a prevailing ors had to relearn which metrics were prescient
ent times, as if Jack- view is more likely than and which were misleading. It took several
son Pollock painted fast adjustment. years before new startups of online firms grew
the economic effects again.
all over a U.S. map.
There have been myriad reactions, also typically UPDATES
localized. It will be a long time before analysts settle
Broadly construed, few of the adjustments to most open questions, and a new prevailing view
this transmission surprised anybody. Someone emerges. As illustration for why, let us tally
who loses a job usually looks for alternatives, some of the changes in the prevailing view in the
and if they fail, they cut back on their own con- last few months, and recognize the challenges
sumption. If investors perceive a decline in for technology firms.
opportunity, they usually redirect their efforts Since March there have been vociferous pub-
to new opportunities with better returns, or cut lic debates about the trigger itself. Every firm in
back on investments. Firms who foresee a social media has had to follow these develop-
decline in demand usually cut back on produc- ments, and decide what action, if any, to take
tion, and reduce staff. against misinformation. What is the rate of infec-
However, the economy experienced a gazil- tion? (It varies by location.) Does it spread by
lion adjustments all at once. Everywhere you air? (Yes, primarily this way). Do masks slow
look in all directions. Because everyone adjusts down the spread? (Yes, but some designs are
at the same time, that makes the sum total of better than others.) Does the summer heat kill
this behavior difficult to forecast in the aggre- it? (No, it is not like the flu.) Can bad cases be
gate. The whole differs from the sum of its parts. treated in a hospital? (Some can, but not easily.)
89
Micro Economics
The misinformation will not cease anytime soon, conversation is urgent, extensive, constantly
least of all when a vaccine emerges, so these shifting, and unceasing. It is not confined to
firms will continue to be challenged. experts. It infuses every public square that hosts
Another set of open question haunts the the conversation. Many technology firms find
firms who supply software to organizations who themselves altering their core strategies and
permitted remote work on a temporary basis, actions due to changes in the frontier of that
and have begun to consider sustaining those conversation.
practices. For example, Google just told its
employees to work remotely through January,
REACHING A PREVAILING VIEW
2021. More broadly, the extent of experimenta- Frequent changes in the prevailing view, as
tion has yielded a new stream of insights about
we presently are experiencing, is a symptom of
best practices, and firms in the office software bad economic times. Good economies are much
business have been scrambling to keep up. Do
more boring.
not expect the implications for product design It is not as if Cassandras took over the prevail-
to settle anytime soon.
ing view today, but most professionals now believe
A related debate has been taking place among that most parts of the economy have started to
data carriers, who built capacity to support heavy
decline. Here are some grim facts: The second
data loads in downtown areas. quarter of 2020 was the worst eco-
Remote work this spring traffic
nomic quarter in the United States
moved data loads to residential Remote work this since the 1940s, and every large
areas during the day time. Mainte- spring traffic moved
bank in the United States has put
nance and investment had to be data loads to
aside tens of billions of dollars for
redirected, and quickly. Those car- residential areas during
anticipated bankruptcies.
riers have begun to plan for next the day time.
Maintenance and Unlike the financial crisis of 2008,
year. Should they expect more of however, there has not been any
investment had to be
the same next spring? Once again,
redirected, and destruction from a financial panic.
that is an open question.
quickly. Those carriers That could have happened, but did
As it turned out, some open
have begun to plan for not. This is the sense in which the
questions about adjustments res- next year. Should they present economy resembles the
olved themselves quickly. Just not expect more of the financial crisis. In the spring, the
many. For example, many medical same next spring? Federal Reserve implemented a pol-
practices have adopted a range of icy to introduce liquidity into the
communications technologies for
financial system, and to help localities with their
telemedicine consults. In light of the years of resis- financing. The situation contained the same poten-
tance to telemedicine, feel free to blame your favor-
tial for disaster. Why not the same outcome? The
ite government entity for needed a crisis to finally Fed learned from the experience of 2009.
get around to approving it. I prefer to give credit to
If policy makers had not learned, it could
all the effort that went into hardware and software have been even worse. It is one of the few bright
investment that finally made this last step
spots in an otherwise bleak situation.
incremental.
In other words, just as occurred 20 years ago Shane Greenstein is a Professor at the Harvard
during the dot-com boom and its aftermath, the Business School. Contact him at sgreenstein@hbs.edu.
90 IEEE Micro
PURPOSE: The IEEE Computer Society is the world’s largest EXECUTIVE COMMITTEE
association of computing professionals and is the leading provider President: kƷǠǹƌ‫ژ‬%Ʒ‫ژ‬FǹȏȵǠƌȄǠ
of technical information in the field.
President-Elect: FȏȵȵƷȽɋ‫ژ‬°ǚɓǹǹ
MEMBERSHIP: Members receive the monthly magazine Past President: ƷƩǠǹǠƌ‫ژ‬uƷɋȵƌ
Computer, discounts, and opportunities to serve (all activities First VP: ¨ǠƩƩƌȵưȏ‫ژ‬uƌȵǠƌȄǠ; Second VP: °ɲ‫ٯ‬äƷȄ‫ژ‬hɓȏ ‫ژژژژژ‬
are led by volunteer members). Membership is open to all IEEE Secretary: %ǠȂǠɋȵǠȏȽ‫ژ‬°ƷȵȲƌȄȏȽ; Treasurer: %ƌɫǠư‫ژ‬kȏȂƷɋ
members, affiliate society members, and others interested in the VP, MemberȽǚǠȲ & Geographic Activities: Yervant Zorian
computer field. VP, Professional & Educational Activities: °ɲ‫ٮ‬äƷȄ‫ژ‬hɓȏ ‫ژژژژژژژژژژژ‬
VP, Publications: Fabrizio Lombardi
COMPUTER SOCIETY WEBSITE: www.computer.org
VP, Standards Activities: Riccardo Mariani
OMBUDSMAN: Direct unresolved complaints to VP, Technical & Conference Activities: William D. Gropp‫ژ‬
ombudsman@computer.org.
2019–2020 IEEE Division VIII Director: Elizabeth L. Burd‫ژ‬
CHAPTERS: Regular and student chapters worldwide provide the ‫ژ׏א׎אٮ׎א׎א‬U---‫ژ‬%ǠɫǠȽǠȏȄ‫ژ‬Ý‫ژ‬%ǠȵƷƩɋȏȵ‫¾ژي‬ǚȏȂƌȽ‫ژ‬uِ‫ژ‬ȏȄɋƷ‫ژژژژژژژژژژ‬
opportunity to interact with colleagues, hear technical experts, ‫ژ׎א׎א‬U---‫ژ‬%ǠɫǠȽǠȏȄ‫ژ‬ÝUUU‫ژ‬%ǠȵƷƩɋȏȵ‫ٮ‬-ǹƷƩɋ‫ژي‬ǚȵǠȽɋǠȄƌ‫ژ‬uِ‫ژ‬°ƩǚȏƨƷȵ
and serve the local professional community.
AVAILABLE INFORMATION: To check membership status, report BOARD OF GOVERNORS
an address change, or obtain more information on any of the ¾ƷȵȂ‫ ژ‬-ɱȲǠȵǠȄǒ‫ ژي׎א׎א ژ‬Ȅưɲ‫ ژِ¾ ژ‬ǚƷȄً‫ ژ‬eȏǚȄ‫ ژ‬%ِ‫ ژ‬eȏǚȄȽȏȄً‫ژ‬
following, email Customer Service at help@computer.org or call
°ɲ‫ٮ‬äƷȄ‫ ژ‬hɓȏً‫ ژ‬%ƌɫǠư‫ ژ‬kȏȂƷɋً‫ ژ‬%ǠȂǠɋȵǠȏȽ‫ ژ‬°ƷȵȲƌȄȏȽً‫ژژژژژ‬
+1 714 821 8380 (international) or our toll-free number, Oƌɲƌɋȏ‫ژ‬äƌȂƌȄƌ
+1 800 272 6657 (US): ¾ƷȵȂ‫ ژ‬-ɱȲǠȵǠȄǒ‫ ژي׏א׎א ژ‬uِ‫ ژ‬ȵǠƌȄ‫ ژ‬ǹƌǵƷً‫ ژ‬FȵƷư‫ ژ‬%ȏɓǒǹǠȽً‫ژ‬
• Membership applications ƌȵǹȏȽ‫ ژ‬-ِ‫ ژ‬eǠȂƷȄƷɼ‫ٮ‬GȏȂƷɼً‫¨ ژ‬ƌȂƌǹƌɋǚƌ‫ ژ‬uƌȵǠȂɓɋǚɓً‫ژژژژژژژژژژژ‬
• Publications catalog -ȵǠǵ‫ ژ‬eƌȄ‫ ژ‬uƌȵǠȄǠȽȽƷȄً‫ ژ‬hɓȄǠȏ‫ ژ‬ÅƩǚǠɲƌȂƌ
• Draft standards and order forms ¾ƷȵȂ‫ ژ‬-ɱȲǠȵǠȄǒ‫ ژيאא׎א ژ‬wǠǹȽ‫ ژ‬ȽƩǚƷȄƨȵɓƩǵً‫ژ‬
• Technical committee list -ȵȄƷȽɋȏ‫ ژ‬ɓƌưȵȏȽ‫ٯ‬ÝƌȵǒƌȽً‫ ژ‬%ƌɫǠư‫ ژ‬°ِ‫ ژ‬-ƨƷȵɋً‫ ژ‬ÞǠǹǹǠƌȂ‫ ژ‬GȵȏȲȲً‫ژ‬
• Technical committee application GȵƌƩƷ‫ ژ‬kƷɬǠȽً‫ ژ‬°ɋƷǑƌȄȏ‫ژ‬îƌȄƷȵȏ
• Chapter start-up procedures
• Student scholarship information
• Volunteer leaders/staff directory EXECUTIVE STAFF
• IEEE senior member grade application (requires 10 years Executive Director: Melissa ِ‫ژ‬Russell
practice and significant performance in five of those 10) Director, Governance & Associate Executive Director:
Anne Marie Kelly
PUBLICATIONS AND ACTIVITIES Director, Finance & Accounting: Sunny Hwang
Director, Information Technology & Services: Sumit Kacker
Computer: The flagship publication of the IEEE Computer Society,
Director, Marketing & Sales: Michelle Tubb
Computer publishes peer-reviewed technical content that covers
Director, Membership Development: Eric Berkowitz
all aspects of computer science, computer engineering,
technology, and applications.
COMPUTER SOCIETY OFFICES
Periodicals: The society publishes 12 magazines‫ژ‬ƌȄư‫ژז׏ژ‬ǱȏɓȵȄƌǹȽ. Washington, D.C.: 2001 L St., Ste. 700, Washington, D.C.
Refer to membership application or request information as noted 20036-4928ٕ Phone: +1 202 371 0101ٕ Fax: +1 202 728 9614ٕ‫ژ‬
above. Email: ǚƷǹȲ‫ۮ‬ƩȏȂȲɓɋƷȵِȏȵǒ
Conference Proceedings & Books: Conference Publishing Los Alamitos: 10662 Los Vaqueros Cir., Los Alamitos, CA 90720ٕ‫ژ‬
Services publishes more than 275 titles every year. Phone: +1 714 821 8380ٕ Email: help@computer.org
Standards Working Groups: More than 150 groups produce IEEE
u-u-¨°OU¥‫ژۯژ‬¥ÅkU¾Uw‫¨ژ‬%-¨°‫ژ‬
standards used throughout the world.
¥ǚȏȄƷ‫ژٕבבבגژזוהژ׎׎זژ׏ڹژي‬Fƌɱ‫ژٕ׏גהגژ׏אזژג׏וژ׏ڹژي‬
Technical Committees: TCs provide professional interaction in -ȂƌǠǹ‫ژي‬ǚƷǹȲ‫ۮ‬ƩȏȂȲɓɋƷȵِȏȵǒ
more than 30 technical areas and directly influence computer
engineering conferences and publications. IEEE BOARD OF DIRECTORS
Conferences/Education: The society holds about 200 conferences President: ¾ȏȽǚǠȏ‫ژ‬Fɓǵɓưƌ
each year and sponsors many educational activities, including President-Elect: °ɓȽƌȄ‫ژ‬hِ‫ٹژ‬hƌɋǚɲ‫ژٺ‬kƌȄư
computing science accreditation. Past President: eȏȽƸ‫ژ‬uِFِ‫ژ‬uȏɓȵƌ
Certifications: The society offers three software developer Secretary: Kathleen ِ‫ژ‬Kramer
credentials. For more information, visit Treasurer: Joseph V. Lillie
www.computer.org/certification. Director & President, IEEE-USA: eǠȂ‫ژ‬ȏȄȵƌư‫ژ‬
Director & President, Standards Association: Robert S. Fish‫ژ‬
BOARD OF GOVERNORS MEETING Director & VP, Educational Activities: °ɋƷȲǚƷȄ‫ژ‬¥ǚǠǹǹǠȲȽ‫ژ‬
Director & VP, Membership ‫ ۯ‬Geographic Activities:‫ژژژژژژژ‬
‫ژגא‬٫‫ דאژ‬°ƷȲɋƷȂƨƷȵ‫ژ׎א׎אژ‬ǠȄ‫ژ‬uƩkƷƌȄً‫ژ‬ÝǠȵǒǠȄǠƌً‫ژ‬Å° hɓǵǱǠȄ‫ژ‬ǚɓȄ
Director & VP, Publication Services & Products: ¾ƌȲƌȄ‫ژ‬°ƌȵǵƌȵ‫ژ‬
Director & VP, Technical Activities: hƌɼɓǚǠȵȏ‫ژ‬hȏȽɓǒƷ
revised ‫ڳת‬uƌɲ‫ש׫ש׫ڳ‬
T
EN
EV
EW
IEEE
N
QUANTUM
WEEK
20
20
ER
B
TO
C
O
6
–1
12
REGISTRATION IS OPEN!
qce.quantum.ieee.org
The Future Directions Quantum Initiative invites you to IEEE Quantum Week
2020—the inaugural IEEE International Conference on Quantum Computing
and Engineering (QCE).
Council on Superconductivity

2020-Sep-Oct - Machine Learning For Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020-Sep-Oct - Machine Learning For Systems

Uploaded by

Copyright:

Available Formats

VOLUME 40, NUMBER 5 SEPTEMBER/OCTOBER 2020

Machine Learning for Systems

DIGITAL LIBRARY — Easily access over 780k

CALLS FOR PAPERS — Discover

EDUCATION — Strengthen your resume

ADVANCE YOUR CAREER — Search the

NETWORK — Make connections that count

Explore all of the member beneﬁts

6 M Guest Editors’ Introduction

64 B Guest Editor’s Introduction

COLUMNS AND DEPARTMENTS

Machine Learning for Systems,

Machine Learning for Systems

8 This work is licensed under a Creative Commons

& MODERN networks (DNNs) are

Table 1. CONDENSA performance results on CIFAR-10, ImageNet, and WikiText-2.

Method Dataset Network s Accuracy rc Throughput

CONDENSA P+Q CIFAR-10 VGG19-BN 0.99 93:26% 188:23 N/A

CONDENSA Filter CIFAR-10 VGG19-BN 0.79 93:34% 1:35 2:59

Baseline CIFAR-10 ResNet56 92:75% 1 1

CONDENSA P+Q CIFAR-10 ResNet56 0.95 91:42% 31:14 N/A

CONDENSA Filter CIFAR-10 ResNet56 0.63 93:18% 1:14 1:17

Baseline ImageNet VGG16-BN 91:50% 1 1

Filter Pruning4 ImageNet VGG16-BN 89:80%  4 N/A

AMC3 ImageNet VGG16-BN N/A 90:1% N/A sF ¼ 1:25

CONDENSA P+Q ImageNet VGG16-BN 0.93 89:89% 29.29 N/A

CONDENSA Filter ImageNet VGG16-BN 0.12 90:25% 1 1:16

Baseline WikiText-2 LSTM Log-Perplexity: 4.70 1 1

Lottery Ticket14 WikiText-2 LSTM N/A Log-Perplexity: 4.70  10 N/A

CONDENSA P+Q WikiText-2 LSTM 0.92 Log-Perplexity: 4.75 4:2 N/A

CONDENSA Block WikiText-2 LSTM 0.60 Log-Perplexity: 4.62 1:1 sF ¼ 2:14

right y-axis) for filter pruning. We only show data

Abstract—With increasingly complex neural network architectures and heterogeneous

& NEURAL demonstrated rema-

through a fully connected layer f ðlþ1Þ

Different from GraphSAGE, parameters in our

SGDP-one HP METIS HDP Runtime speedup over HP / Search speedup over

AmoebaNet (4) 0.394 0.44 0.426 0.418 26.1% / 6.1% 58.8x

2-stack 18-layer WaveNet

4-stack 36-layer WaveNet

GEOMEAN - - - - 20.5% / 18.2% 15x

cell only depends on the results of 2 other cells ACKNOWLEDGMENTS

Digital Object Identifier 10.1109/MM.2020.3009475

& DEEP NEURAL NETWORKS (DNNs) have made

Table 1. Benchmark DNNs and their deep quantization with RELEQ.

Average Accuracy Loss

SimpleNet CIFAR10 {5, 5, 5, 5, 5} 5 0.30

LeNet MNIST {2, 2, 3, 2} 2.25 0.00

ResNet-20 CIFAR10 {8, 2, 2, 3, 2, 2, 2, 3, 2, 3, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 8} 2.81 0.12

10-Layers SVHN {8, 4, 4, 4, 4, 4, 4, 4, 4, 8} 4.80 0.00

VGG-11 CIFAR10 {8, 5, 8, 5, 6, 6, 6, 6, 8} 6.44 0.17

VGG-16 CIFAR10 {8, 8, 8, 6, 8, 6, 8, 6, 8, 6, 8, 6, 8, 6, 8, 8} 7.25 0.10

set of bitwidths and state of quantization for (1)

Flexible Actions Space Asymmetric Reward Formulation for Accuracy

This used formulation 1) produces a smooth

EXPERIMENTAL RESULTS VGG11). Each point on these charts is a unique

RELEQspeedup RELEQspeedup on Energy Improvement of

iterative optimization technique for fine-tuning. CN#1703812, ECCS#1609823, CCF#1553192; in

Hadi Esmaeilzadeh is currently the inaugural holder

Digital Object Identifier 10.1109/MM.2020.3004538

setting depends on the device placement policy, ColocNAS DESIGN

Results of Latency-Aware NAS

NAONet cell gradient 4.8K 10.6M 3.18 8.4 ms -43.2

Method Dataset Network s Accuracy rc Throughput

Filter Pruning4 ImageNet VGG16-BN 89:80% 4 N/A

Lottery Ticket14 WikiText-2 LSTM N/A Log-Perplexity: 4.70 10 N/A