2020-May - Top Picks

VOLUME 40, NUMBER 3 MAY/JUNE 2020
Top Picks
www.computer.org/micro
IEEE
Computer
Society Has
You Covered!
WORLD-CLASS CONFERENCES — Stay
ahead of the curve by attending one of our
200+ globally recognized conferences.
DIGITAL LIBRARY — Easily access over 780k

articles covering world-class peer-reviewed
content in the IEEE Computer Society
Digital Library.
CALLS FOR PAPERS — Discover

opportunities to write and present your
ground-breaking accomplishments.
EDUCATION — Strengthen your resume

with the IEEE Computer Society Course
Catalog and its range of offerings.
ADVANCE YOUR CAREER — Search the

new positions posted in the IEEE Computer
Society Jobs Board.
NETWORK — Make connections that count

by participating in local Region, Section,
and Chapter activities.
Explore all of the member benefits

at www.computer.org today!
EDITOR-IN-CHIEF IEEE MICRO STAFF
Lizy K. John, University of Texas, Austin Journals Production Manager: Joanna Gojlik,
j.gojlik@ieee.org
EDITORIAL BOARD Peer-Review Administrator:
micro-ma@computer.org
R. Iris Bahar, Brown University
Publications Portfolio Manager: Kimberly Sperka
Mauricio Breternitz, University of Lisbon
Publisher: Robin Baldwin
David Brooks, Harvard University
Senior Advertising Coordinator: Debbie Sims
Bronis de Supinski, Lawrence Livermore
IEEE Computer Society Executive Director:
Nation al Lab
Melissa Russell
Shane Greenstein, Harvard Business School
Natalie Enright Jerger, University of Toronto IEEE PUBLISHING OPERATIONS
Hyesoon Kim, Georgia Institute of Technology
John Kim, Korea Advanced Institute of Science Senior Director, Publishing Operations:
and Technology Dawn Melley
Hsien-Hsin (Sean) Lee, Taiwan Semiconductor Director, Editorial Services: Kevin Lisankie
Manufacturing Company Director, Production Services: Peter M. Tuohy
Richard Mateosian Associate Director, Editorial Services:
Tulika Mitra, National University of Singapore Jeffrey E. Cichocki
Trevor Mudge, University of Michigan, Ann Arbor Associate Director, Information Conversion
Onur Mutlu, ETH Zurich and Editorial Support: Neelam Khinvasara
Vijaykrishnan Narayanan, The Pennsylvania Senior Art Director: Janet Dudar
State University Senior Manager, Journals Production:
Per Stenstrom, Chalmers University of Patrick Kempf
Technology CS MAGAZINE OPERATIONS COMMITTEE
Richard H. Stern, George Washington
University Law School Sumi Helal (Chair), Irena Bojanova,
Sreenivas Subramoney, Intel Corporation Jim X. Chen, Shu-Ching Chen,
Carole-Jean Wu, Arizona State University Gerardo Con Diaz, David Alan Grier,
Lixin Zhang, Chinese Academy of Sciences Lizy K. John, Marc Langheinrich, Torsten Möller,
David Nicol, Ipek Ozkaya, George Pallis,
ADVISORY BOARD VS Subrahmanian
David H. Albonesi, Erik R. Altman, Pradip Bose, CS PUBLICATIONS BOARD
Kemal Ebcioglu, Lieven Eeckhout,
Fabrizio Lombardi (VP for Publications),
Michael Flynn, Ruby B. Lee, Yale Patt,
Alfredo Benso, Cristiana Bolchini,
James E. Smith, Marc Tremblay
Javier Bruguera, Carl K. Chang, Fred Douglis,
Subscription change of address: Sumi Helal, Shi-Min Hu, Sy-Yen Kuo,
address.change@ieee.org Avi Mendelson, Stefano Zanero, Daniel Zeng
Missing or damaged copies:
COMPUTER SOCIETY OFFICE
help@computer.org
IEEE MICRO
c/o IEEE Computer Society
10662 Los Vaqueros Circle
Los Alamitos, CA 90720 USA +1 (714) 821-8380
IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society. IEEE Headquarters, Three Park Ave., 17th Floor, New York,
NY 10016-5997; IEEE Computer Society Headquarters, 2001 L St., Ste. 700, Washington, DC 20036; IEEE Computer Society Publications Office,
10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720. Postmaster: Send address changes and undelivered copies to IEEE, Member-
ship Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid at New York, NY, and at additional mailing offices. Canadian
GST #125634188. Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885. Return undeliverable Canadian addresses
to 4960-2 Walker Road; Windsor, ON N9A 6J3. Printed in USA. Reuse rights and reprint permissions: Educational or personal use of this material is
permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the
copy; and 3) does not imply IEEE endorsement of any third-party products or services. Author and their companies are permitted to post the accepted
version of IEEE-copyrighted material on their own webservers without permission, provided that the IEEE copyright notice and a full citation to the
original work appear on the first screen of the posted copy. An accepted manuscript is a version which has been revised by the author to incorporate
review suggestions, but not the published version with copy-editing, proofreading, and formatting added by IEEE. For more information, please go to
ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising,
or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual
Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or pubs-permissions@ieee.org. ©2020 by IEEE. All rights reserved. Abstracting
and library use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy
fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE
Micro does not necessarily constitute an endorsement by IEEE or the Computer Society. All submissions are subject to editing for style, clarity, and
space. IEEE prohibits discrimination, harassment, and bullying. For more information, visit ieee.org/web/aboutus/whatis/policies/p9-26.html.
May/June 2020
Volume 40 Number 3
Special Issue
Guest Editor’s Introduction
56 A smDB: Understanding
and Mitigating Front-End
Stalls in Warehouse-Scale
6 T he 2019 Top Picks in

Computer Architecturee
Hyesoon Kim
Computers
Nayana Prasad Nagendra, Grant Ayers,
David I. August, Hyoun Kyu Cho,
Published by the IEEE Computer Svilen Kanev, Christos Kozyrakis, Trivikram
Society Theme Articles Krishnamurthy, Heiner Litz, Tipp Moseley,
and Parthasarathy Ranganathan
10 U nveiling the Hardware

and Software Implica-
tions of Microservices in
Cloud and Edge Systems
64 E xtending the Frontier of
Quantum Computers With
Qutrits
Yu Gan, Yanqi Zhang, Dailun Cheng, Pranav Gokhale, Jonathan M. Baker,
Ankitha Shetty, Priyal Rathi, Casey Duckering, Frederic T. Chong,
Nayan Katarki, Ariana Bruno, Justin Hu, Natalie C. Brown, and Kenneth R. Brown
Brian Ritchken, Brendon Jackson,
Kelvin Hu, Meghna Pancholi, Yuan He,
Brett Clancy, Chris Colen, Fukang Wen,
Catherine Leung, Siyuan Wang,
73 A rchitecting Noisy
Intermediate-Scale
Quantum Computers:
Leon Zaruvinsky, Mateo Espinosa, Rick Lin,
Zhongling Liu, Jake Padilla, and A Real-System Study
Christina Delimitrou Prakash Murali, Norbert M. Linke, Margaret
Martonosi, Ali Javadi Abhari, Nhung Hong
20 M AESTRO: A Data-Centric Nguyen, and Cinthia Huerta Alderete

Approach to Understand
Reuse, Performance, and
Hardware Cost of DNN 81 S peculative Taint Tracking
(STT): A Comprehensive
Protection for Speculatively
Mappings
Hyoukjun Kwon, Prasanth Chatarasi, Accessed Data
Vivek Sarkar, Tushar Krishna, Jiyong Yu, Mengjia Yan, Artem Khyzha,
Michael Pellauer, and Angshuman Parashar Adam Morrison, Josep Torrellas,
and Christopher W. Fletcher
30 E nergy-Efficient Video
Processing for Virtual
Reality 91 M icroScope: Enabling
Microarchitectural
Yue Leng, Jian Huang, Chi-Chun Chen, Replay Attacks
Qiuyue Sun, and Yuhao Zhu Dimitrios Skarlatos, Mengjia Yan,
Bhargava Gopireddy, Read Sprabery,
Josep Torrellas, and Christopher W. Fletcher
37 T
owards General-Purpose
Acceleration: Finding
Structure in Irregularity
Vidushi Dadu, Jian Weng, Sihao Liu,
99 C reating Foundations for
Secure Microarchitec-
tures With Data-Oblivious ISA
and Tony Nowatzki
Extensions
47 V arifocal Storage: Jiyong Yu, Lucas Hsiung, Mohamad El Hajj,

Dynamic Multiresolution and Christopher W. Fletcher
Data Storage
Yu-Ching Hu, Murtuza Lokhandwala,
Te I, and Hung-Wei Tseng 108 T race Wringing for Program
Trace Privacy
Deeksha Dangwal, Weilong Cui, Joseph
McMahan, and Timothy Sherwood
Image credit: Image licensed by Ingram Publishing.
COLUMNS AND DEPARTMENTS

From the Editor-in-Chief
4 Enjoy These Top Picks,
While You Work From Home!
Lizy Kurian John
Micro Economics
118 Pandemics and the Dismal
Technology Economy
Shane Greenstein
From the Editor-in-Chief
Enjoy These Top Picks, While

You Work From Home!
Lizy Kurian John
University of Texas at Austin
& THE NOVEL CORONAVIRUS has taken all of us to top architecture conferences during 2019 was
unchartered territories in our lives. However, as eligible to compete for the Top Picks honor. In
computer engineers and scientists, we can be total, 96 submissions were received, from which
proud of the integral contributions computers 12 articles were chosen to represent the cream
play in getting everybody connected as lock- of the crop of 2019.
downs and shelter-in-place isolations are taking Professor Hyesoon Kim of Georgia Tech
place around the world! The importance of chaired this year’s selection committee. Hyesoon
secure, low-power, and high-performance chips and 28 experts from academia and industry
and systems cannot worked hard to identify 12 Top Picks and 14
be overemphasized at The importance of Honorable Mention articles. An article recog-
this time. Microproc- secure, low-power, and nized as a Top Pick was invited to prepare a sub-
essors and microsys- high-performance mission for inclusion in this special issue. The
tems are increasingly chips and systems articles in this special issue are intended to be
relevant in everyday cannot be overempha- for a broader audience than the original confer-
lives as well as com- sized at this time. ence articles. These articles also focus more on
puting for medical Microprocessors and their potential impact. The honorable mentions
microsystems are
discoveries. are high-quality articles that unfortunately could
increasingly relevant in
While the pan- not be included in the special issue due to space
everyday lives as well
demic has taken us to as computing for constraints. They are listed in the Guest Editor’s
unfamiliar territories, medical discoveries. Introduction. Interested readers can locate them
IEEE Micro is present- in the original conference proceedings or the
ing to you the very IEEE/ACM Digital Library.
familiar IEEE Micro Top Picks issue. For more The purpose of the Top Picks issue has been
than a decade, IEEE Micro has had this tradition multifold. First and foremost, Top Picks was origi-
of evaluating articles from the previous year’s nally instituted to present the “best of the best” of
architecture conferences and selecting those the preceding year’s architecture research contri-
with the most novelty and potential for long- butions to a broader audience, including industry
term impact. IEEE Micro is upholding the tradi- and other fields. A second goal of Top Picks is to
tion this year as well. Any article published in recognize excellent research in the field and
bestow this honor on researchers who conducted
Digital Object Identifier 10.1109/MM.2020.2993184 the outstanding research that resulted in these
Date of current version 22 May 2020. articles. It is critically important for our field to
0272-1732 ß 2020 IEEE Published by the IEEE Computer Society IEEE Micro
4
honor our budding researchers and help them to titled “Pandemics and the Dismal Technology
shape their careers. The Top Picks honor has been Economy.” This article discusses the economic
seen to be instrumental in achieving faculty posi- crisis brought by the pandemic. The author talks
tions, leading research positions in industry, and about the predictable features of such a crisis
prestigious research grants. Third, the writing style even when the pandemic itself is unprecedented,
intended for a broader audience makes it easy for however, recognizes that it is difficult to make pre-
beginner graduate students to understand the dictions on when the economy might turn around.
state-of-the-art of the field as they are pondering on He also discusses the increased appeal of stream-
topics to work for their doctorate degrees Above ing services and online communication tools.
all, I expect these articles to be enjoyable reads for To all the computer architects and chip
all the readers of IEEE Micro. designers: while COVID19 may have altered your
I take this opportunity to express my grati- daily routines, the demand for microchips and
tude to Hyesoon and the selection committee systems is only increasing. Medical research,
members, who spent countless hours during data analytics to pre-
the Christmas and New Year season to evaluate dict outbreaks, online
the submissions, and conducted a face-to-face It is a significant honor communication and
meeting and deliberated a whole day to per- to rise to the top in a conferencing tools,
form this important selection. Hyesoon and competition where each and security and
the committee conducted a multistep selection candidate article is privacy for various
process that tried to reduce the impact of the already recognized as applications, all are
discussion order of the articles. I wish to an excellent piece of pointing to the need
express my special thanks to Hyesoon and the work. I hope that these for increased resea-
works have significant
selection committee for the thoughtful process rch to design hard-
impact on future
and the hard work. ware and software
computer systems.
The Top Picks articles belong to four that efficiently meets
themes: first, cloud and accelerators; second, the demands of the
acceleration from understanding applications; emerging era.
third, quantum computing; and fourth, security I hope this special issue is thought provoking
and privacy. One may recognize that these are for our readers and helps shape the field for
highly relevant topics with or without the pan- many years to come. I also hope that researchers
demic. A comprehensive article written by Hye- in the field intensify their efforts to design better
soon Kim serves as an excellent introduction to chips and systems to help medical research and
the compendium. ordinary daily lives. Additionally, I encourage
I want to personally congratulate all the Top readers to submit to IEEE Micro. IEEE Micro is
Picks authors for their fantastic work. It is a sig- interested in submissions on any aspect of chip/
nificant honor to rise to the top in a competition system design or architecture.
where each candidate article is already recog- May the Top Picks articles bring some happy
nized as an excellent piece of work. I hope that reading to you amidst this coronavirus pandemic!
these works have significant impact on future
computer systems. Lizy Kurian John is a Cullen Trust for Higher Educa-
In addition to the Top Picks articles, this issue tion Endowed Professor in the Electrical and Computer
also features a Micro Economics column by Engineering Department, University of Texas at Austin.
Shane Greenstein of Harvard Business School, Contact her at ljohn@ece.utexas.edu.
May/June 2020
5
The 2019 Top Picks in Computer

Architecture
Hyesoon Kim
Georgia Institute of Technology
& IT IS MY pleasure to introduce the “2019 Top Twenty eight selection committee members (see
Picks in Computer Architecture.” This annual the “Selection Committee” sidebar) read the
publication presents 12 articles selected from three-page documentations along with the origi-
major computer architecture conferences of the nal conference papers (single-blind review pro-
year. The 12 papers are recognized for their cess by nature). In keeping with the successful
importance, mainly the long-term impact and two-round ranking-based review process of the
influence on the industry and other researchers. past several years, the PC members first catego-
The selection committee members put enor- rized each article as either a top pick, an honor-
mous effort into picking the papers. We asked able mention, or not a top pick. They also
what the criteria should be for the top picks, and ranked the articles. After the first round of
then we tried to answer that question by looking
reviews, all PC members participated in online
for significant improvement over previous work,
discussions to decide which articles should
establishing a new area.
move to the second round. In the first round,
As in prior years, only 12 articles could be
all the articles were assigned at least four
selected to appear in this special issue. The
reviewers, and in the second round, the articles
selection committee chose 14 additional high-
had at least four additional reviewers.
quality articles to be recognized as honorable
This year, as we expanded our research areas
mentions. I strongly encourage you to read
into special accelerators that rely on emerging
these articles (see the “Honorable Mentions”
technologies, we found it particularly challeng-
sidebar).
ing to ensure all reviewers understood the
underlying technologies. Because the selection
REVIEW PROCESS process is concerned more with the impact of
This year’s review process built on previous the work rather than evaluating its technical
years’ selection processes. Authors submitted a accuracy, technical expertise is less critical than
three-page document that contained a two-page for main conference reviews. Nonetheless, when
summary of the article and one page of support- several papers cover similar topics, it is also
ing arguments for long-term impact and influ- important to identify those worthy of nomina-
ence on other researchers and industry. tion based on technical merits. To overcome the
limitations on available expertise, we increased
Digital Object Identifier 10.1109/MM.2020.2992834 the number of reviewers for such emerging tech-
Date of current version 22 May 2020. nology based papers.
6
& SELECTION COMMITTEE Josep Torrellas, University of Illinois
Urbana-Champaign (UIUC)
Arka Basu, Indian Institute of Science (IISc) Lisa Wu Wills, Duke University

Babak Falsafi, Ecole Polytechnique Fe de
rale de Mike O’Connor, NVIDIA/UT-Austin
Lausanne (EPFL) Mohit Tiwari, UT Austin
Boris Grot, University of Edinburgh Onur Mutlu, ETH Zurich/CMU
Christopher Fletcher, University of Illinois Parthasarathy Ranganathan, Google
Urbana-Champaign (UIUC) Rajeev Balasubramonian, University of Utah
nez, Texas A&M University
Daniel Jime Ravi Iyer, Intel
Dmitry Ponomarev, SUNY Binghamton Reetuparna Das, University of Michigan
Edward Suh, Cornell University Thomas Wenisch, University of Michigan/Google
Gennady Pekhimenko, University of Toronto Tor Aamodt, The University of British Columbia
Jangwoo Kim, Seoul National University Tushar Krishna, Georgia Institute of Technology
Jayasena Nuwan, AMD Ulya Karpuzcu, University of Minnesota
Jishen Zhao, University of California San Diego Vijay Janapa Reddi, Harvard University
John Kim, KAIST Yunji Chen, Institute of Computing Technology
Jose Joao, Arm Research
Chinese Academy of Sciences (ICT–CAS)
PC MEETING unanimous for top pick, in which case the article

The in-person (now, we need to differenti- was finalized as a top pick in the first phase). If
ate in-person versus virtual!) PC meeting not, the vote went to all PC members excluding
occurred on January 10, 2020 on the campus of those with conflicts. If the vote was above 50%,
Georgia Institute of Technology, Atlanta, GA, then the article became a candidate for top picks
USA. Of the 28 PC members from three conti- or honorable mentions. Since some articles had
nents, 24 attended the meeting and 4 could not many conflicts, for each article, we precounted
attend due to last minute emergencies. The the number of votes needed to top 50%. All vote
discussion order was loosely determined by results during the meeting were recorded and
the articles’ overall score and rank. Articles later used to decide the order of discussions in
with similar topics were grouped together so the second phase. This two-phase mechanism
as to provide more consistent evaluations and was critical because when PC members voted,
effective discussions. they wanted to see all the articles before making
The PC meeting was conducted in two phases final decisions.
in order to minimize the influence of the discus-
sion order. In the first phase, we made prelimi-
nary decisions about the articles’ outcomes. In SELECTED ARTICLES
the second phase, we adjusted the results; e.g., Cloud and Accelerators
by deselecting articles from the top pick pool or The importance of cloud computing and
rescuing articles from the honorable mention accelerators continues to grow in the architec-
pool. ture community. The challenges of evaluating
Because voting was often the key instrument cloud and edge systems are presented in
in deciding the outcome of the articles, we insti- “Unveiling the Hardware and Software Implica-
tuted a rigorous set of voting procedures. First, tions of Microservices in Cloud and Edge Sys-
only reviewers of the article voted on whether tems” by Gan et al. The article presents a new
the article was a top pick or an honorable men- open source benchmark suite for cloud micro-
tion. If the vote was above 60% among reviewers, services and evaluations on real systems to
the article became a candidate for top picks or guide future architecture designs. Accelerating
honorable mentions (unless the vote was DNN is an ongoing topic in architecture
May/June 2020
7
& HONORABLE MENTIONS Gray, Brucek Khailany, and Stephen W. Keckler

(MICRO 2019).
“The Accelerator Wall: Limits of Chip Special-
“Buffets: An Efficient and Composable Storage
ization,” by Adi Fuchs and David Wentzlaff (HPCA
Idiom for Explicit Decoupled Data Orches-
2019).

tration,” by Michael Pellauer, Yakun Sophia Shao,
“TensorDIMM: A Practical Near-Memory Proc-
Jason Clemons, Neal Crago, Kartik Hegde, Ran-
essing Architecture for Embeddings and Tensor
gharajan Venkatesan, Stephen W. Keckler, Christo-
Operations in Deep Learning,” by Youngeun
pher W. Fletcher, and Joel Emer (ASPLOS 2019).
Kwon, Yunjae Lee, and Minsoo Rhu (MICRO
“ExTensor: An Accelerator for Sparse Tensor
2019).

Algebra,” by Kartik Hegde, Hadi Asghari-Moghad-
“ComputeDRAM: In-Memory Compute Using Off-
dam, Michael Pellauer, Neal Crago, Aamer Jaleel,
the-Shelf DRAMs,” by Fei Gao, Georgios Tziant-
Edgar Solomonik, Joel Emer, and Christopher
zioulis, and David Wentzlaff (MICRO 2019).

Fletcher (MICRO 2019).
“NDA: Preventing Speculative Execution Attacks
“A Formal Analysis of the NVIDIA PTX Memory
at Their Source,” by Ofir Weisse, Ian Neal, Kevin
Consistency Model,” by Daniel Lustig, Sameer
Loughlin, Thomas F. Wenisch, and Baris Kasikci
Sahasrabuddhe, and Olivier Giroux (ASPLOS 2019).
(MICRO 2019).
“SpecShield: Shielding Speculative Data from
“Janus: Optimizing Memory and Storage Support
Microarchitectural Covert Channels,” by Kristin
for Non-Volatile Memory System,” by Sihang Liu,
Barber, Anys Bacha, Li Zhou, Yinqian Zhang, and
Korakit Seemakhupt, Gennady Pekhimenko,
Radu Teodorescu (PACT 2019).
Aasheesh Kolli, and Samira Khan, (ISCA 2019).
“Designing Vertical Processors in Monolithic
“D-RaNGe: Using Commodity DRAM Devices to
3D,” by Bhargava Gopireddy and Josep Torrellas
Generate True Random Numbers with Low
(ISCA 2019).
Latency and High Throughput,” by Jeremie
“CIDR: A Cost-Effective In-Line Data Reduction
S. Kim, Minesh Patel, Hasan Hassan, Lois Orosa,
System for Terabit-per-Second Scale SSD
and Onur Mutlu (HPCA 2019).

Arrays,” by Mohammadamin Ajdari, Pyeongsu
“Simba: Scaling Deep-Learning Inference with
Park, Joonsung Kim, Dongup Kwon, Jangwoo Kim
Multi-Chip-Module-Based Architecture,” by
(HPCA 2019).
Yakun Sophia Shao, Jason Clemons, Rangharajan
“Practical Byte-Granular Memory Blacklisting
Venkatesan, Brian Zimmer, Matthew Fojtik, Nan
using Califorms,” by Hiroshi Sasaki, Miguel A.
Jiang, Ben Keller, Alicia Klinefelter, Nathaniel
Arroyo, M. Tarek Ibn Ziad, Koustubha Bhat, Kanad
Pinckney, Priyanka Raina, Stephen G. Tell, Yanq-
Sinha, and Simha Sethumadhavan (MICRO 2019).
ing Zhang, William J. Dally, Joel Emer, C. Thomas
conferences, as shown in “MAESTRO: A Data- Accelerations From Understanding

Centric Approach to Understand Reuse, Perfor- Applications
mance, and Hardware Cost of DNN Mappings” Understanding target application characteris-
by Kwon et al. This article presents an analyti- tics generates synergetic architectural improve-
cal cost-benefit analysis framework that con- ments. A generalized acceleration framework for
siders the cost of DNN mapping tradeoffs in irregular workloads is presented in “Towards
terms of data reuse. Accelerating virtual real- General-Purpose Acceleration: Finding Structure
ity becomes more important, especially in in Irregularity” by Dadu et al. The article
the current environment. The article “Energy- “Varifocal Storage: Dynamic Multiresolution
Efficient Video Processing for Virtual Reality” Data Storage” by Hu et al. proposes a dynamic
by Leng et al. seeks to improve energy effi- multiresolution storage system to help approxi-
ciency from the architecture support based on mate computing. Understanding applications
the characterizations and evaluations of VR can also be extended to warehouse-scale com-
prototypes. puter (WSC) applications. A profiling and code
IEEE Micro
8
analysis tool to allow identification of critical for Program Trace Privacy” by Dangwal et al.
code segments and then proposing solutions to presents a compression method to generate
improve the performance of WSC is presented in traces that can limit the information leak, and
“AsmDB: Understanding and Mitigating Front- memory trace writing for cache simulation is
End Stalls in Warehouse-Scale Computers” by shown as an example.
Nagendra et al.
Quantum Computing CONCLUSION

2019 was the year that quantum computing I hope you will enjoy reading these articles,
architecture/compilers became one of the main- and I encourage you to explore the full confer-
stream computer architecture research topics. ence versions of both the Top Pick and Honor-
Two articles were selected to represent the able Mention selections. The authors made
research challenges in quantum computing. The significant efforts to write a version that can be
article “Extending the Frontier of Quantum Com- read by a broad audience.
puters With Qutrits” by Gokhale et al. presents
the use of three-level qutrits and also evaluates
the system-level impact of qutrits-based quan-
ACKNOWLEDGMENTS
I would like to thank Lizy Kurian John, the
tum computing. The article also provides good
Editor-in-Chief of IEEE Micro, for providing sup-
background materials for quantum computing. A
port and guidance at every stage of issue prepa-
measurement-based full-stack characterization
ration. I would also like to thank Vijay J. Redidi
of basic quantum computing applications on
for handling papers that I had conflicts with and
real systems is presented in “Architecting Noisy
J. Zho for handling papers that both Vijay and I
Intermediate-Scale Quantum Computers: A Real-
had conflicts with. I also thank the previous
System Study” by Murali et al.
year’s guest editor, S. Dwarkadas, for her input.
Attending her Top Picks meeting last year was
Security and Privacy
very helpful in preparing for this year’s meeting.
As a reflection of new computing design
I thank M. Qureshi for providing valuable input
requirements from security and privacy, three
during the discussions, and the submission chair
articles on security and one article on privacy
Y. Kim, R. Hadidi and the volunteer students
are presented in this issue. The article
J. Lee, and B. Asgari who helped organize PC
“Speculative Taint Tracking (STT): A Compre-
meetings. I thank all the PC members who have
hensive Protection for Speculatively Accessed
diligently read the articles and put enormous
Data” by Yu et al. provides a framework that
effort into selecting the finalists. Finally, I thank
tracks flow of speculative instructions through
all the authors who have submitted their work.
covert channels. The article “MicroScope:
Enabling Microarchitectural Replay Attacks”
Hyesoon Kim is an Associate Professor with the
by Skarlatos et al. presents a means of replay-
School of Computer Science, Georgia Institute of
ing code by forcing microarchitectural reply
Technology. Her research areas include the intersec-
based on address translations. ISA extensions
tion of computer architectures and compilers, with an
design methodology to improve security while emphasis on heterogeneous architectures such as
considering performance is presented in GPUs and accelerators. She is a recipient of NSF
“Creating Foundations for Secure Microarchi- Career Award and is a member of Micro Hall of
tectures With Data-Oblivious ISA Extensions” Fame. She is a senior member of IEEE. Contact her
by Yu et al. Finally, the article “Trace Wringing at hyesoon@cc.gatech.edu.
May/June 2020
9
Theme Article: Top Picks
Unveiling the Hardware

and Software Implications
of Microservices in Cloud
and Edge Systems
Yu Gan, Yanqi Zhang, Dailun Cheng,
Ankitha Shetty, Priyal Rathi, Nayan Katarki,
Ariana Bruno, Justin Hu, Brian Ritchken,
Brendon Jackson, Kelvin Hu,
Meghna Pancholi, Yuan He, Brett Clancy,
Chris Colen, Fukang Wen, Catherine Leung,
Siyuan Wang, Leon Zaruvinsky,
Mateo Espinosa, Rick Lin, Zhongling Liu,
Jake Padilla, and Christina Delimitrou
Cornell University
Abstract—Cloud services progressively shift from monolithic applications to complex graphs

of loosely-coupled microservices. This article aims at understanding the implications
microservices have across the system stack, from hardware acceleration and server design,
to operating systems and networking, cluster management, and programming frameworks.
Toward this effort, we have designed an open-sourced DeathstarBench, a benchmark suite
for interactive microservices that is both representative and extensible.
& CLOUD COMPUTING NOW powers applications applications are interactive, latency critical
from every domain of human endeavor, which services that must meet strict performance
require ever improving performance, respon- (throughput and tail latency), and availability
siveness, and scalability.2,5,6,8 Many of these constraints, while also handling frequent soft-
ware updates.4–7; 12 The past five years have
Digital Object Identifier 10.1109/MM.2020.2985960 seen a significant shift in the way cloud services
Date of publication 22 April 2020; date of current version 22 are designed, from large monolithic implemen-
May 2020. tations, where the entire functionality of a
10
Second, microservices enable programming
language and framework heterogeneity, with
each tier developed in the most suitable lan-
guage, only requiring a common API for micro-
services to communicate with each other;
typically over remote procedure calls (RPC) or a
RESTful API. In contrast, monoliths limit the lan-
guages used for development, and make fre-
quent updates cumbersome and error-prone.
Figure 1. Differences in the deployment of
Finally, microservices separate failure doma-
monoliths and microservices.
ins across application tiers, allowing cleaner error
isolation, and simplifying correctness and perfor-
service is implemented in a single binary, to mance debugging, unlike in monoliths, where
large graphs of single-concerned and loosely- resolving bugs often involves troubleshooting the
coupled microservices.1,10 This shift is becom- entire service. This also makes them applicable to
ing increasingly pervasive, with large cloud pro- Internet-of-Things (IoT) applications that often
viders, such as Amazon, Twitter, Netflix, Apple, host mission-critical computation.
and EBay having already adopted the microser- Despite their advantages, microservices rep-
vices application model, and Netflix reporting resent a significant departure from the way cloud
more than 200 unique microservices in their services are traditionally designed, and have
ecosystem, as of the end of 2016.1 broad implications in both hardware and soft-
The increasing popularity of microservices is ware, changing a lot of assumptions current ware-
justified by several reasons. First, they promote house-scale systems are designed with. For
composable software design, simplifying and example, since dependent microservices are typi-
accelerating development, with each microser- cally placed on different physical machines, they
vice being responsible for a small subset of the put a lot more pressure on high bandwidth and
application’s functionality. The richer the func- low latency networking than traditional applica-
tionality of cloud services becomes, the more tions. Furthermore, the dependencies between
the modular design of microservices helps man- microservices introduce backpressure effects
age system complexity. They similarly facilitate between dependent tiers, leading to cascading
deploying, scaling, and updating individual micro- QoS violations that propagate and amplify
services independently, avoiding long develop- through the system, making performance debug-
ment cycles, and improving elasticity. For ging expensive in both resources and time.11
applications that are updated on a daily basis, Given the increasing prevalence of microser-
modifying, recompiling, and testing a large mono- vices in both cloud and IoT settings, it is impera-
lith is both cumbersome and prone to bugs. tive to study both their opportunities and
Figure 1 shows the deployment differences challenges. Unfortunately most academic work
between a traditional monolithic service, and an on cloud systems is limited to the available
application built with microservices. While the open-source applications; monolithic designs in
entire monolith is scaled out on multiple servers, their majority. This not only prevents a wealth
microservices allow individual components of of interesting research questions from being
the end-to-end application to be elastically explored, but can also lead to misdirected
scaled, with microservices of complementary research efforts whose results do not translate
resources bin-packed on the same physical to the way real cloud services are implemented.
server. Even though modularity in cloud services
was already part of the service-oriented architec-
ture (SOA) design approach, the fine granularity DeathstarBench SUITE
of microservices, and their independent deploy- Our article,10 presented at ASPLOS’19,
ment create hardware and software challenges addresses the lack of representative and open-
different from those in traditional SOA workloads. source benchmarks built with microservices, and
May/June 2020
11
Top Picks
Figure 2. Graph of microservices in Social Network. Figure 3. Graph of microservices in Media Service.
quantifies the opportunities and challenges of this unidirectional follow relationships. Figure 2
new application model across the system stack. shows the architecture of the end-to-end service.
Benchmark Suite Design: We have designed, Users (client) send requests over http, which
implemented, and open-sourced a set of end- first reach a load balancer, implemented with
to-end applications built with interactive micro- nginx. Once a specific webserver is selected,
services, representative of popular production also in nginx, the latter uses a php-fpm module
online services using this application model. Spe- to talk to the microservices responsible for com-
cifically, the benchmark suite includes a social posing and displaying posts, as well as microser-
network, a media service, an ecommerce shop, a vices for advertisements and search engines. All
hotel reservation site, a secure banking system, messages downstream of php-fpm are Apache
and a coordination control platform for UAV Thrift RPCs. Users can create posts embedded
swarms. Across all applications, we adhere to the with text, media, links, and tags to other users.
design principles of representativeness, modular- Their posts are then broadcasted to all their fol-
ity, extensibility, software heterogeneity, and end- lowers. Users can also read, favorite, and repost
to-end operation. posts, as well as reply publicly, or send a direct
Each service includes tens of microservices in message to another user. The application also
different languages and programming models, includes machine learning plugins, such as user
including node.js, Python, C/Cþþ, Java, Java- recommender engines, a search service using
script, Scala, and Go, and leverages open-source Xapian, and microservices to record and display
applications, such as NGINX, memcached, Mon- user statistics, e.g., number of followers, and to
goDB, Cylon, and Xapian. To create the end-to-end allow users to follow, unfollow, or block other
services, we built custom RPC and RESTful APIs accounts. The service’s backend uses memc-
using popular open-source frameworks like ached for caching, and MongoDB for persistent
Apache Thrift, and gRPC. Finally, to track how storage for posts, profiles, media, and recom-
user requests progress through microservices, we mendations. The service is broadly deployed at
have developed a lightweight and transparent to our institution, currently servicing several hun-
the user distributed tracing system, similar to Dap- dred users. We also use this deployment to
per and Zipkin that tracks requests at RPC granu- quantify the tail at scale effects of microservices.
larity, associates RPCs belonging to the same end- Media Service: The application implements an
to-end request, and records traces in a centralized end-to-end service for browsing movie informa-
database. We study both traffic generated by real tion, as well as reviewing, rating, renting, and
users of the services, and synthetic loads gener- streaming movies. Figure 3 shows the architec-
ated by open-loop workload generators. ture of the end-to-end service. As with the social
network, a client request hits the load balancer,
Applications in DeathStarBench which distributes requests among multiple nginx
Social Network: The end-to-end service imple- webservers. Users can search and browse infor-
ments a broadcast-style social network with mation about movies, including their plot,
IEEE Micro
12
photos, videos, cast, and review information, as account, search information about the bank, or
well as insert new reviews in the system for a contact a representative. Once logged in, a user
specific movie by logging into their account. can process a payment, pay their credit card bill,
Users can also select to rent a movie, which browse information about loans or request one,
involves a payment authentication module to and obtain information about wealth manage-
verify that the user has enough funds, and a ment options. Most microservices are written in
video streaming module using nginx-hls, a pro- Java and Javascript. The back-end databases use
duction nginx module for HTTP live streaming. memcached and MongoDB instances.
The actual movie files are stored in NFS, to avoid IoT Swarm Coordination: Finally, we explore an
the latency and complexity of accessing chunked environment where applications run both on the
records from nonrelational databases, while cloud and on edge devices. The service coordi-
movie reviews are kept in memcached and Mon- nates the routing of a swarm of programmable
goDB instances. Movie information is main- drones, which perform image recognition and
tained in a sharded and replicated MySQL obstacle avoidance. We have designed two ver-
database. The application also includes movie sion of this service. In the first, the majority of the
and advertisement recommenders, as well as a computation happens on the drones, including
couple auxiliary services for maintenance and the motion planning, image recognition, and
service discovery, which are not shown in the obstacle avoidance, with the cloud only con-
figure. structing the initial route per-drone, and holding
E-Commerce Site: The service implements an persistent copies of sensor data. This architec-
e-commerce site for clothing. The design draws ture avoids the high network latency between
inspiration, and uses several components of the cloud and edge, however, it is limited by the on-
open-source Sockshop application. The applica- board resources. In the second version, the cloud
tion front-end in this case is a node.js service. is responsible for most of the computation. It per-
Clients can use the service to browse the inven- forms motion control, image recognition, and
tory using catalogue, a Go microservice that obstacle avoidance for all drones, using the
mines the back-end memcached and MongoDB ardrone-autonomy, and Cylon libraries, in
instances holding information about products. OpenCV and Javascript, respectively. The edge
Users can also place orders (Go) by adding items devices are only responsible for collecting sensor
to their cart (Java). After they log in (Go) to their data and transmitting them to the cloud, as well
account, they can select shipping options (Java), as recording some diagnostics using a local node.
process their payment (Go), and obtain an js logging service. In this case, almost every
invoice (Java) for their order. Finally, the service action suffers the cloud-edge network latency,
includes a recommender engine for suggested although services benefit from the additional
products, and microservices for creating an item cloud resources. We use 24 programmable Parrot
wishlist (Java), and displaying current discounts. AR2.0 drones, together with a backend cluster of
Hotel Reservation: The service implements a 20 two-socket, 40-core servers.
hotel reservation site, where users can browse
information about hotels and complete reserva-
tions. The service is primarily written in Go, with Adoption
the backend tiers implemented using memc- DeathStarBench is open-source software
ached and MongoDB. Users can filter hotels under a GPL license. The project is currently in
according to ratings, price, location, and avail- use by several tens of research groups both in
ability. They also receive recommendations on academia and industry. In addition to the open-
hotels they may be interested in. source project, we have also deployed the social
Banking System: The service implements a network as an internal social network at Cornell
secure banking system that processes payments, University, currently used by over 500 students,
loan requests, and credit card transactions. and have used execution traces for several
Users interface with a node.js front-end, similar

to the one in E-commerce to login to their https://github.com/delimitrou/DeathStarBench.
May/June 2020
13
Top Picks
As expected, most interactive services are sen-

sitive to frequency scaling. Among the monolithic
workloads, MongoDB is the only one that can tol-
erate almost minimum frequency at maximum
load, due to it being I/O-bound. The other four sin-
gle-tier services experience increased latency as
frequency drops, with Xapian being the most sen-
sitive, followed by nginx, and memcached. Looking
at the same study for microservices reveals that,
perhaps counterintuitively, they are much more
sensitive to poor single-thread performance than
traditional cloud applications, despite the small
Figure 4. Tail latency with increasing load and decreasing
amount of per-microservice processing. The rea-
frequency (RAPL) for traditional monolithic cloud applications,
sons behind this are the strict, microsecond-level
and the five end-to-end DeathStarBench services. Lighter colors
tail latency requirements of individual microservi-
(yellow) denote QoS violations.
ces, which put more pressure on low and predict-
able latency, emphasizing the need for hardware
research studies, including using ML in root and software techniques that eliminate latency jit-
cause analysis for interactive microservices.11 ter. Out of the five end-to-end services (we omit
Swarm-Edge, since compute happens on the edge
devices), the Social Network and E-commerce are
HARDWARE AND SOFTWARE most sensitive to low frequency, while the Swarm
IMPLICATIONS OF MICROSERVICES service is the least sensitive, primarily because it
We have used DeathStarBench to quantify
is bound by the cloud-edge communication
the implications microservices have in cloud
latency, as opposed to compute speed.
hardware and software, and what these implica-
tions mean for computer engineers. Below we
Networking and OS Overheads
summarize the main findings from our study.
Microservices spend a large fraction processing
network requests of RPCs or other RESTful APIs.
Server Design While for traditional monolithic cloud services
We first quantified how effective current data- only a small amount of time goes toward network
center architectures are at running microservices, processing, with microservices this time increases
as well as how datacenter hardware needs to to 36.3% on average, and on occasion over 50% of
change to better accommodate their perfor- the end-to-end latency, causing the system’s
mance and resource requirements. There has resource bottlenecks to change drastically. To this
been a lot of work on whether small servers can end, we also explored the potential hardware accel-
replace high-end platforms in the cloud.3 Despite eration has to address the network requirements
the power benefits of simple cores, interactive of microservices for low latency and high through-
services still achieve better latency in servers put network processing. Specifically, we use a
optimized for single-thread performance. Micro- bump-in-the-wire setup, seen in Figure 5(a), and
services offer an appealing target for simple similar to the one given by Firestone et al.9 to off-
cores, given the small amount of computation load the entire TCP stack on a Virtex 7 FPGA using
per microservice. Figure 4 (top row) shows the Vivado HLS. The FPGA is placed between the NIC
change in tail latency as load increases and fre- and the top of rack switch, and is connected to
quency decreases using running average power both with matching transceivers, acting as a filter
limit (RAPL) for five popular, open-source single- on the network. We maintain the PCIe connection
tier interactive services: nginx, memcached, between the host and the FPGA for accelerating
MongoDB, Xapian, and Recommender. We com- other services, such as the machine learning mod-
pare these against the five end-to-end services els in the recommender engines, during periods of
(bottom row). low network load. Figure 5(b) shows the speedup
IEEE Micro
14
Figure 5. (a) Overview of the FPGA configuration for RPC Figure 7. Cascading QoS violations in Social
acceleration, and (b) the performance benefits of acceleration in Network compared to per-microservice CPU
terms of network and end-to-end tail latency. utilization.
from acceleration on network processing latency performance issue, but can on occasion make it
alone, and on the end-to-end latency of each of the worse, by admitting more traffic into the system.
services. Network processing latency improves by The more complex the dependence graph
1068x over native TCP, whereas end-to-end tail between microservices, the more pronounced
latency improves by 43% and up to 2:2x. For inter- such issues become. Figure 6 shows the microser-
active, latency-critical services, where even a vices dependence graphs for three major cloud
small improvement in tail latency is significant, service providers, and for one of our applications
network acceleration provides a major boost in (Social Network). The perimeter of the circle (or
performance. sphere surface) shows the different microservi-
ces, and edges show dependencies between
Cluster Management them. Such dependencies are difficult for develop-
A major challenge with microservices has to ers or users to describe, and furthermore, they
do with cluster management. Even though the change frequently, as old microservices are
cluster manager can elastically scale out individ- swapped out and replaced by newer services.
ual microservices on-demand instead of the entire Figure 7 shows the impact of cascading QoS
monolith, dependencies between microservices violations in the Social Network service. Darker
introduce backpressure effects and cascading colors show tail latency closer to nominal opera-
QoS violations that propagate through the sys- tion for a given microservice in Figure 7(a), and
tem, hurting quality of service (QoS). Backpres- low utilization in Figure 7(b). Brighter colors sig-
sure can additionally trick the cluster manager nify high per-microservice tail latency and high
into penalizing or upsizing a highly utilized micro- CPU utilization. Microservices are ordered based
service, even though its saturation is the result of on the service architecture, from the back-end
backpressure from another, potentially not-satu- services at the top, to the front-end at the bot-
rated service. Not only does this not solve the tom. Figure 7(a) shows that once the back-end
service at the top experiences high tail latency,
the hotspot propagates to its upstream services,
and all the way to the front-end. Utilization in
this case can be misleading. Even though the sat-
urated back-end services have high utilization in
Figure 7(b), microservices in the middle of the
figure also have even higher utilization, without
this translating to QoS violations.
Conversely, there are microservices with
relatively low utilization and degraded perfor-
mance, for example, due to waiting on a blocking/
synchronous request from another, saturated
Figure 6. Microservices graphs for three production tier. This highlights the need for cluster manag-
clouds, and our Social Network. ers that account for the impact dependencies
May/June 2020
15
Top Picks
performance for interactive microservices for

three reasons. First, on current serverless plat-
forms communication between dependent func-
tions (serverless tasks) happens via remote
persistent storage, S3, in AWS’s case. This not
only introduces high, but also unpredictable
latency, as S3 is subject to long queueing delays
Figure 8. (a) Microservices taking longer than
and rate limiting. Second, because VMs hosting
monoliths to recover from a QoS violation, even (b) in
serverless tasks are terminated after a given
the presence of autoscaling mechanisms.
amount of idle time, when new tasks are spawned,
they need to reload any external dependencies,
between microservices have on end-to-end per- introducing nonnegligible overheads. Finally, the
formance when allocating resources. placement of serverless tasks is up to the cloud
Finally, the fact that hotspots propagate operator’s scheduler, and hence prone to long
between tiers means that once microservices scheduling delays, and unpredictable perfor-
experience a QoS violation, they need longer to mance due to contention from external jobs shar-
recover than traditional monolithic applications, ing system resources. Overall, we show that while
even in the presence of autoscaling mechanisms, similar, microservices and serverless compute
which most cloud providers employ. Figure 8 each present different system challenges, and are
shows such a case for Social Network imple- well suited for different application classes.
mented with microservices, and as a monolith in
Java. In both cases, the QoS violation is detected Tail at Scale Effects
at the same time. However, while the cluster man- Finally, we explore the system implications of
ager can simply instantiate new copies of the microservices at large scale. Tail at scale effects
monolith and rebalance the load, autoscaling are well-documented in warehouse-scale com-
takes longer to improve performance. This is puters,4 and refer to system (performance, avail-
because, as shown in Figure 8(b), the autoscaler ability, efficiency) issues that specifically arise due
simply upsizes the resources of saturated to a system’s scale. In this article, we use the
services—seen by the progressively darker colors large-scale deployment of the social network appli-
of highly utilized microservices. However, serv- cation with hundreds of real users to the impact of
ices with the highest utilization are not necessar- cascading performance hotspots, request skews,
ily the culprits of a QoS violation, taking the and slow servers on end-to-end performance.
system much longer to identify the correct source Figure 9(a) shows the performance impact
behind the degraded performance and upsizing it. of dependencies between microservices on 100
As a result, by the time the culprit is identified, EC2 instances. Microservices on the y-axis are
long queues have already built up, which take again ordered from the back-end in the top to
considerable time to drain. the front-end in the bottom. While initially
all microservices are behaving nominally, at
Serverless Programming Frameworks t ¼ 260 s the middle tiers, and specifically compo-
Microservices are often used interchangeably sePost, and readPost become saturated due to a
with serverless compute frameworks. Serverless switch routing misconfiguration that overloaded
enables fine-grained, short-lived cloud execution, one instance of each microservice, instead of load
and is well-suited for interactive applications balancing requests across different instances. This
with ample parallelism and intermittent activity. in turn causes their downstream services to satu-
We evaluated our end-to-end applications on rate, causing a similar waterfall pattern in per-tier
AWS’s serverless platform, AWS Lambda and latency to the one in Figure 7. Toward the end of
showed that despite avoiding the high costs of the sampled time (t > 500 s) the back-end services
reserved idle resources, and enabling more elas- also become saturated for a similar reason, caus-
tic scaling in the presence of short load bursts, ing microservices earlier in the critical path to sat-
serverless also results in less predictable urate. This is especially evident for microservices
IEEE Micro
16
As with the hardware and cluster management
implications above, these results again emphasize
the need for hardware and software techniques
that improve performance predictability at scale
without hurting latency and resource efficiency.
Figure 9. (a) Cascading hotspots in the large-scale

Social Network deployment, and tail at scale effects LESSONS FROM DEATHSTARBENCH
from slow servers. DeathStarBench draws attention to the need
for research in the emerging application model
in the middle of the y-axis (bright yellow), whose of microservices and highlights the research
performance was already degraded from the previ- areas with the highest potential for impact, while
ous QoS violation. To allow the system to recover, also providing a widely adopted open-source
we employed rate limiting, which constrains the infrastructure to make that research possible
admitted traffic, until hotspots dissipate. Even and reproducible.
though rate limiting is effective, it affects user Cloud systems have attracted an increasing
experience by dropping a fraction of requests. amount of work over the past five to ten years.
Finally, Figure 9(b) shows the impact of a small As cloud software evolves, the direction of such
number of slow servers on overall QoS as cluster research efforts should also evolve with it. Given
size increases. We purposely slow down a small the increasing complexity of cloud services, the
fraction of servers by enabling aggressive power switch to a fine-grained, multitier application
management, which we already saw is detrimental model, such as microservices, will continue to
to performance. For large clusters ( > 100 instan- gain traction, and requires more attention from
ces), when 1% or more of servers behave poorly, the research community in both academia and
QPS under QoS is almost zero, as industry. DeathStarBench highlights the need
these servers host at least one for systems research in this emerg-
microservice on the critical path, DeathStarBench draws ing field, and quantifies the most
degrading QoS. Even for small attention to the need for promising research directions, and
clusters (40 instances), a single research in the their potential impact.
slow server is the most the ser- emerging application DeathStarBench also highlights
vice can sustain and still achieve model of microservices the need for hardware–software
some QPS under QoS. Finally, we and highlights the codesign in cloud systems, as appli-
research areas with the
compare the impact of slow serv- cations increase in scale and com-
highest potential for
ers in clusters of equal size for plexity. Specifically, we showed that
impact, while also
the monolithic design of Social current platforms are not well-suited
providing a widely
Network. In this case, QPS is adopted open-source for the emerging cloud programming
higher even as cluster sizes grow, infrastructure to make models, which require lower latency,
since a single slow server only that research possible more elastic scaling, and more pre-
affects the instance of the mono- and reproducible. dictable responsiveness. We demon-
lith hosted on it, while the other strated that microservices come
instances operate independently. with both opportunities and chal-
The only exception are back-end databases, which lenges across the system stack, and that for sys-
even for the monolith are shared across applica- tem designers to improve QoS without sacrificing
tion instances, and sharded across machines. If resource efficiency, they need to rethink the cur-
one of the slow servers is hosting a database rent cloud stack in a vertical way, from hardware
shard, all requests directed to that instance are design and networking, to cluster management
degraded. The more complex an application’s and programming framework design. Finally, we
microservices graph, the more impactful slow also quantified the potential hardware accelera-
servers are, as the probability that a service on tion has toward addressing the performance
the critical path will be slowed-down increases. requirements of interactive microservices, and
May/June 2020
17
Top Picks
showed that programmable acceleration can 6. C. Delimitrou and C. Kozyrakis, “Quasar: Resource-
greatly reduce one of the primary overheads of efficient and QoSAware cluster management,” in
multitier services; network processing. Proc. 19th Int. Conf. Archit. Support Program.
As microservices continue to evolve, it is Lang. Oper. Syst., Salt Lake City, UT, USA, 2014,
essential for datacenter hardware, operating pp. 127–144.
and networking systems, cluster managers, and 7. C. Delimitrou and C. Kozyrakis, “HCloud: Resource-
programming frameworks to also evolve with efficient provisioning in shared cloud systems,” in
them, to ensure that their prevalence does not Proc. 21st Int. Conf. Archit. Support Program. Lang.
come at a performance and/or efficiency loss. Oper. Syst., Apr. 2016, pp. 473–488.
Both DeathStarBench and the resulting study of 8. C. Delimitrou and C. Kozyrakis, “Bolt: I know what you
the system implications of microservices are a did last summer... In the cloud,” in Proc. 22nd Int. Conf.
call to action for the research community to fur- Archit. Support Program. Lang. Oper. Syst., Apr. 2017,
ther explore the opportunities and challenges of pp. 599–613.
this emerging application model. 9. D. Firestone et al., “Azure accelerated networking:
Smartnics in the public cloud,” in Proc. 15th USENIX
ACKNOWLEDGMENTS Symp. Netw. Syst. Design Implementation, 2018,
We sincerely thank C. Kozyrakis, D. Sanchez, pp. 51–66.
D. Lo, as well as the academic and industrial users 10. Y. Gan et al., “An open-source benchmark suite for
of the benchmark suite, and the anonymous microservices and their hardware-software
reviewers for their feedback on earlier versions of implications for cloud and edge systems,” in Proc.
this article. This work was supported in part by 24th Int. Conf. Archit. Support Program. Lang. Oper.
an NSF CAREER award, in part by NSF grant CNS- Syst., Apr. 2019, pp. 3–18.
1422088, in part by a Google Faculty Research 11. Y. Gan et al., “Seer: Leveraging big data to
Award, in part by a Alfred P. Sloan Foundation Fel- navigate the complexity of performance debugging
lowship, in part by a Facebook Faculty Research in cloud microservices,” in Proc. 24th Int. Conf.
Award, in part by a John and Norma Balen Sesqui- Archit. Support Program. Lang. Oper. Syst., Apr.
centennial Faculty Fellowship, and in part by gen- 2019, pp. 19–33.
erous donations from Google Compute Engine, 12. D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and
Windows Azure, and Amazon EC2. C. Kozyrakis, “Heracles: Improving resource
efficiency at scale,” in Proc. 42nd Annu. Int. Symp.
Comput. Archit., 2015, pp. 450–462.
& REFERENCES
1. “The evolution of microservices,” 2016. [Online].
Available: https://www.slideshare.net/adriancockcroft/ Yu Gan is currently working toward the Ph.D. degree
evolution-of-microservices-craft-conference
with the School of Electrical and Computer Engineer-
ing, Cornell University, where he works on cloud
2. L. Barroso, U. Hoelzle, and P. Ranganathan, The
computing and root cause analysis for interactive
Datacenter as a Computer: An Introduction to the
microservices. He is a student member of IEEE and
Design of Warehouse-Scale Machines. Morgan & ACM. Contact him at yg397@cornell.edu.
Claypool: San Rafael, CA, USA, 2018.
3. S. Chen, S. Galon, C. Delimitrou, S. Manne, and Yanqi Zhang is currently working toward the Ph.D.
J. F. Martinez, “Workload characterization of degree with the School of Electrical and Computer
interactive cloud services on big and small server Engineering, Cornell University, where he works on
platforms,” in Proc. Int. Symp. Workload cloud systems and resource management for inter-
Characterization, Oct. 2017, pp. 125–134. active microservices. He is a student member of
4. J. Dean and L. A. Barroso, “The tail at scale,” IEEE and ACM. Contact him at yz2297@cornell.edu.
Commun. ACM, vol. 56 no. 2, pp. 74–80, 2013.
5. C. Delimitrou and C. Kozyrakis, “Paragon: QoS-aware Dailun Cheng is currently working toward the
scheduling for heterogeneous datacenters,” in Proc. M.Eng. degree with the School of Electrical and
18th Int. Conf. Archit. Support Program. Lang. Oper. Computer Engineering, Cornell University. Contact
Syst., Houston, TX, USA, 2013, pp. 77–88. him at dc924@cornell.edu.
IEEE Micro
18
Ankitha Shetty is currently working toward Chris Colen is currently working toward the M.Eng.
the M.Eng. degree with the School of Computer degree with the School of Computer Science, Cornell
Science, Cornell University. Contact him at University. Contact him at cdc99@cornell.edu.
aas394@cornell.edu.
Fukang Wen is currently working toward the M.Eng.
Priyal Rathi is currently working toward the M.Eng. degree with the School of Computer Science, Cornell
degree with the School of Computer Science, Cornell University. Contact him at fw224@cornell.edu.
University. Contact him at pr348@cornell.edu.
Catherine Leung is currently working toward the
Nayan Katarki is currently working toward the M.Eng. degree with the School of Computer Science,
M.Eng. degree with the School of Electrical and Cornell University. Contact him at chl66@cornell.edu.
Computer Engineering, Cornell University. Contact
him at nk646@cornell.edu. Siyuan Wang is currently working toward the M.Eng.
degree with the School of Computer Science, Cornell
Ariana Bruno is currently working toward the University. Contact him at sw884@cornell.edu.
M.Eng. degree with the School of Electrical and
Computer Engineering, Cornell University. Contact Leon Zaruvinsky is currently working toward the
him at amb633@cornell.edu. M.Eng. degree with the School of Computer Science,
Cornell University. Contact him at laz37@cornell.edu.
Justin Hu is currently working toward the M.Eng.
degree with the School of Computer Science, Cornell Mateo Espinosa is currently working toward the
University. Contact him at jh2625@cornell.edu. M.Eng. degree with the School of Computer Science,
Cornell University. Contact him at me326@cornell.edu.
Brian Ritchken is currently working toward the
M.Eng. degree with the School of Electrical and Rick Lin is currently working toward the M.Eng.
Computer Engineering, Cornell University. Contact degree with the School of Electrical and Computer
him at bjr96@cornell.edu. Engineering, Cornell University. Contact him at
cl2545@cornell.edu.
Brendon Jackson is currently working toward the
M.Eng. degree with the School of Electrical and Zhongling Liu is currently working toward the
Computer Engineering, Cornell University. Contact M.Eng. degree with the School of Electrical and
him at btj28@cornell.edu. Computer Engineering, Cornell University. Contact
him at zl682@cornell.edu.
Kelvin Hu is currently working toward the M.Eng.
degree with the School of Computer Science, Cornell Jake Padilla is currently working toward the
University. Contact him at sh2442@cornell.edu. M.Eng. degree with the School of Computer Science,
Cornell University. Contact him at jsp264@cornell.edu.
Meghna Pancholi is currently working toward the
B.S. degree with the School of Computer Science, Christina Delimitrou is currently an Assistant
Cornell University. Contact him at mp832@cornell.edu. Professor with the School of Electrical and Computer
Engineering, Cornell University, where she works on
Yuan He is currently working toward the M.Eng. computer architecture and distributed systems.
degree with the School of Electrical and Computer Her research interests include resource-efficient data-
Engineering, Cornell University. Contact him at centers, scheduling and resource management with
yh772@cornell.edu. quality-of-service guarantees, emerging cloud and IoT
application models, and cloud security. Delimitrou
Brett Clancy is currently working toward the M.Eng. received the Ph.D. degree in electrical engineering
degree with the School of Computer Science, Cornell from Stanford University. She is a member of IEEE and
University. Contact him at bjc265@cornell.edu. ACM. Contact her at delimitrou@cornell.edu.
May/June 2020
19
MAESTRO: A Data-Centric
Approach to Understand
Reuse, Performance, and
Hardware Cost of DNN
Mappings
Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer and Angshuman Parashar
Vivek Sarkar, and Tushar Krishna NVIDIA Corp
Georgia Tech
Abstract—The efficiency of an accelerator depends on three factors—mapping, deep

neural network (DNN) layers, and hardware—constructing extremely complicated design
space of DNN accelerators. To demystify such complicated design space and guide the
DNN accelerator design for better efficiency, we propose an analytical cost model,
MAESTRO. MAESTRO receives DNN model description and hardware resources
information as a list, and mapping described in a data-centric representation we propose
as inputs. The data-centric representation consists of three directives that enable
concise description of mappings in a compiler-friendly form. MAESTRO analyzes various
forms of data reuse in an accelerator based on inputs quickly and generates more than 20
statistics including total latency, energy, throughput, etc., as outputs. MAESTRO’s fast
analysis enables various optimization tools for DNN accelerators such as hardware design
exploration tool we present as an example.
Digital Object Identifier 10.1109/MM.2020.2985963

Date of publication 22 April 2020; date of current version 22
May 2020.
20
Figure 1. High-level overview of mapping a high-dimensional DNN layer (CONV2D in this figure) to an
accelerator with 2-D PE array. Note that tile scheduling also needs to be done within spatial partitioning; we omit
it for simplicity. (a) An Overview of Mapping CONV2D to an Accelerator. (b) High-level Tool flow of MAESTRO.
& DEEP NEURAL NETWORK (DNN) inference accel- mapping) is challenging because it requires
erators achieve high performance by exploiting deep understanding of complex interaction of
parallelism over hundreds of processing ele- hardware components, mapping, and DNN
ments (PEs) and high energy efficiency by maxi- layers. In particular, data reuse in scratchpad
mizing data reuse within PEs and on-chip memory hierarchy in DNN accelerators is one of
scratchpads.1–4 The efficiency (performance and the key behaviors, which is critical for energy
energy efficiency) of a DNN accelerator depends efficiency, thus the prime optimization target of
on three factors depicted in Figure 1: 1) the DNN accelerators. Data reuse pattern is dictated
workload (DNN layers), 2) the amount and type by dataflow,1 which are data/computation tile
of available hardware resources (hardware), and scheduling and spatial partitioning strategies
3) the mapping strategy of a DNN layer on the without actual tile size as described in Figure 1
target hardware (mapping). That is, we can pre- (a). To systematically and analytically model the
dict the efficiency (latency, energy, buffer data reuse for DNN accelerators’ efficiency esti-
requirement, etc.) of an accelerator when we mation, we need a precise and thorough descrip-
have full parameters for those three factors, tion of mapping and a framework to analyze data
which can guide the DNN accelerator design for reuse of a mapping on target hardware and the
better efficiency. One critical requirement on the DNN layer.
efficiency estimation is that it needs to be fast Therefore, we propose a data-centric repre-
since the design space (e.g., 480 million valid sentation of mapping that enables precise
designs in our hardware DSE even if we fix the descriptions of all the possible mappings in a
target mapping and layer) is huge, and we need concise and compiler-friendly manner. Leverag-
to query the efficiency of candidate designs in ing the compiler-friendly format, we develop
the search space when we search for an optimal MAESTRO, a comprehensive cost-benefit analy-
design. How do we implement such a fast effi- sis framework based on systematic data reuse
ciency estimation framework that thoroughly analysis. As shown in Figure 1(b), MAESTRO
considers all the parameters of the three receives the three factors—DNN layer, hard-
factors that determine the efficiency of DNN ware, and mapping—as inputs and generates
accelerators? more than 20 estimated statistics including
Such demands led to the development of an latency, energy, the number of buffer accesses,
analytical cost model instead of cycle-accurate buffer size requirement, etc. We validated the
simulators. Analytically, modeling the complex performance statistics of MAESTRO against
high-dimensional DNN accelerator design space cycle-accurate RTL simulation results5 and
over the three factors (DNN layer, hardware, and reported performance in a previous work6 with
May/June 2020
21
Top Picks
Table 1. The taxonomy of data reuse in DNN accelerators and shows, input tile 3 is mapped on all the PEs,
implementation choices for each. We highlight implementation used which implies the spatial reuse opportunities.
in the example reuse patterns with red texts. Dataflow implies data reuse opportunities,
and we can categorize data reuse in DNN acceler-
ators into four types (data reuse taxonomy),
which we summarize in Table 1. Each data reuse
type requires proper hardware support to
exploit the data reuse opportunity as actual data
reuse. We discuss those four reuse types
grouped in communication type as follows:
Spatial/Temporal Multicast. When the spa-
tial/temporal reuse opportunities are in input
tensors (i.e., filter and input activation), the
reused data can be multicasted to multiple PEs
(spatial reuse) or over time (temporal reuse).
The examples in Table 1 show such a pattern
based on fanout NoC (spatial multicast), which
delivers data to multiple PEs at the same time,
and buffer (temporal multicast).
In the spatial multicast example, tiles 1 and 2
are delivered to PE1 and PE2 at the same time
the accuracy of 96.1% on average. MAESTRO pro- leveraging the multicast capability of fanout
vides fast cost-benefit estimation based on an hardware. Alternatively, store-and-forward style
analytical model, which took 493 ms to analyze implementation such as systolic arrays is avail-
the entire Resent50 layers7 on a 256PE NVDLA- able with tradeoff of hardware cost and latency.
style2 accelerator on a laptop with i9-9980H CPU In the temporal multicast example, the same
with 16 GB of memory. MAESTRO supports data tile appears over time in the same PE (PE1).
arbitrary layer sizes and a variety of layer opera- That is, we send the data to the future for reuse
tions from state-of-the-art DNN models, which in the future (i.e., store the data in a buffer and
includes CONV1D, CONV2D, fully connected (FC) read it in the future). Therefore, temporal multi-
layer, depthwise separable convolution, up-scale cast, which is reading the same stored data over
convolution, etc. time, requires a buffer, as shown in Table 1.
Spatial/Temporal Reduction. When the spa-
DATA REUSE IN DNN tial reuse opportunities are in the output activa-
ACCELERATORS tion tensor, the reuse pattern in hardware is
Data reuse is the key behavior in DNN acceler- spatial reduction, which accumulates partial out-
ator that improves both latency and energy via puts (or, partial sums) for an output across multi-
reducing the number of remote buffer accesses ple PEs. The example in Table 1 shows an
(i.e., global buffer),1; 8 which is determined by example reuse pattern based on store-and-for-
dataflow. Data reuse opportunities exist when ward hardware. We observe that the output tiles
the dataflow assigns the same set of data tiles 1 and 2 are moving to the next PE over time,
over consecutive time on the same PE (i.e., reuse which illustrates pipelined accumulation to the
in time) or across multiple PEs but not over con- right direction assuming that PEs are receiving
secutive time (i.e., reuse in space). We define new operands from above (i.e., a row of a systolic
those opportunities as temporal and spatial array). Alternatively, fanin hardware such as
reuse opportunities. For example, in the example reduction tree can support the spatial reduction.
dataflow in Figure 1, output tiles (orange tiles) In contrast, the temporal reuse opportunities
remain the same in time 0 and 1, which implies imply that we compute partial sums over time
the temporal reuse opportunities. Within time 1, and accumulate them within the same location.
as the spatial partitioning example in Figure 1 This type of reuse requires a buffer since
IEEE Micro
22
Figure 2. Example CONV1D operation and mapping of the example on an accelerator. We represent the
mapping in both computation and data space, where each point corresponds to a partial sum and a data,
respectively. We use 1-based indices in this example.
intermediate results need to be stored and read We show an example of mapping on three-PE
again in the future, which effectively indicates accelerator in computation and data space
multiple read-modify-write to a buffer. The exam- in Figure 2(b). In this example mapping, we map
ple in Table 1 shows such a reuse pattern, where three partial sum computation to each PE, and
the output tile 1 appears at the same PE over each PE collaboratively compute partial outputs
time. (accumulated partial sums) on the same set of
To identify the reuse opportunities in arbi- outputs. When the PE array finishes computa-
trary mappings, we need a precise representation in a tile (time=0 in the example), the PE
tion of mapping and systematically infer data array receives the next computation tile
reuse from the description. For those two (time=1 in the example). The next computation
goals, we present a data-centric representation tile is in the direction of loop index x0 . We project
of mapping, which is concise and compiler the same mapping on the data space as shown
friendly. in Figure 2, using the array subscripts in the
loop nest of CONV1D operation in Figure 2(a).
That is, partial sum at (x,0 s) requires weight at s,
DESCRIBING MAPPINGS input at x0 +s, and output at x,0 as shown in the
We use a CONV1D operation described loop body of in Figure 2(a). In the example, we
in Figure 2 as an example operation to introduce observe that the data space explicitly shows
our mapping description. As described data reuse behavior; mapped filter values do not
in Figure 2, CONV1D operation can be under- move over time, which implies that the example
stood as a sliding window operation of a filter mapping is based on a weight-stationary style
vector on a input vector, where individual multi- dataflow. This implies that inferring data reuse
plication results within a filter window are accu- can be significantly simplified when we describe
mulated to generated one output value in the mapping in the data space, which can facilitate a
output vector. When we project the loop indices fast analysis framework of DNN accelerator’s
in the loop nest in Figure 2(a), we obtain compu- efficiency.
tation space in Figure 2(b) where loop indices Motivated by the observation, we introduce
are on each axis, and partial sums are projected data-centric mapping directives that directly
in the plane. We also construct data space of describe the mapping in data space.
each vector as shown in Figure 2(b), where the
corresponding data index is on the axis. Note Data-centric mapping directives
that the data index is not the same as the loop We introduce three data-centric mapping
index (e.g., the input data index x is computed directives in Figure 3(a). Temporal and spatial
using loop indices x0 +s). Therefore, we denote map directives describe data mapping that
data indices using underlined index in this exam- changes in time and space (PEs), respectively.
ple. Note that output and filter indices x0 and s That is, temporal map corresponds to a normal
are identical to the loop indices x0 and s in this for loop in loop nest while spatial map corre-
simple example operation. sponds to a parallel for loop. Those two mapping
May/June 2020
23
Top Picks
Figure 3. Introductory example of data-centric directives. (a) Syntax of data-centric directives. (b) Semantics
of two mapping directives based on an example description process on the example CONV1D mapping
in Figure 2. (c) Capability of data-centric mapping directives that can describe a variety of mapping styles.
directives take three parameters: Mapping size, output vector in Figure 3(b), we observe that
offset, and dimension. The mapping size speci- the starting index of mapping changes over
fies the number of data points (in tensors, map- time 3, which implies that the temporal offset is
ping size in the target dimension since a 3. For filter vector, we observe that the starting
mapping constructs a high-dimensional volume) index of mapping for each PE changes by 1,
mapped on each PE. The offset describes how which implies that the spatial offset is 1. Note
the mapping is updated over time on temporal that spatial map can also involve temporal
map and space on spatial map. Cluster directive aspect as the mapping on the filter vector
specifies the hierarchical organization of PEs, in Figure 3(b); after processing all the computa-
which enables us to explore multiple parallel tion that involves the first data tile on filter, the
dimensions in a mapping. data tile will move on to the next position. This
To understand the syntax and semantics of happens when the number of PEs is not suffi-
data-centric mapping directives, in Figure 3, we cient to cover entire spatially mapped dimen-
provide an example process to determine a cor- sion (also known as spatial folding), and an
responding data mapping description of the implicit temporal offset of (spatial offset)
example mapping in Figure 2(b). We omit the (number of PEs) is applied. Finally, we write the
input tensor because input tensor data mapping dimension on which we describe the data map-
can be easily inferred from the mapping of out- ping, then we obtain the data-centric mapping
put and filter. We first determine if the mapping description of each data mapping, as shown in
is in time or space by checking the mapped the resulting data mapping description column
data are the same or different (i.e., paralleliza- in Figure 3(b). To specify the entire example
tion) across PEs. Next, we check the number of mapping, we need to specify the order of
data points mapped on each PE to determine changes in data tile between output and filter
the mapping size, which are three and one for vectors. Since filter is updated in a slower man-
output and filter, respectively, in the example. ner, we place the data mapping description of
To determine the offset parameter, we check filter above, and write that of output below, like
the temporal and spatial offset on temporal and we specify the update order in loop nest (outer-
spatial map, respectively. For example, for most loop index changes slower).
IEEE Micro
24
Capability of Mapping Directives running VGG16 and AlexNet, respectively. The
Using the data-centric directives, we can latency estimated by MAESTRO are within 3.9%
describe a variety of mappings if it maps con- absolute error of the cycle-accurate RTL simula-
secutive data points in a regular manner (i.e., tion and reported processing delay6 on average.
affine loop subscripts when described in a
loop nest representation). Figure 3(c) shows
the capability of the data-centric directive by CASE STUDIES
showing the changes in the resulting mapping With MAESTRO, we perform deeper case
when we update the base representation we studies about the costs-benefit tradeoff of vari-
obtained in Figure 3(b). When we change the ous mappings when applied to different DNN
directive order, we describe a different order operations. We evaluate five distinct mapping
of data tile update in dimensions. This effec- styles listed in Figure 4(a) in the “Case Study I:
tively changes the stationary vector from The Impact of Mapping Choices” section and the
weight to output, which changes the temporal preference of each mapping to different DNN
data reuse opportunities. When we change the operators. For energy estimation, we multiply
spatial dimension, then we exploit the parallel- activity counts with base energy values from
ism in a different dimension, as the third exam- Cacti13 simulation (28 nm, 2 kB L1 scratchpad,
ple in Figure 3(b) and (c) shows. Finally, if we and 1 MB shared L2 buffer). We also present dis-
change the mapping size (we accordingly tinct design space of an early layer (wide and
update the offset to keep the description shallow) and a late layer (narrow and deep) to
legal), we change the amount of mapped filter show the dramatically different hardware prefer-
and output, as shown in Figure 3(c) and (d). ence of different DNN layers and mapping in the
Based on the fact that data reuse is explicit “Case Study II: Hardware Design-Parameters and
in data dimension and the capability of data- Implementation Analysis” section.
centric directives, we implement an analytical
cost-benefit analysis framework for DNN accel- Case Study I: The Impact of Mapping Choices
erators, MAESTRO. We discuss a high-level Figure 4(b) shows the DNN-operator granu-
overview of MAESTRO next and discuss larity estimation of latency and energy of each
insights from the case studies we performed mapping across five state-of-the-art DNN models
based on MAESTRO next. listed in the “Case Studies” section. Note that
this should be considered a comparison of map-
ping—not of actual designs, which can contain
ANALYTICAL COST MODEL several low-level implementation differences,
Based on the data-centric directives we dis- e.g., custom implementations of logic/memory
cussed, we built a cost-benefit analysis frame- blocks, process technology, etc. We observe
work that considers all of the three factors— that KC-P style mapping provides overall low
DNN layers, hardware, and mapping—with pre- latency and energy. However, the energy effi-
cise modeling of data reuse. MAESTRO consists ciency in VGG16 is worse than YR-P (Eyeriss1
of five preliminary engines: Tensor, cluster, style) mapping, and the latency is worse than
reuse, performance analysis, and cost analysis. YX-P (Shidiannao14 style) mapping in UNet. This
In the article, we focus on the high-level idea is based on the different preference toward map-
without details such as edge case handling, mul- ping of each DNN operator. YX-P provides short
tiple layers, and multiple level hierarchy, etc. We latency to segmentation networks like UNet,
present implementation details in our web page which has wide activation (e.g., 572 572 in the
and open-source repository. We validated input layer) and recovers the original activation
MAESTRO’s performance model against RTL sim- dimension at the end via up-scale convolution
ulation and reported processing delay of two (e.g., transposed convolutions). Such a prefer-
accelerators—MAERI5 and Eyeriss6 when ence to the YX-P style is mainly based on its par-
allelization strategy: It exploits parallelism over

https://maestro.ece.gatech.edu/ both of row and column dimensions in
May/June 2020
25
Top Picks
Figure 4. Summary of case studies. (a) List of mappings used in case study I. (b) Results of the case study I.
Top and bottom rows present latency and energy, respectively. We apply 256 PEs and 32 GBps NoC
bandwidth. We use five different DNN models; Resnet50,7 VGG16,9 ResNeXt50,10 MobileNetV2,11 and
UNet.12 The right-most column presents the average results across models for each DNN operator type and
the adaptive mapping case. We compare the number of input channels and the input activation height to
identify early and late layers (If C > Y, late layer. Else, early layer). (c) Design space of KC-P and YR-P-based
accelerators. We highlight the design space of an early and a late layer to show their significantly different
hardware preference. We apply area/power constraints based on Eyeriss6 to the DSE. The color of each data
point indicates the number of PEs. We mark the throughput- and energy-optimized designs using stars and
crosses. (d) The impact of multicast capability, bandwidth, and buffer size. Design points are selected from
the upper-most design space in (c). The name of design points refer to the differences from the throughput-
optimal reference point. Dark rows represent the efficiency of the selected design point.
activation. The energy efficiency of YR-P map- almost similar (difference < 11%), so the KC-P
ping in VGG16 is based on its high reuse factor mapping provides similar energy efficiency as
(the number of local accesses per fetch) in early YR-P in these cases. This can also be observed in
layers. The YR-P mapping has 5.8 and 15.17 the late layer (blue) bars in Figure 4(b) bottom-
higher activation and filter reuse factors, respec- row plots.
tively, in early layers. However, in late layers, The diverse preference to mappings of differ-
the reuse factors of YR-P and KC-P mapping are ent DNN operators motivates us to employ
IEEE Micro
26
optimal mapping for each DNN operator type. can be observed in the area-throughput plot
We refer such an approach as adaptive mapping in Figure 4(c). YR-P mapping requires low NoC
and present the benefits in the right-most col- bandwidth so it does not show the same behav-
umn of Figure 4(b), the average case analysis ior as KC-P mapping. However, with more strin-
across entire models in the DNN operator granu- gent area and power constraints, YR-P mapping
larity. By employing the adaptive approach, we will show the same behavior.
could observe a potential 37% latency and 10% During DSE runs, MAESTRO reports buffer
energy reduction. Such an optimization opportu- requirements for each mapping and the DSE tool
nity can be exploited by flexible accelerators like places the exact amount buffers MAESTRO
Flexflow15 and MAERI5 or via heterogeneous reported. Contrary to intuition, larger buffer
accelerators that employ multiple subaccelera- sizes do not always provide high throughput, as
tors with various mapping styles in a single DNN shown in buffer-throughput plots in Figure 4
accelerator chip. (plots in the second column). The optimal points
regarding the throughput per buffer size are in
Case Study II: Hardware Design-Parameters the top-left region of the buffer-throughput plots.
and Implementation Analysis The existence of such points indicates that the
Using MAESTRO, we implement a hardware tiling strategy of the mapping (mapping sizes
design space exploration (DSE) tool that in our directive representation) significantly
searches four hardware parameters (the number affects the efficiency of buffer use. We observe
of PEs, L1 buffer size, L2 buffer size, and that the throughput-optimized designs have a
NoC bandwidth) optimized for either energy effi- moderate number of PEs and buffer sizes, imply-
ciency, throughput, or energy-delay-product ing that hardware resources need to be distrib-
(EDP) within given hardware area and power uted not only to PEs but also to NoC and buffers
constraints. The DSE tool receives the same set for high PE utilization. Likewise, we observe that
of inputs as MAESTRO with hardware area/ the buffer amount does not directly increase
power constraints and the area/power of build- throughput and energy efficiency. These results
ing blocks synthesized with the target technol- imply that all the components are intertwined,
ogy. For the cost of building blocks, we and they need to be well-balanced to obtain a
implement float/fixed point multiplier and adder, highly efficient accelerator.
bus, bus arbiter, and global/local scratchpad in We also observe the impact of hardware sup-
RTL and synthesis them using 28-nm technology. port for each data reuse type, discussed
For bus and arbiter cost, we fit the costs into a in Table 1. Figure 4(d) shows such design points
linear and quadratic model using regression found in the design space of KC-P mapping on
because bus cost increases linearly and arbiter VGG16-conv2 layer presented in the first row of
cost increases quadratically (e.g., matrix Figure 4(c). The reference design point is the
arbiter). throughput-optimized design represented as a
Using the DSE tool, we explore the design star in the first row of Figure 4(c). When band-
space of KC-P and YR-P mapping accelerators. width gets smaller, the throughput significantly
We set the area and power constraint as 16 mm2 drops, but energy remains similar. However, the
and 450 mW, which is the reported chip area lack of spatial multicast or reduction support
and power of Eyeriss.6 We plot the entire design resulted in approximately 47% energy increase,
space we explored in Figure 4(c). Whether an as the third and fourth design points shows.
accelerator can achieve peak throughput
depends on not only the number of PEs but also
NoC bandwidth. In particular, although an accel- CONCLUSION
erator has sufficient number of PEs to exploit Fast modeling of cost-benefit space of DNN
the maximum degree of parallelism a mapping accelerators is critical for automated optimiza-
allows, if the NoC does not provide sufficient tion tools since the design space is huge and
bandwidth, the accelerator suffers a communica- high dimensional based on hundreds of DNN
tion bottleneck in the NoC. Such design points model, hardware, and mapping parameters. In
May/June 2020
27
Top Picks
this article, we presented a methodology to 2. “Nvdla deep learning accelerator,” 2017. [Online].
enable fast cost-benefit estimation of a DNN Available: http://nvdla.org.
accelerator on a given DNN model and mapping, 3. A. Parashar et al., “Scnn: An accelerator for
which consists of a compiler-friendly data-cen- compressed-sparse convolutional neural networks,”
tric representation of mappings and an analyti- in Proc. Int. Symp. Comput. Archit., 2017,
cal cost-benefit estimation framework that pp. 27–40.
exploits the explicit data reuse in data space in 4. N. P. Jouppi et al., “In-datacenter performance
data-centric repre- analysis of a tensor processing unit,” in Proc. IEEE
sentations. To ana- Using MAESTRO, we
Int. Symp. Comput. Archit., 2017, pp. 1–12.
lytically estimate the show that no single 5. H. Kwon, A. Samajdar, and T. Krishna, “Maeri:
costs and benefits, mapping and no single Enabling flexible dataflow mapping over DNN
we demystify data hardware is ideal for all accelerators via reconfigurable interconnects,” in
reuse in hardware the DNN layers, which Proc. Int. Conf. Archit. Support Program. Lang. Oper.
and required hard- implies the complexity Syst., 2018, pp. 461–475.
ware support and of the DNN accelerator 6. Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze,
apply the observa- design space. Using “Eyeriss: An energy-efficient reconfigurable
tion into the ana- hardware design accelerator for deep convolutional neural networks,”
lytical cost-benefit space exploration IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–
framework we
estimation frame- 138, Jan. 2017.
implemented using
work, MAESTRO. 7. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
MAESTRO, we also
Using MAESTRO, show that hardware
learning for image recognition,” in Proc. IEEE Conf.
we show that no sin- features can Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
gle mapping and no significantly impact the 8. A. Parashar et al., “Timeloop: A systematic approach
single hardware is throughput to DNN accelerator evaluation,” in Proc. IEEE Int.
ideal for all the and energy. Symp. Perform. Anal. Syst. Softw., Mar. 2019,
DNN layers, which pp. 304–315.
implies the complex- 9. K. Simonyan and A. Zisserman, “Very deep
ity of the DNN accelerator design space. Using convolutional networks for large-scale image
hardware design space exploration framework recognition,” in Proc. Int. Conf. Learn.
we implemented using MAESTRO, we also show Representations, 2015. [Online]. Available: https://iclr.
that hardware features can significantly impact cc/archive/www/doku.php%3Fid=iclr2015:accepted-
the throughput and energy. Those cases show main.html
that the capability of MAESTRO for various anal- r, Z. Tu, and K. He,
10. S. Xie, R. Girshick, P. Dolla
ysis problems on DNN accelerator design space. “Aggregated residual transformations for deep
In addition to the case studies we performed, neural networks,” in Proc. IEEE Conf. Comput. Vis.
MAESTRO also facilitates many other optimiza- Pattern Recognit., 2017, pp. 1492–1500.
tion (e.g., neural architecture search specialized 11. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.
for a target accelerator, mapping search for a tar- Chen, “MobileNetV2: Inverted Residuals and Linear
get accelerator, etc.) frameworks based on its Bottlenecks,” in Proc. IEEE Conf. Comput. Vis. Pattern
speed and accuracy, which will lead to broad Recognit., 2018, pp. 4510–4520.
impact on various areas (DNN model design, 12. O. Ronneberger, P. Fischer, and T. Brox, “U-net:
compiler, architecture, etc.) in the DNN accelera- Convolutional networks for biomedical image
tor domain. segmentation,” in Proc. Int. Conf. Med. Image Comput.
Comput.-Assisted Intervention, 2015, pp. 234–241.
13. N. Muralimanohar, R. Balasubramonian, and N. P.
& REFERENCES Jouppi, “Cacti 6.0: A tool to model large caches,” HP
1. Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial Laboratories, vol. 27, p. 28, 2009.
architecture for energy-efficient dataflow for 14. Z. Du et al., “Shidiannao: Shifting vision processing
convolutional neural networks,” in Proc. Int. Symp. closer to the sensor,” in Proc. Int. Symp. Comput.
Comput. Archit., 2016, pp. 367–379. Archit, 2015, pp. 92–104.
IEEE Micro
28
15. W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, Tushar Krishna is an Assistant Professor in the
School of Electrical and Computer Engineering,
“Flexflow: A flexible dataflow accelerator architecture
Georgia Institute of Technology, where he also holds
for convolutional neural networks,” in Proc. Int. Symp.
the ON Semiconductor Junior Professorship. His
High Perform. Comput. Archit., 2017, pp. 553–564.
research interests include computer architecture,
on-chip interconnection networks, and deep learning
Hyoukjun Kwon is currently working toward the accelerators. Krishna received the Ph.D. degree in
Ph.D. degree in the College of Computing, Georgia electrical engineering and computer science from
Institute of Technology. His research interest includes Massachusetts Institute of Technology. He received
communication-centric and flexible accelerator design the NSF CRII Award in 2018. He is a member of IEEE
and modeling mappings on spatial accelerators. Kwon and ACM. Contact him at tushar@ece.gatech.edu.
received B.S. degrees in environmental materials sci-
ence and in computer science and engineering from Michael Pellauer is a Senior Research Scientist at
Seoul National University. He is a student member of NVIDIA. His research interests are building domain
IEEE. Contact him at hyoukjun@gatech.edu. specific accelerators, with a special emphasis on
deep learning and sparse tensor algebra. Pellauer
Prasanth Chatarasi is a senior Ph.D. student received the Ph.D. degree from Massachusetts Insti-
advised by Prof. Vivek Sarkar and Dr. Jun Shirako in tute of Technology, the Masters degree from Chalm-
the School of Computer Science, Georgia Institute of ers University of Technology, and the Bachelor’s
Technology. His research focuses on advancing degree from Brown University. Contact him at
compiler optimizations for high-performance appli- mpellauer@nvida.com.
cations on general-purpose and domain-specific
parallel architectures. In the past, he focused on Angshuman Parashar is a Senior Research
enhancing traditional compilation techniques for Scientist at NVIDIA. His research interests are in
both sequential and explicitly parallel programs building, evaluating, and programming spatial and
for performance optimizations and debugging data-parallel architectures, with a present focus
on general-purpose architectures. Contact him at on automated mapping of machine learning
cprasanth@gatech.edu. algorithms onto architectures based on explicit
decoupled data orchestration. Parashar received
Vivek Sarkar is a Professor and the Stephen Flem- the Ph.D. degree in computer science and engi-
ing Chair for Telecommunications in the College of neering from the Pennsylvania State University
Computing at Georgia Institute of Technology, where (2007), and the B.Tech. degree in computer
he conducts research in multiple aspects of software science and engineering from the Indian Institute
for parallel computing. He is a Fellow of ACM and of Technology, Delhi, India (2002). Contact him at
IEEE. Contact him at vsarkar@gatech.edu. aparashar@nvidia.com.
May/June 2020
29
Energy-Efficient Video
Processing for Virtual
Reality
Yue Leng and Jian Huang Chi-Chun Chen, Qiuyue Sun, and Yuhao Zhu
University of Illinois at Urbana–Champaign University of Rochester
Abstract—Virtual reality (VR) has huge potential to enable radically new applications,
behind which spherical panoramic video processing is one of the backbone techniques.
However, current VR systems reuse the techniques designed for processing conventional
planar videos, resulting in significant energy inefficiencies. Our characterizations show
that operations that are unique to processing 360 VR content constitute 40% of the
total processing energy consumption. We present EVR, an end-to-end system for
energy-efficient VR video processing. EVR recognizes that the major contributor to the VR
tax is the projective transformation (PT) operations. EVR mitigates the overhead of PT
through two key techniques: semantic-aware streaming on the server and hardware-
accelerated rendering on the client device. Real system measurements show that EVR
reduces the energy of VR rendering by up to 58%, which translates to up to 42% energy
saving for VR devices.
& VIRTUAL (VR) has profound social

REALITY conventional planar videos, 360 videos embed
impact in transformative ways. For instance, panoramic views of the scene. As users change
immersive VR experience is shown to reduce the viewing angle, the VR device renders differ-
patient pain more effectively than traditional ent parts of the scene, mostly on a head-
medical treatments, and is seen as a promising mounted display (HMD), providing an immersive
solution to the opioid epidemic. One of the key experience.
use-cases of VR is 360 video processing. Unlike A major challenge in VR video processing
today is the excessive power consumption of VR
devices. Our measurements show that rendering
Digital Object Identifier 10.1109/MM.2020.2985692 720p VR videos in 30 frames per second (FPS)
Date of publication 6 April 2020; date of current version 22 consistently consumes about 5 W of power,
May 2020. which is twice as much power than rendering
30
conventional planar videos and exceeds the that is specialized for PT. We implement an
thermal design point (TDP) of typical mobile EVR prototype on an Amazon AWS server
devices.4 The device power requirement instance and an NVIDA Jetson TX2 board
will only grow as users demand combined with a Xilinx Zynq-7000
higher frame-rate and resolu- A major challenge in FPGA. Real system measurements
tion, presenting a practical VR video processing show that EVR reduces the energy
challenge to the energy- and today is the excessive of VR rendering by up to 58%,
thermal-constrained mobile VR power consumption of which translates to up to 42%
devices. VR devices. Our energy saving for VR devices.
The excessive device power measurements show
is mainly attributed to the fun- that rendering 720p VR
videos in 30 frames
damental mismatch between ENERGY
per second (FPS)
today’s VR system design phi- CHARACTERIZATIONS
consistently consumes
losophy and the nature of VR A VR system involves two distinct
about 5 W of power,
videos. Today’s VR video sys- which is twice as much
stages: capture and rendering. VR
tems are designed to reuse power than rendering videos are captured by special cam-
well-established techniques conventional planar eras, which generate 360 images
designed for conventional pla- videos and exceeds that are best presented in the spheri-
nar videos.1 This strategy the thermal design cal format. The spherical images are
accelerates the deployment of point (TDP) of typical then projected to planar frames
VR videos, but causes signifi- mobile devices. through one of the spherical-to-pla-
cant energy overhead. More nar projections, such as the equirec-
specifically, VR videos are streamed and proc- tangular projection. The planar video
essed as conventional planar videos. As a result, is either directly live-streamed to client devices
once on-device, each VR frame goes through a for rendering (e.g., broadcasting a sports event),
sequence of spherical–planar projective trans- or published to a content provider, such as
formations (PT) that correctly render a user’s YouTube or Facebook, and then streamed to
current viewing area on the display. The PT client devices upon requests. Alternatively, the
operations are pure overhead uniquely associ- streamed videos can also be persisted in the
ated with processing VR videos—operations local storage on a client device for future play-
that we dub “VR tax.” Our characterizations back. This article focuses on client-side VR con-
show that “VR tax” is responsible for about 40% tent rendering, i.e., after a VR video is captured,
of the processing energy consumption, a lucra- because rendering directly impacts VR devices’
tive target for optimizations. energy efficiency.
We present EVR, an end-to-end system for Rendering VR videos consumes excessive
energy-efficient VR video processing. EVR rec- power on the VR device, which is particularly
ognizes that the major contributor to the VR problematic as VR devices are energy and ther-
tax is the PT operations. EVR mitigates the mal constrained. This section characterizes the
overhead of PT through two key techniques: energy consumption of VR devices. Although
semantic-aware streaming (SAS) on the server there are many prior studies that focused on
and hardware-accelerated rendering (HAR) on energy measurement of mobile devices such as
the client device. EVR uses SAS to reduce smartphones and smartwatches, this is the first
the chances of executing PT on VR devices by such study that specifically focuses on VR devi-
prerendering 360 frames in the cloud. ces. We show that the energy profiles between
Different from conventional prerendering tech- VR devices and traditional mobile devices are
niques, SAS exploits the key semantic informa- different.
tion inherent in VR content that is previously We conduct studies on a recently published VR
ignored. Complementary to SAS, HAR miti- video dataset, which consists of head movement
gates the energy overhead of on-device ren- traces from 59 real users viewing different 360
dering through a new hardware accelerator VR videos on YouTube.3 We replay the traces
May/June 2020
31
Top Picks
insignificant, contributing to only about 9%,

7%, and 4% of the total energy consumption,
respectively. This indicates that optimizing
network, display, and storage would lead to
marginal energy reductions. More lucrative
energy reductions come from optimizing com-
pute and memory.
Contribution of VR Operations We further

find that energy consumed by executing opera-
tions that are uniquely associated with process-
Figure 1. Power and energy characterizations of VR device. ing VR videos constitutes a significant portion of
(a) Power distribution across the major components in a VR the compute and memory energy. Such opera-
device. (b) Contribution of the PT operations to the compute and tions mainly consist of PT operations. We show
memory energy. the energy contribution of the PT operations to
the total compute and memory energy as a
to mimic realistic VR viewing behaviors. We stacked bar chart in Figure 1(b). On average, PT
assemble a custom VR device based on the NVIDIA contributes to about 40% of the total compute
Jetson TX2 development board in order to con- and memory energy, and is up to 53% in the case
duct fine-grained power measurements on the of video Rhino. The PT operations exercise the
hardware, which are infeasible in off-the-shelf VR SoC more than the DRAM as is evident in their
devices. We refer readers to the “Evaluation higher contributions to compute energy than
Methodology” section for a complete experimen- memory energy.
tal setup. Overall, our results show that the PTs would
be an ideal candidate for energy optimizations.
Power and Energy Distribution
We breakdown the device power consump- ENERGY-EFFICIENT VR WITH EVR
tion into five major components: display, net- The goal of EVR is to reduce energy consump-
work (WiFi), storage (eMMC), memory (DRAM), tion of VR devices by optimizing the core compo-
and compute (SoC). The storage system is nents that contribute to the “VR tax.” We present
involved mainly for temporary caching. We an end-to-end energy-efficient VR system with
show the power distribution across the five com- optimization techniques distributed across the
ponents for the five VR video workloads in cloud server and the VR client.
Figure 1(a). The power consumption is averaged
across the entire viewing period. Thus, the Semantics-Aware Streaming
power consumption of each component is pro- Our key idea is to leverage video-inherent
portional to its energy consumption. semantic information that is largely ignored by
We make two important observations. First, today’s VR servers. SAS specifically focuses on
the device consistently draws a power con- one particular form of semantic information:
sumption of about 5 W across all five VR vid- visual object. We show that users tend to focus
eos. As a comparison, the TDP of a mobile on objects in VR content, and object trajecto-
device, i.e., the power that the cooling system ries provide a proxy for predicting user view-
is designed to sustainably dissipate, is around ing areas. We leverage a recently published VR
3.5 W,4 clearly indicating the need to reduce video data set,3 which consists of head move-
power consumption. ment traces from 59 real users viewing differ-
Second, unlike traditional smartphone and ent 360 VR videos on YouTube. We further
smartwatch applications where network, confirm that users track the same set of
display, and storage consume significant objects across frame rather than frequently
energy,2,7,9 the energy consumptions of the switching objects. Specifically, we measure
three components in a VR device are relatively the time durations during which users keep
IEEE Micro
32
user’s current focus. One could also imagine
other design alternatives. For instance, the client
could send the current (desired) FOV to the
cloud service, which returns another FOV video
if there happens to be one that matches the
desired FOV. We leave it as future work to
explore the full design space of the dynamic
component.
Figure 2. Cumulative distribution of tracking
durations.
Hardware-Accelerated Rendering
tracking the movement of the same object, and We propose a new hardware accelerator, PT
show the results in Figure 2 as a cumulative engine (PTE), that performs efficient PTs. We
distribution plot. On average, users spend design the PTE as an SoC IP block that replaces
about 47% of time tracking an object for at the GPU and collaborates with other IPs such as
least 5 s. the Video Codec and Display Processor for VR
The near 100% frame coverage in many vid- video rendering. Figure 3 shows how PTE fits
eos as the number of identified objects increases into a complete VR hardware architecture.
indicates that the server can effectively predict The PTE takes in frames that are decoded from
user viewing area solely based on the visual the video codec, and produces FOV frames to
objects without sophisticated client-side mecha- the frame buffer for display. If a frame is already
nisms such as using machine learning models to prepared by the cloud server as a projected FOV
predict users’ head movement.5,10 This observa- frame, the PTE sends it to the frame buffer
tion frees the resource-constrained VR from per- directly; otherwise the input frame goes through
forming additional work and simplifies the client the PTE’s datapath to generate the FOV frame.
design. The GPU can remain idle during VR video play-
SAS has two major components: First, a static back to save power.
and offline analysis component that extracts The bulk of the PTE is a set of PT units (PTU)
objects from the VR video upon ingestion that exploits the pixel-level parallelism. The
and generates a set of FOV videos that could be pixel memory (P-MEM) holds the pixel data for
directly visualized once on a VR device; second, the incoming input frame, and the sample mem-
a dynamic and runtime serving component that ory (S-MEM) holds the pixel data for the FOV
streams FOV videos on demand to the VR device. frame that is to be sent to the frame buffer. The
We augment the new FOV video with metadata PTE uses DMA to transfer the input and FOV
that corresponds to the head orientation for frame data. The PTE also provides a set of
each frame. Once the FOV video together with
its associated metadata is on the client side and
before a FOV frame is sent to the display, the VR
client compares the desired viewing area indi-
cated by the head motion sensor with the meta-
data associated with the frame. If the two match,
the client directly visualizes the frame on the dis-
play, bypassing the PT operations. Otherwise,
the client system requests the original video seg-
ment from the cloud, essentially falling back to
the normal VR rendering mode.
Note that in our current design, we make the
simplification that the client will always fall back
to the regular VR processing flow upon FOV-miss
and restart streaming FOV videos based on Figure 3. Overview of the augmented hardware architecture.
May/June 2020
33
Top Picks
memory-mapped registers for configuration pur- EVALUATION

poses. The configurability allows the PTE adapt Evaluation Methodology
to different popular projection methods and VR Usage Scenarios. We evaluate three EVR
device parameters, such as FOV size and display variants, each applies to a different use case,
resolution. The configurability ensures PTE’s to demonstrate EVR’s effectiveness and gen-
flexibility without the overhead of general-pur- eral applicability. The three variants are as
pose programmability that GPUs introduce. follows.
The P-MEM and S-MEM must be properly
sized to minimize the DRAM traffic. Holding S: leverages SAS without HAR.
the entire input frame and FOV frame would H: uses HAR without SAS.
require the P-MEM and S-MEM to match the SþH: combines the two techniques.
video resolution (e.g., 4K) and display resolu-
tion (e.g., 1440p) respectively, requiring tens Energy Evaluation Framework Our energy
of MBs on-chip memories that are prohibitively evaluation framework considers the five impor-
large in practice. Interestingly, we find that the tant components of a VR device: network,
filtering step (the only step in the PT algorithm display, storage, memory, and compute. The
that access pixel data) operates much like a network, memory, and compute power can be
stencil operation that possesses two proper- directly measured from the TX2 board through
ties. First, PT accesses only a block of adjacent the onboard Texas Instruments INA 3221 voltage
pixels for each input. Second, the accessed pixel monitor IC. We also use a 2560 1440 AMOLED
blocks tend to overlap between adjacent inputs. display that is used in Samsung Gear VR and its
Thus, the P-MEM and S-MEM are designed to power is measured in our evaluation. We esti-
hold several lines of pixels in the input and FOV mate the storage energy using an empirical
frame, which is similar to the line buffer used in eMMC energy model7 driven by the storage
image signal processor (ISP) designs.6 traffic traces.
EVR Implementation
Building on top of the two optimizing primi- Baseline We compare against a baseline that is
tives, SAS and HAR, we design EVR. EVR includes implemented on the TX2 board and that does
a cloud component and a client component. The not use SAS and HAR. The baseline is able to
cloud component extracts object semantics deliver a real-time (30 FPS basis) user experi-
from VR videos upon ingestion, and prerenders ence. Our goal is to show that EVR can effec-
a set of miniature videos that contain only the tively reduce the energy consumption with little
user viewing areas and that could be directly loss of user experience.
rendered as planar videos by leveraging the
powerful computing resources on the cloud. The Benchmark To faithfully represent real VR
client component retrieves the miniature video user behaviors, we use a recently published
with object semantics, and leverages the special- VR video data set,3 which consists of head
ized accelerator for energy-efficient on-device movement traces from 59 real users viewing
rendering if the original full video is required. different 360 VR videos on YouTube. The vid-
For VR applications whose content comes from eos have a 4K (3840 2160) resolution, which
panoramic videos available on the VR devices, is regarded as providing an immersive VR
the HAR can accelerate the video rendering with experience. The data set is collected using the
lower energy overhead. We implement EVR in a Razer Open Source Virtual Reality HDK2 HMD
prototype system, where the cloud service is with an FOV of 110 110 , and records users’
hosted on an AWS instance while the client is real-time head movement traces. We replay
deployed on a customize platform that combines the traces to emulate readings from the
the NVIDIA TX2 and Xilinx Zynq-7000 develop- IMU sensor and thereby mimic realistic VR
ment boards, which can represent a typical VR viewing behaviors. This trace-driven methodol-
client device. ogy ensures the reproducibility of our results.
IEEE Micro
34
Results the energy efficiency of VR applications with
Energy Reductions On average, S and H cloud/client codesign.
achieve 22% and 38% compute energy savings,
respectively. SþH combines SAS and HAR and Energy Characterization of VR Devices
delivers an average 41%, and up to 58%, energy Although there are many prior studies that
saving. The compute energy savings across appli- focused on energy measurement of mobile
cations are directly proportional to the PT oper- devices, such as smartphones and smart-
ation’s contributions to the processing energy, watches, this is the first such study that specif-
as shown in Figure 1(b). For instance, Paris ically focuses on VR devices. We show that the
and Elephant have lower energy savings because energy profiles of VR devices are significantly
their PT operations contribute different from that of traditional
less to the total compute energy mobile devices. Our results sug-
consumptions. Although there are
many prior studies that gest that we must rethink the
The trend is similar for the conventional system-level power/
focused on energy
total device energy savings. SþH energy optimizations in the con-
measurement of mobile
achieves on average 29% and up text of VR processing.
devices, such as
to 42% energy reduction. The smartphones and
energy reduction increases smartwatches, this is Implication on Hardware IP Block
the VR viewing time, and also the first such study that for VR
reduces the heat dissipation specifically focuses on This article provides a case
and, thus, provides a better view- VR devices. in point for future mobile SoCs to
ing experience. integrate VR-specific and VR-opti-
mized IP blocks, and our principal idea of
User Experience Impact We
bypassing the GPU will be critical to those
also quantify user experience both quantitatively
designs. We design the PTE as a standalone IP
and qualitatively. Quantitatively, we evaluate the
block in order to enable modularity and ease dis-
percentage of FPS degradation introduced by EVR
tribution. Alternatively, the PTE logic could be
compared to the baseline. We show that the FPS
tightly integrated into either the video codec or
drop rate averaged across 59 users is only about
display processor. Indeed, many new designs of
1%. Lee et al. reported that a 5% FPS drop is
the display processor have started integrating
unlikely to affect user perception.8 We assessed
functionalities that used to be executed in GPUs,
qualitative user experience and confirmed that the
such as color space conversion. Such a tight
FPS drop is visually indistinguishable and that EVR
integration would let the display processor
delivers smooth user experiences. Although the
directly perform PT operations before scanning
goal of EVR is not to save bandwidth, EVR does
out the frame to the display, and thus reduces
reduce the network bandwidth requirement
the memory traffic induced by writing the FOV
through SAS, which transmits only the pixels that
frames from the PTE to the frame buffer.
fall within user’s sight.
Cloud/Client Codesign for VR

CONCLUSION EVR provides a cloud/client collaborative
We anticipate that EVR will have a signifi- approach to improve the energy efficiency of
cant long-term impact on VR technologies and VR devices. In EVR, SAS and HAR have differ-
their applications of tomorrow. We summarize ent tradeoffs. On one hand, HAR is applicable
the main contributions as follows: EVR pro- regardless where the VR videos are from, but
vides the first energy characterization study does not completely remove the overhead of the
of VR devices and demonstrate their major PT operation. SAS potentially removes the PT
energy overhead; EVR develops the first hard- operation altogether, but relies on that the VR
ware accelerator for accelerating the critical video is published to a cloud server first. We show
PT operations in VR video processing; EVR that combining the two, when applicable, achieves
provides a systematic approach to improve the best energy efficiency.
May/June 2020
35
Top Picks
& REFERENCES 10. F. Qian, L. Ji, Bo Han, and V. Gopalakrishnan,

1. WhitePaper: 360-Degree Video Rendering. [Online]. “Optimizing 360 video delivery over cellular networks,”
Available: https://community.arm.com/graphics/b/ in Proc. 5th Workshop All Things Cellular: Oper., Appl.
blog/posts/white-paper-360-degreevideo-rendering Challenges, 2016, pp. 1–6.
2. X. Chen, N. Ding, A. Jindal, C. Hu, M. Gupta, and
R. Vannithamby, “Smartphone energy drain in the wild:
Analysis and implications,” ACM SIGMETRICS Yue Leng is currently a Software Engineer with
Performance Eval. Rev., vol. 43, no. 1, pp. 151–164, 2015. Airbnb, San Francisco, CA, USA. Leng received the
3. X. Corbillon, F. Simone, and G. Simon, “360-degree M.S. degree in computer engineering from the Univer-
video head movement dataset,” in Proc. 8th ACM sity of Illinois at Urbana–Champaign in 2019. Contact
Multimedia Syst. Conf., 2017, pp. 199–204. her at yueleng2@illinois.edu.
4. M. Halpern, Y. Zhu, and V. Reddi, “Mobile CPU’s rise to
Jian Huang is currently an Assistant Professor with
power: Quantifying the impact of generational mobile
the Electrical and Computer Engineering Depart-
CPU design trends on performance, energy, and user
ment, University of Illinois at Urbana-Champaign.
satisfaction,” in Proc. Int. Symp. High-Performance
Huang received the Ph.D. degree from Georgia
Comput. Archit., 2016, pp. 64–76.
Institute of Technology in 2017. Contact him at
5. B. Haynes, A. Minyaylov, M. Balazinska, L. Ceze, and jianh@illinois.edu.
A. Cheung, “VisualCloud demonstration: A DBMS for
virtual reality,” in Proc. ACM Int. Conf. Manage. Data, Chi-Chun Chen is currently a Compiler Engineer
2017, pp. 1615–1618. with Cray, Inc., Seattle, WA, USA. Chen received the
6. J. Hegarty et al., “Darkroom: Compiling high-level M.S. degree in computer science from the University
image processing code into hardware pipelines,” in of Rochester in 2019. Contact him at cchen120@ur.
Proc. SIGGRAPH, 2014. rochester.edu.
7. J. Huang, A. Badam, R. Chandra, and E. Nightingale,
“WearDrive: Fast and energy-efficient storage for Qiuyue Sun is currently a senior undergraduate
wearables,” in Proc. USENIX Annu. Tech. Conf., 2015, student with the Computer Science Department,
pp. 613–625. University of Rochester. Contact her at qsun15@u.
rochester.edu.
8. K. Lee et al., “Outatime: Using speculation to enable
low-latency continuous interaction for mobile cloud
gaming,” in Proc. 13th Annu. Int. Conf. Mobile Syst., Yuhao Zhu is currently an Assistant Professor with
the Computer Science Department, University of
Appl., Services, 2015, pp. 151–165.
Rochester. Zhu received the Ph.D. degree from
9. J. Li, A. Badam, R. Chandra, S. Swanson,
The University of Texas at Austin in 2017. He is the
B. Worthington, and Q. Zhang, “On the energy
corresponding author of this article. Contact him at
overhead of mobile storage systems,” in Proc. File
yzhu@rochester.edu.
Storage Technol., 2014, pp. 105–118.
IEEE Micro
36
Towards General-Purpose
Acceleration: Finding
Structure in Irregularity
Vidushi Dadu, Jian Weng, Sihao Liu, and
Tony Nowatzki
University of California Los Angeles
Abstract—Programmable hardware accelerators (e.g., vector processors, GPUs) have

been extremely successful at targeting algorithms with regular control and memory
patterns to achieve order-of-magnitude performance and energy efficiency improvements.
However, they perform far under the peak on important irregular algorithms, like those
from graph processing, database querying, genomics, advanced machine learning, and
others. This work posits that the primary culprit is specific forms of irregular control flow
and memory access. By capturing the problematic behavior at a domain-agnostic level, we
propose an accelerator that is sufficiently general, matches domain-specific accelerator
performance, and significantly outperforms traditional CPUs and GPUs.
& THE SLOWING IMPROVEMENTS of technology accelerators have been proposed for “irregular”
scaling are raising the demand for specialized domains like graph processing,3,9 compressed
hardware accelerators, especially for increas- neural networks,4,6,10 databases12 and genomics.
ingly difficult problems. While general-purpose Compared to such architectures, GPUs lose in
data-processing hardware, like GPUs or other performance and/or energy efficiency by order-
vector architectures, are effective on regular of-magnitude. On the other hand, domain-agnos-
algorithms, those with irregularity in their con- tic architectures are widely applicable, which is
trol flow or memory access patterns suffer in valuable for economies of scale and robustness
performance. As evidence, many domain-specific to algorithm change. An important question
then is whether it is possible to build a program-
Digital Object Identifier 10.1109/MM.2020.2986199 mable accelerator that is equally as capable as
Date of publication 16 April 2020; date of current version 22 GPUs and vector processors, but better suited
May 2020. to irregular algorithms.
May/June 2020 Published by the IEEE Computer Society 0272-1732 ß 2020 IEEE
37
Top Picks
The first step toward this goal is to recog-

nize what makes an algorithm irregular. In
computer architecture, the concept of irregu-
larity is often used informally to indicate
behavior that causes inefficiency. We argue
that to first order, the root cause of most irreg-
ularity is data dependence. Data dependence
can appear in many forms, including control,
memory address calculation, conditional mem-
ory accesses, reuse and parallelism structure.
Consider the concrete example of sorting algo-
rithms. Merge sort has regular memory and irregu-
lar control. The memory that is read or written at
each step is predetermined, but the control flow
decisions depend on the relative order of the lists
to be merged at each step. This data-dependent
control prevents speculative execution from being
effective, and it also prevents vectorization (can-
Figure 1. Restricted data-dependence forms cover
not know the operands from each lane in
many algorithms. (a) Algorithm classification.
advance). Conversely, the radix-sort algorithm
(b) Dependence forms coverage.
has regular control, but irregular memory access
(bin increment and scatter). This prevents pre-
fetching from being effective, and also prevents workload that requires both forms is triangle
vectorization with standard instructions. In gen- counting in graphs. At each vertex, AF-Indirect is
eral, data dependence interferes or complicates used to locate a neighbor node’s adjacency list,
the fundamental mechanisms that parallel pro- and stream join can be used to find common
cessors use to extract performance. neighbors (each indicating a triangle).
Insight: To address these challenges, our key Approach: Critically, we find that these
observation is that it is not necessary to handle restricted forms of data-dependence can serve as
arbitrary irregularity because data dependence abstractions, which can be exploited in hardware.
manifests in common forms across domains. This We use this insight to construct our approach to
work suggests two specific forms are critical: design a “general-purpose” accelerator. Specifi-
stream join and alias-free indirection (AF-Indirect). cally, we start with an architecture known to work
Stream join is defined by in-order processing well for regular algorithms: a systolic-style coarse-
of data, where only the relative order of con- grained reconfigurable architecture (CGRA) with
sumption and production of new data is depen- streaming memory support. We then develop
dent on control decisions. These joins are hardware and software mechanisms for our two
surprisingly common, including merge sort, restricted data-dependence forms.
database joins, and inner product sparse tensor Our design is called the sparse processing
operations. AF-Indirect is characterized by mem- unit (SPU). SPU supports fully pipelined stream
ory access with data-dependent addresses, but joins with a systolic CGRA augmented with a
where the only memory dependencies are read– novel dataflow-control model. SPU supports
modify–write. Relevant kernels include radix- high-bandwidth AF-Indirect (load/store/update)
sort, outer product sparse-tensor operations, with a banked scratchpad with aggressive reor-
hash joins, histograms, and synchronous graph dering and embedded compute units for atomic
processing (e.g., page rank). update. Data dependence complicates the sup-
These data-dependence forms are not mutu- port for finer grain data types [naive subword
ally exclusive; in fact they can be thought of single-instruction–multiple-data (SIMD) is insuffi-
as different ways to relax regular (non-data-depen- cient]. Therefore, we add support to SPU to
dent) algorithms (see Figure 1). An example of a enable decomposing the reconfigurable network
IEEE Micro
38
and wide memory access into power-of-two finer and at each step outputs the smaller item. Even
grain resources while maintaining data-depen- though the data structures, data types, and pur-
dence semantics. pose are very different, their relationship to data
Evaluation and contribution: We study machine dependence is the same: they both have stream
learning (ML) as our primary domain, and graph access, but the relative ordering of stream con-
processing and databases to demonstrate gener- sumption is data dependent (they reuse data
ality. SPU achieves between 1.8–7 speedup on from some stream multiple times).
artificial intelligence (AI)/ML applications, and Stream-join definition: A program region that
SPU’s ability to retain performance on dense is regular except that the reuse of stream data
algorithms led to 4.5 speedup. On graph and the production of outputs may depend on
and database applications, SPU achieves similar the data.
performance to domain-specific accelerators with Problem with CPUs/GPUs and motivation:
modest performance and power overheads. Because of their data-dependent nature,
Our primary contributions in this work are the stream-joins introduce branch mispredictions
identification of the two common exploitable for CPUs. For GPGPUs, vectorization becomes
data-dependence forms, and an ISA and hardware difficult due to control divergence of single-
mechanisms to support them. More broadly, we instruction–multiple threads (SIMT) lanes; also,
believe that taking a domain-agnostic approach the memory pattern can diverge between lanes,
can lead to novel insights and foster knowledge causing bank conflicts.
transfer across domains. To visualize the problem for CPUs, see
Figure 2, which shows both the traditional data-
flow and proposed stream-join dataflow repre-
EXPLOITABLE DATA-DEPENDENCE sentation for the examples above. Here, black
FORMS arrows represent data dependence, and green
We observe that two restricted forms of data
arrows indicate control.
dependence are sufficient to cover many algo-
Figure 2(a) shows that the inner product
rithms: stream join and AF-Indirect. In this section,
dataflow can be mapped to a dataflow-based
we first define these forms and give intuition on
processor like an out-of-order core, but only at
their performance challenges for existing archi-
low throughput. To explain, note that there is a
tectures, and then overview our proposal.
loop-carried dependence through the control-
Preliminary Term—“Streams”: Both of the
dependent increment and memory access. This
dependence forms rely on the concept of stream
prevents perfect pipelining, and the throughput
abstractions, so we briefly explain. Streams are
is limited to one instance of this computation
simply an ordered sequence of values. Relevant
every n cycles, where n is the total latency of
to this work are memory streams, which are
these instructions.
sequences of loads or stores with a well-defined
Insight: Our insight is that from the perspec-
pattern.7,12 Streams are similar to vector
tive of the memory, the control dependence is
accesses, but have no fixed length.
mostly unnecessary, as most loads at the line-
granularity will be performed anyways. There-
Stream Join fore, to break the dependence, we need to sep-
An interesting class of algorithms iterates arate the loads from computation (this is what
over each input (each stream) in order, but the memory streams do), then expose a pipelined
total order of operations (and perhaps whether mechanism for controlling the order of data
an output is produced) is data dependent. Two consumption. In the sparse vector example,
relevant kernels are shown in Figure 2. Sparse we would like to reuse the larger of the two
vector multiplication (a) iterates over two index values for consideration (data-dependent
sparse lists (in CSR format) where indices are reuse). If the comparison instruction can treat
stored in sorted order, and performs the multi- its inputs like a queue, and specify the reuse
plication if there is a match. The core of the behavior (i.e., pop the smaller element), this
merge kernel (b) iterates over two sorted lists, can be accomplished in a pipelined fashion.
May/June 2020
39
Top Picks
Figure 2. Example restricted dependence form algorithms.
Also, the multiply–accumulate operation To enable flexible control interpretation,

is performed on only matching indices, so each instruction embeds a simple configurable
we should discard some of these computations/ mapping function from the instruction output
data. Therefore, in addition to data-dependent and control input to the control operations
reuse, we also require data-dependent discard.
fðinst out; control inÞ ! reuse1; reuse2; discard; reset:
The merge example [see Figure 2(b)] has
a surprisingly similar form and control
dependence loop to the sparse multiplication, Alias-Free Indirection
where the computation is replaced by Many algorithms rely on indirect read, write,
selecting the smaller item. A similar approach and update to memory, often showing up as
of decoupling streams and applying data- a[f(b[i])]. Figure 2 shows two examples: The
dependent reuse and discard will break the sparse-vector/sparse-matrix outer product (c)
control dependence loop and enable high works by performing all combinations of non-
throughput. zero multiplications, and accumulating in the
Our stream-join proposal: We find the desired correct location in a dense output vector. Histo-
behavior can be accomplished with a simple and gram (d) is straightforward. Both perform a
novel control flow model for full-throughput sys- read–modify–write access to an indirect loca-
tolic execution. In this model, each instruction tion. This can be viewed as two dependent
may reuse its inputs, discard the computation, streams. Another important observation is
or reset a register based on a dataflow input. that there are no unknown aliases between
Figure 2 shows the examples written in this streams—the only dependence is between the
model. load and store of the indirect update.
IEEE Micro
40
Figure 3. SPU microarchitecture. (a) Sparse Processing Unit (SPU). (b) SPU Core. (c) Scratchpad Controller. (d) DGRA
Processing Element.
AF-Indirect Definition: A program region that Insight: Our insight is that the dependence
is regular (including no implicit dependencies) check is not required between subsequent
except that the address of one memory stream requests (e.g., corresponding to different static
may depend on another, and a stream can loads) if alias freedom is known. Further, the
encode a read–modify–write operation. atomic operations required in these algorithms
Problem for CPUs/GPUs and our motivation: are often low latency integer arithmetic logic
On CPUs, indirect memory is possible with scat- unit (ALU) operations. Therefore, for depen-
ter/gather, however the throughput is limited dence check, the maximum number of possible
given the limited ports to read/write vector- conflicting addresses is usually low (i.e., 1 less
length number of cache lines simultaneously. than the atomic update latency). Hence, we
Also, not leveraging alias-freedom means a reli- could compare with absolute addresses instead
ance on expensive load-store queues. of relying on a serializing lock bit mechanism.
Although GPUs can use their banked scratch- Our AF-indirect proposal: We find that the
pads for faster indirect access, the following two desired behavior can be accomplished by:
reasons limit the indirect throughput. 1. No reor- 1) exposing alias-freedom in the hardware–
dering of requests across subsequent vector software interface to enable interleaving across
warp accesses.11 Doing so in a GPU would vectors: and 2) storing absolute address of pend-
require dependence checking of in-flight ing atomic updates (maximum 2) to enable
accesses, as they cannot guarantee alias free- pipelining of nonconflicting addresses. Figure 2
dom. 2. Atomic updates to the same scratchpad shows how SPU is able to reorder requests, and
bank are not pipelined even though they access also able to pipeline atomic update requests with
different memory locations. The reason is that initiation with no bubbles. The stall is intro-
the lock bits for atomic operations are shared duced in the presence of “real” dependencies, for
among multiple addressable locations.1 The example, see cycle-4 in AF-Indirect reordering in
coarse-granularity locking is required to reduce Figure 2. This is limited to a maximum two-cycle
the locking overhead. bubble.
To visualize the inefficiency of a typical GPU
scratchpad, see Figure 2, which shows how
scratchpad vector requests (corresponding to SPARSE PROCESSING UNIT
indirect read and atomic update, respectively) are In this section, we first overview the primary
served on a GPU. For simplicity, we assume a warp aspects of the design, and then provide the
size of 8. As GPUs do not reorder requests across details of stream-join-enabled systolic-CGRA and
warps, the update request vector is issued after the banked memory exposed to knowledge of
the completion of all read requests. For updates in AF-Indirect.
GPU, we assume one lock-bit per scratchpad bank. Figure 3(a) shows the proposed SPU archi-
Here, the three-cycle nonpipelineable operation tecture. SPU cores are integrated into a mesh
further worsens the overhead of bank conflicts. network-on-chip (NoC). Each core is composed
May/June 2020
41
Top Picks
of the specialized memory and compute ALUs. Correspondingly, CLT and registers are
fabric: decomposable granularity reconfigura- also composable.
ble architecture (DGRA), together with a con- To route the data from PEs, the network of
trol core for coordination among streams. the DGRA is decomposable into multiple parallel
Communication/synchronization: SPU pro- finer-grain subnetworks (minimum 8 b). For flexi-
vides two specialized mechanisms for communi- ble routing, we add the ability for incoming val-
cation. First, we include the multicast capability ues to shift one subnetwork per switch hop.
in the network. Data can be broadcast to a sub- Alias-freedom-exposed banked memory:
set of cores, using the relative offset in the Because our workloads often require a mix of
scratchpad. As a specialization for loading main linear and indirect arrays simultaneously, for
memory, cores issue their load requests to a cen- example, streaming read of indices (direct) and
tralized memory stream engine, and data can be associated values (indirect), we begin our design
multicast from there to relevant cores. For syn- with two logical scratchpad memories, one
chronizing on data-readiness, SPU uses a data- highly banked and one linear. In this design,
flow-tracker-like mechanism to wait on a count both exist within the same address space.
of remote-scratchpad writes. Hence, memory streams may access locations in
a remote core’s scratchpad using the similar
interface for linear and indirect streams.
SPU Core The role of the scratchpad controller [see
The basic operation of each core [see Figure 3 Figure 3(c)] is to generate requests for reads/
(b)] is that the control core will first configure writes to the linear scratchpad, and reads/
the DGRA for a particular dataflow computation, writes/updates to the indirect scratchpad. A
and then send stream commands to the scratch- control unit assigns the scratchpad streams,
pad controller to read data or write to the DGRA, and their state is maintained in either linear or
which itself has an input and output port inter- indirect stream address generation logic. The
face to buffer data. controller should then select between any con-
Stream-join compute fabric: DGRA: We aug- current streams for address generation and
ment a systolic CGRA to support stream-join send it to the associated scratchpad to maxi-
control and dataflow computation with arbitrary mize expected bandwidth. The linear address
data types. Figure 3(d) shows the microarchitec- generator’s operation is simple—create wide
ture of a DGRA processing element (PE) (green scratchpad requests using the linear access
color represents control). pattern.
To implement control interpretation, we The indirect address generator creates a vec-
add a control lookup table (CLT) to each func- tor of requests by combining each element of
tional unit (FU), which determines a mapping the stream of addresses (coming from the
between the control inputs and possible con- compute fabric, explained in the “Exploitable
trol operations. This mapping is configured Data-Dependence Forms” section) with each ele-
along with the dataflow computation graph. ment in the parent stream (i.e., b[i] in a[f(b
During dataflow operation, CLT consumes one [i])]). This vector of requests is sent to an arbi-
of the dataflow inputs to produce control sig- trated crossbar for distribution to banks, and a
nals for the ALU (discard), associated registers set of queues buffer requests for each static ran-
(reset), and FIFOs connected to ALU inputs dom access memory (SRAM) bank until they can
(reuse). be serviced.
In the DGRA, we enable each coarse-grained Since there are no conflicts among indirect
resource to be able to be decomposed to read/write requests, the requests are serviced
powers-of-two fine-grain resources. For compu- from the top of the bank queue as soon as the
tation, the decomposable PE can split each scratchpad data bus becomes available. For
coarse-grained input into multiple finer-grained atomic update requests, the requests can be ser-
inputs [16-b inputs in Figure 3(d)], which are viced when both scratchpad read and write
used to feed two separate lower granularity buses are available, and the updated address
IEEE Micro
42
does not conflict with the pending updates
issued from the same bank. As the ordering of
the data returned from read requests is critical
for dataflow operations, we employ an indirect
read reorder buffer (IROB) that maintains incom-
plete requests in a circular buffer (see Figure 2).
IROB entries are deallocated in-order when a
request’s data is sent to the compute unit.
Control ISA: We leverage an open-source Figure 4. Overall performance.
stream-dataflow ISA7 for the control core’s
implementation of streams, and add support for
indirect reads/writes/updates, stream-join data- EVALUATION
flow model, and typed dataflow graph. The ISA Our evaluation broadly addresses the question
contains stream instructions for the data trans- of whether restricted data-dependence forms
fer, including reading/writing to main memory exposed to an ISA (and exploited in hardware) can
and scratchpad. help achieve general-purpose acceleration.
Comparison to general-purpose accelerators:
Figure 4 shows how SPU fairs against CPU and
METHODOLOGY GPU for workloads across ML, graph processing,
SPU: We implemented SPU’s DGRA in Chisel,
and databases.
and implemented with an industry 28-nm tech-
The workloads with a stream-join pattern—
nology. We built an SPU simulator in gem5, using
kernel support vector machines (KSVM), TPCH
a RISCV ISA for the control core.
sort heavy queries (SH), gradient boosting deci-
Architecture comparison points: Table 1 shows
sion trees (GBDT)—achieve speedup up to 10
the characteristics of the architectures we com-
speedup over CPU due to avoiding the through-
pare against, including their on-chip memory
put-limiting cyclic dependence loop and lower
sizes, FU composition, and memory bandwidth.
computational density. The GPU also suffers
We also address whether an inorder processor
from hardware underutilization as control leads
is sufficient by comparing against “SPU-inorder,”
to masking in vector lanes.
where the DGRA is replaced by an array of eight
On workloads with AF-Indirect—fully con-
inorder cores (total of 512 cores). For reference,
nected layer (FC), convolution layer (CONV),
we also compared against a dual-socket Intel
arithmetic circuits (AC), Graph, TPCH not sort-
Skylake CPU, with 24 cores.
heavy queries (N-SH)—both GPU and SPU use a
Workload implementations: We implement
histogram-based approach. However, SPU’s
SPU kernels (both dense/sparse) for each work-
aggressive reordering of indirect updates in the
load, and use a combination of libraries and
compute-enabled scratchpad far outperforms
hand-written code to compare against CPU/GPU
the limited ordering in GPU.
versions.
Finally, the ability to support both stream-
join and AF-Indirect enables the use of new com-
Table 1. Characteristics of evaluated architectures. pression techniques like run-length encoding
efficiently. These techniques effectively reduce
Characteristics GPU SPU-inorder SPU
the required memory bandwidth, thus improv-
Processor GP104 In-order SPU-core
ing performance.
Cache+Scratch 4064 kB 2560 kB 2560 kB Even though SPU-inorder can relieve
Cores 1792 512 64 SPU cores some of the vectorization overheads suffered
by GPU, it is insufficient due to lower peak
FP32 Unit 3584 2048 2432
throughput.
FP64 Unit 112 512 160 Domain accelerator comparison: Accelerators
Max Bw 243 GB/s 256 GB/s 256 GB/s for FC,4 CONV,10 and Graphs3 all employ com-
pute-enabled banked memory to achieve high
May/June 2020
43
Top Picks
indirect throughput. SPU is able to remain within Table 2. Analysis of related works.
57% of its performance. The difference is due to
other specializations, e.g., higher radix NoC in Exploitable dependence
Specialized architectures
forms
Graph application-specified integrated circuit
(ASIC) and higher buffer access bandwidth in TPUv1—Dense ML
CONV ASIC. GPU—Dense

In non-sort-heavy database workloads, the No data-dependence
LSSD8—Dense
dense version of SPU performs similar to ASIC.
With efficient stream joins, SPU is able to SPU
catch up to and surpass database ASIC, which Q10012—Database

spends significant area resources on special- Sparse ML6—Sparse
ized sorting units. Algebra
Stream-Join
Benefit of decomposability: In general, we ExTensor5—Tensor
achieve datawidth-proportional speedup by
adding decomposability. In comparison to SPU
subword-SIMD, SPU can see 2.6 speedup by SCNN10—DNN CONV

being able to vectorize run-length decoding, EIE4—DNN FC
which involves control-serializing computa-
OuterSPACE9—Sparse
tion. Similarly, branches in AC also benefit AF-Indirect
Algebra
from decomposability (3). Overall, SPU
achieves geomean speedup of 2.12 speedup Graphicionado3—Graphs
with decomposability. SPU

Area and power: The two major sources of
SPU’s area are the scratchpad banks and DGRA,
together occupying more than two-third of the dependence can be used to classify and
total; DGRA is the major contributor to power understand the fundamental capabilities of
(assuming all PEs are active). domain-specific accelerators. Table 2 shows
Compared to the whole design, adding the scope of several existing domain-specific
stream-join control in the systolic CGRA accelerators. What we see is that each gener-
increases area by 6.6% and power by 17.5%. ally specializes for only one form out of
Decomposability costs another 3.1% area and stream-join, AF-Indirect or regular algorithms.
7.7% power. Beyond simply understanding the space, this
way of viewing algorithm’s interaction with
CONCLUSION architecture can improve the portability of tech-
This work identifies two forms of data- niques across domains. Consider the context of
dependence, which are highly specializable accelerators for sparse linear algebra. SPU’s
and are broadly applicable to a variety of design can join sparse lists at one element per
algorithms. By defining a specialized execution cycle (per PE). An idea proposed for a sparse ML
model and codesigned hardware, SPU, we accelerator is to vectorize the join,6 so that N
enabled the efficient acceleration of a large elements can be joined at once from each list
range of workloads. We observed up to order- (requiring NN comparisons). The ExTensor
of-magnitude speedups and significant power accelerator,5 designed for multidimensional
reductions compared to modern CPUs and sparse tensor ops, goes further. It demonstrates
GPUs while remaining flexible. that a hierarchical list intersection (a form of
More important than the proposed design is stream join) can be more work efficient by skip-
how the approach of identifying and abstracting ping a variable number of unmatched items in a
common dependence forms can influence the single step. To further reduce the memory band-
field. width overhead of sparsity, SparTen2 proposed
Systematic understanding of irregular a bit-vector representation of indices. Thus,
accelerators: Our restricted forms of data- the matched indices can be found using efficient
IEEE Micro
44
bit-level operations. These optimizations can formulations or definitions of restricted data-
apply to SPU. More importantly, by finding struc- dependence forms, which could lead to new
ture and commonality in the dependence forms opportunities for specialization. For example, a
across domains, it becomes clear coarser grain form of data depen-
how to apply these optimizations We think it is important dence than we have explored is
to other superficially different not to forget that data-dependent parallelism (aka
problems, like database join or systems and dynamic parallelism). At the other
decision tree training. application experts are end of the spectrum could be
Impact on general-purpose pro- constantly innovating data-dependent data types, where
cessors: This work focused on new algorithms, and at a fine grain, the data-type size
reconfigurable dataflow-like pro- are now doing so with is chosen to meet the precision
cessors for implementing depen- deep knowledge of the requirements. One could imagine
underlying hardware.
dence-form specialization. While it exposing these forms as first-class
Our results support the
was convenient, other architec- primitives in the hardware/soft-
notion that a rigid
tures can equally benefit from architecture can limit
ware interface, and each could be
such specialization. certain algorithmic plausibly useful in many domains.
approaches from being Effect on algorithms: Finally,
Indirection in GPUs: A conceiv- viable. we think it is important not to
able extension to a GPU ISA forget that systems and applica-
could enable the annotation of tion experts are constantly inno-
a program region as being alias-free indirect vating new algorithms, and are now doing so
(informed by programmer or compiler). This with deep knowledge of the underlying hard-
would allow GPU scratchpads to eliminate ware. Our results support the notion that a
memory dependence checking and enable rigid architecture can limit certain algorithmic
aggressive reordering, leading to reduced approaches from being viable. Therefore, we
impact of bank conflicts and higher through- believe that incorporating support for struc-
put. NVIDIA’s tensor core is precedence that tured irregularity into existing and new pro-
such specialization is feasible. grammable architectures can lead to
Stream-join SIMD: Stream-join control could innovations in novel algorithms and data
be supported in a CPU, for example, through structures.
extensions to SIMD operations. An approach
could be to add specialized instructions, ACKNOWLEDGMENTS
which allow treating registers as FIFOs, and We would like to thank G. Van den Broeck
the branch instructions may control the and A. Choi for their insights and help with arith-
order of data consumption (using simple metic circuits workloads. We would also like to
finite-state machine at FIFOs). thank D. Ott and P. Subrahmanyam for their
Hybrid FPGAs: Recent FPGAs (Xilinx Alveo) thoughtful conversations on the nature of irregu-
include neural network accelerator units, larity and data dependence. This work was
demonstrating the need for specialization supported in part by the National Science Foun-
of even reconfigurable hardware. Increasing dation under Grant CCF-1751400 and Grant CCF-
these units’ flexibility to be similar to 1937599 and in part by the gift funding from
SPU could simultaneously provide many VMware.
of the same efficiency benefits as an
ASIC while also retaining the fundamental
value proposition of FPGAs: broad work- & REFERENCES
load efficiency while retaining fine-grain 1. J. Gomez-Luna, J. M. Gonzalez-Linares, J. I. Benavides
reprogrammability. Benitez, and N. Guil Mata, “Performance modeling
of atomic additions on GPU scratchpad memory,”
Other exploitable data-dependence forms: It IEEE Trans. Parallel Distrib. Syst., vol. 24, no. 11,
is possible that there may be alternate pp. 2273–2282, Nov. 2013.
May/June 2020
45
Top Picks
2. A. Gondimalla, N. Chesnut, M. Thottethodi, and Vidushi Dadu is currently working toward the Ph.D.
degree with the Department of Computer Science,
T. N. Vijaykumar, “SparTen: A sparse tensor accelerator
University of California Los Angeles. Her current
for convolutional neural networks,” in Proc. 52nd Annu.
research focuses on hardware–software codesign to
IEEE/ACM Int. Symp. Microarchit., 2019, pp. 151–165.
enable general-purpose acceleration. Dadu received
3. T. J. Ham, L. Wu, N. Sundaram, N. Satish, and the B.Tech. degree in electronics and communica-
M. Martonosi, “Graphicionado: A high-performance tion engineering from the Indian Institute of Technol-
and energy-efficient accelerator for graph analytics,” ogy Roorkee. She is a student member of IEEE.
in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchit., Contact her at vidushi.dadu@cs.ucla.edu.
Oct. 2016, pp. 1–13.
4. S. Han et al., “EIE: Efficient inference engine on
compressed deep neural network,” in Proc. 43rd
Annu. Int. Symp. Comput. Archit., 2016, pp. 243–254. Jian Weng is currently working toward the Ph.D.
5. K. Hegde et al., “ExTensor: An accelerator for sparse degree with the Department of Computer Science,
tensor algebra,” in Proc. 52nd Annu. IEEE/ACM Int.
University of California Los Angeles. His research
interests include analyzing and designing reconfig-
Symp. Microarchit., 2019, pp. 319–333.
urable spatial architectures along with the associ-
6. A. K. Mishra, E. Nurvitadhi, G. Venkatesh, J. Pearce,
ated compilation techniques. Weng received the
and D. Marr, “Fine-grained accelerators for sparse
B.Eng. degree in computer science from Shanghai
machine learning workloads,” in Proc. 22nd Asia South Jiao Tong University. He is a member of the Asso-
Pacific Design Autom. Conf., 2017, pp. 635–640. ciation of Computing Machinery. Contact him at
7. T. Nowatzki, V. Gangadhar, N. Ardalani, and jian.weng@cs.ucla.edu.
K. Sankaralingam, “Stream-dataflow acceleration,” in
Proc. 44th Annu. Int. Symp. Comput. Archit., 2017,
pp. 416–429.
8. T. Nowatzki, V. Gangadhar, K. Sankaralingam, and Sihao Liu is currently working toward the Ph.D.
G. Wright, “Pushing the limits of accelerator efficiency degree with the Department of Computer Science,
while retaining programmability,” in Proc. IEEE Int. Symp. University of California Los Angeles. His research
High Perform. Comput. Archit., Mar. 2016, pp. 27–39.
interests include spatial architecture prototyping and
design space exploration. Liu received the B.Eng.
9. S. Pal et al., “OuterSPACE: An outer product based
degree in electrical engineering from Xi’an Jiaotong
sparse matrix multiplication accelerator,” in Proc. IEEE
University. He is a student member of IEEE. Contact
Int. Symp. High Perform. Comput. Archit., Feb. 2018,
him at sihao@cs.ucla.edu.
pp. 724–736.
10. A. Parashar et al., “SCNN: An accelerator for compressed-
sparse convolutional neural networks,” in Proc. 44th Annu.
Int. Symp. Comput. Archit., 2017, pp. 27–40. Tony Nowatzki is currently an Assistant Professor
11. NVIDIA Whitepaper, “Cuda C best practices guide,” with the Department of Computer Science, University
May 2019. [Online]. Available: https://docs.nvidia. of California Los Angeles. His research interests
com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf include architecture and compiler codesign and novel
12. L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross, hardware/software interfaces. Nowatzki received the
“Q100: The architecture and design of a database Ph.D. degree in computer science from the University
processing unit,” in Proc. 19th Int. Conf. Archit. Support
of Wisconsin-Madison. He is a member of IEEE.
Contact him at tjn@cs.ucla.edu.
Program. Lang. Oper. Syst., 2014, pp. 255–268.
IEEE Micro
46
Varifocal Storage:
Dynamic Multiresolution
Data Storage
Yu-Ching Hu Te I
University of California, Riverside Google
Murtuza Lokhandwala Hung-Wei Tseng
North Carolina State University University of California, Riverside
Abstract—Varifocal storage (VS) presents a new architecture that coordinates

application demands, hardware accelerators, and intelligent data storage devices to
efficiently support various input resolutions of system components, but still maintain the
flexibility and quality without additional costs. Instead of faithfully shipping the raw data,
the cross-layer design of VS allows an intelligent storage device to work directly with the
running application to generate and deliver data sets in the desired resolution and quality
before going through the narrower system interconnect. In this way, VS minimizes the
bandwidth demand from the data source and allows hardware accelerators to work on
received data without additional preprocessing. Without programmers’ hints, VS achieves
1.46 speedup on a computer with approximate hardware accelerators.
& FOLLOWING THE HINTS of Amdahl’s law, com- Modern computer systems and applications
puter architecture/system designers always try intensively rely on heterogeneous hardware accel-
to “make the common case fast” and focus on erators and algorithms inspired by approximate
optimizing the most time-consuming compo- computing to significantly shrink the execution
nent. However, it is so easy for us to forget that time and improve energy efficiency in compute
the common case changes all the time and opti- kernels, but leave other architectural components
mizing the most common case can also intro- remaining the same as traditional, exact comput-
duce new overhead. ing. As a result, we have now reached a point that
the side effects of approximate computing on
Digital Object Identifier 10.1109/MM.2020.2985955 accelerators, moving data and adjusting data reso-
Date of publication 10 April 2020; date of current version 22 lutions (e.g., data precision levels, summarized
May 2020. results, intermediate results, and sampled
47
Top Picks
contexts), have overtaken compute kernels as a to a CPU and a set of DRAM modules where the
new bottleneck in many applications. computer hosts the operating system and pro-
This research presents varifocal storage (VS), vides a synchronization point for runtime data
a new architecture that coordinates application storage, the computer may incorporate other
demands, hardware accelerators, and intelligent computing units, including general-purpose
data storage devices to efficiently support vari- computing on graphics processing units, digital
ous input resolutions of system components, signal processors or tensor processing units
but still maintain the flexibility and (TPUs), to accelerate the exe-
quality without additional costs. VS cution of compute kernels of
This research presents
revisits the task allocation of workloads. These accelerators
varifocal storage (VS),
approximate computing on general- usually accept data in different
a new architecture that
purpose computers to place tasks coordinates application precisions (e.g., 32 bits in regu-
such as raw-data retrieval, data-reso- demands, hardware lar GPU cores, 16 bits in Tensor
lution adjustment, and quality con- accelerators, and intel- Cores, 8-bits in TPUs) from the
trol in the most appropriate place ligent data storage host processor architecture
within a system in a full-stack sys- devices to efficiently (e.g., 64 bit). The computer
tem design perspective. Instead of support various input also stores input/output data
faithfully shipping the raw data, the resolutions of system persistently using solid state
storage device in VS can work components, but still drives (SSDs) or storage over
directly with the running application maintain the flexibility the network through a network
and quality without interface card (NIC).
to generate and deliver data sets in
additional costs.
the desired resolution and quality These heterogeneous hard-
before going through the narrower ware components exchange data
system interconnect. In this way, VS minimizes through the system interconnect (i.e., PCIe) where
the bandwidth demand from the data source the root complex is nowadays located on the CPU.
and decreases the most latency-critical data- Due to the limited total links available from the
transfer overhead. root complex, the system usually allocates a rela-
We evaluate VS by running a wide range of tively larger amount of links to high-throughput
applications on our prototype SSD. The ideal, accelerators (e.g., 16 PCIe lanes for GPUs). The
programmer-directed VS achieves 1.52 speedup remaining components (e.g., SSDs, NICs) can only
on average over conventional approximate com- use relatively smaller amounts of PCIe links or
puting, while the automatic VS still achieves even have to share the links with other peripherals
1.46 speedup without programmers’ hints. through a PCIe switch.
As the computer may 1) use a data set for
different purposes (e.g., an application can
DEMAND OF PRESENTING DATA SETS request an image in resolutions as high as
IN DIFFERENT RESOLUTIONS 7680 4320 pixels to display or edit, but
Figure 1 illustrates the architecture of a inferencing in a machine learning application
modern heterogeneous computer. In addition only requires 1=64 of that resolution) or 2)
compute data on accelerators with different
precisions, the computer usually stores per-
sistent data with high-resolution content and
needs to generate the input data in the
desired resolution dynamically. Figure 2
explains the resulting data-processing pipeline
of running approximate-computing applica-
tions or using these hardware accelerators in
modern heterogeneous computers.
The computer first needs to issue I/O com-
Figure 1. Architecture of a modern heterogeneous computer. mands for the storage device to access raw data
IEEE Micro
48
optimized I/O library, the overhead of receiving/
preparing data sets exceeds the kernel execution
as the most critical stage in a majority of these
applications. We expect the gap in Figure 3 to grow
with the relatively fast evolution of hardware accel-
erators, but slowly improved I/O and storage
Figure 2. Data-processing pipeline of approximate technologies.
applications using the conventional execution model.
VS SYSTEM ARCHITECTURE
Figure 4 shows VS in a heterogeneous com-
from its internal data arrays and then transfer the
puter system. VS revisits the storage-system
raw data through the underlying system intercon-
stack to allow the device to dynamically produce
nect while simultaneously serving other data-
data with different resolutions on demand. The
access requests. Once the host computer receives
VS core layer resides inside the storage device
a chunk of data, the CPU can start producing data
to change data resolutions presented to applica-
sets in lower resolutions. The compute kernel can
tions. The VS layer interacts with an existing sys-
then perform computations using the resolution-
tem I/O interfaces and provides an extended
adjusted data sets. If the kernel can leverage a
interface for resolution adjustments. The VS
hardware accelerator, the system must addition-
layer also works together with the SSD manage-
ally exchange among different components
ment layer (i.e., the flash translation layer in
through the interconnects before the accelerator
flash-based, SSDs) to locate the requested data.
can compute on the prepared data.
The host system needs an extended kernel
With these highly optimized approximate-com-
driver and API functions for the applications to
puting-based acceleration techniques but rela-
send requests, exchange data, and receive feed-
tively limited bandwidth for data exchange, the
back from the VS core layer. The host applica-
latency of retrieving and preparing data for approx-
tion interacts with the API and sends commands
imate-compute kernels becomes the most critical
specifying operators that VS should apply to the
stage in the data-processing pipeline. Figure 3 com-
raw data.
pares the latency of receiving raw data chunks
The VS core layer supports a set of operators
from a high-end NVM-Express (NVMe) storage
that are especially effective for applications that
device against the execution time of performing
contain high data-level parallelism, but are able
approximate/mixed-precision compute kernels on
to tolerate inaccuracies in data sets. The VS core
the same data chunks using an NVIDIA Tesla T4
GPU for a set of applications. Using a highly
Figure 3. Data-preparation overhead compared

against the execution time of performing compute
kernels on the same amount of data. Figure 4. System architecture of VS.
May/June 2020
49
Top Picks
without the dynamics of the system intercon-

nect bandwidth and the competitions with other
hardware components in CPU-memory bus. In
addition, VS frees up CPU resources to tackle
more useful workloads, leading to performance
gains for approximate applications on the host
side.
Since approximate computing works on
lower resolution data sets, the compute kernels
Figure 5. Data-processing pipeline of VS.
usually consume fewer bytes than exact comput-
layer is also where the system performs mecha- ing ones. VS sends only adjusted data to com-
nisms that automatically determine the most pute units so that VS reduces the size of data
appropriate data resolution for quality control. going through the system interconnect. In this
The host application can optionally enable VS’s way, VS improves the total latency of transfer-
quality control mechanisms through VS’s API ring data, mitigates the idle time in compute
and the kernel driver. units, and decreases the bandwidth demand in
Figure 5 demonstrates the data processing the rest of the system interconnect links.
pipeline in VS. By using operators and quality- Although lossy and lossless data-compres-
control mechanisms inside the storage device, sion algorithms help us to reduce data size and
VS allows the storage device to send adjusted save I/O bandwidth, the overhead of decom-
data sets to the system main memory instead of pressing data on the destination computing
always sending raw data as using conventional device can easily cancel the benefit of reducing
storage devices. Then, the approximate comput- data-transfer time.1
ing components can directly work on lower reso-
lution data, without additional conversion on Quality. VS provides an additional layer of
the host CPU. quality control for applications by capturing
The proposed VS architecture allows the sys- low-quality input that failed the requirement
tem to successfully tackle the challenges of per- before the data leave the storage device. With-
formance, quality, flexibility, and cost in out VS, existing quality-control mechanisms
accommodating various types of workloads on must request full-size, raw data from the storage
modern heterogeneous computers. device, and compare subsets of results for exact
and approximate computation.2–6
Performance. VS improves performance
through 1) exploiting the richer internal band- Flexibility. In the VS architecture, the stor-
width and 2) reducing the total volume of data age device dynamically generates data with dif-
movements in the system interconnect. ferent resolutions to accommodate the demands
As Figure 1 depicts, the controllers found in for diverse applications. Without an architecture
modern datacenter SSDs, including the control- like VS, the storage system must store multiple
lers in the prototype SSD that we used for this versions of a shared data set or provide raw
work, support multiple concurrent channels. data to the host for preprocessing, hurting either
The internal bandwidth of the prototype SSD in space efficiency or performance.
this article can reach up to 8 GB/s by enabling 16 If the storage device stores data using lossy
channels. However, conventional architecture algorithms to save bandwidth, the system sacri-
and storage interfaces waste the rich internal fices support for exact-computing. Similarly,
bandwidth of storage devices since the applica- existing research on “approximate storage sys-
tion only works on the host computer and tems” proposes to store data using unreliable
exchanges data with the SSD using limited PCIe memory cells, but not faithfully store raw data.
bandwidth. In contrast, VS adjusts data using Therefore, systems that use them can neither
the SSD controller, the mechanisms in the VS support exact computation nor dynamically gen-
core layer directly interact with raw data erate data in different resolutions.7,8
IEEE Micro
50
Cost. The VS core layer can leverage exist- any quality control mechanism is enabled, and
ing SSD controllers and minimize extra hardware 2) the parameters that allow the underlying stor-
costs for the following reasons. 1) Empirical age device to adjust data as well as control varia-
studies,9,10 as well as our measurements in the bles that quality control mechanisms use to
unmodified prototype SSD, reveal that the SSD assure the quality of the adjusted data.
controller cores are mostly idle due to the rela- Figure 6 shows the KMeans code with VS func-
tively long latency of accessing NVM devices tion calls inserted. The modified KMeans code ini-
and the overprovisioning of processing power. tiates VS calling vs_setup to set the desired
2) The critical path of the data-access pipeline is operator, resolution, and the data format. VS
determined by either the access time of flash starts adjusting data only if the application calls
chips or the latency of the DMA stage, leaving the vs_read function. This function resembles
slacks that can be taken up by VS to apply opera- the existing Linux read function except that
tors without the need for additional accelera- 1) the resulting data size may be different from
tors. 3) SSD controllers enjoy the benefits of the requested data size, since operators will trim
exclusive resources within the storage device data sizes in most cases, and 2) the function will
and can perform data adjustment more effi- provide feedback regarding the resolution that
ciently than the host CPU. VS selects. If the program calls a regular read
The rest of this section will briefly describe function to read data, VS will act as a conven-
the current programming model, operators, tional data storage but not change the data
quality control mechanisms, and architectural resolution.
support of VS. If VS successfully adjusts the data, the appli-
cation can use a compute kernel that supports
Programming Model lower resolution input (e.g., cluster_approxi-
To prepare an application to take advantage mate) to further reduce the total execution time
of the VS model, the programmer uses the VS of the program. Depending on the approximate
library to specify data resolutions and retrieve compute kernels that the application uses, the
adjusted data for the application. These library programmer can choose different VS operators
functions help the application to set up 1) the for data adjustments when calling the vs_setup
operators required to read data and whether function. In addition to the programmer’s choice
of resolutions, the programmer can optionally
enable VS’ quality control mechanisms, Autofo-
cus and iFilter. Autofocus can automatically
decide the resolution using a set of control varia-
bles for a chosen operator. The decisions that
Autofocus make are usually more conservative
than those of a programmer, but Autofocus can
nonetheless help applications adapt to data
sets. If a given application can apply multiple
versions of approximate kernels for different VS
operators, the programmer can use the iFilter
mechanism to let VS choose both the most
appropriate operators and resolutions for each
data set.
VS Operators
VS provides a set of operators to adjust data
resolutions and expose these operators through
the NVMe interface as well as the system API. VS
Figure 6. KMeans code sample with inserted VS operators are selected under the following crite-
function calls. ria: 1) The computation overhead must match
May/June 2020
51
Top Picks
the processing power inside the storage device. adjusted data, avoiding the cases of sending data
Therefore, VS can minimize the impact on access that fail to pass the quality control knobs to the
latency and power consumption and avoid extra host before the computation occurs. In contrast,
hardware costs. 2) A wide range of applications all previous research projects censor computa-
must be able to apply the operator, thereby tion results and always require at least part of
allowing for more efficient use of valuable device raw data to present in the host main memory as
resources (VS identifies the most useful opera- well as being computed using exact computing.
tors from previous efforts11,12). 3) The operator Autofocus allows the programmer to simply
must allow VS to take advantage of mismatches specify the desired VS-operator, letting VS
between external and internal bandwidths and decide the most appropriate resolution that
downsize the outgoing data. These operators every checked piece of the adjusted data suc-
can flexibly support various resolutions and cessfully passes through the pre-defined, opera-
accommodate exact computing. tor-dependent threshold values. If low-
The current VS framework supports the fol- resolution data failed on the quality control
lowing categories of operators for diverse data knobs, VS will reject low-resolution data and
types. gradually use higher resolutions until the quality
Data Packing: The data-packing operator fits the demand and apply this resolution for the
trims the data set size by using fewer bytes to same data set later. Utilizing another important
express each item and by condensing the layout observation from previous research that a small
in memory. Since the data-packing operator subset of input data is representative of the rest
translates raw data into a less-precise data type, of the input data in approximate-computing
it can potentially decrease accuracy (e.g., dou- applications that tolerate inaccuracies,3 Autofo-
ble!float!half or int64!int32!short!char). cus selects the resolution using only a small por-
Quantization: The quantization operator tion of the raw input data from a requested data
rescales the raw values into a smaller value set and then monitors the quality of the adjusted
space as well as preserves the relative order of input data.
values. The quantization operator applicable to iFilter can work without programmer input
the application requires a large value space. and is more effective than Autofocus for appli-
Reduction/Tiling: The reduction operator cations having compute kernels that are com-
applies a function (e.g., average) to a group of patible with multiple VS operators. The iFilter
input values and yields a single output value. After algorithm is similar to the Autofocus algorithm
applying a reduction operator, VS sends only the in which it selects the most appropriate resolu-
resulting value of each group to reduce the amount tion for each compatible operator, except that
of data passing through the system interconnect. iFilter will keep track of the resolution and the
Sampling: The sampling operator chooses a resulting data size for each operator. After
subset of items from the raw data and sends the selecting an operator that passes all quality
selected items to the host computer. Operators control variables and generates the smallest
in this category can perform uniform/random data size among all passing operators, iFilter
data selection or report only the most represen- will enter the monitoring phase as in
tative data. The sampling operator can poten- Autofocus.
tially achieve the same effect as that of loop
perforation but without any code modification. Building a VS-Compliant Storage Device
Building a VS-compliant storage device
First-Level Quality Control in the Storage means tackling challenges associated with 1)
Devices providing a hardware/software interface that
If the input quality is way too far from the ori- allows applications to describe the resolutions
gin, the approximate computing can hardly gen- and quality of the target data, and 2) minimizing
erate meaningful results.3 Therefore, we the computational overhead/cost of adjusting
designed two quality control mechanisms, Auto- data resolutions. VS overcomes the former chal-
focus and iFilter, both control the quality of lenge by extending the NVMe interface; this
IEEE Micro
52
requires the fewest modifications to the system exact computing is 7% slower than the conven-
stack and applications. VS addresses the latter tional approximate computing.
challenge by exploiting the idle cycles available Several groups of our workloads shared the
in modern SSD controllers. same data sets, but applied different operators
and resolutions to accommodate each individual
demand. Without an architecture like VS, the
storage system must store multiple versions of a
RESULTS
shared data set or provide raw data to the host
In this article, we built a VS-compliant SSD by
for preprocessing, hurting either space effi-
extending a commercialized, datacenter-class
ciency or performance.
SSD. We attached the VS-compliant SSD to a
We also compare our mechanisms with other
high-end heterogeneous machine with a GPU.
alternatives. Autofocus outperforms a state-of-the-
The host operating system contains the
art quality control mechanism by up to 2.86
extended NVMe driver to support additional VS
because VS does not require the storage device to
NVMe commands. We measured the perfor-
deliver raw data to the host. This article also mea-
mance of the resulting system with several work-
sured that VS outperforms the best compression
loads that span a wide range of applications.
algorithm for each data set by 4.40
Figure 7 shows the relative end-
As an optimization since VS incurs zero overhead in
to-end latency of running a com-
usually comes with its decoding data on the host side.
plete workload using workloads
with the conventional approximate own overhead, we really
computing approach with GPU- need to holistically revisit
accelerated kernels as the baseline.
the interactions among CONCLUSION
different system/archi- As an optimization usually
Since VS efficiently prepares input
tectural components to comes with its own overhead, we
data sets in storage devices for
negate the introduced really need to holistically revisit
approximate computing kernels side effects to scale up the interactions among different
running on the GPU, the manual performance with new
system/architectural components
programmer-directed VS leads to a technologies.
to negate the introduced side
speedup of 1.52 for these applica-
effects to scale up performance
tions. VS also achieved an average
with new technologies. This article,
energy savings of 32% for applica-
in particular, demonstrates the case of modern
tions compared to the conventional approximate
heterogeneous computers running both exact
computing.
and approximate computing applications. By
Using Autofocus to dynamically select the
making the demand of applications visible to the
desired data resolutions, these applications
storage system and also making the computing
achieve an average speedup of 1.43. Without
capability available for preparing data in different
any programmer intervention, iFilter can
resolutions, this article fundamentally alleviates
improve performance by 1.46. In contrast, the
the side effects from the conventional, single-
point-optimization design—the data adjustment
and movement overhead.
The resulting architecture not only hides
the latency of data adjustment within the NVM
data access pipeline of the storage device, but
more importantly, it exploits the fact that
approximate computing only needs lower reso-
lution inputs to further reduce the data volume
flowing through the system interconnect. With-
out a full-stack, holistic design like this work,
we can never take full advantage of approxi-
Figure 7. Speedup of the end-to-end latency. mate computing.
May/June 2020
53
Top Picks
This article leverages the success of near- 2. D. S. Khudia, B. Zamirai, M. Samadi, and S. Mahlke,
data/in-storage processing to implement the pro- “Rumba: An online quality management system for
posed idea. However, in addition to “offloading approximate computing,” in Proc. ACM/IEEE 42nd
computation” that prior work focusing mainly Annu. Int. Symp. Comput. Archit., Jun. 2015,
on, this article reveals the potential of “offering pp. 554–566.
new features” (e.g., the quality control mecha- 3. M. A. Laurenzano, P. Hill, M. Samadi, S. Mahlke, J.
nisms in VS) to streamline the rest of computa- Mars, and L. Tang, “Input responsiveness: Using
tion. As the budget of building storage devices is canary inputs to dynamically steer approximation,” in
usually very limited, the processor near-data/in- Proc. 37th ACM SIGPLAN Conf. Program. Lang.
storage cannot compete with host CPUs and Design Implementation, 2016, pp. 161–176.
hardware accelerators. We hope this work could 4. H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A.
inspire researchers in inventing more “add-on” Agarwal, and M. Rinard, “Dynamic knobs for
features that fit the capabilities in these devices responsive power-aware computing,” in Proc. 16th Int.
to improve the application performance. Conf. Archit. Support Program. Lang. Operating Syst.,
We also expect the outcome of this article 2011, pp. 199–212.
inspires researchers to further discover those 5. A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam,
issues introduced by local optimizations and con- L. Ceze, and D. Grossman, “Enerj: Approximate data
sider the presence of heterogeneous computing types for safe and general low-power computation,” in
resources, or intelligent data storage and I/O Proc. 32nd ACM SIGPLAN Conf. Program. Lang.
devices to achieve glocal optimizations as we Design Implementation, 2011, pp. 164–174.
demonstrated in this article. More research on 6. X. Sui, A. Lenharth, D. S. Fussell, and K. Pingali,
hardware/software interfaces that do not hide “Proactive control of approximate programs,” in Proc.
power but maintain good tradeoffs on program- 21st Int. Conf. Archit. Support Program. Lang. Oper.
mability, simplicity, flexibility, and efficiency is Syst., 2016, pp. 607–621.
necessary for emerging computer architectures. 7. A. Sampson, J. Nelson, K. Strauss, and L. Ceze,
“Approximate storage in solid-state memories,” in
Proc. 46th Annu. IEEE/ACM Int. Symp. Microarchit.,
ACKNOWLEDGMENTS
We would like to thank the AI infrastructure 2013, pp. 25–36.
group and academic relations group from Face- 8. S. Ganapathy, A. Teman, R. Giterman, A. Burg, and
book in providing insights on datacenter work- G. Karakonstantis, “Approximate computing with

unreliable dynamic memories,” in Proc. IEEE 13th Int.
loads and research award funding. We gratefully
acknowledge the support of NVIDIA Corporation New Circuits Syst. Conf., Jun. 2015, pp. 1–4.
9. J. Zhang and M. Jung, “Flashabacus: A self-
with the donation of the Quadro P5000 GPU used
in the early phase of this research. We also appre- governing flash-based accelerator for low-power
systems,” in Proc. 13th EuroSys Conf., 2018,
ciate Professor Steven Swanson from the Univer-
sity of California, San Diego, in supporting the pp. 15:1–15:15.
development of our prototype SSD. This work was 10. G. Koo et al., “Summarizer: Trading bandwidth
sponsored by the two National Science Founda- with computing near storage,” in Proc. 50th
tion Awards 1940046 and 1940048. This work was Annu. IEEE/ACM Int. Symp. Microarchit., 2017,
also supported by new faculty startupfunds from pp. 219–231.
North Carolina State University and University of 11. M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S.
Mahlke, “Sage: Self-tuning approximation for graphics
California, Riverside.
engines,” in Proc. 46th Annu. IEEE/ACM Int. Symp.
Microarchit., 2013, pp. 13–24.
& REFERENCES 12. M. Samadi, D. A. Jamshidi, J. Lee, and S. Mahlke,
1. Y. Li et al., “A network-centric hardware/algorithm co- “Paraprox: Pattern-based approximation for data
design to accelerate distributed training of deep parallel applications,” in Proc. 19th Int. Conf.
neural networks,” in Proc. 51th Annu. IEEE/ACM Int. Archit. Support Program. Lang. Oper. Syst., 2014,
Symp. Microarchit., 2018, pp. 175–188. pp. 35–50.
IEEE Micro
54
Yu-Ching Hu is currently working toward the Ph.D. Te I is currently a Software Engineer at Google, work-
degree with the Department of Computer Science and ing in the Google Translate Team. Te I received the
Engineering, University of California, Riverside. His M.S. degree in computer science from North Carolina
research interests focus on improving the performance State University. Contact him at tei@google.com.
of database and machine learning applications
through optimizing their interactions with heteroge- Hung-Wei Tseng is an Assistant Professor with the
neous computing units and storage systems. He is a Department of Electrical and Computer Engineering,
member of IEEE. Contact him at yhu130@ucr.edu. University of California, Riverside. His research inter-
ests include heterogeneous computer architectures
Murtuza Lokhandwala is currently working as a
and nonvolatile memory based storage systems as
Design Verification Engineer. His interests include
well as their programming languages, runtime sys-
system design and architecture for processors, digi-
tems, compilers, and applications. Tseng received
tal design, and verification. Lokhandwala received
the master’s degree in computer engineering from the Ph.D. degree in computer science from the Uni-
North Carolina State University. Contact him at versity of California, San Diego. Contact him at
mlokhan@ncsu.edu. htseng@ucr.edu.
May/June 2020
55
AsmDB: Understanding
and Mitigating Front-End
Stalls in Warehouse-Scale
Computers
Nayana Prasad Nagendra Christos Kozyrakis
Princeton University Stanford University
Grant Ayers Trivikram Krishnamurthy
Google Nvidia
David I. August Heiner Litz
Princeton University University of California, Santa Cruz
Hyoun Kyu Cho and Svilen Kanev Tipp Moseley and
Google Parthasarathy Ranganathan
Google
Abstract—It is well known that the datacenters hosting today’s cloud services waste
a significant number of cycles on front-end stalls. However, prior work has provided little
insights about the source of these front-end stalls and how to address them. This work
analyzes the cause of instruction cache misses at a fleet-wide scale and proposes a new
compiler-driven software code prefetching strategy to reduce instruction caches misses
by 90%.
& DUE TO THE continued growth of cloud-based the world. This massive growth necessitates
digital services, warehouse-scale computers improving the cost and efficiency of WSCs
(WSC) are now serving billions of devices across through microarchitectural and system software
based optimizations.
Digital Object Identifier 10.1109/MM.2020.2986212 WSC workloads are characterized by deep
software stacks in which individual requests can
traverse many layers of data retrieval, data
May 2020.
56
processing, communication, logging, and moni- code fragmentation and the perils of micro-optimi-
toring. As a result, the instruction working set zation; and iii) a novel software-based code pre-
sizes of WSC workloads today are often 100 fetch algorithm for reducing i-cache misses at
larger than server-class L1 instruction caches (i- fleet-wide scales.
cache)1 and are currently expanding at rates of
over 20% per year.2 As cache sizes have not
improved significantly over the last many years, AsmDB: A WSC ASSEMBLY
WSC workloads are becoming increasingly front- DATABASE
end bound. Thus, processors are no longer able To enable the necessary horizontal analysis
to sustain a high instruction fetch rate, manifest- and optimization across the server fleet, we built
ing itself in large unrealized performance gains a continuously updated assembly database
due to front-end stalls, which are dominated by (AsmDB) to collect instruction- and basic-block-
increased i-cache misses. While prior work has level information for most observed CPU cycles
identified the growing impor- across the thousands of real produc-
tance of this problem, to date, To enable the tion services executing across the
there has been little analysis of necessary horizontal Google fleet. AsmDB aggregates
the sources of these misses analysis and instruction and control-flow data col-
and of available opportunities optimization across the lected from hundreds of thousands of
to address them. server fleet, we built a machines each day and grows by mul-
We corroborate this chal- continuously updated tiple TiB each week. We have been
lenge for our WSCs on Google assembly database continuously populating AsmDB over
web search leaf servers, in (AsmDB) to collect
several years with the goal of provid-
which 13.8% of the total per- instruction- and
ing easy-to-query assembly-level infor-
formance potential is wasted basic-block-level
information for most
mation for nearly every unique
due to “front-end latency,” instruction executed in our WSCs. We
observed CPU cycles
principally caused by i-cache demonstrate several cases where
across the thousands
misses. We also measured L1 AsmDB proves invaluable for front-
of real production
i-cache miss rates of 11 misses
services executing end optimization, including spotting
per kilo-instruction, and a across the opportunities for manual optimiza-
hot steady-state instruction Google fleet. tions, finding areas for improvement
working set of approximately
in existing compiler passes, as well as
4 MiB. This is significantly
for serving as a data source for a
larger than the sizes of the L1 and L2 caches on
novel compiler-driven technique to improve i-
today’s server CPUs, but small and hot enough
cache hit rates.
to easily fit and remain in the shared L3 cache
1 AsmDB is an always-on, massive-scale fleet-
(typically 10 s of MiB).
wide performance monitoring system. It uses
To understand and improve the i-cache
hardware support to collect bursty execution
behavior of WSC applications, we focus on tools
traces, performs fleet-wide temporal and spatial
and techniques for “broad” acceleration* of thou-
sampling, and leverages sophisticated offline post-
sands of WSC workloads. At the scale of a typical
processing to construct full-program dynamic
WSC server fleet, performance improvements
control-flow graphs. Collecting and processing
of a few percentage points (and even sub-1%
improvements) lead to millions of dollars in profiling data from hundreds of thousands of
cost and energy savings, as long as they are machines is a daunting task by itself. However, we
widely applicable across workloads. To that end, have carefully designed the system architecture
our work provides three primary contributions: such that it can capture and process profiling data
i) A methodology for analyzing instruction profiles in a cost-efficient way while still processing tera-
at a fleet-wide scale; ii) detailed insights about bytes of data each week.
A fleet-wide assembly database, such as
* AsmDB, provides a scalable solution to search
“Deep” acceleration would involve focusing on a handful of workloads and
trying to recover most of the 15% performance opportunity. for performance antipatterns and opens up new
May/June 2020
57
Top Picks
Figure 1 shows that i-cache misses in WSCs have a

similar long tail. It plots the cumulative distribu-
tion of dynamic instructions, and L1-I and L2-I
misses over unique i-cache lines over a week of
execution, fleet wide. The zoomed-in view of the
graph shows that the miss cumulative distribution
function (CDF) initially has a more significant
slope than the instruction CDF, suggesting that
there exist some pointwise manual optimizations
Figure 1. Fleet-wide distribution of executed
with high potential performance gains. However,
instructions, and L1- and L2-instruction misses over
the distribution of misses quickly tapers off. In
unique cache lines. Like instructions, misses also
particular, addressing just two-thirds of dynamic
follow a long tail.
misses requires optimizations in 1M code loca-
tions, which is only conceivable leveraging auto-
opportunities for performance and total-cost-of- mation. This points us toward exploring scalable,
ownership optimizations. WSC servers typically automated solutions—with compiler and/or hard-
execute thousands of unique applications, so ware support and no developer intervention—to
the kernels that matter most across the fleet exploit these behaviors.
(the “datacenter tax”2) may not be significant for
a single workload and are easy to overlook
in application-by-application investigations. EFFECTS OF CODE FRAGMENTATION
We leverage AsmDB’s fleet-wide data in several ON CACHES
case studies to understand and improve the Code bloat and unnecessary instruction com-
i-cache utilization and IPC of WSC applications. plexity, especially in frequently-executed code,
We further correlate AsmDB with hardware can lead to excessive i-cache pressure. We ana-
performance counter profiles collected by a lyze code bloat in Figure 2, leveraging AsmDB-
datacenter-wide profiling system—Google-wide data—it plots the normalized function hotness
profiling (GWP)4—to reason about specific pat- (how often a particular function is called over a
terns that affect front-end performance. fixed period) versus the function’s size in bytes
for the 100 hottest functions in our WSCs. Per-
haps unsurprisingly, it shows a loose negative
WSC APPLICATION ANALYSIS WITH correlation: Smaller functions are called more fre-
AsmDB quently. It also corroborates prior findings that
WSC applications are well-known for their long low-level library functions (“datacenter tax”2),
instruction tails and flat execution profiles.2 and specifically memcpy and memcmp, are
among the hottest in our examined workloads.
However, despite smaller functions being sig-
nificantly more frequent, they are not the major
source of i-cache misses. Overlaying miss pro-
files from GWP onto Figure 2 (shading), we
notice that most observed cache misses lie in
functions larger than 1 KiB in code size, with
over half in functions larger than 5 KiB. Most
functions of 5 KiB or larger exhibit inlined call
stacks of ten or more layers in depth.
While deep inlining is crucial for performance
in workloads with flat callgraphs, it exponen-
Figure 2. Normalized execution frequency versus function size tially increases the amount of code loaded into
for the top 100 hottest fleet-wide functions. memcmp is a clear the i-cache at each inline level, of which often
outlier. only a small fraction is hot. Cold code brought
IEEE Micro
58
Figure 3. Fraction of hot code within a function among the 100 hottest fleet-wide functions. From the left-hand
side to right-hand side, “hot code” defined as covering 90%, 99%, and 99.9% of execution.
into the cache, in addition to the necessary hot to perform when optimizing typical WSC flat
instructions leading to hot/cold fragmentation execution profiles. Hence, this suggests that
and thus suboptimal utilization of the limited combining inlining with more aggressive hot/
cache resources. cold code splitting can achieve better i-cache uti-
We more formally define fragmentation to be lization, freeing up the scarce capacity.
the fraction of code (in bytes) that is necessary On a finer granularity, we find that the indi-
to cover the last 10%, 1%, or 0.1% of executions vidual cache lines are also often fragmented
of a function. Because functions are sequentially and waste cache capacity, especially for small
laid out in memory, these cold bytes are very functions. Unlike cold cache lines within a
likely to be brought into the cache by next-line function, cold bytes in a cache line are always
prefetchers. Intuitively, this definition measures brought in along with the hot ones, introduc-
the fraction of i-cache capacity potentially ing an even more significant performance
wasted by loading cold cache lines. issue. This suggests that there exist opportuni-
We find that intrafunction fragmentation is ties to improve the basic-block layout, at link
especially prevalent. Even after compiling with or postlink time, when compiler profile infor-
feedback-directed optimization, 50% of the mation is precise enough to reason about spe-
codes in all functions are cold, frequently cific cache lines.
interleaved with hot code sections, and thus We provide a concrete example of optimizing
practically never executed despite being likely code bloat and fragmentation by focusing on
to be in the cache. This is true even among the memcmp, one of the hottest functions contribut-
hottest and most well-optimized functions in ing to cache misses. memcmp clearly stands
our server fleet. out of the correlation between call frequency
Using AsmDB data, we calculate the measure and function size in Figure 2. It is both extremely
of fragmentation for the top 100 functions by frequent, and at almost 6 KiB of code, 10 larger
execution count in our server fleet. Figure 3 plots than memcpy, which is conceptually of similar
it against the containing function size. If we con- complexity. Examining its layout and execution
sider code covering the last 1% of execution as patterns (see Figure 4) suggests that it does
“cold,” 66 functions out of the 100 are comprised suffer from a high amount of fragmentation, as
of more than 50% cold code. Even with a stricter we observed fleet wide in the previous section.
definition of cold (<0.1%), 46 functions have While covering 90% of executed instructions in
more than 50% cold code. Perhaps not surpris- memcmp only requires two cache lines, getting
ingly, there is a loose correlation with function up to 99% coverage requiring 41 lines or 2.6 KiB of
size—larger (more complex) functions tend to cache capacity. Not only is more than 50% of the
have a larger fraction of cold code. code cold, it is also interspersed with hot regions,
We attribute the intrafunction fragmentation increasing the likelihood to be brought in by next-
to the deep inlining that the compiler needs line prefetchers. Such code bloat is costly—
May/June 2020
59
Top Picks
Figure 5. Fan-in for some misses can grow very fast

with distance, especially for library functions.
Figure 4. Instruction execution profile for memcmp. 90% of
dynamic instructions are contained in 2 cache lines, covering 99%
and utilizes the I-TLB instead of the D-TLB. The
of instructions requiring 41 i-cache lines.
implementation of such an instruction has negli-
gible hardware cost and complexity compared to
performance counter data collected by GWP indi- pure hardware methods and is commercially via-
cate that 8.2% of all i-cache misses among the 100 ble today. While it can be implemented on top of
hottest functions are from memcmp alone. a wide variety of hardware front-ends, we demon-
While conceptually simple, our version of strate its viability on a system that employs only
memcmp was highly optimized for microbench- a next-line instruction prefetcher.
marks and contained many code paths for specific Prefetching represents a prediction problem
input variations. We show that in WSC environ- with a limited window of opportunity. Effective
ments where cache capacity is especially con- prefetches are both accurate and timely—they
strained, it is actually better to provide a reduced only bring in useful miss targets and do so
version of memcmp containing only a few paths neither too early nor too late in order to mini-
and that doing so improves fleet-wide performance mize early evictions and cache pollution. As a
by up to 1%. result, an effective prefetcher would have high
overall miss coverage. Our prefetch insertion
algorithm uses profile feedback information
SOFTWARE PREFETCHING from AsmDB and performance counterprofiles
FOR CODE to ensure timely prefetches with minimal
Looking into the instructions that lead to i- overhead.
cache misses, we find that, while not particularly Some of the challenges that arise among soft-
concentrated in specific code regions, most i- ware prefetching techniques include—fan-in, the
cache misses still share common characteristics. number of potential paths leading to a miss
Specifically, missing instructions are often the increases as the prefetch injection site is moved
target of control-flow-changing instructions with backward from a missed target. Figure 5 shows
large jump distances.3 We find that distant the fan-in for the top 20 i-cache misses from a web
branches and calls that are not amenable to tra- search profile. In several cases, the number of
ditional cache locality or next-line prefetching paths leading in to a single miss exceeds 100 even
strategies account for a large fraction of cache with a lookback of only ten instructions. Our
misses among WSC applications. approach leverages profiling information to only
For misses at the target of a distant jump, we insert helpful prefetches, increasing coverage and
propose and evaluate a profile-driven optimiza- minimizing fan-in. Fan-out poses another challenge
tion technique that intelligently injects software in finding the prefetch injection site as not all exe-
prefetch instructions for code into the binary cution paths are likely to lead to the miss. We
during compilation. We outline the design of the address this by pruning paths that exceed a maxi-
necessary “code prefetch” instruction, which is mum fan-out threshold. Furthermore, instruction
similar in nature to existing data prefetch instruc- prefetches themselves increase the code footprint
tions, except that it fetches into the L1 i-cache and hence need to be inserted carefully.
IEEE Micro
60
cases, fewer than 2.5% of additional dynamic
instructions are added for code prefetches.
LONG-TERM IMPLICATIONS
With increased technological growth, WSCs
now serve billions of devices and applications
across the planet. Due to their success, we expect
an ever-greater reliance on WSCs in the near
future, providing faster, more reliable, and more
secure services to society. These increasing dem-
ands necessitate achieving higher performance for
Figure 6. Miss coverage and performance WSCs in order to be cost- and energy-efficient for
improvement for the best-performing configuration WSC companies and their customers while simul-
for each workload. taneously reducing the environmental impact on
our world.
At its core, our prefetch injection strategy In combination with the slowdown of Moore’s
leverages the observation that the injection site law, improving the efficiency of existing hardware
of a prefetch instruction can be freely moved in WSCs becomes even more critical. We analyzed
within the window of opportunity to minimize a web search binary, showing that 68% of the CPU
fan-in and fan-out. We call this approach dynamic performance potential is lost due to pipeline stalls,
window injection. At a high level, our prefetch of which 13.8% are due to the front-end not being
procedure first constructs the execution history able to deliver instructions fast enough.
for each miss and then traverses the control This article addresses the front-end bottle-
flow graph in the reverse direction until it neck on following fronts.
reaches the end of the instruction window, cal-
culated based on the application-level IPC. Next, First, we have built a tool that is capable of
prefetch injection sites are searched for each collecting data from live datacenter applica-
miss among each of its execution paths, which tions at the granularity of instructions and at
have minimal fan-in and fan-out. Prefetch the scale of a WSC. We have described the
instructions are then automatically inserted in architecture design decisions in detail,
the selected injection sites for the correspond- enabling other WSC operators to reproduce
ing misses as part of the final linking steps. our system.
We prototype the effects of our proposed Second, this article is the first work that
software prefetching technique on memory shows detailed characterization studies of
traces from several WSC workloads. We evaluate the processor front-end at the scale of a WSC
on a modified version of the zsim simulator,5 by describing previously unreleased perfor-
using the system parameters modeled against an mance characteristics of WSC workloads.
Intel Haswell datacenter-scale server processor. Third, we have proposed and evaluated a
We focus primarily on three WSC applications— novel software-based code prefetch strategy
a web search leaf node, an ads matching service, to automatically and effectively reduce i-
and a knowledge graph back-end. For each work- cache misses across large WSC workloads.
load, we collect traces during a representative
single-machine load test, which sends realistic This work provides a powerful methodology to
loads to the server under test. perform further at-scale research to obtain a
Figure 6 shows that our prefetching tech- detailed understanding of the microarchitectural
nique is able to eliminate 91%–96% of all i-cache characteristics and the interplay between current
misses, with a performance improvement pro- software and hardware. In addition, its reproduc-
portional to the front-end boundedness of the ibility enables other WSC companies to perform
application and the gap left from NLP. In all similar research. Overall, such research would
May/June 2020
61
Top Picks
enable hardware vendors to work closely with designing domain-specific accelerators becomes
software developers to better design future feasible and cost-efficient. However, while this
processors. approach has proven successful for domains
Our front-end characterization studies benefit such as deep learning, most of the fleet cycles are
the compiler and architecture communities both still executed on general-purpose processors as
in academic and industrial set- many applications are too complex
tings. Our results on micro-optimi- We developed AsmDB, and rapidly changing to render
zations, fragmentation, and code- a database for custom-designed hardware feasi-
bloat can help in fine-tuning com- instruction and basic- ble. Nevertheless, as this article
piler passes, optimizing inlining block information showed, the performance charac-
strategies, and basic block lay- across thousands of teristics of WSC applications are
outs. Similarly, our studies pro- WSC production fundamentally different from tradi-
vide valuable information to binaries, to characterize tional applications such as the
i-cache miss-working SPEC benchmark suite. WSC pro-
architecture researchers exposing
sets and miss-causing cessors may differ with capabilities
existing software loop holes that
instructions. such as our proposed instruction
can be addressed with next-gener-
ation hardware designs. prefetching mechanism, which may
Our work on software code prefetching be of little use to SPEC applications, but which
proves as a strong case study for hardware ven- delivers significant performance gains for data-
dors to provide support for a software code pre- center applications.
fetch instruction and to implement such an In summary, the evidence is strong that this
instruction in the instruction set architecture article will promote the research and develop-
(ISA). With this, compiler writers and software ment of new compiler techniques, new proces-
developers can leverage code prefetching and sor designs, and new ways of collecting and
analyzing behaviors at the warehouse scale.
its resulting performance improvements in an
automatic and scalable way.
More broadly, this article provides two CONCLUSION
insights, which we believe will have a significant This work focused on understanding and
and long-lasting impact on future research in the improving i-cache behavior, which is a critical per-
performance optimization and computer architec- formance constraint for WSC applications.
ture domain. The first insight teaches the impor- We developed AsmDB, a database for instruction
tance of enabling fleet-wide performance and basic-block information across thousands of
optimizations, which we also refer to as the WSC production binaries, to characterize i-cache
Amdahl’s law of WSC performance. Traditionally, miss-working sets and miss-causing instructions.
performance optimizations have been focused on We used these insights to motivate fine-grain lay-
individual applications. In this approach, applica- out optimizations to split hot and cold codes and
tions are profiled to determine the most compute- better utilize limited i-cache capacity. We also pro-
intensive regions, resulting in the largest perfor- posed a new feedback-driven optimization that
mance gains when optimized. However, this inserts software instructions for code prefetching
approach no longer applies to WSCs as datacen- based on the control-flow information and miss
ters run thousands of different applications profiles in AsmDB. This prefetching optimization
simultaneously. As a result, compute-intensive can cover up to 96% of i-cache misses without sig-
application-specific kernels are no longer worth nificant changes to the processor and while requir-
optimizing. Instead, performance engineers need ing only very simple front-end fetch mechanisms.
to focus on code that is shared among many appli-
cations in the fleet, representing the largest aggre-
gated percentage of compute cycles. ACKNOWLEDGMENTS
The second insight teaches the importance of This work was supported by the NSF Award
designing domain-specific general-purpose pro- CCF-1823559. Nayana Prasad Nagendra and
cessors. WSCs have grown to a size at which Grant Ayers contributed equally to this work.
IEEE Micro
62
& REFERENCES Svilen Kanev is currently a Software Engineer at
Google, working on translating datacenter perfor-
1. G. Ayers, J. H. Ahn, C. Kozyrakis, and mance analysis insights into performance and TCO
P. Ranganathan, “Memory hierarchy for web search,” gains. He is broadly interested in anything that strad-
in Proc. IEEE Int. Symp. High Perform. Comput. dles the hardware-software interface. Kanev received
Archit., 2018, pp. 643–656. the Ph.D. degree in computer science from Harvard
2. S. Kanev et al., “Profiling a warehouse-scale University. Contact him at skanev@google.com
computer,” in Proc. Int. Symp. Comput. Archit., 2015,
pp. 158–169. Christos Kozyrakis is currently a Professor of elec-
3. R. Kumar, B. Grot, and V. Nagarajan, “Blasting through trical engineering and computer science with Stanford
the frontend bottleneck with shotgun,” in Proc. Archit. University. His research interests include hardware
Support Program. Lang. Oper. Syst., 2018, pp. 30–42. architectures and system software for cloud computing
and emerging workloads. Kozyrakis received the Ph.D.
4. G. Ren, E. Tune, T. Moseley, Y. Shi, S. Rus, and
degree in computer science from the University of
R. Hundt, “Google-wide profiling: A continuous
California Berkeley. He is a Fellow of IEEE and ACM.
profiling infrastructure for data centers,” IEEE Micro,
Contact him at christos@cs.stanford.edu.
vol. 30, no. 4, pp. 65–79, Jul./Aug. 2010.
5. D. Sanchez and C. Kozyrakis, “ZSim: Fast and accurate
Trivikram Krishnamurthy is currently a Senior Engi-
microarchitectural simulation of thousand-core systems,” neering Manager at Nvidia. Before joining Nvidia, he was
in Proc. Int. Symp. Comput. Archit., 2013, pp. 475–486. a Software Engineer at Google. Krishnamurthy received
the M.S. degree in electrical and computer engineering
from the University of California Santa Barbara. Contact
Nayana Prasad Nagendra is currently working him at trivikram.krishnamurthy@gmail.com.
toward the Ph.D. degree with the Department
of Computer Science, Princeton University. Her
Heiner Litz is currently an Assistant Professor in the
research interests include performance analysis and
Computer Science and Engineering Department, Uni-
microarchitectural design with a focus on data
versity of California, Santa Cruz (UCSC) and the Asso-
centers. This work was done while she was an intern
ciate Director of the Center for Research in Storage
at Google. She is a student member of IEEE and
Systems. His main research interests include com-
ACM. Contact her at nagendra@cs.princeton.edu.
puter architecture, operating systems, and storage
with a focus on data centers. Before joining UCSC, he
Grant Ayers is currently a Software Engineer at
was a Researcher at Google. Litz received the Ph.D.
Google. His research interests include computer
degree from Mannheim University. He is a member of
architecture, security, and accelerators. He joined
IEEE and ACM. Contact him at hlitz@ucsc.edu.
Google after receiving the Ph.D. degree in computer
science from Stanford University. This work was
done while he was an intern at Google. Contact him Tipp Moseley is currently a Principal Software Engi-
at ayers@cs.stanford.edu. neer at Google, where he works on datacenter-scale
performance analysis. His research interests include
David I. August is currently a Professor with the compilers, operating systems, performance analysis,
Department of Computer Science, Princeton Univer- runtime systems, fault tolerance, and optimized lock-
sity, where he directs the Liberty Research Group. His free data structures. Moseley received the Ph.D.
research interests include compilers and computer degree in computer science from the University of
architectures. August received the Ph.D. degree in Colorado at Boulder. Contact him at tipp@google.com.
electrical and computer engineering from the Univer-
sity of Illinois at Urbana–Champaign. Contact him at Parthasarathy Ranganathan is currently a
august@princeton.edu. Distinguished Engineer at Google, where he is design-
ing their next-generation systems. His research inter-
Hyoun Kyu Cho is currently a Software Engineer at ests include systems architecture and management,
Google. His research interests include compiler optimi- power management, and energy efficiency for servers
zation, parallel computing, and performance analysis. and datacenters. Ranganathan received the Ph.D.
Cho received the Ph.D. degree in computer science degree in computer engineering from Rice University.
and engineering from the University of Michigan at Ann He is a Fellow of IEEE and ACM. Contact him at
Arbor. Contact him at netforce@google.com partha.ranganathan@google.com.
May/June 2020
63
Extending the Frontier

of Quantum Computers
With Qutrits
Pranav Gokhale, Jonathan M. Baker, Kenneth R. Brown
Casey Duckering, and Frederic T. Chong Duke University
University of Chicago
Natalie C. Brown
Georgia Institute of Technology
Abstract—We advocate for a fundamentally different way to perform quantum computation

by using three-level qutrits instead of qubits. In particular, we substantially reduce the
resource requirements of quantum computations by exploiting a third state for temporary
variables (ancilla) in quantum circuits. Past work with qutrits has demonstrated only
constant factor improvements, owing to the log2(3) binary-to-ternary compression factor.
We present a novel technique using qutrits to achieve a logarithmic runtime decomposition
of the Generalized Toffoli gate using no ancilla—an exponential improvement over the best
qubit-only equivalent. Our approach features a 70x improvement in total two-qudit gate
count over the qubit-only decomposition. This results in improvements for important
algorithms for arithmetic and QRAM. Simulation results under realistic noise models
indicate over 90% mean reliability (fidelity) for our circuit, versus under 30% for the qubit-
only baseline. These results suggest that qutrits offer a promising path toward extending
the frontier of quantum computers.
& RECENT ADVANCES both hardware and

IN outcomes. In the coming years, we expect quan-
software for quantum computation have demon- tum computing will have important applications
strated significant progress toward practical in fields ranging from machine learning and
optimization to drug discovery. While early
research efforts focused on longer term systems
Digital Object Identifier 10.1109/MM.2020.2985976 employing full error correction to execute large
Date of publication 16 April 2020; date of current version 22 instances of algorithms like Shor factoring and
May 2020. Grover search, recent work has focused on noisy
64
intermediate scale quantum (NISQ) computa- The net result of our work is to extend the fron-
tion. The NISQ regime considers near-term tier of what quantum computers can compute. In
machines with just tens to hundreds of quantum particular, the frontier is defined by the zone in
bits (qubits) and moderate errors. which every machine qubit is a data qubit, for
Given the severe constraints on example, a 100-qubit algorithm
quantum resources, it is critical to Given the severe running on a 100-qubit machine.
fully optimize the compilation of a constraints on quantum This is indicated by the yellow
quantum algorithm in order to have resources, it is critical region in Figure 1. In this frontier
successful computation. Prior archi- to fully optimize the zone, we do not have room for
tectural research has explored tech- compilation of a nondata workspace qubits known
niques such as mapping, scheduling, quantum algorithm as ancilla. The lack of ancilla in
and parallelism to extend the amount in order to have the frontier zone is a costly con-
of useful computation possible. In successful straint that generally leads to inef-
this article, we consider another computation. ficient circuits. For this reason,
technique: quantum trits (qutrits). typical circuits instead operate
While quantum computation is typically below the frontier zone, with many machine
expressed as a two-level binary abstraction of qubits used as ancilla. This article demonstrates
qubits, the underlying physics of quantum that ancilla can be substituted with qutrits,
systems are not intrinsically binary. Whereas enabling us to extend the ancilla-free frontier zone
classical computers operate in binary states of quantum computation.
at the physical level (e.g., clipping above
and below a threshold voltage), quantum com-
puters have natural access to an infinite BACKGROUND
spectrum of discrete energy levels. In fact, A qubit is the fundamental unit of quantum
hardware must actively suppress higher level computation. Compared to their classical coun-
states in order to achieve the two-level qubit terparts which take values of either 0 and 1,
approximation. Hence, using three-level qut- qubits may exist in a superposition of the two
rits is simply a choice of including an addi- states. We designate these two basis states as j0i
tional discrete energy level, albeit at the cost and j1i and can represent any qubit as
of more opportunities for error. jci ¼ a j0i þ b j1i with kak2 þ kbk2 ¼ 1. kak2 and
Prior work on qutrits (or more generally, kbk2 correspond to the probabilities of measur-
d-level qudits) identified only constant factor ing j0i and j1i, respectively.
gains from extending beyond qubits. In Quantum states can be acted on by quan-
general, this prior work1 has emphasized the tum gates, which preserve valid probability
information compression advantages of qut- distributions that sum to 1 and guarantee
rits. For example, N qubits can be expressed reversibility. For example, the X gate trans-
as N=log2 ð3Þ qutrits, which leads to forms a state jci ¼ a j0i þ b j1i to X jci ¼
log2 ð3Þ 1:6 constant factor improvements in b j0i þ a j1i . The X gate is also an example of a
runtimes. classical reversible operation, equivalent to the
Our approach utilizes qutrits in a novel fash- NOT operation. In quantum computation, we have
ion, essentially using the third state as tempo- a single irreversible operation called measurement
rary storage, but at the cost of higher per- that transforms a quantum state into one of the
operation error rates. Under this treatment, the two basis states with a given probability based on
runtime (i.e., circuit depth or critical path) is a and b.
asymptotically faster, and the reliability of com- In order to interact different qubits, two-qubit
putations is also improved. Moreover, our operations are used. The CNOT gate appears both
approach only applies qutrit operations in an in classical reversible computation and in quan-
intermediary stage: The input and output are tum computation. It has a control qubit and a tar-
still qubits, which is important for initialization get qubit. When the control qubit is in the j1i state,
and measurement on real devices.2; 3 the CNOT performs a NOT operation on the target.
May/June 2020
65
Top Picks
In a three-level system, we consider the

computational basis states j0i, j1i, and j2i for
qutrits. A qutrit state jci may be represented
analogously to a qubit as jci ¼ a j0i þ b j1i þ
g j2i , where a2 þ b2 þ g 2 ¼ 1. Qutrits are manipu-
lated in a similar manner to qubits; however,
there are additional gates which may be per-
formed on qutrits.
For instance, in quantum binary logic, there is
only a single X gate. In ternary, there are three X
gates denoted X01 , X02 , and X12 . Each of these Xij
can be viewed as swapping jii with jji and leaving
the third basis element unchanged. For example,
for a qutrit jci ¼ a j0i þ b j1i þ g j2i, applying X02
produces X02 jci ¼ g j0i þ b j1i þ a j2i . There are
Figure 1. Frontier of what quantum hardware can two additional nontrivial operations on a single
execute is the yellow region adjacent to the 45 line. trit. They are Xþ1 and X1 operations, which per-
In this region, each machine qubit is a data qubit. form the addition/subtraction modulo 3.
Typical circuits rely on nondata ancilla qubits for Just as single qubit gates have qutrit analogs,
workspace and therefore operate below the frontier. the same holds for two qutrit gates. For example,
consider the CNOT operation, where an X gate is
The cnot gate serves a special role in quantum performed conditioned on the control being in the
computation, allowing quantum states to become j1i state. For qutrits, any of the X gates presented
entangled so that a pair of qubits cannot be above may be performed, conditioned on the con-
described as two individual qubit states. Any oper- trol being in any of the three possible basis states.
ation may be conditioned on one or more controls. In order to evaluate a decomposition of a
Many classical operations, such as AND and or quantum circuit, we consider quantum circuit
gates, are irreversible and therefore cannot costs. The space cost of a circuit, i.e., the num-
directly be executed as quantum gates. For ber of qubits (or qutrits), is referred to as circuit
example, consider the output of 1 from an OR width. Requiring ancilla increases the circuit
gate with two inputs. With only this information width and, therefore, the space cost of a circuit.
about the output, the value of the inputs cannot The time cost for a circuit is the depth of a cir-
be uniquely determined. These operations can cuit. The depth is given as the length of the criti-
be made reversible by the addition of extra, tem- cal path from input to output.
porary ancilla bits initialized to j0i.
Physical systems in classical hardware are typi-
cally binary. However, in common quantum hard-
PRIOR WORK
ware, such as in superconducting and trapped ion
computers, there is an infinite spectrum of discrete Qudits
energy levels. The qubit abstraction is an artificial Qutrits, and more generally qudits, have been
approximation achieved by suppressing all but the studied in past work both experimentally and
lowest two energy levels. Instead, the hardware theoretically. However, in the past work, qudits
may be configured to manipulate the lowest three have conferred only an information compression
energy levels by operating on qutrits. In general, advantage. For example, N qubits can be com-
such a computer could be configured to operate pressed to N=log2 ðdÞ qudits, giving only a con-
on any number of d levels. As d increases, the num- stant-factor advantage1 at the cost of greater
ber of opportunities for error—termed error chan- errors from operating qudits instead of qubits.
nels—increases. Here, we focus on d ¼ 3 which is Ultimately, the tradeoff between information
sufficient to achieve desired improvements to the compression and higher per-qudit errors has not
Generalized Toffoli gate. been favorable in the past work. As such, the
IEEE Micro
66
Table 1. Asymptotic comparison of N-controlled gate decompositions. The total gate count for all circuits scales
linearly (except for Barenco et al.,6 which scales quadratically). Our construction uses qutrits to achieve logarithmic
depth without ancilla. We benchmark our circuit construction against Gidney,4 which is the asymptotically best
ancilla-free qubit circuit.
He Barenco Wang and

This Work Gidney4 Lanyon et al.8
et al.5 et al.6 Perkowski7
Depth log N N log N N2 N N
Ancilla 0 0 N 0 0 0
Qudit Controls are Controls are Target is d ¼ N-level

Qubits Qubits Qubits
Types qutrits qutrits qudit
Constants Small Large Small Small Small Small
past research toward building practical quantum As in our approach, circuit constructions
computers has focused on qubits. from Wang and Perkowski,7 and Lanyon et al.8
This article introduces qutrit-based circuits, have attempted to improve the ancilla-free Gen-
which are asymptotically better than equivalent eralized Toffoli gate by using qudits. Wang and
qubit-only circuits. Unlike prior work, we dem- Perkowski7 achieves a linear circuit depth but
onstrate a compelling advantage in both run- by operating each control as a qutrit. The
time and reliability, thus justifying the use of Lanyon et al.8 construction, which has been
qutrits. demonstrated experimentally, achieves linear
circuit depths by operating the target as a
Generalized Toffoli Gate d ¼ N-level qudit.
The Toffoli gate itself is a simple extension of Our circuit construction, presented in the
the CNOT gate, but has two controls instead of “Generalized Toffoli Gate” section, has similar
one control. In a Toffoli gate, the NOT is applied if structure to the He design, which can be rep-
and only if both controls are j1i. Similarly, a Gener- resented as a binary tree of gates. However,
alized Toffoli gate has N controls and flips the tar- instead of storing temporary results with a lin-
get qubit if and only if all N control qubits are j1i. ear number of ancilla qubits, our circuit tem-
The Generalized Toffoli gate is an important primi- porarily stores information directly in the
tive used across a wide range of quantum algo- qutrit j2i state of the controls. Thus, no
rithms, and it has been the focus of extensive past ancilla are needed.
optimization work. Table 1 compares past circuit In our simulations, we benchmark our circuit
constructions for the Generalized Toffoli gate to construction against the Gidney construction4
our construction, which is presented in full in because it is the asymptotically best qubit cir-
“Generalized Toffoli Gate” section. cuit in the ancilla-free frontier zone. We label
Among prior work, Gidney,4 He et al.,5 and these two benchmarks as QUTRIT and QUBIT.
Barenco et al.6 designs are all qubit-only. The
three circuits have varying tradeoffs. While Gid-
ney and Barenco operate at the ancilla-free fron- CIRCUIT CONSTRUCTION
tier, they have large circuit depths: Linear with a In order for quantum circuits to be executable
large constant for Gidney and quadratic for Bare- on hardware, they are typically decomposed into
nco. While the He circuit achieves logarithmic single- and two- qudit gates. Performing efficient
depth, it requires an ancilla for each data qubit, low depth and low gate count decompositions is
effectively halving the effective potential of any important in both the NISQ regime and beyond.
given quantum hardware and operating far
below the frontier. Nonetheless, in practice, Key Intuition
most circuit implementations use these linear- We develop the intuition for how qutrits can
ancilla constructions due to their small depths be useful by considering the example of construct-
and gate counts. ing an AND gate. In the framework of quantum
May/June 2020
67
Top Picks
computing, which requires reversibility, AND is

not permitted directly. For example, consider a
two-input AND gate that outputs a 0. Given this
output value, the inputs cannot be uniquely deter-
mined since 00, 01, and 10 all yield an AND output
of 0. However, these operations can be made
reversible by the addition of an extra temporary
workspace bit initialized to 0. Using a single addi-
tional such ancilla, the AND operation can be com- Figure 2. Toffoli AND via qubits (top) versus qutrits
puted reversibly via the well-known Toffoli gate (a (bottom).
double-controlled NOT) in reversible computa-
Generalized Toffoli Gate
tion. While this approach works, it is expensive—
We now present our circuit decomposition
its decomposition into hardware-implementable
for the generalized Toffoli gate in Figure 3. The
one- and two-input gates requires at least six con-
decomposition is expressed in terms of three-
trolled-NOT gates and several single qubit gates,
qutrit gates (two controls, one target) instead of
as depicted in the top circuit of Figure 2.
single- and two- qutrit gates, because the circuit
However, if we break the qubit abstraction
can be understood purely classically at this gran-
and allow occupation of a higher qutrit energy
ularity. In actual implementation and in our sim-
level, the cost of the Toffoli AND operation is
ulation, we used a decomposition9 that requires
greatly diminished. The bottom circuit in Figure 2
six two-qutrit and seven single-qutrit physically
displays this decomposition using qutrits. The
implementable quantum gates.
goal is to elevate the jq2 ¼ 0i ancilla qubit to j1i
Our circuit decomposition is most intuitively
if and only if the top two control qubits are
understood by treating the left half of the circuit
both j1i. This is effectively an AND gate. First a
as a tree. The desired property is that the root of
j1i-controlled ðþ1 mod 3Þ is performed on q0 and
the tree q7 is j2i if and only if each of the 15 con-
q1 . This elevates q1 to j2i iff q0 and q1 were both
trols was originally in the j1i state. To verify this
j1i. Then, a j2i-controlled ðþ1 mod 2Þ gate is
property, we observe that the root q7 can only
applied to q2 . Therefore, q2 is elevated to j1i only
become j2i iff q7 was originally j1i and q3 and q11
when both q0 and q1 were j1i, as desired. The
were both previously j2i. At the next level of the
controls are restored to their original states by a
tree, we see q3 could have only been j2i if q3 was
j1i-controlled ð1 mod 3Þ gate, which resets q1 .
originally j1i and both q1 and q5 were previously
The key intuition is that temporary information
j2i, and similarly for the other triplets. At the
can be stored in the qutrit j2i state, rather
bottom level of the tree, the triplets are con-
than requiring an external ancilla. This allows
trolled on the j1i state, which are only activated
us to maximize the problem sizes that current-
when the even-index controls are all j1i. Thus, if
generation hardware can address.
any of the controls were not j1i, the j2i states
Now, notice that this AND decomposition is
would fail to propagate to the root of the tree.
actually also a Toffoli gate decomposition. In par-
The right half of the circuit performs uncomputa-
ticular, suppose that the bottom qubit is now arbi-
tion to restore the controls to their original state.
trary, rather than initialized to j0i. Then, the net
After each subsequent level of the tree struc-
effect of this circuit is to preserve the controls,
ture, the number of qubits under consideration
while flipping the bottom (target) qubit, if and
is reduced by a factor of 2. Thus, the circuit
only if q0 and q1 were both j1i. For this decomposi-
depth is logarithmic in N. Moreover, each qutrit
tion, since all inputs and outputs are binary qubits
is operated on by a constant number of gates, so
(we only occupy the j2i state temporarily during
the total number of gates is linear in N.
computation), we can insert this circuit construc-
tions into any preexisting qubit-only circuits.
Again, the key intuition in this decomposition is APPLICATION TO ALGORITHMS
that the qutrit j2i state can be used instead of The Generalized Toffoli gate is an important
ancilla to store temporary information. primitive in a broad range of quantum
IEEE Micro
68
algorithms. Here, we note two important applica-
tions of our circuit decomposition.
Arithmetic Circuits
The Generalized Toffoli is a key subcircuit
in many arithmetic circuits such as constant
addition, modular multiplication, and modular
exponentiation. The circuit for computing a
square root is also improved by a more effi-
cient Generalized Toffoli gate. As shown by
Gokhale,10 the circuit for the initial approxi-
pffiffiffi
mation to 1= x involves a sequence of stan-
dard Toffoli gates terminated by a large
OðnÞ-width Generalized Toffoli gate. Our cir-
cuit construction is directly applicable to this
terminal gate.
Quantum Machine Learning

A fundamental component of most algo-
rithms for quantum machine learning is a
quantum random access memory (QRAM).
QRAM has the classically familiar property of
mapping input index bits to output data bits. Figure 3. Our circuit decomposition for the Generalized Toffoli
However, unlike classical RAM, QRAM also gate is shown for 15 controls and 1 target. The inputs and outputs
acts over superpositions of qubits. The initiali- are both qubits, but we allow occupation of the j2i qutrit state in
zation of QRAM is often the bottleneck for between. The circuit has a tree structure and maintains the
quantum machine learning algorithms—an property that the root of each subtree can only be elevated to j2i if
expensive procedure for storing training data all of its control leaves were j1i. Thus, the U gate is only executed
into QRAM can negate any potential quantum if all controls are j1i. The right half of the circuit performs
advantage. uncomputation to restore the controls to their original state. This
However, the QRAM circuit is yet another construction applies more generally to any multiply controlled U
application that can be improved with the gate. Note that the three-input gates are decomposed into six
qutrit-assisted Generalized Toffoli gate. In partic- two-input and seven single-input gates in our actual simulation, as
ular, the flip-flop QRAM is bottlenecked by the based on the decomposition by Di and Wei.9
application of a wide Generalized Toffoli gate for
each classical bitstring stored to the QRAM.11
Noise Simulation
Thus, an efficient Generalized Toffoli gate
Our noise simulation procedure accounts for
reduces the cost of QRAM, relative to nonqutrit
both gate errors and idle errors, described
procedures.
below. To determine when to apply each gate
and idle error, we use Cirq’s scheduler which
SIMULATOR schedules each gate as early as possible, creat-
To simulate our circuit constructions, we ing a sequence of Moment’s of simultaneous
developed a qudit simulation library, built on gates. During each Moment, our noise simulator
Google’s Cirq Python library. Cirq is a qubit- applies a gate error to every qudit acted on.
based quantum circuit library and includes a Finally, the simulator applies an idle error to
number of useful abstractions for quantum every qudit. This noise simulation methodology
states, gates, circuits, and scheduling. Our soft- is consistent with previous simulation techni-
ware performs noise simulation, described in ques, which have accounted for either gate
the following. errors or idle errors.
May/June 2020
69
Top Picks
because our circuit constructions maintain

binary input and output, only occupying the
qutrit j2i states during intermediate computa-
tion. Therefore, the SPAM errors for our circuits
are identical to those for conventional qubit
circuits.
We chose noise models which represent real-
istic near-term machines. Our models encom-
passed both superconducting and trapped ion
platforms. For superconducting technology, we
simulated against four noise models: SC, SC+T1,
Figure 4. Exact circuit depths for all three benchmarked circuit SC+GATES, and SC+T1+GATES. For trapped
constructions for the N-controlled Generalized Toffoli up to ion technology, we simulated against three
N ¼ 200. QUBIT scales linearly in depth and is bested by benchmarks: TI_QUBIT, BARE_QUTRIT, and
QUTRIT’s logarithmic depth. DRESSED_QUTRIT.
Gate errors arise from the imperfect applica-

RESULTS
tion of quantum gates. Two-qudit gates are nois-
Figure 4 plots the exact circuit depths for
ier than single-qudit gates, so we apply different
the QUTRIT versus QUBIT circuit construction.
noise channels for the two. Idle errors arise from
Note that the QUBIT construction is linear in
the continuous decoherence of a quantum sys-
depth, with a high linearity constant. This is sig-
tem due to energy relaxation and interaction
nificantly improved by our QUTRIT construc-
with the environment.
tion, which scales logarithmically in N and has a
Gate errors are reduced by performing fewer
relatively small leading coefficient.
total gates, and idle errors are reduced by
Figure 5 plots the total number of two-qudit
decreasing the circuit depth. Since our circuit
gates for both circuit constructions. Our circuit
constructions asymptotically decrease the
construction is not asymptotically better in total
depth, this means our circuit constructions
gate count—both plots have linear scaling. How-
scale favorably in terms of asymptotically fewer
ever, as emphasized by the logarithmic vertical
idle errors.
axis, the linearity constant for our qutrit circuit
The ultimate metric of interest is the mean
is 70 smaller than for the equivalent ancilla-
fidelity, which captures the probability of overall
free qubit circuit.
successful execution. We do not consider state
We simulated these circuits under realistic
preparation and measurement (SPAM) errors,
noise models in parallel on over 100 n1-stan-
dard-4 Google Cloud instances. These simula-
tions represent over 20 000 CPU hours, which
was sufficient to estimate mean fidelity to
an error of 2s < 0:1% for each circuit-noise
model pair.
The full results of our circuit simulations are
shown in Figure 6. All simulations are for the 14-
input (13 controls, 1 target) Generalized Toffoli
gate. We simulated both circuit constructions
against each of our noise models (when applica-
ble), yielding the 11 bars in the figure.
Figure 5. Exact two-qudit gate counts for the two benchmarked

circuit constructions for the N-controlled Generalized Toffoli. Both DISCUSSION AND FUTURE WORK
plots scale linearly; however, the QUTRIT construction has a Figure 6 demonstrates that our QUTRIT
substantially lower linearity constant. construction (orange bars) can significantly
IEEE Micro
70
Figure 6. Circuit simulation results for all possible pairs of circuit constructions and noise models. Each bar
represents 1000+ trials, so the error bars are all 2s < 0:1%. Our QUTRIT construction significantly
outperforms the QUBIT construction.
outperform the ancilla-free QUBIT benchmark envision other advantages to higher radix quan-
(blue bars) in fidelity (success probability) by tum computing. For example, the information-
more than 10 000. compression advantage of qudits may be particu-
For the SC, SC+T1, and SC+GATES noise mod- larly well suited to the NISQ hardware, where
els, our qutrit constructions achieve between device connectivity—and therefore diameter—is
57–83% mean fidelity, whereas the ancilla-free a bottleneck. Compressing a qubit computation
qubit constructions all have almost 0% fidelity. via qudits would allow us to reduce the graph
Only the lowest error model, SC+T1+GATES diameter.
achieves modest fidelity of 26% for the QUBIT The results presented in this article are appli-
circuit, but in this regime, the qutrit circuit is cable to quantum computing in the near term on
close to 100% fidelity. machines that are expected within the next five
The trapped ion noise models achieve similar years. The net result of this article is to extend the
results—the DRESSED_QUTRIT and frontier of what is computable by
the BARE_QUTRIT achieve approxi- quantum hardware, and hence to
mately 95% fidelity via the QUTRIT cir- Clever use of qutrits accelerate the timeline for practi-
cuit, whereas the TI_QUBIT noise offers a path to more cal quantum computing. Emphat-
model has only 45% fidelity. Between sophisticated quantum ically, our results are driven by
computation today,
the dressed and bare qutrits, the the use of qutrits for asymptoti-
without needing to wait
dressed qutrit exhibits higher fidelity cally faster ancilla-free circuits.
for better hardware. We
than the bare qutrit, as expected. Moreover, we also improve lin-
are optimistic that
Moreover, the dressed qutrit is resil- continued hardware– earity constants by two orders of
ient to leakage errors, so the simula- software codesign may magnitudes. Finally, as verified
tion results should be viewed as a further extend the frontier by our circuit simulator coupled
lower bound on its advantage over the of quantum computers. with realistic noise models, our
qubit and bare qutrit. circuits are more reliable than
Our qutrit-assisted Generalized qubit-only equivalents. In sum,
Toffoli gate has already attracted interest from clever use of qutrits offers a path to more sophisti-
both device physics and algorithms communities. cated quantum computation today, without need-
To this end, major quantum software packages ing to wait for better hardware. We are optimistic
like Cirq are now compatible with qutrit (and that continued hardware–software codesign may
qudit) simulations. We have also been working further extend the frontier of quantum computers.
with hardware groups to experimentally imple-
ment the ideas presented here. One promising
direction is to use OpenPulse, an open standard ACKNOWLEDGMENTS
for pulse-level quantum control, to experimentally We would like to thank Michel Devoret and
demonstrate a generalized Toffoli gate. We also Steven Girvin for suggesting to investigate
May/June 2020
71
Top Picks
qutrits. We also acknowledge David Schuster for 10. P. Gokhale, “Implementation of square root function
helpful discussion on superconducting qutrits. using quantum circuits,” Undergraduate Awards,
This work was supported in part by EPiQC, an 2014.
NSF Expedition in Computing, under Grant CCF- 11. D. K. Park, F. Petruccione, and J.-K. K. Rhee, “Circuit-
1730449/1832377; in part by STAQ under Grant based quantum random access memory for classical
NSF Phy-1818914; and in part by DOE Grants DE- data,” Sci. Rep., vol. 9, no. 1, 2019, Art. no. 3949.
SC0020289 and DE-SC0020331. The work of Pra-
nav Gokhale was supported by the Department
Pranav Gokhale is currently working toward the
of Defense through the National Defense Science
Ph.D. degree with the University of Chicago. His
and Engineering Graduate Fellowship Program.
research focuses on breaking the abstraction
barrier between quantum hardware and software.
He is the founder of Super.tech. Contact him at
& REFERENCES pranavgokhale@uchicago.edu.
1. A. Pavlidis and E. Floratos, “Arithmetic circuits for Jonathan M. Baker is currently working toward
multilevel qudits based on quantum Fourier the Ph.D. degree with the University of Chicago. His
transform,” 2017, arXiv:1707.08834. research is primarily focused on vertical integration
2. J. Randall et al., “Efficient preparation and detection of of the quantum computing hardware–software stack.
microwave dressed-state qubits and qutrits with Contact him at jmbaker@uchicago.edu.
trapped ions,” Phys. Rev. A, vol. 91, 2015,
Casey Duckering is currently working toward the
Art. no. 012322.
Ph.D. degree with the University of Chicago, aiming
3. J. Randall, A. M. Lawrence, S. C. Webster, S. Weidt, to efficiently bring together quantum algorithms
N. V. Vitanov, and W. K. Hensinger, “Generation of with their physical implementation on quantum com-
high-fidelity quantum control methods for multilevel puters. Contact him at cduck@uchicago.edu.
systems,” Phys. Rev. A, vol. 98, 10 2018,
Art. no. 043414. Frederic T. Chong is the Seymour Goodman Pro-
4. C. Gidney, “Constructing large controlled nots,” 2015. fessor with the Department of Computer Science, Uni-
versity of Chicago. He is also Lead Principal
5. Y. He, M.-X. Luo, E. Zhang, H.-K. Wang, and
Investigator for the EPiQC Project (Enabling Practi-
X.-F. Wang, “Decompositions of n-qubit toffoli gates
cal-scale Quantum Computing), an NSF Expedition in
with linear circuit complexity,” Int. J. Theor. Phys.,
Computing. Contact him at chong@cs.uchicago.edu.
vol. 56, pp. 2350–2361, Jul. 2017.
6. A. Barenco et al., “Elementary gates for quantum Natalie C. Brown is currently working toward the
computation,” Phys. Rev. A, vol. 52, pp. 3457–3467, Ph.D. degree with Georgia Institute of Technology.
Nov. 1995. Her research focuses on leakage error correction
7. Y. Wang and M. Perkowski, “Improved complexity of and mitigation in topological surface codes. Contact
quantum oracles for ternary grover algorithm for graph her at natalie.c.brown@duke.edu.
coloring,” in Proc. 41st IEEE Int. Symp. Multiple-Valued
Kenneth R. Brown is an Associate Professor of
Logic, May 2011, pp. 294–301.
electrical and computer engineering with Duke Uni-
8. B. P. Lanyon et al., “Simplifying quantum logic using
versity and the Director of the NSF Software Enabled
higher-dimensional Hilbert spaces,” Nature Phys., Architectures for Quantum co-design (STAQ) project
vol. 5, pp. 134–140,, 2009. developing applications, software, and hardware
9. Y.-M. Di and H.-R. Wei, “Elementary gates for ternary for ion trap quantum computers. Contact him at
quantum logic circuit,” 2011, arXiv:1105.5485. kenneth.r.brown@duke.edu.
IEEE Micro
72
Architecting Noisy
Intermediate-Scale
Quantum Computers:
A Real-System Study
Prakash Murali Ali Javadi Abhari
Princeton University IBM T. J. Watson Research Center
Norbert M. Linke Nhung Hong Nguyen
Joint Quantum Institute, University of Maryland University of Maryland
Margaret Martonosi Cinthia Huerta Alderete
Princeton University University of Maryland and Instituto Nacional de

Astrofısica, Optica nica
y Electro
Abstract—Current quantum computers have very different qubit implementations,

instruction sets, qubit connectivity, and noise characteristics. Using real-system
evaluations on seven quantum systems from three leading vendors, our work explores
fundamental design questions concerning hardware choices, architecture, and compilation.
& QUANTUM COMPUTING (QC) is a fundamentally information. While the basic principles of QC have
new model of computation, which exploits been known since the 1980s, recent hardware
quantum mechanical phenomena to perform progress has ushered in the era of noisy intermedi-
computation. QC systems use qubits (quantum ate-scale quantum (NISQ) devices. These systems
bits) to represent information and gates represent an important milestone toward large
(quantum instructions) to manipulate quantum scale QC, and are expected to scale to 500–1000
qubits in coming years. In spite of being too error-
prone and resource-constrained for well-known
Digital Object Identifier 10.1109/MM.2020.2985683 applications like Shor’s factoring, NISQ systems
Date of publication 6 April 2020; date of current version 22 are capable of very powerful computations. Nota-
May 2020. bly, Google recently demonstrated a classically
73
Top Picks
intractable computation on an NISQ system with applications on seven systems from three lead-
54 qubits.1 ing vendors—IBM, Rigetti, and University of
Being early-stage, NISQ devices are highly Maryland. The systems studied represent differ-
diverse in terms of hardware and ent points in the design space,
architecture. Leading QC vendors with two leading qubit tech-
While the basic
including IBM, Rigetti, Google, principles of QC have nologies (superconducting and
IonQ, and others have adopted been known since the trapped ion qubits), different
very different approaches for build- 1980s, recent hardware connectivity topologies, pro-
ing hardware qubits. To support progress has ushered gramming interfaces, and noise
their qubit choices, vendors have in the era of noisy behavior. The diversity of sys-
also chosen different instruction intermediate-scale tems studied is important for
sets and hardware communication quantum (NISQ) understanding which aspects of
topologies. Further, QC systems devices. These systems QC design hold across different
also have variance in hardware represent an important design choices and which are
milestone toward large
noise, owing to fundamental more implementation specific.
scale QC, and are
challenges in qubit control and Our work represents the most
expected to scale to
manufacturing. While this diversity comprehensive cross-platform,
500–1000 qubits in
itself poses a challenge for efficient coming years. real-system measurements of QC
and portable application execu- prototypes ever performed.
tion, there is also a huge gap On the other hand, this design
between the QC hardware that is buildable now, space diversity also poses serious challenges for
and the resource requirements of compelling accurate comparative studies. In particular, our
real-world applications. Many interesting comparisons hinge on developing a toolflow and
applications demand large systems with several evaluation approach common to all platforms,
thousand quantum bits and high-precision oper- and yet not penalizing any particular platform
ations, but current hardware has less than while pursuing toolflow generality. Our toolflow,
100 qubits and error-prone operations. To fully TriQ, is the first top-to-bottom multivendor QC
attain practical and powerful QC, computer compiler toolflow. TriQ optimizes high-level lan-
architecture techniques and software toolchains guage programs for QC hardware by leveraging
must be employed to narrow the algorithm-to- deep but parameterized knowledge of the target
devices resource gap across a wide range of device characteristics, including the gate set,
algorithms and devices. connectivity, and noise profile. Importantly, TriQ
To this end, our article2 offers one of the avoids inefficiencies in vendor toolflows, offering
deepest explorations of cross-platform charac- up to two orders of magnitude higher reliability
teristics in QC systems, presenting a full-stack, compared to IBM’s Qiskit3 and Rigetti’s Quil4
benchmark-driven, hardware–software analysis. compiler which are the default toolchains for the
Viewing QC through the lens of computer archi- respective hardware. TriQ, therefore, allows us to
tecture, we evaluate important hardware design perform architectural analysis across diverse QC
decisions (qubit types, system size, connectivity, systems using high-level application performance
noise), the hardware–software interface (gate set measurements and is also a common compiler
choices), and software optimizations to tackle toolflow.
fundamental design questions: What instructions Our experiments with TriQ reveal several
should QC systems expose to software? Should architectural insights for QC systems. We quan-
instructions be unified in a device-independent tify the importance of gate set, ISA and connectiv-
ISA across different qubit types? How do hard- ity choices and offers design recommendations.
ware connectivity and noise characteristics We also evaluate the effects of hardware noise
impact benchmark performance? Can hardware on applications and the importance of software
limitations be overcome with a compiler? optimizations to mitigate such noise. Our results
To answer these questions, we use real- have also attracted significant academic and
system measurements to evaluate a suite of QC industry attention with vendors including IBM
IEEE Micro
74
and Rigetti incorporat-
ing our optimizations in
their compiler toolflows.
In coming years, hard-
ware and architectural
insights from our study
are likely to influence
QC.
Figure 1. Hardware qubit technology, native gate set, and software-visible gate set in
the systems used in our study. Each qubit technology lends itself to a set of native gates.
BACKGROUND
ON QC For programming, vendors expose these gates in a software-visible interface or construct
A qubit is the funda- composite gates with multiple native gates.
mental building block
a QC system. Unlike a
classical bit which is restricted to be either in here and refer the reader to our original paper
the state 0 or 1 at any instant, a qubit can exist for more details.2
in a superposition state where it is a probabilistic Figure 1 shows the different hardware qubit
combination of the two basis states. This prop- technologies used in IBM, Rigetti, and UMD sys-
erty allows an n-bit QC system to represent 2n tems. IBM and Rigetti use superconducting
basis states simultaneously, unlike classical qubits, while UMD uses trapped ion qubits. On
registers which can be in exactly one of the 2n one hand, these choices are similar to how clas-
values at any given time. To manipulate informa- sical computers can be realized using vacuum
tion, QC gates are implemented to operate on tubes, relay circuits or CMOS transistors. On the
one or more qubits, using some physical interac- other hand, qubit technologies are very different
tion such as a microwave or laser pulse. Similar and do not lend themselves to abstraction simi-
to universal gates in classical systems, QC com- lar to the ON–OFF switch abstraction in classical
putations can be expressed using a small univer- technologies. For example, on IBM’s supercon-
sal set of single (1Q) and two-qubit (2Q) gates. ducting qubits, the two-qubit interactions are
In particular, 2Q gates create entanglement achieved using the cross-resonance effect,
which is a key property exploited by algorithms. where one qubit is driven at the resonant fre-
To obtain classical output from the system, quency of another qubit using a coupled hard-
qubits are measured or readout, collapsing the ware resonator. In contrast, in UMD’s trapped
superposition state to either 0 or 1. ion qubits, two-qubit interactions are achieved
using collective motional modes of an ion chain,
QC ARCHITECTURE CHOICES mediated through laser pulses.
AND TRADEOFFS Owing to these fundamental differences, ven-
NISQ systems have very diverse hardware dors implement different native gates or microop-
and architecture. While classical metrics such as erations that are feasible on their platform.
performance (time) and area are important to Figure 1 shows these native 1Q and 2Q gates.
evaluate these options, a key figure of merit in Even among superconducting qubits, the native
the current NISQ regime is the likelihood of cor- interactions may be different. For example,
rect execution of applications. Owing to the Rigetti uses the controlled Z operation as the fun-
noise, a single execution of an application may damental 2Q operation instead of the cross-reso-
be corrupted by noise. Hence, programs are nance gate in IBM. Using these native gates,
typically run multiple times and the success rate vendors choose a software-visible programming
is measured as the fraction of trials which yields interface which includes either native gates
the correct answer. Toward understanding how themselves or composite gates which use multi-
system design affects success rate and perfor- ple native gates. These choices for software-
mance, we briefly discuss the key design choices visible gates also differ widely across vendors.
May/June 2020
75
Top Picks
Finally, qubit states are

extremely fragile and difficult
to control. On current sys-
tems, typical 2Q error rates
are 1% –10%. Gate error rates
also have large spatial and
temporal variations depend-
ing on the qubit technology.
Figure 3 shows these varia-
tions for an IBM system. If a
program is executed on a sub-
set of unreliable qubits, the
success rate is greatly dimin-
Figure 2. Characteristics of the devices used in our study. Each device has different ished. In addition, quantum
qubit and gate count (higher is better), coherence time (higher is better), error rates state “decoheres” or loses
(lower is better), and topology (dense connectivity is better). reliability exponentially with
time. That is, if a qubit is ini-
tialized, there is a short win-
Furthermore, QC devices have a qubit connec- dow of coherence time within which all gates in
tion topology which determines the amount of the program must be completed. On current
communication required to perform 2Q gates. As superconducting systems, coherence time is typ-
shown in Figure 2, device topology varies across ically less than 100 ms and varies across qubits.
systems, with sparse nearest-neighbor connec- However, 2Q gates are relatively fast, requiring
tivity in IBM and Rigetti, to full all-pairs connec- hundreds of nanoseconds. On trapped ion sys-
tivity in UMD. When full connectivity is not tems, coherence time is significantly longer
available, SWAP operations are used to enable 2Q (several seconds), but 2Q gates are slower,
gates between arbitrary pairs of qubits. These requiring several hundred microseconds.
SWAPs increase program duration and more Our work focuses on how these design trade-
importantly, worsen the success rate. The choice offs influence QC computer architecture and
of connectivity is not independent of the qubit software design. Toward this, we first develop
type. Trapped ion qubits naturally support a common compiler toolflow, TriQ, that maps
full connectivity, at least at small scales, while programs onto diverse QC systems. Enabled by
superconducting qubits typically use sparse con- TriQ, we perform real-system experiments to
nectivity because of difficulties in implementing explore the architectural design space.
dense physical interconnections.
TRIQ: FULL-STACK MULTIVENDOR

QC TOOLFLOW
Figure 4 illustrates the overall structure of
TriQ. TriQ accepts Scaffold5 programs as input.
Scaffold is a C-like quantum language which has
been used to develop large QC applications.
Using ScaffCC, Scaffold’s front-end compiler,
TriQ generates an intermediate representation
(IR) of the program and uses it as the input
for the subsequent optimization passes. TriQ
Figure 3. Daily variation of error rates of four also takes hardware and system-specific features
hardware supported two-qubit controlled NOT gates in such as gate sets, connectivity, and noise informa-
IBMQ14. The average error rate is approximately 8%, tion (from daily calibration logs for the systems)
but there is up to 9x variation across qubits and days. as configurable inputs. Hardware-dependent
IEEE Micro
76
optimization using these inputs
is a distinguishing feature of TriQ
that allows it to obtain high
success rates across platforms.
As output, TriQ generates opti-
mized code in the vendor-
specified assembly code.
To compile the IR, the first
step is to map program qubits
onto distinct hardware qubits.
For example, program qubits can
be assigned to hardware qubits Figure 4. Overview of the TriQ toolflow. Inputs are high-level Scaffold programs
according to the order they are and their inputs, as well as device-specific QC system properties. Output is
used in the program. This policy optimized code in one of three vendor-specific executable formats.
can result in high communication
overhead and poor success rate
when qubits participating in 2Q gates are not Second, TriQ schedules gates in the program
mapped close together. If program qubits are in a topologically-sorted order using the IR. This
mapped onto unreliable hardware qubits, it allows maximum operations to be executed in
can further worsen the success rate. Therefore, parallel, reducing the errors due to qubit deco-
TriQ uses a noise-adaptive mapping strategy herence. For devices which do not support full
which optimizes both communication and reli- connectivity, TriQ automatically inserts the nec-
ability. TriQ chooses a set of qubits that match essary communication operations to bring
well with the communication requirements of qubits into adjacent positions before executing
the application and simultaneously, it ensures 2Q gate. To improve success rates, TriQ incorpo-
that this set of qubits has low error rates for rates noise-awareness in this step by selecting
the instruction mix of the application. TriQ the lowest error rate paths for moving qubits,
implements this policy using a satisfiability rather than any shortest distance path.
modulo theory (SMT) optimization, solved Third, TriQ translates high-level IR gates into
using Microsoft’s Z3 SMT solver. device-specific IR. Using a set of legal code trans-
To flexibly target different devices, we formations that are provided as input, TriQ
designed the SMT optimization to work with an replaces IR gates with equivalent device-specific
abstract representation of the hardware. TriQ gates, e.g., OpenQASM code for IBM systems.
preprocesses the target device’s connectivity During this pass, TriQ also applies a 1Q gate
graph and gate error rate data and converts optimization where continuous sequences of 1Q
them to a reliability matrix representation. gates are compressed into shorter sequences.
For each pair of qubits, the matrix specifies the TriQ exploits knowledge of hardware error rates
reliability of the lowest error rate path for a 2Q in this step as well. On all three vendors, single
gate between the qubits. When two hardware qubit rotations gates along the Z-axis of the
qubits are far away in the communication topol- qubit have no error.6 While compressing gate
ogy, the reliability of the best path will be low. sequences, TriQ maximizes the use of these Z
It will also be low if all paths between the two rotations, further increasing success rates.
qubits have high error rate edges. Therefore,
using the matrix, TriQ can pick communication-
and reliability-optimized mappings. Since the REAL-SYSTEM ARCHITECTURAL
core functionality of the pass operates using STUDIES USING TRIQ
this matrix abstraction, we can flexibly compute We performed real-system measurements for
good mappings for any device topology and a set of 12 benchmarks on 7 QC systems. These
noise profile simply by changing compile-time benchmarks include important QC kernels such
inputs. as the Toffoli gate and quantum Fourier transform
May/June 2020
77
Top Picks
Figure 5. Success rate for 12 benchmarks on 7 systems. Success rates varies drastically across systems and is
influenced by error rates, qubit connectivity, and application-machine topology match. Benchmarks that are too large to be
mapped onto a machine are marked “X.” This comparison is intended to understand the impact of architectural design
choices such as gate set and connectivity on benchmark performance and is not intended to pick a winning technology,
vendor or implementation. Individual benchmark performance numbers may change over time. These measurements
represent a snapshot of the performance of these systems when we performed the experiments.
operation. To understand architectural choices, shows that machines with dense qubit connec-
we performed multiple experiments with each tivity are less sensitive to application character-
benchmark and system, varying the level of opti- istics and allow a wider variety of programs to
mization and the inputs used for compilation. We execute successfully. Compared to a baseline,
used three main variants of the compiler with TriQ’s communication optimizations offer up
increasing levels of optimization for gate sequen- to 22X reduction in 2Q gate counts. For certain
ces, communication and for noise-adaptivity and programs, this means the difference between a
a fourth baseline version with no optimization. failed execution where noise corrupts the output
We compared different executables in terms and a successful execution where the correct
of instruction count and success rate. Figure 5 answers dominate. When the architecture does
shows the measured success rates using TriQ’s not have full connectivity, compilers like ours
full optimizations. The key insights from our can allow applications to take maximum advan-
study are summarized next. tage of the available hardware resources.
Importance of Gate Set Specificity: We studied Importance of Noise Adaptivity: Our work
whether it is beneficial to expose native gates to shows that the noise variability in QC hardware
software, instead of abstracting them in a device- can be effectively mitigated by software techni-
independent gate set. When TriQ has information ques. By mapping programs onto reliable regions
about the native gate set, the gate optimization of the hardware and orchestrating communica-
passes offers significant benefits. TriQ expresses tion along reliable hardware paths, TriQ effec-
several program instructions using a small num- tively shields applications from spatiotemporal
ber of native gates, leading to an average 50% noise variations. These optimizations provide
reduction in the instruction count and up to 26% further average success rate gains of 2.8X over
increase in success rate. Therefore, unlike prior gate and communication optimizations, and
proposals for device-independent ISAs for QC sys- allows more applications to execute successfully.
tems,7; 8 our results show that such abstractions Put together, TriQ’s optimizations offer up to 1.5-
are detrimental to high success rates. We recom- 28X higher success rates than IBM’s Qiskit,
mend that vendors make the most low-level native Rigetti’s Quil compiler, and hand optimized code
gates in their devices software visible. As an anal- from UMD. Our work is the first to show that such
ogy to classical microprocessors, this is similar to optimizations are important even on trapped ion
making microoperations software visible.9 systems which have less variability. Noise varia-
Importance of Qubit Connectivity: Our work tions are likely in all near-term QC systems in the
demonstrates that the match between applica- next 5 to 10 years. Therefore, compilers like TriQ
tion communication requirements and device will be crucial for reliable program executions.
topology significantly crucially impacts success TriQ’s functionality is portable across
rates. Comparing near-neighbor versus fully- diverse platforms while still performing full
connected systems (like IBM and UMD systems) top-to-bottom optimizations for device and
IEEE Micro
78
application characteristics provided as compile- When they are not well-matched, successful
time inputs. Leveraging microarchitecture executions are unlikely.
details such as native gate sets and noise rates Our work also breaks new ground in QC
was the key to our improvements. Therefore, benchmarking by being distinct from the exist-
QC systems are not yet ready for device-indepen- ing practices of measuring isolated hardware
dent abstraction layers that hide and obstruct characteristics or benchmarking custom-
information flow between hardware and designed applications. On one hand, vendors
software. characterize systems in terms of metrics such as
gate error rates and qubit coherence times.
These metrics are isolated measurements for
IMPACT OF OUR WORK each hardware component, and not direct meas-
Recently, tech news was dominated by dis- urements of program behavior. TriQ enables
cussions of Google’s so-called “quantum suprem- direct and accurate measurements of program
acy” announcement and reactions from other behavior across widely divergent QC platforms.
scientists and QC vendors.1 While QC systems In classical computing, this is akin to the differ-
offering high revenue streams (e.g., as cloud ence between knowing characteristics like core
accelerators) are still in the future, clearly QC is counts and clock rate, versus knowing actual
increasing in importance and has reached an benchmark performance. On the other hand,
inflection point in terms of engineering achieve- vendors have developed benchmarking applica-
ments in real implementations. This makes our tions such as quantum volume. These methods
work extremely timely, with high potential for use a family of custom generated
impact. Just this year, several aca- circuits to measure hardware qual-
demic and industry vendors have Our study features ity. Our work does not field a pre-
already adjusted their compiler systems with different ferred benchmark, but instead
toolflows and aspects of their qubit, noise, and relies on a suite of diverse applica-
exposed gate sets in response to architectural attributes tions to understand the impact of
and provides important
our work. Our optimizations are hardware on applications. This is
insights for designing
already part of IBM’s Qiskit Terra similar to the difference between
better architecture and
compiler as of version 0.8 and benchmarking supercomputers
hardware. These
Rigetti’s Quil compiler version insights will likely with LINPACK or other dedicated
1.16. TriQ, open sourced at influence future QC algorithms, and measuring the per-
https://github.com/prakashmur- ISA design. formance of real applications. We
ali/TriQ is also the first compiler believe that this approach of appli-
for trapped ion systems. cation-based benchmarking will
Our study features systems become common practice in QC, much like how
with different qubit, noise, and architectural benchmark suites such as SPEC are used for clas-
attributes and provides important insights for sical benchmarking.
designing better architecture and hardware. Most importantly, our work represents a sig-
These insights will likely influence future QC nificant advance on the way to practically viable
ISA design. Although QC applications work QC, which requires us to close a five to six order
with any universal gate set, we demonstrate of magnitude gap between algorithm needs
that shielding the natural gates for a qubit and device capabilities. Our work demonstrates
technology by abstracting them into more methods for achieving up to two orders of mag-
common gates imposes severe reliability and nitude improvements in program success rates
performance overheads on NISQ systems. and our approaches work well across vendor
Future QC ISAs need to work in tandem with implementations. In a world where increasing
the underlying qubit technology. Our work qubit count comes only with great engineering
also underscores the importance of matching effort, our work offers substantial and orthogo-
the application’s communication requirements nal advances over underlying hardware progress
and hardware topology by codesigning them. alone.
May/June 2020
79
Top Picks
ACKNOWLEDGMENTS Norbert M. Linke is currently an Assistant Professor

with the University of Maryland, College Park, and a
This work was supported in part by EPiQC,
Fellow of the Joint Quantum Institute (JQI) working on
an NSF Expedition in Computing, under Grant
quantum algorithm implementations, quantum simula-
CCF-1730082. The work of Cinthia Huerta Alder-
tions, and quantum networking with trapped ions. He
ete was supported by CONACYT under Doctoral was a Postdoctoral Researcher and Research Scientist
Grant 455378. with the Group of Chris Monroe, University of Maryland.
Linke received the graduate degree from the University
of Ulm, and the doctorate degree from the University of
& REFERENCES Oxford. Contact him at linke@umd.edu.
1. Frank Arute et al. “Quantum supremacy using a Margaret Martonosi is currently the Hugh Trum-
programmable superconducting processor,” Nature, bull Adams ’35 Professor of Computer Science with
vol. 574, no. 7779, pp. 505–510, 2019. Princeton University. Her research focuses on com-
2. P. Murali, N. M. Linke, M. Martonosi, A. J. Abhari, puter architecture and hardware–software interface
issues in both classical and quantum systems. Mar-
N. H. Nguyen, and C. H. Alderete, “Full-stack, real-
tonosi received the Ph.D. degree in electrical engi-
system quantum computer studies: Architectural
neering from Stanford University. She is a Fellow of
comparisons and design insights,” in Proc. 46th Int.
IEEE and the Association for Computing Machinery
Symp. Comput. Archit., 2019, pp. 527–540.
(ACM). Contact her at mrm@princeton.edu.
3. IBM, “IBM Qiskit,” 2018, Accessed on: Jan. 1, 2020.
[Online]. Available: https://qiskit.org/ Ali Javadi Abhari is currently a Research Staff
4. Rigetti, “QuilC compiler,” 2020, Accessed on: Jan. 1, Member with IBM, Armonk, NY, USA, and a Manager
of the Quantum Compiler Group. His research interests
2020. [Online]. Available: https://github.com/rigetti/quilc
include quantum computing software, compilation,
5. A. J. Abhari et al., “Scaffold: Quantum programming
and architecture. Javadi Abhari received the
language,” Princeton University, Princeton, NJ, USA,
Ph.D. degree in electrical engineering from Princeton
Tech. Rep. TR-934-12, 2012.
University. Contact him at ali.javadi@ibm.com.
6. D. C. McKay, C. J. Wood, S. Sheldon, J. M. Chow, and
J. M. Gambetta, “Efficient z gates for quantum Nhung Hong Nguyen is currently a Ph.D. student
computing,” Phys. Rev. A, vol. 96, Aug. 2017, with Linke Lab, University of Maryland. She was a
Research Assistant with the Center for Quantum
Art. no. 022330.
Technology, Singapore, working in satellite quantum
7. X. Fu et al., “A microarchitecture for a
key distribution. Her research focuses on digital
superconducting quantum processor,” IEEE Micro,
quantum simulation, algorithms implementation, and
vol. 38, no. 3, pp. 40–47, May 2018.
error encoding on trapped ions. Nguyen received
8. A. W. Cross, L. S. Bishop, J. A. Smolin, and the B.S. degree in physics from Nanyang Technolog-
J. M. Gambetta, “Open quantum assembly language,” ical University, Singapore, working on surface spec-
2017, arXiv:1707.03429. [Online]. Available: https:// troscopy with neutral atoms. Contact her at
arxiv.org/abs/1707.03429 nhunghng@umd.edu.
9. M. V. Wilkes, “The best way to design an automatic
Cinthia Huerta Alderete is currently a Ph.D. stu-
calculating machine,” in The Early British Computer
dent with the National Institute of Astrophysics,
Conferences. Cambridge, MA, USA: MIT Press, 1989,
Optics and Electronics (INAOE), San Andrés Chol-
pp. 182–184.
ula, Mexico, currently on a research stay at the Joint
Quantum Institute, University of Maryland. Her
research is focused on, but not limited to, the simula-
Prakash Murali is currently a Ph.D. student in the tion of paraparticle oscillators in a trapped-ion sys-
Computer Science Department, Princeton University. tem. Aside from this topic, she had collaborated on a
His research focuses on accelerating the progress few projects based on the circuit implementation of
toward practical quantum computation using com- different phenomena in quantum physics. Contact
puter architecture and compilation techniques. her at aldehuer@umd.edu.
Contact him at pmurali@cs.princeton.edu.
IEEE Micro
80
Speculative Taint Tracking

(STT): A Comprehensive
Protection for Speculatively
Accessed Data
Jiyong Yu Adam Morrison
University of Illinois at Urbana–Champaign Tel Aviv University
Mengjia Yan Josep Torrellas and Christopher W. Fletcher
Massachusetts Institute of Technology University of Illinois at Urbana–Champaign
Artem Khyzha
Tel Aviv University
Abstract—Speculative execution attacks present an enormous security threat, capable of

reading arbitrary program data under malicious speculation, and later exfiltrating that
data over microarchitectural covert channels. This article proposes speculative taint
tracking (STT), a high-security and high-performance hardware mechanism to block these
attacks. The main idea is that it is safe to execute and selectively forward the results of
speculative instructions that read secrets, as long as we can prove that the forwarded
results do not reach potential covert channels. The technical core of the article is a new
abstraction to help identify all covert channels, and an architecture to quickly identify
when a covert channel is no longer a threat. We further conduct a detailed formal analysis
on the scheme and prove security in a companion document. When evaluated on
SPEC06 workloads, STT incurs 8.5% or 14.5% performance overhead relative to an
insecure machine.
& SPECULATIVE EXECUTION ATTACKS such as

Spectre5 have opened a new chapter in hardware
security. In these attacks, malicious speculative
Digital Object Identifier 10.1109/MM.2020.2985359 execution causes doomed-to-squash instruc-
Date of publication 6 April 2020; date of current version 22 tions to access and later transmit secrets over
May 2020. covert channels such as the cache.9 For
81
Top Picks
example, in Spectre V1 (see Figure 1) a branch To be secure and efficient, we address two key
misprediction enables the attacker to access challenges.
and leak/transmit arbitrary program data by
First, we develop an abstraction that indi-
controlling the out-of-bounds address &array1
[off]. We refer to such data, which is brought cates how and when instructions can form
into the pipeline by a speculative instruction, as covert channels, so as to delay data forward-
secret. ing to the latest safe time.
Second, we identify and develop a microarch-
A secure, but conservative, way to block all
speculative execution attacks— itecture to indicate exactly when data should
regardless of covert channel—is to be considered secret, so as to
delay executing all instructions that This article proposes a re-enable data forwarding at
can access a secret until such new abstraction the earliest safe time.
instructions become nonspecula- through which to view
tive. In nearly all attacks today, this covert channels on Challenge #1: New Abstractions
would imply blocking all loads until speculative microarchi- for Describing All
they are nonspeculative, which tectures, discovers Microarchitectural Covert
new points where
would be tantamount to disabling Channels
instructions can create
speculative execution. Covert channels come in dif-
covert channels, and
This article proposes a princi- ferent shapes and sizes. For
discovers a new class
pled, high-performance mechanism of covert channels example, attackers can monitor
that achieves the same security how loads interact with the
guarantee as the above conserva- cache,5 the timing of SIMD units,6
tive scheme. The key idea is that speculative execution pipeline port contention,2 branch pre-
execution is safe unless speculatively accessed dictor state,1 and more. To comprehensively
data (secrets) reaches a covert channel. In many block leakage through these different channels,
cases, speculative instructions either do not leak it is necessary to understand their common
secrets or do not form covert channels, and so characteristics.
can execute freely under speculation. For exam- To address this challenge, this article pro-
ple, the first load in Spectre V1 forms a covert poses a new abstraction through which to view
channel, but it only leaks the attacker-selected covert channels on speculative microarchitec-
address &array1[off]—not the secret data in tures, discovers new points where instructions
that address. Likewise, many instructions (e.g., can create covert channels, and discovers a new
simple arithmetic) do not form covert channels class of covert channels. We find that all covert
even if their operands are secret values. channels are one of two flavors, which we call
This article presents speculative taint track- explicit and implicit channels (which are related
ing (STT), a framework that tracks the flow of to explicit and implicit information flow8). In an
speculatively accessed data through in-flight explicit channel, data is directly passed to an
instructions (similar to dynamic information instruction whose execution creates operand-
flow tracking/DIFT7) until it is about to reach an dependent hardware resource usage, and that
instruction that may form a covert channel. STT resource usage reveals the data. For example,
then delays the forwarding of the data until how a load impacts the cache depends on the
the instruction becomes nonspeculative or load address.5 In an implicit channel, data indi-
the execution squashes due to mis-speculation. rectly influences how (or that) an instruction(s)
execute, and these changes in resource usage
reveal the data. For example, the instructions
executed after a branch reveal the branch predi-
cate.2,6 This article further defines subclasses of
the implicit channel, based on when the leakage
occurs and based on the nature of the secret-
Figure 1. Spectre variant 1. dependent condition that forms the channel.
IEEE Micro
82
Key Advance: Safe Prediction. Through its the data, has become nonspeculative. Checking
investigation of implicit channels, this article this condition is akin to tracking a single extra
makes a key advance by showing how to use dependence for each instruction, as opposed to
hardware predictors safely. Spectre attacks performing complex backwards slice tracking.
were born from attackers mistraining predic-
tors to leak secrets. Through its abstraction Security Guarantees and Formal Analysis
for implicit channels, STT enforces a policy Alongside the main paper, we formally prove
that prevents arbitrary predictor mistraining that STT enforces a novel form of noninterfer-
from leaking any secret data over any covert ence3 with respect to speculatively accessed
channel. The article shows how this enables data. In a nutshell, we show that hardware
existing predictors to stay enabled without resource usage patterns over time are indepen-
leaking privacy, dramatically improving perfor- dent of data that eventually squashes (covering
mance. In the future, we expect the idea of microarchitectural interference- and timing-
safe prediction to enable further innovation, based attacks). We released a companion techni-
i.e., by enabling the design of new predictors cal report12 with detailed formal analysis and a
without fear of opening new security holes. security proof for this property.
Indeed, our follow-on work uses this idea to
safely improve the performance of instruc-
Putting It All Together
tions that create explicit channels.11
Putting everything together, STT provides
both high security and high performance. It
Challenge #2: Mechanisms to Quickly
does not require partitioning or flushing micro-
and Safely Disable Protection
architectural resources, and does not require
Once we have mechanisms to block secret
changes to the cache/memory subsystem or the
data from reaching covert channels, the next
software stack. When evaluated on SPEC06 work-
question is when and how to disable that protec-
loads, STT incurs 8.5% or 14.5% performance
tion, if speculation turns out to be correct. This is
overhead (depending on the threat model) rela-
crucial for performance, as delaying data forward-
tive to an insecure machine.
ing longer than necessary increases the chance
that delayed instructions reach the head of the
reorder buffer (ROB) and block retirement. ATTACKER MODEL AND
STT tackles this problem with a safe but aggres- PROTECTION SCOPE
sive approach, by re-enabling data forwarding as Attacker Model. STT assumes a powerful adver-
soon as data becomes a function of retired register sary that can monitor any microarchitectural
file state. This represents the earliest safe point, covert channel from anywhere in the system, and
but is nontrivial to implement in hardware. For induce arbitrarily speculative execution to access
example, a delayed instruction’s operand(s) may secrets and create covert channels. For example,
be the result of a complex dependence chain the attacker can monitor covert channels through
across many control flow and speculative opera- the cache/memory system,5 data-dependent arith-
tions. Intuitively, determining that data is a func- metic,4 port contention,2 branch predictors,1 etc.
tion of nonspeculative information would require Scope: Protecting Speculatively Accessed Data.
retracing a backwards slice of the program’s exe- We distinguish attacks based on whether the
cution, which is costly to do quickly. access instruction is doomed-to-squash (tran-
Despite the above challenges, STT proposes a sient) or bound to retire (nontransient). STT’s
simple hardware mechanism that can disable pro- goal is to block attacks involving doomed-to-
tection/re-enable forwarding for an arbitrary squash access instructions, shown in Figure 2.
instruction in a single cycle, using hardware similar These attacks can access data that a correct
to traditional instruction wake-up logic. The key (not miss-speculated) execution would never
idea is that to determine whether data is a function access, which often results in being able to read
of retired state, it is sufficient to determine whether from any location in memory. Attacks involving
the youngest load, whose return value influences bound-to-retire access instructions are out of
May/June 2020
83
Top Picks
Figure 3. STT’s new classification schema for

covert channels.
can feature either an explicit or an implicit branch.

Figure 2. STT’s scope is to protect speculatively accessed data An explicit branch is a control-flow instruction,
from leaking over any microarchitectural covert channel. while an implicit branch is a conceptual branch
Protecting values which have retired is outside of scope. formed in the hardware due to an optimization.
For example, store-to-load forwarding between a
scope. They can only leak retired (or bound to store and a younger load can be viewed as an
retire) register file state, not arbitrary memory, implicit branch that checks for an address alias as
and their leakage can be reasoned about by pro- shown in Figure 4. Written this way, it is clear that
grammers or compilers and blocked using com- store-to-load forwarding can create a covert chan-
plementary techniques (e.g., Yu et al.10). nel: depending on whether there is an alias, the
processor either looks up the cache or forwards
from the local store queue.
ABSTRACTION FOR COVERT
CHANNELS Insights From Analysis of Implicit Channels
STT proposes a novel abstraction for covert Since it was proposed in the STT paper, the
channels (see Figure 3). In our abstraction, classification for implicit channels has proven to
covert channels are broken into two classes: be a robust and useful way to represent and pin-
explicit and implicit channels. An explicit channel, point the root cause of hardware security vulner-
related to explicit flow in information flow,8 is abilities. For example, in the NetSpectre attack,6
one where data (e.g., a secret) is directly passed it might be said that the attack root cause is
to an instruction whose execution creates oper- SIMD unit power-on time. STT’s abstraction
and-dependent hardware resource usage, and shows, however, that the root cause is an
that resource usage reveals the data. An example explicit branch, and that “fixing” the SIMD unit
is a load instruction’s changes to the cache does not prevent the attack.
state. An implicit channel, related to implicit Even more subtly, the abstraction demon-
flow,8 is one where data indirectly influences how strates and provides cases where implicit flow
(or whether) an instruction or several instruc- and privacy leakage do occur, despite not occur-
tions execute, and these changes in resource ring according to program semantics. Consider a
usage reveal the data. An example is a branch simple example where a load is control- and
instruction, whose outcome determines subse- data-independent of a sensitive branch, e.g., “if
quent instructions and thus whether some func- (secret == rV) { rX <- rW; } load rZ <- (rY);”.
tional unit is used. How this load executes is important, to under-
We further distinguish between implicit chan- stand whether a potential cache-based covert
nels, depending on when they leak secrets and channel exists. Traditional software-level anal-
what type of branch they feature. First, we find ysis would indicate that the execution of the
that implicit channels can leak at two points: when load is independent of the secret (the branch
a prediction is made (e.g., a branch prediction) outcome). Yet, on a speculative microarchitec-
and when a resolution occurs (e.g., a branch ture, STT’s abstraction would classify this
resolves). Second, we find that implicit channels code as a resolution-based implicit channel.
IEEE Micro
84
Figure 4. Rewriting a store-load pair as an implicit
branch. implIf reveals a potential covert channel as a
function of memory aliasing to the older store. This
occurs if the microarchitecture supports store-to- Figure 5. Resolution-based implicit channel due to
load forwarding or memory-dependence secret-dependent pipeline squashes. When the
speculation. branch (B) resolves, it leaks the secret based on
whether a squash occurs, as this causes the
younger load to execute once or twice. There is an
As shown in Figure 5, if the branch mis-specu- analogous case when the (public) predictor state
lates and subsequently squashes, the load takes the branch.
may execute either once or twice depending
on the value of secret.
Finally, the abstraction applies to a large set of
a handful of such instructions, which can be
microarchitectural optimizations. For example,
identified manually.
the representation of store-to-load forwarding (see
An instruction should be classified as a trans-
Figure 4) also captures the behavior of memory-
mit instruction if its execution creates operand-
dependence speculation with a store set predictor.
dependent resource usage that can reveal the
Here, the store set predictor is modeled as a predic-
operand (partially or fully). Identifying implicit
tion on the implicit branch (implIf in the figure). As
branches is similar: the architect must analyze
we will see, being able to represent different opti-
whether the resource usage of some in-flight
mizations as predictions on implicit branches will
instruction changes as a function of some other
enable STT to apply a uniform mechanism to block
instruction’s operand. This definition can be for-
leakage through a variety of structures (e.g.,
malized by analyzing (offline) how information
branch, store set, etc., predictors).
flows in each functional unit at the SRAM-bit and
flip-flop levels to determine whether resource
STT: DESIGN usage depends on the input value, in the style of
Framework and Concepts the OISA10 or GLIFT8 formal frameworks. Auto-
STT requires that the microarchitect define matically performing such analysis is important
what instructions write secrets into registers future work.
(access instructions, mainly loads), what instruc-
tions can form explicit channels (transmitters), Taint and Untaint Propagation
and what instructions form implicit channel Conceptually, in each clock cycle, STT
branch predicates (for both explicit and implicit applies the following taint rules to instructions
branches). Finally, the architect must define the in the ROB:
Visibility Point, after which speculation is consid-
ered safe (e.g., at the point of the oldest unre- The Output Register of an Access Instruction
solved branch, or at the head of the ROB). If is tainted if and only if the access instruction
the visibility point refers to an instruction is unsafe.
older than an access instruction, we call the The output register of a non-access instruc-
access instruction unsafe; otherwise it is consid- tion is tainted if and only if at least one of its
ered safe. input operands is tainted.
We provide guidelines for microarchitects
on identifying access and transmit instruc- In the implementation, taint propagation is pig-
tions. An instruction should be classified as gybacked on the existing register renaming logic
an access instruction if it has the potential to in an out-of-order core. Tainting is therefore fast.
read a secret. Except for loads, there are only In contrast, it is difficult to propagate “untaint” to
May/June 2020
85
Top Picks
all dependencies of an access instruction that STT’s principles can be applied to efficiently
becomes safe in a single cycle. We address this make any hardware predictor impossible to
with a single-cycle implementation for untaint in exploit as a covert channel for leaking specula-
the “STT: Implementation” section. tively accessed data.
Unlike prior DIFT schemes,7 STT does not Conceptually, the protection mechanism
require tracking taint in any part of the memory does not need to reason about whether
system or across store-to-load forwarding. The an implicit channel is caused by an explicit
reason is that because loads are access instruc- or implicit branch: both types have a predicate
tions, the taint of their output is determined and the policy with respect to the predicate is
only based on whether they have reached the the same in both cases. The implementation,
visibility point. That is, the output of an unsafe however, must identify the predicate. We illus-
load is always tainted. trate this by showing how the STT microarchi-
tecture handles explicit branches.
Blocking Covert Channels Applying Principle #1 (Prediction-Based Chan-
Given STT’s rules for tainting/untainting data nels). STT requires that every frontend predictor
and its abstraction for covert channels, STT structure be updated based only on untainted
blocks all covert channels by applying a uniform data. This makes the execution path fetched by
rule across each type. the frontend unaffected by the output of unsafe
access instructions. STT passes a branch’s reso-
Blocking Explicit Channels STT blocks lution results to the direct/indirect branch pre-
explicit channels by delaying the execution of dictors only after the branch’s predicate and
any transmit instruction whose operands are target address become untainted; if the branch
tainted until they become untainted. This gets squashed before this, the predictor will not
scheme imposes relatively low overhead because be updated.
it only delays the execution of transmit instruc- Figure 6(c) demonstrates the effect of
tions if they have tainted operands. For example, STT on a speculative execution of the code
a load that only reads a (potential) secret but snippet in Figure 6(a), in which the branch
does not transmit one—such as the load on line B0 is mispredicted as taken. No matter how
2 in Figure 1—executes without delay. The load many experiments the attacker runs, the pre-
on line 3, however, will be delayed and eventually dicted direction of the branch B will not be a
squashed, thereby defeating the attack. function of secret, because the branch predic-
tor is not updated when B resolves. As a
Blocking Implicit Channels STT blocks result, the execution path does not depend on
implicit channels by enforcing an invariant that secret (top versus bottom)—it only depends
the sequence of instructions fetched/executed/ on the predicted branch direction (left versus
squashed never depends on tainted data. That right).
is, STT makes the program counter independent of Applying Principle #2 (Resolution-Based Chan-
tainted data. To enforce this invariant efficiently, nels). STT delays squashing a branch that resolves
without needing to delay execution of instruc- as mispredicted until the branch’s predicate beco-
tions following a tainted branch, we introduce mes untainted. As a result, a doomed-to-squash
two general principles to neutralize the sources branch with a tainted predicate (such as the
of implicit channels: branch B in Figure 6(c)) will never be squashed
and re-executed, preventing the implicit channel
Prediction-Based Implicit Channels are
leak discussed in the “Insights From Analysis of
eliminated by preventing tainted data from
Implicit Channels” section. As Figure 6(c) shows,
affecting the state of any predictor structure.
the doomed-to-squash branch B is eventually
Resolution-based Implicit Channels are
squashed once an older (mispredicted) branch
eliminated by delaying the effects of branch
with an untainted predicate squashes. Thus, the
resolution until the branch’s predicate
squash does not leak any information about the
becomes untainted.
branch’s resolution. Importantly, it is safe to
IEEE Micro
86
Figure 6. STT executing the code in (a), which includes an untainted branch B0, an access instruction reading secret,
and an implicit channel (due to branch B). (a) Implicit channel formed through the squash/control dependency on B.
(b) When earlier branch B0 is predicted correctly. (c) When earlier branch B0 is predicted incorrectly (left: B predicts
taken, right: B predicts not taken).
resolve a branch as soon as its predicate becomes A tainted register needs to be untainted once
untainted, even if an older branch with a tainted all access instructions on which it depends
predicate has not yet resolved. reach the visibility point, i.e., become safe.
STT only increases the latency of recovering Our key observation is that it suffices to track
from a tainted branch misprediction. For exam- only when the youngest access instruction
ple, in Figure 6(b), the load does not execute becomes safe, because instructions become non-
immediately after B resolves. Fortunately, speculative in program order in the processor
tainted branch mispredictions are only a small ROB. We call this youngest access instruction
fraction of overall branch mispredictions, which the youngest root of taint (YRoT).
are infrequent in the first place because Determining the YRoT is done through modi-
successful speculation requires accurate branch fications to rename logic in the processor front-
prediction. end. Specifically, the YRoT for an instruction X
Implicit Branches. The STT paper applies the being renamed is given by the max of 1) the
above principles to secure several common YRoT(s) of the instruction(s) producing the
microarchitectural optimizations that can be for- arguments for X, if those instructions are not
mulated as implicit branches, namely: store-to- access instructions; or 2) the ROB index of the
load forwarding, memory dependence specula- instruction(s) producing the arguments for X,
tion, and memory consistency speculation. In otherwise. (By convention, we assume the ROB
the process, the paper details various optimiza- index increases from ROB head to tail.) After
tions and cases which arise when dealing with rename, the YRoT is stored alongside the
implicit channels. In particular: whether the instruction in its reservation station and is con-
explicit/implicit branch has a prediction step, ceptually an extra dependence for that instruc-
can be resolved early or can be optimized in tion. When the visibility point changes, its new
some other way. For example, because store-to- position is broadcast to in-flight instructions,
load forwarding can only result in two observ- akin to a normal writeback broadcast, and
able outcomes (issue the load or forward from a instructions whose YRoT is less than the visibil-
prior store), we hide which one occurs by ity point’s new position are allowed to execute
unconditionally accessing the cache. (assuming their other dependencies are satis-
fied). The entire architecture requires modest
changes to the frontend rename logic, storage in
STT: IMPLEMENTATION reservation stations for the YRoT, and logic to
We previously assumed untaint information compare the YRoT to the visibility point which is
propagated along data dependencies instantly. comparable to normal instruction wakeup logic.
This is difficult to implement in hardware Figure 7 shows an example. Assume the Spec-
because a word of tainted data may be a function tre attack model, i.e., that the visibility point will
of complex dependence chains involving many be set to the ROB index of the oldest unresolved
access instructions. branch. The ROB contains three unresolved
May/June 2020
87
Top Picks
on index 5 and index 3, index 8 depends on 6, etc.

Re-traversing this dataflow graph to propagate
untaint, akin to tracing backwards slices, would
be expensive. On the other hand, the YRoT
dependence chain is relatively simple. Each
instruction just tracks whichever is the youngest
load that contributes to its dependence chain
(e.g., load M2 for instructions 6, 8, and 9). When
branches B1 and B2 resolve, the visibility point
advances to branch B3 (ROB index 7). Since 7 is
greater than 5 (the YRoT for the transmit instruc-
tion M3), M3 is now allowed to execute. Note, the
dependence chain could have been more com-
plex, with additional branches and arithmetic
dependencies separating load M2 and load M3,
but this would not change the moment that it is
safe to execute load M3.
Figure 7. Example of YRoT tracking showing a Importantly, the above scheme is only secure
snapshot of ROB state. Addition (add) instructions are after applying STT’s mechanisms to block both
used to represent arithmetic (non-loads). If the YRoT explicit and implicit channels (see the “STT:
is set to “-,” that means the instruction’s youngest Design” section). That is, the scheme requires
dependent access instruction is a part of retired state. that r8 is not a function of speculative data at
the exact moment load M2 becomes nonspecula-
branches (B1–B3) and a transmit instruction tive. This requires that branch B3 not be influ-
(M3) whose operand/address r8 is a function of enced by speculative data (achieved by
the return value of two access instructions (M1 protections for implicit channels) and that other
and M2). M3 is a transmit instruction (because it intervening instructions that can cause explicit
is a load) and can potentially leak secrets channels not execute until they are likewise safe
because mis-speculations on branches B1 and (achieved by protections for explicit channels).
B2 can influence the data returned by loads M1
and M2, which in turn contribute to the address
of M3 through data dependencies. FORMAL ANALYSIS/SECURITY
On one hand, the data dependence chain from PROOF
load M1 all the way to load M3 is quite complex. We formally prove12 that STT enforces a
That is, the instruction at ROB index 6 depends novel notion of noninterference: at each step of
the execution, the value of a doomed register—a
register written to by a bound-to-squash access
instruction—does not influence future visible
events in the execution. This applies to all micro-
architectural timing- and interference-based
attacks. For instance, the property ensures that
the program’s completion time and hardware
resource usage—for all hardware structures
including cache, branch predictor, etc.,—is
completely independent of doomed values.
The key challenge in the analysis is how to
Figure 8. Performance evaluation on SPEC06 and PARSEC avoid “looking into the future” to determine if an
benchmark suites. STT outperforms the baseline secure scheme instruction is doomed to squash. We address
(DelayExecute) with much smaller performance overhead, for this by running the STT machine alongside a
both Spectre and Futuristic attacker models. nonspeculative in-order processor, which allows
IEEE Micro
88
us to verify the STT machine’s branch predic- 5. P. Kocher et al., “Spectre attacks: Exploiting
tions and determine whether a prediction leads speculative execution,” in Proc. IEEE Symp. Secur.
to mis-speculation. Privacy, 2019, pp. 1–19.
6. M. Schwarz, M. Schwarzl, M. Lipp, and D. Gruss,
EVALUATION RESULTS “Netspectre: Read arbitrary memory over network,” in

We evaluate STT on 21 SPEC and 9 PARSEC Proc. Eur. Symp. Res. Comput. Secur., 2019,
workloads. The results are shown in Figure 8. Rela- pp. 279–299.
tive to an insecure machine, STT adds only 13.0%/ 7. G. E. Suh, J. W. Lee, D. Zhang, and S. Devadas,
18.2% overhead (averaged across both SPEC and “Secure program execution via dynamic information
PARSEC benchmarks), depending on whether the flow tracking,” in Proc. 11th Int. Conf. Archit.
attack model considers only control-flow specula- Support Program. Lang. Operating Syst., 2004,
tion (Spectre) or all types of speculation (Futuris- pp. 85–96.

tic). This indicates that defending against stronger 8. M. Tiwari, H. M. Wassel, B. Mazloom, S. Mysore, F. T.
attack models is viable with STT without sacrific- Chong, and T. Sherwood, “Complete information flow
ing much performance. Compared to the baseline tracking from the gates up,” in Proc. 14th Int. Conf.
secure scheme (DelayExecute) described in the Archit. Support Program. Lang. Oper. Syst., 2009,
introduction, STT reduces overhead by 4:0 in pp. 109–120.
the Spectre model and 10.5 in the Futuristic 9. Y. Yarom and K. Falkner, “Flush+Reload: A high
model, on average. resolution, low noise, L3 cache side-channel attack,”
in Proc. Usenix Secur. Symp., 2014, pp. 719–732.
ACKNOWLEDGMENTS 10. J. Yu, L. Hsiung, M. E. Hajj, and C. W. Fletcher, “Data

We thank J. Emer, S. Adve, and S. Mukherjee oblivious ISA extensions for side channel-resistant and
for very helpful discussions. We would especially high performance computing,” in Proc. 26th Netw.
like to thank our colleagues at Intel who contrib- Distrib. Syst. Secur. Symp. [Online]. Available: https://
uted significant feedback throughout the project’s eprint.iacr.org/2018/808.

development, in particular, F. Liu, M. Fernandez, 11. J. Yu, N. Mantri, J. Torrellas, A. Morrison, and C. W.
F. McKeen, and C. Rozas. This work was supported Fletcher, “Speculative data-oblivious execution (SDO):
in part by NSF under Grant CNS-1816226; in part by Mobilizing safe prediction for safe and efficient
Blavatnik ICRC at TAU, ISF under Grant 2005/17; speculative execution,” in Proc. Int. Symp. Comput.
and in part by an Intel Strategic Research Alliance Archit., 2020.
Grant. 12. J. Yu, M. Yan, A. Khyzha, A. Morrison, J. Torrellas, and
C. W. Fletcher, “Speculative taint tracking (STT): A
formal analysis,” Univ. Illinois Urbana–Champaign and
& REFERENCES Tel Aviv Univ., Tech. Rep., 2019. [Online]. Available:
1. O. Aciicmez, J.-P. Seifert, and C. K. Koc, http://cwfletcher.net/Content/Publications/Academics/
“Predicting secret keys via branch prediction,” TechReport/stt-for mal-tr_micro19.pdf.
Proc. 7th Cryptographers’ Track RSA Conf. Topics
Cryptology, 2007, pp. 225–242, doi: 10.1007/ Jiyong Yu is currently working toward the Ph.D.
11967668_15. degree at the University of Illinois at Urbana–Cham-
2. A. Bhattacharyya et al., “SMoTherSpectre: Exploiting paign. His research interests are in processor secu-
speculative execution through port contention,” in rity. Contact him at jiyongy2@illinois.edu.
Proc. ACM SIGSAC Conf. Comput. Commun. Secur.,
Mengjia Yan is an Assistant Professor in the
2019, pp. 785–800.
Electrical Engineering and Computer Science
3. J. A. Goguen and J. Meseguer, “Security policies and
Department, Massachusetts Institute of Technology.
security models,” in Proc. IEEE Symp. Secur. Privacy,
Her research interest lies in the areas of computer
1982, Art. no. 11. architecture and hardware security, with a focus on
€ dl, E. Oswald, D. Page, and M. Tunstall,
4. J. Groscha side channel attacks and defenses. Yan received
“Side-channel analysis of cryptographic software via the Ph.D. degree from the University of Illinois
early-terminating multiplications,” in Proc. Int. Conf. at Urbana–Champaign (UIUC). Contact her at
Inf. Secur. Cryptology, 2009, pp. 176–192. mengjia@csail.mit.edu.
May/June 2020
89
Top Picks
Artem Khyzha is a Postdoctoral Fellow in the Josep Torrellas is the Saburo Muroga Professor of
School of Computer Science, Tel Aviv University. Computer Science at the University of Illinois at
His research interests include formal methods Urbana–Champaign (UIUC). He is the Director of the
for software and hardware systems. Khyzha Center for Programmable Extreme Scale Computing
received a joint Ph.D. degree in computer science and past Director of the Illinois-Intel Parallelism Center.
from Technical University of Madrid and IMDEA His research interests include computer architecture
Software Institute. Contact him at artkhyzha@ and parallel processing. Torrellas received the Ph.D.
mail.tau.ac.il. degree from Stanford University. Contact him at
torrellas@cs.uiuc.edu.
Christopher W. Fletcher is an Assistant Professor

Adam Morrison is an Assistant Professor in the in computer science at the University of Illinois at
School of Computer Science, Tel Aviv University. Urbana–Champaign. He has interests ranging from
His research interests include high-performance computer architecture to security to high-performance
trustworthy computer and systems architecture. computing (ranging from theory to practice, algorithm
Morrison received the Ph.D. degree in computer to software to hardware). Fletcher received the Ph.D.
science from Tel Aviv University. Contact him at degree from Massachusetts Institute of Technology in
mad@cs.tau.ac.il. 2016. Contact him at cwfletch@illinois.edu.
IEEE Micro
90
MicroScope: Enabling
Microarchitectural
Replay Attacks
Dimitrios Skarlatos Read Sprabery
University of Illinois at Urbana–Champaign Google
Mengjia Yan Josep Torrellas and Christopher W. Fletcher
Massachusetts Institute of Technology University of Illinois at Urbana–Champaign
Bhargava Gopireddy
Nvidia
Abstract—A microarchitectural replay attack is a novel class of attack where an

adversary can denoise nearly arbitrary microarchitectural side channels in a single run of
the victim. The idea is to cause the victim to repeatedly replay by inducing pipeline
flushes. In this article, we design, implement, and demonstrate our ideas in a framework,
called MicroScope, that causes repeated pipeline flushes by inducing page faults. Our
main result shows that MicroScope can denoise the port contention channel of execution
units. Specifically, we show how MicroScope can reliably detect the presence or absence
of as few as two divide instructions in a single logical run of the victim program. We also
discuss the broader implications of microarchitectural replay attacks.
& IT IS NOW well understood that modern pro- secrets at many points in a program’s
cessors leak secrets over microarchitectural execution.
side and covert channels. These channels Yet, a fundamental challenge for attackers
are seemingly everywhere—from the cache1–3 exploiting these channels is that the channels
to the branch predictor4 and other struc- are notoriously noisy. This means that multiple
tures5,6—and are capable of leaking program measurements of the same event often return
wildly different values. This occurs, for example,
when attempting to glean secret-dependent con-
Digital Object Identifier 10.1109/MM.2020.2986204 trol flow by measuring port contention inside
Date of publication 16 April 2020; date of current version 22 the pipeline.6 As a result, the attacker requires
May 2020. that the victim program run many times (e.g.,
91
Top Picks
thousands of times6) to increase the signal-to- secret is leaked. All the while, the victim has
noise ratio. This fact prevents attackers from logically run only once.
learning secrets in many important scenarios, This article introduces microarchitectural
such as when the victim program runs only replay attacks, and provides a complete proto-
once. type tool called MicroScope that runs the
To eliminate this limitation, this work attack on real hardware. We investigate the
introduces microarchitectural replay attacks attack’s ability to leak secrets in straight-line
(MRAs), a new class of attacks that enables an code, branches, and loops. As a proof of con-
attacker to offset the measurement variation cept, we demonstrate how the attack can
for (i.e., to denoise) potentially any microarch- denoise the notoriously noisy side channel of
itectural side channel, even if pipeline port contention in a sin-
the victim code is executed only gle run of the victim. Finally, we
This article introduces
once. The key observation is discuss how changing different
microarchitectural
that, in modern out-of-order parameters in the attack setup
replay attacks, and
speculative cores, a dynamic yields new flavors of the attack—
provides a complete
instruction may be forced to exe- prototype tool called e.g., enabling an attack to theo-
cute multiple times due to pipe- MicroScope that runs retically bias the output of a
line squashes caused by page the attack on real hardware instruction that gener-
faults, exceptions, or other hardware. ates true random numbers.
events. By forcing the squash We released the full Micro-
and reexecution of an instruction Scope framework as a kernel mod-
multiple times, the attacker can repeatedly ule, available at https://github.com/dskarlatos/
measure the execution characteristics of such MicroScope.
instruction. We call this attack an MRA.
This work also describes and implements a
specific family of MRAs that are applicable in BRIEF BACKGROUND
the context of Intel’s Software Guard Extensions Secure enclaves,8 such as Intel’s SGX7 allow
(SGX).7 Specifically, in this environment, the sensitive user-level code to run securely on a
attacker controls the operating system (OS) and, platform alongside an untrusted supervisor
while the attacker cannot see the victim’s data (i.e., an OS and/or hypervisor). Intel’s SGX uses
directly, it controls the victim’s demand paging. the OS for translation lookaside buffer (TLB)
Now, suppose that there is an instruction I that, and page table management. Each page table
based on secret data, forms a noisy side or entry contains a present bit, which identifies if
covert channel. In addition, suppose that the the physical page is present in memory or not.
attacker finds a public-address load L that is If the bit is cleared, then the translation process
older than I and is in the reorder buffer (ROB) at fails and a page fault exception is raised. The
the same time as I. In this case, the attacker can OS is then invoked to handle it. To keep the TLB
arrange for L to page fault after a long page coherent while updating page table entries, the
walk (e.g., by clearing the present bit of the OS can selectively flush TLB entries through the
corresponding page table entry, and evicting INVLPG instruction.
the multilevel page table entries from the
cache). While the page walk is underway, I exe-
SUMMARY OF THE MICROSCOPE
cutes and the attacker observes a noisy sam-
ATTACK
ple. Then, the OS pretends to service the page
MRAs are based on the key observation that
fault but keeps the present bit cleared. As a
modern hardware allows recently executed, but
result, L will go through the page walk again
not retired, instructions to be rolled back and
and I will execute again. This process is
replayed if certain conditions are met. This
repeated many times, causing the replay of I
an arbitrary number of times until the signal-
The name MicroScope comes from the attack’s ability to peer inside nearly
to-noise ratio is reduced enough that the any microarchitectural side channel.
IEEE Micro
92
Figure 1. Timeline of a MicroScope attack. The Replayer is an untrusted OS or hypervisor process that forces
the Victim code to replay, enabling the Monitor to denoise and extract the secret information.
behavior can be exploited to mount a variety of translations, and eventually suffer a page fault.
attacks (see the “Generalizing Microarchitec- In the meantime, instructions that are younger
tural Replay Attacks” section). than the replay handle, including the sensitive
An MRA attack has three actors: Replayer, instruction(s), can execute speculatively but not
Victim, and Monitor. In a MicroScope attack, retire.
which is a type of MRA, the Replayer is a mali-
cious OS or hypervisor that is responsible for Speculative Execution in the Shadow of Page
page table management. The Victim is an applica- Walks
tion process that executes on some secret data After the attack is set up, the Replayer allows
that the attacker wishes to exfiltrate. The Moni- the Victim to resume execution and issue the
tor is a malicious process that performs auxiliary replay handle, as shown in timeline 3 of Figure 1.
operations, such as causing contention and mon- The replay handle access misses in the L1 TLB, L2
itoring shared resources. Figure 1 shows the TLB, and page walk cache (PWC), and initiates a
timeline of the interleaved execution of the page walk. The hardware page walker fetches the
Replayer, Victim, and Monitor for a MicroScope necessary page table entries sequentially, start-
attack. ing from page global directory (PGD), then page
upper directory (PUD), page middle directory
Attack Setup (PMD), and finally page table entry (PTE).
MicroScope is enabled by what we call a The Replayer can tune the duration of the
Replay Handle. A replay handle can be any mem- speculative execution by choosing whether vic-
ory access instruction that occurs shortly before tim’s page table entries are either present or
one or more security-sensitive instructions in absent from the cache hierarchy and PWC (shown
program order. in the arrows above timeline 3 of Figure 1). The
In MicroScope, the Replayer sets up the speculative instructions executing in the shadow
attack by locating the page table entries req- of the page walk may leave some state in the cache
uired for virtual-to-physical translation of the subsystem and/or create contention for hardware
replay handle. Then, it performs the following structures in the core. This allows the Monitor to
steps, shown in the timeline 1 of Figure 1. First, perform a noisy measurement of the secret data.
it flushes the replay handle data from the cache. At the end of the page walk, the hardware raises a
After that, it clears the present bit of the leaf page fault exception and squashes the speculative
page table entry of the replay handle. After that, instructions in the pipeline.
it flushes the translation page table entries from The Replayer is then invoked to handle the
the caches. Finally, it flushes the TLB entry that page fault. The operation is shown in timeline 2
stores the virtual-to-physical translation of the of Figure 1. The Replayer chooses to keep the
replay handle access. Together, these steps will present bit cleared. Timeline 4 of Figure 1
cause the replay handle to miss in the TLB, shows the actions of the Victim. In this case,
induce a hardware page walk to locate the after the Victim resumes and reissues the replay
May/June 2020
93
Top Picks
may also prime

the processor
state for the next
measurement.
6) When sufficient
measurements
Figure 2. Simple examples of codes that present opportunities for MicroScope attacks. have been gath-
(a) Single secret. (b) Loop secret. (c) Control flow secret. ered, the Replayer
sets the present
handle, the whole process repeats. This process bit in the PTE entry. This enables the Victim to
can be repeated as many times as desired to make forward progress.
denoise and extract the secret information.
With these steps, MicroScope can denoise a
Monitoring Execution side channel formed by, potentially, any instruc-
The Monitor extracts the secret information tion(s)—even ones that expose a secret only
from the Victim. Depending on the side channel once in straight-line code. Furthermore, the
being exploited, the monitor can cause resource Replayer can then clear the present bit of a later
contention in parallel to the victim execution, or replay handle in the application and proceed to
prime and inspect the cache state in between monitor a later section of the code.
replays. This is shown in timeline
5 of Figure 1,
where the Monitor executes in parallel with the SIMPLE ATTACK EXAMPLES
Victim’s speculative execution. Figure 2 shows several examples of codes that
present opportunities for MicroScope attacks.
Attack Summary Each example showcases a different use case.
The attack has the following six steps.
Single-Secret Attack
1) The Replayer identifies a replay handle and Figure 2(a) shows a simple code that has a sin-
prepares the attack—e.g., by priming micro- gle secret. Line 2 accesses a public address (i.e.,
architectural state. known to the OS). This access is the replay handle.
2) When the Victim executes the replay handle, After a few other instructions, sensitive code
it suffers a TLB miss followed by a page walk. at Line 4 processes some secret data. We call this
The time taken by this step can be over 1000 computation the transmit computation of the
cycles, and can be tuned as per the require- Victim, using terminology from prior work.9 The
ments of the attack. transmit computation may leave some state in the
3) In the shadow of the page walk and until the cache or may use specific functional units that cre-
page fault is serviced, the Victim continues ate observable contention. The goal of the adver-
to execute speculatively past the replay han- sary is to extract the secret information. The
dle into the sensitive region, potentially until adversary can obtain it by using MicroScope to
the ROB is full. repeatedly perform steps (2)–(5) from the “Attack
4) The Monitor can cause and measure conte- Summary” section.
ntion on shared hardware resources during
the Victim’s speculative execution, or inspect Loop-Secret Attack
the hardware state left by the Victim’s specu- We now consider the scenario where we want
lative execution. to monitor a given instruction in different itera-
5) When the replay handle triggers a page fault, tions of a loop. We call this case Loop Secret, and
the Replayer gains control and can optionally show an example in Figure 2(b). In the code, the
leave the present bit cleared in the PTE loop body has a replay handle and a transmit oper-
entry. This will induce another replay cycle ation. In each iteration, the transmit operation
that the Monitor can leverage to collect more accesses a different secret. The adversary wants
information. Before the replay, the attacker to obtain the secrets of all the iterations. The
IEEE Micro
94
challenging case is when the replay handle maps
to the same physical data page in all the iterations.
This scenario highlights a common problem in
side channel attacks: secret[i] and secret[i+1]
may induce similar effects, making it hard to dis-
ambiguate between the two. For example, both
secrets may be colocated in the same cache line,
or induce similar pressure on the execution units.
This fact severely impedes the ability to distin-
guish the two accesses.
MicroScope addresses this challenge by
using a second memory instruction to move
Figure 3. Latencies measured by performing a port
between the replay handles in different itera-
contention attack. (a) Victim executes two multiply
tions. This second instruction is located after
operations. (b) Victim executes two division
the transmit instruction in program order, and
operations.
we call it the Pivot instruction. For example, in
Figure 2(b), the instruction at Line 6 can act as
the pivot. branch using at least two different types of side
MicroScope uses the pivot as follows. After channels.
the adversary infers secret[i] and is ready to pro- First, if lines 3 and 5 in Figure 2(c) access dif-
ceed to extract secret[i+1], the adversary per- ferent cache lines, then the Monitor can perform
forms one additional action during step 6 in the a cache based side-channel attack to identify the
“Attack Summary” section. Specifically, after set- cache line accessed, and deduce the branch
ting the present bit in the PTE entry for the replay direction. A second case is when the two paths
handle, it clears the present bit in the PTE entry out of the branch perform different computa-
for the pivot, and resumes the Victim’s execution. tions. In this scenario, the Monitor can apply
As a result, all the Victim instructions before the pressure on the functional units and, by monitor-
pivot are retired, and a new page fault is incurred ing contention, deduce the operation that the
for the pivot. code performs and, hence, the branch direction.
When the Replayer is invoked to handle the
pivot’s page fault, it sets the present bit for the
pivot and clears the present bit for the replay EVALUATION
handle. When the Victim resumes execution, it We validated MRAs and MicroScope by
retires all the instructions of the current iteration denoising a notoriously noisy side channel: exe-
and proceeds to the next iteration, suffering a cution unit port contention.6 For this attack, we
page fault in the replay handle. Steps 2–5 repeat assume the SGX threat model. We use victim
again, enabling the monitoring of secret[i+1]. code similar to the one in Figure 2(c), where one
The process is repeated for all the iterations. side of the branch executes two division opera-
tions, and the other side executes two multipli-
Control Flow Secret Attack cation operations. The Replayer forces the
A final scenario that is commonly exploited replay of the code. Concurrently, the Monitor
using side channels is a secret-dependent branch executes a loop with one division operation in
condition. We call this case Control Flow Secret, each iteration. We measure the time taken by
and show an example in Figure 2(c). In the code, each iteration of the Monitor loop. If the Victim
the direction of the branch is determined by a executes the code with the two multiplications,
secret, which the adversary wants to extract. the Monitor instructions execute fast and,
As shown in the figure, the adversary uses a hence, no contention is measured. Figure 3(a)
replay handle before the branch, and a transmit shows the latency of each iteration of the Moni-
operation in both paths out of the branch. The tor. We see that all but four of the samples take
adversary can extract the direction taken by the less than 120 cycles, which we identify to be the
May/June 2020
95
Top Picks
Attacks Using Different Replay Handles

MRAs can exploit many different sources of
replay beyond page faults. For example, they
can exploit transaction aborts in transactional
memory. Other instructions enable a limited
number of replays. They include any event that
Figure 4. Generalized MRAs. can squash speculative execution,11 such as a
branch misprediction or a load-to-store alias.
contention threshold for our machine. If,
instead, the Victim executes the code with the Amplifying Physical Side Channels
two divisions, the Monitor instructions execute MRAs may also be an effective tool to amplify
slowly. Figure 3(b) shows the latency of each physical channels, such as power and EM.12 The
iteration. We see that 64 measurements are now idea is that a replay attack can turn nonrepeating
above the threshold of 120 cycles. The two cases code into repeating code, which power meters,
are clearly distinguishable. Overall, MicroScope temperature sensors, and frequency analyzers
is able to detect the presence or absence of two can interpret better.
division instructions, outside of a program loop,
using replays. FUTURE RESEARCH DIRECTIONS
We further used MicroScope to perform a AND APPLICATIONS
cache side-channel attack on AES.10 MicroScope The work presented in this article can be
is able to extract the lines accessed in the AES extended in numerous ways. In this section, we
tables without noise in a single run. describe some possible directions.
GENERALIZING MRAs Potential Countermeasures Against MRAs

Figure 4 presents a generalized overview of A future direction is to mitigate MRAs. We are
MRAs. It has four parts: replay handle, replayed actively working on this area. The root cause of
code, side channel, and strategy. In the attack that MRAs is that individual dynamic instructions
we described in the previous sections, the replay may execute more than once. This can be due to
handle is a page fault-inducing load, the replayed a variety of reasons (e.g., a page fault, a transac-
code contains certain instructions that leak pri- tion abort, or a squash due to branch mispredic-
vacy, the side channels discussed are related to tion). Thus, it is clear that new, general security
caches and functional unit ports, and the properties are required to comprehensively
attacker’s strategy is to unconditionally page fault address these vulnerabilities.
until it has high confidence that it has extracted The obvious defense against these attacks
the secret. We now discuss how to create different is for the hardware or the OS to insert a fence
attacks by changing some of these components. after each pipeline flush. However, corner
cases, such as multiple instructions in close
proximity, individually causing replays, need
Attacks on Program Integrity
to be considered. In these cases, even if a
MRAs can be used to violate program integ-
fence is introduced after every pipeline flush,
rity, if the replayed instructions are nondetermin-
the adversary can extract information from the
istic. For example, suppose we replay the Intel
resulting multiple replays.
true random number generator RDRAND instruc-
MicroScope relies on speculative execution
tion. If the attacker can read the return value
to replay Victim instructions. Therefore, a
through a side channel, it can selectively replay
defense solution that holistically blocks side
RDRAND until the value matches some desired
effects caused by speculative execution can
criteria, effectively biasing the random number
effectively block MicroScope. However, existing
generation from the victim’s perspective.y
defense solutions have limited defense coverage,
y introduce substantial performance overhead, or
This attack does not work on Intel machines due to a technicality. We veri-
fied that modulo this technicality, it should work. require costly hardware.
IEEE Micro
96
Page fault-oriented defense mechanisms these instructions deterministically may provide
could be effective to defeat MicroScope. Unfortu- a way to debug hard-to-reproduce software
nately, solutions that rely on Intel TSX are not bugs, such as data races. Small enhancements
sufficient, since may enable single-stepping the code, or the abil-
TSX itself creates a ity to change the direction of branches on the
new mechanism Overall, with a good fly. Overall, with a good interface, MicroScope
with which to cre- interface, MicroScope may become a unique debugging tool for sequen-
ate replays, may become a unique tial and parallel code.
debugging tool for
through transac-
sequential and parallel
tion aborts. Thus,
code.
we believe further
research is needed
& REFERENCES
before applying either of the aforementioned 1. F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee,
defenses to any variant of MRA. “Last-level cache side-channel attacks are practical,”
in Proc. IEEE Symp. Secur. Privacy, May 2015,
MRAs as a Speculative Defense Mechanism pp. 605–622.
While MicroScope is presented as an attack, 2. Y. Yarom, D. Genkin, and N. Heninger, “CacheBleed:
its operation can be used to improve security A timing attack on OpenSSL constant time RSA,” Int.
defenses. This is because it provides a window Conf. Cryptographic Hardware Embedded Syst.,
into what speculative execution attacks, such as 2016.
Spectre and Meltdown, can do. For example, a 3. Y. Yarom and K. Falkner, “Flush+Reload: A high
Spectre attack on a given branch cannot affect resolution, low noise, L3 cache side-channel attack,”
subsequent instructions that are more than in Proc. USENIX Secur. Symp., 2014.
an ROB-long distance away from the branch 4. D. Evtyushkin, R. Riley, N. Abu-Ghazaleh, and
dynamically. MicroScope can provide this infor- D. Ponomarev, “Branchscope: A new side-channel
mation, and thus can determine when an instruc- attack on directional branch predictor,” in Proc. Int.
tion needs protection. Conf. Archit. Support Program. Lang. Oper. Syst.,
Furthermore, MicroScope can be used to per- 2018, pp. 693–707.
form black-box analysis of microarchitectural 5. M. Yan, R. Sprabery, B. Gopireddy, C. Fletcher,
structures, such as the ROB, load store queue R. Campbell, and J. Torrellas, “Attack directories, not
(LSQ), and others. Controlled fine-grain micro- caches: Side channel attacks in a non-inclusive
architectural replay capabilities can enable world,” in Proc. IEEE Symp. Secur. Privacy, vol. 1,
the reverse engineering of hardware structures. pp. 56–72, 2019, doi: 10.1109/SP.2019.00004.
This approach can reveal timing, number of ports, 6. A. C. Aldaya, B. B. Brumley, S. U. Hassan,
and interconnect information and, more impor- C. P. Garcıa, and N. Tuveri, “Port contention for fun
tantly, uncover previously unknown behavior of and profit,” IEEE Symp. Secur. Privacy, 2019,
hardware units under speculative execution. Such pp. 870–887.
information is not only useful for discovering pre- 7. Intel, “Intel software guard extensions programming
viously unknown vulnerabilities, but can further reference,” 2013. [Online]. Available: https://software.
provide a foundation for defense mechanisms. intel.com/sites/default/files/329298-001.pdf
8. P. Subramanyan, R. Sinha, I. Lebedev, S. Devadas,
Parallel Application Debugging and S. A. Seshia, “A formal foundation for secure
The mechanism that enables our current remote execution of enclaves,” in Proc. ACM SIGSAC
MicroScope prototype, namely the capture and Conf. Comput. Commun. Secur., 2017, pp. 2435–2450.
reexecution of an ROB-sized set of instructions, 9. V. Kiriansky, I. A. Lebedev, S. P. Amarasinghe,
is a very useful primitive in software develop- S. Devadas, and J. Emer, “DAWG: A defense against
ment and debugging. Capturing these instruc- cache timing attacks in speculative execution
tions can provide insights into the program processors,” in Proc. 51st Annu. IEEE/ACM Int. Symp.
state in a way that no current tool can. Replaying Microarchit., 2018, pp. 974–987.
May/June 2020
97
Top Picks
10. OpenSSL, “Open source cryptography and Bhargava Gopireddy is a Senior Architect at Nvi-
dia, where he works on energy-efficient GPU archi-
SSL/TLS toolkit,” 2019. [Online]. Available: www.
tectures. His research interests include energy
openssl.org
efficient many-core architectures and architectural
11. C. Canella et al., “A systematic evaluation of
support for operating systems/security. Gopireddy
transient execution attacks and defenses,” in Proc. received the Ph.D. degree in computer science from
28th USENIX Secur. Symp., 2019, pp. 249–266. the University of Illinois at Urbana–Champaign in
12. A. Nazari, N. Sehatbakhsh, M. Alam, A. Zajic, 2018. Contact him at bgopireddy@nvidia.com.
and M. Prvulovic, “EDDIE: EM-based detection of
deviations in program execution,” in Proc. ACM/IEEE Read Sprabery completed his Ph.D. degree with
44th Annu. Int. Symp. Comput. Archit., 2017, a focus on cloud security in 2018 at the University
pp. 333–346. of Illinois at Urbana–Champaign and now works
as a security researcher at Google. Contact him
at spraber2@illinois.edu.
Dimitrios Skarlatos is a Ph.D. candidate at the Josep Torrellas is the Saburo Muroga Professor
University of Illinois at Urbana–Champaign. His of Computer Science at the University of Illinois at
research lies at the intersection of computer archi- Urbana–Champaign (UIUC). He is the Director of the
tecture, security, and operating systems. He builds Center for Programmable Extreme Scale Computing
practical solutions that improve the performance and past Director of the Illinois-Intel Parallelism
and bolster—or sometimes break—the security Center. His research interests include computer
guarantees of computing systems. Contact him at architecture and parallel processing. Torrellas
skarlat2@illinois.edu. received the Ph.D. degree from Stanford University.
Contact him at torrellas@cs.uiuc.edu.
Mengjia Yan is an Assistant Professor in the Electri- Christopher W. Fletcher is an Assistant Professor
cal Engineering and Computer Science Department, in computer science at the University of Illinois at
Massachusetts Institute of Technology. Her research Urbana–Champaign. He has interests ranging from
interest lies in the areas of computer architecture and computer architecture to security to high-performance
hardware security, with a focus on side channel computing (ranging from theory to practice, algorithm
attacks and defenses. Yan received the Ph.D. degree to software to hardware). Fletcher received the Ph.D.
from the University of Illinois at Urbana–Champaign degree from Massachusetts Institute of Technology in
(UIUC). Contact her at mengjia@csail.mit.edu. 2016. Contact him at cwfletch@illinois.edu.
IEEE Micro
98
Creating Foundations
for Secure
Microarchitectures
With Data-Oblivious
ISA Extensions
Jiyong Yu Mohamad El Hajj and Christopher W. Fletcher
University of Illinois at Urbana–Champaign University of Illinois at Urbana–Champaign
Lucas Hsiung
SciFive
Abstract—It is not possible to write microarchitectural side channel-free code on

commercial processors today. Even when we try, the resulting code is low performance. This
article’s goal is to lay an ISA-level foundation, called a Data-Oblivious ISA (OISA) extension,
to address these problems. The key idea with an OISA is to explicitly but abstractly specify
security policy, so that the policy can be decoupled from the microarchitecture and even the
threat model. Analogous to a traditional ISA, this enables an OISA to serve as a portable
security-centric abstraction for software while enabling security-aware implementation and
optimization flexibility for hardware. The article starts by giving a deep-dive in OISA principles
and formal definitions underpinning OISA security. We also provide a concrete OISA built on
top of RISC-V, an implementation prototype on the RISC-V BOOM microarchitecture, a formal
analysis and security argument, and finally extensive performance evaluation on a range of
data-oblivious benchmarks.
Digital Object Identifier 10.1109/MM.2020.2985366

May 2020.
99
Top Picks
& A, ARGUABLY THE, central problem in secure side channels. Depending on the microarchitec-
computer architecture today is how to reason ture, this may require closing side channels
about security amid the sea of different micro- through the cache, translation lookaside buffer
architectural side channel attacks. The prevailing (TLB), etc. That is, security is not tied to closing
approach to stop these attacks is to a specific side channel and the
block leakage stemming from one To our knowledge, this programmer works with a simple,
hardware structure at a time. For is the first proposal that portable guarantee across micro-
example, by partitioning or ran- provides a basis to architectures. On the efficiency
domizing the cache layout, we block all traditional side side, each microarchitecture can
block (or at least aggravate) cache channel and specula- choose how to implement the
timing attacks. Yet, many hardware tive execution attacks safe load operation in whatever
structures have been shown to leak on commercial-class way maximizes performance
secrets—from the cache to the microarchitectures. while preserving security (e.g.,
branch predictors,5 speculative by microcoding the load into sim-
execution,8 port contention,1 arithmetic unit tim- pler safe operations,2 or using hardware parti-
ing,2 etc. Given the many avenues to leak a secret, tioning,11 or using cryptographic techniques9).
it is paramount to explore holistic defenses that Safe loads are just one example. More gener-
provide a basis to block leakage through all hard- ally, deciding which instruction operands to des-
ware structures. ignate as safe opens a new, rich ISA design space
In this direction, the article proposes ISA which trades-off performance and hardware
design principles for what we call data-oblivious complexity.
ISAs (OISAs). The key idea with an OISA is to Beyond formulating design principles for
explicitly but abstractly specify security policy, OISAs, the article proposes a concrete OISA
so that the policy can be decoupled from the extension built on top of RISC-V, implements
microarchitecture and even the threat model. (and open sources) that OISA extension on the
Analogous to a traditional ISA, this enables an BOOM out-of-order (OoO) speculative RISC-V
OISA to serve as a portable security-centric core,3 and provides a formal analysis showing
abstraction for software while enabling security- how the OISA provides a basis to achieve nonin-
aware implementation and optimization flexibil- terference (“zero privacy leakage”) on an
ity for hardware. abstract OoO speculative machine. Crucially, the
The OISA proposed in the article annotates security analysis and principles are robust to
what data is confidential and what instruction modern attacks. Case in point, the article’s formal
operands are safe. Inspired by information flow analysis shows how the OISA soundly defeats
policies (in particular, the classic policy High Z speculative execution attacks (such as Spectre8)
Low), the hardware dynamically enforces that without introducing special case reasoning.
confidential data is never passed to unsafe oper- To our knowledge, this is the first proposal
ands, i.e., Confidential data Z Unsafe operands. that provides a basis to block all traditional side
Informally, “safe” in the article means “does not channel and speculative execution attacks on
create a microarchitectural side channel as a func- commercial-class microarchitectures.
tion of the operand” (we also provide formal defi-
nitions), but other notions of safety can be
retrofitted into the implementation without MOTIVATION: SECURE AND
changing the OISA or the programs that sit on EFFICIENT DATA-OBLIVIOUS
top of it. PROGRAMMING
OISAs enable high security, portability, and The OISA project came about by asking the
efficiency. Consider a simple example OISA following question: Is it possible today to write
instruction: a load with a safe address operand. microarchitectural side channel-free programs
Security-wise and portability-wise, the OISA on modern microarchitectures?
guarantees that when the load executes, the The answer is no. Consider the most conser-
address will not leak through microarchitectural vative approach used by practitioners, called
IEEE Micro
100
data-oblivious programming. In a nutshell, a reveals its argument over a microarchitectural
data-oblivious program is one whose hardware side channel.
resource usage is independent of the program’s This program is legal data-oblivious code:
inputs. To write such programs, the guidelines The branch outcome in each iteration is public
are to use only simple instructions, or otherwise information, the round logic is data oblivious,
ensure that complex instructions do not receive and only the plaintext is meant to be revealed
Confidential data as operands. For example, sim- after decryption is complete. Yet, unwanted pri-
ple bitwise math is allowed, but memory vacy leaks because benign mispredictions can
operations/branches with Confidential data as cause the round logic to exit early. In this exam-
addresses/predicates are not (out of fear of, e.g., ple, an early mispredict of “not taken” allows the
cache-based/control flow-related side channels). attacker to see state before all rounds complete,
Despite being extremely conservative, the which allows it to perform cryptanalysis and
abovementioned guidelines fail in light of ISA- recover the secret key rkey.
invisible microarchitecture-specific optimiza-
tions. For example, on one microarchitecture, a
Core Issue: No Abstraction for Security
simple integer addition might be safe (e.g., imple-
To summarize, data-oblivious programming
mented as a single-cycle operation whose timing
today is insecure and slow. It is insecure because
is independent of its inputs) while on another it
of ISA-invisible microarchitecture-specific opti-
might be unsafe (e.g., implemented as a bit-serial
mizations. It is slow because, out of fear of leak-
operation that skips runs of 0s or zeros to save
ing privacy, programmers are forced into using
time). The article describes 11 like optimizations,
only the simplest of instructions.
which have been proposed in the literature, or
The article sets out to address these issues
are otherwise known to be implemented already,
by introducing new ISA-level abstractions
which break data-oblivious program security.
for reasoning about security and enabling higher
These include data-in-use optimizations (such as
performance. A new ISA abstraction addresses
data-dependent arithmetic) and data-at-rest opti-
the security problem by defining how instruc-
mizations (such as cache compression).
tions leak privacy across all compliant microarch-
In particular, the article points out for the
itectures. It further enables higher performance
first time that speculative execution breaks data-
by allowing data-oblivious programs to take
oblivious program security, by steering execu-
advantage of higher performance instructions, as
tion so that Confidential data is consumed by an
long as those instructions are deemed safe by
instruction whose execution can leak privacy.
the ISA, and gives microarchitects the ability to
This is nontrivial to see for realistic programs,
optimize those instructions subject to the ISA-
given the conservative guidelines used to write
prescribed security policies.
data-oblivious code. For example, consider data-
oblivious decryption
1 for (i = 0; i < NUM_ROUNDS; i++) FORMAL DEFINITIONS FOR
2 state = OblDecryptRound MICROARCHITECTURAL SIDE
(state, rkey [i]) CHANNELS
To start, the article develops a security defi-
3 leak(state)
nition for microarchitectural side channel-free
That is, perform a fixed number of decryption execution. There are two challenges. First, how
rounds, where each round works on a part of the to write the definition to account for any possi-
secret key (rkey) and incrementally updates the ble microarchitectural side channel. Second,
round state (state). Here, we assume that how to write the definition so that it sheds
OblDecryptRound, the round logic, is data obliv- insight on which instructions are “safe” from a
ious. leak() is a proxy for an instruction that microarchitectural side channel perspective.
To define privacy, we adopt a trace-based

Data-oblivious programming goes by several other names, e.g., “constant-
indistinguishability style definition inspired by
time programming” and “programming in the circuit abstraction,” depending
on the community. the oblivious RAM (ORAM)7 literature. We
May/June 2020
101
Top Picks
Figure 1. Changes in resource usage, as a function of data, create microarchitectural side channels. If a
latch is shaded in a given clock cycle, then it means that there is (explicit) information flow from the operands
to that latch in that cycle. Assume operands A and B are two sets of distinct data values, meant to induce
different ALU timings.
consider a program ; which takes public data x information flow abstraction similar to GLIFT.10
and confidential data y as input. That program’s Figure 1 shows an example using an arithmetic
execution trace, on a microarchitecture mArch, logic unit (ALU) with operand-independent and
i.e., “all the atoms in the universe that are per- then operand-dependent timing.
turbed as a result of running ðx; yÞ on mArch,” is First assume a single-cycle ALU (see Figure 1,
denoted mArchððx; yÞÞ. The subset of this trace Case 1). Suppose the input arrives and is stored in
that the attacker can see (called the view) is the input latches at the rising edge of cycle 1.
denoted ViewðmArchððx; yÞÞÞ. For privacy, we Using terminology from information flow, we say
require that the information in the View does the input latch is tainted in cycle 1. Now, regard-
not depend on confidential information, i.e., that less of the logic values of the input, the same
ViewðmArchððx; yÞÞÞ ’ ViewðmArchððx; y0 ÞÞÞ for latches are tainted in each cycle thereafter. That
all confidential data y and y0 . In this setting, ’ is, the output latches are tainted in cycle 2, etc.
informally means “equal, given the capabilities Because which latches are tainted when is indepen-
of any computationally bounded adversary.” For dent of the operands, we say the single-cycle ALU
example, in ORAM schemes the view is the does not form a microarchitectural side channel.
“memory access pattern” and ORAM seeks to Next assume an ALU with operand-dependent
make the memory access pattern independent of timing (see Figure 1, Case 2). For example, a mul-
confidential data. tiply operation that takes one or two cycles,
Next, we must define a view that captures any depending on whether an operand is 0. In this
possible microarchitectural side channel that an case, depending on the input, the output latch is
arbitrary software-based attacker can monitor. either tainted in cycle 2 or cycle 3. Because
This is nontrivial as the attacker can monitor which latches are tainted when is dependent on
many aspects of the program’s execution. For the operands, we say this ALU can form a micro-
example, its execution time, use of the cache, architectural side channel.
arithmetic units, etc. The article makes a key Putting everything together, we model the
observation that all of these leakages can be mod- processor as a state machine composed of com-
eled as confidential data-dependent changes in the binational logic and latches.** The subset of
program’s hardware resource usage over time. For latches that store the Confidential input are
instance, both arithmetic units and cache sets denoted tainted at the start. Then, the View out-
are hardware resources and the fact that they puts a trace that indicates which subset of
are used at confidential data-dependent times is latches are tainted in each cycle. That is, hard-
the crux of the attacks. ware resource usage as a function of time. If the
Then, the question is how to determine microarchitecture ensures that the View is inde-
whether a hardware resource is currently being pendent of Confidential data, the microarchitec-
“used” by a program. (Note that whether a hard- ture does not leak privacy. Conversely, if the
ware resource is being “used” is independent of definition is not satisfied, we can pinpoint which
the logic values currently stored in that struc-
ture.) For this, we rely on an explicit gate-level **W.l.o.g., we treat any state element (flip-flop, SRAM cell, etc.) as a latch.
IEEE Micro
102
(Confidential ! Safe) When Confidential
data is sent to a safe operand: The hardware
designer must add mechanisms to enforce
the security definition given that instruction’s
execution (see the “Formal Definitions for
Figure 2. Protection policies, checked before each Microarchitectural Side Channels” section),
instruction executes. for a specified view. For example, by disabling
performance optimizations, scrubbing side
effects and masking exceptions that occur as
instruction caused the problem by looking at
a function of confidential operands.
where the Views diverged. (To note, the article
(Public ! Safe/Unsafe) When public data is
defines taint propagation in a nonstandard way
sent to safe or unsafe operands, no special
to model only explicit information flows. This
treatment is needed and execution can pro-
prevents taint explosion, which would render
ceed without protection.
the definition not useful. Implicit information
flows are modeled by quantifying over all y0 .) Despite these rules’ simplicity, they provide
both security and efficiency benefits. As we will
see in the “Formal Analysis,” they provide a uni-
PRINCIPLES OF OISA DESIGN
form handling for both traditional- and specula-
The design principles for OISAs are twofold.
tive-execution-based attacks.8 Case in point, the
First, the OISA should expose security guaran-
only mention of speculation is a detail in the rule
tees in a microarchitecture-independent way.
for Confidential Z Unsafe, where we say such an
That is, programs written using an OISA should
information flow delays the instruction’s execu-
maintain the same security guarantees across all
tion until it is nonspeculative. This removes
OISA-enabled microarchitectures. Second, OISAs
false-positive violations due to benign misspecu-
should not preclude modern hardware perfor-
lations. At the same time, the rules enable block-
mance optimizations, except when those optimi-
ing attacks with low overhead. Case in point, the
zations have a chance to leak privacy.
rules encode some intuitive optimizations such
To address these goals, the OISA abstraction
as “Public data does not need protection” and “it
proposed in the article has two parts. First, the
is safe to compute on confidential data with safe
OISA labels data to be confidential/public, to
instructions.” The only situation where instruc-
capture whether that data is a function of user
tion execution is impeded is if confidential data
secrets (i.e., the sensitive program inputs from
is consumed by an unsafe operand.y
the “Formal Definitions for Microarchitectural
Key Idea: Abstract security policies facilitate pro-
Side Channels” section). Second, the OISA speci-
gramming simplicity, implementation flexibility,
fies, for each instruction, whether each instruc-
and performance optimizations. Specifying instruc-
tion operand is safe or unsafe.
tion operand security policy abstractly, i.e., as
Finally, compliant microarchitectures must
safe/unsafe, provides significant flexibility to both
monitor and take different actions based on what
the ISA and hardware designer while simplifying
data is consumed by what instruction operands
programmer-level reasoning about security. At
at runtime. Specifically, hardware must enforce
the ISA level, an ISA designer can decide which
the following rules (shown in Figure 2).
instructions are sufficiently important to warrant
(Confidential Z Unsafe) When confidential safe operands. These choices should be made
data is presented to an unsafe operand: The carefully: On one hand, safe operands impose a
hardware must delay that instruction’s exe- burden on hardware designers as the processor
cution until it is nonspeculative. If the rule must support mechanisms to uphold security viz.,
still applies when the instruction is nonspec- the “Formal Definitions for Microarchitectural
ulative, the program terminates with a fault
(as continuing would constitute an informa- y
This principle directly inspired our follow-on work to block, specifically,
tion leak). speculative execution attacks.12
May/June 2020
103
Top Picks
will recognize this as a formalization of the

guidelines used by data-oblivious programs
today (see the “Core Issue: No Abstraction for
Security”) section. Arithmetic represents all
binary arithmetic operations with safe input
operands. Classify promotes its operand from
public to confidential. Declassify is opposite to
Figure 3. Data-oblivious ISA policy when data is classify, which demotes its operand from confi-
passed to an instruction operand. dential to public. Branch performs a conditional
branch, but is only allowed to specify a public
Side Channels” section for those operands. On the destination or check a public predicate. Load and
other hand, safe operands do not specify an imple- store are only permitted to take public addresses.
mentation strategy. Hardware designers can An important detail is that since declassify has
implement a given operation using simpler data- the potential to make previously protected data
oblivious instructions2 hardware partitioning11 or vulnerable, the OISA requires that the declassify
cryptographic techniques9—depending on what instruction be verified as corresponding to pro-
is efficient given public parameters and the gram semantics. For example, on a speculative
specific microarchitecture. In either case, pro- microarchitecture, this would entail delaying
grammers work with a simple guarantee: Confi- such an instruction until it is nonspeculative.
dential values will not be at risk when consumed Extension: Memory obliviousness via safe-
by safe operands, and dynamic execution will be address loads. A common bottleneck in existing
terminated when violations to this policy are data-oblivious code is the inability to use confi-
detected. dential data as a load address. Therefore, we pro-
Information flow policy and implementation. pose a new set of instructions (an oblivious-
The abovementioned OISA framework describes a memory extension) that enable memory-oblivious
relatively simple security lattice (similar to {High, computation.9 Given the OISA design principles,
Lowg),4 policy (High Z Low) and information flow enabling memory-oblivious instructions is con-
propagation rule (as written, data should be ceptually simple. Instead of emulating memory
marked Confidential if its value is interferrent with obliviousness with dummy memory operations,
the program’s Confidential inputs, which implies a we designate new load/store instructions whose
taint algebra similar to GLIFT10). This reflects the address operand is safe. This gives hardware
article’s goal: to provide a comprehensive, but designers the ability to build secure and efficient
simple, privacy guarantee for data-oblivious pro- implementations, e.g., using partitioning11 or
gramming while granting implementation flexibil- oblivious RAM,7 for that specific operation.
ity to tradeoff design cost and performance. Given Load instructions with safe address operands
other use cases these parameters, and how they are just one example of how to accelerate secure
are concretely enforced by an implementation, computation with an OISA, and the article leaves
can be changed for a family of OISAs, a particular extending our concrete OISA with additional safe
OISA, or a microarchitecture that implements that instructions as future work. A key insight that
OISA. For example, to support richer security motivates this direction is that many data-oblivi-
lattices.6 ous codes share common kernels (e.g., sorting)
that become performance bottlenecks because
DESIGN OF A CONCRETE OISA the only available safe operations are simple
With the principles in the “Principles for OISA instructions. By encapsulating these larger oper-
Design,” section, we now propose a baseline con- ations into new instructions with safe operands,
crete OISA that can be easily implemented on top a future OISA can potentially achieve constant
of common existing ISAs (e.g. 86, ARM, RISC-V). factor or even asymptotic performance improve-
Figure 3 highlights the instructions included ments. For example, a sort implemented data
in the OISA, and which operands are safe/unsafe. obliviously with simple safe instructions may
Programmers that write data-oblivious code cost Oðn log 2 nÞ operations if implemented as a
IEEE Micro
104
BOOM to our concrete BOOM prototype. Through
this model, we prove that the OISA provides a
basis to satisfy strong security definitions such as
those we defined in the “Formal Definitions for
Microarchitectural Side Channels” section. Our
security analysis is general, and applies given any
implementation of several important processor
structures (e.g., it models the branch predictor as
an arbitrary function that takes previous branch
Figure 4. Microarchitectural changes needed to resolutions as input).
support the OISA from the “Design of A Concrete Importantly, we are able to prove security
OISA” section (including the oblivious-memory while allowing high-performance hardware opti-
extension, denoted “omp”). Label stations check and mizations (e.g., OoO, speculative execution) to
enforce the transition rules from Figure 2. remain enabled in the common case and without
ever requiring hardware flushes to structures
such as the cache or branch predictors.
bitonic sort. On the other hand, if sort is speci-
Security intuition. Informally, to argue secu-
fied as a single safe instruction in the OISA, an
rity, we need to show the following.
implementation based on hardware partitioning
can achieve Oðn log nÞ time if implemented as a
a) Each instruction’s resource usage/side-
constant-time merge sort.
effects are independent of Confidential
data.
HARDWARE PROTOTYPE ON RISC-V b) The sequence of instructions that are exe-
BOOM cuted, i.e., the processor program counter
We prototype all hardware changes needed (PC), is independent of Confidential data.z
to support our OISA on top of the RISC-V BOOM
processor (for “Berkeley OoO Machine”).3 Condition (a) follows by definition by apply-
BOOM is the most sophisticated open RISC-V ing the rules in Figure 2 to each instruction as
processor, featuring modern performance opti- it executes. A more subtle point is that condi-
mizations such as speculative and OoO execution (b) also follows from applying the same
tion, and is similar to commercial machines that rules. To see why, first consider a simple unpi-
run data-oblivious code today. pelined, in-order processor with no specula-
Microarchitectural changes to support the tion. In this case, it is clear condition (b) holds
OISA are shown in Figure 4. The main changes are because the only instruction type from Figure 3
logic at instruction issue/execute to enforce the that changes the PC as a function of data is a
rules from the “Principles of OISA Design” section, branch, and the OISA requires that the branch
storage/logic to implement the oblivious-memory predicate and target be Public data. What hap-
extension, and logic to track and denote data as pens when we consider more advanced pipe-
confidential/public. For the latter, we implement a lines, e.g., with prediction and speculation? In
hardware information flow tracking mechanism that case, microarchitectural state outside of
similar to hardware dynamic information flow program semantics, e.g., the branch predictor
tracking, but capable of checking and updating state, influences the PC. To extend our security
whether data is confidential/public (the data’s argument to these machines, we must extend
label) at any stage in the pipeline. what we mean by “resource usage/side-effect”
to include these structures. Then, using induc-
tion, one can show that if conditions (a) and
FORMAL ANALYSIS
(b) hold up to fetching the ith instruction, the
In parallel to our hardware prototype, we
branch predictor state when fetching the
develop a formal analysis that models an abstract
BOOM-class processor (OoO, speculative, super-
z
scalar), and describe how to map the abstract Similar requirements on “not tainting” the PC also govern prior work.10
May/June 2020
105
Top Picks
i þ 1th instruction is independent of Confiden- We have open-sourced our prototype design

tial data, and security follows.x on the RISC-V BOOM processor at https://github.
Example: Security against speculative execu- com/cwfletcher/oisa.
tion attacks. The abovementioned reasoning
shows how OISAs enable security against both
DISCUSSION AND FUTURE
nonspeculative and speculative attacks. Con-
DIRECTIONS
sider the example speculative execution
OISAs can be extended in numerous direc-
attack on data-oblivious decryption from
tions, in particular as a way to compose exist-
the “Motivation: Secure and Efficient Data-
ing hardware/software defensive mechanisms
oblivious Programming” section. This attack
and as a novel backend for the data-oblivious
does not go through when using an OISA by
stack.
invoking aforementioned conditions (a) and
(b). That the branch predictor misspeculates
and executes leak() prematurely is not a func- Simplifying and Composing the Hardware
tion of Confidential data due to condition (b). Trusted Computing Base (TCB)
Furthermore, when leak() executes, it will be A major impediment to progress is that many
unconditionally stopped by the Confidential hardware structures create side channels, and it
data Z Unsafe operands rule due to condition is not clear whether the program is “secure”
(a). Extending the analysis to attacks like Spec- until all channels are blocked. OISAs dramati-
tre,8 where the branch predictor is inten- cally simplify this problem, enabling a new incre-
tionally mistrained, reuses the same logic. mental methodology for designing secure
That is, if conditions (a) and (b) hold, then the hardware and software.
attacker’s strategy for how to mistrain the In their most basic deployment, an OISA might
branch predictor cannot be a function of Confi- opt to only support very basic safe instructions
dential data because the program has not (e.g., bitwise operations and basic arithmetic).
leaked Confidential data up to this point. In Such an OISA can likely be implemented with min-
that case, intentional mistraining looks the imal changes to modern processors and already
same to the analysis as accidental misspecula- improves the state of security today. For exam-
tion, and security follows. ple, by increasing our confidence that conserva-
tively-written codes such as constant-time codes
are really “constant time.”
EVALUATION Beyond this basic deployment, however,
We evaluate OISAs in terms of hardware area
OISAs provide a way for computer architects
and performance over a range of existing data-
to plug-and-play their high-performance
oblivious programs (including linear algebra,
“point” defenses in a compositional way. For
data structures, and graph traversal). Area-wise,
example, a safe load can be implemented by
our proposal takes <5% the area of the unmodi-
previously proposed partitioned or random-
fied BOOM processor. Performance-wise, the
ized cache architectures.11 Importantly, archi-
OISA and hardware implementation provides
tects need only worry about how to implement
an 8:8/1:7 speedup on small/large datasets,
the safe load. The generic OISA rules, e.g., Con-
respectively, relative to data-oblivious code run-
fidential data Z Unsafe operands, take care of
ning on commodity machines (and with the
the rest.
security and portability benefits stated before).
We also show case studies, where the OISA
Composing With the Data-Oblivious Stack
speeds up constant-time AES by 4:4 and the
Beyond writing side channel-free code for
memory oblivious ZeroTrace9 library by 4:6 to
today’s hardware, there is a rich literature in
several orders of magnitude, depending on
the applied cryptography community—ranging
parameters.
from algorithm/data structure design to lan-
x guage and compiler support—for performing
This idea to keep the predictors a function of Public data directly inspired
the mechanism to block “implicit channels” in our follow-on work.12 secure multiparty and encrypted computation.
IEEE Micro
106
We make a key observation that the underlying 7. O. Goldreich and R. Ostrovsky, “Software protection
programming abstraction assumed for those and simulation on oblivious rams,” J. ACM, vol. 43,
works is the same abstraction provided by an pp. 431–473, May 1996.
OISA. For example, a homomorphic encryption 8. P. Kocher et al., “Spectre attacks: Exploiting
operation is akin to a safe instruction, just using a speculative execution,” in Proc. IEEE Symp. Secur.
different implementation suitable for a different Privacy, 2019, pp. 1–19.
threat model. This enables a new, large-scale 9. S. Sasy, S. Gorbunov, and C. W. Fletcher, “ZeroTrace:
research agenda to port insights and advances Oblivious memory primitives from Intel SGX,” in Proc.
made in the applied cryptography community to/ Netw. Distrib. Syst. Secur. Symp., San Diego, CA,
from the microarchitectural side channel commu- USA, Feb. 18–21, 2018. Available: http://dx.doi.org/
nity. For example, we can enable high-level pro- 10.14722/ndss.2018.23239
gramming abstractions for writing OISA-secure 10. M. Tiwari, H. M. Wassel, B. Mazloom, S. Mysore, F. T.
code by adding a new OISA backend to existing Chong, and T. Sherwood, “Complete information flow
data-oblivious compiler frameworks. At the same tracking from the gates up,” in Proc. 14th Int. Conf.
time, the notion of safe instructions provides a Archit. Support Program. Lang. Oper. Syst., 2009,
new theory to explore in applied cryptography. In pp. 109–120.
particular, algorithm design in encrypted compu- 11. Z. Wang and R. B. Lee, “New cache designs for
tation assumes only extremely simple safe opera- thwarting software cache-based side channel
tions (e.g., bit add or multiply). With an OISA, attacks,” in Proc. 34th Annu. Int. Symp. Comput.
however, we can choose which operations sup- Archit., 2007, pp. 494–505.
port safe operands, and co-design algorithms with 12. J. Yu, M. Yan, A. Khyzha, A. Morrison, J. Torrellas, and
this in mind to improve performance. C. Fletcher, “Speculative taint tracking (STT): A
comprehensive protection for speculatively accessed
data,” in Proc. 52nd Annu. IEEE/ACM Int. Symp.
Microarchit., 2019, pp. 954–968.
& REFERENCES
1. A. C. Aldaya, B. B. Brumley, S. U. Hassan, C. P.
Garcıa, and N. Tuveri, “Port contention for fun and
Jiyong Yu is currently working toward the Ph.D.
profit,” in Proc. IEEE Symp. Secur. Privacy, 2019, degree at the University of Illinois at Urbana–Champaign.
pp. 870–887. His research interests are in processor security. Contact
2. M. Andrysco, D. Kohlbrenner, K. Mowery, R. Jhala, S. him at jiyongy2@illinois.edu.
Lerner, and H. Shacham, “On subnormal floating point
and abnormal timing,” in Proc. IEEE Symp. Secur.
Lucas Hsiung received the B.S. degree from the
Privacy, 2015, pp. 623–639.
University of Illinois at Urbana–Champaign in 2019
3. C. Celio, P.-F. Chiu, B. Nikolic, D. A. Patterson, and K.
and now works as a Security Verification Engineer at
, “BOOM v2: An open-source out-of-order
Asanovic SciFive. Contact him at lucas.hsiung@sifive.com.
RISC-V core,” Tech. Rep. UCB/EECS-2017-157, EECS
Dept., Univ. California, Berkeley, CA, USA, 2017.
Mohamad El Hajj is currently working toward the
4. D. E. Denning, “A lattice model of secure information
M.S. degree at the University of Illinois at Urbana–
flow,” Commun. ACM, vol. 19, pp. 236–243, May 1976.
Champaign with research interests in hardware
5. D. Evtyushkin, R. Riley, N. C. Abu-Ghazaleh, ECE, and security. Contact him at melhajj2@illinois.edu.
D. Ponomarev, “BranchScope: A new side-channel
attack on directional branch predictor,” in Proc. 23rd
Christopher W. Fletcher is an Assistant Profes-
Int. Conf. Archit. Support Program. Lang. Oper. Syst,
sor in computer science at the University of Illinois at
2018, pp. 693–707.
Urbana–Champaign. He has interests ranging from
6. A. Ferraiuolo, M. Zhao, A. C. Myers, and G. E. Suh,
computer architecture to security to high-performance
“HyperFlow: A processor architecture for computing (ranging from theory to practice, algorithm
nonmalleable, timing-safe information flow security,” in to software to hardware). Fletcher received the Ph.D.
Proc. ACM SIGSAC Conf. Comput. Commun. Secur., degree from Massachusetts Institute of Technology in
2018, pp. 1583–1600. 2016. Contact him at cwfletch@illinois.edu.
May/June 2020
107
Trace Wringing for

Program Trace Privacy
Deeksha Dangwal Joseph McMahan
University of California, Santa Barbara University of Washington
Weilong Cui Timothy Sherwood
Google, Inc. University of California, Santa Barbara
Abstract—A quantitative approach to optimizing computer systems requires a good

understanding of how applications exercise a machine, and real program traces from
production environments lead to the clearest understanding. Unfortunately, even the
simplest program traces can leak sensitive details about users, their recent activity, or even
details of trade secret algorithms. Given the cleverness of attackers working to undo well-
intentioned, but ultimately insufficient, anonymization techniques, many organizations have
simply decided to cease making traces available. Trace wringing is a new formulation of the
problem of sharing traces where one knows a priori how much information the trace is leaking
in the worst case. The key idea is to squeeze as much information as possible out of the trace
without completely compromising its usefulness for optimization. We demonstrate the utility
of a wrung trace through cache simulation and examine the sensitivity of wrung traces to a
class of attacks on Advanced Encryption Standard (AES) encryption.
& PRIVACY IN THE digital age has become General Data Protection Regulation (GDPR) and
increasingly difficult to achieve and a conten- the California Consumer Privacy Act (CCPA).
tious topic. As technologies that capitalize on Computer scientists and engineers must develop
facial recognition, location services, and per- systems and tools for embedding privacy into
sonal health tracking become mainstream, existing and new workflows. In this article, we
addressing these complex privacy issues is of describe a new approach to privacy, wringing,
foremost importance. Policy makers have put in with particular applicability to the problem of
place regulations on data protection through the sharing program traces.
When working toward application-tuned sys-
tems, developers often find themselves caught
Digital Object Identifier 10.1109/MM.2020.2986113 between the need to share information (so that
Date of publication 8 April 2020; date of current version 22 partners can make intelligent design choices)
May 2020. and the need to hide information (to protect
108
proprietary methods or sensitive data). One been eliminated. While there is no known
place where this problem comes to a head is in mechanism of quantifying the amount of sensi-
the release of program traces; even the simplest tive data that remains in an arbitrary trace, we
memory access traces leak a tremendous can at least say how much total information is
amount of information. For example, we can cap- shared, which provides a useful upper bound.
ture the memory access behavior of a critical If we share only a couple thousand bits about
cryptographic function (which is known to be a a trace, we can then be certain we are not giv-
function of the secret key), a set of ing away every user’s social
lookups corresponding to the pars- The key idea, wringing,
security number by accident.
ing of a social security number, or is to squeeze as much Reconstructing a useful trace
even detailed system configuration information as possible from a few thousand bits of
parameters that are considered a out of the trace without information is hard, but inter-
trade secret. While the sharing of completely compromis- estingly we are free to use any
these traces between technology ing its utility. In the ideal public information about the
partners can lead to more robust and case, only the useful nature of these traces in help-
high-performance systems, it can structure of the trace ing us accomplish this. Com-
also leak highly sensitive informa- remains and all poten- pression, when taken to this
tion, and expose user data to secu- tially sensitive data has extreme and lossy form, con-
been eliminated.
rity vulnerabilities. Today when such nects to privacy in this unex-
traces are needed, programmers may pected way. However, as is
be asked to obfuscate the key algorithm behav- often the case in computer architecture, an
iors to hide sensitive data or provide models of important tradeoff remains between informa-
the system, which approximate the same behav- tion leaked and ability of the trace to capture
ior but omit sensitive parts. Hand-built models the program behavior.
of the system are both tedious to code and of We formalize this new approach specifically in
limited predictive power. Since there is no well- the context of memory address traces in part
defined and well-trusted approach to this prob- because we have many prior trace analysis tech-
lem, developers are often forced to resort to niques to build on.7,9,12 To expose the tradeoff
rough human-language descriptions of the inherent to this problem, we explore a new class
behavior of programs (e.g., it is 80% pointer- of memory trace synthesis techniques based on
chasing). This leads to missed opportunities, ideas from signal processing. By projecting the
frustrated optimization, and the design process address space onto a wrapped 2-D heatmap, we
ultimately suffers. Ideally, engineers would decompose memory behavior into orthogonal
access methods to eliminate any sensitive infor- set of features that can then be replayed to repro-
mation from the traces while still capturing the duce the same “visible” patterns as the traces
program behavior and its interaction with the under examination. Specifically, we use a Hough-
underlying hardware. However, the extent to transformed3 trace to find both constant and
which “sensitive” data influences program strided access patterns. We find that for memory
behavior is rarely understood by a single party, traces it is indeed possible for useful program
and even harder to argue is that it is completely behavior to be conveyed in only a few thousand
absent from a trace. bits. We demonstrate the utility of wrung traces
We present a new formulation of this prob- through cache simulation with bounded leakage,
lem of sharing traces where before release one and even examine the sensitivity of wrung traces
knows (a priori) exactly how much information to a class of attacks on AES encryption.
a trace is leaking in the worst case. The key
idea, wringing, is to squeeze as much informa- TRACE WRINGING AS A NEW GAME
tion as possible out of the trace without The program traces we look at in this article
completely compromising its utility. In the are memory access traces specifically, but more
ideal case, only the useful structure of the trace generally fall into a class of traces useful
remains and all potentially sensitive data has for application-tuning and hardware–software
May/June 2020
109
Top Picks
can be leaked. For example, it should be impossi-

ble to recover an extensive list of social security
numbers, sensitive health information, or even
an entire set of secret keys from such a trace. To
maximize privacy one wants to give away as little
data as possible about the trace. However, to
maximize utility the opposite is true. The ques-
tion is then how little can one give away from
the trace while still being useful?
Answering this question requires an analysis
across two metrics: information leaked and util-
ity, as described in Figure 1. Information is sur-
prisingly easy to quantify; it is the number of bits
Figure 1. Forcing a trace through a channel with a capacity of
from the secret trace that needs to be
only a few bits bounds the amount of sensitive data shared. While
“transmitted” between the full trace (which con-
public information such as prior non-private traces can be used in
tains every address) and proxy trace (which is a
the creation of the code, the trace to be coded must not be known
stand-in for the full trace and is ready for
to the receiver. The objective is to minimize the number of bits
release). In Figure 1, Step 1 is to encode the
shared while maximizing the utility of the proxy trace. We measure
secret trace. Note that any information from pub-
the utility in terms of whether or not certain utility tests are passed
lic traces or training data can be shared freely
by the proxy and/or how close to the original tests results they get.
and even hard-coded into the “receiver,” but in
We present a signal processing approach to reduce the trace to
the end, everything you wish to share about the
an n-bit channel.
full trace must be represented in a single n-bit
“packet” (Step 2 in Figure 1). Quantifying utility is
co-optimization (as opposed to debug traces). A harder and more use-case specific. For memory
program trace can contain a tremendous address traces, we define a distance function
amount of information about the system under between cache miss-rates of trace vectors as one
evaluation. But, as we know, such traces are such function (Step 3 in Figure 1), but, in general,
invaluable for performance evaluation because there are many other metrics one might use.
they demonstrate the way the system actually
behaves in the face of the workloads it must SIGNAL PROCESSING APPROACH
actually handle. While the behaviors are impor- TO WRINGING
tant at a high level, rarely are the specific ele- Given the above mentioned constraints, the
ments of the trace critical. Rather it is the question is how to encode memory address
relationship between those elements and the trace behavior in a general, and yet incredibly
proportions that they appear in the trace that is compact, manner. Our compact representation
often the key. This is of course not a new insight; must also capture the structure of these traces so
what we claim as new is the idea that we can for- that we can identify, describe, and quantify the
malize these schemes in such a way that it patterns that we care most about. We present a
bounds the amount of information leaked about signal processing pipeline for trace wringing.
a system being traced. Our approach describes traces as a probabilistic
Our privacy argument is simple: if we only grammar of generators coupled with very high
share n bits about a specific trace, then we can- level accounting of behavior over time. The
not leak more than n bits about that trace. In “transmitted” bits encode both the structure
practice, this means that if we share only a few and parameters of this scheme.
thousand bits of information about a trace, then To understand our approach at a high level it
nothing beyond those bits has been leaked. is useful to start with a visual sense for the struc-
While it is not a perfect solution (some informa- ture of such traces. We project the address trace
tion might be lost), it says something useful onto a fixed-size modulo-mapping of the memory
about the maximum amount of information that spaces to create a heatmap. Figure 2 shows such a
IEEE Micro
110
heatmap for gcc where
instruction count (time)
runs along the x-axis and
the address runs along the
y-axis. If we were to plot
this for the entire memory,
Figure 2. Phases visible in the trace generated by gcc after k-means clustering. Each of
it would clearly be too
the three colors in the bar marks a unique phase in the trace. Note, importantly, that phases
large for such a graph (the
reoccur over time.
distance between the
stack and heap would
dwarf any local behavior), so instead, we plot the Given that both strong temporal and spatial
address modulo a large power of two. Heatmaps locality features show up as lines, decomposition
such as this have the advantage of mapping into a set of line segments is a natural place to
addresses onto a more manageable space, but at start. The Hough transform can be used to then
the same time, keep the spatio–temporal struc- find the locations and orientations of certain geo-
tures that would actually impact a real cache. metric primitives, such as lines, in the given space.
Interesting and intuitive patterns emerge after We apply Hough transformation, a popular com-
looking over this graph. The flat horizontal lines puter vision technique for detecting patterns in
in the graph are patterns of repeating access to a images; for our features, we employ the Hough line
set of addresses. These are high temporal locality transform. Specifically, we use the progressive
behaviors. Sharp diagonal lines, on the other probabilistic Hough transform,5 a rendition of the
hand, are regions of high spatial locality as Hough transform algorithm that only performs
addresses are accessed one after the other in suc- voting on a subset of the input points. These input
cession. If we can concisely capture the character points are chosen based on certain features of the
of these behaviors, without transmitting the expected result, such as a threshold, the length of
addresses themselves, we can minimize the the expected line, interpolation strategies, and the
amount of information leaked. The modulo-mem- angle of the line. By interleaving the voting process
ory heatmaps exhibit hierarchical organization.
with line detection, this algorithm finds the most
Globally, there exists a recurrence of similar pat-
prevalent features first, while also minimizing the
terns in the order of a few tens of thousand
computational load. The progressive probabilistic
instructions, i.e., the presence of program phases,
Hough transform returns a set of lines, with each
and within them, we observe patterns that we
lines (x,y) coordinates in the modulo-memory
associate with the more local memory access
heatmap space. We also introduce a variable,
activity. In order to find some representative of
weight, for each line, which is a measure of dark-
the higher echelons of this hierarchy, we employ
k-means clustering for program phase analysis. 2,9 ness of the line. Some intuition about how the
Rather than encoding the entire trace monolithi- probabilistic Hough transform functions is
cally, we can encode just the k representative described in Figure 3.
clusters independently. By breaking the pattern The list of phase identifiers (the result of clus-
down into a set of simpler behaviors, we can then tering), the two ðx; yÞ coordinates of each line
tackle them one-by-one. Figure 2 shows the result segment detected by the Hough transforms, and
of running the phase detector on the memory the line’s weight in the representative phase, cre-
address trace for gcc. Each of the three colors in ate compact “information packets.” The size of the
the bar in the figure show the occurrence of three total “transmission” is n and bounds the maximum
unique phases in the memory access trace. The amount of information leaked.
technique does a good job of lining up with the After phase detection and Hough-line transfor-
repeating structures in the heatmap. With these mation, we end up with a set of lines for each rep-
phases marked, we can encode the k representa- resentative phase. We can see the decomposition
tive clusters with log2 k bits. of a phase of gcc into lines in Figure 4. Each phase
May/June 2020
111
Top Picks
Figure 3. We capture information about lines we

observe in trace heatmaps using the Hough
transform. Here, we demonstrate its working. The
points on the test image are surveyed for parameters
Figure 4. Producing probabilistic Hough lines5 on
in the polar coordinate space described as the
top of the heatmap for a program phase in gcc. The
Hough transform. The intersections describe the
colors are used to indicate distinct lines produced by
parameters of the detected lines. The final figure
the decomposition. Line length, threshold, line gap,
shows the probabilistic Hough lines, the more robust
and line slope are input parameters that can be used
and efficient algorithm. For our heatmaps, we use the
to adjust the tradeoff space between information
probabilistic Hough line algorithm.
leakage and utility. For example, choosing a small
line length will ensure that even the smallest lines are
is also assigned a label indicating to which cluster captured, but at the cost of transmitting information
it belongs to, i.e., which representative phase about even the smallest features. Here, we see how
“represents” it. Since the structural information of the choice of parameters can be used as a knob to
each phase is encoded in the Hough lines, we can control the features we want to encode to find a
generate an “address tracelet” for each phase suitable tradeoff point.
using the representative’s lines. Phases from the
same cluster may occur intermittently and in dif- two subsystems. The trace-wringing subsystem
ferent lengths. For all phases in the same cluster, minimizes the number of bits used to describe
we generate patterns continuously in a rotating the trace structure, and the generator subsys-
fashion regardless of the length. Upon picking a tem uses budgeted information to generate the
Hough line at time t, we generate an address proxy trace. Trace wringing includes the genera-
“segment” from that line based on a fixed segment tion of heatmaps from memory access traces,
length, which captures locality at a small granular- phase analysis, decomposition of representative
ity. The final proxy trace has the same length as phases into lines, and creation of packets to
the original trace and captures the most salient transmit this information. Note that there are
aspects of its behavior while at the same time leak- actually many possible values of n and different
ing no more than n bits. parameters will result in different tradeoffs
between utility and privacy.
To evaluate the effectiveness of the
EVALUATION AND OVERVIEW approach, we take a subset of the SPEC20066
OF RESULTS traces, wring them through our pipeline to a
In our pipeline, we pose the problem of shar- target number of bits, a bit budget, and evalu-
ing traces between technology partners with ate the traces across a range of cache
IEEE Micro
112
Figure 5. (a) Heatmaps for the sensitive input trace gcc and (b) the trace-wrung proxy generated by our
pipeline. The heatmap of the trace-wrung proxy shows that both global and local features line up with the input
trace. All but the subtlest of patterns are present in the trace-wrung proxy.
configurations with regards to miss rate. We CONCLUSION AND FUTURE IMPACT

indeed observe that as the bits of information Looking further out, the conflict between the
leakage increase, the proxy miss rate gets need to share information (to provide more opti-
closer to the ground truth miss rate, which mal performance) and hide information (for pri-
confirms that with more information shared, vacy) is becoming increasingly fundamental in all
the proxy trace we reconstruct becomes more of computer science. Threats to personal data pri-
similar to the original trace in terms of struc- vacy are emerging as a leading concern for users.
ture. The exact nature of the tradeoff space is While the European GDPR and CCPA put in place
explained more in the paper. In our setting, we privacy and data protection requirements, the
define utility in terms of similarity of cache onus of implementing tools to understand and
miss rates. We measure and present cache embed privacy into systems falls on engineers.
miss rates in the article. In Figure 5, we com- Computer architects must start thinking more
pare the proxy heatmap generated for gcc, about privacy and provide infrastructure to
against the ground truth. Our wrapped address enable privacy at all levels of computing. Trace
space is of height 2048 (cache lines in the heat- wringing can be leveraged as one such tool.
map) and each column in the heatmap corre- We feel the tradeoff space exposed by trace
sponds to 10 000 memory accesses. The figure wringing will open the doors to future work at
illustrates that our approach is able to capture the intersection of privacy and computer sys-
all but the subtlest of patterns. tems optimization. We hope to see more follow-
It is also worth noting that we further explored up work that builds on years of community expe-
real attacks on the resulting traces. Specifically, rience dealing with address traces and to encode
we choose to examine the trace to see if it is possi- common patterns in a general way. In many
ble to recover an AES key using known attacks. applications, striding memory behavior is an
AES attacks based on cache sets have been well- important component and we believe we are the
studied.8 We present the details of the attack in first to connect the address trace analysis prob-
our paper. We find that trace wrung proxies lem with the Hough transform. The resulting
completely stop traditional AES attacks. We per- analysis is surprisingly robust to noise and can
form this attack on a set of traces collected from capture general striding behavior. While this
runs of AES with a random plaintext. We perform approach is effective for the memory problems
the same attack pre- and post-wringing. Prewring- we examined, there is no shortage of opportu-
ing, the attacker correctly guesses the upper 5 nity to build on the techniques we lay out to cre-
bits of all 16 key-bytes after 1838 encryptions. ate more robust and higher quality trace
This is the maximal information that can be wringing systems. Fully leveraging the best syn-
learned in a first-round attack with 8-byte cache thetic trace generation,11 trace compression,1,4
lines. Postwringing, the attack guesses wrong for quantitative information flow,10 and statistical
all 16 bytes of the key after 50 000 traces. modeling12 techniques and understanding what
May/June 2020
113
Top Picks
they each bring to possible, and perhaps more importantly, they

We feel the tradeoff
the problem is one can then be improved when fed back as a system
space exposed by
next step. Bringing trace wringing will open requirement.
the full algorithmic the doors to future work
power provided by at the intersection of & REFERENCES
the fact that any privacy and computer
public trace data systems optimization. 1. M. Burtscher, I. Ganusov, S. J. Jackson, J. Ke,
can be leveraged in We hope to see more P. Ratanaworabhan, and N. B. Sam, “The VPC trace-
the compression is follow-up work that compression algorithms,” IEEE Trans. Comput.,
also very promising. builds on years of com- vol. 54, no. 11, pp. 1329–1344, Nov. 2005.
This opportunity is munity experience 2. A. S. Dhodapkar and J. E. Smith, “Comparing
dealing with address program phase detection techniques,” in Proc. 36th
particularly inter-
traces and to encode
esting as it sits out- Annu. IEEE/ACM Int. Symp. Microarchit., 2003,
common patterns in a
side of any past pp. 217–227.
general way.
lossy compression 3. R. O. Duda and P. E. Hart, “Use of the Hough
or synthetic trace transformation to detect lines and curves in pictures,”
scheme’s ability to exploit (i.e., minimizing total Commun. ACM, vol. 15, no. 1, pp. 11–15, 1972.
data transferred is different than minimizing sen- 4. E. N. Elnozahy, “Address trace compression through
sitive data transferred). loop detection and reduction,” in Proc. ACM
In application-tuning and system design, one SIGMETRICS Perform. Eval. Rev., 1999, vol. 27,
can certainly understand how related problems pp. 214–215.
exist with storage traces, cache coherence traf- 5. C. Galamhos, J. Matas, and J. Kittler, “Progressive
fic, energy usage, user interaction data, and cer- probabilistic Hough transform for line detection,” in
tainly location data. Clever, yet complex, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
techniques have been developed to address cer- Recognit., 1999, vol. 1, pp. 554–560.
tain anonymity problems in the past, yet the 6. J. L. Henning, “SPEC CPU2006 benchmark
reality is that they are often dependent on spe- descriptions,” ACM SIGARCH Comput. Archit. News,
cific assumptions such as a lack of prior informa- vol. 34, no. 4, pp. 1–17, 2006.
tion, statistical distributions governing the data, 7. P. Michaud, “Online compression of cache-filtered
or that number of queries can be tightly address traces,” in Proc. IEEE Int. Symp. Perform.
bounded. Our wringing approach is very direct Anal. Syst. Softw., 2009, pp. 185–194.
and that comes with clarity as to what it does 8. D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacks
and does not do. It does not guarantee anything and countermeasures: The case of AES,” in Proc.
about how useful the resulting trace will really Cryptographers Track RSA Conf., Springer, 2006,
be for optimization. However, it does transform pp. 1–20.
the problem of safe sharing into a measurable 9. T. Sherwood, E. Perelman, G. Hamerly, and B. Calder,
systems problem subject to the myriad tools we “Automatically characterizing large scale program
have at our disposal for common-case optimiza- behavior,” ACM SIGARCH Comput. Archit. News,
tion. Furthermore, it does provide a strong and vol. 30, no. 5, pp. 45–57, 2002.
clear bound on the amount of useful information 10. G. Smith, “On the foundations of quantitative
given by the trace. For the purposes of privacy information flow,” in Proc. Int. Conf. Foundations Softw.
engineering, this is exceedingly valuable. There Sci. Comput. Struct., Springer, 2009, pp. 288–302.
is a consensus on the importance of building pri- 11. L. Van Ertvelde and L. Eeckhout, “Dispersing
vacy into systems that deal with information proprietary applications as benchmarks through code
about health, legal records and law enforcement, mutation,” ACM SIGARCH Comput. Archit. News,
transportation and location, and other sensitive ACM, vol. 36, pp. 201–210, 2008.
information. But, the capability of today’s 12. J. Weinberg and A. E. Snavely, “Accurate memory
tools and methodologies is limited. Trace wring- signatures and synthetic address traces for HPC
ing provides evidence that new methods that applications,” in Proc. 22nd Annu. Int. Conf.
bound information sharing in useful ways are Supercomput., ACM, 2008, pp. 36–45.
IEEE Micro
114
Deeksha Dangwal is currently working toward the Joseph McMahan is currently a Research Scientist
Ph.D. degree in computer architecture with the with the University of Washington, Seattle. His research
Department of Computer Science, University of interests include computer architecture, security, formal
California, Santa Barbara. Her research interests methods, and machine learning. McMahan received
include computer architecture, privacy, and informa- the Ph.D. degree from the University of California Santa
tion theory. She is a student member of IEEE and Barbara. He is a member of ACM and IEEE. Contact
ACM. She is the corresponding author of this article. him at jmcmahan@cs.washington.edu.
Contact her at deeksha@cs.ucsb.edu.
Timothy Sherwood is currently a Professor of
Weilong Cui is currently a Software Engineer with computer science and the Associate Vice-Chancellor
Google. His research interests include statistical/eco- for Research with the University of California, Santa
nomic-inspired methods and programming languages Barbara. He is a cofounder of the hardware security
for computer architecture performance modeling, as startup Tortuga Logic and the 2016 ACM SIGARCH
well as novel micro/system-architecture and its interac- Maurice Wilkes Awardee “for contributions to novel
tion with software. Cui received the master’s and bach- program analysis advancing architectural modeling
elor’s degrees in computer science from Peking and security.” Sherwood received the B.S. degree in
University and the Ph.D. degree from the Department computer science from UC Davis, and the M.S. and
of Computer Science, University of California, Santa Ph.D. degrees from UC San Diego. Contact him at
Barbara. Contact him at cuiwl@google.com. sherwood@cs.ucsb.edu.
May/June 2020
115
!
HOST
W
O
N
TER
IS
2020
G
RE
6–9 Dec. 2020 • San Jose, CA
IEEE INTERNATIONAL SYMPOSIUM

ON HARDWARE-ORIENTED
SECURITY AND TRUST
6–9 Dec. 2020 • San Jose, CA, USA • DoubleTree by Hilton
Join dedicated professionals at the IEEE International Symposium on Hardware Oriented Security
and Trust (HOST) for an in-depth look into hardware-based security research and development.
Key Topics:
• Semiconductor design, test • Cryptography
and failure analysis and cryptanalysis
• Computer architecture • Imaging and microscopy
• Systems security
Discover innovations from outside your sphere of influence at HOST. Learn about new research
that is critical to your future projects. Meet face-to-face with researchers and experts for
inspiration, solutions, and practical ideas you can put to use immediately.
REGISTER NOW: www.hostsymposium.org

Department: Micro Economics
Pandemics and the Dismal

Technology Economy
Shane Greenstein
Harvard Business School
& WE SPEAK OF viruses, though not the ones forecasts are challenging to make, but less so in
that infect computers. Believe it or not, this insurance markets due to the abundance of prece-
follows a long tradition in economics. Two cen- dent. It is possible to ground a forecast in histori-
turies ago, T. Malthus first made his grim pre- cal patterns.
dictions about death from resource shortages. Forecasts become more difficult when every-
Economics has been known as the dismal sci- thing is hypothetical. For example, you may
ence ever since. have heard about “stress tests” for banks, which
Grim is the mood of the day. This is a painful were implemented after the financial mess in
moment for many stockholders, managers, and 2008. That “test”—more like an
personnel. Many readers may audit—focuses on whether the bank
not have had to think about In spite of the lack of can survive a rare and hypothetical
this topic since the financial precedent, today’s painful scenario. (Now that we are
panic of 2008, or maybe even economic crisis experiencing an actual scenario, we
the dot-com bust. broadly contains pre- will find out if the hypothetical is
Dismal as this topic is, let dictable features. Here planned appropriately.)
us understand how the pan- is why. The economy is Most of today’s dislocation realizes
one big circular flow of
demic could cause so much scenarios that had been hypothetical
expenditure in which
economic damage, and why it until a few months ago. This time
one person’s purchase
may take so long to recover. markets inherit the economic decline
is another person’s
I will keep it basic and focus sale, and that further from shutting down services in the
on the economics of this goes into somebody’s economy—more in a minute. There is
situation for the technology paycheck from which nothing in recent history like this.
economy. more purchases arise. The situation looks quite different
than it did in 2008 or 2001, where
the disruption in those instances origi-
SHORT-RUN EFFECTS nated with the mortgage lending crisis, and with
Analysts have a special set of models for sce-
the misdirected investments behind the dotcom
narios that arise with low probability and impose
boom.
high expense. You may have come across such
In spite of the lack of precedent, today’s eco-
models when buying life insurance or cata-
nomic crisis broadly contains predictable fea-
strophic medical insurance. These types of
tures. Here is why. The economy is one big
circular flow of expenditure in which one person’s
Digital Object Identifier 10.1109/MM.2020.2984182 purchase is another person’s sale, and that fur-
Date of current version 22 May 2020. ther goes into somebody’s paycheck from which
118
more purchases arise. In addition, it just keeps It almost goes without saying, but gloomy
going. Virtually every part of the U.S. economy forecasts about illiquidity are the mood of the
has been finely tuned to expect this circular flow. day at many firms. Related, this is also part of
Until recently this flow was predictable. the logic behind federal legislation to either
The pandemic made the expectations about make credit available to business and cash to
flows obsolete. That introduces many disloca- households. It alleviates suffering and delays
tions. Here is the chain of causality around which liquidity crises.
all hypothetical forecasts revolve. Restaurants, The uncertainty about expenditure explains
bars, schools, theaters, and sporting events have some of the “wait and see” remarks from finan-
closed in all major locations where people cial analysts.
“shelter in place.” As of this writing, more than a
third of the population lives in areas with such
lockdowns, and it could increase. Related parts WHAT THIS DOES TO FIRMS
of the economy also have slowed considerably: Despite the uncertainty, what can we say?
Travel on airlines, staying in hotels, vacationing
Representative examples can illustrate.
in beaches, and enjoying amusement parks.
Many firms get their flow of funds directly
Depending on how you count it, that inter-
from expenditures for leisure and travel. When
rupts somewhere between 10% and 15% of the
that expenditure drops, so too does sales at, for
expenditure in the US economy, and the jobs of
example, Airbnb, Priceline, Hotel.com, Expedia,
somewhere between 10% and 20% of the labor
Orbitz, and more. When restaurants receive less
force. In the third week of March over three
visits, so too does advertising on Open Table
million people—more than 2% of the labor
force—filed new claims for unemployment. That and Yelp and, to some extent, Google.
has never before happened in one week. Related local transportation declines, such as
What happens next? Two additional concepts Uber and Lyft. Close to a third of their trips go to
help round out the big picture: When everybody airports. Additionally, many people are shelter-
cuts back a little, it adds up and reinforces the ing at home, neither going to the local restaurant
overall movements downward. Once everyone nor far away on a vacation. Furthermore, while
expects the worst, as we do now, then the low Uber Eats might make more deliveries, it is not
forecast becomes self-fulfilling. Economists call enough to make up for the overall decline.
these “multiplier” effects, and “self-reinforcing The next few months are a lousy time to sell a
expectations,” and it makes the whole worse consumer product in a retail outlet. Sales of
than the sum of its parts. iPhones and smart phones have declined, as
By way of analogy, system engineers might consumers put off an optional purchase and
recognize this phenomenon as secondary feed- retail outlets close for health safety. Same fore-
back effects. The sum of secondary feedbacks cast for printers, home networks, and tablets.
reinforces the first-order effects, and makes the Every parts’ supplier within those supply chains
system behave at a suboptimal equilibrium. should expect a lower sale in the near term. That
Worst of all, once it settles (at a high level of hurts sales at Apple, Samsung, Intel, Qualcomm,
unemployment), it does not move away easily. and HP, as well as distributors, like CDW, Sta-
The dot-com bust started from a different ples, and Office Max. Ouch.
place, but events there illustrate how multiplier The next generation of online games and
effects operate. In that case, the loss of financial entertainment looks like a mixed bag. Online
confidence led many of those online firms to can- channels help. Freemium services will tend to do
cel orders for equipment at the same time—PCs, better than subscription services, if any of them
office networking, and application software. does well at all. General use will go up, especially
That caused liquidity issues at otherwise healthy for established brands, such as Electronic Arts.
and efficient suppliers. No funds came in for new But not many devices will be sold, such as Xbox
sales, while inventories of final products piled or PlayStation, except through online channels.
up in warehouses, and the bills for last month’s What about Facebook and Google? Both have
inputs and workforce needed attention. seen massive increases in traffic from use, which
May/June 2020
119
Micro Economics
means they sell more ads with more people wireless networks, as home networks press into
online. With less shopping overall, however, each capacity utilization far above anticipated levels.
ad is not as valuable. The value of the ad-based This situation puts pressures on users to increase
business could increase or declines overall. wireless data contracts. Wireless firms also
You may say: what about online retailing? should benefit from pressure at households drop-
Amazon should be able to take advantage of ping broadband and going only with wireless
everyone shopping online instead of visiting Internet. Broadband carriers have seen home traf-
malls. And perhaps some of their third-party sup- fic increase, while pressures for cord-cutting
pliers will be fine as well (e.g., if increased (i.e., diminishing televi-
they have hand sanitizer to sell). sion contracts). The broadband
That said, the most profitable divi- Most people want to business also has always had to
sion at Amazon is AWS, and more know: how long will this manage a huge fraction of house-
broadly, nobody expects AWS, last? Forecasting the holds who do not pay their bills on
Azure, or Google Cloud to decline. recovery is especially time, which also should grow worse
The flexibility inherent in that difficult. Both biology over the next few months.
business model gives them advan- and economics plays a Finally, the financial side of
role. So does economic
tages over construction of new technology will go through a some-
policy. Many analysts
data centers or equipment pur- what predictable decline. No VC
want to know: Will any of
chases for business. will have an IPO in the near term.
the behavior exhibited
Streaming services appeal to during the period of Many startups with cash flow
all the people stuck at home. home sheltering persist issues will be sold at fire sale prices
Established services, such as into the future? to large buyers as acquisitions. As
Netflix, Hulu, YouTube, and Ama- happened in 2009, many VCs will
zon Prime video, are positioned tend to cutoff their worst perform-
to do well. This environment also gives an ing firms, but which firms and when? Good luck
opportunity to HBO Go, CBS, Sling, PlayStation with forecasting that.
Vue, and few others.
That said, new launches are always risky, but
that is especially so in this environment. For
UNANSWERED QUESTIONS
example, Quibi—with its hype and subscription Most people want to know: how long will this
price—probably would have been better off
last? Forecasting the recovery is especially diffi-
launching just a few months earlier, just as Dis- cult. Both biology and economics plays a role.
neyPlus did. Will Quibi gain traction in this envi-
So does economic policy.
ronment? Anybody’s guess. Many analysts want to know: Will any of the
Some online communication tools have got-
behavior exhibited during the period of home
ten more use too, such as WhatsApp, Zoom, and sheltering persist into the future?
Slack. Google Hangouts, Skype, and Webex have
As I write this, there is simply no precedent
seen resurgence, and lesser known services, for making an educated guess. There are too
such as Join.me, GoToMeeting, Stride, RingCen-
many unknowable unknowns.
tral, and BigMarker.
The carrier business will see a mix of experien- Shane Greenstein is a Professor at the Harvard
ces. Traffic has gone up for both wireline and Business School. Contact him at sgreenstein@ hbs.edu.
IEEE Micro
120
PURPOSE: The IEEE Computer Society is the world’s largest EXECUTIVE COMMITTEE
association of computing professionals and is the leading provider President: kƷǠǹƌ‫ژ‬%Ʒ‫ژ‬FǹȏȵǠƌȄǠ
of technical information in the field.
President-Elect: FȏȵȵƷȽɋ‫ژ‬°ǚɓǹǹ
MEMBERSHIP: Members receive the monthly magazine Past President: ƷƩǠǹǠƌ‫ژ‬uƷɋȵƌ
Computer, discounts, and opportunities to serve (all activities First VP: ¨ǠƩƩƌȵưȏ‫ژ‬uƌȵǠƌȄǠ; Second VP: °ɲ‫ٯ‬äƷȄ‫ژ‬hɓȏ ‫ژژژژژ‬
are led by volunteer members). Membership is open to all IEEE Secretary: %ǠȂǠɋȵǠȏȽ‫ژ‬°ƷȵȲƌȄȏȽ; Treasurer: %ƌɫǠư‫ژ‬kȏȂƷɋ
members, affiliate society members, and others interested in the VP, MemberȽǚǠȲ & Geographic Activities: Yervant Zorian
computer field. VP, Professional & Educational Activities: °ɲ‫ٮ‬äƷȄ‫ژ‬hɓȏ ‫ژژژژژژژژژژژ‬
VP, Publications: Fabrizio Lombardi
COMPUTER SOCIETY WEBSITE: www.computer.org
VP, Standards Activities: Riccardo Mariani
OMBUDSMAN: Direct unresolved complaints to VP, Technical & Conference Activities: William D. Gropp‫ژ‬
ombudsman@computer.org.
2019–2020 IEEE Division VIII Director: Elizabeth L. Burd‫ژ‬
CHAPTERS: Regular and student chapters worldwide provide the ‫ژ׏א׎אٮ׎א׎א‬U---‫ژ‬%ǠɫǠȽǠȏȄ‫ژ‬Ý‫ژ‬%ǠȵƷƩɋȏȵ‫¾ژي‬ǚȏȂƌȽ‫ژ‬uِ‫ژ‬ȏȄɋƷ‫ژژژژژژژژژژ‬
opportunity to interact with colleagues, hear technical experts, ‫ژ׎א׎א‬U---‫ژ‬%ǠɫǠȽǠȏȄ‫ژ‬ÝUUU‫ژ‬%ǠȵƷƩɋȏȵ‫ٮ‬-ǹƷƩɋ‫ژي‬ǚȵǠȽɋǠȄƌ‫ژ‬uِ‫ژ‬°ƩǚȏƨƷȵ
and serve the local professional community.
AVAILABLE INFORMATION: To check membership status, report BOARD OF GOVERNORS
an address change, or obtain more information on any of the ¾ƷȵȂ‫ ژ‬-ɱȲǠȵǠȄǒ‫ ژي׎א׎א ژ‬Ȅưɲ‫ ژِ¾ ژ‬ǚƷȄً‫ ژ‬eȏǚȄ‫ ژ‬%ِ‫ ژ‬eȏǚȄȽȏȄً‫ژ‬
following, email Customer Service at help@computer.org or call
°ɲ‫ٮ‬äƷȄ‫ ژ‬hɓȏً‫ ژ‬%ƌɫǠư‫ ژ‬kȏȂƷɋً‫ ژ‬%ǠȂǠɋȵǠȏȽ‫ ژ‬°ƷȵȲƌȄȏȽً‫ژژژژژ‬
+1 714 821 8380 (international) or our toll-free number, Oƌɲƌɋȏ‫ژ‬äƌȂƌȄƌ
+1 800 272 6657 (US): ¾ƷȵȂ‫ ژ‬-ɱȲǠȵǠȄǒ‫ ژي׏א׎א ژ‬uِ‫ ژ‬ȵǠƌȄ‫ ژ‬ǹƌǵƷً‫ ژ‬FȵƷư‫ ژ‬%ȏɓǒǹǠȽً‫ژ‬
• Membership applications ƌȵǹȏȽ‫ ژ‬-ِ‫ ژ‬eǠȂƷȄƷɼ‫ٮ‬GȏȂƷɼً‫¨ ژ‬ƌȂƌǹƌɋǚƌ‫ ژ‬uƌȵǠȂɓɋǚɓً‫ژژژژژژژژژژژ‬
• Publications catalog -ȵǠǵ‫ ژ‬eƌȄ‫ ژ‬uƌȵǠȄǠȽȽƷȄً‫ ژ‬hɓȄǠȏ‫ ژ‬ÅƩǚǠɲƌȂƌ
• Draft standards and order forms ¾ƷȵȂ‫ ژ‬-ɱȲǠȵǠȄǒ‫ ژيאא׎א ژ‬wǠǹȽ‫ ژ‬ȽƩǚƷȄƨȵɓƩǵً‫ژ‬
• Technical committee list -ȵȄƷȽɋȏ‫ ژ‬ɓƌưȵȏȽ‫ٯ‬ÝƌȵǒƌȽً‫ ژ‬%ƌɫǠư‫ ژ‬°ِ‫ ژ‬-ƨƷȵɋً‫ ژ‬ÞǠǹǹǠƌȂ‫ ژ‬GȵȏȲȲً‫ژ‬
• Technical committee application GȵƌƩƷ‫ ژ‬kƷɬǠȽً‫ ژ‬°ɋƷǑƌȄȏ‫ژ‬îƌȄƷȵȏ
• Chapter start-up procedures
• Student scholarship information
• Volunteer leaders/staff directory EXECUTIVE STAFF
• IEEE senior member grade application (requires 10 years Executive Director: Melissa ِ‫ژ‬Russell
practice and significant performance in five of those 10) Director, Governance & Associate Executive Director:
Anne Marie Kelly
PUBLICATIONS AND ACTIVITIES Director, Finance & Accounting: Sunny Hwang
Director, Information Technology & Services: Sumit Kacker
Computer: The flagship publication of the IEEE Computer Society,
Director, Marketing & Sales: Michelle Tubb
Computer publishes peer-reviewed technical content that covers
Director, Membership Development: Eric Berkowitz
all aspects of computer science, computer engineering,
technology, and applications.
COMPUTER SOCIETY OFFICES
Periodicals: The society publishes 12 magazines‫ژ‬ƌȄư‫ژז׏ژ‬ǱȏɓȵȄƌǹȽ. Washington, D.C.: 2001 L St., Ste. 700, Washington, D.C.
Refer to membership application or request information as noted 20036-4928ٕ Phone: +1 202 371 0101ٕ Fax: +1 202 728 9614ٕ‫ژ‬
above. Email: ǚƷǹȲ‫ۮ‬ƩȏȂȲɓɋƷȵِȏȵǒ
Conference Proceedings & Books: Conference Publishing Los Alamitos: 10662 Los Vaqueros Cir., Los Alamitos, CA 90720ٕ‫ژ‬
Services publishes more than 275 titles every year. Phone: +1 714 821 8380ٕ Email: help@computer.org
Standards Working Groups: More than 150 groups produce IEEE
u-u-¨°OU¥‫ژۯژ‬¥ÅkU¾Uw‫¨ژ‬%-¨°‫ژ‬
standards used throughout the world.
¥ǚȏȄƷ‫ژٕבבבגژזוהژ׎׎זژ׏ڹژي‬Fƌɱ‫ژٕ׏גהגژ׏אזژג׏וژ׏ڹژي‬
Technical Committees: TCs provide professional interaction in -ȂƌǠǹ‫ژي‬ǚƷǹȲ‫ۮ‬ƩȏȂȲɓɋƷȵِȏȵǒ
more than 30 technical areas and directly influence computer
engineering conferences and publications. IEEE BOARD OF DIRECTORS
Conferences/Education: The society holds about 200 conferences President: ¾ȏȽǚǠȏ‫ژ‬Fɓǵɓưƌ
each year and sponsors many educational activities, including President-Elect: °ɓȽƌȄ‫ژ‬hِ‫ٹژ‬hƌɋǚɲ‫ژٺ‬kƌȄư
computing science accreditation. Past President: eȏȽƸ‫ژ‬uِFِ‫ژ‬uȏɓȵƌ
Certifications: The society offers three software developer Secretary: Kathleen ِ‫ژ‬Kramer
credentials. For more information, visit Treasurer: Joseph V. Lillie
www.computer.org/certification. Director & President, IEEE-USA: eǠȂ‫ژ‬ȏȄȵƌư‫ژ‬
Director & President, Standards Association: Robert S. Fish‫ژ‬
BOARD OF GOVERNORS MEETING Director & VP, Educational Activities: °ɋƷȲǚƷȄ‫ژ‬¥ǚǠǹǹǠȲȽ‫ژ‬
Director & VP, Membership ‫ ۯ‬Geographic Activities:‫ژژژژژژژ‬
‫ژגא‬٫‫ דאژ‬°ƷȲɋƷȂƨƷȵ‫ژ׎א׎אژ‬ǠȄ‫ژ‬uƩkƷƌȄً‫ژ‬ÝǠȵǒǠȄǠƌً‫ژ‬Å° hɓǵǱǠȄ‫ژ‬ǚɓȄ
Director & VP, Publication Services & Products: ¾ƌȲƌȄ‫ژ‬°ƌȵǵƌȵ‫ژ‬
Director & VP, Technical Activities: hƌɼɓǚǠȵȏ‫ژ‬hȏȽɓǒƷ
revised ‫ڳת‬uƌɲ‫ש׫ש׫ڳ‬

2020-May - Top Picks

Uploaded by

Copyright:

Available Formats

You might also like

2020-May - Top Picks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020-May - Top Picks

Uploaded by

Copyright:

Available Formats

VOLUME 40, NUMBER 3 MAY/JUNE 2020

DIGITAL LIBRARY — Easily access over 780k

CALLS FOR PAPERS — Discover

EDUCATION — Strengthen your resume

ADVANCE YOUR CAREER — Search the

NETWORK — Make connections that count

Explore all of the member beneﬁts

6 T he 2019 Top Picks in

10 U nveiling the Hardware

20 M AESTRO: A Data-Centric Nguyen, and Cinthia Huerta Alderete

47 V arifocal Storage: Jiyong Yu, Lucas Hsiung, Mohamad El Hajj,

COLUMNS AND DEPARTMENTS

Enjoy These Top Picks, While

The 2019 Top Picks in Computer

PC MEETING unanimous for top pick, in which case the article

& HONORABLE MENTIONS Gray, Brucek Khailany, and Stephen W. Keckler

conferences, as shown in “MAESTRO: A Data- Accelerations From Understanding

Quantum Computing CONCLUSION

Unveiling the Hardware

Abstract—Cloud services progressively shift from monolithic applications to complex graphs

As expected, most interactive services are sen-

performance for interactive microservices for

Figure 9. (a) Cascading hotspots in the large-scale

Abstract—The efficiency of an accelerator depends on three factors—mapping, deep

Digital Object Identifier 10.1109/MM.2020.2985963

& VIRTUAL (VR) has profound social

insignificant, contributing to only about 9%,

Contribution of VR Operations We further

memory-mapped registers for configuration pur- EVALUATION

Cloud/Client Codesign for VR

& REFERENCES 10. F. Qian, L. Ji, Bo Han, and V. Gopalakrishnan,

Abstract—Programmable hardware accelerators (e.g., vector processors, GPUs) have

The first step toward this goal is to recog-

Figure 2. Example restricted dependence form algorithms.

Also, the multiply–accumulate operation To enable flexible control interpretation,

CONV ASIC. GPU—Dense

catch up to and surpass database ASIC, which Q10012—Database

subword-SIMD, SPU can see 2.6 speedup by SCNN10—DNN CONV

with decomposability. SPU

Abstract—Varifocal storage (VS) presents a new architecture that coordinates

Figure 3. Data-preparation overhead compared

without the dynamics of the system intercon-

book in providing insights on datacenter work- G. Karakonstantis, “Approximate computing with

also supported by new faculty startupfunds from pp. 219–231.

Figure 1 shows that i-cache misses in WSCs have a

Figure 5. Fan-in for some misses can grow very fast

Extending the Frontier

Abstract—We advocate for a fundamentally different way to perform quantum computation

& RECENT ADVANCES both hardware and

In a three-level system, we consider the

He Barenco Wang and

Qudit Controls are Controls are Target is d ¼ N-level

Constants Small Large Small Small Small Small

computing, which requires reversibility, AND is

Quantum Machine Learning

because our circuit constructions maintain

Gate errors arise from the imperfect applica-

Figure 5. Exact two-qudit gate counts for the two benchmarked

Abstract—Current quantum computers have very different qubit implementations,