Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

VOLUME 39, NUMBER 4 JULY/AUGUST 2019

Secure Architectures

www.computer.org/micro
Call for 2019 Major
Awards Nominations
Deadline: 1 October 2019

Help Recognize Computing’s Most Prestigious Individuals


IEEE Computer Society awards recognize outstanding achievements and highlight
significant contributors in the teaching and R&D computing communities. All members
of the profession are invited to nominate individuals they consider most eligible to receive
international recognition through an appropriate society award.

Charles Babbage Award Computer Pioneer Award


Certificate/$1,000 Silver Medal
In recognition of significant contributions in the field Pioneering concepts and development of the
of parallel computation. computing field.

Computer Entrepreneur Award W. Wallace McDowell Award


Sterling Silver Goblet Certificate/$2,000
Vision and leadership resulting in the growth of some Recent theoretical, design, educational, practical,
segment of the computer industry. or other tangible innovative contributions.

Edward J. McCluskey Taylor L. Booth Award


Technical Achievement Award Bronze Medal/$5,000
Certificate/$2,000 Contributions to computer science and
Contributions to computer science or computer engineering education.
technology.
Computer Science & Engineering
Harry H. Goode Memorial Award Undergraduate Teaching Award
Bronze Medal/$2,000 Plaque/$2,000
Information sciences, including seminal ideas, Recognizes outstanding contributions to
algorithms, computing directions, and concepts. undergraduate education.

Hans Karlsson Award Harlan D. Mills Award


Plaque/$2,000 Plaque/$3,000
Team leadership and achievement through Contributions to the practice of software engineering
collaboration in computing standards. through the application of sound theory.

Richard E. Merwin Award


for Distinguished Service
Bronze Medal/$5,000
Nomination Deadline
Outstanding volunteer service to the
profession at large, including service Submit your nomination by
to the IEEE Computer Society.
1 October 2019 to
www.computer.org/awards
Contact us at
awards@computer.org
EDITOR-IN-CHIEF IEEE MICRO STAFF
Lizy K. John, University of Texas, Austin Journals Coordinator: Joanna Gojlik,
j.gojlik@ieee.org
EDITORIAL BOARD Peer-Review Administrator:
micro-ma@computer.org
R. Iris Bahar, Brown University
Publications Portfolio Manager: Kimberly Sperka
Mauricio Breternitz, University of Lisbon
Publisher: Robin Baldwin
David Brooks, Harvard University
Senior Advertising Coordinator: Debbie Sims
Bronis de Supinski, Lawrence Livermore
IEEE Computer Society Executive Director:
Nation al Lab
Melissa Russell
Shane Greenstein, Harvard Business School
Natalie Enright Jerger, University of Toronto IEEE PUBLISHING OPERATIONS
Hyesoon Kim, Georgia Institute of Technology
John Kim, Korea Advanced Institute of Science Senior Director, Publishing Operations:
and Technology Dawn Melley
Hsien-Hsin (Sean) Lee, Taiwan Semiconductor Director, Editorial Services: Kevin Lisankie
Manufacturing Company Director, Production Services: Peter M. Tuohy
Richard Mateosian Associate Director, Editorial Services:
Tulika Mitra, National University of Singapore Jeffrey E. Cichocki
Trevor Mudge, University of Michigan, Ann Arbor Associate Director, Information Conversion
Onur Mutlu, ETH Zurich and Editorial Support: Neelam Khinvasara
Vijaykrishnan Narayanan, The Pennsylvania Senior Art Director: Janet Dudar
State University Senior Managing Editor: Patrick Kempf
Per Stenstrom, Chalmers University of CS MAGAZINE OPERATIONS COMMITTEE
Technology
Richard H. Stern, George Washington Sumi Helal (Chair), Irena Bojanova,
University Law School Jim X. Chen, Shu-Ching Chen,
Sreenivas Subramoney, Intel Corporation Gerardo Con Diaz, David Alan Grier,
Carole-Jean Wu, Arizona State University Lizy K. John, Marc Langheinrich, Torsten Möller,
Lixin Zhang, Chinese Academy of Sciences David Nicol, Ipek Ozkaya, George Pallis,
VS Subrahmanian
ADVISORY BOARD
CS PUBLICATIONS BOARD
David H. Albonesi, Erik R. Altman, Pradip Bose,
Fabrizio Lombardi (VP for Publications),
Kemal Ebcioglu, Lieven Eeckhout,
Alfredo Benso, Cristiana Bolchini,
Michael Flynn, Ruby B. Lee, Yale Patt,
Javier Bruguera, Carl K. Chang, Fred Douglis,
James E. Smith, Marc Tremblay
Sumi Helal, Shi-Min Hu, Sy-Yen Kuo,
Subscription change of address: Avi Mendelson, Stefano Zanero, Daniel Zeng
address.change@ieee.org
COMPUTER SOCIETY OFFICE
Missing or damaged copies:
help@computer.org IEEE MICRO
c/o IEEE Computer Society
10662 Los Vaqueros Circle
Los Alamitos, CA 90720 USA +1 (714) 821-8380

IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society. IEEE Headquarters, Three Park Ave., 17th Floor, New York,
NY 10016-5997; IEEE Computer Society Headquarters, 2001 L St., Ste. 700, Washington, DC 20036; IEEE Computer Society Publications Office,
10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720. Postmaster: Send address changes and undelivered copies to IEEE, Member-
ship Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid at New York, NY, and at additional mailing offices. Canadian
GST #125634188. Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885. Return undeliverable Canadian addresses
to 4960-2 Walker Road; Windsor, ON N9A 6J3. Printed in USA. Reuse rights and reprint permissions: Educational or personal use of this material is
permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the
copy; and 3) does not imply IEEE endorsement of any third-party products or services. Author and their companies are permitted to post the accepted
version of IEEE-copyrighted material on their own webservers without permission, provided that the IEEE copyright notice and a full citation to the
original work appear on the first screen of the posted copy. An accepted manuscript is a version which has been revised by the author to incorporate
review suggestions, but not the published version with copy-editing, proofreading, and formatting added by IEEE. For more information, please go to
ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising,
or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual
Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or pubs-permissions@ieee.org. ©2019 by IEEE. All rights reserved. Abstracting
and library use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy
fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE
Micro does not necessarily constitute an endorsement by IEEE or the Computer Society. All submissions are subject to editing for style, clarity, and
space. IEEE prohibits discrimination, harassment, and bullying. For more information, visit ieee.org/web/aboutus/whatis/policies/p9-26.html.
July/August 2019
Volume 39 Number 4

Special Issue
55 T
he Queuing-First
Approach for Tail
6 S Guest Editors’ Introduction
ecure Architectures Management of Interactive
Services
Simha Sethumadhavan and Mohit Tiwari Amirhossein Mirhosseini
and Thomas F. Wenisch
Published by the IEEE Computer
Society

Theme Article

8 L everaging Cache Manage-


ment Hardware for Prac-
tical Defense Against Cache
Timing Channel Attacks
Fan Yao, Hongyu Fang, MiloŠ Doroslovački,
and Guru Venkataramani

General Interest

44 R
ASSA: Resistive
Prealignment
Accelerator for Approximate
DNA Long Read Mapping
Roman Kaplan, Leonid Yavits,
and Ran Ginosar
Image credit: ©istockphoto.com/ValeryBrozhinsky

COLUMNS AND DEPARTMENTS 35 FinalFilter: Asserting


From the Editor-in-Chief Security Properties of a
4 Secure Architectures Processor at Runtime
Lizy Kurian John Cynthia Sturton, Matthew Hicks,
Samuel T. King, and
Jonathan M. Smith
Expert Opinion
17 Toward Postquantum Micro Economics
Security for Embedded 66 The Aftermath of the
Cores Dyn DDOS Attack
Rafael Misoczki, Sean Gulley, Shane Greenstein
Vinodh Gopal, Martin G. Dixon,
Hrvoje Vrsalovic, and
Wajdi K. Feghali
27 Energy-Secure System
Architectures (ESSA):
A Workshop Report
Pradip Bose and
Saibal Mukhopadhyay
Column: From the Editor-in-Chief

Secure Architectures
Lizy Kurian John
The University of Texas at Austin

& WELCOME TO THE July/August 2019 issue of “The Aftermath of the Dyn DDOS Attack,” Green-
IEEE Micro, which presents to you a selection stein writes about the October 2016 series of dis-
of articles on secure architectures and a few tributed denial-of-service (DDOS) attacks causing
articles on other topics. many services and platforms to be unavailable for
Computer security is becoming increasingly large segments of users in North America and
important due to the increased use of computers Europe. The attacker targeted systems operated
and the internet in our day-to-day lives. Attacks by the nameserver resolution provider Dyn, who
on computer systems are becoming very com- performs approximately 10% of the nameserver
monplace. Many recent attacks demonstrated services in the United States. Since nameserver
security vulnerabilities in commodity hardware, resolution is essential for many businesses to
pointing to the importance of secure hardware operate, the attack affected a range of businesses
architectures. This special issue including Netflix, CNBC, Twitter,
presents four articles on secure Airbnb, and Etsy. Greenstein pro-
This special issue
computer architectures. The topics vides details on the market share
presents four articles
discussed range from defense of domain name system providers
on secure computer
against cache timing channel and alerts the readers to an impor-
architectures. The
attacks to asserting security prop- tant vulnerability in today’s inter-
topics discussed range
erties of a processor at runtime. net, viz. many internet services are
from defense against
Prof. Simha Sethumadhavan of concentrated in a few providers.
cache timing channel
Columbia University and Prof. Mohit His article draws attention to the
attacks to asserting
Tiwari of the University of Texas at security properties of a lack of redundancy in internet ser-
Austin served as guest editors for the processor at runtime. vice providers.
special issue. A comprehensive article In addition to the special issue
written by Professors Sethumadha- articles on computer security,
van and Tiwari serves as an excellent introduction there are two other articles in this issue. The first
to the compendium on secure architectures. one, “RASSA: Resistive Pre-Alignment Accelerator
The four articles from the secure architecture for Approximate DNA Long Read Mapping” by
theme are accompanied very appropriately by Kaplan et al., presents an in-memory parallel archi-
a Micro Economics column related to computer tecture for similarity search for genomic sequen-
security by Shane Greenstein. In the article titled ces. As personalized medicine based on gene
mapping is emerging, hardware architectures to
support sequence searches are increasingly rele-
Digital Object Identifier 10.1109/MM.2019.2924711 vant. One challenge in mapping long sequences is
Date of current version 23 July 2019. determining the optimal mapping location of every

0272-1732 ß 2019 IEEE Published by the IEEE Computer Society IEEE Micro
4
read on to the reference sequence. In this article, acknowledge and reward the best articles in each
Kaplan et al. present hardware acceleration of of the Transactions and Magazines sponsored by
genomic mapping by using resistive memories the Computer Society. Articles based on a confer-
(memristors) which are elements that store infor- ence paper are ineligible and hence the MICRO
mation by modulating the resistance of the nano- TopPicks articles would be ineligible for this
scale storage elements. Memristor arrays facilitate award. The intent of the award is to recognize
simultaneous compare and mapping and result in outstanding regular papers. The first award will
a highly parallel compute accelerator. be announced in a couple of months.
The second regular channel article is “The IEEE Micro is interested in submissions on any
Queuing-First Approach for Tail Management aspect of chip/system design or architecture.
of Interactive Services” by Mirhosseini and Please consider submitting articles to IEEE
Wenisch. Cloud services are increasingly becom- Micro and remember, all regular articles will be
ing popular, however, service latencies in the eligible for the best paper award.
cloud are heavy-tailed. Some requests can take Hope you enjoy the secure architectures as
100 times more time than the average. This arti- well as other articles presented in this issue.
cle presents two solutions to mitigate this impor- Happy reading!
tant problem, server pooling, and common-case
service acceleration.
I am also proud to write about a new Best Lizy Kurian John is a Cullen Trust for Higher
Paper award for IEEE Micro. Starting this year, the Education Endowed Professor with the Electrical and
best paper award will be given for articles pub- Computer Engineering Department, the University of
lished in IEEE Micro. IEEE Computer Society has Texas at Austin, Austin, TX, USA. Contact her at
recently started a best paper award program to ljohn@ece.utexas.edu.

July/August 2019
5
Guest Editors’ Introduction

Secure Architectures
Simha Sethumadhavan Mohit Tiwari
Columbia University The University of Texas at Austin

& HARDWARE IS THE bedrock on which all com- patterns and triggers dynamic partitioning once
puting systems are built. Recently developed an anomaly is detected. The article demonstrates
hardware ideas for enhancing the software ways to integrate this combined defense in off-
security from both academia and industry hold the-shelf CPUs by using Intel’s cache partitioning
significant promise to improve software security. techniques, and shows the way for future work
At the same time, recent hardware attacks on cur- that combines prevention and detection.
rent commodity hardware have shown hardware The second article by authors from Intel is at
to be a weak foundation for building secure sys- the intersection of extremely low-level firmware
tems. As we enter Post-Moore’s Law era, there are that often forms the root of trust for secure-hard-
significant questions surrounding ware-based computing, and secu-
what would make security techni- rity against quantum-computing
ques more practical. Thus, secu- As we enter Post-
driven cryptanalysis. This article
rity is a first-order problem for Moore’s Law era, there
considers several alternatives for
computer architects today, and are significant ques-
such "post-quantum" cryptographic
this special edition highlights a few tions surrounding what
would make security schemes specifically with an eye
compelling examples across the toward low-level implementation
techniques more prac-
system stack. This issue contains
tical. Thus, security is a security issues that emerge in a
four articles on various facets of
first-order problem for firmware. As secure enclave-based
secure architectures.
computer architects computing takes off, considering
The first paper by Fan et al.
today, and this special the foundation of secure firmware is
addresses the (timely) problem of edition highlights a few an especially timely problem.
information leakage through tim- compelling examples The third article is a report on
ing, specifically through shared across the system
caches. Defenses have considered a Workshop on Energy Secure
stack.
partitioning caches across security Architectures. The workshop was
domains as well as detecting organized by Dr. Bose from IBM
attacks as they happen as two separate research and Dr. Mukhopadhyay from Georgia Tech and
directions—this article proposes to use these focused on motivating the need to secure the on-
approaches in conjunction where the detection chip energy management structures, which are
algorithm looks for anomalous cache occupancy assuming increased importance in the face
of emerging technology trends. This article
describes the challenges in designing energy
Digital Object Identifier 10.1109/MM.2019.2925152
management algorithms, attacks, and possible
solutions.
Date of current version 23 July 2019.

0272-1732 ß 2019 IEEE Published by the IEEE Computer Society IEEE Micro
6
The final article by Sturton et al. describes an Simha Sethumadhavan is an associate profes-
sor of computer science and the Chair of Cyber-
idea called the FinalFilter. Despite the best efforts
security Center, Data Science Institute, Columbia
of designers to create secure designs, unforeseen
University. He is an Alfred P. Sloan Research Fel-
inputs or events may cause the processor to per-
low. Contact him at simha@columbia.edu.
form actions that violate security. In this article,
the authors describe how the use of simple
Mohit Tiwari is an assistant professor in the
invariants can be used to protect against such Department of Electrical and Computer Engineering,
problems. Unlike static assertions that are The University of Texas at Austin. He was a
removed after design time, the FinalFilter invari- Postdoctoral Fellow at UC Berkeley (2011–2013). He
ant controls access to protected assets at all times has a PhD from UC Santa Barbara. Contact him
because they are fabricated into the hardware. at tiwari@austin.utexas.edu.

July/August 2019
7
Theme Article: Secure Architectures

Leveraging Cache
Management Hardware for
Practical Defense Against
Cache Timing Channel
Attacks
Fan Yao Milos Doroslovacki
University of Central Florida George Washington University
Hongyu Fang Guru Venkataramani
George Washington University George Washington University

Abstract—Sensitive information leakage through shared hardware structures is becoming


a growing security concern. In this article, we propose a practical protection framework
against cache timing channel attacks by leveraging commercial off-the-shelf hardware
support in last level caches for cache monitoring and partitioning.

& TIMING CHANNELS ARE a form of information since caches presenting the largest on-chip attack
leakage attacks where adversaries modulate surface for adversaries to exploit combined with
and (or just) observe access timing to shared high bandwidth transfers.13 Previously proposed
resources in order to exfiltrate secrets. Among var- detection and defense techniques against
ious hardware-based information leakage attacks, cache timing attacks either explore hardware
cache timing channels have become notorious, modifications or incur nontrivial performance
overheads.6; 12; 16; 17 For more effective system
protection and widescale deployment, it is critical
Digital Object Identifier 10.1109/MM.2019.2920814 to explore ready-to-use and performance-friendly
Date of publication 4 June 2019; date of current version 23 practical protection against cache timing channel
July 2019. attacks.

0272-1732 ß 2019 IEEE Published by the IEEE Computer Society IEEE Micro
8
In this article, we propose a spy in covert channels, and victim
new framework that makes novel For more effective and spy in side channels. The spy
system protection and
use of COTS hardware to thwart infers secrets from the trojan or
widescale deployment,
cache timing channels. We observe the victim by observing the modu-
it is critical to explore
that cache block replacements by lated latency of cache accesses.2
ready-to-use and
adversaries in cache timing chan- performance-friendly To exfiltrate secrets, the spy needs
nels reveal a distinctive pattern in practical protection to determine a communication
their cache occupancy profiles, against cache timing channel. In case of covert chan-
which could be a strong indicator channel attacks. nels, the trojan and spy may alter-
for the presence of cache timing nate their accesses to the cache
channels. We leverage Intel’s temporally, while in side channels,
Cache Monitoring Technology (CMT3) available the spy has to run in parallel to the victim pro-
in recent server-class processors to perform fine- cess.17 This may vary along space dimension as
grained monitoring of LLC occupancy for individ- well (i.e., access single cache location or alternate
ual application domains. We then apply signal among multiple cache locations).
processing techniques that characterize the com- Recently, Intel’s CMT allows for uniquely
munication strength with spy processes in cache identifying each logical core with a resource
timing channels. We further leverage LLC way monitoring ID,3 and track the LLC usage for the
allocation (i.e., CAT3) and repurpose it as a mapped domains. CMT enables flexible monitor-
secure cache manager to dynamically partition ing of LLC occupancy at user-desired domain
LLC for suspicious domains and disband their granularity such as a core, a multithreaded appli-
timing channel activity. Our mechanism avoids cation, or a virtual machine. With CAT, caches
preemptively separating domains and conse- can be configured to have several different parti-
quently, does not result in high performance tions on cache ways, called classes of service
overheads to benign application domains. (CLOS), where evicting cache lines from other
In comparison to our recent work—COTS- CLOS is restricted for a given domain.
knight,19 the novel contributions of this article
are as follows:
THREAT MODEL
1) To defend against sophisticated adversaries In this article, we focus on the sophisticated
that randomize interval times between trans- form of attacker that does not rely on any prior
missions, we augment our COTSKnight memory sharing, and utilizes Prime þ Probe-
design to remove irrelevant occupancy based techniques to launch attacks on LLC simply
trace segments using time warping (see by creating conflict misses (replacement) on
the “Defense Against Advanced Adversaries” cache sets. Attacks such as Flush þ Reload require
section). shared memory blocks either through shared
2) We perform new experimental studies on vir- libraries or data sharing that may be prohibited in
tualized environments that are prone to cache practical settings. Therefore, we do not consider
timing channel attacks, and demonstrate the such forms of attacks. However, for evict þ reload
efficacy of our approach (see the “Case Study attacks, where cache replacements alter access
on Virtualized Environments” section). latencies, our design would still be applicable.
3) We identify futuristic threats (like mult- (See the “Discussion” section for details.)
iple spies and evidence tampering), and
discuss potential defense mechanisms WHY CACHE OCCUPANCY
using our proposed defense framework (see PATTERNS MATTER?
the “Discussion” section). Regardless of whether a trojan intentionally
communicates or a victim unintentionally leaks
BACKGROUND secrets to a spy, cache timing channels use one
There are typically two processes involved in of the following encoding schemes: 1) ON–OFF
cache timing channels, namely, the trojan and encoding (where spy uses timing profile of a

July/August 2019
9
Secure Architectures

Wemakethefollowingkeyobser-
vation here: Cache timing channels
fundamentally rely on cache block
replacements that create swing pat-
terns in participating domain’s
cache occupancy regardless of the
specific timing channel protocols.
By analyzing these repetitive
swing patterns, there is a poten-
tial to uncover the communication
strength in such attacks. We note
that merely tracking cache misses
onanadversarywillnotbesufficient
as an attacker may inflate cache
misses (through issuing additional
cache loads that create self-
conflicts) on purpose to evade
detection.

SYSTEM DESIGN
Here, we first discuss CMT-
based cache occupancy monitor-
ing and trace analysis for cache
timing channel detection, and
Figure 1. LLC occupancy changes for trojan/victim and spy. (a) ON-OFF: Trjn/ then outline cache partitioning
Victim idle. (b) ON-OFF: Trjn/Victim access. (c) Pulse-pos: Odd sets. (d) Pulse-pos: strategy to prevent information
Even sets. leakage.

Cache Occupancy Monitor and


13 Pattern Analyzer
single cache set group to infer bits/symbols ),
and 2) pulse-position encoding (where spy lever- We leverage Intel CMT10 to obtain LLC occu-
ages access timing of distinct cache set groups pancy data for each domain/context that allows
for inferring each bit/symbol14). the system administrators to flexibly define mon-
Figure 1 illustrates the changes in cache occu- itoring granularity, e.g., hardware threads, appli-
pancy under the two encoding methods. In ON–OFF cations, or even VMs.
encoding, when trojan/victim accesses cache and The occupancy pattern analyzer performs
fetches its blocks, the trojan’s cache occupancy the following steps to determine whether
should first increase and then decrease during there is a cache timing channel between two
spy’s probe when trojan/victim-owned blocks are domains.
replaced. Similarly, the spy’s cache footprint First, the analyzer generates the time-
would first decrease due to trojan/victim’s filling differentiated cache occupancy changes for
in the cache blocks and then increase when spy each domain. Assume that xi and yi are the
probes and fills the cache with its own data. When cache occupancy sample vectors obtained
trojan/victim does not access the cache, neither within the ith window, we can then get the time-
of the processes change their respective cache differentiated cache occupancy traces for each
occupancies. Under pulse-position encoding with domain, denoted as Dxi;j and Dyi;j (i.e., the LLC
two distinct cache set groups used by trojan/vic- occupancy difference between two consecutive
tim (e.g., odd and even sets), we observe swing samples). Figure 2(a) shows time-differentiated
patterns in their cache occupancies. LLC occupancy traces for a timing channel

10 IEEE Micro
that implements parallel
protocol with pulse-position
encoding.
In the second step, to
capture the unique pairwise
cache occupancy swing
pattern in timing channels,
we compute the product of
Dxi and Dyi as zi . Based on
the discussion in the “Why
Cache Occupancy Patterns
Matter?” section negative
values of zi occur when the
cache occupancy patterns
of the two processes move
in opposite directions due
to mutual cache evictions.
In the third step, our ana-
lyzer checks if z series con-
tains repeating negative
pulses that may be caused
by intentional eviction over a
longer period of time (denot- Figure 2. LLC occupancy traces, autocorrelogram, and power spectrum for a cache
ing illegal communication timing channel (with parallel pulse-position).13 (a) Parallel Protocol with pulse-position
activity). To capture the encoding. (b) Autocorrelogram (left) and power spectrum (right).
repetitive swing patterns, we
perform power spectrum
analysis in frequency domain Implementation
on ri , which is the autocorrelogram of z. We implement our framework prototype on a
Figure 2(b) illustrates the autocorrelogram real system with Intel Xeon E5-2698 v4 Proc-
and power spectrum for a (victim, spy) pair in essor. The processor comes with 16 CLOS and
timing channels.13 We can visually observe a 20 LLC slices, and each LLC slice has 20  2048
sharp peak around frequency of 290 in the 64-byte blocks. The LLC occupancy MSR reading
power spectrum, which represents a strong is sampled at 1000/s.
communication strength indicating timing chan-
nel activity (see COTSKnight19 for further
details on this algorithm).
EVALUATION
Power Spectra for Cache Timing Channels
Cache Way Allocation Manager We setup attack variants of cache timing
After the way allocation manager (allocator) is channels2; 13–15 that utilize ON–OFF and pulse-
notified of identified suspicious domains from the position encoding for spy reception and per-
analyzer, it will configure LLC using CAT to isolate form accesses to cache either serially (trojan
the suspicious pairs by heuristically assigning and spy) or in parallel (victim and spy) as
nonoverlapping cache ways to each domain based described in the “Why Cache Occupancy PAT-
on their ratio of LLC occupancy sizes during the TERNS MATTER?” section. In each case, we
last observation period. Our allocator evaluates also ran along side with at least two SPEC2006
two candidate policies, namely, Aggressive Policy, benchmarks with high LLC activity.8 The ana-
that keeps suspicious domains separated until lyzer performs power spectrum analysis based
one of them finishes execution and Jail Policy, that on time-differentiated LLC occupancy traces
partitions the two domains until a timeout period. for six combination pairs of processes. In all

July/August 2019
11
Secure Architectures

trigger cache partitioning. Note that


we have analyzed the attack variants
with different transmission bit rates
(i.e., ranging from a few bps to several
kbps), numbers of cache sets, and
probe intervals. Our results showed
that our framework identifies all of
the trojan–spy domain pairs within
five consecutive analysis windows
after they start execution.
Figure 3. Power spectrum in attack variants including trojan/victim-spy pairs. Partition trigger rate for benign work-
(a) serial-on–off (b) para-pp loads and the corresponding perfor-
mance impact: Among all benign
workloads (each runs four SPEC2006
cases, our framework correctly identified tro- applications), only 6% of the domain pair popula-
jan/victim-spy processes since the pair consis- tion had LLC partitioning—these benchmarks cov-
tently had the highest power in the frequency ered 2% of the analysis window samples. Even
domain. In fact, our experiments show that when there are only two benign applications, it is
the attacker pair’s peak power spectrum val- worth noting that cache occupancy change pat-
ues are at least an order of magnitude higher terns are typically random. Therefore, signal power
than benign application pairs. (that captures the periodic gain-loss patterns) will
Figure 3(a) and (b) shows the analyzer’s not be any higher. Our experiment shows that the
results on representative windows for two tro- LLC partitioning only minimally impacts applica-
jan/victim-spy pairs. In the serial-on–off attack, tions that trigger partition (with less than 5% slow-
we observe a single concentrated and sharp down), and interestingly, we observe performance
peak with the power value in the frequency boost for many of them (up to 9.2%). The overall
domain, while the other data points are almost average impact on all the applications that ran
all zeros. This indicates the existence of a domi- with partitioned LLC was positive (about 1%). This
nating signal in the time domain corresponding shows that our framework can even help benign
to the repetitive gain-loss occupancy pulses due workloads while safeguarding systems against
to timing channel activity [see Figure 3(a)]. We cache timing channels.
also observe a similar isolated peak for the tro-
Runtime Overhead: Our framework implements
jan/victim-spy pair in para-pp, as shown in
the nonintrusive LLC occupancy monitoring for
Figure 3(b) where the signal power is even
only mutually distrusting domains identified by
higher compared to serial-on–off case.
the system administrator. Overall, the mechanism
We repeated several experiments with 60
incurs less than 4% CPU utilization with four
benign workload pairs with high LLC activity
active mutually distrusting domains.
time-overlapping at various random phases, and
observed the peak signal power to be less than 5
Defense Against Advanced Adversaries
about 80% of the time, and around 50 for only
In theory, advanced adversaries may use ran-
about 2% of the time. This shows that a vast
domized interval times between bit transmissi-
majority of benign workload samples do not
ons. Let us imagine a trojan and spy that setup a
exhibit isolated peaks in the frequency domain,
predetermined pseudorandom number generator
and the maximum signal power is significantly
to decide the next waiting period before bit trans-
less than any known timing channels (that have
mission. Even in such cases, our framework can
signal strengths at well over 100).
be adapted to recognize them through a signal
preprocessing procedure called time warping,5
Effectiveness of Our Framework that removes irrelevant segments from the occu-
Defeating Cache Timing Channels: We conser- pancy traces (for which Dx; Dy are close to 0 and
vatively set signal power threshold at 50 to aligns the swing patterns). After this step, the

12 IEEE Micro
periodic patterns are recon-
structed, and the cadence of
cache accesses from adver-
saries will be recovered.
Figure 4 demonstrates the
detection of this attack sce-
nario. For illustration, we
implement a prototype of
this attack by setting up the
trojan and spy as two
threads within the same pro-
cess, and configure the main
thread to control the syn-
chronization. In reality, two
separate trojan/victim and
spy need to be synchro-
nized. Figure 4(a) shows the
LLC occupancy trace for this
attack with random distan-
ces between the swing
pulses. We can see that, with
time warping, high signal
power peaks are observed Figure 4. Analysis of bit transmission at random intervals. (a) Left half shows a snippet
[see Figure 4(b)]. Addition- of original trace with random bit intervals and right half shows time-warped trace. (a)
ally, when this signal com- LLC occupancy changes for transmision with random intervals. (b) Power spectrum on
pression preprocessing step time-warped LLC occupancy trace.
is applied on benign work-
loads, we do not observe any
increase in partition trigger rate. allocation determination during the entire execu-
tion. We can see that the trojan and spy start to ini-
Case Study on Virtualized Environments tiate communication at around 188 s (when we
To evaluate the efficacy of our proposed start to observe increasing signal power). The peak
framework, we perform a case study on virtual- signal power between the trojan and spy domain
ized environment. This study is motivated by pair quickly climbs up to 126 at time 192.5 s, which
growing trend in studying timing channel attacks is when steady covert communication has begun.
in the cloud environment. We implement the para- This quickly triggers the allocator’s action that
onoff attack that works cross-VM (similar to Maur- splits the LLC ways between trojan and spy VMs.
ice et al.14) Consequently, the maximum signal power drops
We setup four KVM virtual machines where the back to nearly zero for the rest of execution, effec-
trojan and spy run on two of the VMs, and simulta- tively preventing any further timing channels. Note
neously, two other VMs corun representative that during the 1-h experiment, the peak signal
cloud benchmarks, namely video streaming power values for the other domain pairs (involving
(stream) and memcached (memcd) from Cloud- Cloudsuite applications) remained flat at values
Suite,7 both of which are highly cache-intensive. less than 3.
The trojan and spy are set to start the para-on–off
attack at a random time between 0 and 300 s.
We configure the allocator to use the Aggressive DISCUSSION
policy to demonstrate the effectiveness of LLC par- We propose a new framework that builds on
titioning. Figure 5 shows the peak signal power COTS hardware and can be augmented with a
between the trojan and spy VM pair and the way host of signal processing techniques to eliminate

July/August 2019
13
Secure Architectures

space or just disable it (e.g.,


Google NaCl). Therefore,
clflush-based cache occupancy
deflation can be handled
easily.
Applicability of Our Tech-
nique to Other Cache Attacks:
While we mainly evaluated the
proposed technique using
Figure 5. Peak signal power values for the trojan/spy pair and the allocator’s way Prime þ Probe-based attacks,
allocation for one hour execution. our proposed framework can be
applied to other cache attacks
noise, randomness, or distortion to unveil the tim- using evictions as well. For instance, in Evict þ
ing channel activity. In this section, we discuss Reload attacks, the repetitive data loads by the
additional monitoring support and signal process- victim and subsequent evictions by the spy will
ing to detect futuristic attacks with sophisticated also introduce cache occupancy gain–loss pat-
adversaries. terns, which can be detected by our proposed
Using Multiple Spy Processes: A spy may try to framework.
evade our defense through potentially involving Current Hardware Limitations and Opportuni-
multiple processes that perform either time-multi- ties: We observe that CMT currently supports a
plexing (each process is active for a short period of minimum precision of 20 cache sets. If attackers
time iteratively) or space multiplexing (each pro- were to leverage less number of sets to carry out
cess touches a subregion of target attack, they may potentially evade COTSknight’s
sets simultaneously) for timing channels. Our pro- detection. While such attacks are possible, they
posed framework can still effectively identify such are prone to high noise. As such, the limitations
malicious activities as it essentially monitors swing mentioned above are an artifact of the current
patterns in cache occupancy usage that could CMT hardware, and not of our analysis
purposefully change cache access latencies for approach per se. That said, we note that CMT
domains as discussed in the “System Design” sec- was designed for improving performance bottle-
tion. Further, CMT þ CAT allows for dynamically necks, and not to detect cache timing channels.
defining security domains that can best isolate the Our study highlights a novel use case for LLC
capability and access boundary for each party monitoring, and we strongly believe that it
(e.g., threads and processes run by the same user would motivate processor vendors to support
belong to the same domain). The cumulative LLC improved precision and bolster system security.
occupancy pattern among all the spy’s processes
in the same domain would preserve the correlated
swing pattern that can be recognized by the RELATED WORK
analyzer. Cache-based timing channels have been widely
Using clflush to Deflate LLC Occupancy: An studied,13; 18 and hardware-based solutions have
adversary may attempt to tamper evidence of its been proposed. CC-Hunter2 detects covert timing
cache occupancy changes by compensating the channel in caches by capturing fined-grained
increase in its own cache occupancy through cache conflict miss patterns in hardware. Replay-
issuing clflush instruction. To handle such sce- Confusion17 records program’s memory accesses
narios, clflush’s usage by suspicious domains and replays them on a different machine to
may be tracked and the associated memory sizes uncover covert channels on caches. Hunger et al.9
can be accounted back to the issuing core, thus, observe the destructive read property in conten-
restoring original occupancy data for analysis. tion-based covert channel and propose a solution
Also, many system-level protections against based on anomaly detection. Demme et al.4 apply
clflush instruction have been proposed, includ- machine learning techniques on architectural-level
ing constraining clflush to only be used in kernel statistics to detect malware including side

14 IEEE Micro
channels. Fang et al.6 also discussed several futuristic threats and
We proposed a novel
use hardware pre- mechanisms to defeat such timing channels.
framework to protect
fetchers to defend caches against timing
against cache timing channel attacks ACKNOWLEDGMENT
channels. through smartly This work was supported by the U.S. National
CATalyst12 uti- leveraging COTS Science Foundation under Grant CNS-1618786,
lizes the CAT tech- support for cache
and by the Semiconductor Research Corpora-
nology to reserve monitoring and
tion Contract 2016-TS-2684. F. Yao performed
static cache parti- performance tuning.
this work as a graduate student at GWU.
tions where secure We implemented a
pages are pinned prototype of our
proposed technique on
upon request from
applications. Differ-
Intel Xeon v4 server & REFERENCES
and our experiments
ently, our pro- 1. M. Bazm, T. Sautereau, M. Lacoste, M. Sudholt,
showed that our
posed mechanism framework can and J. Menaud, “Cache-based side-channel
successfully defeat successfully thwart attacks detection through Intel Cache Monitoring
cache timing chan- several classes of Technology and Hardware Performance Counters,”
nels without appli- cache timing channels in Proc. 3rd Int. Conf. Fog Mobile Edge Comput.,
cation/user-level in both native and 2018, pp. 7–12.
inputs and parti- virtualized environment 2. J. Chen and G. Venkataramani, “CC-hunter:
tion reservation. with minimal Uncovering covert timing channels on shared
1
Bazm et al. lever- performance overhead. processor hardware,” in Proc. 47th Annu. IEEE/ACM
age cache occu- Int. Symp. Microarchit., 2014, pp. 216–228.
pancy information to detect side channels 3. Intel Corporation, Intel 64 and IA-32 Architectures
behavior in conjunction with other perfor- Software Developer’s Manual, vol. 3B, 2016.
mance counters such as cache misses. How- 4. J. Demme et al., “On the feasibility of online
ever, their proposed technique makes malware detection with performance counters,” in
anomalous behavior determination based on Proc. 40th Annu. Int. Symp. Comput. Archit., 2013,
cache footprint, which is subject to high false pp. 559–570.
positive alarms. In contrast, our framework 5. M. G. Elfeky, W. G. Aref, and A. K. Elmagarmid, “Warp:
analyzes cache occupancy gain–loss patterns Time warping for periodicity detection,” in Proc. 5th
that are shown to be the unique characteristic IEEE Int. Conf. Data Mining, 2005, pp. 138–145.
for parties involving timing channel activity, ki, and
6. H. Fang, S. S. Dayapule, F.Yao, M. Doroslovac
which is both effective and efficient. Recently, G. Venkataramani, “Prefetch-guard: Leveraging
DAWG11 has proposed secure cache partition- hardware prefetches to defend against cache timing
ing by strictly isolating both cache hits and channels,” in Proc. IEEE Int. Symp. Hardware Oriented
misses between application domains. Secur. Trust, 2018, pp. 187–190.
7. M. Ferdman et al., “Clearing the clouds: A study of
emerging scale-out workloads on modern hardware,”
CONCLUSION ACM SIGPLAN Notices, vol. 47, pp. 37–48, 2012.
In this article, we proposed a novel framework 8. J. L. Henning, “SPEC CPU2006 benchmark
to protect caches against timing channel attacks descriptions,” ACM SIGARCH Comput. Archit. News,
through smartly leveraging COTS support for vol. 34, no. 4, pp. 1–17, 2006.
cache monitoring and performance tuning. We 9. C. Hunger, M. Kazdagli, A. Rawat, A. Dimakis,
implemented a prototype of our proposed tech- S. Vishwanath, and M. Tiwari, “Understanding
nique on Intel Xeon v4 server, and our experi- contention-based channels and using them for
ments showed that our framework can defense,” in Proc. IEEE 21st Int. Symp. High Perform.
successfully thwart several classes of cache tim- Comput. Archit., 2015, pp. 639–650.
ing channels in both native and virtualized envi- 10. Intel, “Intel-CMT-CAT Pacakage,” 2017. [Online].
ronment with minimal performance overhead. We Available: http://https://github.com/01org/intel-cmt-cat

July/August 2019
15
Secure Architectures

11. V. Kiriansky, I. Lebedev, S. Amarasinghe, S. Devadas, ki, and G. Venkataramani,


19. F. Yao, H. Fang, M. Doroslovac
and J. Emer, “DAWG: A defense against cache timing “COTSknight: Practical defense against cache timing
attacks in speculative execution processors,” in Proc. channel attacks using cache monitoring and partitioning
51st Annu. IEEE/ACM Int. Symp. Microarchit., 2018, technologies,” in Proc. HOST, 2019.
pp. 974–987.
12. F. Liu et al., “CATalyst: Defeating last-level cache side
Fan Yao is an assistant professor of electrical and
channel attacks in cloud computing,” in Proc. IEEE Int.
computer engineering at the University of Central
Symp. High Perform. Comput. Archit., 2016, pp. 406–418.
Florida. His research interests include the areas
13. F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last- of computer architecture, hardware and system
level cache side-channel attacks are practical,” in Proc. security and cloud computing. Contact him at fan.
IEEE Symp. Secur. Privacy, 2015, pp. 605–622. yao@ucf.edu.
14. C. Maurice et al., “Hello from the other side: SSH over
robust cache covert channels in the cloud,” in Proc.
Hongyu Fang is currently a graduate student at
Netw. Distrib. Syst. Secur. Symp., 2017, pp. 8–11.
George Washington University. His research interests
15. T. Ristenpart, E. Tromer, H. Shacham, and S. Savage, include signal processing, computer architecture, and
“Hey, you, get off of my cloud: Exploring information security. Contact him at hongyufang_ee@email.gwu.
leakage in third-party compute clouds,” in Proc. edu.
16th ACM Conf. Comput. Commun. Secur., 2009,
pp. 199–212.
Milos Doroslovac  ki is an associate professor
ki,
16. G. Venkataramani, J. Chen, and M. Doroslovac
of electrical and computer engineering at George
“Detecting hardware covert timing channels,” IEEE Washington University. His research area is adap-
Micro, vol. 36, no. 5, pp. 17–27, Sep.–Oct. 2016. tive signal processing with focus on communica-
17. M. Yan, Y. Shalabi, and J. Torrellas, “Replayconfusion: tions and distributed estimation. Contact him at
Detecting cache-based covert channel attacks using doroslov@gwu.edu.
record and replay,” in Proc. 49th Annu. IEEE/ACM Int.
Symp. Microarchit., 2016, pp. 1–14.
Guru Venkataramani is an associate professor
ki, and G. Venkataramani, “Are
18. F. Yao, M. Doroslovac of electrical and computer engineering at George
coherence protocol states vulnerable to information Washington University. His research area is com-
leakage?” in Proc. IEEE Int. Symp. High Perform. puter architecture, security, and energy optimiza-
Comput. Archit., 2018, pp. 168–179. tion. Contact him at guruv@gwu.edu.

16 IEEE Micro
Expert Opinion

Toward Postquantum
Security for Embedded
Cores
Rafael Misoczki, Sean Gulley, Vinodh Gopal,
Martin G. Dixon, Hrvoje Vrsalovic, and
Wajdi K. Feghali
Intel Corporation

& THE USE OF firmware agents—including their Improvements in silicon manufacturing pro-
use to define the functionality of embedded cesses have reduced the size of these cores to
cores—has proliferated on computer systems of the point where they can be physically embed-
all scales, especially servers. The agents are ded within silicon dies of the system compo-
often not visible to the operating system as they nents (such as a CPU) on which they operate. In
independently perform configuration, monitor- parallel, the volume and importance of their
ing, and certain control tasks. For example, responsibilities have increased, beginning with
the baseboard management controller (BMC) power management and escalating to security
on server platforms runs a firmware stack operations that can affect functional safety.
(e.g., OpenBMC (https://github.com/openbmc/ Given this, it is critical that they run only signed
openbmc)), has a network port, some periph- and authenticated code.
erals on external buses (e.g., I2 C’s, SPI, etc.), One of today’s best practices to authenticate
and storage. The BMC is effectively another the firmware that runs embedded cores is public
full (but scaled-down) computing system on the key cryptography (digital signatures), which
server of which it is a component. While the
relies on FIPS-140 digital signature algorithms
BMC can directly affect the operation of a server
such as RSA and EC-DSA. However, quantum
(e.g., by controlling its power states), it does not
computing will render these algorithms useless
interact with the OS. Its behavior is defined
since factorizing integers and solving the dis-
completely by its firmware.
crete logarithm problem (i.e., the underlying
security problems of RSA and EC-DSA) will be
solvable in polynomial time.11 This implies that
Digital Object Identifier 10.1109/MM.2019.2920203 increasing RSA/ECC key sizes will be insufficient
Date of current version 23 July 2019. to defeat a quantum adversary. Prof. Michele

July/August 2019 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
17
Expert Opinion

Mosca, from the Institute of Quan- microcontrollers in lieu of hard-


One of today’s
tum Computing of the University of wired logic. Today, the same pro-
best practices to
Waterloo, predicts that there is a authenticate the firm- cessors contain even more
50% chance that RSA-2048 public ware that runs embed- embedded agents.
key cryptosystem will be broken ded cores is public key In a typical new design from
by 2031 by quantum computers. In cryptography (digital Intel, there are microcontrollers
response to this threat, the signatures), which to handle: USB Type-C port
relies on FIPS-140 digi- switching, reconfiguration of ana-
National Institute of Standards and
tal signature algorithms log lanes, the display engine,
Technology (NIST) has started a
such as RSA and EC-
competition for standardizing post- graphics memory translations,
DSA. However, quan-
quantum cryptography (PQC). temperature compensation for
tum computing will ren-
Additionally, Grover’s algorithm 5
der these algorithms analog circuits, hardware and
challenges AES-128 and symmetric useless since factoriz- software debugging support,
cryptography algorithms of short ing integers and solv- image processing, asset adminis-
key/digest size, in general. ing the discrete tration and manageability (such
logarithm problem (i.e., as Intel’s AMT features), security,
In response, cryptographers
the underlying security packet processing, and more.
have proposed algorithms that can
problems of RSA and These microcontrollers exist
be classified into several families. EC-DSA) will be solv-
Since embedded cores cannot physically discrete in the silicon
able in polynomial time.
house large keys and are not typi- dies and execute independently
cally high-performance to handle of the execution of the primary
complex operations, one interesting question to Intel CPU(s) or governance by the operating sys-
ask is how suitable these PQC algorithms are for tem; their functionality and behavior are con-
embedded codes. In this paper, we will discuss trolled by firmware that is usually contained in
the proliferation of embedded cores, why secu- flash memory local to the microcontroller core
rity depends upon the firmware running on itself (i.e., embedded in the same silicon die).
them, provide an overview of the existing The proliferation of microcontrollers as
proposals for postquantum digital signatures, embedded agents has been enabled by multi-
discuss what approaches seem the most reason- ple factors: the availability of controllers in a
able for embedded cores, and discuss Intel’s cur- variety of sizes due to improvements in
rent direction. manufacturing processes, the desire for flexi-
bility of designs, and demand for in-the-field
updates. Microcontroller cores are available in
FIRMWARE AGENT PROLIFERATION sizes ranging from ten thousand gates to hun-
In the 1990s, a processor such as the Pentium dreds of thousands of gates. For example, an
Pro Processor from Intel began using a firmware exemplary microcontroller with 64 KB of SRAM
2
binary that defined or redefined certain opera- occupies approximately 0.1 mm on a 10-nm
tions of that CPU, which came to be known as process or around 0.1% of a client processor.
Intel Microcode (or simply microcode). As the Software tools and runtimes for popular cores
complexity of CPU and chipset designs grew, Intel (e.g., ARM-based) allow sophisticated develop-
added a separate microcontroller core (Foxton) ment flows similar to regular desktop/server
to the design of the Core i7 processor code- software; system designers can introduce new
named Nehalem. This microcontroller was functionality to a computer system through
tasked with power management, and it ran in con- these controllers using modern, high-level lan-
junction with the microcode. At the same time, guages (e.g., Rust). This late-binding flexibility
the Platform Controller Hub began to add micro- compensates for the lengthening of hardware
controllers for audio, power management, and development cycles and fabrication times,
manageability. SoC’s from the rest of the industry leading to more firmware agents working in
followed the same paradigm, embedding conjunction in these designs. On the other

18 IEEE Micro
hand, such proliferation of the embedding of decrypt and verify is critical to system respon-
microcontrollers, their increased handling of siveness. Additionally, since main memory is
operation-critical tasks for the overall system, not available in the earliest stages of initializa-
and the easy access to sophisticated develop- tion, microcode verification is done within the
ment tools have made microcontroller cores’ processor’s cache so it must fit into a rela-
firmware an attractive target for malfeasance. tively small footprint (e.g., 16 to 64 KB). Micro-
For product assurance, the firmware needs code as a firmware mainly differs from a
to be authenticated and protected against traditional microcontroller firmware image by
tampering. its target processor core: The Core and Atom
processors are multiple orders of magnitude
FIRMWARE SIGNATURES higher performance than a typically embedded
To assert that the code running on a given microcontroller core. While microcode, having
microcontroller originated from the expected the benefit of a much faster execution via the
source (the vendor) and has not been tampered main CPU, could potentially adopt much more
since it was deployed, the firmware image needs cryptographically stronger (but more computa-
to have ways by which it can be signed and tion-resource demanding) algorithms, those
attested; there must exist the ability to assure same algorithms may be unsuitable for the
that the firmware originated from a specific ven- more constrained embedded cores.
dor (e.g., Intel) and that the bits that comprise
the firmware binary have not been altered by a STATE OF ART ON POSTQUANTUM
3rd party from the time it was installed to the DIGITAL SIGNATURES
time it is being loaded. This is commonly done There has been a recent surge in activity in
with digital signature verification. A simplified the field of PQC, with several new schemes
example of performing this verification is to sign proposed every year and a rapidly growing
a message digest (or hash) of the firmware community of researchers that scrutinize the
binary with a vendor private key and append it robustness, performance and ease of use of
to the firmware. A system that has access to the new and well-established postquantum crypto-
firmware vendor’s public key can then correctly systems. This increased interest may be
verify the authenticity of this hash value. Any related to the recently established standardiza-
signature verification failure signals a modified tion processes on PQC. The National Institute
firmware or the use of a foreign private key (i.e., of Standardization and Technology started a
the signature did not originate from the vendor). project to analyze possible PQC candidates for
Thus, this mechanism prevents attackers from standardization. The Internet Engineering Task
generating valid signatures of modified firmware. Force (IETF) has recently published Request
In practice, firmware signature verification is for Comments (RFC’s) informational docu-
complicated by the constraints on the environ- ments that specify stateful postquantum digital
ment in which the verification is to take place; signatures such as the XMSS scheme.1 Experts
since it usually occurs (at least once) very early from the International Standards Organization
in the system boot process, limited memory, have also been working on a standing docu-
and/or computational resources are available. ment on PQC and, more recently, on a study-
To provide a tangible, real-world example of period on stateful hash-based signatures. In
firmware signatures, we discuss Intel’s micro- this section, we provide a brief analysis of the
code patch. maturity, robustness, and performance of the
signature schemes considered in the aforemen-
Intel Microcode Signing tioned PQC standardization processes.
The constraints on microcode are signifi- Since most of the schemes discussed in this
cant. Microcode may be loaded multiple times section have been submitted to the NIST PQC
from the power-on of a processor, and since it project, we will give some additional context
must be verified each time, the duration to about this process. In November 2017, NIST

July/August 2019
19
Expert Opinion

received 19 submissions on postquantum state- authentication) seem to be among the most


less digital signatures for the 1st round of their suitable applications for stateful schemes since
PQC standardization competition. Among those, the manufacturer can carefully implement state
five were lattice-based, two code-based, seven management.
multivariate-quadratic based, three symmetric- The XMSS Scheme1 is a stateful hash-based
crypto-based schemes, and two others that signature that has been recently published as
could not be classified in any of the previous cat- an informational RFC by the IETF. It can be
egories. In January 2019, NIST selected only nine regarded as an evolved version of the classical
submissions to pass on to the second round of Merkle scheme. From a security perspective,
their competition. Among those, three are lat- XMSS enjoys a security proof in the standard
tice-based, four are multivariate-quadratic model based on mild security assumptions
based, and two are symmetric-crypto-based from the hash function (e.g., pre-image and tar-
schemes. According to NIST, the evaluation cri- get collision resistance). From a performance
teria used to select the second round candidates perspective, since full collision resistance is
was security (formal security proof and resis- not needed, XMSS can operate with smaller
tance to side-channel attacks), cost and perfor- hash digests than the Merkle scheme. This
mance (size of public key and signature, leads to shorter signatures and faster process-
computational efficiency, and probability of fail- ing. For 128 bits of quantum security, XMSS
ures), and algorithm and implementation charac- requires 64 bytes of a public key and 2.44 KB
teristics (parameters flexibility, parallelism for the signature, and verification takes about
amenability, and simplicity). NIST defined a few 760,000 cycles in a modern processor. For 64
security levels for their competition. Level 1 bits of quantum security, XMSS requires 32
should match the postquantum security of bytes of the public key and 740 bytes of signa-
AES128, level 3 should match the postquantum tures, and verification takes about 390,000
security of AES192, and level 5 should match cycles in the same machine.
the postquantum security of AES256. Our perfor- In the stateless subcategory, we consider
mance analysis focuses on parameter sets that SPHINCSþ and PICNIC. SPHINCSþ12 is a state-
achieve security levels 1 and 5 (whenever possi- less hash-based signature scheme that uses
ble), and the reference implementations written huge Merkle trees (e.g., height of 60). It is
by the scheme designers. stateless because it selects the Merkle leaf
nodes at random, and the chance of acciden-
Symmetric-Crypto-Based Schemes tally reusing the same leaf node twice (which
From a security perspective, these are the would void its security guarantees) is negligi-
most conservative schemes since their secu- ble given the huge number of leaf nodes (e.g.,
rity relies uniquely on the security of hash or 260). SPHINCSþ offers two sets of parameters
block ciphers, thus they do not introduce any per security level (thus six in total), one set
additional security assumptions. This category optimized for speed and another optimized for
splits into two subcategories: stateful and small signatures. We focus our analysis on the
stateless schemes. Stateful signature schemes speed optimized parameters using SHA-256.
require the secure storage of some state For level 1, SPHINCSþ offers signatures of
(data) in between signatures generation. State- 16.57 KB, public keys of 32 bytes; signing takes
less signature schemes do not have this addi- 340 million cycles, and verification takes
tional requirement. For example, RSA and EC- 14.98 million cycles (“SPHINCSþ-SHA-256-128f-
DSA are stateless schemes. robust” parameter set). For level 5, these num-
In case state management is a doable task, bers grow to 48.06 KB, 64 bytes; 1,491.94 mil-
i.e., maintaining a piece of data securely in lion and 37.59 million cycles (“SPHINCSþ-SHA-
between signature generation, the stateful 256-256f-robust” parameter set), respectively.
schemes should be considered. We remark that PICNIC8 is a signature algorithm based on a
code signing applications (such as firmware block cipher (LowMC) and a noninteractive

20 IEEE Micro
proof of knowledge. LowMC has been chosen bytes, signing takes 542,000 cycles and verifi-
as the underlying block cipher since it gives cation takes 88,000 cycles. For level 5, these
smaller signature sizes. Still, the signature for numbers change to 1.20 KB, 1.75 KB, 1.07 mil-
level 1 is 33.23 KB and the public key size is 32 lion cycles, and 186,000 cycles, respectively.
bytes (“picnic-L1-FS” parameter set). For level CRYSTALS-DILITHIUM uses module lattices
5, signature size is 129.74 KB and the public problems, which can be viewed in between the
key size is 64 bytes (“picnic-L5-FS” parameter ones used in Learning-With-Errors (LWE) and
set). For level 1, signing takes 137.91 million Ring-LWE problems. In other words, according
cycles and verification takes 90.63 million to the authors, they are just as efficient as Ring-
cycles, while for level 5 it takes 1,112.23 million LWE schemes but closer to the (stronger, more
and 736.31 million cycles, respectively. conservative) LWE underlying security problem.
The symmetric-crypto-based candidates For levels 1 and 3, CRYSTALS-DILITHIUM offers
have very strong security guarantees. In case signatures of size 1.99 and 3.28 KB, and public
state management is possible, stateful HBS keys of size 1.15 and 1.71 KB, respectively. The
schemes may be the most promising approach authors did not provide parameters for level 5.
since they are interesting from both security and In terms of speed, for level 1, it takes 1.3 million
performance perspectives. It is worth mention- cycles to sign and 272,000 cycles to verify. For
ing that digital signatures applied to verify firm- level 3, it takes 1.82 million and 510,000 cycles,
ware authenticity of embedded cores seem one respectively.
application where state management seems pos- qTesla is based on the well-known Ring-LWE
sible. The fact that the signatures are generated problem, and it offers good performance. On
by manufacturers (and not end users) that can April 14, 2019, researchers presented a potential
afford a robust signing facility with state man- attack against qTesla that may affect some of
agement capabilities seems to facilitate the their parameter sets (qTesla’s authors have yet
adoption of stateful HBS schemes in this sce- to respond). From a performance perspective,
nario. Also, the other limitation of certain state- for level 1, qTesla offers signatures of size 1.3 KB
ful HBS schemes that can issue a limited number and public keys of size 1.5 KB, and for level 5 sig-
of signatures does not seem a problem since natures of size 5.9 KB and public keys of size 6.4
manufacturers can, most of the times, predict KB. Regarding speed, for level 1, signature gener-
how many firmware updates a device will receive ation takes 492,000 cycles and verification takes
during its lifetime. On the other hand, in case 82,000 cycles, while for level 5, signature genera-
state management is not possible, stateless tion takes 2.1 million cycles and verification
schemes may be advisable; however, they are takes 394,000 cycles.
less attractive from a performance perspective The lattice-based cryptography field is a pop-
from both size and speed metrics. ular PQC approach. One point of attention is the
secure selection of parameters that still seems
Lattice-Based Schemes to be challenging depending on the underlying
In this category, we have Falcon,3 CRYSTALS- security problem. From a performance perspec-
DILITHIUM,2 and qTesla.9 From the PQC families tive, all three lattice candidates offer reasonable
that introduce additional security assumptions, performance and should be considered promis-
lattices are one of the most popular approaches. ing candidates.
From a side-channel perspective, the Gaussian
sampling process seems to offer some challenges Multivariate Quadratic (MQ)-Based Schemes
to be implemented in a side-channel resilient way. In this category, we have GeMSS,4 Rainbow,10
Falcon is based on the Short Integer Solu- LUOV,6 and MQDSS.7 Several MQ schemes have
tion problem, known in the crypto community been proposed in the past and subsequently
for some time, but it is applied to (structured) broken. Security has become more stable in
NTRU lattices. For level 1, Falcon signatures recent years, however MQ remains the PQC fam-
have 617 bytes and the public key has 897 ily of digital signatures whose security is the

July/August 2019
21
Expert Opinion

Table 1. Comparison for security level 1 or the closest. Sizes in KB, speed in millions of cycles.

Symmetric Crypto Lattices Multivariate Quadratic


CRYSTALS-
XMSS SPHINCSþ PICNIC Falcon qTesla GeMSS Rainbow LUOV MQDSS
DILITHIUM
Signature
0.72 16.57 33.23 0.60 1.99 1.37 0.03 0.06 0.30 20.36
size

Public
0.03 0.03 0.03 0.87 1.15 1.50 417.40 149.00 12.10 0.04
key size

Signing
– 340.00 137.91 0.54 1.37 0.49 690.00 0.40 5.40 26.63
speed

Verification
0.39 14.98 90.63 0.08 0.27 0.08 29.10 0.15 4.30 19.84
speed

least understood. The main benefit of MQ sch- and 75.5 KB, respectively, while signing takes
emes is the compact signature sizes. 24 million cycles and verification takes 18 mil-
GeMSS is a scheme based on the hidden lion cycles.
field equations underlying problem and can be MQDSS is a scheme based on the combina-
seen as a variant of Quartz, a scheme proposed tion of the Sakumoto–Shirai–Hiwatari (SSH) iden-
in 2001, which remains one of the fewest tification scheme with the Fiat-Shamir transform.
unbroken MQ schemes. GeMSS signatures are This is a very innovative proposal. For level 1,
very compact: only 258 bits for level 1, and 588 its signature and public key sizes are 20 KB and
bits for level 5. Public key sizes are 417 KB and 46 bytes long, respectively, while signing takes
3,046.84 KB, respectively. However, GeMSS is 26 million cycles and verification takes 19 million
not speed efficient: for level 1, signing takes cycles. For level 3, its signature and public key
6,690 million cycles and verification takes 29 sizes are 42 KB and 64 bytes long, respectively,
million cycles, while for level 5, signing takes while signing takes 85 million cycles and verifica-
25,300 million cycles and verification takes 172 tion takes 62 million cycles.
million cycles. MQ-based schemes offer interesting perfor-
Rainbow is a signature scheme based on the mance benefits, such as tiny signatures from
well-known Unbalanced-Oil-and-Vinegar (UOV) GeMSS or tiny public keys from MQDSS. How-
signature scheme (which itself is based on the ever, the field of multivariate-quadratic schemes
Oil-and-Vinegar scheme). From a practical per- would still benefit from a more comprehensive
spective, for level 1, the signature is 512 bits security analysis. The PQC standardization pro-
long and the public key is 149.00 KB long, while cess may help in this process by promoting
signing takes 402,000 cycles and verification these schemes and thus attracting an increasing
takes 155,000 cycles. For level 5, the signature is number of researchers to expand the knowledge
159 KB and the public key is 1,227.10 KB long, in this field and increase the confidence of poten-
while signing takes 3.6 million cycles and verifi- tial users.
cation takes 2.3 million cycles. Tables 1 and 2 show the performance of the
Lifted-UOV (LUOV) is a scheme also based second round candidates of the NIST PQC com-
on the UOV scheme. The main difference from petition plus the XMSS scheme published in IETF
UOV consists of some optimizations to reduce RFC8391. We acknowledge that these numbers
the public key size (e.g., lifting the UOV public (most of them obtained from the submission
key to an extension field). For level 2, signature packages) were collected in different platforms,
and public key sizes are 311 bytes and 12.1 KB, and therefore the speed numbers should be
respectively. Signing takes 5.4 million cycles taken as a rough approximation of the actual
and verification takes 4.3 million cycles. For performance, useful when considered from the
level 5, signature and public keys are 494 bytes orders-of-magnitude perspective.

22 IEEE Micro
Table 2. Comparison for security level 5 or the closest. Sizes in KB, speed in millions of cycles.

Symmetric Crypto Lattices Multivariate Quadratic


CRYSTALS-
XMSS SPHINCSþ PICNIC Falcon qTesla GeMSS Rainbow LUOV MQDSS
DILITHIUM
Signature
2.44 48.06 129.74 1.20 3.28 5.92 0.07 1.59 0.5 42.70
size

Public
0.06 0.06 0.06 1.75 1.71 6.43 3,046.84 1,227.10 75.50 0.06
key size

Signing
– 1,491.94 1,112.23, 1.07 1.82 2.15 25,300 3.64 24.00 85.26
speed

Verification
0.74 37.59 736.31 0.18 0.51 0.39 172.00 2.39 18.00 62.30
speed

Analysis of Available PQC Solutions performance perspective, lattice-based algo-


The most important criterion for the selec- rithms offer decent performance. For example,
tion of PQC schemes should be security. The CRYSTALS-DILITHIUM has both signature and
PQC transition offers considerably greater chal- public key size in the low single-digit kilobytes,
lenges than previous (symmetric) cryptographic and speed in the low single digit millions of cycles.
transitions (for example, from 3DES to AES, or Finally, multivariate-quadratic algorithms may
SHA-1 to SHA-2). Public key cryptosystems are have interesting performance advantages, such as
considerably more complex than symmetric the tiny signatures of 258 bits offered by GeMSS,
ones, and we still do not fully understand the but they are the candidates that would benefit the
capabilities of a typical, future quantum adver- most from additional security assessments, given
sary. In this context, conservativeness seems to the recurrent attacks against MQ digital signature
be the most reasonable stance. schemes throughout the history.
Schemes based on symmetric crypto building In summary, if state management is possible,
blocks (e.g., hash or block ciphers) seem to be IETF schemes (e.g., XMSS) seem to be a very
an extremely promising approach, as they do promising approach. If state management is
not introduce any additional security assump- not possible, the candidates of the second round
tions. In cases where state management is possi- of the NIST competition (following the prioritiz-
ble, XMSS is a strong candidate as it offers ation described above: 1—stateless symmetric-
interesting performance and security proof in crypto-based, 2—lattice-based, 3—MQ) seem to
the standard model. In case state management is be a promising approach.
not possible, SPHINCSþ may be preferable since
it does not even need zero-knowledge proofs as COST OF CRYPTOGRAPHY VS.
seen in PICNIC. Where state management is not ADOPTION
possible and the performance offered by state- Generally, technology adoption increases as
less symmetric-crypto-based algorithms is not performance increases and cost decreases. We
acceptable, the competing candidates from expect postquantum cryptography to follow
other PQC families should be considered. In this suit. NIST recently published “The Economic
case, lattice-based algorithms seem to offer an Impacts of the Advanced Encryption Standard,
interesting balance between security and perfor- 1996-2017” which estimated a $250 billion
mance. Some of the lattice-based underlying impact. A key factor in the success of AES has
problems have gone through intense academic been the phenomenal performance achieved in
scrutiny for several years (in some cases, deca- modern microprocessors. Galois Counter Mode
des). One remark, however, is that the commu- (GCM) is a popular AES mode of operation used
nity may still benefit from a better understanding in the networking space that secures the major-
on how to choose secure parameters. From a ity of internet traffic. Consider the historic cost

July/August 2019
23
Expert Opinion

of the computation of GCM, and the adoption for more flexibility and minimize redesign time of
rate in the figure below, gathered from the ICSI embedded microcontroller systems should a sin-
Notary (https://notary.icsi.berkeley.edu/). gle (class of) algorithms become infeasible to
In the figure above, the “performance” of run- deploy for their firmware authentication.
ning the GCM algorithm—presented as “cost in
cycles per byte” where lower cost-per-byte
CONCLUSION
indicates higher performance—corresponds to
Firmware authenticity is a key factor in the
various Intel processors that launched on
proper function and security assurance of a plat-
those dates, with a trend of continuously improv-
form built with embedded microcontroller
ing GCM performance. Notably, in 2013 there was
cores. With the need for PQC-secure crypto-
a dramatic increase in GCM performance (indi-
graphic algorithms to continue to assure this
cated by a significant drop in the cost of cycles/
authenticity, Intel is investigating possibilities
byte), which we believe contributed to a steep
currently under consideration in various stan-
rise of GCM adoption across the industry.
dardization processes. Our priority in this selec-
tion is and will always be security. Regarding
INTEL’S DIRECTION ON PQC AND performance, we are specifically focusing on the
FIRMWARE capabilities of platforms’ embedded cores to
Intel’s strategy is to continue securing execute PQC algorithms and looking for algo-
platforms using cryptographically strong digital rithms that have parameters allowing flexibility
signature standards that execute efficiently. in fitting their execution to the capabilities of
Intel will continue to be aligned with standardiza- both current and planned embedded cores.
tion organizations and will evaluate algorithms
based on security assurance, hardware cost,
and performance—including that in embedded & REFERENCES
cores. Until full PQC transition has completed, a 1. A. Huelsing, D. Butin, S. Gazdag, J. Rijneveld, and
hybrid solution where two or more algorithms are A. Mohaisen, (2018). XMSS: eXtended Merkle Signature
executed in parallel—in order to remove depen- Scheme - Request For Comment 8391 (RFC 8391).
dence on a single algorithm or class of algorithms, Internet Engineering Task Force (IETF), Retrieved: 24
seems an interesting approach. This should allow June, 2019, https://tools.ietf.org/html/rfc8391

24 IEEE Micro
2. V. Lyubashevsky, L. Ducas, E. Kiltz, T. Lepoint, 11. P. Shor, “Algorithms for quantum computation:
P. Schwabe, G. Seiler, and D. Stehle, (2017). Discrete logarithms and factoring,” in Proc. 35th Annu.
CRYSTALS-DILITHIUM - A Submission to the NIST Symp. Foundations Comput. Sci., 1994, pp. 124–134,
Post-Quantum Cryptography Standardization Project. Santa Fe: IEEE Comput. Soc. Press.
National Institute of Standards and Technology (NIST). 12. D. Bernstein, C. Dobraunig, M. Eichlseder, S. Fluhrer,
Retrieved: 24 June, 2019, https://pq-crystals.org/ €lsing, and F. Mendel, (2017).
S. Gazdag, A. Hu
3. T. Prest, P.-A. Fouque, J. Hoffstein, P. Kirchner, SPHINCSþ - A Submission to NIST Post-Quantum
V. Lyubashevsky, T. Pornin, and Z. Zhang, (2017). Cryptography Standardization Project. National
Falcon - A Submission to the NIST Post-Quantum Institute of Standards and Technology (NIST).
Cryptography Standardization Project. National Retrieved: 24 June, 2019, https://sphincs.org/
Institute of Standards and Technology (NIST).
Retrieved: 24 June, 2019, https://falcon-sign.info/
4. A. Casanova, J.-C. Faugere, G. Macario-Rat, Rafael Misoczki is a cryptographer/research
J. Patarin, L. Perret, and J. Ryckeghem, (2017). scientist at Intel Labs. His work is focused on
post-quantum cryptography and its application to
GeMSS - A Submission to the NIST Post-Quantum
secure update, root of trust, remote attestation,
Cryptography Standardization Project. National
and other security flows. He has a PhD from the
Institute of Standards and Technology (NIST).
University of Paris (Pierre et Marie Curie), with a
Retrieved: 24 June, 2019, https://www-polsys.lip6.fr/
thesis on efficient constructions for post-quantum
Links/NIST/GeMSS.html cryptography. He also holds an MSc in electrical
5. L. K. Grover, “A fast quantum mechanical algorithm for engineering and a BSc in computer science from
database search,” in Proc. 28th Annu. ACM Symp. the University of Sao Paulo. Contact him at rafael.
Theory Comput., 1996, pp. 212–219. misoczki@intel.com.
6. W. Beullens, B. Preneel, A. Szepieniec, and
F. Vercauteren, (2017). LUOV - A Submission to the Sean Gulley is a principal engineer at Intel’s Data
NIST Post-Quantum Cryptography Standardization Center Group responsible for anticipating and
Project. National Institute of Standards and accelerating new algorithmic intensive workloads.
Technology (NIST). Retrieved: 24 June, 2019, https:// Since joining Intel in 2001, he has focused primarily on
www.esat.kuleuven.be/cosic/pqcrypto/luov/ cryptography and compression HW and SW solutions
7. S. Samardjiska, M-S. Chen, A. Hulsing, J. Rijneveld, for client and data center. He has a BS in computer
engineering from Tufts University and an MS in electri-
and P. Schwabe, (2017). MQDSS - A Submission to
cal engineering from Stanford University. He has over
the NIST Post-Quantum Cryptography Standardization
30 U.S. patents. Contact him at sean.gulley@intel.com.
Project. National Institute of Standards and
Technology (NIST). Retrieved: 24 June, 2019, http://
mqdss.org/ Vinodh Gopal is as senior principal engineer at Intel,
8. G. Zaverucha, M. Chase, D. Derler, S. Goldfeder, working in the Data Center Group. His work includes
C. Orlandi, S. Ramacher, and V. Kolesnikov, (2017).
accelerators, instruction-set extensions for x86 and
architectural enhancements to processors, in applica-
Picnic - A Submission to the NIST Post-Quantum
tions such as cryptography, integrity, compression,
Cryptography Standardization Project. National Institute
and analytics over a range of products. In 2019, he
of Standards and Technology (NIST). Retrieved: 24
won the Intel Inventor of the Year Award. Contact him
June, 2019, https://microsoft.github.io/Picnic/ at vinodh.gopal@intel.com.
9. N. Bindel, S. Akleylek, E. Alkim, P. S. Barreto,
J. Buchmann, E. Eaton, and G. Zanon, (2017). qTesla -
Martin G. Dixon is an Intel Fellow in the Intel Prod-
A Submission to the NIST Post-Quantum Cryptography
uct Assurance and Security (IPAS) group and direc-
Standardization Project. National Institute of Standards
tor of architecture at Intel Corporation. He is
and Technology (NIST). Retrieved: 24 June, 2019,
responsible for guiding future research and architec-
https://qtesla.org/
ture decisions to secure Intel’s platforms. He has
10. J. Ding, M-S. Chen, A. Petzoldt, D. Schmidt, and B-Y. published a dozen academic papers in the field of
Yang, (2017). Rainbow - A Submission to the NIST computer architecture and holds 50 patents in the
Post-Quantum Cryptography Standardization Proje\ct. field of computer architecture and cryptography.
National Institute of Standards and Technology (NIST). He has a bachelor’s degree in electrical and

July/August 2019
25
Expert Opinion

computer engineering from Carnegie Mellon Univer- Wajdi K. Feghali is an Intel Fellow and the director
sity. Contact him at martin.dixon@intel.com. of the Security and Algorithms Center of Innovation
in the Data Center Group at Intel Corporation. He
leads the development of cryptography, compr-
Hrvoje Vrsalovic has been involved in firmware and ession, data integrity and data de-duplication hard-
app-to-device interface software development in one ware and software solutions with a focus on efficient
form or another ever since being part of the original performance across Intel products. He has been
team that created Palm’s WebOS in 2008. Since then, granted more than 50 U.S. patents, with numerous
he has worked on software—and sometimes hard- other patents pending, and is the author of several
ware—of many “smart” consumer products, particu- published technical papers. He has a bachelor’s
larly wearables. He recently joined Intel’s IPAS group degree in mathematics with a minor in computer sci-
as a security architect. He has a BSc in computer sci- ence from the University of Ottawa. Contact him at
ence from UCSB and an MSc in electrical and com- wajdi.feghali@intel.com.
puter engineering from Carnegie Mellon University.
Contact him at harvey.vrsalovic@intel.com.

26 IEEE Micro
Expert Opinion

Energy-Secure System
Architectures (ESSA):
A Workshop Report
Pradip Bose Saibal Mukhopadhyay
IBM T. J. Watson Research Center Georgia Institute of Technology

& MODERN MICROPROCESSOR CHIPS have multiple temperature) constraints. Effective parallelization
processing engines (or cores) that are architected of application codes, supported by many-core/
to solve a variety of problems in individual and many-thread hardware engines, is the established
cooperative execution modes. In the current trend in current computing. Since 96-thread
regime of commercial designs, we already see dou- POWER8 server chips have already been in the
ble-digit core counts; and if one considers the market for a few years, it is not unrealistic to
degree of hardware multithreading supported in expect around 50 cores and perhaps
each core, the number of hard- 200 hardware threads supported in
ware threads that can be sup- Ever since on-chip and a couple of generations. Of course,
ported in concurrent execution system-level power due to area pressures, one can
add up to many dozens or scores. management architec- expect to see leaner (simpler) cores
For example, IBM’s prior genera- tures have become with only modest single-thread per-
tion POWER8 processor chip routine in the industry, formance growth.
already supported up to 96 hard- concerns about reliable This technology- and market-
ware threads via its 12 cores, operation and associ- driven trend toward throughput-
each of which can execute in up ated security vulner- oriented (scale-out) designs implies
to an eight-way simultaneously abilities have been a major challenge in terms of
multithreaded mode. As present in the minds of
chip-level power and/or thermal
explained in recent ISSCC tech- both the designer and
management—in a regime where
nology trend data, while the core researcher community.
balanced performance growth (sin-
count growth has been steady, gle-thread versus throughput) at
the clock frequency has saturated around the 4- affordable power becomes a steeper challenge
GHz mark—mainly limited by power density (or over time. And, at the full system (i.e., server,
rack, or data center) level, the challenge can be
even greater. At whatever scale one is interested
Digital Object Identifier 10.1109/MM.2019.2921508 in managing such metrics (i.e., power or tempera-
Date of current version 23 July 2019. ture, or even related ones, like system reliability),

July/August 2019 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
27
Expert Opinion

on-chip and system-level power management circuit-level power-management experts within


control architectures will need to be carefully the ESSA community to explore a cross-layer
architected. These must ensure the right trade- approach to energy-secure processor design. In
offs to be applied at runtime to make sure that the next section, we provide a summary descrip-
workload-dependent performance is maximized tion of ESSA 2019, which was held on May 9–10 at
(at least to the extent that customer service-level Tysons Corner, VA, USA.
agreements are met) while adhering to system-
imposed power consumption limits.
Ever since on-chip and system-level power
SUMMARY PROCEEDINGS OF
management architectures have become routine
ESSA-2019
The initial years of workshop offerings around
in the industry,1-3 concerns about reliable opera-
the ESSA theme resulted in a successful spawning
tion and associated security vulnerabilities have
off of a few academic research projects—which
been present in the minds of both the designer
was the underlying objective. Perhaps the best-
and researcher community. What if the sense-
known research that has been reported in the lit-
and-actuate feedback control system(s) imple-
erature since that time is the CLKSCREW attack
mented in such a design had latent bugs, wherein
modality12 published by Simha Sethumadhavan’s
a corner case workload (launched maliciously or
group at Columbia University. This work not only
inadvertently) could disrupt the intended func-
demonstrates the use of software code segments
tionality and cause the system to fail? Could the
to disrupt the power management controls in a
chip or system incur irreparable physical dam-
processor, it also shows how side-channel attacks
age? At a minimum, could a power virus attack
can be orchestrated around this basic attack para-
result in significant performance degradation for
other (regular) customer workloads—effectively digm. In general, the scope of “energy attacks” has
signaling a denial of service attack? In fact, even expanded to side channel attacks that exploit
before power management control systems were energy leaks of various types. As such, in ESSA
in vogue, research papers4 had demonstrated 2019 (https://www.essa-workshop.org/), the tec-
physical attacks, where thermally induced mem- hnical scope of the workshop was expanded to
ory bit-flips could enable Trojan software to take broadly cover: “the range of research being pur-
over the full system, evading immediate detec- sued within industry and academia in order to
tion. Later on, the Charlie Miller hacks of Apple ensure robust and secure functionality while
laptop battery control loops created quite a stir5; meeting the energy-related constraints of the
and data-center level power (energy) attack vec- green computing era.” The technical program of
tors were demonstrated.6 the workshop (https://www.essa-workshop.org/
In light of the above motivational background, #program) consisted of one keynote, three visi-
it is not surprising that researchers from IBM onary invited talks, three ESSA-relevant special
Research (led by Pradip Bose) initiated IBM inter- invited talks, four contributed regular papers, and
nal research on a variety of power attacks and one panel session.
their mitigation around the year 2010. This
research was supported by a DARPA seedling Keynote
grant in 2011–2012, and this also helped that The workshop began with an informative
research team launch the ESSA workshop series keynote address by John Marsh, who repre-
in the beginning of 2011 [in conjunction with the sented Linton Salmon, Program Manager of
International Symposium on Computer Architec- the ongoing DARPA program called System
ture (ISCA)]. An early visionary paper on the ESSA Security Integrated Through Hardware and
theme was presented by Bose in 2012.14 This work- Firmware (SSITH).
shop series was interrupted for a few years, before This was a valuable readout of the most prom-
being resurrected recently in conjunction with the ising research projects that are currently being
Hardware Security-Focused Conference (HOST pursued in the mitigation of software-assisted
2019). The workshop was jointly organized by Dr. hardware attacks. The list of currently active
Pradip Bose from IBM Research and Prof. Saibal SSITH program performers and the key innova-
Mukhopadhyay from Georgia Institute of Technol- tions of their approach, as quoted from Marsh’s
ogy. The key new feature of ESSA 2019 was to bring talk, is shown in Table 1.

28 IEEE Micro
Table 1. Summary description of SSITH projects.

Prime Technical area Point of contact Technical approach


Every word has metadata þ every instruction is checked
Charles Arun Thomas
H/W Architecture based on flexible security micro-Policies defined in
Draper Lab arun@draper.com
software (DSL); compartmentalization;PPASS workbench
Combination of efficient tagging,fenced,regions,
Lockheed Jim Eiche Protection,domains,per-Thread keying,and memory
H/W Architecture
Martin james.eiche@lmco.com encryption;security hardware in parallel with CPU; No
source changes (Binary analysis)
Hardware security compiler with end-to-end formal
Adam Chlipala
MIT H/W Architecture verification; generic support for tagging policies;
adamc@csail.mit.edu
compartmentalized secure enclaves.
Tags on every word of data and every instruction
SRI Robert Watson implement bounds checking and permissions;
H/W Architecture
International Robert.watson@cl.cam.ac.uk encapsulation; formal methods to verify security;security
architecture extended to DMA engines
Anti-fragility approach learns from attacks;machine
UC San Dean M Tullsen
H/W Architecture learning based on Hardware Performance Registers;
Diego tullsen@cs.used.edu
efficient X86 implementation leveraging micro-ops.
High entropy (hard to hack) Plus rapid churning (no time
University of Todd Austin
H/W Architecture to exploit) to mediate “undefined semantics”; tagging;
Michigan austin@umich.edu
encryption;relocation of memory areas.
Automated system security metrics framework and tools;
Joe Kiniry
Galois Test and Metrics objective analysis; used formal methods; GFE IP
kiniry@galois.com
development and support: RICS-V baseline for evaluation.

Visionary Talks was described. The second part of the talk


The first visionary talk was delivered by Dr. showed examples from recent literature where
Vivek De from Intel Labs. In his visionary talk existing power management circuit techniques
titled: “Attack-Resistant Energy-Efficient SoC are leveraged and re-purposed to improve resis-
Design,” Dr. De provided an in-depth perspective tance to power and electromagnetic emission
on the fundamental issues and tradeoffs in the based side-channel attack. The talk showed
design of power-performance-area (PPA) effi-
cient SoCs that can also be
architected to be resilient to
malicious attacks. Figure 1
depicts the basic features of
resilient platforms, as pre-
sented in Dr. De’s talk. The soft-
ware, firmware, and hardware
layers of abstraction in the
design stack that need to be co-
designed to factor in targeted
resiliency features, while meet-
ing critical PPA metrics were
covered. In order to bake in the
“attack resistance” shield into
the general framework of resil-
ient platforms, the architecture
of the secure roots of trust
(including the use of secure and
variation-tolerant PUF/TRNG) Figure 1. Resilient platforms.

July/August 2019
29
Expert Opinion

functions (PUFs) was another


aspect that was covered. These
capabilities were pointed up to
be crucial elements in system-
level security mechanisms. Ano-
ther item covered was quick
destruction of in-memory data
(for DRAMs). The security-cen-
tric aspects of the presentation
focused mainly on the speaker’s
work published in HPCA 2018,15
HPCA 201916 and arxiv 2019.17

Special Invited Talks


Figure 2. Abstract view of energy management. Prof. Mingoo Seok (Colum-
bia University) presented a
paper (M. Seok, A. Tang, Z.
that circuit techniques can strongly impact the
Jiang, S. Sethumadhavan, “Blacklist core:
energy-security tradeoff in a SoC.
machine-learning-based power management tam-
The second visionary talk was delivered by
pering,”) where machine-learning-based dynamic
Prof. Simha Sethumadhavan from Columbia
operating performance point blacklisting is used
University. In his talk, titled: “Software vecto-
for mitigating software based power-management
red fault attacks,” Sethumadhavan recalled
tampering. Using CLKSCREW attack as an exam-
the CLKSCREW attack work7,8 that his group had
ple, Seok argued that static guard-banding could
pioneered, and then painted a picture of the
mitigate such attacks but it incurs performance
work that is emerging beyond that ground-
degradation, power efficiency loss, and long test-
breaking prior work. The CLKSCREW research
ing time. Instead, the talk introduced a detection-
exposed the vulnerabilities in classical DVFS-
then-mitigation approach, which uses a neural-net
based power management in embedded systems
model and detects a malicious command to put
(see Figure 2), and has since resulted in solu-
the system on an unsafe operating performance
tions to fix such gaps in security in a class of
point. If detected, it then mitigates the attack by
popular commercial mobile platforms driven
ignoring the command. The algorithm and hard-
by the Android OS. Sethumadhavan presented
ware realization of the technique shows the ability
a wish list of future energy-secure architectures,
to detect and mitigate CLKSCREW attempts at a
with a focus on hardware security features that
reasonably small amount of overhead in power,
would need to be augmented in order to mitigate
delay, and area. The talk illustrated that co-design
the threat of energy-sourced side channel intru-
of circuit and algorithm is necessary to optimally
sions and physical attacks.
tradeoff the energy and security behavior of an
Prof. Onur Mutlu’s (ETH Zurich) visionary
SoC.
talk, titled “Using commodity memory devices
Prof. Swaroop Ghosh (Penn State) spoke on:
to support fundamental security primitives,” was
“Security of persistent memories.” Excellent prop-
an in-depth journey into the fundamentals of erties, such as zero leakage, high-density, scalabil-
memory system architectures, and the associated ity, and high endurance, of emerging non-volatile
security-related vulnerabilities. Initially, Mutlu memories (NVMs) make them an attractive candi-
spoke about the solution approaches to reduce date for energy management in SoCs. However,
data movement related power consumption Ghosh pointed out that although NVMs can reap
through architectural innovations. Subsequently, energy and performance benefits they may face
a significant part of the talk was on ways to use new security issues that were not perceived
memory devices to generate true random num- before. His talk discussed several potential vulner-
bers with low latency and high throughput. Gener- abilities of NVMs, and how they can be exploited
ation and evaluation of physically unclonable to compromise data integrity (e.g., tampering and

30 IEEE Micro
row-hammering) and data privacy. The talk also The paper by Krishnan et al. stressed the need
explored circuit and system level methods for for a secure power transition mechanism to con-
sensing and inhibiting attacks on NVMs. The talk vert the active system state into a protected non-
reminded the audience that new technologies volatile form and back in energy harvesting based
introduced for energy management can also lead IoT edge platforms. The paper observed that
to new security challenges. secure checkpointings are necessary, but are
Prof. Vijay Janapa Reddi (Harvard) spoke on: expensive to compute and require hardware-
“Closing the performance, power and reliability accelerated cryptography and isolated secure
gap in autonomous aerial machines.” Reddi non-volatile storage. The paper defined an energy-
has been working on the topic area of “aerial harvester subsystem interface that drives the
computing,” and this particular presentation optimized execution of a secure communication
addressed the fundamental issues of power- protocol such that wasted energy is eliminated
performance efficiency and resilience in design- and that run-time performance is improved.
ing the embedded processor engines that power The paper by Tochukwu et al. presented the
autonomous drones. The connection between challenging but critical need to enable a holistic
reliability and security, as also articulated in hardware security evaluation from the microarch-
Sethumadhavan’s visionary talk, was re-exam- itectural point of view. The paper introduced an
ined briefly in Reddi’s presentation. important step toward this direction by proposing
a framework that categorizes threat models based
Contributed Regular Papers on the microarchitectural components being tar-
The regular technical presentations con- geted and provides a generic security metric that
sisted of the following contributed papers: can be used to assess the vulnerability of compo-
 K. Khatamifard, L. Wang, S. Kose, A. Das, nents, as well as the system as a whole.
U. Karpuzcu, “A novel class of covert chan- Finally, the paper by Trilla et al. discussed that
nels enabled by power budget sharing.” as complexity and time-criticality of operations
 A. Krishnan, P. Schaumont, “Hardware sup- being performed in the autonomous vehicles con-
port for secure intermittent architectures.” tinue to increase, this creates conflicting require-
 I. Tochukwu and A. Ismail, “Holistic hardware ments in designing such processors. On one hand,
security assessment framework: a microarch- a simple and predictable design of the processors
itectural framework.” facilitate verification of functional and nonfunc-

tional metrics; but on the other hand, using
D. Trilla, C. Hernandez, J. Abella, F. Cazorla,
high-performance and complex processor designs
“Four birds with one stone: on the use of
with some degrees of obfuscation can deliver
time randomized processors and probabilis-
high computing performance and security. The
tic analysis to address timing, reliability,
paper argued that time-randomized processors
energy and security in critical embedded
(TRP), an alternative to traditional (deterministic)
autonomous systems.”
designs, can address these conflicting require-
It is well known that runtime power manage- ments. TRP facilitates timing analysis via the use
ment is in charge of the optimal distribution of the of statistical/probabilistic techniques, while also
power budget—a very critical shared resource— show capabilities to effectively tackle the chal-
among system components. The paper by Toch- lenges of reliability, security, and energy consump-
ukwu et al. argued that any system-wide shared tion. The paper reviewed the TRP opportunities
resource can give rise to covert communication, if and show that they are a natural fit to fulfill the
not properly managed, and power budget, unfortu- requirements of autonomous critical systems. The
nately, does not represent an exception. The paper showed that disruptive ideas in the macro-
paper presented a proof-of-concept demonstra- and microarchitecture may be necessary to design
tion of covert communication exploiting shared future energy-secure autonomous systems.
power budget and discussed the potential design
space for countermeasures. The paper argued IMPACT OF THE ESSA THEME
that a secure power management infrastructure WORKSHOPS
must be aware of the potential threats associated In this section, we will briefly examine the
with sharing power across multiple entities. ongoing impact of the ESSA theme workshop

July/August 2019
31
Expert Opinion

access to OS and/or power man-


agement firmware code is, of
course, a different issue.) One of
the saving factors, as assessed,
was that the IBM POWER process-
ors up to the POWER7 generation
were not subject to performance
throttling under even the highest
power workloads conceivable. In
other words, for such high-end
server-class processors, the heat
sink and cooling solution were
over-designed to make sure that
the worst-case applications would
not result in exceeding the power
limit that could be handled by the
packaging-cum-cooling solution.
Note that power or electromigra-
tion (EM)-based side channel
attack vulnerabilities were not
Figure 3. Peak temperature driven throttle points and performance effects
within the scope of this study;
for Intel Pentium IV (P4) and Pentium M (PM) class processors. Experiments done
only physical damage and denial
using gcc workload within SPEC95 and different cooling solutions: 100 gallons
of service type attacks were under
per hour (gph) is the highest fluidic cooling rate used, resembling the commercial
consideration at that time.
P4 package; 60 gph is a reduced cooling rate to represent a cheaper packaging
One should note, however, that
solution. (Experimental data: courtesy Hendrik Hamann, IBM Research). well before engaging in this
POWER7-based study, the IBM
series in the context of identifying new security- researchers had studied the performance degra-
related vulnerabilities and devising mitigation dation characteristics of Intel’s Pentium IV series
solutions thereof. So far, in terms of core security processors. Figure 3 shows the experimental
domain impact, the CLKSCREW work from characterization data that compares the temper-
Columbia 7,8 is the leading example of new ature-driven performance differences between
innovation and impact related to the ESSA theme. Intel’s Pentium IV (abbreviated here as P4) and
This work has been acknowledged by the Pentium M (PM) class processors. In this labora-
Android Security Team: (https://source.android. tory experiment (which was conducted back
com/security/overview/acknowledgements): in 2005), the packaging lid of the processor
“Adrian Tang of Columbia University (CLKSCREW was taken out and replaced by a cooling solu-
paper), CVE 2017-8252.” Qualcomm also acknowl- tion provided by a controllable (special) heat-
edged and reported the fixing of bugs associated transparent fluid flow. The thermal imaging,
with the above-quoted CVE (common vulnerabil- measurement, and calibration were conducted
ities and exposures) item. using a special infra-red camera setup.9 As
Within IBM, an in-depth research study (led Figure 3 shows, the “normal” execution time of
by Pradip Bose) was conducted, under the spon- the full gcc workload on a P4 using a commercial
sorship of a small DARPA seedling grant (during grade cooling solution is 80 s. Whereas, if the
2011–2012), to assess the threat level imposed cooling mimics a low-cost packaging solution,
by maliciously launched power/thermal viruses. the execution time degrades to about 140 s due
For commercial IBM high-end processor systems to the throttling-based dynamic temperature
(e.g., POWER7 at that time), the study was not management built into these processors. In con-
able to demonstrate any performance or func- trast, the lower power PM processor runs much
tional degradation through user-level applica- cooler, without incurring any throttling and
tion software access alone. (Supervisory mode executes the same workload in about 90 s, even

32 IEEE Micro
with the low-cost packaging solution. The experi- article presented at ISSCC 2019 by Singh et al.
ments demonstrated the possibility that throt- showed that on-chip low-dropout regulators, cou-
tling-based performance degradation (at a very pled with adaptive clocking and fine-grain DVFS
significant level) could be instigated for a given provide security against power-/EM-based side-
processor-package system product by launching channel attacks against AES engines. There are
a high-power (possibly synthetic) virus workload. many more examples in recent literature where
The literature on side-channel attack mecha- circuit level studies are being performed, and
nisms that exploit the power and/or EM monitor- techniques are being developed to enable a bot-
ing has advanced quite a lot since the inception tom-up approach to energy-secure hardware
of the first ESSA workshop in 2011. This is evi- designs. The advancements at the circuit level
denced even from some of the technical and security research showed the need for engaging
visionary talks presented at circuit community within the ESSA
ESSA 2019. The threat imposed theme, and ESSA 2019 took a posi-
by sharing of a common power Future work must con- tive step toward this goal. The
budget in a multicore chip set- nect unreliable control success is evident from the talks
ting has been described in work loops explicitly to presented at ESSA 2019 by Dr. De,
by Sasaki et al.11 and the paper vulnerabilities from a Dr. Ghosh, and Dr. Seok, all of
by Khatamifard et al. presented mainstream security
which have pointed out the need
at ESSA 2019 shows the conse- research viewpoint. In
for circuit level research in this
quence of this threat model in other words, the
domain.
guarded power man-
explicit terms.
agement principle
While at the system scale,
must be tested as a
FUTURE DIRECTIONS
ESSA themed workshop has stud- In this section, we provide our
mitigation technique
ied the potential security threats view of the future directions of
against CLKSCREW-
introduced by the power manage- research and development within
inspired attacks.
ment solutions, there have been the ESSA theme. One of the early
significant progress in recent research agenda items at IBM that
years in understanding the energy-security trade- fell out of the ESSA-theme was that of guarded
off at the circuit level. In particular, recent res- power management.10 In this solution approach,
earch threads have emerged in designing energy- the baseline power management architecture is
efficient security engines, as well as exploring on- protected through a guard mechanism. The latter
chip power-management circuits for security. A is a higher level monitor-and-control system that
specific example of this new direction has been in observes the operation of the baseline architec-
the domain of designing low-overhead techniques ture through specialized activity (performance)
for improving power and EM-based side-channel- counters. Anomalies detected in the observed
attack resistance of encryption engines. In partic- counter-based signatures can serve to trigger mit-
ular, collaborative work between Georgia Tech igation actions. The latter could include adapting
and Intel labs has demonstrated a set of studies the hardware parameters of the baseline mecha-
where on-chip integrated voltage regulators nism on-the-fly. The above-referenced work was
(IVRs) and adaptive clocking circuits, introduced pursued with robust power management in mind.
mostly for power management, have been lever- Future work must connect unreliable control
aged to improve SCA resistance. A paper pre- loops explicitly to vulnerabilities from a main-
sented by Kar et al., at the 2017 International Solid stream security research viewpoint. In other
State Circuit Conference (ISSCC) demonstrated words, the guarded power management principle
the promise of the using inductive IVRs for inhibit- must be tested as a mitigation technique against
ing power attack on AES engines. A second article CLKSCREW-inspired attacks.
from the group, authored by Singh et al. and pub- The thrust of research in support of power
lished in the IEEE JOURNAL OF SOLID STATE CIRCUITS reduction in the wake of GPU-centric high-perfor-
(JSSC) in 2019, showed inductive IVR coupled with mance compute nodes has led to techniques like
fine-grain dynamic voltage scaling and adaptive adaptive voltage guard-band management (e.g.,
clocking can inhibit power and EM-based side J. Leng et al., MICRO 2015). In future accelerator-
channel attacks on AES engines. More recently, an rich systems, the task of balancing power,

July/August 2019
33
Expert Opinion

performance, and reliability will have to be man- 7. A. Tang, S. Sethumadhavan, and S. Stolfo,
aged using systematic hardware-software manage- “CLKSCREW: Exposing the perils of security oblivious
ment systems. Purely software-based scheduling energy management,” in Proc. USENIX Secur. Symp.,
heuristics will need to get supported by hardware- 2017, pp. 1057–1074.
based monitors. As we progress toward many- 8. A. Tang, S. Sethumadhavan, and S. Stolfo, “Motivating
core processor chips, old-style on-chip power security-aware energy management,” IEEE Micro,
control architectures (with a single, centralized vol. 38, no. 3, pp. 98–106, May/Jun. 2018.
management unit) will give way to scalable, dis- 9. H. Hamann, A. Weger, J. Lacey, Z. Hu, and P. Bose,
tributed control, and management systems. An ini- “Hotspot-limited microprocessors: direct temperature
tial vision on so-called swarm power management and power distribution measurements,” IEEE J. Solid
architectures has been portrayed in recent invited
State Circ., vol. 42, no. 1, pp. 56–65, Jan. 2007.
papers (e.g., the one at DATE 201811).
10. N. Madan, A. Buyuktosunoglu, P. Bose, and
M. Annavaram, “A case for guarded power gating for
ACKNOWLEDGMENTS multi-core processors,” in Proc. 17th Int. Symp. High
The ESSA 2019 workshop,13 the proceed- Performance Comput. Arch., Feb. 2011, pp. 291–300.
ings (and anticipated impact) of which were 11. H. Sasaki, A. Buyktosunoglu, A. Vega, and P. Bose,
summarized in this article, would not have been “Mitigating power contention: A scheduling based
possible without the help of an active and approach,” Comput. Arch. Lett., vol. 16, no. 1,
supportive program committee. The valuable pp. 60–63, 2017.
contributions of our web and publicity chair, 12. A. Vega, A. Buyuktosunoglu, and P. Bose, “Energy-
David Trilla, are to be noted in particular. We secure swarm power management,” in Proc. Design
are grateful to the organizing committee of the Test Eur., 2018, pp. 1652–1657.
HOST 2019 symposium for their support. 13. ESSA-2019 Workshop. [Online]. Available: https://
www.essa-workshop.org/
14. P. Bose et al., “Power management of multi-core
& REFERENCES chips: Challenges and pitfalls,” in Proc. Design Test
1. S. Gunther and R. Singhal, “Next generation intel Eur., 2012, pp. 977–982.
microarchitecture (nehalem) family: Architectural 15. J. Kim et al., “DRaNGe: Using commodity DRAM
insights and power management,” presented at Intel devices to generate true random numbers with low
Developer Forum, San Francisco, CA, USA, Mar. 2008. latency and high throughput,” in Int’l. Symp. High
2. M. Floyd et al., “Introducing the energy management Perform. Comput. Arch. (HPCA), Feb. 2018.
features of the POWER7 chip,” IEEE Micro, vol. 31, 16. J. Kim et al., “The DRAM latency PUF: Quickly
no. 2, pp. 60–75, Mar./Apr. 2011. evaluating physical unclonable functions by exploiting
3. T. Webel et al., “Robust power management in the IBM the latency-reliability tradeoff in modern commodity
z13,” IBM J. R&D, vol. 59, no. 4/5, pp. 16-1–16-12, DRAM devices,” in Int’l. Symp. High Perform. Comput.
Jul./Sep. 2015. Arch. (HPCA), Feb. 2018.
4. S. Govindavajhala and A. W. Appel, “Using memory 17. L. Orosa et al., “Dataplant: in-DRAM security
errors to attack a virtual machine,” in Proc. IEEE Symp. mechanisms for low-cost devices,” arxiv, 2019,
Secur. Privacy, 2003, pp. 154–165. [Online]. Available: https://arxiv.org/abs/1902.07344
5. C. Miller, “Battery firmware hacking,” presented at
BlackHat, August 2011. [Online]. Available: https://www.
Pradip Bose is with IBM T. J. Watson Research
blackhat.com/html/bh-us-11/bh-us-11-briefings.html#
Center, New York. Contact him at: pbose@us.ibm.
Miller
com.
6. Z. Wu, M. Xie, and H. Wang, “Energy attack on server
systems,” in Proc. 5th USENIX Workshop Offensive Saibal Mukhopadhyay is with Georgia Institute of
Technol., 2011, p. 8. Technology. Contact him at: saibal@ece.gatech.edu.

34 IEEE Micro
Expert Opinion

FinalFilter: Asserting
Security Properties of a
Processor at Runtime
Cynthia Sturton Samuel T. King
University of North Carolina at Chapel Hill University of California, Davis
Matthew Hicks Jonathan M. Smith
Virginia Tech University of Pennsylvania and DARPA

& IN AN IDEAL world, it would be possible to trojan and then search for instances in the
build a provably correct and secure processor. design that match the pattern. However, mali-
However, the complexity of today’s processors cious circuitry that does not match the pattern
puts this ideal out of reach. The complete verifi- will be missed, as will inadvertent bugs that
cation of a modern processor remains intracta- open vulnerabilities. By the time the weakness is
ble. Statically verifying even a simple security uncovered, the hardware is already in the end
property—for example, “hardware privilege esca- user’s hands and vulnerable to attack.
lation never occurs”—remains beyond the state In the absence of a full proof of correctness,
of the art in formal verification. what is needed is a final filter: a runtime verifica-
Testing can complement formal verification tion technique that works—postdeployment—to
methods, yet testing is incomplete and bugs in detect and respond to security property viola-
the hardware that leave it vulnerable continue tions as they occur during execution. In this arti-
to elude test suites. Further, a crafty malicious cle, we make the case for final filters using our
actor can evade typical testing coverage metrics. tool, FinalFilter, as a case study.
Recent efforts, including that of three of the
authors, have explored the use of static analysis
on the design files (e.g., hardware description FINALFILTER
Prior research, including our own, has
level source code or gate-level netlists) to find
shown that assertions hard-coded into the
suspicious circuitry.1–3 These techniques rely on
design can be a cheap and effective way to
heuristics to define patterns that indicate a likely
verify the correctness of any single execution
run.4; 5 Assertions can cover properties that
Digital Object Identifier 10.1109/MM.2019.2921509 would be intractable to prove statically for
Date of current version 23 July 2019. the current state of the art. The downside

July/August 2019 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
35
Expert Opinion

security property will be


detected at the point of violation.
This is independent of how the
violation occurs or what the root
cause is.

THREAT MODEL
The trusted computing base
for FinalFilter includes our speci-
fication and verification process
and tools, the fabrication pro-
cess and tools, and the filter’s
current configuration.
Figure 1. Processor design flow with FinalFilter: (a) Hardware description language
implementation of the instruction set specification. (b) Vulnerability is accidentally orLifecycle Assumptions
maliciously opened in the processor. (c) FinalFilter is added to the design as the last Referring to Figure 1, we
action,6 with taps directly on the outputs of ISA state storing elements. (d) FinalFilterassume we are the last ones to
dynamically verifies the properties encoded by trusted software. FinalFilter triggers touch the processor design. We
existing repair/recovery approaches in the event of an invariant violation. FinalFilter rely on orthogonal techniques
continues to protect the repair/recovery software. to ensure that FinalFilter is not
tampered with in the supply
chain, which includes fabrica-
is that, like all execution monitors, this tion of the processor and shipping to the end
approach cannot prove that the property can user.
never be violated, only that if such a violation
occurs the monitor will catch it. As such, a Architectural Scope
final filter is a verification approach that is FinalFilter protects privileged instruction set
complementary to and should be used in architecture (ISA)-level registers. FinalFilter does
conjunction with existing testing and static not detect side-channel attacks as doing so
verification methods. requires knowledge of more than the current trace
We extend the basic idea of an assertion- of execution. The focus of this paper is the integer
based execution monitor to make it configurable core of the processor. Notably, we assume the
so that the set of properties being monitored memory hierarchy is correct.
can be updated postdeployment to reflect new
Attacker Model
information about exploitable vulnerabilities in
The attacker is free to take any action not
the design. FinalFilter is a reconfigurable, run-
precluded by our assumptions, either in hard-
time verification system that monitors the state
ware or in software. This includes an attacker
and events of the processor for invalid updates
capable of creating and exploiting a hardware
to privileged registers.
defect. An example might be a defect that causes
The mechanism of a final filter is simple
the processor to return from an exception with-
and presents a small attack surface. Yet, mak-
out restoring the privilege level.
ing it configurable does add complexity. To
minimize FinalFilter’s cost to the system’s
trustworthiness, we formally verify the cor- DESIGN
rectness properties of its component modules FinalFilter enforces properties over privileged
and of the composed system. Finally, we show ISA state and events necessary for the security of
how to verify key properties for individual software running on the processor. An example
configurations. property that we will return to is, “the processor
As a formally verified execution monitor, Final- transitions from user mode to supervisor mode if,
Filter guarantees that any trace violating a given and only if, there is an interrupt or exception.” Any

36 IEEE Micro
processor that correctly implements the specifica- Invariant I0 is a statement that the instruc-
tion must satisfy this property. Proving this prop- tion set specification says must be true of the
erty statically requires a proof across all possible system at all points of execution. It can be writ-
execution traces—currently an intractable task. ten as a concrete assertion in terms of the ISA-
Yet, as an execution monitor, FinalFilter can verify level state in the following way:
the property for every trace that is executed. Moni- :
A0 ¼ assertðrisingEdgeðSR ½SMÞ ! ðNPC½31 : 12 ¼ 0Þ ^
toring is done by a set of hardware-based asser-
risingEdgeðSR ½SMÞ ! ðNPC½7 : 0 ¼ 0Þ _
tions over architecturally visible states and events.
FinalFilter is designed to be used in conjunc- risingEdgeðSR ½SMÞ ! ðreset ¼ 1ÞÞ
tion with existing software-level recovery and where SR ½SMrepresents the supervisor mode bit
repair tools. For example, BlueChip,1 a tool devel- of the processor’s status register, and an exception
oped by three of the authors, can route execution is indicated by the next program counter
around vulnerable circuitry. FinalFilter provides NPCpointing to an exception vector start address.
precise introspection points and can support a The address will always be of the form
variety of repair and recovery approaches. 0x00000X00, where the “X” indicates a don’t-
Three aspects of the design are worth noting. care value. (This might seem as if it leaves the door
open for a processor attack that escalates privilege
1) FinalFilter is reconfigurable after deployment
while executing at an address that matches the
and can protect multiple security-critical
form 0x00000X00, but it does not. Pages in that
properties concurrently.
address range have supervisor permissions set
2) FinalFilter’s design is formally specified and
which implies that code executing in that address
its implementation proven correct.
range is already in supervisor mode. If the proces-
3) Execution overhead is incurred only in the
sor attack attempts to allow user mode execution
rare case that a processor violates one of the
of supervisor mode pages, FinalFilter includes an
monitored security properties.
invariant to detect such misbehavior.)
We break A0 into three component assertions.
The key insight that allowed us to make the
:
monitor both reconfigurable and able to handle Aa ¼ assertðrisingEdgeðSR ½SMÞ ! ðNPC½31 : 12 ¼ 0ÞÞ
:
multiple invariants concurrently is that many Ab ¼ assertðrisingEdgeðSR ½SMÞ ! ðNPC½7 : 0 ¼ 0ÞÞ
:
security properties can be implemented as a Ac ¼ assertðrisingEdgeðSR ½SMÞ ! ðreset ¼ 1ÞÞ:
Boolean combination of more simple assertions,
Each of these individual assertions is evalu-
and these simple component assertions are usu-
ated at each step of execution, and the results
ally in one of only a few forms. Users can specify
are appropriately combined to form a statement
a number of simple component assertions and
that is equivalent to A0 .
combine them into one or more complex asser-
tions that monitor hardware state. Invariant Monitor
FinalFilter reads in ISA-level state and outputs
Running Example
a signal indicating whether any of the pro-
We use security invariants (or just invariants) to
grammed invariants were violated. It works
describe properties of the ISA that must be true of
essentially as a programmable finite state
a secure implementation—that if violated would
machine. Configuration data programs the
open an exploitable vulnerability. Invariants are
machine with which invariants to check and ISA-
dynamically verified by one or more assertions
level state acts as the input to the machine. The
over architecturally visible state.
number of invariants it can monitor concur-
Consider the following component of the
rently depends on the complexity of the associ-
privilege escalation property mentioned before:
ated component assertions and the number of
: assertion blocks built into the monitor.
I0 ¼ A change in processor mode from low privilege
Using our running example, we now describe
to high privilege is caused only by an exception or each module in the configurable monitor, shown
a reset: in its configured state in Figure 2. In our system, we

July/August 2019
37
Expert Opinion

can substitute a constant value for the value in


B. Returning to our running example, Logic
block 1 will evaluate NPC&0xfffff000 ¼ 0 and
output the result. Logic block 3 will evaluate
NPC&0x000000ff ¼ 0 and output the result and
Logic block 5 will evaluate reset ¼ 1 and output
the result. Logic blocks 0, 2, and 4 will evaluate
SR ½SM ¼ 1 and output the result.

Assert. The Assert block implements compo-


nent assertions of the form p ! q, possibly across
several clock cycles (e.g., if p is true then three
cycles later, q is true). If it is ever the case that p is
Figure 2. FinalFilter configured with assertion A0 . Starting from
true while q is false, the assertion is triggered and
the top of the figure, the components are: ISA-level state, Routing
the output of the Assert block will be high. In our
block, Logic blocks, Assert blocks, and Merge block. The Routing
example, each of Aa , Ab , and Ac are implemented
block sends ISA-level state elements to the Logic blocks; the
in their own Assert block. The consequent q is
Logic blocks condense multibit state and constant inputs down to
always a combinational proposition over ISA state
a single bit output that is sent to the Assert block; the Assert block
at a single step of execution: it is stateless and is
compares the previous value of its inputs to the current value,
given by the current value sent by the Logic block.
outputing the result as a one bit value to the Merge block; the
However, the antecedent p can be stateful, possi-
Merge block combines the Assertion block results to form a higher
bly depending on previous values sent from the
level result that indicates if the programmed invariants still hold;
Logic block. For example, the individual asser-
this result is tied to the processor’s exception generation logic.
tions in our example all have the antecedent
risingEdge(SR ½SM). This proposition is true at
refer to Aa , Ab , and Ac as component assertions, time t if and only if SR½SMis low at time t  1 and
and A0 as simply an assertion. The difference being high at time t. The Logic block will output a signal
that Anumber is the implementation of an invariant, a that is high whenever SR½SMis high and the Assert
combination of component assertions, whereas block will determine when a rising edge of SR½SMis
Aletter represents a component assertion corre- seen. FinalFilter allows antecedents in one of three
sponding to one assertion block in the configura- forms: p 2 fTrue; :st1 ^ st ; stn g. In other words,
ble monitor. p can be defined as True, in which case the asser-
tion will trigger whenever q is false, or p can be
Routing. The Routing block is responsible for defined to be the rising edge of some ISA state s,
feeding the desired ISA-level state to the Logic or p can be defined to be the value of ISA state s at
blocks. The configuration data determines which time t  n, where n is also configurable.
state element gets routed to which Logic block. The Assert block uses four of the industry
To accommodate arbitrary outputs, each Rout- standard Open Verification Library assertions:
ing block output is 32 bits wide, with zero pad-
ding as required. In our running example, SR ½SM  always(expression): expression must always
is output to Logic blocks 0, 2, and 4, NPC is output be true,
to Logic blocks 1 and 3, and reset is output to  edge(type, trigger, expression): expression
Logic block 5, as shown in Figure 2. must be true when the trigger goes from 0 to
1 (type = positive),
Logic. Each Logic block implements a compari-  next(trigger, expression, cycles): expression
son operator. Given two inputs A and B, the con- must be true cycles clock ticks after trigger
figuration data can select one comparison goes from 0 to 1,
operator from the set f¼; 6¼; ; < ; ; > g. Addi-  delta(signal, min, max): when signal changes
tionally, the configuration data can choose to value, the difference must be between min
mask off some portion of A or B, or both, or it and max, inclusive.

38 IEEE Micro
Merge. The Merge block takes the outputs from documents. However, in two cases, the process of
the Assert blocks and combines them as pre- formalizing the specification brought out ambigui-
scribed by the configuration data. It can be ties in the design, and it was necessary to revisit
viewed as a configurable truth table. The inputs the design phase of the process. During the course
to the truth table are the Assert block outputs— of verification, we found one implementation
the component assertions Aa , Ab , and Ac in our error: a logical AND was used where an OR was
running example. The function defining how the needed.
component assertions combine (i.e., the out Ultimately, the monitor’s behavior is deter-
function) is configurable at run time. The truth mined by the configuration data, and it is up to
table is implemented as a hierarchy of look-up the processor or motherboard manufacturer to
tables. For example, with 16 Assert blocks, provide a correct configuration. A misconfigured
rather than a single lookup table with 216 rows, fabric could fail to provide the intended protec-
the monitor would have four lookup tables with tions. We guard against misconfigurations in
six inputs (26 rows) each. The outputs of the three ways.
three first-level lookup tables make up the input First, we protect against invalid configurations
to a second-level lookup table, the output of that would result in unpredictable results. Built in
which is the output of the Merge block. to the design of each block is a check that the
We can now complete our running example. incoming configuration data are well formed. We
Let erra be the output of the Assert block for Aa , verify that if any of the individual components
and let errb and errc be the output of the Assert report an invalid configuration, then FinalFilter
blocks for Ab and Ac , respectively. Remembering will not fire any assertion failures. This behavior
that the output of each Assert block will be high represents a tradeoff in the design space. On the
when the assert triggers, i.e., when the invariant one hand, an accidentally misconfigured fabric,
is violated, we combine the results of the compo- which will never trigger an assertion, is not pro-
nent assertions in the following way: tecting the user. On the other hand, never firing in
the presence of misconfigured data has the benefit
err0 ¼ ðerra jerrb Þ&errc :
of being a stable behavior— it is what exists today.
An alternative is to always fire when the fabric is
As desired, err0 will be high whenever A0 is false,
misconfigured, but this would give an attacker an
i.e., whenever the A0 assertion is triggered.
avenue for launching a denial-of-service attack
Configuration Data. The configuration data making FinalFilter a new avenue of attack, some-
are provided by trusted software (e.g., the sys- thing we wish to avoid.
tem BIOS) at initialization (originally, we imagine Second, we built a software tool to generate
configuration coming from processor or mother- the configuration data from higher level asser-
board manufacturers). It is the mechanism by tion statements. Although only prototypical, we
which FinalFilter is configured, and portions of hope that further developing this tool will make
the configuration data are fed into each block at generating correct configuration data relatively
the appropriate stage. easy for the user.
Third, we built a validation tool to prove
properties about individual configurations. We
VERIFICATION prove the following sanity checks on the configu-
We used the commercial model checking tool
ration data:
Cadence SMV for the verification of the configura-
 There are assertions configured.
ble assertion fabric. For each component of Final-
 None of the assertions are unsatisfiable
Filter shown in Figure 2, we formally specified its
behavior and verified that the implementation (e.g., the following does not occur fTrue !
meets the specification. q ^ :qg).
In most cases, formally specifying a com-  The configured assertions, as a whole, are sat-
ponent’s behavior involved little more than isfiable (e.g., the following does not occur
extracting the information from the design fp ! q; p ! :qg).

July/August 2019
39
Expert Opinion

support for virtual memory. It is popular as a


research prototype and has been used in indus-
try as well7; it is representative of what you
would see in a mid-range phone today.
We wrote a program that automatically gen-
erates the FinalFilter hardware for a given num-
ber of Assert blocks to support. Generating the
hardware programmatically makes it easy to
explore the effect of tuning different parameters,
and creates a regular naming and connection
pattern that allows us to verify the structural
connections of arbitrary filters using an induc-
tion type approach.
Figure 3. Hardware overhead with respect to the number of For a complete system capable of booting
assertions supported by the configurable assertion fabric, Linux, we implemented the processor and filter
evaluated at four optimization levels. The range in the number of combination as the heart of a system-on-chip
assertions represents the range in protection required by the that includes DD2 memory, an Ethernet control-
processors in our analyzed set from AMD. The vertical line ler, and a UART controller. We implemented the
represents the average number of assertions required to protect system-on-chip on the FPGA that comes with the
the processors in our analyzed set. As a reference point, previous Xilinx XUP-V5 development board. We conserva-
work on deployed-bug patching entails hardware overheads of up tively clock the system at 50 MHz.
to 200% and run time overheads of up to 100% in the common case.

Hardware Area Overhead


 Assertions are not trivially violated (e.g., the Figure 3 shows how the hardware area over-
following does not occur fp ! :pg). head changes as the number of assertions sup-
If any of these checks fail, a misconfiguration ported by FinalFilter increases. We built filters
error is reported along with information about with support for as little as 1 assertion to as
the offending assertion(s). The user can run this many as 17 assertions (the number required
tool before loading the configuration data into to protect all AMD processors we analyzed in
FinalFilter. We used the z3 SMT solver as the our previous study on security-critical processor
back end to this tool. bugs5).
We note that while we formally verify the func- The figure contains data at four points in the
tional correctness of each module in the filter, we fabric design space:
manually audit the connection between modules.
1) None. No optimization, this favors expressi-
That is, we manually check that every module’s
bility over overhead.
output signals are appropriately tied to the next
2) One State. This optimization uses Logic
module’s input signals. There is no logic involved
blocks with only one state input. Logic
in the composition and our naming convention
blocks were the biggest contributor to the
made the checks straightforward. Our end-to-end
area of the fabric and 83% of our security-
verification of the invalid configuration signals,
critical invariants used only one input to the
mentioned above, does not rely on this manual
Logic block. This also reduces the number of
audit.
required Routing blocks by 50%.
3) Top six. This optimization replaces the Rout-
EVALUATION ing blocks with new Routing blocks capable
To evaluate the performance and efficacy of of handling the six most frequently used state
FinalFilter, we implement it inside the OR1200 elements. We observe that 76% of invariants
Processor. The OR1200 is an open source, 32-bit require the same six ISA-level state items.
RISC processor with a five-stage pipeline, sepa- 4) Both. This includes the two previous
rate data and instruction caches, and MMU optimizations.

40 IEEE Micro
USING FINALFILTER Design-time verification routines to avoid triggering a bug
Using FinalFilter requires hav- alone is insufficient; that is found postdeployment.
ing a meaningful set of properties some exploitable vul- In this article, we have not
to monitor. In prior work, we took nerabilities will make it addressed the problem of mea-
a manual approach to develop a through. FinalFilter, a suring coverage. Boule  et al. 12
set of security critical properties.5 last line of defense— add circuitry to assertions to
We studied errata documents to one that can be for- track and measure coverage. The
learn what types of exploitable mally verified—pro- question of what is a meaningful
errors can occur and we studied tects security critical
coverage metric for a set of secu-
the architecture’s specification properties of the pro-
rity properties is an open one,
documents to develop a set of cessor core.
but it is critical: such a measure
properties necessary—though can give an indication of the
not sufficient—to protect security critical state number of “unknown unknowns” that remain
of the processor. unprotected.
In subsequent work, one of the authors has
developed a semiautomated method for learning CONCLUSION
new security properties using information Design-time verification alone is insufficient;
gleaned from known exploitable bugs8; and some exploitable vulnerabilities will make it
demonstrated that properties developed for one through. FinalFilter, a last line of defense—one
RISC processor may be suitable for use, after that can be formally verified—protects security
some translation, on a second RISC processor, critical properties of the processor core. We
even across architectures.9 However, the devel- believe the idea is broadly applicable and in
opment of security-critical properties for use future work will be exploring the use of a final fil-
with FinalFilter or any property-based verifica- ter for commercial architectures and for mod-
tion method is still in its infancy and more ules outside the processor core.
research is needed.

ACKNOWLEDGMENT
Case Study The authors would like to thank the editors
We configured FinalFilter with 18 assertions we for their insightful comments and suggestions,
found to be critical to security in our prior work.5 and S. Bellovin for his advice and the phrase
We then introduced into the processor 14 vulner- “final filter.”
abilities from a mix of previously published ha-
rdware attacks and attacks based on exploitable & REFERENCES
vulnerabilities from several years of AMD proces-
sor errata. For each one, we wrote a user-space 1. M. Hicks, M. Finnicum, S. T. King, M. M. K. Martin, and

program that exploits the vulnerability and reports J. M. Smith, “Overcoming an untrusted computing

if the attack was successful. FinalFilter is expres- base: Detecting and removing malicious hardware

sive enough to implement all 18 invariants, and the automatically,” in Proc. IEEE Secur. Privacy, 2010,

configured filter detects all of the attacks. pp. 159–172.


2. J. Zhang, F. Yuan, L. Wei, Z. Sun, and Q. Xu,
“VeriTrust: Verification for hardware trust,” in Proc.
PRIOR WORK IN DYNAMIC ACM Des. Autom. Conf., 2013, pp. 61:1–61:8.
VERIFICATION 3. A. Waksman, M. Suozzo, and S. Sethumadhavan,
FinalFilter builds on a line of research that uses “FANCI: Identification of stealthy malicious logic using
dynamic verification to catch and patch func- boolean functional analysis,” in Proc. ACM Conf.
tional bugs postdeployment. For example, DIVA10 Comput. Commun. Secur., 2013, pp. 697–708.
is a simplified checker core that verifies the com- 4. M. Bilzor, C. Irvine, T. Huffmire, and T. Levin, “Security
putation results of the full-featured core before checkers: Detecting processor malicious inclusions at
the processor commits the results to the ISA level. runtime,” in Proc. IEEE Hardware Oriented Secur.
Narayanasamy et al.11 use instruction rewriting Trust, 2011, pp. 34–39.

July/August 2019
41
Expert Opinion

5. M. Hicks, C. Sturton, S. T. King, and J. M. Smith, against vulnerable hardware designs. Her research is
funded by several National Science Foundation
“SPECS: A lightweight runtime mechanism for
awards, the Semiconductor Research Corporation,
protecting software from security-critical
Intel, a Junior Faculty Development Award from the
processor bugs,” in Proc. ACM Conf. Architectural
University of North Carolina, and a Google Faculty
Support Program. Lang. Oper. Syst., 2015, Research Award. She was recently awarded the
pp. 517–529. Computer Science Departmental Teaching Award
6. A. Waksman and S. Sethumadhavan, “Silencing at the University of North Carolina. She has a
hardware backdoors,” in Proc. IEEE Symp. Secur. BSE from Arizona State University and an MS and
Privacy, 2011, pp. 49–63. a PhD from the University of California, Berkeley.
7. R. Rubenstein, “Open Source MCU core steps in to Contact her at csturton@cs.unc.edu.
power third generation chip,” Jan. 2014. [Online].
Available: http://www.newelectronics.co.uk/ Matthew Hicks is an assistant professor at
electronics-technology/open-source-mcu- core-steps- Virginia Tech, working at the intersection of security,
in-to-power-third-generation-chip/59110/ architecture, and embedded systems, with special
8. R. Zhang, N. Stanley, C. Griggs, A. Chi, and C. Sturton, emphasis on analog-domain hardware security.
“Identifying security critical properties for the dynamic Contact him at mdhicks2@VT.edu.
verification of a processor,” in Proc. ACM Conf.
Architectural Support Programming Lang. Operating Samuel T. King was a professor for eight years at
Syst., 2017, pp. 541–554. the University Illinois Urbana-Champaign. He then
9. R. Zhang, C. Deutschbein, P. Huang, and C. Sturton, left his tenured position at UIUC to push himself intel-
“End-to-end automated exploit generation for lectually and professionally in industry. He is cur-
diagnosing processor designs,” in Proc. IEEE/ACM rently with the Computer Science Department at the
University of California Davis. He is interested in
Symp. Microarchit., 2018, pp. 815–827.
building systems for fighting fraud and rethinking our
10. T. M. Austin, “DIVA: a reliable substrate for deep
notion of digital identity. He has a PhD from the
submicron microarchitecture design,” in Proc. ACM/
University of Michigan, an MS from Stanford Univer-
IEEE MICRO, Haifa, Israel, Nov. 1999, pp. 196–207.
sity, and a BS from UCLA. Contact him at kingst@uc-
[Online]. Available: http://www.eecs.umich.edu/ davis.edu.
taustin/papers/MICRO32-diva.pdf
11. S. Narayanasamy, B. Carneal, and B. Calder, “Patching
Jonathan M. Smith is currently a program manager
processor design errors,” in Proc. IEEE Int. Conf. Comput.
in the Information Innovation Office (I2O) at the Defense
Des., Oct. 2006, pp. 491–498. [Online]. Available: http:// Advanced Projects Research Agency (DARPA) on
cseweb.ucsd.edu/ calder/papers/ICCD-06-HWPatch.pdf leave from the University of Pennsylvania, where he
12. M. Boule, J. Chenard, and Z. Zilic, “Adding debug holds the Olga and Alberico Pompa Professorship of
enhancements to assertion checkers for hardware Engineering and Applied Science and is a professor of
emulation and silicon debug,” in Proc. Int. Conf. Comput. computer and information science. He was previously
Des., 2006, pp. 294–299. a Member of Technical Staff at Bell Telephone Labora-
tories and Bell Communications Research, joining
Penn in 1989 after receiving his PhD from Columbia
Cynthia Sturton is an assistant professor and University. He previously served as a Program Man-
Peter Thacher Grauer Fellow at the University of North ager at DARPA in 2004–2006, and was awarded the
Carolina at Chapel Hill. She leads the Hardware Secu- Office of the Secretary of Defense Medal for Excep-
rity @ UNC research group to investigate the use of tional Public Service in 2006. He became an IEEE Fel-
static and dynamic analysis techniques to protect low in 2001. Contact him at jms@cis.upenn.edu.

42 IEEE Micro
General Interest

RASSA: Resistive
Prealignment Accelerator for
Approximate DNA Long
Read Mapping
Roman Kaplan, Leonid Yavits, and
Ran Ginosar
Technion—Israel Institute of Technology

Abstract—DNA read mapping is a computationally expensive bioinformatics task,


required for genome assembly and consensus polishing. It requires to find the best-
fitting location for each DNA read on a long reference sequence. A novel resistive
approximate similarity search accelerator (RASSA) exploits charge distribution and
parallel in-memory processing to reflect a mismatch count between DNA sequences.
RASSA implementation of DNA long-read prealignment outperforms the state-of-the-art
solution, minimap2, by 16–77 with comparable accuracy and provides two orders of
magnitude higher throughput than GateKeeper, a short-read prealignment hardware
architecture implemented in FPGA.

& CONSTRUCTING DNA sequence in real


HUMAN minutes, potentially enabling real-time genomic
time is paramount to development of precision analysis. However, long-read DNA sequencing
medicine1 and on-site pathogen detection of poses new challenges. First, long reads contain
disease outbreaks.2 Single-molecule, real-time many thousands of base pairs (bps). Second, long
sequencing from Pacific Biosciences3 (PacBio) and reads tend to exhibit about 15%–20% insertion,
Oxford Nanopore Technologies4 (ONT) are new deletion (indel), and substitution errors.3,4
technologies that can produce long reads within To construct a complete host sequence, in
case a reference sequence exists (from a previ-
ously sequenced organism), long reads
Digital Object Identifier 10.1109/MM.2018.2890253 are mapped to high-similarity locations of the
Date of publication 28 December 2018; date of current reference sequence. Determining the edit dis-
version 23 July 2019. tance between every mapped read and the

0272-1732 ß 2018 IEEE Published by the IEEE Computer Society IEEE Micro
44
reference sequence requires a computationally performance and energy efficiency. Resistive
intensive local alignment procedure (e.g., Smith– approximate Hamming distance solutions exist.11
Waterman).4 Its computational time complexity is However, these do not provide the parallelism
typically OðnmÞ for two sequences with lengths n required to support a high throughput applica-
and m. Reference sequences vary from several tion, such as DNA read mapping.
millions to billions of bits per second (bps). It is In this work, we present RASSA, a resistive
therefore computationally prohibitive to perform approximate similarity search accelerator archi-
optimal alignment of every long read with the tecture for DNA long-read prealignment filtering.
entire reference sequence. RASSA is a massively parallel in-memory proces-
Read mappers (e.g., minimap6 and minimap27) sor, facilitating simultaneous comparison of
find regions of high similarity (mappings) between a long read with a reference sequence. The out-
reads or between a read and a puts of RASSA are locations on the
reference sequence, followed by RASSA is a massively reference sequence, where align-
an alignment step to determine parallel in-memory pro- ment may result in high score. The
the exact edit distance and verify cessor, facilitating key performance breakthrough of
that the mapping is correct. In simultaneous compari- RASSA is achieved by applying the
case that a prealignment algo- son of a long read with similarity search in parallel to the
rithm identifies a specific region a reference sequence. entire reference. While the complex-
in the reference suitable for map- ity of alignment is OðmnÞ, RASSA
ping, the alignment can be performed only on employs in-memory parallel computing on OðmÞ
that region, reducing alignment’s duration and memory cells to reduce the computation time to
resource requirements.8 Therefore, read OðnÞ, where m and n are read and reference
mapping can be viewed as a two-step process: pre- lengths, respectively.
alignment filtering and accurate alignment RASSA employs resistive elements, mem-
verification. The prealignment step reduces the ristors, serving at the same time as single bit
problem size for aligners by narrowing the regions storage elements and comparators. Additional
to ones with potentially high-scoring alignment. evaluation transistors translate mismatch scores
Existing prealignment hardware solu- into voltage levels, which are converted into dig-
tions9,10 target short reads (up to several hun- ital values using analog-to-digital converters
dred bps), which contain a small number of (ADC). Further processing determines the most
indel and substitution errors (less than 5%) and likely overlap candidates.
have a different error profile than that of PacBio This paper makes the following contributions.
or ONT long reads.3,4 High edit distance thresh-
old is required for mapping long but error- 1) RASSA, an in-memory processing resistive
9 approximate similarity search accelerator, is
prone reads. However, current solutions have
high false positive rates when the edit distance introduced. The parallel processing architec-
is high (i.e., greater than 15). Thus, the current ture is presented bottom-up, from the mem-
solutions for short reads are not applicable for ristor-based bitcell to base pair encoding
long reads. and up to a complete RASSA system.
Approximate computing techniques are 2) RASSA-based implementation of long read
known to trade accuracy for speed or energy prealignment filtering is developed.
efficiency. In case of long reads, multiple errors 3) Evaluation of RASSA’s prealignment filtering
are a natural part of the sequencing output. accuracy and comparative analysis of its exe-
Therefore, DNA long-read prealignment filtering cution time and throughput is conducted.
inherently tolerates the imprecision.
With the end of Dennard scaling and the
slowdown of Moore’s law, novel hardware BACKGROUND
solutions for data-intensive problems are resear- The following two sections provide concise
ched. Emerging technologies such as resistive background on the problem: DNA read mapping
memories enable new architectures with better and the memristor device technology.

July/August 2019
45
General Interest

DNA Read Mapping


DNA sequencers out-
put fragmented regions of
DNA called reads. The
reads originate from ran-
dom locations in the
genome and may be over-
lapping. If a DNA
sequence from the same
species exists, the DNA
reads are matched
to such existing (refer-
ence) sequence [see
Figure 1(a)]. The main
assumption behind this
approach is that the refer-
ence and the reads origi-
nate from the same
species and, therefore, Figure 1. (a) Mapping of long DNA reads onto existing reference sequence.
contain small number of Red colored bps represent mismatching bps between the reference and reads.
differences (typically less (b) Single RASSA bitcell. (c) and (d) Example of two DNA bps comparison. One
than 1%). Since the main bp matches the compared pattern, preventing match line charge loss (c). The
computational effort in next bp mismatches, causing match line voltage reduction (d).
this process is placing the
reads correctly to the
reference sequence, it is called read mapping. RASSA: RESISTIVE APPROXIMATE
One common read mapping technique is indi- SIMILARITY SEARCH ACCELERATOR
vidual mapping of fixed length segments (seeds) Analyzing long-read alignments to a reference
of a read.6,7 Each seed is used as a key in a pre- sequence reveals long fragments of indel-free
calculated hash table, which values are the seed high-similarity regions [as in Figure 1(a)]. These
locations in the reference sequence. Such loca- regions usually contain tens of bps with few substi-
tions are then extended and a precise alignment tutions, and can sometimes reach hundreds of bps
3
is performed. Once all reads are placed in their (for example, in case of high-accuracy CCS reads).
positions, the new sequence is constructed from This has motivated us to use simple Hamming dis-
the overlapping regions of the reads. A disadvan- tance as a heuristic to find the overlap positions
tage of this approach is its high memory require- of long reads against a reference sequence. To
ments, long running time, and complexity. overcome indels and find high-similarity sections,
all possible overlap positions of a read against a
reference are examined. Each bp has four values,
Resistive Memories
therefore the probability of a mismatch when two
Resistive memories store information by
random bps are compared is 3=4. Comparing two
modulating the resistance of nanoscale storage
elements, called memristors. They are nonvola- random sections of equal length from DNA sequen-
tile, free of leakage power, and emerge as poten- ces leads, on average, to 75% mismatching bps.
tial alternatives to charge-based memories,12 However, when high similarity fragments are com-
including NAND flash. Memristors are two-termi- pared, the running Hamming distance may drop
nal devices, where the resistance of the device is significantly below the 75% average, thus indicat-
changed by the electrical current or voltage. The ing a potential mapping location.
resistance of the memristor is bounded by a min- RASSA is a resistive memory based massively
imum resistance RON (low resistive state) and a parallel processing-in-memory accelerator. It
maximum resistance ROFF (high resistive state). allows storing (typically, a data element per

46 IEEE Micro
memory row) and in situ processing of large data mismatching bit in the case of A–C or A–G mis-
sets. RASSA enables comparing a key pattern with match), leading to ambiguous results. Since a
the entire data set in parallel. Every number of mismatch is signaled by reduced match line volt-
mismatches (of the key pattern versus each data age (caused by charge redistribution), a match
element that is in each memory row) causes a spe- should block charge flow. One hot encoding
cific voltage drop, allowing quantifying the num- assures that at most one mismatch may happen
ber of mismatching locations (called a mismatch in each group of four bitcells. For instance, in 60
score). The mismatch score is compared with a bitcells, at most 15 mismatches may be
predefined threshold value to detect the locations observed. Therefore, in this work, a memristor
that have the desired degree of similarity with the in high resistive state ðROFF Þ is considered logic
compared pattern, indicating a viable mapping “1,” while RON is considered logic “0.”
location. The following sections describe RASSA
functionality, encoding of DNA bp, RASSA system Mismatch Evaluation. During a compare
architecture, and hardware evaluation. operation, the compared (key) pattern is applied
to the gates of the selector transistors of all bit-
DNA Base Pair Encoding and cells. If certain groups of bitcells need to be
Mismatch Evaluation ignored (masked-out) during comparison, zero
Figure 1(b) presents the RASSA bitcell, con- is applied to the gates of the selector transistors
taining two transistors and one memristor of such bitcells. Figure 1(c) shows a stored “A”
(2T1R). Each memristor serves as a single bit nucleotide symbol and a compare pattern of “A.”
storage element and a single bit comparator, The comparison results in a match, so there is
enabled by the selector transistor. no charge redistribution path (through an ROFF
A compare operation consists of two phases: the memristor). Figure 1(d) shows a mismatch,
precharge and the evaluation. During precharge, where the stored pattern is “G” and the key pat-
the match line is precharged to a certain voltage tern is “T.” The mismatch results in charge redis-
level. At the same time, the evaluation transistor in tribution through an RON memristor, causing a
each bitcell is on to discharge the evaluation point match line voltage drop. Figure 2(a) shows all
(created by the diffusion capacitances of the selec- possible match line voltage levels during the
tor and the evaluation transistors). evaluation phase for mismatch scores of 0
During the evaluation phase, if the selector through 15. The match line is sensed by an ADC
transistor is on, a low memristor resistance (System Architecture section). The timing of
ðRON Þ allows charge to pass from the match line such sensing, in addition to the per-cell transis-
to the evaluation point. The charge distribution tor capacitance variations, may lead to inaccura-
causes the match line voltage to drop. Sensing cies in the mismatch score. For example, for a
the voltage of the match line compared with a match line shared by up to 60 bitcells, the mis-
reference voltage (of zero mismatch) allows match score error could be 1. If the number of
quantifying the number of mismatches, produc- bitcells sharing the match line is more than 120,
ing the mismatch score. the mismatch score error could reach 3.

Encoding. RASSA reserves four bitcells to store System Architecture


a DNA bp. There are four nucleotide bases: A, C, The main component of RASSA is the 2T1R
G, and T in each DNA bp, encoded using one-hot array, divided into Word Rows [see Figure 2(c)],
encoding as “1000,” “0100,” “0010,” and “0001,” further divided into Sub-Words [see Figure 2(b)].
respectively. While it is possible to encode four All Word Rows are connected in parallel to the
nucleotide bases using two bits (for example, Key Pattern register [see Figure 2(d)]. The ADC is
“’00,” “’01,” “10,” and “11” for A, C, G, and T, the largest and most energy consuming compo-
respectively), such encoding would result in dif- nent of a Sub-Word. Therefore, in order to use
ferent number of mismatching bits depending on only 4-bit ADC, supporting mismatch scores of 0
a specific pair (for example, two mismatching through 15, the Sub-Word is limited to 60 bitcells.
bits in the case of A–T or C–G mismatch, or one There are 16 Sub-Words within a Word Row,

July/August 2019
47
General Interest

Figure 2. (a) Match line voltage levels for each mismatch score between zero (top curve) and 15 (bottom curve)
mismatches. Every voltage level at the sampling point is converted by the ADC. (b)–(d) Bottom-up block diagram of
RASSA. (b) Single Sub-Word, composed of 60 bitcells in NOR-like structure. (c) Single Word Row containing 16 Sub-Words,
capable of holding 240 DNA bps. (d) Complete RASSA diagram containing N Word Rows ðN ¼ 217 Þ. (e) Accumulating
mismatch scores in analog domain (Conclusions and Future Research Directions section).

amounting to 960 bitcells per word, designed for operational frequency of 1 GHz is possible. For a
storing and comparing up to 240 DNA bps per single Sub-Word, the precharge energy is 1.6 fJ,
cycle. In each compare operation, a compare pat- while the evaluation energy (ADC and control line
tern is applied to all active bitcell bit lines. switching) is 98.4 fJ. For a single Word, containing
The match line voltage of each Sub-Word is 16 Sub-Words, adders, and threshold comparator,
sampled by the ADC and converted into a 4-bit mis- a single compare cycle energy is 1791 fJ.
match score [right side of Figure 2(b)]. The ADC We have manually laid out a RASSA bitcell.
reference voltage and voltage level differences are The total Word Row area in 28-nm technology,
set according to the match line values for each mis- including the bitcells, ADC, adders, and compar-
match score, as demonstrated in Figure 2(a). The ator, is 1598 mm2. Bitcell transistors occupy 4%,
16 Sub-Word ADC outputs are summed up to pro- adders and threshold comparator occupy 28%,
duce the mismatch score for the entire Word Row and the ADC occupies 68% of the Word Row
[see Figure 2(b)]. All such scores are then com- area, respectively. This allows placing of
pared with a threshold value, in parallel, to indi- 131 k ð217 Þ 960-bit (240-bp) Word Rows, storing
cate the Word Rows (corresponding to sequence 31.5 Mb/s, on a single 209-mm die. Its worst-case
locations) with the desired degree of similarity. power consumption at 1 GHz is 235 W. Table 1
summarizes the RASSA system parameters.
Timing, Power, and Area Breakdown Loading the reference sequence to RASSA is
A Sub-Word circuit is designed, placed, and performed on each Word Row separately and
routed using the 28-nm CMOS High-k Metal Gate requires two cycles per Word Row, one cycle to
library from Global Foundries for transistor sizing, write all logic “0s,” and another cycle to write all
timing, and power analysis. We perform Spectre logic “1s.” Given a reference sequence of length
simulations for the FF and SS corners at 70  C and L, the number of cycles required for its loading
L
nominal voltage. Timing analysis shows that an equals 2  d240e.

48 IEEE Micro
Table 1. RASSA system parameters for 28-nm node reference data residing in two Word Rows. Such
process. two-Word Row compare requires two cycles.
The even cycle mismatch score [see Figure 3(c)]
Parameter Value
is added to the score of the following odd cycle
DNA bps per row (bits) 240ð960Þ
[of the Word Row below, Figure 3(d)], and com-
Words per IC 131 kð217 Þ pared with the threshold [see Figure 3(d), right].
Memory size (DNA bps) 31:5M Before every even cycle, the compare pattern is
shifted by one bp to the right, shortening the
Frequency 1 GHz
even cycle compare pattern and extending
Single IC Power 235 W the pattern in the odd cycle by one bp [see
Single IC Area 209 mm2
Figure 3(c) and (d), right]. After 439 (¼ 41 þ 199
 2) cycles, a 200-bp chunk has been compared
against all reference sequence positions. The
DNA READ PREALIGNMENT compare operation repeats for the rest of the
FILTERING WITH RASSA 200-bp read chunks.
A single compare operation in RASSA finds Figure 3(e) presents the concept of edit detec-
the mismatch score between the key pattern tion in RASSA. For simplicity, the chunk length in
and the contents of each Word Row. The refer- this example is 30 bp. Three types of edits are
ence sequence is stored in RASSA (contiguously, shown on the left of the figure. On the right, the
240-bp fragment per Word Row). A fixed-size mismatch score is presented for all relative shifts
chunk (e.g., 200 bps) of the read is fed in as a key of the chunk versus the reference.
pattern. The mismatch score approximates the The average mismatch score in any mismatch-
correlation between the read chunk and the ref- ing location is 75% (the probability of an individ-
erence sequence. A long read contains multiple ual bp mismatch is 0.75). In the exact match case
chunks, therefore the compare operations are (top of the figure), the mismatch score is 0.
performed multiple times, in all possible posi- A substitution results in a very low mismatch
tions of a read chunk vis-a -vis a Word Row score easily detectable by RASSA. The second
reference fragment, sometimes involving two row of Figure 3(e) shows two substitution errors,
neighboring Word Rows. leading to the mismatch score of 2/30 ¼ 6.7%.
The number of Word Rows in RASSA defines The third row of Figure 3(e) shows an insertion
the number of overlap positions examined simul- error. The longest matching section is 18 bp to the
taneously. In a single cycle, dn=240e (where n is right of the insertion, which leads to a mismatch
the reference sequence length) distinct posi- score of (30 - 18)/30 ¼ 40%. With the appropriate
tions on the reference sequence are examined threshold, such scenario is detectable by RASSA
simultaneously. To cover all possible positions, and identified as a potential mapping.
the read chunk is shifted by one bp, and com- Finally, in the fourth row of Figure 3(e), a dele-
pare is repeated 240 times (resembling the con- tion error is presented. Deletions are handled sim-
cept of correlation). ilarly to insertions. In this example, the longest
Figure 3(a)–(d) illustrates the comparison of matching section is also 18 bp. The mismatch
a read chunk against a reference sequence in score is (30 - 18)/30 ¼ 40% as well, and is also
RASSA for several cases. In these examples, a detectable by RASSA as a potential mapping.
chunk length of 200 bp is used [see Figure 3(a)]. While a substitution results in a much lower
A multicycle compare operation matches a mismatch score, RASSA is capable of detecting
200 bp chunk against all its possible locations indels just as confidently, by setting the thresh-
-vis the reference sequence. In the first com-
vis-a old accordingly.
pare cycle [see Figure 3(b)], the first chunk of
the read is compared with all first 200 bps of
In legitimate mismatching locations, the mismatch score is distributed
each RASSA Word Row. binomially. The threshold is set such that the probability of the mismatch
Following the completion of 41(¼ 240-200 þ score to fall below it is sufficiently low. For example, P (mismatch score <
50%) ¼ 0.0008 for 30-bp chunk. For 200-bp chunk, similar probability is
1) cycles, the 200-bp chunk is compared against reached at the threshold of 65%.

July/August 2019
49
General Interest

Figure 3. Illustration of a single long-read chunk examination in RASSA. (a) Long read is divided into chunks,
each 200 bp long. (b, left and right) First chunk is compared against the reference sequence in multiple locations
(simultaneously). (c) and (d) First chunk overlaps with reference sequence bps from two Word Rows. (c, left) First
part of the chunk compared with the last bps of the Word Row. (c, right) All Sub-Word mismatch scores are
summed up and stored (compare to threshold does not take place). (d, left) Second part of the chunk is
compared with the first bps of the next Word row. (d, right) All Sub-Word mismatch scores, including the previous
cycle result from the above Word Row, are summed up and compared with a threshold. Following this step, the
chunk is shifted right by one position (relative to the reference) and steps (c) and (d) are repeated. (e) Edit types
and the mismatch score found by RASSA. (f) Example of a 30-bp read chunk containing insertion, deletion and
substitution errors is compared against the reference sequence divided into 50-bp Word Rows. (g) Mismatch
score versus cycle number for first 21 cycles of comparison of the example in (f). Minimal mismatch score is
below the threshold and achieved in the 9th cycles. The threshold is determined empirically per data set.

Figure 3(f) and (g) illustrates the mismatch location [“min mismatch position” at cycle 9 in
score for the sliding window search of RASSA, in Figure 3(f)], the mismatch score is significantly
presence of multiple errors. Figure 3(f) presents lower than the random average 75% level. Setting
an example of a reference sequence and a read the custom threshold, for instance, at 50% allows
chunk containing a high-similarity region. All efficient overlap of read chunks with a number of
possible edit types (substitution, insertion, and edits (substitutions as well as indels).
deletion) exist in the chunk. Figure 3(g) illus-
trates the mismatch score as a function of the Translating the Output of RASSA to
compare cycle (the relative read chunk position). Mapping Locations
During most cycles, the mismatch score is above RASSA compares every chunk of every read
threshold. When the chunk is in its valid mapping against the entire reference sequence. The

50 IEEE Micro
probability of a false positive match is extremely the alignment step. With such parameters, mini-
low. Therefore, we assume that every compare map2 functions as a prealignment filter. The
that results in a mismatch score below the prede- parameters also invoke appropriate heuristic for
fined threshold indicates is valid mapping loca- PacBio and ONT reads, in addition to enabling
tion for the entire read. The output of RASSA is a multithreading and SIMD extensions. To evaluate
bit vector, one bit per Word Row. The index of the RASSA’s prealignment filtering accuracy, we use
Word Row, together with the iteration number minimap2 as a golden reference. Speedup is cal-
and relative position of the chunk within the read, culated as the ratio of minimap2 execution time,
provides an exact coordinate of a potential map- without indexing, to RASSA execution time. The
ping location. In most examined cases, a single accuracy and speedup of RASSA were obtained
read has a single mapping location indicated by a using an in-house simulator. We assume that the
single compare from a single chunk. reference sequence has already been loaded
In some other cases, a single or multiple into RASSA prior to execution.
chunks produce multiple potential mapping To find the number of incorrect output loca-
locations. In such cases, the distance between tions that might increase the total alignment time,
consecutive locations is examined, starting from we have contrasted prealignment followed by
the lowest coordinate. If the distance between alignment with alignment without prealignment.
two consecutive locations is smaller than the We have used part of the E.coli PacBio dataset,
read length, the location associated with the consisting of about 1000 reads. Total prealign-
higher coordinate is discarded. Otherwise, both ment by RASSA took 20 ms, and the following
locations are kept for further processing. alignment needed to be applied to only 70-kbp
With this selection heuristic, nearby potential subset of the reference, taking minimap2 1490 ms.
mapping locations from a single or multiple In contrast, the same alignment applied without
chunks are combined, while distant locations the prealignment stage took 3000 ms, about twice
are treated as separate mapping locations. the time. Therefore, we decided that reads with
The mapping locations identified by RASSA more than two output locations by RASSA will be
can further be verified by alignment (e.g., Smith– discarded and treated as incorrectly mapped.
Waterman algorithm),4 and used by assembly or
error correction programs.8 Unmapped reads Data Sets. We use five publicly available
can either be discarded (in case of a high datasets, three from PacBio and two from
sequencing coverage) or be mapped with a seed- ONT, taken from two organisms: E.coli K-12
and-extend mapper, and then verified by an NG1655 and Saccharomyces cerevisiae W303
alignment algorithm. (yeast). Both reference sequences are avail-
able at the NCBI (https://www.ncbi.nlm.nih.
gov/). PacBio data sets were taken from
EVALUATION
https://github.com/PacificBiosciences/DevNet/
We compare RASSA with two existing solu-
wiki/Datasets. Error rates, 13 including the
tions, minimap2,7 a state-of-the-art read mapping
share of insertions, deletions, and mismatches
tool, and GateKeeper,9 a state-of-the-art short-
(I,D,M) are also presented.
read prealignment hardware accelerator.
E.coli
Comparison With Minimap2  PacBio: 100 000 reads from one SMRT cell,
Our evaluation focuses on accuracy and 5245 bps on average. Error rate: 14.2%
speedup. Accuracy is measured by two criteria: (I:41.7%, D:21.2%, M:37.1%)
1) sensitivity: correctly mapped reads; 2) false
 PacBio CCS: 260 000 high-quality CCS reads
positives: percentage of incorrect mappings out
of all mappings by RASSA. from 16 SMRT cells, 940 bps on average.
Error rate: 1% (I:5%, D:19.5%, M:75.5%)
Methodology. Minimap2 is run with the
parameters “-x map-pb” and “-x map-ont,” invok-  ONT (from http://lab.loman.net/2016/07/30/
ing its execution for overlap detection without nanopore-r9-data-release/): 165 000 R9 1D

July/August 2019
51
General Interest

Table 2. Sensitivity, fraction of exact mappings, and speedup of RASSA compared to minimap2.

Data sets Large Chunk (200 bps) Small Chunk (100 bps)
False False
Sensitivity Speedup Sensitivity Speedup
Positives Positives
E.coli PacBio 79.3% 13.4% 25 83.2% 13.6% 16
E.coli PacBio
96.3% 8.9% 43 96.2% 6.9% 24
CCS
E.coli ONT 88.8% 10.5% 48 87.6% 12.4% 31
Yeast PacBio 69.8% 8.7% 77 72% 11.8% 51
Yeast ONT 85.9% 34.9% 31 85.1% 39.2% 49
minimap2 mapped only about 20% of all reads, with 50% of mappings with lower quality score than 60 (indicates a high-confidence
mapping).

reads, 9009 bps on average. Error rate: 20.2% executing instances of a typical alignment
(I:14.5%, D:37.2%, M:48.3%) algorithm.
Yeast

Throughput Comparison With GateKeeper
PacBio: 100 000 reads from one SMRT cell,
We compare RASSA throughput (the number
6294 bps on average. Error rate: 14% (I:5%,
of examined mapping locations per second) with
D:19.5%, M:75.5%)
that of GateKeeper.9 GateKeeper was imple-
 ONT (ERR789757 from NCBI): 30 000 R7.3 2D mented in a Virtex-7 FPGA using Xilinx VC709
MinION reads, 11,337 bps on average. Error board running at 250 MHz.
rate: 13.4% (I:23.3%, D:35.7%, M:41%) GateKeeper is designed to compare short
reads with a reference sequence. Table 3 shows
Table 2 presents the accuracy results for all
the throughput in Billions of Examined Mapping
five data sets above. Chunk sizes of 200 and 100
Locations per second (BEML/s) of RASSA and
bps and corresponding thresholds were deter-
GateKeeper on two short-read data sets used in9
mined empricially, trading off accuracy and per-
100-bp reads and 300-bp reads. RASSA frequency
formance. Small changes of threshold induce
is adjusted to that of GateKeeper. In addition, we
only marginal changes in accuracy. For most
show the average RASSA throughput for the
data sets, 55% threshold was used on 200-bp
200-bp reads, equivalent to the chunks lengths
chunks and 45% for 100-pb chunks; for the Yeast
used in Table 2.
PacBio case, we used 45% and 40%, respectively.
RASSA outperforms GateKeeper by more
than two orders of magnitude. When applied to
Speedup. We compare RASSA execution time the short-read mapping prealignment, RASSA
with that of minimap2, executed on a server with covers a read with one chunk. Consequently,
16-core 2-GHz Intel Xeon E5-2650 CPU and 64 GB of RASSA takes 1–3 cycles (for the read lengths
RAM. Table 2 shows that RASSA achieves 16–77 used in Table 3) to find the mismatch score in all
speedup over minimap2.y We note that the yeast Word Rows in parallel. GakeKeeper, on the other
dataset has fewer reads than E.coli, but a longer hand, is reported to process up to 140 (20)
reference sequence (11.7 Mbp versus 4.6 Mbp),
which might cause the longer execution time on
Table 3. RASSA and GateKeeper9 throughput (billions of
minimap2. In contrast, RASSA is insensitive to examined mapping locations per second, BEML/s).
the reference sequence length and its execution
time is determined by the length of a read chunk. Read RASSA @250
GateKeeper
Lengths MHz
RASSA produces output (on average, one
100 bp 1.7 BEML/s 226.8 BEML/s
mapping per read) at rate of 50,000–500,000
reads/s, enabling multiple simultaneously 200 bp - 175.2 BEML/s

300 bp 0.2 BEML/s 142.8 BEML/s


y
The overlap detection stage without alignment of minimap2.

52 IEEE Micro
mapping locations of 100-bp (300-bp) reads in hardware cost: higher density can be achieved
parallel, affected by the edit distance threshold. by sharing ADCs among multiple Sub-Words and
by applying analog computations, as presented
CONCLUSIONS AND FUTURE in Figure 2(e). RASSA mapping and resistive CAM
RESEARCH DIRECTIONS alignment5 may be combined into a single high-
This paper presents RASSA, an in-memory performance in-memory mapper/aligner. Finally,
processing parallel architecture of a resistive thanks to its use of short chunks, RASSA can be
approximate similarity search accelerator. We effectively applied to short reads.
apply RASSA to the long-read DNA mapping
problem. The length of reads, coupled with a low
read quality, poses a challenge for existing map- & REFERENCES
pers, optimized for high-quality short reads. The 1. J. L. Jameson and D. L. Longo, “Precision medicine—
read mapping process is data and compute inten- Personalized, problematic, and promising,”
sive, making it a target for acceleration. RASSA Obstetrical Gynecological Survey, vol. 70, no. 10,
addresses the challenge by breaking long reads pp. 612–614, 2015.
into short chunks and by applying full correlation. 2. J. Quick et al., “Real-time, portable genome
By allowing faster mapping on large data sets, we sequencing for Ebola surveillance,” Nature, vol. 530,
potentially make a step toward real-time pathogen no. 7589, pp. 228–232, 2016.
or genome sequence completion. 3. A. Rhoads and K. F. Auc, “PacBio sequencing and its
We compared applications,” Genom., Proteomics Bioinform., vol. 13,
the RASSA accu- no. 5, pp. 278–289, 2015.
We compared the
racy and execution 4. T. Laver et al., “Assessing the performance of the
RASSA accuracy and
time with that of oxford nanopore technologies minion,” Biomol.
execution time with that
minimap2, a state- Detection Quantification, vol. 3, pp. 1–8, 2015.
of minimap2, a state-of-
of-the-art mapping the-art mapping solu- 5. R. Kaplan, L. Yavits, R. Ginosar, and U. Weiser
solution, on five tion, on five long-read “A resistive CAM processing-in-storage architecture
long-read data sets data sets taken from for DNA sequence alignment,” IEEE Micro., vol. 37,
taken from two two organisms. Our pp. 20–28, 2017.
organisms. Our evaluation shows that 6. H. Li, “Minimap and Miniasm: Fast mapping and
evaluation shows RASSA can outperform de novo assembly for noisy long sequences,”
that RASSA can minimap2 by 16–77x. Bioinformatics, vol. 32, no. 14, pp. 2103–2110, 2016.
outperform mini- 7. H. Li, “Minimap2: Pairwise alignment for nucleotide
map2 by 16–77. In sequences,” Bioinformatics, vol. 34, pp. 3094–3100,
addition, we compared RASSA’s throughput, 2018.
measured in examined mapping locations per 8. K. Berlin, S. Koren, C. S. Chin, J. P. Drake,
second, with that of GateKeeper, a state-of-the- J. M. Landolin, and A. M. Phillippy, “Assembling large
art short-read prealignment hardware accelera- genomes with single-molecule sequencing and
tor. We find that RASSA can outperform Gate- locality-sensitive hashing,” Nature Biotechnol., vol. 33,
Keeper by more than two orders of magnitude. no. 6, pp. 623–630, 2015.
This work can be extended in several ways. 9. M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, and
First, RASSA can be applied to read-to-read C. Alkan, “GateKeeper: A new hardware architecture for
overlap finding, which requires finding overlaps accelerating pre-alignment in DNA short read mapping,”
between pairs of reads. Read-to-read overlap Bioinformatics, vol. 33, no. 21, pp. 3355–3363, 2017.
finding is an important first step in de novo 10. J. S. Kim et al., “GRIM-filter: Fast seed location filtering
genome assembly6 (constructing the host DNA in DNA read mapping using processing-in-memory
sequence without a reference sequence), a prob- technologies,” BMC Genom., vol. 19, no. 2, p. 89, 2018.
lem more computationally challenging than read 11. M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. M. Rabaey,
mapping. Second, a detailed design space explo- “Exploring hyperdimensional associative memory,” in
ration needs to be performed. For example, Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2017,
RASSA can further be optimized in terms of pp. 445–456.

July/August 2019
53
General Interest

12. H. Akinaga and H. Shima, “Resistive random access 2001. He coauthored a number of patents and
research papers on SoC and ASIC. His research
memory (ReRAM) based on metal oxides,” Proc. IEEE,
interests include non-von Neumann computer
vol. 98, no. 12, pp. 2237–2251, Dec. 2010.
architectures, accelerators, and processing in
13. J. L. Weirather et al., “Comprehensive comparison of
memory. He has an MSc and a PhD in electrical
pacific biosciences and oxford nanopore technologies engineering from Technion. Contact him at leonid.
and their applications to transcriptome analysis,” yavits@nububbles.com.
version 2, F1000Research, 2017.
Ran Ginosar is a professor with the Department of
Roman Kaplan is currently a PhD student in the Electrical Engineering and serves as the head of
Faculty of Electrical Engineering, Technion—Israel the VLSI Systems Research Center at the Techn-
Institute of Technology, under the supervision of ion—Israel Institute of Technology. He joined the
Prof. R. Ginosar. Between 2009 and 2014, he was a Technion faculty in 1983 and was a visiting associ-
software engineer. His research interests are parallel ate professor with the University of Utah during
computer architectures, in-data accelerators for 1989–1990, and a visiting faculty with Intel
machine learning, and bioinformatics. He has a BSc Research Labs in 1997–1999. His research inter-
and an MSc from the Faculty of Electrical Engineer- ests include VLSI architecture, manycore com-
ing, Technion. Contact him at romankap@gmail.com. puters, asynchronous logic and synchronization,
networks on chip, and biologic implant chips. He
Leonid Yavits is currently a Postdoctoral Fellow has co-founded several companies in various areas
in electrical engineering at Technion—Israel Insti- of VLSI systems. He has a BSc (summa cum laude)
tute of Technology. He co-founded VisionTech from Technion and a PhD from Princeton University,
where he co-designed a single chip MPEG2 both in electrical and computer engineering.
codec. VisionTech was acquired by Broadcom in Contact him at ran@ee.technion.ac.il.

54 IEEE Micro
General Interest

The Queuing-First
Approach for Tail
Management
of Interactive Services
Amirhossein Mirhosseini and
Thomas F. Wenisch
University of Michigan

Abstract—Managing high-percentile tail latencies is key to designing user-facing cloud


services. Rare system hiccups or unusual code paths make some requests take 10  100 
longer than the average. Prior work seeks to reduce tail latency by trying to address
primarily root causes of slow requests. However, often the bulk of requests comprising the
tail are not these rare slow-to-execute requests. Rather, due to head-of-line blocking, most
of the tail comprises requests enqueued behind slow-to-execute requests. Under high
disparity service distributions, queuing effects drastically magnify the impact of rare system
hiccups and can result in high tail latencies even under modest load. We demonstrate that
improving the queuing behavior of a system often yields greater benefit than mitigating the
individual system hiccups that increase service time tails. We suggest two general
directions to improve system queuing behavior–server pooling and common-case service
acceleration–and discuss circumstances where each is most beneficial.

& ONLINE DATA-INTENSIVE (OLDI) services (e.g., must meet strict response time service-level
web search) traverse terabytes of data with strict objectives (SLOs), especially for tail latencies.2; 3
latency targets.1 Managing high-percentile tail Second, such services typically communicate via
latencies is a key problem in designing such serv- fan-out patterns wherein datasets are “shared”
ices. First, to guarantee user satisfaction, services across numerous “leaf” servers and their
responses are aggregated before responding to
the user. As such, overall latency is often dictated
Digital Object Identifier 10.1109/MM.2019.2897671 by the slowest leaves (i.e., the “tail at scale”
Date of current version 23 July 2019. effect4).

July/August 2019 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
55
General Interest

High tail latencies arise from two effects. First, behavior: server pooling, and common-case ser-
such applications’ service time distributions vice acceleration (CCSA). Server pooling is the
include outlying requests that take much longer practice of redesigning system architecture to
(10  100 or more) than the mean.5 Some change single-server (“scale-out”) queues into
requests may require exceptional processing multiserver (“scale-up”) ones; that is, rather
time depending on their arguments (e.g., search than enqueuing requests at distinct servers/
engines1; 6) or query types (e.g., sets versus gets cores, a single queue is shared among many (i.e.,
in key-value stores5; 7). Some requests are delayed converting c G/G/1 queues into a G/G/c). Server
by system interference, such as from garbage col- pooling greatly reduces queuing delay and can
lection, page deduplication, synchronous huge- completely eliminate queueing with enough
page compaction, or network stack imped- servers (i.e., high enough c). Pooling smooths
iments.4; 8 In other cases, scheduler inefficiencies, fluctuations in both arrivals and service, making
power state transitions, suboptimal interrupt the system behave more like one with determin-
routing, poor NUMA node allocation, or virtualiza- istic interarrival and service times. Especially for
tion effects may contribute to long tail latencies.9 high disparity service time distributions (i.e.,
Finally, interference from colocated workloads rare system events/hiccups), server pooling
can cause slowdown due to contention for shared reduces the overall tail latency by breaking
caches, memory bandwidth, or global resources HoL blocking and preventing nominal requests
like network cards or switches.10; 11 from waiting behind exceptionally long ones.
A second key contributor to applications’ Even a modest degree of concurrency allows
end-to-end latency distribution are queuing many short requests to drain past stalled ones,
effects.3 Queuing arises at numer- substantially reducing weight in
ous layers causing some requests Online data-intensive
the latency distribution tail.
to wait for others4; whereas queu- (OLDI) services (e.g., CCSA improves systems’ queu-
ing also affects average perfor- web search) traverse ing behavior by deploying optimi-
mance, its effect on tail latency terabytes of data with zations that target common-case
may be catastrophic. To achieve strict latency targets. service behavior (as opposed to
performance stability, systems Managing high- optimizations that target directly
must be engineered such that the percentile tail latencies rare/slow requests or hiccups). It
overall request arrival rate is lower is a key problem in may seem counterintuitive to
than the aggregate system capacity designing such improve tail latency by optimizing
(service rate). However, as both services. typical-case request performance,
rates fluctuate, arrivals may tempo- but queuing delays are greatly
rarily outstrip service capacity, causing requests impacted by the average load, which depends
to queue. Queueing delay is most apparent more on typical-case service time than rare cases.
under high system load. However, in this paper, In single-server systems, CCSA has little
we make the case that queuing effects drastically impact when the service variance is excessively
magnify the impact of rare system events/hic- high (i.e., HoL blocking is common), as nominal
cups and can result in high tail latencies even requests queue behind rare, slow ones regard-
under modest load. Due to head-of-line (HoL) less of how fast the nominal requests are proc-
blocking, many requests are delayed by an essed. But, if there is sufficient concurrency
exceptionally slow one that stalls a server/core; (e.g., by using server pooling) that slow requests
these delayed requests account for a bulk of the rarely occupy all servers, then CCSA provides
latency distribution tail. enormous benefit by allowing nominal requests
Through stochastic queuing simulation,12 we to drain past slow ones, drastically reducing
show that improving a system’s queuing behav- wait time. Importantly, we show that, with con-
ior often yields much greater benefit than miti- currency, CCSA is more effective than reducing
gating the individual system hiccups that directly either the length or the probability of
increase service time tails. We suggest two gen- rare hiccups. Since finding and mitigating tail
eral directions for improving system queuing events is hard due to their myriad causes,13 we

56 IEEE Micro
believe this observation is encouraging—we can independent of the system’s history).14 An inter-
reduce tail latency without engaging in “whack- esting property of exponentially distributed ran-
a-mole” with rare system hiccups. dom variables is the constant ratio between their
In short, we argue that cloud system designers mean values and all of their quantiles (including
should invest optimization effort first into 1) the median and all-percentile tails), as shown in
reducing HoL blocking through higher concur-
rency and improved queuing discipline (i.e., server P ðS > aEðSÞÞ ¼ ea : (1)
pooling) and then into 2) optimizing common-
case performance to improve mean service time. Due to the memoryless property of exponen-
Both of these approaches may have greater tial distributions, M=M=c queuing systems
impact and are easier to achieve can be easily analyzed with con-
than directly pinpointing and miti- tinuous time Markov chains and
Neither inter-arrival nor
gating rare cases and hiccups; service times of inter- have closed-form solutions for
whereas server pooling smooths active cloud services many of their parameters, such
out arrival and service variability, are perfectly modeled as average waiting and sojourn
CCSA reduces the effective system by exponential (waiting plus service) times.
load. The relative impact of the distributions. But since Neither inter-arrival nor service
two approaches depends criti- requests usually times of interactive cloud services
cally on the system load and ser- originate from a large are perfectly modeled by exponen-
vice time variance. CCSA’s pool of independent tial distributions. But since
effectiveness improves as service sources (e.g., many requests usually originate from a
times become more normal and/ distinct users), they
large pool of independent sources
typically mimic Poisson
or concurrency increases. We (e.g., many distinct users), they
(memoryless) arrivals.
build a simple regression model typically mimic Poisson (memory-
on concurrency and service time less) arrivals; prior studies have
variance to estimate HoL blocking and indicate observed that interarrival time dis-
whether server pooling or CCSA is more benefi- tributions usually have small coefficients of varia-
cial in reducing tail latency. System designers tion (mostly, between 1 and 2).12 As such, inter-
can use this model to guide optimization effort arrival processes can be well approximated with
and estimate its impact. an exponential distribution (CV ¼ 1) with little
fidelity loss.15 Service time distributions, in con-
trast, may have long tails; some requests encoun-
BACKGROUND AND METHODOLOGY ter rare hiccups that increase service time by
Most interactive cloud services can be mod- 10   100 (or even more) over the mean—
eled as A=S=c queuing systems (based on much larger than the ratio of the 99th percentile
Kendall’s notation14), where A specifies the and mean values in the exponentially distributed
request inter-arrival time distribution, S the ser- services times of M=M=c systems [ 4:6, based
vice time distribution, and c the number of concur- on (1)]. Hence, interactive cloud services are
rent servers. Regardless of the distributions, the often investigated using M=G=c queuing models
average arrival rate () must be lower than the (G stands for General).5; 16
average aggregate service rate of all servers (mc, Unfortunately, M=G=c queuing models do not
with m as the average service rate of a single have closed-form solutions for average waiting/
server); otherwise, requests queue without bound. sojourn times and the accuracy of existing
The most common queuing models used in approximations, which use only a few moments,
analytical studies are M=M=c systems (M stands is poor.17 Furthermore, to the best of our knowl-
for Markovian), where both interarrival and ser- edge, there is no widely used approximation for
vice times follow exponential distributions. It can waiting/sojourn time quantiles of these systems.
be shown that the exponential distribution is the Thus, we use stochastic queuing simulation,
only continuous distribution with the memory- based on the BigHouse methodology,12 to mea-
less property (i.e., occurrence of events is sure the tail latency of such M=G=c systems. We

July/August 2019
57
General Interest

simulate the queuing system until we achieve garbage collection or voltage/frequency


95% confidence intervals of 5% error in reported state transitions.
results. We consider the first-come–first-served
(FCFS) queuing discipline as prior work9; 18
shows it to be the best non-preemptive schedul- THE QUEUEING-FIRST APPROACH
ing policy when tail latency is the metric of Requests may incur an end-to-end latency in
interest. a high percentile tail either because the request
We model “nominal” request performance by itself incurred a rare hiccup or due to queuing
drawing service times from an exponential distri- delays. Queuing greatly magnifies the impact of
bution with mean 1=mn . Then, to represent rare/ few, rare hiccups by causing nominal requests
slow requests, which we call “hiccups,” with prob- to queue behind one with a hiccup and incur
ability ph , we add an additional delay drawn from high sojourn times. With deterministic or mem-
a second exponential distribution with mean ory-less service times, queuing arises primarily
1=mh . We vary both ph and the ratio of mn =mh in due to request bursts, wherein the instanta-
our experiments. This hybrid model is similar to neous arrival rate exceeds the average service
the dual-branch Hyperexponential distribution, rate. However, with high disparity service time
which is widely used as a phase-type distribution distributions, queuing is caused mostly by HoL
for approximating heavy-tailed systems.14 We blocking, wherein the instantaneous service
study analytical distributions as they are easier to rate drops temporarily well below the average
understand and their parameters can be tuned to request arrival rate.
model various real scenarios. The differing nature of queuing has important
The intent of our approach is to model the implications. First, with high disparity service,
near-memoryless nominal behavior of cloud queuing can arise even at low load; when a slow
services and then overlay an independent distri- request stalls the server for a long time, many
bution to model hiccups. We consider hiccups requests may queue behind it, even if the arrival
that are 1) 10 longer than average, affecting 1% rate is low. Second, it increases the contribution
of requests, and 2) 100 longer affecting 0.1% of nominal requests to the sojourn-time tail; while
of requests. hiccups directly impact few requests, such
requests account for a large fraction of server uti-
1) The first one represents unusual code paths lization. As such, a substantial fraction of nominal
that arise in, e.g., web search. As an example, requests queue behind the exceptional ones. As
Microsoft observes a bimodal distribution an example, in an M=G=1 queue where 0.1% of
for Bing search,6 wherein most requests requests incur a 100 higher than nominal ser-
incur latencies close to the mean but occa- vice time, the exceptional requests account for
sional requests require an order of magni- 10% server utilization. As a consequence of Pois-
tude more processing time due to their son arrivals, 10% of requests arrive during such
complicated search queries. They report a a slow service and may also contribute to the
27 ratio between the 99th percentile tail sojourn-time tail.
and the median latency (which is usually Figure 1(a) and (b) reports the normalized
smaller than the mean). Similarly, Google 99th percentile tail latency of an M=M=1 system
reports a 1-ms median leaf service time with and its M=G=1 counterpart with the high dispar-
99th percentile tail latency of 10 ms.4 ity service time distribution described above
2) This one represents rare pauses that arise across various load levels. Figure 1(c) reports
due to system activities and interference. the fraction of sojourn time spent waiting by the
As an example, Wang et al.19 studied a mul- 1% slowest requests for both M=M=1 and
titier web application and identified a simi- M=G=1 queues. Under low loads, wait time is
lar bimodal distribution incorporating usually small in M=M=1 systems and the
rare requests with less than 1% probability sojourn-time tail is nearly the same as the ser-
that take 30  40 longer than the mean vice-time tail. However, queuing accounts for
due to “transient events,” such as JVM a significant fraction of tail latency when the

58 IEEE Micro
Figure 1. (a) Normalized service- and sojourn-time 99th percentile tail in an M/M/1 queue. (b) Normalized service- and
sojourn-time 99th percentile tail in an M/G/1 queue. (c) Average % wait time in sojourn-time tail requests. (d) % of sojourn-
time tail requests that are also in the service-time tail. The M/G/1 queue has an exponential service time distribution, but
incorporates 100 hiccups that occur in 0.1% of the requests.

service time distribution is high disparity. Fur- queuing behavior is typically more effective than
thermore, since hiccups occur with a low proba- seeking to directly mitigate system hiccups that
bility (0.1%), they do not noticeably affect the cause heavy/long tails. Finding and mitigating
service time 99th percentile tail. However, due system hiccups is hard. As such, we advocate
to the HoL blocking, their impact on the sojourn- pursuing optimizations that address queuing
time tail is large under both low and high loads. behavior instead.
Figure 1(d) reports the percentage of
requests in the sojourn-time tail that also con- Server Pooling
tribute to the service-time tail. Under both Figure 2 contrasts two different models to
low and high loads, the percentage is much compose multiple servers. In the scale-out
higher in the M=M=1 system. With high dis- model, each server has a separate request queue
parity service times, HoL blocking in the and a dispatcher/load balancer steers incoming
M=G=1 system comprises the bulk of the requests into different queues such that the
tail—most sojourn-time tail requests are nomi- request arrival rate of all servers is balanced. In
nal requests that queue behind exceptionally the scale-up model, instead a single request
slow ones. Furthermore, as shown in Figure 1 queue is shared among all servers, which each
(c), while the fraction of queuing delay rela- fetch requests from the central request queue as
tive to sojourn time in tail requests is higher they become idle. This model requires synchro-
in the M=G=1 system, queuing still accounts nization of the central request queue, but
for more than half of sojourn time even in improves queuing.
M=M=1 systems for loads over 30%. It can be shown that the scale-up (M=G=c)
The takeaway is that if a service incurs either organization always outperforms the scale-out
high load or has a high disparity service time dis- organization (c  M=G=1) in principle (neglect-
tribution, end-to-end tail latency is dominated by ing synchronization). First, in the scale-up
queuing effects. As a result, improving system organization, a server will not remain idle if
there are requests wait-
ing in the central queue.
However, in scale-out
systems, a server may
remain idle if its own
queue is empty even
while other servers have
outstanding requests.
Second, when a request
takes longer than aver-
age in a scale-out organi-
zation, all the requests
Figure 2. Scale-out versus scale-up queuing organizations. behind it suffer from

July/August 2019
59
General Interest

Figure 3. Normalized service-time (light bars) and sojourn time (dark bars) tails of an M=G=1 queue under different
scenarios. (a) 70% load, 100 hiccups affecting 0.1% of requests. (b) 70% load, 10 hiccups affecting 1% of requests.
(c) 30% load, 100 hiccups affecting 0.1% of requests.

HoL blocking delays. In contrast, in scale-up of contemporary software systems use a


architectures, requests may be serviced by any scale-out queuing architecture as it is easier
server; stalling at one server has 1little impact to implement.9 Implementing a scale-up model
on system-wide instantaneous service rate. across multiple machines requires remote dis-
Several prior studies have observed that aggregated memory accesses or a distributed
scale-up queuing systems outperform scale- data structure, which are difficult to imple-
out organizations.9; 5 However, a large number ment and optimize. Even within a single

60 IEEE Micro
multicore server, implementing a scale-up server pooling, queuing delay vanishes and the
model mandates either a single synchronized sojourn-time tail and service-time tail match. In
data structure or a work-stealing architecture, such a scenario, end-to-end tail latency is even
which incur coherence traffic and are difficult lower than in a system with no hiccups but with-
to scale. out pooling.
We refer to the practice of consolidating Figure 3(b) reports the same results for hic-
c  M=G=1 servers into a single M=G=c system cups 10 longer than the average occurring in
as server pooling. When service-time distribu- 1% of requests. Whereas the general trend
tions are high disparity, HoL blocking becomes matches Figure 3(a), the gap between the ser-
the main source of queuing delay (and tail vice- and sojourn-time tails is noticeably smaller
latency) and the gap between the two queuing even though the total service time attributable
organizations grows. We argue that server pool- to hiccups is the same (10  1% ¼ 100  0:1%).
ing can play a key role in resolving HoL blocking As previously observed, longer hiccups intro-
under such service conditions and, hence, should duce more severe HoL blocking and cause more
be pursued despite higher implementation com- nominal requests to queue behind the excep-
plexity. In fact, server pooling often reduces the tional ones (despite lower hiccup probability).
tail latency more than directly mitigating the rare Nevertheless, in Figure 3(b), pooling across only
hiccups that cause exceptionally long service. two servers, despite hiccups, is enough to
Figure 3 reports the normalized service/ reduce the sojourn time tail below that of a sys-
sojourn time tail latencies in an M=G=1 system tem without server pooling and without hiccups.
with different service time distributions and sys- Figure 3(c) considers the same service time dis-
tem loads. The leftmost red bars represent tail tribution as Figure 3(a), but under lower (30%)
latencies in the presence of rare hiccups. The system load. Here, whereas queuing delays are
next group of blue bars show the tail latency typically near-negligible under low load, the high
where the impact (i.e., duration/probability) of disparity service distribution nevertheless
hiccups has been reduced. In particular, from left causes HoL blocking and a significant sojourn
to right, these bars represent cases where hiccup time tail. Interestingly, the ratio between the
duration is halved, their occurrence probability sojourn- and service-time tails is much higher
is halved, and where hiccups are fully eliminated. than that seen in Figure 3(b) due to longer hic-
Finally, the cluster of green bars indicate server cups and higher HoL blocking, despite lower
pooling cases with varying number of servers c. load. Furthermore, when HoL blocking is high
(We discuss the orange bars later.) but system load is low, pooling across two serv-
Figure 3(a) considers an exponential service ers completely eliminates queuing delay.
time distribution with hiccups that occur 0.1% of In summary, server pooling is highly effective
the time and last 100 longer than the average in eliminating HoL blocking and reducing queu-
service time under 70% system load. We make ing delays that otherwise arise due to rare sys-
three observations: First, reducing the hiccup tem hiccups. Although pooling across many
probability is considerably less effective at cores/machines is often challenging, encourag-
reducing the overall tail than reducing their ingly, we show that pooling across as few as two
duration. The intuition is that longer hiccups servers is often sufficient for large tail latency
cause more requests to queue and hence exacer- reductions.
bate tails more than shorter but more frequent A variety of steering and scheduling techni-
hiccups. Second, pooling only two servers ques can enable a scale-out system to more
reduces tail latency almost as much as halving closely approximate scale-up system behavior.
hiccup durations. Whereas it may be challenging Examples include smart load-balancing schemes
to implement high-concurrency data structures that steer requests to queues based on wait time
to enable a high degree of server pooling, estimates derived from metrics like queue occu-
sharing queues across just pairs of machines pancy, injecting replica requests to different
or cores is likely easier than finding and mitigat- queues and then cancelling the redundant
ing hiccups. Finally, with greater degrees of requests,4 and various work-stealing approaches

July/August 2019
61
General Interest

set of orange bars in Figure 3


report service and sojourn time
tails under varying degrees of
CCSA (i.e., different speedups
of common-case service time). We
observe a large benefit in Figure 3
(b), where hiccups are relatively
short and there is little HoL block-
ing; accelerating the common-case
service time by only 20% (without
affecting its tail) reduces the
sojourn time tail almost as much
as reducing the average hiccup
Figure 4. Normalized sojourn time tail latency in an M=G=1 queue (100 hiccups length by half. Doubling the ser-
in 0.1% of requests) with various degrees of server pooling and CCSA. vice rate reduces the sojourn time
tail below that of a system with no
that migrate tasks between queues.20 While hiccups or a system with a pooling degree of two.
these techniques typically fall short of an ideal In the remaining cases [see Figure 3(a) and (c)],
M=G=c system, they still drastically reduce wait CCSA has only modest impact on the sojourn time
time and HoL blocking. Further, as shown by tail, as there is more HoL blocking and insufficient
Wierman and Zwart,18 FCFS scheduling is only concurrency for requests to avoid it
best for M=G=1 queues if the service-time dis- CCSA provides greater benefit with higher con-
tribution is light-tailed. Otherwise, variants of currency (e.g., via server pooling). Even modest
processor sharing outperform FCFS in terms concurrency is sufficient to unlock CCSA’s effec-
of tail latency. Thus, should direct implemen- tiveness; rare events are unlikely to occupy multi-
tation of server pooling prove prohibitive in a ple servers at the same time, so nominal requests
particular system, time-multiplexing machines/ nearly always bypass a stalled server. As an
cores among requests may provide an alterna- example, Figure 4 considers the scenario from
tive to address queuing due to rare hiccups. Figure 3(a) but in M=G=2 and M=G=3 systems
(i.e., with server pooling). Not only is CCSA better
Common-Case Service Acceleration than mitigating hiccups, it is also better than fur-
CCSA is another general approach to improve ther increasing server pooling. We expect that
system queuing behavior. In this approach, rather CCSA will typically be easier to implement than
than seek to mitigate the rare hiccups that cause hunting down and optimizing the underlying
high service times, instead, the system designer causes of rare performance hiccups as software
deploys optimizations that accelerate common developers are already incentivized to make the
case behavior. As such, while CCSA directly common case fast. Note that this approach is
reduces average service time, it has little effect on most beneficial for service distributions wherein,
the tail of the service distribution. Conventional despite heavy/long tails, most of the system
wisdom suggests that improving average service utilization arises from nominal requests. For
time does not improve tail latency; indeed, some example, in our modeled distributions, 10% of
prior work suggests trading off slower average system utilization is spent on hiccups. However,
performance to reign in tails.7; 6 However, reduc- in many power-law distributions, tail events con-
ing average service time increases service rate, tribute to 80%–90% of the distribution; CCSA
and hence reduces server utilization. Reduced uti- would not be as effective in such scenarios.
lization in turn reduces queuing delays.
Unlike server pooling, CCSA has little impact Discussion
when HoL blocking is high, as nominal requests We expect CCSA to be more beneficial than
enqueue behind long ones regardless of how fast server pooling, as CCSA reduces the effective sys-
nominal requests are processed. The rightmost tem load, while server pooling has no effect on

62 IEEE Micro
load/utilization. However, as we showed in the memoryless around CVdeparture ¼ 1:0, where the
previous section, CCSA is only effective in the ratio of tail-to-average cases does not decrease
absence of server HoL blocking. When HoL block- through higher concurrency [see (1)]. As a result,
ing is frequent (e.g., service time variance is high), we suggest a regression model of the form of
CCSA no longer provides benefit as nominal
requests queue behind exceptionally long ones. In CVdeparture  ðCVservice  1Þe0:8ðc1Þ þ 1 (2)
such scenarios, additional concurrency must be
introduced to unleash CCSA’s efficacy. and tune its parameter using the least squares
In single-server systems, service time variabil- method. We find its average error to be less than
ity is a good measure of HoL blocking. For exam- 13%.
ple, in Figure 3(a), where CVservice ¼ 4:2, CCSA has Using this model, we can derive CVdeparture as
negligible impact; nominal requests wait behind a proxy for the HoL blocking rate and predict
slow ones. In contrast, in Figure 3(b) where how it is affected by server pooling. Alterna-
CVservice ¼ 1:6 (near the CVservice ¼ 1:0 of M=M= tively, cloud system architects may perform
queues), CCSA is more effective than server Stochastic Queueing Simulations, similar to our
pooling. However, CVservice only approach, and directly measure
reflects HoL blocking in single- We recommend devel- CVdeparture instead of predicting it.
server systems. We suggest the opers follow a simple When the system approaches the
interdeparture time variability of optimization sequence CVdeparture ¼ 1:0 of M=M= queues,
a saturated queue (when queuing to address tail latency in blocking becomes rare; the
probability is close to 1.0) to their services: 1) intro- remaining tail of the sojourn time
measure HoL blocking in multi- duce server pooling distribution is then primarily due
server queues. In saturated until HoL blocking is suf- to service time tails or high load.
single-server queues, the inter- ficiently mitigated; 2) if Under low load, queuing delays
load is high, introduce vanish with sufficient server pool-
departure time distribution is
CCSA; 3) if end-to-end
the service time distribution ing; remaining sojourn time tails
tails remain unaccept-
(CVservice ¼ CVdeparture ). However, reflect only service tails. Under
able, only then seek to
with multiple servers, departures high load, HoL blocking will no
directly optimize rare,
interleave, reducing interdepar- high service latencies. longer be the dominant source of
ture time variability. For example, queuing delays when sufficient
in Figure 4, which is similar to concurrency has been introduced.
Figure 3(a) but with an additional 1–2 servers, the As such, with sufficient server pooling (often
CVdeparture drops (from 3.0) to 1.7 and 1.1, respec- just 2–3 servers), CCSA becomes more effective
tively. As a result, in the M=G=2 case, CCSA yields than further server pooling.
almost the same benefit as pooling. In the M=G=3 In short, we recommend developers follow a
case, where HoL blocking resembles that of an simple optimization sequence to address tail
M=M= queue with CVdeparture ¼ 1:0, CCSA yields latency in their services: 1) introduce server pool-
much better results than server pooling. ing until HoL blocking is sufficiently mitigated; 2)
We find that a simple regression model can if load is high, introduce CCSA; 3) if end-to-end
predict the CVdeparture of a saturated M=G=c queue tails remain unacceptable, only then seek to
based on its CVservice and the number of servers directly optimize rare, high service latencies.
(c). We construct the model by simulating satu-
rated queues with a set of high disparity distribu- CONCLUSION
tions with different CVservice and measure their Improving a system’s queuing behavior often
CVdeparture . We observe that small degrees of yields much greater benefit than mitigating the
server pooling quickly reduce HoL blocking. individual system hiccups that increase service
Therefore, we postulate an exponential decay time tails. We suggest two general directions
effect for the number of servers. Also, we for improving system queuing behavior—server
note that CVdeparture may not decrease below pooling, and CCSA—which synergistically address
1.0 as the interdeparture process becomes near- queuing behaviors that often drive tail latency.

July/August 2019
63
General Interest

ACKNOWLEDGMENTS 13. X. Yang, S. M. Blackburn, and K. S. McKinley,


This work was supported by the Center for “Computer performance microscopy with shim,” in
Applications Driving Architectures, one of six Proc. ACM/IEEE 42nd Annu. Int. Symp. Comput.
centers of JUMP, a Semiconductor Research Cor- Archit., 2015, pp. 170–184.
poration program co-sponsored by DARPA. 14. M. Harchol-Balter, Performance Modeling and
Design of Computer Systems: Queueing Theory in

& REFERENCES Action. Cambridge, U.K.: Cambridge Univ. Press,


2013.
1. L. A. Barroso, J. Dean, and U. Holzle, “Web search for
15. D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. Weber,
a planet: The Google cluster architecture,” IEEE Micro,
and T. F. Wenisch, “Power management of online
vol. 23, no. 2, pp. 22–28, Mar./Apr. 2003.
data-intensive services,” in Proc. 38th Annu. Int. Symp.
2. S. Kanev et al., “Profiling a warehouse-scale computer,”
Comput. Archit., 2011, pp. 319–330.
in Proc. ACM/IEEE 42nd Annu. Int. Symp. Comput.
16. D. Meisner, B. T. Gold, and T. F. Wenisch, “Powernap:
Archit., 2015, pp. 158–169.
Eliminating server idle power,” ACM Sigplan Notices,
3. C. Delimitrou and C. Kozyrakis, “Amdahl’s law for tail
vol. 44, no. 3, pp. 205–216, 2009.
latency,” Commun. ACM, vol. 61, no. 8, pp. 65–72, 2018.
17. V. Gupta, M. Harchol-Balter, J. Dai, and B. Zwart, “On
4. J. Dean and L. A. Barroso, “The tail at scale,” Commun.
the inapproximability of m/g/k: Why two moments of
ACM, vol. 56, no. 2, pp. 74–80, 2013.
job size distribution are not enough,” Queueing Syst.,
5. H. Kasture and D. Sanchez, “Tailbench: A benchmark
vol. 64, no. 1, pp. 5–48, 2010.
suite and evaluation methodology for latency-critical
18. A. Wierman and B. Zwart, “Is tail-optimal scheduling
applications,” in Proc. IEEE Int. Symp. Workload
possible?” Oper. Res., vol. 60, no. 5, pp. 1249–1257, 2012.
Characterization., 2016, pp. 1–10.
19. Q. Wang et al., “Detecting transient bottlenecks in n-tier
6. M. E. Haque et al., “Few-to-many: Incremental parallelism
applications through fine-grained analysis,” in Proc.
for reducing tail latency in interactive services,” ACM
IEEE 33rd Int. Conf. Distrib. Comput. Syst., 2013,
SIGPLAN Notices, vol. 50, no. 4, pp. 161–175, 2015.
pp. 31–40.
7. C.-H. Hsu et al., “Adrenaline: Pinpointing and reining in
20. J. Li et al., “Work stealing for interactive services to
tail queries with quick voltage boosting,” in Proc. IEEE
meet target latency,” ACM SIGPLAN Notices, vol. 51,
21st Int. Symp. High Perform. Comput. Archit., 2015,
no. 8, p. 14, 2016.
pp. 271–282.
8. R. Kapoor, G. Porter, M. Tewari, G. M. Voelker, and
Amirhossein Mirhosseini is currently working
A. Vahdat, “Chronos: Predictable low latency for data toward a PhD in computer science and engineering at
center applications,” in Proc. 3rd ACM Symp. Cloud the University of Michigan. His research interests cen-
Comput., 2012, Art. no. 9. ter on computer architecture with particular emphasis
9. J. Li, N. K. Sharma, D. R. Ports, and S. D. Gribble, on data centers, cloud computing, and microservices.
“Tales of the tail: Hardware, OS, and application-level He is a student member of the IEEE and the Associa-
sources of tail latency,” in Proc. ACM Symp. Cloud tion for Computing Machinery (ACM). Contact him at
Comput., 2014, pp. 1–14. miramir@umich.edu.
10. D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and
C. Kozyrakis, “Heracles: improving resource efficiency Thomas F. Wenisch is an associate professor of
at scale,” in Proc. ACM SIGARCH Comput. Archit.
electrical engineering and computer science and
Associate Chair of External Affairs at the University of
News, 2015, vol. 43, no. 3, pp. 450–462.
Michigan, Ann Arbor. His research interests center on
11. A. Mirhosseini, A. Sriraman, and T. F. Wenisch,
computer architecture with particular emphasis on
“Enhancing server efficiency in the face of killer
server and data center systems, memory persistency,
microseconds,” in Proc. IEEE 25th Int. Symp. High multiprocessor systems, and performance evaluation
Perform. Comput. Archit., 2019, pp. 185–198. methodology. He has a PhD in electrical and computer
12. D. Meisner, J. Wu, and T. F. Wenisch, “Bighouse: A engineering from Carnegie Mellon University. He is a
simulation infrastructure for data center systems,” in member of the IEEE and the Association for Computing
Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., Machinery (ACM). Contact him at twenisch@umich.
2012, pp. 35–45. edu.

64 IEEE Micro
Column: Micro Economics

The Aftermath of the Dyn DDOS Attack


Shane Greenstein
Harvard Business School

& NOBODY KNOWS WHO organized the attack. tells us something about the challenges facing
It might have come from an angry gamer, or suppliers, and in this case, it tells us about a
from a rogue spy, or, perhaps, an angry rogue basic challenge in network security today. It will
spy playing games. The program hijacked many take a bit of work to appreciate the lesson, and,
cameras and home devices, and redirected them let me tip my hand, the news is not good.
to engineer a series of distributed denial of The article provides a summary of a longer
server (DDOS) attacks on a few hours apart, all study done by a group of my colleagues and
on October 21, 2016. By executing this novel and myself.1
rather clever hijack of many devices for a DDOS What happened?
attack, the attack exposed an important vulnera- Start with the basics. A website owner can set
bility in today’s internet. up its own name resolver, or hire somebody to
The attack contains one other do it for them. Many years ago most firms did it
element. It aimed at Dyn, who acts themselves. Like a lot of things
as a name resolver. Dyn enables By executing this novel on the internet, over time a set of
Internet traffic by translating the and rather clever hijack professional firms emerged,
site’s domain name (URL) into the of many devices for a while many technically sophisti-
IP address where the server behind DDOS attack, the cated firms still perform this
that domain is to be found. During attack exposed an function for themselves.
the later phases of the attack, Dyn important vulnerability What do name resolvers do?
in today’s internet. Start with the basics. It is a
servers were unable to process
users’ requests, and as a result, mouthful, but we need to under-
users lost access to web domains contracting stand it to understand what happened to Dyn.
with Dyn, such as Netflix, CNBC, and Twitter. When an application (such as a web browser)
Other well-known firms also were disabled, such wants to access a page or resource located at a
as Airbnb, Etsy, Play Station Network, and Wikia. known domain name, it can access DNS records
This article focuses on the aftermath of this and a corresponding IP address. In principle, the
event, which did not get headlines, but illus- application submits a request to a DNS “resolver”
trates an important features of the situation. asking for the IP address corresponding to a given
Specifically, how did users react? User behavior domain name. The resolver queries a root name-
server, which replies with the corresponding to
the TLD nameserver specified by the domain
Digital Object Identifier 10.1109/MM.2019.2919886 name (e.g., “.com”). The resolver then queries
Date of current version 23 July 2019. that TLD nameserver with the second component

0272-1732 ß 2019 IEEE Published by the IEEE Computer Society IEEE Micro
66
Figure 1. Market share for largest name servers.

of the domain name (e.g., “google”). The TLD as email, it could also cripple communications
nameserver retrieves that domain’s authoritative at many firms. That is a big deal.
nameservers (e.g., “ns1.google.com”) and returns As should be obvious, the potential economic
them to the resolver. Finally, the resolver queries losses could be enormous. Many businesses,
one of the authoritative nameservers and receives especially internet businesses, lose a lot of reve-
a usable IP address for the domain. The IP address nue from being down for a day.
is passed back to the original application, which That motivates a basic question: after this
can use it to connect to the desired host. This attack, what did users do? It did not take much
entire process generally takes just milliseconds. professional experience to understand the vul-
In practice, many users request the same thing nerability or the lessons. Businesses were vul-
repeatedly, so it is possible to cache the answers nerable to a single point of failure. A website
to many of the intermediate steps, and that can could construct a form of insurance by perform-
speed up the resolution even more. More to the ing a simple act: maintaining multiple name serv-
point, caching across many servers acts as a ers with multihoming.
quasi-buffer against a DDOS attack, especially if We set out to find out two things. First, how
the attack can be rebuffed in a short time. many firms maintain multihoming? Second, how
The emergence of numerous lessons and auto- did the use of resolvers change after the Dyn
mated standard practices has gone hand-in-hand attack? Figure 1, which is taken from Figure 7 of
with the emergence of professional firms. Such our study,1 tells a big part of the answer we
firms know how to provide services at scale. In found. This figure shows the market shares for
turn, and as with many other professional mar- the largest providers of nameservers between
kets, some of them became good at their service, late 2011 and the middle of 2017.
and that performance attracted many customers. What did firms do?
Thus emerged an ironic outcome. While the Figure 1 shows that, long before the Dyn attack,
internet contains many points of resiliency, the name servers had embarked on a general trend
increasing concentration of services in a small toward more concentration. The growth of three
number of providers has created concentrated firms—Dyn, AWS, and Cloudflare—drove this
points of failure. To say it another way, Dyn per- trend. Dyn’s growth had already begun to level off
formed approximately 10% of nameserver serv- by 2014, while AWS and Cloudflare have continued
ices in the United States (prior to the attack), so to grow unabated throughout the time period.
bringing it down could bring down 10% of the The figure actually does not tell the entire
internet’s servers. Since name resolution also story. We did a little investigation, and found
supports a range of other communications, such that both AWS and Cloudfare attract a high

July/August 2019
67
Micro Economics

fraction of the contracts from newly founded from customers of Dyn and somewhat from Neu-
firms. Older firms tend to use others. star (another large provider). No other provider
We also found that, as expected, over time saw a big change.
an increasing fraction of users contracted with You might reasonably reply that Cloudflare’s
providers instead of doing it themselves. Close to security model makes it difficult to multihome,
60% do it themselves in 2011, while close to 30% but does protect against this type of attack. Which
do it in 2017. In other words, the market is true. But what accounts for so little with all the
shares rise during a time when the number of cus- others? Most users act as if they do not care.
tomers practically doubled.
The figure also illustrates the big surprise.
CONCLUSION
The attack on Dyn had consequences for its
Let us summarize. There was a DDOS attack
commercial success. Many of its users seem to
on Dyn. It demonstrated a new vulnerability, and
have blamed Dyn for the down-
we are still vulnerable to another
time, and shifted to another pro-
The least expensive similar attack. In fact, we may be
vider. Within a couple months
insurance against more vulnerable now that bad
Dyn lost a quarter of its custom-
another attack is multi- actors have watched the prior
ers. Ouch.
homing with more than demonstration.
But wait, that is not the only one resolver. Let us be How did the user community
surprise. It is also important to clear about it: It is not act? The two big headlines are: 1)
notice what did not happen: We did expensive. It is just a a fraction of customers acted as if
not find much multihoming after minor hassle to main-
they blamed Dyn, and took precau-
the Dyn attack. Around 11% of tain multiple suppliers.
tions; and 2) all but a small frac-
existing domains multihomed
tion of non-Dyn customers did not
prior to the Dyn attack, and about 18% did so after
act as if they learned any lesson.
the attack. New domains multihomed at a rate of
I do not know about you, but I do not feel any
less than 5% prior to the attack, and after the
safer. Sigh.
attack about 8% did. In short, there was an uptick
in multihoming, but the vast majority of sites con-
tinued to use a single provider. & REFERENCE
Why does that matter? The least expensive
1. S. Bates, J. Bowers, S. Greenstein, J. Weinstock, and
insurance against another attack is multihoming
J. Zittrain, “In support of internet entropy: Mitigating an
with more than one resolver. Let us be clear
increasingly dangerous lack of redundancy in DNS
about it: It is not expensive. It is just a minor has-
resolution by major websites and services,” NBER
sle to maintain multiple suppliers. (And, more-
working paper 24317, 2018. [Online]. Available at
over, many aspects of web administration are no
SSRN: https://ssrn.com/abstract¼3241740
worse a hassle. That is the job, after all.)
Not illustrated in the figure are minutia of mar-
ket shares. In case you were wondering, most of Shane Greenstein is a professor at the Harvard
the multihoming among existing websites came Business School. Contact him at sgreenstein@hbs.edu.

68 IEEE Micro
PURPOSE: The IEEE Computer Society is the world’s largest EXECUTIVE COMMITTEE
association of computing professionals and is the leading provider
President: Cecilia Metra
of technical information in the field.
President-Elect: Leila De Floriani
MEMBERSHIP: Members receive the monthly magazine Past President: Hironori Kasahara
Computer, discounts, and opportunities to serve (all activities First VP: Forrest Shull; Second VP: Avi Mendelson;
are led by volunteer members). Membership is open to all IEEE
Secretary: David Lomet; Treasurer: Dimitrios Serpanos;
members, affiliate society members, and others interested in the
VP, Member & Geographic Activities: Yervant Zorian;
computer field.
VP, Professional & Educational Activities: Kunio Uchiyama;
COMPUTER SOCIETY WEBSITE: www.computer.org VP, Publications: Fabrizio Lombardi; VP, Standards Activities:
OMBUDSMAN: Direct unresolved complaints to Riccardo Mariani; VP, Technical & Conference Activities:
ombudsman@computer.org. William D. Gropp
2018–2019 IEEE Division V Director: John W. Walz
CHAPTERS: Regular and student chapters worldwide provide the
opportunity to interact with colleagues, hear technical experts, 2019 IEEE Division V Director Elect: Thomas M. Conte
and serve the local professional community. 2019–2020 IEEE Division VIII Director: Elizabeth L. Burd

AVAILABLE INFORMATION: To check membership status, report BOARD OF GOVERNORS


an address change, or obtain more information on any of the
¾ƷȵȂ‫ژ‬-ɱȲǠȵǠȄǒ‫ژيח׏׎אژ‬°ƌɓȵƌƨǚ‫ژ‬ƌǒƩǚǠً‫ژ‬GȵƷǒȏȵɲ‫ژِ¾ژ‬ɲȵưً‫ژ‬
following, email Customer Service at help@computer.org or call
%ƌɫǠư‫ژ‬°ِ‫ژ‬-ƨƷȵɋً‫ژ‬eǠǹǹ‫ژ‬Uِ‫ژ‬GȏȽɋǠȄً‫ژ‬ÞǠǹǹǠƌȂ‫ژ‬GȵȏȲȲً‫ژ‬°ɓȂǠ‫ژ‬OƷǹƌǹ
+1 714 821 8380 (international) or our toll-free number,
¾ƷȵȂ‫ژ‬-ɱȲǠȵǠȄǒ‫ژي׎א׎אژ‬Ȅưɲ‫ژِ¾ژ‬ǚƷȄً‫ژ‬eȏǚȄ‫ژ‬%ِ‫ژ‬eȏǚȄȽȏȄً‫ژ‬
+1 800 272 6657 (US):
°ɲ‫ٮ‬äƷȄ‫ژ‬hɓȏً‫ژ‬%ƌɫǠư‫ژ‬kȏȂƷɋً‫ژ‬%ǠȂǠɋȵǠȏȽ‫ژ‬°ƷȵȲƌȄȏȽً‫ژ‬Oƌɲƌɋȏ‫ژ‬äƌȂƌȄƌ‫ژ‬
• Membership applications
• Publications catalog ¾ƷȵȂ‫ژ‬-ɱȲǠȵǠȄǒ‫ژي׏א׎אژ‬uِ‫ژ‬ȵǠƌȄ‫ژ‬ǹƌǵƷً‫ژ‬FȵƷư‫ژ‬%ȏɓǒǹǠȽً‫ژ‬
• Draft standards and order forms ƌȵǹȏȽ‫ژ‬-ِ‫ژ‬eǠȂƷȄƷɼ‫ٮ‬GȏȂƷɼً‫¨ژ‬ƌȂƌǹƌɋǚƌ‫ژ‬uƌȵǠȂɓɋǚɓً‫ژ‬
• Technical committee list -ȵǠǵ‫ژ‬eƌȄ‫ژ‬uƌȵǠȄǠȽȽƷȄً‫ژ‬hɓȄǠȏ‫ژ‬ÅƩǚǠɲƌȂƌ
• Technical committee application
• Chapter start-up procedures EXECUTIVE STAFF
• Student scholarship information Executive Director: Melissa ِ‫ژ‬Russell
• Volunteer leaders/staff directory Director, Governance & Associate Executive Director:
• IEEE senior member grade application (requires 10 years Anne Marie Kelly
practice and significant performance in five of those 10) Director, Finance & Accounting: Sunny Hwang
Director, Information Technology & Services: Sumit Kacker
PUBLICATIONS AND ACTIVITIES Director, Marketing & Sales: Michelle Tubb
Director, Membership Development: Eric Berkowitz
Computer: The flagship publication of the IEEE Computer Society,
Computer, publishes peer-reviewed technical content that
COMPUTER SOCIETY OFFICES
covers all aspects of computer science, computer engineering,
Washington, D.C.: 2001 L St., Ste. 700, Washington, D.C.
technology, and applications.
20036-4928ٕ Phone: +1 202 371 0101ٕ Fax: +1 202 728 9614ٕ‫ژ‬
Periodicals: The society publishes 12 magazines, 15 transactions,
Email: hq.ofc@computer.org
and two letters. Refer to membership application or request
Los Alamitos: 10662 Los Vaqueros Cir., Los Alamitos, CA 90720ٕ‫ژ‬
information as noted above.
Phone: +1 714 821 8380ٕ Email: help@computer.org
Conference Proceedings & Books: Conference Publishing Asia/Pacific: Watanabe Building, 1-4-2 Minami-Aoyama,
Services publishes more than 275 titles every year. Minato-ku, Tokyo 107-0062, Japanٕ Phone: +81 3 3408 3118ٕ
Standards Working Groups: More than 150 groups produce IEEE Fax: +81 3 3408 3553ٕ Email: tokyo.ofc@computer.org
standards used throughout the world.
Technical Committees: TCs provide professional interaction in u-u-¨°OU¥‫ژۯژ‬¥ÅkU¾U‚w‫¨‚ژ‬%-¨°‫ژ‬
more than 30 technical areas and directly influence computer ¥ǚȏȄƷ‫ژٕודההژאואژ׎׎זژ׏ڹژي‬Fƌɱ‫ژٕ׏גהגژ׏אזژג׏וژ׏ڹژي‬
engineering conferences and publications. -ȂƌǠǹ‫ژي‬ǚƷǹȲ‫ۮ‬ƩȏȂȲɓɋƷȵِȏȵǒ
Conferences/Education: The society holds about 200 conferences
each year and sponsors many educational activities, including IEEE BOARD OF DIRECTORS
computing science accreditation.
President & CEO: Jose M.D. Moura
Certifications: The society offers three software developer President-Elect: Toshio Fukuda
credentials. For more information, visit Past President: James A. Jefferies
www.computer.org/certification. Secretary: Kathleen Kramer
Treasurer: Joseph V. Lillie
2019 BOARD OF GOVERNORS MEETINGS Director & President, IEEE-USA: Thomas M. Coughlin‫ژ‬
‫(ژ‬TBD) ‚ƩɋȏƨƷȵ: Teleconference Director & President, Standards Association: Robert S. Fish‫ژ‬
Director & VP, Educational Activities: Witold M. Kinsner‫ژ‬
Director & VP, Membership and Geographic Activities:
Francis B. Grosz, Jr.
Director & VP, Publication Services & Products: Hulya Kirkici‫ژ‬
Director & VP, Technical Activities: K.J. Ray Liu
revised 1‫ ׮‬eɓǹɲ 2019

You might also like