Professional Documents
Culture Documents
2019-07 - Secure Architectures
2019-07 - Secure Architectures
Secure Architectures
www.computer.org/micro
Call for 2019 Major
Awards Nominations
Deadline: 1 October 2019
IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society. IEEE Headquarters, Three Park Ave., 17th Floor, New York,
NY 10016-5997; IEEE Computer Society Headquarters, 2001 L St., Ste. 700, Washington, DC 20036; IEEE Computer Society Publications Office,
10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720. Postmaster: Send address changes and undelivered copies to IEEE, Member-
ship Processing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paid at New York, NY, and at additional mailing offices. Canadian
GST #125634188. Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885. Return undeliverable Canadian addresses
to 4960-2 Walker Road; Windsor, ON N9A 6J3. Printed in USA. Reuse rights and reprint permissions: Educational or personal use of this material is
permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the
copy; and 3) does not imply IEEE endorsement of any third-party products or services. Author and their companies are permitted to post the accepted
version of IEEE-copyrighted material on their own webservers without permission, provided that the IEEE copyright notice and a full citation to the
original work appear on the first screen of the posted copy. An accepted manuscript is a version which has been revised by the author to incorporate
review suggestions, but not the published version with copy-editing, proofreading, and formatting added by IEEE. For more information, please go to
ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising,
or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual
Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or pubs-permissions@ieee.org. ©2019 by IEEE. All rights reserved. Abstracting
and library use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy
fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE
Micro does not necessarily constitute an endorsement by IEEE or the Computer Society. All submissions are subject to editing for style, clarity, and
space. IEEE prohibits discrimination, harassment, and bullying. For more information, visit ieee.org/web/aboutus/whatis/policies/p9-26.html.
July/August 2019
Volume 39 Number 4
Special Issue
55 T
he Queuing-First
Approach for Tail
6 S Guest Editors’ Introduction
ecure Architectures Management of Interactive
Services
Simha Sethumadhavan and Mohit Tiwari Amirhossein Mirhosseini
and Thomas F. Wenisch
Published by the IEEE Computer
Society
Theme Article
General Interest
44 R
ASSA: Resistive
Prealignment
Accelerator for Approximate
DNA Long Read Mapping
Roman Kaplan, Leonid Yavits,
and Ran Ginosar
Image credit: ©istockphoto.com/ValeryBrozhinsky
Secure Architectures
Lizy Kurian John
The University of Texas at Austin
& WELCOME TO THE July/August 2019 issue of “The Aftermath of the Dyn DDOS Attack,” Green-
IEEE Micro, which presents to you a selection stein writes about the October 2016 series of dis-
of articles on secure architectures and a few tributed denial-of-service (DDOS) attacks causing
articles on other topics. many services and platforms to be unavailable for
Computer security is becoming increasingly large segments of users in North America and
important due to the increased use of computers Europe. The attacker targeted systems operated
and the internet in our day-to-day lives. Attacks by the nameserver resolution provider Dyn, who
on computer systems are becoming very com- performs approximately 10% of the nameserver
monplace. Many recent attacks demonstrated services in the United States. Since nameserver
security vulnerabilities in commodity hardware, resolution is essential for many businesses to
pointing to the importance of secure hardware operate, the attack affected a range of businesses
architectures. This special issue including Netflix, CNBC, Twitter,
presents four articles on secure Airbnb, and Etsy. Greenstein pro-
This special issue
computer architectures. The topics vides details on the market share
presents four articles
discussed range from defense of domain name system providers
on secure computer
against cache timing channel and alerts the readers to an impor-
architectures. The
attacks to asserting security prop- tant vulnerability in today’s inter-
topics discussed range
erties of a processor at runtime. net, viz. many internet services are
from defense against
Prof. Simha Sethumadhavan of concentrated in a few providers.
cache timing channel
Columbia University and Prof. Mohit His article draws attention to the
attacks to asserting
Tiwari of the University of Texas at security properties of a lack of redundancy in internet ser-
Austin served as guest editors for the processor at runtime. vice providers.
special issue. A comprehensive article In addition to the special issue
written by Professors Sethumadha- articles on computer security,
van and Tiwari serves as an excellent introduction there are two other articles in this issue. The first
to the compendium on secure architectures. one, “RASSA: Resistive Pre-Alignment Accelerator
The four articles from the secure architecture for Approximate DNA Long Read Mapping” by
theme are accompanied very appropriately by Kaplan et al., presents an in-memory parallel archi-
a Micro Economics column related to computer tecture for similarity search for genomic sequen-
security by Shane Greenstein. In the article titled ces. As personalized medicine based on gene
mapping is emerging, hardware architectures to
support sequence searches are increasingly rele-
Digital Object Identifier 10.1109/MM.2019.2924711 vant. One challenge in mapping long sequences is
Date of current version 23 July 2019. determining the optimal mapping location of every
0272-1732 ß 2019 IEEE Published by the IEEE Computer Society IEEE Micro
4
read on to the reference sequence. In this article, acknowledge and reward the best articles in each
Kaplan et al. present hardware acceleration of of the Transactions and Magazines sponsored by
genomic mapping by using resistive memories the Computer Society. Articles based on a confer-
(memristors) which are elements that store infor- ence paper are ineligible and hence the MICRO
mation by modulating the resistance of the nano- TopPicks articles would be ineligible for this
scale storage elements. Memristor arrays facilitate award. The intent of the award is to recognize
simultaneous compare and mapping and result in outstanding regular papers. The first award will
a highly parallel compute accelerator. be announced in a couple of months.
The second regular channel article is “The IEEE Micro is interested in submissions on any
Queuing-First Approach for Tail Management aspect of chip/system design or architecture.
of Interactive Services” by Mirhosseini and Please consider submitting articles to IEEE
Wenisch. Cloud services are increasingly becom- Micro and remember, all regular articles will be
ing popular, however, service latencies in the eligible for the best paper award.
cloud are heavy-tailed. Some requests can take Hope you enjoy the secure architectures as
100 times more time than the average. This arti- well as other articles presented in this issue.
cle presents two solutions to mitigate this impor- Happy reading!
tant problem, server pooling, and common-case
service acceleration.
I am also proud to write about a new Best Lizy Kurian John is a Cullen Trust for Higher
Paper award for IEEE Micro. Starting this year, the Education Endowed Professor with the Electrical and
best paper award will be given for articles pub- Computer Engineering Department, the University of
lished in IEEE Micro. IEEE Computer Society has Texas at Austin, Austin, TX, USA. Contact her at
recently started a best paper award program to ljohn@ece.utexas.edu.
July/August 2019
5
Guest Editors’ Introduction
Secure Architectures
Simha Sethumadhavan Mohit Tiwari
Columbia University The University of Texas at Austin
& HARDWARE IS THE bedrock on which all com- patterns and triggers dynamic partitioning once
puting systems are built. Recently developed an anomaly is detected. The article demonstrates
hardware ideas for enhancing the software ways to integrate this combined defense in off-
security from both academia and industry hold the-shelf CPUs by using Intel’s cache partitioning
significant promise to improve software security. techniques, and shows the way for future work
At the same time, recent hardware attacks on cur- that combines prevention and detection.
rent commodity hardware have shown hardware The second article by authors from Intel is at
to be a weak foundation for building secure sys- the intersection of extremely low-level firmware
tems. As we enter Post-Moore’s Law era, there are that often forms the root of trust for secure-hard-
significant questions surrounding ware-based computing, and secu-
what would make security techni- rity against quantum-computing
ques more practical. Thus, secu- As we enter Post-
driven cryptanalysis. This article
rity is a first-order problem for Moore’s Law era, there
considers several alternatives for
computer architects today, and are significant ques-
such "post-quantum" cryptographic
this special edition highlights a few tions surrounding what
would make security schemes specifically with an eye
compelling examples across the toward low-level implementation
techniques more prac-
system stack. This issue contains
tical. Thus, security is a security issues that emerge in a
four articles on various facets of
first-order problem for firmware. As secure enclave-based
secure architectures.
computer architects computing takes off, considering
The first paper by Fan et al.
today, and this special the foundation of secure firmware is
addresses the (timely) problem of edition highlights a few an especially timely problem.
information leakage through tim- compelling examples The third article is a report on
ing, specifically through shared across the system
caches. Defenses have considered a Workshop on Energy Secure
stack.
partitioning caches across security Architectures. The workshop was
domains as well as detecting organized by Dr. Bose from IBM
attacks as they happen as two separate research and Dr. Mukhopadhyay from Georgia Tech and
directions—this article proposes to use these focused on motivating the need to secure the on-
approaches in conjunction where the detection chip energy management structures, which are
algorithm looks for anomalous cache occupancy assuming increased importance in the face
of emerging technology trends. This article
describes the challenges in designing energy
Digital Object Identifier 10.1109/MM.2019.2925152
management algorithms, attacks, and possible
solutions.
Date of current version 23 July 2019.
0272-1732 ß 2019 IEEE Published by the IEEE Computer Society IEEE Micro
6
The final article by Sturton et al. describes an Simha Sethumadhavan is an associate profes-
sor of computer science and the Chair of Cyber-
idea called the FinalFilter. Despite the best efforts
security Center, Data Science Institute, Columbia
of designers to create secure designs, unforeseen
University. He is an Alfred P. Sloan Research Fel-
inputs or events may cause the processor to per-
low. Contact him at simha@columbia.edu.
form actions that violate security. In this article,
the authors describe how the use of simple
Mohit Tiwari is an assistant professor in the
invariants can be used to protect against such Department of Electrical and Computer Engineering,
problems. Unlike static assertions that are The University of Texas at Austin. He was a
removed after design time, the FinalFilter invari- Postdoctoral Fellow at UC Berkeley (2011–2013). He
ant controls access to protected assets at all times has a PhD from UC Santa Barbara. Contact him
because they are fabricated into the hardware. at tiwari@austin.utexas.edu.
July/August 2019
7
Theme Article: Secure Architectures
Leveraging Cache
Management Hardware for
Practical Defense Against
Cache Timing Channel
Attacks
Fan Yao Milos Doroslovacki
University of Central Florida George Washington University
Hongyu Fang Guru Venkataramani
George Washington University George Washington University
& TIMING CHANNELS ARE a form of information since caches presenting the largest on-chip attack
leakage attacks where adversaries modulate surface for adversaries to exploit combined with
and (or just) observe access timing to shared high bandwidth transfers.13 Previously proposed
resources in order to exfiltrate secrets. Among var- detection and defense techniques against
ious hardware-based information leakage attacks, cache timing attacks either explore hardware
cache timing channels have become notorious, modifications or incur nontrivial performance
overheads.6; 12; 16; 17 For more effective system
protection and widescale deployment, it is critical
Digital Object Identifier 10.1109/MM.2019.2920814 to explore ready-to-use and performance-friendly
Date of publication 4 June 2019; date of current version 23 practical protection against cache timing channel
July 2019. attacks.
0272-1732 ß 2019 IEEE Published by the IEEE Computer Society IEEE Micro
8
In this article, we propose a spy in covert channels, and victim
new framework that makes novel For more effective and spy in side channels. The spy
system protection and
use of COTS hardware to thwart infers secrets from the trojan or
widescale deployment,
cache timing channels. We observe the victim by observing the modu-
it is critical to explore
that cache block replacements by lated latency of cache accesses.2
ready-to-use and
adversaries in cache timing chan- performance-friendly To exfiltrate secrets, the spy needs
nels reveal a distinctive pattern in practical protection to determine a communication
their cache occupancy profiles, against cache timing channel. In case of covert chan-
which could be a strong indicator channel attacks. nels, the trojan and spy may alter-
for the presence of cache timing nate their accesses to the cache
channels. We leverage Intel’s temporally, while in side channels,
Cache Monitoring Technology (CMT3) available the spy has to run in parallel to the victim pro-
in recent server-class processors to perform fine- cess.17 This may vary along space dimension as
grained monitoring of LLC occupancy for individ- well (i.e., access single cache location or alternate
ual application domains. We then apply signal among multiple cache locations).
processing techniques that characterize the com- Recently, Intel’s CMT allows for uniquely
munication strength with spy processes in cache identifying each logical core with a resource
timing channels. We further leverage LLC way monitoring ID,3 and track the LLC usage for the
allocation (i.e., CAT3) and repurpose it as a mapped domains. CMT enables flexible monitor-
secure cache manager to dynamically partition ing of LLC occupancy at user-desired domain
LLC for suspicious domains and disband their granularity such as a core, a multithreaded appli-
timing channel activity. Our mechanism avoids cation, or a virtual machine. With CAT, caches
preemptively separating domains and conse- can be configured to have several different parti-
quently, does not result in high performance tions on cache ways, called classes of service
overheads to benign application domains. (CLOS), where evicting cache lines from other
In comparison to our recent work—COTS- CLOS is restricted for a given domain.
knight,19 the novel contributions of this article
are as follows:
THREAT MODEL
1) To defend against sophisticated adversaries In this article, we focus on the sophisticated
that randomize interval times between trans- form of attacker that does not rely on any prior
missions, we augment our COTSKnight memory sharing, and utilizes Prime þ Probe-
design to remove irrelevant occupancy based techniques to launch attacks on LLC simply
trace segments using time warping (see by creating conflict misses (replacement) on
the “Defense Against Advanced Adversaries” cache sets. Attacks such as Flush þ Reload require
section). shared memory blocks either through shared
2) We perform new experimental studies on vir- libraries or data sharing that may be prohibited in
tualized environments that are prone to cache practical settings. Therefore, we do not consider
timing channel attacks, and demonstrate the such forms of attacks. However, for evict þ reload
efficacy of our approach (see the “Case Study attacks, where cache replacements alter access
on Virtualized Environments” section). latencies, our design would still be applicable.
3) We identify futuristic threats (like mult- (See the “Discussion” section for details.)
iple spies and evidence tampering), and
discuss potential defense mechanisms WHY CACHE OCCUPANCY
using our proposed defense framework (see PATTERNS MATTER?
the “Discussion” section). Regardless of whether a trojan intentionally
communicates or a victim unintentionally leaks
BACKGROUND secrets to a spy, cache timing channels use one
There are typically two processes involved in of the following encoding schemes: 1) ON–OFF
cache timing channels, namely, the trojan and encoding (where spy uses timing profile of a
July/August 2019
9
Secure Architectures
Wemakethefollowingkeyobser-
vation here: Cache timing channels
fundamentally rely on cache block
replacements that create swing pat-
terns in participating domain’s
cache occupancy regardless of the
specific timing channel protocols.
By analyzing these repetitive
swing patterns, there is a poten-
tial to uncover the communication
strength in such attacks. We note
that merely tracking cache misses
onanadversarywillnotbesufficient
as an attacker may inflate cache
misses (through issuing additional
cache loads that create self-
conflicts) on purpose to evade
detection.
SYSTEM DESIGN
Here, we first discuss CMT-
based cache occupancy monitor-
ing and trace analysis for cache
timing channel detection, and
Figure 1. LLC occupancy changes for trojan/victim and spy. (a) ON-OFF: Trjn/ then outline cache partitioning
Victim idle. (b) ON-OFF: Trjn/Victim access. (c) Pulse-pos: Odd sets. (d) Pulse-pos: strategy to prevent information
Even sets. leakage.
10 IEEE Micro
that implements parallel
protocol with pulse-position
encoding.
In the second step, to
capture the unique pairwise
cache occupancy swing
pattern in timing channels,
we compute the product of
Dxi and Dyi as zi . Based on
the discussion in the “Why
Cache Occupancy Patterns
Matter?” section negative
values of zi occur when the
cache occupancy patterns
of the two processes move
in opposite directions due
to mutual cache evictions.
In the third step, our ana-
lyzer checks if z series con-
tains repeating negative
pulses that may be caused
by intentional eviction over a
longer period of time (denot- Figure 2. LLC occupancy traces, autocorrelogram, and power spectrum for a cache
ing illegal communication timing channel (with parallel pulse-position).13 (a) Parallel Protocol with pulse-position
activity). To capture the encoding. (b) Autocorrelogram (left) and power spectrum (right).
repetitive swing patterns, we
perform power spectrum
analysis in frequency domain Implementation
on ri , which is the autocorrelogram of z. We implement our framework prototype on a
Figure 2(b) illustrates the autocorrelogram real system with Intel Xeon E5-2698 v4 Proc-
and power spectrum for a (victim, spy) pair in essor. The processor comes with 16 CLOS and
timing channels.13 We can visually observe a 20 LLC slices, and each LLC slice has 20 2048
sharp peak around frequency of 290 in the 64-byte blocks. The LLC occupancy MSR reading
power spectrum, which represents a strong is sampled at 1000/s.
communication strength indicating timing chan-
nel activity (see COTSKnight19 for further
details on this algorithm).
EVALUATION
Power Spectra for Cache Timing Channels
Cache Way Allocation Manager We setup attack variants of cache timing
After the way allocation manager (allocator) is channels2; 13–15 that utilize ON–OFF and pulse-
notified of identified suspicious domains from the position encoding for spy reception and per-
analyzer, it will configure LLC using CAT to isolate form accesses to cache either serially (trojan
the suspicious pairs by heuristically assigning and spy) or in parallel (victim and spy) as
nonoverlapping cache ways to each domain based described in the “Why Cache Occupancy PAT-
on their ratio of LLC occupancy sizes during the TERNS MATTER?” section. In each case, we
last observation period. Our allocator evaluates also ran along side with at least two SPEC2006
two candidate policies, namely, Aggressive Policy, benchmarks with high LLC activity.8 The ana-
that keeps suspicious domains separated until lyzer performs power spectrum analysis based
one of them finishes execution and Jail Policy, that on time-differentiated LLC occupancy traces
partitions the two domains until a timeout period. for six combination pairs of processes. In all
July/August 2019
11
Secure Architectures
12 IEEE Micro
periodic patterns are recon-
structed, and the cadence of
cache accesses from adver-
saries will be recovered.
Figure 4 demonstrates the
detection of this attack sce-
nario. For illustration, we
implement a prototype of
this attack by setting up the
trojan and spy as two
threads within the same pro-
cess, and configure the main
thread to control the syn-
chronization. In reality, two
separate trojan/victim and
spy need to be synchro-
nized. Figure 4(a) shows the
LLC occupancy trace for this
attack with random distan-
ces between the swing
pulses. We can see that, with
time warping, high signal
power peaks are observed Figure 4. Analysis of bit transmission at random intervals. (a) Left half shows a snippet
[see Figure 4(b)]. Addition- of original trace with random bit intervals and right half shows time-warped trace. (a)
ally, when this signal com- LLC occupancy changes for transmision with random intervals. (b) Power spectrum on
pression preprocessing step time-warped LLC occupancy trace.
is applied on benign work-
loads, we do not observe any
increase in partition trigger rate. allocation determination during the entire execu-
tion. We can see that the trojan and spy start to ini-
Case Study on Virtualized Environments tiate communication at around 188 s (when we
To evaluate the efficacy of our proposed start to observe increasing signal power). The peak
framework, we perform a case study on virtual- signal power between the trojan and spy domain
ized environment. This study is motivated by pair quickly climbs up to 126 at time 192.5 s, which
growing trend in studying timing channel attacks is when steady covert communication has begun.
in the cloud environment. We implement the para- This quickly triggers the allocator’s action that
onoff attack that works cross-VM (similar to Maur- splits the LLC ways between trojan and spy VMs.
ice et al.14) Consequently, the maximum signal power drops
We setup four KVM virtual machines where the back to nearly zero for the rest of execution, effec-
trojan and spy run on two of the VMs, and simulta- tively preventing any further timing channels. Note
neously, two other VMs corun representative that during the 1-h experiment, the peak signal
cloud benchmarks, namely video streaming power values for the other domain pairs (involving
(stream) and memcached (memcd) from Cloud- Cloudsuite applications) remained flat at values
Suite,7 both of which are highly cache-intensive. less than 3.
The trojan and spy are set to start the para-on–off
attack at a random time between 0 and 300 s.
We configure the allocator to use the Aggressive DISCUSSION
policy to demonstrate the effectiveness of LLC par- We propose a new framework that builds on
titioning. Figure 5 shows the peak signal power COTS hardware and can be augmented with a
between the trojan and spy VM pair and the way host of signal processing techniques to eliminate
July/August 2019
13
Secure Architectures
14 IEEE Micro
channels. Fang et al.6 also discussed several futuristic threats and
We proposed a novel
use hardware pre- mechanisms to defeat such timing channels.
framework to protect
fetchers to defend caches against timing
against cache timing channel attacks ACKNOWLEDGMENT
channels. through smartly This work was supported by the U.S. National
CATalyst12 uti- leveraging COTS Science Foundation under Grant CNS-1618786,
lizes the CAT tech- support for cache
and by the Semiconductor Research Corpora-
nology to reserve monitoring and
tion Contract 2016-TS-2684. F. Yao performed
static cache parti- performance tuning.
this work as a graduate student at GWU.
tions where secure We implemented a
pages are pinned prototype of our
proposed technique on
upon request from
applications. Differ-
Intel Xeon v4 server & REFERENCES
and our experiments
ently, our pro- 1. M. Bazm, T. Sautereau, M. Lacoste, M. Sudholt,
showed that our
posed mechanism framework can and J. Menaud, “Cache-based side-channel
successfully defeat successfully thwart attacks detection through Intel Cache Monitoring
cache timing chan- several classes of Technology and Hardware Performance Counters,”
nels without appli- cache timing channels in Proc. 3rd Int. Conf. Fog Mobile Edge Comput.,
cation/user-level in both native and 2018, pp. 7–12.
inputs and parti- virtualized environment 2. J. Chen and G. Venkataramani, “CC-hunter:
tion reservation. with minimal Uncovering covert timing channels on shared
1
Bazm et al. lever- performance overhead. processor hardware,” in Proc. 47th Annu. IEEE/ACM
age cache occu- Int. Symp. Microarchit., 2014, pp. 216–228.
pancy information to detect side channels 3. Intel Corporation, Intel 64 and IA-32 Architectures
behavior in conjunction with other perfor- Software Developer’s Manual, vol. 3B, 2016.
mance counters such as cache misses. How- 4. J. Demme et al., “On the feasibility of online
ever, their proposed technique makes malware detection with performance counters,” in
anomalous behavior determination based on Proc. 40th Annu. Int. Symp. Comput. Archit., 2013,
cache footprint, which is subject to high false pp. 559–570.
positive alarms. In contrast, our framework 5. M. G. Elfeky, W. G. Aref, and A. K. Elmagarmid, “Warp:
analyzes cache occupancy gain–loss patterns Time warping for periodicity detection,” in Proc. 5th
that are shown to be the unique characteristic IEEE Int. Conf. Data Mining, 2005, pp. 138–145.
for parties involving timing channel activity, ki, and
6. H. Fang, S. S. Dayapule, F.Yao, M. Doroslovac
which is both effective and efficient. Recently, G. Venkataramani, “Prefetch-guard: Leveraging
DAWG11 has proposed secure cache partition- hardware prefetches to defend against cache timing
ing by strictly isolating both cache hits and channels,” in Proc. IEEE Int. Symp. Hardware Oriented
misses between application domains. Secur. Trust, 2018, pp. 187–190.
7. M. Ferdman et al., “Clearing the clouds: A study of
emerging scale-out workloads on modern hardware,”
CONCLUSION ACM SIGPLAN Notices, vol. 47, pp. 37–48, 2012.
In this article, we proposed a novel framework 8. J. L. Henning, “SPEC CPU2006 benchmark
to protect caches against timing channel attacks descriptions,” ACM SIGARCH Comput. Archit. News,
through smartly leveraging COTS support for vol. 34, no. 4, pp. 1–17, 2006.
cache monitoring and performance tuning. We 9. C. Hunger, M. Kazdagli, A. Rawat, A. Dimakis,
implemented a prototype of our proposed tech- S. Vishwanath, and M. Tiwari, “Understanding
nique on Intel Xeon v4 server, and our experi- contention-based channels and using them for
ments showed that our framework can defense,” in Proc. IEEE 21st Int. Symp. High Perform.
successfully thwart several classes of cache tim- Comput. Archit., 2015, pp. 639–650.
ing channels in both native and virtualized envi- 10. Intel, “Intel-CMT-CAT Pacakage,” 2017. [Online].
ronment with minimal performance overhead. We Available: http://https://github.com/01org/intel-cmt-cat
July/August 2019
15
Secure Architectures
16 IEEE Micro
Expert Opinion
Toward Postquantum
Security for Embedded
Cores
Rafael Misoczki, Sean Gulley, Vinodh Gopal,
Martin G. Dixon, Hrvoje Vrsalovic, and
Wajdi K. Feghali
Intel Corporation
& THE USE OF firmware agents—including their Improvements in silicon manufacturing pro-
use to define the functionality of embedded cesses have reduced the size of these cores to
cores—has proliferated on computer systems of the point where they can be physically embed-
all scales, especially servers. The agents are ded within silicon dies of the system compo-
often not visible to the operating system as they nents (such as a CPU) on which they operate. In
independently perform configuration, monitor- parallel, the volume and importance of their
ing, and certain control tasks. For example, responsibilities have increased, beginning with
the baseboard management controller (BMC) power management and escalating to security
on server platforms runs a firmware stack operations that can affect functional safety.
(e.g., OpenBMC (https://github.com/openbmc/ Given this, it is critical that they run only signed
openbmc)), has a network port, some periph- and authenticated code.
erals on external buses (e.g., I2 C’s, SPI, etc.), One of today’s best practices to authenticate
and storage. The BMC is effectively another the firmware that runs embedded cores is public
full (but scaled-down) computing system on the key cryptography (digital signatures), which
server of which it is a component. While the
relies on FIPS-140 digital signature algorithms
BMC can directly affect the operation of a server
such as RSA and EC-DSA. However, quantum
(e.g., by controlling its power states), it does not
computing will render these algorithms useless
interact with the OS. Its behavior is defined
since factorizing integers and solving the dis-
completely by its firmware.
crete logarithm problem (i.e., the underlying
security problems of RSA and EC-DSA) will be
solvable in polynomial time.11 This implies that
Digital Object Identifier 10.1109/MM.2019.2920203 increasing RSA/ECC key sizes will be insufficient
Date of current version 23 July 2019. to defeat a quantum adversary. Prof. Michele
July/August 2019 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
17
Expert Opinion
18 IEEE Micro
hand, such proliferation of the embedding of decrypt and verify is critical to system respon-
microcontrollers, their increased handling of siveness. Additionally, since main memory is
operation-critical tasks for the overall system, not available in the earliest stages of initializa-
and the easy access to sophisticated develop- tion, microcode verification is done within the
ment tools have made microcontroller cores’ processor’s cache so it must fit into a rela-
firmware an attractive target for malfeasance. tively small footprint (e.g., 16 to 64 KB). Micro-
For product assurance, the firmware needs code as a firmware mainly differs from a
to be authenticated and protected against traditional microcontroller firmware image by
tampering. its target processor core: The Core and Atom
processors are multiple orders of magnitude
FIRMWARE SIGNATURES higher performance than a typically embedded
To assert that the code running on a given microcontroller core. While microcode, having
microcontroller originated from the expected the benefit of a much faster execution via the
source (the vendor) and has not been tampered main CPU, could potentially adopt much more
since it was deployed, the firmware image needs cryptographically stronger (but more computa-
to have ways by which it can be signed and tion-resource demanding) algorithms, those
attested; there must exist the ability to assure same algorithms may be unsuitable for the
that the firmware originated from a specific ven- more constrained embedded cores.
dor (e.g., Intel) and that the bits that comprise
the firmware binary have not been altered by a STATE OF ART ON POSTQUANTUM
3rd party from the time it was installed to the DIGITAL SIGNATURES
time it is being loaded. This is commonly done There has been a recent surge in activity in
with digital signature verification. A simplified the field of PQC, with several new schemes
example of performing this verification is to sign proposed every year and a rapidly growing
a message digest (or hash) of the firmware community of researchers that scrutinize the
binary with a vendor private key and append it robustness, performance and ease of use of
to the firmware. A system that has access to the new and well-established postquantum crypto-
firmware vendor’s public key can then correctly systems. This increased interest may be
verify the authenticity of this hash value. Any related to the recently established standardiza-
signature verification failure signals a modified tion processes on PQC. The National Institute
firmware or the use of a foreign private key (i.e., of Standardization and Technology started a
the signature did not originate from the vendor). project to analyze possible PQC candidates for
Thus, this mechanism prevents attackers from standardization. The Internet Engineering Task
generating valid signatures of modified firmware. Force (IETF) has recently published Request
In practice, firmware signature verification is for Comments (RFC’s) informational docu-
complicated by the constraints on the environ- ments that specify stateful postquantum digital
ment in which the verification is to take place; signatures such as the XMSS scheme.1 Experts
since it usually occurs (at least once) very early from the International Standards Organization
in the system boot process, limited memory, have also been working on a standing docu-
and/or computational resources are available. ment on PQC and, more recently, on a study-
To provide a tangible, real-world example of period on stateful hash-based signatures. In
firmware signatures, we discuss Intel’s micro- this section, we provide a brief analysis of the
code patch. maturity, robustness, and performance of the
signature schemes considered in the aforemen-
Intel Microcode Signing tioned PQC standardization processes.
The constraints on microcode are signifi- Since most of the schemes discussed in this
cant. Microcode may be loaded multiple times section have been submitted to the NIST PQC
from the power-on of a processor, and since it project, we will give some additional context
must be verified each time, the duration to about this process. In November 2017, NIST
July/August 2019
19
Expert Opinion
20 IEEE Micro
proof of knowledge. LowMC has been chosen bytes, signing takes 542,000 cycles and verifi-
as the underlying block cipher since it gives cation takes 88,000 cycles. For level 5, these
smaller signature sizes. Still, the signature for numbers change to 1.20 KB, 1.75 KB, 1.07 mil-
level 1 is 33.23 KB and the public key size is 32 lion cycles, and 186,000 cycles, respectively.
bytes (“picnic-L1-FS” parameter set). For level CRYSTALS-DILITHIUM uses module lattices
5, signature size is 129.74 KB and the public problems, which can be viewed in between the
key size is 64 bytes (“picnic-L5-FS” parameter ones used in Learning-With-Errors (LWE) and
set). For level 1, signing takes 137.91 million Ring-LWE problems. In other words, according
cycles and verification takes 90.63 million to the authors, they are just as efficient as Ring-
cycles, while for level 5 it takes 1,112.23 million LWE schemes but closer to the (stronger, more
and 736.31 million cycles, respectively. conservative) LWE underlying security problem.
The symmetric-crypto-based candidates For levels 1 and 3, CRYSTALS-DILITHIUM offers
have very strong security guarantees. In case signatures of size 1.99 and 3.28 KB, and public
state management is possible, stateful HBS keys of size 1.15 and 1.71 KB, respectively. The
schemes may be the most promising approach authors did not provide parameters for level 5.
since they are interesting from both security and In terms of speed, for level 1, it takes 1.3 million
performance perspectives. It is worth mention- cycles to sign and 272,000 cycles to verify. For
ing that digital signatures applied to verify firm- level 3, it takes 1.82 million and 510,000 cycles,
ware authenticity of embedded cores seem one respectively.
application where state management seems pos- qTesla is based on the well-known Ring-LWE
sible. The fact that the signatures are generated problem, and it offers good performance. On
by manufacturers (and not end users) that can April 14, 2019, researchers presented a potential
afford a robust signing facility with state man- attack against qTesla that may affect some of
agement capabilities seems to facilitate the their parameter sets (qTesla’s authors have yet
adoption of stateful HBS schemes in this sce- to respond). From a performance perspective,
nario. Also, the other limitation of certain state- for level 1, qTesla offers signatures of size 1.3 KB
ful HBS schemes that can issue a limited number and public keys of size 1.5 KB, and for level 5 sig-
of signatures does not seem a problem since natures of size 5.9 KB and public keys of size 6.4
manufacturers can, most of the times, predict KB. Regarding speed, for level 1, signature gener-
how many firmware updates a device will receive ation takes 492,000 cycles and verification takes
during its lifetime. On the other hand, in case 82,000 cycles, while for level 5, signature genera-
state management is not possible, stateless tion takes 2.1 million cycles and verification
schemes may be advisable; however, they are takes 394,000 cycles.
less attractive from a performance perspective The lattice-based cryptography field is a pop-
from both size and speed metrics. ular PQC approach. One point of attention is the
secure selection of parameters that still seems
Lattice-Based Schemes to be challenging depending on the underlying
In this category, we have Falcon,3 CRYSTALS- security problem. From a performance perspec-
DILITHIUM,2 and qTesla.9 From the PQC families tive, all three lattice candidates offer reasonable
that introduce additional security assumptions, performance and should be considered promis-
lattices are one of the most popular approaches. ing candidates.
From a side-channel perspective, the Gaussian
sampling process seems to offer some challenges Multivariate Quadratic (MQ)-Based Schemes
to be implemented in a side-channel resilient way. In this category, we have GeMSS,4 Rainbow,10
Falcon is based on the Short Integer Solu- LUOV,6 and MQDSS.7 Several MQ schemes have
tion problem, known in the crypto community been proposed in the past and subsequently
for some time, but it is applied to (structured) broken. Security has become more stable in
NTRU lattices. For level 1, Falcon signatures recent years, however MQ remains the PQC fam-
have 617 bytes and the public key has 897 ily of digital signatures whose security is the
July/August 2019
21
Expert Opinion
Table 1. Comparison for security level 1 or the closest. Sizes in KB, speed in millions of cycles.
Public
0.03 0.03 0.03 0.87 1.15 1.50 417.40 149.00 12.10 0.04
key size
Signing
– 340.00 137.91 0.54 1.37 0.49 690.00 0.40 5.40 26.63
speed
Verification
0.39 14.98 90.63 0.08 0.27 0.08 29.10 0.15 4.30 19.84
speed
least understood. The main benefit of MQ sch- and 75.5 KB, respectively, while signing takes
emes is the compact signature sizes. 24 million cycles and verification takes 18 mil-
GeMSS is a scheme based on the hidden lion cycles.
field equations underlying problem and can be MQDSS is a scheme based on the combina-
seen as a variant of Quartz, a scheme proposed tion of the Sakumoto–Shirai–Hiwatari (SSH) iden-
in 2001, which remains one of the fewest tification scheme with the Fiat-Shamir transform.
unbroken MQ schemes. GeMSS signatures are This is a very innovative proposal. For level 1,
very compact: only 258 bits for level 1, and 588 its signature and public key sizes are 20 KB and
bits for level 5. Public key sizes are 417 KB and 46 bytes long, respectively, while signing takes
3,046.84 KB, respectively. However, GeMSS is 26 million cycles and verification takes 19 million
not speed efficient: for level 1, signing takes cycles. For level 3, its signature and public key
6,690 million cycles and verification takes 29 sizes are 42 KB and 64 bytes long, respectively,
million cycles, while for level 5, signing takes while signing takes 85 million cycles and verifica-
25,300 million cycles and verification takes 172 tion takes 62 million cycles.
million cycles. MQ-based schemes offer interesting perfor-
Rainbow is a signature scheme based on the mance benefits, such as tiny signatures from
well-known Unbalanced-Oil-and-Vinegar (UOV) GeMSS or tiny public keys from MQDSS. How-
signature scheme (which itself is based on the ever, the field of multivariate-quadratic schemes
Oil-and-Vinegar scheme). From a practical per- would still benefit from a more comprehensive
spective, for level 1, the signature is 512 bits security analysis. The PQC standardization pro-
long and the public key is 149.00 KB long, while cess may help in this process by promoting
signing takes 402,000 cycles and verification these schemes and thus attracting an increasing
takes 155,000 cycles. For level 5, the signature is number of researchers to expand the knowledge
159 KB and the public key is 1,227.10 KB long, in this field and increase the confidence of poten-
while signing takes 3.6 million cycles and verifi- tial users.
cation takes 2.3 million cycles. Tables 1 and 2 show the performance of the
Lifted-UOV (LUOV) is a scheme also based second round candidates of the NIST PQC com-
on the UOV scheme. The main difference from petition plus the XMSS scheme published in IETF
UOV consists of some optimizations to reduce RFC8391. We acknowledge that these numbers
the public key size (e.g., lifting the UOV public (most of them obtained from the submission
key to an extension field). For level 2, signature packages) were collected in different platforms,
and public key sizes are 311 bytes and 12.1 KB, and therefore the speed numbers should be
respectively. Signing takes 5.4 million cycles taken as a rough approximation of the actual
and verification takes 4.3 million cycles. For performance, useful when considered from the
level 5, signature and public keys are 494 bytes orders-of-magnitude perspective.
22 IEEE Micro
Table 2. Comparison for security level 5 or the closest. Sizes in KB, speed in millions of cycles.
Public
0.06 0.06 0.06 1.75 1.71 6.43 3,046.84 1,227.10 75.50 0.06
key size
Signing
– 1,491.94 1,112.23, 1.07 1.82 2.15 25,300 3.64 24.00 85.26
speed
Verification
0.74 37.59 736.31 0.18 0.51 0.39 172.00 2.39 18.00 62.30
speed
July/August 2019
23
Expert Opinion
of the computation of GCM, and the adoption for more flexibility and minimize redesign time of
rate in the figure below, gathered from the ICSI embedded microcontroller systems should a sin-
Notary (https://notary.icsi.berkeley.edu/). gle (class of) algorithms become infeasible to
In the figure above, the “performance” of run- deploy for their firmware authentication.
ning the GCM algorithm—presented as “cost in
cycles per byte” where lower cost-per-byte
CONCLUSION
indicates higher performance—corresponds to
Firmware authenticity is a key factor in the
various Intel processors that launched on
proper function and security assurance of a plat-
those dates, with a trend of continuously improv-
form built with embedded microcontroller
ing GCM performance. Notably, in 2013 there was
cores. With the need for PQC-secure crypto-
a dramatic increase in GCM performance (indi-
graphic algorithms to continue to assure this
cated by a significant drop in the cost of cycles/
authenticity, Intel is investigating possibilities
byte), which we believe contributed to a steep
currently under consideration in various stan-
rise of GCM adoption across the industry.
dardization processes. Our priority in this selec-
tion is and will always be security. Regarding
INTEL’S DIRECTION ON PQC AND performance, we are specifically focusing on the
FIRMWARE capabilities of platforms’ embedded cores to
Intel’s strategy is to continue securing execute PQC algorithms and looking for algo-
platforms using cryptographically strong digital rithms that have parameters allowing flexibility
signature standards that execute efficiently. in fitting their execution to the capabilities of
Intel will continue to be aligned with standardiza- both current and planned embedded cores.
tion organizations and will evaluate algorithms
based on security assurance, hardware cost,
and performance—including that in embedded & REFERENCES
cores. Until full PQC transition has completed, a 1. A. Huelsing, D. Butin, S. Gazdag, J. Rijneveld, and
hybrid solution where two or more algorithms are A. Mohaisen, (2018). XMSS: eXtended Merkle Signature
executed in parallel—in order to remove depen- Scheme - Request For Comment 8391 (RFC 8391).
dence on a single algorithm or class of algorithms, Internet Engineering Task Force (IETF), Retrieved: 24
seems an interesting approach. This should allow June, 2019, https://tools.ietf.org/html/rfc8391
24 IEEE Micro
2. V. Lyubashevsky, L. Ducas, E. Kiltz, T. Lepoint, 11. P. Shor, “Algorithms for quantum computation:
P. Schwabe, G. Seiler, and D. Stehle, (2017). Discrete logarithms and factoring,” in Proc. 35th Annu.
CRYSTALS-DILITHIUM - A Submission to the NIST Symp. Foundations Comput. Sci., 1994, pp. 124–134,
Post-Quantum Cryptography Standardization Project. Santa Fe: IEEE Comput. Soc. Press.
National Institute of Standards and Technology (NIST). 12. D. Bernstein, C. Dobraunig, M. Eichlseder, S. Fluhrer,
Retrieved: 24 June, 2019, https://pq-crystals.org/ €lsing, and F. Mendel, (2017).
S. Gazdag, A. Hu
3. T. Prest, P.-A. Fouque, J. Hoffstein, P. Kirchner, SPHINCSþ - A Submission to NIST Post-Quantum
V. Lyubashevsky, T. Pornin, and Z. Zhang, (2017). Cryptography Standardization Project. National
Falcon - A Submission to the NIST Post-Quantum Institute of Standards and Technology (NIST).
Cryptography Standardization Project. National Retrieved: 24 June, 2019, https://sphincs.org/
Institute of Standards and Technology (NIST).
Retrieved: 24 June, 2019, https://falcon-sign.info/
4. A. Casanova, J.-C. Faugere, G. Macario-Rat, Rafael Misoczki is a cryptographer/research
J. Patarin, L. Perret, and J. Ryckeghem, (2017). scientist at Intel Labs. His work is focused on
post-quantum cryptography and its application to
GeMSS - A Submission to the NIST Post-Quantum
secure update, root of trust, remote attestation,
Cryptography Standardization Project. National
and other security flows. He has a PhD from the
Institute of Standards and Technology (NIST).
University of Paris (Pierre et Marie Curie), with a
Retrieved: 24 June, 2019, https://www-polsys.lip6.fr/
thesis on efficient constructions for post-quantum
Links/NIST/GeMSS.html cryptography. He also holds an MSc in electrical
5. L. K. Grover, “A fast quantum mechanical algorithm for engineering and a BSc in computer science from
database search,” in Proc. 28th Annu. ACM Symp. the University of Sao Paulo. Contact him at rafael.
Theory Comput., 1996, pp. 212–219. misoczki@intel.com.
6. W. Beullens, B. Preneel, A. Szepieniec, and
F. Vercauteren, (2017). LUOV - A Submission to the Sean Gulley is a principal engineer at Intel’s Data
NIST Post-Quantum Cryptography Standardization Center Group responsible for anticipating and
Project. National Institute of Standards and accelerating new algorithmic intensive workloads.
Technology (NIST). Retrieved: 24 June, 2019, https:// Since joining Intel in 2001, he has focused primarily on
www.esat.kuleuven.be/cosic/pqcrypto/luov/ cryptography and compression HW and SW solutions
7. S. Samardjiska, M-S. Chen, A. Hulsing, J. Rijneveld, for client and data center. He has a BS in computer
engineering from Tufts University and an MS in electri-
and P. Schwabe, (2017). MQDSS - A Submission to
cal engineering from Stanford University. He has over
the NIST Post-Quantum Cryptography Standardization
30 U.S. patents. Contact him at sean.gulley@intel.com.
Project. National Institute of Standards and
Technology (NIST). Retrieved: 24 June, 2019, http://
mqdss.org/ Vinodh Gopal is as senior principal engineer at Intel,
8. G. Zaverucha, M. Chase, D. Derler, S. Goldfeder, working in the Data Center Group. His work includes
C. Orlandi, S. Ramacher, and V. Kolesnikov, (2017).
accelerators, instruction-set extensions for x86 and
architectural enhancements to processors, in applica-
Picnic - A Submission to the NIST Post-Quantum
tions such as cryptography, integrity, compression,
Cryptography Standardization Project. National Institute
and analytics over a range of products. In 2019, he
of Standards and Technology (NIST). Retrieved: 24
won the Intel Inventor of the Year Award. Contact him
June, 2019, https://microsoft.github.io/Picnic/ at vinodh.gopal@intel.com.
9. N. Bindel, S. Akleylek, E. Alkim, P. S. Barreto,
J. Buchmann, E. Eaton, and G. Zanon, (2017). qTesla -
Martin G. Dixon is an Intel Fellow in the Intel Prod-
A Submission to the NIST Post-Quantum Cryptography
uct Assurance and Security (IPAS) group and direc-
Standardization Project. National Institute of Standards
tor of architecture at Intel Corporation. He is
and Technology (NIST). Retrieved: 24 June, 2019,
responsible for guiding future research and architec-
https://qtesla.org/
ture decisions to secure Intel’s platforms. He has
10. J. Ding, M-S. Chen, A. Petzoldt, D. Schmidt, and B-Y. published a dozen academic papers in the field of
Yang, (2017). Rainbow - A Submission to the NIST computer architecture and holds 50 patents in the
Post-Quantum Cryptography Standardization Proje\ct. field of computer architecture and cryptography.
National Institute of Standards and Technology (NIST). He has a bachelor’s degree in electrical and
July/August 2019
25
Expert Opinion
computer engineering from Carnegie Mellon Univer- Wajdi K. Feghali is an Intel Fellow and the director
sity. Contact him at martin.dixon@intel.com. of the Security and Algorithms Center of Innovation
in the Data Center Group at Intel Corporation. He
leads the development of cryptography, compr-
Hrvoje Vrsalovic has been involved in firmware and ession, data integrity and data de-duplication hard-
app-to-device interface software development in one ware and software solutions with a focus on efficient
form or another ever since being part of the original performance across Intel products. He has been
team that created Palm’s WebOS in 2008. Since then, granted more than 50 U.S. patents, with numerous
he has worked on software—and sometimes hard- other patents pending, and is the author of several
ware—of many “smart” consumer products, particu- published technical papers. He has a bachelor’s
larly wearables. He recently joined Intel’s IPAS group degree in mathematics with a minor in computer sci-
as a security architect. He has a BSc in computer sci- ence from the University of Ottawa. Contact him at
ence from UCSB and an MSc in electrical and com- wajdi.feghali@intel.com.
puter engineering from Carnegie Mellon University.
Contact him at harvey.vrsalovic@intel.com.
26 IEEE Micro
Expert Opinion
Energy-Secure System
Architectures (ESSA):
A Workshop Report
Pradip Bose Saibal Mukhopadhyay
IBM T. J. Watson Research Center Georgia Institute of Technology
& MODERN MICROPROCESSOR CHIPS have multiple temperature) constraints. Effective parallelization
processing engines (or cores) that are architected of application codes, supported by many-core/
to solve a variety of problems in individual and many-thread hardware engines, is the established
cooperative execution modes. In the current trend in current computing. Since 96-thread
regime of commercial designs, we already see dou- POWER8 server chips have already been in the
ble-digit core counts; and if one considers the market for a few years, it is not unrealistic to
degree of hardware multithreading supported in expect around 50 cores and perhaps
each core, the number of hard- 200 hardware threads supported in
ware threads that can be sup- Ever since on-chip and a couple of generations. Of course,
ported in concurrent execution system-level power due to area pressures, one can
add up to many dozens or scores. management architec- expect to see leaner (simpler) cores
For example, IBM’s prior genera- tures have become with only modest single-thread per-
tion POWER8 processor chip routine in the industry, formance growth.
already supported up to 96 hard- concerns about reliable This technology- and market-
ware threads via its 12 cores, operation and associ- driven trend toward throughput-
each of which can execute in up ated security vulner- oriented (scale-out) designs implies
to an eight-way simultaneously abilities have been a major challenge in terms of
multithreaded mode. As present in the minds of
chip-level power and/or thermal
explained in recent ISSCC tech- both the designer and
management—in a regime where
nology trend data, while the core researcher community.
balanced performance growth (sin-
count growth has been steady, gle-thread versus throughput) at
the clock frequency has saturated around the 4- affordable power becomes a steeper challenge
GHz mark—mainly limited by power density (or over time. And, at the full system (i.e., server,
rack, or data center) level, the challenge can be
even greater. At whatever scale one is interested
Digital Object Identifier 10.1109/MM.2019.2921508 in managing such metrics (i.e., power or tempera-
Date of current version 23 July 2019. ture, or even related ones, like system reliability),
July/August 2019 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
27
Expert Opinion
28 IEEE Micro
Table 1. Summary description of SSITH projects.
July/August 2019
29
Expert Opinion
30 IEEE Micro
row-hammering) and data privacy. The talk also The paper by Krishnan et al. stressed the need
explored circuit and system level methods for for a secure power transition mechanism to con-
sensing and inhibiting attacks on NVMs. The talk vert the active system state into a protected non-
reminded the audience that new technologies volatile form and back in energy harvesting based
introduced for energy management can also lead IoT edge platforms. The paper observed that
to new security challenges. secure checkpointings are necessary, but are
Prof. Vijay Janapa Reddi (Harvard) spoke on: expensive to compute and require hardware-
“Closing the performance, power and reliability accelerated cryptography and isolated secure
gap in autonomous aerial machines.” Reddi non-volatile storage. The paper defined an energy-
has been working on the topic area of “aerial harvester subsystem interface that drives the
computing,” and this particular presentation optimized execution of a secure communication
addressed the fundamental issues of power- protocol such that wasted energy is eliminated
performance efficiency and resilience in design- and that run-time performance is improved.
ing the embedded processor engines that power The paper by Tochukwu et al. presented the
autonomous drones. The connection between challenging but critical need to enable a holistic
reliability and security, as also articulated in hardware security evaluation from the microarch-
Sethumadhavan’s visionary talk, was re-exam- itectural point of view. The paper introduced an
ined briefly in Reddi’s presentation. important step toward this direction by proposing
a framework that categorizes threat models based
Contributed Regular Papers on the microarchitectural components being tar-
The regular technical presentations con- geted and provides a generic security metric that
sisted of the following contributed papers: can be used to assess the vulnerability of compo-
K. Khatamifard, L. Wang, S. Kose, A. Das, nents, as well as the system as a whole.
U. Karpuzcu, “A novel class of covert chan- Finally, the paper by Trilla et al. discussed that
nels enabled by power budget sharing.” as complexity and time-criticality of operations
A. Krishnan, P. Schaumont, “Hardware sup- being performed in the autonomous vehicles con-
port for secure intermittent architectures.” tinue to increase, this creates conflicting require-
I. Tochukwu and A. Ismail, “Holistic hardware ments in designing such processors. On one hand,
security assessment framework: a microarch- a simple and predictable design of the processors
itectural framework.” facilitate verification of functional and nonfunc-
tional metrics; but on the other hand, using
D. Trilla, C. Hernandez, J. Abella, F. Cazorla,
high-performance and complex processor designs
“Four birds with one stone: on the use of
with some degrees of obfuscation can deliver
time randomized processors and probabilis-
high computing performance and security. The
tic analysis to address timing, reliability,
paper argued that time-randomized processors
energy and security in critical embedded
(TRP), an alternative to traditional (deterministic)
autonomous systems.”
designs, can address these conflicting require-
It is well known that runtime power manage- ments. TRP facilitates timing analysis via the use
ment is in charge of the optimal distribution of the of statistical/probabilistic techniques, while also
power budget—a very critical shared resource— show capabilities to effectively tackle the chal-
among system components. The paper by Toch- lenges of reliability, security, and energy consump-
ukwu et al. argued that any system-wide shared tion. The paper reviewed the TRP opportunities
resource can give rise to covert communication, if and show that they are a natural fit to fulfill the
not properly managed, and power budget, unfortu- requirements of autonomous critical systems. The
nately, does not represent an exception. The paper showed that disruptive ideas in the macro-
paper presented a proof-of-concept demonstra- and microarchitecture may be necessary to design
tion of covert communication exploiting shared future energy-secure autonomous systems.
power budget and discussed the potential design
space for countermeasures. The paper argued IMPACT OF THE ESSA THEME
that a secure power management infrastructure WORKSHOPS
must be aware of the potential threats associated In this section, we will briefly examine the
with sharing power across multiple entities. ongoing impact of the ESSA theme workshop
July/August 2019
31
Expert Opinion
32 IEEE Micro
with the low-cost packaging solution. The experi- article presented at ISSCC 2019 by Singh et al.
ments demonstrated the possibility that throt- showed that on-chip low-dropout regulators, cou-
tling-based performance degradation (at a very pled with adaptive clocking and fine-grain DVFS
significant level) could be instigated for a given provide security against power-/EM-based side-
processor-package system product by launching channel attacks against AES engines. There are
a high-power (possibly synthetic) virus workload. many more examples in recent literature where
The literature on side-channel attack mecha- circuit level studies are being performed, and
nisms that exploit the power and/or EM monitor- techniques are being developed to enable a bot-
ing has advanced quite a lot since the inception tom-up approach to energy-secure hardware
of the first ESSA workshop in 2011. This is evi- designs. The advancements at the circuit level
denced even from some of the technical and security research showed the need for engaging
visionary talks presented at circuit community within the ESSA
ESSA 2019. The threat imposed theme, and ESSA 2019 took a posi-
by sharing of a common power Future work must con- tive step toward this goal. The
budget in a multicore chip set- nect unreliable control success is evident from the talks
ting has been described in work loops explicitly to presented at ESSA 2019 by Dr. De,
by Sasaki et al.11 and the paper vulnerabilities from a Dr. Ghosh, and Dr. Seok, all of
by Khatamifard et al. presented mainstream security
which have pointed out the need
at ESSA 2019 shows the conse- research viewpoint. In
for circuit level research in this
quence of this threat model in other words, the
domain.
guarded power man-
explicit terms.
agement principle
While at the system scale,
must be tested as a
FUTURE DIRECTIONS
ESSA themed workshop has stud- In this section, we provide our
mitigation technique
ied the potential security threats view of the future directions of
against CLKSCREW-
introduced by the power manage- research and development within
inspired attacks.
ment solutions, there have been the ESSA theme. One of the early
significant progress in recent research agenda items at IBM that
years in understanding the energy-security trade- fell out of the ESSA-theme was that of guarded
off at the circuit level. In particular, recent res- power management.10 In this solution approach,
earch threads have emerged in designing energy- the baseline power management architecture is
efficient security engines, as well as exploring on- protected through a guard mechanism. The latter
chip power-management circuits for security. A is a higher level monitor-and-control system that
specific example of this new direction has been in observes the operation of the baseline architec-
the domain of designing low-overhead techniques ture through specialized activity (performance)
for improving power and EM-based side-channel- counters. Anomalies detected in the observed
attack resistance of encryption engines. In partic- counter-based signatures can serve to trigger mit-
ular, collaborative work between Georgia Tech igation actions. The latter could include adapting
and Intel labs has demonstrated a set of studies the hardware parameters of the baseline mecha-
where on-chip integrated voltage regulators nism on-the-fly. The above-referenced work was
(IVRs) and adaptive clocking circuits, introduced pursued with robust power management in mind.
mostly for power management, have been lever- Future work must connect unreliable control
aged to improve SCA resistance. A paper pre- loops explicitly to vulnerabilities from a main-
sented by Kar et al., at the 2017 International Solid stream security research viewpoint. In other
State Circuit Conference (ISSCC) demonstrated words, the guarded power management principle
the promise of the using inductive IVRs for inhibit- must be tested as a mitigation technique against
ing power attack on AES engines. A second article CLKSCREW-inspired attacks.
from the group, authored by Singh et al. and pub- The thrust of research in support of power
lished in the IEEE JOURNAL OF SOLID STATE CIRCUITS reduction in the wake of GPU-centric high-perfor-
(JSSC) in 2019, showed inductive IVR coupled with mance compute nodes has led to techniques like
fine-grain dynamic voltage scaling and adaptive adaptive voltage guard-band management (e.g.,
clocking can inhibit power and EM-based side J. Leng et al., MICRO 2015). In future accelerator-
channel attacks on AES engines. More recently, an rich systems, the task of balancing power,
July/August 2019
33
Expert Opinion
performance, and reliability will have to be man- 7. A. Tang, S. Sethumadhavan, and S. Stolfo,
aged using systematic hardware-software manage- “CLKSCREW: Exposing the perils of security oblivious
ment systems. Purely software-based scheduling energy management,” in Proc. USENIX Secur. Symp.,
heuristics will need to get supported by hardware- 2017, pp. 1057–1074.
based monitors. As we progress toward many- 8. A. Tang, S. Sethumadhavan, and S. Stolfo, “Motivating
core processor chips, old-style on-chip power security-aware energy management,” IEEE Micro,
control architectures (with a single, centralized vol. 38, no. 3, pp. 98–106, May/Jun. 2018.
management unit) will give way to scalable, dis- 9. H. Hamann, A. Weger, J. Lacey, Z. Hu, and P. Bose,
tributed control, and management systems. An ini- “Hotspot-limited microprocessors: direct temperature
tial vision on so-called swarm power management and power distribution measurements,” IEEE J. Solid
architectures has been portrayed in recent invited
State Circ., vol. 42, no. 1, pp. 56–65, Jan. 2007.
papers (e.g., the one at DATE 201811).
10. N. Madan, A. Buyuktosunoglu, P. Bose, and
M. Annavaram, “A case for guarded power gating for
ACKNOWLEDGMENTS multi-core processors,” in Proc. 17th Int. Symp. High
The ESSA 2019 workshop,13 the proceed- Performance Comput. Arch., Feb. 2011, pp. 291–300.
ings (and anticipated impact) of which were 11. H. Sasaki, A. Buyktosunoglu, A. Vega, and P. Bose,
summarized in this article, would not have been “Mitigating power contention: A scheduling based
possible without the help of an active and approach,” Comput. Arch. Lett., vol. 16, no. 1,
supportive program committee. The valuable pp. 60–63, 2017.
contributions of our web and publicity chair, 12. A. Vega, A. Buyuktosunoglu, and P. Bose, “Energy-
David Trilla, are to be noted in particular. We secure swarm power management,” in Proc. Design
are grateful to the organizing committee of the Test Eur., 2018, pp. 1652–1657.
HOST 2019 symposium for their support. 13. ESSA-2019 Workshop. [Online]. Available: https://
www.essa-workshop.org/
14. P. Bose et al., “Power management of multi-core
& REFERENCES chips: Challenges and pitfalls,” in Proc. Design Test
1. S. Gunther and R. Singhal, “Next generation intel Eur., 2012, pp. 977–982.
microarchitecture (nehalem) family: Architectural 15. J. Kim et al., “DRaNGe: Using commodity DRAM
insights and power management,” presented at Intel devices to generate true random numbers with low
Developer Forum, San Francisco, CA, USA, Mar. 2008. latency and high throughput,” in Int’l. Symp. High
2. M. Floyd et al., “Introducing the energy management Perform. Comput. Arch. (HPCA), Feb. 2018.
features of the POWER7 chip,” IEEE Micro, vol. 31, 16. J. Kim et al., “The DRAM latency PUF: Quickly
no. 2, pp. 60–75, Mar./Apr. 2011. evaluating physical unclonable functions by exploiting
3. T. Webel et al., “Robust power management in the IBM the latency-reliability tradeoff in modern commodity
z13,” IBM J. R&D, vol. 59, no. 4/5, pp. 16-1–16-12, DRAM devices,” in Int’l. Symp. High Perform. Comput.
Jul./Sep. 2015. Arch. (HPCA), Feb. 2018.
4. S. Govindavajhala and A. W. Appel, “Using memory 17. L. Orosa et al., “Dataplant: in-DRAM security
errors to attack a virtual machine,” in Proc. IEEE Symp. mechanisms for low-cost devices,” arxiv, 2019,
Secur. Privacy, 2003, pp. 154–165. [Online]. Available: https://arxiv.org/abs/1902.07344
5. C. Miller, “Battery firmware hacking,” presented at
BlackHat, August 2011. [Online]. Available: https://www.
Pradip Bose is with IBM T. J. Watson Research
blackhat.com/html/bh-us-11/bh-us-11-briefings.html#
Center, New York. Contact him at: pbose@us.ibm.
Miller
com.
6. Z. Wu, M. Xie, and H. Wang, “Energy attack on server
systems,” in Proc. 5th USENIX Workshop Offensive Saibal Mukhopadhyay is with Georgia Institute of
Technol., 2011, p. 8. Technology. Contact him at: saibal@ece.gatech.edu.
34 IEEE Micro
Expert Opinion
FinalFilter: Asserting
Security Properties of a
Processor at Runtime
Cynthia Sturton Samuel T. King
University of North Carolina at Chapel Hill University of California, Davis
Matthew Hicks Jonathan M. Smith
Virginia Tech University of Pennsylvania and DARPA
& IN AN IDEAL world, it would be possible to trojan and then search for instances in the
build a provably correct and secure processor. design that match the pattern. However, mali-
However, the complexity of today’s processors cious circuitry that does not match the pattern
puts this ideal out of reach. The complete verifi- will be missed, as will inadvertent bugs that
cation of a modern processor remains intracta- open vulnerabilities. By the time the weakness is
ble. Statically verifying even a simple security uncovered, the hardware is already in the end
property—for example, “hardware privilege esca- user’s hands and vulnerable to attack.
lation never occurs”—remains beyond the state In the absence of a full proof of correctness,
of the art in formal verification. what is needed is a final filter: a runtime verifica-
Testing can complement formal verification tion technique that works—postdeployment—to
methods, yet testing is incomplete and bugs in detect and respond to security property viola-
the hardware that leave it vulnerable continue tions as they occur during execution. In this arti-
to elude test suites. Further, a crafty malicious cle, we make the case for final filters using our
actor can evade typical testing coverage metrics. tool, FinalFilter, as a case study.
Recent efforts, including that of three of the
authors, have explored the use of static analysis
on the design files (e.g., hardware description FINALFILTER
Prior research, including our own, has
level source code or gate-level netlists) to find
shown that assertions hard-coded into the
suspicious circuitry.1–3 These techniques rely on
design can be a cheap and effective way to
heuristics to define patterns that indicate a likely
verify the correctness of any single execution
run.4; 5 Assertions can cover properties that
Digital Object Identifier 10.1109/MM.2019.2921509 would be intractable to prove statically for
Date of current version 23 July 2019. the current state of the art. The downside
July/August 2019 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
35
Expert Opinion
THREAT MODEL
The trusted computing base
for FinalFilter includes our speci-
fication and verification process
and tools, the fabrication pro-
cess and tools, and the filter’s
current configuration.
Figure 1. Processor design flow with FinalFilter: (a) Hardware description language
implementation of the instruction set specification. (b) Vulnerability is accidentally orLifecycle Assumptions
maliciously opened in the processor. (c) FinalFilter is added to the design as the last Referring to Figure 1, we
action,6 with taps directly on the outputs of ISA state storing elements. (d) FinalFilterassume we are the last ones to
dynamically verifies the properties encoded by trusted software. FinalFilter triggers touch the processor design. We
existing repair/recovery approaches in the event of an invariant violation. FinalFilter rely on orthogonal techniques
continues to protect the repair/recovery software. to ensure that FinalFilter is not
tampered with in the supply
chain, which includes fabrica-
is that, like all execution monitors, this tion of the processor and shipping to the end
approach cannot prove that the property can user.
never be violated, only that if such a violation
occurs the monitor will catch it. As such, a Architectural Scope
final filter is a verification approach that is FinalFilter protects privileged instruction set
complementary to and should be used in architecture (ISA)-level registers. FinalFilter does
conjunction with existing testing and static not detect side-channel attacks as doing so
verification methods. requires knowledge of more than the current trace
We extend the basic idea of an assertion- of execution. The focus of this paper is the integer
based execution monitor to make it configurable core of the processor. Notably, we assume the
so that the set of properties being monitored memory hierarchy is correct.
can be updated postdeployment to reflect new
Attacker Model
information about exploitable vulnerabilities in
The attacker is free to take any action not
the design. FinalFilter is a reconfigurable, run-
precluded by our assumptions, either in hard-
time verification system that monitors the state
ware or in software. This includes an attacker
and events of the processor for invalid updates
capable of creating and exploiting a hardware
to privileged registers.
defect. An example might be a defect that causes
The mechanism of a final filter is simple
the processor to return from an exception with-
and presents a small attack surface. Yet, mak-
out restoring the privilege level.
ing it configurable does add complexity. To
minimize FinalFilter’s cost to the system’s
trustworthiness, we formally verify the cor- DESIGN
rectness properties of its component modules FinalFilter enforces properties over privileged
and of the composed system. Finally, we show ISA state and events necessary for the security of
how to verify key properties for individual software running on the processor. An example
configurations. property that we will return to is, “the processor
As a formally verified execution monitor, Final- transitions from user mode to supervisor mode if,
Filter guarantees that any trace violating a given and only if, there is an interrupt or exception.” Any
36 IEEE Micro
processor that correctly implements the specifica- Invariant I0 is a statement that the instruc-
tion must satisfy this property. Proving this prop- tion set specification says must be true of the
erty statically requires a proof across all possible system at all points of execution. It can be writ-
execution traces—currently an intractable task. ten as a concrete assertion in terms of the ISA-
Yet, as an execution monitor, FinalFilter can verify level state in the following way:
the property for every trace that is executed. Moni- :
A0 ¼ assertðrisingEdgeðSR ½SMÞ ! ðNPC½31 : 12 ¼ 0Þ ^
toring is done by a set of hardware-based asser-
risingEdgeðSR ½SMÞ ! ðNPC½7 : 0 ¼ 0Þ _
tions over architecturally visible states and events.
FinalFilter is designed to be used in conjunc- risingEdgeðSR ½SMÞ ! ðreset ¼ 1ÞÞ
tion with existing software-level recovery and where SR ½SMrepresents the supervisor mode bit
repair tools. For example, BlueChip,1 a tool devel- of the processor’s status register, and an exception
oped by three of the authors, can route execution is indicated by the next program counter
around vulnerable circuitry. FinalFilter provides NPCpointing to an exception vector start address.
precise introspection points and can support a The address will always be of the form
variety of repair and recovery approaches. 0x00000X00, where the “X” indicates a don’t-
Three aspects of the design are worth noting. care value. (This might seem as if it leaves the door
open for a processor attack that escalates privilege
1) FinalFilter is reconfigurable after deployment
while executing at an address that matches the
and can protect multiple security-critical
form 0x00000X00, but it does not. Pages in that
properties concurrently.
address range have supervisor permissions set
2) FinalFilter’s design is formally specified and
which implies that code executing in that address
its implementation proven correct.
range is already in supervisor mode. If the proces-
3) Execution overhead is incurred only in the
sor attack attempts to allow user mode execution
rare case that a processor violates one of the
of supervisor mode pages, FinalFilter includes an
monitored security properties.
invariant to detect such misbehavior.)
We break A0 into three component assertions.
The key insight that allowed us to make the
:
monitor both reconfigurable and able to handle Aa ¼ assertðrisingEdgeðSR ½SMÞ ! ðNPC½31 : 12 ¼ 0ÞÞ
:
multiple invariants concurrently is that many Ab ¼ assertðrisingEdgeðSR ½SMÞ ! ðNPC½7 : 0 ¼ 0ÞÞ
:
security properties can be implemented as a Ac ¼ assertðrisingEdgeðSR ½SMÞ ! ðreset ¼ 1ÞÞ:
Boolean combination of more simple assertions,
Each of these individual assertions is evalu-
and these simple component assertions are usu-
ated at each step of execution, and the results
ally in one of only a few forms. Users can specify
are appropriately combined to form a statement
a number of simple component assertions and
that is equivalent to A0 .
combine them into one or more complex asser-
tions that monitor hardware state. Invariant Monitor
FinalFilter reads in ISA-level state and outputs
Running Example
a signal indicating whether any of the pro-
We use security invariants (or just invariants) to
grammed invariants were violated. It works
describe properties of the ISA that must be true of
essentially as a programmable finite state
a secure implementation—that if violated would
machine. Configuration data programs the
open an exploitable vulnerability. Invariants are
machine with which invariants to check and ISA-
dynamically verified by one or more assertions
level state acts as the input to the machine. The
over architecturally visible state.
number of invariants it can monitor concur-
Consider the following component of the
rently depends on the complexity of the associ-
privilege escalation property mentioned before:
ated component assertions and the number of
: assertion blocks built into the monitor.
I0 ¼ A change in processor mode from low privilege
Using our running example, we now describe
to high privilege is caused only by an exception or each module in the configurable monitor, shown
a reset: in its configured state in Figure 2. In our system, we
July/August 2019
37
Expert Opinion
38 IEEE Micro
Merge. The Merge block takes the outputs from documents. However, in two cases, the process of
the Assert blocks and combines them as pre- formalizing the specification brought out ambigui-
scribed by the configuration data. It can be ties in the design, and it was necessary to revisit
viewed as a configurable truth table. The inputs the design phase of the process. During the course
to the truth table are the Assert block outputs— of verification, we found one implementation
the component assertions Aa , Ab , and Ac in our error: a logical AND was used where an OR was
running example. The function defining how the needed.
component assertions combine (i.e., the out Ultimately, the monitor’s behavior is deter-
function) is configurable at run time. The truth mined by the configuration data, and it is up to
table is implemented as a hierarchy of look-up the processor or motherboard manufacturer to
tables. For example, with 16 Assert blocks, provide a correct configuration. A misconfigured
rather than a single lookup table with 216 rows, fabric could fail to provide the intended protec-
the monitor would have four lookup tables with tions. We guard against misconfigurations in
six inputs (26 rows) each. The outputs of the three ways.
three first-level lookup tables make up the input First, we protect against invalid configurations
to a second-level lookup table, the output of that would result in unpredictable results. Built in
which is the output of the Merge block. to the design of each block is a check that the
We can now complete our running example. incoming configuration data are well formed. We
Let erra be the output of the Assert block for Aa , verify that if any of the individual components
and let errb and errc be the output of the Assert report an invalid configuration, then FinalFilter
blocks for Ab and Ac , respectively. Remembering will not fire any assertion failures. This behavior
that the output of each Assert block will be high represents a tradeoff in the design space. On the
when the assert triggers, i.e., when the invariant one hand, an accidentally misconfigured fabric,
is violated, we combine the results of the compo- which will never trigger an assertion, is not pro-
nent assertions in the following way: tecting the user. On the other hand, never firing in
the presence of misconfigured data has the benefit
err0 ¼ ðerra jerrb Þ&errc :
of being a stable behavior— it is what exists today.
An alternative is to always fire when the fabric is
As desired, err0 will be high whenever A0 is false,
misconfigured, but this would give an attacker an
i.e., whenever the A0 assertion is triggered.
avenue for launching a denial-of-service attack
Configuration Data. The configuration data making FinalFilter a new avenue of attack, some-
are provided by trusted software (e.g., the sys- thing we wish to avoid.
tem BIOS) at initialization (originally, we imagine Second, we built a software tool to generate
configuration coming from processor or mother- the configuration data from higher level asser-
board manufacturers). It is the mechanism by tion statements. Although only prototypical, we
which FinalFilter is configured, and portions of hope that further developing this tool will make
the configuration data are fed into each block at generating correct configuration data relatively
the appropriate stage. easy for the user.
Third, we built a validation tool to prove
properties about individual configurations. We
VERIFICATION prove the following sanity checks on the configu-
We used the commercial model checking tool
ration data:
Cadence SMV for the verification of the configura-
There are assertions configured.
ble assertion fabric. For each component of Final-
None of the assertions are unsatisfiable
Filter shown in Figure 2, we formally specified its
behavior and verified that the implementation (e.g., the following does not occur fTrue !
meets the specification. q ^ :qg).
In most cases, formally specifying a com- The configured assertions, as a whole, are sat-
ponent’s behavior involved little more than isfiable (e.g., the following does not occur
extracting the information from the design fp ! q; p ! :qg).
July/August 2019
39
Expert Opinion
40 IEEE Micro
USING FINALFILTER Design-time verification routines to avoid triggering a bug
Using FinalFilter requires hav- alone is insufficient; that is found postdeployment.
ing a meaningful set of properties some exploitable vul- In this article, we have not
to monitor. In prior work, we took nerabilities will make it addressed the problem of mea-
a manual approach to develop a through. FinalFilter, a suring coverage. Boule et al. 12
set of security critical properties.5 last line of defense— add circuitry to assertions to
We studied errata documents to one that can be for- track and measure coverage. The
learn what types of exploitable mally verified—pro- question of what is a meaningful
errors can occur and we studied tects security critical
coverage metric for a set of secu-
the architecture’s specification properties of the pro-
rity properties is an open one,
documents to develop a set of cessor core.
but it is critical: such a measure
properties necessary—though can give an indication of the
not sufficient—to protect security critical state number of “unknown unknowns” that remain
of the processor. unprotected.
In subsequent work, one of the authors has
developed a semiautomated method for learning CONCLUSION
new security properties using information Design-time verification alone is insufficient;
gleaned from known exploitable bugs8; and some exploitable vulnerabilities will make it
demonstrated that properties developed for one through. FinalFilter, a last line of defense—one
RISC processor may be suitable for use, after that can be formally verified—protects security
some translation, on a second RISC processor, critical properties of the processor core. We
even across architectures.9 However, the devel- believe the idea is broadly applicable and in
opment of security-critical properties for use future work will be exploring the use of a final fil-
with FinalFilter or any property-based verifica- ter for commercial architectures and for mod-
tion method is still in its infancy and more ules outside the processor core.
research is needed.
ACKNOWLEDGMENT
Case Study The authors would like to thank the editors
We configured FinalFilter with 18 assertions we for their insightful comments and suggestions,
found to be critical to security in our prior work.5 and S. Bellovin for his advice and the phrase
We then introduced into the processor 14 vulner- “final filter.”
abilities from a mix of previously published ha-
rdware attacks and attacks based on exploitable & REFERENCES
vulnerabilities from several years of AMD proces-
sor errata. For each one, we wrote a user-space 1. M. Hicks, M. Finnicum, S. T. King, M. M. K. Martin, and
program that exploits the vulnerability and reports J. M. Smith, “Overcoming an untrusted computing
if the attack was successful. FinalFilter is expres- base: Detecting and removing malicious hardware
sive enough to implement all 18 invariants, and the automatically,” in Proc. IEEE Secur. Privacy, 2010,
July/August 2019
41
Expert Opinion
5. M. Hicks, C. Sturton, S. T. King, and J. M. Smith, against vulnerable hardware designs. Her research is
funded by several National Science Foundation
“SPECS: A lightweight runtime mechanism for
awards, the Semiconductor Research Corporation,
protecting software from security-critical
Intel, a Junior Faculty Development Award from the
processor bugs,” in Proc. ACM Conf. Architectural
University of North Carolina, and a Google Faculty
Support Program. Lang. Oper. Syst., 2015, Research Award. She was recently awarded the
pp. 517–529. Computer Science Departmental Teaching Award
6. A. Waksman and S. Sethumadhavan, “Silencing at the University of North Carolina. She has a
hardware backdoors,” in Proc. IEEE Symp. Secur. BSE from Arizona State University and an MS and
Privacy, 2011, pp. 49–63. a PhD from the University of California, Berkeley.
7. R. Rubenstein, “Open Source MCU core steps in to Contact her at csturton@cs.unc.edu.
power third generation chip,” Jan. 2014. [Online].
Available: http://www.newelectronics.co.uk/ Matthew Hicks is an assistant professor at
electronics-technology/open-source-mcu- core-steps- Virginia Tech, working at the intersection of security,
in-to-power-third-generation-chip/59110/ architecture, and embedded systems, with special
8. R. Zhang, N. Stanley, C. Griggs, A. Chi, and C. Sturton, emphasis on analog-domain hardware security.
“Identifying security critical properties for the dynamic Contact him at mdhicks2@VT.edu.
verification of a processor,” in Proc. ACM Conf.
Architectural Support Programming Lang. Operating Samuel T. King was a professor for eight years at
Syst., 2017, pp. 541–554. the University Illinois Urbana-Champaign. He then
9. R. Zhang, C. Deutschbein, P. Huang, and C. Sturton, left his tenured position at UIUC to push himself intel-
“End-to-end automated exploit generation for lectually and professionally in industry. He is cur-
diagnosing processor designs,” in Proc. IEEE/ACM rently with the Computer Science Department at the
University of California Davis. He is interested in
Symp. Microarchit., 2018, pp. 815–827.
building systems for fighting fraud and rethinking our
10. T. M. Austin, “DIVA: a reliable substrate for deep
notion of digital identity. He has a PhD from the
submicron microarchitecture design,” in Proc. ACM/
University of Michigan, an MS from Stanford Univer-
IEEE MICRO, Haifa, Israel, Nov. 1999, pp. 196–207.
sity, and a BS from UCLA. Contact him at kingst@uc-
[Online]. Available: http://www.eecs.umich.edu/ davis.edu.
taustin/papers/MICRO32-diva.pdf
11. S. Narayanasamy, B. Carneal, and B. Calder, “Patching
Jonathan M. Smith is currently a program manager
processor design errors,” in Proc. IEEE Int. Conf. Comput.
in the Information Innovation Office (I2O) at the Defense
Des., Oct. 2006, pp. 491–498. [Online]. Available: http:// Advanced Projects Research Agency (DARPA) on
cseweb.ucsd.edu/ calder/papers/ICCD-06-HWPatch.pdf leave from the University of Pennsylvania, where he
12. M. Boule, J. Chenard, and Z. Zilic, “Adding debug holds the Olga and Alberico Pompa Professorship of
enhancements to assertion checkers for hardware Engineering and Applied Science and is a professor of
emulation and silicon debug,” in Proc. Int. Conf. Comput. computer and information science. He was previously
Des., 2006, pp. 294–299. a Member of Technical Staff at Bell Telephone Labora-
tories and Bell Communications Research, joining
Penn in 1989 after receiving his PhD from Columbia
Cynthia Sturton is an assistant professor and University. He previously served as a Program Man-
Peter Thacher Grauer Fellow at the University of North ager at DARPA in 2004–2006, and was awarded the
Carolina at Chapel Hill. She leads the Hardware Secu- Office of the Secretary of Defense Medal for Excep-
rity @ UNC research group to investigate the use of tional Public Service in 2006. He became an IEEE Fel-
static and dynamic analysis techniques to protect low in 2001. Contact him at jms@cis.upenn.edu.
42 IEEE Micro
General Interest
RASSA: Resistive
Prealignment Accelerator for
Approximate DNA Long
Read Mapping
Roman Kaplan, Leonid Yavits, and
Ran Ginosar
Technion—Israel Institute of Technology
0272-1732 ß 2018 IEEE Published by the IEEE Computer Society IEEE Micro
44
reference sequence requires a computationally performance and energy efficiency. Resistive
intensive local alignment procedure (e.g., Smith– approximate Hamming distance solutions exist.11
Waterman).4 Its computational time complexity is However, these do not provide the parallelism
typically OðnmÞ for two sequences with lengths n required to support a high throughput applica-
and m. Reference sequences vary from several tion, such as DNA read mapping.
millions to billions of bits per second (bps). It is In this work, we present RASSA, a resistive
therefore computationally prohibitive to perform approximate similarity search accelerator archi-
optimal alignment of every long read with the tecture for DNA long-read prealignment filtering.
entire reference sequence. RASSA is a massively parallel in-memory proces-
Read mappers (e.g., minimap6 and minimap27) sor, facilitating simultaneous comparison of
find regions of high similarity (mappings) between a long read with a reference sequence. The out-
reads or between a read and a puts of RASSA are locations on the
reference sequence, followed by RASSA is a massively reference sequence, where align-
an alignment step to determine parallel in-memory pro- ment may result in high score. The
the exact edit distance and verify cessor, facilitating key performance breakthrough of
that the mapping is correct. In simultaneous compari- RASSA is achieved by applying the
case that a prealignment algo- son of a long read with similarity search in parallel to the
rithm identifies a specific region a reference sequence. entire reference. While the complex-
in the reference suitable for map- ity of alignment is OðmnÞ, RASSA
ping, the alignment can be performed only on employs in-memory parallel computing on OðmÞ
that region, reducing alignment’s duration and memory cells to reduce the computation time to
resource requirements.8 Therefore, read OðnÞ, where m and n are read and reference
mapping can be viewed as a two-step process: pre- lengths, respectively.
alignment filtering and accurate alignment RASSA employs resistive elements, mem-
verification. The prealignment step reduces the ristors, serving at the same time as single bit
problem size for aligners by narrowing the regions storage elements and comparators. Additional
to ones with potentially high-scoring alignment. evaluation transistors translate mismatch scores
Existing prealignment hardware solu- into voltage levels, which are converted into dig-
tions9,10 target short reads (up to several hun- ital values using analog-to-digital converters
dred bps), which contain a small number of (ADC). Further processing determines the most
indel and substitution errors (less than 5%) and likely overlap candidates.
have a different error profile than that of PacBio This paper makes the following contributions.
or ONT long reads.3,4 High edit distance thresh-
old is required for mapping long but error- 1) RASSA, an in-memory processing resistive
9 approximate similarity search accelerator, is
prone reads. However, current solutions have
high false positive rates when the edit distance introduced. The parallel processing architec-
is high (i.e., greater than 15). Thus, the current ture is presented bottom-up, from the mem-
solutions for short reads are not applicable for ristor-based bitcell to base pair encoding
long reads. and up to a complete RASSA system.
Approximate computing techniques are 2) RASSA-based implementation of long read
known to trade accuracy for speed or energy prealignment filtering is developed.
efficiency. In case of long reads, multiple errors 3) Evaluation of RASSA’s prealignment filtering
are a natural part of the sequencing output. accuracy and comparative analysis of its exe-
Therefore, DNA long-read prealignment filtering cution time and throughput is conducted.
inherently tolerates the imprecision.
With the end of Dennard scaling and the
slowdown of Moore’s law, novel hardware BACKGROUND
solutions for data-intensive problems are resear- The following two sections provide concise
ched. Emerging technologies such as resistive background on the problem: DNA read mapping
memories enable new architectures with better and the memristor device technology.
July/August 2019
45
General Interest
46 IEEE Micro
memory row) and in situ processing of large data mismatching bit in the case of A–C or A–G mis-
sets. RASSA enables comparing a key pattern with match), leading to ambiguous results. Since a
the entire data set in parallel. Every number of mismatch is signaled by reduced match line volt-
mismatches (of the key pattern versus each data age (caused by charge redistribution), a match
element that is in each memory row) causes a spe- should block charge flow. One hot encoding
cific voltage drop, allowing quantifying the num- assures that at most one mismatch may happen
ber of mismatching locations (called a mismatch in each group of four bitcells. For instance, in 60
score). The mismatch score is compared with a bitcells, at most 15 mismatches may be
predefined threshold value to detect the locations observed. Therefore, in this work, a memristor
that have the desired degree of similarity with the in high resistive state ðROFF Þ is considered logic
compared pattern, indicating a viable mapping “1,” while RON is considered logic “0.”
location. The following sections describe RASSA
functionality, encoding of DNA bp, RASSA system Mismatch Evaluation. During a compare
architecture, and hardware evaluation. operation, the compared (key) pattern is applied
to the gates of the selector transistors of all bit-
DNA Base Pair Encoding and cells. If certain groups of bitcells need to be
Mismatch Evaluation ignored (masked-out) during comparison, zero
Figure 1(b) presents the RASSA bitcell, con- is applied to the gates of the selector transistors
taining two transistors and one memristor of such bitcells. Figure 1(c) shows a stored “A”
(2T1R). Each memristor serves as a single bit nucleotide symbol and a compare pattern of “A.”
storage element and a single bit comparator, The comparison results in a match, so there is
enabled by the selector transistor. no charge redistribution path (through an ROFF
A compare operation consists of two phases: the memristor). Figure 1(d) shows a mismatch,
precharge and the evaluation. During precharge, where the stored pattern is “G” and the key pat-
the match line is precharged to a certain voltage tern is “T.” The mismatch results in charge redis-
level. At the same time, the evaluation transistor in tribution through an RON memristor, causing a
each bitcell is on to discharge the evaluation point match line voltage drop. Figure 2(a) shows all
(created by the diffusion capacitances of the selec- possible match line voltage levels during the
tor and the evaluation transistors). evaluation phase for mismatch scores of 0
During the evaluation phase, if the selector through 15. The match line is sensed by an ADC
transistor is on, a low memristor resistance (System Architecture section). The timing of
ðRON Þ allows charge to pass from the match line such sensing, in addition to the per-cell transis-
to the evaluation point. The charge distribution tor capacitance variations, may lead to inaccura-
causes the match line voltage to drop. Sensing cies in the mismatch score. For example, for a
the voltage of the match line compared with a match line shared by up to 60 bitcells, the mis-
reference voltage (of zero mismatch) allows match score error could be 1. If the number of
quantifying the number of mismatches, produc- bitcells sharing the match line is more than 120,
ing the mismatch score. the mismatch score error could reach 3.
July/August 2019
47
General Interest
Figure 2. (a) Match line voltage levels for each mismatch score between zero (top curve) and 15 (bottom curve)
mismatches. Every voltage level at the sampling point is converted by the ADC. (b)–(d) Bottom-up block diagram of
RASSA. (b) Single Sub-Word, composed of 60 bitcells in NOR-like structure. (c) Single Word Row containing 16 Sub-Words,
capable of holding 240 DNA bps. (d) Complete RASSA diagram containing N Word Rows ðN ¼ 217 Þ. (e) Accumulating
mismatch scores in analog domain (Conclusions and Future Research Directions section).
amounting to 960 bitcells per word, designed for operational frequency of 1 GHz is possible. For a
storing and comparing up to 240 DNA bps per single Sub-Word, the precharge energy is 1.6 fJ,
cycle. In each compare operation, a compare pat- while the evaluation energy (ADC and control line
tern is applied to all active bitcell bit lines. switching) is 98.4 fJ. For a single Word, containing
The match line voltage of each Sub-Word is 16 Sub-Words, adders, and threshold comparator,
sampled by the ADC and converted into a 4-bit mis- a single compare cycle energy is 1791 fJ.
match score [right side of Figure 2(b)]. The ADC We have manually laid out a RASSA bitcell.
reference voltage and voltage level differences are The total Word Row area in 28-nm technology,
set according to the match line values for each mis- including the bitcells, ADC, adders, and compar-
match score, as demonstrated in Figure 2(a). The ator, is 1598 mm2. Bitcell transistors occupy 4%,
16 Sub-Word ADC outputs are summed up to pro- adders and threshold comparator occupy 28%,
duce the mismatch score for the entire Word Row and the ADC occupies 68% of the Word Row
[see Figure 2(b)]. All such scores are then com- area, respectively. This allows placing of
pared with a threshold value, in parallel, to indi- 131 k ð217 Þ 960-bit (240-bp) Word Rows, storing
cate the Word Rows (corresponding to sequence 31.5 Mb/s, on a single 209-mm die. Its worst-case
locations) with the desired degree of similarity. power consumption at 1 GHz is 235 W. Table 1
summarizes the RASSA system parameters.
Timing, Power, and Area Breakdown Loading the reference sequence to RASSA is
A Sub-Word circuit is designed, placed, and performed on each Word Row separately and
routed using the 28-nm CMOS High-k Metal Gate requires two cycles per Word Row, one cycle to
library from Global Foundries for transistor sizing, write all logic “0s,” and another cycle to write all
timing, and power analysis. We perform Spectre logic “1s.” Given a reference sequence of length
simulations for the FF and SS corners at 70 C and L, the number of cycles required for its loading
L
nominal voltage. Timing analysis shows that an equals 2 d240e.
48 IEEE Micro
Table 1. RASSA system parameters for 28-nm node reference data residing in two Word Rows. Such
process. two-Word Row compare requires two cycles.
The even cycle mismatch score [see Figure 3(c)]
Parameter Value
is added to the score of the following odd cycle
DNA bps per row (bits) 240ð960Þ
[of the Word Row below, Figure 3(d)], and com-
Words per IC 131 kð217 Þ pared with the threshold [see Figure 3(d), right].
Memory size (DNA bps) 31:5M Before every even cycle, the compare pattern is
shifted by one bp to the right, shortening the
Frequency 1 GHz
even cycle compare pattern and extending
Single IC Power 235 W the pattern in the odd cycle by one bp [see
Single IC Area 209 mm2
Figure 3(c) and (d), right]. After 439 (¼ 41 þ 199
2) cycles, a 200-bp chunk has been compared
against all reference sequence positions. The
DNA READ PREALIGNMENT compare operation repeats for the rest of the
FILTERING WITH RASSA 200-bp read chunks.
A single compare operation in RASSA finds Figure 3(e) presents the concept of edit detec-
the mismatch score between the key pattern tion in RASSA. For simplicity, the chunk length in
and the contents of each Word Row. The refer- this example is 30 bp. Three types of edits are
ence sequence is stored in RASSA (contiguously, shown on the left of the figure. On the right, the
240-bp fragment per Word Row). A fixed-size mismatch score is presented for all relative shifts
chunk (e.g., 200 bps) of the read is fed in as a key of the chunk versus the reference.
pattern. The mismatch score approximates the The average mismatch score in any mismatch-
correlation between the read chunk and the ref- ing location is 75% (the probability of an individ-
erence sequence. A long read contains multiple ual bp mismatch is 0.75). In the exact match case
chunks, therefore the compare operations are (top of the figure), the mismatch score is 0.
performed multiple times, in all possible posi- A substitution results in a very low mismatch
tions of a read chunk vis-a -vis a Word Row score easily detectable by RASSA. The second
reference fragment, sometimes involving two row of Figure 3(e) shows two substitution errors,
neighboring Word Rows. leading to the mismatch score of 2/30 ¼ 6.7%.
The number of Word Rows in RASSA defines The third row of Figure 3(e) shows an insertion
the number of overlap positions examined simul- error. The longest matching section is 18 bp to the
taneously. In a single cycle, dn=240e (where n is right of the insertion, which leads to a mismatch
the reference sequence length) distinct posi- score of (30 - 18)/30 ¼ 40%. With the appropriate
tions on the reference sequence are examined threshold, such scenario is detectable by RASSA
simultaneously. To cover all possible positions, and identified as a potential mapping.
the read chunk is shifted by one bp, and com- Finally, in the fourth row of Figure 3(e), a dele-
pare is repeated 240 times (resembling the con- tion error is presented. Deletions are handled sim-
cept of correlation). ilarly to insertions. In this example, the longest
Figure 3(a)–(d) illustrates the comparison of matching section is also 18 bp. The mismatch
a read chunk against a reference sequence in score is (30 - 18)/30 ¼ 40% as well, and is also
RASSA for several cases. In these examples, a detectable by RASSA as a potential mapping.
chunk length of 200 bp is used [see Figure 3(a)]. While a substitution results in a much lower
A multicycle compare operation matches a mismatch score, RASSA is capable of detecting
200 bp chunk against all its possible locations indels just as confidently, by setting the thresh-
-vis the reference sequence. In the first com-
vis-a old accordingly.
pare cycle [see Figure 3(b)], the first chunk of
the read is compared with all first 200 bps of
In legitimate mismatching locations, the mismatch score is distributed
each RASSA Word Row. binomially. The threshold is set such that the probability of the mismatch
Following the completion of 41(¼ 240-200 þ score to fall below it is sufficiently low. For example, P (mismatch score <
50%) ¼ 0.0008 for 30-bp chunk. For 200-bp chunk, similar probability is
1) cycles, the 200-bp chunk is compared against reached at the threshold of 65%.
July/August 2019
49
General Interest
Figure 3. Illustration of a single long-read chunk examination in RASSA. (a) Long read is divided into chunks,
each 200 bp long. (b, left and right) First chunk is compared against the reference sequence in multiple locations
(simultaneously). (c) and (d) First chunk overlaps with reference sequence bps from two Word Rows. (c, left) First
part of the chunk compared with the last bps of the Word Row. (c, right) All Sub-Word mismatch scores are
summed up and stored (compare to threshold does not take place). (d, left) Second part of the chunk is
compared with the first bps of the next Word row. (d, right) All Sub-Word mismatch scores, including the previous
cycle result from the above Word Row, are summed up and compared with a threshold. Following this step, the
chunk is shifted right by one position (relative to the reference) and steps (c) and (d) are repeated. (e) Edit types
and the mismatch score found by RASSA. (f) Example of a 30-bp read chunk containing insertion, deletion and
substitution errors is compared against the reference sequence divided into 50-bp Word Rows. (g) Mismatch
score versus cycle number for first 21 cycles of comparison of the example in (f). Minimal mismatch score is
below the threshold and achieved in the 9th cycles. The threshold is determined empirically per data set.
Figure 3(f) and (g) illustrates the mismatch location [“min mismatch position” at cycle 9 in
score for the sliding window search of RASSA, in Figure 3(f)], the mismatch score is significantly
presence of multiple errors. Figure 3(f) presents lower than the random average 75% level. Setting
an example of a reference sequence and a read the custom threshold, for instance, at 50% allows
chunk containing a high-similarity region. All efficient overlap of read chunks with a number of
possible edit types (substitution, insertion, and edits (substitutions as well as indels).
deletion) exist in the chunk. Figure 3(g) illus-
trates the mismatch score as a function of the Translating the Output of RASSA to
compare cycle (the relative read chunk position). Mapping Locations
During most cycles, the mismatch score is above RASSA compares every chunk of every read
threshold. When the chunk is in its valid mapping against the entire reference sequence. The
50 IEEE Micro
probability of a false positive match is extremely the alignment step. With such parameters, mini-
low. Therefore, we assume that every compare map2 functions as a prealignment filter. The
that results in a mismatch score below the prede- parameters also invoke appropriate heuristic for
fined threshold indicates is valid mapping loca- PacBio and ONT reads, in addition to enabling
tion for the entire read. The output of RASSA is a multithreading and SIMD extensions. To evaluate
bit vector, one bit per Word Row. The index of the RASSA’s prealignment filtering accuracy, we use
Word Row, together with the iteration number minimap2 as a golden reference. Speedup is cal-
and relative position of the chunk within the read, culated as the ratio of minimap2 execution time,
provides an exact coordinate of a potential map- without indexing, to RASSA execution time. The
ping location. In most examined cases, a single accuracy and speedup of RASSA were obtained
read has a single mapping location indicated by a using an in-house simulator. We assume that the
single compare from a single chunk. reference sequence has already been loaded
In some other cases, a single or multiple into RASSA prior to execution.
chunks produce multiple potential mapping To find the number of incorrect output loca-
locations. In such cases, the distance between tions that might increase the total alignment time,
consecutive locations is examined, starting from we have contrasted prealignment followed by
the lowest coordinate. If the distance between alignment with alignment without prealignment.
two consecutive locations is smaller than the We have used part of the E.coli PacBio dataset,
read length, the location associated with the consisting of about 1000 reads. Total prealign-
higher coordinate is discarded. Otherwise, both ment by RASSA took 20 ms, and the following
locations are kept for further processing. alignment needed to be applied to only 70-kbp
With this selection heuristic, nearby potential subset of the reference, taking minimap2 1490 ms.
mapping locations from a single or multiple In contrast, the same alignment applied without
chunks are combined, while distant locations the prealignment stage took 3000 ms, about twice
are treated as separate mapping locations. the time. Therefore, we decided that reads with
The mapping locations identified by RASSA more than two output locations by RASSA will be
can further be verified by alignment (e.g., Smith– discarded and treated as incorrectly mapped.
Waterman algorithm),4 and used by assembly or
error correction programs.8 Unmapped reads Data Sets. We use five publicly available
can either be discarded (in case of a high datasets, three from PacBio and two from
sequencing coverage) or be mapped with a seed- ONT, taken from two organisms: E.coli K-12
and-extend mapper, and then verified by an NG1655 and Saccharomyces cerevisiae W303
alignment algorithm. (yeast). Both reference sequences are avail-
able at the NCBI (https://www.ncbi.nlm.nih.
gov/). PacBio data sets were taken from
EVALUATION
https://github.com/PacificBiosciences/DevNet/
We compare RASSA with two existing solu-
wiki/Datasets. Error rates, 13 including the
tions, minimap2,7 a state-of-the-art read mapping
share of insertions, deletions, and mismatches
tool, and GateKeeper,9 a state-of-the-art short-
(I,D,M) are also presented.
read prealignment hardware accelerator.
E.coli
Comparison With Minimap2 PacBio: 100 000 reads from one SMRT cell,
Our evaluation focuses on accuracy and 5245 bps on average. Error rate: 14.2%
speedup. Accuracy is measured by two criteria: (I:41.7%, D:21.2%, M:37.1%)
1) sensitivity: correctly mapped reads; 2) false
PacBio CCS: 260 000 high-quality CCS reads
positives: percentage of incorrect mappings out
of all mappings by RASSA. from 16 SMRT cells, 940 bps on average.
Error rate: 1% (I:5%, D:19.5%, M:75.5%)
Methodology. Minimap2 is run with the
parameters “-x map-pb” and “-x map-ont,” invok- ONT (from http://lab.loman.net/2016/07/30/
ing its execution for overlap detection without nanopore-r9-data-release/): 165 000 R9 1D
July/August 2019
51
General Interest
Table 2. Sensitivity, fraction of exact mappings, and speedup of RASSA compared to minimap2.
Data sets Large Chunk (200 bps) Small Chunk (100 bps)
False False
Sensitivity Speedup Sensitivity Speedup
Positives Positives
E.coli PacBio 79.3% 13.4% 25 83.2% 13.6% 16
E.coli PacBio
96.3% 8.9% 43 96.2% 6.9% 24
CCS
E.coli ONT 88.8% 10.5% 48 87.6% 12.4% 31
Yeast PacBio 69.8% 8.7% 77 72% 11.8% 51
Yeast ONT 85.9% 34.9% 31 85.1% 39.2% 49
minimap2 mapped only about 20% of all reads, with 50% of mappings with lower quality score than 60 (indicates a high-confidence
mapping).
reads, 9009 bps on average. Error rate: 20.2% executing instances of a typical alignment
(I:14.5%, D:37.2%, M:48.3%) algorithm.
Yeast
Throughput Comparison With GateKeeper
PacBio: 100 000 reads from one SMRT cell,
We compare RASSA throughput (the number
6294 bps on average. Error rate: 14% (I:5%,
of examined mapping locations per second) with
D:19.5%, M:75.5%)
that of GateKeeper.9 GateKeeper was imple-
ONT (ERR789757 from NCBI): 30 000 R7.3 2D mented in a Virtex-7 FPGA using Xilinx VC709
MinION reads, 11,337 bps on average. Error board running at 250 MHz.
rate: 13.4% (I:23.3%, D:35.7%, M:41%) GateKeeper is designed to compare short
reads with a reference sequence. Table 3 shows
Table 2 presents the accuracy results for all
the throughput in Billions of Examined Mapping
five data sets above. Chunk sizes of 200 and 100
Locations per second (BEML/s) of RASSA and
bps and corresponding thresholds were deter-
GateKeeper on two short-read data sets used in9
mined empricially, trading off accuracy and per-
100-bp reads and 300-bp reads. RASSA frequency
formance. Small changes of threshold induce
is adjusted to that of GateKeeper. In addition, we
only marginal changes in accuracy. For most
show the average RASSA throughput for the
data sets, 55% threshold was used on 200-bp
200-bp reads, equivalent to the chunks lengths
chunks and 45% for 100-pb chunks; for the Yeast
used in Table 2.
PacBio case, we used 45% and 40%, respectively.
RASSA outperforms GateKeeper by more
than two orders of magnitude. When applied to
Speedup. We compare RASSA execution time the short-read mapping prealignment, RASSA
with that of minimap2, executed on a server with covers a read with one chunk. Consequently,
16-core 2-GHz Intel Xeon E5-2650 CPU and 64 GB of RASSA takes 1–3 cycles (for the read lengths
RAM. Table 2 shows that RASSA achieves 16–77 used in Table 3) to find the mismatch score in all
speedup over minimap2.y We note that the yeast Word Rows in parallel. GakeKeeper, on the other
dataset has fewer reads than E.coli, but a longer hand, is reported to process up to 140 (20)
reference sequence (11.7 Mbp versus 4.6 Mbp),
which might cause the longer execution time on
Table 3. RASSA and GateKeeper9 throughput (billions of
minimap2. In contrast, RASSA is insensitive to examined mapping locations per second, BEML/s).
the reference sequence length and its execution
time is determined by the length of a read chunk. Read RASSA @250
GateKeeper
Lengths MHz
RASSA produces output (on average, one
100 bp 1.7 BEML/s 226.8 BEML/s
mapping per read) at rate of 50,000–500,000
reads/s, enabling multiple simultaneously 200 bp - 175.2 BEML/s
52 IEEE Micro
mapping locations of 100-bp (300-bp) reads in hardware cost: higher density can be achieved
parallel, affected by the edit distance threshold. by sharing ADCs among multiple Sub-Words and
by applying analog computations, as presented
CONCLUSIONS AND FUTURE in Figure 2(e). RASSA mapping and resistive CAM
RESEARCH DIRECTIONS alignment5 may be combined into a single high-
This paper presents RASSA, an in-memory performance in-memory mapper/aligner. Finally,
processing parallel architecture of a resistive thanks to its use of short chunks, RASSA can be
approximate similarity search accelerator. We effectively applied to short reads.
apply RASSA to the long-read DNA mapping
problem. The length of reads, coupled with a low
read quality, poses a challenge for existing map- & REFERENCES
pers, optimized for high-quality short reads. The 1. J. L. Jameson and D. L. Longo, “Precision medicine—
read mapping process is data and compute inten- Personalized, problematic, and promising,”
sive, making it a target for acceleration. RASSA Obstetrical Gynecological Survey, vol. 70, no. 10,
addresses the challenge by breaking long reads pp. 612–614, 2015.
into short chunks and by applying full correlation. 2. J. Quick et al., “Real-time, portable genome
By allowing faster mapping on large data sets, we sequencing for Ebola surveillance,” Nature, vol. 530,
potentially make a step toward real-time pathogen no. 7589, pp. 228–232, 2016.
or genome sequence completion. 3. A. Rhoads and K. F. Auc, “PacBio sequencing and its
We compared applications,” Genom., Proteomics Bioinform., vol. 13,
the RASSA accu- no. 5, pp. 278–289, 2015.
We compared the
racy and execution 4. T. Laver et al., “Assessing the performance of the
RASSA accuracy and
time with that of oxford nanopore technologies minion,” Biomol.
execution time with that
minimap2, a state- Detection Quantification, vol. 3, pp. 1–8, 2015.
of minimap2, a state-of-
of-the-art mapping the-art mapping solu- 5. R. Kaplan, L. Yavits, R. Ginosar, and U. Weiser
solution, on five tion, on five long-read “A resistive CAM processing-in-storage architecture
long-read data sets data sets taken from for DNA sequence alignment,” IEEE Micro., vol. 37,
taken from two two organisms. Our pp. 20–28, 2017.
organisms. Our evaluation shows that 6. H. Li, “Minimap and Miniasm: Fast mapping and
evaluation shows RASSA can outperform de novo assembly for noisy long sequences,”
that RASSA can minimap2 by 16–77x. Bioinformatics, vol. 32, no. 14, pp. 2103–2110, 2016.
outperform mini- 7. H. Li, “Minimap2: Pairwise alignment for nucleotide
map2 by 16–77. In sequences,” Bioinformatics, vol. 34, pp. 3094–3100,
addition, we compared RASSA’s throughput, 2018.
measured in examined mapping locations per 8. K. Berlin, S. Koren, C. S. Chin, J. P. Drake,
second, with that of GateKeeper, a state-of-the- J. M. Landolin, and A. M. Phillippy, “Assembling large
art short-read prealignment hardware accelera- genomes with single-molecule sequencing and
tor. We find that RASSA can outperform Gate- locality-sensitive hashing,” Nature Biotechnol., vol. 33,
Keeper by more than two orders of magnitude. no. 6, pp. 623–630, 2015.
This work can be extended in several ways. 9. M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, and
First, RASSA can be applied to read-to-read C. Alkan, “GateKeeper: A new hardware architecture for
overlap finding, which requires finding overlaps accelerating pre-alignment in DNA short read mapping,”
between pairs of reads. Read-to-read overlap Bioinformatics, vol. 33, no. 21, pp. 3355–3363, 2017.
finding is an important first step in de novo 10. J. S. Kim et al., “GRIM-filter: Fast seed location filtering
genome assembly6 (constructing the host DNA in DNA read mapping using processing-in-memory
sequence without a reference sequence), a prob- technologies,” BMC Genom., vol. 19, no. 2, p. 89, 2018.
lem more computationally challenging than read 11. M. Imani, A. Rahimi, D. Kong, T. Rosing, and J. M. Rabaey,
mapping. Second, a detailed design space explo- “Exploring hyperdimensional associative memory,” in
ration needs to be performed. For example, Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2017,
RASSA can further be optimized in terms of pp. 445–456.
July/August 2019
53
General Interest
12. H. Akinaga and H. Shima, “Resistive random access 2001. He coauthored a number of patents and
research papers on SoC and ASIC. His research
memory (ReRAM) based on metal oxides,” Proc. IEEE,
interests include non-von Neumann computer
vol. 98, no. 12, pp. 2237–2251, Dec. 2010.
architectures, accelerators, and processing in
13. J. L. Weirather et al., “Comprehensive comparison of
memory. He has an MSc and a PhD in electrical
pacific biosciences and oxford nanopore technologies engineering from Technion. Contact him at leonid.
and their applications to transcriptome analysis,” yavits@nububbles.com.
version 2, F1000Research, 2017.
Ran Ginosar is a professor with the Department of
Roman Kaplan is currently a PhD student in the Electrical Engineering and serves as the head of
Faculty of Electrical Engineering, Technion—Israel the VLSI Systems Research Center at the Techn-
Institute of Technology, under the supervision of ion—Israel Institute of Technology. He joined the
Prof. R. Ginosar. Between 2009 and 2014, he was a Technion faculty in 1983 and was a visiting associ-
software engineer. His research interests are parallel ate professor with the University of Utah during
computer architectures, in-data accelerators for 1989–1990, and a visiting faculty with Intel
machine learning, and bioinformatics. He has a BSc Research Labs in 1997–1999. His research inter-
and an MSc from the Faculty of Electrical Engineer- ests include VLSI architecture, manycore com-
ing, Technion. Contact him at romankap@gmail.com. puters, asynchronous logic and synchronization,
networks on chip, and biologic implant chips. He
Leonid Yavits is currently a Postdoctoral Fellow has co-founded several companies in various areas
in electrical engineering at Technion—Israel Insti- of VLSI systems. He has a BSc (summa cum laude)
tute of Technology. He co-founded VisionTech from Technion and a PhD from Princeton University,
where he co-designed a single chip MPEG2 both in electrical and computer engineering.
codec. VisionTech was acquired by Broadcom in Contact him at ran@ee.technion.ac.il.
54 IEEE Micro
General Interest
The Queuing-First
Approach for Tail
Management
of Interactive Services
Amirhossein Mirhosseini and
Thomas F. Wenisch
University of Michigan
& ONLINE DATA-INTENSIVE (OLDI) services (e.g., must meet strict response time service-level
web search) traverse terabytes of data with strict objectives (SLOs), especially for tail latencies.2; 3
latency targets.1 Managing high-percentile tail Second, such services typically communicate via
latencies is a key problem in designing such serv- fan-out patterns wherein datasets are “shared”
ices. First, to guarantee user satisfaction, services across numerous “leaf” servers and their
responses are aggregated before responding to
the user. As such, overall latency is often dictated
Digital Object Identifier 10.1109/MM.2019.2897671 by the slowest leaves (i.e., the “tail at scale”
Date of current version 23 July 2019. effect4).
July/August 2019 Published by the IEEE Computer Society 0272-1732 ß 2019 IEEE
55
General Interest
High tail latencies arise from two effects. First, behavior: server pooling, and common-case ser-
such applications’ service time distributions vice acceleration (CCSA). Server pooling is the
include outlying requests that take much longer practice of redesigning system architecture to
(10 100 or more) than the mean.5 Some change single-server (“scale-out”) queues into
requests may require exceptional processing multiserver (“scale-up”) ones; that is, rather
time depending on their arguments (e.g., search than enqueuing requests at distinct servers/
engines1; 6) or query types (e.g., sets versus gets cores, a single queue is shared among many (i.e.,
in key-value stores5; 7). Some requests are delayed converting c G/G/1 queues into a G/G/c). Server
by system interference, such as from garbage col- pooling greatly reduces queuing delay and can
lection, page deduplication, synchronous huge- completely eliminate queueing with enough
page compaction, or network stack imped- servers (i.e., high enough c). Pooling smooths
iments.4; 8 In other cases, scheduler inefficiencies, fluctuations in both arrivals and service, making
power state transitions, suboptimal interrupt the system behave more like one with determin-
routing, poor NUMA node allocation, or virtualiza- istic interarrival and service times. Especially for
tion effects may contribute to long tail latencies.9 high disparity service time distributions (i.e.,
Finally, interference from colocated workloads rare system events/hiccups), server pooling
can cause slowdown due to contention for shared reduces the overall tail latency by breaking
caches, memory bandwidth, or global resources HoL blocking and preventing nominal requests
like network cards or switches.10; 11 from waiting behind exceptionally long ones.
A second key contributor to applications’ Even a modest degree of concurrency allows
end-to-end latency distribution are queuing many short requests to drain past stalled ones,
effects.3 Queuing arises at numer- substantially reducing weight in
ous layers causing some requests Online data-intensive
the latency distribution tail.
to wait for others4; whereas queu- (OLDI) services (e.g., CCSA improves systems’ queu-
ing also affects average perfor- web search) traverse ing behavior by deploying optimi-
mance, its effect on tail latency terabytes of data with zations that target common-case
may be catastrophic. To achieve strict latency targets. service behavior (as opposed to
performance stability, systems Managing high- optimizations that target directly
must be engineered such that the percentile tail latencies rare/slow requests or hiccups). It
overall request arrival rate is lower is a key problem in may seem counterintuitive to
than the aggregate system capacity designing such improve tail latency by optimizing
(service rate). However, as both services. typical-case request performance,
rates fluctuate, arrivals may tempo- but queuing delays are greatly
rarily outstrip service capacity, causing requests impacted by the average load, which depends
to queue. Queueing delay is most apparent more on typical-case service time than rare cases.
under high system load. However, in this paper, In single-server systems, CCSA has little
we make the case that queuing effects drastically impact when the service variance is excessively
magnify the impact of rare system events/hic- high (i.e., HoL blocking is common), as nominal
cups and can result in high tail latencies even requests queue behind rare, slow ones regard-
under modest load. Due to head-of-line (HoL) less of how fast the nominal requests are proc-
blocking, many requests are delayed by an essed. But, if there is sufficient concurrency
exceptionally slow one that stalls a server/core; (e.g., by using server pooling) that slow requests
these delayed requests account for a bulk of the rarely occupy all servers, then CCSA provides
latency distribution tail. enormous benefit by allowing nominal requests
Through stochastic queuing simulation,12 we to drain past slow ones, drastically reducing
show that improving a system’s queuing behav- wait time. Importantly, we show that, with con-
ior often yields much greater benefit than miti- currency, CCSA is more effective than reducing
gating the individual system hiccups that directly either the length or the probability of
increase service time tails. We suggest two gen- rare hiccups. Since finding and mitigating tail
eral directions for improving system queuing events is hard due to their myriad causes,13 we
56 IEEE Micro
believe this observation is encouraging—we can independent of the system’s history).14 An inter-
reduce tail latency without engaging in “whack- esting property of exponentially distributed ran-
a-mole” with rare system hiccups. dom variables is the constant ratio between their
In short, we argue that cloud system designers mean values and all of their quantiles (including
should invest optimization effort first into 1) the median and all-percentile tails), as shown in
reducing HoL blocking through higher concur-
rency and improved queuing discipline (i.e., server P ðS > aEðSÞÞ ¼ ea : (1)
pooling) and then into 2) optimizing common-
case performance to improve mean service time. Due to the memoryless property of exponen-
Both of these approaches may have greater tial distributions, M=M=c queuing systems
impact and are easier to achieve can be easily analyzed with con-
than directly pinpointing and miti- tinuous time Markov chains and
Neither inter-arrival nor
gating rare cases and hiccups; service times of inter- have closed-form solutions for
whereas server pooling smooths active cloud services many of their parameters, such
out arrival and service variability, are perfectly modeled as average waiting and sojourn
CCSA reduces the effective system by exponential (waiting plus service) times.
load. The relative impact of the distributions. But since Neither inter-arrival nor service
two approaches depends criti- requests usually times of interactive cloud services
cally on the system load and ser- originate from a large are perfectly modeled by exponen-
vice time variance. CCSA’s pool of independent tial distributions. But since
effectiveness improves as service sources (e.g., many requests usually originate from a
times become more normal and/ distinct users), they
large pool of independent sources
typically mimic Poisson
or concurrency increases. We (e.g., many distinct users), they
(memoryless) arrivals.
build a simple regression model typically mimic Poisson (memory-
on concurrency and service time less) arrivals; prior studies have
variance to estimate HoL blocking and indicate observed that interarrival time dis-
whether server pooling or CCSA is more benefi- tributions usually have small coefficients of varia-
cial in reducing tail latency. System designers tion (mostly, between 1 and 2).12 As such, inter-
can use this model to guide optimization effort arrival processes can be well approximated with
and estimate its impact. an exponential distribution (CV ¼ 1) with little
fidelity loss.15 Service time distributions, in con-
trast, may have long tails; some requests encoun-
BACKGROUND AND METHODOLOGY ter rare hiccups that increase service time by
Most interactive cloud services can be mod- 10 100 (or even more) over the mean—
eled as A=S=c queuing systems (based on much larger than the ratio of the 99th percentile
Kendall’s notation14), where A specifies the and mean values in the exponentially distributed
request inter-arrival time distribution, S the ser- services times of M=M=c systems [ 4:6, based
vice time distribution, and c the number of concur- on (1)]. Hence, interactive cloud services are
rent servers. Regardless of the distributions, the often investigated using M=G=c queuing models
average arrival rate () must be lower than the (G stands for General).5; 16
average aggregate service rate of all servers (mc, Unfortunately, M=G=c queuing models do not
with m as the average service rate of a single have closed-form solutions for average waiting/
server); otherwise, requests queue without bound. sojourn times and the accuracy of existing
The most common queuing models used in approximations, which use only a few moments,
analytical studies are M=M=c systems (M stands is poor.17 Furthermore, to the best of our knowl-
for Markovian), where both interarrival and ser- edge, there is no widely used approximation for
vice times follow exponential distributions. It can waiting/sojourn time quantiles of these systems.
be shown that the exponential distribution is the Thus, we use stochastic queuing simulation,
only continuous distribution with the memory- based on the BigHouse methodology,12 to mea-
less property (i.e., occurrence of events is sure the tail latency of such M=G=c systems. We
July/August 2019
57
General Interest
58 IEEE Micro
Figure 1. (a) Normalized service- and sojourn-time 99th percentile tail in an M/M/1 queue. (b) Normalized service- and
sojourn-time 99th percentile tail in an M/G/1 queue. (c) Average % wait time in sojourn-time tail requests. (d) % of sojourn-
time tail requests that are also in the service-time tail. The M/G/1 queue has an exponential service time distribution, but
incorporates 100 hiccups that occur in 0.1% of the requests.
service time distribution is high disparity. Fur- queuing behavior is typically more effective than
thermore, since hiccups occur with a low proba- seeking to directly mitigate system hiccups that
bility (0.1%), they do not noticeably affect the cause heavy/long tails. Finding and mitigating
service time 99th percentile tail. However, due system hiccups is hard. As such, we advocate
to the HoL blocking, their impact on the sojourn- pursuing optimizations that address queuing
time tail is large under both low and high loads. behavior instead.
Figure 1(d) reports the percentage of
requests in the sojourn-time tail that also con- Server Pooling
tribute to the service-time tail. Under both Figure 2 contrasts two different models to
low and high loads, the percentage is much compose multiple servers. In the scale-out
higher in the M=M=1 system. With high dis- model, each server has a separate request queue
parity service times, HoL blocking in the and a dispatcher/load balancer steers incoming
M=G=1 system comprises the bulk of the requests into different queues such that the
tail—most sojourn-time tail requests are nomi- request arrival rate of all servers is balanced. In
nal requests that queue behind exceptionally the scale-up model, instead a single request
slow ones. Furthermore, as shown in Figure 1 queue is shared among all servers, which each
(c), while the fraction of queuing delay rela- fetch requests from the central request queue as
tive to sojourn time in tail requests is higher they become idle. This model requires synchro-
in the M=G=1 system, queuing still accounts nization of the central request queue, but
for more than half of sojourn time even in improves queuing.
M=M=1 systems for loads over 30%. It can be shown that the scale-up (M=G=c)
The takeaway is that if a service incurs either organization always outperforms the scale-out
high load or has a high disparity service time dis- organization (c M=G=1) in principle (neglect-
tribution, end-to-end tail latency is dominated by ing synchronization). First, in the scale-up
queuing effects. As a result, improving system organization, a server will not remain idle if
there are requests wait-
ing in the central queue.
However, in scale-out
systems, a server may
remain idle if its own
queue is empty even
while other servers have
outstanding requests.
Second, when a request
takes longer than aver-
age in a scale-out organi-
zation, all the requests
Figure 2. Scale-out versus scale-up queuing organizations. behind it suffer from
July/August 2019
59
General Interest
Figure 3. Normalized service-time (light bars) and sojourn time (dark bars) tails of an M=G=1 queue under different
scenarios. (a) 70% load, 100 hiccups affecting 0.1% of requests. (b) 70% load, 10 hiccups affecting 1% of requests.
(c) 30% load, 100 hiccups affecting 0.1% of requests.
60 IEEE Micro
multicore server, implementing a scale-up server pooling, queuing delay vanishes and the
model mandates either a single synchronized sojourn-time tail and service-time tail match. In
data structure or a work-stealing architecture, such a scenario, end-to-end tail latency is even
which incur coherence traffic and are difficult lower than in a system with no hiccups but with-
to scale. out pooling.
We refer to the practice of consolidating Figure 3(b) reports the same results for hic-
c M=G=1 servers into a single M=G=c system cups 10 longer than the average occurring in
as server pooling. When service-time distribu- 1% of requests. Whereas the general trend
tions are high disparity, HoL blocking becomes matches Figure 3(a), the gap between the ser-
the main source of queuing delay (and tail vice- and sojourn-time tails is noticeably smaller
latency) and the gap between the two queuing even though the total service time attributable
organizations grows. We argue that server pool- to hiccups is the same (10 1% ¼ 100 0:1%).
ing can play a key role in resolving HoL blocking As previously observed, longer hiccups intro-
under such service conditions and, hence, should duce more severe HoL blocking and cause more
be pursued despite higher implementation com- nominal requests to queue behind the excep-
plexity. In fact, server pooling often reduces the tional ones (despite lower hiccup probability).
tail latency more than directly mitigating the rare Nevertheless, in Figure 3(b), pooling across only
hiccups that cause exceptionally long service. two servers, despite hiccups, is enough to
Figure 3 reports the normalized service/ reduce the sojourn time tail below that of a sys-
sojourn time tail latencies in an M=G=1 system tem without server pooling and without hiccups.
with different service time distributions and sys- Figure 3(c) considers the same service time dis-
tem loads. The leftmost red bars represent tail tribution as Figure 3(a), but under lower (30%)
latencies in the presence of rare hiccups. The system load. Here, whereas queuing delays are
next group of blue bars show the tail latency typically near-negligible under low load, the high
where the impact (i.e., duration/probability) of disparity service distribution nevertheless
hiccups has been reduced. In particular, from left causes HoL blocking and a significant sojourn
to right, these bars represent cases where hiccup time tail. Interestingly, the ratio between the
duration is halved, their occurrence probability sojourn- and service-time tails is much higher
is halved, and where hiccups are fully eliminated. than that seen in Figure 3(b) due to longer hic-
Finally, the cluster of green bars indicate server cups and higher HoL blocking, despite lower
pooling cases with varying number of servers c. load. Furthermore, when HoL blocking is high
(We discuss the orange bars later.) but system load is low, pooling across two serv-
Figure 3(a) considers an exponential service ers completely eliminates queuing delay.
time distribution with hiccups that occur 0.1% of In summary, server pooling is highly effective
the time and last 100 longer than the average in eliminating HoL blocking and reducing queu-
service time under 70% system load. We make ing delays that otherwise arise due to rare sys-
three observations: First, reducing the hiccup tem hiccups. Although pooling across many
probability is considerably less effective at cores/machines is often challenging, encourag-
reducing the overall tail than reducing their ingly, we show that pooling across as few as two
duration. The intuition is that longer hiccups servers is often sufficient for large tail latency
cause more requests to queue and hence exacer- reductions.
bate tails more than shorter but more frequent A variety of steering and scheduling techni-
hiccups. Second, pooling only two servers ques can enable a scale-out system to more
reduces tail latency almost as much as halving closely approximate scale-up system behavior.
hiccup durations. Whereas it may be challenging Examples include smart load-balancing schemes
to implement high-concurrency data structures that steer requests to queues based on wait time
to enable a high degree of server pooling, estimates derived from metrics like queue occu-
sharing queues across just pairs of machines pancy, injecting replica requests to different
or cores is likely easier than finding and mitigat- queues and then cancelling the redundant
ing hiccups. Finally, with greater degrees of requests,4 and various work-stealing approaches
July/August 2019
61
General Interest
62 IEEE Micro
load/utilization. However, as we showed in the memoryless around CVdeparture ¼ 1:0, where the
previous section, CCSA is only effective in the ratio of tail-to-average cases does not decrease
absence of server HoL blocking. When HoL block- through higher concurrency [see (1)]. As a result,
ing is frequent (e.g., service time variance is high), we suggest a regression model of the form of
CCSA no longer provides benefit as nominal
requests queue behind exceptionally long ones. In CVdeparture ðCVservice 1Þe0:8ðc1Þ þ 1 (2)
such scenarios, additional concurrency must be
introduced to unleash CCSA’s efficacy. and tune its parameter using the least squares
In single-server systems, service time variabil- method. We find its average error to be less than
ity is a good measure of HoL blocking. For exam- 13%.
ple, in Figure 3(a), where CVservice ¼ 4:2, CCSA has Using this model, we can derive CVdeparture as
negligible impact; nominal requests wait behind a proxy for the HoL blocking rate and predict
slow ones. In contrast, in Figure 3(b) where how it is affected by server pooling. Alterna-
CVservice ¼ 1:6 (near the CVservice ¼ 1:0 of M=M= tively, cloud system architects may perform
queues), CCSA is more effective than server Stochastic Queueing Simulations, similar to our
pooling. However, CVservice only approach, and directly measure
reflects HoL blocking in single- We recommend devel- CVdeparture instead of predicting it.
server systems. We suggest the opers follow a simple When the system approaches the
interdeparture time variability of optimization sequence CVdeparture ¼ 1:0 of M=M= queues,
a saturated queue (when queuing to address tail latency in blocking becomes rare; the
probability is close to 1.0) to their services: 1) intro- remaining tail of the sojourn time
measure HoL blocking in multi- duce server pooling distribution is then primarily due
server queues. In saturated until HoL blocking is suf- to service time tails or high load.
single-server queues, the inter- ficiently mitigated; 2) if Under low load, queuing delays
load is high, introduce vanish with sufficient server pool-
departure time distribution is
CCSA; 3) if end-to-end
the service time distribution ing; remaining sojourn time tails
tails remain unaccept-
(CVservice ¼ CVdeparture ). However, reflect only service tails. Under
able, only then seek to
with multiple servers, departures high load, HoL blocking will no
directly optimize rare,
interleave, reducing interdepar- high service latencies. longer be the dominant source of
ture time variability. For example, queuing delays when sufficient
in Figure 4, which is similar to concurrency has been introduced.
Figure 3(a) but with an additional 1–2 servers, the As such, with sufficient server pooling (often
CVdeparture drops (from 3.0) to 1.7 and 1.1, respec- just 2–3 servers), CCSA becomes more effective
tively. As a result, in the M=G=2 case, CCSA yields than further server pooling.
almost the same benefit as pooling. In the M=G=3 In short, we recommend developers follow a
case, where HoL blocking resembles that of an simple optimization sequence to address tail
M=M= queue with CVdeparture ¼ 1:0, CCSA yields latency in their services: 1) introduce server pool-
much better results than server pooling. ing until HoL blocking is sufficiently mitigated; 2)
We find that a simple regression model can if load is high, introduce CCSA; 3) if end-to-end
predict the CVdeparture of a saturated M=G=c queue tails remain unacceptable, only then seek to
based on its CVservice and the number of servers directly optimize rare, high service latencies.
(c). We construct the model by simulating satu-
rated queues with a set of high disparity distribu- CONCLUSION
tions with different CVservice and measure their Improving a system’s queuing behavior often
CVdeparture . We observe that small degrees of yields much greater benefit than mitigating the
server pooling quickly reduce HoL blocking. individual system hiccups that increase service
Therefore, we postulate an exponential decay time tails. We suggest two general directions
effect for the number of servers. Also, we for improving system queuing behavior—server
note that CVdeparture may not decrease below pooling, and CCSA—which synergistically address
1.0 as the interdeparture process becomes near- queuing behaviors that often drive tail latency.
July/August 2019
63
General Interest
64 IEEE Micro
Column: Micro Economics
& NOBODY KNOWS WHO organized the attack. tells us something about the challenges facing
It might have come from an angry gamer, or suppliers, and in this case, it tells us about a
from a rogue spy, or, perhaps, an angry rogue basic challenge in network security today. It will
spy playing games. The program hijacked many take a bit of work to appreciate the lesson, and,
cameras and home devices, and redirected them let me tip my hand, the news is not good.
to engineer a series of distributed denial of The article provides a summary of a longer
server (DDOS) attacks on a few hours apart, all study done by a group of my colleagues and
on October 21, 2016. By executing this novel and myself.1
rather clever hijack of many devices for a DDOS What happened?
attack, the attack exposed an important vulnera- Start with the basics. A website owner can set
bility in today’s internet. up its own name resolver, or hire somebody to
The attack contains one other do it for them. Many years ago most firms did it
element. It aimed at Dyn, who acts themselves. Like a lot of things
as a name resolver. Dyn enables By executing this novel on the internet, over time a set of
Internet traffic by translating the and rather clever hijack professional firms emerged,
site’s domain name (URL) into the of many devices for a while many technically sophisti-
IP address where the server behind DDOS attack, the cated firms still perform this
that domain is to be found. During attack exposed an function for themselves.
the later phases of the attack, Dyn important vulnerability What do name resolvers do?
in today’s internet. Start with the basics. It is a
servers were unable to process
users’ requests, and as a result, mouthful, but we need to under-
users lost access to web domains contracting stand it to understand what happened to Dyn.
with Dyn, such as Netflix, CNBC, and Twitter. When an application (such as a web browser)
Other well-known firms also were disabled, such wants to access a page or resource located at a
as Airbnb, Etsy, Play Station Network, and Wikia. known domain name, it can access DNS records
This article focuses on the aftermath of this and a corresponding IP address. In principle, the
event, which did not get headlines, but illus- application submits a request to a DNS “resolver”
trates an important features of the situation. asking for the IP address corresponding to a given
Specifically, how did users react? User behavior domain name. The resolver queries a root name-
server, which replies with the corresponding to
the TLD nameserver specified by the domain
Digital Object Identifier 10.1109/MM.2019.2919886 name (e.g., “.com”). The resolver then queries
Date of current version 23 July 2019. that TLD nameserver with the second component
0272-1732 ß 2019 IEEE Published by the IEEE Computer Society IEEE Micro
66
Figure 1. Market share for largest name servers.
of the domain name (e.g., “google”). The TLD as email, it could also cripple communications
nameserver retrieves that domain’s authoritative at many firms. That is a big deal.
nameservers (e.g., “ns1.google.com”) and returns As should be obvious, the potential economic
them to the resolver. Finally, the resolver queries losses could be enormous. Many businesses,
one of the authoritative nameservers and receives especially internet businesses, lose a lot of reve-
a usable IP address for the domain. The IP address nue from being down for a day.
is passed back to the original application, which That motivates a basic question: after this
can use it to connect to the desired host. This attack, what did users do? It did not take much
entire process generally takes just milliseconds. professional experience to understand the vul-
In practice, many users request the same thing nerability or the lessons. Businesses were vul-
repeatedly, so it is possible to cache the answers nerable to a single point of failure. A website
to many of the intermediate steps, and that can could construct a form of insurance by perform-
speed up the resolution even more. More to the ing a simple act: maintaining multiple name serv-
point, caching across many servers acts as a ers with multihoming.
quasi-buffer against a DDOS attack, especially if We set out to find out two things. First, how
the attack can be rebuffed in a short time. many firms maintain multihoming? Second, how
The emergence of numerous lessons and auto- did the use of resolvers change after the Dyn
mated standard practices has gone hand-in-hand attack? Figure 1, which is taken from Figure 7 of
with the emergence of professional firms. Such our study,1 tells a big part of the answer we
firms know how to provide services at scale. In found. This figure shows the market shares for
turn, and as with many other professional mar- the largest providers of nameservers between
kets, some of them became good at their service, late 2011 and the middle of 2017.
and that performance attracted many customers. What did firms do?
Thus emerged an ironic outcome. While the Figure 1 shows that, long before the Dyn attack,
internet contains many points of resiliency, the name servers had embarked on a general trend
increasing concentration of services in a small toward more concentration. The growth of three
number of providers has created concentrated firms—Dyn, AWS, and Cloudflare—drove this
points of failure. To say it another way, Dyn per- trend. Dyn’s growth had already begun to level off
formed approximately 10% of nameserver serv- by 2014, while AWS and Cloudflare have continued
ices in the United States (prior to the attack), so to grow unabated throughout the time period.
bringing it down could bring down 10% of the The figure actually does not tell the entire
internet’s servers. Since name resolution also story. We did a little investigation, and found
supports a range of other communications, such that both AWS and Cloudfare attract a high
July/August 2019
67
Micro Economics
fraction of the contracts from newly founded from customers of Dyn and somewhat from Neu-
firms. Older firms tend to use others. star (another large provider). No other provider
We also found that, as expected, over time saw a big change.
an increasing fraction of users contracted with You might reasonably reply that Cloudflare’s
providers instead of doing it themselves. Close to security model makes it difficult to multihome,
60% do it themselves in 2011, while close to 30% but does protect against this type of attack. Which
do it in 2017. In other words, the market is true. But what accounts for so little with all the
shares rise during a time when the number of cus- others? Most users act as if they do not care.
tomers practically doubled.
The figure also illustrates the big surprise.
CONCLUSION
The attack on Dyn had consequences for its
Let us summarize. There was a DDOS attack
commercial success. Many of its users seem to
on Dyn. It demonstrated a new vulnerability, and
have blamed Dyn for the down-
we are still vulnerable to another
time, and shifted to another pro-
The least expensive similar attack. In fact, we may be
vider. Within a couple months
insurance against more vulnerable now that bad
Dyn lost a quarter of its custom-
another attack is multi- actors have watched the prior
ers. Ouch.
homing with more than demonstration.
But wait, that is not the only one resolver. Let us be How did the user community
surprise. It is also important to clear about it: It is not act? The two big headlines are: 1)
notice what did not happen: We did expensive. It is just a a fraction of customers acted as if
not find much multihoming after minor hassle to main-
they blamed Dyn, and took precau-
the Dyn attack. Around 11% of tain multiple suppliers.
tions; and 2) all but a small frac-
existing domains multihomed
tion of non-Dyn customers did not
prior to the Dyn attack, and about 18% did so after
act as if they learned any lesson.
the attack. New domains multihomed at a rate of
I do not know about you, but I do not feel any
less than 5% prior to the attack, and after the
safer. Sigh.
attack about 8% did. In short, there was an uptick
in multihoming, but the vast majority of sites con-
tinued to use a single provider. & REFERENCE
Why does that matter? The least expensive
1. S. Bates, J. Bowers, S. Greenstein, J. Weinstock, and
insurance against another attack is multihoming
J. Zittrain, “In support of internet entropy: Mitigating an
with more than one resolver. Let us be clear
increasingly dangerous lack of redundancy in DNS
about it: It is not expensive. It is just a minor has-
resolution by major websites and services,” NBER
sle to maintain multiple suppliers. (And, more-
working paper 24317, 2018. [Online]. Available at
over, many aspects of web administration are no
SSRN: https://ssrn.com/abstract¼3241740
worse a hassle. That is the job, after all.)
Not illustrated in the figure are minutia of mar-
ket shares. In case you were wondering, most of Shane Greenstein is a professor at the Harvard
the multihoming among existing websites came Business School. Contact him at sgreenstein@hbs.edu.
68 IEEE Micro
PURPOSE: The IEEE Computer Society is the world’s largest EXECUTIVE COMMITTEE
association of computing professionals and is the leading provider
President: Cecilia Metra
of technical information in the field.
President-Elect: Leila De Floriani
MEMBERSHIP: Members receive the monthly magazine Past President: Hironori Kasahara
Computer, discounts, and opportunities to serve (all activities First VP: Forrest Shull; Second VP: Avi Mendelson;
are led by volunteer members). Membership is open to all IEEE
Secretary: David Lomet; Treasurer: Dimitrios Serpanos;
members, affiliate society members, and others interested in the
VP, Member & Geographic Activities: Yervant Zorian;
computer field.
VP, Professional & Educational Activities: Kunio Uchiyama;
COMPUTER SOCIETY WEBSITE: www.computer.org VP, Publications: Fabrizio Lombardi; VP, Standards Activities:
OMBUDSMAN: Direct unresolved complaints to Riccardo Mariani; VP, Technical & Conference Activities:
ombudsman@computer.org. William D. Gropp
2018–2019 IEEE Division V Director: John W. Walz
CHAPTERS: Regular and student chapters worldwide provide the
opportunity to interact with colleagues, hear technical experts, 2019 IEEE Division V Director Elect: Thomas M. Conte
and serve the local professional community. 2019–2020 IEEE Division VIII Director: Elizabeth L. Burd