A Nem Golden Age For Cumputer Architecture

turing lecture
DOI:10.1145/ 3282307
Innovations like domain-specific hardware,

enhanced security, open instruction sets, and
agile chip development will lead the way.
BY JOHN L. HENNESSY AND DAVID A. PATTERSON
A New Golden
Age for
Computer
Architecture
WE BEGAN OUR Turing Lecture June 4, 201811 with a review
of computer architecture since the 1960s. In addition
to that review, here, we highlight current challenges
and identify future opportunities, projecting another
golden age for the field of computer architecture in
the next decade, much like the 1980s when we did the
research that led to our award, delivering gains in cost, engineers, including ACM A.M. Tur-
energy, and security, as well as performance. ing Award laureate Fred Brooks, Jr.,
thought they could create a single ISA
that would efficiently unify all four of
“Those who cannot remember the past are condemned these ISA bases.
to repeat it.” —George Santayana, 1905 They needed a technical solution
for how computers as inexpensive as
Software talks to hardware through a vocabulary key insights

called an instruction set architecture (ISA). By the early ˽˽ Software advances can inspire
1960s, IBM had four incompatible lines of computers, architecture innovation.
˽˽ Elevating the hardware/software
each with its own ISA, software stack, I/O system, interface creates opportunities for
architecture innovation.
and market niche—targeting small business, large ˽˽ The marketplace ultimately settles
business, scientific, and real time, respectively. IBM architecture debates.
48 COMMUNICATIO NS O F TH E AC M | F EBR UA RY 201 9 | VO L . 62 | NO. 2

those with 8-bit data paths and as fast one control line, each row was a mi- ly 6. The most expensive computers
as those with 64-bit data paths could croinstruction, and writing microin- had the widest control stores because
share a single ISA. The data paths are structions was called microprogram- more complicated data paths used
the “brawn” of the processor in that ming.39 A control store contains an more control lines. The least-costly
they perform the arithmetic but are rela- ISA interpreter written using micro- computers had narrower control
tively easy to “widen” or “narrow.” The instructions, so execution of a con- stores due to simpler hardware but
greatest challenge for computer de- ventional instruction takes several mi- needed more microinstructions since
ILLUSTRATION BY PETER CROW TH ER ASSO CIATES
signers then and now is the “brains” croinstructions. The control store was they took more clock cycles to execute
of the processor—the control hard- implemented through memory, which a System/360 instruction.
ware. Inspired by software program- was much less costly than logic gates. Facilitated by microprogramming,
ming, computing pioneer and Turing The table here lists four models IBM bet the future of the company
laureate Maurice Wilkes proposed of the new System/360 ISA IBM an- that the new ISA would revolutionize
how to simplify control. Control was nounced April 7, 1964. The data paths the computing industry and won the
specified as a two-dimensional ar- vary by a factor of 8, memory capacity bet. IBM dominated its markets, and
ray he called a “control store.” Each by a factor of 16, clock rate by nearly 4, IBM mainframe descendants of the
column of the array corresponded to performance by 50, and cost by near- computer family announced 55 years
F E B R UA RY 2 0 1 9 | VO L. 6 2 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 49
turing lecture
ago still bring in $10 billion in rev- ated for the Xerox Palo Alto Research operating system written in the then-
enue per year. Center in 1973. It was indeed the first new programming language Ada.
As seen repeatedly, although the personal computer, sporting the first This ambitious project was alas sev-
marketplace is an imperfect judge of bit-mapped display and first Ethernet eral years late, forcing Intel to start an
technological issues, given the close local-area network. The device control- emergency replacement effort in Santa
ties between architecture and com- lers for the novel display and network Clara to deliver a 16-bit microproces-
mercial computers, it eventually deter- were microprograms stored in a 4,096- sor in 1979. Intel gave the new team 52
mines the success of architecture inno- word × 32-bit WCS. weeks to develop the new “8086” ISA
vations that often require significant Microprocessors were still in the and design and build the chip. Given
engineering investment. 8-bit era in the 1970s (such as the In- the tight schedule, designing the ISA
Integrated circuits, CISC, 432, 8086, tel 8080) and programmed primarily took only 10 person-weeks over three
IBM PC. When computers began us- in assembly language. Rival design- regular calendar weeks, essentially by
ing integrated circuits, Moore’s Law ers would add novel instructions to extending the 8-bit registers and in-
meant control stores could become outdo one another, showing their ad- struction set of the 8080 to 16 bits. The
much larger. Larger memories in turn vantages through assembly language team completed the 8086 on schedule
allowed much more complicated ISAs. examples. but to little fanfare when announced.
Consider that the control store of the Gordon Moore believed Intel’s To Intel’s great fortune, IBM was
VAX-11/780 from Digital Equipment next ISA would last the lifetime of developing a personal computer to
Corp. in 1977 was 5,120 words × 96 Intel, so he hired many clever com- compete with the Apple II and needed
bits, while its predecessor used only puter science Ph.D.’s and sent them a 16-bit microprocessor. IBM was in-
256 words × 56 bits. to a new facility in Portland to invent terested in the Motorola 68000, which
Some manufacturers chose to make the next great ISA. The 8800, as Intel had an ISA similar to the IBM 360, but
microprogramming available by let- originally named it, was an ambi- it was behind IBM’s aggressive sched-
ting select customers add custom tious computer architecture project ule. IBM switched instead to an 8-bit
features they called “writable control for any era, certainly the most ag- bus version of the 8086. When IBM an-
store” (WCS). The most famous WCS gressive of the 1980s. It had 32-bit nounced the PC on August 12, 1981, the
computer was the Alto36 Turing laure- capability-based addressing, ob- hope was to sell 250,000 PCs by 1986.
ates Chuck Thacker and Butler Lamp- ject-oriented architecture, variable- The company instead sold 100 million
son, together with their colleagues, cre- bit-length instructions, and its own worldwide, bestowing a very bright fu-
ture on the emergency replacement
Features of four models of the IBM System/360 family; IPS is instructions per second. Intel ISA.
Intel’s original 8800 project was
Model M30 M40 M50 M65 renamed iAPX-432 and finally an-
Datapath width 8 bits 16 bits 32 bits 64 bits nounced in 1981, but it required sev-
Control store size 4k x 50 4k x 52 2.75k x 85 2.75k x 87 eral chips and had severe performance
Clock rate 1.3 MHz 1.6 MHz 2 MHz 5 MHz problems. It was discontinued in 1986,
(ROM cycle time) (750 ns) (625 ns) (500 ns) (200 ns)
the year after Intel extended the 16-
Memory capacity 8–64 KiB 16–256 KiB 64–512 KiB 128–1,024 KiB bit 8086 ISA in the 80386 by expand-
Performance (commercial) 29,000 IPS 75,000 IPS 169,000 IPS 567,000 IPS ing its registers from 16 bits to 32 bits.
Performance (scientific) 10,200 IPS 40,000 IPS 133,000 IPS 563,000 IPS Moore’s prediction was thus correct
Price (1964 $) $192,000 $216,000 $460,000 $1,080,000 that the next ISA would last as long as
Price (2018 $) $1,560,000 $1,760,000 $3,720,000 $8,720,000 Intel did, but the marketplace chose
the emergency replacement 8086 rath-
er than the anointed 432. As the archi-
Figure 1. University of California, Berkeley, RISC-I and Stanford University MIPS tects of the Motorola 68000 and iAPX-
microprocessors.
432 both learned, the marketplace is
rarely patient.
From complex to reduced instruc-
tion set computers. The early 1980s
saw several investigations into com-
plex instruction set computers (CISC)
enabled by the big microprograms in
the larger control stores. With Unix
demonstrating that even operating sys-
tems could use high-level languages,
the critical question became: “What in-
structions would compilers generate?”
instead of “What assembly language
would programmers use?” Significant-
ly raising the hardware/software inter-
50 COMMUNICATIO NS O F TH E AC M | F EBR UA RY 201 9 | VO L . 62 | NO. 2

turing lecture
face created an opportunity for archi- datapath, along with instruction and
tecture innovation. data caches, in a single chip.
Turing laureate John Cocke and his For example, Figure 1 shows the
colleagues developed simpler ISAs and RISC-I8 and MIPS12 microprocessors
compilers for minicomputers. As an
experiment, they retargeted their re- In today’s post-PC developed at the University of Califor-
nia, Berkeley, and Stanford University
search compilers to use only the simple
register-register operations and load-
era, x86 shipments in 1982 and 1983, respectively, that
demonstrated the benefits of RISC.
store data transfers of the IBM 360 ISA, have fallen almost These chips were eventually presented
avoiding the more complicated instruc-
tions. They found that programs ran up
10% per year since at the leading circuit conference, the
IEEE International Solid-State Circuits
to three times faster using the simple the peak in 2011, Conference, in 1984.33,35 It was a re-
subset. Emer and Clark6 found 20% of
the VAX instructions needed 60% of the
while chips with markable moment when a few gradu-
ate students at Berkeley and Stanford
microcode and represented only 0.2% RISC processors could build microprocessors that were
of the execution time. One author (Pat-
terson) spent a sabbatical at DEC to have skyrocketed arguably superior to what industry
could build.
help reduce bugs in VAX microcode. If to 20 billion. These academic chips inspired
microprocessor manufacturers were many companies to build RISC micro-
going to follow the CISC ISA designs processors, which were the fastest for
of the larger computers, he thought the next 15 years. The explanation is
they would need a way to repair the due to the following formula for pro-
microcode bugs. He wrote such a cessor performance:
paper, 31 but the journal Computer Time/Program = Instructions /
rejected it. Reviewers opined that it was Program × (Clock cycles) /
a terrible idea to build microproces- Instruction × Time / (Clock cycle)
sors with ISAs so complicated that they DEC engineers later showed2 that
needed to be repaired in the field. That the more complicated CISC ISA execut-
rejection called into question the value ed about 75% of the number instruc-
of CISC ISAs for microprocessors. Iron- tions per program as RISC (the first
ically, modern CISC microprocessors term), but in a similar technology CISC
do indeed include microcode repair executed about five to six more clock
mechanisms, but the main result of his cycles per instruction (the second
paper rejection was to inspire him to term), making RISC microprocessors
work on less-complex ISAs for micro- approximately 4× faster.
processors—reduced instruction set Such formulas were not part of com-
computers (RISC). puter architecture books in the 1980s,
These observations and the shift to leading us to write Computer Architec-
high-level languages led to the opportu- ture: A Quantitative Approach13 in 1989.
nity to switch from CISC to RISC. First, The subtitle suggested the theme of the
the RISC instructions were simplified book: Use measurements and bench-
so there was no need for a microcod- marks to evaluate trade-offs quanti-
ed interpreter. The RISC instructions tatively instead of relying more on the
were typically as simple as microin- architect’s intuition and experience, as
structions and could be executed di- in the past. The quantitative approach
rectly by the hardware. Second, the we used was also inspired by what Tur-
fast memory, formerly used for the ing laureate Donald Knuth’s book had
microcode interpreter of a CISC ISA, done for algorithms.20
was repurposed to be a cache of RISC VLIW, EPIC, Itanium. The next ISA
instructions. (A cache is a small, fast innovation was supposed to succeed
memory that buffers recently execut- both RISC and CISC. Very long instruc-
ed instructions, as such instructions tion word (VLIW)7 and its cousin, the
are likely to be reused soon.) Third, explicitly parallel instruction computer
register allocators based on Gregory (EPIC), the name Intel and Hewlett
Chaitin’s graph-coloring scheme made Packard gave to the approach, used wide
it much easier for compilers to efficient- instructions with multiple independent
ly use registers, which benefited these operations bundled together in each
register-register ISAs.3 Finally, Moore’s instruction. VLIW and EPIC advocates
Law meant there were enough transis- at the time believed if a single instruc-
tors in the 1980s to include a full 32-bit tion could specify, say, six independent
turing lecture
Figure 2. Transistors per chip of Intel microprocessors vs. Moore’s Law. tion of the RISC microinstructions.
Any ideas RISC designers were using
Moore’s Law vs. Intel Microprocessor Density for performance—separate instruc-
Moore’s Law (1975 version) Density tion and data caches, second-level
10,000,000 caches on chip, deep pipelines, and
1,000,000 fetching and executing several in-
structions simultaneously—could
100,000
then be incorporated into the x86.
10,000 AMD and Intel shipped roughly 350
million x86 microprocessors annually
1,000
at the peak of the PC era in 2011. The
100 high volumes and low margins of the
PC industry also meant lower prices
10
than RISC computers.
Given the hundreds of millions
1980 1990 2000 2010
of PCs sold worldwide each year, PC
software became a giant market.
Whereas software providers for the
Figure 3. Transistors per chip and power per mm2. Unix marketplace would offer differ-
ent software versions for the differ-
Technology (nm) Power/nm2
ent commercial RISC ISAs—Alpha,
200 4.5 HP-PA, MIPS, Power, and SPARC—the
180 4 PC market enjoyed a single ISA, so
Relative Power per nm2

160
3.5 software developers shipped “shrink
140
Nanometers
120
3 wrap” software that was binary com-
100
2.5 patible with only the x86 ISA. A much
80 2 larger software base, similar perfor-
60 1.5 mance, and lower prices led the x86
40 1 to dominate both desktop computers
20 0.5 and small-server markets by 2000.
0 0 Apple helped launch the post-PC
2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
era with the iPhone in 2007. Instead of
buying microprocessors, smartphone
companies built their own systems
operations—two data transfers, two in- to write.” Pundits noted delays and on a chip (SoC) using designs from
teger operations, and two floating point underperformance of Itanium and re- other companies, including RISC
operations—and compiler technology christened it “Itanic” after the ill-fated processors from ARM. Mobile-device
could efficiently assign operations into Titantic passenger ship. The market- designers valued die area and energy
the six instruction slots, the hardware place again eventually ran out of pa- efficiency as much as performance,
could be made simpler. Like the RISC tience, leading to a 64-bit version of disadvantaging CISC ISAs. Moreover,
approach, VLIW and EPIC shifted work the x86 as the successor to the 32-bit arrival of the Internet of Things vastly
from the hardware to the compiler. x86, and not Itanium. increased both the number of proces-
Working together, Intel and Hewlett The good news is VLIW still matches sors and the required trade-offs in die
Packard designed a 64-bit processor based narrower applications with small pro- size, power, cost, and performance.
on EPIC ideas to replace the 32-bit x86. grams and simpler branches and omit This trend increased the importance
High expectations were set for the first caches, including digital-signal processing. of design time and cost, further dis-
EPIC processor, called Itanium by In- advantaging CISC processors. In to-
tel and Hewlett Packard, but the real- RISC vs. CISC in the day’s post-PC era, x86 shipments have
ity did not match its developers’ early PC and Post-PC Eras fallen almost 10% per year since the
claims. Although the EPIC approach AMD and Intel used 500-person de- peak in 2011, while chips with RISC
worked well for highly structured sign teams and superior semicon- processors have skyrocketed to 20 bil-
floating-point programs, it struggled ductor technology to close the per- lion. Today, 99% of 32-bit and 64-bit
to achieve high performance for in- formance gap between x86 and RISC. processors are RISC.
teger programs that had less predict- Again inspired by the performance Concluding this historical review,
able cache misses or less-predictable advantages of pipelining simple vs. we can say the marketplace settled the
branches. As Donald Knuth later complex instructions, the instruction RISC-CISC debate; CISC won the later
noted:21 “The Itanium approach ... decoder translated the complex x86 stages of the PC era, but RISC is win-
was supposed to be so terrific—un- instructions into internal RISC-like ning the post-PC era. There have been
til it turned out that the wished-for microinstructions on the fly. AMD no new CISC ISAs in decades. To our
compilers were basically impossible and Intel then pipelined the execu- surprise, the consensus on the best
52 COMM UNICATIO NS O F THE AC M | F EBR UA RY 201 9 | VO L . 62 | NO. 2

turing lecture
ISA principles for general-purpose generation of technology, computers cluding approximately 15 branches,
processors today is still RISC, 35 years would become more energy efficient. as they represent approximately 25%
after their introduction. Dennard scaling began to slow sig- of executed instructions. To keep the
nificantly in 2007 and faded to almost pipeline full, branches are predicted
Current Challenges for nothing by 2012 (see Figure 3). and code is speculatively placed into
Processor Architecture Between 1986 and about 2002, the the pipeline for execution. The use
“If a problem has no solution, it may exploitation of instruction level paral- of speculation is both the source of
not be a problem, but a fact—not to be lelism (ILP) was the primary architec- ILP performance and of inefficiency.
solved, but to be coped with over time.” tural method for gaining performance When branch prediction is perfect,
—Shimon Peres and, along with improvements in speed speculation improves performance
While the previous section focused of transistors, led to an annual perfor- yet involves little added energy cost—
on the design of the instruction set mance increase of approximately 50%. it can even save energy—but when it
architecture (ISA), most computer The end of Dennard scaling meant ar- “mispredicts” branches, the proces-
architects do not design new ISAs chitects had to find more efficient ways sor must throw away the incorrectly
but implement existing ISAs in the to exploit parallelism. speculated instructions, and their
prevailing implementation technol- To understand why increasing ILP computational work and energy are
ogy. Since the late 1970s, the technol- caused greater inefficiency, consider wasted. The internal state of the pro-
ogy of choice has been metal oxide a modern processor core like those cessor must also be restored to the
semiconductor (MOS)-based inte- from ARM, Intel, and AMD. Assume it state that existed before the mispre-
grated circuits, first n-type metal–ox- has a 15-stage pipeline and can issue dicted branch, expending additional
ide semiconductor (nMOS) and then four instructions every clock cycle. It time and energy.
complementary metal–oxide semi- thus has up to 60 instructions in the To see how challenging such a design
conductor (CMOS). The stunning rate pipeline at any moment in time, in- is, consider the difficulty of correctly
of improvement in MOS technology—
captured in Gordon Moore’s predic- Figure 4. Wasted instructions as a percentage of all instructions completed on an Intel
Core i7 for a variety of SPEC integer benchmarks.
tions—has been the driving factor
enabling architects to design more-
aggressive methods for achieving 40% 39% 38%
performance for a given ISA. Moore’s 35% 32%
30%
original prediction in 196526 called 25%
25% 24%
for a doubling in transistor density 22%
20%
yearly; in 1975, he revised it, project- 15% 15%
11%
ing a doubling every two years.28 It 10%
6% 7%
eventually became called Moore’s 5% 5%
1%
Law. Because transistor density grows 0
LIBQUANTUM
XALANCBMK
OMNETPP
PERLBEN
H264REF
quadratically while speed grows lin-
HMMER
GOBMK
SJENG
ASTAR
BZIP2
MCF
GCC
early, architects used more transis-

tors to improve performance.
End of Moore’s Law and

Dennard Scaling
Although Moore’s Law held for many Figure 5. Effect of Amdahl’s Law on speedup as a fraction of clock cycle time in serial
mode.
decades (see Figure 2), it began to slow
sometime around 2000 and by 2018
showed a roughly 15-fold gap between 65
60
Moore’s prediction and current capa-
55
bility, an observation Moore made in
50
2003 that was inevitable.27 The current
45
expectation is that the gap will con-
40 1%
tinue to grow as CMOS technology ap-
Speedup
35
proaches fundamental limits. 30
Accompanying Moore’s Law was a 2%
25
projection made by Robert Dennard 20
4%
called “Dennard scaling,”5 stating that 15 6%
as transistor density increased, power 10 8%
10%
consumption per transistor would 5
drop, so the power per mm2 of sili- 0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65
con would be near constant. Since the
Processor Count
computational capability of a mm2 of
silicon was increasing with each new
turing lecture
predicting the outcome of 15 branches. ent approach to achieve performance a single core, assuming different por-
If a processor architect wants to limit improvements. The multicore era was tions of serial execution, where only
wasted work to only 10% of the time, thus born. one processor is active. For example,
the processor must predict each branch Multicore shifted responsibility for when only 1% of the time is serial, the
correctly 99.3% of the time. Few general- identifying parallelism and deciding speedup for a 64-processor configura-
purpose programs have branches that how to exploit it to the programmer tion is about 35. Unfortunately, the
can be predicted so accurately. and to the language system. Multicore power needed is proportional to 64
To appreciate how this wasted work does not resolve the challenge of ener- processors, so approximately 45% of
adds up, consider the data in Figure 4, gy-efficient computation that was exac- the energy is wasted.
showing the fraction of instructions erbated by the end of Dennard scaling. Real programs have more complex
that are effectively executed but turn Each active core burns power whether structures of course, with portions
out to be wasted because the proces- or not it contributes effectively to the that allow varying numbers of proces-
sor speculated incorrectly. On average, computation. A primary hurdle is an sors to be used at any given moment
19% of the instructions are wasted for old observation, called Amdahl’s Law, in time. Nonetheless, the need to com-
these benchmarks on an Intel Core i7. stating that the speedup from a paral- municate and synchronize periodically
The amount of wasted energy is great- lel computer is limited by the portion means most applications have some
er, however, since the processor must of a computation that is sequential. portions that can effectively use only
use additional energy to restore the To appreciate the importance of this a fraction of the processors. Although
state when it speculates incorrectly. observation, consider Figure 5, show- Amdahl’s Law is more than 50 years
Measurements like these led many to ing how much faster an application old, it remains a difficult hurdle.
conclude architects needed a differ- runs with up to 64 cores compared to With the end of Dennard scaling,
increasing the number of cores on a
Figure 6. Growth of computer performance using integer programs (SPECintCPU). chip meant power is also increasing
at nearly the same rate. Unfortunately,
End of the Line ⇒ 2X/20 years (3%/yr) the power that goes into a processor
Amdahl’s Law ⇒ 2X/6 years (12%/year) must also be removed as heat. Mul-
End of Dennard Scaling ⇒ Multicore 2X/3.5 years (23%/year) ticore processors are thus limited by
CISC 2X/2.5 years RISC 2X/1.5 years the thermal dissipation power (TDP),
(22%/year) (52%/year)
100,000 or average amount of power the pack-
age and cooling system can remove.
Although some high-end data centers
Performance vs. VAX11-780
10,000
may use more advanced packages and
cooling technology, no computer us-
1,000 ers would want to put a small heat
exchanger on their desks or wear a ra-
100 diator on their backs to cool their cell-
phones. The limit of TDP led directly
10 to the era of “dark silicon,” whereby
processors would slow on the clock
1 rate and turn off idle cores to prevent
1980 1985 1990 1995 2000 2005 2010 2015
overheating. Another way to view this
approach is that some chips can real-
locate their precious power from the
Figure 7. Potential speedup of matrix multiply in Python for four optimizations. idle cores to the active ones.
An era without Dennard scaling,
Matrix Multiply Speedup Over Native Python along with reduced Moore’s Law and
62,806 Amdahl’s Law in full effect means
100,000
inefficiency limits improvement in
6,727 performance to only a few percent
10,000
per year (see Figure 6). Achieving
Speedup
366 higher rates of performance improve-

1,000
ment—as was seen in the 1980s and
1990s—will require new architec-
100
47 tural approaches that use the inte-
10
grated-circuit capability much more
efficiently. We will return to what ap-
1
1
proaches might work after discussing
Python C + parallel + memory + SIMD another major shortcoming of mod-
loops optimization instructions
ern computers—their support, or
lack thereof, for computer security.
54 COM MUNICATIO NS O F TH E ACM | F EBR UA RY 201 9 | VO L . 62 | NO. 2

turing lecture
Overlooked Security to be attacked creates many new vul-

In the 1970s, processor architects nerabilities. Two more vulnerabilities
focused significant attention on en- in the virtual-machine architecture
hancing computer security with con- were subsequently reported.37,38 One
cepts ranging from protection rings
to capabilities. It was well under- The end of Dennard of them, called Foreshadow, allows
penetration of the Intel SGX security
stood by these architects that most
bugs would be in software, but they
scaling meant mechanisms designed to protect the
highest risk data (such as encryption
believed architectural support could architects had to keys). New vulnerabilities are being
help. These features were largely un-
used by operating systems that were
find more efficient discovered monthly.
Side-channel attacks are not new,
deliberately focused on supposedly ways to exploit but in most earlier cases, a software
benign environments (such as per-
sonal computers), and the features
parallelism. flaw allowed the attack to succeed. In
the Meltdown, Spectre, and other at-
involved significant overhead then, so tacks, it is a flaw in the hardware im-
were eliminated. In the software com- plementation that exposes protected
munity, many thought formal verifica- information. There is a fundamental
tion and techniques like microkernels difficulty in how processor architects
would provide effective mechanisms define what is a correct implementa-
for building highly secure software. tion of an ISA because the standard
Unfortunately, the scale of our collec- definition says nothing about the
tive software systems and the drive for performance effects of executing an
performance meant such techniques instruction sequence, only about the
could not keep up with processor per- ISA-visible architectural state of the
formance. The result is large software execution. Architects need to rethink
systems continue to have many securi- their definition of a correct implemen-
ty flaws, with the effect amplified due tation of an ISA to prevent such securi-
to the vast and increasing amount of ty flaws. At the same time, they should
personal information online and the be rethinking the attention they pay
use of cloud-based computing, which computer security and how architects
shares physical hardware among po- can work with software designers to
tential adversaries. implement more-secure systems. Ar-
Although computer architects and chitects (and everyone else) depend
others were perhaps slow to realize too much on more information sys-
the growing importance of security, tems to willingly allow security to be
they began to include hardware sup- treated as anything less than a first-
port for virtual machines and encryp- class design concern.
tion. Unfortunately, speculation in-
troduced an unknown but significant Future Opportunities in
security flaw into many processors. In Computer Architecture
particular, the Meltdown and Spectre “What we have before us are some breath-
security flaws led to new vulnerabili- taking opportunities disguised as insoluble
ties that exploit vulnerabilities in the problems.” —John Gardner, 1965
microarchitecture, allowing leakage Inherent inefficiencies in general-
of protected information at a high purpose processors, whether from ILP
rate.14 Both Meltdown and Spectre use techniques or multicore, combined
so-called side-channel attacks where- with the end of Dennard scaling and
by information is leaked by observing Moore’s Law, make it highly unlikely,
the time taken for a task and convert- in our view, that processor architects
ing information invisible at the ISA and designers can sustain significant
level into a timing visible attribute. In rates of performance improvements in
2018, researchers showed how to ex- general-purpose processors. Given the
ploit one of the Spectre variants to leak importance of improving performance
information over a network without to enable new software capabilities,
the attacker loading code onto the tar- we must ask: What other approaches
get processor.34 Although this attack, might be promising?
called NetSpectre, leaks information There are two clear opportunities, as
slowly, the fact that it allows any ma- well as a third created by combining the
chine on the same local-area network two. First, existing techniques for build-
(or within the same cluster in a cloud) ing software make extensive use of high-
turing lecture
Figure 8. Functional organization of Google Tensor Processing Unit (TPU v1).
14 GiB/s 30 GiB/s
DDR3 Weight FIFO
Interfaces (Weight Fetcher)
Control Control 30 GiB/s
Lo
165
10 GiB/s Unified Buffer (96K
Systolic GiB/s Matrix
(Local
Interface
Array Multiply Unit

Activation
PCIe
14 GiB/s 14 GiB/s Control (64K per cycle)

Interface
Storage)
Host
Control Accumulators
D
R
A
Activation M C
port
Instr
165 GiB/s
ddr3
Off-Chip I/O Normalize/Pool 3%
Data Buffer
Computation
Control Control
Control
Not to Scale
level languages with dynamic typing and An interesting research direction application-specific integrated cir-
storage management. Unfortunately, concerns whether some of the perfor- cuits (ASICs) that are often used for a
such languages are typically interpreted mance gap can be closed with new com- single function with code that rarely
and execute very inefficiently. Leiserson piler technology, possibly assisted by changes. DSAs are often called acceler-
et al.24 used a small example—perform- architectural enhancements. Although ators, since they accelerate some of an
ing matrix multiply—to illustrate this the challenges in efficiently translating application when compared to execut-
inefficiency. As in Figure 7, simply re- and implementing high-level scripting ing the entire application on a general-
writing the code in C from Python—a languages like Python are difficult, the purpose CPU. Moreover, DSAs can
typical high-level, dynamically typed lan- potential gain is enormous. Achieving achieve better performance because
guage—increases performance 47-fold. even 25% of the potential gain could they are more closely tailored to the
Using parallel loops running on many result in Python programs running needs of the application; examples of
cores yields a factor of approximately tens to hundreds of times faster. This DSAs include graphics processing
7. Optimizing the memory layout to ex- simple example illustrates how great units (GPUs), neural network proces-
ploit caches yields a factor of 20, and a the gap is between modern languages sors used for deep learning, and pro-
final factor of 9 comes from using the emphasizing programmer productivity cessors for software-defined networks
hardware extensions for doing single in- and traditional approaches emphasiz- (SDNs). DSAs can achieve higher per-
struction multiple data (SIMD) parallel- ing performance. formance and greater energy efficiency
ism operations that are able to perform Domain-specific architectures. A for four main reasons:
16 32-bit operations per instruction. more hardware-centric approach is to First and most important, DSAs
All told, the final, highly optimized ver- design architectures tailored to a spe- exploit a more efficient form of par-
sion runs more than 62,000× faster on cific problem domain and offer signif- allelism for the specific domain. For
a multicore Intel processor compared icant performance (and efficiency) example, single-instruction multiple
to the original Python version. This is of gains for that domain, hence, the data parallelism (SIMD), is more ef-
course a small example, one might ex- name “domain-specific architectures” ficient than multiple instruction mul-
pect programmers to use an optimized (DSAs), a class of processors tailored tiple data (MIMD) because it needs to
library for. Although it exaggerates the for a specific domain—programmable fetch only one instruction stream and
usual performance gap, there are likely and often Turing-complete but tai- processing units operate in lockstep.9
many programs for which factors of 100 lored to a specific class of applica- Although SIMD is less flexible than
to 1,000 could be achieved. tions. In this sense, they differ from MIMD, it is a good match for many
56 COMM UNICATIO NS O F THE ACM | F EBR UA RY 201 9 | VO L . 62 | NO. 2

turing lecture
DSAs. DSAs may also use VLIW ap- cessors operate. For suitable applica- to the processor efficiently. Examples
proaches to ILP rather than specula- tions, user-controlled memories can of DSLs include Matlab, a language for
tive out-of-order mechanisms. As men- use much less energy than caches. operating on matrices, TensorFlow, a
tioned earlier, VLIW processors are a Third, DSAs can use less precision dataflow language used for program-
poor match for general-purpose code15 when it is adequate. General-purpose ming DNNs, P4, a language for pro-
but for limited domains can be much CPUs usually support 32- and 64-bit in- gramming SDNs, and Halide, a lan-
more efficient, since the control mech- teger and floating-point (FP) data. For guage for image processing specifying
anisms are simpler. In particular, most many applications in machine learn- high-level transformations.
high-end general-purpose processors ing and graphics, this is more accuracy The challenge when using DSLs is
are out-of-order superscalars that re- than is needed. For example, in deep how to retain enough architecture in-
quire complex control logic for both neural networks (DNNs), inference dependence that software written in
instruction initiation and instruction regularly uses 4-, 8-, or 16-bit integers, a DSL can be ported to different ar-
completion. In contrast, VLIWs per- improving both data and computation- chitectures while also achieving high
form the necessary analysis and sched- al throughput. Likewise, for DNN train- efficiency in mapping the software
uling at compile-time, which can work ing applications, FP is useful, but 32 to the underlying DSA. For example,
well for an explicitly parallel program. bits is enough and 16 bits often works. the XLA system translates Tensorflow
Second, DSAs can make more effec- Finally, DSAs benefit from targeting to heterogeneous processors that
tive use of the memory hierarchy. Mem- programs written in domain-specific use Nvidia GPUs or Tensor Processor
ory accesses have become much more languages (DSLs) that expose more Units (TPUs).40 Balancing portability
costly than arithmetic computations, parallelism, improve the structure and among DSAs along with efficiency is
as noted by Horowitz.16 For example, representation of memory access, and an interesting research challenge for
accessing a block in a 32-kilobyte cache make it easier to map the application ef- language designers, compiler creators,
involves an energy cost approximately ficiently to a domain-specific processor. and DSA architects.
200× higher than a 32-bit integer add. Example DSA: TPU v1. As an example
This enormous differential makes Domain-Specific Languages DSA, consider the Google TPU v1, which
optimizing memory accesses critical DSAs require targeting of high-level op- was designed to accelerate neural net
to achieving high-energy efficiency. erations to the architecture, but trying inference.17,18 The TPU has been in
General-purpose processors run code to extract such structure and informa- production since 2015 and powers ap-
in which memory accesses typically ex- tion from a general-purpose language plications ranging from search queries
hibit spatial and temporal locality but like Python, Java, C, or Fortran is sim- to language translation to image recog-
are otherwise not very predictable at ply too difficult. Domain specific lan- nition to AlphaGo and AlphaZero, the
compile time. CPUs thus use multilevel guages (DSLs) enable this process and DeepMind programs for playing Go and
caches to increase bandwidth and hide make it possible to program DSAs ef- Chess. The goal was to improve the per-
the latency in relatively slow, off-chip ficiently. For example, DSLs can make formance and energy efficiency of deep
DRAMs. These multilevel caches often vector, dense matrix, and sparse ma- neural net inference by a factor of 10.
consume approximately half the energy trix operations explicit, enabling the As shown in Figure 8, the TPU or-
of the processor but avoid almost all DSL compiler to map the operations ganization is radically different from a
accesses to the off-chip DRAMs that re-
quire approximately 10× the energy of a Figure 9. Agile hardware development methodology.
last-level cache access.
Caches have two notable disadvan-
tages: Big Chip
Tape-Out
When datasets are very large. Caches
simply do not work well when datasets Tape-Out
are very large and also have low tempo-
ral or spatial locality; and
When caches work well. When
Tape-In
caches work well, the locality is very
high, meaning, by definition, most
of the cache is idle most of the time. ASIC Flow
In applications where the memory-
access patterns are well defined and FPGA
discoverable at compile time, which
is true of typical DSLs, programmers C++
and compilers can optimize the use of
the memory better than can dynami-
cally allocated caches. DSAs thus usu-
ally use a hierarchy of memories with
movement controlled explicitly by the
software, similar to how vector pro-
turing lecture
general-purpose processor. The main amine and make complex trade-offs and Foundation (http://riscv.org/). Being
computational unit is a matrix unit, optimizations will be advantaged. open allows the ISA evolution to occur
a systolic array22 structure that pro- This opportunity has already led to in public, with hardware and software
vides 256 × 256 multiply-accumulates a surge of architecture innovation, at- experts collaborating before decisions
every clock cycle. The combination of tracting many competing architectural are finalized. An added benefit of an
8-bit precision, highly efficient sys- philosophies: open foundation is the ISA is unlikely to
tolic structure, SIMD control, and GPUs. Nvidia GPUs use many cores, expand primarily for marketing reasons,
dedication of significant chip area to each with large register files, many sometimes the only explanation for ex-
this function means the number of hardware threads, and caches;4 tensions of proprietary instruction sets.
multiply-accumulates per clock cycle TPUs. Google TPUs rely on large RISC-V is a modular instruction set.
is approximately 100× what a general- two-dimensional systolic multipli- A small base of instructions run the full
purpose single-core CPU can sustain. ers and software-controlled on-chip open source software stack, followed by
Rather than caches, the TPU uses a lo- memories;17 optional standard extensions designers
cal memory of 24 megabytes, approxi- FPGAs. Microsoft deploys field pro- can include or omit depending on their
mately double a 2015 general-purpose grammable gate arrays (FPGAs) in its needs. This base includes 32-bit address
CPU with the same power dissipa- data centers it tailors to neural network and 64-bit address versions. RISC-V can
tion. Finally, both the activation applications;10 and grow only through optional extensions;
memory and the weight memory (in- CPUs. Intel offers CPUs with many the software stack still runs fine even if
cluding a FIFO structure that holds cores enhanced by large multi-level architects do not embrace new exten-
weights) are linked through user- caches and one-dimensional SIMD in- sions. Proprietary architectures gener-
controlled high-bandwidth memory structions, the kind of FPGAs used by ally require upward binary compatibil-
channels. Using a weighted arith- Microsoft, and a new neural network ity, meaning when a processor company
metic mean based on six common processor that is closer to a TPU than adds new feature, all future processors
inference problems in Google data to a CPU.19 must also include it. Not so for RISC-V,
centers, the TPU is 29× faster than a In addition to these large players, whereby all enhancements are optional
general-purpose CPU. Since the TPU dozens of startups are pursuing their and can be deleted if not needed by an
requires less than half the power, it own proposals.25 To meet growing de- application. Here are the standard ex-
has an energy efficiency for this work- mand, architects are interconnecting tensions so far, using initials that stand
load that is more than 80× better than a hundreds to thousands of such chips to for their full names:
general-purpose CPU. form neural-network supercomputers. M. Integer multiply/divide;
This avalanche of DNN architec- A. Atomic memory operations;
Summary tures makes for interesting times in F/D. Single/double-precision float-
We have considered two different ap- computer architecture. It is difficult to ing-point; and
proaches to improve program perfor- predict in 2019 which (or even if any) of C. Compressed instructions.
mance by improving efficiency in the these many directions will win, but the A third distinguishing feature of
use of hardware technology: First, by marketplace will surely settle the com- RISC-V is the simplicity of the ISA.
improving the performance of modern petition just as it settled the architec- While not readily quantifiable, here are
high-level languages that are typically tural debates of the past. two comparisons to the ARMv8 archi-
interpreted; and second, by building do- tecture, as developed by the ARM com-
main-specific architectures that greatly Open Architectures pany contemporaneously:
improve performance and efficiency Inspired by the success of open source Fewer instructions. RISC-V has many
compared to general-purpose CPUs. software, the second opportunity in fewer instructions. There are 50 in
DSLs are another example of how to im- computer architecture is open ISAs. the base that are surprisingly similar
prove the hardware/software interface To create a “Linux for processors” the in number and nature to the origi-
that enables architecture innovations field needs industry-standard open nal RISC-I.30 The remaining standard
like DSAs. Achieving significant gains ISAs so the community can create extensions—M, A, F, and D—add 53
through such approaches will require open source cores, in addition to indi- instructions, plus C added another 34,
a vertically integrated design team that vidual companies owning proprietary totaling 137. ARMv8 has more than
understands applications, domain- ones. If many organizations design 500; and
specific languages and related compil- processors using the same ISA, the Fewer instruction formats. RISC-V
er technology, computer architecture greater competition may drive even has many fewer instruction formats,
and organization, and the underlying quicker innovation. The goal is to six, while ARMv8 has at least 14.
implementation technology. The need provide processors for chips that cost Simplicity reduces the effort to both
to vertically integrate and make design from a few cents to $100. design processors and verify hardware
decisions across levels of abstraction The first example is RISC-V (called correctness. As the RISC-V targets range
was characteristic of much of the early “RISC Five”), the fifth RISC architecture from data-center chips to IoT devices,
work in computing before the industry developed at the University of Califor- design verification can be a significant
became horizontally structured. In this nia, Berkeley.32 RISC-V’s has a commu- part of the cost of development.
new era, vertical integration has become nity that maintains the architecture Fourth, RISC-V is a clean-slate de-
more important, and teams that can ex- under the stewardship of the RISC-V sign, starting 25 years later, letting its

turing lecture
architects learn from mistakes of its tion in waterfall development. Small

predecessors. Unlike first-generation programming teams quickly developed
RISC architectures, it avoids microar- working-but-incomplete prototypes and
chitecture or technology-dependent got customer feedback before starting
features (such as delayed branches and
delayed loads) or innovations (such as Security experts the next iteration. The scrum version of
agile development assembles teams of
register windows) that were supersed-
ed by advances in compiler technology.
do not believe in five to 10 programmers doing sprints of
two to four weeks per iteration.
Finally, RISC-V supports DSAs by re- security through Once again inspired by a software
serving a vast opcode space for custom
accelerators.
obscurity, so open success, the third opportunity is ag-
ile hardware development. The good
Beyond RISC-V, Nvidia also an- implementations news for architects is that modern
nounced (in 2017) a free and open ar-
chitecture29 it calls Nvidia Deep Learn-
are attractive, electronic computer aided design
(ECAD) tools raise the level of abstrac-
ing Accelerator (NVDLA), a scalable, and open tion, enabling agile development, and
configurable DSA for machine-learning
inference. Configuration options in- implementations this higher level of abstraction increas-
es reuse across designs.
clude data type (int8, int16, or fp16 ) require an open It seems implausible to claim sprints
and the size of the two-dimensional
multiply matrix. Die size scales from architecture. of four weeks to apply to hardware, giv-
en the months between when a design
0.5 mm2 to 3 mm2 and power from 20 is “taped out” and a chip is returned.
milliWatts to 300 milliWatts. The ISA, Figure 9 outlines how an agile develop-
software stack, and implementation ment method can work by changing the
are all open. prototype at the appropriate level.23 The
Open simple architectures are syn- innermost level is a software simulator,
ergistic with security. First, security ex- the easiest and quickest place to make
perts do not believe in security through changes if a simulator could satisfy an
obscurity, so open implementations iteration. The next level is FPGAs that
are attractive, and open implementa- can run hundreds of times faster than a
tions require an open architecture. detailed software simulator. FPGAs can
Equally important is increasing the run operating systems and full bench-
number of people and organizations marks like those from the Standard
who can innovate around secure ar- Performance Evaluation Corporation
chitectures. Proprietary architectures (SPEC), allowing much more precise
limit participation to employees, but evaluation of prototypes. Amazon Web
open architectures allow all the best Services offers FPGAs in the cloud, so
minds in academia and industry to architects can use FPGAs without need-
help with security. Finally, the simplic- ing to first buy hardware and set up a
ity of RISC-V makes its implementa- lab. To have documented numbers for
tions easier to check. Moreover, the die area and power, the next outer level
open architectures, implementations, uses the ECAD tools to generate a chip’s
and software stacks, plus the plasticity layout. Even after the tools are run, some
of FPGAs, mean architects can deploy manual steps are required to refine the
and evaluate novel solutions online results before a new processor is ready to
and iterate them weekly instead of an- be manufactured. Processor designers
nually. While FPGAs are 10× slower call this next level a “tape in.” These first
than custom chips, such performance four levels all support four-week sprints.
is still fast enough to support online For research purposes, we could stop
users and thus subject security innova- at tape in, as area, energy, and perfor-
tions to real attackers. We expect open mance estimates are highly accurate.
architectures to become the exemplar However, it would be like running a
for hardware/software co-design by ar- long race and stopping 100 yards be-
chitects and security experts. fore the finish line because the runner
can accurately predict the final time.
Agile Hardware Development Despite all the hard work in race prepa-
The Manifesto for Agile Software Develop- ration, the runner would miss the thrill
ment (2001) by Beck et al.1 revolution- and satisfaction of actually crossing the
ized software development, overcoming finish line. One advantage hardware en-
the frequent failure of the traditional gineers have over software engineers is
elaborate planning and documenta- they build physical things. Getting chips
turing lecture
back to measure, run real programs, References

for Industrial and Applied Mathematics, Philadelphia,
PA, 1979, 256–282.
and show to their friends and family is 1. Beck, K., Beedle, M., Van Bennekum, A., Cockburn, A.,
23. Lee, Y., Waterman, A., Cook, H., Zimmer, B., Keller,
Cunningham, W., Fowler, M. . . . and Kern, J. Manifesto
a great joy of hardware design. for Agile Software Development, 2001; https://
B., Puggelli, A. . . . and Chiu, P. An agile approach to
building RISC-V microprocessors. IEEE Micro 36, 2
Many researchers assume they must agilemanifesto.org/
(Feb. 2016), 8–20.
2. Bhandarkar, D. and Clark, D.W. Performance from
stop short because fabricating chips is architecture: Comparing a RISC and a CISC with
24. Leiserson, C. et al. There’s plenty of room at the top.
To appear.
unaffordable. When designs are small, similar hardware organization. In Proceedings of the
25. Metz, C. Big bets on A.I. open a new frontier for chip
Fourth International Conference on Architectural
they are surprisingly inexpensive. Archi- start-ups, too. The New York Times (Jan. 14, 2018).
Support for Programming Languages and Operating
26. Moore, G. Cramming more components onto
tects can order 100 1-mm2 chips for only Systems (Santa Clara, CA, Apr. 8–11). ACM Press,
integrated circuits. Electronics 38, 8 (Apr. 19, 1965),
New York, 1991, 310–319.
$14,000. In 28 nm, 1 mm2 holds millions 56–59.
3. Chaitin, G. et al. Register allocation via coloring.
27. Moore, G. No exponential is forever: But ‘forever’ can
of transistors, enough area for both a Computer Languages 6, 1 (Jan. 1981), 47–57.
be delayed! [semiconductor industry]. In Proceedings
4. Dally, W. et al. Hardware-enabled artificial intelligence.
of the IEEE International Solid-State Circuits
RISC-V processor and an NVLDA accel- In Proceedings of the Symposia on VLSI Technology
Conference Digest of Technical Papers (San Francisco,
and Circuits (Honolulu, HI, June 18–22). IEEE Press,
erator. The outermost level is expensive 2018, 3–6.
CA, Feb. 13). IEEE, 2003, 20–23.
28. Moore, G. Progress in digital integrated electronics. In
if the designer aims to build a large chip, 5. Dennard, R. et al. Design of ion-implanted MOSFETs
Proceedings of the International Electronic Devices
with very small physical dimensions. IEEE Journal of
but an architect can demonstrate many Solid State Circuits 9, 5 (Oct. 1974), 256–268.
Meeting (Washington, D.C., Dec.). IEEE, New York,
1975, 11–13.
novel ideas with small chips. 6. Emer, J. and Clark, D. A characterization of processor
29. Nvidia. Nvidia Deep Learning Accelerator (NVDLA),
performance in the VAX-11/780. In Proceedings
2017; http://nvdla.org/
of the 11th International Symposium on Computer
30. Patterson, D. How Close is RISC-V to RISC-I?
Conclusion Architecture (Ann Arbor, MI, June). ACM Press, New
ASPIRE blog, June 19, 2017; https://aspire.eecs.
York, 1984, 301–310.
“The darkest hour is just before the 7. Fisher, J. The VLIW machine: A multiprocessor for
berkeley.edu/2017/06/how-close-is-risc-v-to-risc-i/
31. Patterson, D. RISCy history. Computer Architecture
dawn.” —Thomas Fuller, 1650 compiling scientific code. Computer 17, 7 (July 1984),
Today blog, May 30, 2018; https://www.sigarch.org/
45–53.
To benefit from the lessons of his- 8. Fitzpatrick, D.T., Foderaro, J.K., Katevenis, M.G.,
riscy-history/
32. Patterson, D. and Waterman, A. The RISC-V Reader:
tory, architects must appreciate that Landman, H.A., Patterson, D.A., Peek, J.B., Peshkess,
An Open Architecture Atlas. Strawberry Canyon LLC,
Z., Séquin, C.H., Sherburne, R.W., and Van Dyke, K.S. A
software innovations can also inspire San Francisco, CA, 2017.
RISCy approach to VLSI. ACM SIGARCH Computer
33. Rowen, C., Przbylski, S., Jouppi, N., Gross, T.,
architects, that raising the abstraction Architecture News 10, 1 (Jan. 1982), 28–32.
Shott, J., and Hennessy, J. A pipelined 32b NMOS
9. Flynn, M. Some computer organizations and their
level of the hardware/software interface microprocessor. In Proceedings of the IEEE
effectiveness. IEEE Transactions on Computers 21, 9
International Solid-State Circuits Conference Digest
yields opportunities for innovation, and (Sept. 1972), 948–960.
of Technical Papers (San Francisco, CA, Feb. 22–24)
10. Fowers, J. et al. A configurable cloud-scale DNN
IEEE, 1984, 180–181.
that the marketplace ultimately settles processor for real-time AI. In Proceedings of the 34. Schwarz, M., Schwarzl, M., Lipp, M., and Gruss, D.
45th ACM/IEEE Annual International Symposium on
computer architecture debates. The Computer Architecture (Los Angeles, CA, June 2–6).
Netspectre: Read arbitrary memory over network. arXiv
preprint, 2018; https://arxiv.org/pdf/1807.10535.pdf
iAPX-432 and Itanium illustrate how IEEE, 2018, 1–14.
35. Sherburne, R., Katevenis, M., Patterson, D., and Sequin,
11. Hennessy, J. and Patterson, D. A New Golden Age for
architecture investment can exceed re- Computer Architecture. Turing Lecture delivered at
C. A 32b NMOS microprocessor with a large register
file. In Proceedings of the IEEE International Solid-
turns, while the S/360, 8086, and ARM the 45th ACM/IEEE Annual International Symposium State Circuits Conference (San Francisco, CA, Feb.
on Computer Architecture (Los Angeles, CA, June 4,
deliver high annual returns lasting de- 2018); http://iscaconf.org/isca2018/turing_lecture.html;
22–24). IEEE Press, 1984, 168–169.
36. Thacker, C., MacCreight, E., and Lampson, B. Alto:
cades with no end in sight. https://www.youtube.com/watch?v=3LVeEjsn8Ts A Personal Computer. CSL-79-11, Xerox Palo Alto
12. Hennessy, J., Jouppi, N., Przybylski, S., Rowen,
The end of Dennard scaling and C., Gross, T., Baskett, F., and Gill, J. MIPS: A
Research Center, Palo Alto, CA, Aug. 7,1979; http://
people.scs.carleton.ca/~soma/distos/fall2008/alto.pdf
Moore’s Law and the deceleration of per- microprocessor architecture. ACM SIGMICRO 37. Turner, P., Parseghian, P., and Linton, M. Protecting
Newsletter 13, 4 (Oct. 5, 1982), 17–22.
formance gains for standard micropro- 13. Hennessy, J. and Patterson, D. Computer Architecture:
against the new ‘L1TF’ speculative vulnerabilities.
Google blog, Aug. 14, 2018; https://cloud.google.com/
cessors are not problems that must be A Quantitative Approach. Morgan Kauffman, San blog/products/gcp/protectingagainst-the-new-l1tf-
Francisco, CA, 1989. speculative-vulnerabilities
solved but facts that, recognized, offer 14. Hill, M. A primer on the meltdown and Spectre 38. Van Bulck, J. et al. Foreshadow: Extracting the keys
breathtaking opportunities. High-level, hardware security design flaws and their important to the Intel SGX kingdom with transient out-of-order
implications, Computer Architecture Today blog (Feb. execution. In Proceedings of the 27th USENIX Security
domain-specific languages and archi- 15, 2018); https://www.sigarch.org/a-primer-on-the- Symposium (Baltimore, MD, Aug. 15–17). USENIX
tectures, freeing architects from the meltdown-spectre-hardware-security-design-flaws- Association, Berkeley, CA, 2018.
and-their-important-implications/ 39. Wilkes, M. and Stringer, J. Micro-programming and the
chains of proprietary instruction sets, 15. Hopkins, M. A critical look at IA-64: Massive design of the control circuits in an electronic digital
resources, massive ILP, but can it deliver?
along with demand from the public for Microprocessor Report 14, 2 (Feb. 7, 2000), 1–5.
computer. Mathematical Proceedings of the Cambridge
Philosophical Society 49, 2 (Apr. 1953), 230–238.
improved security, will usher in a new 16. Horowitz M. Computing’s energy problem (and what 40. XLA Team. XLA – TensorFlow. Mar. 6, 2017; https://
we can do about it). In Proceedings of the IEEE
golden age for computer architects. International Solid-State Circuits Conference Digest of
developers.googleblog.com/2017/03/xlatensorflow-
compiled.html
Aided by open source ecosystems, ag- Technical Papers (San Francisco, CA, Feb. 9–13). IEEE
Press, 2014, 10–14.
ilely developed chips will convincingly 17. Jouppi, N., Young, C., Patil, N., and Patterson, D. A John L. Hennessy (hennnessy@stanford.edu) is
demonstrate advances and thereby domain-specific architecture for deep neural networks. Past-President of Stanford University, Stanford, CA, USA,
Commun. ACM 61, 9 (Sept. 2018), 50–58. and is Chairman of Alphabet Inc., Mountain View, CA, USA.
accelerate commercial adoption. The 18. Jouppi, N.P., Young, C., Patil, N., Patterson, D., Agrawal,
ISA philosophy of the general-purpose G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, David A. Patterson (pattrsn@berkeley.edu) is the Pardee
A., and Boyle, R. In-datacenter performance analysis Professor of Computer Science, Emeritus at the University
processors in these chips will likely be of a tensor processing unit. In Proceedings of the of California, Berkeley, CA, USA, and a Distinguished
RISC, which has stood the test of time. 44th ACM/IEEE Annual International Symposium on Engineer at Google, Mountain View, CA, USA.
Computer Architecture (Toronto, ON, Canada, June
Expect the same rapid improvement as 24–28). IEEE Computer Society, 2017, 1–12.
in the last golden age, but this time in 19. Kloss, C. Nervana Engine Delivers Deep Learning at © 2019 ACM 0001-0782/19/2 $15.00
Ludicrous Speed. Intel blog, May 18, 2016;
terms of cost, energy, and security, as https://ai.intel.com/nervana-engine-delivers-deep-
well as in performance. learning-at-ludicrous-speed/
20. Knuth, D. The Art of Computer Programming:
The next decade will see a Cambri- Fundamental Algorithms, First Edition. Addison
an explosion of novel computer archi- Wesley, Reading, MA, 1968.
21. Knuth, D. and Binstock, A. Interview with Donald
tectures, meaning exciting times for Knuth. InformIT, Hoboken, NJ, 2010; http://www. To watch Hennessy and
informit.com/articles/article.aspx Patterson’s full Turing Lecture, see
computer architects in academia and 22. Kung, H. and Leiserson, C. Systolic arrays (for VLSI). https://www.acm.org/hennessy-
in industry. Chapter in Sparse Matrix Proceedings Vol. 1. Society patterson-turing-lecture

A Nem Golden Age For Cumputer Architecture

Uploaded by

Copyright:

Available Formats

You might also like

A Nem Golden Age For Cumputer Architecture

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Nem Golden Age For Cumputer Architecture

Uploaded by

Copyright:

Available Formats

turing lecture

Innovations like domain-specific hardware,

Software talks to hardware through a vocabulary key insights

48 COMMUNICATIO NS O F TH E AC M | F EBR UA RY 201 9 | VO L . 62 | NO. 2

50 COMMUNICATIO NS O F TH E AC M | F EBR UA RY 201 9 | VO L . 62 | NO. 2

Relative Power per nm2

52 COMM UNICATIO NS O F THE AC M | F EBR UA RY 201 9 | VO L . 62 | NO. 2

early, architects used more transis-

End of Moore’s Law and

366 higher rates of performance improve-

54 COM MUNICATIO NS O F TH E ACM | F EBR UA RY 201 9 | VO L . 62 | NO. 2

Overlooked Security to be attacked creates many new vul-

Figure 8. Functional organization of Google Tensor Processing Unit (TPU v1).

Control Control 30 GiB/s

Array Multiply Unit

14 GiB/s 14 GiB/s Control (64K per cycle)

56 COMM UNICATIO NS O F THE ACM | F EBR UA RY 201 9 | VO L . 62 | NO. 2

58 COMM UNICATIO NS O F THE ACM | F EBR UA RY 201 9 | VO L . 62 | NO. 2

architects learn from mistakes of its tion in waterfall development. Small

back to measure, run real programs, References

60 COMM UNICATIO NS O F THE ACM | F EBR UA RY 201 9 | VO L . 62 | NO. 2

You might also like