Professional Documents
Culture Documents
A Nem Golden Age For Cumputer Architecture
A Nem Golden Age For Cumputer Architecture
A Nem Golden Age For Cumputer Architecture
DOI:10.1145/ 3282307
A New Golden
Age for
Computer
Architecture
WE BEGAN OUR Turing Lecture June 4, 201811 with a review
of computer architecture since the 1960s. In addition
to that review, here, we highlight current challenges
and identify future opportunities, projecting another
golden age for the field of computer architecture in
the next decade, much like the 1980s when we did the
research that led to our award, delivering gains in cost, engineers, including ACM A.M. Tur-
energy, and security, as well as performance. ing Award laureate Fred Brooks, Jr.,
thought they could create a single ISA
that would efficiently unify all four of
“Those who cannot remember the past are condemned these ISA bases.
to repeat it.” —George Santayana, 1905 They needed a technical solution
for how computers as inexpensive as
signers then and now is the “brains” croinstructions. The control store was they took more clock cycles to execute
of the processor—the control hard- implemented through memory, which a System/360 instruction.
ware. Inspired by software program- was much less costly than logic gates. Facilitated by microprogramming,
ming, computing pioneer and Turing The table here lists four models IBM bet the future of the company
laureate Maurice Wilkes proposed of the new System/360 ISA IBM an- that the new ISA would revolutionize
how to simplify control. Control was nounced April 7, 1964. The data paths the computing industry and won the
specified as a two-dimensional ar- vary by a factor of 8, memory capacity bet. IBM dominated its markets, and
ray he called a “control store.” Each by a factor of 16, clock rate by nearly 4, IBM mainframe descendants of the
column of the array corresponded to performance by 50, and cost by near- computer family announced 55 years
F E B R UA RY 2 0 1 9 | VO L. 6 2 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 49
turing lecture
ago still bring in $10 billion in rev- ated for the Xerox Palo Alto Research operating system written in the then-
enue per year. Center in 1973. It was indeed the first new programming language Ada.
As seen repeatedly, although the personal computer, sporting the first This ambitious project was alas sev-
marketplace is an imperfect judge of bit-mapped display and first Ethernet eral years late, forcing Intel to start an
technological issues, given the close local-area network. The device control- emergency replacement effort in Santa
ties between architecture and com- lers for the novel display and network Clara to deliver a 16-bit microproces-
mercial computers, it eventually deter- were microprograms stored in a 4,096- sor in 1979. Intel gave the new team 52
mines the success of architecture inno- word × 32-bit WCS. weeks to develop the new “8086” ISA
vations that often require significant Microprocessors were still in the and design and build the chip. Given
engineering investment. 8-bit era in the 1970s (such as the In- the tight schedule, designing the ISA
Integrated circuits, CISC, 432, 8086, tel 8080) and programmed primarily took only 10 person-weeks over three
IBM PC. When computers began us- in assembly language. Rival design- regular calendar weeks, essentially by
ing integrated circuits, Moore’s Law ers would add novel instructions to extending the 8-bit registers and in-
meant control stores could become outdo one another, showing their ad- struction set of the 8080 to 16 bits. The
much larger. Larger memories in turn vantages through assembly language team completed the 8086 on schedule
allowed much more complicated ISAs. examples. but to little fanfare when announced.
Consider that the control store of the Gordon Moore believed Intel’s To Intel’s great fortune, IBM was
VAX-11/780 from Digital Equipment next ISA would last the lifetime of developing a personal computer to
Corp. in 1977 was 5,120 words × 96 Intel, so he hired many clever com- compete with the Apple II and needed
bits, while its predecessor used only puter science Ph.D.’s and sent them a 16-bit microprocessor. IBM was in-
256 words × 56 bits. to a new facility in Portland to invent terested in the Motorola 68000, which
Some manufacturers chose to make the next great ISA. The 8800, as Intel had an ISA similar to the IBM 360, but
microprogramming available by let- originally named it, was an ambi- it was behind IBM’s aggressive sched-
ting select customers add custom tious computer architecture project ule. IBM switched instead to an 8-bit
features they called “writable control for any era, certainly the most ag- bus version of the 8086. When IBM an-
store” (WCS). The most famous WCS gressive of the 1980s. It had 32-bit nounced the PC on August 12, 1981, the
computer was the Alto36 Turing laure- capability-based addressing, ob- hope was to sell 250,000 PCs by 1986.
ates Chuck Thacker and Butler Lamp- ject-oriented architecture, variable- The company instead sold 100 million
son, together with their colleagues, cre- bit-length instructions, and its own worldwide, bestowing a very bright fu-
ture on the emergency replacement
Features of four models of the IBM System/360 family; IPS is instructions per second. Intel ISA.
Intel’s original 8800 project was
Model M30 M40 M50 M65 renamed iAPX-432 and finally an-
Datapath width 8 bits 16 bits 32 bits 64 bits nounced in 1981, but it required sev-
Control store size 4k x 50 4k x 52 2.75k x 85 2.75k x 87 eral chips and had severe performance
Clock rate 1.3 MHz 1.6 MHz 2 MHz 5 MHz problems. It was discontinued in 1986,
(ROM cycle time) (750 ns) (625 ns) (500 ns) (200 ns)
the year after Intel extended the 16-
Memory capacity 8–64 KiB 16–256 KiB 64–512 KiB 128–1,024 KiB bit 8086 ISA in the 80386 by expand-
Performance (commercial) 29,000 IPS 75,000 IPS 169,000 IPS 567,000 IPS ing its registers from 16 bits to 32 bits.
Performance (scientific) 10,200 IPS 40,000 IPS 133,000 IPS 563,000 IPS Moore’s prediction was thus correct
Price (1964 $) $192,000 $216,000 $460,000 $1,080,000 that the next ISA would last as long as
Price (2018 $) $1,560,000 $1,760,000 $3,720,000 $8,720,000 Intel did, but the marketplace chose
the emergency replacement 8086 rath-
er than the anointed 432. As the archi-
Figure 1. University of California, Berkeley, RISC-I and Stanford University MIPS tects of the Motorola 68000 and iAPX-
microprocessors.
432 both learned, the marketplace is
rarely patient.
From complex to reduced instruc-
tion set computers. The early 1980s
saw several investigations into com-
plex instruction set computers (CISC)
enabled by the big microprograms in
the larger control stores. With Unix
demonstrating that even operating sys-
tems could use high-level languages,
the critical question became: “What in-
structions would compilers generate?”
instead of “What assembly language
would programmers use?” Significant-
ly raising the hardware/software inter-
face created an opportunity for archi- datapath, along with instruction and
tecture innovation. data caches, in a single chip.
Turing laureate John Cocke and his For example, Figure 1 shows the
colleagues developed simpler ISAs and RISC-I8 and MIPS12 microprocessors
compilers for minicomputers. As an
experiment, they retargeted their re- In today’s post-PC developed at the University of Califor-
nia, Berkeley, and Stanford University
search compilers to use only the simple
register-register operations and load-
era, x86 shipments in 1982 and 1983, respectively, that
demonstrated the benefits of RISC.
store data transfers of the IBM 360 ISA, have fallen almost These chips were eventually presented
avoiding the more complicated instruc-
tions. They found that programs ran up
10% per year since at the leading circuit conference, the
IEEE International Solid-State Circuits
to three times faster using the simple the peak in 2011, Conference, in 1984.33,35 It was a re-
subset. Emer and Clark6 found 20% of
the VAX instructions needed 60% of the
while chips with markable moment when a few gradu-
ate students at Berkeley and Stanford
microcode and represented only 0.2% RISC processors could build microprocessors that were
of the execution time. One author (Pat-
terson) spent a sabbatical at DEC to have skyrocketed arguably superior to what industry
could build.
help reduce bugs in VAX microcode. If to 20 billion. These academic chips inspired
microprocessor manufacturers were many companies to build RISC micro-
going to follow the CISC ISA designs processors, which were the fastest for
of the larger computers, he thought the next 15 years. The explanation is
they would need a way to repair the due to the following formula for pro-
microcode bugs. He wrote such a cessor performance:
paper, 31 but the journal Computer Time/Program = Instructions /
rejected it. Reviewers opined that it was Program × (Clock cycles) /
a terrible idea to build microproces- Instruction × Time / (Clock cycle)
sors with ISAs so complicated that they DEC engineers later showed2 that
needed to be repaired in the field. That the more complicated CISC ISA execut-
rejection called into question the value ed about 75% of the number instruc-
of CISC ISAs for microprocessors. Iron- tions per program as RISC (the first
ically, modern CISC microprocessors term), but in a similar technology CISC
do indeed include microcode repair executed about five to six more clock
mechanisms, but the main result of his cycles per instruction (the second
paper rejection was to inspire him to term), making RISC microprocessors
work on less-complex ISAs for micro- approximately 4× faster.
processors—reduced instruction set Such formulas were not part of com-
computers (RISC). puter architecture books in the 1980s,
These observations and the shift to leading us to write Computer Architec-
high-level languages led to the opportu- ture: A Quantitative Approach13 in 1989.
nity to switch from CISC to RISC. First, The subtitle suggested the theme of the
the RISC instructions were simplified book: Use measurements and bench-
so there was no need for a microcod- marks to evaluate trade-offs quanti-
ed interpreter. The RISC instructions tatively instead of relying more on the
were typically as simple as microin- architect’s intuition and experience, as
structions and could be executed di- in the past. The quantitative approach
rectly by the hardware. Second, the we used was also inspired by what Tur-
fast memory, formerly used for the ing laureate Donald Knuth’s book had
microcode interpreter of a CISC ISA, done for algorithms.20
was repurposed to be a cache of RISC VLIW, EPIC, Itanium. The next ISA
instructions. (A cache is a small, fast innovation was supposed to succeed
memory that buffers recently execut- both RISC and CISC. Very long instruc-
ed instructions, as such instructions tion word (VLIW)7 and its cousin, the
are likely to be reused soon.) Third, explicitly parallel instruction computer
register allocators based on Gregory (EPIC), the name Intel and Hewlett
Chaitin’s graph-coloring scheme made Packard gave to the approach, used wide
it much easier for compilers to efficient- instructions with multiple independent
ly use registers, which benefited these operations bundled together in each
register-register ISAs.3 Finally, Moore’s instruction. VLIW and EPIC advocates
Law meant there were enough transis- at the time believed if a single instruc-
tors in the 1980s to include a full 32-bit tion could specify, say, six independent
F E B R UA RY 2 0 1 9 | VO L. 6 2 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 51
turing lecture
Figure 2. Transistors per chip of Intel microprocessors vs. Moore’s Law. tion of the RISC microinstructions.
Any ideas RISC designers were using
Moore’s Law vs. Intel Microprocessor Density for performance—separate instruc-
Moore’s Law (1975 version) Density tion and data caches, second-level
10,000,000 caches on chip, deep pipelines, and
1,000,000 fetching and executing several in-
structions simultaneously—could
100,000
then be incorporated into the x86.
10,000 AMD and Intel shipped roughly 350
million x86 microprocessors annually
1,000
at the peak of the PC era in 2011. The
100 high volumes and low margins of the
PC industry also meant lower prices
10
than RISC computers.
Given the hundreds of millions
1980 1990 2000 2010
of PCs sold worldwide each year, PC
software became a giant market.
Whereas software providers for the
Figure 3. Transistors per chip and power per mm2. Unix marketplace would offer differ-
ent software versions for the differ-
Technology (nm) Power/nm2
ent commercial RISC ISAs—Alpha,
200 4.5 HP-PA, MIPS, Power, and SPARC—the
180 4 PC market enjoyed a single ISA, so
120
3 wrap” software that was binary com-
100
2.5 patible with only the x86 ISA. A much
80 2 larger software base, similar perfor-
60 1.5 mance, and lower prices led the x86
40 1 to dominate both desktop computers
20 0.5 and small-server markets by 2000.
0 0 Apple helped launch the post-PC
2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
era with the iPhone in 2007. Instead of
buying microprocessors, smartphone
companies built their own systems
operations—two data transfers, two in- to write.” Pundits noted delays and on a chip (SoC) using designs from
teger operations, and two floating point underperformance of Itanium and re- other companies, including RISC
operations—and compiler technology christened it “Itanic” after the ill-fated processors from ARM. Mobile-device
could efficiently assign operations into Titantic passenger ship. The market- designers valued die area and energy
the six instruction slots, the hardware place again eventually ran out of pa- efficiency as much as performance,
could be made simpler. Like the RISC tience, leading to a 64-bit version of disadvantaging CISC ISAs. Moreover,
approach, VLIW and EPIC shifted work the x86 as the successor to the 32-bit arrival of the Internet of Things vastly
from the hardware to the compiler. x86, and not Itanium. increased both the number of proces-
Working together, Intel and Hewlett The good news is VLIW still matches sors and the required trade-offs in die
Packard designed a 64-bit processor based narrower applications with small pro- size, power, cost, and performance.
on EPIC ideas to replace the 32-bit x86. grams and simpler branches and omit This trend increased the importance
High expectations were set for the first caches, including digital-signal processing. of design time and cost, further dis-
EPIC processor, called Itanium by In- advantaging CISC processors. In to-
tel and Hewlett Packard, but the real- RISC vs. CISC in the day’s post-PC era, x86 shipments have
ity did not match its developers’ early PC and Post-PC Eras fallen almost 10% per year since the
claims. Although the EPIC approach AMD and Intel used 500-person de- peak in 2011, while chips with RISC
worked well for highly structured sign teams and superior semicon- processors have skyrocketed to 20 bil-
floating-point programs, it struggled ductor technology to close the per- lion. Today, 99% of 32-bit and 64-bit
to achieve high performance for in- formance gap between x86 and RISC. processors are RISC.
teger programs that had less predict- Again inspired by the performance Concluding this historical review,
able cache misses or less-predictable advantages of pipelining simple vs. we can say the marketplace settled the
branches. As Donald Knuth later complex instructions, the instruction RISC-CISC debate; CISC won the later
noted:21 “The Itanium approach ... decoder translated the complex x86 stages of the PC era, but RISC is win-
was supposed to be so terrific—un- instructions into internal RISC-like ning the post-PC era. There have been
til it turned out that the wished-for microinstructions on the fly. AMD no new CISC ISAs in decades. To our
compilers were basically impossible and Intel then pipelined the execu- surprise, the consensus on the best
ISA principles for general-purpose generation of technology, computers cluding approximately 15 branches,
processors today is still RISC, 35 years would become more energy efficient. as they represent approximately 25%
after their introduction. Dennard scaling began to slow sig- of executed instructions. To keep the
nificantly in 2007 and faded to almost pipeline full, branches are predicted
Current Challenges for nothing by 2012 (see Figure 3). and code is speculatively placed into
Processor Architecture Between 1986 and about 2002, the the pipeline for execution. The use
“If a problem has no solution, it may exploitation of instruction level paral- of speculation is both the source of
not be a problem, but a fact—not to be lelism (ILP) was the primary architec- ILP performance and of inefficiency.
solved, but to be coped with over time.” tural method for gaining performance When branch prediction is perfect,
—Shimon Peres and, along with improvements in speed speculation improves performance
While the previous section focused of transistors, led to an annual perfor- yet involves little added energy cost—
on the design of the instruction set mance increase of approximately 50%. it can even save energy—but when it
architecture (ISA), most computer The end of Dennard scaling meant ar- “mispredicts” branches, the proces-
architects do not design new ISAs chitects had to find more efficient ways sor must throw away the incorrectly
but implement existing ISAs in the to exploit parallelism. speculated instructions, and their
prevailing implementation technol- To understand why increasing ILP computational work and energy are
ogy. Since the late 1970s, the technol- caused greater inefficiency, consider wasted. The internal state of the pro-
ogy of choice has been metal oxide a modern processor core like those cessor must also be restored to the
semiconductor (MOS)-based inte- from ARM, Intel, and AMD. Assume it state that existed before the mispre-
grated circuits, first n-type metal–ox- has a 15-stage pipeline and can issue dicted branch, expending additional
ide semiconductor (nMOS) and then four instructions every clock cycle. It time and energy.
complementary metal–oxide semi- thus has up to 60 instructions in the To see how challenging such a design
conductor (CMOS). The stunning rate pipeline at any moment in time, in- is, consider the difficulty of correctly
of improvement in MOS technology—
captured in Gordon Moore’s predic- Figure 4. Wasted instructions as a percentage of all instructions completed on an Intel
Core i7 for a variety of SPEC integer benchmarks.
tions—has been the driving factor
enabling architects to design more-
aggressive methods for achieving 40% 39% 38%
performance for a given ISA. Moore’s 35% 32%
30%
original prediction in 196526 called 25%
25% 24%
for a doubling in transistor density 22%
20%
yearly; in 1975, he revised it, project- 15% 15%
11%
ing a doubling every two years.28 It 10%
6% 7%
eventually became called Moore’s 5% 5%
1%
Law. Because transistor density grows 0
LIBQUANTUM
XALANCBMK
OMNETPP
PERLBEN
H264REF
quadratically while speed grows lin-
HMMER
GOBMK
SJENG
ASTAR
BZIP2
MCF
GCC
35
proaches fundamental limits. 30
Accompanying Moore’s Law was a 2%
25
projection made by Robert Dennard 20
4%
called “Dennard scaling,”5 stating that 15 6%
as transistor density increased, power 10 8%
10%
consumption per transistor would 5
drop, so the power per mm2 of sili- 0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65
con would be near constant. Since the
Processor Count
computational capability of a mm2 of
silicon was increasing with each new
F E B R UA RY 2 0 1 9 | VO L. 6 2 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 53
turing lecture
predicting the outcome of 15 branches. ent approach to achieve performance a single core, assuming different por-
If a processor architect wants to limit improvements. The multicore era was tions of serial execution, where only
wasted work to only 10% of the time, thus born. one processor is active. For example,
the processor must predict each branch Multicore shifted responsibility for when only 1% of the time is serial, the
correctly 99.3% of the time. Few general- identifying parallelism and deciding speedup for a 64-processor configura-
purpose programs have branches that how to exploit it to the programmer tion is about 35. Unfortunately, the
can be predicted so accurately. and to the language system. Multicore power needed is proportional to 64
To appreciate how this wasted work does not resolve the challenge of ener- processors, so approximately 45% of
adds up, consider the data in Figure 4, gy-efficient computation that was exac- the energy is wasted.
showing the fraction of instructions erbated by the end of Dennard scaling. Real programs have more complex
that are effectively executed but turn Each active core burns power whether structures of course, with portions
out to be wasted because the proces- or not it contributes effectively to the that allow varying numbers of proces-
sor speculated incorrectly. On average, computation. A primary hurdle is an sors to be used at any given moment
19% of the instructions are wasted for old observation, called Amdahl’s Law, in time. Nonetheless, the need to com-
these benchmarks on an Intel Core i7. stating that the speedup from a paral- municate and synchronize periodically
The amount of wasted energy is great- lel computer is limited by the portion means most applications have some
er, however, since the processor must of a computation that is sequential. portions that can effectively use only
use additional energy to restore the To appreciate the importance of this a fraction of the processors. Although
state when it speculates incorrectly. observation, consider Figure 5, show- Amdahl’s Law is more than 50 years
Measurements like these led many to ing how much faster an application old, it remains a difficult hurdle.
conclude architects needed a differ- runs with up to 64 cores compared to With the end of Dennard scaling,
increasing the number of cores on a
Figure 6. Growth of computer performance using integer programs (SPECintCPU). chip meant power is also increasing
at nearly the same rate. Unfortunately,
End of the Line ⇒ 2X/20 years (3%/yr) the power that goes into a processor
Amdahl’s Law ⇒ 2X/6 years (12%/year) must also be removed as heat. Mul-
End of Dennard Scaling ⇒ Multicore 2X/3.5 years (23%/year) ticore processors are thus limited by
CISC 2X/2.5 years RISC 2X/1.5 years the thermal dissipation power (TDP),
(22%/year) (52%/year)
100,000 or average amount of power the pack-
age and cooling system can remove.
Although some high-end data centers
Performance vs. VAX11-780
10,000
may use more advanced packages and
cooling technology, no computer us-
1,000 ers would want to put a small heat
exchanger on their desks or wear a ra-
100 diator on their backs to cool their cell-
phones. The limit of TDP led directly
10 to the era of “dark silicon,” whereby
processors would slow on the clock
1 rate and turn off idle cores to prevent
1980 1985 1990 1995 2000 2005 2010 2015
overheating. Another way to view this
approach is that some chips can real-
locate their precious power from the
Figure 7. Potential speedup of matrix multiply in Python for four optimizations. idle cores to the active ones.
An era without Dennard scaling,
Matrix Multiply Speedup Over Native Python along with reduced Moore’s Law and
62,806 Amdahl’s Law in full effect means
100,000
inefficiency limits improvement in
6,727 performance to only a few percent
10,000
per year (see Figure 6). Achieving
Speedup
F E B R UA RY 2 0 1 9 | VO L. 6 2 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 55
turing lecture
14 GiB/s 30 GiB/s
DDR3 Weight FIFO
Interfaces (Weight Fetcher)
Lo
165
10 GiB/s Unified Buffer (96K
Systolic GiB/s Matrix
(Local
Interface
Storage)
Host
Control Accumulators
D
R
A
Activation M C
port
Instr
165 GiB/s
ddr3
Off-Chip I/O Normalize/Pool 3%
Data Buffer
Computation
Control Control
Control
Not to Scale
level languages with dynamic typing and An interesting research direction application-specific integrated cir-
storage management. Unfortunately, concerns whether some of the perfor- cuits (ASICs) that are often used for a
such languages are typically interpreted mance gap can be closed with new com- single function with code that rarely
and execute very inefficiently. Leiserson piler technology, possibly assisted by changes. DSAs are often called acceler-
et al.24 used a small example—perform- architectural enhancements. Although ators, since they accelerate some of an
ing matrix multiply—to illustrate this the challenges in efficiently translating application when compared to execut-
inefficiency. As in Figure 7, simply re- and implementing high-level scripting ing the entire application on a general-
writing the code in C from Python—a languages like Python are difficult, the purpose CPU. Moreover, DSAs can
typical high-level, dynamically typed lan- potential gain is enormous. Achieving achieve better performance because
guage—increases performance 47-fold. even 25% of the potential gain could they are more closely tailored to the
Using parallel loops running on many result in Python programs running needs of the application; examples of
cores yields a factor of approximately tens to hundreds of times faster. This DSAs include graphics processing
7. Optimizing the memory layout to ex- simple example illustrates how great units (GPUs), neural network proces-
ploit caches yields a factor of 20, and a the gap is between modern languages sors used for deep learning, and pro-
final factor of 9 comes from using the emphasizing programmer productivity cessors for software-defined networks
hardware extensions for doing single in- and traditional approaches emphasiz- (SDNs). DSAs can achieve higher per-
struction multiple data (SIMD) parallel- ing performance. formance and greater energy efficiency
ism operations that are able to perform Domain-specific architectures. A for four main reasons:
16 32-bit operations per instruction. more hardware-centric approach is to First and most important, DSAs
All told, the final, highly optimized ver- design architectures tailored to a spe- exploit a more efficient form of par-
sion runs more than 62,000× faster on cific problem domain and offer signif- allelism for the specific domain. For
a multicore Intel processor compared icant performance (and efficiency) example, single-instruction multiple
to the original Python version. This is of gains for that domain, hence, the data parallelism (SIMD), is more ef-
course a small example, one might ex- name “domain-specific architectures” ficient than multiple instruction mul-
pect programmers to use an optimized (DSAs), a class of processors tailored tiple data (MIMD) because it needs to
library for. Although it exaggerates the for a specific domain—programmable fetch only one instruction stream and
usual performance gap, there are likely and often Turing-complete but tai- processing units operate in lockstep.9
many programs for which factors of 100 lored to a specific class of applica- Although SIMD is less flexible than
to 1,000 could be achieved. tions. In this sense, they differ from MIMD, it is a good match for many
DSAs. DSAs may also use VLIW ap- cessors operate. For suitable applica- to the processor efficiently. Examples
proaches to ILP rather than specula- tions, user-controlled memories can of DSLs include Matlab, a language for
tive out-of-order mechanisms. As men- use much less energy than caches. operating on matrices, TensorFlow, a
tioned earlier, VLIW processors are a Third, DSAs can use less precision dataflow language used for program-
poor match for general-purpose code15 when it is adequate. General-purpose ming DNNs, P4, a language for pro-
but for limited domains can be much CPUs usually support 32- and 64-bit in- gramming SDNs, and Halide, a lan-
more efficient, since the control mech- teger and floating-point (FP) data. For guage for image processing specifying
anisms are simpler. In particular, most many applications in machine learn- high-level transformations.
high-end general-purpose processors ing and graphics, this is more accuracy The challenge when using DSLs is
are out-of-order superscalars that re- than is needed. For example, in deep how to retain enough architecture in-
quire complex control logic for both neural networks (DNNs), inference dependence that software written in
instruction initiation and instruction regularly uses 4-, 8-, or 16-bit integers, a DSL can be ported to different ar-
completion. In contrast, VLIWs per- improving both data and computation- chitectures while also achieving high
form the necessary analysis and sched- al throughput. Likewise, for DNN train- efficiency in mapping the software
uling at compile-time, which can work ing applications, FP is useful, but 32 to the underlying DSA. For example,
well for an explicitly parallel program. bits is enough and 16 bits often works. the XLA system translates Tensorflow
Second, DSAs can make more effec- Finally, DSAs benefit from targeting to heterogeneous processors that
tive use of the memory hierarchy. Mem- programs written in domain-specific use Nvidia GPUs or Tensor Processor
ory accesses have become much more languages (DSLs) that expose more Units (TPUs).40 Balancing portability
costly than arithmetic computations, parallelism, improve the structure and among DSAs along with efficiency is
as noted by Horowitz.16 For example, representation of memory access, and an interesting research challenge for
accessing a block in a 32-kilobyte cache make it easier to map the application ef- language designers, compiler creators,
involves an energy cost approximately ficiently to a domain-specific processor. and DSA architects.
200× higher than a 32-bit integer add. Example DSA: TPU v1. As an example
This enormous differential makes Domain-Specific Languages DSA, consider the Google TPU v1, which
optimizing memory accesses critical DSAs require targeting of high-level op- was designed to accelerate neural net
to achieving high-energy efficiency. erations to the architecture, but trying inference.17,18 The TPU has been in
General-purpose processors run code to extract such structure and informa- production since 2015 and powers ap-
in which memory accesses typically ex- tion from a general-purpose language plications ranging from search queries
hibit spatial and temporal locality but like Python, Java, C, or Fortran is sim- to language translation to image recog-
are otherwise not very predictable at ply too difficult. Domain specific lan- nition to AlphaGo and AlphaZero, the
compile time. CPUs thus use multilevel guages (DSLs) enable this process and DeepMind programs for playing Go and
caches to increase bandwidth and hide make it possible to program DSAs ef- Chess. The goal was to improve the per-
the latency in relatively slow, off-chip ficiently. For example, DSLs can make formance and energy efficiency of deep
DRAMs. These multilevel caches often vector, dense matrix, and sparse ma- neural net inference by a factor of 10.
consume approximately half the energy trix operations explicit, enabling the As shown in Figure 8, the TPU or-
of the processor but avoid almost all DSL compiler to map the operations ganization is radically different from a
accesses to the off-chip DRAMs that re-
quire approximately 10× the energy of a Figure 9. Agile hardware development methodology.
last-level cache access.
Caches have two notable disadvan-
tages: Big Chip
Tape-Out
When datasets are very large. Caches
simply do not work well when datasets Tape-Out
are very large and also have low tempo-
ral or spatial locality; and
When caches work well. When
Tape-In
caches work well, the locality is very
high, meaning, by definition, most
of the cache is idle most of the time. ASIC Flow
In applications where the memory-
access patterns are well defined and FPGA
discoverable at compile time, which
is true of typical DSLs, programmers C++
and compilers can optimize the use of
the memory better than can dynami-
cally allocated caches. DSAs thus usu-
ally use a hierarchy of memories with
movement controlled explicitly by the
software, similar to how vector pro-
F E B R UA RY 2 0 1 9 | VO L. 6 2 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 57
turing lecture
general-purpose processor. The main amine and make complex trade-offs and Foundation (http://riscv.org/). Being
computational unit is a matrix unit, optimizations will be advantaged. open allows the ISA evolution to occur
a systolic array22 structure that pro- This opportunity has already led to in public, with hardware and software
vides 256 × 256 multiply-accumulates a surge of architecture innovation, at- experts collaborating before decisions
every clock cycle. The combination of tracting many competing architectural are finalized. An added benefit of an
8-bit precision, highly efficient sys- philosophies: open foundation is the ISA is unlikely to
tolic structure, SIMD control, and GPUs. Nvidia GPUs use many cores, expand primarily for marketing reasons,
dedication of significant chip area to each with large register files, many sometimes the only explanation for ex-
this function means the number of hardware threads, and caches;4 tensions of proprietary instruction sets.
multiply-accumulates per clock cycle TPUs. Google TPUs rely on large RISC-V is a modular instruction set.
is approximately 100× what a general- two-dimensional systolic multipli- A small base of instructions run the full
purpose single-core CPU can sustain. ers and software-controlled on-chip open source software stack, followed by
Rather than caches, the TPU uses a lo- memories;17 optional standard extensions designers
cal memory of 24 megabytes, approxi- FPGAs. Microsoft deploys field pro- can include or omit depending on their
mately double a 2015 general-purpose grammable gate arrays (FPGAs) in its needs. This base includes 32-bit address
CPU with the same power dissipa- data centers it tailors to neural network and 64-bit address versions. RISC-V can
tion. Finally, both the activation applications;10 and grow only through optional extensions;
memory and the weight memory (in- CPUs. Intel offers CPUs with many the software stack still runs fine even if
cluding a FIFO structure that holds cores enhanced by large multi-level architects do not embrace new exten-
weights) are linked through user- caches and one-dimensional SIMD in- sions. Proprietary architectures gener-
controlled high-bandwidth memory structions, the kind of FPGAs used by ally require upward binary compatibil-
channels. Using a weighted arith- Microsoft, and a new neural network ity, meaning when a processor company
metic mean based on six common processor that is closer to a TPU than adds new feature, all future processors
inference problems in Google data to a CPU.19 must also include it. Not so for RISC-V,
centers, the TPU is 29× faster than a In addition to these large players, whereby all enhancements are optional
general-purpose CPU. Since the TPU dozens of startups are pursuing their and can be deleted if not needed by an
requires less than half the power, it own proposals.25 To meet growing de- application. Here are the standard ex-
has an energy efficiency for this work- mand, architects are interconnecting tensions so far, using initials that stand
load that is more than 80× better than a hundreds to thousands of such chips to for their full names:
general-purpose CPU. form neural-network supercomputers. M. Integer multiply/divide;
This avalanche of DNN architec- A. Atomic memory operations;
Summary tures makes for interesting times in F/D. Single/double-precision float-
We have considered two different ap- computer architecture. It is difficult to ing-point; and
proaches to improve program perfor- predict in 2019 which (or even if any) of C. Compressed instructions.
mance by improving efficiency in the these many directions will win, but the A third distinguishing feature of
use of hardware technology: First, by marketplace will surely settle the com- RISC-V is the simplicity of the ISA.
improving the performance of modern petition just as it settled the architec- While not readily quantifiable, here are
high-level languages that are typically tural debates of the past. two comparisons to the ARMv8 archi-
interpreted; and second, by building do- tecture, as developed by the ARM com-
main-specific architectures that greatly Open Architectures pany contemporaneously:
improve performance and efficiency Inspired by the success of open source Fewer instructions. RISC-V has many
compared to general-purpose CPUs. software, the second opportunity in fewer instructions. There are 50 in
DSLs are another example of how to im- computer architecture is open ISAs. the base that are surprisingly similar
prove the hardware/software interface To create a “Linux for processors” the in number and nature to the origi-
that enables architecture innovations field needs industry-standard open nal RISC-I.30 The remaining standard
like DSAs. Achieving significant gains ISAs so the community can create extensions—M, A, F, and D—add 53
through such approaches will require open source cores, in addition to indi- instructions, plus C added another 34,
a vertically integrated design team that vidual companies owning proprietary totaling 137. ARMv8 has more than
understands applications, domain- ones. If many organizations design 500; and
specific languages and related compil- processors using the same ISA, the Fewer instruction formats. RISC-V
er technology, computer architecture greater competition may drive even has many fewer instruction formats,
and organization, and the underlying quicker innovation. The goal is to six, while ARMv8 has at least 14.
implementation technology. The need provide processors for chips that cost Simplicity reduces the effort to both
to vertically integrate and make design from a few cents to $100. design processors and verify hardware
decisions across levels of abstraction The first example is RISC-V (called correctness. As the RISC-V targets range
was characteristic of much of the early “RISC Five”), the fifth RISC architecture from data-center chips to IoT devices,
work in computing before the industry developed at the University of Califor- design verification can be a significant
became horizontally structured. In this nia, Berkeley.32 RISC-V’s has a commu- part of the cost of development.
new era, vertical integration has become nity that maintains the architecture Fourth, RISC-V is a clean-slate de-
more important, and teams that can ex- under the stewardship of the RISC-V sign, starting 25 years later, letting its
F E B R UA RY 2 0 1 9 | VO L. 6 2 | N O. 2 | C OM M U N IC AT ION S OF T HE ACM 59
turing lecture