9 2.05pm John Hennessey

The End of Moore’s Law & Faster
General Purpose Computing, and a

Road Forward
John Hennessy
Stanford University
March 2019
The End of an Era
• 40 years of stunning progress in microprocessor design

• 1.4x annual performance improvement for 40+ years ≈ 106 x faster (throughput)!
• Three architectural innovations:

• Width: 8->16->64 bit (~4x)
• Instruction level parallelism:
• 4-10 cycles per instruction to 4+ instructions per cycle (~10-20x)
• Multicore: one processor to 32 cores (~32x)
• Clock rate: 3 MHz to 4 GHz (through technology & architecture)
• Made possible by IC technology:

• Moore’s Law: growth in transistor count
• Dennard Scaling: power/transistor shrinks as speed & density increase
• Power = frequency x CV2
• Energy expended per computation was reducing
Future processors 1
THREE CHANGES CONVERGE
• Technology
• End of Dennard scaling: power becomes the key constraint
• Slowdown in Moore’s Law: transistors cost (even unused)
• Architectural
• Limitation and inefficiencies in exploiting instruction level
parallelism end the uniprocessor era.
• Amdahl’s Law and its implications end the “easy” multicore era
• Application focus shifts
• From desktop to individual, mobile devices and ultrascale cloud
computing, IoT: new constraints.
Future processors 2
UNIPROCESSOR PERFORMANCE
(SINGLE CORE)
Performance = highest SPECInt by year; from Hennessy & Patterson [2018].

Future processors 3
MOORE’S LAW IN DRAMS
4
THE TECHNOLOGY SHIFTS
MOORE’S LAW SLOWDOWN IN INTEL PROCESSORS
10X
Cost/transisto
r slowing
down faster,
due to fab
costs.
Future processors 5
TECHNOLOGY, POWER, AND
DENNARD SCALING
200 4.5
180 4
Relative Power per nm^2

160
Technology (nm)
3.5
Energy/nm^2
Namometers
140
3
120
2.5
100
2
80
1.5
60
40
1
20 0.5
0 0
2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Power consumption
based on models in
Energy scaling for fixed task is better, since more & faster xistors. Esmaeilzadeh
[2011]. 6
ENERGY EFFICIENCY IS THE NEW METRIC
Battery lifetime determines effectiveness!
LCD is biggest;
CPU close
behind.
“Always on”
assistants likely
to increase CPU
Future processors
demand.
7
AND IN THE CLOUD
Effective Operating Cost

Capital Costs (with amortization)
Shell and Amortized
land servers
Amortized
Power = network
cooling
Power
infrastructure
Servers + electricity
Other
amortized
infrastructure
Networking Staff
equipment
Future processors 8
END OF DENNARD SCALING IS A CRISIS
• Energy consumption has become more important to users

• For mobile, IoT, and for large clouds
• Processors have reached their power limit

• Thermal dissipation is maxed out (chips turn off to avoid overheating!)
• Even with better packaging: heat and battery are limits.
• Architectural advances must increase energy efficiency

• Reduce power or improve performance for the same power
• But, most architectural techniques have reached limits in energy efficiency!

• 1982-2005: Instruction level parallelism
• Compiler and processor find parallelism
• 2005-2017: Multicore
• Programmer identifies parallelism
• Caches: diminishing returns (small incremental improvements).
Future processors 9
Instruction Level Parallelism Era
1982-2005
• Instruction level parallelism achieves significant

performance advantages
• Pipelining: 5 stages to 15+ stages to allow faster clock

rates (energy neutralized by Dennard scaling)
• Multiple issue: <1 instruction/clock to 4+

instructions/clock
• Significant increase in transistors to increase issue rate
• Why did it end?
• Diminishing returns in efficiency
Future processors 10
Getting More ILP
• Branches and memory aliasing are a major limit:

• 4 instructions/clock x 15 deep pipelineè need more than 60
instructions “in flight”
• Speculation was introduced to allow this

• Speculation involves predicting program behavior
• Predict branches & predict matching memory addresses
• If prediction is accurate can proceed
• If the prediction is inaccurate, undo the work and restart
• How good must branch prediction be—very good!
• 15-deep pipeline: ~4 branches 94% correct = 98.7%
• 60-instructions in flight: ~15 branches 90% = 99%
Future processors
Work wasted / Total work
PE
10%
15%
20%
25%
30%
35%
40%
0%
5%
RL
BE
NC
H
BZ
I P2
GC
C
M
C
GO F
BM
HM K
M
ER
LI B SJE
QU NG
AN
TU
H2 M
64
O M R EF
NE
TP
P
A
XA STA
LA R
NC
BM
K
M
ILC
NA
M
D
DE
AL
I
Data collected by Professor Lu Peng and student Ying Zhang at LSU. SO I
PL
E
PO X
VR
AY
LB
WASTED WORK ON THE INTEL CORE I7
SP M
HI
NX
3
12
The Multicore Era
2005-2017
• Make the programmer responsible for identifying

parallelism via threads
• Exploit the threads on multiple cores

• Increase cores if more transistors: easy scaling!
• Energy ≈ Transistor count ≈ Active cores
• So, we need Performance ≈ Active cores
• But, Amdahl’s Law says that this is highly unlikely
AMDAHL’S LAW LIMITS PERFORMANCE GAINS FROM
PARALLEL PROCESSING
Speedup versus % ”Serial” Processing Time

64
60
56
52
48
44
40
1%
Speed-up
36
32
28 2%
24
20
4%
16
6%
12
8%
8 10%
4
0
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
Processor Count
SECOND CHALLENGE TO HIGHER PERFORMANCE FROM
MULTICORE: END OF DENNARD SCALING
• End of Dennard scaling means multicore scaling ends

• Full scaling will mean “dark silicon,” with cores OFF.
• Example
• Today: 14 nm process, largest Intel multicore
• Intel E7-8890: 24-core, 2.2 GHz, TDP = 165W (power limited)
• Turbo (one core): 3.4 GHz. All cores @ 3.4 GHz = 255 W.
• A 7 nm process could yield (estimates)
• 64 cores; power unconstrained: 6 GHz & 365 W.
• 64 cores; power constrained: 4 GHz & 250 W.
Power Limit Active Cores
180 W 46/64
200 W 51/64
Future processors
220 W 56/64
PUTTING THE CHALLENGES TOGETHER
DENNARD SCALING + AMDAHL’S LAW
Speedup versus % ”Serial” Processing Time

64
60 1%
56
52 2%
S 48
p 44
4%
e 40
8%
36
e 32
d 28
- 24
20
u
16
p 12
8
4
0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65
Processors (Cores)
SIDEBAR: INSTRUCTION SET EFFICIENCY
• RISC ideas were about improving efficiency:

• 1980s: efficiency in use of transistors
• Less significant in CPUs in 1990s: processors dominated by
other things & Moore/Dennard in full operation
• RISC comeback: driven by mobile world in 2000s:

• Energy efficiency crucial: small batteries, all day operation
• Si efficiency important for cost!
• 2020 and on: Add Design Efficiency
• With growing spectrum of designs targeted to specific
applications, efficiency of design/verification increasing.
What OPPORTUNITIES Left?
▪ SW-centric
- Modern scripting languages are interpreted,
dynamically-typed and encourage reuse
- Efficient for programmers but not for execution
▪ HW-centric
- Only path left is Domain Specific Architectures
- Just do a few tasks, but extremely well
▪ Combination
- Domain Specific Languages & Architectures
18
WHAT’S THE OPPORTUNITY?
Matrix Multiply: relative speedup to a Python version (18 core Intel)
9X
20
X 63,000X
7X
50
X
from: “There’s Plenty of Room at the Top,” Leiserson, et. al., to appear.
19
DOMAIN SPECIFIC ARCHITECTURES (DSAS)
• Achieve higher efficiency by tailoring the architecture to

characteristics of the domain
• Not one application, but a domain of applications
• Different from strict ASIC
• Requires more domain-specific knowledge then GP processors
need
• Design DSAs and processors for targeted environments
• More variability than in GP processors
• Examples:
• Neural network processors for machine learning
• GPUs for graphics, virtual reality
• Good news: demand for performance focused on such domains
WHERE DOES THE ENERGY GO IN GP CPUS?
CAN DSAS DO BETTER?
Function Energy in Picojoules

8-bit add 0.03
32-bit add 0.1
FP Multiply 16-bit 1.1
FP Multiply 32-bit 3.7
Register file access* 6
Control (per instruction, superscalar) 20-40
L1 cache access 10
L2 cache access 20
L3 cache access 100
Off-chip DRAM access 1,300-2,600
* Increasing the size or number of ports, increases energy roughly proportionally.
From Horowitz [2016].

INSTRUCTION ENERGY BREAKDOWN
Load Register (from L1 Cache) FP Multiply (32-bit) from registers
FP Mult.
L1 D-cache L1 I-cache 32-bit
access L1 cache
access 8% access
18% 18% Register file 20%
access
12%
Register file
access
11%
Control
53% Control
60%
WHY DSAS CAN WIN (NO MAGIC)
TAILOR THE ARCHITECTURE TO THE DOMAIN
• Simpler parallelism for a specific domain (less control HW):

• SIMD vs. MIMD
• VLIW vs. Speculative, out-of-order
• More effective use of memory bandwidth (on/off chip)
• User controlled versus caches
• Processor + memory structures versus traditional
• Program prefetching to off-chip memory when needed
• Eliminate unneeded accuracy
• IEEE replaced by lower precision FP
• 32-bit,64-bit integers to 8-16 bits
• Domain specific programming model matches application to the
processor architecture
DOMAIN SPECIFIC LANGUAGES
DSAs require targeting high level operations to architecture

● Hard to start with C or Python-like language and recover
structure
● Need matrix, vector, or sparse matrix operations
● Domain Specific Languages specify these operations:
○ OpenGL, TensorFlow, P4
● If DSL programs retain architecture-independence,
interesting compiler challenges will exist
○ XLA
“XLA - TensorFlow, Compiled”, XLA Team, March 6, 2017
24
Deep learning is causing
a machine learning revolution
From “A New Golden Age in

Computer Architecture:
Empowering the Machine-
Learning Revolution.” Dean,
J., Patterson, D., & Young, C.
(2018). IEEE Micro, 38(2),
21-29.
TPU 1: High-level Chip Architecture
for DNN Inference
▪ Matrix Unit: 65,536 (256x256) 8-

bit multiply-accumulate units
▪ 700 MHz clock rate
▪ Peak: 92T operations/second
▪ 65,536 * 2 * 700M
▪ >25X as many MACs vs. GPU
▪ >100X as many MACs vs. CPU
▪ 4 MiB of Accumulator memory
▪ 24 MiB of on-chip Unified Buffer
(activation memory)
▪ 3.5X as much on-chip memory
vs. GPU
HOW IS SILICON USED: TPU-1 VS. CPU?
TPU-1 (–pads)
• Memory: 44%
• Compute: 39%
• Interface: 15%
• Control: 2%
CPU (Skylake core)

• Cache: 33%
• Control: 30%
• Compute: 21%
• Mem Man:12%
• Misc: 4%
X33
Matrix Unit Systolic Array

X32 X23
(Kung & Leiserson)
inputs
X31 X22 X13
X21 X12
Computing Y = WX
X11
3x3 systolic array

W = 3x3 matrix
W11 W12 W13
Matrix Unit (MXU)
weights
W21 W22 W23
W31 W32 W33
28 Systolic slides courtesy of Cliff Young @ Google.

accumulation
X33
inputs
X32 X23
X31 X22 X13
X21 X12
W11
W12 W13
X11
Matrix Unit (MXU)
weights
W21 W22 W23
W31 W32 W33
accumulation 29
Computing Y = WX
inputs
with W = 3x3, batch-size(X) =

X33
3
X32 X23
X31 X22 X13
W12X1
W11 2+
W13
X21 W11X1
Matrix Unit (MXU)
1
weights
W21
W22 W23
X11
W31 W32 W33
accumulation 30
Computing Y = WX
inputs

3
X33
X32 X23
W12X2
W13X1
W11 2+
3+
X31 W11X2
Matrix Unit (MXU)
...
1
W22X1
weights
W21 2+
W21X1 W23
X21
1
W31
W32 W33
X11
accumulation 31
Computing Y = WX
inputs

3
outputs
W13X3
W11 W12 3+ Y21 = W11X21 + W12X22 + W13X23 Y11 = W11X11 + W12X12 + W13X13
Matrix Unit (MXU)
...
W22X3
weights
W23X2
2+
W21 W21X3
3+ Y12 = W21X11 + W22X12 + W23X13
...
1
W32X2 W33X1
W31 2+ 3
W31X2 +
X31 ...
1
accumulation 32
Computing Y = WX
inputs

3
outputs
W11 W12 W13 Y31 = W11X31 + W12X32 + W13X33 Y21 = W11X21 + W12X22 + W13X23 Y11 = W11X11 + W
Matrix Unit (MXU)
W23X3
weights
3
W21 W22 + Y22 = W21X21 + W22X22 + W23X23 Y12 = W21X11 + W22X12 + W23X13
...
W32X3
W33X2
2+
W31 W31X3
3+
Y13 = W31X11 + W32X12 + W33X13
...
1
accumulation 33
Computing Y = WX
inputs

3
outputs
W11 W12 W13 Y31 = W11X11 + W12X12 + W13X13 Y21 = W11X21 + W

Matrix Unit (MXU)
weights
W33X3
W31 W32 3+
Y23 = W31X21 + W32X22 + W33X23 Y13 = W31X11 + W32X12 + W33X13
...
accumulation 34
Advantages: Matrix Unit Systolic Array in TPU-1
• Each operand is used up to 256 times!
• Nearest-neighbor communication replaces RF access:
• Eliminate many reads/writes; reduce long wire delays
• For 64K Matrix Unit:
• Energy from eliminating register access > energy of Matrix Unit!
outputs
W11 W12 W13 Y31 = W11X11 + W

Matrix Unit (MXU)
weights
W21 W22 W23 Y32 = W21X31 + W22X32 + W23X33 Y22 = W21X21 + W
accumulation 35
Performance/Watt on Inference TPU-1 vs CPU & GPU
Important caveat:
• TPU-1 uses 8-bit integer
• GPU uses FP
36
Log Rooflines for CPU, GPU, TPU
TPU
GPU
CPU
37
Training: A Much More Intensive Problem
FP operations/second
to train in 3.5 months
Rapid Innovation
TPU v1 92 teraops
(deployed 2015) Inference only
180 teraflops
Cloud TPU 64 GB HBM
(v2, Cloud GA 2017,
Training and inference
Pod Alpha 2018)
Generally available (GA)
420 teraflops
Cloud TPU 128 GB HBM
(v3, Cloud Beta 2018) Training and inference
Beta
Enabled by simpler design, compatibility at DSL level, ease of verification.

Enabling Massive Computing Cycles for Training
11.5 petaflops (64xTPU v2)

4 TB HBM
2-D toroidal mesh network
Training and inference
Alpha
Cloud TPU Pod (v2, 2017)
> 100 petaflops! (256xTPU v3)

32 TB HBM
Liquid cooled
New chip architecture + larger-scale system
TPU v3 Pod (2018) 40

CHALLENGES AND OPPORTUNITIES
• Design of DSAs and DSLs

• Optimizing the mapping to a DSA for portability & performance.
• DSAs & DSLs for new fields
• Open problem: dealing with sparse data
▪ Make HW development more like software:
▪ Prototyping, reuse, abstraction
▪ Open HW stacks (ISA to IP libraries)
▪ Role of ML in CAD?
▪ Technology:
▪ Silicon: Extend Dennard scaling and Moore’s Law
▪ Packaging: use optics, enhance cooling
▪ Beyond Si: Carbon nanotubes, Quantum?
CONCLUDING THOUGHTS:
EVERYTHING OLD IS NEW AGAIN
• Dave Kuck, software architect for Illiac IV (circa 1975)

“What I was really frustrated about was the fact, with
Iliac IV, programming the machine was very difficult
and the architecture probably was not very well suited
to some of the applications we were trying to run. The
key idea was that I did not think we had a very good
match in Iliac IV between applications and
architecture.”
• Achieving cost-performance in this era of DSAs will

require matching the applications, languages ,
architecture, and reducing design cost.
Future processors 42 Dave Kuck, ACM Oral History

WHY DRAMS ARE HARD AND CRITICAL!
INTEL CORE I7: Theoretical CPI = 0.25
Achieved CPI
2.5
1.5
CPI
0.5
M
P
H
LI I
C
P2
D
K
CF
AY
M
ILC
36
G
ER
3
EX
RE
NX
BM
BM
TP
GC
TA
EN
M
NC
LB
TU
A
M
18
I
PL
VR
M
BZ
NE
NA
64
DE
AS
HI
SJ
BE
SO
HM
GO
NC
AN
16
PO
SP
H2
OM
RL
67
LA
QU
PE
0.
XA
LI B
Future processors Data collected by Professor Lu Peng and student Ying Zhang at LSU. 44

9 2.05pm John Hennessey

Uploaded by

Copyright:

Available Formats

You might also like

9 2.05pm John Hennessey

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

9 2.05pm John Hennessey

Uploaded by

Copyright:

Available Formats

The End of Moore’s Law & Faster

General Purpose Computing, and a

• 40 years of stunning progress in microprocessor design

• Three architectural innovations:

• Clock rate: 3 MHz to 4 GHz (through technology & architecture)

• Made possible by IC technology:

Performance = highest SPECInt by year; from Hennessy & Patterson [2018].

Relative Power per nm^2

Battery lifetime determines effectiveness!

Effective Operating Cost

• Energy consumption has become more important to users

• Processors have reached their power limit

• Architectural advances must increase energy efficiency

• But, most architectural techniques have reached limits in energy efficiency!

• Instruction level parallelism achieves significant

• Pipelining: 5 stages to 15+ stages to allow faster clock

• Multiple issue: <1 instruction/clock to 4+

• Branches and memory aliasing are a major limit:

• Speculation was introduced to allow this

• Make the programmer responsible for identifying

• Exploit the threads on multiple cores

Speedup versus % ”Serial” Processing Time

• End of Dennard scaling means multicore scaling ends

Speedup versus % ”Serial” Processing Time

• RISC ideas were about improving efficiency:

• RISC comeback: driven by mobile world in 2000s:

Matrix Multiply: relative speedup to a Python version (18 core Intel)

• Achieve higher efficiency by tailoring the architecture to

Function Energy in Picojoules

* Increasing the size or number of ports, increases energy roughly proportionally.

From Horowitz [2016].

Load Register (from L1 Cache) FP Multiply (32-bit) from registers

• Simpler parallelism for a specific domain (less control HW):

DSAs require targeting high level operations to architecture

“XLA - TensorFlow, Compiled”, XLA Team, March 6, 2017

From “A New Golden Age in

▪ Matrix Unit: 65,536 (256x256) 8-

CPU (Skylake core)

Matrix Unit Systolic Array

X31 X22 X13

3x3 systolic array

W21 W22 W23

W31 W32 W33

28 Systolic slides courtesy of Cliff Young @ Google.

X31 X22 X13

W21 W22 W23

W31 W32 W33

with W = 3x3, batch-size(X) =

X31 X22 X13

W31 W32 W33

with W = 3x3, batch-size(X) =

with W = 3x3, batch-size(X) =

with W = 3x3, batch-size(X) =

with W = 3x3, batch-size(X) =

W11 W12 W13 Y31 = W11X11 + W12X12 + W13X13 Y21 = W11X21 + W

W11 W12 W13 Y31 = W11X11 + W

W21 W22 W23 Y32 = W21X31 + W22X32 + W23X33 Y22 = W21X21 + W