9 2.05pm John Hennessey

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

The End of Moore’s Law & Faster

General Purpose Computing, and a


Road Forward

John Hennessy
Stanford University
March 2019
The End of an Era

• 40 years of stunning progress in microprocessor design


• 1.4x annual performance improvement for 40+ years ≈ 106 x faster (throughput)!

• Three architectural innovations:


• Width: 8->16->64 bit (~4x)
• Instruction level parallelism:
• 4-10 cycles per instruction to 4+ instructions per cycle (~10-20x)
• Multicore: one processor to 32 cores (~32x)

• Clock rate: 3 MHz to 4 GHz (through technology & architecture)

• Made possible by IC technology:


• Moore’s Law: growth in transistor count
• Dennard Scaling: power/transistor shrinks as speed & density increase
• Power = frequency x CV2
• Energy expended per computation was reducing

Future processors 1
THREE CHANGES CONVERGE

• Technology
• End of Dennard scaling: power becomes the key constraint
• Slowdown in Moore’s Law: transistors cost (even unused)
• Architectural
• Limitation and inefficiencies in exploiting instruction level
parallelism end the uniprocessor era.
• Amdahl’s Law and its implications end the “easy” multicore era
• Application focus shifts
• From desktop to individual, mobile devices and ultrascale cloud
computing, IoT: new constraints.

Future processors 2
UNIPROCESSOR PERFORMANCE
(SINGLE CORE)

Performance = highest SPECInt by year; from Hennessy & Patterson [2018].


Future processors 3
MOORE’S LAW IN DRAMS

4
THE TECHNOLOGY SHIFTS
MOORE’S LAW SLOWDOWN IN INTEL PROCESSORS

10X

Cost/transisto
r slowing
down faster,
due to fab
costs.

Future processors 5
TECHNOLOGY, POWER, AND
DENNARD SCALING
200 4.5

180 4

Relative Power per nm^2


160
Technology (nm)
3.5
Energy/nm^2
Namometers

140
3
120
2.5
100
2
80
1.5
60

40
1

20 0.5

0 0
2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Power consumption
based on models in
Energy scaling for fixed task is better, since more & faster xistors. Esmaeilzadeh
[2011]. 6
ENERGY EFFICIENCY IS THE NEW METRIC

Battery lifetime determines effectiveness!

LCD is biggest;
CPU close
behind.

“Always on”
assistants likely
to increase CPU
Future processors
demand.
7
AND IN THE CLOUD

Effective Operating Cost


Capital Costs (with amortization)
Shell and Amortized
land servers

Amortized
Power = network
cooling
Power
infrastructure
Servers + electricity
Other
amortized
infrastructure
Networking Staff
equipment

Future processors 8
END OF DENNARD SCALING IS A CRISIS

• Energy consumption has become more important to users


• For mobile, IoT, and for large clouds

• Processors have reached their power limit


• Thermal dissipation is maxed out (chips turn off to avoid overheating!)
• Even with better packaging: heat and battery are limits.

• Architectural advances must increase energy efficiency


• Reduce power or improve performance for the same power

• But, most architectural techniques have reached limits in energy efficiency!


• 1982-2005: Instruction level parallelism
• Compiler and processor find parallelism
• 2005-2017: Multicore
• Programmer identifies parallelism
• Caches: diminishing returns (small incremental improvements).

Future processors 9
Instruction Level Parallelism Era
1982-2005

• Instruction level parallelism achieves significant


performance advantages

• Pipelining: 5 stages to 15+ stages to allow faster clock


rates (energy neutralized by Dennard scaling)

• Multiple issue: <1 instruction/clock to 4+


instructions/clock
• Significant increase in transistors to increase issue rate
• Why did it end?
• Diminishing returns in efficiency
Future processors 10
Getting More ILP

• Branches and memory aliasing are a major limit:


• 4 instructions/clock x 15 deep pipelineè need more than 60
instructions “in flight”

• Speculation was introduced to allow this


• Speculation involves predicting program behavior
• Predict branches & predict matching memory addresses
• If prediction is accurate can proceed
• If the prediction is inaccurate, undo the work and restart
• How good must branch prediction be—very good!
• 15-deep pipeline: ~4 branches 94% correct = 98.7%
• 60-instructions in flight: ~15 branches 90% = 99%

Future processors 11
Future processors
Work wasted / Total work
PE

10%
15%
20%
25%
30%
35%
40%

0%
5%
RL
BE
NC
H
BZ
I P2
GC
C
M
C
GO F
BM
HM K
M
ER
LI B SJE
QU NG
AN
TU
H2 M
64
O M R EF
NE
TP
P
A
XA STA
LA R
NC
BM
K

M
ILC
NA
M
D
DE
AL
I
Data collected by Professor Lu Peng and student Ying Zhang at LSU. SO I
PL
E
PO X
VR
AY
LB
WASTED WORK ON THE INTEL CORE I7

SP M
HI
NX
3
12
The Multicore Era
2005-2017

• Make the programmer responsible for identifying


parallelism via threads

• Exploit the threads on multiple cores


• Increase cores if more transistors: easy scaling!
• Energy ≈ Transistor count ≈ Active cores
• So, we need Performance ≈ Active cores
• But, Amdahl’s Law says that this is highly unlikely

Future processors 13
AMDAHL’S LAW LIMITS PERFORMANCE GAINS FROM
PARALLEL PROCESSING

Speedup versus % ”Serial” Processing Time


64
60
56
52
48
44
40
1%
Speed-up

36
32
28 2%
24
20
4%
16
6%
12
8%
8 10%
4
0
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
Processor Count

Future processors 14
SECOND CHALLENGE TO HIGHER PERFORMANCE FROM
MULTICORE: END OF DENNARD SCALING

• End of Dennard scaling means multicore scaling ends


• Full scaling will mean “dark silicon,” with cores OFF.
• Example
• Today: 14 nm process, largest Intel multicore
• Intel E7-8890: 24-core, 2.2 GHz, TDP = 165W (power limited)
• Turbo (one core): 3.4 GHz. All cores @ 3.4 GHz = 255 W.
• A 7 nm process could yield (estimates)
• 64 cores; power unconstrained: 6 GHz & 365 W.
• 64 cores; power constrained: 4 GHz & 250 W.
Power Limit Active Cores
180 W 46/64
200 W 51/64

Future processors
220 W 56/64
PUTTING THE CHALLENGES TOGETHER
DENNARD SCALING + AMDAHL’S LAW

Speedup versus % ”Serial” Processing Time


64
60 1%
56
52 2%
S 48
p 44
4%

e 40
8%
36
e 32
d 28

- 24
20
u
16
p 12
8
4
0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65

Processors (Cores)
Future processors 16
SIDEBAR: INSTRUCTION SET EFFICIENCY

• RISC ideas were about improving efficiency:


• 1980s: efficiency in use of transistors
• Less significant in CPUs in 1990s: processors dominated by
other things & Moore/Dennard in full operation

• RISC comeback: driven by mobile world in 2000s:


• Energy efficiency crucial: small batteries, all day operation
• Si efficiency important for cost!
• 2020 and on: Add Design Efficiency
• With growing spectrum of designs targeted to specific
applications, efficiency of design/verification increasing.

Future processors 17
What OPPORTUNITIES Left?

▪ SW-centric
- Modern scripting languages are interpreted,
dynamically-typed and encourage reuse
- Efficient for programmers but not for execution
▪ HW-centric
- Only path left is Domain Specific Architectures
- Just do a few tasks, but extremely well
▪ Combination
- Domain Specific Languages & Architectures

18
WHAT’S THE OPPORTUNITY?

Matrix Multiply: relative speedup to a Python version (18 core Intel)

9X

20
X 63,000X
7X

50
X

from: “There’s Plenty of Room at the Top,” Leiserson, et. al., to appear.

19
DOMAIN SPECIFIC ARCHITECTURES (DSAS)

• Achieve higher efficiency by tailoring the architecture to


characteristics of the domain
• Not one application, but a domain of applications
• Different from strict ASIC
• Requires more domain-specific knowledge then GP processors
need
• Design DSAs and processors for targeted environments
• More variability than in GP processors
• Examples:
• Neural network processors for machine learning
• GPUs for graphics, virtual reality
• Good news: demand for performance focused on such domains

Future processors 20
WHERE DOES THE ENERGY GO IN GP CPUS?
CAN DSAS DO BETTER?

Function Energy in Picojoules


8-bit add 0.03
32-bit add 0.1
FP Multiply 16-bit 1.1
FP Multiply 32-bit 3.7
Register file access* 6
Control (per instruction, superscalar) 20-40
L1 cache access 10
L2 cache access 20
L3 cache access 100
Off-chip DRAM access 1,300-2,600

* Increasing the size or number of ports, increases energy roughly proportionally.

From Horowitz [2016].


Future processors 21
INSTRUCTION ENERGY BREAKDOWN

Load Register (from L1 Cache) FP Multiply (32-bit) from registers

FP Mult.
L1 D-cache L1 I-cache 32-bit
access L1 cache
access 8% access
18% 18% Register file 20%
access
12%
Register file
access
11%

Control
53% Control
60%

Future processors 22
WHY DSAS CAN WIN (NO MAGIC)
TAILOR THE ARCHITECTURE TO THE DOMAIN

• Simpler parallelism for a specific domain (less control HW):


• SIMD vs. MIMD
• VLIW vs. Speculative, out-of-order
• More effective use of memory bandwidth (on/off chip)
• User controlled versus caches
• Processor + memory structures versus traditional
• Program prefetching to off-chip memory when needed
• Eliminate unneeded accuracy
• IEEE replaced by lower precision FP
• 32-bit,64-bit integers to 8-16 bits
• Domain specific programming model matches application to the
processor architecture

Future processors 23
DOMAIN SPECIFIC LANGUAGES

DSAs require targeting high level operations to architecture


● Hard to start with C or Python-like language and recover
structure
● Need matrix, vector, or sparse matrix operations
● Domain Specific Languages specify these operations:
○ OpenGL, TensorFlow, P4
● If DSL programs retain architecture-independence,
interesting compiler challenges will exist
○ XLA

“XLA - TensorFlow, Compiled”, XLA Team, March 6, 2017

24
Deep learning is causing
a machine learning revolution

From “A New Golden Age in


Computer Architecture:
Empowering the Machine-
Learning Revolution.” Dean,
J., Patterson, D., & Young, C.
(2018). IEEE Micro, 38(2),
21-29.
TPU 1: High-level Chip Architecture
for DNN Inference

▪ Matrix Unit: 65,536 (256x256) 8-


bit multiply-accumulate units
▪ 700 MHz clock rate
▪ Peak: 92T operations/second
▪ 65,536 * 2 * 700M
▪ >25X as many MACs vs. GPU
▪ >100X as many MACs vs. CPU
▪ 4 MiB of Accumulator memory
▪ 24 MiB of on-chip Unified Buffer
(activation memory)
▪ 3.5X as much on-chip memory
vs. GPU
HOW IS SILICON USED: TPU-1 VS. CPU?

TPU-1 (–pads)
• Memory: 44%
• Compute: 39%
• Interface: 15%
• Control: 2%

CPU (Skylake core)


• Cache: 33%
• Control: 30%
• Compute: 21%
• Mem Man:12%
• Misc: 4%

Future processors 27
X33

Matrix Unit Systolic Array


X32 X23
(Kung & Leiserson)
inputs

X31 X22 X13

X21 X12

Computing Y = WX
X11

3x3 systolic array


W = 3x3 matrix
W11 W12 W13
Matrix Unit (MXU)
weights

W21 W22 W23

W31 W32 W33

28 Systolic slides courtesy of Cliff Young @ Google.


accumulation
Matrix Unit Systolic Array
X33
inputs

X32 X23

X31 X22 X13

X21 X12

W11
W12 W13
X11
Matrix Unit (MXU)
weights

W21 W22 W23

W31 W32 W33

accumulation 29
Matrix Unit Systolic Array
Computing Y = WX
inputs

with W = 3x3, batch-size(X) =


X33
3

X32 X23

X31 X22 X13

W12X1
W11 2+
W13
X21 W11X1
Matrix Unit (MXU)

1
weights

W21
W22 W23
X11

W31 W32 W33

accumulation 30
Matrix Unit Systolic Array
Computing Y = WX
inputs

with W = 3x3, batch-size(X) =


3

X33

X32 X23

W12X2
W13X1
W11 2+
3+
X31 W11X2
Matrix Unit (MXU)

...
1

W22X1
weights

W21 2+
W21X1 W23
X21
1

W31
W32 W33
X11

accumulation 31
Matrix Unit Systolic Array
Computing Y = WX
inputs

with W = 3x3, batch-size(X) =


3

outputs

W13X3
W11 W12 3+ Y21 = W11X21 + W12X22 + W13X23 Y11 = W11X11 + W12X12 + W13X13
Matrix Unit (MXU)

...

W22X3
weights

W23X2
2+
W21 W21X3
3+ Y12 = W21X11 + W22X12 + W23X13
...
1

W32X2 W33X1
W31 2+ 3
W31X2 +
X31 ...
1

accumulation 32
Matrix Unit Systolic Array
Computing Y = WX
inputs

with W = 3x3, batch-size(X) =


3

outputs

W11 W12 W13 Y31 = W11X31 + W12X32 + W13X33 Y21 = W11X21 + W12X22 + W13X23 Y11 = W11X11 + W
Matrix Unit (MXU)

W23X3
weights

3
W21 W22 + Y22 = W21X21 + W22X22 + W23X23 Y12 = W21X11 + W22X12 + W23X13
...

W32X3
W33X2
2+
W31 W31X3
3+
Y13 = W31X11 + W32X12 + W33X13
...
1

accumulation 33
Matrix Unit Systolic Array
Computing Y = WX
inputs

with W = 3x3, batch-size(X) =


3

outputs

W11 W12 W13 Y31 = W11X11 + W12X12 + W13X13 Y21 = W11X21 + W


Matrix Unit (MXU)
weights

W21 W22 W23 Y32 = W21X31 + W22X32 + W23X33 Y22 = W21X21 + W22X22 + W23X23 Y12 = W21X11 + W

W33X3
W31 W32 3+
Y23 = W31X21 + W32X22 + W33X23 Y13 = W31X11 + W32X12 + W33X13
...

accumulation 34
Advantages: Matrix Unit Systolic Array in TPU-1
• Each operand is used up to 256 times!
• Nearest-neighbor communication replaces RF access:
• Eliminate many reads/writes; reduce long wire delays
• For 64K Matrix Unit:
• Energy from eliminating register access > energy of Matrix Unit!

outputs

W11 W12 W13 Y31 = W11X11 + W


Matrix Unit (MXU)
weights

W21 W22 W23 Y32 = W21X31 + W22X32 + W23X33 Y22 = W21X21 + W

W31 W32 W33 Y33 = W31X31 + W32X32 + W33X33 Y23 = W31X21 + W32X22 + W33X23 Y13 = W31X11 + W

accumulation 35
Performance/Watt on Inference TPU-1 vs CPU & GPU

Important caveat:
• TPU-1 uses 8-bit integer
• GPU uses FP

36
Log Rooflines for CPU, GPU, TPU

TPU

GPU

CPU

37
Training: A Much More Intensive Problem

FP operations/second
to train in 3.5 months

Future processors 38
Rapid Innovation
TPU v1 92 teraops
(deployed 2015) Inference only

180 teraflops
Cloud TPU 64 GB HBM
(v2, Cloud GA 2017,
Training and inference
Pod Alpha 2018)
Generally available (GA)

420 teraflops
Cloud TPU 128 GB HBM
(v3, Cloud Beta 2018) Training and inference
Beta

Enabled by simpler design, compatibility at DSL level, ease of verification.


Future processors 39
Enabling Massive Computing Cycles for Training

11.5 petaflops (64xTPU v2)


4 TB HBM
2-D toroidal mesh network
Training and inference
Alpha

Cloud TPU Pod (v2, 2017)

> 100 petaflops! (256xTPU v3)


32 TB HBM
Liquid cooled
New chip architecture + larger-scale system

TPU v3 Pod (2018) 40


CHALLENGES AND OPPORTUNITIES

• Design of DSAs and DSLs


• Optimizing the mapping to a DSA for portability & performance.
• DSAs & DSLs for new fields
• Open problem: dealing with sparse data
▪ Make HW development more like software:
▪ Prototyping, reuse, abstraction
▪ Open HW stacks (ISA to IP libraries)
▪ Role of ML in CAD?
▪ Technology:
▪ Silicon: Extend Dennard scaling and Moore’s Law
▪ Packaging: use optics, enhance cooling
▪ Beyond Si: Carbon nanotubes, Quantum?

Future processors 41
CONCLUDING THOUGHTS:
EVERYTHING OLD IS NEW AGAIN

• Dave Kuck, software architect for Illiac IV (circa 1975)


“What I was really frustrated about was the fact, with
Iliac IV, programming the machine was very difficult
and the architecture probably was not very well suited
to some of the applications we were trying to run. The
key idea was that I did not think we had a very good
match in Iliac IV between applications and
architecture.”

• Achieving cost-performance in this era of DSAs will


require matching the applications, languages ,
architecture, and reducing design cost.

Future processors 42 Dave Kuck, ACM Oral History


WHY DRAMS ARE HARD AND CRITICAL!

Future processors 43
INTEL CORE I7: Theoretical CPI = 0.25
Achieved CPI

2.5

1.5
CPI

0.5

M
P
H

LI I
C
P2

D
K
CF

AY
M

ILC
36
G
ER

3
EX
RE

NX
BM

BM
TP
GC

TA
EN

M
NC

LB
TU

A
M

18
I

PL

VR
M
BZ

NE

NA
64

DE
AS

HI
SJ
BE

SO
HM
GO

NC
AN

16

PO

SP
H2

OM
RL

67
LA
QU
PE

0.
XA
LI B

Future processors Data collected by Professor Lu Peng and student Ying Zhang at LSU. 44

You might also like