Professional Documents
Culture Documents
9 2.05pm John Hennessey
9 2.05pm John Hennessey
9 2.05pm John Hennessey
John Hennessy
Stanford University
March 2019
The End of an Era
Future processors 1
THREE CHANGES CONVERGE
• Technology
• End of Dennard scaling: power becomes the key constraint
• Slowdown in Moore’s Law: transistors cost (even unused)
• Architectural
• Limitation and inefficiencies in exploiting instruction level
parallelism end the uniprocessor era.
• Amdahl’s Law and its implications end the “easy” multicore era
• Application focus shifts
• From desktop to individual, mobile devices and ultrascale cloud
computing, IoT: new constraints.
Future processors 2
UNIPROCESSOR PERFORMANCE
(SINGLE CORE)
4
THE TECHNOLOGY SHIFTS
MOORE’S LAW SLOWDOWN IN INTEL PROCESSORS
10X
Cost/transisto
r slowing
down faster,
due to fab
costs.
Future processors 5
TECHNOLOGY, POWER, AND
DENNARD SCALING
200 4.5
180 4
140
3
120
2.5
100
2
80
1.5
60
40
1
20 0.5
0 0
2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Power consumption
based on models in
Energy scaling for fixed task is better, since more & faster xistors. Esmaeilzadeh
[2011]. 6
ENERGY EFFICIENCY IS THE NEW METRIC
LCD is biggest;
CPU close
behind.
“Always on”
assistants likely
to increase CPU
Future processors
demand.
7
AND IN THE CLOUD
Amortized
Power = network
cooling
Power
infrastructure
Servers + electricity
Other
amortized
infrastructure
Networking Staff
equipment
Future processors 8
END OF DENNARD SCALING IS A CRISIS
Future processors 9
Instruction Level Parallelism Era
1982-2005
Future processors 11
Future processors
Work wasted / Total work
PE
10%
15%
20%
25%
30%
35%
40%
0%
5%
RL
BE
NC
H
BZ
I P2
GC
C
M
C
GO F
BM
HM K
M
ER
LI B SJE
QU NG
AN
TU
H2 M
64
O M R EF
NE
TP
P
A
XA STA
LA R
NC
BM
K
M
ILC
NA
M
D
DE
AL
I
Data collected by Professor Lu Peng and student Ying Zhang at LSU. SO I
PL
E
PO X
VR
AY
LB
WASTED WORK ON THE INTEL CORE I7
SP M
HI
NX
3
12
The Multicore Era
2005-2017
Future processors 13
AMDAHL’S LAW LIMITS PERFORMANCE GAINS FROM
PARALLEL PROCESSING
36
32
28 2%
24
20
4%
16
6%
12
8%
8 10%
4
0
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
Processor Count
Future processors 14
SECOND CHALLENGE TO HIGHER PERFORMANCE FROM
MULTICORE: END OF DENNARD SCALING
Future processors
220 W 56/64
PUTTING THE CHALLENGES TOGETHER
DENNARD SCALING + AMDAHL’S LAW
e 40
8%
36
e 32
d 28
- 24
20
u
16
p 12
8
4
0
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65
Processors (Cores)
Future processors 16
SIDEBAR: INSTRUCTION SET EFFICIENCY
Future processors 17
What OPPORTUNITIES Left?
▪ SW-centric
- Modern scripting languages are interpreted,
dynamically-typed and encourage reuse
- Efficient for programmers but not for execution
▪ HW-centric
- Only path left is Domain Specific Architectures
- Just do a few tasks, but extremely well
▪ Combination
- Domain Specific Languages & Architectures
18
WHAT’S THE OPPORTUNITY?
9X
20
X 63,000X
7X
50
X
from: “There’s Plenty of Room at the Top,” Leiserson, et. al., to appear.
19
DOMAIN SPECIFIC ARCHITECTURES (DSAS)
Future processors 20
WHERE DOES THE ENERGY GO IN GP CPUS?
CAN DSAS DO BETTER?
FP Mult.
L1 D-cache L1 I-cache 32-bit
access L1 cache
access 8% access
18% 18% Register file 20%
access
12%
Register file
access
11%
Control
53% Control
60%
Future processors 22
WHY DSAS CAN WIN (NO MAGIC)
TAILOR THE ARCHITECTURE TO THE DOMAIN
Future processors 23
DOMAIN SPECIFIC LANGUAGES
24
Deep learning is causing
a machine learning revolution
TPU-1 (–pads)
• Memory: 44%
• Compute: 39%
• Interface: 15%
• Control: 2%
Future processors 27
X33
X21 X12
Computing Y = WX
X11
X32 X23
X21 X12
W11
W12 W13
X11
Matrix Unit (MXU)
weights
accumulation 29
Matrix Unit Systolic Array
Computing Y = WX
inputs
X32 X23
W12X1
W11 2+
W13
X21 W11X1
Matrix Unit (MXU)
1
weights
W21
W22 W23
X11
accumulation 30
Matrix Unit Systolic Array
Computing Y = WX
inputs
X33
X32 X23
W12X2
W13X1
W11 2+
3+
X31 W11X2
Matrix Unit (MXU)
...
1
W22X1
weights
W21 2+
W21X1 W23
X21
1
W31
W32 W33
X11
accumulation 31
Matrix Unit Systolic Array
Computing Y = WX
inputs
outputs
W13X3
W11 W12 3+ Y21 = W11X21 + W12X22 + W13X23 Y11 = W11X11 + W12X12 + W13X13
Matrix Unit (MXU)
...
W22X3
weights
W23X2
2+
W21 W21X3
3+ Y12 = W21X11 + W22X12 + W23X13
...
1
W32X2 W33X1
W31 2+ 3
W31X2 +
X31 ...
1
accumulation 32
Matrix Unit Systolic Array
Computing Y = WX
inputs
outputs
W11 W12 W13 Y31 = W11X31 + W12X32 + W13X33 Y21 = W11X21 + W12X22 + W13X23 Y11 = W11X11 + W
Matrix Unit (MXU)
W23X3
weights
3
W21 W22 + Y22 = W21X21 + W22X22 + W23X23 Y12 = W21X11 + W22X12 + W23X13
...
W32X3
W33X2
2+
W31 W31X3
3+
Y13 = W31X11 + W32X12 + W33X13
...
1
accumulation 33
Matrix Unit Systolic Array
Computing Y = WX
inputs
outputs
W21 W22 W23 Y32 = W21X31 + W22X32 + W23X33 Y22 = W21X21 + W22X22 + W23X23 Y12 = W21X11 + W
W33X3
W31 W32 3+
Y23 = W31X21 + W32X22 + W33X23 Y13 = W31X11 + W32X12 + W33X13
...
accumulation 34
Advantages: Matrix Unit Systolic Array in TPU-1
• Each operand is used up to 256 times!
• Nearest-neighbor communication replaces RF access:
• Eliminate many reads/writes; reduce long wire delays
• For 64K Matrix Unit:
• Energy from eliminating register access > energy of Matrix Unit!
outputs
W31 W32 W33 Y33 = W31X31 + W32X32 + W33X33 Y23 = W31X21 + W32X22 + W33X23 Y13 = W31X11 + W
accumulation 35
Performance/Watt on Inference TPU-1 vs CPU & GPU
Important caveat:
• TPU-1 uses 8-bit integer
• GPU uses FP
36
Log Rooflines for CPU, GPU, TPU
TPU
GPU
CPU
37
Training: A Much More Intensive Problem
FP operations/second
to train in 3.5 months
Future processors 38
Rapid Innovation
TPU v1 92 teraops
(deployed 2015) Inference only
180 teraflops
Cloud TPU 64 GB HBM
(v2, Cloud GA 2017,
Training and inference
Pod Alpha 2018)
Generally available (GA)
420 teraflops
Cloud TPU 128 GB HBM
(v3, Cloud Beta 2018) Training and inference
Beta
Future processors 41
CONCLUDING THOUGHTS:
EVERYTHING OLD IS NEW AGAIN
Future processors 43
INTEL CORE I7: Theoretical CPI = 0.25
Achieved CPI
2.5
1.5
CPI
0.5
M
P
H
LI I
C
P2
D
K
CF
AY
M
ILC
36
G
ER
3
EX
RE
NX
BM
BM
TP
GC
TA
EN
M
NC
LB
TU
A
M
18
I
PL
VR
M
BZ
NE
NA
64
DE
AS
HI
SJ
BE
SO
HM
GO
NC
AN
16
PO
SP
H2
OM
RL
67
LA
QU
PE
0.
XA
LI B
Future processors Data collected by Professor Lu Peng and student Ying Zhang at LSU. 44