Professional Documents
Culture Documents
Deep Learning Tutorial and Recent Trends by Song Han
Deep Learning Tutorial and Recent Trends by Song Han
Song Han
Stanford University
Feb 22, 2017
FPGA17, Monterey
Intro
2
Deep Learning is Changing Our Lives
Machine
Self-Driving Translation
Artistic
Play Go
Style Transfer
Machine Learning 101: the Setup
weights/parameters
Training Inference
model
Machine Learning 101: the Setup
weights/parameters
training dataset
Training Inference
model
training hardware
Machine Learning 101: the Setup
weights/parameters
test image
training dataset
Training Inference
model
16X 10X
Model Training Ops
152 layers 465 GFLOP
22.6 GFLOP 12,000 hrs of Data
~3.5% error ~5% Error
8 layers 80 GFLOP
1.4 GFLOP 7,000 hrs of Data
~16% Error ~8% Error
Application
Hardware
Open the Box for HW Design
?
Application
Hardware
?
Inference Training
Agenda
Algorithm
Inference Training
Hardware
Agenda
Algorithm
Inference Training
Hardware
Agenda
Algorithm
Inference Training
Hardware
Agenda
Algorithm
Inference Training
Hardware
The Problem: Large DNN Model
1 = 100
To achieve this goal, we present a method to prune network connections in a manner that preserves the
original accuracy. After an initial training phase, we remove all connections whose weight is lower
than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first
phase learns the topology of the networks learning which connections are important and removing
the unimportant connections. We then retrain the sparse network so the remaining connections can
compensate for the connections that have been removed. The phases of pruning and retraining may
be repeated iteratively to further reduce network complexity. In effect, this training process learns
The Problem of Large DNN on Mobile
To achieve this goal, we present a method to prune network connections in a manner that preserves the
original accuracy. After an initial training phase, we remove all connections whose weight is lower
than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first
phase learns the topology of the networks learning which connections are important and removing
the unimportant connections. We then retrain the sparse network so the remaining connections can
compensate for the connections that have been removed. The phases of pruning and retraining may
be repeated iteratively to further reduce network complexity. In effect, this training process learns
Part 1: Algorithms for Efficient Inference
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
Part 1: Algorithms for Efficient Inference
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
Pruning Neural Networks
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS15
# Synapses in Human Brain
1000 Trillion
Synapses
50 Trillion 500 Trillion
Synapses Synapses
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
Retrain to Recover Accuracy
-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%
Parametes Pruned Away
Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
Pruning RNN and LSTM
ptioning
Geo Mean
Baseline:
Intel Core i7 5930K: MKL CBLAS GEMV,
MKL SPBLAS CSRMV
NVIDIA GeForce GTX Titan X: cuBLAS
GEMV, cuSPARSE CSRMV CPU GPU mGPU
NVIDIA Tegra K1: cuBLAS GEMV,
cuSPARSE CSRMV
Energy Efficiency for Pruned FC layer
Geo Mean
Baseline:
Intel Core i7 5930K: MKL CBLAS GEMV,
MKL SPBLAS CSRMV
NVIDIA GeForce GTX Titan X: cuBLAS
GEMV, cuSPARSE CSRMV CPU GPU mGPU
NVIDIA Tegra K1: cuBLAS GEMV,
cuSPARSE CSRMV
Part 1: Algorithms for Efficient Inference
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
Weight Sharing
Figure 2: Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).
gradient
-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02
-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).
AlexNet on ImageNet
Han et al. Deep Compression, ICLR 2016 (Best Paper Award)
Results of Deep Compression
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
Quantizing the Weight and Activation
Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA16
Quantization Result
Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA16
Part 1: Algorithms for Efficient Inference
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
Low Rank Approximation for Conv
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
Binary / Ternary Net: Motivation
Normalized
4800 4800 Intermediate Ternary Weight
recision Weight
Count
Count
3200 3200
=>
1600
Quantize
1600
0 0
-t
e
0.05
0 t 0.05
1 0
Weight Value
0.05 0.05
-1
0
Weight Value
0.05
0 1
(d) (e)
gradient1
Net (a), pruned GoogLeNet (b), after retraining
he sparistyInference
constraint and recovering the zero
Time
Binary-Weight-Network and XNOR-Network
Rastegari et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks ECCV 2016
Binary-Weight-Network and XNOR-Network
Rastegari et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks ECCV 2016
Binary-Weight-Network and XNOR-Network
Rastegari et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks ECCV 2016
Trained Ternary Quantization
Trained
Normalized
Full Precision Weight Intermediate Ternary Weight Quantization Final Ternary Weight
Full Precision Weight
Scale
Normalize Quantize
Wn Wp
-1 0 1 -1 -t 0 t 1 -1 0 1 -Wn 0 Wp
gradient1 gradient2
Loss
Feed Forward Back Propagate Inference Time
and 2) maintain a constant sparsity r for all layers throughout training. By adjusting the hyper-
Weight Evolution during Training
parameter r we are able to obtain ternary weight networks with various sparsities. We use the first
method and set t to 0.05 in experiments on CIFAR-10 and ImageNet dataset and use the second one
to explore a wider range of sparsities in section 5.1.1.
0
-1
-2
-3
Negatives Zeros Positives Negatives Zeros Positives Negatives Zeros Positives
100%
Ternary Weight
75%
Percentage
50%
25%
0%
0 50 100 150 0 50 100 150 0 50 100 150
Epochs
Figure 2: Ternary weights value (above) and distribution (below) with iterations for different layers
of ResNet-20 on CIFAR-10.
4
Visualization of the TTQ Kernels
Lin, Zhouhan, et al. "Neural networks with few multiplications." arXiv preprint
arXiv:1510.03009 (2015).
Li, Fengfu, and Bin Liu. "Ternary Weight Networks." arXiv preprint
arXiv:1605.04711 (2016).
1. Pruning
2. Weight Sharing
3. Quantization
6. Winograd Transformation
Substituting U = GgGT and V = B T dB, we have: =
T
! " Neig
Y =A U V A (9)
Winograd Convolution dc,b
Labeling tile coordinates as (# x, y#), we rewrite the con- gk,c
vnet layer formula (2) for a single image i, filter k, and tile G, B
coordinate (#x, y#) as: Yk,b
for k
C
$ fo
Yi,k,!x,!y = Di,c,!x,!y Gk,c
c=1
C
$ % &
= AT Uk,c Vc,i,!x,!y A (10) for b
c=1 fo
%$
C &
= AT Uk,c Vc,i,!x,!y A
c=1
for
Thus we can reduce over C channels in transform space,
fo
and only then apply the inverse transform A to the sum.
This amortizes the cost of the inverse transform over the
Winograd. Arithmetic complexity of computations, volume 33. Siam, 1980 for k
number of channels.
Lavin & Gray, Fast Algorithms for Convolutional Neural Networks, 2015
fo
We examine the sum
Training in
Workshop track - ICLR 2017
the Winograd Domain
Activation
Layer
Train Producing 4 output pixels:
Activation Activation
Kernel Layer Layer
Layer
ReLU
Prune
Direct
ReLU Convolution:
Train
()
Train
( ) ( )
( ) Prune ReLU Prune
Transformed
Eltwise Kernel
Layer Eltwise Eltwise
Muliply Muliply Muliply
Winograd4*4=16 multiplications
Winograd Channel wise (2.25x less)
Winograd
Channel wise - Channel wise
Summation Summation domain Summation
domain domain
( ) ( ) ( )
Activation Activation Activation
Layer + 1 Layer + 1 Layer + 1
(a) (b) (c)
Liu et al.Figure
Efficient Sparse-Winograd
1: Combining WinogradConvolutional Neural
convolution with Networks,
sparse submitted
weights and to ICLR
activations. (a)2017 workshop
Original
Winograd-based convolution proposed by Lavin (2015) fills in the non-zeros in both the weights
and activations. (b) Pruning the 4 4 transformed kernel restores sparsity to the weights. (c) Mov-
Training in
Workshop track - ICLR 2017
the Winograd Domain
Activation
Layer
Train Producing 4 output pixels:
Activation Activation
Kernel Layer Layer
Layer
ReLU
Prune
Direct
ReLU Convolution:
Train
()
Train
Transformed Transformed
- 4*9=36 multiplications (1x)
Kernel Activation
Layer
Transformed
Kernel
Layer Layer
- sparse weight [NIPS15] (3x)
( ) ( )
() ReLU
Transformed
Kernel
- sparse activation (relu)
Prune
(3x)
Prune
Winograd4*4=16 multiplications
Winograd Channel wise (2.25x less)
Winograd
Channel wise - Channel wise
Summation Summation domain Summation
domain domain
( )
- dense weight (1x)
() ()
Activation - dense activation (1x)
Activation Activation
Layer + 1 Layer + 1 Layer + 1
(a) - Overall (b) saving: 2.25x (c)
Liu et al.Figure
Efficient Sparse-Winograd
1: Combining WinogradConvolutional Neural
convolution with Networks,
sparse submitted
weights and to ICLR
activations. (a)2017 workshop
Original
Winograd-based convolution proposed by Lavin (2015) fills in the non-zeros in both the weights
and activations. (b) Pruning the 4 4 transformed kernel restores sparsity to the weights. (c) Mov-
Solution: Fold Relu into Winograd
Activation Activation
Producing 4 output pixels:
Layer Layer
Train
( )
Train
Direct Convolution:
Transformed Transformed Transformed
Kernel Activation
Layer
Kernel - 4*9=36 multiplications (1x)
Layer Layer
( )
- sparse weight (2.5x)
( )
re-training until accuracy converges. We varied the pruning rate R from 20% to 70%. The first
93
convolution92.8
layer is not included in pruning but is included in re-training.
Winograd
92.6
Figure 2 shows accuracy as a function of density for the three architectures of FigureReLU
1. The
+ network
of Figure 1c
92.4(which moves pruning and ReLU to the transform domain) can bePruning pruned to 40%
density without
92.2 significant (> 0.1%) loss of accuracy. The conventional network of Figure 1a can
70 (%)
only be pruned20to 60%
25
density
30
before
35
accuracy
40 45
falls.
50
Weight Density
55 60 65
Figure
Liu et3:al.Activation density of convolution
Efficient Sparse-Winograd layersNeural
Convolutional of VGG-nagadomi. Whiskers
Networks, submitted show
to ICLR oneworkshop
2017 standard
deviation above and below the mean.
Agenda
Algorithm
Inference Training
Hardware
Diannao (Electric Brain)
*'
Control#Processor#(CP)# Figure 16
Component Area Power Critical
DMA#
weight' neuron Inst.#
Instruc:ons# or Block in m2 (%) in mW (%) path in ns celerator ov
output output
layer ACCELERATOR 3,023,077 485 1.02
NFU)1% NFU)2% NFU)3% Combinational 608,842 (20.14%) 89 (18.41%)
Tn#
Memory 1,158,000 (38.31%) 177 (36.59%) every cycl
hidden x NBin% Registers 375,882 (12.43%) 86 (17.84%)
+' in the SIM
DMA#
Inst.#
layer Clock network 68,721 (2.27%) 132 (27.16%)
+'
accelerator
Memory#Interface#
Filler cell 811,632 (26.85%)
DMA#
synapses Inst.#
*' bi SB 1,153,814 (38.17%) 105 (22.65%) both classi
Tn#
ai
NBin 427,992 (14.16%) 91 (19.76%)
input than the S
NBout% NBout 433,906 (14.35%) 92 (19.97%)
neuron table' two types
Tn#x#Tn#
x NFU 846,563 (28.00%) 132 (27.22%)
synapse
CP 141,809 (5.69%) 31 (6.39%) faster than
AXIMUX 9,767 (0.32%) 8 (2.65%) of comput
Figure 9. Full hardware implementation of neural networks. Other 9,226 (0.31%) 26 (5.36%)
SB% classifier a
Table 6. Characteristics of accelerator and breakdown by com- 2.01 16-bi
ponent type (first 5 lines), and functional block (last 7 lines). upper boun
5
FigureCritical
15.Path Layout
(ns) (65nm). Figure 11. Accelerator. to two maj
Area (mm^2)
Energy (nJ) First, be
4
such as the Intel ETANN [18] at the beginning of the 1990s, logic which is in charge of reading data out of NBin/NBout; bination o
Component Area Power Critical Figure 16. Speedup of accelerator over SIMD,
not because neural networks were already large at the time, and of ideal ac-versions will focus on how to reduce or pipeline this
next note that w
3
6SHHGXS
100 eDRAM eDRAM
Bank0 Bank1
and EDRAM to fit larger
-96.3
HT1 PHY
tile tile
HT2.0
tile tile
Figure 10: Speedup w.r.t. the GPU baseline
(South Link)
Bank2 Bank3
(inference). Note
CONV1 and the full NN need a 4-node system, while CONV3* and
that 68mm^2 fitting 12 Million
layer sizes can Figure 9: Snapshot
range fromof less than
the node layout.
CONV4* even need a 36-node system.
Figure 5: Tile-based organization of a node (left) and tile archi-
parameters, consumes 16W
st of them ranging in the tens(%)ofPower
Component/Block Area (m ) 2 MB. (W )
tecture
(%)
(right). A node contains 16 tiles, two central eDRAM banks
CONV3* and CONV4*, need a 36-node system because
ropriateCentralforBlockcaching
W HOLE CHIP
purposes,
67,732,900
they15.97 and fat tree interconnect;
their size is respectively a tile1.29 hasGBanand NFU, 1.32 four
GB. The eDRAMfull banks
7,898,081 (11.66%) 1.80 (11.27%)
or such Tileslarge-scale storage. However,
30,161,968 (44.53%) and input/output
6.15 ( 38.53%) NN containsinterfaces to/from i.e.,
59.48M synapses, the118.96MB
central eDRAM (16-bit data), banks.
HTs 17,620,440 (26.02%) 8.01 ( 50.14%)
requiring at least 4 nodes.
have aWires
Otherhigher storage 5,973,803 density.
6,078,608 (8.97%)
(8.82%) For0.01 (0.06%)
Table 3:
On average, theParameter
1-node, 4-node, settings of ShiDianNao
16-node and 64-node and DianNao.
AM memory Combinational
Memory requires
NBin
20.73mm
3,979,345 (5.88%)
32207390 (47.55%) at
2 6.06 (37.97%)
6.12 (38.30%)
above
architectures are respectively 21.38x, 79.81x, 216.72x, and
example),
450.65x fasterand we
than interleave
the GPU baseline. the synapses
1 ShiDianNao
The first reasonrows for among
DianNao
- ShiDiannao (Vision
DRAM memory Clock network of the
Registers
same
3,348,677 (4.94%)
586323 size and
(0.87%)
3.07 (19.25%)
0.71 (4.48%)
Filler cell
node requires 7.27mm [50], i.e.,27,611,165
2 (40.76%)
SB a the four banks.the higher performance
in each node, there
Dataiswidth
#
are
the large number
multipliers
9216 operators
16-bit of operators:
64
(mostly
16-bit
64
multipliers
Computer) It can fits small
nsity. Table VI: Node NFU layout characteristics. We placed and routed
and adders), compared this
NBin
todesign
the 2496atMACs
SRAM size 28nm
64 KB
of(ST
1technology,
KB
the 1GPU.
sufficient eDRAM capacity to hold LP), and we
The obtained
second reason the
NBout SRAM size
is SB floorplan
thatSRAM
the on-chip
size of
64 KB
128Figure
eDRAM KB provides 4.
KB
16 KBThe NFU
the model (up-to 64K
necessary bandwidth Inst. and SRAM
low-latency
size access
32 KB to feed 8 these
KB
will onlyfootprint is very small at 0.78mm (0.88mm0.88mm), but 2
ombinedeDRAM). eDRAM of all logic
The combinational
NBout chips
for 5.88% and 4.94% of the area respectively.
and register account many operators.
the processNevertheless,
imposes
Table 4: the an averagecharacteristics
Hardware
scalability of spacing
the differentof 0.2m
of
layers varies abetween
ShiDianNao at 1GHz,
parameters) on-chip. It maps
M accesses, which are particularly IB
We used Synopsys PrimePower to estimate the power
r instance,
consumption a read
Figureof17: access
theLayout
chip. The to a 256-
peak power consumption
of ShiDianNao (65 nm).
wires, is
and LRNwhere
lot. provides
layerspower
scale
only andbest
the 4 energy
horizontalare averaged
(no inter-node
with a speedup of up to 1340.77x for 64 nodes (LRN2),
metal over 10 benchmarks.
communication)
layers. As a the computation on 2D PE
at 28nm
15.97 Wconsumes 0.0192nJ
(at a pessimistic 100% toggle (50A, result,
rate), i.e., roughly 5- theCONV 65536 wires
andAccelerator connecting
Area 2
) the
(mmalmost NFU
Power (mW ) to the eDRAM
Energy (nJ)
Benchmarks. We collected 10 CNNs
10% of a state-of-the-art GPU card. The architecture
while a 256-bit read access to
from
a
representative
require
block aonly width
have of
POOL layers
Total 655360.2
inter-node
scale
communications
= 3.2768mm,
4.86 (100%)
as well
on border
320.10
because
(100%)elements,
they
see 6048.70
Figure 4.
(100%) array. The chip is 4.86 mm^2
visual recognition applications and used
breakdown shows that the tiles consume more than one third them as our bench- e.g., CONV1 NFU 4
achieves a0.66 (13.58%)
speedup 268.82 (83.98%)
of 2595.23x for 64 nodes, 5281.09 (87.29%)
consumes
marks (Table6.18nJ
(38.53%) of2).
theAmongat 28nm
power, all the
and layers[40],
four of i.e.,
HTallIPsbenchmarks,
consume Consequently,
aboutinputbut the wires NBin occupy
actual speedup of
1.124 3.2768
(23.05%)
LRN and POOL
35.53 3.2768475.01
(11.10%)
layers is lower
0.88 (7.85%)
and consumes 320 mW
NBout 1.12 (23.05%) 6.60 (2.06%) 86.61 (1.43%)
atthan CONVSB,layers which is (33.95%)
almost
they are equal to the 94.08 combined
2
1x. one half
neurons
The consume(50.14%).
ratio The component
isatcells
most
largely 45 KB,due and breakdown
to +synapses
the shows0.88 that,= 42.18mm
consume because
1.65 less computationally
6.77 (2.11%) (1.56%)
overall, memory (tile eDRAMs central eDRAM) intensive. On the other hand, CLASS layers scale less well
most 118 KB, which do not exceed the SRAM capacities
DDR3account physical-level
for 38.30% of theinterface, on-
total power, combinational
area of
logic
all
of eDRAM banks, all NFUs and the I/O.
IB 0.31 (6.38%) 2.38 (0.74%) 35.84 (0.59%)
because of the highwhen amount of inter-node communication-
our design Chen
ctivation, etc.
and
(Table
registers et
3). al.
(mostly Diannao:
NFUs and HT A
protocol small-footprint
2)
analyzers) High s, Internal
since high-throughput
Therefore,
each Bandwidth:
output neuron
most
uses
of an
In accelerator
order
all
application
input to avoid
neurons
consists
from thisforcon-
ofubiquitous
uncom- machine-learning, ASPLOS 2014
monly
consume 37.97% and 19.25% respectively.
10.limited
er Experimentalby the memory Results bandwidth, gestion, we adopt
different a small
nodes, see output
tile-based feature
Sectiondesign,
V-E2, e.g., mapsCLASS1
as shownwith fewer has aoutput neurons
in Figure 5.
speedup than implemented
of 72.96x PEs (e.g.,
for 64 nodes. This is5 5 in illustrated
further the C2 layer of bench-
up its B. size
Performancein order to process more The output neurons
in the mark
are spread
Simple
time breakdown Conv forout
of Figure 8 in PEs
11.8 Note
the thatdifferent
in the
eachcurrent
bar
tiles, so
accelerator
10.1. Layout Characteristics
In Figure 10, we compare the performance ofthat each isNFU can simultaneously
nd more inputs per output neuron our ar- normalized
design), tosome
the total will beprocess
PEs execution but16
time,Although
idle. dueinput
towethe neurons
played with the
Eyeriss: Reduce Memory Access by
Row-Stationary Dataflow
Chen et al Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks ISCA 2016
Eyeriss: Reduce Memory Access by
Row-Stationary Dataflow
Dataflow Comparison: CONV Layers
2
1.5 ALU
RF
Normalized
1 NoC
Energy/MAC
buffer
0.5 DRAM
0
WS OSA OSB OSC NLR RS
CNN Dataflows
Chen et al Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks ISCA 2016
EIE: Reduce Memory Access by Compression
~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
logically P E3 B 0 0 0 0 C
B
B
B
C B
C=B
b3 C
C ReLU
C )
B
B
B
b3 C
C
C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0
Column Pointer 0 1 2 3
Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016, Hotchips 2016
z William J. Dally
iversity, NVIDIA EIE Architecture
avan,horowitz,dally}@stanford.edu
Weight decode
Compressed 4-bit 16-bit
DNN Model Virtual weight Weight Real weight
Encoded Weight Look-up ALU Prediction
Relative Index
Sparse Format Index Result
Input 16-bit Mem
4-bit Accum
Image Relative Index Absolute Index
100000
10000
1000
100
10
1
Core-i7 5930k TitanX Tegra K1 A-Eye Da-DianNao True-North EIE EIE
(22nm CPU) (28nm GPU) (28nm mGPU) (28nm FPGA) (28nm ASIC (28nm ASIC) (45nm ASIC) (28nm ASIC)
64PEs 256PEs
Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016
Dynamic Fixed-Point + Lookup
Algorithm
Inference Training
Hardware
Part 3: Efficient Training Algorithm
1. Batch Normalization
2. Model Distillation
3. DSD: Dense-Sparse-Dense Training
Part 3: Efficient Training Algorithm
1. Batch Normalization
2. Model Distillation
3. DSD: Dense-Sparse-Dense Training
of the entire training data. Let the normalized values be x!1...m , and their linear trans-
formations be y1...m . We refer to the transform
lization via Batch
Mini-Batch Normalization
BN :x y , 1...m 1...m
cs as the Batch Normalizing Transform. We present the BN
whitening of each layers inputs is costly Transform in Algorithm 1. In the algorithm, is a constant
here differentiable, we make two neces- added to the mini-batch variance for numerical stability.
ons. The first is that instead of whitening
layer inputs and outputs jointly, we will Input: Values of x over a mini-batch: B = {x1...m };
scalar feature independently, by making it Parameters to be learned: ,
of zero and the variance of 1. For a layer Output: {yi = BN, (xi )}
<=nor-
onal input x = (x(1) . . . x(d) ), we will m
1 #
mension B xi // mini-batch mean
(k) (k)
m i=1
(k) x E[x ]
! = "
x 1 #m
Var[x ](k)
B2 (xi B )2 // mini-batch variance
m i=1
tation and variance are computed over the
xi B
. As shown in (LeCun et al., 1998b), such !i " 2
x // normalize
peeds up convergence, even when the fea- B +
correlated. yi ! xi + BN, (xi ) // scale and shift
ply normalizing each input of a layer <= may
e layer can represent. For instance, nor- Algorithm 1: Batch Normalizing Transform, applied to
puts of a sigmoid would constrain them to activation x over a mini-batch.
e of the Ioffe
nonlinearity.
et al. BatchTo address this,Accelerating
Normalization: we Deep Network Training by Reducing Internal Covariate Shift
http://torch.ch/blog/2016/02/04/resnets.html
he transformation inserted in the network The BN transform can be added to a network to manip-
e identity transform. To accomplish this, ulate any activation. In the notation y = BN, (x), we
Step 1:
rocess thePreprocess
data the data
Batch Normalization
In practice,
may also see you
PCAmay
and also see PCA
Whitening anddata
of the Whiteni
0.8
0.7
Mod
0.6
Incep
BN-Bas
Inception
BNBaseline BN-
0.5 BNx5
BNx30
BN-x
BNx5Sigmoid BN-x5-S
Steps to match Inception
0.4
5M 10M 15M 20M 25M 30M Figure 3:
variants,
Figure 2: Single crop validation accuracy of Inception reach the
and its batch-normalized variants, vs. the number of and the
Ioffe et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
training steps. work.
Tips Using Batch Normalization
1. Batch Normalization
2. Model Distillation
3. DSD: Dense-Sparse-Dense Training
Softened outputs reveal the dark knowledge
1. Batch Normalization
2. Model Distillation
3. DSD: Dense-Sparse-Dense Training
DSD: Dense Sparse Dense Training
der review as a conference paper at ICLR 2017
Dense Sparse Dense
Pruning Re-Dense
ure 1: Dense-Sparse-Dense Training Flow. The sparse training regularizes the model, and the fi
se training restores the pruned weights (red), increasing the model capacity without overfitti
DSD produces same model architecture but can find better optimization solution,
arrives at better local minima, and achieves higher prediction accuracy across a wide
orithm 1: Workflow of DSD training
range of deep neural networks on CNNs / RNNs / LSTMs.
tialization: W (0) with W (0) N (0, )
tput :W (t) . Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017
Initial Dense Phase
DSD: Intuition
Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017
DSD: Results
Table 1: Overview of the neural networks, data sets and performance improvements from DSD.
Neural Network Domain Dataset Type Baseline DSD Abs. Imp. Rel. Imp.
GoogLeNet Vision ImageNet CNN 31.1%1 30.0% 1.1% 3.6%
VGG-16 Vision ImageNet CNN 31.5%1 27.2% 4.3% 13.7%
ResNet-18 Vision ImageNet CNN 30.4%1 29.3% 1.1% 3.7%
ResNet-50 Vision ImageNet CNN 24.0%1 23.2% 0.9% 3.5%
NeuralTalk Caption Flickr-8K LSTM 16.82 18.5 1.7 10.1%
DeepSpeech Speech WSJ93 RNN 33.6%3 31.6% 2.0% 5.8%
DeepSpeech-2 Speech WSJ93 RNN 14.5% 3 13.4% 1.1% 7.4%
1
Top-1 error. VGG/GoogLeNet baselines from Caffe model zoo, ResNet from Facebook.
2
BLEU score baseline from Neural Talk model zoo, higher the better.
3
Word error rate: DeepSpeech2 is trained with a portion of Baidu internal dataset with only max
decoding DSD
to show the effect
Model Zoo of
is DNN improvement.
online: https://songhan.github.io/DSD
3 R ELATED W ORK
Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017
94
Dropout and DropConnect: DSD, Dropout ( Srivastava et al. (2014)) and DropConnnect ( Wan
DSD: Results
Table 1: Overview of the neural networks, data sets and performance improvements from DSD.
Neural Network Domain Dataset Type Baseline DSD Abs. Imp. Rel. Imp.
GoogLeNet Vision ImageNet CNN 31.1%1 30.0% 1.1% 3.6%
VGG-16 Vision ImageNet CNN 31.5%1 27.2% 4.3% 13.7%
ResNet-18 Vision ImageNet CNN 30.4%1 29.3% 1.1% 3.7%
ResNet-50 Vision ImageNet CNN 24.0%1 23.2% 0.9% 3.5%
NeuralTalk Caption Flickr-8K LSTM 16.82 18.5 1.7 10.1%
DeepSpeech Speech WSJ93 RNN 33.6%3 31.6% 2.0% 5.8%
DeepSpeech-2 Speech WSJ93 RNN 14.5% 3 13.4% 1.1% 7.4%
1
Top-1 error. VGG/GoogLeNet baselines from Caffe model zoo, ResNet from Facebook.
2
BLEU score baseline from Neural Talk model zoo, higher the better.
3
Word error rate: DeepSpeech2 is trained with a portion of Baidu internal dataset with only max
decoding DSD
to show the effect
Model Zoo of
is DNN improvement.
online: https://songhan.github.io/DSD
3 R ELATED W ORK
Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017
95
Dropout and DropConnect: DSD, Dropout ( Srivastava et al. (2014)) and DropConnnect ( Wan
DSD: Results
Table 1: Overview of the neural networks, data sets and performance improvements from DSD.
Neural Network Domain Dataset Type Baseline DSD Abs. Imp. Rel. Imp.
GoogLeNet Vision ImageNet CNN 31.1%1 30.0% 1.1% 3.6%
VGG-16 Vision ImageNet CNN 31.5%1 27.2% 4.3% 13.7%
ResNet-18 Vision ImageNet CNN 30.4%1 29.3% 1.1% 3.7%
ResNet-50 Vision ImageNet CNN 24.0%1 23.2% 0.9% 3.5%
NeuralTalk Caption Flickr-8K LSTM 16.82 18.5 1.7 10.1%
DeepSpeech Speech WSJ93 RNN 33.6%3 31.6% 2.0% 5.8%
DeepSpeech-2 Speech WSJ93 RNN 14.5% 3 13.4% 1.1% 7.4%
1
Top-1 error. VGG/GoogLeNet baselines from Caffe model zoo, ResNet from Facebook.
2
BLEU score baseline from Neural Talk model zoo, higher the better.
3
Word error rate: DeepSpeech2 is trained with a portion of Baidu internal dataset with only max
decoding DSD
to show the effect
Model Zoo of
is DNN improvement.
online: https://songhan.github.io/DSD
3 R ELATED W ORK
Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017
96
Dropout and DropConnect: DSD, Dropout ( Srivastava et al. (2014)) and DropConnnect ( Wan
DSD on Caption Generation
Baseline: a boy Baseline: a Baseline: two Baseline: a man and Baseline: a person in
in a red shirt is basketball player in dogs are playing a woman are sitting a red jacket is riding a
climbing a rock a red uniform is together in a field. on a bench. bike through the
wall. playing with a ball. woods.
Sparse: a young Sparse: a basketball Sparse: two dogs Sparse: a man is Sparse: a car drives
girl is jumping off player in a blue are playing in a sitting on a bench through a mud puddle.
a tree. uniform is jumping field. with his hands in the
over the goal. air. DSD: a car drives
DSD: a young girl DSD: a basketball DSD: two dogs are DSD: a man is sitting through a forest.
in a pink shirt is player in a white playing in the on a bench with his
swinging on a uniform is trying to grass. arms folded.
swing. make a shot.
Baseline: a boy is swimming in a pool. Baseline: a group of people are Baseline: two girls in bathing suits are Baseline: a man in a red shirt and
Sparse: a small black dog is jumping standing in front of a building. playing in the water. jeans is riding a bicycle down a street.
into a pool. Sparse: a group of people are standing Sparse: two children are playing in the Sparse: a man in a red shirt and a
DSD: a black and white dog is swimming in front of a building. sand. woman in a wheelchair.
in a pool. DSD: a group of people are walking in a DSD: two children are playing in the DSD: a man and a woman are riding on
park. sand. a street.
Baseline: a group of people sit on a Baseline: a man in a black jacket and a Baseline: a group of football players in Baseline: a dog runs through the grass.
bench in front of a building. black jacket is smiling. red uniforms. Sparse: a dog runs through the grass.
Sparse: a group of people are Sparse: a man and a woman are standing Sparse: a group of football players in a DSD: a white and brown dog is running
standing in front of a building. in front of a mountain. field. through the grass.
DSD: a group of people are standing DSD: a man in a black jacket is standing DSD: a group of football players in red
in a fountain. next to a man in a black shirt. and white uniforms.
Baseline model: Andrej Karpathy, Neural Talk model zoo.
Agenda
Algorithm
Inference Training
Hardware
CPUs for Training
CPUs Are Targeting Deep Learning
Intel Knights Landing (2016)
7 TFLOPS FP32
245W TDP
29 GFLOPS/W (FP32)
14nm process
Knights Mill: next gen Xeon Phi optimized for deep learning
Intel announced the addition of new vector instructions for deep learning
(AVX512-4VNNIW and AVX512-4FMAPS), October 2016
Slide Source: Sze et al Survey of DNN Hardware, MICRO16 Tutorial.
Image
Image Source:
Source: Intel,
Intel, Data DataNext
Source: Source: Next Platform
Platform
2
GPUs for Training
GPUs Are Targeting Deep Learning
Nvidia PASCAL GP100 (2016)
3RU 3200W
Algorithm
Inference Training
Hardware
Wrap-Up
Algorithm
1. Pruning
2. Weight Sharing
1. Batch Normalization
3. Quantization
2. Model Distillation
4. Low Rank Approximation
3. DSD Training
5. Binary / Ternary Net
6. Winograd Transformation
Inference Training
1. Saves BW by Better Data Flow
2. Saves BW by Compression 1. CPU
3. Replace Mul with Lookup 2. GPU
4. Separable Kernel + Transpose SRAM 3. GPU Cloud
5. Binary / Ternary NN Accelerator
Hardware
Future: Intelligence on Mobile
Limited Resource
Battery Constrained
Cooling Constrained
Glasses Self Driving Cars
106
Outlook: the Path for Computation
PC Mobile-First AI-First
Brain-Inspired
Mobile
Computation Intelligent
Computation
Computation
107
Thank you!
stanford.edu/~songhan