Download as pdf or txt
Download as pdf or txt
You are on page 1of 108

Deep Learning

Tutorial and Recent Trends

Song Han
Stanford University
Feb 22, 2017
FPGA17, Monterey
Intro

Song Han Bill Dally


PhD Candidate Chief Scientist
Stanford NVIDIA
Professor
Stanford

2
Deep Learning is Changing Our Lives

Machine
Self-Driving Translation

Artistic
Play Go
Style Transfer
Machine Learning 101: the Setup

weights/parameters

Training Inference

model
Machine Learning 101: the Setup

weights/parameters

training dataset

Training Inference

model

training hardware
Machine Learning 101: the Setup

weights/parameters

test image
training dataset

Training Inference

model

training hardware inference hardware


Models are Getting Larger

IMAGE RECOGNITION SPEECH RECOGNITION

16X 10X
Model Training Ops
152 layers 465 GFLOP
22.6 GFLOP 12,000 hrs of Data
~3.5% error ~5% Error
8 layers 80 GFLOP
1.4 GFLOP 7,000 hrs of Data
~16% Error ~8% Error

2012 2015 2014 2015


AlexNet ResNet Deep Speech 1 Deep Speech 2

Dally, NIPS2016 workshop on Efficient Methods for Deep Neural Networks


The Problem of Deep Learning

Computation Intensive, Memory Intensive,


Difficult to Deploy

AlphaGo: 1920 CPUs and 280 GPUs,


$3000 electric bill per game
Given the power budget,
Moores law is no longer
providing more computation
Improve the Efficiency of Deep Learning
by Algorithm-Hardware Co-Design
Application as Black Box

Application

Hardware
Open the Box for HW Design

?
Application

Hardware
?

Breaks the boundary between application and architecture


Agenda

Inference Training
Agenda

Algorithm

Inference Training

Hardware
Agenda

Algorithm

Inference Training

Hardware
Agenda

Algorithm

Algorithms for Algorithms for


Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for


Efficient Inference Efficient Training

Hardware
Agenda

Algorithm

Algorithms for Algorithms for


Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for


Efficient Inference Efficient Training

Hardware
The Problem: Large DNN Model

App developers suffers from the model size


The Problem: Large DNN Model

Hardware engineer suffers from the model size


larger model => more memory reference => more energy
Relative Energy Cost
Operation Energy [pJ] Relative Cost
32 bit int ADD 0.1 1
32 bit float ADD 0.9 9
32 bit Register File 1 10
32 bit int MULT 3.1 31
32 bit float MULT 3.7 37
32 bit SRAM Cache 5 50
32 bit DRAM Memory 640 6400
1 10 100 1000 10000
Figure 1: Energy table for 45nm CMOS process. Memory access is 2-3 orders of magnitude more
Figureenergy
1: Energy table for
expensive 45nm
than CMOS
arithmetic process [7]. Memory access is 3 orders of magnitude more
operations.
energy expensive than simple arithmetic.

1 = 100
To achieve this goal, we present a method to prune network connections in a manner that preserves the
original accuracy. After an initial training phase, we remove all connections whose weight is lower
than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first
phase learns the topology of the networks learning which connections are important and removing
the unimportant connections. We then retrain the sparse network so the remaining connections can
compensate for the connections that have been removed. The phases of pruning and retraining may
be repeated iteratively to further reduce network complexity. In effect, this training process learns
The Problem of Large DNN on Mobile

Hardware engineer suffers from the model size


larger model => more memory reference => more energy
Relative Energy Cost
Operation Energy [pJ] Relative Cost
32 bit int ADD 0.1 1
32 bit float ADD 0.9 9
32 bit Register File 1 10
32 bit int MULT 3.1 31
32 bit float MULT 3.7 37
32 bit SRAM Cache 5 50
32 bit DRAM Memory 640 6400
1 10 100 1000 10000
Figure 1: Energy table for 45nm CMOS process. Memory access is 2 orders of magnitude more
Figureenergy
1: Energy table for
expensive 45nm
than CMOS
arithmetic process [7]. Memory access is 3 orders of magnitude more
operations.
energy expensive than simple arithmetic.

To achieve this goal, we present a method to prune network connections in a manner that preserves the
original accuracy. After an initial training phase, we remove all connections whose weight is lower
than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first
phase learns the topology of the networks learning which connections are important and removing
the unimportant connections. We then retrain the sparse network so the remaining connections can
compensate for the connections that have been removed. The phases of pruning and retraining may
be repeated iteratively to further reduce network complexity. In effect, this training process learns
Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
Pruning Neural Networks

Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS15
# Synapses in Human Brain

1000 Trillion
Synapses
50 Trillion 500 Trillion
Synapses Synapses

Newborn 1 year old Adolescent


Christopher A Walsh. Peter Huttenlocher (1931-2013). Nature, 502(7470):172172, 2013.
AlexNet

CONV: 3x FC: 10x

Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
Retrain to Recover Accuracy

L2 regularization w/o retrain L1 regularization w/o retrain


L1 regularization w/ retrain L2 regularization w/ retrain
L2 regularization w/ iterative prune and retrain
0.5%
0.0%
-0.5%
-1.0%
Accuracy Loss

-1.5%
-2.0%
-2.5%
-3.0%
-3.5%
-4.0%
-4.5%
40% 50% 60% 70% 80% 90% 100%
Parametes Pruned Away

Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
Pruning RNN and LSTM

ptioning

ultimodal Recurrent Neural Networks, Mao et al.


Alignments for Generating Image Descriptions, Karpathy and Fei-Fei
al Image Caption Generator, Vinyals et al.
onvolutional Networks for Visual Recognition and Description, Donahue et al.
Visual Representation for Image Caption Generation, Chen and Zitnick
*Karpathy et al "Deep Visual-
Karpathy &Semantic
Justin Johnson Lecture 10 - 51
Alignments for 8 Feb 2016
Generating Image Descriptions"
Pruning NeuralTalk and LSTM
Original: a basketball player in a white uniform is
playing with a ball
Pruned 90%: a basketball player in a white uniform is
playing with a basketball

Original : a brown dog is running through a grassy field


Pruned 90%: a brown dog is running through a grassy
area

Original : a man is riding a surfboard on a wave


Pruned 90%: a man in a wetsuit is riding a wave on a
beach

Original : a soccer player in red is running in the field


Pruned 95%: a man in a red shirt and black and white
black shirt is running through a field
Speedup for Pruned FC layer

Geo Mean
Baseline:
Intel Core i7 5930K: MKL CBLAS GEMV,
MKL SPBLAS CSRMV
NVIDIA GeForce GTX Titan X: cuBLAS
GEMV, cuSPARSE CSRMV CPU GPU mGPU
NVIDIA Tegra K1: cuBLAS GEMV,
cuSPARSE CSRMV
Energy Efficiency for Pruned FC layer

Geo Mean
Baseline:
Intel Core i7 5930K: MKL CBLAS GEMV,
MKL SPBLAS CSRMV
NVIDIA GeForce GTX Titan X: cuBLAS
GEMV, cuSPARSE CSRMV CPU GPU mGPU
NVIDIA Tegra K1: cuBLAS GEMV,
cuSPARSE CSRMV
Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
Weight Sharing
Figure 2: Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).

Han et al. Deep Compression, ICLR 2016 (Best Paper Award)


We store the sparse structure that results from pruning using compressed sparse row (CSR) o
Weight Sharing
Figure 2: Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).

Han et al. Deep Compression, ICLR 2016 (Best Paper Award)


We store the sparse structure that results from pruning using compressed sparse row (CSR) o
Weight Sharing
Figure 2: Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).

Han et al. Deep Compression, ICLR 2016 (Best Paper Award)


We store the sparse structure that results from pruning using compressed sparse row (CSR) o
Weight Sharing
Figure 2: Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).

Han et al. Deep Compression, ICLR 2016 (Best Paper Award)


We store the sparse structure that results from pruning using compressed sparse row (CSR) o
Weight Sharing
Figure 2: Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).

Han et al. Deep Compression, ICLR 2016 (Best Paper Award)


We store the sparse structure that results from pruning using compressed sparse row (CSR) o
Weight Sharing
Figure 2: Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).

Han et al. Deep Compression, ICLR 2016 (Best Paper Award)


We store the sparse structure that results from pruning using compressed sparse row (CSR) o
Weight Sharing
Figure 2: Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow

weights cluster index fine-tuned


(32 bit float) (2 bit uint) centroids centroids

2.09 -0.98 1.48 0.09 3 0 2 1 3: 2.00 1.96

0.05 -0.14 -1.08 2.12 cluster 1 1 0 3 2: 1.50 1.48

-0.91 1.92 0 -1.03 0 3 1 0 1: 0.00 -0.04

1.87 0 1.53 1.49 3 1 2 2 0: -1.00 lr -0.97

gradient

-0.03 -0.01 0.03 0.02 -0.03 0.12 0.02 -0.07 0.04

-0.01 0.01 -0.02 0.12 group by 0.03 0.01 -0.02 reduce 0.02

-0.01 0.02 0.04 0.01 0.02 -0.01 0.01 0.04 -0.02 0.04

-0.07 -0.02 0.01 -0.02 -0.01 -0.02 -0.01 0.01 -0.03

Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).

Han et al. Deep Compression, ICLR 2016 (Best Paper Award)


We store the sparse structure that results from pruning using compressed sparse row (CSR) o
Trained Quantization Changes
Weight Distribution

Han et al. Deep Compression, ICLR 2016 (Best Paper Award)


Trained Quantization Changes
Weight Distribution

Han et al. Deep Compression, ICLR 2016 (Best Paper Award)

Pruning Trained Quantization Huffman Coding


Trained Quantization Changes
Weight Distribution

Han et al. Deep Compression, ICLR 2016 (Best Paper Award)


Bits Per Weight

Han et al. Deep Compression, ICLR 2016 (Best Paper Award)


Pruning + Trained Quantization

AlexNet on ImageNet
Han et al. Deep Compression, ICLR 2016 (Best Paper Award)
Results of Deep Compression

Original Compressed Compression Original Compressed


Network
Size Size Ratio Accuracy Accuracy
LeNet-300 1070KB 27KB 40x 98.36% 98.42%

LeNet-5 1720KB 44KB 39x 99.20% 99.26%

AlexNet 240MB 6.9MB 35x 80.27% 80.30%

VGGNet 550MB 11.3MB 49x 88.68% 89.09%

GoogleNet 28MB 2.8MB 10x 88.90% 88.92%

SqueezeNet 4.8MB 0.47MB 10x 80.32% 80.35%

Han et al. Deep Compression, ICLR 2016 (Best Paper Award)


Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
Quantizing the Weight and Activation

Train with float


Quantizing the weight and
activation:
Gather the statistics for
weight and activation
Choose proper radix point
position
Fine-tune in float format
Convert to fixed-point format

Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA16
Quantization Result

Qiu et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, FPGA16
Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
Low Rank Approximation for Conv

Zhang et al Efficient and Accurate Approximations of Nonlinear Convolutional Networks CVPR15


Low Rank Approximation for Conv

Zhang et al Efficient and Accurate Approximations of Nonlinear Convolutional Networks CVPR15


Low Rank Approximation for FC

Novikov et al Tensorizing Neural Networks, NIPS15


Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
Binary / Ternary Net: Motivation

(S) Recover Zero Weights Train on Dense (D)


6400 6400

Normalized
4800 4800 Intermediate Ternary Weight
recision Weight

Count
Count

3200 3200
=>
1600
Quantize
1600

0 0
-t
e
0.05
0 t 0.05
1 0
Weight Value
0.05 0.05
-1
0
Weight Value
0.05
0 1
(d) (e)
gradient1
Net (a), pruned GoogLeNet (b), after retraining
he sparistyInference
constraint and recovering the zero
Time
Binary-Weight-Network and XNOR-Network

Binarize both weights and inputs


Convolution as Binary dot product
Dot product between implemented
by XNOR-Bitcounting operations

Rastegari et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks ECCV 2016
Binary-Weight-Network and XNOR-Network

Rastegari et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks ECCV 2016
Binary-Weight-Network and XNOR-Network

Rastegari et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks ECCV 2016
Trained Ternary Quantization

Trained
Normalized
Full Precision Weight Intermediate Ternary Weight Quantization Final Ternary Weight
Full Precision Weight
Scale
Normalize Quantize
Wn Wp
-1 0 1 -1 -t 0 t 1 -1 0 1 -Wn 0 Wp

gradient1 gradient2
Loss
Feed Forward Back Propagate Inference Time

Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR17

Pruning Trained Quantization Huffman Coding


l = t max(|w|)
(9)

and 2) maintain a constant sparsity r for all layers throughout training. By adjusting the hyper-
Weight Evolution during Training
parameter r we are able to obtain ternary weight networks with various sparsities. We use the first
method and set t to 0.05 in experiments on CIFAR-10 and ImageNet dataset and use the second one
to explore a wider range of sparsities in section 5.1.1.

res1.0/conv1/Wn res1.0/conv1/Wp res3.2/conv2/Wn res3.2/conv2/Wp linear/Wn linear/Wp


3
Ternary Weight Value

0
-1
-2

-3
Negatives Zeros Positives Negatives Zeros Positives Negatives Zeros Positives
100%
Ternary Weight

75%
Percentage

50%

25%

0%
0 50 100 150 0 50 100 150 0 50 100 150
Epochs

Figure 2: Ternary weights value (above) and distribution (below) with iterations for different layers
of ResNet-20 on CIFAR-10.

4
Visualization of the TTQ Kernels

Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR17

Pruning Trained Quantization Huffman Coding


Error Rate on ImageNet

Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR17

Pruning Trained Quantization Huffman Coding


Related Works

Lin, Zhouhan, et al. "Neural networks with few multiplications." arXiv preprint
arXiv:1510.03009 (2015).

Introduce Binary and Ternary Connection.

Use probabilistic quantization method.

Introduce concept of latent weights.

Zhou, Shuchang, et al. "DoReFa-Net: Training Low Bitwidth Convolutional


Neural Networks with Low Bitwidth Gradients." arXiv preprint
arXiv:1606.06160 (2016).

Adopt Thresholding quantization method.


Related Works

Li, Fengfu, and Bin Liu. "Ternary Weight Networks." arXiv preprint
arXiv:1605.04711 (2016).

Treat quantization as an optimization problem:

Back-propagate using identical mapping.

Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary


Convolutional Neural Networks." arXiv preprint arXiv:1603.05279(2016).

Same strategy on binary weights.


Part 1: Algorithms for Efficient Inference

1. Pruning

2. Weight Sharing

3. Quantization

4. Low Rank Approximation

5. Binary / Ternary Net

6. Winograd Transformation
Substituting U = GgGT and V = B T dB, we have: =
T
! " Neig
Y =A U V A (9)
Winograd Convolution dc,b
Labeling tile coordinates as (# x, y#), we rewrite the con- gk,c
vnet layer formula (2) for a single image i, filter k, and tile G, B
coordinate (#x, y#) as: Yk,b
for k
C
$ fo
Yi,k,!x,!y = Di,c,!x,!y Gk,c
c=1
C
$ % &
= AT Uk,c Vc,i,!x,!y A (10) for b
c=1 fo
%$
C &
= AT Uk,c Vc,i,!x,!y A
c=1
for
Thus we can reduce over C channels in transform space,
fo
and only then apply the inverse transform A to the sum.
This amortizes the cost of the inverse transform over the
Winograd. Arithmetic complexity of computations, volume 33. Siam, 1980 for k
number of channels.
Lavin & Gray, Fast Algorithms for Convolutional Neural Networks, 2015
fo
We examine the sum
Training in
Workshop track - ICLR 2017
the Winograd Domain

Activation
Layer
Train Producing 4 output pixels:
Activation Activation
Kernel Layer Layer
Layer
ReLU
Prune
Direct
ReLU Convolution:
Train
()
Train

Transformed Transformed Transformed


- 4*9=36 multiplications (1x)
Kernel Activation
Layer
Kernel
Layer Layer

( ) ( )
( ) Prune ReLU Prune
Transformed
Eltwise Kernel
Layer Eltwise Eltwise
Muliply Muliply Muliply

ReLU-ed ReLU-ed Transformed


Transformed Transformed ReLU-ed
Activation Eltwise
Activation Eltwise Activation
Layer Product
Winograd convolution:
Layer Product Layer
Eltwise
Product

Winograd4*4=16 multiplications
Winograd Channel wise (2.25x less)
Winograd
Channel wise - Channel wise
Summation Summation domain Summation
domain domain
( ) ( ) ( )
Activation Activation Activation
Layer + 1 Layer + 1 Layer + 1
(a) (b) (c)
Liu et al.Figure
Efficient Sparse-Winograd
1: Combining WinogradConvolutional Neural
convolution with Networks,
sparse submitted
weights and to ICLR
activations. (a)2017 workshop
Original
Winograd-based convolution proposed by Lavin (2015) fills in the non-zeros in both the weights
and activations. (b) Pruning the 4 4 transformed kernel restores sparsity to the weights. (c) Mov-
Training in
Workshop track - ICLR 2017
the Winograd Domain

Activation
Layer
Train Producing 4 output pixels:
Activation Activation
Kernel Layer Layer
Layer
ReLU
Prune
Direct
ReLU Convolution:
Train
()
Train

Transformed Transformed
- 4*9=36 multiplications (1x)
Kernel Activation
Layer
Transformed
Kernel
Layer Layer
- sparse weight [NIPS15] (3x)
( ) ( )
() ReLU
Transformed
Kernel
- sparse activation (relu)
Prune
(3x)
Prune

Eltwise Eltwise Eltwise


- Overall saving: 9x
Muliply Layer
Muliply Muliply

ReLU-ed ReLU-ed Transformed


Transformed Transformed ReLU-ed
Activation Eltwise
Eltwise Activation
Layer Product
Winograd convolution:
Activation
Layer Product Layer
Eltwise
Product

Winograd4*4=16 multiplications
Winograd Channel wise (2.25x less)
Winograd
Channel wise - Channel wise
Summation Summation domain Summation
domain domain
( )
- dense weight (1x)

() ()
Activation - dense activation (1x)
Activation Activation
Layer + 1 Layer + 1 Layer + 1
(a) - Overall (b) saving: 2.25x (c)
Liu et al.Figure
Efficient Sparse-Winograd
1: Combining WinogradConvolutional Neural
convolution with Networks,
sparse submitted
weights and to ICLR
activations. (a)2017 workshop
Original
Winograd-based convolution proposed by Lavin (2015) fills in the non-zeros in both the weights
and activations. (b) Pruning the 4 4 transformed kernel restores sparsity to the weights. (c) Mov-
Solution: Fold Relu into Winograd

Activation Activation
Producing 4 output pixels:
Layer Layer

Train
( )
Train
Direct Convolution:
Transformed Transformed Transformed
Kernel Activation
Layer
Kernel - 4*9=36 multiplications (1x)
Layer Layer

- sparse weight [NIPS15] (3x)


Prune ReLU Prune
- sparse activation (relu) (3x)
Eltwise Eltwise
Muliply Muliply - Overall saving: 9x
Transformed
ReLU-ed
Eltwise Activation
Product Layer
Eltwise
Product Winograd convolution:
hannel wise Winograd Channel wise - 4*4=16 multiplications (2.25x less)
ummation domain Summation

( )
- sparse weight (2.5x)
( )

Activation Activation - dense activation (2.25x)


Layer + 1 Layer + 1
(b) (c) - Overall saving: 12x
Liuweights
with sparse et al. Efficient Sparse-Winograd
and activations. (a) OriginalConvolutional Neural Networks, submitted to ICLR 2017 workshop
in (2015) fills in the non-zeros in both the weights
med kernel restores sparsity to the weights. (c) Mov-
Workshop track - ICLR 2017 Result
93.4
Figure 2:93.2Test accuracy vs density for the three architectures of Figure 1 on VGG-nagadomi.
Spatial
Pruning
Test Accuracy

re-training until accuracy converges. We varied the pruning rate R from 20% to 70%. The first
93

convolution92.8
layer is not included in pruning but is included in re-training.
Winograd
92.6
Figure 2 shows accuracy as a function of density for the three architectures of FigureReLU
1. The
+ network
of Figure 1c
92.4(which moves pruning and ReLU to the transform domain) can bePruning pruned to 40%
density without
92.2 significant (> 0.1%) loss of accuracy. The conventional network of Figure 1a can
70 (%)
only be pruned20to 60%
25
density
30
before
35
accuracy
40 45
falls.
50
Weight Density
55 60 65

spatial activations transformed activations


Figure 2: Test accuracy vs density for the three architectures of Figure 1 on VGG-nagadomi.
0.6
Activation Density

re-training until accuracy


0.5
converges. We varied the pruning rate R from 20% to 70%. The first
convolution layer is not included in pruning but is included in re-training.
0.4
Figure 2 shows accuracy
0.3 as a function of density for the three architectures of Figure 1. The network
of Figure 1c (which moves pruning and ReLU to the transform domain) can be pruned to 40%
0.2
density without significant (> 0.1%) loss of accuracy. The conventional network of Figure 1a can
only be pruned to 60%
0.1 density before accuracy falls.
0
conv1 conv2 conv3 conv4 conv5 conv6 conv7 Overall

Figure
Liu et3:al.Activation density of convolution
Efficient Sparse-Winograd layersNeural
Convolutional of VGG-nagadomi. Whiskers
Networks, submitted show
to ICLR oneworkshop
2017 standard
deviation above and below the mean.
Agenda

Algorithm

Algorithms for Algorithms for


Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for


Efficient Inference Efficient Training

Hardware
Diannao (Electric Brain)

Figure 15. Layout (65nm).

*'
Control#Processor#(CP)# Figure 16
Component Area Power Critical

DMA#
weight' neuron Inst.#
Instruc:ons# or Block in m2 (%) in mW (%) path in ns celerator ov
output output
layer ACCELERATOR 3,023,077 485 1.02
NFU)1% NFU)2% NFU)3% Combinational 608,842 (20.14%) 89 (18.41%)

Tn#
Memory 1,158,000 (38.31%) 177 (36.59%) every cycl
hidden x NBin% Registers 375,882 (12.43%) 86 (17.84%)
+' in the SIM

DMA#
Inst.#
layer Clock network 68,721 (2.27%) 132 (27.16%)
+'
accelerator

Memory#Interface#
Filler cell 811,632 (26.85%)

DMA#
synapses Inst.#
*' bi SB 1,153,814 (38.17%) 105 (22.65%) both classi

Tn#
ai
NBin 427,992 (14.16%) 91 (19.76%)
input than the S
NBout% NBout 433,906 (14.35%) 92 (19.97%)
neuron table' two types

Tn#x#Tn#
x NFU 846,563 (28.00%) 132 (27.22%)
synapse
CP 141,809 (5.69%) 31 (6.39%) faster than
AXIMUX 9,767 (0.32%) 8 (2.65%) of comput
Figure 9. Full hardware implementation of neural networks. Other 9,226 (0.31%) 26 (5.36%)
SB% classifier a
Table 6. Characteristics of accelerator and breakdown by com- 2.01 16-bi
ponent type (first 5 lines), and functional block (last 7 lines). upper boun
5

FigureCritical
15.Path Layout
(ns) (65nm). Figure 11. Accelerator. to two maj
Area (mm^2)
Energy (nJ) First, be
4

such as the Intel ETANN [18] at the beginning of the 1990s, logic which is in charge of reading data out of NBin/NBout; bination o
Component Area Power Critical Figure 16. Speedup of accelerator over SIMD,
not because neural networks were already large at the time, and of ideal ac-versions will focus on how to reduce or pipeline this
next note that w
3

or Block in m2 (%) in mW (%) path in ns celerator


but because over accelerator.
hardware resources (number of transistors) were critical path. The total RAM capacity (NBin + NBout + SB core, whic
ACCELERATOR 3,023,077 485 1.02 naturally much more scarce. The principle was to time- + CP instructions) is 44KB (8KB for the CP RAM). The area high perfo
2

- Diannao improved CNN computation efficiency by using dedicated functional units


Combinational
Memory
608,842 (20.14%)
1,158,000 (38.31%)
89 (18.41%)
177 (36.59%)
share the physical neurons and use the on-chip RAM to and power are dominated by the buffers (NBin/NBout/SB) at CONV5 w
every cycle (we naturally use 16-bit fixed-point operations
store synapses and intermediate neurons values of hidden respectively 56% and 60%, with the NFU being a close sec- the most s
1

and memory buffers optimized for the CNN workload.


Registers
Clock network
375,882 (12.43%)
68,721 (2.27%)
86 (17.84%)
132 (27.16%)
layers.
in theHowever,
SIMD atasthat time, As
well). many neural networks
mentioned
small enough that all synapses and intermediate neurons
were 7.1,ond
in Section
is
theat 28% and 27%. The percentage of the total cell power
59.47%, but the routing network (included in the different
preloading
0

Filler cell 8x8


811,632 16x16
(26.85%) 32x32 32x4 64x8 128x16
accelerator performs 496 16-bit operations every cycle for on average
- SB Multiplier + adder tree + shifter + non-linear lookup orchestrated by instructions
NBin
1,153,814 (38.17%)
427,992 (14.16%)
105 (22.65%)
Figure 10. Energy, critical path and area of full-hardware layers.
91 (19.76%)
values could fit in the neural network RAM. Since this is no
both classifier and convolutional layers,
longer the case, one of the main challenges for large-scalei.e., 62x more (
components
496
)
of the table breakdown) accounts for a signif-
icant share of the total power at 38.77%. At 65nm, due to
8
we expect
that perfor
than network
neural the SIMD core. We
accelerator design empirically
has become observe that on these
the interplay the high toggle rate of the accelerator, the leakage power is
- NFUWeights in off-chip DRAM
NBout
neuron to
433,906 (14.35%)
846,563 (28.00%)
a neuron
92 (19.97%)
of the next 132
layer,(27.22%)
and from one synap- two types
between of layers, the
the computational and accelerator is on average 117.87x
the memory hierarchy. almost negligible at 1.73%.
ploited alo
and with a
CP 141,809 (5.69%) 31 (6.39%)
tic latch to the associated neuron. For instance, an execution faster than the SIMD core, so about 2x above the ratio Finally, we have also evaluated a design with Tn = 8, ory reques
- 452 GOP/s, 3.02 mm^2 and 485 mW
AXIMUX time of 15ns
Other
9,767 (0.32%)
and an
9,226
8 (2.65%)
energy reduction
(0.31%) 26
of 974x over a core
(5.36%)10 hidden, 10
5.of computational
Accelerator foroperators Large Neural Networks
(62x). We measured that,and forthus 64 multipliers in NFU-1. The total area for that tional laye
and convolutional layers, the SIMD core performs is 0.85 mm , i.e., 3.59x smaller than for Tn = 16
design 2
has been reported for a 90-10-10 (90 inputs, Inclassifier
this section, we draw from the analysis of Sections 3 and of 195.15x
outputs) perceptron [38].
Table 6. Characteristics of accelerator and breakdown by com-
42.01
to design an accelerator
16-bit operations for per
large-scale
cycle neural networks.
on average, instead ofdue theto the reduced buffer width and the fewer number of to private
The main components of the accelerator are the fol- arithmetic operators. We plan to investigate larger designs have less p
ponent type4.2
(firstMaximum Number
5 lines), and of Hardware
functional Neurons
block (last ?
7 lines). upper bound of 8 operations per cycle. We traced this back
lowing: an input buffer for input neurons (NBin), an out- with Tn = 32 or 64 in the near future. NFU-2 is u
Chen However,
et al.theDiannao:
area, energy andA small-footprint
delay grow quadratically with high-throughput accelerator for ubiquitous machine-learning, ASPLOS 2014
to buffer
put two major reasons.
for output neurons (NBout), and a third buffer for POOL3
the number of neurons. We have synthesized the ASIC ver- First, better
for synaptic weightslatency
(SB), tolerance
connected to due to an appropriate com-
a computational 7.2 Time and Throughput In orde
logic which is in
sions of charge of reading
neural network layersdata out of dimensions,
of various NBin/NBout; and bination
block of preloading
(performing both synapsesandand reuse in computations)
neurons NBin and SB buffers; In Figure 16, we report the speedup of the accelerator over ior of POO
next versions will focus on how to reduce or pipeline this
we report their area, critical path and energy in Figure 10. which
note we thatcallwethedid
Neural
not Functional
implement Unit (NFU), and the
a prefetcher in the SIMD
SIMD, see SIMD/Acc. Recall that we use a 128-bit SIMD configurati
We have
critical path. The used
totalSynopsys ICC for the
RAM capacity place and
(NBin route, and
+ NBout the
+ SB control logic (CP), see Figure 11. We first describe the NFU
core, which would partly bridge that gap. This explainsprocessor, the so capable of performing up to 8 16-bit operations and synap
TSMC 65nm GP library, standard VT. A hardware neuron below, and then we focus on and explain the rationale for the
to keep synapses in a fixed storage HT2.0 (North Link)
wo purposes.
eDRAM0 Wires eDRAM1

HT2.0 (West Link)


is targeted for both inference and

HT2.0 (East Link)


0.88 mm
the neurons of the previous layer

Diannao and Friends


Wires NFU Wires
omputation; in training, the neurons
(so neurons of the previous layer eDRAM2 Wires eDRAM3
n backward-propagated (so neurons HT2.0 (South Link)
now the inputs). As a result, de-
neurons and synapses) are allocated Figure 4: Simplified floorplan with a single central NFU showing
wire congestion.
o be moved between the forward
Since there are many more synapses
N 2 ) vs. O(N ) for classifier layers, HT0 PHY 1chip 4chips
Data
16chips to SB64chips - DaDiannao (Bigger
HT0
HT2.0 1000
(North Link)
Computer) uses multi-chip
Controller
SB SB
Nx Ny vs. N Tile0 Tile1 Tile4 Tile5
if Nx Ny for
-96.3

tile tile tile tile

6SHHGXS
100 eDRAM eDRAM

HT2.0 (West Link)


th private kernels,
Tile2see
Tile3Section
Tile6 II),
Tile7 it

HT2.0 (East Link)


HT3 PHY

Bank0 Bank1
and EDRAM to fit larger
-96.3

tile tile 10 tile tile


neuron outputs instead ofBlock
Central synapses.
HT2 HT3 eDRAM NFU 16
Controller
16
Controller

pses (most of the computation input- router


1
tile tile tile tile input output
models. Each chip is
HT2 PHY

Tile8 Tile9 Tile12 Tile13



-96.3

neurons neurons
l operators provides low-energy/low- SB
eDRAM
SB
eDRAM
) transfers and Tile10
highTile11internal
Tile14 band-
Tile15

HT1
Controller
-96.3

HT1 PHY

tile tile
HT2.0
tile tile
Figure 10: Speedup w.r.t. the GPU baseline
(South Link)
Bank2 Bank3
(inference). Note
CONV1 and the full NN need a 4-node system, while CONV3* and
that 68mm^2 fitting 12 Million
layer sizes can Figure 9: Snapshot
range fromof less than
the node layout.
CONV4* even need a 36-node system.
Figure 5: Tile-based organization of a node (left) and tile archi-
parameters, consumes 16W
st of them ranging in the tens(%)ofPower
Component/Block Area (m ) 2 MB. (W )
tecture
(%)
(right). A node contains 16 tiles, two central eDRAM banks
CONV3* and CONV4*, need a 36-node system because
ropriateCentralforBlockcaching
W HOLE CHIP
purposes,
67,732,900
they15.97 and fat tree interconnect;
their size is respectively a tile1.29 hasGBanand NFU, 1.32 four
GB. The eDRAMfull banks
7,898,081 (11.66%) 1.80 (11.27%)
or such Tileslarge-scale storage. However,
30,161,968 (44.53%) and input/output
6.15 ( 38.53%) NN containsinterfaces to/from i.e.,
59.48M synapses, the118.96MB
central eDRAM (16-bit data), banks.
HTs 17,620,440 (26.02%) 8.01 ( 50.14%)
requiring at least 4 nodes.
have aWires
Otherhigher storage 5,973,803 density.
6,078,608 (8.97%)
(8.82%) For0.01 (0.06%)
Table 3:
On average, theParameter
1-node, 4-node, settings of ShiDianNao
16-node and 64-node and DianNao.
AM memory Combinational
Memory requires
NBin
20.73mm
3,979,345 (5.88%)
32207390 (47.55%) at
2 6.06 (37.97%)
6.12 (38.30%)
above
architectures are respectively 21.38x, 79.81x, 216.72x, and
example),
450.65x fasterand we
than interleave
the GPU baseline. the synapses
1 ShiDianNao
The first reasonrows for among
DianNao
- ShiDiannao (Vision
DRAM memory Clock network of the
Registers
same
3,348,677 (4.94%)
586323 size and
(0.87%)
3.07 (19.25%)
0.71 (4.48%)
Filler cell
node requires 7.27mm [50], i.e.,27,611,165
2 (40.76%)
SB a the four banks.the higher performance
in each node, there
Dataiswidth
#
are
the large number
multipliers
9216 operators
16-bit of operators:
64
(mostly
16-bit
64
multipliers
Computer) It can fits small
nsity. Table VI: Node NFU layout characteristics. We placed and routed
and adders), compared this
NBin
todesign
the 2496atMACs
SRAM size 28nm
64 KB
of(ST
1technology,
KB
the 1GPU.
sufficient eDRAM capacity to hold LP), and we
The obtained
second reason the
NBout SRAM size
is SB floorplan
thatSRAM
the on-chip
size of
64 KB
128Figure
eDRAM KB provides 4.
KB
16 KBThe NFU
the model (up-to 64K
necessary bandwidth Inst. and SRAM
low-latency
size access
32 KB to feed 8 these
KB
will onlyfootprint is very small at 0.78mm (0.88mm0.88mm), but 2
ombinedeDRAM). eDRAM of all logic
The combinational
NBout chips
for 5.88% and 4.94% of the area respectively.
and register account many operators.
the processNevertheless,
imposes
Table 4: the an averagecharacteristics
Hardware
scalability of spacing
the differentof 0.2m
of
layers varies abetween
ShiDianNao at 1GHz,
parameters) on-chip. It maps
M accesses, which are particularly IB
We used Synopsys PrimePower to estimate the power
r instance,
consumption a read
Figureof17: access
theLayout
chip. The to a 256-
peak power consumption
of ShiDianNao (65 nm).
wires, is
and LRNwhere
lot. provides
layerspower
scale
only andbest
the 4 energy
horizontalare averaged
(no inter-node
with a speedup of up to 1340.77x for 64 nodes (LRN2),
metal over 10 benchmarks.
communication)
layers. As a the computation on 2D PE
at 28nm
15.97 Wconsumes 0.0192nJ
(at a pessimistic 100% toggle (50A, result,
rate), i.e., roughly 5- theCONV 65536 wires
andAccelerator connecting
Area 2
) the
(mmalmost NFU
Power (mW ) to the eDRAM
Energy (nJ)
Benchmarks. We collected 10 CNNs
10% of a state-of-the-art GPU card. The architecture
while a 256-bit read access to
from
a
representative
require
block aonly width
have of
POOL layers
Total 655360.2
inter-node
scale
communications
= 3.2768mm,
4.86 (100%)
as well
on border
320.10
because
(100%)elements,
they
see 6048.70
Figure 4.
(100%) array. The chip is 4.86 mm^2
visual recognition applications and used
breakdown shows that the tiles consume more than one third them as our bench- e.g., CONV1 NFU 4
achieves a0.66 (13.58%)
speedup 268.82 (83.98%)
of 2595.23x for 64 nodes, 5281.09 (87.29%)
consumes
marks (Table6.18nJ
(38.53%) of2).
theAmongat 28nm
power, all the
and layers[40],
four of i.e.,
HTallIPsbenchmarks,
consume Consequently,
aboutinputbut the wires NBin occupy
actual speedup of
1.124 3.2768
(23.05%)
LRN and POOL
35.53 3.2768475.01
(11.10%)
layers is lower
0.88 (7.85%)
and consumes 320 mW
NBout 1.12 (23.05%) 6.60 (2.06%) 86.61 (1.43%)
atthan CONVSB,layers which is (33.95%)
almost
they are equal to the 94.08 combined
2
1x. one half
neurons
The consume(50.14%).
ratio The component
isatcells
most
largely 45 KB,due and breakdown
to +synapses
the shows0.88 that,= 42.18mm
consume because
1.65 less computationally
6.77 (2.11%) (1.56%)
overall, memory (tile eDRAMs central eDRAM) intensive. On the other hand, CLASS layers scale less well
most 118 KB, which do not exceed the SRAM capacities
DDR3account physical-level
for 38.30% of theinterface, on-
total power, combinational
area of
logic
all
of eDRAM banks, all NFUs and the I/O.
IB 0.31 (6.38%) 2.38 (0.74%) 35.84 (0.59%)
because of the highwhen amount of inter-node communication-
our design Chen
ctivation, etc.
and
(Table
registers et
3). al.
(mostly Diannao:
NFUs and HT A
protocol small-footprint
2)
analyzers) High s, Internal
since high-throughput
Therefore,
each Bandwidth:
output neuron
most
uses
of an
In accelerator
order
all
application
input to avoid
neurons
consists
from thisforcon-
ofubiquitous
uncom- machine-learning, ASPLOS 2014
monly
consume 37.97% and 19.25% respectively.
10.limited
er Experimentalby the memory Results bandwidth, gestion, we adopt
different a small
nodes, see output
tile-based feature
Sectiondesign,
V-E2, e.g., mapsCLASS1
as shownwith fewer has aoutput neurons
in Figure 5.
speedup than implemented
of 72.96x PEs (e.g.,
for 64 nodes. This is5 5 in illustrated
further the C2 layer of bench-
up its B. size
Performancein order to process more The output neurons
in the mark
are spread
Simple
time breakdown Conv forout
of Figure 8 in PEs
11.8 Note
the thatdifferent
in the
eachcurrent
bar
tiles, so
accelerator
10.1. Layout Characteristics
In Figure 10, we compare the performance ofthat each isNFU can simultaneously
nd more inputs per output neuron our ar- normalized
design), tosome
the total will beprocess
PEs execution but16
time,Although
idle. dueinput
towethe neurons
played with the
Eyeriss: Reduce Memory Access by
Row-Stationary Dataflow

Chen et al Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks ISCA 2016
Eyeriss: Reduce Memory Access by
Row-Stationary Dataflow
Dataflow Comparison: CONV Layers
2

1.5 ALU
RF
Normalized
1 NoC
Energy/MAC
buffer
0.5 DRAM

0
WS OSA OSB OSC NLR RS
CNN Dataflows

RS uses 1.4 2.5 lower energy than other dataflows


[Chen et al., ISCA 2016] 60

Chen et al Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks ISCA 2016
EIE: Reduce Memory Access by Compression

~
a 0 a1 0 a3
~b
0 1 0 1 0 1
P E0 w0,0 w0,1 0 w0,3 b0 b0
B C B C B C
P E1 B 0 0 w1,2 0 C B b1 C B b1 C
B C B C B C
P E2 B 0 w2,1 0 w2,3 C
B B
C B b2 C
C
B
B 0 C
C
logically P E3 B 0 0 0 0 C
B
B
B
C B
C=B
b3 C
C ReLU
C )
B
B
B
b3 C
C
C
B 0 0 w4,2 w4,3 C B b4 C B 0 C
B C B C B C
Bw5,0 0 0 0 C B b5 C B b5 C
B C B C B C
B C B C B C
@ 0 0 0 w 6,3 A @ b6 A @ b6 A
0 w7,1 0 0 b7 0

Virtual Weight W0,0 W0,1 W4,2 W0,3 W4,3

physically Relative Index 0 1 2 0 0

Column Pointer 0 1 2 3

Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016, Hotchips 2016
z William J. Dally


iversity, NVIDIA EIE Architecture
avan,horowitz,dally}@stanford.edu

Weight decode
Compressed 4-bit 16-bit
DNN Model Virtual weight Weight Real weight
Encoded Weight Look-up ALU Prediction
Relative Index
Sparse Format Index Result
Input 16-bit Mem
4-bit Accum
Image Relative Index Absolute Index

Figure 1. Efficient inferenceAddress


engine that Accumulate
works on the compressed deep
neural network model for machine learning applications.

word,Hanoret al.speech sample.


EIE: Efficient Foronembedded
Inference Engine Compressed Deepmobile applications,
Neural Network, ISCA 2016

these resource demands become prohibitive. Table I shows


Comparison: Energy Efficiency

Energy Efficiency (Layers/J)


1000000

100000

10000

1000

100

10

1
Core-i7 5930k TitanX Tegra K1 A-Eye Da-DianNao True-North EIE EIE
(22nm CPU) (28nm GPU) (28nm mGPU) (28nm FPGA) (28nm ASIC (28nm ASIC) (45nm ASIC) (28nm ASIC)
64PEs 256PEs

Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016
Dynamic Fixed-Point + Lookup

Sparse Activation + Sign-Magnitude Number


Format

Separable Kernel + Transpose-Read SRAM


Binary / Ternary NN Accelerator

Accelerating Binarized Convolutional Neural Networks with Software-


Programmable FPGAs
3:30 -
Ritchie Zhao1, Weinan Song1, Wentao Zhang1, Tianwei Xing2, Jeng-Hau Lin3, Mani
3:55
Srivastava2, Rajesh Gupta3, Zhiru Zhang1
1Cornell University, 2UCLA, 3UCSD

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference


Yaman Umuroglu1,2, Nicholas J. Fraser1,3, Giulio Gambardella1, Michaela Blott1, Philip
9:25 -
Leong3, Magnus Jahre2, Kees Vissers1
9:50
1Xilinx Research Labs, 2Norwegian University of Science and Technology, 3University of
Sydney
Agenda

Algorithm

Algorithms for Algorithms for


Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for


Efficient Inference Efficient Training

Hardware
Part 3: Efficient Training Algorithm

1. Batch Normalization
2. Model Distillation
3. DSD: Dense-Sparse-Dense Training
Part 3: Efficient Training Algorithm

1. Batch Normalization
2. Model Distillation
3. DSD: Dense-Sparse-Dense Training
of the entire training data. Let the normalized values be x!1...m , and their linear trans-
formations be y1...m . We refer to the transform
lization via Batch
Mini-Batch Normalization
BN :x y , 1...m 1...m
cs as the Batch Normalizing Transform. We present the BN
whitening of each layers inputs is costly Transform in Algorithm 1. In the algorithm, is a constant
here differentiable, we make two neces- added to the mini-batch variance for numerical stability.
ons. The first is that instead of whitening
layer inputs and outputs jointly, we will Input: Values of x over a mini-batch: B = {x1...m };
scalar feature independently, by making it Parameters to be learned: ,
of zero and the variance of 1. For a layer Output: {yi = BN, (xi )}
<=nor-
onal input x = (x(1) . . . x(d) ), we will m
1 #
mension B xi // mini-batch mean
(k) (k)
m i=1
(k) x E[x ]
! = "
x 1 #m
Var[x ](k)
B2 (xi B )2 // mini-batch variance
m i=1
tation and variance are computed over the
xi B
. As shown in (LeCun et al., 1998b), such !i " 2
x // normalize
peeds up convergence, even when the fea- B +
correlated. yi ! xi + BN, (xi ) // scale and shift
ply normalizing each input of a layer <= may
e layer can represent. For instance, nor- Algorithm 1: Batch Normalizing Transform, applied to
puts of a sigmoid would constrain them to activation x over a mini-batch.
e of the Ioffe
nonlinearity.
et al. BatchTo address this,Accelerating
Normalization: we Deep Network Training by Reducing Internal Covariate Shift
http://torch.ch/blog/2016/02/04/resnets.html
he transformation inserted in the network The BN transform can be added to a network to manip-
e identity transform. To accomplish this, ulate any activation. In the notation y = BN, (x), we
Step 1:
rocess thePreprocess
data the data
Batch Normalization
In practice,
may also see you
PCAmay
and also see PCA
Whitening anddata
of the Whiteni

(data has diagonal (covariance matrix is the


(data has diagonal (covaria
covariance matrix) identity matrix)
covariance matrix) identity
Fei-Fei Li & Andrej Karpathy & Justin Johnson, Stanford CS231n course slides
pathy & Justin
Fei-Fei KarpathyLecture
Johnson
Li & Andrej & Justin5Johnson
- 50 20 Lecture
Jan 20165-
Batch Normalization Helps Convergence

0.8

0.7
Mod
0.6
Incep
BN-Bas
Inception
BNBaseline BN-
0.5 BNx5
BNx30
BN-x
BNx5Sigmoid BN-x5-S
Steps to match Inception
0.4
5M 10M 15M 20M 25M 30M Figure 3:
variants,
Figure 2: Single crop validation accuracy of Inception reach the
and its batch-normalized variants, vs. the number of and the
Ioffe et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
training steps. work.
Tips Using Batch Normalization

Increase learning rate.


Remove Dropout
Reduce the regularization.
Accelerate the learning rate decay.
Remove Local Response Normalization
Shuffle training examples
Reduce the photometric distortions.
Ioffe et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Part 3: Efficient Training Algorithm

1. Batch Normalization
2. Model Distillation
3. DSD: Dense-Sparse-Dense Training
Softened outputs reveal the dark knowledge

Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network


Softened outputs reveal the dark knowledge

Method: Divide score by a temperature to get a much softer


distribution

Result: Start with a trained model that classifies 58.9% of the


test frames correctly. The new model converges to 57.0%
correct even when it is only trained on 3% of the data

Hinton et al. Dark knowledge / Distilling the Knowledge in a Neural Network


Part 3: Efficient Training Algorithm

1. Batch Normalization
2. Model Distillation
3. DSD: Dense-Sparse-Dense Training
DSD: Dense Sparse Dense Training
der review as a conference paper at ICLR 2017
Dense Sparse Dense

Pruning Re-Dense

Sparsity Constraint Increase Model Capacity

ure 1: Dense-Sparse-Dense Training Flow. The sparse training regularizes the model, and the fi
se training restores the pruned weights (red), increasing the model capacity without overfitti
DSD produces same model architecture but can find better optimization solution,
arrives at better local minima, and achieves higher prediction accuracy across a wide
orithm 1: Workflow of DSD training
range of deep neural networks on CNNs / RNNs / LSTMs.
tialization: W (0) with W (0) N (0, )
tput :W (t) . Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017
Initial Dense Phase
DSD: Intuition

learn the trunk first then learn the leaves

Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017
DSD: Results

Under review as a conference paper at ICLR 2017

Table 1: Overview of the neural networks, data sets and performance improvements from DSD.
Neural Network Domain Dataset Type Baseline DSD Abs. Imp. Rel. Imp.
GoogLeNet Vision ImageNet CNN 31.1%1 30.0% 1.1% 3.6%
VGG-16 Vision ImageNet CNN 31.5%1 27.2% 4.3% 13.7%
ResNet-18 Vision ImageNet CNN 30.4%1 29.3% 1.1% 3.7%
ResNet-50 Vision ImageNet CNN 24.0%1 23.2% 0.9% 3.5%
NeuralTalk Caption Flickr-8K LSTM 16.82 18.5 1.7 10.1%
DeepSpeech Speech WSJ93 RNN 33.6%3 31.6% 2.0% 5.8%
DeepSpeech-2 Speech WSJ93 RNN 14.5% 3 13.4% 1.1% 7.4%
1
Top-1 error. VGG/GoogLeNet baselines from Caffe model zoo, ResNet from Facebook.
2
BLEU score baseline from Neural Talk model zoo, higher the better.
3
Word error rate: DeepSpeech2 is trained with a portion of Baidu internal dataset with only max
decoding DSD
to show the effect
Model Zoo of
is DNN improvement.
online: https://songhan.github.io/DSD

3 R ELATED W ORK
Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017

94
Dropout and DropConnect: DSD, Dropout ( Srivastava et al. (2014)) and DropConnnect ( Wan
DSD: Results

Under review as a conference paper at ICLR 2017

Table 1: Overview of the neural networks, data sets and performance improvements from DSD.
Neural Network Domain Dataset Type Baseline DSD Abs. Imp. Rel. Imp.
GoogLeNet Vision ImageNet CNN 31.1%1 30.0% 1.1% 3.6%
VGG-16 Vision ImageNet CNN 31.5%1 27.2% 4.3% 13.7%
ResNet-18 Vision ImageNet CNN 30.4%1 29.3% 1.1% 3.7%
ResNet-50 Vision ImageNet CNN 24.0%1 23.2% 0.9% 3.5%
NeuralTalk Caption Flickr-8K LSTM 16.82 18.5 1.7 10.1%
DeepSpeech Speech WSJ93 RNN 33.6%3 31.6% 2.0% 5.8%
DeepSpeech-2 Speech WSJ93 RNN 14.5% 3 13.4% 1.1% 7.4%
1
Top-1 error. VGG/GoogLeNet baselines from Caffe model zoo, ResNet from Facebook.
2
BLEU score baseline from Neural Talk model zoo, higher the better.
3
Word error rate: DeepSpeech2 is trained with a portion of Baidu internal dataset with only max
decoding DSD
to show the effect
Model Zoo of
is DNN improvement.
online: https://songhan.github.io/DSD

3 R ELATED W ORK
Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017

95
Dropout and DropConnect: DSD, Dropout ( Srivastava et al. (2014)) and DropConnnect ( Wan
DSD: Results

Under review as a conference paper at ICLR 2017

Table 1: Overview of the neural networks, data sets and performance improvements from DSD.
Neural Network Domain Dataset Type Baseline DSD Abs. Imp. Rel. Imp.
GoogLeNet Vision ImageNet CNN 31.1%1 30.0% 1.1% 3.6%
VGG-16 Vision ImageNet CNN 31.5%1 27.2% 4.3% 13.7%
ResNet-18 Vision ImageNet CNN 30.4%1 29.3% 1.1% 3.7%
ResNet-50 Vision ImageNet CNN 24.0%1 23.2% 0.9% 3.5%
NeuralTalk Caption Flickr-8K LSTM 16.82 18.5 1.7 10.1%
DeepSpeech Speech WSJ93 RNN 33.6%3 31.6% 2.0% 5.8%
DeepSpeech-2 Speech WSJ93 RNN 14.5% 3 13.4% 1.1% 7.4%
1
Top-1 error. VGG/GoogLeNet baselines from Caffe model zoo, ResNet from Facebook.
2
BLEU score baseline from Neural Talk model zoo, higher the better.
3
Word error rate: DeepSpeech2 is trained with a portion of Baidu internal dataset with only max
decoding DSD
to show the effect
Model Zoo of
is DNN improvement.
online: https://songhan.github.io/DSD

3 R ELATED W ORK
Han et al. DSD: Dense-Sparse-Dense Training for Deep Neural Networks, ICLR 2017

96
Dropout and DropConnect: DSD, Dropout ( Srivastava et al. (2014)) and DropConnnect ( Wan
DSD on Caption Generation

Baseline: a boy Baseline: a Baseline: two Baseline: a man and Baseline: a person in
in a red shirt is basketball player in dogs are playing a woman are sitting a red jacket is riding a
climbing a rock a red uniform is together in a field. on a bench. bike through the
wall. playing with a ball. woods.
Sparse: a young Sparse: a basketball Sparse: two dogs Sparse: a man is Sparse: a car drives
girl is jumping off player in a blue are playing in a sitting on a bench through a mud puddle.
a tree. uniform is jumping field. with his hands in the
over the goal. air. DSD: a car drives
DSD: a young girl DSD: a basketball DSD: two dogs are DSD: a man is sitting through a forest.
in a pink shirt is player in a white playing in the on a bench with his
swinging on a uniform is trying to grass. arms folded.
swing. make a shot.

Figure 3: Visualization of DSD training improves the performance of image captioning.


Baseline model: Andrej Karpathy, Neural Talk model zoo.
the
Han et forest from
al. DSD: the background. Training
Dense-Sparse-Dense The goodfor performance of DSD training
Deep Neural Networks, generalizes
ICLR 2017 beyond these
examples, more image caption results generated by DSD training is provided in the supplementary
97
material.
A. Supplementary Material: More Examples of DSD Training Improves the Performance of
Generated
NeuralTalk by NeuralTalk
Auto-Caption System (Images from Flickr-8K Test Set)
DSD on Caption Generation

Baseline: a boy is swimming in a pool. Baseline: a group of people are Baseline: two girls in bathing suits are Baseline: a man in a red shirt and
Sparse: a small black dog is jumping standing in front of a building. playing in the water. jeans is riding a bicycle down a street.
into a pool. Sparse: a group of people are standing Sparse: two children are playing in the Sparse: a man in a red shirt and a
DSD: a black and white dog is swimming in front of a building. sand. woman in a wheelchair.
in a pool. DSD: a group of people are walking in a DSD: two children are playing in the DSD: a man and a woman are riding on
park. sand. a street.

Baseline: a group of people sit on a Baseline: a man in a black jacket and a Baseline: a group of football players in Baseline: a dog runs through the grass.
bench in front of a building. black jacket is smiling. red uniforms. Sparse: a dog runs through the grass.
Sparse: a group of people are Sparse: a man and a woman are standing Sparse: a group of football players in a DSD: a white and brown dog is running
standing in front of a building. in front of a mountain. field. through the grass.
DSD: a group of people are standing DSD: a man in a black jacket is standing DSD: a group of football players in red
in a fountain. next to a man in a black shirt. and white uniforms.
Baseline model: Andrej Karpathy, Neural Talk model zoo.
Agenda

Algorithm

Algorithms for Algorithms for


Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for


Efficient Inference Efficient Training

Hardware
CPUs for Training
CPUs Are Targeting Deep Learning
Intel Knights Landing (2016)

7 TFLOPS FP32

16GB MCDRAM 400 GB/s

245W TDP

29 GFLOPS/W (FP32)

14nm process

Knights Mill: next gen Xeon Phi optimized for deep learning
Intel announced the addition of new vector instructions for deep learning
(AVX512-4VNNIW and AVX512-4FMAPS), October 2016
Slide Source: Sze et al Survey of DNN Hardware, MICRO16 Tutorial.
Image
Image Source:
Source: Intel,
Intel, Data DataNext
Source: Source: Next Platform
Platform
2
GPUs for Training
GPUs Are Targeting Deep Learning
Nvidia PASCAL GP100 (2016)

10/20 TFLOPS FP32/FP16


16GB HBM 750 GB/s
300W TDP
67 GFLOPS/W (FP16)
16nm process
160GB/s NV Link

Slide Source: Sze et al Survey of DNN Hardware, MICRO16 Tutorial.


Data Source: NVIDIA
Source: Nvidia 3
GPU Systems
Systems for Learning
for Deep Training

Nvidia DGX-1 (2016)


170 TFLOPS
8 Tesla P100, Dual Xeon

NVLink Hybrid Cube Mesh


Optimized DL Software
7 TB SSD Cache

Dual 10GbE, Quad IB 100Gb

3RU 3200W

Slide Source: Sze et alSource:


Survey ofNvidia
DNN Hardware, MICRO16 Tutorial. 4

Data Source: NVIDIA


Cloud Systems
Cloud for Deep
Systems Learning
for Training
Facebooks
Facebook Deep Learning Machine
Big Sur

Open Rack Compliant

Powered by 8 Tesla M40 GPUs

2x Faster Training for Faster Deployment

2x Larger Networks for Higher Accuracy

Slide Source: Sze et al Survey of DNN Hardware, MICRO16 Tutorial.


Data Source: Facebook
Source: Facebook 5
Wrap-Up

Algorithm

Algorithms for Algorithms for


Efficient Inference Efficient Training

Inference Training

Hardware for Hardware for


Efficient Inference Efficient Training

Hardware
Wrap-Up

Algorithm
1. Pruning
2. Weight Sharing
1. Batch Normalization
3. Quantization
2. Model Distillation
4. Low Rank Approximation
3. DSD Training
5. Binary / Ternary Net
6. Winograd Transformation

Inference Training
1. Saves BW by Better Data Flow
2. Saves BW by Compression 1. CPU
3. Replace Mul with Lookup 2. GPU
4. Separable Kernel + Transpose SRAM 3. GPU Cloud
5. Binary / Ternary NN Accelerator

Hardware
Future: Intelligence on Mobile

Phones Drones Robots

Limited Resource
Battery Constrained
Cooling Constrained
Glasses Self Driving Cars

106
Outlook: the Path for Computation

PC Mobile-First AI-First

Brain-Inspired
Mobile
Computation Intelligent
Computation
Computation

Sundar Pichai, Google IO, 2016

107
Thank you!
stanford.edu/~songhan

You might also like