Professional Documents
Culture Documents
Tutorial On DNN 02 Popular DNNs Datasets
Tutorial On DNN 02 Popular DNNs Datasets
Datasets
Clarifai
2
0
2012 2013 2014 2015 Human
Digit Classification
28x28 pixels (B&W)
10 Classes
60,000 Training
10,000 Testing
http://yann.lecun.com/exdb/mnist/
3
LeNet-5
CONV Layers: 2
Fully Connected Layers: 2
Weights: 60k Digit Classification!
MACs: 341k (MNIST Dataset)
Sigmoid used for non-linearity
http://yann.lecun.com/exdb/lenet/
5
ImageNet
Image Classification For ImageNet Large Scale Visual
~256x256 pixels (color) Recognition Challenge (ILSVRC)
1000 Classes accuracy of classification task reported
1.3M Training based on top-1 and top-5 error
100,000 Testing (50,000 Validation)
Image Source: http://karpathy.github.io/
http://www.image-net.org/challenges/LSVRC/
6
AlexNet
CONV Layers: 5 ILSCVR12 Winner
Fully Connected Layers: 3
Weights: 61M Uses Local Response Normalization (LRN)
MACs: 724M
ReLU used for non-linearity [Krizhevsky et al., NeurIPS 2012]
ofmap7
input ofmap1 ofmap2 ofmap3 ofmap4 ofmap5 ofmap6 ofmap8
7
AlexNet
CONV Layers: 5 ILSCVR12 Winner
Fully Connected Layers: 3
Weights: 61M Uses Local Response Normalization (LRN)
MACs: 724M
ReLU used for non-linearity [Krizhevsky et al., NeurIPS 2012]
L7
4096
L1 L2 L3 L4 L5 L6 (4096x1) L8
96 filters 256 filters 384 filters 384 filters 384 filters 4096 filters 1000
(11x11x3) (5x5x48) (3x3x256) (3x3x192) (3x3x192) (13x13x256) (4096x1)
8
Large Sizes with Varying Shapes
AlexNet Convolutional Layer Configurations
Layer Filter Size (RxS) # Filters (M) # Channels (C) Stride
1 11x11 96 3 4
2 5x5 256 48 1
3 3x3 384 256 1
4 3x3 384 192 1
5 3x3 256 192 1
Layer 1 Layer 2 Layer 3
1000
224x224 scores
Input
Non-Linearity
Non-Linearity
Non-Linearity
Non-Linearity
Fully Connect
Non-Linearity
Non-Linearity
Fully Connect
L1 L2 L3 L4 L5 L6 L7
Non-Linearity
Conv (11x11)
Fully Connect
Max Pooling
Max Pooling
Norm (LRN)
Norm (LRN)
Image
Conv (3x3)
Conv (5x5)
Conv (3x3)
Conv (3x3)
Max Pool
3 0 4 2 0 1 0 0 0 0 0 0 0 0
0 0 1 2 3 2 0 0 0 1 2 3 2 0
0 1 2 2 2 0 3 0 1 2 2 2 0 0
5 0 1 0 1 3 0 0 0 1 31 1 3 0
0 1 2 2 1 0 1 0 1 2 2 1 0 0
0 0 1 0 3 1 0 0 0 1 0 3 1 0
5 2 0 3 0 5 8 0 0 0 0 0 0 0
12
Stacked Filters
• Deeper network means more weights
• Use stack of smaller filters (3x3) to cover the same
receptive field with fewer filter weights
Example
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 2 3 2 0 0 0 1 2 3 2 0
0 1 2 2 2 0 0 0 1 7 8 8 0 0
0 0 1 0 1 3 0 0 0 5 6 7 3 0
0 1 2 2 1 0 0 0 1 6 5 7 0 0
0 0 1 0 3 1 0 0 0 1 0 3 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
3x3 filter1
13
Stacked Filters
• Deeper network means more weights
• Use stack of smaller filters (3x3) to cover the same
receptive field with fewer filter weights
Example
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 2 3 2 0 0 0 1 2 3 2 0
0 1 2 2 2 0 0 0 1 7 8 8 0 0
0 0 1 0 1 3 0 0 0 5 6 7 3 0
0 1 2 2 1 0 0 0 1 6 5 7 0 0
0 0 1 0 3 1 0 0 0 1 0 3 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
3x3 filter1
14
Stacked Filters
• Deeper network means more weights
• Use stack of smaller filters (3x3) to cover the same
receptive field with fewer filter weights
Example
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 2 3 2 0 0 0 1 2 3 2 0
0 1 2 2 2 0 0 0 1 7 8 8 0 0
0 0 1 0 1 3 0 0 0 5 6 7 3 0
0 1 2 2 1 0 0 0 1 6 5 7 0 0
0 0 1 0 3 1 0 0 0 1 0 3 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
3x3 filter1
15
Stacked Filters
• Deeper network means more weights
• Use stack of smaller filters (3x3) to cover the same
receptive field with fewer filter weights
Example
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 2 3 2 0 0 0 1 2 3 2 0
0 1 2 2 2 0 0 0 1 7 8 8 0 0
0 0 1 0 1 3 0 0 0 5 6 7 3 0
0 1 2 2 1 0 0 0 1 6 5 7 0 0
0 0 1 0 3 1 0 0 0 1 0 3 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
3x3 filter1
16
VGGNet: Stacked Filters
• Deeper network means more weights
• Use stack of smaller filters (3x3) to cover the same receptive field
with fewer filter weights
• Non-linear activation inserted between each filter
Example: 5x5 filter (25 weights) à two 3x3 filters (18 weights)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 2 3 2 0 0 0 1 2 3 2 0 0 0 1 2 3 2 0
0 1 2 2 2 0 0 0 1 7 8 8 0 0 0 1 2 2 2 0 0
0 0 1 0 1 3 0 0 0 5 6 7 3 0 0 0 1 31 1 3 0
0 1 2 2 1 0 0 0 1 6 5 7 0 0 0 1 2 2 1 0 0
0 0 1 0 3 1 0 0 0 1 0 3 1 0 0 0 1 0 3 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 Inception Layers
Inception
Module
1x1 ‘bottleneck’ to
reduce number of
weights and
multiplications
filter1
(1x1x64)
56
56
1
Modified image from source:
Stanford cs231n
filter2
(1x1x64)
56
56
2
Modified image from source:
Stanford cs231n
56
56
32
Modified image from source:
Stanford cs231n
Inception
Module
1x1 ‘bottleneck’ to
reduce number of
weights and
multiplications
Go Deeper!
24
ResNet: Training
Training and validation error increases with more layers;
this is due to vanishing gradient, no overfitting.
Introduce short cut module to address this!
Thin curves denote training error, and bold curves denote validation error.
16 Short
Cut Layers
30
DenseNet
More Skip Connections!
Connections not only from previous layer, but
many past layers to strengthen feature map
Dense propagation and feature reuse.
Block
Transition layers
[Huang et al., CVPR 2017] 31
DenseNet
Higher accuracy than ResNet with fewer weights and multiplications
Used by ILSVRC
2017 Winner WMW
[Xie et al., CVPR 2017] 34
ResNeXt
Improved accuracy vs. ‘complexity’ tradeoff compared to
other ResNet based models
35
Efficient DNN Models
36
Accuracy vs. Weight & OPs
compress
ResNet
expand
GoogleNet
compress
38
Example: SqueezeNet
Reduce number of weights by reducing number of
input channels by “squeezing” with 1x1
50x fewer weights than AlexNet (no accuracy loss)
However, 2.4x more operations than AlexNet*
Fire Module
decompose
GoogleNet/Inception v3
5x5 filter Apply sequentially
5x1 filter
1x5 filter
decompose
separable
filters
40
Example: Inception V3
Go deeper (v1: 22 layers à v3: 40+ layers) by reducing the
number of weights per filter using filter decomposition
~3.5% higher accuracy than v1
5x5 filter à 3x3 filters 3x3 filter à 3x1 and 1x3 filters
Separable filters
R
R
1 C
C 1
S S 1
42
Example: Xception
• An Inception module based on depth-wise separable convolutions
• Claims to learn richer features with similar number of weights as
Inception V3 (i.e. more efficient use of weights)
– Similar performance on ImageNet; 4.3% better on larger dataset (JFT)
– However, 1.5x more operations required than Inception V3
Spatial correlation
Cross-channel correlation
…
…
…
C/2 filter1 output fmap1
…
…
…
H
R E
…
…
S W F
input fmap
C/2 filter2 C
…
…
…
output fmap2
…
…
R
…
H
E
…
S
…
W F
46
Example: ShuffleNet
Shuffle order such that channels are not isolated across groups
(up to 4% increase in accuracy)
52
Summary
• Approaches used to improve accuracy by popular DNN models in
the ImageNet Challenge
– Go deeper (i.e. more layers)
– Stack smaller filters and apply 1x1 bottlenecks to reduce number of
weights such that the deeper models can fit into a GPU (faster training)
– Use multiple connections across layers (e.g. parallel and short cut)
• Efficient models aim to reduce number of weights and number of
operations
– Most use some form of filter decomposition (spatial, depth and channel)
– Note: Number of weights and operations does not directly map to
storage, speed and power/energy. Depends on hardware!
• Filter shapes vary across layers and models
– Need flexible hardware!
53
Datasets
54
Image Classification Datasets
• Image Classification/Recognition
– Given an entire image à Select 1 of N classes
– No localization (detection)
MNIST IMAGENET
Year 1998 2012
Resolution 28x28 256x256
Classes 10 1000
Training 60k 1.3M
Testing 10k 100k
Accuracy 0.21% error 2.25%
(ICML 2013) top-5 error
(2017 winner)
http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html
56
Effectiveness of More Data
Accuracy increases logarithmically based on amount training data
Results from Google Internal Dataset
JFT-300M (300M images, 18291 categories)
Orders of magnitude larger than ImageNet
58
Beyond CNN (CONV and FC Layers)
59
Summary
60