Professional Documents
Culture Documents
Common CNN Architectures: Shusen Wang
Common CNN Architectures: Shusen Wang
Shusen Wang
Background
History of Neural Networks
Yann LeCunn
BP was used
invented by convolution to
different solve image
researchers problems; he AlexNet won
The first neural independently used BP to the ImageNet
network learn the filters. LeNet Challenge
SIFT Feature
HOG Feature
LBP Feature
Traditional Approaches
SIFT Feature
LBP Feature
Classic CNN Architectures
LeNet-5 [Yan LeCun et al. 1998]
C3: f. maps 16@10x10
C1: feature maps S4: f. maps 16@5x5
INPUT 6@28x28
32x32 S2: f. maps C5: layer F6: layer OUTPUT
6@14x14 120 10
84
Convolution2D
MaxPooling2D
Conv block 1:
frozen VGG16 [Simonyan & Zisserman 2014]
Convolution2D
Conv block 2:
Convolution2D
frozen
MaxPooling2D
Convolution2D
Convolution2D
Conv block 3:
frozen
Convolution2D
MaxPooling2D
Convolution2D
Convolution2D
Conv block 4:
frozen
Convolution2D
MaxPooling2D
Convolution2D
Convolution2D
• Layers: 13 Conv + 2 FC layers.
We fine-tune
Convolution2D
• Number of trainable parameters: 138#.
Conv block 5.
MaxPooling2D
Flatten
• Paper: Very Deep Convolutional Networks for Large-Scale Image Recognition
We fine-tune
our own fully
Dense
connected
classifier.
Dense Figure 5.19 Fine-tuning the last
convolutional block of the VGG16 network
“Modern” CNN Architectures
• A linear layer with softmax loss as the classifier (pre- softmax2
dicting the same 1000 classes as the main classifier, but SoftmaxActivation
Figure 3. DepthConcat
tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat
used to create the final model used at inference time. Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool
tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
of the image area with aspect ratio constrained to the inter- DepthConcat
val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv
of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
DepthConcat
task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm
egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)
lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)
compares the ground truth against the first predicted class, input
• A linear layer with softmax loss as the classifier (pre-
dicting the same 1000 classes as the main classifier, but the lower laye softmax2
SoftmaxActivation
AveragePool
7x7+1(V)
Figure 3.
inefficiencies
Conv
1x1+1(S)
Conv
DepthConcat
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
6. Training Methodology
A useful a
Conv Conv MaxPool
DepthConcat
FC
FC
An inception module
Conv Conv Conv Conv
concatenation
as dropout and the learning rate. Therefore, it is hard to
give a definitive guidance to the most effective single way be processed Conv
1x1+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
FC
Conv
1x1+1(S)
tion that was verified to work very well after the competi-
simultaneous
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
tion, includes sampling of various sized patches of the im- Conv Conv MaxPool
The impro
MaxPool
3x3+2(S)
of the image area with aspect ratio constrained to the inter- DepthConcat
of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Previous layer
task of classifying the image into one of 1000 leaf-node cat-
egories in the Imagenet hierarchy. There are about 1.2 mil- inferior, but LocalRespNorm
Conv
3x3+1(S)
input
• A linear layer with softmax loss as the classifier (pre- softmax2
d2 = 100 dicting the same 1000 classes as the main classifier, but SoftmaxActivation
Figure 3. DepthConcat
Conv
1x1+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
Conv
1x1+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
softmax1
SoftmaxActivation
x_input = Input(shape=(d1, d2, 64*4)) convergence using few high-end GPUs within a week, the
main limitation being the memory usage. Our training used
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat
Conv
1x1+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool
tion, includes sampling of various sized patches of the im- Conv Conv MaxPool
of the image area with aspect ratio constrained to the inter- DepthConcat
val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv
of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
DepthConcat
MaxPool
3x3+1(S)
Conv
Conv
5x5+1(S)
Conv
Conv
1x1+1(S)
MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)
lion images for training, 50,000 for validation and 100,000 Conv
In [3]: bers are usually reported: the top-1 accuracy rate, which
Conv
7x7+2(S)
compares the ground truth against the first predicted class, input
• A linear layer with softmax loss as the classifier (pre- softmax2
dicting the same 1000 classes as the main classifier, but SoftmaxActivation
Figure 3. DepthConcat
the
MaxPool
DepthConcat
lief [4] distributed machine learning system using mod- Conv Conv Conv Conv
3x3+2(S)
softmax1
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Bottom layers
MaxPool
FC
3x3+2(S)
cat-
DepthConcat FC
tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat
one
Conv Conv MaxPool AveragePool
tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
tion, includes sampling of various sized patches of the im- Conv Conv MaxPool
LocalRespNorm
1x1+1(S) 1x1+1(S) 3x3+1(S)
ased
MaxPool
3x3+2(S)
of the image area with aspect ratio constrained to the inter- DepthConcat
val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv
of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
um-
3x3+2(S)
DepthConcat
hich
3x3+2(S)
lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)
lass,
LocalRespNorm
compares the ground truth against the first predicted class, input
• A linear layer with softmax loss as the classifier (pre-
not strictly necessary, simply reflecti
softmax2
dicting the same 1000 classes as the main classifier, but SoftmaxActivation
Figure 3.
A useful aspect of this architectu
DepthConcat
6. Training Methodology
GoogLeNet networksincreasing the number of units at e
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
(a) Inception module, naı̈ve version tum [17], fixed learning rate schedule (decreasing the learn-
tions with larger patch sizes. Furthe
DepthConcat
over the months leadinglows the and practical intuition that visu
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
Filter
be processed at various scales and t
converged models were trained on with other options, some-
Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
simultaneously.
Conv Conv MaxPool AveragePool
of the image area with aspect ratio constrained to the inter- DepthConcat
1x1 convolutions 1x1 convolutions of Andrew Howard [8] were useful to combat overfitting to
3x3 max pooling
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Classification Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Previous layer
task of classifying the image into one of 1000 leaf-node cat-
haveThere found that all the available kno
LocalRespNorm
compares the ground truth against the first predicted class, input
• A linear layer with softmax loss as the classifier (pre- softmax2
dicting the same 1000 classes as the main classifier, but SoftmaxActivation
Figure 3. DepthConcat
tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat
used to create the final model used at inference time. Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
SoftmaxActivation
of parameters. converged models were trained on with other options, some-
times in conjunction with changed hyperparameters, such
as dropout and the learning rate. Therefore, it is hard to
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
1x1+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation
FC
DepthConcat FC
to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool
FC
tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
of the image area with aspect ratio constrained to the inter- DepthConcat
of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm
egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)
compares the ground truth against the first predicted class, input
• A linear layer with softmax loss as the classifier (pre- softmax2
dicting the same 1000 classes as the main classifier, but SoftmaxActivation
Figure 3. DepthConcat
Auxiliary outputs
MaxPool
FC
3x3+2(S)
Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
used to create the final model used at inference time. Conv Conv MaxPool
During Training:
1x1+1(S) 1x1+1(S) 3x3+1(S)
Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)
to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool
tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
age whose size is distributed evenly between FC8% and 100% MaxPool
3x3+2(S)
of the image area with aspect ratio constrained to the inter- DepthConcat
val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv
During Test:
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
DepthConcat
ConvChallenge
7. ILSVRC 2014 Classification Conv Conv Conv Conv
task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm
egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)
lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)
imagesMaxPool
for testing. Each image AveragePool
is associated with one LocalRespNorm
compares the ground truth against the first predicted class, input
How Deep Can We Go?
56-layer
Test error
20-layer
20-layer
Iterations Iterations
How Deep Can We Go?
56-layer
Test error
20-layer
20-layer
Iterations Iterations
How Deep Can We Go?
dicting the same 1000 classes as the main classifier, but SoftmaxActivation
Figure 3. DepthConcat
used to create the final model used at inference time. Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
• Inception: 21 Conv + 1 FC layers. over the months leading to the competition, and already
DepthConcat softmax0
give a definitive guidance to the most effective single way Conv Conv Conv Conv Conv
to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool
tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
of the image area with aspect ratio constrained to the inter- DepthConcat
val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv
of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
DepthConcat
task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm
egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)
lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)
compares the ground truth against the first predicted class, input
• A linear layer with softmax loss as the classifier (pre- softmax2
dicting the same 1000 classes as the main classifier, but SoftmaxActivation
Figure 3. DepthConcat
used to create the final model used at inference time. Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
• Inception: 21 Conv + 1 FC layers. over the months leading to the competition, and already
DepthConcat softmax0
give a definitive guidance to the most effective single way Conv Conv Conv Conv Conv
to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool
tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
of the image area with aspect ratio constrained to the inter- DepthConcat
val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv
of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
DepthConcat
task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm
egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)
lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)
compares the ground truth against the first predicted class, input
3x3 conv, 128 7x7 conv, 64, /2 7x7 conv, 64, /2
modern architectures
output
pool, /2 3x3 conv, 256, /2 3x3 conv, 256, /2
size: 14
output
pool, /2 3x3 conv, 512, /2 3x3 conv, 512, /2
size: 7
3x3 conv, 512 3x3 conv, 512
F(x)
3x3 conv, 512 3x3 conv, 256 3x3 conv, 256
relu
x
3x3 conv, 512 3x3 conv, 256 3x3 conv, 256
F(x) + x output
size: 7
pool, /2
3x3 conv, 256
number of layers
⋯
Classification Error: Top-1 vs. Top-5
class pred. probability
• husky 0.50
• wolf 0.20
ConvNet
• dog 0.18
• fox 0.08
• snow 0.01
• fur 0.01
• forest 0.01
Label: “wolf”
Evaluated by the Top-1 error,
⋯
this prediction is wrong!
Classification Error: Top-1 vs. Top-5
class pred. probability
• husky 0.50
• wolf 0.20
ConvNet
• dog 0.18
• fox 0.08
• snow 0.01
• fur 0.01
• forest 0.01
Label: “wolf”
Evaluated by the Top-5 error,
⋯
the prediction is correct!
Beyond the Accuracy
Deep Learning on the Edge
Object Detection
Face Attributes
Bonjour
Solution 1: Could Computing
Solution 1: Could Computing
• A linear layer with softmax loss as the classifier (pre- softmax2
dicting the same 1000 classes as the main classifier, but SoftmaxActivation
Figure 3. DepthConcat
tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat
used to create the final model used at inference time. Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool
tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
of the image area with aspect ratio constrained to the inter- DepthConcat
val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv
of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
DepthConcat
task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm
egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)
lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)
compares the ground truth against the first predicted class, input
and the top-5 error rate, which compares the ground truth
against the first 5 predicted classes: an image is deemed Figure 3: GoogLeNet network with all the bells and whistles.
correctly classified if the ground truth is among the top-5,
Solution 1: Could Computing
• A linear layer with softmax loss as the classifier (pre- softmax2
dicting the same 1000 classes as the main classifier, but SoftmaxActivation
Figure 3. DepthConcat
tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat
used to create the final model used at inference time. Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool
tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
of the image area with aspect ratio constrained to the inter- DepthConcat
val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv
of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
DepthConcat
task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm
egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)
lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)
compares the ground truth against the first predicted class, input
and the top-5 error rate, which compares the ground truth
against the first 5 predicted classes: an image is deemed Figure 3: GoogLeNet network with all the bells and whistles.
correctly classified if the ground truth is among the top-5,
Solution 1: Could Computing
Downsides:
1. Sending data between client and
server is relatively slow (in
comparing with local computation
on the device.)
3. User’s privacy.
Solution 2: Edge Computing
Solution 2: Edge Computing
Solution 2: Edge Computing
Advantages:
Shiba
1. Inu
Faster. (Computation is typically
faster than communication.)
Advantages: Disadvantages:
Shiba
1. Inu
Faster. (Computation is typically 1. Energy consumption. (Heavy
faster than communication.) matrix and tensor computation.)
2. Avoid sending and receiving data. 2. Cost local memory and storage.
(Deep ConvNets has at least
• Save money.
millions of parameters.)
• Works without internet.
• Motivations
• Small enough to fit in an iPhone (just 4# parameters).
• Less computation than the standard ConvNets.
• Key idea: Depthwise separable convolution.
• Paper: https://arxiv.org/pdf/1704.04861.pdf
• Further reading:
• http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/
• https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d
Convolution
convolution
keras.layers.Conv2D
()×(+×(1 input
In this example, (1 = 3.
Convolution
convolution depthwise convolution
keras.layers.Conv2D keras.layers.DepthwiseConv2D
depthwise convolution
keras.layers.DepthwiseConv2D
convolution
keras.layers.Conv2D () ×(+ ×(1 tensor
1×1 convolution
keras.layers.Conv2D
depthwise convolution
keras.layers.DepthwiseConv2D
0) ×0+ parameters
convolution
keras.layers.Conv2D () ×(+ ×(1 tensor
0) ×0+ ×(1 parameters
1×1 convolution
keras.layers.Conv2D
(1 parameters
depthwise convolution
keras.layers.DepthwiseConv2D
0) ×0+ parameters
convolution
keras.layers.Conv2D () ×(+ ×(1 tensor
0) ×0+ ×(1 parameters
1×1 convolution
keras.layers.Conv2D
(1 parameters
depthwise convolution
keras.layers.DepthwiseConv2D
() ×(+ ×(1 × 0) ×0+ scalar multiply
convolution
keras.layers.Conv2D () ×(+ ×(1 tensor
() ×(+ × 0) ×0+ ×(1 scalar multiply
1×1 convolution
keras.layers.Conv2D
patch size
() ×(+ ×(1 scalar multiply
#patches
() ×(+ ×1 matrix () ×(+ ×1 matrix
MobileNet: Depthwise Separable Conv + 1×1 Conv
depthwise convolution
keras.layers.DepthwiseConv2D
() ×(+ ×(1 × 0) ×0+ scalar multiply
convolution
keras.layers.Conv2D () ×(+ ×(1 tensor
() ×(+ × 0) ×0+ ×(1 scalar multiply
1×1 convolution
keras.layers.Conv2D
() ×(+ ×(1 scalar multiply
112×112×64 tensor
Half the width and height, double the depth.
56×56×128 tensor
28×28×256 tensor
14×14×512 tensor
7×7×1024 tensor
Why not using “Flatten”?
AveragePool
1024 vector
Implementation in Keras
separable convolution
keras.layers.SeparableConv2D
depthwise convolution
keras.layers.DepthwiseConv2D + 1×1 convolution
keras.layers.Conv2D
Implementation in Keras
In [4]:
import keras
model = keras.applications.mobilenet.MobileNet(input_shape=None,
alpha=1.0, depth_multiplier=1, dropout=1e-3,
include_top=True, weights='imagenet',
input_tensor=None, pooling=None, classes=1000)
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) (None, 224, 224, 3) 0
_________________________________________________________________
Summary
CNN Architectures