Common CNN Architectures: Shusen Wang

Common CNN Architectures
Shusen Wang
Background
History of Neural Networks
Yann LeCunn
BP was used
invented by convolution to
different solve image
researchers problems; he AlexNet won
The first neural independently used BP to the ImageNet
network learn the filters. LeNet Challenge
1943 1958 1960’s 1986 1989 1997 1998 2012

Perceptron Hinton and LSTM was
others invented. Neural networks were
reinvented BP; considered “dead” by
it then most ML researchers.
became
popular.
RNN was invented.

Traditional Approaches
SIFT Feature
HOG Feature
LBP Feature
Traditional Approaches
SIFT Feature
HOG Feature Linear Model

“dog”
LBP Feature
Classic CNN Architectures
LeNet-5 [Yan LeCun et al. 1998]
C3: f. maps 16@10x10
C1: feature maps S4: f. maps 16@5x5
INPUT 6@28x28
32x32 S2: f. maps C5: layer F6: layer OUTPUT
6@14x14 120 10
84
Full connection Gaussian connections

Convolutions Subsampling Convolutions Subsampling Full connection
• Layers: 2 Conv + 2 FC layers.

• Paper: Gradient-based learning applied to document recognition
AlexNet [Alex Krizhevsky et al. 2012]

• Number of trainable parameters: 60#.
• Paper: ImageNet Classification with Deep Convolutional Neural Networks
• Keras implementation: http://dandxy89.github.io/ImageModels/
_________________________________________________________________
In [4]: Layer (type) Output Shape Param #
=================================================================
DROPOUT = 0.5 input_2 (InputLayer) (None, 227, 227, 3) 0
model_input = Input(shape = (227, 227, 3)) _________________________________________________________________
conv2d_6 (Conv2D) (None, 55, 55, 96) 34944
# First convolutional Layer (96x11x11) _________________________________________________________________
z = Conv2D(filters = 96, kernel_size = (11,11), strides = (4,4), activation = "relu")(model_input) max_pooling2d_4 (MaxPooling2 (None, 27, 27, 96) 0
_________________________________________________________________
z = MaxPooling2D(pool_size = (3,3), strides=(2,2))(z)
batch_normalization_3 (Batch (None, 27, 27, 96) 384
z = BatchNormalization()(z)
_________________________________________________________________
zero_padding2d_5 (ZeroPaddin (None, 31, 31, 96) 0
# Second convolutional Layer (256x5x5) _________________________________________________________________
z = ZeroPadding2D(padding = (2,2))(z) conv2d_7 (Conv2D) (None, 27, 27, 256) 614656
z = Convolution2D(filters = 256, kernel_size = (5,5), strides = (1,1), activation = "relu")(z) _________________________________________________________________
z = MaxPooling2D(pool_size = (3,3), strides=(2,2))(z) max_pooling2d_5 (MaxPooling2 (None, 13, 13, 256) 0
z = BatchNormalization()(z) _________________________________________________________________
batch_normalization_4 (Batch (None, 13, 13, 256) 1024
# Rest 3 convolutional layers _________________________________________________________________
z = ZeroPadding2D(padding = (1,1))(z) zero_padding2d_6 (ZeroPaddin (None, 15, 15, 256) 0
_________________________________________________________________
z = Convolution2D(filters = 384, kernel_size = (3,3), strides = (1,1), activation = "relu")(z)
conv2d_8 (Conv2D) (None, 13, 13, 384) 885120
_________________________________________________________________
z = ZeroPadding2D(padding = (1,1))(z) zero_padding2d_7 (ZeroPaddin (None, 15, 15, 384) 0
z = Convolution2D(filters = 384, kernel_size = (3,3), strides = (1,1), activation = "relu")(z) _________________________________________________________________
conv2d_9 (Conv2D) (None, 13, 13, 384) 1327488
z = ZeroPadding2D(padding = (1,1))(z) _________________________________________________________________
z = Convolution2D(filters = 256, kernel_size = (3,3), strides = (1,1), activation = "relu")(z) zero_padding2d_8 (ZeroPaddin (None, 15, 15, 384) 0
_________________________________________________________________
z = MaxPooling2D(pool_size = (3,3), strides=(2,2))(z) conv2d_10 (Conv2D) (None, 13, 13, 256) 884992
z = Flatten()(z) _________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 6, 6, 256) 0
_________________________________________________________________
z = Dense(4096, activation="relu")(z)
flatten_2 (Flatten) (None, 9216) 0
z = Dropout(DROPOUT)(z)
_________________________________________________________________
dense_3 (Dense) (None, 4096) 37752832
z = Dense(4096, activation="relu")(z) _________________________________________________________________
z = Dropout(DROPOUT)(z) dropout_3 (Dropout) (None, 4096) 0
_________________________________________________________________
final_dim = 1 if N_CATEGORY == 2 else N_CATEGORY dense_4 (Dense) (None, 4096) 16781312
final_act = "sigmoid" if N_CATEGORY == 2 else "softmax" _________________________________________________________________
model_output = Dense(final_dim, activation=final_act)(z) dropout_4 (Dropout) (None, 4096) 0
_________________________________________________________________
model = Model(model_input, model_output) dense_5 (Dense) (None, 100) 409700
=================================================================
model.summary()
Total params: 58,692,452.0
_________________________________________________________________ Trainable params: 58,691,748.0
Layer (type) Output Shape Param # Non-trainable params: 704.0
_________________________________________________________________
Convolution2D
Convolution2D
MaxPooling2D
Conv block 1:
frozen VGG16 [Simonyan & Zisserman 2014]
Convolution2D
Conv block 2:
Convolution2D
frozen
MaxPooling2D
Convolution2D
Convolution2D
Conv block 3:
frozen
Convolution2D
MaxPooling2D
Convolution2D
Convolution2D
Conv block 4:
frozen
Convolution2D
MaxPooling2D
Convolution2D
Convolution2D
We fine-tune
Convolution2D
• Number of trainable parameters: 138#.
Conv block 5.
MaxPooling2D
Flatten
• Paper: Very Deep Convolutional Networks for Large-Scale Image Recognition
We fine-tune
our own fully
Dense
connected
classifier.
Dense Figure 5.19 Fine-tuning the last
convolutional block of the VGG16 network
“Modern” CNN Architectures
• A linear layer with softmax loss as the classifier (pre- softmax2
dicting the same 1000 classes as the main classifier, but SoftmaxActivation
GoogLeNet/Inception [Szegedy et al. 2015]

removed at inference time). FC
A schematic view of the resulting network is depicted in

AveragePool
7x7+1(V)
Figure 3. DepthConcat
Conv Conv Conv Conv

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
6. Training Methodology Conv Conv MaxPool
• Stack of the “Inception” modules.

1x1+1(S) 1x1+1(S) 3x3+1(S)
GoogLeNet networks were trained using the DistBe- DepthConcat
lief [4] distributed machine learning system using mod- Conv

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
softmax1
est amount of model and data-parallelism. Although we Conv Conv MaxPool

SoftmaxActivation
• Only 5# parameters. (VGG16 has 138# parameters!)

1x1+1(S) 1x1+1(S) 3x3+1(S)
used a CPU based implementation only, a rough estimate MaxPool

3x3+2(S)
FC
suggests that the GoogLeNet network could be trained to DepthConcat FC
convergence using few high-end GPUs within a week, the

main limitation being the memory usage. Our training used
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)
asynchronous stochastic gradient descent with 0.9 momen- Conv

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat
ing rate by 4% every 8 epochs). Polyak averaging [13] was Conv

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
used to create the final model used at inference time. Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
Image sampling methods have changed substantially

over the months leading to the competition, and already
DepthConcat softmax0
converged models were trained on with other options, some-

Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
times in conjunction with changed hyperparameters, such Conv

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC
as dropout and the learning rate. Therefore, it is hard to DepthConcat FC
give a definitive guidance to the most effective single way Conv

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool
of the models were mainly trained on smaller relative crops,

1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)
others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat
tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
tion, includes sampling of various sized patches of the im- Conv

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
age whose size is distributed evenly between 8% and 100% MaxPool

3x3+2(S)
of the image area with aspect ratio constrained to the inter- DepthConcat
val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv
of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
the imaging conditions of training data. Conv

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
7. ILSVRC 2014 Classification Challenge Conv

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Setup and Results Conv

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
The ILSVRC 2014 classification challenge involves the

MaxPool
3x3+2(S)
task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm
egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)
lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)
images for testing. Each image is associated with one LocalRespNorm
ground truth category, and performance is measured based MaxPool
on the highest scoring classifier predictions. Two num-

3x3+2(S)
bers are usually reported: the top-1 accuracy rate, which

Conv
7x7+2(S)
compares the ground truth against the first predicted class, input
• A linear layer with softmax loss as the classifier (pre-
dicting the same 1000 classes as the main classifier, but the lower laye softmax2
SoftmaxActivation

removed at inference time).

not strictly ne FC
AveragePool
7x7+1(V)
Figure 3.
inefficiencies
Conv
1x1+1(S)
Conv
DepthConcat
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
6. Training Methodology
A useful a
Conv Conv MaxPool
• Stack of the “Inception” modules.

1x1+1(S) 1x1+1(S) 3x3+1(S)
lief [4] distributed machine learning system using mod-

increasing the
Conv Conv Conv Conv
softmax1
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
est amount of model and data-parallelism. Although we Conv Conv MaxPool

SoftmaxActivation
• Only 5# parameters. (VGG16 has 138# parameters!)

1x1+1(S) 1x1+1(S) 3x3+1(S)
used a CPU based implementation only, a rough estimate

suggests that the GoogLeNet network could be trained to
without an u MaxPool
3x3+2(S)
DepthConcat
FC
FC

plexity at late
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
tum [17], fixed learning rate schedule (decreasing the learn-

ing rate by 4% every 8 epochs). Polyak averaging [13] was use of dimens DepthConcat
An inception module
Conv Conv Conv Conv
(a) Inception module, naı̈ve version

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
used to create the final model used at inference time.

tions with lar
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)


lows the prac
Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
times in conjunction with changed hyperparameters, such

Filter
Conv Conv MaxPool
FC
1x1+1(S) 1x1+1(S) 3x3+1(S)
concatenation
as dropout and the learning rate. Therefore, it is hard to
give a definitive guidance to the most effective single way be processed Conv
1x1+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
FC
Conv
1x1+1(S)
to train these networks. To complicate matters further, some

the next stage
Conv Conv MaxPool AveragePool

1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)
tion that was verified to work very well after the competi-
simultaneous
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
tion, includes sampling of various sized patches of the im- Conv Conv MaxPool
3x3 convolutions 5x5 convolutions

age whose 1x1 convolutions
size is distributed evenly between 8% and 100%
1x1+1(S) 1x1+1(S) 3x3+1(S)
The impro
MaxPool
3x3+2(S)
val [ 34 , 43 ]. Also, we found that the photometric distortions

1x1 convolutions
increasing bo
Conv Conv Conv Conv

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
1x1 convolutions 1x1 convolutions 3x3 max pooling

7. ILSVRC 2014 Classification Challenge ber of stages w Conv
1x1+1(S)
DepthConcat
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Setup and Results

One can utiliz
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

MaxPool
3x3+2(S)
Previous layer
task of classifying the image into one of 1000 leaf-node cat-
egories in the Imagenet hierarchy. There are about 1.2 mil- inferior, but LocalRespNorm
Conv
3x3+1(S)
lion images for training, 50,000 for validation and 100,000

have found th
Conv
1x1+1(V)
ground truth category, and performance is measured based

a controlled b
MaxPool

3x3+2(S)
(b) Inception module with dimensionality reduction

compares the ground truth against the first predicted class,
Conv
7x7+2(S)
input
d2 = 100 dicting the same 1000 classes as the main classifier, but SoftmaxActivation


AveragePool
7x7+1(V)
In [5]: 6. Training Methodology

Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
1x1+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
from keras.layers import Input, Conv2D, MaxPooling2D, concatenate

est amount of model and data-parallelism. Although we
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
1x1+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
softmax1
SoftmaxActivation

3x3+2(S)
FC
x_input = Input(shape=(d1, d2, 64*4)) convergence using few high-end GPUs within a week, the
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
ing rate by 4% every 8 epochs). Polyak averaging [13] was

tower1 = Conv2D(64, (1,1), padding='same', activation='relu')(x_input)
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
1x1+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)


Conv Conv Conv Conv
SoftmaxActivation

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC

tower2 = Conv2D(64, (3,3), padding='same', activation='relu')(tower2)
DepthConcat FC

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)

1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

1x1+1(S) 1x1+1(S) 3x3+1(S)

3x3+2(S)
tower4 = MaxPooling2D((3,3), strides=(1,1), padding='same')(x_input)

1x1+1(S)
Conv
1x1+1(S)
DepthConcat
MaxPool
3x3+1(S)

7. ILSVRC 2014 Classification Challenge
Setup and Results
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
Conv
5x5+1(S)
Conv
Conv
1x1+1(S)
MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

MaxPool
3x3+2(S)

x_output = concatenate([tower1, tower2, tower3, tower4], axis = 3)
LocalRespNorm
3x3+1(S)
Output: ()×(+×256 (the same as input)

1x1+1(V)

3x3+2(S)
In [3]: bers are usually reported: the top-1 accuracy rate, which
Conv
7x7+2(S)


AveragePool
7x7+1(V)
Conv Conv Conv Conv

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
6. Training Methodology Conv

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
GoogLeNet networks were trained using the DistBe-
the
MaxPool
DepthConcat
lief [4] distributed machine learning system using mod- Conv Conv Conv Conv
3x3+2(S)
softmax1
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
est amount of model and data-parallelism. Although we Conv

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation
Bottom layers
MaxPool
FC
3x3+2(S)
cat-
DepthConcat FC

main limitation being the memory LocalRespNorm
usage. Our training used
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
mil- • A simple ConvNet.

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

Conv Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

over the months leading to the 3x3+1(S)competition, and already

Conv Conv Conv Conv
SoftmaxActivation
,000 • 3 Conv + 2 Pooling + 2 Normalization Layers.

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC
give a definitive guidance to the most Conv

effective single way Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
to train these networks. To complicate1x1+1(V)further, some

matters
one

1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
LocalRespNorm
1x1+1(S) 1x1+1(S) 3x3+1(S)
age whose size is distributed evenly between 8% and 100%
ased
MaxPool
3x3+2(S)
the imaging conditions of training MaxPool

data. Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
um-
3x3+2(S)
DepthConcat

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

MaxPool
hich
3x3+2(S)
task of classifying the image into oneConv

of 1000 leaf-node cat- LocalRespNorm
egories in the Imagenet hierarchy.7x7+2(S)

There are about 1.2 mil- Conv
3x3+1(S)
1x1+1(V)
images for testing. Each image is associated with one
lass,
LocalRespNorm

3x3+2(S)
bers are usually reported: the top-1input

accuracy rate, which
Conv
7x7+2(S)
• A linear layer with softmax loss as the classifier (pre-
not strictly necessary, simply reflecti
softmax2

removed at inference time).
inefficiencies in our current impleme
FC

AveragePool
7x7+1(V)
Figure 3.
A useful aspect of this architectu
DepthConcat
Conv Conv Conv Conv

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
6. Training Methodology
GoogLeNet networksincreasing the number of units at e
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
were trained using the DistBe- DepthConcat

without
data-parallelism. an Althoughuncontrolled blow-up i
Conv Conv Conv Conv
softmax1
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
est amount of model and we Conv

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation
Stack of 9 Inception modules plexity at later stages. This is achie

MaxPool
FC
3x3+2(S)
asynchronous stochastic use of dimensionality reduction prior

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)
gradient descent with 0.9 momen- Conv

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
(a) Inception module, naı̈ve version tum [17], fixed learning rate schedule (decreasing the learn-
tions with larger patch sizes. Furthe
DepthConcat

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
over the months leadinglows the and practical intuition that visu
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

to the competition, already
Filter
be processed at various scales and t
Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
concatenation times in conjunction with changed hyperparameters, such Conv

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC

the next stage
single waycan abstract features fr
DepthConcat FC
give a definitive guidance to the most effective Conv

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
to train these networks. To complicate matters further, some
simultaneously.

1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)
3x3 convolutions 5x5 convolutions 1x1 convolutions

tion that was verified to work very well after the competi-
The improved use of computation
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

1x1 convolutions
increasing both the width of each sta
MaxPool
3x3+2(S)
val [ , ]. Also, we found that the photometric distortions

3 4
4 3
the imaging conditions ofber

trainingofdata. stages without getting into com
Conv Conv Conv Conv
1x1 convolutions 1x1 convolutions of Andrew Howard [8] were useful to combat overfitting to
3x3 max pooling
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool

1x1+1(S) 1x1+1(S) 3x3+1(S)
7. ILSVRC 2014 One can utilizeChallenge the Inception archite

DepthConcat
Classification Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Setup and Results

inferior, challenge but computationally cheap
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
The ILSVRC 2014 classification involves the

MaxPool
3x3+2(S)
Previous layer
haveThere found that all the available kno
LocalRespNorm
egories in the Imagenet hierarchy. are about 1.2 mil- Conv

3x3+1(S)

a controlled balancing of computatio
Conv
1x1+1(V)
images for testing. Each image is associated with one

(b) Inception module with dimensionality reduction
LocalRespNorm
inthenetworks that are 3 10⇥ faster t

MaxPool

3x3+2(S)
bers are usually reported: top-1 accuracy rate, which

Conv
7x7+2(S)


AveragePool
7x7+1(V)
Conv Conv Conv Conv

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
pre- Output layers

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
softmax1

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation

3x3+2(S)
FC

softmax2
DepthConcat FC
• Using AveragePool instead of Flatten.

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)
but• AveragePool Layer for reducing the number

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
1x1+1(S) 1x1+1(S) 3x3+1(S)

SoftmaxActivation
of parameters. converged models were trained on with other options, some-
times in conjunction with changed hyperparameters, such
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
1x1+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation
FC
DepthConcat FC
give a definitive guidance to the most effective single way

Shape: 1000
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)
FC
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

Shape: 1×1×1024
MaxPool
3x3+2(S)
val [ 34 , 43 ]. Also, we found that the photometric distortions
Question: If AveragePool is replaced

d in
Conv Conv Conv Conv
the imaging conditions of training data. AveragePool Conv

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
by Flatten, then what will be #params

DepthConcat
7. ILSVRC 2014 Classification 7x7+1(V)

Challenge Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
in the FC layer? Shape:

The ILSVRC7×7×1024
2014 classification challenge involves the
MaxPool
3x3+2(S)
3x3+1(S)

DepthConcat
Conv
1x1+1(V)

3x3+2(S)

Conv
7x7+2(S)


AveragePool
7x7+1(V)
Conv Conv Conv Conv

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

softmax1
Conv Conv Conv Conv
softmax1
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation
Auxiliary outputs
MaxPool
FC
3x3+2(S)

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)
tum [17], fixed learning rate schedule (decreasing the learn-

SoftmaxActivation
ing rate by 4% every 8 epochs). Polyak averaging [13] was Conv Conv
DepthConcat
Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
During Training:
1x1+1(S) 1x1+1(S) 3x3+1(S)


Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
times in conjunction with changed hyperparameters, such Conv Conv MaxPool
• Inject additional gradient at lower layers.

FC
1x1+1(S) 1x1+1(S) 3x3+1(S)

FC
DepthConcat
Conv Conv Conv

FC
Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)
• Make the optimization easier.

1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
age whose size is distributed evenly between FC8% and 100% MaxPool
3x3+2(S)
of Andrew Howard [8] were useful to combat overfitting to
During Test:
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat
ConvChallenge
7. ILSVRC 2014 Classification Conv Conv Conv Conv
• Disable the auxiliary outputs.

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Setup and Results

1x1+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

MaxPool
3x3+2(S)
3x3+1(S)
1x1+1(V)
imagesMaxPool
for testing. Each image AveragePool
is associated with one LocalRespNorm

on the3x3+1(S)
highest scoring classifier 5x5+3(V)
MaxPool
predictions. Two num-

3x3+2(S)

Conv
7x7+2(S)
How Deep Can We Go?
• LeNet-5: 2 Conv + 2 FC layers.

• AlexNet: 5 Conv + 3 FC layers. classic architectures (sequential)
• VGG16: 13 Conv + 2 FC layers.
Question: Why VGG16 and VGG19? Why not deeper classic architectures?
How Deep Can We Go?

Case Study:Conv
• AlexNet: 5 ResNet
+ 3 FC layers. classic architectures (sequential)
• VGG16: 13
[He et al., 2015] Conv + 2 FC layers.
Question:
What happens Why VGG16 and VGG19? Why not deeper classic architectures?
when we continue stacking deeper layers on a “plain” convolutional
neural network?
Answer: Deeper nets have worse training and test errors.
56-layer
Training error
56-layer
Test error
20-layer
20-layer
Iterations Iterations
How Deep Can We Go?

Case Study:Conv
• AlexNet: 5 ResNet
+ 3 FC layers. classic architectures (sequential)
• VGG16: 13
[He et al., 2015] Conv + 2 FC layers.
Question:
What happens What makes deeper VGG nets worse?
when we continue stacking deeper layers on a “plain” convolutional
neural network?
Answer: It is bad optimization, not overfitting.
56-layer
Training error
56-layer
Test error
20-layer
20-layer
Iterations Iterations
How Deep Can We Go?

Question: What makes deeper VGG nets worse?
• The vanishing gradient problem.
• Derivative of the loss function w.r.t. a bottom layer can be vanishingly small.
• It makes deep nets difficult to optimize. The bottom layers are not well
trained.
• The model is good; but you cannot find a good local minimum.
How Deep Can We Go?


AveragePool
7x7+1(V)
Conv Conv Conv Conv

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
softmax1

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation

3x3+2(S)
FC

DepthConcat FC

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
1x1+1(S) 1x1+1(S) 3x3+1(S)
• Inception: 21 Conv + 1 FC layers. over the months leading to the competition, and already

Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC
give a definitive guidance to the most effective single way Conv Conv Conv Conv Conv
Question: Why can Inception go deeper than VGG16?

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

3x3+2(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

MaxPool
3x3+2(S)
3x3+1(S)
1x1+1(V)

3x3+2(S)

Conv
7x7+2(S)
How Deep Can We Go?


AveragePool
7x7+1(V)
Conv Conv Conv Conv

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
softmax1

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation

3x3+2(S)
FC

DepthConcat FC

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
1x1+1(S) 1x1+1(S) 3x3+1(S)
• Inception: 21 Conv + 1 FC layers. over the months leading to the competition, and already

Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC
give a definitive guidance to the most effective single way Conv Conv Conv Conv Conv
Question: Why can Inception go deeper than VGG16?

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
• Auxiliary outputs inject additional gradient at lower layers.

MaxPool
3x3+2(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

MaxPool
3x3+2(S)
3x3+1(S)
1x1+1(V)

3x3+2(S)

Conv
7x7+2(S)
3x3 conv, 128 7x7 conv, 64, /2 7x7 conv, 64, /2
How Deep Can We Go?

pool, /2 pool, /2 pool, /2
output
size: 56
3x3 conv, 256 3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64
pool, /2 3x3 conv, 128, /2 3x3 conv, 128, /2

output

size: 28

3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
• Inception: 21 Conv + 1 FC layers. 3x3 conv, 128 3x3 conv, 128
modern architectures
output
pool, /2 3x3 conv, 256, /2 3x3 conv, 256, /2
size: 14
• ResNet: Up to 151 Conv + 1 FC layers.

3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
ResNet’s key idea: skip connections for curing vanishing gradient.

3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
output
pool, /2 3x3 conv, 512, /2 3x3 conv, 512, /2
size: 7
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512

output
fc 4096 avg pool avg pool
size: 1
fc 4096 fc 1000 fc 1000
3x3 conv, 128 7x7 conv, 64, /2 7x7 conv, 64, /2
ResNet [He et al. 2015]

pool, /2 pool, /2 pool, /2
output
size: 56
A block of ResNet output

size: 28
pool, /2
3x3 conv, 64
3x3 conv, 128, /2

3x3 conv, 64
3x3 conv, 128, /2
x 3x3 conv, 512
3x3 conv, 512

3x3 conv, 128
3x3 conv, 128
3x3 conv, 128

3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
3x3 conv, 128 3x3 conv, 128
weight layer output

size: 14
pool, /2
3x3 conv, 512

3x3 conv, 256, /2
3x3 conv, 256

3x3 conv, 256, /2
3x3 conv, 256
F(x)
relu
x
3x3 conv, 256 3x3 conv, 256
weight layer 3x3 conv, 256 3x3 conv, 256
identity 3x3 conv, 256
3x3 conv, 256

3x3 conv, 256
3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
3x3 conv, 256 3x3 conv, 256
F(x) + x output
size: 7
pool, /2
3x3 conv, 256
3x3 conv, 512, /2

3x3 conv, 256
3x3 conv, 512, /2
relu 3x3 conv, 512
3x3 conv, 512

3x3 conv, 512
3x3 conv, 512
Figure 2. Residual learning: a building block.

3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512
3x3 conv, 512 3x3 conv, 512

output
fc 4096 avg pool avg pool
size: 1
fc 4096 fc 1000 fc 1000
Winners of the ImageNet Challenge
number of layers
top-5 error rate

Classification Error: Top-1 vs. Top-5
class pred. probability
• husky 0.50
• wolf 0.20
ConvNet
• dog 0.18
• fox 0.08
• snow 0.01
• fur 0.01
• forest 0.01
Label: “wolf”
⋯
• husky 0.50
• wolf 0.20
ConvNet
• dog 0.18
• fox 0.08
• snow 0.01
• fur 0.01
• forest 0.01
Label: “wolf”
Evaluated by the Top-1 error,
⋯
this prediction is wrong!
• husky 0.50
• wolf 0.20
ConvNet
• dog 0.18
• fox 0.08
• snow 0.01
• fur 0.01
• forest 0.01
Label: “wolf”
Evaluated by the Top-5 error,
⋯
the prediction is correct!
Beyond the Accuracy
Deep Learning on the Edge
Object Detection
OCR & Translation
Photo by Juanedc (CC BY 2.0)
Face Attributes
MobileNets Speech Recognition &

Translation
Google Doodle by Sarah Harrison P

Solution 1: Could Computing
Bonjour

AveragePool
7x7+1(V)
Conv Conv Conv Conv

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
softmax1

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation

3x3+2(S)
FC

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
1x1+1(S) 1x1+1(S) 3x3+1(S)


Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)

1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

3x3+2(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

MaxPool
3x3+2(S)
3x3+1(S)
1x1+1(V)

3x3+2(S)

Conv
7x7+2(S)
and the top-5 error rate, which compares the ground truth
against the first 5 predicted classes: an image is deemed Figure 3: GoogLeNet network with all the bells and whistles.
correctly classified if the ground truth is among the top-5,

AveragePool
7x7+1(V)
Conv Conv Conv Conv

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
softmax1

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation

3x3+2(S)
FC

1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
1x1+1(S) 1x1+1(S) 3x3+1(S)


Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)

Name and ID
1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

3x3+2(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
DepthConcat

1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

MaxPool
3x3+2(S)
3x3+1(S)
1x1+1(V)

3x3+2(S)

Conv
7x7+2(S)
and the top-5 error rate, which compares the ground truth
against the first 5 predicted classes: an image is deemed Figure 3: GoogLeNet network with all the bells and whistles.
correctly classified if the ground truth is among the top-5,
Downsides:
1. Sending data between client and
server is relatively slow (in
comparing with local computation
on the device.)
2. Sending data through 3G/4G/5G

network may be charged.
3. User’s privacy.
Solution 2: Edge Computing
Run a ConvNet on the chip

to make prediction.
(The ConvNet has been
trained on the server.)
Advantages:
Shiba
1. Inu
Faster. (Computation is typically
faster than communication.)
2. Avoid sending and receiving data.

• Save money.
• Works without internet.
3. Protect user’s privacy. Run a ConvNet on the chip

to make prediction.
Advantages: Disadvantages:
Shiba
1. Inu
Faster. (Computation is typically 1. Energy consumption. (Heavy
faster than communication.) matrix and tensor computation.)
2. Avoid sending and receiving data. 2. Cost local memory and storage.
(Deep ConvNets has at least
• Save money.
millions of parameters.)
• Works without internet.
3. Protect user’s privacy. Run a ConvNet on the chip

to make prediction.
Cloud Computing V.S. Edge Computing
• Edge computing is more popular for the deep learning

tasks (at present).
• Research: deep learning on the edge
1. Less float point operations (FLOPs). (Thus less energy
consumption.)
2. Less network parameters. (Thus less memory and storage.)
MobileNet [Howard et al. 2017]
• Motivations
• Small enough to fit in an iPhone (just 4# parameters).
• Less computation than the standard ConvNets.
• Key idea: Depthwise separable convolution.
• Paper: https://arxiv.org/pdf/1704.04861.pdf
• Further reading:
• http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/
• https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d
Convolution
convolution
keras.layers.Conv2D
One 0)×0+×(1 filter
()×(+×(1 input
In this example, (1 = 3.
Convolution
convolution depthwise convolution
keras.layers.Conv2D keras.layers.DepthwiseConv2D
One 0)×0+×(1 filter One 0)×0+ filter
()×(+×(1 input ()×(+×1 output ()×(+×(1 input

Convolution
For 3 = 1 to (1 (each of the slices):

One 0
• Compute 1 filter
) ×0+ ×(the convolution of the 3-th One 0)×0+ filter
slice (shape ()×(+) and the filter
(shape 0)×0+).
• Output: a matrix (shape ()×(+).
()×(+×(1 input ()×(+×1 output ()×(+×(1 input

Convolution
For 3 = 1 to (1 (each of the slices):

One 0
• Compute 1 filter
) ×0+ ×(the convolution of the 3-th One 0)×0+ filter
slice (shape ()×(+) and the filter
(shape 0)×0+).
• Output: a matrix (shape ()×(+).
Thus the final output is ()×(+×(1.
()×(+×(1 input ()×(+×1 output ()×(+×(1 input ()×(+×(1 output

Convolution
One 0)×0+×(1 filter One 0)×0+ filter
()×(+×(1 input ()×(+×1 output ()×(+×(1 input ()×(+×(1 output

MobileNet: Depthwise Conv + 4×4 Conv
depthwise convolution 1×1 convolution
keras.layers.DepthwiseConv2D keras.layers.Conv2D
One 0)×0+ filter One 1×1×(1 filter
()×(+×(1 input ()×(+×(1 output ()×(+×(1 input ()×(+×1 output

MobileNet: Depthwise Separable Conv + 1×1 Conv
() ×(+ ×(1 tensor () ×(+ ×(1 tensor
depthwise convolution
keras.layers.DepthwiseConv2D
convolution
keras.layers.Conv2D () ×(+ ×(1 tensor
1×1 convolution
keras.layers.Conv2D
() ×(+ ×1 matrix () ×(+ ×1 matrix

() ×(+ ×(1 tensor () ×(+ ×(1 tensor
0) ×0+ parameters
convolution
0) ×0+ ×(1 parameters
1×1 convolution
keras.layers.Conv2D
(1 parameters
() ×(+ ×1 matrix () ×(+ ×1 matrix

() ×(+ ×(1 tensor () ×(+ ×(1 tensor
0) ×0+ parameters
convolution
0) ×0+ ×(1 parameters
1×1 convolution
keras.layers.Conv2D
(1 parameters
() ×(+ ×1 matrix () ×(+ ×1 matrix

() ×(+ ×(1 tensor () ×(+ ×(1 tensor
() ×(+ ×(1 × 0) ×0+ scalar multiply
convolution
() ×(+ × 0) ×0+ ×(1 scalar multiply
1×1 convolution
keras.layers.Conv2D
patch size
() ×(+ ×(1 scalar multiply
#patches
() ×(+ ×1 matrix () ×(+ ×1 matrix
() ×(+ ×(1 tensor () ×(+ ×(1 tensor
() ×(+ ×(1 × 0) ×0+ scalar multiply
convolution
() ×(+ × 0) ×0+ ×(1 scalar multiply
1×1 convolution
keras.layers.Conv2D
() ×(+ ×(1 scalar multiply
() ×(+ ×1 matrix () ×(+ ×1 matrix

MobileNet
Input: 224×224×3 tensor
Conv2D
112×112×64 tensor
Half the width and height, double the depth.
56×56×128 tensor
28×28×256 tensor
14×14×512 tensor
7×7×1024 tensor
Why not using “Flatten”?
AveragePool
1024 vector
Implementation in Keras
separable convolution
keras.layers.SeparableConv2D
• In Keras, you can directly use SeparableConv2D to build MobileNet.

• SeparableConv2D = DepthwiseConv2D + Conv2D(1×1)
keras.layers.DepthwiseConv2D + 1×1 convolution
keras.layers.Conv2D
Implementation in Keras
In [4]:
import keras
model = keras.applications.mobilenet.MobileNet(input_shape=None,
alpha=1.0, depth_multiplier=1, dropout=1e-3,
include_top=True, weights='imagenet',
input_tensor=None, pooling=None, classes=1000)
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) (None, 224, 224, 3) 0
_________________________________________________________________
Summary
CNN Architectures

CNN Architectures

Question: Can the classic architectures go deeper?

Answer: No.
• Deeper nets have worse training and test errors.
• Vanishing gradient.
CNN Architectures

• Inception: 21 Conv + 1 FC layers.

CNN Architectures

• Inception: 21 Conv + 1 FC layers.

Tricks:
• Inception: auxiliary output (gradient injection).
• ResNet: skip connection.
Reduce Memory and Computation
• Deploy a deep neural network to a smart device.

• Challenges:
• Computation è Power consumption.
• Parameters è Storage and memory.
Reduce Memory and Computation
• Deploy a deep neural network to a smart device.

• Challenges:
• Computation è Power consumption.
• Parameters è Storage and memory.
• MobileNet’s solutions:
• Separable Convolution instead of Tensor Convolution.
• AveragePool Layer instead of Flatten Layer. (Proposed by
Inception Net.)

Common CNN Architectures: Shusen Wang

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Common CNN Architectures: Shusen Wang

Uploaded by

Copyright:

Available Formats

Common CNN Architectures

1943 1958 1960’s 1986 1989 1997 1998 2012

RNN was invented.

HOG Feature Linear Model

Full connection Gaussian connections

• Layers: 2 Conv + 2 FC layers.

• Layers: 5 Conv + 3 FC layers.

GoogLeNet/Inception [Szegedy et al. 2015]

A schematic view of the resulting network is depicted in

Conv Conv Conv Conv

6. Training Methodology Conv Conv MaxPool

• Stack of the “Inception” modules.

GoogLeNet networks were trained using the DistBe- DepthConcat

lief [4] distributed machine learning system using mod- Conv

est amount of model and data-parallelism. Although we Conv Conv MaxPool

• Only 5# parameters. (VGG16 has 138# parameters!)

used a CPU based implementation only, a rough estimate MaxPool

suggests that the GoogLeNet network could be trained to DepthConcat FC

convergence using few high-end GPUs within a week, the

asynchronous stochastic gradient descent with 0.9 momen- Conv

ing rate by 4% every 8 epochs). Polyak averaging [13] was Conv

Image sampling methods have changed substantially

converged models were trained on with other options, some-

times in conjunction with changed hyperparameters, such Conv

as dropout and the learning rate. Therefore, it is hard to DepthConcat FC

give a definitive guidance to the most effective single way Conv

of the models were mainly trained on smaller relative crops,

others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat

tion, includes sampling of various sized patches of the im- Conv

age whose size is distributed evenly between 8% and 100% MaxPool

the imaging conditions of training data. Conv

7. ILSVRC 2014 Classification Challenge Conv

Setup and Results Conv

The ILSVRC 2014 classification challenge involves the

images for testing. Each image is associated with one LocalRespNorm

ground truth category, and performance is measured based MaxPool

on the highest scoring classifier predictions. Two num-

bers are usually reported: the top-1 accuracy rate, which

GoogLeNet/Inception [Szegedy et al. 2015]

A schematic view of the resulting network is depicted in

• Stack of the “Inception” modules.

GoogLeNet networks were trained using the DistBe- DepthConcat

lief [4] distributed machine learning system using mod-

est amount of model and data-parallelism. Although we Conv Conv MaxPool

• Only 5# parameters. (VGG16 has 138# parameters!)

used a CPU based implementation only, a rough estimate

main limitation being the memory usage. Our training used

asynchronous stochastic gradient descent with 0.9 momen- Conv

tum [17], fixed learning rate schedule (decreasing the learn-

(a) Inception module, naı̈ve version

used to create the final model used at inference time.

Image sampling methods have changed substantially

converged models were trained on with other options, some-

times in conjunction with changed hyperparameters, such

to train these networks. To complicate matters further, some

of the models were mainly trained on smaller relative crops,

others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat

3x3 convolutions 5x5 convolutions

val [ 34 , 43 ]. Also, we found that the photometric distortions

the imaging conditions of training data. Conv

1x1 convolutions 1x1 convolutions 3x3 max pooling

Setup and Results