Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

Common CNN Architectures

Shusen Wang
Background
History of Neural Networks
Yann LeCunn
BP was used
invented by convolution to
different solve image
researchers problems; he AlexNet won
The first neural independently used BP to the ImageNet
network learn the filters. LeNet Challenge

1943 1958 1960’s 1986 1989 1997 1998 2012


Perceptron Hinton and LSTM was
others invented. Neural networks were
reinvented BP; considered “dead” by
it then most ML researchers.
became
popular.

RNN was invented.


Traditional Approaches

SIFT Feature

HOG Feature

LBP Feature
Traditional Approaches

SIFT Feature

HOG Feature Linear Model


“dog”

LBP Feature
Classic CNN Architectures
LeNet-5 [Yan LeCun et al. 1998]
C3: f. maps 16@10x10
C1: feature maps S4: f. maps 16@5x5
INPUT 6@28x28
32x32 S2: f. maps C5: layer F6: layer OUTPUT
6@14x14 120 10
84

Full connection Gaussian connections


Convolutions Subsampling Convolutions Subsampling Full connection

• Layers: 2 Conv + 2 FC layers.


• Paper: Gradient-based learning applied to document recognition
AlexNet [Alex Krizhevsky et al. 2012]

• Layers: 5 Conv + 3 FC layers.


• Number of trainable parameters: 60#.
• Paper: ImageNet Classification with Deep Convolutional Neural Networks
• Keras implementation: http://dandxy89.github.io/ImageModels/
_________________________________________________________________
In [4]: Layer (type) Output Shape Param #
=================================================================
DROPOUT = 0.5 input_2 (InputLayer) (None, 227, 227, 3) 0
model_input = Input(shape = (227, 227, 3)) _________________________________________________________________
conv2d_6 (Conv2D) (None, 55, 55, 96) 34944
# First convolutional Layer (96x11x11) _________________________________________________________________
z = Conv2D(filters = 96, kernel_size = (11,11), strides = (4,4), activation = "relu")(model_input) max_pooling2d_4 (MaxPooling2 (None, 27, 27, 96) 0
_________________________________________________________________
z = MaxPooling2D(pool_size = (3,3), strides=(2,2))(z)
batch_normalization_3 (Batch (None, 27, 27, 96) 384
z = BatchNormalization()(z)
_________________________________________________________________
zero_padding2d_5 (ZeroPaddin (None, 31, 31, 96) 0
# Second convolutional Layer (256x5x5) _________________________________________________________________
z = ZeroPadding2D(padding = (2,2))(z) conv2d_7 (Conv2D) (None, 27, 27, 256) 614656
z = Convolution2D(filters = 256, kernel_size = (5,5), strides = (1,1), activation = "relu")(z) _________________________________________________________________
z = MaxPooling2D(pool_size = (3,3), strides=(2,2))(z) max_pooling2d_5 (MaxPooling2 (None, 13, 13, 256) 0
z = BatchNormalization()(z) _________________________________________________________________
batch_normalization_4 (Batch (None, 13, 13, 256) 1024
# Rest 3 convolutional layers _________________________________________________________________
z = ZeroPadding2D(padding = (1,1))(z) zero_padding2d_6 (ZeroPaddin (None, 15, 15, 256) 0
_________________________________________________________________
z = Convolution2D(filters = 384, kernel_size = (3,3), strides = (1,1), activation = "relu")(z)
conv2d_8 (Conv2D) (None, 13, 13, 384) 885120
_________________________________________________________________
z = ZeroPadding2D(padding = (1,1))(z) zero_padding2d_7 (ZeroPaddin (None, 15, 15, 384) 0
z = Convolution2D(filters = 384, kernel_size = (3,3), strides = (1,1), activation = "relu")(z) _________________________________________________________________
conv2d_9 (Conv2D) (None, 13, 13, 384) 1327488
z = ZeroPadding2D(padding = (1,1))(z) _________________________________________________________________
z = Convolution2D(filters = 256, kernel_size = (3,3), strides = (1,1), activation = "relu")(z) zero_padding2d_8 (ZeroPaddin (None, 15, 15, 384) 0
_________________________________________________________________
z = MaxPooling2D(pool_size = (3,3), strides=(2,2))(z) conv2d_10 (Conv2D) (None, 13, 13, 256) 884992
z = Flatten()(z) _________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 6, 6, 256) 0
_________________________________________________________________
z = Dense(4096, activation="relu")(z)
flatten_2 (Flatten) (None, 9216) 0
z = Dropout(DROPOUT)(z)
_________________________________________________________________
dense_3 (Dense) (None, 4096) 37752832
z = Dense(4096, activation="relu")(z) _________________________________________________________________
z = Dropout(DROPOUT)(z) dropout_3 (Dropout) (None, 4096) 0
_________________________________________________________________
final_dim = 1 if N_CATEGORY == 2 else N_CATEGORY dense_4 (Dense) (None, 4096) 16781312
final_act = "sigmoid" if N_CATEGORY == 2 else "softmax" _________________________________________________________________
model_output = Dense(final_dim, activation=final_act)(z) dropout_4 (Dropout) (None, 4096) 0
_________________________________________________________________
model = Model(model_input, model_output) dense_5 (Dense) (None, 100) 409700
=================================================================
model.summary()
Total params: 58,692,452.0
_________________________________________________________________ Trainable params: 58,691,748.0
Layer (type) Output Shape Param # Non-trainable params: 704.0
_________________________________________________________________
Convolution2D

Convolution2D

MaxPooling2D
Conv block 1:
frozen VGG16 [Simonyan & Zisserman 2014]
Convolution2D

Conv block 2:
Convolution2D
frozen

MaxPooling2D

Convolution2D

Convolution2D
Conv block 3:
frozen
Convolution2D

MaxPooling2D

Convolution2D

Convolution2D
Conv block 4:
frozen
Convolution2D

MaxPooling2D

Convolution2D

Convolution2D
• Layers: 13 Conv + 2 FC layers.
We fine-tune
Convolution2D
• Number of trainable parameters: 138#.
Conv block 5.

MaxPooling2D

Flatten
• Paper: Very Deep Convolutional Networks for Large-Scale Image Recognition
We fine-tune
our own fully
Dense
connected
classifier.
Dense Figure 5.19 Fine-tuning the last
convolutional block of the VGG16 network
“Modern” CNN Architectures
• A linear layer with softmax loss as the classifier (pre- softmax2

dicting the same 1000 classes as the main classifier, but SoftmaxActivation

GoogLeNet/Inception [Szegedy et al. 2015]


removed at inference time). FC

A schematic view of the resulting network is depicted in


AveragePool
7x7+1(V)

Figure 3. DepthConcat

Conv Conv Conv Conv


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

6. Training Methodology Conv Conv MaxPool

• Stack of the “Inception” modules.


1x1+1(S) 1x1+1(S) 3x3+1(S)

GoogLeNet networks were trained using the DistBe- DepthConcat

lief [4] distributed machine learning system using mod- Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
softmax1

est amount of model and data-parallelism. Although we Conv Conv MaxPool


SoftmaxActivation

• Only 5# parameters. (VGG16 has 138# parameters!)


1x1+1(S) 1x1+1(S) 3x3+1(S)

used a CPU based implementation only, a rough estimate MaxPool


3x3+2(S)
FC

suggests that the GoogLeNet network could be trained to DepthConcat FC

convergence using few high-end GPUs within a week, the


main limitation being the memory usage. Our training used
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

asynchronous stochastic gradient descent with 0.9 momen- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat

ing rate by 4% every 8 epochs). Polyak averaging [13] was Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

used to create the final model used at inference time. Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

Image sampling methods have changed substantially


over the months leading to the competition, and already
DepthConcat softmax0

converged models were trained on with other options, some-


Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

times in conjunction with changed hyperparameters, such Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC

as dropout and the learning rate. Therefore, it is hard to DepthConcat FC

give a definitive guidance to the most effective single way Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)

to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool

of the models were mainly trained on smaller relative crops,


1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)

others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat

tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

tion, includes sampling of various sized patches of the im- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

age whose size is distributed evenly between 8% and 100% MaxPool


3x3+2(S)

of the image area with aspect ratio constrained to the inter- DepthConcat

val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv

of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

the imaging conditions of training data. Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

DepthConcat

7. ILSVRC 2014 Classification Challenge Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

Setup and Results Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

The ILSVRC 2014 classification challenge involves the


MaxPool
3x3+2(S)

task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm

egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)

lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)

images for testing. Each image is associated with one LocalRespNorm

ground truth category, and performance is measured based MaxPool

on the highest scoring classifier predictions. Two num-


3x3+2(S)

bers are usually reported: the top-1 accuracy rate, which


Conv
7x7+2(S)

compares the ground truth against the first predicted class, input
• A linear layer with softmax loss as the classifier (pre-
dicting the same 1000 classes as the main classifier, but the lower laye softmax2

SoftmaxActivation

GoogLeNet/Inception [Szegedy et al. 2015]


removed at inference time).

A schematic view of the resulting network is depicted in


not strictly ne FC

AveragePool
7x7+1(V)

Figure 3.
inefficiencies
Conv
1x1+1(S)
Conv
DepthConcat

3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

6. Training Methodology
A useful a
Conv Conv MaxPool

• Stack of the “Inception” modules.


1x1+1(S) 1x1+1(S) 3x3+1(S)

GoogLeNet networks were trained using the DistBe- DepthConcat

lief [4] distributed machine learning system using mod-


increasing the
Conv Conv Conv Conv
softmax1
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

est amount of model and data-parallelism. Although we Conv Conv MaxPool


SoftmaxActivation

• Only 5# parameters. (VGG16 has 138# parameters!)


1x1+1(S) 1x1+1(S) 3x3+1(S)

used a CPU based implementation only, a rough estimate


suggests that the GoogLeNet network could be trained to
convergence using few high-end GPUs within a week, the
without an u MaxPool
3x3+2(S)

DepthConcat
FC

FC

main limitation being the memory usage. Our training used


plexity at late
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

asynchronous stochastic gradient descent with 0.9 momen- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

tum [17], fixed learning rate schedule (decreasing the learn-


ing rate by 4% every 8 epochs). Polyak averaging [13] was use of dimens DepthConcat

An inception module
Conv Conv Conv Conv

(a) Inception module, naı̈ve version


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

used to create the final model used at inference time.


tions with lar
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

Image sampling methods have changed substantially


over the months leading to the competition, and already
DepthConcat softmax0

converged models were trained on with other options, some-


lows the prac
Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

times in conjunction with changed hyperparameters, such


Filter
Conv Conv MaxPool
FC
1x1+1(S) 1x1+1(S) 3x3+1(S)

concatenation
as dropout and the learning rate. Therefore, it is hard to
give a definitive guidance to the most effective single way be processed Conv
1x1+1(S)
DepthConcat

Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
FC

Conv
1x1+1(S)

to train these networks. To complicate matters further, some


the next stage
Conv Conv MaxPool AveragePool

of the models were mainly trained on smaller relative crops,


1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)

others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat

tion that was verified to work very well after the competi-
simultaneous
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

tion, includes sampling of various sized patches of the im- Conv Conv MaxPool

3x3 convolutions 5x5 convolutions


age whose 1x1 convolutions
size is distributed evenly between 8% and 100%
1x1+1(S) 1x1+1(S) 3x3+1(S)

The impro
MaxPool
3x3+2(S)

of the image area with aspect ratio constrained to the inter- DepthConcat

val [ 34 , 43 ]. Also, we found that the photometric distortions


1x1 convolutions
increasing bo
Conv Conv Conv Conv

of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

the imaging conditions of training data. Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

1x1 convolutions 1x1 convolutions 3x3 max pooling


7. ILSVRC 2014 Classification Challenge ber of stages w Conv
1x1+1(S)
DepthConcat

Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

Setup and Results


One can utiliz
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

The ILSVRC 2014 classification challenge involves the


MaxPool
3x3+2(S)

Previous layer
task of classifying the image into one of 1000 leaf-node cat-
egories in the Imagenet hierarchy. There are about 1.2 mil- inferior, but LocalRespNorm

Conv
3x3+1(S)

lion images for training, 50,000 for validation and 100,000


have found th
Conv
1x1+1(V)

images for testing. Each image is associated with one LocalRespNorm

ground truth category, and performance is measured based


a controlled b
MaxPool

on the highest scoring classifier predictions. Two num-


3x3+2(S)

(b) Inception module with dimensionality reduction


bers are usually reported: the top-1 accuracy rate, which
compares the ground truth against the first predicted class,
Conv
7x7+2(S)

input
• A linear layer with softmax loss as the classifier (pre- softmax2

d2 = 100 dicting the same 1000 classes as the main classifier, but SoftmaxActivation

GoogLeNet/Inception [Szegedy et al. 2015]


removed at inference time). FC

A schematic view of the resulting network is depicted in


AveragePool
7x7+1(V)

Figure 3. DepthConcat

In [5]: 6. Training Methodology


Conv
1x1+1(S)
Conv
3x3+1(S)

Conv
1x1+1(S)
Conv
5x5+1(S)

Conv
1x1+1(S)
Conv
1x1+1(S)

MaxPool
3x3+1(S)

GoogLeNet networks were trained using the DistBe- DepthConcat

from keras.layers import Input, Conv2D, MaxPooling2D, concatenate


lief [4] distributed machine learning system using mod-
est amount of model and data-parallelism. Although we
Conv
1x1+1(S)
Conv
3x3+1(S)

Conv
1x1+1(S)
Conv
5x5+1(S)

Conv
1x1+1(S)
Conv
1x1+1(S)

MaxPool
3x3+1(S)
softmax1

SoftmaxActivation

used a CPU based implementation only, a rough estimate MaxPool


3x3+2(S)
FC

suggests that the GoogLeNet network could be trained to DepthConcat FC

x_input = Input(shape=(d1, d2, 64*4)) convergence using few high-end GPUs within a week, the
main limitation being the memory usage. Our training used
Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)

asynchronous stochastic gradient descent with 0.9 momen- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat

ing rate by 4% every 8 epochs). Polyak averaging [13] was


tower1 = Conv2D(64, (1,1), padding='same', activation='relu')(x_input)
used to create the final model used at inference time.
Conv
1x1+1(S)
Conv
3x3+1(S)

Conv
1x1+1(S)
Conv
5x5+1(S)

Conv
1x1+1(S)
Conv
1x1+1(S)

MaxPool
3x3+1(S)

Image sampling methods have changed substantially


over the months leading to the competition, and already
DepthConcat softmax0

converged models were trained on with other options, some-


Conv Conv Conv Conv
SoftmaxActivation

tower2 = Conv2D(64, (1,1), padding='same', activation='relu')(x_input)


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

times in conjunction with changed hyperparameters, such Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC

as dropout and the learning rate. Therefore, it is hard to


tower2 = Conv2D(64, (3,3), padding='same', activation='relu')(tower2)
DepthConcat FC

give a definitive guidance to the most effective single way Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)

to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool

of the models were mainly trained on smaller relative crops,


1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)

others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat

tower3 = Conv2D(64, (1,1), padding='same', activation='relu')(x_input)


tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

tion, includes sampling of various sized patches of the im- Conv Conv MaxPool

tower3 = Conv2D(64, (5,5), padding='same', activation='relu')(tower3)


1x1+1(S) 1x1+1(S) 3x3+1(S)

age whose size is distributed evenly between 8% and 100% MaxPool


3x3+2(S)

of the image area with aspect ratio constrained to the inter- DepthConcat

val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv

of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

tower4 = MaxPooling2D((3,3), strides=(1,1), padding='same')(x_input)


the imaging conditions of training data. Conv
1x1+1(S)
Conv
1x1+1(S)

DepthConcat
MaxPool
3x3+1(S)

tower4 = Conv2D(64, (1,1), padding='same', activation='relu')(tower4)


7. ILSVRC 2014 Classification Challenge
Setup and Results
Conv
1x1+1(S)
Conv
3x3+1(S)

Conv
Conv
5x5+1(S)

Conv
Conv
1x1+1(S)

MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

The ILSVRC 2014 classification challenge involves the


MaxPool
3x3+2(S)

task of classifying the image into one of 1000 leaf-node cat-


x_output = concatenate([tower1, tower2, tower3, tower4], axis = 3)
LocalRespNorm

egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)

lion images for training, 50,000 for validation and 100,000 Conv

Output: ()×(+×256 (the same as input)


1x1+1(V)

images for testing. Each image is associated with one LocalRespNorm

ground truth category, and performance is measured based MaxPool

on the highest scoring classifier predictions. Two num-


3x3+2(S)

In [3]: bers are usually reported: the top-1 accuracy rate, which
Conv
7x7+2(S)

compares the ground truth against the first predicted class, input
• A linear layer with softmax loss as the classifier (pre- softmax2

dicting the same 1000 classes as the main classifier, but SoftmaxActivation

GoogLeNet/Inception [Szegedy et al. 2015]


removed at inference time). FC

A schematic view of the resulting network is depicted in


AveragePool
7x7+1(V)

Figure 3. DepthConcat

Conv Conv Conv Conv


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

6. Training Methodology Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

GoogLeNet networks were trained using the DistBe-

the
MaxPool
DepthConcat

lief [4] distributed machine learning system using mod- Conv Conv Conv Conv

3x3+2(S)
softmax1
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

est amount of model and data-parallelism. Although we Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation

used a CPU based implementation only, a rough estimate

Bottom layers
MaxPool
FC
3x3+2(S)

suggests that the GoogLeNet network could be trained to

cat-
DepthConcat FC

convergence using few high-end GPUs within a week, the


main limitation being the memory LocalRespNorm
usage. Our training used
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

asynchronous stochastic gradient descent with 0.9 momen- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat

mil- • A simple ConvNet.


ing rate by 4% every 8 epochs). Polyak averaging [13] was Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

used to create the final model used at inference time.


Conv Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

Image sampling methods have changed substantially


over the months leading to the 3x3+1(S)competition, and already
DepthConcat softmax0

converged models were trained on with other options, some-


Conv Conv Conv Conv
SoftmaxActivation

,000 • 3 Conv + 2 Pooling + 2 Normalization Layers.


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

times in conjunction with changed hyperparameters, such Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC

as dropout and the learning rate. Therefore, it is hard to DepthConcat FC

give a definitive guidance to the most Conv


effective single way Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)

to train these networks. To complicate1x1+1(V)further, some


matters

one
Conv Conv MaxPool AveragePool

of the models were mainly trained on smaller relative crops,


1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)

others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat

tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

tion, includes sampling of various sized patches of the im- Conv Conv MaxPool

LocalRespNorm
1x1+1(S) 1x1+1(S) 3x3+1(S)

age whose size is distributed evenly between 8% and 100%

ased
MaxPool
3x3+2(S)

of the image area with aspect ratio constrained to the inter- DepthConcat

val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv

of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

the imaging conditions of training MaxPool


data. Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

um-
3x3+2(S)
DepthConcat

7. ILSVRC 2014 Classification Challenge Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

Setup and Results Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

The ILSVRC 2014 classification challenge involves the


MaxPool

hich
3x3+2(S)

task of classifying the image into oneConv


of 1000 leaf-node cat- LocalRespNorm

egories in the Imagenet hierarchy.7x7+2(S)


There are about 1.2 mil- Conv
3x3+1(S)

lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)

images for testing. Each image is associated with one

lass,
LocalRespNorm

ground truth category, and performance is measured based MaxPool

on the highest scoring classifier predictions. Two num-


3x3+2(S)

bers are usually reported: the top-1input


accuracy rate, which
Conv
7x7+2(S)

compares the ground truth against the first predicted class, input
• A linear layer with softmax loss as the classifier (pre-
not strictly necessary, simply reflecti
softmax2

dicting the same 1000 classes as the main classifier, but SoftmaxActivation

GoogLeNet/Inception [Szegedy et al. 2015]


removed at inference time).
inefficiencies in our current impleme
FC

A schematic view of the resulting network is depicted in


AveragePool
7x7+1(V)

Figure 3.
A useful aspect of this architectu
DepthConcat

Conv Conv Conv Conv


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

6. Training Methodology
GoogLeNet networksincreasing the number of units at e
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

were trained using the DistBe- DepthConcat

lief [4] distributed machine learning system using mod-


without
data-parallelism. an Althoughuncontrolled blow-up i
Conv Conv Conv Conv
softmax1
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

est amount of model and we Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation

used a CPU based implementation only, a rough estimate

Stack of 9 Inception modules plexity at later stages. This is achie


MaxPool
FC
3x3+2(S)

suggests that the GoogLeNet network could be trained to DepthConcat FC

convergence using few high-end GPUs within a week, the

asynchronous stochastic use of dimensionality reduction prior


main limitation being the memory usage. Our training used
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

gradient descent with 0.9 momen- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

(a) Inception module, naı̈ve version tum [17], fixed learning rate schedule (decreasing the learn-
tions with larger patch sizes. Furthe
DepthConcat

ing rate by 4% every 8 epochs). Polyak averaging [13] was Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

used to create the final model used at inference time.

over the months leadinglows the and practical intuition that visu
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

Image sampling methods have changed substantially


to the competition, already
DepthConcat softmax0

Filter
be processed at various scales and t
converged models were trained on with other options, some-
Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

concatenation times in conjunction with changed hyperparameters, such Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC

as dropout and the learning rate. Therefore, it is hard to


the next stage
single waycan abstract features fr
DepthConcat FC

give a definitive guidance to the most effective Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)

to train these networks. To complicate matters further, some

simultaneously.
Conv Conv MaxPool AveragePool

of the models were mainly trained on smaller relative crops,


1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)

others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat

3x3 convolutions 5x5 convolutions 1x1 convolutions


tion that was verified to work very well after the competi-
The improved use of computation
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

tion, includes sampling of various sized patches of the im- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

age whose size is distributed evenly between 8% and 100%


1x1 convolutions
increasing both the width of each sta
MaxPool
3x3+2(S)

of the image area with aspect ratio constrained to the inter- DepthConcat

val [ , ]. Also, we found that the photometric distortions


3 4
4 3

the imaging conditions ofber


trainingofdata. stages without getting into com
Conv Conv Conv Conv

1x1 convolutions 1x1 convolutions of Andrew Howard [8] were useful to combat overfitting to
3x3 max pooling
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

Conv Conv MaxPool


1x1+1(S) 1x1+1(S) 3x3+1(S)

7. ILSVRC 2014 One can utilizeChallenge the Inception archite


DepthConcat

Classification Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

Setup and Results


inferior, challenge but computationally cheap
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

The ILSVRC 2014 classification involves the


MaxPool
3x3+2(S)

Previous layer
task of classifying the image into one of 1000 leaf-node cat-
haveThere found that all the available kno
LocalRespNorm

egories in the Imagenet hierarchy. are about 1.2 mil- Conv


3x3+1(S)

lion images for training, 50,000 for validation and 100,000


a controlled balancing of computatio
Conv
1x1+1(V)

images for testing. Each image is associated with one


(b) Inception module with dimensionality reduction
LocalRespNorm

ground truth category, and performance is measured based

inthenetworks that are 3 10⇥ faster t


MaxPool

on the highest scoring classifier predictions. Two num-


3x3+2(S)

bers are usually reported: top-1 accuracy rate, which


Conv
7x7+2(S)

compares the ground truth against the first predicted class, input
• A linear layer with softmax loss as the classifier (pre- softmax2

dicting the same 1000 classes as the main classifier, but SoftmaxActivation

GoogLeNet/Inception [Szegedy et al. 2015]


removed at inference time). FC

A schematic view of the resulting network is depicted in


AveragePool
7x7+1(V)

Figure 3. DepthConcat

Conv Conv Conv Conv


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

6. Training Methodology Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

GoogLeNet networks were trained using the DistBe- DepthConcat

pre- Output layers


lief [4] distributed machine learning system using mod- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
softmax1

est amount of model and data-parallelism. Although we Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation

used a CPU based implementation only, a rough estimate MaxPool


3x3+2(S)
FC

suggests that the GoogLeNet network could be trained to


softmax2
DepthConcat FC

convergence using few high-end GPUs within a week, the

• Using AveragePool instead of Flatten.


main limitation being the memory usage. Our training used
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

but• AveragePool Layer for reducing the number


asynchronous stochastic gradient descent with 0.9 momen- Conv
1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat

ing rate by 4% every 8 epochs). Polyak averaging [13] was Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

used to create the final model used at inference time. Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

Image sampling methods have changed substantially


over the months leading to the competition, and already
DepthConcat softmax0

SoftmaxActivation
of parameters. converged models were trained on with other options, some-
times in conjunction with changed hyperparameters, such
as dropout and the learning rate. Therefore, it is hard to
Conv
1x1+1(S)
Conv
3x3+1(S)

Conv
1x1+1(S)
Conv
5x5+1(S)

Conv
1x1+1(S)
Conv
1x1+1(S)

MaxPool
3x3+1(S)
SoftmaxActivation

FC

DepthConcat FC

give a definitive guidance to the most effective single way


Shape: 1000
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool

of the models were mainly trained on smaller relative crops,


1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)

others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat

FC
tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

tion, includes sampling of various sized patches of the im- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

age whose size is distributed evenly between 8% and 100%


Shape: 1×1×1024
MaxPool
3x3+2(S)

of the image area with aspect ratio constrained to the inter- DepthConcat

val [ 34 , 43 ]. Also, we found that the photometric distortions

Question: If AveragePool is replaced


d in
Conv Conv Conv Conv

of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

the imaging conditions of training data. AveragePool Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

by Flatten, then what will be #params


DepthConcat

7. ILSVRC 2014 Classification 7x7+1(V)


Challenge Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

Setup and Results Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

in the FC layer? Shape:


The ILSVRC7×7×1024
2014 classification challenge involves the
MaxPool
3x3+2(S)

task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm

egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)

lion images for training, 50,000 for validation and 100,000


DepthConcat
Conv
1x1+1(V)

images for testing. Each image is associated with one LocalRespNorm

ground truth category, and performance is measured based MaxPool

on the highest scoring classifier predictions. Two num-


3x3+2(S)

bers are usually reported: the top-1 accuracy rate, which


Conv
7x7+2(S)

compares the ground truth against the first predicted class, input
• A linear layer with softmax loss as the classifier (pre- softmax2

dicting the same 1000 classes as the main classifier, but SoftmaxActivation

GoogLeNet/Inception [Szegedy et al. 2015]


removed at inference time). FC

A schematic view of the resulting network is depicted in


AveragePool
7x7+1(V)

Figure 3. DepthConcat

Conv Conv Conv Conv


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

6. Training Methodology Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

GoogLeNet networks were trained using the DistBe- DepthConcat

lief [4] distributed machine learning system using mod-


softmax1
Conv Conv Conv Conv
softmax1
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

est amount of model and data-parallelism. Although we Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation

used a CPU based implementation only, a rough estimate

Auxiliary outputs
MaxPool
FC
3x3+2(S)

suggests that the GoogLeNet network could be trained to DepthConcat FC

convergence using few high-end GPUs within a week, the


main limitation being the memory usage. Our training used
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

asynchronous stochastic gradient descent with 0.9 momen- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

tum [17], fixed learning rate schedule (decreasing the learn-


SoftmaxActivation
ing rate by 4% every 8 epochs). Polyak averaging [13] was Conv Conv
DepthConcat

Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

used to create the final model used at inference time. Conv Conv MaxPool

During Training:
1x1+1(S) 1x1+1(S) 3x3+1(S)

Image sampling methods have changed substantially


over the months leading to the competition, and already
DepthConcat softmax0

converged models were trained on with other options, some-


Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

times in conjunction with changed hyperparameters, such Conv Conv MaxPool

• Inject additional gradient at lower layers.


FC
1x1+1(S) 1x1+1(S) 3x3+1(S)

as dropout and the learning rate. Therefore, it is hard to


FC
give a definitive guidance to the most effective single way Conv
DepthConcat

Conv Conv Conv


FC

Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool

• Make the optimization easier.


of the models were mainly trained on smaller relative crops,
1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)

others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat

tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

tion, includes sampling of various sized patches of the im- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

age whose size is distributed evenly between FC8% and 100% MaxPool
3x3+2(S)

of the image area with aspect ratio constrained to the inter- DepthConcat

val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv

of Andrew Howard [8] were useful to combat overfitting to

During Test:
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

the imaging conditions of training data. Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

DepthConcat

ConvChallenge
7. ILSVRC 2014 Classification Conv Conv Conv Conv

• Disable the auxiliary outputs.


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

Setup and Results


1x1+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

The ILSVRC 2014 classification challenge involves the


MaxPool
3x3+2(S)

task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm

egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)

lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)

imagesMaxPool
for testing. Each image AveragePool
is associated with one LocalRespNorm

ground truth category, and performance is measured based


on the3x3+1(S)
highest scoring classifier 5x5+3(V)
MaxPool

predictions. Two num-


3x3+2(S)

bers are usually reported: the top-1 accuracy rate, which


Conv
7x7+2(S)

compares the ground truth against the first predicted class, input
How Deep Can We Go?

• LeNet-5: 2 Conv + 2 FC layers.


• AlexNet: 5 Conv + 3 FC layers. classic architectures (sequential)
• VGG16: 13 Conv + 2 FC layers.
Question: Why VGG16 and VGG19? Why not deeper classic architectures?
How Deep Can We Go?

• LeNet-5: 2 Conv + 2 FC layers.


Case Study:Conv
• AlexNet: 5 ResNet
+ 3 FC layers. classic architectures (sequential)
• VGG16: 13
[He et al., 2015] Conv + 2 FC layers.
Question:
What happens Why VGG16 and VGG19? Why not deeper classic architectures?
when we continue stacking deeper layers on a “plain” convolutional
neural network?
Answer: Deeper nets have worse training and test errors.
56-layer
Training error

56-layer

Test error
20-layer

20-layer

Iterations Iterations
How Deep Can We Go?

• LeNet-5: 2 Conv + 2 FC layers.


Case Study:Conv
• AlexNet: 5 ResNet
+ 3 FC layers. classic architectures (sequential)
• VGG16: 13
[He et al., 2015] Conv + 2 FC layers.
Question:
What happens What makes deeper VGG nets worse?
when we continue stacking deeper layers on a “plain” convolutional
neural network?
Answer: It is bad optimization, not overfitting.
56-layer
Training error

56-layer

Test error
20-layer

20-layer

Iterations Iterations
How Deep Can We Go?

• LeNet-5: 2 Conv + 2 FC layers.


• AlexNet: 5 Conv + 3 FC layers. classic architectures (sequential)
• VGG16: 13 Conv + 2 FC layers.
Question: What makes deeper VGG nets worse?
• The vanishing gradient problem.
• Derivative of the loss function w.r.t. a bottom layer can be vanishingly small.
• It makes deep nets difficult to optimize. The bottom layers are not well
trained.
• The model is good; but you cannot find a good local minimum.
• A linear layer with softmax loss as the classifier (pre- softmax2

dicting the same 1000 classes as the main classifier, but SoftmaxActivation

How Deep Can We Go?


removed at inference time). FC

A schematic view of the resulting network is depicted in


AveragePool
7x7+1(V)

Figure 3. DepthConcat

Conv Conv Conv Conv


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

6. Training Methodology Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

GoogLeNet networks were trained using the DistBe- DepthConcat

• LeNet-5: 2 Conv + 2 FC layers.


lief [4] distributed machine learning system using mod- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
softmax1

est amount of model and data-parallelism. Although we Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation

used a CPU based implementation only, a rough estimate MaxPool


3x3+2(S)
FC

suggests that the GoogLeNet network could be trained to

• AlexNet: 5 Conv + 3 FC layers. classic architectures (sequential)


DepthConcat FC

convergence using few high-end GPUs within a week, the


main limitation being the memory usage. Our training used
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

asynchronous stochastic gradient descent with 0.9 momen- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

• VGG16: 13 Conv + 2 FC layers.


tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat

ing rate by 4% every 8 epochs). Polyak averaging [13] was Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

used to create the final model used at inference time. Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

Image sampling methods have changed substantially

• Inception: 21 Conv + 1 FC layers. over the months leading to the competition, and already
DepthConcat softmax0

converged models were trained on with other options, some-


Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

times in conjunction with changed hyperparameters, such Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC

as dropout and the learning rate. Therefore, it is hard to DepthConcat FC

give a definitive guidance to the most effective single way Conv Conv Conv Conv Conv

Question: Why can Inception go deeper than VGG16?


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool

of the models were mainly trained on smaller relative crops,


1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)

others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat

tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

tion, includes sampling of various sized patches of the im- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

age whose size is distributed evenly between 8% and 100% MaxPool


3x3+2(S)

of the image area with aspect ratio constrained to the inter- DepthConcat

val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv

of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

the imaging conditions of training data. Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

DepthConcat

7. ILSVRC 2014 Classification Challenge Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

Setup and Results Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

The ILSVRC 2014 classification challenge involves the


MaxPool
3x3+2(S)

task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm

egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)

lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)

images for testing. Each image is associated with one LocalRespNorm

ground truth category, and performance is measured based MaxPool

on the highest scoring classifier predictions. Two num-


3x3+2(S)

bers are usually reported: the top-1 accuracy rate, which


Conv
7x7+2(S)

compares the ground truth against the first predicted class, input
• A linear layer with softmax loss as the classifier (pre- softmax2

dicting the same 1000 classes as the main classifier, but SoftmaxActivation

How Deep Can We Go?


removed at inference time). FC

A schematic view of the resulting network is depicted in


AveragePool
7x7+1(V)

Figure 3. DepthConcat

Conv Conv Conv Conv


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

6. Training Methodology Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

GoogLeNet networks were trained using the DistBe- DepthConcat

• LeNet-5: 2 Conv + 2 FC layers.


lief [4] distributed machine learning system using mod- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
softmax1

est amount of model and data-parallelism. Although we Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation

used a CPU based implementation only, a rough estimate MaxPool


3x3+2(S)
FC

suggests that the GoogLeNet network could be trained to

• AlexNet: 5 Conv + 3 FC layers. classic architectures (sequential)


DepthConcat FC

convergence using few high-end GPUs within a week, the


main limitation being the memory usage. Our training used
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

asynchronous stochastic gradient descent with 0.9 momen- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

• VGG16: 13 Conv + 2 FC layers.


tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat

ing rate by 4% every 8 epochs). Polyak averaging [13] was Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

used to create the final model used at inference time. Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

Image sampling methods have changed substantially

• Inception: 21 Conv + 1 FC layers. over the months leading to the competition, and already
DepthConcat softmax0

converged models were trained on with other options, some-


Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

times in conjunction with changed hyperparameters, such Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC

as dropout and the learning rate. Therefore, it is hard to DepthConcat FC

give a definitive guidance to the most effective single way Conv Conv Conv Conv Conv

Question: Why can Inception go deeper than VGG16?


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool

of the models were mainly trained on smaller relative crops,


1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)

others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat

tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

tion, includes sampling of various sized patches of the im- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

age whose size is distributed evenly between 8% and 100%

• Auxiliary outputs inject additional gradient at lower layers.


MaxPool
3x3+2(S)

of the image area with aspect ratio constrained to the inter- DepthConcat

val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv

of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

the imaging conditions of training data. Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

DepthConcat

7. ILSVRC 2014 Classification Challenge Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

Setup and Results Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

The ILSVRC 2014 classification challenge involves the


MaxPool
3x3+2(S)

task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm

egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)

lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)

images for testing. Each image is associated with one LocalRespNorm

ground truth category, and performance is measured based MaxPool

on the highest scoring classifier predictions. Two num-


3x3+2(S)

bers are usually reported: the top-1 accuracy rate, which


Conv
7x7+2(S)

compares the ground truth against the first predicted class, input
3x3 conv, 128 7x7 conv, 64, /2 7x7 conv, 64, /2

How Deep Can We Go?


pool, /2 pool, /2 pool, /2
output
size: 56
3x3 conv, 256 3x3 conv, 64 3x3 conv, 64

3x3 conv, 256 3x3 conv, 64 3x3 conv, 64

3x3 conv, 256 3x3 conv, 64 3x3 conv, 64

3x3 conv, 256 3x3 conv, 64 3x3 conv, 64

• LeNet-5: 2 Conv + 2 FC layers.


3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

pool, /2 3x3 conv, 128, /2 3x3 conv, 128, /2


output

• AlexNet: 5 Conv + 3 FC layers. classic architectures (sequential)


size: 28
3x3 conv, 512 3x3 conv, 128 3x3 conv, 128

3x3 conv, 512 3x3 conv, 128 3x3 conv, 128

3x3 conv, 512 3x3 conv, 128 3x3 conv, 128

• VGG16: 13 Conv + 2 FC layers.


3x3 conv, 512 3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

• Inception: 21 Conv + 1 FC layers. 3x3 conv, 128 3x3 conv, 128

modern architectures
output
pool, /2 3x3 conv, 256, /2 3x3 conv, 256, /2
size: 14

• ResNet: Up to 151 Conv + 1 FC layers.


3x3 conv, 512 3x3 conv, 256 3x3 conv, 256

3x3 conv, 512 3x3 conv, 256 3x3 conv, 256

3x3 conv, 512 3x3 conv, 256 3x3 conv, 256

3x3 conv, 512 3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

ResNet’s key idea: skip connections for curing vanishing gradient.


3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

output
pool, /2 3x3 conv, 512, /2 3x3 conv, 512, /2
size: 7
3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512


output
fc 4096 avg pool avg pool
size: 1
fc 4096 fc 1000 fc 1000
3x3 conv, 128 7x7 conv, 64, /2 7x7 conv, 64, /2

ResNet [He et al. 2015]


pool, /2 pool, /2 pool, /2
output
size: 56
3x3 conv, 256 3x3 conv, 64 3x3 conv, 64

3x3 conv, 256 3x3 conv, 64 3x3 conv, 64

3x3 conv, 256 3x3 conv, 64 3x3 conv, 64

3x3 conv, 256 3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

A block of ResNet output


size: 28
pool, /2
3x3 conv, 64

3x3 conv, 128, /2


3x3 conv, 64

3x3 conv, 128, /2

3x3 conv, 512 3x3 conv, 128 3x3 conv, 128

3x3 conv, 512 3x3 conv, 128 3x3 conv, 128

x 3x3 conv, 512

3x3 conv, 512


3x3 conv, 128

3x3 conv, 128

3x3 conv, 128


3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

weight layer output


size: 14
pool, /2

3x3 conv, 512


3x3 conv, 256, /2

3x3 conv, 256


3x3 conv, 256, /2

3x3 conv, 256

F(x)
3x3 conv, 512 3x3 conv, 256 3x3 conv, 256

relu
x
3x3 conv, 512 3x3 conv, 256 3x3 conv, 256

3x3 conv, 512 3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

weight layer 3x3 conv, 256 3x3 conv, 256

identity 3x3 conv, 256

3x3 conv, 256


3x3 conv, 256

3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

F(x) + x output
size: 7
pool, /2
3x3 conv, 256

3x3 conv, 512, /2


3x3 conv, 256

3x3 conv, 512, /2

relu 3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

Figure 2. Residual learning: a building block.


3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512


output
fc 4096 avg pool avg pool
size: 1
fc 4096 fc 1000 fc 1000
Winners of the ImageNet Challenge

number of layers

top-5 error rate


Classification Error: Top-1 vs. Top-5
class pred. probability
• husky 0.50
• wolf 0.20
ConvNet
• dog 0.18
• fox 0.08
• snow 0.01
• fur 0.01
• forest 0.01
Label: “wolf”


Classification Error: Top-1 vs. Top-5
class pred. probability
• husky 0.50
• wolf 0.20
ConvNet
• dog 0.18
• fox 0.08
• snow 0.01
• fur 0.01
• forest 0.01
Label: “wolf”
Evaluated by the Top-1 error,

this prediction is wrong!
Classification Error: Top-1 vs. Top-5
class pred. probability
• husky 0.50
• wolf 0.20
ConvNet
• dog 0.18
• fox 0.08
• snow 0.01
• fur 0.01
• forest 0.01
Label: “wolf”
Evaluated by the Top-5 error,

the prediction is correct!
Beyond the Accuracy
Deep Learning on the Edge
Object Detection

OCR & Translation

Photo by Juanedc (CC BY 2.0)

Face Attributes

MobileNets Speech Recognition &


Translation

Google Doodle by Sarah Harrison P


Solution 1: Could Computing
Solution 1: Could Computing
Solution 1: Could Computing

Bonjour
Solution 1: Could Computing
Solution 1: Could Computing
• A linear layer with softmax loss as the classifier (pre- softmax2

dicting the same 1000 classes as the main classifier, but SoftmaxActivation

removed at inference time). FC

A schematic view of the resulting network is depicted in


AveragePool
7x7+1(V)

Figure 3. DepthConcat

Conv Conv Conv Conv


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

6. Training Methodology Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

GoogLeNet networks were trained using the DistBe- DepthConcat

lief [4] distributed machine learning system using mod- Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
softmax1

est amount of model and data-parallelism. Although we Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation

used a CPU based implementation only, a rough estimate MaxPool


3x3+2(S)
FC

suggests that the GoogLeNet network could be trained to DepthConcat FC

convergence using few high-end GPUs within a week, the


main limitation being the memory usage. Our training used
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

asynchronous stochastic gradient descent with 0.9 momen- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat

ing rate by 4% every 8 epochs). Polyak averaging [13] was Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

used to create the final model used at inference time. Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

Image sampling methods have changed substantially


over the months leading to the competition, and already
DepthConcat softmax0

converged models were trained on with other options, some-


Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

times in conjunction with changed hyperparameters, such Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC

as dropout and the learning rate. Therefore, it is hard to DepthConcat FC

give a definitive guidance to the most effective single way Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)

to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool

of the models were mainly trained on smaller relative crops,


1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)

others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat

tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

tion, includes sampling of various sized patches of the im- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

age whose size is distributed evenly between 8% and 100% MaxPool


3x3+2(S)

of the image area with aspect ratio constrained to the inter- DepthConcat

val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv

of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

the imaging conditions of training data. Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

DepthConcat

7. ILSVRC 2014 Classification Challenge Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

Setup and Results Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

The ILSVRC 2014 classification challenge involves the


MaxPool
3x3+2(S)

task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm

egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)

lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)

images for testing. Each image is associated with one LocalRespNorm

ground truth category, and performance is measured based MaxPool

on the highest scoring classifier predictions. Two num-


3x3+2(S)

bers are usually reported: the top-1 accuracy rate, which


Conv
7x7+2(S)

compares the ground truth against the first predicted class, input

and the top-5 error rate, which compares the ground truth
against the first 5 predicted classes: an image is deemed Figure 3: GoogLeNet network with all the bells and whistles.
correctly classified if the ground truth is among the top-5,
Solution 1: Could Computing
• A linear layer with softmax loss as the classifier (pre- softmax2

dicting the same 1000 classes as the main classifier, but SoftmaxActivation

removed at inference time). FC

A schematic view of the resulting network is depicted in


AveragePool
7x7+1(V)

Figure 3. DepthConcat

Conv Conv Conv Conv


1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

6. Training Methodology Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

GoogLeNet networks were trained using the DistBe- DepthConcat

lief [4] distributed machine learning system using mod- Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
softmax1

est amount of model and data-parallelism. Although we Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
SoftmaxActivation

used a CPU based implementation only, a rough estimate MaxPool


3x3+2(S)
FC

suggests that the GoogLeNet network could be trained to DepthConcat FC

convergence using few high-end GPUs within a week, the


main limitation being the memory usage. Our training used
Conv Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S) 1x1+1(S)

asynchronous stochastic gradient descent with 0.9 momen- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
AveragePool
5x5+3(V)

tum [17], fixed learning rate schedule (decreasing the learn- DepthConcat

ing rate by 4% every 8 epochs). Polyak averaging [13] was Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

used to create the final model used at inference time. Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)

Image sampling methods have changed substantially


over the months leading to the competition, and already
DepthConcat softmax0

converged models were trained on with other options, some-


Conv Conv Conv Conv
SoftmaxActivation
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

times in conjunction with changed hyperparameters, such Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)
FC

as dropout and the learning rate. Therefore, it is hard to DepthConcat FC

give a definitive guidance to the most effective single way Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)
Conv
1x1+1(S)

to train these networks. To complicate matters further, some Conv Conv MaxPool AveragePool

of the models were mainly trained on smaller relative crops,


Name and ID
1x1+1(S) 1x1+1(S) 3x3+1(S) 5x5+3(V)

others on larger ones, inspired by [8]. Still, one prescrip- DepthConcat

tion that was verified to work very well after the competi- Conv
1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

tion, includes sampling of various sized patches of the im- Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

age whose size is distributed evenly between 8% and 100% MaxPool


3x3+2(S)

of the image area with aspect ratio constrained to the inter- DepthConcat

val [ 34 , 43 ]. Also, we found that the photometric distortions Conv Conv Conv Conv

of Andrew Howard [8] were useful to combat overfitting to 1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)

the imaging conditions of training data. Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

DepthConcat

7. ILSVRC 2014 Classification Challenge Conv


1x1+1(S)
Conv
3x3+1(S)
Conv
5x5+1(S)
Conv
1x1+1(S)

Setup and Results Conv


1x1+1(S)
Conv
1x1+1(S)
MaxPool
3x3+1(S)

The ILSVRC 2014 classification challenge involves the


MaxPool
3x3+2(S)

task of classifying the image into one of 1000 leaf-node cat- LocalRespNorm

egories in the Imagenet hierarchy. There are about 1.2 mil- Conv
3x3+1(S)

lion images for training, 50,000 for validation and 100,000 Conv
1x1+1(V)

images for testing. Each image is associated with one LocalRespNorm

ground truth category, and performance is measured based MaxPool

on the highest scoring classifier predictions. Two num-


3x3+2(S)

bers are usually reported: the top-1 accuracy rate, which


Conv
7x7+2(S)

compares the ground truth against the first predicted class, input

and the top-5 error rate, which compares the ground truth
against the first 5 predicted classes: an image is deemed Figure 3: GoogLeNet network with all the bells and whistles.
correctly classified if the ground truth is among the top-5,
Solution 1: Could Computing

Downsides:
1. Sending data between client and
server is relatively slow (in
comparing with local computation
on the device.)

2. Sending data through 3G/4G/5G


network may be charged.

3. User’s privacy.
Solution 2: Edge Computing
Solution 2: Edge Computing
Solution 2: Edge Computing

Run a ConvNet on the chip


to make prediction.
(The ConvNet has been
trained on the server.)
Solution 2: Edge Computing

Advantages:
Shiba
1. Inu
Faster. (Computation is typically
faster than communication.)

2. Avoid sending and receiving data.


• Save money.
• Works without internet.

3. Protect user’s privacy. Run a ConvNet on the chip


to make prediction.
(The ConvNet has been
trained on the server.)
Solution 2: Edge Computing

Advantages: Disadvantages:
Shiba
1. Inu
Faster. (Computation is typically 1. Energy consumption. (Heavy
faster than communication.) matrix and tensor computation.)

2. Avoid sending and receiving data. 2. Cost local memory and storage.
(Deep ConvNets has at least
• Save money.
millions of parameters.)
• Works without internet.

3. Protect user’s privacy. Run a ConvNet on the chip


to make prediction.
(The ConvNet has been
trained on the server.)
Cloud Computing V.S. Edge Computing

• Edge computing is more popular for the deep learning


tasks (at present).
• Research: deep learning on the edge
1. Less float point operations (FLOPs). (Thus less energy
consumption.)
2. Less network parameters. (Thus less memory and storage.)
MobileNet [Howard et al. 2017]

• Motivations
• Small enough to fit in an iPhone (just 4# parameters).
• Less computation than the standard ConvNets.
• Key idea: Depthwise separable convolution.
• Paper: https://arxiv.org/pdf/1704.04861.pdf
• Further reading:
• http://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/
• https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d
Convolution
convolution
keras.layers.Conv2D

One 0)×0+×(1 filter

()×(+×(1 input

In this example, (1 = 3.
Convolution
convolution depthwise convolution
keras.layers.Conv2D keras.layers.DepthwiseConv2D

One 0)×0+×(1 filter One 0)×0+ filter

()×(+×(1 input ()×(+×1 output ()×(+×(1 input


Convolution
convolution depthwise convolution
keras.layers.Conv2D keras.layers.DepthwiseConv2D

For 3 = 1 to (1 (each of the slices):


One 0
• Compute 1 filter
) ×0+ ×(the convolution of the 3-th One 0)×0+ filter
slice (shape ()×(+) and the filter
(shape 0)×0+).
• Output: a matrix (shape ()×(+).

()×(+×(1 input ()×(+×1 output ()×(+×(1 input


Convolution
convolution depthwise convolution
keras.layers.Conv2D keras.layers.DepthwiseConv2D

For 3 = 1 to (1 (each of the slices):


One 0
• Compute 1 filter
) ×0+ ×(the convolution of the 3-th One 0)×0+ filter
slice (shape ()×(+) and the filter
(shape 0)×0+).
• Output: a matrix (shape ()×(+).

Thus the final output is ()×(+×(1.

()×(+×(1 input ()×(+×1 output ()×(+×(1 input ()×(+×(1 output


Convolution
convolution depthwise convolution
keras.layers.Conv2D keras.layers.DepthwiseConv2D

One 0)×0+×(1 filter One 0)×0+ filter

()×(+×(1 input ()×(+×1 output ()×(+×(1 input ()×(+×(1 output


MobileNet: Depthwise Conv + 4×4 Conv
depthwise convolution 1×1 convolution
keras.layers.DepthwiseConv2D keras.layers.Conv2D

One 0)×0+ filter One 1×1×(1 filter

()×(+×(1 input ()×(+×(1 output ()×(+×(1 input ()×(+×1 output


MobileNet: Depthwise Separable Conv + 1×1 Conv

() ×(+ ×(1 tensor () ×(+ ×(1 tensor

depthwise convolution
keras.layers.DepthwiseConv2D

convolution
keras.layers.Conv2D () ×(+ ×(1 tensor

1×1 convolution
keras.layers.Conv2D

() ×(+ ×1 matrix () ×(+ ×1 matrix


MobileNet: Depthwise Separable Conv + 1×1 Conv

() ×(+ ×(1 tensor () ×(+ ×(1 tensor

depthwise convolution
keras.layers.DepthwiseConv2D
0) ×0+ parameters
convolution
keras.layers.Conv2D () ×(+ ×(1 tensor
0) ×0+ ×(1 parameters

1×1 convolution
keras.layers.Conv2D
(1 parameters

() ×(+ ×1 matrix () ×(+ ×1 matrix


MobileNet: Depthwise Separable Conv + 1×1 Conv

() ×(+ ×(1 tensor () ×(+ ×(1 tensor

depthwise convolution
keras.layers.DepthwiseConv2D
0) ×0+ parameters
convolution
keras.layers.Conv2D () ×(+ ×(1 tensor
0) ×0+ ×(1 parameters

1×1 convolution
keras.layers.Conv2D
(1 parameters

() ×(+ ×1 matrix () ×(+ ×1 matrix


MobileNet: Depthwise Separable Conv + 1×1 Conv

() ×(+ ×(1 tensor () ×(+ ×(1 tensor

depthwise convolution
keras.layers.DepthwiseConv2D
() ×(+ ×(1 × 0) ×0+ scalar multiply
convolution
keras.layers.Conv2D () ×(+ ×(1 tensor
() ×(+ × 0) ×0+ ×(1 scalar multiply

1×1 convolution
keras.layers.Conv2D
patch size
() ×(+ ×(1 scalar multiply
#patches
() ×(+ ×1 matrix () ×(+ ×1 matrix
MobileNet: Depthwise Separable Conv + 1×1 Conv

() ×(+ ×(1 tensor () ×(+ ×(1 tensor

depthwise convolution
keras.layers.DepthwiseConv2D
() ×(+ ×(1 × 0) ×0+ scalar multiply
convolution
keras.layers.Conv2D () ×(+ ×(1 tensor
() ×(+ × 0) ×0+ ×(1 scalar multiply

1×1 convolution
keras.layers.Conv2D
() ×(+ ×(1 scalar multiply

() ×(+ ×1 matrix () ×(+ ×1 matrix


MobileNet
Input: 224×224×3 tensor
Conv2D

112×112×64 tensor
Half the width and height, double the depth.
56×56×128 tensor

28×28×256 tensor

14×14×512 tensor

7×7×1024 tensor
Why not using “Flatten”?
AveragePool

1024 vector
Implementation in Keras

separable convolution
keras.layers.SeparableConv2D

• In Keras, you can directly use SeparableConv2D to build MobileNet.


• SeparableConv2D = DepthwiseConv2D + Conv2D(1×1)

depthwise convolution
keras.layers.DepthwiseConv2D + 1×1 convolution
keras.layers.Conv2D
Implementation in Keras

In [4]:
import keras
model = keras.applications.mobilenet.MobileNet(input_shape=None,
alpha=1.0, depth_multiplier=1, dropout=1e-3,
include_top=True, weights='imagenet',
input_tensor=None, pooling=None, classes=1000)
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) (None, 224, 224, 3) 0
_________________________________________________________________
Summary
CNN Architectures

• LeNet-5: 2 Conv + 2 FC layers.


• AlexNet: 5 Conv + 3 FC layers. classic architectures (sequential)
• VGG16: 13 Conv + 2 FC layers.
CNN Architectures

• LeNet-5: 2 Conv + 2 FC layers.


• AlexNet: 5 Conv + 3 FC layers. classic architectures (sequential)
• VGG16: 13 Conv + 2 FC layers.

Question: Can the classic architectures go deeper?


Answer: No.
• Deeper nets have worse training and test errors.
• Vanishing gradient.
CNN Architectures

• LeNet-5: 2 Conv + 2 FC layers.


• AlexNet: 5 Conv + 3 FC layers. classic architectures (sequential)
• VGG16: 13 Conv + 2 FC layers.

• Inception: 21 Conv + 1 FC layers.


modern architectures
• ResNet: Up to 151 Conv + 1 FC layers.
CNN Architectures

• LeNet-5: 2 Conv + 2 FC layers.


• AlexNet: 5 Conv + 3 FC layers. classic architectures (sequential)
• VGG16: 13 Conv + 2 FC layers.

• Inception: 21 Conv + 1 FC layers.


modern architectures
• ResNet: Up to 151 Conv + 1 FC layers.
Tricks:
• Inception: auxiliary output (gradient injection).
• ResNet: skip connection.
Reduce Memory and Computation

• Deploy a deep neural network to a smart device.


• Challenges:
• Computation è Power consumption.
• Parameters è Storage and memory.
Reduce Memory and Computation

• Deploy a deep neural network to a smart device.


• Challenges:
• Computation è Power consumption.
• Parameters è Storage and memory.
• MobileNet’s solutions:
• Separable Convolution instead of Tensor Convolution.
• AveragePool Layer instead of Flatten Layer. (Proposed by
Inception Net.)

You might also like