Professional Documents
Culture Documents
09 ConvNet Architectures
09 ConvNet Architectures
09 ConvNet Architectures
Abir Das
IIT Kharagpur
Agenda
LeNet-5
Review: LeNet-5
[LeCun et al., 1998]
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 6 May 1, 2018
§ Citation of the paper as on Jan 31, 2021 is 33,759
Source: CS231n course, Stanford University
AlexNet
Architecture:
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8
AlexNet
Architecture:
CONV1
MAX POOL1
NORM1
CONV2 Input: 227x227x3 images
MAX POOL2
NORM2
CONV3 First layer (CONV1): 96 11x11 filters applied at stride 4
CONV4 =>
CONV5 Q: what is the output volume size? Hint: (227-11)/4+1 = 55
Max POOL3
FC6
FC7
FC8
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 7 May 1, 2018
Source: CS231n course, Stanford University
AlexNet
Architecture:
CONV1
MAX POOL1
NORM1 Input: 227x227x3 images
CONV2
MAX POOL2
NORM2 First layer (CONV1): 96 11x11 filters applied at stride 4
CONV3 =>
CONV4 Output volume [55x55x96]
CONV5
Max POOL3
FC6 Q: What is the total number of parameters in this layer?
FC7
FC8
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 7 May 1, 2018
Source: CS231n course, Stanford University
AlexNet
Architecture:
CONV1
MAX POOL1
NORM1 Input: 227x227x3 images
CONV2
MAX POOL2
NORM2 First layer (CONV1): 96 11x11 filters applied at stride 4
CONV3 =>
CONV4 Output volume [55x55x96]
CONV5
Max POOL3
FC6 Parameters: (11*11*3)*96 = 35K
FC7
FC8
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 7 May 1, 2018
Source: CS231n course, Stanford University
AlexNet
Architecture:
CONV1
MAX POOL1
NORM1
CONV2 Input: 227x227x3 images
MAX POOL2 After CONV1: 55x55x96
NORM2
CONV3 Second layer (POOL1): 3x3 filters applied at stride 2
CONV4 Q: what is the output volume size? Hint: (55-3)/2+1 = 27
CONV5
Max POOL3
FC6
FC7
FC8
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 7 May 1, 2018
Source: CS231n course, Stanford University
AlexNet
Architecture:
CONV1
MAX POOL1
NORM1
CONV2 Input: 227x227x3 images
MAX POOL2 After CONV1: 55x55x96
NORM2
CONV3 Second layer (POOL1): 3x3 filters applied at stride 2.
CONV4 Output volume: 27x27x96
CONV5
Max POOL3 Q: what is the number of parameters in this layer?
FC6
FC7
FC8
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 7 May 1, 2018
Source: CS231n course, Stanford University
AlexNet
Architecture:
CONV1
MAX POOL1
NORM1
CONV2 Input: 227x227x3 images
MAX POOL2 After CONV1: 55x55x96
NORM2
CONV3 Second layer (POOL1): 3x3 filters applied at stride 2.
CONV4 Output volume: 27x27x96
CONV5
Max POOL3 Parameters: 0!
FC6
FC7
FC8
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 7 May 1, 2018
Source: CS231n course, Stanford University
AlexNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 7 May 1, 2018
Source: CS231n course, Stanford University
AlexNet
Imagenet Leaderboard
19 layers 22 layers
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 21 May 1, 2018
Source: CS231n course, Stanford University
Imagenet Leaderboard
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
ZFNet: Improved
hyperparameters over 152 layers 152 layers 152 layers
AlexNet
19 layers 22 layers
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 22 May 1, 2018
Source: CS231n course, Stanford University
ZFNet
AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 16.4% -> 11.7%
§ Citation
Fei-Fei Li & Justin
of Johnson & Serena
the paper Yeung
as on Jan 31, 2021Lecture 9 - 23
is 11,396 May 1, 2018
Source: CS231n course, Stanford University
ZFNet
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
19 layers 22 layers
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 24 May 1, 2018
Source: CS231n course, Stanford University
VGG
Case Study: VGGNet
[Simonyan and Zisserman, 2014]
8 layers (AlexNet)
-> 16 - 19 layers (VGG16Net)
VGG
VGG
§ Receptive field is the region in the input space that a particular CNN’s
feature (activation value) is looking at (or getting computed due to)
VGG
§ Receptive field is 3 × 3
VGG
§ Receptive field is 5 × 5
VGG
§ Receptive field is 7 × 7
VGG
VGG
VGG
TOTAL memory: 15.2M * 4 bytes ~= 58MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
Source: CS231n course, Stanford University
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 32 May 1, 2018
Abir Das (IIT Kharagpur) CS60010 Feb 01 and 02, 2021 25 / 73
LeNet AlexNet ZFNet VGG C3D GoogLeNet ResNet MobileNet-v1
VGG
TOTAL memory: 15.2M * 4 bytes ~= 58MB / image (only forward! ~*2 for bwd) Common names
TOTAL params: 138M parameters
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 33course, Stanford
Source: CS231n May University
1, 2018
Abir Das (IIT Kharagpur) CS60010 Feb 01 and 02, 2021 26 / 73
LeNet AlexNet ZFNet VGG C3D GoogLeNet ResNet MobileNet-v1
VGG
Details:
- ILSVRC’14 2nd in classification, 1st in
localization
- Similar training procedure as Krizhevsky
2012
- No Local Response Normalisation (LRN)
- Use VGG16 or VGG19 (VGG19 only
slightly better, more memory)
- Use ensembles for best results
- FC7 features generalize well to other
tasks
Video Classification
Examples from UCF-101 dataset.
ApplyEyeMakeup CuttingInKitchen
BalanceBeam TableTennisShot
C3D
C3D
INPUT: [16x112x112x3] memory: 16*112*112*3=602K params: 0 (not counting biases)
CONV1a: [16x112x112x64] memory: 16*112*112*64=12.8M params: 3*(3*3*3)*64 = 5,184
POOL1: [16x56x56x64] memory: 16*56*56*64=3.2M params: 0
CONV2a: [16x56x56x128] memory: 16*56*56*128=6.4M params: 64*(3*3*3)*128 = 221,184
POOL2: [8x28x28x128] memory: 8*28*28*128=802K params: 0
CONV3a: [8x28x28x256] memory: 8*28*28*256=1.6M params: 128*(3*3*3)*256 = 884,736
CONV3b: [8x28x28x256] memory: 8*28*28*256=1.6M params: 256*(3*3*3)*256 = 1,769,472
POOL3: [4x14x14x256] memory: 4*14*14*256=200K params: 0
CONV4a: [4x14x14x512] memory: 4*14*14*512=401K params: 256*(3*3*3)*512 = 3,538,944
CONV4b: [4x14x14x512] memory: 4*14*14*512=401K params: 512*(3*3*3)*512 = 7,077,888
POOL4: [2x7x7x512] memory: 2*7*7*512=50K params: 0
CONV5a: [2x7X7x512] memory: 2*7*7*512=50K params: 512*(3*3*3)*512 = 7,077,888
CONV5b: [2x7X7x512] memory: 2*7*7*512=50K params: 512*(3*3*3)*512 = 7,077,888
POOL5: [1x4x4x512] memory: 1*4*4*512=8,192 params: 0
FC6: [4096] memory: 4096 params: 4*4*512*4096 = 33,554,432
FC: [4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [101] memory: 101 params: 4096*101 = 413,696
GoogLeNet
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
19 layers 22 layers
§ Citation
Fei-Fei Li & JustinofJohnson & Serena
the paper as onYeung
Jan 31, Lecture
2021 9 - 35
is 27,825 May 1, 2018
GoogLeNet
1x1 convolutions
1x1 CONV
with 32 filters
56
56
(each filter has size 1x1x64,
and performs a 64-dimensional
dot product)
preserves spatial dimensions,
reduces depth!
56 56
64 32
GoogLeNet
- 22 layers
- Efficient “Inception” module
- No FC layers
- Only 5 million parameters!
12x less than AlexNet Inception module
- ILSVRC’14 classification winner
(6.7% top 5 error)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 36 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
Inception module
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 37 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 38 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 38 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
Module input:
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 41 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
28x28x128
Module input:
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 41 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
28x28x128
Module input:
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 41 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
28x28x128 28x28x192
28x28x96 28x28x256
Module input:
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 41 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
28x28x128 28x28x192
28x28x96 28x28x256
Module input:
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 41 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
28x28x(128+192+96+256) = 28x28x672
28x28x128 28x28x192
28x28x96 28x28x256
Module input:
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 41 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
Module input:
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 41 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 41 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
28x28x(128+192+96+256) = 28x28x672
28x28x128 28x28x192
28x28x96 28x28x256
Solution: “bottleneck” layers that
use 1x1 convolutions to reduce
feature depth
Module input:
28x28x256
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 41 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
GoogLeNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 54 May 1, 2018
Source: CS231n course, Stanford University
GoogLeNet
Inception module
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 55 May 1, 2018
Source: CS231n course, Stanford University
49 / 73
MobileNet-v1
softmax2
SoftmaxActivation
FC
Feb 01 and 02, 2021
AveragePool
7x7+1(V)
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
Lecture 9 - 56
DepthConcat
ResNet
CS60010
DepthConcat FC
Case Study: GoogLeNet
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
ZFNet
Stem Network:
2x Conv-Pool
Full GoogLeNet
LocalRespNorm
Conv-Pool-
architecture
Conv
GoogLeNet
3x3+1(S)
Conv
1x1+1(V)
LocalRespNorm
AlexNet
MaxPool
3x3+2(S)
Conv
7x7+2(S)
input
LeNet
May 1, 2018
50 / 73
MobileNet-v1
softmax2
SoftmaxActivation
FC
Feb 01 and 02, 2021
AveragePool
7x7+1(V)
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
Lecture 9 - 56
DepthConcat
ResNet
CS60010
DepthConcat FC
Case Study: GoogLeNet
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
ZFNet
Full GoogLeNet
LocalRespNorm
architecture
Conv
GoogLeNet
3x3+1(S)
Conv
1x1+1(V)
LocalRespNorm
AlexNet
MaxPool
3x3+2(S)
Conv
7x7+2(S)
input
LeNet
May 1, 2018
51 / 73
MobileNet-v1
softmax2
(removed expensive FC layers!)
SoftmaxActivation
FC
Feb 01 and 02, 2021
AveragePool
7x7+1(V)
DepthConcat
Classifier output
DepthConcat
ResNet
CS60010
DepthConcat FC
Case Study: GoogLeNet
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
ZFNet
Full GoogLeNet
LocalRespNorm
architecture
Conv
GoogLeNet
3x3+1(S)
Conv
1x1+1(V)
LocalRespNorm
AlexNet
MaxPool
3x3+2(S)
Conv
7x7+2(S)
input
LeNet
May 1, 2018
52 / 73
MobileNet-v1
softmax2
SoftmaxActivation
FC
Feb 01 and 02, 2021
Auxiliary classification outputs to inject additional gradient at lower layers
AveragePool
7x7+1(V)
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
1x1+1(S) 1x1+1(S) 3x3+1(S)
Lecture 9 - 56
DepthConcat
ResNet
CS60010
DepthConcat FC
Case Study: GoogLeNet
DepthConcat
Conv Conv Conv Conv
1x1+1(S) 3x3+1(S) 5x5+1(S) 1x1+1(S)
Conv Conv MaxPool
ZFNet
Full GoogLeNet
LocalRespNorm
architecture
Conv
GoogLeNet
3x3+1(S)
Conv
1x1+1(V)
LocalRespNorm
AlexNet
MaxPool
3x3+2(S)
Conv
7x7+2(S)
input
LeNet
LeNet AlexNet ZFNet VGG C3D GoogLeNet ResNet MobileNet-v1
GoogLeNet
- 22 layers
- Efficient “Inception” module
- No FC layers
- 12x less params than AlexNet
- ILSVRC’14 classification winner Inception module
(6.7% top 5 error)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 62 May 1, 2018
Source: CS231n course, Stanford University
ResNet
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners
“Revolution of Depth”
19 layers 22 layers
§ Citation
Fei-Fei Li & JustinofJohnson & Serena
the paper as onYeung
Jan 31, Lecture
2021 9 - 63
is 68,522 May 1, 2018
ResNet
relu
Very deep networks using residual F(x) + x
connections
..
.
- 152-layer model for ImageNet X
F(x) relu
- ILSVRC’15 classification winner identity
(3.57% top 5 error)
- Swept all classification and
detection competitions in X
ILSVRC’15 and COCO’15! Residual block
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 64 May 1, 2018
Source: CS231n course, Stanford University
ResNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 66 May 1, 2018
Source: CS231n course, Stanford University
ResNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 66 May 1, 2018
Source: CS231n course, Stanford University
ResNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 69 May 1, 2018
Source: CS231n course, Stanford University
ResNet
Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
relu
H(x) F(x) + x
F(x) X
relu relu
identity
X X
“Plain” layers Residual block
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 70 May 1, 2018
Source: CS231n course, Stanford University
ResNet
Solution: Use network layers to fit a residual mapping instead of directly trying to fit a
desired underlying mapping
H(x) = F(x) + x relu Use layers to
H(x) F(x) + x
fit residual
F(x) = H(x) - x
instead of
F(x) X
relu relu
identity H(x) directly
X X
“Plain” layers Residual block
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 70 May 1, 2018
Source: CS231n course, Stanford University
ResNet
X
Residual block
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 73 May 1, 2018
Source: CS231n course, Stanford University
ResNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 73 May 1, 2018
Source: CS231n course, Stanford University
ResNet
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 78 May 1, 2018
Source: CS231n course, Stanford University
ResNet
20 20
ResNet-20
ResNet-32
ResNet-44
ResNet-56
56-layer ResNet-110
error (%)
error (%)
10 10
20-layer 20-layer
5
110-layer
plain-20 5
plain-32
plain-44
plain-56
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6 4
iter. (1e4) iter. (1e4)
Figure 6. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error. Lef
of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 1202 laye
3
plain-20
plain-56
training data 07+12
ResNet-20 test data VOC 07 te
std
2 ResNet-56
ResNet-110
VGG-16 73.2
1
ResNet-101 76.4
0 20 40 60 80 100
layer index (original)
Table 7. Object detection
Source: He et. al.,mAP
2015 (%
2007/2012 test sets using baseline F
plain-20
3
plain-56
ble 10Feb
and0111and
for02,
better
2021 results.
ResNet-20
Abir Das
2 (IIT Kharagpur) CS60010 64 / 73
std
ResNet-56
LeNet AlexNet ZFNet VGG C3D GoogLeNet ResNet MobileNet-v1
ResNet
Experimental Results
- Able to train very deep
networks without degrading
(152 layers on ImageNet, 1202
on Cifar)
- Deeper networks now achieve
lowing training error as
expected
- Swept 1st place in all ILSVRC
and COCO 2015 competitions
MobileNet-v1
MobileNet-v1
MobileNet-v1
𝐷& ×𝐷& ×𝑀
𝐷& ×𝐷& ×𝑀
𝐷& ×𝐷& ×𝑀
—1 · 1 · M · DF · DF · N = DF · DF · M · N
§ What is the number of parameters?
—1 · 1 · M · DF · DF · N = DF · DF · M · N
§ What is the number of parameters? —1 · 1 · M · N
§ This operation is called 1 × 1 pointwise convolution
Abir Das (IIT Kharagpur) CS60010 Feb 01 and 02, 2021 69 / 73
LeNet AlexNet ZFNet VGG C3D GoogLeNet ResNet MobileNet-v1
MobileNet-v1 Structure
Experimental Results