CNN Architectures 01

CNNs Architectures
16
Common Performance Metrics
• Top-1 score: check if a sample’s top class (i.e. the one
with highest probability) is the same as its target label
• Top-5 score: check if your label is in your 5 first
predictions (i.e. predictions with 5 highest
probabilities)
• → Top-5 error: percentage of test samples for which
the correct class was not in the top 5 predicted
classes
26
AlexNet
[Krizhevsky et al. NIPS’12] AlexNet

28
Alex Net
• First filter with stride 4 to reduce size significantly

• 96 filters
Q: What is the total number of parameters in this layer?
AlexNet
•• Use
Use of
of same
same convolutions
convolutions
•• As
As with
with LeNet:
LeNet,Width,
Width,Height
height Number
Number of
of Filters
filters

30
AlexNet

31
AlexNet
• Softmax for 1000 classes [Krizhevsky et al. NIPS’12] AlexNet

32
AlexNet
• Similar to LeNet but much bigger (~1000 times)
• Use of ReLU instead of tanh/sigmoid
60M parameters

33
VGGNet
• Striving for simplicity
• CONV = 3x3 filters with stride 1, same convolutions
• MAXPOOL = 2x2 filters with stride 2
[Simonyan and Zisserman ICLR’15] VGGNet

34
VGGNet Conv=3x3,s=1,same
Maxpool=2x2,s=2
• 2 consecutive convolutional layers, each one with 64

filters
• What is the output size?
35
Maxpool=2x2,s=2

36
Maxpool=2x2,s=2

37
Maxpool=2x2,s=2
• Number of filters is multiplied by 2

38
Maxpool=2x2,s=2

39
VGGNet
• Conv -> Pool -> Conv -> Pool -> Conv -> FC
• As we go deeper: Width, Height Number of Filters
• Called VGG-16: 16 layers that have weights

138M parameters
• Large but simplicity makes it appealing

40
Inception Layer
• Tired of choosing filter sizes?
• Use them all!
• All same convolutions
• 3x3 max pooling is with stride 1

67
Inception Layer: Computational Cost
32
92 Conv 5x5x200 32
+ ReLU
32 32
200 92
Multiplications: 5x5x200 x 32x32x92 ~ 470 million
1 value of the output volume

69
Inception Layer: Computational Cost
32 32
16 Conv 1x1 92 Conv 5x5 32

+ ReLU + ReLU
32 32 32
200 16 92
Multiplications: 1x1x200x32x32x16 5x5x16x32x32x92 ~ 40 million
Reduction of multiplications by 1/10

70
Inception Layer
[Szegedy et al CVPR’15] GoogLeNet

71
Inception Layer: Dimensions
We do not want max pool
28×28×256 result to take up almost all
the output
28×28×128 28×28×32 28×28×32
28×28×64
28×28×96 28×28×16 28×28×192
28×28×192
sion (b) Inception module with dimension reductions

72
GoogLeNet: Using the Inception Layer
Inception block
Extra max pool layers to

reduce dimensionality
73
Xception Net
• „Extreme version of Inception“: applying (modified)
Depthwise Separable Convolutions instead of normal
convolutions
• 36 conv layers, structured into several modules with
skip connections
• outperforms Inception Net V3
[Chollet CVPR’17] Xception

74
But why?
Original convolution
256 kernels of size 5x5x3
Multiplications:
256x5x5x3 x (8x8 locations) = 1.228.800
Depth-wise convolution
Multiplications:
5x5x3 x (8x8 locations) = 4800
Less
1x1 convolution computations!
Multiplications:
256x1x1x3x (8x8 locations) = 49152
80
Just Keep Stacking More Layers?
41
Skip Connections
43
The Problem of Depth
• As we add more and more layers, training becomes
harder
• Vanishing and exploding gradients
• How can we train very deep nets?
44
Residual Block
• Two layers
!$# 𝑥K
𝑥 𝑥 !"#
Input 𝑊 ! 𝑥 !$# + 𝑏 ! 𝑥 ! = 𝑓(𝑊 ! 𝑥 !$# + 𝑏 ! )

Linear Non-linearity
𝑥 !"# = 𝑓(𝑊 !"# 𝑥 ! + 𝑏 !"# )

45
Residual Block
• Two layers
!$# 𝑥K
𝑥 𝑥 !"#
Skip connection
Input Linear Linear
Main path
46
Residual Block
• Two layers
!$# 𝑥K
𝑥 𝑥 !"#
𝑥 !"# = 𝑓(𝑊 !"# 𝑥 ! + 𝑏 !"# + 𝑥 !$# )
Input Linear Linear
𝑥 !"# = 𝑓(𝑊 !"# 𝑥 ! + 𝑏 !"# )

47
Residual Block
• Two layers
𝑥 !$# 𝑥K +𝑥 !"#
• Usually use a same convolution since we need same

dimensions
• Otherwise we need to convert the dimensions with a
matrix of learned weights or zero padding
48
ResNet Block
Plain Net Residual Net
𝑥 𝑥
Weight Layer Weight Layer
Any two
𝑅𝑒𝐿𝑈 𝑅𝑒𝐿𝑈 Identity
stacked layers 𝐹(𝑥)
𝑥
Weight Layer Weight Layer
𝐻(𝑥) 𝑅𝑒𝐿𝑈
𝐹 𝑥 +𝑥 +
𝑅𝑒𝐿𝑈
[He et al. CVPR’16] ResNet

49
ResNet
• SGD + Momentum (0.9)

• Learning rate 0.1, divided by 10 when plateau
ResNet-152:
• Mini-batch size 256 60M Parameters
• Weight decay of 1e-5
• No dropout [He et al. CVPR’16] ResNet
50
ResNet
• If we make the network deeper, at some point
performance starts to degrade
51
ResNet
• If we make the network deeper, at some point
performance starts to degrade
52
Why do ResNets Work?
NN 𝑥 !$# 𝑥K +𝑥 !"#
• How is this block really affecting me?
53
NN 𝑥 !$# 𝑥K +𝑥 !"#
𝑥 !"# = 𝑓(𝑊 !"# 𝑥 ! + 𝑏 !"# + 𝑥 !$# )

~zero ~zero
𝑥 !"# = 𝑓(𝑥 !$# )
54
NN 𝑥 !$# 𝑥K +𝑥 !"#
• The identity is easy for the residual block to learn

• Guaranteed it will not hurt performance, can only
improve
56
ImageNet Benchmark
ImageNet Classification top-5-error (%)
Shallow 152 Layers
30 28,2 160
25,8 Revolution of Depth 140
25
120
20 8 Layers
16,4 100
15 8 Layers 80
11,7
19 Layers 22 Layers 60
10 36 Layers
7,3 6,66
5,5 40
5 3,57 2,99 2,25 20
0 0
ILSVRC ILSVRC AlexNet ZF VGG GoogLeNet Xception ResNet Trimps- SENet
2010 2011 (ILSVRC (ILSVRC (ILSVRC (ILSVRC (2016) (ILSVRC Soushen (ILSVRC
2012) 2013) 2014) 2014) 2015) (ILSVRC 2017)
2016)
84
1x1 Convolutions
57
1x1 Convolution
32
1 output
32
3
• Same as having a 3 neuron fully connected layer

63
Fully Convolutional
Network
85
Classification Network
“tabby
cat”
86
FCN: Becoming Fully Convolutional
Change fully connected layers to convolution layers with 1 x 1 output,

so network is “fully convolutional, and no layers have pre-defined
input size
88
FCN: Upsampling Output
89
Semantic Segmentation (FCN)
pixel wise output

and Loss
How do we go back
to the input size?
[Long and Shelhamer. 15] FCN
90
Types of Upsampling
• 1. Interpolation
?
91
Types of Upsampling
• 1. Interpolation Original image
Nearest neighbor interpolation Bilinear interpolation Bicubic interpolation
Image: Michael Guerzhoy 92

Types of Upsampling
• 2. Transposed conv
- Unpooling
- Convolution filter (learned)
+ CONVS
✅ efficient
[A. Dosovitskiy, TPAMI 2017] “Learning to Generate Chairs, Tables and Cars with Convolutional Networks“
I2DL: Prof. Niessner, Prof. Leal-Taixé 94
Types of Upsampling
• 2. Transposed convolution Output 5x5
- Unpooling
- Convolution filter (learned)
- Also called up-convolution
Input 3x3
95
Refined Outputs
• If one does a cascade of unpooling + conv
operations, we get to the encoder-decoder
architecture
• Even more refined: Autoencoders with skip

connections (aka U-Net)
96
U-Net
U-Net architecture: Each blue box is a multichannel feature map. Number of channels
denoted at the top of the box . Dimensions at the top of the box. White boxes are the
copied feature maps.
[Ronneberger et al. MICCAI’15] U-Net 97
U-Net: Encoder
Left side: Contraction Path (Encoder)
• Captures context of the image
• Follows typical architecture of a CNN:
– Repeated application of 2 unpadded 3x3 convolutions
– Each followed by ReLU activation
– 2x2 maxpooling operation with stride 2 for downsampling
– At each downsampling step, # of channels is doubled
à as before: Height, Width , Depth:
[Ronneberger et al. MICCAI’15] U-Net
98
U-Net: Decoder
Right Side: Expansion Path (Decoder):
• Upsampling to recover spatial locations for assigning
class labels to each pixel
– 2x2 up-convolution that halves number of input channels
– Skip Connections: outputs of up-convolutions are
concatenated with feature maps from encoder
– Followed by 2 ordinary 3x3 convs
– final layer: 1x1 conv to map 64 channels to # classes
• Height, Width: , Depth:
[Ronneberger et al. MICCAI’15] U-Net

99

CNN Architectures 01

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CNN Architectures 01

Uploaded by

Copyright:

Available Formats

CNNs Architectures

[Krizhevsky et al. NIPS’12] AlexNet

• First filter with stride 4 to reduce size significantly

[Krizhevsky et al. NIPS’12] AlexNet

[Krizhevsky et al. NIPS’12] AlexNet

• Softmax for 1000 classes [Krizhevsky et al. NIPS’12] AlexNet

• Use of ReLU instead of tanh/sigmoid

[Krizhevsky et al. NIPS’12] AlexNet

• CONV = 3x3 filters with stride 1, same convolutions

• MAXPOOL = 2x2 filters with stride 2

[Simonyan and Zisserman ICLR’15] VGGNet

• 2 consecutive convolutional layers, each one with 64

[Simonyan and Zisserman ICLR’15] VGGNet

[Simonyan and Zisserman ICLR’15] VGGNet

• Number of filters is multiplied by 2

[Simonyan and Zisserman ICLR’15] VGGNet

[Simonyan and Zisserman ICLR’15] VGGNet

• Called VGG-16: 16 layers that have weights

• Large but simplicity makes it appealing

[Simonyan and Zisserman ICLR’15] VGGNet

• Use them all!

• All same convolutions

• 3x3 max pooling is with stride 1

Multiplications: 5x5x200 x 32x32x92 ~ 470 million

1 value of the output volume

16 Conv 1x1 92 Conv 5x5 32

Multiplications: 1x1x200x32x32x16 5x5x16x32x32x92 ~ 40 million

Reduction of multiplications by 1/10

[Szegedy et al CVPR’15] GoogLeNet

28×28×128 28×28×32 28×28×32

28×28×96 28×28×16 28×28×192

sion (b) Inception module with dimension reductions

Extra max pool layers to

[Chollet CVPR’17] Xception

• Vanishing and exploding gradients

• How can we train very deep nets?

Input 𝑊 ! 𝑥 !$# + 𝑏 ! 𝑥 ! = 𝑓(𝑊 ! 𝑥 !$# + 𝑏 ! )

𝑥 !"# = 𝑓(𝑊 !"# 𝑥 ! + 𝑏 !"# )

Input Linear Linear

𝑥 !"# = 𝑓(𝑊 !"# 𝑥 ! + 𝑏 !"# + 𝑥 !$# )

Input Linear Linear

𝑥 !"# = 𝑓(𝑊 !"# 𝑥 ! + 𝑏 !"# )

• Usually use a same convolution since we need same

[He et al. CVPR’16] ResNet

• SGD + Momentum (0.9)

• How is this block really affecting me?

𝑥 !"# = 𝑓(𝑊 !"# 𝑥 ! + 𝑏 !"# + 𝑥 !$# )

• The identity is easy for the residual block to learn

• Same as having a 3 neuron fully connected layer

Change fully connected layers to convolution layers with 1 x 1 output,

pixel wise output

Nearest neighbor interpolation Bilinear interpolation Bicubic interpolation

Image: Michael Guerzhoy 92

- Also called up-convolution

• Even more refined: Autoencoders with skip

• Height, Width: , Depth:

[Ronneberger et al. MICCAI’15] U-Net

You might also like