Professional Documents
Culture Documents
Aidl 2023s DL 08 CNN Architectures
Aidl 2023s DL 08 CNN Architectures
CNN Architectures
for image classification
Associate Professor
Universitat Politècnica de Catalunya
Index
• CNN Architectures for Image classification
• The ImageNet large scale visual recognition challenge
• AlexNet
• VGG-Net
• GoogleNet
• ResNet
• SENet
• DenseNet
• Comparison
2
ImageNet dataset (ILSVRC)
Large Scale Visual Recognition Challenge (2012-2017)
ImageNet
• Scene photographs from image search engines non-uniform distribution of images per category (ImageNet
hierarchy). Human labels via Amazon Mechanical Turk (14 M images, 1M with bounding box annotations)
2012 winner
• Supervision (Krizhevsky,Hinton,..)
• Alexnet CNN
• 10% margin on other approaches
• Revolution in computer vision
• Top-5 error 15.315%
Human performance 5%
5
Previously: LeNet-5 the output of the last convolutional layer is
flattened to a single vector which is input to a
• LeCun et al., 1998 fully connected layer
9
S. Credit: J. Jhonson
ImageNet: ILSVRC winners
• lmageNet Large Scale Visual Recognition Challenge (ILSVRC)
ZFNet improved
hyperparameters
over AlexNet
10
ZFNet
• Zeiler, Fergus, 2013
Refinement of AlexNet:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ILSVRC 2013 winner top 5 error: 16.4% -> 11.7%
Visualization can help proposing better architectures: “A convnet model that uses the same components
(filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite.”
M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV 2014 (Best Paper Award winner) 11
ImageNet: ILSVRC winners
• lmageNet Large Scale Visual Recognition Challenge (ILSVRC)
Deeper networks
VGG
GoogLeNet
12
VGGNet
• Simonynan, Zisserman (2014)
ILSVRC 2014 11.7% à 7.3% top 5 error
More principled (regular) design
Use Only:
• 3x3 CONV stride 1, pad 1 and
• 2x2 MAX POOL stride 2
• After pool, double #channels
2 versions
• VGG-16 (16 parameter layers)
• VGG-19 (19 parameter layers)
K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR 2015 S. Credit: Stanford cs231
13
VGGNet most memory is in early CONV
Convolution
Pooling
Softmax
Other
Convolution
Pooling
Softmax
Other
Convolution
Pooling
Softmax
Other
Inception module
C. Szegedy et al., Going deeper with convolutions, CVPR 2015 18
Recall: 1x1 convolutions
19
GoogLeNet
• The inception module:
Parallel paths with different receptive field sizes and operations are meant to capture sparse
patterns of correlations in the stack of feature maps
Uses 1x1 convolutions to reduce feature depth before expensive convolutions
#NiN Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." ICLR 2014. Figure: Alexis Cook 21
GoogLeNet
• No large FC layers at the end. Instead, uses global average pooling to collapse spatial dimensions
and one linear layer to produce class scores (recall VGG-16: most parameters were in the FC
layers!)
Classifier output
(remove expensive FC layers)
S. Credit: J. Jhonson 22
GoogLeNet
• Training using loss at the end of the network didn’t work well: network is too deep, gradients don’t propagate
cleanly
• Attach auxiliary classifiers at several intermediate points in the network that also try to classify the image and
receive loss (with BatchNorm no longer need to use this trick)
classifier output
(remove expensive FC layers)
conv
pool-
2x conv
pool Auxiliary classifiers
23
C. Szegedy et al., Going deeper with convolutions, CVPR 2015
Inception v2, v3, v4
• Improvements:
• Regularize training with batch normalization, reducing importance of auxiliary classifiers
• More variants of inception modules with aggressive factorization of filters
• Increase the number of feature maps while decreasing spatial resolution (pooling)
stride2 stride 2
stride 1
stride 2
inception v2
C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
C. Szegedy et al., Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016
24
ImageNet: ILSVRC winners
• lmageNet Large Scale Visual Recognition Challenge (ILSVRC)
Very deep
networks
25
Residual Networks
• What happens when we continue stacking deeper layers on a plain CNN?
56-layer model performs worse on both training and test error: the deep model performs worse, but it’s not due to
overfitting (it also performs worse than the shallow model on the training set)
• Hypothesis: it is an optimization problem, deeper models are harder to train, in particular they don’t
learn identity functions to emulate shallow models
• Solution: copy the learned layers from shallower model and set additional layers to identity mapping
26
Residual Networks
• The residual module
• Introduce skip or shortcut connections (existing before en several forms in literature): ): fit a residual
mapping instead of directly trying to fit a desired underlying mapping
• Make it easy for network layers to represent the identity mapping
• For some reason, need to skip at least two layers
H(x) = F(x) + x
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition, CVPR 2016 (Best Paper) 27
Residual Networks
• Full architecture:
- A residual network is a stack of many residual
blocks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition, CVPR 2016 (Best Paper) 28
Residual Networks
• Uses the same aggressive stem as GoogleNet to downsample the input 4x
before applying residual blocks
29
S. Credit: J. Jhonson
Residual Networks
• For deeper networks (ResNet-50 +)
• use bottleneck layer to improve efficiency (similar to GoogLeNet)
1x1 conv, 256 filters project Directly performing 3x3 convolutions with 256
back to 256 feature maps feature maps at input and output:
(28x28x256) 256 x 256 x 3 x 3 ~ 600K operations
Using 1x1 convolutions to reduce 256 to 64 feature
maps, followed by 3x3 convolutions, followed by 1x1
3x3 conv operates over only 64 convolutions to expand back to 256 maps:
feature maps 256 x 64 x 1 x 1 ~ 16K
64 x 64 x 3 x 3 ~ 36K
64 x 256 x 1 x 1 ~ 16K
1x1 conv, 64 filters to project Total: ~70K
to 28x28x64
30
S. Credit: J. Jhonson
Residual Networks
• Training ResNet in practice
• Batch Normalization after every CONV layer
• Xavier/2 initialization from He et al.
• SGD + Momentum (0.9)
• Learning rate: 0.1, divided by 10 when validation error plateaus
• Mini-batch size 256
• Weight decay of 1e-5
• No dropout used
• Experimental results
• able to train very deep networks without degrading (152 layers on ImageNet)
• deeper networks now achieve lower training error as expected
• MSRA: ILSVRC & COCO 2015 competitions
- ImageNet Classification: “ultra deep”, 152-leyers
- ImageNet Detection: 16% better than 2nd
- ImageNet Localization: 27% better than 2nd
- COCO Detection: 11% better than 2nd
- COCO Segmentation: 12% better than 2nd
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition, CVPR 2016 (Best Paper) 32
ImageNet: ILSVRC winners
• lmageNet Large Scale Visual Recognition Challenge (ILSVRC)
33
Model ensembles
• ImageNet 2016 winner: Model Ensembles
• Multi-scale ensemble of Inception, Inception-Resnet, Resnet, Wide Resnet models
S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, Aggregated Residual Transformations for Deep Neural Networks, CVPR 2017 35
ResNeXt on ImageNet
S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, Aggregated Residual Transformations for Deep Neural Networks, CVPR 2017 36
ImageNet: ILSVRC winners
• lmageNet Large Scale Visual Recognition Challenge (ILSVRC)
37
Squeeze and excitation networks
• SENet
• Hu, Albanie, Sun, Wu 2017
• ILSVRC 2017 winner (2.251% top 5 error) with ResNeXt-152-SE
• An ensemble of SENets
• SE block
Ex. adding SE-blocks to ResNet-50 you can expect almost the same accuracy as ResNet-101
J. Hu, S. Albanie, G. Sun, E. Wu Squeeze-and-Excitation Networks 39
ImageNet: ILSVRC winners
• lmageNet Large Scale Visual Recognition Challenge (ILSVRC)
40
Densely Connected Neural Networks (DenseNets)
• Dense blocks where each layer is
connected to every other layer in
feedforward fashion
Dense block
G. Huang, Z. Liu, and L. van der Maaten, Densely Connected Convolutional Networks, CVPR 2017 (Best Paper Award)
https://towardsdatascience.com/review-densenet-image-classification-b6631a8ef803 41
S. Credit: J. Jhonson
DenseNets
G. Huang, Z. Liu, and L. van der Maaten, Densely Connected Convolutional Networks, CVPR 2017 (Best Paper Award) 42
DenseNets
G. Huang, Z. Liu, and L. van der Maaten, Densely Connected Convolutional Networks, CVPR 2017 (Best Paper Award) 43
Comparing models
• An analysis of Deep Neural Net works Models for Practical Applications, 2017 (A. Canziani, E.
Culurciello, A. Paszke)
https://towardsdatascience.com/neural-network-architectures-156e5bad51ba 44
Comparing models
• An analysis of Deep Neural Net works Models for Practical Applications, 2017 (A. Canziani)
https://towardsdatascience.com/neural-network-architectures-156e5bad51ba 45
How to use a pre-trained network for a new task?
• In practice, very few methods train a CNN from stratch (with random initialization) because it
is relatively rare to have a dataset of sufficient size
• Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which
contains 12. million images with 1000 categories)
• and then use the ConvNet either
• as a fixed feature extractor for the task of interest (retrain only the classifier), if
the dataset is small
• as an initialization, use old weights as initialization, train the full network or only
some of the higher layers, if the dataset is medium size: transfer learning
46
How to use a pre-trained network for a new task?
• Strategy 1: Use as feature extractor Remove these layers
Use as off-the-shelf
feature
How to use a pre-trained network for a new task?
• Strategy 2: Transfer learning Train new prediction
layer(s)
Train this
Freeze these
How to use a pre-trained network for a new task?
more specific
more generic
49
Summary
• Architectures for image classification
• LeNet: pioneer net for digit recognition
• AlexNet: smaller compute, still memory heavy, lower accuracy
• VGG: highest memory, most operations
• GoogLeNet: most efficient
• ResNet: moderate effciency depending on model, better accuracy
• Squeeze and excitation
• Dense network
• Transfer learning: use a trained network for a new task (as feature extractor or fine-tuning)
2. What is the use of 1x1 convolutions. In which models are they used and why?
4. What is global average pooling and why is it used in GoogLeNet, Resnets and other models.
6. What is a Squeeze and Excitation block? How are they combined with Resnet or Inception
modules?
51