Aidl 2023s DL 08 CNN Architectures

Deep Learning
CNN Architectures
for image classification
Verónica Vilaplana Besler

Veronica.vilaplana@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Index
• CNN Architectures for Image classification
• The ImageNet large scale visual recognition challenge
• AlexNet
• VGG-Net
• GoogleNet
• ResNet
• SENet
• DenseNet
• Comparison
2
ImageNet dataset (ILSVRC)
Large Scale Visual Recognition Challenge (2012-2017)
ImageNet
• Scene photographs from image search engines non-uniform distribution of images per category (ImageNet
hierarchy). Human labels via Amazon Mechanical Turk (14 M images, 1M with bounding box annotations)
Dataset for challenge

• 1.2 million training images
• 100 K test images
• 1000 object classes (categories)
• Balanced dataset
• Categories are leaves of the ImageNet hierarchy

• No overlap: presence of one category
implies absence of another
• Suitable for softmax classification
Particularities: the 1000 clases contain 120 breeds of dogs!

www.image-net.org/challenges/LSVRC/ 3
ILSVRC Metric
Metric:
Top 5 error rate
- Algorithm predicts at most 5 classes
in descending order of confidence
- If the correct class is among the first 5
predictions, error is 0, otherwise is 1
!
Error= 0 Error= 0 Error=1 4

ImageNet: ILSVRC winners
Pre-2012 approaches based on
• SIFT, CSIFT, GIST, LBP, color stats AlexNet first
• Fisher vector encoding CNN-based
winner
• Linear SVMs
• Performance started to saturate
2012 winner
• Supervision (Krizhevsky,Hinton,..)
• Alexnet CNN
• 10% margin on other approaches
• Revolution in computer vision
• Top-5 error 15.315%
Human performance 5%
5
Previously: LeNet-5 the output of the last convolutional layer is
flattened to a single vector which is input to a
• LeCun et al., 1998 fully connected layer
Conv 6 5x5 Conv 16 5x5

P=0 S=1 P=0, S=1
MNIST digit classification problem

handwritten digits Conv filters were 5x5, applied at stride 1, no pad
Sigmoid or tanh nonlinearity
60,000 training examples
Subsampling (average pooling) layers were 2x2 applied at stride 2
10,000 test samples
Fully connected layers at the end
10 classes
i.e. architecture is [CONV-POOL-CONV-POOL-FC-FC], 60K param
32x32 grayscale images
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, 1998. 6
AlexNet
• Krizhevsky, Sutskever, Hinton, 2012
• Similar framework to LeNet:

• More data and bigger model
• 8 layers (5 convolutional, 3 fully connected)
• 650,000 units
• 60 million free parameters
• Max pooling, ReLu nonlinearities
• GPU implementation: trained on two GPUs for a week A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with
Deep Convolutional Neural Networks, NIPS 2012
• Dropout regularization / data augmentation
• Ensemble of 7 nets used in ILSVRC challenge 7
AlexNet
Full (simplified) AlexNet architecture:

ILSVRC 2012 winner: 16.4
8 layers
[224x224x3] INPUT 7 CNN ensemble: 15.4%
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer Details/Retrospectives:
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 - first use of ReLU
[13x13x256] MAX POOL2: 3x3 filters at stride 2 - used Norm layers (not common)
[13x13x256] NORM2: Normalization layer - heavy data augmentation
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 - dropout 0.5
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 - batch size 128
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 - SGD Momentum 0.9
[6x6x256] MAX POOL3: 3x3 filters at stride 2 - Learning rate 1e-2, reduced by 10
[4096] FC6: 4096 neurons manually when val accuracy plateaus
[4096] FC7: 4096 neurons - L2 weight decay 5e-4
[1000] FC8: 1000 neurons (class scores) 8
S. Credit: Stanford cs231
AlexNet
Most of the memory Nearly all parameters Most floating-point

usage is in the early are in the fully ops occur in the
convolution layers connected layers convolution layers
9
S. Credit: J. Jhonson
• lmageNet Large Scale Visual Recognition Challenge (ILSVRC)
ZFNet improved
hyperparameters
over AlexNet
10
ZFNet
• Zeiler, Fergus, 2013
Refinement of AlexNet:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ILSVRC 2013 winner top 5 error: 16.4% -> 11.7%
More trial and error…
Visualization can help proposing better architectures: “A convnet model that uses the same components
(filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite.”
M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV 2014 (Best Paper Award winner) 11
Deeper networks
VGG
GoogLeNet
12
VGGNet
• Simonynan, Zisserman (2014)
ILSVRC 2014 11.7% à 7.3% top 5 error
More principled (regular) design
Large receptive fields replaced by 3x3 conv

Two 3x3 conv has same receptive field as a single 5x5 conv
But has fewer parameters and takes les computation
Use Only:
• 3x3 CONV stride 1, pad 1 and
• 2x2 MAX POOL stride 2
• After pool, double #channels
Shows that depth is a critical component

for good performance (16 – 19 layers)
2 versions
• VGG-16 (16 parameter layers)
• VGG-19 (19 parameter layers)
K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR 2015 S. Credit: Stanford cs231
13
VGGNet most memory is in early CONV
most parameters are in late FC S. Credit: Stanford cs231 14

AlexNet vs VGG-16: a much bigger network
AlexNet total: 1.9 MB AlexNet total: 61 M AlexNet total: 0.7 GFLOP

VGG-16 total: 48.6 MB (25x) VGG-16 total: 138 M (2.3x) VGG-16 total: 13.6 GFLOP (19.4x)
15
GoogLeNet
• Szegedi, Liu, Jia, Sermanet et al. (2014)
Many innovations for efficiency: reduce parameter count, memory usage and computation
Convolution
Pooling
Softmax
Other
ILSVRC 2014 winner (6.7% top 5 error)

- A stem network to downsample input
- Inception modules that dramatically reduced the number of parameters Compared to AlexNet
- Global average pooling instead of FC 12x less parameters (5M vs 60M)
- Auxiliary classifiers 2x more compute 6.67% (vs 16,4%)
22 layers
C. Szegedy et al., Going deeper with convolutions, CVPR 2015 16

GoogLeNet
• A stem network at the start aggressively downsamples input (recall in VGG16 most of the
computation was at the start)
Convolution
Pooling
Softmax
Other
Total from 224 to 28 spatial resolution: Compare VGG-16:

Memory: 7.5 MB, Params: 124K, MFLOP: 418 Memory: 42.9 MB (5.7x), Params: 1.1M (8.9x), MFLOP: 7485 (17.8x)
17
GoogLeNet
• The Inception Module: Local unit with parallel branches
• Local structure repeated many times throughout the network
Convolution
Pooling
Softmax
Other
9 inception modules: network in a network…
Inception module
Recall: 1x1 convolutions
1x1 convolution layers are used to reduce dimensionality

(number of feature maps)
32 1x1 conv with 64 filters 32
each filter has size 1x1x128

and performs a 128-dimensional dot
product
128 64
32 32
19
GoogLeNet
• The inception module:
Parallel paths with different receptive field sizes and operations are meant to capture sparse
patterns of correlations in the stack of feature maps
Uses 1x1 convolutions to reduce feature depth before expensive convolutions
Apply parallel filter operations on the input

from previous layer:
- multiple receptive field sizes for
convolution (1x1, 3x3, 5x5)
- pooling operation (3x3)
Concatenate all filter outputs together depth-

wise

GoogLeNet
• Global average pooling (GAP). No fully connected layers needed
#NiN Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." ICLR 2014. Figure: Alexis Cook 21
GoogLeNet
• No large FC layers at the end. Instead, uses global average pooling to collapse spatial dimensions
and one linear layer to produce class scores (recall VGG-16: most parameters were in the FC
layers!)
Classifier output
(remove expensive FC layers)
Compare with VGG-16
S. Credit: J. Jhonson 22
GoogLeNet
• Training using loss at the end of the network didn’t work well: network is too deep, gradients don’t propagate
cleanly
• Attach auxiliary classifiers at several intermediate points in the network that also try to classify the image and
receive loss (with BatchNorm no longer need to use this trick)
classifier output
(remove expensive FC layers)
conv
pool-
2x conv
pool Auxiliary classifiers
...and no (big) fully connected layers needed !
23
C. Szegedy et al., Going deeper with convolutions, CVPR 2015
Inception v2, v3, v4
• Improvements:
• Regularize training with batch normalization, reducing importance of auxiliary classifiers
• More variants of inception modules with aggressive factorization of filters
• Increase the number of feature maps while decreasing spatial resolution (pooling)
stride2 stride 2
stride 1
stride 2
inception v2
inception v2 inception v4 - inception-resnet
C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
C. Szegedy et al., Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016
24
Very deep
networks
25
Residual Networks
• What happens when we continue stacking deeper layers on a plain CNN?
56-layer model performs worse on both training and test error: the deep model performs worse, but it’s not due to
overfitting (it also performs worse than the shallow model on the training set)
• Hypothesis: it is an optimization problem, deeper models are harder to train, in particular they don’t
learn identity functions to emulate shallow models
• Solution: copy the learned layers from shallower model and set additional layers to identity mapping
26
Residual Networks
• The residual module
• Introduce skip or shortcut connections (existing before en several forms in literature): ): fit a residual
mapping instead of directly trying to fit a desired underlying mapping
• Make it easy for network layers to represent the identity mapping
• For some reason, need to skip at least two layers
H(x) = F(x) + x
Use layers to fit residual

F(x) = H(x) – x
instead of H(x) directly
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition, CVPR 2016 (Best Paper) 27
Residual Networks
• Full architecture:
- A residual network is a stack of many residual
blocks
- Regular design, like VGG; every residual block has

two 3x3 conv layers
- Network is divided into stages: the first block of

each stage halves the resolution (with stride-2
convolutions) and doubles the number of channels
Residual Networks
• Uses the same aggressive stem as GoogleNet to downsample the input 4x
before applying residual blocks
• Like GoogleNet, no big fully-connected layers: instead use global average

pooling and single linear layer at the end
29
Residual Networks
• For deeper networks (ResNet-50 +)
• use bottleneck layer to improve efficiency (similar to GoogLeNet)
1x1 conv, 256 filters project Directly performing 3x3 convolutions with 256
back to 256 feature maps feature maps at input and output:
(28x28x256) 256 x 256 x 3 x 3 ~ 600K operations
Using 1x1 convolutions to reduce 256 to 64 feature
maps, followed by 3x3 convolutions, followed by 1x1
3x3 conv operates over only 64 convolutions to expand back to 256 maps:
feature maps 256 x 64 x 1 x 1 ~ 16K
64 x 64 x 3 x 3 ~ 36K
64 x 256 x 1 x 1 ~ 16K
1x1 conv, 64 filters to project Total: ~70K
to 28x28x64
30
Residual Networks
• Training ResNet in practice
• Batch Normalization after every CONV layer
• Xavier/2 initialization from He et al.
• SGD + Momentum (0.9)
• Learning rate: 0.1, divided by 10 when validation error plateaus
• Mini-batch size 256
• Weight decay of 1e-5
• No dropout used
• Experimental results
• able to train very deep networks without degrading (152 layers on ImageNet)
• deeper networks now achieve lower training error as expected
• MSRA: ILSVRC & COCO 2015 competitions
- ImageNet Classification: “ultra deep”, 152-leyers
- ImageNet Detection: 16% better than 2nd
- ImageNet Localization: 27% better than 2nd
- COCO Detection: 11% better than 2nd
- COCO Segmentation: 12% better than 2nd
• ILSVRC 2015 top 5 error 3.6% (better than human perf.) 31

Residual Networks
• Architectures for ImageNet
33
Model ensembles
• ImageNet 2016 winner: Model Ensembles
• Multi-scale ensemble of Inception, Inception-Resnet, Resnet, Wide Resnet models
Shao et al, 2016 34

Improving ResNets
• ResNeXt
S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, Aggregated Residual Transformations for Deep Neural Networks, CVPR 2017 35
ResNeXt on ImageNet
S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, Aggregated Residual Transformations for Deep Neural Networks, CVPR 2017 36
37
Squeeze and excitation networks
• SENet
• Hu, Albanie, Sun, Wu 2017
• ILSVRC 2017 winner (2.251% top 5 error) with ResNeXt-152-SE
• An ensemble of SENets
• SE block
1. Squeeze each channel to a single number by average pooling

2. A 2 layer neural network produces a vector of numbers (one per channel)
3. Numbers are used to weight channels
• It is possible to construct an SE network (SENet) by simply stacking a collection of SE block
J. Hu, S. Albanie, G. Sun, E. Wu Squeeze-and-Excitation Networks 38

Squeeze and excitation networks
• The squeze and excitation blocks can be added to existing architectures
Ex. adding SE-blocks to ResNet-50 you can expect almost the same accuracy as ResNet-101
J. Hu, S. Albanie, G. Sun, E. Wu Squeeze-and-Excitation Networks 39
Completion of the challenge:

Annual ImageNet competition no longer
held after 2017 -> now moved to Kaggle.
40
Densely Connected Neural Networks (DenseNets)
• Dense blocks where each layer is
connected to every other layer in
feedforward fashion
• Alleviates vanishing gradient,

strengthens feature propagation,
encourages feature reuse
Dense block
G. Huang, Z. Liu, and L. van der Maaten, Densely Connected Convolutional Networks, CVPR 2017 (Best Paper Award)
https://towardsdatascience.com/review-densenet-image-classification-b6631a8ef803 41
DenseNets
G. Huang, Z. Liu, and L. van der Maaten, Densely Connected Convolutional Networks, CVPR 2017 (Best Paper Award) 42
DenseNets
ImageNet validation error vs. number of parameters and test-time operations
G. Huang, Z. Liu, and L. van der Maaten, Densely Connected Convolutional Networks, CVPR 2017 (Best Paper Award) 43
Comparing models
• An analysis of Deep Neural Net works Models for Practical Applications, 2017 (A. Canziani, E.
Culurciello, A. Paszke)
https://towardsdatascience.com/neural-network-architectures-156e5bad51ba 44
Comparing models
• An analysis of Deep Neural Net works Models for Practical Applications, 2017 (A. Canziani)
AlexNet: small Inception-v4: ResNet: mod VGG: highest

compute, memory GoogLeNet: resnet+incep efficiency, high memory, most
heavy, lower most tion accuracy operations
accuracy efficient
Circle size: # parameters
https://towardsdatascience.com/neural-network-architectures-156e5bad51ba 45
How to use a pre-trained network for a new task?
• In practice, very few methods train a CNN from stratch (with random initialization) because it
is relatively rare to have a dataset of sufficient size
• Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which
contains 12. million images with 1000 categories)
• and then use the ConvNet either
• as a fixed feature extractor for the task of interest (retrain only the classifier), if
the dataset is small
• as an initialization, use old weights as initialization, train the full network or only
some of the higher layers, if the dataset is medium size: transfer learning
46
• Strategy 1: Use as feature extractor Remove these layers
Use as off-the-shelf
feature
• Strategy 2: Transfer learning Train new prediction
layer(s)
Train this
more data = retrain

more of the
network (or all of it)
Keep frozen or
fine-tune
Freeze these
more specific
more generic
49
Summary
• Architectures for image classification
• LeNet: pioneer net for digit recognition
• AlexNet: smaller compute, still memory heavy, lower accuracy
• VGG: highest memory, most operations
• GoogLeNet: most efficient
• ResNet: moderate effciency depending on model, better accuracy
• Squeeze and excitation
• Dense network
• Transfer learning: use a trained network for a new task (as feature extractor or fine-tuning)
• What is missing from the picture?

• Training tricks and details: initialization, regularization, normalization
• Training data augmentation
• Averaging classifier outputs over multiple crops/flips
• Ensembles of networks
• Meta-learning: learning to learn net architectures
50
Review questions
1. Analyze and compare Alexnet and VGG in terms of memory usage, number of parameters and
operations by layers. Which are layers that need more memory / computation / operations
2. What is the use of 1x1 convolutions. In which models are they used and why?
3. What is an inception module?
4. What is global average pooling and why is it used in GoogLeNet, Resnets and other models.
5. What is a residual block?
6. What is a Squeeze and Excitation block? How are they combined with Resnet or Inception
modules?
51

Aidl 2023s DL 08 CNN Architectures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aidl 2023s DL 08 CNN Architectures

Uploaded by

Copyright:

Available Formats

Deep Learning

Verónica Vilaplana Besler

Dataset for challenge

• Categories are leaves of the ImageNet hierarchy

Particularities: the 1000 clases contain 120 breeds of dogs!

Error= 0 Error= 0 Error=1 4

Conv 6 5x5 Conv 16 5x5

MNIST digit classification problem

• Similar framework to LeNet:

Full (simplified) AlexNet architecture:

Most of the memory Nearly all parameters Most floating-point

More trial and error…

Large receptive fields replaced by 3x3 conv

Shows that depth is a critical component

most parameters are in late FC S. Credit: Stanford cs231 14

AlexNet total: 1.9 MB AlexNet total: 61 M AlexNet total: 0.7 GFLOP

ILSVRC 2014 winner (6.7% top 5 error)

C. Szegedy et al., Going deeper with convolutions, CVPR 2015 16

Total from 224 to 28 spatial resolution: Compare VGG-16:

9 inception modules: network in a network…

1x1 convolution layers are used to reduce dimensionality

32 1x1 conv with 64 filters 32

each filter has size 1x1x128

Apply parallel filter operations on the input

Concatenate all filter outputs together depth-

C. Szegedy et al., Going deeper with convolutions, CVPR 2015 20

Compare with VGG-16

...and no (big) fully connected layers needed !

inception v2 inception v4 - inception-resnet

Use layers to fit residual

- Regular design, like VGG; every residual block has

- Network is divided into stages: the first block of

• Like GoogleNet, no big fully-connected layers: instead use global average

• ILSVRC 2015 top 5 error 3.6% (better than human perf.) 31

Shao et al, 2016 34

1. Squeeze each channel to a single number by average pooling

• It is possible to construct an SE network (SENet) by simply stacking a collection of SE block

J. Hu, S. Albanie, G. Sun, E. Wu Squeeze-and-Excitation Networks 38

Completion of the challenge:

• Alleviates vanishing gradient,

ImageNet validation error vs. number of parameters and test-time operations

AlexNet: small Inception-v4: ResNet: mod VGG: highest

Circle size: # parameters

more data = retrain

• What is missing from the picture?

3. What is an inception module?

5. What is a residual block?

You might also like