Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Deep Learning

CNN Architectures
for image classification

Verónica Vilaplana Besler


Veronica.vilaplana@upc.edu

Associate Professor
Universitat Politècnica de Catalunya
Index
• CNN Architectures for Image classification
• The ImageNet large scale visual recognition challenge
• AlexNet
• VGG-Net
• GoogleNet
• ResNet
• SENet
• DenseNet
• Comparison

2
ImageNet dataset (ILSVRC)
Large Scale Visual Recognition Challenge (2012-2017)
ImageNet
• Scene photographs from image search engines non-uniform distribution of images per category (ImageNet
hierarchy). Human labels via Amazon Mechanical Turk (14 M images, 1M with bounding box annotations)

Dataset for challenge


• 1.2 million training images
• 100 K test images
• 1000 object classes (categories)
• Balanced dataset

• Categories are leaves of the ImageNet hierarchy


• No overlap: presence of one category
implies absence of another
• Suitable for softmax classification

Particularities: the 1000 clases contain 120 breeds of dogs!


www.image-net.org/challenges/LSVRC/ 3
ILSVRC Metric
Metric:
Top 5 error rate
- Algorithm predicts at most 5 classes
in descending order of confidence
- If the correct class is among the first 5
predictions, error is 0, otherwise is 1
!

Error= 0 Error= 0 Error=1 4


ImageNet: ILSVRC winners
Pre-2012 approaches based on
• SIFT, CSIFT, GIST, LBP, color stats AlexNet first
• Fisher vector encoding CNN-based
winner
• Linear SVMs
• Performance started to saturate

2012 winner
• Supervision (Krizhevsky,Hinton,..)
• Alexnet CNN
• 10% margin on other approaches
• Revolution in computer vision
• Top-5 error 15.315%

Human performance 5%
5
Previously: LeNet-5 the output of the last convolutional layer is
flattened to a single vector which is input to a
• LeCun et al., 1998 fully connected layer

Conv 6 5x5 Conv 16 5x5


P=0 S=1 P=0, S=1

MNIST digit classification problem


handwritten digits Conv filters were 5x5, applied at stride 1, no pad
Sigmoid or tanh nonlinearity
60,000 training examples
Subsampling (average pooling) layers were 2x2 applied at stride 2
10,000 test samples
Fully connected layers at the end
10 classes
i.e. architecture is [CONV-POOL-CONV-POOL-FC-FC], 60K param
32x32 grayscale images
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, 1998. 6
AlexNet
• Krizhevsky, Sutskever, Hinton, 2012

• Similar framework to LeNet:


• More data and bigger model
• 8 layers (5 convolutional, 3 fully connected)
• 650,000 units
• 60 million free parameters
• Max pooling, ReLu nonlinearities
• GPU implementation: trained on two GPUs for a week A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with
Deep Convolutional Neural Networks, NIPS 2012
• Dropout regularization / data augmentation
• Ensemble of 7 nets used in ILSVRC challenge 7
AlexNet

Full (simplified) AlexNet architecture:


ILSVRC 2012 winner: 16.4
8 layers
[224x224x3] INPUT 7 CNN ensemble: 15.4%
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer Details/Retrospectives:
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 - first use of ReLU
[13x13x256] MAX POOL2: 3x3 filters at stride 2 - used Norm layers (not common)
[13x13x256] NORM2: Normalization layer - heavy data augmentation
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 - dropout 0.5
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 - batch size 128
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 - SGD Momentum 0.9
[6x6x256] MAX POOL3: 3x3 filters at stride 2 - Learning rate 1e-2, reduced by 10
[4096] FC6: 4096 neurons manually when val accuracy plateaus
[4096] FC7: 4096 neurons - L2 weight decay 5e-4
[1000] FC8: 1000 neurons (class scores) 8
S. Credit: Stanford cs231
AlexNet

Most of the memory Nearly all parameters Most floating-point


usage is in the early are in the fully ops occur in the
convolution layers connected layers convolution layers

9
S. Credit: J. Jhonson
ImageNet: ILSVRC winners
• lmageNet Large Scale Visual Recognition Challenge (ILSVRC)
ZFNet improved
hyperparameters
over AlexNet

10
ZFNet
• Zeiler, Fergus, 2013

Refinement of AlexNet:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ILSVRC 2013 winner top 5 error: 16.4% -> 11.7%

More trial and error…

Visualization can help proposing better architectures: “A convnet model that uses the same components
(filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite.”
M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV 2014 (Best Paper Award winner) 11
ImageNet: ILSVRC winners
• lmageNet Large Scale Visual Recognition Challenge (ILSVRC)

Deeper networks
VGG
GoogLeNet

12
VGGNet
• Simonynan, Zisserman (2014)
ILSVRC 2014 11.7% à 7.3% top 5 error
More principled (regular) design

Large receptive fields replaced by 3x3 conv


Two 3x3 conv has same receptive field as a single 5x5 conv
But has fewer parameters and takes les computation

Use Only:
• 3x3 CONV stride 1, pad 1 and
• 2x2 MAX POOL stride 2
• After pool, double #channels

Shows that depth is a critical component


for good performance (16 – 19 layers)

2 versions
• VGG-16 (16 parameter layers)
• VGG-19 (19 parameter layers)

K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR 2015 S. Credit: Stanford cs231
13
VGGNet most memory is in early CONV

most parameters are in late FC S. Credit: Stanford cs231 14


AlexNet vs VGG-16: a much bigger network

AlexNet total: 1.9 MB AlexNet total: 61 M AlexNet total: 0.7 GFLOP


VGG-16 total: 48.6 MB (25x) VGG-16 total: 138 M (2.3x) VGG-16 total: 13.6 GFLOP (19.4x)
15
S. Credit: J. Jhonson
GoogLeNet
• Szegedi, Liu, Jia, Sermanet et al. (2014)
Many innovations for efficiency: reduce parameter count, memory usage and computation

Convolution
Pooling
Softmax
Other

ILSVRC 2014 winner (6.7% top 5 error)


- A stem network to downsample input
- Inception modules that dramatically reduced the number of parameters Compared to AlexNet
- Global average pooling instead of FC 12x less parameters (5M vs 60M)
- Auxiliary classifiers 2x more compute 6.67% (vs 16,4%)
22 layers

C. Szegedy et al., Going deeper with convolutions, CVPR 2015 16


GoogLeNet
• A stem network at the start aggressively downsamples input (recall in VGG16 most of the
computation was at the start)

Convolution
Pooling
Softmax
Other

Total from 224 to 28 spatial resolution: Compare VGG-16:


Memory: 7.5 MB, Params: 124K, MFLOP: 418 Memory: 42.9 MB (5.7x), Params: 1.1M (8.9x), MFLOP: 7485 (17.8x)
17
S. Credit: J. Jhonson
GoogLeNet
• The Inception Module: Local unit with parallel branches
• Local structure repeated many times throughout the network

Convolution
Pooling
Softmax
Other

9 inception modules: network in a network…

Inception module
C. Szegedy et al., Going deeper with convolutions, CVPR 2015 18
Recall: 1x1 convolutions

1x1 convolution layers are used to reduce dimensionality


(number of feature maps)

32 1x1 conv with 64 filters 32

each filter has size 1x1x128


and performs a 128-dimensional dot
product
128 64
32 32

19
GoogLeNet
• The inception module:
Parallel paths with different receptive field sizes and operations are meant to capture sparse
patterns of correlations in the stack of feature maps
Uses 1x1 convolutions to reduce feature depth before expensive convolutions

Apply parallel filter operations on the input


from previous layer:
- multiple receptive field sizes for
convolution (1x1, 3x3, 5x5)
- pooling operation (3x3)

Concatenate all filter outputs together depth-


wise

C. Szegedy et al., Going deeper with convolutions, CVPR 2015 20


GoogLeNet
• Global average pooling (GAP). No fully connected layers needed

#NiN Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." ICLR 2014. Figure: Alexis Cook 21
GoogLeNet
• No large FC layers at the end. Instead, uses global average pooling to collapse spatial dimensions
and one linear layer to produce class scores (recall VGG-16: most parameters were in the FC
layers!)

Classifier output
(remove expensive FC layers)

Compare with VGG-16

S. Credit: J. Jhonson 22
GoogLeNet
• Training using loss at the end of the network didn’t work well: network is too deep, gradients don’t propagate
cleanly
• Attach auxiliary classifiers at several intermediate points in the network that also try to classify the image and
receive loss (with BatchNorm no longer need to use this trick)

classifier output
(remove expensive FC layers)
conv
pool-
2x conv
pool Auxiliary classifiers

...and no (big) fully connected layers needed !

23
C. Szegedy et al., Going deeper with convolutions, CVPR 2015
Inception v2, v3, v4
• Improvements:
• Regularize training with batch normalization, reducing importance of auxiliary classifiers
• More variants of inception modules with aggressive factorization of filters
• Increase the number of feature maps while decreasing spatial resolution (pooling)

stride2 stride 2
stride 1

stride 2

inception v2

inception v2 inception v4 - inception-resnet

C. Szegedy et al., Rethinking the inception architecture for computer vision, CVPR 2016
C. Szegedy et al., Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016
24
ImageNet: ILSVRC winners
• lmageNet Large Scale Visual Recognition Challenge (ILSVRC)
Very deep
networks

25
Residual Networks
• What happens when we continue stacking deeper layers on a plain CNN?

56-layer model performs worse on both training and test error: the deep model performs worse, but it’s not due to
overfitting (it also performs worse than the shallow model on the training set)

• Hypothesis: it is an optimization problem, deeper models are harder to train, in particular they don’t
learn identity functions to emulate shallow models
• Solution: copy the learned layers from shallower model and set additional layers to identity mapping

26
Residual Networks
• The residual module
• Introduce skip or shortcut connections (existing before en several forms in literature): ): fit a residual
mapping instead of directly trying to fit a desired underlying mapping
• Make it easy for network layers to represent the identity mapping
• For some reason, need to skip at least two layers
H(x) = F(x) + x

Use layers to fit residual


F(x) = H(x) – x
instead of H(x) directly

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition, CVPR 2016 (Best Paper) 27
Residual Networks
• Full architecture:
- A residual network is a stack of many residual
blocks

- Regular design, like VGG; every residual block has


two 3x3 conv layers

- Network is divided into stages: the first block of


each stage halves the resolution (with stride-2
convolutions) and doubles the number of channels

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition, CVPR 2016 (Best Paper) 28
Residual Networks
• Uses the same aggressive stem as GoogleNet to downsample the input 4x
before applying residual blocks

• Like GoogleNet, no big fully-connected layers: instead use global average


pooling and single linear layer at the end

29
S. Credit: J. Jhonson
Residual Networks
• For deeper networks (ResNet-50 +)
• use bottleneck layer to improve efficiency (similar to GoogLeNet)

1x1 conv, 256 filters project Directly performing 3x3 convolutions with 256
back to 256 feature maps feature maps at input and output:
(28x28x256) 256 x 256 x 3 x 3 ~ 600K operations
Using 1x1 convolutions to reduce 256 to 64 feature
maps, followed by 3x3 convolutions, followed by 1x1
3x3 conv operates over only 64 convolutions to expand back to 256 maps:
feature maps 256 x 64 x 1 x 1 ~ 16K
64 x 64 x 3 x 3 ~ 36K
64 x 256 x 1 x 1 ~ 16K
1x1 conv, 64 filters to project Total: ~70K
to 28x28x64

30
S. Credit: J. Jhonson
Residual Networks
• Training ResNet in practice
• Batch Normalization after every CONV layer
• Xavier/2 initialization from He et al.
• SGD + Momentum (0.9)
• Learning rate: 0.1, divided by 10 when validation error plateaus
• Mini-batch size 256
• Weight decay of 1e-5
• No dropout used

• Experimental results
• able to train very deep networks without degrading (152 layers on ImageNet)
• deeper networks now achieve lower training error as expected
• MSRA: ILSVRC & COCO 2015 competitions
- ImageNet Classification: “ultra deep”, 152-leyers
- ImageNet Detection: 16% better than 2nd
- ImageNet Localization: 27% better than 2nd
- COCO Detection: 11% better than 2nd
- COCO Segmentation: 12% better than 2nd

• ILSVRC 2015 top 5 error 3.6% (better than human perf.) 31


Residual Networks
• Architectures for ImageNet

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition, CVPR 2016 (Best Paper) 32
ImageNet: ILSVRC winners
• lmageNet Large Scale Visual Recognition Challenge (ILSVRC)

33
Model ensembles
• ImageNet 2016 winner: Model Ensembles
• Multi-scale ensemble of Inception, Inception-Resnet, Resnet, Wide Resnet models

Shao et al, 2016 34


Improving ResNets
• ResNeXt

S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, Aggregated Residual Transformations for Deep Neural Networks, CVPR 2017 35
ResNeXt on ImageNet

S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, Aggregated Residual Transformations for Deep Neural Networks, CVPR 2017 36
ImageNet: ILSVRC winners
• lmageNet Large Scale Visual Recognition Challenge (ILSVRC)

37
Squeeze and excitation networks
• SENet
• Hu, Albanie, Sun, Wu 2017
• ILSVRC 2017 winner (2.251% top 5 error) with ResNeXt-152-SE
• An ensemble of SENets

• SE block

1. Squeeze each channel to a single number by average pooling


2. A 2 layer neural network produces a vector of numbers (one per channel)
3. Numbers are used to weight channels

• It is possible to construct an SE network (SENet) by simply stacking a collection of SE block

J. Hu, S. Albanie, G. Sun, E. Wu Squeeze-and-Excitation Networks 38


Squeeze and excitation networks
• The squeze and excitation blocks can be added to existing architectures

Ex. adding SE-blocks to ResNet-50 you can expect almost the same accuracy as ResNet-101
J. Hu, S. Albanie, G. Sun, E. Wu Squeeze-and-Excitation Networks 39
ImageNet: ILSVRC winners
• lmageNet Large Scale Visual Recognition Challenge (ILSVRC)

Completion of the challenge:


Annual ImageNet competition no longer
held after 2017 -> now moved to Kaggle.

40
Densely Connected Neural Networks (DenseNets)
• Dense blocks where each layer is
connected to every other layer in
feedforward fashion

• Alleviates vanishing gradient,


strengthens feature propagation,
encourages feature reuse

Dense block

G. Huang, Z. Liu, and L. van der Maaten, Densely Connected Convolutional Networks, CVPR 2017 (Best Paper Award)

https://towardsdatascience.com/review-densenet-image-classification-b6631a8ef803 41
S. Credit: J. Jhonson
DenseNets

G. Huang, Z. Liu, and L. van der Maaten, Densely Connected Convolutional Networks, CVPR 2017 (Best Paper Award) 42
DenseNets

ImageNet validation error vs. number of parameters and test-time operations

G. Huang, Z. Liu, and L. van der Maaten, Densely Connected Convolutional Networks, CVPR 2017 (Best Paper Award) 43
Comparing models
• An analysis of Deep Neural Net works Models for Practical Applications, 2017 (A. Canziani, E.
Culurciello, A. Paszke)

https://towardsdatascience.com/neural-network-architectures-156e5bad51ba 44
Comparing models
• An analysis of Deep Neural Net works Models for Practical Applications, 2017 (A. Canziani)

AlexNet: small Inception-v4: ResNet: mod VGG: highest


compute, memory GoogLeNet: resnet+incep efficiency, high memory, most
heavy, lower most tion accuracy operations
accuracy efficient

Circle size: # parameters

https://towardsdatascience.com/neural-network-architectures-156e5bad51ba 45
How to use a pre-trained network for a new task?
• In practice, very few methods train a CNN from stratch (with random initialization) because it
is relatively rare to have a dataset of sufficient size
• Instead, it is common to pretrain a ConvNet on a very large dataset (e.g. ImageNet, which
contains 12. million images with 1000 categories)
• and then use the ConvNet either
• as a fixed feature extractor for the task of interest (retrain only the classifier), if
the dataset is small
• as an initialization, use old weights as initialization, train the full network or only
some of the higher layers, if the dataset is medium size: transfer learning

46
How to use a pre-trained network for a new task?
• Strategy 1: Use as feature extractor Remove these layers
Use as off-the-shelf
feature
How to use a pre-trained network for a new task?
• Strategy 2: Transfer learning Train new prediction
layer(s)

Train this

more data = retrain


more of the
network (or all of it)
Keep frozen or
fine-tune

Freeze these
How to use a pre-trained network for a new task?

more specific

more generic

49
Summary
• Architectures for image classification
• LeNet: pioneer net for digit recognition
• AlexNet: smaller compute, still memory heavy, lower accuracy
• VGG: highest memory, most operations
• GoogLeNet: most efficient
• ResNet: moderate effciency depending on model, better accuracy
• Squeeze and excitation
• Dense network
• Transfer learning: use a trained network for a new task (as feature extractor or fine-tuning)

• What is missing from the picture?


• Training tricks and details: initialization, regularization, normalization
• Training data augmentation
• Averaging classifier outputs over multiple crops/flips
• Ensembles of networks
• Meta-learning: learning to learn net architectures
50
Review questions
1. Analyze and compare Alexnet and VGG in terms of memory usage, number of parameters and
operations by layers. Which are layers that need more memory / computation / operations

2. What is the use of 1x1 convolutions. In which models are they used and why?

3. What is an inception module?

4. What is global average pooling and why is it used in GoogLeNet, Resnets and other models.

5. What is a residual block?

6. What is a Squeeze and Excitation block? How are they combined with Resnet or Inception
modules?
51

You might also like