An Analysis of Convolutional Neural Network Architectures

AN ANALYSIS OF CONVOLUTIONAL
NEURAL NETWORK ARCHITECTURES.

Raahul Singh
S20180010141
THE POWER OF
CONVOLUTIONAL NEURAL
NETWORKS
• CNN are changing the world as we know it.
• They are producing revolutionary results in many fields like :

MEDICAL DIAGNOSTICS
IMAGE CLASSIFICATION
ROBOTICS
BUT WHAT MAKES THEM SO
POWERFUL ?
IN ORDER TO PROPERLY UNDERSTAND A CNN, WE
NEED TO LOOK AT WHAT WE HAD BEFORE CNNS
WERE MAINSTREAM.
STAGES OF A CLASSIFIER
• Segmenter – Seperates objects from its surroundings.
• Feature Extractor – Gathers relevant information from the input and

eliminates variabilities.
• Classifier – Categorizes the resulting feature representations into classes.

PERFORMANCE METRIC
• Overall performance determined by the quality of the segmenter and the
feature extractor which were hand crafted.
• Thus, problem specific. Not at all portable to a new set of classicfication.
• Prone to fallacies of unjustified human assumptions.
• Cannot take into account the variability of the real world objects.
SOLUTION
• Keep preprocessing to a minimum. Feed raw pixel data.
• Use gradient based learning, ie, a neural net.
• However, this does not solve the problem of space invariance and affine
transformations of real world objects.
• Besides, feeding raw pixel data means combinatorial explosion in the

number of the trainable parameters.
SOLUTION TO THE SOLUTION
• In principle, a fully connected NN of sufficient size would be able to
produce outputs that are invariant to affine transformations. However,
this would lead to similar weights patterns at various locations in the
network.
• Convolution.
• But why Convolution? Why not any other operator?

CONVOLUTION
• A measure of overlap.
• A matrix which describes a specific transformation to highlight a specific

feature. A filter, so as to speak, is applied throughout the input to produce
a new image, called a feature map.
• Now you know how Snapchat makes you look pretty.
• A filter can extract localised patters from the input.

CONVOLUTION VISUALISED
FEATURE EXTRACTION
VISUALIZED
LARGE NUMBERS IN CONVOLUTION INDICATE THE
PRESENCE OF A FEATURE AT THE SPECIFIC
LOCATION WHERE THE FILTER IS APPLIED.
SMALL NUMBERS INDICATE
ABSENCE.
HOW DOES THIS SOLVE SPACE
INVARIANCE?
• Replication of weights, ie., applying the same filter at all possible
locations of the image. Thus, extracting all possible occurrences of a
feature and mapping it to the feature map.
• This also preserves the information about the local 2D relationships

between the pixels, something which a simple feedforward network does
not consider.
• This is achieved by varying the size of the filter which becomes the
“receptive field size” of the filter.
SO, CONV NETS.
• They combine three architectural ideas, to ensure some degrees of
shifting, scaling and distortion.
• Local Receptive fields, ie, convolving filters that extract localized features.
• Shared weights, or “sliding these filters”.
• Spacial Subsampling
• Each layer receives input from a set of units located in a small

neighbourhood in the previous layer.
SUB SAMPLING
• With these receptive fields Neurons can extract elementary features.
• These features become an input for the next layer which extracts more
sophisticated features.
• Once a feature has been extracted, its absolute position in the image is of
little consequence, because of all the possibilities in which an object can
exist.
• What matters is its relative position with respect to the other extracted
features, for these will be recombined to give higher features.
• Learning exact locations of filters has another pitfall. It leads to loss of
generality.
• The network acquires unjustified certainty about the specific localization

of a feature.
• This is solved by Subsampling or pooling of layers, Which downsamples or

reduces the resolution of the image.
• Again, has biological inspiration.

EXAMPLE OF SUB SAMPLING:
MAX POOLING
• Since all the weights/ values in a filter can be learned through backprop, a
CNN acts as their own feature extractor. And can tune the filters according
to the problem at hand.
• A large degree of invariance to affine transformations can be achieved by a

sequential implementation of convolution to extract a feature set and
progressive reduction of resolution.
• This decrease in the resolution is compensated with the increase in the

richness of the representations of the feature maps, thus giving us our
coveted prize, generality.
• The more the depth of the network, the richer features it can extract.
CASE STUDY 1: LE NET 5 (1998)
• Main ideas : Convolution, local receptive fields, shared weights, spacial
subsampling.
• LeNet-5 is a very simple network.
• It only has 7 layers, among which there are 3 convolutional layers (C1, C3
and C5), 2 sub-sampling (pooling) layers (S2 and S4), and 1 fully
connected layer (F6), that are followed by the output layer.
• Convolutional layers use 5 by 5 convolutions with stride 1.
• Sub-sampling layers are 2 by 2 average pooling layers. Tanh sigmoid
activations are used throughout the network.
• It was limited by the computational power and small size of labeled data at
the time.
CASE STUDY : ALEXNET(2012)
AlexNet contains eight layers:
• 1th: Convolutional Layer: 96 kernels of size 11×11×3

(stride: 4, pad: 0)
55×55×96 feature maps
Then 3×3 Overlapping Max Pooling (stride: 2)
Then Local Response Normalization
• 2nd: Convolutional Layer: 256 kernels of size 5×5×48

(stride: 1, pad: 2)
Then Local Response Normalization
• 3rd: Convolutional Layer: 384 kernels of size 3×3×256
(stride: 1, pad: 1)
(stride: 1, pad: 1)
(stride: 1, pad: 1)
• 6th: Fully Connected (Dense) Layer of
4096 neurons
4096 neurons
Output: 1000 neurons (since there are 1000 classes)
• Softmax is used for calculating the loss.
DEVIATIONS FROM LE NET
• Before Alexnet, Tanh was used. ReLU is introduced in AlexNet.
And ReLU is six times faster than Tanh to reach 25% training error rate.
• Tanh is a saturating function prone to vanishing gradients.

LOCAL RESPONSE NORMALIZATION
• In AlexNet, local response normalization is used. Normalization

helps to speed up the convergence.
• Data Augmentation
Image translation and horizontal reflection (mirroring)
A random 224×224 is extracted from one 256×256 image plus
horizontal reflection. The size of training set is increased by a factor of
2048.
DROPOUT
• With the layer that using dropout, during training, each neuron has a
probability not to contribute to feed forward pass and participate in
backpropagation. Thus, each neuron can have a larger chance to be
trained, and not to depend so much for some very “strong” neuron.
• During test time, there will be no dropout.
• All these changes made ALEXNET a breakthrough in the field of
classification and is often credited with bringing Deep Learning to the
front lines of Machine Learning.
CASE STUDY: ZF NET (2013)
• ZF net was focused on visualising the transformations at each layer.
DECONVNET TECHNIQUES FOR VISUALIZATION
ZF NET : MODIFICATIONS OF ALEXNET BASED
ON VISUALIZATION RESULTS
• Reduced the 1st layer filter size from 11x11 to 7x7.

• Made the 1st layer stride of the convolution 2, rather than 4.
• By visualizing the convolutional network layer by layer, ZFNet adjusts the
layer hyperparameters such as filter size or stride of the AlexNet and
successfully reduces the error rates.
• It is important to note that although pooling is non invertible and
only approximate answers can be obtained, Convolution is a
linear transformation and hence, can be successfully inverted.
This Deconvolution is used to generate approximate visualization
of the outputs of the intermediate layers.
CASE STUDY: VGG NET (2015)
VGG NET: CHANGES
• The Use of 3×3 Filters
• Filter Size is directly proportional to the square of the number of parameters
to train. Hence, larger filters take longer to train and are prone to errors.
• By using 2 layers of 3×3 filters, it actually have already covered 5×5

area as in the above figure. By using 3 layers of 3×3 filters, it actually
have already covered 7×7 effective area. Thus, large-size filters such as
11×11 in AlexNet and 7×7 in ZFNet indeed are not needed.
• For example :
• 1 layer of 11×11 filter, number of parameters = 11×11=121

5 layer of 3×3 filter, number of parameters = 3×3×5=45
Number of parameters is reduced by 63%
• With fewer parameters to be learnt, it is better for faster convergence,

and reduced overfitting problem.
• As you can see, by adding more layers, the error rate keeps on
decreasing progressively till it reaches a minimum. After which it starts
increasing.
• This can be interpreted as the start of overfitting by the network, or loss

of generalisation.
• This can be combated by using a larger, more varied dataset.

MULTI SCALE TRAINING AND
TESTING
• As object has different scale within the image, if we only train the network at the
same scale, we might miss the detection or have the wrong classification for
the objects with other scales. To tackle this, authors propose multi-scale training.
• For single-scale training, an image is scaled with smaller-size equal to 256 or 384,
i.e. S=256 or 384. Since the network accepts 224×224 input images only. The scaled
image will be cropped to 224×224
• For multi-scale training, an image is scaled with smaller-size equal to a range from

256 to 512, i.e. S=[256;512], then cropped to 224×224. Therefore, with a range of
S, we are inputting different scaled objects into the network for training.
CASE STUDY: GOOGLENET
(2015)
• The network architecture is quite different from VGGNet, ZFNet, and
AlexNet. It contains 1×1 Convolution at the middle of the network.
And global average pooling is used at the end of the network instead
of using fully connected layers.
WHAT IF WE COULD INTRODUCE A NON
LINEARITY IN THE CONVOLUTION LAYER ITSELF?
• Welcome to “Network in Network”.

• In GoogLeNet, 1×1 convolution is used as a dimension reduction module to reduce the
computation. By reducing the computation bottleneck, depth and width can be increased.
• Example:
• Without the use of 1×1 convolution
• Number of operations =
(14×14×48)×(5×5×480) = 112.9M
• With the use of 1×1 convolution:
• Number of operations for 1×1 =
(14×14×16)×(1×1×480) = 1.5M
Number of operations for 5×5 =
(14×14×48)×(5×5×16) = 3.8M
Total number of operations = 1.5M + 3.8M = 5.3M
which is much much smaller than 112.9M !!!!!!!!!!!!!!!
• Think of NIN as a horizontal mini neural net, which is used in place of a
simple filter.
• Whereas pooling reduces the length and breadth of the layer, NIN helps in
reducing the depth.
• The added non linearity adds to the richness of the features that a filter,
in this case the NIN can extract. Further combating the Spatial Variance.
THE INCEPTION MODULE
• Previously, in nets like AlexNet, and VGGNet, conv size is fixed for each
layer.
• Now, 1×1 conv, 3×3 conv, 5×5 conv, and 3×3 max pooling are done
together for the previous input, and stacked together again at
output. When an image comes in, let the network choose the right
path.
• We can now appreciate the depth wise dimensionality reduction that NIN
provides.
• Without the 1×1 convolution as above, the total number of operations

would be gargantuan.
GLOBAL AVERAGE POOLING
• Previously, fully connected (FC) layers are used at the end of network,

such as in AlexNet. All inputs are connected to each output.
• Number of weights (connections) above = 7×7×1024×1024 = 51.3M
• In GoogLeNet, global average pooling is used nearly at the end of

network by averaging each feature map from 7×7 to 1×1, as in the
figure above.
• Number of weights = 0
• Like Always, this is the idea from NIN which can be less prone to
overfitting.
• In the last few years, experts have turned to global average pooling (GAP)
layers to minimize overfitting by reducing the total number of parameters
in the model.
• Similar to max pooling layers, GAP layers are used to reduce the spatial
dimensions of a three-dimensional tensor.
• However, GAP layers perform a more extreme type of dimensionality

reduction, where a tensor with dimensions [ h x w x d ] is reduced in size
to dimensions [ 1 x 1 x d ]. GAP layers reduce each [ h x w ] feature map
to a single number by simply taking the average of all [ hw ] values.
• Then softmax activation function is applied to yield the predicted
probability of each class.
GLOBAL AVERAGE POOLING
VISUALIZATION
FINALLY, CONCLUSIONS
REFERENCES:
•
Object Recognition with Gradient-Based Learning, Yann LeCun et al, 1999
• Gradient Based Learning Applied to Document Recognition, Yann LeCun et al, 1998
• ImageNet Classification with Deep Convolutional Neural Networks, Alex Krizhevsky,
et al, 2012
• Visualizing and Understanding Convolutional Networks, Matthew D. Zeiler and Rob
Fergus, 2013
• Very Deep Convolutional Networks for Large-Scale Image Recognition, Visual
Geometry Group, Department of Engineering Science, University of Oxford,
2015
• Network In Network, Min Lin, Qiang Chen , Shuicheng Yan, 2014
• Going Deeper with Convolutions, Google Inc., et al, 2015

An Analysis of Convolutional Neural Network Architectures

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Analysis of Convolutional Neural Network Architectures

Uploaded by

Copyright:

Available Formats

AN ANALYSIS OF CONVOLUTIONAL

NEURAL NETWORK ARCHITECTURES.

• They are producing revolutionary results in many fields like :

• Feature Extractor – Gathers relevant information from the input and

• Classifier – Categorizes the resulting feature representations into classes.

• Thus, problem specific. Not at all portable to a new set of classicfication.

• Prone to fallacies of unjustified human assumptions.

• Use gradient based learning, ie, a neural net.

• Besides, feeding raw pixel data means combinatorial explosion in the

• But why Convolution? Why not any other operator?

• A matrix which describes a specific transformation to highlight a specific

• Now you know how Snapchat makes you look pretty.

• A filter can extract localised patters from the input.

• This also preserves the information about the local 2D relationships

• Shared weights, or “sliding these filters”.

• Each layer receives input from a set of units located in a small

• The network acquires unjustified certainty about the specific localization

• This is solved by Subsampling or pooling of layers, Which downsamples or

• Again, has biological inspiration.

• A large degree of invariance to affine transformations can be achieved by a

• This decrease in the resolution is compensated with the increase in the

• 1th: Convolutional Layer: 96 kernels of size 11×11×3

• 2nd: Convolutional Layer: 256 kernels of size 5×5×48

• Tanh is a saturating function prone to vanishing gradients.

• In AlexNet, local response normalization is used. Normalization

• Reduced the 1st layer filter size from 11x11 to 7x7.

• By using 2 layers of 3×3 filters, it actually have already covered 5×5

• 1 layer of 11×11 filter, number of parameters = 11×11=121

• With fewer parameters to be learnt, it is better for faster convergence,

• This can be interpreted as the start of overfitting by the network, or loss

• This can be combated by using a larger, more varied dataset.

• For multi-scale training, an image is scaled with smaller-size equal to a range from

• Welcome to “Network in Network”.

• Without the 1×1 convolution as above, the total number of operations

• Previously, fully connected (FC) layers are used at the end of network,

• Number of weights (connections) above = 7×7×1024×1024 = 51.3M

• In GoogLeNet, global average pooling is used nearly at the end of

• However, GAP layers perform a more extreme type of dimensionality

You might also like