Professional Documents
Culture Documents
Convolutional Neural Networks
Convolutional Neural Networks
Convolutional Neural Networks
etworks
Problem: So far, our classifiers don’t
respect the spatial structure of images!
56
56 231
231
24 2
24
2
Input image
(2, 2)
(4,)
Components of a Convolutional Network
Fully-Connected Layers Activation
Function
x
h
s
Input Output
1 1
10 x 3072
3072 10
weights
Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1
Input Output
1 1
10 x 3072
3072 10
weights
1 number:
the result of taking a dot
product between a row of
W and the input (a 3072-
dimensional dot product)
Convolution Layer
3x32x32 image: preserve spatial structure
32 height
32 width
3 depth /
channels
Convolution Layer
3x32x32 image
3x5x5 filter
3x5x5 filter
32 6x3x5x5
3 filters Stack activations to get
a 6x28x28 output
image!
Convolution Layer 6 activation maps,
each 1x28x28
3x32x32 image Also 6-dim bias vector:
Convolution
Layer
32
32 6x3x5x5
3 filters Stack activations to get
a 6x28x28 output
image!
Convolution Layer 28x28 grid, at each
point a 6-dim vector
3x32x32 image Also 6-dim bias vector:
Convolution
Layer
32
32 6x3x5x5
3 filters Stack activations to get
a 6x28x28 output
image!
Convolution Layer 2x6x28x28
2x3x32x32 Batch of outputs
Batch of images Also 6-dim bias vector:
Convolution
Layer
32
32 6x3x5x5
3 filters
Convolution Layer N x Cout x H’ x W’
N x Cin x H x W Batch of outputs
Batch of images Also Cout-dim bias vector:
Convolution
Layer
H
W C xC xK xK
out in w h Cout
Cin filters
Stacking Convolutions
32 28 26
32 28 26
32 28 26
Conv ReLU
W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 AlexNet: 64 filters, each
3x11x11
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
7
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
7
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
7
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
7
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
Output: 5x5
7
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
Output: 5x5
In Problem:
7 Feature maps
general:
“shrink” with
Input: W
each layer!
Output:
Filter: KW–K+1
7
0
A0 closer
0 0
look
0 0
at
0
spatial
0 0
dimensions
0 0 Input: 7x7
0 0 Filter: 3x3
0 0 Output: 5x5
0 0 In Problem:
general: Feature maps
0 0
“shrink” with
0 0
Input: W
each layer!
Output:
Filter: K W – K + 1
0 0
0 0 0 0 0 0 0 0 0 Solution: padding
Add zeros around the input
0
A0 closer
0 0
look
0 0
at
0
spatial
0 0
dimensions
0 0 Input: 7x7
0 0 Filter: 3x3
0 0 Output: 5x5
0 0 In Very common:
0 0 general: Set P = (K – 1) / 2 to
Input: W make output have
0 0
same size as input!
Filter: K
0 0
Padding:W
Output: P – K + 1 + 2P
0 0 0 0 0 0 0 0 0
Receptive Fields
For convolution with kernel size K, each element in
the output depends on a K x K receptive field in the
input
Receptive Fields
Each successive convolution adds K – 1 to the receptive field
size With L layers the receptive field size is 1 + L * (K – 1)
Input Output
”receptive field in the input” vs “receptive field in the previous layer”
Receptive Fields
Each successive convolution adds K – 1 to the receptive field
size With L layers the receptive field size is 1 + L * (K – 1)
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad
2
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a 64-
dimensional dot product)
56 56
64 32
Example: 1x1 Convolution
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a 64-
dimensional dot product)
56 56
Stacking 1x1 conv
64 layers gives MLP 32
operating on each input Lin et al, “Network in Network”, ICLR
position 2014
Convolution Summary
Input: Cin x H x W
Hyperparameters:
- Kernel size: KH
x KW
- Number
filters: Cout
- Padding: P
- Stride: S
Weight matrix: Cout x Cin x KH x KW
giving Cout filters of size Cin x KH x KW
Bias vector: Cout
Output size: Cout x H’ x W’ where:
- H’ = (H – K + 2P) / S + 1
Convolution Summary
Input: Cin x H x W
Hyperparameters: Common settings:
- Kernel size: KH KH = KW (Small square filters)
x KW P = (K – 1) / 2 (”Same” padding)
- Number Cin, Cout = 32, 64, 128, 256 (powers of 2)
filters: Cout K = 3, P = 1, S = 1 (3x3 conv)
- Padding: P K = 5, P = 2, S = 1 (5x5 conv)
- Stride: S K = 1, P = 0, S = 1 (1x1 conv)
Weight matrix: Cout x Cin x KH x KW K = 3, P = 1, S = 2 (Downsample by 2)
giving Cout filters of size Cin x KH x KW
Bias vector: Cout
Output size: Cout x H’ x W’ where:
- H’ = (H – K + 2P) / S + 1
Other types of convolution
So far: 2D Convolution 1D Convolution
Input: Cin x H x W Input: Cin x W
Weights: Cout x Cin x K x K Weights: Cout x Cin x K
H Cin
W
Cin
W
Pooling Layers: Another way to downsample
64 x 224 x 224
64 x 112 x 112
Hyperparameters
: Kernel Size
Stride
Pooling function
Max Pooling 64 x 224 x 224
1 2 3 4
Introduces invariance
to small spatial shifts
y No learnable
parameters!
Pooling Summary
Input: C x H x W
Hyperparameters
: Common settings:
- Kernel size: K max, K = 2, S = 2
- Stride: S max, K = 3, S = 2
- Pooling (AlexNet)
function (max,
avg)
Output: C x H’ x
W’ where
Hierarchy of feature detectors in visual cortex
Components of a Convolutional Network
Fully-Connected Layers Activation
Function
x
h
s
Normalization
Convolution Layers Pooling
Layers 𝑥! ,# − 𝜇#
𝑥! ! ,# =
$
𝜎#
+𝜀
Convolutional Networks
Classic architecture: [Conv, ReLU, Pool] x N, flatten, [FC, ReLU] x N,
FC
Example: LeNet-5
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML
Batch Normalization
Idea: “Normalize” the outputs of a layer so they have zero
mean and unit variance
These two are learnable parameters, during the training neural network ensures the optimal values of γ and
β are used. That will enable the accurate normalization of each batch.
Batch Normalization
FC Usually inserted after Fully
BN Connected or Convolutional layers,
and before nonlinearity.
tanh
FC
𝑥−𝐸 𝑥
BN
𝑥!
tanh
= 𝑉𝑎𝑟 𝑥
Ioffe and Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift”, ICML 2015
Batch Normalization
- Makes deep networks much easier to train!
FC - Allows higher learning rates, faster convergence
- Networks become more robust to initialization
BN - Acts as regularization during training
- Zero overhead at test-time: can be fused with
tanh
conv!
FC
BN ImageNet
accuracy
tanh
tanh
Activation Function
Components of a Convolutional Network
Convolution Layers Pooling Fully-Connected
Layers Layers
x
h
s
Most
computationally
expensive!
Activation Function
Summary: Components of a Convolutional Network
Convolution Layers Pooling Fully-Connected
Layers Layers x
h
s
Activation Function
Why FC layers are used?
• If one goes through the math, it will become visible that each output neuron (thus - prediction wrt.
to some class) depends only on the subset of the input dimensions (pixels).
• This would be something among the lines of a model, which only decides whether an image is an
element of class 1 depending on first few "columns" (or, depending on the architecture, rows, or
some patch of the image), then whether this is class 2 on a few next columns (maybe
overlapping), ...,
• and finally some class K depending on a few last columns. Usually data does not have this
characteristic, you cannot classify image of the cat based on a few first columns and ignoring the
rest.
• However, if you introduce fully connected layer, you provide your model with ability
to mix signals, since every single neuron has a connection to every single one in the next layer, now
there is a flow of information between each input dimension (pixel location) and each output class,
thus the decision is based truly on the whole image.
Next time:
CNN
Architectures