Convolutional Neural Networks

Convolutional Neural
etworks
Problem: So far, our classifiers don’t
respect the spatial structure of images!
Solution: Define new computational

nodes that operate on images!
Stretch pixels into column
56
56 231
231
24 2
24
2
Input image
(2, 2)
(4,)
Components of a Convolutional Network
Fully-Connected Layers Activation
Function
x
h
s
Convolution Layers Pooling

Layers
Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1
Input Output
1 1
10 x 3072
3072 10
weights
Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1
Input Output
1 1
10 x 3072
3072 10
weights
1 number:
the result of taking a dot
product between a row of
W and the input (a 3072-
dimensional dot product)
Convolution Layer
3x32x32 image: preserve spatial structure
32 height
32 width
3 depth /
channels
Convolution Layer
3x32x32 image
3x5x5 filter
Convolve the filter with the

32 height image
i.e. “slide over the image
spatially, computing dot
32 width products”
3 depth /
channels
Convolution Layer Filters always extend the
full depth of the input
3x32x32 image volume
3x5x5 filter
Convolve the filter with the

32 height image
i.e. “slide over the image
spatially, computing dot
32 width products”
3 depth /
channels
Convolution Layer 6 activation maps,
each 1x28x28
3x32x32 image Consider 6
filters, each
3x5x5
Convolution
Layer
32
32 6x3x5x5
3 filters Stack activations to get
a 6x28x28 output
image!
Convolution Layer 6 activation maps,
each 1x28x28
3x32x32 image Also 6-dim bias vector:
Convolution
Layer
32
32 6x3x5x5
a 6x28x28 output
image!
Convolution Layer 28x28 grid, at each
point a 6-dim vector
3x32x32 image Also 6-dim bias vector:
Convolution
Layer
32
32 6x3x5x5
a 6x28x28 output
image!
Convolution Layer 2x6x28x28
2x3x32x32 Batch of outputs
Batch of images Also 6-dim bias vector:
Convolution
Layer
32
32 6x3x5x5
3 filters
Convolution Layer N x Cout x H’ x W’
N x Cin x H x W Batch of outputs
Batch of images Also Cout-dim bias vector:
Convolution
Layer
H
W C xC xK xK
out in w h Cout
Cin filters
Stacking Convolutions
32 28 26
Conv Conv Conv ….
W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Q: What happens if we
Stacking Convolutions stack two convolution
layers?
32 28 26
Conv Conv Conv ….
W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Q: What happens if we (Recall y=W2W1x is
Stacking stack two convolution a linear classifier)
layers?
A: We get another
Convolutions convolution!
32 28 26
Conv ReLU Conv ReLU Conv ReLU ….
W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
What do convolutional filters learn?
32 28 26
Conv ReLU Conv ReLU Conv ReLU ….
W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
What do convolutional filters learn?
First-layer conv filters: local image templates
(Often learns oriented edges, opposing
colors)
32 28
Conv ReLU
W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 AlexNet: 64 filters, each
3x11x11
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
7
Input: 7x7
Filter: 3x3
7
Input: 7x7
Filter: 3x3
7
Input: 7x7
Filter: 3x3
7
Input: 7x7
Filter: 3x3
Output: 5x5
7
Input: 7x7
Filter: 3x3
Output: 5x5
In Problem:
7 Feature maps
general:
“shrink” with
Input: W
each layer!
Output:
Filter: KW–K+1
7
0
A0 closer
0 0
look
0 0
at
0
spatial
0 0
dimensions
0 0 Input: 7x7
0 0 Filter: 3x3
0 0 Output: 5x5
0 0 In Problem:
general: Feature maps
0 0
“shrink” with
0 0
Input: W
each layer!
Output:
Filter: K W – K + 1
0 0
0 0 0 0 0 0 0 0 0 Solution: padding
Add zeros around the input
0
A0 closer
0 0
look
0 0
at
0
spatial
0 0
dimensions
0 0 Input: 7x7
0 0 Filter: 3x3
0 0 Output: 5x5
0 0 In Very common:
0 0 general: Set P = (K – 1) / 2 to
Input: W make output have
0 0
same size as input!
Filter: K
0 0
Padding:W
Output: P – K + 1 + 2P
0 0 0 0 0 0 0 0 0
Receptive Fields
For convolution with kernel size K, each element in
the output depends on a K x K receptive field in the
input
Receptive Fields
Each successive convolution adds K – 1 to the receptive field
size With L layers the receptive field size is 1 + L * (K – 1)
Input Output
”receptive field in the input” vs “receptive field in the previous layer”
Receptive Fields
Input Problem: For large images we need many layers Output

for each output to “see” the whole image
Receptive Fields
Input Problem: For large images we need many layers Output

for each output to “see” the whole image
image
Solution: Downsample inside the network
Strided Convolution
Input: 7x7
Filter: 3x3
Stride: 2
Strided Convolution
Input: 7x7
Filter: 3x3
Stride: 2
Strided Convolution
Input: 7x7
Filter: 3x3 Output: 3x3
Stride: 2
Strided Convolution
Input: 7x7
Filter: 3x3 Output: 3x3
Stride: 2
In
general:
Input: W
Filter: K
Padding: P
Stride: S ((W – K + 2P) / S )+ 1
Output:
Convolution Example
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad
2
Output volume size: ?

Convolution Example
10 5x5 filters with stride 1, pad 2
Output volume size:

((32+2*2-5)/1)+1 = 32 spatially, so
10 x 32 x 32
Convolution Example
Output volume size: 10 x 32 x 32

Number of learnable
parameters: ?
Convolution Example

Number of learnable parameters:
760
Parameters per filter: 3*5*5 + 1 (for bias) =
76 10 filters, so total is 10 * 76 = 760
Convolution Example

760 Number of multiply-add
operations: ?
Convolution Example

760
Number of multiply-add operations:
768,000
10*32*32 = 10,240 outputs; each output is the inner product
Example: 1x1 Convolution
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a 64-
56 56
64 32
Example: 1x1 Convolution
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a 64-
56 56
Stacking 1x1 conv
64 layers gives MLP 32
operating on each input Lin et al, “Network in Network”, ICLR
position 2014
Convolution Summary
Input: Cin x H x W
Hyperparameters:
- Kernel size: KH
x KW
- Number
filters: Cout
- Padding: P
- Stride: S
Weight matrix: Cout x Cin x KH x KW
giving Cout filters of size Cin x KH x KW
Bias vector: Cout
Output size: Cout x H’ x W’ where:
- H’ = (H – K + 2P) / S + 1
Convolution Summary
Input: Cin x H x W
Hyperparameters: Common settings:
- Kernel size: KH KH = KW (Small square filters)
x KW P = (K – 1) / 2 (”Same” padding)
- Number Cin, Cout = 32, 64, 128, 256 (powers of 2)
filters: Cout K = 3, P = 1, S = 1 (3x3 conv)
- Padding: P K = 5, P = 2, S = 1 (5x5 conv)
- Stride: S K = 1, P = 0, S = 1 (1x1 conv)
Weight matrix: Cout x Cin x KH x KW K = 3, P = 1, S = 2 (Downsample by 2)
giving Cout filters of size Cin x KH x KW
Bias vector: Cout
Output size: Cout x H’ x W’ where:
- H’ = (H – K + 2P) / S + 1
Other types of convolution
So far: 2D Convolution 1D Convolution
Input: Cin x H x W Input: Cin x W
Weights: Cout x Cin x K x K Weights: Cout x Cin x K
H Cin
W
Cin
W
Pooling Layers: Another way to downsample
64 x 224 x 224
64 x 112 x 112
Hyperparameters
: Kernel Size
Stride
Pooling function
Max Pooling 64 x 224 x 224
Single depth slice

1 1 2 4
x
5 6 7 8 Max pooling with 2x2 6 8
kernel size and stride
2
3 2 1 0 3 4
1 2 3 4
Introduces invariance
to small spatial shifts
y No learnable
parameters!
Pooling Summary
Input: C x H x W
Hyperparameters
: Common settings:
- Kernel size: K max, K = 2, S = 2
- Stride: S max, K = 3, S = 2
- Pooling (AlexNet)
function (max,
avg)
Output: C x H’ x
W’ where
Hierarchy of feature detectors in visual cortex
Fully-Connected Layers Activation
Function
x
h
s
Normalization
Convolution Layers Pooling
Layers 𝑥! ,# − 𝜇#
𝑥! ! ,# =
$
𝜎#
+𝜀
Convolutional Networks
Classic architecture: [Conv, ReLU, Pool] x N, flatten, [FC, ReLU] x N,
FC
Example: LeNet-5
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x

5 ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
5 ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Example: LeNet-5
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
5 ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14

5 ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14 As we go through the network:
ReLU 50 x 14 x 14 Spatial size decreases
MaxPool(K=2, S=2) 50 x 7 x 7 (using pooling or strided conv)
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
Number of channels increases
ReLU 500
Linear (500 -> 10) 10 500 x 10
(total “volume” is preserved!)

Problem: Deep Networks very hard to
train!
The shifts can be large!
– Can affect training badly
Then move the entire collection to the
appropriate location
Batch Normalization
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML
Batch Normalization
Idea: “Normalize” the outputs of a layer so they have zero
mean and unit variance
Why? Helps reduce “internal covariate shift”, improves
optimization We can normalize a batch of activations like this:

This is a differentiable function, so
we can use it as an operator in our
networks and backprop through
it! internal covariate shift”, ICML
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing
2015
A batch of N Vectors each having size D.
Those parameters are basically allowing the network to learn itself what are the means and variances
that it wants to see in each elements of vector.
These two are learnable parameters, during the training neural network ensures the optimal values of γ and
β are used. That will enable the accurate normalization of each batch.
Batch Normalization
FC Usually inserted after Fully
BN Connected or Convolutional layers,
and before nonlinearity.
tanh
FC
𝑥−𝐸 𝑥
BN
𝑥!
tanh
= 𝑉𝑎𝑟 𝑥
Ioffe and Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift”, ICML 2015
Batch Normalization
- Makes deep networks much easier to train!
FC - Allows higher learning rates, faster convergence
- Networks become more robust to initialization
BN - Acts as regularization during training
- Zero overhead at test-time: can be fused with
tanh
conv!
FC
BN ImageNet
accuracy
tanh
Ioffe and Szegedy, “Batch normalization: Accelerating deep Training

iterations
Batch Normalization
- Makes deep networks much easier to train!
FC - Allows higher learning rates, faster convergence
- Networks become more robust to initialization
BN - Acts as regularization during training
- Zero overhead at test-time: can be fused with
tanh
conv!
- Not well-understood theoretically (yet)
FC - Behaves differently during training and testing:
this is a very common source of bugs!
BN
tanh
Ioffe and Szegedy, “Batch normalization: Accelerating deep

Convolution Layers Pooling Fully-Connected
Layers Layers x
h
s
Activation Function
Layers Layers
x
h
s
Most
computationally
expensive!
Activation Function
Summary: Components of a Convolutional Network
Layers Layers x
h
s
Activation Function
Why FC layers are used?
• If one goes through the math, it will become visible that each output neuron (thus - prediction wrt.
to some class) depends only on the subset of the input dimensions (pixels).
• This would be something among the lines of a model, which only decides whether an image is an
element of class 1 depending on first few "columns" (or, depending on the architecture, rows, or
some patch of the image), then whether this is class 2 on a few next columns (maybe
overlapping), ...,
• and finally some class K depending on a few last columns. Usually data does not have this
characteristic, you cannot classify image of the cat based on a few first columns and ignoring the
rest.
• However, if you introduce fully connected layer, you provide your model with ability
to mix signals, since every single neuron has a connection to every single one in the next layer, now
there is a flow of information between each input dimension (pixel location) and each output class,
thus the decision is based truly on the whole image.
Next time:
CNN
Architectures

Convolutional Neural Networks

Uploaded by

Copyright:

Available Formats

You might also like

Convolutional Neural Networks

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Convolutional Neural Networks

Uploaded by

Copyright:

Available Formats

Convolutional Neural

Solution: Define new computational

Convolution Layers Pooling

Convolve the filter with the

Convolve the filter with the

Conv Conv Conv ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

Conv Conv Conv ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

Conv ReLU Conv ReLU Conv ReLU ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

Conv ReLU Conv ReLU Conv ReLU ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

Input Problem: For large images we need many layers Output

Input Problem: For large images we need many layers Output

Output volume size: ?

Output volume size:

Output volume size: 10 x 32 x 32

Output volume size: 10 x 32 x 32

Output volume size: 10 x 32 x 32

Output volume size: 10 x 32 x 32

Single depth slice

Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x

Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x

Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Why? Helps reduce “internal covariate shift”, improves

optimization We can normalize a batch of activations like this:

Ioffe and Szegedy, “Batch normalization: Accelerating deep Training

Ioffe and Szegedy, “Batch normalization: Accelerating deep

You might also like