Convolutional Neural Networks

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 102

Convolutional Neural

etworks
Problem: So far, our classifiers don’t
respect the spatial structure of images!

Solution: Define new computational


nodes that operate on images!
Stretch pixels into column

56

56 231
231

24 2
24

2
Input image
(2, 2)
(4,)
Components of a Convolutional Network
Fully-Connected Layers Activation
Function
x
h
s

Convolution Layers Pooling


Layers
Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1

Input Output
1 1
10 x 3072
3072 10
weights
Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1

Input Output
1 1
10 x 3072
3072 10
weights
1 number:
the result of taking a dot
product between a row of
W and the input (a 3072-
dimensional dot product)
Convolution Layer
3x32x32 image: preserve spatial structure

32 height

32 width
3 depth /
channels
Convolution Layer
3x32x32 image

3x5x5 filter

Convolve the filter with the


32 height image
i.e. “slide over the image
spatially, computing dot
32 width products”
3 depth /
channels
Convolution Layer Filters always extend the
full depth of the input
3x32x32 image volume

3x5x5 filter

Convolve the filter with the


32 height image
i.e. “slide over the image
spatially, computing dot
32 width products”
3 depth /
channels
Convolution Layer 6 activation maps,
each 1x28x28
3x32x32 image Consider 6
filters, each
3x5x5
Convolution
Layer
32

32 6x3x5x5
3 filters Stack activations to get
a 6x28x28 output
image!
Convolution Layer 6 activation maps,
each 1x28x28
3x32x32 image Also 6-dim bias vector:

Convolution
Layer
32

32 6x3x5x5
3 filters Stack activations to get
a 6x28x28 output
image!
Convolution Layer 28x28 grid, at each
point a 6-dim vector
3x32x32 image Also 6-dim bias vector:

Convolution
Layer
32

32 6x3x5x5
3 filters Stack activations to get
a 6x28x28 output
image!
Convolution Layer 2x6x28x28
2x3x32x32 Batch of outputs
Batch of images Also 6-dim bias vector:

Convolution
Layer
32

32 6x3x5x5
3 filters
Convolution Layer N x Cout x H’ x W’
N x Cin x H x W Batch of outputs
Batch of images Also Cout-dim bias vector:

Convolution
Layer
H

W C xC xK xK
out in w h Cout
Cin filters
Stacking Convolutions

32 28 26

Conv Conv Conv ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3


32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Q: What happens if we
Stacking Convolutions stack two convolution
layers?

32 28 26

Conv Conv Conv ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3


32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Q: What happens if we (Recall y=W2W1x is
Stacking stack two convolution a linear classifier)
layers?
A: We get another
Convolutions convolution!
32 28 26

Conv ReLU Conv ReLU Conv ReLU ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3


32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
What do convolutional filters learn?

32 28 26

Conv ReLU Conv ReLU Conv ReLU ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3


32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
What do convolutional filters learn?
First-layer conv filters: local image templates
(Often learns oriented edges, opposing
colors)
32 28

Conv ReLU

W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 AlexNet: 64 filters, each
3x11x11
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3

7
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3

7
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3

7
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3

7
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
Output: 5x5

7
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
Output: 5x5
In Problem:
7 Feature maps
general:
“shrink” with
Input: W
each layer!
Output:
Filter: KW–K+1

7
0
A0 closer
0 0
look
0 0
at
0
spatial
0 0
dimensions
0 0 Input: 7x7
0 0 Filter: 3x3
0 0 Output: 5x5
0 0 In Problem:
general: Feature maps
0 0
“shrink” with
0 0
Input: W
each layer!
Output:
Filter: K W – K + 1
0 0

0 0 0 0 0 0 0 0 0 Solution: padding
Add zeros around the input
0
A0 closer
0 0
look
0 0
at
0
spatial
0 0
dimensions
0 0 Input: 7x7
0 0 Filter: 3x3
0 0 Output: 5x5
0 0 In Very common:
0 0 general: Set P = (K – 1) / 2 to
Input: W make output have
0 0
same size as input!
Filter: K
0 0
Padding:W
Output: P – K + 1 + 2P
0 0 0 0 0 0 0 0 0
Receptive Fields
For convolution with kernel size K, each element in
the output depends on a K x K receptive field in the
input
Receptive Fields
Each successive convolution adds K – 1 to the receptive field
size With L layers the receptive field size is 1 + L * (K – 1)

Input Output
”receptive field in the input” vs “receptive field in the previous layer”
Receptive Fields
Each successive convolution adds K – 1 to the receptive field
size With L layers the receptive field size is 1 + L * (K – 1)

Input Problem: For large images we need many layers Output


for each output to “see” the whole image
Receptive Fields
Each successive convolution adds K – 1 to the receptive field
size With L layers the receptive field size is 1 + L * (K – 1)

Input Problem: For large images we need many layers Output


for each output to “see” the whole image
image
Solution: Downsample inside the network
Strided Convolution
Input: 7x7
Filter: 3x3
Stride: 2
Strided Convolution
Input: 7x7
Filter: 3x3
Stride: 2
Strided Convolution
Input: 7x7
Filter: 3x3 Output: 3x3
Stride: 2
Strided Convolution
Input: 7x7
Filter: 3x3 Output: 3x3
Stride: 2
In
general:
Input: W
Filter: K
Padding: P
Stride: S ((W – K + 2P) / S )+ 1
Output:
Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad
2

Output volume size: ?


Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size:


((32+2*2-5)/1)+1 = 32 spatially, so
10 x 32 x 32
Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: 10 x 32 x 32


Number of learnable
parameters: ?
Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: 10 x 32 x 32


Number of learnable parameters:
760
Parameters per filter: 3*5*5 + 1 (for bias) =
76 10 filters, so total is 10 * 76 = 760
Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: 10 x 32 x 32


Number of learnable parameters:
760 Number of multiply-add
operations: ?
Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: 10 x 32 x 32


Number of learnable parameters:
760
Number of multiply-add operations:
768,000
10*32*32 = 10,240 outputs; each output is the inner product
Example: 1x1 Convolution

1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a 64-
dimensional dot product)
56 56
64 32
Example: 1x1 Convolution

1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a 64-
dimensional dot product)
56 56
Stacking 1x1 conv
64 layers gives MLP 32
operating on each input Lin et al, “Network in Network”, ICLR
position 2014
Convolution Summary
Input: Cin x H x W
Hyperparameters:
- Kernel size: KH
x KW
- Number
filters: Cout
- Padding: P
- Stride: S
Weight matrix: Cout x Cin x KH x KW
giving Cout filters of size Cin x KH x KW
Bias vector: Cout
Output size: Cout x H’ x W’ where:
- H’ = (H – K + 2P) / S + 1
Convolution Summary
Input: Cin x H x W
Hyperparameters: Common settings:
- Kernel size: KH KH = KW (Small square filters)
x KW P = (K – 1) / 2 (”Same” padding)
- Number Cin, Cout = 32, 64, 128, 256 (powers of 2)
filters: Cout K = 3, P = 1, S = 1 (3x3 conv)
- Padding: P K = 5, P = 2, S = 1 (5x5 conv)
- Stride: S K = 1, P = 0, S = 1 (1x1 conv)
Weight matrix: Cout x Cin x KH x KW K = 3, P = 1, S = 2 (Downsample by 2)
giving Cout filters of size Cin x KH x KW
Bias vector: Cout
Output size: Cout x H’ x W’ where:
- H’ = (H – K + 2P) / S + 1
Other types of convolution
So far: 2D Convolution 1D Convolution
Input: Cin x H x W Input: Cin x W
Weights: Cout x Cin x K x K Weights: Cout x Cin x K

H Cin

W
Cin
W
Pooling Layers: Another way to downsample
64 x 224 x 224
64 x 112 x 112

Hyperparameters
: Kernel Size
Stride
Pooling function
Max Pooling 64 x 224 x 224

Single depth slice


1 1 2 4
x
5 6 7 8 Max pooling with 2x2 6 8
kernel size and stride
2
3 2 1 0 3 4

1 2 3 4
Introduces invariance
to small spatial shifts
y No learnable
parameters!
Pooling Summary
Input: C x H x W
Hyperparameters
: Common settings:
- Kernel size: K max, K = 2, S = 2
- Stride: S max, K = 3, S = 2
- Pooling (AlexNet)
function (max,
avg)
Output: C x H’ x
W’ where
Hierarchy of feature detectors in visual cortex
Components of a Convolutional Network
Fully-Connected Layers Activation
Function
x
h
s

Normalization
Convolution Layers Pooling
Layers 𝑥! ,# − 𝜇#
𝑥! ! ,# =
$
𝜎#
+𝜀
Convolutional Networks
Classic architecture: [Conv, ReLU, Pool] x N, flatten, [FC, ReLU] x N,
FC

Example: LeNet-5

Lecun et al, “Gradient-based learning applied to document recognition”, 1998


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28

Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x


5 ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x
5 ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998
Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x
5 ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Lecun et al, “Gradient-based learning applied to document recognition”, 1998


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14

Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x


5 ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998
Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Lecun et al, “Gradient-based learning applied to document recognition”, 1998


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Lecun et al, “Gradient-based learning applied to document recognition”, 1998


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Lecun et al, “Gradient-based learning applied to document recognition”, 1998


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Lecun et al, “Gradient-based learning applied to document recognition”, 1998


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Lecun et al, “Gradient-based learning applied to document recognition”, 1998


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14 As we go through the network:
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14 Spatial size decreases
MaxPool(K=2, S=2) 50 x 7 x 7 (using pooling or strided conv)
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
Number of channels increases
ReLU 500
Linear (500 -> 10) 10 500 x 10
(total “volume” is preserved!)

Lecun et al, “Gradient-based learning applied to document recognition”, 1998


Problem: Deep Networks very hard to
train!
The shifts can be large!
– Can affect training badly
Then move the entire collection to the
appropriate location
Batch Normalization

Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML
Batch Normalization
Idea: “Normalize” the outputs of a layer so they have zero
mean and unit variance

Why? Helps reduce “internal covariate shift”, improves

optimization We can normalize a batch of activations like this:


This is a differentiable function, so
we can use it as an operator in our
networks and backprop through
it! internal covariate shift”, ICML
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing
2015
A batch of N Vectors each having size D.
Those parameters are basically allowing the network to learn itself what are the means and variances
that it wants to see in each elements of vector.

These two are learnable parameters, during the training neural network ensures the optimal values of γ and
β are used. That will enable the accurate normalization of each batch.
Batch Normalization
FC Usually inserted after Fully
BN Connected or Convolutional layers,
and before nonlinearity.
tanh

FC
𝑥−𝐸 𝑥
BN
𝑥!
tanh
= 𝑉𝑎𝑟 𝑥
Ioffe and Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift”, ICML 2015
Batch Normalization
- Makes deep networks much easier to train!
FC - Allows higher learning rates, faster convergence
- Networks become more robust to initialization
BN - Acts as regularization during training
- Zero overhead at test-time: can be fused with
tanh
conv!
FC

BN ImageNet
accuracy
tanh

Ioffe and Szegedy, “Batch normalization: Accelerating deep Training


network training by reducing internal covariate shift”, ICML 2015
iterations
Batch Normalization
- Makes deep networks much easier to train!
FC - Allows higher learning rates, faster convergence
- Networks become more robust to initialization
BN - Acts as regularization during training
- Zero overhead at test-time: can be fused with
tanh
conv!
- Not well-understood theoretically (yet)
FC - Behaves differently during training and testing:
this is a very common source of bugs!
BN

tanh

Ioffe and Szegedy, “Batch normalization: Accelerating deep


network training by reducing internal covariate shift”, ICML 2015
Components of a Convolutional Network
Convolution Layers Pooling Fully-Connected
Layers Layers x
h
s

Activation Function
Components of a Convolutional Network
Convolution Layers Pooling Fully-Connected
Layers Layers
x
h
s

Most
computationally
expensive!

Activation Function
Summary: Components of a Convolutional Network
Convolution Layers Pooling Fully-Connected
Layers Layers x
h
s

Activation Function
Why FC layers are used?

• If one goes through the math, it will become visible that each output neuron (thus - prediction wrt.
to some class) depends only on the subset of the input dimensions (pixels).

• This would be something among the lines of a model, which only decides whether an image is an
element of class 1 depending on first few "columns" (or, depending on the architecture, rows, or
some patch of the image), then whether this is class 2 on a few next columns (maybe
overlapping), ...,

• and finally some class K depending on a few last columns. Usually data does not have this
characteristic, you cannot classify image of the cat based on a few first columns and ignoring the
rest.

• However, if you introduce fully connected layer, you provide your model with ability
to mix signals, since every single neuron has a connection to every single one in the next layer, now
there is a flow of information between each input dimension (pixel location) and each output class,
thus the decision is based truly on the whole image.
Next time:
CNN
Architectures

You might also like