Download as pdf or txt
Download as pdf or txt
You are on page 1of 98

Lecture 7:

Convolutional Networks

Justin Johnson Lecture 7 - 1 September 24, 2019


Reminder: A2

Due Monday, September 30, 11:59pm (Even if you enrolled late!)

Your submission must pass the validation script

Justin Johnson Lecture 7 - 2 September 24, 2019


Slight schedule change

Content originally planned for today got split into two lectures

Pushes the schedule back a bit:

A4 Due Date: Friday 11/1 -> Friday 11/8


A5 Due Date: Friday 11/15 -> Friday 11/22
A6 Due Date: Still Friday 12/6

Justin Johnson Lecture 7 - 3 September 24, 2019


Last Time: Backpropagation
During the backward pass, each node in
Represent complex expressions the graph receives upstream gradients
as computational graphs and multiplies them by local gradients to
compute downstream gradients
x
s (scores)
*
hinge
loss
+
L

W
R

Downstream
gradients
f
Local
gradients
Forward pass computes outputs
Upstream
gradient
Backward pass computes gradients

Justin Johnson Lecture 7 - 4 September 24, 2019


f(x,W) = Wx Stretch pixels into column

56
Problem: So far our
56 231
classifiers don’t 231

24 2
respect the spatial
24
structure of images!
Input image
2
(2, 2)
(4,)
Input:
x W1 h W2 s
3072
Output: 10
Hidden layer:
100

Justin Johnson Lecture 7 - 5 September 24, 2019


f(x,W) = Wx Stretch pixels into column

56
Problem: So far our
56 231
classifiers don’t 231

24 2
respect the spatial
24
structure of images!
Input image
2
(2, 2) Solution: Define new
computational nodes (4,)
Input:
x W1 h W2 s that operate on images!
3072
Output: 10
Hidden layer:
100

Justin Johnson Lecture 7 - 6 September 24, 2019


Components of a Full-Connected Network
Fully-Connected Layers Activation Function

x h s

Justin Johnson Lecture 7 - 7 September 24, 2019


Components of a Convolutional Network
Fully-Connected Layers Activation Function

x h s

Convolution Layers Pooling Layers Normalization

Justin Johnson Lecture 7 - 8 September 24, 2019


Components of a Convolutional Network
Fully-Connected Layers Activation Function

x h s

Convolution Layers Pooling Layers Normalization

Justin Johnson Lecture 7 - 9 September 24, 2019


Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1

Input Output
1 1
10 x 3072
3072 10
weights

Justin Johnson Lecture 7 - 10 September 24, 2019


Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1

Input Output
1 1
10 x 3072
3072 10
weights
1 number:
the result of taking a dot
product between a row of W
and the input (a 3072-
dimensional dot product)

Justin Johnson Lecture 7 - 11 September 24, 2019


Convolution Layer
3x32x32 image: preserve spatial structure

32 height

32 width
3 depth /
channels
Justin Johnson Lecture 7 - 12 September 24, 2019
Convolution Layer
3x32x32 image

3x5x5 filter

Convolve the filter with the image


32 height i.e. “slide over the image spatially,
computing dot products”

32 width
3 depth /
channels
Justin Johnson Lecture 7 - 13 September 24, 2019
Convolution Layer Filters always extend the full
depth of the input volume
3x32x32 image

3x5x5 filter

Convolve the filter with the image


32 height i.e. “slide over the image spatially,
computing dot products”

32 width
3 depth /
channels
Justin Johnson Lecture 7 - 14 September 24, 2019
Convolution Layer
3x32x32 image

3x5x5 filter

1 number:
32 the result of taking a dot product between the filter
and a small 3x5x5 chunk of the image
(i.e. 3*5*5 = 75-dimensional dot product + bias)
32
3

Justin Johnson Lecture 7 - 15 September 24, 2019


Convolution Layer 1x28x28
activation map
3x32x32 image

3x5x5 filter
28

convolve (slide) over


32 all spatial locations

28
32
1
3

Justin Johnson Lecture 7 - 16 September 24, 2019


Convolution Layer two 1x28x28
Consider repeating with activation map
3x32x32 image
a second (green) filter:
3x5x5 filter
28 28

convolve (slide) over


32 all spatial locations

28
32
1 1
3

Justin Johnson Lecture 7 - 17 September 24, 2019


Convolution Layer 6 activation maps,
each 1x28x28
3x32x32 image Consider 6 filters,
each 3x5x5

Convolution
Layer
32

32 6x3x5x5
3 filters Stack activations to get a
6x28x28 output image!
Justin Johnson Lecture 7 - 18 September 24, 2019
Convolution Layer 6 activation maps,
each 1x28x28
3x32x32 image Also 6-dim bias vector:

Convolution
Layer
32

32 6x3x5x5
3 filters Stack activations to get a
6x28x28 output image!
Justin Johnson Lecture 7 - 19 September 24, 2019
Convolution Layer 28x28 grid, at each
point a 6-dim vector
3x32x32 image Also 6-dim bias vector:

Convolution
Layer
32

32 6x3x5x5
3 filters Stack activations to get a
6x28x28 output image!
Justin Johnson Lecture 7 - 20 September 24, 2019
Convolution Layer 2x6x28x28
2x3x32x32 Batch of outputs
Batch of images Also 6-dim bias vector:

Convolution
Layer
32

32 6x3x5x5
3 filters

Justin Johnson Lecture 7 - 21 September 24, 2019


Convolution Layer N x Cout x H’ x W’
N x Cin x H x W Batch of outputs
Batch of images Also Cout-dim bias vector:

Convolution
Layer
H

W Cout x Cinx Kw x Kh
Cout
Cin filters

Justin Johnson Lecture 7 - 22 September 24, 2019


Stacking Convolutions

32 28 26

Conv Conv Conv ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3


32 b1: 5 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Justin Johnson Lecture 7 - 23 September 24, 2019
Q: What happens if we stack
Stacking Convolutions two convolution layers?

32 28 26

Conv Conv Conv ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3


32 b1: 5 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Justin Johnson Lecture 7 - 24 September 24, 2019
Q: What happens if we stack (Recall y=W2W1x is
Stacking Convolutions two convolution layers? a linear classifier)
A: We get another convolution!

32 28 26

Conv ReLU Conv ReLU Conv ReLU ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3


32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Justin Johnson Lecture 7 - 25 September 24, 2019
What do convolutional filters learn?

32 28 26

Conv ReLU Conv ReLU Conv ReLU ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3


32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Justin Johnson Lecture 7 - 26 September 24, 2019
What do convolutional filters learn?

32 28 Linear classifier: One template per class

Conv ReLU

W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28
Justin Johnson Lecture 7 - 27 September 24, 2019
What do convolutional filters learn?
MLP: Bank of whole-image templates

32 28

Conv ReLU

W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28
Justin Johnson Lecture 7 - 28 September 24, 2019
What do convolutional filters learn?
First-layer conv filters: local image templates
(Often learns oriented edges, opposing colors)
32 28

Conv ReLU

W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 AlexNet: 64 filters, each 3x11x11

Justin Johnson Lecture 7 - 29 September 24, 2019


A closer look at spatial dimensions

32 28

Conv ReLU

W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28
Justin Johnson Lecture 7 - 30 September 24, 2019
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3

7
Justin Johnson Lecture 7 - 31 September 24, 2019
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3

7
Justin Johnson Lecture 7 - 32 September 24, 2019
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3

7
Justin Johnson Lecture 7 - 33 September 24, 2019
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3

7
Justin Johnson Lecture 7 - 34 September 24, 2019
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
Output: 5x5

7
Justin Johnson Lecture 7 - 35 September 24, 2019
A closer look at spatial dimensions
Input: 7x7
Filter: 3x3
Output: 5x5
In general: Problem: Feature
7 maps “shrink”
Input: W
Filter: K with each layer!
Output: W – K + 1

7
Justin Johnson Lecture 7 - 36 September 24, 2019
A closer look at spatial dimensions
0 0 0 0 0 0 0 0 0
Input: 7x7
0 0
Filter: 3x3
0 0
Output: 5x5
0 0
In general: Problem: Feature
0 0
Input: W maps “shrink”
0 0
Filter: K with each layer!
0 0 Output: W – K + 1
0 0
Solution: padding
0 0 0 0 0 0 0 0 0 Add zeros around the input
Justin Johnson Lecture 7 - 37 September 24, 2019
A closer look at spatial dimensions
0 0 0 0 0 0 0 0 0
Input: 7x7
0 0
Filter: 3x3
0 0
Output: 5x5
0 0

0 0
In general: Very common:
Input: W Set P = (K – 1) / 2 to
0 0
Filter: K make output have
same size as input!
0 0 Padding: P
0 0 Output: W – K + 1 + 2P
0 0 0 0 0 0 0 0 0

Justin Johnson Lecture 7 - 38 September 24, 2019


Receptive Fields
For convolution with kernel size K, each element in the
output depends on a K x K receptive field in the input

Input Output

Justin Johnson Lecture 7 - 39 September 24, 2019


Receptive Fields
Each successive convolution adds K – 1 to the receptive field size
With L layers the receptive field size is 1 + L * (K – 1)

Input Output
Be careful – ”receptive field in the input” vs “receptive field in the previous layer”
Hopefully clear from context!

Justin Johnson Lecture 7 - 40 September 24, 2019


Receptive Fields
Each successive convolution adds K – 1 to the receptive field size
With L layers the receptive field size is 1 + L * (K – 1)

Input Problem: For large images we need many layers Output


for each output to “see” the whole image image

Justin Johnson Lecture 7 - 41 September 24, 2019


Receptive Fields
Each successive convolution adds K – 1 to the receptive field size
With L layers the receptive field size is 1 + L * (K – 1)

Input Problem: For large images we need many layers Output


for each output to “see” the whole image image
Solution: Downsample inside the network

Justin Johnson Lecture 7 - 42 September 24, 2019


Strided Convolution
Input: 7x7
Filter: 3x3
Stride: 2

Justin Johnson Lecture 7 - 43 September 24, 2019


Strided Convolution
Input: 7x7
Filter: 3x3
Stride: 2

Justin Johnson Lecture 7 - 44 September 24, 2019


Strided Convolution
Input: 7x7
Filter: 3x3 Output: 3x3
Stride: 2

Justin Johnson Lecture 7 - 45 September 24, 2019


Strided Convolution
Input: 7x7
Filter: 3x3 Output: 3x3
Stride: 2
In general:
Input: W
Filter: K
Padding: P
Stride: S
Output: (W – K + 2P) / S + 1
Justin Johnson Lecture 7 - 46 September 24, 2019
Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: ?

Justin Johnson Lecture 7 - 47 September 24, 2019


Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size:


(32+2*2-5)/1+1 = 32 spatially, so
10 x 32 x 32

Justin Johnson Lecture 7 - 48 September 24, 2019


Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: 10 x 32 x 32


Number of learnable parameters: ?

Justin Johnson Lecture 7 - 49 September 24, 2019


Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: 10 x 32 x 32


Number of learnable parameters: 760
Parameters per filter: 3*5*5 + 1 (for bias) = 76
10 filters, so total is 10 * 76 = 760

Justin Johnson Lecture 7 - 50 September 24, 2019


Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: 10 x 32 x 32


Number of learnable parameters: 760
Number of multiply-add operations: ?

Justin Johnson Lecture 7 - 51 September 24, 2019


Convolution Example

Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2

Output volume size: 10 x 32 x 32


Number of learnable parameters: 760
Number of multiply-add operations: 768,000
10*32*32 = 10,240 outputs; each output is the inner product
of two 3x5x5 tensors (75 elems); total = 75*10240 = 768K
Justin Johnson Lecture 7 - 52 September 24, 2019
Example: 1x1 Convolution

1x1 CONV
56 with 32 filters
56
(each filter has size 1x1x64,
and performs a 64-
dimensional dot product)
56 56
64 32

Justin Johnson Lecture 7 - 53 September 24, 2019


Example: 1x1 Convolution

1x1 CONV
56 with 32 filters
56
(each filter has size 1x1x64,
and performs a 64-
dimensional dot product)
56 56
Stacking 1x1 conv layers
64 gives MLP operating on 32
each input position Lin et al, “Network in Network”, ICLR 2014

Justin Johnson Lecture 7 - 54 September 24, 2019


Convolution Summary
Input: Cin x H x W
Hyperparameters:
- Kernel size: KH x KW
- Number filters: Cout
- Padding: P
- Stride: S
Weight matrix: Cout x Cin x KH x KW
giving Cout filters of size Cin x KH x KW
Bias vector: Cout
Output size: Cout x H’ x W’ where:
- H’ = (H – K + 2P) / S + 1
- W’ = (W – K + 2P) / S + 1
Justin Johnson Lecture 7 - 55 September 24, 2019
Convolution Summary
Input: Cin x H x W
Hyperparameters: Common settings:
- Kernel size: KH x KW KH = KW (Small square filters)
- Number filters: Cout P = (K – 1) / 2 (”Same” padding)
- Padding: P Cin, Cout = 32, 64, 128, 256 (powers of 2)
- Stride: S K = 3, P = 1, S = 1 (3x3 conv)
Weight matrix: Cout x Cin x KH x KW K = 5, P = 2, S = 1 (5x5 conv)
giving Cout filters of size Cin x KH x KW K = 1, P = 0, S = 1 (1x1 conv)
Bias vector: Cout K = 3, P = 1, S = 2 (Downsample by 2)
Output size: Cout x H’ x W’ where:
- H’ = (H – K + 2P) / S + 1
- W’ = (W – K + 2P) / S + 1
Justin Johnson Lecture 7 - 56 September 24, 2019
Other types of convolution
So far: 2D Convolution
Input: Cin x H x W
Weights: Cout x Cin x K x K

W
Cin
Justin Johnson Lecture 7 - 57 September 24, 2019
Other types of convolution
So far: 2D Convolution 1D Convolution
Input: Cin x H x W Input: Cin x W
Weights: Cout x Cin x K x K Weights: Cout x Cin x K

H Cin

W
Cin W
Justin Johnson Lecture 7 - 58 September 24, 2019
Other types of convolution
So far: 2D Convolution 3D Convolution
Input: Cin x H x W Input: Cin x H x W x D
Weights: Cout x Cin x K x K Weights: Cout x Cin x K x K x K

H
Cin-dim vector
H
at each point
in the volume
W D
Cin W
Justin Johnson Lecture 7 - 59 September 24, 2019
PyTorch Convolution Layer

Justin Johnson Lecture 7 - 60 September 24, 2019


PyTorch Convolution Layers

Justin Johnson Lecture 7 - 61 September 24, 2019


Components of a Convolutional Network
Fully-Connected Layers Activation Function

x h s

Convolution Layers Pooling Layers Normalization

Justin Johnson Lecture 7 - 62 September 24, 2019


Pooling Layers: Another way to downsample

Hyperparameters:
Kernel Size
Stride
Pooling function

Justin Johnson Lecture 7 - 63 September 24, 2019


Max Pooling
Single depth slice
1 1 2 4
x Max pooling with 2x2
5 6 7 8 kernel size and stride 2 6 8

3 2 1 0 3 4

1 2 3 4 Introduces invariance to
small spatial shifts
y No learnable parameters!
Justin Johnson Lecture 7 - 64 September 24, 2019
Pooling Summary
Input: C x H x W
Hyperparameters:
- Kernel size: K Common settings:
- Stride: S max, K = 2, S = 2
- Pooling function (max, avg) max, K = 3, S = 2 (AlexNet)
Output: C x H’ x W’ where
- H’ = (H – K) / S + 1
- W’ = (W – K) / S + 1
Learnable parameters: None!
Justin Johnson Lecture 7 - 65 September 24, 2019
Components of a Convolutional Network
Fully-Connected Layers Activation Function

x h s

Convolution Layers Pooling Layers Normalization

Justin Johnson Lecture 7 - 66 September 24, 2019


Convolutional Networks
Classic architecture: [Conv, ReLU, Pool] x N, flatten, [FC, ReLU] x N, FC

Example: LeNet-5

Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 67 September 24, 2019


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 68 September 24, 2019


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 69 September 24, 2019


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 70 September 24, 2019


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 71 September 24, 2019


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 72 September 24, 2019


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 73 September 24, 2019


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 74 September 24, 2019


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 75 September 24, 2019


Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
As we go through the network:
MaxPool(K=2, S=2) 20 x 14 x 14
Conv (Cout=50, K=5, P=2, S=1) 50 x 14 x 14 50 x 20 x 5 x 5
ReLU 50 x 14 x 14
Spatial size decreases
MaxPool(K=2, S=2) 50 x 7 x 7 (using pooling or strided conv)
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500 Number of channels increases
ReLU 500 (total “volume” is preserved!)
Linear (500 -> 10) 10 500 x 10
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Justin Johnson Lecture 7 - 76 September 24, 2019


Problem: Deep Networks very hard to train!

Justin Johnson Lecture 7 - 77 September 24, 2019


Components of a Convolutional Network
Fully-Connected Layers Activation Function

x h s

Convolution Layers Pooling Layers Normalization

Justin Johnson Lecture 7 - 78 September 24, 2019


Batch Normalization
Idea: “Normalize” the outputs of a layer so they have zero mean
and unit variance

Why? Helps reduce “internal covariate shift”, improves optimization

We can normalize a batch of activations like this:


This is a differentiable function, so
we can use it as an operator in our
networks and backprop through it!
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML 2015

Justin Johnson Lecture 7 - 79 September 24, 2019


Batch Normalization
Input: Per-channel
mean, shape is D

Per-channel
std, shape is D
N X
Normalized x,
Shape is N x D

D
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML 2015

Justin Johnson Lecture 7 - 80 September 24, 2019


Batch Normalization
Input: Per-channel
mean, shape is D

Per-channel
std, shape is D
N X
Normalized x,
Shape is N x D

D Problem: What if zero-mean, unit


variance is too hard of a constraint?
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML 2015

Justin Johnson Lecture 7 - 81 September 24, 2019


Batch Normalization
Input: Per-channel
mean, shape is D
Learnable scale and
Per-channel
shift parameters: std, shape is D

Normalized x,
Learning = , Shape is N x D
= will recover the
Output,
identity function! Shape is N x D

Justin Johnson Lecture 7 - 82 September 24, 2019


Problem: Estimates depend on
Batch Normalization: Test-Time minibatch; can’t do this at test-time!

Input: Per-channel
mean, shape is D
Learnable scale and
Per-channel
shift parameters: std, shape is D

Normalized x,
Learning = , Shape is N x D
= will recover the
Output,
identity function! Shape is N x D

Justin Johnson Lecture 7 - 83 September 24, 2019


Batch Normalization: Test-Time
(Running) average of Per-channel
Input: values seen during mean, shape is D
training
Learnable scale and (Running) average of
Per-channel
shift parameters: values seen during
std, shape is D
training

Normalized x,
Learning = , Shape is N x D
= will recover the
Output,
identity function! Shape is N x D

Justin Johnson Lecture 7 - 84 September 24, 2019


Batch Normalization: Test-Time
(Running) average of Per-channel
Input: values seen during mean, shape is D
training
Learnable scale and (Running) average of
Per-channel
shift parameters: values seen during
std, shape is D
training

Normalized x,
During testing batchnorm Shape is N x D
becomes a linear operator!
Can be fused with the previous Output,
fully-connected or conv layer Shape is N x D

Justin Johnson Lecture 7 - 85 September 24, 2019


Batch Normalization for ConvNets
Batch Normalization for Batch Normalization for
fully-connected networks convolutional networks
(Spatial Batchnorm, BatchNorm2D)

x: N × D x: N×C×H×W
Normalize Normalize

𝞵,𝝈: 1 × D 𝞵,𝝈: 1×C×1×1


ɣ,β: 1 × D ɣ,β: 1×C×1×1
y = ɣ(x-𝞵)/𝝈+β y = ɣ(x-𝞵)/𝝈+β
Justin Johnson Lecture 7 - 86 September 24, 2019
Batch Normalization
FC Usually inserted after Fully Connected
BN
or Convolutional layers, and before
nonlinearity.
tanh

FC

BN

tanh

Ioffe and Szegedy, “Batch normalization: Accelerating deep


network training by reducing internal covariate shift”, ICML 2015

Justin Johnson Lecture 7 - 87 September 24, 2019


Batch Normalization
- Makes deep networks much easier to train!
FC - Allows higher learning rates, faster convergence
- Networks become more robust to initialization
BN - Acts as regularization during training
- Zero overhead at test-time: can be fused with conv!
tanh

FC

BN ImageNet
accuracy

tanh

Ioffe and Szegedy, “Batch normalization: Accelerating deep Training iterations


network training by reducing internal covariate shift”, ICML 2015

Justin Johnson Lecture 7 - 88 September 24, 2019


Batch Normalization
- Makes deep networks much easier to train!
FC - Allows higher learning rates, faster convergence
- Networks become more robust to initialization
BN - Acts as regularization during training
- Zero overhead at test-time: can be fused with conv!
tanh - Not well-understood theoretically (yet)
- Behaves differently during training and testing: this
FC is a very common source of bugs!
BN

tanh

Ioffe and Szegedy, “Batch normalization: Accelerating deep


network training by reducing internal covariate shift”, ICML 2015

Justin Johnson Lecture 7 - 89 September 24, 2019


Layer Normalization Layer Normalization for fully-
connected networks
Batch Normalization for Same behavior at train and test!
fully-connected networks Used in RNNs, Transformers

x: N × D x: N × D
Normalize Normalize

𝞵,𝝈: 1 × D 𝞵,𝝈: N × 1
ɣ,β: 1 × D ɣ,β: 1 × D
y = ɣ(x-𝞵)/𝝈+β y = ɣ(x-𝞵)/𝝈+β
Ba, Kiros, and Hinton, “Layer Normalization”, arXiv 2016

Justin Johnson Lecture 7 - 90 September 24, 2019


Instance Normalization
Instance Normalization for
Batch Normalization for convolutional networks
convolutional networks Same behavior at train / test!

x: N×C×H×W x: N×C×H×W
Normalize Normalize

𝞵,𝝈: 1×C×1×1 𝞵,𝝈: N×C×1×1


ɣ,β: 1×C×1×1 ɣ,β: 1×C×1×1
y = ɣ(x-𝞵)/𝝈+β y = ɣ(x-𝞵)/𝝈+β
Ulyanov et al, Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis, CVPR 2017

Justin Johnson Lecture 7 - 91 September 24, 2019


Comparison of Normalization Layers

Wu and He, “Group Normalization”, ECCV 2018

Justin Johnson Lecture 7 - 92 September 24, 2019


Group Normalization

Wu and He, “Group Normalization”, ECCV 2018

Justin Johnson Lecture 7 - 93 September 24, 2019


Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers

x h s

Activation Function Normalization

Justin Johnson Lecture 7 - 94 September 24, 2019


Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers

x h s
Most
computationally
expensive!

Activation Function Normalization

Justin Johnson Lecture 7 - 95 September 24, 2019


Summary: Components of a Convolutional Network
Convolution Layers Pooling Layers Fully-Connected Layers

x h s

Activation Function Normalization

Justin Johnson Lecture 7 - 96 September 24, 2019


Summary: Components of a Convolutional Network

Problem: What is the right way to combine all these components?

Justin Johnson Lecture 7 - 97 September 24, 2019


Next time:
CNN Architectures

Justin Johnson Lecture 7 - 98 September 24, 2019

You might also like