CNN 01 Abcdefg

Lecture 7:
Convolutional Networks
Justin Johnson Lecture 7 - 1 September 24, 2019

Reminder: A2
Due Monday, September 30, 11:59pm (Even if you enrolled late!)
Your submission must pass the validation script

Slight schedule change
Content originally planned for today got split into two lectures
Pushes the schedule back a bit:
A4 Due Date: Friday 11/1 -> Friday 11/8

A5 Due Date: Friday 11/15 -> Friday 11/22
A6 Due Date: Still Friday 12/6

Last Time: Backpropagation
During the backward pass, each node in
Represent complex expressions the graph receives upstream gradients
as computational graphs and multiplies them by local gradients to
compute downstream gradients
x
s (scores)
*
hinge
loss
+
L
W
R
Downstream
gradients
f
Local
gradients
Forward pass computes outputs
Upstream
gradient
Backward pass computes gradients

f(x,W) = Wx Stretch pixels into column
56
Problem: So far our
56 231
classifiers don’t 231
24 2
respect the spatial
24
structure of images!
Input image
2
(2, 2)
(4,)
Input:
x W1 h W2 s
3072
Output: 10
Hidden layer:
100

f(x,W) = Wx Stretch pixels into column
56
Problem: So far our
56 231
classifiers don’t 231
24 2
respect the spatial
24
structure of images!
Input image
2
(2, 2) Solution: Define new
computational nodes (4,)
Input:
x W1 h W2 s that operate on images!
3072
Output: 10
Hidden layer:
100

Components of a Full-Connected Network
Fully-Connected Layers Activation Function
x h s

Components of a Convolutional Network
x h s
Convolution Layers Pooling Layers Normalization

x h s

Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1
Input Output
1 1
10 x 3072
3072 10
weights

Fully-Connected Layer
32x32x3 image -> stretch to 3072 x 1
Input Output
1 1
10 x 3072
3072 10
weights
1 number:
the result of taking a dot
product between a row of W
and the input (a 3072-
dimensional dot product)

Convolution Layer
3x32x32 image: preserve spatial structure
32 height
32 width
3 depth /
channels
Convolution Layer
3x32x32 image
3x5x5 filter
Convolve the filter with the image

32 height i.e. “slide over the image spatially,
computing dot products”
32 width
3 depth /
channels
Convolution Layer Filters always extend the full
depth of the input volume
3x32x32 image
3x5x5 filter
Convolve the filter with the image

32 height i.e. “slide over the image spatially,
computing dot products”
32 width
3 depth /
channels
Convolution Layer
3x32x32 image
3x5x5 filter
1 number:
32 the result of taking a dot product between the filter
and a small 3x5x5 chunk of the image
(i.e. 3*5*5 = 75-dimensional dot product + bias)
32
3

Convolution Layer 1x28x28
activation map
3x32x32 image
3x5x5 filter
28
convolve (slide) over

32 all spatial locations
28
32
1
3

Convolution Layer two 1x28x28
Consider repeating with activation map
3x32x32 image
a second (green) filter:
3x5x5 filter
28 28
convolve (slide) over

32 all spatial locations
28
32
1 1
3

Convolution Layer 6 activation maps,
each 1x28x28
3x32x32 image Consider 6 filters,
each 3x5x5
Convolution
Layer
32
32 6x3x5x5
3 filters Stack activations to get a
6x28x28 output image!
Convolution Layer 6 activation maps,
each 1x28x28
3x32x32 image Also 6-dim bias vector:
Convolution
Layer
32
32 6x3x5x5
Convolution Layer 28x28 grid, at each
point a 6-dim vector
3x32x32 image Also 6-dim bias vector:
Convolution
Layer
32
32 6x3x5x5
Convolution Layer 2x6x28x28
2x3x32x32 Batch of outputs
Batch of images Also 6-dim bias vector:
Convolution
Layer
32
32 6x3x5x5
3 filters

Convolution Layer N x Cout x H’ x W’
N x Cin x H x W Batch of outputs
Batch of images Also Cout-dim bias vector:
Convolution
Layer
H
W Cout x Cinx Kw x Kh
Cout
Cin filters

Stacking Convolutions
32 28 26
Conv Conv Conv ….
W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

32 b1: 5 28 b2: 10 26
b3: 12
3 6 10
Input: First hidden layer: Second hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Q: What happens if we stack
Stacking Convolutions two convolution layers?
32 28 26
Conv Conv Conv ….
W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

32 b1: 5 28 b2: 10 26
b3: 12
3 6 10
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
Q: What happens if we stack (Recall y=W2W1x is
Stacking Convolutions two convolution layers? a linear classifier)
A: We get another convolution!
32 28 26
Conv ReLU Conv ReLU Conv ReLU ….
W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
What do convolutional filters learn?
32 28 26
Conv ReLU Conv ReLU Conv ReLU ….
W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

32 b1: 6 28 b2: 10 26
b3: 12
3 6 10
N x 3 x 32 x 32 N x 6 x 28 x 28 N x 10 x 26 x 26
32 28 Linear classifier: One template per class
Conv ReLU
W1: 6x3x5x5
32 b1: 6 28
3 6
Input: First hidden layer:
N x 3 x 32 x 32 N x 6 x 28 x 28
MLP: Bank of whole-image templates
32 28
Conv ReLU
W1: 6x3x5x5
32 b1: 6 28
3 6
N x 3 x 32 x 32 N x 6 x 28 x 28
First-layer conv filters: local image templates
(Often learns oriented edges, opposing colors)
32 28
Conv ReLU
W1: 6x3x5x5
32 b1: 6 28
3 6
N x 3 x 32 x 32 N x 6 x 28 x 28 AlexNet: 64 filters, each 3x11x11

A closer look at spatial dimensions
32 28
Conv ReLU
W1: 6x3x5x5
32 b1: 6 28
3 6
N x 3 x 32 x 32 N x 6 x 28 x 28
Input: 7x7
Filter: 3x3
7
Input: 7x7
Filter: 3x3
7
Input: 7x7
Filter: 3x3
7
Input: 7x7
Filter: 3x3
7
Input: 7x7
Filter: 3x3
Output: 5x5
7
Input: 7x7
Filter: 3x3
Output: 5x5
In general: Problem: Feature
7 maps “shrink”
Input: W
Filter: K with each layer!
Output: W – K + 1
7
0 0 0 0 0 0 0 0 0
Input: 7x7
0 0
Filter: 3x3
0 0
Output: 5x5
0 0
In general: Problem: Feature
0 0
Input: W maps “shrink”
0 0
Filter: K with each layer!
0 0 Output: W – K + 1
0 0
Solution: padding
0 0 0 0 0 0 0 0 0 Add zeros around the input
0 0 0 0 0 0 0 0 0
Input: 7x7
0 0
Filter: 3x3
0 0
Output: 5x5
0 0
0 0
In general: Very common:
Input: W Set P = (K – 1) / 2 to
0 0
Filter: K make output have
same size as input!
0 0 Padding: P
0 0 Output: W – K + 1 + 2P
0 0 0 0 0 0 0 0 0

Receptive Fields
For convolution with kernel size K, each element in the
output depends on a K x K receptive field in the input
Input Output

Receptive Fields
Each successive convolution adds K – 1 to the receptive field size
With L layers the receptive field size is 1 + L * (K – 1)
Input Output
Be careful – ”receptive field in the input” vs “receptive field in the previous layer”
Hopefully clear from context!

Receptive Fields
Input Problem: For large images we need many layers Output

for each output to “see” the whole image image

Receptive Fields
Input Problem: For large images we need many layers Output

for each output to “see” the whole image image
Solution: Downsample inside the network

Strided Convolution
Input: 7x7
Filter: 3x3
Stride: 2

Strided Convolution
Input: 7x7
Filter: 3x3
Stride: 2

Strided Convolution
Input: 7x7
Filter: 3x3 Output: 3x3
Stride: 2

Strided Convolution
Input: 7x7
Filter: 3x3 Output: 3x3
Stride: 2
In general:
Input: W
Filter: K
Padding: P
Stride: S
Output: (W – K + 2P) / S + 1
Convolution Example
Input volume: 3 x 32 x 32
10 5x5 filters with stride 1, pad 2
Output volume size: ?

Convolution Example
Output volume size:

(32+2*2-5)/1+1 = 32 spatially, so
10 x 32 x 32

Convolution Example
Output volume size: 10 x 32 x 32

Number of learnable parameters: ?

Convolution Example

Number of learnable parameters: 760
Parameters per filter: 3*5*5 + 1 (for bias) = 76
10 filters, so total is 10 * 76 = 760

Convolution Example

Number of multiply-add operations: ?

Convolution Example

Number of multiply-add operations: 768,000
10*32*32 = 10,240 outputs; each output is the inner product
of two 3x5x5 tensors (75 elems); total = 75*10240 = 768K
Example: 1x1 Convolution
1x1 CONV
56 with 32 filters
56
(each filter has size 1x1x64,
and performs a 64-
56 56
64 32

Example: 1x1 Convolution
1x1 CONV
56 with 32 filters
56
(each filter has size 1x1x64,
and performs a 64-
56 56
Stacking 1x1 conv layers
64 gives MLP operating on 32
each input position Lin et al, “Network in Network”, ICLR 2014

Convolution Summary
Input: Cin x H x W
Hyperparameters:
- Kernel size: KH x KW
- Number filters: Cout
- Padding: P
- Stride: S
Weight matrix: Cout x Cin x KH x KW
giving Cout filters of size Cin x KH x KW
Bias vector: Cout
Output size: Cout x H’ x W’ where:
- H’ = (H – K + 2P) / S + 1
- W’ = (W – K + 2P) / S + 1
Convolution Summary
Input: Cin x H x W
Hyperparameters: Common settings:
- Kernel size: KH x KW KH = KW (Small square filters)
- Number filters: Cout P = (K – 1) / 2 (”Same” padding)
- Padding: P Cin, Cout = 32, 64, 128, 256 (powers of 2)
- Stride: S K = 3, P = 1, S = 1 (3x3 conv)
Weight matrix: Cout x Cin x KH x KW K = 5, P = 2, S = 1 (5x5 conv)
giving Cout filters of size Cin x KH x KW K = 1, P = 0, S = 1 (1x1 conv)
Bias vector: Cout K = 3, P = 1, S = 2 (Downsample by 2)
Output size: Cout x H’ x W’ where:
- H’ = (H – K + 2P) / S + 1
- W’ = (W – K + 2P) / S + 1
Other types of convolution
So far: 2D Convolution
Input: Cin x H x W
Weights: Cout x Cin x K x K
W
Cin
So far: 2D Convolution 1D Convolution
Input: Cin x H x W Input: Cin x W
Weights: Cout x Cin x K x K Weights: Cout x Cin x K
H Cin
W
Cin W
So far: 2D Convolution 3D Convolution
Input: Cin x H x W Input: Cin x H x W x D
Weights: Cout x Cin x K x K Weights: Cout x Cin x K x K x K
H
Cin-dim vector
H
at each point
in the volume
W D
Cin W
PyTorch Convolution Layer

PyTorch Convolution Layers

x h s

Pooling Layers: Another way to downsample
Hyperparameters:
Kernel Size
Stride
Pooling function

Max Pooling
Single depth slice
1 1 2 4
x Max pooling with 2x2
5 6 7 8 kernel size and stride 2 6 8
3 2 1 0 3 4
1 2 3 4 Introduces invariance to
small spatial shifts
y No learnable parameters!
Pooling Summary
Input: C x H x W
Hyperparameters:
- Kernel size: K Common settings:
- Stride: S max, K = 2, S = 2
- Pooling function (max, avg) max, K = 3, S = 2 (AlexNet)
Output: C x H’ x W’ where
- H’ = (H – K) / S + 1
- W’ = (W – K) / S + 1
Learnable parameters: None!
x h s

Convolutional Networks
Classic architecture: [Conv, ReLU, Pool] x N, flatten, [FC, ReLU] x N, FC
Example: LeNet-5
Lecun et al, “Gradient-based learning applied to document recognition”, 1998

Example: LeNet-5
Layer Output Size Weight Size
Input 1 x 28 x 28
Conv (Cout=20, K=5, P=2, S=1) 20 x 28 x 28 20 x 1 x 5 x 5
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
MaxPool(K=2, S=2) 50 x 7 x 7
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500
ReLU 500
Linear (500 -> 10) 10 500 x 10

Example: LeNet-5
Input 1 x 28 x 28
ReLU 20 x 28 x 28
As we go through the network:
MaxPool(K=2, S=2) 20 x 14 x 14
ReLU 50 x 14 x 14
Spatial size decreases
MaxPool(K=2, S=2) 50 x 7 x 7 (using pooling or strided conv)
Flatten 2450
Linear (2450 -> 500) 500 2450 x 500 Number of channels increases
ReLU 500 (total “volume” is preserved!)
Linear (500 -> 10) 10 500 x 10

Problem: Deep Networks very hard to train!

x h s

Batch Normalization
Idea: “Normalize” the outputs of a layer so they have zero mean
and unit variance
Why? Helps reduce “internal covariate shift”, improves optimization
We can normalize a batch of activations like this:

This is a differentiable function, so
we can use it as an operator in our
networks and backprop through it!
Ioffe and Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, ICML 2015

Batch Normalization
Input: Per-channel
mean, shape is D
Per-channel
std, shape is D
N X
Normalized x,
Shape is N x D
D

Batch Normalization
Input: Per-channel
mean, shape is D
Per-channel
std, shape is D
N X
Normalized x,
Shape is N x D
D Problem: What if zero-mean, unit

variance is too hard of a constraint?

Batch Normalization
Input: Per-channel
mean, shape is D
Learnable scale and
Per-channel
shift parameters: std, shape is D
Normalized x,
Learning = , Shape is N x D
= will recover the
Output,
identity function! Shape is N x D

Problem: Estimates depend on
Batch Normalization: Test-Time minibatch; can’t do this at test-time!
Input: Per-channel
mean, shape is D
Learnable scale and
Per-channel
shift parameters: std, shape is D
Normalized x,
= will recover the
Output,

Batch Normalization: Test-Time
(Running) average of Per-channel
Input: values seen during mean, shape is D
training
Learnable scale and (Running) average of
Per-channel
shift parameters: values seen during
std, shape is D
training
Normalized x,
= will recover the
Output,

Batch Normalization: Test-Time
(Running) average of Per-channel
Input: values seen during mean, shape is D
training
Learnable scale and (Running) average of
Per-channel
shift parameters: values seen during
std, shape is D
training
Normalized x,
During testing batchnorm Shape is N x D
becomes a linear operator!
Can be fused with the previous Output,
fully-connected or conv layer Shape is N x D

Batch Normalization for ConvNets
Batch Normalization for Batch Normalization for
fully-connected networks convolutional networks
(Spatial Batchnorm, BatchNorm2D)
x: N × D x: N×C×H×W
Normalize Normalize
𝞵,𝝈: 1 × D 𝞵,𝝈: 1×C×1×1

ɣ,β: 1 × D ɣ,β: 1×C×1×1
y = ɣ(x-𝞵)/𝝈+β y = ɣ(x-𝞵)/𝝈+β
Batch Normalization
FC Usually inserted after Fully Connected
BN
or Convolutional layers, and before
nonlinearity.
tanh
FC
BN
tanh
Ioffe and Szegedy, “Batch normalization: Accelerating deep

network training by reducing internal covariate shift”, ICML 2015

Batch Normalization
- Makes deep networks much easier to train!
FC - Allows higher learning rates, faster convergence
- Networks become more robust to initialization
BN - Acts as regularization during training
- Zero overhead at test-time: can be fused with conv!
tanh
FC
BN ImageNet
accuracy
tanh
Ioffe and Szegedy, “Batch normalization: Accelerating deep Training iterations


Batch Normalization
- Makes deep networks much easier to train!
FC - Allows higher learning rates, faster convergence
- Networks become more robust to initialization
BN - Acts as regularization during training
- Zero overhead at test-time: can be fused with conv!
tanh - Not well-understood theoretically (yet)
- Behaves differently during training and testing: this
FC is a very common source of bugs!
BN
tanh
Ioffe and Szegedy, “Batch normalization: Accelerating deep


Layer Normalization Layer Normalization for fully-
connected networks
Batch Normalization for Same behavior at train and test!
fully-connected networks Used in RNNs, Transformers
x: N × D x: N × D
Normalize Normalize
𝞵,𝝈: 1 × D 𝞵,𝝈: N × 1
ɣ,β: 1 × D ɣ,β: 1 × D
Ba, Kiros, and Hinton, “Layer Normalization”, arXiv 2016

Instance Normalization
Instance Normalization for
Batch Normalization for convolutional networks
convolutional networks Same behavior at train / test!
x: N×C×H×W x: N×C×H×W
Normalize Normalize
𝞵,𝝈: 1×C×1×1 𝞵,𝝈: N×C×1×1

ɣ,β: 1×C×1×1 ɣ,β: 1×C×1×1
Ulyanov et al, Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis, CVPR 2017

Comparison of Normalization Layers
Wu and He, “Group Normalization”, ECCV 2018

Group Normalization
Wu and He, “Group Normalization”, ECCV 2018

Convolution Layers Pooling Layers Fully-Connected Layers
x h s
Activation Function Normalization

x h s
Most
computationally
expensive!

Summary: Components of a Convolutional Network
x h s

Summary: Components of a Convolutional Network
Problem: What is the right way to combine all these components?

Next time:
CNN Architectures

CNN 01 Abcdefg

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CNN 01 Abcdefg

Uploaded by

Copyright:

Available Formats

Lecture 7:

Justin Johnson Lecture 7 - 1 September 24, 2019

Due Monday, September 30, 11:59pm (Even if you enrolled late!)

Your submission must pass the validation script

Justin Johnson Lecture 7 - 2 September 24, 2019

Pushes the schedule back a bit:

A4 Due Date: Friday 11/1 -> Friday 11/8

Justin Johnson Lecture 7 - 3 September 24, 2019

Justin Johnson Lecture 7 - 4 September 24, 2019

Justin Johnson Lecture 7 - 5 September 24, 2019

Justin Johnson Lecture 7 - 6 September 24, 2019

Justin Johnson Lecture 7 - 7 September 24, 2019

Convolution Layers Pooling Layers Normalization

Justin Johnson Lecture 7 - 8 September 24, 2019

Convolution Layers Pooling Layers Normalization

Justin Johnson Lecture 7 - 9 September 24, 2019

Justin Johnson Lecture 7 - 10 September 24, 2019

Justin Johnson Lecture 7 - 11 September 24, 2019

Convolve the filter with the image

Convolve the filter with the image

Justin Johnson Lecture 7 - 15 September 24, 2019

convolve (slide) over

Justin Johnson Lecture 7 - 16 September 24, 2019

convolve (slide) over

Justin Johnson Lecture 7 - 17 September 24, 2019

Justin Johnson Lecture 7 - 21 September 24, 2019

Justin Johnson Lecture 7 - 22 September 24, 2019

Conv Conv Conv ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

Conv Conv Conv ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

Conv ReLU Conv ReLU Conv ReLU ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

Conv ReLU Conv ReLU Conv ReLU ….

W1: 6x3x5x5 W2: 10x6x3x3 W3: 12x10x3x3

32 28 Linear classifier: One template per class

Justin Johnson Lecture 7 - 29 September 24, 2019

Justin Johnson Lecture 7 - 38 September 24, 2019

Justin Johnson Lecture 7 - 39 September 24, 2019

Justin Johnson Lecture 7 - 40 September 24, 2019

Input Problem: For large images we need many layers Output

Justin Johnson Lecture 7 - 41 September 24, 2019

Input Problem: For large images we need many layers Output

Justin Johnson Lecture 7 - 42 September 24, 2019

Justin Johnson Lecture 7 - 43 September 24, 2019

Justin Johnson Lecture 7 - 44 September 24, 2019

Justin Johnson Lecture 7 - 45 September 24, 2019

Output volume size: ?

Justin Johnson Lecture 7 - 47 September 24, 2019

Output volume size:

Justin Johnson Lecture 7 - 48 September 24, 2019

Output volume size: 10 x 32 x 32

Justin Johnson Lecture 7 - 49 September 24, 2019

Output volume size: 10 x 32 x 32

Justin Johnson Lecture 7 - 50 September 24, 2019

Output volume size: 10 x 32 x 32

Justin Johnson Lecture 7 - 51 September 24, 2019

Output volume size: 10 x 32 x 32

Justin Johnson Lecture 7 - 53 September 24, 2019

Justin Johnson Lecture 7 - 54 September 24, 2019

Justin Johnson Lecture 7 - 60 September 24, 2019

Justin Johnson Lecture 7 - 61 September 24, 2019

Convolution Layers Pooling Layers Normalization