CNN_Short

Convolutional Neural
Networks
Muhammad Atif Tahir
Some Slides from
Waterloo (CS)
Andrew Ng (Coursera)
CNN
⚫ We know it is good to learn a small model.
⚫ From this fully connected model, do we really need all
the edges?
⚫ Can some of these be shared?
Consider learning an image:
⚫Some patterns are much smaller than

the whole image
Can represent a small region with fewer parameters
“beak” detector
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.
“upper-left
beak” detector
They can be compressed

to the same parameters.
“middle beak”
detector
A convolutional layer
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.
Beak detector
A filter
Convolution These are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
0 1 0 0 1 0 -1 1 -1 Filter 2
0 0 1 0 1 0 -1 1 -1
…
…
6 x 6 image
Each filter detects a
small pattern (3 x 3).
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1
6 x 6 image 3 -2 -2 -1
-1 1 -1
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map
0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
1 -1 -1 -1-1 11 -1-1
11 -1-1 -1-1 -1 1 -1
-1-1 11 -1-1 -1-1-1 111 -1-1-1 Filter 2
-1 1 -1 Filter 1 -1 1 -1
-1-1 -1-1 11 -1-1 11 -1-1
-1 -1 1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected
…
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4: 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to
16 1 9 inputs, not
fully connected
…
1 -1 -1 1: 1
-1 1 -1 Filter 1 2: 0
-1 -1 1 3: 0
4: 0 3
…
1 0 0 0 0 1
0 1 0 0 1 0 7: 0
0 0 1 1 0 0 8: 1
1 0 0 0 1 0 9: 0 -1
0 1 0 0 1 0 10: 0
…
0 0 1 0 1 0
13: 0
6 x 6 image
14: 0
Fewer parameters 15: 1
16: 1 Shared weights
Even fewer parameters
…
The whole CNN
cat dog ……
Convolution
Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times
Max Pooling
Flattened
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
3 -1 -3 -1 -1 -1 -1 -1
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
Why Pooling
⚫ Subsampling pixels will not change the object

bird
bird
Subsampling
We can subsample the pixels to make image smaller

fewer parameters to characterize the image
A CNN compresses a fully connected
network in two ways:
⚫Reducing number of connections
⚫Shared weights on the edges
⚫Max pooling further reduces the complexity
Max Pooling
New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN
3 0
-1 1 Convolution
3 1
0 3
Max Pooling
Can
A new image
repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling
is the number of filters

The whole CNN
cat dog ……
Convolution
Max Pooling
Fully Connected A new image

Feedforward network
Convolution
Max Pooling
Flattened A new image

3
Flattening
0
1
3 0
-1 1 3
3 1 -1
0 3 Flattened
1 Fully Connected
Feedforward network
3
Convolution Layer (Summary)
• Convolution layer takes an input feature map of dimension 𝑊 × 𝐻 ×

𝑁 and produces an output feature map of dimension 𝑊 ෡ ×𝐻 ෡×𝑀
• Each layer is defined using following parameters:
• # Input channels (N)
• # Output channels (M)
• Kernel size
• Padding
• Stride
Convolution Layer (Summary)
Figure: In this example, 5x5 input is

Figure: In this example, 5x5 input is
convolved with 3x3 kernel with
convolved with 3x3 kernel with
stride=2 and padding=1 to produce
stride=padding=1 to produce an
an output of size 3x3.
output of size 5x5.
Parameters
• For one layer,
• i, no. of input maps (or channels)
• f, filter size (just the length)
• o, no. of output maps (or channels. this is also defined by how many
filters are used)
• One filter is applied to every input map
• num_params
= weights + biases
= [i × (f×f) × o] + o
Parameters
Greyscale image with 2×2 filter, output 3 channels
• i = 1 (greyscale has only 1 channel)

• f=2
• o=3
• num_params
= [i × (f×f) × o] + o
= [1 × (2×2) × 3] + 3
= 15
RGB image with 2×2 filter, output of 1
channel
• i = 3 (3 channels)
• f=2
• o=1
• num_params
= [i × (f×f) × o] + o
= [3 × (2×2) × 1] + 1
= 13
Image with 2 channels, with 2×2 filter, and
output of 3 channels
• i = 2 (2 channels)
• f=2
• o=3
• num_params
= [i × (f×f) × 3] + 3
= [2 × (2×2) × 3] + 3
= 27
Formula
• output = (input + 2p –k)/s+1
• Where k is kernel size
• E.g.
• s = 1; p = 0, k = 3, input = 28
• Output = (28-3)/1 + 1 = 26x26

Problem: In AlexNet, the input image is of size 227x227x3. The first
convolutional layer has 96 kernels of size 11x11x3. The stride is 4 and
padding is 0. Therefore, the size of the output image right after the first bank
of convolutional layers is
So, the output image is of size 55x55x96 ( one channel for each kernel )
Only modified the network structure and
CNN in Keras input format (vector -> 3-D tensor)
input
Convolution
1 -1 -1
-1 1 -1
-1 1 -1
-1 1 -1 … There are
25 3x3
-1 -1 1
-1 1 -1 … Max Pooling
filters.
Input_shape = ( 28 , 28 , 1)
28 x 28 pixels 1: black/white, 3: RGB Convolution
3 -1 3 Max Pooling
-3 1
CNN in Keras input format (vector -> 3-D array)
Input
1 x 28 x 28
Convolution
How many parameters for
each filter? 9 25 x 26 x 26
Max Pooling
25 x 13 x 13
Convolution
How many parameters 225=
for each filter? 50 x 11 x 11
25x9
Max Pooling
50 x 5 x 5
CNN in Keras input format (vector -> 3-D array)
Input
1 x 28 x 28
Output Convolution
25 x 26 x 26
Fully connected Max Pooling
feedforward network
25 x 13 x 13
Convolution
50 x 11 x 11
Max Pooling
1250 50 x 5 x 5
Flattened
AlphaGo
Next move
Neural
(19 x 19
Network positions)
19 x 19 matrix
Black: 1 Fully-connected feedforward
network can be used
white: -1
none: 0 But CNN performs much better
AlphaGo’s policy network
The following is quotation from their Nature article:
Note: AlphaGo does not use Max Pooling.
CNN in speech recognition
The filters move in the

CNN frequency direction.
Frequency
Image Time
Spectrogram
CNN in text classification
Source of image:
http://citeseerx.ist.psu.edu/viewdoc/downlo
ad?doi=10.1.1.703.6858&rep=rep1&type=p
df
Convolutional NN
Convolutional Neural Networks is extension

of traditional Multi-layer Perceptron, based
on 3 ideas:
1. Local receive fields
2. Shared weights
3. Spatial / temporal sub-sampling
See LeCun paper (1998) on text recognition:
http://yann.lecun.com/exdb/publis/pdf/lecun-
01a.pdf
What is Convolutional NN ?
CNN - multi-layer NN architecture
 Convolutional + Non-Linear Layer
 Sub-sampling Layer
 Convolutional +Non-L inear Layer
 Fully connected layers
⚫ Supervised
Classi-
Feature Extraction
fication
What is Convolutional NN ?
2x2
Convolution + NL Sub-sampling Convolution + NL

CNN success story: ILSVRC 2012
Imagenet data base: 14 mln labeled images, 20K categories
ILSVRC: Classification
Imagenet Classifications 2012
ILSVRC 2012: top rankers
http://www.image-net.org/challenges/LSVRC/2012/results.html
N Error-5 Algorithm Team Authors

1 0.153 Deep Conv. Neural Univ. of Krizhevsky et al
Network Toronto
2 0.262 Features + Fisher Vectors ISI Gunji et al
+ Linear classifier
3 0.270 Features + FV + SVM OXFORD_VG Simonyan et al
G
4 0.271 SIFT + FV + PQ + SVM XRCE/INRIA Perronin et al
5 0.300 Color desc. + SVM Univ. of van de Sande et al
Amsterdam
Imagenet 2013: top rankers
http://www.image-net.org/challenges/LSVRC/2013/results.php
N Error-5 Algorithm Team Authors

1 0.117 Deep Convolutional Neural Clarifi Zeiler
Network
2 0.129 Deep Convolutional Neural Nat.Univ Min LIN
Networks Singapore
3 0.135 Deep Convolutional Neural NYU Zeiler
Networks Fergus
4 0.135 Deep Convolutional Neural Andrew Howard
Networks
5 0.137 Deep Convolutional Neural Overfeat Pierre Sermanet et
Networks NYU al
Imagenet Classifications 2013
Conv Net Topology
⚫ 5 convolutional layers
⚫ 3 fully connected layers + soft-max
⚫ 650K neurons , 60 Mln weights
Why ConvNet should be Deep?
Rob Fergus, NIPS 2013

Conv Nets:
beyond Visual Classification
CNN applications
CNN is a big Plenty low hanging fruits
hammer
You need just a right nail!

Conv NN: Detection
Sermanet, CVPR 2014

Conv NN: Scene parsing
Farabet, PAMI 2013

CNN: indoor semantic labeling
RGBD
Farabet, 2013
Conv NN: Action Detection
Taylor, ECCV 2010

CNN_Short

Uploaded by

Copyright:

Available Formats

You might also like

CNN_Short

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CNN_Short

Uploaded by

Copyright:

Available Formats

Convolutional Neural

⚫Some patterns are much smaller than

Can represent a small region with fewer parameters

They can be compressed

⚫ Subsampling pixels will not change the object

We can subsample the pixels to make image smaller

is the number of filters

Fully Connected A new image

Flattened A new image

• Convolution layer takes an input feature map of dimension 𝑊 × 𝐻 ×

Figure: In this example, 5x5 input is

• i = 1 (greyscale has only 1 channel)

• output = (input + 2p –k)/s+1

• Where k is kernel size

• Output = (28-3)/1 + 1 = 26x26

28 x 28 pixels 1: black/white, 3: RGB Convolution

The filters move in the

Convolutional Neural Networks is extension

Convolution + NL Sub-sampling Convolution + NL

N Error-5 Algorithm Team Authors

N Error-5 Algorithm Team Authors

Rob Fergus, NIPS 2013

You need just a right nail!

Sermanet, CVPR 2014

Farabet, PAMI 2013

Taylor, ECCV 2010

You might also like