Professional Documents
Culture Documents
Lecture 5 PDF
Lecture 5 PDF
x W1 h W2 s
3072 100 10
“Upstream
gradient”
“Downstream
gradients”
f
“Upstream
gradient”
“Downstream
gradients”
f
“Upstream
gradient”
“Downstream
gradients”
f
“Upstream
gradient”
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 -10 April 13, 2021
So far: backprop with scalars
Regular derivative:
If x changes by a
small amount, how
much will y change?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 -15 April 13, 2021
Backprop with Vectors
Loss L still a scalar!
Dx
Dz
f
Dy
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 -16 April 13, 2021
Backprop with Vectors
Loss L still a scalar!
Dx
Dz
f
Dy
“Upstream gradient”
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 -17 April 13, 2021
Backprop with Vectors
Loss L still a scalar!
Dx
Dz
f
Dy Dz
“Upstream gradient”
For each element of z, how
much does it influence L?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 -18 April 13, 2021
Backprop with Vectors
Loss L still a scalar!
Dx
Dz
“Downstream
gradients”
f
Dy Dz
“Upstream gradient”
For each element of z, how
much does it influence L?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 -19 April 13, 2021
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”
Dz
“Downstream
gradients”
f
Dy Dz
“Upstream gradient”
For each element of z, how
much does it influence L?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 -20 April 13, 2021
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”
[Dx x Dz] Dz
“Downstream
gradients”
f
[Dy x Dz]
Dy Dz
Jacobian
matrices “Upstream gradient”
For each element of z, how
much does it influence L?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 -21 April 13, 2021
Backprop with Vectors
Loss L still a scalar!
Dx “local
gradients”
Dx [Dx x Dz] Dz
“Downstream
gradients”
Matrix-vector
multiply
f
[Dy x Dz]
Dy Dz
Jacobian
matrices “Upstream gradient”
Dy For each element of z, how
much does it influence L?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 -22 April 13, 2021
Gradients of variables wrt loss have same dims as the original variable
Dx Dz
f
Dy Dz
“Upstream gradient”
Dy For each element of z, how
much does it influence L?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 -23 April 13, 2021
Backprop with Vectors
4D input x: 4D output z:
[ 1 ] [ 1 ]
[ -2 ] f(x) = max(0,x) [ 0 ]
[ 3 ] (elementwise) [ 3 ]
[ -1 ] [ 0 ]
4D dL/dz:
[ 4 ]
[ -1 ] Upstream
[ 5 ] gradient
[ 9 ]
[Dz×Mz]
Matrix-vector
multiply
f
[Dy×My]
Jacobian
matrices
[Dx×Mx]
[Dz×Mz]
“Downstream
gradients”
Matrix-vector
multiply
f
[Dy×My] [Dz×Mz]
Jacobian
matrices “Upstream gradient”
[Dy×My] For each element of z, how
much does it influence L?
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 -33 April 13, 2021
Backprop with Matrices (or Tensors) Loss L still a scalar!
x W1 h W2 s
3072 100 10
recognized
letters of the alphabet
update rule:
These figures are reproduced from Widrow 1960, Stanford Electronics Laboratories Technical
Widrow and Hoff, ~1960: Adaline/Madaline Report with permission from Stanford University Special Collections.
recognizable math
Reinvigorated research in
Deep Learning
Figures copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
1962
RECEPTIVE FIELDS, BINOCULAR
INTERACTION
AND FUNCTIONAL ARCHITECTURE IN
THE CAT'S VISUAL CORTEX
Cat image by CNX OpenStax is licensed
Neocognitron
[Fukushima 1980]
LeNet-5
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
“AlexNet”
Figures copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
[Taigman et al. 2014] Activations of inception-v3 architecture [Szegedy et al. 2015] to image of Emma McIntosh,
used with permission. Figure and architecture not from Taigman et al. 2014.
Images are examples of pose estimation, not actually from Toshev & Szegedy 2014. Copyright Lane McIntosh.
[Toshev, Szegedy 2014]
[Guo et al. 2014] Figures copyright Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard Lewis,
and Xiaoshi Wang, 2014. Reproduced with permission.
top of a surfboard suitcase on the floor beach holding a surfboard Captions generated by Justin Johnson using Neuraltalk2
input activation
1 1
10 x 3072
3072 10
weights
input activation
1 1
10 x 3072
3072 10
weights
1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)
32 height
32 width
3 depth
5x5x3 filter
32
32
3
5x5x3 filter
32
32
3
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3
32
32
3
32
32
3
32
32
3
32
32
3
28
32 28
3 1
28
32 28
3 1
32
28
Convolution Layer
32 28
3 6
32 28
CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6
32 28 24
CONV, ….
CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 24
filters
3 6 10
28
32 28
3 1
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.
N e.g. N = 7, F = 3:
F stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
(recall:)
(N - F) / stride + 1
(recall:)
(N + 2P - F) / stride + 1
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 100 April 13, 2021
Examples time:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 101 April 13, 2021
Examples time:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 102 April 13, 2021
Examples time:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 103 April 13, 2021
Examples time:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 104 April 13, 2021
Convolution layer: summary
Let’s assume input is W1 x H1 x C
Conv layer needs 4 hyperparameters:
- Number of filters K
- The filter size F
- The stride S
- The zero padding P
This will produce an output of W2 x H2 x K
where:
- W2 = (W1 - F + 2P)/S + 1
- H2 = (H1 - F + 2P)/S + 1
Number of parameters: F2CK and K biases
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 105 April 13, 2021
Convolution layer: summary Common settings:
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 107 April 13, 2021
(btw, 1x1 convolution layers make perfect sense)
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
64 32
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 108 April 13, 2021
Example: CONV
layer in PyTorch
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 109 April 13, 2021
Example: CONV
layer in Keras
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 110 April 13, 2021
The brain/neuron view of CONV Layer
32x32x3 image
5x5x3 filter
32
1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 111 April 13, 2021
The brain/neuron view of CONV Layer
32x32x3 image
5x5x3 filter
32
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 112 April 13, 2021
Receptive field
32
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 113 April 13, 2021
The brain/neuron view of CONV Layer
32
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 114 April 13, 2021
Reminder: Fully Connected Layer
Each neuron
32x32x3 image -> stretch to 3072 x 1 looks at the full
input volume
input activation
1 1
10 x 3072
3072 10
weights
1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 115 April 13, 2021
two more layers to go: POOL/FC
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 116 April 13, 2021
Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 117 April 13, 2021
MAX POOLING
3 2 1 0 3 4
1 2 3 4
y
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 118 April 13, 2021
Pooling layer: summary
Let’s assume input is W1 x H1 x C
Conv layer needs 2 hyperparameters:
- The spatial extent F
- The stride S
Number of parameters: 0
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 119 April 13, 2021
Fully Connected Layer (FC layer)
- Contains neurons that connect to the entire input volume, as in ordinary Neural
Networks
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 120 April 13, 2021
[ConvNetJS demo: training on CIFAR-10]
http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 121 April 13, 2021
Summary
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 5 - 122 April 13, 2021