Download as pdf or txt
Download as pdf or txt
You are on page 1of 162

Deep Learning

Vimal
BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus

Lecture No. 3| Deep Networks


Time: 11 AM – 1 PM
Date:14/05/2022
These slides are assembled by the instructor with grateful acknowledgement of the many
others who made their course materials freely available online.
Terminology ….

Image Credit: https://www.sumologic.com/blog/machine-learning-deep-learning/


So what is Deep (Machine) Learning?

A few different ideas:

(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations

End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction

Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Traditional Machine Learning
VISION
hand-crafted
features your favorite

SIFT/HOG
classifier “car”
fixed learned

SPEECH
hand-crafted
features your favorite

MFCC
classifier \ˈd ē p\

fixed learned

NLP
hand-crafted
This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Hierarchical Compositionality
VISION

pixels edge texton motif part object

SPEECH
sample spectral formant motif phone word
band

NLP
character word NP/VP/.. clause sentence story

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Building A Complicated Function
Given a library of simple functions

Compose into a

complicate function

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Building A Complicated Function
Given a library of simple functions

Idea 1: Linear Combinations


Compose into a
• Boosting
• Kernels
complicate function
• …

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Building A Complicated Function
Given a library of simple functions

Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Building A Complicated Function
Given a library of simple functions

Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Deep Learning = Hierarchical Compositionality

“car”

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Deep Learning = Hierarchical Compositionality

Low-Level Mid-Level High-Level Trainable “car”


Feature Feature Feature Classifier

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Sparse DBNs
[Lee et al. ICML ‘09]
Figure courtesy: Quoc Le
So what is Deep (Machine)
Learning?
A few different ideas:

(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations

End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction

Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Traditional Machine Learning
VISION
hand-crafted
features your favorite

SIFT/HOG
classifier “car”
fixed learned

SPEECH
hand-crafted
features your favorite

MFCC
classifier \ˈd ē p\

fixed learned

NLP
hand-crafted
This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
SIFT Spin Images
Feature Engineering

HoG Textons

and many many more….


Traditional Machine Learning (more accurately)
“Learned”
VISION
K-Means/
SIFT/HOG classifier
pooling “car”
fixed unsupervised supervised

SPEECH
Mixture of
MFCC classifier
Gaussians \ˈd ē p\
fixed unsupervised supervised

NLP
This burrito place Parse Tree
n-grams classifier
is yummy and fun! Syntactic “+”
fixed unsupervised supervised
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Deep Learning = End-to-End Learning
“Learned”
VISION
K-Means/
SIFT/HOG classifier
pooling “car”
fixed unsupervised supervised

SPEECH
Mixture of
MFCC classifier
Gaussians \ˈd ē p\
fixed unsupervised supervised

NLP
This burrito place Parse Tree
n-grams classifier
is yummy and fun! Syntactic “+”
fixed unsupervised supervised
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
“Shallow” vs Deep Learning
“Shallow” models

hand-crafted “Simple” Trainable


Feature Extractor Classifier
fixed learned
Deep models

Trainable Trainable Trainable


Feature- Feature- Feature-
Transform / Transform / Transform /
Classifier Classifier Classifier

Learned Internal Representations


Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
So what is Deep (Machine) Learning?

A few different ideas:

(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations

End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction

Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Distributed Representations Toy Example

Local vs Distributed

Slide Credit: Moontae Lee


Distributed Representations Toy Example

Can we interpret each dimension?

Slide Credit: Moontae Lee


Power of distributed representations!

Local

Distributed

Slide Credit: Moontae Lee


Power of distributed representations!

United States : Dollar :: Mexico:?

Slide Credit: Moontae Lee


Benefits of Deep/Representation Learning

(Usually) Better Performance


“Because gradient descent is better than you”
Yann LeCun

New domains without “experts”


RGBD/Lidar
Multi-spectral data
Gene-expression data
Unclear how to hand-engineer
Benefits of Deep/Representation Learning

Modularity!
Plug and play architectures!
Problems with Deep Learning
Terminology ….

Image Credit: https://www.sumologic.com/blog/machine-learning-deep-learning/


So what is Deep (Machine)
Learning?
A few different ideas:

(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations

End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction

Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Traditional Machine Learning
VISION
hand-crafted
features your favorite

SIFT/HOG
classifier “car”
fixed learned

SPEECH
hand-crafted
features your favorite

MFCC
classifier \ˈd ē p\

fixed learned

NLP
hand-crafted
This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Hierarchical Compositionality
VISION

pixels edge texton motif part object

SPEECH
sample spectral formant motif phone word
band

NLP
character word NP/VP/.. clause sentence story

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Building A Complicated Function
Given a library of simple functions

Compose into a

complicate function

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Building A Complicated Function
Given a library of simple functions

Idea 1: Linear Combinations


Compose into a
• Boosting
• Kernels
complicate function
• …

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Building A Complicated Function
Given a library of simple functions

Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Building A Complicated Function
Given a library of simple functions

Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Deep Learning = Hierarchical Compositionality

“car”

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Deep Learning = Hierarchical Compositionality

Low-Level Mid-Level High-Level Trainable “car”


Feature Feature Feature Classifier

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Sparse DBNs
[Lee et al. ICML ‘09]
Figure courtesy: Quoc Le
So what is Deep (Machine)
Learning?
A few different ideas:

(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations

End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction

Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Traditional Machine Learning
VISION
hand-crafted
features your favorite

SIFT/HOG
classifier “car”
fixed learned

SPEECH
hand-crafted
features your favorite

MFCC
classifier \ˈd ē p\

fixed learned

NLP
hand-crafted
This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
SIFT Spin Images
Feature Engineering

HoG Textons

and many many more….


Traditional Machine Learning (more accurately)
“Learned”
VISION
K-Means/
SIFT/HOG classifier
pooling “car”
fixed unsupervised supervised

SPEECH
Mixture of
MFCC classifier
Gaussians \ˈd ē p\
fixed unsupervised supervised

NLP
This burrito place Parse Tree
n-grams classifier
is yummy and fun! Syntactic “+”
fixed unsupervised supervised
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Deep Learning = End-to-End Learning
“Learned”
VISION
K-Means/
SIFT/HOG classifier
pooling “car”
fixed unsupervised supervised

SPEECH
Mixture of
MFCC classifier
Gaussians \ˈd ē p\
fixed unsupervised supervised

NLP
This burrito place Parse Tree
n-grams classifier
is yummy and fun! Syntactic “+”
fixed unsupervised supervised
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
“Shallow” vs Deep Learning
“Shallow” models

hand-crafted “Simple” Trainable


Feature Extractor Classifier
fixed learned
Deep models

Trainable Trainable Trainable


Feature- Feature- Feature-
Transform / Transform / Transform /
Classifier Classifier Classifier

Learned Internal Representations


Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
So what is Deep (Machine) Learning?

A few different ideas:

(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations

End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction

Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Distributed Representations Toy Example

Local vs Distributed

Slide Credit: Moontae Lee


Distributed Representations Toy Example

Can we interpret each dimension?

Slide Credit: Moontae Lee


Power of distributed representations!

Local

Distributed

Slide Credit: Moontae Lee


Power of distributed representations!

United States : Dollar :: Mexico:?

Slide Credit: Moontae Lee


Benefits of Deep/Representation Learning

(Usually) Better Performance


“Because gradient descent is better than you”
Yann LeCun

New domains without “experts”


RGBD/Lidar
Multi-spectral data
Gene-expression data
Unclear how to hand-engineer
Benefits of
Deep/Representation
Modularity!
Learning
Plug and play architectures!
Problems with Deep Learning

Problem#1: Non-Convex! Non-Convex! Non-Convex!


Depth>=3: most losses non-convex in parameters
Theoretically, all bets are off
Leads to stochasticity
different initializations → different local minima

Standard response #1
“Yes, but all interesting learning problems are non-convex”
For example, human learning
Order matters → wave hands → non-convexity

Standard response #2
“Yes, but it often works!”
Problems with Deep Learning

Problem#2: Lack of interpretability


Hard to track down what’s failing
Pipeline systems have “oracle” performances at each step
In end-to-end systems, it’s hard to know why things are not working

Standard response #1
Tricks of the trade: visualize features, add losses at different layers, pre-train to avoid
degenerate initializations…
“We’re working on it”

Standard response #2
“Yes, but it often works!”
Problems with Deep Learning

Problem#3: Lack of easy reproducibility


Direct consequence of stochasticity & non-convexity

Standard response #1
It’s getting much better
Standard toolkits/libraries/frameworks now available
PyTorch, TensorFlow, MxNet…

Standard response #2
“Yes, but it often works!”
Problems with Deep Learning

Problem#2: Lack of interpretability


Hard to track down what’s failing
Pipeline systems have “oracle” performances at each step
In end-to-end systems, it’s hard to know why things are not working

Standard response #1
Tricks of the trade: visualize features, add losses at different layers, pre-train to avoid
degenerate initializations…
“We’re working on it”

Standard response #2
“Yes, but it often works!”
Problems with Deep Learning

Problem#3: Lack of easy reproducibility


Direct consequence of stochasticity & non-convexity

Standard response #1
It’s getting much better
Standard toolkits/libraries/frameworks now available
PyTorch, TensorFlow, MxNet…

Standard response #2
“Yes, but it often works!”
Story So Far
• Characteristics of Deep Learning
• Hierarchical representation
• End-to-end learning
• Distributed representation
• Challenges of Deep Learning
• Design of perceptron networks
• pen and paper based
• MLP as universal
• Boolean Function
• Classifiers
• Function approximator
• Need for Depth
Agenda
• MLP Parameter Estimation via Empirical Risk
Minimization

• Function Optimization Refresher

• Computational Graph

• Deep MLP Training


• Activation Functions
• Input Output Specification
• ……
Option 1: Construct by hand
1 X2 1
1 -1 -1 -1
0,1

X1 X2 X1 X2

-1,0 1,0
X1

1
1 1 1
-1 1
0,-1
X1 X2
-4 X1 X2
1 1 1
1
1 1 1 1
1 -1 -1 -1 1 • Not possible for all
1 1 -1
Assuming simple perceptrons:
but the simplest
output = 1 if i i i i X1 X2 problems.
Option 2: Automatic estimation
of an MLP

• Given the function to model, we can derive


the parameters of the network to model it, thru’
computation
Empirical Risk Minimization

• Given a training set of input-output pairs 1 1 2 2 N N


– Error on the ith instance: i i
– Empirical average error (Empirical Risk) on all training data:

i i
i
• Estimate the parameters to minimize the empirical estimate of expected
error

W
– I.e. minimize the empirical risk over the drawn samples
Perceptron Learning Algorithm

• Given training instances


– i or
Using a +1/-1 representation
for classes to simplify
• Initialize notation
• Cycle through the training instances:
• do
– For
T
i i
• If i i

i i
• until no more classification errors
Perceptrons with differentiable
activation functions
1 1
2 i i
2

3
.. 3

+
i

N– 1
. N– 1
i i

N
N i i
N+ 1

• is a differentiable function of
do z
– dz
is well-defined and finite for all
• Using the chain rule, is a differentiable function of both inputs 𝒊and
weights 𝒊
• This means that we can compute the change in the output for small
changes in either the input or the weights
Overall network is differentiable

2 2
1,1 2,1
= output of overall network
k
i,j = weight connecting the ith unit
of the kth layer to the jth unit of
the k+1-th layer
k
i = output of the ith unit of the kth layer

k
is differentiable w.r.t both and i

• Every individual perceptron is differentiable w.r.t its inputs


and its weights (including “bias” weight)
• By the chain rule, the overall function is differentiable w.r.t
every parameter (weight or bias)
– Small changes in the parameters result in measurable changes in
output
Overall function is differentiable

• The overall function is differentiable w.r.t every parameter


– Small changes in the parameters result in measurable changes
in the output
– We will derive the actual derivatives using the chain rule later
Overall setting for “Learning” the
MLP

• Given a training set of input-output pairs 1 1 2 2 N N


– is the desired output of the network in response to
– and may both be vectors
• …we must find the network parameters such that the network produces the
desired output for each training input
– Or a close approximation of it
– The architecture of the network must be specified by us
ERM for neural networks
Actual output of network:
1 2 K

Desired output of network:


Error on i-th training input:

Average training error(loss):

– What is the exact form of Div()? More on this later


• Optimize network parameters to minimize the
total error over all training inputs
Problem Statement
• Given a training set of input-output pairs

• Minimize the following function

w.r.t

• This is problem of function minimization


– An instance of optimization
• FUNCTION OPTIMIZATION Refresher
A brief note on derivatives..

derivative

• A derivative of a function at any point tells us how


much a minute increment to the argument of the
function will increment the value of the function
▪ For any expressed as a multiplier to a tiny
increment to obtain the increments to the output

▪ Based on the fact that at a fine enough resolution, any


smooth, continuous function is locally linear at any point
Scalar function of scalar argument

• When and are scalar

▪ Derivative:

▪ Often represented (using somewhat inaccurate notation) as dy


dx
▪ Or alternately (and more reasonably) as
Multivariate scalar function:
Scalar function of vector argument
Note: is now a vector
1

• Giving us that is a row vector: 1 D


1 1 2 2 D D
• The partial derivative i gives us how increments when only i is
incremented
y
• Often represented as
si

1 2 D
1 2 D
Multivariate scalar function:
Scalar function of vector argument
Note: is now a vector
1

We will be using this


• Where symbol for vector and
matrix derivatives

o You may be more familiar with the term “gradient” which


is actually defined as the transpose of the derivative
The problem of optimization
f(x)
global maximum

inflection point

local minimum
global minimum

• General problem of
optimization: find
the value of x where
f(x) is minimum
Finding the minimum of a function

f(x)

• Find the value at which =0


– Solve

• The solution is a “turning point”


– Derivatives go from positive to negative or vice versa at this point
• But is it a minimum?
Turning Points
0
0 --
+ -- +
+
--
+ - + -
+ -- +
+
--
- +
-
0

• Both maxima and minima have zero derivative


• Both are turning points
Derivatives of a curve

f ’(x
) x
f(x)

• Both maxima and minima are turning points


• Both maxima and minima have zero derivative
Derivative of the derivative of the
curve
f’’(x)
f ’(x
) x
f(x)

• Both maxima and minima are turning points


• Both maxima and minima have zero derivative

• The second derivative f’’(x) is –ve at maxima and


+ve at minima!
A note on derivatives of functions of
single variable
maximum

• All locations with


Inflection point

zero derivative
minimum

Critical points
are critical points
– These can be local maxima, local
minima, or inflection points

Derivative is 0
A note on derivatives of functions of
single variable
maximum
• All locations with zero
Inflection point derivative are critical
points
minimum
– These can be local maxima, local minima,
or inflection points

• The second derivative is


– at minima
– at maxima
– Zero at inflection points
positive

2 zero • It’s a little more complicated for


2
functions of multiple variables..
negative
What about functions of multiple
variables?

• The optimum point is still “turning” point


– Shifting in any direction will increase the value
– For smooth functions, miniscule shifts will not result in any change at all
• We must find a point where shifting in any direction by a microscopic
amount will not change the value of the function
A brief note on derivatives of
multivariate functions
The Gradient of a scalar function

• The derivative of a scalar function of a


multi-variate input is a multiplicative factor that gives
us the change in for tiny variations in

– The gradient is the transpose of the derivative


Gradients of scalar functions with
multi-variate inputs
• Consider

• Relation:
Gradients of scalar functions with
multi-variate inputs
• Consider

• Relation:

This is a vector inner product. To understand its behavior lets


consider a well-known property of inner products
A well-known vector property

• The inner product between two vectors of


fixed lengths is maximum when the two
vectors are aligned
– i.e. when
Properties of Gradient

– The inner product between T and

• Fixing the length of


– E.g.
• is max if is aligned with
– T

– The function f(X) increases most rapidly if the input


increment is perfectly aligned to T

• The gradient is the direction of fastest increase in f(X)


Gradient

Gradient
vector X 𝑇
Gradient

Gradient
vector X 𝑇

Moving in this
direction increases
fastest
Gradient

Gradient
vector X 𝑇

Moving in this
𝑇
X direction increases
Moving in this fastest
direction decreases
fastest
The Hessian
Finding the minimum of a scalar
function of a multi-variate input

• The optimum point is a turning point – the


gradient will be 0
Unconstrained Minimization of
function (Multivariate)
Unconstrained Minimization of
function (Example)
Unconstrained Minimization of
function (Example)
Unconstrained Minimization of
function (Example)
Closed Form Solutions are not always
available
f(X)

• Often it is not possible to simply solve


– The function to minimize/maximize may have an
intractable form
• In these situations, iterative solutions are used
– Begin with a “guess” for the optimal and refine it
iteratively until the correct value is obtained
The Approach of Gradient Descent

• Iterative solution:
– Start at some point
– Find direction in which to shift this point to decrease error
• This can be found from the derivative of the function
– A positive derivative → moving left decreases error
– A negative derivative → moving right decreases error
– Shift point in this direction
The Approach of Gradient Descent

• Iterative solution: Trivial algorithm


▪ Initialize
▪ While

• Identical to previous algorithm


The Approach of Gradient Descent

• Iterative solution: Trivial algorithm


▪ Initialize
▪ While

• is the “step size”


Gradient descent/ascent
(multivariate)
• The gradient descent/ascent method to find the
minimum or maximum of a function iteratively
– To find a maximum move in the direction of the
gradient

– To find a minimum move exactly opposite the


direction of the gradient

• Many solutions to choosing step size


1. Fixed step size
• Fixed step size
– Use fixed value for
What is the optimal step size?

• Step size is critical for fast optimization


• Will revisit this topic later
• For now, simply assume a potentially-
iteration-dependent step size
Gradient descent convergence criteria

• The gradient descent algorithm converges


when one of the following criteria is satisfied

58
Overall Gradient Descent Algorithm

• Initialize:

• do


• while
169
(C) Dhruv Batra
How do we compute gradients?
• Analytic or “Manual” Differentiation
– Applying formulas from calculus

• Numerical Differentiation

• Automatic Differentiation
– Computation Graph

(C) Dhruv Batra


Numerical vs Analytic Gradients

Numerical gradient: slow :(, approximate :(, easy to write :)


Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your


implementation with numerical gradient.
This is called a gradient check.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Computational Graph: Example

sin( ) *

x1 x2

(C) Dhruv Batra


Logistic Regression as a Cascade
Given a library of simple functions

Compose into a

complicate function

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Computational Graph

Any DAG of differentiable modules is


allowed!

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato


Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]
• Step 2: Compute gradients w.r.t. parameters [B-Pass]

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]
• Step 2: Compute gradients w.r.t. parameters [B-Pass]

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]
• Step 2: Compute gradients w.r.t. parameters [B-Pass]

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]
• Step 2: Compute gradients w.r.t. parameters [B-Pass]
• Step 3: Use gradient to update parameters

*** ɵ same as w
(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Key Computation: Forward-Prop

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Key Computation: Back-Prop

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun


Backpropagation: a simple example

184

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

185

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

186

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

187

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

188

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

189

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

190

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

191

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

192

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

193

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

194

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:

195

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

196

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

197

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

198

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

199

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Want:
Upstream Local
gradient gradient

200

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

201

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

202

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

203

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

204

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

Upstream Local
gradient gradient

205

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

206

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

Upstream Local
gradient gradient

207

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

208

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

Upstream Local
gradient gradient

209

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

210

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

Upstream Local
gradient gradient

211

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

212

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

[upstream gradient] x [local gradient]


[0.2] x [1] = 0.2
[0.2] x [1] = 0.2 (both inputs!)

213

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

214

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Another example:

[upstream gradient] x [local gradient]


x0: [0.2] x [2] = 0.4
w0: [0.2] x [-1] = -0.2

215

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Computational graph representation may not
be unique. Choose one where local gradients
at each node can be easily expressed!

sigmoid function

sigmoid gate

216

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


Computational graph representation may not
be unique. Choose one where local gradients
at each node can be easily expressed!

sigmoid function

sigmoid gate

[upstream gradient] x [local gradient]


[1.00] x [(1 - 0.73) (0.73)]= 0.2

217

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n


How to Specify/Design
Perceptron Parameters
x1 OR x2 with Perceptron (Step
Threshold)
Perceptron with hard/step threshold
Output = 0 if input w1*x1 + w2*x2 < threshold T
= 1 if input >= T
From OR truth table, X1 AND x2 Truth Table
w1*0 + w2*0 < T implies T > 0
w1*1+w2*0 >= T implies w1 >=T x1 x2 Output
w1*0+w2*1 >= T implies w2 >= T 0 0 0
w1*1+w2*1 >=T implies w1+w2>=T
1 0 1
Choose, T =1 (you can choose any T > 0) 0 1 1
Then w1 =1 (you can choose any value >= T) 1 1 1
Similarly choose w2 = 1 (you can choose any value >= T)
w1+w2 >=T is automatically satisfied for w1=w2=1 and T=1

Thus, AND can be realized with w1=1, w2=1, T=1 . Important to note that other
values of w1, w2, and T can also implement OR, as long as the about 4 inequalities
are satisfied
Specify: Perceptron Parameters for
Classification

1
-1 -1 -1

1
-1 1 1
1 0 1 0

• Note: Different Choices of weights and bias are possible.


• Left hidden node implements x1=1 line
• Right hidden node implements x1=5 line
• 2nd node from left implements x2=x1+3 and 3rd from left implements x1+x2=9
• For + class, output of left hidden node = +1, for other nodes output = -1
Example: Functional Approximation in 1-D
Given the function values (xi ,yi) design an MLP
x y y
0 1
1 12.5
2 2 2 4.5 -1 -1
1 1 1
1 -1
-1
3 4.5
T=0 2 2 3 3 5 5 6
5 12.5
1
1 1 1 1 1 1 1
12.5

x
4.5
y

2
1

0 2 3 5 6 x
Example: Functional Approximation in 1-D
Given the function values (xi ,yi) design an MLP
x y y
0 1 -4.5
4.5
2 2
T=3 5
3 4.5
1 1
5 12.5

x
12.5 • First consider one data point, say (3, 4.5)
and hidden nodes with hard threshold T=3
and T=5 (output = 1 if input >=T else 0)
• Output node uses a ReLU activation, i.e., y =
4.5
sum of all weighted outputs from hidden
y nodes
• With choice of weights in the above figure,
2 y = 4.5 for 3<=x<5, and 0 otherwise.
1 • Note that choice of T=5 is by design, given
the set of discrete data points . Any T>=3 up
to next data point for the right hidden node
0 2 3 5 6 x would work.
Example: Functional Approximation in 1-D
Given the function values (xi ,yi) design an MLP
x y y
0 1
1 12.5
2 2 2 4.5 -1 -1
1 1 1
1 -1
-1
3 4.5
T=0 2 2 3 3 5 5 6
5 12.5
1
1 1 1 1 1 1 1
12.5

x
4.5
y • Applying the method shown in previous
slide to all data points, the above
2 network is obtained.
1 • Output y matches exactly training data
points.
0 2 3 5 6 x

You might also like