Download as pdf or txt
Download as pdf or txt
You are on page 1of 162

Deep Learning

BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus

Lecture No. 3| Deep Networks

Time: 11 AM – 1 PM
These slides are assembled by the instructor with grateful acknowledgement of the many
others who made their course materials freely available online.
Terminology ….

Image Credit:

So what is Deep (Machine) Learning?

A few different ideas:

(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations

End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction

Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Traditional Machine Learning
features your favorite

classifier “car”
fixed learned

features your favorite

classifier \ˈd ē p\

fixed learned

This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Hierarchical Compositionality

pixels edge texton motif part object

sample spectral formant motif phone word

character word NP/VP/.. clause sentence story

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Building A Complicated Function
Given a library of simple functions

Compose into a

complicate function

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Building A Complicated Function
Given a library of simple functions

Idea 1: Linear Combinations

Compose into a
• Boosting
• Kernels
complicate function
• …

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Building A Complicated Function
Given a library of simple functions

Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Building A Complicated Function
Given a library of simple functions

Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Deep Learning = Hierarchical Compositionality


Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Deep Learning = Hierarchical Compositionality

Low-Level Mid-Level High-Level Trainable “car”

Feature Feature Feature Classifier

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Sparse DBNs
[Lee et al. ICML ‘09]
Figure courtesy: Quoc Le
So what is Deep (Machine)
A few different ideas:

(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations

End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction

Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Traditional Machine Learning
features your favorite

classifier “car”
fixed learned

features your favorite

classifier \ˈd ē p\

fixed learned

This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
SIFT Spin Images
Feature Engineering

HoG Textons

and many many more….

Traditional Machine Learning (more accurately)
SIFT/HOG classifier
pooling “car”
fixed unsupervised supervised

Mixture of
MFCC classifier
Gaussians \ˈd ē p\
fixed unsupervised supervised

This burrito place Parse Tree
n-grams classifier
is yummy and fun! Syntactic “+”
fixed unsupervised supervised
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Deep Learning = End-to-End Learning
SIFT/HOG classifier
pooling “car”
fixed unsupervised supervised

Mixture of
MFCC classifier
Gaussians \ˈd ē p\
fixed unsupervised supervised

This burrito place Parse Tree
n-grams classifier
is yummy and fun! Syntactic “+”
fixed unsupervised supervised
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
“Shallow” vs Deep Learning
“Shallow” models

hand-crafted “Simple” Trainable

Feature Extractor Classifier
fixed learned
Deep models

Trainable Trainable Trainable

Feature- Feature- Feature-
Transform / Transform / Transform /
Classifier Classifier Classifier

Learned Internal Representations

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
So what is Deep (Machine) Learning?

A few different ideas:

(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations

End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction

Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Distributed Representations Toy Example

Local vs Distributed

Slide Credit: Moontae Lee

Distributed Representations Toy Example

Can we interpret each dimension?

Slide Credit: Moontae Lee

Power of distributed representations!



Slide Credit: Moontae Lee

Power of distributed representations!

United States : Dollar :: Mexico:?

Slide Credit: Moontae Lee

Benefits of Deep/Representation Learning

(Usually) Better Performance

“Because gradient descent is better than you”
Yann LeCun

New domains without “experts”

Multi-spectral data
Gene-expression data
Unclear how to hand-engineer
Benefits of Deep/Representation Learning

Plug and play architectures!
Problems with Deep Learning
Terminology ….

Image Credit:

So what is Deep (Machine)
A few different ideas:

(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations

End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction

Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Traditional Machine Learning
features your favorite

classifier “car”
fixed learned

features your favorite

classifier \ˈd ē p\

fixed learned

This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Hierarchical Compositionality

pixels edge texton motif part object

sample spectral formant motif phone word

character word NP/VP/.. clause sentence story

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Building A Complicated Function
Given a library of simple functions

Compose into a

complicate function

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Building A Complicated Function
Given a library of simple functions

Idea 1: Linear Combinations

Compose into a
• Boosting
• Kernels
complicate function
• …

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Building A Complicated Function
Given a library of simple functions

Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Building A Complicated Function
Given a library of simple functions

Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Deep Learning = Hierarchical Compositionality


Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Deep Learning = Hierarchical Compositionality

Low-Level Mid-Level High-Level Trainable “car”

Feature Feature Feature Classifier

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Sparse DBNs
[Lee et al. ICML ‘09]
Figure courtesy: Quoc Le
So what is Deep (Machine)
A few different ideas:

(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations

End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction

Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Traditional Machine Learning
features your favorite

classifier “car”
fixed learned

features your favorite

classifier \ˈd ē p\

fixed learned

This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
SIFT Spin Images
Feature Engineering

HoG Textons

and many many more….

Traditional Machine Learning (more accurately)
SIFT/HOG classifier
pooling “car”
fixed unsupervised supervised

Mixture of
MFCC classifier
Gaussians \ˈd ē p\
fixed unsupervised supervised

This burrito place Parse Tree
n-grams classifier
is yummy and fun! Syntactic “+”
fixed unsupervised supervised
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Deep Learning = End-to-End Learning
SIFT/HOG classifier
pooling “car”
fixed unsupervised supervised

Mixture of
MFCC classifier
Gaussians \ˈd ē p\
fixed unsupervised supervised

This burrito place Parse Tree
n-grams classifier
is yummy and fun! Syntactic “+”
fixed unsupervised supervised
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
“Shallow” vs Deep Learning
“Shallow” models

hand-crafted “Simple” Trainable

Feature Extractor Classifier
fixed learned
Deep models

Trainable Trainable Trainable

Feature- Feature- Feature-
Transform / Transform / Transform /
Classifier Classifier Classifier

Learned Internal Representations

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
So what is Deep (Machine) Learning?

A few different ideas:

(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations

End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction

Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Distributed Representations Toy Example

Local vs Distributed

Slide Credit: Moontae Lee

Distributed Representations Toy Example

Can we interpret each dimension?

Slide Credit: Moontae Lee

Power of distributed representations!



Slide Credit: Moontae Lee

Power of distributed representations!

United States : Dollar :: Mexico:?

Slide Credit: Moontae Lee

Benefits of Deep/Representation Learning

(Usually) Better Performance

“Because gradient descent is better than you”
Yann LeCun

New domains without “experts”

Multi-spectral data
Gene-expression data
Unclear how to hand-engineer
Benefits of
Plug and play architectures!
Problems with Deep Learning

Problem#1: Non-Convex! Non-Convex! Non-Convex!

Depth>=3: most losses non-convex in parameters
Theoretically, all bets are off
Leads to stochasticity
different initializations → different local minima

Standard response #1
“Yes, but all interesting learning problems are non-convex”
For example, human learning
Order matters → wave hands → non-convexity

Standard response #2
“Yes, but it often works!”
Problems with Deep Learning

Problem#2: Lack of interpretability

Hard to track down what’s failing
Pipeline systems have “oracle” performances at each step
In end-to-end systems, it’s hard to know why things are not working

Standard response #1
Tricks of the trade: visualize features, add losses at different layers, pre-train to avoid
degenerate initializations…
“We’re working on it”

Standard response #2
“Yes, but it often works!”
Problems with Deep Learning

Problem#3: Lack of easy reproducibility

Direct consequence of stochasticity & non-convexity

Standard response #1
It’s getting much better
Standard toolkits/libraries/frameworks now available
PyTorch, TensorFlow, MxNet…

Standard response #2
“Yes, but it often works!”
Problems with Deep Learning

Problem#2: Lack of interpretability

Hard to track down what’s failing
Pipeline systems have “oracle” performances at each step
In end-to-end systems, it’s hard to know why things are not working

Standard response #1
Tricks of the trade: visualize features, add losses at different layers, pre-train to avoid
degenerate initializations…
“We’re working on it”

Standard response #2
“Yes, but it often works!”
Problems with Deep Learning

Problem#3: Lack of easy reproducibility

Direct consequence of stochasticity & non-convexity

Standard response #1
It’s getting much better
Standard toolkits/libraries/frameworks now available
PyTorch, TensorFlow, MxNet…

Standard response #2
“Yes, but it often works!”
Story So Far
• Characteristics of Deep Learning
• Hierarchical representation
• End-to-end learning
• Distributed representation
• Challenges of Deep Learning
• Design of perceptron networks
• pen and paper based
• MLP as universal
• Boolean Function
• Classifiers
• Function approximator
• Need for Depth
• MLP Parameter Estimation via Empirical Risk

• Function Optimization Refresher

• Computational Graph

• Deep MLP Training

• Activation Functions
• Input Output Specification
• ……
Option 1: Construct by hand
1 X2 1
1 -1 -1 -1

X1 X2 X1 X2

-1,0 1,0

1 1 1
-1 1
X1 X2
-4 X1 X2
1 1 1
1 1 1 1
1 -1 -1 -1 1 • Not possible for all
1 1 -1
Assuming simple perceptrons:
but the simplest
output = 1 if i i i i X1 X2 problems.
Option 2: Automatic estimation
of an MLP

• Given the function to model, we can derive

the parameters of the network to model it, thru’
Empirical Risk Minimization

• Given a training set of input-output pairs 1 1 2 2 N N

– Error on the ith instance: i i
– Empirical average error (Empirical Risk) on all training data:

i i
• Estimate the parameters to minimize the empirical estimate of expected

– I.e. minimize the empirical risk over the drawn samples
Perceptron Learning Algorithm

• Given training instances

– i or
Using a +1/-1 representation
for classes to simplify
• Initialize notation
• Cycle through the training instances:
• do
– For
i i
• If i i

i i
• until no more classification errors
Perceptrons with differentiable
activation functions
1 1
2 i i

.. 3


N– 1
. N– 1
i i

N i i
N+ 1

• is a differentiable function of
do z
– dz
is well-defined and finite for all
• Using the chain rule, is a differentiable function of both inputs 𝒊and
weights 𝒊
• This means that we can compute the change in the output for small
changes in either the input or the weights
Overall network is differentiable

2 2
1,1 2,1
= output of overall network
i,j = weight connecting the ith unit
of the kth layer to the jth unit of
the k+1-th layer
i = output of the ith unit of the kth layer

is differentiable w.r.t both and i

• Every individual perceptron is differentiable w.r.t its inputs

and its weights (including “bias” weight)
• By the chain rule, the overall function is differentiable w.r.t
every parameter (weight or bias)
– Small changes in the parameters result in measurable changes in
Overall function is differentiable

• The overall function is differentiable w.r.t every parameter

– Small changes in the parameters result in measurable changes
in the output
– We will derive the actual derivatives using the chain rule later
Overall setting for “Learning” the

• Given a training set of input-output pairs 1 1 2 2 N N

– is the desired output of the network in response to
– and may both be vectors
• …we must find the network parameters such that the network produces the
desired output for each training input
– Or a close approximation of it
– The architecture of the network must be specified by us
ERM for neural networks
Actual output of network:
1 2 K

Desired output of network:

Error on i-th training input:

Average training error(loss):

– What is the exact form of Div()? More on this later

• Optimize network parameters to minimize the
total error over all training inputs
Problem Statement
• Given a training set of input-output pairs

• Minimize the following function


• This is problem of function minimization

– An instance of optimization
A brief note on derivatives..


• A derivative of a function at any point tells us how

much a minute increment to the argument of the
function will increment the value of the function
▪ For any expressed as a multiplier to a tiny
increment to obtain the increments to the output

▪ Based on the fact that at a fine enough resolution, any

smooth, continuous function is locally linear at any point
Scalar function of scalar argument

• When and are scalar

▪ Derivative:

▪ Often represented (using somewhat inaccurate notation) as dy

▪ Or alternately (and more reasonably) as
Multivariate scalar function:
Scalar function of vector argument
Note: is now a vector

• Giving us that is a row vector: 1 D

1 1 2 2 D D
• The partial derivative i gives us how increments when only i is
• Often represented as

1 2 D
1 2 D
Multivariate scalar function:
Scalar function of vector argument
Note: is now a vector

We will be using this

• Where symbol for vector and
matrix derivatives

o You may be more familiar with the term “gradient” which

is actually defined as the transpose of the derivative
The problem of optimization
global maximum

inflection point

local minimum
global minimum

• General problem of
optimization: find
the value of x where
f(x) is minimum
Finding the minimum of a function


• Find the value at which =0

– Solve

• The solution is a “turning point”

– Derivatives go from positive to negative or vice versa at this point
• But is it a minimum?
Turning Points
0 --
+ -- +
+ - + -
+ -- +
- +

• Both maxima and minima have zero derivative

• Both are turning points
Derivatives of a curve

f ’(x
) x

• Both maxima and minima are turning points

• Both maxima and minima have zero derivative
Derivative of the derivative of the
f ’(x
) x

• Both maxima and minima are turning points

• Both maxima and minima have zero derivative

• The second derivative f’’(x) is –ve at maxima and

+ve at minima!
A note on derivatives of functions of
single variable

• All locations with

Inflection point

zero derivative

Critical points
are critical points
– These can be local maxima, local
minima, or inflection points

Derivative is 0
A note on derivatives of functions of
single variable
• All locations with zero
Inflection point derivative are critical
– These can be local maxima, local minima,
or inflection points

• The second derivative is

– at minima
– at maxima
– Zero at inflection points

2 zero • It’s a little more complicated for

functions of multiple variables..
What about functions of multiple

• The optimum point is still “turning” point

– Shifting in any direction will increase the value
– For smooth functions, miniscule shifts will not result in any change at all
• We must find a point where shifting in any direction by a microscopic
amount will not change the value of the function
A brief note on derivatives of
multivariate functions
The Gradient of a scalar function

• The derivative of a scalar function of a

multi-variate input is a multiplicative factor that gives
us the change in for tiny variations in

– The gradient is the transpose of the derivative

Gradients of scalar functions with
multi-variate inputs
• Consider

• Relation:
Gradients of scalar functions with
multi-variate inputs
• Consider

• Relation:

This is a vector inner product. To understand its behavior lets

consider a well-known property of inner products
A well-known vector property

• The inner product between two vectors of

fixed lengths is maximum when the two
vectors are aligned
– i.e. when
Properties of Gradient

– The inner product between T and

• Fixing the length of

– E.g.
• is max if is aligned with
– T

– The function f(X) increases most rapidly if the input

increment is perfectly aligned to T

• The gradient is the direction of fastest increase in f(X)


vector X 𝑇

vector X 𝑇

Moving in this
direction increases

vector X 𝑇

Moving in this
X direction increases
Moving in this fastest
direction decreases
The Hessian
Finding the minimum of a scalar
function of a multi-variate input

• The optimum point is a turning point – the

gradient will be 0
Unconstrained Minimization of
function (Multivariate)
Unconstrained Minimization of
function (Example)
Unconstrained Minimization of
function (Example)
Unconstrained Minimization of
function (Example)
Closed Form Solutions are not always

• Often it is not possible to simply solve

– The function to minimize/maximize may have an
intractable form
• In these situations, iterative solutions are used
– Begin with a “guess” for the optimal and refine it
iteratively until the correct value is obtained
The Approach of Gradient Descent

• Iterative solution:
– Start at some point
– Find direction in which to shift this point to decrease error
• This can be found from the derivative of the function
– A positive derivative → moving left decreases error
– A negative derivative → moving right decreases error
– Shift point in this direction
The Approach of Gradient Descent

• Iterative solution: Trivial algorithm

▪ Initialize
▪ While

• Identical to previous algorithm

The Approach of Gradient Descent

• Iterative solution: Trivial algorithm

▪ Initialize
▪ While

• is the “step size”

Gradient descent/ascent
• The gradient descent/ascent method to find the
minimum or maximum of a function iteratively
– To find a maximum move in the direction of the

– To find a minimum move exactly opposite the

direction of the gradient

• Many solutions to choosing step size

1. Fixed step size
• Fixed step size
– Use fixed value for
What is the optimal step size?

• Step size is critical for fast optimization

• Will revisit this topic later
• For now, simply assume a potentially-
iteration-dependent step size
Gradient descent convergence criteria

• The gradient descent algorithm converges

when one of the following criteria is satisfied

Overall Gradient Descent Algorithm

• Initialize:

• do

• while
(C) Dhruv Batra
How do we compute gradients?
• Analytic or “Manual” Differentiation
– Applying formulas from calculus

• Numerical Differentiation

• Automatic Differentiation
– Computation Graph

(C) Dhruv Batra

Numerical vs Analytic Gradients

Numerical gradient: slow :(, approximate :(, easy to write :)

Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your

implementation with numerical gradient.
This is called a gradient check.

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Computational Graph: Example

sin( ) *

x1 x2

(C) Dhruv Batra

Logistic Regression as a Cascade
Given a library of simple functions

Compose into a

complicate function

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Computational Graph

Any DAG of differentiable modules is


(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato

Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]
• Step 2: Compute gradients w.r.t. parameters [B-Pass]

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]
• Step 2: Compute gradients w.r.t. parameters [B-Pass]

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]
• Step 2: Compute gradients w.r.t. parameters [B-Pass]

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]
• Step 2: Compute gradients w.r.t. parameters [B-Pass]
• Step 3: Use gradient to update parameters

*** ɵ same as w
(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Key Computation: Forward-Prop

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Key Computation: Back-Prop

(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Backpropagation: a simple example


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4



Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4



Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4



Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4



Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4



Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4



Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4



Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4



Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4



Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Upstream Local
gradient gradient


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Upstream Local
gradient gradient


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Upstream Local
gradient gradient


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Upstream Local
gradient gradient


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Upstream Local
gradient gradient


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:

Upstream Local
gradient gradient


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:

Upstream Local
gradient gradient


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:

Upstream Local
gradient gradient


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:

Upstream Local
gradient gradient


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:

[upstream gradient] x [local gradient]

[0.2] x [1] = 0.2
[0.2] x [1] = 0.2 (both inputs!)


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Another example:

[upstream gradient] x [local gradient]

x0: [0.2] x [2] = 0.4
w0: [0.2] x [-1] = -0.2


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Computational graph representation may not
be unique. Choose one where local gradients
at each node can be easily expressed!

sigmoid function

sigmoid gate


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Computational graph representation may not
be unique. Choose one where local gradients
at each node can be easily expressed!

sigmoid function

sigmoid gate

[upstream gradient] x [local gradient]

[1.00] x [(1 - 0.73) (0.73)]= 0.2


Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

How to Specify/Design
Perceptron Parameters
x1 OR x2 with Perceptron (Step
Perceptron with hard/step threshold
Output = 0 if input w1*x1 + w2*x2 < threshold T
= 1 if input >= T
From OR truth table, X1 AND x2 Truth Table
w1*0 + w2*0 < T implies T > 0
w1*1+w2*0 >= T implies w1 >=T x1 x2 Output
w1*0+w2*1 >= T implies w2 >= T 0 0 0
w1*1+w2*1 >=T implies w1+w2>=T
1 0 1
Choose, T =1 (you can choose any T > 0) 0 1 1
Then w1 =1 (you can choose any value >= T) 1 1 1
Similarly choose w2 = 1 (you can choose any value >= T)
w1+w2 >=T is automatically satisfied for w1=w2=1 and T=1

Thus, AND can be realized with w1=1, w2=1, T=1 . Important to note that other
values of w1, w2, and T can also implement OR, as long as the about 4 inequalities
are satisfied
Specify: Perceptron Parameters for

-1 -1 -1

-1 1 1
1 0 1 0

• Note: Different Choices of weights and bias are possible.

• Left hidden node implements x1=1 line
• Right hidden node implements x1=5 line
• 2nd node from left implements x2=x1+3 and 3rd from left implements x1+x2=9
• For + class, output of left hidden node = +1, for other nodes output = -1
Example: Functional Approximation in 1-D
Given the function values (xi ,yi) design an MLP
x y y
0 1
1 12.5
2 2 2 4.5 -1 -1
1 1 1
1 -1
3 4.5
T=0 2 2 3 3 5 5 6
5 12.5
1 1 1 1 1 1 1



0 2 3 5 6 x
Example: Functional Approximation in 1-D
Given the function values (xi ,yi) design an MLP
x y y
0 1 -4.5
2 2
T=3 5
3 4.5
1 1
5 12.5

12.5 • First consider one data point, say (3, 4.5)
and hidden nodes with hard threshold T=3
and T=5 (output = 1 if input >=T else 0)
• Output node uses a ReLU activation, i.e., y =
sum of all weighted outputs from hidden
y nodes
• With choice of weights in the above figure,
2 y = 4.5 for 3<=x<5, and 0 otherwise.
1 • Note that choice of T=5 is by design, given
the set of discrete data points . Any T>=3 up
to next data point for the right hidden node
0 2 3 5 6 x would work.
Example: Functional Approximation in 1-D
Given the function values (xi ,yi) design an MLP
x y y
0 1
1 12.5
2 2 2 4.5 -1 -1
1 1 1
1 -1
3 4.5
T=0 2 2 3 3 5 5 6
5 12.5
1 1 1 1 1 1 1

y • Applying the method shown in previous
slide to all data points, the above
2 network is obtained.
1 • Output y matches exactly training data
0 2 3 5 6 x

You might also like