DL Slides 2

Deep Learning
Vimal
BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
Lecture No. 3| Deep Networks

Time: 11 AM – 1 PM
Date:14/05/2022
These slides are assembled by the instructor with grateful acknowledgement of the many
others who made their course materials freely available online.
Terminology ….
Image Credit: https://www.sumologic.com/blog/machine-learning-deep-learning/

So what is Deep (Machine) Learning?
A few different ideas:
(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations
End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction
Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Traditional Machine Learning
VISION
hand-crafted
features your favorite
SIFT/HOG
classifier “car”
fixed learned
SPEECH
hand-crafted
MFCC
classifier \ˈd ē p\
fixed learned
NLP
hand-crafted
This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Hierarchical Compositionality
VISION
pixels edge texton motif part object
SPEECH
sample spectral formant motif phone word
band
NLP
character word NP/VP/.. clause sentence story

Building A Complicated Function
Given a library of simple functions
Compose into a
complicate function

Idea 1: Linear Combinations

Compose into a
• Boosting
• Kernels
complicate function
• …

Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…

Compose into a
• Deep Learning
• Grammar models
complicate function

Deep Learning = Hierarchical Compositionality
“car”

Low-Level Mid-Level High-Level Trainable “car”

Feature Feature Feature Classifier
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]

Sparse DBNs
[Lee et al. ICML ‘09]
Figure courtesy: Quoc Le
So what is Deep (Machine)
Learning?
End-to-End Learning
VISION
hand-crafted
SIFT/HOG
fixed learned
SPEECH
hand-crafted
MFCC
fixed learned
NLP
hand-crafted
classifier “+”
fixed learned
SIFT Spin Images
Feature Engineering
HoG Textons
and many many more….

Traditional Machine Learning (more accurately)
“Learned”
VISION
K-Means/
SIFT/HOG classifier
pooling “car”
fixed unsupervised supervised
SPEECH
Mixture of
MFCC classifier
Gaussians \ˈd ē p\
NLP
This burrito place Parse Tree
n-grams classifier
is yummy and fun! Syntactic “+”
Deep Learning = End-to-End Learning
“Learned”
VISION
K-Means/
SIFT/HOG classifier
pooling “car”
SPEECH
Mixture of
MFCC classifier
NLP
n-grams classifier
“Shallow” vs Deep Learning
“Shallow” models
hand-crafted “Simple” Trainable

Feature Extractor Classifier
fixed learned
Deep models
Trainable Trainable Trainable

Feature- Feature- Feature-
Transform / Transform / Transform /
Classifier Classifier Classifier
Learned Internal Representations

End-to-End Learning
Distributed Representations Toy Example
Local vs Distributed
Slide Credit: Moontae Lee

Can we interpret each dimension?

Power of distributed representations!
Local
Distributed

United States : Dollar :: Mexico:?

Benefits of Deep/Representation Learning
(Usually) Better Performance

“Because gradient descent is better than you”
Yann LeCun
New domains without “experts”

RGBD/Lidar
Multi-spectral data
Gene-expression data
Unclear how to hand-engineer
Modularity!
Plug and play architectures!
Problems with Deep Learning
Terminology ….
Image Credit: https://www.sumologic.com/blog/machine-learning-deep-learning/

Learning?
End-to-End Learning
VISION
hand-crafted
SIFT/HOG
fixed learned
SPEECH
hand-crafted
MFCC
fixed learned
NLP
hand-crafted
classifier “+”
fixed learned
Hierarchical Compositionality
VISION
pixels edge texton motif part object
SPEECH
sample spectral formant motif phone word
band
NLP
character word NP/VP/.. clause sentence story

Compose into a
complicate function

Idea 1: Linear Combinations

Compose into a
• Boosting
• Kernels
complicate function
• …

Compose into a
• Deep Learning
• Grammar models
complicate function

Compose into a
• Deep Learning
• Grammar models
complicate function

“car”

Low-Level Mid-Level High-Level Trainable “car”

Feature Feature Feature Classifier
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]

Sparse DBNs
[Lee et al. ICML ‘09]
Figure courtesy: Quoc Le
Learning?
End-to-End Learning
VISION
hand-crafted
SIFT/HOG
fixed learned
SPEECH
hand-crafted
MFCC
fixed learned
NLP
hand-crafted
classifier “+”
fixed learned
SIFT Spin Images
Feature Engineering
HoG Textons
and many many more….

Traditional Machine Learning (more accurately)
“Learned”
VISION
K-Means/
SIFT/HOG classifier
pooling “car”
SPEECH
Mixture of
MFCC classifier
NLP
n-grams classifier
Deep Learning = End-to-End Learning
“Learned”
VISION
K-Means/
SIFT/HOG classifier
pooling “car”
SPEECH
Mixture of
MFCC classifier
NLP
n-grams classifier
“Shallow” vs Deep Learning
“Shallow” models
hand-crafted “Simple” Trainable

Feature Extractor Classifier
fixed learned
Deep models
Trainable Trainable Trainable

Feature- Feature- Feature-
Transform / Transform / Transform /
Classifier Classifier Classifier
Learned Internal Representations

End-to-End Learning
Local vs Distributed

Can we interpret each dimension?

Local
Distributed

United States : Dollar :: Mexico:?

(Usually) Better Performance

“Because gradient descent is better than you”
Yann LeCun
New domains without “experts”

RGBD/Lidar
Multi-spectral data
Gene-expression data
Unclear how to hand-engineer
Benefits of
Deep/Representation
Modularity!
Learning
Plug and play architectures!
Problem#1: Non-Convex! Non-Convex! Non-Convex!

Depth>=3: most losses non-convex in parameters
Theoretically, all bets are off
Leads to stochasticity
different initializations → different local minima
Standard response #1
“Yes, but all interesting learning problems are non-convex”
For example, human learning
Order matters → wave hands → non-convexity
“Yes, but it often works!”
Problem#2: Lack of interpretability

Hard to track down what’s failing
Pipeline systems have “oracle” performances at each step
In end-to-end systems, it’s hard to know why things are not working
Tricks of the trade: visualize features, add losses at different layers, pre-train to avoid
degenerate initializations…
“We’re working on it”
Problem#3: Lack of easy reproducibility

Direct consequence of stochasticity & non-convexity
It’s getting much better
Standard toolkits/libraries/frameworks now available
PyTorch, TensorFlow, MxNet…
Problem#2: Lack of interpretability

Hard to track down what’s failing
Pipeline systems have “oracle” performances at each step
In end-to-end systems, it’s hard to know why things are not working
Tricks of the trade: visualize features, add losses at different layers, pre-train to avoid
degenerate initializations…
“We’re working on it”
Problem#3: Lack of easy reproducibility

Direct consequence of stochasticity & non-convexity
It’s getting much better
Standard toolkits/libraries/frameworks now available
PyTorch, TensorFlow, MxNet…
Story So Far
• Characteristics of Deep Learning
• Hierarchical representation
• End-to-end learning
• Distributed representation
• Challenges of Deep Learning
• Design of perceptron networks
• pen and paper based
• MLP as universal
• Boolean Function
• Classifiers
• Function approximator
• Need for Depth
Agenda
• MLP Parameter Estimation via Empirical Risk
Minimization
• Function Optimization Refresher
• Computational Graph
• Deep MLP Training

• Activation Functions
• Input Output Specification
• ……
Option 1: Construct by hand
1 X2 1
1 -1 -1 -1
0,1
X1 X2 X1 X2
-1,0 1,0
X1
1
1 1 1
-1 1
0,-1
X1 X2
-4 X1 X2
1 1 1
1
1 1 1 1
1 -1 -1 -1 1 • Not possible for all
1 1 -1
Assuming simple perceptrons:
but the simplest
output = 1 if i i i i X1 X2 problems.
Option 2: Automatic estimation
of an MLP
• Given the function to model, we can derive

the parameters of the network to model it, thru’
computation
Empirical Risk Minimization
• Given a training set of input-output pairs 1 1 2 2 N N

– Error on the ith instance: i i
– Empirical average error (Empirical Risk) on all training data:
i i
i
• Estimate the parameters to minimize the empirical estimate of expected
error
W
– I.e. minimize the empirical risk over the drawn samples
Perceptron Learning Algorithm
• Given training instances

– i or
Using a +1/-1 representation
for classes to simplify
• Initialize notation
• Cycle through the training instances:
• do
– For
T
i i
• If i i
i i
• until no more classification errors
Perceptrons with differentiable
activation functions
1 1
2 i i
2
3
.. 3
+
i
N– 1
. N– 1
i i
N
N i i
N+ 1
• is a differentiable function of
do z
– dz
is well-defined and finite for all
• Using the chain rule, is a differentiable function of both inputs 𝒊and
weights 𝒊
• This means that we can compute the change in the output for small
changes in either the input or the weights
Overall network is differentiable
2 2
1,1 2,1
= output of overall network
k
i,j = weight connecting the ith unit
of the kth layer to the jth unit of
the k+1-th layer
k
i = output of the ith unit of the kth layer
k
is differentiable w.r.t both and i
• Every individual perceptron is differentiable w.r.t its inputs

and its weights (including “bias” weight)
• By the chain rule, the overall function is differentiable w.r.t
every parameter (weight or bias)
– Small changes in the parameters result in measurable changes in
output
Overall function is differentiable
• The overall function is differentiable w.r.t every parameter

– Small changes in the parameters result in measurable changes
in the output
– We will derive the actual derivatives using the chain rule later
Overall setting for “Learning” the
MLP
• Given a training set of input-output pairs 1 1 2 2 N N

– is the desired output of the network in response to
– and may both be vectors
• …we must find the network parameters such that the network produces the
desired output for each training input
– Or a close approximation of it
– The architecture of the network must be specified by us
ERM for neural networks
Actual output of network:
1 2 K
Desired output of network:

Error on i-th training input:
Average training error(loss):
– What is the exact form of Div()? More on this later

• Optimize network parameters to minimize the
total error over all training inputs
Problem Statement
• Given a training set of input-output pairs
• Minimize the following function
w.r.t
• This is problem of function minimization

– An instance of optimization
• FUNCTION OPTIMIZATION Refresher
A brief note on derivatives..
derivative
• A derivative of a function at any point tells us how

much a minute increment to the argument of the
function will increment the value of the function
▪ For any expressed as a multiplier to a tiny
increment to obtain the increments to the output
▪ Based on the fact that at a fine enough resolution, any

smooth, continuous function is locally linear at any point
Scalar function of scalar argument
• When and are scalar
▪ Derivative:
▪ Often represented (using somewhat inaccurate notation) as dy

dx
▪ Or alternately (and more reasonably) as
Multivariate scalar function:
Scalar function of vector argument
Note: is now a vector
1
• Giving us that is a row vector: 1 D

1 1 2 2 D D
• The partial derivative i gives us how increments when only i is
incremented
y
• Often represented as
si
1 2 D
1 2 D
Multivariate scalar function:
Scalar function of vector argument
Note: is now a vector
1
We will be using this

• Where symbol for vector and
matrix derivatives
o You may be more familiar with the term “gradient” which

is actually defined as the transpose of the derivative
The problem of optimization
f(x)
global maximum
inflection point
local minimum
global minimum
• General problem of
optimization: find
the value of x where
f(x) is minimum
Finding the minimum of a function
f(x)
• Find the value at which =0

– Solve
• The solution is a “turning point”

– Derivatives go from positive to negative or vice versa at this point
• But is it a minimum?
Turning Points
0
0 --
+ -- +
+
--
+ - + -
+ -- +
+
--
- +
-
0
• Both maxima and minima have zero derivative

• Both are turning points
Derivatives of a curve
f ’(x
) x
f(x)
• Both maxima and minima are turning points

Derivative of the derivative of the
curve
f’’(x)
f ’(x
) x
f(x)
• Both maxima and minima are turning points

• The second derivative f’’(x) is –ve at maxima and

+ve at minima!
A note on derivatives of functions of
single variable
maximum
• All locations with

Inflection point
zero derivative
minimum
Critical points
are critical points
– These can be local maxima, local
minima, or inflection points
Derivative is 0
A note on derivatives of functions of
single variable
maximum
• All locations with zero
Inflection point derivative are critical
points
minimum
– These can be local maxima, local minima,
or inflection points
• The second derivative is

– at minima
– at maxima
– Zero at inflection points
positive
2 zero • It’s a little more complicated for

2
functions of multiple variables..
negative
What about functions of multiple
variables?
• The optimum point is still “turning” point

– Shifting in any direction will increase the value
– For smooth functions, miniscule shifts will not result in any change at all
• We must find a point where shifting in any direction by a microscopic
amount will not change the value of the function
A brief note on derivatives of
multivariate functions
The Gradient of a scalar function
• The derivative of a scalar function of a

multi-variate input is a multiplicative factor that gives
us the change in for tiny variations in
– The gradient is the transpose of the derivative

Gradients of scalar functions with
multi-variate inputs
• Consider
• Relation:
Gradients of scalar functions with
multi-variate inputs
• Consider
• Relation:
This is a vector inner product. To understand its behavior lets

consider a well-known property of inner products
A well-known vector property
• The inner product between two vectors of

fixed lengths is maximum when the two
vectors are aligned
– i.e. when
Properties of Gradient
•
– The inner product between T and
• Fixing the length of

– E.g.
• is max if is aligned with
– T
– The function f(X) increases most rapidly if the input

increment is perfectly aligned to T
• The gradient is the direction of fastest increase in f(X)

Gradient
Gradient
vector X 𝑇
Gradient
Gradient
vector X 𝑇
Moving in this
direction increases
fastest
Gradient
Gradient
vector X 𝑇
Moving in this
𝑇
X direction increases
Moving in this fastest
direction decreases
fastest
The Hessian
Finding the minimum of a scalar
function of a multi-variate input
• The optimum point is a turning point – the

gradient will be 0
Unconstrained Minimization of
function (Multivariate)
function (Example)
function (Example)
function (Example)
Closed Form Solutions are not always
available
f(X)
• Often it is not possible to simply solve

– The function to minimize/maximize may have an
intractable form
• In these situations, iterative solutions are used
– Begin with a “guess” for the optimal and refine it
iteratively until the correct value is obtained
The Approach of Gradient Descent
• Iterative solution:
– Start at some point
– Find direction in which to shift this point to decrease error
• This can be found from the derivative of the function
– A positive derivative → moving left decreases error
– A negative derivative → moving right decreases error
– Shift point in this direction
• Iterative solution: Trivial algorithm

▪ Initialize
▪ While
• Identical to previous algorithm

• Iterative solution: Trivial algorithm

▪ Initialize
▪ While
• is the “step size”

Gradient descent/ascent
(multivariate)
• The gradient descent/ascent method to find the
minimum or maximum of a function iteratively
– To find a maximum move in the direction of the
gradient
– To find a minimum move exactly opposite the

direction of the gradient
• Many solutions to choosing step size

1. Fixed step size
• Fixed step size
– Use fixed value for
What is the optimal step size?
• Step size is critical for fast optimization

• Will revisit this topic later
• For now, simply assume a potentially-
iteration-dependent step size
Gradient descent convergence criteria
• The gradient descent algorithm converges

when one of the following criteria is satisfied
58
Overall Gradient Descent Algorithm
• Initialize:
▪
▪
• do
▪
▪
• while
169
(C) Dhruv Batra
How do we compute gradients?
• Analytic or “Manual” Differentiation
– Applying formulas from calculus
• Numerical Differentiation
• Automatic Differentiation
– Computation Graph
(C) Dhruv Batra

Numerical vs Analytic Gradients
Numerical gradient: slow :(, approximate :(, easy to write :)

Analytic gradient: fast :), exact :), error-prone :(
In practice: Derive analytic gradient, check your

implementation with numerical gradient.
This is called a gradient check.
Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Computational Graph: Example
sin( ) *
x1 x2
(C) Dhruv Batra

Logistic Regression as a Cascade
Compose into a
complicate function
(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Computational Graph
Any DAG of differentiable modules is

allowed!
(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato

Neural Network Training
• Step 1: Compute Loss on mini-batch [F-Pass]



• Step 2: Compute gradients w.r.t. parameters [B-Pass]



• Step 3: Use gradient to update parameters
*** ɵ same as w
Key Computation: Forward-Prop

Key Computation: Back-Prop

Backpropagation: a simple example
184

185

e.g. x = -2, y = 5, z = -4
186

e.g. x = -2, y = 5, z = -4
Want:
187

e.g. x = -2, y = 5, z = -4
Want:
188

e.g. x = -2, y = 5, z = -4
Want:
189

e.g. x = -2, y = 5, z = -4
Want:
190

e.g. x = -2, y = 5, z = -4
Want:
191

e.g. x = -2, y = 5, z = -4
Want:
192

e.g. x = -2, y = 5, z = -4
Want:
193

e.g. x = -2, y = 5, z = -4
Want:
194

e.g. x = -2, y = 5, z = -4
Want:
195

e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
196

e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
197

e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
198

e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
199

e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
200

Another example:
201

Another example:
202

Another example:
203

Another example:
204

Another example:
Upstream Local
gradient gradient
205

Another example:
206

Another example:
Upstream Local
gradient gradient
207

Another example:
208

Another example:
Upstream Local
gradient gradient
209

Another example:
210

Another example:
Upstream Local
gradient gradient
211

Another example:
212

Another example:
[upstream gradient] x [local gradient]

[0.2] x [1] = 0.2
[0.2] x [1] = 0.2 (both inputs!)
213

Another example:
214

Another example:

x0: [0.2] x [2] = 0.4
w0: [0.2] x [-1] = -0.2
215

Computational graph representation may not
be unique. Choose one where local gradients
at each node can be easily expressed!
sigmoid function
sigmoid gate
216

Computational graph representation may not
be unique. Choose one where local gradients
at each node can be easily expressed!
sigmoid function
sigmoid gate

[1.00] x [(1 - 0.73) (0.73)]= 0.2
217

How to Specify/Design
Perceptron Parameters
x1 OR x2 with Perceptron (Step
Threshold)
Perceptron with hard/step threshold
Output = 0 if input w1*x1 + w2*x2 < threshold T
= 1 if input >= T
From OR truth table, X1 AND x2 Truth Table
w1*0 + w2*0 < T implies T > 0
w1*1+w2*0 >= T implies w1 >=T x1 x2 Output
w1*0+w2*1 >= T implies w2 >= T 0 0 0
w1*1+w2*1 >=T implies w1+w2>=T
1 0 1
Choose, T =1 (you can choose any T > 0) 0 1 1
Then w1 =1 (you can choose any value >= T) 1 1 1
Similarly choose w2 = 1 (you can choose any value >= T)
w1+w2 >=T is automatically satisfied for w1=w2=1 and T=1
Thus, AND can be realized with w1=1, w2=1, T=1 . Important to note that other
values of w1, w2, and T can also implement OR, as long as the about 4 inequalities
are satisfied
Specify: Perceptron Parameters for
Classification
1
-1 -1 -1
1
-1 1 1
1 0 1 0
• Note: Different Choices of weights and bias are possible.

• Left hidden node implements x1=1 line
• Right hidden node implements x1=5 line
• 2nd node from left implements x2=x1+3 and 3rd from left implements x1+x2=9
• For + class, output of left hidden node = +1, for other nodes output = -1
Example: Functional Approximation in 1-D
Given the function values (xi ,yi) design an MLP
x y y
0 1
1 12.5
2 2 2 4.5 -1 -1
1 1 1
1 -1
-1
3 4.5
T=0 2 2 3 3 5 5 6
5 12.5
1
1 1 1 1 1 1 1
12.5
x
4.5
y
2
1
0 2 3 5 6 x
x y y
0 1 -4.5
4.5
2 2
T=3 5
3 4.5
1 1
5 12.5
x
12.5 • First consider one data point, say (3, 4.5)
and hidden nodes with hard threshold T=3
and T=5 (output = 1 if input >=T else 0)
• Output node uses a ReLU activation, i.e., y =
4.5
sum of all weighted outputs from hidden
y nodes
• With choice of weights in the above figure,
2 y = 4.5 for 3<=x<5, and 0 otherwise.
1 • Note that choice of T=5 is by design, given
the set of discrete data points . Any T>=3 up
to next data point for the right hidden node
0 2 3 5 6 x would work.
x y y
0 1
1 12.5
2 2 2 4.5 -1 -1
1 1 1
1 -1
-1
3 4.5
T=0 2 2 3 3 5 5 6
5 12.5
1
1 1 1 1 1 1 1
12.5
x
4.5
y • Applying the method shown in previous
slide to all data points, the above
2 network is obtained.
1 • Output y matches exactly training data
points.
0 2 3 5 6 x

DL Slides 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL Slides 2

Uploaded by

Copyright:

Available Formats

Deep Learning

Lecture No. 3| Deep Networks

Image Credit: https://www.sumologic.com/blog/machine-learning-deep-learning/

A few different ideas:

pixels edge texton motif part object

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Idea 1: Linear Combinations

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Low-Level Mid-Level High-Level Trainable “car”

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

and many many more….

hand-crafted “Simple” Trainable

Trainable Trainable Trainable

Learned Internal Representations

A few different ideas:

Slide Credit: Moontae Lee

Can we interpret each dimension?

Slide Credit: Moontae Lee

Slide Credit: Moontae Lee

United States : Dollar :: Mexico:?

Slide Credit: Moontae Lee

(Usually) Better Performance

New domains without “experts”

Image Credit: https://www.sumologic.com/blog/machine-learning-deep-learning/

pixels edge texton motif part object

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Idea 1: Linear Combinations

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Low-Level Mid-Level High-Level Trainable “car”

Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

and many many more….

hand-crafted “Simple” Trainable

Trainable Trainable Trainable

Learned Internal Representations

A few different ideas:

Slide Credit: Moontae Lee

Can we interpret each dimension?

Slide Credit: Moontae Lee

Slide Credit: Moontae Lee

United States : Dollar :: Mexico:?

Slide Credit: Moontae Lee

(Usually) Better Performance

New domains without “experts”

Problem#1: Non-Convex! Non-Convex! Non-Convex!

Problem#2: Lack of interpretability

Problem#3: Lack of easy reproducibility

Problem#2: Lack of interpretability

Problem#3: Lack of easy reproducibility

• Function Optimization Refresher

• Deep MLP Training

• Given the function to model, we can derive

• Given a training set of input-output pairs 1 1 2 2 N N

• Given training instances

• Every individual perceptron is differentiable w.r.t its inputs

• The overall function is differentiable w.r.t every parameter