Professional Documents
Culture Documents
DL Slides 2
DL Slides 2
Vimal
BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations
End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction
Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Traditional Machine Learning
VISION
hand-crafted
features your favorite
SIFT/HOG
classifier “car”
fixed learned
SPEECH
hand-crafted
features your favorite
MFCC
classifier \ˈd ē p\
fixed learned
NLP
hand-crafted
This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Hierarchical Compositionality
VISION
SPEECH
sample spectral formant motif phone word
band
NLP
character word NP/VP/.. clause sentence story
Compose into a
complicate function
Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…
Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…
“car”
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations
End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction
Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Traditional Machine Learning
VISION
hand-crafted
features your favorite
SIFT/HOG
classifier “car”
fixed learned
SPEECH
hand-crafted
features your favorite
MFCC
classifier \ˈd ē p\
fixed learned
NLP
hand-crafted
This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
SIFT Spin Images
Feature Engineering
HoG Textons
SPEECH
Mixture of
MFCC classifier
Gaussians \ˈd ē p\
fixed unsupervised supervised
NLP
This burrito place Parse Tree
n-grams classifier
is yummy and fun! Syntactic “+”
fixed unsupervised supervised
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Deep Learning = End-to-End Learning
“Learned”
VISION
K-Means/
SIFT/HOG classifier
pooling “car”
fixed unsupervised supervised
SPEECH
Mixture of
MFCC classifier
Gaussians \ˈd ē p\
fixed unsupervised supervised
NLP
This burrito place Parse Tree
n-grams classifier
is yummy and fun! Syntactic “+”
fixed unsupervised supervised
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
“Shallow” vs Deep Learning
“Shallow” models
(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations
End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction
Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Distributed Representations Toy Example
Local vs Distributed
Local
Distributed
Modularity!
Plug and play architectures!
Problems with Deep Learning
Terminology ….
(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations
End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction
Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Traditional Machine Learning
VISION
hand-crafted
features your favorite
SIFT/HOG
classifier “car”
fixed learned
SPEECH
hand-crafted
features your favorite
MFCC
classifier \ˈd ē p\
fixed learned
NLP
hand-crafted
This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Hierarchical Compositionality
VISION
SPEECH
sample spectral formant motif phone word
band
NLP
character word NP/VP/.. clause sentence story
Compose into a
complicate function
Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…
Idea 2: Compositions
Compose into a
• Deep Learning
• Grammar models
complicate function
• Scattering transforms…
“car”
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations
End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction
Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Traditional Machine Learning
VISION
hand-crafted
features your favorite
SIFT/HOG
classifier “car”
fixed learned
SPEECH
hand-crafted
features your favorite
MFCC
classifier \ˈd ē p\
fixed learned
NLP
hand-crafted
This burrito place features your favorite
is yummy and fun! Bag-of-words
classifier “+”
fixed learned
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
SIFT Spin Images
Feature Engineering
HoG Textons
SPEECH
Mixture of
MFCC classifier
Gaussians \ˈd ē p\
fixed unsupervised supervised
NLP
This burrito place Parse Tree
n-grams classifier
is yummy and fun! Syntactic “+”
fixed unsupervised supervised
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Deep Learning = End-to-End Learning
“Learned”
VISION
K-Means/
SIFT/HOG classifier
pooling “car”
fixed unsupervised supervised
SPEECH
Mixture of
MFCC classifier
Gaussians \ˈd ē p\
fixed unsupervised supervised
NLP
This burrito place Parse Tree
n-grams classifier
is yummy and fun! Syntactic “+”
fixed unsupervised supervised
Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
“Shallow” vs Deep Learning
“Shallow” models
(Hierarchical) Compositionality
Cascade of non-linear transformations
Multiple layers of representations
End-to-End Learning
Learning (goal-driven) representations
Learning to feature extraction
Distributed Representations
No single neuron “encodes” everything
Groups of neurons work together
Distributed Representations Toy Example
Local vs Distributed
Local
Distributed
Standard response #1
“Yes, but all interesting learning problems are non-convex”
For example, human learning
Order matters → wave hands → non-convexity
Standard response #2
“Yes, but it often works!”
Problems with Deep Learning
Standard response #1
Tricks of the trade: visualize features, add losses at different layers, pre-train to avoid
degenerate initializations…
“We’re working on it”
Standard response #2
“Yes, but it often works!”
Problems with Deep Learning
Standard response #1
It’s getting much better
Standard toolkits/libraries/frameworks now available
PyTorch, TensorFlow, MxNet…
Standard response #2
“Yes, but it often works!”
Problems with Deep Learning
Standard response #1
Tricks of the trade: visualize features, add losses at different layers, pre-train to avoid
degenerate initializations…
“We’re working on it”
Standard response #2
“Yes, but it often works!”
Problems with Deep Learning
Standard response #1
It’s getting much better
Standard toolkits/libraries/frameworks now available
PyTorch, TensorFlow, MxNet…
Standard response #2
“Yes, but it often works!”
Story So Far
• Characteristics of Deep Learning
• Hierarchical representation
• End-to-end learning
• Distributed representation
• Challenges of Deep Learning
• Design of perceptron networks
• pen and paper based
• MLP as universal
• Boolean Function
• Classifiers
• Function approximator
• Need for Depth
Agenda
• MLP Parameter Estimation via Empirical Risk
Minimization
• Computational Graph
X1 X2 X1 X2
-1,0 1,0
X1
1
1 1 1
-1 1
0,-1
X1 X2
-4 X1 X2
1 1 1
1
1 1 1 1
1 -1 -1 -1 1 • Not possible for all
1 1 -1
Assuming simple perceptrons:
but the simplest
output = 1 if i i i i X1 X2 problems.
Option 2: Automatic estimation
of an MLP
i i
i
• Estimate the parameters to minimize the empirical estimate of expected
error
W
– I.e. minimize the empirical risk over the drawn samples
Perceptron Learning Algorithm
i i
• until no more classification errors
Perceptrons with differentiable
activation functions
1 1
2 i i
2
3
.. 3
+
i
N– 1
. N– 1
i i
N
N i i
N+ 1
• is a differentiable function of
do z
– dz
is well-defined and finite for all
• Using the chain rule, is a differentiable function of both inputs 𝒊and
weights 𝒊
• This means that we can compute the change in the output for small
changes in either the input or the weights
Overall network is differentiable
2 2
1,1 2,1
= output of overall network
k
i,j = weight connecting the ith unit
of the kth layer to the jth unit of
the k+1-th layer
k
i = output of the ith unit of the kth layer
k
is differentiable w.r.t both and i
w.r.t
derivative
▪ Derivative:
1 2 D
1 2 D
Multivariate scalar function:
Scalar function of vector argument
Note: is now a vector
1
inflection point
local minimum
global minimum
• General problem of
optimization: find
the value of x where
f(x) is minimum
Finding the minimum of a function
f(x)
f ’(x
) x
f(x)
zero derivative
minimum
Critical points
are critical points
– These can be local maxima, local
minima, or inflection points
Derivative is 0
A note on derivatives of functions of
single variable
maximum
• All locations with zero
Inflection point derivative are critical
points
minimum
– These can be local maxima, local minima,
or inflection points
• Relation:
Gradients of scalar functions with
multi-variate inputs
• Consider
• Relation:
Gradient
vector X 𝑇
Gradient
Gradient
vector X 𝑇
Moving in this
direction increases
fastest
Gradient
Gradient
vector X 𝑇
Moving in this
𝑇
X direction increases
Moving in this fastest
direction decreases
fastest
The Hessian
Finding the minimum of a scalar
function of a multi-variate input
• Iterative solution:
– Start at some point
– Find direction in which to shift this point to decrease error
• This can be found from the derivative of the function
– A positive derivative → moving left decreases error
– A negative derivative → moving right decreases error
– Shift point in this direction
The Approach of Gradient Descent
58
Overall Gradient Descent Algorithm
• Initialize:
▪
▪
• do
▪
▪
• while
169
(C) Dhruv Batra
How do we compute gradients?
• Analytic or “Manual” Differentiation
– Applying formulas from calculus
• Numerical Differentiation
• Automatic Differentiation
– Computation Graph
sin( ) *
x1 x2
Compose into a
complicate function
*** ɵ same as w
(C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun
Key Computation: Forward-Prop
184
185
e.g. x = -2, y = 5, z = -4
186
e.g. x = -2, y = 5, z = -4
Want:
187
e.g. x = -2, y = 5, z = -4
Want:
188
e.g. x = -2, y = 5, z = -4
Want:
189
e.g. x = -2, y = 5, z = -4
Want:
190
e.g. x = -2, y = 5, z = -4
Want:
191
e.g. x = -2, y = 5, z = -4
Want:
192
e.g. x = -2, y = 5, z = -4
Want:
193
e.g. x = -2, y = 5, z = -4
Want:
194
e.g. x = -2, y = 5, z = -4
Want:
195
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
196
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
197
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
198
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
199
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
Upstream Local
gradient gradient
200
201
202
203
204
Upstream Local
gradient gradient
205
206
Upstream Local
gradient gradient
207
208
Upstream Local
gradient gradient
209
210
Upstream Local
gradient gradient
211
212
213
214
215
sigmoid function
sigmoid gate
216
sigmoid function
sigmoid gate
217
Thus, AND can be realized with w1=1, w2=1, T=1 . Important to note that other
values of w1, w2, and T can also implement OR, as long as the about 4 inequalities
are satisfied
Specify: Perceptron Parameters for
Classification
1
-1 -1 -1
1
-1 1 1
1 0 1 0
x
4.5
y
2
1
0 2 3 5 6 x
Example: Functional Approximation in 1-D
Given the function values (xi ,yi) design an MLP
x y y
0 1 -4.5
4.5
2 2
T=3 5
3 4.5
1 1
5 12.5
x
12.5 • First consider one data point, say (3, 4.5)
and hidden nodes with hard threshold T=3
and T=5 (output = 1 if input >=T else 0)
• Output node uses a ReLU activation, i.e., y =
4.5
sum of all weighted outputs from hidden
y nodes
• With choice of weights in the above figure,
2 y = 4.5 for 3<=x<5, and 0 otherwise.
1 • Note that choice of T=5 is by design, given
the set of discrete data points . Any T>=3 up
to next data point for the right hidden node
0 2 3 5 6 x would work.
Example: Functional Approximation in 1-D
Given the function values (xi ,yi) design an MLP
x y y
0 1
1 12.5
2 2 2 4.5 -1 -1
1 1 1
1 -1
-1
3 4.5
T=0 2 2 3 3 5 5 6
5 12.5
1
1 1 1 1 1 1 1
12.5
x
4.5
y • Applying the method shown in previous
slide to all data points, the above
2 network is obtained.
1 • Output y matches exactly training data
points.
0 2 3 5 6 x