Chapter 4 Deep Neural Nets

Chapter – 4
Deep Neural Networks
Prof. Harshawardhan P. Ahire

Assistant Professor
Department of Electronics and Telecommunication Engineering
K. J. Somaiya Institute of Engineering and Information Technology Sion
1
Reca
ll…
Using gradient ascent for linear classifiers
Key idea behind today’s lecture:
1. Define a linear classifier (logistic regression)
2. Define an objective function (likelihood)
3. Optimize it with gradient descent to learn
parameters
4. Predict the class with highest probability under
the model
2
Reca
ll…
This decision function isn’t Use a differentiable function
differentiable: instead:
sign(x)
3
Reca
ll…
This decision function isn’t Use a differentiable function
differentiable: instead:
sign(x)
4
Reca
ll…
Logistic Regression
Data: Inputs are continuous vectors of length K. Outputs are
discrete.
Model: Logistic function applied to dot product of

parameters with input vector.
Learning: finds the parameters that minimize some

objective function.
Prediction: Output is the most probable class.
5
NEURAL NETWORKS
6
Learning highly non-linear
functions
f: X  Y
 f might be non-linear function
 X (vector of) continuous and/or discrete vars
 Y (vector of) continuous and/or discrete vars
The XOR gate Speech recognition
© Eric Xing @ CMU, 2006-2011 7

Perceptron and Neural Nets
 From biological neuron to artificial neuron (perceptron)
Inp
uts
x1 Lin
ear Hard
w C
ombin
er L
imiter
1 O
utp
ut
 Y
w
2

x2
T
hresh
old
 Activation function
 1, if X  
Y 
 1, if X  
 Artificial neuron networks

 supervised learning
Output Signals
 gradient descent Input Signals
Middle Layer
Input Layer Output Layer
© Eric Xing @ CMU, 2006-2011 8
Connectionist Models
 Consider humans: Dendrites
Nodes
 Neuron switching time
~ 0.001 second +
Synapses
 Number of neurons +
+ Axon
(weights)
~ 1010 -
-
Synapses
 Connections per neuron
~ 104-5
 Scene recognition time
~ 0.1 second
 100 inference steps doesn't seem like enough
 much parallel computation
 Properties of artificial neural nets (ANN)
 Many neuron-like threshold switching units
 Many weighted interconnections among units
 Highly parallel, distributed processes
© Eric Xing @ CMU, 2006-2011 9

Why is everyone talking
Motivation
about Deep Learning?
• Because a lot of money is invested in it…
– DeepMind: Acquired by Google for $400 million
– DNNResearch: Three person startup (including
Geoff Hinton) acquired by Google for unknown
price tag
– Enlitic, Ersatz, MetaMind, Nervana, Skylab:
Deep Learning startups commanding millions of
VC dollars
• Because it made the front page of the New
York Times
10
Why is everyone talking
Motivation
about Deep Learning?
1960s
Deep learning:
– Has won numerous pattern recognition
1980s
competitions
– Does so with minimal feature
1990s
engineering
This wasn’t always the case!
2006 Since 1980s: Form of models hasn’t changed much,
but lots of new tricks…
– More hidden units
2016 – Better (online) optimization
– New nonlinear functions (ReLUs)
– Faster computers (CPUs and GPUs)
11
A Recipe for
Background
Machine Learning
1. Given training data: Face Face Not a face
2. Choose each of these:

– Decision function
Examples: Linear regression,
Logistic regression, Neural Network
– Loss function
Examples: Mean-squared error,
Cross Entropy
12
A Recipe for
Background
Machine Learning
1. Given training data: 3. Define goal:

– Decision function 4. Train with SGD:
(take small steps
opposite the gradient)
– Loss function
13
A Recipe for
Background
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:
gradient of any
differentiable
(takefunction efficiently!
small steps
– Loss function
14
A Recipe for
Background
Goals for Today’sMachine
LectureLearning
1. 1.
Given training
Explore data:
a new class of 3. Define functions
decision goal:
(Neural Networks)
2. Consider variants of this recipe for training
(take small steps
– Loss function
15
Decision
Functions Linear Regression
Output y
θ1 θ2 θ3 θM
Input x1 x2 x3 … xM
16
Decision
Functions Logistic Regression
Output y
θ1 θ2 θ3 θM
17
Decision
Output y
Face Face Not a face
θ1 θ2 θ3 θM
18
Decision
Output y
1 1 0
θ2 θ3 θM x2
θ1
x1
19
Decision
Output y
θ1 θ2 θ3 θM
20
Neural Network Model
Inputs
.6 Output
Age 34 .4
.2 S
.1 .5 0.6
Gender 2 .3 .2
.8
S
.7 S “Probability of
beingAlive”
Stage 4 .2
Dependent
Independent Weights HiddenL Weights variable
variables ayer
Prediction
© Eric Xing @ CMU, 2006-2011 21

“Combined logistic models”
Inputs
.6 Output
Age 34
.5 0.6
.1
Gender 2 S
.7 .8 “Probability of
beingAlive”
Stage 4
Dependent
variables ayer
Prediction
© Eric Xing @ CMU, 2006-2011 22

Inputs
Output
Age 34
.2 .5
0.6
Gender 2 .3
S
“Probability of
.8
beingAlive”
Stage 4 .2
Dependent
variables ayer
Prediction
© Eric Xing @ CMU, 2006-2011 23

Inputs
.6 Output
Age 34
.2 .5
.1 0.6
Gender 1 .3
S
.7 “Probability of
.8
beingAlive”
Stage 4 .2
Dependent
variables ayer
Prediction
© Eric Xing @ CMU, 2006-2011 24

Not really,
no target for hidden units...
Age 34 .6 .4
.2 S
.1 .5 0.6
Gender 2 .3 .2
.8
S
.7 S “Probability of
beingAlive”
Stage 4 .2
Dependent
variables ayer
Prediction
© Eric Xing @ CMU, 2006-2011 25

Jargon Pseudo-Correspondence
 Independent variable = input variable
 Dependent variable = output variable
 Coefficients = “weights”
 Estimates = “targets”
Logistic Regression Model (the sigmoid unit)

Inputs Output
Age 34
5
0.6
Gende 1 4
S “Probability of
r beingAlive”
Stage 4 8
Independent variables Coefficients Dependent variable

x1, x2, x3 a, b, c p Prediction
© Eric Xing @ CMU, 2006-2011 26
Decision
Functions Neural Network
Output y
Hidden Layer a1 a2 … aD
27
Decision
Functions Neural Network
Output y
z1 z2 … zD
Hidden Layer
x1 x2 x3 … xM
Input
28
Building a Neural Net
Output y
Features x1 x2 … xM
29
Output y
D=M
1 1 1
Input x1 x2 … xM
30
Output y
D=M
Input x1 x2 … xM
31
Output y
D=M
Input x1 x2 … xM
32
Output y
D<M
33
Decision Boundary
• 0 hidden layers: linear classifier
– Hyperplanes
x1 x2
Example from to Eric Postma via Jason Eisner 34

Decision Boundary
• 1 hidden layer
– Boundary of convex region (open or closed)
x1 x2

Decision Boundary
y
• 2 hidden layers
– Combinations of convex regions
x1 x2

Decision
Functions
Multi-Class Output
Output y1 … yK
37
Decision
Functions Deeper Networks
Next lecture:
Output y
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
38
Decision
Next lecture:
Output y
b1 b2 … bE
Hidden Layer 2
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
39
Decision
Next lecture: Output y
Making the
neural Hidden Layer 3 c1 c2 … cF
networks
deeper Hidden Layer 2 b1 b2 … bE
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
40
Decision Different Levels of
Functions Abstraction
• We don’t know
the “right”
levels of
abstraction
• So let the model
figure it out!
41
Example from Honglak Lee (NIPS 2010)
Face Recognition:
– Deep Network
can build up
increasingly
higher levels of
abstraction
– Lines, parts,
regions
42
Output y
c1 c2 … cF
Hidden Layer 3
b1 b2 … bE
Hidden Layer 2
a1 a2 … aD
Hidden Layer 1
x1 x2 x3 … xM
Input
43
ARCHITECTURES
44
Neural Network Architectures
Even for a basic Neural Network, there are
many design decisions to make:
1. # of hidden layers (depth)
2. # of units per hidden layer (width)
3. Type of activation function (nonlinearity)
4. Form of objective function
45
Activation Functions
Neural Network with sigmoid
activation functions
Output y
z1 z2 … zD
Hidden Layer
x1 x2 x3 … xM
Input
46
Neural Network with arbitrary
nonlinear activation functions
Output y
z1 z2 … zD
Hidden Layer
x1 x2 x3 … xM
Input
47
Sigmoid / Logistic Function So far, we’ve
assumed that the
activation function
(nonlinearity) is
always the sigmoid
function…
48
• A new change: modifying the nonlinearity
– The logistic is not widely used in modern ANNs
Alternate 1:
tanh
Like logistic function but

shifted to range [-1, +1]
Slide from William Cohen

AI Stats 2010
depth 4?
sigmoid
vs.
tanh
Figure from Glorot & Bentio (2010)

– reLU often used in vision tasks
Alternate 2: rectified linear unit
Linear with a cutoff at zero
(Implementation: clip the gradient

when you pass zero)

– reLU often used in vision tasks
Alternate 2: rectified linear unit
Soft version: log(exp(x)+1)
Doesn’t saturate (at one end)

Sparsifies outputs
Helps with vanishing gradient

Objective Functions for NNs
• Regression:
– Use the same objective as Linear Regression
– Quadratic loss (i.e. mean squared error)
• Classification:
– Use the same objective as Logistic Regression
– Cross-entropy (i.e. negative log likelihood)
– This requires probabilities, so we add an additional “softmax”
layer at the end of our network
53
Multi-Class Output
Output y1 … yK
54
Multi-Class Output
Softmax:
y1 … yK
Output
a1 a2 … aD
Hidden Layer
x1 x2 x3 … xM
Input
55
Cross-entropy vs. Quadratic loss
Figure from Glorot & Bentio (2010)

A Recipe for
Background
Machine Learning

(take small steps
– Loss function
57
Objective Functions
Matching Quiz: Suppose you are given a neural net with a single output, y,
and one hidden layer.
1) Minimizing sum of squared 5) …MLE estimates of weights assuming
errors… target follows a Bernoulli with
parameter given by the output value
2) Minimizing sum of squared
errors plus squared Euclidean 6) …MAP estimates of weights
norm of weights… assuming weight priors are zero mean
…gives… Gaussian
3) Minimizing cross-entropy…
7) …estimates with a large margin on
4) Minimizing hinge loss… the training data
8) …MLE estimates of weights
assuming zero mean Gaussian noise on
the output value
A. 1=5, 2=7, 3=6, 4=8 D. 1=7, 2=5, 3=6, 4=8

B. 1=5, 2=7, 3=8, 4=6 E. 1=8, 2=6, 3=5, 4=7
C. 1=7, 2=5, 3=5, 4=7 F. 1=8, 2=6, 3=8, 4=6 58
BACKPROPAGATION
59
A Recipe for
Background
Machine Learning

(take small steps
– Loss function
60
Training Backpropagation
• Question 1:
When can we compute the gradients of the
parameters of an arbitrary neural network?
• Question 2:
When can we make the gradient
computation efficient?
61
Training Chain Rule
Given:
Chain Rule:
y1
u1 u2 … uJ
x2
62
Training Chain Rule
Given:
Chain Rule:
y1
Backpropagation
…
is just repeated u1 u2 uJ
application of the
chain rule from
Calculus 101. x2
63
Training Chain Rule
y
1
Given:
Chain Rule:
u u u
…
1 2 J
x
2
Backpropagation:
1. Instantiate the computation as a directed acyclic graph, where each
intermediate quantity is a node
2. At each node, store (a) the quantity computed in the forward pass and
(b) the partial derivative of the goal with respect to that node’s
intermediate quantity.
3. Initialize all partial derivatives to 0.
4. Visit each node in reverse topological order. At each node, add its
contribution to the partial derivatives of its parents
This algorithm is also called automatic differentiation in the reverse-mode 64

65
66
Output y
Case 1:
Logistic θ1 θ2 θ3 θM
Regression
x1 x2 x3 … xM
Input
67
Output y
z1 z2 … zD
Hidden Layer
x1 x2 x3 … xM
Input
68
Output y
z1 z2 … zD
Hidden Layer
x1 x2 x3 … xM
Input
69
Case 2:
Neural
Network
y
z z z
…
1 2 D
x x x x
…
1 2 3 M
70
Training Chain Rule
y
1
Given:
Chain Rule:
u u u
…
1 2 J
x
2
Backpropagation:
intermediate quantity is a node
(b) the partial derivative of the goal with respect to that node’s
intermediate quantity.

Training Chain Rule
y
1
Given:
Chain Rule:
u u u
…
1 2 J
x
2
Backpropagation:
node represents a Tensor.
(b) the partial derivatives of the goal with respect to that node’s
Tensor.

Case 2:
Neural
Module 5
Network
y
Module
z z z4
…
1 2 D
x x x x
…
1 2 3 M
Module 3
Module 2
Module 1
73
A Recipe for
Background
Gradients
Machine Learning
1. Given training data: 3. Definecan
Backpropagation goal:
compute this
gradient!
And it’s a special case of a more
general algorithm called reverse-
2. Choose each of these:mode automatic differentiation that
– Decision function can compute
4. Train
the with SGD:
gradient of any
differentiable
(takefunction efficiently!
small steps
– Loss function
74
Summary
1. Neural Networks…
– provide a way of learning features
– are highly nonlinear prediction functions
– (can be) a highly parallel network of logistic
regression classifiers
– discover useful hidden representations of the input
2. Backpropagation…
– provides an efficient way to compute gradients
– is a special case of reverse-mode automatic
differentiation
75

Chapter 4 Deep Neural Nets

Uploaded by

Copyright:

Available Formats

You might also like

Chapter 4 Deep Neural Nets

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4 Deep Neural Nets

Uploaded by

Copyright:

Available Formats

Chapter – 4

Deep Neural Networks

Prof. Harshawardhan P. Ahire

Model: Logistic function applied to dot product of

Learning: finds the parameters that minimize some

Prediction: Output is the most probable class.

The XOR gate Speech recognition

© Eric Xing @ CMU, 2006-2011 7

 Artificial neuron networks

© Eric Xing @ CMU, 2006-2011 9

2. Choose each of these:

2. Choose each of these:

© Eric Xing @ CMU, 2006-2011 21

© Eric Xing @ CMU, 2006-2011 22

© Eric Xing @ CMU, 2006-2011 23

© Eric Xing @ CMU, 2006-2011 24

© Eric Xing @ CMU, 2006-2011 25

Logistic Regression Model (the sigmoid unit)

Independent variables Coefficients Dependent variable

Example from to Eric Postma via Jason Eisner 34

Example from to Eric Postma via Jason Eisner 35

Example from to Eric Postma via Jason Eisner 36

Like logistic function but

Slide from William Cohen

Figure from Glorot & Bentio (2010)

Alternate 2: rectified linear unit

Linear with a cutoff at zero

(Implementation: clip the gradient

Slide from William Cohen

Alternate 2: rectified linear unit

Soft version: log(exp(x)+1)

Doesn’t saturate (at one end)

Slide from William Cohen

Figure from Glorot & Bentio (2010)

2. Choose each of these:

A. 1=5, 2=7, 3=6, 4=8 D. 1=7, 2=5, 3=6, 4=8

2. Choose each of these:

This algorithm is also called automatic differentiation in the reverse-mode 64

This algorithm is also called automatic differentiation in the reverse-mode 71

This algorithm is also called automatic differentiation in the reverse-mode 72

You might also like