Professional Documents
Culture Documents
A2.2 DNN Update 2
A2.2 DNN Update 2
Jie Tang
Tsinghua University
1 1/ 51
Overview
2 Introduction
Example: Learning XOR
3 Backpropagation Algorithm
4 Activation Functions
5 Advanced Architecture
6 Summary
2 1/ 51
3 1/ 51
Deep Feedforward Networks
A deep feedforward network (DFN), also referred to as
Multi-layer Perceptron and Feedforward Neural Network,
is an artificial neural network wherein connections between the
units do not form a cycle. It is the first and also the simplest
type of ANN.
When extended to include feedback connections, they are
called Recurrent Neural Networks (RNN).
x1 h21 x1 h21
+1 +1 +1 +1
5 1/ 51
Single-layer Perceptron
X3
T
y = f (W x) = f ( Wi xi + 1 × b) x1 W1
i=1 W2
x2 f y
W3
Activation
where f is the activation function: x3 b
function
1
sigmoid function f (z) = 1+exp(−z) +1
e z −e −z
hyperbolic tangent tanh(z) = e z +e −z
rectified linear function
f (z) = max(0, x)
remember, it is useful
sigmoid function f 0 (z) = f (z)(1 − f (z))
hyperbolic tangent f 0 (z) = 1 − (f (z))2
rectified linear function
f 0 (z) = 0, if z ≤ 0; and 1 otherwise
6 1/ 51
Softmax Units for Multinomial Output Distributions
A softmax output unit is defined by:
softmax(z)i = Pexp(zi )
j exp(zj )
7 1/ 51
Deep Feedforward Networks
x1 h21 h31
bias units {+1}, input layer {xi }, hidden layer(s) hil , output
layer yi ;
Goal: approximate some map function f ∗
– e.g., a classifier y = f ∗ (x) to map an input x onto a
category y
– Feedforward Network defines a mapping
y = f ∗ (x; θ)
x1 h21 h31
z2 = W1 × x + b 1
h2 = f (z2 )
(1)
zl+1 = Wl × hl + b l
y = f (zL )
9 1/ 51
Outline
2 Introduction
Example: Learning XOR
3 Backpropagation Algorithm
4 Activation Functions
5 Advanced Architecture
6 Summary
10 1/ 51
Extending Linear Models
11 1/ 51
Extending Linear Models
To represent non-linear functions of x, by transforming x
using a non-linear function φ(x)
Similar “kernel” trick obtains a nonlinearity
SVM: f (x) = w T x + b → f (x) = b + i αi φ(x)φ(x i )
P
where φ(x)φ(x i ) = K (x, x i )
Then how to choose the mapping φ.
Use a very generic φ, such as infinite-dimensional φ, but the
generalization is poor.
Manually engineer φ, but with little transfer between
domains.
The strategy of deep learning is to learn φ. We parametrize
the representation as φ(x; θ) and use the optimization
algorithm to find the θ.
capture the benefit of the first approach by being highly
generic –— by using a very broad family φ(x; θ).
the human designer only needs to find the right general
function family rather than precisely the right function.
12 1/ 51
Example: Learning XOR
13 1/ 51
Linear model doesn’t fit
14 1/ 51
Linear model doesn’t fit
Solving the XOR problem by learning a
representation.
When x1 = 0, output has to increase
with x2
When x1 = 1, output has to decrease
with x2
Linear model
f (x, w , b) = x1 w1 + x2 w2 + b has to
assign a single weight to x2 so it
cannot solve the problem.
A better solution.
We user a simple feedforward
network.
one hidden layer containing
two hidden units.
the nonlinear features have mapped
both x = [1, 0]T and x = [0, 1]T to a 15 1/ 51
Feedforward Network for XOR
f (x; W, c, w, b) = wT f (WT x + c) + b
16 1/ 51
Feedforward Network for XOR
1 1 0
Let W = ,c= ,
1 1 −1
1
w= , and b = 0. We get correct
−2
answers on all points.
Now walk through how models processes.
Design matrix X of all four points.
XW.
Adding c.
Applying ReLU.
Multipying by w.
17 1/ 51
Feedforward Network for XOR
18 1/ 51
Outline
2 Introduction
Example: Learning XOR
3 Backpropagation Algorithm
4 Activation Functions
5 Advanced Architecture
6 Summary
19 1/ 51
Loss Function
sl X
sl=1
" n # L
1 X1 2 λ XX
J (W , b) = kŷ − y k + (Wjil )2
n 2 2
i=1 l=1 i=1 j=1
20 1/ 51
Model Learning
21 1/ 51
Backpropagation
(l) (l) ∂
Wij = Wij − α (l)
J(W , b) (2)
∂Wij
(l) (l) ∂
bi = bi − α (l)
J(W , b) (3)
∂bi
22 1/ 51
Backpropagation
Forward propagation
The inputs x provide the initial information that then propagates
up to the hidden units at each layer and finally produces ŷ .
Backward propagation
simply called backprop, allows the information from the cost to
then flow backwards through the network, in order to compute the
gradient.
23 1/ 51
Training
One iteration of gradient descent updates the parameters
W , b as follows:
(l) (l) ∂
Wij = Wij − α (l)
J (W , b) (4)
∂Wij
(l) (l) ∂
bi = bi − α (l)
J (W , b) (5)
∂bi
25 1/ 51
Backpropagation: forward
Given a training example (x, y ), we will first run a “forward
pass” to compute all the activations throughout the network,
including the output value of the hypothesis ŷ .
x1 h21
x2 h22
y
x3 h23
+1 +1
26 1/ 51
Backpropagation: error in the output layer
Then, for each node i in layer l, we would like to compute an
(l)
“error term” δi that measures how much that node was
“responsible” for any errors in our output.
For an output node, we can directly measure the difference
between the network’s activation and the true target value,
(L)
and use that to define δi .
x1 h21 h31
x3 h23 h33 h4 2 y2
W1 , b1 W2 , b2 W3 , b3 W4 , b4
+1 +1 +1 +1
Layer L 1 Layer L 5
(input) Layer L 2 Layer L 3 Layer L 4
(output)
∂ 1
δiL = ky − ŷ k2 = −(yi − f (ziL )) · f 0 (ziL ) (11)
∂ziL 2
27 1/ 51
Backpropagation: error backpropagation
For an output node, we can directly measure the difference
between the network’s activation and the true target value,
(L)
and use that to define δi .
(l)
For hidden units, we will compute δi based on a
weighted average of the error terms of the nodes that
(l)
uses zi as an input.
x1 h21 h31
28 1/ 51
Backpropagation: error backpropagation (2)
(l)
For hidden units, we will compute δi based on a weighted average
(l)
of the error terms of the nodes that uses zi as an input.
x1 h21 h31
29 1/ 51
Backpropagation: derivations
30 1/ 51
Backpropagation: summary (pseudo-code)
repeat
∆Wl := 0, ∆bl := 0 for all l;
for i = 1 to n,
perform forward propagation;
use backpropagation to computer errors at each
layer;
compute partial derivatives OWl J (W , b; x, y ) and
Obl J (W , b; x, y )
∆Wl := ∆Wl + OWl J (W , b; x, y );
∆bl := ∆bl + Obl J (W , b; x, y );
update the parameters
l l 1 l l
W = W − α ( ∆W ) + λW
n
l l 1 l
b = b − α ( ∆b )
n
until convergence.
31 1/ 51
Question: What are general limitations of back
propagation rule?
32 1/ 51
Outline
2 Introduction
Example: Learning XOR
3 Backpropagation Algorithm
4 Activation Functions
5 Advanced Architecture
6 Summary
33 1/ 51
Activation functions
34 1/ 51
Logistic Sigmoid
35 1/ 51
Hyperbolic Tangent
36 1/ 51
Rectified Linear Units
Rectified linear units are an excellent default choice of hidden
unit (Nair&Hinton, 2010).
38 1/ 51
Maxout Units
where G (i) is the set of indices into the inputs for group i,
{(i − 1)k + 1, ..., ik}.
A maxout unit can learn a piecewise linear, convex function
with up to k pieces. Maxout units can thus be seen as
learning the activation function itself rather than just the
relationship between units.
39 1/ 51
Maxout Units
input.
40 1/ 51
Question: What is the relationship between DNN and
Logistic Regression?
41 1/ 51
Question: What is the relationship between DNN and
Logistic Regression?
42 1/ 51
Outline
2 Introduction
Example: Learning XOR
3 Backpropagation Algorithm
4 Activation Functions
5 Advanced Architecture
6 Summary
43 1/ 51
Advanced Architecture
44 1/ 51
Residual Neural Networks
45 1/ 51
Residual Neural Networks
48 1/ 51
Outline
2 Introduction
Example: Learning XOR
3 Backpropagation Algorithm
4 Activation Functions
5 Advanced Architecture
6 Summary
49 1/ 51
Summary
x2 h22
y
x3 h23
+1 +1
50 1/ 51
Thanks.
HP: http://keg.cs.tsinghua.edu.cn/jietang/
Email: jietang@tsinghua.edu.cn
51 1/ 51