Ch4 - Multilayer Perceptron

Chapter 4
Multilayer Perceptron
1
Multilayer Perceptron Chapter 4
Multilayer Perceptrons
𝑾(1) 𝑾(2) 𝑾(3) Hidden
Unit
Input .. Output
Unit . Unit
..
. ..
.
Input Hidden Hidden Output

layer layer 1 layer 2 layer
Minibatch 𝑿 ∈ ℝ𝑛×𝑑 of 𝑛 examples where each example has 𝑑 inputs (features)
HCM City Univ. of Technology, Faculty of Mechanical Engineering 2 Duong Van Tu

Output of unit
Input of unit
Matrix of weight
𝑙𝑡ℎ layer
𝑗𝑡ℎ unit
bias
Vector of
bias
Vector of input
Vector of output Activate function
𝑙
𝑤𝑖𝑗 is the weight of the connection between the 𝑖𝑡ℎ unit of the (𝑙 − 1)𝑡ℎ layer and
the 𝑗𝑡ℎ unit of the 𝑙𝑡ℎ layer.
𝑏𝑖𝑙 is the bias of the 𝑖𝑡ℎ unit of the 𝑙𝑡ℎ layer.
𝑾 – Matrix of weight.
𝒃 – Matrix of bias.

Example of an MLP with a hidden layer of 5 hidden units.
This MLP has 4 inputs, 3 outputs, and its hidden layer contains 5 hidden units
Activation function
The output of the 𝑖𝑡ℎ unit belongs to the 𝑙𝑡ℎ layer is calculated by:
(𝑙) 𝑙 𝑙−1 𝑙
𝑎𝑖 = 𝑓(𝑤𝑖 𝑎 + 𝑏𝑖 )
Where 𝑓(. ) is the nonlinear activation function.
In vector form, the output of all units of the 𝑙𝑡ℎ layer is calculated by:
𝒂(𝑙) = 𝑓(𝑾 𝑙 𝒂 𝑙−1 +𝒃 𝑙 )
Sign function
It is noted that the sign function should not be used in MLP.

Activation function
Sigmoid function
1
𝜎 𝑥 =
1 + 𝑒 −𝑥
Sigmoid saturate and kill gradients.
A very undesirable property of the sigmoid
neuron is that when the neuron’s activation
saturates at either tail of 0 or 1, the gradient
at these regions is almost zero.

Activation function
Tanh function
1 − 𝑒 −2𝑥
𝑡𝑎𝑛ℎ 𝑥 = −2𝑥
= 2𝜎 2𝑥 − 1
1+𝑒
Like the sigmoid neuron, its activations
saturate, but unlike the sigmoid neuron its
output is zero-centered.

Activation function
ReLu function
The Rectified Linear Unit has become very popular
in the last few years.
𝑓 𝑥 = max(0, 𝑥)
Non-differentiable at zero and ReLU is unbounded.
The gradients for negative input are zero, which
means for activations in that region, the weights
are not updated during backpropagation.

Forward propagation
Forward propagation (or forward pass) refers to the calculation and storage of
intermediate variables (including outputs) for a neural network in order from the
input layer to the output layer.
𝑇
𝐱 = 𝑥1 𝑥2
(1) (1) (1) (2) (2)
(1) 𝑎1 = 𝑓1 (𝑤1,1 𝑥1 + 𝑤2,1 𝑥2 + b1 ) 𝑦ො = 𝑓2 (𝑤1,1 𝑎1 + 𝑤2,1 𝑎2 + 𝑏 (2) )
𝑤1,1 𝑛1
𝑥1 (2)
𝑤1,1
(1) (1)
𝑤2,1 𝑤1,2 (2)
𝑥2 𝑤2,1
(1) 𝑛2
𝑤2,2 (1) (1)
𝑎2 = 𝑓1 (𝑤1,2 𝑥1 + 𝑤2,2 𝑥2 + b2 )
(1)
Input layer hidden layer 1 output layer

Chain Rule Gradient
We supposes that 𝑤 is a function of 𝑥, 𝑦 and that 𝑥, 𝑦 are functions of 𝑢, 𝑣. That is,
𝑤 = 𝑓 𝑥, 𝑦 , 𝑥 = 𝑔 𝑢, 𝑣 , 𝑦 = ℎ(𝑢, 𝑣)
The use of the term chain comes because to compute w we need to do a chain of
computations
(𝑢, 𝑣) → (𝑥, 𝑦) → 𝑤
We will say 𝑤 is a dependent variable, 𝑢 and 𝑣 are independent variables and 𝑥 and
𝑦 are intermediate variables.
𝜕𝑤 𝜕𝑤
Since 𝑤 is a function of 𝑥 and 𝑦 it has partial derivatives and
𝜕𝑥 𝜕𝑦

Chain Rule Gradient
Since 𝑤 is a function of 𝑢 and 𝑣 we can also compute the partial derivatives, that is
𝜕𝑤 𝜕𝑤
and
𝜕𝑢 𝜕𝑣
The chain rule relates these derivatives by the following formulas
𝜕𝑤 𝜕𝑤 𝜕𝑥 𝜕𝑤 𝜕𝑦
= +
𝜕𝑢 𝜕𝑥 𝜕𝑢 𝜕𝑦 𝜕𝑢
𝜕𝑤 𝜕𝑤 𝜕𝑥 𝜕𝑤 𝜕𝑦
= +
𝜕𝑣 𝜕𝑥 𝜕𝑣 𝜕𝑦 𝜕𝑣

Backpropagation
✓ Backpropagation refers to the method of calculating the gradient of neural
network parameters.
✓ In short, the method traverses the network in reverse order, from the output to
the input layer, according to the chain rule from calculus.
✓ The algorithm stores any intermediate variables (partial derivatives) required
while calculating the gradient with respect to some parameters.
✓ To update the weight and the bias, we need to define the loss function
𝐽 𝑾, 𝒃, 𝐱, 𝐲
𝑾 is the weight matrix, 𝒃 is the bias matrix
𝐱, 𝐲 are the input and output vector
Backpropagation
The loss function is defined according to Mean Square Error (MSE)
𝑁
1 2
ෝ𝒏
𝐽 𝑾, 𝒃, 𝐱, 𝐲 = ෍ 𝒚𝒏 − 𝒚
N
𝑛=1
It is difficult to take the gradient of the loss function over the weight matrix.
𝜕𝐽 𝜕𝐽 𝜕𝐽
The backpropagation is to take the gradient , , … , for applying the
𝜕𝑾(1) 𝜕𝑾(2) 𝜕𝑾(𝐿)
Gradient Descent to update the weight matrix of each layer.

Backpropagation
Consider the weight of the 𝑗𝑡ℎ unit of the 𝐿𝑡ℎ layer:
(𝐿)
𝜕𝐿 𝜕𝐿 𝜕𝑧𝑗
(𝐿)
= (𝐿) (𝐿)
𝜕𝑤𝑖𝑗 𝜕𝑧𝑗 𝜕𝑤𝑖𝑗
(𝐿) (𝐿) (𝐿−1) (𝐿)
Recall 𝑧𝑗 = 𝑤𝑖𝑗 𝑎𝑖 + 𝑏𝑗 then it can be deducted as:
(𝐿)
𝜕𝑧𝑗 (𝐿−1)
(𝐿)
= 𝑎𝑖
𝜕𝑤𝑖𝑗
(𝐿) 𝜕𝐿 𝜕𝐿 (𝐿) (𝐿−1)

Let define 𝑒𝑗 ≜ (𝐿) , the above equation becomes: (𝐿) = 𝑒𝑗 𝑎𝑖
𝜕𝑧𝑗 𝜕𝑤𝑖𝑗
Backpropagation
By the same manner, the gradient of the loss function over the bias is:
(𝐿)
𝜕𝐿 𝜕𝐿 𝜕𝑧𝑗 (𝐿)
(𝐿)
= (𝐿) (𝐿)
= 𝑒𝑗
𝜕𝑏𝑗 𝜕𝑧𝑗 𝜕𝑏𝑗
For the 𝑙𝑡ℎ layer, the gradient at the 𝑗𝑡ℎ unit is calculated as below:
(𝑙)
𝜕𝐿 𝜕𝐿 𝜕𝑧𝑗
(𝑙)
= (𝑙) (𝑙)
𝜕𝑤𝑖𝑗 𝜕𝑧𝑗 𝜕𝑤𝑖𝑗
(𝑙)
(𝑙) (𝑙) (𝑙−1) (𝑙) 𝜕𝑧𝑗 (𝑙−1)
Recall that 𝑧𝑗 = 𝑤𝑖𝑗 𝑎𝑖 + 𝑏𝑗 , then it can be deducted as (𝑙) = 𝑎𝑖
𝜕𝑤𝑖𝑗
Backpropagation
(𝑙) 𝜕𝐿
Let define 𝑒𝑗 ≜ (𝑙) , the gradient of 𝐿 over the weight is calculated as:
𝜕𝑧𝑗
𝜕𝐿 (𝑙) (𝑙−1)
(𝑙)
= 𝑒𝑗 𝑎𝑖
𝜕𝑤𝑖𝑗
𝜕𝐿
Let consider the term (𝑙) , we can use the chain rule:
𝜕𝑧𝑗
(𝑙)
𝜕𝐿 𝜕𝐿 𝜕𝑎𝑗
(𝑙)
= (𝑙) (𝑙)
𝜕𝑧𝑗 𝜕𝑎𝑗 𝜕𝑧𝑗

Backpropagation
(𝑙+1) (𝑙+1) (𝑙)
Recall that the input of the 𝑘 𝑡ℎ unit is calculated as 𝑧𝑘 = 𝒘𝑘 𝒂 + 𝒃(𝑙+1)
(𝑙) (𝑙+1)
𝑎𝑗 → 𝑧𝑘 with 𝑘 = 1, … , 𝑑 (𝑙+1)
Then we can rewritten as follows:
𝑑 (𝑙+1) (𝑙+1) 𝑑 (𝑙+1)

𝜕𝐿 𝜕𝐿 𝜕𝑧𝑘 (𝑙+1) (𝑙+1) (𝑙+1) (𝑙+1)
(𝑙)
= ෍ (𝑙+1) (𝑙)
= ෍ 𝑒𝑘 𝑤𝑗𝑘 = 𝒘𝑘: 𝒆
𝜕𝑎𝑗 𝑘=1 𝜕𝑧𝑘 𝜕𝑎𝑗 𝑘=1
In the other hand, we have:
(𝑙)
𝜕𝑎𝑗 (𝑙)
(𝑙) = 𝑓′(𝑧𝑗 ) with 𝑓 is the activation function
𝜕𝑧𝑗
Backpropagation
Then we can rewrite as follows:
(𝑙) 𝜕𝐿 (𝑙+1) (𝑙+1) (𝑙)

𝑒𝑗 = (𝑙)
= (𝒘𝑘: 𝒆 )𝑓′(𝑧𝑗 )
𝜕𝑧𝑗
Then
𝜕𝐿 (𝑙+1) (𝑙+1) (𝑙) (𝑙−1)

(𝑙) = (𝒘𝑘: 𝒆 )𝑓′(𝑧𝑗 )𝑎𝑖
𝜕𝑤𝑖𝑗
With the same manner, we have:
𝜕𝐿 (𝑙)
(𝑙)
= 𝑒𝑗
𝜕𝑏𝑗
Backpropagation
Then we can rewrite as follows:
(𝑙) 𝜕𝐿 (𝑙+1) (𝑙+1) (𝑙)

𝑒𝑗 = (𝑙)
= (𝒘𝑘: 𝒆 )𝑓′(𝑧𝑗 )
𝜕𝑧𝑗
Then
𝜕𝐿 (𝑙+1) (𝑙+1) (𝑙) (𝑙−1)

(𝑙) = (𝒘𝑘: 𝒆 )𝑓′(𝑧𝑗 )𝑎𝑖
𝜕𝑤𝑖𝑗
With the same manner, we have:
𝜕𝐿 (𝑙)
(𝑙)
= 𝑒𝑗
𝜕𝑏𝑗
Backpropagation
Summary
(Unit by unit)
Implement forward propagation with each input, store activation values 𝒂(𝑙)
(𝐿) 𝜕𝐿
At each unit at the output layer, we calculate: 𝑒𝑗 = (𝐿)
𝜕𝑧𝑗
𝜕𝐿 (𝐿−1) (𝐿) 𝜕𝐿 (𝐿)

The gradient is determined as: (𝐿) = 𝑎𝑖 𝑒𝑗 , and (𝐿) = 𝑒𝑗
𝜕𝑤𝑖𝑗 𝜕𝑏𝑗
(𝑙) (𝑙+1) (𝑙+1) (𝑙)

At the 𝑙𝑡ℎ layer, we calculate: 𝑒𝑗 = (𝒘𝑘: 𝒆 )𝑓′(𝑧𝑗 )
𝜕𝐿 (𝑙) (𝑙−1) 𝜕𝐿 (𝑙)

The gradient is determined as: (𝑙) = 𝑒𝑗 𝑎𝑖 , and (𝑙) = 𝑒𝑗
𝜕𝑤𝑖𝑗 𝜕𝑏𝑗
Backpropagation
Summary
(In vector form)
Implement forward propagation with each input, store activation values 𝒂𝑙
𝜕𝐿
At the output layer, we calculate: 𝒆(𝐿) =
𝜕𝒛(𝐿)
𝜕𝐿 𝜕𝐿
The gradient is determined as: = 𝒂(𝐿−1) 𝒆(𝐿) , and = 𝒆(𝐿)
𝜕𝑾(𝐿) 𝜕𝒃(𝐿)
At the 𝑙𝑡ℎ layer, we calculate: 𝒆(𝑙) = 𝑾(𝑙+1) 𝒆(𝑙+1) ∗ 𝑓 ′ 𝒛(𝑙) where * is the element-
wise product.
𝜕𝐿 𝜕𝐿
The gradient is determined as: = 𝒆(𝑙) 𝒂(𝑙−1) , and = 𝒆(𝑙)
𝜕𝑾(𝑙) 𝜕𝒃(𝑙)

Multiclass Classification

ReLu
Softmax

Forward propagation
𝒁(1) = 𝑾(1) 𝑿 + 𝒃(1)
𝑨(1) = max 𝒁(1) , 0
𝒁(2) = 𝑾(2) 𝑨(1) + 𝒃(2)
෡ = 𝑨(2) = softmax(𝒁(2) )
𝒀
Loss function
𝑁 𝐶
1
𝐽 𝑾, 𝒃; 𝑿, 𝒀 = − ෍ ෍ 𝑦𝑗𝑖 log(𝑦ො𝑗𝑖 )
𝑁
𝑖=1 𝑗=1

Backpropagation
𝜕𝐽 1 𝜕𝐽 𝜕𝐽 (2)
𝒆(2) = = ෡
(𝒀 − 𝒀), = 𝒂(1) 𝒆(2) , = σ𝑁
𝑛=1 𝒆𝑛
𝜕𝒁(2) 𝑁 𝜕𝑾(2) 𝜕𝒃(2)
𝜕𝐽
𝒆(1) = 𝑾 2 𝒆 1 ∗ 𝑓′ 𝒁 1 , = 𝒂(0) 𝒆(1) = 𝑿 𝒆(1)
𝜕𝑾(1)
Next discussion: Multiclass Regression with MLP

Ch4 - Multilayer Perceptron

Uploaded by

Copyright:

Available Formats

You might also like

Ch4 - Multilayer Perceptron

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ch4 - Multilayer Perceptron

Uploaded by

Copyright:

Available Formats

Chapter 4

Input Hidden Hidden Output

Minibatch 𝑿 ∈ ℝ𝑛×𝑑 of 𝑛 examples where each example has 𝑑 inputs (features)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 2 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 4 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 6 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 7 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 8 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 9 Duong Van Tu

Input layer hidden layer 1 output layer

HCM City Univ. of Technology, Faculty of Mechanical Engineering 10 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 11 Duong Van Tu

The chain rule relates these derivatives by the following formulas

HCM City Univ. of Technology, Faculty of Mechanical Engineering 12 Duong Van Tu

Gradient Descent to update the weight matrix of each layer.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 14 Duong Van Tu

(𝐿) 𝜕𝐿 𝜕𝐿 (𝐿) (𝐿−1)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 17 Duong Van Tu

Then we can rewritten as follows:

𝑑 (𝑙+1) (𝑙+1) 𝑑 (𝑙+1)

In the other hand, we have:

(𝑙) 𝜕𝐿 (𝑙+1) (𝑙+1) (𝑙)

𝜕𝐿 (𝑙+1) (𝑙+1) (𝑙) (𝑙−1)

With the same manner, we have:

(𝑙) 𝜕𝐿 (𝑙+1) (𝑙+1) (𝑙)

𝜕𝐿 (𝑙+1) (𝑙+1) (𝑙) (𝑙−1)

With the same manner, we have:

𝜕𝐿 (𝐿−1) (𝐿) 𝜕𝐿 (𝐿)

(𝑙) (𝑙+1) (𝑙+1) (𝑙)

𝜕𝐿 (𝑙) (𝑙−1) 𝜕𝐿 (𝑙)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 22 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 23 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 24 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 25 Duong Van Tu

Next discussion: Multiclass Regression with MLP

HCM City Univ. of Technology, Faculty of Mechanical Engineering 26 Duong Van Tu

You might also like