Ch4 - Multilayer Perceptron

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Chapter 4

Multilayer Perceptron

1
Multilayer Perceptron Chapter 4
Multilayer Perceptrons
𝑾(1) 𝑾(2) 𝑾(3) Hidden
Unit

Input .. Output
Unit . Unit
..
. ..
.

Input Hidden Hidden Output


layer layer 1 layer 2 layer

Minibatch 𝑿 ∈ ℝ𝑛×𝑑 of 𝑛 examples where each example has 𝑑 inputs (features)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 2 Duong Van Tu


Multilayer Perceptron Chapter 4
Multilayer Perceptrons

Output of unit
Input of unit

Matrix of weight

𝑙𝑡ℎ layer
𝑗𝑡ℎ unit
bias

Vector of
bias
Vector of input
Vector of output Activate function
HCM City Univ. of Technology, Faculty of Mechanical Engineering 3 Duong Van Tu
Multilayer Perceptron Chapter 4
Multilayer Perceptrons
𝑙
𝑤𝑖𝑗 is the weight of the connection between the 𝑖𝑡ℎ unit of the (𝑙 − 1)𝑡ℎ layer and
the 𝑗𝑡ℎ unit of the 𝑙𝑡ℎ layer.
𝑏𝑖𝑙 is the bias of the 𝑖𝑡ℎ unit of the 𝑙𝑡ℎ layer.
𝑾 – Matrix of weight.
𝒃 – Matrix of bias.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 4 Duong Van Tu


Multilayer Perceptron Chapter 4
Multilayer Perceptrons
Example of an MLP with a hidden layer of 5 hidden units.

This MLP has 4 inputs, 3 outputs, and its hidden layer contains 5 hidden units
HCM City Univ. of Technology, Faculty of Mechanical Engineering 5 Duong Van Tu
Multilayer Perceptron Chapter 4
Activation function
The output of the 𝑖𝑡ℎ unit belongs to the 𝑙𝑡ℎ layer is calculated by:
(𝑙) 𝑙 𝑙−1 𝑙
𝑎𝑖 = 𝑓(𝑤𝑖 𝑎 + 𝑏𝑖 )
Where 𝑓(. ) is the nonlinear activation function.
In vector form, the output of all units of the 𝑙𝑡ℎ layer is calculated by:
𝒂(𝑙) = 𝑓(𝑾 𝑙 𝒂 𝑙−1 +𝒃 𝑙 )
Sign function
It is noted that the sign function should not be used in MLP.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 6 Duong Van Tu


Multilayer Perceptron Chapter 4
Activation function
Sigmoid function

1
𝜎 𝑥 =
1 + 𝑒 −𝑥
Sigmoid saturate and kill gradients.
A very undesirable property of the sigmoid
neuron is that when the neuron’s activation
saturates at either tail of 0 or 1, the gradient
at these regions is almost zero.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 7 Duong Van Tu


Multilayer Perceptron Chapter 4
Activation function
Tanh function

1 − 𝑒 −2𝑥
𝑡𝑎𝑛ℎ 𝑥 = −2𝑥
= 2𝜎 2𝑥 − 1
1+𝑒
Like the sigmoid neuron, its activations
saturate, but unlike the sigmoid neuron its
output is zero-centered.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 8 Duong Van Tu


Multilayer Perceptron Chapter 4
Activation function
ReLu function
The Rectified Linear Unit has become very popular
in the last few years.
𝑓 𝑥 = max(0, 𝑥)
Non-differentiable at zero and ReLU is unbounded.
The gradients for negative input are zero, which
means for activations in that region, the weights
are not updated during backpropagation.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 9 Duong Van Tu


Multilayer Perceptron Chapter 4
Forward propagation
Forward propagation (or forward pass) refers to the calculation and storage of
intermediate variables (including outputs) for a neural network in order from the
input layer to the output layer.
𝑇
𝐱 = 𝑥1 𝑥2
(1) (1) (1) (2) (2)
(1) 𝑎1 = 𝑓1 (𝑤1,1 𝑥1 + 𝑤2,1 𝑥2 + b1 ) 𝑦ො = 𝑓2 (𝑤1,1 𝑎1 + 𝑤2,1 𝑎2 + 𝑏 (2) )
𝑤1,1 𝑛1
𝑥1 (2)
𝑤1,1
(1) (1)
𝑤2,1 𝑤1,2 (2)
𝑥2 𝑤2,1
(1) 𝑛2
𝑤2,2 (1) (1)
𝑎2 = 𝑓1 (𝑤1,2 𝑥1 + 𝑤2,2 𝑥2 + b2 )
(1)

Input layer hidden layer 1 output layer

HCM City Univ. of Technology, Faculty of Mechanical Engineering 10 Duong Van Tu


Multilayer Perceptron Chapter 4
Chain Rule Gradient
We supposes that 𝑤 is a function of 𝑥, 𝑦 and that 𝑥, 𝑦 are functions of 𝑢, 𝑣. That is,
𝑤 = 𝑓 𝑥, 𝑦 , 𝑥 = 𝑔 𝑢, 𝑣 , 𝑦 = ℎ(𝑢, 𝑣)
The use of the term chain comes because to compute w we need to do a chain of
computations
(𝑢, 𝑣) → (𝑥, 𝑦) → 𝑤

We will say 𝑤 is a dependent variable, 𝑢 and 𝑣 are independent variables and 𝑥 and
𝑦 are intermediate variables.
𝜕𝑤 𝜕𝑤
Since 𝑤 is a function of 𝑥 and 𝑦 it has partial derivatives and
𝜕𝑥 𝜕𝑦

HCM City Univ. of Technology, Faculty of Mechanical Engineering 11 Duong Van Tu


Multilayer Perceptron Chapter 4
Chain Rule Gradient
Since 𝑤 is a function of 𝑢 and 𝑣 we can also compute the partial derivatives, that is
𝜕𝑤 𝜕𝑤
and
𝜕𝑢 𝜕𝑣

The chain rule relates these derivatives by the following formulas

𝜕𝑤 𝜕𝑤 𝜕𝑥 𝜕𝑤 𝜕𝑦
= +
𝜕𝑢 𝜕𝑥 𝜕𝑢 𝜕𝑦 𝜕𝑢

𝜕𝑤 𝜕𝑤 𝜕𝑥 𝜕𝑤 𝜕𝑦
= +
𝜕𝑣 𝜕𝑥 𝜕𝑣 𝜕𝑦 𝜕𝑣

HCM City Univ. of Technology, Faculty of Mechanical Engineering 12 Duong Van Tu


Multilayer Perceptron Chapter 4
Backpropagation
✓ Backpropagation refers to the method of calculating the gradient of neural
network parameters.
✓ In short, the method traverses the network in reverse order, from the output to
the input layer, according to the chain rule from calculus.
✓ The algorithm stores any intermediate variables (partial derivatives) required
while calculating the gradient with respect to some parameters.
✓ To update the weight and the bias, we need to define the loss function
𝐽 𝑾, 𝒃, 𝐱, 𝐲
𝑾 is the weight matrix, 𝒃 is the bias matrix
𝐱, 𝐲 are the input and output vector
HCM City Univ. of Technology, Faculty of Mechanical Engineering 13 Duong Van Tu
Multilayer Perceptron Chapter 4
Backpropagation
The loss function is defined according to Mean Square Error (MSE)

𝑁
1 2
ෝ𝒏
𝐽 𝑾, 𝒃, 𝐱, 𝐲 = ෍ 𝒚𝒏 − 𝒚
N
𝑛=1
It is difficult to take the gradient of the loss function over the weight matrix.
𝜕𝐽 𝜕𝐽 𝜕𝐽
The backpropagation is to take the gradient , , … , for applying the
𝜕𝑾(1) 𝜕𝑾(2) 𝜕𝑾(𝐿)

Gradient Descent to update the weight matrix of each layer.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 14 Duong Van Tu


Multilayer Perceptron Chapter 4
Backpropagation
Consider the weight of the 𝑗𝑡ℎ unit of the 𝐿𝑡ℎ layer:

(𝐿)
𝜕𝐿 𝜕𝐿 𝜕𝑧𝑗
(𝐿)
= (𝐿) (𝐿)
𝜕𝑤𝑖𝑗 𝜕𝑧𝑗 𝜕𝑤𝑖𝑗
(𝐿) (𝐿) (𝐿−1) (𝐿)
Recall 𝑧𝑗 = 𝑤𝑖𝑗 𝑎𝑖 + 𝑏𝑗 then it can be deducted as:

(𝐿)
𝜕𝑧𝑗 (𝐿−1)
(𝐿)
= 𝑎𝑖
𝜕𝑤𝑖𝑗

(𝐿) 𝜕𝐿 𝜕𝐿 (𝐿) (𝐿−1)


Let define 𝑒𝑗 ≜ (𝐿) , the above equation becomes: (𝐿) = 𝑒𝑗 𝑎𝑖
𝜕𝑧𝑗 𝜕𝑤𝑖𝑗
HCM City Univ. of Technology, Faculty of Mechanical Engineering 15 Duong Van Tu
Multilayer Perceptron Chapter 4
Backpropagation
By the same manner, the gradient of the loss function over the bias is:

(𝐿)
𝜕𝐿 𝜕𝐿 𝜕𝑧𝑗 (𝐿)
(𝐿)
= (𝐿) (𝐿)
= 𝑒𝑗
𝜕𝑏𝑗 𝜕𝑧𝑗 𝜕𝑏𝑗
For the 𝑙𝑡ℎ layer, the gradient at the 𝑗𝑡ℎ unit is calculated as below:

(𝑙)
𝜕𝐿 𝜕𝐿 𝜕𝑧𝑗
(𝑙)
= (𝑙) (𝑙)
𝜕𝑤𝑖𝑗 𝜕𝑧𝑗 𝜕𝑤𝑖𝑗

(𝑙)
(𝑙) (𝑙) (𝑙−1) (𝑙) 𝜕𝑧𝑗 (𝑙−1)
Recall that 𝑧𝑗 = 𝑤𝑖𝑗 𝑎𝑖 + 𝑏𝑗 , then it can be deducted as (𝑙) = 𝑎𝑖
𝜕𝑤𝑖𝑗
HCM City Univ. of Technology, Faculty of Mechanical Engineering 16 Duong Van Tu
Multilayer Perceptron Chapter 4
Backpropagation

(𝑙) 𝜕𝐿
Let define 𝑒𝑗 ≜ (𝑙) , the gradient of 𝐿 over the weight is calculated as:
𝜕𝑧𝑗

𝜕𝐿 (𝑙) (𝑙−1)
(𝑙)
= 𝑒𝑗 𝑎𝑖
𝜕𝑤𝑖𝑗

𝜕𝐿
Let consider the term (𝑙) , we can use the chain rule:
𝜕𝑧𝑗

(𝑙)
𝜕𝐿 𝜕𝐿 𝜕𝑎𝑗
(𝑙)
= (𝑙) (𝑙)
𝜕𝑧𝑗 𝜕𝑎𝑗 𝜕𝑧𝑗

HCM City Univ. of Technology, Faculty of Mechanical Engineering 17 Duong Van Tu


Multilayer Perceptron Chapter 4
Backpropagation
(𝑙+1) (𝑙+1) (𝑙)
Recall that the input of the 𝑘 𝑡ℎ unit is calculated as 𝑧𝑘 = 𝒘𝑘 𝒂 + 𝒃(𝑙+1)
(𝑙) (𝑙+1)
𝑎𝑗 → 𝑧𝑘 with 𝑘 = 1, … , 𝑑 (𝑙+1)

Then we can rewritten as follows:

𝑑 (𝑙+1) (𝑙+1) 𝑑 (𝑙+1)


𝜕𝐿 𝜕𝐿 𝜕𝑧𝑘 (𝑙+1) (𝑙+1) (𝑙+1) (𝑙+1)
(𝑙)
= ෍ (𝑙+1) (𝑙)
= ෍ 𝑒𝑘 𝑤𝑗𝑘 = 𝒘𝑘: 𝒆
𝜕𝑎𝑗 𝑘=1 𝜕𝑧𝑘 𝜕𝑎𝑗 𝑘=1

In the other hand, we have:

(𝑙)
𝜕𝑎𝑗 (𝑙)
(𝑙) = 𝑓′(𝑧𝑗 ) with 𝑓 is the activation function
𝜕𝑧𝑗
HCM City Univ. of Technology, Faculty of Mechanical Engineering 18 Duong Van Tu
Multilayer Perceptron Chapter 4
Backpropagation
Then we can rewrite as follows:

(𝑙) 𝜕𝐿 (𝑙+1) (𝑙+1) (𝑙)


𝑒𝑗 = (𝑙)
= (𝒘𝑘: 𝒆 )𝑓′(𝑧𝑗 )
𝜕𝑧𝑗
Then

𝜕𝐿 (𝑙+1) (𝑙+1) (𝑙) (𝑙−1)


(𝑙) = (𝒘𝑘: 𝒆 )𝑓′(𝑧𝑗 )𝑎𝑖
𝜕𝑤𝑖𝑗

With the same manner, we have:

𝜕𝐿 (𝑙)
(𝑙)
= 𝑒𝑗
𝜕𝑏𝑗
HCM City Univ. of Technology, Faculty of Mechanical Engineering 19 Duong Van Tu
Multilayer Perceptron Chapter 4
Backpropagation
Then we can rewrite as follows:

(𝑙) 𝜕𝐿 (𝑙+1) (𝑙+1) (𝑙)


𝑒𝑗 = (𝑙)
= (𝒘𝑘: 𝒆 )𝑓′(𝑧𝑗 )
𝜕𝑧𝑗
Then

𝜕𝐿 (𝑙+1) (𝑙+1) (𝑙) (𝑙−1)


(𝑙) = (𝒘𝑘: 𝒆 )𝑓′(𝑧𝑗 )𝑎𝑖
𝜕𝑤𝑖𝑗

With the same manner, we have:

𝜕𝐿 (𝑙)
(𝑙)
= 𝑒𝑗
𝜕𝑏𝑗
HCM City Univ. of Technology, Faculty of Mechanical Engineering 20 Duong Van Tu
Multilayer Perceptron Chapter 4
Backpropagation
Summary
(Unit by unit)
Implement forward propagation with each input, store activation values 𝒂(𝑙)

(𝐿) 𝜕𝐿
At each unit at the output layer, we calculate: 𝑒𝑗 = (𝐿)
𝜕𝑧𝑗

𝜕𝐿 (𝐿−1) (𝐿) 𝜕𝐿 (𝐿)


The gradient is determined as: (𝐿) = 𝑎𝑖 𝑒𝑗 , and (𝐿) = 𝑒𝑗
𝜕𝑤𝑖𝑗 𝜕𝑏𝑗

(𝑙) (𝑙+1) (𝑙+1) (𝑙)


At the 𝑙𝑡ℎ layer, we calculate: 𝑒𝑗 = (𝒘𝑘: 𝒆 )𝑓′(𝑧𝑗 )

𝜕𝐿 (𝑙) (𝑙−1) 𝜕𝐿 (𝑙)


The gradient is determined as: (𝑙) = 𝑒𝑗 𝑎𝑖 , and (𝑙) = 𝑒𝑗
𝜕𝑤𝑖𝑗 𝜕𝑏𝑗
HCM City Univ. of Technology, Faculty of Mechanical Engineering 21 Duong Van Tu
Multilayer Perceptron Chapter 4
Backpropagation
Summary
(In vector form)
Implement forward propagation with each input, store activation values 𝒂𝑙
𝜕𝐿
At the output layer, we calculate: 𝒆(𝐿) =
𝜕𝒛(𝐿)

𝜕𝐿 𝜕𝐿
The gradient is determined as: = 𝒂(𝐿−1) 𝒆(𝐿) , and = 𝒆(𝐿)
𝜕𝑾(𝐿) 𝜕𝒃(𝐿)

At the 𝑙𝑡ℎ layer, we calculate: 𝒆(𝑙) = 𝑾(𝑙+1) 𝒆(𝑙+1) ∗ 𝑓 ′ 𝒛(𝑙) where * is the element-
wise product.
𝜕𝐿 𝜕𝐿
The gradient is determined as: = 𝒆(𝑙) 𝒂(𝑙−1) , and = 𝒆(𝑙)
𝜕𝑾(𝑙) 𝜕𝒃(𝑙)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 22 Duong Van Tu


Multilayer Perceptron Chapter 4
Multiclass Classification

HCM City Univ. of Technology, Faculty of Mechanical Engineering 23 Duong Van Tu


Multilayer Perceptron Chapter 4
Multiclass Classification
ReLu
Softmax

HCM City Univ. of Technology, Faculty of Mechanical Engineering 24 Duong Van Tu


Multilayer Perceptron Chapter 4
Multiclass Classification

Forward propagation
𝒁(1) = 𝑾(1) 𝑿 + 𝒃(1)
𝑨(1) = max 𝒁(1) , 0
𝒁(2) = 𝑾(2) 𝑨(1) + 𝒃(2)
෡ = 𝑨(2) = softmax(𝒁(2) )
𝒀
Loss function

𝑁 𝐶
1
𝐽 𝑾, 𝒃; 𝑿, 𝒀 = − ෍ ෍ 𝑦𝑗𝑖 log(𝑦ො𝑗𝑖 )
𝑁
𝑖=1 𝑗=1

HCM City Univ. of Technology, Faculty of Mechanical Engineering 25 Duong Van Tu


Multilayer Perceptron Chapter 4
Multiclass Classification
Backpropagation
𝜕𝐽 1 𝜕𝐽 𝜕𝐽 (2)
𝒆(2) = = ෡
(𝒀 − 𝒀), = 𝒂(1) 𝒆(2) , = σ𝑁
𝑛=1 𝒆𝑛
𝜕𝒁(2) 𝑁 𝜕𝑾(2) 𝜕𝒃(2)

𝜕𝐽
𝒆(1) = 𝑾 2 𝒆 1 ∗ 𝑓′ 𝒁 1 , = 𝒂(0) 𝒆(1) = 𝑿 𝒆(1)
𝜕𝑾(1)

Next discussion: Multiclass Regression with MLP

HCM City Univ. of Technology, Faculty of Mechanical Engineering 26 Duong Van Tu

You might also like