Ch2 - Fundamental of Deep Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Chapter 2

Fundamentals of Deep Learning

1
Fundamentals of Deep Learning Chapter 2

Multilayer Perceptrons

Number of layers: 𝐿 = Hidden layer numbers + 1 Hidden


Unit

Input Output
Unit Unit

Input Output
layer Hidden Hidden layer
layer 1 layer 2

HCM City Univ. of Technology, Faculty of Mechanical Engineering 2 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Multilayer Perceptrons
Output of unit
Input of unit in dam = vector or matrix
in thuong = scalar
Matrix of weight

𝑙𝑡ℎ layer
𝑗𝑡ℎ unit
bias

Vector of
bias
Vector of input
Vector of output Activate function
HCM City Univ. of Technology, Faculty of Mechanical Engineering 3 Duong Van Tu
Fundamentals of Deep Learning Chapter 2

Multilayer Perceptrons S

𝑙
𝑤𝑖𝑗 is the weight of the connection between the 𝑖𝑡ℎ unit of the (𝑙 − 1)𝑡ℎ layer and the
𝑗𝑡ℎ unit of the 𝑙𝑡ℎ layer.
𝑏𝑖𝑙 is the bias of the 𝑖𝑡ℎ unit of the 𝑙𝑡ℎ layer.
𝑾 – Matrix of weight.
𝒃 – Matrix of bias.

What is the learning meaning?

HCM City Univ. of Technology, Faculty of Mechanical Engineering 4 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Multilayer Perceptrons

Example Single layer perception


nay la bias

Vector of input
𝐱 = 𝑥0 𝑥1 𝑥2 𝑥3 𝑥4 𝑇

Matrix of weight
𝐰 = 𝑤0 𝑤1 𝑤2 𝑤3 𝑤4 𝑇

Dimension 𝑑 = 4
Unit output: 𝑧
co 4 chieu input
Activation function: 𝑠𝑖𝑔𝑛 (khong tin bias)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 5 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Flow in Neural Network

𝐱 = 𝑥1 𝑥2 𝑇

0 0
0 𝑎1 = 𝑓1 (𝑤1,1 𝑥1 + 𝑤2,1 𝑥2 + b10 )
𝑤1,1 𝑛1
𝑥1 1
𝑤1,1
0 0
𝑤2,1 𝑤1,2 1
𝑤2,1
𝑥2
0 𝑛2
𝑤2,2 0
𝑎2 = 𝑓1 (𝑤2,1 0
𝑥1 + 𝑤2,2 𝑥2 + b02 )
Input layer output layer
1 1
hidden layer 1 activate function 𝑦ො = 𝑓2 (𝑤1,1 𝑎1 + 𝑤2,1 𝑎2 + 𝑏1 )

HCM City Univ. of Technology, Faculty of Mechanical Engineering 6 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Backpropagation
lan truyen nguoc
Backpropagation, short for "backward propagation of errors," is an algorithm for
supervised learning of artificial neural networks using gradient descent.
The loss (error) function di nguoc chieu lai dao ham

1
𝐸 𝑿, 𝜽 = σ𝑁 𝑦ො𝑖 − 𝑦𝑖 2 mean square error
2𝑁 𝑖=1

𝑦𝑖 - The target output (labeled data)


𝑦ො𝑖 - The predict output of the network on the input 𝐱 𝐢

𝑁 - The set of input-output pairs { 𝐱𝟏 , 𝑦1 , … , (𝐱𝐍 , 𝑦𝑁 )}


𝜽≜𝐰∪𝐛

HCM City Univ. of Technology, Faculty of Mechanical Engineering 7 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Gradient Descent

Local minimum

Solving 𝑓ሶ 𝑥 = 0 to find 𝑥 ∗


At any 𝑥, 𝑓(𝑥) is the slope of the tangent.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 8 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Gradient Descent
Our objective is to find 𝜃 such a way that the loss
function 𝐸(𝑿, 𝜽) is minimum.

𝑁
1 2
𝐸 𝑿, 𝜽 = ෍ 𝑦ො𝑖 − 𝑦𝑖
2𝑁
𝑖=1

HCM City Univ. of Technology, Faculty of Mechanical Engineering 9 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Gradient Descent
If 𝑓ሶ 𝑥(𝑡) > 0 then 𝑥(𝑡) is on the right of 𝑥 ∗
If 𝑓ሶ 𝑥(𝑡) < 0 then 𝑥(𝑡) is on the left of 𝑥 ∗
Then the new value of 𝑥 should be obtained as:
𝑥 𝑡 + 1 = 𝑥 𝑡 − 𝜂 𝑓ሶ 𝑥(𝑡)

𝜂 – The shifting rate

HCM City Univ. of Technology, Faculty of Mechanical Engineering 10 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Gradient Descent
The matrix of weight is updated by the following:
𝜽(𝑡 + 1) = 𝜽(𝑡) − 𝜂∇𝜃 𝐸 𝑿, 𝜽(𝑡)
𝜂 – The learning rate
𝑁
1 2
𝐸 𝑿, 𝜽 = ෍ 𝑦ො𝑖 − 𝑦𝑖
2𝑁
𝑖=1

1 2
𝐽 𝐰 = 𝒚 − 𝐱ത𝐰
2𝑁

HCM City Univ. of Technology, Faculty of Mechanical Engineering 11 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Gradient Descent
Numerical Gradient

𝑓 𝑥0 − 𝑓 𝑥0 − 𝜀
𝑓ሶ 𝑥 ≈
𝜀
𝑓 𝑥0 + 𝜀 − 𝑓 𝑥0

𝑓 𝑥 ≈
𝜀
𝑓 𝑥0 + 𝜀 − 𝑓 𝑥0 − 𝜀
𝑓ሶ 𝑥 ≈
2𝜀

HCM City Univ. of Technology, Faculty of Mechanical Engineering 12 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Gradient Descent with Momentum

HCM City Univ. of Technology, Faculty of Mechanical Engineering 13 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Gradient Descent with Momentum

Normal gradient descent 𝐰 = 𝐰 − 𝜂 ∇𝐰 𝐸 𝐰


If we define that
𝑣 𝑡 ≜ ∇𝐰 𝐸 𝐰
Then the gradient descent with momentum proposes that
𝐰 t+1 =𝐰 t −𝑣 𝑡
with
𝑣 𝑡 = 𝛾𝑣 𝑡 − 1 + 𝜂∇𝐰 𝐸 𝐰
In which 𝛾 used to be chosen as 0.9.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 14 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Gradient Descent with Momentum

HCM City Univ. of Technology, Faculty of Mechanical Engineering 15 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Stochastic Gradient Descent

• In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is
taken to be the whole dataset.
• Although, using the whole dataset is really useful for getting to the minima in a less
noisy and less random manner, but the problem arises when our datasets get big.
• You have a million samples in your dataset, so if you use a typical Gradient Descent
optimization technique.
• In SGD, it uses only a single sample, i.e., a batch size of one, to perform each
iteration.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 16 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Stochastic Gradient Descent

Batch Gradient Descent Stochastic Gradient Descent

HCM City Univ. of Technology, Faculty of Mechanical Engineering 17 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Linear Regression

A given dataset is described as follows:

Height (cm) Weight (kg) Height (cm) Weight (kg)


147 49 168 60
150 50 170 72
153 51 173 63
155 52 175 64
158 54 178 66
160 56 180 67
163 58 183 68
165 59

HCM City Univ. of Technology, Faculty of Mechanical Engineering 18 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Linear Regression

The figure illustrates the relationship between the weight and the height of the dataset

HCM City Univ. of Technology, Faculty of Mechanical Engineering 19 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Linear Regression
𝑁 = 13
𝑑=1
𝐗 = x1 , … , x𝑁 = [147, 150, 153, 158, 163, 165, 168, ..., 183]
𝐲 = 𝑦1 , … , 𝑦𝑁 = [49, 50, 51, 54, 58, 59, 60, ..., 68]
𝐰 = 𝑤0 𝑤1 𝑥0 𝑤0
𝐱ത = [𝑥0 , 𝑥1 ] 𝑦ො
𝑥1 𝑤1

𝑦ො = 𝑓 𝐰 𝑇 𝐱ത = 𝑓 𝑥1 𝑤1 + 𝑤0 = 𝑥1 𝑤1 + 𝑤0

Activate function: Linear function 𝑓 𝑠 = 𝑠

HCM City Univ. of Technology, Faculty of Mechanical Engineering 20 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Linear Regression
The loss function: Train error
Validation error
Test error
1
𝐸 𝐰 = 𝒚 − 𝐰 𝑇 𝐱ത 2
2𝑁
Underfitting (train error con cao)
The gradient is calculated as: Overfitting (chi ok vs train data)
Fitting
1 𝑇 𝑇
∇𝐰 𝐸 𝐰 = 𝐱ത 𝐰 𝐱ത − 𝒚
𝑁
The weight is updated by:
𝐰 = 𝐰 − 𝜂 ∇𝐰 𝐸 𝐰

Loop until…

HCM City Univ. of Technology, Faculty of Mechanical Engineering 21 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Linear Regression

HCM City Univ. of Technology, Faculty of Mechanical Engineering 22 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Binary Classification

Given problem Our goal

Linearly Separable

HCM City Univ. of Technology, Faculty of Mechanical Engineering 23 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Binary Classification
Matrix of input 𝐗 = 𝐱1 , 𝐱 2 , … , 𝐱N ∈ ℝ𝑑×𝑁
𝑁 – Number of data point
𝑑 – Dimension of a point
Matrix of output 𝐲 = 𝑦1 , 𝑦2 , … , 𝑦𝑁 ∈ ℝ𝑁×1
If 𝐱𝑖 belongs to the blue class, 𝑦𝑖 = 1
Otherwise 𝑦𝑖 = −1

HCM City Univ. of Technology, Faculty of Mechanical Engineering 24 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Binary Classification

𝑥0 𝑤0

𝑥1 𝑦ො
𝑤1
𝑥2 𝑤2

The predict output of the network is


determined by:
𝑦ො = 𝑓 𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑤0 = 𝑓(𝐰 𝑇 𝐱ത)

The value of 𝑦ො either is 1 or -1.

What should the activate function be?

HCM City Univ. of Technology, Faculty of Mechanical Engineering 25 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Binary Classification

𝑥0 𝑤0

𝑥1 𝑦ො
𝑤1
𝑥2 𝑤2

The predict output of the network is


determined by:
𝑦ො = 𝑠𝑔𝑛 𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑤0 = sgn 𝐰 𝑇 𝐱ത

Why?

HCM City Univ. of Technology, Faculty of Mechanical Engineering 26 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Binary Classification
𝑦ො = 𝑠𝑔𝑛 𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑤0 = sgn(𝐰 𝑇 𝐱ത)

The matrix of weight is initialized with 0.


The 𝑖𝑡ℎ misclassified point in the set 𝕄:
𝑦ො𝑖 ≠ 𝑦𝑖
yi sgn 𝐰 𝑇 𝐱ത 𝑖 < 0
The 1𝑠𝑡 loss function

𝐸1 𝐰 = ෍ (−yi sgn 𝐰 𝑇 𝐱ത 𝑖 )
𝐱ത 𝒊 ∈𝕄

HCM City Univ. of Technology, Faculty of Mechanical Engineering 27 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Binary Classification
The 1𝑠𝑡 loss function

𝐸1 𝐰 = ෍ (−yi sgn 𝐰 𝑇 𝐱ത 𝑖 )
𝐱ത 𝒊 ∈𝕄

The 2𝑛𝑑 loss function

𝐸 𝐰 = ෍ (−yi 𝐰 𝑇 𝐱ത 𝑖 )
𝐱ത 𝒊 ∈𝕄

The gradient of the loss function at a point The matrix of weight is updated by:
∇𝐰 𝐸 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = −𝑦𝑖 𝐱ത 𝑖 𝐰 = 𝐰 + 𝜂𝑦𝑖 𝐱ത 𝑖

HCM City Univ. of Technology, Faculty of Mechanical Engineering 28 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Logistic Regression
Example: Probability of passing an exam versus hours of study.
A group of 20 students spends between 0 and 6 hours studying for an exam.
How does the number of hours spent studying affect the probability of the student
passing the exam?

Hours 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50

Pass 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1

HCM City Univ. of Technology, Faculty of Mechanical Engineering 29 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Logistic Regression
Example: Probability of passing an exam versus hours of study.
Hours 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50

Pass 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1

HCM City Univ. of Technology, Faculty of Mechanical Engineering 30 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Logistic Regression
Example: Probability of passing an exam versus hours of study.

Which will the activate function suitable for this?

HCM City Univ. of Technology, Faculty of Mechanical Engineering 31 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Logistic Regression
Example: Probability of passing an exam versus hours of study.
Sigmoid function

1
𝑓 𝑠 =
1 + 𝑒 −𝑠
Gradient

𝑒 −𝑠 1 𝑒 −𝑠
𝑓ሶ 𝑠 = = −𝑠 −𝑠 = 𝑓 𝑠 1 − 𝑓(𝑠)
1 + 𝑒 −𝑠 2 1+𝑒 1+𝑒
𝑦ො1 = 𝑓 𝐰 𝑇 𝐱ത 𝑖 Probability of 𝑖𝑡ℎ point belongs to class 1 (pass)
𝑦ො2 = 1 − 𝑓 𝐰 𝑇 𝐱ത 𝑖 Probability of 𝑖𝑡ℎ point belongs to class 0 (Fail)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 32 Duong Van Tu


Fundamentals of Deep Learning Chapter 2

Logistic Regression
Example: Probability of passing an exam versus hours of study.
𝑦 1−𝑦𝑖
𝑃 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = 𝑧𝑖 𝑖 1 − 𝑧𝑖
If 𝑦𝑖 = 1 then 𝑃 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = 𝑧𝑖
If 𝑦𝑖 = 0 then 𝑃 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = 1 − 𝑧𝑖
𝐰 = argmax 𝑃 𝐰, 𝐱ത 𝑖 , 𝑦𝑖

The loss function


𝑁

𝐸 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = − ෍ 𝑦𝑖 𝑙𝑜𝑔𝑧𝑖 + 1 − 𝑦𝑖 log 1 − 𝑧𝑖
𝑖=1

HCM City Univ. of Technology, Faculty of Mechanical Engineering 33 Duong Van Tu

You might also like