Professional Documents
Culture Documents
EE2211 Introduction To Machine Learning
EE2211 Introduction To Machine Learning
EE2211 Introduction To Machine Learning
Learning
Lecture 8
Thomas Yeo
thomas.yeo@nus.edu.sg
2
© Copyright EE, NUS. All Rights Reserved.
Fundamental ML Algorithms:
Optimization, Gradient Descent
3
© Copyright EE, NUS. All Rights Reserved.
Review
• Supervised learning: given feature(s) , we want to predict
target
• Most supervised learning algorithms can be formulated as the
following optimization problem
4
© Copyright EE, NUS. All Rights Reserved.
Review
• Supervised learning: given feature(s) , we want to predict
target
• Most supervised learning algorithms can be formulated as the
following optimization problem
5
© Copyright EE, NUS. All Rights Reserved.
Review
• Supervised learning: given feature(s) , we want to predict
target
• Most supervised learning algorithms can be formulated as the
following optimization problem
Data-Loss(w) Reg(w)
6
© Copyright EE, NUS. All Rights Reserved.
Review
7
© Copyright EE, NUS. All Rights Reserved.
Review
8
© Copyright EE, NUS. All Rights Reserved.
Review
Bias/Offset
Feature 1 of i-th sample
Feature 2 of i-th sample
9
© Copyright EE, NUS. All Rights Reserved.
Review
Bias/Offset
Feature 1 of i-th sample
Feature 2 of i-th sample
Bias/Offset
is feature
of i-th sample
10
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model
11
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model
12
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model
13
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model
14
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model
15
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model
16
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model
17
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model
18
© Copyright EE, NUS. All Rights Reserved.
Building Blocks of ML algorithms
• Learning model reflects our belief about the relationship between the
features & target
• Loss function is the penalty for predicting when the true value is
• Regularization encourages less complex models
• Cost function is the final optimization criterion we want to minimize
• Optimization routine to find solution to cost function
19
© Copyright EE, NUS. All Rights Reserved.
Building Blocks of ML algorithms
• Learning model reflects our belief about the relationship between the
features & target
• Loss function is the penalty for predicting when the true value is
• Regularization encourages less complex models
• Cost function is the final optimization criterion we want to minimize
• Optimization routine to find solution to cost function
20
© Copyright EE, NUS. All Rights Reserved.
Building Blocks of ML algorithms
• Learning model reflects our belief about the relationship between the
features & target
• Loss function is the penalty for predicting when the true value is
• Regularization encourages less complex models
• Cost function is the final optimization criterion we want to minimize
• Optimization routine to find solution to cost function
21
© Copyright EE, NUS. All Rights Reserved.
Building Blocks of ML algorithms
• Learning model reflects our belief about the relationship between the
features & target
• Loss function is the penalty for predicting when the true value is
• Regularization encourages less complex models
• Cost function is the final optimization criterion we want to minimize
• Optimization routine to find solution to cost function
22
© Copyright EE, NUS. All Rights Reserved.
Building Blocks of ML algorithms
• Learning model reflects our belief about the relationship between the
features & target
• Loss function is the penalty for predicting when the true value is
• Regularization encourages less complex models
• Cost function is the final optimization criterion we want to minimize
• Optimization routine to find solution to cost function
23
© Copyright EE, NUS. All Rights Reserved.
Motivation for Gradient Descent
• Different learning function , loss function & regularization
give rise to different learning algorithms
• In polynomial regression (previous lectures), optimal can
be written with the following “closed-form” formula (primal
solution):
24
© Copyright EE, NUS. All Rights Reserved.
Motivation for Gradient Descent
• Different learning function , loss function & regularization
give rise to different learning algorithms
• In polynomial regression (previous lectures), optimal can
be written with the following “closed-form” formula (primal
solution):
25
© Copyright EE, NUS. All Rights Reserved.
Motivation for Gradient Descent
• Different learning function , loss function & regularization
give rise to different learning algorithms
• In polynomial regression (previous lectures), optimal can
be written with the following “closed-form” formula (primal
solution):
26
© Copyright EE, NUS. All Rights Reserved.
Motivation for Gradient Descent
• Different learning function , loss function & regularization
give rise to different learning algorithms
• In polynomial regression (previous lectures), optimal can
be written with the following “closed-form” formula (primal
solution):
27
© Copyright EE, NUS. All Rights Reserved.
Motivation for Gradient Descent
• Different learning function , loss function & regularization
give rise to different learning algorithms
• In polynomial regression (previous lectures), optimal can
be written with the following “closed-form” formula (primal
solution):
28
© Copyright EE, NUS. All Rights Reserved.
Questions?
29
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm
30
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm
31
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm
32
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm
33
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm
34
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm
35
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm
36
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm
According to multi-variable
calculus, if eta is not too big,
then => we get better after
each iteration
37
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm
According to multi-variable
calculus, if eta is not too big,
then => we get better after
each iteration
38
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm
39
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm
40
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm
41
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm
42
© Copyright EE, NUS. All Rights Reserved.
Questions?
43
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
44
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
45
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
46
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
47
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
48
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
49
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
50
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
51
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
52
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
53
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
54
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
55
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
56
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
57
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
58
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
59
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
60
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
61
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
62
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
63
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
64
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
65
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
66
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
67
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
68
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
69
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
70
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too big?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
71
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too big?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
72
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too big?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
73
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too big?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
74
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too big?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
75
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too big?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,
76
© Copyright EE, NUS. All Rights Reserved.
Questions?
77
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models
78
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models
79
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models
80
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models
81
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models
82
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models
83
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models
ReLU Exponential
84
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models
ReLU Exponential
85
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models
ReLU Exponential
86
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions
87
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions
88
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions
89
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions
90
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions
91
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions
92
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions
Logistic Loss
93
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions
94
© Copyright EE, NUS. All Rights Reserved.
Questions?
95
© Copyright EE, NUS. All Rights Reserved.
Summary
• Building blocks of machine learning algorithms
– Learning model: reflects our belief about relationship between features
& target we want to predict
– Loss function: penalty for wrong prediction
– Regularization: penalizes complex models
– Optimization routine: find minimum of overall cost function
• Gradient descent algorithm
– At each iteration, compute gradient & update model parameters in
direction opposite to gradient
– If learning rate is too big => may not converge
– If learning rate is too small => converge very slowly
• Different learning models, e.g., linear, polynomial, sigmoid,
ReLU, exponential, etc
• Different loss functions, e.g., square error, binary, logistic, etc
96
© Copyright EE, NUS. All Rights Reserved.
Summary
• Building blocks of machine learning algorithms
– Learning model: reflects our belief about relationship between features
& target we want to predict
– Loss function: penalty for wrong prediction
– Regularization: penalizes complex models
– Optimization routine: find minimum of overall cost function
• Gradient descent algorithm
– At each iteration, compute gradient & update model parameters in
direction opposite to gradient
– If learning rate is too big => may not converge
– If learning rate is too small => converge very slowly
• Different learning models, e.g., linear, polynomial, sigmoid,
ReLU, exponential, etc
• Different loss functions, e.g., square error, binary, logistic, etc
97
© Copyright EE, NUS. All Rights Reserved.
Summary
• Building blocks of machine learning algorithms
– Learning model: reflects our belief about relationship between features
& target we want to predict
– Loss function: penalty for wrong prediction
– Regularization: penalizes complex models
– Optimization routine: find minimum of overall cost function
• Gradient descent algorithm
– At each iteration, compute gradient & update model parameters in
direction opposite to gradient
– If learning rate is too big => may not converge
– If learning rate is too small => converge very slowly
• Different learning models, e.g., linear, polynomial, sigmoid,
ReLU, exponential, etc
• Different loss functions, e.g., square error, binary, logistic, etc
98
© Copyright EE, NUS. All Rights Reserved.
Summary
• Building blocks of machine learning algorithms
– Learning model: reflects our belief about relationship between features
& target we want to predict
– Loss function: penalty for wrong prediction
– Regularization: penalizes complex models
– Optimization routine: find minimum of overall cost function
• Gradient descent algorithm
– At each iteration, compute gradient & update model parameters in
direction opposite to gradient
– If learning rate is too big => may not converge
– If learning rate is too small => converge very slowly
• Different learning models, e.g., linear, polynomial, sigmoid,
ReLU, exponential, etc
• Different loss functions, e.g., square error, binary, logistic, etc
99
© Copyright EE, NUS. All Rights Reserved.