EE2211 Introduction To Machine Learning

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 99

EE2211 Introduction to Machine

Learning
Lecture 8

Thomas Yeo
thomas.yeo@nus.edu.sg

Electrical and Computer Engineering Department


National University of Singapore

Acknowledgement: EE2211 development team


Thomas Yeo, Kar-Ann Toh, Chen Khong Tham, Helen Zhou, Robby Tan & Haizhou

© Copyright EE, NUS. All Rights Reserved.


Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks

2
© Copyright EE, NUS. All Rights Reserved.
Fundamental ML Algorithms:
Optimization, Gradient Descent

Module III Contents


• Overfitting, underfitting and model complexity
• Regularization
• Bias-variance trade-off
• Loss function
• Optimization
• Gradient descent
• Decision trees
• Random forest

3
© Copyright EE, NUS. All Rights Reserved.
Review
• Supervised learning: given feature(s) , we want to predict
target
• Most supervised learning algorithms can be formulated as the
following optimization problem

• Data-Loss(w) quantifies fitting error to training set given


parameters w: smaller error => better fit to training data
• Regularization(w) penalizes more complex models
• For example, in the case of polynomial regression (previous
lectures):

4
© Copyright EE, NUS. All Rights Reserved.
Review
• Supervised learning: given feature(s) , we want to predict
target
• Most supervised learning algorithms can be formulated as the
following optimization problem

• Data-Loss(w) quantifies fitting error to training set given


parameters w: smaller error => better fit to training data
• Regularization(w) penalizes more complex models
• For example, in the case of polynomial regression (previous
lectures):

5
© Copyright EE, NUS. All Rights Reserved.
Review
• Supervised learning: given feature(s) , we want to predict
target
• Most supervised learning algorithms can be formulated as the
following optimization problem

• Data-Loss(w) quantifies fitting error to training set given


parameters w: smaller error => better fit to training data
• Regularization(w) penalizes more complex models
• For example, in the case of polynomial regression (previous
lectures):

Data-Loss(w) Reg(w)
6
© Copyright EE, NUS. All Rights Reserved.
Review

7
© Copyright EE, NUS. All Rights Reserved.
Review

is prediction of -th is target of -th


training sample training sample

8
© Copyright EE, NUS. All Rights Reserved.
Review

Bias/Offset
Feature 1 of i-th sample
Feature 2 of i-th sample

9
© Copyright EE, NUS. All Rights Reserved.
Review

Bias/Offset
Feature 1 of i-th sample
Feature 2 of i-th sample

Bias/Offset
is feature
of i-th sample

10
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model

11
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model

12
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model

13
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model

14
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model

15
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model

Cost Function Loss Function Learning Model Regularization

16
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model

17
© Copyright EE, NUS. All Rights Reserved.
Loss Function & Learning Model

Cost Function Loss Function Learning Model Regularization

18
© Copyright EE, NUS. All Rights Reserved.
Building Blocks of ML algorithms

• Learning model reflects our belief about the relationship between the
features & target
• Loss function is the penalty for predicting when the true value is
• Regularization encourages less complex models
• Cost function is the final optimization criterion we want to minimize
• Optimization routine to find solution to cost function

19
© Copyright EE, NUS. All Rights Reserved.
Building Blocks of ML algorithms

• Learning model reflects our belief about the relationship between the
features & target
• Loss function is the penalty for predicting when the true value is
• Regularization encourages less complex models
• Cost function is the final optimization criterion we want to minimize
• Optimization routine to find solution to cost function

20
© Copyright EE, NUS. All Rights Reserved.
Building Blocks of ML algorithms

• Learning model reflects our belief about the relationship between the
features & target
• Loss function is the penalty for predicting when the true value is
• Regularization encourages less complex models
• Cost function is the final optimization criterion we want to minimize
• Optimization routine to find solution to cost function

21
© Copyright EE, NUS. All Rights Reserved.
Building Blocks of ML algorithms

• Learning model reflects our belief about the relationship between the
features & target
• Loss function is the penalty for predicting when the true value is
• Regularization encourages less complex models
• Cost function is the final optimization criterion we want to minimize
• Optimization routine to find solution to cost function

22
© Copyright EE, NUS. All Rights Reserved.
Building Blocks of ML algorithms

• Learning model reflects our belief about the relationship between the
features & target
• Loss function is the penalty for predicting when the true value is
• Regularization encourages less complex models
• Cost function is the final optimization criterion we want to minimize
• Optimization routine to find solution to cost function

23
© Copyright EE, NUS. All Rights Reserved.
Motivation for Gradient Descent
• Different learning function , loss function & regularization
give rise to different learning algorithms
• In polynomial regression (previous lectures), optimal can
be written with the following “closed-form” formula (primal
solution):

• For other learning function , loss function &


regularization , optimizing might not be so easy
• Usually have to estimate iteratively with some algorithm
• Optimization workhorse for modern machine learning is
gradient descent

24
© Copyright EE, NUS. All Rights Reserved.
Motivation for Gradient Descent
• Different learning function , loss function & regularization
give rise to different learning algorithms
• In polynomial regression (previous lectures), optimal can
be written with the following “closed-form” formula (primal
solution):

• For other learning function , loss function &


regularization , optimizing might not be so easy
• Usually have to estimate iteratively with some algorithm
• Optimization workhorse for modern machine learning is
gradient descent

25
© Copyright EE, NUS. All Rights Reserved.
Motivation for Gradient Descent
• Different learning function , loss function & regularization
give rise to different learning algorithms
• In polynomial regression (previous lectures), optimal can
be written with the following “closed-form” formula (primal
solution):

• For other learning function , loss function &


regularization , optimizing might not be so easy
• Usually have to estimate iteratively with some algorithm
• Optimization workhorse for modern machine learning is
gradient descent

26
© Copyright EE, NUS. All Rights Reserved.
Motivation for Gradient Descent
• Different learning function , loss function & regularization
give rise to different learning algorithms
• In polynomial regression (previous lectures), optimal can
be written with the following “closed-form” formula (primal
solution):

• For other learning function , loss function &


regularization , optimizing might not be so easy
• Usually have to estimate iteratively with some algorithm
• Optimization workhorse for modern machine learning is
gradient descent

27
© Copyright EE, NUS. All Rights Reserved.
Motivation for Gradient Descent
• Different learning function , loss function & regularization
give rise to different learning algorithms
• In polynomial regression (previous lectures), optimal can
be written with the following “closed-form” formula (primal
solution):

• For other learning function , loss function &


regularization , optimizing might not be so easy
• Usually have to estimate iteratively with some algorithm
• Optimization workhorse for modern machine learning is
gradient descent

28
© Copyright EE, NUS. All Rights Reserved.
Questions?

29
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm

30
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm

31
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm

32
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm

33
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm

34
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm

35
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm

36
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm

According to multi-variable
calculus, if eta is not too big,
then => we get better after
each iteration

37
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm

According to multi-variable
calculus, if eta is not too big,
then => we get better after
each iteration

38
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm

39
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm

• Possible convergence criteria


– Set maximum iteration
– Check percentage or absolute change in below a threshold
– Check percentage or absolute change in below a threshold
• Gradient descent can only find local minimum
– Because gradient = 0 at local minimum, so won’t change after that
• Many variations of gradient descent, e.g., change how gradient is
computed or learning rate decreases with increasing

40
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm

• Possible convergence criteria Local Minimum


– Set maximum iteration
– Check percentage or absolute change in below a threshold
Global Minimum
– Check percentage or absolute change in below a threshold
• Gradient descent can only find local minimum
– Because gradient = 0 at local minimum, so won’t change after that
• Many variations of gradient descent, e.g., change how gradient is
computed or learning rate decreases with increasing

41
© Copyright EE, NUS. All Rights Reserved.
Gradient Descent Algorithm

• Possible convergence criteria Local Minimum


– Set maximum iteration
– Check percentage or absolute change in below a threshold
Global Minimum
– Check percentage or absolute change in below a threshold
• Gradient descent can only find local minimum
– Because gradient = 0 at local minimum, so won’t change after that
• Many variations of gradient descent, e.g., change how gradient is
computed or learning rate decreases with increasing

42
© Copyright EE, NUS. All Rights Reserved.
Questions?

43
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

44
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

45
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

46
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

47
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

48
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

49
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

50
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

51
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

52
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

53
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

54
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

55
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

56
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

57
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

58
© Copyright EE, NUS. All Rights Reserved.
Find to minimize
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

is just nice => rapid convergence

59
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

60
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

61
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

62
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

63
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

64
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

65
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

66
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

67
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

68
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

69
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too small?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

too small => slow convergence

70
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too big?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

71
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too big?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

72
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too big?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

73
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too big?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

74
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too big?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

75
© Copyright EE, NUS. All Rights Reserved.
What if learning rate is too big?
• Obviously minimum corresponds to , but let’s do gradient
descent
– Gradient
– Initialize , learning rate
– At each iteration,

too big => overshoot local


minimum => slow convergence
or may never converge

76
© Copyright EE, NUS. All Rights Reserved.
Questions?

77
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models

78
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models

79
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models

80
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models

81
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models

82
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models

83
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models

ReLU Exponential

84
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models

ReLU Exponential

85
© Copyright EE, NUS. All Rights Reserved.
Different Learning Models

ReLU Exponential

86
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions

87
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions

88
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions

89
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions

90
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions

91
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions

92
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions

Logistic Loss

93
© Copyright EE, NUS. All Rights Reserved.
Different Loss Functions

94
© Copyright EE, NUS. All Rights Reserved.
Questions?

95
© Copyright EE, NUS. All Rights Reserved.
Summary
• Building blocks of machine learning algorithms
– Learning model: reflects our belief about relationship between features
& target we want to predict
– Loss function: penalty for wrong prediction
– Regularization: penalizes complex models
– Optimization routine: find minimum of overall cost function
• Gradient descent algorithm
– At each iteration, compute gradient & update model parameters in
direction opposite to gradient
– If learning rate is too big => may not converge
– If learning rate is too small => converge very slowly
• Different learning models, e.g., linear, polynomial, sigmoid,
ReLU, exponential, etc
• Different loss functions, e.g., square error, binary, logistic, etc

96
© Copyright EE, NUS. All Rights Reserved.
Summary
• Building blocks of machine learning algorithms
– Learning model: reflects our belief about relationship between features
& target we want to predict
– Loss function: penalty for wrong prediction
– Regularization: penalizes complex models
– Optimization routine: find minimum of overall cost function
• Gradient descent algorithm
– At each iteration, compute gradient & update model parameters in
direction opposite to gradient
– If learning rate is too big => may not converge
– If learning rate is too small => converge very slowly
• Different learning models, e.g., linear, polynomial, sigmoid,
ReLU, exponential, etc
• Different loss functions, e.g., square error, binary, logistic, etc

97
© Copyright EE, NUS. All Rights Reserved.
Summary
• Building blocks of machine learning algorithms
– Learning model: reflects our belief about relationship between features
& target we want to predict
– Loss function: penalty for wrong prediction
– Regularization: penalizes complex models
– Optimization routine: find minimum of overall cost function
• Gradient descent algorithm
– At each iteration, compute gradient & update model parameters in
direction opposite to gradient
– If learning rate is too big => may not converge
– If learning rate is too small => converge very slowly
• Different learning models, e.g., linear, polynomial, sigmoid,
ReLU, exponential, etc
• Different loss functions, e.g., square error, binary, logistic, etc

98
© Copyright EE, NUS. All Rights Reserved.
Summary
• Building blocks of machine learning algorithms
– Learning model: reflects our belief about relationship between features
& target we want to predict
– Loss function: penalty for wrong prediction
– Regularization: penalizes complex models
– Optimization routine: find minimum of overall cost function
• Gradient descent algorithm
– At each iteration, compute gradient & update model parameters in
direction opposite to gradient
– If learning rate is too big => may not converge
– If learning rate is too small => converge very slowly
• Different learning models, e.g., linear, polynomial, sigmoid,
ReLU, exponential, etc
• Different loss functions, e.g., square error, binary, logistic, etc

99
© Copyright EE, NUS. All Rights Reserved.

You might also like