Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Applied

Machine Learning
Lecture 2
One variable (simple)
linear regression
Ekarat Rattagan, Ph.D.

1
Outline
2.1 Linear regression
2.2 Learning algorithm
2.3 Iterative method: Gradient descent
2.4 Gradient descent for linear regression

2
Andrew Ng
2.1 Linear regression

3
Andrew Ng
Supervised Learning: Regression
• Given (xX1, y ), (x1,2, y
= 1{(x y 12),), ..., (x
(x 2, y 2n),, y. n.). , (x m, y m)}
• Learn a function f(x) to predict y given x Notation:
– y is real-valued == regression m = Number of training examples
9 x’s = “input” or “independent”
variables or features
September Arctic Sea Ice Extent

8
7
y’s = “output”, “target”, or
(1,000,000 sq km)

6
5 “dependent” variable
4
3
2
1
0
1970 1980 1990 2000 2010 2020
Year
26
Data from G. Witt. Journal of Statistics Education, Volume 21, Number 1 (2013)

4
= {(x 1, y 1), (x 2, y 2), . . . , (x m, y m)}

Credit: Tom M. Mitchell 5


2.2 Learning algorithm

6
Andrew Ng
Learning algorithm

• Objective: to find the best target function h ∈ H

hθ(x) = θ0 + θ1x (Model representation, L1, P40)

• We need to specify a performance measure “A computer program is said to


learn from experience E with
• Cost function respect to some task T and some
performance measure P, if its
performance on T, as measured
by P, improves with experience
E.”

7
What is cost function ?
• A function that measures the performance of a Machine
Learning model for given data.
• It quantifies the error between estimated and true values.
• It is used for parameter estimation.

8
Cost function (Sum of Squared Errors)
hθ(x) = θ0 + θ1x
5
(3,4)
4

2
(0,1)
1
(2,1)

0
0 1 2 3 4 5
m
(hθ(x (i)) − y (i))2

Cost function J(θ0, θ1) =
9
i=1
Learning objective: minimize J(θ0, θ1)
θ0, θ1

To find so that the gap between hθ(x) and is minimized.

10
Closed-form solution

• The Normal equation

θ = (XTX)−1 XT y

Code: appendix
11
Normal equation

• Limitation
• Non-invertible or Singular Matrix (det = 0)
• Some features are redundant

• Solution
• The Moore-Penrose Pseudo-inverse
• Singular Value Decomposition (SVD)

https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.9-The-Moore-Penrose-Pseudoinverse/
12
Normal equation

• Limitation
• Computational complexity O(n 3)
• Slow when the number of features grows large (e.g., 100,000), to
inverse matrix

• Solution
• Another approach: iterative method

13
2.3 Iterative method:
(Batch) Gradient descent

14
Andrew Ng
Hypothesis:

Parameters:

Cost Function:

Goal:

15
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y θ1 = 0.5
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y θ1 = 0
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Credit: Andrew NG

19
J(θ0,θ1)

θ1
θ0

Credit: Andrew NG
20
Gradient descent

reference: https://arxiv.org/pdf/1609.04747.pdf

http://sites.science.oregonstate.edu/math/home/programs/undergrad/CalculusQuestStudyGuides/vcalc/grad/grad.html 21
Gradient descent

Simultaneous update

22
Declare convergence

Declare convergence if decreases by less than, e.g., in one


iteration.

23
If α is too small, gradient descent can
be slow.

If α is too large, gradient descent can


overreach the minimum.
It may fail to converge, or even
diverge.

Figures from “Hands-on Machine Learning with Scikit-Learn, Keras, & TensorFlow, …” 24
2.4 Gradient descent
for linear regression

25
Andrew Ng
Gradient descent algorithm

Linear Regression Model

26
Gradient descent algorithm

How to get these two equations?


27
(for fixed , this is a function of x) (function of the parameters )

Credit: Andrew NG
28
(for fixed , this is a function of x) (function of the parameters )

Credit: Andrew NG
29
(for fixed , this is a function of x) (function of the parameters )

Credit: Andrew NG
30
(for fixed , this is a function of x) (function of the parameters )

Credit: Andrew NG
31
(for fixed , this is a function of x) (function of the parameters )

Credit: Andrew NG
32
(for fixed , this is a function of x) (function of the parameters )

Credit: Andrew NG
33
(for fixed , this is a function of x) (function of the parameters )

Credit: Andrew NG
34
(for fixed , this is a function of x) (function of the parameters )

Credit: Andrew NG
35
(for fixed , this is a function of x) (function of the parameters )

Credit: Andrew NG
36
Gradient Descent

• Limitation
• Need to choose the learning rate
• Can be very slow as large datasets may not fit in the
memory

37
Gradient Descent VS Normal Equation

descent

use pseudo-inverse

https://www.geeksforgeeks.org/difference-between-gradient-descent-and-normal-equation/ 38

You might also like