Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Applied

Machine Learning
Lecture: 3
Multiple Linear Regression

Ekarat Rattagan, Ph.D.

!1
Andrew Ng
Outline
3.1 Multiple features
3.2 Gradient descent for multiple variables
3.3 Feature Scaling
3.4 Stochastic Gradient Descent
3.5 Mini-batch Gradient Descent
3.6 Polynomial regression

2
Andrew Ng
3.1 Multiple features

3
Single features (variable)

Size (feet2) Price ($1000)

2104 460
1416 232
1534 315
852 178
… …

4
Multiple features (variables)

Size (feet2) Number of Number of Age of home Price ($1000)


bedrooms floors (years)

2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Notation:
= number of features
= input (features) of training example.
= value of feature in training example.
5
Hypothesis
Single variable
Multiple variables

Multivariate linear regression


hθ(x) = θ0 x0 + θ1x1 + . . . + θn xn = θ T x

θ T x = [θ0, θ1, . . . , θn] x0 For mathematical convenient,


x1 we define x0(i) = 1
.
.
xn 6
3.2 Gradient descent
for multiple variables

7
Hypothesis: hθ(x) = θ0 x0 + θ1x1 + . . . + θn xn = θ T x
Cost function:

Gradient descent:
Repeat until convergence

Simultaneously update for every

8
Gradient Descent for single variable

Single variable, i.e., j = 0 and 1


Repeat until convergence
1
(hθ(x (i)) − y (i))x0(i)
m∑
θ0 := θ0 − α
1
(hθ(x (i)) − y (i))x1(i)
m∑
θ1 := θ1 − α

9
Gradient Descent for multiple variables
Multiple variables, i.e., j = 0, 1, …, n
Repeat until convergence
1
(hθ(x (i)) − y (i))x0(i)
m∑
θ0 := θ0 − α
1
(hθ(x (i)) − y (i))x1(i)
m∑
θ1 := θ1 − α
.
.
.
1
(hθ(x (i)) − y (i))xn(i)
m∑
θn := θn − α

10
3.3 Feature Scaling

11
Feature Scaling

θ2 J(θ1, θ2)

J(θ1, θ2)

θ2

θ1 θ1
Source: https://stackoverflow.com/questions/46686924/why-scaling-data-is-very-important-in-neural-networklstm/46688787#46688787 12
Feature scaling

The goal is to transform features to be on a similar scale. This improves the


performance and training stability of the model.

Standardization xi − μ
x′i =
(or Z-score normalization) s
xi − xmin
Min-max scaling x′i =
(or normalization) xmax − xmin

https://sebastianraschka.com/Articles/2014_about_feature_scaling.html

https://developers.google.com/machine-learning/data-prep/transform/normalization
13
3.4 Stochastic Gradient Descent

14
Stochastic Gradient Descent
1 m
(hθ(x (i)) − y (i))xj(i)
m∑
Batch gradient descent θj := θj − α (2)
i=1

all m examples
x (i)

θj := θj − α(hθ(x (i)) − y (i))xj(i)


{θj, j = 1,2,3,...}

Ref: Bottou, L. (2012). Stochastic gradient descent tricks. In Neural networks:


Tricks of the trade (pp. 421-436). Springer, Berlin, Heidelberg.

15
Batch Gradient Descent Stochastic Gradient Descent

1. Randomly shuffle (reorder)


training examples

2. Repeat {
for {

(for every )
}
}

16
17
Learning schedule

• The solution to reduce the stochastic noise


• To gradually reduce the learning rate
α0
αt :=
(1 + α0 λt)

Ref: Bottou, L. (2012). Stochastic gradient descent tricks. In Neural networks: Tricks of the trade (pp. 421-436). Springer, Berlin, Heidelberg.

18
3.5 Mini-batch Gradient Descent

19
Mini-batch gradient descent
Batch gradient descent: Use all m examples in each iteration
Stochastic gradient descent: Use 1 example in each iteration
Mini-batch gradient descent: Use b examples in each iteration

Credit: Andrew NG
20
Mini-batch gradient descent
Say .
Repeat {
for {

(for every )
}
}
Credit: Andrew NG
21
Credit: Andrew NG
22
3.6 Polynomial regression

23
Housing prices prediction

hθ(x) = θ0 + θ1 × front + θ2 × depth

24
Polynomial regression

Price
(y)

Size (x)

25

You might also like