Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Linear Regression

with Multiple Variables


Lecture 04

Silvia Ahmed Mirza Mohammad Lutfe Elahi


CSE 445 Machine Learning ECE@NSU
Multiple Features
Size in feet2 (x) Number of Number of Age of Price ($) in 1000’s
Bedrooms Floors home (y)
(years)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …

• Notation
– n = number of features
– m = number of training examples
– x(i) = input (features) of ith training example
= value of features j in ith training example

CSE 445 Machine Learning ECE@NSU


Multiple Features
• Multiple variables = multiple features
• In original version we had
– x = house size (use this to predict)
– y = house price
• If in a new scheme we have more variables (such as
number of bedrooms, number of floors, age of the
home)
– x1, x2, x3, x4 are the four features
• x1 - size (feet squared)
• x2 - Number of bedrooms
• x3 - Number of floors
• x4 - Age of home (years)
– y is the output variable (price)
CSE 445 Machine Learning ECE@NSU
Hypothesis for Multiple Features

Previously: hθ(x) = θ0 + θ1x

hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + … + θnxn

E.g.:
hθ(x) = 80 + 0.1x1 + 0.01x2 + 3x3 - 2x4

CSE 445 Machine Learning ECE@NSU


Hypothesis for Multiple Features
• For convenience of notation, x0 = 1 For every example
i you have an additional 0th feature for each example
• So now your feature vector is n + 1 dimensional
feature vector indexed from 0
– This is a column vector called X
– Each example/sample has a column vector associated with it
• Parameters are also in a 0 indexed n + 1 dimensional
vector
– This is also a column vector called θ
– This vector is the same for each example

CSE 445 Machine Learning ECE@NSU


Hypothesis for Multiple Features
hθ(x) = θ0 + θ1x1 + θ2x2 + … + θnxn

0 θ0
1 θ1
2 θ2
X= ϵ ℝn+1 θ= ϵ ℝn+1

n θn

hθ(x) = θTX

CSE 445 Machine Learning ECE@NSU


Hypothesis for Multiple Features
• hθ(x) = θT X
– θT is an [1 x n+1] matrix
– In other words, because θ is a column vector, the
transposition operation transforms it into a row vector
– So before θ was a matrix [n+1 x 1]
– Now θT is a matrix [1 x n+1]
– Which means the inner dimensions of θT and X match, so
they can be multiplied together as
• [1 x n+1] * [n+1 x 1] = hθ(x)
• So, in other words, the transpose of our parameter vector * an input
example X gives you a predicted hypothesis which is [1 x 1]
dimensions (i.e. a single value)

CSE 445 Machine Learning ECE@NSU


Model for Multiple Features
Hypothesis: hθ(x) = θTX = θ0x0 + θ1x1 + θ2x2 + … + θnxn

Parameters = θ0, θ1, θ2, … θn

Cost function: θ = θ( )−

Gradient descent:
Repeat{
θj θj – α J(θ)
j

} (simultaneously update for every j = 0, 1, … n)


CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
Previously (n = 1):
Repeat
{
θ0 θ0 – α θ(
()
)− ()

θ1 θ1 – α θ(
( )) − () ()

(simultaneously update θ0, θ1)


}

CSE 445 Machine Learning ECE@NSU


Gradient Descent Algorithm
New algorithm (n ≥ 1):
Repeat
{
θj θj – α θ( )−
(simultaneously update θj for j = 0, 1, … n)
}

θ0 θ0 – α θ(
()
)− () ()

θ1 θ1 – α θ(
()
)− () ()

θ2 θ2 – α θ(
()
)− () ()


CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
• We're doing this for each j (0 until n)
as simultaneous update (like when n = 1)
• So, we reset/update θj to
– θj minus the learning rate (α) times the partial derivative of
the θ vector with respect to θj
– In non-calculus words, this means that we do
• Learning rate
• Times 1/m (makes the math easier)
• Times the sum of
– The hypothesis taking in the variable vector, minus the actual value,
times the jth value in that variable vector for each example

CSE 445 Machine Learning ECE@NSU


Feature Scaling
• Having a problem with multiple features
• Make sure those features have a similar scale
– Means gradient descent will converge more quickly
• E.g. x1 = size (0 – 2000 feet2)
x2 = number of bed rooms (1 - 5)
• Means the contours generated if we
θ2 J(θ)
plot θ1 vs. θ2 give a very tall and
thin shape due to the huge range
difference
• Running gradient descent can take a
long time to find the global minimum

θ1
CSE 445 Machine Learning ECE@NSU
Feature Scaling
• Idea: Make sure features are on a similar scale

0 ≤ x1 ≤ 1
0 ≤ x2 ≤ 1 θ2
J(θ)
• You define each value from
x1 and x2 by dividing by the
max for each feature
• Contours become more like
circles (as scaled between
0 and 1) θ1

CSE 445 Machine Learning ECE@NSU


Feature Scaling
• Get every feature into approximately a -1 ≤ xi ≤ 1 range
• Want to avoid large ranges, small ranges or very
different ranges from one another
• Rule of thumb regarding acceptable ranges
– -3 to +3 is generally fine - any bigger bad
– -1/3 to +1/3 is okay - any smaller bad

x0 = 0
0 ≤ x1 ≤ 3
-2 ≤ x2 ≤ 0.5
-100 ≤ x3 ≤ 100
-0.0001 ≤ x4 ≤ 0.0001
CSE 445 Machine Learning ECE@NSU
Mean Normalization
• Take a feature xi
– Replace it by (xi - mean)/max
– So your values all have an average of about 0

• E.g. -0.5 ≤ x1 ≤ 0.5

-0.5 ≤ x2 ≤ 0.5

• Instead of max, can also use standard deviation or


(max - min)

CSE 445 Machine Learning ECE@NSU


Learning Rate α
• θj θj – α J(θ)
j

• Debugging: how to make sure gradient descent is


working correctly
• How to choose learning rate α

CSE 445 Machine Learning ECE@NSU


Learning Rate α
J(θ)
θ
• θj θj – α J(θ)
j

• Number of iterations varies a lot


– 30 iterations
– 3000 iterations
0 100 200 300 400
– 3000000 iterations
– Very hard to tell in advance how No of iteration
many iterations will be needed
– Can often make a guess based on a plot like this after the first 100 or so
iterations

CSE 445 Machine Learning ECE@NSU


Learning Rate α
J(θ)
θ

0 100 200 300 400

No of iteration
• Automatic convergence tests
– Declare convergence if J(θ) decrease by less than 10-3 in one iteration

CSE 445 Machine Learning ECE@NSU


Learning Rate α
J(θ) J(θ)

No of iteration No of iteration
J(θ)

• Gradient descent not working


• Use small α

No of iteration
CSE 445 Machine Learning ECE@NSU
Learning Rate α
• For sufficiently small α, J(θ) should decrease on every
iteration
• But if α is too small, gradient descent can be slow to
convergence
• So
– If α is too small: slow convergence
– If α is too large: J(θ) may not decrease on every iteration;
may not converge
• To choose α, try
…, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …

CSE 445 Machine Learning ECE@NSU


Feature Selection
• House price prediction

• Two features
– Frontage - width of the plot of land along road (x1)
– Depth - depth away from road (x2)

• hθ(x) = θ0 + θ1x frontage + θ2 x depth

CSE 445 Machine Learning ECE@NSU


Feature Selection
• You don't have to use just two features
– Can create new features

• Might decide that an important feature is the land area


– So, create a new feature area (x3) = frontage * depth
hθ(x) = θ0 + θ1x area
• Area is a better indicator
• Often, by defining new features you may get a better
model

CSE 445 Machine Learning ECE@NSU


Polynomial regression
• May fit the data better
• hθ(x) = θ0 + θ1x + θ2x2
– e.g. quadratic function

CSE 445 Machine Learning ECE@NSU


Polynomial regression
• For housing data could use a quadratic function
– But may not fit the data so well - inflection point
means housing prices decrease when size gets
really big

CSE 445 Machine Learning ECE@NSU


Polynomial regression
• So instead must use a cubic function
• hθ(x) = θ0 + θ1x + θ2x2 + θ3x3

CSE 445 Machine Learning ECE@NSU


Polynomial regression
• So instead must use a cubic function
• hθ(x) = θ0 + θ1 x1 + θ2 x2 + θ3 x3
• hθ(x) = θ0 + θ1(size) + θ2 (size)2 + θ3 (size)3
x1 = (size)
x2 = (size)2
x3 = (size)3

Make sure apply the feature scaling.


size = 1 – 1000
(size)2 = 1 – 1000000
(size)3 = 1 – 109
CSE 445 Machine Learning ECE@NSU
Choice of features
hθ(x) = θ0 + θ1 (size) + θ2   (size)

• Instead of a conventional polynomial you could do


variable ^(1/something) - i.e. square root, cubed root,
etc.
CSE 445 Machine Learning ECE@NSU
Gradient Descent
J(θ)

• In order to minimize cost function J(θ), iterative


algorithm takes many steps in multiple iteration of
gradient descent to converge to global minimum

CSE 445 Machine Learning ECE@NSU


Normal Equation
• For some linear regression problems the
normal equation provides a better solution
• So far we've been using gradient descent
– Iterative algorithm which takes steps to converge
• Normal equation solves θ analytically
– Solve for the optimum value of θ in one step
– Has some advantages and disadvantages

CSE 445 Machine Learning ECE@NSU


Normal Equation
• Simplified cost function
J(θ) = aθ θ+c
If 1D, θ ϵ ℝ, not a vector
• How do you minimize this?
– Set-

J(θ) = … = 0
• Take derivative of J(θ) with respect to θ
• Set that derivative equal to 0
• Allows to solve for the value of θ which minimizes
J(θ)

CSE 445 Machine Learning ECE@NSU


Normal Equation
• In our more complex problems;
– Here θ is an n + 1 dimensional vector of real numbers
– Cost function is a function of the vector value
θ ϵ ℝn+1
J(θ0, θ1, θ2, … θn) = θ( )−
• How do we minimize this function-
– Take the partial derivative of J(θ) with respect θj and set
to 0 for every j
J(θ) = … = 0 (for every j)

CSE 445 Machine Learning ECE@NSU


Normal Equation
• Solve for
J(θ0, θ1, θ2, … θn )
which minimizes J(θ)

• This would give the values of θ which minimize J(θ)


• If you work through the calculus and the solution, the
derivation is pretty complex
– Not going to go through here
– Instead, what do you need to know to implement this
process

CSE 445 Machine Learning ECE@NSU


Normal Equation
Size in feet2 Number of Number of Age of home Price ($) in 1000’s
Bedrooms Floors (years)
(x1) (x2) (x3) (x4) (y)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178

• Here
– n=4
– m=4

CSE 445 Machine Learning ECE@NSU


Normal Equation
• Add an extra column (x0 feature)
• Construct a column vector y vector [m x 1] matrix
Size in Number of Number of Age of home Price ($) in 1000’s
feet2 Bedrooms Floors (years)
(x0) (x1) (x2) (x3) (x4) (y)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178

X= y=

m x (n+1) m
• θ= (XT X)-1XT y
CSE 445 Machine Learning ECE@NSU
General Form
• m training examples and n features
• The design matrix (X)
– Each training example is a n+1 dimensional feature column
vector
– X is constructed by taking each training example,
determining its transpose (i.e. column -> row) and using it
for a row in the design X
– This creates an [m x (n+1)] matrix

CSE 445 Machine Learning ECE@NSU


General Form
• m training examples (x(1), y(1)), …, (x(m), y(m))
• n features

ϵ ℝn+1
Design
Matrix

m x (n+1)

CSE 445 Machine Learning ECE@NSU


General Form
• Concrete example with only one feature.

E.g.

mx2

CSE 445 Machine Learning ECE@NSU


Normal Equation
θ = (XT X)-1XT y

• (XT X)-1 is inverse of matrix XT X


– i.e. A = XT X
– A-1 = (XT X)-1

• No need to do feature scaling.

CSE 445 Machine Learning ECE@NSU


Gradient Descent vs Feature Scaling
m training examples, n features
Gradient Descent Normal Equation
Need to choose α No need to choose α
Needs many iteration Don’t need to iterate
Works well even when n is Needs to compute (XT X)-1
massive (millions) • This is the inverse of an n x n
• Better suited to big data matrix
• What is a big n though • With most implementations
• 100 or even a 1000 is still computing a matrix inverse
(relativity) small grows by O(n3)
• If n is 10 000 then look at • So not great
using gradient descent • Slow if n is large
• Can be much slower
n = 106 n = 100, n = 1000, n = 10000

CSE 445 Machine Learning ECE@NSU


Normal Equation and Noninvertibility
Normal equation: θ = (XT X)-1XT y

• What if (XT X)-1 is non-invertible?


(singular/degenetate)
– Only some matrices are invertible
– This should be quite a rare problem
• Octave: pinv(X ' * X) * X ' * y
– pinv (pseudo inverse)
– This gets the right value even if (XT X) is non-invertible

CSE 445 Machine Learning ECE@NSU


Normal Equation and Noninvertibility
Normal equation: θ = (XT X)-1XT y

• What does it mean for (XT X) to be non-invertible


• Normally two common causes
– Redundant features (Linearly dependent)
• e.g.
– x1 = size in feet
– x2 = size in meters squared
– x1= (3.28)2 x2

CSE 445 Machine Learning ECE@NSU


Normal Equation and Noninvertibility
Normal equation: θ = (XT X)-1XT y
• What does it mean for (XT X) to be non-invertible
• Normally two common causes
– Too many features
• e.g. m ≤ n (m is much larger than n)
– m = 10 and n = 100
– θ ϵ ℝ100+1
• Trying to fit 101 parameters from 10 training examples
• Sometimes work, but not always a good idea
• Not enough data
• Later look at why this may be too little data
• To solve this we
– Delete features
– Use regularization (let's you use lots of features for a small training set)

CSE 445 Machine Learning ECE@NSU

You might also like