Linear Regression With Multiple Variables

Linear Regression
with Multiple Variables

Lecture 04
Silvia Ahmed Mirza Mohammad Lutfe Elahi

CSE 445 Machine Learning ECE@NSU
Multiple Features
Size in feet2 (x) Number of Number of Age of Price ($) in 1000’s
Bedrooms Floors home (y)
(years)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
• Notation
– n = number of features
– m = number of training examples
– x(i) = input (features) of ith training example
= value of features j in ith training example

Multiple Features
• Multiple variables = multiple features
• In original version we had
– x = house size (use this to predict)
– y = house price
• If in a new scheme we have more variables (such as
number of bedrooms, number of floors, age of the
home)
– x1, x2, x3, x4 are the four features
• x1 - size (feet squared)
• x2 - Number of bedrooms
• x3 - Number of floors
• x4 - Age of home (years)
– y is the output variable (price)
Hypothesis for Multiple Features
Previously: hθ(x) = θ0 + θ1x
hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + … + θnxn
E.g.:
hθ(x) = 80 + 0.1x1 + 0.01x2 + 3x3 - 2x4

• For convenience of notation, x0 = 1 For every example
i you have an additional 0th feature for each example
• So now your feature vector is n + 1 dimensional
feature vector indexed from 0
– This is a column vector called X
– Each example/sample has a column vector associated with it
• Parameters are also in a 0 indexed n + 1 dimensional
vector
– This is also a column vector called θ
– This vector is the same for each example

hθ(x) = θ0 + θ1x1 + θ2x2 + … + θnxn
0 θ0
1 θ1
2 θ2
X= ϵ ℝn+1 θ= ϵ ℝn+1
n θn
hθ(x) = θTX

• hθ(x) = θT X
– θT is an [1 x n+1] matrix
– In other words, because θ is a column vector, the
transposition operation transforms it into a row vector
– So before θ was a matrix [n+1 x 1]
– Now θT is a matrix [1 x n+1]
– Which means the inner dimensions of θT and X match, so
they can be multiplied together as
• [1 x n+1] * [n+1 x 1] = hθ(x)
• So, in other words, the transpose of our parameter vector * an input
example X gives you a predicted hypothesis which is [1 x 1]
dimensions (i.e. a single value)

Model for Multiple Features
Hypothesis: hθ(x) = θTX = θ0x0 + θ1x1 + θ2x2 + … + θnxn
Parameters = θ0, θ1, θ2, … θn
Cost function: θ = θ( )−
Gradient descent:
Repeat{
θj θj – α J(θ)
j
} (simultaneously update for every j = 0, 1, … n)

Gradient Descent Algorithm
Previously (n = 1):
Repeat
{
θ0 θ0 – α θ(
()
)− ()
θ1 θ1 – α θ(
( )) − () ()
(simultaneously update θ0, θ1)

}

New algorithm (n ≥ 1):
Repeat
{
θj θj – α θ( )−
(simultaneously update θj for j = 0, 1, … n)
}
θ0 θ0 – α θ(
()
)− () ()
θ1 θ1 – α θ(
()
)− () ()
θ2 θ2 – α θ(
()
)− () ()
…
• We're doing this for each j (0 until n)
as simultaneous update (like when n = 1)
• So, we reset/update θj to
– θj minus the learning rate (α) times the partial derivative of
the θ vector with respect to θj
– In non-calculus words, this means that we do
• Learning rate
• Times 1/m (makes the math easier)
• Times the sum of
– The hypothesis taking in the variable vector, minus the actual value,
times the jth value in that variable vector for each example

Feature Scaling
• Having a problem with multiple features
• Make sure those features have a similar scale
– Means gradient descent will converge more quickly
• E.g. x1 = size (0 – 2000 feet2)
x2 = number of bed rooms (1 - 5)
• Means the contours generated if we
θ2 J(θ)
plot θ1 vs. θ2 give a very tall and
thin shape due to the huge range
difference
• Running gradient descent can take a
long time to find the global minimum
θ1
Feature Scaling
• Idea: Make sure features are on a similar scale
0 ≤ x1 ≤ 1
0 ≤ x2 ≤ 1 θ2
J(θ)
• You define each value from
x1 and x2 by dividing by the
max for each feature
• Contours become more like
circles (as scaled between
0 and 1) θ1

Feature Scaling
• Get every feature into approximately a -1 ≤ xi ≤ 1 range
• Want to avoid large ranges, small ranges or very
different ranges from one another
• Rule of thumb regarding acceptable ranges
– -3 to +3 is generally fine - any bigger bad
– -1/3 to +1/3 is okay - any smaller bad
x0 = 0
0 ≤ x1 ≤ 3
-2 ≤ x2 ≤ 0.5
-100 ≤ x3 ≤ 100
-0.0001 ≤ x4 ≤ 0.0001
Mean Normalization
• Take a feature xi
– Replace it by (xi - mean)/max
– So your values all have an average of about 0
• E.g. -0.5 ≤ x1 ≤ 0.5
-0.5 ≤ x2 ≤ 0.5
• Instead of max, can also use standard deviation or

(max - min)

Learning Rate α
• θj θj – α J(θ)
j
• Debugging: how to make sure gradient descent is

working correctly
• How to choose learning rate α

Learning Rate α
J(θ)
θ
• θj θj – α J(θ)
j
• Number of iterations varies a lot

– 30 iterations
– 3000 iterations
0 100 200 300 400
– 3000000 iterations
– Very hard to tell in advance how No of iteration
many iterations will be needed
– Can often make a guess based on a plot like this after the first 100 or so
iterations

Learning Rate α
J(θ)
θ
0 100 200 300 400
No of iteration
• Automatic convergence tests
– Declare convergence if J(θ) decrease by less than 10-3 in one iteration

Learning Rate α
J(θ) J(θ)
No of iteration No of iteration
J(θ)
• Gradient descent not working

• Use small α
No of iteration
Learning Rate α
• For sufficiently small α, J(θ) should decrease on every
iteration
• But if α is too small, gradient descent can be slow to
convergence
• So
– If α is too small: slow convergence
– If α is too large: J(θ) may not decrease on every iteration;
may not converge
• To choose α, try
…, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …

Feature Selection
• House price prediction
• Two features
– Frontage - width of the plot of land along road (x1)
– Depth - depth away from road (x2)
• hθ(x) = θ0 + θ1x frontage + θ2 x depth

Feature Selection
• You don't have to use just two features
– Can create new features
• Might decide that an important feature is the land area

– So, create a new feature area (x3) = frontage * depth
hθ(x) = θ0 + θ1x area
• Area is a better indicator
• Often, by defining new features you may get a better
model

Polynomial regression
• May fit the data better
• hθ(x) = θ0 + θ1x + θ2x2
– e.g. quadratic function

• For housing data could use a quadratic function
– But may not fit the data so well - inflection point
means housing prices decrease when size gets
really big

• So instead must use a cubic function
• hθ(x) = θ0 + θ1x + θ2x2 + θ3x3

• So instead must use a cubic function
• hθ(x) = θ0 + θ1 x1 + θ2 x2 + θ3 x3
• hθ(x) = θ0 + θ1(size) + θ2 (size)2 + θ3 (size)3
x1 = (size)
x2 = (size)2
x3 = (size)3
Make sure apply the feature scaling.

size = 1 – 1000
(size)2 = 1 – 1000000
(size)3 = 1 – 109
Choice of features
hθ(x) = θ0 + θ1 (size) + θ2 (size)
• Instead of a conventional polynomial you could do

variable ^(1/something) - i.e. square root, cubed root,
etc.
Gradient Descent
J(θ)
• In order to minimize cost function J(θ), iterative

algorithm takes many steps in multiple iteration of
gradient descent to converge to global minimum

Normal Equation
• For some linear regression problems the
normal equation provides a better solution
• So far we've been using gradient descent
– Iterative algorithm which takes steps to converge
• Normal equation solves θ analytically
– Solve for the optimum value of θ in one step
– Has some advantages and disadvantages

Normal Equation
• Simplified cost function
J(θ) = aθ θ+c
If 1D, θ ϵ ℝ, not a vector
• How do you minimize this?
– Set-
J(θ) = … = 0
• Take derivative of J(θ) with respect to θ
• Set that derivative equal to 0
• Allows to solve for the value of θ which minimizes
J(θ)

Normal Equation
• In our more complex problems;
– Here θ is an n + 1 dimensional vector of real numbers
– Cost function is a function of the vector value
θ ϵ ℝn+1
J(θ0, θ1, θ2, … θn) = θ( )−
• How do we minimize this function-
– Take the partial derivative of J(θ) with respect θj and set
to 0 for every j
J(θ) = … = 0 (for every j)

Normal Equation
• Solve for
J(θ0, θ1, θ2, … θn )
which minimizes J(θ)
• This would give the values of θ which minimize J(θ)

• If you work through the calculus and the solution, the
derivation is pretty complex
– Not going to go through here
– Instead, what do you need to know to implement this
process

Normal Equation
Size in feet2 Number of Number of Age of home Price ($) in 1000’s
Bedrooms Floors (years)
(x1) (x2) (x3) (x4) (y)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
• Here
– n=4
– m=4

Normal Equation
• Add an extra column (x0 feature)
• Construct a column vector y vector [m x 1] matrix
Size in Number of Number of Age of home Price ($) in 1000’s
feet2 Bedrooms Floors (years)
(x0) (x1) (x2) (x3) (x4) (y)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
X= y=
m x (n+1) m
• θ= (XT X)-1XT y
General Form
• m training examples and n features
• The design matrix (X)
– Each training example is a n+1 dimensional feature column
vector
– X is constructed by taking each training example,
determining its transpose (i.e. column -> row) and using it
for a row in the design X
– This creates an [m x (n+1)] matrix

General Form
• m training examples (x(1), y(1)), …, (x(m), y(m))
• n features
ϵ ℝn+1
Design
Matrix
m x (n+1)

General Form
• Concrete example with only one feature.
E.g.
mx2

Normal Equation
θ = (XT X)-1XT y
• (XT X)-1 is inverse of matrix XT X

– i.e. A = XT X
– A-1 = (XT X)-1
• No need to do feature scaling.

Gradient Descent vs Feature Scaling
m training examples, n features
Gradient Descent Normal Equation
Need to choose α No need to choose α
Needs many iteration Don’t need to iterate
Works well even when n is Needs to compute (XT X)-1
massive (millions) • This is the inverse of an n x n
• Better suited to big data matrix
• What is a big n though • With most implementations
• 100 or even a 1000 is still computing a matrix inverse
(relativity) small grows by O(n3)
• If n is 10 000 then look at • So not great
using gradient descent • Slow if n is large
• Can be much slower
n = 106 n = 100, n = 1000, n = 10000

Normal Equation and Noninvertibility
Normal equation: θ = (XT X)-1XT y
• What if (XT X)-1 is non-invertible?

(singular/degenetate)
– Only some matrices are invertible
– This should be quite a rare problem
• Octave: pinv(X ' * X) * X ' * y
– pinv (pseudo inverse)
– This gets the right value even if (XT X) is non-invertible

• What does it mean for (XT X) to be non-invertible

• Normally two common causes
– Redundant features (Linearly dependent)
• e.g.
– x1 = size in feet
– x2 = size in meters squared
– x1= (3.28)2 x2

• What does it mean for (XT X) to be non-invertible
• Normally two common causes
– Too many features
• e.g. m ≤ n (m is much larger than n)
– m = 10 and n = 100
– θ ϵ ℝ100+1
• Trying to fit 101 parameters from 10 training examples
• Sometimes work, but not always a good idea
• Not enough data
• Later look at why this may be too little data
• To solve this we
– Delete features
– Use regularization (let's you use lots of features for a small training set)

Linear Regression With Multiple Variables

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression With Multiple Variables

Uploaded by

Copyright:

Available Formats

Linear Regression

with Multiple Variables

Silvia Ahmed Mirza Mohammad Lutfe Elahi

CSE 445 Machine Learning ECE@NSU

Previously: hθ(x) = θ0 + θ1x

hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + … + θnxn

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

Parameters = θ0, θ1, θ2, … θn

} (simultaneously update for every j = 0, 1, … n)

(simultaneously update θ0, θ1)

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

• E.g. -0.5 ≤ x1 ≤ 0.5

• Instead of max, can also use standard deviation or

CSE 445 Machine Learning ECE@NSU

• Debugging: how to make sure gradient descent is

CSE 445 Machine Learning ECE@NSU

• Number of iterations varies a lot

CSE 445 Machine Learning ECE@NSU

0 100 200 300 400

CSE 445 Machine Learning ECE@NSU

• Gradient descent not working

CSE 445 Machine Learning ECE@NSU

• hθ(x) = θ0 + θ1x frontage + θ2 x depth

CSE 445 Machine Learning ECE@NSU

• Might decide that an important feature is the land area

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

Make sure apply the feature scaling.

• Instead of a conventional polynomial you could do

• In order to minimize cost function J(θ), iterative

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

• This would give the values of θ which minimize J(θ)

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

• (XT X)-1 is inverse of matrix XT X

• No need to do feature scaling.

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

• What if (XT X)-1 is non-invertible?

CSE 445 Machine Learning ECE@NSU

• What does it mean for (XT X) to be non-invertible

CSE 445 Machine Learning ECE@NSU

CSE 445 Machine Learning ECE@NSU

You might also like