04 LinearRegression PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Linear

Regression

These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made
their course materials freely available online. Feel free to reuse or adapt these slides for your own academic
purposes, provided that you include proper aHribuIon. Please send comments and correcIons to Eric.
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
Regression
Given: n o
–  Data where
X = x(1) , . . . , x(n) x(i) 2 Rd
n o
y = y (1) , . . . , y (n) y (i) 2 R
–  Corresponding labels where

9
September Arc+c Sea Ice Extent

8
7
(1,000,000 sq km)

6
5
4
3
Linear Regression
QuadraIc Regression
2
1
0
1975 1980 1985 1990 1995 2000 2005 2010 2015
Year
Data from G. WiH. Journal of StaIsIcs EducaIon, Volume 21, Number 1 (2013) 2
Prostate Cancer Dataset
•  97 samples, parIIoned into 67 train / 30 test
•  Eight predictors (features):
–  6 conInuous (4 log transforms), 1 binary, 1 ordinal
•  ConInuous outcome variable:
–  lpsa: log(prostate specific anIgen level)

Based on slide by Jeff Howbert


Linear Regression
•  Hypothesis: d
X
y = ✓0 + ✓1 x1 + ✓2 x2 + . . . + ✓d xd = ✓j xj
j=0
Assume x0 = 1

•  Fit model by minimizing sum of squared errors

Figures are courtesy of Greg Shakhnarovich 5


Least Squares Linear Regression
•  Cost FuncIon n ⇣
1 X ⇣ ⌘ ⌘2
J(✓) = h✓ x(i) y (i)
2n i=1
•  Fit by solving min J(✓)

6
IntuiIon Behind Cost FuncIon
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
For insight on J(), let’s assume so
x2R ✓ = [✓0 , ✓1 ]

Based on example
by Andrew Ng 7
IntuiIon Behind Cost FuncIon
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
For insight on J(), let’s assume so
x2R ✓ = [✓0 , ✓1 ]

(for fixed , this is a funcIon of x) (funcIon of the parameter )


3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Based on example
by Andrew Ng 8
IntuiIon Behind Cost FuncIon
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
For insight on J(), let’s assume so
x2R ✓ = [✓0 , ✓1 ]

(for fixed , this is a funcIon of x) (funcIon of the parameter )


3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1 ⇥ ⇤
Based on example J([0, 0.5]) = (0.5 1)2 + (1 2)2 + (1.5 3)2 ⇡ 0.58 9
by Andrew Ng 2⇥3
IntuiIon Behind Cost FuncIon
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
For insight on J(), let’s assume so
x2R ✓ = [✓0 , ✓1 ]

(for fixed , this is a funcIon of x) (funcIon of the parameter )


3 3
J([0, 0]) ⇡ 2.333

2 2
y
1 1 J() is concave

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Based on example
by Andrew Ng 10
IntuiIon Behind Cost FuncIon

Slide by Andrew Ng 11
IntuiIon Behind Cost FuncIon

(for fixed , this is a funcIon of x) (funcIon of the parameters )

Slide by Andrew Ng 12
IntuiIon Behind Cost FuncIon

(for fixed , this is a funcIon of x) (funcIon of the parameters )

Slide by Andrew Ng 13
IntuiIon Behind Cost FuncIon

(for fixed , this is a funcIon of x) (funcIon of the parameters )

Slide by Andrew Ng 14
IntuiIon Behind Cost FuncIon

(for fixed , this is a funcIon of x) (funcIon of the parameters )

Slide by Andrew Ng 15
Basic Search Procedure
•  Choose iniIal value for ✓
•  UnIl we reach a minimum:
–  Choose a new value for to reduce
✓ J(✓)

J( 0, 1)

1
0

Figure by Andrew Ng 16
Basic Search Procedure
•  Choose iniIal value for ✓
•  UnIl we reach a minimum:
–  Choose a new value for to reduce
✓ J(✓)


J( 0, 1)

1
0

Figure by Andrew Ng 17
Basic Search Procedure
•  Choose iniIal value for ✓
•  UnIl we reach a minimum:
–  Choose a new value for to reduce
✓ J(✓)


J( 0, 1)

Since the least squares objecIve funcIon is convex (concave),


1

we don’t need to worry about local minima


0

Figure by Andrew Ng 18
Gradient Descent
•  IniIalize ✓
•  Repeat unIl convergence
@ simultaneous update
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d

learning rate (small)


e.g., α = 0.05 3

2
J(✓) ↵
1

0
-0.5 0 0.5 1 1.5 2 2.5
✓ 19
Gradient Descent
•  IniIalize ✓
•  Repeat unIl convergence
@ simultaneous update
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
@ 1 X ⇣ ⇣ (i) ⌘ ⌘2
n
@ (i)
For Linear Regression: J(✓) = h✓ x y
@✓j @✓j 2n i=1
n d
!2
@ 1 X X (i)
= ✓k xk y (i)
@✓j 2n i=1
k=0
n d
! d
!
1 X X (i) @ X (i)
= ✓k xk y (i) ⇥ ✓k xk y (i)
n i=1
@✓j
k=0 k=0
n d
!
1X X (i) (i)
= ✓k xk y (i) xj
n i=1
k=0

1 X⇣ ⇣ ⌘ ⌘
n 20
(i)
Gradient Descent
•  IniIalize ✓
•  Repeat unIl convergence
@ simultaneous update
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
@ 1 X ⇣ ⇣ (i) ⌘ ⌘2
n
@ (i)
For Linear Regression: J(✓) = h✓ x y
@✓j @✓j 2n i=1
n d
!2
@ 1 X X (i)
= ✓k xk y (i)
@✓j 2n i=1
k=0
n d
! d
!
1 X X (i) @ X (i)
= ✓k xk y (i) ⇥ ✓k xk y (i)
n i=1
@✓j
k=0 k=0
n d
!
1X X (i) (i)
= ✓k xk y (i) xj
n i=1
k=0

1 X⇣ ⇣ ⌘ ⌘
n 21
(i)
Gradient Descent
•  IniIalize ✓
•  Repeat unIl convergence
@ simultaneous update
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
@ 1 X ⇣ ⇣ (i) ⌘ ⌘2
n
@ (i)
For Linear Regression: J(✓) = h✓ x y
@✓j @✓j 2n i=1
n d
!2
@ 1 X X (i)
= ✓k xk y (i)
@✓j 2n i=1
k=0
n d
! d
!
1 X X (i) @ X (i)
= ✓k xk y (i) ⇥ ✓k xk y (i)
n i=1
@✓j
k=0 k=0
n d
!
1X X (i) (i)
= ✓k xk y (i) xj
n i=1
k=0

1 X⇣ ⇣ ⌘ ⌘
n 22
(i)
Gradient Descent
•  IniIalize ✓
•  Repeat unIl convergence
@ simultaneous update
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
@ 1 X ⇣ ⇣ (i) ⌘ ⌘2
n
@ (i)
For Linear Regression: J(✓) = h✓ x y
@✓j @✓j 2n i=1
n d
!2
@ 1 X X (i)
= ✓k xk y (i)
@✓j 2n i=1
k=0
n d
! d
!
1 X X (i) @ X (i)
= ✓k xk y (i) ⇥ ✓k xk y (i)
n i=1
@✓j
k=0 k=0
n d
!
1X X (i) (i)
= ✓k xk y (i) xj
n i=1
k=0

1 X⇣ ⇣ ⌘ ⌘
n 23
(i)
Gradient Descent for Linear Regression
•  IniIalize ✓
•  Repeat unIl convergence
Xn ⇣ ⇣ ⌘ ⌘
1 (i) simultaneous
✓j ✓j ↵ h✓ x(i) y (i)
xj update
n i=1 for j = 0 ... d

•  To achieve simultaneous update ⇣ ⌘


•  At the start of each GD iteraIon, compute h✓ x(i)
•  Use this stored value in the update step loop
•  Assume convergence when k✓new ✓old k2 < ✏
sX q
L2 norm: kvk2 = vi2 = v12 + v22 + . . . + v|v|
2

i
24
Gradient Descent

(for fixed , this is a funcIon of x) (funcIon of the parameters )

h(x) = -900 – 0.1 x

Slide by Andrew Ng 25
Gradient Descent

(for fixed , this is a funcIon of x) (funcIon of the parameters )

Slide by Andrew Ng 26
Gradient Descent

(for fixed , this is a funcIon of x) (funcIon of the parameters )

Slide by Andrew Ng 27
Gradient Descent

(for fixed , this is a funcIon of x) (funcIon of the parameters )

Slide by Andrew Ng 28
Gradient Descent

(for fixed , this is a funcIon of x) (funcIon of the parameters )

Slide by Andrew Ng 29
Gradient Descent

(for fixed , this is a funcIon of x) (funcIon of the parameters )

Slide by Andrew Ng 30
Gradient Descent

(for fixed , this is a funcIon of x) (funcIon of the parameters )

Slide by Andrew Ng 31
Gradient Descent

(for fixed , this is a funcIon of x) (funcIon of the parameters )

Slide by Andrew Ng 32
Gradient Descent

(for fixed , this is a funcIon of x) (funcIon of the parameters )

Slide by Andrew Ng 33
Choosing α
α too small α too large
Increasing value for J(✓)
slow convergence

•  May overshoot the minimum


•  May fail to converge
•  May even diverge

To see if gradient descent is working, print out each iteraIon


J(✓)
•  The value should decrease at each iteraIon
•  If it doesn’t, adjust α
34
Extending Linear Regression to
More Complex Models
•  The inputs X for linear regression can be:
–  Original quanItaIve inputs
–  TransformaIon of quanItaIve inputs
•  e.g. log, exp, square root, square, etc.
–  Polynomial transformaIon
•  example: y = 0 + 1 x + 2 x2 + 3 x3
–  Basis expansions
–  Dummy coding of categorical inputs
–  InteracIons between variables
•  example: x3 = x1 x2

This allows use of linear regression techniques


to fit non-linear datasets.
Linear Basis FuncIon Models
•  Generally, d
X
h✓ (x) = ✓j j (x)
j=0
basis funcIon

•  Typically, so that acts as a bias


0 (x) = 1 ✓0
•  In the simplest case, we use linear basis funcIons :
j (x) = xj

Based on slide by Christopher Bishop (PRML)


Linear Basis FuncIon Models
•  Polynomial basis funcIons:

–  These are global; a small


change in x affects all
basis funcIons

•  Gaussian basis funcIons:

–  These are local; a small change


in x only affect nearby basis
funcIons. μj and s control
locaIon and scale (width).
Based on slide by Christopher Bishop (PRML)
Linear Basis FuncIon Models
•  Sigmoidal basis funcIons:

where

–  These are also local; a small


change in x only affects
nearby basis funcIons. μj
and s control locaIon and
scale (slope).

Based on slide by Christopher Bishop (PRML)


Example of Finng a Polynomial Curve
with a Linear Model

p
X
y = ✓0 + ✓1 x + ✓2 x2 + . . . + ✓p xp = ✓j xj
j=0
Linear Basis FuncIon Models
d
X
•  Basic Linear Model: h✓ (x) = ✓j xj
j=0
d
X
•  Generalized Linear Model: h✓ (x) = ✓j j (x)
j=0

•  Once we have replaced the data by the outputs of


the basis funcIons, finng the generalized model is
exactly the same problem as finng the basic model
–  Unless we use the kernel trick – more on that when we
cover support vector machines
–  Therefore, there is no point in cluHering the math with
basis funcIons
Based on slide by Geoff Hinton 40
Linear Algebra Concepts
•  Vector in is an ordered set of d real numbers
Rd
–  e.g., v = [1,6,3,4] is in R4 ⎛1⎞
⎜ ⎟
–  “[1,6,3,4]” is a column vector: ⎜6⎟
⎜ 3⎟
–  as opposed to a row vector: ⎜ ⎟
⎜ 4⎟
⎝ ⎠

(1 6 3 4)

•  An m-by-n matrix is an object with m rows and n columns,


where each entry is a real number:
⎛1 2 8⎞
⎜ ⎟
⎜ 4 78 6 ⎟
⎜9 3 2⎟
⎝ ⎠
Based on slides by Joseph Bradley
Linear Algebra Concepts
•  Transpose: reflect vector/matrix on line:
T T
⎛a⎞ ⎛a b ⎞ ⎛a c ⎞
⎜⎜ ⎟⎟ = (a b ) ⎜⎜ ⎟⎟ = ⎜⎜ ⎟⎟
⎝b⎠ ⎝c d ⎠ ⎝b d ⎠
–  Note: (Ax)T=xTAT (We’ll define mulIplicaIon soon…)

•  Vector norms: X
! p1

–  Lp norm of v = (v1,…,vk) is |vi |p


i
–  Common norms: L1, L2
–  Linfinity = maxi |vi|
•  Length of a vector v is L2(v)
Based on slides by Joseph Bradley
Linear Algebra Concepts

•  Vector dot product: u • v = (u1 u2 ) • (v1 v2 ) = u1v1 + u2v2

–  Note: dot product of u with itself = length(u)2 = kuk22

•  Matrix product: ⎛ a11 a12 ⎞ ⎛ b11 b12 ⎞


A = ⎜⎜ ⎟⎟, B = ⎜⎜ ⎟⎟
⎝ a21 a22 ⎠ ⎝ b21 b22 ⎠
⎛ a11b11 + a12b21 a11b12 + a12b22 ⎞
AB = ⎜⎜ ⎟⎟
⎝ a21b11 + a22b21 a21b12 + a22b22 ⎠

Based on slides by Joseph Bradley


Linear Algebra Concepts

•  Vector products:
–  Dot product: u • v = u v = (u u )⎜ 1 ⎞⎟ = u v + u v

T
v
1 2 ⎜ ⎟ 1 1 2 2
v
⎝ 2⎠

–  Outer product:
⎛ u1 ⎞ ⎛ u1v1 u1v2 ⎞
uv = ⎜⎜ ⎟⎟(v1 v2 ) =
T
⎜⎜ ⎟⎟
⎝ u2 ⎠ ⎝ u2v1 u2v2 ⎠

Based on slides by Joseph Bradley


VectorizaIon
•  Benefits of vectorizaIon
–  More compact equaIons
–  Faster code (using opImized matrix libraries)
•  Consider our model: Xd
h(x) = ✓j xj
•  Let 2 3
j=0

✓0
6 ✓1 7 ⇥ ⇤
6 7 |
x = 1 x1 . . . x d
✓=6 . 7
4 .. 5
✓d
|
•  Can write the model in vectorized form as h(x) = ✓ x
45
VectorizaIon
•  Consider our model for n instances:
⇣ ⌘ X d
(i) (i)
h x = ✓j xj
j=0
•  Let 2 (1) (1)
3
2 3 1 x1 . . . xd
✓0 6 . .. .. .. 7
6 .. . . . 7
6 ✓1 7 6 7
6 7 6 7
✓ = 6 . 7 X = 6 1 x(i) (i)
. . . xd 7
.
4 . 5 6 1 7
6 .. .. .. .. 7
✓d 4 . . . . 5
(n) (n)
1 x1 . . . xd
R(d+1)⇥1 Rn⇥(d+1)
•  Can write the model in vectorized form as h✓ (x) = X✓
46
VectorizaIon
•  For the linear regression cost funcIon:
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1

Xn ⇣ ⌘2
1
J(✓) = ✓ | x(i) y (i)
2n i=1
Rn⇥(d+1)
1 R(d+1)⇥1
Let: |
2 3 J(✓) = (X✓ y) (X✓ y)
y(1) 2n
6 y (2) 7
6 7
y=6 .. 7
4 . 5 R1⇥n Rn⇥1
y (n)
47
Closed Form SoluIon
•  Instead of using GD, solve for opImal ✓ analyIcally
@
–  NoIce that the soluIon is when J(✓) = 0
@✓
•  DerivaIon:
1 | 1 x 1
J (✓) = (X✓ y) (X✓ y)
2n
/ ✓ | X | X✓ y | X✓ ✓ | X | y + y | y
/ ✓ | X | X✓ 2✓ | X | y + y | y
Take derivaIve and set equal to 0, then solve for :

@
(✓ | X | X✓ 2✓ | X | y + y | y) = 0
@✓
(X | X)✓ X | y = 0
(X | X)✓ = X | y
Closed Form SoluIon: ✓ = (X | X) 1
X |y 48
X y + y y) = 0
X)✓ X | y =Closed Form SoluIon
0
| |
(X X)✓ = X ✓y
•  Can obtain by simply plugging X and y into
✓ = (X | X) 1
X |y
2 (1) (1)
3
1 x1 ... xd
6 . .. .. 7 2 3
6 .. .. 7 y (1)
6 . . . 7 6 7
6 (i) (i) 7 6 y (2) 7
X=6 1 x1 ... xd 7 y=6 .. 7
6 7 4 . 5
6 .. .. .. .. 7
4 . . . . 5 y (n)
(n) (n)
1 x1 ... xd

•  If X T X is not inverIble (i.e., singular), may need to:
–  Use pseudo-inverse instead of the inverse
•  In python, numpy.linalg.pinv(a)
–  Remove redundant (not linearly independent) features
–  Remove extra features to ensure that d ≤ n
49
Gradient Descent vs Closed Form
Gradient Descent Closed Form Solu+on
•  Requires mulIple iteraIons •  Non-iteraIve
•  Need to choose α •  No need for α
•  Works well when n is large •  Slow if n is large
•  Can support incremental – CompuIng (X T X)-1 is
learning roughly O(n3)

50
Improving Learning:
Feature Scaling
•  Idea: Ensure that feature have similar scales
Before Feature Scaling Aver Feature Scaling
20 20
15 15

✓2 10 ✓2 10
5 5
0 0
0 5 10 15 20 0 5 10 15 20
✓1 ✓1
•  Makes gradient descent converge much faster

51
Feature StandardizaIon
•  Rescales features to have zero mean and unit variance

Xn
1 (i)
–  Let μj be the mean of feature j: j
µ = x
n i=1 j
–  Replace each value with:
(i)
(i) xj µj for j = 1...d
xj (not x0!)
sj
•  sj is the standard deviaIon of feature j
•  Could also use the range of feature j (maxj – minj) for sj

•  Must apply the same transformaIon to instances for


both training and predicIon
•  Outliers can cause problems
52
Price Quality of Fit

Price

Price
Size Size Size

Underfinng Correct fit Overfinng


(high bias) (high variance)

OverfiHng:
•  The learned hypothesis may fit the training set very
well ( )
J(✓) ⇡ 0
•  ...but fails to generalize to new examples

Based on example by Andrew Ng 53


RegularizaIon
•  A method for automaIcally controlling the
complexity of the learned hypothesis
•  Idea: penalize for large values of ✓j
–  Can incorporate into the cost funcIon
–  Works well when we have a lot of features, each that
contributes a bit to predicIng the label

•  Can also address overfinng by eliminaIng features


(either manually or via model selecIon)

54
RegularizaIon
•  Linear regression objecIve funcIon
Xn ⇣ ⇣ ⌘ ⌘2 XXdd
1
J(✓) = h✓ x(i) y (i) + ✓✓j2j2
2n 2 j=1
i=1 j=1

model fit to data regularizaIon

–  is the regularizaIon parameter ( )


0
–  No regularizaIon on !
✓0

55
Understanding RegularizaIon
Xn ⇣ ⇣ ⌘ ⌘2 d
X
1
J(✓) = h✓ x(i) y (i) + ✓j2
2n i=1 2 j=1

X d
✓j2 = k✓1:d k22
•  Note that
j=1
–  This is the magnitude of the feature coefficient vector!

•  We can also think of this as:


Xd
(✓j 0)2 = k✓1:d ~0k22
j=1
•  L2 regularizaIon pulls coefficients toward 0
56
Understanding RegularizaIon
Xn ⇣ ⇣ ⌘ ⌘2 d
X
1
J(✓) = h✓ x(i) y (i) + ✓j2
2n i=1 2 j=1

•  What happens if we set to be huge (e.g., 1010)?


Price

Size

Based on example by Andrew Ng 57


Understanding RegularizaIon
Xn ⇣ ⇣ ⌘ ⌘2 d
X
1
J(✓) = h✓ x(i) y (i) + ✓j2
2n i=1 2 j=1

•  What happens if we set to be huge (e.g., 1010)?


Price

0 Size
0 0 0

Based on example by Andrew Ng 58


Regularized Linear Regression
•  Cost FuncIon
Xn ⇣ ⇣ ⌘ ⌘2 d
X
1
J(✓) = h✓ x(i) y (i) + ✓j2
2n i=1 2 j=1

•  Fit by solving min J(✓)


•  Gradient update:
n ⇣
X ⇣ ⌘ ⌘
@ 1 (i)
@✓0
J(✓) ✓ 0 ✓ 0 ↵ h ✓ x y (i)
n i=1
Xn ⇣ ⇣ ⌘ ⌘
@ 1 (i)
@✓j
J(✓) ✓j ✓j ↵ h✓ x(i) y (i) xj ↵ ✓j
n i=1
regularizaIon
59
Regularized Linear Regression
1 X ⇣ ⇣ (i) ⌘ ⌘2
n d
X
J(✓) = h✓ x y (i) + ✓j2
2n i=1 2 j=1
Xn ⇣ ⇣ ⌘ ⌘
1
✓0 ✓0 ↵ h✓ x(i) y (i)
n i=1
Xn ⇣ ⇣ ⌘ ⌘
1 (i)
✓j ✓j ↵ h✓ x(i) y (i) xj ↵ ✓j
n i=1

•  We can rewrite the gradient step as:


Xn ⇣ ⇣ ⌘ ⌘
1 (i)
✓j ✓j (1 ↵ ) ↵ h✓ x(i) y (i)
xj
n i=1

60
Regularized Linear Regression
•  To incorporate regularizaIon into the closed form
soluIon:
0 2 31 1
0 0 0 ... 0
B 6 0 1 0 . . . 0 7C
B 6 7C
B | 6 0 0 1 . . . 0 7C
✓ = BX X + 6 7C X | y
B 6 .. .. .. . . .. 7C
@ 4 . . . . . 5A
0 0 0 ... 1

61
Regularized Linear Regression
•  To incorporate regularizaIon into the closed form
soluIon:
0 2 31 1
0 0 0 ... 0
B 6 0 1 0 . . . 0 7C
B 6 7C
B | 6 0 0 1 . . . 0 7C
✓ = BX X + 6 7C X | y
B 6 .. .. .. . . .. 7C
@ 4 . . . . . 5A
0 0 0 ... 1
@
•  Can derive this the same way, by solving J(✓) = 0
@✓
•  Can prove that for λ > 0, inverse exists in the
equaIon above
62

You might also like