Professional Documents
Culture Documents
04 LinearRegression PDF
04 LinearRegression PDF
04 LinearRegression PDF
Regression
These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made
their course materials freely available online. Feel free to reuse or adapt these slides for your own academic
purposes, provided that you include proper aHribuIon. Please send comments and correcIons to Eric.
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
Regression
Given: n o
– Data where
X = x(1) , . . . , x(n) x(i) 2 Rd
n o
y = y (1) , . . . , y (n) y (i) 2 R
– Corresponding labels where
9
September Arc+c Sea Ice Extent
8
7
(1,000,000 sq km)
6
5
4
3
Linear Regression
QuadraIc Regression
2
1
0
1975 1980 1985 1990 1995 2000 2005 2010 2015
Year
Data from G. WiH. Journal of StaIsIcs EducaIon, Volume 21, Number 1 (2013) 2
Prostate Cancer Dataset
• 97 samples, parIIoned into 67 train / 30 test
• Eight predictors (features):
– 6 conInuous (4 log transforms), 1 binary, 1 ordinal
• ConInuous outcome variable:
– lpsa: log(prostate specific anIgen level)
6
IntuiIon Behind Cost FuncIon
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
For insight on J(), let’s assume so
x2R ✓ = [✓0 , ✓1 ]
Based on example
by Andrew Ng 7
IntuiIon Behind Cost FuncIon
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
For insight on J(), let’s assume so
x2R ✓ = [✓0 , ✓1 ]
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Based on example
by Andrew Ng 8
IntuiIon Behind Cost FuncIon
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
For insight on J(), let’s assume so
x2R ✓ = [✓0 , ✓1 ]
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1 ⇥ ⇤
Based on example J([0, 0.5]) = (0.5 1)2 + (1 2)2 + (1.5 3)2 ⇡ 0.58 9
by Andrew Ng 2⇥3
IntuiIon Behind Cost FuncIon
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
For insight on J(), let’s assume so
x2R ✓ = [✓0 , ✓1 ]
2 2
y
1 1 J() is concave
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
Based on example
by Andrew Ng 10
IntuiIon Behind Cost FuncIon
Slide by Andrew Ng 11
IntuiIon Behind Cost FuncIon
Slide by Andrew Ng 12
IntuiIon Behind Cost FuncIon
Slide by Andrew Ng 13
IntuiIon Behind Cost FuncIon
Slide by Andrew Ng 14
IntuiIon Behind Cost FuncIon
Slide by Andrew Ng 15
Basic Search Procedure
• Choose iniIal value for ✓
• UnIl we reach a minimum:
– Choose a new value for to reduce
✓ J(✓)
J( 0, 1)
1
0
Figure by Andrew Ng 16
Basic Search Procedure
• Choose iniIal value for ✓
• UnIl we reach a minimum:
– Choose a new value for to reduce
✓ J(✓)
✓
J( 0, 1)
1
0
Figure by Andrew Ng 17
Basic Search Procedure
• Choose iniIal value for ✓
• UnIl we reach a minimum:
– Choose a new value for to reduce
✓ J(✓)
✓
J( 0, 1)
Figure by Andrew Ng 18
Gradient Descent
• IniIalize ✓
• Repeat unIl convergence
@ simultaneous update
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
2
J(✓) ↵
1
0
-0.5 0 0.5 1 1.5 2 2.5
✓ 19
Gradient Descent
• IniIalize ✓
• Repeat unIl convergence
@ simultaneous update
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
@ 1 X ⇣ ⇣ (i) ⌘ ⌘2
n
@ (i)
For Linear Regression: J(✓) = h✓ x y
@✓j @✓j 2n i=1
n d
!2
@ 1 X X (i)
= ✓k xk y (i)
@✓j 2n i=1
k=0
n d
! d
!
1 X X (i) @ X (i)
= ✓k xk y (i) ⇥ ✓k xk y (i)
n i=1
@✓j
k=0 k=0
n d
!
1X X (i) (i)
= ✓k xk y (i) xj
n i=1
k=0
1 X⇣ ⇣ ⌘ ⌘
n 20
(i)
Gradient Descent
• IniIalize ✓
• Repeat unIl convergence
@ simultaneous update
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
@ 1 X ⇣ ⇣ (i) ⌘ ⌘2
n
@ (i)
For Linear Regression: J(✓) = h✓ x y
@✓j @✓j 2n i=1
n d
!2
@ 1 X X (i)
= ✓k xk y (i)
@✓j 2n i=1
k=0
n d
! d
!
1 X X (i) @ X (i)
= ✓k xk y (i) ⇥ ✓k xk y (i)
n i=1
@✓j
k=0 k=0
n d
!
1X X (i) (i)
= ✓k xk y (i) xj
n i=1
k=0
1 X⇣ ⇣ ⌘ ⌘
n 21
(i)
Gradient Descent
• IniIalize ✓
• Repeat unIl convergence
@ simultaneous update
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
@ 1 X ⇣ ⇣ (i) ⌘ ⌘2
n
@ (i)
For Linear Regression: J(✓) = h✓ x y
@✓j @✓j 2n i=1
n d
!2
@ 1 X X (i)
= ✓k xk y (i)
@✓j 2n i=1
k=0
n d
! d
!
1 X X (i) @ X (i)
= ✓k xk y (i) ⇥ ✓k xk y (i)
n i=1
@✓j
k=0 k=0
n d
!
1X X (i) (i)
= ✓k xk y (i) xj
n i=1
k=0
1 X⇣ ⇣ ⌘ ⌘
n 22
(i)
Gradient Descent
• IniIalize ✓
• Repeat unIl convergence
@ simultaneous update
✓j ✓j ↵ J(✓)
@✓j for j = 0 ... d
@ 1 X ⇣ ⇣ (i) ⌘ ⌘2
n
@ (i)
For Linear Regression: J(✓) = h✓ x y
@✓j @✓j 2n i=1
n d
!2
@ 1 X X (i)
= ✓k xk y (i)
@✓j 2n i=1
k=0
n d
! d
!
1 X X (i) @ X (i)
= ✓k xk y (i) ⇥ ✓k xk y (i)
n i=1
@✓j
k=0 k=0
n d
!
1X X (i) (i)
= ✓k xk y (i) xj
n i=1
k=0
1 X⇣ ⇣ ⌘ ⌘
n 23
(i)
Gradient Descent for Linear Regression
• IniIalize ✓
• Repeat unIl convergence
Xn ⇣ ⇣ ⌘ ⌘
1 (i) simultaneous
✓j ✓j ↵ h✓ x(i) y (i)
xj update
n i=1 for j = 0 ... d
i
24
Gradient Descent
Slide by Andrew Ng 25
Gradient Descent
Slide by Andrew Ng 26
Gradient Descent
Slide by Andrew Ng 27
Gradient Descent
Slide by Andrew Ng 28
Gradient Descent
Slide by Andrew Ng 29
Gradient Descent
Slide by Andrew Ng 30
Gradient Descent
Slide by Andrew Ng 31
Gradient Descent
Slide by Andrew Ng 32
Gradient Descent
Slide by Andrew Ng 33
Choosing α
α too small α too large
Increasing value for J(✓)
slow convergence
where
p
X
y = ✓0 + ✓1 x + ✓2 x2 + . . . + ✓p xp = ✓j xj
j=0
Linear Basis FuncIon Models
d
X
• Basic Linear Model: h✓ (x) = ✓j xj
j=0
d
X
• Generalized Linear Model: h✓ (x) = ✓j j (x)
j=0
(1 6 3 4)
• Vector norms: X
! p1
• Vector products:
– Dot product: u • v = u v = (u u )⎜ 1 ⎞⎟ = u v + u v
⎛
T
v
1 2 ⎜ ⎟ 1 1 2 2
v
⎝ 2⎠
– Outer product:
⎛ u1 ⎞ ⎛ u1v1 u1v2 ⎞
uv = ⎜⎜ ⎟⎟(v1 v2 ) =
T
⎜⎜ ⎟⎟
⎝ u2 ⎠ ⎝ u2v1 u2v2 ⎠
✓0
6 ✓1 7 ⇥ ⇤
6 7 |
x = 1 x1 . . . x d
✓=6 . 7
4 .. 5
✓d
|
• Can write the model in vectorized form as h(x) = ✓ x
45
VectorizaIon
• Consider our model for n instances:
⇣ ⌘ X d
(i) (i)
h x = ✓j xj
j=0
• Let 2 (1) (1)
3
2 3 1 x1 . . . xd
✓0 6 . .. .. .. 7
6 .. . . . 7
6 ✓1 7 6 7
6 7 6 7
✓ = 6 . 7 X = 6 1 x(i) (i)
. . . xd 7
.
4 . 5 6 1 7
6 .. .. .. .. 7
✓d 4 . . . . 5
(n) (n)
1 x1 . . . xd
R(d+1)⇥1 Rn⇥(d+1)
• Can write the model in vectorized form as h✓ (x) = X✓
46
VectorizaIon
• For the linear regression cost funcIon:
Xn ⇣ ⇣ ⌘ ⌘2
1
J(✓) = h✓ x(i) y (i)
2n i=1
Xn ⇣ ⌘2
1
J(✓) = ✓ | x(i) y (i)
2n i=1
Rn⇥(d+1)
1 R(d+1)⇥1
Let: |
2 3 J(✓) = (X✓ y) (X✓ y)
y(1) 2n
6 y (2) 7
6 7
y=6 .. 7
4 . 5 R1⇥n Rn⇥1
y (n)
47
Closed Form SoluIon
• Instead of using GD, solve for opImal ✓ analyIcally
@
– NoIce that the soluIon is when J(✓) = 0
@✓
• DerivaIon:
1 | 1 x 1
J (✓) = (X✓ y) (X✓ y)
2n
/ ✓ | X | X✓ y | X✓ ✓ | X | y + y | y
/ ✓ | X | X✓ 2✓ | X | y + y | y
Take derivaIve and set equal to 0, then solve for :
✓
@
(✓ | X | X✓ 2✓ | X | y + y | y) = 0
@✓
(X | X)✓ X | y = 0
(X | X)✓ = X | y
Closed Form SoluIon: ✓ = (X | X) 1
X |y 48
X y + y y) = 0
X)✓ X | y =Closed Form SoluIon
0
| |
(X X)✓ = X ✓y
• Can obtain by simply plugging X and y into
✓ = (X | X) 1
X |y
2 (1) (1)
3
1 x1 ... xd
6 . .. .. 7 2 3
6 .. .. 7 y (1)
6 . . . 7 6 7
6 (i) (i) 7 6 y (2) 7
X=6 1 x1 ... xd 7 y=6 .. 7
6 7 4 . 5
6 .. .. .. .. 7
4 . . . . 5 y (n)
(n) (n)
1 x1 ... xd
• If X T X is not inverIble (i.e., singular), may need to:
– Use pseudo-inverse instead of the inverse
• In python, numpy.linalg.pinv(a)
– Remove redundant (not linearly independent) features
– Remove extra features to ensure that d ≤ n
49
Gradient Descent vs Closed Form
Gradient Descent Closed Form Solu+on
• Requires mulIple iteraIons • Non-iteraIve
• Need to choose α • No need for α
• Works well when n is large • Slow if n is large
• Can support incremental – CompuIng (X T X)-1 is
learning roughly O(n3)
50
Improving Learning:
Feature Scaling
• Idea: Ensure that feature have similar scales
Before Feature Scaling Aver Feature Scaling
20 20
15 15
✓2 10 ✓2 10
5 5
0 0
0 5 10 15 20 0 5 10 15 20
✓1 ✓1
• Makes gradient descent converge much faster
51
Feature StandardizaIon
• Rescales features to have zero mean and unit variance
Xn
1 (i)
– Let μj be the mean of feature j: j
µ = x
n i=1 j
– Replace each value with:
(i)
(i) xj µj for j = 1...d
xj (not x0!)
sj
• sj is the standard deviaIon of feature j
• Could also use the range of feature j (maxj – minj) for sj
Price
Price
Size Size Size
OverfiHng:
• The learned hypothesis may fit the training set very
well ( )
J(✓) ⇡ 0
• ...but fails to generalize to new examples
54
RegularizaIon
• Linear regression objecIve funcIon
Xn ⇣ ⇣ ⌘ ⌘2 XXdd
1
J(✓) = h✓ x(i) y (i) + ✓✓j2j2
2n 2 j=1
i=1 j=1
55
Understanding RegularizaIon
Xn ⇣ ⇣ ⌘ ⌘2 d
X
1
J(✓) = h✓ x(i) y (i) + ✓j2
2n i=1 2 j=1
X d
✓j2 = k✓1:d k22
• Note that
j=1
– This is the magnitude of the feature coefficient vector!
Size
0 Size
0 0 0
• Gradient update:
n ⇣
X ⇣ ⌘ ⌘
@ 1 (i)
@✓0
J(✓) ✓ 0 ✓ 0 ↵ h ✓ x y (i)
n i=1
Xn ⇣ ⇣ ⌘ ⌘
@ 1 (i)
@✓j
J(✓) ✓j ✓j ↵ h✓ x(i) y (i) xj ↵ ✓j
n i=1
regularizaIon
59
Regularized Linear Regression
1 X ⇣ ⇣ (i) ⌘ ⌘2
n d
X
J(✓) = h✓ x y (i) + ✓j2
2n i=1 2 j=1
Xn ⇣ ⇣ ⌘ ⌘
1
✓0 ✓0 ↵ h✓ x(i) y (i)
n i=1
Xn ⇣ ⇣ ⌘ ⌘
1 (i)
✓j ✓j ↵ h✓ x(i) y (i) xj ↵ ✓j
n i=1
61
Regularized Linear Regression
• To incorporate regularizaIon into the closed form
soluIon:
0 2 31 1
0 0 0 ... 0
B 6 0 1 0 . . . 0 7C
B 6 7C
B | 6 0 0 1 . . . 0 7C
✓ = BX X + 6 7C X | y
B 6 .. .. .. . . .. 7C
@ 4 . . . . . 5A
0 0 0 ... 1
@
• Can derive this the same way, by solving J(✓) = 0
@✓
• Can prove that for λ > 0, inverse exists in the
equaIon above
62