Professional Documents
Culture Documents
10 Regression, Including Least-Squares Linear and Logistic Regression
10 Regression, Including Least-Squares Linear and Logistic Regression
10 Regression, Including Least-Squares Linear and Logistic Regression
n
X
Find w, ↵ that minimizes (Xi · w + ↵ yi )2
i=1
•
•
•• ••
•• •
• • •• •• • • •
• • • • •• •
•
• •
•• • • • • •• •• • •
•
•• • • • • X2
• •
•
X1
linregress.pdf (ISL, Figure 3.4) [An example of linear regression.]
Now X is an n ⇥ (d + 1) matrix; w is a (d + 1)-vector. [We’ve added a column of all-1’s to the end of X.]
[We rewrite the optimization problem above:]
Find w that minimizes |Xw y|2 = RSS(w), for residual sum of squares
56 Jonathan Richard Shewchuk
Optimize by calculus:
[We never compute X + directly, but we are interested in the fact that w is a linear transformation of y.]
[X is usually not square, so X can’t have an inverse. However, every X has a pseudoinverse X + , and if X > X
is invertible, then X + is a “left inverse.”]
Observe: X + X = (X > X) 1 X > X = I ( (d + 1) ⇥ (d + 1) [which explains the name “left inverse”]
Observe: the predicted values of y are ŷi = w · Xi ) ŷ = Xw = XX + y = Hy
H = XX + is called the hat matrix because it puts the hat on y
where |{z}
n⇥n
[Ideally, H would be the identity matrix and we’d have a perfect fit, but if n > d + 1, then H is singular.]
Interpretation as a projection:
– ŷ = Xw 2 Rn is a linear combination of columns of X (one column per feature)
– For fixed X, varying w, Xw is subspace of Rn spanned by columns
ŷ = Xw
( subspace spanned by X’s columns
(at most d + 1 dimensions)
L(z) L(z)
4 4
3 3
1 1
z z
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
logloss0.pdf, loglosspt7.pdf [Plots of the loss L(z, y) for y = 0 (left) and y = 0.7 (right). As
you might guess, the left function is minimized at z = 0, and the right function is minimized
at z = 0.7. These loss functions are always convex.]
J(w) is convex! Solve by gradient descent.
[To do gradient descent, we’ll need to compute some derivatives.]
d 1 e
s0 ( ) = =
d 1+e (1 + e )2
= s( ) (1 s( ))
s(x) s (x)
1.0 0.25
0.8 0.20
0.6 0.15
0.4 0.10
0.2 0.05
x x
-4 -2 0 2 4 -4 -2 0 2 4
Let si = s(Xi · w)
X yi !
1 yi
rw J = rsi rsi
si 1 si
X yi 1 yi !
= si (1 si ) Xi
s 1 si
X i
= (yi si ) Xi
2 3
666 s1 777
666 s2 777
6 777
= X > (y s(Xw)) where s(Xw) = 66666 .. 777 [applies s component-wise to Xw]
666 . 777
4 5
sn