Professional Documents
Culture Documents
Class05 LogisticsSVM
Class05 LogisticsSVM
Lorenzo Rosasco
Last class
I features
f (x) = w > x 7→ f (x) = w > x,
I kernels
N
X
f (x) = w > x 7→ f (x) = k (x, xi )ci .
i =1
f (x) = w > x,
and k (x, x̄) := Φ(x)Φ(x̄), is pos. def.
I A pos. def. kernels k defines space Hk of functions
N
X
f (x) = k (x, xi )ci .
i =1
Today
I Logistic loss.
I Hinge loss.
n
1X
min `(yi , w > xi ) + λ kwk2 , λ ≥ 0.
w∈Rd n
i =1
Problem Solve
L (w) + λkwk2
min b
w∈Rd
where
n
1X
L (w) =
b `(yi , w > xi ).
n
i =1
L (w) + λkwk2 .
Lλ (w) = b
b
2 lim
kwk→∞ Lλ (w) = ∞. L.Rosasco, 9.520/6.860 2018
b
Logistic regression
n
1X >
Lλ (w) =
b log(1 + e −yi w xi ) + λkwk2 .
n
i =1
I b
Lλ is smooth
n
1 X xi yi
∇b
Lλ (w) = − > + 2λw.
n
i =1
1 + e yi w xi
∇b
Lλ (w) = 0,
k∇F (w) − ∇F (w 0 )k ≤ B kw − w 0 k
n
1X >
min log(1 + e −yi w xi ) + λkwk2
w∈Rd n
i =1
Consider
n
1 1 X xi yi
wt +1 = wt − − > + 2λwt .
B n y w
1+e i t i x
i =1
Complexity
Time: O (ndT ) for n examples, d dimension, T steps.
n
1X >
minp log(1 + e −yi w Φ (xi ) ) + λkwk2
w∈R n
i =1
Consider
n
1 1 X Φ(xi )yi
wt +1 = wt − − > + 2λwt .
B n 1 + e yi wt Φ (xi )
i =1
Complexity
Time O (npT ) for n examples, p features, T steps.
n
X
f (x) = w > x = xi> xci , ci ∈ R.
i =1
By induction
n
1 1 X ei yi
ct +1 = ct − − + 2λct
B n y f (
1+e i t ix )
i =1
Assume
n
X
wt = xi (ct )i
i =1
n
1 1 X xi yi
wt +1 = wt − − > + 2λwt
B n yi wt xi
i =1 1 + e
n
n n
X 1 1 X yi X
= xi (ct )i − − xi Pn >
+ 2λ( xi (ct )i )
B n yi ( j =1 xj (ct )j ) xi
i =1 i =1 1 + e i =1
n " !#
X 1 1 yi
= xi (ct )i − − + 2λ(ct )i .
B n 1 + e yi ( nj=1 xj (ct )j )> xi
P
i =1
Then
n
X
wt +1 = xi (ct +1 )i
i =1
L.Rosasco, 9.520/6.860 2018
Kernel logistic regression
Complexity
Time: O (n 2 (Ck + T )) for n examples, Ck kernel evaluation, T steps.
n
1X
Lλ (w) =
b |1 − yi w > xi |+ + λkwk2
n
i =1
| {z }
non-smooth & strongly-convex
−yi xi
if yi w > xi ≤ 1
Si (w) = , B = sup kxk + 2λ.
0
otherwise x∈X
√
B t is a bound on the subgradient.
L.Rosasco, 9.520/6.860 2018
Subgradient
Let F : Rp → R convex,
Subgradient
∂F (w0 ) set of vectors v ∈ Rp such that, for every w ∈ Rp
F (w) − F (w0 ) ≥ (w − w0 )> v
0
In one dimension ∂F (w0 ) = [F−0 (w0 ), F+ (w0 )].
n
1X
minp |1 − yi w > xi |+ + λkwk2
w∈R n
i =1
Consider n
1 1 X
wt +1 = wt − √ Si (wt ) + 2λwt
n
B t i =1
−yi xi
if yi w > xi ≤ 1
Si (wt ) =
0
otherwise
Complexity
Time: O (ndT ) for n examples, d dimensions, T steps.
I Set λ = 0.
Reasoning as above we can solve ERM by
n
1 1 X −yi xi
if yi w > xi ≤ 1
wt +1 = wt − √ Si (wt ) + 2λwt , Si (wt ) =
B t n i =1 0
otherwise
n
1X
minp |1 − yi w > Φ(xi )|+ + λkwk2
w∈R n
i =1
Consider n
1 1 X
wt +1 = wt − √
n Si (w t ) + 2λwt
B t i =1
−yi Φ(xi ) if yi w > xi ≤ 1
Si (wt ) =
0
otherwise
Complexity
Time O (npT ) for n examples, p features, T steps.
By induction
n
1 1 X
ct + 1 = ct − √
n Si (c t ) + 2λc t
B t i =1
and
−yi ei
if yi ft (xi ) < 1
Si (ct ) = .
0
otherwise
By induction
n
1 1 X
ct + 1 = ct − √
n Si (c t ) + 2λc t
B t i =1
and
−yi ei
if yi ft (xi ) < 1
Si (ct ) = .
0
otherwise
Complexity
Time: O (n 2 (Ck + T )) for n examples, Ck kernel evaluation, T steps.
L.Rosasco, 9.520/6.860 2018
What else
∇F (w∗ ) = 0 0 ∈ ∂F (w)
where
ci = ci (w) ∈ [` − (−yi w > xi ), ` + (−yi w > xi )]
with ` − , ` + left, right derivatives of | · |+ .
ci = 1 if yf (xi ) < 1
0 ≤ ci ≤ 1 if yf (xi ) = 1
ci = 0 if yf (xi ) > 1
ci = 1 if yf (xi ) < 1
0 ≤ ci ≤ 1 if yf (xi ) = 1
ci = 0 if yf (xi ) > 1
The conditions
ci = 1 if yf (xi ) < 1
0 ≤ ci ≤ 1 if yf (xi ) = 1
ci = 0 if yf (xi ) > 1
show that the SVM solution is sparse wrt the training points.
n
1X
minp |1 − yi w > xi |+ + λkwk2 .
w∈R n
i =1
1
For C = 2nλ , consider the following equivalent formulation
n
X 1
minp C ξi + kwk2 ,
w∈R 2
i =1
ξi ≥ 0, yi w > xi ≥ 1 − ξi
Hard margin SVM: find the max margin solution separating the data.
I Margin .