Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

MIT 9.520/6.

860, Fall 2018


Statistical Learning Theory and Applications

Class 05: Logistic Regression and Support Vector


Machines

Lorenzo Rosasco
Last class

Non linear functions using

I features
f (x) = w > x 7→ f (x) = w > x,

I kernels
N
X
f (x) = w > x 7→ f (x) = k (x, xi )ci .
i =1

L.Rosasco, 9.520/6.860 2018


More precisely

I A feature map Φ defines the space HΦ of functions

f (x) = w > x,
and k (x, x̄) := Φ(x)Φ(x̄), is pos. def.
I A pos. def. kernels k defines space Hk of functions
N
X
f (x) = k (x, xi )ci .
i =1

with the reproducing property


f (x) = hf , k (x, ·)iHk

I For every k there is a1 Φ such that

k (x, x̄) = Φ(x)Φ(x̄),


and
Hk ' H Φ .
1 Indeed, infinitely many. L.Rosasco, 9.520/6.860 2018
Today

Beyond least squares

(y − f (x))2 7→ `(y, f (x)).

Today
I Logistic loss.
I Hinge loss.

L.Rosasco, 9.520/6.860 2018


ERM and penalization

n
1X
min `(yi , w > xi ) + λ kwk2 , λ ≥ 0.
w∈Rd n
i =1

I Logistic loss 7→ logistic regression.


I Hinge 7→ SVM.

Non linear extensions via features/kernels.

L.Rosasco, 9.520/6.860 2018


From regularization to optimization

Problem Solve
L (w) + λkwk2
min b
w∈Rd

where
n
1X
L (w) =
b `(yi , w > xi ).
n
i =1

L.Rosasco, 9.520/6.860 2018


Minimization

Assume ` convex and continuous, let

L (w) + λkwk2 .
Lλ (w) = b
b

I Coercive2 , strongly convex functional


⇒ a minimizer exists and is unique.

I Computations depends on the considered loss.

2 lim
kwk→∞ Lλ (w) = ∞. L.Rosasco, 9.520/6.860 2018
b
Logistic regression

n
1X >
Lλ (w) =
b log(1 + e −yi w xi ) + λkwk2 .
n
i =1

I b
Lλ is smooth
n
1 X xi yi
∇b
Lλ (w) = − > + 2λw.
n
i =1
1 + e yi w xi

I Optimality condition gives a nonlinear equation

∇b
Lλ (w) = 0,

so we use gradient methods.

L.Rosasco, 9.520/6.860 2018


Gradient descent

Let F : Rd → R differentiable, (strictly) convex and such that

k∇F (w) − ∇F (w 0 )k ≤ B kw − w 0 k

(e.g. supw k H (w) k ≤ B )


|{z}
hessian
Then
1
w0 = 0, wt +1 = wt − ∇F (wt ),
B
converges to the minimizer of F .
L.Rosasco, 9.520/6.860 2018
Gradient descent for logistic regression

n
1X >
min log(1 + e −yi w xi ) + λkwk2
w∈Rd n
i =1

Consider
 n

1  1 X xi yi 
wt +1 = wt − − > + 2λwt  .
B n y w
1+e i t i x
i =1

Complexity
Time: O (ndT ) for n examples, d dimension, T steps.

L.Rosasco, 9.520/6.860 2018


Non-linear features

f (x) = w > x 7→ f (x) = w > Φ(x),

Φ(x) = (φ1 (x), . . . , φp (x)).

L.Rosasco, 9.520/6.860 2018


Gradient descent for non linear logistic regression

n
1X >
minp log(1 + e −yi w Φ (xi ) ) + λkwk2
w∈R n
i =1

Consider
 n

1  1 X Φ(xi )yi 
wt +1 = wt − − > + 2λwt  .
B n 1 + e yi wt Φ (xi )
i =1

Complexity
Time O (npT ) for n examples, p features, T steps.

What about kernels?

L.Rosasco, 9.520/6.860 2018


Representer theorem for logistic regression?

As for least squares,


Show that w = ni=1 xi ci . i.e.
P

n
X
f (x) = w > x = xi> xci , ci ∈ R.
i =1

Compute c = (c1 , . . . , cn ) ∈ Rn rather than w ∈ Rd .

L.Rosasco, 9.520/6.860 2018


Representer theorem for GD & logistic regression

By induction
 n

1  1 X ei yi 
ct +1 = ct − − + 2λct 
B n y f (
1+e i t ix )
i =1

with ei the i -th element of the canonical basis and


n
X
ft (x) = x > xi (ct )i
i =1

L.Rosasco, 9.520/6.860 2018


Proof of the representer theorem for GD & logistic
regression

Assume
n
X
wt = xi (ct )i
i =1

 n

1  1 X xi yi 
wt +1 = wt − − > + 2λwt 
B n yi wt xi
i =1 1 + e
n
 n n

X 1  1 X yi X 
= xi (ct )i − − xi Pn >
+ 2λ( xi (ct )i )
B n yi ( j =1 xj (ct )j ) xi
i =1 i =1 1 + e i =1
n " !#
X 1 1 yi
= xi (ct )i − − + 2λ(ct )i .
B n 1 + e yi ( nj=1 xj (ct )j )> xi
P
i =1

Then
n
X
wt +1 = xi (ct +1 )i
i =1
L.Rosasco, 9.520/6.860 2018
Kernel logistic regression

Given a pos. def. kernel, consider


 n

1  1 X ei yi 
ct +1 = ct − − + 2λct 
B n y f (
1+e i t ix )
i =1

with ei the i -th element of the canonical basis and


n
X
ft (x) = k (x, xi )(ct )i
i =1

Complexity
Time: O (n 2 (Ck + T )) for n examples, Ck kernel evaluation, T steps.

L.Rosasco, 9.520/6.860 2018


Hinge loss and support vector machines

n
1X
Lλ (w) =
b |1 − yi w > xi |+ + λkwk2
n
i =1
| {z }
non-smooth & strongly-convex

Consider “left” derivative


 n 
1  1 X 
wt +1 = wt − √  Si (wt ) + 2λwt 
n

B t i =1


−yi xi

 if yi w > xi ≤ 1
Si (w) =  , B = sup kxk + 2λ.
0
 otherwise x∈X


B t is a bound on the subgradient.
L.Rosasco, 9.520/6.860 2018
Subgradient
Let F : Rp → R convex,
Subgradient
∂F (w0 ) set of vectors v ∈ Rp such that, for every w ∈ Rp
F (w) − F (w0 ) ≥ (w − w0 )> v
0
In one dimension ∂F (w0 ) = [F−0 (w0 ), F+ (w0 )].

L.Rosasco, 9.520/6.860 2018


Subgradient method

Let F : Rp → R convex, with subdifferential bounded by B , and


γt = 1√ then,
B t
wt +1 = wt − γt vt
with vt ∈ ∂F (wt ) converges to the minimizer of F .

Note: it is not a descent method.

L.Rosasco, 9.520/6.860 2018


Subgradient method for SVM

n
1X
minp |1 − yi w > xi |+ + λkwk2
w∈R n
i =1

Consider  n 
1  1 X 
wt +1 = wt − √  Si (wt ) + 2λwt 
n

B t i =1


−yi xi

 if yi w > xi ≤ 1
Si (wt ) = 
0
 otherwise

Complexity
Time: O (ndT ) for n examples, d dimensions, T steps.

L.Rosasco, 9.520/6.860 2018


Connection to the perceptron

I Replace the hinge loss with

`(y, f (x)) = | − yf (x)|+ .

I Set λ = 0.
Reasoning as above we can solve ERM by
 n  
1  1 X  −yi xi

 if yi w > xi ≤ 1
wt +1 = wt − √  Si (wt ) + 2λwt  , Si (wt ) = 
B t n i =1 0
 otherwise

This is a “batch” version of the perceptron of the algorithm,



−yt xt if yt w > xt ≤ 0


wt +1 = wt − γ (St (wt )) , Si (wt ) = 
0
 otherwise

L.Rosasco, 9.520/6.860 2018


Nonlinear SVM using features and subgradients

n
1X
minp |1 − yi w > Φ(xi )|+ + λkwk2
w∈R n
i =1

Consider  n 
1  1 X 
wt +1 = wt − √ 
n Si (w t ) + 2λwt


B t i =1


−yi Φ(xi ) if yi w > xi ≤ 1


Si (wt ) = 
0
 otherwise
Complexity
Time O (npT ) for n examples, p features, T steps.

What about kernels?


L.Rosasco, 9.520/6.860 2018
Representer theorem of SVM

By induction
 n 
1  1 X 
ct + 1 = ct − √ 
n Si (c t ) + 2λc t


B t i =1

with ei the i -th element of the canonical basis,


n
X
ft (x) = x > xi (ct )i
i =1

and 
−yi ei

 if yi ft (xi ) < 1
Si (ct ) =  .
0
 otherwise

L.Rosasco, 9.520/6.860 2018


Kernel SVM using subgradient

By induction
 n 
1  1 X 
ct + 1 = ct − √ 
n Si (c t ) + 2λc t


B t i =1

with ei the i -th element of the canonical basis,


n
X
ft (x) = k (x, xi )(ct )i
i =1

and 
−yi ei

 if yi ft (xi ) < 1
Si (ct ) =  .
0
 otherwise

Complexity
Time: O (n 2 (Ck + T )) for n examples, Ck kernel evaluation, T steps.
L.Rosasco, 9.520/6.860 2018
What else

I Why are they called support vector machines?

I And what about the margin and all that?

L.Rosasco, 9.520/6.860 2018


Optimality condition for SVM

Smooth Convex Non-smooth Convex

∇F (w∗ ) = 0 0 ∈ ∂F (w)

0 ∈ ∂F (w∗ ) ⇔ 0 ∈ ∂|1 − yi w > xi |+ + λ2w


1
⇔ w ∈∂ |1 − yi w > xi |+ .

L.Rosasco, 9.520/6.860 2018


Optimality condition for SVM (cont.)

The optimality condition can be rewritten as


n n
1X X yi ci
0= (−yi xi ci ) + 2λw ⇒ w= xi ( ).
n 2λn
i =1 i =1

where
ci = ci (w) ∈ [` − (−yi w > xi ), ` + (−yi w > xi )]
with ` − , ` + left, right derivatives of | · |+ .

A direct computation gives

ci = 1 if yf (xi ) < 1
0 ≤ ci ≤ 1 if yf (xi ) = 1
ci = 0 if yf (xi ) > 1

L.Rosasco, 9.520/6.860 2018


Support vectors

ci = 1 if yf (xi ) < 1
0 ≤ ci ≤ 1 if yf (xi ) = 1
ci = 0 if yf (xi ) > 1

L.Rosasco, 9.520/6.860 2018


Sparsity and SVM solvers

The conditions

ci = 1 if yf (xi ) < 1
0 ≤ ci ≤ 1 if yf (xi ) = 1
ci = 0 if yf (xi ) > 1

show that the SVM solution is sparse wrt the training points.

I Classical Quadratic Programming solvers for SVM exploit


sparsity.
I Subgradient methods require only matrix vector multiplications,
hence are preferable for large scale problems.

L.Rosasco, 9.520/6.860 2018


And now the margin

n
1X
minp |1 − yi w > xi |+ + λkwk2 .
w∈R n
i =1

1
For C = 2nλ , consider the following equivalent formulation
n
X 1
minp C ξi + kwk2 ,
w∈R 2
i =1

subj. to for all i = 1, . . . , n,

ξi ≥ 0, yi w > xi ≥ 1 − ξi

The slack variables ξi ’s quantify how much constraints are violated.

L.Rosasco, 9.520/6.860 2018


Soft and hard margin SVM

This is the classical soft margin SVM formulation


n
X 1
minp C ξi + kwk2 , subj. to ξi ≥ 0, yi w > xi ≥ 1−ξi , ∀ i = 1, . . . , n.
w∈R 2
i =1

The name comes from considering the limit case C → 0


1
minp kwk2 , subj. to yi w > xi ≥ 1, ∀ i = 1, . . . , n,
w∈R 2
called hard margin SVM.

L.Rosasco, 9.520/6.860 2018


Max margin

min kwk2 , subj. to yi w > xi ≥ 1, ∀ i = 1, . . . , n.


w∈Rp
The above problem has a geometric interpretation.

For linearly separable data


I 2/ kwk is the margin: smallest distance of each class to w > x.
I The constraint select functions linearly separating the data.

Hard margin SVM: find the max margin solution separating the data.

L.Rosasco, 9.520/6.860 2018


Summary

I Logistic regression and SVM are instances of penalized ERM.

I Optimization by gradient descent/subgradient method.

I Nonlinear extension using features/kernels.

I Optimality conditions and support vectors.

I Margin .

L.Rosasco, 9.520/6.860 2018

You might also like