Class05 LogisticsSVM

MIT 9.520/6.
860, Fall 2018

Statistical Learning Theory and Applications
Class 05: Logistic Regression and Support Vector

Machines
Lorenzo Rosasco
Last class
Non linear functions using
I features
f (x) = w > x 7→ f (x) = w > x,
I kernels
N
X
f (x) = w > x 7→ f (x) = k (x, xi )ci .
i =1
L.Rosasco, 9.520/6.860 2018

More precisely
I A feature map Φ defines the space HΦ of functions
f (x) = w > x,
and k (x, x̄) := Φ(x)Φ(x̄), is pos. def.
I A pos. def. kernels k defines space Hk of functions
N
X
f (x) = k (x, xi )ci .
i =1
with the reproducing property

f (x) = hf , k (x, ·)iHk
I For every k there is a1 Φ such that
k (x, x̄) = Φ(x)Φ(x̄),

and
Hk ' H Φ .
1 Indeed, infinitely many. L.Rosasco, 9.520/6.860 2018
Today
Beyond least squares
(y − f (x))2 7→ `(y, f (x)).
Today
I Logistic loss.
I Hinge loss.
L.Rosasco, 9.520/6.860 2018

ERM and penalization
n
1X
min `(yi , w > xi ) + λ kwk2 , λ ≥ 0.
w∈Rd n
i =1
I Logistic loss 7→ logistic regression.

I Hinge 7→ SVM.
Non linear extensions via features/kernels.
L.Rosasco, 9.520/6.860 2018

From regularization to optimization
Problem Solve
L (w) + λkwk2
min b
w∈Rd
where
n
1X
L (w) =
b `(yi , w > xi ).
n
i =1
L.Rosasco, 9.520/6.860 2018

Minimization
Assume ` convex and continuous, let
L (w) + λkwk2 .
Lλ (w) = b
b
I Coercive2 , strongly convex functional

⇒ a minimizer exists and is unique.
I Computations depends on the considered loss.
2 lim
kwk→∞ Lλ (w) = ∞. L.Rosasco, 9.520/6.860 2018
b
Logistic regression
n
1X >
Lλ (w) =
b log(1 + e −yi w xi ) + λkwk2 .
n
i =1
I b
Lλ is smooth
n
1 X xi yi
∇b
Lλ (w) = − > + 2λw.
n
i =1
1 + e yi w xi
I Optimality condition gives a nonlinear equation
∇b
Lλ (w) = 0,
so we use gradient methods.
L.Rosasco, 9.520/6.860 2018

Gradient descent
Let F : Rd → R differentiable, (strictly) convex and such that
k∇F (w) − ∇F (w 0 )k ≤ B kw − w 0 k
(e.g. supw k H (w) k ≤ B )

|{z}
hessian
Then
1
w0 = 0, wt +1 = wt − ∇F (wt ),
B
converges to the minimizer of F .
L.Rosasco, 9.520/6.860 2018
Gradient descent for logistic regression
n
1X >
min log(1 + e −yi w xi ) + λkwk2
w∈Rd n
i =1
Consider
 n

1  1 X xi yi 
wt +1 = wt − − > + 2λwt  .
B n y w
1+e i t i x
i =1
Complexity
Time: O (ndT ) for n examples, d dimension, T steps.
L.Rosasco, 9.520/6.860 2018

Non-linear features
f (x) = w > x 7→ f (x) = w > Φ(x),
Φ(x) = (φ1 (x), . . . , φp (x)).
L.Rosasco, 9.520/6.860 2018

Gradient descent for non linear logistic regression
n
1X >
minp log(1 + e −yi w Φ (xi ) ) + λkwk2
w∈R n
i =1
Consider
 n

1  1 X Φ(xi )yi 
wt +1 = wt − − > + 2λwt  .
B n 1 + e yi wt Φ (xi )
i =1
Complexity
Time O (npT ) for n examples, p features, T steps.
What about kernels?
L.Rosasco, 9.520/6.860 2018

Representer theorem for logistic regression?
As for least squares,

Show that w = ni=1 xi ci . i.e.
P
n
X
f (x) = w > x = xi> xci , ci ∈ R.
i =1
Compute c = (c1 , . . . , cn ) ∈ Rn rather than w ∈ Rd .
L.Rosasco, 9.520/6.860 2018

Representer theorem for GD & logistic regression
By induction
 n

1  1 X ei yi 
ct +1 = ct − − + 2λct 
B n y f (
1+e i t ix )
i =1
with ei the i -th element of the canonical basis and

n
X
ft (x) = x > xi (ct )i
i =1
L.Rosasco, 9.520/6.860 2018

Proof of the representer theorem for GD & logistic
regression
Assume
n
X
wt = xi (ct )i
i =1
 n

1  1 X xi yi 
wt +1 = wt − − > + 2λwt 
B n yi wt xi
i =1 1 + e
n
 n n

X 1  1 X yi X 
= xi (ct )i − − xi Pn >
+ 2λ( xi (ct )i )
B n yi ( j =1 xj (ct )j ) xi
i =1 i =1 1 + e i =1
n " !#
X 1 1 yi
= xi (ct )i − − + 2λ(ct )i .
B n 1 + e yi ( nj=1 xj (ct )j )> xi
P
i =1
Then
n
X
wt +1 = xi (ct +1 )i
i =1
L.Rosasco, 9.520/6.860 2018
Kernel logistic regression
Given a pos. def. kernel, consider

 n

1  1 X ei yi 
ct +1 = ct − − + 2λct 
B n y f (
1+e i t ix )
i =1
with ei the i -th element of the canonical basis and

n
X
ft (x) = k (x, xi )(ct )i
i =1
Complexity
Time: O (n 2 (Ck + T )) for n examples, Ck kernel evaluation, T steps.
L.Rosasco, 9.520/6.860 2018

Hinge loss and support vector machines
n
1X
Lλ (w) =
b |1 − yi w > xi |+ + λkwk2
n
i =1
| {z }
non-smooth & strongly-convex
Consider “left” derivative

 n 
1  1 X 
wt +1 = wt − √  Si (wt ) + 2λwt 
n

B t i =1

−yi xi

 if yi w > xi ≤ 1
Si (w) =  , B = sup kxk + 2λ.
0
 otherwise x∈X
√
B t is a bound on the subgradient.
L.Rosasco, 9.520/6.860 2018
Subgradient
Let F : Rp → R convex,
Subgradient
∂F (w0 ) set of vectors v ∈ Rp such that, for every w ∈ Rp
F (w) − F (w0 ) ≥ (w − w0 )> v
0
In one dimension ∂F (w0 ) = [F−0 (w0 ), F+ (w0 )].
L.Rosasco, 9.520/6.860 2018

Subgradient method
Let F : Rp → R convex, with subdifferential bounded by B , and

γt = 1√ then,
B t
wt +1 = wt − γt vt
with vt ∈ ∂F (wt ) converges to the minimizer of F .
Note: it is not a descent method.
L.Rosasco, 9.520/6.860 2018

Subgradient method for SVM
n
1X
minp |1 − yi w > xi |+ + λkwk2
w∈R n
i =1
Consider  n 
1  1 X 
wt +1 = wt − √  Si (wt ) + 2λwt 
n

B t i =1

−yi xi

 if yi w > xi ≤ 1
Si (wt ) = 
0
 otherwise
Complexity
Time: O (ndT ) for n examples, d dimensions, T steps.
L.Rosasco, 9.520/6.860 2018

Connection to the perceptron
I Replace the hinge loss with
`(y, f (x)) = | − yf (x)|+ .
I Set λ = 0.
Reasoning as above we can solve ERM by
 n  
1  1 X  −yi xi

 if yi w > xi ≤ 1
wt +1 = wt − √  Si (wt ) + 2λwt  , Si (wt ) = 
B t n i =1 0
 otherwise
This is a “batch” version of the perceptron of the algorithm,


−yt xt if yt w > xt ≤ 0


wt +1 = wt − γ (St (wt )) , Si (wt ) = 
0
 otherwise
L.Rosasco, 9.520/6.860 2018

Nonlinear SVM using features and subgradients
n
1X
minp |1 − yi w > Φ(xi )|+ + λkwk2
w∈R n
i =1
Consider  n 
1  1 X 
wt +1 = wt − √ 
n Si (w t ) + 2λwt


B t i =1

−yi Φ(xi ) if yi w > xi ≤ 1


Si (wt ) = 
0
 otherwise
Complexity
Time O (npT ) for n examples, p features, T steps.
What about kernels?

L.Rosasco, 9.520/6.860 2018
Representer theorem of SVM
By induction
 n 
1  1 X 
ct + 1 = ct − √ 
n Si (c t ) + 2λc t


B t i =1
with ei the i -th element of the canonical basis,

n
X
ft (x) = x > xi (ct )i
i =1
and 
−yi ei

 if yi ft (xi ) < 1
Si (ct ) =  .
0
 otherwise
L.Rosasco, 9.520/6.860 2018

Kernel SVM using subgradient
By induction
 n 
1  1 X 
ct + 1 = ct − √ 
n Si (c t ) + 2λc t


B t i =1
with ei the i -th element of the canonical basis,

n
X
ft (x) = k (x, xi )(ct )i
i =1
and 
−yi ei

 if yi ft (xi ) < 1
Si (ct ) =  .
0
 otherwise
Complexity
Time: O (n 2 (Ck + T )) for n examples, Ck kernel evaluation, T steps.
L.Rosasco, 9.520/6.860 2018
What else
I Why are they called support vector machines?
I And what about the margin and all that?
L.Rosasco, 9.520/6.860 2018

Optimality condition for SVM
Smooth Convex Non-smooth Convex
∇F (w∗ ) = 0 0 ∈ ∂F (w)
0 ∈ ∂F (w∗ ) ⇔ 0 ∈ ∂|1 − yi w > xi |+ + λ2w

1
⇔ w ∈∂ |1 − yi w > xi |+ .
2λ
L.Rosasco, 9.520/6.860 2018

Optimality condition for SVM (cont.)
The optimality condition can be rewritten as

n n
1X X yi ci
0= (−yi xi ci ) + 2λw ⇒ w= xi ( ).
n 2λn
i =1 i =1
where
ci = ci (w) ∈ [` − (−yi w > xi ), ` + (−yi w > xi )]
with ` − , ` + left, right derivatives of | · |+ .
A direct computation gives
ci = 1 if yf (xi ) < 1
0 ≤ ci ≤ 1 if yf (xi ) = 1
ci = 0 if yf (xi ) > 1
L.Rosasco, 9.520/6.860 2018

Support vectors
ci = 1 if yf (xi ) < 1
0 ≤ ci ≤ 1 if yf (xi ) = 1
ci = 0 if yf (xi ) > 1
L.Rosasco, 9.520/6.860 2018

Sparsity and SVM solvers
The conditions
ci = 1 if yf (xi ) < 1
0 ≤ ci ≤ 1 if yf (xi ) = 1
ci = 0 if yf (xi ) > 1
show that the SVM solution is sparse wrt the training points.
I Classical Quadratic Programming solvers for SVM exploit

sparsity.
I Subgradient methods require only matrix vector multiplications,
hence are preferable for large scale problems.
L.Rosasco, 9.520/6.860 2018

And now the margin
n
1X
minp |1 − yi w > xi |+ + λkwk2 .
w∈R n
i =1
1
For C = 2nλ , consider the following equivalent formulation
n
X 1
minp C ξi + kwk2 ,
w∈R 2
i =1
subj. to for all i = 1, . . . , n,
ξi ≥ 0, yi w > xi ≥ 1 − ξi
The slack variables ξi ’s quantify how much constraints are violated.
L.Rosasco, 9.520/6.860 2018

Soft and hard margin SVM
This is the classical soft margin SVM formulation

n
X 1
minp C ξi + kwk2 , subj. to ξi ≥ 0, yi w > xi ≥ 1−ξi , ∀ i = 1, . . . , n.
w∈R 2
i =1
The name comes from considering the limit case C → 0

1
minp kwk2 , subj. to yi w > xi ≥ 1, ∀ i = 1, . . . , n,
w∈R 2
called hard margin SVM.
L.Rosasco, 9.520/6.860 2018

Max margin
min kwk2 , subj. to yi w > xi ≥ 1, ∀ i = 1, . . . , n.

w∈Rp
The above problem has a geometric interpretation.
For linearly separable data

I 2/ kwk is the margin: smallest distance of each class to w > x.
I The constraint select functions linearly separating the data.
Hard margin SVM: find the max margin solution separating the data.
L.Rosasco, 9.520/6.860 2018

Summary
I Logistic regression and SVM are instances of penalized ERM.
I Optimization by gradient descent/subgradient method.
I Nonlinear extension using features/kernels.
I Optimality conditions and support vectors.
I Margin .
L.Rosasco, 9.520/6.860 2018

Class05 LogisticsSVM

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Class05 LogisticsSVM

Uploaded by

Copyright:

Available Formats

MIT 9.520/6.

860, Fall 2018

Class 05: Logistic Regression and Support Vector

Non linear functions using

L.Rosasco, 9.520/6.860 2018

I A feature map Φ defines the space HΦ of functions

with the reproducing property

I For every k there is a1 Φ such that

k (x, x̄) = Φ(x)Φ(x̄),

Beyond least squares

(y − f (x))2 7→ `(y, f (x)).

L.Rosasco, 9.520/6.860 2018

I Logistic loss 7→ logistic regression.

Non linear extensions via features/kernels.

L.Rosasco, 9.520/6.860 2018

L.Rosasco, 9.520/6.860 2018

Assume ` convex and continuous, let

I Coercive2 , strongly convex functional

I Computations depends on the considered loss.

I Optimality condition gives a nonlinear equation

so we use gradient methods.

L.Rosasco, 9.520/6.860 2018

Let F : Rd → R differentiable, (strictly) convex and such that

(e.g. supw k H (w) k ≤ B )

L.Rosasco, 9.520/6.860 2018

f (x) = w > x 7→ f (x) = w > Φ(x),

Φ(x) = (φ1 (x), . . . , φp (x)).

L.Rosasco, 9.520/6.860 2018

What about kernels?

L.Rosasco, 9.520/6.860 2018

As for least squares,

Compute c = (c1 , . . . , cn ) ∈ Rn rather than w ∈ Rd .

L.Rosasco, 9.520/6.860 2018

with ei the i -th element of the canonical basis and

L.Rosasco, 9.520/6.860 2018

Given a pos. def. kernel, consider

with ei the i -th element of the canonical basis and

L.Rosasco, 9.520/6.860 2018

Consider “left” derivative

L.Rosasco, 9.520/6.860 2018

Let F : Rp → R convex, with subdifferential bounded by B , and

Note: it is not a descent method.

L.Rosasco, 9.520/6.860 2018

L.Rosasco, 9.520/6.860 2018

I Replace the hinge loss with

`(y, f (x)) = | − yf (x)|+ .

This is a “batch” version of the perceptron of the algorithm,

L.Rosasco, 9.520/6.860 2018

What about kernels?

with ei the i -th element of the canonical basis,

L.Rosasco, 9.520/6.860 2018

with ei the i -th element of the canonical basis,

I Why are they called support vector machines?

I And what about the margin and all that?

L.Rosasco, 9.520/6.860 2018

Smooth Convex Non-smooth Convex

0 ∈ ∂F (w∗ ) ⇔ 0 ∈ ∂|1 − yi w > xi |+ + λ2w

L.Rosasco, 9.520/6.860 2018

The optimality condition can be rewritten as

A direct computation gives

L.Rosasco, 9.520/6.860 2018

L.Rosasco, 9.520/6.860 2018