Machine Learning-Kernel Methods

Machine learning: lecture 7 Topics
Tommi S. Jaakkola • Support vector machines

MIT CSAIL – separable case, formulation, margin
tommi@csail.mit.edu – non-separable case, penalties, and logistic regression
– dual solution, kernels
– examples, properties
Tommi Jaakkola, MIT CSAIL 2
Support vector machine (SVM) SVM: separable case

(d
• When the training examples are linearly separable we can • We minimize "w1"2/2 = 2
i=1 wi /2 subject to
maximize a geometric notion of margin (distance to the
yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n
boundary) by minimizing the regularization penalty
d margin = 1/!ŵ1! o
! o o o
x
"w1"2/2 = wi2/2 x o f (x; ŵ) = ŵ0 + xT ŵ1
o
i=1 x o
o
o x
x o o o x ŵ1 o
o o ooo o o xx xxx
subject to the classification constraints x
o
x x x
x− x+ w1
x o x
o + −
x |x − x |/2
yi[w0 + xTi w1] −1≥0 x
x
o x
x x
for i = 1, . . . , n. x • The resulting margin and the “slope” "ŵ1" are inversely
x
x related
• The solution is defined only on the basis of a subset of
examples or “support vectors”
Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL 4
SVM: non-separable case SVM: non-separable case cont’d

• When the examples are not linearly separable we can modify • We can also write the SVM optimization problem more
the optimization problem slightly to add a penalty for compactly as
violating the classification constraints: ξi
n "
! & #$ '+%
We minimize
n C 1 − yi [w0 + xi w1] +"w1"2/2
T
!
"w1"2/2 + C ξi x o
o
o o
i=1
i=1 x x
o
o where (z) = z if z ≥ 0 and zero otherwise (i.e., returns the
+
x o
subject to relaxed classification x
o positive part).
x ŵ1 o
constraints x x x
x
yi [w0 + xTi w1] − 1 + ξi ≥ 0, x
o
x
for i = 1, . . . , n. Here ξi ≥ 0 are

called “slack” variables.

SVM: non-separable case cont’d SVM vs logistic regression
• We can also write the SVM optimization problem more • When viewed from the point of view of regularized empirical
compactly as loss minimization, SVM and logistic regression appear quite
ξi similar:
n "
! & #$ '+%
1 !& '+
n
C 1 − yi [w0 + xi w1] +"w1"2/2
T
SVM: 1 − yi [w0 + xTi w1] + λ"w1"2/2
i=1 n i=1
where (z) = z if z ≥ 0 and zero otherwise (i.e., returns the

+ − log P (y |x,w)
i
n " & #$ '%
1!
positive part). Logistic: − log g yi [w0 + xTi w1] +λ"w1"2/2
n i=1
• This is equivalent to regularized empirical loss minimization
where g(z) = (1 + exp(−z))−1 is the logistic function.
1 !& '+
n
1 − yi [w0 + xTi w1] + λ"w1"2/2 (Note that we have transformed the problem maximizing the
n i=1
penalized log-likelihood into minimizing negative penalized
where λ = 1/nC is the regularization parameter. log-likelihood.)
SVM vs logistic regression cont’d SVM: solution, Lagrange multipliers

• The difference comes from how we penalize “errors”: • Back to the separable case: how do we solve
&" z
1!
n #$ %' min "w1"2/2 subject to
Both: Loss yi [w0 + xTi w1] + λ"w1"2/2
n i=1 yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n
5
SVM loss
• SVM: 4.5 LR loss
Loss(z) = (1 − z)+ 3.5
3
loss
2.5
• Regularized logistic reg: 2
1.5
Loss(z) = log(1 + exp(−z)) 1
0.5
0
!4 !3 !2 !1 0 1 2 3 4
z
SVM: solution, Lagrange multipliers SVM: solution, Lagrange multipliers

• Back to the separable case: how do we solve • Back to the separable case: how do we solve
min "w1"2/2 subject to min "w1"2/2 subject to

yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n
• Let start by representing the constraints as losses • Let start by representing the constraints as losses
+ +
) * 0, yi [w0 + xTi w1] − 1 ≥ 0 ) * 0, yi [w0 + xTi w1] − 1 ≥ 0
max α 1 − yi [w0 + xTi w1] = max α 1 − yi [w0 + xTi w1] =
α≥0 ∞, otherwise α≥0 ∞, otherwise
and rewrite the minimization problem in terms of these

, n
! ) *-
min "w1"2/2 + max αi 1 − yi [w0 + xTi w1]
w αi≥0
i=1

SVM: solution, Lagrange multipliers SVM solution cont’d
• Back to the separable case: how do we solve • We can then swap ’max’ and ’min’:
min "w1"2/2 subject to , n
! ) *-
min max "w1"2/2 + αi 1 − yi [w0 + xTi w1]
yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n w {αi≥0}
i=1
, n
! ) *-
• Let start by representing the constraints as losses ?
= max min "w1"2/2 + αi 1 − yi [w0 + xTi w1]
+ {αi≥0} w
) * 0, yi [w0 + xTi w1] − 1 ≥ 0 $ i=1 %" #
max α 1 − yi [w0 + xTi w1] = J(w;α)
α≥0 ∞, otherwise
As a result we have to be able to minimize J(w; α) with
and rewrite the minimization problem in terms of these
respect to parameters w for any fixed setting of the Lagrange
, !n
) *- multipliers αi ≥ 0.
min "w1"2/2 + max αi 1 − yi [w0 + xTi w1]
w αi≥0
i=1
, n
! ) *-
= min max "w1"2/2 + αi 1 − yi [w0 + xTi w1]
w {αi≥0}
i=1
SVM solution cont’d SVM solution cont’d

• We can then swap ’max’ and ’min’: • We can then substitute the solution
! n
, !n
) *- ∂
min max "w1"2/2 + αi 1 − yi [w0 + xTi w1] J(w; α) = w1 − αiyixi = 0
w {αi≥0}
∂w1 i=1
i=1
! n
, n
! ) *- ∂
?
= max min "w1" /2 +2
αi 1 − yi [w0 + xTi w1] J(w; α) = − αiyi = 0
{αi≥0} w ∂w0 i=1
$ i=1 %" #
J(w;α) back into the objective and get (after some algebra):
, !n
) *-
We can find the optimal ŵ as a function of {αi} by setting
max "ŵ1"2/2 + αi 1 − yi [ŵ0 + xTi ŵ1]
the derivatives to zero: P
αi ≥ 0
i=1
n αiyi = 0
∂ ! i
J(w; α) = w1 − αiyixi = 0 ,!
n
1 !
n -
∂w1 i=1 = max αi − yiyj αiαj (xTi xj )
n αi ≥ 0 2 i,j=1
∂ ! P
αiyi = 0
i=1
J(w; α) = − αiyi = 0 i
∂w0 i=1
SVM solution: summary SVM solution: summary

• We can find the optimal setting of the Lagrange multipliers • We can find the optimal setting of the Lagrange multipliers
αi by maximizing αi by maximizing
n
! n n n
1 ! ! 1 !
αi − yiyj αiαj (xTi xj ) αi − yiyj αiαj (xTi xj )
i=1
2 i,j=1 i=1
2 i,j=1
( (
subject to αi ≥ 0 and i αiyi = 0. Only αi’s corresponding subject to αi ≥ 0 and i αiyi = 0. Only αi’s corresponding
to “support vectors” will be non-zero. to “support vectors” will be non-zero.
• We can make predictions on any new example x according
to the sign of the discriminant function
n
)! !
ŵ0 + xT ŵ1= ŵ0 + xT α̂iyixi) = ŵ0 + α̂iyi(xT xi)
i=1 i∈SV

SVM solution: summary SVM solution: summary
• We can find the optimal setting of the Lagrange multipliers • We can find the optimal setting of the Lagrange multipliers
αi by maximizing αi by maximizing
n
! n n n
1 ! ! 1 !
αi − yiyj αiαj (xTi xj ) αi − yiyj αiαj (xTi xj )
i=1
2 i,j=1 i=1
2 i,j=1
( (
subject to αi ≥ 0 and i αiyi = 0. Only αi’s corresponding subject to αi ≥ 0 and i αiyi = 0. Only αi’s corresponding
to “support vectors” will be non-zero. to “support vectors” will be non-zero.
• We can make predictions on any new example x according • We can make predictions on any new example x according
to the sign of the discriminant function to the sign of the discriminant function
n n
)! ! )! !
ŵ0 + xT ŵ1 = ŵ0 + xT α̂iyixi)= ŵ0 + α̂iyi(xT xi) ŵ0 + xT ŵ1 = ŵ0 + xT α̂iyixi) = ŵ0 + α̂iyi(xT xi)
i=1 i∈SV i=1 i∈SV
Non-linear classifier Non-linear classifier

• So far our classifier can make only linear separations
x x
• As with linear regression and logistic regression models, we x x
x
can easily obtain a non-linear classifier by first mapping our x

x
x
x
examples x = [x1 x2] into longer feature vectors φ(x)

x
x x
x
√ √ √ x
x
x x
φ(x) = [x21 x22 2x1x2 2x1 2x2 1]

Linear separator in the feature φ-space
and then applying the linear classifier to the new feature
vectors φ(x) x
x x
x x x
x
x
x
x
x
x x
x
x
x
Non-linear separator in the original x-space
Feature mapping and kernels Examples of kernel functions

• Let’s look at the previous example in a bit more detail • Linear kernel
√ √ √
x → φ(x) = [x21 x22 2x1x2 2x1 2x2 1] K(x, x%) = (xT x%)
• The SVM classifier deals only with inner products of examples • Polynomial kernel
(or feature vectors). In this example, ) *p
K(x, x%) = 1 + (xT x%)
φ(x) φ(x ) =
T %
x21x%2
1 + x22x%2
2 + 2x1x2x1x2
% %
+ 2x1x%1 + 2x2x%2 +1 where p = 2, 3, . . .. To get the feature vectors we
= (1 + x1x%1 + x2x%2)2 concatenate all up to pth order polynomial terms of the
) T % 2
* components of x (weighted appropriately)
= 1 + (x x )
• Radial basis kernel
so the inner products can be evaluated without ever explicitly . /
1
constructing the feature vectors φ(x)! K(x, x%) = exp − "x − x%"2
2
) *2
• K(x, x%) = 1 + (xT x%) is a kernel function (inner product In this case the feature space is infinite dimensional function
in the feature space) space (use of the kernel results in a non-parametric classifier).

SVM examples
2 2
1.5 1.5
1 1
0.5 0.5
0 0
!0.5 !0.5
!1 !1
!1.5 !1 !0.5 0 0.5 1 1.5 2 !1.5 !1 !0.5 0 0.5 1 1.5 2
linear 2nd order polynomial

2 2
1.5 1.5
1 1
0.5 0.5
0 0
!0.5 !0.5
!1 !1
!1.5 !1 !0.5 0 0.5 1 1.5 2 !1.5 !1 !0.5 0 0.5 1 1.5 2
4th order polynomial 8th order polynomial
Tommi Jaakkola, MIT CSAIL 25

Machine Learning-Kernel Methods

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning-Kernel Methods

Uploaded by

Copyright:

Available Formats

Machine learning: lecture 7 Topics

Tommi S. Jaakkola • Support vector machines

Tommi Jaakkola, MIT CSAIL 2

Support vector machine (SVM) SVM: separable case

Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL 4

SVM: non-separable case SVM: non-separable case cont’d

for i = 1, . . . , n. Here ξi ≥ 0 are

Tommi Jaakkola, MIT CSAIL 5 Tommi Jaakkola, MIT CSAIL 6

where (z) = z if z ≥ 0 and zero otherwise (i.e., returns the

Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8

SVM vs logistic regression cont’d SVM: solution, Lagrange multipliers

Loss(z) = (1 − z)+ 3.5

• Regularized logistic reg: 2

Loss(z) = log(1 + exp(−z)) 1

Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL 10

SVM: solution, Lagrange multipliers SVM: solution, Lagrange multipliers

min "w1"2/2 subject to min "w1"2/2 subject to

and rewrite the minimization problem in terms of these

Tommi Jaakkola, MIT CSAIL 11 Tommi Jaakkola, MIT CSAIL 12

Tommi Jaakkola, MIT CSAIL 13 Tommi Jaakkola, MIT CSAIL 14

SVM solution cont’d SVM solution cont’d

Tommi Jaakkola, MIT CSAIL 15 Tommi Jaakkola, MIT CSAIL 16

SVM solution: summary SVM solution: summary

Tommi Jaakkola, MIT CSAIL 17 Tommi Jaakkola, MIT CSAIL 18

Tommi Jaakkola, MIT CSAIL 19 Tommi Jaakkola, MIT CSAIL 20

Non-linear classifier Non-linear classifier

can easily obtain a non-linear classifier by first mapping our x

examples x = [x1 x2] into longer feature vectors φ(x)

φ(x) = [x21 x22 2x1x2 2x1 2x2 1]

Non-linear separator in the original x-space

Tommi Jaakkola, MIT CSAIL 21 Tommi Jaakkola, MIT CSAIL 22

Feature mapping and kernels Examples of kernel functions

Tommi Jaakkola, MIT CSAIL 23 Tommi Jaakkola, MIT CSAIL 24

linear 2nd order polynomial

4th order polynomial 8th order polynomial

Tommi Jaakkola, MIT CSAIL 25

You might also like