Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Machine learning: lecture 7 Topics

Tommi S. Jaakkola • Support vector machines


MIT CSAIL – separable case, formulation, margin
tommi@csail.mit.edu – non-separable case, penalties, and logistic regression
– dual solution, kernels
– examples, properties

Tommi Jaakkola, MIT CSAIL 2

Support vector machine (SVM) SVM: separable case


(d
• When the training examples are linearly separable we can • We minimize "w1"2/2 = 2
i=1 wi /2 subject to
maximize a geometric notion of margin (distance to the
yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n
boundary) by minimizing the regularization penalty
d margin = 1/!ŵ1! o
! o o o
x
"w1"2/2 = wi2/2 x o f (x; ŵ) = ŵ0 + xT ŵ1
o
i=1 x o
o
o x
x o o o x ŵ1 o
o o ooo o o xx xxx
subject to the classification constraints x
o
x x x
x− x+ w1
x o x
o + −
x |x − x |/2
yi[w0 + xTi w1] −1≥0 x
x
o x
x x
for i = 1, . . . , n. x • The resulting margin and the “slope” "ŵ1" are inversely
x
x related
• The solution is defined only on the basis of a subset of
examples or “support vectors”

Tommi Jaakkola, MIT CSAIL 3 Tommi Jaakkola, MIT CSAIL 4

SVM: non-separable case SVM: non-separable case cont’d


• When the examples are not linearly separable we can modify • We can also write the SVM optimization problem more
the optimization problem slightly to add a penalty for compactly as
violating the classification constraints: ξi
n "
! & #$ '+%
We minimize
n C 1 − yi [w0 + xi w1] +"w1"2/2
T
!
"w1"2/2 + C ξi x o
o
o o
i=1

i=1 x x
o
o where (z) = z if z ≥ 0 and zero otherwise (i.e., returns the
+
x o
subject to relaxed classification x
o positive part).
x ŵ1 o
constraints x x x
x
yi [w0 + xTi w1] − 1 + ξi ≥ 0, x
o
x

for i = 1, . . . , n. Here ξi ≥ 0 are


called “slack” variables.

Tommi Jaakkola, MIT CSAIL 5 Tommi Jaakkola, MIT CSAIL 6


SVM: non-separable case cont’d SVM vs logistic regression
• We can also write the SVM optimization problem more • When viewed from the point of view of regularized empirical
compactly as loss minimization, SVM and logistic regression appear quite
ξi similar:
n "
! & #$ '+%
1 !& '+
n
C 1 − yi [w0 + xi w1] +"w1"2/2
T
SVM: 1 − yi [w0 + xTi w1] + λ"w1"2/2
i=1 n i=1

where (z) = z if z ≥ 0 and zero otherwise (i.e., returns the


+ − log P (y |x,w)
i
n " & #$ '%
1!
positive part). Logistic: − log g yi [w0 + xTi w1] +λ"w1"2/2
n i=1
• This is equivalent to regularized empirical loss minimization
where g(z) = (1 + exp(−z))−1 is the logistic function.
1 !& '+
n
1 − yi [w0 + xTi w1] + λ"w1"2/2 (Note that we have transformed the problem maximizing the
n i=1
penalized log-likelihood into minimizing negative penalized
where λ = 1/nC is the regularization parameter. log-likelihood.)

Tommi Jaakkola, MIT CSAIL 7 Tommi Jaakkola, MIT CSAIL 8

SVM vs logistic regression cont’d SVM: solution, Lagrange multipliers


• The difference comes from how we penalize “errors”: • Back to the separable case: how do we solve
&" z
1!
n #$ %' min "w1"2/2 subject to
Both: Loss yi [w0 + xTi w1] + λ"w1"2/2
n i=1 yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n
5
SVM loss
• SVM: 4.5 LR loss

Loss(z) = (1 − z)+ 3.5

3
loss

2.5

• Regularized logistic reg: 2

1.5

Loss(z) = log(1 + exp(−z)) 1

0.5

0
!4 !3 !2 !1 0 1 2 3 4
z

Tommi Jaakkola, MIT CSAIL 9 Tommi Jaakkola, MIT CSAIL 10

SVM: solution, Lagrange multipliers SVM: solution, Lagrange multipliers


• Back to the separable case: how do we solve • Back to the separable case: how do we solve

min "w1"2/2 subject to min "w1"2/2 subject to


yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n

• Let start by representing the constraints as losses • Let start by representing the constraints as losses
+ +
) * 0, yi [w0 + xTi w1] − 1 ≥ 0 ) * 0, yi [w0 + xTi w1] − 1 ≥ 0
max α 1 − yi [w0 + xTi w1] = max α 1 − yi [w0 + xTi w1] =
α≥0 ∞, otherwise α≥0 ∞, otherwise

and rewrite the minimization problem in terms of these


, n
! ) *-
min "w1"2/2 + max αi 1 − yi [w0 + xTi w1]
w αi≥0
i=1

Tommi Jaakkola, MIT CSAIL 11 Tommi Jaakkola, MIT CSAIL 12


SVM: solution, Lagrange multipliers SVM solution cont’d
• Back to the separable case: how do we solve • We can then swap ’max’ and ’min’:
min "w1"2/2 subject to , n
! ) *-
min max "w1"2/2 + αi 1 − yi [w0 + xTi w1]
yi [w0 + xTi w1] − 1 ≥ 0, i = 1, . . . , n w {αi≥0}
i=1
, n
! ) *-
• Let start by representing the constraints as losses ?
= max min "w1"2/2 + αi 1 − yi [w0 + xTi w1]
+ {αi≥0} w
) * 0, yi [w0 + xTi w1] − 1 ≥ 0 $ i=1 %" #
max α 1 − yi [w0 + xTi w1] = J(w;α)
α≥0 ∞, otherwise
As a result we have to be able to minimize J(w; α) with
and rewrite the minimization problem in terms of these
respect to parameters w for any fixed setting of the Lagrange
, !n
) *- multipliers αi ≥ 0.
min "w1"2/2 + max αi 1 − yi [w0 + xTi w1]
w αi≥0
i=1
, n
! ) *-
= min max "w1"2/2 + αi 1 − yi [w0 + xTi w1]
w {αi≥0}
i=1

Tommi Jaakkola, MIT CSAIL 13 Tommi Jaakkola, MIT CSAIL 14

SVM solution cont’d SVM solution cont’d


• We can then swap ’max’ and ’min’: • We can then substitute the solution
! n
, !n
) *- ∂
min max "w1"2/2 + αi 1 − yi [w0 + xTi w1] J(w; α) = w1 − αiyixi = 0
w {αi≥0}
∂w1 i=1
i=1
! n
, n
! ) *- ∂
?
= max min "w1" /2 +2
αi 1 − yi [w0 + xTi w1] J(w; α) = − αiyi = 0
{αi≥0} w ∂w0 i=1
$ i=1 %" #
J(w;α) back into the objective and get (after some algebra):
, !n
) *-
We can find the optimal ŵ as a function of {αi} by setting
max "ŵ1"2/2 + αi 1 − yi [ŵ0 + xTi ŵ1]
the derivatives to zero: P
αi ≥ 0
i=1
n αiyi = 0
∂ ! i

J(w; α) = w1 − αiyixi = 0 ,!
n
1 !
n -
∂w1 i=1 = max αi − yiyj αiαj (xTi xj )
n αi ≥ 0 2 i,j=1
∂ ! P
αiyi = 0
i=1
J(w; α) = − αiyi = 0 i

∂w0 i=1

Tommi Jaakkola, MIT CSAIL 15 Tommi Jaakkola, MIT CSAIL 16

SVM solution: summary SVM solution: summary


• We can find the optimal setting of the Lagrange multipliers • We can find the optimal setting of the Lagrange multipliers
αi by maximizing αi by maximizing
n
! n n n
1 ! ! 1 !
αi − yiyj αiαj (xTi xj ) αi − yiyj αiαj (xTi xj )
i=1
2 i,j=1 i=1
2 i,j=1
( (
subject to αi ≥ 0 and i αiyi = 0. Only αi’s corresponding subject to αi ≥ 0 and i αiyi = 0. Only αi’s corresponding
to “support vectors” will be non-zero. to “support vectors” will be non-zero.
• We can make predictions on any new example x according
to the sign of the discriminant function
n
)! !
ŵ0 + xT ŵ1= ŵ0 + xT α̂iyixi) = ŵ0 + α̂iyi(xT xi)
i=1 i∈SV

Tommi Jaakkola, MIT CSAIL 17 Tommi Jaakkola, MIT CSAIL 18


SVM solution: summary SVM solution: summary
• We can find the optimal setting of the Lagrange multipliers • We can find the optimal setting of the Lagrange multipliers
αi by maximizing αi by maximizing
n
! n n n
1 ! ! 1 !
αi − yiyj αiαj (xTi xj ) αi − yiyj αiαj (xTi xj )
i=1
2 i,j=1 i=1
2 i,j=1
( (
subject to αi ≥ 0 and i αiyi = 0. Only αi’s corresponding subject to αi ≥ 0 and i αiyi = 0. Only αi’s corresponding
to “support vectors” will be non-zero. to “support vectors” will be non-zero.
• We can make predictions on any new example x according • We can make predictions on any new example x according
to the sign of the discriminant function to the sign of the discriminant function
n n
)! ! )! !
ŵ0 + xT ŵ1 = ŵ0 + xT α̂iyixi)= ŵ0 + α̂iyi(xT xi) ŵ0 + xT ŵ1 = ŵ0 + xT α̂iyixi) = ŵ0 + α̂iyi(xT xi)
i=1 i∈SV i=1 i∈SV

Tommi Jaakkola, MIT CSAIL 19 Tommi Jaakkola, MIT CSAIL 20

Non-linear classifier Non-linear classifier


• So far our classifier can make only linear separations
x x
• As with linear regression and logistic regression models, we x x
x

can easily obtain a non-linear classifier by first mapping our x


x
x
x

examples x = [x1 x2] into longer feature vectors φ(x)


x
x x
x
√ √ √ x
x
x x

φ(x) = [x21 x22 2x1x2 2x1 2x2 1]


Linear separator in the feature φ-space
and then applying the linear classifier to the new feature
vectors φ(x) x
x x
x x x
x
x
x
x
x
x x
x
x
x

Non-linear separator in the original x-space

Tommi Jaakkola, MIT CSAIL 21 Tommi Jaakkola, MIT CSAIL 22

Feature mapping and kernels Examples of kernel functions


• Let’s look at the previous example in a bit more detail • Linear kernel
√ √ √
x → φ(x) = [x21 x22 2x1x2 2x1 2x2 1] K(x, x%) = (xT x%)

• The SVM classifier deals only with inner products of examples • Polynomial kernel
(or feature vectors). In this example, ) *p
K(x, x%) = 1 + (xT x%)
φ(x) φ(x ) =
T %
x21x%2
1 + x22x%2
2 + 2x1x2x1x2
% %
+ 2x1x%1 + 2x2x%2 +1 where p = 2, 3, . . .. To get the feature vectors we
= (1 + x1x%1 + x2x%2)2 concatenate all up to pth order polynomial terms of the
) T % 2
* components of x (weighted appropriately)
= 1 + (x x )
• Radial basis kernel
so the inner products can be evaluated without ever explicitly . /
1
constructing the feature vectors φ(x)! K(x, x%) = exp − "x − x%"2
2
) *2
• K(x, x%) = 1 + (xT x%) is a kernel function (inner product In this case the feature space is infinite dimensional function
in the feature space) space (use of the kernel results in a non-parametric classifier).

Tommi Jaakkola, MIT CSAIL 23 Tommi Jaakkola, MIT CSAIL 24


SVM examples
2 2

1.5 1.5

1 1

0.5 0.5

0 0

!0.5 !0.5

!1 !1
!1.5 !1 !0.5 0 0.5 1 1.5 2 !1.5 !1 !0.5 0 0.5 1 1.5 2

linear 2nd order polynomial


2 2

1.5 1.5

1 1

0.5 0.5

0 0

!0.5 !0.5

!1 !1
!1.5 !1 !0.5 0 0.5 1 1.5 2 !1.5 !1 !0.5 0 0.5 1 1.5 2

4th order polynomial 8th order polynomial

Tommi Jaakkola, MIT CSAIL 25

You might also like