Download as pdf or txt
Download as pdf or txt
You are on page 1of 84

MIMA Group

M L
D M

Chapter 9
Support Vector Machines

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University


Learning Machines MIMA

 A machine to learn the mapping

xi a yi
 Defined as

x a f (x, α )
Learning by adjusting
this parameter?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2


Generalization vs. Learning MIMA

 How a machine learns?


 Adjusting the parameters so as to partition the pattern (feature)
space for classification.
 How to adjust?
Minimize the empirical risk (traditional approaches).

 What the machine learned?


 Memorize the patterns it sees? or
 Memorize the rules it finds for different classes?
 What does the machine actually learn if it minimizes empirical
risk only?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 3


Risks MIMA

Expected Risk (test error)

R(α )   12 y  f (x, α ) dP(x, y )

Empirical Risk (training error)


l
Remp (α )  1
2l  y  f (x , α )
i 1
i i

R(α )  Remp (α ) ?
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 4
More on Empirical Risk MIMA

 How can make the empirical risk arbitrarily small?


 To let the machine have very large memorization capacity.
 Does a machine with small empirical risk also get small
expected risk?
 How to avoid the machine to strain to memorize training
patterns, instead of doing generalization, only?
 How to deal with the straining-memorization capacity of
a machine?
 What the new criterion should be?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 5


Structure Risk Minimization MIMA

Learn both the right ‘structure’ and


Goal: right `rules’ for classification.

Right Structure:
E.g., Right amount and right forms of components or parameters are
to participate in a learning machine.

Right Rules:
The empirical risk will also be reduced if right rules are learned.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 6


New Criterion MIMA

Risk due to the


Total Risk = Empirical Risk + structure of
the learning machine

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 7


The VC Dimension MIMA

 Consider a set of function f (x,)  {1,1}.


 A given set of l points can be labeled in 2l ways.
 If a member of the set {f ()} can be found which
correctly assigns the labels for all labeling, then
the set of points is shattered by that set of
functions.
 The VC dimension of {f ()} is the maximum
number of training points that can be shattered
by {f ()}.

VC: Vapnik Chervonenkis


Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 8
The VC Dimension for Oriented Lines in R2 MIMA

 VC dimension = 3

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 9


More on VC Dimension MIMA

 In general, the VC dimension of a set of oriented


hyperplanes in Rn is n+1.

 VC dimension is a measure of memorization


capability.

 VC dimension is not directly related to number of


parameters. Vapnik (1995) has an example with
1 parameter and infinite VC dimension.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 10


Bound on Expected Risk MIMA

Expected Risk R(α )   12 y  f (x, α ) dP(x, y )


l
Empirical Risk Remp (α )  1
2l  y  f (x , α )
i 1
i i

 h(log(2l h)  1)  log( 4) 
P  R ( )  Remp ( )    1  
 l 
VC Confidence
h is the VC dimension; l is the number of samples

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 11


Bound on Expected Risk MIMA

Consider small  (e.g.,   0.05).

h(log(2l h)  1)  log( 4)
R ( )  Remp ( ) 
l

 h(log(2l h)  1)  log( 4) 
P  R ( )  Remp ( )    1  
 l 
VC Confidence

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 12


Bound on Expected Risk MIMA

Consider small  (e.g.,   0.05).

h(log(2l h)  1)  log( 4)
R ( )  Remp ( ) 
l
Traditional approaches
minimize empirical risk only

Structure risk minimization want to minimize the bound

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 13


VC Confidence MIMA

h(log(2l h)  1)  log( 4)
R ( )  Remp ( ) 
l

Amongst machines with zero


empirical risk, choose the one
with smallest VC dimension

How to evaluate VC dimension?

 =0.05 and l=10,000


Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 14
Structure Risk Minimization MIMA

h1  h2  h3  h4

h4 h3 h2 h1

Nested subset of functions with different VC dimensions.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 15


Structure Risk Minimization MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 16


Linear SVM MIMA

 The linear separability

Linearly separable Not linearly separable

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 17


Linear SVM MIMA

 The linear separability

How would you classify


these points using a
linear discriminant
function in order to
minimize the error rate?

Linearly separable

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 18


Maximum Margin Classifier MIMA

yi (wxi  b)  1  0 i
The linear discriminant
function (classifier) with the
maximum margin is the best

Margin is defined as the width


that the boundary could be
increased by before hitting
a data point
Why is it the best?
O
 Intuitively robust to outliners
Supporters and thus strong generalization
ability
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 19
Relation Between VC Dimension and Margin
MIMA

 What is the relation btw. the margin width and


VC dimension?
 Let x belong to sphere of radius R. The set of -
margin separating hyperplanes has VC
dimension h bounded by:

d is the dimension of x,

What does this mean?


Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 20
Linear SVM MIMA

 The linear separability


wx  b  0 Linearly Separable
w
wx  b  1 w, b such that

wxi  b  1 for yi  1
wxi  b  1 for yi  1
wx  b  1
yi (wxi  b)  1  0 i
wx  b  0

Linearly separable

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 21


Margin Width MIMA

yi (wxi  b)  1  0 i

1  b 1  b
w d 
|| w || || w ||
2

|| w ||

How about maximize the margin?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 22


Building SVM MIMA

Minimize 1
2 || w ||2
Subject to yi (w xi  b)  1  0 i
T

This requires the knowledge about Lagrange Multiplier.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 23


The Method of Lagrange MIMA

2
Minimize 1
2 || w ||
Subject to yi (w T xi  b)  1  0 i

The Lagrangian:
l
1
L(w , b;  )  || w ||2  i  yi (w T xi  b)  1 i  0
2 i 1

Minimize it w.r.t w & b, while maximize it w.r.t. .

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 24


The Method of Lagrange MIMA

 Why Lagrange?
 The constraints will be replaced by constraints on the
Lagrange multipliers, which will be much easier to
handle.
 In this reformulation of the problem, the training data
will only appear in the form of dot products between
vectors.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 25


The Method of Lagrange MIMA

2 How about if it is
Minimize 1
2 || w || zero?

Subject to yi (w T xi  b)  1  0 i
What value of i
should be if it is
The Lagrangian:
feasible and nonzero?
l
1
L(w , b;  )  || w ||2  i  yi (w T xi  b)  1 i  0
2 i 1

Minimize it w.r.t w & b, while maximize it w.r.t. .

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 26


The Method of Lagrange MIMA
l l
1
L(w, b;  )  || w ||  i yi (w xi  b)   i
2 T

2 i 1 i 1
2
Minimize 1
2 || w ||
Subject to yi (w T xi  b)  1  0 i

The Lagrangian:
l
1
L(w , b;  )  || w ||2  i  yi (w T xi  b)  1 i  0
2 i 1
l l
1
 || w ||2  i yi (w T xi  b)   i
2 i 1 i 1

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 27


Duality MIMA
l l
1
L(w, b;  )  || w ||  i yi (w xi  b)   i
2 T

2 i 1 i 1
2
Minimize 1
2 || w ||
Subject to yi (w T xi  b)  1  0 i

Maximize L(w*, b*;  )


Subject to  w ,b L(w, b;  )  0
i  0, i  1,K , l

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 28


Duality MIMA
l l
1
L(w, b;  )  || w ||  i yi (w xi  b)   i
2 T

2 i 1 i 1
l l
 w L(w , b;  )  w   i yi xi  0 w*   i yi xi
i 1 i 1
l l
b L(w, b;  )   i yi  0  y
i 1
i i 0
i 1

Maximize L(w*, b*;  )


Subject to  w ,b L(w, b;  )  0
i  0, i  1,K , l

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 29


Duality MIMA
l l
1
L(w, b;  )  || w ||  i yi (w xi  b)   i
2 T

2 i 1 i 1
l l
 w L(w , b;  )  w   i yi xi  0 w*   i yi xi
i 1 i 1
l l
b L(w, b;  )   i yi  0  y
i 1
i i 0
i 1
T T
1 l  l
 l  l l l
L(w*, b*;  )    i yi xi    y x
i i i   i i i 
 y x  y x i i i  b i yi   i
2  i 1  i 1  i 1  i 1 i 1 i 1

T
l
1 l
 l
  i    i yi xi   y x i i i
i 1 2  i 1  i 1

l
1 l l
  i   i  j yi y j  xi ,x j  Maximize
i 1 2 i 1 j 1

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 30


Duality MIMA
l l
1
L(w, b;  )  || w ||  i yi (w xi  b)   i
2 T

2 i 1 i 1
l l
 w L(w , b;  )  w   i yi xi  0 w*   i yi xi
i 1 i 1
l l
b L(w, b;  )   i yi  0  ii  y0
T
 y  0
i 1 i 1

T T
1 l  l
 l  l l l
L(w*, b*;  )    i yi xi    y x
i i i   i i i 
 y x  y x i i i  b i yi   i
2  i 1  i 1  i 1  i 1 i 1 i 1

T
l
1 l
 l
1 T
  i    i yi xi   y x i i i F (  )    1   D
i 1 2  i 1  i 1
2
l
1 l l
  i   i  j yi y j  xi ,x j  Maximize
i 1 2 i 1 j 1

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 31


Duality MIMA

Minimize 1
2 || w ||2
The Primal
Subject to yi (w T xi  b)  1  0 i

1 T
Maximize F (  )    1   D
The Dual 2
Subject to T y  0

0

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 32


The Solution MIMA

Quadratic Programming
Find * by …
l
w    y xi
*

i 1
*
i i b ? *

1 T
Maximize F (  )    1   D
The Dual 2
Subject to T y  0

0

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 33


The Solution MIMA

Quadratic Programming
Find * by … Call it a support
l vector is i > 0.
w    y xi
* *
i i
i 1

b*  yi  w * xi , i  0
T

The Karush-Kuhn-Tucker
The Lagrangian: Conditions
l
1
L(w , b;  )  || w ||2  i  yi (w T xi  b)  1
2 i 1

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 34


The Karush-Kuhn-Tucker Conditions MIMA
l
1
L(w, b;  )  || w ||2  i  yi (w T xi  b)  1
2 i 1

l
 w L(w , b;  )  w   i yi xi  0
i 1
l
b L(w , b;  )   i yi  0
i 1

yi (wT xi  b)  1  0, i  1,K , l

i  0, i  1,K , l

i  yi (wT xi  b)  1  0, i  1,K , l


Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 35
Classification MIMA
l
w *   i* yi xi
i 1

f (x)  sgn  w * x  b *
T

 l * 
 sgn   i yi  xi , x  b * 
 i 1 
 
 sgn   i yi  xi , x   b * 
*
 * 0 
 i 

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 36


Classification Using Supporters MIMA

The weight for


the ith support vector. Bias

 
f (x)  sgn   i yi  xi , x   b * 
*
 * 0 
 i 
The similarity measure btw.
input and the ith support vector.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 37


Linear SVM MIMA

 Then non-separable case


We require that
wxi  b  1  i for yi  1
wxi  b  1  i for yi  1

yi (wxi  b)  1  i  0 i

i  0 i

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 38


Mathematic Formulation MIMA

For simplicity, we consider k = 1.

  
k
Minimize 1
2 || w || C
2
i i

Subject to yi (w T xi  b)  1  i  0 i
i  0 i

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 39


Mathematic Formulation MIMA

For simplicity, we consider k = 1.

  
k
Minimize 1
2 || w || C
2
i i

Subject to yi (w T xi  b)  1  i  0 i
i  0 i

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 40


The Lagrangian MIMA

Minimize 1
2 || w || C  i i
2

Subject to yi (w xi  b)  1  i  0 i
T

i  0 i
L(w , b, ; ,  )
1
 || w ||2 C  i i   i i  yi (w T xi  b)  1  i    i ii
2
i  0, i  0

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 41


Duality MIMA
1
L(w, b, ; ,  )  || w ||2 C  i i   i i  yi (w T xi  b)  1  i    i ii
2

Minimize 1
2 || w || C  i i
2 i  0, i  0

Subject to yi (w xi  b)  1  i  0 i
T

i  0 i
Maximize L(w*, b*, *; ,  )
Subject to  w ,b , L(w, b, ; ,  )  0
  0,   0
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 42
Duality i  0, i  0 MIMA
1
L(w, b, ; ,  )  || w ||2 C  i i   i i  yi (w T xi  b)  1  i    i ii
2
 w L(w , b, ; ,  )  w   i i yi xi  0 w*   i i yi xi

b L(w, b, ; ,  )   i i yi  0 y i i 0
i

i L(w , b, ; ,  )  C  i  i  0 i  C  i
0  i  C
Maximize L(w*, b*, *; ,  )
Subject to  w ,b , L(w, b, ; ,  )  0
  0,   0
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 43
Duality i  0, i  0 MIMA
1
L(w, b, ; ,  )  || w ||2 C  i i   i i  yi (w T xi  b)  1  i    i ii
2
 w L(w , b, ; ,  )  w   i i yi xi  0 w*   i i yi xi

b L(w, b, ; ,  )   i i yi  0 y i i 0
i

i L(w , b, ; ,  )  C  i  i  0 i  C  i
0  i  C
F (,  )  L(w*, b*, *; ,  )   i  1   i  j yi y j  xi , x j 
i
2 i j

Maximize this
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 44
Duality i  0, i  0 MIMA
1
L(w, b, ; ,  )  || w ||2 C  i i   i i  yi (w T xi  b)  1  i    i ii
2
 w L(w , b, ; ,  )  w   i i yi xi  0 w*   i i yi xi

b L(w, b, ; ,  )   i i yi  0 y i i 0
i
T y  0
i L(w , b, ; ,  )  C  i  i  0 i  C  i
0    C 0  i  C
F (,  )  L(w*, b*, *; ,  )   i  1   i  j yi y j  xi , x j 
i
2 i j

1 T
Maximize this F (  )    1   D
2
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 45
Duality MIMA

  
k
Minimize 1
2 || w || C
2
i i
The Primal
Subject to yi (w T xi  b)  1  i  0 i
i  0 i
1 T
Maximize F (  )    1   D
The Dual 2
Subject to  T
y0
0    C1

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 46


The Karush-Kuhn-Tucker Conditions MIMA
1
L(w, b, ; ,  )  || w ||2 C  i i   i i  yi (w T xi  b)  1  i    i ii
2
 w L(w , b, ; ,  )  w   i i yi xi  0
b L(w, b, ; ,  )   i i yi  0
i L(w , b, ; ,  )  C  i  i  0
yi (wT xi  b)  1  i  0
i  0
i  0
i  0
i  yi (wT xi  b)  1  i   0
 i i  0
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 47
The Solution MIMA

Quadratic Programming

Find * by …
l b ? *
w *   i* yi xi
i 1 ?
1 T
Maximize F (  )    1   D
The Dual 2
Subject to T y  0
0    C1

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 48


The Solution MIMA

Find * by … Call it a support


vector is 0 < i < C.
C
l
w *   i* yi xi
i 1

b*  yi  w *T xi , 0<i  C
The Lagrangian:
L(w, b, ; ,  )
1
 || w ||2 C  i i   i i  yi (w T xi  b)  1  i    i ii
2

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 49


The Solution MIMA

Find * by … Call it a support


vector is 0 < i < C.
C
l
w *   i* yi xi
i 1

b*  yi  w *T xi , 0<i  C
i  max  0,1  yi (w * xi  b*)
A false classification
pattern if i > 1..

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 50


Classification MIMA
l
w    y xi
* *
i i
i 1

f (x)  sgn  w * x  b *
T

 l * 
 sgn   i yi  xi , x  b * 
 i 1 
 
 sgn   i yi  xi , x   b * 
*
 * 0 
 i 

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 51


Classification Using Supporters MIMA

The weight for


the ith support vector. Bias

 
f (x)  sgn   i yi  xi , x   b * 
*
 * 0 
 i 
The similarity measure btw.
input and the ith support vector.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 52


Lagrange Multiplier MIMA

 “Lagrange Multiplier Method” is a powerful tool


for constraint optimization.

 Contributed by Riemann.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 53


Milkmaid Problem MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 54


Milkmaid Problem MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 55


Milkmaid Problem MIMA
2
f ( x, y )   ( x  xi ) 2  ( y  yi ) 2
i 1
Goal: (x, y)
Minimize

f ( x, y ) (x1, y1)
Subject to

g ( x, y )  0

(x2, y2)

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 56


Observation MIMA
2
f ( x, y )   ( x  xi ) 2  ( y  yi ) 2
i 1
Goal: (x*, y*)
Minimize

f ( x, y ) (x1, y1)
Subject to At the extreme point, say, (x*, y*)
g ( x, y )  0
f (x*, y*) = g(x*, y*).
(x2, y2)

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 57


Optimization with Equality ConstraintsMIMA

Goal: Min/Max f ( x)
Subject to g ( x)  0
Lemma:
At an extreme point, say, x*, we have

f (x*)  g (x*) if g (x*)  0

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 58


Proof MIMA

f (x*)  g (x*) if g (x*)  0

x* be an extreme point. r(t)


Let r(t) be any differentiable path on r(t0)
surface g(x)=0 such that r(t0)=x*.

is a vector tangent to the


x*
r(t0 ) surface g(x)=0 at x*.
f ( x)
f (x*)  f (r (t )) |t t0
d
dt f (r (t ))  f (r (t ))  r(t )
Since x* be an extreme point,
g ( x)  0
0  f (r (t0 ))  r(t0 )  f (x*)  r(t0 )
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 59
Proof MIMA

f (x*)  g (x*) if g (x*)  0

f (x*)  r(t0 ) r(t)


r(t0)
This is true for any r pass through x* on
surface g(x)=0.
f (x*) x*
f ( x)
It implies that f (x*)  , r(t0 )
where  is the tangential plane of surface
g(x)=0 at x*.
g ( x)  0
0  f (r (t0 ))  r(t0 )  f (x*)  r(t0 )
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 60
Optimization with Equality ConstraintsMIMA

Goal: Min/Max f ( x)
Subject to g ( x)  0
Lemma:
At an extreme point, say, x*, we have
f (x*)  g (x*) if g (x*)  0
Lagrange Multiplier

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 61


The Method of Lagrange MIMA

x: dimension n.
Goal: Min/Max f ( x)
Subject to g ( x)  0
Find the extreme points by solving the following equations.

f ( x )   g ( x )
n + 1 equations
g ( x)  0 with n + 1 variables

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 62


Lagrangian MIMA

Goal: Min/Max f ( x) Constraint

g ( x)  0
Optimization
Subject to

Define L(x;  )  f (x)   g (x) Lagrangian

 x L(x;  )  0
Solve Unconstraint
Optimization
  L(x;  )  0
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 63
Optimization with
Multiple Equality Constraints MIMA

  (1 ,K , m ) T

Min/Max f ( x)
Subject to gi (x)  0, i  1,K , m
m
Define L(x;  )  f (x)    g ( x)
i 1
i i Lagrangian

Solve  x ,  L(x;  )  0

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 64


Optimization with
Inequality Constraints MIMA

Minimize f ( x)
Subject to gi (x)  0, i  1,K , m
h j (x)  0, j  1,K , n

You can always reformulate your problems into the about form.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 65


Lagrange Multipliers MIMA
  (1 ,K , m )T
  ( 1 ,K , n )T i  0

Minimize f ( x)
Subject to gi (x)  0, i  1,K , m
h j (x)  0, j  1,K , n

Lagrangian:
m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
i 1 j 1
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 66
Lagrange Multipliers MIMA
  (1 ,K , m )T
  ( 1 ,K , n )T i  0

Minimize f ( x) negative for


feasible solutions

Subject to gi (x)  0, i  1,K , m


0 for feasible
h j (x)  0, j  1,K , n
solutions

Lagrangian:
m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
i 1 j 1
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 67
Duality MIMA
m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
i 1 j 1

Let x* be a local extreme.

0   x L(x; ,  ) |x  x*
Define D(,  )  L(x*; ,  )
m n
D(,  )  f (x*)   i gi (x*)    j h j (x*)  f (x*)
i 1 j 1

Maximize it w.r.t. 


Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 68
Duality MIMA
m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
To minimize
i 1
the Lagrangian
j 1
w.r.t x,
while to maximize it w.r.t.  and .
Let x* be a local extreme.

0   x L(x; ,  ) |x  x*
Define D(,  )  L(x*; ,  )
m n
D(,  )  f (x*)   i gi (x*)    j h j (x*)  f (x*)
i 1 j 1

Maximize it w.r.t. 


Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 69
Saddle Point Determination MIMA
m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
i 1 j 1

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 70


Saddle Point Determination MIMA
m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
i 1 j 1

Minimize f ( x)
The primal Subject to gi (x)  0, i  1,K , m
h j (x)  0, j  1,K , n

Maximize L(x*; ,  )
The dual Subject to  x , L(x; ,  )  0
0
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 71
The Karush-Kuhn-Tucker Conditions MIMA

m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
i 1 j 1

m n
 x L(x; ,  )   x f (x)   i  x gi (x)    j  x h j (x)  0
i 1 j 1

gi (x)  0, i  1,K , m
 j  0, j  1,K , n
h j (x)  0, j  1,K , n
 j h j (x)  0, j  1,K , n

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 72


Non-linear SVMs MIMA
 Datasets that are linearly separable with noise work out great:

0 x

 But what are we going to do if the dataset is just too hard?

0 x

 How about… mapping data to a higher-dimensional space:


x2

0 x

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 73


Non-linear SVMs: Feature Space MIMA
 General idea: the original input space can be mapped
to some higher-dimensional feature space where the
training set is separable:

Φ: x → φ(x)

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 74


Nonlinear SVMs: The Kernel TrickMIMA
 With this mapping, our discriminant function is now:

g ( x)  wT  ( x)  b    y  ( x ) ( x)  b
xi SV
i i i

 No need to know this mapping explicitly, because we only


use the dot product of feature vectors in both the training and
test.

 A kernel function is defined as a function that corresponds to


a dot product of two feature vectors in some expanded
feature space:

K (xi , x j )   (xi )T  (x j )

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 75


Nonlinear SVMs: The Kernel TrickMIMA
 An example:
2-dimensional vectors x=[x1 x2];
let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj) = φ(xi) Tφ(xj):

K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2] [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]T

= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2] T

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 76


Nonlinear SVMs: The Kernel TrickMIMA
 Examples of commonly-used kernel functions:

 Linear kernel: K (xi , x j )  xTi x j


 Polynomial kernel: K ( xi , x j )  (1  xTi x j ) p
 Gaussian (Radial-Basis Function (RBF) )
kernel: 2
xi  x j
K (xi , x j )  exp( )
2 2

 Sigmoid:
K (xi , x j )  tanh(  0 xTi x j  1 )
 In general, functions that satisfy Mercer’s condition can be
kernel functions.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 77


Nonlinear SVM: Optimization MIMA
 Formulation: (Lagrangian Dual Problem)
1
max  i    i  j yi y j K ( xi , x j )
i 2 i j
such that 0    C
i

 y
i
i i 0

 The solution of the discriminant function is

g ( x)  w  ( x)  b 
T
  y K ( x, x )  b
xi SV
i i i

 The optimization technique is the same.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 78


Support Vector Machine: Algorithm
MIMA

 1. Choose a kernel function


 2. Choose a value for C
 3. Solve the quadratic programming
problem (many software packages
available)
 4. Construct the discriminant function from
the support vectors

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 79


Other issues MIMA

 Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
- domain experts can give assistance in formulating appropriate
similarity measures
 Choice of kernel parameters
- e.g. σ in Gaussian kernel
- σ is the distance between closest points with different classifications
- In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters.
 Optimization criterion – Hard margin v.s. Soft margin
- a lengthy series of experiments in which various parameters are
tested

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 80


Comparison with Neural NetworksMIMA
 Neural Networks  SVMs
 Hidden Layers map to  Kernel maps to a very-
lower dimensional high dimensional space
spaces  Search space has a
 Search space has unique minimum
multiple local minima  Training is extremely
 Training is expensive efficient
 Classification extremely  Classification extremely
efficient efficient
 Requires number of  Kernel and cost the two
hidden units and layers parameters to select
 Very good accuracy in  Very good accuracy in
typical domains typical domains
 Extremely robust

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 81


MIMA

 UCI datasets:
http://archive.ics.uci.edu/ml/datasets.html
 Reuters-21578 Text Categorization Collection
 Wine
 Credit Approval
 Requirements
 Use different kernels(>=3)
 Choose best values for parameters
 You can also use dimension reduction method, e.g.,
PCA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 82


MIMA

 针对UCI数据集(
http://archive.ics.uci.edu/ml/datasets.html)中的
Musk(version2), Wine,采用三种SVM来对其进
行分类,计算准确率。其中每种SVM要求用不同
的核函数。另外,采用一种集成学习方法,将不
同模型集成,集成的模型可以是不同核函数的
SVM,也可以加上神经网络、KNN、线性模型、
多项式模型等。比较集成模型与SVM模型及其他
模型的结果。
 要求:6月17日24时之前提交代码和报告。

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 83


MIMA Group

Any Question?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

You might also like