Chapter 9

MIMA Group
M L
D M
Chapter 9
Support Vector Machines
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Learning Machines MIMA
 A machine to learn the mapping
xi a yi
 Defined as
x a f (x, α )
Learning by adjusting
this parameter?
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

Generalization vs. Learning MIMA
 How a machine learns?

 Adjusting the parameters so as to partition the pattern (feature)
space for classification.
 How to adjust?
Minimize the empirical risk (traditional approaches).
 What the machine learned?

 Memorize the patterns it sees? or
 Memorize the rules it finds for different classes?
 What does the machine actually learn if it minimizes empirical
risk only?

Risks MIMA
Expected Risk (test error)
R(α )   12 y  f (x, α ) dP(x, y )
Empirical Risk (training error)

l
Remp (α )  1
2l  y  f (x , α )
i 1
i i
R(α )  Remp (α ) ?
More on Empirical Risk MIMA
 How can make the empirical risk arbitrarily small?

 To let the machine have very large memorization capacity.
 Does a machine with small empirical risk also get small
expected risk?
 How to avoid the machine to strain to memorize training
patterns, instead of doing generalization, only?
 How to deal with the straining-memorization capacity of
a machine?
 What the new criterion should be?

Structure Risk Minimization MIMA
Learn both the right ‘structure’ and

Goal: right `rules’ for classification.
Right Structure:
E.g., Right amount and right forms of components or parameters are
to participate in a learning machine.
Right Rules:
The empirical risk will also be reduced if right rules are learned.

New Criterion MIMA
Risk due to the

Total Risk = Empirical Risk + structure of
the learning machine

The VC Dimension MIMA
 Consider a set of function f (x,)  {1,1}.

 A given set of l points can be labeled in 2l ways.
 If a member of the set {f ()} can be found which
correctly assigns the labels for all labeling, then
the set of points is shattered by that set of
functions.
 The VC dimension of {f ()} is the maximum
number of training points that can be shattered
by {f ()}.
VC: Vapnik Chervonenkis

The VC Dimension for Oriented Lines in R2 MIMA
 VC dimension = 3

More on VC Dimension MIMA
 In general, the VC dimension of a set of oriented

hyperplanes in Rn is n+1.
 VC dimension is a measure of memorization

capability.
 VC dimension is not directly related to number of

parameters. Vapnik (1995) has an example with
1 parameter and infinite VC dimension.

Bound on Expected Risk MIMA
Expected Risk R(α )   12 y  f (x, α ) dP(x, y )

l
Empirical Risk Remp (α )  1
2l  y  f (x , α )
i 1
i i
 h(log(2l h)  1)  log( 4) 
P  R ( )  Remp ( )    1  
 l 
VC Confidence
h is the VC dimension; l is the number of samples

Consider small  (e.g.,   0.05).
h(log(2l h)  1)  log( 4)
R ( )  Remp ( ) 
l
 h(log(2l h)  1)  log( 4) 
P  R ( )  Remp ( )    1  
 l 
VC Confidence

Consider small  (e.g.,   0.05).
h(log(2l h)  1)  log( 4)
R ( )  Remp ( ) 
l
Traditional approaches
minimize empirical risk only
Structure risk minimization want to minimize the bound

VC Confidence MIMA
h(log(2l h)  1)  log( 4)
R ( )  Remp ( ) 
l
Amongst machines with zero

empirical risk, choose the one
with smallest VC dimension
How to evaluate VC dimension?
 =0.05 and l=10,000

h1  h2  h3  h4
h4 h3 h2 h1
Nested subset of functions with different VC dimensions.


Linear SVM MIMA
 The linear separability
Linearly separable Not linearly separable

Linear SVM MIMA
How would you classify

these points using a
linear discriminant
function in order to
minimize the error rate?
Linearly separable

Maximum Margin Classifier MIMA
yi (wxi  b)  1  0 i
The linear discriminant
function (classifier) with the
maximum margin is the best
Margin is defined as the width

that the boundary could be
increased by before hitting
a data point
Why is it the best?
O
 Intuitively robust to outliners
Supporters and thus strong generalization
ability
Relation Between VC Dimension and Margin
MIMA
 What is the relation btw. the margin width and

VC dimension?
 Let x belong to sphere of radius R. The set of -
margin separating hyperplanes has VC
dimension h bounded by:
d is the dimension of x,
What does this mean?

Linear SVM MIMA

wx  b  0 Linearly Separable
w
wx  b  1 w, b such that
wxi  b  1 for yi  1
wxi  b  1 for yi  1
wx  b  1
yi (wxi  b)  1  0 i
wx  b  0
Linearly separable

Margin Width MIMA
yi (wxi  b)  1  0 i
1  b 1  b
w d 
|| w || || w ||
2

|| w ||
How about maximize the margin?

Building SVM MIMA
Minimize 1
2 || w ||2
Subject to yi (w xi  b)  1  0 i
T
This requires the knowledge about Lagrange Multiplier.

The Method of Lagrange MIMA
2
Minimize 1
2 || w ||
Subject to yi (w T xi  b)  1  0 i
The Lagrangian:
l
1
L(w , b;  )  || w ||2  i  yi (w T xi  b)  1 i  0
2 i 1
Minimize it w.r.t w & b, while maximize it w.r.t. .

 Why Lagrange?
 The constraints will be replaced by constraints on the
Lagrange multipliers, which will be much easier to
handle.
 In this reformulation of the problem, the training data
will only appear in the form of dot products between
vectors.

2 How about if it is
Minimize 1
2 || w || zero?
What value of i
should be if it is
The Lagrangian:
feasible and nonzero?
l
1
L(w , b;  )  || w ||2  i  yi (w T xi  b)  1 i  0
2 i 1
Minimize it w.r.t w & b, while maximize it w.r.t. .

l l
1
L(w, b;  )  || w ||  i yi (w xi  b)   i
2 T
2 i 1 i 1
2
Minimize 1
2 || w ||
The Lagrangian:
l
1
L(w , b;  )  || w ||2  i  yi (w T xi  b)  1 i  0
2 i 1
l l
1
 || w ||2  i yi (w T xi  b)   i
2 i 1 i 1

Duality MIMA
l l
1
L(w, b;  )  || w ||  i yi (w xi  b)   i
2 T
2 i 1 i 1
2
Minimize 1
2 || w ||
Maximize L(w*, b*;  )

Subject to  w ,b L(w, b;  )  0
i  0, i  1,K , l

Duality MIMA
l l
1
L(w, b;  )  || w ||  i yi (w xi  b)   i
2 T
2 i 1 i 1
l l
 w L(w , b;  )  w   i yi xi  0 w*   i yi xi
i 1 i 1
l l
b L(w, b;  )   i yi  0  y
i 1
i i 0
i 1
Maximize L(w*, b*;  )

Subject to  w ,b L(w, b;  )  0
i  0, i  1,K , l

Duality MIMA
l l
1
L(w, b;  )  || w ||  i yi (w xi  b)   i
2 T
2 i 1 i 1
l l
i 1 i 1
l l
b L(w, b;  )   i yi  0  y
i 1
i i 0
i 1
T T
1 l  l
 l  l l l
L(w*, b*;  )    i yi xi    y x
i i i   i i i 
 y x  y x i i i  b i yi   i
2  i 1  i 1  i 1  i 1 i 1 i 1
T
l
1 l
 l
  i    i yi xi   y x i i i
i 1 2  i 1  i 1
l
1 l l
  i   i  j yi y j  xi ,x j  Maximize
i 1 2 i 1 j 1

Duality MIMA
l l
1
L(w, b;  )  || w ||  i yi (w xi  b)   i
2 T
2 i 1 i 1
l l
i 1 i 1
l l
b L(w, b;  )   i yi  0  ii  y0
T
 y  0
i 1 i 1
T T
1 l  l
 l  l l l
L(w*, b*;  )    i yi xi    y x
i i i   i i i 
 y x  y x i i i  b i yi   i
2  i 1  i 1  i 1  i 1 i 1 i 1
T
l
1 l
 l
1 T
  i    i yi xi   y x i i i F (  )    1   D
i 1 2  i 1  i 1
2
l
1 l l
  i   i  j yi y j  xi ,x j  Maximize
i 1 2 i 1 j 1

Duality MIMA
Minimize 1
2 || w ||2
The Primal
1 T
Maximize F (  )    1   D
The Dual 2
Subject to T y  0
0

The Solution MIMA
Quadratic Programming
Find * by …
l
w    y xi
*
i 1
*
i i b ? *
1 T
Maximize F (  )    1   D
The Dual 2
0

The Solution MIMA
Find * by … Call it a support
l vector is i > 0.
w    y xi
* *
i i
i 1
b*  yi  w * xi , i  0
T
The Karush-Kuhn-Tucker
The Lagrangian: Conditions
l
1
L(w , b;  )  || w ||2  i  yi (w T xi  b)  1
2 i 1

The Karush-Kuhn-Tucker Conditions MIMA
l
1
L(w, b;  )  || w ||2  i  yi (w T xi  b)  1
2 i 1
l
 w L(w , b;  )  w   i yi xi  0
i 1
l
b L(w , b;  )   i yi  0
i 1
yi (wT xi  b)  1  0, i  1,K , l
i  0, i  1,K , l
i  yi (wT xi  b)  1  0, i  1,K , l

Classification MIMA
l
w *   i* yi xi
i 1
f (x)  sgn  w * x  b *
T
 l * 
 sgn   i yi  xi , x  b * 
 i 1 
 
 sgn   i yi  xi , x   b * 
*
 * 0 
 i 

Classification Using Supporters MIMA
The weight for

the ith support vector. Bias
 
f (x)  sgn   i yi  xi , x   b * 
*
 * 0 
 i 
The similarity measure btw.
input and the ith support vector.

Linear SVM MIMA
 Then non-separable case

We require that
wxi  b  1  i for yi  1
wxi  b  1  i for yi  1
yi (wxi  b)  1  i  0 i
i  0 i

Mathematic Formulation MIMA
For simplicity, we consider k = 1.
  
k
Minimize 1
2 || w || C
2
i i
Subject to yi (w T xi  b)  1  i  0 i
i  0 i

Mathematic Formulation MIMA
For simplicity, we consider k = 1.
  
k
Minimize 1
2 || w || C
2
i i
i  0 i

The Lagrangian MIMA
Minimize 1
2 || w || C  i i
2
Subject to yi (w xi  b)  1  i  0 i
T
i  0 i
L(w , b, ; ,  )
1
 || w ||2 C  i i   i i  yi (w T xi  b)  1  i    i ii
2
i  0, i  0

Duality MIMA
1
L(w, b, ; ,  )  || w ||2 C  i i   i i  yi (w T xi  b)  1  i    i ii
2
Minimize 1
2 || w || C  i i
2 i  0, i  0
Subject to yi (w xi  b)  1  i  0 i
T
i  0 i
Maximize L(w*, b*, *; ,  )
Subject to  w ,b , L(w, b, ; ,  )  0
  0,   0
Duality i  0, i  0 MIMA
1
2
 w L(w , b, ; ,  )  w   i i yi xi  0 w*   i i yi xi
b L(w, b, ; ,  )   i i yi  0 y i i 0
i
i L(w , b, ; ,  )  C  i  i  0 i  C  i
0  i  C
Maximize L(w*, b*, *; ,  )
Subject to  w ,b , L(w, b, ; ,  )  0
  0,   0
1
2
b L(w, b, ; ,  )   i i yi  0 y i i 0
i
i L(w , b, ; ,  )  C  i  i  0 i  C  i
0  i  C
F (,  )  L(w*, b*, *; ,  )   i  1   i  j yi y j  xi , x j 
i
2 i j
Maximize this
1
2
b L(w, b, ; ,  )   i i yi  0 y i i 0
i
T y  0
i L(w , b, ; ,  )  C  i  i  0 i  C  i
0    C 0  i  C
F (,  )  L(w*, b*, *; ,  )   i  1   i  j yi y j  xi , x j 
i
2 i j
1 T
Maximize this F (  )    1   D
2
Duality MIMA
  
k
Minimize 1
2 || w || C
2
i i
The Primal
i  0 i
1 T
Maximize F (  )    1   D
The Dual 2
Subject to  T
y0
0    C1

1
2
 w L(w , b, ; ,  )  w   i i yi xi  0
b L(w, b, ; ,  )   i i yi  0
i L(w , b, ; ,  )  C  i  i  0
yi (wT xi  b)  1  i  0
i  0
i  0
i  0
i  yi (wT xi  b)  1  i   0
 i i  0
The Solution MIMA
Find * by …
l b ? *
w *   i* yi xi
i 1 ?
1 T
Maximize F (  )    1   D
The Dual 2
0    C1

The Solution MIMA

vector is 0 < i < C.
C
l
w *   i* yi xi
i 1
b*  yi  w *T xi , 0<i  C
The Lagrangian:
L(w, b, ; ,  )
1
 || w ||2 C  i i   i i  yi (w T xi  b)  1  i    i ii
2

The Solution MIMA

vector is 0 < i < C.
C
l
w *   i* yi xi
i 1
b*  yi  w *T xi , 0<i  C
i  max  0,1  yi (w * xi  b*)
A false classification
pattern if i > 1..

Classification MIMA
l
w    y xi
* *
i i
i 1
f (x)  sgn  w * x  b *
T
 l * 
 sgn   i yi  xi , x  b * 
 i 1 
 
 sgn   i yi  xi , x   b * 
*
 * 0 
 i 

Classification Using Supporters MIMA
The weight for

the ith support vector. Bias
 
f (x)  sgn   i yi  xi , x   b * 
*
 * 0 
 i 
The similarity measure btw.
input and the ith support vector.

Lagrange Multiplier MIMA
 “Lagrange Multiplier Method” is a powerful tool

for constraint optimization.
 Contributed by Riemann.

Milkmaid Problem MIMA


2
f ( x, y )   ( x  xi ) 2  ( y  yi ) 2
i 1
Goal: (x, y)
Minimize
f ( x, y ) (x1, y1)
Subject to
g ( x, y )  0
(x2, y2)

Observation MIMA
2
f ( x, y )   ( x  xi ) 2  ( y  yi ) 2
i 1
Goal: (x*, y*)
Minimize
f ( x, y ) (x1, y1)
Subject to At the extreme point, say, (x*, y*)
g ( x, y )  0
f (x*, y*) = g(x*, y*).
(x2, y2)

Optimization with Equality ConstraintsMIMA
Goal: Min/Max f ( x)
Subject to g ( x)  0
Lemma:
At an extreme point, say, x*, we have
f (x*)  g (x*) if g (x*)  0

Proof MIMA
f (x*)  g (x*) if g (x*)  0
x* be an extreme point. r(t)

Let r(t) be any differentiable path on r(t0)
surface g(x)=0 such that r(t0)=x*.
is a vector tangent to the

x*
r(t0 ) surface g(x)=0 at x*.
f ( x)
f (x*)  f (r (t )) |t t0
d
dt f (r (t ))  f (r (t ))  r(t )
Since x* be an extreme point,
g ( x)  0
0  f (r (t0 ))  r(t0 )  f (x*)  r(t0 )
Proof MIMA
f (x*)  g (x*) if g (x*)  0
f (x*)  r(t0 ) r(t)

r(t0)
This is true for any r pass through x* on
surface g(x)=0.
f (x*) x*
f ( x)
It implies that f (x*)  , r(t0 )
where  is the tangential plane of surface
g(x)=0 at x*.
g ( x)  0
0  f (r (t0 ))  r(t0 )  f (x*)  r(t0 )
Optimization with Equality ConstraintsMIMA
Lemma:
At an extreme point, say, x*, we have
f (x*)  g (x*) if g (x*)  0
Lagrange Multiplier

x: dimension n.
Find the extreme points by solving the following equations.
f ( x )   g ( x )
n + 1 equations
g ( x)  0 with n + 1 variables

Lagrangian MIMA
Goal: Min/Max f ( x) Constraint
g ( x)  0
Optimization
Subject to
Define L(x;  )  f (x)   g (x) Lagrangian
 x L(x;  )  0
Solve Unconstraint
Optimization
  L(x;  )  0
Optimization with
Multiple Equality Constraints MIMA
  (1 ,K , m ) T
Min/Max f ( x)
Subject to gi (x)  0, i  1,K , m
m
Define L(x;  )  f (x)    g ( x)
i 1
i i Lagrangian
Solve  x ,  L(x;  )  0

Optimization with
Inequality Constraints MIMA
Minimize f ( x)
h j (x)  0, j  1,K , n
You can always reformulate your problems into the about form.

Lagrange Multipliers MIMA
  (1 ,K , m )T
  ( 1 ,K , n )T i  0
Minimize f ( x)
h j (x)  0, j  1,K , n
Lagrangian:
m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
i 1 j 1
Lagrange Multipliers MIMA
  (1 ,K , m )T
  ( 1 ,K , n )T i  0
Minimize f ( x) negative for

feasible solutions

0 for feasible
h j (x)  0, j  1,K , n
solutions
Lagrangian:
m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
i 1 j 1
Duality MIMA
m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
i 1 j 1
Let x* be a local extreme.
0   x L(x; ,  ) |x  x*
Define D(,  )  L(x*; ,  )
m n
D(,  )  f (x*)   i gi (x*)    j h j (x*)  f (x*)
i 1 j 1
Maximize it w.r.t. 

Duality MIMA
m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
To minimize
i 1
the Lagrangian
j 1
w.r.t x,
while to maximize it w.r.t.  and .
Let x* be a local extreme.
0   x L(x; ,  ) |x  x*
Define D(,  )  L(x*; ,  )
m n
D(,  )  f (x*)   i gi (x*)    j h j (x*)  f (x*)
i 1 j 1
Maximize it w.r.t. 

Saddle Point Determination MIMA
m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
i 1 j 1

Saddle Point Determination MIMA
m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
i 1 j 1
Minimize f ( x)
The primal Subject to gi (x)  0, i  1,K , m
h j (x)  0, j  1,K , n
Maximize L(x*; ,  )
The dual Subject to  x , L(x; ,  )  0
0
m n
L(x; ,  )  f (x)   i gi (x)    j h j (x)
i 1 j 1
m n
 x L(x; ,  )   x f (x)   i  x gi (x)    j  x h j (x)  0
i 1 j 1
gi (x)  0, i  1,K , m
 j  0, j  1,K , n
h j (x)  0, j  1,K , n
 j h j (x)  0, j  1,K , n

Non-linear SVMs MIMA
 Datasets that are linearly separable with noise work out great:
0 x
 But what are we going to do if the dataset is just too hard?
0 x
 How about… mapping data to a higher-dimensional space:

x2
0 x

Non-linear SVMs: Feature Space MIMA
 General idea: the original input space can be mapped
to some higher-dimensional feature space where the
training set is separable:
Φ: x → φ(x)

Nonlinear SVMs: The Kernel TrickMIMA
 With this mapping, our discriminant function is now:
g ( x)  wT  ( x)  b    y  ( x ) ( x)  b
xi SV
i i i
 No need to know this mapping explicitly, because we only

use the dot product of feature vectors in both the training and
test.
 A kernel function is defined as a function that corresponds to

a dot product of two feature vectors in some expanded
feature space:
K (xi , x j )   (xi )T  (x j )

 An example:
2-dimensional vectors x=[x1 x2];
let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj) = φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2] [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]T
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2] T

 Examples of commonly-used kernel functions:
 Linear kernel: K (xi , x j )  xTi x j

 Polynomial kernel: K ( xi , x j )  (1  xTi x j ) p
 Gaussian (Radial-Basis Function (RBF) )
kernel: 2
xi  x j
K (xi , x j )  exp( )
2 2
 Sigmoid:
K (xi , x j )  tanh(  0 xTi x j  1 )
 In general, functions that satisfy Mercer’s condition can be
kernel functions.

Nonlinear SVM: Optimization MIMA
 Formulation: (Lagrangian Dual Problem)
1
max  i    i  j yi y j K ( xi , x j )
i 2 i j
such that 0    C
i
 y
i
i i 0
 The solution of the discriminant function is
g ( x)  w  ( x)  b 
T
  y K ( x, x )  b
xi SV
i i i
 The optimization technique is the same.

Support Vector Machine: Algorithm
MIMA
 1. Choose a kernel function

 2. Choose a value for C
 3. Solve the quadratic programming
problem (many software packages
available)
 4. Construct the discriminant function from
the support vectors

Other issues MIMA
 Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
- domain experts can give assistance in formulating appropriate
similarity measures
 Choice of kernel parameters
- e.g. σ in Gaussian kernel
- σ is the distance between closest points with different classifications
- In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters.
 Optimization criterion – Hard margin v.s. Soft margin
- a lengthy series of experiments in which various parameters are
tested

Comparison with Neural NetworksMIMA
 Neural Networks  SVMs
 Hidden Layers map to  Kernel maps to a very-
lower dimensional high dimensional space
spaces  Search space has a
 Search space has unique minimum
multiple local minima  Training is extremely
 Training is expensive efficient
 Classification extremely  Classification extremely
efficient efficient
 Requires number of  Kernel and cost the two
hidden units and layers parameters to select
 Very good accuracy in  Very good accuracy in
typical domains typical domains
 Extremely robust

MIMA
 UCI datasets:
http://archive.ics.uci.edu/ml/datasets.html
 Reuters-21578 Text Categorization Collection
 Wine
 Credit Approval
 Requirements
 Use different kernels(>=3)
 Choose best values for parameters
 You can also use dimension reduction method, e.g.,
PCA

MIMA
 针对UCI数据集（
http://archive.ics.uci.edu/ml/datasets.html）中的
Musk(version2), Wine，采用三种SVM来对其进
行分类，计算准确率。其中每种SVM要求用不同
的核函数。另外，采用一种集成学习方法，将不
同模型集成，集成的模型可以是不同核函数的
SVM，也可以加上神经网络、KNN、线性模型、
多项式模型等。比较集成模型与SVM模型及其他
模型的结果。
 要求：6月17日24时之前提交代码和报告。

MIMA Group
Any Question?
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Chapter 9

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 9

Uploaded by

Copyright:

Available Formats

MIMA Group

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

 A machine to learn the mapping

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

 How a machine learns?

 What the machine learned?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 3

Expected Risk (test error)

R(α )   12 y  f (x, α ) dP(x, y )

Empirical Risk (training error)

 How can make the empirical risk arbitrarily small?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 5

Learn both the right ‘structure’ and

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 6

Risk due to the

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 7

 Consider a set of function f (x,)  {1,1}.

VC: Vapnik Chervonenkis

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 9

 In general, the VC dimension of a set of oriented

 VC dimension is a measure of memorization

 VC dimension is not directly related to number of

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 10

Expected Risk R(α )   12 y  f (x, α ) dP(x, y )

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 11

Consider small  (e.g.,   0.05).

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 12

Consider small  (e.g.,   0.05).

Structure risk minimization want to minimize the bound

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 13

Amongst machines with zero

How to evaluate VC dimension?

 =0.05 and l=10,000

Nested subset of functions with different VC dimensions.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 15

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 16

 The linear separability

Linearly separable Not linearly separable

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 17

 The linear separability

How would you classify

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 18

Margin is defined as the width

 What is the relation btw. the margin width and

What does this mean?

 The linear separability

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 21

How about maximize the margin?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 22

This requires the knowledge about Lagrange Multiplier.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 23

Minimize it w.r.t w & b, while maximize it w.r.t. .

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 24

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 25

Minimize it w.r.t w & b, while maximize it w.r.t. .

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 26

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 27

Maximize L(w*, b*;  )

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 28

Maximize L(w*, b*;  )

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 29

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 30

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 31

Maximize L(w, b;  )

Maximize L(w, b;  )

f (x)  g (x) if g (x*)  0

f (x)  g (x) if g (x*)  0

f (x)  g (x) if g (x*)  0