Professional Documents
Culture Documents
Chapter 9
Chapter 9
M L
D M
Chapter 9
Support Vector Machines
xi a yi
Defined as
x a f (x, α )
Learning by adjusting
this parameter?
R(α ) Remp (α ) ?
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 4
More on Empirical Risk MIMA
Right Structure:
E.g., Right amount and right forms of components or parameters are
to participate in a learning machine.
Right Rules:
The empirical risk will also be reduced if right rules are learned.
VC dimension = 3
h(log(2l h) 1) log( 4)
P R ( ) Remp ( ) 1
l
VC Confidence
h is the VC dimension; l is the number of samples
h(log(2l h) 1) log( 4)
R ( ) Remp ( )
l
h(log(2l h) 1) log( 4)
P R ( ) Remp ( ) 1
l
VC Confidence
h(log(2l h) 1) log( 4)
R ( ) Remp ( )
l
Traditional approaches
minimize empirical risk only
h(log(2l h) 1) log( 4)
R ( ) Remp ( )
l
h1 h2 h3 h4
h4 h3 h2 h1
Linearly separable
yi (wxi b) 1 0 i
The linear discriminant
function (classifier) with the
maximum margin is the best
d is the dimension of x,
wxi b 1 for yi 1
wxi b 1 for yi 1
wx b 1
yi (wxi b) 1 0 i
wx b 0
Linearly separable
yi (wxi b) 1 0 i
1 b 1 b
w d
|| w || || w ||
2
|| w ||
Minimize 1
2 || w ||2
Subject to yi (w xi b) 1 0 i
T
2
Minimize 1
2 || w ||
Subject to yi (w T xi b) 1 0 i
The Lagrangian:
l
1
L(w , b; ) || w ||2 i yi (w T xi b) 1 i 0
2 i 1
Why Lagrange?
The constraints will be replaced by constraints on the
Lagrange multipliers, which will be much easier to
handle.
In this reformulation of the problem, the training data
will only appear in the form of dot products between
vectors.
2 How about if it is
Minimize 1
2 || w || zero?
Subject to yi (w T xi b) 1 0 i
What value of i
should be if it is
The Lagrangian:
feasible and nonzero?
l
1
L(w , b; ) || w ||2 i yi (w T xi b) 1 i 0
2 i 1
2 i 1 i 1
2
Minimize 1
2 || w ||
Subject to yi (w T xi b) 1 0 i
The Lagrangian:
l
1
L(w , b; ) || w ||2 i yi (w T xi b) 1 i 0
2 i 1
l l
1
|| w ||2 i yi (w T xi b) i
2 i 1 i 1
2 i 1 i 1
2
Minimize 1
2 || w ||
Subject to yi (w T xi b) 1 0 i
2 i 1 i 1
l l
w L(w , b; ) w i yi xi 0 w* i yi xi
i 1 i 1
l l
b L(w, b; ) i yi 0 y
i 1
i i 0
i 1
2 i 1 i 1
l l
w L(w , b; ) w i yi xi 0 w* i yi xi
i 1 i 1
l l
b L(w, b; ) i yi 0 y
i 1
i i 0
i 1
T T
1 l l
l l l l
L(w*, b*; ) i yi xi y x
i i i i i i
y x y x i i i b i yi i
2 i 1 i 1 i 1 i 1 i 1 i 1
T
l
1 l
l
i i yi xi y x i i i
i 1 2 i 1 i 1
l
1 l l
i i j yi y j xi ,x j Maximize
i 1 2 i 1 j 1
2 i 1 i 1
l l
w L(w , b; ) w i yi xi 0 w* i yi xi
i 1 i 1
l l
b L(w, b; ) i yi 0 ii y0
T
y 0
i 1 i 1
T T
1 l l
l l l l
L(w*, b*; ) i yi xi y x
i i i i i i
y x y x i i i b i yi i
2 i 1 i 1 i 1 i 1 i 1 i 1
T
l
1 l
l
1 T
i i yi xi y x i i i F ( ) 1 D
i 1 2 i 1 i 1
2
l
1 l l
i i j yi y j xi ,x j Maximize
i 1 2 i 1 j 1
Minimize 1
2 || w ||2
The Primal
Subject to yi (w T xi b) 1 0 i
1 T
Maximize F ( ) 1 D
The Dual 2
Subject to T y 0
0
Quadratic Programming
Find * by …
l
w y xi
*
i 1
*
i i b ? *
1 T
Maximize F ( ) 1 D
The Dual 2
Subject to T y 0
0
Quadratic Programming
Find * by … Call it a support
l vector is i > 0.
w y xi
* *
i i
i 1
b* yi w * xi , i 0
T
The Karush-Kuhn-Tucker
The Lagrangian: Conditions
l
1
L(w , b; ) || w ||2 i yi (w T xi b) 1
2 i 1
l
w L(w , b; ) w i yi xi 0
i 1
l
b L(w , b; ) i yi 0
i 1
yi (wT xi b) 1 0, i 1,K , l
i 0, i 1,K , l
f (x) sgn w * x b *
T
l *
sgn i yi xi , x b *
i 1
sgn i yi xi , x b *
*
* 0
i
f (x) sgn i yi xi , x b *
*
* 0
i
The similarity measure btw.
input and the ith support vector.
yi (wxi b) 1 i 0 i
i 0 i
k
Minimize 1
2 || w || C
2
i i
Subject to yi (w T xi b) 1 i 0 i
i 0 i
k
Minimize 1
2 || w || C
2
i i
Subject to yi (w T xi b) 1 i 0 i
i 0 i
Minimize 1
2 || w || C i i
2
Subject to yi (w xi b) 1 i 0 i
T
i 0 i
L(w , b, ; , )
1
|| w ||2 C i i i i yi (w T xi b) 1 i i ii
2
i 0, i 0
Minimize 1
2 || w || C i i
2 i 0, i 0
Subject to yi (w xi b) 1 i 0 i
T
i 0 i
Maximize L(w*, b*, *; , )
Subject to w ,b , L(w, b, ; , ) 0
0, 0
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 42
Duality i 0, i 0 MIMA
1
L(w, b, ; , ) || w ||2 C i i i i yi (w T xi b) 1 i i ii
2
w L(w , b, ; , ) w i i yi xi 0 w* i i yi xi
b L(w, b, ; , ) i i yi 0 y i i 0
i
i L(w , b, ; , ) C i i 0 i C i
0 i C
Maximize L(w*, b*, *; , )
Subject to w ,b , L(w, b, ; , ) 0
0, 0
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 43
Duality i 0, i 0 MIMA
1
L(w, b, ; , ) || w ||2 C i i i i yi (w T xi b) 1 i i ii
2
w L(w , b, ; , ) w i i yi xi 0 w* i i yi xi
b L(w, b, ; , ) i i yi 0 y i i 0
i
i L(w , b, ; , ) C i i 0 i C i
0 i C
F (, ) L(w*, b*, *; , ) i 1 i j yi y j xi , x j
i
2 i j
Maximize this
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 44
Duality i 0, i 0 MIMA
1
L(w, b, ; , ) || w ||2 C i i i i yi (w T xi b) 1 i i ii
2
w L(w , b, ; , ) w i i yi xi 0 w* i i yi xi
b L(w, b, ; , ) i i yi 0 y i i 0
i
T y 0
i L(w , b, ; , ) C i i 0 i C i
0 C 0 i C
F (, ) L(w*, b*, *; , ) i 1 i j yi y j xi , x j
i
2 i j
1 T
Maximize this F ( ) 1 D
2
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 45
Duality MIMA
k
Minimize 1
2 || w || C
2
i i
The Primal
Subject to yi (w T xi b) 1 i 0 i
i 0 i
1 T
Maximize F ( ) 1 D
The Dual 2
Subject to T
y0
0 C1
Quadratic Programming
Find * by …
l b ? *
w * i* yi xi
i 1 ?
1 T
Maximize F ( ) 1 D
The Dual 2
Subject to T y 0
0 C1
b* yi w *T xi , 0<i C
The Lagrangian:
L(w, b, ; , )
1
|| w ||2 C i i i i yi (w T xi b) 1 i i ii
2
b* yi w *T xi , 0<i C
i max 0,1 yi (w * xi b*)
A false classification
pattern if i > 1..
f (x) sgn w * x b *
T
l *
sgn i yi xi , x b *
i 1
sgn i yi xi , x b *
*
* 0
i
f (x) sgn i yi xi , x b *
*
* 0
i
The similarity measure btw.
input and the ith support vector.
Contributed by Riemann.
f ( x, y ) (x1, y1)
Subject to
g ( x, y ) 0
(x2, y2)
f ( x, y ) (x1, y1)
Subject to At the extreme point, say, (x*, y*)
g ( x, y ) 0
f (x*, y*) = g(x*, y*).
(x2, y2)
Goal: Min/Max f ( x)
Subject to g ( x) 0
Lemma:
At an extreme point, say, x*, we have
Goal: Min/Max f ( x)
Subject to g ( x) 0
Lemma:
At an extreme point, say, x*, we have
f (x*) g (x*) if g (x*) 0
Lagrange Multiplier
x: dimension n.
Goal: Min/Max f ( x)
Subject to g ( x) 0
Find the extreme points by solving the following equations.
f ( x ) g ( x )
n + 1 equations
g ( x) 0 with n + 1 variables
g ( x) 0
Optimization
Subject to
x L(x; ) 0
Solve Unconstraint
Optimization
L(x; ) 0
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 63
Optimization with
Multiple Equality Constraints MIMA
(1 ,K , m ) T
Min/Max f ( x)
Subject to gi (x) 0, i 1,K , m
m
Define L(x; ) f (x) g ( x)
i 1
i i Lagrangian
Solve x , L(x; ) 0
Minimize f ( x)
Subject to gi (x) 0, i 1,K , m
h j (x) 0, j 1,K , n
You can always reformulate your problems into the about form.
Minimize f ( x)
Subject to gi (x) 0, i 1,K , m
h j (x) 0, j 1,K , n
Lagrangian:
m n
L(x; , ) f (x) i gi (x) j h j (x)
i 1 j 1
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 66
Lagrange Multipliers MIMA
(1 ,K , m )T
( 1 ,K , n )T i 0
Lagrangian:
m n
L(x; , ) f (x) i gi (x) j h j (x)
i 1 j 1
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 67
Duality MIMA
m n
L(x; , ) f (x) i gi (x) j h j (x)
i 1 j 1
0 x L(x; , ) |x x*
Define D(, ) L(x*; , )
m n
D(, ) f (x*) i gi (x*) j h j (x*) f (x*)
i 1 j 1
0 x L(x; , ) |x x*
Define D(, ) L(x*; , )
m n
D(, ) f (x*) i gi (x*) j h j (x*) f (x*)
i 1 j 1
Minimize f ( x)
The primal Subject to gi (x) 0, i 1,K , m
h j (x) 0, j 1,K , n
Maximize L(x*; , )
The dual Subject to x , L(x; , ) 0
0
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 71
The Karush-Kuhn-Tucker Conditions MIMA
m n
L(x; , ) f (x) i gi (x) j h j (x)
i 1 j 1
m n
x L(x; , ) x f (x) i x gi (x) j x h j (x) 0
i 1 j 1
gi (x) 0, i 1,K , m
j 0, j 1,K , n
h j (x) 0, j 1,K , n
j h j (x) 0, j 1,K , n
0 x
0 x
0 x
Φ: x → φ(x)
g ( x) wT ( x) b y ( x ) ( x) b
xi SV
i i i
K (xi , x j ) (xi )T (x j )
K(xi,xj)=(1 + xiTxj)2,
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2] [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]T
Sigmoid:
K (xi , x j ) tanh( 0 xTi x j 1 )
In general, functions that satisfy Mercer’s condition can be
kernel functions.
y
i
i i 0
g ( x) w ( x) b
T
y K ( x, x ) b
xi SV
i i i
Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
- domain experts can give assistance in formulating appropriate
similarity measures
Choice of kernel parameters
- e.g. σ in Gaussian kernel
- σ is the distance between closest points with different classifications
- In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters.
Optimization criterion – Hard margin v.s. Soft margin
- a lengthy series of experiments in which various parameters are
tested
UCI datasets:
http://archive.ics.uci.edu/ml/datasets.html
Reuters-21578 Text Categorization Collection
Wine
Credit Approval
Requirements
Use different kernels(>=3)
Choose best values for parameters
You can also use dimension reduction method, e.g.,
PCA
针对UCI数据集(
http://archive.ics.uci.edu/ml/datasets.html)中的
Musk(version2), Wine,采用三种SVM来对其进
行分类,计算准确率。其中每种SVM要求用不同
的核函数。另外,采用一种集成学习方法,将不
同模型集成,集成的模型可以是不同核函数的
SVM,也可以加上神经网络、KNN、线性模型、
多项式模型等。比较集成模型与SVM模型及其他
模型的结果。
要求:6月17日24时之前提交代码和报告。
Any Question?