(Suykens J.) A Short Introduction To Support Vecto

A (short) Introduction to Support Vector
Machines and Kernelbased Learning

Johan Suykens
K.U. Leuven, ESAT-SCD-SISTA
Kasteelpark Arenberg 10
B-3001 Leuven (Heverlee), Belgium
Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70
Email: johan.suykens@esat.kuleuven.ac.be
http://www.esat.kuleuven.ac.be/sista/members/suykens.html
ESANN 2003, Bruges April 2003
Overview
Disadvantages of classical neural nets
SVM properties and standard SVM classifier
Related kernelbased learning methods
Use of the kernel trick (Mercer Theorem)
LS-SVMs: extending the SVM framework
Towards a next generation of universally applicable models?
The problem of learning and generalization
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003
x1 w1
x2 w2
w
x3 w 3
xn n
b
1
h()
Classical MLPs
h()
Multilayer Perceptron (MLP) properties:
Universal approximation of continuous nonlinear functions

Learning from input-output patterns; either off-line or on-line learning
Parallel network architecture, multiple inputs and outputs
Use in feedforward and recurrent networks
Use in supervised and unsupervised learning applications
Problems: Existence of many local minima!
How many neurons needed for a given task?
Support Vector Machines (SVM)

cost function
cost function
MLP
SVM
weights
weights
Nonlinear classification and function estimation by convex optimization

with a unique solution and primal-dual interpretations.
Number of neurons automatically follows from a convex program.
Learning and generalization in huge dimensional input spaces (able to
avoid the curse of dimensionality!).
Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, ... ).
Application-specific kernels possible (e.g. textmining, bioinformatics)
SVM: support vectors

1
0.9
2.5
2
0.8
1.5
0.7
x2
x2
0.6
0.5
0.5
0
0.4
0.5
0.3
0.2
1.5
0.1
0.1
0.2
0.3
0.4
0.5
x1
0.6
0.7
0.8
0.9
2.5
2.5
1.5
0.5
x1
0.5
1.5
Decision boundary can be expressed in terms of a limited number of

support vectors (subset of given training data); sparseness property
Classifier follows from the solution to a convex QP problem.
SVMs: living in two worlds ...

Primal space:
( large data sets)
Parametric: estimate w Rnh

y(x) = sign[wT (x) + b]
1 (x)
(x)
w1
y(x)
w nh
+
+
+
+
x
x
x
x
x
x
nh (x)
x
+
x
x
Dual space:
x x
x
x
K(xi, xj ) = (xi)T (xj ) (Kernel trick)
Input space
( high dimensional inputs)
Non-parametric:
RN
Pestimate
#sv
y(x) = sign[ i=1 iyiK(x, xi) + b]
K(x, x1 )
1
Feature space
y(x)
#sv
K(x, x#sv )
Standard SVM classifier (1)

n
Training set {xi, yi}N
i=1 : inputs xi R ; class labels yi {1, +1}
Classifier:
y(x) = sign[wT (x) + b]
with () : Rn Rnh a mapping to a high dimensional feature space
(which can be infinite dimensional!)
For separable data, assume
T
w (xi) + b +1, if yi = +1
yi[wT (xi) + b] 1, i
T
w (xi) + b 1, if yi = 1
Optimization problem (non-separable case):
N
X
1 T
yi[wT (xi) + b] 1 i
min J (w, ) = w w + c
i s.t.
i 0, i = 1, ..., N
w,b,
2
i=1

Lagrangian:
L(w, b, ; , ) = J (w, )
N
X
i{yi[wT (xi) + b] 1 + i}
ii
i=1
i=1
Find saddle point:
N
X
max min L(w, b, ; , )

,
w,b,
One obtains
L
w
=0 w=
N
X
iyi(xi)
i=1
L
b
L
i
=0
N
X
iyi = 0
i=1
= 0 0 i c, i = 1, ..., N

Dual problem: QP problem
N
N
X
1 X
max Q() =
yiyj K(xi, xj ) ij +
j s.t.
2 i,j=1
j=1
iyi = 0
i=1
0 i c, i
with kernel trick (Mercer Theorem): K(xi, xj ) = (xi)T (xj )

Obtained classifier:
PN
y(x) = sign[ i=1 i yi K(x, xi) + b]
Some possible kernels K(, ):

K(x, xi) = xTi x (linear SVM)
K(x, xi) = (xTi x + )d (polynomial SVM of degree d)
K(x, xi) = exp{kx xik22/ 2} (RBF kernel)
K(x, xi) = tanh( xTi x + ) (MLP kernel)
Kernelbased learning: many related methods and fields

SVMs
Regularization networks
Some early history on RKHS:

RKHS
LS-SVMs
Gaussian processes
1910-1920: Moore
1940: Aronszajn
1951: Krige
1970: Parzen
1971: Kimeldorf & Wahba
Kernel ridge regression
Kriging
SVMs are closely related to learning in Reproducing Kernel Hilbert Spaces
Wider use of the kernel trick

Angle between vectors:
Input space:
Feature space:
cos (x),(z)
cos xz
xT z
=
kxk2kzk2
(x)T (z)
K(x, z)
p
=
=p
k(x)k2k(z)k2
K(x, x) K(z, z)
Distance between vectors:

Input space:
kx zk22 = (x z)T (x z) = xT x + z T z 2xT z
Feature space:
k(x) (z)k22 = K(x, x) + K(z, z) 2K(x, z)
10
LS-SVM models: extending the SVM framework

Linear and nonlinear classification and function estimation, applicable in high
dimensional input spaces; primal-dual optimization formulations.
Solving linear systems; link with Gaussian processes, regularization networks and kernel
versions of Fisher discriminant analysis.
Sparse approximation and robust regression (robust statistics).

Bayesian inference (probabilistic interpretations, inference of hyperparameters, model
selection, automatic relevance determination for input selection).
Extensions to unsupervised learning: kernel PCA (and related methods of kernel PLS,
CCA), density estimation ( clustering).
Fixed-size LS-SVMs: large scale problems; adaptive learning machines; transductive.

Extensions to recurrent networks and control.
11
SVM
LS-SVM
Robust Kernel
Kernel
Robust Linear
Linear
Towards a next generation of universal models?
FDA
PCA
PLS
CCA
Classifiers
Regression
Clustering
Recurrent
Research issues:
Large scale methods
Adaptive processing
Robustness issues
Statistical aspects
Application-specific kernels
12
Fixed-size LS-SVM (1)

Primal space
Dual space
Nystrom method
Kernel PCA
Density estimate
Entropy criteria
Regression
Eigenfunctions
SV selection
Modelling in view of primal-dual representations

Link Nystr
om approximation (GP) - kernel PCA - density estimation
13

high dimensional inputs, large data sets, adaptive learning machines (using LS-SVMlab)
Sinc function (20.000 data, 10 SV)
1.4
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
0.2
0.4
0.6
10
x
0
10
0.4
10
x
0
40
50
60
70
80
10
Santa Fe laser data

5
yk , yk
yk
2
0
100
200
300
400
500
600
700
800
10
20
900
discrete time k
30
90
discrete time k
14

2.5
2.5
1.5
1.5
0.5
0.5
0.5
0.5
1.5
1.5
2.5
2.5
1.5
0.5
0.5
1.5
2.5
2.5
2.5
2.5
2.5
1.5
1.5
0.5
0.5
0.5
0.5
1.5
1.5
2.5
2.5
1.5
0.5
0.5
1.5
2.5
2.5
1.5
0.5
0.5
1.5
1.5
0.5
0.5
1.5
15
The problem of learning and generalization (1)

Different mathematical settings exist, e.g.
Vapnik et al.:
Predictive learning problem (inductive inference)
Estimating values of functions at given points (transductive inference)
Vapnik V. (1998) Statistical Learning Theory, John Wiley & Sons, New York.
Poggio et al., Smale:

Estimate true function f with analysis of approximation error and sample
error (e.g. in RKHS space, Sobolev space)
Cucker F., Smale S. (2002) On the mathematical foundations of learning theory, Bulletin of the AMS,
39, 149.
Goal: Deriving bounds on the generalization error (this can be used to

determine regularization parameters and other tuning constants). Important
for practical applications is trying to get sharp bounds.
16

(see Pontil, ESANN 2003)
Random variables x X, y Y R
Draw i.i.d. samples from (unknown) probability distribution (x, y)
Generalization error:
Z
E[f ] =
L(y, f (x))(x, y)dxdy
X,Y
N
1 X
Loss function L(y, f (x)); empirical error EN [f ] =
L(yi, f (xi))
N i=1
f := arg min E[f ] (true function); fN := arg min EN [f ]

f
f
R
If L(y, f ) = (f y)2 then f = Y y(y|x)dy (regression function)
Consider hypothesis space H with fH := arg min E[f ]
f H
17

generalization error =
sample error
+
E[fN ] E[f]
= (E[fN ] E[fH]) +
approximation error
(E[fH] E[f])
approximation error depends only on H (not on sampled examples)

sample error:
E(fN ) E(fH) (N, 1/h, 1/) (w.p. 1 )
is a non-decreasing function
h measures the size of hypothesis space H
Overfitting when h large & N small (large sample error)
Goal: obtain a good trade-off between sample error and approximation error
18
Interdisciplinary challenges
replacements
NATO-ASI on Learning Theory and Practice,
Leuven July 2002
http://www.esat.kuleuven.ac.be/sista/natoasi/ltp2002.html
neural networks
data mining
linear algebra
pattern recognition
mathematics
SVM & kernel methods
machine learning
optimization
statistics
signal processing
systems and control theory

J.A.K. Suykens, G. Horvath, S. Basu, C. Micchelli, J. Vandewalle (Eds.), Advances in Learning Theory:
Methods, Models and Applications, NATO-ASI Series Computer and Systems Sciences, IOS Press, 2003.
19
Books, software, papers ...

www.kernel-machines.org
&
www.esat.kuleuven.ac.be/sista/lssvmlab/
Introductory papers:
C.J.C. Burges (1998) A tutorial on support vector machines for pattern recognition, Knowledge Discovery
and Data Mining, 2(2), 121-167.
A.J. Smola, B. Scholkopf (1998) A tutorial on support vector regression, NeuroCOLT Technical Report
NC-TR-98-030, Royal Holloway College, University of London, UK.
T. Evgeniou, M. Pontil, T. Poggio (2000) Regularization networks and support vector machines, Advances
in Computational Mathematics, 13(1), 150.
K.-R. M
uller, S. Mika, G. Ratsch, K. Tsuda, B. Scholkopf (2001) An introduction to kernel-based learning
algorithms, IEEE Transactions on Neural Networks, 12(2), 181-201.
20

(Suykens J.) A Short Introduction To Support Vecto

Uploaded by

Copyright:

Available Formats

You might also like

(Suykens J.) A Short Introduction To Support Vecto

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(Suykens J.) A Short Introduction To Support Vecto

Uploaded by

Copyright:

Available Formats

A (short) Introduction to Support Vector

Machines and Kernelbased Learning

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

Universal approximation of continuous nonlinear functions

Support Vector Machines (SVM)

Nonlinear classification and function estimation by convex optimization

SVM: support vectors

Decision boundary can be expressed in terms of a limited number of

SVMs: living in two worlds ...

( large data sets)

Parametric: estimate w Rnh

K(xi, xj ) = (xi)T (xj ) (Kernel trick)

( high dimensional inputs)

Standard SVM classifier (1)

Standard SVM classifier (2)

Find saddle point:

max min L(w, b, ; , )

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

Standard SVM classifier (3)

with kernel trick (Mercer Theorem): K(xi, xj ) = (xi)T (xj )

Some possible kernels K(, ):

Kernelbased learning: many related methods and fields

Some early history on RKHS:

Kernel ridge regression

SVMs are closely related to learning in Reproducing Kernel Hilbert Spaces

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

Wider use of the kernel trick

Distance between vectors:

LS-SVM models: extending the SVM framework

Sparse approximation and robust regression (robust statistics).

Fixed-size LS-SVMs: large scale problems; adaptive learning machines; transductive.

Towards a next generation of universal models?

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

Fixed-size LS-SVM (1)

Modelling in view of primal-dual representations

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

Fixed-size LS-SVM (2)

Santa Fe laser data

Fixed-size LS-SVM (3)

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

The problem of learning and generalization (1)

Poggio et al., Smale:

Goal: Deriving bounds on the generalization error (this can be used to

The problem of learning and generalization (2)

f := arg min E[f ] (true function); fN := arg min EN [f ]

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

The problem of learning and generalization (3)

approximation error depends only on H (not on sampled examples)

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

NATO-ASI on Learning Theory and Practice,

Leuven July 2002

systems and control theory

Books, software, papers ...

You might also like