(Suykens J.) A Short Introduction To Support Vecto

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

A (short) Introduction to Support Vector

Machines and Kernelbased Learning


Johan Suykens
K.U. Leuven, ESAT-SCD-SISTA
Kasteelpark Arenberg 10
B-3001 Leuven (Heverlee), Belgium
Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70
Email: johan.suykens@esat.kuleuven.ac.be
http://www.esat.kuleuven.ac.be/sista/members/suykens.html
ESANN 2003, Bruges April 2003

Overview
Disadvantages of classical neural nets
SVM properties and standard SVM classifier
Related kernelbased learning methods
Use of the kernel trick (Mercer Theorem)
LS-SVMs: extending the SVM framework
Towards a next generation of universally applicable models?
The problem of learning and generalization

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

x1 w1
x2 w2
w
x3 w 3
xn n
b
1

h()

Classical MLPs

h()
Multilayer Perceptron (MLP) properties:

Universal approximation of continuous nonlinear functions


Learning from input-output patterns; either off-line or on-line learning
Parallel network architecture, multiple inputs and outputs
Use in feedforward and recurrent networks
Use in supervised and unsupervised learning applications
Problems: Existence of many local minima!
How many neurons needed for a given task?
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

Support Vector Machines (SVM)


cost function

cost function

MLP

SVM

weights

weights

Nonlinear classification and function estimation by convex optimization


with a unique solution and primal-dual interpretations.
Number of neurons automatically follows from a convex program.
Learning and generalization in huge dimensional input spaces (able to
avoid the curse of dimensionality!).
Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, ... ).
Application-specific kernels possible (e.g. textmining, bioinformatics)
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

SVM: support vectors


1

0.9

2.5
2

0.8

1.5

0.7

x2

x2

0.6

0.5

0.5
0

0.4
0.5

0.3

0.2

1.5

0.1

0.1

0.2

0.3

0.4

0.5

x1

0.6

0.7

0.8

0.9

2.5
2.5

1.5

0.5

x1

0.5

1.5

Decision boundary can be expressed in terms of a limited number of


support vectors (subset of given training data); sparseness property
Classifier follows from the solution to a convex QP problem.
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

SVMs: living in two worlds ...


Primal space:

( large data sets)

Parametric: estimate w Rnh


y(x) = sign[wT (x) + b]
1 (x)

(x)

w1
y(x)

w nh

+
+

+
+

x
x

x
x

x
x

nh (x)

x
+

x
x

Dual space:

x x
x
x

K(xi, xj ) = (xi)T (xj ) (Kernel trick)

Input space

( high dimensional inputs)

Non-parametric:
RN
Pestimate
#sv
y(x) = sign[ i=1 iyiK(x, xi) + b]
K(x, x1 )
1

Feature space
y(x)

#sv
K(x, x#sv )
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

Standard SVM classifier (1)


n
Training set {xi, yi}N
i=1 : inputs xi R ; class labels yi {1, +1}

Classifier:
y(x) = sign[wT (x) + b]
with () : Rn Rnh a mapping to a high dimensional feature space
(which can be infinite dimensional!)
For separable data, assume
T
w (xi) + b +1, if yi = +1
yi[wT (xi) + b] 1, i
T
w (xi) + b 1, if yi = 1
Optimization problem (non-separable case):

N
X
1 T
yi[wT (xi) + b] 1 i
min J (w, ) = w w + c
i s.t.
i 0, i = 1, ..., N
w,b,
2
i=1
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

Standard SVM classifier (2)


Lagrangian:
L(w, b, ; , ) = J (w, )

N
X

i{yi[wT (xi) + b] 1 + i}

ii

i=1

i=1

Find saddle point:

N
X

max min L(w, b, ; , )


,

w,b,

One obtains

L
w

=0 w=

N
X

iyi(xi)

i=1

L
b
L
i

=0

N
X

iyi = 0

i=1

= 0 0 i c, i = 1, ..., N

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

Standard SVM classifier (3)


Dual problem: QP problem

N
N
X
1 X
max Q() =
yiyj K(xi, xj ) ij +
j s.t.

2 i,j=1

j=1

iyi = 0

i=1

0 i c, i

with kernel trick (Mercer Theorem): K(xi, xj ) = (xi)T (xj )


Obtained classifier:

PN
y(x) = sign[ i=1 i yi K(x, xi) + b]

Some possible kernels K(, ):


K(x, xi) = xTi x (linear SVM)
K(x, xi) = (xTi x + )d (polynomial SVM of degree d)
K(x, xi) = exp{kx xik22/ 2} (RBF kernel)
K(x, xi) = tanh( xTi x + ) (MLP kernel)
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

Kernelbased learning: many related methods and fields


SVMs
Regularization networks

Some early history on RKHS:


RKHS

LS-SVMs

Gaussian processes

1910-1920: Moore
1940: Aronszajn
1951: Krige
1970: Parzen
1971: Kimeldorf & Wahba

Kernel ridge regression

Kriging

SVMs are closely related to learning in Reproducing Kernel Hilbert Spaces

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

Wider use of the kernel trick


Angle between vectors:
Input space:
Feature space:
cos (x),(z)

cos xz

xT z
=
kxk2kzk2

(x)T (z)
K(x, z)
p
=
=p
k(x)k2k(z)k2
K(x, x) K(z, z)

Distance between vectors:


Input space:
kx zk22 = (x z)T (x z) = xT x + z T z 2xT z
Feature space:
k(x) (z)k22 = K(x, x) + K(z, z) 2K(x, z)
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

10

LS-SVM models: extending the SVM framework


Linear and nonlinear classification and function estimation, applicable in high
dimensional input spaces; primal-dual optimization formulations.

Solving linear systems; link with Gaussian processes, regularization networks and kernel
versions of Fisher discriminant analysis.

Sparse approximation and robust regression (robust statistics).


Bayesian inference (probabilistic interpretations, inference of hyperparameters, model
selection, automatic relevance determination for input selection).

Extensions to unsupervised learning: kernel PCA (and related methods of kernel PLS,
CCA), density estimation ( clustering).

Fixed-size LS-SVMs: large scale problems; adaptive learning machines; transductive.


Extensions to recurrent networks and control.
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

11

SVM

LS-SVM

Robust Kernel

Kernel

Robust Linear

Linear

Towards a next generation of universal models?

FDA

PCA

PLS

CCA

Classifiers

Regression

Clustering

Recurrent

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

Research issues:
Large scale methods
Adaptive processing
Robustness issues
Statistical aspects
Application-specific kernels

12

Fixed-size LS-SVM (1)


Primal space

Dual space
Nystrom method
Kernel PCA
Density estimate
Entropy criteria

Regression

Eigenfunctions
SV selection

Modelling in view of primal-dual representations


Link Nystr
om approximation (GP) - kernel PCA - density estimation

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

13

Fixed-size LS-SVM (2)


high dimensional inputs, large data sets, adaptive learning machines (using LS-SVMlab)
Sinc function (20.000 data, 10 SV)
1.4

1.2

1.2
1

1
0.8

0.8
0.6

0.6

0.4

0.4

0.2

0.2

0
0

0.2
0.2

0.4

0.6
10

x
0

10

0.4
10

x
0

40

50

60

70

80

10

Santa Fe laser data


5

yk , yk

yk

2
0

100

200

300

400

500

600

700

800

10

20

900

discrete time k
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

30

90

discrete time k
14

Fixed-size LS-SVM (3)


2.5

2.5

1.5

1.5

0.5

0.5

0.5

0.5

1.5

1.5

2.5
2.5

1.5

0.5

0.5

1.5

2.5

2.5
2.5

2.5

2.5

1.5

1.5

0.5

0.5

0.5

0.5

1.5

1.5

2.5
2.5

1.5

0.5

0.5

1.5

2.5
2.5

1.5

0.5

0.5

1.5

1.5

0.5

0.5

1.5

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

15

The problem of learning and generalization (1)


Different mathematical settings exist, e.g.
Vapnik et al.:
Predictive learning problem (inductive inference)
Estimating values of functions at given points (transductive inference)
Vapnik V. (1998) Statistical Learning Theory, John Wiley & Sons, New York.

Poggio et al., Smale:


Estimate true function f with analysis of approximation error and sample
error (e.g. in RKHS space, Sobolev space)
Cucker F., Smale S. (2002) On the mathematical foundations of learning theory, Bulletin of the AMS,
39, 149.

Goal: Deriving bounds on the generalization error (this can be used to


determine regularization parameters and other tuning constants). Important
for practical applications is trying to get sharp bounds.
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

16

The problem of learning and generalization (2)


(see Pontil, ESANN 2003)

Random variables x X, y Y R
Draw i.i.d. samples from (unknown) probability distribution (x, y)
Generalization error:
Z
E[f ] =
L(y, f (x))(x, y)dxdy
X,Y

N
1 X
Loss function L(y, f (x)); empirical error EN [f ] =
L(yi, f (xi))
N i=1

f := arg min E[f ] (true function); fN := arg min EN [f ]


f
f
R
If L(y, f ) = (f y)2 then f = Y y(y|x)dy (regression function)
Consider hypothesis space H with fH := arg min E[f ]
f H

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

17

The problem of learning and generalization (3)


generalization error =
sample error
+
E[fN ] E[f]
= (E[fN ] E[fH]) +

approximation error
(E[fH] E[f])

approximation error depends only on H (not on sampled examples)


sample error:
E(fN ) E(fH) (N, 1/h, 1/) (w.p. 1 )
is a non-decreasing function
h measures the size of hypothesis space H
Overfitting when h large & N small (large sample error)
Goal: obtain a good trade-off between sample error and approximation error

Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

18

Interdisciplinary challenges

replacements

NATO-ASI on Learning Theory and Practice,

Leuven July 2002

http://www.esat.kuleuven.ac.be/sista/natoasi/ltp2002.html

neural networks
data mining

linear algebra

pattern recognition

mathematics
SVM & kernel methods

machine learning
optimization

statistics
signal processing

systems and control theory


J.A.K. Suykens, G. Horvath, S. Basu, C. Micchelli, J. Vandewalle (Eds.), Advances in Learning Theory:
Methods, Models and Applications, NATO-ASI Series Computer and Systems Sciences, IOS Press, 2003.
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

19

Books, software, papers ...


www.kernel-machines.org

&

www.esat.kuleuven.ac.be/sista/lssvmlab/

Introductory papers:
C.J.C. Burges (1998) A tutorial on support vector machines for pattern recognition, Knowledge Discovery
and Data Mining, 2(2), 121-167.
A.J. Smola, B. Scholkopf (1998) A tutorial on support vector regression, NeuroCOLT Technical Report
NC-TR-98-030, Royal Holloway College, University of London, UK.
T. Evgeniou, M. Pontil, T. Poggio (2000) Regularization networks and support vector machines, Advances
in Computational Mathematics, 13(1), 150.
K.-R. M
uller, S. Mika, G. Ratsch, K. Tsuda, B. Scholkopf (2001) An introduction to kernel-based learning
algorithms, IEEE Transactions on Neural Networks, 12(2), 181-201.
Introduction to SVM and kernelbased learning Johan Suykens ESANN 2003

20

You might also like