Learning From Data: 8: Support Vector Machines (SVM)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Learning From Data

8: Support Vector Machines (SVM)


Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main

Wissen durch Praxis stärkt


1/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM) Summer Semester 2022
Content
Recap
Support Vector Machines
Duality
Non-Separability
Kernel Trick
VC Dimension
Further Ideas
Evaluation
Demo
Bibliography

2/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Until recently, philosophy was based on the very simple idea that
the world is simple. In machine learning, for the first time, we
have examples where the world is not simple.
– Vladimir Vapnik

3/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Recap Perceptron and Linear Model
Remember: We separate using a separating hyperplane.

2.5
2.0
o
1.5 o o
x2

+ +
1.0

+
0.5

0.5 1.0 1.5 2.0 2.5

x1

But which hyperplane is “best”? (There are infinitely many ones.)

4/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Maximum Margin
Idea: Use the one with maximum margin:

2.5
2.0
o
o o
1.5
x2

+ +
1.0

+
0.5

0.5 1.0 1.5 2.0 2.5

x1

5/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Maximum Margin (cont.)

As it turns out this is a powerful idea, because


it gives a unique solution
one can prove that not all data vectors contribute but only so-called
support vectors
this results in lower VC dimension, which support generalization
can be extended to non-linear boundaries using kernel trick
it has a powerful mathematical foundation
the computation is a convex optimization not suffering from local
optima
it has been proven to be very successful

6/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
SVMs: A New Kid on the Block in the 80/90s

Before SVMs:
Almost all learning methods used linear decision surfaces
Linear models have nice theoretical properties, but are limited
Neural Networks (NN) allow for efficient learning of non-linear
decision surfaces, however
NN all suffer from local minima
Still little theoretical basis for NN
After SVM [CV95]:
We have efficient learning algorithms for non-linear functions
We have a learning algorithm thoroughly based on computational
learning theory developed by Vapnik et al. [CV95]

7/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Key Ideas

Separate non-linear regions using “kernel functions”


This is possible as computations only depend on so-called “dot
products” (inner products – a measure of similarity)
Mathematics can use quadratic optimization problem to avoid local
minimum issues with e.g. neural nets
The resulting learning algorithm is an optimization algorithm rather
than a greedy search
Can be used for classification and regression

8/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Support Vectors

Support Vectors
are data points closest to separating hyperplane
correspond to the most difficult to classify points
define the separating hyperplane
are usually much smaller subset of all data points – this reduces the
effective VC dimension (which is good!) and explains why SVMs
perform well
Intuition: The boundaries of the classes are more “informative” than the
overall distribution of the classes

9/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Linear Separability

Assume linear separability for now (general case later).


In two dimensions, we separate by a line, in higher dimensions by
hyperplane
Input: (xi , yi ) training pair samples of features as usual. We assume
the yi to be classifier labels for now.
Output: set of weights wi , one for each feature, whose linear
combination predicts the value of y .
We will maximize the margin
As it turns out, the number of weights that are nonzero will be just
a few that correspond to important features that “matter” in
deciding the separating line (hyperplane) – the support vectors

10/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
The Math

2.5
2.0
o
o o

1.5
x2
+ +

1.0
0.5 +

0.5 1.0 1.5 2.0 2.5

x1

Find a, b, and c such that


ax1 + bx2 ≥ c for ◦ (+1) and
ax1 + bx2 < c for + (−1).
This is messy, so we will use normal forms instead.
11/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Normal Forms
Any hyperplane is fully described by its normal vector1 w ∈ Rn and
its so-called bias b ∈ R.
Then we decide whether a point is in ±1 by

hw (x) = sign(hw, xi)

or
hw (x) = sign(hw, xi + b)

Definition 1
The geometric margin of a hyperplane w with respect to a dataset D is
the shortest distance from a training point xi to the hyperplane defined
by w.

1
i.e. w perpendicular to the plane and ||w || = 1
12/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Primal Form

Thus we try to maximize the geometric margin, i.e.


w
max min |hxi , i + b|
w,b xi ||w||

subject to sign(hw, xi i + b) = sign(yi ).

However, in this form this is horrible – so let us simplify this.

13/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Simplification 1: Linearizing Classes

Note, that sign(hw, xi i + b) = sign(yi ) is equivalent to


(hw, xi i + b)yi ≥ 0.
This is still linear.

14/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Simplification 2: Symmetry

The maximum margin is exactly halfway in between positive and


negative classes.
Indeed, if this were not the case, denote x+ and x− the closest
points in the positive and negative classes, then
hx+ , wi + b = −(hx− , wi + b) because if this were not the case we
could move the decision boundary along w until it they are exactly
equal increasing the margin.

15/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Simplification 3: Scale Invariance
We can increase the length of w arbitrarily without changing the
constraints
w
The true distance of a point x is always hx, ||w|| i measured in units
1
of ||w|| .
Let x be the closest point to the hyperplane. Then we can scale it
w
such that hx, ||w|| i = 1.
The constraints have to be replaced by hxi , wiyi ≥ 1 (why?).
Furthermore, we no longer need to ask which point is closest to the
candidate hyperplane, because after all, we never cared which point
it was!
All we cared was how far away that closest point was. And we do
1
now know that it is exactly ||w|| away.
As a consequence, we get rid of the max and min – we just have to
minimize ||w|| equivalent to 12 ||w||2 .
16/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Primal Form

Definition 2 (Primal Form)


The Primal Form of the SVM is the following optimization problem
1
arg min ||w||2
w,b 2

subject to (hw, xi i + b)yi ≥ 1 for all i ∈ 1, . . . , m.

Note: This is a quadratic programming problem with a unique solution.

By the way: At https://jeremykun.com/2017/06/05/


formulating-the-support-vector-machine-optimization-problem/
you find a nice demo visualization.

17/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Optimization Problems

In general, optimization problems are hard to solve.


However sometimes we are lucky.
Remember, linear problems?
Can be solved with linear programming.
Likewise, linear problems with linear constraints can be solved using
Lagrangians or Lagrangian multipliers.
For SVMs we have a so-called “convex quadratic” optimization
problem (the objective is a quadratic function and the constraints
form a convex set – linear inequalities and equalities).
Fortunately, there exists result due to Karush, Kuhn, and Tucker, the
KKT theorem [KT51] that covers this case.

18/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Karush, Kuhn, and Tucker (KKT) Theorem
Theorem 3
Suppose you have an optimization problem in Rn of the following form:

arg min f (x ), subject to gi (x ) ≤ 0, i = 1, . . . , m,


x

where f is a differentiable function and gi are affine polynomials. Suppose z is a


local minimum of f . Then there exist constants (called KKT or Lagrange
multipliers) α1 , . . . , αm such that the following are true.
m
X
−∇f (z) = αi ∇gi (z) gradient of Lagrangian is zero (1)
i=1
gi (z) ≤ 0 ∀i = 1, . . . , m primal constraints are satisfied (2)
αi ≥ 0 ∀i = 1, . . . , m dual constraints are satisfied (3)
αi gi (z) = 0 ∀i = 1, . . . , m complementary slackness conditions (4)

For the special case of convex functions and convex constraints the reverse is
true as well (equivalence).
19/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Interpreting Karush, Kuhn, and Tucker (KKT) Theorem
1. The first equation requires the generalized Lagrangian has gradient zero.
This implies that the primal objective is at a local minimum.
2. The second equation implies that the original constraints (of the primal)
are satisfied.
3. The third equation implies that the constraints of the dual are satisfied.
4. The fourth implies that the primal and dual are interrelated

αi gi (z) = 0 ∀i = 1, . . . , m.

This is also called the “slackness” condition:


4.1 Either the dual constraint is tight i.e. αi = 0 or the primal condition
gi has to be tight, i.e. gi (z) = 0.
4.2 Thus, at least one of the two must be exactly zero.
4.3 For the SVM problem slackness implies that, for the optimal
separating hyperplane w , if a data point does not have functional
margin exactly 1, then it is not a support vector.
20/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Interpreting the KKT Theorem (cont.)

The function
m
X
L(α) := f (x ) + αi g(xi )
i=1

is called the Lagrangian of the problem.

Condition (1) above is equivalent to


!
∇L = 0

21/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Interpreting the KKT Theorem (cont.)
Why would one solve the dual instead of the primal?
First, the primal (arg minw 12 ||w||2 ) requires a minimization over all all w
subject to a condition ((hw, xi i + b)yi ≥ 1) on all data vectors. This might
be expensive.
The dual instead – because of the slackness – only involves a smaller
number of support vectors (sparseness). This is computationally more
attractive
The dual only depends on the inner product. We will take advantage of
that when we introduce the kernel trick (see below).
For more information, see http:
//www.csc.kth.se/utbildning/kth/kurser/DD3364/Lectures/KKT.pdf,
https:
//cs.stanford.edu/people/davidknowles/lagrangian_duality.pdf or
https://www.quora.com/
What-is-the-meaning-of-the-value-of-a-Lagrange-multiplier-when-doi
answer/Balaji-Pitchai-Kannu
22/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Interpreting the KKT Theorem (cont.)

https://web.archive.org/web/20210506170321/http:
//www.onmyphd.com/?p=kkt.karush.kuhn.tucker

23/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Interpreting the KKT Theorem (cont.)
Let us compute the Lagrangian:
m
1 X
L(w, b, α) = ||w||2 + αj (1 − yj (hw, xj i + b))
2 j=1
m m m
1 X X X
= ||w||2 + αj − αj yj hw, xj i − αj yj b
2 j=1 j=1 j=1

We have
m
∂L X
=− yj αj .
∂b j=1

Condition (1) of the KKT implies that m


P
j=1 yj αj = 0.
Henceforth, adding or subtracting this from the solution does not change
it.
Thus, we can get rid of the term b m
P
j=1 yj αj . altogether.

24/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Interpreting the KKT Theorem (cont.)

Furthermore,
m
∂L X !
= wi − αj yj xj,i = 0,
∂wi j=1

implies
m
X
w= αj y j xj .
j=1

This means, that we can find the weight as a linear combination of the
data-points. Also, because of slackness many contributions will be zero,
i.e. the optimal solution can be written as a sparse sum of training
examples!

25/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Interpreting the KKT Theorem (cont.)

Substituting everything back into the Lagrangian, we get:


m m
X 1 X
L(α) = αj − αi αj yi yj hxi , xj i,
j=1
2 i,j=1

and indeed, this depends only on the inner product of the training vectors!
Henceforth we san summarize everything in the following lemma:

26/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Interpreting the KKT Theorem (cont.)
Lemma 4
Solving the SVM primal is equivalent to solving the Dual Problem, i.e. computing the
max of the Lagrangian
max L(α, x)
α

where the Lagrangian is defined as


m m
X 1X
L(α, x) = αj − αi αj yi yj hxi , xj i,
2
j=1 i,j=1

subject to

αj ≥ 0
αj (1 − yj (hw, xj i + b)) = 0
m
X
yj αj = 0.
j=1

27/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Solving the Dual

How can one solve the dual problem?

Answers include:
1. Standard quadratic optimization procedures [BV04] (online:
https://web.stanford.edu/~boyd/cvxbook/)
2. Sequential minimal optimization [Pla98], [Pla99]
3. Modified gradient projection [CV95]
4. Sub-gradient descent and coordinate descent [SKR85]

For a summary, see


https://www.csie.ntu.edu.tw/~cjlin/papers/bottou_lin.pdf
or listen to the seminar talk!

28/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
What about Non-Separability?
Reasons for non-separability
Data is genuinely non-linear. In this case, we will apply kernel trick
Data is noisy. In this case, we should allow for some data entries
that are mis-classified due to noise to avoid overfitting.
So far, the primal 21 ||w||2 or its corresponding dual
L(α) = m j=1 αj − 2
1 Pm
i,j=1 αj αj yi yj hxi , xj i is following a so-called
P

hard-margin approach, as we do not allow for mis-classified entries at all.


In the so-called soft-margin approach we allow for this in a controlled
manner:
Introduce slack variables ξi on the margin and allow for points
violating the margin by ξi .
Introduce a cost variable C controlling how much we weight this in
the optimization procedure (a form of regularization) to balance
data-fitting and over-fitting.
29/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Soft Margin
First the Primal form:
Soft Margin SVM – Primal
Minimize
m
1 X
||w||2 + C ξi
2
i=1

subject to yi (hw, xi + b) ≥ 1 − ξi .

A large value of C implies a high penalty for violating the margin – thus we get closer
to a hard margin approach. A smaller allows for fitting the training data better
(beware of overfitting).
As it turns out the dual is surprisingly simple:

Soft Margin SVM – Dual


We maximize the same! Lagrangian but subject to 0 ≤ αi ≤ C (instead of 0 ≤ αi ), i.e.
the slack variables do not even appear in the formulation (and henceforth not in the
solution either).

The role of C becomes even more obvious in the dual approach.


30/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Summary

Typically, the support vectors will be a small proportion of the


training data.
However, if the problem is non-separable or with small margin, then
any data point misclassified or within the margin has non-zero αi .
This will slow down any optimization process.
On the other hand, if the number of support vectors becomes large
and approaches the number of all data points this is an indicator of
bad generalization capability of this SVM.
We should regard C as an other paramater of the training process
(can use cross-validation to optimize).

31/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Non-Linear
Remember:
● ●● ● ●●●
● ●●●●
● ●● ●● ●●
●●
● ●●
●● ● ●●
● ● ●● ●●
● ●
●●●● ●

●●●●●●●
●● ● ●● ● ● ●
●●●
● ●●●●
●● ● ●● ●● ● ●● ●●● ●
● ● ● ●●● ●●

● ● ● ● ● ● ●● ●
● ●●●●
● ● ●●● ●●●●●
● ● ●
● ●● ● ●
● ●● ●
● ● ●●
● ●
● ●● ●
● ●
● ● ● ● ● ●●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ●● ●● ● ●
● ●
●● ●● ● ●●
● ● ●

● ● ●● ● ● ● ●● ● ● ●● ●●
● ● ●● ● ●●●●●● ● ● ●●● ● ●
● ● ●● ●
●● ●● ●
●● ● ●● ● ●

● ● ● ●● ● ● ● ● ●●● ● ● ●●
● ●● ● ● ● ●● ●
●●
● ●
● ●●●
●● ●● ●● ● ●● ●●● ● ●●● ●
● ●● ● ●●● ●● ● ● ●●●
● ●●
●●● ● ● ● ●● ●● ●●●●●●
●●●
●● ● ● ● ● ● ●● ● ● ●
●●
●● ● ● ● ● ● ● ●● ● ●● ●● ● ●
● ● ●● ● ● ● ●● ●
● ● ● ●● ●●●●●
●●
● ● ● ● ● ● ● ●● ● ●

phi

● ● ● ●
x2

● ● ●

● ● ●●
● ●
●● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ●● ● ● ●● ● ● ●
● ●● ● ●


● ● ●

● ● ● ●●

●● ●●●● ● ●● ● ●

●● ● ● ● ●● ●●●●● ● ●
● ●●● ●

● ● ● ●●●● ● ● ● ● ●●
● ● ● ● ●
●● ● ●● ●● ● ● ●●

● ● ●●● ● ● ● ● ●● ●
●● ●
● ●
●● ●
●●●● ●●● ● ●● ●
● ●● ●● ● ● ●● ●● ●●●●



● ● ●

● ● ●●●●● ● ●●● ●
● ●
● ●● ● ●● ● ●
● ● ● ●● ●● ●
●● ● ● ● ● ●●● ●● ●
●● ●
●● ● ●
● ●
● ● ●● ●● ● ●●

● ●● ●
● ● ● ● ● ● ● ● ●●
● ● ●● ● ●●● ● ● ● ● ●●●●
● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●●● ●
●●● ●●●● ● ● ● ● ● ●●●● ●●●●
● ● ● ●●●

● ●● ●●
●●
● ● ● ● ●●●●
●● ●●● ● ● ●●
● ●●●●●● ●● ●
● ● ●●

●● ● ●● ●
● ● ●● ● ●●●● ● ● ●
● ●● ● ● ●● ● ●●

●● ●● ● ●
● ●●● ●
● ●
● ●

x1 r

We can do more powerful things with SVMs, because only the


inner-product (aka “dot”-product) enters the equations of the dual!

32/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Intuition

Inner products provide some measure of (geometric) “similarity”.


For example, inner product in 2D between 2 vectors of unit length
returns the cosine of the angle between them – i.e. if they are
parallel their inner product is 1 (complete similarity), if they are
perpendicular (complete independence/difference) their inner
product is 0.
Inner products can be used in geometry to define norms and
distances.
Thus, we can try to use them to to define similarity in abstract
features spaces, too.

33/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Kernel Trick

Key Ideas:
1. Suppose we can find a mapping Φ : X = Rn of the feature space
into another (higher) dimensional (even infinite dimensional) space
Z so that the problem becomes linearly separable.
2. As the mapping (and the product) becomes difficult, impossible to
compute in Z, we further assume that we have a so-called kernel
function2 K : X × X → R such that

hΦ(xi ), Φ(xj )i = K (xi ,xj ).

then we can solve the optimization problem without ever computing


anything in Z.

2
satisfying some properties
34/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
SVM Kernel Trick
Lemma 5
Let K : X × X → R be a kernel function, i.e. let K be a (continuous) symmetric
non-negative definite function, then the following problem is well defined.
Solve the max-min of the Lagrangian, defined as
m m
X 1 X
L(α) = αj − αj αj yi yj K (xi , xj ),
j=1
2 i,j=1

subject to

C ≥ αj ≥ 0
αj (1 − yj (hw , xj i + bj )) = 0
X m
yj αj = 0.
j=1

35/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Mercer’s Theorem

Kernels can be characterized by Mercer’s theorem:


Theorem 6 (Mercer’s Theorem)
Let K be a (continuous) symmetric non-negative definite function, then
K has a representation using eigenfunctions ei (x)

X
K (x, y) = λi ei (x)ei (y)
i=1

In practice, however common kernels are usually constructed explicitly.

36/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Examples of Kernels

Examples of Kernels:

K (x, y) = hx, yi Standard, Linear Kernel


p
Kp (x, y) = (hx, yi + 1) polynomial kernel
||x − y||2
Kr (x, y) = exp − Gaussian or radial kernel (RBF)
2σ 2
Ks (x, y) = tanh(κhx, yi − δ) Sigmoid kernel

Note, that one can construct kernels from simpler one by sum and / or
product.

37/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Examples of Radial Kernel

SVM classification plot


ooo
xo
4 o x x
oo o
o o o
o o oo
2 o o

2
oo o ooo
o o o o ooo o
o x o o o o
o xx o o
0
o o o x x o o
o o o x x
x1

o o ooo x o o
o oo x
−2 o o x x
o x
ox
o o o o oo o
o oo o
oo o o

1
−4
o o o
o
−6
o o

−4 −2 0 2 4 6

x2

38/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
VC Dimension of SVMs

It can be shown that the VC dimension of a linear SVM in d


dimensions is d + 1 ([CV95]).
The VC dimension of the RBF is infinite (despite of this fact, it
works well in practice, though mostly due to regularization – stay
tuned for the next lecture!).

39/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Multiclass

So far, we considered only binary classification.


For multi-class classification one can
build binary classifiers which distinguish
between one of the labels and the rest (one-versus-all) or
between every pair of classes (one-versus-one),
classify new instances for one-versus-all case by winner-takes-all
strategy, i.e. classifier with highest output function assigns the class,
take for the one-versus-one approach, classification uses a max-wins
voting strategy, i.e. class with the most votes determines the
instance classification,
take a directed acyclic graph SVM, or
use Error-correcting output codes.

40/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Support Vector Regression (SVR)
SVMs can also be used for regression, i.e. when the yi ∈ R instead of
yi ∈ ±1.
Key Idea introduced in [DBK+ 96] and called SVR:
Follows same line of reasoning than SVM.
Intuitively, it tries to fit a line to data by minimising a cost function.
With kernel trick it provides a non-linear regression, i.e. fitting a
curve rather than a straight line.
Technically, one introduces a margin error term  and calculates the
following minimization problem (primal)
1
arg min ||w||2
w 2
subject to |yi − (hw, xi i − b)| ≤ .

41/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Relevance Vector Machine (RVM)

SVMs do not have a probabilistic interpretation (no Bayesian


interpretation) – for instance it does not give a probability for class
assignment (binary).
To remedy this fact, one could either follow an approach suggested
by Wenzel et al [WGDK17] to enhance SVMs such that one could
apply Bayesian thinking or
use a Relevance Vector Machine (RVM) [Tip01] using Bayesian
inference to obtain solutions for regression and probabilistic
classification.
The RVM has an identical functional form to the support vector
machine, but provides probabilistic classification.

42/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Pros and Cons of SVMs
Advantages of SVMs
Mathematical Foundation
Kernel Trick
Regularization Parameter
Global minimum (does not get stuck in local minima)
Fast, if not too much data
Excellent classification performance for many tasks
Disadvantages of SVMs
Regularization and Kernel Parameters C , σ, κ, . . . have to be estimated – no
specific theory, thus
Model selection still empirically by validation
Get recently outperformed in pattern recognition (visual pattern recognition) by
deep convolution networks
No probabilistic interpretation
Black box
SVMs are memory-intensive as one needs to store kernel matrix in memory.
43/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Further Reading

A good online reference is

https://www.svm-tutorial.com/

44/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Demo

Let’s play!

Use SVM.R file in the code examples (based on http://www.statslab.


cam.ac.uk/~tw389/teaching/SLP18/Practical07sol.pdf)

45/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
References I
[BV04] S. Boyd and L. Vandenberghe, Convex Optimization. New York,
NY, USA: Cambridge University Press, 2004.

[CV95] C. Cortes and V. Vapnik, “Support-vector networks,” in Machine


Learning, 1995, pp. 273–297.

[DBK+ 96] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik,


“Support vector regression machines,” in Proceedings of the 9th
International Conference on Neural Information Processing
Systems, ser. NIPS’96. Cambridge, MA, USA: MIT Press, 1996,
pp. 155–161. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2998981.2999003

[KT51] H. Kuhn and A. Tucker, “Nonlinear programming,” in Proceedings


of the 2nd Berkeley Symposium on Mathematics, Statistics
and Probability,, Berkeley. University of California Press, 1951,
pp. 481–492.

46/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
References II
[Pla98] J. C. Platt, “Sequential minimal optimization: A fast algorithm for
training support vector machines,” no. 208, pp. 1–21, 1998.

[Pla99] ——, “Advances in kernel methods,” B. Schölkopf, C. J. C. Burges,


and A. J. Smola, Eds. Cambridge, MA, USA: MIT Press, 1999, ch.
Fast Training of Support Vector Machines Using Sequential Minimal
Optimization, pp. 185–208. [Online]. Available:
http://dl.acm.org/citation.cfm?id=299094.299105

[SKR85] N. Z. Shor, K. C. Kiwiel, and A. Ruszcayǹski, Minimization


Methods for Non-differentiable Functions. Berlin, Heidelberg:
Springer-Verlag, 1985.

[Tip01] M. E. Tipping, “Sparse bayesian learning and the relevance vector


machine,” J. Mach. Learn. Res., vol. 1, pp. 211–244, Sep. 2001.
[Online]. Available: https://doi.org/10.1162/15324430152748236

47/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
References III

[Vap95] V. N. Vapnik, The nature of statistical learning theory. New


York, NY, USA: Springer-Verlag New York, Inc., 1995.

[WGDK17] F. Wenzel, T. Galy-Fajou, M. Deutsch, and M. Kloft, “Bayesian


nonlinear support vector machines for big data,” in ECML/PKDD
(1), ser. Lecture Notes in Computer Science, vol. 10534. Springer,
2017, pp. 307–322.

48/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)

You might also like