Professional Documents
Culture Documents
Learning From Data: 8: Support Vector Machines (SVM)
Learning From Data: 8: Support Vector Machines (SVM)
Learning From Data: 8: Support Vector Machines (SVM)
2/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Until recently, philosophy was based on the very simple idea that
the world is simple. In machine learning, for the first time, we
have examples where the world is not simple.
– Vladimir Vapnik
3/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Recap Perceptron and Linear Model
Remember: We separate using a separating hyperplane.
2.5
2.0
o
1.5 o o
x2
+ +
1.0
+
0.5
x1
4/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Maximum Margin
Idea: Use the one with maximum margin:
2.5
2.0
o
o o
1.5
x2
+ +
1.0
+
0.5
x1
5/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Maximum Margin (cont.)
6/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
SVMs: A New Kid on the Block in the 80/90s
Before SVMs:
Almost all learning methods used linear decision surfaces
Linear models have nice theoretical properties, but are limited
Neural Networks (NN) allow for efficient learning of non-linear
decision surfaces, however
NN all suffer from local minima
Still little theoretical basis for NN
After SVM [CV95]:
We have efficient learning algorithms for non-linear functions
We have a learning algorithm thoroughly based on computational
learning theory developed by Vapnik et al. [CV95]
7/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Key Ideas
8/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Support Vectors
Support Vectors
are data points closest to separating hyperplane
correspond to the most difficult to classify points
define the separating hyperplane
are usually much smaller subset of all data points – this reduces the
effective VC dimension (which is good!) and explains why SVMs
perform well
Intuition: The boundaries of the classes are more “informative” than the
overall distribution of the classes
9/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Linear Separability
10/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
The Math
2.5
2.0
o
o o
1.5
x2
+ +
1.0
0.5 +
x1
or
hw (x) = sign(hw, xi + b)
Definition 1
The geometric margin of a hyperplane w with respect to a dataset D is
the shortest distance from a training point xi to the hyperplane defined
by w.
1
i.e. w perpendicular to the plane and ||w || = 1
12/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Primal Form
13/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Simplification 1: Linearizing Classes
14/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Simplification 2: Symmetry
15/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Simplification 3: Scale Invariance
We can increase the length of w arbitrarily without changing the
constraints
w
The true distance of a point x is always hx, ||w|| i measured in units
1
of ||w|| .
Let x be the closest point to the hyperplane. Then we can scale it
w
such that hx, ||w|| i = 1.
The constraints have to be replaced by hxi , wiyi ≥ 1 (why?).
Furthermore, we no longer need to ask which point is closest to the
candidate hyperplane, because after all, we never cared which point
it was!
All we cared was how far away that closest point was. And we do
1
now know that it is exactly ||w|| away.
As a consequence, we get rid of the max and min – we just have to
minimize ||w|| equivalent to 12 ||w||2 .
16/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Primal Form
17/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Optimization Problems
18/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Karush, Kuhn, and Tucker (KKT) Theorem
Theorem 3
Suppose you have an optimization problem in Rn of the following form:
For the special case of convex functions and convex constraints the reverse is
true as well (equivalence).
19/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Interpreting Karush, Kuhn, and Tucker (KKT) Theorem
1. The first equation requires the generalized Lagrangian has gradient zero.
This implies that the primal objective is at a local minimum.
2. The second equation implies that the original constraints (of the primal)
are satisfied.
3. The third equation implies that the constraints of the dual are satisfied.
4. The fourth implies that the primal and dual are interrelated
αi gi (z) = 0 ∀i = 1, . . . , m.
The function
m
X
L(α) := f (x ) + αi g(xi )
i=1
21/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Interpreting the KKT Theorem (cont.)
Why would one solve the dual instead of the primal?
First, the primal (arg minw 12 ||w||2 ) requires a minimization over all all w
subject to a condition ((hw, xi i + b)yi ≥ 1) on all data vectors. This might
be expensive.
The dual instead – because of the slackness – only involves a smaller
number of support vectors (sparseness). This is computationally more
attractive
The dual only depends on the inner product. We will take advantage of
that when we introduce the kernel trick (see below).
For more information, see http:
//www.csc.kth.se/utbildning/kth/kurser/DD3364/Lectures/KKT.pdf,
https:
//cs.stanford.edu/people/davidknowles/lagrangian_duality.pdf or
https://www.quora.com/
What-is-the-meaning-of-the-value-of-a-Lagrange-multiplier-when-doi
answer/Balaji-Pitchai-Kannu
22/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Interpreting the KKT Theorem (cont.)
https://web.archive.org/web/20210506170321/http:
//www.onmyphd.com/?p=kkt.karush.kuhn.tucker
23/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Interpreting the KKT Theorem (cont.)
Let us compute the Lagrangian:
m
1 X
L(w, b, α) = ||w||2 + αj (1 − yj (hw, xj i + b))
2 j=1
m m m
1 X X X
= ||w||2 + αj − αj yj hw, xj i − αj yj b
2 j=1 j=1 j=1
We have
m
∂L X
=− yj αj .
∂b j=1
24/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Interpreting the KKT Theorem (cont.)
Furthermore,
m
∂L X !
= wi − αj yj xj,i = 0,
∂wi j=1
implies
m
X
w= αj y j xj .
j=1
This means, that we can find the weight as a linear combination of the
data-points. Also, because of slackness many contributions will be zero,
i.e. the optimal solution can be written as a sparse sum of training
examples!
25/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Interpreting the KKT Theorem (cont.)
and indeed, this depends only on the inner product of the training vectors!
Henceforth we san summarize everything in the following lemma:
26/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Interpreting the KKT Theorem (cont.)
Lemma 4
Solving the SVM primal is equivalent to solving the Dual Problem, i.e. computing the
max of the Lagrangian
max L(α, x)
α
subject to
αj ≥ 0
αj (1 − yj (hw, xj i + b)) = 0
m
X
yj αj = 0.
j=1
27/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Solving the Dual
Answers include:
1. Standard quadratic optimization procedures [BV04] (online:
https://web.stanford.edu/~boyd/cvxbook/)
2. Sequential minimal optimization [Pla98], [Pla99]
3. Modified gradient projection [CV95]
4. Sub-gradient descent and coordinate descent [SKR85]
28/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
What about Non-Separability?
Reasons for non-separability
Data is genuinely non-linear. In this case, we will apply kernel trick
Data is noisy. In this case, we should allow for some data entries
that are mis-classified due to noise to avoid overfitting.
So far, the primal 21 ||w||2 or its corresponding dual
L(α) = m j=1 αj − 2
1 Pm
i,j=1 αj αj yi yj hxi , xj i is following a so-called
P
subject to yi (hw, xi + b) ≥ 1 − ξi .
A large value of C implies a high penalty for violating the margin – thus we get closer
to a hard margin approach. A smaller allows for fitting the training data better
(beware of overfitting).
As it turns out the dual is surprisingly simple:
31/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Non-Linear
Remember:
● ●● ● ●●●
● ●●●●
● ●● ●● ●●
●●
● ●●
●● ● ●●
● ● ●● ●●
● ●
●●●● ●
●
●●●●●●●
●● ● ●● ● ● ●
●●●
● ●●●●
●● ● ●● ●● ● ●● ●●● ●
● ● ● ●●● ●●
●
● ● ● ● ● ● ●● ●
● ●●●●
● ● ●●● ●●●●●
● ● ●
● ●● ● ●
● ●● ●
● ● ●●
● ●
● ●● ●
● ●
● ● ● ● ● ●●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ●● ●● ● ●
● ●
●● ●● ● ●●
● ● ●
●
● ● ●● ● ● ● ●● ● ● ●● ●●
● ● ●● ● ●●●●●● ● ● ●●● ● ●
● ● ●● ●
●● ●● ●
●● ● ●● ● ●
●
● ● ● ●● ● ● ● ● ●●● ● ● ●●
● ●● ● ● ● ●● ●
●●
● ●
● ●●●
●● ●● ●● ● ●● ●●● ● ●●● ●
● ●● ● ●●● ●● ● ● ●●●
● ●●
●●● ● ● ● ●● ●● ●●●●●●
●●●
●● ● ● ● ● ● ●● ● ● ●
●●
●● ● ● ● ● ● ● ●● ● ●● ●● ● ●
● ● ●● ● ● ● ●● ●
● ● ● ●● ●●●●●
●●
● ● ● ● ● ● ● ●● ● ●
phi
●
● ● ● ●
x2
● ● ●
●
● ● ●●
● ●
●● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ●● ● ● ●● ● ● ●
● ●● ● ●
●
●
● ● ●
●
● ● ● ●●
●
●● ●●●● ● ●● ● ●
●
●● ● ● ● ●● ●●●●● ● ●
● ●●● ●
●
● ● ● ●●●● ● ● ● ● ●●
● ● ● ● ●
●● ● ●● ●● ● ● ●●
●
● ● ●●● ● ● ● ● ●● ●
●● ●
● ●
●● ●
●●●● ●●● ● ●● ●
● ●● ●● ● ● ●● ●● ●●●●
●
●
●
● ● ●
●
● ● ●●●●● ● ●●● ●
● ●
● ●● ● ●● ● ●
● ● ● ●● ●● ●
●● ● ● ● ● ●●● ●● ●
●● ●
●● ● ●
● ●
● ● ●● ●● ● ●●
●
● ●● ●
● ● ● ● ● ● ● ● ●●
● ● ●● ● ●●● ● ● ● ● ●●●●
● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●●● ●
●●● ●●●● ● ● ● ● ● ●●●● ●●●●
● ● ● ●●●
●
● ●● ●●
●●
● ● ● ● ●●●●
●● ●●● ● ● ●●
● ●●●●●● ●● ●
● ● ●●
●
●● ● ●● ●
● ● ●● ● ●●●● ● ● ●
● ●● ● ● ●● ● ●●
●
●● ●● ● ●
● ●●● ●
● ●
● ●
x1 r
32/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Intuition
33/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Kernel Trick
Key Ideas:
1. Suppose we can find a mapping Φ : X = Rn of the feature space
into another (higher) dimensional (even infinite dimensional) space
Z so that the problem becomes linearly separable.
2. As the mapping (and the product) becomes difficult, impossible to
compute in Z, we further assume that we have a so-called kernel
function2 K : X × X → R such that
2
satisfying some properties
34/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
SVM Kernel Trick
Lemma 5
Let K : X × X → R be a kernel function, i.e. let K be a (continuous) symmetric
non-negative definite function, then the following problem is well defined.
Solve the max-min of the Lagrangian, defined as
m m
X 1 X
L(α) = αj − αj αj yi yj K (xi , xj ),
j=1
2 i,j=1
subject to
C ≥ αj ≥ 0
αj (1 − yj (hw , xj i + bj )) = 0
X m
yj αj = 0.
j=1
35/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Mercer’s Theorem
36/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Examples of Kernels
Examples of Kernels:
Note, that one can construct kernels from simpler one by sum and / or
product.
37/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Examples of Radial Kernel
2
oo o ooo
o o o o ooo o
o x o o o o
o xx o o
0
o o o x x o o
o o o x x
x1
o o ooo x o o
o oo x
−2 o o x x
o x
ox
o o o o oo o
o oo o
oo o o
1
−4
o o o
o
−6
o o
−4 −2 0 2 4 6
x2
38/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
VC Dimension of SVMs
39/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Multiclass
40/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Support Vector Regression (SVR)
SVMs can also be used for regression, i.e. when the yi ∈ R instead of
yi ∈ ±1.
Key Idea introduced in [DBK+ 96] and called SVR:
Follows same line of reasoning than SVM.
Intuitively, it tries to fit a line to data by minimising a cost function.
With kernel trick it provides a non-linear regression, i.e. fitting a
curve rather than a straight line.
Technically, one introduces a margin error term and calculates the
following minimization problem (primal)
1
arg min ||w||2
w 2
subject to |yi − (hw, xi i − b)| ≤ .
41/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Relevance Vector Machine (RVM)
42/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Pros and Cons of SVMs
Advantages of SVMs
Mathematical Foundation
Kernel Trick
Regularization Parameter
Global minimum (does not get stuck in local minima)
Fast, if not too much data
Excellent classification performance for many tasks
Disadvantages of SVMs
Regularization and Kernel Parameters C , σ, κ, . . . have to be estimated – no
specific theory, thus
Model selection still empirically by validation
Get recently outperformed in pattern recognition (visual pattern recognition) by
deep convolution networks
No probabilistic interpretation
Black box
SVMs are memory-intensive as one needs to store kernel matrix in memory.
43/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Further Reading
https://www.svm-tutorial.com/
44/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Demo
Let’s play!
45/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
References I
[BV04] S. Boyd and L. Vandenberghe, Convex Optimization. New York,
NY, USA: Cambridge University Press, 2004.
46/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
References II
[Pla98] J. C. Platt, “Sequential minimal optimization: A fast algorithm for
training support vector machines,” no. 208, pp. 1–21, 1998.
47/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
References III
48/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)