Learning From Data: 8: Support Vector Machines (SVM)

Learning From Data
8: Support Vector Machines (SVM)

Jörg Schäfer
Frankfurt University of Applied Sciences
Department of Computer Sciences
Nibelungenplatz 1
D-60318 Frankfurt am Main
Wissen durch Praxis stärkt

1/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM) Summer Semester 2022
Content
Recap
Support Vector Machines
Duality
Non-Separability
Kernel Trick
VC Dimension
Further Ideas
Evaluation
Demo
Bibliography
2/48 Jörg Schäfer | Learning From Data | c b n a 8: Support Vector Machines (SVM)
Until recently, philosophy was based on the very simple idea that
the world is simple. In machine learning, for the first time, we
have examples where the world is not simple.
– Vladimir Vapnik
Recap Perceptron and Linear Model
Remember: We separate using a separating hyperplane.
2.5
2.0
o
1.5 o o
x2
+ +
1.0
+
0.5
0.5 1.0 1.5 2.0 2.5
x1
But which hyperplane is “best”? (There are infinitely many ones.)
Maximum Margin
Idea: Use the one with maximum margin:
2.5
2.0
o
o o
1.5
x2
+ +
1.0
+
0.5
0.5 1.0 1.5 2.0 2.5
x1
Maximum Margin (cont.)
As it turns out this is a powerful idea, because

it gives a unique solution
one can prove that not all data vectors contribute but only so-called
support vectors
this results in lower VC dimension, which support generalization
can be extended to non-linear boundaries using kernel trick
it has a powerful mathematical foundation
the computation is a convex optimization not suffering from local
optima
it has been proven to be very successful
SVMs: A New Kid on the Block in the 80/90s
Before SVMs:
Almost all learning methods used linear decision surfaces
Linear models have nice theoretical properties, but are limited
Neural Networks (NN) allow for efficient learning of non-linear
decision surfaces, however
NN all suffer from local minima
Still little theoretical basis for NN
After SVM [CV95]:
We have efficient learning algorithms for non-linear functions
We have a learning algorithm thoroughly based on computational
learning theory developed by Vapnik et al. [CV95]
Key Ideas
Separate non-linear regions using “kernel functions”

This is possible as computations only depend on so-called “dot
products” (inner products – a measure of similarity)
Mathematics can use quadratic optimization problem to avoid local
minimum issues with e.g. neural nets
The resulting learning algorithm is an optimization algorithm rather
than a greedy search
Can be used for classification and regression
Support Vectors
Support Vectors
are data points closest to separating hyperplane
correspond to the most difficult to classify points
define the separating hyperplane
are usually much smaller subset of all data points – this reduces the
effective VC dimension (which is good!) and explains why SVMs
perform well
Intuition: The boundaries of the classes are more “informative” than the
overall distribution of the classes
Linear Separability
Assume linear separability for now (general case later).

In two dimensions, we separate by a line, in higher dimensions by
hyperplane
Input: (xi , yi ) training pair samples of features as usual. We assume
the yi to be classifier labels for now.
Output: set of weights wi , one for each feature, whose linear
combination predicts the value of y .
We will maximize the margin
As it turns out, the number of weights that are nonzero will be just
a few that correspond to important features that “matter” in
deciding the separating line (hyperplane) – the support vectors
The Math
2.5
2.0
o
o o
1.5
x2
+ +
1.0
0.5 +
0.5 1.0 1.5 2.0 2.5
x1
Find a, b, and c such that

ax1 + bx2 ≥ c for ◦ (+1) and
ax1 + bx2 < c for + (−1).
This is messy, so we will use normal forms instead.
Normal Forms
Any hyperplane is fully described by its normal vector1 w ∈ Rn and
its so-called bias b ∈ R.
Then we decide whether a point is in ±1 by
hw (x) = sign(hw, xi)
or
hw (x) = sign(hw, xi + b)
Definition 1
The geometric margin of a hyperplane w with respect to a dataset D is
the shortest distance from a training point xi to the hyperplane defined
by w.
1
i.e. w perpendicular to the plane and ||w || = 1
12/48 Jörg Schäfer | Learning From Data | c b na 8: Support Vector Machines (SVM)
Primal Form
Thus we try to maximize the geometric margin, i.e.

w
max min |hxi , i + b|
w,b xi ||w||
subject to sign(hw, xi i + b) = sign(yi ).
However, in this form this is horrible – so let us simplify this.
Simplification 1: Linearizing Classes
Note, that sign(hw, xi i + b) = sign(yi ) is equivalent to

(hw, xi i + b)yi ≥ 0.
This is still linear.
Simplification 2: Symmetry
The maximum margin is exactly halfway in between positive and

negative classes.
Indeed, if this were not the case, denote x+ and x− the closest
points in the positive and negative classes, then
hx+ , wi + b = −(hx− , wi + b) because if this were not the case we
could move the decision boundary along w until it they are exactly
equal increasing the margin.
Simplification 3: Scale Invariance
We can increase the length of w arbitrarily without changing the
constraints
w
The true distance of a point x is always hx, ||w|| i measured in units
1
of ||w|| .
Let x be the closest point to the hyperplane. Then we can scale it
w
such that hx, ||w|| i = 1.
The constraints have to be replaced by hxi , wiyi ≥ 1 (why?).
Furthermore, we no longer need to ask which point is closest to the
candidate hyperplane, because after all, we never cared which point
it was!
All we cared was how far away that closest point was. And we do
1
now know that it is exactly ||w|| away.
As a consequence, we get rid of the max and min – we just have to
minimize ||w|| equivalent to 12 ||w||2 .
Primal Form
Definition 2 (Primal Form)

The Primal Form of the SVM is the following optimization problem
1
arg min ||w||2
w,b 2
subject to (hw, xi i + b)yi ≥ 1 for all i ∈ 1, . . . , m.
Note: This is a quadratic programming problem with a unique solution.
By the way: At https://jeremykun.com/2017/06/05/

formulating-the-support-vector-machine-optimization-problem/
you find a nice demo visualization.
Optimization Problems
In general, optimization problems are hard to solve.

However sometimes we are lucky.
Remember, linear problems?
Can be solved with linear programming.
Likewise, linear problems with linear constraints can be solved using
Lagrangians or Lagrangian multipliers.
For SVMs we have a so-called “convex quadratic” optimization
problem (the objective is a quadratic function and the constraints
form a convex set – linear inequalities and equalities).
Fortunately, there exists result due to Karush, Kuhn, and Tucker, the
KKT theorem [KT51] that covers this case.
Karush, Kuhn, and Tucker (KKT) Theorem
Theorem 3
Suppose you have an optimization problem in Rn of the following form:
arg min f (x ), subject to gi (x ) ≤ 0, i = 1, . . . , m,

x
where f is a differentiable function and gi are affine polynomials. Suppose z is a

local minimum of f . Then there exist constants (called KKT or Lagrange
multipliers) α1 , . . . , αm such that the following are true.
m
X
−∇f (z) = αi ∇gi (z) gradient of Lagrangian is zero (1)
i=1
gi (z) ≤ 0 ∀i = 1, . . . , m primal constraints are satisfied (2)
αi ≥ 0 ∀i = 1, . . . , m dual constraints are satisfied (3)
αi gi (z) = 0 ∀i = 1, . . . , m complementary slackness conditions (4)
For the special case of convex functions and convex constraints the reverse is
true as well (equivalence).
Interpreting Karush, Kuhn, and Tucker (KKT) Theorem
1. The first equation requires the generalized Lagrangian has gradient zero.
This implies that the primal objective is at a local minimum.
2. The second equation implies that the original constraints (of the primal)
are satisfied.
3. The third equation implies that the constraints of the dual are satisfied.
4. The fourth implies that the primal and dual are interrelated
αi gi (z) = 0 ∀i = 1, . . . , m.
This is also called the “slackness” condition:

4.1 Either the dual constraint is tight i.e. αi = 0 or the primal condition
gi has to be tight, i.e. gi (z) = 0.
4.2 Thus, at least one of the two must be exactly zero.
4.3 For the SVM problem slackness implies that, for the optimal
separating hyperplane w , if a data point does not have functional
margin exactly 1, then it is not a support vector.
Interpreting the KKT Theorem (cont.)
The function
m
X
L(α) := f (x ) + αi g(xi )
i=1
is called the Lagrangian of the problem.
Condition (1) above is equivalent to

!
∇L = 0
Why would one solve the dual instead of the primal?
First, the primal (arg minw 12 ||w||2 ) requires a minimization over all all w
subject to a condition ((hw, xi i + b)yi ≥ 1) on all data vectors. This might
be expensive.
The dual instead – because of the slackness – only involves a smaller
number of support vectors (sparseness). This is computationally more
attractive
The dual only depends on the inner product. We will take advantage of
that when we introduce the kernel trick (see below).
For more information, see http:
//www.csc.kth.se/utbildning/kth/kurser/DD3364/Lectures/KKT.pdf,
https:
//cs.stanford.edu/people/davidknowles/lagrangian_duality.pdf or
https://www.quora.com/
What-is-the-meaning-of-the-value-of-a-Lagrange-multiplier-when-doi
answer/Balaji-Pitchai-Kannu
https://web.archive.org/web/20210506170321/http:
//www.onmyphd.com/?p=kkt.karush.kuhn.tucker
Let us compute the Lagrangian:
m
1 X
L(w, b, α) = ||w||2 + αj (1 − yj (hw, xj i + b))
2 j=1
m m m
1 X X X
= ||w||2 + αj − αj yj hw, xj i − αj yj b
2 j=1 j=1 j=1
We have
m
∂L X
=− yj αj .
∂b j=1
Condition (1) of the KKT implies that m

P
j=1 yj αj = 0.
Henceforth, adding or subtracting this from the solution does not change
it.
Thus, we can get rid of the term b m
P
j=1 yj αj . altogether.
Furthermore,
m
∂L X !
= wi − αj yj xj,i = 0,
∂wi j=1
implies
m
X
w= αj y j xj .
j=1
This means, that we can find the weight as a linear combination of the
data-points. Also, because of slackness many contributions will be zero,
i.e. the optimal solution can be written as a sparse sum of training
examples!
Substituting everything back into the Lagrangian, we get:

m m
X 1 X
L(α) = αj − αi αj yi yj hxi , xj i,
j=1
2 i,j=1
and indeed, this depends only on the inner product of the training vectors!
Henceforth we san summarize everything in the following lemma:
Lemma 4
Solving the SVM primal is equivalent to solving the Dual Problem, i.e. computing the
max of the Lagrangian
max L(α, x)
α
where the Lagrangian is defined as

m m
X 1X
L(α, x) = αj − αi αj yi yj hxi , xj i,
2
j=1 i,j=1
subject to
αj ≥ 0
αj (1 − yj (hw, xj i + b)) = 0
m
X
yj αj = 0.
j=1
Solving the Dual
How can one solve the dual problem?
Answers include:
1. Standard quadratic optimization procedures [BV04] (online:
https://web.stanford.edu/~boyd/cvxbook/)
2. Sequential minimal optimization [Pla98], [Pla99]
3. Modified gradient projection [CV95]
4. Sub-gradient descent and coordinate descent [SKR85]
For a summary, see

https://www.csie.ntu.edu.tw/~cjlin/papers/bottou_lin.pdf
or listen to the seminar talk!
What about Non-Separability?
Reasons for non-separability
Data is genuinely non-linear. In this case, we will apply kernel trick
Data is noisy. In this case, we should allow for some data entries
that are mis-classified due to noise to avoid overfitting.
So far, the primal 21 ||w||2 or its corresponding dual
L(α) = m j=1 αj − 2
1 Pm
i,j=1 αj αj yi yj hxi , xj i is following a so-called
P
hard-margin approach, as we do not allow for mis-classified entries at all.

In the so-called soft-margin approach we allow for this in a controlled
manner:
Introduce slack variables ξi on the margin and allow for points
violating the margin by ξi .
Introduce a cost variable C controlling how much we weight this in
the optimization procedure (a form of regularization) to balance
data-fitting and over-fitting.
Soft Margin
First the Primal form:
Soft Margin SVM – Primal
Minimize
m
1 X
||w||2 + C ξi
2
i=1
subject to yi (hw, xi + b) ≥ 1 − ξi .
A large value of C implies a high penalty for violating the margin – thus we get closer
to a hard margin approach. A smaller allows for fitting the training data better
(beware of overfitting).
As it turns out the dual is surprisingly simple:
Soft Margin SVM – Dual

We maximize the same! Lagrangian but subject to 0 ≤ αi ≤ C (instead of 0 ≤ αi ), i.e.
the slack variables do not even appear in the formulation (and henceforth not in the
solution either).
The role of C becomes even more obvious in the dual approach.

Summary
Typically, the support vectors will be a small proportion of the

training data.
However, if the problem is non-separable or with small margin, then
any data point misclassified or within the margin has non-zero αi .
This will slow down any optimization process.
On the other hand, if the number of support vectors becomes large
and approaches the number of all data points this is an indicator of
bad generalization capability of this SVM.
We should regard C as an other paramater of the training process
(can use cross-validation to optimize).
Non-Linear
Remember:
● ●● ● ●●●
● ●●●●
● ●● ●● ●●
●●
● ●●
●● ● ●●
● ● ●● ●●
● ●
●●●● ●
●
●●●●●●●
●● ● ●● ● ● ●
●●●
● ●●●●
●● ● ●● ●● ● ●● ●●● ●
● ● ● ●●● ●●
●
● ● ● ● ● ● ●● ●
● ●●●●
● ● ●●● ●●●●●
● ● ●
● ●● ● ●
● ●● ●
● ● ●●
● ●
● ●● ●
● ●
● ● ● ● ● ●●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ●● ●● ● ●
● ●
●● ●● ● ●●
● ● ●
●
● ● ●● ● ● ● ●● ● ● ●● ●●
● ● ●● ● ●●●●●● ● ● ●●● ● ●
● ● ●● ●
●● ●● ●
●● ● ●● ● ●
●
● ● ● ●● ● ● ● ● ●●● ● ● ●●
● ●● ● ● ● ●● ●
●●
● ●
● ●●●
●● ●● ●● ● ●● ●●● ● ●●● ●
● ●● ● ●●● ●● ● ● ●●●
● ●●
●●● ● ● ● ●● ●● ●●●●●●
●●●
●● ● ● ● ● ● ●● ● ● ●
●●
●● ● ● ● ● ● ● ●● ● ●● ●● ● ●
● ● ●● ● ● ● ●● ●
● ● ● ●● ●●●●●
●●
● ● ● ● ● ● ● ●● ● ●
phi
●
● ● ● ●
x2
● ● ●
●
● ● ●●
● ●
●● ● ● ● ● ● ● ● ● ● ● ●
●● ● ● ●● ● ● ●● ● ● ●
● ●● ● ●
●
●
● ● ●
●
● ● ● ●●
●
●● ●●●● ● ●● ● ●
●
●● ● ● ● ●● ●●●●● ● ●
● ●●● ●
●
● ● ● ●●●● ● ● ● ● ●●
● ● ● ● ●
●● ● ●● ●● ● ● ●●
●
● ● ●●● ● ● ● ● ●● ●
●● ●
● ●
●● ●
●●●● ●●● ● ●● ●
● ●● ●● ● ● ●● ●● ●●●●
●
●
●
● ● ●
●
● ● ●●●●● ● ●●● ●
● ●
● ●● ● ●● ● ●
● ● ● ●● ●● ●
●● ● ● ● ● ●●● ●● ●
●● ●
●● ● ●
● ●
● ● ●● ●● ● ●●
●
● ●● ●
● ● ● ● ● ● ● ● ●●
● ● ●● ● ●●● ● ● ● ● ●●●●
● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●●● ●
●●● ●●●● ● ● ● ● ● ●●●● ●●●●
● ● ● ●●●
●
● ●● ●●
●●
● ● ● ● ●●●●
●● ●●● ● ● ●●
● ●●●●●● ●● ●
● ● ●●
●
●● ● ●● ●
● ● ●● ● ●●●● ● ● ●
● ●● ● ● ●● ● ●●
●
●● ●● ● ●
● ●●● ●
● ●
● ●
x1 r
We can do more powerful things with SVMs, because only the

inner-product (aka “dot”-product) enters the equations of the dual!
Intuition
Inner products provide some measure of (geometric) “similarity”.

For example, inner product in 2D between 2 vectors of unit length
returns the cosine of the angle between them – i.e. if they are
parallel their inner product is 1 (complete similarity), if they are
perpendicular (complete independence/difference) their inner
product is 0.
Inner products can be used in geometry to define norms and
distances.
Thus, we can try to use them to to define similarity in abstract
features spaces, too.
Kernel Trick
Key Ideas:
1. Suppose we can find a mapping Φ : X = Rn of the feature space
into another (higher) dimensional (even infinite dimensional) space
Z so that the problem becomes linearly separable.
2. As the mapping (and the product) becomes difficult, impossible to
compute in Z, we further assume that we have a so-called kernel
function2 K : X × X → R such that
hΦ(xi ), Φ(xj )i = K (xi ,xj ).
then we can solve the optimization problem without ever computing

anything in Z.
2
satisfying some properties
SVM Kernel Trick
Lemma 5
Let K : X × X → R be a kernel function, i.e. let K be a (continuous) symmetric
non-negative definite function, then the following problem is well defined.
Solve the max-min of the Lagrangian, defined as
m m
X 1 X
L(α) = αj − αj αj yi yj K (xi , xj ),
j=1
2 i,j=1
subject to
C ≥ αj ≥ 0
αj (1 − yj (hw , xj i + bj )) = 0
X m
yj αj = 0.
j=1
Mercer’s Theorem
Kernels can be characterized by Mercer’s theorem:

Theorem 6 (Mercer’s Theorem)
Let K be a (continuous) symmetric non-negative definite function, then
K has a representation using eigenfunctions ei (x)
∞
X
K (x, y) = λi ei (x)ei (y)
i=1
In practice, however common kernels are usually constructed explicitly.
Examples of Kernels
Examples of Kernels:
K (x, y) = hx, yi Standard, Linear Kernel

p
Kp (x, y) = (hx, yi + 1) polynomial kernel
||x − y||2
Kr (x, y) = exp − Gaussian or radial kernel (RBF)
2σ 2
Ks (x, y) = tanh(κhx, yi − δ) Sigmoid kernel
Note, that one can construct kernels from simpler one by sum and / or
product.
Examples of Radial Kernel
SVM classification plot

ooo
xo
4 o x x
oo o
o o o
o o oo
2 o o
2
oo o ooo
o o o o ooo o
o x o o o o
o xx o o
0
o o o x x o o
o o o x x
x1
o o ooo x o o
o oo x
−2 o o x x
o x
ox
o o o o oo o
o oo o
oo o o
1
−4
o o o
o
−6
o o
−4 −2 0 2 4 6
x2
VC Dimension of SVMs
It can be shown that the VC dimension of a linear SVM in d

dimensions is d + 1 ([CV95]).
The VC dimension of the RBF is infinite (despite of this fact, it
works well in practice, though mostly due to regularization – stay
tuned for the next lecture!).
Multiclass
So far, we considered only binary classification.

For multi-class classification one can
build binary classifiers which distinguish
between one of the labels and the rest (one-versus-all) or
between every pair of classes (one-versus-one),
classify new instances for one-versus-all case by winner-takes-all
strategy, i.e. classifier with highest output function assigns the class,
take for the one-versus-one approach, classification uses a max-wins
voting strategy, i.e. class with the most votes determines the
instance classification,
take a directed acyclic graph SVM, or
use Error-correcting output codes.
Support Vector Regression (SVR)
SVMs can also be used for regression, i.e. when the yi ∈ R instead of
yi ∈ ±1.
Key Idea introduced in [DBK+ 96] and called SVR:
Follows same line of reasoning than SVM.
Intuitively, it tries to fit a line to data by minimising a cost function.
With kernel trick it provides a non-linear regression, i.e. fitting a
curve rather than a straight line.
Technically, one introduces a margin error term and calculates the
following minimization problem (primal)
1
arg min ||w||2
w 2
subject to |yi − (hw, xi i − b)| ≤ .
Relevance Vector Machine (RVM)
SVMs do not have a probabilistic interpretation (no Bayesian

interpretation) – for instance it does not give a probability for class
assignment (binary).
To remedy this fact, one could either follow an approach suggested
by Wenzel et al [WGDK17] to enhance SVMs such that one could
apply Bayesian thinking or
use a Relevance Vector Machine (RVM) [Tip01] using Bayesian
inference to obtain solutions for regression and probabilistic
classification.
The RVM has an identical functional form to the support vector
machine, but provides probabilistic classification.
Pros and Cons of SVMs
Advantages of SVMs
Mathematical Foundation
Kernel Trick
Regularization Parameter
Global minimum (does not get stuck in local minima)
Fast, if not too much data
Excellent classification performance for many tasks
Disadvantages of SVMs
Regularization and Kernel Parameters C , σ, κ, . . . have to be estimated – no
specific theory, thus
Model selection still empirically by validation
Get recently outperformed in pattern recognition (visual pattern recognition) by
deep convolution networks
No probabilistic interpretation
Black box
SVMs are memory-intensive as one needs to store kernel matrix in memory.
Further Reading
A good online reference is
https://www.svm-tutorial.com/
Demo
Let’s play!
Use SVM.R file in the code examples (based on http://www.statslab.

cam.ac.uk/~tw389/teaching/SLP18/Practical07sol.pdf)
References I
[BV04] S. Boyd and L. Vandenberghe, Convex Optimization. New York,
NY, USA: Cambridge University Press, 2004.
[CV95] C. Cortes and V. Vapnik, “Support-vector networks,” in Machine

Learning, 1995, pp. 273–297.
[DBK+ 96] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik,

“Support vector regression machines,” in Proceedings of the 9th
International Conference on Neural Information Processing
Systems, ser. NIPS’96. Cambridge, MA, USA: MIT Press, 1996,
pp. 155–161. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2998981.2999003
[KT51] H. Kuhn and A. Tucker, “Nonlinear programming,” in Proceedings

of the 2nd Berkeley Symposium on Mathematics, Statistics
and Probability,, Berkeley. University of California Press, 1951,
pp. 481–492.
References II
[Pla98] J. C. Platt, “Sequential minimal optimization: A fast algorithm for
training support vector machines,” no. 208, pp. 1–21, 1998.
[Pla99] ——, “Advances in kernel methods,” B. Schölkopf, C. J. C. Burges,

and A. J. Smola, Eds. Cambridge, MA, USA: MIT Press, 1999, ch.
Fast Training of Support Vector Machines Using Sequential Minimal
Optimization, pp. 185–208. [Online]. Available:
http://dl.acm.org/citation.cfm?id=299094.299105
[SKR85] N. Z. Shor, K. C. Kiwiel, and A. Ruszcayǹski, Minimization

Methods for Non-differentiable Functions. Berlin, Heidelberg:
Springer-Verlag, 1985.
[Tip01] M. E. Tipping, “Sparse bayesian learning and the relevance vector

machine,” J. Mach. Learn. Res., vol. 1, pp. 211–244, Sep. 2001.
[Online]. Available: https://doi.org/10.1162/15324430152748236
References III
[Vap95] V. N. Vapnik, The nature of statistical learning theory. New

York, NY, USA: Springer-Verlag New York, Inc., 1995.
[WGDK17] F. Wenzel, T. Galy-Fajou, M. Deutsch, and M. Kloft, “Bayesian

nonlinear support vector machines for big data,” in ECML/PKDD
(1), ser. Lecture Notes in Computer Science, vol. 10534. Springer,
2017, pp. 307–322.

Learning From Data: 8: Support Vector Machines (SVM)

Uploaded by

Copyright:

Available Formats

You might also like

Learning From Data: 8: Support Vector Machines (SVM)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning From Data: 8: Support Vector Machines (SVM)

Uploaded by

Copyright:

Available Formats

Learning From Data

8: Support Vector Machines (SVM)

Wissen durch Praxis stärkt

0.5 1.0 1.5 2.0 2.5

But which hyperplane is “best”? (There are infinitely many ones.)

0.5 1.0 1.5 2.0 2.5

As it turns out this is a powerful idea, because

Separate non-linear regions using “kernel functions”

Assume linear separability for now (general case later).

0.5 1.0 1.5 2.0 2.5

Find a, b, and c such that

hw (x) = sign(hw, xi)

Thus we try to maximize the geometric margin, i.e.

subject to sign(hw, xi i + b) = sign(yi ).

However, in this form this is horrible – so let us simplify this.

Note, that sign(hw, xi i + b) = sign(yi ) is equivalent to

The maximum margin is exactly halfway in between positive and

Definition 2 (Primal Form)

subject to (hw, xi i + b)yi ≥ 1 for all i ∈ 1, . . . , m.

Note: This is a quadratic programming problem with a unique solution.

By the way: At https://jeremykun.com/2017/06/05/

In general, optimization problems are hard to solve.

arg min f (x ), subject to gi (x ) ≤ 0, i = 1, . . . , m,

where f is a differentiable function and gi are affine polynomials. Suppose z is a

This is also called the “slackness” condition:

is called the Lagrangian of the problem.

Condition (1) above is equivalent to

Condition (1) of the KKT implies that m

Substituting everything back into the Lagrangian, we get:

where the Lagrangian is defined as

How can one solve the dual problem?

For a summary, see

hard-margin approach, as we do not allow for mis-classified entries at all.

Soft Margin SVM – Dual

The role of C becomes even more obvious in the dual approach.

Typically, the support vectors will be a small proportion of the

We can do more powerful things with SVMs, because only the

Inner products provide some measure of (geometric) “similarity”.

hΦ(xi ), Φ(xj )i = K (xi ,xj ).

then we can solve the optimization problem without ever computing

Kernels can be characterized by Mercer’s theorem:

In practice, however common kernels are usually constructed explicitly.

K (x, y) = hx, yi Standard, Linear Kernel

SVM classification plot

It can be shown that the VC dimension of a linear SVM in d

So far, we considered only binary classification.

SVMs do not have a probabilistic interpretation (no Bayesian

A good online reference is

Use SVM.R file in the code examples (based on http://www.statslab.

[CV95] C. Cortes and V. Vapnik, “Support-vector networks,” in Machine

[DBK+ 96] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik,

[KT51] H. Kuhn and A. Tucker, “Nonlinear programming,” in Proceedings

[Pla99] ——, “Advances in kernel methods,” B. Schölkopf, C. J. C. Burges,

[SKR85] N. Z. Shor, K. C. Kiwiel, and A. Ruszcayǹski, Minimization

[Tip01] M. E. Tipping, “Sparse bayesian learning and the relevance vector

[Vap95] V. N. Vapnik, The nature of statistical learning theory. New

[WGDK17] F. Wenzel, T. Galy-Fajou, M. Deutsch, and M. Kloft, “Bayesian