Download as pdf or txt
Download as pdf or txt
You are on page 1of 57

Machine Learning

Support Vector Machine, Kernels


Ramon Baldrich
Universitat Autònoma de Barcelona

Credit to: Dimos Karatzas, A. Ng, A. Karpathy, T. Mitchell


Another ML introduction:
A pictorial introduction

SUPPORT VECTOR MACHINES


Classification
x2

x1
Classification
x2 Generative models: try
model the class, e.g. estimate
the PDF that gave rise to
these points.

We learn: p(x | y)

x1
Classification
x2 Generative models: try
model the class, e.g. estimate
the PDF that gave rise to
these points. Datos Etiqueta

We learn: p(x | y)

Discriminative Models: use


the data to find the best way
to separate the classes.

We learn: p(y | x)

x1
Linearly Separable Classes
x2
θT x  0 →
θT x = 0
θT x  0
θT x  0 →

Decision
boundary
θT x  0 θ =  0 ,1 ,..., n 
T

x1 x = [ x0 = 1, x1 ,..., xn ]T
Linearly Separable Classes
x2 There are multiple valid solutions
(hyper-planes that separate the
two classes).
Which one shall we choose?

x1
Support Vector Machines
x2 Linear classifiers: the
decision boundary is an
hyper-plane

x1
Support Vector Machines
x2 Linear classifiers: the
decision boundary is an
hyper-plane

Maximal margin: the solution


maximises the margin (empty
space) between the classes

Buscar un hiperplano con la separación entre


las dos clases mas grande posible

x1
Support Vector Machines
x2 Linear classifiers: the
decision boundary is an
hyper-plane

Maximal margin: the solution


maximises the margin (empty
space) between the classes

Support vectors: only a few


data points are needed to
define this margin. They are
called support vectors

x1
Support Vector Machines
x2 Every other decision
boundary, has a smaller
margin

x1
Support Vector Machines
x2 If the support vectors move,
the decision boundary
changes

x1
Support Vector Machines
x2 If the support vectors move,
the decision boundary
changes

x1
Support Vector Machines
x2 If the rest of the points move,
the decision boundary
remains stable

x1
Support Vector Machines
x2 If the rest of the points move,
the decision boundary
remains stable

x1
Outliers - Soft Margin
x2 Outliers (if they happen to be
support vectors) might have a
big effect to the solution

Es mas robusto ante outliers

x1
Outliers - Soft Margin
x2 Outliers (if they happen to be
support vectors) might have a
Pero aun así, pueden llegar a ser un problema
big effect to the solution

In this case, we can relax


the maximum margin
condition (soft margin)

This implies adding a


tolerance to errors,
controlled by what is
called slack variables

x1
Lo que haremos es suavizar
añadiendo una tolerancia
Non-Linearly Separable Classes –
Y si mi espacio es no

The Kernel Trick lineal?

Transformamos el espacio a otro


donde se puedan separar las If classes are not linearly
clases
separable, SVM will not work
1D
The solution is to transform
the feature space into
2D another one (of higher
dimension), in which our
classes are linearly separable

It is not necessary to define


this transformation explicitly.
It is only necessary to define
a similarity function that
calculates the dot-product in
this new space. This is called
a kernel
INTUITION AND OPTIMISATION
OBJECTIVE
A Line – Just in Case

Conjunto de parametros

θT x +  0 = 0

In the rest of the discussion we will


 assume that θ0 is equal to zero for
θ simplicity, this means that the
hyperplane crosses the origin
θ0
Intuition
Support We want to find the hyperplane that
vectors maximises the distance between the
closest vectors (of both classes) to the
hyperplane and the hyperplane itself

d- d+ Hyperplane is defined by
vector: θ

d+ : distance from
 hyperplane to support
θ vectors of class (+)
d- : distance from
hyperplane to support
vectors of class (-)

d + = d- = D
A bit of Geometry
Given a hyperplane θ the distance of a
point x to the hyperplane is given by:

D = x cos( )
D
 The definition of the dot product
θ  between two vectors x and θ is:
    
x
θ  x = θ x =| θ || x | cos( )
T

Distance can be written:


 
θx 
D =  =| x | cos( )
|θ|
A bit of Geometry
Podemos definir un hiperplano de muchas
 formas

37 θ The same hyperplane can be defined by


 a whole family of vectors (same
D 2θ direction, different lengths), and the
 distance calculation would be the same:
θ 

x
  
θ  2θ  400 θ 
D =  x =  x =  x
|θ| | 2θ | | 400 θ |
A bit of Geometry
When searching for the best hyperplane, we want to
pick a single representative between all equivalent
θT x +  0 = 1 representations for a given hyperplane (the family of
vectors defining the same hyperplane)
θT x +  0 = 0 D
 The convention for SVM is to pick the vector θ
D θ that makes (we introduce θ0 back):
θT x +  0 = 1 for all support ve ctors of class( + )
θ T x +  0 = −1 for all support ve ctors of class(-)

Therefore the distance we are


trying to maximise is given by:
 
θ  x + 0 1
θT x +  0 = −1 D=  = 
|θ| |θ|
Maximizar D = Minimizar O
Which is equivalent to minimising |θ|
Optimisation Function
Maximise
θ x + 0 = 0
T
margin
θT x +  0 = 1
Minimise |θ|:
θT x +  0  1 1 2
min θ  min θ
2
Subject to:
θ x +  0 = −1
T yi (θT xi +  0 )  1

y =1 Classify
correctly
y = −1
θT x +  0  −1

This is a quadratic
optimisation problem
Cost Functions
θT x +  0 = 0 T
θ x + 0 = 1 This means setting the costs so that
θT x +  0  1 we do not penalise (cost = zero)
anything classified correctly, but we
θT x +  0 = −1 do penalise if things fall on the
wrong side
y =1
θT x +  0  −1 y = −1

Costy=1 Costy=-1

-1 1 θTx+θ0 -1 1 θTx+θ0
COMPARISON TO LOGISTIC
REGRESSION
Logistic Regression Reminder

hθ ( x ) =
1
−θ T x
1+ e
y =1
hθ (x ) = g (z)
y=0

z = θT x
If y=1, we want hθ≈1, θTx >> 0
If y=0, we want hθ≈0, θTx << 0
Simplified cost function
1 m
( ( )
J ( ) =  Cost h x (i ) , y (i )
m i =1
)
 − log (h (x )) , if y = 1
Cost (h (x ), y ) = 
− log (1 − h ( x )) , if y = 0
(Note that y is always either 0 or 1)

Cost (h ( x), y ) = − y log (h ( x)) − (1 − y )log (1 − h ( x))

If y = 1 : Cost (h ( x), y ) = − log (h ( x))


If y = 0 : Cost (h ( x), y ) = − log (1 − h ( x))
The Cost of a Sample
Cost of sample = − y log (h ( x) ) − (1 − y ) log (1 − h ( x) )
 1  0  1 
= − y log  −θ T x
 − (1 − y ) log 1 − −θ T x

1+ e   1+ e 
If y=1, we want θTx >> 0
Cost
 1 
− log  −θ T x

1+ e 

z = θTx
The Cost of a Sample
Cost of sample = − y log (h ( x) ) − (1 − y ) log (1 − h ( x) )
0    1 
 − (1 − y ) log 1 −
1
= − y log  −θ T x −θ T x

1+ e   1+ e 
If y=1, we want θTx >> 0 If y=0, we want θTx << 0
Cost Cost
 1   1 
− log  −θ T x
 − log 1 − 
1+ e   1+ e −θ T x

z = θTx z = θTx
Towards SVM
Cost of sample = − y cost1 (θT x) − (1 − y ) cost 0 (θT x)

If y=1, we want θTx ≥ 1 If y=0, we want θTx ≤ -1


Cost Cost
 1   1 
− log  −θ T x
 − log 1 − 
1+ e   1+ e −θ T x

( )
cost1 θT x ( )
cost 0 θT x

z = θTx z = θTx
Optimization Objective
Logistic Regression:

1 m (i )
 ( ( )) ( )( ( ( )))+ 2m 
n
min  y − log h x (i ) + 1 − y (i ) − log 1 − h x (i ) 2
j
θ m
i =1 j =1

SVM:
 n 2
θ
1 m (i )
m i =1
 T
( ) (
min  y cost 1 θ x + 1 − y cost 0 θ x +
(i ) T
) 
2m j =1
j ( )
By convention: we do not divide by m, and we use a weight C (equivalent
to 1/λ) to control the relative weight of the regularization factor

 ( ) ( ) ( )
m n
1
min C  y (i ) cost 1 θT x + 1 − y (i ) cost 0 θT x +   j2
θ 2 j =1
i =1
SVM Objective Function

 ( ) ( ) ( )
m
1 n 2
min C  y cost 1 θ x + 1 − y
(i ) T (i )
cost 0 θ x +   j
T
θ 2 j =1
i =1

Unlike logistic regression, the SVM output cannot be


treated as a probability

( )
hθ (x ) = sign θT x
Large Margin Classifier

 ( ) ( ) ( )
m
1 n 2
min C  y cost 1 θ x + 1 − y
(i ) T (i )
cost 0 θ x +   j
T
θ 2 j =1
i =1

Cost Cost

-1 1 z = θTx -1 1 z = θTx

If y=1, we want θTx ≥ 1 If y=0, we want θTx ≤ -1


(not just θTx ≥ 0) (not just θTx < 0)
Large Margin Classifier
0
 ( ) ( ) ( )
m
1 n 2
min C  y cost 1 θ x + 1 − y
(i ) T (i )
cost 0 θ x +   j
T
θ 2 j =1
i =1

Suppose that we are exploring the set of (many possible)


decision boundaries that classify correctly our data (C is set to a
very large value):
If y=1, θTx ≥ 1
If y=0, θTx ≤ -1

The first term of the objective function is zero


Minimizing the second term has the effect of selecting the
solution with the maximum margin!
DUAL REPRESENTATION AND
KERNELS
Optimisation Function
Maximise
θ x + 0 = 0
T
margin
θT x +  0 = 1
Minimise:
θT x +  0  1 1 2 1 T
θ = θθ
2 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
θ x +  0 = −1
T

y =1 Classify
correctly
y = −1
θT x +  0  −1

This is a quadratic
optimisation problem
Dual Representation
OBJECTIVE: Obtain 𝛉, 𝜃0 belonging to the solution hyperplane: 𝛉𝑇 𝐱 𝑖 + 𝜃0 =0
This is solved as a problem of quadratic optimisation:

1
min Φ(𝛉) = 𝛉𝑇 𝛉
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0

SVM PRIMAL

Construction of the Lagrangian L :

L(x, a ) = f (x) + å ai gi (x) "ai ³ 0


i

Constraints
Function to optimise
Lagrange multipliers

𝑓 𝑥 →Φ 𝛉
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Dual Representation
OBJECTIVE: Obtain 𝛉, 𝜃0 belonging to the solution hyperplane: 𝛉𝑇 𝐱 𝑖 + 𝜃0 =0
This is solved as a problem of quadratic optimisation: The Lagrangian L of the SVM problem is:

1
1 𝐿(𝛉, 𝜃0 , 𝜶) = 𝛉𝑇 𝛉 − ෍ 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1
min Φ(𝛉) = 𝛉𝑇 𝛉 2
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0

SVM PRIMAL
𝑖

Construction of the Lagrangian L :

L(x, a ) = f (x) + å ai gi (x) "ai ³ 0


i

Constraints
Function to optimise
Lagrange multipliers

𝑓 𝑥 →Φ 𝛉
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Dual Representation
OBJECTIVE: Obtain 𝛉, 𝜃0 belonging to the solution hyperplane: 𝛉𝑇 𝐱 𝑖 + 𝜃0 =0
This is solved as a problem of quadratic optimisation: The Lagrangian L of the SVM problem is:

1
1 𝐿(𝛉, 𝜃0 , 𝜶) = 𝛉𝑇 𝛉 − ෍ 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1
min Φ(𝛉) = 𝛉𝑇 𝛉 2
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0

SVM PRIMAL
𝑖

The minimisation of L with respect to 𝛉, 𝜃0 implies:


𝜕𝐿
= 𝛉 − σ𝑖 𝛼𝑖 𝑦𝑖 𝐱 𝑖 = 0 ⟹ 𝛉 = σ𝑖 𝛼𝑖 𝑦𝑖 𝐱 𝑖
𝜕𝛉
𝜕𝐿
= − σ𝑖 𝛼𝑖 𝑦𝑖 = 0 ⟹ σ𝑖 𝛼𝑖 𝑦𝑖 = 0 𝛼𝑖 ≥ 0
𝜕𝜃0
Construction of the Lagrangian L :

L(x, a ) = f (x) + å ai gi (x) "ai ³ 0


i

Constraints
Function to optimise
Lagrange multipliers

𝑓 𝑥 →Φ 𝛉
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Dual Representation
OBJECTIVE: Obtain 𝛉, 𝜃0 belonging to the solution hyperplane: 𝛉𝑇 𝐱 𝑖 + 𝜃0 =0
This is solved as a problem of quadratic optimisation: The Lagrangian L of the SVM problem is:

1
1 𝐿(𝛉, 𝜃0 , 𝜶) = 𝛉𝑇 𝛉 − ෍ 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1
min Φ(𝛉) = 𝛉𝑇 𝛉 2
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0

SVM PRIMAL
𝑖

The minimisation of L with respect to 𝛉, 𝜃0 implies:


𝜕𝐿
= 𝛉 − σ𝑖 𝛼𝑖 𝑦𝑖 𝐱 𝑖 = 0 ⟹ 𝛉 = σ𝑖 𝛼𝑖 𝑦𝑖 𝐱 𝑖
𝜕𝛉
𝜕𝐿
= − σ𝑖 𝛼𝑖 𝑦𝑖 = 0 ⟹ σ𝑖 𝛼𝑖 𝑦𝑖 = 0 𝛼𝑖 ≥ 0
𝜕𝜃0
Construction of the Lagrangian L :
𝐿(𝛉, 𝜃0 , 𝜶)
𝑇 𝑇
1
L(x, a ) = f (x) + å ai gi (x) "ai ³ 0 =
2
෍ 𝛼𝑖 𝑦𝑖 𝐱 𝑖 ෍ 𝛼𝑗 𝑦𝑗 𝐱𝑗 − ෍ 𝛼𝑖 𝑦𝑖 ෍ 𝛼𝑗 𝑦𝑗 𝐱𝑗 𝐱 𝑖 + ෍ 𝛼𝑖
i
𝑖 𝑗 𝑖 𝑗 𝑖
1
Constraints = − ෍ ෍ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱 𝑗 + ෍ 𝛼𝑖
Function to optimise 2
𝑖 𝑗 𝑖
Lagrange multipliers

𝑓 𝑥 →Φ 𝛉
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Dual Representation
OBJECTIVE: Obtain 𝛉, 𝜃0 belonging to the solution hyperplane: 𝛉𝑇 𝐱 𝑖 + 𝜃0 =0
This is solved as a problem of quadratic optimisation: The Lagrangian L of the SVM problem is:

1
1 𝐿(𝛉, 𝜃0 , 𝜶) = 𝛉𝑇 𝛉 − ෍ 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1
min Φ(𝛉) = 𝛉𝑇 𝛉 2
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0

SVM PRIMAL
𝑖

The minimisation of L with respect to 𝛉, 𝜃0 implies:


𝜕𝐿
= 𝛉 − σ𝑖 𝛼𝑖 𝑦𝑖 𝐱 𝑖 = 0 ⟹ 𝛉 = σ𝑖 𝛼𝑖 𝑦𝑖 𝐱 𝑖
𝜕𝛉
𝜕𝐿
= − σ𝑖 𝛼𝑖 𝑦𝑖 = 0 ⟹ σ𝑖 𝛼𝑖 𝑦𝑖 = 0 𝛼𝑖 ≥ 0
𝜕𝜃0
Construction of the Lagrangian L :
𝐿(𝛉, 𝜃0 , 𝜶)
𝑇 𝑇
1
L(x, a ) = f (x) + å ai gi (x) "ai ³ 0 =
2
෍ 𝛼𝑖 𝑦𝑖 𝐱 𝑖 ෍ 𝛼𝑗 𝑦𝑗 𝐱𝑗 − ෍ 𝛼𝑖 𝑦𝑖 ෍ 𝛼𝑗 𝑦𝑗 𝐱𝑗 𝐱 𝑖 + ෍ 𝛼𝑖
i
𝑖 𝑗 𝑖 𝑗 𝑖
1
Constraints = − ෍ ෍ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱 𝑗 + ෍ 𝛼𝑖
Function to optimise 2
𝑖 𝑗 𝑖
Lagrange multipliers
1
𝑓 𝑥 →Φ 𝛉 max Φ(𝜶) = − ෍ ෍ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱 𝑗 + ෍ 𝛼𝑖
𝜶 2
𝑖 𝑗 𝑖
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Subject to: σ𝑖 𝛼𝑖 𝑦𝑖 = 0 𝛼𝑖 ≥ 0
SVM DUAL
Dual Representation
OBJECTIVE: Obtain 𝛉, 𝜃0 belonging to the solution hyperplane: 𝛉𝑇 𝐱 𝑖 + 𝜃0 =0
This is solved as a problem of quadratic optimisation: The Lagrangian L of the SVM problem is:

1
1 𝐿(𝛉, 𝜃0 , 𝜶) = 𝛉𝑇 𝛉 − ෍ 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1
min Φ(𝛉) = 𝛉𝑇 𝛉 2
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0

SVM PRIMAL
𝑖

The minimisation of L with respect to 𝛉, 𝜃0 implies:


𝜕𝐿
= 𝛉 − σ𝑖 𝛼𝑖 𝑦𝑖 𝐱 𝑖 = 0 ⟹ 𝛉 = σ𝑖 𝛼𝑖 𝑦𝑖 𝐱 𝑖
𝜕𝛉
𝜕𝐿
= − σ𝑖 𝛼𝑖 𝑦𝑖 = 0 ⟹ σ𝑖 𝛼𝑖 𝑦𝑖 = 0 𝛼𝑖 ≥ 0
𝜕𝜃0

1 𝐿(𝛉, 𝜃0 , 𝜶)
max Φ(𝜶) = − ෍ ෍ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱 𝑗 + ෍ 𝛼𝑖 𝑇 𝑇
𝜶 2 1
𝑖 𝑗 𝑖 = ෍ 𝛼𝑖 𝑦𝑖 𝐱 𝑖 ෍ 𝛼𝑗 𝑦𝑗 𝐱𝑗 − ෍ 𝛼𝑖 𝑦𝑖 ෍ 𝛼𝑗 𝑦𝑗 𝐱𝑗 𝐱 𝑖 + ෍ 𝛼𝑖
2
Subject to: σ𝑖 𝛼𝑖 𝑦𝑖 = 0 𝛼𝑖 ≥ 0 𝑖 𝑗 𝑖 𝑗 𝑖
SVM DUAL 1
= − ෍ ෍ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱 𝑗 + ෍ 𝛼𝑖
2
𝑖 𝑗 𝑖

𝛉 = ෍ 𝛼𝑖 𝑦𝑖 𝐱𝑖
𝑖
ൢ⟹ 𝑓 𝐱 = sgn ෍ 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0
𝜃0 = 𝑦𝑖 − 𝛉𝑇 𝐱 𝑖 𝑖

SVM Classifier
Non-Linearly Separable Classes –
The Kernel Trick
1. Transformar la dimensión de una feature a una otra (superior)
2. Definir una función que calcule el producto escalar en esta nueva dimensión
3. Ahora nuestra feature es linealmente seperable (en la nueva dimensión) If classes are not linearly
separable, SVM will not work
1D
The solution is to transform
the feature space into
2D another one (of higher
dimension), in which our
classes are linearly separable

It is not necessary to define


this transformation explicitly.
It is only necessary to define
a similarity function that
calculates the dot-product in
this new space. This is called
a kernel
Non-Linearly Separable Classes –
The Kernel Trick
Non-Linearly Separable Classes –
The Kernel Trick
Mapping function:
1D x   (x)

x Kernel function:
2D
K (x1 , x2 ) =  (x1 )T  (x2 )
 (x)
The Kernel Trick
Dual Form
Maximise: Subject to:
~ 1
L ( ) = −   i j yi yi xTi x j +   i
N
i  0  y =0
2 i j i
i i
i =0

Kernel SVM
Maximise: Subject to:
L ( ) = −  i j yi yi K (xi , x j ) +  i
~ 1 N
i  0  y =0
2 i j i
i i
i =0

 0 = yi − θT  (xi ) = yi −  y j i K (x j , xi )
j

h (x) = sign ( yi i K (x, x i ) + b)


i
Example Kernel Functions
Kernel functions introduce their own parameters, the value of which
is generally fixed through cross-validation

Polynomial: ( )
K (x, z) = xT z
d

2
− x − z / 2
Radial Basis Functions: K (x, z) = e

Sigmoid: K (x, z ) = tanh(  x, z −  )

2 + c 2 ) −1
1/ 2
Inverse Multiquadric: K (x, z) = ( x − z
n
Intersection Kernel: K (x, z ) =  min( x(i), z (i))
i =1

The Kernel functions must be continuous, symmetric, and most preferably should have a
positive (semi-) definite Gram matrix. Kernels which are said to satisfy the Mercer’s
small theorem are positive semi-definite, meaning their kernel matrices have only non-
negative Eigen values. The use of a positive definite kernel insures that the
print optimization problem will be convex and solution will be unique.

http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/
SOFT MARGIN, SLACK VARIABLES
Outliers - Soft Margin
x2 Outliers (if they happen to be
support vectors) might have a
big effect to the solution

x1
Outliers - Soft Margin
x2 Outliers (if they happen to be
support vectors) might have a
big effect to the solution

In this case, we can relax


the maximum margin
condition (soft margin)

This implies adding a


tolerance to errors,
controlled by what is
called slack variables

x1
Outliers - Soft Margin
 ( ) ( ) ( )
m
1 n 2
x2 min C  y cost 1 θ x + 1 − y
(i ) T (i )
cost 0 θ x +   j
T
θ 2 j =1
i =1

1
C=

C very large

C not too large

x1
Outliers - Soft Margin
Primal Form
x2
Minimise: Subject to:
1 T
θ θ + C  i yi (θT xi + 0 )  1 − i
2 i

1 The impact of slack


variables is modulated in
the optimization problem
by a parameter C, that
can be interpreted as a
regularization parameter
2 (maximization of margin
vs minimization of the
accepted error)
x1
Outliers - Soft Margin
Dual Form
x2
Maximise:
~ 1
L ( ) = −   i j yi y j xTi x j +   i
2 i j i

1 Subject to:
N

 y
i =0
i i =0

C  i  0
2
In the dual representation
x1 slack variables imply a new
inequality that regulates
the contribution of the
vectors
Support Vector Machines (SVM)
Key concepts:

• SVM as a quadratic optimisation problem.

• The dual problem allows the explicit solution from scalar products.

• The solution in form of scalar products allow the introduction of


kernels.

• Kernel functions allow to classify non linearly separable datasets.

• The slack variables introduce a regularisation factor that allows


tolerance to errors by relaxing the condition of maximal margin.

• Both the kernel parameters and the regularisation factor must be


adjusted during the training stage. + penalty + underfitting
- penalty + overfitting
C = 1/reg.
+ C + overfitting
- C + underfitting

You might also like