Professional Documents
Culture Documents
Ac-Ed l05 - SMVS, Kernels
Ac-Ed l05 - SMVS, Kernels
x1
Classification
x2 Generative models: try
model the class, e.g. estimate
the PDF that gave rise to
these points.
We learn: p(x | y)
x1
Classification
x2 Generative models: try
model the class, e.g. estimate
the PDF that gave rise to
these points. Datos Etiqueta
We learn: p(x | y)
We learn: p(y | x)
x1
Linearly Separable Classes
x2
θT x 0 →
θT x = 0
θT x 0
θT x 0 →
Decision
boundary
θT x 0 θ = 0 ,1 ,..., n
T
x1 x = [ x0 = 1, x1 ,..., xn ]T
Linearly Separable Classes
x2 There are multiple valid solutions
(hyper-planes that separate the
two classes).
Which one shall we choose?
x1
Support Vector Machines
x2 Linear classifiers: the
decision boundary is an
hyper-plane
x1
Support Vector Machines
x2 Linear classifiers: the
decision boundary is an
hyper-plane
x1
Support Vector Machines
x2 Linear classifiers: the
decision boundary is an
hyper-plane
x1
Support Vector Machines
x2 Every other decision
boundary, has a smaller
margin
x1
Support Vector Machines
x2 If the support vectors move,
the decision boundary
changes
x1
Support Vector Machines
x2 If the support vectors move,
the decision boundary
changes
x1
Support Vector Machines
x2 If the rest of the points move,
the decision boundary
remains stable
x1
Support Vector Machines
x2 If the rest of the points move,
the decision boundary
remains stable
x1
Outliers - Soft Margin
x2 Outliers (if they happen to be
support vectors) might have a
big effect to the solution
x1
Outliers - Soft Margin
x2 Outliers (if they happen to be
support vectors) might have a
Pero aun así, pueden llegar a ser un problema
big effect to the solution
x1
Lo que haremos es suavizar
añadiendo una tolerancia
Non-Linearly Separable Classes –
Y si mi espacio es no
Conjunto de parametros
θT x + 0 = 0
d- d+ Hyperplane is defined by
vector: θ
d+ : distance from
hyperplane to support
θ vectors of class (+)
d- : distance from
hyperplane to support
vectors of class (-)
d + = d- = D
A bit of Geometry
Given a hyperplane θ the distance of a
point x to the hyperplane is given by:
D = x cos( )
D
The definition of the dot product
θ between two vectors x and θ is:
x
θ x = θ x =| θ || x | cos( )
T
y =1 Classify
correctly
y = −1
θT x + 0 −1
This is a quadratic
optimisation problem
Cost Functions
θT x + 0 = 0 T
θ x + 0 = 1 This means setting the costs so that
θT x + 0 1 we do not penalise (cost = zero)
anything classified correctly, but we
θT x + 0 = −1 do penalise if things fall on the
wrong side
y =1
θT x + 0 −1 y = −1
Costy=1 Costy=-1
-1 1 θTx+θ0 -1 1 θTx+θ0
COMPARISON TO LOGISTIC
REGRESSION
Logistic Regression Reminder
hθ ( x ) =
1
−θ T x
1+ e
y =1
hθ (x ) = g (z)
y=0
z = θT x
If y=1, we want hθ≈1, θTx >> 0
If y=0, we want hθ≈0, θTx << 0
Simplified cost function
1 m
( ( )
J ( ) = Cost h x (i ) , y (i )
m i =1
)
− log (h (x )) , if y = 1
Cost (h (x ), y ) =
− log (1 − h ( x )) , if y = 0
(Note that y is always either 0 or 1)
z = θTx
The Cost of a Sample
Cost of sample = − y log (h ( x) ) − (1 − y ) log (1 − h ( x) )
0 1
− (1 − y ) log 1 −
1
= − y log −θ T x −θ T x
1+ e 1+ e
If y=1, we want θTx >> 0 If y=0, we want θTx << 0
Cost Cost
1 1
− log −θ T x
− log 1 −
1+ e 1+ e −θ T x
z = θTx z = θTx
Towards SVM
Cost of sample = − y cost1 (θT x) − (1 − y ) cost 0 (θT x)
( )
cost1 θT x ( )
cost 0 θT x
z = θTx z = θTx
Optimization Objective
Logistic Regression:
1 m (i )
( ( )) ( )( ( ( )))+ 2m
n
min y − log h x (i ) + 1 − y (i ) − log 1 − h x (i ) 2
j
θ m
i =1 j =1
SVM:
n 2
θ
1 m (i )
m i =1
T
( ) (
min y cost 1 θ x + 1 − y cost 0 θ x +
(i ) T
)
2m j =1
j ( )
By convention: we do not divide by m, and we use a weight C (equivalent
to 1/λ) to control the relative weight of the regularization factor
( ) ( ) ( )
m n
1
min C y (i ) cost 1 θT x + 1 − y (i ) cost 0 θT x + j2
θ 2 j =1
i =1
SVM Objective Function
( ) ( ) ( )
m
1 n 2
min C y cost 1 θ x + 1 − y
(i ) T (i )
cost 0 θ x + j
T
θ 2 j =1
i =1
( )
hθ (x ) = sign θT x
Large Margin Classifier
( ) ( ) ( )
m
1 n 2
min C y cost 1 θ x + 1 − y
(i ) T (i )
cost 0 θ x + j
T
θ 2 j =1
i =1
Cost Cost
-1 1 z = θTx -1 1 z = θTx
y =1 Classify
correctly
y = −1
θT x + 0 −1
This is a quadratic
optimisation problem
Dual Representation
OBJECTIVE: Obtain 𝛉, 𝜃0 belonging to the solution hyperplane: 𝛉𝑇 𝐱 𝑖 + 𝜃0 =0
This is solved as a problem of quadratic optimisation:
1
min Φ(𝛉) = 𝛉𝑇 𝛉
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0
SVM PRIMAL
Constraints
Function to optimise
Lagrange multipliers
𝑓 𝑥 →Φ 𝛉
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Dual Representation
OBJECTIVE: Obtain 𝛉, 𝜃0 belonging to the solution hyperplane: 𝛉𝑇 𝐱 𝑖 + 𝜃0 =0
This is solved as a problem of quadratic optimisation: The Lagrangian L of the SVM problem is:
1
1 𝐿(𝛉, 𝜃0 , 𝜶) = 𝛉𝑇 𝛉 − 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1
min Φ(𝛉) = 𝛉𝑇 𝛉 2
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0
SVM PRIMAL
𝑖
Constraints
Function to optimise
Lagrange multipliers
𝑓 𝑥 →Φ 𝛉
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Dual Representation
OBJECTIVE: Obtain 𝛉, 𝜃0 belonging to the solution hyperplane: 𝛉𝑇 𝐱 𝑖 + 𝜃0 =0
This is solved as a problem of quadratic optimisation: The Lagrangian L of the SVM problem is:
1
1 𝐿(𝛉, 𝜃0 , 𝜶) = 𝛉𝑇 𝛉 − 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1
min Φ(𝛉) = 𝛉𝑇 𝛉 2
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0
SVM PRIMAL
𝑖
Constraints
Function to optimise
Lagrange multipliers
𝑓 𝑥 →Φ 𝛉
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Dual Representation
OBJECTIVE: Obtain 𝛉, 𝜃0 belonging to the solution hyperplane: 𝛉𝑇 𝐱 𝑖 + 𝜃0 =0
This is solved as a problem of quadratic optimisation: The Lagrangian L of the SVM problem is:
1
1 𝐿(𝛉, 𝜃0 , 𝜶) = 𝛉𝑇 𝛉 − 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1
min Φ(𝛉) = 𝛉𝑇 𝛉 2
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0
SVM PRIMAL
𝑖
𝑓 𝑥 →Φ 𝛉
𝑔𝑖 𝑥 → 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1 ≥ 0
Dual Representation
OBJECTIVE: Obtain 𝛉, 𝜃0 belonging to the solution hyperplane: 𝛉𝑇 𝐱 𝑖 + 𝜃0 =0
This is solved as a problem of quadratic optimisation: The Lagrangian L of the SVM problem is:
1
1 𝐿(𝛉, 𝜃0 , 𝜶) = 𝛉𝑇 𝛉 − 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1
min Φ(𝛉) = 𝛉𝑇 𝛉 2
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0
SVM PRIMAL
𝑖
1
1 𝐿(𝛉, 𝜃0 , 𝜶) = 𝛉𝑇 𝛉 − 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0 − 1
min Φ(𝛉) = 𝛉𝑇 𝛉 2
𝛉,𝜃0 2
Subject to:
𝑦𝑖 (𝛉𝑇 𝐱 𝑖 + 𝜃0 ) ≥ 1
} Þ max
𝜶
(min(𝐿(𝛉, 𝜃0 , 𝜶)))
𝛉,𝜃 0
SVM PRIMAL
𝑖
1 𝐿(𝛉, 𝜃0 , 𝜶)
max Φ(𝜶) = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱 𝑗 + 𝛼𝑖 𝑇 𝑇
𝜶 2 1
𝑖 𝑗 𝑖 = 𝛼𝑖 𝑦𝑖 𝐱 𝑖 𝛼𝑗 𝑦𝑗 𝐱𝑗 − 𝛼𝑖 𝑦𝑖 𝛼𝑗 𝑦𝑗 𝐱𝑗 𝐱 𝑖 + 𝛼𝑖
2
Subject to: σ𝑖 𝛼𝑖 𝑦𝑖 = 0 𝛼𝑖 ≥ 0 𝑖 𝑗 𝑖 𝑗 𝑖
SVM DUAL 1
= − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱 𝑗 + 𝛼𝑖
2
𝑖 𝑗 𝑖
𝛉 = 𝛼𝑖 𝑦𝑖 𝐱𝑖
𝑖
ൢ⟹ 𝑓 𝐱 = sgn 𝛼𝑖 𝑦𝑖 𝛉𝑇 𝐱 𝑖 + 𝜃0
𝜃0 = 𝑦𝑖 − 𝛉𝑇 𝐱 𝑖 𝑖
SVM Classifier
Non-Linearly Separable Classes –
The Kernel Trick
1. Transformar la dimensión de una feature a una otra (superior)
2. Definir una función que calcule el producto escalar en esta nueva dimensión
3. Ahora nuestra feature es linealmente seperable (en la nueva dimensión) If classes are not linearly
separable, SVM will not work
1D
The solution is to transform
the feature space into
2D another one (of higher
dimension), in which our
classes are linearly separable
x Kernel function:
2D
K (x1 , x2 ) = (x1 )T (x2 )
(x)
The Kernel Trick
Dual Form
Maximise: Subject to:
~ 1
L ( ) = − i j yi yi xTi x j + i
N
i 0 y =0
2 i j i
i i
i =0
Kernel SVM
Maximise: Subject to:
L ( ) = − i j yi yi K (xi , x j ) + i
~ 1 N
i 0 y =0
2 i j i
i i
i =0
0 = yi − θT (xi ) = yi − y j i K (x j , xi )
j
Polynomial: ( )
K (x, z) = xT z
d
2
− x − z / 2
Radial Basis Functions: K (x, z) = e
2 + c 2 ) −1
1/ 2
Inverse Multiquadric: K (x, z) = ( x − z
n
Intersection Kernel: K (x, z ) = min( x(i), z (i))
i =1
The Kernel functions must be continuous, symmetric, and most preferably should have a
positive (semi-) definite Gram matrix. Kernels which are said to satisfy the Mercer’s
small theorem are positive semi-definite, meaning their kernel matrices have only non-
negative Eigen values. The use of a positive definite kernel insures that the
print optimization problem will be convex and solution will be unique.
http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/
SOFT MARGIN, SLACK VARIABLES
Outliers - Soft Margin
x2 Outliers (if they happen to be
support vectors) might have a
big effect to the solution
x1
Outliers - Soft Margin
x2 Outliers (if they happen to be
support vectors) might have a
big effect to the solution
x1
Outliers - Soft Margin
( ) ( ) ( )
m
1 n 2
x2 min C y cost 1 θ x + 1 − y
(i ) T (i )
cost 0 θ x + j
T
θ 2 j =1
i =1
1
C=
C very large
x1
Outliers - Soft Margin
Primal Form
x2
Minimise: Subject to:
1 T
θ θ + C i yi (θT xi + 0 ) 1 − i
2 i
1 Subject to:
N
y
i =0
i i =0
C i 0
2
In the dual representation
x1 slack variables imply a new
inequality that regulates
the contribution of the
vectors
Support Vector Machines (SVM)
Key concepts:
• The dual problem allows the explicit solution from scalar products.