Machine Learning Approaches - SVM

10/02/2020
Machine
Learning
Approaches - SVM
B. Solaiman
Basel.solaiman@imt-atlantique.fr
Départ. Image &Traitement de l’Information
Brest, France
Introduction
Machine Learning Approaches ----------- B. Solaiman
1
10/02/2020
Introduction 3
A Pattern (sample, instance, example, ….) is

represented by a set of K features, or attributes,
viewed as a K-dimensional feature vector X RK
The pattern “may” be associated with a target

(Xm,ym): XmRK , ym=f(Xm)
variable y = f(X) (in case of supervised learning)
Stored Data f : Unknown function associating X and y
Learning : given the training set, estimate the prediction

function f by minimizing “a” prediction error
Machine ~ ~
Learning
f : XmRK , ym = f(Xm)
Introduction 4
y
No Target Variable y  Unlabeled Training Data
y  L= {C1, C2,…Cn}  Labeled Training Data

(L: set of discrete labels)
yR  Labeled Training Data
Regression problem
Instead of predicting the class of an input pattern, we want to predict continuous
values
2
10/02/2020
Introduction 5
Natures of a feature
Numerical, or quantitative, Feature
Symbolic, or qualitative, Feature

Nominal Feature:
Feature values are symbols (= and ≠ are the only operations
we can conduct)
Ordinal Feature:
Feature values are “ordered” symbols (=, ≠, <, >, ≤ and ≥
are the only operations we can conduct)
Introduction 6
Distance between patterns Numerical Features

X = [x1, x2, .., xn] & Y = [y1, y2, .., yn] xi, yi
1
 n pp
Minkowski Distance d p X ,Y   
 i 1

xi  y i 

n
Manhattan distance : d1  X , Y    xi  y i (P = 1)
i 1
n
Euclidian distance : d 2 X ,Y    x
i 1
i  yi 
2
(P = 2)
n
Max distance : d   X , Y   max xi  y i (P = )
i 1
3
10/02/2020
Introduction 7
Distance between patterns

Nominal features
X = [x1, x2, .., xn] & Y = [y1, y2, .., yn]
x i , yi  W  W 1 X W 2 X … X W n
n- Q
dist(X, Y) =
n
Q: Number of features having the same modality in X and Y
Introduction 8
Distance between patterns

Nominal binary features
X = [x1, x2, .., xn] & Y = [y1, y2, .., yn] xi, yiW  {0, 1}
Confusion Matrix
Y
1 0 a : nb of feature assuming « 1 » in X and « 1 » in Y
1 a b b : nb of feature assuming « 1 » in X and « 0 » in Y
X c : nb of feature assuming « 0 » in X and « 1 » in Y
0 c d d : nb of feature assuming « 0 » in X and « 0 » in Y
4
10/02/2020
Introduction 9
Distance between patterns Nominal binary features

Case 1: Binary Symmetric Features (0 and 1 have the same importance)
Simple Matching Coefficient
Y
b+c 1 0
dist(X, Y) =
a+b+c+d 1 a b
X
Example 0 c d
X
Y
2+1
dist(X, Y) = = 3 / 7 = 0.429
2+2+1+2
Introduction 10
Distance between patterns Nominal binary features

Case 2: Binary Non Symmetric Features
(1 is more important than 0)
Jaccard’ Coefficient Y
b+c 1 0
dist(X, Y) = 1 a
a+b+c b
X
Exemple 0 c d
X
Y
2+1
dist(X, Y) = =3/5
2+2+1
5
10/02/2020
11
Support Vector Machines (SVM)
12
Support Vector Machines (SVM)

Linear Models for Classification
Perceptron
A Bit of Geometry
Hard (Linear) Support Vector Machines (H-SVM)
Soft Margin Support Vector Machines (SM-SVM)

Kernel-based Support Vector Machines
6
10/02/2020
Support Vector Machines (SVM) 13
Linear Models for Classification

x2
C1 H Model for Classification ?
(x1, x2) C1
C2 Model
(x1, x2) C2
x1
Training Set Linear Model :
(Binary classification Hyperplane
=
Separation Surface H
Two classes)
x2 H
How does it work ?
C1 H+ Hyperplane equation :
C2 w.X+b=0
X = (x1, x2,…, xn)
w = (w1, w2,…, wn)
H- w . X = x1.w1+ x2.w2+….+ xn.wn
x1
IF : X “on” H: w.X+b=0
IF : X ϵ H+: w.X+b>0
IF : X ϵ H-: w.X+b<0
7
10/02/2020
x2 H
C1
C2
Linear Model
Classification Decision Rule :
x1
Computation of: If w . X + b > 0 Then Decide C1

X If w . X + b > 0 Then Decide C2
w.X+b
Linear Model for Binary Classification
Find: w = (w1, w2,…, wn), b

Input
Training dataset w . X + b > 0 if y = +1 w .X+b=0
X = (x1, x2,…, xn), y w . X + b < 0 if y = -1
8
10/02/2020
Perceptron, Rosenblatt - 1958
w1
x1
w2
x2
 f(X) = w . X + b  0?
wn = x1.w1+x2.w2+….+xn.wn+b
xn
b = w0 , x0 = 1 → f(X) = w . X (dot product)
Decision Rule: If f(X) = w . X > 0 Then +1

If f(X) = w . X < 0 Then - 1
A Bit of Geometry
Dot Product < w, X > =w.X
= ||w|| ||X|| Cos q
 x1.w1 + x2.w2 +….+ xn.wn
w w q > 90
q < 90
X X
w. X > 0 w. X < 0
9
10/02/2020

Decision Rule: If f(X) = w . X > 0 Then +1
If f(X) = w . X < 0 Then -1
Types of error:
f(X) = w. X > 0 whereas y = -1 f(X) = w . X < 0 whereas y = +1
wt+1
wt+1 wt
X wt X
Dw ~  y . X
10
10/02/2020

Perceptron Algorithm :
Random initialization of (w1, w2,…, wn), b

Pick training samples X one by one
Predict class of X using current (w1, w2,…, wn), b
y’ = Sign(w X)
If y’ is correct (i.e., y = y’ ) : No change: wt+1 = wt
If y’ is wrong: Adjust wt : wt+1 = wt +   yt  Xt
►  is the learning rate parameter
► Xt is the t-th training example
► yt is true tth class label ({+1, -1})
11
10/02/2020
A Bit of Geometry
Projection of X onto w
= ||X|| Cosq
= w . X / ||w ||
w
q
X
A Bit of Geometry Parallel Hyperplanes

All “parallel hyperplanes” have:
H - The same equation w . X + b = 0;
x2
X2 - The same coefficients vector w;
- They have different origin translation values b
Coefficients vector : w
- Perpendicular to H ?
w X1 w . (X1X2) = w . (X2-X1)
= w.X2 - w.X1
x1 = -b – (-b) = 0
12
10/02/2020
A Bit of Geometry Distance between an instance X and H
X D = dist(X, H) = ??
x2 D
q D = || X0X || Cosq
X0  || w. X0X || / || w ||
 || w . X – w . X0 | / || w ||
q  || w . X + b || / || w ||
w x1
H

Which of the linear separators is optimal?
13
10/02/2020

Largest Margin Concept:
H The distance from the separating
hyperplane, H, corresponds to the
A C
“confidence” of decision
B Example:
We are more sure and confident about the

class of A and B than of C

Largest Margin Concept:
The margin of a linear classifier is the width that the boundary could be increased
by before hitting a datapoint instance
14
10/02/2020

Largest Margin Concept :
Margin : D
H The decision boundary should be as far away
from the data of both classes as possible
“H-SVM” : The maximum margin linear
classifier is the linear classifier with the
C1 maximum margin
C2 Support vectors
(datapoints that the margin pushes up against)
Separating hyperplane : w . X + b = 0

Largest Margin Concept :
H Margin width : D = ?
- : C2 D
D  2 . | w . X + b | / || w ||
+ : C1
w.X+b=+1 D  2 / || w ||
w.X+b=-1
15
10/02/2020

Vapnik, 1965; Vapnik, 1995 :
{(Xm, ym), m = 1,.., M, Xm Rn , ym  {+1, -1} }
H-SVM formulation : Find w & b such that
Maximize : Minimize:
D  2 / || w || || w ||2 / 2
Subject to Support vectors constraints :
ym . (w . Xm + b) ≥ +1 for all m
Quadratic Optimization problem, subject to linear constraints
1D Example a.x+b=0
Dataset : {(-3,-1), (-1,-1), (+2,+1)}

-3 -1 +1
H : All lines with a cross point between -1 and +1
Margin constraints: ym . (w . Xm + b) ≥ +1 , m = 1, .., M :
a . (-3) + b < -1 → b < 3a - 1 b Support Vectors
a . (-1) + b < -1 → b < a - 1
a . (+2) + b > +1 → b > - 2a + 1
Minimize a
|| w ||2 = ||a||2 → a = 0.66
→ b = 0.33
16
10/02/2020

RAPPEL: QUADRATIC PROGRAMMING PROBLEM
F(w , b) = b + ∑ ai wi + ∑ ∑ qi,j wi wj w= (w1, w2, …wn) , b
Quadratic objective Quadratic

Function Programming w=(w1, w2, …wn) , b
F(w , b) Numerical Analysis
M Linear Constraints
∑ cij wi ≤ dj j = 1, 2, …., M
Optimization using Lagrange Multipliers
Lagrangian : L = || w ||2 / 2 + ∑ l { 1 – [y
m
m . (w . Xm + b)]}
m=1,..,M
Lagrange Multipliers
The constraints are added into the optimization by adding extra variables (called
Lagrange multipliers)
|| w ||2 = (w1)2 + (w2)2 +….+ (wn)2
17
10/02/2020

Objective
Function:
L = || w ||2 / 2 + ∑ l { 1 – [y
m
m . (w . Xm + b)] }
m=1,..,M
Linear sum
Setting the gradient of L w.r.t. w and b to zero, we haveof: samples
∂L
∂w
= w - ∑ lm ym . Xm = 0 w* = ∑l m ym . Xm
m=1,..,M m=1,..,M
∂L
∂b
=- ∑l m ym =0 ∑l m ym = 0
m=1,..,M m=1,..,M
L = || w ||2 / 2 + ∑ l { 1 – [y
m
m . (w . Xm + b)] } w* = ∑l m ym . Xm
m=1,..,M
m=1,..,M
L=(1/2) . [∑l m m
my .X ].[∑lm ym.Xm]-[∑lm ym.Xm{∑lk yk.Xk}]-∑lm ym.b +∑lm
m=1,..,M m=1,..,M m=1,..,M k=1,..,M m=1,..,M m=1,..,M
The optimization problem becomes : (called : Dual Problem)
Find lm such that:
L= - (1/2) . ∑ ∑l m lk ym yk . [Xm. Xk] +∑lm is
m=1,..,M k=1,..,M m=1,..,M maximized
where ∑l m ym = 0 lm ≥ 0, m =1,.., M
m=1,..,M
18
10/02/2020
L= - (1/2) . ∑ ∑l m lk ym yk . [Xm. Xk] +∑lm

m=1,..,M k=1,..,M m=1,..,M
- Quadratic optimization problems are a well-known mathematical

programming problems for which several algorithms exist;
- Once resolved, most of lm are 0, only a small number have lm > 0;

the corresponding Xm s are the support vectors
Given a solution l1… lM to the dual problem, solution to the primal

(i.e., coefficients vector w* and origin shift factor b* solution) is given
by : Learned weight
w* = ∑l m y .
m Xm
m ϵ Support Vectors Support Vectors
b*(k) = yk - ∑ lm ym . XmXk
m =1,…, M
For any k such
that lk > 0
b* = (1/K) ∑ b*(k) (K denotes the nb of SVs)
k
19
10/02/2020
Geometrical Interpretation
C1
Support vectors
l8=0.6 l10=0 w  l’s with values different
from zero (they hold up the
separating plane)!
l7=0
l5=0 l2=0
l1=0.8
l4=0
l6=1.4
l9=0
l3=0
w.X+b=0
C2
Decision rule : Given a new data instance X

Then, the classifying function is
(note that we don’t need w* explicitly):
f(X) = ∑ lm ym . Xm . X + b*
m ϵ Support Vectors
If f(X) > 0 Then decide C1;

Else decide C2
20
10/02/2020
Notice that f(X) relies on the dot product between the

unknown instance X and the support vectors Xm
Also keep in mind that solving the optimization problem

involved computing the dot products Xm Xk between all
training data instances
Non Linear Separability
Slightly non separable Highly non separable
Soft Margin Kernel-based

Support Vector Machines Support Vector Machines
21
10/02/2020

To allow errors in data, we relax the
margin constraints by introducing
slack variables, m ( 0) :
w.Xm + b ≥ +1-m , if ym = +1
w.Xm + b ≤ -1+m , if ym = -1 ξm ξk
The new constraints:

ym.(w.Xm+b) ≥ 1-m, m=1, …, M
m: Measure of margin violation
m ≥ 0 , m=1, …, M by an erroneous instance Xm
Assigning an “extra cost” for errors in the objective function

that we are going to optimize:
Minimize: L = || w ||2 / 2 + C ∑ m
m=1, …, M
Subject to: ym . (w . Xm + b) ≥ 1 - m, m=1, …, M

m ≥ 0 , m=1, …, M
∑ m : Total margin violation
m=1, …, M
C : Large C means a higher penalty to errors

22
10/02/2020
L = ||w||2/2 + C ∑ m -∑l {y .(w.X

m
m m+ b) - 1+ m -} ∑mm m
m=1,..,M m=1,..,M m=1,..,M
Minimizing Total margin lm, mm ≥ 0

↔ violation error Lagrange Multipliers
Maximizing the
margin Objective function
to be minimized w.r.t. w, b & m
Tradeoff (i.e., relative
importance) between
error and margin
L = ||w||2/2 + C ∑ m -∑l {y .(w.X

m
m m+ b) - 1+ m -} ∑mm m
m=1,..,M m=1,..,M m=1,..,M
Setting the gradient of L w.r.t. w, b and m to zero, we have :
∂L
∂w
= w - ∑ lm ym . Xm = 0
m=1,..,M Same solution
as for the H-SVM
∂L
∂b
=- ∑l m ym = 0
m=1,..,M
∂L mm ≥ 0 lm ≤ C
= C – lm – mm = 0 lm ≥ 0
∂ m (Otherwise, C – lm ≤ 0,
thus no mm will verify:
C – lm – mm = 0
23
10/02/2020
w* = ∑l m ym . Xm
with the only difference
m=1,..,M
w.r.t (H-SVM):
∑l m ym = 0 0 ≤ lm ≤ C
m=1,..,M
Decision rule : Given a new data instance X

Then, the classifying function is
f(X) = ∑ lm ym . Xm . X + b*
m ϵ Support Vectors
If f(X) > 0 Then decide C1;
Else decide C2

General idea: the original feature space is mapped to some higher-dimensional
feature space where the training set is linearly separable
Φ: Rn → RN
X → Z = Φ(X)
X R n Z R N , N > n
24
10/02/2020

x2
x1

Rn RN
Φ: Rn → RN
X → Z = Φ(X)
Nonlinearly separable dataset Linearly separable dataset
{(Xm, }
ym), XmRn, ym {+1, -1} {(Z m= }
Φ(Xm), ym), ZmRN, ym {+1, -1}
Do we need to know the explicit

expression of Z = Φ(X) ? SVM Machinery
25
10/02/2020
The Kernel trick

Lagrangian to L= - (1/2) . ∑ ∑l m lk ym yk . [Zm. Zk] +∑lm
maximize m=1,..,M k=1,..,M m=1,..,M
Constraints ∑l m ym = 0 lm ≥ 0, m =1,.., M
m=1,..,M
Decision f(X) = ∑ lm ym . Zm . Z +b*

Rule m ϵ Support Vectors
Bias term b* = yk - ∑ lm ym . ZmZk

m =1,…, M
We only deal with Z as far as Z-dot product is concerned

The Kernel trick SVM Machinery

(Quadratic Programming,
Decision rule) needs
Z - Transformation
Z-dot product
Xm Zm = Φ(Xm)
K(Xm, Xm’) = ZmZm. Z. Zm’m’ Z-dot product
Xm’ m’ = Φ(X
Zwithout m’)computation of
explicit
Φ(Xm) and Φ(Xm’)
K(.,.) : is called the Transformation Kernel
26
10/02/2020
Example 1
X = (x1, x2) ……………. : XR2

2 2
Φ(X) = (1, x1, x2, (x1) , (x2) , x1 . x2) :
d
Full 2 order polynomial transformation
K(X, X’) = Z . Z’
= 1+ x1x’1 + x2x’2+ (x1)2 (x’1)2 + (x2)2 (x’2)2 + x1 . x’1 . x2 x’2
Example 2 X = (x1, x2) ……………. : XR2

Computing the Kernel without transforming X and X’ ?
2
K(X, X’) = (1 + X . X’)
2 2
K(X, X’)= (1 + X . X’) = (1+ x1x’1 + x2x’2)
2 2 2 2
=1+(x1) (x’1) +(x2) (x’2) +2x1x’1+2x2x’2+2x1x’1x2 x’2
This is the inner product of Z= Φ(X) and Z’= Φ(X’) where
2 2
Φ(X) = [1, (x1) , (x2) , √2 x1 , √2 x2 , √2 x1 x2]
2 2
Φ(X’) = [1, (x’1) , (x’2) , √2 x’1 , √2 x’2 , √2 x’1 x’2]
27
10/02/2020
Example 2
Think about the computation complexity if :
- X = (x1, x2,….. , xd) ……… : XRd , d >> 2 and

Q
- K(X, X’) = (1 + X . X’) , Q >> 2
Example 3 (Radial Basis Function Kernel)

- a ||X - X’||2
Show that K(X, X’) = e
corresponds to an “inner product” in an infinite
dimensional Z-Space (i.e., transforming X instances)
Hint
For d = 1 (X = x: i.e., scalar), a = 1:
- a ||X - X’||2
- Expand e
- Expand cross exponential term using Taylor series
28
10/02/2020
Example 4 XOR
x2
1 Index i x
X yu
1 (1,1) 1
2 (1,-1) -1
-1 1 x1
3 (-1,-1) 1
4 (-1,1) -1
-1
2
K(X, X’) = (1 + X . X’)
Example 4 XOR
Optimal Lagrange multipliers: l1 = l2 = l3 = l4 = 1/8
The 4 instance data are SV
29
10/02/2020
Example 5 Face Detection application

Training images database:
266 pictures: 150 faces + 116 non-faces
...
Preprocessing
- Gray scale transformation
- Histogram equalization
- Adjust resolution to 30x40 pixel
Training the SVM
based on the 266 training instances, a polynomial kernel
Example 5 Face Detection application

Test phase:
Given an input image:
• - Moving 30x40 pixel sub window over the input image
• - Histogram equalization of a sub window
• - Classification by SVM
• - Removing intersections
30
10/02/2020
How many faces ?

4 5 6
3
2 7
9
1 8
11
10
Example 6 Learning gender with SVMs
Processed
faces
31
10/02/2020

Training dataset: 1044 males, 713 females
Experiment with various kernels, select Gaussian RBF
32
10/02/2020
Example 7 Two spirals benchmark problem

Carnegie Mellon AI Repository
i i  1,2,, n
a n : num of patterns
104  : density
x  R cos a
R : radius
y  R sin a
x  x y   y
Data generation :
( x, y ) with R  3.5   1.0 n   100 n   100
Class
–1
Class
+1
33
10/02/2020
Class
+1
Class
–1
34

Machine Learning Approaches - SVM

Uploaded by

Copyright:

Available Formats

You might also like

Machine Learning Approaches - SVM

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Approaches - SVM

Uploaded by

Copyright:

Available Formats

10/02/2020

Machine Learning Approaches ----------- B. Solaiman

A Pattern (sample, instance, example, ….) is

The pattern “may” be associated with a target

Stored Data f : Unknown function associating X and y

Learning : given the training set, estimate the prediction

Machine Learning Approaches ----------- B. Solaiman

y  L= {C1, C2,…Cn}  Labeled Training Data

yR  Labeled Training Data

Machine Learning Approaches ----------- B. Solaiman

Symbolic, or qualitative, Feature

Machine Learning Approaches ----------- B. Solaiman

Distance between patterns Numerical Features

Machine Learning Approaches ----------- B. Solaiman

Distance between patterns

Machine Learning Approaches ----------- B. Solaiman

Distance between patterns

Machine Learning Approaches ----------- B. Solaiman

Distance between patterns Nominal binary features

Distance between patterns Nominal binary features

Support Vector Machines (SVM)

Machine Learning Approaches ----------- B. Solaiman

Support Vector Machines (SVM)

Soft Margin Support Vector Machines (SM-SVM)

Machine Learning Approaches ----------- B. Solaiman

Support Vector Machines (SVM) 13

Linear Models for Classification

Support Vector Machines (SVM) 14

Support Vector Machines (SVM) 15

Computation of: If w . X + b > 0 Then Decide C1

Machine Learning Approaches ----------- B. Solaiman

Support Vector Machines (SVM) 16

Linear Model for Binary Classification

Find: w = (w1, w2,…, wn), b

Machine Learning Approaches ----------- B. Solaiman

Support Vector Machines (SVM) 17

Perceptron, Rosenblatt - 1958

b = w0 , x0 = 1 → f(X) = w . X (dot product)

Decision Rule: If f(X) = w . X > 0 Then +1

Machine Learning Approaches ----------- B. Solaiman

Support Vector Machines (SVM) 18

Support Vector Machines (SVM) 19

Perceptron, Rosenblatt - 1958

Support Vector Machines (SVM) 20

Machine Learning Approaches ----------- B. Solaiman

Support Vector Machines (SVM) 21

Perceptron, Rosenblatt - 1958

Random initialization of (w1, w2,…, wn), b

Machine Learning Approaches ----------- B. Solaiman

Support Vector Machines (SVM) 22

Perceptron, Rosenblatt - 1958

Machine Learning Approaches ----------- B. Solaiman

Support Vector Machines (SVM) 23

Machine Learning Approaches ----------- B. Solaiman

Support Vector Machines (SVM) 24

A Bit of Geometry Parallel Hyperplanes

Machine Learning Approaches ----------- B. Solaiman