Professional Documents
Culture Documents
05 Lecture ML Supervised - Learning SVM
05 Lecture ML Supervised - Learning SVM
• Supervised Learning
– Classification
• Classifier
– Naïve Bayes
– k-Nearest Neightbour
– Logistic Regression
– Support Vector Machine
• Classifier Fusion
• Supervised Learning
– Classification
• Classifier
– Naïve Bayes
– k-Nearest Neightbour
– Logistic Regression
– Support Vector Machine
• Classifier Fusion
Machine Learning
Machine Learning
Intuitive Understanding
Idea
k=3
k=5
Given
Classification Algorithm
“Hamming” Distance
• Used in the context of categorical variables 0
• E.g. distance between names, document types 1
k=1 k=3
Use Cases
$
• Spam filtering 1 1 𝑖𝑓 𝑦%! = 𝑦
• Recommender systems 𝑃 𝑦|𝑥 = ( 𝛿 𝑦%! , 𝑦 , δ = ,
𝑘 0 𝑖𝑓 𝑦%! ≠ 𝑦
• Text classification !"#
• Document similarity
• Alternative:
• omit P(x|c) and P(c), and directly estimate P(c|x) !
Intuitive Understanding
Linear Regression
Sea Bass? P(c|x) = fΘ(x1) =mx1+b, with Θ = (m,b)
(Yes) 1 (Yes) 1
0.5
(No) 0 (No) 0
Lightness x1 Lightness x1
Intuitive Understanding
Linear Regression Challenge: Outlier
Sea Bass? Good hypothesis? P(c|x) = fΘ(x1) =mx1+b, with Θ = (m,b)
(Yes) 1
0.5
(No) 0
Lightness x1
fΘ(x)
• where
sigmoid
linear
fΘ(x)
• where
sigmoid
linear
Given
• A set of labeled training samples {xi, ci}
• xi - feature representation of examples
• ci - class labels (e.g. document type, rating on YouTube etc.)
• For each weight configuration w we can compute the classification loss 𝓛 “Error”
Use Cases
• Predictive maintenance
• Medical treatment response
• Customer churn prediction
• Loan default prediction
[ 2, 0, 2, 0 ]
[ 2, 0, 2, 0 ]
[ 2, 0, 2, 0 ]
[ 2, 0, 2, 0 ]
[ 2, 0, 2, 0 ]
[ 2, 0, 2, 0 ]
[ 2, 0, 2, 0 ]
angle
• Given:
– training samples x1,..,xnÎ Rd
• Geometric approach:
with labels y1,..,yn Î {-1,1} – find a hyperplane w that
separates the classes
• f(x) = <w,x> + b w
– use: “augmented vectors”
w*= w*=
f
x
angle
w*= w*=
• Finding „good“ data transformations for the classification problem can be difficult
• Instead, we will omit the transformation f(x) and use a similarity functions
k(xi,xj) that compare two samples xi,xj → this approach is called the kernel trick
w*= w*=
derive
• Kernel Trick
– We can omit the computation of f, and simply
compute the kernel function k(.,.)
• Given
– training set with samples x1,..,xn, w
and its labels y1, .., yn
• Algorithm
1. choose a kernel function k(.,.)
2. estimate a1, .., an by optimizing the SVM equation
(ai ≠ 0 → xi is a „support vector“)
3. These a1, ..,an values define a maximum-margin decision boundary in a
high-dimensional space defined by the kernel function.
• Given
– training set with samples x1,..,xn, w
and its labels y1, .., yn
• Algorithm
1. choose a kernel function k(.,.)
2. estimate a1, .., an by optimizing the SVM equation
(ai ≠ 0 → xi is a „support vector“)
3. These a1, ..,an values define a maximum-margin decision boundary in a
high-dimensional space defined by the kernel function.
• Given
– test samples x1,..,xn → x = w
• Unknown
– labels y1, .., yn → y = { , } ?
• Classification
1. compute k(x,xi) for all x
2. compute classification score:
3. class decision is: sign( f(x))
• Linear:
• Polynomial:
• Gaussian (RBF) Some practical
kernel functions
• Histogram intersection:
• Chi square
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 59
SVM: Kernel Best Practice
β very large...
β very small...
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 60
SVM: Hyper-Parameter Optimization
β
• Frequently used approach: Grid Search
– test different values of C and β
on a regular grid (alt. log grid)
– for each pair, measure
classification accuracy
on a held-out validation set
C
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 61
SVMs – Summary
• Disadvantages
– often: ad hoc choice of kernel functions
– scalability problems to large training sets
– Limited learning capacity with large number of positive samples
x1
x2 [x1,x2,..,x
early fusion classifier decision
… M]
xM
x1 classifier P(c|x1)
x2
… classifier P(c|x2) late fusion decision
xM ...
classifier P(c|xM)
→ no-free-lunch theorem
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 68
Questions?
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 69