05 Lecture ML Supervised - Learning SVM

Introduction to Machine Learning & Deep Learning (Fall 2023)
Lecture 6: Supervised Learning – Classification - SVM

Prof. Damian Borth
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 1

Last Lecture
• Supervised Learning
– Classification
• Classifier
– Naïve Bayes
– k-Nearest Neightbour
– Logistic Regression
– Support Vector Machine
• Classifier Fusion

This Lecture
• Supervised Learning
– Classification
• Classifier
– Naïve Bayes
– k-Nearest Neightbour
– Logistic Regression
– Support Vector Machine
• Classifier Fusion

Supervised Learning

Types of Machine Learning
Machine Learning
Supervised Learning Semi-supervised Learning Unsupervised Learning

• Objective: Learn the • Objective: Learn structure use • Objective: Identification of
relationship between data only few labels to label unknown distributions,
and a desired output pattern and dependencies
• Data:
• Data x contains labels c whose -> few labels are known • Data x contains dependencies
relationship to be learned : • -> many labels are unknown or patterns to be observed
-> labels are known -> labels are unknown
• “Learning known pattern” Self-supervised Learning • ”Learning unknown pattern”
• e.g. decision trees, • Objective: Learn representation • e.g. clustering algorithms,
neural nets, support vector of data by controlled pseudo- principle component analysis,
machines etc. labels for downstream task self-organizing maps etc.
• Data:
-> labels are unknown
Classification Regression • -> representation & lin. classifier Clustering Dim. Reduction

Types of Machine Learning
Machine Learning
Supervised Learning Semi-supervised Learning Unsupervised Learning

• Objective: Learn the • Objective: Learn structure use • Objective: Identification of
relationship between data only few labels to label unknown distributions,
and a desired output pattern and dependencies
• Data:
• Data x contains labels c whose -> few labels are known • Data x contains dependencies
relationship to be learned : • -> many labels are unknown or patterns to be observed
-> labels are known -> labels are unknown
• “Learning known pattern” Self-supervised Learning • ”Learning unknown pattern”
• e.g. decision trees, • Objective: Learn representation • e.g. clustering algorithms,
neural nets, support vector of data by controlled pseudo- principle component analysis,
machines etc. labels for downstream task self-organizing maps etc.
• Data:
-> labels are unknown
Classification Regression • -> representation & lin. classifier Clustering Dim. Reduction

Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)
4. How to combine „fuse“ distinct classifiers ?

5. Summary and conclusion
Parametric vs Non-parametric
• So far, we assumed P(x|c) to be Gaussian.

• What about these distributions?
• Often, we don't know the parametric form of P(x|c)

• Possible approaches:
• mixtures of Gaussians
• non-parametric methods (no parameters, no training)
k-Nearest Neighbor - Idea
Intuitive Understanding
Idea
“Assign each unknown example x to the majority class

y of its k closest neighbors where k is a parameter.” k=1
k=3
k=5
Unknown example x to classify class y=0

class y=1

k-Nearest Neighbor - Approach
Given
• A set of labeled training samples {xi, yi}

• xi - feature representation of examples
• yi - class labels (e.g. document type, rating on YouTube etc.)
• Unknown sample x that we aim to predict the target
Classification Algorithm
• Compute the distance D(x, xi) of x to every training sample xi

• Select the k closest instances xi1 … xik and their class labels yi1 … yik
• Classify x according to the majority class of its k neighbors
• Calculating the majority class: $
1 1 𝑖𝑓 𝑦%! = 𝑦
𝑃 𝑦|𝑥 = ( 𝛿 𝑦%! , 𝑦 , δ = ,
𝑘 0 𝑖𝑓 𝑦%! ≠ 𝑦
!"#

k-Nearest Neighbor – Distance Measures
“Euclidian” Distance (”L2-norm”)

• Used in the context of continuous variables
• Not very robust, single solution
“Manhattan” Distance (“L1-norm”)

• Used in the context of binary or encoded variables
• Robust, possibly multiple solution
“Hamming” Distance
• Used in the context of categorical variables 0
• E.g. distance between names, document types 1

k-Nearest Neighbor – Different “k” Example
k=1 k=3
k=10 k=50 k=200

k-Nearest Neighbor
Summary and Discussion

Pro’s Con’s
• “Non parametric” approach • Computationally expensive
• ”no” assumptions about the data distribution • time: computes all distances
• Simple to implement • Space: stores all examples
• Flexible to feature / distance choices • Sensitive outliers / irrelevant features
Use Cases
$
• Spam filtering 1 1 𝑖𝑓 𝑦%! = 𝑦
• Recommender systems 𝑃 𝑦|𝑥 = ( 𝛿 𝑦%! , 𝑦 , δ = ,
𝑘 0 𝑖𝑓 𝑦%! ≠ 𝑦
• Text classification !"#
• Document similarity

k-Nearest Neighbor
• When is nearest neighbor (NN) successful?

• we need many samples in small regions!
• Is nearest neighbor better than Gaussians?

• not necessarily – if the underlying class-conditional densities
are truly Gaussian and we can determine parameters reliably,
Gaussians are the optimal model!
• Are there really no parameters?

• there‘s K as hyper-parameter to choose
• low K = high variance
• high K = oversmoothing
• good compromise in practice: K=√n

Overview
Agenda

Discriminative Models
• We saw a generative model: „Gaussians“

• we know P(x|c) and P(c), i.e. we know P(c|x)
• we can „generate“ samples from P(c|x)
• draw c' from P(c)
• draw x' from P(x|c)
• Alternative:
• omit P(x|c) and P(c), and directly estimate P(c|x) !
→ discriminative models: P(c|x) = fΘ(x)

Logistic Regression - Introduction
Linear Regression
Sea Bass? P(c|x) = fΘ(x1) =mx1+b, with Θ = (m,b)
(Yes) 1 (Yes) 1
0.5
(No) 0 (No) 0
Lightness x1 Lightness x1
Classification Hypothesis Challenge

Threshold classifier fΘ(x1) output at 0.5:
“How to handle anomalies or
• If fΘ(x1) ≥ 0.5, predict c = 1 “Sea Bass” different modalities in the data?”
• If fΘ(x1) < 0.5, predict c = 0 “Salmon”

Logistic Regression - Introduction
Linear Regression Challenge: Outlier
Sea Bass? Good hypothesis? P(c|x) = fΘ(x1) =mx1+b, with Θ = (m,b)
(Yes) 1
0.5
(No) 0
Lightness x1
Classification Hypothesis Idee

Threshold classifier fΘ(x1) output at 0.5: Improve “Linear Regression” by:
(1) a non-linear hypothesis with fΘ
• If fΘ(x1) ≥ 0.5, predict c = 1 “Sea Bass” (2) learnable parameters Θ
• If fΘ(x1) < 0.5, predict c = 0 “Salmon”
“Logistic Regression”

Logistic Regression - Idea (one dimensions)
• remember the Gaussian case P(c|x) was a sigmoid function
fΘ(x)
• where
sigmoid
linear

Logistic Regression - Idea (one dimensions)
• remember the Gaussian case P(c|x) was a sigmoid function
fΘ(x)
• where
sigmoid
linear

Logistic Regression - Idea (more dimensions)
• In more dimensions, we have a weight vector w
• The decision boundary becomes a (linear) hyperplane
• We can omit b using augmented vectors:

Logistic Regression - Approach
Given
• A set of labeled training samples {xi, ci}
• xi - feature representation of examples
• ci - class labels (e.g. document type, rating on YouTube etc.)
• For each weight configuration w we can compute the classification loss 𝓛 “Error”
Training Algorithm (see Bishop p. 205f.)

• Initialize the weight configuration w0 “Gradient Descent Learning”
• Until convergence of loss 𝓛 do:
• Update the weight configuration according
to Gradient Descent Learning
• Increase k = k+1

Logistic Regression
Summary and Discussion

Pro’s Con’s
• “Discriminative” approach • Non-deterministic results
• learn only the needed • May end up in a local minima
• Results are easy to interpret • Learns linear decision boundaries
• Can be trained fast • Vulnerable to overfitting
Use Cases
• Predictive maintenance
• Medical treatment response
• Customer churn prediction
• Loan default prediction

Overview
Agenda

Support Vector Machines (SVMs)
• Support Vector Machines were leading the State-of-the-art in

many machine learning tasks (including image recognition)
• A classifier benchmarking experiment:
– More than 100 datasets from the public UCI machine learning repository
– 7 classifiers, with parameters (for example, k in k-NN) optimized by a cross-validation gridsearch
– this illustration counts the datasets on which each classifier works best

• SVMs were particularly successful in image recognition

• visual words + SVMs = „standard pipeline“


Visual Word Feature Extraction






[ 2, 0, 2, 0 ]


Visual Word Feature Extraction SVM Classification
[ 2, 0, 2, 0 ]


[ 2, 0, 2, 0 ]


[ 2, 0, 2, 0 ]


[ 2, 0, 2, 0 ]


[ 2, 0, 2, 0 ]


[ 2, 0, 2, 0 ]

Approach
• maximum margin classification
• non-linearity by kernel functions

y distance from origin
angle

SVM: Notation
• Given:
– training samples x1,..,xnÎ Rd
• Geometric approach:
with labels y1,..,yn Î {-1,1} – find a hyperplane w that
separates the classes
• f(x) = <w,x> + b w
– use: “augmented vectors”
• f(x) = <w,x> (x → [x,1])

– classification of class presence ↔ x > 0

SVM: The maximum-margin Principle
• Which hyperplane is the best?

– multiple hyperplane possible
• Guiding Principles / Approaches

– generative models
(e.g., Gaussians with identical covariances)
– logistic regression
(likelihood maximization)
– perceptron
(error minimization)
– maximum-margin principle
SVM: Margin Maximization
• To find the hyperplane w that

w
maximizes the margin, let us
first require that for all
sample xi the following holds:

We have two kinds of samples: g w

– „safe“ samples xi which are
„far away“ from the decision
boundary: <w,xi> > yi
– „support vectors“ xi samples on

the margin: <w,xi> = yi
Relationship between g and w:

• the size of the margin g is 1/||w||2
• maximizing the margin is equivalent to minimizing ||w||2

• Altogether, a decision boundary w* w*=

that maximizes the margin can be
computed by solving the following
optimization problem:
• This is a “simple” optimization problem

– the objective function is quadratic, i.e., differentiable and convex
→ quadratic programming
– the constraints are all linear
– a globally optimal solution can be computed in O(n3)
– in practice, an SVM computational effort is » O(c×n1.8)
Classification Problem: Non-Separability
• Problem: in practice, datasets are often not linear separable!

• We can solve this problem by two extensions:

Slack Variables Kernel Function
mapping samples to a (proper)

allowing for errors during training
higher-dimensional vector space
in favor of a max margin
and solving the problem there in a
hyperplane w
linear way



hyperplane w
linear way

SVM: Slack Variables
• What is the better hyperplane for this dataset?

→ allow some training error: introduce slack variables
no training errors, one training error,

but small margin but larger margin
(=likely test errors) (=likely fewer test errors)

SVM: Max Margin & Slack Variables
• Solution: Introduce slack variables x1, .., xn
w*= w*=
• We can satisfy all constraints by making xis large enough

• Hyper-parameter C realizes balancing:
→ C = ∞ i.e „hard“ margin, all xi are 0, no training error allowed
→ the smaller C, the larger the margin (at the cost of incorrectly classified training samples)
• The target function is still convex („simple“ optimization)


hyperplane w
linear way



hyperplane w
linear way

SVM: Non-(Linear)-Separability
• Slack variables are not enough!

– What is the best decision boundary on this dataset?
• We need non-linear decision boundaries

• Solutions:
– higher order decision functions y
– classifier stacking
– neural networks
x
(will be covered later)
– data transformation
(kernel functions)

SVM: Data Transformation f
• In the example, we can find a transformation

f for the samples xi - such that they become
linearly separable
→ transform each xi to polar coordinates:

y distance from origin
f
x
angle

SVM: Data Transformation f
• Linear Classification with Data Transformation
→ define a feature transformation f: Rd → Rm

→ perform classification on f(xi) instead of xi
w*= w*=

SVM: Kernel Trick & Representer Theorem
• Finding „good“ data transformations for the classification problem can be difficult
• Instead, we will omit the transformation f(x) and use a similarity functions
k(xi,xj) that compare two samples xi,xj → this approach is called the kernel trick
• The similarity functions k(xi,xj) are called kernel functions
• The Representer Theorem is the basis of the kernel trick
• It tells us that the maximum-margin

solution lies in the subspace spanned
by the training samples, i.e. we can re-
write the maximum-margin solution w as:
SVM: Kernel Trick & Representer Theorem
Using the Representer Theorem,

we can rewrite:
w*= w*=
derive
SVM Equation kernel function: <f(xi),f(xj)> = k(xi,xj)

Kernel Trick & Representer Theorem - Consequence
• Kernel Trick
– We can omit the computation of f, and simply
compute the kernel function k(.,.)
• Kernel Function k(.,.)

– The kernel function k(xi,xj) defines a
similarity measure between xi and xj
– there are several kernel functions to choose from
• We do not even have to know f

– this is actually pretty awesome!

SVM: Training
• Given
– training set with samples x1,..,xn, w
and its labels y1, .., yn
• Algorithm
1. choose a kernel function k(.,.)
2. estimate a1, .., an by optimizing the SVM equation
(ai ≠ 0 → xi is a „support vector“)
3. These a1, ..,an values define a maximum-margin decision boundary in a
high-dimensional space defined by the kernel function.

SVM: Training
• Given
– training set with samples x1,..,xn, w
and its labels y1, .., yn
• Algorithm
1. choose a kernel function k(.,.)
2. estimate a1, .., an by optimizing the SVM equation
(ai ≠ 0 → xi is a „support vector“)
3. These a1, ..,an values define a maximum-margin decision boundary in a
high-dimensional space defined by the kernel function.

SVM: Classification
• Given
– test samples x1,..,xn → x = w
• Unknown
– labels y1, .., yn → y = { , } ?
• Classification
1. compute k(x,xi) for all x
2. compute classification score:
3. class decision is: sign( f(x))

SVM: Kernel Best Practice
• How do we choose kernels k(,.,) in practice?

• They can be construct from distance functions
– if d(.,.) is a distance function, then e-d(.,.) i.e exp{-d(.,.)} can be used as a kernel function
• Linear:
• Polynomial:
• Gaussian (RBF) Some practical
kernel functions
• Histogram intersection:
• Chi square
SVM: Kernel Best Practice
• Kernels should show a class-wise block structure
• Example: β in the Gaussian kernel:
β very large...
(picture: Christoph Lampert)
β very small...
SVM: Hyper-Parameter Optimization
• Parameter Optimization in SVMs:

– cost of training samples misclassified: C
– kernel parameter β
good parameter choices
β
• Frequently used approach: Grid Search
– test different values of C and β
on a regular grid (alt. log grid)
– for each pair, measure
classification accuracy
on a held-out validation set
C
SVMs – Summary
• Support Vector Machines were state-of-the-art classifier,

particularly successful in image recognition
• Advantages
– the maximum-margin problem can be solved globally optimally!
– the number of parameters is „independent“ of the feature dimensionality.
This makes SVMs very suitable classifiers for small, high-dimensional training sets!
– flexibility: we can incorporate application-specific kernels
– very good empirical results
• Disadvantages
– often: ad hoc choice of kernel functions
– scalability problems to large training sets
– Limited learning capacity with large number of positive samples

Overview
Agenda

Early vs. Late Fusion
• Multiple classifiers can combine

different pieces of evidence
• multiple features
• multiple modalities
• multiple classifiers
• multiple training sets

Early vs. Late Fusion
• Different combination strategies

• early fusion = concatenate features
• late fusion = combine classification results
x1
x2 [x1,x2,..,x
early fusion classifier decision
… M]
xM
x1 classifier P(c|x1)
x2
… classifier P(c|x2) late fusion decision
xM ...
classifier P(c|xM)

Overview
Agenda

Overview
Agenda

Discussion
• This lecture – four sample classifiers

• Naive Bayes (with Gaussian CCDs)
• K-nearest neighbor
• Logistic regression
• Support Vector Machine (SVM)
• The Big Answer to “Which one is the best?”

• the right classifier depends on the distribution of the target data...
• … on the preprocessing ...
• … on the features...
• … on the amount of training data
→ no-free-lunch theorem
Questions?

05 Lecture ML Supervised - Learning SVM

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

05 Lecture ML Supervised - Learning SVM

Uploaded by

Copyright:

Available Formats

Introduction to Machine Learning & Deep Learning (Fall 2023)

Lecture 6: Supervised Learning – Classification - SVM

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 1

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 2

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 3

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 4

Supervised Learning Semi-supervised Learning Unsupervised Learning

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 5

Supervised Learning Semi-supervised Learning Unsupervised Learning

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 6

4. How to combine „fuse“ distinct classifiers ?

• So far, we assumed P(x|c) to be Gaussian.

• Often, we don't know the parametric form of P(x|c)

“Assign each unknown example x to the majority class

Unknown example x to classify class y=0

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 9

• A set of labeled training samples {xi, yi}

• Compute the distance D(x, xi) of x to every training sample xi

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 10

“Euclidian” Distance (”L2-norm”)

“Manhattan” Distance (“L1-norm”)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 11

k=10 k=50 k=200

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 12

Summary and Discussion

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 13

• When is nearest neighbor (NN) successful?

• Is nearest neighbor better than Gaussians?

• Are there really no parameters?

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 14

4. How to combine „fuse“ distinct classifiers ?

• We saw a generative model: „Gaussians“

→ discriminative models: P(c|x) = fΘ(x)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 16

Classification Hypothesis Challenge

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 17

Classification Hypothesis Idee

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 18

• remember the Gaussian case P(c|x) was a sigmoid function

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 19

• remember the Gaussian case P(c|x) was a sigmoid function

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 20

• In more dimensions, we have a weight vector w

• The decision boundary becomes a (linear) hyperplane

• We can omit b using augmented vectors:

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 21

Training Algorithm (see Bishop p. 205f.)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 22

Summary and Discussion

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 23

4. How to combine „fuse“ distinct classifiers ?

• Support Vector Machines were leading the State-of-the-art in

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 25

• SVMs were particularly successful in image recognition

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 26

• SVMs were particularly successful in image recognition

Visual Word Feature Extraction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 27

• SVMs were particularly successful in image recognition

Visual Word Feature Extraction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 28

• SVMs were particularly successful in image recognition