Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

Introduction to Machine Learning & Deep Learning (Fall 2023)

Lecture 6: Supervised Learning – Classification - SVM


Prof. Damian Borth

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 1


Last Lecture

• Supervised Learning
– Classification
• Classifier
– Naïve Bayes
– k-Nearest Neightbour
– Logistic Regression
– Support Vector Machine
• Classifier Fusion

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 2


This Lecture

• Supervised Learning
– Classification
• Classifier
– Naïve Bayes
– k-Nearest Neightbour
– Logistic Regression
– Support Vector Machine
• Classifier Fusion

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 3


Supervised Learning

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 4


Types of Machine Learning

Machine Learning

Supervised Learning Semi-supervised Learning Unsupervised Learning


• Objective: Learn the • Objective: Learn structure use • Objective: Identification of
relationship between data only few labels to label unknown distributions,
and a desired output pattern and dependencies
• Data:
• Data x contains labels c whose -> few labels are known • Data x contains dependencies
relationship to be learned : • -> many labels are unknown or patterns to be observed
-> labels are known -> labels are unknown
• “Learning known pattern” Self-supervised Learning • ”Learning unknown pattern”
• e.g. decision trees, • Objective: Learn representation • e.g. clustering algorithms,
neural nets, support vector of data by controlled pseudo- principle component analysis,
machines etc. labels for downstream task self-organizing maps etc.
• Data:
-> labels are unknown
Classification Regression • -> representation & lin. classifier Clustering Dim. Reduction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 5


Types of Machine Learning

Machine Learning

Supervised Learning Semi-supervised Learning Unsupervised Learning


• Objective: Learn the • Objective: Learn structure use • Objective: Identification of
relationship between data only few labels to label unknown distributions,
and a desired output pattern and dependencies
• Data:
• Data x contains labels c whose -> few labels are known • Data x contains dependencies
relationship to be learned : • -> many labels are unknown or patterns to be observed
-> labels are known -> labels are unknown
• “Learning known pattern” Self-supervised Learning • ”Learning unknown pattern”
• e.g. decision trees, • Objective: Learn representation • e.g. clustering algorithms,
neural nets, support vector of data by controlled pseudo- principle component analysis,
machines etc. labels for downstream task self-organizing maps etc.
• Data:
-> labels are unknown
Classification Regression • -> representation & lin. classifier Clustering Dim. Reduction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 6


Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)

4. How to combine „fuse“ distinct classifiers ?


5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 7
Parametric vs Non-parametric

• So far, we assumed P(x|c) to be Gaussian.


• What about these distributions?

• Often, we don't know the parametric form of P(x|c)


• Possible approaches:
• mixtures of Gaussians
• non-parametric methods (no parameters, no training)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 8
k-Nearest Neighbor - Idea

Intuitive Understanding

Idea

“Assign each unknown example x to the majority class


y of its k closest neighbors where k is a parameter.” k=1

k=3

k=5

Unknown example x to classify class y=0


class y=1

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 9


k-Nearest Neighbor - Approach

Given

• A set of labeled training samples {xi, yi}


• xi - feature representation of examples
• yi - class labels (e.g. document type, rating on YouTube etc.)
• Unknown sample x that we aim to predict the target

Classification Algorithm

• Compute the distance D(x, xi) of x to every training sample xi


• Select the k closest instances xi1 … xik and their class labels yi1 … yik
• Classify x according to the majority class of its k neighbors
• Calculating the majority class: $
1 1 𝑖𝑓 𝑦%! = 𝑦
𝑃 𝑦|𝑥 = ( 𝛿 𝑦%! , 𝑦 , δ = ,
𝑘 0 𝑖𝑓 𝑦%! ≠ 𝑦
!"#

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 10


k-Nearest Neighbor – Distance Measures

“Euclidian” Distance (”L2-norm”)


• Used in the context of continuous variables
• Not very robust, single solution

“Manhattan” Distance (“L1-norm”)


• Used in the context of binary or encoded variables
• Robust, possibly multiple solution

“Hamming” Distance
• Used in the context of categorical variables 0
• E.g. distance between names, document types 1

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 11


k-Nearest Neighbor – Different “k” Example

k=1 k=3

k=10 k=50 k=200

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 12


k-Nearest Neighbor

Summary and Discussion


Pro’s Con’s
• “Non parametric” approach • Computationally expensive
• ”no” assumptions about the data distribution • time: computes all distances
• Simple to implement • Space: stores all examples
• Flexible to feature / distance choices • Sensitive outliers / irrelevant features

Use Cases
$
• Spam filtering 1 1 𝑖𝑓 𝑦%! = 𝑦
• Recommender systems 𝑃 𝑦|𝑥 = ( 𝛿 𝑦%! , 𝑦 , δ = ,
𝑘 0 𝑖𝑓 𝑦%! ≠ 𝑦
• Text classification !"#
• Document similarity

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 13


k-Nearest Neighbor

• When is nearest neighbor (NN) successful?


• we need many samples in small regions!

• Is nearest neighbor better than Gaussians?


• not necessarily – if the underlying class-conditional densities
are truly Gaussian and we can determine parameters reliably,
Gaussians are the optimal model!

• Are there really no parameters?


• there‘s K as hyper-parameter to choose
• low K = high variance
• high K = oversmoothing
• good compromise in practice: K=√n

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 14


Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)

4. How to combine „fuse“ distinct classifiers ?


5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 15
Discriminative Models

• We saw a generative model: „Gaussians“


• we know P(x|c) and P(c), i.e. we know P(c|x)
• we can „generate“ samples from P(c|x)
• draw c' from P(c)
• draw x' from P(x|c)

• Alternative:
• omit P(x|c) and P(c), and directly estimate P(c|x) !

→ discriminative models: P(c|x) = fΘ(x)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 16


Logistic Regression - Introduction

Intuitive Understanding
Linear Regression
Sea Bass? P(c|x) = fΘ(x1) =mx1+b, with Θ = (m,b)
(Yes) 1 (Yes) 1

0.5

(No) 0 (No) 0
Lightness x1 Lightness x1

Classification Hypothesis Challenge


Threshold classifier fΘ(x1) output at 0.5:
“How to handle anomalies or
• If fΘ(x1) ≥ 0.5, predict c = 1 “Sea Bass” different modalities in the data?”
• If fΘ(x1) < 0.5, predict c = 0 “Salmon”

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 17


Logistic Regression - Introduction

Intuitive Understanding
Linear Regression Challenge: Outlier
Sea Bass? Good hypothesis? P(c|x) = fΘ(x1) =mx1+b, with Θ = (m,b)
(Yes) 1

0.5

(No) 0
Lightness x1

Classification Hypothesis Idee


Threshold classifier fΘ(x1) output at 0.5: Improve “Linear Regression” by:
(1) a non-linear hypothesis with fΘ
• If fΘ(x1) ≥ 0.5, predict c = 1 “Sea Bass” (2) learnable parameters Θ
• If fΘ(x1) < 0.5, predict c = 0 “Salmon”
“Logistic Regression”

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 18


Logistic Regression - Idea (one dimensions)

• remember the Gaussian case P(c|x) was a sigmoid function

fΘ(x)

• where
sigmoid

linear

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 19


Logistic Regression - Idea (one dimensions)

• remember the Gaussian case P(c|x) was a sigmoid function

fΘ(x)

• where
sigmoid

linear

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 20


Logistic Regression - Idea (more dimensions)

• In more dimensions, we have a weight vector w

• The decision boundary becomes a (linear) hyperplane

• We can omit b using augmented vectors:

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 21


Logistic Regression - Approach

Given
• A set of labeled training samples {xi, ci}
• xi - feature representation of examples
• ci - class labels (e.g. document type, rating on YouTube etc.)
• For each weight configuration w we can compute the classification loss 𝓛 “Error”

Training Algorithm (see Bishop p. 205f.)


• Initialize the weight configuration w0 “Gradient Descent Learning”
• Until convergence of loss 𝓛 do:
• Update the weight configuration according
to Gradient Descent Learning
• Increase k = k+1

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 22


Logistic Regression

Summary and Discussion


Pro’s Con’s
• “Discriminative” approach • Non-deterministic results
• learn only the needed • May end up in a local minima
• Results are easy to interpret • Learns linear decision boundaries
• Can be trained fast • Vulnerable to overfitting

Use Cases
• Predictive maintenance
• Medical treatment response
• Customer churn prediction
• Loan default prediction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 23


Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)

4. How to combine „fuse“ distinct classifiers ?


5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 24
Support Vector Machines (SVMs)

• Support Vector Machines were leading the State-of-the-art in


many machine learning tasks (including image recognition)
• A classifier benchmarking experiment:
– More than 100 datasets from the public UCI machine learning repository
– 7 classifiers, with parameters (for example, k in k-NN) optimized by a cross-validation gridsearch
– this illustration counts the datasets on which each classifier works best

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 25


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 26


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 27


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 28


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 29


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 30


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction SVM Classification

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 31


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction SVM Classification

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 32


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction SVM Classification

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 33


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction SVM Classification

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 34


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction SVM Classification

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 35


Support Vector Machines (SVMs)

• SVMs were particularly successful in image recognition


• visual words + SVMs = „standard pipeline“

Visual Word Feature Extraction SVM Classification

[ 2, 0, 2, 0 ]

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 36


Approach

• maximum margin classification

• non-linearity by kernel functions


y distance from origin

angle

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 37


SVM: Notation

• Given:
– training samples x1,..,xnÎ Rd
• Geometric approach:
with labels y1,..,yn Î {-1,1} – find a hyperplane w that
separates the classes

• f(x) = <w,x> + b w
– use: “augmented vectors”

• f(x) = <w,x> (x → [x,1])


– classification of class presence ↔ x > 0

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 38


SVM: The maximum-margin Principle

• Which hyperplane is the best?


– multiple hyperplane possible

• Guiding Principles / Approaches


– generative models
(e.g., Gaussians with identical covariances)
– logistic regression
(likelihood maximization)
– perceptron
(error minimization)
– maximum-margin principle
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 39
SVM: Margin Maximization

• To find the hyperplane w that


w
maximizes the margin, let us
first require that for all
sample xi the following holds:

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 40


SVM: Margin Maximization

We have two kinds of samples: g w


– „safe“ samples xi which are
„far away“ from the decision
boundary: <w,xi> > yi

– „support vectors“ xi samples on


the margin: <w,xi> = yi

Relationship between g and w:


• the size of the margin g is 1/||w||2
• maximizing the margin is equivalent to minimizing ||w||2

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 41


SVM: Margin Maximization

• Altogether, a decision boundary w* w*=


that maximizes the margin can be
computed by solving the following
optimization problem:

• This is a “simple” optimization problem


– the objective function is quadratic, i.e., differentiable and convex
→ quadratic programming
– the constraints are all linear
– a globally optimal solution can be computed in O(n3)
– in practice, an SVM computational effort is » O(c×n1.8)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 42
Classification Problem: Non-Separability

• Problem: in practice, datasets are often not linear separable!

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 43


Classification Problem: Non-Separability

• Problem: in practice, datasets are often not linear separable!

• We can solve this problem by two extensions:


Slack Variables Kernel Function

mapping samples to a (proper)


allowing for errors during training
higher-dimensional vector space
in favor of a max margin
and solving the problem there in a
hyperplane w
linear way

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 44


Classification Problem: Non-Separability

• Problem: in practice, datasets are often not linear separable!

• We can solve this problem by two extensions:


Slack Variables Kernel Function

mapping samples to a (proper)


allowing for errors during training
higher-dimensional vector space
in favor of a max margin
and solving the problem there in a
hyperplane w
linear way

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 45


SVM: Slack Variables

• What is the better hyperplane for this dataset?


→ allow some training error: introduce slack variables

no training errors, one training error,


but small margin but larger margin
(=likely test errors) (=likely fewer test errors)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 46


SVM: Max Margin & Slack Variables

• Solution: Introduce slack variables x1, .., xn

w*= w*=

• We can satisfy all constraints by making xis large enough


• Hyper-parameter C realizes balancing:
→ C = ∞ i.e „hard“ margin, all xi are 0, no training error allowed
→ the smaller C, the larger the margin (at the cost of incorrectly classified training samples)
• The target function is still convex („simple“ optimization)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 47
Classification Problem: Non-Separability

• Problem: in practice, datasets are often not linear separable!

• We can solve this problem by two extensions:


Slack Variables Kernel Function

mapping samples to a (proper)


allowing for errors during training
higher-dimensional vector space
in favor of a max margin
and solving the problem there in a
hyperplane w
linear way

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 48


Classification Problem: Non-Separability

• Problem: in practice, datasets are often not linear separable!

• We can solve this problem by two extensions:


Slack Variables Kernel Function

mapping samples to a (proper)


allowing for errors during training
higher-dimensional vector space
in favor of a max margin
and solving the problem there in a
hyperplane w
linear way

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 49


SVM: Non-(Linear)-Separability

• Slack variables are not enough!


– What is the best decision boundary on this dataset?

• We need non-linear decision boundaries


• Solutions:
– higher order decision functions y
– classifier stacking
– neural networks
x
(will be covered later)
– data transformation
(kernel functions)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 50


SVM: Data Transformation f

• In the example, we can find a transformation


f for the samples xi - such that they become
linearly separable

→ transform each xi to polar coordinates:


y distance from origin

f
x

angle

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 51


SVM: Data Transformation f

• Linear Classification with Data Transformation

→ define a feature transformation f: Rd → Rm


→ perform classification on f(xi) instead of xi

w*= w*=

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 52


SVM: Kernel Trick & Representer Theorem

• Finding „good“ data transformations for the classification problem can be difficult

• Instead, we will omit the transformation f(x) and use a similarity functions
k(xi,xj) that compare two samples xi,xj → this approach is called the kernel trick

• The similarity functions k(xi,xj) are called kernel functions

• The Representer Theorem is the basis of the kernel trick

• It tells us that the maximum-margin


solution lies in the subspace spanned
by the training samples, i.e. we can re-
write the maximum-margin solution w as:
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 53
SVM: Kernel Trick & Representer Theorem

Using the Representer Theorem,


we can rewrite:

w*= w*=

derive

SVM Equation kernel function: <f(xi),f(xj)> = k(xi,xj)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 54


Kernel Trick & Representer Theorem - Consequence

• Kernel Trick
– We can omit the computation of f, and simply
compute the kernel function k(.,.)

• Kernel Function k(.,.)


– The kernel function k(xi,xj) defines a
similarity measure between xi and xj
– there are several kernel functions to choose from

• We do not even have to know f


– this is actually pretty awesome!

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 55


SVM: Training

• Given
– training set with samples x1,..,xn, w
and its labels y1, .., yn

• Algorithm
1. choose a kernel function k(.,.)
2. estimate a1, .., an by optimizing the SVM equation
(ai ≠ 0 → xi is a „support vector“)
3. These a1, ..,an values define a maximum-margin decision boundary in a
high-dimensional space defined by the kernel function.

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 56


SVM: Training

• Given
– training set with samples x1,..,xn, w
and its labels y1, .., yn

• Algorithm
1. choose a kernel function k(.,.)
2. estimate a1, .., an by optimizing the SVM equation
(ai ≠ 0 → xi is a „support vector“)
3. These a1, ..,an values define a maximum-margin decision boundary in a
high-dimensional space defined by the kernel function.

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 57


SVM: Classification

• Given
– test samples x1,..,xn → x = w
• Unknown
– labels y1, .., yn → y = { , } ?

• Classification
1. compute k(x,xi) for all x
2. compute classification score:
3. class decision is: sign( f(x))

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 58


SVM: Kernel Best Practice

• How do we choose kernels k(,.,) in practice?


• They can be construct from distance functions
– if d(.,.) is a distance function, then e-d(.,.) i.e exp{-d(.,.)} can be used as a kernel function

• Linear:
• Polynomial:
• Gaussian (RBF) Some practical
kernel functions
• Histogram intersection:
• Chi square
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 59
SVM: Kernel Best Practice

• Kernels should show a class-wise block structure

• Example: β in the Gaussian kernel:

β very large...

(picture: Christoph Lampert)

β very small...
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 60
SVM: Hyper-Parameter Optimization

• Parameter Optimization in SVMs:


– cost of training samples misclassified: C
– kernel parameter β
good parameter choices

β
• Frequently used approach: Grid Search
– test different values of C and β
on a regular grid (alt. log grid)
– for each pair, measure
classification accuracy
on a held-out validation set
C
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 61
SVMs – Summary

• Support Vector Machines were state-of-the-art classifier,


particularly successful in image recognition
• Advantages
– the maximum-margin problem can be solved globally optimally!
– the number of parameters is „independent“ of the feature dimensionality.
This makes SVMs very suitable classifiers for small, high-dimensional training sets!
– flexibility: we can incorporate application-specific kernels
– very good empirical results

• Disadvantages
– often: ad hoc choice of kernel functions
– scalability problems to large training sets
– Limited learning capacity with large number of positive samples

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 62


Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)

4. How to combine „fuse“ distinct classifiers ?


5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 63
Early vs. Late Fusion

• Multiple classifiers can combine


different pieces of evidence
• multiple features
• multiple modalities
• multiple classifiers
• multiple training sets

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 64


Early vs. Late Fusion

• Different combination strategies


• early fusion = concatenate features
• late fusion = combine classification results

x1
x2 [x1,x2,..,x
early fusion classifier decision
… M]

xM

x1 classifier P(c|x1)
x2
… classifier P(c|x2) late fusion decision
xM ...
classifier P(c|xM)

Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 65


Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)

4. How to combine „fuse“ distinct classifiers ?


5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 66
Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)

4. How to combine „fuse“ distinct classifiers ?


5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 67
Discussion

• This lecture – four sample classifiers


• Naive Bayes (with Gaussian CCDs)
• K-nearest neighbor
• Logistic regression
• Support Vector Machine (SVM)

• The Big Answer to “Which one is the best?”


• the right classifier depends on the distribution of the target data...
• … on the preprocessing ...
• … on the features...
• … on the amount of training data

→ no-free-lunch theorem
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 68
Questions?
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 69

You might also like