Professional Documents
Culture Documents
02-Classification - Commented2
02-Classification - Commented2
02-Classification - Commented2
Chapter 2: Classification
Bernd Bischl, Julia Moosbauer, Andreas Groll
y ∈ Y = {C1 , ..., Cg }
with 2 ≤ g < ∞ given data D .
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 1 / 21
Classifiers
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 2 / 21
CLASSIFICATION MODELS
We defined models f : X → Rg as functions that output (continuous)
scores / probabilities and not (discrete) classes. Why?
From an optimization perspective, it is much (!) easier to optimize
costs for continuous-valued functions
Scores / probabilities (for classes) contain more information than
the class labels alone
As we will see later, scores can easily be transformed into class
labels; but class labels cannot be transformed into scores
We distinguish scoring and probabilistic classifiers.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 3 / 21
CLASSIFICATION MODELS
Scoring Classifiers:
Construct g discriminant / scoring functions f1 , ..., fg : X → R
Scores f1 (x), ..., fg (x) are transformed into classes by choosing
the class with the maximum score
h(x ) = 1
⇐⇒ f1 (x) > f−1 (x)
⇐⇒ f1 (x) − f−1 (x) > 0
⇐⇒ f (x) > 0
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 4 / 21
CLASSIFICATION MODELS
Class labels are constructed by h(x) = sgn(f (x))
|f (x)| is called “confidence”
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 5 / 21
CLASSIFICATION MODELS
Probabilistic Classifiers:
Construct g probability functions
P
π1 , ..., πg : X → [0, 1],i πi = 1
Probabilities π1 (x), ..., πg (x) are transformed into labels by
predicting the class with the maximum probability
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 6 / 21
CLASSIFICATION MODELS
Remark: If we want to emphasize that our model outputs probabilities
we denote the model as π : X → [0, 1]g ; if we are talking about models
in a general sense, we write f comprising both probabilistic and scoring
classifiers (context will make this clear!)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 7 / 21
CLASSIFICATION MODELS
Both scoring and probabilistic classifiers can be turned into
discrete classes via thresholding (binary case) / predicting the
class with the maximum score (multiclass)
Discrete classes, which are often intrinsically produced by scores,
can not be transferred into scores again (Attention: Arrow in the
figure might be misleading!)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 8 / 21
DECISION BOUNDARIES
Xk = {x ∈ X : h(x) = k }
f (x) = c ,
if c ∈ R was used as threshold (usually c = 0 for scoring
classifiers and c = 0.5 for probabilistic classifiers) in the binary
case.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 9 / 21
DECISION BOUNDARIES
4.5 4.5
Sepal.Width
Sepal.Width
4.0 Species 4.0 Species
3.5 setosa 3.5 setosa
3.0 versicolor 3.0 versicolor
2.5 virginica 2.5 virginica
2.0 2.0
5 6 7 8 5 6 7 8
Sepal.Length Sepal.Length
4.5 4.5
Sepal.Width
Sepal.Width
4.0 Species 4.0 Species
3.5 setosa 3.5 setosa
3.0 versicolor 3.0 versicolor
2.5 virginica 2.5 virginica
2.0 2.0
5 6 7 8 5 6 7 8
Sepal.Length Sepal.Length
Different shapes of decision boundaries. Classifiers (QDA, decision tree, nonlinear
SVM, Naive-Bayes)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 10 / 21
LINEAR CLASSIFIERS
If the discriminant functions fk (x) can be specified as linear functions
(possibly through a rank-preserving, monotone transformation
g : R → R), i.e.
fi (x) = fj (x)
⇐⇒ g (fi (x)) = g (fj (x))
⇐⇒ wi> x + bi = wj> x + bj
⇐⇒ (wi − wj )> x + (bi − bj ) = 0
⇐⇒ wij> x + bij = 0,
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 11 / 21
LINEAR CLASSIFIERS
with wij := wi − wj and bij = bi − bj . This is a hyperplane separating
two classes.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 12 / 21
SCALING: SIGMOIDS (BINARY)
Probabilistic classifiers are often preferred because probabilities yield a
more natural interpretation. Any score-generating model can be turned
into a probability estimator.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 13 / 21
SCALING: SIGMOIDS (BINARY)
Examples for sigmoid functions:
Arctan function: s(t ) = arctan(t )
et −e−t
Hyperbolic tangent: s(t ) = tanh(t ) = et +e−t
1
Logistic function: s(t ) = 1+e−t
Any cumulative density function (cdf), e.g. the normal distribution
(also called probit function)
1.00
0.75
0.50
y
arctan
0.25
tanh
logistic
0.00 probit
−5.0 −2.5 0.0 2.5 5.0
x
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 14 / 21
SCALING: SIGMOIDS (BINARY)
The logistic function
1
s(t ) =
1 + e−t
is a popular choice for transforming scores to probabilities (used, for
example, in logistic regression). Properties of the logistic function s(t ):
limt → −∞ s(t ) = 0 and limt →∞ s(t ) = 1
∂ s (t ) exp(t )
∂t = (1+exp(t ))2
= s(t )(1 − s(t ))
s(t ) is symmetrical about the point (0, 12 )
1.00
0.75
s(t)
0.50
0.25
0.00
−4 0 4
t
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 15 / 21
SCALING: SOFTMAX (MULTICLASS)
Any multiclass scoring classifier can be transformed into probabilities
using a transformation that maps the scores to a vector of probabilities
exp(zk )
s(z )k = P
j exp(zj )
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 16 / 21
SCALING: SOFTMAX (MULTICLASS)
It is a generalization of the logistic function (check for g = 2). It
“squashes” a g-dimensional real-valued vector z to a vector of the same
dimension, with every entry in the range [0, 1] and all entries adding up
to 1.
exp(fk (x))
πk (x) = Pg .
j =1 exp(fj (x))
Compared to the arg max operator, the soft max keeps information about the other,
non-maximal elements in a reversible way. Thus the function is called soft max.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 17 / 21
Generative vs. Discriminative Approaches
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 18 / 21
GENERATIVE APPROACHES
Two fundamental approaches exist to construct classifiers: The
generative approach and the discriminant approach.
P(x|y = k )P(y = k )
πk (x) = P(y = k | x) = ∝ P(x|y = k )πk
P(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 19 / 21
GENERATIVE APPROACHES
Examples:
Linear discriminant analysis: generative, linear. Each class’s
density is a multivariate Gaussian with
P(x | y = k ) ∼ N (µ
µk , Σ )
P(x | y = k ) ∼ N (µ
µk , Σ k )
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 20 / 21
GENERATIVE APPROACHES
Naive Bayes: generative, non-linear. “Naive” conditional
independence assumption, i.e. features given the category y are
conditionally independent of each other
p
Y
P(x | y = k ) = P((x1 , ..., xp ) | y = k ) = P(xj | y = k ).
j =1
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 21 / 21