02-Classification - Commented2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Advanced Statistical Learning

Chapter 2: Classification
Bernd Bischl, Julia Moosbauer, Andreas Groll

Department of Statistics – TU Dortmund


Winter term 2020/21
CLASSIFICATION TASKS
In classification, we aim at predicting a discrete output

y ∈ Y = {C1 , ..., Cg }
with 2 ≤ g < ∞ given data D .

In this course, we assume the classes to be encoded as


Y = {0, 1} or Y = {−1, 1} (in the binary case g = 2)
Y = {1, 2, ..., g } (in the multiclass case g ≥ 3)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 1 / 21
Classifiers

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 2 / 21
CLASSIFICATION MODELS
We defined models f : X → Rg as functions that output (continuous)
scores / probabilities and not (discrete) classes. Why?
From an optimization perspective, it is much (!) easier to optimize
costs for continuous-valued functions
Scores / probabilities (for classes) contain more information than
the class labels alone
As we will see later, scores can easily be transformed into class
labels; but class labels cannot be transformed into scores
We distinguish scoring and probabilistic classifiers.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 3 / 21
CLASSIFICATION MODELS
Scoring Classifiers:
Construct g discriminant / scoring functions f1 , ..., fg : X → R
Scores f1 (x), ..., fg (x) are transformed into classes by choosing
the class with the maximum score

h(x) = arg max fk (x).


k ∈{1,2,...,g }

For g = 2, a single discriminant function f (x) = f1 (x) − f−1 (x) is


sufficient (note that it would be more natural here to label the
classes with {+1, −1})

h(x ) = 1
⇐⇒ f1 (x) > f−1 (x)
⇐⇒ f1 (x) − f−1 (x) > 0
⇐⇒ f (x) > 0
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 4 / 21
CLASSIFICATION MODELS
Class labels are constructed by h(x) = sgn(f (x))
|f (x)| is called “confidence”

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 5 / 21
CLASSIFICATION MODELS
Probabilistic Classifiers:
Construct g probability functions
P
π1 , ..., πg : X → [0, 1],i πi = 1
Probabilities π1 (x), ..., πg (x) are transformed into labels by
predicting the class with the maximum probability

h(x) = arg max πk (x)


{k ∈1,2,...,g }

If g = 2, usually one probability function π(x) is outputted by most


models (note that it would be more natural here to label the
classes with {0, 1})
If g = 2, transformation into discrete classes by thresholding:
h(x) := 1(π(x) ≥ c ) for some threshold c ∈ [0, 1] (for binary
classification we usually set c = 0.5)
Probabilistic classifiers can also be seen as scoring classifiers

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 6 / 21
CLASSIFICATION MODELS
Remark: If we want to emphasize that our model outputs probabilities
we denote the model as π : X → [0, 1]g ; if we are talking about models
in a general sense, we write f comprising both probabilistic and scoring
classifiers (context will make this clear!)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 7 / 21
CLASSIFICATION MODELS
Both scoring and probabilistic classifiers can be turned into
discrete classes via thresholding (binary case) / predicting the
class with the maximum score (multiclass)
Discrete classes, which are often intrinsically produced by scores,
can not be transferred into scores again (Attention: Arrow in the
figure might be misleading!)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 8 / 21
DECISION BOUNDARIES

A decision boundary is defined as a hypersurface that partitions


the input space X into g (the number of classes) decision regions

Xk = {x ∈ X : h(x) = k }

Ties between those regions are called decision boundaries

{x ∈ X : ∃ i 6= j s.t. fi (x) = fj (x)


and fi (x), fj (x) ≥ fk (x) ∀k 6= i , j }

(in the general multiclass case) or

f (x) = c ,
if c ∈ R was used as threshold (usually c = 0 for scoring
classifiers and c = 0.5 for probabilistic classifiers) in the binary
case.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 9 / 21
DECISION BOUNDARIES
4.5 4.5
Sepal.Width

Sepal.Width
4.0 Species 4.0 Species
3.5 setosa 3.5 setosa
3.0 versicolor 3.0 versicolor
2.5 virginica 2.5 virginica
2.0 2.0
5 6 7 8 5 6 7 8
Sepal.Length Sepal.Length

4.5 4.5
Sepal.Width

Sepal.Width
4.0 Species 4.0 Species
3.5 setosa 3.5 setosa
3.0 versicolor 3.0 versicolor
2.5 virginica 2.5 virginica
2.0 2.0
5 6 7 8 5 6 7 8
Sepal.Length Sepal.Length
Different shapes of decision boundaries. Classifiers (QDA, decision tree, nonlinear
SVM, Naive-Bayes)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 10 / 21
LINEAR CLASSIFIERS
If the discriminant functions fk (x) can be specified as linear functions
(possibly through a rank-preserving, monotone transformation
g : R → R), i.e.

g (fk (x)) = wk> x + bk ,


we will call the classifier a linear classifier.

We can then write a tie between scores as

fi (x) = fj (x)
⇐⇒ g (fi (x)) = g (fj (x))
⇐⇒ wi> x + bi = wj> x + bj
⇐⇒ (wi − wj )> x + (bi − bj ) = 0
⇐⇒ wij> x + bij = 0,

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 11 / 21
LINEAR CLASSIFIERS
with wij := wi − wj and bij = bi − bj . This is a hyperplane separating
two classes.

Note that linear classifiers can represent non-linear decision boundaries


in the original input space if we use derived features like higher order
interactions, polynomial features, basis function expansions, etc.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 12 / 21
SCALING: SIGMOIDS (BINARY)
Probabilistic classifiers are often preferred because probabilities yield a
more natural interpretation. Any score-generating model can be turned
into a probability estimator.

For g = 2, we can use a transformation function s : R → [0, 1]

π(x) := s (f (x)) ∈ [0, 1]


to map scores to probabilities.

A commonly used type of transformation functions are sigmoid


functions: A sigmoid is a bounded, differentiable, real-valued function
s : R → [0, 1] that has a non-negative derivative at each point.

In deep learning sigmoids are used as activation functions.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 13 / 21
SCALING: SIGMOIDS (BINARY)
Examples for sigmoid functions:
Arctan function: s(t ) = arctan(t )
et −e−t
Hyperbolic tangent: s(t ) = tanh(t ) = et +e−t
1
Logistic function: s(t ) = 1+e−t
Any cumulative density function (cdf), e.g. the normal distribution
(also called probit function)

1.00

0.75

0.50
y

arctan
0.25
tanh
logistic
0.00 probit
−5.0 −2.5 0.0 2.5 5.0
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 14 / 21
SCALING: SIGMOIDS (BINARY)
The logistic function

1
s(t ) =
1 + e−t
is a popular choice for transforming scores to probabilities (used, for
example, in logistic regression). Properties of the logistic function s(t ):
limt → −∞ s(t ) = 0 and limt →∞ s(t ) = 1
∂ s (t ) exp(t )
∂t = (1+exp(t ))2
= s(t )(1 − s(t ))
s(t ) is symmetrical about the point (0, 12 )
1.00
0.75
s(t)

0.50
0.25
0.00
−4 0 4
t

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 15 / 21
SCALING: SOFTMAX (MULTICLASS)
Any multiclass scoring classifier can be transformed into probabilities
using a transformation that maps the scores to a vector of probabilities

(f1 (x ), ..., fg (x )) 7→ (π1 (x ), ..., πg (x )) ,


fulfilling
πk (x ) ∈ [0, 1] for all k = 1, ..., g
Pg
k =1 πk (x ) = 1.

A commonly used function is the softmax function, which is defined on


a numerical vector z (which could be our scores):

exp(zk )
s(z )k = P
j exp(zj )

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 16 / 21
SCALING: SOFTMAX (MULTICLASS)
It is a generalization of the logistic function (check for g = 2). It
“squashes” a g-dimensional real-valued vector z to a vector of the same
dimension, with every entry in the range [0, 1] and all entries adding up
to 1.

For a categorical response variable y ∈ {1, . . . , g } with g > 2 the


model extends to

exp(fk (x))
πk (x) = Pg .
j =1 exp(fj (x))

Compared to the arg max operator, the soft max keeps information about the other,
non-maximal elements in a reversible way. Thus the function is called soft max.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 17 / 21
Generative vs. Discriminative Approaches

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 18 / 21
GENERATIVE APPROACHES
Two fundamental approaches exist to construct classifiers: The
generative approach and the discriminant approach.

The generative approach employs the Bayes theorem:

P(x|y = k )P(y = k )
πk (x) = P(y = k | x) = ∝ P(x|y = k )πk
P(x)

and models P(x | y = k ) (usually by assuming something about the


structure of this distribution) to allow the computation of πk (x). The
discriminant functions in this approach are

πk (x) or log P(x|y = k ) + log πk

Discriminant approaches try to model the discriminant functions


directly, often by loss minimization.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 19 / 21
GENERATIVE APPROACHES
Examples:
Linear discriminant analysis: generative, linear. Each class’s
density is a multivariate Gaussian with

P(x | y = k ) ∼ N (µ
µk , Σ )

with equal covariances.


Quadratic discriminant analysis: generative, non-linear. Each
class’s density is a multivariate Gaussian with

P(x | y = k ) ∼ N (µ
µk , Σ k )

with unequal covariances Σ k .

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 20 / 21
GENERATIVE APPROACHES
Naive Bayes: generative, non-linear. “Naive” conditional
independence assumption, i.e. features given the category y are
conditionally independent of each other
p
Y
P(x | y = k ) = P((x1 , ..., xp ) | y = k ) = P(xj | y = k ).
j =1

Logistic regression: discriminant, (usually) linear

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 2 – 21 / 21

You might also like