Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Advanced Statistical Learning

Chapter 4: Risk Minimization II


Bernd Bischl, Julia Moosbauer, Andreas Groll

Department of Statistics – TU Dortmund


Winter term 2020/21
(THEORETICAL) RISK FOR CLASSIFICATION
Let’s have a look at the theoretical risk first. Let y be categorical with g
classes, i.e. Y = {1, ..., g }, and Pxy be the joint distribution on (x, y ).

Goal: Find a model f that minimizes the expected loss over random
variables (x, y ) ∼ Pxy

Z
min R(f ) = min Exy [L (y , f (x))] = min L (y , f (x)) d Pxy .
f ∈H f ∈H f ∈H X ×Y

By applying Bayes theorem

p(x, y ) = p(y | x) · p(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 1 / 29
(THEORETICAL) RISK FOR CLASSIFICATION
In general, we can rewrite the risk as
 
R(f ) = Exy [L (y , f (x))] = Ex Ey |x [L(y , f (x))]
" #
X
= Ex L(k , f (x))P (y = k | x = x) ,
k ∈Y

with P (y = k |x = x) being the posterior probability for class k w.r.t.


Pxy .

The optimal model for a general loss L (y , f (x)) is:

X
f̂ (x) = arg min L(k , f (x ))P (y = k |x = x) .
f ∈H
k ∈Y

If we know Pxy perfectly (and hence the posterior probabilities), this is


the loss-optimal classifier.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 2 / 29
(THEORETICAL) RISK: DISCRETE CLASSES
Let’s consider for now a classifier h(x) that outputs discrete classes
directly.

Note that in this case, we can compactly write the loss function in terms
of a cost matrix
 
c11 c12 . . . c1g

 c21 c22 . . . c2g 

C=
 c31 c32 . . . c3g ,

 .. .. .. .. 
 . . . . 
cg1 cg2 . . . cgg
where cij expresses how we weight the error if we classify h(x) = j but
y = i.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 3 / 29
(THEORETICAL) RISK: DISCRETE CLASSES
The most natural choice for L (y , h(x)) is of course the 0-1-loss that
counts the number of misclassifications

1 if y 6= h(x)
L (y , h(x)) = 1{y 6=h(x)} = ,
0 if y = h(x)
which corresponds to a matrix that has ones everywhere except for the
diagonal:
 
1 0 ... 0
 0 1 ... 0 
C = 11T −  ..  ,
 
.. .. . .
 . . . . 
0 0 ... 1
where 1 is a vector of ones of length g.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 4 / 29
(THEORETICAL) RISK: DISCRETE CLASSES
The solution of the above minimization problem is then
X
ĥ(x) = arg min L(k , l ) · P (y = k | x = x)
l ∈Y
k ∈Y
X
= arg min P (y = k | x = x) = arg min 1 − P (y = l | x = x)
l ∈Y l ∈Y
k 6=l

= arg max P (y = l | x = x) ,
l ∈Y

which corresponds to predicting the most probable class.

We call it the Bayes classifier and its expected loss the Bayes loss or
Bayes error rate for 0-1-loss.

More general structures of cost matrices will be discussed later in a


separate chapter about cost-sensitive classification.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 5 / 29
EMPIRICAL RISK MINIMIZATION
As usually Pxy is not known, we minimize the empirical risk as an
approximation to the theoretical risk
n
1 X  (i )  (i ) 
f̂ = arg min R̄emp (f ) = arg min L y ,f x .
f ∈H f ∈H n
i =1

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 6 / 29
Classification Losses

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 7 / 29
CLASSIFICATION LOSSES

As for regression before, losses measure prediction errors


point-wise
In classification, however, we need to distinguish the different
types of classifiers in the notion of the loss function
Thus, losses can either be defined on
hard labels h(x) or
(class) scores f (x) or
(class) probabilities π(x)
For multiclass classification, loss functions will be defined on
vectors of scores (f1 (x), ..., fg (x)) or on vectors of probabilities
(π1 (x), ..., πg (x))

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 8 / 29
MARGINS AND LOSS PLOTS
Note that for a binary scoring classifier f (x),

h(x) = sign(f (x))


will be the corresponding label. Loss functions are usually defined on
the so-called margin
(
>0 if y = sign(f (x)) (correct classification) ,
y · f (x) =
<0 if y 6= sign(f (x)) (misclassification) .

|f (x)| is called confidence.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 9 / 29
MARGINS AND LOSS PLOTS
We define the loss plot for binary scoring classifiers as the plot that
shows the point-wise loss L (y , f (x)), vs. the margin y · f (x). Large
positive values of y · f (x) are good and penalized less.

Example: 0-1-Loss
1.00

0.75
L(y, f(x))

0.50

0.25

0.00
−2 −1 0 1 2
y ⋅ f(x)

For probabilistic classifiers, loss plots show the Loss L(y , π(x)) versus
the class probability π(x).

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 10 / 29
CLASSIFICATION LOSSES: BINARY 0-1-LOSS
L (y , f (x)) = 1{y 6=h(x)} = 1{y ·f (x)<0}

Intuitive, often what we are interested in


Analytic properties: Not even continuous, even for linear f (·) the
optimization problem is NP-hard and close to intractable
1.00

0.75
L(yf(x))

0.50

0.25

0.00
−2 −1 0 1 2
y ⋅ f(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 11 / 29
CLASSIFICATION LOSSES: MULTICLASS
0-1-LOSS
In the multiclass case, the 0-1-loss is just

L (y , f (x)) = 1{y 6=h(x)} .

The optimal constant model w.r.t. the (general multiclass) 0-1-loss is


predicting the most frequent class k ∈ {1, 2, ..., g }
n o
h(x) = mode y (1) , . . . , y (n) .

Proof: Trivial, proof is left as an exercise to the reader.

For general hypothesis spaces, however, optimization becomes very


hard or even intractable.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 12 / 29
CLASSIFICATION LOSSES: BINARY BRIER
SCORE
Let us use the L2-Loss on probabilities to measure the goodness of a
prediction. For the binary case this is:

L (y , π(x)) = (π(x) − y )2

The Brier score measures the squared difference between the


predicted probability π(x) and the actual outcome y
The Brier score is a proper scoring rule, which means: the true
underlying predictive distribution is optimal w.r.t. the expected loss
→ incentivation to predict the “true” predictive distribution

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 13 / 29
CLASSIFICATION LOSSES: BINARY BRIER
SCORE

Brier Score
1.00

0.75
L(y, π(x))

y
0.50 0
1
0.25

0.00
0.00 0.25 0.50 0.75 1.00
π(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 14 / 29
CLASSIFICATION LOSSES: BINARY BRIER
SCORE
The optimal constant model π(x) = θ w.r.t. the Brier score for data D
(Y ∈ {0, 1}) is

n 
X 2
min Remp (θ) = min y (i ) − θ
θ θ
i =1
n
∂Remp (θ) X
⇔ = −2 · (y (i ) − θ) = 0
∂θ
i =1
n
1X
θ̂ = y (i ) .
n
i =1

This is the fraction of class-1 observations in the observed data.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 15 / 29
CLASSIFICATION LOSSES: MULTICLASS BRIER
SCORE
The (binary) Brier score easily generalizes to the multiclass Brier score
that is defined on a vector of class probabilities (π1 (x), ..., πg (x))
g
X 2
L (y , f (x)) = 1{y =k } − πk (x) .
k =1

The optimal constant model π(x) = (θ1 , ..., θg ) (outputting a vector of


constant class probabilities) is
g 
n X 2
X
θ̂j = arg minθj Remp (θ) = arg minθj 1{y (i ) =k } − θk
i =1 k =1
n
∂Remp (θ) X
⇔ = −2 · (1{y (i ) =j } − θj ) = 0
∂θj i =1
n
1X
θ̂j = 1{y (i ) =j } ,
n
i =1

1
Pn
θ̂j = n i =1 1{y (i ) =j } being the fraction of class j observations.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 16 / 29
CLASSIFICATION LOSSES: EXPONENTIAL LOSS
Another possible choice for a (binary) loss function that is a smooth
approximation to the 0-1-loss:
L (y , f (x)) = exp(−y · f (x)), used in AdaBoost
Convex, differentiable (thus easier to optimize than 0-1-loss)
The loss increases exponentially for wrong predictions with high
confidence; if the prediction is right with a small confidence only,
there, loss is still positive
No closed-form analytic solution to empirical risk minimization

6
L(yf(x))

0
−2 −1 0 1 2
y ⋅ f(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 17 / 29
Maximum Likelihood Estimation vs.
Empirical Risk Minimization

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 18 / 29
MAXIMUM LIKELIHOOD IN CLASSIFICATION
Let us assume the outputs y (i ) to be Bernoulli distributed, i.e.

y (i ) ∼ Ber(π(x(i ) )) ,
and let π(x(i ) ) be a model that directly models the probabilities π(x(i ) ).

The maximization of the negative log likelihood is based on

n
X   n
X      
− log p y (i ) | x(i ) , θ = −y (i ) log[π x(i ) ] − 1 − y (i ) log[1 − π x(i ) ].
i =1 i =1

This gives rise to the following loss function

L (y , f (x)) = −y ln(π(x)) − (1 − y ) ln(1 − π(x)) ,


which we call Bernoulli-Loss.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 19 / 29
CLASSIFICATION LOSSES: BERNOULLI LOSS

L(y , π(x)) = −y ln(π(x)) − (1 − y ) ln(1 − π(x))


Convex, differentiable (gradient methods can be used), not robust
Also called logarithmic loss or cross-entropy loss (which will be
motivated later)

4
L(y, π(x))

3 y
0
2
1
1

0
0.00 0.25 0.50 0.75 1.00
π(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 20 / 29
CLASSIFICATION LOSSES: BERNOULLI LOSS
The constant model π(x) = θ that is optimal w.r.t. the empirical risk is
the fraction of class 1 observations
n
1 X (i )
π̂(x) = y .
n
i =1

Proof: Exercise.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 21 / 29
CLASSIFICATION LOSSES: LOGARITHMIC LOSS
The generalization of the Bernoulli loss (logarithmic loss) for two
classes is the (multiclass) logarithmic loss:

g
X
L (y , f (x)) = − 1{y =k } log (πk (x)) ,
k =1

with πk (x) denoting the predicted probability for class k .


The optimal constant model π(x) = (θ1 , ..., θg ) (outputting a vector of
constant class probabilities) is
n
1X
θ̂k = 1{y (i ) =k } ,
n
Pn i =1
1
with θ̂k = n i =1 1{y (i ) =k } being the fraction of class k observations.

Proof: Exercise.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 22 / 29
CLASSIFICATION LOSSES: LOGARITHMIC LOSS
We will see later on how this corresponds to the (multinomial) softmax
regression (see chapter Multiclass classification) and to the
cross-entropy (see chapter Information Theory).

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 23 / 29
LOGISTIC LOSS
Let Y = {0, 1}
Let us consider a binary score-based model where scores f (x ) are
transformed into probabilities by the logistic sigmoid function

1
π(x ) = s(f (x )) = ∈ [0, 1].
1 + exp(−f (x ))
If we plug it into the Bernoulli loss, the negative log-likelihood
becomes

− log p(y | x, θ) = −y log[π(x)] − (1 − y ) log[1 − π(x)]


= y log[1 + exp(−f (x))] + (1 − y ) log[1 + exp(f (x)]

For y = 0 and y = 1 this is:

y = 0 : log[1 + exp(f (x)]


y = 1 : log[1 + exp(−f (x))]

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 24 / 29
LOGISTIC LOSS
If we would encode now with Y = {−1, +1}, we can unify this like
this:
L (y , f (x)) = log[1 + exp(−y · f (x)]
This loss function is called Logistic Loss. If we set
f (x) = θ0 + θ > x, we would end up in Logistic Regression.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 25 / 29
CLASSIFICATION LOSSES: LOGISTIC LOSS

L (y , f (x)) = ln(1 + exp(−y · f (x))), used in logistic regression


Convex, differentiable (gradient methods can be used), not robust
Is equivalent to cross entropy loss if scores are transformed to
probabilities by the logistic function

2.0

1.5
L(yf(x))

1.0

0.5

−2 −1 0 1 2
y ⋅ f(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 26 / 29
CLASSIFICATION LOSSES: LOGISTIC LOSS
The minimizer of the (theoretical) risk R(f ) for the logistic loss function
is

 
P (y | x = x )
f̂ (x) = ln .
1 − P (y | x = x)

Proof: Exercise.

The function is undefined when P (y | x = x) = 1 or P (y | x = x) = 0,


but predicts a smooth curve which grows when P (y | x = x) increases
and equals 0 when P (y | x = x) = 0.5.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 27 / 29
BERNOULLI VS. LOGISTIC LOSS

The terms “Bernoulli” and “Logistic” loss are often used


synonymously, which leads to confusion
In this lecture, we distinguish between Bernoulli / Logarithmic loss,
which is motivated from a maximum likelihood perspective and
acts on probabilities π(x), usually requiring 0-1-encoding
and logistic loss, which acts on scores f (x), usually requiring
-1-+1-encoding
They are equivalent iff scores are transformed into probabilities by
the logistic function
The cross-entropy loss (motivated from an information theoretic
perspective) is also equivalent to the logistic loss; see chapter
Information Theory

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 28 / 29
OUTLOOK
When introducing different learning algorithms, we will come back to
the loss functions introduced in this chapter or even introduce new
ones. For example:
Ordinary Linear Regression: L2-loss
Logistic Regression: Logistic loss
Support Vector Machine Classification: Hinge-Loss (see lecture
Introduction to Statistical Learning)
Support Vector Machine Regression: -insensitive loss (see
lecture Introduction to Statistical Learning)
AdaBoost: Exponential loss (see Boosting chapter)
Once knowing the theory of risk minimization and properties of loss
functions, we can combine model classes and loss functions as needed
or even tailor loss functions to our needs.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 29 / 29

You might also like