Chapter 04b Riskmin-Class - Commented2

Advanced Statistical Learning
Chapter 4: Risk Minimization II

Bernd Bischl, Julia Moosbauer, Andreas Groll
Department of Statistics – TU Dortmund

Winter term 2020/21
(THEORETICAL) RISK FOR CLASSIFICATION
Let’s have a look at the theoretical risk first. Let y be categorical with g
classes, i.e. Y = {1, ..., g }, and Pxy be the joint distribution on (x, y ).
Goal: Find a model f that minimizes the expected loss over random
variables (x, y ) ∼ Pxy
Z
min R(f ) = min Exy [L (y , f (x))] = min L (y , f (x)) d Pxy .
f ∈H f ∈H f ∈H X ×Y
By applying Bayes theorem
p(x, y ) = p(y | x) · p(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 1 / 29
(THEORETICAL) RISK FOR CLASSIFICATION
In general, we can rewrite the risk as

R(f ) = Exy [L (y , f (x))] = Ex Ey |x [L(y , f (x))]
" #
X
= Ex L(k , f (x))P (y = k | x = x) ,
k ∈Y
with P (y = k |x = x) being the posterior probability for class k w.r.t.

Pxy .
The optimal model for a general loss L (y , f (x)) is:
X
f̂ (x) = arg min L(k , f (x ))P (y = k |x = x) .
f ∈H
k ∈Y
If we know Pxy perfectly (and hence the posterior probabilities), this is

the loss-optimal classifier.
(THEORETICAL) RISK: DISCRETE CLASSES
Let’s consider for now a classifier h(x) that outputs discrete classes
directly.
Note that in this case, we can compactly write the loss function in terms
of a cost matrix
 
c11 c12 . . . c1g

 c21 c22 . . . c2g 

C=
 c31 c32 . . . c3g ,

 .. .. .. .. 
 . . . . 
cg1 cg2 . . . cgg
where cij expresses how we weight the error if we classify h(x) = j but
y = i.
The most natural choice for L (y , h(x)) is of course the 0-1-loss that
counts the number of misclassifications

1 if y 6= h(x)
L (y , h(x)) = 1{y 6=h(x)} = ,
0 if y = h(x)
which corresponds to a matrix that has ones everywhere except for the
diagonal:
 
1 0 ... 0
 0 1 ... 0 
C = 11T −  ..  ,
 
.. .. . .
 . . . . 
0 0 ... 1
where 1 is a vector of ones of length g.
The solution of the above minimization problem is then
X
ĥ(x) = arg min L(k , l ) · P (y = k | x = x)
l ∈Y
k ∈Y
X
= arg min P (y = k | x = x) = arg min 1 − P (y = l | x = x)
l ∈Y l ∈Y
k 6=l
= arg max P (y = l | x = x) ,
l ∈Y
which corresponds to predicting the most probable class.
We call it the Bayes classifier and its expected loss the Bayes loss or
Bayes error rate for 0-1-loss.
More general structures of cost matrices will be discussed later in a

separate chapter about cost-sensitive classification.
EMPIRICAL RISK MINIMIZATION
As usually Pxy is not known, we minimize the empirical risk as an
approximation to the theoretical risk
n
1 X (i ) (i )
f̂ = arg min R̄emp (f ) = arg min L y ,f x .
f ∈H f ∈H n
i =1
Classification Losses
CLASSIFICATION LOSSES
As for regression before, losses measure prediction errors

point-wise
In classification, however, we need to distinguish the different
types of classifiers in the notion of the loss function
Thus, losses can either be defined on
hard labels h(x) or
(class) scores f (x) or
(class) probabilities π(x)
For multiclass classification, loss functions will be defined on
vectors of scores (f1 (x), ..., fg (x)) or on vectors of probabilities
(π1 (x), ..., πg (x))
MARGINS AND LOSS PLOTS
Note that for a binary scoring classifier f (x),
h(x) = sign(f (x))

will be the corresponding label. Loss functions are usually defined on
the so-called margin
(
>0 if y = sign(f (x)) (correct classification) ,
y · f (x) =
<0 if y 6= sign(f (x)) (misclassification) .
|f (x)| is called confidence.
MARGINS AND LOSS PLOTS
We define the loss plot for binary scoring classifiers as the plot that
shows the point-wise loss L (y , f (x)), vs. the margin y · f (x). Large
positive values of y · f (x) are good and penalized less.
Example: 0-1-Loss
1.00
0.75
L(y, f(x))
0.50
0.25
0.00
−2 −1 0 1 2
y ⋅ f(x)
For probabilistic classifiers, loss plots show the Loss L(y , π(x)) versus
the class probability π(x).
CLASSIFICATION LOSSES: BINARY 0-1-LOSS
L (y , f (x)) = 1{y 6=h(x)} = 1{y ·f (x)<0}
Intuitive, often what we are interested in

Analytic properties: Not even continuous, even for linear f (·) the
optimization problem is NP-hard and close to intractable
1.00
0.75
L(yf(x))
0.50
0.25
0.00
−2 −1 0 1 2
y ⋅ f(x)
CLASSIFICATION LOSSES: MULTICLASS
0-1-LOSS
In the multiclass case, the 0-1-loss is just
L (y , f (x)) = 1{y 6=h(x)} .
The optimal constant model w.r.t. the (general multiclass) 0-1-loss is

predicting the most frequent class k ∈ {1, 2, ..., g }
n o
h(x) = mode y (1) , . . . , y (n) .
Proof: Trivial, proof is left as an exercise to the reader.
For general hypothesis spaces, however, optimization becomes very

hard or even intractable.
CLASSIFICATION LOSSES: BINARY BRIER
SCORE
Let us use the L2-Loss on probabilities to measure the goodness of a
prediction. For the binary case this is:
L (y , π(x)) = (π(x) − y )2
The Brier score measures the squared difference between the

predicted probability π(x) and the actual outcome y
The Brier score is a proper scoring rule, which means: the true
underlying predictive distribution is optimal w.r.t. the expected loss
→ incentivation to predict the “true” predictive distribution
SCORE
Brier Score
1.00
0.75
L(y, π(x))
y
0.50 0
1
0.25
0.00
0.00 0.25 0.50 0.75 1.00
π(x)
SCORE
The optimal constant model π(x) = θ w.r.t. the Brier score for data D
(Y ∈ {0, 1}) is
n
X 2
min Remp (θ) = min y (i ) − θ
θ θ
i =1
n
∂Remp (θ) X
⇔ = −2 · (y (i ) − θ) = 0
∂θ
i =1
n
1X
θ̂ = y (i ) .
n
i =1
This is the fraction of class-1 observations in the observed data.
CLASSIFICATION LOSSES: MULTICLASS BRIER
SCORE
The (binary) Brier score easily generalizes to the multiclass Brier score
that is defined on a vector of class probabilities (π1 (x), ..., πg (x))
g
X 2
L (y , f (x)) = 1{y =k } − πk (x) .
k =1
The optimal constant model π(x) = (θ1 , ..., θg ) (outputting a vector of

constant class probabilities) is
g
n X 2
X
θ̂j = arg minθj Remp (θ) = arg minθj 1{y (i ) =k } − θk
i =1 k =1
n
∂Remp (θ) X
⇔ = −2 · (1{y (i ) =j } − θj ) = 0
∂θj i =1
n
1X
θ̂j = 1{y (i ) =j } ,
n
i =1
1
Pn
θ̂j = n i =1 1{y (i ) =j } being the fraction of class j observations.
CLASSIFICATION LOSSES: EXPONENTIAL LOSS
Another possible choice for a (binary) loss function that is a smooth
approximation to the 0-1-loss:
L (y , f (x)) = exp(−y · f (x)), used in AdaBoost
Convex, differentiable (thus easier to optimize than 0-1-loss)
The loss increases exponentially for wrong predictions with high
confidence; if the prediction is right with a small confidence only,
there, loss is still positive
No closed-form analytic solution to empirical risk minimization
6
L(yf(x))
0
−2 −1 0 1 2
y ⋅ f(x)
Maximum Likelihood Estimation vs.
Empirical Risk Minimization
MAXIMUM LIKELIHOOD IN CLASSIFICATION
Let us assume the outputs y (i ) to be Bernoulli distributed, i.e.
y (i ) ∼ Ber(π(x(i ) )) ,
and let π(x(i ) ) be a model that directly models the probabilities π(x(i ) ).
The maximization of the negative log likelihood is based on
n
X n
X
− log p y (i ) | x(i ) , θ = −y (i ) log[π x(i ) ] − 1 − y (i ) log[1 − π x(i ) ].
i =1 i =1
This gives rise to the following loss function
L (y , f (x)) = −y ln(π(x)) − (1 − y ) ln(1 − π(x)) ,

which we call Bernoulli-Loss.
CLASSIFICATION LOSSES: BERNOULLI LOSS
L(y , π(x)) = −y ln(π(x)) − (1 − y ) ln(1 − π(x))

Convex, differentiable (gradient methods can be used), not robust
Also called logarithmic loss or cross-entropy loss (which will be
motivated later)
4
L(y, π(x))
3 y
0
2
1
1
0
0.00 0.25 0.50 0.75 1.00
π(x)
CLASSIFICATION LOSSES: BERNOULLI LOSS
The constant model π(x) = θ that is optimal w.r.t. the empirical risk is
the fraction of class 1 observations
n
1 X (i )
π̂(x) = y .
n
i =1
Proof: Exercise.
CLASSIFICATION LOSSES: LOGARITHMIC LOSS
The generalization of the Bernoulli loss (logarithmic loss) for two
classes is the (multiclass) logarithmic loss:
g
X
L (y , f (x)) = − 1{y =k } log (πk (x)) ,
k =1
with πk (x) denoting the predicted probability for class k .

The optimal constant model π(x) = (θ1 , ..., θg ) (outputting a vector of
constant class probabilities) is
n
1X
θ̂k = 1{y (i ) =k } ,
n
Pn i =1
1
with θ̂k = n i =1 1{y (i ) =k } being the fraction of class k observations.
Proof: Exercise.
CLASSIFICATION LOSSES: LOGARITHMIC LOSS
We will see later on how this corresponds to the (multinomial) softmax
regression (see chapter Multiclass classification) and to the
cross-entropy (see chapter Information Theory).
LOGISTIC LOSS
Let Y = {0, 1}
Let us consider a binary score-based model where scores f (x ) are
transformed into probabilities by the logistic sigmoid function
1
π(x ) = s(f (x )) = ∈ [0, 1].
1 + exp(−f (x ))
If we plug it into the Bernoulli loss, the negative log-likelihood
becomes
− log p(y | x, θ) = −y log[π(x)] − (1 − y ) log[1 − π(x)]

= y log[1 + exp(−f (x))] + (1 − y ) log[1 + exp(f (x)]
For y = 0 and y = 1 this is:
y = 0 : log[1 + exp(f (x)]

y = 1 : log[1 + exp(−f (x))]
LOGISTIC LOSS
If we would encode now with Y = {−1, +1}, we can unify this like
this:
L (y , f (x)) = log[1 + exp(−y · f (x)]
This loss function is called Logistic Loss. If we set
f (x) = θ0 + θ > x, we would end up in Logistic Regression.
CLASSIFICATION LOSSES: LOGISTIC LOSS
L (y , f (x)) = ln(1 + exp(−y · f (x))), used in logistic regression

Convex, differentiable (gradient methods can be used), not robust
Is equivalent to cross entropy loss if scores are transformed to
probabilities by the logistic function
2.0
1.5
L(yf(x))
1.0
0.5
−2 −1 0 1 2
y ⋅ f(x)
CLASSIFICATION LOSSES: LOGISTIC LOSS
The minimizer of the (theoretical) risk R(f ) for the logistic loss function
is

P (y | x = x )
f̂ (x) = ln .
1 − P (y | x = x)
Proof: Exercise.
The function is undefined when P (y | x = x) = 1 or P (y | x = x) = 0,

but predicts a smooth curve which grows when P (y | x = x) increases
and equals 0 when P (y | x = x) = 0.5.
BERNOULLI VS. LOGISTIC LOSS
The terms “Bernoulli” and “Logistic” loss are often used

synonymously, which leads to confusion
In this lecture, we distinguish between Bernoulli / Logarithmic loss,
which is motivated from a maximum likelihood perspective and
acts on probabilities π(x), usually requiring 0-1-encoding
and logistic loss, which acts on scores f (x), usually requiring
-1-+1-encoding
They are equivalent iff scores are transformed into probabilities by
the logistic function
The cross-entropy loss (motivated from an information theoretic
perspective) is also equivalent to the logistic loss; see chapter
Information Theory
OUTLOOK
When introducing different learning algorithms, we will come back to
the loss functions introduced in this chapter or even introduce new
ones. For example:
Ordinary Linear Regression: L2-loss
Logistic Regression: Logistic loss
Support Vector Machine Classification: Hinge-Loss (see lecture
Introduction to Statistical Learning)
Support Vector Machine Regression: -insensitive loss (see
lecture Introduction to Statistical Learning)
AdaBoost: Exponential loss (see Boosting chapter)
Once knowing the theory of risk minimization and properties of loss
functions, we can combine model classes and loss functions as needed
or even tailor loss functions to our needs.

Chapter 04b Riskmin-Class - Commented2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 04b Riskmin-Class - Commented2

Uploaded by

Copyright:

Available Formats

Advanced Statistical Learning

Chapter 4: Risk Minimization II

Department of Statistics – TU Dortmund

By applying Bayes theorem

p(x, y ) = p(y | x) · p(x)

with P (y = k |x = x) being the posterior probability for class k w.r.t.

The optimal model for a general loss L (y , f (x)) is:

If we know Pxy perfectly (and hence the posterior probabilities), this is

which corresponds to predicting the most probable class.

More general structures of cost matrices will be discussed later in a

As for regression before, losses measure prediction errors

h(x) = sign(f (x))

|f (x)| is called confidence.

Intuitive, often what we are interested in

L (y , f (x)) = 1{y 6=h(x)} .

The optimal constant model w.r.t. the (general multiclass) 0-1-loss is

Proof: Trivial, proof is left as an exercise to the reader.

For general hypothesis spaces, however, optimization becomes very

The Brier score measures the squared difference between the

This is the fraction of class-1 observations in the observed data.

The optimal constant model π(x) = (θ1 , ..., θg ) (outputting a vector of

The maximization of the negative log likelihood is based on

This gives rise to the following loss function

L (y , f (x)) = −y ln(π(x)) − (1 − y ) ln(1 − π(x)) ,

L(y , π(x)) = −y ln(π(x)) − (1 − y ) ln(1 − π(x))

with πk (x) denoting the predicted probability for class k .

− log p(y | x, θ) = −y log[π(x)] − (1 − y ) log[1 − π(x)]

For y = 0 and y = 1 this is:

y = 0 : log[1 + exp(f (x)]

L (y , f (x)) = ln(1 + exp(−y · f (x))), used in logistic regression

The function is undefined when P (y | x = x) = 1 or P (y | x = x) = 0,

The terms “Bernoulli” and “Logistic” loss are often used

You might also like