Professional Documents
Culture Documents
Chapter 04b Riskmin-Class - Commented2
Chapter 04b Riskmin-Class - Commented2
Goal: Find a model f that minimizes the expected loss over random
variables (x, y ) ∼ Pxy
Z
min R(f ) = min Exy [L (y , f (x))] = min L (y , f (x)) d Pxy .
f ∈H f ∈H f ∈H X ×Y
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 1 / 29
(THEORETICAL) RISK FOR CLASSIFICATION
In general, we can rewrite the risk as
R(f ) = Exy [L (y , f (x))] = Ex Ey |x [L(y , f (x))]
" #
X
= Ex L(k , f (x))P (y = k | x = x) ,
k ∈Y
X
f̂ (x) = arg min L(k , f (x ))P (y = k |x = x) .
f ∈H
k ∈Y
Note that in this case, we can compactly write the loss function in terms
of a cost matrix
c11 c12 . . . c1g
c21 c22 . . . c2g
C=
c31 c32 . . . c3g ,
.. .. .. ..
. . . .
cg1 cg2 . . . cgg
where cij expresses how we weight the error if we classify h(x) = j but
y = i.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 3 / 29
(THEORETICAL) RISK: DISCRETE CLASSES
The most natural choice for L (y , h(x)) is of course the 0-1-loss that
counts the number of misclassifications
1 if y 6= h(x)
L (y , h(x)) = 1{y 6=h(x)} = ,
0 if y = h(x)
which corresponds to a matrix that has ones everywhere except for the
diagonal:
1 0 ... 0
0 1 ... 0
C = 11T − .. ,
.. .. . .
. . . .
0 0 ... 1
where 1 is a vector of ones of length g.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 4 / 29
(THEORETICAL) RISK: DISCRETE CLASSES
The solution of the above minimization problem is then
X
ĥ(x) = arg min L(k , l ) · P (y = k | x = x)
l ∈Y
k ∈Y
X
= arg min P (y = k | x = x) = arg min 1 − P (y = l | x = x)
l ∈Y l ∈Y
k 6=l
= arg max P (y = l | x = x) ,
l ∈Y
We call it the Bayes classifier and its expected loss the Bayes loss or
Bayes error rate for 0-1-loss.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 5 / 29
EMPIRICAL RISK MINIMIZATION
As usually Pxy is not known, we minimize the empirical risk as an
approximation to the theoretical risk
n
1 X (i ) (i )
f̂ = arg min R̄emp (f ) = arg min L y ,f x .
f ∈H f ∈H n
i =1
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 6 / 29
Classification Losses
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 7 / 29
CLASSIFICATION LOSSES
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 8 / 29
MARGINS AND LOSS PLOTS
Note that for a binary scoring classifier f (x),
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 9 / 29
MARGINS AND LOSS PLOTS
We define the loss plot for binary scoring classifiers as the plot that
shows the point-wise loss L (y , f (x)), vs. the margin y · f (x). Large
positive values of y · f (x) are good and penalized less.
Example: 0-1-Loss
1.00
0.75
L(y, f(x))
0.50
0.25
0.00
−2 −1 0 1 2
y ⋅ f(x)
For probabilistic classifiers, loss plots show the Loss L(y , π(x)) versus
the class probability π(x).
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 10 / 29
CLASSIFICATION LOSSES: BINARY 0-1-LOSS
L (y , f (x)) = 1{y 6=h(x)} = 1{y ·f (x)<0}
0.75
L(yf(x))
0.50
0.25
0.00
−2 −1 0 1 2
y ⋅ f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 11 / 29
CLASSIFICATION LOSSES: MULTICLASS
0-1-LOSS
In the multiclass case, the 0-1-loss is just
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 12 / 29
CLASSIFICATION LOSSES: BINARY BRIER
SCORE
Let us use the L2-Loss on probabilities to measure the goodness of a
prediction. For the binary case this is:
L (y , π(x)) = (π(x) − y )2
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 13 / 29
CLASSIFICATION LOSSES: BINARY BRIER
SCORE
Brier Score
1.00
0.75
L(y, π(x))
y
0.50 0
1
0.25
0.00
0.00 0.25 0.50 0.75 1.00
π(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 14 / 29
CLASSIFICATION LOSSES: BINARY BRIER
SCORE
The optimal constant model π(x) = θ w.r.t. the Brier score for data D
(Y ∈ {0, 1}) is
n
X 2
min Remp (θ) = min y (i ) − θ
θ θ
i =1
n
∂Remp (θ) X
⇔ = −2 · (y (i ) − θ) = 0
∂θ
i =1
n
1X
θ̂ = y (i ) .
n
i =1
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 15 / 29
CLASSIFICATION LOSSES: MULTICLASS BRIER
SCORE
The (binary) Brier score easily generalizes to the multiclass Brier score
that is defined on a vector of class probabilities (π1 (x), ..., πg (x))
g
X 2
L (y , f (x)) = 1{y =k } − πk (x) .
k =1
1
Pn
θ̂j = n i =1 1{y (i ) =j } being the fraction of class j observations.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 16 / 29
CLASSIFICATION LOSSES: EXPONENTIAL LOSS
Another possible choice for a (binary) loss function that is a smooth
approximation to the 0-1-loss:
L (y , f (x)) = exp(−y · f (x)), used in AdaBoost
Convex, differentiable (thus easier to optimize than 0-1-loss)
The loss increases exponentially for wrong predictions with high
confidence; if the prediction is right with a small confidence only,
there, loss is still positive
No closed-form analytic solution to empirical risk minimization
6
L(yf(x))
0
−2 −1 0 1 2
y ⋅ f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 17 / 29
Maximum Likelihood Estimation vs.
Empirical Risk Minimization
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 18 / 29
MAXIMUM LIKELIHOOD IN CLASSIFICATION
Let us assume the outputs y (i ) to be Bernoulli distributed, i.e.
y (i ) ∼ Ber(π(x(i ) )) ,
and let π(x(i ) ) be a model that directly models the probabilities π(x(i ) ).
n
X n
X
− log p y (i ) | x(i ) , θ = −y (i ) log[π x(i ) ] − 1 − y (i ) log[1 − π x(i ) ].
i =1 i =1
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 19 / 29
CLASSIFICATION LOSSES: BERNOULLI LOSS
4
L(y, π(x))
3 y
0
2
1
1
0
0.00 0.25 0.50 0.75 1.00
π(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 20 / 29
CLASSIFICATION LOSSES: BERNOULLI LOSS
The constant model π(x) = θ that is optimal w.r.t. the empirical risk is
the fraction of class 1 observations
n
1 X (i )
π̂(x) = y .
n
i =1
Proof: Exercise.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 21 / 29
CLASSIFICATION LOSSES: LOGARITHMIC LOSS
The generalization of the Bernoulli loss (logarithmic loss) for two
classes is the (multiclass) logarithmic loss:
g
X
L (y , f (x)) = − 1{y =k } log (πk (x)) ,
k =1
Proof: Exercise.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 22 / 29
CLASSIFICATION LOSSES: LOGARITHMIC LOSS
We will see later on how this corresponds to the (multinomial) softmax
regression (see chapter Multiclass classification) and to the
cross-entropy (see chapter Information Theory).
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 23 / 29
LOGISTIC LOSS
Let Y = {0, 1}
Let us consider a binary score-based model where scores f (x ) are
transformed into probabilities by the logistic sigmoid function
1
π(x ) = s(f (x )) = ∈ [0, 1].
1 + exp(−f (x ))
If we plug it into the Bernoulli loss, the negative log-likelihood
becomes
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 24 / 29
LOGISTIC LOSS
If we would encode now with Y = {−1, +1}, we can unify this like
this:
L (y , f (x)) = log[1 + exp(−y · f (x)]
This loss function is called Logistic Loss. If we set
f (x) = θ0 + θ > x, we would end up in Logistic Regression.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 25 / 29
CLASSIFICATION LOSSES: LOGISTIC LOSS
2.0
1.5
L(yf(x))
1.0
0.5
−2 −1 0 1 2
y ⋅ f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 26 / 29
CLASSIFICATION LOSSES: LOGISTIC LOSS
The minimizer of the (theoretical) risk R(f ) for the logistic loss function
is
P (y | x = x )
f̂ (x) = ln .
1 − P (y | x = x)
Proof: Exercise.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 27 / 29
BERNOULLI VS. LOGISTIC LOSS
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 28 / 29
OUTLOOK
When introducing different learning algorithms, we will come back to
the loss functions introduced in this chapter or even introduce new
ones. For example:
Ordinary Linear Regression: L2-loss
Logistic Regression: Logistic loss
Support Vector Machine Classification: Hinge-Loss (see lecture
Introduction to Statistical Learning)
Support Vector Machine Regression: -insensitive loss (see
lecture Introduction to Statistical Learning)
AdaBoost: Exponential loss (see Boosting chapter)
Once knowing the theory of risk minimization and properties of loss
functions, we can combine model classes and loss functions as needed
or even tailor loss functions to our needs.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 29 / 29