Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Advanced Statistical Learning

Chapter 4: Risk Minimization I


Bernd Bischl, Julia Moosbauer, Andreas Groll

Department of Statistics – TU Dortmund


Winter term 2020/21
LEARNING = REPRESENTATION + COSTS +
OPTIMIZATION

Learning = Representation + Cost + Optimization


X ? ?

Assume we decided on a certain hypothesis space H (e.g. the


space of linear models) and we are given some data D we can
use for training
How do we get the “best” model f̂ ?
What does “best” mean? How do we distinguish good models from
bad ones? (Cost function)
And how do we get there? (Optimization)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 1 / 53
Risk Minimization for Regression

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 2 / 53
SCENARIO
Given the hypothesis space of linear models, which model will be
returned by a learning algorithm (under “perfect” optimization)?

1
y

−1

−2
−2 −1 0 1 2
x

Answer: It depends on how the cost function is specified.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 3 / 53
LOSSES: MEASURING ERRORS POINT-WISE

As stated in the introduction chapter, we search for a function


f ∈ H that takes x ∈ X as an input and predicts outputs f (x) that
are very close to the real targets y ∈ Y :

y ≈ f (x)

We will first measure the “goodness” of a prediction point-wise.


We introduce a function called loss function

L : Y × Rg → R≥0
to compare prediction and real target L(y , f (x)).
We require:
L(y , ỹ ) ≥ 0 for all y , ỹ ∈ Y (non-negativity)
L(y , ỹ ) = 0 ⇔ y = ỹ (optimality)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 4 / 53
LOSSES: MEASURING ERRORS POINT-WISE
Example: L (y , f (x)) = (y − f (x))2 (point-wise squared errors)

4
7

3
6

L(y, f(x))

2
y

1
y^
4
3

0
4 5 6 7 −2 −1 0 1 2

x y − f(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 5 / 53
LOSSES, RESIDUALS AND PSEUDO-RESIDUALS

Regression losses usually only depend on the residuals

r := y − f (x)
 
r (i ) := y (i ) − f x(i ) .

A loss is called distance-based if


it can be written in terms of the residual

L (y , f (x)) = ψ(r ) for some ψ : R → R

it is zero iff r = 0: ψ(0) = 0.


A loss is translation-invariant, if:

L(y + a, f (x) + a) = L (y , f (x)) .

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 6 / 53
LOSSES, RESIDUALS AND PSEUDO-RESIDUALS
A loss is translation-invariant iff it is distance-based:
⇒: If a loss is translation-invariant, then it is also
distance-based:
L (y , f (x)) = L(y − y , f (x) − y ) = L(0, −r ) =: ψ(r ).

⇐: The residual r = y − f (x) is translation-invariant, thus any


loss that depends on the residual only is also
translation-invariant.
We also introduce the notion of pseudo-residuals which are
defined as the negative first derivatives of loss functions w.r.t. f
  
∂ L (y , f (x)) ∂ L y (i ) , f x(i )
(i )
r̃ := − r̃ := −
∂f ∂f
(Note that pseudo-residuals are functions of y and f (x). )

We will see later that in case of the L2-loss, pseudo-residuals


correspond to the residuals - hence the name.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 7 / 53
LOSS PLOTS
We call the plot that shows the point-wise error, i.e. the loss L (y , f (x))
vs. the residuals r := y − f (x) (for regression), loss plot. The
pseudo-residual corresponds to the slope of the tangent in
(y − f (x), L (y , f (x))).
4
3
L(y, f(x))

2
1
0

−2 −1 0 1 2

y − f(x)

We will define a similar plot for classification later on in this chapter.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 8 / 53
(THEORETICAL) RISK MINIMIZATION
Let Pxy be the joint distribution on (x, y ) that defines all aspects of the
generating process our data come from.

Goal: Find f ∈ H that minimizes the expected loss over random


variables (x, y ) ∼ Pxy
Z
min R(f ) = min E[L (y , f (x))] = min L (y , f (x)) d Pxy .
f ∈H f ∈H f ∈H X ×Y

Example:
Y = R (regression)
2
L (y , f (x)) = (y − f (x)) (squared errors / L2-loss)
The corresponding risk is
h i
Exy (y − f (x))2

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 9 / 53
(THEORETICAL) RISK MINIMIZATION: L2-LOSS
We calculate the (theoretical) risk for Y = R and the L2-loss. Let us
assume our hypothesis space is not restricted at all and contains all
possible functions H = {f : X → Y}.
By the law of total expectation

R(f ) = Exy [L (y , f (x))] = Ex Ey |x (y − f (x))2 |x = x .


  

As our hypothesis space is not restricted at all, we can proceed


quite “arbitrarily” when constructing our model f̂ .
For every input x, we find the “optimal” prediction f̂ (x) by
minimizing the risk point-wise

f̂ (x) = argminc Ey |x (y − c )2 | x = x = Ey |x [y | x = x] .
 

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 10 / 53
(THEORETICAL) RISK MINIMIZATION: L2-LOSS
The last step follows from:

E (y − c )2 = E (y − c )2 − (E[y ] − c )2 + (E[y ] − c )2
   
| {z }
Var[y −c ]

= Var[y − c ] + (E[y ] − c )2 = Var[y ] + (E[y ] − c )2

This is obviously minimal for c = E[y ].


So for squared loss the best prediction is the conditional
expectation of y given x.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 11 / 53
(THEORETICAL) RISK MINIMIZATION: LIMITATION
Problem: Minimizing R(f ) over f is generally not feasible or practical:
Pxy is unknown (if it were known, we could use it directly to
construct optimal predictions)
We could estimate Pxy in non-parametric fashion from the data D
(i.i.d. drawn from Pxy ), e.g. by kernel density estimation, but this
really does not scale to higher dimensions (see curse of
dimensionality)
We can efficiently estimate Pxy , if we place rigorous assumptions
on its distributional form, and methods like discriminant analysis
work exactly this way. ML usually studies more flexible models.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 12 / 53
EMPIRICAL RISK MINIMIZATION
Let’s assume that we have a dataset D that can be assumed to be
drawn i.i.d. from Pxy .

An alternative (without directly assuming anything about Pxy ) is to


approximate R(f ) based on D , by means of the empirical risk

n
X   
Remp (f ) = L y (i ) , f x(i )
i =1
n
1 X  (i )  (i ) 
R̄emp (f ) = L y ,f x
n
i =1

(as the factor n1 does not make a difference in optimization, we will


consider Remp (f ) most of the time).
Note: Remp (f ) is a good approximation for R(f ) only if D is an
unbiased, independent and large enough sample from Pxy .

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 13 / 53
EMPIRICAL RISK MINIMIZATION
Learning then amounts to empirical risk minimization

f̂ = arg min Remp (f ).


f ∈H

If f is parameterized by θ ∈ Θ, this becomes:

n
X   
Remp (θ) = L y (i ) , f x(i ) | θ
i =1
θ̂ = arg min Remp (θ)
θ∈Θ

Thus learning (often) means solving the above optimization problem.


This implies a very tight connection between ML and optimization.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 14 / 53
EMPIRICAL RISK MINIMIZATION
Why should we care about how to choose the loss function?
Statistical properties of f : Choice of loss implies statistical
properties of f like robustness, implicit error distribution (see later)
Computational / Optimization complexity of the optimization
problem: The complexity of the optimization problem

min Remp (θ)


θ∈Θ

is mainly influenced by the choice of the loss function, e.g.


Smoothness of the objective: some optimization methods
require smoothness (e.g. gradient methods)
Uni- or multimodality of the problem: if L (y , f (x)) is convex in
its second argument, and f (x | θ) is linear in θ , then Remp (θ)
is convex and every local minimum of Remp (θ) is a global
one. If L (y , f (x)) is not convex, Remp (θ) might have multiple
local minima (bad!).

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 15 / 53
Regression Losses

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 16 / 53
REGRESSION LOSSES: L2-LOSS
2 2
L (y , f (x)) = (y − f (x)) or L (y , f (x)) = 0.5 (y − f (x))

Tries to reduce large residuals (if residual is twice as large, loss is


4 times as large), hence outliers in y can become problematic
Analytic properties: convex, differentiable (gradient no problem in
loss minimization)
Residuals and Pseudo-residuals coincide:
2
− ∂ 0.5(y∂−f f (x)) = y − f (x) = r

4
7

3
6

L(y, f(x))

2
y

y^
1
4
3

4 5 6 7 −2 −1 0 1 2

x y − f(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 17 / 53
REGRESSION LOSSES: L2-LOSS
What is the optimal constant model, i.e.

f ∈ H = {f (x) = θ | θ ∈ R} ,
w.r.t. the L2-Loss?

f = arg min Remp


f ∈H
n 
X 2
⇔ θ̂ = arg min y (i ) − θ
θ∈R i =1

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 18 / 53
REGRESSION LOSSES: L2-LOSS
We calculate θ̂ analytically by calculating the first derivative of Remp
w.r.t. θ and setting it to 0:

n 
∂Remp (θ) X 
!
= 2 y (i ) − θ = 0
∂θ
i =1
n
X
y (i ) − nθ = 0
i =1
n
1 X (i )
θ̂ = y =: ȳ .
n
i =1

So the optimal constant model with respect to the L2-loss predicts the
average of observed outcomes f̂ (x ) = ȳ .

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 19 / 53
REGRESSION LOSSES: L2-LOSS

1
f(x)

L2 loss
0

−1

−2 −1 0 1 2
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 20 / 53
REGRESSION LOSSES: L1-LOSS

L (y , f (x)) = |y − f (x)|

More robust than L2, outliers in y are less problematic


Analytical properties: convex, not differentiable for y = f (x )
(optimization becomes harder)

0.0 0.5 1.0 1.5 2.0


7
6

L(y, f(x))
y

y^
4
3

4 5 6 7 −2 −1 0 1 2

x y − f(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 21 / 53
REGRESSION LOSSES: L1-LOSS
What is the optimal constant model f (x) = θ w.r.t. the L1-Loss?

f = arg min Remp


f ∈H
n
X
⇔ θ̂ = arg min y (i ) − θ
θ∈R i =1

⇔ θ̂ = median(y (i ) )

Proof: Exercise.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 22 / 53
REGRESSION LOSSES: L1-LOSS

1
f(x)

L2 loss
L1 loss
0

−1

−2 −1 0 1 2
x

We see that the model fitted w.r.t. L1-loss is less affected by the outlier
(−1.8, 2).

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 23 / 53
REGRESSION LOSSES: QUANTILE LOSS
(
(1 − α)(f (x) − y ), if y < f (x)
L (y , f (x)) =
α(y − f (x)) if y ≥ f (x)

Quantile α ∈ (0, 1)
Is an extension of L1 loss (with α = 0.5 equals L1 loss)
Weights positive / negative residuals more
α < 0.5 (α > 0.5) penalty to over-estimation (under-estimation)
Also known as pinball loss

6
L(y − f(x))

α = 0.25
4
α = 0.50
2 α = 0.75

0
−10 −5 0 5 10
y − f(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 24 / 53
REGRESSION LOSSES: QUANTILE LOSS
The pinball loss for α = 0.75 penalizes under-estimation, i.e. positive
residuals (y > f (x)), more then over-estimation.

1.5
7

1.0
6

L(y, f(x))
y

0.5
y^
4

0.0
3

4 5 6 7 −2 −1 0 1 2

x y − f(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 25 / 53
REGRESSION LOSSES: QUANTILE LOSS
What is the optimal constant model f (x) = θ w.r.t. the Quantile Loss?
f = arg min Remp
f ∈H
X X
⇔ θ̂ = arg min{(1 − α) y (i ) − θ + α y (i ) − θ }
θ∈R
y (i ) <θ y (i ) ≥θ

⇔ θ̂ = Qα ({y (i ) })

where Qα (.) computes the empirical α-quantile of {y (i ) }, i = 1, ..., n.

Proof: Exercise.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 26 / 53
REGRESSION LOSSES: QUANTILE LOSS
2

1
L2 loss
f(x)

L1 loss
q loss (α = 0.25)
0 q loss (α = 0.75)

−1

−2 −1 0 1 2
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 27 / 53
REGRESSION LOSSES: HUBER LOSS
(
1
2
(y− f (x))2 if |y − f (x)| ≤ δ
L (y , f (x)) =
δ|y − f (x)| − 12 δ 2 otherwise

Piece wise combination of of L1 and L2 loss


Analytic properties: Convex, differentiable, robust
Combines advantages of L1 and L2 loss: differentiable + robust

1.5
7

1.0
6

L(y, f(x))
y

0.5
y^
4

0.0
3

4 5 6 7 −2 −1 0 1 2

x y − f(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 28 / 53
REGRESSION LOSSES: HUBER LOSS
The following plot shows the Huber loss for different values of δ .

15
L(y − f(x))

10 δ = 2.0
δ=1
δ = 0.5
5

0
−10 −5 0 5 10
y − f(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 29 / 53
REGRESSION LOSSES: HUBER LOSS
What is the optimal constant model f (x ) = θ w.r.t. Huber Loss?
There is no closed form solution
Numerical optimization methods are necessary
→ the “optimal” solution can only be approached to a certain
degree of accuracy via iterative optimization

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 30 / 53
EXCURSUS: NUMERICAL OPTIMIZATION
We are searching for the parameter θ ∈ Θ which minimizes the
empirical risk, i.e.
n
1 X  (i )  (i ) 
min R̄emp (θ) = min L y ,f x | θ
θ∈Θ θ∈Θ n
i =1

What if there is no closed-form solution to the expression above?


(Analytical Optimization)

−→ Numerical Optimization

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 31 / 53
NUMERICAL OPTIMIZATION: GRADIENT
DESCENT

To find local minima, gradient descent takes steps in direction of


the negative gradient of Remp (θ) at the current point θ [j ] .
The iterative update rule is

θ [j +1] = θ [j ] − α[j ] · ∇θ Remp (θ)|θ=θ[j ] ,

where the step-size α[j ] needs to be chosen (fixed, line-search, ...)


The negative gradient is the direction of the steepest descent.
As it uses the gradient, gradient descent is a first-order iterative
optimization algorithm.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 32 / 53
NUMERICAL OPTIMIZATION: GRADIENT
DESCENT
By using the chain rule we see that
  
n ∂ L y (i ) , f x(i ) | θ
1 X  
∇θ R̄emp (θ) = ∇θ f x(i ) | θ
n
i =1 |

{zf }
=−r̃ (i )
n
1 X (i )
 
(i )
= − r̃ · ∇θ f x | θ
n
i =1

For risk minimization, the update rule for the parameter θ is


n
1X
  
θ [j +1] ← θ [j ] − α[j ] ∇θ L y (i ) , f x(i ) | θ
n θ=θ [j ]
i =1
n
1 X (i )
 
θ [j +1] ← θ [j ] + α[j ] (i )
r̃ · ∇θ f x | θ ,
n θ=θ [j ]
i =1

where ∇θ is the gradient, α[j ] ∈ [0, 1] is called “learning rate” in this


context.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 33 / 53
NUMERICAL OPTIMIZATION: GRADIENT
DESCENT
Example: Huber Loss

We can calculate the pseudo-residuals for the Huber loss


      
− ∂ 1 (y (i ) − f x(i ) )2 = (y (i ) − f x(i ) ) if |y (i ) − f x(i ) | ≤ δ
(i ) ∂f 2
r̃ =    
− ∂ δ|y (i ) − f x(i ) | − 1 δ 2 = δ · sgn(y (i ) − f x(i ) ) otherwise .
∂f 2

For the constant model f (x) = θ this results in the following update rule:

n
α [j ] X
θ[j +1] ← θ[j ] + r̃ (i )
n
i =1

The steps of gradient descent are shown in the following plots. The
green line shows the “optimal” constant model w.r.t. Huber loss, the
black line shows the iterations of gradient descent.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 34 / 53
NUMERICAL OPTIMIZATION: GRADIENT
DESCENT
Iteration 1
2

1
y

−1

−2 −1 0 1 2
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 35 / 53
NUMERICAL OPTIMIZATION: GRADIENT
DESCENT
Iteration 2
2

1
y

−1

−2 −1 0 1 2
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 36 / 53
NUMERICAL OPTIMIZATION: GRADIENT
DESCENT
Iteration 3
2

1
y

−1

−2 −1 0 1 2
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 37 / 53
REGRESSION LOSSES: HUBER LOSS
2

1
L2 loss
L1 loss
f(x)

q loss (α = 0.25)
0 q loss (α = 0.75)
h loss (δ = 1.0)

−1

−2 −1 0 1 2
x

We see that the constant model fitted w.r.t. Huber loss in fact lies
between L1- and L2-loss.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 38 / 53
NUMERICAL OPTIMIZATION: STOCHASTIC
GRADIENT DESCENT

Stochastic Gradient Descent (SGD) is a stochastic


approximation of gradient descent
Pn
SGD is applied if i =1 ∇θ[j ] L y (i ) , f x(i ) | θ is expensive in


terms of evaluations (every summand needs to be evaluated)


SGD approximates the gradient using just a random observation i
leading to a simplified updating rule:

  
θ [j +1] ← θ [j ] − α[j ] ∇θ L y (i ) , f x(i ) | θ
θ=θ [j ]

The sequence of parameters {θ [1] , θ [2] , . . .} is stochastic since it


depends on the randomly drawn observation in every step

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 39 / 53
NUMERICAL OPTIMIZATION: MINI-BATCH
GRADIENT DESCENT

Stochastic gradient is computationally cheap compared to


standard gradient descent, but might be very noisy
A trade-off between standard gradient descent (uses all
observations for computation of the gradient) and stochastic
gradient (uses one observation for approximation of the gradient)
is mini-batch gradient descent
Mini-batch gradient descent uses a set of randomly drawn
observations I ⊂ {1, 2, ..., n} for approximation of the gradient

X   
θ [ j + 1] ← θ [ j ] − α [ j ] ∇θ L y (i ) , f x(i ) | θ
θ=θ [j ]
i ∈I

For further details on multivariate optimization see lecture


Computational Statistics.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 40 / 53
REGRESSION LOSSES: EPSILON-INSENSITIVE
LOSS
(
0, if |y − f (x)| ≤ 
L (y , f (x)) =
|y − f (x)| − , otherwise
 ∈ R+
Modification of L1-loss, errors below  accepted without penalty
Properties: convex and not differentiable for y − f (x) ∈ {−, }
no-closed form solution
7

0.8
6

L(y, f(x))
y

0.4
y^
4

0.0
3

4 5 6 7 −2 −1 0 1 2

x y − f(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 41 / 53
REGRESSION LOSSES: EPSILON-INSENSITIVE
LOSS
2

L2 loss
1
L1 loss
q loss (α = 0.25)
f(x)

q loss (α = 0.75)
0 h loss (δ = 1.0)
ε loss (ε = 1)
ε loss (ε = 10)

−1

−2 −1 0 1 2
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 42 / 53
REGRESSION LOSSES: LOG-BARRIER LOSS
(   2 
−a2 · log 1 − |y −af (x)| , if |y − f (x)| ≤ a
L (y , f (x)) =
∞, if |y − f (x)| > a

Behaves like L2 loss for small residuals


We use this, if we don’t want residuals larger than a at all
No guarantee that the risk minimization problem has a solution
Plot shows Log-Barrier Loss for a = 2
5
4
L(y − f(x))

3
2
1
0
−2 −1 0 1 2
y − f(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 43 / 53
COMPARISON OF LOSS FUNCTIONS

12

8 L2 loss
L(y − f(x))

L1 loss
q loss (α = 0.25)
q loss (α = 0.75)
h loss (δ = 2.0)
ε loss (ε = 2.0)
4

0
−10 −5 0 5 10
y − f(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 44 / 53
Maximum Likelihood Estimation vs.
Empirical Risk Minimization

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 45 / 53
MAXIMUM LIKELIHOOD
Let us approach regression from maximum likelihood perspective.
Assume that our data is generated as follows:

y = ftrue (x) + , i = 1, 2, ..., n ,


where  ∼ p is a random variable independent of x and with zero
mean E[] = 0 .
We see: y = ftrue (x) +  follows a distribution with mean ftrue (x).
We will denote the corresponding density by p(y | x, θ).

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 46 / 53
MAXIMUM LIKELIHOOD
Given data D drawn i.i.d. from p(y | x, θ), the maximum-likelihood
principle is to maximize
n
Y  
L(θ) = p y (i ) | x(i ) , θ
i =1

or to minimize the neg. log-likelihood:


n
X  
−`(θ) = − log p y (i ) | x(i ) , θ
i =1

Now let us define a new loss as:

L (y , f (x | θ)) := − log p(y | x, θ)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 47 / 53
MAXIMUM LIKELIHOOD
The empirical risk is then defined as
n
X   
Remp (θ) = L y (i ) , f x(i ) | θ
i =1

Then the maximum-likelihood estimator θ̂ , which we obtain by


optimizing L(θ) is identical to the loss-minimal θ̂ we obtain by
minimizing Remp (θ).
This means that we can derive for every error distribution an
equivalent loss function which leads to the same point estimator
for the parameter vector θ .
We can even disregard multiplicative or additive constants in the
loss, as they do not change the minimizer.
Attention: The other way around does not always work: We
cannot derive a pdf or error distribution corresponding to every
loss function – the hinge loss is a prominent example.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 48 / 53
GAUSSIAN ERRORS - L2-LOSS
Let’s assume that errors are Gaussian, i.e. (i ) ∼ N (0, σ 2 ). Then

y = ftrue (x) +  ∼ N ftrue (x), σ 2 .




The likelihood is then


Pn   2 
n
Y    
i =1 y (i ) − f x (i ) | θ
(i ) (i )
L(θ) = p y f x | θ , σ2 ∝ exp − .
 
2σ 2
i =1

It’s easy to see that minimizing the neg. log-likelihood is equivalent to


the L2-loss minimization approach since
n 
X  2
−`(θ) ∝ y (i ) − f x(i ) .
i =1
Note: We use ∝ as “proportional to ... up to multiplicative and additive constants”.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 49 / 53
LAPLACE ERRORS - L1-LOSS
If we assume (i ) to follow a Laplace distribution with density
 
1 |x − µ|
exp − ,
2b b
the same calculation yields
Pn  !
i =1 |y (i ) − f x(i ) | θ |
L(θ) ∝ exp −
b

or minimization of
n
X  
`(θ) ∝ − y (i ) − f x(i ) | θ .
i =1

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 50 / 53
OTHER ERROR DISTRIBUTIONS

We can plot the “empirical” error distribution, i.e. the distribution of


the residuals after fitting a model w.r.t. L1 / L2-loss.
Histogram of residuals for L1 loss
120
resL1

80
40
0
−3 −2 −1 0 1 2 3

Histogram of residuals for L2 loss


40
resL2

20

0
−3 −2 −1 0 1 2 3

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 51 / 53
OTHER ERROR DISTRIBUTIONS
There are losses that do not correspond to “real” error densities.
However, intuitively, we see that a certain type of loss function
defines how residuals will be distributed.
Histogram of residuals for Huber loss
40
resHub

20

0
−3 −2 −1 0 1 2 3

Histogram of residuals for Epsilon−Insensitive loss


40
resEps

20

0
−3 −2 −1 0 1 2 3

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 52 / 53
OTHER ERROR DISTRIBUTIONS
Histogram of residuals for Quantile loss
resQuant75 120

80

40

0
−3 −2 −1 0 1 2 3

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 53 / 53

You might also like