Professional Documents
Culture Documents
Chapter 4a Riskmin-Reg - Commented4
Chapter 4a Riskmin-Reg - Commented4
Chapter 4a Riskmin-Reg - Commented4
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 1 / 53
Risk Minimization for Regression
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 2 / 53
SCENARIO
Given the hypothesis space of linear models, which model will be
returned by a learning algorithm (under “perfect” optimization)?
1
y
−1
−2
−2 −1 0 1 2
x
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 3 / 53
LOSSES: MEASURING ERRORS POINT-WISE
y ≈ f (x)
L : Y × Rg → R≥0
to compare prediction and real target L(y , f (x)).
We require:
L(y , ỹ ) ≥ 0 for all y , ỹ ∈ Y (non-negativity)
L(y , ỹ ) = 0 ⇔ y = ỹ (optimality)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 4 / 53
LOSSES: MEASURING ERRORS POINT-WISE
Example: L (y , f (x)) = (y − f (x))2 (point-wise squared errors)
4
7
3
6
L(y, f(x))
2
y
1
y^
4
3
0
4 5 6 7 −2 −1 0 1 2
x y − f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 5 / 53
LOSSES, RESIDUALS AND PSEUDO-RESIDUALS
r := y − f (x)
r (i ) := y (i ) − f x(i ) .
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 6 / 53
LOSSES, RESIDUALS AND PSEUDO-RESIDUALS
A loss is translation-invariant iff it is distance-based:
⇒: If a loss is translation-invariant, then it is also
distance-based:
L (y , f (x)) = L(y − y , f (x) − y ) = L(0, −r ) =: ψ(r ).
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 7 / 53
LOSS PLOTS
We call the plot that shows the point-wise error, i.e. the loss L (y , f (x))
vs. the residuals r := y − f (x) (for regression), loss plot. The
pseudo-residual corresponds to the slope of the tangent in
(y − f (x), L (y , f (x))).
4
3
L(y, f(x))
2
1
0
−2 −1 0 1 2
y − f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 8 / 53
(THEORETICAL) RISK MINIMIZATION
Let Pxy be the joint distribution on (x, y ) that defines all aspects of the
generating process our data come from.
Example:
Y = R (regression)
2
L (y , f (x)) = (y − f (x)) (squared errors / L2-loss)
The corresponding risk is
h i
Exy (y − f (x))2
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 9 / 53
(THEORETICAL) RISK MINIMIZATION: L2-LOSS
We calculate the (theoretical) risk for Y = R and the L2-loss. Let us
assume our hypothesis space is not restricted at all and contains all
possible functions H = {f : X → Y}.
By the law of total expectation
f̂ (x) = argminc Ey |x (y − c )2 | x = x = Ey |x [y | x = x] .
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 10 / 53
(THEORETICAL) RISK MINIMIZATION: L2-LOSS
The last step follows from:
E (y − c )2 = E (y − c )2 − (E[y ] − c )2 + (E[y ] − c )2
| {z }
Var[y −c ]
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 11 / 53
(THEORETICAL) RISK MINIMIZATION: LIMITATION
Problem: Minimizing R(f ) over f is generally not feasible or practical:
Pxy is unknown (if it were known, we could use it directly to
construct optimal predictions)
We could estimate Pxy in non-parametric fashion from the data D
(i.i.d. drawn from Pxy ), e.g. by kernel density estimation, but this
really does not scale to higher dimensions (see curse of
dimensionality)
We can efficiently estimate Pxy , if we place rigorous assumptions
on its distributional form, and methods like discriminant analysis
work exactly this way. ML usually studies more flexible models.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 12 / 53
EMPIRICAL RISK MINIMIZATION
Let’s assume that we have a dataset D that can be assumed to be
drawn i.i.d. from Pxy .
n
X
Remp (f ) = L y (i ) , f x(i )
i =1
n
1 X (i ) (i )
R̄emp (f ) = L y ,f x
n
i =1
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 13 / 53
EMPIRICAL RISK MINIMIZATION
Learning then amounts to empirical risk minimization
n
X
Remp (θ) = L y (i ) , f x(i ) | θ
i =1
θ̂ = arg min Remp (θ)
θ∈Θ
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 14 / 53
EMPIRICAL RISK MINIMIZATION
Why should we care about how to choose the loss function?
Statistical properties of f : Choice of loss implies statistical
properties of f like robustness, implicit error distribution (see later)
Computational / Optimization complexity of the optimization
problem: The complexity of the optimization problem
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 15 / 53
Regression Losses
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 16 / 53
REGRESSION LOSSES: L2-LOSS
2 2
L (y , f (x)) = (y − f (x)) or L (y , f (x)) = 0.5 (y − f (x))
4
7
3
6
L(y, f(x))
2
y
y^
1
4
3
4 5 6 7 −2 −1 0 1 2
x y − f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 17 / 53
REGRESSION LOSSES: L2-LOSS
What is the optimal constant model, i.e.
f ∈ H = {f (x) = θ | θ ∈ R} ,
w.r.t. the L2-Loss?
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 18 / 53
REGRESSION LOSSES: L2-LOSS
We calculate θ̂ analytically by calculating the first derivative of Remp
w.r.t. θ and setting it to 0:
n
∂Remp (θ) X
!
= 2 y (i ) − θ = 0
∂θ
i =1
n
X
y (i ) − nθ = 0
i =1
n
1 X (i )
θ̂ = y =: ȳ .
n
i =1
So the optimal constant model with respect to the L2-loss predicts the
average of observed outcomes f̂ (x ) = ȳ .
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 19 / 53
REGRESSION LOSSES: L2-LOSS
1
f(x)
L2 loss
0
−1
−2 −1 0 1 2
x
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 20 / 53
REGRESSION LOSSES: L1-LOSS
L (y , f (x)) = |y − f (x)|
L(y, f(x))
y
y^
4
3
4 5 6 7 −2 −1 0 1 2
x y − f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 21 / 53
REGRESSION LOSSES: L1-LOSS
What is the optimal constant model f (x) = θ w.r.t. the L1-Loss?
⇔ θ̂ = median(y (i ) )
Proof: Exercise.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 22 / 53
REGRESSION LOSSES: L1-LOSS
1
f(x)
L2 loss
L1 loss
0
−1
−2 −1 0 1 2
x
We see that the model fitted w.r.t. L1-loss is less affected by the outlier
(−1.8, 2).
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 23 / 53
REGRESSION LOSSES: QUANTILE LOSS
(
(1 − α)(f (x) − y ), if y < f (x)
L (y , f (x)) =
α(y − f (x)) if y ≥ f (x)
Quantile α ∈ (0, 1)
Is an extension of L1 loss (with α = 0.5 equals L1 loss)
Weights positive / negative residuals more
α < 0.5 (α > 0.5) penalty to over-estimation (under-estimation)
Also known as pinball loss
6
L(y − f(x))
α = 0.25
4
α = 0.50
2 α = 0.75
0
−10 −5 0 5 10
y − f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 24 / 53
REGRESSION LOSSES: QUANTILE LOSS
The pinball loss for α = 0.75 penalizes under-estimation, i.e. positive
residuals (y > f (x)), more then over-estimation.
1.5
7
1.0
6
L(y, f(x))
y
0.5
y^
4
0.0
3
4 5 6 7 −2 −1 0 1 2
x y − f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 25 / 53
REGRESSION LOSSES: QUANTILE LOSS
What is the optimal constant model f (x) = θ w.r.t. the Quantile Loss?
f = arg min Remp
f ∈H
X X
⇔ θ̂ = arg min (1 − α) y (i ) − θ + α y (i ) − θ
θ∈R (i ) (i )
y <θ y ≥θ
(i )
⇔ θ̂ = Qα ({y })
Proof: Exercise.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 26 / 53
REGRESSION LOSSES: QUANTILE LOSS
2
1
L2 loss
f(x)
L1 loss
q loss (α = 0.25)
0 q loss (α = 0.75)
−1
−2 −1 0 1 2
x
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 27 / 53
REGRESSION LOSSES: HUBER LOSS
(
1
2
(y− f (x))2 if |y − f (x)| ≤ δ
L (y , f (x)) =
δ|y − f (x)| − 12 δ 2 otherwise
1.5
7
1.0
6
L(y, f(x))
y
0.5
y^
4
0.0
3
4 5 6 7 −2 −1 0 1 2
x y − f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 28 / 53
REGRESSION LOSSES: HUBER LOSS
The following plot shows the Huber loss for different values of δ .
15
L(y − f(x))
10 δ = 2.0
δ=1
δ = 0.5
5
0
−10 −5 0 5 10
y − f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 29 / 53
REGRESSION LOSSES: HUBER LOSS
What is the optimal constant model f (x ) = θ w.r.t. Huber Loss?
There is no closed form solution
Numerical optimization methods are necessary
→ the “optimal” solution can only be approached to a certain
degree of accuracy via iterative optimization
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 30 / 53
EXCURSUS: NUMERICAL OPTIMIZATION
We are searching for the parameter θ ∈ Θ which minimizes the
empirical risk, i.e.
n
1 X (i ) (i )
min R̄emp (θ) = min L y ,f x | θ
θ∈Θ θ∈Θ n
i =1
−→ Numerical Optimization
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 31 / 53
NUMERICAL OPTIMIZATION: GRADIENT
DESCENT
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 32 / 53
NUMERICAL OPTIMIZATION: GRADIENT
DESCENT
By using the chain rule we see that
n ∂ L y (i ) , f x(i ) | θ
1 X
∇θ R̄emp (θ) = ∇θ f x(i ) | θ
n
i =1 |
∂
{zf }
=−r̃ (i )
n
1 X (i )
(i )
= − r̃ · ∇θ f x | θ
n
i =1
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 33 / 53
NUMERICAL OPTIMIZATION: GRADIENT
DESCENT
Example: Huber Loss
For the constant model f (x) = θ this results in the following update rule:
n
α [j ] X
θ[j +1] ← θ[j ] + r̃ (i )
n
i =1
The steps of gradient descent are shown in the following plots. The
green line shows the “optimal” constant model w.r.t. Huber loss, the
black line shows the iterations of gradient descent.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 34 / 53
NUMERICAL OPTIMIZATION: GRADIENT
DESCENT
Iteration 1
2
1
y
−1
−2 −1 0 1 2
x
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 35 / 53
NUMERICAL OPTIMIZATION: GRADIENT
DESCENT
Iteration 2
2
1
y
−1
−2 −1 0 1 2
x
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 36 / 53
NUMERICAL OPTIMIZATION: GRADIENT
DESCENT
Iteration 3
2
1
y
−1
−2 −1 0 1 2
x
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 37 / 53
REGRESSION LOSSES: HUBER LOSS
2
1
L2 loss
L1 loss
f(x)
q loss (α = 0.25)
0 q loss (α = 0.75)
h loss (δ = 1.0)
−1
−2 −1 0 1 2
x
We see that the constant model fitted w.r.t. Huber loss in fact lies
between L1- and L2-loss.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 38 / 53
NUMERICAL OPTIMIZATION: STOCHASTIC
GRADIENT DESCENT
θ [j +1] ← θ [j ] − α[j ] ∇θ L y (i ) , f x(i ) | θ
θ=θ [j ]
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 39 / 53
NUMERICAL OPTIMIZATION: MINI-BATCH
GRADIENT DESCENT
X
θ [ j + 1] ← θ [ j ] − α [ j ] ∇θ L y (i ) , f x(i ) | θ
θ=θ [j ]
i ∈I
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 40 / 53
REGRESSION LOSSES: EPSILON-INSENSITIVE
LOSS
(
0, if |y − f (x)| ≤
L (y , f (x)) =
|y − f (x)| − , otherwise
∈ R+
Modification of L1-loss, errors below accepted without penalty
Properties: convex and not differentiable for y − f (x) ∈ {−, }
no-closed form solution
7
0.8
6
L(y, f(x))
y
0.4
y^
4
0.0
3
4 5 6 7 −2 −1 0 1 2
x y − f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 41 / 53
REGRESSION LOSSES: EPSILON-INSENSITIVE
LOSS
2
L2 loss
1
L1 loss
q loss (α = 0.25)
f(x)
q loss (α = 0.75)
0 h loss (δ = 1.0)
ε loss (ε = 1)
ε loss (ε = 10)
−1
−2 −1 0 1 2
x
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 42 / 53
REGRESSION LOSSES: LOG-BARRIER LOSS
( 2
−a2 · log 1 − |y −af (x)| , if |y − f (x)| ≤ a
L (y , f (x)) =
∞, if |y − f (x)| > a
3
2
1
0
−2 −1 0 1 2
y − f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 43 / 53
COMPARISON OF LOSS FUNCTIONS
12
8 L2 loss
L(y − f(x))
L1 loss
q loss (α = 0.25)
q loss (α = 0.75)
h loss (δ = 2.0)
ε loss (ε = 2.0)
4
0
−10 −5 0 5 10
y − f(x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 44 / 53
Maximum Likelihood Estimation vs.
Empirical Risk Minimization
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 45 / 53
MAXIMUM LIKELIHOOD
Let us approach regression from maximum likelihood perspective.
Assume that our data is generated as follows:
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 46 / 53
MAXIMUM LIKELIHOOD
Given data D drawn i.i.d. from p(y | x, θ), the maximum-likelihood
principle is to maximize
n
Y
L(θ) = p y (i ) | x(i ) , θ
i =1
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 47 / 53
MAXIMUM LIKELIHOOD
The empirical risk is then defined as
n
X
Remp (θ) = L y (i ) , f x(i ) | θ
i =1
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 48 / 53
GAUSSIAN ERRORS - L2-LOSS
Let’s assume that errors are Gaussian, i.e. (i ) ∼ N (0, σ 2 ). Then
Pn 2
n
Y
i =1 y (i ) − f x (i ) | θ
(i ) (i )
L(θ) = p y f x | θ , σ2 ∝ exp − .
2σ 2
i =1
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 49 / 53
LAPLACE ERRORS - L1-LOSS
If we assume (i ) to follow a Laplace distribution with density
1 |x − µ|
exp − ,
2b b
the same calculation yields
Pn !
i =1 |y (i ) − f x(i ) | θ |
L(θ) ∝ exp −
b
or minimization of
n
X
`(θ) ∝ − y (i ) − f x(i ) | θ .
i =1
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 50 / 53
OTHER ERROR DISTRIBUTIONS
80
40
0
−3 −2 −1 0 1 2 3
20
0
−3 −2 −1 0 1 2 3
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 51 / 53
OTHER ERROR DISTRIBUTIONS
There are losses that do not correspond to “real” error densities.
However, intuitively, we see that a certain type of loss function
defines how residuals will be distributed.
Histogram of residuals for Huber loss
40
resHub
20
0
−3 −2 −1 0 1 2 3
20
0
−3 −2 −1 0 1 2 3
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 52 / 53
OTHER ERROR DISTRIBUTIONS
Histogram of residuals for Quantile loss
resQuant75 120
80
40
0
−3 −2 −1 0 1 2 3
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 53 / 53