Chapter 4a Riskmin-Reg - Commented4

Advanced Statistical Learning
Chapter 4: Risk Minimization I

Bernd Bischl, Julia Moosbauer, Andreas Groll
Department of Statistics – TU Dortmund

Winter term 2020/21
LEARNING = REPRESENTATION + COSTS +
OPTIMIZATION
Learning = Representation + Cost + Optimization

X ? ?
Assume we decided on a certain hypothesis space H (e.g. the

space of linear models) and we are given some data D we can
use for training
How do we get the “best” model f̂ ?
What does “best” mean? How do we distinguish good models from
bad ones? (Cost function)
And how do we get there? (Optimization)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 4 – 1 / 53
Risk Minimization for Regression
SCENARIO
Given the hypothesis space of linear models, which model will be
returned by a learning algorithm (under “perfect” optimization)?
1
y
−1
−2
−2 −1 0 1 2
x
Answer: It depends on how the cost function is specified.
LOSSES: MEASURING ERRORS POINT-WISE
As stated in the introduction chapter, we search for a function

f ∈ H that takes x ∈ X as an input and predicts outputs f (x) that
are very close to the real targets y ∈ Y :
y ≈ f (x)
We will first measure the “goodness” of a prediction point-wise.

We introduce a function called loss function
L : Y × Rg → R≥0
to compare prediction and real target L(y , f (x)).
We require:
L(y , ỹ ) ≥ 0 for all y , ỹ ∈ Y (non-negativity)
L(y , ỹ ) = 0 ⇔ y = ỹ (optimality)
LOSSES: MEASURING ERRORS POINT-WISE
Example: L (y , f (x)) = (y − f (x))2 (point-wise squared errors)
4
7
3
6
L(y, f(x))
2
y
1
y^
4
3
0
4 5 6 7 −2 −1 0 1 2
x y − f(x)
LOSSES, RESIDUALS AND PSEUDO-RESIDUALS
Regression losses usually only depend on the residuals
r := y − f (x)

r (i ) := y (i ) − f x(i ) .
A loss is called distance-based if

it can be written in terms of the residual
L (y , f (x)) = ψ(r ) for some ψ : R → R
it is zero iff r = 0: ψ(0) = 0.

A loss is translation-invariant, if:
L(y + a, f (x) + a) = L (y , f (x)) .
LOSSES, RESIDUALS AND PSEUDO-RESIDUALS
A loss is translation-invariant iff it is distance-based:
⇒: If a loss is translation-invariant, then it is also
distance-based:
L (y , f (x)) = L(y − y , f (x) − y ) = L(0, −r ) =: ψ(r ).
⇐: The residual r = y − f (x) is translation-invariant, thus any

loss that depends on the residual only is also
translation-invariant.
We also introduce the notion of pseudo-residuals which are
defined as the negative first derivatives of loss functions w.r.t. f

∂ L (y , f (x)) ∂ L y (i ) , f x(i )
(i )
r̃ := − r̃ := −
∂f ∂f
(Note that pseudo-residuals are functions of y and f (x). )
We will see later that in case of the L2-loss, pseudo-residuals

correspond to the residuals - hence the name.
LOSS PLOTS
We call the plot that shows the point-wise error, i.e. the loss L (y , f (x))
vs. the residuals r := y − f (x) (for regression), loss plot. The
pseudo-residual corresponds to the slope of the tangent in
(y − f (x), L (y , f (x))).
4
3
L(y, f(x))
2
1
0
−2 −1 0 1 2
y − f(x)
We will define a similar plot for classification later on in this chapter.
(THEORETICAL) RISK MINIMIZATION
Let Pxy be the joint distribution on (x, y ) that defines all aspects of the
generating process our data come from.
Goal: Find f ∈ H that minimizes the expected loss over random

variables (x, y ) ∼ Pxy
Z
min R(f ) = min E[L (y , f (x))] = min L (y , f (x)) d Pxy .
f ∈H f ∈H f ∈H X ×Y
Example:
Y = R (regression)
2
L (y , f (x)) = (y − f (x)) (squared errors / L2-loss)
The corresponding risk is
h i
Exy (y − f (x))2
(THEORETICAL) RISK MINIMIZATION: L2-LOSS
We calculate the (theoretical) risk for Y = R and the L2-loss. Let us
assume our hypothesis space is not restricted at all and contains all
possible functions H = {f : X → Y}.
By the law of total expectation
R(f ) = Exy [L (y , f (x))] = Ex Ey |x (y − f (x))2 |x = x .

As our hypothesis space is not restricted at all, we can proceed

quite “arbitrarily” when constructing our model f̂ .
For every input x, we find the “optimal” prediction f̂ (x) by
minimizing the risk point-wise
f̂ (x) = argminc Ey |x (y − c )2 | x = x = Ey |x [y | x = x] .

(THEORETICAL) RISK MINIMIZATION: L2-LOSS
The last step follows from:
E (y − c )2 = E (y − c )2 − (E[y ] − c )2 + (E[y ] − c )2

| {z }
Var[y −c ]
= Var[y − c ] + (E[y ] − c )2 = Var[y ] + (E[y ] − c )2
This is obviously minimal for c = E[y ].

So for squared loss the best prediction is the conditional
expectation of y given x.
(THEORETICAL) RISK MINIMIZATION: LIMITATION
Problem: Minimizing R(f ) over f is generally not feasible or practical:
Pxy is unknown (if it were known, we could use it directly to
construct optimal predictions)
We could estimate Pxy in non-parametric fashion from the data D
(i.i.d. drawn from Pxy ), e.g. by kernel density estimation, but this
really does not scale to higher dimensions (see curse of
dimensionality)
We can efficiently estimate Pxy , if we place rigorous assumptions
on its distributional form, and methods like discriminant analysis
work exactly this way. ML usually studies more flexible models.
EMPIRICAL RISK MINIMIZATION
Let’s assume that we have a dataset D that can be assumed to be
drawn i.i.d. from Pxy .
An alternative (without directly assuming anything about Pxy ) is to

approximate R(f ) based on D , by means of the empirical risk
n
X
Remp (f ) = L y (i ) , f x(i )
i =1
n
1 X (i ) (i )
R̄emp (f ) = L y ,f x
n
i =1
(as the factor n1 does not make a difference in optimization, we will

consider Remp (f ) most of the time).
Note: Remp (f ) is a good approximation for R(f ) only if D is an
unbiased, independent and large enough sample from Pxy .
Learning then amounts to empirical risk minimization
f̂ = arg min Remp (f ).

f ∈H
If f is parameterized by θ ∈ Θ, this becomes:
n
X
Remp (θ) = L y (i ) , f x(i ) | θ
i =1
θ̂ = arg min Remp (θ)
θ∈Θ
Thus, learning (often) means solving the above optimization problem.

This implies a very tight connection between ML and optimization.
Why should we care about how to choose the loss function?
Statistical properties of f : Choice of loss implies statistical
properties of f like robustness, implicit error distribution (see later)
Computational / Optimization complexity of the optimization
problem: The complexity of the optimization problem
min Remp (θ)

θ∈Θ
is mainly influenced by the choice of the loss function, e.g.

Smoothness of the objective: some optimization methods
require smoothness (e.g. gradient methods)
Uni- or multimodality of the problem: if L (y , f (x)) is convex in
its second argument, and f (x | θ) is linear in θ , then Remp (θ)
is convex and every local minimum of Remp (θ) is a global
one. If L (y , f (x)) is not convex, Remp (θ) might have multiple
local minima (bad!).
Regression Losses
REGRESSION LOSSES: L2-LOSS
2 2
L (y , f (x)) = (y − f (x)) or L (y , f (x)) = 0.5 (y − f (x))
Tries to reduce large residuals (if residual is twice as large, loss is

4 times as large), hence outliers in y can become problematic
Analytic properties: convex, differentiable (gradient no problem in
loss minimization)
Residuals and Pseudo-residuals coincide:
2
− ∂ 0.5(y∂−f f (x)) = y − f (x) = r
4
7
3
6
L(y, f(x))
2
y
y^
1
4
3
4 5 6 7 −2 −1 0 1 2
x y − f(x)
What is the optimal constant model, i.e.
f ∈ H = {f (x) = θ | θ ∈ R} ,
w.r.t. the L2-Loss?
f = arg min Remp

f ∈H
n
X 2
⇔ θ̂ = arg min y (i ) − θ
θ∈R i =1
We calculate θ̂ analytically by calculating the first derivative of Remp
w.r.t. θ and setting it to 0:
n
∂Remp (θ) X
!
= 2 y (i ) − θ = 0
∂θ
i =1
n
X
y (i ) − nθ = 0
i =1
n
1 X (i )
θ̂ = y =: ȳ .
n
i =1
So the optimal constant model with respect to the L2-loss predicts the
average of observed outcomes f̂ (x ) = ȳ .
1
f(x)
L2 loss
0
−1
−2 −1 0 1 2
x
L (y , f (x)) = |y − f (x)|
More robust than L2, outliers in y are less problematic

Analytical properties: convex, not differentiable for y = f (x )
(optimization becomes harder)
0.0 0.5 1.0 1.5 2.0

7
6
L(y, f(x))
y
y^
4
3
4 5 6 7 −2 −1 0 1 2
x y − f(x)
What is the optimal constant model f (x) = θ w.r.t. the L1-Loss?
f = arg min Remp

f ∈H
n
X
⇔ θ̂ = arg min y (i ) − θ
θ∈R i =1
⇔ θ̂ = median(y (i ) )
Proof: Exercise.
1
f(x)
L2 loss
L1 loss
0
−1
−2 −1 0 1 2
x
We see that the model fitted w.r.t. L1-loss is less affected by the outlier
(−1.8, 2).
REGRESSION LOSSES: QUANTILE LOSS
(
(1 − α)(f (x) − y ), if y < f (x)
L (y , f (x)) =
α(y − f (x)) if y ≥ f (x)
Quantile α ∈ (0, 1)
Is an extension of L1 loss (with α = 0.5 equals L1 loss)
Weights positive / negative residuals more
α < 0.5 (α > 0.5) penalty to over-estimation (under-estimation)
Also known as pinball loss
6
L(y − f(x))
α = 0.25
4
α = 0.50
2 α = 0.75
0
−10 −5 0 5 10
y − f(x)
The pinball loss for α = 0.75 penalizes under-estimation, i.e. positive
residuals (y > f (x)), more then over-estimation.
1.5
7
1.0
6
L(y, f(x))
y
0.5
y^
4
0.0
3
4 5 6 7 −2 −1 0 1 2
x y − f(x)
What is the optimal constant model f (x) = θ w.r.t. the Quantile Loss?
f = arg min Remp
f ∈H
 
 X X 
⇔ θ̂ = arg min (1 − α) y (i ) − θ + α y (i ) − θ
θ∈R  (i ) (i )

y <θ y ≥θ
(i )
⇔ θ̂ = Qα ({y })
where Qα (.) computes the empirical α-quantile of {y (i ) }, i = 1, ..., n.
Proof: Exercise.
2
1
L2 loss
f(x)
L1 loss
q loss (α = 0.25)
0 q loss (α = 0.75)
−1
−2 −1 0 1 2
x
REGRESSION LOSSES: HUBER LOSS
(
1
2
(y− f (x))2 if |y − f (x)| ≤ δ
L (y , f (x)) =
δ|y − f (x)| − 12 δ 2 otherwise
Piece wise combination of of L1 and L2 loss

Analytic properties: Convex, differentiable, robust
Combines advantages of L1 and L2 loss: differentiable + robust
1.5
7
1.0
6
L(y, f(x))
y
0.5
y^
4
0.0
3
4 5 6 7 −2 −1 0 1 2
x y − f(x)
The following plot shows the Huber loss for different values of δ .
15
L(y − f(x))
10 δ = 2.0
δ=1
δ = 0.5
5
0
−10 −5 0 5 10
y − f(x)
What is the optimal constant model f (x ) = θ w.r.t. Huber Loss?
There is no closed form solution
Numerical optimization methods are necessary
→ the “optimal” solution can only be approached to a certain
degree of accuracy via iterative optimization
EXCURSUS: NUMERICAL OPTIMIZATION
We are searching for the parameter θ ∈ Θ which minimizes the
empirical risk, i.e.
n
1 X (i ) (i )
min R̄emp (θ) = min L y ,f x | θ
θ∈Θ θ∈Θ n
i =1
What if there is no closed-form solution to the expression above?

(Analytical Optimization)
−→ Numerical Optimization
NUMERICAL OPTIMIZATION: GRADIENT
DESCENT
To find local minima, gradient descent takes steps in direction of

the negative gradient of Remp (θ) at the current point θ [j ] .
The iterative update rule is
θ [j +1] = θ [j ] − α[j ] · ∇θ Remp (θ)|θ=θ[j ] ,
where the step-size α[j ] needs to be chosen (fixed, line-search, ...)

The negative gradient is the direction of the steepest descent.
As it uses the gradient, gradient descent is a first-order iterative
optimization algorithm.
DESCENT
By using the chain rule we see that

n ∂ L y (i ) , f x(i ) | θ
1 X
∇θ R̄emp (θ) = ∇θ f x(i ) | θ
n
i =1 |
∂
{zf }
=−r̃ (i )
n
1 X (i )

(i )
= − r̃ · ∇θ f x | θ
n
i =1
For risk minimization, the update rule for the parameter θ is

n
1X

θ [j +1] ← θ [j ] − α[j ] ∇θ L y (i ) , f x(i ) | θ
n θ=θ [j ]
i =1
n
1 X (i )

θ [j +1] ← θ [j ] + α[j ] (i )
r̃ · ∇θ f x | θ ,
n θ=θ [j ]
i =1
where ∇θ is the gradient, α[j ] ∈ [0, 1] is called “learning rate” in this

context.
DESCENT
Example: Huber Loss
We can calculate the pseudo-residuals for the Huber loss


− ∂ 1 (y (i ) − f x(i ) )2 = (y (i ) − f x(i ) ) if |y (i ) − f x(i ) | ≤ δ
(i ) ∂f 2
r̃ =
− ∂ δ|y (i ) − f x(i ) | − 1 δ 2 = δ · sgn(y (i ) − f x(i ) ) otherwise .
∂f 2
For the constant model f (x) = θ this results in the following update rule:
n
α [j ] X
θ[j +1] ← θ[j ] + r̃ (i )
n
i =1
The steps of gradient descent are shown in the following plots. The
green line shows the “optimal” constant model w.r.t. Huber loss, the
black line shows the iterations of gradient descent.
DESCENT
Iteration 1
2
1
y
−1
−2 −1 0 1 2
x
DESCENT
Iteration 2
2
1
y
−1
−2 −1 0 1 2
x
DESCENT
Iteration 3
2
1
y
−1
−2 −1 0 1 2
x
2
1
L2 loss
L1 loss
f(x)
q loss (α = 0.25)
0 q loss (α = 0.75)
h loss (δ = 1.0)
−1
−2 −1 0 1 2
x
We see that the constant model fitted w.r.t. Huber loss in fact lies
between L1- and L2-loss.
NUMERICAL OPTIMIZATION: STOCHASTIC
GRADIENT DESCENT
Stochastic Gradient Descent (SGD) is a stochastic

approximation of gradient descent
Pn
SGD is applied if i =1 ∇θ[j ] L y (i ) , f x(i ) | θ is expensive in

terms of evaluations (every summand needs to be evaluated)

SGD approximates the gradient using just a random observation i
leading to a simplified updating rule:

θ [j +1] ← θ [j ] − α[j ] ∇θ L y (i ) , f x(i ) | θ
θ=θ [j ]
The sequence of parameters {θ [1] , θ [2] , . . .} is stochastic since it

depends on the randomly drawn observation in every step
NUMERICAL OPTIMIZATION: MINI-BATCH
GRADIENT DESCENT
Stochastic gradient is computationally cheap compared to

standard gradient descent, but might be very noisy
A trade-off between standard gradient descent (uses all
observations for computation of the gradient) and stochastic
gradient (uses one observation for approximation of the gradient)
is mini-batch gradient descent
Mini-batch gradient descent uses a set of randomly drawn
observations I ⊂ {1, 2, ..., n} for approximation of the gradient
X
θ [ j + 1] ← θ [ j ] − α [ j ] ∇θ L y (i ) , f x(i ) | θ
θ=θ [j ]
i ∈I
For further details on multivariate optimization see lecture

Computational Statistics.
REGRESSION LOSSES: EPSILON-INSENSITIVE
LOSS
(
0, if |y − f (x)| ≤
L (y , f (x)) =
|y − f (x)| − , otherwise
∈ R+
Modification of L1-loss, errors below accepted without penalty
Properties: convex and not differentiable for y − f (x) ∈ {−, }
no-closed form solution
7
0.8
6
L(y, f(x))
y
0.4
y^
4
0.0
3
4 5 6 7 −2 −1 0 1 2
x y − f(x)
REGRESSION LOSSES: EPSILON-INSENSITIVE
LOSS
2
L2 loss
1
L1 loss
q loss (α = 0.25)
f(x)
q loss (α = 0.75)
0 h loss (δ = 1.0)
ε loss (ε = 1)
ε loss (ε = 10)
−1
−2 −1 0 1 2
x
REGRESSION LOSSES: LOG-BARRIER LOSS
( 2
−a2 · log 1 − |y −af (x)| , if |y − f (x)| ≤ a
L (y , f (x)) =
∞, if |y − f (x)| > a
Behaves like L2 loss for small residuals

We use this, if we don’t want residuals larger than a at all
No guarantee that the risk minimization problem has a solution
Plot shows Log-Barrier Loss for a = 2
5
4
L(y − f(x))
3
2
1
0
−2 −1 0 1 2
y − f(x)
COMPARISON OF LOSS FUNCTIONS
12
8 L2 loss
L(y − f(x))
L1 loss
q loss (α = 0.25)
q loss (α = 0.75)
h loss (δ = 2.0)
ε loss (ε = 2.0)
4
0
−10 −5 0 5 10
y − f(x)
Maximum Likelihood Estimation vs.
Empirical Risk Minimization
MAXIMUM LIKELIHOOD
Let us approach regression from maximum likelihood perspective.
Assume that our data is generated as follows:
y = ftrue (x) + , i = 1, 2, ..., n ,

where ∼ p is a random variable independent of x and with zero
mean E[] = 0 .
We see: y = ftrue (x) + follows a distribution with mean ftrue (x).
We will denote the corresponding density by p(y | x, θ).
MAXIMUM LIKELIHOOD
Given data D drawn i.i.d. from p(y | x, θ), the maximum-likelihood
principle is to maximize
n
Y
L(θ) = p y (i ) | x(i ) , θ
i =1
or to minimize the neg. log-likelihood:

n
X
−`(θ) = − log p y (i ) | x(i ) , θ
i =1
Now let us define a new loss as:
L (y , f (x | θ)) := − log p(y | x, θ)
MAXIMUM LIKELIHOOD
The empirical risk is then defined as
n
X
Remp (θ) = L y (i ) , f x(i ) | θ
i =1
Then the maximum-likelihood estimator θ̂ , which we obtain by

optimizing L(θ) is identical to the loss-minimal θ̂ we obtain by
minimizing Remp (θ).
This means that we can derive for every error distribution an
equivalent loss function which leads to the same point estimator
for the parameter vector θ .
We can even disregard multiplicative or additive constants in the
loss, as they do not change the minimizer.
Attention: The other way around does not always work: We
cannot derive a pdf or error distribution corresponding to every
loss function – the hinge loss is a prominent example.
GAUSSIAN ERRORS - L2-LOSS
Let’s assume that errors are Gaussian, i.e. (i ) ∼ N (0, σ 2 ). Then
y = ftrue (x) + ∼ N ftrue (x), σ 2 .

The likelihood is then

Pn 2 
n
Y
i =1 y (i ) − f x (i ) | θ
(i ) (i )
L(θ) = p y f x | θ , σ2 ∝ exp − .
 
2σ 2
i =1
It’s easy to see that minimizing the neg. log-likelihood is equivalent to

the L2-loss minimization approach since
n
X 2
−`(θ) ∝ y (i ) − f x(i ) .
i =1
Note: We use ∝ as “proportional to ... up to multiplicative and additive constants”.
LAPLACE ERRORS - L1-LOSS
If we assume (i ) to follow a Laplace distribution with density

1 |x − µ|
exp − ,
2b b
the same calculation yields
Pn !
i =1 |y (i ) − f x(i ) | θ |
L(θ) ∝ exp −
b
or minimization of
n
X
`(θ) ∝ − y (i ) − f x(i ) | θ .
i =1
OTHER ERROR DISTRIBUTIONS
We can plot the “empirical” error distribution, i.e. the distribution of

the residuals after fitting a model w.r.t. L1 / L2-loss.
Histogram of residuals for L1 loss
120
resL1
80
40
0
−3 −2 −1 0 1 2 3
Histogram of residuals for L2 loss

40
resL2
20
0
−3 −2 −1 0 1 2 3
There are losses that do not correspond to “real” error densities.
However, intuitively, we see that a certain type of loss function
defines how residuals will be distributed.
Histogram of residuals for Huber loss
40
resHub
20
0
−3 −2 −1 0 1 2 3
Histogram of residuals for Epsilon−Insensitive loss

40
resEps
20
0
−3 −2 −1 0 1 2 3
Histogram of residuals for Quantile loss
resQuant75 120
80
40
0
−3 −2 −1 0 1 2 3

Chapter 4a Riskmin-Reg - Commented4

Uploaded by

Copyright:

Available Formats

You might also like

Chapter 4a Riskmin-Reg - Commented4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4a Riskmin-Reg - Commented4

Uploaded by

Copyright:

Available Formats

Advanced Statistical Learning

Chapter 4: Risk Minimization I

Department of Statistics – TU Dortmund

Learning = Representation + Cost + Optimization

Assume we decided on a certain hypothesis space H (e.g. the

Answer: It depends on how the cost function is specified.

As stated in the introduction chapter, we search for a function

We will first measure the “goodness” of a prediction point-wise.

Regression losses usually only depend on the residuals

A loss is called distance-based if

L (y , f (x)) = ψ(r ) for some ψ : R → R

it is zero iff r = 0: ψ(0) = 0.

L(y + a, f (x) + a) = L (y , f (x)) .

⇐: The residual r = y − f (x) is translation-invariant, thus any

We will see later that in case of the L2-loss, pseudo-residuals

We will define a similar plot for classification later on in this chapter.

Goal: Find f ∈ H that minimizes the expected loss over random

R(f ) = Exy [L (y , f (x))] = Ex Ey |x (y − f (x))2 |x = x .

As our hypothesis space is not restricted at all, we can proceed

= Var[y − c ] + (E[y ] − c )2 = Var[y ] + (E[y ] − c )2

This is obviously minimal for c = E[y ].

An alternative (without directly assuming anything about Pxy ) is to

(as the factor n1 does not make a difference in optimization, we will

f̂ = arg min Remp (f ).

If f is parameterized by θ ∈ Θ, this becomes:

Thus, learning (often) means solving the above optimization problem.

min Remp (θ)

is mainly influenced by the choice of the loss function, e.g.

Tries to reduce large residuals (if residual is twice as large, loss is

f = arg min Remp

More robust than L2, outliers in y are less problematic

0.0 0.5 1.0 1.5 2.0

f = arg min Remp

where Qα (.) computes the empirical α-quantile of {y (i ) }, i = 1, ..., n.

Piece wise combination of of L1 and L2 loss

What if there is no closed-form solution to the expression above?

To find local minima, gradient descent takes steps in direction of

θ [j +1] = θ [j ] − α[j ] · ∇θ Remp (θ)|θ=θ[j ] ,

where the step-size α[j ] needs to be chosen (fixed, line-search, ...)

For risk minimization, the update rule for the parameter θ is

where ∇θ is the gradient, α[j ] ∈ [0, 1] is called “learning rate” in this

We can calculate the pseudo-residuals for the Huber loss

Stochastic Gradient Descent (SGD) is a stochastic

terms of evaluations (every summand needs to be evaluated)

The sequence of parameters {θ [1] , θ [2] , . . .} is stochastic since it

Stochastic gradient is computationally cheap compared to

For further details on multivariate optimization see lecture

Behaves like L2 loss for small residuals

y = ftrue (x) + , i = 1, 2, ..., n ,

or to minimize the neg. log-likelihood:

Now let us define a new loss as:

L (y , f (x | θ)) := − log p(y | x, θ)

Then the maximum-likelihood estimator θ̂ , which we obtain by

y = ftrue (x) +  ∼ N ftrue (x), σ 2 .

The likelihood is then

It’s easy to see that minimizing the neg. log-likelihood is equivalent to

We can plot the “empirical” error distribution, i.e. the distribution of

Histogram of residuals for L2 loss

y = ftrue (x) + , i = 1, 2, ..., n ,

y = ftrue (x) + ∼ N ftrue (x), σ 2 .