ESGB Evaluation Methods

Lecture : Bias/variance trade-off, model assessment
and model selection
Pr. Fatima Ezzahraa BEN BOUAZZA

fbenbouazza@um6ss.ma
Supervised Learning (i)
Let us assume that one is given:
Ø A learning sample of 𝑁 input-output pairs
𝐿𝑆 = 𝑥! , 𝑦! |i = 1, … , 𝑁 , 𝑥! ∈ 𝒳, 𝑦𝑖 ∈ 𝒴
independently and identically drawn (i.i.d.) from an unknown
distribution 𝑝(𝑥, 𝑦).
Ø A loss function
𝐿: 𝒴 × 𝒴 → ℝ
measuring the discrepancy between its arguments.
One wants to find a function 𝑓 : 𝒳 → 𝒴 that minimizes the following
expectation (generalization error):
𝐸",$ 𝐿 𝑦, 𝑓 𝑥
2 /87
Supervised Learning (ii)
There are two types of problems:
- Classification:
𝐿 𝑦, 𝑦 % = 1 𝑦 ≠ 𝑦 % (error rate)
- Regression:
𝐿 𝑦, 𝑦 % = (y − 𝑦 % )2 (square error)
3 /87
Learning set randomness
Let us denote by 𝑓=&' the function learned from a learning sample LS

by a given learning algorithm.
The function 𝑓=&' (its prediction at some point 𝑥) is a random variable.
Error on the learning sample and test sample for 100 different samples.
Source: [1]
4 /87
Two quantities of interest
Given a model 𝑓=&' built from some learning sample, its generalization
error is given by:
𝐸𝑟𝑟&' = 𝐸𝑥, 𝑦 𝐿 𝑦, 𝑓=&' 𝑥
Þ useful for model assessment and model selection.

Given a learning algorithm, its expected generalization error over
random learning samples of size 𝑁 is given by:
𝐸&' {𝐸𝑟𝑟&' } = 𝐸𝐿𝑆 𝐸𝑥, 𝑦 𝐿 𝑦, 𝑓=&' 𝑥
Þ useful to characterize a learning algorithm.
5 /87
Outline
Ø Bias/variance trade-off: decomposition of the expected error

𝐸&' {𝐸𝑟𝑟&' } that helps to understand overfitting.
Ø Performance evaluation: procedures to estimate 𝐸𝑟𝑟&' or

𝐸&' {𝐸𝑟𝑟&' }.
Ø Performance measures: common loss functions 𝐿 for

classification and regression.
6 /87
Outline
1 11 Bias / variance trade-off
1 2 Performance measures
1 31 Further reading
1 4
7 /87
Outline

2 Bias and variance definitions
3 Parameters that influence bias and variance
4 Bias and variance reduction techniques
8 /87
Bias and variance definitions - Outline
1 1
Ø Bias/variance decomposition for a simple regression problem
1 2 with no input.
1 3
Ø Extension to regression problems with inputs
Ø Bias/variance trade-off in classification problems

1 4
9 /87
A simple regression problem with no input (i)
Goal: predict as well as possible the height of a Belgian male adult.

1 1
More formally:
1 2
- Choose an error measure (e.g. square error).
1 3 - Find an estimation 𝑦@ such that
𝐸$ 𝑦 − 𝑦@ (
1 4
over the whole population of Belgian male adults is minimized.
10 /87
A simple regression problem with no input (ii)
The estimation that minimizes the error can be computed by cancelling
its derivative:
𝜕
𝐸𝑦 𝑦 − 𝑦 % ( = 0
𝜕𝑦 %
⇔ 𝐸$ −2 𝑦 − 𝑦 % = 0
⇔ 𝐸$ 𝑦 − 𝐸$ 𝑦 % = 0
⇔ 𝑦 % = 𝐸𝑦{𝑦}
Hence, the estimation that minimizes the error is 𝐸𝑦{𝑦}, which is
called the Bayes model.
However, in practice, it is impossible to compute the exact value
of 𝐸𝑦{𝑦}, as this would imply to measure the height of every
Belgian male adult.
11 /87
Learning algorithms (i)
As 𝑝 𝑦 𝑖s unknown, one needs to find an estimation 𝑦@ from a

sample of individuals i.i.d. from the Belgian male adult population:
𝐿𝑆 = 𝑦) , ⋯ , 𝑦*
Examples:
)
• 𝑦@) = * ∑*
+,) 𝑦!
$
-)./01 $%
• 𝑦@( = !"#
, 𝜆 ∈ 0, +∞ (if it is known that the height
-0*
is close to 180)
12 /87
Learning algorithms (ii)
As learning samples 𝐿𝑆 are randomly drawn, the prediction 𝑦@ will

also be a random variable:
A good learning algorithm should not be good only on a single

learning sample but on average over all learning samples of size 𝑁.
One thus seeks to minimize
𝐸 = 𝐸&2 𝐸$ 𝑦 − 𝑦@ (
13 /87
Bias/ variance decomposition (i)
Let us analyse this error in more details:
(
𝐸&2 𝐸$ 𝑦 − 𝑦@
(
= 𝐸&' 𝐸$ 𝑦 − 𝐸$ 𝑦 + 𝐸$ 𝑦 − 𝑦@
( (
= 𝐸&' 𝐸$ 𝑦 − 𝐸$ 𝑦 + 𝐸&' 𝐸$ 𝐸$ 𝑦 − 𝑦@
+ 𝐸&' 𝐸$ 2 𝑦 − 𝐸$ 𝑦 𝐸$ 𝑦 − 𝑦@
( (
= 𝐸𝑦 𝑦 − 𝐸$ 𝑦 + 𝐸&' 𝐸𝑦 𝑦 − 𝑦@ +
𝐸&' 2(𝐸𝑦 𝑦 − 𝐸𝑦 𝑦 )(𝐸𝑦 𝑦 − 𝑦)
@
( (
= 𝐸$ 𝑦 − 𝐸𝑦 𝑦 + 𝐸&' 𝐸$ 𝑦 − 𝑦@
14 /87
Bias/ variance decomposition (ii)
( (
𝐸 = 𝐸$ 𝑦 − 𝐸𝑦 𝑦 + 𝐸&' 𝐸$ 𝑦 − 𝑦@
= residual error = 𝑣𝑎𝑟' 𝑦
The residual error is the minimal attainable error.
15 /87
Bias/ variance decomposition (iii)
(
𝐸&' 𝐸$ 𝑦 − 𝑦@
(
= 𝐸&' 𝐸$ 𝑦 − 𝐸&' 𝑦@ + 𝐸&3 𝑦@ − 𝑦@
( (
= 𝐸&' 𝐸$ 𝑦 − 𝐸&' 𝑦@ + 𝐸&' 𝐸&' 𝑦@ − 𝑦@
+ 𝐸&' 2 𝐸$ 𝑦 − 𝐸&' 𝑦@ 𝐸&' 𝑦@ − 𝑦@
( (
= 𝐸$ 𝑦 − 𝐸&' 𝑦@ +𝐸&' 𝑦@ − 𝐸&' 𝑦@
+ 2 𝐸$ 𝑦 − 𝐸&' 𝑦@ 𝐸&' 𝑦@ − 𝐸&' 𝑦@
( (
= 𝐸$ 𝑦 − 𝐸&' 𝑦@ +𝐸&' 𝑦@ − 𝐸&' 𝑦@
16 /87
Bias/ variance decomposition (iv)
= average model
(
𝐸 = 𝑣𝑎𝑟$ 𝑦 + 𝐸$ 𝑦 − 𝐸&' 𝑦@ +⋯
= bias2
where the bias is the difference between the average model and the
Bayes model.
17 /87
Bias/ variance decomposition (v)
𝐸 = 𝑣𝑎𝑟$ 𝑦 + 𝑏𝑖𝑎𝑠 ( + 𝐸&' 𝑦@ − 𝐸&' 𝑦@ (
= estimation variance = varLS{𝑦}

8
where varLS{𝑦} is
@ a consequence of overfitting.
18 /87
Bias/ variance decomposition (vi)
The expected generalization error can thus be rewritten as:
𝐸 = 𝑣𝑎𝑟$ 𝑦 + 𝑏𝑖𝑎𝑠 ( + 𝑣𝑎𝑟&' 𝑦@
19 /87
Application to a simple example (i)
Let us first consider a model 𝑦@) computing the average height over the
learning sample
*
1
𝑦@) = S 𝑦!
𝑁
!,)
One can easily compute:

(
𝑏i𝑎𝑠 ( = 𝐸$ 𝑦 − 𝐸&' 𝑦@) =0
1
𝑣𝑎𝑟&' 𝑦@) = var 𝑦 𝑦
𝑁
From statistics, 𝑦@) is the best estimate with zero bias.
20 /87
Application to a simple example (ii)
Let us now consider a model where it is known that the height is close
to 180:
𝜆180 + ∑ 𝑦!
𝑦@( =
𝜆+𝑁
One can compute:

(
𝜆 (
𝑏i𝑎𝑠 ( = 𝐸$ 𝑦 − 180
𝜆+𝑁
𝑁
𝑣𝑎𝑟&' 𝑦@( = 2 var 𝑦 𝑦
𝜆+𝑁
From these, it can be observed that 𝑦@) may not be the best estimator
because of its variance. There is a bias/variance trade-off with
respect to λ.
21 /87
Bayesian approach (i)
Let us assume that :
Ø The average height is close to 180cm :

(
𝑦V − 180
𝑃 𝑦V = 𝐴exp −
2𝜎$(
Ø The height of one individual is Gaussian around the mean:

𝑦! − 𝑦V (
𝑃 𝑦! |𝑦V = 𝐵exp
2𝜎$(
What is the most probable value of 𝑦V after having seen the learning
sample?
𝑦@ = arg max 𝑃 𝑦V |𝐿𝑆
$
22 /87
Bayesian approach (ii)
𝑦@ = arg max 𝑃 𝑦V |𝐿𝑆

$
= arg max 𝑃 𝐿𝑆|𝑦V 𝑃(𝑦)
V (Bayes theorem and P(LS) is constant)
$
= arg max 𝑃 𝑦1, … , 𝑦* |𝑦V 𝑃(𝑦)
V
$
= arg max ∏*
!,) 𝑃 𝑦! |𝑦
V 𝑃(𝑦)
V (independence of learning cases)
$
*
= arg m𝑖𝑛 − a!,) log 𝑃 𝑦! |𝑦V − log 𝑃(𝑦)
V
$
23 /87
Bayesian approach (iii)
", '"! - !
"'%*+ -
= arg m𝑖𝑛 ∑&
#$% +
"! ().- ().-
=⋯
,%*+-∑, ", ).-
= ,-&
with 𝜆 =
).-
24 /87
Extension to a problem with inputs (i)
In this case, 𝑦@ 𝑥 is a function of several inputs ⇒ average over the

whole input space.
The error becomes:
(
𝐸",$ 𝑦 − 𝑦@ 𝑥
When averaging over all learning sets:
𝐸 = 𝐸&' 𝐸",$ 𝑦 − 𝑦@ 𝑥 2
(
= 𝐸" 𝐸&' 𝐸$|" 𝑦 − 𝑦@ 𝑥
= 𝐸" 𝑣𝑎𝑟$|" {𝑦} + 𝐸" bias( 𝑥 + 𝐸" 𝑣𝑎𝑟&' {𝑦@ 𝑥 }
25 /87
Extension to a problem with inputs (ii)
(
𝐸&' 𝐸$|" 𝑦 − 𝑦@ 𝑥 = 𝑛𝑜𝑖𝑠𝑒 𝑥 + 𝑏𝑖𝑎𝑠2 𝑥 + 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥)
(
Ø 𝑛𝑜𝑖𝑠𝑒 𝑥 = 𝐸$|" 𝑦 − ℎ5 𝑥 :
Quantifies how much y varies from ℎ5 𝑥 = 𝐸$|" {y} (Bayes model).
Ø 𝑏𝑖𝑎𝑠2 𝑥 = (ℎ5 𝑥 − 𝐸&' 𝑦@ 𝑥 )2 :

Measures the error between the Bayes model and the average model.
(
Ø 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑥 = 𝐸&' 𝑦@ 𝑥 − 𝐸&' 𝑦@ 𝑥 :
Quantifies how much 𝑦@ 𝑥 varies from one learning sample to

another.
26 /87
Illustration (i)
Let us consider the following problem:
- A single input 𝑥, which is a uniform random variable drawn in
[0, 1].
- 𝑦 = ℎ 𝑥 + 𝜀 where ε ∼ Ɲ(0, 1).
27 /87
Illustration (ii)
A low variance and high bias method leads to underfitting.
28 /87
Illustration (ii)
29 /87
Illustration (ii)
30 /87
Illustration (ii)
31 /87
Illustration (ii)
32 /87
Illustration (iii)
A low bias and high variance method leads to overfitting.
33 /87
Illustration (iii)
34 /87
Illustration (iii)
35 /87
Illustration (iii)
36 /87
Illustration (iii)
37 /87
Illustration (iv)
No noise doesn’t imply no variance, rather less variance.
38 /87
Illustration (iv)
39 /87
Illustration (iv)
40 /87
Parameters that influence bias and variance - Outline
Ø Complexity of the model

Ø Complexity of the Bayes model
Ø Noise
Ø Learning sample size
Ø Learning algorithm
41 /87
Illustrative problem
Let us consider the following artificial problem:

- 10 inputs, all uniform random variables drawn in [0, 1].
- The true function depends only on 5 inputs:
𝑦 𝑥 = 10 sin 𝜋𝑥) 𝑥( + 20 𝑥6 − 0.5 2 + 10𝑥7 + 6𝑥8 + 𝜀
where 𝜀 is a random variable drawn in Ɲ(0, 1).
The following experimentations are conducted:

1. 𝐸&' → Average over 50 learning sets of size 500.
2. 𝐸",$ → Average over 2 000 cases
⇒ Estimate bias and variance, as well as the residual error.
42 /87
Complexity of the model
Usually, the bias is a decreasing function of the complexity, while

the variance is an increasing function of the complexity.
Note: the residual error is included in the bias term.
43 /87
Complexity of the model – Neural networks
Error, bias and variance w.r.t. the number of neurons in the hidden layer.
44 /87
Complexity of the model – Regression trees
Error, bias and variance w.r.t. the number of test nodes.

45 /87
Complexity of the model – kNN
Error, bias and variance w.r.t. the number of neighbors.
46 /87
Complexity of the Bayes model
At fixed model complexity, the bias increases with the complexity

of the Bayes model. Nevertheless, the effect on variance is difficult
to predict.
47 /87
Noise
Variance increases with noise while bias remains mainly unaffected.
Influence of noise on full regression trees.
48 /87
Learning sample size (i)
At fixed model complexity, bias remains constant and variance decreases

with the learning sample size.
Influence of the learning sample size on linear regression.
49 /87
Learning sample size (ii)
When the complexity of the model is dependent on the learning sample

size, both bias and variance decrease with the learning sample size
Influence of the learning sample size on regression trees.
50 /87
Learning algorithms - Linear regression
Linear regression has few parameters. Therefore, it has a small variance.

However, since the true function is non-linear, it has a high bias.
51 /87
Learning algorithms - kNN
For small values of 𝑘, there is a high variance and a moderate bias. For
high values of 𝑘, the variance decreases but bias increases.
52 /87
Learning algorithms - MLP
In both cases, the bias is low. However, variance increases with the
model complexity.
53 /87
Learning algorithms – Regression tree
Regression trees have a small bias. Indeed, a complex enough tree can
approximate any non-linear function. However, it has a high variance.
54 /87
Bias and variance reduction techniques - Outline
Ø Introduction
Ø Dealing with the bias/variance trade-off of one algorithm
Ø Ensemble methods
55 /87
Bias and variance reduction techniques
In the context of a given method, adapting the learning algorithm to

find the best trade-off between bias and variance is a solution, but not
a panacea.
Examples: pruning, weight decay.
Ensemble methods change the bias/variance trade-off, but destroy

some features of the original method.
Examples: bagging, boosting.
56 /87
Variance reduction for a given model (i)
Idea: reduce the ability of the learning algorithm to fit the learning
sample.
Ø Pruning: reduces the model complexity explicitly

Ø Early stopping: reduces the amount of search
Ø Regularization: reduces the size of the hypothesis space.
Example: weight decay in neural networks consists in penalizing
high weight values.
57 /87
Variance reduction for a given model (ii)
The selection of the optimal level of fitting can be done:

- a priori (not optimal)
- by cross-validation (less efficient):
→ 𝑏𝑖𝑎𝑠2 ≈ error on the learning set.
→ 𝐸 ≈ error on an independent test set.
58 /87
Variance reduction for a given model (iv)
Examples:
- Post-pruning of regression trees.
- Early stopping of MLP by cross-validation.
As expected, variance decreases but bias increases.
59 /87
Ensemble methods (i)
Ensemble methods combine the predictions of several models built

with a learning algorithm in order to improve with respect to the use
of a single model.
There are two main families:
1. Averaging techniques: they grow several models independently

and average their predictions. They mainly decrease variance.
Examples: bagging, random forests.
2. Boosting type algorithms: they grow several models sequentially.

They mainly decrease bias.
Examples: adaboost, MART.
60 /87
Ensemble methods (ii)
Examples: bagging, boosting, random forests.
61 /87
Discussion
The notions of bias and variance are very useful to predict how
changing the (learning and problem) parameters will affect
accuracy. This explains why very simple methods can work much
better than more complex ones on very difficult tasks.
Variance reduction is a very important topic: reducing bias is easy,

while keeping variance low is not as easy. It is especially important
when machine learning is applied in domains that require complex
models: time series analysis, computer vision, natural langage
processing, . . .
All learning algorithms are not equal in terms of variance. Trees are
among the worse methods according to this criterion.
62 /87
Outline
1 Classification
1 32 Regression
3 Loss functions for learning
63 /87
Performance criteria
Which of these two models is

the best? The choice of an
error or quality measure is
highly application-dependent.
64 /87
Binary classification (i)
Results can be summarized in a contingency table, also called

confusion matrix.
9:09*
→ Error rate = :0*
;:0;*
→ Accuracy = :0*
= 1 − Error rate
65 /87
Binary classification (ii)
The simplest criterion to compare model performances is the error

rate or the accuracy:
Example:
/ 1
→ Error rate = 01
= 27% → Error rate = 01
= 33%
00 02
→ Accuracy = 01 = 73% → Accuracy = 01 = 66%
66 /87
Limitations of the error rate
This criterion does not convey any information about the error distribution
across classes: the first two models have the same error rates but different error
distributions. It is also sensitive to changes in the class distribution in the test
sample.
67 /87
Sensitivity and specificity (i)
For medical diagnosis, more appropriate measures are:

;:
- Sensitivity (or recall) =
:
;* 9:
- Specificity = ;*09: = 1 − *
68 /87
Sensitivity and specificity (ii)
69 /87
Sensitivity and specificity (iii)
Determining which model is the best one depends on the application.
70 /87
Receiver Operating Characteristic (ROC) Curve (i)
Where are:
Ø The best classifier?
Ø A classifier that always says
positive?
Ø A classifier that always says
negative?
Ø A classifier that randomly
guesses the class?
71 /87
(ROC) Curve (ii)
Often, the output of a classification algorithm is a number, e.g. a

class probability. In this case, a threshold may be chosen in order to
balance sensitivity and specificity.
A ROC curve plots sensitivity versus 1 − specificity values for

different thresholds.
72 /87
(ROC) Curve (iii)
73 /87
(ROC) Curve (iv)
74 /87
Area under the ROC curve
The area under the ROC curve summarizes a ROC curve by a single
number. It can be interpreted as the probability that two objects
randomly drawn from the sample are well ordered by the model, i.e.
the positive has a higher score than the negative.
However, it does not tell the whole story.
75 /87
Precision and recall (i)
Other frequently used measures are:

;:
Ø Precision = ;:09: = proportion of good predictions among all
the positive predictions
;:
Ø Recall = ;:09* = proportion of positives that are detected
:<=>!2!?@ × B=>CDD
Ø F-measure = 2 × :<=>!2!?@0B=>CDD
76 /87
Precision and recall (ii)
77 /87
Precision/recall vs ROC curve
78 /87
Regression performance (i)
79 /87
Regression performance (ii)
80 /87
Performance measures for training
Performance measures for training can be different from performance

measures for testing. There are several reasons for that:
Ø Algorithmic:
- A differentiable measure is amenable to gradient optimization.
- A decomposable measure is amenable to online training.
Examples: the error rate and MAE are not derivable, the AUC is not
decomposable.
Ø Overfitting:
- For training, the loss function often incorporates a penalty term
for model complexity, which is irrelevant at test time.
- Some measures are less prone to overfitting (e.g. margin)
81 /87
Outline
1 3 Further reading
82 /87
Further reading
Ø Hastie et al., chapter 2 & 7, [1]:

- Bias/variance trade-off (2.5, 2.9, 7.2, 7.3)
- Model assessment and selection (7.1, 7.2, 7.3, 7.10, 7.11)
83 /87
References
Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

The elements of statistical learning: data mining, inference, and
prediction.
Springer Science & Business Media, 2009.
Cross-validation for detecting and preventing overfitting.

https://www.autonlab.org/resources/tutorials.
Accessed: 2020-10-21.
84 /87

ESGB Evaluation Methods

Uploaded by

Copyright:

Available Formats

You might also like

ESGB Evaluation Methods

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ESGB Evaluation Methods

Uploaded by

Copyright:

Available Formats

Lecture : Bias/variance trade-off, model assessment

and model selection

Pr. Fatima Ezzahraa BEN BOUAZZA

There are two types of problems:

Let us denote by 𝑓=&' the function learned from a learning sample LS

𝐸𝑟𝑟&' = 𝐸𝑥, 𝑦 𝐿 𝑦, 𝑓=&' 𝑥

Þ useful for model assessment and model selection.

𝐸&' {𝐸𝑟𝑟&' } = 𝐸𝐿𝑆 𝐸𝑥, 𝑦 𝐿 𝑦, 𝑓=&' 𝑥

Þ useful to characterize a learning algorithm.

Ø Bias/variance trade-off: decomposition of the expected error

Ø Performance evaluation: procedures to estimate 𝐸𝑟𝑟&' or

Ø Performance measures: common loss functions 𝐿 for

1 11 Bias / variance trade-off

1 11 Bias / variance trade-off

Ø Bias/variance trade-off in classification problems

Goal: predict as well as possible the height of a Belgian male adult.

1 3 - Find an estimation 𝑦@ such that

As 𝑝 𝑦 𝑖s unknown, one needs to find an estimation 𝑦@ from a

As learning samples 𝐿𝑆 are randomly drawn, the prediction 𝑦@ will

A good learning algorithm should not be good only on a single

= residual error = 𝑣𝑎𝑟' 𝑦

The residual error is the minimal attainable error.

𝐸 = 𝑣𝑎𝑟$ 𝑦 + 𝑏𝑖𝑎𝑠 ( + 𝐸&' 𝑦@ − 𝐸&' 𝑦@ (

= estimation variance = varLS{𝑦}

The expected generalization error can thus be rewritten as:

𝐸 = 𝑣𝑎𝑟$ 𝑦 + 𝑏𝑖𝑎𝑠 ( + 𝑣𝑎𝑟&' 𝑦@

One can easily compute:

From statistics, 𝑦@) is the best estimate with zero bias.

One can compute:

Ø The average height is close to 180cm :

Ø The height of one individual is Gaussian around the mean:

𝑦@ = arg max 𝑃 𝑦V |𝐿𝑆

In this case, 𝑦@ 𝑥 is a function of several inputs ⇒ average over the

When averaging over all learning sets:

= 𝐸" 𝑣𝑎𝑟$|" {𝑦} + 𝐸" bias( 𝑥 + 𝐸" 𝑣𝑎𝑟&' {𝑦@ 𝑥 }

Quantifies how much y varies from ℎ5 𝑥 = 𝐸$|" {y} (Bayes model).

Ø 𝑏𝑖𝑎𝑠2 𝑥 = (ℎ5 𝑥 − 𝐸&' 𝑦@ 𝑥 )2 :

Quantifies how much 𝑦@ 𝑥 varies from one learning sample to

Ø Complexity of the model

Let us consider the following artificial problem:

where 𝜀 is a random variable drawn in Ɲ(0, 1).

The following experimentations are conducted:

⇒ Estimate bias and variance, as well as the residual error.

Usually, the bias is a decreasing function of the complexity, while

Error, bias and variance w.r.t. the number of test nodes.

Error, bias and variance w.r.t. the number of neighbors.

At fixed model complexity, the bias increases with the complexity

Variance increases with noise while bias remains mainly unaffected.

Influence of noise on full regression trees.

At fixed model complexity, bias remains constant and variance decreases

Influence of the learning sample size on linear regression.

When the complexity of the model is dependent on the learning sample

Influence of the learning sample size on regression trees.

Linear regression has few parameters. Therefore, it has a small variance.

In the context of a given method, adapting the learning algorithm to

Ensemble methods change the bias/variance trade-off, but destroy

Ø Pruning: reduces the model complexity explicitly