Professional Documents
Culture Documents
ESGB Evaluation Methods
ESGB Evaluation Methods
ESGB Evaluation Methods
- Classification:
𝐿 𝑦, 𝑦 % = 1 𝑦 ≠ 𝑦 % (error rate)
- Regression:
𝐿 𝑦, 𝑦 % = (y − 𝑦 % )2 (square error)
3 /87
Learning set randomness
Error on the learning sample and test sample for 100 different samples.
Source: [1]
4 /87
Two quantities of interest
Given a model 𝑓=&' built from some learning sample, its generalization
error is given by:
5 /87
Outline
6 /87
Outline
1 2 Performance measures
1 31 Further reading
1 4
7 /87
Outline
1 2 Performance measures
1 31 Further reading
8 /87
Bias and variance definitions - Outline
1 1
Ø Bias/variance decomposition for a simple regression problem
1 2 with no input.
1 3
Ø Extension to regression problems with inputs
9 /87
A simple regression problem with no input (i)
𝐸$ 𝑦 − 𝑦@ (
1 4
over the whole population of Belgian male adults is minimized.
10 /87
A simple regression problem with no input (ii)
The estimation that minimizes the error can be computed by cancelling
its derivative:
𝜕
𝐸𝑦 𝑦 − 𝑦 % ( = 0
𝜕𝑦 %
⇔ 𝐸$ −2 𝑦 − 𝑦 % = 0
⇔ 𝐸$ 𝑦 − 𝐸$ 𝑦 % = 0
⇔ 𝑦 % = 𝐸𝑦{𝑦}
Hence, the estimation that minimizes the error is 𝐸𝑦{𝑦}, which is
called the Bayes model.
However, in practice, it is impossible to compute the exact value
of 𝐸𝑦{𝑦}, as this would imply to measure the height of every
Belgian male adult.
11 /87
Learning algorithms (i)
𝐿𝑆 = 𝑦) , ⋯ , 𝑦*
Examples:
)
• 𝑦@) = * ∑*
+,) 𝑦!
$
-)./01 $%
• 𝑦@( = !"#
, 𝜆 ∈ 0, +∞ (if it is known that the height
-0*
is close to 180)
12 /87
Learning algorithms (ii)
13 /87
Bias/ variance decomposition (i)
Let us analyse this error in more details:
(
𝐸&2 𝐸$ 𝑦 − 𝑦@
(
= 𝐸&' 𝐸$ 𝑦 − 𝐸$ 𝑦 + 𝐸$ 𝑦 − 𝑦@
( (
= 𝐸&' 𝐸$ 𝑦 − 𝐸$ 𝑦 + 𝐸&' 𝐸$ 𝐸$ 𝑦 − 𝑦@
+ 𝐸&' 𝐸$ 2 𝑦 − 𝐸$ 𝑦 𝐸$ 𝑦 − 𝑦@
( (
= 𝐸𝑦 𝑦 − 𝐸$ 𝑦 + 𝐸&' 𝐸𝑦 𝑦 − 𝑦@ +
𝐸&' 2(𝐸𝑦 𝑦 − 𝐸𝑦 𝑦 )(𝐸𝑦 𝑦 − 𝑦)
@
( (
= 𝐸$ 𝑦 − 𝐸𝑦 𝑦 + 𝐸&' 𝐸$ 𝑦 − 𝑦@
14 /87
Bias/ variance decomposition (ii)
( (
𝐸 = 𝐸$ 𝑦 − 𝐸𝑦 𝑦 + 𝐸&' 𝐸$ 𝑦 − 𝑦@
15 /87
Bias/ variance decomposition (iii)
(
𝐸&' 𝐸$ 𝑦 − 𝑦@
(
= 𝐸&' 𝐸$ 𝑦 − 𝐸&' 𝑦@ + 𝐸&3 𝑦@ − 𝑦@
( (
= 𝐸&' 𝐸$ 𝑦 − 𝐸&' 𝑦@ + 𝐸&' 𝐸&' 𝑦@ − 𝑦@
+ 𝐸&' 2 𝐸$ 𝑦 − 𝐸&' 𝑦@ 𝐸&' 𝑦@ − 𝑦@
( (
= 𝐸$ 𝑦 − 𝐸&' 𝑦@ +𝐸&' 𝑦@ − 𝐸&' 𝑦@
+ 2 𝐸$ 𝑦 − 𝐸&' 𝑦@ 𝐸&' 𝑦@ − 𝐸&' 𝑦@
( (
= 𝐸$ 𝑦 − 𝐸&' 𝑦@ +𝐸&' 𝑦@ − 𝐸&' 𝑦@
16 /87
Bias/ variance decomposition (iv)
= average model
(
𝐸 = 𝑣𝑎𝑟$ 𝑦 + 𝐸$ 𝑦 − 𝐸&' 𝑦@ +⋯
= bias2
where the bias is the difference between the average model and the
Bayes model.
17 /87
Bias/ variance decomposition (v)
where varLS{𝑦} is
@ a consequence of overfitting.
18 /87
Bias/ variance decomposition (vi)
19 /87
Application to a simple example (i)
Let us first consider a model 𝑦@) computing the average height over the
learning sample
*
1
𝑦@) = S 𝑦!
𝑁
!,)
20 /87
Application to a simple example (ii)
Let us now consider a model where it is known that the height is close
to 180:
𝜆180 + ∑ 𝑦!
𝑦@( =
𝜆+𝑁
From these, it can be observed that 𝑦@) may not be the best estimator
because of its variance. There is a bias/variance trade-off with
respect to λ.
21 /87
Bayesian approach (i)
Let us assume that :
What is the most probable value of 𝑦V after having seen the learning
sample?
𝑦@ = arg max 𝑃 𝑦V |𝐿𝑆
$
22 /87
Bayesian approach (ii)
23 /87
Bayesian approach (iii)
", '"! - !
"'%*+ -
= arg m𝑖𝑛 ∑&
#$% +
"! ().- ().-
=⋯
,%*+-∑, ", ).-
= ,-&
with 𝜆 =
).-
24 /87
Extension to a problem with inputs (i)
𝐸 = 𝐸&' 𝐸",$ 𝑦 − 𝑦@ 𝑥 2
(
= 𝐸" 𝐸&' 𝐸$|" 𝑦 − 𝑦@ 𝑥
25 /87
Extension to a problem with inputs (ii)
(
𝐸&' 𝐸$|" 𝑦 − 𝑦@ 𝑥 = 𝑛𝑜𝑖𝑠𝑒 𝑥 + 𝑏𝑖𝑎𝑠2 𝑥 + 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑥)
(
Ø 𝑛𝑜𝑖𝑠𝑒 𝑥 = 𝐸$|" 𝑦 − ℎ5 𝑥 :
26 /87
Illustration (i)
Let us consider the following problem:
- A single input 𝑥, which is a uniform random variable drawn in
[0, 1].
- 𝑦 = ℎ 𝑥 + 𝜀 where ε ∼ Ɲ(0, 1).
27 /87
Illustration (ii)
A low variance and high bias method leads to underfitting.
28 /87
Illustration (ii)
A low variance and high bias method leads to underfitting.
29 /87
Illustration (ii)
A low variance and high bias method leads to underfitting.
30 /87
Illustration (ii)
A low variance and high bias method leads to underfitting.
31 /87
Illustration (ii)
A low variance and high bias method leads to underfitting.
32 /87
Illustration (iii)
A low bias and high variance method leads to overfitting.
33 /87
Illustration (iii)
A low bias and high variance method leads to overfitting.
34 /87
Illustration (iii)
A low bias and high variance method leads to overfitting.
35 /87
Illustration (iii)
A low bias and high variance method leads to overfitting.
36 /87
Illustration (iii)
A low bias and high variance method leads to overfitting.
37 /87
Illustration (iv)
No noise doesn’t imply no variance, rather less variance.
38 /87
Illustration (iv)
No noise doesn’t imply no variance, rather less variance.
39 /87
Illustration (iv)
No noise doesn’t imply no variance, rather less variance.
40 /87
Parameters that influence bias and variance - Outline
41 /87
Illustrative problem
42 /87
Complexity of the model
43 /87
Complexity of the model – Neural networks
Error, bias and variance w.r.t. the number of neurons in the hidden layer.
44 /87
Complexity of the model – Regression trees
46 /87
Complexity of the Bayes model
47 /87
Noise
48 /87
Learning sample size (i)
49 /87
Learning sample size (ii)
50 /87
Learning algorithms - Linear regression
51 /87
Learning algorithms - kNN
For small values of 𝑘, there is a high variance and a moderate bias. For
high values of 𝑘, the variance decreases but bias increases.
52 /87
Learning algorithms - MLP
In both cases, the bias is low. However, variance increases with the
model complexity.
53 /87
Learning algorithms – Regression tree
Regression trees have a small bias. Indeed, a complex enough tree can
approximate any non-linear function. However, it has a high variance.
54 /87
Bias and variance reduction techniques - Outline
Ø Introduction
Ø Dealing with the bias/variance trade-off of one algorithm
Ø Ensemble methods
55 /87
Bias and variance reduction techniques
56 /87
Variance reduction for a given model (i)
Idea: reduce the ability of the learning algorithm to fit the learning
sample.
57 /87
Variance reduction for a given model (ii)
58 /87
Variance reduction for a given model (iv)
Examples:
- Post-pruning of regression trees.
- Early stopping of MLP by cross-validation.
59 /87
Ensemble methods (i)
60 /87
Ensemble methods (ii)
61 /87
Discussion
The notions of bias and variance are very useful to predict how
changing the (learning and problem) parameters will affect
accuracy. This explains why very simple methods can work much
better than more complex ones on very difficult tasks.
All learning algorithms are not equal in terms of variance. Trees are
among the worse methods according to this criterion.
62 /87
Outline
1 2 Performance measures
1 Classification
1 32 Regression
3 Loss functions for learning
1 34 Further reading
63 /87
Performance criteria
64 /87
Binary classification (i)
9:09*
→ Error rate = :0*
;:0;*
→ Accuracy = :0*
= 1 − Error rate
65 /87
Binary classification (ii)
/ 1
→ Error rate = 01
= 27% → Error rate = 01
= 33%
00 02
→ Accuracy = 01 = 73% → Accuracy = 01 = 66%
66 /87
Limitations of the error rate
This criterion does not convey any information about the error distribution
across classes: the first two models have the same error rates but different error
distributions. It is also sensitive to changes in the class distribution in the test
sample.
67 /87
Sensitivity and specificity (i)
68 /87
Sensitivity and specificity (ii)
69 /87
Sensitivity and specificity (iii)
70 /87
Receiver Operating Characteristic (ROC) Curve (i)
Where are:
Ø The best classifier?
Ø A classifier that always says
positive?
Ø A classifier that always says
negative?
Ø A classifier that randomly
guesses the class?
71 /87
(ROC) Curve (ii)
72 /87
(ROC) Curve (iii)
73 /87
(ROC) Curve (iv)
74 /87
Area under the ROC curve
The area under the ROC curve summarizes a ROC curve by a single
number. It can be interpreted as the probability that two objects
randomly drawn from the sample are well ordered by the model, i.e.
the positive has a higher score than the negative.
75 /87
Precision and recall (i)
:<=>!2!?@ × B=>CDD
Ø F-measure = 2 × :<=>!2!?@0B=>CDD
76 /87
Precision and recall (ii)
77 /87
Precision/recall vs ROC curve
78 /87
Regression performance (i)
79 /87
Regression performance (ii)
80 /87
Performance measures for training
Ø Overfitting:
- For training, the loss function often incorporates a penalty term
for model complexity, which is irrelevant at test time.
- Some measures are less prone to overfitting (e.g. margin)
81 /87
Outline
1 2 Performance measures
1 3 Further reading
82 /87
Further reading
83 /87
References
84 /87