08 Eval-Intro Slides

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 57

Lecture 08

Model Evaluation 1:
Introduction to Overfitting and Underfitting

STAT 479: Machine Learning, Fall 2018


Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/

Sebastian Raschka STAT 479: Machine Learning FS 2018 1


Overview
Bias and Variance
Overfitting and Underfitting
Basics
Holdout method
Confidence Intervals
Repeated holdout
Resampling methods
Model Eval Lectures Empirical confidence intervals
Hyperparameter tuning
Cross-Validation Model selection
Algorithm Selection
Statistical Tests
Evaluation Metrics

Sebastian Raschka STAT 479: Machine Learning FS 2018 2


Overfitting and Underfitting

Sebastian Raschka STAT 479: Machine Learning FS 2018 3


Overfitting and Underfitting

"Generalization Performance"

• Want a model to "generalize" well to unseen data


("high generalization accuracy" or "low generalization error")

Sebastian Raschka STAT 479: Machine Learning FS 2018 4


Overfitting and Underfitting

Assumptions

• i.i.d. assumption: inputs are independent, and training and test examples are
identically distributed (drawn from the same probability distribution)

• For some random model that has not been fitted to the training set, we
expect both the training and test error to be equal

• The training error or accuracy provides an (optimistically) biased estimate of the


generalization performance

Sebastian Raschka STAT 479: Machine Learning FS 2018 5


Overfitting and Underfitting

Model Capacity

• Underfitting: both training and test error are large

• Overfitting: gap between training and test error (where test error is higher)

• Large hypothesis space being searched by a learning algorithm


-> high tendency to overfit

Sebastian Raschka STAT 479: Machine Learning FS 2018 6


Overfitting and Underfitting

Sebastian Raschka STAT 479: Machine Learning FS 2018 7


"[...] model has high bias/variance" -- What does that mean?

Sebastian Raschka STAT 479: Machine Learning FS 2018 8


"[...] model has high bias/variance" -- What does that mean?

Originally formulated for regression

Sebastian Raschka STAT 479: Machine Learning FS 2018 9


Bias-Variance Decomposition and Trade-off

Sebastian Raschka STAT 479: Machine Learning FS 2018 10


Bias-Variance Decomposition

• Decomposition of the loss into bias and variance help us understand


learning algorithms, concepts are correlated to underfitting and overfitting

• Helps explain why ensemble methods (last lecture) might perform better
than single models

Sebastian Raschka STAT 479: Machine Learning FS 2018 11


Bias-Variance Intuition
Low Variance High Variance
(Precise) (Not Precise)

(Accurate)
Low Bias
(Not Accurate)
High Bias

Sebastian Raschka STAT 479: Machine Learning FS 2018 12


Bias and Variance Intuition

Sebastian Raschka STAT 479: Machine Learning FS 2018 13


Bias and Variance Intuition

Sebastian Raschka STAT 479: Machine Learning FS 2018 14


Bias and Variance Intuition

Sebastian Raschka STAT 479: Machine Learning FS 2018 15


Bias and Variance Intuition

Sebastian Raschka STAT 479: Machine Learning FS 2018 16


Bias and Variance Intuition

High bias

(There are two points where the bias is zero)

Sebastian Raschka STAT 479: Machine Learning FS 2018 17


Bias and Variance Intuition

High Variance
Why?

(here, I fit an unpruned decision tree)

Sebastian Raschka STAT 479: Machine Learning FS 2018 18


Bias and Variance Example

where f(x) is some true (target) function

suppose we have multiple training sets

Sebastian Raschka STAT 479: Machine Learning FS 2018 19


Bias and Variance Example

High variance

Sebastian Raschka STAT 479: Machine Learning FS 2018 20


Bias and Variance Example

High variance

What happens if we take the average?


Does this remind you of something?

Sebastian Raschka STAT 479: Machine Learning FS 2018 21


Terminology

Point estimator θ̂ of some parameter θ

(could also be a function, e.g., the hypothesis is


an estimator of some target function)

Sebastian Raschka STAT 479: Machine Learning FS 2018 22


Terminology

Point estimator θ̂ of some parameter θ

(could also be a function, e.g., the hypothesis is


an estimator of some target function)

Bias = E[θ]̂ − θ

Sebastian Raschka STAT 479: Machine Learning FS 2018 23


Bias-Variance Decomposition

General Definition:

Bias(θ)̂ = E[θ]̂ − θ

( )
2
Var(θ)̂ = E[θ2̂ ] − E[θ ]̂

Var(θ)̂ = E[(E[θ]̂ − θ)̂ 2]

Sebastian Raschka STAT 479: Machine Learning FS 2018 24


Bias-Variance Decomposition

General Definition: Intuition:

Bias(θ)̂ = E[θ]̂ − θ (we ignore noise in this


lecture for simplicity)

( )
2
Var(θ)̂ = E[θ2̂ ] − E[θ ]̂

Var(θ)̂ = E[(E[θ]̂ − θ)̂ 2]

Sebastian Raschka STAT 479: Machine Learning FS 2018 25


Bias-Variance Decomposition of Squared Error

General Definition: Intuition:

Bias is the difference between the average estimator


Bias(θ)̂ = E[θ]̂ − θ from different training samples and the true value.
(The expectation is over the training sets.)

( )
2
Var(θ)̂ = E[θ2̂ ] − E[θ ]̂
The variance provides an estimate of how much the estimate
varies as we vary the training data (e.g,. by resampling).

Var(θ)̂ = E[(E[θ]̂ − θ)̂ 2]

Sebastian Raschka STAT 479: Machine Learning FS 2018 26


Bias-Variance Decomposition

Loss = Bias + Variance + Noise

Sebastian Raschka STAT 479: Machine Learning FS 2018 27


Bias-Variance Decomposition of Squared Error

General Definition: "ML notation" for the Squared Error Loss:

Bias(θ)̂ = E[θ]̂ − θ
y = f(x) (target, target function)

̂ = h(x)
ŷ = f(x)
( )
2
Var(θ)̂ = E[θ2̂ ] − E[θ ]̂
S = (y − y)̂ 2 (For the sake of
Var(θ)̂ = E[(E[θ]̂ − θ)̂ 2] simplicity, we ignore the
noise term in this lecture)

(Next slides: the expectation is over the training data, i.e, the average
estimator from different training samples)

Sebastian Raschka STAT 479: Machine Learning FS 2018 28


Bias-Variance Decomposition of Squared Error

"ML notation" for the Squared Error Loss:

y = f(x) (target, target function)

̂ = h(x)
ŷ = f(x)
(x is a particular data point e.g,. in the test set;
S = (y − y)̂ 2 the expectation is over training sets)

S = (y − y)̂ 2
(y − y)̂ 2 = (y − E[y]̂ + E[y]̂ − y)̂ 2
̂ 2 + (E[y]̂ − y)2 + 2(y − E[y])(E[
= (y − E[y]) ̂ y]̂ − y)̂

Sebastian Raschka STAT 479: Machine Learning FS 2018 29


Bias-Variance Decomposition of Squared Error

S = (y − y)̂ 2
(y − y)̂ 2 = (y − E[y]̂ + E[y]̂ − y)̂ 2
̂ 2 + (E[y]̂ − y)2 + 2(y − E[y])(E[
= (y − E[y]) ̂ y]̂ − y)̂

E[S] = E[(y − y)̂ 2]


E[(y − y)̂ 2] = (y − E[y])
̂ 2 + E[(E[y]̂ − y)̂ 2]
= [Bias of the fit]2 + Variance of the fit

(The expectation is over the training data, i.e, the average estimator
from different training samples)

Sebastian Raschka STAT 479: Machine Learning FS 2018 30


Bias-Variance Decomposition of Squared Error

S = (y − y)̂ 2
???
(y − y)̂ 2 = (y − E[y]̂ + E[y]̂ − y)̂ 2
̂ 2 + (E[y]̂ − y)2 + 2(y − E[y])(E[
= (y − E[y]) ̂ y]̂ − y)̂

E[S] = E[(y − y)̂ 2]


E[(y − y)̂ 2] = (y − E[y])
̂ 2 + E[(E[y]̂ − y)̂ 2]
= [Bias]2 + Variance

Sebastian Raschka STAT 479: Machine Learning FS 2018 31


Bias-Variance Decomposition of Squared Error

S = (y − y)̂ 2
???
(y − y)̂ 2 = (y − E[y]̂ + E[y]̂ − y)̂ 2
̂ 2 + (E[y]̂ − y)2 + 2(y − E[y])(E[
= (y − E[y]) ̂ y]̂ − y)̂

̂
E[2(y − E[y])(E[ y]̂ − y)]
̂ = 2E[(y − E[y])(E[̂ y]̂ − y)]
̂
̂
= 2(y − E[y])E[(E[ y]̂ − y)]̂
̂
= 2(y − E[y])(E[E[ y]]̂ − E[y]) ̂
̂
= 2(y − E[y])(E[ y]̂ − E[y]) ̂
=0

Sebastian Raschka STAT 479: Machine Learning FS 2018 32


Sebastian Raschka STAT 479: Machine Learning FS 2018 33
Domingos, P. (2000). A unified bias-variance decomposition.
In Proceedings of 17th International Conference on Machine Learning
(pp. 231-238).

"several authors have proposed bias-variance decompositions related to zero-


one loss (Kong & Dietterich, 1995; Breiman, 1996b; Kohavi & Wolpert, 1996;
Tibshirani, 1996; Friedman, 1997). However, each of these decompositions has
significant shortcomings."

Sebastian Raschka STAT 479: Machine Learning FS 2018 34


Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

Squared Loss Generalized Loss


(y − y)̂ 2 L(y, y)̂

E[(y − y)̂ 2] ̂
E[L(y, y)]

Sebastian Raschka STAT 479: Machine Learning FS 2018 35


Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

Squared Loss Generalized Loss


(y − y)̂ 2 L(y, y)̂

E[(y − y)̂ 2] ̂
E[L(y, y)]
E[(y − y)̂ 2] = (y − E[ y])
̂ 2 + E[(E[ y]̂ − y)̂ 2]
Bias2 + Variance

Sebastian Raschka STAT 479: Machine Learning FS 2018 36


Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

Squared Loss Generalized Loss


(y − y)̂ 2 L(y, y)̂

E[(y − y)̂ 2] ̂
E[L(y, y)]
E[(y − y)̂ 2] = (y − E[ y])
̂ 2 + E[(E[ y]̂ − y)̂ 2]
Bias2 + Variance

Bias2: ̂ 2
(y − E[y]) ̂
L(y, E[y])

Variance: E[(E[y]̂ − y)̂ 2] E[L(y,̂ E[y])]


̂

Sebastian Raschka STAT 479: Machine Learning FS 2018 37


Define "Main Prediction"
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

The main prediction is the prediction that minimizes the average loss

y¯ ̂ = argmin E[L( y,̂ y′̂ )]


y′̂

For squared loss -> Mean

For 0-1 loss -> Mode

Sebastian Raschka STAT 479: Machine Learning FS 2018 38


Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

Squared Loss 0-1 Loss


(y − y)̂ 2 L(y, y)̂

E[(y − y)̂ 2] ̂
E[L(y, y)]
E[(y − y)̂ 2] = (y − E[ y])
̂ 2 + E[(E[ y]̂ − y)̂ 2]
Bias2 + Variance
Main prediction -> Mean Main prediction -> Mode

Bias2: ̂ 2
(y − E[y]) ̂
L(y, E[y])

Variance: E[(E[y]̂ − y)̂ 2] E[L(y,̂ E[y])]


̂

Sebastian Raschka STAT 479: Machine Learning FS 2018 39


Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

Squared Loss 0-1 Loss


E[(y − y)̂ 2] ̂
E[L(y, y)]
P(y ≠ y)̂
Main prediction -> Mode
Main prediction -> Mean
Bias2: ̂ 2
(y − E[y]) ̂
L(y, E[y])

{0 otherwise
1 if y ≠ y¯ ̂
Bias =

Variance: E[(E[y]̂ − y)̂ 2] E[L(y,̂ E[y])]


̂

Variance = P( ŷ ≠ ȳ)̂

Sebastian Raschka STAT 479: Machine Learning FS 2018 40


Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

0-1 Loss Loss = Bias + Variance = P(ŷ ≠ y)

{0 otherwise
1 if y ≠ y¯ ̂
Bias =

Loss = Variance = P(ŷ ≠ y)

Variance = P( ŷ ≠ ȳ)̂

Sebastian Raschka STAT 479: Machine Learning FS 2018 41


Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

0-1 Loss Loss = Bias + Variance = P(ŷ ≠ y)

{0 otherwise
1 if y ≠ y¯ ̂
Bias =

Loss = Variance = P(ŷ ≠ y)

Variance = P( ŷ ≠ ȳ)̂

Sebastian Raschka STAT 479: Machine Learning FS 2018 42


Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

0-1 Loss Loss = P(ŷ ≠ y)

{0 otherwise
1 if y ≠ y¯ ̂
Bias =

Loss = P(ŷ ≠ y) = 1 − P(ŷ = y)

Sebastian Raschka STAT 479: Machine Learning FS 2018 43


Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

0-1 Loss Loss = P(ŷ ≠ y)

{0 otherwise
1 if y ≠ y¯ ̂
Bias =

Loss = P( ŷ ¯̂
≠ y) = 1 − P(ŷ = y) = 1 − P(ŷ ≠ y)

Sebastian Raschka STAT 479: Machine Learning FS 2018 44


Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

0-1 Loss Loss = P(ŷ ≠ y)

{0 otherwise
1 if y ≠ y¯ ̂
Bias =

Loss = P( ŷ ¯̂
≠ y) = 1 − P(ŷ = y) = 1 − P(ŷ ≠ y)

Sebastian Raschka STAT 479: Machine Learning FS 2018 45


Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

0-1 Loss Loss = P(ŷ ≠ y)

{0 otherwise
1 if y ≠ y¯ ̂
Bias =

Loss = P( ŷ ¯̂
≠ y) = 1 − P(ŷ = y) = 1 − P(ŷ ≠ y)

Loss = Bias - Variance

Sebastian Raschka STAT 479: Machine Learning FS 2018 46


Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

0-1 Loss Loss = P(ŷ ≠ y)

{0 otherwise
1 if y ≠ y¯ ̂
Bias =

Variance can improve loss!!


Why is that so?
Loss = P( ŷ ¯̂
≠ y) = 1 − P(ŷ = y) = 1 − P(ŷ ≠ y)

Loss = Bias - Variance

Sebastian Raschka STAT 479: Machine Learning FS 2018 47


Statistical Bias vs "Machine Learning Bias"

• "Machine learning bias" sometimes also called "inductive bias"

• e.g., decision tree algorithms consider small trees before they consider large trees
(if training data can be classified by small tree, large trees are not considered)

Sebastian Raschka STAT 479: Machine Learning FS 2018 48


Hypothesis Space (From Lecture 1)

Entire hypothesis space


e.g,. decision tree + KNN

Hypothesis space
a particular learning
algorithm category
has access to
e.g,. decision tree

Hypothesis space
a particular learning
algorithm can sample
e.g,. ID3 Particular hypothesis
(i.e., a model/classifier)

Sebastian Raschka STAT 479: Machine Learning FS 2018 49


Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree
algorithms. Technical report, Department of Computer Science, Oregon State University.

Sebastian Raschka STAT 479: Machine Learning FS 2018 50


Bias-Variance Simulation of C 4.5
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree
algorithms. Technical report, Department of Computer Science, Oregon State University.

• simulation on 200 training sets with 200 examples each (0-1 labels)
• 200 hypotheses
• test set: 22,801 examples (1 data point for each grid point)

• mean error rate is 536 errors (out of the 22,801 test examples)
• 297 as a result of bias
• 239 as a result of variance
14 14

12 12

10
class c0
10
(remember that trees use a
8 8 "staircase" to approximate
x2

class c1
6
diagonal boundaries)
4 4

2 2

0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
x1 x1

Sebastian Raschka STAT 479: Machine Learning FS 2018 51


Bias-Variance Simulation of C 4.5
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree
algorithms. Technical report, Department of Computer Science, Oregon State University.

errors due to bias: 1788


errors due to variance:1046

Sebastian Raschka STAT 479: Machine Learning FS 2018 52


Bias-Variance Simulation of C 4.5
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree
algorithms. Technical report, Department of Computer Science, Oregon State University.

errors due to bias: 1788


errors due to variance: 1046

why?

Sebastian Raschka STAT 479: Machine Learning FS 2018 53


Bias-Variance Simulation of C 4.5
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree
algorithms. Technical report, Department of Computer Science, Oregon State University.

errors due to bias: 0


errors due to variance: 17

Sebastian Raschka STAT 479: Machine Learning FS 2018 54


Bias-Variance Simulation of C 4.5

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

45
40
L = Loss
35
30 L
B = Bias
B
V = (net) Variance

Loss (%)
25 V
20 Vu
Vb
15
10
5
0
0 5 10 15 20
k

14
L
12 B
V
Vu
10 Vb
Loss (%)

0
0 5 10 15 20
k

Figure 4: Effect of varying k in k-nearest neighbor: audiology (top) and chess (bottom).

Sebastian Raschka STAT 479: Machine Learning FS 2018 55


Recommended Reading Resources
for Bias-Decomposition

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
0-1 loss

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of


17th International Conference on Machine Learning (pp. 231-238).

includes noise
and more general: Loss = Bias + c Variance

or more precisely c1N(x) + B(x) + c2V(x)

where, e.g., c1 = c2 = 1 for squared loss

Sebastian Raschka STAT 479: Machine Learning FS 2018 56


Bias and Variance
Overfitting and Underfitting
Basics
Holdout method
Next Lecture
Confidence Intervals
Repeated holdout
Resampling methods
Model Eval Lectures Empirical confidence intervals
Hyperparameter tuning
Cross-Validation Model selection
Algorithm Selection
Statistical Tests
Evaluation Metrics

Sebastian Raschka STAT 479: Machine Learning FS 2018 57

You might also like