08 Eval-Intro Slides

Lecture 08
Model Evaluation 1:
Introduction to Overfitting and Underfitting
STAT 479: Machine Learning, Fall 2018

Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/
Sebastian Raschka STAT 479: Machine Learning FS 2018 1

Overview
Bias and Variance
Overfitting and Underfitting
Basics
Holdout method
Confidence Intervals
Repeated holdout
Resampling methods
Model Eval Lectures Empirical confidence intervals
Hyperparameter tuning
Cross-Validation Model selection
Algorithm Selection
Statistical Tests
Evaluation Metrics


"Generalization Performance"
• Want a model to "generalize" well to unseen data

("high generalization accuracy" or "low generalization error")

Assumptions
• i.i.d. assumption: inputs are independent, and training and test examples are
identically distributed (drawn from the same probability distribution)
• For some random model that has not been fitted to the training set, we
expect both the training and test error to be equal
• The training error or accuracy provides an (optimistically) biased estimate of the

generalization performance

Model Capacity
• Underfitting: both training and test error are large
• Overfitting: gap between training and test error (where test error is higher)
• Large hypothesis space being searched by a learning algorithm

-> high tendency to overfit


"[...] model has high bias/variance" -- What does that mean?

"[...] model has high bias/variance" -- What does that mean?
Originally formulated for regression

Bias-Variance Decomposition and Trade-off

Bias-Variance Decomposition
• Decomposition of the loss into bias and variance help us understand

learning algorithms, concepts are correlated to underfitting and overfitting
• Helps explain why ensemble methods (last lecture) might perform better
than single models

Bias-Variance Intuition
Low Variance High Variance
(Precise) (Not Precise)
(Accurate)
Low Bias
(Not Accurate)
High Bias

Bias and Variance Intuition




High bias
(There are two points where the bias is zero)

High Variance
Why?
(here, I fit an unpruned decision tree)

Bias and Variance Example
where f(x) is some true (target) function
suppose we have multiple training sets

High variance

High variance
What happens if we take the average?

Does this remind you of something?

Terminology
Point estimator θ̂ of some parameter θ
(could also be a function, e.g., the hypothesis is

an estimator of some target function)

Terminology
Point estimator θ̂ of some parameter θ
(could also be a function, e.g., the hypothesis is

an estimator of some target function)
Bias = E[θ]̂ − θ

General Definition:
Bias(θ)̂ = E[θ]̂ − θ
( )
2
Var(θ)̂ = E[θ2̂ ] − E[θ ]̂
Var(θ)̂ = E[(E[θ]̂ − θ)̂ 2]

General Definition: Intuition:
Bias(θ)̂ = E[θ]̂ − θ (we ignore noise in this

lecture for simplicity)
( )
2
Var(θ)̂ = E[θ2̂ ] − E[θ ]̂
Var(θ)̂ = E[(E[θ]̂ − θ)̂ 2]

Bias-Variance Decomposition of Squared Error
General Definition: Intuition:
Bias is the difference between the average estimator

Bias(θ)̂ = E[θ]̂ − θ from different training samples and the true value.
(The expectation is over the training sets.)
( )
2
Var(θ)̂ = E[θ2̂ ] − E[θ ]̂
The variance provides an estimate of how much the estimate
varies as we vary the training data (e.g,. by resampling).
Var(θ)̂ = E[(E[θ]̂ − θ)̂ 2]

Loss = Bias + Variance + Noise

General Definition: "ML notation" for the Squared Error Loss:
Bias(θ)̂ = E[θ]̂ − θ
y = f(x) (target, target function)
̂ = h(x)
ŷ = f(x)
( )
2
Var(θ)̂ = E[θ2̂ ] − E[θ ]̂
S = (y − y)̂ 2 (For the sake of
Var(θ)̂ = E[(E[θ]̂ − θ)̂ 2] simplicity, we ignore the
noise term in this lecture)
(Next slides: the expectation is over the training data, i.e, the average
estimator from different training samples)

"ML notation" for the Squared Error Loss:
y = f(x) (target, target function)
̂ = h(x)
ŷ = f(x)
(x is a particular data point e.g,. in the test set;
S = (y − y)̂ 2 the expectation is over training sets)
S = (y − y)̂ 2
(y − y)̂ 2 = (y − E[y]̂ + E[y]̂ − y)̂ 2
̂ 2 + (E[y]̂ − y)2 + 2(y − E[y])(E[
= (y − E[y]) ̂ y]̂ − y)̂

S = (y − y)̂ 2
(y − y)̂ 2 = (y − E[y]̂ + E[y]̂ − y)̂ 2
̂ 2 + (E[y]̂ − y)2 + 2(y − E[y])(E[
= (y − E[y]) ̂ y]̂ − y)̂
E[S] = E[(y − y)̂ 2]

E[(y − y)̂ 2] = (y − E[y])
̂ 2 + E[(E[y]̂ − y)̂ 2]
= [Bias of the fit]2 + Variance of the fit
(The expectation is over the training data, i.e, the average estimator
from different training samples)

S = (y − y)̂ 2
???
(y − y)̂ 2 = (y − E[y]̂ + E[y]̂ − y)̂ 2
̂ 2 + (E[y]̂ − y)2 + 2(y − E[y])(E[
= (y − E[y]) ̂ y]̂ − y)̂
E[S] = E[(y − y)̂ 2]

E[(y − y)̂ 2] = (y − E[y])
̂ 2 + E[(E[y]̂ − y)̂ 2]
= [Bias]2 + Variance

S = (y − y)̂ 2
???
(y − y)̂ 2 = (y − E[y]̂ + E[y]̂ − y)̂ 2
̂ 2 + (E[y]̂ − y)2 + 2(y − E[y])(E[
= (y − E[y]) ̂ y]̂ − y)̂
̂
E[2(y − E[y])(E[ y]̂ − y)]
̂ = 2E[(y − E[y])(E[̂ y]̂ − y)]
̂
̂
= 2(y − E[y])E[(E[ y]̂ − y)]̂
̂
= 2(y − E[y])(E[E[ y]]̂ − E[y]) ̂
̂
= 2(y − E[y])(E[ y]̂ − E[y]) ̂
=0

Domingos, P. (2000). A unified bias-variance decomposition.
In Proceedings of 17th International Conference on Machine Learning
(pp. 231-238).
"several authors have proposed bias-variance decompositions related to zero-

one loss (Kong & Dietterich, 1995; Breiman, 1996b; Kohavi & Wolpert, 1996;
Tibshirani, 1996; Friedman, 1997). However, each of these decompositions has
significant shortcomings."

Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance
of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of

17th International Conference on Machine Learning (pp. 231-238).
Squared Loss Generalized Loss

(y − y)̂ 2 L(y, y)̂
E[(y − y)̂ 2] ̂
E[L(y, y)]

University.


(y − y)̂ 2 L(y, y)̂
E[(y − y)̂ 2] ̂
E[L(y, y)]
E[(y − y)̂ 2] = (y − E[ y])
̂ 2 + E[(E[ y]̂ − y)̂ 2]
Bias2 + Variance

University.


(y − y)̂ 2 L(y, y)̂
E[(y − y)̂ 2] ̂
E[L(y, y)]
E[(y − y)̂ 2] = (y − E[ y])
̂ 2 + E[(E[ y]̂ − y)̂ 2]
Bias2 + Variance
Bias2: ̂ 2
(y − E[y]) ̂
L(y, E[y])
Variance: E[(E[y]̂ − y)̂ 2] E[L(y,̂ E[y])]

̂

Define "Main Prediction"
University.

The main prediction is the prediction that minimizes the average loss
y¯ ̂ = argmin E[L( y,̂ y′̂ )]

y′̂
For squared loss -> Mean
For 0-1 loss -> Mode

University.

Squared Loss 0-1 Loss

(y − y)̂ 2 L(y, y)̂
E[(y − y)̂ 2] ̂
E[L(y, y)]
E[(y − y)̂ 2] = (y − E[ y])
̂ 2 + E[(E[ y]̂ − y)̂ 2]
Bias2 + Variance
Main prediction -> Mean Main prediction -> Mode
Bias2: ̂ 2
(y − E[y]) ̂
L(y, E[y])

̂

University.

Squared Loss 0-1 Loss

E[(y − y)̂ 2] ̂
E[L(y, y)]
P(y ≠ y)̂
Main prediction -> Mode
Main prediction -> Mean
Bias2: ̂ 2
(y − E[y]) ̂
L(y, E[y])
{0 otherwise
1 if y ≠ y¯ ̂
Bias =

̂
Variance = P( ŷ ≠ ȳ)̂

University.

0-1 Loss Loss = Bias + Variance = P(ŷ ≠ y)
{0 otherwise
1 if y ≠ y¯ ̂
Bias =
Loss = Variance = P(ŷ ≠ y)

University.

0-1 Loss Loss = Bias + Variance = P(ŷ ≠ y)
{0 otherwise
1 if y ≠ y¯ ̂
Bias =
Loss = Variance = P(ŷ ≠ y)

University.

0-1 Loss Loss = P(ŷ ≠ y)
{0 otherwise
1 if y ≠ y¯ ̂
Bias =
Loss = P(ŷ ≠ y) = 1 − P(ŷ = y)

University.

{0 otherwise
1 if y ≠ y¯ ̂
Bias =
Loss = P( ŷ ¯̂
≠ y) = 1 − P(ŷ = y) = 1 − P(ŷ ≠ y)

University.

{0 otherwise
1 if y ≠ y¯ ̂
Bias =
Loss = P( ŷ ¯̂
≠ y) = 1 − P(ŷ = y) = 1 − P(ŷ ≠ y)

University.

{0 otherwise
1 if y ≠ y¯ ̂
Bias =
Loss = P( ŷ ¯̂
≠ y) = 1 − P(ŷ = y) = 1 − P(ŷ ≠ y)
Loss = Bias - Variance

University.

{0 otherwise
1 if y ≠ y¯ ̂
Bias =
Variance can improve loss!!

Why is that so?
Loss = P( ŷ ¯̂
≠ y) = 1 − P(ŷ = y) = 1 − P(ŷ ≠ y)
Loss = Bias - Variance

Statistical Bias vs "Machine Learning Bias"
• "Machine learning bias" sometimes also called "inductive bias"
• e.g., decision tree algorithms consider small trees before they consider large trees
(if training data can be classified by small tree, large trees are not considered)

Hypothesis Space (From Lecture 1)
Entire hypothesis space

e.g,. decision tree + KNN
Hypothesis space
a particular learning
algorithm category
has access to
e.g,. decision tree
Hypothesis space
a particular learning
algorithm can sample
e.g,. ID3 Particular hypothesis
(i.e., a model/classifier)

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree
algorithms. Technical report, Department of Computer Science, Oregon State University.

Bias-Variance Simulation of C 4.5
• simulation on 200 training sets with 200 examples each (0-1 labels)
• 200 hypotheses
• test set: 22,801 examples (1 data point for each grid point)
• mean error rate is 536 errors (out of the 22,801 test examples)
• 297 as a result of bias
• 239 as a result of variance
14 14
12 12
10
class c0
10
(remember that trees use a
8 8 "staircase" to approximate
x2
class c1
6
diagonal boundaries)
4 4
2 2
0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
x1 x1

errors due to bias: 1788

errors due to variance:1046


errors due to variance: 1046
why?


errors due to variance: 17

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
45
40
L = Loss
35
30 L
B = Bias
B
V = (net) Variance
Loss (%)
25 V
20 Vu
Vb
15
10
5
0
0 5 10 15 20
k
14
L
12 B
V
Vu
10 Vb
Loss (%)
0
0 5 10 15 20
k
Figure 4: Effect of varying k in k-nearest neighbor: audiology (top) and chess (bottom).

Recommended Reading Resources
for Bias-Decomposition
University.
0-1 loss

includes noise
and more general: Loss = Bias + c Variance
or more precisely c1N(x) + B(x) + c2V(x)
where, e.g., c1 = c2 = 1 for squared loss

Bias and Variance
Basics
Holdout method
Next Lecture
Confidence Intervals
Repeated holdout
Resampling methods
Model Eval Lectures Empirical confidence intervals
Hyperparameter tuning
Cross-Validation Model selection
Algorithm Selection
Statistical Tests
Evaluation Metrics

08 Eval-Intro Slides

Uploaded by

Copyright:

Available Formats

You might also like

08 Eval-Intro Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

08 Eval-Intro Slides

Uploaded by

Copyright:

Available Formats

Lecture 08

STAT 479: Machine Learning, Fall 2018

Sebastian Raschka STAT 479: Machine Learning FS 2018 1

Sebastian Raschka STAT 479: Machine Learning FS 2018 2

Sebastian Raschka STAT 479: Machine Learning FS 2018 3

• Want a model to "generalize" well to unseen data

Sebastian Raschka STAT 479: Machine Learning FS 2018 4

• The training error or accuracy provides an (optimistically) biased estimate of the

Sebastian Raschka STAT 479: Machine Learning FS 2018 5

• Underfitting: both training and test error are large

• Large hypothesis space being searched by a learning algorithm

Sebastian Raschka STAT 479: Machine Learning FS 2018 6

Sebastian Raschka STAT 479: Machine Learning FS 2018 7

Sebastian Raschka STAT 479: Machine Learning FS 2018 8

Originally formulated for regression

Sebastian Raschka STAT 479: Machine Learning FS 2018 9

Sebastian Raschka STAT 479: Machine Learning FS 2018 10

• Decomposition of the loss into bias and variance help us understand

Sebastian Raschka STAT 479: Machine Learning FS 2018 11

Sebastian Raschka STAT 479: Machine Learning FS 2018 12

Sebastian Raschka STAT 479: Machine Learning FS 2018 13

Sebastian Raschka STAT 479: Machine Learning FS 2018 14

Sebastian Raschka STAT 479: Machine Learning FS 2018 15

Sebastian Raschka STAT 479: Machine Learning FS 2018 16

(There are two points where the bias is zero)

Sebastian Raschka STAT 479: Machine Learning FS 2018 17

(here, I fit an unpruned decision tree)

Sebastian Raschka STAT 479: Machine Learning FS 2018 18

where f(x) is some true (target) function

suppose we have multiple training sets

Sebastian Raschka STAT 479: Machine Learning FS 2018 19

Sebastian Raschka STAT 479: Machine Learning FS 2018 20

What happens if we take the average?

Sebastian Raschka STAT 479: Machine Learning FS 2018 21

Point estimator θ̂ of some parameter θ

(could also be a function, e.g., the hypothesis is

Sebastian Raschka STAT 479: Machine Learning FS 2018 22

Point estimator θ̂ of some parameter θ

(could also be a function, e.g., the hypothesis is

Sebastian Raschka STAT 479: Machine Learning FS 2018 23

Var(θ)̂ = E[(E[θ]̂ − θ)̂ 2]

Sebastian Raschka STAT 479: Machine Learning FS 2018 24

General Definition: Intuition:

Bias(θ)̂ = E[θ]̂ − θ (we ignore noise in this

Var(θ)̂ = E[(E[θ]̂ − θ)̂ 2]

Sebastian Raschka STAT 479: Machine Learning FS 2018 25

General Definition: Intuition:

Bias is the diﬀerence between the average estimator

Var(θ)̂ = E[(E[θ]̂ − θ)̂ 2]

Sebastian Raschka STAT 479: Machine Learning FS 2018 26

Loss = Bias + Variance + Noise

Sebastian Raschka STAT 479: Machine Learning FS 2018 27

General Definition: "ML notation" for the Squared Error Loss:

Sebastian Raschka STAT 479: Machine Learning FS 2018 28

"ML notation" for the Squared Error Loss:

y = f(x) (target, target function)

Sebastian Raschka STAT 479: Machine Learning FS 2018 29

E[S] = E[(y − y)̂ 2]