Professional Documents
Culture Documents
Lecture 4 - Bias-Variance Trade-Off and Model Selection
Lecture 4 - Bias-Variance Trade-Off and Model Selection
Marcello Restelli
2 Bias-Variance Tradeoff
3 Model Selection
Feature Selection
Regularization
Dimension Reduction
4 Model Ensembles
Bagging
Boosting
Proof Sketch.
Given any training set D
For every concept f where AccG (L) = 0.5 + δ
There is a concept f 0 where AccG (L) = 0.5 − δ
∀x ∈ D, f 0 (x) = f (x) = y; ∀x ∈
/ D, f 0 (x) 6= f (x)
Marcello Restelli March 28, 2017 4 / 63
“No Free Lunch” Theorems
Theorem
1 X 1
For any learner L, AccG (L) =
|F| 2
F
(given any distribution P over x and training set size N)
Corollary
For any two learners L1 , L2 :
If ∃ learning problem s.t. AccG (L1 ) > AccG (L2 )
then ∃ learning problem s.t. AccG (L2 ) > AccG (L1 )
Marcello Restelli March 28, 2017 5 / 63
“No Free Lunch” Theorems
Bias-Variance Decomposition
Assume that we have a data set D with N samples obtained by a function
ti = f (xi ) +
E[] = 0 and Var[] = σ 2
We want to find a model y(x) that approximates f as well as possible
Let’s consider the expected square error on an unseen sample x
E[(t − y(x))2 ] = E[t2 + y(x)2 − 2ty(x)]
= E[t2 ] + E[y(x)2 ] − E[2ty(x)]
= E[t2 ] ± E[t]2 + E[y(x)2 ] ± E[y(x)]2 − 2f (x)E[y(x)]
= Var[t] + E[t]2 + Var[y(x)] + E[y(x)]2 − 2f (x)E[y(x)]
= Var[t] + Var[y(x)] + (f (x) − E[y(x)])2
= Var[t] + Var[y(x)] + E[f (x) − y(x)]2
| {z } | {z } | {z }
σ2 Variance Bias2
Expectation is performed over different realizations of the training set D
Marcello Restelli March 28, 2017 8 / 63
Bias-Variance Tradeoff
Bias-Variance Decomposition
Graphical Definition
Bias
If you sample a data set D multiple times you expect to learn a different
y(x)
Expected hypothesis is E[y(x)]
Bias: difference
Z
between the truth and what you expect to learn
2
bias2 = (f (x) − E[y(x)]) p(x)dx
Decreases with more complex models
Variance
Variance: difference between what you learn from a particular data set
and what you expect
Z
to learn
E (y(x) − y(x))2 p(x)dx
variance =
where y(x) = E[y(x)]
Decreases with simpler models
Decreases with more samples
Bias-Variance Decomposition
Graphical Definition
K
!2
2 σ22 1X
E[(t − y(x)) ] = σ + + f (x) − f (xi )
K K
i=1
Training Error
Classification
N
1 X
Ltrain = (I(tn 6= y(xn )))
N
n=1
Prediction Error
Classification
Z
Ltrue = I (f (x) 6= y(x)) p(x)dx
Train-Test
In practice
Randomly divide the data set into
test and train
Use training data to optimize
parameters
Use test data to evaluate
prediction error
Ntest
1 X
Ltest = (tn − y(xn ))2
Ntest
n=1
Train-Test
In practice
Randomly divide the data set into
test and train
Use training data to optimize
parameters
Use test data to evaluate
prediction error
Ntest
1 X
Ltest = (tn − y(xn ))2
Ntest
n=1
Train-Test
Bias-Variance Tradeoff
Train-Test
Bias-Variance Tradeoff
Curse of Dimensionality
Feature Selection : identifies a subset of input features that are most related
to the output
Regularization : all the input features are used, but the estimated coefficients
are shrunken towards zero, thus reducing the variance
Dimension Reduction : the input variables are projected into a
lower-dimensional subspace
1 Let M0 denote the null model, which contains no input feature (it
simply predicts the sample mean for each observation)
2 For k = 1, 2, . . . , M
M
1 Fit all models that contain exactly k features
k
M
2 Pick the best among these models and call it Mk . Here best is
k
defined as having the smallest RSS, or equivalently largest R2
3 Select a single best model from among M0 , . . . , MM using some
criterion (cross-validation, AIC, BIC,...)
The model containing all the features will always have the smallest
training error
We wish to choose a model with low test error, not a low training error
Therefore, RSS and R2 are not suitable for selecting the best model
among a collection of models with different numbers of features
There are two approaches to estimate the test error
Direct estimation using a validation approach
Making an adjustment to the training error to account for model
complexity
Validation Set
So far, we have considered to randomly split the data set into two parts:
Training data
Test data
But Test data must always remain independent!
Never ever ever ever learn on test data, including for model selection
Given a dataset, randomly split into three parts
Training data
Validation data
Test data
Use validation data for tuning learning algorithm (e.g., model
selection)
Save test data for very final evaluation
Adjustment Techniques
1
Cp : Cp = (RSS + 2dσ˜2 )
N
where d is the total number of parameters, σ˜2 is an estimate of
the variance of the noise
AIC : AIC = −2 log L + 2d
where L is the maximized value of the likelihood function for
the estimated model
1
BIC : BIC = (RSS + log(N)dσ˜2 )
N
BIC replaces the 2dσ˜2 of Cp with log(N)dσ˜2 term. Since
log N > 2 for any n > 7, BIC select smaller models
RSS/(N − d − 1)
Adjusted R2 : AdjustedR2 = 1 −
TSS/(N − 1)
where TSS is the total sum of squares. Differently from the
other criteria, here a large value indicates a model with a small
test error.
Marcello Restelli March 28, 2017 33 / 63
Model Selection Regularization
Shrinkage Methods
On the left we have Lasso, on the right the comparison with ridge
Black: squared bias
Green: variance
Purple: MSE
Solid: Lasso
Dashed: Ridge
Purple crosses: Lasso model with minimum MSE
Marcello Restelli March 28, 2017 37 / 63
Model Selection Regularization
Dimension Reduction
Conceptual Algorithm
Find a line such that when the data is projected onto that line, it has the
maximum variance
Conceptual Algorithm
Find a new line, orthogonal to the first one, that has maximum projected
variance
Conceptual Algorithm
Repeat until m lines have been identified and project the points in the data set
on these lines
Steps in PCA
Applying PCA
Full set of PCs comprise a new orthogonal basis for feature space,
whose axes are aligned with the maximum variances of original data
Projection of original data onto first k PCs (Ek = (e1 , . . . , ek )) gives a
reduced dimensionality representation of the data
X0 = XEk
PCA Example
PCA Example
The methods seen so far can reduce bias by increasing variance or vice
versa
Is it possible to reduce variance without increasing bias?
Yes, Bagging!
Is it possible to reduce also the bias?
Yes, Boosting!
Bagging and Boosting are meta-algorithms (there are many other
methods)
The basic idea is: instead of learning one model, learn several and
combine them
Typically improves accuracy, often by a lot
Var(X)
Var(X) =
N
Average models to reduce variance
One problem
we have only one train set
where do multiple models come from?
Analysis of Bagging
Boosting
Boosting Algorithm
Bagging vs Boosting