Professional Documents
Culture Documents
0 Regularization PDF
0 Regularization PDF
0 Regularization PDF
2
Need for Regularization …
• Model Variance
– If number of observations far exceed number of
predictors
• Variance of OLS models is low
– If number of observations are more than predictors,
but not many more
• Variance of OLS models may be large
– If number of observations are less than predictors
• There exists no OLS model
3
Model Variance
10
Example
• The following example illustrates:
– Effect of sample size on model accuracy
– Use of cross validation (CV)
11
The Problem
The data The data
• x = 1:360
• Create six predictors Scale the predictors
x1 = x x1n = scale(x1)
x2 = x1 * x1 x2n = scale(x2)
x3 = x1 * x1 * x1 x3n = scale(x3)
x4 = sin(x1 * pi/180) x4n = scale(x4)
x5 = cos(x1 * pi/180) x5n = scale(x5)
x6 = tanh(x1 * pi/180) x6n = scale(x6)
yp = x1n+x2n+x3n+x4n+x5n+x6n + rnorm(360,0,1)
12
Data Visualization
13
Example
• The following example illustrates:
– Effect of sample size on model accuracy
– Use of Cross Validation (CV)
• Explanation:
– Models are created using the following sample sizes
• 50, 100, 200, 360
– 10 fold Cross Validation method is used to create
multiple model from every sample size, and the errors
are evaluated
14
Effect of Sample Size on Model Accuracy
• Sample size: 50; Number of folds: 10
19
Effect of Sample Size on Model Accuracy
20
Another Important Aspect: Model Selection
• This aspect addresses the question:
– How many independent variables (predictors) are
really required to adequately create a good model
21
Correlation Analysis
22
Correlation Analysis
23
Model Selection
Will the Linear Model help us understand reality?
• Are all the predictors required?
• Are there any methods to weed out irrelevant
predictors?
– Can the Beta values of irrelevant variables
automatically be set to zero?
– (This is known as feature selection)
• Methods
– Subset selection
– Shrinkage
– Dimension reduction
– (These are also applicable to Classification as well)
24
Model Selection
Will the Linear Model help us understand reality?
• Are all the predictors required?
• Are there any methods to weed out irrelevant
predictors?
– Can the Beta values of irrelevant variables
automatically be set to zero?
– (This is known as feature selection)
• Methods
– Subset selection
– Shrinkage
– Dimension reduction
– (These are also applicable to Classification as well)
25
Subset Selection
• Best subset selection
– Involves exhaustive model building
– From zero predictors to all predictors
– Results in creation of 2p models
– Computationally exhaustive
– Infeasible for large number of predictors
26
Forward Stepwise Selection
• Starts with zero predictors (ie. the mean)
• Adds predictors to the model – one at a time
– Search for that variable which results in best additional
improvement to the model
• Results in creation of models
– Where ‘p’ = number of predictors
• Advantage:
– Can be used even when observations < predictors
• Limitation:
– Not guaranteed to identify the best possible model
– (When compared with the exhaustive method) 27
Backward Stepwise Selection
• Begins with full OLS model
• Exhaustively searches and removes the least
useful predictor
– One at a time
• Results in creation of models
• Disadvantage
– Cannot be used if observations < predictors
28
Optimal Model Selection
How to choose the most optimal model?
• Following calculated values can be used
– Residual Sum of Square of errors (RSS)
– R-squared
• But RSS drops with use of additional predictors
and R-squared increases !
– So we will anyway end up with all predictors !!
• Remedy?
– Adjust training error to account for over-fitting
– Use validation set or cross-validation approaches
29
Adjustment of Training Error
Following methods are used:
• Mallows Cp
• AIC – Akaike Information Criterion
• Bayesian Information Criterion
• Adjusted R2
30
Mallows Cp
• n = Number of observations
• RSS = Root Sum Squares of errors (Training)
• d = Number of predictors
• = Variance of error ε associated with each
response measurement (estimate of …)
34
The Metric v/s Predictors Plot
• The metrics v/s predictors plot will be as follows:
35
One-Standard-Error Rule
• A typical metrics v/s model plot
Should we always select
that model which gives us
the lowest value of the
metric?
Can we select model with 4
or 3 predictors, instead of 6?
• Shrinkage Penalty
41
Ridge Regression
Effect of increasing λ on βj values
Net effect:
1. Drive certain βj
towards Zero (… but
never Zero)
2. Effectively, reduce
Variance by
increasing Bias
(Shrinkage)
As λ increases the flexibility of the fit decreases (Bias increases) but variance decreases
42
Exercise
Exercise:
• In general, see how the variance increases as the
number of observations become less
• See how Ridge regression can reduce such
variance
• See how Ridge Regression can help in case the
number of observations are not many more (or
even lesser …) than the number of predictors
43
LASSO (Least Absolute Shrinkage and Selection Operator)
The LASSO minimizes the following cost function:
44
LASSO: Terminology
• L1 norm
• LASSO Penalty
45
LASSO: Effect of λ on Predictors
49
Tuning Parameter Selection (λ)
• Choose a grid of λ values
• Compute cross-validation for each value of λ
• Select the value of λ for which the cross-validation
error is the smallest
• Re-fit the model using the selected value of λ and
all the available observations
50
Tuning Parameter Selection Using CV
RIDGE Regression
• λ from 10 to 100
51
Tuning Parameter Selection Using CV
Variable seq: 3,2,1,6,5,4,7
RIDGE Regression
• λ from 10 to 100
52
Tuning Parameter Selection Using CV
RIDGE Regression
• λ from 0 to 10
53
Tuning Parameter Selection Using CV
Variable seq: 3,6,5,4,1,2
RIDGE Regression
• λ from 0 to 10
54
Tuning Parameter Selection Using CV
RIDGE Regression
• λ from 0 to 1
55
Tuning Parameter Selection Using CV
Variable seq: 3,6,5,4,1,2
RIDGE Regression
• λ from 0 to 1
56
Tuning Parameter Selection Using CV
RIDGE Regression
• λ from 0 to 0.1
57
Tuning Parameter Selection Using CV
Variable seq: 3,2,6,-2,5,4,2,1
RIDGE Regression
• λ from 0 to 0.1
58
Tuning Parameter Selection Using CV
59
Tuning Parameter Selection Using CV
60
Best Values Calculation
61
Best Values Calculation
mean(RMSE) = 0.8160154
mean(Rsquared) = 0.9220527
62
Parameter Tuning Using Cross Validation
Method Samples Min RMSE Max RMSE Sd
MLR 50 0.455 1.439 0.330
100 0.570 1.380 0.278
200 0.767 1.240 0.155
360 0.745 1.190 0.133
RIDGE 50 0.537 1.465 0.298
100 0.634 1.183 0.192
200 0.695 1.113 0.154
360 0.817 1.128 0.082
LASSO 50 0.580 1.450 0.322
100 0.680 1.310 0.202
200 0.790 1.130 0.116
360 0.794 1.132 0.128
ENET 50 0.425 1.691 0.419
100 0.558 1.630 0.312
200 0.615 1.143 0.171
360 0.775 1.121 0.129 63
Parameter Tuning Using Cross Validation
Samples Method Min RMSE Max RMSE Sd
50 MLR 0.455 1.439 0.330
RIDGE 0.537 1.465 0.298
LASSO 0.580 1.450 0.322
ENET 0.425 1.691 0.419
100 MLR 0.570 1.380 0.278
RIDGE 0.634 1.183 0.192
LASSO 0.680 1.310 0.202
ENET 0.558 1.630 0.312
200 MLR 0.767 1.240 0.155
RIDGE 0.695 1.113 0.154
LASSO 0.790 1.130 0.116
ENET 0.615 1.143 0.171
360 MLR 0.745 1.190 0.133
RIDGE 0.817 1.128 0.082
LASSO 0.794 1.132 0.128
ENET 0.775 1.121 0.129 64
Model Selection
Will the Linear Model help us understand reality?
• Are all the predictors required?
• Are there any methods to weed out irrelevant
predictors?
– Can the Beta values of irrelevant variables
automatically be set to zero?
– (This is known as feature selection)
• Methods
– Subset selection
– Shrinkage and Model Selection
– Dimension reduction
– (These are also applicable to Classification as well)
65
Model Selection Using LASSO
66
LASSO: Results
Steps
67
Steps
68
69
LASSO Results: Comparison with correlation
70
LASSO Results: Comparison with correlation
73
The Problem
The data The data
• x = 1:360
• Create six predictors Scale the predictors
x1 = x x1n = scale(x1)
x2 = x1 * x1 x2n = scale(x2)
x3 = x1 * x1 * x1 x3n = scale(x3)
x4 = sin(x1 * pi/180) x4n = scale(x4)
x5 = cos(x1 * pi/180) x5n = scale(x5)
x6 = tanh(x1 * pi/180) x6n = scale(x6)
yp = x1n+x2n+x3n+x4n+x5n+x6n + rnorm(360,0,1)
74
Data Visualization
75
Principle Component Analysis
• What is the optimal number of independent variables required to
capture the variance in data?
– This question is answered by PCA
• PCA results for the generated data set
76
Principle Component Analysis
• What is the optimal number of independent variables required to
capture the variance in data?
– This question is answered by PCA
• PCA results for the generated data set
77
Exercise
• For the Challenge problem, use LASSO to select
the correct operators
78
The “Challenge” problem: Recap
• The problem
– Given the data set {Y, x1, x2} … as shown below
– Find out the nature of the function f, where
• y = f(x1, x2)
79
The Challenge Problem: Recap
• Creation of additional predictors using FE:
– Assume y depends on additional variables derived
from x1 and x2
– For example: assume y depends on the following
original and assumed (derived) variables
• x1 x2 (the original variables)
• x3 = x12, x4 = x22
• x5 = sin(x1) x6 = cos(x1)
• x7 = tan(x1) x8 = tanh(x1)
• x9 = sin(x2) x10 = cos(x2)
• x11 = tan(x2) x12 = tanh(x2)
• x13 = log(x1 + 10) x14 = log(x2 + 10)
• x15 = exp(x1) x16 = exp(x2)
80
The Challenge Problem: Solutions
Earlier solution:
• Was implemented using ‘backward selection’
81
The Challenge Problem: Solutions
Earlier solution:
• Was implemented using ‘backward selection’
• LASSO
– Given a value of the ‘fraction’
– Calculates all βj to minimize the above cost function
– Multiple values of fraction are tried out … and βj
calculated
– The results are then plotted
83
LASSO Based Solution …
• LASSO: coefficients v/s fraction plot
• Observation:
– Only coefficients of x3 and x4 are active throughout
– LASSO has flagged only x3 and x4 to be relevant
• See next slide
84
LASSO Based Solution … Steps & Coefficients
85
LASSO Based Solution …
• LASSO: coefficients v/s active predictors (dof) plot
• Observation:
– Full solution is reached with 2 predictors
– LASSO has flagged that 2 predictors (x3 and x4) are
important
86
The Challenge Problem: PCA
87
The Challenge Problem: PCA
88