0 Regularization PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 88

Regularization and Model Selection

Modifications To OLS Model …


Required:

• To improve prediction accuracy


– Regularization
• To improve model interpretability
– Model Selection

2
Need for Regularization …
• Model Variance
– If number of observations far exceed number of
predictors
• Variance of OLS models is low
– If number of observations are more than predictors,
but not many more
• Variance of OLS models may be large
– If number of observations are less than predictors
• There exists no OLS model

3
Model Variance

Model changes drastically with different sets of training points


4
Model Variance

Model changes drastically with different sets of training points


5
Model Variance

Model changes drastically with different sets of training points


6
Model Variance

Model changes drastically with different sets of training points


7
Model Variance

Model changes drastically with different sets of training points


8
Model Variance

Model changes drastically with different sets of training points


9
Remedy: if there aren’t enough observations
• Constrain the coefficients
– This increases bias
– But improves the predictability of OLS models
• By reducing variability

• Coefficients can be constrained using the


following methods:
– RIDGE Regression (L1 penalty)
– LASSO (L2 penalty)
– Elastic Net Regression (L1 and L2 penalties)

10
Example
• The following example illustrates:
– Effect of sample size on model accuracy
– Use of cross validation (CV)

11
The Problem
The data The data
• x = 1:360
• Create six predictors Scale the predictors
 x1 = x  x1n = scale(x1)
 x2 = x1 * x1  x2n = scale(x2)
 x3 = x1 * x1 * x1  x3n = scale(x3)
 x4 = sin(x1 * pi/180)  x4n = scale(x4)
 x5 = cos(x1 * pi/180)  x5n = scale(x5)
 x6 = tanh(x1 * pi/180)  x6n = scale(x6)

yp = x1n+x2n+x3n+x4n+x5n+x6n + rnorm(360,0,1)
12
Data Visualization

13
Example
• The following example illustrates:
– Effect of sample size on model accuracy
– Use of Cross Validation (CV)
• Explanation:
– Models are created using the following sample sizes
• 50, 100, 200, 360
– 10 fold Cross Validation method is used to create
multiple model from every sample size, and the errors
are evaluated

14
Effect of Sample Size on Model Accuracy
• Sample size: 50; Number of folds: 10

• RMSE Range : [0.455, 1.439]


• RMSE SD : 0.330
15
Effect of Sample Size on Model Accuracy
• Sample Size: 100; Number of folds: 10

• RMSE Range : [0.57, 1.38]


• RMSE SD : 0.278
16
Effect of Sample Size on Model Accuracy
• Sample Size: 200; Number of folds: 10

• RMSE Range : [0.767, 1.24]


• RMSE SD : [0.155]
17
Effect of Sample Size on Model Accuracy
• Sample Size: 360 (all observations): Folds: 10

• RMSE Range : [0.745, 1.19]


• RMSE SD : 0.133
18
Effect of Sample Size on Model Accuracy

Method Samples Min RMSE Max RMSE Sd


MLR 50 0.455 1.439 0.330
100 0.570 1.380 0.278
200 0.767 1.240 0.155
360 0.745 1.190 0.133

19
Effect of Sample Size on Model Accuracy

Method Samples Min RMSE Max RMSE Sd


MLR 50 0.455 1.439 0.330
100 0.570 1.380 0.278
200 0.767 1.240 0.155
360 0.745 1.190 0.133

Is it possible to reduce model variability / variance?

Regularization is the answer


(it increases Bias to reduce Variance)

20
Another Important Aspect: Model Selection
• This aspect addresses the question:
– How many independent variables (predictors) are
really required to adequately create a good model

21
Correlation Analysis

22
Correlation Analysis

23
Model Selection
Will the Linear Model help us understand reality?
• Are all the predictors required?
• Are there any methods to weed out irrelevant
predictors?
– Can the Beta values of irrelevant variables
automatically be set to zero?
– (This is known as feature selection)
• Methods
– Subset selection
– Shrinkage
– Dimension reduction
– (These are also applicable to Classification as well)
24
Model Selection
Will the Linear Model help us understand reality?
• Are all the predictors required?
• Are there any methods to weed out irrelevant
predictors?
– Can the Beta values of irrelevant variables
automatically be set to zero?
– (This is known as feature selection)
• Methods
– Subset selection
– Shrinkage
– Dimension reduction
– (These are also applicable to Classification as well)
25
Subset Selection
• Best subset selection
– Involves exhaustive model building
– From zero predictors to all predictors
– Results in creation of 2p models
– Computationally exhaustive
– Infeasible for large number of predictors

26
Forward Stepwise Selection
• Starts with zero predictors (ie. the mean)
• Adds predictors to the model – one at a time
– Search for that variable which results in best additional
improvement to the model
• Results in creation of models
– Where ‘p’ = number of predictors
• Advantage:
– Can be used even when observations < predictors
• Limitation:
– Not guaranteed to identify the best possible model
– (When compared with the exhaustive method) 27
Backward Stepwise Selection
• Begins with full OLS model
• Exhaustively searches and removes the least
useful predictor
– One at a time
• Results in creation of models
• Disadvantage
– Cannot be used if observations < predictors

Hybrid forward / backward approaches can be used

28
Optimal Model Selection
How to choose the most optimal model?
• Following calculated values can be used
– Residual Sum of Square of errors (RSS)
– R-squared
• But RSS drops with use of additional predictors
and R-squared increases !
– So we will anyway end up with all predictors !!
• Remedy?
– Adjust training error to account for over-fitting
– Use validation set or cross-validation approaches
29
Adjustment of Training Error
Following methods are used:
• Mallows Cp
• AIC – Akaike Information Criterion
• Bayesian Information Criterion
• Adjusted R2

30
Mallows Cp

• n = Number of observations
• RSS = Root Sum Squares of errors (Training)
• d = Number of predictors
• = Variance of error ε associated with each
response measurement (estimate of …)

Cp takes on small values for models with low test error


Choose model with lowest Cp value
31
Akaike Information Criteria (AIC)

• AIC is defined for the class of problems fit by


Maximum Likelihood Estimation (MLE)
– For Gaussian errors, MLE and Least Squares are same

• As can be seen, AIC and Cp are proportional

AIC and Cp are proportional


Lower the AIC, lower is the expected test error
Select the model with lowest AIC
32
Bayesian Information Criteria (BIC)

• When compared with Cp, BIC replaces ‘2’ with


log(n)
– For n > 7, log(n) > 2
– BIC thus imposes a larger penalty for models with
many variables
– It selects smaller models than Cp

Low BIC  Lower expected test error


33
Adjusted R2

• Adjusted R2 : Price is paid for including


unnecessary variables in the model

Higher the magnitude of Adjusted R2 better the model

34
The Metric v/s Predictors Plot
• The metrics v/s predictors plot will be as follows:

35
One-Standard-Error Rule
• A typical metrics v/s model plot
Should we always select
that model which gives us
the lowest value of the
metric?
Can we select model with 4
or 3 predictors, instead of 6?

• The one-standard-error rule


– If there is another model with fewer predictors
– And its metric is within one-standard-error of the minima
– Select that model …
36
Validation and Cross Validation
• Validation set and Cross Validation: directly
operate on actual data
– Hence these are better methods when compared with
Cp, AIC, BIC and R2
– But computationally expensive

• Previously - computation was really expensive


– Hence methods like Cp, AIC, BIC, R2 were used

• But – Validation Set approach preferred now


37
Model Selection
Will the Linear Model help us understand reality?
• Are all the predictors required?
• Are there any methods to weed out irrelevant
predictors?
– Can the Beta values of irrelevant variables
automatically be set to zero?
– (This is known as feature selection)
• Methods
– Subset selection
– Shrinkage
– Dimension reduction
– (These are also applicable to Classification as well)
38
Shrinkage Methods
Following steps are involved in shrinkage:
• Start with all the predictors
• Instead of only OLS, use modifications that will:
– Force the Beta values to shrink thereby also reducing
variance

• Effective and popular shrinkage techniques:


– Ridge Regression
– LASSO (Least Absolute Shrinkage and Selection Operator)
Shrinking the coefficients also reduces their variance
39
Ridge Regression (controls variance)
Coefficients are estimated by minimizing:

• The second term is the shrinkage penalty


 λ ≥ 0, is a tuning parameter : multiple values have to
be tried in order to select the best
 Note: Shrinkage is NOT applied to β0
• Effect of the shrinkage penalty
It essentially shrinks βj towards zero (… but never Zero)
Caution
Normalize the predictors before Ridge Regression
40
RIDGE Regression: Terminology
• L2 norm

• RIDGE Regression coefficient estimate

• Shrinkage Penalty

41
Ridge Regression
Effect of increasing λ on βj values
Net effect:
1. Drive certain βj
towards Zero (… but
never Zero)
2. Effectively, reduce
Variance by
increasing Bias
(Shrinkage)
As λ increases the flexibility of the fit decreases (Bias increases) but variance decreases

42
Exercise
Exercise:
• In general, see how the variance increases as the
number of observations become less
• See how Ridge regression can reduce such
variance
• See how Ridge Regression can help in case the
number of observations are not many more (or
even lesser …) than the number of predictors

43
LASSO (Least Absolute Shrinkage and Selection Operator)
The LASSO minimizes the following cost function:

• This results in some of the βj becoming exactly ‘0’


– Why?
• Therefore, in addition to shrinkage, LASSO also
does feature selection
– It yields sparse models
– Value of λ governs the de-selection of predictors

44
LASSO: Terminology
• L1 norm

• LASSO coefficient estimate

• LASSO Penalty

45
LASSO: Effect of λ on Predictors

• Some of the βj become exactly Zero as λ increases


• This has the effect of predictor / feature selection
– Models become simpler and more interpretable
– Useful if there are a large number of predictors, to
begin with 46
Variable Selection: LASSO v/s RIDGE Regression

• LHS = Region of LASSO; RHS = Region of RIDGE


• Red lines = contours of RSS
• Observations:
– In case of RIDGE, red lines cannot intersect blue region at points where one of
the βj is Zero
– In case of LASSO, red lines can intersect the blue region at points where one of
the βj is Zero
47
LASSO v/s RIDGE Regression
• RIDGE Regression performs better when
– Response variable actually depends on a large number of
predictors, and
– Coefficients are all of similar size

• LASSO Regression performs better when


– Response variable actually depends on a small fraction of
the total number of predictors
– Most coefficients are either small or Zero

• For a given data set cross-validation is used


– To determine which of the two methods is better
– There is no method available to decide apriori
48
Tuning Parameter Selection (λ)

49
Tuning Parameter Selection (λ)
• Choose a grid of λ values
• Compute cross-validation for each value of λ
• Select the value of λ for which the cross-validation
error is the smallest
• Re-fit the model using the selected value of λ and
all the available observations

This process is known as Hyperparameter Tuning

50
Tuning Parameter Selection Using CV
RIDGE Regression
• λ from 10 to 100

51
Tuning Parameter Selection Using CV
Variable seq: 3,2,1,6,5,4,7
RIDGE Regression
• λ from 10 to 100

52
Tuning Parameter Selection Using CV
RIDGE Regression
• λ from 0 to 10

53
Tuning Parameter Selection Using CV
Variable seq: 3,6,5,4,1,2
RIDGE Regression
• λ from 0 to 10

54
Tuning Parameter Selection Using CV
RIDGE Regression
• λ from 0 to 1

55
Tuning Parameter Selection Using CV
Variable seq: 3,6,5,4,1,2
RIDGE Regression
• λ from 0 to 1

56
Tuning Parameter Selection Using CV
RIDGE Regression
• λ from 0 to 0.1

57
Tuning Parameter Selection Using CV
Variable seq: 3,2,6,-2,5,4,2,1
RIDGE Regression
• λ from 0 to 0.1

58
Tuning Parameter Selection Using CV

59
Tuning Parameter Selection Using CV

Variable seq: 3,6,5,4,2,1

60
Best Values Calculation

Refer next slide to see how this


is arrived at, from the CV “folds”

61
Best Values Calculation

mean(RMSE) = 0.8160154
mean(Rsquared) = 0.9220527

62
Parameter Tuning Using Cross Validation
Method Samples Min RMSE Max RMSE Sd
MLR 50 0.455 1.439 0.330
100 0.570 1.380 0.278
200 0.767 1.240 0.155
360 0.745 1.190 0.133
RIDGE 50 0.537 1.465 0.298
100 0.634 1.183 0.192
200 0.695 1.113 0.154
360 0.817 1.128 0.082
LASSO 50 0.580 1.450 0.322
100 0.680 1.310 0.202
200 0.790 1.130 0.116
360 0.794 1.132 0.128
ENET 50 0.425 1.691 0.419
100 0.558 1.630 0.312
200 0.615 1.143 0.171
360 0.775 1.121 0.129 63
Parameter Tuning Using Cross Validation
Samples Method Min RMSE Max RMSE Sd
50 MLR 0.455 1.439 0.330
RIDGE 0.537 1.465 0.298
LASSO 0.580 1.450 0.322
ENET 0.425 1.691 0.419
100 MLR 0.570 1.380 0.278
RIDGE 0.634 1.183 0.192
LASSO 0.680 1.310 0.202
ENET 0.558 1.630 0.312
200 MLR 0.767 1.240 0.155
RIDGE 0.695 1.113 0.154
LASSO 0.790 1.130 0.116
ENET 0.615 1.143 0.171
360 MLR 0.745 1.190 0.133
RIDGE 0.817 1.128 0.082
LASSO 0.794 1.132 0.128
ENET 0.775 1.121 0.129 64
Model Selection
Will the Linear Model help us understand reality?
• Are all the predictors required?
• Are there any methods to weed out irrelevant
predictors?
– Can the Beta values of irrelevant variables
automatically be set to zero?
– (This is known as feature selection)
• Methods
– Subset selection
– Shrinkage and Model Selection
– Dimension reduction
– (These are also applicable to Classification as well)
65
Model Selection Using LASSO

66
LASSO: Results

Steps

67
Steps

68
69
LASSO Results: Comparison with correlation

70
LASSO Results: Comparison with correlation

AUTOMATIC FEATURE SELECTION


71
Final Linear Model (After LASSO)

y = 2.421 * x3n + 0.157 * x5n + 0.540 * x6n

AUTOMATIC FEATURE SELECTION


72
Principle Component Analysis
• What is the optimal number of independent
variables required to capture the variance in
data?
– This question is answered by PCA

73
The Problem
The data The data
• x = 1:360
• Create six predictors Scale the predictors
 x1 = x  x1n = scale(x1)
 x2 = x1 * x1  x2n = scale(x2)
 x3 = x1 * x1 * x1  x3n = scale(x3)
 x4 = sin(x1 * pi/180)  x4n = scale(x4)
 x5 = cos(x1 * pi/180)  x5n = scale(x5)
 x6 = tanh(x1 * pi/180)  x6n = scale(x6)

yp = x1n+x2n+x3n+x4n+x5n+x6n + rnorm(360,0,1)
74
Data Visualization

75
Principle Component Analysis
• What is the optimal number of independent variables required to
capture the variance in data?
– This question is answered by PCA
• PCA results for the generated data set

76
Principle Component Analysis
• What is the optimal number of independent variables required to
capture the variance in data?
– This question is answered by PCA
• PCA results for the generated data set

77
Exercise
• For the Challenge problem, use LASSO to select
the correct operators

78
The “Challenge” problem: Recap
• The problem
– Given the data set {Y, x1, x2} … as shown below
– Find out the nature of the function f, where
• y = f(x1, x2)

79
The Challenge Problem: Recap
• Creation of additional predictors using FE:
– Assume y depends on additional variables derived
from x1 and x2
– For example: assume y depends on the following
original and assumed (derived) variables
• x1 x2 (the original variables)
• x3 = x12, x4 = x22
• x5 = sin(x1) x6 = cos(x1)
• x7 = tan(x1) x8 = tanh(x1)
• x9 = sin(x2) x10 = cos(x2)
• x11 = tan(x2) x12 = tanh(x2)
• x13 = log(x1 + 10) x14 = log(x2 + 10)
• x15 = exp(x1) x16 = exp(x2)
80
The Challenge Problem: Solutions
Earlier solution:
• Was implemented using ‘backward selection’

81
The Challenge Problem: Solutions
Earlier solution:
• Was implemented using ‘backward selection’

Can we have the LASSO do this automatically?


82
LASSO Based Solution
• Cost function to be minimized

• LASSO
– Given a value of the ‘fraction’
– Calculates all βj to minimize the above cost function
– Multiple values of fraction are tried out … and βj
calculated
– The results are then plotted
83
LASSO Based Solution …
• LASSO: coefficients v/s fraction plot

• Observation:
– Only coefficients of x3 and x4 are active throughout
– LASSO has flagged only x3 and x4 to be relevant
• See next slide
84
LASSO Based Solution … Steps & Coefficients

85
LASSO Based Solution …
• LASSO: coefficients v/s active predictors (dof) plot

• Observation:
– Full solution is reached with 2 predictors
– LASSO has flagged that 2 predictors (x3 and x4) are
important
86
The Challenge Problem: PCA

87
The Challenge Problem: PCA

PCA too indicates that only two Principal


axes practically capture all the variances
in the predictors

88

You might also like