0 Regularization PDF

Regularization and Model Selection
Modifications To OLS Model …

Required:
• To improve prediction accuracy

– Regularization
• To improve model interpretability
– Model Selection
2
Need for Regularization …
• Model Variance
– If number of observations far exceed number of
predictors
• Variance of OLS models is low
– If number of observations are more than predictors,
but not many more
• Variance of OLS models may be large
– If number of observations are less than predictors
• There exists no OLS model
3
Model Variance
Model changes drastically with different sets of training points

4
Model Variance

5
Model Variance

6
Model Variance

7
Model Variance

8
Model Variance

9
Remedy: if there aren’t enough observations
• Constrain the coefficients
– This increases bias
– But improves the predictability of OLS models
• By reducing variability
• Coefficients can be constrained using the

following methods:
– RIDGE Regression (L1 penalty)
– LASSO (L2 penalty)
– Elastic Net Regression (L1 and L2 penalties)
10
Example
• The following example illustrates:
– Effect of sample size on model accuracy
– Use of cross validation (CV)
11
The Problem
The data The data
• x = 1:360
• Create six predictors Scale the predictors
 x1 = x  x1n = scale(x1)
 x2 = x1 * x1  x2n = scale(x2)
 x3 = x1 * x1 * x1  x3n = scale(x3)
 x4 = sin(x1 * pi/180)  x4n = scale(x4)
 x5 = cos(x1 * pi/180)  x5n = scale(x5)
 x6 = tanh(x1 * pi/180)  x6n = scale(x6)
yp = x1n+x2n+x3n+x4n+x5n+x6n + rnorm(360,0,1)
12
Data Visualization
13
Example
• The following example illustrates:
– Effect of sample size on model accuracy
– Use of Cross Validation (CV)
• Explanation:
– Models are created using the following sample sizes
• 50, 100, 200, 360
– 10 fold Cross Validation method is used to create
multiple model from every sample size, and the errors
are evaluated
14
Effect of Sample Size on Model Accuracy
• Sample size: 50; Number of folds: 10
• RMSE Range : [0.455, 1.439]

• RMSE SD : 0.330
15
• Sample Size: 100; Number of folds: 10
• RMSE Range : [0.57, 1.38]

• RMSE SD : 0.278
16
• Sample Size: 200; Number of folds: 10
• RMSE Range : [0.767, 1.24]

• RMSE SD : [0.155]
17
• Sample Size: 360 (all observations): Folds: 10
• RMSE Range : [0.745, 1.19]

• RMSE SD : 0.133
18
Method Samples Min RMSE Max RMSE Sd

MLR 50 0.455 1.439 0.330
100 0.570 1.380 0.278
200 0.767 1.240 0.155
360 0.745 1.190 0.133
19

MLR 50 0.455 1.439 0.330
100 0.570 1.380 0.278
200 0.767 1.240 0.155
360 0.745 1.190 0.133
Is it possible to reduce model variability / variance?
Regularization is the answer

(it increases Bias to reduce Variance)
20
Another Important Aspect: Model Selection
• This aspect addresses the question:
– How many independent variables (predictors) are
really required to adequately create a good model
21
Correlation Analysis
22
Correlation Analysis
23
Model Selection
Will the Linear Model help us understand reality?
• Are all the predictors required?
• Are there any methods to weed out irrelevant
predictors?
– Can the Beta values of irrelevant variables
automatically be set to zero?
– (This is known as feature selection)
• Methods
– Subset selection
– Shrinkage
– Dimension reduction
– (These are also applicable to Classification as well)
24
Model Selection
predictors?
• Methods
– Shrinkage
25
Subset Selection
• Best subset selection
– Involves exhaustive model building
– From zero predictors to all predictors
– Results in creation of 2p models
– Computationally exhaustive
– Infeasible for large number of predictors
26
Forward Stepwise Selection
• Starts with zero predictors (ie. the mean)
• Adds predictors to the model – one at a time
– Search for that variable which results in best additional
improvement to the model
• Results in creation of models
– Where ‘p’ = number of predictors
• Advantage:
– Can be used even when observations < predictors
• Limitation:
– Not guaranteed to identify the best possible model
– (When compared with the exhaustive method) 27
Backward Stepwise Selection
• Begins with full OLS model
• Exhaustively searches and removes the least
useful predictor
– One at a time
• Results in creation of models
• Disadvantage
– Cannot be used if observations < predictors
Hybrid forward / backward approaches can be used
28
Optimal Model Selection
How to choose the most optimal model?
• Following calculated values can be used
– Residual Sum of Square of errors (RSS)
– R-squared
• But RSS drops with use of additional predictors
and R-squared increases !
– So we will anyway end up with all predictors !!
• Remedy?
– Adjust training error to account for over-fitting
– Use validation set or cross-validation approaches
29
Adjustment of Training Error
Following methods are used:
• Mallows Cp
• AIC – Akaike Information Criterion
• Bayesian Information Criterion
• Adjusted R2
30
Mallows Cp
• n = Number of observations
• RSS = Root Sum Squares of errors (Training)
• d = Number of predictors
• = Variance of error ε associated with each
response measurement (estimate of …)
Cp takes on small values for models with low test error

Choose model with lowest Cp value
31
Akaike Information Criteria (AIC)
• AIC is defined for the class of problems fit by

Maximum Likelihood Estimation (MLE)
– For Gaussian errors, MLE and Least Squares are same
• As can be seen, AIC and Cp are proportional
AIC and Cp are proportional

Lower the AIC, lower is the expected test error
Select the model with lowest AIC
32
Bayesian Information Criteria (BIC)
• When compared with Cp, BIC replaces ‘2’ with

log(n)
– For n > 7, log(n) > 2
– BIC thus imposes a larger penalty for models with
many variables
– It selects smaller models than Cp
Low BIC  Lower expected test error

33
Adjusted R2
• Adjusted R2 : Price is paid for including

unnecessary variables in the model
Higher the magnitude of Adjusted R2 better the model
34
The Metric v/s Predictors Plot
• The metrics v/s predictors plot will be as follows:
35
One-Standard-Error Rule
• A typical metrics v/s model plot
Should we always select
that model which gives us
the lowest value of the
metric?
Can we select model with 4
or 3 predictors, instead of 6?
• The one-standard-error rule

– If there is another model with fewer predictors
– And its metric is within one-standard-error of the minima
– Select that model …
36
Validation and Cross Validation
• Validation set and Cross Validation: directly
operate on actual data
– Hence these are better methods when compared with
Cp, AIC, BIC and R2
– But computationally expensive
• Previously - computation was really expensive

– Hence methods like Cp, AIC, BIC, R2 were used
• But – Validation Set approach preferred now

37
Model Selection
predictors?
• Methods
– Shrinkage
38
Shrinkage Methods
Following steps are involved in shrinkage:
• Start with all the predictors
• Instead of only OLS, use modifications that will:
– Force the Beta values to shrink thereby also reducing
variance
• Effective and popular shrinkage techniques:

– Ridge Regression
– LASSO (Least Absolute Shrinkage and Selection Operator)
Shrinking the coefficients also reduces their variance
39
Ridge Regression (controls variance)
Coefficients are estimated by minimizing:
• The second term is the shrinkage penalty

 λ ≥ 0, is a tuning parameter : multiple values have to
be tried in order to select the best
 Note: Shrinkage is NOT applied to β0
• Effect of the shrinkage penalty
It essentially shrinks βj towards zero (… but never Zero)
Caution
Normalize the predictors before Ridge Regression
40
RIDGE Regression: Terminology
• L2 norm
• RIDGE Regression coefficient estimate
• Shrinkage Penalty
41
Ridge Regression
Effect of increasing λ on βj values
Net effect:
1. Drive certain βj
towards Zero (… but
never Zero)
2. Effectively, reduce
Variance by
increasing Bias
(Shrinkage)
As λ increases the flexibility of the fit decreases (Bias increases) but variance decreases
42
Exercise
Exercise:
• In general, see how the variance increases as the
number of observations become less
• See how Ridge regression can reduce such
variance
• See how Ridge Regression can help in case the
number of observations are not many more (or
even lesser …) than the number of predictors
43
LASSO (Least Absolute Shrinkage and Selection Operator)
The LASSO minimizes the following cost function:
• This results in some of the βj becoming exactly ‘0’

– Why?
• Therefore, in addition to shrinkage, LASSO also
does feature selection
– It yields sparse models
– Value of λ governs the de-selection of predictors
44
LASSO: Terminology
• L1 norm
• LASSO coefficient estimate
• LASSO Penalty
45
LASSO: Effect of λ on Predictors
• Some of the βj become exactly Zero as λ increases

• This has the effect of predictor / feature selection
– Models become simpler and more interpretable
– Useful if there are a large number of predictors, to
begin with 46
Variable Selection: LASSO v/s RIDGE Regression
• LHS = Region of LASSO; RHS = Region of RIDGE

• Red lines = contours of RSS
• Observations:
– In case of RIDGE, red lines cannot intersect blue region at points where one of
the βj is Zero
– In case of LASSO, red lines can intersect the blue region at points where one of
the βj is Zero
47
LASSO v/s RIDGE Regression
• RIDGE Regression performs better when
– Response variable actually depends on a large number of
predictors, and
– Coefficients are all of similar size
• LASSO Regression performs better when

– Response variable actually depends on a small fraction of
the total number of predictors
– Most coefficients are either small or Zero
• For a given data set cross-validation is used

– To determine which of the two methods is better
– There is no method available to decide apriori
48
Tuning Parameter Selection (λ)
49
Tuning Parameter Selection (λ)
• Choose a grid of λ values
• Compute cross-validation for each value of λ
• Select the value of λ for which the cross-validation
error is the smallest
• Re-fit the model using the selected value of λ and
all the available observations
This process is known as Hyperparameter Tuning
50
Tuning Parameter Selection Using CV
RIDGE Regression
• λ from 10 to 100
51
Variable seq: 3,2,1,6,5,4,7
RIDGE Regression
• λ from 10 to 100
52
RIDGE Regression
• λ from 0 to 10
53
Variable seq: 3,6,5,4,1,2
RIDGE Regression
• λ from 0 to 10
54
RIDGE Regression
• λ from 0 to 1
55
RIDGE Regression
• λ from 0 to 1
56
RIDGE Regression
• λ from 0 to 0.1
57
Variable seq: 3,2,6,-2,5,4,2,1
RIDGE Regression
• λ from 0 to 0.1
58
59
60
Best Values Calculation
Refer next slide to see how this

is arrived at, from the CV “folds”
61
Best Values Calculation
mean(RMSE) = 0.8160154
mean(Rsquared) = 0.9220527
62
Parameter Tuning Using Cross Validation
MLR 50 0.455 1.439 0.330
100 0.570 1.380 0.278
200 0.767 1.240 0.155
360 0.745 1.190 0.133
RIDGE 50 0.537 1.465 0.298
100 0.634 1.183 0.192
200 0.695 1.113 0.154
360 0.817 1.128 0.082
LASSO 50 0.580 1.450 0.322
100 0.680 1.310 0.202
200 0.790 1.130 0.116
360 0.794 1.132 0.128
ENET 50 0.425 1.691 0.419
100 0.558 1.630 0.312
200 0.615 1.143 0.171
360 0.775 1.121 0.129 63
Parameter Tuning Using Cross Validation
Samples Method Min RMSE Max RMSE Sd
50 MLR 0.455 1.439 0.330
RIDGE 0.537 1.465 0.298
LASSO 0.580 1.450 0.322
ENET 0.425 1.691 0.419
100 MLR 0.570 1.380 0.278
RIDGE 0.634 1.183 0.192
LASSO 0.680 1.310 0.202
ENET 0.558 1.630 0.312
200 MLR 0.767 1.240 0.155
RIDGE 0.695 1.113 0.154
LASSO 0.790 1.130 0.116
ENET 0.615 1.143 0.171
360 MLR 0.745 1.190 0.133
RIDGE 0.817 1.128 0.082
LASSO 0.794 1.132 0.128
ENET 0.775 1.121 0.129 64
Model Selection
predictors?
• Methods
– Shrinkage and Model Selection
65
Model Selection Using LASSO
66
LASSO: Results
Steps
67
Steps
68
69
LASSO Results: Comparison with correlation
70
LASSO Results: Comparison with correlation
AUTOMATIC FEATURE SELECTION

71
Final Linear Model (After LASSO)
y = 2.421 * x3n + 0.157 * x5n + 0.540 * x6n
AUTOMATIC FEATURE SELECTION

72
Principle Component Analysis
• What is the optimal number of independent
variables required to capture the variance in
data?
– This question is answered by PCA
73
The Problem
The data The data
• x = 1:360
• Create six predictors Scale the predictors
 x1 = x  x1n = scale(x1)
 x2 = x1 * x1  x2n = scale(x2)
 x3 = x1 * x1 * x1  x3n = scale(x3)
 x4 = sin(x1 * pi/180)  x4n = scale(x4)
 x5 = cos(x1 * pi/180)  x5n = scale(x5)
 x6 = tanh(x1 * pi/180)  x6n = scale(x6)
yp = x1n+x2n+x3n+x4n+x5n+x6n + rnorm(360,0,1)
74
Data Visualization
75
• What is the optimal number of independent variables required to
capture the variance in data?
• PCA results for the generated data set
76
• What is the optimal number of independent variables required to
capture the variance in data?
• PCA results for the generated data set
77
Exercise
• For the Challenge problem, use LASSO to select
the correct operators
78
The “Challenge” problem: Recap
• The problem
– Given the data set {Y, x1, x2} … as shown below
– Find out the nature of the function f, where
• y = f(x1, x2)
79
The Challenge Problem: Recap
• Creation of additional predictors using FE:
– Assume y depends on additional variables derived
from x1 and x2
– For example: assume y depends on the following
original and assumed (derived) variables
• x1 x2 (the original variables)
• x3 = x12, x4 = x22
• x5 = sin(x1) x6 = cos(x1)
• x7 = tan(x1) x8 = tanh(x1)
• x9 = sin(x2) x10 = cos(x2)
• x11 = tan(x2) x12 = tanh(x2)
• x13 = log(x1 + 10) x14 = log(x2 + 10)
• x15 = exp(x1) x16 = exp(x2)
80
The Challenge Problem: Solutions
Earlier solution:
• Was implemented using ‘backward selection’
81
The Challenge Problem: Solutions
Earlier solution:
• Was implemented using ‘backward selection’
Can we have the LASSO do this automatically?

82
LASSO Based Solution
• Cost function to be minimized
• LASSO
– Given a value of the ‘fraction’
– Calculates all βj to minimize the above cost function
– Multiple values of fraction are tried out … and βj
calculated
– The results are then plotted
83
LASSO Based Solution …
• LASSO: coefficients v/s fraction plot
• Observation:
– Only coefficients of x3 and x4 are active throughout
– LASSO has flagged only x3 and x4 to be relevant
• See next slide
84
LASSO Based Solution … Steps & Coefficients
85
LASSO Based Solution …
• LASSO: coefficients v/s active predictors (dof) plot
• Observation:
– Full solution is reached with 2 predictors
– LASSO has flagged that 2 predictors (x3 and x4) are
important
86
The Challenge Problem: PCA
87
The Challenge Problem: PCA
PCA too indicates that only two Principal

axes practically capture all the variances
in the predictors
88

0 Regularization PDF

Uploaded by

Copyright:

Available Formats

You might also like

0 Regularization PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

0 Regularization PDF

Uploaded by

Copyright:

Available Formats

Regularization and Model Selection

Modifications To OLS Model …

• To improve prediction accuracy

Model changes drastically with different sets of training points

Model changes drastically with different sets of training points

Model changes drastically with different sets of training points

Model changes drastically with different sets of training points

Model changes drastically with different sets of training points

Model changes drastically with different sets of training points

• Coefficients can be constrained using the

• RMSE Range : [0.455, 1.439]

• RMSE Range : [0.57, 1.38]

• RMSE Range : [0.767, 1.24]

• RMSE Range : [0.745, 1.19]

Method Samples Min RMSE Max RMSE Sd

Method Samples Min RMSE Max RMSE Sd

Is it possible to reduce model variability / variance?

Regularization is the answer

Hybrid forward / backward approaches can be used

Cp takes on small values for models with low test error

• AIC is defined for the class of problems fit by

• As can be seen, AIC and Cp are proportional

AIC and Cp are proportional

• When compared with Cp, BIC replaces ‘2’ with

Low BIC  Lower expected test error

• Adjusted R2 : Price is paid for including

Higher the magnitude of Adjusted R2 better the model

• The one-standard-error rule

• Previously - computation was really expensive

• But – Validation Set approach preferred now

• Effective and popular shrinkage techniques:

• The second term is the shrinkage penalty

• RIDGE Regression coefficient estimate

• This results in some of the βj becoming exactly ‘0’

• LASSO coefficient estimate

• Some of the βj become exactly Zero as λ increases

• LHS = Region of LASSO; RHS = Region of RIDGE

• LASSO Regression performs better when

• For a given data set cross-validation is used

This process is known as Hyperparameter Tuning

Variable seq: 3,6,5,4,2,1

Refer next slide to see how this

AUTOMATIC FEATURE SELECTION

y = 2.421 * x3n + 0.157 * x5n + 0.540 * x6n

AUTOMATIC FEATURE SELECTION

Can we have the LASSO do this automatically?

PCA too indicates that only two Principal

You might also like