Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

MH4510 - Statistical Learning and Data Mining - AY1819 S1 Lab 07

MH4510 - Regularization Method


Matthew Zakharia Hadimaja

28th September 2018 (Fri) - Regularization Method


Course instructor : PUN Chi Seng
Lab instructor : Matthew Zakharia Hadimaja

References
Chapter 6.6, [ISLR] An Introduction to Statistical Learning (with Applications in R). Free access to download
the book: http://www-bcf.usc.edu/~gareth/ISL/
To see the help file of a function funcname, type ?funcname.

1. Preparation

Load dataset
library(ISLR)
data(Hitters)
Hitters <- na.omit(Hitters)

glmnet has different input types. Therefore, we have to create them first
# x, the predictor, has to be a numerical matrix
# model.matrix converts factors to a set of dummy variables
x <- model.matrix(Salary ~ ., Hitters)[, -1]
head(x)
# y, the output, has to be a vector
y <- Hitters$Salary

2. Ridge Regression

The penalty is defined as (1 − α)/2||β||22 + α||β||1 . Therefore, for ridge regression, the alpha is 0.
library(glmnet)
grid <- 10 ^ seq(10, -2, length = 100) # lambda from 10^10 to 10^-2, logarithmically scaled
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)
names(ridge.mod) # read ?glmnet for details
dim(coef(ridge.mod))
par(mfrow = c(1,2))
plot(ridge.mod, xvar = 'norm')
plot(ridge.mod, xvar = 'lambda')

1
MH4510 - Statistical Learning and Data Mining - AY1819 S1 Lab 07

Large vs small lambda

Large lambda
ridge.mod$lambda[50]
coef(ridge.mod)[, 10]
sqrt(sum(coef(ridge.mod)[-1, l] ^ 2)) # l2-norm of the coefficients

Small lambda
ridge.mod$lambda[60]
coef(ridge.mod)[, l]
sqrt(sum(coef(ridge.mod)[-1, l] ^ 2))

Predict coefficients with new lambda value s.


predict(ridge.mod, s = 50, type = "coefficients")[1:20, ]

For lambda = 0 or lambda = Inf, what model does the algorithm produce?

3. LASSO

Same as ridge, but alpha = 1 now. Notice that the coefficients can be exactly zero.
lasso.mod <- glmnet(x, y, alpha = 1, lambda = grid)
par(mfrow = c(1,2))
plot(lasso.mod, xvar = 'norm')
plot(lasso.mod, xvar = 'lambda')

Large vs small lambda

Large lambda
lasso.mod$lambda[50]
coef(lasso.mod)[, 50]
sqrt(sum(coef(lasso.mod)[-1, l] ^ 2))

Small lambda
lasso.mod$lambda[80]
coef(lasso.mod)[, 80]
sqrt(sum(coef(lasso.mod)[-1, l] ^ 2))

Predict coefficients with new lambda value s.


predict(lasso.mod, s = 50, type = "coefficients")[1:20, ]

Cross validation

Cross validation to choose best lambda


lasso.cv <- cv.glmnet(x[train, ], y[train], alpha = 1)
plot(lasso.cv)
(lasso.bestlam <- lasso.cv$lambda.min)

Refit using the whole training set

2
MH4510 - Statistical Learning and Data Mining - AY1819 S1 Lab 07

lasso.cvmod <- glmnet(x[train, ], y[train], alpha = 1, lambda = lasso.bestlam)


lasso.coef <- coef(lasso.cvmod)
lasso.coef

Predict test set


lasso.pred <- predict(lasso.cvmod, newx = x[-train, ])
mean((lasso.pred - y[-train]) ^ 2)

4. Tutorial

Explain how K-fold cross-validation is implemented on ridge regression / LASSO with scaling. Please specify
how to compute the cross-validation function and how is the scaling implemented.
This is a pseudo-code for CV without scaling. Note that this does not represent the cv.glmnet function.
Modify the pseudo-code below to answer the question above.
Suppose L is a vector containing lambda values to try, and we have 'X' as our data.

1. set the CV error for each lambda, CV[lambda] = 0, lambda in L


2. split X into training set Tr and test set Te
3. split Tr into K random parts with equal size, Tr[k], k = 1,2,...,K
4. for k in 1:K
1. set Tr[-k] as the k-th pseudo training set, pTr[k]
2. set Tr[k] as the k-th pseudo test set, pTe[k]
3. for lambda in L
1. perform ridge regression / LASSO with Tr[-k] and lambda
2. evaluate the test error on pTe[k]
3. CV[lambda] = CV[lambda] + test error
5. choose lambda that minimises CV[lambda], call it lambda*
6. refit the whole model with Tr and lambda*
7. check the performance on Te

You might also like