Cross Validation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

VALIDATION SET APPROACH

PROBLEM:
Write an R program to train a model for the Swiss data set by cross validation using validation set
approach and predict the variable fertility
AIM:
To perform the validation set approach on a given data set

CODE IN R LANGUAGE:
library(caret)
library(tidyverse)
head(swiss) #load the data
str(swiss) #to get the structure of the data set
set.seed(123)
#to divide the data set into test-train sets
train_set=swiss$Fertility %>% createDataPartition(p=0.8,list = FALSE)
train_data=swiss[train_set,]
test_data=swiss[-train_set,]
model=lm(Fertility~.,data=train_data) #built model
#make predictions
predictions=model%>%predict(test_data)
predictions
data.frame(R2=R2(predictions,test_data$Fertility),RMSE=RMSE(predictions,test_data$Fertility),MA
E=MAE(predictions,test_data$Fertility))

OUTPUT:
> library(caret)
> library(tidyverse)
> head(swiss) #load the data
Fertility Agriculture Examination Education Catholic
Courtelary 80.2 17.0 15 12 9.96
Delemont 83.1 45.1 6 9 84.84
Franches-Mnt 92.5 39.7 5 5 93.40
Moutier 85.8 36.5 12 7 33.77
Neuveville 76.9 43.5 17 15 5.16
Porrentruy 76.1 35.3 9 7 90.57
Infant.Mortality
Courtelary 22.2
Delemont 22.2
Franches-Mnt 20.2
Moutier 20.3
Neuveville 20.6
Porrentruy 26.6
> str(swiss) #to get the structure of the data set
'data.frame': 47 obs. of 6 variables:
$ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
$ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
$ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
$ Education : int 12 9 5 7 15 7 7 8 7 13 ...
$ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
$ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
> set.seed(123)
> #to divide the data set into test-train sets
> train_set=swiss$Fertility %>% createDataPartition(p=0.8,list = FALSE)
> train_data=swiss[train_set,]
> test_data=swiss[-train_set,]
> model=lm(Fertility~.,data=train_data) #built model
> #make predictions
> predictions=model%>%predict(test_data)
> predictions
Glane Sarine Aigle Avenches Payerne Rolle Entremont
79.87755 78.07547 59.22761 64.51491 71.03743 62.55682 76.96442
Martigwy
76.22415
>
data.frame(R2=R2(predictions,test_data$Fertility),RMSE=RMSE(predictions,test_data$Fertility),MAE=MAE(p
redictions,test_data$Fertility))
R2 RMSE MAE
0.5946201 6.410914 5.651552

RESULT:
The validation set approch is used to train the linear regression model on the Swiss dataset. The model
have a r squared value of 0.594 and RMSE value of 6.41
LEAVE-ONE-OUT CROSS VALIDATION

PROBLEM:
Write an R program to train a model for the Swiss data set by leave one out cross validation and
predict the variable for unseen data.

AIM:
To study the leave-one-out cross validation method to train a model for a given data set.
CODE IN R LANGUAGE:
library(caret)
library(tidyverse)
head(swiss) #load the data
str(swiss) #to get the structure of the data set
set.seed(123)
method=trainControl(method = "LOOCV") #leave-one-out method
model1=train(Fertility~.,data=swiss,method="lm",trControl=method) #train the model
print(model1) #summarise the model
OUTPUT:
>library(caret)
> library(tidyverse)
> head(swiss) #load the data
Fertility Agriculture Examination Education Catholic
Courtelary 80.2 17.0 15 12 9.96
Delemont 83.1 45.1 6 9 84.84
Franches-Mnt 92.5 39.7 5 5 93.40
Moutier 85.8 36.5 12 7 33.77
Neuveville 76.9 43.5 17 15 5.16
Porrentruy 76.1 35.3 9 7 90.57
Infant.Mortality
Courtelary 22.2
Delemont 22.2
Franches-Mnt 20.2
Moutier 20.3
Neuveville 20.6
Porrentruy 26.6
> str(swiss) #to get the structure of the data set
'data.frame': 47 obs. of 6 variables:
$ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
$ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
$ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
$ Education : int 12 9 5 7 15 7 7 8 7 13 ...
$ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
$ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
> set.seed(123)
> method=trainControl(method = "LOOCV")
> #TRAIN MODEL
> model1=train(Fertility~.,data=swiss,method="lm",trControl=method)
> #summarize
> print(model1)
Linear Regression

47 samples
5 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 46, 46, 46, 46, 46, 46, ...
Resampling results:

RMSE Rsquared MAE


7.738618 0.6128307 6.116021

Tuning parameter 'intercept' was held constant at a value of TRUE

RESULT:
Leave One-Out cross validation is used to train the linear regression model on the Swiss dataset. The
model have a r squared value of 0.612 and RMSE value of 7.738.
K-FOLD CROSS VALIDATION

PROBLEM:
Write an R program to train a model for the Swiss data set by repeated k-fold cross validation and
predict the dependent variable.

AIM:
To study the k-fold cross validation method to train a model for a given data set.
CODE IN R LANGUAGE:
library(caret)
library(tidyverse)
head(swiss) #load the data
str(swiss) #to get the structure of the data set
set.seed(123)
method2=trainControl(method = "repeatedcv",number = 10,repeats = 3) #k-fold method
model2=train(Fertility~.,data=swiss,method="lm",trControl=method2) #to create an model
print(model2) #Summary of the model

OUTPUT:
> library(caret)
> library(tidyverse)
> head(swiss) #load the data
Fertility Agriculture Examination Education Catholic
Courtelary 80.2 17.0 15 12 9.96
Delemont 83.1 45.1 6 9 84.84
Franches-Mnt 92.5 39.7 5 5 93.40
Moutier 85.8 36.5 12 7 33.77
Neuveville 76.9 43.5 17 15 5.16
Porrentruy 76.1 35.3 9 7 90.57
Infant.Mortality
Courtelary 22.2
Delemont 22.2
Franches-Mnt 20.2
Moutier 20.3
Neuveville 20.6
Porrentruy 26.6
> str(swiss) #to get the structure of the data set
'data.frame': 47 obs. of 6 variables:
$ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
$ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
$ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
$ Education : int 12 9 5 7 15 7 7 8 7 13 ...
$ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
$ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
> set.seed(123)
> method2=trainControl(method = "repeatedcv",number = 10,repeats = 3) #k-fold method
> model2=train(Fertility~.,data=swiss,method="lm",trControl=method2) #to create an model
> print(model2) #Summary of the model
Linear Regression

47 samples
5 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 42, 42, 42, 42, 42, 44, ...
Resampling results:

RMSE Rsquared MAE


7.357186 0.6992415 6.15871

Tuning parameter 'intercept' was held constant at a value of TRUE

RESULT:

10-fold cross validation is perfomed to train the linear regression model on the Swiss data set. The model have a
R-squared value of 0.699 and RMSE value of 7.357
BOOTSTRAPPING
PROBLEM:
Write a R program to perform bootstrapping on the Auto dataset and evaluate the model.
AIM:
Perform bootstrapping on the given set of data.
CODE IN R LANGUAGE:
library(MASS)
attach(Auto)
statistic <- function(data, index) {
lm.fit <- lm(mpg ~ horsepower, data = data, subset = index)
coef(lm.fit)
}

statistic(Auto, 1:392)
summary(lm(mpg ~ horsepower, data = Auto))
set.seed(123)
boot(Auto, statistic, 1000)
quad.statistic <- function(data, index) {
lm.fit <- lm(mpg ~ poly(horsepower, 2), data = data, subset = index)
coef(lm.fit)
}

set.seed(1)
boot(Auto, quad.statistic, 1000)
summary(lm(mpg ~ poly(horsepower, 2), data = Auto))
OUTPUT:
> library(MASS)
> attach(Auto)
> statistic <- function(data, index) {
+ lm.fit <- lm(mpg ~ horsepower, data = data, subset = index)
+ coef(lm.fit)
+}
> statistic(Auto, 1:392)
(Intercept) horsepower
39.9358610 -0.1578447
> summary(lm(mpg ~ horsepower, data = Auto))

Call:
lm(formula = mpg ~ horsepower, data = Auto)

Residuals:
Min 1Q Median 3Q Max
-13.5710 -3.2592 -0.3435 2.7630 16.9240

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.935861 0.717499 55.66 <2e-16 ***
horsepower -0.157845 0.006446 -24.49 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.906 on 390 degrees of freedom


Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16

> set.seed(123)
> boot(Auto, statistic, 1000)

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = Auto, statistic = statistic, R = 1000)

Bootstrap Statistics :
original bias std. error
t1* 39.9358610 0.0156469811 0.845583773
t2* -0.1578447 -0.0001803022 0.007393556
> quad.statistic <- function(data, index) {
+ lm.fit <- lm(mpg ~ poly(horsepower, 2), data = data, subset = index)
+ coef(lm.fit)
+}
> set.seed(1)
> boot(Auto, quad.statistic, 1000)

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:
boot(data = Auto, statistic = quad.statistic, R = 1000)

Bootstrap Statistics :
original bias std. error
t1* 23.44592 -0.003660358 0.2195369
t2* -120.13774 0.002769239 3.6138046
t3* 44.08953 0.101767465 4.1998076
> summary(lm(mpg ~ poly(horsepower, 2), data = Auto))

Call:
lm(formula = mpg ~ poly(horsepower, 2), data = Auto)

Residuals:
Min 1Q Median 3Q Max
-14.7135 -2.5943 -0.0859 2.2868 15.8961

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.4459 0.2209 106.13 <2e-16 ***
poly(horsepower, 2)1 -120.1377 4.3739 -27.47 <2e-16 ***
poly(horsepower, 2)2 44.0895 4.3739 10.08 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.374 on 389 degrees of freedom


Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16

RESULT:
A regression model for the in-built dataset Auto is built using bootstrapping method to predict the
variable ‘mpg’.
BEST SUBSET SELECTION

PROBLEM:
Write a R program to fit a regression model for the Hitters data set using best subset selection method
to predict the variable Salary.
AIM:
To use the best subset selection method to fit the best regression model for the given dataset.
CODE IN R LANGUAGE:
library(ISLR)
names(Hitters) # to see variables names
dim(Hitters )
sum(is.na(Hitters$Salary)) #to calculate no.of miising observations
Hitters =na.omit(Hitters) # to remove all rows that have missing value
dim(Hitters )
library(leaps)
#to perform best subset selection on the data set
regfit.full=regsubsets(Salary~.,Hitters)
summary(regfit.full) #summary of the model
regfit.full=regsubsets (Salary~.,data=Hitters ,nvmax=19)
reg.summary =summary (regfit.full)
names(reg.summary)
reg.summary$rsq
reg.summary$rss
reg.summary$adjr2
OUTPUT:
> library(ISLR)
> names(Hitters) # to see variables names
[1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
[7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
[13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
[19] "Salary" "NewLeague"
> dim(Hitters )
[1] 263 20
> sum(is.na(Hitters$Salary)) #to calculate no.of miising observations
[1] 0
> Hitters =na.omit(Hitters) # to remove all rows that have missing value
> dim(Hitters )
[1] 263 20
> library(leaps)
> #to perform best subset selection on the data set
> regfit.full=regsubsets(Salary~.,Hitters)
> summary(regfit.full) #summary of the model
Subset selection object
Call: regsubsets.formula(Salary ~ ., Hitters)
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 (1)"" "" "" "" "" "" ""
2 (1)"" "" "" "" "" "" ""
3 (1)"" "" "" "*" " " " " " "
4 ( 1 ) " " " " "*" "*" " " " " " "
5 ( 1 ) " " " " "*" "*" " " " " " "
6 ( 1 ) " " " " "*" "*" " " " " " "
7 ( 1 ) " " " " "*" "*" " " " " " "
8 ( 1 ) "*" " " "*" "*" " " " " " "
> regfit.full=regsubsets (Salary~.,data=Hitters ,nvmax=19)
> reg.summary =summary (regfit.full)
> names(reg.summary)
[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
> reg.summary$rsq
[1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
[8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
[15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
> reg.summary$rss
[1] 36179679 30646560 29249297 27970852 27149899 26194904 25906548 25136930
[9] 24814051 24500402 24387345 24333232 24289148 24248660 24235177 24219377
[17] 24209447 24201837 24200700
> reg.summary$adjr2
[1] 0.3188503 0.4208024 0.4450753 0.4672734 0.4808971 0.4972001 0.5007849
[8] 0.5137083 0.5180572 0.5222606 0.5225706 0.5217245 0.5206736 0.5195431
[15] 0.5178661 0.5162219 0.5144464 0.5126097 0.5106270

RESULT:
The best fit linear regression model is built on the given dataset with the maximum R 2 value of
0.546 .
FORWARD STEPWISE SELECTION

PROBLEM:
Write a R program to fit a regression model for the Hitters data set using forward stepwise selection
method.
AIM:
To use the forward stepwise selection method to fit the best regression model for the given dataset.
CODE IN R LANGUAGE:
library(ISLR)
names(Hitters) # to see variables names
dim(Hitters )
sum(is.na(Hitters$Salary)) #to calculate no.of miising observations
Hitters =na.omit(Hitters) # to remove all rows that have missing value
dim(Hitters )
library(leaps)
regfit.fwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "forward")
summary(regfit.fwd)
coef(regfit.fwd,which.min(summary(regfit.fwd)$cp))
OUTPUT:
> library(ISLR)
> names(Hitters) # to see variables names
[1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
[7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
[13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
[19] "Salary" "NewLeague"
> dim(Hitters )
[1] 263 20
> sum(is.na(Hitters$Salary)) #to calculate no.of miising observations
[1] 0
> Hitters =na.omit(Hitters) # to remove all rows that have missing value
> dim(Hitters )
[1] 263 20
> library(leaps)
> regfit.fwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "forward")
> summary(regfit.fwd)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 19
Selection Algorithm: forward
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
1 (1) "" "" "" "" """" "" "" "" "" ""
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " "
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*"
15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*"
16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) "*" " " " " " " "" "" "" ""
2 ( 1 ) "*" " " " " " " "" "" "" ""
3 ( 1 ) "*" " " " " " " "*" " " " " " "
4 ( 1 ) "*" " " " " "*" "*" " " " " " "
5 ( 1 ) "*" " " " " "*" "*" " " " " " "
6 ( 1 ) "*" " " " " "*" "*" " " " " " "
7 ( 1 ) "*" "*" " " "*" "*" " " " " " "
8 ( 1 ) "*" "*" " " "*" "*" " " " " " "
9 ( 1 ) "*" "*" " " "*" "*" " " " " " "
10 ( 1 ) "*" "*" " " "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
12 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
> coef(regfit.fwd,which.min(summary(regfit.fwd)$cp))
(Intercept) AtBat Hits Walks CAtBat CRuns
162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490
CRBI CWalks DivisionW PutOuts Assists
0.7743122 -0.8308264 -112.3800575 0.2973726 0.2831680

RESULT:
The best fit regression model is built on the dataset to predict the variable salary.
BACKWARD STEPWISE SELECTION

PROBLEM:
Write a R programme to fit a regression model for the Hitters data set using backward stepwise
selection method.
AIM:
To use the backward stepwise selection method to fit the best regression model for the given dataset.
CODE IN R LANGUAGE:
library(ISLR)
names(Hitters) # to see variables names
dim(Hitters )
sum(is.na(Hitters$Salary)) #to calculate no.of miising observations
Hitters =na.omit(Hitters) # to remove all rows that have missing value
dim(Hitters )
library(leaps)
#to fit a regression model
regfit.bwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "backward")
summary(regfit.bwd)
coef(regfit.bwd,which.min(summary(regfit.bwd)$cp))
OUTPUT:
> library(ISLR)
> names(Hitters) # to see variables names
[1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
[7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
[13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
[19] "Salary" "NewLeague"
> dim(Hitters )
[1] 263 20
> sum(is.na(Hitters$Salary)) #to calculate no.of miising observations
[1] 0
> Hitters =na.omit(Hitters) # to remove all rows that have missing value
> dim(Hitters )
[1] 263 20
> library(leaps)
> regfit.bwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "backward")
> summary(regfit.bwd)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward")
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 19
Selection Algorithm: backward
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*"
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*"
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*"
4 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " "*"
5 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*"
9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*"
12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*"
14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*"
15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*"
16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 (1) "" "" "" "" "" "" "" ""
2 (1) "" "" "" "" "" "" "" ""
3 (1) "" "" "" "" "*" " " " " " "
4 (1) "" "" "" "" "*" " " " " " "
5 (1) "" "" "" "" "*" " " " " " "
6 ( 1 ) " " " " " " "*" "*" " " " " " "
7 ( 1 ) " " "*" " " "*" "*" " " " " " "
8 ( 1 ) "*" "*" " " "*" "*" " " " " " "
9 ( 1 ) "*" "*" " " "*" "*" " " " " " "
10 ( 1 ) "*" "*" " " "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
12 ( 1 ) "*" "*" "*" "*" "*" "*" " " " "
13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " "
17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
> coef(regfit.bwd,which.min(summary(regfit.bwd)$cp))
(Intercept) AtBat Hits Walks CAtBat CRuns
162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490
CRBI CWalks DivisionW PutOuts Assists
0.7743122 -0.8308264 -112.3800575 0.2973726 0.2831680

RESULT:
The best fit regression model is built on the dataset to predict the variable salary using backward
stepwise selection.

You might also like