ML Lab 10 - Ensemble Learning

Lab Sheet - 10
Ensemble Learning
Machine Learning
BITS F464
I Semester 2023-24
Ensemble Learning is a type of supervised learning technique in which the basic idea is to
generate multiple models on a training dataset and then simply combining (average) their
output to generate a strong model which outperforms all the individual base classifiers.
Load the libraries

library(psych) #for general functions
library(ggplot2) #for data visualization
# library(devtools)
library(caret) #for training and cross validation (also calls other model li
baries)
## Warning: Installed Rcpp (0.12.13) different from Rcpp used to build dplyr
(0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
library(rpart) #for trees
library(rpart.plot) # Enhanced tree plots
library(RColorBrewer) # Color selection for fancy tree plot
library(party) # Alternative decision tree algorithm
library(partykit) # Convert rpart object to BinaryTree
library(pROC) #for ROC curves
library(ISLR) #for the Carseat Data
Introduction to Data
Reading in the CarSeats Data exploration data set. This is a simulated data set containing sales of
child car seats at 400 different stores. Sales can be predicted by 10 other variables.
#loading the data

data("Carseats")
Our outcome of interest will be a binary version of Sales: Unit sales (in thousands) at each
location.
(Note again that there is no id variable. This is convenient for some tasks.)
Descriptives
#sample descriptives
describe(Carseats)
#histogram of outcome
ggplot(data=Carseats, aes(x=Sales)) +
geom_histogram(binwidth=1, boundary=.5, fill="white", color="black") +
geom_vline(xintercept = 8, color="red", size=2) +
labs(x = "Sales")
For convenience of didactic illustration we create a new variable HighSales that is binary, “No”
if Sales <= 8, and “Yes” otherwise.
#creating new binary variable

Carseats$HighSales=ifelse(Carseats$Sales<=8,"No","Yes")
Some Data cleanup
#remove old variable

Carseats$Sales <- NULL
#convert a factor variable into a numeric variable
Carseats$ShelveLoc <- as.numeric(Carseats$ShelveLoc)
Splitting the data into training and test sets

We split the data - half for Training, half for Testing
#random sample half the rows

halfsample = sample(dim(Carseats)[1], dim(Carseats)[1]/2) # half of sample
#create training and test data sets
Carseats.train = Carseats[halfsample, ]
Carseats.test = Carseats[-halfsample, ]
We will use these to evaluate a variety of different classification algorithms: Random Forests,
CForests,
Model 0: A Single Classification Tree

We first optimize fit of a classification tree. Our objective with the cross-validation is to optmize
the size of the tree - tuning the complexity parameter.
train.tree <- train(as.factor(HighSales) ~ .,
data=Carseats.train,
method="ctree",
trControl=cvcontrol,
tuneLength = 10)
train.tree
We see how the accruacy is maximized at a relatively less complex tree.

Look at the final tree
# plot tree
plot(train.tree$finalModel,
main="Regression Tree for Carseat High Sales")
To evalaute the accuracy of the tree we can look at the confusion matrix for the Training data.
#obtaining class predictions

tree.classTrain <- predict(train.tree,
type="raw")
head(tree.classTrain)
#computing confusion matrix
confusionMatrix(Carseats.train$HighSales,tree.classTrain)
Some Errors. But the model was learned.

More interesting is the confusion matrix when applied to the Test data.
tree.classTest <- predict(train.tree,

newdata = Carseats.test,
type="raw")
head(tree.classTest)
confusionMatrix(Carseats.test$HighSales,tree.classTest)
Accuracy of 0.71
When evaluating classification models, a few other functions may be useful. For example,
the pROC package provides convenience for calculating confusion matrices, the associcated
measures of sensitivity and specificity, and for obtaining and plotting ROC curves. We can also
look at the ROC curve by extracting probabilites of “Yes”.
#Obtaining predicted probabilites for Test data

tree.probs=predict(train.tree,
newdata=Carseats.test,
type="prob")
head(tree.probs)
#Calculate ROC curve
rocCurve.tree <- roc(Carseats.test$HighSales,tree.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.tree,col=c(4))
calculate the area under curve (bigger is better)
auc(rocCurve.tree)
Model 1: Bagging of ctrees

Training the model using treebag
We first optimize fit of a classification tree. Our objective with the cross-validation is to optmize
the size of the tree - tuning the complexity parameter.
#Using treebag
train.bagg <- train(as.factor(HighSales) ~ ., data=Carseats.train,method="tre

ebag",trControl=cvcontrol,importance=TRUE)
train.bagg
plot(varImp(train.bagg))
Not yet sure how to parse mode details from the output in order to look at the collection of trees.
Look at the collection of final trees
To evaluate the accuracy of the Bagged Trees we can look at the confusion matrix for the
Training data.

bagg.classTrain <- predict(train.bagg, type="raw")
head(bagg.classTrain)
confusionMatrix(Carseats.train$HighSales,bagg.classTrain)
The accuracy is perfect!


bagg.classTest <- predict(train.bagg,
newdata = Carseats.test,
type="raw")
head(bagg.classTest)
confusionMatrix(Carseats.test$HighSales,bagg.classTest)
Accuracy of 0.76
We can also look at the ROC curve by extracting probabilites of “Yes”.

bagg.probs=predict(train.bagg,
newdata=Carseats.test, type="prob")
head(bagg.probs)
rocCurve.bagg <- roc(Carseats.test$HighSales,bagg.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.bagg,col=c(6))
#calculate the area under curve (bigger is better)

auc(rocCurve.bagg)
## Area under the curve: 0.8904
Model 2: Random Forest for classification trees

Training the model using random forest
train.rf <- train(as.factor(HighSales) ~ .,

data=Carseats.train,
method="rf",
trControl=cvcontrol,
#tuneLength = 3,
importance=TRUE)
train.rf
We can look at the confusion matrix for the Training data.

rf.classTrain <- predict(train.rf, type="raw")
head(rf.classTrain)
confusionMatrix(Carseats.train$HighSales,rf.classTrain)
No Errors. That is good - the model was learned well.


rf.classTest <- predict(train.rf, newdata = Carseats.test,
type="raw")
head(rf.classTest)
confusionMatrix(Carseats.test$HighSales,rf.classTest)
Accuracy of 0.78. An improvement over Bagging only

We can also look at the ROC curve by extracting probabilites of “Yes”.

rf.probs=predict(train.rf,newdata=Carseats.test,type="prob")
head(rf.probs)
rocCurve.rf <- roc(Carseats.test$HighSales,rf.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.rf,col=c(1))
#calculate the area under curve (bigger is better)
auc(rocCurve.rf)
## Area under the curve: 0.9021
Model Comparisons
We can examine how the models do by looking at the ROC curves.
plot(rocCurve.tree,col=c(4))
plot(rocCurve.bagg,add=TRUE,col=c(6)) # color magenta is bagg
plot(rocCurve.rf,add=TRUE,col=c(1)) # color black is rf
A good demonstration of Decision Tree, Random Forest, boosting and bagging algorithm can be
found here:
See …
https://machinelearningmastery.com/machine-learning-ensembles-with-r/
https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-
implementation-in-r/
Exercise
1. Apply boosting model on the same dataset and compare it with bagging ensemble. Also,
try the Adaboost function and analyze their result.
2. How to determine the number of base classifiers to be used in an ensemble?
3. How to find the error of an ensemble when the error rates of base classifiers are
different?

ML Lab 10 - Ensemble Learning

Uploaded by

Copyright:

Available Formats

You might also like

ML Lab 10 - Ensemble Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Lab 10 - Ensemble Learning

Uploaded by

Copyright:

Available Formats

Lab Sheet - 10

Load the libraries

#loading the data

#creating new binary variable

Some Data cleanup

#remove old variable

Splitting the data into training and test sets

#random sample half the rows

Model 0: A Single Classification Tree

We see how the accruacy is maximized at a relatively less complex tree.

#obtaining class predictions

Some Errors. But the model was learned.

#obtaining class predictions

tree.classTest <- predict(train.tree,

#Obtaining predicted probabilites for Test data

Model 1: Bagging of ctrees

train.bagg <- train(as.factor(HighSales) ~ ., data=Carseats.train,method="tre

#obtaining class predictions

The accuracy is perfect!

#obtaining class predictions

#Obtaining predicted probabilites for Test data

#calculate the area under curve (bigger is better)

Model 2: Random Forest for classification trees

train.rf <- train(as.factor(HighSales) ~ .,

We can look at the confusion matrix for the Training data.

#obtaining class predictions

No Errors. That is good - the model was learned well.

#obtaining class predictions

Accuracy of 0.78. An improvement over Bagging only

#Obtaining predicted probabilites for Test data

You might also like