ML Lab 10 - Ensemble Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Lab Sheet - 10

Ensemble Learning
Machine Learning
BITS F464
I Semester 2023-24

Ensemble Learning is a type of supervised learning technique in which the basic idea is to
generate multiple models on a training dataset and then simply combining (average) their
output to generate a strong model which outperforms all the individual base classifiers.

Load the libraries


library(psych) #for general functions
library(ggplot2) #for data visualization
# library(devtools)
library(caret) #for training and cross validation (also calls other model li
baries)
## Warning: Installed Rcpp (0.12.13) different from Rcpp used to build dplyr
(0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
library(rpart) #for trees
library(rpart.plot) # Enhanced tree plots
library(RColorBrewer) # Color selection for fancy tree plot
library(party) # Alternative decision tree algorithm
library(partykit) # Convert rpart object to BinaryTree
library(pROC) #for ROC curves
library(ISLR) #for the Carseat Data

Introduction to Data
Reading in the CarSeats Data exploration data set. This is a simulated data set containing sales of
child car seats at 400 different stores. Sales can be predicted by 10 other variables.

#loading the data


data("Carseats")

Our outcome of interest will be a binary version of Sales: Unit sales (in thousands) at each
location.
(Note again that there is no id variable. This is convenient for some tasks.)
Descriptives

#sample descriptives
describe(Carseats)
#histogram of outcome
ggplot(data=Carseats, aes(x=Sales)) +
geom_histogram(binwidth=1, boundary=.5, fill="white", color="black") +
geom_vline(xintercept = 8, color="red", size=2) +
labs(x = "Sales")

For convenience of didactic illustration we create a new variable HighSales that is binary, “No”
if Sales <= 8, and “Yes” otherwise.

#creating new binary variable


Carseats$HighSales=ifelse(Carseats$Sales<=8,"No","Yes")

Some Data cleanup

#remove old variable


Carseats$Sales <- NULL
#convert a factor variable into a numeric variable
Carseats$ShelveLoc <- as.numeric(Carseats$ShelveLoc)

Splitting the data into training and test sets


We split the data - half for Training, half for Testing

#random sample half the rows


halfsample = sample(dim(Carseats)[1], dim(Carseats)[1]/2) # half of sample
#create training and test data sets
Carseats.train = Carseats[halfsample, ]
Carseats.test = Carseats[-halfsample, ]

We will use these to evaluate a variety of different classification algorithms: Random Forests,
CForests,

Model 0: A Single Classification Tree


We first optimize fit of a classification tree. Our objective with the cross-validation is to optmize
the size of the tree - tuning the complexity parameter.
train.tree <- train(as.factor(HighSales) ~ .,
data=Carseats.train,
method="ctree",
trControl=cvcontrol,
tuneLength = 10)
train.tree

We see how the accruacy is maximized at a relatively less complex tree.


Look at the final tree

# plot tree
plot(train.tree$finalModel,
main="Regression Tree for Carseat High Sales")

To evalaute the accuracy of the tree we can look at the confusion matrix for the Training data.

#obtaining class predictions


tree.classTrain <- predict(train.tree,
type="raw")
head(tree.classTrain)
#computing confusion matrix
confusionMatrix(Carseats.train$HighSales,tree.classTrain)

Some Errors. But the model was learned.


More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions

tree.classTest <- predict(train.tree,


newdata = Carseats.test,
type="raw")
head(tree.classTest)
#computing confusion matrix
confusionMatrix(Carseats.test$HighSales,tree.classTest)

Accuracy of 0.71
When evaluating classification models, a few other functions may be useful. For example,
the pROC package provides convenience for calculating confusion matrices, the associcated
measures of sensitivity and specificity, and for obtaining and plotting ROC curves. We can also
look at the ROC curve by extracting probabilites of “Yes”.

#Obtaining predicted probabilites for Test data


tree.probs=predict(train.tree,
newdata=Carseats.test,
type="prob")
head(tree.probs)
#Calculate ROC curve
rocCurve.tree <- roc(Carseats.test$HighSales,tree.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.tree,col=c(4))
calculate the area under curve (bigger is better)
auc(rocCurve.tree)

Model 1: Bagging of ctrees


Training the model using treebag
We first optimize fit of a classification tree. Our objective with the cross-validation is to optmize
the size of the tree - tuning the complexity parameter.

#Using treebag

train.bagg <- train(as.factor(HighSales) ~ ., data=Carseats.train,method="tre


ebag",trControl=cvcontrol,importance=TRUE)
train.bagg
plot(varImp(train.bagg))

Not yet sure how to parse mode details from the output in order to look at the collection of trees.
Look at the collection of final trees
To evaluate the accuracy of the Bagged Trees we can look at the confusion matrix for the
Training data.

#obtaining class predictions


bagg.classTrain <- predict(train.bagg, type="raw")
head(bagg.classTrain)
#computing confusion matrix
confusionMatrix(Carseats.train$HighSales,bagg.classTrain)

The accuracy is perfect!


More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions


bagg.classTest <- predict(train.bagg,
newdata = Carseats.test,
type="raw")
head(bagg.classTest)
#computing confusion matrix
confusionMatrix(Carseats.test$HighSales,bagg.classTest)

Accuracy of 0.76
We can also look at the ROC curve by extracting probabilites of “Yes”.

#Obtaining predicted probabilites for Test data


bagg.probs=predict(train.bagg,
newdata=Carseats.test, type="prob")

head(bagg.probs)
#Calculate ROC curve
rocCurve.bagg <- roc(Carseats.test$HighSales,bagg.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.bagg,col=c(6))

#calculate the area under curve (bigger is better)


auc(rocCurve.bagg)
## Area under the curve: 0.8904

Model 2: Random Forest for classification trees


Training the model using random forest

train.rf <- train(as.factor(HighSales) ~ .,


data=Carseats.train,
method="rf",
trControl=cvcontrol,
#tuneLength = 3,
importance=TRUE)
train.rf

We can look at the confusion matrix for the Training data.

#obtaining class predictions


rf.classTrain <- predict(train.rf, type="raw")
head(rf.classTrain)
#computing confusion matrix
confusionMatrix(Carseats.train$HighSales,rf.classTrain)

No Errors. That is good - the model was learned well.


More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions


rf.classTest <- predict(train.rf, newdata = Carseats.test,
type="raw")
head(rf.classTest)
#computing confusion matrix
confusionMatrix(Carseats.test$HighSales,rf.classTest)

Accuracy of 0.78. An improvement over Bagging only


We can also look at the ROC curve by extracting probabilites of “Yes”.

#Obtaining predicted probabilites for Test data


rf.probs=predict(train.rf,newdata=Carseats.test,type="prob")
head(rf.probs)
#Calculate ROC curve
rocCurve.rf <- roc(Carseats.test$HighSales,rf.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.rf,col=c(1))
#calculate the area under curve (bigger is better)
auc(rocCurve.rf)
## Area under the curve: 0.9021
Model Comparisons
We can examine how the models do by looking at the ROC curves.

plot(rocCurve.tree,col=c(4))
plot(rocCurve.bagg,add=TRUE,col=c(6)) # color magenta is bagg
plot(rocCurve.rf,add=TRUE,col=c(1)) # color black is rf

A good demonstration of Decision Tree, Random Forest, boosting and bagging algorithm can be
found here:
See …
https://machinelearningmastery.com/machine-learning-ensembles-with-r/
https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-
implementation-in-r/

Exercise
1. Apply boosting model on the same dataset and compare it with bagging ensemble. Also,
try the Adaboost function and analyze their result.
2. How to determine the number of base classifiers to be used in an ensemble?
3. How to find the error of an ensemble when the error rates of base classifiers are
different?

You might also like