Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 19

Ensemble Methods

1
Overview
• What are ensemble methods?
• The main ensemble methods
–Boosting
–Bagging
–Random Forest

2
Ensemble Methods
• Ensemble methods combine the results from multiple
models with the goal of improving prediction accuracy
• Example: bagging
The original training
data

Step 1: Create
… multiple Data
Sets

Step 2: Build
multiple
models

Step 3: Combine
outcomes into a
single prediction
Understanding Ensemble
Methods

4
Motivating Example

• Suppose I have 5 classifiers which each classify a point correctly 70%


of the time.
• If I use one of the classifiers to classify a new record, what is the probability
that I get it right?
70%

5
Motivating Example
• If these 5 classifiers are completely independent and I take the
majority vote, how often is the majority vote correct for a new
record?
• P(getting it right)=
P(all 5 get it right)+P(4 classifiers get it right) + P(3 classifiers get it
right)
• P(getting it right)=
 5 5  5  4 5
 .7   .7 (1  .7)1   .7 3 (1  .7) 2
 5  4  3
 1* .7 5  5 * .7 4 (1  .7)1  10 * .7 3 (1  .7) 2  0.83692
“n choose k” – in how many different
83.7%
ways can you select k items from n
overall items (In Excel you can use
the function COMBIN to calculate
these values)

6
Motivating Example
Suppose I have 101 classifiers which each classify a point correctly
70% of the time. If these 101 classifiers are completely independent
and I take the majority vote, how often is the majority vote correct
for a new record?
We can view the number of correct classifiers as a binomial random
variables, with 101 trials, and we need at least 51 of them to be
correct in order for the overall prediction to be correct
P(of getting it right) = P(at least 51 get it right)
= 1-P(at most 50 get it wrong)
P(of getting it right) = 1-BINOM.DIST(50,101,.7,1)
= .9999

≈ 100%

7
Types of Ensemble Algorithms
• Ensemble algorithm/methods include
–Bagging
• builds different classifiers by training on repeated samples (with replacement) from the
data

–Boosting
• combines simple base classifiers by up-weighting data points which are classified
incorrectly

–Random Forests
• averages many trees which are constructed with some amount of randomness

8
Bagging
• The basics:
– Step 1: Create B datasets, using sampling with replacement
– Step 2: Create one classifier for each dataset
– Step 3: Combine the classifiers by averaging over the predictions in case of a
continous outcome, or by simple majority vote in case of a categorical
outcome.
• Bagging is simple to implement
• Bagging using very weak classifiers may not result in an
improvement; bagging good prediction models will in most cases
help improve the prediction accuracy

9
Bagging
• Bootstrap aggregating
• Multiple training datasets are created by resampling of the
observed dataset (and of equal size to the observed dataset)
• Obtained by random sampling with replacement from the
original dataset
• Train the statistical learning method on each of the training datasets,
and obtain the prediction
• For prediction:
• Regression: average all predictions from all trees
• Classification: majority vote among all trees

10
How Bagging Works

• 

11
Bagging Output

• Regression trees – average output of each tree


• Classification trees
• Majority vote of class predictions, or
• Average class probabilities and then predict class

12
Random Forests
• Random Forest (RF) borrows ideas from Bagging
• RF averages many classification trees, where each is constructed
using a random subset of the variables for each split in the tree
• The key parameters that need to be determined are
a) the number of trees, and b) the variable subset size
– Value of these parameters will vary from one application to the next
• RF can be used with both regression trees and classification trees
• RF has good predictive performance, even when the data is very
noisy, and generally does not overfit

13
How Random Forests Work
• It is a very efficient statistical learning method
• It builds on the idea of bagging, but it provides an improvement
because it de-correlates the trees
• How does it work?
• Build a number of decision trees on bootstrapped training sample, but when
building these trees, each time a split in a tree is considered, a random
sample of m predictors is chosen as split candidates from the full set of p
predictors (Usually )
• RF with m = p is just bagging

14
Why consider a random sample of m predictors
instead of all p predictors for splitting?

• Suppose that we have a very strong predictor in the data


set along with a number of other moderately strong
predictor, then in the collection of bagged trees, most or all
of them will use the very strong predictor for the first split
• All bagged trees will look similar. Hence all the predictions
from the bagged trees will be highly correlated
• Averaging many highly correlated quantities does not lead
to a large variance reduction, and thus random forests “de-
correlates” the bagged trees leading to more reduction in
variance

15
Bagging in R
# Bagging and Random Forests

library(randomForest)

# We first do bagging (which is just RF with m = p)


set.seed(1)
bag.boston=randomForest(medv~.,data=Boston,subset=train,mtry=13,importance=TRUE)
bag.boston

##
## Call:
## randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE,
subset = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 11.02509
## % Var explained: 86.65

16
Bagging in R
yhat.bag = predict(bag.boston,newdata=Boston[-train,])
plot(yhat.bag, boston.test)
abline(0,1)

mean((yhat.bag-boston.test)^2)

## [1] 13.47349

17
Predictor Importance
varImpPlot(bag.boston)

importance(bag.boston)

## %IncMSE IncNodePurity
## crim 15.396510 950.03191
## zn 1.100738 21.42389
## indus 12.225351 183.14933
## chas 2.726681 13.25062
## nox 10.606485 302.78478
## rm 45.090272 7325.33947
## age 10.400796 309.19654
## dis 17.315918 892.19354
## rad 3.208664 64.56585
## tax 9.296886 296.22083
## ptratio 15.325244 279.25118
## black 5.944955 243.04952
## lstat 39.324555 9837.83280

lstat and rm are most important


18
Random Forests in R
set.seed(1)
rf.boston=randomForest(medv~.,data=Boston,subset=train,mtry=6,importance=TRUE)
yhat.rf = predict(rf.boston,newdata=Boston[-train,])
mean((yhat.rf-boston.test)^2)

## [1] 11.48022

19

You might also like