Lecture 22: Bagging and Random Forest: Wenbin Lu Department of Statistics North Carolina State University Fall 2019

Lecture 22: Bagging and Random Forest
Wenbin Lu
Department of Statistics
North Carolina State University
Fall 2019
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 35

Outlines
Bagging Methods
Bagging Trees
Random Forest
Applications in Causal Inference/Optimal Treatment Decision

Bagging Trees
Ensemble Methods (Model Averaging)
A machine learning ensemble meta-algorithm designed to improve the

stability and accuracy of machine learning algorithms
widely used in statistical classification and regression.
can reduce variance and helps to avoid overfitting.
usually applied to decision tree methods, but it can be used with any
type of method.

Bagging Trees
Bootstrap Aggregation (Bagging)
Bagging is a special case of the model averaging approach.
Bootstrap aggregation = Bagging
Bagging leads to ”improvements for unstable procedures” (Breiman,

1996), e.g. neural nets, classification and regression trees, and subset
selection in linear regression (Breiman, 1994).
On the other hand, it can mildly degrade the performance of stable
methods such as K-nearest neighbors (Breiman, 1996).

Bagging Trees
Basic Idea
Given a standard training set D of size n, bagging

generates B new training sets Di , each of size n0 , by sampling from D
uniformly and with replacement.
The B models are fitted using the above B bootstrap samples and
combined by averaging the output (for regression) or voting (for
classification).
This kind of sample is known as a bootstrap sample.
By sampling with replacement, some observations may be repeated in
each Di .
If n0 = n, then for large n, the set Di is expected to have the fraction
(1 − 1/e) ≈ 63.2% of the unique examples of D, the rest being
duplicates.

Bagging Trees
Bagging Procedures
Bagging uses the bootstrap to improve the estimate or prediction of a fit.

Given data Z = {(x1 , y1 ), ..., (xn , yn )}, we generate B bootstrap
samples Z∗b
Empirical distribution P̂: putting equal probability 1/n on each
(xi , yi ) (discrete)
Generate Z∗b = {(x1∗ , yn∗ ), ..., (xn∗ , yn∗ )} ∼ P̂, b = 1, ..., B
Obtain fˆ∗b (x), b = 1, ..., B.
The Monte Carlo estimate of the bagging estimate
B
1 X ˆ∗b
fˆbag (x) = f (x).
B
b=1

Bagging Trees
Properties of Bagging Estimates
Advantages:
Note fˆbag (x) −→ EP̂ fˆ∗ (x) as B → ∞,
fˆbag (x) typically has smaller variance than fˆ(x);
fˆbag (x) differs from fˆ(x) only when the latter is nonlinear or adaptive
function of data.

Bagging Trees
Bagging Classification Trees

In classification problems, there are two scenarios
(1) fˆb (x) is indicator-vector, with one 1 and K − 1 0’s (hard
classification)
(2) fˆ∗b (x) = (p1 , ..., pK ), the estimates of class probabilities
Pbb (Y = k|X = x), k = 1, · · · , K (soft classification). The bagged
estimates are the average prediction at x from B trees
B
bag X
fˆk (x) = B −1 fˆk∗b (x), k = 1, · · · , K .
b=1

Bagging Trees
Bagging Classification Trees (cont.)

There are two types of averaging:
(1) Use the majority vote: arg maxk B ˆb
P
b=1 I {f (x) = k}.
(2) Use the averaged probabilities. The bagged classifier
bag
Ĝbag (x) = arg max fˆk (x).
k
Note: the second method tends to produce estimates with lower variances
than the first method, especially for small B.

Bagging Trees
Example
Sample size n = 30, two classes

p = 5 features, each having a standard Gaussian distribution with
pairwise correlation Corr(Xj , Xk ) = 0.95.
The response Y was generated according to
Pr(Y = 1|x1 ≤ 0.5) = 0.2, Pr(Y = 1|x1 > 0.5) = 0.8.
The Bayes error is 0.2.

A test sample of size 2, 000 was generated from the same population.
We fit
classification trees to the training sample
classification trees to each of 200 bootstrap samples

Bagging Trees
Elements of Statisti al Learning

Hastie, Tibshirani & Friedman 2001 Chapter 8
epla ements PSfrag repla ements
Original Tree Bootstrap Tree 1
0
Original Tree 0
strap Tree 1 x.2<0.39

10/30
x.2<0.36
7/30
x.2>0.39 x.2>0.36
strap Tree 2 Bootstrap Tree 2

0 1 0 1
strap Tree 3 3/21 2/9 Bootstrap Tree 3 1/23 1/7
x.3<-1.575 x.1<-0.965
strap Tree 4 x.3>-1.575

Bootstrap Tree 4 x.1>-0.965

1 0 0 0
2/5 0/16 1/5 0/18

Bootstrap Tree 2 Bootstrap Tree 3
Original Tree 0
Original Tree 0

11/30
Bootstrap Tree 1 x.4<0.395
4/30
x.2>0.39 x.4>0.395
Bootstrap Tree 2
0 1 0 0
strap Tree 3 3/22 0/8 2/25 2/5
x.3<-1.575 x.3<-1.575
strap Tree 4 x.3>-1.575


1 0 0 0
2/5 0/17 2/5 0/20

Bootstrap Tree 4 Bootstrap Tree 5
Original Tree 0
Original Tree 0

13/30
Bootstrap Tree 1 x.2<0.38
12/30
x.2>0.255 x.2>0.38

0 1 0 1
strap Tree 3 2/16 3/14 Bootstrap Tree 3 4/20 2/10
x.3<-1.385 x.3<-1.61
x.3>-1.385
strap Tree 5
0 0 1 0
2/5 0/11 2/6 0/14
Figure 8.9: Bagging trees on simulated dataset. Top

left panel shows original tree. Five trees grown on boot-
strap samples are shown.

Bagging Trees
About Bagging Trees
The original tree and five bootstrap trees are all different:
with different splitting features
with different splitting cutpoints
The trees have high variance due to the correlation in the predictors
Averaging reduces variance and leaves bias unchanged.
Under squared-error loss, averaging reduces variance and leaves bias
unchanged.
Therefore, bagging will often decrease MSE.

Bagging Trees
Elements of Statisti al Learning

Hastie, Tibshirani & Friedman 2001 Chapter 8
0.35
Original Tree
0.30
•
•
Test Error
•••• •••
••• • •
•••• ••••••• •••
• •• ••••
•••• •• •
••••• •• • •
•••• • •••••• • •
0.25
••• • •••••••••••
••• •••• ••• Bagged Trees
• •• •• • •••
•
•• • ••••••••••
••••• ••
••••••••••• ••
••
•••••••••••••••••••••••
•• ••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••• •••••••••••••••••
•••
••••••••••• •••••• ••••• ••••• •••• •••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••• ••••••••••••••••••••••••••••
Bayes
0.20
0 50 100 150 200
Number of Bootstrap Samples
Figure 8.10: Error urves for the bagging example of
Figure 8.9. Shown is the test error of the original tree
and bagged trees as a fun tion of the number of boot-
strap samples. The green points orrespond to majority
vote, while the purple points average the probabilities.

Bagging Trees
About Bagging
Bagging can dramatically reduce the variance of unstable procedures

like trees, leading to improved prediction
Bagging smooths out this variance and hence reducing the test error
Bagging can stabilize unstable procedures.
The simple structure in the model can be lost due to bagging
A bagged tree is no longer a tree.
The bagged estimate is not easy to interpret.
Under 0-1 loss for classification, bagging may not help due to the
nonadditivity of bias and variance.

Random Forest
Random Forest
Random Forests (Breiman, 2001):

Random forest is an ensemble classifier that consists of many decision
trees and outputs the class that is the mode of the class’s output by
individual trees.
Bagging: random subsampling the training set
Random subspace method (Ho, 1995, 1998): random subsampling
the feature space (called “feature bagging”); To reduce the
correlation of the trees in an ordinary bootstrap sample: if one or a
few features are very strong predictors for the response variable
(target output), these features will be selected in many of the trees,
causing them to become correlated.
Random forest: combine bagging and random subspace method.

Random Forest
Learning Algorithm for Building A Tree
Denote the training size by n and the number of variables by p. Assume

m < p is the number of input variables to be used to determine the
decision at a node of the tree.
Randomly choose n samples with replacement (i.e. take a bootstrap
sample).
Use the rest of the samples to estimate the error of the tree, by
predicting their classes.
For each node of the tree, randomly choose m variables on which to
base the decision at that node. Calculate the best split based on
these m variables in the training set.
Each tree is fully grown and not pruned.

Random Forest
Properties of Random Forest

Advantages:
highly accurate in many real examples; fast; handles a very large
number of input variables.
ability to estimate the importance of variables for classification.
generates an internal unbiased estimate of the generalization error as
the forest building progresses.
impute missing data and maintains accuracy when a large proportion
of the data are missing.
provides an experimental way to detect variable interactions.
can balance error in unbalanced data sets.
compute proximities between cases, useful for clustering, detecting
outliers, and (by scaling) visualizing the data
can be extended to unlabeled data, leading to unsupervised clustering,
outlier detection and data views
Random Forest
Properties of Random Forest
Disadvantages:
Random forests are prone to overfitting for some data sets. This is
even more pronounced in noisy classification/regression tasks.
Random forests do not handle large numbers of irrelevant features
Implementation:
R packages: randomForest; randomForestSRC (for survival data); grf
(generalized random forest for causal effect estimation).
Asymptotics of generalized random forest: Athey et al. (2019)

Random Forest
Applications in Causal Inference/Optimal Treatment

Decision
Potential outcomes: Y ∗ (a), an outcome that would result if a patient

were given treatment a ∈ A.
Observed data: response, Y (larger value of Y indicates better
outcome); p-dimensional covariates, X ∈ X ; received treatment,
A ∈ A.
Consider binary treatment case: A = {0, 1}.
Average treatment effect (ATE): ∆ = E {Y ∗ (1) − Y ∗ (0)}.
Conditional treatment effect (CTE):
τ (x) = E {Y ∗ (1)|X = x} − E {Y ∗ (0)|X = x}.
Average treatment effect on the treated (ATT):
∆1 = E {Y ∗ (1)|A = 1} − E {Y ∗ (0)|A = 1}.

Random Forest
Assumptions for Causal Inference
A1. Consistency assumption:
Y = Y ∗ (1)A + Y ∗ (0)(1 − A)
A2. No unmeasured confounders assumption (strong ignorablility):
{Y ∗ (1), Y ∗ (0)} ⊥
⊥A|X
A3. Positivity assumption: 0 < P(A = 1|X ) < 1 for any X .

Random Forest
Estimation of Causal Effects
Note that ∆ 6= E (Y |A = 1) − E (Y |A = 0). (why?)

In fact,
I (A = 1)
E Y = E {Y ∗ (1)},
π(X )
where π(X ) = P(A = 1|X ).
Under assumptions A1-A2, we have
τ (x) = E (Y |A = 1, X = x) − E (Y |A = 0, X = x).
Proof:
E {Y ∗ (1)|X = x} = E {Y ∗ (1)|A = 1, X = x} = E {Y |A = 1, X = x}.
How about ∆1 ?
E {Y ∗ (0)|A = 1} = E {Y ∗ (0)A}/P(A = 1) = E {Y ∗ (0)π(X )}/P(A = 1).

Random Forest
Estimation of CTE
Two ways of using random forests:

Method I: Fit Y ∼ RF (X , A). Denote the resulting estimator by
µ b(x, 1) − µ
b(X , A). Then, τb(x) = µ b(x, 0).
Method II: Fit Y ∼ RF (X ) separately for A = 1 and A = 0 groups.
Denote the resulting estimators by µb1 (X ) and µ
b0 (X ), respectively.
b1 (x) − µ
Then, τb(x) = µ b0 (x).
Method II is usually better than method I (more accurate).
Under the assumptions A1-A2, it can be shown that the optimal
treatment decision rule is estimated by d opt (x) = I {b
τ (x) > 0}.
The decision rule τb(x) obtained by RF may be difficult to interpret
clinically. In practice, it may prefer a simple and interpretable decision
rule. We may consider a tree classification based on the labels
Zi ≡ I {bτ (Xi ) > 0} and features Xi .
More accurately, we should consider a weighted classification of Zi on
Xi with the weight wi = |b
τ (Xi )|.
Random Forest
Estimation of ATE
Inverse probability weighted estimation (IPW):

n n
X Ai X (1 − Ai )
∆
b = Yi − Yi ,
π
b(Xi ) 1−πb(Xi )
i=1 i=1
where πb(X ) is estimated propensity score, e.g., using logistic

regression or random forest.
IPW estimator is lack of efficiency (why?).
Augmented inverse probability weighted estimation (AIPW):
n n
X Ai X Ai
∆
b = Yi − −1 µ b1 (Xi )
π
b(Xi ) π
b(Xi )
i=1 i=1
" n n #
X 1 − Ai X 1 − Ai
− Yi − −1 µb0 (Xi ) .
1−π b(Xi ) 1−πb(Xi )
i=1 i=1

Random Forest
Estimation of ATE/ATT
AIPW estimator is doubly robust: ∆ b is consistent when either π
b(x) is
consistent or µ
b0 (x) and µ
b1 (x) are consistent.
Alternative estimators of ATE (regression estimators):
n
b = 1
X
∆ {b
µ1 (Xi ) − µ
b0 (x)},
n
i=1
n h
b = 1
X i
∆ Ai {Yi − µ
b0 (Xi )} + (1 − Ai ){b
µ1 (Xi ) − Yi } .
n
i=1
Estimator of ATT:
Pn Pn (1−Ai )bπ(Xi )
i=1 Ai Yi π (Xi ) Yi
i=1 1−b
∆
b 1 = Pn − Pn .
i=1 Ai i=1 Ai
n
X n
X
∆
b1 = Ai {Yi − µ
b0 (Xi )}/ Ai .
i=1 i=1
Random Forest
AIDS Data Example
AIDS Clinical Trials Group Study 175 (ACTG175)

HIV-infected patients with CD4 counts 200 ∼ 500/mm3
Randomized to four treatment groups:
zidovudine alone (ZDV)
zidovudine plus didanosine (ZDV+ddI)
zidovudine plus zalcitabine (ZDV+zal)
didanosine alone (ddI)
12 baseline clinical covariates, such as age (years), weight (kg),
Karnofsky score (scale of 0-100), CD4 count (cells/mm3 ) at baseline
Primary endpoints of interest: (i) CD4 count at 20 ± 5 weeks
post-baseline; (ii) the first time that a patient had a decline in their
CD4 cell count of at least 50%, or an event indicating progression to
AIDS, or death.

Random Forest
Survival Curves
1.0
0.9
0.8
0.7
black: ZDV only
red: ZDV+ddI
blue: ZDV+zal
0.6
green: ddI only
0 200 400 600 800 1000 1200

Random Forest
A Closer Look of Top 2 Treatments
A=1: ZDV + ddI

A=0: ZDV + zal
Age ≤ 34 Age > 34

A=1 266 256 522
A=0 263 261 524
529 517 1046
Note: Age 34 is the median age for the patients

Random Forest
Heterogeneous Treatment Effects

1.00
1.00
Age <=34 Age > 34
0.95
0.95
Survival Probability
Survival Probability
0.90
0.90
0.85
0.85
0.80
0.80
trt 1 trt 1
trt 0 trt 0
0.75
0.75
0 200 600 1000 0 200 600 1000
Days Days

Random Forest
Random Forest Estimation of CTE
library(survival)
a = read.table("C:\\Users\\wlu4\\all.dat",sep=" ")
b = read.table("C:\\Users\\wlu4\\events.dat")
ix = match(as.numeric(b[,1]), as.numeric(a[,1]))
trt = as.numeric(b[!is.na(ix),4])
###Kaplan-Meier curve
t.time = as.numeric(b[!is.na(ix),3])
delta = as.numeric(b[!is.na(ix),2])
km.fit = survfit(Surv(t.time,delta)~trt)
plot(km.fit,ylim=c(0.55,1),col = c("black","red","blue",
"green") )
legend(10, .7, c("black: ZDV only", "red: ZDV+ddI",
"blue: ZDV+zal","green: ddI only"))
Random Forest
####consider ZDV + ddI & ZDV + zal groups

ix1 = (trt == 1)|(trt==2)
Y = as.numeric(a[ix1,20])
A = as.numeric(trt[ix1]==1)
X = as.matrix(a[ix1,c(2,3,7,19,23,4,5,6,12,13,14,16)])
###random forest fit
library(randomForestSRC)
Y1 = Y[A==1]
X1 = X[A==1,]
data1 = data.frame(cbind(Y1,X1))
fit1 = rfsrc(Y1~.,data=data1)
Y0 = Y[A==0]
X0 = X[A==0,]
data0 = data.frame(cbind(Y0,X0))
fit0 = rfsrc(Y0~.,data=data0)
Random Forest
mu11 = predict(fit1)
mu10 = predict(fit1,data0)
mu01 = predict(fit0,data1)
mu00 = predict(fit0)
## $predicted.oob: out of bag prediction

tau1 = mu11$predicted.oob - mu01$predicted
tau0 = mu10$predicted - mu00$predicted.oob
tau = numeric(length(Y))
tau[A==1] = tau1
tau[A==0] = tau0
cr = sum((tau > 0)*(Y > 0))/length(Y) #correct decision rate

Random Forest
OTR Estimation Using Tree
## Tree classification for optimal treatment rule

library(tree)
Z = as.numeric(tau > 0)
data2 = data.frame(cbind(Z,X))
## fit the original tree with split="deviance"

tree_otr=tree(Z~., data=data2, split="deviance")
## tree summary
summary(tree_otr)
plot(tree_otr)
text(tree_otr)

Random Forest
OTR Estimation Using Tree
## 10-fold CV for parameter tuning

set.seed(11)
tree_cv = cv.tree(tree_otr,,prune.tree,K=10)
min_idx = which.min(tree_cv$dev)
min_idx
tree_cv$size[min_idx]
tree_otr_prune = prune.tree(tree_otr, best = 12)

plot(tree_otr_prune)
text(tree_otr_prune)

Random Forest
Original Tree Using Deviance
V2 < 35.5
|
V2 < 24.5 V3 < 106.05
V19 < 336 V19 < 463.5

0.9473
0.4000
V23 < 1378

V23 < 688.5 V3 < 83.0044
0.1707
V3 < 68.9472
V14 < 0.5
0.8462
0.1429
0.1111
0.9274
0.3793
0.7059
V3 < 56.7804 V19 < 338.5
0.3333
0.8898
V3 < 75.1488 V3 < 83.0544
V23 < 897.5
0.2453
0.9423
0.2000
0.4118
0.8333

Random Forest
After Parameter Tuning based on 10-CV
V2 < 35.5
|
V2 < 24.5 V3 < 106.05
V19 < 336 V19 < 463.5

0.9473 0.4000
V23 < 688.5 V3 < 83.0044

0.1707 0.6970
V14 < 0.5
0.5875 0.1111
0.9274
V19 < 338.5
0.8629
V3 < 75.1488 V3 < 83.0544
0.2453 0.6585
0.9423 0.2000

Lecture 22: Bagging and Random Forest: Wenbin Lu Department of Statistics North Carolina State University Fall 2019

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 22: Bagging and Random Forest: Wenbin Lu Department of Statistics North Carolina State University Fall 2019

Uploaded by

Copyright:

Available Formats

Lecture 22: Bagging and Random Forest

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 35

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 2 / 35

Ensemble Methods (Model Averaging)

A machine learning ensemble meta-algorithm designed to improve the

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 3 / 35

Bootstrap Aggregation (Bagging)

Bagging is a special case of the model averaging approach.

Bootstrap aggregation = Bagging

Bagging leads to ”improvements for unstable procedures” (Breiman,

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 4 / 35

Given a standard training set D of size n, bagging

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 5 / 35

Bagging uses the bootstrap to improve the estimate or prediction of a fit.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 35

Properties of Bagging Estimates

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 7 / 35

Bagging Classification Trees

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 8 / 35

Bagging Classification Trees (cont.)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 9 / 35

Sample size n = 30, two classes

Pr(Y = 1|x1 ≤ 0.5) = 0.2, Pr(Y = 1|x1 > 0.5) = 0.8.

The Bayes error is 0.2.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 10 / 35

Elements of Statisti al Learning

strap Tree 1 x.2<0.39

strap Tree 2 Bootstrap Tree 2

strap Tree 3 3/21 2/9 Bootstrap Tree 3 1/23 1/7

strap Tree 4 x.3>-1.575

strap Tree 5 Bootstrap Tree 5

epla ements PSfrag repla ements

strap Tree 1 x.2<0.39

strap Tree 3 3/22 0/8 2/25 2/5

strap Tree 4 x.3>-1.575

strap Tree 5 Bootstrap Tree 5

epla ements PSfrag repla ements

strap Tree 1 x.2<0.255

strap Tree 2 Bootstrap Tree 2

strap Tree 3 2/16 3/14 Bootstrap Tree 3 4/20 2/10

Figure 8.9: Bagging trees on simulated dataset. Top

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 11 / 35

About Bagging Trees

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 12 / 35

Elements of Statisti al Learning

0 50 100 150 200

Number of Bootstrap Samples

Figure 8.10: Error urves for the bagging example of

Figure 8.9. Shown is the test error of the original tree

and bagged trees as a fun tion of the number of boot-

strap samples. The green points orrespond to majority

vote, while the purple points average the probabilities.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 13 / 35

Bagging can dramatically reduce the variance of unstable procedures

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 14 / 35

Random Forests (Breiman, 2001):

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 15 / 35

Learning Algorithm for Building A Tree

Denote the training size by n and the number of variables by p. Assume

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 16 / 35

Properties of Random Forest

Properties of Random Forest