Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Lecture 22: Bagging and Random Forest

Wenbin Lu

Department of Statistics
North Carolina State University

Fall 2019

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 35


Outlines

Bagging Methods
Bagging Trees
Random Forest
Applications in Causal Inference/Optimal Treatment Decision

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 2 / 35


Bagging Trees

Ensemble Methods (Model Averaging)

A machine learning ensemble meta-algorithm designed to improve the


stability and accuracy of machine learning algorithms
widely used in statistical classification and regression.
can reduce variance and helps to avoid overfitting.
usually applied to decision tree methods, but it can be used with any
type of method.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 3 / 35


Bagging Trees

Bootstrap Aggregation (Bagging)

Bagging is a special case of the model averaging approach.

Bootstrap aggregation = Bagging

Bagging leads to ”improvements for unstable procedures” (Breiman,


1996), e.g. neural nets, classification and regression trees, and subset
selection in linear regression (Breiman, 1994).
On the other hand, it can mildly degrade the performance of stable
methods such as K-nearest neighbors (Breiman, 1996).

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 4 / 35


Bagging Trees

Basic Idea

Given a standard training set D of size n, bagging


generates B new training sets Di , each of size n0 , by sampling from D
uniformly and with replacement.
The B models are fitted using the above B bootstrap samples and
combined by averaging the output (for regression) or voting (for
classification).
This kind of sample is known as a bootstrap sample.
By sampling with replacement, some observations may be repeated in
each Di .
If n0 = n, then for large n, the set Di is expected to have the fraction
(1 − 1/e) ≈ 63.2% of the unique examples of D, the rest being
duplicates.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 5 / 35


Bagging Trees

Bagging Procedures

Bagging uses the bootstrap to improve the estimate or prediction of a fit.


Given data Z = {(x1 , y1 ), ..., (xn , yn )}, we generate B bootstrap
samples Z∗b
Empirical distribution P̂: putting equal probability 1/n on each
(xi , yi ) (discrete)
Generate Z∗b = {(x1∗ , yn∗ ), ..., (xn∗ , yn∗ )} ∼ P̂, b = 1, ..., B
Obtain fˆ∗b (x), b = 1, ..., B.
The Monte Carlo estimate of the bagging estimate
B
1 X ˆ∗b
fˆbag (x) = f (x).
B
b=1

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 35


Bagging Trees

Properties of Bagging Estimates

Advantages:
Note fˆbag (x) −→ EP̂ fˆ∗ (x) as B → ∞,
fˆbag (x) typically has smaller variance than fˆ(x);
fˆbag (x) differs from fˆ(x) only when the latter is nonlinear or adaptive
function of data.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 7 / 35


Bagging Trees

Bagging Classification Trees


In classification problems, there are two scenarios
(1) fˆb (x) is indicator-vector, with one 1 and K − 1 0’s (hard
classification)
(2) fˆ∗b (x) = (p1 , ..., pK ), the estimates of class probabilities
Pbb (Y = k|X = x), k = 1, · · · , K (soft classification). The bagged
estimates are the average prediction at x from B trees
B
bag X
fˆk (x) = B −1 fˆk∗b (x), k = 1, · · · , K .
b=1

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 8 / 35


Bagging Trees

Bagging Classification Trees (cont.)


There are two types of averaging:
(1) Use the majority vote: arg maxk B ˆb
P
b=1 I {f (x) = k}.
(2) Use the averaged probabilities. The bagged classifier

bag
Ĝbag (x) = arg max fˆk (x).
k

Note: the second method tends to produce estimates with lower variances
than the first method, especially for small B.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 9 / 35


Bagging Trees

Example

Sample size n = 30, two classes


p = 5 features, each having a standard Gaussian distribution with
pairwise correlation Corr(Xj , Xk ) = 0.95.
The response Y was generated according to

Pr(Y = 1|x1 ≤ 0.5) = 0.2, Pr(Y = 1|x1 > 0.5) = 0.8.

The Bayes error is 0.2.


A test sample of size 2, 000 was generated from the same population.
We fit
classification trees to the training sample
classification trees to each of 200 bootstrap samples

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 10 / 35


Bagging Trees

Elements of Statisti al Learning


Hastie, Tibshirani & Friedman 2001 Chapter 8
epla ements PSfrag repla ements
Original Tree Bootstrap Tree 1
0
Original Tree 0

strap Tree 1 x.2<0.39


10/30

x.2<0.36
7/30

x.2>0.39 x.2>0.36

strap Tree 2 Bootstrap Tree 2


0 1 0 1

strap Tree 3 3/21 2/9 Bootstrap Tree 3 1/23 1/7

x.3<-1.575 x.1<-0.965

strap Tree 4 x.3>-1.575


Bootstrap Tree 4 x.1>-0.965

strap Tree 5 Bootstrap Tree 5


1 0 0 0
2/5 0/16 1/5 0/18

epla ements PSfrag repla ements


Bootstrap Tree 2 Bootstrap Tree 3
Original Tree 0
Original Tree 0

strap Tree 1 x.2<0.39


11/30
Bootstrap Tree 1 x.4<0.395
4/30

x.2>0.39 x.4>0.395

Bootstrap Tree 2
0 1 0 0

strap Tree 3 3/22 0/8 2/25 2/5

x.3<-1.575 x.3<-1.575

strap Tree 4 x.3>-1.575


Bootstrap Tree 4 x.3>-1.575

strap Tree 5 Bootstrap Tree 5


1 0 0 0
2/5 0/17 2/5 0/20

epla ements PSfrag repla ements


Bootstrap Tree 4 Bootstrap Tree 5
Original Tree 0
Original Tree 0

strap Tree 1 x.2<0.255


13/30
Bootstrap Tree 1 x.2<0.38
12/30

x.2>0.255 x.2>0.38

strap Tree 2 Bootstrap Tree 2


0 1 0 1

strap Tree 3 2/16 3/14 Bootstrap Tree 3 4/20 2/10

x.3<-1.385 x.3<-1.61
x.3>-1.385
Bootstrap Tree 4 x.3>-1.61

strap Tree 5
0 0 1 0
2/5 0/11 2/6 0/14

Figure 8.9: Bagging trees on simulated dataset. Top


left panel shows original tree. Five trees grown on boot-
strap samples are shown.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 11 / 35


Bagging Trees

About Bagging Trees

The original tree and five bootstrap trees are all different:
with different splitting features
with different splitting cutpoints
The trees have high variance due to the correlation in the predictors
Averaging reduces variance and leaves bias unchanged.
Under squared-error loss, averaging reduces variance and leaves bias
unchanged.
Therefore, bagging will often decrease MSE.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 12 / 35


Bagging Trees

Elements of Statisti al Learning


Hastie, Tibshirani & Friedman 2001 Chapter 8

0.35
Original Tree

0.30


Test Error

•••• •••
••• • •
•••• ••••••• •••
• •• ••••
•••• •• •
••••• •• • •
•••• • •••••• • •
0.25

••• • •••••••••••
••• •••• ••• Bagged Trees
• •• •• • •••

•• • ••••••••••
••••• ••
••••••••••• ••
••
•••••••••••••••••••••••
•• ••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••• •••••••••••••••••
•••
••••••••••• •••••• ••••• ••••• •••• •••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••• ••••••••••••••••••••••••••••
Bayes
0.20

0 50 100 150 200

Number of Bootstrap Samples

Figure 8.10: Error urves for the bagging example of

Figure 8.9. Shown is the test error of the original tree

and bagged trees as a fun tion of the number of boot-

strap samples. The green points orrespond to majority

vote, while the purple points average the probabilities.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 13 / 35


Bagging Trees

About Bagging

Bagging can dramatically reduce the variance of unstable procedures


like trees, leading to improved prediction
Bagging smooths out this variance and hence reducing the test error
Bagging can stabilize unstable procedures.
The simple structure in the model can be lost due to bagging
A bagged tree is no longer a tree.
The bagged estimate is not easy to interpret.
Under 0-1 loss for classification, bagging may not help due to the
nonadditivity of bias and variance.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 14 / 35


Random Forest

Random Forest

Random Forests (Breiman, 2001):


Random forest is an ensemble classifier that consists of many decision
trees and outputs the class that is the mode of the class’s output by
individual trees.
Bagging: random subsampling the training set
Random subspace method (Ho, 1995, 1998): random subsampling
the feature space (called “feature bagging”); To reduce the
correlation of the trees in an ordinary bootstrap sample: if one or a
few features are very strong predictors for the response variable
(target output), these features will be selected in many of the trees,
causing them to become correlated.
Random forest: combine bagging and random subspace method.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 15 / 35


Random Forest

Learning Algorithm for Building A Tree

Denote the training size by n and the number of variables by p. Assume


m < p is the number of input variables to be used to determine the
decision at a node of the tree.
Randomly choose n samples with replacement (i.e. take a bootstrap
sample).
Use the rest of the samples to estimate the error of the tree, by
predicting their classes.
For each node of the tree, randomly choose m variables on which to
base the decision at that node. Calculate the best split based on
these m variables in the training set.
Each tree is fully grown and not pruned.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 16 / 35


Random Forest

Properties of Random Forest


Advantages:
highly accurate in many real examples; fast; handles a very large
number of input variables.
ability to estimate the importance of variables for classification.
generates an internal unbiased estimate of the generalization error as
the forest building progresses.
impute missing data and maintains accuracy when a large proportion
of the data are missing.
provides an experimental way to detect variable interactions.
can balance error in unbalanced data sets.
compute proximities between cases, useful for clustering, detecting
outliers, and (by scaling) visualizing the data
can be extended to unlabeled data, leading to unsupervised clustering,
outlier detection and data views
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 17 / 35
Random Forest

Properties of Random Forest

Disadvantages:
Random forests are prone to overfitting for some data sets. This is
even more pronounced in noisy classification/regression tasks.
Random forests do not handle large numbers of irrelevant features
Implementation:
R packages: randomForest; randomForestSRC (for survival data); grf
(generalized random forest for causal effect estimation).
Asymptotics of generalized random forest: Athey et al. (2019)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 18 / 35


Random Forest

Applications in Causal Inference/Optimal Treatment


Decision

Potential outcomes: Y ∗ (a), an outcome that would result if a patient


were given treatment a ∈ A.
Observed data: response, Y (larger value of Y indicates better
outcome); p-dimensional covariates, X ∈ X ; received treatment,
A ∈ A.
Consider binary treatment case: A = {0, 1}.
Average treatment effect (ATE): ∆ = E {Y ∗ (1) − Y ∗ (0)}.
Conditional treatment effect (CTE):
τ (x) = E {Y ∗ (1)|X = x} − E {Y ∗ (0)|X = x}.
Average treatment effect on the treated (ATT):
∆1 = E {Y ∗ (1)|A = 1} − E {Y ∗ (0)|A = 1}.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 19 / 35


Random Forest

Assumptions for Causal Inference

A1. Consistency assumption:

Y = Y ∗ (1)A + Y ∗ (0)(1 − A)

A2. No unmeasured confounders assumption (strong ignorablility):

{Y ∗ (1), Y ∗ (0)} ⊥
⊥A|X

A3. Positivity assumption: 0 < P(A = 1|X ) < 1 for any X .

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 20 / 35


Random Forest

Estimation of Causal Effects

Note that ∆ 6= E (Y |A = 1) − E (Y |A = 0). (why?)


In fact,  
I (A = 1)
E Y = E {Y ∗ (1)},
π(X )
where π(X ) = P(A = 1|X ).
Under assumptions A1-A2, we have
τ (x) = E (Y |A = 1, X = x) − E (Y |A = 0, X = x).
Proof:

E {Y ∗ (1)|X = x} = E {Y ∗ (1)|A = 1, X = x} = E {Y |A = 1, X = x}.

How about ∆1 ?

E {Y ∗ (0)|A = 1} = E {Y ∗ (0)A}/P(A = 1) = E {Y ∗ (0)π(X )}/P(A = 1).

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 21 / 35


Random Forest

Estimation of CTE

Two ways of using random forests:


Method I: Fit Y ∼ RF (X , A). Denote the resulting estimator by
µ b(x, 1) − µ
b(X , A). Then, τb(x) = µ b(x, 0).
Method II: Fit Y ∼ RF (X ) separately for A = 1 and A = 0 groups.
Denote the resulting estimators by µb1 (X ) and µ
b0 (X ), respectively.
b1 (x) − µ
Then, τb(x) = µ b0 (x).
Method II is usually better than method I (more accurate).
Under the assumptions A1-A2, it can be shown that the optimal
treatment decision rule is estimated by d opt (x) = I {b
τ (x) > 0}.
The decision rule τb(x) obtained by RF may be difficult to interpret
clinically. In practice, it may prefer a simple and interpretable decision
rule. We may consider a tree classification based on the labels
Zi ≡ I {bτ (Xi ) > 0} and features Xi .
More accurately, we should consider a weighted classification of Zi on
Xi with the weight wi = |b
τ (Xi )|.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 22 / 35
Random Forest

Estimation of ATE

Inverse probability weighted estimation (IPW):


n n
X Ai X (1 − Ai )

b = Yi − Yi ,
π
b(Xi ) 1−πb(Xi )
i=1 i=1

where πb(X ) is estimated propensity score, e.g., using logistic


regression or random forest.
IPW estimator is lack of efficiency (why?).
Augmented inverse probability weighted estimation (AIPW):
n n  
X Ai X Ai

b = Yi − −1 µ b1 (Xi )
π
b(Xi ) π
b(Xi )
i=1 i=1
" n n   #
X 1 − Ai X 1 − Ai
− Yi − −1 µb0 (Xi ) .
1−π b(Xi ) 1−πb(Xi )
i=1 i=1

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 23 / 35


Random Forest

Estimation of ATE/ATT
AIPW estimator is doubly robust: ∆ b is consistent when either π
b(x) is
consistent or µ
b0 (x) and µ
b1 (x) are consistent.
Alternative estimators of ATE (regression estimators):
n
b = 1
X
∆ {b
µ1 (Xi ) − µ
b0 (x)},
n
i=1
n h
b = 1
X i
∆ Ai {Yi − µ
b0 (Xi )} + (1 − Ai ){b
µ1 (Xi ) − Yi } .
n
i=1
Estimator of ATT:
Pn Pn (1−Ai )bπ(Xi )
i=1 Ai Yi π (Xi ) Yi
i=1 1−b

b 1 = Pn − Pn .
i=1 Ai i=1 Ai
n
X n
X

b1 = Ai {Yi − µ
b0 (Xi )}/ Ai .
i=1 i=1
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 24 / 35
Random Forest

AIDS Data Example

AIDS Clinical Trials Group Study 175 (ACTG175)


HIV-infected patients with CD4 counts 200 ∼ 500/mm3
Randomized to four treatment groups:
zidovudine alone (ZDV)
zidovudine plus didanosine (ZDV+ddI)
zidovudine plus zalcitabine (ZDV+zal)
didanosine alone (ddI)
12 baseline clinical covariates, such as age (years), weight (kg),
Karnofsky score (scale of 0-100), CD4 count (cells/mm3 ) at baseline
Primary endpoints of interest: (i) CD4 count at 20 ± 5 weeks
post-baseline; (ii) the first time that a patient had a decline in their
CD4 cell count of at least 50%, or an event indicating progression to
AIDS, or death.

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 25 / 35


Random Forest

Survival Curves
1.0
0.9
0.8
0.7

black: ZDV only

red: ZDV+ddI

blue: ZDV+zal
0.6

green: ddI only

0 200 400 600 800 1000 1200

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 26 / 35


Random Forest

A Closer Look of Top 2 Treatments

A=1: ZDV + ddI


A=0: ZDV + zal

Age ≤ 34 Age > 34


A=1 266 256 522
A=0 263 261 524
529 517 1046

Note: Age 34 is the median age for the patients

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 27 / 35


Random Forest

Heterogeneous Treatment Effects


1.00

1.00
Age <=34 Age > 34
0.95

0.95
Survival Probability

Survival Probability
0.90

0.90
0.85

0.85
0.80

0.80
trt 1 trt 1
trt 0 trt 0
0.75

0.75

0 200 600 1000 0 200 600 1000

Days Days

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 28 / 35


Random Forest

Random Forest Estimation of CTE

library(survival)

a = read.table("C:\\Users\\wlu4\\all.dat",sep=" ")
b = read.table("C:\\Users\\wlu4\\events.dat")
ix = match(as.numeric(b[,1]), as.numeric(a[,1]))
trt = as.numeric(b[!is.na(ix),4])

###Kaplan-Meier curve
t.time = as.numeric(b[!is.na(ix),3])
delta = as.numeric(b[!is.na(ix),2])

km.fit = survfit(Surv(t.time,delta)~trt)
plot(km.fit,ylim=c(0.55,1),col = c("black","red","blue",
"green") )
legend(10, .7, c("black: ZDV only", "red: ZDV+ddI",
"blue: ZDV+zal","green: ddI only"))
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 29 / 35
Random Forest

Random Forest Estimation of CTE

####consider ZDV + ddI & ZDV + zal groups


ix1 = (trt == 1)|(trt==2)
Y = as.numeric(a[ix1,20])
A = as.numeric(trt[ix1]==1)
X = as.matrix(a[ix1,c(2,3,7,19,23,4,5,6,12,13,14,16)])
###random forest fit
library(randomForestSRC)
Y1 = Y[A==1]
X1 = X[A==1,]
data1 = data.frame(cbind(Y1,X1))
fit1 = rfsrc(Y1~.,data=data1)

Y0 = Y[A==0]
X0 = X[A==0,]
data0 = data.frame(cbind(Y0,X0))
fit0 = rfsrc(Y0~.,data=data0)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 30 / 35
Random Forest

Random Forest Estimation of CTE

mu11 = predict(fit1)
mu10 = predict(fit1,data0)
mu01 = predict(fit0,data1)
mu00 = predict(fit0)

## $predicted.oob: out of bag prediction


tau1 = mu11$predicted.oob - mu01$predicted
tau0 = mu10$predicted - mu00$predicted.oob

tau = numeric(length(Y))
tau[A==1] = tau1
tau[A==0] = tau0
cr = sum((tau > 0)*(Y > 0))/length(Y) #correct decision rate

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 31 / 35


Random Forest

OTR Estimation Using Tree

## Tree classification for optimal treatment rule


library(tree)
Z = as.numeric(tau > 0)
data2 = data.frame(cbind(Z,X))

## fit the original tree with split="deviance"


tree_otr=tree(Z~., data=data2, split="deviance")
## tree summary
summary(tree_otr)
plot(tree_otr)
text(tree_otr)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 32 / 35


Random Forest

OTR Estimation Using Tree

## 10-fold CV for parameter tuning


set.seed(11)
tree_cv = cv.tree(tree_otr,,prune.tree,K=10)
min_idx = which.min(tree_cv$dev)
min_idx
tree_cv$size[min_idx]

tree_otr_prune = prune.tree(tree_otr, best = 12)


plot(tree_otr_prune)
text(tree_otr_prune)

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 33 / 35


Random Forest

Original Tree Using Deviance

V2 < 35.5
|

V2 < 24.5 V3 < 106.05

V19 < 336 V19 < 463.5


0.9473
0.4000

V23 < 1378


V23 < 688.5 V3 < 83.0044
0.1707
V3 < 68.9472
V14 < 0.5
0.8462
0.1429
0.1111
0.9274
0.3793
0.7059
V3 < 56.7804 V19 < 338.5

0.3333
0.8898
V3 < 75.1488 V3 < 83.0544

V23 < 897.5

0.2453
0.9423
0.2000
0.4118
0.8333

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 34 / 35


Random Forest

After Parameter Tuning based on 10-CV

V2 < 35.5
|

V2 < 24.5 V3 < 106.05

V19 < 336 V19 < 463.5


0.9473 0.4000

V23 < 688.5 V3 < 83.0044


0.1707 0.6970

V14 < 0.5

0.5875 0.1111
0.9274

V19 < 338.5

0.8629

V3 < 75.1488 V3 < 83.0544

0.2453 0.6585
0.9423 0.2000

Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 35 / 35

You might also like