Professional Documents
Culture Documents
Lecture 22: Bagging and Random Forest: Wenbin Lu Department of Statistics North Carolina State University Fall 2019
Lecture 22: Bagging and Random Forest: Wenbin Lu Department of Statistics North Carolina State University Fall 2019
Wenbin Lu
Department of Statistics
North Carolina State University
Fall 2019
Bagging Methods
Bagging Trees
Random Forest
Applications in Causal Inference/Optimal Treatment Decision
Basic Idea
Bagging Procedures
Advantages:
Note fˆbag (x) −→ EP̂ fˆ∗ (x) as B → ∞,
fˆbag (x) typically has smaller variance than fˆ(x);
fˆbag (x) differs from fˆ(x) only when the latter is nonlinear or adaptive
function of data.
bag
Ĝbag (x) = arg max fˆk (x).
k
Note: the second method tends to produce estimates with lower variances
than the first method, especially for small B.
Example
x.2<0.36
7/30
x.2>0.39 x.2>0.36
x.3<-1.575 x.1<-0.965
x.2>0.39 x.4>0.395
Bootstrap Tree 2
0 1 0 0
x.3<-1.575 x.3<-1.575
x.2>0.255 x.2>0.38
x.3<-1.385 x.3<-1.61
x.3>-1.385
Bootstrap Tree 4 x.3>-1.61
strap Tree 5
0 0 1 0
2/5 0/11 2/6 0/14
The original tree and five bootstrap trees are all different:
with different splitting features
with different splitting cutpoints
The trees have high variance due to the correlation in the predictors
Averaging reduces variance and leaves bias unchanged.
Under squared-error loss, averaging reduces variance and leaves bias
unchanged.
Therefore, bagging will often decrease MSE.
0.35
Original Tree
0.30
•
•
Test Error
•••• •••
••• • •
•••• ••••••• •••
• •• ••••
•••• •• •
••••• •• • •
•••• • •••••• • •
0.25
••• • •••••••••••
••• •••• ••• Bagged Trees
• •• •• • •••
•
•• • ••••••••••
••••• ••
••••••••••• ••
••
•••••••••••••••••••••••
•• ••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••• •••••••••••••••••
•••
••••••••••• •••••• ••••• ••••• •••• •••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••• ••••••••••••••••••••••••••••
Bayes
0.20
About Bagging
Random Forest
Disadvantages:
Random forests are prone to overfitting for some data sets. This is
even more pronounced in noisy classification/regression tasks.
Random forests do not handle large numbers of irrelevant features
Implementation:
R packages: randomForest; randomForestSRC (for survival data); grf
(generalized random forest for causal effect estimation).
Asymptotics of generalized random forest: Athey et al. (2019)
Y = Y ∗ (1)A + Y ∗ (0)(1 − A)
{Y ∗ (1), Y ∗ (0)} ⊥
⊥A|X
How about ∆1 ?
Estimation of CTE
Estimation of ATE
Estimation of ATE/ATT
AIPW estimator is doubly robust: ∆ b is consistent when either π
b(x) is
consistent or µ
b0 (x) and µ
b1 (x) are consistent.
Alternative estimators of ATE (regression estimators):
n
b = 1
X
∆ {b
µ1 (Xi ) − µ
b0 (x)},
n
i=1
n h
b = 1
X i
∆ Ai {Yi − µ
b0 (Xi )} + (1 − Ai ){b
µ1 (Xi ) − Yi } .
n
i=1
Estimator of ATT:
Pn Pn (1−Ai )bπ(Xi )
i=1 Ai Yi π (Xi ) Yi
i=1 1−b
∆
b 1 = Pn − Pn .
i=1 Ai i=1 Ai
n
X n
X
∆
b1 = Ai {Yi − µ
b0 (Xi )}/ Ai .
i=1 i=1
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 24 / 35
Random Forest
Survival Curves
1.0
0.9
0.8
0.7
red: ZDV+ddI
blue: ZDV+zal
0.6
1.00
Age <=34 Age > 34
0.95
0.95
Survival Probability
Survival Probability
0.90
0.90
0.85
0.85
0.80
0.80
trt 1 trt 1
trt 0 trt 0
0.75
0.75
Days Days
library(survival)
a = read.table("C:\\Users\\wlu4\\all.dat",sep=" ")
b = read.table("C:\\Users\\wlu4\\events.dat")
ix = match(as.numeric(b[,1]), as.numeric(a[,1]))
trt = as.numeric(b[!is.na(ix),4])
###Kaplan-Meier curve
t.time = as.numeric(b[!is.na(ix),3])
delta = as.numeric(b[!is.na(ix),2])
km.fit = survfit(Surv(t.time,delta)~trt)
plot(km.fit,ylim=c(0.55,1),col = c("black","red","blue",
"green") )
legend(10, .7, c("black: ZDV only", "red: ZDV+ddI",
"blue: ZDV+zal","green: ddI only"))
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 29 / 35
Random Forest
Y0 = Y[A==0]
X0 = X[A==0,]
data0 = data.frame(cbind(Y0,X0))
fit0 = rfsrc(Y0~.,data=data0)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 30 / 35
Random Forest
mu11 = predict(fit1)
mu10 = predict(fit1,data0)
mu01 = predict(fit0,data1)
mu00 = predict(fit0)
tau = numeric(length(Y))
tau[A==1] = tau1
tau[A==0] = tau0
cr = sum((tau > 0)*(Y > 0))/length(Y) #correct decision rate
V2 < 35.5
|
0.3333
0.8898
V3 < 75.1488 V3 < 83.0544
0.2453
0.9423
0.2000
0.4118
0.8333
V2 < 35.5
|
0.5875 0.1111
0.9274
0.8629
0.2453 0.6585
0.9423 0.2000