ML 15 Random Forests

Bootstrapping and Random Forests
The City College of New York

CSc 59929 – Introduction to Machine Learning 1
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Bootstrapping
• Bootstrapping is any test or metric that relies on

random sampling with replacement.
• Bootstrapping allows assigning measures of
accuracy (defined in terms of bias, variance,
confidence intervals, prediction error or some other
such measure) to sample estimates.

Bootstrapping
• Bootstrapping can be used to generate multiple

training dataset from a sample dataset.
• If a set of observations can be assumed to be from
an I.I.D. population, an approximating distribution
of the population can implemented by constructing
a number of resamples (with replacement) of the
observed dataset and of equal size to the observed
dataset.

Expected value of a
function of a random Variable
The expected value of a function F of a random

variable Y is denoted by E(F(Y)) where
E ( F (Y ) )
= ∑=
F ( y ) Pr (Y i yi )
all i
or
E ( F (Y ) ) = ∫ F ( y ) p ( y ) dy
+∞
−∞

Variance of a random variable
The variance of a random variable Y is denoted by

Var(Y) where
(Y ) E (Y − µY ) 
 2
Var
=
 
The variance is a measure of the width or dispersion

of the distribution about it mean.

Variance of a random variable
(Y ) E (Y − µY ) 
 2
Note that Var
=
 
= E Y 2 − 2 µY Y + µY 2 
=E Y 2  − 2 E [ µY Y ] + E  µY 2 
=E Y 2  − 2 µY E [Y ] + µY 2 E [1]
= E Y 2  − 2 µY 2 + µY 2
= E Y 2  − µY 2
= E Y  − ( E [Y ]) 2 2

Standard Deviation of a random variable
The standard deviation of a random variable Y is

denoted by σ Y where
σ Y = Var (Y )
The standard deviation is a measure of the width or

dispersion of the distribution about it mean.

Covariance of two random variables
The covariance of two random variables, X and Y, is

denoted by Cov(X,Y) where
Cov( X , Y ) = E ( X − µ X )(Y − µY ) 
Covariance is a measure of the joint variability of

two random variables.

Covariance of two random variables
Remembering that µ X = E [ X ]
Cov( X , Y ) = E ( X − µ X )(Y − µY ) 

= E [ XY − µ X Y − µY X + µ X µY ]
=E [ XY ] − E [ µ X Y ] − E [ µY X ] + E [ µ X µY ]
=E [ XY ] − µ X E [Y ] − µY E [ X ] + µ X µY
= E [ XY ] − µ X µY − µY µ X + µ X µY
= E [ XY ] − µ X µY

Correlation coefficient between two random
variables
The correlation coefficient between two random
variables, X and Y is denoted by ρ X ,Y where
ρ X ,Y = Corr ( X , Y )
Cov( X , Y )
=
σ XσY
E [ XY ] − µ X µY
=
σ XσY

Why bootstrapping works
Let S = {s1, s2, ..., sN } be a sample that is assumed to be I.I.D.
{ }
Let Sˆ = Sˆ1 , Sˆ2 ,..., SˆK , ,
where each Sˆk is a bootstrap sample that contains N elements

drawn from S with replacement.
The elements of each Sˆk are denoted by {sˆk1, sˆk 2, ..., sˆkN }

Why bootstrapping works
Let F (si ) be a function defined on S .

{ }
Let Fˆ (Sˆk ) = F ( sˆk 1, ) , F ( sˆk 2, ) ,..., F ( sˆkN , ) .
1 K
Let µˆ F =
K
∑ F(Sˆ ).
i
i
1
∑Var ( F(Sˆ ) ).
K
Let σˆ =2
F k
K k
 1 K K 
( 1
) ∑∑ Cov ( F(Sˆ ), F(Sˆ ) )
K K
ˆ ˆ
 2 ∑∑ Cov F( S j ), F( S k ) 
 K j k  K2
j k
Let ρˆ F = j k
1 K  1 K σˆ F 2
K ∑
 k
Var F( ˆ
S k (
) K ∑

)
k
Var F( ˆ ) 
S k 

( )

Variance of F( Sˆ )
1  1 K K
( )
K
ˆ Cov F( Sî ), F( Sˆ j )
Var 
K
∑i F(Si )  = K 2=∑∑
i 1 =j 1
1 K K 
( ) ( )
K
ˆ ˆ ˆ ˆ
2 ∑ ∑
= Cov F( Si ), F( S j ) + ∑ Cov F( Si ), F( S j ) 
K=i 1 =
j i j ≠i 
1 K  
( ) ( )
K
ˆ ˆ ˆ
2 ∑
= Var F( Si ) + ∑ Cov F( Si ), F( S j ) 
= K i1 j ≠i 
1 K  2 K
2 
2 ∑ ∑
= σ ˆ + ( K − 1) σˆ ρˆ 
= K i1 j ≠i 
1
= 
2 
K σˆ 2
+ ( K − 1)σˆ 2
ρˆ 
K
σˆ 2 ( K − 1) 2
= + σˆ ρˆ
K K
Variance of F( Sˆ )
1 K
ˆ=  σˆ 2
( K − 1) 2
Var 
K
∑i i  K
F( S ) +
K
σˆ ρˆ
σˆ 2 1
= + σˆ 2 ρˆ − σˆ 2 ρˆ
K K
1
=σˆ 2 ρˆ + σˆ 2 (1 − ρˆ ) 
K
decreases if ρˆ decreases decreases as K increases, regardless of ρˆ

Tree Bagging algorithm
1) Randomly choose n samples from the training set with

replacement to create a bootstrap sample.
2) Grow a decision tree from the bootstrap sample. At
each node:
1) Split the node using the feature for the best split
according to the objective function (e.g.,
maximizing information gain).
3) Repeat steps 1) and 2) k times.
4) Aggregate the prediction by each tree to assign a
classification by majority vote.

Random Forest algorithm
1) Randomly choose n samples from the training set with

replacement to create a bootstrap sample.
2) Grow a decision tree from the bootstrap sample. At
each node:
1) Randomly select d features without replacement
2) Split the node using the feature for the best split
according to the objective function (e.g.,
maximizing information gain).
3) Repeat steps 1) and 2) k times.
4) Aggregate the prediction by each tree to assign a
classification by majority vote.
What distinguishes the Random Forest
algorithm from Bagging
1) Randomly select d features without replacement
This added step is sometimes call feature bagging.

Out-of-bag samples
When choosing a bootstrap sample S of size N, we select sample
points without replacement; some points can (and likely will) be
repeated in the bootstrap sample. Other points can (and likely
will) be omitted.
The probability that a particular sample point is omitted from S is

N
 1
Pr( N =
) 1 −  .
 N
N
 1
lim Pr( N ) =lim 1 −  = e −1 ≈ 0.3678. N does not have to
N →∞ N →∞
 N
be very large for Pr(N) to be close to it asymptotic value.
For example, Pr(5) ≈ 0.328 and Pr(50) ≈ 0.364.
Out-of-bag samples
N
 1
lim Pr( N ) =lim 1 −  =e −1 ≈ 0.3678.
N →∞ N →∞
 N
0.38
0.37
0.36
0.35 1/e
0.34 Pr(N)
0.33
0.32
0.31
Pr(N)
0.30
0.29
0.28
0.27
0.26
0.25
0.24
2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98
N
Out-of-bag sample advantages
 The fact any given data point, (xi ,yi ) is missing from a significant
fraction (~ 1/3) of bootstrap samples, Sˆ , drawn from even a small
k
sample set, S , creates an interesting opportunity.

 We can train our model on all of the bootstrap samples
{Sˆ , Sˆ ,...Sˆ } and then test how well the model does in predicting
1 2 K
yi given xi using only those bootstrap samples that do not contain xi .

 A data point is said to be out of bag for the bootstrap samples that
do not contain it.
 Using out of bag samples for testing allows us to use all of our data
for training and still have data for testing that has not been used in
training.
Training a majority voting classifier
using out-of-bag samples
Let F ( xi ) = MajorityVoting ( F1 ( xi ) , F2 ( xi ) ,..., FK ( xi ) ) ,
The training error is given by is
1 N
E  F ( x )  =
N
∑1
i =1
F ( xi ) ≠ yi ,

Testing a majority voting classifier
using out-of-bag samples
( )
Let Fôob ( xi ) = MajorityVoting Fôob ,1 ( xi ) , Fôob ,2 ( xi ) ,..., Fôob , K ( xi ) ,
Fk ( xi ) if xi ∉ Sˆk
where Fôob ,k ( xi ) = 
 { } if xi ∈ Sˆk
The out-of-bag error is given by
ˆ 1 N
E  Foob ( x )  =
 
N
∑1
i =1
Fôob ( xi ) ≠ yi
,

original training data
Sample ID 1 2 3 4 … N
Feature 1 F1(1) F1(2) F1(3) F1(4) … F1(N)
Feature 2 F2(1) F2(2) F2(3) F2(4) … F2(N)
⁞ ⁞ ⁞ ⁞ ⁞ ⁞
Feature M FM(1) FM(2) FM(3) FM(4) … FM(N)
Class ⓪ ❶ ❶ ⓪ … ❶
1 2 3 4 … N
⓪❶❶⓪ … ❶

1 2 3 4 … N
original training data
⓪❶❶⓪ … ❶
omitted data
1 2 2 4 … 3 …
bootstrap samples S1
⓪❶❶⓪ … ❶ ❶ …
1 1 3 4 … 2 …
S2 ⓪⓪❶⓪ … ⓪ ❶ …
1 1 2 4 … 3 …
S3 ⓪⓪❶⓪ … ❶ ❶ …
2 3 3 4 … 1 …
S4 ❶ ❶❶ ⓪ … ❶ ⓪ …
⁞ ⁞
1 1 3 3 … 2 4 …
SK
⓪ ⓪❶ ❶ … ⓪ ❶ ⓪ …

training resulting
data tree
1 2 2 4 …
S1 T1
⓪❶❶⓪ … ❶ ⓪ ❶❶ ❶
1 1 3 4 …
S2 T2
⓪⓪❶⓪ … ⓪ ⓪ ⓪⓪ ❶
1 1 2 4 …
S3 T3
⓪⓪❶⓪ … ❶ ⓪ ⓪❶ ❶
2 3 3 4 …
S4 ❶ ❶❶ ⓪ … ❶ ❶ ⓪❶ ❶
T4
⁞ ⁞
1 1 3 3 …
SK TK
⓪ ⓪❶ ❶ … ⓪ ❶ ⓪⓪ ❶

new tree majority vote
tree
instance classification classification
T1 C1=❶
⓪ ❶❶ ❶
T2 C2=⓪
⓪ ⓪⓪ ❶
Sample ID j
Feature 1 F1(j)
Feature 2 F2(j) ⓪ ⓪❶ ❶
T3 C3=❶
⁞ ⁞ C =❶
Feature M FM(j)
❶ ⓪❶ ❶
T4 C4=❶
Class ?
⁞ ⁞
❶ ⓪⓪ ❶
TK CK=⓪

random forest
majority –vote
training data
T1 predictions
⓪ ❶❶ ❶
Sample ID 1 2 … N Sample ID 1 2 … N
Feature 1 F1(1) F1(2) … F1(N) T2 Feature 1 F1(1) F1(2) … F1(N)

⓪ ⓪⓪ ❶
Feature 2 F2(1) F2(2) … F2(N) Feature 2 F2(1) F2(2) … F2(N)
⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞
⓪ ⓪❶ ❶
T3 Feature M FM(1) FM(2) … FM(N)
Feature M FM(1) FM(2) … FM(N)
Class ⓪ ❶ … ❶ Class ⓪ ❶ … ❶
❶ ⓪❶ ❶
T4 Predicted
⓪ ❶ … ⓪
Class
⁞ Error 0 0 … 1
TK 1 N
❶ ⓪⓪ ❶ Errortraining =
N
∑ Error ( )
i =1
i

training out-of-bag
data trees
1 2 2 4 …
S1 T3,… Classify each xi
⓪❶❶⓪ … ❶ ⓪ ⓪❶ ❶
using only the S k

1 1 3 4 …
S2 T2,… that do not
⓪⓪❶⓪ … ⓪ ⓪ ⓪⓪ ❶
1 1 2 4 … contain xi to
S3 T3….
⓪⓪❶⓪ … ❶
obtain Fôob ,k ( xi ) .
⓪ ⓪❶ ❶
…
2 3 3 4 …
S4 T1,…, 1 N
∑1
⓪ ❶❶ ❶
❶ ❶❶ ⓪ … ❶
… Erroroob = Fôob ( xi ) ≠ yi
⁞ ⁞ N i =1
1 1 3 3 …
SK T2,T4,…
⓪ ⓪❶ ❶ … ⓪ ⓪ ⓪⓪ ❶ ❶ ⓪❶ ❶

Scikit-learn Random Forest classifier class
class sklearn.ensemble.RandomForestClassifier (…)
Parameter Default Parameter Defaults

n_estimators 10 min_impurity_decrease 0.
criterion ‘gini’ bootstrap True
max_features ‘auto’ oob_score False
max_depth None n_jobs 1
min_samples_split 2 random_state None
min_samples_leaf 1 verbose 0
min_weight_fraction_leaf 0. warm_start False
max_leaf_nodes None class_weight None

Scikit-learn Random Forest classifier
attributes
Attribute Description
estimators_ The collection of fitted sub-estimators
classes_ The class labels
n_classes The number of classes
n_features_ The number of features when fit is performed
n_outputs The number of outputs when fit is performed
feature_importances_ The feature importances
Score of the training dataset obtained using an out-of-bag
oob_score_
estimate
Decision function computed with out-of-bag estimate on
oob_decision_function
the training set.

Scikit-learn Random Forest classifier
methods
Method Description
apply(X) Apply trees in the forest to X, return leaf indicies.
decision_path(X) Return the decision path in the forest.
fit(X, y[, sample_weight]) Build a forest of trees from the training set (X, y).
get_params([deep]) Get parameters for this estimator.
predict(X) Predict class for X.
predict_log_proba(X) Predict class log-probabilities of for X.
predict_proba(X[, check_input]) Predict class probabilities for X.
Returns the mean accuracy on the given test
score(X, y[, sample_weight])
data and labels.
set_params(**params) Set the parameters of this estimator.

Scikit-learn Bagging Classifier class
class sklearn.ensemble.BaggingClassifier (…)
Parameter Default Parameter Defaults

base_estimator None oob_score False
n_estimators 10 warm_start False
max_samples 1.0 n_jobs 1
max_features 1.0 random_state None
bootstrap True verbose 0
bootstrap_features False

Scikit-learn Bagging Classifier attributes
Attribute Description
base_estimator_ The base estimator from which the ensemble is grown.
estimators_ List of estimators
The subset of drawn samples (i.e., the in-bag samples) for
estimators_samples_
each base estimator.
estimators_features_ The subset of drawn features for each base estimator.
classes_ The class labels
n_classes The number of classes
Score of the training dataset obtained using an out-of-bag
oob_score_
estimate
Decision function computed with out-of-bag estimate on
oob_decision_function
the training set.

Scikit-learn Bagging Classifier methods
Method Description
Average of the decision functions of the base
decision_function(X)
classifiers.
Build a Bagging ensemble of estimators from the
fit(X, y[, sample_weight])
training set (X, y).
get_params([deep]) Get parameters for this estimator.
predict(X) Predict class for X.
predict_log_proba(X) Predict class log-probabilities of for X.
predict_proba(X[, check_input]) Predict class probabilities for X.
Returns the mean accuracy on the given test
score(X, y[, sample_weight])
data and labels.
set_params(**params) Set the parameters of this estimator.


ML 15 Random Forests

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML 15 Random Forests

Uploaded by

Copyright:

Available Formats

Bootstrapping and Random Forests

The City College of New York

• Bootstrapping is any test or metric that relies on

The City College of New York

• Bootstrapping can be used to generate multiple

The City College of New York

The expected value of a function F of a random

The City College of New York

The variance of a random variable Y is denoted by

The variance is a measure of the width or dispersion

The City College of New York

The City College of New York

The standard deviation of a random variable Y is

The standard deviation is a measure of the width or

The City College of New York

The covariance of two random variables, X and Y, is

Cov( X , Y ) = E ( X − µ X )(Y − µY ) 

Covariance is a measure of the joint variability of

The City College of New York

Cov( X , Y ) = E ( X − µ X )(Y − µY ) 

The City College of New York

The City College of New York

where each Sˆk is a bootstrap sample that contains N elements

The City College of New York

Let F (si ) be a function defined on S .

The City College of New York

The City College of New York

1) Randomly choose n samples from the training set with

The City College of New York

1) Randomly choose n samples from the training set with

The City College of New York

The probability that a particular sample point is omitted from S is

sample set, S , creates an interesting opportunity.

yi given xi using only those bootstrap samples that do not contain xi .

The City College of New York

The City College of New York

The City College of New York

The City College of New York

The City College of New York

The City College of New York

Feature 1 F1(1) F1(2) … F1(N) T2 Feature 1 F1(1) F1(2) … F1(N)

The City College of New York

using only the S k

The City College of New York

Parameter Default Parameter Defaults

The City College of New York

The City College of New York

The City College of New York

Parameter Default Parameter Defaults

The City College of New York

The City College of New York

The City College of New York

You might also like