Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Bootstrapping and Random Forests

The City College of New York


CSc 59929 – Introduction to Machine Learning 1
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Bootstrapping

• Bootstrapping is any test or metric that relies on


random sampling with replacement.
• Bootstrapping allows assigning measures of
accuracy (defined in terms of bias, variance,
confidence intervals, prediction error or some other
such measure) to sample estimates.

The City College of New York


CSc 59929 – Introduction to Machine Learning 2
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Bootstrapping

• Bootstrapping can be used to generate multiple


training dataset from a sample dataset.
• If a set of observations can be assumed to be from
an I.I.D. population, an approximating distribution
of the population can implemented by constructing
a number of resamples (with replacement) of the
observed dataset and of equal size to the observed
dataset.

The City College of New York


CSc 59929 – Introduction to Machine Learning 3
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Expected value of a
function of a random Variable

The expected value of a function F of a random


variable Y is denoted by E(F(Y)) where

E ( F (Y ) )
= ∑=
F ( y ) Pr (Y i yi )
all i

or
E ( F (Y ) ) = ∫ F ( y ) p ( y ) dy
+∞

−∞

The City College of New York


CSc 59929 – Introduction to Machine Learning 4
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Variance of a random variable

The variance of a random variable Y is denoted by


Var(Y) where

(Y ) E (Y − µY ) 
 2
Var
=
 

The variance is a measure of the width or dispersion


of the distribution about it mean.

The City College of New York


CSc 59929 – Introduction to Machine Learning 5
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Variance of a random variable

(Y ) E (Y − µY ) 
 2
Note that Var
=
 
= E Y 2 − 2 µY Y + µY 2 
=E Y 2  − 2 E [ µY Y ] + E  µY 2 
=E Y 2  − 2 µY E [Y ] + µY 2 E [1]
= E Y 2  − 2 µY 2 + µY 2
= E Y 2  − µY 2

= E Y  − ( E [Y ]) 2 2

The City College of New York


CSc 59929 – Introduction to Machine Learning 6
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Standard Deviation of a random variable

The standard deviation of a random variable Y is


denoted by σ Y where

σ Y = Var (Y )

The standard deviation is a measure of the width or


dispersion of the distribution about it mean.

The City College of New York


CSc 59929 – Introduction to Machine Learning 7
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Covariance of two random variables

The covariance of two random variables, X and Y, is


denoted by Cov(X,Y) where

Cov( X , Y ) = E ( X − µ X )(Y − µY ) 

Covariance is a measure of the joint variability of


two random variables.

The City College of New York


CSc 59929 – Introduction to Machine Learning 8
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Covariance of two random variables

Remembering that µ X = E [ X ]

Cov( X , Y ) = E ( X − µ X )(Y − µY ) 


= E [ XY − µ X Y − µY X + µ X µY ]
=E [ XY ] − E [ µ X Y ] − E [ µY X ] + E [ µ X µY ]
=E [ XY ] − µ X E [Y ] − µY E [ X ] + µ X µY
= E [ XY ] − µ X µY − µY µ X + µ X µY
= E [ XY ] − µ X µY

The City College of New York


CSc 59929 – Introduction to Machine Learning 9
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Correlation coefficient between two random
variables
The correlation coefficient between two random
variables, X and Y is denoted by ρ X ,Y where

ρ X ,Y = Corr ( X , Y )
Cov( X , Y )
=
σ XσY
E [ XY ] − µ X µY
=
σ XσY

The City College of New York


CSc 59929 – Introduction to Machine Learning 10
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Why bootstrapping works
Let S = {s1, s2, ..., sN } be a sample that is assumed to be I.I.D.

{ }
Let Sˆ = Sˆ1 , Sˆ2 ,..., SˆK , ,

where each Sˆk is a bootstrap sample that contains N elements


drawn from S with replacement.
The elements of each Sˆk are denoted by {sˆk1, sˆk 2, ..., sˆkN }

The City College of New York


CSc 59929 – Introduction to Machine Learning 11
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Why bootstrapping works

Let F (si ) be a function defined on S .


{ }
Let Fˆ (Sˆk ) = F ( sˆk 1, ) , F ( sˆk 2, ) ,..., F ( sˆkN , ) .
1 K
Let µˆ F =
K
∑ F(Sˆ ).
i
i

1
∑Var ( F(Sˆ ) ).
K
Let σˆ =2
F k
K k

 1 K K 
( 1
) ∑∑ Cov ( F(Sˆ ), F(Sˆ ) )
K K
ˆ ˆ
 2 ∑∑ Cov F( S j ), F( S k ) 
 K j k  K2
j k

Let ρˆ F = j k

1 K  1 K σˆ F 2
K ∑
 k
Var F( ˆ
S k (
) K ∑

)
k
Var F( ˆ ) 
S k 

( )

The City College of New York


CSc 59929 – Introduction to Machine Learning 12
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Variance of F( Sˆ )
1  1 K K
( )
K
ˆ Cov F( Sˆi ), F( Sˆ j )
Var 
K
∑i F(Si )  = K 2=∑∑
i 1 =j 1

1 K K 
( ) ( )
K
ˆ ˆ ˆ ˆ
2 ∑ ∑
= Cov F( Si ), F( S j ) + ∑ Cov F( Si ), F( S j ) 
K=i 1 =
j i j ≠i 
1 K  
( ) ( )
K
ˆ ˆ ˆ
2 ∑
= Var F( Si ) + ∑ Cov F( Si ), F( S j ) 
= K i1 j ≠i 
1 K  2 K
2 
2 ∑ ∑
= σ ˆ + ( K − 1) σˆ ρˆ 
= K i1 j ≠i 
1
= 
2 
K σˆ 2
+ ( K − 1)σˆ 2
ρˆ 
K
σˆ 2 ( K − 1) 2
= + σˆ ρˆ
K K
The City College of New York
CSc 59929 – Introduction to Machine Learning 13
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Variance of F( Sˆ )

1 K
ˆ=  σˆ 2
( K − 1) 2
Var 
K
∑i i  K
F( S ) +
K
σˆ ρˆ

σˆ 2 1
= + σˆ 2 ρˆ − σˆ 2 ρˆ
K K
1
=σˆ 2 ρˆ + σˆ 2 (1 − ρˆ ) 
K
decreases if ρˆ decreases decreases as K increases, regardless of ρˆ

The City College of New York


CSc 59929 – Introduction to Machine Learning 14
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Tree Bagging algorithm

1) Randomly choose n samples from the training set with


replacement to create a bootstrap sample.
2) Grow a decision tree from the bootstrap sample. At
each node:
1) Split the node using the feature for the best split
according to the objective function (e.g.,
maximizing information gain).
3) Repeat steps 1) and 2) k times.
4) Aggregate the prediction by each tree to assign a
classification by majority vote.

The City College of New York


CSc 59929 – Introduction to Machine Learning 15
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Random Forest algorithm

1) Randomly choose n samples from the training set with


replacement to create a bootstrap sample.
2) Grow a decision tree from the bootstrap sample. At
each node:
1) Randomly select d features without replacement
2) Split the node using the feature for the best split
according to the objective function (e.g.,
maximizing information gain).
3) Repeat steps 1) and 2) k times.
4) Aggregate the prediction by each tree to assign a
classification by majority vote.
The City College of New York
CSc 59929 – Introduction to Machine Learning 16
Spring 2020 – Erik K. Grimmelmann, Ph.D.
What distinguishes the Random Forest
algorithm from Bagging
1) Randomly select d features without replacement
This added step is sometimes call feature bagging.

The City College of New York


CSc 59929 – Introduction to Machine Learning 17
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Out-of-bag samples
When choosing a bootstrap sample S of size N, we select sample
points without replacement; some points can (and likely will) be
repeated in the bootstrap sample. Other points can (and likely
will) be omitted.

The probability that a particular sample point is omitted from S is


N
 1
Pr( N =
) 1 −  .
 N
N
 1
lim Pr( N ) =lim 1 −  = e −1 ≈ 0.3678. N does not have to
N →∞ N →∞
 N
be very large for Pr(N) to be close to it asymptotic value.
For example, Pr(5) ≈ 0.328 and Pr(50) ≈ 0.364.
The City College of New York
CSc 59929 – Introduction to Machine Learning 18
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Out-of-bag samples
N
 1
lim Pr( N ) =lim 1 −  =e −1 ≈ 0.3678.
N →∞ N →∞
 N
0.38
0.37
0.36
0.35 1/e
0.34 Pr(N)
0.33
0.32
0.31
Pr(N)

0.30
0.29
0.28
0.27
0.26
0.25
0.24
2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98
N
The City College of New York
CSc 59929 – Introduction to Machine Learning 19
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Out-of-bag sample advantages
 The fact any given data point, (xi ,yi ) is missing from a significant
fraction (~ 1/3) of bootstrap samples, Sˆ , drawn from even a small
k

sample set, S , creates an interesting opportunity.


 We can train our model on all of the bootstrap samples
{Sˆ , Sˆ ,...Sˆ } and then test how well the model does in predicting
1 2 K

yi given xi using only those bootstrap samples that do not contain xi .


 A data point is said to be out of bag for the bootstrap samples that
do not contain it.
 Using out of bag samples for testing allows us to use all of our data
for training and still have data for testing that has not been used in
training.
The City College of New York
CSc 59929 – Introduction to Machine Learning 20
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Training a majority voting classifier
using out-of-bag samples
Let F ( xi ) = MajorityVoting ( F1 ( xi ) , F2 ( xi ) ,..., FK ( xi ) ) ,
The training error is given by is
1 N
E  F ( x )  =
N
∑1
i =1
F ( xi ) ≠ yi ,

The City College of New York


CSc 59929 – Introduction to Machine Learning 21
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Testing a majority voting classifier
using out-of-bag samples

( )
Let Fˆoob ( xi ) = MajorityVoting Fˆoob ,1 ( xi ) , Fˆoob ,2 ( xi ) ,..., Fˆoob , K ( xi ) ,
Fk ( xi ) if xi ∉ Sˆk
where Fˆoob ,k ( xi ) = 
 { } if xi ∈ Sˆk
The out-of-bag error is given by
ˆ 1 N
E  Foob ( x )  =
 
N
∑1
i =1
Fˆoob ( xi ) ≠ yi
,

The City College of New York


CSc 59929 – Introduction to Machine Learning 22
Spring 2020 – Erik K. Grimmelmann, Ph.D.
original training data

Sample ID 1 2 3 4 … N
Feature 1 F1(1) F1(2) F1(3) F1(4) … F1(N)
Feature 2 F2(1) F2(2) F2(3) F2(4) … F2(N)
⁞ ⁞ ⁞ ⁞ ⁞ ⁞
Feature M FM(1) FM(2) FM(3) FM(4) … FM(N)
Class ⓪ ❶ ❶ ⓪ … ❶

1 2 3 4 … N
⓪❶❶⓪ … ❶

The City College of New York


CSc 59929 – Introduction to Machine Learning 23
Spring 2020 – Erik K. Grimmelmann, Ph.D.
1 2 3 4 … N
original training data
⓪❶❶⓪ … ❶

omitted data
1 2 2 4 … 3 …
bootstrap samples S1
⓪❶❶⓪ … ❶ ❶ …

1 1 3 4 … 2 …
S2 ⓪⓪❶⓪ … ⓪ ❶ …

1 1 2 4 … 3 …
S3 ⓪⓪❶⓪ … ❶ ❶ …

2 3 3 4 … 1 …
S4 ❶ ❶❶ ⓪ … ❶ ⓪ …

⁞ ⁞
1 1 3 3 … 2 4 …
SK
⓪ ⓪❶ ❶ … ⓪ ❶ ⓪ …

The City College of New York


CSc 59929 – Introduction to Machine Learning 24
Spring 2020 – Erik K. Grimmelmann, Ph.D.
training resulting
data tree
1 2 2 4 …
S1 T1
⓪❶❶⓪ … ❶ ⓪ ❶❶ ❶

1 1 3 4 …
S2 T2
⓪⓪❶⓪ … ⓪ ⓪ ⓪⓪ ❶

1 1 2 4 …
S3 T3
⓪⓪❶⓪ … ❶ ⓪ ⓪❶ ❶

2 3 3 4 …
S4 ❶ ❶❶ ⓪ … ❶ ❶ ⓪❶ ❶
T4

⁞ ⁞
1 1 3 3 …
SK TK
⓪ ⓪❶ ❶ … ⓪ ❶ ⓪⓪ ❶

The City College of New York


CSc 59929 – Introduction to Machine Learning 25
Spring 2020 – Erik K. Grimmelmann, Ph.D.
new tree majority vote
tree
instance classification classification

T1 C1=❶
⓪ ❶❶ ❶

T2 C2=⓪
⓪ ⓪⓪ ❶
Sample ID j
Feature 1 F1(j)
Feature 2 F2(j) ⓪ ⓪❶ ❶
T3 C3=❶
⁞ ⁞ C =❶
Feature M FM(j)
❶ ⓪❶ ❶
T4 C4=❶
Class ?
⁞ ⁞

❶ ⓪⓪ ❶
TK CK=⓪

The City College of New York


CSc 59929 – Introduction to Machine Learning 26
Spring 2020 – Erik K. Grimmelmann, Ph.D.
random forest
majority –vote
training data
T1 predictions
⓪ ❶❶ ❶

Sample ID 1 2 … N Sample ID 1 2 … N

Feature 1 F1(1) F1(2) … F1(N) T2 Feature 1 F1(1) F1(2) … F1(N)


⓪ ⓪⓪ ❶
Feature 2 F2(1) F2(2) … F2(N) Feature 2 F2(1) F2(2) … F2(N)

⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞ ⁞
⓪ ⓪❶ ❶
T3 Feature M FM(1) FM(2) … FM(N)
Feature M FM(1) FM(2) … FM(N)
Class ⓪ ❶ … ❶ Class ⓪ ❶ … ❶

❶ ⓪❶ ❶
T4 Predicted
⓪ ❶ … ⓪
Class
⁞ Error 0 0 … 1

TK 1 N
❶ ⓪⓪ ❶ Errortraining =
N
∑ Error ( )
i =1
i

The City College of New York


CSc 59929 – Introduction to Machine Learning 27
Spring 2020 – Erik K. Grimmelmann, Ph.D.
training out-of-bag
data trees
1 2 2 4 …
S1 T3,… Classify each xi
⓪❶❶⓪ … ❶ ⓪ ⓪❶ ❶

using only the S k


1 1 3 4 …
S2 T2,… that do not
⓪⓪❶⓪ … ⓪ ⓪ ⓪⓪ ❶

1 1 2 4 … contain xi to
S3 T3….
⓪⓪❶⓪ … ❶
obtain Fˆoob ,k ( xi ) .
⓪ ⓪❶ ❶

2 3 3 4 …
S4 T1,…, 1 N

∑1
⓪ ❶❶ ❶
❶ ❶❶ ⓪ … ❶
… Erroroob = Fˆoob ( xi ) ≠ yi
⁞ ⁞ N i =1

1 1 3 3 …
SK T2,T4,…
⓪ ⓪❶ ❶ … ⓪ ⓪ ⓪⓪ ❶ ❶ ⓪❶ ❶

The City College of New York


CSc 59929 – Introduction to Machine Learning 28
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Random Forest classifier class
class sklearn.ensemble.RandomForestClassifier (…)

Parameter Default Parameter Defaults


n_estimators 10 min_impurity_decrease 0.
criterion ‘gini’ bootstrap True
max_features ‘auto’ oob_score False
max_depth None n_jobs 1
min_samples_split 2 random_state None
min_samples_leaf 1 verbose 0
min_weight_fraction_leaf 0. warm_start False
max_leaf_nodes None class_weight None

The City College of New York


CSc 59929 – Introduction to Machine Learning 29
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Random Forest classifier
attributes
Attribute Description
estimators_ The collection of fitted sub-estimators
classes_ The class labels
n_classes The number of classes
n_features_ The number of features when fit is performed
n_outputs The number of outputs when fit is performed
feature_importances_ The feature importances
Score of the training dataset obtained using an out-of-bag
oob_score_
estimate
Decision function computed with out-of-bag estimate on
oob_decision_function
the training set.

The City College of New York


CSc 59929 – Introduction to Machine Learning 30
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Random Forest classifier
methods
Method Description
apply(X) Apply trees in the forest to X, return leaf indicies.
decision_path(X) Return the decision path in the forest.
fit(X, y[, sample_weight]) Build a forest of trees from the training set (X, y).
get_params([deep]) Get parameters for this estimator.
predict(X) Predict class for X.
predict_log_proba(X) Predict class log-probabilities of for X.
predict_proba(X[, check_input]) Predict class probabilities for X.
Returns the mean accuracy on the given test
score(X, y[, sample_weight])
data and labels.
set_params(**params) Set the parameters of this estimator.

The City College of New York


CSc 59929 – Introduction to Machine Learning 31
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Bagging Classifier class
class sklearn.ensemble.BaggingClassifier (…)

Parameter Default Parameter Defaults


base_estimator None oob_score False
n_estimators 10 warm_start False
max_samples 1.0 n_jobs 1
max_features 1.0 random_state None
bootstrap True verbose 0
bootstrap_features False

The City College of New York


CSc 59929 – Introduction to Machine Learning 32
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Bagging Classifier attributes
Attribute Description
base_estimator_ The base estimator from which the ensemble is grown.
estimators_ List of estimators
The subset of drawn samples (i.e., the in-bag samples) for
estimators_samples_
each base estimator.
estimators_features_ The subset of drawn features for each base estimator.
classes_ The class labels
n_classes The number of classes
Score of the training dataset obtained using an out-of-bag
oob_score_
estimate
Decision function computed with out-of-bag estimate on
oob_decision_function
the training set.

The City College of New York


CSc 59929 – Introduction to Machine Learning 33
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Scikit-learn Bagging Classifier methods
Method Description
Average of the decision functions of the base
decision_function(X)
classifiers.
Build a Bagging ensemble of estimators from the
fit(X, y[, sample_weight])
training set (X, y).
get_params([deep]) Get parameters for this estimator.
predict(X) Predict class for X.
predict_log_proba(X) Predict class log-probabilities of for X.
predict_proba(X[, check_input]) Predict class probabilities for X.
Returns the mean accuracy on the given test
score(X, y[, sample_weight])
data and labels.
set_params(**params) Set the parameters of this estimator.

The City College of New York


CSc 59929 – Introduction to Machine Learning 34
Spring 2020 – Erik K. Grimmelmann, Ph.D.

You might also like