Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Random Forest

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Basic steps -Classification algorithms

Profiling Differentiation Classification

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2
Should I invest in a company – ask the experts

Employee of XYZ Financial Advisor of XYZ Stock Market Trader Employee of acompetitor Market Researchteam Social Media Expert

Knows internal perspective on companies observed company’sstock internal functionality of the analyzes the customer understand product
functionality vs competition price over past 3years competitor firms preference of XYZ’sproduct positioning

lacks a sight of companyin


lacks a view on internal knows seasonality trends unaware of thechanges Changes in customer
insider information focus and the external
policies and market performance XYZwill bring sentiment overtime
factors

lacks a broader perspective has been right 60% of have been right 75%of unaware of detailsbeyond
has been right 75%times. has been right 70%times.
on competitors times. times. digital marketing

has been right 65%of


has been right 70%times.
times.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
3
Scenario1 - Combine all the info – informed decision

assuming all the


all the 6experts/teams
predictions are Assumption – decisions
verify that it’s a good
independent of each are not correlated
decision
other

Each person thinks from combined accuracy rate


a different perspective improves when we take
to take the decisionand voting principle
not influenced by the (complementing
other principle)

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
4
Scenario 2 – info from similar sources
6 experts, all of If we combinetheir
Everyone has a
them are employees advice into single
propensity of 70%to
of XYZworking in prediction based on
advocate correctly.
the samedivision voting?

All the predictions


Similar predictions –
are based on very
accuracy will go
similar set of
down while voting
information

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
5
Ensemble learning
• Machine learning technique that combines several base
models in order to produce one optimal predictive model.
• Weak classifiers
• Different set of variables for each classifier
• Combine into singleprediction

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
6
What is a boot strapped dataset
Sno X1 X2 Y
4 144 29 No
2 529 34 Yes
3 125 67 No

Sno X1 X2 Y Sno X1 X2 Y
1 432 29 Yes Random sample rows 3 125 67 No
2 529 34 Yes with replacement 4 144 29 No
4 144 29 No
3 125 67 No
4 144 29 No

Sno X1 X2 Y
3 125 67 No
2 529 34 Yes
3 125 67 No

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
7
Using a random set of variables every time
Sno X1 X2 Y
4 144 29 No
2 529 34 Yes
Random 3 125 67 No
sample rows
with
Sno X1 X2 X3 X4 Y replacement Sno X3 X4 Y
1 432 29 313 6 Yes
2 529 34 379 2 Yes 3 317 4 No
3 125 67 317 4 No Random 4 103 8 No
4 144 29 103 8 No 4 103 8 No
subset of X
variables

Sno X1 X3 Y
3 125 317 No
2 529 379 Yes
3 125 317 No

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
8
Basic idea ofrandom forest

Draw multiple random Using a random subset of


samples, with Combine the
predictors at each stage,
replacement, from the predictions/classifications Use voting for
fit a classification (or
data from the individual trees classification and
regression) tree to each
to obtain improved averaging for prediction.
• (this sampling approach iscalled sample (and thus obtain a
the bootstrap). predictions.
“forest”).

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
9
Steps in random forest algorithm

Step2 – create a decision


tree using boot strapped
Step1 – create a Step3 – repeat the same
dataset. But only use a
bootstrapped dataset and create multiple trees
random subset of
variables at each step

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
10
Out of bag data points
Sno
4
X1
144
X2
29 No
Y
• When we create a
bootstrapped dataset, ~1/3
2 529 34 Yes
3 125 67 No

of the original data does not


Sno
1
2
X1
432
529
X2
29
34
X3
313
379
X4
6 Yes
2 Yes
Y
Sno
3
X3
317
X4
4 No
Y end up in the boot strapped
3
4
125
144
67
29
317
103
4 No
8 No
4
4
103
103
8 No
8 No dataset
• This is called out-of-bag
Sno
3
X1
125
X3
317 No
Y dataset
2 529 379 Yes
3 125 317 No

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
11
How to calculate accuracy
• OOB samples used to measure how accurate our random
forest is
• by the ratio of out of bag samples correctly classified by the
random forest model
• Proportion of OOB samples incorrectly classified – out of
bag error

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
12
How to decide on how many variables to use per step?

• Compare OOB error for using 2 variables per step, 3 variables


and so on
• Choose the most accurate set of variables
• Typically we start by using square root of number of
variables
• Then try a few settings above and below the value

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
13
Summary of Random forest

Consists of a large number Each tree in the random class with most votes
of individual decision trees forest spits out a class
becomes model’s prediction
that operate asan ensemble prediction

A large number of relatively


fundamental concept - uncorrelated models (trees)
operating as a committee
wisdom of crowds
will outperform any of the
individual models.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
14
Overall flow of the RFclassification process
Feature engineering EDA–Univariate
Find Baseline Yclass
– convert relevant • Boxplot for numvar
Read csv file %to checkclass
variables to • Barplot for catvar
imbalance
categorical

EDA– bivariate
• Boxplot – num X vs catY Split into training Build a random
Tune ntree & mtry
• Stacked bar – cat X vsY and test sets forest model

Model performance
Predict for train & • Acc, sens, spec
Variable importance
test • AUC

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
15
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
40

You might also like