Lecture-4-Decision Tree and Random Forest

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

4/27/2021

Machine Learning
Lecture – 4
Decision Tree

Machine Learning
by Tom M. Mitchell

Muhammad Affan Alim

Outline
• Introduction
• Example
• ID3 Heuristic algorithm
• Principles
• Entropy
• Information gain
• Evaluations

1
4/27/2021

Decision Tree- Introduction


• A decision tree is a tree structured classifier/and regressor

• It has two types of nodes


• decision nodes and
• leaf nodes/prediction node

Decision Tree- Introduction


• The decision tree is a non-linear function or model
• In a tree which has nodes and edges and some nodes are
leafs
• A decision tree is a classifier in the form of a tree which has
two types of nodes
- Decision nodes
- Leaf nodes
• It can be used for both classification and regression but
classification is more popular

2
4/27/2021

Decision Tree- Introduction diagram

Decision Tree- Introduction


• Prefer the smaller tree
• Low depth
• Small number of depth
examples

3
4/27/2021

Decision Tree- Introduction


• The popular attribute selection measure
• Information gain
• Gini index
• By using information gain as a criterions, the attributes to be
categorical and for Gini index attributes are assumed to be
continuous
• By information gain as a criterion, we try to estimate the information
contained by each attribute
• Entropy, for binary classification problem with only two classes,
positive and negative
7

Decision Tree - Model


• Given a set of training cases/objects and their attribute
values, try to determine the target attribute value of new
examples.

- Classification
- regression

4
4/27/2021

Decision Tree- Example


Decision Tree
• This is usually done on the value of a feature or attribute of
the instance
• test is on same attribute and there is a branch for each
outcome

• There may be two outcomes or in some cases, more than


two outcomes

• Nodes will be created until reach at leaf node

10

5
4/27/2021

Decision Tree - construction


• One way to construct
the tree until not reach
at leaf
• It is recursive process

11

Decision Tree- construction


• Tree will be grow until
unless only yes or
no not will be
generated

12

6
4/27/2021

Decision Tree - construction


• Given example, you start with the root of the tree and
based on the value of the test move to corresponding
branch

• this iterative process doing continue until reach to leaf


node.

13

Decision tree - construction


• three further grows
with the wind attribute

14

7
4/27/2021

Decision Tree - construction


• Presentation of
number of yes
and no at each
node

15

Decision Tree - Which attribute to split on?

• Want to measure "purity" of the split


- more certain about Yes/No after the split
- pure set (4 yes / 0 no) => completely certain (100%)
- impure (3 yes / 3 no) => completely uncertain (50%)
can't use P("yes"| set):
- must be symmetric: 4 yes/0 no as pure as 0 yes / 4 no

16

8
4/27/2021

Decision Tree - Entropy


• Entropy measure the uncertainty in data
• The value of entropy limited between 0 and 1
• 1 interpret the more uncertain
• 0 interpret the more certain

17

Decision Tree - Entropy

𝑯(𝑺)  =   −  𝒑  𝐥𝐨𝐠 𝟐  𝒑   −  𝒑  𝐥𝐨𝐠 𝟐  𝒑

• S ... subset of training examples


• p(+) or p(-) ... % of positive or negative examples in S

18

9
4/27/2021

Decision Tree - Entropy


• Interpretation: assume item X belongs to S
- how many bits need to tell if X positive or negative
• Impure (3 yes / 3 no)

• Pure (4 yes/ 0 no)

19

Decision Tree – Information Gain


• How many items in pure sets
• Expected drop in entropy after split:

• V : possible values of A
• S : set of examples {x}
• Sv: subset where XA = V

20

10
4/27/2021

Decision Tree – Information Gain


• Calculation of Information
gain for a specific node
• If more gain interprets
the certainty of node

21

Decision Tree – Information Gain


• Calculation of Information
gain for a specific node
• If more gain interprets the
certainty of node
G ( S , outlook )
5 4 5
 H (S )  H ( S sunny )  H ( S overcast )  H ( Srain )
2 2 3 3 4 4 0 0 3 3 2 2 14 14 14
 log 2 ( )  log 2 ( )  log( )  log( )
5 5 5 5 4 4 4 4  5 log(5)  5 log( 5 ) 5 4 5
H ( S sunny )  0.97  0.94  (0.97)  (0)  (0.97)
H(S ) 0
overcast H (Srain )  0.97 14 14 14
 0.49

22

11
4/27/2021

Decision Tree – Important note


• Just want to clarify one thing: the same attribute can appear in
decision tree for many times as long as they are in different
"branches" right?

• For obvious reasons, it does not make sense to use the same
decision within the same branch.

• On different branches, this reasoning obviously does not hold.

23

Decision Tree – Important Terminology


Root Node: It represents the entire population or sample and this further gets
divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called
the decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.

Pruning: When we remove sub-nodes of a decision node, this process is called


pruning. You can say the opposite process of splitting.

Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.


Parent and Child Node: A node, which is divided into sub-nodes is called a
parent node of sub-nodes whereas sub-nodes are the child of a parent node.

24

12
4/27/2021

Decision Tree – Important Terminology


• d

25

Decision Tree – Pruning


c

26

13
4/27/2021

Decision Tree – How it works


• Decision trees use multiple algorithms to decide to split a node into
two or more sub-nodes
• The creation of sub-nodes increases the homogeneity of resultant
sub-nodes
• The algorithm selection is also based on the type of target
variables. Let us look at some algorithms used in Decision Trees:

27

Decision Tree – How it works cont…


1. ID3 → (extension of D3)
2. C4.5 → (successor of ID3)
3. CART → (Classification And Regression Tree)
4. CHAID → (Chi-square automatic interaction detection Performs
multi-level splits when computing classification trees)
5. MARS → (multivariate adaptive regression splines)

• The ID3 algorithm builds decision trees using a top-down greedy


search approach through the space of possible branches with no
backtracking

28

14
4/27/2021

Decision Tree – ID3 Algorithm


(0) split(node, {examples})
(1) A the best attribute for splitting the {examples}
(2) assign A as decision attribute for node
(3) loop for each value of A, create new child node
(4) split training {examples} to child nodes
(5) if examples perfectly classified: STOP
else: iterate over new child nodes
split(child-node, {subset of examples})

29

Decision Tree – Attribute selection method


• If the dataset consists of N attributes then deciding which attribute to
place at the root or at different levels of the tree as internal nodes is a
complicated step
• For solving this attribute selection problem, researchers worked and
devised some solutions. They suggested using some criteria like :

1. Entropy,
2. Information gain,
3. Gini index,
4. Gain Ratio,
5. Reduction in Variance
6. Chi-Square

30

15
4/27/2021

Decision Tree – Data Set


Day Outlook Temperature Humidity Wind Play Tennis
• k D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

31

Decision Tree - Example


• ID3 Algorithm
• According to ID3 algorithm to find best node (highest information
node)
• Then split the node using its values

32

16
4/27/2021

Example cont…

33

34

17
4/27/2021

35

36

18
4/27/2021

Decision Tree – Example 2


Sno A B C D E
1 4.8 3.4 1.9 0.2 T A B C D
• 2
3
d 5.0
5.0
3.0
3.4
1.6
1.6
0.2
0.4
T
T
>= 5
<5
5Yes/7No
3Yes/1no
>=3.0
<3
8Yes/4no
0yes/4no
>=4.2 0yes/6no >=1.4
< 4.2 8yes/2no < 1.4
0yes/7no
8yes/1no
4 5.2 3.5 1.5 0.2 T
5 5.2 3.4 1.4 0.2 T
6 4.7 3.2 1.6 0.2 T
7 4.8 3.1 1.6 0.2 T
8 5.4 3.4 1.5 0.4 T
9 7.0 3.2 4.7 1.4 F
10 6.4 3.2 4.5 1.5 F
11 6.9 3.1 4.9 1.5 F
12 5.5 2.3 4.0 1.5 F
13 6.5 2.8 4.6 1.5 F
14 5.7 2.8 4.5 1.3 F
15 6.3 3.3 4.7 1.6 F
16 4.9 2.4 3.5 1 F

37

• H(S) =1
Information gain of A
var A >=5
H(VA>=5) = -[5/12 log5/12 + 7/12 log7/12]
H(VA>=5) = 0.9799

Var <5
H(VA<5) = -[3/4log3/4 + 1/4log1/4]
H(VA<5) = 0.81128

IG(A) = 1 – (12/16)(0.9799) – (4/16)(0.81128


IG(A) = 0.0622
38

19
4/27/2021

• H(S) =1
Information gain of B
var B >=3
H(VB>=3) = -[8/12log8/12 + 4/12log4/12]
H(VB>=3) = 0.9182

Var B<3
H(VB<3) = -[0/4log0/4 + 4/4log4/4]
H(VB<3) = 0

IG(B) = 1 -12/16(0.9182) – 4/16(0)


IG(B) = 0.7070
39

• Similarly
IG(C) = 0.54880
IG(D) = 0.41189

40

20
4/27/2021

Decision Tree-example 2
| Sv |
Information gain = H(S) -  H ( SV )
|S|

41

Decision Tree – Overfitting

model hH overfits if there is a h’H

errortrain(h) < errortrain(h’)


and
errorX(h) > errorX(h’)

- Low bias
- High Varaiance
42

21
4/27/2021

Decision Tree – Overfitting

43

Decision Tree
Python Code

44

22
4/27/2021

Decision Tree – Python Code

• DecisionTreeClassifier is a class capable of performing multi-class


classification on a dataset.

• As with other classifiers, DecisionTreeClassifier takes as input two


arrays:
• an array X, sparse or dense, of shape (n_samples, n_features) holding
the training samples,
• and an array Y of integer values, shape (n_samples), holding the class
labels for the training samples:

45

Decision Tree – Python Code


class sklearn.tree.DecisionTreeClassifier(*, criterion='gini',
splitter='best', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_features=None, random_state=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
class_weight=None, ccp_alpha=0.0)

46

23
4/27/2021

Decision Tree – Python Code


import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score
Data = pd.read_csv('diabetes.csv')
print(Data.shape)
print(Data.head())
InputTrain = Data.drop('Outcome',axis='columns')
InputTarget = Data['Outcome']
47

Decision Tree – Python Code


train_x, test_x, train_y, test_y =
train_test_split(InputTrain,InputTarget,test_size = 0.3)
model = tree.DecisionTreeClassifier()
model.fit(train_x,train_y)
DT_pred = model.predict(test_x)
acc = accuracy_score(test_y, DT_pred)
print(acc)

48

24
4/27/2021

Decision Tree – Python Code


class sklearn.tree.DecisionTreeClassifier(*, criterion='gini',
splitter='best', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_features=None, random_state=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
class_weight=None, ccp_alpha=0.0)

49

Decision Tree – Python Code - 2

>>> from sklearn import tree


>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, Y)
>>> clf.predict([[2., 2.]]) # test data
array([1])

50

25
4/27/2021

Decision Tree – Python Code


• DecisionTreeClassifier is capable of both binary (where the labels are
[-1, 1]) classification and multiclass (where the labels are [0, …, K-1])
classification.

• Using the Iris dataset, we can construct a tree as follows:

51

Decision Tree – Python Code-tree visualization

>>> from sklearn.datasets import load_iris


>>> from sklearn import tree
>>> X, y = load_iris(return_X_y=True)
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, y)

52

26
4/27/2021

Decision Tree – Python Code-tree visualization

• Once trained, you can plot the tree with the plot_tree function:
>>> tree.plot_tree(clf)

53

Decision Tree – Python Code-tree visualization

• We can also export the tree in Graphviz format using the


export_graphviz exporter. If you use the conda package manager, the
graphviz binaries and the python package can be installed with conda
install python-graphviz.

54

27
4/27/2021

Decision Tree – Python Code-tree visualization

>>> import graphviz


>>> dot_data = tree.export_graphviz(clf, out_file=None)
>>> graph = graphviz.Source(dot_data)
>>> graph.render("iris")

55

Decision Tree – Python Code


• The export_graphviz exporter also supports a variety of aesthetic
options, including coloring nodes by their class (or value for
regression) and using explicit variable and class names if desired.
Jupyter notebooks also render these plots inline automatically:

56

28
4/27/2021

Decision Tree – Python Code


>>> dot_data = tree.export_graphviz(clf, out_file=None,
... feature_names=iris.feature_names,
... class_names=iris.target_names,
... filled=True, rounded=True,
... special_characters=True)
>>> graph = graphviz.Source(dot_data)
>>> graph

57

Decision Tree – Python Code-GraphViz output

58

29
4/27/2021

Decision Tree – Python Code


s

59

Decision Tree – Python Code- Textual format

• Alternatively, the tree can also be exported in textual format with the
function export_text.

• This method doesn’t require the installation of external libraries and


is more compact:

60

30
4/27/2021

Decision Tree – Python Code-Textual format


>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.tree import export_text
>>> iris = load_iris()
>>> decision_tree = DecisionTreeClassifier(random_state=0,
max_depth=2)
>>> decision_tree = decision_tree.fit(iris.data, iris.target)
>>> r = export_text(decision_tree,
feature_names=iris['feature_names'])

61

Decision Tree – Python Code-Textual format


>>> print(r)
|--- petal width (cm) <= 0.80
| |--- class: 0
|--- petal width (cm) > 0.80
| |--- petal width (cm) <= 1.75
| | |--- class: 1
| |--- petal width (cm) > 1.75
| | |--- class: 2

62

31
4/27/2021

Decision Tree – Python Code-Regression


• Decision trees can also be applied to regression problems, using the
DecisionTreeRegressor class.

• As in the classification setting, the fit method will take as argument


arrays X and y, only that in this case y is expected to have floating
point values instead of integer values:

63

Decision Tree – Python Code


>>> from sklearn import tree
>>> X = [[0, 0], [2, 2]]
>>> y = [0.5, 2.5]
>>> clf = tree.DecisionTreeRegressor()
>>> clf = clf.fit(X, y)
>>> clf.predict([[1, 1]])
array([0.5])

64

32
4/27/2021

Decision Tree – Python Code-parameter detail

criterion{“gini”, “entropy”}, default=”gini”


The function to measure the quality of a split. Supported criteria are
“gini” for the Gini impurity and “entropy” for the information gain.

splitter{“best”, “random”}, default=”best”


The strategy used to choose the split at each node. Supported strategies
are “best” to choose the best split and “random” to choose the best
random split.

65

Decision Tree – Python Code-parameter detail


max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all
leaves are pure or until all leaves contain less than min_samples_split
samples.

min_samples_splitint or float, default=2


The minimum number of samples required to split an internal node:
• If int, then consider min_samples_split as the minimum number.
• If float, then min_samples_split is a fraction and ceil(min_samples_split *
n_samples) are the minimum number of samples for each split.

66

33
4/27/2021

Random Forest

Building and Evaluating the random forest

67

Random Forest

68

34
4/27/2021

Ensemble Technique
• Ensemble method uses two techniques
1. Bagging (Bootstrap aggregation)
2. Boosting

1. Bagging
• Random Forest
2. Boosting
• Adaboost
• Gradient Boosting
• Extreme Gradient Boosting (XGBooting)

69

Random Forest-Bagging
• A technique known as bagging is used to create an ensemble of
trees where multiple training sets are generated with
replacement.

• In the bagging technique, a data set is divided into N samples


using randomized sampling. Then, using a single learning
algorithm a model is built on all samples. Later, the resultant
predictions are combined using voting or averaging in parallel.

70

35
4/27/2021

Ensemble Technique – Bagging cont…

Training Data (TD)

Row sample with replacement

RSD1 RSD2 RSD3 RSD4 RSD5 ... RSDn

Learning Model

M1 M2 M3 M4 M5 ... Mn
Model Predictions
R1 R2 R3 R4 R5 Rn

Majority Voting

Final Prediction

71

Random Forest
• Random Forest is an example of ensemble learning, in which we
combine multiple machine learning algorithms to obtain better
predictive performance.

Why the name “Random”?


Two key concepts that give it the name random:
• A random sampling of training data set when building trees.
• Random subsets of features considered when splitting nodes.

72

36
4/27/2021

Random Forest cont…


• Decision Tree are easy to build, easy to use and easy to
interpret

• But in practice they are not that awesome

• They work great with the data used to create them, but
they are not flexible when it comes to classifying new
samples

73

Random Forest
• The Random Forest combine the simplicity of decision
trees with flexibility resulting in a vast improvement in
accuracy

74

37
4/27/2021

Ensemble Technique – Random Forest


RSD : Row selection data
FSD : Feature selection data
Training Data (TD)

Row and Feature sample with replacement

RSD1+FSD1 RSD2+FSD2 RSD3+FSD3 RSD4+FSD4 RSD5+FSD5 ... RSDn+FSDn

Learning Model using Decision Tree

M1 M2 M3 M4 M5 ... Mn
Model Predictions
R1 R2 R3 R4 R5 Rn

Majority Voting
In case of regression problem take
Final Prediction mean of median of all predictions

75

Random Forest
• Lets make a random forest
• Step 1: Create a "bootstrapped" dataset

76

38
4/27/2021

Random Forest
• s

77

Random Forest
• Now created the bootstrapped Dataset

78

39
4/27/2021

Random Forest
• Step 2: Create a decision tree using the bootstrapped
dataset, but only use a random subset of variables (or
columns) at each step

79

Random Forest
• In this example, we will only consider 2
variables(columns) at each step
• Note: We will talk more about how to determine the
optimal number of variables to consider later

80

40
4/27/2021

Random Forest

81

Random Forest
• s

82

41
4/27/2021

Random Forest
• s

83

Random Forest
• s

84

42
4/27/2021

Random Forest
• We built a tree...

(1)Using a bootstrapped dataset

(2) Only considering a random a subset of variables at each


step

85

Random Forest

86

43
4/27/2021

Random Forest
• Ideally, do this 100's of times

• Using a bootstrapped sample and considering only a


subset of the variables at each step results in a wide
variety of trees

• The variety is what makes random forests more effictive


than individual decision trees

87

Random Forest
• How to use it

• We have all information


• and now want to know if they have heart disease or not

88

44
4/27/2021

Random Forest

• The first tree says yes

89

Random Forest
• s

90

45
4/27/2021

Random Forest
• Keep repeat for all trees

91

Random Forest
• After running the data down all of the trees in the random
forest, we see which option received more votes

92

46
4/27/2021

Random Forest

93

Random Forest
• Terminology Alert!

• Bootstrapping the data plus using the aggregate to make


a decision is called “Bagging”

• Data which is not used for the random forest is called


“out-of-Bootstrap-dataset”

94

47
4/27/2021

Random Forest-code
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
Data = pd.read_csv('diabetes.csv')
print(Data.shape)
print(Data.head())
InputTrain = Data.drop('Outcome',axis='columns')
InputTarget = Data['Outcome']
95

Random Forest-code
>> train_x, test_x, train_y, test_y =
train_test_split(InputTrain,InputTarget,test_size = 0.3)
>> model = RandomForestClassifier(n_estimators=200,
max_depth=5,random_state=0, criterion='entropy')
>>print(model.fit(train_x,train_y))
RF_pred = model.predict(test_x)
acc = accuracy_score(test_y, RF_pred)
print(acc)

96

48
4/27/2021

Random Forest-code
Class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *,
criterion='gini', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, bootstrap=True, oob_score=False,
n_jobs=None, random_state=None, verbose=0, warm_start=False,
class_weight=None, ccp_alpha=0.0, max_samples=None)

# Parameters of Random Forest


# https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.
html

97

Random Forest Hyperparameters


1. max_depth
2. min_sample_split
3. max_leaf_nodes
4. min_samples_leaf
5. n_estimators
6. max_sample (bootstrap sample)
7. max_features

98

49
4/27/2021

Random Forest Hyperparameter #1: max_depth


1. Let’s discuss the critical max_depth hyperparameter first. The
max_depth of a tree in Random Forest is defined as the longest
path between the root node and the leaf node:

99

Random Forest Hyperparameter #1: max_depth


• Using the max_depth parameter, we can limit up to what depth
we want every tree in my random forest to grow.

100

50
4/27/2021

Random Forest Hyperparameter #1: max_depth


• In this graph, we can clearly see that as the max depth of the decision
tree increases, the performance of the model over the training set
increases continuously.
• On the other hand as the max_depth value increases, the performance
over the test set increases initially but after a certain point, it starts to
decrease rapidly.

• Can you think of a reason for this? The tree starts to overfit the training
set and therefore is not able to generalize over the unseen points in the
test set.

101

Random Forest Hyperparameter #2: min_sample_split


• A parameter that tells the decision tree in a random forest the
minimum required number of observations in any given node in
order to split it.

• The default value of the minimum_sample_split is assigned to 2.


This means that if any terminal node has more than two
observations and is not a pure node, we can split it further into
subnodes.

102

51
4/27/2021

Random Forest Hyperparameter #2: min_sample_split


• Having a default value as 2 poses the issue that a tree often keeps
on splitting until the nodes are completely pure. As a result, the
tree grows in size and therefore overfits the data.

103

Random Forest Hyperparameter #2: min_sample_split


• By increasing the value of the min_sample_split, we can reduce
the number of splits that happen in the decision tree and
therefore prevent the model from overfitting.
• In the above example, if we increase the min_sample_split value
from 2 to 6, the tree on the left would then look like the tree on
the right.

104

52
4/27/2021

Random Forest Hyperparameter #2: min_sample_split


• Now, let’s look at the effect of
min_samples_split on the
performance of the model.
• The graph below is plotted
considering that all the other
parameters remain the same and
only the value of min_samples_split
is changed:

105

Random Forest Hyperparameter #2: min_sample_split


• On increasing the value of the min_sample_split hyperparameter,
we can clearly see that for the small value of parameters, there is a
significant difference between the training score and the test
scores
• But as the value of the parameter increases, the difference
between the train score and the test score decreases.

106

53
4/27/2021

Random Forest Hyperparameter #2: min_sample_split


• But there’s one thing you should keep in mind
• When the parameter value increases too much, there is an overall
dip in both the training score and test scores

• This is due to the fact that the minimum requirement of splitting a


node is so high that there are no significant splits observed

• As a result, the random forest starts to underfit.

107

Random Forest Hyperparameter #3: max_terminal_nodes


• Next, let’s move on to another Random Forest hyperparameter
called max_leaf_nodes.

• This hyperparameter sets a condition on the splitting of the nodes


in the tree and hence restricts the growth of the tree.

• If after splitting we have more terminal nodes than the specified


number of terminal nodes, it will stop the splitting and the tree
will not grow further.

108

54
4/27/2021

Random Forest Hyperparameter #3: max_terminal_nodes


• Let’s say we set the maximum terminal nodes as 2 in this case. As
there is only one node, it will allow the tree to grow further:

109

Random Forest Hyperparameter #3: max_terminal_nodes


• Now, after the first split, you can see that there are 2 nodes here
and we have set the maximum terminal nodes as 2.

• Hence, the tree will terminate here and will not grow further. This
is how setting the maximum terminal nodes or max_leaf_nodes
can help us in preventing overfitting.

110

55
4/27/2021

Random Forest Hyperparameter #3: max_terminal_nodes


• Note that if the value of the
max_leaf_nodes is very small, the
random forest is likely to underfit.
Let’s see how this parameter
affects the random forest model’s
performance:

111

Random Forest Hyperparameter #3: max_terminal_nodes


• We can see that when the parameter value is very small, the tree
is underfitting and as the parameter value increases, the
performance of the tree over both test and train increases.

• According to this plot, the tree starts to overfit as the parameter


value goes beyond 25.

112

56
4/27/2021

Random Forest Hyperparameter #4: min_samples_leaf


• Time to shift our focus to min_sample_leaf. This Random Forest
hyperparameter specifies the minimum number of samples that
should be present in the leaf node after splitting a node.

113

Random Forest Hyperparameter #4: min_samples_leaf


• Let’s understand min_sample_leaf using an example. Let’s say we
have set the minimum samples for a terminal node as 5:

114

57
4/27/2021

Random Forest Hyperparameter #4: min_samples_leaf


• The tree on the left represents an unconstrained tree. Here, the
nodes marked with green color satisfy the condition as they have a
minimum of 5 samples. Hence, they will be treated as the leaf or
terminal nodes.

• However, the red node has only 3 samples and hence it will not be
considered as the leaf node. Its parent node will become the leaf
node. That’s why the tree on the right represents the results when
we set the minimum samples for the terminal node as 5.

115

Random Forest Hyperparameter #4: min_samples_leaf


• If we plot the performance/parameter
value plot as before:

• We can clearly see that the Random


Forest model is overfitting when the
parameter value is very low (when
parameter value < 100), but the
model performance quickly rises up
and rectifies the issue of overfitting
(100 < parameter value < 400). But
when we keep on increasing the value
of the parameter (> 500), the model
slowly drifts towards the realm of
underfitting.
116

58
4/27/2021

Random Forest Hyperparameter #5: n_estimators


• We know that a Random Forest algorithm is nothing but a
grouping of trees. But how many trees should we consider? That’s
a common question fresher data scientists ask. And it’s a valid
one!

• We might say that more trees should be able to produce a more


generalized result, right? But by choosing more number of trees,
the time complexity of the Random Forest model also increases.

117

Random Forest Hyperparameter #5: n_estimators


• In this graph, we can clearly see that the performance of the
model sharply increases and then stagnates at a certain level:

118

59
4/27/2021

Random Forest Hyperparameter #5: n_estimators


• This means that choosing a large number of estimators in a
random forest model is not the best idea.

• Although it will not degrade the model, it can save you the
computational complexity and prevent the use of a fire
extinguisher on your CPU!

119

Random Forest Hyperparameter #6: max_samples


• The max_samples
hyperparameter
determines what fraction
of the original dataset is
given to any individual
tree.
• You might be thinking that
more data is always better.
Let’s try to see if that
makes sense.
120

60
4/27/2021

Random Forest Hyperparameter #6: max_samples


• We can see that the performance of the model rises sharply and
then saturates fairly quickly. Can you figure out what the key
takeaway from this visualization is?

• It is not necessary to give each decision tree of the Random Forest


the full data. If you would notice, the model performance reaches
its max when the data provided is less than 0.2 fraction of the
original dataset. That’s quite astonishing!

121

Random Forest Hyperparameter #6: max_samples


• Although this fraction will differ from dataset to dataset, we can
allocate a lesser fraction of bootstrapped data to each decision
tree. As a result, the training time of the Random Forest model is
reduced drastically.

122

61
4/27/2021

Random Forest Hyperparameter #7: max_features


• Finally, we will observe the effect of the max_features
hyperparameter. This resembles the number of maximum features
provided to each tree in a random forest.

• We know that random forest chooses some random samples from


the features to find the best split. Let’s see how varying this
parameter can affect our random forest model’s performance.

123

Random Forest Hyperparameter #7: max_features


• We can see that the performance
of the model initially increases as
the number of max_feature
increases.
• But, after a certain point, the
train_score keeps on increasing.
But the test_score saturates and
even starts decreasing towards the
end, which clearly means that the
model starts to overfit.

124

62
4/27/2021

Random Forest Hyperparameter #7: max_features


• Ideally, the overall performance of the model is the highest close
to 6 value of the max features.
• It is a good convention to consider the default value of this
parameter, which is set to square root of the number of features
present in the dataset.
• The ideal number of max_features generally tend to lie close to
this value.

125

63

You might also like