Lecture-4-Decision Tree and Random Forest

4/27/2021
Machine Learning
Lecture – 4
Decision Tree
Machine Learning
by Tom M. Mitchell
Muhammad Affan Alim
Outline
• Introduction
• Example
• ID3 Heuristic algorithm
• Principles
• Entropy
• Information gain
• Evaluations
1
4/27/2021
Decision Tree- Introduction

• A decision tree is a tree structured classifier/and regressor
• It has two types of nodes

• decision nodes and
• leaf nodes/prediction node

• The decision tree is a non-linear function or model
• In a tree which has nodes and edges and some nodes are
leafs
• A decision tree is a classifier in the form of a tree which has
two types of nodes
- Decision nodes
- Leaf nodes
• It can be used for both classification and regression but
classification is more popular
2
4/27/2021
Decision Tree- Introduction diagram

• Prefer the smaller tree
• Low depth
• Small number of depth
examples
3
4/27/2021

• The popular attribute selection measure
• Information gain
• Gini index
• By using information gain as a criterions, the attributes to be
categorical and for Gini index attributes are assumed to be
continuous
• By information gain as a criterion, we try to estimate the information
contained by each attribute
• Entropy, for binary classification problem with only two classes,
positive and negative
7
Decision Tree - Model

• Given a set of training cases/objects and their attribute
values, try to determine the target attribute value of new
examples.
- Classification
- regression
4
4/27/2021
Decision Tree- Example

•
Decision Tree
• This is usually done on the value of a feature or attribute of
the instance
• test is on same attribute and there is a branch for each
outcome
• There may be two outcomes or in some cases, more than

two outcomes
• Nodes will be created until reach at leaf node
10
5
4/27/2021
Decision Tree - construction

• One way to construct
the tree until not reach
at leaf
• It is recursive process
11
Decision Tree- construction

• Tree will be grow until
unless only yes or
no not will be
generated
12
6
4/27/2021

• Given example, you start with the root of the tree and
based on the value of the test move to corresponding
branch
• this iterative process doing continue until reach to leaf

node.
13
Decision tree - construction

• three further grows
with the wind attribute
14
7
4/27/2021

• Presentation of
number of yes
and no at each
node
15
Decision Tree - Which attribute to split on?
• Want to measure "purity" of the split

- more certain about Yes/No after the split
- pure set (4 yes / 0 no) => completely certain (100%)
- impure (3 yes / 3 no) => completely uncertain (50%)
can't use P("yes"| set):
- must be symmetric: 4 yes/0 no as pure as 0 yes / 4 no
16
8
4/27/2021
Decision Tree - Entropy

• Entropy measure the uncertainty in data
• The value of entropy limited between 0 and 1
• 1 interpret the more uncertain
• 0 interpret the more certain
17
𝑯(𝑺) = − 𝒑 𝐥𝐨𝐠 𝟐 𝒑 − 𝒑 𝐥𝐨𝐠 𝟐 𝒑
• S ... subset of training examples

• p(+) or p(-) ... % of positive or negative examples in S
18
9
4/27/2021

• Interpretation: assume item X belongs to S
- how many bits need to tell if X positive or negative
• Impure (3 yes / 3 no)
• Pure (4 yes/ 0 no)
19
Decision Tree – Information Gain

• How many items in pure sets
• Expected drop in entropy after split:
• V : possible values of A
• S : set of examples {x}
• Sv: subset where XA = V
20
10
4/27/2021

• Calculation of Information
gain for a specific node
• If more gain interprets
the certainty of node
21

• Calculation of Information
gain for a specific node
• If more gain interprets the
certainty of node
G ( S , outlook )
5 4 5
 H (S )  H ( S sunny )  H ( S overcast )  H ( Srain )
2 2 3 3 4 4 0 0 3 3 2 2 14 14 14
 log 2 ( )  log 2 ( )  log( )  log( )
5 5 5 5 4 4 4 4  5 log(5)  5 log( 5 ) 5 4 5
H ( S sunny )  0.97  0.94  (0.97)  (0)  (0.97)
H(S ) 0
overcast H (Srain )  0.97 14 14 14
 0.49
22
11
4/27/2021
Decision Tree – Important note

• Just want to clarify one thing: the same attribute can appear in
decision tree for many times as long as they are in different
"branches" right?
• For obvious reasons, it does not make sense to use the same
decision within the same branch.
• On different branches, this reasoning obviously does not hold.
23
Decision Tree – Important Terminology

Root Node: It represents the entire population or sample and this further gets
divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called
the decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called

pruning. You can say the opposite process of splitting.
Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.

Parent and Child Node: A node, which is divided into sub-nodes is called a
parent node of sub-nodes whereas sub-nodes are the child of a parent node.
24
12
4/27/2021
Decision Tree – Important Terminology

• d
25
Decision Tree – Pruning

c
26
13
4/27/2021
Decision Tree – How it works

• Decision trees use multiple algorithms to decide to split a node into
two or more sub-nodes
• The creation of sub-nodes increases the homogeneity of resultant
sub-nodes
• The algorithm selection is also based on the type of target
variables. Let us look at some algorithms used in Decision Trees:
27
Decision Tree – How it works cont…

1. ID3 → (extension of D3)
2. C4.5 → (successor of ID3)
3. CART → (Classification And Regression Tree)
4. CHAID → (Chi-square automatic interaction detection Performs
multi-level splits when computing classification trees)
5. MARS → (multivariate adaptive regression splines)
• The ID3 algorithm builds decision trees using a top-down greedy

search approach through the space of possible branches with no
backtracking
28
14
4/27/2021
Decision Tree – ID3 Algorithm

(0) split(node, {examples})
(1) A the best attribute for splitting the {examples}
(2) assign A as decision attribute for node
(3) loop for each value of A, create new child node
(4) split training {examples} to child nodes
(5) if examples perfectly classified: STOP
else: iterate over new child nodes
split(child-node, {subset of examples})
29
Decision Tree – Attribute selection method

• If the dataset consists of N attributes then deciding which attribute to
place at the root or at different levels of the tree as internal nodes is a
complicated step
• For solving this attribute selection problem, researchers worked and
devised some solutions. They suggested using some criteria like :
1. Entropy,
2. Information gain,
3. Gini index,
4. Gain Ratio,
5. Reduction in Variance
6. Chi-Square
30
15
4/27/2021
Decision Tree – Data Set

Day Outlook Temperature Humidity Wind Play Tennis
• k D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
31
Decision Tree - Example

• ID3 Algorithm
• According to ID3 algorithm to find best node (highest information
node)
• Then split the node using its values
32
16
4/27/2021
Example cont…
33
34
17
4/27/2021
35
36
18
4/27/2021
Decision Tree – Example 2

Sno A B C D E
1 4.8 3.4 1.9 0.2 T A B C D
• 2
3
d 5.0
5.0
3.0
3.4
1.6
1.6
0.2
0.4
T
T
>= 5
<5
5Yes/7No
3Yes/1no
>=3.0
<3
8Yes/4no
0yes/4no
>=4.2 0yes/6no >=1.4
< 4.2 8yes/2no < 1.4
0yes/7no
8yes/1no
4 5.2 3.5 1.5 0.2 T
5 5.2 3.4 1.4 0.2 T
6 4.7 3.2 1.6 0.2 T
7 4.8 3.1 1.6 0.2 T
8 5.4 3.4 1.5 0.4 T
9 7.0 3.2 4.7 1.4 F
10 6.4 3.2 4.5 1.5 F
11 6.9 3.1 4.9 1.5 F
12 5.5 2.3 4.0 1.5 F
13 6.5 2.8 4.6 1.5 F
14 5.7 2.8 4.5 1.3 F
15 6.3 3.3 4.7 1.6 F
16 4.9 2.4 3.5 1 F
37
• H(S) =1
Information gain of A
var A >=5
H(VA>=5) = -[5/12 log5/12 + 7/12 log7/12]
H(VA>=5) = 0.9799
Var <5
H(VA<5) = -[3/4log3/4 + 1/4log1/4]
H(VA<5) = 0.81128
IG(A) = 1 – (12/16)(0.9799) – (4/16)(0.81128

IG(A) = 0.0622
38
19
4/27/2021
• H(S) =1
Information gain of B
var B >=3
H(VB>=3) = -[8/12log8/12 + 4/12log4/12]
H(VB>=3) = 0.9182
Var B<3
H(VB<3) = -[0/4log0/4 + 4/4log4/4]
H(VB<3) = 0
IG(B) = 1 -12/16(0.9182) – 4/16(0)

IG(B) = 0.7070
39
• Similarly
IG(C) = 0.54880
IG(D) = 0.41189
40
20
4/27/2021
Decision Tree-example 2
| Sv |
Information gain = H(S) -  H ( SV )
|S|
41
Decision Tree – Overfitting
model hH overfits if there is a h’H
errortrain(h) < errortrain(h’)

and
errorX(h) > errorX(h’)
- Low bias
- High Varaiance
42
21
4/27/2021
Decision Tree – Overfitting
43
Decision Tree
Python Code
44
22
4/27/2021
Decision Tree – Python Code
• DecisionTreeClassifier is a class capable of performing multi-class

classification on a dataset.
• As with other classifiers, DecisionTreeClassifier takes as input two

arrays:
• an array X, sparse or dense, of shape (n_samples, n_features) holding
the training samples,
• and an array Y of integer values, shape (n_samples), holding the class
labels for the training samples:
45

class sklearn.tree.DecisionTreeClassifier(*, criterion='gini',
splitter='best', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_features=None, random_state=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
class_weight=None, ccp_alpha=0.0)
46
23
4/27/2021

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score
Data = pd.read_csv('diabetes.csv')
print(Data.shape)
print(Data.head())
InputTrain = Data.drop('Outcome',axis='columns')
InputTarget = Data['Outcome']
47

train_x, test_x, train_y, test_y =
train_test_split(InputTrain,InputTarget,test_size = 0.3)
model = tree.DecisionTreeClassifier()
model.fit(train_x,train_y)
DT_pred = model.predict(test_x)
acc = accuracy_score(test_y, DT_pred)
print(acc)
48
24
4/27/2021

class sklearn.tree.DecisionTreeClassifier(*, criterion='gini',
splitter='best', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_features=None, random_state=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
class_weight=None, ccp_alpha=0.0)
49
Decision Tree – Python Code - 2
>>> from sklearn import tree

>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, Y)
>>> clf.predict([[2., 2.]]) # test data
array([1])
50
25
4/27/2021

• DecisionTreeClassifier is capable of both binary (where the labels are
[-1, 1]) classification and multiclass (where the labels are [0, …, K-1])
classification.
• Using the Iris dataset, we can construct a tree as follows:
51
Decision Tree – Python Code-tree visualization
>>> from sklearn.datasets import load_iris

>>> X, y = load_iris(return_X_y=True)
>>> clf = tree.DecisionTreeClassifier()
>>> clf = clf.fit(X, y)
52
26
4/27/2021
• Once trained, you can plot the tree with the plot_tree function:
>>> tree.plot_tree(clf)
53
• We can also export the tree in Graphviz format using the

export_graphviz exporter. If you use the conda package manager, the
graphviz binaries and the python package can be installed with conda
install python-graphviz.
54
27
4/27/2021
>>> import graphviz

>>> dot_data = tree.export_graphviz(clf, out_file=None)
>>> graph = graphviz.Source(dot_data)
>>> graph.render("iris")
55

• The export_graphviz exporter also supports a variety of aesthetic
options, including coloring nodes by their class (or value for
regression) and using explicit variable and class names if desired.
Jupyter notebooks also render these plots inline automatically:
56
28
4/27/2021

>>> dot_data = tree.export_graphviz(clf, out_file=None,
... feature_names=iris.feature_names,
... class_names=iris.target_names,
... filled=True, rounded=True,
... special_characters=True)
>>> graph = graphviz.Source(dot_data)
>>> graph
57
Decision Tree – Python Code-GraphViz output
58
29
4/27/2021

s
59
Decision Tree – Python Code- Textual format
• Alternatively, the tree can also be exported in textual format with the
function export_text.
• This method doesn’t require the installation of external libraries and

is more compact:
60
30
4/27/2021
Decision Tree – Python Code-Textual format

>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.tree import export_text
>>> iris = load_iris()
>>> decision_tree = DecisionTreeClassifier(random_state=0,
max_depth=2)
>>> decision_tree = decision_tree.fit(iris.data, iris.target)
>>> r = export_text(decision_tree,
feature_names=iris['feature_names'])
61
Decision Tree – Python Code-Textual format

>>> print(r)
|--- petal width (cm) <= 0.80
| |--- class: 0
|--- petal width (cm) > 0.80
| |--- petal width (cm) <= 1.75
| | |--- class: 1
| |--- petal width (cm) > 1.75
| | |--- class: 2
62
31
4/27/2021
Decision Tree – Python Code-Regression

• Decision trees can also be applied to regression problems, using the
DecisionTreeRegressor class.
• As in the classification setting, the fit method will take as argument

arrays X and y, only that in this case y is expected to have floating
point values instead of integer values:
63

>>> X = [[0, 0], [2, 2]]
>>> y = [0.5, 2.5]
>>> clf = tree.DecisionTreeRegressor()
>>> clf = clf.fit(X, y)
>>> clf.predict([[1, 1]])
array([0.5])
64
32
4/27/2021
Decision Tree – Python Code-parameter detail
criterion{“gini”, “entropy”}, default=”gini”

The function to measure the quality of a split. Supported criteria are
“gini” for the Gini impurity and “entropy” for the information gain.
splitter{“best”, “random”}, default=”best”

The strategy used to choose the split at each node. Supported strategies
are “best” to choose the best split and “random” to choose the best
random split.
65
Decision Tree – Python Code-parameter detail

max_depthint, default=None
The maximum depth of the tree. If None, then nodes are expanded until all
leaves are pure or until all leaves contain less than min_samples_split
samples.
min_samples_splitint or float, default=2

The minimum number of samples required to split an internal node:
• If int, then consider min_samples_split as the minimum number.
• If float, then min_samples_split is a fraction and ceil(min_samples_split *
n_samples) are the minimum number of samples for each split.
66
33
4/27/2021
Random Forest
Building and Evaluating the random forest
67
Random Forest
68
34
4/27/2021
Ensemble Technique
• Ensemble method uses two techniques
1. Bagging (Bootstrap aggregation)
2. Boosting
1. Bagging
• Random Forest
2. Boosting
• Adaboost
• Gradient Boosting
• Extreme Gradient Boosting (XGBooting)
69
Random Forest-Bagging
• A technique known as bagging is used to create an ensemble of
trees where multiple training sets are generated with
replacement.
• In the bagging technique, a data set is divided into N samples

using randomized sampling. Then, using a single learning
algorithm a model is built on all samples. Later, the resultant
predictions are combined using voting or averaging in parallel.
70
35
4/27/2021
Ensemble Technique – Bagging cont…
Training Data (TD)
Row sample with replacement
RSD1 RSD2 RSD3 RSD4 RSD5 ... RSDn
Learning Model
M1 M2 M3 M4 M5 ... Mn
Model Predictions
R1 R2 R3 R4 R5 Rn
Majority Voting
Final Prediction
71
Random Forest
• Random Forest is an example of ensemble learning, in which we
combine multiple machine learning algorithms to obtain better
predictive performance.
Why the name “Random”?

Two key concepts that give it the name random:
• A random sampling of training data set when building trees.
• Random subsets of features considered when splitting nodes.
72
36
4/27/2021
Random Forest cont…

• Decision Tree are easy to build, easy to use and easy to
interpret
• But in practice they are not that awesome
• They work great with the data used to create them, but
they are not flexible when it comes to classifying new
samples
73
Random Forest
• The Random Forest combine the simplicity of decision
trees with flexibility resulting in a vast improvement in
accuracy
74
37
4/27/2021
Ensemble Technique – Random Forest

RSD : Row selection data
FSD : Feature selection data
Training Data (TD)
Row and Feature sample with replacement
RSD1+FSD1 RSD2+FSD2 RSD3+FSD3 RSD4+FSD4 RSD5+FSD5 ... RSDn+FSDn
Learning Model using Decision Tree
M1 M2 M3 M4 M5 ... Mn
Model Predictions
R1 R2 R3 R4 R5 Rn
Majority Voting
In case of regression problem take
Final Prediction mean of median of all predictions
75
Random Forest
• Lets make a random forest
• Step 1: Create a "bootstrapped" dataset
76
38
4/27/2021
Random Forest
• s
77
Random Forest
• Now created the bootstrapped Dataset
78
39
4/27/2021
Random Forest
• Step 2: Create a decision tree using the bootstrapped
dataset, but only use a random subset of variables (or
columns) at each step
79
Random Forest
• In this example, we will only consider 2
variables(columns) at each step
• Note: We will talk more about how to determine the
optimal number of variables to consider later
80
40
4/27/2021
Random Forest
81
Random Forest
• s
82
41
4/27/2021
Random Forest
• s
83
Random Forest
• s
84
42
4/27/2021
Random Forest
• We built a tree...
(1)Using a bootstrapped dataset
(2) Only considering a random a subset of variables at each

step
85
Random Forest
•
86
43
4/27/2021
Random Forest
• Ideally, do this 100's of times
• Using a bootstrapped sample and considering only a

subset of the variables at each step results in a wide
variety of trees
• The variety is what makes random forests more effictive

than individual decision trees
87
Random Forest
• How to use it
• We have all information

• and now want to know if they have heart disease or not
88
44
4/27/2021
Random Forest
• The first tree says yes
89
Random Forest
• s
90
45
4/27/2021
Random Forest
• Keep repeat for all trees
91
Random Forest
• After running the data down all of the trees in the random
forest, we see which option received more votes
92
46
4/27/2021
Random Forest
•
93
Random Forest
• Terminology Alert!
• Bootstrapping the data plus using the aggregate to make

a decision is called “Bagging”
• Data which is not used for the random forest is called

“out-of-Bootstrap-dataset”
94
47
4/27/2021
Random Forest-code
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
Data = pd.read_csv('diabetes.csv')
print(Data.shape)
print(Data.head())
InputTrain = Data.drop('Outcome',axis='columns')
InputTarget = Data['Outcome']
95
Random Forest-code
>> train_x, test_x, train_y, test_y =
train_test_split(InputTrain,InputTarget,test_size = 0.3)
>> model = RandomForestClassifier(n_estimators=200,
max_depth=5,random_state=0, criterion='entropy')
>>print(model.fit(train_x,train_y))
RF_pred = model.predict(test_x)
acc = accuracy_score(test_y, RF_pred)
print(acc)
96
48
4/27/2021
Random Forest-code
Class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *,
criterion='gini', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, bootstrap=True, oob_score=False,
n_jobs=None, random_state=None, verbose=0, warm_start=False,
class_weight=None, ccp_alpha=0.0, max_samples=None)
# Parameters of Random Forest

# https://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.
html
97
Random Forest Hyperparameters

1. max_depth
2. min_sample_split
3. max_leaf_nodes
4. min_samples_leaf
5. n_estimators
6. max_sample (bootstrap sample)
7. max_features
98
49
4/27/2021
Random Forest Hyperparameter #1: max_depth

1. Let’s discuss the critical max_depth hyperparameter first. The
max_depth of a tree in Random Forest is defined as the longest
path between the root node and the leaf node:
99

• Using the max_depth parameter, we can limit up to what depth
we want every tree in my random forest to grow.
100
50
4/27/2021

• In this graph, we can clearly see that as the max depth of the decision
tree increases, the performance of the model over the training set
increases continuously.
• On the other hand as the max_depth value increases, the performance
over the test set increases initially but after a certain point, it starts to
decrease rapidly.
• Can you think of a reason for this? The tree starts to overfit the training
set and therefore is not able to generalize over the unseen points in the
test set.
101
Random Forest Hyperparameter #2: min_sample_split

• A parameter that tells the decision tree in a random forest the
minimum required number of observations in any given node in
order to split it.
• The default value of the minimum_sample_split is assigned to 2.

This means that if any terminal node has more than two
observations and is not a pure node, we can split it further into
subnodes.
102
51
4/27/2021

• Having a default value as 2 poses the issue that a tree often keeps
on splitting until the nodes are completely pure. As a result, the
tree grows in size and therefore overfits the data.
103

• By increasing the value of the min_sample_split, we can reduce
the number of splits that happen in the decision tree and
therefore prevent the model from overfitting.
• In the above example, if we increase the min_sample_split value
from 2 to 6, the tree on the left would then look like the tree on
the right.
104
52
4/27/2021

• Now, let’s look at the effect of
min_samples_split on the
performance of the model.
• The graph below is plotted
considering that all the other
parameters remain the same and
only the value of min_samples_split
is changed:
105

• On increasing the value of the min_sample_split hyperparameter,
we can clearly see that for the small value of parameters, there is a
significant difference between the training score and the test
scores
• But as the value of the parameter increases, the difference
between the train score and the test score decreases.
106
53
4/27/2021

• But there’s one thing you should keep in mind
• When the parameter value increases too much, there is an overall
dip in both the training score and test scores
• This is due to the fact that the minimum requirement of splitting a

node is so high that there are no significant splits observed
• As a result, the random forest starts to underfit.
107
Random Forest Hyperparameter #3: max_terminal_nodes

• Next, let’s move on to another Random Forest hyperparameter
called max_leaf_nodes.
• This hyperparameter sets a condition on the splitting of the nodes

in the tree and hence restricts the growth of the tree.
• If after splitting we have more terminal nodes than the specified

number of terminal nodes, it will stop the splitting and the tree
will not grow further.
108
54
4/27/2021

• Let’s say we set the maximum terminal nodes as 2 in this case. As
there is only one node, it will allow the tree to grow further:
109

• Now, after the first split, you can see that there are 2 nodes here
and we have set the maximum terminal nodes as 2.
• Hence, the tree will terminate here and will not grow further. This
is how setting the maximum terminal nodes or max_leaf_nodes
can help us in preventing overfitting.
110
55
4/27/2021

• Note that if the value of the
max_leaf_nodes is very small, the
random forest is likely to underfit.
Let’s see how this parameter
affects the random forest model’s
performance:
111

• We can see that when the parameter value is very small, the tree
is underfitting and as the parameter value increases, the
performance of the tree over both test and train increases.
• According to this plot, the tree starts to overfit as the parameter

value goes beyond 25.
112
56
4/27/2021
Random Forest Hyperparameter #4: min_samples_leaf

• Time to shift our focus to min_sample_leaf. This Random Forest
hyperparameter specifies the minimum number of samples that
should be present in the leaf node after splitting a node.
113

• Let’s understand min_sample_leaf using an example. Let’s say we
have set the minimum samples for a terminal node as 5:
114
57
4/27/2021

• The tree on the left represents an unconstrained tree. Here, the
nodes marked with green color satisfy the condition as they have a
minimum of 5 samples. Hence, they will be treated as the leaf or
terminal nodes.
• However, the red node has only 3 samples and hence it will not be
considered as the leaf node. Its parent node will become the leaf
node. That’s why the tree on the right represents the results when
we set the minimum samples for the terminal node as 5.
115

• If we plot the performance/parameter
value plot as before:
• We can clearly see that the Random

Forest model is overfitting when the
parameter value is very low (when
parameter value < 100), but the
model performance quickly rises up
and rectifies the issue of overfitting
(100 < parameter value < 400). But
when we keep on increasing the value
of the parameter (> 500), the model
slowly drifts towards the realm of
underfitting.
116
58
4/27/2021
Random Forest Hyperparameter #5: n_estimators

• We know that a Random Forest algorithm is nothing but a
grouping of trees. But how many trees should we consider? That’s
a common question fresher data scientists ask. And it’s a valid
one!
• We might say that more trees should be able to produce a more

generalized result, right? But by choosing more number of trees,
the time complexity of the Random Forest model also increases.
117

• In this graph, we can clearly see that the performance of the
model sharply increases and then stagnates at a certain level:
118
59
4/27/2021

• This means that choosing a large number of estimators in a
random forest model is not the best idea.
• Although it will not degrade the model, it can save you the
computational complexity and prevent the use of a fire
extinguisher on your CPU!
119
Random Forest Hyperparameter #6: max_samples

• The max_samples
hyperparameter
determines what fraction
of the original dataset is
given to any individual
tree.
• You might be thinking that
more data is always better.
Let’s try to see if that
makes sense.
120
60
4/27/2021

• We can see that the performance of the model rises sharply and
then saturates fairly quickly. Can you figure out what the key
takeaway from this visualization is?
• It is not necessary to give each decision tree of the Random Forest

the full data. If you would notice, the model performance reaches
its max when the data provided is less than 0.2 fraction of the
original dataset. That’s quite astonishing!
121

• Although this fraction will differ from dataset to dataset, we can
allocate a lesser fraction of bootstrapped data to each decision
tree. As a result, the training time of the Random Forest model is
reduced drastically.
122
61
4/27/2021
Random Forest Hyperparameter #7: max_features

• Finally, we will observe the effect of the max_features
hyperparameter. This resembles the number of maximum features
provided to each tree in a random forest.
• We know that random forest chooses some random samples from

the features to find the best split. Let’s see how varying this
parameter can affect our random forest model’s performance.
123

• We can see that the performance
of the model initially increases as
the number of max_feature
increases.
• But, after a certain point, the
train_score keeps on increasing.
But the test_score saturates and
even starts decreasing towards the
end, which clearly means that the
model starts to overfit.
124
62
4/27/2021

• Ideally, the overall performance of the model is the highest close
to 6 value of the max features.
• It is a good convention to consider the default value of this
parameter, which is set to square root of the number of features
present in the dataset.
• The ideal number of max_features generally tend to lie close to
this value.
125
63

Lecture-4-Decision Tree and Random Forest

Uploaded by

Copyright:

Available Formats

You might also like

Lecture-4-Decision Tree and Random Forest

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture-4-Decision Tree and Random Forest

Uploaded by

Copyright:

Available Formats

4/27/2021

Muhammad Affan Alim

Decision Tree- Introduction

• It has two types of nodes

Decision Tree- Introduction

Decision Tree- Introduction diagram

Decision Tree- Introduction

Decision Tree- Introduction

Decision Tree - Model

Decision Tree- Example

• There may be two outcomes or in some cases, more than

• Nodes will be created until reach at leaf node

Decision Tree - construction

Decision Tree- construction

Decision Tree - construction

• this iterative process doing continue until reach to leaf

Decision tree - construction

Decision Tree - construction

Decision Tree - Which attribute to split on?

• Want to measure "purity" of the split

Decision Tree - Entropy

Decision Tree - Entropy

𝑯(𝑺) = − 𝒑 𝐥𝐨𝐠 𝟐 𝒑 − 𝒑 𝐥𝐨𝐠 𝟐 𝒑

• S ... subset of training examples

Decision Tree - Entropy

• Pure (4 yes/ 0 no)

Decision Tree – Information Gain

Decision Tree – Information Gain

Decision Tree – Information Gain

Decision Tree – Important note

• On different branches, this reasoning obviously does not hold.

Decision Tree – Important Terminology

Pruning: When we remove sub-nodes of a decision node, this process is called

Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.

Decision Tree – Important Terminology

Decision Tree – Pruning

Decision Tree – How it works

Decision Tree – How it works cont…

• The ID3 algorithm builds decision trees using a top-down greedy

Decision Tree – ID3 Algorithm

Decision Tree – Attribute selection method

Decision Tree – Data Set

Decision Tree - Example

Decision Tree – Example 2

IG(A) = 1 – (12/16)(0.9799) – (4/16)(0.81128

IG(B) = 1 -12/16(0.9182) – 4/16(0)

Decision Tree – Overfitting

model hH overfits if there is a h’H

errortrain(h) < errortrain(h’)

Decision Tree – Overfitting

Decision Tree – Python Code

• DecisionTreeClassifier is a class capable of performing multi-class

• As with other classifiers, DecisionTreeClassifier takes as input two

Decision Tree – Python Code

Decision Tree – Python Code

Decision Tree – Python Code

Decision Tree – Python Code

Decision Tree – Python Code - 2