Professional Documents
Culture Documents
Lecture-4-Decision Tree and Random Forest
Lecture-4-Decision Tree and Random Forest
Lecture-4-Decision Tree and Random Forest
Machine Learning
Lecture – 4
Decision Tree
Machine Learning
by Tom M. Mitchell
Outline
• Introduction
• Example
• ID3 Heuristic algorithm
• Principles
• Entropy
• Information gain
• Evaluations
1
4/27/2021
2
4/27/2021
3
4/27/2021
- Classification
- regression
4
4/27/2021
Decision Tree
• This is usually done on the value of a feature or attribute of
the instance
• test is on same attribute and there is a branch for each
outcome
10
5
4/27/2021
11
12
6
4/27/2021
13
14
7
4/27/2021
15
16
8
4/27/2021
17
18
9
4/27/2021
19
• V : possible values of A
• S : set of examples {x}
• Sv: subset where XA = V
20
10
4/27/2021
21
22
11
4/27/2021
• For obvious reasons, it does not make sense to use the same
decision within the same branch.
23
24
12
4/27/2021
25
26
13
4/27/2021
27
28
14
4/27/2021
29
1. Entropy,
2. Information gain,
3. Gini index,
4. Gain Ratio,
5. Reduction in Variance
6. Chi-Square
30
15
4/27/2021
31
32
16
4/27/2021
Example cont…
33
34
17
4/27/2021
35
36
18
4/27/2021
37
• H(S) =1
Information gain of A
var A >=5
H(VA>=5) = -[5/12 log5/12 + 7/12 log7/12]
H(VA>=5) = 0.9799
Var <5
H(VA<5) = -[3/4log3/4 + 1/4log1/4]
H(VA<5) = 0.81128
19
4/27/2021
• H(S) =1
Information gain of B
var B >=3
H(VB>=3) = -[8/12log8/12 + 4/12log4/12]
H(VB>=3) = 0.9182
Var B<3
H(VB<3) = -[0/4log0/4 + 4/4log4/4]
H(VB<3) = 0
• Similarly
IG(C) = 0.54880
IG(D) = 0.41189
40
20
4/27/2021
Decision Tree-example 2
| Sv |
Information gain = H(S) - H ( SV )
|S|
41
- Low bias
- High Varaiance
42
21
4/27/2021
43
Decision Tree
Python Code
44
22
4/27/2021
45
46
23
4/27/2021
48
24
4/27/2021
49
50
25
4/27/2021
51
52
26
4/27/2021
• Once trained, you can plot the tree with the plot_tree function:
>>> tree.plot_tree(clf)
53
54
27
4/27/2021
55
56
28
4/27/2021
57
58
29
4/27/2021
59
• Alternatively, the tree can also be exported in textual format with the
function export_text.
60
30
4/27/2021
61
62
31
4/27/2021
63
64
32
4/27/2021
65
66
33
4/27/2021
Random Forest
67
Random Forest
68
34
4/27/2021
Ensemble Technique
• Ensemble method uses two techniques
1. Bagging (Bootstrap aggregation)
2. Boosting
1. Bagging
• Random Forest
2. Boosting
• Adaboost
• Gradient Boosting
• Extreme Gradient Boosting (XGBooting)
69
Random Forest-Bagging
• A technique known as bagging is used to create an ensemble of
trees where multiple training sets are generated with
replacement.
70
35
4/27/2021
Learning Model
M1 M2 M3 M4 M5 ... Mn
Model Predictions
R1 R2 R3 R4 R5 Rn
Majority Voting
Final Prediction
71
Random Forest
• Random Forest is an example of ensemble learning, in which we
combine multiple machine learning algorithms to obtain better
predictive performance.
72
36
4/27/2021
• They work great with the data used to create them, but
they are not flexible when it comes to classifying new
samples
73
Random Forest
• The Random Forest combine the simplicity of decision
trees with flexibility resulting in a vast improvement in
accuracy
74
37
4/27/2021
M1 M2 M3 M4 M5 ... Mn
Model Predictions
R1 R2 R3 R4 R5 Rn
Majority Voting
In case of regression problem take
Final Prediction mean of median of all predictions
75
Random Forest
• Lets make a random forest
• Step 1: Create a "bootstrapped" dataset
76
38
4/27/2021
Random Forest
• s
77
Random Forest
• Now created the bootstrapped Dataset
78
39
4/27/2021
Random Forest
• Step 2: Create a decision tree using the bootstrapped
dataset, but only use a random subset of variables (or
columns) at each step
79
Random Forest
• In this example, we will only consider 2
variables(columns) at each step
• Note: We will talk more about how to determine the
optimal number of variables to consider later
80
40
4/27/2021
Random Forest
81
Random Forest
• s
82
41
4/27/2021
Random Forest
• s
83
Random Forest
• s
84
42
4/27/2021
Random Forest
• We built a tree...
85
Random Forest
•
86
43
4/27/2021
Random Forest
• Ideally, do this 100's of times
87
Random Forest
• How to use it
88
44
4/27/2021
Random Forest
89
Random Forest
• s
90
45
4/27/2021
Random Forest
• Keep repeat for all trees
91
Random Forest
• After running the data down all of the trees in the random
forest, we see which option received more votes
92
46
4/27/2021
Random Forest
•
93
Random Forest
• Terminology Alert!
94
47
4/27/2021
Random Forest-code
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
Data = pd.read_csv('diabetes.csv')
print(Data.shape)
print(Data.head())
InputTrain = Data.drop('Outcome',axis='columns')
InputTarget = Data['Outcome']
95
Random Forest-code
>> train_x, test_x, train_y, test_y =
train_test_split(InputTrain,InputTarget,test_size = 0.3)
>> model = RandomForestClassifier(n_estimators=200,
max_depth=5,random_state=0, criterion='entropy')
>>print(model.fit(train_x,train_y))
RF_pred = model.predict(test_x)
acc = accuracy_score(test_y, RF_pred)
print(acc)
96
48
4/27/2021
Random Forest-code
Class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *,
criterion='gini', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, bootstrap=True, oob_score=False,
n_jobs=None, random_state=None, verbose=0, warm_start=False,
class_weight=None, ccp_alpha=0.0, max_samples=None)
97
98
49
4/27/2021
99
100
50
4/27/2021
• Can you think of a reason for this? The tree starts to overfit the training
set and therefore is not able to generalize over the unseen points in the
test set.
101
102
51
4/27/2021
103
104
52
4/27/2021
105
106
53
4/27/2021
107
108
54
4/27/2021
109
• Hence, the tree will terminate here and will not grow further. This
is how setting the maximum terminal nodes or max_leaf_nodes
can help us in preventing overfitting.
110
55
4/27/2021
111
112
56
4/27/2021
113
114
57
4/27/2021
• However, the red node has only 3 samples and hence it will not be
considered as the leaf node. Its parent node will become the leaf
node. That’s why the tree on the right represents the results when
we set the minimum samples for the terminal node as 5.
115
58
4/27/2021
117
118
59
4/27/2021
• Although it will not degrade the model, it can save you the
computational complexity and prevent the use of a fire
extinguisher on your CPU!
119
60
4/27/2021
121
122
61
4/27/2021
123
124
62
4/27/2021
125
63