Professional Documents
Culture Documents
DT 03 Algorithm Behind Decion Tree
DT 03 Algorithm Behind Decion Tree
1
Section 3
Technique behind decision tree –
How does a decision tree develop? First on Categorical dependent
variable
• GINI Method
• Steps taken by software programs to learn the classification (develop the
tree)
• Steps taken by software program to implement the learning on unseen
data
Learning more from practical point of view
• Focus on basic and practical stuff
• Brief discussion on advance approach
Decision Tree (CART) for a numeric outcome
CHI Square method etc.
2
Numeric independent variable based split
5
Gini Index – what it is?
GINI= 1 – 12 – 02 =0
Verdict – no impurity
C1 = 6, C2=0
Verdict – some
more impurity
C1 = 4, C2=2 C1 = 2, C2=4
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Decision Tree– created by Gopal Prasad Malakar 9
Gini Index details
Maximum (1 - 1/nc) when records are equally distributed among all
classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implying most
interesting information
Let’s calculate
C1 = 3, C2=3
GINI (left)= 1
– (3/4)2 – GINI (right)=
(1/4)2 =0.375 1 – (2/2)2 –
(0/2)2 =0
C1 = 1, C2=3 C1 = 2, C2=0
Let’s calculate
C1 = 3, C2=3
GINI (split)= (3/6)*Gini left (0) + (3/6)*Gini Right (0) =0 (best split)
Decision Tree– created by Gopal Prasad Malakar 13
Gini Index of a split
When a node p is split into k partitions (children), the quality of split
is computed as k
ni
GINI split GINI (i )
i 1 n
• where, ni = number of records at child i,
• n = number of records at node p
Let’s calculate
C1 = 4, C2=4
GINI (split)= (4/8)*Gini left (0.5) + (4/8)*Gini Right (0.5) =0.5 (worst possible split)
Let’s calculate
C1 = 4, C2=4
GINI (split)= (7/8)*Gini left (0.4897) + (1/8)*Gini Right (0) =0.42 (not so good split)
Decision Tree– created by Gopal Prasad Malakar 15
Final touch–
• Select the variable and the cut?
• Understanding split of categorical variables
C1 = 4, C2=4
23
Decision Tree Example
CHK_ACCT
(70%)
<1.5 >=1.5
Duration NODE 7
(87%)
>=22.5 <22.5
SAV_ACCT Node 6
(65%)
<2.5 >=2.5
Node 4 Node 5
(37%) (71%)
6 No Medium 60K No
Training Set
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Start from the root of tree. New Data
CHK_ACCT CHK Durat SAV_ Resp
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4
Duration NODE 7
(87%)
>=22.5 <22.5
SAV_ACCT Node 6
(65%)
<2.5 >=2.5
Node 4 Node 5
(37%) (71%)
Apply Model to Test Data
Start from the root of tree. New Data
CHK_ACCT CHK Durat SAV_ Resp
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4
Duration NODE 7
(87%)
>=22.5 <22.5
SAV_ACCT Node 6
(65%)
<2.5 >=2.5
Node 4 Node 5
(37%) (71%)
Apply Model to Test Data
Start from the root of tree. New Data
CHK_ACCT CHK Durat SAV_ Resp
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4
Duration NODE 7
(87%)
>=22.5 <22.5
SAV_ACCT Node 6
(65%)
<2.5 >=2.5
Node 4 Node 5
(37%) (71%)
Apply Model to Test Data
Start from the root of tree. New Data
CHK_ACCT CHK Durat SAV_ Resp
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4
Duration NODE 7
(87%)
>=22.5 <22.5
SAV_ACCT Node 6
(65%)
<2.5 >=2.5
Node 4 Node 5
(37%) (71%)
Apply Model to Test Data
Start from the root of tree. New Data
CHK_ACCT CHK Durat SAV_ Resp
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4 71%
Duration NODE 7
(87%)
>=22.5 <22.5
Assume 71%
response from
SAV_ACCT Node 6
such account set
(65%)
<2.5 >=2.5
Node 4 Node 5
(37%) (71%)
• Stopping Criteria - advance
36
K Fold Cross Validation
Validate Model
2
Validate Model
4
Total 78 87 65 230
( f f ) / f
2
i e
2
e * 30.5 = (90 x 78 / 230)
21.1
Chi square measure –
Numerator gives more weightage to large variance than small variance
Denominator ensures that chi square is a relative measure of variance against
expectation
Greater the value of chi-square statistics, stronger is the relationship between
independent and dependent variable
Degree of freedom (r-1)*(c-1)
Lower the p value – higher is the relationship between dependent and independent
Another decision tree technique
CHAID – stands for Chi Square Automatic Interaction Detector
Understand chi square
Use chi square statistics for tree development
Get a good understanding of chi square if you wish ….. Attaching some video
from “Statistics” course -- into appendix content
PDF will also be available for the same