DT 03 Algorithm Behind Decion Tree

Algorithm Behind Decision Tree
- How does a tree develop?
1
Section 3
 Technique behind decision tree –
 How does a decision tree develop? First on Categorical dependent
variable
• GINI Method
• Steps taken by software programs to learn the classification (develop the
tree)
• Steps taken by software program to implement the learning on unseen
data
 Learning more from practical point of view
• Focus on basic and practical stuff
• Brief discussion on advance approach
 Decision Tree (CART) for a numeric outcome
 CHI Square method etc.
2
Numeric independent variable based split
Decision Tree– created by Gopal Prasad Malakar 3

Split for continuous variable
 How to use continuous variable for the split (where do I define split)
Q: Where splits need to be evaluated?
Sorted by Age Sorted by Blood Pressure
1. Which is the variable and

2. Where to the cut of the variable for the best split
4
Split for continuous variable
 How to use continuous variable for the split (where do I define split)
 For the sake of simplicity, split of numeric variables can be taken at 1
percentile, 2 percentile, 3 percentile etc. and so on
 How do you know, which variable and which split to take
 In fact if you have 20 numeric independent variable, you might end
up evaluating 20*99 = 1980 splits
 And then take the best split
 Let’s understand by formula
5
Gini Index – what it is?

Gini Index Method
 One of the most popular methods
 It’s a measure of impurity
 Please note – we want purity i.e. definitive classes
• Which has one kind of dependent variable
 Formula –
GINI (t )  1   [ p( j | t )]2
j
• (NOTE: p( j | t) is the relative frequency of class j at node t).

 Let’s calculate
GINI= 1 – 12 – 02 =0
Verdict – no impurity
C1 = 6, C2=0

Gini Index Method
 Formula –
GINI (t )  1   [ p( j | t )]2
j
• (NOTE: p( j | t) is the relative frequency of class j at node t).
GINI= 1 – (5/6)2 GINI= 1 – (3/6)2

– (1/6)2 =0.278 – (3/6)2 =0.500
Verdict – some Verdict –

impurity maximum
C1 = 5, C2=1 C1 = 3, C2=3
impurity (?)
GINI= 1 – (4/6)2 GINI= 1 – (2/6)2

– (2/6)2 =0.444 – (4/6)2 =0.444
Verdict – some
more impurity
C1 = 4, C2=2 C1 = 2, C2=4

Range of GINI Index
Gini Index (measure of impurity)

0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Gini Index details
 Maximum (1 - 1/nc) when records are equally distributed among all
classes, implying least interesting information
 Minimum (0.0) when all records belong to one class, implying most
interesting information

Gini Index of a split– get a feel?

Gini Index of a split
 When a node p is split into k partitions (children), the quality of split
is computed as k
ni
GINI split   GINI (i )
i 1 n
• where, ni = number of records at child i,
• n = number of records at node p
C1 = 3, C2=3
GINI (left)= 1
– (3/4)2 – GINI (right)=
(1/4)2 =0.375 1 – (2/2)2 –
(0/2)2 =0
C1 = 1, C2=3 C1 = 2, C2=0
GINI (split)= (4/6)*Gini left (.375) + (2/6)*Gini Right (0) =.25

is computed as k
ni
i 1 n
C1 = 3, C2=3
GINI (left)= 1 GINI (right)=

– (3/3)2 – 1 – (3/3)2 –
(0/3)2 =0 (0/3)2 =0
C1 = 0, C2=3 C1 = 3, C2=0
GINI (split)= (3/6)*Gini left (0) + (3/6)*Gini Right (0) =0 (best split)
is computed as k
ni
i 1 n
C1 = 4, C2=4

– (2/4)2 – 1 – (2/4)2 –
(2/4)2 =0.5 (2/4)2 =0.5
C1 = 2, C2=2 C1 = 2, C2=2
GINI (split)= (4/8)*Gini left (0.5) + (4/8)*Gini Right (0.5) =0.5 (worst possible split)

is computed as k
n
GINI split   i
GINI (i )
i 1 n
C1 = 4, C2=4

– (4/7)2 – 1 – (0/1)2 –
(3/7)2 (1/1)2 =0
=0.4897 C1 = 4, C2=3 C1 = 0, C2=1
GINI (split)= (7/8)*Gini left (0.4897) + (1/8)*Gini Right (0) =0.42 (not so good split)
Final touch–
• Select the variable and the cut?
• Understanding split of categorical variables

 Effect of Weighing partitions:
• Larger and Purer Partitions are sought for
 Lower the value of Gini Index of split, better the split
 Question – how to select the best variable and which value to take
for the split
 Create table with each variable and it’s various cut point and
associated Gini split
Variable Split Value Gini Split
Months spend in current home 25 0.002
Spend on fuel 350 0.03
Months spend in current home 35 0.05
State A+ B vs. Rest 0.1
 Sort then in ascending order of GINI split score

 Select the variable and the cut value of top most row
 This ensures maximum reduction in impurity
 Now repeat this process for each of the child node
Gini Index of a split for categorical variable
 Let’s say a particular categorical variable has three distinct categories
A, B and C
 Then obtain Gini index of a split for
C1 = 4, C2=4
A B&C A&B C A&C B

Practical Approach of tree development and validation

Stopping criteria and model validation
 Practical approach for data mining scenario

• By number of levels
• By having minimum x number of observation in each leaf node
• Each variable and it’s cut should not be counter intuitive to
common sense
Dev_Model_1<-rpart(Target~., data=base_1,
control=rpart.control(minsplit=60, minbucket=30,
maxdepth=4 )
# minsplit: the minimum number of observations that
must exist in a node in order
# for a split to be attempted
# minbucket : the minimum number of observations in any
terminal <leaf> node
# Maxdepth : Maximum depth for any node, with the root
node counted as depth 0.
)
Stopping criteria and model validation
 Practical way of validation for data mining scenario

• Apply the tree on validation data (preferably out of time data)
with known dependent variable values
• Calculate
• response rate and
• population % in each of the leaf node
• If this is similar to that of development data, model is done and is
ready to use on new dataset
Node % of % of % Node bad rate bad %

populatio populat change in dev (a) rate in change
n in dev ion in =abs( Val (b) =abs(
(a) Val (b) (a-b))/a (a-b))/a
7 10% 11% 10% 7 30% 28.5% 5%
8 12% 10.5% 12.5% 8 25% 27% 8%

Applying decision tree learning on new data– Using the model?

How to classify the unseen data
 How does learning from test data is applied on unseen data
 It looks at the independent variables in the order in which tree was
developed
 Conditional move- Value of one variable at a time decides, which
variable to consider next
 The leaf nodes gives the final characteristics
23
Decision Tree Example
CHK_ACCT
(70%)
<1.5 >=1.5
Duration NODE 7
(87%)
>=22.5 <22.5
SAV_ACCT Node 6
(65%)
<2.5 >=2.5
Node 4 Node 5
(37%) (71%)

Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class

Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Start from the root of tree. New Data
CHK_ACCT CHK Durat SAV_ Resp
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4
Duration NODE 7
(87%)
>=22.5 <22.5
SAV_ACCT Node 6
(65%)
<2.5 >=2.5
Node 4 Node 5
(37%) (71%)
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4
Duration NODE 7
(87%)
>=22.5 <22.5
SAV_ACCT Node 6
(65%)
<2.5 >=2.5
Node 4 Node 5
(37%) (71%)
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4
Duration NODE 7
(87%)
>=22.5 <22.5
SAV_ACCT Node 6
(65%)
<2.5 >=2.5
Node 4 Node 5
(37%) (71%)
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4
Duration NODE 7
(87%)
>=22.5 <22.5
SAV_ACCT Node 6
(65%)
<2.5 >=2.5
Node 4 Node 5
(37%) (71%)
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4 71%
Duration NODE 7
(87%)
>=22.5 <22.5
Assume 71%
response from
SAV_ACCT Node 6
such account set
(65%)
<2.5 >=2.5
Node 4 Node 5
(37%) (71%)
• Stopping Criteria - advance

Stopping Criteria advance
 First understand pruning
 Current number of leaf nodes – 8
 After pruning number of leafs - 7

Automated decision tree
 Develop very long tree
• Till each node has only one class of dependent variable
• Largest tree grown is called “maximal” tree
• Maximal tree could have hundreds or thousands of nodes
• usually instruct software to grow only moderately too big (x number of
levels)
 Understand– error rate
• Actual vs. forecasted class of the dependent variable
 Trimming off parts of the tree that don’t work on validation data
• But which branch to cut first?
• Prune away "weakest link" — the nodes that add least to overall
accuracy of the tree
• If several nodes have same contribution they all can be pruned away
simultaneously
• Sequence determined all the way back to root node

Pruning sequence – visualization

Training Data Vs. Test Data Error Rates
 Compare error rates measured by

No.
• Development data Terminal R(T) Rts(T)
• Validation dataset Nodes
 Development data R(T) always 71 .00 .42
decreases as tree grows (Q: Why?) 63 .00 .40
 Validation data Rts(T) first declines 58 .03 .39
40 .10 .32
then increases (Q: Why?)
34 .12 .32
 Over fitting is the result tree of too 19 .20 .31
much reliance on learn R(T) **10 .29 .30
 Can lead to disasters when applied 9 .32 .34
to new data 7 .41 .47
6 .46 .54
5 .53 .61
2 .75 .82
1 .86 .91
CART Summary
 CART Key Features

• binary splits
parent gets two children
each child produces two grandchildren
four grandchildren produce 8 great grandchildren
• gini index as splitting criteria
• grow, then prune
• surrogates for missing values
• optimal tree – Minimum Error Rate tree – on validation (once the
full tree gets developed in development data)
36
K Fold Cross Validation

Cross Validation (also known as k-fold validation)
Random Split Develop Model

1 1
2
2
3 3
4
3
5 4
4
5
5
Validate Model
2
Cluster Analysis – created by Gopal Prasad Malakar 38

Cross Validation (also known as k-fold validation)
Random Split Develop Model

1 1
2
2
3 2
4
3
5 3
4
5
5
Validate Model
4

K fold cross validation
• Divide data randomly into K part

• For i=1 to k
• Keep Ith part separate and use remaining k-1 subsets taken
together to develop the model
• validate it on Ith part

Applying advance pruning using R

Demo using R
• Learn how to develop decision tree having lowest cross validation

error

CART for numeric dependent variable – what is the
meaning of r2 here?

Decision Tree in case of numeric dependent variable
 Use R2 (in lieu of Gini Index of split) to find best split
 Let’s develop regression tree and understand R2 by a practical
example

CHAID – get a feel

Chi Square – contingency table
Occupational High Sale Moderate Sale Low Sale Total
Status
Age in 20-35 30 (30.5) 42 (34.1) 18 (25.4) 90
Age in 35-50 14 (22.1) 20 (24.5) 31 (18.4) 65
Age >= 51 34 (25.4) 25 (28.4) 16 (21.2) 75
Total 78 87 65 230
  ( f  f ) / f
2
i e
2
e * 30.5 = (90 x 78 / 230)
 21.1
 Chi square measure –
 Numerator gives more weightage to large variance than small variance
 Denominator ensures that chi square is a relative measure of variance against
expectation
 Greater the value of chi-square statistics, stronger is the relationship between
independent and dependent variable
 Degree of freedom (r-1)*(c-1)
 Lower the p value – higher is the relationship between dependent and independent
Another decision tree technique
 CHAID – stands for Chi Square Automatic Interaction Detector
 Understand chi square
 Use chi square statistics for tree development
 Get a good understanding of chi square if you wish ….. Attaching some video
from “Statistics” course -- into appendix content
 PDF will also be available for the same

Developing tree using CHAID
 Select that variables based split which is having smallest p value

 This will ensure best separation of population where dependent
variable in 1 vs. population where dependent variable is 0.
 For each node repeat the same process
CHAID vs. CART

CHAID vs. CART
• CHAID uses a p-value from a significance test to measure the

desirability of a split, while CART uses the reduction of an impurity
measure.
• CHAID searches for multi-way splits, while CART performs only
binary splits.
• CHAID uses a forward stopping rule to grow a tree, while CART
deliberately overfits and uses validation data to prune back.
51

DT 03 Algorithm Behind Decion Tree

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DT 03 Algorithm Behind Decion Tree

Uploaded by

Copyright:

Available Formats

Algorithm Behind Decision Tree

- How does a tree develop?

Decision Tree– created by Gopal Prasad Malakar 3

Sorted by Age Sorted by Blood Pressure

1. Which is the variable and

 Let’s understand by formula

Decision Tree– created by Gopal Prasad Malakar 6

• (NOTE: p( j | t) is the relative frequency of class j at node t).

Decision Tree– created by Gopal Prasad Malakar 7

• (NOTE: p( j | t) is the relative frequency of class j at node t).

GINI= 1 – (5/6)2 GINI= 1 – (3/6)2

Verdict – some Verdict –

GINI= 1 – (4/6)2 GINI= 1 – (2/6)2

Decision Tree– created by Gopal Prasad Malakar 8

Gini Index (measure of impurity)

Decision Tree– created by Gopal Prasad Malakar 10

Decision Tree– created by Gopal Prasad Malakar 11

GINI (split)= (4/6)*Gini left (.375) + (2/6)*Gini Right (0) =.25

GINI (left)= 1 GINI (right)=

GINI (left)= 1 GINI (right)=

Decision Tree– created by Gopal Prasad Malakar 14

GINI (left)= 1 GINI (right)=

Decision Tree– created by Gopal Prasad Malakar 16

 Sort then in ascending order of GINI split score

A B&C A&B C A&C B

Decision Tree– created by Gopal Prasad Malakar 18

Decision Tree– created by Gopal Prasad Malakar 19

 Practical approach for data mining scenario

 Practical way of validation for data mining scenario

Node % of % of % Node bad rate bad %

Decision Tree– created by Gopal Prasad Malakar 21

Decision Tree– created by Gopal Prasad Malakar 22

Decision Tree– created by Gopal Prasad Malakar 24

Tid Attrib1 Attrib2 Attrib3 Class

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ?

Decision Tree– created by Gopal Prasad Malakar 31

 After pruning number of leafs - 7

Decision Tree– created by Gopal Prasad Malakar 32

Decision Tree– created by Gopal Prasad Malakar 33

Decision Tree– created by Gopal Prasad Malakar 34

 Compare error rates measured by

 CART Key Features

Decision Tree– created by Gopal Prasad Malakar 37

Random Split Develop Model

Cluster Analysis – created by Gopal Prasad Malakar 38

Random Split Develop Model

Cluster Analysis – created by Gopal Prasad Malakar 39

• Divide data randomly into K part

Cluster Analysis – created by Gopal Prasad Malakar 40

Decision Tree– created by Gopal Prasad Malakar 41

• Learn how to develop decision tree having lowest cross validation

Cluster Analysis – created by Gopal Prasad Malakar 42

Decision Tree– created by Gopal Prasad Malakar 43

Decision Tree– created by Gopal Prasad Malakar 44

Decision Tree– created by Gopal Prasad Malakar 45

Age in 35-50 14 (22.1) 20 (24.5) 31 (18.4) 65

Age >= 51 34 (25.4) 25 (28.4) 16 (21.2) 75

Decision Tree– created by Gopal Prasad Malakar 47

GINI (split)= (4/6)Gini left (.375) + (2/6)Gini Right (0) =.25