Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Algorithm Behind Decision Tree

- How does a tree develop?

1
Section 3
 Technique behind decision tree –
 How does a decision tree develop? First on Categorical dependent
variable
• GINI Method
• Steps taken by software programs to learn the classification (develop the
tree)
• Steps taken by software program to implement the learning on unseen
data
 Learning more from practical point of view
• Focus on basic and practical stuff
• Brief discussion on advance approach
 Decision Tree (CART) for a numeric outcome
 CHI Square method etc.

2
Numeric independent variable based split

Decision Tree– created by Gopal Prasad Malakar 3


Split for continuous variable
 How to use continuous variable for the split (where do I define split)
Q: Where splits need to be evaluated?

Sorted by Age Sorted by Blood Pressure

1. Which is the variable and


2. Where to the cut of the variable for the best split
4
Split for continuous variable
 How to use continuous variable for the split (where do I define split)
 For the sake of simplicity, split of numeric variables can be taken at 1
percentile, 2 percentile, 3 percentile etc. and so on
 How do you know, which variable and which split to take
 In fact if you have 20 numeric independent variable, you might end
up evaluating 20*99 = 1980 splits
 And then take the best split

 Let’s understand by formula

5
Gini Index – what it is?

Decision Tree– created by Gopal Prasad Malakar 6


Gini Index Method
 One of the most popular methods
 It’s a measure of impurity
 Please note – we want purity i.e. definitive classes
• Which has one kind of dependent variable
 Formula –
GINI (t )  1   [ p( j | t )]2
j

• (NOTE: p( j | t) is the relative frequency of class j at node t).


 Let’s calculate

GINI= 1 – 12 – 02 =0

Verdict – no impurity

C1 = 6, C2=0

Decision Tree– created by Gopal Prasad Malakar 7


Gini Index Method
 Formula –
GINI (t )  1   [ p( j | t )]2
j

• (NOTE: p( j | t) is the relative frequency of class j at node t).

GINI= 1 – (5/6)2 GINI= 1 – (3/6)2


– (1/6)2 =0.278 – (3/6)2 =0.500

Verdict – some Verdict –


impurity maximum
C1 = 5, C2=1 C1 = 3, C2=3
impurity (?)

GINI= 1 – (4/6)2 GINI= 1 – (2/6)2


– (2/6)2 =0.444 – (4/6)2 =0.444

Verdict – some
more impurity
C1 = 4, C2=2 C1 = 2, C2=4

Decision Tree– created by Gopal Prasad Malakar 8


Range of GINI Index

Gini Index (measure of impurity)


0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Decision Tree– created by Gopal Prasad Malakar 9
Gini Index details
 Maximum (1 - 1/nc) when records are equally distributed among all
classes, implying least interesting information
 Minimum (0.0) when all records belong to one class, implying most
interesting information

Decision Tree– created by Gopal Prasad Malakar 10


Gini Index of a split– get a feel?

Decision Tree– created by Gopal Prasad Malakar 11


Gini Index of a split
 When a node p is split into k partitions (children), the quality of split
is computed as k
ni
GINI split   GINI (i )
i 1 n
• where, ni = number of records at child i,
• n = number of records at node p

 Let’s calculate

C1 = 3, C2=3

GINI (left)= 1
– (3/4)2 – GINI (right)=
(1/4)2 =0.375 1 – (2/2)2 –
(0/2)2 =0
C1 = 1, C2=3 C1 = 2, C2=0

GINI (split)= (4/6)*Gini left (.375) + (2/6)*Gini Right (0) =.25


Decision Tree– created by Gopal Prasad Malakar 12
Gini Index of a split
 When a node p is split into k partitions (children), the quality of split
is computed as k
ni
GINI split   GINI (i )
i 1 n
• where, ni = number of records at child i,
• n = number of records at node p

 Let’s calculate

C1 = 3, C2=3

GINI (left)= 1 GINI (right)=


– (3/3)2 – 1 – (3/3)2 –
(0/3)2 =0 (0/3)2 =0
C1 = 0, C2=3 C1 = 3, C2=0

GINI (split)= (3/6)*Gini left (0) + (3/6)*Gini Right (0) =0 (best split)
Decision Tree– created by Gopal Prasad Malakar 13
Gini Index of a split
 When a node p is split into k partitions (children), the quality of split
is computed as k
ni
GINI split   GINI (i )
i 1 n
• where, ni = number of records at child i,
• n = number of records at node p

 Let’s calculate

C1 = 4, C2=4

GINI (left)= 1 GINI (right)=


– (2/4)2 – 1 – (2/4)2 –
(2/4)2 =0.5 (2/4)2 =0.5
C1 = 2, C2=2 C1 = 2, C2=2

GINI (split)= (4/8)*Gini left (0.5) + (4/8)*Gini Right (0.5) =0.5 (worst possible split)

Decision Tree– created by Gopal Prasad Malakar 14


Gini Index of a split
 When a node p is split into k partitions (children), the quality of split
is computed as k
n
GINI split   i
GINI (i )
i 1 n
• where, ni = number of records at child i,
• n = number of records at node p

 Let’s calculate

C1 = 4, C2=4

GINI (left)= 1 GINI (right)=


– (4/7)2 – 1 – (0/1)2 –
(3/7)2 (1/1)2 =0
=0.4897 C1 = 4, C2=3 C1 = 0, C2=1

GINI (split)= (7/8)*Gini left (0.4897) + (1/8)*Gini Right (0) =0.42 (not so good split)
Decision Tree– created by Gopal Prasad Malakar 15
Final touch–
• Select the variable and the cut?
• Understanding split of categorical variables

Decision Tree– created by Gopal Prasad Malakar 16


Gini Index of a split
 Effect of Weighing partitions:
• Larger and Purer Partitions are sought for
 Lower the value of Gini Index of split, better the split
 Question – how to select the best variable and which value to take
for the split
 Create table with each variable and it’s various cut point and
associated Gini split
Variable Split Value Gini Split
Months spend in current home 25 0.002
Spend on fuel 350 0.03
Months spend in current home 35 0.05
State A+ B vs. Rest 0.1

 Sort then in ascending order of GINI split score


 Select the variable and the cut value of top most row
 This ensures maximum reduction in impurity
 Now repeat this process for each of the child node
Decision Tree– created by Gopal Prasad Malakar 17
Gini Index of a split for categorical variable
 Let’s say a particular categorical variable has three distinct categories
A, B and C
 Then obtain Gini index of a split for

C1 = 4, C2=4

A B&C A&B C A&C B

Decision Tree– created by Gopal Prasad Malakar 18


Practical Approach of tree development and validation

Decision Tree– created by Gopal Prasad Malakar 19


Stopping criteria and model validation

 Practical approach for data mining scenario


• By number of levels
• By having minimum x number of observation in each leaf node
• Each variable and it’s cut should not be counter intuitive to
common sense
Dev_Model_1<-rpart(Target~., data=base_1,
control=rpart.control(minsplit=60, minbucket=30,
maxdepth=4 )
# minsplit: the minimum number of observations that
must exist in a node in order
# for a split to be attempted
# minbucket : the minimum number of observations in any
terminal <leaf> node
# Maxdepth : Maximum depth for any node, with the root
node counted as depth 0.
)
Decision Tree– created by Gopal Prasad Malakar 20
Stopping criteria and model validation

 Practical way of validation for data mining scenario


• Apply the tree on validation data (preferably out of time data)
with known dependent variable values
• Calculate
• response rate and
• population % in each of the leaf node
• If this is similar to that of development data, model is done and is
ready to use on new dataset

Node % of % of % Node bad rate bad %


populatio populat change in dev (a) rate in change
n in dev ion in =abs( Val (b) =abs(
(a) Val (b) (a-b))/a (a-b))/a
7 10% 11% 10% 7 30% 28.5% 5%
8 12% 10.5% 12.5% 8 25% 27% 8%

Decision Tree– created by Gopal Prasad Malakar 21


Applying decision tree learning on new data– Using the model?

Decision Tree– created by Gopal Prasad Malakar 22


How to classify the unseen data
 How does learning from test data is applied on unseen data
 It looks at the independent variables in the order in which tree was
developed
 Conditional move- Value of one variable at a time decides, which
variable to consider next
 The leaf nodes gives the final characteristics

23
Decision Tree Example

CHK_ACCT
(70%)

<1.5 >=1.5

Duration NODE 7
(87%)

>=22.5 <22.5

SAV_ACCT Node 6
(65%)

<2.5 >=2.5

Node 4 Node 5
(37%) (71%)

Decision Tree– created by Gopal Prasad Malakar 24


Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Start from the root of tree. New Data
CHK_ACCT CHK Durat SAV_ Resp
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4

Duration NODE 7
(87%)

>=22.5 <22.5

SAV_ACCT Node 6
(65%)

<2.5 >=2.5

Node 4 Node 5
(37%) (71%)
Apply Model to Test Data
Start from the root of tree. New Data
CHK_ACCT CHK Durat SAV_ Resp
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4

Duration NODE 7
(87%)

>=22.5 <22.5

SAV_ACCT Node 6
(65%)

<2.5 >=2.5

Node 4 Node 5
(37%) (71%)
Apply Model to Test Data
Start from the root of tree. New Data
CHK_ACCT CHK Durat SAV_ Resp
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4

Duration NODE 7
(87%)

>=22.5 <22.5

SAV_ACCT Node 6
(65%)

<2.5 >=2.5

Node 4 Node 5
(37%) (71%)
Apply Model to Test Data
Start from the root of tree. New Data
CHK_ACCT CHK Durat SAV_ Resp
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4

Duration NODE 7
(87%)

>=22.5 <22.5

SAV_ACCT Node 6
(65%)

<2.5 >=2.5

Node 4 Node 5
(37%) (71%)
Apply Model to Test Data
Start from the root of tree. New Data
CHK_ACCT CHK Durat SAV_ Resp
(70%)
_AC ion ACC onse
CT T Rate
<1.5 >=1.5
1.25 23 4 71%

Duration NODE 7
(87%)

>=22.5 <22.5

Assume 71%
response from
SAV_ACCT Node 6
such account set
(65%)

<2.5 >=2.5

Node 4 Node 5
(37%) (71%)
• Stopping Criteria - advance

Decision Tree– created by Gopal Prasad Malakar 31


Stopping Criteria advance
 First understand pruning
 Current number of leaf nodes – 8

 After pruning number of leafs - 7

Decision Tree– created by Gopal Prasad Malakar 32


Automated decision tree
 Develop very long tree
• Till each node has only one class of dependent variable
• Largest tree grown is called “maximal” tree
• Maximal tree could have hundreds or thousands of nodes
• usually instruct software to grow only moderately too big (x number of
levels)
 Understand– error rate
• Actual vs. forecasted class of the dependent variable
 Trimming off parts of the tree that don’t work on validation data
• But which branch to cut first?
• Prune away "weakest link" — the nodes that add least to overall
accuracy of the tree
• If several nodes have same contribution they all can be pruned away
simultaneously
• Sequence determined all the way back to root node

Decision Tree– created by Gopal Prasad Malakar 33


Pruning sequence – visualization

Decision Tree– created by Gopal Prasad Malakar 34


Training Data Vs. Test Data Error Rates

 Compare error rates measured by


No.
• Development data Terminal R(T) Rts(T)
• Validation dataset Nodes
 Development data R(T) always 71 .00 .42
decreases as tree grows (Q: Why?) 63 .00 .40
 Validation data Rts(T) first declines 58 .03 .39
40 .10 .32
then increases (Q: Why?)
34 .12 .32
 Over fitting is the result tree of too 19 .20 .31
much reliance on learn R(T) **10 .29 .30
 Can lead to disasters when applied 9 .32 .34
to new data 7 .41 .47
6 .46 .54
5 .53 .61
2 .75 .82
1 .86 .91
CART Summary

 CART Key Features


• binary splits
parent gets two children
each child produces two grandchildren
four grandchildren produce 8 great grandchildren
• gini index as splitting criteria
• grow, then prune
• surrogates for missing values
• optimal tree – Minimum Error Rate tree – on validation (once the
full tree gets developed in development data)

36
K Fold Cross Validation

Decision Tree– created by Gopal Prasad Malakar 37


Cross Validation (also known as k-fold validation)

Random Split Develop Model


1 1
2
2
3 3
4
3
5 4
4
5
5

Validate Model
2

Cluster Analysis – created by Gopal Prasad Malakar 38


Cross Validation (also known as k-fold validation)

Random Split Develop Model


1 1
2
2
3 2
4
3
5 3
4
5
5

Validate Model
4

Cluster Analysis – created by Gopal Prasad Malakar 39


K fold cross validation

• Divide data randomly into K part


• For i=1 to k
• Keep Ith part separate and use remaining k-1 subsets taken
together to develop the model
• validate it on Ith part

Cluster Analysis – created by Gopal Prasad Malakar 40


Applying advance pruning using R

Decision Tree– created by Gopal Prasad Malakar 41


Demo using R

• Learn how to develop decision tree having lowest cross validation


error

Cluster Analysis – created by Gopal Prasad Malakar 42


CART for numeric dependent variable – what is the
meaning of r2 here?

Decision Tree– created by Gopal Prasad Malakar 43


Decision Tree in case of numeric dependent variable
 Use R2 (in lieu of Gini Index of split) to find best split
 Let’s develop regression tree and understand R2 by a practical
example

Decision Tree– created by Gopal Prasad Malakar 44


CHAID – get a feel

Decision Tree– created by Gopal Prasad Malakar 45


Chi Square – contingency table
Occupational High Sale Moderate Sale Low Sale Total
Status
Age in 20-35 30 (30.5) 42 (34.1) 18 (25.4) 90

Age in 35-50 14 (22.1) 20 (24.5) 31 (18.4) 65

Age >= 51 34 (25.4) 25 (28.4) 16 (21.2) 75

Total 78 87 65 230

  ( f  f ) / f
2
i e
2
e * 30.5 = (90 x 78 / 230)
 21.1
 Chi square measure –
 Numerator gives more weightage to large variance than small variance
 Denominator ensures that chi square is a relative measure of variance against
expectation
 Greater the value of chi-square statistics, stronger is the relationship between
independent and dependent variable
 Degree of freedom (r-1)*(c-1)
 Lower the p value – higher is the relationship between dependent and independent
Another decision tree technique
 CHAID – stands for Chi Square Automatic Interaction Detector
 Understand chi square
 Use chi square statistics for tree development

 Get a good understanding of chi square if you wish ….. Attaching some video
from “Statistics” course -- into appendix content
 PDF will also be available for the same

Decision Tree– created by Gopal Prasad Malakar 47


Developing tree using CHAID

 Select that variables based split which is having smallest p value


 This will ensure best separation of population where dependent
variable in 1 vs. population where dependent variable is 0.
 For each node repeat the same process
CHAID vs. CART

Decision Tree– created by Gopal Prasad Malakar 49


CHAID vs. CART

• CHAID uses a p-value from a significance test to measure the


desirability of a split, while CART uses the reduction of an impurity
measure.
• CHAID searches for multi-way splits, while CART performs only
binary splits.
• CHAID uses a forward stopping rule to grow a tree, while CART
deliberately overfits and uses validation data to prune back.
51

You might also like