Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 48

0

Classification Trees
Trees and Rules
Goal: Classify or predict an outcome
based on a set of predictors
 The output is a set of rules

 Rules are represented by tree

diagrams
 Also called CART (Classification

And Regression Trees) , Decision


Trees, or just Trees
Key Ideas

Recursive partitioning: Repeatedly


split the records into two parts so as to
achieve maximum homogeneity within
the new parts

Pruning the tree: Simplify the tree


by pruning peripheral branches to avoid
overfitting
0

The Idea of Recursive


Partitioning
 Recursive partitioning
of predictors to
achieve homogenous
groups
X2
 At each step a single
predictor is used for
splitting the data.
 Use rectangles of X1
different sizes to split
the data
Recursive Partitioning
Steps
 Pick one of the predictor variables, xi
 Pick a value of xi, say si, that divides the
training data into two (not necessarily
equal) portions
 Measure how “pure” or homogeneous
each of the resulting portions are
“Pure” = containing records of mostly one
class
 Idea is to pick xi, and si to maximize
purity

0

Example: Beer Preference


 Using the predictors Age, Gender,
Married, Income we want to be
able to classify the preference
of new beer drinkers (Regular/Light
beer)
 Two classes, 4 predictors
 100 records, partitioned into
training/validation
0

Splitting the space in trees

100
90
80
70
60
Regular beer
Age

50
Light beer
40
30
20
10
0
$0 $20,000 $40,000 $60,000 $80,000
Income
A Classification Tree for the 0

Beer Preference Example


(training)
42.5

Age
Splitting value
29 31

# records
34375 41173

Income Income
6 23 18 13

Regular Regular
39180 51.5
Leaf/ terminal
Income Age
node
7 16 7 6

Light Light Light Regular


0

Method Settings
1. Determining splits/partitions

2. Terminating tree growth

3. Finding rule to predict the class for each


record

4. Pruning of tree (cutting branches)

Three famous algorithms:


CART (in XLMiner, SAS, CART), CHAID (in SAS) , C4.5
(in Clementine by SPSS)
0

Determining the best split


 Best split is the one that best discriminates
between records in the different classes
 Goal is to have a single class predominate in
each resulting node
 Split that maximizes the reduction in impurity
of node is chosen

CART algorithm (Classification And Regression Trees)
evaluates all possible binary splits
 C4.5 algorithm splits categorical predictor with K
categories into K children (creates bushiness)
 Two impurity measures are entropy and
Gini index
 Impurity of split = weighted average of
impurities of children
0

Entropy Measure
K
Entropy of a node =− ∑ p log ( p )
i =1
i 2 i

K = number of classes
pi = the percentage of records in
class i
0

Entropy

0 ≤ Entropy ≤ log2(K)

Good; all records Bad; the K classes


in same class. are equally likely.
PURE AS IMPURE AS POSSIBLE
0

Entropy For 2 classes

1
0.9
0.8
0.7
0.6
Entropy

0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

pi
0

Entropy: Example
Assume we have K = 2 classes (buy, not buy)
Then, maximum Entropy is log2(2) = 1

Ex.1—Impure Node: p1 = 0.5, p2 = 0.5 x


x
Entropy = – [0.5×log2(0.5) + 0.5×log2(0.5)] = 1 x

Ex.2—Pure Node: p1 = 1, p2 = 0
Entropy = – [1×log2(1) + 0×log2(0)] = 0
Splitting the 100 beer drinkers 0

(50 prefer light, 50 regular) by


gender

43 females gender 57 males

25 Regular 25 Regular
18 Light 32 Light

Before split (p1=p2=0.5) entropy=1


After split:

FEMALES: -[ 18/43 log2 18/43 + 25/43 log2 25/43 ] = 0.98


MALES: -[ 32/57 log2 32/57 + 25/57 log2 25/57 ] = 0.989
Combined: (0.43)(0.98) + (0.57)(0.989) = 0.985 not too much improvement
0

The Gini Impurity Index


The impurity of a node is

K
GI = 1 − ∑ pi2
i =1

K = number of classes
pi = the percentage of records in
class i
0

The Gini Index

0 ≤ GI ≤ (K-1) / K

Good; all records Bad; the K classes


in same class. are equally likely.
PURE AS IMPURE AS POSSIBLE
Splitting the 100 beer drinkers 0

(50 prefer light, 50 regular) by


gender
gender
43 females 57 males

25 Regular 25 Regular
18 Light 32 Light

Before split: 1- [0.25 +0.25] = 0.5


LEFT: 1- [(25/43)^2 + (18/43)^2 ] = 0.487
RIGHT: 1- [(25/57)^2 + (32/57)^2 ] = 0.492
Combined: 0.43 * 0.487 + 0.57 * 0.492 = 0.49
Recursive Partitioning
 Obtain overall impurity measure
(weighted avg. of individual rectangles)
 At each successive stage, compare this
measure across all possible splits in all
variables
 Choose the split that reduces impurity
the most
 Chosen split points become nodes on
the tree
0

Learning about the “best


predictors”

 Predictors 42.5

close to top Age


are most 29 31

informative
 Trees 34375 41173

sometimes Income Income

used for
6 23 18 13

variable Regular
39180
Regular
51.5

selection
Income Age
 Age, Income 7 16 7 6

Light Light Light Regular


Tree Structure
 Split points become nodes on tree (circles
with split value in center)
 Rectangles represent “leaves” (terminal
points, no further splits, classification
value noted)
 Numbers on lines between nodes indicate
# cases
 Read down tree to derive rule, e.g.
 If lot size < 19, and if income > 84.75, then
class = “owner”
Determining Leaf Node
Label
 Each leaf node label is determined
by “voting” of the records within it,
and by the cutoff value
 Records within each leaf node are
from the training data
 Default cutoff=0.5 means that the
leaf node’s label is the majority
class.
 Cutoff = 0.75: requires majority of
75% or more “1” records in the
0

Converting a Tree into


Rules
 Convert tree to IF-ELSE- 42.5

AND rules 29
Age
31
 IF age>42.5 AND
income< $41,173 THEN 34375 41173

prefers regular beer Income Income


6 23 18 13
 IF age<42.5 AND income
< $34,375 then prefers Regular
39180
Regular
51.5

regular beer Income Age

Can condense some


7 16 7 6

rules Light Light Light Regular

 If income < $34,375 then


regular beer
Labeling Leaf Nodes
 Default: majority vote
 In 2-class case, majority vote = cutoff of
0.5
 Changing the cutoff will change the
labeling
 Example:
 success class=buyers
 0.75 means that to label the node as
“buyer”, at least 75% of the training set
observations with that predictor
combination should be buyers.
0

Classifying existing and


new records
 “Drop” the record into the tree
(=use the rules)
 Classify record as the majority
class in its node 42.5

Age
29 31

34375 41173

Income Income
6 23 18 13

Regular Regular
39180 51.5

Income Age
7 16 7 6

Light Light Light Regular


Predict the preference of a 50- 0

year old female with 50K


income
42.5

Age
29 31

34375 41173

Income Income
6 23 18 13

Regular Regular
39180 51.5

Income Age
7 16 7 6

Light Light Light Regular


Stopping Tree Growth
 Natural end of process is 100%
purity in each leaf
 This overfits the data, which end
up fitting noise in the data
 Overfitting leads to low predictive
accuracy of new data
 Past a certain point, the error rate
for the validation data starts to
increase
0

Avoiding Over-fitting
 Partitioning is done based on
training set
 As we go down the tree, splitting is
based on less data
 Larger trees lead to higher
prediction variance
0

Avoiding Over-fitting –
cont.
 How will the tree perform on new
data?
 The error rate of a tree =
misclassifications
Error Rate

Unseen 
data

Training 
data

# splits
0

Solution 1: Stopping Tree


Growth
 Rules/Criteria used to stop tree
growth:
 Tree depth
 Minimum # of records (cases) in node
 Minimum reduction in impurity
measure
 Problem: hard to know when to
stop…
CHAID (Chi-square Automatic 0

Interaction Detector): Stopping in


time
 Popular in marketing.
 Instead of using an impurity measure, CHAID
uses the chi-square test of independence to
determine splits
 At each node, we split on the predictor that has the
strongest association with the Y variable
 Strength of association is measured by the p-value of
chi-square test of independence (smaller p-value =
stronger evidence of association)
 Splitting terminates when no more association
is found between the predictors and Y.
 Requires categorical predictors (or bin interval
predictors into categories)
0

Solution 2: Pruning (CART,


C4.5)
 We would like to maximize
prediction accuracy for unseen
data
 Use validation set to prune the
Prune here
tree
Error Rate

validation 
data

Training 
data

# splits
Pruning
 CART lets tree grow to full extent, then
prunes it back
 Idea is to find that point at which the
validation error begins to rise
 Generate successively smaller trees by
pruning leaves
 At each pruning stage, multiple trees
are possible
 Use cost complexity to choose the best
tree at that stage
Cost Complexity
CC(T) = Err(T) + α
L(T)
CC(T) = cost complexity of a tree
Err(T) = proportion of misclassified
records
α = penalty factor attached to tree size
(set by user)

 Among trees of given size, choose the


one with lowest CC
 Do this for each size of tree
Pruning Results
 This process yields a set of trees of
different sizes and associated error
rates

Two trees of interest:


 Minimum error tree

 Has lowest error rate on validation data


 Best pruned tree
 Smallest tree within one std. error of min.
error
0

Beer Preference Example


Full tree (training) Pruned tree (validation)
42.5
42.5

Age
Age
29 31 18 22

34375 41173 34375 41173

Income Income Income Income


6 23 18 13 7 11 12 10

Regular Regular Regular Light Regular


39180 51.5 51.5

Income Age Age


7 16 7 6 4 6

Light Light Light Regular Light Regular


Extension to numerical Y:
Regression Trees
 All is similar to classification trees,
except labels of leaf nodes are
averages of the observations in the
node
 Non-parametric, no assumptions, can
find global relationships between Y
and predictors
 Nice alternative to linear reg, if large
dataset
 XLMiner: Prediction> Regression
Differences from CT
 Prediction is computed as the
average of numerical target
variable in the rectangle (in CT it is
majority vote)
 Impurity measured by sum of
squared deviations from leaf
mean
 Performance measured by RMSE
(root mean squared error)
Advantages of trees
 Easy to use, understand
 Produce rules that are easy to
interpret & implement
 Variable selection & reduction is
automatic
 Do not require the assumptions of
statistical models
 Can work without extensive
handling of missing data
Disadvantages
 May not perform well where there
is structure in the data that is not
well captured by horizontal or
vertical splits

 Since the process deals with one


variable at a time, no way to
capture interactions between
variables
Big Example: Personal
Loan Offer
 As part of customer acquisition efforts,
Universal bank wants to run a campaign
for current customers to purchase a
loan.
 In order to improve target marketing,
they want to find customers that are
most likely to accept the personal loan
offer.
 They use data from a previous
campaign on 5000 customers, 480 of
Personal Loan Data
Description
ID Customer ID
Age Customer's age in completed years
Experience #years of professional experience
Income Annual income of the customer ($000)
ZIPCode Home Address ZIP code.
Family Family size of the customer
CCAvg Avg. spending on credit cards per month ($000)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any. ($000)
Personal Loan Did this customer accept the personal loan offered in the last campaign?
Securities Account Does the customer have a securities account with the bank?
CD Account Does the customer have a certificate of deposit (CD) account with the bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer use a credit card issued by UniversalBank?
File: “UniversalBankTrees.xls”
Full tree (CT_output4)
92.5

Income
1830 670

2.95 1.5

CCAvg Education
1737 93 398 272

0
0.5 2.5 114.5

CD Account Family Income


87 6 346 52 111 161

81.5 168 0.5 116 2.95 116.5

Income Mortgage CD Account Income CCAvg Income


51 36 5 1 330 16 21 31 79 32 6 155

Sub Tree Sub Tree Sub Tree Sub Tree Sub Tree Sub Tree Sub Tree
1 0 0 1 1
beneath beneath beneath beneath beneath beneath beneath

Training Data (Using Full Tree) Validation Data (Using Full Tree) Test Data (Using Full Tree)
Classification Confusion Matrix Classification Confusion Matrix Classification Confusion Matrix
Predicted Class Predicted Class Predicted Class
Actual Class 1 0 Actual Class 1 0 Actual Class 1 0
1 235 0 1 128 15 1 88 14
0 0 2265 0 17 1340 0 8 890

Error Report Error Report Error Report


Class # Cases # Errors % Error Class # Cases # Errors % Error Class # Cases # Errors % Error
1 235 0 0.00 1 143 15 10.49 1 102 14 13.73
0 2265 0 0.00 0 1357 17 1.25 0 898 8 0.89
Overall 2500 0 0.00 Overall 1500 32 2.13 Overall 1000 22 2.20
# Decision % Error % Error
Nodes Training Validation
41
40
39
0
0.04
0.08
2.133333
2.2
2.2
Pruned tree
(CT_Output3)
38 0.12 2.2
37 0.16 2.066667
36 0.2 2.066667
35 0.2 2.066667
34 0.24 2.066667
33 0.28 2.066667 92.5
32 0.4 2.066667
31 0.48 2.133333 Income
30 0.48 2.133333 1083 417
29 0.56 2.133333
28 0.6 1.866667
27 0.64 1.866667 0
1.5
26 0.72 1.866667
25 0.76 1.866667
24 0.88 1.866667 Education
23 0.88 1.733333 260 157
22 0.88 1.733333
21 0.96 1.733333
20 0.96 1.733333 2.5 114.5
19 1 1.733333
18 1 1.733333 Family Income
17 1.12 1.733333 0 260 130 27
16 1.12 1.533333
15 1.12 1.533333
14 1.16 1.533333 0 1
116 2.95
13 1.16 1.6
12 1.2 1.6
11 1.2 1.466667 <-- Min. Err. Tree Std. Err. 0.003103929 Income CCAvg
10 1.6 1.666667 60 200 0 130
9 2.2 1.666667
8 2.2 1.866667
0 1 0 1
7 2.24 1.866667
6 2.24 1.6 <-- Best Pruned Tree
5 4.44 1.8
Validation Data scoring Test Data scoring - Summary Report (Using Best Pruned Tree
4 5.08 2.333333 Classification Confusion Matrix
Classification Confusion Matrix
3 5.24 3.466667 Predicted Class
Predicted Class
2 9.4 9.533333
Actual Class 1 0 Actual Class 1 0
1 9.4 9.533333
1 127 16 1 88 14
0 9.4 9.533333
0 8 1349 0 3 895

Error Report Error Report


Class # Cases # Errors % Error Class # Cases # Errors % Error
1 143 16 11.19 1 102 14 13.73
0 1357 8 0.59 0 898 3 0.33
Overall 1500 24 1.60 Overall 1000 17 1.70

You might also like