Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 58

+ 1

Decision
Trees and
Modeling
Data Mining Methods According to 2

the Nature of the Data

Continuous Categorical No
Response Response Response
Continuous Linear Logistic regression Principal
predictors regression Neural nets components
Neural nets Discriminant Cluster
k-nearest analysis analysis
neighbors k-nearest neighbors
Categorical Linear Neural nets Association
predictors regression Classification rules
Neural nets trees
Regression trees Logistic regression
Naïve Bayes
3

Supervised
Types of
 Algorithms used in classification and
Data prediction
Mining  Classification  models discrete-valued
functions
Methods  Prediction  models continuous-valued
functions
 Require data in which the value of the
outcome of interest (e.g., purchase or no
purchase) is known
 Use training data, validation data, and test
data
Unsupervised
 Algorithms where there is no outcome
variable to predict or classify
 Include association rules, data reduction
methods, and clustering techniques
Supervised Learning Process: 4

Two Steps
Learning (Training)
Learn a model using the training data

Testing
Test the model using unseen test data to assess model
accuracy

apply
model

Induction Deduction

Source: Bing Liu, Web Data Mining, 2nd ed., Fig. 3.1., p. 66, Springer, 2011.
Supervised Learning: 5

Classification
 The task of learning a target function f that maps each
attribute set x to one of the predefined class labels y

Input Output
Attribute set Classification Class label
(x) model (y)

 Thetarget function is also known informally at a


classification model which is useful for predicting or
describing data with binary or nominal categories
6
Classification: Explanatory vs.
Predictive Modeling
Descriptive Models Predictive Models
 Serve as an explanatory  Predict the class label of
tool to distinguish between new cases accurately
objects of different classes
 Data is split into a training and
 Fit the data closely validation set (holdout set)
 Use the entire dataset for  Training set  estimate the
estimating the best-fit model model
to maximize information  Holdout set  test the
about the hypothesized model’s performance on
relationship in the population new unobserved data
 Performance measures  Performance measures
assess how well the model assess predictive accuracy
approximates the data
7
Introducing Decision Trees

Decision trees are the workhorses of many practical


applications of machine learning because they are fast and
produce intelligible output that is often surprisingly accurate.
Aim: Partition the data into smaller, more homogenous groups,
(i.e., the nodes of the split are more pure)
Fast training performance
High degree of accuracy
Relatively simple computing methodology
Easy to build
Easy to interpret
Donot require any assumptions about the distribution of the
measurements in each group which can be categorical, discrete
numeric, or continuous
8

A Simple
Decision
Tree For
Buying A
Car
• Each node
specifies a test
(question) of
an attribute.
• Each branch
from a node is a
possible value
of this
attribute
Source: IBM SPSS Modeler 14.2 Modeling Nodes, 2011, Figure 6-2, p. 115.
9
Decision Tree Goals

1. Accurate classification
 minimize error

2. Efficient decision making


 fewest # of decisions/tests
10
How Decision Trees Work

Generate a tree structure of training data subsets by splitting a


dataset on selected input attributes in a top-down recursive
manner:
1. Begin with a single node (root) composed of all
observations in the training set
2. Choose the best input attribute whose values within the
node can be used to split the node into subsets that are
more homogenous with respect to the classification
attribute than the unsplit node (Attributes are selected on
the basis of an impurity function)
3. For each subset of the split, determine if that subset can
or should split. For each subset of the split, return to step
2, until the stopping criteria is reached.
How A Decision Tree Works: 11

Three Types of Nodes


 A root node
Has no incoming edges and zero or more outgoing
edges

 Internal nodes
Each has exactly one incoming edge and two or
more outgoing edges

 Leaf or terminal nodes


 Each has exactly one incoming edge and no
outgoing edges
 Each has a class label
12
The Play Golf Training Data
Day Outlook Temperature Humidity Windy Play Golf
D1 sunny 85 85 false no
D2 sunny 80 90 true no
D3 overcast 83 86 false yes
D4 rainy 70 96 false yes
D5 rainy 68 80 false yes
D6 rainy 65 70 true no
D7 overcast 64 65 true yes
D8 sunny 72 95 false no
D9 sunny 69 70 false yes
D10 rainy 75 80 false yes
D11 sunny 75 70 true yes
D12 overcast 72 90 true yes
D13 overcast 81 75 false yes
D14 rainy 71 91 true no
13
Decision Tree Example: Play Golf

Root node
• Each node
specifies a test of
an attribute.
• Each branch from
Internal a node is a
nodes possible value of
this attribute.

Leave nodes
Decision trees seek to create a set of leaf nodes
that are as ”pure” as possible!

Source: http://gautam.lis.illinois.edu/monkmiddleware/public/analytics/decisiontree.html
14
What do we mean by “pure”?

Very impure group Less impure Very pure group


Three Requirements that must be met 15

before DT algorithms may be applied:

1. DTs represent supervised learning  thus


require pre-classified target variables.
A training data set must be supplied
which provides the algorithm with the
values of the target variable.
Three requirements that must be met 16

before DT algorithms may be applied:

2. The training set should be rich and varied,


providing the algorithm with a healthy cross section
of the types of records for which classification may
be needed in the future
 Training set too large  created tree may overfit

DTs learn by example  if examples are


systematically lacking for a definable subset of
records, classification and prediction for this subset
will be problematic or impossible.
 Training data set too small  tree might not be specific
enough to work with more general data (underfit - both
training and test errors are large) 
Three requirements that must be met 17

before DT algorithms may be applied:

3. The target attribute classes must be discrete,


cannot be applied to continuous target variable
 Target variable must take on values that are
clearly demarcated as either belonging to a
particular class or not belonging
 Play golf (yes, no)
 Credit risk (high, low)
 Loan applications (accept, reject)
 Credit card transaction classification (fraudulent,
legitimate)
 Medical diagnosis (cancer, no cancer)
 Spam filters (spam, no spam)
 Security screening (threat, no threat)
18

Decision DT algorithms mainly differ on


Tree  Splitting criteria
Algorithms  Which variable to split first?
 What values to use to split?
 How many splits to form for each node?

 Stopping criteria
 When to stop building the tree
 Pruning (generalization method)
 Pre-pruning versus post-pruning

Most popular DT algorithms


 ID3,C4.5, C5
 CART (Classification And Regression
Tree  binary decision tree)
 CHAID, M5
19
SPSS Modeler Decision Tree Nodes
Name Target Field # of Method Used For
Measurement splits Splitting
Type
C&RT Continuous, Binary Dispersion measure (Gini) or
categorical, flag, least squared deviation
nominal, or ordinal (continuous target) to minimize
impurity at each step
CHAID Continuous, Multiple Chi-square statistics to identify
categorical, flag, optimal splits
nominal, or ordinal
QUEST Categorical, flag, Binary Chi-square test (categorical
or nominal predictors), ANOVA (continuous
inputs)
C5.0 Flag, nominal, or Multiple Information theory measure:
ordinal information gain ratio
Alternative Measures for 20

Selecting the Best Split


 Gini index
 Determines the purity of a specific class as a result of a
decision to branch along a particular attribute (used in CART)
 Chooses attribute that maximizes reduction in impurity as
splitting attribute at a particular node
 See the video: https://
www.youtube.com/watch?v=Zze7SKuz9QQ
 Information gain
 Most common splitting criterion based on entropy, a
measure of the extent of uncertainty or randomness of a
particular attribute/value split (used in ID3, C4.5, C5)
 Measures the change in entropy due to any amount of new
information being added: how much would be gained by
branching on an attribute
 Chooses attribute with the highest information gain as
splitting attribute at a particular node
 Chi-square statistics (used in CHAID)
A Loan Application Training 21

Data Set

Class attribute:
indicates if loan
application was
approved (Yes)
in the past or not
(No)

Source: Bing Liu, Web Data Mining, 2nd ed., Table 3.1., p. 64, Springer, 2011.
Chapter 3 can be downloaded from:
http://www.springer.com/cda/content/document/cda_downloaddocument/9783642194597-c3.pdf?SGWID=0-
0-45-1229046-p174122057
Loan Application Training Data Set 22

Which attributes should be used for splitting?


In which order?
What splits are needed, what is the test condition?
Application
Approved
23
Selecting The Best Split

 The objective is to reduce the impurity or uncertainty


in data as much as possible
 A subset of data is pure if all instances belong to the
same class.

An attribute is good when


The resulting node contains records of mostly one class,
i.e. it’s “pure” or homogenous

An attribute is poor when


• It provides no discrimination - attribute is immaterial to the
decision
• For each value we have the same number of positive and
negative instances
Choosing Splitting Attributes: Which 24
one discriminates best vis-à-vis the
class attribute “Approved”?
Multi-way split Two-way split

No: 60% No: 40% No: 20% No: 0% No: 66.67%


Yes: 40% Yes: 60% Yes: 80% Yes: 100% Yes: 33.33%

Source: Bing Liu, Web Data Mining, 2nd ed., Fig. 3.5., p. 71, Springer, 2011.
+ 25

One Possible Tree


The tree generalizes the data as the tree captures the
key regularities in the data.

Dependent Variable: Approved

Yes 9
No 6

Source: Bing Liu, Web Data Mining, 2nd ed., Fig. 3.2., p. 67, Springer, 2011.
Using the tree to predict the 26

class of a new loan applicant


Age Has_job Own_house Credit-rating Class
young false false good ?

Dependent Variable: Support - the proportion of all


Decision rule records in the data set that rest
IF Age = young Approved in that particular terminal leaf
AND Has_job = false node
THEN Class = No Yes 9
No 6 Confidence of the rule – the
Support = 3/15 = 20% proportion of records in the leaf
Confidence = 3/3 = 100% node for which the decision rule
is true (here leaf nodes are
pure)
27
Is the first tree unique?

 No, there are many Dependent Variable: Approved


possible trees that can
be learned from the Yes 9
No 6
data.
 Here is a simpler,
smaller tree.
 In practice, one wants
to have a small and
accurate tree
 It’s more general
 It’s easier to understand
by human users
Source: Bing Liu, Web Data Mining, 2nd ed., Fig. 3.3., p. 68, Springer, 2011.
Using the smaller tree to 28

predict the class of a new loan


Age Has_job Own_house Credit-rating Class
young false false good ?

Dependent Variable: Approved It is possible to read a set of


rules directly off a decision tree.
Yes 9
No 6 Preconditions are logically
ANDed together, and all the tests
must succeed if the rule is to fire.

IF Own_house = false
AND Has_job = false
THEN Class = No

Support = 6/15 = 40%


Confidence = 6/6 = 100%
29

C5.0 Tree
Built with
SPSS
Modeler
18

One of the most


attractive aspect
of decision trees
lies in their
interpretability,
especially with
respect to the
construction of
decision rules.
30
From Trees to Rules

 Each path from the root to a leaf is a rule.


 The tree paths or rules are mutually exclusive and exhaustive.
 Every data instance is covered by a single rule (a tree path).

Rule Antecedent Consequent Support Confidence


1 If Own_house = true Then Yes 6/15 6/6 = 100%
(loan approved)
2 If Own_house = false Then Yes 3/15 3/3 = 100%
And Has_job = true (loan approved)
3 If Own_house = false Then No 6/15 6/6 = 100%
And Has_job = false (loan not approved)
31

Classifier Predictive accuracy


A simple measure of classification error.
Evaluation
Speed/Efficiency
 time to construct the model
 time to use the model

Robustness
 Ability
to make reasonably accurate predictions, given
noisy data or data with missing and erroneous values
Scalability
 Ability
to construct a prediction model given a rather
large amount of data
Interpretability
 Level
of understanding & insight provided by the
model
Compactness of the model
 Size of the tree, or the number of rules.
Judging Classification Performance:
Evaluation methods

 Theavailable data set is divided into two disjoint


subsets
 the training set (for learning a model)
 the test set (for testing the model)

 The training set should not be used in testing and


the test set should not be used in learning.
 Unseen test set provides a unbiased estimate of accuracy.

 The test set is also called the holdout set.


 Mainly used when the data set D is large.
33
Judging Classification Performance

Errors committed by a The goal is a good


classification model: classification model with:
 Training errors Low training error
Number of misclassification  fits the training data well
errors committed on training
records AND

 Generalization errors Low generalization error


Expected error of the model  accurately classifies
on previously unseen records it has never seen
records (testing records) before
+ Classifier Evaluation: Confusion Matrix 34

A matrix showing the predicted and actual classifications. A


confusion matrix is of size L x L, where L is the number of
different label values. The following confusion matrix is for L=2:

Accuracy Measures
The following terms are defined for a 2 x 2 confusion matrix:

Source: Kohavi & Provost, Glossary of Terms, Machine Learning, 30, 271-274 (1998),
http://robotics.stanford.edu/~ronnyk/glossary.html
Evaluating Performance When
One Class is More Important
 Data
sets with imbalanced class distributions are quite
common in many real applications.
 In
many such cases it is more important to identify
members of one class
 Taxfraud
 Credit card fraud
 Credit default
 Response to promotional offer
 Detecting electronic network intrusion
 Predicting delayed flights

 In
such cases, we are willing to tolerate greater overall
error, in return for better identifying the important
class for further attention
36
Judging Classification Performance:
Different Kinds of Wrong Predictions

False Positive False Negative


 A record that is  A record that is
classified as positive classified as negative
but is actually negative but is actually positive

Example: Insurance company wants to predict if an


uninvestigated claim is fraudulent
 Claim predicted as  Claim is predicted as
fraudulent that is actually legitimate but is actually
legitimate fraudulent
 Cost: Additional money to  Cost: Payoff of claim that
investigate a claim that should be not be paid
should be paid
Judging Classification Performance: 37

Which Model Performs Better?

Good results correspond to large numbers down


the main diagonal (green, from top left to bottom right)
and small, ideally zero, off-diagonal elements.
Model 1 Predicted Class
N = 500 Class = Yes Class = No

Actual Class = Yes 150 40


Class Class = No 60 250
Model 2 Predicted Class
N = 500 Class = Yes Class = No

Actual Class = Yes 250 45


Class Class = No 5 200
Judging Classification Performance: 38

A numeric example with 3 classes?

How accurate is the model that


produced this classification matrix?
N = 200 Predicted Class
a b c total
Actual a 88 10 2 100
Class b 14 40 6 60
c 18 10 12 40
total 120 60 20 200
The Cost of Classification: 39

A Targeted Marketing Example


False Consumer classified as Cost of preparing
positive likely responder, but does and mailing
not respond marketing materials
False Consumer predicted not to No money was
negative be a likely responder, but spent, nothing was
would have bought if offered gained
True A consumer is offered the Profit based on
positive product and buys it revenue minus costs
True Consumer who was not No cost incurred, no
negative offered a deal and would not profit
have bought it even it if had
been offered
If costs are known, they can be 40
incorporated into a financial analysis
of the decision-making process.
Cost Predicted Class • A standard accuracy measure would
Matrix Yes No have preferred model M2 over M1.
• Incorporating costs, model M2 is
inferior since the improvement comes
Actual Yes -1 100 at the expense of increasing the more
Class No 1 0 costly false negative errors.

Model 1 Predicted Class Model 2 Predicted Class


N = 500 Yes No N = 500 Yes No

Actual Yes 150 40 Actual Yes 250 45


Class Class
No
Overall accuracy: 60
400/500 250
= .80 No 5 200
Overall accuracy: 450/500 = .90
FP rate: 60/310 = .1935 FP rate: 5/205 = .0243
FN rate = 40/190 = .2105 FN rate = 45/295 = .1525

Cost M1 = 150 * (-1) + 40 * 100 + 60 * 1 + 250 * 0 = 3910


Cost M2 = 250 * (-1) + 45 * 100 + 5 * 1 + 200 * 0 = 4255
Benchmarking the Model’s Performance 41
to Morrison’s Proportional Chance
Criterion
 Morrison’s likelihood analysis (1969) provides a criterion that
may be used to compare the proportion of correctly
classified observations with the proportion expected by
chance.
 The overall percentage of objects likely to be classified
correctly by chance alone:
Cpro = alpha2 + (1 – alpha)2
 Alpha: proportion of cases in class = Yes
 1 – alpha: proportion of cases in class = No

Predictive accuracy of model should be higher than the


proportional chance criterion!
 Example:
 Class = Yes 1: 73% of the cases, Class = No: 27% of the cases
 C = 0.732 + 0.272 = 0.59
 Use as a cutoff estimated probability
42
Cutoff for Classification

Most DM algorithms classify via a 2-step process:


 For each record,
 Compute probability of belonging to class “1”
 Compare to cutoff value, and classify accordingly

 Default cutoff value is 0.50


 If >= 0.50, classify as “1”
 If < 0.50, classify as “0”

 Can use different cutoff values


 Typically, error rate is lowest for cutoff = 0.50
43
Cutoff
Table
Example

Cutoff Value = 0.75:


8 records are
classified as “1”

Cutoff Value = 0.50:


13 records are
classified as “1”

Cutoff Value = 0.25:


15 records are
classified as “1”

Source: Figure 5.7 in Shmueli et al. (2010), Data Mining for Business Intelligence, p. 104.
44
Cutoff Value = 0.75 Predicted Class
Classification Actual Class 0 1
Confusion 0 11 1
Matrices for 1 5 7
Various True Positive Rate: 7/12 = .5833
Cutoff Values False Positive Rate: 1/12 = .0833
Cutoff Value = 0.50 Predicted Class
Actual Class 0 1
0 10 2
1 1 11
True Positive Rate: 11/12 = .9166
False Positive Rate: 2/12 = .1667
Cutoff Value = 0.25 Predicted Class
Actual Class 0 1
0 8 4
1 1 11
True Positive Rate: 11/12 = .9166
False Positive Rate: 4/12 = .3333
45
The Class Imbalance Problem

 Animbalance occurs when one or more classes have very


low proportions in the training data as compared to the other
classes
 Example:
 1% of credit card transactions are fraudulent
 Fraudulent = yes is a “rare” event / class

 Detectinginstances of the rare class is akin to finding a


needle in a haystack
 Predictionresults will likely be that every record is predicted
as a non fraudulent (predicted target = 0), making the
prediction result insufficiently informative, i.e., model does
not detect any class 1 (fraudulent)
Accuracy Limitations with 46

Imbalanced Data Sets


 The accuracy measure may be misleading
and not be well-suited for evaluating models
derived from imbalanced data sets.
Remedies for Severe Class 47

Imbalance: Sampling Methods


 Up-sampling (oversampling)
 Simulates / imputes additional data points to improve
balance across the classes
 Replicates the positive cases (the “rare” event cases)

 Down-sampling (undersampling)
 Reduces the number of samples to improve balance
across the classes
 Eliminates negative examples, i.e., majority event
cases
48
Remedying the Imbalance Problem with
SPSS Modeler’s Balance Node (Record Ops)

Performs random undersampling by assigning a


factor smaller than 1 to the majority class
Remedying the Imbalance Problem with 49
SPSS Modeler’s Balance Node (Record Ops)

Carries out simple random oversampling by


assigning a factor larger than 1 to the minority class
50
Avoid Overfitting in Classification

 Model overfitting
 Model fits the training data too well
 Good accuracy on training data but poor on test data
 Symptoms:
 Tree too large - too deep and too many branches
 Some may reflect anomalies in the training data due to
noise or outliers

 Trees are pruned to avoid overfitting


 Removes the weakest, least reliable branches so a tree
captures the patterns but not the noise in the training data
 Resulting tree is smaller and less complex, thus easier to
understand
 Faster and better at correctly classifying independent test
data (i.e., previously unseen tuples) than unpruned tree
51
How Does Tree Pruning Work?

 Prepruning approach  Postpruning


 Tree construction is  Successively selecting
halted early, e.g., no a decision node and
further splits or redesignating it as a
partitions at a given leaf node
node  Removes subtrees from
 Based on measures of a fully grown tree, i.e.,
statistical significance, lop off the branches
information gain, Gini extending beyond a
index, etc. particular decision node
 Upon halting, the node
becomes a leaf
Pruning A Decision Tree: 52

When To Stop Splitting Data


 All examples for a given node belong to the same class
 There are no examples left
 Thereare no remaining attributes for further partitioning
– majority class is the leaf
 No attribute satisfies a minimum information gain
threshold
A maximal depth is reached
 There are less than a certain number of examples in the
current subtree: again a mechanism to prevent
overfitting
+

IBM SPSS Modeler 18


Building Decision Tree
Hands-On Exercise

• Drug Reaction (C&RT, C5.0)


• Survival of Passengers on the Titanic (CHAID)
CHAID Tree without Partition
Stream 1: CHAID Tree Table Node

$R-Survived: Predicted score

$RC-Survived: Confidence value


(between 0.0 and 1.0) is model’s own
estimation how accurate each
predicted value is)
CHAID Tree Partition Node
CHAID Tree Analysis Node

You might also like