Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 53

Decision Trees

▪ Decision Tree is one of the most widely used and


practical methods of inductive inference
▪ Features
▪ Method for approximating discrete-valued functions
(including boolean)
▪ Learned functions are represented as decision trees (or if-
then-else rules)
▪ Expressive hypotheses space, including disjunction
▪ Robust to noisy data

*
In decision analysis, a decision tree can be
used to visually and explicitly represent
decisions and decision making. As the
name goes, it uses a tree-like model of
decisions.

2
3
When to use Decision Trees
▪ Problem characteristics:
▪ Instances can be described by attribute value pairs
▪ Target function is discrete valued
▪ Disjunctive hypothesis may be required
▪ Possibly noisy training data samples
▪ Robust to errors in training data
▪ Missing attribute values
▪ Different classification problems:
▪ Equipment or medical diagnosis
▪ Credit risk analysis
▪ Several tasks in natural language processing

*
5
6
10
Top-down induction of Decision Trees
▪ ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
▪ Given a training set of examples, the algorithms for building DT
performs search in the space of decision trees
▪ The construction of the tree is top-down. The algorithm is greedy.
▪ The fundamental question is “which attribute should be tested next?
Which question gives us more information?”
▪ Select the best attribute
▪ A descendent node is then created for each possible value of this
attribute and examples are partitioned according to this value
▪ The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left

*
12
Which attribute is the best classifier?

▪ A statistical property called information gain, measures how


well a given attribute separates the training examples
▪ Information gain uses the notion of entropy, commonly used
in information theory
▪ Information gain = expected reduction of entropy

* Maria Simi
14
15
16
17
Example: expected information gain
▪ Let
▪ Values(Wind) = {Weak, Strong}
▪ S = [9+, 5−]
▪ SWeak = [6+, 2−]
▪ SStrong = [3+, 3−]
▪ Information gain due to knowing Wind:
Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14 Entropy(SStrong)
= 0.94 − 8/14 × 0.811 − 6/14 × 1.00
= 0,048

*
19
20
21
-2/5log2/5-3/5log3/5 -3/5log3/5-2/5log2/5

22
First step: which attribute to test at the root?

▪ Which attribute should be tested at the root?


▪ Gain(S, Outlook) = 0.246
▪ Gain(S, Humidity) = 0.151
▪ Gain(S, Wind) = 0.084
▪ Gain(S, Temperature) = 0.029
▪ Outlook provides the best prediction for the target
▪ Lets grow the tree:
▪ add to the tree a successor for each possible value of Outlook
▪ partition the training samples according to the value of Outlook

*
After first step

*
Second step
▪ Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970 − 3/5 × 0.0 − 2/5 × 0.0 = 0.970
Gain(SSunny, Wind) = 0.970 − 2/5 × 1.0 − 3.5 × 0.918 = 0 .019
Gain(SSunny, Temp.) = 0.970 − 2/5 × 0.0 − 2/5 × 1.0 − 1/5 × 0.0 = 0.570
▪ Humidity provides the best prediction for the target
▪ Lets grow the tree:
▪ add to the tree a successor for each possible value of Humidity
▪ partition the training samples according to the value of Humidity

*
Second and third steps

{D1, D2, {D9, D11} {D4, D5, D10} {D6, D14}


D8} Yes Yes No
No

*
27
28
Other Splitting Criterion: GINI Index
Gini Index/Gini impurity, calculates the amount of
probability of a specific feature that is classified
incorrectly when selected randomly. If all the
elements are linked with a single class then it can
be called pure.GINI Index for a given node t :
k
n
GINI (t )  1   [ p( j | t )] 2 GINI split  
i 1 n
i
GINI (i )
j
Note: information gain in this slide is weighted GINI index
Note: information gain in this slide is weighted GINI index
Note: information gain in this slide is weighted GINI index
Note: information gain in this slide is weighted GINI index
Note: information gain in this slide is weighted GINI index

Gini Index is a metric to measure how often a randomly chosen element would
be incorrectly identified.
It means an attribute with lower Gini index should be preferred.
How to Specify Test Condition?
 Depends on attribute types
– Nominal
– ordinal
– Continuous

 Depends on number of ways to split


– Binary split
– Multi-way split
Splitting Based on Nominal Attributes

 Multi-way split: Use as many partitions as


values
CarType
Family Luxury
Sports

OR
 Binary split: Divide values into two subsets

CarType
CarType {Family,
{Sports, Luxury} {Sports}
Luxury} {Family}

Need to find optimal partitioning!


Splitting Based on Continuous Attributes

• Different ways of handling


– Multi-way split: form ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – repeat on each new partition

– Binary split: (A < v) or (A  v)


• How to choose v?

Need to find optimal partitioning!

Can use GAIN or GINI !


39
40
Overfitting and Underfitting

• Overfitting:
– Given a model space H, a specific model hH is said
to overfit the training data if there exists some
alternative model h’H, such that h has smaller error
than h’ over the training examples, but h’ has smaller
error than h over the entire distribution of instances
• Underfitting:
– The model is too simple, so that both training and
test errors are large
underfitting
Detecting Overfitting
Overfitting
Overfitting in Decision Tree Learning
 Overfitting results in decision trees that are
more complex than necessary
– Tree growth went too far
– Number of instances gets smaller as we build the
tree (e.g., several leaves match a single example)

 Training error no longer provides a good


estimate of how well the tree will perform on
previously unseen records
Avoiding Tree Overfitting – Solution 1
 Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
 Stop if all instances belong to the same class
 Stop if all the attribute values are the same
– More restrictive conditions:
Stop if number of instances is less than some user-specified threshold
Stop if class distribution of instances are independent of the available
features
Stop if expanding the current node does not improve impurity
measures (e.g., GINI or GAIN)
Avoiding Tree Overfitting – Solution 2
 Post-pruning
– Split dataset into training and validation sets
– Grow full decision tree on training set
– While the accuracy on the validation set increases:
Evaluate the impact of pruning each subtree, replacing
its root by a leaf labeled with the majority class for that
subtree
Replace subtree that most increases validation set
accuracy (greedy approach)
Decision Tree Based Classification
 Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Good accuracy (carefu
 Disadvantages:
– Axis-parallel decision boundaries
– Redundancy
– Need data to fit in memory
– Need to retrain with new data
Assignment

You might also like