Professional Documents
Culture Documents
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
Decision Tree: Courtesy: Prof. Pabitra Mitra, CSE, IIT Kharagpur
Jayanta Mukhopadhyay
Dept. of Computer Science and Engg.
7
Applications of Decision
Trees
<Instances describable by a fixed set <Examples:
of attributes and their values – Equipment or
<Target function is discrete valued medical
– 2-valued diagnosis
– N-valued – Credit risk
analysis
– But can approximate continuous
functions – Calendar
scheduling
<Disjunctive hypothesis space preferences
<Possibly noisy training data
– Errors, missing values, …
8
Decision Trees
Attribute 2
Given distribution of
+ + + + - - - - + + + + training instances
+ + + + - - - - + + + + draw axis parallel
+ + + + - - - - + + + + lines to separate the
- - - - instances of each
- - - - - - - - + + + + class.
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
9
Decision Tree Structure
Draw axis
Attribute 2
+ + + + - - - - + + + +
parallel
+ + + + - - - - + + + +
Lines to
+ + + + - - - - + + + +
separate the
- - - -
Instances of
- - - - - - - - + + + +
each class
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
10
Decision Tree Structure
Decision leaf * Alternate splits possible
Attribute 2
Decision node
+ + + + - - - - + + + + = condition
+ + + + - - - - + + + + = box
30 + + + + - - - - + + + + = collection of satisfying
examples
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - Decision nodes (splits)
20 40 Attribute 1
11
Decision Tree Construction
n Find the best
structure
n Given a training
data set
12
Top-Down Construction
< Start with an empty tree
< Main loop:
1.Split the “best” decision attribute (A) for next node
2.Assign A as decision attribute for node
3.For each value of A, create new descendant of node
4.Sort training examples to leaf nodes
5.If training examples perfectly classified, STOP,
Else iterate over new leaf nodes
< Grow tree just deep enough for perfect classification
– If possible (or can approximate at chosen depth)
< Which attribute is the best?
13
Best attribute to split?
Attribute 2
+ + + + - - - - + + + +
+ + + + - - - - + + + +
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
14
Best attribute to split?
Attribute 2
+ + + + - - - - + + + +
+ + + + - - - - + + + +
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
+ + + + - - - - + + + +
+ + + + - - - - + + + +
+ + + + - - - - + + + +
- - - -
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1
Attribute 1 > 40?
16
Which split to make next?
+ + + + - - - - + + + +
+ + + + - - - - + + + + Already pure leaf
+ + + + - - - - + + + + No further need to
- - - - split
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1 18
Principle of Decision Tree
Construction
n Try to form a tree with pure leaves
n Correct classification
n A greedy approach
1. Initially treat the entire data set as a single box.
2. For each box choose the spilt with an attribute that
reduces its impurity (in terms of class labels) by
the maximum amount.
3. Split the box having highest reduction in impurity
4. Continue to Step 2.
5. Stop when all boxes are pure.
19
Choosing the Best Attribute?
n Consider 64 examples, 29+ and 35-
n Which one is better?
29+, 35- A1 29+, 35- A2
t f t f
25+, 5- 4+, 30- 15+, 19- 14+, 16-
20
Entropy
n A measure for uncertainty, purity and information
content.
n optimal length code assigns (- log2p) bits to message
having probability p
n Let S be a sample of training examples
n p+ : the proportion of positive examples in S
n Entropy of S:
n average optimal number of bits to encode information
about certainty/uncertainty about S
Entropy(S) = p+(-log2p+) + p-(-log2p-) = -p+log2p+- p-log2p-
n Can be generalized to more than two values 21
Entropy
< Entropy can also be viewed as measuring
– purity of S,
– uncertainty in S,
– information in S, …
< E.g.: values of entropy for p+=1, p+=0, p+=.5
< Easy generalization to more than binary values
– Sum over pi *(-log2 pi) , i=1,n
vi is + or – for binary
vi varies from 1 to n in the general case
22
Choosing Best Attribute?
n Consider 64 examples (29+,35-) and compute
entropies:
n Which one is better?
29+, 35- A1 E(S)=0.993 E(S)=0.993
29+, 35- A2
0.650 t f 0.522 0.989 t f 0.997
25+, 5- 4+, 30- 15+, 19- 14+, 16-
Sv
Gain( S , A) = Entropy ( S ) - å
vÎValues ( A ) S
Entropy ( S v )
9+, 5- {D1,…,D14}
Outlook
{D1,D2,D8,D9,D11}{D3,D7,D12,D13} {D4,D5,D6,D10,D15}
2+, 3- 4+, 0- 3+, 2-
? ?
Yes
Ssunny= {D1,D2,D8,D9,D11}
Gain (Ssunny, Humidity) = .970
Gain (Ssunny, Temp) = .570
Gain (Ssunny, Wind) = .019 28
Final Decision Tree for
Example
Outlook
Sunny Rain
Overcast
Humidity Wind
Yes
High Weak
Normal Strong
No Yes No
Yes
29
Hypothesis Space Search
(ID3)
n Hypothesis space
(all possible
trees) is
complete!
n Target function is
included in there
30
ID3
n Input: TEs
n Attributes, Target_attribute (+ve / -ve)
n Create a root node
n If all TEs +ve / -ve, return a single node with label +ve / -ve.
n If Attr. empty, return single node with the most freq. label.
n Select the best attribute A
n For each possible value vi of A
n Form the subset S of TEs whose value of A is vi
n If S empty
n return by adding a leaf node below with the most freq. label.
n Else
n recursively call ID3 with S to form the subtree.
Hypothesis Space Search in
Decision Trees
n No backtracking
n Locally optimal ...
n Statistics-based search choices
n Use all TE’s at each step
n Robust to noisy data 33
Restriction bias vs.
Preference bias
n Restriction bias (or Language bias)
n Incomplete hypothesis space
34
Inductive Bias in ID3
n Preference for short trees, and for those with
high information gain attributes near the root
n Principle of Occam's razor
n prefer the shortest hypothesis that fits the data
n Justification
n Smaller likelihood of a short hypothesis fitting the
data at random
n Problems
n Other ways to reduce random fits to data
n Size of hypothesis based on the data
representation
n Minimum description length principle
35
Overfitting the Data
•May not have the best generalization performance.
- noise in the training data
- very little data
•h overfits the training data if there exists another
hypothesis, h’, such has greater error than h on the training
data, but smaller error on the test data than h.
On training
accuracy On testing
Complexity of tree 36
Overfitting
Attribute 2
Attribute 1
Attribute 2
When to stop splitting further?
+ + + + A very deep
- - - - + + + +
+ + - + tree required
- - - - + + + +
+ + + + To fit just one
- - - - + + + +
odd training
- - - -
- - - - example
- - - - + + + +
- - - - - - - - + + + +
- - - - - - - - + + + +
- - - - - - - -
Attribute 1 38
Overfitting in Decision
Trees
n Consider adding noisy training example (should be +):
Day Outlook Temp Humidity Wind Tennis?
D15 Sunny Hot Normal Strong No
sunny rain
overcast
Humidity Wind
normal Yes
high strong weak
No Yes
No Yes 39
Overfitting - Example
Outlook
Noise or other
coincidental regularities
Sunny Overcast Rain
1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
Humidity Yes Wind
Strong Weak
No Yes
40
Avoiding Overfitting
•Methods of evaluation
- Cross-validation:
- Reserve hold-out set to evaluate utility (more
popular)
- Statistical testing:
- Test if the observed regularity can be dismissed
as likely to occur by chance
- Minimum Description Length:
- Is the additional complexity of the hypothesis
smaller than remembering the exceptions ?
(regularization in other contexts)
42
General Strategy: Overfit and Simplify
Reduced-Error Pruning
• A post-pruning, cross validation approach
- Partition training data into “grow” set and “validation” set.
- Build a complete tree for the “grow” data
- Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below;
replace it by majority vote.
Test the accuracy on the validation set
Permanently prune the node with the greatest increase
in accuracy on the validation test.
• Problem: Uses less data to construct the tree
• Sometimes done at the rules level.
43
Rule post-pruning
44
Example of rules
Outlook
46
Continuous Valued Attributes
n Create a discrete attribute from continuous
variables
n E.g., define critical Temperature = 82.5
(48+60)/2 (80+90)/2
Temp 40 48 60 72 80 90
Tennis? N N Y Y Y N
n Candidate thresholds
n chosen by information gain function
n Check gain for the attribute with possible thresholds and choose
the maximum
n can have more than one threshold 47
Candidate Thresholds
(48+60)/2 (80+90)/2
Temp 40 48 60 72 80 90
Tennis? N N Y Y Y N
49
Attributes with many values
Color? green
.
yellow .
.
.
@ C 3 @ 3 C B B E
𝐺𝑖𝑛𝑖 𝑐𝑜𝑙𝑜𝑟 = × 2× × + × 2× × + × 2× × = 0.342
AB @ @ AB @ @ AB B B
55
Gain of Gini Index
9 5
𝐺𝑖𝑛𝑖 = 2× × = 0.46
14 14
@ C 3 @ 3 C B B E
𝐺𝑖𝑛𝑖 𝑐𝑜𝑙𝑜𝑟 = × 2× × + × 2× × + × 2× × = 0.342
AB @ @ AB @ @ AB B B
56
Three Impurity Measures
A Gain(A) GainRatio(A) GiniGain(A)
Color 0.247 0.156 0.116
Outline 0.152 0.152 0.092
Dot 0.048 0.049 0.03
57
Regression Tree
n Similar to classification
n Use a set of attributes to predict the value
(instead of a class label)
n Instead of computing information gain,
compute the sum of squared errors
n Partition the attribute space into a set of
rectangular subspaces, each with its own
predictor
n The simplest predictor is a constant value
58
Rectilinear Division
n A regression tree is a piecewise constant function
of the input attributes
X1£ t1 X2
X2 £ t2 X1 £ t3
r5
r2
r3
t2 r4
r1 r2 r3 X2 £ t4 r1
r4 r5 t1 t3 X1
59
Growing Regression Trees
<…>
<Outlook=CLOUDY> <Outlook=SUNNY> <Outlook=RAINY> <Temp=HIGH>
?
<Wind=STRONG> <Wind=WEAK> <…>
<Temp=HIGH> <Temp=LOW>
? ?
67
Beyond decision trees …
Oblique Decision Tree
n Oblique Decision Tree (Heath, Kasif and Salzberg, IJCAI’93)
n A combination of multiple attributes checked in a node.
n Not on a single attribute as in an ordinary DT.
n A hyperplane dividing into two halves of a space.
n For an ordinary DT all planes are axis parallel.
n Usually smaller DT
n Highly dependent on choice of hyperplanes.
n Central axis projection
n Get two clusters and search a hyperplane separating them
n How to select?
n Randomly!
72