Professional Documents
Culture Documents
Decision Trees CLS
Decision Trees CLS
Decision Trees CLS
Marek A. Perkowski
http://www.ece.pdx.edu/~mperkows/
Padraig Cunningham
http://www.cs.tcd.ie/Padraig.Cunningham/aic
3/22/04 2
Supplimentary material
www
http://dms.irb.hr/tutorial/tut_dtrees.php
http://www.cs.uregina.ca/~dbd/cs831/notes/
ml/dtrees/4_dtrees1.html
Mitchell – Chapter 3
3/22/04 3
Decision Tree for PlayTennis
Outlook
No Yes No Yes
3/22/04 4
Decision Tree for PlayTennis
Outlook
No 3/22/04
Yes No Yes 6
Decision Tree for Conjunction
Outlook=Sunny Wind=Weak
Outlook
Wind No No
Strong Weak
No Yes
3/22/04 7
Decision Tree for Disjunction
Outlook=Sunny Wind=Weak
Outlook
No Yes No Yes
3/22/04 8
Decision Tree for XOR
Outlook=Sunny XOR Wind=Weak
Outlook
(Outlook=Sunny Humidity=Normal)
(Outlook=Overcast)
(Outlook=Rain Wind=Weak)
3/22/04 10
When to consider Decision
Trees
Instances describable by attribute-value pairs
Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data
Missing attribute values
Examples:
Medical diagnosis
3/22/04 11
History of Decision Trees
3/22/04 12
Classification, Regression and
Clustering trees
regression model, or …
Clustering trees just group examples in leaves
Most (but not all) research in machine learning
focuses on classification trees
3/22/04 13
Top-Down Induction of
Decision Trees (ID3)
Basic algorithm for TDIDT: (later more formal version)
start with full data set
3/22/04 14
Which Attribute is ”best”?
3/22/04 15
Finding the best test
(for classification trees)
For classification trees: find test for which children
are as “pure” as possible
Purity measure borrowed from information theory:
entropy
is a measure of “missing information”; more
3/22/04 16
Entropy
3/22/04 18
Information Gain
Gain(S,A): expected reduction in entropy due to
sorting S on attribute A
Over
Sunny Rain
cast
3/22/04 24
ID3 Algorithm
[D1,D2,…,D14] Outlook
[9+,5-]
No Yes No Yes
3/22/04 27
Occam’s Razor
Why prefer short hypotheses?
Argument in favor:
Fewer short hypotheses than long hypotheses
coincidence
A long hypothesis that fits the data might be a coincidence
Argument opposed:
There are many ways to define small sets of hypotheses
E.g. All trees with a prime number of nodes that use attributes
3/22/04 28
Avoiding Overfitting
Phenomenon of overfitting:
keep improving a model, making it better
3/22/04 30
Overfitting: example
-
+
+ +
+ -
-
+ -
+ -
+
-
area with probably
- - + wrong predictions
- - -
-
- -
- -
- - -
3/22/04 31
Overfitting in Decision Tree
Learning
3/22/04 32
Avoid Overfitting
How can we avoid overfitting?
Stop growing when data split not statistically
significant
Grow full tree then post-prune
Minimize:
size(tree) + size(misclassifications(tree))
3/22/04 33
Stopping criteria
How do we know when overfitting starts?
a) use a validation set: data not considered for
3/22/04 34
Post-pruning trees
After learning the tree: start pruning branches
away
For all nodes in tree:
3/22/04 35
Reduced-Error Pruning
3/22/04 36
Effect of Reduced Error
Pruning
3/22/04 37
Rule-Post Pruning
3/22/04 38
Converting a Tree to Rules
Outlook
Replace Gain by :
Gain2(S,A)/Cost(A) [Tan, Schimmer 1990]
2Gain(S,A)-1/(Cost(A)+1)w w [0,1] [Nunez 1988]
3/22/04 41
Attributes with many Values
3/22/04 42
Unknown Attribute Values
3/22/04 43