Professional Documents
Culture Documents
Decisiontrees
Decisiontrees
Evgueni Smirnov
Topics
Statistical Topics
Regression
Bayes
Linear Regression
Logistic Regression
K Nearest
Instant- or Distance-
Neighbours
Based Learning
KD-Trees
K-means
Unsupervised Clustering Radial Basis
Expectation Maximization Functions
Recommender Systems
Q-Learning
Reinforcement Learning
SARSA
Overview
• Decision Trees for Classification
– Definition
– Classification Problems for Decision Trees
– Entropy and Information Gain
– Learning Decision Trees
– Overfitting and Pruning
– Handling Continuous-Valued Attributes
– Handling Missing Attribute Values
Decision Trees for Classification
• A decision tree is a tree where:
– Each interior node tests an attribute
– Each branch corresponds to an attribute value
– Each leaf node is labelled with a class (class node)
A1
a13
a11
a12
A2 A3
c1
a21 a22 a31 a32
c1 c2 c2 c1
A simple database: playtennis
Day Outlook Temperature Humidity Wind Play
Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild Normal Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool High Strong Yes
D8 Sunny Mild Normal Weak No
D9 Sunny Hot Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Cool Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Decision Tree For Playing Tennis
Outlook
no yes yes no
Classification with Decision Trees
Classify(x: instance, node: variable containing a node of DT)
• if node is a classification node then
– return the class of node;
• else
– determine the child of node that match x.
– return Classify(x, child). A1
a13
a11
a12
A2 A3
c1
a21 a22 a31 a32
c1 c2 c2 c1
When to Consider Decision Trees
• Each instance consists of an attribute with discrete values
(e.g. outlook/sunny, etc..)
• The classification is over discrete values (e.g. yes/no )
• It is okay to have disjunctive descriptions – each path in
the tree represents a disjunction of attribute combinations.
Any Boolean function can be represented!
• It is okay for the training data to contain errors – decision
trees are robust to classification errors in the training data.
• It is okay for the training data to contain missing values –
decision trees can be used even if instances have missing
attributes.
Decision Tree Learning
Basic Algorithm:
1. A the “best" decision attribute for a node N.
2. Assign A as decision attribute for the node N.
3. For each value of A, create new descendant of the node N.
4. Sort training examples to leaf nodes.
5. IF training examples perfectly classified, THEN STOP.
ELSE iterate over new leaf nodes
Decision Tree Learning
Outlook
Sunny Rain
Overcast
____________________________________ _____________________________________ _____________________________________
Outlook Temp Hum Wind Play Outlook Temp Hum Wind Play Outlook Temp Hum Wind Play
------------------------------------------------------- --------------------------------------------------------- ---------------------------------------------------------
Sunny Hot High Weak No Overcast Hot High Weak Yes Rain Mild High Weak Yes
Sunny Hot High Strong No Overcast Cool Normal Strong Yes Rain Cool Normal Weak Yes
Sunny Mild High Weak No Rain Cool Normal Strong No
Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Rain Mild High Strong No
Entropy
Let S be a sample of training examples, and
p+ is the proportion of positive examples in S and
p- is the proportion of negative examples in S.
Then: entropy measures the impurity of S:
E( S) = - p+ log2 p+ – p- log2 p-
Entropy Example from the Dataset
In the Play Tennis dataset we had two target
classes: yes and no
Out of 14 instances, 9 classified yes, rest no
Outlo Temp Humi Wind Outlo Temp Humi Wind
Play play
ok . dity y ok . dity y
9 9
p yes log2 0.41 Sunn
Hot High False No
Sunn
Mild High False No
14 14 y y
Sunn Sunn Norm
Hot High True No Cool False Yes
y y al
Over Norm
pno 5 log2 5 0.53
Hot High False Yes Rainy Mild False Yes
cast al
Norm Over
Rainy Cool False Yes Mild High True Yes
al cast
Over Norm
Cool True Yes Rainy Mild High True No
cast al
Information Gain
Information Gain is the expected reduction in entropy
caused by partitioning the instances from S according to a
given attribute.
| Sv |
Gain(S, A) = E(S) - E (Sv )
vValues ( A ) | S |
where SV = { s S | A(s) = V}
S
Sv1 Sv2
Example
Outlook
Sunny Rain
Overcast
____________________________________ _____________________________________ _____________________________________
Outlook Temp Hum Wind Play Outlook Temp Hum Wind Play Outlook Temp Hum Wind Play
------------------------------------------------------- --------------------------------------------------------- ---------------------------------------------------------
Sunny Hot High Weak No Overcast Hot High Weak Yes Rain Mild High Weak Yes
Sunny Hot High Strong No Overcast Cool Normal Strong Yes Rain Cool Normal Weak Yes
Sunny Mild High Weak No Rain Cool Normal Strong No
Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Rain Mild High Strong No
Gain (Ssunny , Temperature) = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570
Temperature
65.4 >65.4
• How to find the no
cut-point ? yes
Continuous Attributes
Temp. Play Temp. Play
80 No 64 Yes
Temp.< 64.5 I=0.048
85 No 65 No
Temp.< 66.5 I=0.010
83 Yes 68 Yes
75 Yes 69 Yes
68 Yes 70 Yes
Temp.< 70.5 I=0.045
65 No 71 No
Sort
64 Yes 72 No
Temp.< 71.5 I=0.001
72 No 72 Yes
75 Yes 75 Yes
70 Yes 75 Yes
Temp.< 77.5 I=0.025
69 Yes 80 No
Temp.< 80.5 I=0.000
72 Yes 81 Yes
81 Yes 83 Yes
Temp.< 84 I=0.113
71 No 85 No
Continuous attribute
A2<0.33 ? 1
yes no
A2
good A1<0.91 ?
A1<0.23 ? A2<0.91 ?
0
0 1
good bad A1
A2<0.75 ?
A2<0.49 ?
x+y<1
Class = + Class =
False True
no yes yes no
no Windy yes no
false true
yes no
?
Implications of Overfitting
• Small number of instances are associated with leaf nodes. In
this case it is possible that for coincidental regularities to occur
that are unrelated to the actual target concept.
-
+ + + -
+ -
+ ++ - -
-
area with probably
- - + wrong predictions
- - - -
- -
- -
- - -
Approaches to Avoiding Overfitting
Outlook Outlook
High Normal
2 False True
no yes yes no
3 2 3 2
Validation Set
• Validation set is a set of instances used to evaluate the utility
of nodes in decision trees. The validation set has to be chosen
so that it is unlikely to suffer from same errors or fluctuations
as the training set.
• Usually before pruning the training data is split randomly into
a growing set and a validation set.
Reduced-Error Pruning
(Sub-tree replacement)
Split data into growing and
validation sets. Outlook
3 instances 2 instances
T2
yes
Outlook
Rain
ErrorGS=27%, ErrorVS=25%
Sunny
Overcast
Humidity Wind
yes
Strong Weak
T5
High Normal
no yes
no yes
yes
ErrorGS=6%, ErrorVS=8%
ErrorGS=33%, ErrorVS=35%
Reduced Error Pruning Example
Rule Post-Pruning
1. Convert tree to equivalent set of rules.
2. Prune each rule independently of others.
3. Sort final rules by their estimated accuracy, and consider them
in this sequence when classifying subsequent instances.
Outlook
no yes yes no
Attributes with Many Values
Letter
a b c y z
…
• Problem:
– Not good splits: they fragment the data too quickly, leaving
insufficient data at the next level
– The reduction of impurity of such test is often high (example:
split on the object id).
• Two solutions:
– Change the splitting criterion to penalize attributes with many
values
– Consider only binary splits
Attributes with Many Values
c
| Si | | Si |
SplitInfo( S , A) log 2
i 1 | S | |S|
Gain( S , A)
GainRatio
SplitInfo( S , A)