Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 46

Decision Trees

Evgueni Smirnov
Topics
Statistical Topics
Regression
Bayes 
Linear Regression
Logistic Regression

Supervised Artifical Neural Networks


Support Vector Machines
Rule Induction
Structural Rule-based Learning Decision Trees

K Nearest
Instant- or Distance-
Neighbours
Based Learning
KD-Trees
K-means
Unsupervised Clustering Radial Basis
Expectation Maximization Functions
Recommender Systems 

Q-Learning
Reinforcement Learning
SARSA
Overview
• Decision Trees for Classification
– Definition
– Classification Problems for Decision Trees
– Entropy and Information Gain
– Learning Decision Trees
– Overfitting and Pruning
– Handling Continuous-Valued Attributes
– Handling Missing Attribute Values
Decision Trees for Classification
• A decision tree is a tree where:
– Each interior node tests an attribute
– Each branch corresponds to an attribute value
– Each leaf node is labelled with a class (class node)

A1
a13
a11
a12
A2 A3
c1
a21 a22 a31 a32

c1 c2 c2 c1
A simple database: playtennis
Day Outlook Temperature Humidity Wind Play
Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild Normal Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool High Strong Yes
D8 Sunny Mild Normal Weak No
D9 Sunny Hot Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Cool Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Decision Tree For Playing Tennis
Outlook

sunny overcast rainy

Humidity yes Windy

high normal false true

no yes yes no
Classification with Decision Trees
Classify(x: instance, node: variable containing a node of DT)
• if node is a classification node then
– return the class of node;
• else
– determine the child of node that match x.
– return Classify(x, child). A1
a13
a11
a12
A2 A3
c1
a21 a22 a31 a32

c1 c2 c2 c1
When to Consider Decision Trees
• Each instance consists of an attribute with discrete values
(e.g. outlook/sunny, etc..)
• The classification is over discrete values (e.g. yes/no )
• It is okay to have disjunctive descriptions – each path in
the tree represents a disjunction of attribute combinations.
Any Boolean function can be represented!
• It is okay for the training data to contain errors – decision
trees are robust to classification errors in the training data.
• It is okay for the training data to contain missing values –
decision trees can be used even if instances have missing
attributes.
Decision Tree Learning

Basic Algorithm:
1. A  the “best" decision attribute for a node N.
2. Assign A as decision attribute for the node N.
3. For each value of A, create new descendant of the node N.
4. Sort training examples to leaf nodes.
5. IF training examples perfectly classified, THEN STOP.
ELSE iterate over new leaf nodes
Decision Tree Learning
Outlook

Sunny Rain
Overcast
____________________________________ _____________________________________ _____________________________________
Outlook Temp Hum Wind Play Outlook Temp Hum Wind Play Outlook Temp Hum Wind Play
------------------------------------------------------- --------------------------------------------------------- ---------------------------------------------------------
Sunny Hot High Weak No Overcast Hot High Weak Yes Rain Mild High Weak Yes
Sunny Hot High Strong No Overcast Cool Normal Strong Yes Rain Cool Normal Weak Yes
Sunny Mild High Weak No Rain Cool Normal Strong No
Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Rain Mild High Strong No
Entropy
Let S be a sample of training examples, and
p+ is the proportion of positive examples in S and
p- is the proportion of negative examples in S.
Then:  entropy measures the impurity of S:
E( S) = - p+ log2 p+ – p- log2 p-
Entropy Example from the Dataset
In the Play Tennis dataset we had two target
classes: yes and no
Out of 14 instances, 9 classified yes, rest no
Outlo Temp Humi Wind Outlo Temp Humi Wind
Play play
ok . dity y ok . dity y
 9   9 
p yes     log2    0.41 Sunn
Hot High False No
Sunn
Mild High False No
 14   14  y y
    Sunn Sunn Norm
Hot High True No Cool False Yes
y y al

Over Norm
   
pno    5  log2  5   0.53
Hot High False Yes Rainy Mild False Yes
cast al

 14   14  Rainy Mild High False Yes Sunn


Mild
Norm
True Yes
    y al

Norm Over
Rainy Cool False Yes Mild High True Yes
al cast

E (S )  p yes  pno  0.94 Rainy Cool


Norm
al
True No Over
cast
Hot
Norm
al
False Yes

Over Norm
Cool True Yes Rainy Mild High True No
cast al
Information Gain
Information Gain is the expected reduction in entropy
caused by partitioning the instances from S according to a
given attribute.
 
| Sv |
Gain(S, A) = E(S) -  E (Sv )
vValues ( A ) | S |

where SV = { s  S | A(s) = V}
S

Sv1 Sv2
Example
Outlook

Sunny Rain
Overcast
____________________________________ _____________________________________ _____________________________________
Outlook Temp Hum Wind Play Outlook Temp Hum Wind Play Outlook Temp Hum Wind Play
------------------------------------------------------- --------------------------------------------------------- ---------------------------------------------------------
Sunny Hot High Weak No Overcast Hot High Weak Yes Rain Mild High Weak Yes
Sunny Hot High Strong No Overcast Cool Normal Strong Yes Rain Cool Normal Weak Yes
Sunny Mild High Weak No Rain Cool Normal Strong No
Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Rain Mild High Strong No

 Which attribute should be tested here?


Gain (Ssunny , Humidity) = = .970 - (3/5) 0.0 - (2/5) 0.0 = .970

Gain (Ssunny , Temperature) = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570

Gain (Ssunny , Wind) = .970 - (2/5) 1.0 - (3/5) .918 = .019


ID3 Algorithm
Informally:
– Determine the attribute with the highest
information gain on the training set.
– Use this attribute as the root, create a branch for
each of the values the attribute can have.
– For each branch, repeat the process with subset
of the training set that is classified by that
branch.
Hypothesis Space Search in ID3

• The hypothesis space is


the set of all decision trees
defined over the given set
of attributes.
• ID3’s hypothesis space is
a compete space; i.e., the
target description is there!
• ID3 performs a simple-to-
complex, hill climbing
search through this space.
Hypothesis Space Search in ID3
• The evaluation function is
the information gain.
• ID3 maintains only a single
current decision tree.
• ID3 performs no
backtracking in its search.
• ID3 uses all training
instances at each step of the
search.
Inductive Bias in ID3
• Preference for short trees
• Preference for trees with high
information gain attributes near
the root.
• Bias is a preference to some
hypotheses, not a restriction on
the hypothesis space
Occam’s Razor
• Preference for simple models over complex
models is quite generally used in machine
learning
• Similar principle in science: Occam’s Razor
– roughly: do not make things more complicated
than necessary
• Reasoning, in the case of decision trees: more
complex trees have higher probability of
overfitting the data set
Continuous Attributes
• Example: temperature as a number instead of a discrete
value
• Two solutions:
– Pre-discretize: Cold if Temperature<70, Mild between 70 and
75, Hot if Temperature>75
– Discretize during tree growing:

Temperature
65.4 >65.4
• How to find the no
cut-point ? yes
Continuous Attributes
Temp. Play Temp. Play
80 No 64 Yes
Temp.< 64.5 I=0.048
85 No 65 No
Temp.< 66.5 I=0.010
83 Yes 68 Yes
75 Yes 69 Yes
68 Yes 70 Yes
Temp.< 70.5 I=0.045
65 No 71 No
Sort
64 Yes 72 No
Temp.< 71.5 I=0.001
72 No 72 Yes
75 Yes 75 Yes
70 Yes 75 Yes
Temp.< 77.5 I=0.025
69 Yes 80 No
Temp.< 80.5 I=0.000
72 Yes 81 Yes
81 Yes 83 Yes
Temp.< 84 I=0.113
71 No 85 No
Continuous attribute
A2<0.33 ? 1
yes no

A2
good A1<0.91 ?

A1<0.23 ? A2<0.91 ?
0
0 1
good bad A1
A2<0.75 ?
A2<0.49 ?

good bad Decision Trees


A2<0.65 ? bad
are non-linear
bad good
classifiers!!!
Oblique Decision Trees

x+y<1

Class = + Class =

• Test condition may involve multiple attributes

• More expressive representation


• Finding optimal test condition is computationally expensive
Posterior Class Probabilities
Outlook

Sunny Overcast Rainy

no: 2 pos and 3 neg no: 2 pos and 0 neg Windy


Ppos = 0.4, Pneg = 0.6 Ppos = 1.0, Pneg = 0.0

False True

no: 0 pos and 2 neg no: 3 pos and 0 neg


Ppos = 0.0, Pneg = 1.0 Ppos = 1.0, Pneg = 0.0
Overfitting
Definition: Given a hypothesis space H, a hypothesis h  H is
said to overfit the training data if there exists some hypothesis
h’  H, such that h has smaller error that h’ over the training
instances, but h’ has a smaller error that h over the entire
distribution of instances.
Implications of Overfitting
Outlook

sunny overcast rainy

Humidity yes Windy

high normal false true

no yes yes no

• Noisy training instances. Consider an noisy training example:


Outlook = Sunny; Temp = Hot; Humidity = Normal; Wind = True; PlayTennis = No

This instance affects the training instances:


Outlook = Sunny; Temp = Cool; Humidity = Normal; Wind = False; PlayTennis = Yes
Outlook = Sunny; Temp = Mild; Humidity = Normal; Wind = True; PlayTennis = Yes
Implications of Overfitting
Outlook

sunny overcast rainy

Humidity yes Windy

high normal false true

no Windy yes no

false true

Outlook = Sunny; Temp = Hot; Humidity = Normal; Wind = True; PlayTennis = No


yes Temp Outlook = Sunny; Temp = Cool; Humidity = Normal; Wind = False; PlayTennis = Yes
Outlook = Sunny; Temp = Mild; Humidity = Normal; Wind = True; PlayTennis = Yes

mild cool high

yes no
?
Implications of Overfitting
• Small number of instances are associated with leaf nodes. In
this case it is possible that for coincidental regularities to occur
that are unrelated to the actual target concept.
-
+ + + -
+ -
+ ++ - -
-
area with probably
- - + wrong predictions

- - - -
- -
- -
- - -
Approaches to Avoiding Overfitting

• Pre-pruning: stop growing the tree earlier,


before it reaches the point where it perfectly
classifies the training data

• Post-pruning: Allow the tree to overfit the


data, and then post-prune the tree.
Pre-pruning
• It is difficult to decide when to stop growing the tree.
• A possible scenario is to stop when the leaf nodes gets less than
m training instances. Here is an example for m = 5.

Outlook Outlook

Sunny Overcast Rainy Sunny Overcast Rainy

Humidity yes Windy no ? yes

High Normal
2 False True

no yes yes no

3 2 3 2
Validation Set
• Validation set is a set of instances used to evaluate the utility
of nodes in decision trees. The validation set has to be chosen
so that it is unlikely to suffer from same errors or fluctuations
as the training set.
• Usually before pruning the training data is split randomly into
a growing set and a validation set.
Reduced-Error Pruning
(Sub-tree replacement)
Split data into growing and
validation sets. Outlook

Pruning a decision node d consists sunny overcast rainy


of:
1. removing the subtree rooted at d. Humidity yes Windy

2. making d a leaf node.


3. assigning d the most common high normal false true

classification of the training


instances associated with d. no yes yes no

3 instances 2 instances

Accuracy of the tree on the validation


set is 90%.
Reduced-Error Pruning
(Sub-tree replacement)
Split data into growing and
validation sets. Outlook

Pruning a decision node d consists sunny overcast rainy


of:
1. removing the subtree rooted at d. no yes Windy

2. making d a leaf node.


3. assigning d the most common false true

classification of the training


instances associated with d. yes no

Accuracy of the tree on the validation


set is 92.4%.
Reduced-Error Pruning
(Sub-tree replacement)
Split data into growing and validation
sets. Outlook

Pruning a decision node d consists of:


sunny overcast rainy
1. removing the subtree rooted at d.
2. making d a leaf node.
no yes Windy
3. assigning d the most common
classification of the training instances
associated with d. false true

Do until further pruning is harmful: yes no

1. Evaluate impact on validation set of


pruning each possible node (plus those
below it).
2. Greedily remove the one that most Accuracy of the tree on the validation
improves validation set accuracy.
set is 92.4%.
Reduced-Error Pruning
(Sub-tree replacement)
T1 T3 Outlook
Outlook
Rain Rain
Sunny Sunny
Overcast Overcast
Humidity Wind no Wind
yes yes
Strong Weak Strong Weak
High Normal

no Temp. no yes no yes


Mild Cool,Hot
no yes
ErrorGS=13%, ErrorVS=15%
ErrorGS=0%, ErrorVS=10%
T4
Outlook
Sunny Rain
Overcast
no yes

T2
yes

Outlook
Rain
ErrorGS=27%, ErrorVS=25%
Sunny
Overcast
Humidity Wind
yes
Strong Weak

T5
High Normal

no yes
no yes
yes

ErrorGS=6%, ErrorVS=8%
ErrorGS=33%, ErrorVS=35%
Reduced Error Pruning Example
Rule Post-Pruning
1. Convert tree to equivalent set of rules.
2. Prune each rule independently of others.
3. Sort final rules by their estimated accuracy, and consider them
in this sequence when classifying subsequent instances.
Outlook

IF (Outlook = Sunny) & (Humidity = High)


sunny overcast rainy THEN PlayTennis = No
IF (Outlook = Sunny) & (Humidity = Normal)
Humidity yes Windy
THEN PlayTennis = Yes
……….
false normal false true

no yes yes no
Attributes with Many Values
Letter
a b c y z

• Problem:
– Not good splits: they fragment the data too quickly, leaving
insufficient data at the next level
– The reduction of impurity of such test is often high (example:
split on the object id).
• Two solutions:
– Change the splitting criterion to penalize attributes with many
values
– Consider only binary splits
Attributes with Many Values
c
| Si | | Si |
SplitInfo( S , A)   log 2
i 1 | S | |S|
Gain( S , A)
GainRatio 
SplitInfo( S , A)

• Example: outlook in the playtennis


– InfoGain(outlook) = 0.246
– Splitinformation(outlook) = 1.577
– Gainratio(outlook) = 0.246/1.577=0.156 < 0.246
• Problem: the gain ratio favours unbalanced tests
Attributes with Many Values
Attributes with Many Values
Missing Attribute Values
Strategies:

1. Assign most common value of A among other instances


belonging to the same concept.
2. If node n tests the attribute A, assign most common value
of A among other instances sorted to node n.
3. If node n tests the attribute A, assign a probability to each
of possible values of A. These probabilities are estimated
based on the observed frequencies of the values of A.
These probabilities are used in the information gain
|S |
measure (via info )   ( v E (Sv )
E ( Sgain) ).
vValues ( A ) |S|
Summary Points
1. Decision tree learning provides a practical method for
classification and regression.
2. DT algorithms search complete hypothesis space.
3. The inductive bias of decision trees is preference bias.
4. Overfitting the training data is an important issue in
decision tree learning.
5. A large number of extensions of the DT learning algorithm
have been proposed for overfitting avoidance, handling
missing attributes, handling numerical attributes, etc.
References
• Mitchell, Tom. M. 1997. Machine Learning. New York: McGraw-Hill
• Quinlan, J. R. 1986. Induction of decision trees. Machine Learning
• Stuart Russell, Peter Norvig, 1995. Artificial Intelligence: A Modern
Approach. New Jersey: Prantice Hall
Homework
1. Derive the complexity of the hypothesis space of decision trees
defined over n Boolean attributes. Indicate the complexity of the
algorithm that learns the minimal decision tree from this hypothesis
space.
2. Prove analytically that we don’t have to test an attribute twice along
a path in a decision tree.
3. Construct a data set that would cause ID3 to find a non-minimal
tree. Show this tree as well as your minimal tree that you can
generate by hand.

You might also like