Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Decision trees

Machine Learning

Mario Martin
Decision trees: table of contents

Representational capacity of decision trees


Entropy and information gain
ID3: Building decision trees
Avoiding overfitting
Extensions:
How to deal with numerical data
How to avoid wide trees
Dealing with expensive attributes

Mario Martin
Toy example: “PlayTennis” dataset

Day Outlook Temp. Humidity Wind PlayTennis


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Mario Martin
Decision tree for “PlayTennis”

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

Mario Martin
Decision tree for “PlayTennis”

Outlook

Sunny Overcast Rain

Humidity Every node asks something about


one attribute

High Normal Every child is one possible answer

No Yes Leaves return the answer for the


Classification task

Mario Martin
Decision tree for “PlayTennis”

Outlook Temperature Humidity Wind PlayTennis


Sunny Hot High Weak No

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

Mario Martin
DT can represent complex rules

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak


No Yes No Yes

(Outlook=Sunny ∧ Humidity=Normal)
∨ (Outlook=Overcast)
∨ (Outlook=Rain ∧ Wind=Weak)
Mario Martin
Learning Decision Tres

From a training set, many possible trees


Do not learn any tree but the simplest one! (simplicity
for generalization)
The simplest tree is the shortest one.
One approach: Creating all the trees and keep the
shortest one
... But this approach requires a too much computation
time
ID3/C4.5 builds short decision trees more efficiently
(although does not ensure to get the shortest one).

Mario Martin
“Top-Down” induction of decision trees: ID3

1. Set node to the training examples and put it as the root of


the tree
2. A ← the "best" test attribute for the current node
3. For each value of attribute A, create a new descendant
node in the decision tree
4. Separate examples in node among descendants depending
of the attribute value for A
5. If there remain leafs nodes in the tree with mixed positive
and negative examples, set this leaf to node and return to
step 2.
6. In other case, label each leaf of the tree with the sign of the
examples in the leaf, and return it

Mario Martin
ID3

Outlook [D1-,D2-,D3+,D4,+,D5+,D6-,D7+, D8+,


D9+,D10+,D11+,D12+,D13+,D14-]

Mario Martin
ID3

Outlook [D1-,D2-,D3+,D4,+,D5+,D6-,D7+, D8+,


D9+,D10+,D11+,D12+,D13+,D14-]

Sunny Overcast Rain

Mario Martin
ID3

Outlook

Sunny Overcast Rain

[D1-,D2-, D8+,D9+,D11+] [D3+,D7+,D12+,D13+] [D4+,D5+, D6-, D10+,D14-]

Mario Martin
ID3

Outlook

Sunny Overcast Rain

[D4+,D5+, D6-, D10+,D14-]


Humidity [D3+,D7+,D12+,D13+]

High Normal

[D1-,D2-] [D8+,D9+,D11+]

Mario Martin
ID3

Outlook

Sunny Overcast Rain

Humidity [D3+,D7+,D12+,D13+] Wind

High Normal Strong Weak

[D6-,D14-] [D4+,D5+,D10+]
[D1-,D2-] [D8+,D9+,D11+]

Mario Martin
ID3

Outlook

Sunny Overcast Rain

Humidity Yes Wind


[D3+,D7+,D12+,D13+]

High Normal Strong Weak

No Yes No Yes
[D1-,D2-] [D8+,D9+,D11+] [D6-,D14-] [D4+,D5+,D10+]
Mario Martin
ID3

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

Mario Martin
How to choose the best attribute?

Goal: To obtain the shortest tree


... equivalent to separate as quickly as possible + examples from
- examples
Therefore questions should be the maximum discriminative

Mario Martin
How to choose the best attribute?

Goal: To obtain the shortest tree


... equivalent to separate as quickly as possible + examples from
- examples
Therefore questions should be the maximum discriminative
Information Gain...

Mario Martin
What attribute is best?

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[29+, 0-] [0+, 35-] [18+, 33-] [11+, 2-]

Mario Martin
What attribute is best?

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]

Mario Martin
Information theory

One answer gives us an information amount proportional to the


unpredictability of the answer
Example: Coin heads and tails
Formally:
n
E (X )   pi log(pi )
i
Where pi is probability of answer i, and n is the
number of possible answers

Information Gain from an answer A is computed as:

Mario Martin
Binary answer case

E(A) = -p+ log2 p+ - p- log2 p- =


-p+ log2 p+ – (1-p+)log2 (1-p+ )

Mario Martin
Binary answer case

For DTs, we assume:


p+ the proportion of + examples in S
p- the proportion of – examples in S
E(S) measures the ”mixing” of examples in S
E(S) = -p+ log2 p+ - p- log2 p-

Mario Martin
Entropy

E(S) –entropy of S- measures necessary bits to


compress information about current distribution of +
and - examples in S considering it a random variable.
It is a measure of information content taken from the
Information Theory field.

It also can be seen as a measure of disorder... That’s


why it is useful in ID3.

Mario Martin
Information gain

Information Gain (IG) obtained from an answer is the difference


of information before and after the answer.

S [29+,35-] A1=? Current information E(S)

True False Information after


answering A
[21+, 5-] [8+, 30-]
S’ S’’

Mario Martin
Information gain

E([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99

IG(S,A)=E(S) - ∑v∈values(A) |Sv|/|S| E(Sv)

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]

Mario Martin
Which is the best attribute to choose?

The one with highest information gain!!!


Play tennis example:

Mario Martin
Choosing the attribute

S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940
Humidity Wind

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]


E=0.985 E=0.592 E=0.811 E=1.0
G(S,Humidity) G(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
Mario Martin
Choosing the attribute

S=[9+,5-] G(S,Outlook)
E=0.940 =0.940-(5/14)*0.971
Outlook -(4/14)*0.0 – (5/14)*0.0971
=0.247
Over
Sunny Rain
cast

[2+, 3-] [4+, 0] [3+, 2-]


E=0.971 E=0.0 E=0.971

Mario Martin
Choosing next attribute

[D1,D2,…,D14] Outlook
[9+,5-]

Sunny Overcast Rain

Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]


[2+,3-] [4+,0-] [3+,2-]
? Yes ?
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

Mario Martin
ID3 Algorithm

Outlook

Sunny Overcast Rain

Humidity Yes Wind


[D3,D7,D12,D13]

High Normal Strong Weak

No Yes No Yes

[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]

Mario Martin
ID3 Algorithm

Mario Martin
[Possible issues while building the tree]

What if there are not examples for one value when


expanding a node? Do not generate branch. When
testing, go to the most popular branch
When we cannot expand the node but still there are +
and – examples mixed in the node, label the node with
the majority class.
Overfitting

Mario Martin
Overfitting in DTs

Decision trees tend to overfit the data given that we can partition the
examples until we reach a leave where all the predictions are correct
As we descend on the tree the decisions are taken with less and less
data
We can regularize the learning by controlling the parameters that
define the complexity of the tree
Minimum number of examples in a leave
Number of nodes/leaves in the tree
Maximum depth of the branches of the tree
Fixing a threshold for the heuristic

Mario Martin
Examples of trees

Mario Martin
Overfitting in DTs

An alternative is to overfit the tree and prune the result by estimating


the generalization error of the leaves, deleting the nodes with an
expected error larger than no having them (Post prunning)

Mario Martin
Overfitting in DTs

Error

test set

training set

Node number

Mario Martin
Atributes

Undesired effect: Bias towards attributes with a large


number of different values in categorical variables
Attributes with numeric values?
Discretize... But how to define thresholds?

Mario Martin
Categorical Variables

Information Gain has preference towards attributes with larger


number of values
This bias is due to having more possible answers, sort more the
data (in fact, an answer gives you more IG, but also needed
more information to answer the question)

In practical implementations categorical attributes are assumed


to be binary
Attributes with more than two values are transformed using
one-hot encoding: This makes decision trees binary trees in
practice

Mario Martin
Atributes

Undesired effect: Bias towards attributes with a large


number of different values in categorical variables
Attributes with numeric values?
Discretize... But how to define thresholds?

Mario Martin
Attributes with numeric values

Mario Martin
Attributes with numeric values

Given continuous attribute A, consider all different


splits [A[x] < vi ] versus [A[x] >= vi ]:
Start from sorted sequence of values A = {v1, v2, ..., vn}
i

Consider each value as the threshold for the split. For each different split,
compute the information gain. (Note: There are at most n-1 possible
splits)
Turn the split with highest IG into the boolean test for the node Data will
be split into [A[x] < vi] and [A[x] >= vi ]

Attribute A cannot be removed for future splits


Of course, this increase a lot the computation time.

Mario Martin
Attributes with numeric values

Mario Martin
Attributes with numeric values

Mario Martin
Decision Trees for regression
Similar to classification, but minimize the sum of
squared errors relative to the average target value in
each induced partition;

Mario Martin
Decision Trees for regression

Mario Martin
Advantages/Problems with DTs

Advantages:
Their computational complexity is low
They perform automatic feature selection, only the attributes
relevant for the task are used
They can be interpreted more easily than other models
Drawbacks:
They have large variance, slight changes on the sample can result in
completely different model
They are sensitive to noise and miss-classifications (regularization
helps)
They approximate the concepts by partitions that are parallel to the
attributes, so they can return complex models if classification borders
are oblique

Mario Martin

You might also like