ML Decision Trees

Decision trees
Machine Learning
Mario Martin
Decision trees: table of contents
Representational capacity of decision trees

Entropy and information gain
ID3: Building decision trees
Avoiding overfitting
Extensions:
How to deal with numerical data
How to avoid wide trees
Dealing with expensive attributes
Mario Martin
Toy example: “PlayTennis” dataset
Day Outlook Temp. Humidity Wind PlayTennis

D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Mario Martin
Decision tree for “PlayTennis”
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak
No Yes No Yes
Mario Martin
Outlook
Sunny Overcast Rain
Humidity Every node asks something about

one attribute
High Normal Every child is one possible answer
No Yes Leaves return the answer for the

Classification task
Mario Martin
Outlook Temperature Humidity Wind PlayTennis

Sunny Hot High Weak No
Outlook
Sunny Overcast Rain
Humidity Yes Wind
No Yes No Yes
Mario Martin
DT can represent complex rules
Outlook
Sunny Overcast Rain
Humidity Yes Wind

No Yes No Yes
(Outlook=Sunny ∧ Humidity=Normal)
∨ (Outlook=Overcast)
∨ (Outlook=Rain ∧ Wind=Weak)
Mario Martin
Learning Decision Tres
From a training set, many possible trees

Do not learn any tree but the simplest one! (simplicity
for generalization)
The simplest tree is the shortest one.
One approach: Creating all the trees and keep the
shortest one
... But this approach requires a too much computation
time
ID3/C4.5 builds short decision trees more efficiently
(although does not ensure to get the shortest one).
Mario Martin
“Top-Down” induction of decision trees: ID3
1. Set node to the training examples and put it as the root of

the tree
2. A ← the "best" test attribute for the current node
3. For each value of attribute A, create a new descendant
node in the decision tree
4. Separate examples in node among descendants depending
of the attribute value for A
5. If there remain leafs nodes in the tree with mixed positive
and negative examples, set this leaf to node and return to
step 2.
6. In other case, label each leaf of the tree with the sign of the
examples in the leaf, and return it
Mario Martin
ID3
Outlook [D1-,D2-,D3+,D4,+,D5+,D6-,D7+, D8+,

D9+,D10+,D11+,D12+,D13+,D14-]
Mario Martin
ID3
Outlook [D1-,D2-,D3+,D4,+,D5+,D6-,D7+, D8+,

D9+,D10+,D11+,D12+,D13+,D14-]
Sunny Overcast Rain
Mario Martin
ID3
Outlook
Sunny Overcast Rain
[D1-,D2-, D8+,D9+,D11+] [D3+,D7+,D12+,D13+] [D4+,D5+, D6-, D10+,D14-]
Mario Martin
ID3
Outlook
Sunny Overcast Rain
[D4+,D5+, D6-, D10+,D14-]

Humidity [D3+,D7+,D12+,D13+]
High Normal
[D1-,D2-] [D8+,D9+,D11+]
Mario Martin
ID3
Outlook
Sunny Overcast Rain
Humidity [D3+,D7+,D12+,D13+] Wind
[D6-,D14-] [D4+,D5+,D10+]
[D1-,D2-] [D8+,D9+,D11+]
Mario Martin
ID3
Outlook
Sunny Overcast Rain
Humidity Yes Wind

[D3+,D7+,D12+,D13+]
No Yes No Yes
[D1-,D2-] [D8+,D9+,D11+] [D6-,D14-] [D4+,D5+,D10+]
Mario Martin
ID3
Outlook
Sunny Overcast Rain
Humidity Yes Wind
No Yes No Yes
Mario Martin
How to choose the best attribute?
Goal: To obtain the shortest tree

... equivalent to separate as quickly as possible + examples from
- examples
Therefore questions should be the maximum discriminative
Mario Martin
How to choose the best attribute?
Goal: To obtain the shortest tree

... equivalent to separate as quickly as possible + examples from
- examples
Therefore questions should be the maximum discriminative
Information Gain...
Mario Martin
What attribute is best?
[29+,35-] A1=? A2=? [29+,35-]
True False True False
[29+, 0-] [0+, 35-] [18+, 33-] [11+, 2-]
Mario Martin
What attribute is best?
[29+,35-] A1=? A2=? [29+,35-]
[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]
Mario Martin
Information theory
One answer gives us an information amount proportional to the

unpredictability of the answer
Example: Coin heads and tails
Formally:
n
E (X )   pi log(pi )
i
Where pi is probability of answer i, and n is the
number of possible answers
Information Gain from an answer A is computed as:
Mario Martin
Binary answer case
E(A) = -p+ log2 p+ - p- log2 p- =

-p+ log2 p+ – (1-p+)log2 (1-p+ )
Mario Martin
Binary answer case
For DTs, we assume:

p+ the proportion of + examples in S
p- the proportion of – examples in S
E(S) measures the ”mixing” of examples in S
E(S) = -p+ log2 p+ - p- log2 p-
Mario Martin
Entropy
E(S) –entropy of S- measures necessary bits to

compress information about current distribution of +
and - examples in S considering it a random variable.
It is a measure of information content taken from the
Information Theory field.
It also can be seen as a measure of disorder... That’s

why it is useful in ID3.
Mario Martin
Information gain
Information Gain (IG) obtained from an answer is the difference

of information before and after the answer.
S [29+,35-] A1=? Current information E(S)
True False Information after

answering A
[21+, 5-] [8+, 30-]
S’ S’’
Mario Martin
Information gain
E([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99
IG(S,A)=E(S) - ∑v∈values(A) |Sv|/|S| E(Sv)
[29+,35-] A1=? A2=? [29+,35-]
[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]
Mario Martin
Which is the best attribute to choose?
The one with highest information gain!!!

Play tennis example:
Mario Martin
Choosing the attribute
S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940
Humidity Wind
High Normal Weak Strong
[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E=0.985 E=0.592 E=0.811 E=1.0
G(S,Humidity) G(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
Mario Martin
Choosing the attribute
S=[9+,5-] G(S,Outlook)
E=0.940 =0.940-(5/14)*0.971
Outlook -(4/14)*0.0 – (5/14)*0.0971
=0.247
Over
Sunny Rain
cast
[2+, 3-] [4+, 0] [3+, 2-]

E=0.971 E=0.0 E=0.971
Mario Martin
Choosing next attribute
[D1,D2,…,D14] Outlook
[9+,5-]
Sunny Overcast Rain
Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]

[2+,3-] [4+,0-] [3+,2-]
? Yes ?
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
Mario Martin
ID3 Algorithm
Outlook
Sunny Overcast Rain
Humidity Yes Wind

[D3,D7,D12,D13]
No Yes No Yes
[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]
Mario Martin
ID3 Algorithm
Mario Martin
[Possible issues while building the tree]
What if there are not examples for one value when

expanding a node? Do not generate branch. When
testing, go to the most popular branch
When we cannot expand the node but still there are +
and – examples mixed in the node, label the node with
the majority class.
Overfitting
Mario Martin
Overfitting in DTs
Decision trees tend to overfit the data given that we can partition the
examples until we reach a leave where all the predictions are correct
As we descend on the tree the decisions are taken with less and less
data
We can regularize the learning by controlling the parameters that
define the complexity of the tree
Minimum number of examples in a leave
Number of nodes/leaves in the tree
Maximum depth of the branches of the tree
Fixing a threshold for the heuristic
Mario Martin
Examples of trees
Mario Martin
Overfitting in DTs
An alternative is to overfit the tree and prune the result by estimating

the generalization error of the leaves, deleting the nodes with an
expected error larger than no having them (Post prunning)
Mario Martin
Overfitting in DTs
Error
test set
training set
Node number
Mario Martin
Atributes
Undesired effect: Bias towards attributes with a large

number of different values in categorical variables
Attributes with numeric values?
Discretize... But how to define thresholds?
Mario Martin
Categorical Variables
Information Gain has preference towards attributes with larger

number of values
This bias is due to having more possible answers, sort more the
data (in fact, an answer gives you more IG, but also needed
more information to answer the question)
In practical implementations categorical attributes are assumed

to be binary
Attributes with more than two values are transformed using
one-hot encoding: This makes decision trees binary trees in
practice
Mario Martin
Atributes
Undesired effect: Bias towards attributes with a large

number of different values in categorical variables
Attributes with numeric values?
Discretize... But how to define thresholds?
Mario Martin
Attributes with numeric values
Mario Martin
Given continuous attribute A, consider all different

splits [A[x] < vi ] versus [A[x] >= vi ]:
Start from sorted sequence of values A = {v1, v2, ..., vn}
i
Consider each value as the threshold for the split. For each different split,
compute the information gain. (Note: There are at most n-1 possible
splits)
Turn the split with highest IG into the boolean test for the node Data will
be split into [A[x] < vi] and [A[x] >= vi ]
Attribute A cannot be removed for future splits

Of course, this increase a lot the computation time.
Mario Martin
Mario Martin
Mario Martin
Decision Trees for regression
Similar to classification, but minimize the sum of
squared errors relative to the average target value in
each induced partition;
Mario Martin
Decision Trees for regression
Mario Martin
Advantages/Problems with DTs
Advantages:
Their computational complexity is low
They perform automatic feature selection, only the attributes
relevant for the task are used
They can be interpreted more easily than other models
Drawbacks:
They have large variance, slight changes on the sample can result in
completely different model
They are sensitive to noise and miss-classifications (regularization
helps)
They approximate the concepts by partitions that are parallel to the
attributes, so they can return complex models if classification borders
are oblique
Mario Martin

ML Decision Trees

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Decision Trees

Uploaded by

Copyright:

Available Formats

Decision trees

Representational capacity of decision trees

Day Outlook Temp. Humidity Wind PlayTennis

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

Sunny Overcast Rain

Humidity Every node asks something about

High Normal Every child is one possible answer

No Yes Leaves return the answer for the

Outlook Temperature Humidity Wind PlayTennis

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

From a training set, many possible trees

1. Set node to the training examples and put it as the root of

Outlook [D1-,D2-,D3+,D4,+,D5+,D6-,D7+, D8+,

Outlook [D1-,D2-,D3+,D4,+,D5+,D6-,D7+, D8+,

Sunny Overcast Rain

Sunny Overcast Rain

[D1-,D2-, D8+,D9+,D11+] [D3+,D7+,D12+,D13+] [D4+,D5+, D6-, D10+,D14-]

Sunny Overcast Rain

[D4+,D5+, D6-, D10+,D14-]

Sunny Overcast Rain

Humidity [D3+,D7+,D12+,D13+] Wind

High Normal Strong Weak

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

Goal: To obtain the shortest tree

Goal: To obtain the shortest tree

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[29+, 0-] [0+, 35-] [18+, 33-] [11+, 2-]

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]

One answer gives us an information amount proportional to the

Information Gain from an answer A is computed as:

E(A) = -p+ log2 p+ - p- log2 p- =

For DTs, we assume:

E(S) –entropy of S- measures necessary bits to

It also can be seen as a measure of disorder... That’s

Information Gain (IG) obtained from an answer is the difference

S [29+,35-] A1=? Current information E(S)

True False Information after

E([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99

IG(S,A)=E(S) - ∑v∈values(A) |Sv|/|S| E(Sv)

[29+,35-] A1=? A2=? [29+,35-]

True False True False

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]

The one with highest information gain!!!

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

[2+, 3-] [4+, 0] [3+, 2-]

Sunny Overcast Rain

Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]