Professional Documents
Culture Documents
Solution 10 Decision Trees
Solution 10 Decision Trees
Solution 10 Decision Trees
Sheet 10
Decision Trees
Learning Goals
• Decision Trees
• Impurity functions
• The ID3 algorithm
1. lmm Draw a decision tree for each of the following three boolean functions. (All variables
draw from the set {FALSE, TRUE} and you may use {0, 1} instead.)
A. x1 ∧ ¬x2 B. x1 XOR x2 C. x1 ∨ (x2 ∧ x3 )
Solution:
• A
x1
1/ \0
x2 f
1/ \0
f t
• B
x1
/ \
0/ \1
x2 x2
/ | |\
1/ |0 1| \0
t f f t
• C
x1
/ \
1/ \0
t x2
/ \
0/ \1
f x3
/ \
0/ \1
f t
Sheet 10 Data Science & Advanced Topics Page 2 of 5
2. lmm Consider the following illustration of two possible splittings of an example set D with
four classes C = {c1 , c2 , c3 , c4 }.
(a) Compute the drop in impurity ∆ι for left and right split, respectively, using the mis-
classification rate ιmisclass as well as the Gini impurity ιGini . (Note: This will yield 4
values: one for every combination of “split” and “impurity measure”.)
Solution: We need to determine for four classes c ∈ {1, 2, 3, 4} and two possible
attribute values a ∈ {1, 2} which of the given splits we should choose.
∆ιmis (D, {D1 , D2 }) = ∆ιmis (D) − a p(a) · ∆ιmis (Da ) = 1 − maxc p(c) − a p(a) ·
P P
(1 − maxc p(c|a)) = 34 − ( 14 + 14 ) = 14
∆ιgini (D, {D1 , D2 }) = ∆ιgini (D) − a p(a) · ∆ιgini (Da ) = 1 − c p(c)2 − a p(a) ·
P P P
(1 − c p(c|a)2 ) = 43 − ( 14 + 14 ) = 14
P
a/c 1 2 3 4
1 1 1
1 4 4 2
1 1 1
2
P4 4 2
p(c) = p(c|a) · p(a) :
a
Sheet 10 Data Science & Advanced Topics Page 3 of 5
c 1 2 3 4
1 1 1 1
p(c) 4 4 4 4
∆ιmis (D, {D1 , D2 }) = ∆ιmis (D) − a p(a) · ∆ιmis (Da ) = 1 − maxc p(c) − a p(a) ·
P P
(1 − maxc p(c|a)) = 14
∆ιgini (D, {D1 , D2 }) = ∆ιgini (D) − a p(a) · ∆ιgini (Da ) = 1 − c p(c)2 − a p(a) ·
P P P
(1 − c p(c|a)2 ) = 81
P
Solution: The left split is preferable as it has a higher chance of leading to pure
nodes in fewer splits. The misclassification rate only pays attention to the most
frequent class (maxc ) - regardless of whether the remaining probability mass is
scattered across only one or several other classes. The Gini criterion takes into
account the probability mass from all classes and prefers situations in which the
probability mass is scattered across only a few dominating classes. According to
the Gini criterion we would choose the left split over the right split. According to
the misclassification rate none of the splits is preferable over the other.
Use the ID3 algorithm to determine a decision tree, whereas the attributes are to be chosen
with the maximum average information gain iGain :
X |Da |
iGain(D, A) ≡ H(D) − · H(Da ) with H(D) = −p⊕ log2 (p⊕ ) − p log2 (p )
a∈A
|D|
1 3 1 1 2 2
H(D| Color) = − (1 log2 1 + 0 log2 0) + ( log2 ( ) + log2 ( ))
7 7 3 3 3 3
2 1 1 1 1 1
+ ( log2 ( ) + log2 ( )) + (1 log2 1 + 0 log2 0)
7 2 2 2 2 7
1 1 2 1 1 1
= − 0 + (log2 ( ) + 2 log2 ( )) + (log2 ( ) + log2 ( )) + 0
7 3 3 7 2 2
1 1 2 2 1
= − (log2 ( ) + 2 log2 ( )) + log2 ( ) ≈ 0.679
7 3 3 7 2
• Attribute Fur:
3 2 2 1 1 2
7 ( 3 log2 ( 3 ) + 3 log2 ( 3 )) + 7 (0 log2 0 + 1 log2 1)
H(D|Fur) = −
2
+ (1 log2 1 + 0 log2 0)
7
3 2 2 1 1 1 2 1
= − ( log2 ( ) + log2 ( )) = − (2 log2 ( ) + log2 ( ))
7 3 3 3 3 7 3 3
≈ 0.394
• Attribut Size:
4 3 3 1 1 3 1 1 2 2
H(D|Size) = − ( log2 ( ) + log2 ( )) + ( log2 ( ) + log2 ( ))
7 4 4 4 4 7 3 3 3 3
1 3 1 1 2
= − (3 log2 ( ) + log2 ( )) + (log2 ( ) + 2 log2 ( ))
7 4 4 3 3
≈ 0.857
Nodes fur=smooth and fur=curly are already pure. Node fur=ragged is not pure yet, so
we have to evaluate the impurity reduction for all remaining attributes (except "fur") for
node fur=ragged. Since there are two classes compute the related subtree recursively.
Sheet 10 Data Science & Advanced Topics Page 5 of 5
1 1 1
H(D| Color) = − (1 log2 1 + 0 log2 0) + (0 log2 0 + 1 log2 1) + (1 log2 1 + 0 log2 0)
3 3 3
= 0
• Attribute Size:
1 2 1 1 1 1
H(D| Size) = − (1 log2 1 + 0 log2 0) + ( log2 ( ) + log2 ( ))
3 3 2 2 2 2
2 1 2
= − log2 ( ) =
3 2 3