Solution 10 Decision Trees

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Data Science & Advanced Topics Chair of Data Science

WS 2022/23 Prof. Michael Granitzer


Julian Stier & Saber Zerhoudi ◦ julian.stier@uni-passau.de

Sheet 10
Decision Trees
Learning Goals
• Decision Trees
• Impurity functions
• The ID3 algorithm

1. lmm Draw a decision tree for each of the following three boolean functions. (All variables
draw from the set {FALSE, TRUE} and you may use {0, 1} instead.)
A. x1 ∧ ¬x2 B. x1 XOR x2 C. x1 ∨ (x2 ∧ x3 )

Solution:

• A
x1
1/ \0
x2 f
1/ \0
f t

• B
x1
/ \
0/ \1
x2 x2
/ | |\
1/ |0 1| \0
t f f t

• C
x1
/ \
1/ \0
t x2
/ \
0/ \1
f x3
/ \
0/ \1
f t
Sheet 10 Data Science & Advanced Topics Page 2 of 5

2. lmm Consider the following illustration of two possible splittings of an example set D with
four classes C = {c1 , c2 , c3 , c4 }.

(a) Compute the drop in impurity ∆ι for left and right split, respectively, using the mis-
classification rate ιmisclass as well as the Gini impurity ιGini . (Note: This will yield 4
values: one for every combination of “split” and “impurity measure”.)

Solution: We need to determine for four classes c ∈ {1, 2, 3, 4} and two possible
attribute values a ∈ {1, 2} which of the given splits we should choose.

We estimate p(a = 1) as |D 1| |D2 |


|D| = 0.5 and p(a = 2) as |D| = 0.5. You could e.g.
imagine the dataset to contain 20,000 examples and both D1 and D2 contain 10,000
examples.
a/c 1 2 3 4
1 1
1 2 2
1 1
2 2 2
p(c) = p(c|a) · p(a) :
P
a
c 1 2 3 4
1 1 1 1
p(c) 4 4 4 4

∆ιmis (D, {D1 , D2 }) = ∆ιmis (D) − a p(a) · ∆ιmis (Da ) = 1 − maxc p(c) − a p(a) ·
P P

(1 − maxc p(c|a)) = 34 − ( 14 + 14 ) = 14
∆ιgini (D, {D1 , D2 }) = ∆ιgini (D) − a p(a) · ∆ιgini (Da ) = 1 − c p(c)2 − a p(a) ·
P P P

(1 − c p(c|a)2 ) = 43 − ( 14 + 14 ) = 14
P

a/c 1 2 3 4
1 1 1
1 4 4 2
1 1 1
2
P4 4 2
p(c) = p(c|a) · p(a) :
a
Sheet 10 Data Science & Advanced Topics Page 3 of 5

c 1 2 3 4
1 1 1 1
p(c) 4 4 4 4

∆ιmis (D, {D1 , D2 }) = ∆ιmis (D) − a p(a) · ∆ιmis (Da ) = 1 − maxc p(c) − a p(a) ·
P P

(1 − maxc p(c|a)) = 14
∆ιgini (D, {D1 , D2 }) = ∆ιgini (D) − a p(a) · ∆ιgini (Da ) = 1 − c p(c)2 − a p(a) ·
P P P

(1 − c p(c|a)2 ) = 81
P

(b) Explain the result.

Solution: The left split is preferable as it has a higher chance of leading to pure
nodes in fewer splits. The misclassification rate only pays attention to the most
frequent class (maxc ) - regardless of whether the remaining probability mass is
scattered across only one or several other classes. The Gini criterion takes into
account the probability mass from all classes and prefers situations in which the
probability mass is scattered across only a few dominating classes. According to
the Gini criterion we would choose the left split over the right split. According to
the misclassification rate none of the splits is preferable over the other.

3. llm Given the following training set with dogs data:

Color Fur Size Class


brown ragged small well-behaved
black ragged big dangerous
black smooth big dangerous
black curly small well-behaved
white curly small well-behaved
white smooth small dangerous
red ragged big well-behaved

Use the ID3 algorithm to determine a decision tree, whereas the attributes are to be chosen
with the maximum average information gain iGain :
X |Da |
iGain(D, A) ≡ H(D) − · H(Da ) with H(D) = −p⊕ log2 (p⊕ ) − p log2 (p )
a∈A
|D|

Solution: Determine the values of H(D| Attribute ):

• Attribute Color: m = 7 (Number of objects)

Color well-behaved dangerous Probability


brown 1 0 P (brown) = 1/7
black 1 2 P (black) = 3/7
white 1 1 P (white) = 2/7
red 1 0 P (red) = 1/7
Sheet 10 Data Science & Advanced Topics Page 4 of 5

1 3 1 1 2 2

H(D| Color) = − (1 log2 1 + 0 log2 0) + ( log2 ( ) + log2 ( ))
7 7 3 3 3 3
2 1 1 1 1 1

+ ( log2 ( ) + log2 ( )) + (1 log2 1 + 0 log2 0)
7 2 2 2 2 7
1 1 2 1 1 1
 
= − 0 + (log2 ( ) + 2 log2 ( )) + (log2 ( ) + log2 ( )) + 0
7 3 3 7 2 2
1 1 2 2 1
 
= − (log2 ( ) + 2 log2 ( )) + log2 ( ) ≈ 0.679
7 3 3 7 2

• Attribute Fur:

Fur well-behaved dangerous Probability


ragged 2 1 P (ragged) = 3/7
smooth 0 2 P (smooth) = 2/7
curly 2 0 P (curly) = 2/7

3 2 2 1 1 2
 
 7 ( 3 log2 ( 3 ) + 3 log2 ( 3 )) + 7 (0 log2 0 + 1 log2 1)
H(D|Fur) = −  
 2 
+ (1 log2 1 + 0 log2 0)
7
3 2 2 1 1 1 2 1
= − ( log2 ( ) + log2 ( )) = − (2 log2 ( ) + log2 ( ))
7 3 3 3 3 7 3 3
≈ 0.394

• Attribut Size:

Size well-behaved dangerous Probability


small 3 1 P (small) = 4/7
big 1 2 P (big) = 3/7

4 3 3 1 1 3 1 1 2 2
 
H(D|Size) = − ( log2 ( ) + log2 ( )) + ( log2 ( ) + log2 ( ))
7 4 4 4 4 7 3 3 3 3
1 3 1 1 2
 
= − (3 log2 ( ) + log2 ( )) + (log2 ( ) + 2 log2 ( ))
7 4 4 3 3
≈ 0.857

H(D|A) minimal for A = Fur ⇒ Choose Attribute Fur.

Nodes fur=smooth and fur=curly are already pure. Node fur=ragged is not pure yet, so
we have to evaluate the impurity reduction for all remaining attributes (except "fur") for
node fur=ragged. Since there are two classes compute the related subtree recursively.
Sheet 10 Data Science & Advanced Topics Page 5 of 5

• Attribute Color: m = 3 (Number of objects)

Color well-behaved dangerous Probability


brown 1 0 P (brown) = 1/3
black 0 1 P (black) = 1/3
red 1 0 P (red) = 1/3

1 1 1
 
H(D| Color) = − (1 log2 1 + 0 log2 0) + (0 log2 0 + 1 log2 1) + (1 log2 1 + 0 log2 0)
3 3 3
= 0

• Attribute Size:

Size well-behaved dangerous Probability


small 1 0 P (small) = 1/3
big 1 1 P (big) = 2/3

1 2 1 1 1 1
 
H(D| Size) = − (1 log2 1 + 0 log2 0) + ( log2 ( ) + log2 ( ))
3 3 2 2 2 2
2 1 2
= − log2 ( ) =
3 2 3

H(D|A) minimal for A = Color ⇒ choose attribute Color. Resulting tree:

You might also like