Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

CentraleSupélec 2023

TD 1: Decision Trees (corrected version)

Ensemble learning from theory to practice

1 Exercises and reminders

Exercise 1: Reminders (to do on paper):

Figure 1: The spatial segmentation of a classification decision tree.

1. Thanks to the spatial segmentation, draw the visual tree.


Correction:

Figure 2: The visualized tree of the classification decision tree spatial segmentation.

M. Tami Page 1/6


CentraleSupélec 2023
2. What is the input space? What is the output space?
Correction:
(a) The input space is X = R × R = R2
(b) The output space is Y = {1, 2, 3}
3. Write the f formula associated.
Correction: Let’s define xi an observation (d-length vector) from the input space and xi,j a real
observation associated to the variable Xj .
f (xi ) = 1{xi ∈R4 } + 21{xi ∈R1 ∪R2 } + 31{xi ∈R3 ∪R5 }
K
X
= γk 1{xi ∈Rk }
k=1

where,
γ1 = γ2 = 2
γ3 = γ5 = 3
γ4 = 1
R1 = {xi : xi,1 < 0.4} ∩ {xi : xi,2 < 0.7}
R2 = {xi : 0.4 < xi,1 < 0.6} ∩ {xi : xi,2 < 0.7}
R3 = {xi : xi,1 < 0.6} ∩ {xi : xi,2 > 0.7}
R4 = {xi : xi,1 > 0.6} ∩ {xi : xi,2 < 0.4}
R5 = {xi : xi,1 > 0.6} ∩ {xi : xi,2 > 0.4}

4. How many split there are?


Correction: 4 splits.
5. How many nodes?
Correction: 9 nodes including 5 terminal nodes.
6. How many leaves?
Correction: 5 leaves.
7. What will be the predicted value associated to the new data symbolized by the question mark?
Correction: class 2 (or orange).
8. Is this tree accurate?
Correction: It looks accurate (with a little bit classification error which is preferable to overfitting)
but we can have an interrogation about the relevance of the last split.
9. Compute the MSE (Mean Squared Error measure) to evaluate the quality of the decision tree
predictor
n
1X
(yi − ŷi )2
n i=1
11
Correction: M SE = 21 ≈ 0.52
10. What do you think about the result? What is disturbing in the previous question (clue: train data
vs test data)?
Correction: First we can expect a better result (closest to zero). Then here we compute the error
on the train set however to conclude about the accuracy of a decision tree it is better appropriate
to consider the error on the test set. At last, here we are in a classification case therefore we have to
use an error adapted (e.g. missclassification error). Indeed, MSE is only adapted for the regression
task.

M. Tami Page 2/6


CentraleSupélec 2023
Exercise 2 (Questions about understanding basic concepts of the course ):

1. A student has trained a decision tree and notices that its performance is better on the train set
than the test set. Should he increase or decrease the depth of the tree?
Correction: He should decrease the depth of the tree because reducing the depth will give a model
that is less prone to overfitting.
2. A student has dataset with n instances and p features.
(a) What is the maximum number of leaves of a decision tree on this dataset?
Correction: Each leave contains at least one observation, therefore the answer is n.
(b) What is the maximum depth of a decision tree on this dataset?
Correction: Each leave contains at least one observation, therefore the answer is n − 1. If on
the path between the root and the deepest leaf, at each split a leaf is created which contains
a single observation. Note that the root has a depth equal to zero.

Exercise 3 (Understand the splitting process idea by practice with Gini impurity):
Consider the following dataset containing for 10 plants the length and width of their sepals. We want to
discriminate plants that belong to the species Iris virginica (+) from others (-).

Label + + + + + + - - - -
Length (cm) 6.7 6.7 6.3 6.5 6.2 5.9 6.1 6.4 6.6 6.8
Width(cm) 3.3 3 2.5 3 3.4 3 2.8 2.9 3 2.8
1. Calculate the Gini impurity for all possible separation points using the length of the sepals as the
separating variable.
Note that, the Gini impurity of an R region is defined as:
C
X C
X
Imp(R) := pc (R)(1 − pc (R)) = 1 − p2c (R) (1)
c=1 c=1

1
P
where, pc (R) := |R| i:xi ∈R 1{yi =c} . Thus, if all the instances of a region belong to the same class,
the impurity of this region is equal to 0; conversely, if a region contains as many instances of each
of the C classes, the right part of the product is 1 − pc (R) = 1 − C1 , or 21 in the case of a binary
classification.
Correction: Here the parent node corresponds to the root. We denote R0 the region associated to
the parent node. We have 4 instances for class − and 6 instances for class +. Thus,
4 2 6
Gini(R0 ) = 1 − ( ) − ( )2 ≈ 0.48 (2)
10 10
(a) We start with the length lower value and consider as a potential node length ≤ 5.9 which give
two child nodes denoted R1 (left child) and R2 (right child). For R1 the number of instances
Label + -
from each class is summarized in the following table, which
number of instances 1 0
allow us to compute
1 0
Gini(R1 ) = 1 − ( )2 − ( )2 = 0
1 1
Label + -
Then, for R2 , which give us,
number of instances 5 4
5 4
Gini(R2 ) = 1 − ( )2 − ( )2 ≈ 0.494
9 9
We deduce the Gini impurity for the first possible separation as follow,
|R1 | |R2 | 1 9
∆Gini(length ≤ 5.9) = Ginir(R1 ) + Ginir(R2 ) ≈ ×0+ × 0.494 ≈ 0.444 (3)
|R| |R| 10 10

M. Tami Page 3/6


CentraleSupélec 2023
(b) For the second possible separation, we will consider length ≤ 6.1. We obtain for R1
Label + -
number of instances 1 1

1 1
Gini(R1 ) = 1 − ( )2 − ( )2 = 0.5
2 2
Label + -
and, for R2 ,
number of instances 5 3

5 3
Gini(R2 ) = 1 − ( )2 − ( )2 ≈ 0.468
8 8
We deduce the Gini impurity for the secon possible separation as follow,

|R1 | |R2 | 2 8
∆Gini(length ≤ 6.1) = Ginir(R1 )+ Ginir(R2 ) ≈ ×0.5+ ×0.468 ≈ 0.475 (4)
|R| |R| 10 10

By repeating this process, we then obtain all the results requested and summarized in the
following table,
Split val. < 5.9 ≤ 5.9 ≤ 6.1 ≤ 6.2 ≤ 6.3 ≤ 6.4 ≤ 6.5 ≤ 6.6 ≤ 6.7 ≤ 6.8
∆ Gini - 0.444 (Eq. 3) 0.475 (Eq. 4) 0.476 0.45 0.48 0.467 0.476 0.4 -
2. Calculate the Gini impurity for all possible separation points using the width of the sepals as the
separating variable.
Correction: Following the same previous process we obtain for Width variable,
Split val. < 2.5 ≤ 2.5 ≤ 2.8 ≤ 2.9 ≤ 3 ≤ ≤ 3.3 ≤ 3.4
∆ Gini - 0.444 0.419 0.317 0.4 0.444 -
3. What is the first node of a decision tree trained on this dataset with the Gini impurity?
Correction: The lowest Gini impurity value is 0.317, obtained by separating across the width
at the threshold of 2.9cm, i.e., width ≤ 2.9. Therefore, the first node of the decision tree
compares the width of the sepals to the value of 2.9. On the left, four plants, including an Iris
virginica, then on the right, six plants including 5 Iris virginica.

Exercise 4: Understand the splitting process idea (to do on paper): We will show that mini-
mizing the quadratic risk amounts to minimizing the variance in each hyper-rectangle of the input space
partition.

1. For all region Rk , write the minimization empirical quadratic risk problem where the predictor is
a decision tree.
Correction:
n K
!2
1X X
∀k ∈ {1, . . . , K} , min yi − γk 1{xi ∈Rk }
γk ,Rk n
i=1 k=1
n
1 X 2
⇐⇒ ∀k ∈ {1, . . . , K} , for a fixed k, min (yi − γk )
γk ,Rk n
i:xi ∈Rk

2. Write Rk in function of 2 subspaces RL (j, s) and RR (j, s) obtained after splitting a region Rk via
a split (j, s).
Correction: Rk = RL (j, s) ∪ RR (j, s) such as, RL (j, s) = {xi ∈ Rk : xi,j ≤ s} and
RR (j, s) = {xi ∈ Rk : xi,j > s}.
3. How are these two subspaces relative to each other?
Correction: Disjoint.

M. Tami Page 4/6


CentraleSupélec 2023
4. Write the new risk formula based on these two subspaces.
Correction: " #
X 2
X 2
min min (yi − γL ) + min (yi − γR ) (5)
Rk γL γR
xi ∈RL xi ∈RR
Remark: the following answer is also correct,
 
X 2
X 2
min min (yi − γL ) + min (yi − γR ) 
j,s γL γR
xi ∈RL (j,s) xi ∈RR (j,s)

The last corresponds to the lecture 1 notations.


5. Write the variance formula in a node k (corresponding to a region Rk ), then in each child nodes of
k.
Correction: The variance formula for a node k is the following:
1 X 2
V (Rk ) := (yi − average(yi |xi ∈ Rk ))
Card(Rk )
i:xi ∈Rk

The variance formula for the left child of the node k is the following:
1 X 2
V (RL(k) ) := yi − average(yi |xi ∈ RL(k) )
Card(RL(k) )
i:xi ∈RL(k)

The variance formula for the right child of the node k is the following:
1 X 2
V (RR(k) ) := yi − average(yi |xi ∈ RR(k) )
Card(RR(k) )
i:xi ∈RR(k)

6. Show that minimizing the empirical quadratic risk formula amounts to minimizing the variance in
each hyper-rectangle of the input P space partition.2
Correction: We first solve min i:xi ∈RL (yi − γL )
γL
P 2
It means, ∀k ∈ {1, . . . , K} γ̂L = argmin i:xi ∈RL (yi − γL ) and,
γL
!
∂ X 2
(yi − γL ) =0
∂γL
i:xi ∈RL
X
⇐⇒ − 2 (yi − γL ) = 0
i:xi ∈RL
X X
⇐⇒ γL = yi
i:xi ∈RL i:xi ∈RL
X
⇐⇒Card(RL )γL = yi
i:xi ∈RL
1 X
⇐⇒γ̂L = yi
Card(RL )
i:xi ∈RL

⇐⇒γ̂L = average(yi |xi ∈ RL )


Similarly, γ̂R = average(yi |xi ∈ RR )
Then we re-write Eq. (5) such as a minimization problem only over (j, s) and not over γ parameters
because we optimize yet the problem over the last,
 
X 2
X 2
min  (yi − average(yi |xi ∈ RL )) + (yi − average(yi |xi ∈ RR )) 
j,s
xi ∈RL (j,s) xi ∈RR (j,s)

⇐⇒ min [Card(RL )V (RL ) + Card(RR )V (RR )]


j,s

Remarks:

M. Tami Page 5/6


CentraleSupélec 2023
1
(a) We can multiply by n which is a constant and then we can see the quadratic risk formula.
(b) Card(RL ) and Card(RR ) are constants therefore we can multiply by them such as we did in
the last step.

To conclude, minimizing the empirical quadratic risk amounts to minimizing the variance in each
child nodes. Recursively by the construction process of a binary decision tree (CART) by mini-
mizing the empirical quadratic risk, we minimize the variance in each hyper rectangle of the input
space partition.

M. Tami Page 6/6

You might also like