Professional Documents
Culture Documents
TD1 ELTP 2023 Correction
TD1 ELTP 2023 Correction
Figure 2: The visualized tree of the classification decision tree spatial segmentation.
where,
γ1 = γ2 = 2
γ3 = γ5 = 3
γ4 = 1
R1 = {xi : xi,1 < 0.4} ∩ {xi : xi,2 < 0.7}
R2 = {xi : 0.4 < xi,1 < 0.6} ∩ {xi : xi,2 < 0.7}
R3 = {xi : xi,1 < 0.6} ∩ {xi : xi,2 > 0.7}
R4 = {xi : xi,1 > 0.6} ∩ {xi : xi,2 < 0.4}
R5 = {xi : xi,1 > 0.6} ∩ {xi : xi,2 > 0.4}
1. A student has trained a decision tree and notices that its performance is better on the train set
than the test set. Should he increase or decrease the depth of the tree?
Correction: He should decrease the depth of the tree because reducing the depth will give a model
that is less prone to overfitting.
2. A student has dataset with n instances and p features.
(a) What is the maximum number of leaves of a decision tree on this dataset?
Correction: Each leave contains at least one observation, therefore the answer is n.
(b) What is the maximum depth of a decision tree on this dataset?
Correction: Each leave contains at least one observation, therefore the answer is n − 1. If on
the path between the root and the deepest leaf, at each split a leaf is created which contains
a single observation. Note that the root has a depth equal to zero.
Exercise 3 (Understand the splitting process idea by practice with Gini impurity):
Consider the following dataset containing for 10 plants the length and width of their sepals. We want to
discriminate plants that belong to the species Iris virginica (+) from others (-).
Label + + + + + + - - - -
Length (cm) 6.7 6.7 6.3 6.5 6.2 5.9 6.1 6.4 6.6 6.8
Width(cm) 3.3 3 2.5 3 3.4 3 2.8 2.9 3 2.8
1. Calculate the Gini impurity for all possible separation points using the length of the sepals as the
separating variable.
Note that, the Gini impurity of an R region is defined as:
C
X C
X
Imp(R) := pc (R)(1 − pc (R)) = 1 − p2c (R) (1)
c=1 c=1
1
P
where, pc (R) := |R| i:xi ∈R 1{yi =c} . Thus, if all the instances of a region belong to the same class,
the impurity of this region is equal to 0; conversely, if a region contains as many instances of each
of the C classes, the right part of the product is 1 − pc (R) = 1 − C1 , or 21 in the case of a binary
classification.
Correction: Here the parent node corresponds to the root. We denote R0 the region associated to
the parent node. We have 4 instances for class − and 6 instances for class +. Thus,
4 2 6
Gini(R0 ) = 1 − ( ) − ( )2 ≈ 0.48 (2)
10 10
(a) We start with the length lower value and consider as a potential node length ≤ 5.9 which give
two child nodes denoted R1 (left child) and R2 (right child). For R1 the number of instances
Label + -
from each class is summarized in the following table, which
number of instances 1 0
allow us to compute
1 0
Gini(R1 ) = 1 − ( )2 − ( )2 = 0
1 1
Label + -
Then, for R2 , which give us,
number of instances 5 4
5 4
Gini(R2 ) = 1 − ( )2 − ( )2 ≈ 0.494
9 9
We deduce the Gini impurity for the first possible separation as follow,
|R1 | |R2 | 1 9
∆Gini(length ≤ 5.9) = Ginir(R1 ) + Ginir(R2 ) ≈ ×0+ × 0.494 ≈ 0.444 (3)
|R| |R| 10 10
1 1
Gini(R1 ) = 1 − ( )2 − ( )2 = 0.5
2 2
Label + -
and, for R2 ,
number of instances 5 3
5 3
Gini(R2 ) = 1 − ( )2 − ( )2 ≈ 0.468
8 8
We deduce the Gini impurity for the secon possible separation as follow,
|R1 | |R2 | 2 8
∆Gini(length ≤ 6.1) = Ginir(R1 )+ Ginir(R2 ) ≈ ×0.5+ ×0.468 ≈ 0.475 (4)
|R| |R| 10 10
By repeating this process, we then obtain all the results requested and summarized in the
following table,
Split val. < 5.9 ≤ 5.9 ≤ 6.1 ≤ 6.2 ≤ 6.3 ≤ 6.4 ≤ 6.5 ≤ 6.6 ≤ 6.7 ≤ 6.8
∆ Gini - 0.444 (Eq. 3) 0.475 (Eq. 4) 0.476 0.45 0.48 0.467 0.476 0.4 -
2. Calculate the Gini impurity for all possible separation points using the width of the sepals as the
separating variable.
Correction: Following the same previous process we obtain for Width variable,
Split val. < 2.5 ≤ 2.5 ≤ 2.8 ≤ 2.9 ≤ 3 ≤ ≤ 3.3 ≤ 3.4
∆ Gini - 0.444 0.419 0.317 0.4 0.444 -
3. What is the first node of a decision tree trained on this dataset with the Gini impurity?
Correction: The lowest Gini impurity value is 0.317, obtained by separating across the width
at the threshold of 2.9cm, i.e., width ≤ 2.9. Therefore, the first node of the decision tree
compares the width of the sepals to the value of 2.9. On the left, four plants, including an Iris
virginica, then on the right, six plants including 5 Iris virginica.
Exercise 4: Understand the splitting process idea (to do on paper): We will show that mini-
mizing the quadratic risk amounts to minimizing the variance in each hyper-rectangle of the input space
partition.
1. For all region Rk , write the minimization empirical quadratic risk problem where the predictor is
a decision tree.
Correction:
n K
!2
1X X
∀k ∈ {1, . . . , K} , min yi − γk 1{xi ∈Rk }
γk ,Rk n
i=1 k=1
n
1 X 2
⇐⇒ ∀k ∈ {1, . . . , K} , for a fixed k, min (yi − γk )
γk ,Rk n
i:xi ∈Rk
2. Write Rk in function of 2 subspaces RL (j, s) and RR (j, s) obtained after splitting a region Rk via
a split (j, s).
Correction: Rk = RL (j, s) ∪ RR (j, s) such as, RL (j, s) = {xi ∈ Rk : xi,j ≤ s} and
RR (j, s) = {xi ∈ Rk : xi,j > s}.
3. How are these two subspaces relative to each other?
Correction: Disjoint.
The variance formula for the left child of the node k is the following:
1 X 2
V (RL(k) ) := yi − average(yi |xi ∈ RL(k) )
Card(RL(k) )
i:xi ∈RL(k)
The variance formula for the right child of the node k is the following:
1 X 2
V (RR(k) ) := yi − average(yi |xi ∈ RR(k) )
Card(RR(k) )
i:xi ∈RR(k)
6. Show that minimizing the empirical quadratic risk formula amounts to minimizing the variance in
each hyper-rectangle of the input P space partition.2
Correction: We first solve min i:xi ∈RL (yi − γL )
γL
P 2
It means, ∀k ∈ {1, . . . , K} γ̂L = argmin i:xi ∈RL (yi − γL ) and,
γL
!
∂ X 2
(yi − γL ) =0
∂γL
i:xi ∈RL
X
⇐⇒ − 2 (yi − γL ) = 0
i:xi ∈RL
X X
⇐⇒ γL = yi
i:xi ∈RL i:xi ∈RL
X
⇐⇒Card(RL )γL = yi
i:xi ∈RL
1 X
⇐⇒γ̂L = yi
Card(RL )
i:xi ∈RL
Remarks:
To conclude, minimizing the empirical quadratic risk amounts to minimizing the variance in each
child nodes. Recursively by the construction process of a binary decision tree (CART) by mini-
mizing the empirical quadratic risk, we minimize the variance in each hyper rectangle of the input
space partition.