Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

M2 STEPE | Introduction to Machine Learning | 09/11/2023

Trees and Forests

Emilie Chautru
Geosciences and Geoengineering Department
Geostatistics team
Back to the beginning

What have you seen so far ?

Trees and Forests 1


I
Decision Trees
What is a decision tree?

Setting 𝑑 ∈ ℕ features 𝑥 = (𝑥1 , … , 𝑥𝑑 ) ∈ 𝒳 ⊂ ℝ𝑑 and a label 𝑦 ∈ 𝒴 ⊂ ℝ


( )
𝑛 ∈ ℕ data 𝒟𝑛 = 𝑥 𝑖 , 𝑦 𝑖 1≤𝑖≤𝑛

] features continuous and/or categorical

Trees and Forests 2


What is a decision tree?

Setting 𝑑 ∈ ℕ features 𝑥 = (𝑥1 , … , 𝑥𝑑 ) ∈ 𝒳 ⊂ ℝ𝑑 and a label 𝑦 ∈ 𝒴 ⊂ ℝ


( )
𝑛 ∈ ℕ data 𝒟𝑛 = 𝑥 𝑖 , 𝑦 𝑖 1≤𝑖≤𝑛

] features continuous and/or categorical

A decision tree is a hierarchical prediction model that can be represented by


a tree structure: each node corresponds to a condition test on the input features
and each leaf to a predicted output

Trees and Forests 2


What is a decision tree?

Colour

red yellow

Size Shape

< 5cm ≥ 5cm oval round curved

Shape Apple Lemon Apple Banana

round pointy

Sherry Strawberry

Trees and Forests 2


What is a decision tree?

Setting 𝑑 ∈ ℕ features 𝑥 = (𝑥1 , … , 𝑥𝑑 ) ∈ 𝒳 ⊂ ℝ𝑑 and a label 𝑦 ∈ 𝒴 ⊂ ℝ


( )
𝑛 ∈ ℕ data 𝒟𝑛 = 𝑥 𝑖 , 𝑦 𝑖 1≤𝑖≤𝑛

] features continuous and/or categorical

A decision tree is a hierarchical prediction model that can be represented by


a tree structure: each node corresponds to a condition test on the input features
and each leaf to a predicted output

Binary tree each node has exactly 2 children


] every tree can be rewritten as a binary tree

Trees and Forests 2


What is a decision tree?

Red

yes no

<5cm Curved

yes no yes no

Round Apple Banana Round

yes no yes no

Sherry Strawberry Apple Lemon

Trees and Forests 2


Trees and partition
A tree 𝑔 ∶ 𝒳 → 𝒴 with 𝑅 ∈ ℕ leaves defines a partition {ℛ1 , … , ℛ𝑅 } of 𝒳

𝑥2 < 𝑠22 ℛ2 ℛ4
𝑠22
yes no

𝑥1 < 𝑠12 𝑥1 < 𝑠11


ℛ5
yes no yes no ℛ1
ℛ1 𝑥2 < 𝑠21 ℛ2 ℛ4
𝑠21
yes no ℛ3

ℛ3 ℛ5 𝑠11 𝑠12

Trees and Forests 3


Trees and partition
A tree 𝑔 ∶ 𝒳 → 𝒴 with 𝑅 ∈ ℕ leaves defines a partition {ℛ1 , … , ℛ𝑅 } of 𝒳

𝑅

∀𝑥 ∈ 𝒳 𝑥) =
𝑔(𝑥 𝟏𝑥 ∈ ℛ𝑟 𝑦(ℛ𝑟 )
𝑟=1

Classification 𝐾 ∈ ℕ classes 𝒴 = {1, … , 𝐾}  Majority vote


𝑛

∀ℛ ⊂ 𝒳 𝑦(ℛ) = argmax 𝟏𝑥 𝑖 ∈ ℛ 𝟏𝑦𝑖 =𝑘
1≤𝑘≤𝐾 𝑖=1

Regression 𝒴 = ℝ  Average label


𝑛
∑ 𝑛
1 ∑ 𝑖
∀ℛ ⊂ 𝒳 𝑛(ℛ) = 𝟏𝑥 𝑖 ∈ ℛ 𝑦(ℛ) = 𝑦 𝟏𝑥 𝑖 ∈ ℛ
𝑖=1 𝑛(ℛ) 𝑖=1

Trees and Forests 3


Growing a binary tree | CART

Problem find a tree-like partition that minimizes the empirical risk


. generally computationally unfeasible

Solution greedy, recursive, divisive algorithm

Trees and Forests 4


Growing a binary tree | CART

Problem find a tree-like partition that minimizes the empirical risk


. generally computationally unfeasible

Solution greedy, recursive, divisive algorithm

Classification And Regression Tree (Breiman et al., 1984)

node defined by variable 𝑥𝑗 ∈ 𝒳𝑗 and subset 𝒮 ⊂ 𝒳𝑗 that split 𝒳 into

𝑥 ∈ 𝒳 ∣ 𝑥𝑗 ∈ 𝒮}
ℒ𝑗,𝒮 = {𝑥 and 𝑥 ∈ 𝒳 ∣ 𝑥𝑗 ∉ 𝒮}
ℛ𝑗,𝒮 = {𝑥

] (𝑗, 𝒮) chosen to minimize a splitting cost

Trees and Forests 4


Growing a binary tree | Splitting cost

Regression 𝒴 = ℝ  Mean Squared Error


𝑛
1 ∑ 𝑖
∀ℛ ⊂ 𝒳 𝑦(ℛ) = 𝑦 𝟏𝑥 𝑖 ∈ ℛ
𝑛(ℛ) 𝑖=1

𝑛
∑ 𝑛

( 𝑖 ( ))2 ( 𝑖 ( ))2
𝐶(𝑗, 𝒮) = 𝑦 − 𝑦 ℒ𝑗,𝒮 𝟏𝑥 𝑖 ∈ ℒ𝑗,𝒮 + 𝑦 − 𝑦 ℛ𝑗,𝒮 𝟏𝑥 𝑖 ∈ ℛ𝑗,𝒮
𝑖=1 𝑖=1

Trees and Forests 5


Growing a binary tree | Splitting cost

Classification 𝐾 ∈ ℕ classes 𝒴 = {1, … , 𝐾}  Impurity criterion


𝑛
1 ∑
∀ℛ ⊂ 𝒳 ∀𝑘 ∈ 𝒴 𝑝𝑘 (ℛ) = 𝟏𝑥 𝑖 ∈ ℛ 𝟏𝑦𝑖 =𝑘
𝑛(ℛ) 𝑖=1

𝐼(ℛ) = 1 − max 1≤𝑘≤𝐾 𝑝𝑘 (ℛ)


classification error
∑𝐾
cross entropy 𝐼(ℛ) = − 𝑘=1 𝑝𝑘 (ℛ) log2 𝑝𝑘 (ℛ)
∑𝐾
Gini 𝐼(ℛ) = 𝑘=1 𝑝𝑘 (𝑅) (1 − 𝑝𝑘 (ℛ))

( ) ( )
𝑛 ℒ𝑗,𝒮 ( ) 𝑛 ℛ𝑗,𝒮 ( )
𝐶(𝑗, 𝒮) = 𝐼 ℒ𝑗,𝒮 + 𝐼 ℛ𝑗,𝒮
𝑛 𝑛

Trees and Forests 5


Grow a binary tree | Example

6e+05

5e+05
mse

4e+05

3e+05

2e+05
180 200 220

6000

5000
Body mass (g)

4000

3000

180 200 220


Flipper length (mm)

Trees and Forests 6


Grow a binary tree | Example

250000
mse

200000

150000

180 200 220

6000

5000
Body mass (g)

4000

3000

180 200 220


Flipper length (mm)

Trees and Forests 6


Grow a binary tree | Example

60

50
Bill length (mm)

Species
Adelie
Chinstrap
Gentoo

40

170 180 190 200 210 220 230


Flipper length (mm)

Trees and Forests 6


Grow a binary tree | Example

60

50
Bill length (mm)

Species
Adelie
Chinstrap
Gentoo

40

180 200 220


Flipper length (mm)

Trees and Forests 6


Grow a binary tree | Example

flipper_length_mm

< 207
>= 207

bill_length_mm

< 45
>= 45

Adelie Chinstrap Gentoo

Trees and Forests 6


Growing a binary tree | Pruning

When to stop growing the tree?

Arbitrary constraints minimum number of training observations per leaf


maximum depth of the tree

Regularisation add a complexity cost


tree 𝑔 defining 𝑅 ∈ ℕ regions ℛ1 , … , ℛ𝑅 ⊂ 𝒳 𝜆 ∈ ℝ∗+ hyperparameter
𝑅

𝐶𝜆 (𝑔) = 𝑛(ℛ𝑟 ) 𝐼(ℛ𝑟 ) + 𝜆 𝑅
𝑟=1

Trees and Forests 7


Review on trees

Assets
allows any type (continuous/categorical) of features and label
adapted to multi-class problems
adapted to multi-modal data
easy to implement, visualise and interpret

Liabilities
weak learners (too simple)
non-robust to data variations

 Ensemble learning

Trees and Forests 8


II
Ensemble learning
From trees to forests: ensemble learning

Wisdom of crowd: a combination of weak learners can perform


significantly better than any one of them

Bagging the errors of many independent weak learners can balance each
other out

Boosting a weak learner can improve another by focuses on its errors

Trees and Forests 9


Bagging

Leo Breiman (1996)


Principle
create 𝐵 ∈ ℕ bootstrap samples of the dataset
apply the algorithm to each of the 𝐵 samples to get 𝐵 predictors
combine all predictions to form an average predictor

] parallel calculations possible

Bagging reduces the variance of individual learners  more stable and


better performances

. only useful with weak learners like trees

Trees and Forests 10


Random Forests

Leo Breiman (2001)


Principle
create 𝐵 bootstrap sub-samples of the dataset
for each sub-sample, select a random subset of the 𝑑 variables
apply CART to each of the 𝐵 subsamples, forming a forest
combine the 𝐵 trees to form an average predictor over the forest
√ 𝑑
] typically take 𝑑 variables in classification and in regression
3

Random Forests form subsamples that are as diverse as possible  high


performance and quite robust to the choice of hyperparameters if 𝐵 large
enough

Trees and Forests 11


Boosting
Schapire et al. (1997)

AdaBoost (classification) set a number 𝑀 ∈ ℕ of iterations


1
give each data point 𝑖 an initial weight 𝜔𝑖1 =
𝑛
for 𝑚 ∈ {1, … , 𝑀}:
( )
learn a decision tree 𝑔𝑚 on the dataset with weights 𝜔𝑖𝑚 1≤𝑖≤𝑛
∑𝑛
compute the weighted error of the model: 𝜖𝑚 = 𝑖=1 𝜔𝑖𝑚 𝟏𝑔𝑚 (𝑥𝑥 𝑖 )=𝑦𝑖
1 1 − 𝜖𝑚
compute the confidence level of the model: 𝛼𝑚 = ln ( )
2 𝜖𝑚
update the weights so that they focus on where 𝑔𝑚 makes mistakes:
{ }
𝑚+1
𝜔𝑖𝑚 exp −𝛼𝑚 𝑦 𝑖 𝑔𝑚 (𝑥
𝑥𝑖 )
𝜔𝑖 = ∑𝑛
𝜔𝑚 exp {−𝛼𝑚 𝑦 𝓁 𝑔𝑚 (𝑥
𝓁=1 𝓁
𝑥 𝓁 )}

∑𝑀
return the final decision function: 𝑔 ∶ 𝑥 ∈ 𝒳 ↦ 𝑚=1
𝑥)
𝛼𝑚 𝑔𝑚 (𝑥

Trees and Forests 12


References

Chloe-Agathe. Azencott.
Introduction au Machine Learning.
Collection InfoSup, Dunod, 2022.
Trevor Hastie, Robert Tibshirani, Jerome Friedman.
The Elements of Statistical Learning.
Springer Series in Statistics, 2009.

Trees and Forests 13

You might also like