23 Ens RandomForests

M2 STEPE | Introduction to Machine Learning | 09/11/2023
Trees and Forests
Emilie Chautru
Geosciences and Geoengineering Department
Geostatistics team
Back to the beginning
What have you seen so far ?
Trees and Forests 1

I
Decision Trees
What is a decision tree?
Setting 𝑑 ∈ ℕ features 𝑥 = (𝑥1 , … , 𝑥𝑑 ) ∈ 𝒳 ⊂ ℝ𝑑 and a label 𝑦 ∈ 𝒴 ⊂ ℝ

( )
𝑛 ∈ ℕ data 𝒟𝑛 = 𝑥 𝑖 , 𝑦 𝑖 1≤𝑖≤𝑛
] features continuous and/or categorical
Trees and Forests 2


( )
A decision tree is a hierarchical prediction model that can be represented by

a tree structure: each node corresponds to a condition test on the input features
and each leaf to a predicted output
Trees and Forests 2

Colour
red yellow
Size Shape
< 5cm ≥ 5cm oval round curved
Shape Apple Lemon Apple Banana
round pointy
Sherry Strawberry
Trees and Forests 2


( )
A decision tree is a hierarchical prediction model that can be represented by

a tree structure: each node corresponds to a condition test on the input features
and each leaf to a predicted output
Binary tree each node has exactly 2 children

] every tree can be rewritten as a binary tree
Trees and Forests 2

Red
yes no
<5cm Curved
yes no yes no
Round Apple Banana Round
yes no yes no
Sherry Strawberry Apple Lemon
Trees and Forests 2

Trees and partition
A tree 𝑔 ∶ 𝒳 → 𝒴 with 𝑅 ∈ ℕ leaves defines a partition {ℛ1 , … , ℛ𝑅 } of 𝒳
𝑥2 < 𝑠22 ℛ2 ℛ4
𝑠22
yes no
𝑥1 < 𝑠12 𝑥1 < 𝑠11

ℛ5
yes no yes no ℛ1
ℛ1 𝑥2 < 𝑠21 ℛ2 ℛ4
𝑠21
yes no ℛ3
ℛ3 ℛ5 𝑠11 𝑠12
Trees and Forests 3

Trees and partition
A tree 𝑔 ∶ 𝒳 → 𝒴 with 𝑅 ∈ ℕ leaves defines a partition {ℛ1 , … , ℛ𝑅 } of 𝒳
𝑅
∑
∀𝑥 ∈ 𝒳 𝑥) =
𝑔(𝑥 𝟏𝑥 ∈ ℛ𝑟 𝑦(ℛ𝑟 )
𝑟=1
Classification 𝐾 ∈ ℕ classes 𝒴 = {1, … , 𝐾} Majority vote

𝑛
∑
∀ℛ ⊂ 𝒳 𝑦(ℛ) = argmax 𝟏𝑥 𝑖 ∈ ℛ 𝟏𝑦𝑖 =𝑘
1≤𝑘≤𝐾 𝑖=1
Regression 𝒴 = ℝ Average label

𝑛
∑ 𝑛
1 ∑ 𝑖
∀ℛ ⊂ 𝒳 𝑛(ℛ) = 𝟏𝑥 𝑖 ∈ ℛ 𝑦(ℛ) = 𝑦 𝟏𝑥 𝑖 ∈ ℛ
𝑖=1 𝑛(ℛ) 𝑖=1
Trees and Forests 3

Growing a binary tree | CART
Problem find a tree-like partition that minimizes the empirical risk

. generally computationally unfeasible
Solution greedy, recursive, divisive algorithm
Trees and Forests 4

Growing a binary tree | CART
Problem find a tree-like partition that minimizes the empirical risk

. generally computationally unfeasible
Solution greedy, recursive, divisive algorithm
Classification And Regression Tree (Breiman et al., 1984)
node defined by variable 𝑥𝑗 ∈ 𝒳𝑗 and subset 𝒮 ⊂ 𝒳𝑗 that split 𝒳 into
𝑥 ∈ 𝒳 ∣ 𝑥𝑗 ∈ 𝒮}
ℒ𝑗,𝒮 = {𝑥 and 𝑥 ∈ 𝒳 ∣ 𝑥𝑗 ∉ 𝒮}
ℛ𝑗,𝒮 = {𝑥
] (𝑗, 𝒮) chosen to minimize a splitting cost
Trees and Forests 4

Growing a binary tree | Splitting cost
Regression 𝒴 = ℝ Mean Squared Error

𝑛
1 ∑ 𝑖
∀ℛ ⊂ 𝒳 𝑦(ℛ) = 𝑦 𝟏𝑥 𝑖 ∈ ℛ
𝑛(ℛ) 𝑖=1
𝑛
∑ 𝑛
∑
( 𝑖 ( ))2 ( 𝑖 ( ))2
𝐶(𝑗, 𝒮) = 𝑦 − 𝑦 ℒ𝑗,𝒮 𝟏𝑥 𝑖 ∈ ℒ𝑗,𝒮 + 𝑦 − 𝑦 ℛ𝑗,𝒮 𝟏𝑥 𝑖 ∈ ℛ𝑗,𝒮
𝑖=1 𝑖=1
Trees and Forests 5

Growing a binary tree | Splitting cost
Classification 𝐾 ∈ ℕ classes 𝒴 = {1, … , 𝐾} Impurity criterion

𝑛
1 ∑
∀ℛ ⊂ 𝒳 ∀𝑘 ∈ 𝒴 𝑝𝑘 (ℛ) = 𝟏𝑥 𝑖 ∈ ℛ 𝟏𝑦𝑖 =𝑘
𝑛(ℛ) 𝑖=1
𝐼(ℛ) = 1 − max 1≤𝑘≤𝐾 𝑝𝑘 (ℛ)

classification error
∑𝐾
cross entropy 𝐼(ℛ) = − 𝑘=1 𝑝𝑘 (ℛ) log2 𝑝𝑘 (ℛ)
∑𝐾
Gini 𝐼(ℛ) = 𝑘=1 𝑝𝑘 (𝑅) (1 − 𝑝𝑘 (ℛ))
( ) ( )
𝑛 ℒ𝑗,𝒮 ( ) 𝑛 ℛ𝑗,𝒮 ( )
𝐶(𝑗, 𝒮) = 𝐼 ℒ𝑗,𝒮 + 𝐼 ℛ𝑗,𝒮
𝑛 𝑛
Trees and Forests 5

Grow a binary tree | Example
6e+05
5e+05
mse
4e+05
3e+05
2e+05
180 200 220
6000
5000
Body mass (g)
4000
3000
180 200 220

Flipper length (mm)
Trees and Forests 6

250000
mse
200000
150000
180 200 220
6000
5000
Body mass (g)
4000
3000
180 200 220

Flipper length (mm)
Trees and Forests 6

60
50
Bill length (mm)
Species
Adelie
Chinstrap
Gentoo
40
170 180 190 200 210 220 230

Flipper length (mm)
Trees and Forests 6

60
50
Bill length (mm)
Species
Adelie
Chinstrap
Gentoo
40
180 200 220

Flipper length (mm)
Trees and Forests 6

flipper_length_mm
< 207
>= 207
bill_length_mm
< 45
>= 45
Adelie Chinstrap Gentoo
Trees and Forests 6

Growing a binary tree | Pruning
When to stop growing the tree?
Arbitrary constraints minimum number of training observations per leaf

maximum depth of the tree
Regularisation add a complexity cost

tree 𝑔 defining 𝑅 ∈ ℕ regions ℛ1 , … , ℛ𝑅 ⊂ 𝒳 𝜆 ∈ ℝ∗+ hyperparameter
𝑅
∑
𝐶𝜆 (𝑔) = 𝑛(ℛ𝑟 ) 𝐼(ℛ𝑟 ) + 𝜆 𝑅
𝑟=1
Trees and Forests 7

Review on trees
Assets
allows any type (continuous/categorical) of features and label
adapted to multi-class problems
adapted to multi-modal data
easy to implement, visualise and interpret
Liabilities
weak learners (too simple)
non-robust to data variations
Ensemble learning
Trees and Forests 8

II
Ensemble learning
From trees to forests: ensemble learning
Wisdom of crowd: a combination of weak learners can perform

significantly better than any one of them
Bagging the errors of many independent weak learners can balance each
other out
Boosting a weak learner can improve another by focuses on its errors
Trees and Forests 9

Bagging
Leo Breiman (1996)

Principle
create 𝐵 ∈ ℕ bootstrap samples of the dataset
apply the algorithm to each of the 𝐵 samples to get 𝐵 predictors
combine all predictions to form an average predictor
] parallel calculations possible
Bagging reduces the variance of individual learners more stable and

better performances
. only useful with weak learners like trees
Trees and Forests 10

Random Forests
Leo Breiman (2001)

Principle
create 𝐵 bootstrap sub-samples of the dataset
for each sub-sample, select a random subset of the 𝑑 variables
apply CART to each of the 𝐵 subsamples, forming a forest
combine the 𝐵 trees to form an average predictor over the forest
√ 𝑑
] typically take 𝑑 variables in classification and in regression
3
Random Forests form subsamples that are as diverse as possible high

performance and quite robust to the choice of hyperparameters if 𝐵 large
enough

Boosting
Schapire et al. (1997)
AdaBoost (classification) set a number 𝑀 ∈ ℕ of iterations

1
give each data point 𝑖 an initial weight 𝜔𝑖1 =
𝑛
for 𝑚 ∈ {1, … , 𝑀}:
( )
learn a decision tree 𝑔𝑚 on the dataset with weights 𝜔𝑖𝑚 1≤𝑖≤𝑛
∑𝑛
compute the weighted error of the model: 𝜖𝑚 = 𝑖=1 𝜔𝑖𝑚 𝟏𝑔𝑚 (𝑥𝑥 𝑖 )=𝑦𝑖
1 1 − 𝜖𝑚
compute the confidence level of the model: 𝛼𝑚 = ln ( )
2 𝜖𝑚
update the weights so that they focus on where 𝑔𝑚 makes mistakes:
{ }
𝑚+1
𝜔𝑖𝑚 exp −𝛼𝑚 𝑦 𝑖 𝑔𝑚 (𝑥
𝑥𝑖 )
𝜔𝑖 = ∑𝑛
𝜔𝑚 exp {−𝛼𝑚 𝑦 𝓁 𝑔𝑚 (𝑥
𝓁=1 𝓁
𝑥 𝓁 )}
∑𝑀
return the final decision function: 𝑔 ∶ 𝑥 ∈ 𝒳 ↦ 𝑚=1
𝑥)
𝛼𝑚 𝑔𝑚 (𝑥

References
Chloe-Agathe. Azencott.
Introduction au Machine Learning.
Collection InfoSup, Dunod, 2022.
Trevor Hastie, Robert Tibshirani, Jerome Friedman.
The Elements of Statistical Learning.
Springer Series in Statistics, 2009.

23 Ens RandomForests

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

23 Ens RandomForests

Uploaded by

Copyright:

Available Formats

M2 STEPE | Introduction to Machine Learning | 09/11/2023

Trees and Forests

What have you seen so far ?

Trees and Forests 1

Setting 𝑑 ∈ ℕ features 𝑥 = (𝑥1 , … , 𝑥𝑑 ) ∈ 𝒳 ⊂ ℝ𝑑 and a label 𝑦 ∈ 𝒴 ⊂ ℝ

] features continuous and/or categorical

Trees and Forests 2

Setting 𝑑 ∈ ℕ features 𝑥 = (𝑥1 , … , 𝑥𝑑 ) ∈ 𝒳 ⊂ ℝ𝑑 and a label 𝑦 ∈ 𝒴 ⊂ ℝ

] features continuous and/or categorical

A decision tree is a hierarchical prediction model that can be represented by

Trees and Forests 2

< 5cm ≥ 5cm oval round curved

Shape Apple Lemon Apple Banana

Trees and Forests 2

Setting 𝑑 ∈ ℕ features 𝑥 = (𝑥1 , … , 𝑥𝑑 ) ∈ 𝒳 ⊂ ℝ𝑑 and a label 𝑦 ∈ 𝒴 ⊂ ℝ

] features continuous and/or categorical

A decision tree is a hierarchical prediction model that can be represented by

Binary tree each node has exactly 2 children

Trees and Forests 2

Round Apple Banana Round

Sherry Strawberry Apple Lemon

Trees and Forests 2

𝑥1 < 𝑠12 𝑥1 < 𝑠11

Trees and Forests 3

Classification 𝐾 ∈ ℕ classes 𝒴 = {1, … , 𝐾}  Majority vote

Regression 𝒴 = ℝ  Average label

Trees and Forests 3

Problem find a tree-like partition that minimizes the empirical risk

Solution greedy, recursive, divisive algorithm

Trees and Forests 4

Problem find a tree-like partition that minimizes the empirical risk

Solution greedy, recursive, divisive algorithm

Classification And Regression Tree (Breiman et al., 1984)

node defined by variable 𝑥𝑗 ∈ 𝒳𝑗 and subset 𝒮 ⊂ 𝒳𝑗 that split 𝒳 into

] (𝑗, 𝒮) chosen to minimize a splitting cost

Trees and Forests 4

Regression 𝒴 = ℝ  Mean Squared Error

Trees and Forests 5

Classification 𝐾 ∈ ℕ classes 𝒴 = {1, … , 𝐾}  Impurity criterion

𝐼(ℛ) = 1 − max 1≤𝑘≤𝐾 𝑝𝑘 (ℛ)

Trees and Forests 5

180 200 220

Trees and Forests 6

180 200 220

180 200 220

Trees and Forests 6

170 180 190 200 210 220 230

Trees and Forests 6

180 200 220

Trees and Forests 6

Adelie Chinstrap Gentoo

Trees and Forests 6

When to stop growing the tree?

Arbitrary constraints minimum number of training observations per leaf

Regularisation add a complexity cost

Trees and Forests 7

Trees and Forests 8

Wisdom of crowd: a combination of weak learners can perform

Boosting a weak learner can improve another by focuses on its errors

Trees and Forests 9

Leo Breiman (1996)

] parallel calculations possible

Bagging reduces the variance of individual learners  more stable and

Classification 𝐾 ∈ ℕ classes 𝒴 = {1, … , 𝐾} Majority vote

Regression 𝒴 = ℝ Average label

Regression 𝒴 = ℝ Mean Squared Error

Classification 𝐾 ∈ ℕ classes 𝒴 = {1, … , 𝐾} Impurity criterion

Bagging reduces the variance of individual learners more stable and

Random Forests form subsamples that are as diverse as possible high