Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Insurance Analytics

Prof. Julien Trufin

Année académique 2020-2021

1
Tree-based methods : Bagging trees and random forests

Bagging trees and random forests

2
Tree-based methods : Bagging trees and random forests

Introduction
One issue with regression trees : their high variance.
→ High variability of the prediction µ
bD (x ) over the trees trained from all possible
D.

Bagging trees and random forests aim to reduce the variance without too
much altering bias.

Principle of ensemble methods (based on randomization) : introducing


random perturbations into the training procedure in order to get different
models from a single training set D and combining them to obtain the
estimate of the ensemble.
Start with the average prediction ED [b
µD (x )].
→ Same bias as µ
bD (x ) since

ED [b
µD (x )] = ED [ED [b
µD (x )]] .

→ Zero variance since


VD [ED [b
µD (x )]] = 0.

3
Tree-based methods : Bagging trees and random forests

Introduction
Assume we can draw as many training sets as we want, so that we have B
training sets D1 , D2 , . . . , DB available. An approximation of the average
model is then given by
B
1 X
E
bD [b
µD (x )] = µ
b b (x ).
B b=1 D

The average of E µD (x )] with respect to D1 , . . . , DB is the average


bD [b
prediction ED [b
µD (x )], that is,
B B
" #
1 X 1 X
ED1 ,...,DB µ
bDb (x ) = E b [b
µDb (x )] = ED [b
µD (x )] ,
B b=1 B b=1 D

while the variance of E µD (x )] with respect to D1 , . . . , DB is given by


bD [b
B B
" # " #
1 X 1 X
VD1 ,...,DB µ
bDb (x ) = V 1 B µ
bDb (x )
B b=1 B 2 D ,...,D b=1
B
1 X VD [b
µD (x )]
= V b [b
µDb (x )] = .
B 2 b=1 D B

⇒ Compared to each individual estimate : Bias unchanged and variance


divided by B .
4
Tree-based methods : Bagging trees and random forests

Introduction
In practice, the probability distribution from which the observations of the
training set are drawn is usually not known so that there is only one training
set available.

In this context, the bootstrap approach, used both in bagging trees and
random forests, appears to be particularly useful.

5
Tree-based methods : Bagging trees and random forests Bagging trees

Bagging trees
Bagging is one of the first ensemble methods proposed in the literature.

The probability distribution of the random vector (Y , X ) is approximated by


1
its empirical version which puts an equal probability |I| on each of the
observations {(yi , x i ); i ∈ I} of the training set D.
Instead of simulating B training sets D1 , D2 , . . . , DB from (Y , X ), which is
not possible in practice, the idea of bagging is to simulate B bootstrap
samples D∗1 , D∗2 , . . . , D∗B of D from its empirical counterpart.
→ A bootstrap sample of D is obtained by simulating independently |I|
observations from the empirical distribution of (Y , X ).
→ A bootstrap sample is thus a random sample of D taken with replacement
which has the same size as D.

6
Tree-based methods : Bagging trees and random forests Bagging trees

Bagging trees
Let D∗1 , D∗2 , . . . , D∗B be B bootstrap samples of D.
→ For each D∗b , b = 1, . . . , B , we fit our model, giving prediction
µ
bD,Θb (x ) = µ
bD∗b (x ).

The bagging prediction is then defined by


B
1 X
bbag
µD,Θ (x ) = µ
bD,Θb (x ),
B
b=1

where Θ = (Θ1 , . . . , ΘB ).
→ Random vectors Θ1 , . . . , ΘB fully capture the randomness of the training
procedure.

7
Tree-based methods : Bagging trees and random forests Bagging trees

Algorithm

Algorithm : Bagging Trees.

For b = 1 to B do
1. Generate a bootstrap sample D∗b of D.
2. Fit an unpruned tree on D∗b , which gives prediction µ
bD,Θb (x ).
End for
PB
bbag
Output : µD,Θ (x ) =
1
B b=1 µ
bD,Θb (x ).

8
Tree-based methods : Bagging trees and random forests Bagging trees

Bias
The bias is the same as the one of the individual sampled models. Indeed,
h i
Bias(x ) = µ(x ) − ED,Θ µ bbag
D,Θ (x )
B
" #
1 X
= µ(x ) − ED,Θ1 ,...,ΘB bD,Θb (x )
µ
B b=1
B
1 X  
= µ(x ) − ED,Θb µ bD,Θb (x )
B b=1
 
= µ(x ) − ED,Θb µ bD,Θb (x )

since predictions µ
bD,Θ1 (x ), . . . , µ
bD,ΘB (x ) are identically distributed.

However, the bias of µ bD,Θb (x ) is typically greater in absolute terms than the
bD (x ) fitted on D since the reduced sample D∗b imposes restrictions.
bias of µ
⇒ The improvements in the estimation obtained by bagging will be a
consequence of variance reduction.

9
Tree-based methods : Bagging trees and random forests Bagging trees

Variance
bbag
The variance of µD,Θ (x ) can be written as
B
" #
h i 1 X
VD,Θ µbbag
D,Θ (x ) = VD,Θ1 ,...,ΘB µ
bD,Θb (x )
B b=1
" B #
1 X
= VD,Θ1 ,...,ΘB µ D,Θb (x )
B2
b
b=1
( " " B ##
1 X
= VD EΘ1 ,...,ΘB µbD,Θb (x ) D

B2 b=1
" " B ## )
X
+ED VΘ1 ,...,ΘB µbD,Θb (x ) D

b=1
h h ii 1 h h ii
= VD EΘb µbD,Θb (x ) D + ED VΘb µbD,Θb (x ) D

B
since conditionally to D, µ
bD,Θ1 (x ), . . . , µ
bD,ΘB (x ) are i.i.d..
→ The first term = sampling variance of the bagging ensemble (a result of the
sampling variability of D).
→ The second term = within-D variance (a result of the randomization due to
the bootstrap sampling). As B increases, this second term disappears.
10
Tree-based methods : Bagging trees and random forests Bagging trees

Variance
Observation : h i
bbag
 
VD,Θ µ D,Θ (x ) ≤ VD,Θb µbD,Θb (x ) .

Let ρ(x ) be the Pearson’s correlation coefficient between any pair of


predictions used in the averaging which are built on the same training set but
fitted on two different bootstrap samples :
h i
CovD,Θb ,Θb 0 µ bD,Θb 0 (x )
bD,Θb (x ), µ
ρ(x ) = q r h i
 
VD,Θb µ bD,Θb (x ) VD,Θb 0 µ bD,Θb 0 (x )
h i
CovD,Θb ,Θb 0 µbD,Θb (x ), µ
bD,Θb 0 (x )
=  
VD,Θb µ bD,Θb (x )
as µ
bD,Θb (x ) and µ
bD,Θb0 (x ) are i.d..
By the law of total covariance :
    
CovD,Θ ,Θ µ
b D,Θ (x ), µ
b D,Θ (x ) = ED CovΘ ,Θ µ
b D,Θ (x ), µ
b D,Θ (x )|D
b b0 b b0 b b0 b b0
 h i  
+CovD EΘ µb D,Θ (x )|D , EΘ µ
b D,Θ (x )|D
b b b0 b0
h h ii
= VD EΘ µ
b D,Θ (x )|D
b b

since conditionally to D, µ
bD,Θb (x ) and µ
bD,Θb0 (x ) are i.i.d..
11
Tree-based methods : Bagging trees and random forests Bagging trees

Variance
ρ(x ) becomes
  
VD EΘb µ bD,Θb (x )|D
ρ(x ) =  
VD,Θb µ bD,Θb (x )
  
VD EΘb µ (x )|D
ii D,Θb h
b
= h h h ii .
VD EΘb µ bD,Θb (x ) D + ED VΘb µ bD,Θb (x ) D

ρ(x ) measures the correlation between a pair of predictions in the ensemble


induced by repeatedly making training sample draws D from the population
and then drawing a pair of bootstrap samples from D.
→ When ρ(x ) is close to 1, the predictions are highly correlated, suggesting that
the randomization due to the bootstrap sampling has no significant effect on
the predictions. The total variance is mostly driven by the training set.
→ When ρ(x ) is close to 0, the predictions are uncorrelated, suggesting that the
randomization due to the bootstrap sampling has a strong impact on the
predictions. The total variance is mostly due to the randomization induced by
the bootstrap samples.

12
Tree-based methods : Bagging trees and random forests Bagging trees

Variance
Alternatively,
    
VD EΘb µbD,Θb (x )|D = ρ(x )VD,Θb µbD,Θb (x )

and     
ED VΘb µbD,Θb (x )|D = (1 − ρ(x )) VD,Θb µbD,Θb (x ) ,

so that
h i 1
bbag
     
VD,Θ µ D,Θ (x ) = VD EΘb µ bD,Θb (x )|D + ED VΘb µ bD,Θb (x )|D
B
  (1 − ρ(x ))  
= ρ(x )VD,Θb µ bD,Θb (x ) + VD,Θb µ bD,Θb (x ) .
B

→ As B increases, the second term disappears.


→ When ρ(x ) < 1 (which means that the randomization due to the bootstrap
sampling influences the individual predictions), one sees that the variance of
the ensemble is strictly smaller than the variance of an individual model.

13
Tree-based methods : Bagging trees and random forests Bagging trees

Variance
Notice that
µD,Θb (x )] ≥ VD [b
VD,Θb [b µD (x )] .
⇒ Bagging averages models with higher variances.

bbag
Nevertheless, µD,Θ (x ) has generally a smaller variance than µ
bD (x ).
→ Typically, ρ(x ) compensates for the variance increase

µD,Θb (x )] − VD [b
VD,Θb [b µD (x )] ,

µD,Θb (x )] ≥ VD [b
so that the combined effect of ρ(x ) < 1 and VD,Θb [b µD (x )]
often leads to a variance reduction

µD (x )] − ρ(x )VD,Θb [b
VD [b µD,Θb (x )]

that is positive.
→ Because of their high variance, regression trees very likely benefit from the
averaging procedure.

14
Tree-based methods : Bagging trees and random forests Bagging trees

Expected generalization error : Squared error loss


We have h  i  h i2
bag
ED,Θ Err µ
bD,Θ (x ) = Err (µ(x )) + µ(x ) − ED,Θ µbbag
D,Θ (x )
h i
+VD,Θ µ bbag
D,Θ (x ) .

The bias remains unchanged while the variance decreases compared to the
individual prediction µ
bD,Θb (x ), so that we get
h  i
ED,Θ Err µbbag
D,Θ (x )
2 h i
bbag

= Err (µ(x )) + µ(x ) − ED,Θb µbD,Θb (x ) + VD,Θ µ D,Θ (x )
 2  
≤ Err (µ(x )) + µ(x ) − ED,Θb µbD,Θb (x ) + VD,Θb µ bD,Θb (x )
 
= ED,Θb Err µ bD,Θb (x ) .

For every value of X , the expected generalization error of the ensemble is


smaller than the expected generalization error of an individual model.
Taking the average over X leads to
h  i
bbag
 
ED,Θ Err µ D,Θ ≤ ED,Θb Err µ
bD,Θb .

15
Tree-based methods : Bagging trees and random forests Bagging trees

Expected generalization error : Poisson deviance loss


We have h  i h  i
bag
ED,Θ Err µ
bD,Θ (x ) = Err (µ(x )) + ED,Θ E P µbbag
D,Θ (x )

with
bbag bbag
" # " !#!
h  i µ D,Θ (x ) µD,Θ (x )
ED,Θ E P
bbag
µD,Θ (x ) = 2µ(x ) ED,Θ − 1 − ED,Θ ln .
µ(x ) µ(x )

Now,
bbag
" #
µD,Θ (x )
 
µ
bD,Θb (x )
ED,Θ = ED,Θb ,
µ(x ) µ(x )

so that h  i
ED,Θ E P µ bbag
D,Θ (x )
     
µ
bD,Θb (x ) bD,Θb (x )
µ
= 2µ(x ) ED,Θb − 1 − ED,Θb ln
µ(x ) µ(x )
" bag !# !
µbD,Θ (x )
 
µ
bD,Θb (x )
−2µ(x ) ED,Θ ln − ED,Θb ln
µ(x ) µ(x )
h i
P

= ED,Θb E µ
bD,Θb (x )
 h  i 
bbag

−2µ(x ) ED,Θ ln µ D,Θ (x ) − ED,Θb ln µ
bD,Θb (x ) .
16
Tree-based methods : Bagging trees and random forests Bagging trees

Expected generalization error : Poisson deviance loss


Jensen’s inequality implies
h i
bbag
 
ED,Θ ln µ D,Θ (x ) − ED,Θb ln µbD,Θb (x )
B
" !#
1 X  
= ED,Θ1 ,...,ΘB ln µ
bD,Θb (x ) − ED,Θb ln µ
bD,Θb (x )
B b=1
B
" #
1 X  
≥ ED,Θ1 ,...,ΘB bD,Θb (x ) − ED,Θb ln µ
ln µ bD,Θb (x )
B b=1
= 0,

so that h  i h i
ED,Θ E P µbbag
D,Θ (x ) ≤ ED,Θb E P µ
bD,Θb (x )

and hence h  i
bbag
 
ED,Θ Err µ D,Θ (x ) ≤ ED,Θb Err µ
bD,Θb (x ) .

For every value of X , the expected generalization error of the ensemble is


smaller than the expected generalization error of an individual model.
Taking the average over X leads to
h  i
bbag
 
ED,Θ Err µ D,Θ ≤ ED,Θb Err µ
bD,Θb .
17
Tree-based methods : Bagging trees and random forests Random forests

Random forests
The procedure called random forests is a modification of bagging trees.
→ It produces a collection of trees that are more de-correlated.

The variance of the bagging prediction can be expressed as


h i (1 − ρ(x ))
VD,Θ µbbag
D,Θ (x ) = ρ(x )VD,Θb [b
µD,Θb (x )] + VD,Θb [b
µD,Θb (x )] .
B
→ As B increases, the second term disappears.
→ ρ(x ) in the first term limits the effect of averaging.

The idea of random forests : Reducing ρ(x ) without increasing


VD,Θb [b
µD,Θb (x )] too much.
→ Reducing correlation among trees can be achieved by adding randomness to
the training procedure.

Random forests is a combination of bagging with random feature selection at


each node.
→ Specifically, m(≤ p) features are selected at random before each split and used
as candidates for splitting.
18
Tree-based methods : Bagging trees and random forests Random forests

Random forests
The random forest prediction writes
B
1 X
brfD,Θ (x ) =
µ µ
bD,Θb (x ),
B
b=1

where µ
bD,Θb (x ) denotes the prediction at point x for the bth random forest
tree.
→ Θ1 , . . . , ΘB capture the randomness of the bootstrap sampling and the
additional randomness due to the random selection of m features before each
split.
→ m is a tuning parameter. Typical value of m is bp/3c.

19
Tree-based methods : Bagging trees and random forests Random forests

Algorithm

Algorithm : Random Forests.

For b = 1 to B do
1. Generate a bootstrap sample D∗b of D.
2. Fit a tree on D∗b .
For each node t do
(2.1) Select m (≤ p) features at random from the p original features.
(2.2) Pick the best feature among the m.
(2.3) Split the node into two daughter nodes.
End for
This gives prediction µ
bD,Θb (x ) (use typical tree stopping criteria (but do not
prune)).
End for
1
PB
brfD,Θ (x ) =
Output : µ B b=1 µ
bD,Θb (x ).

20
Tree-based methods : Bagging trees and random forests Random forests

Computational costs
Computational costs and memory requirements increase as the number of
bootstrap samples increases.
→ However, this can be mitigated with parallel computing. Indeed, each
bootstrap sample and the corresponding tree is independent of any other
sample and tree.

21
Tree-based methods : Bagging trees and random forests Out-of-bag estimate

Out-of-bag estimate
For each observation (yi , x i ) of the training set D, an out-of-bag prediction
can be constructed by averaging only trees corresponding to bootstrap
samples D∗b in which (yi , x i ) does not appear.

The out-of-bag prediction for observation (yi , x i ) is thus given by


B
1 X
boob / D∗b .
 
µD,Θ (x i ) = PB bD,Θb (x i )I (yi , x i ) ∈
µ
b=1 / D∗b ] b=1
I [(yi , x i ) ∈

The generalization error of µ


bD,Θ can be estimated by
oob 1 X
Err
d (b
µD,Θ ) = boob
L(yi , µD,Θ (x i )),
|I|
i∈I

which is called the out-of-bag estimate of the generalization error.


oob
Err
d µD,Θ ) is almost identical to the |I|-fold cross-validation estimate.
(b
→ However, it does not require to fit new trees, so that bagging trees and
random forests can be fit in one sequence : We stop adding new trees when
the out-of-bag estimate of the generalization error stabilizes.
22
Tree-based methods : Bagging trees and random forests Interpretability

Interpretability : Relative importances


A bagged model is less interpretable than a model that is not bagged.
→ A bagged tree is no longer a tree.

However, measure of variable importance can be constructed by combining


measures of importance from the bootstrap trees.
→ For the bth tree in the ensemble, denoted Tb , the relative importance of feature xj is
the sum of the deviance reductions ∆Dχt over the non-terminal nodes t ∈ Te(Tb ) (xj )
(i.e. the non-terminal nodes t of Tb for which xj was selected as the splitting
feature) : X
Ib (xj ) = ∆Dχt .
t∈T
e
(Tb ) (xj )

→ The relative importance of a feature xj is naturally extended by averaging Ib (xj ) over


the collection of trees :
B
1 X
I(xj ) = Ib (xj ).
B b=1
→ Normalization : The relative influence is normalized such that the sum of the
importance equals 100.
⇒ Any individual number can be interpreted as the percentage contribution to the
overall model.

23
Tree-based methods : Bagging trees and random forests Interpretability

Interpretability : Relative importances (by permutation)


An alternative to compute variable importances is the following.

Some observations (yi , x i ) of the training set D do not appear in bootstrap


sample D∗b . They are called the out-of-bag observations for the bth tree.

These observations enable to assess the predictive accuracy of µ


bD,Θb , that is,

1 X
Err
d (bµD,Θb ) = L(yi , µ
bD,Θb (x i )),
|I\I ∗b |
i∈I\I ∗b

where I ∗b labels the observations in D∗b .

The categories of feature xj are then randomly permuted in the out-of-bag


perm(j)
observations, so that we get perturbed observations (yi , x i ), i ∈ I\I ∗b ,
and the predictive accuracy of µ
bD,Θb is again computed as
perm(j) 1 X perm(j)
Err
d (b
µD,Θb ) = L(yi , µ
bD,Θb (x i )).
|I\I ∗b |
i∈I\I ∗b

24
Tree-based methods : Bagging trees and random forests Interpretability

Interpretability : Relative importances (by permutation)


The decrease in predictive accuracy due to this permuting is averaged over all
trees and is used as a measure of importance for feature xj in the ensemble,
that is
B  
1 X d perm(j)
I(xj ) = Err µD,Θb ) − Err
(b d (bµD,Θb ) .
B
b=1

These importances can be normalized to improve their readability.

A feature will be important if randomly permuting its values decreases the


model accuracy. In such a case, it means that the model relies on the feature
for the prediction.

25
Tree-based methods : Bagging trees and random forests Interpretability

Partial dependence plots


Consider the subvector x S of l < p of the features x = (x1 , x2 , . . . , xp ),
indexed by S ⊂ {1, 2, . . . , p}. Let x S̄ be the complement subvector such that

x S ∪ x S̄ = x .

In principle, µ
b(x ) depends on features in x S and x S̄ , so that we can write

µ
b(x ) = µ
b(x S , x S̄ ).

If one conditions on specific values for the features in x S̄ , then µ


b(x ) can be
seen as a function of the features in x S . The partial dependence of µ b(x ) on
x S is given by
µ
bS (x S ) = EX S̄ [b
µ(x S , X S̄ )] .

The partial dependence function µ


bS (x S ) can be estimated from the training
set by
1 X
µ
b(x S , x i S̄ ),
|I|
i∈I

where {x i S̄ , i ∈ I} are the values of X S̄ in the training set


26
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Data set :
→ MTPL insurance portfolio of a Belgian insurance company observed during
one year.
→ Description of data set :
> str(data) # description of the dataset
’data.frame’: 160944 obs. of 10 variables:
$ AgePh : int 50 64 60 77 28 26 26 58 59 57 ...
$ AgeCar : int 12 3 10 15 7 12 8 14 6 10 ...
$ Fuel : Factor w/ 2 levels "Diesel","Gasoline": 2 2 1 2 2 2 2 2 2 2 ...
$ Split : Factor w/ 4 levels "Half-Yearly",..: 2 4 4 4 1 3 1 3 1 1 ...
$ Cover : Factor w/ 3 levels "Comprehensive",..: 3 2 3 3 3 3 1 3 2 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 1 2 1 1 2 2 2 ...
$ Use : Factor w/ 2 levels "Private","Professional": 1 1 1 1 1 1 1 1 1 1 ...
$ PowerCat: Factor w/ 5 levels "C1","C2","C3",..: 2 2 2 2 2 2 2 2 1 1 ...
$ ExpoR : num 1 1 1 1 0.0466 ...
$ Nclaim : int 1 0 0 0 1 0 1 0 0 0 ...

27
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Data set :
→ The data set comprises 160 944 insurance policies.
→ For each policy, we have 8 features :
- AgePh : policyholder’s age ;
- AgeCar : age of the car ;
- Fuel : fuel of the car, with two categories (gas or diesel) ;
- Split : splitting of the premium, with four categories (annually, semi-annually,
quarterly or monthly) ;
- Cover : extent of the coverage, with three categories (from compulsory
third-party liability cover to comprehensive) ;
- Gender : policyholder’s gender, with two categories (female or male) ;
- Use : use of the car, with two categories (private or professional) ;
- PowerCat : the engine’s power, with five categories.
→ For each policy, we have the number of claim (Nclaim), the response, and
exposure information (exposure-to-risk (ExpoR) expressed in year).

28
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Data set :
> head(data,10) # 10 first observations
AgePh AgeCar Fuel Split Cover Gender Use PowerCat ExpoR Nclaim
1 50 12 Gasoline Monthly TPL.Only Female Private C2 1.00000000 1
2 64 3 Gasoline Yearly Limited.MD Male Private C2 1.00000000 0
3 60 10 Diesel Yearly TPL.Only Female Private C2 1.00000000 0
4 77 15 Gasoline Yearly TPL.Only Female Private C2 1.00000000 0
5 28 7 Gasoline Half-Yearly TPL.Only Male Private C2 0.04657534 1
6 26 12 Gasoline Quarterly TPL.Only Female Private C2 1.00000000 0
7 26 8 Gasoline Half-Yearly Comprehensive Female Private C2 1.00000000 1
8 58 14 Gasoline Quarterly TPL.Only Male Private C2 0.40273973 0
9 59 6 Gasoline Half-Yearly Limited.MD Male Private C1 1.00000000 0
10 57 10 Gasoline Half-Yearly Limited.MD Male Private C1 1.00000000 0

29
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Descriptive statistics of the data :

120000
100000
80000
Number of policies

60000
40000
20000
0

1 2 3 4 5 6 7 8 9 10 11 12

Exposure (in months)

30
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Descriptive statistics of the data :
0.15

100000

75000

0.10

ClaimFrequency
totalExposure

50000

0.05

25000

0 0.00

Female Male Female Male


Gender Gender

31
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Descriptive statistics of the data :
100000

0.15

75000

0.10

ClaimFrequency
totalExposure

50000

0.05

25000

0 0.00

Diesel Gasoline Diesel Gasoline


Fuel Fuel

32
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Descriptive statistics of the data :

1e+05
0.10

ClaimFrequency
totalExposure

5e+04 0.05

0e+00 0.00

Private Professional Private Professional


Use Use

33
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Descriptive statistics of the data :
0.15

80000

60000

0.10

ClaimFrequency
totalExposure

40000

0.05

20000

0 0.00

Comprehensive Limited.MD TPL.Only Comprehensive Limited.MD TPL.Only


Cover Cover

34
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Descriptive statistics of the data :
0.20

60000

0.15

ClaimFrequency
totalExposure

40000

0.10

20000
0.05

0 0.00

Half−Yearly Monthly Quarterly Yearly Half−Yearly Monthly Quarterly Yearly


Split Split

35
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Descriptive statistics of the data :
80000

0.15
60000

ClaimFrequency
totalExposure

0.10
40000

0.05
20000

0 0.00

C1 C2 C3 C4 C5 C1 C2 C3 C4 C5
PowerCat PowerCat

36
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Descriptive statistics of the data :

0.20

10000
0.15

ClaimFrequency
totalExposure

0.10

5000

0.05

0 0.00

0 5 10 15 20 0 5 10 15 20
AgeCar AgeCar

37
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Descriptive statistics of the data :
1.00

3000

0.75

ClaimFrequency
2000
totalExposure

0.50

1000
0.25

0 0.00

20 40 60 80 20 40 60 80
AgePh AgePh

38
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Training set and validation set :
→ Training set : 80% of the data set.
→ Validation set : 20% of the data set.
> library(caret)
> inValidation = createDataPartition(data$Nclaim, p=0.2, list=FALSE)
> validation.set = data[inValidation,]
> training.set = data[-inValidation,]

39
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
R packages :
→ ipred (for bagging).
→ randomForest.
→ rfCountData.

rfPoisson() (bagging() : for bagging only).


→ Description :
 Fit a RF model (Breiman’s algorithm) on count data, by using the Poisson
deviance.
→ Usage :
 rfPoisson(x, offset, y, xtest, offsettest, ytest,
ntree, mtry, nodesize, importance=TRUE, keep.forest = TRUE, ...)

40
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Tuning the model :
set.seed(87)
folds = createFolds(training.set$Nclaim, k = 5, list = TRUE)

grid.param = expand.grid(fold = 1:5,


mtry. = seq(from = 8, to = 1, by = -1),
nodesize. = c(500,1000,5000,10000))

cl <- makeCluster(12) # Number of nodes for parallel computing


registerDoParallel(cl)
clusterCall(cl, function() library(rfCountData)) #Export package to nodes
set.seed(64, kind = "L’Ecuyer-CMRG")
res = foreach(i =1:nrow(grid.param), .packages = "rfCountData") %dopar% {
X=folds[[grid.param[i,]$fold]] #Current fold (-> test set)

rfPoisson(x = training.set[-X,!names(training.set) %in% c("Nclaim", "ExpoR")],


offset = log(training.set[-X,]$ExpoR),
y = training.set[-X,]$Nclaim,
xtest = training.set[X,!names(training.set) %in% c("Nclaim", "ExpoR")],
offsettest = log(training.set[X,]$ExpoR),
ytest = training.set[X,]$Nclaim,
ntree = 2000,
mtry = grid.param[i,]$mtry., # Current mtry
nodesize = grid.param[i,]$nodesize., # Current nodesize
keep.forest = TRUE)

41
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Tuning the model :
Cross−validation results

0.550
0.545
0.540 Nodesize 500 Nodesize 1 000 Nodesize 5 000 Nodesize 10 000
Deviance

0.535
0.530

8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1

mtry

42
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Optimal RF :
> # fit optimal model
> optimal_rf = rfPoisson(x = training.set[,!names(training.set)
%in% c("Nclaim", "ExpoR")],
offset = log(training.set$ExpoR),
y = training.set$Nclaim,
xtest = validation.set[,!names(validation.set)
%in% c("Nclaim", "ExpoR")],
offsettest = log(validation.set$ExpoR),
ytest = validation.set$Nclaim,
ntree = 2000,
mtry = mtry_star,
nodesize = nodesize_star,
keep.forest = TRUE,
do.trace = TRUE,
importance=TRUE)

> print(optimal_rf)
Call:
rfPoisson(x = ...)
Type of random forest: regression
Number of trees: 2000
No. of variables tried at each split: 3

43
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Optimal RF :
> # Optimal number of trees
> par(mfrow=c(1,2))
> plot(optimal_rf, xlim=c(0,2000), ylim=c(0.54,0.55))
> plot(optimal_rf, xlim=c(0,500), ylim=c(0.54,0.55), main="Zoom")

0.5500 0.5500

0.5475 0.5475

0.5450 0.5450

0.5425 0.5425

0.5400 0.5400

0 500 1000 1500 2000 0 100 200 300 400 500


Number of trees Number of trees

44
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Optimal RF :
> # relative importances
> imp <- importance(rf.optim, type = 1)
> impvar <- rownames(imp)[order(imp[, 1], decreasing = TRUE)]
> par(mfrow=c(1, 1))
> varImpPlot(rf.optim, sort = TRUE, type = 1)

AgePh ●

Split ●

Fuel ●

AgeCar ●

Cover ●

Gender ●

PowerCat ●

Use ●
0.000 0.004 0.008 0.012
%IncLossFunction

45
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Optimal RF :
> # partial dependences
> op <- par(mfrow=c(2, 4)) # for all features (here: 8 features)
> for (i in seq_along(impvar)) {
+partialPlot(optimal_rf, training.set, x.var = impvar[i], offset =log(training.set$ExpoR),
+ xlab = impvar[i], main = paste("Partial Dependence on", impvar[i]))
+}
> par(op)

0.225
0.15
0.12

0.200
0.10

0.10
0.08
0.175

0.150 0.05
0.05 0.04

0.125

0.00 0.00 0.00


0.100
20 30 40 50 60 70 80 90 Half−Yearly Monthly Quarterly Yearly Diesel Gasoline ComprehensiveLimited.MD TPL.Only
AgePh Split Fuel Cover

0.125
0.12 0.12

0.100
0.14

0.08 0.08
0.075

0.13 0.050

0.04 0.04

0.025

0.12 0.00 0.00 0.000

0 5 10 15 20 Female Male C1 C2 C3 C4 C5 Private Professional


AgeCar Gender PowerCat Use

46
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Predictions :
> data$pred = predict(optimal_rf, offset = log(data$ExpoR), newdata = training.set)
> head(data, 20)
AgePh AgeCar Fuel Split Cover Gender Use PowerCat Latitude Longitude ExpoR Nclaim pred
50 12 Gasoline Monthly TPL.Only Male Private C2 50.5 4.21 1.00000 1 0.16459
64 3 Gasoline Yearly Limited.MD Female Private C2 50.5 4.21 1.00000 0 0.10225
60 10 Diesel Yearly TPL.Only Male Private C2 50.5 4.21 1.00000 0 0.12587
77 15 Gasoline Yearly TPL.Only Male Private C2 50.5 4.21 1.00000 0 0.09129
28 7 Gasoline Half-Yearly TPL.Only Female Private C2 50.5 4.21 0.04658 1 0.01007
26 12 Gasoline Quarterly TPL.Only Male Private C2 50.5 4.21 1.00000 0 0.31131
26 8 Gasoline Half-Yearly Comprehensive Male Private C2 50.5 4.21 1.00000 1 0.21448
58 14 Gasoline Quarterly TPL.Only Female Private C2 50.5 4.21 0.40274 0 0.06025
59 6 Gasoline Half-Yearly Limited.MD Female Private C1 50.5 4.21 1.00000 0 0.09995
57 10 Gasoline Half-Yearly Limited.MD Female Private C1 50.5 4.21 1.00000 0 0.10124
62 5 Gasoline Yearly Limited.MD Male Private C1 50.5 4.21 1.00000 0 0.08188
57 15 Gasoline Yearly TPL.Only Male Private C2 50.5 4.21 1.00000 0 0.09640
30 10 Gasoline Monthly Limited.MD Male Private C2 50.5 4.21 1.00000 1 0.20159
47 14 Gasoline Monthly TPL.Only Female Private C1 50.5 4.21 1.00000 0 0.16334
67 8 Gasoline Yearly Limited.MD Male Private C2 50.5 4.21 1.00000 0 0.08522
62 7 Gasoline Quarterly Comprehensive Male Professional C2 50.5 4.21 1.00000 0 0.12341
82 10 Gasoline Yearly Limited.MD Male Private C2 50.5 4.21 0.73425 0 0.06909
33 15 Gasoline Half-Yearly TPL.Only Male Private C1 50.5 4.21 0.31507 0 0.04204
43 2 Diesel Half-Yearly Comprehensive Male Private C3 50.5 4.21 1.00000 0 0.12183
51 7 Gasoline Yearly Limited.MD Male Private C4 50.5 4.21 1.00000 0 0.10955

47
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Comparison with a single regression tree :
→ Generalization error :
d val µ
 
Err brfD,Θ = 0.5440970.

Model Generalization error


Tree 0.5452772
RF 0.5440970

48
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Comparison with a single regression tree :
→ Model granularity (impact of the age) : regression tree :

0.14
16e+3 / 129e+3
100%
yes AgePh >= 30 no



0.21
3459 / 19e+3
15%
Split = Yearly


0.13
12e+3 / 110e+3
85%
Split = Half−Yearly,Yearly

0.17
3097 / 24e+3
18%
AgePh >= 56


0.12
9325 / 86e+3
67%
AgePh >= 58


0.13 0.17
6704 / 57e+3 2612 / 19e+3
44% 15%

Fuel = Gasoline Cover = Comprehensive,Limited.MD

0.096
2621 / 29e+3
23%

Fuel = Gasoline

0.14
2588 / 20e+3
16%
Cover = Comprehensive,Limited.MD


0.09 0.12
1982 / 23e+3 4116 / 37e+3
18% 29%

Gender = Female Split = Yearly

0.13
1750 / 15e+3
11%

Cover = Limited.MD

0.087 0.12
1551 / 19e+3 2366 / 22e+3
15% 17%
AgePh < 74 AgePh >= 48


0.12
1587 / 14e+3
11%

Gender = Female

0.084
1156 / 15e+3
11%
AgeCar >= 5.5

0.093 0.1 0.11 0.13 0.14 0.15 0.16 0.17


441 / 5033 431 / 4421 779 / 7868 604 / 5040 1247 / 9863 1506 / 11e+3 1247 / 9419 1179 / 7750
4% 3% 6% 4% 8% 9% 7% 6%

0.079 0.096 0.12 0.12 0.11 0.13 0.13 0.19 0.24


715 / 9639 395 / 4347 639 / 5613 983 / 9358 503 / 4808 1082 / 9111 485 / 4437 1365 / 9690 2280 / 11e+3
7% 3% 4% 7% 4% 7% 3% 8% 9%

49
Tree-based methods : Bagging trees and random forests Interpretability

Example MTPL
Comparison with a single regression tree :
→ Model granularity (impact of the age) : RF :

0.225
0.15
0.12

0.200
0.10
? 0.10
0.08
0.175

0.150 0.05
0.05 0.04
-
0.125

0.00 0.00 0.00


0.100
20 30 40 50 60 70 80 90 Half−Yearly Monthly Quarterly Yearly Diesel Gasoline ComprehensiveLimited.MD TPL.Only
AgePh Split Fuel Cover

0.125
0.12 0.12

0.100
0.14

0.08 0.08
0.075

0.13 0.050

0.04 0.04

0.025

0.12 0.00 0.00 0.000

0 5 10 15 20 Female Male C1 C2 C3 C4 C5 Private Professional


AgeCar Gender PowerCat Use
50

You might also like