Insurance Analytics: Prof. Julien Trufin

Insurance Analytics
Prof. Julien Trufin
Année académique 2020-2021
1
Tree-based methods : Bagging trees and random forests
Bagging trees and random forests
2
Introduction
One issue with regression trees : their high variance.
→ High variability of the prediction µ
bD (x ) over the trees trained from all possible
D.
Bagging trees and random forests aim to reduce the variance without too
much altering bias.
Principle of ensemble methods (based on randomization) : introducing

random perturbations into the training procedure in order to get different
models from a single training set D and combining them to obtain the
estimate of the ensemble.
Start with the average prediction ED [b
µD (x )].
→ Same bias as µ
bD (x ) since
ED [b
µD (x )] = ED [ED [b
µD (x )]] .
→ Zero variance since

VD [ED [b
µD (x )]] = 0.
3
Introduction
Assume we can draw as many training sets as we want, so that we have B
training sets D1 , D2 , . . . , DB available. An approximation of the average
model is then given by
B
1 X
E
bD [b
µD (x )] = µ
b b (x ).
B b=1 D
The average of E µD (x )] with respect to D1 , . . . , DB is the average

bD [b
prediction ED [b
µD (x )], that is,
B B
" #
1 X 1 X
ED1 ,...,DB µ
bDb (x ) = E b [b
µDb (x )] = ED [b
µD (x )] ,
B b=1 B b=1 D
while the variance of E µD (x )] with respect to D1 , . . . , DB is given by

bD [b
B B
" # " #
1 X 1 X
VD1 ,...,DB µ
bDb (x ) = V 1 B µ
bDb (x )
B b=1 B 2 D ,...,D b=1
B
1 X VD [b
µD (x )]
= V b [b
µDb (x )] = .
B 2 b=1 D B
⇒ Compared to each individual estimate : Bias unchanged and variance

divided by B .
4
Introduction
In practice, the probability distribution from which the observations of the
training set are drawn is usually not known so that there is only one training
set available.
In this context, the bootstrap approach, used both in bagging trees and
random forests, appears to be particularly useful.
5
Tree-based methods : Bagging trees and random forests Bagging trees
Bagging trees
Bagging is one of the first ensemble methods proposed in the literature.
The probability distribution of the random vector (Y , X ) is approximated by

1
its empirical version which puts an equal probability |I| on each of the
observations {(yi , x i ); i ∈ I} of the training set D.
Instead of simulating B training sets D1 , D2 , . . . , DB from (Y , X ), which is
not possible in practice, the idea of bagging is to simulate B bootstrap
samples D∗1 , D∗2 , . . . , D∗B of D from its empirical counterpart.
→ A bootstrap sample of D is obtained by simulating independently |I|
observations from the empirical distribution of (Y , X ).
→ A bootstrap sample is thus a random sample of D taken with replacement
which has the same size as D.
6
Bagging trees
Let D∗1 , D∗2 , . . . , D∗B be B bootstrap samples of D.
→ For each D∗b , b = 1, . . . , B , we fit our model, giving prediction
µ
bD,Θb (x ) = µ
bD∗b (x ).
The bagging prediction is then defined by

B
1 X
bbag
µD,Θ (x ) = µ
bD,Θb (x ),
B
b=1
where Θ = (Θ1 , . . . , ΘB ).
→ Random vectors Θ1 , . . . , ΘB fully capture the randomness of the training
procedure.
7
Algorithm
Algorithm : Bagging Trees.
For b = 1 to B do
1. Generate a bootstrap sample D∗b of D.
2. Fit an unpruned tree on D∗b , which gives prediction µ
bD,Θb (x ).
End for
PB
bbag
Output : µD,Θ (x ) =
1
B b=1 µ
bD,Θb (x ).
8
Bias
The bias is the same as the one of the individual sampled models. Indeed,
h i
Bias(x ) = µ(x ) − ED,Θ µ bbag
D,Θ (x )
B
" #
1 X
= µ(x ) − ED,Θ1 ,...,ΘB bD,Θb (x )
µ
B b=1
B
1 X
= µ(x ) − ED,Θb µ bD,Θb (x )
B b=1

= µ(x ) − ED,Θb µ bD,Θb (x )
since predictions µ
bD,Θ1 (x ), . . . , µ
bD,ΘB (x ) are identically distributed.
However, the bias of µ bD,Θb (x ) is typically greater in absolute terms than the
bD (x ) fitted on D since the reduced sample D∗b imposes restrictions.
bias of µ
⇒ The improvements in the estimation obtained by bagging will be a
consequence of variance reduction.
9
Variance
bbag
The variance of µD,Θ (x ) can be written as
B
" #
h i 1 X
VD,Θ µbbag
D,Θ (x ) = VD,Θ1 ,...,ΘB µ
bD,Θb (x )
B b=1
" B #
1 X
= VD,Θ1 ,...,ΘB µ D,Θb (x )
B2
b
b=1
( " " B ##
1 X
= VD EΘ1 ,...,ΘB µbD,Θb (x )D

B2 b=1
" " B ## )
X
+ED VΘ1 ,...,ΘB µbD,Θb (x )D

b=1
h h ii 1 h h ii
= VD EΘb µbD,Θb (x )D + ED VΘb µbD,Θb (x )D

B
since conditionally to D, µ
bD,Θ1 (x ), . . . , µ
bD,ΘB (x ) are i.i.d..
→ The first term = sampling variance of the bagging ensemble (a result of the
sampling variability of D).
→ The second term = within-D variance (a result of the randomization due to
the bootstrap sampling). As B increases, this second term disappears.
10
Variance
Observation : h i
bbag

VD,Θ µ D,Θ (x ) ≤ VD,Θb µbD,Θb (x ) .
Let ρ(x ) be the Pearson’s correlation coefficient between any pair of

predictions used in the averaging which are built on the same training set but
fitted on two different bootstrap samples :
h i
CovD,Θb ,Θb 0 µ bD,Θb 0 (x )
bD,Θb (x ), µ
ρ(x ) = q r h i

VD,Θb µ bD,Θb (x ) VD,Θb 0 µ bD,Θb 0 (x )
h i
CovD,Θb ,Θb 0 µbD,Θb (x ), µ
bD,Θb 0 (x )
=
VD,Θb µ bD,Θb (x )
as µ
bD,Θb (x ) and µ
bD,Θb0 (x ) are i.d..
By the law of total covariance :

CovD,Θ ,Θ µ
b D,Θ (x ), µ
b D,Θ (x ) = ED CovΘ ,Θ µ
b D,Θ (x ), µ
b D,Θ (x )|D
b b0 b b0 b b0 b b0
h i
+CovD EΘ µb D,Θ (x )|D , EΘ µ
b D,Θ (x )|D
b b b0 b0
h h ii
= VD EΘ µ
b D,Θ (x )|D
b b
since conditionally to D, µ
bD,Θb (x ) and µ
bD,Θb0 (x ) are i.i.d..
11
Variance
ρ(x ) becomes

VD EΘb µ bD,Θb (x )|D
ρ(x ) =
VD,Θb µ bD,Θb (x )

VD EΘb µ (x )|D
ii D,Θb h
b
= h h h ii .
VD EΘb µ bD,Θb (x )D + ED VΘb µ bD,Θb (x )D

ρ(x ) measures the correlation between a pair of predictions in the ensemble

induced by repeatedly making training sample draws D from the population
and then drawing a pair of bootstrap samples from D.
→ When ρ(x ) is close to 1, the predictions are highly correlated, suggesting that
the randomization due to the bootstrap sampling has no significant effect on
the predictions. The total variance is mostly driven by the training set.
→ When ρ(x ) is close to 0, the predictions are uncorrelated, suggesting that the
randomization due to the bootstrap sampling has a strong impact on the
predictions. The total variance is mostly due to the randomization induced by
the bootstrap samples.
12
Variance
Alternatively,

VD EΘb µbD,Θb (x )|D = ρ(x )VD,Θb µbD,Θb (x )
and
ED VΘb µbD,Θb (x )|D = (1 − ρ(x )) VD,Θb µbD,Θb (x ) ,
so that
h i 1
bbag

VD,Θ µ D,Θ (x ) = VD EΘb µ bD,Θb (x )|D + ED VΘb µ bD,Θb (x )|D
B
(1 − ρ(x ))
= ρ(x )VD,Θb µ bD,Θb (x ) + VD,Θb µ bD,Θb (x ) .
B
→ As B increases, the second term disappears.

→ When ρ(x ) < 1 (which means that the randomization due to the bootstrap
sampling influences the individual predictions), one sees that the variance of
the ensemble is strictly smaller than the variance of an individual model.
13
Variance
Notice that
µD,Θb (x )] ≥ VD [b
VD,Θb [b µD (x )] .
⇒ Bagging averages models with higher variances.
bbag
Nevertheless, µD,Θ (x ) has generally a smaller variance than µ
bD (x ).
→ Typically, ρ(x ) compensates for the variance increase
µD,Θb (x )] − VD [b
VD,Θb [b µD (x )] ,
µD,Θb (x )] ≥ VD [b
so that the combined effect of ρ(x ) < 1 and VD,Θb [b µD (x )]
often leads to a variance reduction
µD (x )] − ρ(x )VD,Θb [b
VD [b µD,Θb (x )]
that is positive.
→ Because of their high variance, regression trees very likely benefit from the
averaging procedure.
14
Expected generalization error : Squared error loss

We have h i h i2
bag
ED,Θ Err µ
bD,Θ (x ) = Err (µ(x )) + µ(x ) − ED,Θ µbbag
D,Θ (x )
h i
+VD,Θ µ bbag
D,Θ (x ) .
The bias remains unchanged while the variance decreases compared to the
individual prediction µ
bD,Θb (x ), so that we get
h i
ED,Θ Err µbbag
D,Θ (x )
2 h i
bbag

= Err (µ(x )) + µ(x ) − ED,Θb µbD,Θb (x ) + VD,Θ µ D,Θ (x )
2
≤ Err (µ(x )) + µ(x ) − ED,Θb µbD,Θb (x ) + VD,Θb µ bD,Θb (x )

= ED,Θb Err µ bD,Θb (x ) .
For every value of X , the expected generalization error of the ensemble is

smaller than the expected generalization error of an individual model.
Taking the average over X leads to
h i
bbag

ED,Θ Err µ D,Θ ≤ ED,Θb Err µ
bD,Θb .
15
Expected generalization error : Poisson deviance loss

We have h i h i
bag
ED,Θ Err µ
bD,Θ (x ) = Err (µ(x )) + ED,Θ E P µbbag
D,Θ (x )
with
bbag bbag
" # " !#!
h i µ D,Θ (x ) µD,Θ (x )
ED,Θ E P
bbag
µD,Θ (x ) = 2µ(x ) ED,Θ − 1 − ED,Θ ln .
µ(x ) µ(x )
Now,
bbag
" #
µD,Θ (x )

µ
bD,Θb (x )
ED,Θ = ED,Θb ,
µ(x ) µ(x )
so that h i
ED,Θ E P µ bbag
D,Θ (x )

µ
bD,Θb (x ) bD,Θb (x )
µ
= 2µ(x ) ED,Θb − 1 − ED,Θb ln
µ(x ) µ(x )
" bag !# !
µbD,Θ (x )

µ
bD,Θb (x )
−2µ(x ) ED,Θ ln − ED,Θb ln
µ(x ) µ(x )
h i
P

= ED,Θb E µ
bD,Θb (x )
h i
bbag

−2µ(x ) ED,Θ ln µ D,Θ (x ) − ED,Θb ln µ
bD,Θb (x ) .
16
Expected generalization error : Poisson deviance loss

Jensen’s inequality implies
h i
bbag

ED,Θ ln µ D,Θ (x ) − ED,Θb ln µbD,Θb (x )
B
" !#
1 X
= ED,Θ1 ,...,ΘB ln µ
bD,Θb (x ) − ED,Θb ln µ
bD,Θb (x )
B b=1
B
" #
1 X
≥ ED,Θ1 ,...,ΘB bD,Θb (x ) − ED,Θb ln µ
ln µ bD,Θb (x )
B b=1
= 0,
so that h i h i
ED,Θ E P µbbag
D,Θ (x ) ≤ ED,Θb E P µ
bD,Θb (x )
and hence h i
bbag

ED,Θ Err µ D,Θ (x ) ≤ ED,Θb Err µ
bD,Θb (x ) .
For every value of X , the expected generalization error of the ensemble is

smaller than the expected generalization error of an individual model.
Taking the average over X leads to
h i
bbag

ED,Θ Err µ D,Θ ≤ ED,Θb Err µ
bD,Θb .
17
Tree-based methods : Bagging trees and random forests Random forests
Random forests
The procedure called random forests is a modification of bagging trees.
→ It produces a collection of trees that are more de-correlated.
The variance of the bagging prediction can be expressed as

h i (1 − ρ(x ))
VD,Θ µbbag
D,Θ (x ) = ρ(x )VD,Θb [b
µD,Θb (x )] + VD,Θb [b
µD,Θb (x )] .
B
→ As B increases, the second term disappears.
→ ρ(x ) in the first term limits the effect of averaging.
The idea of random forests : Reducing ρ(x ) without increasing

VD,Θb [b
µD,Θb (x )] too much.
→ Reducing correlation among trees can be achieved by adding randomness to
the training procedure.
Random forests is a combination of bagging with random feature selection at

each node.
→ Specifically, m(≤ p) features are selected at random before each split and used
as candidates for splitting.
18
Random forests
The random forest prediction writes
B
1 X
brfD,Θ (x ) =
µ µ
bD,Θb (x ),
B
b=1
where µ
bD,Θb (x ) denotes the prediction at point x for the bth random forest
tree.
→ Θ1 , . . . , ΘB capture the randomness of the bootstrap sampling and the
additional randomness due to the random selection of m features before each
split.
→ m is a tuning parameter. Typical value of m is bp/3c.
19
Algorithm
Algorithm : Random Forests.
For b = 1 to B do
1. Generate a bootstrap sample D∗b of D.
2. Fit a tree on D∗b .
For each node t do
(2.1) Select m (≤ p) features at random from the p original features.
(2.2) Pick the best feature among the m.
(2.3) Split the node into two daughter nodes.
End for
This gives prediction µ
bD,Θb (x ) (use typical tree stopping criteria (but do not
prune)).
End for
1
PB
brfD,Θ (x ) =
Output : µ B b=1 µ
bD,Θb (x ).
20
Computational costs
Computational costs and memory requirements increase as the number of
bootstrap samples increases.
→ However, this can be mitigated with parallel computing. Indeed, each
bootstrap sample and the corresponding tree is independent of any other
sample and tree.
21
Tree-based methods : Bagging trees and random forests Out-of-bag estimate
Out-of-bag estimate
For each observation (yi , x i ) of the training set D, an out-of-bag prediction
can be constructed by averaging only trees corresponding to bootstrap
samples D∗b in which (yi , x i ) does not appear.
The out-of-bag prediction for observation (yi , x i ) is thus given by

B
1 X
boob / D∗b .

µD,Θ (x i ) = PB bD,Θb (x i )I (yi , x i ) ∈
µ
b=1 / D∗b ] b=1
I [(yi , x i ) ∈
The generalization error of µ

bD,Θ can be estimated by
oob 1 X
Err
d (b
µD,Θ ) = boob
L(yi , µD,Θ (x i )),
|I|
i∈I
which is called the out-of-bag estimate of the generalization error.

oob
Err
d µD,Θ ) is almost identical to the |I|-fold cross-validation estimate.
(b
→ However, it does not require to fit new trees, so that bagging trees and
random forests can be fit in one sequence : We stop adding new trees when
the out-of-bag estimate of the generalization error stabilizes.
22
Tree-based methods : Bagging trees and random forests Interpretability
Interpretability : Relative importances

A bagged model is less interpretable than a model that is not bagged.
→ A bagged tree is no longer a tree.
However, measure of variable importance can be constructed by combining

measures of importance from the bootstrap trees.
→ For the bth tree in the ensemble, denoted Tb , the relative importance of feature xj is
the sum of the deviance reductions ∆Dχt over the non-terminal nodes t ∈ Te(Tb ) (xj )
(i.e. the non-terminal nodes t of Tb for which xj was selected as the splitting
feature) : X
Ib (xj ) = ∆Dχt .
t∈T
e
(Tb ) (xj )
→ The relative importance of a feature xj is naturally extended by averaging Ib (xj ) over

the collection of trees :
B
1 X
I(xj ) = Ib (xj ).
B b=1
→ Normalization : The relative influence is normalized such that the sum of the
importance equals 100.
⇒ Any individual number can be interpreted as the percentage contribution to the
overall model.
23
Interpretability : Relative importances (by permutation)

An alternative to compute variable importances is the following.
Some observations (yi , x i ) of the training set D do not appear in bootstrap

sample D∗b . They are called the out-of-bag observations for the bth tree.
These observations enable to assess the predictive accuracy of µ

bD,Θb , that is,
1 X
Err
d (bµD,Θb ) = L(yi , µ
bD,Θb (x i )),
|I\I ∗b |
i∈I\I ∗b
where I ∗b labels the observations in D∗b .
The categories of feature xj are then randomly permuted in the out-of-bag

perm(j)
observations, so that we get perturbed observations (yi , x i ), i ∈ I\I ∗b ,
and the predictive accuracy of µ
bD,Θb is again computed as
perm(j) 1 X perm(j)
Err
d (b
µD,Θb ) = L(yi , µ
bD,Θb (x i )).
|I\I ∗b |
i∈I\I ∗b
24
Interpretability : Relative importances (by permutation)

The decrease in predictive accuracy due to this permuting is averaged over all
trees and is used as a measure of importance for feature xj in the ensemble,
that is
B
1 X d perm(j)
I(xj ) = Err µD,Θb ) − Err
(b d (bµD,Θb ) .
B
b=1
These importances can be normalized to improve their readability.
A feature will be important if randomly permuting its values decreases the

model accuracy. In such a case, it means that the model relies on the feature
for the prediction.
25
Partial dependence plots

Consider the subvector x S of l < p of the features x = (x1 , x2 , . . . , xp ),
indexed by S ⊂ {1, 2, . . . , p}. Let x S̄ be the complement subvector such that
x S ∪ x S̄ = x .
In principle, µ
b(x ) depends on features in x S and x S̄ , so that we can write
µ
b(x ) = µ
b(x S , x S̄ ).
If one conditions on specific values for the features in x S̄ , then µ

b(x ) can be
seen as a function of the features in x S . The partial dependence of µ b(x ) on
x S is given by
µ
bS (x S ) = EX S̄ [b
µ(x S , X S̄ )] .
The partial dependence function µ

bS (x S ) can be estimated from the training
set by
1 X
µ
b(x S , x i S̄ ),
|I|
i∈I
where {x i S̄ , i ∈ I} are the values of X S̄ in the training set

26
Example MTPL
Data set :
→ MTPL insurance portfolio of a Belgian insurance company observed during
one year.
→ Description of data set :
> str(data) # description of the dataset
’data.frame’: 160944 obs. of 10 variables:
$ AgePh : int 50 64 60 77 28 26 26 58 59 57 ...
$ AgeCar : int 12 3 10 15 7 12 8 14 6 10 ...
$ Fuel : Factor w/ 2 levels "Diesel","Gasoline": 2 2 1 2 2 2 2 2 2 2 ...
$ Split : Factor w/ 4 levels "Half-Yearly",..: 2 4 4 4 1 3 1 3 1 1 ...
$ Cover : Factor w/ 3 levels "Comprehensive",..: 3 2 3 3 3 3 1 3 2 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 1 2 1 1 2 2 2 ...
$ Use : Factor w/ 2 levels "Private","Professional": 1 1 1 1 1 1 1 1 1 1 ...
$ PowerCat: Factor w/ 5 levels "C1","C2","C3",..: 2 2 2 2 2 2 2 2 1 1 ...
$ ExpoR : num 1 1 1 1 0.0466 ...
$ Nclaim : int 1 0 0 0 1 0 1 0 0 0 ...
27
Example MTPL
Data set :
→ The data set comprises 160 944 insurance policies.
→ For each policy, we have 8 features :
- AgePh : policyholder’s age ;
- AgeCar : age of the car ;
- Fuel : fuel of the car, with two categories (gas or diesel) ;
- Split : splitting of the premium, with four categories (annually, semi-annually,
quarterly or monthly) ;
- Cover : extent of the coverage, with three categories (from compulsory
third-party liability cover to comprehensive) ;
- Gender : policyholder’s gender, with two categories (female or male) ;
- Use : use of the car, with two categories (private or professional) ;
- PowerCat : the engine’s power, with five categories.
→ For each policy, we have the number of claim (Nclaim), the response, and
exposure information (exposure-to-risk (ExpoR) expressed in year).
28
Example MTPL
Data set :
> head(data,10) # 10 first observations
AgePh AgeCar Fuel Split Cover Gender Use PowerCat ExpoR Nclaim
1 50 12 Gasoline Monthly TPL.Only Female Private C2 1.00000000 1
2 64 3 Gasoline Yearly Limited.MD Male Private C2 1.00000000 0
3 60 10 Diesel Yearly TPL.Only Female Private C2 1.00000000 0
4 77 15 Gasoline Yearly TPL.Only Female Private C2 1.00000000 0
5 28 7 Gasoline Half-Yearly TPL.Only Male Private C2 0.04657534 1
6 26 12 Gasoline Quarterly TPL.Only Female Private C2 1.00000000 0
7 26 8 Gasoline Half-Yearly Comprehensive Female Private C2 1.00000000 1
8 58 14 Gasoline Quarterly TPL.Only Male Private C2 0.40273973 0
9 59 6 Gasoline Half-Yearly Limited.MD Male Private C1 1.00000000 0
10 57 10 Gasoline Half-Yearly Limited.MD Male Private C1 1.00000000 0
29
Example MTPL
Descriptive statistics of the data :
120000
100000
80000
Number of policies
60000
40000
20000
0
1 2 3 4 5 6 7 8 9 10 11 12
Exposure (in months)
30
Example MTPL
0.15
100000
75000
0.10
ClaimFrequency
totalExposure
50000
0.05
25000
0 0.00
Female Male Female Male

Gender Gender
31
Example MTPL
100000
0.15
75000
0.10
ClaimFrequency
totalExposure
50000
0.05
25000
0 0.00
Diesel Gasoline Diesel Gasoline

Fuel Fuel
32
Example MTPL
1e+05
0.10
ClaimFrequency
totalExposure
5e+04 0.05
0e+00 0.00
Private Professional Private Professional

Use Use
33
Example MTPL
0.15
80000
60000
0.10
ClaimFrequency
totalExposure
40000
0.05
20000
0 0.00
Comprehensive Limited.MD TPL.Only Comprehensive Limited.MD TPL.Only

Cover Cover
34
Example MTPL
0.20
60000
0.15
ClaimFrequency
totalExposure
40000
0.10
20000
0.05
0 0.00
Half−Yearly Monthly Quarterly Yearly Half−Yearly Monthly Quarterly Yearly

Split Split
35
Example MTPL
80000
0.15
60000
ClaimFrequency
totalExposure
0.10
40000
0.05
20000
0 0.00
C1 C2 C3 C4 C5 C1 C2 C3 C4 C5
PowerCat PowerCat
36
Example MTPL
0.20
10000
0.15
ClaimFrequency
totalExposure
0.10
5000
0.05
0 0.00
0 5 10 15 20 0 5 10 15 20
AgeCar AgeCar
37
Example MTPL
1.00
3000
0.75
ClaimFrequency
2000
totalExposure
0.50
1000
0.25
0 0.00
20 40 60 80 20 40 60 80
AgePh AgePh
38
Example MTPL
Training set and validation set :
→ Training set : 80% of the data set.
→ Validation set : 20% of the data set.
> library(caret)
> inValidation = createDataPartition(data$Nclaim, p=0.2, list=FALSE)
> validation.set = data[inValidation,]
> training.set = data[-inValidation,]
39
Example MTPL
R packages :
→ ipred (for bagging).
→ randomForest.
→ rfCountData.
rfPoisson() (bagging() : for bagging only).

→ Description :
Fit a RF model (Breiman’s algorithm) on count data, by using the Poisson
deviance.
→ Usage :
rfPoisson(x, offset, y, xtest, offsettest, ytest,
ntree, mtry, nodesize, importance=TRUE, keep.forest = TRUE, ...)
40
Example MTPL
Tuning the model :
set.seed(87)
folds = createFolds(training.set$Nclaim, k = 5, list = TRUE)
grid.param = expand.grid(fold = 1:5,

mtry. = seq(from = 8, to = 1, by = -1),
nodesize. = c(500,1000,5000,10000))
cl <- makeCluster(12) # Number of nodes for parallel computing

registerDoParallel(cl)
clusterCall(cl, function() library(rfCountData)) #Export package to nodes
set.seed(64, kind = "L’Ecuyer-CMRG")
res = foreach(i =1:nrow(grid.param), .packages = "rfCountData") %dopar% {
X=folds[[grid.param[i,]$fold]] #Current fold (-> test set)
rfPoisson(x = training.set[-X,!names(training.set) %in% c("Nclaim", "ExpoR")],

offset = log(training.set[-X,]$ExpoR),
y = training.set[-X,]$Nclaim,
xtest = training.set[X,!names(training.set) %in% c("Nclaim", "ExpoR")],
offsettest = log(training.set[X,]$ExpoR),
ytest = training.set[X,]$Nclaim,
ntree = 2000,
mtry = grid.param[i,]$mtry., # Current mtry
nodesize = grid.param[i,]$nodesize., # Current nodesize
keep.forest = TRUE)
41
Example MTPL
Tuning the model :
Cross−validation results
0.550
0.545
0.540 Nodesize 500 Nodesize 1 000 Nodesize 5 000 Nodesize 10 000
Deviance
0.535
0.530
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1
mtry
42
Example MTPL
Optimal RF :
> # fit optimal model
> optimal_rf = rfPoisson(x = training.set[,!names(training.set)
%in% c("Nclaim", "ExpoR")],
offset = log(training.set$ExpoR),
y = training.set$Nclaim,
xtest = validation.set[,!names(validation.set)
%in% c("Nclaim", "ExpoR")],
offsettest = log(validation.set$ExpoR),
ytest = validation.set$Nclaim,
ntree = 2000,
mtry = mtry_star,
nodesize = nodesize_star,
keep.forest = TRUE,
do.trace = TRUE,
importance=TRUE)
> print(optimal_rf)
Call:
rfPoisson(x = ...)
Type of random forest: regression
Number of trees: 2000
No. of variables tried at each split: 3
43
Example MTPL
Optimal RF :
> # Optimal number of trees
> par(mfrow=c(1,2))
> plot(optimal_rf, xlim=c(0,2000), ylim=c(0.54,0.55))
> plot(optimal_rf, xlim=c(0,500), ylim=c(0.54,0.55), main="Zoom")
0.5500 0.5500
0.5475 0.5475
0.5450 0.5450
0.5425 0.5425
0.5400 0.5400
0 500 1000 1500 2000 0 100 200 300 400 500

Number of trees Number of trees
44
Example MTPL
Optimal RF :
> # relative importances
> imp <- importance(rf.optim, type = 1)
> impvar <- rownames(imp)[order(imp[, 1], decreasing = TRUE)]
> par(mfrow=c(1, 1))
> varImpPlot(rf.optim, sort = TRUE, type = 1)
AgePh ●
Split ●
Fuel ●
AgeCar ●
Cover ●
Gender ●
PowerCat ●
Use ●
0.000 0.004 0.008 0.012
%IncLossFunction
45
Example MTPL
Optimal RF :
> # partial dependences
> op <- par(mfrow=c(2, 4)) # for all features (here: 8 features)
> for (i in seq_along(impvar)) {
+partialPlot(optimal_rf, training.set, x.var = impvar[i], offset =log(training.set$ExpoR),
+ xlab = impvar[i], main = paste("Partial Dependence on", impvar[i]))
+}
> par(op)
0.225
0.15
0.12
0.200
0.10
0.10
0.08
0.175
0.150 0.05
0.05 0.04
0.125
0.00 0.00 0.00

0.100
20 30 40 50 60 70 80 90 Half−Yearly Monthly Quarterly Yearly Diesel Gasoline ComprehensiveLimited.MD TPL.Only
AgePh Split Fuel Cover
0.125
0.12 0.12
0.100
0.14
0.08 0.08
0.075
0.13 0.050
0.04 0.04
0.025
0.12 0.00 0.00 0.000
0 5 10 15 20 Female Male C1 C2 C3 C4 C5 Private Professional

AgeCar Gender PowerCat Use
46
Example MTPL
Predictions :
> data$pred = predict(optimal_rf, offset = log(data$ExpoR), newdata = training.set)
> head(data, 20)
AgePh AgeCar Fuel Split Cover Gender Use PowerCat Latitude Longitude ExpoR Nclaim pred
50 12 Gasoline Monthly TPL.Only Male Private C2 50.5 4.21 1.00000 1 0.16459
64 3 Gasoline Yearly Limited.MD Female Private C2 50.5 4.21 1.00000 0 0.10225
60 10 Diesel Yearly TPL.Only Male Private C2 50.5 4.21 1.00000 0 0.12587
77 15 Gasoline Yearly TPL.Only Male Private C2 50.5 4.21 1.00000 0 0.09129
28 7 Gasoline Half-Yearly TPL.Only Female Private C2 50.5 4.21 0.04658 1 0.01007
26 12 Gasoline Quarterly TPL.Only Male Private C2 50.5 4.21 1.00000 0 0.31131
26 8 Gasoline Half-Yearly Comprehensive Male Private C2 50.5 4.21 1.00000 1 0.21448
58 14 Gasoline Quarterly TPL.Only Female Private C2 50.5 4.21 0.40274 0 0.06025
59 6 Gasoline Half-Yearly Limited.MD Female Private C1 50.5 4.21 1.00000 0 0.09995
57 10 Gasoline Half-Yearly Limited.MD Female Private C1 50.5 4.21 1.00000 0 0.10124
62 5 Gasoline Yearly Limited.MD Male Private C1 50.5 4.21 1.00000 0 0.08188
57 15 Gasoline Yearly TPL.Only Male Private C2 50.5 4.21 1.00000 0 0.09640
30 10 Gasoline Monthly Limited.MD Male Private C2 50.5 4.21 1.00000 1 0.20159
47 14 Gasoline Monthly TPL.Only Female Private C1 50.5 4.21 1.00000 0 0.16334
62 7 Gasoline Quarterly Comprehensive Male Professional C2 50.5 4.21 1.00000 0 0.12341
33 15 Gasoline Half-Yearly TPL.Only Male Private C1 50.5 4.21 0.31507 0 0.04204
43 2 Diesel Half-Yearly Comprehensive Male Private C3 50.5 4.21 1.00000 0 0.12183
47
Example MTPL
Comparison with a single regression tree :
→ Generalization error :
d val µ

Err brfD,Θ = 0.5440970.
Model Generalization error

Tree 0.5452772
RF 0.5440970
48
Example MTPL
→ Model granularity (impact of the age) : regression tree :

0.14
16e+3 / 129e+3
100%
yes AgePh >= 30 no

0.21
3459 / 19e+3
15%
Split = Yearly

0.13
12e+3 / 110e+3
85%
Split = Half−Yearly,Yearly
0.17
3097 / 24e+3
18%
AgePh >= 56

0.12
9325 / 86e+3
67%
AgePh >= 58

0.13 0.17
6704 / 57e+3 2612 / 19e+3
44% 15%
Fuel = Gasoline Cover = Comprehensive,Limited.MD
0.096
2621 / 29e+3
23%
Fuel = Gasoline
0.14
2588 / 20e+3
16%
Cover = Comprehensive,Limited.MD

0.09 0.12
1982 / 23e+3 4116 / 37e+3
18% 29%
Gender = Female Split = Yearly
0.13
1750 / 15e+3
11%
Cover = Limited.MD
0.087 0.12
1551 / 19e+3 2366 / 22e+3
15% 17%
AgePh < 74 AgePh >= 48

0.12
1587 / 14e+3
11%
Gender = Female
0.084
1156 / 15e+3
11%
AgeCar >= 5.5
0.093 0.1 0.11 0.13 0.14 0.15 0.16 0.17

441 / 5033 431 / 4421 779 / 7868 604 / 5040 1247 / 9863 1506 / 11e+3 1247 / 9419 1179 / 7750
4% 3% 6% 4% 8% 9% 7% 6%
0.079 0.096 0.12 0.12 0.11 0.13 0.13 0.19 0.24

715 / 9639 395 / 4347 639 / 5613 983 / 9358 503 / 4808 1082 / 9111 485 / 4437 1365 / 9690 2280 / 11e+3
7% 3% 4% 7% 4% 7% 3% 8% 9%
49
Example MTPL
→ Model granularity (impact of the age) : RF :
0.225
0.15
0.12
0.200
0.10
? 0.10
0.08
0.175
0.150 0.05
0.05 0.04
-
0.125
0.00 0.00 0.00

0.100
20 30 40 50 60 70 80 90 Half−Yearly Monthly Quarterly Yearly Diesel Gasoline ComprehensiveLimited.MD TPL.Only
AgePh Split Fuel Cover
0.125
0.12 0.12
0.100
0.14
0.08 0.08
0.075
0.13 0.050
0.04 0.04
0.025
0.12 0.00 0.00 0.000
0 5 10 15 20 Female Male C1 C2 C3 C4 C5 Private Professional

AgeCar Gender PowerCat Use
50

Insurance Analytics: Prof. Julien Trufin

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Insurance Analytics: Prof. Julien Trufin

Uploaded by

Copyright:

Available Formats

Insurance Analytics

Prof. Julien Trufin

Année académique 2020-2021

Bagging trees and random forests

Principle of ensemble methods (based on randomization) : introducing

→ Zero variance since

The average of E µD (x )] with respect to D1 , . . . , DB is the average

while the variance of E µD (x )] with respect to D1 , . . . , DB is given by

⇒ Compared to each individual estimate : Bias unchanged and variance

The probability distribution of the random vector (Y , X ) is approximated by

The bagging prediction is then defined by

Algorithm : Bagging Trees.

Let ρ(x ) be the Pearson’s correlation coefficient between any pair of

ρ(x ) measures the correlation between a pair of predictions in the ensemble

→ As B increases, the second term disappears.

Expected generalization error : Squared error loss

For every value of X , the expected generalization error of the ensemble is

Expected generalization error : Poisson deviance loss

Expected generalization error : Poisson deviance loss

For every value of X , the expected generalization error of the ensemble is

The variance of the bagging prediction can be expressed as

The idea of random forests : Reducing ρ(x ) without increasing

Random forests is a combination of bagging with random feature selection at

Algorithm : Random Forests.

The out-of-bag prediction for observation (yi , x i ) is thus given by

The generalization error of µ

which is called the out-of-bag estimate of the generalization error.

Interpretability : Relative importances

However, measure of variable importance can be constructed by combining

→ The relative importance of a feature xj is naturally extended by averaging Ib (xj ) over

Interpretability : Relative importances (by permutation)

Some observations (yi , x i ) of the training set D do not appear in bootstrap

These observations enable to assess the predictive accuracy of µ

where I ∗b labels the observations in D∗b .

The categories of feature xj are then randomly permuted in the out-of-bag

Interpretability : Relative importances (by permutation)

These importances can be normalized to improve their readability.

A feature will be important if randomly permuting its values decreases the

Partial dependence plots

If one conditions on specific values for the features in x S̄ , then µ

The partial dependence function µ

where {x i S̄ , i ∈ I} are the values of X S̄ in the training set

Exposure (in months)

Female Male Female Male

Diesel Gasoline Diesel Gasoline

Private Professional Private Professional

Comprehensive Limited.MD TPL.Only Comprehensive Limited.MD TPL.Only

Half−Yearly Monthly Quarterly Yearly Half−Yearly Monthly Quarterly Yearly

rfPoisson() (bagging() : for bagging only).

grid.param = expand.grid(fold = 1:5,

cl <- makeCluster(12) # Number of nodes for parallel computing

rfPoisson(x = training.set[-X,!names(training.set) %in% c("Nclaim", "ExpoR")],

0 500 1000 1500 2000 0 100 200 300 400 500

0.00 0.00 0.00

0.12 0.00 0.00 0.000

0 5 10 15 20 Female Male C1 C2 C3 C4 C5 Private Professional

Model Generalization error

Fuel = Gasoline Cover = Comprehensive,Limited.MD

Gender = Female Split = Yearly

0.093 0.1 0.11 0.13 0.14 0.15 0.16 0.17

0.079 0.096 0.12 0.12 0.11 0.13 0.13 0.19 0.24

0.00 0.00 0.00