Professional Documents
Culture Documents
CK1 Booklet 1 PDF
CK1 Booklet 1 PDF
CK1 Booklet 1 PDF
Subject CS2
Revision Notes
For the 2019 exams
Machine learning
Booklet 12
Covering
CONTENTS
Contents Page
Copyright agreement
Legal action will be taken if these terms are infringed. In addition, we may
seek to take disciplinary action through the profession or through your
employer.
These conditions remain in force after you have finished using the course.
These chapter numbers refer to the 2019 edition of the ActEd course notes.
The numbering of the syllabus items is the same as that used by the Institute
and Faculty of Actuaries.
OVERVIEW
There are two main types of machine learning algorithms that can be used.
With supervised learning we work with an existing dataset where we already
know the right answer, eg a set of insurance claims whose validity has
already been investigated. The machine learning algorithm is trained to give
the right answers for this dataset, after which it can be applied to new claims
to identify any suspicious ones.
With unsupervised learning we work with a dataset where we don’t know the
right answer. It is the job of the machine learning algorithm to look for
patterns in the data and to suggest a possible solution. For example, a
motor insurer with policies sold throughout the country may wish to divide
the country into 20 geographical areas (based on postcode) so that all the
policies sold in a particular area have similar claims experience. The
algorithm will suggest possible groupings of postcodes that might be suitable
to treat as homogeneous groups. The resulting 20 groups could then be
used for calculating premiums and for calculating the reserves the insurer
needs to hold.
The main algorithms we will look at here are the naïve Bayes method,
decision trees and the k-means algorithm.
As the name suggests, the naïve Bayes method is based on Bayes’ theorem
from probability theory. It classifies items by comparing the likelihood of
each observation. This method requires very little data and can produce
surprisingly good results.
Decision trees (which are also called CART methods – Classification and
Regression Trees) classify items by working sequentially through a series of
questions in the form of a flow chart. This leads to a final node in the tree,
which is the predicted classification for that item. We also look at a number
of measures, including the Gini index, which can be used to measure how
effective a particular decision tree is at making correct classifications.
The k-means algorithm is based on the idea that items that would be close
together when considered as points in multi-dimensional space can be
considered to be similar. The algorithm starts by allocating the data items
randomly to k groups (or clusters), then uses an iterative process to improve
the groupings until each group forms a cluster of items that are all close
together. New items can then easily be classified by finding which of the k
groups they fall within. We can vary the number of groups so that there are
enough to separate out the different types of item effectively but not so many
as to make the groupings artificial or unmanageable.
CORE READING
All of the Core Reading for the topics covered in this booklet is contained in
this section.
The text given in Arial Bold Italic font is additional Core Reading that is not
directly related to the topic being discussed.
____________
Introduction
Target function
y = f ( x1, x 2 ,...)
↓
Data
( x11, x 21, , y 1)
( x12 , x 22 , , y 2 )
( x1N , x 2N , , y N )
↓
Hypotheses Learning Hypothesis
y = h1( x1, x 2 ,) algorithm y = g ( x1, x 2 ,)
y = h2 ( x1, x 2 ,) → →
y = hM ( x1, x 2 ,)
y f ( x1, x 2 , , x J )
We use the data to develop a hypothesis which relates the data to the
output. Let the hypothesis be:
y = g ( x1, x 2 , , x J )
The idea is that g ( x1, x 2 ,) should be close to the unknown function
f ( x1, x 2 ,) .
So, for example, in classical linear modelling the hypothesis set might
be the set of linear relationships:
In the example studied by Chalk and McMurtrie, the task was to predict
the cause codes of aviation accidents from the words in brief
narratives of the accidents.
A key element of this scenario is that we are going to apply the results
of the exercise to data that were not used to develop the algorithm.
This means that we are interested in the performance of the algorithm
not just in the sample of N cases in our data, but ‘out of sample’.
This is not always the case in statistical modelling (where we are often
content with the model which ‘fits’ our data best).
____________
Model evaluation
But evaluating a predictive model involves more than this. Even the
‘best’ model may not be a very good predictive model. And even if it is
good, it might take a very long time to find the correct parameters, or it
might be very difficult to interpret (and explain to clients).
YES NO
TP
Precision =
TP + FP
____________
TP
Recall =
TP + FN
____________
2 ¥ Precision ¥ Recall
F1 score =
Precision + Recall
____________
FP
False positive rate =
TN + FP
There is a trade-off between the recall (the true positive rate) and the
false positive rate (the percentage of cases which are not positives, but
which are classified as such).
____________
14 The trade-off between recall and the false positive rate can be
illustrated using a receiver operating characteristic (ROC) curve.
The area under the ROC provides another single-figure measure of the
efficacy of the model. The further away from the diagonal is the ROC,
the greater the area under the curve and the better the model is at
correctly classifying the cases.
____________
This figure compares the ROC curves for a logistic regression model
fitted to the cause codes for aircraft accidents with a naïve model
based on random guesswork.
____________
But how can we assess the likely predictive performance of the model?
Can we be sure that we can use machine learning to test numerous
hypotheses and eventually pick one which will generalise acceptably to
new data?
1 2N
P ÈÎ Ein (g ) - Eout (g ) > e ˘˚ £ 4 [H (N )] e 8
- e
where:
This may be true in theory, but how do we test the performance of our
model out-of-sample?
____________
Train-validation-test
In practice, the ‘training’ data is often split into a part used to estimate
the parameters of the model, and a part used to validate the model.
a training data set: the sample of data used to fit the model
Parameters
Hyper-parameters
the rate at which the model should learn from the data.
Over-fitting
21 So there is a trade-off here, between bias – the lack of fit of the model
to the training data – and variance – the tendency for the estimated
parameters to reflect the specific data we use for training the model.
____________
Validation
22 One way to assess how the predictive ability of the model changes as
the number of parameters / features increases is to withhold a portion
of the ‘training’ data and use it to validate models with different
numbers of parameters / features J . One approach is to divide the
training data into, say, s slices, and to ‘train’ the model s times, using
a different slice for validation each time. This is called s-fold
cross-validation.
____________
How can we achieve a good balance between bias and variance? Put
another way, is there a method that can use all the features to choose
the final hypothesis g , but will prevent it becoming too complex so
that generalisation is poor? There is, and it is called regularisation or
penalisation.
____________
Regularisation
J
L * (w 1, w 2 , , w J ) + l  w 2j
j =1
____________
26 As noted earlier, since minimising the loss function is, in some models,
equivalent to maximising the likelihood, minimising this expression is
equivalent to maximising a penalised likelihood.
____________
supervised learning
unsupervised learning
semi-supervised learning
reinforcement learning.
28 The difference between these lies not (as one might think) in the level
of involvement of the human researcher in the development of the
algorithm, or in the supervision of the machine. Instead, it lies in the
extent to which the machine is given an instruction as to the end-point
(or target) of the analysis.
____________
Supervised learning
Regression vs classification
Generating classifications
P ( x1, x 2 ,..., y )
P ( y | x1, x 2 ,...) =
P ( x1, x 2 ,...)
J
P ( x1, x 2 ,..., y ) = P ( y ) ’ P ( x j | y )
j =1
Unsupervised learning
Given a set of covariates, the idea is that the machine should try to find
groups of cases which are similar to one another but different from
cases in other groups. In the language we used in the exposed to risk
chapter, we try to divide the data into homogeneous classes. However,
we may not tell the machine in advance what the characteristics of
each of these classes should be, or even how many such classes there
should be. We allow the machine to determine these given a set of
rules which form part of the algorithm. Machine learning where the
output is not specified in advance is called unsupervised learning.
____________
Semi-supervised learning
Reinforcement learning
41 Example
Some values of Y are more desirable than others, and we want the
agent to take the actions which will lead to desirable outcomes of Y .
How do we achieve this? The agent does not know the future, and
cannot necessarily see how the actions taken at time t will enhance or
reduce the probabilities of Y .
Collecting data
45 During the last 20-30 years the size of datasets available for analysis by
actuaries and other researchers has increased enormously. Datasets,
such as those on purchasing behaviour collected by supermarkets,
relate to millions of transactions.
____________
Cleaning the data, replacing missing values, and checking the data
for obvious errors is an important stage of any analysis, including
machine learning.
Feature scaling
J
L * (w 1, w 2 , , w J ) + l  w 2j
j =1
J
The penalty imposed for model complexity is l  w 2j which clearly
j =1
depends on the weights and hence on the scale at which the features
are measured.
____________
Splitting the data into the training, validation and test data sets
50 A typical split might be to use 60% of the data for training, 20% for
validation and 20% for testing. However it depends on the problem and
not on the data. A guide might be to select enough data for the
validation data set and the testing data set so that the validation and
testing processes can function, and to allocate the rest of the data to
the training data set. In practice, this often leads to around a 60% /
20% / 20% split.
____________
52 The model should then be validated using the 20% of the data set aside
for this purpose. This should indicate, for example, whether we are at
risk of over-fitting our data. The results of the validation exercise may
mean that further training is required.
____________
53 Once the model has been trained on a set of data, its performance
should be evaluated. How this is done may depend on the purpose of
the analysis. If the aim is prediction, then one obvious approach is to
test the model on a set of data different from the one used for
development. If the aim is to identify hidden patterns in the data, other
measures of performance may be needed.
____________
55 If the performance of the model is not sufficient for the task at hand, it
may be possible to improve its performance. Sometimes the
combination of several different algorithms applied to the same data
set will produce a performance which is substantially better than any
individual model. In other cases, the use of more data might provide a
boost to performance. However, except when considering very simple
combinations of models, care should be taken not to overfit the
evaluation set.
____________
For example, if there are many covariates, including all of them might
make the model unstable (the estimators of the parameters b 1, , b J
will have large variances) because of correlations among the x1, , x J .
If we wish to use the model for prediction on new data, this is very
undesirable. We want to be able to trust that the estimated values of
the parameters linking the covariates to the outcome variable are
solidly grounded and unlikely to shift greatly when the model is applied
to a new data set. Another way of saying this is that we only want to
include features which really do have a general effect on output.
____________
AIC = deviance + 2J
P ( A | Br )P (Br )
P (Br | A) = for r = 1,2, , R
P ( A)
where:
R
P ( A) = Â P ( A | Bi )P (Bi )
i =1
____________
P ( x1i ,..., x Ji | y i = 1) P ( y i = 1)
P ( y i = 1| x1i ,..., x Ji ) =
P ( x1i ,..., x Ji )
62 The naïve Bayes algorithm assumes that the values of the x i are
independent, conditional on the value of y i .
P ( y i = 1| x1i ,..., x Ji )
P ( x1i | y i = 1)P ( x 2i | y i = 1) P ( x Ji | y i = 1) P ( y i = 1)
=
P ( x1i ,..., x Ji )
so that:
J
P ( y i = 1| x1i ,..., x Ji ) μ P ( y i = 1) ’ P ( x ji | y i = 1)
j =1
____________
random forest
The leaf nodes of the tree contain an output variable y which is used
to make a prediction.
____________
66 Example
Weight
>80kg
Male
YES NO
Male Female
Given an input of [height = 60cm, weight = 65kg] the above tree would
be traversed as follows:
Given a new input, the tree is traversed by evaluating the specific input
at the root node of the tree.
____________
67 A learned binary tree is a partitioning of the input space. You can think
of each input variable as a dimension on a p -dimension space. The
decision tree splits this up into rectangles (when p 2 input variables)
or hyper-rectangles with more inputs.
New data is filtered through the tree and lands in one of the rectangles
and the output value for that rectangle is the prediction made by the
model. This gives an intuition for the type of decisions that a CART
model can make, eg boxy decision boundaries.
____________
Greedy splitting
This is a numerical procedure where all the values are lined up and
different split points are tried and tested using a cost function. The
split with the best cost (lowest cost because we minimize cost) is
selected.
All input variables and all possible split points are evaluated and
chosen in a greedy manner (ie the very best split point is chosen each
time).
____________
N
 ( y i - yˆ i )2
i =1
where y i is the output for the training sample and yˆ i is the predicted
output for the rectangle.
____________
G = Â pk (1 - pk )
k
G = 2 p1 p2
or:
(
G = 1 - p12 + p22 )
The Gini index calculation for each node is weighted by the total
number of instances in the parent node. The Gini score for a chosen
split point in a binary classification problem is therefore calculated as
follows:
( ) ( )
ng ng
G = ÈÍ1 - g1,1
2 2 ˘
+ g1,2 ˙ ¥ 1 + ÈÍ1 - g2,1
2 2 ˘
+ g2,2 ˙ ¥ 2
Î ˚ n Î ˚ n
Here:
ng1 and ng2 are the total number of instances in groups 1 and 2
Stopping criterion
Pruning
76 The fastest and simplest pruning method is to work through each leaf
node in the tree and evaluate the effect of removing it using a hold-out
test set. Leaf nodes are removed only if it results in a drop in the
overall cost function on the entire test set. You stop removing nodes
when no further improvements can be made.
____________
K-means clustering
J
dist( x , k ) = Â ( x j - k j )2
j =1
5. Re-assign cases to the nearest cluster centre using the new cluster
centres.
80 The table below shows the strengths and weaknesses of the K -means
algorithm.
Strengths Weaknesses
Uses simple principles for Less sophisticated than more
identifying clusters which can be recent clustering algorithms
explained in non-statistical terms
Highly flexible and can be adapted Not guaranteed to find the optimal
to address nearly all its set of clusters because it
shortcomings with simple incorporates a random element
adjustments
Fairly efficient and performs well Requires a reasonable guess as
to how many clusters naturally
exist in the data
Ê P ( y i = 1) ˆ
log Á ˜ = b 0 + b 1x1i + ... + b J x Ji
Ë P ( y i = 0) ¯
FACTSHEET
This factsheet summarises the main methods, formulae and information
required for tackling questions on the topics in this booklet.
Machine learning
If a data set has a collection of features x1, x2,, xJ , each associated with a
corresponding output y , then the aim is to find a hypothesis
y g ( x1, x2,, xJ ) that provides a good approximation to the underlying
function y f ( x1, x2,, xJ ) and hence minimises the chosen loss function.
The goal is to find an algorithm that can predict the outcome y for
previously unseen cases.
The model accuracy is the proportion of predictions the model gets right.
Classification
Positive Negative
Positive True Positive (TP) False Negative (FN)
True state
Negative False Positive (FP) True Negative (TN)
TP TP
Precision Recall (sensitivity)
TP FP TP FN
2 Precision Recall FP
F1 False positive rate
Precision Recall TN FP
A receiver operating characteristic (ROC) curve plots the recall (ie true
positive rate) against the false positive rate. The greater the area under the
graph, the better the model is at classifying data.
Assessing predictiveness
In-sample data refers to the data used to fit and validate the model.
Out-of-sample data refers to data that are not used to fit the model, ie data
that the algorithm has not seen before.
Parameters
Parameters are estimated from the data and used to make predictions.
Overfitting occurs when too many parameters are used in the model. It can
lead to the model identifying patterns that aren’t really there.
Hence, a model should achieve a balance between bias (ie fit to the training
data) and variance (ie predictive accuracy).
Regularisation
J
One possibility is to add a term l  w 2j to the loss function, so that we now
j =1
minimise:
J
L * (w1, w 2,, w J ) w 2j
j 1
Decision trees
For regression problems, the split points are chosen so as to minimise the
squared error cost function:
N
( y i yˆ i )2
i 1
where y i is the output for the training sample and yˆi is the predicted output
for the rectangle.
For classification problems, the split points are chosen so as to minimise the
Gini index. The Gini index is a measure of ‘purity’.
In a binary classification problem, the Gini score for a leaf node (ie a final
node in the tree) is:
1 p12 p22
where pk is the proportion of sample items of class k present at that node.
The Gini index for a chosen split point (or for the whole tree) is the weighted
average of the Gini score for each node involved, weighted by the number of
items at each node. For a binary classification problem, this is:
G
nodes
nnode
n
1 p12 p22
In this case, G must take a value between 0, which means that all the items
at each node are of the same type, and 0.5, which means that the items at
each node are of mixed types.
The Gini index makes no judgement about whether the prediction is right or
wrong. It only provides a measure of how effective the tree is at sorting the
data into homogenous groups.
Discriminant analysis
Penalised regression
AIC deviance 2J
BIC deviance J ln N
Unsupervised learning
K-means clustering
Semi-supervised learning
Reinforcement learning
In machine learning, the term ‘train a model’ is used, whereas other fields
refer to ‘fitting a model’ or ‘choosing parameters’.
In general, models are more useful if they are easy to explain and are
acceptable to regulators.
NOTES
NOTES
NOTES
NOTES
NOTES