Professional Documents
Culture Documents
Trick or Treat? Application of Neural Networks in Insurance: January 10th, 2019
Trick or Treat? Application of Neural Networks in Insurance: January 10th, 2019
Application of Neural
Networks in Insurance
January 10th, 2019
This is not an academic paper, and opinions The mathematical basis of a neural network is
therein are only our own. Our view is not hard to understand, but it can be very
captured by the following three points. difficult to understand how the parts
combine.
1 Neural networks perform one trick,
matching output to input, really well. Importantly, with the available software
2 Lots of business processes can be seen as libraries, the mathematics do not need to be
performing that trick. understood to use neural networks.
3 The barrier to entry for applying neutral Nevertheless, training and debugging models
networks is low. is easier with the mathematical basis.
1.2 About the authors 1.4 What type of people build neural networks?
The authors work in consulting in the CFRO Machine learning experts and data scientists
department of KPMG in Munich, in the areas typically train neural networks. A machine
of actuarial science and risk management. learning expert is expected to be able to
optimize the algorithms, and a data scientist is
1.3 What is a neural network? expected to know more about testing
hypotheses, as a general rule.
A neural network picks the best output from a
list of choices, given an input. If a business More recently, the software is so practical
process can be expressed as picking an output that people from other fields are training their
from a list of choices, and you want to own neural networks. Increasingly the tools
automate that, then a neural network is are used by actuaries, risk managers,
probably the way to go. There have long underwriters, and more. National actuarial
existed other methods to perform the same organizations are actively launching data
task – the key innovation, we think, is that a science initiatives in many countries.
neural network is really good at this task.
There is no formal definition of a data
A neural network is a machine learning scientist. A panel discussion on registering
technique. It falls under the label artificial data scientist as a protected profession, at the
intelligence (AI), because it can replace tasks German Data Science Days conference in
where a human would be expected to make a Munich in 2018, showed the idea as very
decision; it is sometimes treated as unpopular with practitioners.
synonymous with AI, but the field of AI is
much broader. The idea of a neural network is In our experience in the actuarial and risk
simple, but has many applications, and is area, it is often easier to train people with
changing the way we work. mathematical qualifications and an affinity for
coding and databases than to directly find
data scientists. The computing part involves a
2
involves mathematics, and some engineering, One of the main problems, preventing the
to set the computers up. But the bulk of the wide usage of neural networks in the
work looks more like chemistry: designing insurance industry, is the lack of
experiments, setting out what you expect to interpretability of those models. Research has
learn, trying to set up the experiment just shown that there are some mathematical
right, waiting, and documenting the results. It concepts to increase the interpretability, but
is very easy to waste calendar time by not it is early days yet.
being systematic with the experimentation.
We see a ‘trust but verify’ method being used
Validate the results in practice. Say two models are fitted, a
Validating neural networks is not much traditional model and a neural network. The
different than validating models in the neural network is assumed to have better
actuarial, risk management, quant, or predictive power. The traditional model can
statistical world. Validation is still all about be explained easily. The rule goes: If both
separating training and validation data, models give a similar answer, use the neural
analysing residuals, and producing good network. If two materially different answers
graphics. However, explaining the parameters are given, investigate.
fitted by a neural network can be next to
impossible. 3.2 Ethics of letting a machine make decisions
Traditional statistical methods and machine In some of the discussions that we have been
learning usually use cross validation where part of, the consensus has been that a
the training and testing is done multiple machine can act to benefit the customer, but
times. With lots of data, larger neural it cannot take decisions that disadvantage the
networks can take a lot of computing power customer. This means automating paying a
to train. Training and validation data is often claim, accepting an application, for example,
defined once, and a test data set is kept aside. but not automatically refusing a claim or
application.
Use it to predict
Once the neural network has been trained, 3.3 Going beyond best estimates
using it to predict is an almost instantaneous
The main non-standard work we face using
process. The input is passed to the network,
neural networks for insurance is fitting
and the answer comes out.
probability distributions, which we solve by
Decisions about how and when to update the estimating conditional density function
network are complicated. The only new parameters.
consideration versus traditional modelling in
Standard neural networks provide point
insurance is that computing time might be
estimates and not estimates for distribution
expensive.
functions. These neural networks are not able
to compute percentiles or risk measures.
3 Future of risk & regulatory Since Solvency II, and other regulatory risk
frameworks, require metrics like value at risk
3.1 Verification with a traditional model to compute the capital requirements, it is
A neural network is a black box. If it does necessary to estimate features of probability
something weird, it might be hard to explain distributions.
to your stakeholders. It is a good idea to have This extension requires some extra
a simple, interpretable model as a baseline for programming work.
the neural network, where regulated and risky
activities are concerned.
6
Neural networks stack potentially hundreds or network architecture are used to solve
thousands of neurons, and, hence, can model particular problems.
very complicated non-linear relationships. It
For sequential data, like text or time series
has been proved that any mathematical
data, recurrent layers introduce a mechanism
relationship can be modelled exactly by a
for learning time-dependent relationships.
sufficiently complicated neural network.
The neurons receive information not only
Activation Function from previous layers, but also from previous
There is no single activation function that rounds in this layer. This means that the order
works best in all cases. Recent research has of the given information is important.
shown that a mixture of activation functions
Gated recurrent layers and long short-term
can be better. The choice of activation
memory (LSTM) layers are used to represent
functions for your problem affects both
deeper relationships in sequential data. The
predictive power and time it takes to train the
LSTM introduces a memory cell that is able to
network.
forget. LSTMs can learn very complex
The rectified linear unit (ReLU) function, or sequences and are used for time series
the ‘rectifier’, f(x) = max(0, x), is currently analysis with seasonal effects. They are also
favoured by many practitioners, because it widely used for writing and speech
works. The ReLU activation function is quick recognition.
to calculate. It also allows neurons to be
In image classification, convolutional layers
switched on and off for different data items,
use a sliding window to move across the
making it easy to represent multiple
image, and calculate a function of groups of
relationships in the same network.
pixels at a time. The results of this function
Several modifications to ReLU are available. are then used by the other layers. The
Exponential linear units (ELUs) have been neurons in a convolutional layer are only
shown to allow higher classification accuracy connected to local regions of the image.
than ReLUs. Leaky ReLUs allow a small
Pooling layers are also sometimes used to
positive gradient when the ReLU is not active.
compress the data in image classification, but
Softplus approximates the ReLU with a
their use is controversial. Quoting Geoffrey
smooth function.
Hinton, pioneer of deep learning, “The
In the early days of neural networks the pooling operation used in convolutional
sigmoid function, with its S-shaped curve, was neural networks is a big mistake and the fact
a very popular choice, but it was hard to train. that it works so well is a disaster.”
It compresses the values of the output to the
range between zero and one. The tanh 4.2 How do neural networks learn?
function became accepted as an alternative to The parameters of a neural network are
speed up training, and this is still used in trained via gradient descent. Given a
recurrent neural networks for predicting performance measure, the gradients can be
sequential data. computed with respect to the network
The list of activation functions used is parameters via back propagation. In back
continually growing. propagation, the chain rule for derivatives is
used recursively starting at the last layer and
Layers passing the gradients backwards through the
In a basic neural network, a single layer neural network. The parameters are then
usually has the same activation function for updated by taking small steps in the opposite
each neuron. Extensions of the basic neural direction of the calculated gradients.
8
The gradients are not usually computed on In theory, training a neural network is pretty
the whole training data set at once. A small, easy. In practice, selecting the hyper-
randomly-chosen batch is used to calculate parameters and network structure is as much
the gradients and update the model an art as a science. It takes a lot of trial and
parameters for each training step. This error to find a good network architecture and
process is called mini-batch gradient descent, the experimentation itself is very time
and is a variant of stochastic gradient descent. consuming.
A training step is a single calculation of the
gradients and parameter update, and an
epoch is a full cycle through all the data.
5 Tools & toys
Applying mathematical models has long been
Neural networks are prone to overfitting, but
possible by choosing the right tools. At a
methods are available to mitigate this, called
conference on Bayesian statistics in 2006 in
regularization. Dropout regularization
Lago di Como, the late John Nelder, described
randomly selects neurons that are left out
how he had released the GLIM software in the
during a training step. This requires the neural
1970s, so that non-statisticians could apply
network to not rely heavily on specific nodes
generalized linear models. At a seminar at
and is believed to improve generalisation.
Google’s office in Munich in 2018, one of the
Another regularization technique adds the L2- lead engineers of TensorFlow described how
norm of the neural network weights to the the software is being developed to enable
loss, which also smooths the loss surface and non-experts in machine learning to apply
improves the learning process. neural networks.
and the projects’ main sponsors are different, exploratory and statistical work. In
but the functionality is increasingly the same. comparison to R, Python focuses on
productivity and code readability. The choice
All three libraries can be used on a laptop, and
depends on the task, knowledge of the
are supported by cloud platforms. Operating
language and personal preference.
system images with the software pre-installed
are available on most cloud platforms. Packages and Modules
The module repository of Python, called PyPi,
In contrast to Keras, TensorFlow and PyTorch
is by far behind the package repository CRAN
offer low-level as well as high-level
for R for statistical methods, although it’s
application programmers’ interfaces (APIs).
catching up. Statistical libraries like numpy
The low level APIs come with support for
and scikit-learn for Python grow and improve
machine learning generally and offer low level
further, but the focus is on basic functionality
functions, providing much more flexibility
in comparison to readily implemented
than Keras. PyTorch and TensorFlow have
statistical models in R.
frontends for C++, which is intended to enable
research in high performance, low latency C++ Visualization
applications. Python and R both support visualization, but
Python’s support is not built-in. Python
Since TensorFlow’s Estimator Framework and
supports graphical visualization with basic
PyTorch’s Ignite were launched, the
libraries like Matplotlib or advanced ones like
respective high level APIs, it is much easier to
Plot.ly. It requires more effort to achieve basic
get started with neural networks using
results than in R, but you also have more
PyTorch or TensorFlow.
possibilities, like interactive dashboards and
TensorBoard better support of web applications.
TensorBoard is a tool for visualization,
specially developed for TensorFlow, but can
also be used with PyTorch. It visualizes the
6 Discussion
whole training process of your models. Neural networks have suddenly become easy
to train and popular. In insurance, as with
5.2 Programming Language other industries, we are finding a gold rush of
If your team consists only of computer business problems we can apply this new
technology to.
scientists, software engineers, or similar, then
they might want Scala, C++, JavaScript, and Applying neural networks needs a
that’s fine. If, like us, you have a mixed team combination of business and technological
from various backgrounds, either Python or R acumen.
is a better choice.
Our outlook is, if there is a repeated task that
Python & R involves picking an output from a list of
Python and R are high-level languages that choices, it is probably going to be automated
allow application of algorithms, without using neural networks in the near to medium-
having to handle all the ugly details term future.
underneath. The languages frequently
reference the same libraries to do the hard
work.
Both languages are powerful tools for
statistical modeling and both have advantages
and disadvantages. R is primary used for