Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Data Transformation Using the Window Functions in PySpark

Demonstrated with a use case

I work as an actuary in an insurance company. For various purposes we (securely) collect and store data
for our policyholders in a data warehouse. One e to obtain useful information for downstream actuarial
analyses.

To demonstrate, one of the popular products we sell provides claims payment in the form of an income
stream in the event that the policyholder is unable to work due to an injury or a sickness (“Income
Protection”). For three (synthetic) policyholders A, B and C, the claims payments under their Income
Protection claims may be stored in the tabular format as below:

Table 1: Claims payment, colour-coded by Policyholder ID. Table by author

An immediate observation of this dataframe is that there exists a one-to-one mapping for some fields, but
not for all fields. In particular, there is a one-to-one mapping between “Policyholder ID” and “Monthly
Benefit”, as well as between “Claim Number” and “Cause of Claim”. However, mappings between the
“Policyholder ID” field and fields such as “Paid From Date”, “Paid To Date” and “Amount” are one-to-
many as claim payments accumulate and get appended to the dataframe over time.

What if we would like to extract information over a particular policyholder Window? For example, the
date of the last payment, or the number of payments, for each policyholder. This may be difficult to
achieve (particularly with Excel which is the primary data transformation tool for most life insurance
actuaries) as these fields depend on values spanning multiple rows, if not all rows for a particular
policyholder. To visualise, these fields have been added in the table below:

Table 2: Extract information over a “Window”, colour-coded by Policyholder ID. Table by author

Mechanically, this involves firstly applying a filter to the “Policyholder ID” field for a particular
policyholder, which creates a Window for this policyholder, applying some operations over the rows in
this window and iterating this through all policyholders. Based on the dataframe in Table 1, this article
demonstrates how this can be easily achieved using the Window Functions in PySpark.

An Insurance Use Case


To recap, Table 1 has the following features:

 Claims payments are captured in a tabular format. However, no fields can be used as a unique key
for each payment.
 The Monthly Benefits under the policies for A, B and C are 100, 200 and 500 respectively.
However, the Amount Paid may be less than the Monthly Benefit, as the claimants may not be
unable to work for the entire period in a given month.
 It appears that for B, the claims payment ceased on 15-Feb-20, before resuming again on 01-Mar-
20. This gap in payment is important for estimating durations on claim, and needs to be allowed for.

Let’s use Windows Functions to derive two measures at the policyholder level, Duration on Claim and
Payout Ratio. These measures are defined below:

1. How long each policyholder has been on claim (“Duration on Claim”) at 30-Jun-20, allowing
for any gap in payments. For example, this is 2.5 months (or 77 days) for B.
2. How much on average the Monthly Benefit under the policy was paid out to the policyholder for
the period on claim (“Payout Ratio”). For example, this is 100% for policyholder A and B, and
50% for policyholder C.

For life insurance actuaries, these two measures are relevant for claims reserving, as Duration on Claim
impacts the expected number of future payments, whilst the Payout Ratio impacts the expected amount
paid for these future payments.

A step-by-step guide on how to derive these two measures using Window Functions is provided below.

Duration on Claim
Step 1 — Import libraries and initiate a Spark session

import numpy as np
import pandas as pd
import datetime as dt
import pyspark
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
## Initiate Spark session
spark_1= SparkSession.builder.appName('demo_1').getOrCreate()
df_1 = spark_1.createDataFrame(demo_date_adj)

Step 2— Define Windows


As we are deriving information at a policyholder level, the primary window of interest would be one that
localises the information for each policyholder. In the Python codes below:
 Window_1 is a window over “Policyholder ID”, further sorted by “Paid From Date”.
 Window_2 is simply a window over “Policyholder ID”.

Although both Window_1 and Window_2 provide a view over the “Policyholder ID” field, Window_1
furhter sorts the claims payments for a particular policyholder by “Paid From Date” in an ascending
order. This is important for deriving the Payment Gap using the “lag” Window Function, which is
discussed in Step 3.

## Customise Windows to apply the Window Functions to

Window_1 = Window.partitionBy("Policyholder ID").orderBy("Paid From Date")

Window_2 = Window.partitionBy("Policyholder ID").orderBy("Policyholder ID")

Step 3 — Windows Functions for Durations on Claim


“with_Column” is a PySpark method for creating a new column in a dataframe.

The following columns are created to derive the Duration on Claim for a particular policyholder. In this
order:

 Date of First Payment — this is the minimum “Paid From Date” for a particular policyholder,
over Window_1 (or indifferently Window_2).
 Date of Last Payment — this is the maximum “Paid To Date” for a particular policyholder, over
Window_1 (or indifferently Window_2).
 Duration on Claim per Payment — this is the Duration on Claim per record, calculated as Date of
Last Payment minus Date of First Payment at each row.
 Duration on Claim per Policyholder — this sums the “Duration on Claim per Payment”
column above for a particular policyholder over Window_1 (or indifferently Window_2), and
arrives at a row-agostic sum (i.e. the total Duration on Claim).

df_1_spark = df_1.withColumn("Date of First Payment", F.min("Paid From Date").over(Window_1)) \

.withColumn("Date of Last Payment", F.max("Paid To Date").over(Window_1)) \

.withColumn("Duration on Claim - per Payment", F.datediff(F.col("Date of Last Payment"), F.col("Date of First


Payment")) + 1) \

.withColumn("Duration on Claim - per Policyholder", F.sum("Duration on Claim - per Payment").over(Window_2)) \

As mentioned previously, for a policyholder, there may exist Payment Gaps between claims payments. In
other words, over the pre-defined windows, the “Paid From Date” for a particular payment may not
follow immediately the “Paid To Date” of the previous payment. You should be able to see in Table 1
that this is the case for policyholder B.

For the purpose of actuarial analyses, Payment Gap for a policyholder needs to be identified and
subtracted from the Duration on Claim initially calculated as the difference between the dates of first and
last payments.

The Payment Gap can be derived using the Python codes below:
.withColumn("Paid To Date Last Payment", F.lag("Paid To Date", 1).over(Window_1)) \

.withColumn("Paid To Date Last Payment adj", F.when(F.col("Paid To Date Last Payment").isNull(), F.col("Paid From
Date")) \

.otherwise(F.date_add(F.col("Paid To Date Last Payment"), 1))) \

.withColumn("Payment Gap", F.datediff(F.col("Paid From Date"), F.col("Paid To Date Last Payment adj")))

It may be easier to explain the above steps using visuals. As shown in the table below, the Window
Function “F.lag” is called to return the “Paid To Date Last Payment” column which for a policyholder
window is the “Paid To Date” of the previous row as indicated by the blue arrows. This is then compared
against the “Paid From Date” of the current row to arrive at the Payment Gap. As expected, we have a
Payment Gap of 14 days for policyholder B.

For the purpose of calculating the Payment Gap, Window_1 is used as the claims payments need to be in
a chornological order for the “F.lag” function to return the desired output.

Table 3: Derive Payment Gap. Table by author

Adding the finishing touch below gives the final Duration on Claim, which is now one-to-one against the
Policyholder ID.

.withColumn("Payment Gap - Max", F.max("Payment Gap").over(Window_2)) \

.withColumn("Duration on Claim - Final", F.col("Duration on Claim - per Policyholder") - F.col("Payment Gap -


Max"))

The table below shows all the columns created with the Python codes above.
Table 4: All columns created in PySpark. Table by author

Payout Ratio
The Payout Ratio is defined as the actual Amount Paid for a policyholder, divided by the Monthly Benefit
for the duration on claim. This measures how much of the Monthly Benefit is paid out for a particular
policyholder.

Leveraging the Duration on Claim derived previously, the Payout Ratio can be derived using the Python
codes below.

.withColumn("Amount Paid Total", F.sum("Amount Paid").over(Window_2)) \


.withColumn("Monthly Benefit Total", F.col("Monthly Benefit") * F.col("Duration on Claim - Final") / 30.5) \
.withColumn("Payout Ratio", F.round(F.col("Amount Paid Total") / F.col("Monthly Benefit Total"), 1))

The outputs are as expected as shown in the table below. To show the outputs in a PySpark session,
simply add .show() at the end of the codes.

Table 5: Payout Ratio. Table by author

Examples of Other Window Functions


There are other useful Window Functions. This article provides a good summary.

For example, you can set a counter for the number of payments for each policyholder using the Window
Function F.row_number() per below, which you can apply the Window Function F.max() over to get the
number of payments.

.withColumn("Number of Payments", F.row_number().over(Window_1)) \

Another Window Function which is more relevant for actuaries would be the dense_rank() function,
which if applied over the Window below, is able to capture distinct claims for the same policyholder
under different claims causes. One application of this is to identify at scale whether a claim is a relapse
from a previous cause or a new claim for a policyholder.

Window_3 = Window.partitionBy("Policyholder ID").orderBy("Cause of Claim")


....
withColumn("Claim_Cause_Leg", F.dense_rank().over(Window_3))

Excel
As mentioned in a previous article of mine, Excel has been the go-to data transformation tool for most life
insurance actuaries in Australia. Similar to one of the use cases discussed in the article, the data
transformation required in this exercise will be difficult to achieve with Excel.

Given its scalability, it’s actually a no-brainer to use PySpark for commercial applications involving large
datasets. That said, there does exist an Excel solution for this instance which involves the use of the
advanced array formulas.

To briefly outline the steps for creating a Window in Excel:

1. Manually sort the dataframe per Table 1 by the “Policyholder ID” and “Paid From Date” fields.
Copy and paste the “Policyholder ID” field to a new sheet/location, and deduplicate.
2. Referencing the raw table (i.e. Table 1), apply the ROW formula with MIN/MAX respectively to
return the row reference for the first and last claims payments for a particular policyholder (this is
an array formula which takes reasonable time to run). For example, as shown in the table below,
this is row 4–6 for Policyholder A.
3. Based on the row reference above, use the ADDRESS formula to return the range reference of a
particular field. For example, this is $G$4:$G$6 for Policyholder A as shown in the table below.
4. Apply the INDIRECT formulas over the ranges in Step 3 to get the Date of First Payment and
Date of Last Payment.

Table 6: Excel demo. Table by author

You might also like