Customer Profitability Analysis of Automobile

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

The 11th International Conference on

Computer Science & Education (ICCSE 2016)


August 23-25, 2016. Nagoya University, Japan WdP4.5

Customer Profitability Analysis of Automobile


Insurance Market Based on Data Mining

Jianbing Xiahou Yanyu Xu Siyu Zhang Wenxuan Liao*


Software School Software School Software School Management School
Xiamen University Xiamen University Xiamen University Xiamen University
Xiamen, China Xiamen, China Xiamen, China Xiamen, China
jbxiahou@xmu.edu.cn 263946553@qq.com zhangsiyu_2012@163.com 550203688@qq.com

Abstract—Data mining technology is an interdiscipline using telecommunications, etc, so data mining has a bright future.
theory and technology of artificial intelligence, machine learning, Mainstream data mining system that developed in strict
statistics and other fields. It can extract implicit but useful accordance with CRISP-DM (Cross-Industry Standard Process
information and knowledge from vast amount of historical data for Data Mining), divides the process of data mining into the
for the enterprise, and provide solid support for the decision of
following six phases: business understanding, data
company. Combining with the rate reform of domestic automobile
insurance industry, this paper discusses the application of data understanding, data preparation, modeling, evaluation and
mining technology to the customer profitability, finds out the rule deployment. In practical terms, data mining is an iterative
of classification before and after the rate reform, and shows the process with stepwise precision method[1].
progress of customer profitability analysis by using decision tree. By different functions, data mining can be categorized into
category prediction and description. The function of category
Keywords—Data Mining; Automobile Insurance; Premium prediction is that training the model by classified data and using
Rate Reform; Customer Profitability the trained model to partition unclassified data. Common
methods are decision Trees, bayesian method and so on. The
I. Introduction
function of description mainly summarizes or split the data
In 2011, the domestic automobile insurance industry based on analyzing the internal intrinsic relationships among the
implemented market-oriented policies on premium rate in some given dataset, which mainly include clustering, association rules,
pilot cities such as Xiamen to advance the process of premium etc[2].
rate reform. After the premium rate reform, the premium
discounts are linked to customers’ claim records, enlarging the B. Customer Profitability
floating range of the premium rates, which significantly widens The customer profitability analysis pricing[3] argues that
the gap of the amount of the premium paid by the customers. In when pricing each transaction, enterprises should consider the
this situation, it will be hard for the insurance companies to customer and enterprise's overall relationship, that is fully
judge the customer profitability through the traditional statistics considering the cost and benefits of various transactions between
or experience, therefore, researches about the customer the customers and enterprises. The benchmark of pricing is the
profitability analysis of automobile insurance market becomes overall yields, also called the customer profitability that can be
increasingly important. In this paper, on the basis of building the computed by
enterprise data warehouse, we apply data mining technology to
analyze the massive historical customer data from the database   ୣ ൌ  σ୬୧ୀଵሺ ୧ െ  ୧ ሻ  
of insurance companies. We do this by classifying the customer
profitability through clustering methods and analyzing the where  ୣ is the customer profitability, ୧ is an income from the
relations between the customers’ property features and profit ‹-th customer, that is the amount of premium from the customers,
contribution degree by decision tree, to find out the rule of ୧ is the expenditure cost for ‹ -th customer, including the
classification before and after the rate reform and set up the amount of compensate to customers, staff salaries and employee
evaluation model of customer profitability. bonus, and ୧ െ ୧ is net profit from the ‹-th customer. For the
convenience of computing, this paper assumes ୧ includes only
II. Relevant Concepts the amount of compensate to customers.
A. Basic Concepts of Data Mining
III. Analysis Process
Data mining is the process of extracting implicit but
meaningful rules or patterns from large scale datasets. It is an A. Business Understanding
interdiscipline based on statistics, machine learning, databases Data mining is not a process about technology, but a process
and other disciplines. It has been developing rapidly in recent of combining business and technology, or a business process
years and applied to many fields, such as finance, retail, served by technology. The understanding of the business largely

978-1-5090-2218-2/16/$31.00 ©2016 IEEE 603


WdP4.5

determines the ultimate success of a project. Before the data Data preparation includes extracting and merging relevant
mining, the first thing is to understand the goals of the business, data from the business system, aggregating and converting data,
the second is to set the data mining goals according to the and building the data warehouse, while unifies unit, format and
business goals, and the last is to collect relevant data and naming. After the conversion, conversion quality should be
processed based on understanding goals. checked in order to avoid unnecessary information loss.
The process of insuring and settling claims in insurance The data comes from terminated insurance records of an
company and the data entry can be simplified as Figure 1. insurance company since 2009. This paper selected samples of
213270, including 83190 samples after the rate reform.
2) Data Cleaning
Data cleaning is the process of handling missing value and
noise point to make data clean and tidy, in case of influencing
the conclusions drawn form the data.
Missing value refers to that no data value is stored for the
variable in an observation, which mainly because of the lack of
Fig. 1. The flow chart of vehicles’ insurance and claim process sample information in the data gathering process. To handle
missing values, according to the characteristics of variables, it
The business goal is to divide all customers into multiple can use direct delete method, statistical filling method,
disjoint set in accordance with their customer profitability, to prospective estimation, and new value method. For example,
identify their features respectively, and to explains the changes New-Renew-Transfer variable contains more than ten thousand
brought by this reform, which will be the foundation for the missing values, because of higher numbers, so using empirical
latter-phase marketing, service and other activities. filling method and statistical filling method synthetically for
In the view of that the home vehicle insurance covers large data cleaning rather than deleting them directly. Specifically,
proportion in vehicle insurance business with inconsistent judge the vehicle as be new if vehicular preferential coefficient
customers’ behaviors and a high degree of marketization, and is 0.95, and others are filled with the mode, that is “Renew”, at
the rate reform only focus on commercial insurance of motor the same time, mode of field will not be affected.
vehicle, therefore the goal of data mining is to find customers’ Meanwhile, due to the occurrence of errors through manual
features under different profit contributions by using collection or equipment problems, it will inevitably introduce
classification methods, and make comparisons before and after noise points or outliers into data. For the one-dimensional data
the rate reform. processing, detecting noise by a simple statistic.
3) Variables Selection
B. Data Understanding Variables selection includes correlation analysis, redundant
Next up, after determining business requirement, is data processing and so on, to delete duplicate information and reduce
understanding and confirming data state, to ensure that the constraints, so that reduce the analysis dimensions with
postmortem analysis will proceed smoothly. It mainly includes guarantee of analytical precision.
three meanings: first, understanding the data model, ie, confirms Correlation analysis is one of the common statistical method
the actual meaning that data represents, evaluates whether the of Variables correlation analysis. Linear correlation analysis
data has become outdated or incorrect and so on. Second, focuses the strength of lineal relation direction between two
preliminary analysis of the data distribution, mainly understands variables, which described by statistical variable ” , namely
the data distribution, the data quality, the data correlation, etc. correlative coefficient, as follows:
Third, confirming the quality of data, by visualization analysis,
judging whether the data meet needs in the following three σ൫୅ି୅൯൫୆ି୆൯ σሺ୅୆ሻି୬୅୆
 ”୅ǡ୆ ൌ ൌ   
respects: whether the critical data is available, whether there are ሺ୬ିଵሻ஢ఽ ஢ా ሺ୬ିଵሻ஢ఽ ஢ా
many missing or invalid values and whether there has sufficient
historical data[1]. where  denotes the number of tuples,  and  denote the mean
According to the actual conditions, this paper determines the of  or  respectively, ɐ୅ and ɐ୆ denote the standard deviation
evaluation system of customer profitability as shown in Table 1. of  or  respectively.
In this paper, customer profitability is discredited as the
C. Data Preprocessing
decision variable, and other variables are regarded as the
High-quality decision-making necessarily relies upon high- condition variables. Then simplify condition variables through
quality data. Thus, data preprocessing is an important step in the correlation test between conditional variables and information
knowledge discovery process, meanwhile is the most consuming entropy among the decision variable and the condition variables.
time and verbose process in data mining. In the real world, the In addition, customer profitability is calculated from the
collected data are mostly incomplete, noise, inconsistent. amount of premium and the amount of compensate, creating
Therefore, it requires data cleaning, data conversion, in order to redundancy, thus these two variables are eliminated from
qualify for requirements of data mining algorithm and produce conditional variables. However, the correlation coefficient
the most reliable and accurate results. between the amount of insurance and the amount of premium is
1) Data Preparation 0.8861, having a strong relationship.

604
WdP4.5

TABLE I. Evaluation System of Customer Profitability According to the above steps, this paper determines the final
results as shown in Table 2.
Category Variable Data Type Source Field

Age num TABLE II. Results of Data Preprocessing


Human
Factors ID Number Variable Value Range
(2) Sex char 1. 11~29 years old,
2. 30~37 years old,
Service Life num Service Life Age 3. 38~45 years old,
Purchase Purchase Price 4. 46~53 years old,
Vehicle Price of New num of New 5. 54~89 years old.
Factors Vehicle Vehicle 11. Male,
Sex
(4) Brand Origin char Brand Name 12. Female.
Seating Seating 41. China,
num 42. Japan,
Capacity Capacity
Channel char Clause 43. American,
Brand
New-Renew- New-Renew- 44. Germany,
char Origin
Transfer Transfer 45. France,
Days of Days of 46. Korea,
num 47. Other Europe countries.
Insurance Insurance
Business Amount of Amount of 0. <0.05 million,
num Purchase 1. 0.05 million~0.1 million,
Factors Insurance Insurance
(7) Preferential Preferential Price of 2. 0.1 million~0.15 million,
num New 3. 0.15 million~0.2 million,
Coefficient Coefficient
Vehicle 4. 0.2 million~0.5 million,
Amount of
5. >=0.5 million.
Customer Premium,
num Seating
Profitability Amount of 2~18 seats.
Compensate Capacity
Service
0~20 years.
Life
4) Variables Conversion
0. New vehicle,
For discrete variables, higher level of concepts is a substitute
for lower level of original data by concept hierarchy promoted 1. Be accident-free more than 3 years,
method, which is a sublimation from the perceptual knowledge 2. Be accident-free more than 2 years,
3. Be accident-free more than 1 years,
to the rational knowledge. In addition, abstract concepts are
4. Apply for insurance claims once or twice
more powerful and make judgments easier for mining model.
Preferential last year, and the amount of compensate is
For example, a wide range of brand name are conceptualized as
Rank less than 75% of the amount of premium,
brand Origin in accordance with the origin of core technology.
5. Between 4th and 6th condition,
Discrete variables numeralization, meets the requirement of
6. Apply for insurance claims more than 3
the linearity regression algorithm, and then on the one hand
times last year, and the amount of
takes up less space and shorter timelines during processing. In
compensate is more than 75% of the
this paper, for gathered data, variables including sex, brand
amount of premium.
origin, channel and new-renew-transfer, need numeralization.
New- 31. New vehicle,
Continuous variables discretization means divide the
Renew- 32. Renewal policy,
continuous variables into intervals, any value in each of which
Transfer 33. Transfer
will be regarded as the same value. Common discretization
method can be divided into supervised methods, and Days of 1. Less than one year,
unsupervised methods. Supervised methods include equi-width Insurance 2. One year.
method, equi-frequency method, business method, etc. And then 1. Traditional business,
Channel
on the other hand unsupervised methods contain class-based 2. Telephone/Internet sales business.
information entropy, 1R algorithm, etc. In this paper, for 3. <0.2 million,
actually gathered data, a section of continuous variables requires 4. 0.2 million~0.4 million,
discretization, including preferential coefficient, age, service life, Amount of 5. 0.4 million~0.6 million,
purchase price of new vehicle, days of insurance, amount of Insurance 6. 0.6 million~0.8 million,
insurance, etc. Discrete methods used actually are shown in 7. 0.8 million~1.5 million,
Table 2. 8. >=1.5 million.
5) Results of Data Preprocessing Customer
The next section discusses that.
Profitability

605
WdP4.5

D. Building Customer Profitability Analysis Model denotes the class label of samples that match the path from root
1) Dividing Classification of Customer Profitability by node to leaf node.
Clustering Method Following is a top-down method to build a decision tree[5]:
In this paper, customer profitability is the target variable in
the classification model, which request the training set with Input: Node , Training set , Branching index 
clearly identifiable class labels. Therefore, the data set is Output: The decision tree with the node  as the root node
clustered by clustering algorithm in order to classify the based on the training set  and training set Training set.
customers by a decision tree. Method:
Cluster analysis[4] mainly researches the “Like Attracts Like” make_tree (, ,  ){
problem in statistics, which virtually establishes a classification Initialize the root node;
method that automatically sorts a quantity of data according to In the dataset , solve the branching scheme of node ,
the degree of similarity in nature, without prior knowledge. satisfying branching index  ;
Class is a set of similar individual, and different class have if (node  satisfies branch conditional) {
significant differences. Select the best branching scheme to divide dataset
This paper employs K-means clustering analysis algorithm  into ଵ , ଶ ;
that is a standard iterative algorithm to minimize the sum of Create ’s children node ଵ , ଶ ;
squared of intra-class mean. The data structure of cluster make_tree(ଵ , ଵ ,  );
analysis contains age, sex, service life, purchase price of new make_tree(ଶ , ଶ ,  );
vehicle, brand origin, seating capacity, channel, new-renew- }
transfer, days of insurance, amount of insurance, preferential }
coefficient and so on. In this paper, with actual needs
classification, customer profitability was classified into four By using the decision tree model, this paper determined the
ranks: extremely low (-30000 and below), low (-30000,0), classification feature of customers after the rate reform (see
medium (0,5000), high (5000 and above), as class labels of Table 3). For example, a customer meeting the conditions,
training set in the classification method. which include preferential rank is 0 or 1 (the vehicle is new or
2) Extracting the Customers’ Features by Decision Tree the customer has been accident-free more than 3 years), service
Decision tree algorithm is one of data mining classification life is 0 year (the vehicle is new), and purchase price of new
algorithms. Through analyzing given training data, it generates vehicle is less than 0.1 million, has great possibility of medium
the Decision Tree Model for classification and prediction. Each rank, and also may be belong to high rank. In other words,
internal node in the decision tree describes the test for an customers who purchase vehicle insurance for their new, cheap
attribute of samples. Each internal node has one or more vehicle, probably are valued clients. These clients have more
subsequent branches, and each subsequent branch corresponds stable business situation and profitability, and ought to be well
to one of the possible values for this attribute. Each leaf node maintained all the time.

TABLE III. Classification Rules of Decision Tree for Customer Profitability Analysis after Rate Reform
Layer 1 Layer 2 Layer 3 First Result Second Result
Purchase Price of New Vehicle: Medium High
<=0.1 million (76%) (24%)
Service Life: Purchase Price of New Vehicle: High Medium
0~1 years 0.1 million~0.15 million (77%) (23%)
Purchase Price of New Vehicle: High Medium
>=0.15 million (94%) (6%)
Preferential Amount of Insurance: Low
Rank (0~1) <=0.2 million (100%)
Service Life: Amount of Insurance: Low Extremely low
1~7 years 0.2 million~1.5 million (56%) (44%)
Amount of Insurance: Extremely low Low
>=1.5 million (73%) (27%)
Service Life: Extremely low Low
8~20 years (74%) (26%)
Amount of Insurance: Medium
<=0.2 million (100%)
Preferential Amount of Insurance: Medium Low
Preferential Rank (2)
Rank (2~4) 0.2 million~0.8 million (52%) (48%)
Amount of Insurance: Low Medium
>=0.8 million (63%) (36%)

606
WdP4.5

Medium
Preferential Rank (3~4)
(100%)
Purchase Price of New Vehicle: Medium High
<=0.1 million (97.8%) (2.2%)
Amount of Insurance: Purchase Price of New Vehicle: High Medium
<=0.8 million 0.1 million ~ 0.15 million (56%) (44%)
Purchase Price of New Vehicle: High Medium
>=0.15 million (72%) (28%)
Preferential
Channel: High Medium
Rank (5~6)
Traditional business (94%) (6%)
Amount of Insurance:
Channel:
0.8 million ~ 1.5 million High Medium
Telephone/Internet sales
(69%) (31%)
business
Amount of Insurance: High
>=1.5 million (100%)

For comparison, this paper still used the four interval amount of insurance is more than 1.5 million, has great
divisions above, including extremely low (-30000 and below), possibility of extremely low rank, and also may be belong to
low (-30000,0), medium (0,5000) and high (5000 and above), to high rank. It shows that idiosyncratic risk of these customers is
extracting the customers’ features before the rate reform. fairly unstable, and profit contributions of these customers are
However, classification effect of customers’ features before the liable to go to extremes. Therefore, insurance companies should
rate reform is not ideal, but experiment marginally produces pay attention to educate these customers in safety, and enhance
relatively clear classification rules (see Table 4). For example, a after-sales service to them. But there might be a special case,
customer meeting the conditions, which include purchase price because data sizes of this leaf node is less, only 37, which is not
of new vehicle is between 0.15 million and 0.2 million and powerful than the node with large amount of data.

TABLE IV. Classification Rules of Decision Tree for Customer Profitability Analysis before Rate Reform
Layer 1 Layer 2 Layer 3 First Result Second Result
Amount of Insurance: Medium Low
<=0.4 million (58%) (40%)
Preferential Rank Amount of Insurance: Low Low
(0) 0.4 million~0.6 million (47%) (26%)
Amount of Insurance: Medium Low
>=0.6 million (31%) (28%)
Seating Capacity: Low Low
2~6 seats (46%) (35%)
Purchase Price of
Preferential Rank Seating Capacity: Medium Low
New Vehicle:
(1~4) 6~9 seats (64%) (30%)
<0.1 million
Seating Capacity: Extremely low Medium
9~14 seats (60%) (23%)
Amount of Insurance: Medium Extremely low
<=0.2 million (62%) (22%)
Preferential Rank Amount of Insurance: Low Low
(5~6) 0.2 million~0.6 million. (37%) (35%)
Amount of Insurance: Low Medium
>=0.6 million (38%) (23%)
Amount of Insurance: Medium
<=0.2 million (100%)
Purchase Price of Preferential Rank Amount of Insurance: Medium Low
New Vehicle: (0) 0.2 million~0.4 million. (61%) (34%)
0.1 million~0.15 Amount of Insurance: Medium Medium
million >=0.4 million (46%) (16%)
Preferential Rank Amount of Insurance: Medium Low
(1~4) <=0.2 million (86%) (14%)

607
WdP4.5

Amount of Insurance: Low Medium


0.2 million~0.8 million. (41%) (31%)
Amount of Insurance: Medium Low
>=0.8 million (34%) (26%)
Amount of Insurance: Medium Low
<=0.4 million (53%) (22%)
Preferential Rank Amount of Insurance: Low Medium
(5~6) 0.4 million~0.8 million. (33%) (28%)
Amount of Insurance: High Medium
>=0.8 million (36%) (22%)
Service Life: Medium High
0~1 years (85%) (14%)
Amount of
Service Life: Medium Low
Insurance:
1~5 years (92%) (7%)
<=0.4 million.
Service Life: Medium Low
5~20 years (64%) (35%)
High Low
Preferential Rank (0)
Amount of (60%) (14%)
Purchase Price of
Insurance: Low Extremely low
New Vehicle: Preferential Rank (1~4)
0.4 million~1.5 (25%) (24%)
>=0.15 million
million High Low
Preferential Rank (5~6)
(40%) (34%)
Purchase Price of New Vehicle: Extremely low High
0.15 million ~ 0.2 million (86%) (12%)
Amount of
Purchase Price of New Vehicle: Low Extremely low
Insurance:
0.2 million ~ 0.5 million (30%) (27%)
>=1.5 million
Purchase Price of New Vehicle: Extremely low Low
>=0.5 million (46%) (40%)

x Classification rules generated from data are not clearly


IV. Conclusions defined before the rate reform, for two reasons.
Through the comparison of classification rules of decision One is that data groupings before the rate reform refer to
tree before and after the rate reform, this paper draws the category before the rate reform. While the hypothesis is
conclusion as following: convenient, some differences from the actual situation. For
x The importance of preferential rank is more and more example, both purchase price of new vehicle and the amount of
obvious after the rate reform. insurance before the rate reform are lower than those after the
It is because, after the rate reform, preferential coefficient reform, and classification rules generated from data after the
and preferential rank have a closer relationship with reform reduce the ability of data grouping and may even
compensation that customers gained, and influence customer interfere the generation of normal decision tree.
profitability more directly. The other is that correspondence rules before the rate reform
x Before the rate reform, the importance of purchase between customer profitability and customers’ features are not
price is higher than that of amount of insurance, which as obvious as those after the reform. Before the rate reform,
is the opposite after the rate reform. customers will ask for compensation without hesitation after the
This paper guess that the reaction conforms to the actual accident. However, after the rate reform, when they arrange the
situation. Before the rate reform, because of preferences that insurance and ask for compensation, they will take into account
customers enjoyed are not that different. Vehicle factors of the factor that influence on discounts now and into the future.
customers are more a reflection of customer profitability than For instance, to keep a good record and enjoy more discounts,
business factors. And then after the rate reform, because some customers might choose not to ask for compensation after
accident-less customers have more discounts and they may be the accident if the compensation is less, which may result in a
increasingly purchasing vehicle insurance, so that amount of clearer classification of customers because of their behaviors.
insurance they pay is likely to become more or less. This
instability leads to classification according to customers’ References
psychological needs, and these different psychological needs are [1] Keyun Hu, Fengzhan Tian, Houkuan Huang. Data Mining
likely to reflect their risk awareness and potentially possible of Theory and Applications [M] Beijing: Tsinghua University
having an automobile accident, thereby affect customer Press, [2008].
profitability.

608
WdP4.5

[2] Weihui Huang, Guohua Geng, Li Chan. Data mining [4] Qinhua Wei. Customer profit contribution of electricity
technology in the insurance business [J]. Computer based on data mining analysis [J]. Demand Side
Applications and software, 2008 (3): 123-125. Management, 2006,2.
[3] Mingqiang Bi. Based on contribution analysis and [5] Zhaole Tang. Application of Data Mining Technology in
customer relations Commercial Bank Loan Pricing Method CRM in the auto insurance [D]. University of Electronic
[J] Finance Forum 2004 (7): 44-50. Science and Technology, [2007].

609

You might also like