Professional Documents
Culture Documents
Customer Profitability Analysis of Automobile
Customer Profitability Analysis of Automobile
Customer Profitability Analysis of Automobile
Abstract—Data mining technology is an interdiscipline using telecommunications, etc, so data mining has a bright future.
theory and technology of artificial intelligence, machine learning, Mainstream data mining system that developed in strict
statistics and other fields. It can extract implicit but useful accordance with CRISP-DM (Cross-Industry Standard Process
information and knowledge from vast amount of historical data for Data Mining), divides the process of data mining into the
for the enterprise, and provide solid support for the decision of
following six phases: business understanding, data
company. Combining with the rate reform of domestic automobile
insurance industry, this paper discusses the application of data understanding, data preparation, modeling, evaluation and
mining technology to the customer profitability, finds out the rule deployment. In practical terms, data mining is an iterative
of classification before and after the rate reform, and shows the process with stepwise precision method[1].
progress of customer profitability analysis by using decision tree. By different functions, data mining can be categorized into
category prediction and description. The function of category
Keywords—Data Mining; Automobile Insurance; Premium prediction is that training the model by classified data and using
Rate Reform; Customer Profitability the trained model to partition unclassified data. Common
methods are decision Trees, bayesian method and so on. The
I. Introduction
function of description mainly summarizes or split the data
In 2011, the domestic automobile insurance industry based on analyzing the internal intrinsic relationships among the
implemented market-oriented policies on premium rate in some given dataset, which mainly include clustering, association rules,
pilot cities such as Xiamen to advance the process of premium etc[2].
rate reform. After the premium rate reform, the premium
discounts are linked to customers’ claim records, enlarging the B. Customer Profitability
floating range of the premium rates, which significantly widens The customer profitability analysis pricing[3] argues that
the gap of the amount of the premium paid by the customers. In when pricing each transaction, enterprises should consider the
this situation, it will be hard for the insurance companies to customer and enterprise's overall relationship, that is fully
judge the customer profitability through the traditional statistics considering the cost and benefits of various transactions between
or experience, therefore, researches about the customer the customers and enterprises. The benchmark of pricing is the
profitability analysis of automobile insurance market becomes overall yields, also called the customer profitability that can be
increasingly important. In this paper, on the basis of building the computed by
enterprise data warehouse, we apply data mining technology to
analyze the massive historical customer data from the database ୣ ൌ σ୬୧ୀଵሺ୧ െ ୧ ሻ
of insurance companies. We do this by classifying the customer
profitability through clustering methods and analyzing the where ୣ is the customer profitability, ୧ is an income from the
relations between the customers’ property features and profit -th customer, that is the amount of premium from the customers,
contribution degree by decision tree, to find out the rule of ୧ is the expenditure cost for -th customer, including the
classification before and after the rate reform and set up the amount of compensate to customers, staff salaries and employee
evaluation model of customer profitability. bonus, and ୧ െ ୧ is net profit from the -th customer. For the
convenience of computing, this paper assumes ୧ includes only
II. Relevant Concepts the amount of compensate to customers.
A. Basic Concepts of Data Mining
III. Analysis Process
Data mining is the process of extracting implicit but
meaningful rules or patterns from large scale datasets. It is an A. Business Understanding
interdiscipline based on statistics, machine learning, databases Data mining is not a process about technology, but a process
and other disciplines. It has been developing rapidly in recent of combining business and technology, or a business process
years and applied to many fields, such as finance, retail, served by technology. The understanding of the business largely
determines the ultimate success of a project. Before the data Data preparation includes extracting and merging relevant
mining, the first thing is to understand the goals of the business, data from the business system, aggregating and converting data,
the second is to set the data mining goals according to the and building the data warehouse, while unifies unit, format and
business goals, and the last is to collect relevant data and naming. After the conversion, conversion quality should be
processed based on understanding goals. checked in order to avoid unnecessary information loss.
The process of insuring and settling claims in insurance The data comes from terminated insurance records of an
company and the data entry can be simplified as Figure 1. insurance company since 2009. This paper selected samples of
213270, including 83190 samples after the rate reform.
2) Data Cleaning
Data cleaning is the process of handling missing value and
noise point to make data clean and tidy, in case of influencing
the conclusions drawn form the data.
Missing value refers to that no data value is stored for the
variable in an observation, which mainly because of the lack of
Fig. 1. The flow chart of vehicles’ insurance and claim process sample information in the data gathering process. To handle
missing values, according to the characteristics of variables, it
The business goal is to divide all customers into multiple can use direct delete method, statistical filling method,
disjoint set in accordance with their customer profitability, to prospective estimation, and new value method. For example,
identify their features respectively, and to explains the changes New-Renew-Transfer variable contains more than ten thousand
brought by this reform, which will be the foundation for the missing values, because of higher numbers, so using empirical
latter-phase marketing, service and other activities. filling method and statistical filling method synthetically for
In the view of that the home vehicle insurance covers large data cleaning rather than deleting them directly. Specifically,
proportion in vehicle insurance business with inconsistent judge the vehicle as be new if vehicular preferential coefficient
customers’ behaviors and a high degree of marketization, and is 0.95, and others are filled with the mode, that is “Renew”, at
the rate reform only focus on commercial insurance of motor the same time, mode of field will not be affected.
vehicle, therefore the goal of data mining is to find customers’ Meanwhile, due to the occurrence of errors through manual
features under different profit contributions by using collection or equipment problems, it will inevitably introduce
classification methods, and make comparisons before and after noise points or outliers into data. For the one-dimensional data
the rate reform. processing, detecting noise by a simple statistic.
3) Variables Selection
B. Data Understanding Variables selection includes correlation analysis, redundant
Next up, after determining business requirement, is data processing and so on, to delete duplicate information and reduce
understanding and confirming data state, to ensure that the constraints, so that reduce the analysis dimensions with
postmortem analysis will proceed smoothly. It mainly includes guarantee of analytical precision.
three meanings: first, understanding the data model, ie, confirms Correlation analysis is one of the common statistical method
the actual meaning that data represents, evaluates whether the of Variables correlation analysis. Linear correlation analysis
data has become outdated or incorrect and so on. Second, focuses the strength of lineal relation direction between two
preliminary analysis of the data distribution, mainly understands variables, which described by statistical variable , namely
the data distribution, the data quality, the data correlation, etc. correlative coefficient, as follows:
Third, confirming the quality of data, by visualization analysis,
judging whether the data meet needs in the following three σ൫ି൯൫ି൯ σሺሻି୬
ǡ ൌ ൌ
respects: whether the critical data is available, whether there are ሺ୬ିଵሻఽ ా ሺ୬ିଵሻఽ ా
many missing or invalid values and whether there has sufficient
historical data[1]. where denotes the number of tuples, and denote the mean
According to the actual conditions, this paper determines the of or respectively, ɐ and ɐ denote the standard deviation
evaluation system of customer profitability as shown in Table 1. of or respectively.
In this paper, customer profitability is discredited as the
C. Data Preprocessing
decision variable, and other variables are regarded as the
High-quality decision-making necessarily relies upon high- condition variables. Then simplify condition variables through
quality data. Thus, data preprocessing is an important step in the correlation test between conditional variables and information
knowledge discovery process, meanwhile is the most consuming entropy among the decision variable and the condition variables.
time and verbose process in data mining. In the real world, the In addition, customer profitability is calculated from the
collected data are mostly incomplete, noise, inconsistent. amount of premium and the amount of compensate, creating
Therefore, it requires data cleaning, data conversion, in order to redundancy, thus these two variables are eliminated from
qualify for requirements of data mining algorithm and produce conditional variables. However, the correlation coefficient
the most reliable and accurate results. between the amount of insurance and the amount of premium is
1) Data Preparation 0.8861, having a strong relationship.
604
WdP4.5
TABLE I. Evaluation System of Customer Profitability According to the above steps, this paper determines the final
results as shown in Table 2.
Category Variable Data Type Source Field
605
WdP4.5
D. Building Customer Profitability Analysis Model denotes the class label of samples that match the path from root
1) Dividing Classification of Customer Profitability by node to leaf node.
Clustering Method Following is a top-down method to build a decision tree[5]:
In this paper, customer profitability is the target variable in
the classification model, which request the training set with Input: Node , Training set , Branching index
clearly identifiable class labels. Therefore, the data set is Output: The decision tree with the node as the root node
clustered by clustering algorithm in order to classify the based on the training set and training set Training set.
customers by a decision tree. Method:
Cluster analysis[4] mainly researches the “Like Attracts Like” make_tree (, , ){
problem in statistics, which virtually establishes a classification Initialize the root node;
method that automatically sorts a quantity of data according to In the dataset , solve the branching scheme of node ,
the degree of similarity in nature, without prior knowledge. satisfying branching index ;
Class is a set of similar individual, and different class have if (node satisfies branch conditional) {
significant differences. Select the best branching scheme to divide dataset
This paper employs K-means clustering analysis algorithm into ଵ , ଶ ;
that is a standard iterative algorithm to minimize the sum of Create ’s children node ଵ , ଶ ;
squared of intra-class mean. The data structure of cluster make_tree(ଵ , ଵ , );
analysis contains age, sex, service life, purchase price of new make_tree(ଶ , ଶ , );
vehicle, brand origin, seating capacity, channel, new-renew- }
transfer, days of insurance, amount of insurance, preferential }
coefficient and so on. In this paper, with actual needs
classification, customer profitability was classified into four By using the decision tree model, this paper determined the
ranks: extremely low (-30000 and below), low (-30000,0), classification feature of customers after the rate reform (see
medium (0,5000), high (5000 and above), as class labels of Table 3). For example, a customer meeting the conditions,
training set in the classification method. which include preferential rank is 0 or 1 (the vehicle is new or
2) Extracting the Customers’ Features by Decision Tree the customer has been accident-free more than 3 years), service
Decision tree algorithm is one of data mining classification life is 0 year (the vehicle is new), and purchase price of new
algorithms. Through analyzing given training data, it generates vehicle is less than 0.1 million, has great possibility of medium
the Decision Tree Model for classification and prediction. Each rank, and also may be belong to high rank. In other words,
internal node in the decision tree describes the test for an customers who purchase vehicle insurance for their new, cheap
attribute of samples. Each internal node has one or more vehicle, probably are valued clients. These clients have more
subsequent branches, and each subsequent branch corresponds stable business situation and profitability, and ought to be well
to one of the possible values for this attribute. Each leaf node maintained all the time.
TABLE III. Classification Rules of Decision Tree for Customer Profitability Analysis after Rate Reform
Layer 1 Layer 2 Layer 3 First Result Second Result
Purchase Price of New Vehicle: Medium High
<=0.1 million (76%) (24%)
Service Life: Purchase Price of New Vehicle: High Medium
0~1 years 0.1 million~0.15 million (77%) (23%)
Purchase Price of New Vehicle: High Medium
>=0.15 million (94%) (6%)
Preferential Amount of Insurance: Low
Rank (0~1) <=0.2 million (100%)
Service Life: Amount of Insurance: Low Extremely low
1~7 years 0.2 million~1.5 million (56%) (44%)
Amount of Insurance: Extremely low Low
>=1.5 million (73%) (27%)
Service Life: Extremely low Low
8~20 years (74%) (26%)
Amount of Insurance: Medium
<=0.2 million (100%)
Preferential Amount of Insurance: Medium Low
Preferential Rank (2)
Rank (2~4) 0.2 million~0.8 million (52%) (48%)
Amount of Insurance: Low Medium
>=0.8 million (63%) (36%)
606
WdP4.5
Medium
Preferential Rank (3~4)
(100%)
Purchase Price of New Vehicle: Medium High
<=0.1 million (97.8%) (2.2%)
Amount of Insurance: Purchase Price of New Vehicle: High Medium
<=0.8 million 0.1 million ~ 0.15 million (56%) (44%)
Purchase Price of New Vehicle: High Medium
>=0.15 million (72%) (28%)
Preferential
Channel: High Medium
Rank (5~6)
Traditional business (94%) (6%)
Amount of Insurance:
Channel:
0.8 million ~ 1.5 million High Medium
Telephone/Internet sales
(69%) (31%)
business
Amount of Insurance: High
>=1.5 million (100%)
For comparison, this paper still used the four interval amount of insurance is more than 1.5 million, has great
divisions above, including extremely low (-30000 and below), possibility of extremely low rank, and also may be belong to
low (-30000,0), medium (0,5000) and high (5000 and above), to high rank. It shows that idiosyncratic risk of these customers is
extracting the customers’ features before the rate reform. fairly unstable, and profit contributions of these customers are
However, classification effect of customers’ features before the liable to go to extremes. Therefore, insurance companies should
rate reform is not ideal, but experiment marginally produces pay attention to educate these customers in safety, and enhance
relatively clear classification rules (see Table 4). For example, a after-sales service to them. But there might be a special case,
customer meeting the conditions, which include purchase price because data sizes of this leaf node is less, only 37, which is not
of new vehicle is between 0.15 million and 0.2 million and powerful than the node with large amount of data.
TABLE IV. Classification Rules of Decision Tree for Customer Profitability Analysis before Rate Reform
Layer 1 Layer 2 Layer 3 First Result Second Result
Amount of Insurance: Medium Low
<=0.4 million (58%) (40%)
Preferential Rank Amount of Insurance: Low Low
(0) 0.4 million~0.6 million (47%) (26%)
Amount of Insurance: Medium Low
>=0.6 million (31%) (28%)
Seating Capacity: Low Low
2~6 seats (46%) (35%)
Purchase Price of
Preferential Rank Seating Capacity: Medium Low
New Vehicle:
(1~4) 6~9 seats (64%) (30%)
<0.1 million
Seating Capacity: Extremely low Medium
9~14 seats (60%) (23%)
Amount of Insurance: Medium Extremely low
<=0.2 million (62%) (22%)
Preferential Rank Amount of Insurance: Low Low
(5~6) 0.2 million~0.6 million. (37%) (35%)
Amount of Insurance: Low Medium
>=0.6 million (38%) (23%)
Amount of Insurance: Medium
<=0.2 million (100%)
Purchase Price of Preferential Rank Amount of Insurance: Medium Low
New Vehicle: (0) 0.2 million~0.4 million. (61%) (34%)
0.1 million~0.15 Amount of Insurance: Medium Medium
million >=0.4 million (46%) (16%)
Preferential Rank Amount of Insurance: Medium Low
(1~4) <=0.2 million (86%) (14%)
607
WdP4.5
608
WdP4.5
[2] Weihui Huang, Guohua Geng, Li Chan. Data mining [4] Qinhua Wei. Customer profit contribution of electricity
technology in the insurance business [J]. Computer based on data mining analysis [J]. Demand Side
Applications and software, 2008 (3): 123-125. Management, 2006,2.
[3] Mingqiang Bi. Based on contribution analysis and [5] Zhaole Tang. Application of Data Mining Technology in
customer relations Commercial Bank Loan Pricing Method CRM in the auto insurance [D]. University of Electronic
[J] Finance Forum 2004 (7): 44-50. Science and Technology, [2007].
609