Data Mining Project Ragunathan

BANKING &
INSURANCE
DOMAIN
DATA MINING
AND ANALYSIS
10/23/2021
Contents
Executive Summary..............................................................................................................................2
Introduction..........................................................................................................................................2
Clustering Analysis................................................................................................................................3
Q 1.1..................................................................................................................................................3
Fig.1 – Datatypes.......................................................................................................................3
Fig.2 – null values count in each column...................................................................................3
Fig.4 – Duplicate count..............................................................................................................4
Fig.5 – Mean, mode standard deviation analysis.......................................................................5
Fig.6 – Outliers in each column..................................................................................................6
Fig.7 – Pairplot...........................................................................................................................6
Q 1.2..................................................................................................................................................7
Q1.3...................................................................................................................................................7
Fig.8 – Dendrogram...................................................................................................................8
Q1.4...................................................................................................................................................8
Fig.9 – Three customer segmentation.......................................................................................8
Fig.10 – Two customer segmentation........................................................................................9
Q 1.5..................................................................................................................................................9
CART-RF-ANN......................................................................................................................................10
Q 2.1................................................................................................................................................10
Fig.2.1 – Column info...............................................................................................................10
Fig.2.2 – Column null count info..............................................................................................11
Fig.2.3 – Metadata of all columns............................................................................................11
Fig.2.4 – Duplicate count.........................................................................................................12
Fig.2.5 – Pairplot......................................................................................................................12
Fig.2.6 – Distribution plot........................................................................................................13
Q 2.2................................................................................................................................................13
Q.2.3................................................................................................................................................13
Fig.2.7 – Performance of CART................................................................................................14
Fig.2.8 – Performance of Rain Forest -Training........................................................................14
Fig.2.9 – Performance of Rain Forest -Test..............................................................................15
Fig.2.10 – ROC of Rain Forest -Test..........................................................................................16
Fig.2.10 – Performance of Neural Network -training..............................................................16
Fig.2.11 ROC of Neural Network -training................................................................................17
Fig.2.11 – Performance of Neural Network -Test data............................................................17
Fig.2.12 ROC of Neural Network -test.....................................................................................18
Q.2.4................................................................................................................................................18
Fig.2.13 Performance comparison of Various Modals............................................................19
Fig.2.14 Training data - ROC_AUC Curve of various modals..................................................19
Fig.2.15 Test data - ROC_AUC Curve of various modals.......................................................20
Q.2.5................................................................................................................................................20
Executive Summary
To identify the segments based on credit card usage based on the data
collected by a leading bank which wants to develop a customer segmentation to give
promotional offers to its customers.
Introduction
The purpose of the whole exercise is to identify the customer segments based
on the various factors like spending, advance payments, probability of full payment,
current balance credit limit minimum payment amount and maximum amount spend
in single shopping.
Clustering Analysis
Q 1.1
Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis)
Result of Analysis:
The descriptive statistics help us to understand the nature of data like

data type of each feature/column, the number of rows, the presence of null
values, outliers, duplicates, mean, mode, standard deviation and the
distribution of data between features etc. The following results of analysis
help us to understand the nature of data.
Fig.1 – Datatypes
Fig.2 – null values count in each column

Fig.3 – not null count
Fig.4 – Duplicate count
Fig.5 – Mean, mode standard deviation analysis

Fig.6 – Outliers in each column
Fig.7 – Pairplot
Q 1.2
Do you think scaling is necessary for clustering in this case? Justify.
Result of Analysis:
The unit of the given columns is different from each other. Since we are dealing
with distance-based algorithms, to avoid the impact of one more column over another
columns, we need to do the scaling.
Data Dictionary for Market Segmentation:
spending: Amount spent by the customer per month (in 1000s)
advance_payments: Amount paid by the customer in advance by cash (in 100s)
probability_of_full_payment: Probability of payment done in full by the customer to the

bank
current_balance: Balance amount left in the account to make purchases (in 1000s)
credit_limit: Limit of the amount in credit card (10000s)
min_payment_amt : minimum paid by the customer while making payments for purchases
made monthly (in 100s)
max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s)
Q1.3
Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them
The hierarchical clustering performed on the scaled data. The number of optimal
clusters are 2 for the given bank dataset. The following figures explain the clusters.
Fig.8 – Dendrogram
Q1.4
Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them.
Applied K-Means clustering on scaled data. While elbow curve method shows that
the given data can be segmented into three segments, the silhouette score shows that the
given data can be segmented into two parts.
While analyzing the three segmented data, in following profile, there is not much
difference in mean of various features/column.
Fig.9 – Three customer segmentation
While analyzing the two segmented data, in following profile, there is clear
separation in mean of various features/column as shown below.
Fig.10 – Two customer segmentation
silhouette score of with two cluster is 0.46577247686580914 and the same of with
three cluster is 0.40072705527512986.
Elbow and Silhouette methods are used to find the optimal number of clusters.
Ambiguity arises for the elbow method to pick the value of k. Silhouette analysis can be
used to study the separation distance between the resulting clusters and can be considered
a better method compared to the Elbow method.
So, silhouette score of with two cluster is higher, we can opt for two cluster
segmentations.
Q 1.5
Describe cluster profiles for the clusters defined. Recommend different
promotional strategies for different clusters.
There is significant difference spotted in spending and advance payments of two

segments of customers. If we provide some offers/promotion to cluster0(please refer
previous question , figure 10) , we can attract cluster 0 customers to improve the business
and the same we need to retain the Cluster 1 customers also.
CART-RF-ANN
Q 2.1
Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis).
Result of Analysis:
The descriptive statistics help us to understand the nature of data like

data type of each feature/column, the number of rows, the presence of null
values, outliers, duplicates, mean, mode, standard deviation and the
distribution of data between features etc. The following results of analysis
help us to understand the nature of data.
Fig.2.1 – Column info

Fig.2.2 – Column null count info
Fig.2.3 – Metadata of all columns

Fig.2.4 – Duplicate count
Fig.2.5 – Pairplot
Fig.2.6 – Distribution plot
Q 2.2
Data Split: Split the data into test and train, build classification model CART,
Random Forest, Artificial Neural Network
The splitting of data into test and train is done, building of

classification model using CART, Random Forest and Artificial Neural Network
also done. Please refer the attached source code for that.
Q.2.3
Performance Metrics: Comment and Check the performance of Predictions on Train
and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score,
classification reports for each model.
The performance of CART, Random Forest and Artificial Neural

Network models are as follows.
Performance of CART:
Fig.2.7 – Performance of CART
Performance of Rain forest:
Fig.2.8 – Performance of Rain Forest -Training

Fig.2.9 – Performance of Rain Forest -Test
Fig.2.10 – ROC of Rain Forest -Test
Performance of Neural Network:
Fig.2.10 – Performance of Neural Network -training

Fig.2.11 ROC of Neural Network -training
Fig.2.11 – Performance of Neural Network -Test data

Fig.2.12 ROC of Neural Network -test
Q.2.4
Final Model: Compare all the models and write an inference which model is
best/optimized.
The performance of all three model (CART, RF and ANN) is given as

follows . The ROC_AUC area details of all three model also as follows.
Based on the performance of models, the best model is Random
Forest for the given data set.
Fig.2.13 Performance comparison of Various Modals

Fig.2.14 Training data - ROC_AUC Curve of various modals
Test ROC_AUC Curve
Fig.2.15 Test data - ROC_AUC Curve of various modals
Q.2.5
Inference: Based on the whole Analysis, what are the business insights and
recommendations
The given dataset has both Yes and No in Claimed column. The
business expectation is to reduce the claims in order to get maximum profit.
So, we need to concentrate on the cases which has yes in Claimed column.
Need to find out what are the channel, products , agencies and age groups
which has maximum claim , and concentrate on actions to reduce claims in
those areas. In addition to that , we need to increase the number of policies in
the areas of no Claim.

Data Mining Project Ragunathan

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Project Ragunathan

Uploaded by

Copyright:

Available Formats

BANKING &

The descriptive statistics help us to understand the nature of data like

Fig.2 – null values count in each column

Fig.4 – Duplicate count

Fig.5 – Mean, mode standard deviation analysis

Data Dictionary for Market Segmentation:

spending: Amount spent by the customer per month (in 1000s)

advance_payments: Amount paid by the customer in advance by cash (in 100s)

probability_of_full_payment: Probability of payment done in full by the customer to the

credit_limit: Limit of the amount in credit card (10000s)

max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s)

Fig.9 – Three customer segmentation

There is significant difference spotted in spending and advance payments of two

The descriptive statistics help us to understand the nature of data like

Fig.2.1 – Column info

Fig.2.3 – Metadata of all columns

The splitting of data into test and train is done, building of

The performance of CART, Random Forest and Artificial Neural

Performance of Rain forest:

Fig.2.8 – Performance of Rain Forest -Training

Performance of Neural Network:

Fig.2.10 – Performance of Neural Network -training

Fig.2.11 – Performance of Neural Network -Test data

The performance of all three model (CART, RF and ANN) is given as

Fig.2.13 Performance comparison of Various Modals

Fig.2.15 Test data - ROC_AUC Curve of various modals

You might also like