Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Available online at www.sciencedirect.

com

Procedia Economics and Finance 00 (2012) 000–000

Credit Card Customer Segmentation by Clustering


Bennett Ng Teng Seng1

Abstract

A credit card is a payment card issued to cardholders to enable them to pay for goods and services while using credit from
the issuer, accruing debt. This debt can then be paid back within specified time intervals, to not accrue interest, or not,
accruing additional interest. These cards often come with a credit limit, which restricts the maximum amount of money a
lender will allow a cardholder to spend. A cardholder is not limited to any one issuer for credit cards and can approach
multiple issuers to obtain multiple credit cards.

Keywords: credit card; clustering; machine learning; kmeans

1. Main text

This dataset consists customer information from customers who have a credit card. In order to easily
segment the customer population for other business needs, like identifying loyal customers or trends, the K-
Means Clustering Unsupervised Model can be used. By segmenting the customers into different groups, the
bank can then used this data to support their various business decisions.

A The Dataset

For this dataset, it is 16KB in size and contain 7 columns with 660 rows of data.

Index Column Name Column Description Data Type Count


1 Sl_No Customer serial number, to index int64 660
the values
2 Customer Key Customer Identification int64 660
3 Avg_Credit_Limit Average Credit Card Limit For The int64 660
Customer
4 Total_Credit_Card Total Credit Cards Owned by the int64 660
s Customer
5 Total_visits_bank Total Number of Bank Visits by int64 660
the Customer
1

* Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000 .


E-mail address: author@institute.xxx .
2 Author name / Procedia Economics and Finance 00 (2012) 000–000
6 Total_visits_online Total Visits Online by the Bank int64 660
Customer
7 Total_calls_made Total Calls Made by the Customer int64 660
to the Bank

1.1. Preparation of Data

Index Column Name Unique Values


1 Sl_No 660
2 Customer Key 655
3 Avg_Credit_Limit 110
4 Total_Credit_Cards 10
5 Total_visits_bank 6
6 Total_visits_online 16
7 Total_calls_made 11

The table aboves shows the number of unique values in the dataset for each column. As ‘Customer Key’ is
lesser than the serial number, which should be indexing the dataset, there are repeated values of ‘Customer
Key’, which means that data that is referring to the same customer is repeated in the database. In order to
remove that, I removed the duplicate values with the Pandas function .drop_duplicates(). After dropping the
repeated data, the dataset now has 655 rows, instead of the previous 660 rows of data.

I then removed the columns ‘Sl_No’ and ‘Customer Key’ as they are mainly used for indexing of the data
and it is not necessary during the clustering of the data. This allows the dataset to be smaller and easier to
perform analysis on. The remaining columns in the dataset were all numerical values and thus no dummy
encoding was necessary.

1.2. Exploratory Data Analysis

For Exploratory Data Analysis of this dataset, I first started by using the Pandas .describe() function to find
out the numerical summary of each of the column in the dataset.
Author name / Procedia Economics and Finance 00 (2012) 000–000
I then plotted a histogram for each of the column to find out the distribution if data for each column. For
the Average Credit Card Limit histogram, it is obvious that there are many outliers as the data is skewed to
the left and there are minimal occurrences for higher spending limit. Another plot that was interesting was the
Total Visits Online by Bank Customer histogram. In the histogram, I could see that most of the customers
visited the bank via online means between 1 to 4 times, but there are outliers who visit the bank more than 6
times.
4 Author name / Procedia Economics and Finance 00 (2012) 000–000
Next, I decided to plot a heatmap to find out the correlation between the variables in the dataset.

1.3. Scaling of data

I then scaled the data as each of the variables in the dataset have different units and this will affect a model
like K-Means Clustering as it relies on calculating the Euclidean distance between points. This allows all
features to contribute to the model equally and not just favor the features with the largest units. I scaled the
data with the StandardScaler() from sklearn.preprocessing.

1.4. Implementing K-Means Model

First, in order to find out the appropriate number of clusters to instantiate the K-Means Clustering model
with, I made use of the Elbow plot. The elbow plot plots fits the KMeans model to the data using different
number of clusters and plots the sum of squared distances of samples closest to their closest cluster center,
called ‘inertia’ in the sklearns K-Means model.
Author name / Procedia Economics and Finance 00 (2012) 000–000
From the elbow plot, I can tell that the optimal number of clusters to be deployed will be 3 which is where
the join of the elbow is formed in the plot.
I then instantiate a K-Means model with 3 clusters and fit the scaled data into the model. I then saved the
labels that the model had placed on the data back to the original data, so that the labels and means would
make more sense, compared to on the scaled data. I then made boxplots to interpret what the clusters meant in
the context of each feature and came up with the characteristics of each cluster.

From the boxplots we can deduce the following:


 Group 0
o Have the lowest average credit limit on their credit cards (~$12,000)
o Have the lowest average credit cards owned (~2 cards)
o Tend to make calls to the bank more than online and bank visits
 Group 1
o Have the middle average credit limit on their credit cards (~$30,000)
o Have the middle average credit cards owned (~6 cards)
o Tend to prefer visiting the bank in real life than visit online or make calls
 Group 2
o Have the highest average credit limit on their credit cards (~$145,000)
o Have the highest average credit cards owned (~9 cards)
o Tend to prefer visiting the bank online than in real life or by making calls

1.5. Conclusion

With the customer population of the bank segmented into 3 distinct categories, the bank can make use of
such information in order to support some of their business decisions, like offering promotions or loyalty
programs. With a relatively simple model fitted into the data, business, like the bank, can derive many
insights that can ultimately improve the business. Hence, I would recommend for more businesses to make
use of Machine Learning to gain insights and improve their businesses in an easy and cost-effective manner.
6 Author name / Procedia Economics and Finance 00 (2012) 000–000
References
En.wikipedia.org. 2021. Credit card - Wikipedia. [online] Available at: <https://en.wikipedia.org/wiki/Credit_card> [Accessed 13 August
2021].
Kagan, J. and Brock, T., 2021. What Is a Credit Limit?. [online] Investopedia. Available at:
<https://www.investopedia.com/terms/c/credit_limit.asp> [Accessed 13 August 2021].
Scikit-learn.org. 2020. sklearn.cluster.KMeans — scikit-learn 0.24.2 documentation. [online] Available at: <https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html> [Accessed 13 August 2021].
Ali, A., 2021. Unsupervised Learning - Credit Card Customers. [online] Kaggle.com. Available at:
<https://www.kaggle.com/anasmjali/unsupervised-learning-credit-card-customers> [Accessed 13 August 2021].

You might also like