Professional Documents
Culture Documents
Credit Card Customer Segmentation by Clustering: Bennett NG Teng Seng
Credit Card Customer Segmentation by Clustering: Bennett NG Teng Seng
com
Abstract
A credit card is a payment card issued to cardholders to enable them to pay for goods and services while using credit from
the issuer, accruing debt. This debt can then be paid back within specified time intervals, to not accrue interest, or not,
accruing additional interest. These cards often come with a credit limit, which restricts the maximum amount of money a
lender will allow a cardholder to spend. A cardholder is not limited to any one issuer for credit cards and can approach
multiple issuers to obtain multiple credit cards.
1. Main text
This dataset consists customer information from customers who have a credit card. In order to easily
segment the customer population for other business needs, like identifying loyal customers or trends, the K-
Means Clustering Unsupervised Model can be used. By segmenting the customers into different groups, the
bank can then used this data to support their various business decisions.
A The Dataset
For this dataset, it is 16KB in size and contain 7 columns with 660 rows of data.
The table aboves shows the number of unique values in the dataset for each column. As ‘Customer Key’ is
lesser than the serial number, which should be indexing the dataset, there are repeated values of ‘Customer
Key’, which means that data that is referring to the same customer is repeated in the database. In order to
remove that, I removed the duplicate values with the Pandas function .drop_duplicates(). After dropping the
repeated data, the dataset now has 655 rows, instead of the previous 660 rows of data.
I then removed the columns ‘Sl_No’ and ‘Customer Key’ as they are mainly used for indexing of the data
and it is not necessary during the clustering of the data. This allows the dataset to be smaller and easier to
perform analysis on. The remaining columns in the dataset were all numerical values and thus no dummy
encoding was necessary.
For Exploratory Data Analysis of this dataset, I first started by using the Pandas .describe() function to find
out the numerical summary of each of the column in the dataset.
Author name / Procedia Economics and Finance 00 (2012) 000–000
I then plotted a histogram for each of the column to find out the distribution if data for each column. For
the Average Credit Card Limit histogram, it is obvious that there are many outliers as the data is skewed to
the left and there are minimal occurrences for higher spending limit. Another plot that was interesting was the
Total Visits Online by Bank Customer histogram. In the histogram, I could see that most of the customers
visited the bank via online means between 1 to 4 times, but there are outliers who visit the bank more than 6
times.
4 Author name / Procedia Economics and Finance 00 (2012) 000–000
Next, I decided to plot a heatmap to find out the correlation between the variables in the dataset.
I then scaled the data as each of the variables in the dataset have different units and this will affect a model
like K-Means Clustering as it relies on calculating the Euclidean distance between points. This allows all
features to contribute to the model equally and not just favor the features with the largest units. I scaled the
data with the StandardScaler() from sklearn.preprocessing.
First, in order to find out the appropriate number of clusters to instantiate the K-Means Clustering model
with, I made use of the Elbow plot. The elbow plot plots fits the KMeans model to the data using different
number of clusters and plots the sum of squared distances of samples closest to their closest cluster center,
called ‘inertia’ in the sklearns K-Means model.
Author name / Procedia Economics and Finance 00 (2012) 000–000
From the elbow plot, I can tell that the optimal number of clusters to be deployed will be 3 which is where
the join of the elbow is formed in the plot.
I then instantiate a K-Means model with 3 clusters and fit the scaled data into the model. I then saved the
labels that the model had placed on the data back to the original data, so that the labels and means would
make more sense, compared to on the scaled data. I then made boxplots to interpret what the clusters meant in
the context of each feature and came up with the characteristics of each cluster.
1.5. Conclusion
With the customer population of the bank segmented into 3 distinct categories, the bank can make use of
such information in order to support some of their business decisions, like offering promotions or loyalty
programs. With a relatively simple model fitted into the data, business, like the bank, can derive many
insights that can ultimately improve the business. Hence, I would recommend for more businesses to make
use of Machine Learning to gain insights and improve their businesses in an easy and cost-effective manner.
6 Author name / Procedia Economics and Finance 00 (2012) 000–000
References
En.wikipedia.org. 2021. Credit card - Wikipedia. [online] Available at: <https://en.wikipedia.org/wiki/Credit_card> [Accessed 13 August
2021].
Kagan, J. and Brock, T., 2021. What Is a Credit Limit?. [online] Investopedia. Available at:
<https://www.investopedia.com/terms/c/credit_limit.asp> [Accessed 13 August 2021].
Scikit-learn.org. 2020. sklearn.cluster.KMeans — scikit-learn 0.24.2 documentation. [online] Available at: <https://scikit-
learn.org/stable/modules/generated/sklearn.cluster.KMeans.html> [Accessed 13 August 2021].
Ali, A., 2021. Unsupervised Learning - Credit Card Customers. [online] Kaggle.com. Available at:
<https://www.kaggle.com/anasmjali/unsupervised-learning-credit-card-customers> [Accessed 13 August 2021].