RFM Model For Customer Purchase Behaviour Using K-Means Algorithm

RFM model for Customer Purchase
behaviour Using K-Means Algorithm
A Technical Seminar Report

16EC82
Submitted by,
Shubhankar 1RV16EC155
Siddhartha Bhaumik 1RV16EC156
Under the guidance of

Dr. Nagaraj Bhat
Assistant Professor
Dept. of ECE
RV College of Engineering
In partial fulfillment of the requirements for the degree of

Bachelor of Engineering in
Electronics and Communication Engineering
2020-2021
RV College of Engineering , Bengaluru ®
(Autonomous institution affiliated to VTU, Belagavi )
Department of Electronics and Communication Engineering
.
CERTIFICATE
Certified that the Technical Seminar titled RFM model for Customer Purchase be-
haviour Using K-Means Algorithm is carried out by Shubhankar (1RV16EC155)
and Siddhartha Bhaumik (1RV16EC156) who are bonafide students of RV College
of Engineering, Bengaluru, in partial fulfillment of the requirements for the degree of
Bachelor of Engineering in Electronics and Communication Engineering of the
Visvesvaraya Technological University, Belagavi during the year 2020-2021. It is cer-
tified that all corrections/suggestions indicated for the Internal Assessment have been
incorporated in the Technical Seminar report deposited in the departmental library. The
Technical Seminar report has been approved as it satisfies the academic requirements in
respect of Technical Seminar work prescribed by the institution for the said degree.
Signature of Guide Signature of Head of the Department Signature of Principal
Dr. Nagaraj Bhat Dr. K S Geetha Dr. K. N. Subramanya
External Viva
Name of Examiners Signature with Date
1.
2.
DECLARATION
We, Shubhankar and Siddhartha Bhaumik students of eighth semester B.E., De-
partment of Electronics and Communication Engineering, RV College of Engineering,
Bengaluru, hereby declare that the Technical Seminar titled ‘RFM model for Cus-
tomer Purchase behaviour Using K-Means Algorithm’ has been carried out by us
and submitted in partial fulfilment for the award of degree of Bachelor of Engineering
in Electronics and Communication Engineering during the year 2020-2021.
Further we declare that the content of the dissertation has not been submitted previously
by anybody for the award of any degree or diploma to any other university.
We also declare that any Intellectual Property Rights generated out of this project carried
out at RVCE will be the property of RV College of Engineering, Bengaluru and we will
be one of the authors of the same.
Place: Bengaluru
Date:
Name Signature
1. Shubhankar(1RV16EC155)
2. Siddhartha Bhaumik(1RV16EC156)
ACKNOWLEDGEMENT
We are indebted to our guide, Dr. Nagaraj Bhat, Assistant Professor, RV College
of Engineering . for the wholehearted support, suggestions and invaluable advice through-
out our Technical Seminar and also helped in the preparation of this thesis.
We also express our gratitude to our examiner Dr. Kiran V., Associate Professor ,
Department of Electronics and Communication Engineering for their valuable comments
and suggestions.
Our sincere thanks to Dr. K S Geetha, Professor and Head, Department of Elec-
tronics and Communication Engineering, RVCE for the support and encouragement.
We express sincere gratitude to our beloved Principal, Dr. K. N. Subramanya for

the appreciation towards this technical seminar work.
We thank all the teaching staff and technical staff of Electronics and Communication
Engineering department, RVCE for their help.
Lastly, we take this opportunity to thank our family members and friends who pro-
vided all the backup support throughout the project work.
ABSTRACT
Clustering is the method of grouping a set of objects in a certain way that the objects
within the same group which are called clusters and are more similar to each other in a
certain sense to every set aside from those in other groups or clusters. The main task of
clustering is Univariate Analysis and Exploratory Data Analysis.
The project evaluates the performance of Customer Segmentation performed on the
set of data acquired from different places. The RFM analysis performed provides us with
different scenarios needed for achieving insights and strategies in marketing planning. K-
Means algorithm is mainly employed in this project where PCA and Non-PCA process is
employed in order to get the most model required in K-Means and the clusters are created
to create groups in the data to plan different means to keep them together. Using python
packages (yellow brick, Scikit, Pandas and PCA) the dataset is created into distribution
plots and they are then converted into clusters for the requirement.
In the following project, the main objective is to apply to different marketing models
through intelligence to identify certain potential customers by giving proof of relevant
and the timely data for marketing entities in the Marketing Retail Industry. One dataset
is considered for simulation which is then processed to perform different Analysis in the
clustering algorithm, the main two algorithms considered are the K-means and Hierarchi-
cal Algorithm to find the number of Clusters in a data group which is formed to provide
the insight required.
The results obtained for customer segmentation based on their buying pattern of
customers though strategically important, is an equally difficult task. Customer retention
has another one of major concern for both online and the physical enterprises that are
used. In the research work, the RFM model is implemented on synthetic and real data
sets, to analyse different customer segmentation behaviour. Based on the Silhouette
Score, the Sales Recency, Sales Frequency and Sales Monetary can be analysed and an
optimal solution is found and used. The clusters are taken into consideration which allows
us to provide insight and strategies for targeting marketing towards certain customers.
Clusters allow in creating a scenario for targeted strategy every single customer which
allows us create certain situations for future scenarios for improvement into the strategies
needed to be developed.
i
CONTENTS
Abstract i
List of Figures iv
List of Tables vi
Abbreviations vii
1 Introduction to RFM Analysis and Customer Segmentation 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Brief Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Assumptions made / Constraints of the project . . . . . . . . . . . . . . 7
1.8 Organization of the report . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Fundamentals of RFM Analysis and Customer Segmentation 8

2.1 Fundamentals of Customer Segmentation . . . . . . . . . . . . . . . . . . 9
2.2 Fundamentals of RFM Analysis . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Recency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Monetary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Why is RFM better than Other Segmentation Methods? . . . . . 10
2.3 Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 K-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Density Based Spatial Clustering Applications with Noise Algo-
rithm(DBSCAN) . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . 13
2.3.4 Balance Iterative Reducing and Clustering Hierarchies (BIRCH)
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
ii
2.3.5 Affinity propagation algorithm . . . . . . . . . . . . . . . . . . . . 14
3 Implementation of RFM Analysis and Customer Segmentation 15

3.1 Design for Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Selection of Clustering Algorithm . . . . . . . . . . . . . . . . . . 16
3.1.2 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2.1 Exploratory Analysis and Data Pre-processing . . . . . . 17
3.1.2.2 Execution of RFM Analysis . . . . . . . . . . . . . . . . 17
3.1.2.3 K-Means with Euclidean distance . . . . . . . . . . . . . 17
3.1.3 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.4 Data set description used in the analysis . . . . . . . . . . . . . . 18
3.2 Steps involved in Customer Marketing Strategy . . . . . . . . . . . . . . 19
3.2.1 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Normalization of RFM model indices . . . . . . . . . . . . . . . . 20
3.2.3 Indicator Weight Analysis . . . . . . . . . . . . . . . . . . . . . . 20
3.2.4 Clustering the Customers by K-Means Algorithm . . . . . . . . . 20
4 Results and Analysis 21

4.1 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.1 Clustering (Non-PCA) . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.1.1 K-Means Clustering . . . . . . . . . . . . . . . . . . . . 28
4.4.1.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . 31
4.4.2 Clustering (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.2.1 K-Means Clustering . . . . . . . . . . . . . . . . . . . . 32
4.4.2.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . 35
4.5 Model Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.6 Insights and Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Conclusion and Future Scope 41

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iii
LIST OF FIGURES
1.1 RFM analysis framework [1] . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Description of K-Means Clustering [11] . . . . . . . . . . . . . . . . . . . 12

2.2 DBSCAN clustering [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 PCA 1st and 2nd dimension clustering [2] . . . . . . . . . . . . . . . . . 13
2.4 Affinity propagation algorithm framework [11] . . . . . . . . . . . . . . . 14
3.1 Steps Involved in Methodology For Data Analysis [1] . . . . . . . . . . . 16

3.2 Proposed method For Customer Marketing Strategy [2] . . . . . . . . . . 19
4.1 Interaction of Total Bank visits . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Interaction of Total Online Visits . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Interaction of Total Calls Made . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Data Distribution of Recency Analysis . . . . . . . . . . . . . . . . . . . 25
4.5 Data Distribution of Inverse Recency Analysis . . . . . . . . . . . . . . . 26
4.6 Data Distribution of Frequency Analysis . . . . . . . . . . . . . . . . . . 26
4.7 Data Distribution of Monetary Analysis . . . . . . . . . . . . . . . . . . 26
4.8 Visualization of Scaled Values . . . . . . . . . . . . . . . . . . . . . . . . 27
4.9 Clustering Plot of Recency, Frequency and Monetary . . . . . . . . . . . 28
4.10 Elbow Method to determine the Values of K . . . . . . . . . . . . . . . . 28
4.11 Silhouette Analysis for K=2 Clusters for Recency, Frequency and Monetary
Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.15 Hierarchical Clustering at K=3 . . . . . . . . . . . . . . . . . . . . . . . 31
4.16 Clustering Plot of PC1 vs PC2 . . . . . . . . . . . . . . . . . . . . . . . 32
4.17 Elbow Method to determine the Values of K . . . . . . . . . . . . . . . . 32
iv
Plot in PCA method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.22 Hierarchical Clustering at K=3 . . . . . . . . . . . . . . . . . . . . . . . 35
4.23 Data Clusters Plotted Based on Non-PCA Method . . . . . . . . . . . . 36
4.24 Data Clusters Plotted Based on PCA Method . . . . . . . . . . . . . . . 37
4.25 Final Data Clusters Plotted for our dataframe . . . . . . . . . . . . . . . 38
4.26 KDE plot for each of the Clusters . . . . . . . . . . . . . . . . . . . . . 39
v
LIST OF TABLES
4.1 Univariate Analysis on set of Data . . . . . . . . . . . . . . . . . . . . . 22

4.2 Univariate Analysis for Different interaction . . . . . . . . . . . . . . . . 24
4.3 Data Correlation of the Features . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Data frame of RFM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.5 RFM average of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6 PCA after feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.7 Concatenating original data between PCA and Non-PCA . . . . . . . . . 35
4.8 For fitting the PCA model for pipeline . . . . . . . . . . . . . . . . . . . 38
vi
ABBREVIATIONS
AHP Analytic Hierarchial Proces
BIRCH Balance Iterative Reducing and Clustering Hierarchies
CRISP-DM Cross Industry Standard Process for Data Mining
CRM Customer Relationship Management
DBSCAN Density Based Spatial Clustering Application with Noise Algorithm
EDU Exploratory Data Analysis
GRU Gated Recurrent Unit
KDE K Desktop Environment
PCA Principal Component Analysis
PCI Pharmacy Control of India
RFM Recency,Frequency,Monetary
SME Small and Medium-sized Enterprises
VIP Very Important Person
vii
RV College of Engineering ®, Bengaluru - 560059
Chapter 1
Introduction to RFM Analysis and
Customer Segmentation
Department of Electronics and Communication Engineering, 2020-2021

CHAPTER 1
INTRODUCTION TO RFM ANALYSIS AND
CUSTOMER SEGMENTATION
1.1 Introduction
In the light of data segmentation, customers are divided into set of individuals with
distinct similarities. Some of the attributes relevant to customer segmentation are gender,
age, lifestyle, location, purchase and income behaviour. Such attributes are mainly cate-
gorized based on the historical purchasing behaviour that can lead to a specific outcome,
for example, an increase in sales and the profit for the company.
In the ever-growing competition and increasing complexity of business environment,
segmentation and its systematic study improves customer loyalty and enhances enterprise-
level for long lasting relationship by widening profitable customer database. The two most
prominent types of segmentations used in K-Means Algorithm are the Qualitative and
Quantitative insights. In the scope of the current study, Quantitative insight is used for
the purpose of segmentation clustering.
Well-defined customer segmentation helps in effective allocation of marketing re-
sources, enables the companies to target the specific group of customers and also helps in
building healthy long-term relationship with the customers. The major industries wherein
customer segmentation and for data mining can be applied the Retail Industry, because
it requires a vast amount of data on sales, transportation, consumption ratio, redelivery
service and many others. Also, Retail data mining helps in identifying and effectively
mapping customer behaviour and related patterns during the entire life-cycle of business
transactions. This ultimately, leads to improved customer service, effective sales and
distribution strategies and many more. This work mainly focuses on tracking the histor-
ical purchasing behaviour of customers with the aim to find maximum amount of sale
possible in the specific area. Based on the statistical results and indicators, companies
in the retail industry can design various sales and marketing strategies like promotional
campaigns, extending seasonal discounts or floating sales enabling coupons to increase
the sales and improve customer retention.
Department of Electronics and Communication Engineering, 2020-2021 2

Figure 1.1: RFM analysis framework [1]
Figure 1.1 shows the RFM framework structure. The analysis is done accordingly.
To achieve the above objectives, customer clustering and segmentation is carried out
using the K-Means algorithm. It is based on RFM values for different regions. RFM
can be defined as segmentation of customer analysis which not only gives information
on frequent purchasing pattern of the customer, but also recent purchase and the profit.
Initially the clusters are evaluated using Silhouette Analysis for Recency Vs. Monetary
using K-Means for varying number of Clusters. This is followed by the Silhouette Anal-
ysis of Frequency Vs. Monetary, using the K-Means for different number of Clusters.
Silhouette Analysis is a prototype-based method to evaluate or validate Clusters. The
validity can be either be cohesion, or separation, or a combination of both. In the present
work, Silhouette Coefficient combines both cohesion and separation.
1.2 Problem statement

In the light of data segmentation, customers are divided into set of individuals with
distinct similarities. The issue in data segmentation is the amount of data present cannot
be processed through normal procedures which won’t be time bound. The data segmen-
tation is done through Clustering by K-Means through PCA & Non-PCA method and
Silhouette Analysis.
1.3 Motivation
There are several analytical methods while working with Customer segmentation.
RFM analysis helps in the formation of customer segments. Customer segmentation is

the process of identifying a group of customers who share similar characteristics. By
creating customer segments, a store provides customized product promotions to those
who are interested in them. RFM analysis plays a significant role in determining the
relation of a customer with the company. By effective analysis of RFM, a store can
attract customers by realizing their needs. It can increase customer engagement by giving
custom product suggestions in accordance with their interests. It can help to retain good
customers by applying appropriate marketing strategies for them as well as those who
aren’t loyal customers yet.
1.4 Objectives
The objectives of the project are
1. To perform Univariate Analysis on the dataset and data visualization of the data.
2. To perform Clustering through K-means Clustering to find the level of Customer

Segmentation and the different values of K.
3. To predict the level of Clustering at each level to perform Silhouette Analysis and
formulate the customer purchase behaviour.
1.5 Literature Survey

The objective of the study [1] is to apply business intelligence in identifying potential
customers by providing relevant and timely data to business entities in the Retail Industry.
The data furnished is based on systematic study and scientific applications in analysing
sales history and purchasing behaviour of the consumers. The curated and organized
data as an outcome of this scientific study not only enhances business sales and profit,
but also equips with intelligent insights in predicting consumer purchasing behaviour and
related patterns.
The efficient segmentation of customers of an enterprise is categorized into groups of
similar behaviour based on the RFM (Recency, Frequency and Monetary) values of the
customers. The journal [2] shows transactional data of a company over is analysed over a
specific period. Segmentation gives a good understanding of the needs of the customers
and helps in identifying the potential customers of the company. Dividing the customers
into segments also increases the revenue of the company.

Data Mining applied to the field of commercialization allows, among other aspects,
to discover patterns of behaviour in clients, which companies can use to create marketing
strategies addressed to their different types of clients. The research [3] focused on a
database, the CRISP-DM methodology was applied for the Data Mining process. The
database used was that corresponding to the sector of SMEs and referring to customers
and sales, the analysis was made based on the data pattern.
Identifying the patient types with different economic values can be useful for hospital
development. The work [4] uses the theory of customer relationship management (CRM)
to analyse the outpatients in the hospital for infectious diseases in Shanghai, China.
The RFM is designed for robustness, flexibility and ease-of-use (particularly by the
non-expert), and no claims are made for superior accuracy, or indeed novelty, compared
to other line-by-line codes.The journal [5] shows the main limitations at present are a
lack of scattering and simplified modelling of surface reflectance and line-mixing.
The paper [6] identifies critical products and key customers to strengthen company
performance is vitally important in the digital transformation era. Critical products are
the item sets that are preferred by VIP customers and yet not popular among ordinary
customers. As a result, critical products should be kept on the shelf despite its sales
volumes being lower than other popular items.
The research shows [7] recency, frequency, and monetary (RFM) models are widely
used to estimate customer value. However, they are based on the customer perspective
and do not take the product perspective into account. A RFM per product (RFM/P)
model is proposed to first estimate customer values per product and then aggregate them
to obtain the overall customer value.Empirical applications for a financial services com-
pany and a supermarket demonstrate that RFM/P opens up the possibility to combine
customer and product perspectives.
The journal [8] focuses on the privacy problem of an outsourced k -means clustering
scheme for two parties. In particular, each party’s data is encrypted only once and then
stored in the cloud.The proposed privacy-preserving k -means collaborative clustering
protocol is executed mainly at the cloud, with O (k (m + n)) rounds of interactions
among the two parties and the cloud, where m and n represent the total numbers of
records for the two parties, respectively.
Customers’ portfolios can be analysed from customer transactions. With customer

transactions data, the company can find out which potential customers and the customers
who do give less value for the company. The journal [9] shows ways to analyse customer
transaction data is by clustering customers with the C-means algorithm by using an RFM
model. Forming the number of clusters is validated by the PCI model and ranking is done
by multiplying AHP weight to find the life value of the customer so that it can be known
which customer group gives high value.
A combined online-learning model with K-means clustering and gated recurrent unit
(GRU) neural networks for trajectory prediction is proposed in the journal [10]. In the
new model, the K-means clustering algorithm is used to adaptively cluster the trajectory
points with higher similarity are grouped into the same cluster.
1.6 Brief Methodology

In the project, exploratory data analysis (EDA) refers to initial exploration of data in
order to extract or discover the patterns with the help of statistics or graphical represen-
tations. In this activity, EDA helps in identifying unique customers, percentage of orders
by top 10 or more, information about the data, mismatch in description, stock code and
to check null values.
After data is preprocessed, check for recent transactions, frequency and the amount
spent by the customers. In order to create recency variable, decide the reference date -
that is one day prior to the last transaction. RFM analysis is a very popular customer
segmentation and identifiable technique in database marketing.
Clustering using K-means algorithm is a method of unsupervised learning used for
data analysis. This algorithm identifies ‘K’ centroids from the dataset ‘D’ and assigns
the non-overlapping data points to each of the nearest clusters. The intra-cluster distance
is maximum compared to inter-cluster distance in K-means algorithm.
Insights and strategies are formulated through the help of data clusters which are
generated. The data clusters are understood and are distributed in 3 different clusters to
get different customer types.

1.7 Assumptions made / Constraints of the project
RFM analysis is one of the main segmentation techniques used in customer purchase
behaviour techniques. In the project, the aim is to devise a solution for the drawback
of segmentation techniques by creating clusters and predicting the customer behaviour
through different insights and strategies for each of them.
1.8 Organization of the report

This report is organized as follows. Write the discussions in each chapter. A sample
is as follows.
Chapter 1 contains the introduction to Customer Segmentation. It also includes the

motivation, problem statement, objectives, literature review and brief methodology
of the project.
Chapter 2 explains the theory and fundamentals RFM analysis with understanding
of RFM with respect to Customer Segmentation.It also explains the fundamentals
of Customer Segmentation.
Chapter 3 explains the implementation and methodology of the process happening

at every step in RFM analysis.
Chapter 4 consists of results obtained during the simulation of the data with respect
to Customer Analysis. And Clusters are generated for Customer Segmentation.
Chapter 5 gives the conclusion chapter which discusses the inferences drawn from
the project followed by providing some future scope for the work.

Chapter 2
Fundamentals of RFM Analysis and
Customer Segmentation

CHAPTER 2
FUNDAMENTALS OF RFM ANALYSIS AND
The RFM Model of customer value uses proven marketing principles to help businesses
differentiate between marketing to existing and new users, and helps them create relevant
and personalized messaging by understanding user behaviour. The model allows the
business to segment its users based on three criteria based on an existing customer’s
transaction history, with customer segmentation which have been discussed.
2.1 Fundamentals of Customer Segmentation

Customer Segmentation is the process of dividing customers into groups based on
common characteristics so companies can market to each group effectively and appro-
priately. In business-to-business marketing, a company might segment according to a
wide range of factors: Industry,Number of Employees, Products previously purchased,
Location etc. Segmentation allows marketers to better tailor their marketing efforts to
various audience subsets. Those efforts can relate to both communications and product
development.
Customer segmentation requires a company to gather specific information – data –
about customers and analyse it to identify patterns that can be used to create segments.
Some of that can be gathered from purchasing information – job title, geography, products
purchased, for example. Some of it might be gleaned from how the customer entered your
system. An online marketer working from an opt-in email list might segment marketing
messages according to the opt-in offer that attracted the customer, for example.
Common characteristics in customer segments can guide how a company markets to
individual segments and what products or services it promotes to them. A small business
selling hand-made guitars, for example, might decide to promote lower-priced products
to younger guitarists and higher-priced premium guitars to older musicians based on
segment knowledge that tells them that younger musicians have less disposable income
than their older counterparts. Customer segmentation can be practiced by all businesses
regardless of size or industry and whether they sell online or in person. It begins with
gathering and analysing data and ends with acting on the information gathered in a way
that is appropriate and effective.

2.2 Fundamentals of RFM Analysis
RFM (Recency, Frequency, Monetary) analysis is a proven marketing model for behaviour-
based customer segmentation. It groups customers based on their transaction history –
how recently,how often and how much did they buy. RFM helps divide customers into
various categories or clusters to identify customers who are more likely to respond to
promotions and also for future personalization services.
2.2.1 Frequency
A high frequency score means a customer buys your brand frequently, and is likely to
be a loyalist of your brand. To calculate frequency, businesses need to analyse the total
number of purchases completed by customers in a fixed time period. Frequency can be
scored by grading on custom-built filters such as bought thrice in a year/bought once a
month and so on, depending on the nature of the business.
2.2.2 Recency
A high recency score means a customer has positively considered your brand for a
purchase decision recently. Recency can be scored by grading on custom-built filters such
as bought on the last 7 days/1 month/3 months and so on, depending on the nature of
the business.
2.2.3 Monetary
A high monetary value score means a customer is one of the highest spending cus-
tomers of your brand. Monetary value score can be graded on custom-built filters like
spent more depending on the nature of the business. All the above criteria can be graded
on a scale of 1 to 5, with 5 being the best score you could assign a customer. It is also
critical to specify an appropriate range for each grade, in order to create groupings of
customers with similar buying behaviour.
2.2.4 Why is RFM better than Other Segmentation Methods?

The RFM model is built on transactions between the user and the business, to create
a robust data-backed method based on hard numbers. This customer data is graded,
further analysed and then segmented in order to target customers as distinct groups.
This model helps businesses effectively analyse past buying behaviour of each customer,
to predict and shape future customer interactions.

Traditional methods of segmentation, used by market research companies before the
advent of data analytics , use variables like demographic and psycho graphic factors to
group its customers.Researchers always utilize sample audiences to predict population
behaviour, which reduces market researchers ability to predict user behaviour of niche
consumer sets and specific customers.
A sample could be incorrect, due to many reasons like an insufficient number of people,
incorrect gender balance, varying psycho graphic factors etc . These problems cannot
occur in RFM, as it is a fundamentally data-centric model which analyses the entire
population set, instead of a curated sample set. In addition to that, the variables of the
RFM model are 100% accurate and precise, whereas traditional research involved factors
like psycho graphics, which could be interpreted subjectively. Using the RFM model
helps a business define interactions with each specific customer, creating opportunities
to increase the relevance of messaging, eventually creating the potential for increased
customer lifetime value. RFM has the potential to create seamless interactions with high
customer satisfaction
2.3 Clustering Algorithm

The clustering algorithm is based on kind of data that is being used, such as some
need to guess number of clusters in data-set and some need to find the minimum dis-
tance between the observation of data-set. The clustering has many different types of
algorithms, this section will discuss a few of the algorithms in brief.
2.3.1 K-Means Algorithm

The k-means algorithm is one of the most popular algorithms in clustering. It is a cen-
troid based algorithm and the simplest unsupervised learning algorithm. The algorithm
classifies the dataset by dividing the samples into different clusters of equal variances.
The number of clusters must be specified. This algorithm is best used on smaller data-
sets because it iterates over all of the data points, with the linear complexity of O(n).
The k-means algorithm will take more time to classify data points if large number of
data-sets are present.

Figure 2.1: Description of K-Means Clustering [11]
2.3.2 Density Based Spatial Clustering Applications with Noise

Algorithm(DBSCAN)
The DBSCAN, is a density-based clustering unlike k-means. It is a good algorithm
to find out-liners in a data-set. It finds arbitrarily shaped clusters based on density of
data points in different regions, separates regions by areas of low-density so that it can
detect out-liners between the high density clusters.
DBSCAN uses two parameters: minPts (minimum number of data points to be clus-
tered) and eps (the distance used to determine if a data point is in same area as others) to
determine how the clusters are defined. In short, the areas of high density are separated
by areas of low density in this algorithm.
Figure 2.2: DBSCAN clustering [11]
Figure 2.2 shows the Density Based Spatial Clustering Applications with Noise Algo-
rithm through clustering methods.

2.3.3 Principal Component Analysis (PCA)
PCA creates a low-dimensional representation of the samples from a data set which
is optimal in the sense that it contains as much of the variance in the original data set
as is possible. PCA also provides a variable representation that is directly connected to
the sample representation, and which allows the user to visually find variables that are
characteristic for specific sample groups.(Agglomerative) hierarchical clustering builds a
tree-like structure (a dendrogram) where the leaves are the individual objects (samples
or variables) and the algorithm successively pairs together objects showing the highest
degree of similarity. These objects are then collapsed into a pseudo-object (a cluster) and
treated as a single object in all subsequent steps.
Figure 2.3: PCA 1st and 2nd dimension clustering [2]
Figure 2.3 describes the PCA 1st(PC1) and 2nd dimension through clustering.
2.3.4 Balance Iterative Reducing and Clustering Hierarchies

(BIRCH) Algorithm
The BIRCH algorithm works better on large data-sets. It breaks the data-set into little
summaries that are clustered instead of original data points. The summaries maximum
possible distribution information about data points.
The algorithm is mostly used with other clustering algorithm because other clustering
techniques can be used on summaries generated by BIRCH. The main drawback of BIRCH
algorithm is that it only works on numeric data values, cannot be used for categorical
values unless do some data transformations are done.

2.3.5 Affinity propagation algorithm
In the algorithm each data point communicates with all of the other data points to
let each other know how similar they are and then starts to show the clusters in the data.
In this algorithm we need not specify the number of clusters. Each data point in this
algorithm sends a message between the pair of data points until they converge.
As the messages are sent between data points, sets of data are termed as exemplars,
which represent the clusters. The time complexity is O(Nˆ2T) which is the main drawback
of this algorithm.
Figure 2.4: Affinity propagation algorithm framework [11]
Figure 2.4 shows the framework structuring of Affinity propagation algorithm.
Machine Learning algorithms provide us with the steps involved required in clustering
of the data which have been discussed. Customer Segmentation insights for forming
different strategies have also been discussed. These strategies will provide us with the
implementation required for the model.

Chapter 3
Implementation of RFM Analysis
and Customer Segmentation

CHAPTER 3
IMPLEMENTATION OF RFM ANALYSIS AND
RFM Analysis and Customer Segmentation data is simulated using Jupyter Notebook
in Python. The data is analysed and the separated in different Clusters for Recency,
Frequency and Monetary Analysis. This provides us with insight and Strategies for
Customer Segmentation.
3.1 Design for Data Analysis

This section elaborates on the proposed objective, algorithm used and the experimen-
tal framework for the desired outcome of the study.
Figure 3.1: Steps Involved in Methodology For Data Analysis [1]
3.1.1 Selection of Clustering Algorithm

A cluster is a conceptually informative group of objects that have shared and common
characteristics. Clustering is used for customer segmentation for additional data analysis.
The research reveals that one of the applications of K-means is customer segmentation.
K-Means Clustering Algorithm is a working-based partition clustering technique which

takes the user specified clusters present, represented by centroids. K-means is faster and
does well on large datasets compared to other models. Another use of K-means is that
the algorithm requires only one input parameters ‘K’ than other algorithm. Customer
Segmentation is one of the major applications of K-means. The given work also uses
K-means.
3.1.2 Proposed Methodology

The proposed methodology can be broadly divided into separate steps as shown in
Figure 3.1. The corresponding details are explained as below:
3.1.2.1 Exploratory Analysis and Data Pre-processing

Exploratory data analysis (EDA) is the exploration of different data to extract or dis-
cover the certain patterns with the help of statistics or through graphical representations.
In the given research, EDA gives use of identifying unique customers, shows percentage
of long orders by top 10 or more, information about the data, mismatch in description,
stock code and to check null values. Data is pre-processed and applied to identify and
remove missing identification number and negative transactions.
3.1.2.2 Execution of RFM Analysis

The data is pre-processed for to get recent transactions, frequency and the total
amount by the customers. In order to get to create a recency variable value, we get the
reference date that is one older than the recent one. RFM analysis is a very popular
customer segmentation and identifying technique in marketing.
Recency: It is the number of days before the given reference date when a customer
has made a certain transaction. Lesser the recency, higher is the visit.
Frequency: It is the given period between two given purchases of a customer.

Higher is the frequency, more is the visit for the transaction.
Monetary: It is the amount of money spent by a customer during a specific period

of time. Higher the value, more is the profit generated during the transaction.
3.1.2.3 K-Means with Euclidean distance

K-Means algorithm is used with Euclidean distance metric to partition the given
data for RFM values. K-Means is used with to analyse the amount obtained for Recent

and Frequent transactions as, to partition the customers based on the amount of the
recent transactions and to group the customers on the amount generated with frequent
transactions.
The calculation of the Silhouette score clusters obtained in the previous step are given
using silhouette score, which analyses how the resulting clusters are separated. It has
a range of [-1, +1]. The value is near at +1, then objects are grouped far away from
neighbouring clusters, as -1, then objects might have been assigned to a wrong cluster as
pre-processing of data is not perfectly correct.
Evaluation of Clusters is done through the Silhouette values and are compared with
K=3 and K=5 to get the optimal clusters based on the value. After the analysis is
done, the recency data is compared to the amount of frequency of one cluster to another.
The data helps in identifying the group of customers having highest sales recency, sales
frequency and sales amount.
3.1.3 Mathematical Model

Clustering using K-means algorithm is a method of unsupervised learning used for
data analysis. This algorithm identifies ‘K’ centroids from the dataset ‘D’ and assigns the
non-overlapping data points to each of the nearest clusters. The intra-cluster distance is
maximum compared to inter-cluster distance in K-mean algorithm. Since it is an iterative
approach, data points are moved to different clusters, based on the centroid’s calculation.
3.1.4 Data set description used in the analysis

The dataset is synthetic for use. Real time dataset is collected from a logistics com-
pany. A fifteen-day customer transactions are collected that is from a Retail store.

3.2 Steps involved in Customer Marketing Strategy
Figure 3.2: Proposed method For Customer Marketing Strategy [2]
The section explains the proposed given process of the customer value analysis. This
process has the following four steps shown in Figure 3.2 :
Data pre-processing or data preparation and pre-processing.
Normalization of EFM model indices.
Index weight analysis.
Customer clustering by given through K- means algorithm where every dimension

of customer information is analysed using the RFM model and also uses K-means
algorithm to classify target customers.
The research analysis is used with the process introduced step by step.
3.2.1 Data Pre-processing

At the start, the original dataset for the case study based on RFM model is selected.
The original given dataset is then cleaned to remove the outliers and the inaccurate values
and generate an initial dataset. Now, by eliminating the redundant attributes, the data
is then changed to a format that is easier and more efficient to process the customer for
value analysis.

3.2.2 Normalization of RFM model indices
The large difference in the values range from three indicators of the RFM model which
gives us the time since last purchase, purchase frequency, and total purchase amount in
order to get to eliminate and remove the impact of the numerical values on the classifi-
cation results, the min-max normalizes the method used in standardizing the data and
using to get the initial standardized dataset.
3.2.3 Indicator Weight Analysis

Index weight is the value of relative importance of each inspection index of a measured
object. The research object is characterized by a large number of customers and massive
consumption data. A principal component analysis method is assigned to the weight to
the RFM model. Principal component analysis is analytical method that actually trans-
forms multiple indicators into a single comprehensive one through the given reduction
technique. The weight of each indicator gets equal to the variance contribution rate of
the complete principal component. The higher the importance of the greater the variance
contribution rate of it. The computing process is done.
3.2.4 Clustering the Customers by K-Means Algorithm

The K initial cluster centers are selected randomly from the dataset, and the Eu-
clidean distance between the given remaining data objects and the given cluster center
is calculated. The cluster center is closest to the target data and the object is identified,
and the data objects are allocated to the cluster corresponding to the cluster center. The
average of all given data objects in each cluster is taken from the new cluster center
to start the new iteration. The process repats itself until the cluster center ceases and
changes the maximum number of iterations reached.
The selection of the value of k for the number of clusters has many great implications
for the clustering results. The elbow method is generally used to determine the best-case
value of k.The relationship curve there between SSE and takes the shape of an elbow and
the value of k considering the elbow is the true cluster number for the data.
The proposed methodology provides us with the steps that are required in construct-
ing a strategy for generating different clusters required for customer segmentation.The
segmentation provides us with different insights in customer behaviour and their pattern.

Chapter 4
Results and Analysis

CHAPTER 4
RESULTS AND ANALYSIS
Simulation results obtained such as the Univariate analysis, Exploratory Data Anal-
ysis. For the different clusters Recency, Frequency and Monetary. The total visits with
different transactions are noted with clustering algorithm. Clustering is performed with
different methods such as PCA and Non- PCA. K-means Algorithm is performed to find
the best cluster for different types of customers and best marketing strategy will be built.
Inferences are drawn from each simulation result and the algorithm offering the best
result is obtained.
4.1 Univariate Analysis

Univariate is the simplest form of statistical analysis. It is inferential or descriptive.
The key for it to work is that only one variable is involved. Univariate analysis is not as
valid as different Multivariate analysis such as Exploratory Data Analysis.
Table 4.1: Univariate Analysis on set of Data

Average credit Total Credit Total Total Visits Total Calls
Sl. No. Customer Key
limit Cards Visits Online Made
1 87073 100000 2 1 1 0
2 38414 50000 3 0 10 9
3 17341 30000 7 1 3 4
4 40496 50000 5 1 1 4
5 47437 100000 6 0 12 3
Figure 4.1: Interaction of Total Bank visits

Figure 4.2: Interaction of Total Online Visits
Figure 4.3: Interaction of Total Calls Made
In 4.1, 4.2, 4.3 it is shown the different interactions done by the customer .

Table 4.2: Univariate Analysis for Different interaction
Average credit limit Total Credit Cards Total Visits Total Visits Online Total Calls Made
Count 666.000000 666.000000 666.000000 666.000000 666.000000
Mean 34574.242424 4.706061 2.403030 2.606061 3.583333
Std 37625.487804 2.167835 1.631813 2.935724 2.865317
Min 3000.000000 1.000000 0.000000 0.000000 0.000000
25% 10000.000000 3.000000 1.000000 1.000000 1.000000
50% 30000.000000 5.000000 2.000000 2.000000 3.000000
75% 48000.000000 6.000000 4.000000 4.000000 5.000000
Max 200000.000000 10.000000 5.000000 15.000000 10.000000
Univariate Analysis provides that there are no missing and negative values which are
removed. From the data above in Table 4.1 and 4.2, here a customer carries 5 credit
cards and visits 2 times on an average. There are certain Customers who do not visit
banks, visit online banks or calls are made.
4.2 Exploratory Data Analysis
Table 4.3: Data Correlation of the Features

Average credit Total Credit Total Visits Total Calls
Total Visits
limit Cards Online Made
Average credit limit 1.000000 0.608860 -0.100312 0.551385 -0.414352
Total Credit Cards 0.608860 1.000000 0.315796 0.167758 -0.651251
Total Visits -0.100312 0.315796 1.000000 -0.551861 -0.506016
Total Visits Online 0.551385 0.167758 -0.551861 1.000000 0.127299
Total Calls Made -0.414352 -0.651251 -0.506016 0.127299 1.000000
Table 4.3 shows the Data Correlation between the features through Exploratory Data
Analysis.
4.3 Data Pre-processing

The given data frame contains data of Recency (R), Frequency (F), and Monetary is
given as follows.
Recency is how recent a specific customer made his/her latest purchase. The given
case sets a date for threshold to calculate frequency.
Frequency is how many transactions for a specific customer in a date range.
Monetary is how much a specific customer has spent in a date range.

Table 4.4: Data frame of RFM
Customer Name Recency Inverse Recency Frequency Monetary
1 Tushar Kota 17 1445 37 24644.65
2 Mitra Ghosh 13 1449 34 20759.52
3 Anusha Das 24 1438 31 14212.62
4 Nilan Balan 36 1426 41 20186.79
5 Kushal Behr 3 1459 42 21718.24
Table 4.5: RFM average of the data

Recency Inverse Recency Frequency Monetary
Count 795.000000 795.000000 795.000000 795.000000
Mean 24.368553 1437.631447 32.394969 15902.517836
Std 27.438069 27.438609 5.408033 5209.813520
Min 1.000000 1033.000000 15.000000 3892.230000
25% 7.000000 1428.000000 28.000000 12242.615000
50% 17.000000 1445.000000 32.000000 15257.550000
75% 34.000000 145.0000000 36.000000 18770.795000
Max 429.000000 1461.0000000 47.000000 40488.080000
Table 4.4 and 4.5 provides us with pre-processed data from the data frame providing
Recency,Inverse Recency,Frequency and Monetary Analysis.
Figure 4.4: Data Distribution of Recency Analysis

Figure 4.5: Data Distribution of Inverse Recency Analysis
Figure 4.6: Data Distribution of Frequency Analysis
Figure 4.7: Data Distribution of Monetary Analysis
Figure 4.4, 4.5, 4.6, 4.7 provides us with data distribution plot for RFM analysis of
our dataframe.
The clustering method helps in rescale values in each feature. RFM usually divided
data into 5 equal parts and scored them from 1 to 5 but standardization of data is

done. It is known that Recency is different compared to Frequency and Monetary. While
Frequency and Monetary is better when their values are larger, opposite to Recency.
Figure 4.8: Visualization of Scaled Values
In 4.8, there is one data point that is very far away to other data as in the given in
the above pair plot. That data is needed to be dropped and analysed through clustering.
4.4 Clustering
The simulation provides us with different clustering methods like Non-PCA and PCA
methods to perform K-Means Clustering through which we can generate the cluster plots
required to infer different strategies and come up with different plans.

4.4.1 Clustering (Non-PCA)
4.4.1.1 K-Means Clustering
Figure 4.9: Clustering Plot of Recency, Frequency and Monetary
Figure 4.9 depicts the clustering analysis of Recency,Monetary and Frequency through
a pair plot.
Figure 4.10: Elbow Method to determine the Values of K
Figure 4.10 depicts the Inertia vs k value which gives the value of k through decreasing
inertia.

Figure 4.11: Silhouette Analysis for K=2 Clusters for Recency, Frequency and Monetary
Plot
Plot

Plot
Plot
In Figures 4.11,4.12,4.13and 4.14 ,silhouette Analysis is done to the Clustering Plot

for the different values of K=2,3,4,5 which gives us the silhouette plot for the various
clusters of different values of K. RFM is depicted through Visualization of clustered data.

4.4.1.2 Hierarchical Clustering
Figure 4.15: Hierarchical Clustering at K=3
Figure 4.15 shows the dendrogram structure at k=3.

Non-PCA K-Means Clustering Method provides us with the data from elbow chart
and Silhouette Index that the optimal cluster is between 2 to 3. Hierarchical Clustering
gives us the data that the optimal clustering is at K=3.
4.4.2 Clustering (PCA)

A T-test generally removes the irrelevant features from the dataset. So, the data is
less corrupted and easy to hold. For example, there might be feature that contradicts to
itself giving no relations and then preserving that information of that feature is difficult.
So, PC1 and PC2 (parts of PCA) provides better variance after using feature selection
as PCA can easily preserve the information of the dataset as it is less corrupted.
Table 4.6: PCA after feature selection

PC1 PC2
1 1.790245 0.279731
2 0.958071 -0.166236
3 -0.409873 -0.103127
4 1.480692 0.978402
5 2.228522 -0.330263
Table 4.6 gives the PCA values for the separate dimensions.

4.4.2.1 K-Means Clustering
Figure 4.16: Clustering Plot of PC1 vs PC2
Figure 4.16 shows the clustering plot for PC1 vs PC2 in K-Means Clustering.
Figure 4.17: Elbow Method to determine the Values of K
Figure 4.17 depicts the Inertia vs k value which gives the value of k through decreasing
inertia.

Plot in PCA method
Plot in PCA method

Plot in PCA method
Plot in PCA method
Figures 4.18,4.19,4.20 and 4.21,show silhouette analysis is done to the Clustering Plot
for the different values of K=2,3,4,5 which gives us the silhouette plot for the various
clusters of different values of K. RFM is depicted through Visualization of clustered data.

4.4.2.2 Hierarchical Clustering
Figure 4.22: Hierarchical Clustering at K=3
Figure 4.22 shows the dendrogram structure at k=3.

PCA K-Means Clustering Method provides us with the data from elbow chart and
Silhouette Index that the optimal cluster is at K=3.Hierarchical Clustering gives us the
data that the optimal clustering is at K=3.
Table 4.7: Concatenating original data between PCA and Non-PCA

Cluster
Recency Frequency Monetary Cluster
Customer Recency Frequency Monetary Non-
Scale Scale Scale PCA
PCA
1 Tushar Kota 17 37 24644.65 0.291554 0.849921 1.678322 2 2
2 Mitra Ghosh 13 34 20759.52 0.470798 0.292226 0.930059 2 2
3 Anusha Das 24 31 14212.62 -0.022123 -0.265468 -0.330850 1 1
4 Nialn Balan 36 41 20186.79 -0.559854 1.593514 0.819754 2 2
5 Kushal Behr 3 42 21718.24 0.918908 1.779412 1.114706 2 2
Results of both the Clustering Methods shows:

Cluster Non-PCA:
365
317
110

Cluster PCA:
368
317
107
Figure 4.23: Data Clusters Plotted Based on Non-PCA Method

Figure 4.24: Data Clusters Plotted Based on PCA Method
Figure 4.23 and 4.24 show the data clusters plotted based on Non-PCA and PCA
methods for separate clusters.
There was no significant difference on results between Non-PCA and PCA method.
But, since PCA results in higher silhouette index so PCA is used for the clustering
technique.
After analyzing the PCA pair plot we can notice that the First cluster belongs to the
best customers who have a low recency, high frequency and high monetary. Second cluster
belongs to the loyal customers who have low recency, low frequency and low monetary.
Third cluster belongs to potential customers who have a high recency.

4.5 Model Deployment
Table 4.8: For fitting the PCA model for pipeline

Inverse Recency Frequency Monetary
Customer Recency Frequency Monetary Cluster
Recency Scale Scale Scale
1 Tushar Kota 17 1445 37 24644.65 0.291554 0.849921 1.678322 1
2 Mitra Ghosh 13 1449 34 20759.52 0.470798 0.292226 0.930059 1
3 Anusha Das 24 1438 31 14212.62 -0.022123 -0.265468 -0.330850 2
4 Nialn Balan 36 1426 41 20186.79 -0.559854 1.593514 0.819754 1
5 Kushal Behr 3 1459 42 21718.24 0.918908 1.779412 1.114706 1
Figure 4.25: Final Data Clusters Plotted for our dataframe

Figure 4.26: KDE plot for each of the Clusters

Figure 4.26 shows the K Desktop Environment plot for the RFM scale for each of the
clusters.
4.6 Insights and Strategies

After performing the customer segmentation using RFM marketing analysis, we get
cohesive picture of a customer base. To conclude clustering analysis, we get 3 distinct
groups where the customer behaviour is used for segmentation.
Group 1 (Cluster 1): They are our long-standing customers. those who come
out in terms of recency, frequency or even Monetary value, as we see they made
less transactions with a low monetary value a long time ago.
Strategy: We can design more specifically targeted communication that help con-
vert into a more loyal, higher RFM value customers.
Group 2 (Cluster 2): They are our loyal customers, they come first in terms of
frequency with large-value transactions. However, they are the second most recent
customers who made purchases, so we can’t lose them.
Strategy: We need more personalized offers that can be promoted for product
recommendation based on their past transactions in order to increase engagement
and higher customer retention rate.
Group 3 (Cluster 3): They are our new customer base, they are the most re-
cent customers who made purchase, slightly higher in monetary value than group1.
However, less frequent than group 2, which makes perfect sense they are newly
introduced to the market.
Strategy: The triggered welcome emails can be used to ensure engagement, es-
tablishing personal connection, encourage them to make more purchases with in-
troductory offers.

Chapter 5
Conclusion and Future Scope

CHAPTER 5
CONCLUSION AND FUTURE SCOPE
5.1 Conclusion
Customer purchase behaviours are analysed based on the separate online transaction
data of a company by using RFM and K-means clustering algorithm. Customers are
first classified into four groups based on their large purchasing behaviour. Strategies are
proposed accordingly in order to gain a high level of customer satisfaction and peace.
Obvious effectiveness of the analysis method proposed in the research is proved by im-
provement of key performance indices of the company. Improvement results in the key
performances of indices are given.
Customer segmentation based on their buying pattern of customers though strategi-
cally important, is an equally difficult task. Customer retention has another one of major
concern for both online and the physical enterprises that are used. In the research work,
the RFM model is implemented on synthetic and real data sets, to analyse different cus-
tomer segmentation behaviours. Also, the clusters are evaluated using different Silhouette
Analysis for K-Means clustering algorithm with different large number of clusters. Based
on the Silhouette Score, the Sales Recency, Sales Frequency and Sales Monetary can be
analysed and an optimal solution is found and used.
The clusters are taken into consideration which allows us to provide insight and strate-
gies for targeting marketing towards certain customers. Clusters allow in creating a sce-
nario for targeted strategy every single customer which allows us create certain situations
for future scenarios for improvement into the strategies needed to be developed.
5.2 Future Scope

In future implementations, scope of the future work which lies in the study and
analysis of specific categories of products, for example Mobiles and its accessories. Various
parameters such as the preference or the most effective techniques at a specific event or at
some threshold parameters with different regions has been studied for designing effective
business enhancement.
The advancements and deliberations in the provided area will help the enterprises to
improve the business after providing promotions and getting innovative strategies and
provide cutting edge to the method against competitions.

BIBLIOGRAPHY
[1] P. Anitha and M. M. Patil, “RFM model for customer purchase behavior using
k-means algorithm,” Journal of King Saud University - Computer and Information
Sciences, Dec. 2019. doi: 10.1016/j.jksuci.2019.12.011.
[2] S. Monalisa, P. Nadya, and R. Novita, “Analysis for customer lifetime value cat-
egorization with RFM model,” Procedia Computer Science, vol. 161, pp. 834–840,
2019. doi: 10.1016/j.procs.2019.11.190.
[3] J. Silva, N. Varela, L. A. B. López, and R. H. R. Millán, “Association rules extrac-

tion for customer segmentation in the SMEs sector using the apriori algorithm,”
Procedia Computer Science, vol. 151, pp. 1207–1212, 2019. doi: 10.1016/j.procs.
2019.04.173.
[4] M. Li, Q. Wang, Y. Shen, and T. Zhu, “Customer relationship management analysis
of outpatients in a chinese infectious disease hospital using drug-proportion recency-
frequency-monetary model,” International Journal of Medical Informatics, vol. 147,
p. 104 373, Mar. 2021. doi: 10.1016/j.ijmedinf.2020.104373.
[5] P.-Y. Hsu and C.-W. Huang, “IECT: A methodology for identifying critical prod-
ucts using purchase transactions,” Applied Soft Computing, vol. 94, p. 106 420, Sep.
2020. doi: 10.1016/j.asoc.2020.106420.
[6] A. Dudhia, “The reference forward model (RFM),” Journal of Quantitative Spec-
troscopy and Radiative Transfer, vol. 186, pp. 243–253, Jan. 2017. doi: 10.1016/
j.jqsrt.2016.06.018.
[7] R. Heldt, C. S. Silveira, and F. B. Luce, “Predicting customer value per product:
From RFM to RFM/p,” Journal of Business Research, vol. 127, pp. 444–453, Apr.
2021. doi: 10.1016/j.jbusres.2019.05.001.
[8] E. Zhang, M. Li, S.-M. Yiu, J. Du, J.-Z. Zhu, and G.-G. Jin, “Fair hierarchical
secret sharing scheme based on smart contract,” Information Sciences, vol. 546,
pp. 166–176, Feb. 2021. doi: 10.1016/j.ins.2020.07.032.
43
[9] A. J. Christy, A. Umamakeswari, L. Priyatharsini, and A. Neyaa, “RFM ranking –
an effective approach to customer segmentation,” Journal of King Saud University
- Computer and Information Sciences, Sep. 2018. doi: 10.1016/j.jksuci.2018.
09.004.
[10] S.-C. Wang, Y.-T. Tsai, and Y.-S. Ciou, “A hybrid big data analytical approach for
analyzing customer patterns through an integrated supply chain network,” Journal
of Industrial Information Integration, vol. 20, p. 100 177, Dec. 2020. doi: 10.1016/
j.jii.2020.100177.
[11] A. Doniec, S. Lecoeuche, R. Mandiau, and A. Sylvain, “Purchase intention-based

agent for customer behaviours,” Information Sciences, vol. 521, pp. 380–397, Jun.
2020. doi: 10.1016/j.ins.2020.02.054.

RFM Model For Customer Purchase Behaviour Using K-Means Algorithm

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RFM Model For Customer Purchase Behaviour Using K-Means Algorithm

Uploaded by

Copyright:

Available Formats

RFM model for Customer Purchase

behaviour Using K-Means Algorithm

A Technical Seminar Report

Under the guidance of

In partial fulfillment of the requirements for the degree of

Signature of Guide Signature of Head of the Department Signature of Principal

Dr. Nagaraj Bhat Dr. K S Geetha Dr. K. N. Subramanya

Name of Examiners Signature with Date

We express sincere gratitude to our beloved Principal, Dr. K. N. Subramanya for

1 Introduction to RFM Analysis and Customer Segmentation 1

2 Fundamentals of RFM Analysis and Customer Segmentation 8

3 Implementation of RFM Analysis and Customer Segmentation 15

4 Results and Analysis 21

5 Conclusion and Future Scope 41

1.1 RFM analysis framework [1] . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Description of K-Means Clustering [11] . . . . . . . . . . . . . . . . . . . 12

3.1 Steps Involved in Methodology For Data Analysis [1] . . . . . . . . . . . 16

4.1 Interaction of Total Bank visits . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Univariate Analysis on set of Data . . . . . . . . . . . . . . . . . . . . . 22

AHP Analytic Hierarchial Proces

BIRCH Balance Iterative Reducing and Clustering Hierarchies

CRISP-DM Cross Industry Standard Process for Data Mining

CRM Customer Relationship Management

DBSCAN Density Based Spatial Clustering Application with Noise Algorithm

EDU Exploratory Data Analysis

GRU Gated Recurrent Unit

KDE K Desktop Environment

PCA Principal Component Analysis

PCI Pharmacy Control of India

SME Small and Medium-sized Enterprises

VIP Very Important Person

Department of Electronics and Communication Engineering, 2020-2021

Department of Electronics and Communication Engineering, 2020-2021 2

Figure 1.1: RFM analysis framework [1]

1.2 Problem statement

Department of Electronics and Communication Engineering, 2020-2021 3

2. To perform Clustering through K-means Clustering to find the level of Customer

1.5 Literature Survey

Department of Electronics and Communication Engineering, 2020-2021 4

Department of Electronics and Communication Engineering, 2020-2021 5

1.6 Brief Methodology

Department of Electronics and Communication Engineering, 2020-2021 6

1.8 Organization of the report

 Chapter 1 contains the introduction to Customer Segmentation. It also includes the

 Chapter 3 explains the implementation and methodology of the process happening

Department of Electronics and Communication Engineering, 2020-2021 7

Department of Electronics and Communication Engineering, 2020-2021

2.1 Fundamentals of Customer Segmentation

Department of Electronics and Communication Engineering, 2020-2021 9

2.2.4 Why is RFM better than Other Segmentation Methods?

Department of Electronics and Communication Engineering, 2020-2021 10

2.3 Clustering Algorithm

2.3.1 K-Means Algorithm

Department of Electronics and Communication Engineering, 2020-2021 11

Figure 2.1: Description of K-Means Clustering [11]

2.3.2 Density Based Spatial Clustering Applications with Noise

Figure 2.2: DBSCAN clustering [11]

Department of Electronics and Communication Engineering, 2020-2021 12

Figure 2.3: PCA 1st and 2nd dimension clustering [2]

2.3.4 Balance Iterative Reducing and Clustering Hierarchies

Department of Electronics and Communication Engineering, 2020-2021 13

Figure 2.4: Affinity propagation algorithm framework [11]

Figure 2.4 shows the framework structuring of Affinity propagation algorithm.

Chapter 1 contains the introduction to Customer Segmentation. It also includes the

Chapter 3 explains the implementation and methodology of the process happening

Frequency: It is the given period between two given purchases of a customer.

Monetary: It is the amount of money spent by a customer during a specific period

Data pre-processing or data preparation and pre-processing.

Normalization of EFM model indices.

Index weight analysis.

Customer clustering by given through K- means algorithm where every dimension

Frequency is how many transactions for a specific customer in a date range.

Monetary is how much a specific customer has spent in a date range.