AliGalipSekeroglu GraduationProject

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/353072980

Impacts of Feature Selection Techniques in Machine Learning Algorithms for


Cross Selling: A Comprehensive Study for Insurance Industry

Research · July 2021


DOI: 10.13140/RG.2.2.24254.41284

CITATION READS

1 793

1 author:

Ali Galip Sekeroglu


Bahçeşehir University
1 PUBLICATION 1 CITATION

SEE PROFILE

All content following this page was uploaded by Ali Galip Sekeroglu on 08 July 2021.

The user has requested enhancement of the downloaded file.


Impacts of Feature Selection Techniques in Machine
Learning Algorithms for Cross Selling: A
Comprehensive Study for Insurance Industry
Ali Galip Şekeroğlu#
#
Bahçeşehir University Graduate School of Natural and Applied Science, Big Data Analytics and Management
İstanbul, Turkey
1 galipsekeroglu@gmail.com

Abstract — This study is conducted to demonstrate the methods The aim of this project is to predict health insurance
and results of cross-selling techniques by applying different policyholders who would be likely interested in buying
methods in order to make future data-oriented decision making vehicle insurance as an example of cross-selling strategies. A
in the insurance sector which is one of the important financial publicly available dataset [1] that contains around 381.000
industries by using data whose volume, diversity, velocity and samples, 12 various attributes like demographic as well as
importance have been increasing rapidly for all kinds of vehicle information has been used for that purpose. For the
organizations today. During the study, 14 articles have been purpose of the article, 3.000 randomly selected samples from
published in the last 5 years about cross-selling in marketing, big
data and engineering are used for literature review and their
the dataset have been chosen to reveal impacts of dimension
results have been examined. For the study, randomly selected reduction methods on accuracy and F1 score metrics of
3.000 samples in the dataset which is not balanced that consist ofapplied different classification techniques through an
381.000 samples and 12 different attributes have been chosen and unbalanced target dataset. This study presents a
data analysis, data transformation, scaling and oversampling comprehensive cross-selling literature research for different
technique for the output column methods performed respectively sectors, in particular financial industries, and machine
before applying machine learning algorithms to classify learning modelling through various state-of-art algorithms for
appropriate customers for cross-selling. The main research of the classifying the possible customers that would be appropriate
study is that revealing impacts of different applied feature for buying vehicle insurance as an example of cross-selling. In
selection and dimensionality reduction methods effectively on
spending time on modelling and performance results. In addition,
addition, the study tries to provide comparable performance of
in the result of the applied machine learning algorithms by several machine learning algorithms after applying dimension
measuring accuracy and F1-score metrics after using Select K- reduction methods and before applying any methods.
Best, Hashing Encoder, Feature Hasher, Sequential Feature This paper is organized as follows: general
Selector (SFS) and Principal Component Analysis (PCA) information presents comprehensive literature review
methods for feature selection, some differences have been regarding cross-selling and briefly discusses related works
obtained in either higher or lower accuracy scores. Reducing the published in the last 5 years. Material and methods section
number of attributes by using feature selection methods have contains the techniques and methodologies for cross-selling
been affected spending time while training models as predictions. In result section, predictions results of algorithms
improvement of the study.
have been revealed corresponding two classification metrics
and finally, discussion and conclusion part concludes the
Keywords— Machine Learning, Cross-Selling, Feature Selection, study and explains further ways to extend the study.
Dimensionality Reduction, Insurance
II. LITERATURE REVIEW
I. INTRODUCTION AND PURPOSE
In recent years, data became one of the most
In the past years, the total volume of created data is important assets for the companies, governments and any
significantly increased after electronic devices like mobile other institutions and these organizations aim to understand
phones and IoT devices became a part of our daily lives. customer’s purchase patterns and rule sets from the gathered
Existences of enormous amount of the data make financial data to maximize their profit through rearranging their
industries convenient by applying several techniques of data marketing campaigns according to these findings. Cross
analytics for reaching meaningful insights regarding their selling is a well-known technique which goal is to increase the
customers in order to use for decision-making. Machine institution’s profit by selling more related items in a single
learning approaches turn into one of the most popular ways to transaction. The fundamental basis of cross-selling is the fact
help companies to determine the possible clients who might that in order to offer more services or products while purchase
stop using the provided service (customer churn) at early transaction has not been completed yet. There are many
stages or detecting the appropriate customers for applying studies which demonstrate the real-life practices and their
cross-selling techniques to enhance the company’s profit. results after cross selling and up selling techniques on the
customers. As an example of up-selling might be selling a the experiment indicated that machine learning models that
computer which has better hardware requirements in online utilize ensemble-based techniques (Adaboost and Random
shopping or convincing the customer to have a credit card Forest) have reached the highest accuracy results which are
which has higher limit than the chosen card by the customer in over 96% while predicting the customers who the company
a bank. must target for cross-selling and other techniques to convince
The study of cross-selling and up-selling in a bank [2] them not to stop using the provided services.
is aimed to show the negative or positive impacts of cross- In our daily purchase routines, people tend to buy
selling applications on the customers by collecting the items together when these two or more items are related to
answers of a questionnaire that 150 people has taken part in. each other, for instance a mobile phone as the major item and
This technique can cause customer loss when it is applied a cover for the phone as the minor item that is associated with
wrong while expecting to maximize the profit in financial the major item and this can be described as the multi-item
companies such as banks or insurance companies and inventory management problem. There are various solutions
increasing the customer loyalty. According to results of the are proposed for solving the problem in recent decades in
survey’s open-ended questions regarding satisfaction as well order to increase the sale rates through cross-selling to
as age or education level, over 51% of the bank’s customers enhance the demand of selling related items. For the purpose
are not satisfied for buying products from cross-selling of the article [5], it was assumed that the sale of a major item
purchases because of misleading campaigns. When there is make impact on the demands for multiple minor items which
not much difference or not providing enough information is related with the fact of cross-selling. The authors mainly
about the price while cross-selling, these results cause focused on proving one of the algorithms which come up with
dissatisfaction of the customers. the assertion of multi-item inventory management problem is
Data mining is a term that explains the phases of a not an NP-hard as well as proposed polynomial algorithm that
journey which includes data collection, pre-processing, can be used for large-scale problems with advanced efficiency.
experiments and success metrics for finding hidden patterns Producing consistent recommendations to enhance
and relations in order to get relevant information for decision- profit by increasing the number of products to be sold through
making to enhance profit or reduce the costs for an institution. cross-selling strategies is critical for every organization and
A well-planned and organized data mining process is vital for various state-of-art techniques are used to create product-
having successful cross-selling marketing strategy for their product recommendations. A case study of recommendation in
current customers as explained in the study for pharmaceutical industry [6] found that graph-based
telecommunication customers. For the following study [3], convolutional neural networks to recommend a
878 telephone customer data that also contains missing values pharmaceutical product can be highly more effective than
in the attributes like name, zone and type of current customer traditional recommendations which are produced by a
have been collected and Naïve Bayes and C4.5 algorithm that pharmacist personally. In the study, PharmaSage which is a
prunes the created decision tree to reduce error rate in the test graph-based convolutional network algorithm has been
data are applied to detect and classify potential customers for developed by transforming the sales and pharmaceutical
cross-selling application. According to the accuracy metric in product datasets that contain features like indications and
order to evaluate the models that are created through collected adverse impacts of any drug into a graph while splitting the
customer data, C4.5 algorithm performed better accuracy data into training, validation and testing for successfully
which is 88.61% for detecting potential appropriate customers measuring the success of the recommendations. Furthermore,
for cross-selling strategies for the collected data. one method which based on probability theory has been
Recently, machine learning models became so proposed to prevent the problem of popularity bias to enhance
popular to get meaningful insights after data analysis and the quality of recommendations.
provide advantages for increasing the sales through several Continuously dealing with solving the challenges of
marketing techniques and reduce the rate of customers who cross-selling is the key point of performing effective
intend to unsubscribe services of the companies. The study [4] campaigns for increasing profit rates of any kind of
showed that in order to prevent customer churn rate, organizations. In addition to existing information regarding
understanding customer relationship in a few steps such as cross-selling strategies, the research is aimed to reveal the
creating segments for the customers who have similar challenges which haven’t been found yet specifically in
purchase behaviours is critical to enhance transactions with corporate banking industry through collecting data from
the help of techniques such as cross-selling and market basket interviews with the customers, arranging workshops and
analysis in order to minimize customer churn rate by ensuring analyzing cases [7]. According to the findings of the study,
more personalized offers, services and benefits. The main there are diverse conclusions in terms of customers for
purpose of the study [4] is that showing the impacts of applied instance; working with old employees resistant to change from
machine learning algorithms such as KNN, ANN, Random the existing system which is not sufficient to require customer
Forest and Support Vector Machine (SVM) for anticipating needs to provide better and improved services for cross-selling,
the customers who are expected to churn after following the lack of communication for information sharing in the bank
phases of required methods like data collection, data about the customers and using not state-of-art CRM systems
transformation, data cleaning and feature selection. Result of are just major issues to cope with in the banks.
Companies willing to ensure consistent correlated and related with customer’s lifetime duration,
recommendations to its customers on right time for increasing amount of spending and purchase rate and loyal customers are
customer loyalty and decreasing the possible customer churn willing to remain buying products from the same store in the
in advance. Recommendation through cross-selling can be future.
considered as many different aspects like classification of the Usage of customers’ data for being able to
possible customers who might buy more products or ranking understand demand of the customers and detect the potential
prediction of the customers. Cross-selling recommendation customers for successful cross-selling or any kind of
has been taken into account as a classification problem and a marketing practices in order to enhance for multiple product
collaborative ensemble learning method called multi-kernel sales can misconduct by the financial companies such as
support tensor machine (MK-STM) has been developed for insurance or banking. Obviously, cross-selling is important
cross-selling recommendations using multi-type multi-way and beneficial for the customers as well because buying more
data is the term that is used to represent the input data while than one item in a transaction may reduce the cost of buying
creating the model in this study [8]. In order to proof the the items separately unless the companies use the power of
impacts of the method which performs feature selection as cross-selling as the monopoly power among the other
well; various kinds of datasets have been used which contain competitors. The article [11] basically explains and reveals the
demographic, related product, similar customer and historical ways of using cross-selling in the finance industry and trying
promotion data where the demographic data are defined by a to find convenient solutions to prevent misusage of that under
matrix and the rest are represented by tensors. In conclusion, by law and regulations and two improper cross-selling usages
the proposed algorithm has reached better results than other of real-life scenarios have been explained as well. As
ensemble learning and methods that have been using for international misconduct examples, Payment Protection
cross-selling recommendations. Insurance (PPI) and Wells Fargo (WF), responses and
Online shopping has been significantly changed the regulations that have been taken by European and US
usual behaviour of people’s buying or selling products from Authorities have been expressed in order to demonstrate
physical shops personally and psychologically. Buyers are different aspect of cross-selling in banks.
willing to buy more products in order to get rid of paying Although cross-selling impacts are used mostly in
delivery cost or any extra other charge and online retailers are any financial organizations, these strategies can be applied in
tend to sell more products to the customers by using the cross- any kind of business as explained in the article [12] that
selling opportunities while deciding the amount of discounts. reveals a real-life example of home video industry in the US.
The authors of the following study [9] propose a mathematical Mainly, the study tries to indicate the impacts on sales of
model in order to answer research questions in order to have some cross-selling approaches like bundling, drafting or
better understanding about the impacts of price discounting quantity forcing with respect to movie demand for identifying
strategies for the online product sellers and the traffic correlation, causality and statistical significance of old movie
generators of cross-sold products. In order to test the demand due to similar cast or director and the sales of
hypotheses, three different datasets are collected and linear different movies by through the distributor. The main aim is
regression, vector auto regression (VAR) and sample t-test that in order to indicate the sales are somehow linked each
methods have been applied for that purpose. Offered model is other by using OLS methods after offering an assumption
promising for maximizing the profit for online sellers and which relies on just the credits of the movie as well as not
have proven through probability and cumulative distribution association identity to prevent the bias which can have impact
functions (CDF and PDF) after changing the model’s of the regression task. In conclusion, founded estimates from
parameters. the proposal were appropriate in terms of statistical
There are plenty different reasons that influence significance of the cross-selling impacts for the customers.
potential targeted people for cross-selling methods such as In order to enhance successful cross-selling strategies
socio-demographics or psychographics of the buyer and these for increased sale numbers and customer loyalty, many
make impact on the buyer’s behaviour while buying additional various approaches have been proposed such as segmenting
products. In daily life, intuitively, people mostly tend to buy the customers who have similar buying patterns, market
from the places that they have already bought something basket analysis or revealing customer similarities with respect
before due to the familiarity and in order to prove these claims to traditional RFM (recency, frequency, monetary) models to
based on a study which ensures the relationship between produce recommendations. However, all these strategies have
customer lifetime value (CLV) and cross-buying, the authors issues while dealing with different problems for instance;
examine cross-selling impacts in terms of purchase rate, customer segmentation is not suitable for big data solutions
lifetime duration and spending by using collected purchase due to high computational cost or market basket analysis
history data of the customers from an online shopping mall in outcomes might produce misleading too general rules not to
Japan [10]. In order to prove the hypotheses, three different cover important segments. For the purpose of the related
models whose parameters are estimated by hierarchical Bayes article [13] is that in order to detect communities in terms of
multivariate regression coefficients have been proposed for customer-product relationship in bipartite graph through the
explaining store royalty and cross-buying relationship. The algorithm named Louvain which aims to maximize buying
results of the study demonstrated that cross-selling is highly similarity patterns for both products and customers in the
clusters and the algorithm has been applied on the dataset that 3.1 Exploratory Data Analysis
contains 773,999 sale transactions with 21 attributes. The
3.1.1 Dataset Features
results have proved that the Louvain algorithm performed well
in terms of reduced computational time and better solutions in The attributes of the dataset as shown in Table 3.1. :
order to detect the communities that cross-selling approaches
can be applied to reach higher response rates of the
recommended products which is showed to the customers that
located in the same clusters.
Revealing the customer’s purchase behaviour pattern
from the stored transactions can be very beneficial for
managing cross-selling and up-selling methods in any kind of
sale platform either online or physical stores and there are
several approaches data mining techniques such as association
analysis that can produce recommendations for the marketing
campaigns. The desired purpose of the article [14] is that in
order to demonstrate usage of association analysis that is used
finding the relationship of frequent items based on confidence
and support values through algorithms called FP-Growth
which is based on tree construction while showing the
frequent items and Association Rule Mining that is used for
defining associative rules in terms of support and confidence
requirements are used for deciding cross-selling campaigns by
using RapidMiner. As conclusion, support and confidence
levels of association rules of the items in the dataset such as
printer, keyboard and connector have been revealed based on
the results of applied algorithms in order to determine cross-
validation items for marketing.
After increased usage of data for detecting potential
customers for cross-selling applications, competitions among
Table 3.1. Attributes, Descriptions and Data Types of the Dataset
insurance companies have been reasonably increased to find
the customers who most likely to remain as a loyal purchaser
The features as shown in Table 3.1. can be grouped
of the company. The research [15] that has been aimed for
into two diverse parts and examine accordingly:
understanding how to keep the existing customers in efficient
continuous features as Age, Annual_Premium, Vintage
ways, the main purpose is in order to demonstrate the impacts
and discrete (categorical) features as Response,
of characteristics of the customer for cross-selling by showing
Vehicle_Age, Vehicle_Damage, Driving_License,
the use case of one of the largest Swiss insurance company. 14
Previously_Insured, Region_Code and
different hypotheses and assumptions have been formulated in
PolicySalesChannel. There are no missing values in the
order to reveal the impacts of the customers’ features on
dataset as shown in Figure 3.1. :
cross-selling using several methods such as descriptive
statistics, analysis of variance (ANOVA), x2 test and logistic
regression in order to confirm or reject of these hypotheses.
According to the findings, most of the hypotheses have been
approved from the statistical tests and models that have been
created on the dataset consist of features like age, residence
information and number of damages. Therefore, from the
analysis, some of the columns are remarkably effective for the
customers’ cross-buying behaviour such as age, the number of
damages and residence area that have been proved by
observed p-values smaller than 0.1%.

III. MATERIALS AND METHODS


Exploratory data analysis has been performed to
Figure 3.1. Analysis of Missing Values in the Dataset
investigate distribution of the attributes, detection of outlier
and anomalies as well as proving visual proofs and required
data pre-processing phases are used in order to have better
understanding about the dataset.
According to the descriptive statistics including According to the pair plot figure, age attribute of the
standard deviation, mean, median and percentiles of the policyholders demonstrate that younger customers are
dataset among 3000 randomly selected samples, there is more likely to respond positively for buying vehicle
less difference between mean and median (50%) insurance provided by the company so it can be clearly
measures so that the dataset distribution can be assumed said that young customers of the insurance company are
as normally distributed. Furthermore, in order to reduce much more appropriate for applying cross-selling
the outliers from the dataset that can be understood proposals. The health insurance policyholders who are
from the standard deviation of Annual_Premium roughly older than 20 are not interested in buying
column, min-max scaling is applied on Age, vehicle insurance, especially the customers who are
Annual_Premium and Vintage columns as demonstrated older than 40. Histogram plot of age attribute in Figure
in Figure 3.2. : 3.4. shows the distribution and there is a right-skewed
distribution that the frequency of young policyholders
samples are significantly more than older policyholders
in the sample of the dataset.

Figure 3.2. Summary of Descriptive Statistics of the Dataset

3.1.2 Continuous Variable Analysis


Figure 3.4. Histogram Diagram of Age Attribute
Age, Annual_Premium and Vintage attributes
contain either integer or float continuous variables and The amount customer needs to pay as premium in the
in order so visualize the distribution and relationship year attribute that are defined as Annual_Premium
among each numeric variables; pair plot, heatmap with demonstrates that the policyholders pay around 5.000 to
Pearson correlation and histograms have been created. 15.000 are willing to buy more ensured insurance
Besides, both distributions and samples are colored with services by the company through looking at the pair plot.
respect to cross-selling response of the customers. In contrast, the pair plot also indicates that there are two
( Blue color represents 0 and orange color stands for 1 peak points when Annual_Premium has reached at
for the response.) The following Figure 3.3. indicates 100.000 and 200.000 and in addition to that, most of the
the pair plot which shows the distribution of continuous customers who have to pay more amount as premium in
variables: the year are remarkably not interested of buying vehicle
insurance. Besides, as shown in the histogram plot in
Figure 3.5. for Annual_Premium attribute shows also
right-skewed distribution as well. Most of the randomly
selected samples pay around 25.000 to 50.000 regularly
to the company for being able to have guarantee of
compension.

Figure 3.5. Histogram Diagram of Annual_Premium Attribute

Figure 3.3. Pair Plot of Continuous Variables


Vintage attribute represents number of days where dataset so that might not be good at seperating the
the customer has been associated with the company. Response outcome. 13% of male and 11% female
From the pair plot in Figure, it can be seen that the customers responded positively so that it can be clearly
policyholders who recently start using the provided said that males are slightly interested in buying vehicle
service (between 0 to 50 days) from the company are insurance than females as shown in Figure 3.8. :
willing to buy vehicle insurance in addition to their
health insurance. The following histogram in Figure 3.6.
represents the distribution of Vintage attribute just have
been fluctuated in contrast to Age and Annual_Premium
attributes where both have right-skewed distribution.

Figure 3.8. Distribution of Gender Attribute

Vehicle_Age feature can be distinctive for revealing


of preferences of the policyholders. 28% of the
Figure 3.6. Histogram Diagram of Vintage Attribute customers whose car is older than two years are tend to
get vehicle insurance. In contrast, only 0.04% of the
In the last part of exploratory data analysis for health policyholders are responded positively so most
continuous variables section, pearson correlation matrix people are not that much interested to have insurance
in Figure 3.7. that displays the correlation coefficients when their vehicle is new than the ones whose car is
among continuous features of the dataset in order to older than 2 years as shown in Figure 3.9.
reveal the positive or negative relations has been
created. Accordingly, the correlations among all
numeric features with respect to Response attribute are
not remarkable and all of the correlations among
continuous variables are close to 0.

Figure 3.9. Distribution of Vehicle_Age Attribute

There are 1537 customers whose vehicle have


damaged according to the balanced distribution of
Vehicle_Damage attribute and the rest of these
customers have a vehichle that haven’t damaged,yet.
Figure 3.7. Heatmap of Pearson Correlation Coefficient Matrix 25% of the customers who are the owner of damaged
vehicle have been approved the cross-selling offer
3.1.3 Categorical Variable Analysis provided by the company. From that evidence, it’s
The dataset consist of seven discrete attributes named obvious to say that the customers who have already a
Gender, Vehicle_Age, Driving_License, health insurance and have damaged car are keen on
Previously_Insured, Region_Code and buying vehicle insurance as shown in Figure 3.10. :
Policy_Sales_Channel and the distribution of these
attributes are analyzed based on the Response feature.
For Gender attribute, there are 1623 male and 1377
female samples whose distribution is balanced in the
Policy_Sales_Channel attribute represents the
channel that the company reaches to the customer,for
instance agencies, phone, or mail and 152, 124 and 26
are the ananoymized channel codes where efficient
ways to communicate the customers in order to perform
cross-selling. Besides, Region_Code feature stands for
the unique code of the region of the customer and 844
out of 3000 samples come from the same region which
is 28 so that region is most likely a city where provide
Figure 3.10. Distribution of Vehicle_Damage Attribute huge potential customers.

In the dataset that consist of randomly selected 3000 3.2. Data Pre-processing
samples, 1675 of these policyholders have never taken Data pre-processing is vital in order to transform raw
vehicle insurance, on the contrary, 1325 of them already data into the format that is appropriate for gathering
purchased vehicle insurance before. 22% of the meaningful insights from data analysis and modelling
customers who haven’t previously insured before have for future predictions. In the study, data pre-processing
confirmed to buy vehicle insurance and can be seen phase includes two main steps respectively named data
from the Figure 3.11. as expected, the customers who transformation where one-hot encoding, min-max
already has vehicle insurance only 0.002% of them have scaling and oversampling methods are applied and
accepted for another vehicle insurance. After the feature selection methods like Select K-Best or Hashing
analysis, it’s clear that Vehicle_Damage is a good Encoder in order to choose the most appropriate
indicator of vehicle insurance cross-selling campaigns. features of the dataset to use while modelling.

3.2.1. Data Transformation


3.2.1.1. One - Hot Encoding
Gender,Vehicle_Damage,Vehicle_Age,Region_Code
and Policy_Sales_Channel attributes are categorical
variables even though Region_Code and
Policy_Sales_Channel are defined as floats. One-hot
encoding is the technique where the applied column is
removed and creates a new column that contains binary
integer values to represent each sample. In essence,
one-hot encoding is one of the most well-known
encoding technique where transforms a single variable
Figure 3.11. Distribution of Previously_Insured Attribute with n observations and d distinct values, to d binary
variables with n observations each [16]. After applying
For Driving_License attribute, 2992 out of 3000 one-hot encoding technique as a part of data
samples have driving license that is legally required to transformation, the dataset became to contain 133
have insurance and 12% of these policyholders are attribute columns from 12 attribute columns where the
interested in cross-selling proposal. The following original dataset consist of.
Figure 3.12. indicates the distribution of 3.2.1.2. Min-Max Scaling
Driving_License based on Response attribute and there Age, Annual_Premium and Vintage features are
is no histogram charts for the customers who don’t have numeric features where all have different distribution,
driving license among the samples. standard deviation, minimum and maximum values. In
order to decrease the impacts of possible outliers in the
dataset, min-max scaling is applied to rescale all the
numeric values into the range of 0 and 1 where 0 stands
for the minimum value and 1 stands for the maximum
value in the related column. Min-Max scaling is highly
important to prevent bias that might be caused by the
variables are defined as different scales and have
different contribution in the models. Instead of using
different scaling methods like standard or robust scaler,
min-max scaling is chosen to be able to use Naïve
Bayes algorithms (BernoulliNB and ComplementNB)
Figure 3.12. Distribution of Driving_License Attribute
which require only positive values while modelling.
Figure 3.13 shows descriptive statistics of these three 3.3. Feature Selection
features after min-max scaling is applied: The dataset consist of 9 attributes (8 input and 1
output features) before applying one-hot encoding as a
part of data transformation phase that changes the
number of attributes into 133 columns. The main
purpose of performing feature selection methods is to
determine the features which represent the dataset in the
best way and there are a few reasons and benefits of
applying feature selection methods before modelling.
Figure 3.13. Summary of Descriptive Statistics of the Scaled Features
First of all, modelling by using too many columns may
reduce the training error; however, models might not
3.2.1.3. Train - Test Split perform generalization well on new samples that is
After one-hot encoding and scaling transformations, called overfitting. In essence, overfitting refers to a
by creating the dataframe into training and testing, 70% model which memorizes the training data and performs
of dataset is used for creating training set and 30% of too bad in testing data and one of the reasons of
dataset is used for creating testing set the models. overfitting is model complexity. While dealing with too
3.2.1.4. Synthetic Minority Oversampling Technique many feature columns that is also called curse of
(SMOTE) dimensionality, reducing model complexity through
In classification problems, it’s critical to have feature selection methods can be used before modelling
balanced output feature for the sake of trustworthy of and all the models must be as simple as possible and
the predictions and the next figure indicates the class easy to explain in terms of features.
label distribution of Response column that is highly In the study, Sequential Feature Selector (SFS),
imbalanced where there are 2619 samples whose Select K-Best, Feature Hasher, Hashing Encoder feature
answer is 0 whereas 381 samples responded as 1 that selection methods and dimensionally reduction
can be seen in Figure 3.14. technique Principal Component Analysis (PCA) are
applied in order to find the best features before
modelling and impacts of the models that have been
created by these selected features in terms of accuracy
and F1-score classification metrics. The main aim of the
study is to compare the models where feature selection
methods are applied and the models that are created
after required data transformation phases applied only.

3.3.1. Sequential Feature Selector (SFS)


Sequential Feature Selector (SFS) is the method that
chooses the most proper subset of features in the dataset
Figure 3.14. Distribution of Response Attribute to reduce the computational time while modeling and
generalization error of the model by eliminating
For dealing with the imbalanced class labeled dataset, irrelevant attributes. After applying the related method,
undersampling and oversampling techniques can be the following attributes named Driving_Licence,
applied to overcome this issue. Undersampling is the Previously_Insured, Vehicle_Damage_is_Yes,
technique that decreases the quantity of the majority Vehicle_Age_is_> 2 Years and
class label which is 0 in the dataset whereas Policy_Sales_Channel_is_111.0 features have been
oversampling is the technique that increases the selected respectively as the best 5 features among 130
proportion of minor class which is 1 by re-sampling. In attributes.
essence, Nitesh Chawla, et al. (2002) presents a method
of over-sampling the minority class involves creating 3.3.2. Select K-Best
synthetic minority class examples [17]. Synthetic Select K-Best method has been performed after
Minority Oversampling Technique (SMOTE) works choosing the number of top features which is
through k-nearest neighbour algorithm to enhance the represented by k features has chosen as 5 and chi2 as
proportion of minority class by creating synthetic data. the score function that computes chi2 statistics between
Right after test-train split phase before diving into each feature of input features and the output column.
details of machine learning, SMOTE technique is There are many different score functions of the method
applied only on training part of the dataset and counts like mutual information, univariate linear regression
of response labels 1 and 0 is become equal in training tests or ANOVA F-value computation. In other words,
set which is 1842. Select K-Best keeps the first k features of the input
features with the highest score of the chosen score customers who are suitable for applying cross-selling
function. campaign offers. Both parametric and non-parametric
3.2.3. Feature Hasher machine learning algorithms have been used in order to
Hashing is a widely-used technique that transforms provide trustworthy and have wider aspect while
the given input feature into chosen number of interpreting the classification results. In order to
components as fixed-size and this encoder represents evaluate and measure the classification performance of
categorical attributes in new columns by fixing the the models by using F1 score that is calculated metrics
number of dimension. Feature Hasher method is used to for each class attribute through finding their average
prevent the drawbacks of one-hot encoding where the weighted by support and 5-Folded cross validation
related categorical attribute contains multiple categories score of accuracy metrics have been used. At the
that increase the dimension of the dataset which cause beginning of modelling phase, all classification
more complexity models. In the study, the method is algorithms have been applied without any feature
applied on Gender, Vehicle_Age and Vehicle_Damage selection and any dimensionality reduction method
attributes by using number of features as 2, 4 and 3 hasn't performed.
respectively. One-hot encoding method is applied for First phase of the modelling part consist of the base
the rest of categorical attributes (Region_Code and models and classification results that have been
Policy_Sales_Channel). gathered by applying no feature selection or
3.2.4. Hashing Encoder dimensional reduction techniques. Accordingly, in K-
Hashing Encoder is a method that vectorizes features Nearest Neighbours model, 10 models have been
with high speed and extremely efficient way so created to measure training and testing accuracies and
basically, by using the method corresponding feature is additionally, 15 different models have been created to
converted into a vector of features. This method is an detect the optimal number of neighbours that
improved version of the multivariate hashing represented by K. For increasing the trustworthiness of
implementation with configurable dimensionality to fix accuracy metric, 3-Fold cross validated accuracy scores
some inappropriate implementations which makes have been calculated while deciding the optimal number
encoding process [18]. Gender, Region_Code, of K and the figure demonstrates misclassification error
Vehicle_Age, Vehicle_Damage and versus number of neighbours. Accordingly, when the
Policy_Sales_Channel attributes are hashed where number of neighbours K increases, misclassification
number components where the number of bits error increases as well so from that finding that is
represents the related feature have chosen as 12 which gathered by the Figure, K is chosen as 3 as shown in
is 8 by default and can be increased up-to 32 bits. Figure 3.15. :
Especially Region_Code and Policy_Sales Channel
attributes consist of too many unique categorical
features so through Hashing Encoder, the complexity of
the dataset for modelling have significantly reduced.
3.2.5. Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a statistical
dimensional reduction method that is performed for
reducing the dimensionality of data. The main aim of
that method is describe the components of features that
Figure 3.15. Finding Optimal Number of K for Base KNN Model
create new not correlated variables by maximizing total
explainable variance and minimizing information loss at
the same time [19]. After all data pre-processing phases, Unlike the model that is created by applying non-
the dimension of the dataset has increased to 130 parametric K-Nearest Neighbours algorithm, rest of the
columns in contrast to 12 columns that the dataset base models through Random Forest, Complement
contain before all transformations. In the study, the Naïve Bayes, Bernoulli Naïve Bayes, Multilayer
dataset is described by using three components to Perceptron, Support Vector Classifier and Logistic
enhance interpretability, reduce dimensionality and Regression algorithms have been used with default
extract the most related features from 130 attributes of parameters.
the dataset. Feature Hasher, Select K-Best and Hashing Encoder
3.4. Modelling feature selection methods have applied after performing
After all required data pre-processing phases are data pre-processing steps that already mentioned above
completed; K-Nearest Neighbours (KNN), Complement have been used in order to create machine learning
Naïve Bayes, Bernoulli Naïve Bayes, Logistic models through parametric and non-parametric
Regression, Random Forest, Multilayer Perceptron, algorithms. The figures which demonstrate
Support Vector Machine (SVM) classification misclassification error versus number of neighbours
algorithms have been applied to classify the target after utilizing K-Nearest Neighbours algorithm related
with applied Feature Hasher and Hashing Encoder have
shown similar characteristics just like in Figure 1 that
shows while increasing the number of neighbours K,
misclassification error increases in time. Unlike all
other applied feature selection methods, after applying
Select K-Best method, the related figure has been
shown significantly different result in order to decide
the optimal number of K that is obviously can be seen
as 3 which shown in Figure 3.16:

Table 4.1. Accuracy and F1 Performance Scores of Base Models

In Figure 4.1, comparison of AUC scores of the


models have been represented and Logistic Regression
have the highest AUC score which is 0.835. Besides,
Support Vector Machine classification algorithm hasn’t
Figure 3.16. Finding Optimal Number of K (Select K-Best Applied) showed in the comparison figure of AUC because the
algorithm is not suitable for calculating probability of
In addition, Sequential Feature Selector (SFS) feature prediction.
selection method is used to find the best 5 features for
using in K-Nearest Neighbours model with K chosen as
3. Driving_License,Previously_Insured,
Vehicle_Damage_is_Yes,Vehicle_Age_is_>2Years and
Policy_Sales_Channel_is_111.0 features have been
selected through the method with 5-Fold cross validated
accuracy metric for performance metric and K-Nearest
Neighbours is the only algorithm that is applied to
demonstrate the results. Furthermore, the same machine
learning algorithms that have been applied for base
models which already explained above are used after
Feature Hasher, Select K-Best and Hashing Encoder
feature selection methods as well.

IV. RESULTS
In order to ensure trustworthy and easily
interpretable classification results of detecting the
appropriate customers for performing cross-selling Figure 4.1. ROC Curve Analysis Results of Base Models
campaigns, seven various classification algorithms are
applied on the dataset where no feature selection Table 4.2 demonstrates the results of machine
methods have been applied while developing machine learning models that have been created after feature
learning models. While interpreting the test dataset hashing feature selection method applied for
results for base models without any applied feature classification of customers. In contrast to base machine
selection methods or other models that have been learning models, there are significant improvements in
created after applying feature selection methods, 5-Fold models that Logistic Regression and Support Vector
cross-validated accuracy and F1 score metrics have Classifier algorithms are used. Similarly as in the base
been used to measure the success of the models. In models, Random Forest and Multilayer Perceptron
Table 4.1, the results of base machine learning models algorithms have performed well in terms of accuracy
for classification are represented and accordingly, and F1 score metrics.
Random Forest and Multilayer Perceptron have
performed the highest accuracy and F1 score among
other five algorithms.
Differently than the base models and the ones that
have been created after Feature Hashing feature
selection method applied, K-Nearest Neighbours
algorithm performed the highest F1 score after Select
K-Best method. From the results, it can be said that
Select K-Best feature selection method works better
through non-parametric machine learning algorithms as
shown in Figure 4.3.

Table 4.2. Performance Metrics of the Models (Feature Hasher Applied)

Figure 4.2 shows the comparison of measurements of


the models that feature hasher method is applied with
respect to F1 score. Even though Support Vector
Classifier algorithm has performed high accuracy scores,
F1 score of that model have reached the lowest score.

Figure 4.3. F1 Score Comparison of the Models (Select K-Best


Applied)

Hashing Encoder is the technique that uses bits to


represent any chosen features in the dataset. Gender,
Region_Code, Vehicle_Age, Vehicle_Damage and
Policy_Sales_Channel features are hashed with 12
bits by the encoder. There are few numbers of
categorical attributes in Gender, Vehicle_Age and
Vehicle_Damage columns, however,
Figure 4.2. F1 Score Comparison of the Models (Feature Hasher Applied) Policy_Sales_Channel and Region_Code features can
be seen as high-cardinality and bit size can be
Select K-Best feature selection method is performed expanded up-to 32 bits. According to Figure 1,
after required data transformation phases, respectively Complement Naïve Bayes - Bernoulli Naïve Bayes
one-hot encoding, scaling, train-test split and SMOTE and the other machine learning algorithms have
for dealing with unbalancing class labels in order to reached similar accuracy scores as demonstrated in
choose first 5 top features have been selected with chi2 Table 4.4.
score function. Complement Naïve Bayes and Bernoulli
Naïve Bayes algorithms have the lowest accuracy
scores and rest of the algorithms have similar accuracy
scores as shown in Table 4.3.

Table 4.4. Performance Metrics of the Models (Hashing


Encoder Applied)

Table 4.3. Performance Metrics of the Models (Select K-Best Applied)


Similar to base machine learning models, Random Neighbors model versus number of features which are
Forest has performed the highest accuracy score in selected by SFS with standard error have shown.
terms of F1 score metric as shown in Figure 1.
Additionally, the rest of the algorithms have reached
closer F1 score which is about 0.75 as indicated in
Figure 4.4.

Figure 4.6. Accuracy Performance of KNN Model (SFS Applied)

Figure 4.4. F1 Score Comparison of the Models (Hashing Encoder V. DISCUSSION AND CONCLUSION
Applied)
This study is aimed to present comparable results of
Principal Component Analysis (PCA) is used as a machine learning algorithms in order to classify the
technique of dimension reduction to represent the samples in terms of appropriateness for cross-selling
features of the dataset that contains 130 columns by marketing campaigns. There are various pre-processing
principal components. The aim of applying PCA is to methods have been performed before developing the
maximize explained variance ratio by choosing the models such as encoding, scaling and oversampling for
correct number of components. In the study, three dealing with unbalanced feature which is predicted by
principal components have created to demonstrate total the models. Essentially, the main concern of the study is
variance and by using the components, 100% total that revealing the impacts of feature selection methods
variance explained as shown in the Figure 4.5. on accuracy and F1 score classification metrics of the
models that have been created before and after applying
these methods. In order to investigate this concern, 7
different state-of-art machine learning algorithms have
been applied after completing the necessary data pre-
processing implementations and the results of the base
models are collected. Furthermore, 4 different feature
selection methods and 1 dimension reduction technique
have been applied to see the difference of the
classification metrics between the base models and the
models that feature selection method have been applied
at. Accordingly, the algorithms performed differently
from each other with respect to applied feature selection
method. However, training time of the models after
Figure 4.5. Principal Components of the Dataset in 3D selecting the appropriate number of features by using
the feature selection methods have been reduced
Lastly, Sequential Feature Selector (SFS) is applied dramatically. This study can be extended by using
to see the impacts of using the chosen features while hyperparameter optimization techniques in order to find
modelling in terms of accuracy and F1 score. Because the best parameters for each model to be able to use
of having 130 features in the dataset after all data pre- these with while creating new models after feature
processing phases, only 5 best features have been selection methods applied. Lastly, 3000 samples have
selected by SFS and develop machine learning models been randomly chosen from the dataset and the number
with Driving_License, Previously_Insured, of the samples can be increased as well.
Vehicle_Damage_is_Yes, Vehicle_Age_is_> 2 Years
and Policy_Sales_Channel_is_111.0 features. Besides,
all of the models have reached similar 5-Fold cross-
validated accuracy score so that only Figure 4.6
demonstrates the accuracy performance of K-Nearest
VI. REFERENCES https://www.sciencedirect.com/science/article/abs/pii/S10949
[1] Health Insurance Cross Sell Prediction Dataset 9681730052X
https://www.kaggle.com/anmolkumar/health-insurance-cross- [10] Wirawan Dony Dahana, Makoto Morisada & Yukihiro
sell-prediction Miwa (2019): Cross- selling across stores or within a store?
[2] Kwiatkowska, J. (2018). Cross-selling and up-selling in a Impacts of cross-buying behavior in online shopping malls,
bank. Copernican Journal of Finance & Accounting 7(4), 59– Journal of Marketing Channels, DOI:
70. http://dx.doi.org/10.12775/CJFA.2018.02 10.1080/1046669X.2019.1646186
[3] Purnamasari et al 2020 J. Phys.: Conf. Ser. 1641 012010 - [11] Francesco De Pascalis, 'Sales Culture and Misconduct in
The Determination Analysis Of Telecommunications the Financial Services Industry:An Analysis of Cross-Selling
Customers Potential Cross-Selling with Classification Naive Practices', (2018), 39, Business Law Review, Issue 5, pp. 150-
Bayes and C4.5. 159, https://bura.brunel.ac.uk/handle/2438/16250
https://iopscience.iop.org/article/10.1088/1742- [12] Luís Cabral & Gabriel Natividad, 2016. "Cross-selling in
6596/1641/1/012010 the US home video industry," RAND Journal of Economics,
[4] Sahar F. Sabbeh, “Machine-Learning Techniques for RAND Corporation, vol. 47(1), pages 29-47, February.
Customer Retention: A Comparative Study” International https://doi.org/10.1111/1756-2171.12117
Journal of Advanced Computer Science and Applications [13] Lili Zhang, Jennifer Priestley, Joseph DeMaio, Sherry Ni
(IJACSA), 9(2), 2018. and Xiaoguang Tian, 2021. "Measuring Customer Similarity
http://dx.doi.org/10.14569/IJACSA.2018.090238 and Identifying Cross-Selling Products by Community
[5] Zhang, Ren-Qian & Yi, Meng & Wang, Qi-Qi & Xiang, Detection," Big Data.Apr 2021.132-143.
Chen, 2018. "Polynomial algorithm of inventory model with http://doi.org/10.1089/big.2020.0044
complete backordering and correlated demand caused by [14] ANDIS INDRAWAN, I Wayan; OKA SAPUTRA,
cross-selling," International Journal of Production Economics, Komang; LINAWATI, Linawati. Implementation of
Elsevier, vol. 199(C), pages 193-198. Association Rules to Manage Cross-Selling and Up-Selling
https://www.sciencedirect.com/science/article/abs/pii/S09255 for IT Shop. International Journal of Engineering and
27318301245 Emerging Technology, [S.l.], v. 4, n. 2, p. 60--63, aug. 2020.
[6] Hell, Franz; Taha, Yasser; Hinz, Gereon; Heibei, Sabine; ISSN 2579-5988.
Müller, Harald; Knoll, Alois. 2020. "Graph Convolutional https://doi.org/10.24843/IJEET.2019.v04.i02.p11
Neural Network for a Pharmacy Cross-Selling Recommender [15] Staudt, Y. and Wagner, J. (2018), "What policyholder
System" Information 11, no. 11: 525. and contract features determine the evolution of non-life
https://doi.org/10.3390/info11110525 insurance customer relationships? A case study analysis",
[7] Salo, J., Cripps, H., & Wendelin, R. (2020). Developing International Journal of Bank Marketing, Vol. 36 No. 6, pp.
cross-selling capability in key corporate bank relationships: 1098-1124. https://doi.org/10.1108/IJBM-11-2016-0175
the case of a Nordic Bank. Journal of Financial Services [16] Kedar Potdar, Taher S Pardawala and Chinmay D Pai. A
Marketing, 25(3-4), 45-52. https://doi.org/10.1057/s41264- Comparative Study of Categorical Variable Encoding
020-00076-8 Techniques for Neural Network Classifiers. International
[8] Fan, Zhi-Ping & Sun, Minghe, 2016. "A multi-kernel Journal of Computer Applications 175(4):7-9, October 2017.
support tensor machine for classification with multitype [17] Nitesh Chawla, et al. SMOTE: Synthetic Minority Over-
multiway data and an application to cross-selling sampling Technique, Journal Of Artificial Intelligence
recommendations Author-Name: Chen, Zhen-Yu," European Research, Volume 16, pages 321-357, 2002
Journal of Operational Research, Elsevier, vol. 255(1), pages DOI:10.1613/jair.953
110-120. [18] Scikit-Learn Hashing Encoder Documentation
https://www.sciencedirect.com/science/article/abs/pii/S03772 https://contrib.scikit-
2171630340X learn.org/category_encoders/hashing.html
[9] Kocas, Cenk & Pauwels, Koen & Bohlmann, Jonathan D., [19] Jolliffe Ian T. and Cadima Jorge 2016 Principal
2018. "Pricing Best Sellers and Traffic Generators: The Role component analysis: a review and Recent Developments Phil.
of Asymmetric Cross-selling," Journal of Interactive Trans. R. Soc. A.3742015020220150202
Marketing, Elsevier, vol. 41(C), pages 28-43. http://doi.org/10.1098/rsta.2015.0202

View publication stats

You might also like