Professional Documents
Culture Documents
Genetic Algo Application On Credit Card Fraud
Genetic Algo Application On Credit Card Fraud
Genetic Algo Application On Credit Card Fraud
Abstract - Credit Card Fraud (CCF) is a serious challenge CCF is concerned with the illegitimate use of credit card data for
facing credit card holder and the credit card delivering transactions. Transaction with credit card can be performed either
companies in the past decades. There are two levels CCF are digitally or physically[5]. In physical transactions, transaction is
performed, the transaction level frauds and application level perform with credit card either through physical contact or scan
frauds. This paper focuses on the application level of CCF with a device, while in digital transactions, cardholders usually
detection using Genetic Algorithm (GA) as a feature selection supply the card number, expiry date, and card verification number
technique. The GA feature selection technique is in two through phone or online. Even with the several authorization
phases, the first phase is designated as the first priority techniques put in place, CCF have not be able to stall efficiently.
features where eight (8) attributes were selected as the fittest To prevent loss from fraudsters, two mechanisms are commonly
attributes. At second stage which is referred to as the second employed: fraud detection and fraud prevention. Fraud detection
priority features where another set of eight (8) attributes were is a technique of monitoring the cardholders’ transaction activities
considered and selected. The Naïve Bayes (NB), Random with the intent of detecting whether an incoming dealing is coming
Forest (RF) and Support Vector Machine (SVM) supervised from the cardholder or fraudsters[6]. While fraud prevention is a
machine learning techniques were used for the detection of defensive technique, where fraud transaction is halt from taking
CCF on German credit card dataset which is an imbalance place in the first instance. Generally, fraud detection are of two
dataset. The experimental findings of the proposed model types: anomaly detection and misuse detection. Misuse detection
revealed that the first priority features are the most important employs classification techniques to determine an incoming
features. Also, the obtained results showed that the RF transaction whether is a fraud or not. Typically, this type of
algorithm outperformed NB and SVM in terms of accuracy, approach has a model to learn about the various existing fraud
fraud detection rate and precision. patterns. While anomaly detection develops an historical
Keyword: Credit card, Imbalance dataset, Genetic algorithm, transaction model for the behavioral profile of normal transaction
Fraud Detection, NB, SVM, Random forest of a cardholder, and decide whether an incoming transaction is a
potential fraud, once it is deviate from the usual transaction
I - INTRODUCTION pattern. Though, an anomaly detection technique requires
Banking and finance are vital sector in our day to day sufficient successive training data that will model the normal
activities, it is inevitable for a day to pass without dealing with transaction pattern of a cardholder [7].
bank either through online platform or physical transaction [1]. Fraud transactions detection using traditional approaches of
The efficiency and viability of both private and public sector has manual detection consume lot of time and inefficient, hence with
extremely raises because of the banking information system. the invention of big data, manual procedures has become more
Credit cards are extensively employed as a medium of payments impracticable. Nevertheless, financial organizations have recently
due to the general acceptability of e-commerce, internet motivated toward computational approaches for control and
technology, online banking and advancement in mobile intelligent prevention of CCF challenges. Also, with the increase in users
devices, most especially online transactions operations that are number and online transactions serious workloads has been
done through web payment gateways, such as Alipay, PayPal and brought to these systems [8].
others. As the rises in credit card transactions as the most Data mining approach is one of the outstanding techniques
dominant medium of payment for both online and offline employed in preventing and detecting CCF problem[9].
transaction, credit card fraud (CCF) rate are also increasing at very According to [10] CCF detection is the technique of classifying
alarming rate[2], [3]. the transactions into two classes of genuine and fraudulent
Financial fraud has posed serious menace that are far reaching transactions. Fraud detection on credit card is modeled on the
consequences to the individuals, corporate organizations, analysis of a card’s spending pattern. Lot of techniques have been
government and finance industry. Fraud is an unlawful or criminal employed for credit card fraudulent detection, among them are
deception with the aim of attain financial gain or personal gain[4]. artificial neural network[11], decision tree[12], frequent itemset
mining[13], genetic algorithm (GA)[11],[12], migrating birds
978-1-7281-9677-0/20/$31.00
Authorized licensed use limited to: ©2020 IEEE
Carleton University. 1091
Downloaded on May 28,2021 at 01:43:57 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Decision Aid Sciences and Application (DASA)
optimization algorithm [16], Naïve Bayes (NB)[17] and support while [13] used 16 relevant features out of 20 variables in their
vector machine (SVM)[18]. Logistic regression and NB study.
comparative analysis was done by [19]. Bayesian and neural
network performance was evaluated on CFF data in [20], while B. Credit Card Fraud
[21] tested for capability of decision tree, neural networks and Existing CCFs can be grouped into two types: external fraud
logistic regression in fraud detections. Ref.[22] appraised random and inner card fraud [7],[18] while a more wider classification
forests and SVM combined with logistic regression, with the effort have been presented by [10] to be three categories, namely, card
of attaining a better detection of CCF. There are number of hitches related frauds (application, account takeover, fake, counterfeit,
that are related to credit card detection, among them are: identity thief and stolen,), Internet frauds (credit card generators,
dynamism in fraudulent behavior pattern, that is, sometimes site cloning and fake sites) and merchant related frauds (merchant
fraudulent transactions seem like legitimate ones; availability of conspiracy and triangulation) [10]. It was reported in [25] that
credit card transaction datasets and highly imbalanced in nature; amount lost due to CCF every year worth $500 million US dollars.
optimal feature selection for the models; appropriate performance There are unusual prodigy that are mainly characterize credit
evaluation metric to be used on skewed CCF data [10]. Also, card transactions data. Both genuine and illicit transactions ones
nature of sampling approach also has great effect on CCF tend to share similar profile. Fraudsters innovate a new tricks to
detection performance, selection of features and detection mimic the genuine (legitimate) cardholder spending pattern.
technique(s) employed. The objective of this work is to examine Hence, the profiles of legitimate and fraudulent activities are
the effect of GA feature selection priority on selected variables of continuously dynamic. This nature of characteristic tends toward
highly skewed CCF data using NB, Random Forest (RF) and SVM reduction in the actual figure of true fraudulent cases classified in
as classifiers to evaluate the approach. a pool of credit card transactions data moving toward a highly
skewed distribution of negative class (legitimate transactions).
II. RELATED STUDY The credit card data investigated by [23] contains 0.025% positive
Credit card transactions are usually regard as a binary cases while in [16] it is below 0.005% positive cases.
classification problem. The credit card transaction is either Ref.[10] examined the performance of NB, k-nearest neighbor
categorize as a genuine transaction (negative class) or a fraudulent (KNN) and logistic regression on fraud detection of highly skewed
transaction (positive class). credit card data. European cardholders’ dataset that containing
284,807 transactions was used for the experiments. They
A. Feature selection employed hybrid technique of under-sampling and oversampling
Analysis of cardholder’s spending profile is the main to pre-process the skewed data. Their results showed that KNN
backbone of CCF detection. This spending behavior is analyzed performs better than other approaches, the accuracy obtained for
by selecting optimal features that capture the uniqueness of a NB, KNN and logistic regression classifiers are 97.92%, 97.69%
credit card profile. The transactions profile (both legitimate and and 54.86% respectively.
fraudulent) tend to be dynamically changing. Hence, optimal Ref.[26] proposed Enhanced LINGO clustering algorithm for
feature selection approach that are greatly differentiates between Fraud Miner. This improvement involving the replacement of
both profiles is required to accomplish efficient credit card Apriori algorithm employed in Fraud Miner with Frequently
transaction classification. The selected features that represent the Pattern creation in LINGO clustering algorithm, and summarize
card usage profiles and algorithms used, determine the CCF customer’s profile either within his genuine or fraud transactions.
detection systems’ performance. These features are derivative of The results of their simulated test transactions showed that LINGO
both transactions (legitimate and fraudulent) and historical produced significant summarized patterns better than the outcome
transaction of a credit card. These features can be grouped into of Apriori Algorithm.
five major types, namely all transactions statistics, merchant type Ref.[27] employed hybrid techniques of AdaBoost and
statistics, regional statistics, time-based amount statistics and majority voting approaches to evaluate the efficacy of the model.
time-based number of transactions statistics [10], [23]. Publicly available credit card dataset in addition to a real-world
The features that represent the general profile of card usage credit card dataset obtained from a financial institution was
are referred to as all transactions statistics type. The features that analyzed. The best MCCscore of 0.823 was achieved in majority
taken into account, the spending profiles of the card in relation to voting approach.
the geographical regions are known as regional statistics type. Ref.[7] examined the performance of two types of random
Merchant statistics type are refer to the features that demonstrate forest models on a real-life B2C credit card dataset transactions.
the card usage with different merchant types. Time-based statistics The models are Random-tree-based random forest (I) and CART-
types are the features that identify the profile of the card usage based random forest (II). Their results showed that Random Forest
with respect to the amounts against time ranges or usage I yielded 91.96% accuracy while Random Forest II yielded
frequencies against time ranges. Cardholder profile is the major 96.77% accuracy.
focus of most of the literature rather than card profile. It is obvious Ref.[25] proposed a hypothetical framework for novel data
that an individual can use multiple credit cards for several mining algorithm known as condensed nearest neighbor (CNN)
purposes. Thus, one can exhibit variant pattern of spending algorithm to detect the CCF. CNN algorithm is a nonparametric
behavior on such cards. In this study, we focus more on card technique of classification with the aims of forming condensed set
profile rather than cardholder alone since a single credit card can for retaining the samples that are vital for decision making. It
only exhibit a distinct spending profile while a cardholder can targets is to minimize the number of attributes for comparison,
exhibit various behavior pattern on different cards. Ref.[24] used therefore creating a condensed training set with improvement in
a total of 30 features in their study, 27 features was used in [23] query and memory requirements.
1092
Authorized licensed use limited to: Carleton University. Downloaded on May 28,2021 at 01:43:57 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Decision Aid Sciences and Application (DASA)
dataset from the UCI repository is very partial [28]. The dataset
comprises 1000 instances (credit contenders) and 21 features (with
seven numerical, 14 definite/insignificant). With each instance
define the credit position of a specific member, either good or bad,
III. RESEARCH METHODOLOGY access in the dataset signifies an individual who takes credit from
This research work is presented in three phases. Firstly, the a bank. Every individual is classified as good or bad credit with
respect to a set of features.
dataset was collected from the UCI repository. Secondly, a data
pre-processing was carried out to remove redundancy in the B. Feature Selection
dataset. The third phase of CCF detection is applying three
machine learning techniques. This research work goal is to Feature selection (FS) method reduces the redundant feature
improve the machine learning model to distinguish CCF based on in the dataset, therefore improving the learning performance [25].
GA as a feature selection method. The NB, RF and SVM are used FS method minimizes the impact of irrelevant features in CCFs
to evaluate the performance of the GA on a German credit dataset. and enhance fraud detection rate.
Figure 1 presents the framework of the proposed system and
summarized the approach used in this work. C. Genetic Algorithm
GA technique is very popular technique in evolutionary
computation research. It is a replica of a natural selection
[26],[27]. It is applied widely in business, engineering and many
German Credit Data Preprocessing other domains. The approach ideal is to get an optimal solution to
card Dataset a problem [28],[29]. GA consists of three essential operators, that
is, selection, crossover and mutation. Selection distinguishes the
most fitted individuals in the population set available, based on
fitness function[33]. Crossover merges the second half of the
original record with the first half of the second record. Mutation
randomly exchanges the 0’s with 1’s bit and vice versa.
Perform Feature Selection with GA The steps involved in GA is as follow:
Step 1: Create random chromosome ‘n’ population each
specifying a dissimilar response to the problem.
Step 2: Estimate the fitness of each ‘x’ chromosome.
Step 3: Generate new population to the point of completion of new
population.
GA fitted features selected into two priorities Selection: Select higher fitness value dependent on the
chromosome of two parents.
a. Crossover: To generate new offspring for the parents to
cross over. It can be multi-point or one point.
b. Mutation: Concerning mutation, a few bits flip
haphazardly to transform new offspring.
Training set Train NB,RF and SVM model
c. The new population in the new population is placed in
the acceptance phase.
Step 4: In the replacement step, use a new population to keep the
calculation running.
Step 5: If the last condition is met, stop the phase in the testing
phase and restore the current population's optimum arrangement.
Step 6: Go to step 2 in the looping phase.
Test set
Predict the class label of the set The GA parameter is configured as follows.
Population magnitude: 20
Sum of generations: 20
Probability of crossover: 0.6
Probability of mutation: 0.033
Performance evaluation
Account occurrence: 20
Arbitrary amount seed: 1
D. Naïve Bayes
Figure1: Framework of the proposed credit card fraudulent model
Naive Bayes (NB) classification is built on the approach of
A. German Dataset Bayesian theorem posterior probability. NB model works
excellently most especially when predictors have independent
German credit dataset has been utilized in this learning to classes. Though, sometimes it did well, even when predictors have
classify the transactions into genuine or fraud. The available
1093
Authorized licensed use limited to: Carleton University. Downloaded on May 28,2021 at 01:43:57 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Decision Aid Sciences and Application (DASA)
no distinctive independent class[34]. NB technique has two stages were considered and selected. Table 1 and Table 2 depicts the
of classified data. The first phase is known as the learning stage information about the features selected.
where the training dataset is input to the model and compute the
parameters of a probability distribution, with the notion that The priority features in the dataset are considered as the most
predictors are conditionally independent. The second stage is important feature in the dataset
prediction phase, where the unfamiliar data (test dataset) is supply
into NB classifier to predict and evaluate the posterior probability
of individual classes. Subsequently, the test dataset is classified Table 1: First Priority Features
based on the highest posterior probability.
NB is good method for a huge dataset, because no complex Features
Features Type
iterative parameter approximation involve [35]. It performs Number
efficiently in both the training and classification stages [36]. It
assumes that all features in the feature vector are equally 1 Position of a prevailing Nom
independent and essential [33]. checking account
Precision (Rate of Hit): provides the accuracy in cases classified 20 Foreign worker Nom
as positive.
Where Nom means Nominal and Num means Numerical
Precision = (1) The GA is also used to selects another set of eight (8) features
( ) in the second stage. The selected attributes are shown in Table 2.
Recall (sensitivity or rate of fraud detection): Provides the Table 2: Second Priority Features
accuracy of positive (fake) case classification.
Features
Features Type
Recall (Sensitivity) = (2) Number
( )
4 Purpose of the credit Nom
Specificity: provides the accuracy of the classification of the non-
fraud case. 5 Credit amount Num
1094
Authorized licensed use limited to: Carleton University. Downloaded on May 28,2021 at 01:43:57 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Decision Aid Sciences and Application (DASA)
The amount of fraud cases is commonly minimal as related to We performed experimental analysis on the second priority
the total amount of all transactions. The experiments performance features in Table 2 using the classifiers NB, RF and SVM, as
evaluation were carryout using ten folds cross-validation methods presented in Table 4. The NB, RF and SVM classifiers learn from
since the dataset is imbalanced. the training set of data and are also used on the test dataset to
classify and categorize the data into fraud and genuine
transactions. The results are presented in Table 4.
2nd Priority
Classification Accuracy for the
1st Priority
two Priorities
2nd Priority
RF 0 20 40 60 80 100 120
SVM RF NB
NB
B. Experimental Results Analysis on the GA fitted Figure 2 shows comparison of the two approaches based on
Second Priority Features. accuracy and it revealed that first priority performed better than
second priority. Also, figure 3 presented comparison based on
1095
Authorized licensed use limited to: Carleton University. Downloaded on May 28,2021 at 01:43:57 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Decision Aid Sciences and Application (DASA)
specificity, sensitivity and precision. Therefore, the results for credit card fraud detection using multi-perspective
revealed that the first priority features that are fitted by GA are the HMMs,” Futur. Gener. Comput. Syst., vol. 102, pp. 393–
essential features in the Credit Card fraudulent imbalance data. 402, 2020.
[7] S. Xuan, G. Liu, Z. Li, L. Zheng, S. Wang, and C. Jiang, [20] S. Maes, K. Tuyls, B. Vanschoenwinkel, and B.
“Random Forest for Credit Card Fraud Detection,” in Manderick, “Credit card fraud detection using Bayesian
IEEE 15th International Conference on Networking, and neural networks.,” in 1st international naiso
Sensing and Control (ICNSC), 2018, pp. 1–6. congress on neuro fuzzy technologies, 2002, pp. 261–
270.
[8] S. Patil, V. Nemade, and P. Soni, “Predictive Modelling
For Credit Card Fraud Detection Using Data Analytics,” [21] A. Shen, R. Tong, and Y. Deng, “Application of
Procedia Comput. Sci., vol. 132, pp. 385–395, 2018, doi: classification models on credit card fraud detection,” in
10.1016/j.procs.2018.05.199. 2007 International Conference on Service Systems and
Service Management, IEEE, 2007, pp. 1–4.
[9] Y. Lucas et al., “Towards automated feature engineering
1096
Authorized licensed use limited to: Carleton University. Downloaded on May 28,2021 at 01:43:57 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Decision Aid Sciences and Application (DASA)
[22] S. Bhattacharyya, S. Jha, K. Tharakunnel, and J. C. [34] M. A. Hambali, Y. K. Saheed, T. O. Oladele, and M. D.
Westland, “Data mining for credit card fraud: A Gbolagade, “Adaboost Ensemble Algorithms for Breast
comparative study,” Decis. Support Syst., vol. 50, no. 3, Cancer,” J. Adv. Comput. Res., vol. 10, no. 2, pp. 31–52,
pp. 602–613, 2011. 2019.
[23] A. C. Bahnsen, A. Stojanovic, D. Aouada, and B. [35] K. Suresh and R. Dillibabu, “Designing a Machine
Ottersten, “Cost sensitive credit card fraud detection Learning Based Software Risk Assessment Model
using Bayes minimum risk,” in 12th International Using Naïve Bayes Algorithm,” TAGA J., vol. 14, pp.
Conference on Machine Learning and Applications 3141–3147, 2018.
(ICMLA), 2013, pp. 333–338.
[36] I. D. Dinov and I. D. Dinov, Probabilistic Learning:
[24] S. Stolfo, D. W. Fan, W. Lee, A. Prodromidis, and P. Classification Using Naive Bayes. 2018.
Chan, “Credit card fraud detection using meta-learning:
Issues and initial results,” in AAAI-97 Workshop on [37] L. Li et al., “A robust hybrid between genetic algorithm
Fraud Detection and Risk Management, 1997. and support vector machine for extracting an optimal
feature gene subset,” Genomics, vol. 85, no. 1, pp. 16–
[25] P. R. Vardhani, Y. I. Priyadarshini, Y. Narasimhulu, and 23, 2005.
Á. C. N. N. Á. Nonparametric, CNN Data Mining
Algorithm for Detecting Credit Card Fraud. Singapore.:
Springer Singapore, 2019.
1097
Authorized licensed use limited to: Carleton University. Downloaded on May 28,2021 at 01:43:57 UTC from IEEE Xplore. Restrictions apply.