Urtc45901.2018.9244782

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Comparative Analysis of Machine Learning

Algorithms through Credit Card Fraud


Detection
Rishi Banerjee Gabriela Bourla
East Brunswick High School Eastern Regional High School
Rutgers School of Engineering Rutgers School of Engineering
Piscataway, New Jersey Piscataway, New Jersey
rishi.banerjee@comcast.net gabbie.bourla@gmail.com

Steven Chen Mehal Kashyap Sonia Purohit


Bridgewater-Raritan High School Edison High School The Academy for Health & Medical
Sciences
Rutgers School of Engineering Rutgers School of Engineering Rutgers School of Engineering
Piscataway, New Jersey Piscataway, New Jersey Piscataway, New Jersey
stevenc1822@gmail.com mehalkashyap@gmail.com sonia purohit@outlook.com

Abstract—With the increase of e-commerce and online trans- II. P ROCEDURE : C REATING AND T RAINING
actions throughout the twenty-first century, credit card fraud is C LASSIFICATION M ODELS
a serious and growing problem. Such malicious practices can
affect millions of people across the world through identity theft The process for finding usable data in this paper follows the
and loss of money. Data science has emerged as a means of typical Data Analysis Pipeline, as seen in Figure 1.
identifying fraudulent behavior. Contemporary methods rely on
applying data mining techniques to skewed datasets with confi-
dential attributes. This paper examines numerous classification
models trained on a public dataset to analyze correlation of
certain attributes with fraudulence. This paper also proposes
better metrics for determining false negatives and measures the
effectiveness of random sampling to diminish the imbalance of the
dataset. Finally, this paper explains the best algorithms to utilize
in datasets with high class imbalances. It was determined that the
Support Vector Machine algorithm had the highest performance
rate for detecting credit card fraud under realistic conditions. Figure 1. Data Analysis Pipeline [3]

A. Algorithms
I. INTRODUCTION
The algorithms chosen for analysis were K Nearest Neigh-
bor (KNN), Support Vector Machine (SVM), Logistic Regres-
Credit-card fraud is a general term for the unauthorized use sion, Random Forest, Naive Bayes, and Multilayer Perceptron.
of funds in a transaction typically by means of a credit or These algorithms were chosen specifically because they act as
debit card [1]. Incidents of fraud have increased significantly binary classifiers, which makes them suitable for credit card
in recent years with the rising popularity of online shopping fraud detection, as transactions in our experiment are classified
and e-commerce. Currently, credit-card companies attempt to as one of two states–fraudulent or non-fraudulent (normal).
predict the legitimacy of a purchase through analyzing anoma-
lies in various fields such as purchase location, transaction B. Dataset Acquisition
amount, and user purchase history. However, with the recent The dataset utilized was part of a 2009 competition in
increases in cases of credit card fraud it is crucial for credit coordination with the University of California, San Diego and
card companies to optimize their algorithmic solutions. [2] [3] the Fair Isaac Corporation. It contains 94,682 data points with
This paper compares various deep learning and regression sixteen known fields, including dollar amount of purchase,
algorithmic models to explore which algorithm and attributes hour of the transaction, location of the transaction based upon
provide the most accurate method of identifying credit-card zip-code, and thirteen other confidential fields, titled with
fraud. encrypted names such as field1, flag1, indicator1.

1
978-1-5386-9374-2/18/$31.00 ©2018 IEEE

Authorized licensed use limited to: Carleton University. Downloaded on June 01,2021 at 20:18:26 UTC from IEEE Xplore. Restrictions apply.
C. Preprocessing Recall is the measure of the ratio of correct positive
Due to the high normal-to-fraudulent ratio, displayed within predictions to all actual positive entries. The recall assesses
the dataset as seen in Figure 2, predictions made from the the completeness of the program, checking how many true
initial training set, which had a normal-to-fraudulent ratio of positives were detected as positive. The equation for recall
49:1, were greatly skewed. Many of the utilized algorithms can be seen by Equation 6.
classified the test data with ninety-eight percent accuracy by true positives
recall = (6)
predicting every transaction as normal, with only true negative true positives + f alse negatives
and false negative cases. The Fβ score is the weighted harmonic mean of precision
and recall, reaching its optimal value at 1 and its worst value
at 0. The beta parameter determines the weight of precision
in the combined score. β < 1 lends more weight to precision,
while β > 1 favors recall. Because both precision and recall
are equally important to the accuracy of the model, the F-1
score, or Fβ score when β is set to 1, was used. The equation
of Fβ score can be represented by Equation 7 [?].
precision ∗ recall
Fβ = (1 + β 2 ) ∗ (7)
(β 2 + precision) + recall
E. Model Training
Figure 2. Count of Fraudulent to Non-Fraudulent Datapoints The models were trained and tested with two CSV files
containing identical attributes. For all algorithms except for
In order to resolve this issue, the data was processed with
the Random Forest Classifier and the Multilayer Perceptron,
a lower normal-to-fraudulent ratio. This was done by first
the precision, recall, and Fβ score were calculated for each
dividing the initial dataset into two separate data sets based
field. The quantification of the true positives, true negatives,
off of fraudulence. The 2094 data values from the fraudulent
false positives, and false negatives, was represented by a con-
dataset were split evenly between a training set and a testing
fusion matrix, which organizes true positives, true negatives,
set. Normal data was then randomly split into the training and
false positives, and false negatives. To assess the effect of
testing sets. This action was an example of random under-
undersampling, numerous datasets were created with different
sampling, or the removal of negatives in order to increase the
normal transaction to fraudulent transaction ratios, which were
significance of fraud data [5]. This method of data processing
then trained and tested on multiple fields with the same ratio
provided greater flexibility in manipulating the ratio of normal-
as the set and the 49:1 ratio of the initial dataset.
to-fraudulent data within each data set simply by increasing
or decreasing the amount of normal data that was added to III. RESULTS AND ANALYSIS
each dataset. Splitting the data into both a training and testing A. Finding Feature Importance
set also solved the issue of overfitting, a condition where an
The data was first run through the Random Forest Classifier
algorithm can only properly function with a particular data set.
with both the 49:1 training and testing set. The importance
Ultimately, eight unique pairs of testing and training datasets
metric created as a result of this is shown below in Table I. As
were formed with differing ratios. Although this processing
a result of the Feature Importance Metric, the most significant
method greatly diminished the number of training cases from
fields were seen to be field1,field3 and hour1.
63122 cases, the predictions made were less skewed as the
algorithms were able to make true positive predictions based
Table I
on the increased percentage of fraudulent cases within the F EATURE I MPORTANCE M EASURES FOR R ANDOM F OREST
dataset.
Attribute(s) Importance
D. Evaluation Metrics field3 0.34
hour1 0.16
Because of the large imbalance between normal and fraudu- field1 0.12
lent data points, accuracy, or the percent of correctly predicted field4 0.1
data points out of the total dataset was not a viable metric amount 0.06
on which to base results. As a result, the paper metrics of field5 0.06
flag5 0.04
precision, recall, and F-1 score were used. Precision is the field2 0.03
ratio of the number of true positives out of all of the correct flag1 0.02
data points. It can be seen as a measure of the quality of the indicator1 0.02
flag2 0.02
data returned as positive. The equation for precision can be flag3 0.02
seen in Equation 5. flag4 0.01
true positives indicator2 0.0
precision = (5)
true positives + f alse positives

2
978-1-5386-9374-2/18/$31.00 ©2018 IEEE

Authorized licensed use limited to: Carleton University. Downloaded on June 01,2021 at 20:18:26 UTC from IEEE Xplore. Restrictions apply.
B. Bulk Implementation of Classifiers data at which the predictions for the testing set will produce
The datasets created were trained on all attributes and the highest F-1 score. This trend can be seen in the Random
combinations of attributes found important from the Random Forests, KNN, and Naive Bayes where a peak on the graph
Forest selection. The F-1 score was then calculated from each is visible. If the models are trained on datasets with normal-
result after being tested on datasets that had the same ratio of to-fraudulent ratios close to 1:1, the testing set will match the
normal-to-fraudulent data as the datasets on which they were ratio that will maintain the same ratio of positives to negatives.
trained, as seen in Figure 3 below. Testing on a highly imbalanced dataset predicts the same
ratio of the training set, resulting in a high number of false
positives and a very low precision and a very high recall. As
the imbalance increases, the precision increases as the recall
decreases. As a result, the normal-to-fraudulent ratio reached
its optimal ratio as precision and recall approached each other.
C. Performance Evaluation of Most Successful Algorithms
1) Random Forest Classifier: For the data tested on sets of
the same ratios, the classifier experienced the smallest decrease
in the F-1 score. While the precision of the Random forests
model experienced only a slight drop, the recall experienced
large decreases as the normal-to-fraudulent ratio increased.
While testing on the large dataset, the random forest algorithm
reached the optimal normal-to-fraudulent ratio at around the
Figure 3. F-1 scores of Algorithms applied to testing datasets with controlled 30:1 ratio. As seen in Figure 5, the Random Forest algorithm
Normal-to-Fraudulent Transaction Ratios

Figure 3 demonstrates the effectiveness of various algo-


rithms measured by F-1 Score of the algorithms as the normal-
to-fraudulent ratio increases. When graphed, the performance
of these algorithms in relation to the normal-to-fraudulent
ratio exponentially decreases as the ratio increases. The only
exception is the Support Vector Machine.
Next, the algorithms trained on the datasets created by un-
dersampling were then tested on a dataset of 31,560 datapoints
and a normal-to-fraudulent data ratio of 49:1. The F-1 scores
versus the normal-to-fraudulent data are shown in Figure 4
below. It can be seen that for datasets with a high normal-

Figure 5. F-1 score, Precision, and Recall in the Random Forest Classifier

was indicative of the pattern observed in Figure 4, as the


F-1 score reaches its peak score, the precision and recall
lines intersect each other at the same ratio. As a result, for
the random forest algorithm, it would be necessary to find
the optimal ratio to train the dataset on in order to use
random subsampling to reduce bias. This pattern could be seen
between clustering and probabilistic models, as these models
do not experience dropoff like the regression-based model.
The Random Forest is the most efficient algorithm for testing
Figure 4. F-1 scores of Algorithms applied to testing datasets with uncon-
trolled Normal-to-Fraudulent Transaction Ratios on biased datasets at the optimal dataset because, unlike the
SVM, the Random Forests algorithm does not require much
to-fraudulent ratio, the Support Vector Machine algorithm processing power and time to train efficiently.
produced the highest F-1 scores, while in datasets with a 2) Support Vector Machine: Results of the Support Vector
low normal-to-fraudulent ratio, the Random Forest algorithm Machine algorithm tested and trained with adjusted datasets of
produced the highest F-1 score. approximately equal normal-to-fraudulent ratios indicated an
In systems where there a large dataset is tested, for many average F-1 Score of 79.96%, regardless of the combination
algorithms, there is an optimal ratio for normal-to-fraudulent of factors. Unlike most models presented in this paper, the

3
978-1-5386-9374-2/18/$31.00 ©2018 IEEE

Authorized licensed use limited to: Carleton University. Downloaded on June 01,2021 at 20:18:26 UTC from IEEE Xplore. Restrictions apply.
F-score did not decrease as the normal-to-fraudulent ratio was released to the public, the true factors which can be traced
decreased. It stayed constant at approximately 84% as seen for credit card fraud detection can known. Therefore, credit
in Figure 4. card companies can be informed about the most important
Results of the Support Vector Machine algorithm varied factors to analyze when predicting credit card fraud and im-
when trained with an adjusted dataset but tested with the prove the efficiency of their notification systems. Furthermore,
unadjusted and highly skewed data set. F-1 scores averaged The results of this project were limited by the small sample
94.07% for the SVM with all combinations of factors, which size of fraudulent cases provided by the data set. By using
is significantly higher compared to the data for the five other a larger dataset with a greater number of fraudulent cases,
algorithms. The average F-scores of the tests under different the algorithms can be trained to make predictions of greater
ratio combinations but same factor combination were averaged precision. In order to pursue these goals, more computing
to determine the ideal combination of factors for analysis. power may be required. It may be important to consider
The average F-1 scores for the Support Vector Machine using a Graphical Processing Unit like the Nvidia Jetson II to
analyzing the “hour1”, “field3”, and “hour 1 and field 3” factor improve the productivity of training and testing each algorithm
combinations of transactions was calculated to be 94.51%, with a larger, more complex data set. Other methods for bias
94.4%, and 93.27%, respectively. prevention, such as other resampling techniques, cost-sensitive
In contrast to other algorithms, the SVM algorithm took learning methods, and ensemble learning methods could also
much more time and computing power to complete the fitting be tested in future datasets to discover the best method of
of the model. Compared to the Random Forest Classifier, the dealing with skewed data sets. Ultimately, the results of this
model takes much more time to execute. As a result, when research project can provide insight on the best algorithm to
processing real-time data such as credit card transactions on be used in other cases of data analysis on skewed data sets,
datasets that would be much larger than the UCSD-FICO set, such as in natural disaster prediction.
the SVM model would have to be made more efficient in order
to process and classify fraud data in a reasonable amount of
time after the transaction. V. ACKNOWLEDGMENTS

IV. CONCLUSION
The authors of this paper gratefully acknowledge the fol-
Overall the main project goal involved determining the lowing: Project mentor Jacob Battipaglia for his valuable
optimal algorithm for analysis as well as the best-performing knowledge of machine learning and data science; Residential
combination of factors to detect credit-card fraud. Based on the Teaching Assistant Siddhi Shah, head counselor Nicholas
results of Figure 3, it can be concluded that the best algorithm Ferraro, and research coordinator Brian Lai for their invaluable
for analysis of datasets close to a 1:1 normal-to-fraudulent guidance and support; Dean Ilene Rosen, the Director of
ratio is the Random Forest Classifier, assuming the normal- GSET, and Dean Jean Patrick Antoine, the Associate Director
to-fraudulent distribution of the testing and training set is the of GSET, for their management and guidance; Rutgers Univer-
same. However, the presence of a balanced training dataset as sity, Rutgers School of Engineering, and the State of New Jer-
well as a testing & training dataset of the same distribution sey for the chance to advance knowledge, explore engineering,
is an unrealistic expectation. Therefore the optimal machine and open up new opportunities; Lockheed Martin, Silverline,
learning algorithm that a credit card company should use is Rubik’s, and other Corporate Sponsors; and lastly NJ GSET
dependent on the F-1 Scores of algorithms tested with highly Alumni, for their continued participation and support.
skewed datasets. According to Figure 4, the Support Vector
Machine was the most successful in the detection of credit
card fraud when tested under more realistic conditions. The F- R EFERENCES
scores of all algorithms under multiple combinations of factors
were analyzed as described earlier in this paper, and it was [1] What is credit card fraud? definition and meaning. [Online].
determined that the ideal condition for analysis is the hour1 [2] J. Steele and J. Gonzalez, Credit card fraud and ID theft statistics,
field. Based on this research, a credit-card company should CreditCards.com. [Online].
consider implementing a Support-Vector Machine algorithm [3] Big Data in Science: Which Business Model is Suitable? ADC Review,
ADC Review, 11-Sep-2015. [Online].
that analyzes the purchase time in order to most accurately
[4] R. Alencar, Resampling strategies for imbalanced datasets — Kaggle,
detect whether a credit-card transaction is fraudulent or not. Kaggle. [Online].
[5] C. Goutte and E. Gaussier, A Probabilistic Interpretation of Precision,
A. Future Work Recall and F-Score, with Implication for Evaluation, Lecture Notes in
This research on detecting credit card fraud has great poten- Computer Science Advances in Information Retrieval, pp. 345359, 2005.
tial for future implications. If a dataset with unencrypted fields

4
978-1-5386-9374-2/18/$31.00 ©2018 IEEE

Authorized licensed use limited to: Carleton University. Downloaded on June 01,2021 at 20:18:26 UTC from IEEE Xplore. Restrictions apply.

You might also like