Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

A HYBRID APPROACH FOR OPTIMIZED FRAUDULENT CREDIT CARD TRANSACTION

DETECTION USING MACHINE LEARNING

Sai Deep I1, Alapati Naga Praneeth1, Kowshik PJ1, Surya Kiran Bellamkonda1,Dr Narasimayya2

1 B.Tech CSE - AI, Department of computer science and technology Faculty of Engineering and Technology
JAIN UNIVERSITY, Karnataka
2 Associate Professor, Department of CSE-AI, Faculty of Engineering and Technology JAIN UNIVERSITY,
Karnataka

Email

ABSTRACT
The target of this study is to obtain a method of intruders detection system that offers an alternative support
for the protection of the most commonly used financial transaction model. There is a large range of credit card
transactions happening within the actual world. but there is additionally the 0.33-birthday celebration person
who monitors our activities and puts people in problem. This changed into going on not best now however
additionally approximately 20 years in the past, and with the help of the brand new technology algorithms,
this fraud hobby has been multiplied hastily in certain areas consisting of online shopping, advertising and
marketing, and so forth. So, to locate those styles of fraud pastime, our research work specializes in figuring
out the share of the fraudulent inside the given statistics set. in recent times, technology had been improvised
and frauds had been growing unexpectedly. inside the banking zone, fraudulent sports in credit score-card
were improved. In our research paintings, the main process is to make correct predictions while balanced
statistics is fed. With the use of gadget studying [ML], this research examine analyses and summarizes the
frauds in credit card transactions. two library documents inclusive of PyCaret and SMOTE had been used for
records balancing. five getting to know algorithms accomplished the detection of fraudulent hobby, among
them random wooded area algorithm which feeds on a balanced dataset of SMOTE offers a very good estimate
of the generalization mistakes and to be resistant to overfitting.

INTRODUCTION Machine learning, on the other hand, has been


used to develop several systems for detecting
Over the past decade, e-commerce has increased credit card fraud. However, credit card fraud
and credit card usage has increased significantly. detection remains a learning challenge due to class
The increased use of credit cards has led to a imbalance in the dataset. Class imbalance is not the
steady increase in fraudulent transactions. only problem that prevents credit card fraud
Fraudulent credit card transactions have had a detection, but it is the most significant one. Class
serious impact on the financial industry. According imbalance is a problem that arises in some real-
to a recent report, about $27.85 billion was lost to world ML applications where the dataset has an
credit card fraud in 2018, up 16.2% from $23.97 uneven class distribution. For example, samples of
billion in 2017, to $35 billion by 2023. estimated to one class (majority class) are higher than samples
reach. These losses can be reduced through of another class (minority class).
effective fraud monitoring and prevention.
Most credit card transaction records are
imbalanced, with the number of legitimate
transactions far outweighing the number of tasks. The J48 algorithm is used to create decision
fraudulent transactions. Most traditional ML trees and is used for classification problems. J48 is
algorithms perform well when trained on balanced an extension of ID3 (Iterative Dichotomous
data Skewed class distributions skew the Parser). J48 is one of the most widely used and
performance of traditional ML algorithms against heavily analyzed areas of machine learning. This
the majority class, because the algorithms are algorithm works primarily with fixed and
designed to consider error rates rather than class categorical variables. Adaboost is one of the most
distributions. Therefore, more examples of the popular machine learning algorithms and is
minority class are misclassified than examples of primarily designed for binary classification. This
the majority class. The motives leading to fraud are algorithm is mainly used to improve the
criminal purposes. This practice of pursuing fraud performance of decision trees. It is also primarily
is essentially aimed at siphoning money illegally, used for regression classification. The Adaboost
resulting in loss of financial or personal gain. By algorithm is a fraud case and can classify
definition, credit card fraud is the use of credit card transactions as fraudulent and non-fraudulent. In
information to make purchases without the user's their study, they concluded that both AdaBoost
knowledge. Credit card transactions are organized and logistic regression achieved the highest
physically or virtually. In physical transactions, accuracy. The time factor is considered the best
users exchange cards personally at the time of algorithm as the accuracy is the same. Considering
purchase. Virtual Transactions include online the time factor, we conclude that the Adaboost
operations via the World Wide Web. algorithm is effective in detecting credit card fraud.
In 2019, Sahayasakila V, D. Kavya Monisha,
RELATED WORK
Aishwarya, Sikhakolli Venkatavisalakshiswshai
There are many fraud detection techniques and Yasaswi published key algorithmic techniques,
lots of research methods for detecting Fraudulent Whale Optimization Techniques (WOA) and
Detection using neural networks, data mining, and Synthetic Minority Oversampling Techniques
distributed data mining and many more. During (SMOTE), Twain [8]. Explained. The main goal was
our literature survey, we concluded that the to improve the convergence rate and solve the
fraudulent detection can be done in many other data imbalance problem. The class imbalance
approaches in Machine Learning. problem is solved using SMOTE and WOA methods.
The SMOTE method separates all synthesized
We researched on fraudulent detection on both transactions, resamples them to ensure data
machine learning and deep learning algorithms. accuracy, and optimizes them using the WOA
We mainly focused our research on how to handle method. This algorithm also improves the
the imbalanced data and which techniques are convergence speed, stability and efficiency of the
being used some of them are classification, system.
sampling and resembling technique. The
algorithms and methods which are available for Heta Naik , Prashasti Kanikar: Credit card Fraud
fraudulent detection. Detection based on Machine Learning
Algorithms,International Journal of Computer
In 2019, Heta Naik and Prashasti Kanikar did Applications (0975 – 8887) Volume 182 – No. 44,
research on various algorithms [4] such as Naive March 2019.
Bayes, Logistic Regression, J48 and Adaboost.
Naive Bayes is one of the classification algorithms. In 2018, Nawanushu Khare and Saad Yunus Sait
This algorithm is based on Bayes' theorem. Bayes' described their work on decision trees, random
theorem states that given the probability of an forests, SVMs, and logistic regression
event occurring, . The logistic regression algorithm [5]. They were given severely distorted data
is similar to the linear regression algorithm. Linear sets and worked with them.
regression is used to predict or predict values. Performance evaluation is based on precision,
Logistic regression is mainly used for classification sensitivity, specificity, and precision. The results
show that the accuracy is 97.7% for logistic the patterns. There exist some random weights(w)
regression, 95.5% for decision trees, 98.6% for which are assigned to every neuron initially. The
random forest, and 97.5% for SVM classifier. weights are multiplied by the provided inputs(x)
They concluded that the random forest algorithm attained from the input layer. The weighted sum is
is the most accurate among other algorithms and formed by aggregating the multiplied values. After
is considered the best algorithm for fraud obtaining the aggregated sum activation function
detection. They also concluded that the SVM is employed on the weighted sum of the inputs.
algorithm has data imbalance problems and does The stimulating task converts it into a subsequent
not provide better results for credit card fraud output. Once the network receives the input
detection. variable quantity, the weight of the input is
allocated at random. The significance of every
Navanshu Khare ,Saad Yunus Sait: Credit Card
input data point is implied by its weight in terms of
Fraud Detection Using Machine Learning Models
estimating the outcome. The discrimination
and Collating Machine Learning Models,
parameter, on the other side, provides an option
International Journal of Pure and Applied
to update the activation curve. Once all the
Mathematics Volume 118 No. 20 2018, 825-838
products are aggregated, we obtain the weighted
ISSN: 1314-3395.
sum. In this scenario we feed the input perceptron
PROPOSED METHODOLOGY with the existing data of fraud transactions which
is undergone through Principal Component
Analysis. This input data is multiplied with random
weights(W) initially to form an aggregated sum.
This random weights are updated during the
process of backpropagation. The weights are
adjusted in a way such that the error is minimum
during the prediction. Each perceptron is triggered
with an activation function once the aggregated
sum is calculated. In the output layer the
probability is determined which classifies the
category of the given input.
4.Proposed work:

4.1 Artificial neural networks:

This algorithm makes use of an intricate


connection between the output data and the input
data to reveal the hidden correlated patterns. We
make use of this strategy for recognizing and
predicting the un-authorized transactions. The
model’s innovative capabilities can increase the
existing data analysis techniques. As displayed in
the figure the layers which are at the initial state 4.2 Pycaret:
are the input layers which receives the information
from the data set which can be any kind of data A simple, easy to implement open-source library
format like text, audio, video, image and other that automates all the required classification
types. The network entails of dense layers which algorithms. It is a low-code machine learning
are hidden layers made of units that transmutes library that aims to decreases the cycle time from
the given input data into a desired output upon hypothesis to intuitions. It enables to achieve
various mathematical computations by recognizing iterative end-to-end data science researches. It is a
wrapper around several machine learning libraries An optimized distributed boosting library which is
and frameworks such as scikit-learn, XGBoost. All designed to be highly efficient and flexible. It is an
of the procedures performed in typical experiment ensemble learning algorithm as it combines the
are implemented and necessarily stored in a results of multiple models. The decision trees are
custom pipeline. created in a sequential form where weights have a
major importance. The weights are initialized to
Pycaret is initialized and imported with all modules
the independent variables which are given as input
required. During the setup, the setup() function
to the decision tree. The weights which are
allows to configure simple data preparation ,such
predicted erroneous is modified and are the fed to
as scaling , power transforms. We will specify the
the second decision tree. These unique classifiers
data, target variable. It will expect if all the
ensemble to form a robust precise model. It can
variables are as per the dataset, as per
work for multiple purposes like regression,
requirements. Give input as ‘y’ and enter. Input for
classification, ranking and custom-defined
percentage of data that should be used for training
challenges.
will be asked for the training model. Here we have
the split data as 70/30. The scores for the prediction of every distinct
decision tree sum up to get the final score.
These models which are trained can be compared
using compare_models() function which reports a Mathematically the model can be represented in
table of results summarizing all of the models that the form:
were evaluated along with their performance.

After comparing all the models and it’s metrics we


choose the best-performing model and its
configuration. Where K is the count of trees, f is functional space
of F.
4.2.1 CatBoost Classifier:
The objective of the XGBoost model is given by the
CatBoost is a recently open-sourced machine
following equation:
learning algorithm. It incorporates with deep
learning frameworks and can work with various
data types to help solve a wide range of problems
that businesses face. CatBoost works well with
multiple categories of data such as text, image, Which is to minimize the magnitude of loss
audio. Boost comes from gradient boosting function by applying additive strategy.
machine learning algorithm. Gradient boosting is a
unique and efficient algorithm which is used widely
to multiple types of problems and challenges. It has
the ability to provide a result which is good
compared to deep learning models which requires
huge amount of data for better accuracy. CatBoost
can be used without any pre-processing techniques
as it converts the categorical class variables into
integers using it’s statistic combination of
categorical and numerical features. It has multiple
parameters that can be tuned like learning rate,
tree depth, fold size, bagging temperature.
5.Evaluation and result analysis:
4.2.2 XGBoost Classifier:
The dataset we use for the fraudulent transaction
is received from a European credit card company Recall = (56844) / (56844 + 20) = 0.9996 (99.96 %)
which was downloaded form Kaggle. It holds
So, recall of our test dataset is 99.96%.
transaction data of few years ago. It contains
284,807 transactions out of which only 492 Precision (Positive predicted value) = (56844) /
transactions are fraudulent transactions. All these (56844 + 18) = 0.9996 (99.96 %)
transactions were taken in a span of two
consecutive days. The proposition of fraud So, the precision of our test dataset is 99.96%.
transaction accounts to only 0.172% of the entire Therefor the f1-score for our model is defined as:-
number of transactions. The dataset’s input
variables are integer values that are obtained F1-score = 2*((Precision * Recall) / (Precision +
after PCA transformation. To maintain the Recall))
confidentiality of the data ,the actual features = 2 * (( 99.96 * 99.96) / ( 99.96 + 99.96))
were not disclosed and only the features that has
been undergone PCA are given. Features like = 2 *( ( 9992.0016) / (199.92))
‘Time’ and ‘Amount’ can’t be PCA transformed.
= 99.96 %
‘Time’ represents the difference in seconds
elapsed between the particular transaction and So, the F1-score of our test dataset is 99.96%
the first transaction. The class ‘Amount’ indicates
the amount of money transaction that had
occurred. The target feature ‘class’ represents
whether that particular transaction is fraudulent
or not. It shows 1 for a fraud transaction and 0 for
a valid transaction.

Our test dataset has a total of 56844 transactions.


We have used the test dataset to see how well our
model will do at real-time transactions. So,
according to our confusion matrix, the accuracy
can be calculated as: -
Neural networks have learned the previous
Accuracy = (56844 + 80) / 56962 behaviour and can detect real-time credit card
= 56924 / 56962 frauds. From the above result of Neural network
model we can see that we got highest accuracy and
= 0.9993(99.93 %) precision of 99.96% but, the biggest disadvantage
when applying it is the problem with an
So, the accuracy of test dataset is 99.93%.
imbalanced dataset because the number of
So, found the recall of our test dataset by:
fraudulent transactions in real life is much lesser
than non-fraudulent cases.

This problem is partially solved by the data


resampling technique. For future works could be
used improving classification algorithm to fit
better with imbalanced dataset.

0 1

56844 20
0

1 18 80

SMOTE:
You can see a list of different machine learning
We use SMOTE which is an oversampling models that have been run here using PyCaret and
technique where the synthetic samples are we can see that CatBoost Classifier gets the best
generated for the minority class. This model with an accuracy of 0.9996, an accuracy of
algorithm helps to overcome the overfitting 0.94, an F1 score of 0.86, and a kappa score of
problem posed by random oversampling. SMOTE 0.86.
works by utilizing a k-nearest neighbour algorithm.
70% of the data is used to train the model, and the
Using SMOTE we can tweak the model to reduce remaining 30% is used to test the model and
false negatives, at the cost of increasing false predict its outcome. Here predict_model() is used
positives. The result of using SMOTE is generally to query the result for both catboost and XG
an increase in recall, at the cost of lower precision. boost.

RESULT AND PERFORMANCE ANALYSIS

CONCLUSION

Communication and Electrical Technology


(ICCCET), 2011, pp. 152-156, doi:
REFERENCES
10.1109/ICCCET.2011.5762457.
[1]Fayyomi, Aisha & Eleyan, Derar & Eleyan, Amina.
[4]J. O. Awoyemi, A. O. Adetunmbi and S. A.
(2021). A Survey Paper On Credit Card Fraud
Oluwadare, "Credit card fraud detection using
Detection Techniques. International Journal of
machine learning techniques: A comparative
Scientific & Technology Research. 10. 72-79.
analysis," 2017 International Conference on
[2]Asha RB, Suresh Kumar KR. Credit card fraud Computing Networking and Informatics (ICCNI),
detection using artificial neural network, 2017, pp. 1-9, doi: 10.1109/ICCNI.2017.8123782.
https://doi.org/10.1016/j.gltp.2021.01.006.
[5]A. Srivastava, A. Kundu, S. Sural and A.
[3]S. Benson Edwin Raj and A. Annie Portia, Majumdar, "Credit Card Fraud Detection Using
"Analysis on credit card fraud detection methods," Hidden Markov Model," in IEEE Transactions on
2011 International Conference on Computer, Dependable and Secure Computing, vol. 5, no. 1,
pp. 37-48, Jan.-March 2008, doi: [8]S. Makki, Z. Assaghir, Y. Taher, R. Haque, M. -S.
10.1109/TDSC.2007.70228. Hacid and H. Zeineddine, "An Experimental Study
With Imbalanced Classification Approaches for
[6]D. Varmedja, M. Karanovic, S. Sladojevic, M.
Credit Card Fraud Detection," in IEEE Access, vol. 7,
Arsenovic and A. Anderla, "Credit Card Fraud
pp. 93010-93022, 2019, doi:
Detection - Machine Learning methods," 2019 18th
10.1109/ACCESS.2019.2927266.
International Symposium INFOTEH-JAHORINA
(INFOTEH), 2019, pp. 1-5, doi: [9]R. Sailusha, V. Gnaneswar, R. Ramesh and G. R.
10.1109/INFOTEH.2019.8717766. Rao, "Credit Card Fraud Detection Using Machine
Learning," 2020 4th International Conference on
[7]A. Shen, R. Tong and Y. Deng, "Application of
Intelligent Computing and Control Systems
Classification Models on Credit Card Fraud
(ICICCS), 2020, pp. 1264-1270, doi:
Detection," 2007 International Conference on
10.1109/ICICCS48265.2020.9121114.
Service Systems and Service Management, 2007,
pp. 1-4, doi: 10.1109/ICSSSM.2007.4280163.

You might also like