major 1 2nd

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

1.

INTRODUCTION
Fraud can be defined as wrongful or criminal deception intended to result in
financial or personal gain. Credit card fraud is related to the illegal use of credit card
information for purchases in a physical or digital manner. In digital transactions, fraud
can happen over the line or the web, since the cardholders usually provide the card
number, expiration date, and card verification number by telephone or website. There
are two mechanisms, fraud prevention and fraud detection, that can be exploited to avoid
fraud-related losses. Fraud prevention is a proactive method that stops fraud from
happening in the first place. On the other hand, fraud detection is needed when a
fraudster attempts a fraudulent transaction. Fraud detection in banking is considered a
binary classification problem in which data is classified as legitimate or fraudulent.
Because banking data is large in volume and with datasets containing a large amount of
transaction data, manually reviewing and finding patterns for fraudulent transactions is
either impossible or takes a long time. Here we adopt Bayesian optimization for fraud
detection. Therefore, machine learning-based algorithms play a pivotal role in fraud
detection and prediction .Machine learning algorithms and high processing power
increase the capability of handling large datasets and fraud detection in a more efficient
manner. Machine learning algorithms and deep learning also provide fast and efficient
solutions to real-time problems. We propose an efficient approach for detecting credit
card fraud that has been evaluated on publicly available datasets and has used optimised
algorithms Light-GBM, XGBoost, CatBoost, and logistic regression individually,as
well as majority voting combined methods, as well as deep learning and hyperparameter
settings. An ideal fraud detection system should detect more fraudulent cases, and the
precision of detecting fraudulent cases should be high, i.e., all results should be correctly
detected, which will lead to the trust of customers in the bank, and on the other hand,
the bank will not suffer losses due to incorrect detection. The main contributions of this
project are summarized as follows:1)We adopt Bayesian optimization for fraud
detection and propose to use the weight-tuning hyperparameter to
solve the unbalanced data issue as a pre-process step. We also suggest using CatBoost
and XGBoost alongside LightGBM to improve performance. We use the XGBoost
algorithm due to the high speed of training in big data as well as the regularization term,
whichover comes overfitting by measuring the complexity of the tree, and it does not

1
Fraud Detection in Banking Data
by Machine Learning Techniques
require much time to set the hyperparameters. We also use the Catboost algorithm
because there is no need to adjust hyperparameters for overfitting control, and it also
obtains good results without changing hyperparameters compared to other machine
learning algorithms.2)We propose a majority-voting ensemble learning approach to
combine CatBoost, XGBoost, and Light-GBM and review the effect of the combined
methods on the performance of fraud detection on real, unbalanced data. We also
propose to use deep learning for adjusting and fine-tuning the hyperparameters. To
evaluate the performance of the proposed methods, we perform extensive experiments
on real-world data. To better cover the unbalanced datasets, we use recall-precision in
addition to the typically used ROC-AUC. We also evaluate the performance using
F1_score and MCC metrics.

1.1 PURPOSE
The purpose of our project is to identify fraudulent activities in banking data quickly
and accurately thereby safeguarding customers assets by using machine learning
techniques.

1.2 SCOPE
The purpose of our project is to identify fraudulent activities in banking data quickly
and accurately thereby safeguarding customers assets by using machine learning
techniques.

2
Fraud Detection in Banking Data
by Machine Learning Techniques
2. LITERATURE SURVEY
Comparison and analysis of logistic regression, Naive Bayes and knn
machine learning algorithms for credit card fraud detection
Financial fraud is a threat which is increasing on a greater pace and has a very
bad impact over the economy, collaborative institutions and administration. Credit card
transactions are increasing faster because of the advancement in internet technology
which leads to high dependence over internet. With the up-gradation of technology and
increase in usage of credit cards, fraud rates become challenge for economy. With
inclusion of new security features in credit card transactions the fraudsters are also
developing new patterns or loopholes to chase the transactions. As a result of which
behaviour of frauds and normal transactions change constantly. Also, the problem with
the credit card data is that it is highly skewed which leads to inefficient prediction of
fraudulent transactions. In order to achieve the better result, imbalanced or skewed data
is pre-processed with the re-sampling (over-sampling or under sampling) technique for
better results. The three different proportions of datasets were used in this study and
random under-sampling technique was used for skewed dataset. This work uses the three
machine learning algorithms namely: logistic regression, Naive Bayes and K-nearest
neighbour. The performance of these algorithms is recorded with their comparative
analysis. The work is implemented in python and the performance of the algorithms is
measured based on accuracy, sensitivity, specificity, precision, F-measure and area
under curve. On the basis these measurements logistic regression based model for
prediction of fraudulent was found to be a better in comparison to other prediction
models developed from Naïve Bayes and K-nearest neighbour. Better results are also
seen by applying under sampling techniques over the data before developing the
prediction model.
An ensemble learning framework for credit card fraud detection based
on training set partitioning and clustering
The popularity of credit card has greatly facilitated the transactions between
merchants and cardholders. However, credit card fraud has been derived, which results
in losses of billions of euros every year. In recent years, machine learning and data
mining technology have been widely used in fraud detection and achieved favourable

3
Fraud Detection in Banking Data
by Machine Learning Techniques
performances. Most of these studies use the technology of under-sampling to deal with
the high imbalance of credit card data. However, it will potentially discard some relevant
training samples which will weaken the ability of the classifier. In this paper, we propose
an ensemble learning framework based on training set partitioning and clustering. It
turns out that the proposed framework not only ensures the integrity of the sample
features, but also solves the high imbalance of the dataset. A main feature of our
framework is that every base estimator can be trained in parallel. This improves the
efficiency of the framework. We show the effectiveness of our proposed ensemble
framework by experimental results on a real credit card transaction dataset.
A sequence mining-based novel architecture for detecting fraudulent
transactions in healthcare systems
With the exponential rise in government and private health-supported schemes,
the number of fraudulent billing cases is also increasing. Detection of fraudulent
transactions in healthcare systems is an exigent task due to intricate relationships among
dynamic elements, including doctors, patients, and services. Hence, to introduce
transparency in health support programs, there is a need to develop intelligent fraud
detection models for tracing the loopholes in existing procedures, so that the fraudulent
medical billing cases can be accurately identified. Moreover, there is also a need to
optimize both the cost burden for the service provider and medical benefits for the client.
This paper presents a novel process-based fraud detection methodology to detect
insurance claim-related frauds in the healthcare system using sequence mining concepts.
Recent literature focuses on the amount-based analysis or medication versus disease
sequential analysis rather than detecting frauds using sequence generation of services
within each specialty. The proposed methodology generates frequent sequences with
different pattern lengths. The confidence values and confidence level are computed for
each sequence. The sequence rule engine generates frequent sequences along with
confidence values for each hospital’s specialty and compares them with the actual
patient values. This identifies anomalies as both sequences would not be compliant with
the rule engine’s sequences. The process-based fraud detection methodology is
validated using last five years of a local hospital’s transactional data that includes many
reported cases of fraudulent activities.

4
Fraud Detection in Banking Data
by Machine Learning Techniques
Credit card fraud detection in e-commerce: An outlier detection
approach
Often the challenge associated with tasks like fraud and spam detection is the
lack of all likely patterns needed to train suitable supervised learning models. This
problem accentuates when the fraudulent patterns are not only scarce, they also change
over time. Change in fraudulent pattern is because fraudsters continue to innovate novel
ways to circumvent measures put in place to prevent fraud. Limited data and
continuously changing patterns make learning significantly difficult. We hypothesize
that good behaviour does not change with time and data points representing good
behaviour have consistent spatial signature under different groupings. Based on this
hypothesis we are proposing an approach that detects outliers in large data sets by
assigning a consistency score to each data point using an ensemble of clustering
methods. Our main contribution is proposing a novel method that can detect outliers in
large datasets and is robust to changing patterns. We also argue that area under the ROC
curve, although a commonly used metric to evaluate outlier detection methods is not the
right metric. Since outlier detection problems have a skewed distribution of classes,
precision-recall curves are better suited because precision compares false positives to
true positives (outliers) rather than true negatives (inliers) and therefore is not affected
by the problem of class imbalance. We show empirically that area under the precision-
recall curve is a better than ROC as an evaluation metric. The proposed approach is
tested on the modified version of the Landsat satellite dataset, the modified version of
the Ann-thyroid dataset and a large real world credit card fraud detection dataset
available through Kaggle where we show significant improvement over the baseline
methods.

5
Fraud Detection in Banking Data
by Machine Learning Techniques
3.SYSTEM ANALYSIS
3.1 EXISTING SYSTEM

The existing system for fraud detection in banking data used algorithms like
Naïve Bayes, Linear regression, Decision trees and Support vector machine. The
framework addresses the challenges arising from the imbalanced nature of data and
many issues.

3.1.1 Disadvantages
 Less Accuracy
 Due to the imbalanced data efficiency and performance is less
 More Complex

3.2 PROBLEM STATEMENT

The existing system was not able to detect the credit card fraud accurately
because of the presence of the imbalanced data in the taken dataset.

3.3 PROPOSED SYSTEM

The proposed system use class weight - tuning hyperparameters to solve the
issue of imbalanced data. We also use Bayesian Optimization for improved feature
selection. Further it makes use of algorithms like logistic regression, XGBoost and
Voting mechanisms to detect the credit card fraud in the banking data.

3.3.1 Advantages
 Accuracy and performance is high because of the balanced data
 Computational time is reduced
 Less complex

6
Fraud Detection in Banking Data
by Machine Learning Techniques
4.SYSTEM REQUIREMENTS SPECIFICATION
4.1 FUNCTIONAL REQUIREMENTS
 Collecting banking data
 Analyzing information
 Load trained model
 To predict whether the transaction is fraudulent or legitimate

4.2 NON-FUNCTIONAL REQUIREMENTS


 The dataset should not contain any noisy data.
 The input data entered must be correct and accurate.
 The data should be analyzed properly in order to get correct output.

4.3 HARDWARE REQUIREMENTS


 Processor : Intel i3
 Hard Disk : 250 GB.
 RAM : 4 GB

4.4 SOFTWARE REQUIREMENTS


 Operating system : Windows 7,8 or 10 Ultimate.
 Coding Language : Python.
 Front-End : Python console, jupyter notebook

7
Fraud Detection in Banking Data
by Machine Learning Techniques
5.SYSTEM DESIGN
5.1 SYSTEM ARCHITECTURE

Figure 5.1 System Architecture


1.Data collection: In fraud detection in banking data using machine learning
techniques, the data collection step involves gathering transactional data from various
sources within the banking system. This data typically includes a wide range of
information related to financial transactions, account activities, and customer
interactions.
2.Data Preprocessing: The data preprocessing step is crucial for preparing raw
transactional data for analysis by machine learning algorithms. This step involves
several key tasks to ensure that the data is cleaned, formatted, and transformed in a way
that facilitates accurate and effective fraud detection.
3.Splitting data into training and testing: Splitting the data into training and testing
sets is a critical step to evaluate the performance of the machine learning model. This

8
Fraud Detection in Banking Data
by Machine Learning Techniques
process ensures that the model is trained on a subset of the data and then tested on
unseen data to assess its generalization ability.
4.Bayesian Optimization: Bayesian optimization is a powerful technique employed in
fraud detection in banking data by machine learning techniques to efficiently tune the
hyperparameters of machine learning models, maximizing their performance.
5.Algorithms: We apply various algorithms to train a model such as logistic regression,
CatBoost, XgBoost, LightGBM, and deep learning algorithms to evaluate the
performance. Various algorithms can be employed to identify patterns indicative of
fraudulent activities.
6.5-fold cross validation: 5-fold cross-validation is a technique used in machine
learning to assess the performance of a predictive model. In this method, the dataset is
divided into five equal-sized subsets or folds. The model is trained and evaluated five
times, with each iteration using a different fold as the testing set and the remaining folds
as the training set. The performance metrics obtained from each iteration are then
aggregated to provide an overall assessment of the model's performance. This technique
helps to mitigate the risk of overfitting or underfitting the model to a particular subset
of the data and provides a more reliable estimate of how well the model will generalize
to unseen data.
7.Model Evaluation: Model evaluation in fraud detection in banking data using
machine learning techniques involves assessing the performance of the predictive model
in identifying fraudulent transactions accurately while minimizing false positives.

9
Fraud Detection in Banking Data
by Machine Learning Techniques
5.2 DATA FLOW DIAGRAM

Figure 5.2 Data flow diagram

10
Fraud Detection in Banking Data
by Machine Learning Techniques
6.CONCLUSION AND FUTURE ENHANCEMENTS

6.1 CONCLUSION
Employing machine learning techniques for fraud detection in banking data
offers a promising solution to mitigate financial losses and safeguard the integrity of
financial institutions. Through the utilization of advanced algorithms, such as anomaly
detection, classification models, and clustering techniques, banks can analyze vast
amounts of transactional data to identify patterns indicative of fraudulent activities.
We will propose a machine learning approach to improve the performance of
fraud detection. We used a publicly available credit card dataset with 28 features and
0.17 percent of the fraud data. We proposed two methods. In the proposed LightGBM,
we used class weight tuning to choose the proper hyperparameters. We used the
common evaluation metrics, including accuracy, precision, recall, F1-score, and AUC.
We will improve the performance of the algorithm with the help of the majority voting
algorithm. We also improve the criteria by using the deep learning method. The
assurance of the results of MCC for unbalanced data proved that, compared to other
criteria of evaluation, it's stronger. Using hyper parameters to address data unbalance
compared to sampling methods, in addition to reducing memory and time needed to
evaluate algorithms, also has better results.
In essence, by harnessing the power of machine learning, banks can proactively
identify and prevent fraudulent activities, thereby bolstering trust among customers,
regulators, and stakeholders while upholding the integrity of the financial system as a
whole.

11
Fraud Detection in Banking Data
by Machine Learning Techniques
6.2 FUTURE ENHANCEMENTS
Future enhancements in fraud detection for banking data via machine learning
techniques are likely to focus on advanced algorithmic approaches and data integration
strategies. It will give better transparency and interpretability of machine learning
models, facilitating trust among stakeholders and aiding regulatory compliance. These
advancements aim to refine fraud detection systems, ensuring they remain effective
against evolving fraudulent tactics while also addressing concerns related to model
interpretability and regulatory requirements.

12
Fraud Detection in Banking Data
by Machine Learning Techniques
7.REFERENCES
 F. Itoo, M. Meenakshi, and S. Singh, “Comparison and analysis of logistic
regression, Naive Bayes and knn machine learning algorithms for credit card
fraud detection”.
 R. Almutairi, A. Godavarthi, A. R.Kotha, and E. Ceesay, "Analyzing credit card
fraud detection based on machine learning models".
 A. A. Taha and S. J. Malebary, "An intelligent approach to credit card fraud
detection using an optimized light gradient boosting machine".
 H. Cho.Y. Kim, E. Lee, D. CMhoi,Y. Lee, and W. Rhee, "Basic enhancement
strategies when using Bayesian optimization for hyperparameter tuning of deep
neural networks".

13
Fraud Detection in Banking Data
by Machine Learning Techniques

You might also like