Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Table of Contents

Link for Google Colab Note book


https://colab.research.google.com/drive/1SeeVkyzLOgeRJBg1h8aKfa6eW6h-qW1n?usp=sharing
Introduction
……………………………………………………………………………………………………………
.4 Objective
……………………………………………………………………………………………………………
…...4
Credit Card Fraud Detection
…………………………………………………………………………………….4
Machine Learning
…………………………………………………………………………………………………….5
Supervised Machine Learning
………………………………………………………………………………....5
Benefits of Machine Learning in Financial Analytics ……………………………………………….. 6
Exploring the Dataset
…………………………………………………………………………………………..….7
Visualising the Data
…………………………………………………………………………………………..…….8
Correlation Matrix
……………………………………………………………………………………………..……9
Evaluate Model Performance
…………………………………………………………………………….…..10 Logistic Regression
………………………………………………………………………………………………...10
Support Vector Machine
………………………………………………………………………………………..13 Decision Tree
………………………………………………………………………………………………………...14
Observations
……………………………………………………………………………………………………….…1
7
Limitations
……………………………………………………………………………………………………………
..18 Conclusion
……………………………………………………………………………………………………………
..18
References
…………………………………………………………………………………………………
………..…19

1|Pa
ge
Introduction
Fraudulent activities are reported across the world on a daily basis causing huge losses to individuals
and to companies. In today’s time, the online shopping trend has led to a surge in usage of net
banking and credit / debit cards as modes of payments. The online payment mode doesn’t require
the presence of a physical card and can be used by anyone who has the card information posing a
major risk of fraudulent transactions. Additionally, credit cards have become a popular mode of
payment for absolutely any and every purchase made in our day to day lives. This has given rise to
a number of frauds which poses a major risk to any business since it impacts the customer and the
company. In a real situation, not all online transactions are fraud and it is not always easy to identify
a fraud.

Objective
This project evaluates the use of Machine Learning to analyse three different models to learn fraud
detection. We will analyse different classification models that provides the best results in identifying
and distinguishing fraud, based on historic data to analyse, detect, reduce and prevent occurrence of
Credit card frauds in order to provide an outstanding customer service. By this, analysis the aim is
to detect the potential fraud.

Credit Card Fraud Detection


Credit card fraud is an inclusive term for fraud committed using payment cards, such as a credit card
or debit card. The purpose may be to obtain goods or services, or to make a payment to another
account which is controlled by a criminal.

2|Pa
ge
Machine learning
Machine learning is a subset of AI that enables a computer to act and make data driven decisions to
carry out certain tasks. These programs are algorithms designed in a way that can learn and improve
over time when exposed to new data.
Machine learning is the scientific study of algorithms and static models that a computer uses to
analyse and draw inferences from patterns in data. Machine learning algorithms build a mathematical
model based on sample data, known as training data in order to make prediction and decisions. It
uses algorithms to parse data, learn from that data and make informed decisions based on what it has
learned.
Today, machine learning is used in every aspect of financial analysis. Financial institutions largely
collate and utilize the internally collected data such as bank transactions, purchase history patterns,
mode of communication and brand loyalty. Internet banking data and mobile phone usage for
banking and online transactions have further opened the avenues for the flow of data into the
financial system hence numerous data is available and can be used for risk and fraud management
analysis. Almost all of the financial institutions use machine learning techniques in the areas of
customer service, fraud detection, forecasting, understanding consumer sentiment, customer
profiling and target marketing, among others. The introduction of machine learning into finance has
led to an improved management of real-time information.

There are broadly four types of Machine Learning Models


1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
4. Transfer Learning

Supervised Machine Learning


• It is an algorithm that learns to perform a task from known examples – “Training data”.
• An important feature of using supervised learning is to have a labelled data to train the model.
• Supervised learning algorithms try to model relationships and dependencies between the
target prediction output and the input features such that we can predict the output values for
the new data based on those relationships which it learned from the previous data sets.
• Here the human experts act as the teacher where we feed the computer with training data
containing the input/predictors and we show it the correct answers (output) and from the data
the computer should be able to learn the patterns.
• In the finance sector Supervised Learning is used to predict the Creditworthiness of a Credit
Card holder. This can be determined by building a Machine Learning model to check for
faulty attributes by providing it with data on delinquent and non- delinquent customers.

3|Pa
ge
Benefits of Machine Learning in Financial Analytics
• Machine Learning is used to detect fraud in real time.
• Reduction in the number of fraudulent transactions.
• Users can safely use their Credit / Debit cards for online transactions.
• The use of Machine Learning techniques provides an added level of security.

4|Pa
ge
Exploring the Dataset
Dataset Description
• The dataset for Credit Card Fraud Detection has been obtained from Kaggle.
• This dataset consists of 31 features of which 28 are anonymized with labels V1 – V28 and 3
features providing information about the time and amount of the transaction.
• Information whether the transaction was fraudulent or not has been provided in the Class
feature where 0 = Non-Fraud, 1= Fraud.
• The dataset did not contain any missing values.
• The size of the original data has been modified/ reduced for ease of analysis.
• Due to nature of confidentiality, the exact source and details of the data are unknown.
• Specified features provided are: Time, Amount and Class

The dataset contains 9961 cases out of which 0.25% are fraud cases

Breakdown of Fraud and Non-Fraud cases

5|Pa
ge
Visualising the data
In the below graph it is observed that there are hardly any fraudulent transactions compared to the
non-fraud transactions showing that the data is highly imbalanced.

The Class feature represents: 0 = Non-Fraud, 1= Fraud

6|Pa
ge
Correlation Matrix using Heatmap
There seems be very little correlation between the class feature which could be due to the highly
imbalanced data which disturbs the importance of Correlation.

There is no notable correlation between features V1 – V28. There seems to be no distinct correlation
between Time and amount.

7|Pa
ge
Evaluate Model Performance
Data analysis using three Models of Supervised Machine Learning
Logistic Regression, Support Vector Machine (SVM) and Decision Tree. These are the most popular
models used for solving classification problems. All these models can be built feasibly using the
algorithms provided by the scikit-learn package.

1. Logistic Regression
Logistic Regression estimates a continuous quantity which is the probability that an event occurs.
This probability is compared with a certain threshold that allows taking the decision about the
classification of the new data. If the threshold of the probability is equal to 0.5, we can classify the
new data to each of the classes by comparing the probability value with that threshold.

Instead of fitting a straight line or hyperplane as in a linear regression model, the logistic regression
model uses the logistic function to reflect the output of a linear equation between 0 and 1. With this
function it is possible to map real values of the predictions into probabilities.
Note: Logistic Regression uses the ‘sigmoid’ function for the two-class logistic regression and the
‘softmax’ function for the multiclass logistic regression.

8|Page
The logistic regression is a kind of reformulation of the linear regression for classification problems,
as the linear regression is not good to separate classes. In a linear regression, we model the
relationship between the features and the target variable with the following equation:

In order to address classification problems, we convert this equation to obtain probabilities


between 0 and 1. The right side of the above equation is wrapped into a logistic function, and
therefore forces the output to be between 0 and 1.

Importance
Logistic Regression is used in financial analytics by credit rating companies to obtain instant
information on customers. The risk status of the customer can be predicted to separate the delinquent
customers from the non-delinquent customers.

Analysis
Our dataset consists a binary classifier. Logistic Regression is used to estimate discrete values
(Binary values like 0/1, yes/no, true/false) based on given set of independent variables. In simple
words, it predicts the probability of occurrence of an event by fitting data to a logit function. Since,
it predicts the probability, its output values lie between 0 and 1.

In this case the output data has a categorical value:


0 = Non -Fraud
1 = Fraud

The sklearn package has been used to evaluate model performance such as the confusion matrix,
accuracy score and classification report.

9|Page
The outcome shows that in recall the model has performed well in identifying Non fraudulent Class
– 0, but the model has not performed well in correctly identifying the fraudulent Class - 1.
The model is not strong enough to identify doubtful transactions however, once doubtful transactions
are identified the model can correctly classify the fraudulent transactions.

Limitations
The model requires continuous fine tuning and updating by providing more fraudulent based
scenarios so that it can identify the transactions correctly. Also, it needs to be ascertained that the X
variables are correctly defined, there could be a possibility of introducing new X variables.

10 | P a g e
2. Support Vector Machine (SVM)
Support Vector Machine is a Supervised Machine Learning algorithm that can be used for both
classification and regression problems. However, it is most used in classification problems. The goal
of the algorithm is to classify new unseen objects into two separate groups based on their properties
and a set of examples that are already classified.

It is a classification algorithm that helps to identify outliers in the data. The idea behind One-class
SVM is to train only on good quantity of legitimate transactions and then identify anomalies by
comparing each new data point to them.

SVM can be used for both, Classification and for Regression analysis. It includes multiple kernels
so that the hyperparameters can be tuned up depending on the kernels that are used, the level of
performance may differ accordingly. In SVM the target variable has to be specified.

Support Vector machine handles situations of non-linear relations in the data by using a kernel
function which map the data into a higher dimensional space where a linear hyperplane can be used
to separate classes.

Importance
This algorithm is useful to deal with imbalanced data-related issues such as Fraud Detection. SVM
is also used to predict stock market trends.

Analysis
Since it was identified in the Correlation matrix there was hardly any relationship between the
variables, this model has been used for analysis using kernels – linear, poly, rbf and sigmoid. The
model’s performance on the basis of confusion matrix, accuracy score and classification report has
been shown below:

11 | P a g e
On evaluating this model on our dataset, it is observed that the accuracy level of 99.69 is obtained
using the ‘poly’, ‘rbf’ and ‘sigmoid’ kernels. Though the model is able to identify the Non-fraud
class correctly, it lacks the potential to identify the fraud class indicating a low score in recall and in
precision.

Limitations
SVM is not suitable to be used for large datasets.
A relatively small number of mislabelled examples can dramatically decrease the performance.

3. Decision Tree
Decision tree algorithm works for both categorical and continuous dependent variables. It is a
Supervised learning algorithm which can handle classification and regression problems. For both
problems, the algorithm breaks down a dataset into smaller subsets by using if-then-else decision
rules within the features of the data.

The decision tree is a structure that contains root node, branch and leaf node. Every internal node
indicates a test on attribute, every branch indicates the outcome of the test and each leaf node holds
the class tag. The upper most node in the tree is the root node. A Decision trees organize
circumstances by sorting them down the tree from the root to some leaf node, which delivers the
classification of the instance. Each node in the tree specifies a test of some attribute of the instance
and each branch descending from that node links to one of the possible values for this attribute.

12 | P a g e
In the process of building the tree, the most important features are selected by the algorithm in a top-
down approach creating the decision nodes and branches of the tree and making predictions at points
when the tree cannot be expanded any more. The goal of a decision tree problem is to reduce the
measure of entropy while building the tree.

Decision trees are powerful and popular tools for classification and prediction.

Decision tree showing the analysed credit card data

13 | P a g e
Importance
In financial analytics this algorithm helps in making major decisions around investment analysis for
example, it can help to determine interest rates, value of bonds and other investment tools by
analysing the effects.

Analysis
This model showed a high recall score for the Non fraud transactions but failed to perform with
identifying the fraud cases.

Limitations
Over fitting is one of the most common problem, to address this Random Forest algorithm is
used. At times, the predictive power for unseen data is reduced. It is time consuming to train the
model.

14 | P a g e
Observations
• Machine Learning models learn more general patterns by looking at lots of examples
• When fraudsters make small tweaks, the model still recognizes them as suspicious since its
unlike anything is has seen from legitimate customers / transactions.
• Machine Learning models are not just good at finding the risky patterns, they’re much less
brittle than rules.
• Comparisons of various Machine Learning parameters can be used for comparing Accuracy
method, True Positive, False Positive, Training data etc.
• Class imbalance affects the model’s performance.
• When we train our model on a balanced dataset it ends up with a balanced validation dataset
and we normally pick our optimal threshold such that the performance is optimised on the
validation dataset. However, ultimately this model needs to been run in production and
production is not balanced data too.
• To set up Class balance data for training it is not always obvious that the class balance that
works best at validation will be the same one that works best against imbalanced production
data.

15 | P a g e
• Imbalanced training data doesn’t do the best on either the validation or evaluation data but
the class balance you use, one that is optimal on production data might not be the most
optimal one on validation data. So, it’s important to ensure that we’re looking at both.

Limitations
• A large amount of training data is required so that the algorithm automatically identifies good
from bad.
• Inadequate training data is the biggest drawback since the quality of the availability of the
training data is limited, the model can only be as good as the data.
• In case the training data doesn’t capture any of the attack patterns it becomes very difficult
for the algorithm to identify that.
• With new attack patterns Supervised Machine Learning usually fails to capture that and so it
needs constant tuning to retain the model to keep in pace.
• Incorrect flagging is also a major concern. The dataset needs to be flagged / labelled correctly
so that the output of the analysis is correct.
• Unbalanced data, means that one class is more frequent than the other. Most of the supervised
algorithms are sensitive to unbalanced data.

Conclusion
A sound fraud detection system is of an utmost importance. Result obtained by the Logistic
Regression Algorithm is the best compared to the other Algorithms, since the accuracy obtained is
almost equal to cent percent. Logistic Regression can minimize the fraud rate and it is easy to
implement.

The accuracy of fraud prevention can be improved though Machine Learning based methods using
the Credit card behavioural data.

16 | P a g e
References:
Wikipedia
Machine Learning Group — ULB, Credit Card Fraud Detection (2018), Kaggle
Amazon Web services
Sculley, D., et al. Google Inc.
https://www.ijcsmc.com
https://towardsdatascience.com
https://www.financetrain.com

17 | P a g e

You might also like