Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

CAPSTONE PROJECT REPORT

(Project Term Jan-May 2022)

Real-time Credit Card Fraud Detection Using Machine


Learning

NAME Reg no
K.J.S.K. Rohith 11801456
A. Gokul Sai Kumar Reddy 11802934
I. Sai Tharun 11801454
B. Akhilesh Gandhi 11801351

Project Group Number: KC091


Course Code: CSE445

Under the Guidance of

Harwant Singh Arri, 12975

School of Computer Science and Engineering.


TOPIC APPROVAL PERFORMA

School of Computer Science and Engineering (SCSE)

Program : P132::B.Tech. (Computer Science & Engineering)

COURSE CODE : CSE445 REGULAR/BACKLOG : Regular GROUP NUMBER : CSERGC0091

Supervisor Name : Harwant Singh Arri UID : 12975 Designation : Assistant Professor

Qualification : ________________________ Research Experience : ________________________

SR.NO. NAME OF STUDENT Prov. Regd. No. BATCH SECTION CONTACT NUMBER

1 Alla Gokul Sai Kumar Reddy 11802934 2018 K18BH 8374095180


2 Kandepu Jai Sai Krishna Rohith 11801456 2018 K18FG 8309073121
3 Immanneni Sai Tharun 11801454 2018 K18AZ 6303889735
4 Batchu Akhilesh Gandhi 11801351 2018 K18CJ 9440979776

SPECIALIZATION AREA : Networking and Security-I Supervisor Signature: ___________________

PROPOSED TOPIC : Real Time Credit Card Fraud Detection using machine learning model.

Qualitative Assessment of Proposed Topic by PAC

Sr.No. Parameter Rating (out of 10)

1 Project Novelty: Potential of the project to create new knowledge 6.82

2 Project Feasibility: Project can be timely carried out in-house with low-cost and available resources in 6.82
the University by the students.
3 Project Academic Inputs: Project topic is relevant and makes extensive use of academic inputs in UG 6.45
program and serves as a culminating effort for core study area of the degree program.
4 Project Supervision: Project supervisor’s is technically competent to guide students, resolve any issues, 7.91
and impart necessary skills.
5 Social Applicability: Project work intends to solve a practical problem. 6.45

6 Future Scope: Project has potential to become basis of future research work, publication or patent. 7.09

PAC Committee Members

PAC Member (HOD/Chairperson) Name: Harwant Singh Arri UID: 12975 Recommended (Y/N): Yes

PAC Member (Allied) Name: Dr.Max Bhatia UID: 16870 Recommended (Y/N): Yes

PAC Member 3 Name: Ravishanker UID: 12412 Recommended (Y/N): Yes

Final Topic Approved by PAC: Real Time Credit Card Fraud Detection using machine learning model.

Overall Remarks: Approved

PAC CHAIRPERSON Name: 13897::Dr. Deepak Prashar Approval Date: 22 Mar 2022

5/2/2022 3:27:33 PM
DECLARATION

We hereby declare that the project work titled Real-time Credit Card Fraud Detection Using Machine
Learning and Research Paper is an authentic record of our work completed as part of the Capstone Project
for the award of a B. Tech degree in Computer Science and Engineering from Lovely Professional
University, Phagwara, from January to May 2022, under the supervision of Mr. Harwant Singh Arri. All of
the information in this capstone project report is accurate and based on our extensive research. Project group
number KC091.

K.J.S.K. Rohith
11801456
A. Gokul Sai Kumar Reddy
11802934
I. Sai Tharun
11801454
B. Akhilesh Gandhi
11801351

2
CERTIFICATE

This is to certify that the declaration statement made by this group of students is correct to the best of my
knowledge and belief. They have completed this Capstone Project under my guidance and supervision. The
present work is the result of their original investigation, effort, and study. No part of the work has ever been
submitted for any other degree at any University. The Capstone Project is fit for the submission and partial
fulfillment of the conditions for the award of a B.Tech degree in Computer Science and engineering from
Lovely Professional University, Phagwara.

Signature and Name of Mentor

Mr. Harwant Singh Arri

Designation

School of computer science and Engineering,

Lovely Professional University,

Phagwara, Punjab.

Date:

3
Acknowledgement

We take this opportunity to express our gratitude to everyone who has supported us throughout the B.Tech
program. Thank you for their enthusiasm, constructive criticism, and advice during work. We truly thank
them for sharing their genuine and enlightening ideas on the many issues related to this project.

Most of all I would like to thank Mr. Harwant Singh Arri, (12975) for his support and guidance at Lovely
Professional University which is his key guide in helping us put this work together and make it a complete
success with his suggestions and instructions to serve as a major contribution to the implementation project.

4
TABLE OF CONTENT Page No

1. Introduction 8

1.1. Credit Card Fraud 8

1.2. Different Types of Credit Card Fraud 8

1.3. Credit Card Fraud Detection 10

1.4. Aim of the project 10

2. Profile of Problem 10

2.1. Observation 10

3. Existing Systems 11

3.1. Existing Software 11

3.2. What’s new in the system to be developed. 12

4. Software Design and Architecture 13

4.1. System Process Representation 13

5. Methodology 16

5.1. Data Processing and EDA 16

5.2. Data Split & Feature Selection 17

6. Algorithms Used 18

6.1. Logistic Regression 18

6.2. XG Boost 19

6.3. Random Forest 20

6.4. K Nearest Neighbor (KNN) 21

7. API (NODEJS) 22

7.1. TECHNOLOGIES 22

7.2. Modules 22

5
7.3. External API’s Used 23

8. User Functionalities 23

8.1.MindMap 25

8.2 Architecture 26

9. PROJECT LEGACY 26

9.1. CURRENT STATUS OF THE PROJECT 26

9.2 TECHNICAL AND MANAGERIAL LESSONS LEARNT 27

10. Conclusion 27

6
1. Introduction

1.1 Credit Card Fraud

Because of the rapid expansion of the e-commerce business, there has been a significant increase in the
use of credit cards for online purchases, resulting in an increase in fraud. In recent years, banks have
found it extremely difficult to detect fraud in the credit card system. To identify credit card fraud in
transactions, machine learning is essential. To improve prediction skills, banks use a variety of machine
learning methodologies as well as historical data and novel elements. The sampling approach utilised on
the data set, variable selection, and detection algorithms all have a substantial impact on credit card fraud
detection performance. Kaggle was used to gather credit card transaction data collections with a total of
2,84,808 credit card transactions from a European bank.

1.2 Different Types of Credit Card Fraud

Credit card fraud manifests itself in a variety of ways. It can happen online, over the phone, over text, or
in person. Fake emails may fool you, data breaches can steal your information, and your credit cards can
be taken from your inbox. These are only a handful of the possibilities. The following are some of the
most prevalent kinds of credit card fraud:

1.2.1 Card-not-present (CNP) fraud

Scammers steal a cardholder's credit card and personal information, then use it to conduct online or
phone purchases. CNP fraud is difficult to detect since there is no physical card to inspect and the
merchant cannot verify the buyer's identity.

1.2.2 Credit card application fraud

Criminals use stolen personal information to apply for credit cards (name, address, birthday, and
social security number). This type of fraud might go undetected until the victim applies for credit or
checks their credit record. The victim is typically not harmed as a result of the cards' protection.

accountable for any purchases made using fake credit card accounts, however this type of fraud may
have a negative impact on the victim's credit score

1.2.3 Account takeover

8
Scammers phone credit card companies pretending to be the cardholder after obtaining personal
information. They then change the account's passwords and PIN codes in order to gain control of the
account. This type of credit card fraud will almost certainly be detected when the cardholder attempts
to use their card or log in to their account online.

1.2.4 Skimming of credit cards

Despite the increasing usage of credit cards, credit card skimming is a common practise. Skimmers
are devices that steal information from the magnetic strip on the back of a credit card. Scammers
place them at ATMs, retail establishments, gas stations, and other locations that accept credit cards.
The information is then sold to other fraudsters or used to make charges on your credit card.

Figure 1: US credit card losses [2011-2018]

1.3. Credit Card Fraud Detection

Unsecured websites may be used by hackers or fraudsters to steal the card's private information.
Everyone engaged in the process suffers when a fraudster hacks an individual's credit/debit card,
from the individual whose confidential data has been disclosed to the organisations (usually banks)
that issue the credit card and the merchant who is closing the transaction with purchase. As a result,
it's critical to spot fraudulent transactions as soon as possible. Financial institutions and industries
such as e-commerce are taking proactive measures to identify and block scammers. Various powerful
machine learning technologies are in use, examining each transaction and catching fraudsters in the

9
act by analysing behavioural data and transaction trends. "Credit card fraud detection" is the
method of automatically distinguishing between fraudulent and legitimate users.

1.4. Aim of the project

The goal of this project is to determine whether or not a credit card transaction is fraudulent based on
the data set taken from the online website Kaggle. The dataset covers credit card transactions done by
European cardholders in September 2013. Python was chosen for this project because it is very
simple to use a various approach and has a large number of machine learning algorithms.

2. Profile of Problem
Part of the Credit Card Fraud Detection Problem involves modelling prior credit card
transactions with information of those that turned out to be a hoax. This model is then used to detect
if a new transaction is fraudulent. Our objective is to detect all fraudulent transactions while lowering
the amount of false positives.

2.1. Observation

Only a small percentage of transactions are genuinely fraudulent (less than 1%). With 492 frauds in a
total of 284,807 observations, The data set is significantly biased . There were only 0.172% fraud
cases as a result of this. The minimal number of fraudulent transactions justifies this biased set.

3. Existing Systems
Assume you're hired to assist a credit card organization in detecting probable fraud cases so that
clients are protected from being charged for products they didn't buy. You are provided a dataset
comprising transactions between persons, as well as information about whether or not they are
fraudulent, and you must distinguish between them. This is the case we'll be dealing with. Our
ultimate goal is to address this problem by developing classification models that can categorise and
distinguish fraud transactions.
10
3.1. Existing Software

For detecting fraud, credit card companies have developed incredibly sophisticated systems. They
keep track of every transaction made with each card. Then, to hunt for anomalous transactions, credit
card companies employ complex computer algorithms. Your card issuer may flag your card for
probable fraud if you rarely leave your city and your card is used in another state. Flagged
transactions can have a range of outcomes depending on the company. Text messages or automated
phone calls may be sent by some issuers. This allows you to verify that you are the one making the
transaction or not.

Figure 2: Fraud Detection Process

3.2. What’s new in the system to be developed.

We use four different machine learning algorithms in this project to predict fake credit card
transactions accurately. The machine learning algorithm needs to give over 90% accuracy. This
software helps to detect fraud transactions very carefully and provide more secure payments.

11
12
4. Software Design and Architecture

4.1. System Process Representation

Imbalanced Data

Split Data

Test Data

Algorithm

Confusion matrix. Accuracy score F1 score

Figure 3: DFD for Fraud Detection

4.1.1. Imbalanced Data

Imbalanced data are datasets with an uneven distribution of observations in the target class, with one
class label having a big number of observations while the other has a small number.

Assume XYZ is a bank that offers credit cards to its clients. The bank is now concerned that some
fraudulent transactions are taking place, and when they reviewed their records, they discovered that
just 20 fraud cases were recognised for every 1000 transactions. As a result, the amount of fraudulent
transactions per 100 transactions is equal to 2%, implying that more than 98 percent of transactions
are "No Fraud." The "No Fraud" class is considered the majority, while the "Fraud" class is
considered the minority.

13
4.1.2. Split Data

When data is split into two or more subsets, it is called data splitting. One part of a two-part split is
typically used to analyse or test the data, while the other is used to train the model. Data splitting is
a crucial feature of data science, especially for constructing data-driven models.

Figure 4: Split Data into training and test data

4.1.3. Train Data

Training data is a large dataset used to train a machine learning model. Using training data,
prediction models that employ machine learning algorithms are taught how to extract attributes that
are significant to certain business goals. The training data for supervised ML models is labelled. The
data used to train unsupervised machine learning models is unlabeled. A set of data used to educate a
software how to learn and give enhanced outcomes utilising technologies such as neural networks is
referred to as training data. Additional data sets known as validation and testing sets can be used to
enhance it.

4.1.4. Algorithm

An algorithm is a set of well-defined instructions for addressing a certain issue. It takes a set of
inputs and returns the desired outcome. We employ four distinct machine learning algorithms.

14
in our project. We used four algorithms: Logistic Regression, XG Boost, Random Forest, K Nearest
Neighbor (KNN).

4.1.5. Confusion matrix

A confusion matrix is a table that displays how well a classification model works on a set of
test data with known true values. The confusion matrix itself is simple, but the words associated with
it may be. It measures Recall, Precision, Accuracy, and the AUC-ROC curve

Figure 5: confusion matrix

TP = (True Positive)

FP = (False Positive)

FN = (False Negative)

TN = (True Negative)

4.1.6. Accuracy score

In classification problems, accuracy is a metric that represents the proportion of correct


predictions. This value is calculated by dividing the number of correct estimates by the total number
of forecasts. This formula is based on a binary classification issue and gives a simple definition.

Accuracy score = TP + TN / TP + TN + FP + FN

4.1.7. F1 score

The F1 Score is calculated by averaging Precision and Recall. As a result,


both false positives and false negatives are considered in this score. Although not as obvious as
accuracy, F1 is often more useful than accuracy, especially when the class distribution is
asymmetrical.
F1_Score = (Precision)*(Recall) / (Precision)+(Recall)

15
5. Methodology
We'll be working with the Kaggle Credit Card Fraud Detection dataset. It consists of the key
components generated using PCA, V1 to V28. We'll disregard the time element because it's irrelevant
for model building. The remaining features are the 'Amount' feature, which contains the total amount
of money being transacted, and the 'Class' feature, which indicates whether or not the transaction is a
fraud case.

5.1. Data Processing and EDA

Let's see how many fraud cases and non-fraud cases are in our collection. Let us also compute the
proportion of fraud incidents in the total number of recorded transactions. We can see that there are
only 492 fraud cases out of a total of 284,807 samples, or 0.17 percent of all samples. As a
consequence, we can say that the data we're dealing with is extremely imbalanced and should be
treated with caution while modelling and appraising it.

cases = len(d_f)
nonfraud_count = len(d_f[df.Class == 0])
fraud_count = len(d_f[d_f.Class == 1])
fraud_percentage = round(fraud_count/non_fraud_count*100, 2)

16
5.2. Data Split & Feature Selection

We will define the independent (X) and dependent variables during this procedure (Y). We'll divide the data
into a training set and a testing set based on the identified variables, which will be utilised for modelling and
evaluation. Using Python's 'train test split' technique, we can simply split the data.

X = d_f.drop('Class', axis = 1).values


y = d_f['Class'].values

X_Train, X_Test, y_Train, y_Test = train_Test_split(X, y, test_size = 0.2, random_state=


0)

17
6.Algorithms Used
We used four algorithms: Logistic Regression, XG Boost, Random Forest, K Nearest Neighbor
(KNN).

6.1. Logistic Regression

The approach of modelling the likelihood of a discrete result given an input variable is known as
logistic regression. The most often used logistic regression models include a binary result that might
be true or false, yes or no. Logistic regression is a sophisticated supervised machine learning
technique. The range of logistic regression is confined between 0 and 1.

Logistic Function = 1/1+e^-x.

▪ X is input value.

def logistic_regression():
lr = LogisticRegression(solver="lbfgs")
lr.fit(X_train, y_train)
lr_yhat = lr.predict(X_test)
lr_matrix = confusion_matrix(y_test, lr_yhat, labels=[0,1])
lr_cm_plot = plot_confusion_matrix(lr_matrix,classes = ['Non-
Default(0)','Default(1)'],
normalize = False, title =
'LogisticRegression')
plt.rcParams['figure.figsize'] = (6, 6)
plt.savefig(logisticregression_img)
aws_file_url = upload_image(logisticregression_img)
# plt.show()
return [accuracy_score(y_test, lr_yhat),f1_score(y_test,
lr_yhat),aws_file_url]

F1_score = 0.72251461988304092.

Accuracy_score = 0.9991748885221726

Figure 6: matrix of Logistic regression

18
6.2. XG Boost

XG Boost is a scalable and highly accurate form of gradient boosting that pushes the computational
limitations of boosted tree algorithms. It was created primarily to improve the performance and
computational speed of machine learning models.

def xgboost():
xgb = XGBClassifier(max_depth = 4)
xgb.fit(X_train, y_train)
xgb_yhat = xgb.predict(X_test)
xgb_matrix = confusion_matrix(y_test, xgb_yhat, labels = [0, 1])
xgb_cm_plot = plot_confusion_matrix(xgb_matrix,
classes = ['Non-
Default(0)','Default(1)'],
normalize = False, title = 'XGBoost')
plt.rcParams['figure.figsize'] = (6, 6)
plt.savefig(xgboost_img)
aws_file_upload = upload_image(xgboost_img)
# plt.show()
return [accuracy_score(y_test, xgb_yhat), f1_score(y_test,
xgb_yhat),aws_file_upload]

F1_score = 0.8449197860962566

Accuracy_score = 0.9994908886626171

Figure 7: matrix of XG Boost

19
6.3. Random Forest

Random forest is a supervised machine learning technique used to address classification and
regression issues. It generates decision trees from a variety of samples, employing the majority vote
for classification and the average for regression.

def random_forest():
rf = RandomForestClassifier(max_depth = 4)
rf.fit(X_train, y_train)
rf_yhat = rf.predict(X_test)
rf_matrix = confusion_matrix(y_test, rf_yhat, labels = [0, 1])
rf_cm_plot = plot_confusion_matrix(rf_matrix,
classes = ['Non-
Default(0)','Default(1)'],
normalize = False, title = 'Random
Forest Tree')
plt.rcParams['figure.figsize'] = (6, 6)
plt.savefig(randomforest_img)
aws_file_upload = upload_image(randomforest_img)
# plt.show()
return [accuracy_score(y_test, rf_yhat),f1_score(y_test,
rf_yhat),aws_file_upload]

F1_score = 0.7727272727272727

Accuracy_score = 0.9993328885923949

Figure 8: Matrix of Random Forest

20
6.4. K Nearest Neighbor (KNN)

It's a supervised machine learning algorithm. The technique may be used to tackle both classification
and regression problem statements. The symbol 'K' denotes the number of closest neighbours to a
new unknown variable that must be predicted or classified.

def k_nearest_neighbors():
n = 5
knn = KNeighborsClassifier(n_neighbors = n)
knn.fit(X_train, y_train)
knn_yhat = knn.predict(X_test)
knn_matrix = confusion_matrix(y_test, knn_yhat, labels = [0, 1])

knn_cm_plot = plot_confusion_matrix(knn_matrix,
classes = ['Non-
Default(0)','Default(1)'],
normalize = False, title = 'KNN')
plt.rcParams['figure.figsize'] = (6, 6)
plt.savefig(knn_img)
aws_file_url = upload_image(knn_img)
# plt.show()
return [accuracy_score(y_test, knn_yhat),f1_score(y_test,
knn_yhat),aws_file_url]

F1_score = 0.7865168653932841

Accuracy_score = 0.9993328885923949

Figure 9: Matrix of KNN

21
7.API (NODEJS)

This Nodejs API enables organizations or users to interact with our backend (algorithms which was
developed in python)

Node.js is a server-side runtime environment that is open source and is based on Chrome's V8
JavaScript engine. It provides a cross-platform runtime environment with event-driven, non-blocking
(asynchronous) I/O for constructing exceptionally scalable server-side JavaScript applications.

We have Developed API using Nodejs/express and other technologies will be listed below

7.1. TECHNOLOGIES

i) Nodejs

ii) Express (Express is a lightweight and adaptable Node.js web application framework that offers a
comprehensive range of functionality for online and mobile apps.)

iii) Mongodb (MongoDB is an open source cross-platform document-oriented database programme.


MongoDB, a NoSQL database application, uses JSON-like documents with optional schemas.)

iv) Redis (Redis is a distributed, in-memory key–value database, cache, and message broker.)

7.2. Modules

i) JsonwebToken (JSON Web Token is a proposed Internet standard for creating data with optional
signature and/or encryption, the payload of which comprises JSON that asserts several assumptions.)

ii) Multer (Nodejs module that handles multipart/form-data.)

iii) BcryptJs (salt password to protect against rainbow table attacks )

iv) Child_process (The child process module allows us to access Operating System features by running any
system command within a child process.)

v) Fs (module helps us store, access, and manage data on our operating system.)

7.3. External API’s Used

22
i) AWS S3 bucket (Amazon S3 is a cloud-based object storage solution that offers industry-leading
scalability, data availability, security, and performance.)

ii) SendGrid (SendGrid is a cloud-based SMTP service that allows you to send email without using email
servers.)

8. User Functionalities

• A user can access our API using Postman or any API testing applications

• Register To Server to Store and maintain Your Data.

• Signup functionality data such as username, email, password, location, data of birth, phone
which all fields are Required and Email should be unique

Figure 10: DFD of signup

Login functionality takes email and password as input and validate given data and return
JWT_TOKEN

Figure 11: DFD of Login

23
Figure 12: DFD of Server response

File-upload functionality can be accessed from browser or postman where we have created a HTML
page. From there an users can upload csv file

Figure 13: DFD of upload

The algorithm route interacts with the Python environment, where we must accept filenames as input
from users as well as the algorithm name.

For example, if a user enters /xgboost algo with the filename

A string such as [node app.js user id filename algorithm] is asserted by the server. and it will execute
a subprocess using child process

24
When each subprocess fires, the NodeJS application checks to see if the algorithm data already exists
in the database before proceeding and calling the Python script.

Figure 14: DFD of python script.

8.1.MindMap

Figure 15:Mind Map

25
8.2 Architecture

Figure 16: Complete project

9. Project Legacy

9.1. CURRENT STATUS OF THE PROJECT

Our project is presently in the implementation phase. All of the conceivable errors and data
abnormalities have been summitted. The project does not even require small adjustments. The only
thing left is to publish the web app and deploy it to online servers.

26
9.2 TECHNICAL AND MANAGERIAL LESSONS LEARNT

This project was quite beneficial in terms of understanding about machine learning technologies. We
also learned how product prices fluctuate in the real world. Technically, because we all work in the
CSE/IT department, designing each module of the software, executing various test cases, and finally
merging and maintaining the entire product was a fantastic learning experience. Aside from technical
knowledge, we also learn how to operate as a team and assist one another when necessary. We also
learned how to manage software projects.

10. Conclusion
To detect fraud in the credit card system, machine learning techniques such as logistic regression, k-
nearest neighbours, xgboost, and random forest were utilised. The suggested system's performance is
measured using sensitivity, specificity, accuracy, and error rate. Logistic regression, k-nearest
neighbours, xgboost, and random forest classifier have accuracy of 99.9174888522, 99.9332888592,
99.9490888663, and 99.9332888592, respectively. When all four methods were compared, it was
discovered that the XG boost classifier outperformed logistic regression, k-nearest neighbours, and
random forest. we have evaluated each of the models using the evaluation metrics and chose which
model is most suitable for the given case and have built the models feasibly in python but, there are
more and more math and statistics behind each of the models

27

You might also like