Professional Documents
Culture Documents
Final Report
Final Report
NAME Reg no
K.J.S.K. Rohith 11801456
A. Gokul Sai Kumar Reddy 11802934
I. Sai Tharun 11801454
B. Akhilesh Gandhi 11801351
Supervisor Name : Harwant Singh Arri UID : 12975 Designation : Assistant Professor
SR.NO. NAME OF STUDENT Prov. Regd. No. BATCH SECTION CONTACT NUMBER
PROPOSED TOPIC : Real Time Credit Card Fraud Detection using machine learning model.
2 Project Feasibility: Project can be timely carried out in-house with low-cost and available resources in 6.82
the University by the students.
3 Project Academic Inputs: Project topic is relevant and makes extensive use of academic inputs in UG 6.45
program and serves as a culminating effort for core study area of the degree program.
4 Project Supervision: Project supervisor’s is technically competent to guide students, resolve any issues, 7.91
and impart necessary skills.
5 Social Applicability: Project work intends to solve a practical problem. 6.45
6 Future Scope: Project has potential to become basis of future research work, publication or patent. 7.09
PAC Member (HOD/Chairperson) Name: Harwant Singh Arri UID: 12975 Recommended (Y/N): Yes
PAC Member (Allied) Name: Dr.Max Bhatia UID: 16870 Recommended (Y/N): Yes
Final Topic Approved by PAC: Real Time Credit Card Fraud Detection using machine learning model.
PAC CHAIRPERSON Name: 13897::Dr. Deepak Prashar Approval Date: 22 Mar 2022
5/2/2022 3:27:33 PM
DECLARATION
We hereby declare that the project work titled Real-time Credit Card Fraud Detection Using Machine
Learning and Research Paper is an authentic record of our work completed as part of the Capstone Project
for the award of a B. Tech degree in Computer Science and Engineering from Lovely Professional
University, Phagwara, from January to May 2022, under the supervision of Mr. Harwant Singh Arri. All of
the information in this capstone project report is accurate and based on our extensive research. Project group
number KC091.
K.J.S.K. Rohith
11801456
A. Gokul Sai Kumar Reddy
11802934
I. Sai Tharun
11801454
B. Akhilesh Gandhi
11801351
2
CERTIFICATE
This is to certify that the declaration statement made by this group of students is correct to the best of my
knowledge and belief. They have completed this Capstone Project under my guidance and supervision. The
present work is the result of their original investigation, effort, and study. No part of the work has ever been
submitted for any other degree at any University. The Capstone Project is fit for the submission and partial
fulfillment of the conditions for the award of a B.Tech degree in Computer Science and engineering from
Lovely Professional University, Phagwara.
Designation
Phagwara, Punjab.
Date:
3
Acknowledgement
We take this opportunity to express our gratitude to everyone who has supported us throughout the B.Tech
program. Thank you for their enthusiasm, constructive criticism, and advice during work. We truly thank
them for sharing their genuine and enlightening ideas on the many issues related to this project.
Most of all I would like to thank Mr. Harwant Singh Arri, (12975) for his support and guidance at Lovely
Professional University which is his key guide in helping us put this work together and make it a complete
success with his suggestions and instructions to serve as a major contribution to the implementation project.
4
TABLE OF CONTENT Page No
1. Introduction 8
2. Profile of Problem 10
2.1. Observation 10
3. Existing Systems 11
5. Methodology 16
6. Algorithms Used 18
6.2. XG Boost 19
7. API (NODEJS) 22
7.1. TECHNOLOGIES 22
7.2. Modules 22
5
7.3. External API’s Used 23
8. User Functionalities 23
8.1.MindMap 25
8.2 Architecture 26
9. PROJECT LEGACY 26
10. Conclusion 27
6
1. Introduction
Because of the rapid expansion of the e-commerce business, there has been a significant increase in the
use of credit cards for online purchases, resulting in an increase in fraud. In recent years, banks have
found it extremely difficult to detect fraud in the credit card system. To identify credit card fraud in
transactions, machine learning is essential. To improve prediction skills, banks use a variety of machine
learning methodologies as well as historical data and novel elements. The sampling approach utilised on
the data set, variable selection, and detection algorithms all have a substantial impact on credit card fraud
detection performance. Kaggle was used to gather credit card transaction data collections with a total of
2,84,808 credit card transactions from a European bank.
Credit card fraud manifests itself in a variety of ways. It can happen online, over the phone, over text, or
in person. Fake emails may fool you, data breaches can steal your information, and your credit cards can
be taken from your inbox. These are only a handful of the possibilities. The following are some of the
most prevalent kinds of credit card fraud:
Scammers steal a cardholder's credit card and personal information, then use it to conduct online or
phone purchases. CNP fraud is difficult to detect since there is no physical card to inspect and the
merchant cannot verify the buyer's identity.
Criminals use stolen personal information to apply for credit cards (name, address, birthday, and
social security number). This type of fraud might go undetected until the victim applies for credit or
checks their credit record. The victim is typically not harmed as a result of the cards' protection.
accountable for any purchases made using fake credit card accounts, however this type of fraud may
have a negative impact on the victim's credit score
8
Scammers phone credit card companies pretending to be the cardholder after obtaining personal
information. They then change the account's passwords and PIN codes in order to gain control of the
account. This type of credit card fraud will almost certainly be detected when the cardholder attempts
to use their card or log in to their account online.
Despite the increasing usage of credit cards, credit card skimming is a common practise. Skimmers
are devices that steal information from the magnetic strip on the back of a credit card. Scammers
place them at ATMs, retail establishments, gas stations, and other locations that accept credit cards.
The information is then sold to other fraudsters or used to make charges on your credit card.
Unsecured websites may be used by hackers or fraudsters to steal the card's private information.
Everyone engaged in the process suffers when a fraudster hacks an individual's credit/debit card,
from the individual whose confidential data has been disclosed to the organisations (usually banks)
that issue the credit card and the merchant who is closing the transaction with purchase. As a result,
it's critical to spot fraudulent transactions as soon as possible. Financial institutions and industries
such as e-commerce are taking proactive measures to identify and block scammers. Various powerful
machine learning technologies are in use, examining each transaction and catching fraudsters in the
9
act by analysing behavioural data and transaction trends. "Credit card fraud detection" is the
method of automatically distinguishing between fraudulent and legitimate users.
The goal of this project is to determine whether or not a credit card transaction is fraudulent based on
the data set taken from the online website Kaggle. The dataset covers credit card transactions done by
European cardholders in September 2013. Python was chosen for this project because it is very
simple to use a various approach and has a large number of machine learning algorithms.
2. Profile of Problem
Part of the Credit Card Fraud Detection Problem involves modelling prior credit card
transactions with information of those that turned out to be a hoax. This model is then used to detect
if a new transaction is fraudulent. Our objective is to detect all fraudulent transactions while lowering
the amount of false positives.
2.1. Observation
Only a small percentage of transactions are genuinely fraudulent (less than 1%). With 492 frauds in a
total of 284,807 observations, The data set is significantly biased . There were only 0.172% fraud
cases as a result of this. The minimal number of fraudulent transactions justifies this biased set.
3. Existing Systems
Assume you're hired to assist a credit card organization in detecting probable fraud cases so that
clients are protected from being charged for products they didn't buy. You are provided a dataset
comprising transactions between persons, as well as information about whether or not they are
fraudulent, and you must distinguish between them. This is the case we'll be dealing with. Our
ultimate goal is to address this problem by developing classification models that can categorise and
distinguish fraud transactions.
10
3.1. Existing Software
For detecting fraud, credit card companies have developed incredibly sophisticated systems. They
keep track of every transaction made with each card. Then, to hunt for anomalous transactions, credit
card companies employ complex computer algorithms. Your card issuer may flag your card for
probable fraud if you rarely leave your city and your card is used in another state. Flagged
transactions can have a range of outcomes depending on the company. Text messages or automated
phone calls may be sent by some issuers. This allows you to verify that you are the one making the
transaction or not.
We use four different machine learning algorithms in this project to predict fake credit card
transactions accurately. The machine learning algorithm needs to give over 90% accuracy. This
software helps to detect fraud transactions very carefully and provide more secure payments.
11
12
4. Software Design and Architecture
Imbalanced Data
Split Data
Test Data
Algorithm
Imbalanced data are datasets with an uneven distribution of observations in the target class, with one
class label having a big number of observations while the other has a small number.
Assume XYZ is a bank that offers credit cards to its clients. The bank is now concerned that some
fraudulent transactions are taking place, and when they reviewed their records, they discovered that
just 20 fraud cases were recognised for every 1000 transactions. As a result, the amount of fraudulent
transactions per 100 transactions is equal to 2%, implying that more than 98 percent of transactions
are "No Fraud." The "No Fraud" class is considered the majority, while the "Fraud" class is
considered the minority.
13
4.1.2. Split Data
When data is split into two or more subsets, it is called data splitting. One part of a two-part split is
typically used to analyse or test the data, while the other is used to train the model. Data splitting is
a crucial feature of data science, especially for constructing data-driven models.
Training data is a large dataset used to train a machine learning model. Using training data,
prediction models that employ machine learning algorithms are taught how to extract attributes that
are significant to certain business goals. The training data for supervised ML models is labelled. The
data used to train unsupervised machine learning models is unlabeled. A set of data used to educate a
software how to learn and give enhanced outcomes utilising technologies such as neural networks is
referred to as training data. Additional data sets known as validation and testing sets can be used to
enhance it.
4.1.4. Algorithm
An algorithm is a set of well-defined instructions for addressing a certain issue. It takes a set of
inputs and returns the desired outcome. We employ four distinct machine learning algorithms.
14
in our project. We used four algorithms: Logistic Regression, XG Boost, Random Forest, K Nearest
Neighbor (KNN).
A confusion matrix is a table that displays how well a classification model works on a set of
test data with known true values. The confusion matrix itself is simple, but the words associated with
it may be. It measures Recall, Precision, Accuracy, and the AUC-ROC curve
TP = (True Positive)
FP = (False Positive)
FN = (False Negative)
TN = (True Negative)
Accuracy score = TP + TN / TP + TN + FP + FN
4.1.7. F1 score
15
5. Methodology
We'll be working with the Kaggle Credit Card Fraud Detection dataset. It consists of the key
components generated using PCA, V1 to V28. We'll disregard the time element because it's irrelevant
for model building. The remaining features are the 'Amount' feature, which contains the total amount
of money being transacted, and the 'Class' feature, which indicates whether or not the transaction is a
fraud case.
Let's see how many fraud cases and non-fraud cases are in our collection. Let us also compute the
proportion of fraud incidents in the total number of recorded transactions. We can see that there are
only 492 fraud cases out of a total of 284,807 samples, or 0.17 percent of all samples. As a
consequence, we can say that the data we're dealing with is extremely imbalanced and should be
treated with caution while modelling and appraising it.
cases = len(d_f)
nonfraud_count = len(d_f[df.Class == 0])
fraud_count = len(d_f[d_f.Class == 1])
fraud_percentage = round(fraud_count/non_fraud_count*100, 2)
16
5.2. Data Split & Feature Selection
We will define the independent (X) and dependent variables during this procedure (Y). We'll divide the data
into a training set and a testing set based on the identified variables, which will be utilised for modelling and
evaluation. Using Python's 'train test split' technique, we can simply split the data.
17
6.Algorithms Used
We used four algorithms: Logistic Regression, XG Boost, Random Forest, K Nearest Neighbor
(KNN).
The approach of modelling the likelihood of a discrete result given an input variable is known as
logistic regression. The most often used logistic regression models include a binary result that might
be true or false, yes or no. Logistic regression is a sophisticated supervised machine learning
technique. The range of logistic regression is confined between 0 and 1.
▪ X is input value.
def logistic_regression():
lr = LogisticRegression(solver="lbfgs")
lr.fit(X_train, y_train)
lr_yhat = lr.predict(X_test)
lr_matrix = confusion_matrix(y_test, lr_yhat, labels=[0,1])
lr_cm_plot = plot_confusion_matrix(lr_matrix,classes = ['Non-
Default(0)','Default(1)'],
normalize = False, title =
'LogisticRegression')
plt.rcParams['figure.figsize'] = (6, 6)
plt.savefig(logisticregression_img)
aws_file_url = upload_image(logisticregression_img)
# plt.show()
return [accuracy_score(y_test, lr_yhat),f1_score(y_test,
lr_yhat),aws_file_url]
F1_score = 0.72251461988304092.
Accuracy_score = 0.9991748885221726
18
6.2. XG Boost
XG Boost is a scalable and highly accurate form of gradient boosting that pushes the computational
limitations of boosted tree algorithms. It was created primarily to improve the performance and
computational speed of machine learning models.
def xgboost():
xgb = XGBClassifier(max_depth = 4)
xgb.fit(X_train, y_train)
xgb_yhat = xgb.predict(X_test)
xgb_matrix = confusion_matrix(y_test, xgb_yhat, labels = [0, 1])
xgb_cm_plot = plot_confusion_matrix(xgb_matrix,
classes = ['Non-
Default(0)','Default(1)'],
normalize = False, title = 'XGBoost')
plt.rcParams['figure.figsize'] = (6, 6)
plt.savefig(xgboost_img)
aws_file_upload = upload_image(xgboost_img)
# plt.show()
return [accuracy_score(y_test, xgb_yhat), f1_score(y_test,
xgb_yhat),aws_file_upload]
F1_score = 0.8449197860962566
Accuracy_score = 0.9994908886626171
19
6.3. Random Forest
Random forest is a supervised machine learning technique used to address classification and
regression issues. It generates decision trees from a variety of samples, employing the majority vote
for classification and the average for regression.
def random_forest():
rf = RandomForestClassifier(max_depth = 4)
rf.fit(X_train, y_train)
rf_yhat = rf.predict(X_test)
rf_matrix = confusion_matrix(y_test, rf_yhat, labels = [0, 1])
rf_cm_plot = plot_confusion_matrix(rf_matrix,
classes = ['Non-
Default(0)','Default(1)'],
normalize = False, title = 'Random
Forest Tree')
plt.rcParams['figure.figsize'] = (6, 6)
plt.savefig(randomforest_img)
aws_file_upload = upload_image(randomforest_img)
# plt.show()
return [accuracy_score(y_test, rf_yhat),f1_score(y_test,
rf_yhat),aws_file_upload]
F1_score = 0.7727272727272727
Accuracy_score = 0.9993328885923949
20
6.4. K Nearest Neighbor (KNN)
It's a supervised machine learning algorithm. The technique may be used to tackle both classification
and regression problem statements. The symbol 'K' denotes the number of closest neighbours to a
new unknown variable that must be predicted or classified.
def k_nearest_neighbors():
n = 5
knn = KNeighborsClassifier(n_neighbors = n)
knn.fit(X_train, y_train)
knn_yhat = knn.predict(X_test)
knn_matrix = confusion_matrix(y_test, knn_yhat, labels = [0, 1])
knn_cm_plot = plot_confusion_matrix(knn_matrix,
classes = ['Non-
Default(0)','Default(1)'],
normalize = False, title = 'KNN')
plt.rcParams['figure.figsize'] = (6, 6)
plt.savefig(knn_img)
aws_file_url = upload_image(knn_img)
# plt.show()
return [accuracy_score(y_test, knn_yhat),f1_score(y_test,
knn_yhat),aws_file_url]
F1_score = 0.7865168653932841
Accuracy_score = 0.9993328885923949
21
7.API (NODEJS)
This Nodejs API enables organizations or users to interact with our backend (algorithms which was
developed in python)
Node.js is a server-side runtime environment that is open source and is based on Chrome's V8
JavaScript engine. It provides a cross-platform runtime environment with event-driven, non-blocking
(asynchronous) I/O for constructing exceptionally scalable server-side JavaScript applications.
We have Developed API using Nodejs/express and other technologies will be listed below
7.1. TECHNOLOGIES
i) Nodejs
ii) Express (Express is a lightweight and adaptable Node.js web application framework that offers a
comprehensive range of functionality for online and mobile apps.)
iv) Redis (Redis is a distributed, in-memory key–value database, cache, and message broker.)
7.2. Modules
i) JsonwebToken (JSON Web Token is a proposed Internet standard for creating data with optional
signature and/or encryption, the payload of which comprises JSON that asserts several assumptions.)
iv) Child_process (The child process module allows us to access Operating System features by running any
system command within a child process.)
v) Fs (module helps us store, access, and manage data on our operating system.)
22
i) AWS S3 bucket (Amazon S3 is a cloud-based object storage solution that offers industry-leading
scalability, data availability, security, and performance.)
ii) SendGrid (SendGrid is a cloud-based SMTP service that allows you to send email without using email
servers.)
8. User Functionalities
• A user can access our API using Postman or any API testing applications
• Signup functionality data such as username, email, password, location, data of birth, phone
which all fields are Required and Email should be unique
Login functionality takes email and password as input and validate given data and return
JWT_TOKEN
23
Figure 12: DFD of Server response
File-upload functionality can be accessed from browser or postman where we have created a HTML
page. From there an users can upload csv file
The algorithm route interacts with the Python environment, where we must accept filenames as input
from users as well as the algorithm name.
A string such as [node app.js user id filename algorithm] is asserted by the server. and it will execute
a subprocess using child process
24
When each subprocess fires, the NodeJS application checks to see if the algorithm data already exists
in the database before proceeding and calling the Python script.
8.1.MindMap
25
8.2 Architecture
9. Project Legacy
Our project is presently in the implementation phase. All of the conceivable errors and data
abnormalities have been summitted. The project does not even require small adjustments. The only
thing left is to publish the web app and deploy it to online servers.
26
9.2 TECHNICAL AND MANAGERIAL LESSONS LEARNT
This project was quite beneficial in terms of understanding about machine learning technologies. We
also learned how product prices fluctuate in the real world. Technically, because we all work in the
CSE/IT department, designing each module of the software, executing various test cases, and finally
merging and maintaining the entire product was a fantastic learning experience. Aside from technical
knowledge, we also learn how to operate as a team and assist one another when necessary. We also
learned how to manage software projects.
10. Conclusion
To detect fraud in the credit card system, machine learning techniques such as logistic regression, k-
nearest neighbours, xgboost, and random forest were utilised. The suggested system's performance is
measured using sensitivity, specificity, accuracy, and error rate. Logistic regression, k-nearest
neighbours, xgboost, and random forest classifier have accuracy of 99.9174888522, 99.9332888592,
99.9490888663, and 99.9332888592, respectively. When all four methods were compared, it was
discovered that the XG boost classifier outperformed logistic regression, k-nearest neighbours, and
random forest. we have evaluated each of the models using the evaluation metrics and chose which
model is most suitable for the given case and have built the models feasibly in python but, there are
more and more math and statistics behind each of the models
27