Prayag Report

INSURANCE FRAUD DETECTION USING
MACHINE LEARNING
A Major Project Report.
Submitted in partial fulfillment of the requirement for the award of Degree of
Bachelor of Engineering with specialization in Electronics and
Communication Engineering
Submitted to
RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA,

BHOPAL (M.P.)
Submitted By:
PRAYAG PARASHAR
(0827EC201018)
Submitted To:
Asst. Prof. JYOTI PIPARIYA
(Department of ECE)
DEPARTMENT OF ELECTRONICS & COMMUNICATION

ENGINEERING
ACROPOLIS INSTITUTE OF TECHNOLOGY & RESEARCH
INDORE MP-452001
SESSION: 2023-24
1
ACROPOLIS INSTITUTE OF TECHNOLOGY & RESEARCH,
INDORE (M.P.)
Department of Electronics & Communication Engineering
2023-2024
CERTIFICATE
This is to certify that the work embodied in this dissertation entitled

“Insurance Fraud Detection Using Machine Learning” is submitted
to Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal (M.P.), in the
Faculty of Engineering by Prayag Parashar(0827EC201018), in the
partial fulfillment of the requirement for the award of the degree of
Bachelor of Engineering with specialization in Electronics and
Communication Engineering.
Date:
Internal Examiner External Examiner
2
ACKNOWLEDGEMENT
This dissertation would not have been possible without the guidance and the help of several individuals, who in
one way or the other, contributed and extended their valuable assistance in the preparation and completion of this
study.
Primarily, I would like to express my deep respect and gratitude towards my guide, Mr. Upendra Tiwari, who
not only guided the academic project work, but also stood, as a teacher and philosopher in realizing the
imagination in pragmatic way. I want to thank him for introducing me to the field of research with his abundant
knowledge and valuable guidance.
I wish to express my profound sense of gratitude to Dr. UBS Chandrawat (Head, Department of Electronics and
Communication Engineering). His presence and optimism have provided an invaluable influence on my work and
outlook. I consider it as my good fortune to have an opportunity to work under the guidance of such a wonderful
and great person.
I would also like to take this opportunity to acknowledge Dr. S.C. Sharma (Director) and other faculty members
and staff of the institute for extending all help in conducting the dissertation work directly or indirectly. They
have been a reliable source of inspiration to me, and I thank them sincerely. I would like to acknowledge my
institute, Acropolis Institute of Technology & Research, Indore, for providing all the necessary facilities to
complete my thesis work.
I am especially indebted to my parents for their love, sacrifice, and support. They are my teachers after I came to
this world and have set notable examples for me about how to live, study and work. It is all because of the values
and ethics taught by them, which made me capable of doing my work with great honesty and responsibility.
At last, I would like to thank God, since He has provided me with all his blessings, without which this work
would have been an impossible task to go with.
Prayag Parashar
(0827EC201018)
3
TABLE OF CONTENTS
❖ Acknowledgement
1. Introduction
➔ Introduction to Auto Insurance Fraud
➔ Introduction to Machine Learning
2. Data Pre-Processing in Machine Learning
➔ Why do we need Data Pre-Processing?
3. EDA (Exploratory Data Analysis)
➔ EDA Tools
➔ Types of EDA
4. Machine Learning Algorithms
➔ Linear Regression
➔ Logistic Regression
➔ Decision Tree
➔ Support Vector Machine (SVM)
➔ kNN Algorithm
➔ K-Means Algorithm
5. Clustering
6. Model Building
7. Existing System for Fraud Detection
➔ Private Investigators
➔ Analytics Tools
8. Technology Implemented
➔ Python - The New Generation Language
➔ Why Python is Perfect Language for Machine Learning?
➔ Technology Behind Fraud Detection
9. System Design
10.Implementation
4
10. Project Report
➔ Dataset Description
➔ Result
➔Project Output after
Deployment 12. Advantages of
Machine Learning 13. Disadvantages
of Machine Learning
14. Conclusion
15.References
5
INTRODUCTION
Insurance fraud is any act committed to defraud an insurance process. It occurs when a
claimant attempts to obtain some benefit or advantage they are not entitled to, or when an
insurer knowingly denies some benefit that is due. False insurance claims are insurance
claims filed with the fraudulent intention towards an insurance provider.
Insurance fraud has existed since the beginning of insurance as a commercial enterprise.
Fraudulent claims account for a significant portion of all claims received by insurers, and
cost billions of dollars annually. Types of insurance fraud are diverse and occur in all areas
of insurance. Insurance crimes also range in severity, from slightly exaggerating claims to
deliberately causing accidents or damage. Fraudulent activities affect the lives of innocent
people, both directly through accidental or intentional injury or damage, and indirectly
by the crimes leading to higher insurance premiums. Insurance fraud poses a significant
problem, and governments and other organizations try to deter such activity.
People who commit insurance fraud include:
organized criminals who steal large sums through fraudulent business activities,
professionals and technicians who inflate service costs or charge for
services not rendered, and
ordinary people who want to cover their deductible or view filing a claim as an
opportunity to make a little money.
INTRODUCTION TO AUTO INSURANCE FRAUD
Auto insurance fraud ranges from misrepresenting facts on insurance applications and
inflating insurance claims to staging accidents and submitting claim forms for injuries or
damage that never occurred, to false reports of stolen vehicles.
There are two main categories for auto insurance fraud—hard insurance fraud and soft
insurance fraud.
Hard insurance fraud involves creating a situation that allows you to make a claim on an
auto insurance policy. A common example for this type of fraud is staging a auto wreck with
the goal of benefitting from the resulting claim.
6
Soft insurance fraud involves exaggerating or flat out fabricating the damages or expenses
involved in an auto insurance policy or claim. A common example for this type of fraud is a
mechanic making unnecessary repairs to your auto with the goal of charging the insurance
company a higher bill.
Soft insurance fraud is the more common form because it’s easier to commit and harder to
spot. Hard insurance fraud is usually more costly to the insurance provider and comes with
higher penalties.
Auto fraud is one of the largest and most well-known problems that insurers face.
Fraudulent claims can be highly expensive for each insurer. Therefore, it is important to
know which claims are correct and which are not. Ideally, an insurance agent would have
the capacity to investigate each case and conclude whether it is genuine or not. However,
this process is not only time consuming, but costly. Sourcing and funding the skilled labor
required to review each of the thousands of claims that are filed a day is simply unfeasible.
This is where machine learning comes in to save the day. Once the proper data is fed to the
system it'll be very easy to find out if the claim is genuine or not.
INTRODUCTION TO MACHINE LEARNING
Machine Learning is said as a subset of Artificial intelligence that is mainly concerned with
the development of algorithms which allow a computer to learn from the data and past
experiences on their own. Machine learning is a method of data analysis that automates
analytical model building. It is a branch of artificial intelligence based on the idea that
systems can learn from data, identify patterns, and make decisions with minimal human
intervention.
The process of learning begins with observations or data, such as examples, direct
experience, or instruction, to look for patterns in data and make better decisions in the future
based on the examples that we provide. The primary aim is to allow the computers learn
automatically without human intervention or assistance and adjust actions accordingly.
7
Fig 1.1 Block Diagram for Machine learning Steps
Machine learning algorithms are often categorized as supervised or unsupervised.
Supervised machine learning algorithms can apply what has been learned in the past to
new data using labeled examples to predict future events. Starting from the analysis of a
known training dataset, the learning algorithm produces an inferred function to make
predictions about the output values. The system can provide targets for any new input
after sufficient training. The learning algorithm can also compare its output with the
correct, intended output and find errors to modify the model accordingly.
Types of Supervised Learning:
● Classification: It is a Supervised Learning task where output is having defined labels
(discrete value). Example: Gmail classifies mails in more than one classes like social,
promotions, updates, forum.
● Regression: It is a Supervised Learning task where output is having continuous
value. Example of Supervised Learning Algorithms:
● Linear Regression
● Nearest Neighbor
● Decision Trees
● Support Vector Machine (SVM)
● Random Forest
Unsupervised machine learning algorithms are used when the information used to
train is neither classified nor labeled. Unsupervised learning studies how systems can
infer a function to describe a hidden structure from unlabeled data. The system doesn’t
figure out the right output, but it explores the data and can draw inferences from
datasets.
8
to describe hidden structures from unlabeled data.
Unsupervised learning is classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent groupings in
the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover
rules that describe large portions of your data, such as people that buy X also tend to
buy Y. Types of Unsupervised Learning: -
● Clustering
● Exclusive (partitioning)
● Agglomerative
● overlapping
● Clusterin
g Types: -
● Hierarchical clustering
● K-means clustering
● Principal Component Analysis
● Singular Value Decomposition
● Independent Component Analysis
Semi-supervised machine learning algorithms fall somewhere in between supervised

and unsupervised learning, since they use both labeled and unlabeled data for training.
–typically, a small amount of labeled data and a large amount of unlabeled data. The
systems that use this method can considerably improve learning accuracy. Usually,
semi-supervised learning is chosen when the acquired labeled data requires skilled and
relevant resources to train it / learn from it. Otherwise, acquiring unlabeled data
generally doesn’t require additional resources.
● A common example of an application of semi-supervised learning is a text
document classifier. This is the type of situation where semi-supervised
learning is ideal because it would be nearly impossible to find a large
amount of labeled text documents. This is simply because it is not time
efficient to have a person read through entire text documents just to assign it a
simple classification.
9
● So, semi-supervised learning allows for the algorithm to learn from a small
amount of labeled text documents while still classifying a large amount of
unlabeled text documents in the training data.
How semi-supervised learning works
● The way that semi-supervised learning manages to train the model with less
labeled training data than supervised learning is by using pseudo labeling. This
can combine many neural network models and training methods. Here’s how it
works:
● Train the model with the small amount of labeled training data just like you
would in supervised learning, until it gives you good results.
● Then use it with the unlabeled training dataset to predict the outputs, which are
pseudo labels since they may not be quite accurate.
● Link the labels from the labeled training data with the pseudo labels created in
the previous step.
● Link the data inputs in the labeled training data with the inputs in the
unlabeled data.
● Then, train the model the same way as you did with the labeled set in the
beginning to decrease the error and improve the model’s accuracy.
Reinforcement machine learning algorithms is a learning method that interacts with its
environment by producing actions and discovers errors or rewards. Trial and error search
and delayed reward are the most relevant characteristics of reinforcement learning. This
method allows machines and software agents to automatically determine the ideal
behavior within a specific context in order to maximize its performance. Simple reward
feedback is required for the agent to learn which action is best; this is known as the
reinforcement signal.
Types of Reinforcement: There are two types of Reinforcement:

Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular
behavior, increases the strength and the frequency of the behavior. In other words, it
has a positive effect on behavior.
● Advantages of reinforcement learning are:
● Maximizes Performance
10
● Sustain Change for a long period of time.
● Disadvantages of reinforcement learning:
● Too much Reinforcement can lead to overload of states which
can diminish the results.
Negative–
Negative Reinforcement is defined as strengthening of a behavior because a negative
condition is stopped or avoided.
● Advantages of reinforcement learning:
● Increases Behavior
● Provide defiance to minimum standard of performance.
● Disadvantages of reinforcement learning:
● It Only provides enough to meet up the minimum behavior.
11
DATA PREPROCESSING IN MACHINE LEARNING
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model.
It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean
and formatted data.
Data imputations. Most ML frameworks include methods and APIs for balancing or filling in.
missing data. Techniques generally include imputing missing values with standard
deviation, mean, median and k-nearest neighbors (k-NN) of the data in the given field.
Oversampling. Bias or imbalance in the dataset can be corrected by generating more.
observations/samples with methods like repetition, bootstrapping, or Synthetic Minority Over- Sampling
Technique (SMOTE), and then adding them to the under-represented classes.
Data integration. Combining multiple datasets to get a large corpus can
overcome incompleteness in a single dataset.
Data normalization. The size of a dataset affects the memory and processing required for
iterations during training. Normalization reduces the size by reducing the order and magnitude
of data.
And while doing any operation with data, it is mandatory to clean it and put in a formatted way.
So, for this, we use data preprocessing task.
Why do we need Data Pre-processing?
● A real-world data generally contains noises, missing values, and maybe

in an unusable format which cannot be directly used for machine learning
models.
● Data preprocessing is required tasks for cleaning the data and making it suitable
for a machine learning model which also increases the accuracy and efficiency of a
machine learning model.
● It involves below steps:
● Getting the dataset
● Importing libraries
● Importing datasets
12
● Finding Missing Data
● Encoding Categorical Data
● Splitting dataset into training and test set
● Feature scaling
13
EDA (Exploratory Data Analysis)
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets
and summarize their main characteristics, often employing data visualization methods.
It helps determine how best to manipulate data sources to get the answers you need, making it.
easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis.
testing task and provides a provides a better understanding of data set variables and the relationships
between them.
It can also help determine if the statistical techniques you are considering for data analysis are.
appropriate. Originally developed by American mathematician John Tukey in the 1970s,

EDA techniques continue to be a widely used method in the data discovery process today.
Exploratory data analysis tools
● Specific statistical functions and techniques you can perform with EDA tools include:
● Clustering and dimension reduction techniques, which help create graphical
displays of high- dimensional data containing many variables.
● Univariate visualization of each field in the raw dataset, with summary statistics.
● Bivariate visualizations and summary statistics that allow you to assess the relationship
between each variable in the dataset and the target variable you’re looking at.
● Multivariate visualizations, for mapping and understanding interactions between different
fields in the data.
● K-means Clustering is a clustering method in unsupervised learning where data points are
assigned into K groups, i.e., the number of clusters, based on the distance from each group’s
centroid. The data points closest to a particular centroid will be clustered under the same
category. K-means Clustering is commonly used in market segmentation, pattern
recognition, and image compression.
● Predictive models, such as linear regression, use statistics and data to predict outcomes.
14
Types of exploratory data:
● Univariate non-graphical. This is simplest form of data analysis, where the data being
analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes
or relationships. The main purpose of univariate analysis is to describe the data and find
patterns that exist within it.
● Univariate graphical. Non-graphical methods don’t provide a full picture of the data.
Graphical methods are therefore required. Common types of univariate graphics include:
● Stem-and-leaf plots, which show all data values and the shape of the distribution.
● Histograms, a bar plot in which each bar represents the frequency (count) or proportion
(count/total count) of cases for a range of values.
● Box plots, which graphically depict the five-number summary of minimum, first quartile,
median, third quartile, and maximum.
● Multivariate nongraphical: Multivariate data arises from more than one variable.
Multivariate non-graphical EDA techniques generally show the relationship between two or
more variables of the data through cross-tabulation or statistics.
● Multivariate graphical: Multivariate data uses graphics to display relationships between
two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each
group representing one level of one of the variables and each bar within a group representing
the levels of the other variable.
● Other common types of multivariate graphics include:
● Scatter plot, which is used to plot data points on a horizontal and a vertical axis
to show how much one variable is affected by another.
● Multivariate chart, which is a graphical representation of the relationships
between factors and a response.
● Run chart, which is a line graph of data plotted over time.
● Bubble chart, which is a data visualization that displays multiple circles (bubbles)
in a two- dimensional plot.
● Heat map, which is a graphical representation of data where values are depicted by color.
15
Machine Learning Algorithms
There are many types of Machine Learning Algorithms specific to different use cases. As we work with
datasets, a machine learning algorithm works in two stages. We usually split the data around 20%-80%
between testing and training stages. Under supervised learning, we split a dataset into a training data and test
data in Python ML. Followings are the Algorithms of Python Machine Learning -
1. Linear Regression-
Linear regression is one of the supervised Machine learning algorithms in Python that observes continuous
features and predicts an outcome. Depending on whether it runs on a single variable or on many features, we
can call it simple linear regression or multiple linear regression.
This is one of the most popular Python ML algorithms and often under-appreciated. It assigns optimal
weights to variables to create a line ax+b to predict the output. We often use linear regression to
estimate real values like a number of calls and costs of houses based on continuous variables. The regression
line is the best line that fits Y=a*X+b to denote a relationship between independent and dependent variables.
2. Logistic Regression -
Logistic regression is a supervised classification is unique Machine Learning algorithms in Python that finds its use in
estimating discrete values like 0/1, yes/no, and true/false. This is based on a given set of independent variables. We
use a logistic function to predict the probability of an event, and this gives us an output between 0 and 1. Although it
says ‘regression’, this is actually a classification algorithm. Logistic regression fits data into a logit function and is
also called logit regression.
16
3. Decision Tree -
A decision tree falls under supervised Machine Learning Algorithms in Python and comes of use for both
classification and regression- although mostly for classification. This model takes an instance, traverses the
tree, and compares important features with a determined conditional statement. Whether it descends to the
left child branch, or the right depends on the result. Usually, more important features are closer to the root.
Decision Tree, a Machine Learning algorithm in Python can work on both categorical and continuous
dependent variables. Here, we split a population into two or more homogeneous sets. Tree models where the
target variable can take a discrete set of values are called classification trees; in these tree structures, leaves
represent class labels and branches represent conjunctions of features that lead to those class labels.
Decision trees where the target variable can take continuous values (typically real numbers) are called
regression trees.
4. Support Vector Machine (SVM)-

SVM is a supervised classification is one of the most important Machines Learning algorithms in Python,
that plots a line that divides different categories of your data. In this ML algorithm, we calculate the vector
to optimize the line. This is to ensure that the closest point in each group lies farthest from each other. While
you will almost always find this to be a linear vector, it can be other than that. An SVM model is a
representation of the examples as points in space, mapped so that the examples of the separate categories are
17
divided by a clear gap that is as wide as possible. In addition to performing linear classification, SVMs can
efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their
inputs into high-dimensional feature spaces. When data are unlabeled, supervised learning is not possible,
and an unsupervised learning approach is required, which attempts to find natural clustering of the data to
groups, and then map new data to these formed groups.
18
5. kNN Algorithm -
This is a Python Machine Learning algorithm for classification and regression- mostly for classification.
This is a supervised learning algorithm that considers different centroids and uses a usually Euclidean
function to compare distance. Then, it analyzes the results and classifies each point to the group to optimize
it to place with all closest points to it. It classifies new cases using a majority vote of k of its neighbors. The
case it assigns to a class is the one most common among its K nearest neighbors. For this, it uses a distance
function. k-NN is a type of instance-based learning, or lazy learning, where the function is only
approximated locally and all computation is deferred until classification. k-NN is a special case of a
variable- bandwidth, kernel density "balloon" estimator with a uniform kernel.
6. K-Means Algorithm -
k-Means is an unsupervised algorithm that solves the problem of clustering. It classifies data using a number
of clusters. The data points inside a class are homogeneous and heterogeneous to peer groups. k-means
clustering is a method of vector quantization, originally from signal processing, that is popular for cluster
analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. k-means
clustering is rather easy to apply to even large data sets, particularly when using heuristics such as Lloyd's
algorithm. It often is used as a preprocessing step for other algorithms, for example to find a starting
configuration. The problem is computationally difficult (NP-hard). k-means originates from signal
processing, and still finds use in this domain. In cluster analysis, the k-means algorithm can be used to
partition the input data set into k partitions (clusters). k-means clustering has been used as a feature learning
(or dictionary learning) step, in either (semi-)supervised learning or unsupervised learning.
19
7. Random Forest -
A random forest is an ensemble of decision trees. To classify every new object based on its attributes, trees
vote for class- each tree provides a classification. The classification with the most votes wins in the forest.
Random forests or random decision forests are an ensemble learning method for classification, regression
and other tasks that operates by constructing a multitude of decision trees at training time and outputting the
class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
20
CLUSTERING
It is basically a type of unsupervised learning method. An unsupervised learning method is a

method in which we draw references from datasets consisting of input data without labelled
responses. Generally, it is used as a process to find meaningful structure, explanatory underlying
processes, generative features, and groupings inherent in a set of examples. Clustering is the task of
dividing the population or data points into a number of groups such that data points in the same
groups are more similar to other data points in the same group and dissimilar to the data points in
other groups. It is basically a collection of objects on the basis of similarity and dissimilarity
between them.
For ex– The data points in the graph below clustered together can be classified into one single
group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below
picture.
Fig 1.8 Clusters of Data Points
It is not necessary for clusters to be a spherical. Such as:
21
DBSCAN: Density-based Spatial Clustering of Applications with Noise These data points
are clustered by using the basic concept that the data point lies within the given constraint from the
cluster center. Various distance methods and techniques are used for calculation of the outliers.
Fig 1.9 DBSCAN
22
MODEL BUILDING
Building machine learning models that could generalize well on future data.
requires thoughtful consideration of the data at hand and of assumptions about various
available training algorithms.
Ultimate evaluation of a machine learning model’s quality requires an
appropriate selection and interpretation of assessment criteria.
Machine learning consists of algorithms that can automate analytical model building.
Using algorithms that iteratively learn from data, machine learning models facilitate.
computers to find hidden insights from Big Data without being explicitly programmed
where to look.
This has given rise to a plethora of applications based on Machine learning.
23
24
EXISTING SYSTEM
When we are talking about the current existing system of insurance fraud detection it usually
involves the insurance companies that have a separate team of people, analytics tools, policies and
private investigators to do background checks and find out if the claim made by the insured person
is fraud or not.
Sometimes, it’s obvious. For example, a claim for something the insured probably never owned
and can’t produce any documentation for, will be scrutinized. One of the front-line defenses
though is to analyze claims history. Insurance companies all share information about claims into a
central database, and both the insured and insurer generally have access to it. Was there a similar
claim shortly before you cancelled your last policy? Big red flag. Do you file aclaim every few
months? Big red flag. Do you have two policies covering the same thing and you’re trying to
claim on both? Yet another, big red flag. The insurance companies usually know more about our
claims history than we do.
POLICIES IN PLACE
Insurance companies usually share a list of things that make a claim suspicious. Few of the items
that are on the list are:
Claimant is in financial distress, worse off than they were when the policy
started, or claiming shortly after the inception of the policy.
Increases in policy limits shortly before an alleged loss occurs
Missing police reports, where appropriate
Valuable items removed from the premises shortly before a fire
loss Firefighters unable to gain access due to items blocking doors,
etc. Slip and falls with no witnesses.
Altered documentation
Delayed reporting of a
claim.
These things create reasons to look deeper into the claim and investigate the authenticity of the claim.
25
PRIVATE INVESTIGATORS
Insurance companies sometimes hire private investigators when they need someone experienced
and capable of looking into someone that might seem suspicious to them. PIs are great for
gathering information the insured person and gather proof if any. They are cheaper to afford for
companies and they don’t bid publicity that the company might face if they were to start a police
investigation for every single suspicious claim.
ANALYTICS TOOLS
Fig 1.2 Process of Fraud Detection Used by Companies
Many organizations nowadays provide a range of data analytics and pattern recognition tools that
help insurance companies detect an insurance fraud at a very early stage. It also helps them
recognize patterns between similar types of fraudulent claims and deny the claim at the earliest
stage possible so that loss to the company is minimum and they are also able to save time.
The insurance industry has developed sophisticated algorithms to detect fraud, as well as relying
on tried-and-true methods like raising questions if multiple claim checks are being sent to the
same-address.
26
Technology Implemented
Python – The New Generation Language

Python is a widely used general-purpose, high level programming language. It was initially designed by
Guido van Rossum in 1991 and developed by Python Software Foundation. It was mainly developed for an
emphasis on code readability, and its syntax allows programmers to express concepts in fewer lines of code.
Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including
procedural, object-oriented, and functional programming. Python is often described as a "batteries included"
language due to its comprehensive standard library.
Features
● Interpreted
In Python there is no separate compilation and execution steps like C/C++. It directly run the program
from the source code. Internally, Python converts the source code into an intermediate form called
bytecodes which is then translated into native language of specific computer to run it.
● Platform Independent
Python programs can be developed and executed on the multiple operating system platform. Python
can be used on Linux, Windows, Macintosh, Solaris and many more.
● Multi- Paradigm
Python is a multi-paradigm programming language. Object-oriented programming and structured
programming are fully supported, and many of its features support functional programming and aspect-
oriented programming .
● Simple
Python is a very simple language. It is a very easy to learn as it is closer to English language. In
python more emphasis is on the solution to the problem rather than the syntax.
● Rich Library Support
Python standard library is very vast. It can help to do various things involving regular expressions,
documentation generation, unit testing, threading, databases, web browsers, CGI, email, XML, HTML,
WAV files, cryptography, GUI and many more.
● Free and Open Source
Firstly, Python is freely available. Secondly, it is open-source. This means that its source code is
available to the public. We can download it, change it, use it, and distribute it. This is called FLOSS
(Free/Libre and Open Source Software). As the Python community, we’re all headed toward one goal-
an ever-bettering Python.
27
Why Python Is a Perfect Language for Machine Learning?
A great library ecosystem -

A great choice of libraries is one of the main reasons Python is the most popular programming
language used for AI. A library is a module or a group of modules published by different sources
which include a pre-written piece of code that allows users to reach some functionality or perform
different actions. Python libraries provide base level items so developers don’t have to code them
from the very beginning every time. ML requires continuous data processing, and Python’s libraries
let us access, handle and transform data. These are some of the most widespread libraries you can
use for ML and AI:
o Scikit-learn for handling basic ML algorithms like clustering, linear
and logistic regressions, regression, classification, and others.
o Pandas for high-level data structures and analysis. It allows merging and filtering
of data, as well as gathering it from other external sources like Excel, for instance.
o Keras for deep learning. It allows fast calculations and prototyping, as it uses
the GPU in addition to the CPU of the computer.
o TensorFlow for working with deep learning by setting up, training,
and utilizing artificial neural networks with massive datasets.
o Matplotlib for creating 2D plots, histograms, charts, and other forms of visualization.
o NLTK for working with computational linguistics, natural
language recognition, and processing.
o Scikit-image for image processing.
o PyBrain for neural networks, unsupervised and reinforcement learning.
o Caffe for deep learning that allows switching between the CPU and the
GPU and processing 60+ mln images a day using a single NVIDIA K40
GPU.
o StatsModels for statistical algorithms and data exploration.
In the PyPI repository, we can discover and compare more python libraries.
A low entry barrier -

Working in the ML and AI industry means dealing with a bunch of data that we need to process in
the most convenient and effective way. The low entry barrier allows more data scientists to quickly
pick up Python and start using it for AI development without wasting too much effort into learning
the language. In addition to this, there’s a lot of documentation available, and Python’s community is
always there to help out and give advice.
28
Flexibility-
Python for machine learning is a great choice, as this language is very flexible:
▪ It offers an option to choose either to use OOPs or scripting.
▪ There’s also no need to recompile the source code, developers

can implement any changes and quickly see the results.
▪ Programmers can combine Python and other languages to reach their goals.
Good Visualization Options-

For AI developers, it’s important to highlight that in artificial intelligence, deep learning, and
machine learning, it’s vital to be able to represent data in a human-readable format. Libraries like
Matplotlib allow data scientists to build charts, histograms, and plots for better data comprehension,
effective presentation, and visualization. Different application programming interfaces also simplify
the visualization process and make it easier to create clear reports.
Community Support-
It’s always very helpful when there’s strong community support built around the programming
language. Python is an open-source language which means that there’s a bunch of resources open for
programmers starting from beginners and ending with pros. A lot of Python documentation is
available online as well as in Python communities and forums, where programmers and machine
learning developers discuss errors, solve problems, and help each other out. Python programming
language is absolutely free as is the variety of useful libraries and tools.
Growing Popularity-
As a result of the advantages discussed above, Python is becoming more and more popular among
data scientists. According to Stack Overflow, the popularity of Python is predicted to grow until
2020, at least. This means it’s easier to search for developers and replace team players if required.
Also, the cost of their work maybe not as high as when using a less popular programming language
29
THE TECHNOLOGY BEHIND FRAUD DETECTION
In the world of Data Science, there are a great number of other methodologies and algorithms that
accurately leverage large amounts of user data. Each of them has proven to effectively perform in
some particular scenarios and situations. Machine Learning experts divide them into two main
scenarios depending on the available dataset:
Scenario 1: The dataset has a sufficient number of fraud examples.
In this case, classic machine learning or statistics-based techniques are applied to detect fraudulent
attacks. This involves training a machine learning model or employing adequate algorithms to
estimate transaction legitimacy. We’ll go through the most commonly used algorithms below.
Scenario 2: The dataset has no (or just a very little number of) fraud examples.
In the case that none of any previous information on fraudulent transactions was stored, the
learning model is built based on examples of legitimate
transactions. Before jumping into the commonly used learning models applied to fraud
detection, it’s to say that the majority has the same purpose of usage and differs only by its
mathematical characteristics. Hence available data becomes a decisive factor when choosing
appropriate learning models rather than an algorithm itself.
● Random Forest or random decision forests. This algorithm ensembles decision trees and
accurately analyzes missing data, noise, outliers and errors. It is fast on train and score and,
as a consequence, has become one of the preferable among fraud detection professionals.
● Artificial Neural Networks (ANN). This system simulates the function of the brain to
perform tasks by learning from the past, extract rules and predict future activity based on the
current situation. It can predict whether the transaction is fraudulent or not by classifying an
input into predefined groups.
● Support Vector Machines (SVMs). It’s an excellent prediction tool that can resolve a wide
range of learning problems, such as handwritten digit recognition, classification of web
pages and face detection. This method is capable of detecting fraudulent activity at the time
of transaction.
30
● K-Nearest Neighbors (KNN). Also known as the “lazy learning” algorithm due to its
simplicity: instead of making calculations once the data is introduced, it just stores it for
further classification. The KNN algorithm rests on feature similarity and its proximity. When
the nearest neighbor is fraudulent, the transaction is classified as fraudulent and when the
nearest neighbor is legal, it is classified as legal.
● Logistic Regression is a prediction algorithm borrowed by machine learning from the fields
of statistics. It's widely used for credit card fraud detection and credit scoring.
Finally, the goal of artificial intelligence in the field of insurance fraud is to make it easier for
human agents to find and investigate fraudulent claims and transactions, rather than sifting through
tons of claims in an exhausting and time-consuming way.
31
SYSTEM DESIGN
Fig 1.3 System Design for our model
32
IMPLEMENTATION
33
Overview-
A dataset related to adult income is given. This project classifies whether a person will be able to earn more
than 50,000 or not.
Dataset Description-
The Insurance Fraud Detection dataset was extracted by GitHub. The data set consists of anonymous
information such as occupation, age, native country, race, capital gain, capital loss, education, work class
and more. Each row is labelled as either having a salary greater than ">50K" or "<=50K".
Dataset Source : https://www.kaggle.com/datasets/shivamb/vehicle-claim-fraud-detection
Result-
Our project successfully classifies fraud detection on insurance claim with 80% Accuracy
Project Output - After deployment
34
35
Advantages of Machine Learning
Every coin has two faces, each face has its own property and features. It’s time to uncover the faces
of ML. A very powerful tool that holds the potential to revolutionize the way things work.
1. Easily identifies trends and patterns -

Machine Learning can review large volumes of data and discover specific trends and patterns that would not
be apparent to humans. For instance, for an e-commerce website like Amazon, it serves to understand the
browsing behaviors and purchase histories of its users to help cater to the right products, deals, and
reminders relevant to them. It uses the results to reveal relevant advertisements to them.
2. No human intervention needed (automation) -

With ML, we don’t need to babysit our project every step of the way. Since it means giving machines the
ability to learn, it lets them make predictions and also improve the algorithms on their own. A common
example of this is anti-virus software. they learn to filter new threats as they are recognized. ML is also
good at recognizing spam.
3. Continuous Improvement -
As ML algorithms gain experience, they keep improving in accuracy and efficiency. This lets them make
better decisions. Say we need to make a weather forecast model. As the amount of data, we have keeps
growing, our algorithms learn to make more accurate predictions faster.
4. Handling multi-dimensional and multi-variety data -

Machine Learning algorithms are good at handling data that are multi-dimensional and multi-variety, and
they can do this in dynamic or uncertain environments.
5. Wide Applications -
We could be an e-seller or a healthcare provider and make ML work for us. Where it does apply, it holds the
capability to help deliver a much more personal experience to customers while also targeting the right
customers.
36
Disadvantages of Machine Learning
With all those advantages to its powerfulness and popularity, Machine Learning isn’t perfect.
The following factors serve to limit it:
1. Data Acquisition -
Machine Learning requires massive data sets to train on, and these should be inclusive/unbiased, and of
good quality. There can also be times where they must wait for new data to be generated.
2. Time and Resources -

ML needs enough time to let the algorithms learn and develop enough to fulfill their purpose with a
considerable amount of accuracy and relevancy. It also needs massive resources to function. This can mean
additional requirements of computer power for us.
3. Interpretation of Results -
Another major challenge is the ability to accurately interpret results generated by the algorithms. We
must also carefully choose the algorithms for your purpose.
4. High error-susceptibility -
Machine Learning is autonomous but highly susceptible to errors. Suppose you train an algorithm with data
sets small enough to not be inclusive. You end up with biased predictions coming from a biased training set.
This leads to irrelevant advertisements being displayed to customers. In the case of ML, such blunders can
set off a chain of errors that can go undetected for long periods of time. And when they do get noticed, it
takes quite some time to recognize the source of the issue, and even longer to correct it.
37
Conclusion
In conclusion, the integration of machine learning into insurance fraud detection processes offers a
transformative leap in accuracy, efficiency, and cost-effectiveness. By harnessing the power of advanced
algorithms, insurers can proactively identify fraudulent activities in real-time, resulting in improved risk
management, cost savings, and enhanced customer experiences. As the landscape of insurance fraud
evolves, the adaptability of machine learning models becomes a critical asset in staying ahead of
sophisticated tactics. While not without challenges, the promising outcomes suggest that the synergy
between machine learning and traditional methods holds great potential for the ongoing fight against
insurance fraud.
38
References
➢ https://expertsystem.com/
➢ https://www.geeksforgeeks.org/
➢ https://www.wikipedia.org/
➢ https://www.coursera.org/learn/machine-learning
➢ https://machinelearningmastery.com/
➢ https://towardsdatascience.com/machine-learning/home
39

Prayag Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Prayag Report

Uploaded by

Copyright:

Available Formats

INSURANCE FRAUD DETECTION USING

RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA,

DEPARTMENT OF ELECTRONICS & COMMUNICATION

Department of Electronics & Communication Engineering

This is to certify that the work embodied in this dissertation entitled

Internal Examiner External Examiner

People who commit insurance fraud include:

professionals and technicians who inflate service costs or charge for

services not rendered, and

opportunity to make a little money.

INTRODUCTION TO AUTO INSURANCE FRAUD

INTRODUCTION TO MACHINE LEARNING

Machine learning algorithms are often categorized as supervised or unsupervised.

Semi-supervised machine learning algorithms fall somewhere in between supervised

Types of Reinforcement: There are two types of Reinforcement:

and formatted data.

Oversampling. Bias or imbalance in the dataset can be corrected by generating more.

Data integration. Combining multiple datasets to get a large corpus can

overcome incompleteness in a single dataset.

So, for this, we use data preprocessing task.

Why do we need Data Pre-processing?

● A real-world data generally contains noises, missing values, and maybe

appropriate. Originally developed by American mathematician John Tukey in the 1970s,

Exploratory data analysis tools

4. Support Vector Machine (SVM)-

It is basically a type of unsupervised learning method. An unsupervised learning method is a

It is not necessary for clusters to be a spherical. Such as:

Fig 1.9 DBSCAN

Ultimate evaluation of a machine learning model’s quality requires an

appropriate selection and interpretation of assessment criteria.

This has given rise to a plethora of applications based on Machine learning.

started, or claiming shortly after the inception of the policy.

Increases in policy limits shortly before an alleged loss occurs

Missing police reports, where appropriate

Valuable items removed from the premises shortly before a fire

loss Firefighters unable to gain access due to items blocking doors,

etc. Slip and falls with no witnesses.

Fig 1.2 Process of Fraud Detection Used by Companies

Python – The New Generation Language

A great library ecosystem -

A low entry barrier -

▪ It offers an option to choose either to use OOPs or scripting.

▪ There’s also no need to recompile the source code, developers

Good Visualization Options-

Scenario 1: The dataset has a sufficient number of fraud examples.

Project Output - After deployment

1. Easily identifies trends and patterns -

2. No human intervention needed (automation) -

4. Handling multi-dimensional and multi-variety data -

2. Time and Resources -

You might also like