P4 Project Report

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 28

MEDICAL INSURANCE COST PREDICTION

SYSTEM
A Project Report (Project Work I)
Submitted in partial fulfillment of requirement of the
Degree of
BACHELOR OF TECHNOLOGY in
ELECTRONICS & COMMUNICATION
BY
DHARESH BAHETY
EN18EL301057
Under the Guidance of
MR. PARAG RAVEKAR SIR
Assistant Professor

Department of Electronics Engineering


Faculty of Engineering
MEDI-CAPS UNIVERSITY, INDORE- 453331
November 2021

I
Report Approval
The project work “Medical Insurance Cost Prediction System” is
hereby approved as a creditable study of an engineering subject carried
out and presented in a manner satisfactory to warrant its acceptance as
prerequisite for the Degree for which it has been submitted.
It is to be understood that by this approval the undersigned do not
endorse or approved any statement made, opinion expressed, or
conclusion drawn there in but approve the “Project Report” only for the
purpose for which it has been submitted.

Internal Examiner

Name: Mr. Parag Ravekar Sir

Designation: Assistant Professor

Affiliation

1. ______________________

2. ______________________

3. ______________________

4. ______________________

II
Declaration

I hereby declare that the project entitled “Medical Insurance Cost


Prediction System” submitted in partial fulfillment for the award of the
degree of Bachelor of Technology in ‘Electronics Engineering’
completed under the supervision of Mr. Parag Ravekar Sir, Assistant
Professor Electronics Department, Faculty of Engineering, Medi-Caps
University Indore is an authentic work.

Further, I declare that the content of this Project work, in full or in parts,
have neither been taken from any other source nor have been submitted
to any other Institute or University for the award of any degree or
diploma.

Signature and name of the student(s) with date

III
Certificate

I, Mr. Parag Ravekar certify that the project entitled “Medical


Insurance Cost Prediction System” submitted in partial fulfillment for
the award of the degree of Bachelor of Technology by Dharesh Bahety
is the record carried out by him under my guidance and that the work has
not formed the basis of award of any other degree elsewhere.

Mr. Parag Ravekar Dr. Ajay Kulkarni

Assistant Professor Head of the Department

Electronics Department Electronics Department

Medi-Caps University, Indore Medi-Caps University, Indore

IV
V
Acknowledgement

I would like to express my deepest gratitude to Honorable Chancellor,


Shri R C Mittal, who has provided me with every facility to
successfully carry out this project, and my profound indebtedness to Dr.
Dilip K. Patnaik Vice Chancellor and Prof (Dr.) D.K. Panda Pro Vice
Chancellor , Medi-Caps University, whose unfailing support and
enthusiasm has always boosted up my morale. I also thank, Dr. Suresh
Jain Dean, Faculty of Engineering, Medi-Caps University, for giving me
a chance to work on this project. I would also like to thank my Head of
the Department Dr. Ajay Kulkarni for his continuous encouragement
for betterment of the project.

VI
It is their help and support, due to which we became able to complete the
design and technical report.Without their support this report would not
have been possible.

Dharesh Bahety
B.Tech. IV Year
Enrollment No: EN18EL301057
Department of Electronics & Communication
Faculty of Engineering
Medi-Caps University, Indore

Abstract

People assume Medical Insurance costs to be high due to which they avoid
taking Medical Insurance for themselves and their family, though it has become
a necessity now. This project designs an automatic system that can predict what
the medical insurance cost of a person will be.
So in this project, we are going to design a machine learning system that can
learn from the input data and it can predict what the insurance cost will be. This
system will first use the Insurance cost data as a raw input which is analyzed
using Data Analysis. In data Analysis, we find statistical measures of each data
set such as Count, Mean, Standard Deviation, Minimum and Maximum value.
Also we will be plotting graphs to determine Distribution of Age, BMI, Gender,
Number of Children, Smoking habit, Region, Insurance Cost etc. Raw data is

VII
converted to processed data, then the processed data will be split into training
data used for training the machine and test data used to check accuracy of
designed model.
Whenever anyone is willing to take medical insurance from a company, this
system will act as a Machine Expert which will predict the estimate cost of
Insurance of a person based on Input parameters provided.

Table of Contents
Page No
Report Approval II
Declaration III
Acknowledgement IV
Abstract V
Table of Content VI
List of Figure VII
Abbreviations VIII
Chapter 1
1.1 Introduction 1

VIII
1.2 Literature Review 1
1.3 Objective 2
1.4 Proposed Solution 2
1.5 Significance 3
1.6 Disadvantage 3
Chapter 2
Prerequisite 4
Chapter 3
3.1 Methodology 8
3.2 Tools Used 8
3.3 Procedure adopted 9
Chapter 4
4.1 Result 12
4.2 Conclusion 14
4.3 Summary 14
4.4 Reference 14

IX
List of Figures

S.NO. Name of Figure Page No


1 Sample Data set 7
2 Example linear regression curve 8
3 Work Flow 8
4 Age Distribution 10
5 Gender Distribution 10
6 BMI Distribution 11
7 Smoker Distribution 11
8 Result screenshot 13
9 R squared Value 13

X
Abbreviations

Prof Professor
ANN Artificial Neural Network

XI
XII
Chapter-1

1.1 Introduction

People assume Medical Insurance costs to be high due to which they avoid
taking Medical Insurance for themselves and their family, though it has
become a necessity now. This project designs an automatic system that can
predict what the medical insurance cost of a person will be.

In classification, learning algorithms takes the input data and map the output
to a discrete output like True or False. In regression, learning algorithms
maps the input data to continuous output like weight, cost, etc.

So in this project, I am going to design a machine learning system using


linear regression model that can learn from the data and it can predict what
the cost can be.

1.2 Literature Review

Referring to the “The Supervised Learning Methods for Predicting


Healthcare Costs” by “ Kensaku Kawamoto”, the author explains that they
have identified 5 methods of predicting the Medical Insurance costs , and
they evaluated performance of each method. The data set used by them
consisted of 90,000 individuals, 6.3 million medical claims and 1.2
million pharmacy claims approx. In this comparison, a method known as
“gradient boosting” which is suited for low to medium cost individuals
and it had the best predictive performance overall. For high cost
individuals, highest performance was reported for Artificial Neural
Network (ANN) and the Ridge regression model. The author broadly
classifies three kinds of methods that have been reported for cost
prediction: rule-based, statistical and supervised learning. The author
identifies a limitation to the study that they used one data set of a
1
particular region, and wanted to study further exploring the use of more
advanced supervised learning methods such as deep learning and structure
analysis.
The disadvantage of the rule based methods is that they require a lot of
domain knowledge, which is not easily available and is often expensive.
There are a variety of supervised learning methods that have been used in
this area. These methods include Lasso, which is a type of linear
regression, gradient boosting on regression decision trees, M5 regression
decision tree, random forest, linear regression and CART regression tree.

1.3 Objectives

 To develop a one-stop system that predicts Medical Insurance cost in no


time.
 Helps achieve transparency in cost decision with no hidden/ commission
costs.
 Easy for individual to plan their savings / investments for taking Medical
Insurance for themselves and family.

1.4 Proposed Solution

 Using the system designed, a person can predict Medical Insurance cost of
individuals of all ages.
 This will help motivate parents to insure their children from early age , as
Insurance prices are low for infants and children compared to adults.
 People are free to decide which policy they wish to buy and estimate cost of
the policy, which is important to decide the coverage they will get and wish
to have.

2
1.5 Significance

The significance of using linear regression model is

 It determines the predictor’s strength.

 Forcast an effect

 Trend Forcasting

1.6 Disadvantage

The disadvantages of Linear regression model are:

 It is limited to linear relationship

 Only mean of dependent variable is looked

 It is outlier sensitive

 Data needs to be independent

3
Chapter-2
Prerequisite
Direct Comparison of Alternative Cost Prediction Methods
using a Health Insurer Data Set
A. Approach

We used a set of health insurance data to directly compare the cost effectiveness
of the cost prediction methods outlined in these books.

B. Data
Our set of data included 1340 medical claims from different people. Available
data include demographic information (e.g., age, gender), clinical contact
information (e.g., location of service), and cost information. The data set is
divided into two periods: the viewing time and the output time. The former was
used to predict individual costs in the outcome period. Table 1 shows all the
input features used in this study. All of the features used in this study were cost-
related features, which had the largest and most complete set of cost-related
features among the revised manuscripts. If a member had no monthly expenses it
was considered zero. There are no missing values in this database.

C. Classifier
Classifiers evaluated included Linear Regression, Lasso, Ridge and CART. With
the exception of CART, some dividers have not been previously evaluated for co
st estimates. All models have been upgraded to their parameters to obtain their b

4
est parameter setting for 30 percent of the data set. Models tested with various pe
rformance parameters.

A brief description of all the models used in this study is provided below:

Lasso: developed with flexible and customizable selection, provided by the L1-
norm (function for error error of small line squares).

Ridge: This is a linear regression model which given by the L2-norm (function
of the error loss of small line squares). The L2-norm equips the model to have
sparse coefficients, meaning that most coefficient with zero values or very
small values with a few large coefficient.

CART: This is a regression decision tree, where in each node of the algorithm
chooses to split which minimizes the sum of squared errors for regression of the
nodes. The important quality is that the algorithm uses the sample of mean in the
instances of each node for regression .

Artificial Neural Network (ANN): This is a large set of processing units (i.e.,
neurons), to which each unit is connected to many others. Neural networks often
cover multiple layers and the goal is to solve problems in the same way that the
human brain can.

5
Features Description

Age Age of a person

Gender Gender of person

Smoker/non smoker Person is smoker or not

BMI Body mass index of person

Children No. of children a person has

Table 1: Features used to develop prediction model

D.Model

Linear Regression Model: Linear regression model tries to model the


relationship between two variables by fitting a linear equation to observed data.
One variable is taken into consideration as independent variable, and the other is
considered to be a dependent variable.

Before attempting to fit a linear model to observed data, a modeler should first
determine whether there is a relationship between the variables of interest or not.
A scaler plot can be a helpful tool in determining the strength of the relationship
between two variables. If there seems to be no association between the proposed
variable and dependent variables, then fitting a linear regression model to the
data probably will not provide a beneficial model. A valuable numerical measure
of association between two variables is the correlation coefficient, which is a
value between 1 and -1 indicating the strength of the association of the observed
data for the two variables.

A linear regression line has an equation of the form Y = mX + c, where X is the


explanatory variable and Y is the dependent variable. The slope of the line is m,
and c is the intercept.
6
Fig 1: Sample data set

Let us understand the algorithm to design linear regression through an example.


Say we have a data set of years of experience and salary per year as shown in fig
1. We plot the data in a graph and try to determine the best fit (blue line in fig
2 ). Since it is a linear graph, so best fit will be in the form of Y = mX + c .
Slope is determined through given formula, m = (y2 – y1)/(x2 – x1). So here
(1000000 −800000) 200000
slope will be m = (5− 4 )
= 1

And intercept will be

200000=200000(0)+c

c=200000

So our equation for best comes out to be Y = 200000 (X+1)

Once we determine the slope and intercept, we need to design a function that
will replicate the functionality of the best fit, and hence helps in prediction.

Fig 2: Example linear regression curve

7
Chapter 3

3.1 Methodology

Insurance Data Data Pre- Train - Test


cost data Analysis processing Split

Model Design

Testing of trained Prediction of insurance


Input data
model cost

Fig: 3

Work Flow

3.2 Description of tool used in the project


 Python 3.0 :Python is an interpreted high-level general-purpose
programming language. ts language constructs as well as its object-oriented
approach aim to help programmers write clear, logical code for small and
large-scale projects.

 Google Colab: Colaboratory, or “Colab”, is a product from Google Research.


Colab allows anybody to write and execute arbitrary python code through the

8
browser, and is especially well suited to machine learning, data analysis and
education.

3.3 Procedure Adopted


I had collected raw insurance cost data and performed data analysis on it.

1. Collection of insurance raw data set.

2. Uploading the csv file on IDE containing the raw data set collected.

3. Importing the dependencies:

Dependencies are the libraries and functions that we need for our program.
The included libraries are numpy,pandas,matplot,seaborn. The inbuilt
functions used are train_test_split and LinearRegression

4. Data collection and analysis:

Loading data from csv file to pandas dataframes. Using the dataframes, I
tried to get some information about the data set such as total number of
rows,column, number of null values (if any), data type of each column and
categorical features. If there is any missing values, then determining where
it is missing and processing it.

5. Data analysis:

A. In data analysis we analyse the data through plotting of graphs.

B. Firstly we need to get some statistical measures of each data column.


Statistical measure include count,mean,standard deviation, minimum value,
maximum value.

C. Next step was to start plotting the graphs of dataset.

 Normal distribution of age: To check most of our data is in which


age region.

 Distribution of gender column: Since it is a categorical data, we have


to use count plot.

9
 Normal distribution of BMI

 Distribution of Children column

 Distribution of Smoker column

 Distribution of Region column

 Normal distribution of Charge

Fig 4:Age distribution Fig 5: Gender Distribution curve

Fig 6: BMI distribution Fig 7: Smoker data distribution

6. Pre processing:

10
We cannot feed the data directly into our machine learning algorithm. We need
to do some processing on it. So once we process the data it will be compatible to
be feeded into our machine learning algorithm.

7. Train-test split:

Next step is to split our data into training data and testing data. Training data is
used to train our machine while testing data is used to evaluate the performance
of our model.

8. Model design:

We feed the training data to our machine learning model. I will be using Linear
Regression Model. This model is being used as it is a statistical model that
is used for predictive analysis. Linear regression makes predictions for
continuous or numeric values. After that we have a trained model, so we can
now feed new data to the trained model.

Once we feed the new data to our trained model, this model will predict the
estimated cost of medical insurance.

11
Chapter 4
4.1 Result
A. Types of healthcare cost prediction approaches
There are 3 kinds of methods that have been used for cost prediction: rule based,
statistical based and supervised learning method. The disadvantage of the rule
based methods is that they require a lot of domain knowledge, which is not
easily available and is often expensive. Although statistical models, mainly
multiple regression models, are powerful tools for capturing the relationships
between the predictors and the dependent variable, they have two important
challenges. One is that working with several independent variables often causes
multi co-linearity, which is caused by the presence of significant correlations
among predictors. Further, their overall performance is challenged by using the
skewed nature of healthcare, where cost data characteristic a spike at zero,
distributions are strongly skewed with a heavy right hand, and extreme values
can be present, all of which make them inefficient in small to medium sample
sizes if the underlying distribution is not normal. Despite the fact that several
advanced statistical strategies are being proposed to deal with the skewness
found in healthcare data, this form of prediction approach isn’t able to
outperform supervised learning methods. Therefore, this devotes to the use of
supervised learning methods for cost prediction, and the remainder of the
literature excludes other types of prediction methods.

B. Input features that have been used for cost on cost prediction:
Input features are one of the essential parts of a supervised learning task.
Numeric cost prediction studies have benefited from a variety of features as
input, which are summarized in Table 1.
12
Fig 8: Result screenshot

C. Performance measures and evaluation results for cost on cost


prediction
R squared value: R Squared value is the statistical measure that indicates the
closeness to the fitted regression line. It is also called as the coefficient of
determination.

R-squared = Explained variation / Total variation

Higher the R squared is, better the model fits our data. It lies between 0 to 1.

Fig 9: R squared value screenshot

13
4.2 Conclusion
The conclusion of this project is to use the designed system to predict the
Medical Insurance Cost of an Individual depending on their input parameters.
This model gives high accuracy and hence is good to be adopted in the field of
health care and insurance sector.

4.3 Summary
Collected raw data set and Uploaded .csv file on IDE. Found the dependencies to
be imported which are Libraries and functions needed where our libraries are
numpy, pandas, matplot, seaborn. The inbuilt functions used are train_test_split
and LinearRegression. The collected data was then analysed and then we plotted
graphs of the analyzed dataset, a few examples of which are Distribution of age,
gender, BMI etc as mentioned above. Then I had carried out data pre-processing
which makes raw data compatible for Machine Learning Algorithm. Then
splitting data into Training and Testing data. Then had trained the machine using
Training data and evaluate the performance using Test data. This was fed to our
Machine Learning model which makes it a Trained model. Now the trained
model will give Estimate Insurance cost as output based on input data.

4.4 Reference

[1] Demsar J. “Statistical comparisons of classifiers over multiple data sets”. The Journal of Machine
Learning Research. 2020;7:1–30

[2] Mohammad Amin Morid,Kensaku Kawamoto, Travis Ault,Josette Dorius,Samir Abdelrahman


”Supervised Learning Methods for Predicting Healthcare Costs” David Eccles School of Business,
University of Utah PMCID: PMC5977561 2020

14
[3] Duncan I, Loginov M, Ludkovski M. Testing “Alternative Regression Frameworks for Predictive
Modeling of Health Care Costs” North American Actuarial Journal. 2019

[4] Pradeep kr, Naveen Aradhya “ A Collective Study of Machine Learning (ML) Algorithms with Big
Data Analytics (BDA) for Healthcare Analytics (HcA)” International Journal of Emerging Trends 2018

[5] Michel Denuit,Donatien Hainaut,Julien Trufin “Effective Statistical Learning Methods for
Actuaries I: GLMs and Extensions” January 2019

[6] Ranjodh Singh,Meghna P Ayyar,Tata Venkata Sri Pavan,Rajiv Ratn Shah “Automating Car
Insurance Claims Using Deep Learning Techniques”  2019 IEEE Fifth International Conference on
Multimedia Big Data (BigMM) September 2019

15
16

You might also like