Professional Documents
Culture Documents
P4 Project Report
P4 Project Report
P4 Project Report
SYSTEM
A Project Report (Project Work I)
Submitted in partial fulfillment of requirement of the
Degree of
BACHELOR OF TECHNOLOGY in
ELECTRONICS & COMMUNICATION
BY
DHARESH BAHETY
EN18EL301057
Under the Guidance of
MR. PARAG RAVEKAR SIR
Assistant Professor
I
Report Approval
The project work “Medical Insurance Cost Prediction System” is
hereby approved as a creditable study of an engineering subject carried
out and presented in a manner satisfactory to warrant its acceptance as
prerequisite for the Degree for which it has been submitted.
It is to be understood that by this approval the undersigned do not
endorse or approved any statement made, opinion expressed, or
conclusion drawn there in but approve the “Project Report” only for the
purpose for which it has been submitted.
Internal Examiner
Affiliation
1. ______________________
2. ______________________
3. ______________________
4. ______________________
II
Declaration
Further, I declare that the content of this Project work, in full or in parts,
have neither been taken from any other source nor have been submitted
to any other Institute or University for the award of any degree or
diploma.
III
Certificate
IV
V
Acknowledgement
VI
It is their help and support, due to which we became able to complete the
design and technical report.Without their support this report would not
have been possible.
Dharesh Bahety
B.Tech. IV Year
Enrollment No: EN18EL301057
Department of Electronics & Communication
Faculty of Engineering
Medi-Caps University, Indore
Abstract
People assume Medical Insurance costs to be high due to which they avoid
taking Medical Insurance for themselves and their family, though it has become
a necessity now. This project designs an automatic system that can predict what
the medical insurance cost of a person will be.
So in this project, we are going to design a machine learning system that can
learn from the input data and it can predict what the insurance cost will be. This
system will first use the Insurance cost data as a raw input which is analyzed
using Data Analysis. In data Analysis, we find statistical measures of each data
set such as Count, Mean, Standard Deviation, Minimum and Maximum value.
Also we will be plotting graphs to determine Distribution of Age, BMI, Gender,
Number of Children, Smoking habit, Region, Insurance Cost etc. Raw data is
VII
converted to processed data, then the processed data will be split into training
data used for training the machine and test data used to check accuracy of
designed model.
Whenever anyone is willing to take medical insurance from a company, this
system will act as a Machine Expert which will predict the estimate cost of
Insurance of a person based on Input parameters provided.
Table of Contents
Page No
Report Approval II
Declaration III
Acknowledgement IV
Abstract V
Table of Content VI
List of Figure VII
Abbreviations VIII
Chapter 1
1.1 Introduction 1
VIII
1.2 Literature Review 1
1.3 Objective 2
1.4 Proposed Solution 2
1.5 Significance 3
1.6 Disadvantage 3
Chapter 2
Prerequisite 4
Chapter 3
3.1 Methodology 8
3.2 Tools Used 8
3.3 Procedure adopted 9
Chapter 4
4.1 Result 12
4.2 Conclusion 14
4.3 Summary 14
4.4 Reference 14
IX
List of Figures
X
Abbreviations
Prof Professor
ANN Artificial Neural Network
XI
XII
Chapter-1
1.1 Introduction
People assume Medical Insurance costs to be high due to which they avoid
taking Medical Insurance for themselves and their family, though it has
become a necessity now. This project designs an automatic system that can
predict what the medical insurance cost of a person will be.
In classification, learning algorithms takes the input data and map the output
to a discrete output like True or False. In regression, learning algorithms
maps the input data to continuous output like weight, cost, etc.
1.3 Objectives
Using the system designed, a person can predict Medical Insurance cost of
individuals of all ages.
This will help motivate parents to insure their children from early age , as
Insurance prices are low for infants and children compared to adults.
People are free to decide which policy they wish to buy and estimate cost of
the policy, which is important to decide the coverage they will get and wish
to have.
2
1.5 Significance
Forcast an effect
Trend Forcasting
1.6 Disadvantage
It is outlier sensitive
3
Chapter-2
Prerequisite
Direct Comparison of Alternative Cost Prediction Methods
using a Health Insurer Data Set
A. Approach
We used a set of health insurance data to directly compare the cost effectiveness
of the cost prediction methods outlined in these books.
B. Data
Our set of data included 1340 medical claims from different people. Available
data include demographic information (e.g., age, gender), clinical contact
information (e.g., location of service), and cost information. The data set is
divided into two periods: the viewing time and the output time. The former was
used to predict individual costs in the outcome period. Table 1 shows all the
input features used in this study. All of the features used in this study were cost-
related features, which had the largest and most complete set of cost-related
features among the revised manuscripts. If a member had no monthly expenses it
was considered zero. There are no missing values in this database.
C. Classifier
Classifiers evaluated included Linear Regression, Lasso, Ridge and CART. With
the exception of CART, some dividers have not been previously evaluated for co
st estimates. All models have been upgraded to their parameters to obtain their b
4
est parameter setting for 30 percent of the data set. Models tested with various pe
rformance parameters.
A brief description of all the models used in this study is provided below:
Lasso: developed with flexible and customizable selection, provided by the L1-
norm (function for error error of small line squares).
Ridge: This is a linear regression model which given by the L2-norm (function
of the error loss of small line squares). The L2-norm equips the model to have
sparse coefficients, meaning that most coefficient with zero values or very
small values with a few large coefficient.
CART: This is a regression decision tree, where in each node of the algorithm
chooses to split which minimizes the sum of squared errors for regression of the
nodes. The important quality is that the algorithm uses the sample of mean in the
instances of each node for regression .
Artificial Neural Network (ANN): This is a large set of processing units (i.e.,
neurons), to which each unit is connected to many others. Neural networks often
cover multiple layers and the goal is to solve problems in the same way that the
human brain can.
5
Features Description
D.Model
Before attempting to fit a linear model to observed data, a modeler should first
determine whether there is a relationship between the variables of interest or not.
A scaler plot can be a helpful tool in determining the strength of the relationship
between two variables. If there seems to be no association between the proposed
variable and dependent variables, then fitting a linear regression model to the
data probably will not provide a beneficial model. A valuable numerical measure
of association between two variables is the correlation coefficient, which is a
value between 1 and -1 indicating the strength of the association of the observed
data for the two variables.
200000=200000(0)+c
c=200000
Once we determine the slope and intercept, we need to design a function that
will replicate the functionality of the best fit, and hence helps in prediction.
7
Chapter 3
3.1 Methodology
Model Design
Fig: 3
Work Flow
8
browser, and is especially well suited to machine learning, data analysis and
education.
2. Uploading the csv file on IDE containing the raw data set collected.
Dependencies are the libraries and functions that we need for our program.
The included libraries are numpy,pandas,matplot,seaborn. The inbuilt
functions used are train_test_split and LinearRegression
Loading data from csv file to pandas dataframes. Using the dataframes, I
tried to get some information about the data set such as total number of
rows,column, number of null values (if any), data type of each column and
categorical features. If there is any missing values, then determining where
it is missing and processing it.
5. Data analysis:
9
Normal distribution of BMI
6. Pre processing:
10
We cannot feed the data directly into our machine learning algorithm. We need
to do some processing on it. So once we process the data it will be compatible to
be feeded into our machine learning algorithm.
7. Train-test split:
Next step is to split our data into training data and testing data. Training data is
used to train our machine while testing data is used to evaluate the performance
of our model.
8. Model design:
We feed the training data to our machine learning model. I will be using Linear
Regression Model. This model is being used as it is a statistical model that
is used for predictive analysis. Linear regression makes predictions for
continuous or numeric values. After that we have a trained model, so we can
now feed new data to the trained model.
Once we feed the new data to our trained model, this model will predict the
estimated cost of medical insurance.
11
Chapter 4
4.1 Result
A. Types of healthcare cost prediction approaches
There are 3 kinds of methods that have been used for cost prediction: rule based,
statistical based and supervised learning method. The disadvantage of the rule
based methods is that they require a lot of domain knowledge, which is not
easily available and is often expensive. Although statistical models, mainly
multiple regression models, are powerful tools for capturing the relationships
between the predictors and the dependent variable, they have two important
challenges. One is that working with several independent variables often causes
multi co-linearity, which is caused by the presence of significant correlations
among predictors. Further, their overall performance is challenged by using the
skewed nature of healthcare, where cost data characteristic a spike at zero,
distributions are strongly skewed with a heavy right hand, and extreme values
can be present, all of which make them inefficient in small to medium sample
sizes if the underlying distribution is not normal. Despite the fact that several
advanced statistical strategies are being proposed to deal with the skewness
found in healthcare data, this form of prediction approach isn’t able to
outperform supervised learning methods. Therefore, this devotes to the use of
supervised learning methods for cost prediction, and the remainder of the
literature excludes other types of prediction methods.
B. Input features that have been used for cost on cost prediction:
Input features are one of the essential parts of a supervised learning task.
Numeric cost prediction studies have benefited from a variety of features as
input, which are summarized in Table 1.
12
Fig 8: Result screenshot
Higher the R squared is, better the model fits our data. It lies between 0 to 1.
13
4.2 Conclusion
The conclusion of this project is to use the designed system to predict the
Medical Insurance Cost of an Individual depending on their input parameters.
This model gives high accuracy and hence is good to be adopted in the field of
health care and insurance sector.
4.3 Summary
Collected raw data set and Uploaded .csv file on IDE. Found the dependencies to
be imported which are Libraries and functions needed where our libraries are
numpy, pandas, matplot, seaborn. The inbuilt functions used are train_test_split
and LinearRegression. The collected data was then analysed and then we plotted
graphs of the analyzed dataset, a few examples of which are Distribution of age,
gender, BMI etc as mentioned above. Then I had carried out data pre-processing
which makes raw data compatible for Machine Learning Algorithm. Then
splitting data into Training and Testing data. Then had trained the machine using
Training data and evaluate the performance using Test data. This was fed to our
Machine Learning model which makes it a Trained model. Now the trained
model will give Estimate Insurance cost as output based on input data.
4.4 Reference
[1] Demsar J. “Statistical comparisons of classifiers over multiple data sets”. The Journal of Machine
Learning Research. 2020;7:1–30
14
[3] Duncan I, Loginov M, Ludkovski M. Testing “Alternative Regression Frameworks for Predictive
Modeling of Health Care Costs” North American Actuarial Journal. 2019
[4] Pradeep kr, Naveen Aradhya “ A Collective Study of Machine Learning (ML) Algorithms with Big
Data Analytics (BDA) for Healthcare Analytics (HcA)” International Journal of Emerging Trends 2018
[5] Michel Denuit,Donatien Hainaut,Julien Trufin “Effective Statistical Learning Methods for
Actuaries I: GLMs and Extensions” January 2019
[6] Ranjodh Singh,Meghna P Ayyar,Tata Venkata Sri Pavan,Rajiv Ratn Shah “Automating Car
Insurance Claims Using Deep Learning Techniques” 2019 IEEE Fifth International Conference on
Multimedia Big Data (BigMM) September 2019
15
16