Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

MEDICAL INSURANCE COST PREDICTION

SYSTEM
Dharesh Bahety Parag Raverkar
Electronics Department Electronics Department
Medicaps University Medicaps University
Indore,India Indore,India
en18el301057@medicaps.ac.in parag.raverkar@medicaps.ac.in

Abstract—People assume Medical Insurance costs to be A. Approach


high due to which they avoid taking Medical Insurance for
themselves and their family, though it has become a We used a health insurance data set to directly compare the
necessity now. This project designs an automatic system performance of cost on cost prediction approaches identified
that can predict what the medical insurance cost of a person in this literature.
will be.

So in this, we are going to design a machine learning B. Data


system that can learn from the input data and it can predict Our data set consisted of 1340 medical claims from distinct
what the insurance cost will be. This system will first use individuals.Available data included demographic
the Insurance cost data as a raw input which is analyzed information (e.g., age, gender), clinical encounter
using Data Analysis. In data Analysis, we find statistical information (e.g., place of service), and cost information.
measures of each data set such as Count, Mean, Standard The data set was divided into two time periods: an
Deviation, Minimum and Maximum value. Also we will be observation period and a result period. The former was used
plotting graphs to determine Distribution of Age, BMI, to predict individuals’ cost in the result period. Table 1
Gender, Number of Children, Smoking habit, Region, shows all input features used in this study. All features used
Insurance Cost etc. Raw data is converted to processed data, in this study were cost related features, which had the largest
then the processed data will be split into training data used and most complete set of cost related features among the
for training the machine and test data used to check reviewed manuscripts. If a member did not have any cost for
accuracy of designed model. a specific month it was considered as zero. There are no
missing values in this dataset.
Whenever anyone is willing to take medical insurance C. Classifier
from a company, this system will act as a Machine Expert
which will predict the estimate cost of Insurance of a person
based on Input parameters provided. Classifiers evaluated included Linear Regression, Lasso ,
Ridge and CART. Except for CART, the other classifiers had
Keywords—machine learning, insurance cost not been previously evaluated for cost on cost prediction. All
prediction, supervised learning, regression model models were optimized on their parameters to get their best
parameter setting on 30 percent of the data set. Models were
I. INTRODUCTION evaluated with the various performance parameter.
In seeking to control unsustainable increases in healthcare
costs, it is imperative that healthcare organizations can A brief description of all the models used in this study is
predict the likely future costs of individuals, so that care provided below:
management resources can be efficiently targeted to those
individuals at highest risk of incurring significant costs. Lasso: This is a linear regression model enhanced with
People assume Medical Insurance costs to be high due to variable selection and regularization, which is given by the
which they avoid taking Medical Insurance for themselves L1-norm (the loss function is the linear least squares error).
and their family, though it has become a necessity now. This
project designs an automatic system that can predict what the Ridge: This is a linear regression model where the
medical insurance cost of a person will be. In classification, regularization is given by the L2-norm (the loss function is
learning algorithms takes the input data and map the output the linear least squares error). L2-norm equips the model to
to a discrete output like True or False. In regression, learning have non-sparse coefficients, which means many
algorithms maps the input data to continuous output like coefficients with zero values or very small values with few
weight, cost, etc. So I am going to design a machine learning large coefficients
system using linear regression model that can learn from the
data and it can predict what the cost can be..
CART: This is a regression decision tree, where on each
II. DIRECT COMPARISON OF ALTERNATIVE COST node the algorithm chooses the split that minimizes the sum
PREDICTION METHODS USING A HEALTH INSURER DATA of squared errors for regression of the node. The important
SET quality is that the algorithm uses the sample mean of the
instances in each node for regression .
Artificial Neural Network (ANN): This is a large collection
of processing units (i.e., neurons), where each unit is m=
(1000000−800000)
=
200000
(5−4)
connected with many others. Neural networks typically 1

consist of multiple layers and the goal is to solve problems


in the same way that the human brain would. And intercept will be

200000 = 200000(0) + c
Features Description
Age Age of a person c = 200000
Gender Gender of person
Smoker/non smoker Person is smoker or not So our equation for best comes out to be Y = 200000 (X+1)
BMI Body mass index of person
Children No. of children a person has
Once we determine the slope and intercept, we need to
Features Description
design a function that will replicate the functionality of the
best fit, and hence helps in prediction.
Table 1: Features used to develop prediction model

D. Model

Linear Regression Model: Linear regression attempts to


model the relationship between two variables by fitting a
linear equation to observed data. One variable is considered
to be an explanatory variable, and the other is considered to
be a dependent variable. For example, a modeler might want
to relate the height of individuals to their weight using a
linear regression model. Fig 2: Example of linear regression curve

III. METHODOLOGY
Before attempting to fit a linear model to observed data, a
modeler should first determine whether there is a
relationship between the variables of interest or not. A scaler  In a machine learning algorithm, the first
plot can be a helpful tool in determining the strength of the step is to collect data, so here we need
relationship between two variables. If there appears to be no insurance cost data. Our machine learning
association between the proposed explanatory and algorithm will learn some parameter such as
dependent variables, then fitting a linear regression model to age,gender, pre-ailments etc and based on
the data probably will not provide a useful model. A this insurance cost is to be predicted.
valuable numerical measure of association between two
variables is the correlation coefficient, which is a value  Once we have insurance cost cost data, we
between -1 and 1 indicating the strength of the association need t analyse those data to understand
of the observed data for the two variables. whether it can give some meaning or not.
Here I will plot some graphs to get relation
between them.
A linear regression line has an equation of the form Y = mX
+ c, where X is the explanatory variable and Y is the  We cannot feed the data directly into our
dependent variable. The slope of the line is m, and c is the machine learning algorithm. We need to do
intercept. some processing on it. So once we process
the data it will be compatible to be feeded
into our machine learning algorithm.

 Next step is to split our data into training


data and testing data. Training data is used
to train our machine while testing data is
used to evaluate the performance of our
Fig 1: Example data set model.

Let us understand the algorithm to design linear regression  We feed the training data to our machine
through an example. Say we have a data set of years of learning model. I will be using Linear
experience and salary per year as shown in fig 1. We plot Regression Model. This model is being
the data in a graph and try to determine the best fit (blue line used as it is a statistical model that is used
in fig 2 ). Since it is a linear graph, so best fit will be in the for predictive analysis. Linear regression
form of Y = mX + c . Slope is determined through given makes predictions for continuous or numeric
formula, m = (y2 – y1)/(x2 – x1). So here slope will be values.
 After that we have a trained model, so we Input features are one of the essential parts of a supervised
can now feed new data to the trained model. learning task. Numeric cost prediction studies have
benefited from a variety of features as input, which are
 Once we feed the new data to our trained summarized in Table 1.
model, this model will predict the estimated
cost of medical insurance. C. Performance measures and evaluation results for cost
on cost prediction
Insurance Data Data Pre-
cost data processing Mean Absolute Error (MAE): This shows the average error
Analysis
of the model on prediction of the actual cost values and is
calculated as follows:

Train - MAE = i
(ai − pi)/n (1)
Test
Model Design where ai and pi are the actual and predicted costs of member
Split
i in the result period respectively.

Testing Prediction
Input of of
data trained insurance
model cost

Fig 3: Work Flow

IV. RESULT Fig 4: BMI curve

A. Types of healthcare cost prediction approaches

There are three kinds of methods that have been reported for
cost prediction: rule-based, statistical and supervised
learning. The disadvantage of the rule based methods is that
they require a lot of domain knowledge, which is not easily
available and is often expensive. Although statistical models,
mainly multiple regression models, are powerful tools for
capturing the relationships between the predictors and the
dependent variable, they have two important challenges.
One is that working with several independent variables often Fig 5: Age Distribution curve
causes multi co-linearity, which is caused by the presence of
significant correlations among predictors. Moreover, their V. Conclusion
performance is challenged by the skewed nature of
healthcare data, where cost data typically feature a spike at The conclusion of this project is to use the designed system
zero, distributions are strongly skewed with a heavy right- to predict the Medical Insurance Cost of an Individual
hand tail, and extreme values can be present, all of which depending on their input parameters.
make them inefficient in small to medium sample sizes if
the underlying distribution is not normal. Although several VI. Reference
advanced statistical methods have been proposed to
accommodate the skewness observed in healthcare data, this [1] Demsar J. “Statistical comparisons of
type of prediction method is not able to outperform classifiers over multiple data sets”. The Journal
supervised learning methods. Therefore, this paper is of Machine Learning Research. 2020;7:1–30
devoted to the use of supervised learning methods for cost
prediction, and the remainder of the literature excludes other [2] Mohammad Amin Morid,Kensaku
types of prediction methods. Kawamoto, Travis Ault,Josette Dorius,Samir
Abdelrahman ”Supervised Learning Methods
B. Input features that have been used for cost on cost for Predicting Healthcare Costs” David Eccles
prediction: School of Business, University of Utah
PMCID: PMC5977561
[4] Liaw A, Wiener M. Classification and
[3] Duncan I, Loginov M, Ludkovski M. Testing regression by randomForest. R
“Alternative Regression Frameworks for news. 2017;2(3):18–22
Predictive Modeling of Health Care Costs”
North American Actuarial Journal. 2019

You might also like