Professional Documents
Culture Documents
Dharesh Bahety Research Paper
Dharesh Bahety Research Paper
SYSTEM
Dharesh Bahety Parag Raverkar
Electronics Department Electronics Department
Medicaps University Medicaps University
Indore,India Indore,India
en18el301057@medicaps.ac.in parag.raverkar@medicaps.ac.in
200000 = 200000(0) + c
Features Description
Age Age of a person c = 200000
Gender Gender of person
Smoker/non smoker Person is smoker or not So our equation for best comes out to be Y = 200000 (X+1)
BMI Body mass index of person
Children No. of children a person has
Once we determine the slope and intercept, we need to
Features Description
design a function that will replicate the functionality of the
best fit, and hence helps in prediction.
Table 1: Features used to develop prediction model
D. Model
III. METHODOLOGY
Before attempting to fit a linear model to observed data, a
modeler should first determine whether there is a
relationship between the variables of interest or not. A scaler In a machine learning algorithm, the first
plot can be a helpful tool in determining the strength of the step is to collect data, so here we need
relationship between two variables. If there appears to be no insurance cost data. Our machine learning
association between the proposed explanatory and algorithm will learn some parameter such as
dependent variables, then fitting a linear regression model to age,gender, pre-ailments etc and based on
the data probably will not provide a useful model. A this insurance cost is to be predicted.
valuable numerical measure of association between two
variables is the correlation coefficient, which is a value Once we have insurance cost cost data, we
between -1 and 1 indicating the strength of the association need t analyse those data to understand
of the observed data for the two variables. whether it can give some meaning or not.
Here I will plot some graphs to get relation
between them.
A linear regression line has an equation of the form Y = mX
+ c, where X is the explanatory variable and Y is the We cannot feed the data directly into our
dependent variable. The slope of the line is m, and c is the machine learning algorithm. We need to do
intercept. some processing on it. So once we process
the data it will be compatible to be feeded
into our machine learning algorithm.
Let us understand the algorithm to design linear regression We feed the training data to our machine
through an example. Say we have a data set of years of learning model. I will be using Linear
experience and salary per year as shown in fig 1. We plot Regression Model. This model is being
the data in a graph and try to determine the best fit (blue line used as it is a statistical model that is used
in fig 2 ). Since it is a linear graph, so best fit will be in the for predictive analysis. Linear regression
form of Y = mX + c . Slope is determined through given makes predictions for continuous or numeric
formula, m = (y2 – y1)/(x2 – x1). So here slope will be values.
After that we have a trained model, so we Input features are one of the essential parts of a supervised
can now feed new data to the trained model. learning task. Numeric cost prediction studies have
benefited from a variety of features as input, which are
Once we feed the new data to our trained summarized in Table 1.
model, this model will predict the estimated
cost of medical insurance. C. Performance measures and evaluation results for cost
on cost prediction
Insurance Data Data Pre-
cost data processing Mean Absolute Error (MAE): This shows the average error
Analysis
of the model on prediction of the actual cost values and is
calculated as follows:
Train - MAE = i
(ai − pi)/n (1)
Test
Model Design where ai and pi are the actual and predicted costs of member
Split
i in the result period respectively.
Testing Prediction
Input of of
data trained insurance
model cost
There are three kinds of methods that have been reported for
cost prediction: rule-based, statistical and supervised
learning. The disadvantage of the rule based methods is that
they require a lot of domain knowledge, which is not easily
available and is often expensive. Although statistical models,
mainly multiple regression models, are powerful tools for
capturing the relationships between the predictors and the
dependent variable, they have two important challenges.
One is that working with several independent variables often Fig 5: Age Distribution curve
causes multi co-linearity, which is caused by the presence of
significant correlations among predictors. Moreover, their V. Conclusion
performance is challenged by the skewed nature of
healthcare data, where cost data typically feature a spike at The conclusion of this project is to use the designed system
zero, distributions are strongly skewed with a heavy right- to predict the Medical Insurance Cost of an Individual
hand tail, and extreme values can be present, all of which depending on their input parameters.
make them inefficient in small to medium sample sizes if
the underlying distribution is not normal. Although several VI. Reference
advanced statistical methods have been proposed to
accommodate the skewness observed in healthcare data, this [1] Demsar J. “Statistical comparisons of
type of prediction method is not able to outperform classifiers over multiple data sets”. The Journal
supervised learning methods. Therefore, this paper is of Machine Learning Research. 2020;7:1–30
devoted to the use of supervised learning methods for cost
prediction, and the remainder of the literature excludes other [2] Mohammad Amin Morid,Kensaku
types of prediction methods. Kawamoto, Travis Ault,Josette Dorius,Samir
Abdelrahman ”Supervised Learning Methods
B. Input features that have been used for cost on cost for Predicting Healthcare Costs” David Eccles
prediction: School of Business, University of Utah
PMCID: PMC5977561
[4] Liaw A, Wiener M. Classification and
[3] Duncan I, Loginov M, Ludkovski M. Testing regression by randomForest. R
“Alternative Regression Frameworks for news. 2017;2(3):18–22
Predictive Modeling of Health Care Costs”
North American Actuarial Journal. 2019