Professional Documents
Culture Documents
IT Assignment Final
IT Assignment Final
IT Assignment Final
Contents
Medical Insurance Premium Prediction: -............................................................................................................2
Glossary:.............................................................................................................................................................17
References:.........................................................................................................................................................18
1|Page
Medical Insurance Premium Prediction: -
A medical insurance premium is an upfront payment made on behalf of an individual or family
in order to keep their health insurance policy active.
Descriptive Analytics:
Diagnostic Analytics:
Why have there been fluctuations in premium prices over the years?
Are certain demographic groups associated with higher or lower premium prices, and if
so, why?
What factors contribute the most to changes in premium prices?
Predictive Analytics:
What is the likely trend in medical premium prices for the next year based on historical
data?
Can we predict which policyholders are likely to experience substantial premium
increases in the coming year?
What will be the impact of changes in policyholder demographics on future premium
pricing?
Are there hidden patterns or correlations in the data that can help improve premium
pricing accuracy?
Can machine learning models provide more accurate premium predictions compared to
traditional statistical methods?
What variables or features have the most significant influence on premium price
predictions?
2|Page
What should be done to mitigate significant premium price increases for specific
policyholders?
Can we recommend personalized pricing strategies for policyholders based on their risk
profiles?
How can we optimize pricing decisions to balance profitability and customer retention?
Business Problem: In the healthcare insurance industry, accurately estimating the premium
price for medical insurance policies is crucial. Accurate premium pricing ensures that insurance
companies charge policyholders an appropriate amount, maintaining profitability while offering
competitive rates.
Data Requirements: To solve this problem, you need a dataset that includes the following
variables:
Other relevant policyholder information (e.g., health status, smoking status, etc.).
Premium Price: The actual premium price for medical insurance policies.
Model Architecture: -
Packages: caTools, ggplot2
Import Data
library(readr)
medical_premium
library(caTools)
3|Page
set.seed(123)
# Modeling
summary(PremiumPrice_lm)
# Diagnostic Studies
# Residual Plot
main="Premium Price")
abline(0, 0)
library(ggplot2)
4|Page
PremiumPrice_lm_res_df <- data.frame(fitted_value=predict(PremiumPrice_lm, train),
residual=resid(PremiumPrice_lm))
geom_point() +
geom_hline(yintercept=0, color='blue')+
labs(
PremiumPrice_lm_stdres = rstandard(PremiumPrice_lm)
qqnorm(PremiumPrice_lm_stdres,
ylab="Standardized Residuals",
xlab="Normal Scores",
qqline(PremiumPrice_lm_stdres)
# Prediction
predicted
# Model Evaluation
5|Page
# MAE (Mean Absolute Error)
MAE
MAPE
Model's Summary: -
6|Page
The model with an adjusted R-square value of 0.6225. The adjusted R-square value indicates
that 62.25% total variation of Premium Price can be explained by all the features.
# Answer: Descriptive analytics helps us understand the historical trends and patterns in
medical premium pricing. From the data, we can see that the medical premium prices vary
widely, with some policies having significantly higher premiums than others. We can also
observe the distribution of premiums among different age groups, regions, or policy types.
7|Page
# Answer: Diagnostic analytics helps us understand why certain trends or patterns exist in
medical premium pricing. For example, we can investigate whether there is a correlation
between age and premium price. We can use correlation analysis to determine if older
policyholders tend to pay higher premiums due to increased health risks.
# Answer: Predictive analytics allows us to make future predictions based on historical data. In
this context, we can build predictive models to estimate future medical premium prices. For
instance, we can use regression models to predict premium prices for new policies based on
variables like age, income, and region.
# Answer: Prescriptive analytics guides us on what actions to take to optimize medical premium
pricing. It can involve optimizing pricing strategies to attract more customers or minimize risk.
For instance, if the analysis shows that premiums are too high for a certain age group, we can
adjust pricing strategies to make policies more attractive to that demographic.
# Answer: We can further enhance this project by applying machine learning techniques. For
example, we can use decision tree algorithms to identify which factors have the most significant
impact on premium pricing. Additionally, clustering algorithms can help segment policyholders
into groups with similar characteristics, allowing for more tailored pricing strategies.
Graphical Representation: -
1) Scatterplot Matrix-
This matrix of scatterplots can help visualize the relationships between variables.
8|Page
Code:
library(GGally)
Age vs. Income: There is a positive linear relationship between Age and Income. As Age
increases, Income tends to rise.
9|Page
Age vs. Premium Price: Although there is some variability, there is a subtle positive correlation
between Age and Premium Price.
Conclusion: Older individuals might have slightly higher medical insurance premiums, but the
relationship is not very strong.
Income vs. Premium Price: A clear positive relationship exists between Income and Premium
Price. As Income increases, Premium Price tends to increase.
Conclusion: Individuals with higher incomes tend to pay higher medical insurance premiums.
Code:
Diagram:
Analysis:
10 | P a g e
The narrow spread of residuals suggests that the model has relatively low variability in
its errors.
The absence of significant outliers indicates that the model's performance is generally
consistent across most data points.
3) Bar Graph:
A bar plot can be used to visualize the Mean Absolute Error (MAE) and Mean Absolute
Percentage Error (MAPE) you calculated. You can create a bar plot to compare these two
metrics.
Code:
library(ggplot2)
geom_bar(stat="identity") +
theme_minimal()
Diagram:
11 | P a g e
Analysis:
Both MAE and MAPE are important metrics for evaluating the performance of
regression models.
Looking at the graph, it appears that the MAE and MAPE values are relatively low, which
is a positive sign. This suggests that the model's predictions are reasonably accurate in
terms of both absolute error and percentage error.
The relatively low MAE indicates that, on average, the model's predictions are close to
the actual premium prices. This suggests that the model is effective in estimating
premium prices.
The MAPE being low indicates that the model's predictions are, on average, within a
small percentage of the actual premium prices. This indicates that the model's
percentage accuracy is also good.
Overall, the bar graph suggests that the linear regression model you've built for medical
premium prediction is performing well in terms of accuracy, with low MAE and MAPE
values.
4) Line Graph: -
12 | P a g e
Code:
# Assuming you have a time-based or ordered variable in your dataset, replace "TimeVariable"
with the actual variable name
geom_line() +
labs(
y="MAPE"
)+
theme_minimal()
Diagram:
Analysis:
13 | P a g e
The linear regression model can be considered reasonable for predicting medical
premium prices based on the provided dataset.
The diagnostic plots and QQ plot indicate that the model assumptions are
approximately met.
The MAE and MAPE values provide a measure of the model's prediction accuracy, which
can be further used for model improvement or comparison with other models.
The data visualization (line graph) helps to identify how the model's prediction errors
vary across different categories or time periods, providing valuable insights for decision-
making.
5) Box Plot: -
Code:
geom_boxplot() +
labs(
y = "MAPE"
)+
theme_minimal()
Diagram:
14 | P a g e
Analysis:
The example box plot above shows gradual increase in the second year in the premium in a
proportionate amount.
15 | P a g e
Model's Residual Diagnostic: -
• Residual Plot
The model diagnostic plots above show that the linear regression model fits the data well.
There is a straight-line relationship between the residual and the fitted value, residuals have a
constant variance, and residuals are normally distributed.
16 | P a g e
Glossary:
MAE (Mean Absolute Error): A metric used to measure the average absolute difference
between the predicted and actual values in a regression model. It quantifies the accuracy of the
model.
MAPE (Mean Absolute Percentage Error): A metric used to measure the average percentage
difference between the predicted and actual values in a regression model. It quantifies the
percentage accuracy of the model.
Residuals: The differences between the predicted values and the actual values in a regression
model. Residuals represent the model's errors.
Lessons Learnt:
Data Preparation is Crucial: Ensuring the data is clean, well-structured, and relevant is a
critical step. Data preprocessing, including handling missing values and outliers,
significantly impacts model performance.
Model Evaluation is Key: Evaluating the model using appropriate metrics is crucial. MAE
and MAPE were used in this project to assess prediction accuracy, but other metrics like
RMSE and R-squared should also be considered depending on the context.
17 | P a g e
Interpreting Results: Interpreting the model results is essential for drawing meaningful
conclusions. Understanding the coefficients and their significance helps explain the
relationships between variables.
Continuous Learning: The field of data science and machine learning is continually
evolving. Staying updated with the latest techniques, libraries, and best practices is
crucial for success in such projects.
References:
Dataset: Medical Premium Data (Dummy data)
R Documentation: R documentation for functions used in the analysis (e.g., lm, ggplot2)
18 | P a g e