IT Assignment Final

ASSIGNMENT-1
“R FOR INSURANCE DATA SCIENCE”
TOPIC: INSURANCE PREMIUM PREDICTION
Contents
Medical Insurance Premium Prediction: -............................................................................................................2
Model Architecture: -...........................................................................................................................................3
Model's Summary: -..............................................................................................................................................7
Graphical Representation: -..................................................................................................................................9
Model's Residual Diagnostic: -............................................................................................................................16
Glossary:.............................................................................................................................................................17
References:.........................................................................................................................................................18
Group members Name—
Vaishnavi Nade- 2224120
Sakshi Mall- 2224133
Dipanshu Rahtore- 2224128
Stuti Singh - 2224118
1|Page
Medical Insurance Premium Prediction: -
A medical insurance premium is an upfront payment made on behalf of an individual or family
in order to keep their health insurance policy active.
Descriptive Analytics:
 What is the current distribution of medical premium prices in our dataset?

 Can we identify any trends or patterns in premium prices over the past few years?
 What are the key demographic characteristics (age, gender, etc.) of policyholders in our
dataset?
Diagnostic Analytics:
 Why have there been fluctuations in premium prices over the years?
 Are certain demographic groups associated with higher or lower premium prices, and if
so, why?
 What factors contribute the most to changes in premium prices?
Predictive Analytics:
 What is the likely trend in medical premium prices for the next year based on historical
data?
 Can we predict which policyholders are likely to experience substantial premium
increases in the coming year?
 What will be the impact of changes in policyholder demographics on future premium
pricing?
Discovery Analytics (Data Mining, ML):
 Are there hidden patterns or correlations in the data that can help improve premium
pricing accuracy?
 Can machine learning models provide more accurate premium predictions compared to
traditional statistical methods?
 What variables or features have the most significant influence on premium price
predictions?
Prescriptive Analytics (Decision Analytics):
2|Page
 What should be done to mitigate significant premium price increases for specific
policyholders?
 Can we recommend personalized pricing strategies for policyholders based on their risk
profiles?
 How can we optimize pricing decisions to balance profitability and customer retention?
Business Problem: In the healthcare insurance industry, accurately estimating the premium
price for medical insurance policies is crucial. Accurate premium pricing ensures that insurance
companies charge policyholders an appropriate amount, maintaining profitability while offering
competitive rates.
Identify Data-Variables Requirements-
Data Requirements: To solve this problem, you need a dataset that includes the following
variables:
Age: Age of the policyholder.
Gender: Gender of the policyholder.
Other relevant policyholder information (e.g., health status, smoking status, etc.).
Premium Price: The actual premium price for medical insurance policies.
Model Architecture: -
Packages: caTools, ggplot2
Import Data
library(readr)
medical_premium <- read_csv("Medicalpremium.csv")
medical_premium
# 1. Split data: train - test
library(caTools)
3|Page
set.seed(123)
sample <- sample.split(medical_premium$PremiumPrice, SplitRatio = 0.80)
train <- subset(medical_premium, sample == TRUE)
test <- subset(medical_premium, sample == FALSE)
# Modeling
PremiumPrice_lm <- lm(PremiumPrice ~ . , data=train)
# Summary of the Linear Regression Model
summary(PremiumPrice_lm)
# Diagnostic Studies
# Plotting the regression line
# Residual Plot
PremiumPrice_lm_res <- resid(PremiumPrice_lm)
# Check assumptions: linearity, constant variance, and independence
plot(predict(PremiumPrice_lm, train), PremiumPrice_lm_res,
ylab="Residuals", xlab="Fitted Values",
main="Premium Price")
abline(0, 0)
# Draw using ggplot2
library(ggplot2)
4|Page
PremiumPrice_lm_res_df <- data.frame(fitted_value=predict(PremiumPrice_lm, train),
residual=resid(PremiumPrice_lm))
ggplot(PremiumPrice_lm_res_df, aes(x=fitted_value, y=residual)) +
geom_point() +
geom_hline(yintercept=0, color='blue')+
labs(
title = 'Regressing Premium Price With ggplot2'
# QQ Plot for normality of residuals
PremiumPrice_lm_stdres = rstandard(PremiumPrice_lm)
qqnorm(PremiumPrice_lm_stdres,
ylab="Standardized Residuals",
xlab="Normal Scores",
main="Premium Price ")
qqline(PremiumPrice_lm_stdres)
# Prediction
predicted <- predict(PremiumPrice_lm, test)
actual_pred <- data.frame(cbind(actual=test$PremiumPrice, predicted=predicted))
predicted
# Model Evaluation
5|Page
# MAE (Mean Absolute Error)
MAE <- mean(abs(actual_pred$actual - actual_pred$predicted))
MAE
# MAPE (Mean Absolute Percentage Error)
MAPE <- mean(abs((actual_pred$predicted - actual_pred$actual)) / actual_pred$actual)
MAPE
Model's Summary: -
6|Page
The model with an adjusted R-square value of 0.6225. The adjusted R-square value indicates
that 62.25% total variation of Premium Price can be explained by all the features.
Additional Analysis Questions:
1. What has happened / What is happening (Descriptive Analytics)?
# Answer: Descriptive analytics helps us understand the historical trends and patterns in
medical premium pricing. From the data, we can see that the medical premium prices vary
widely, with some policies having significantly higher premiums than others. We can also
observe the distribution of premiums among different age groups, regions, or policy types.
2. Why (Diagnostic Analytics)?
7|Page
# Answer: Diagnostic analytics helps us understand why certain trends or patterns exist in
medical premium pricing. For example, we can investigate whether there is a correlation
between age and premium price. We can use correlation analysis to determine if older
policyholders tend to pay higher premiums due to increased health risks.
3. What is likely to happen (Predictive Analytics)?
# Answer: Predictive analytics allows us to make future predictions based on historical data. In
this context, we can build predictive models to estimate future medical premium prices. For
instance, we can use regression models to predict premium prices for new policies based on
variables like age, income, and region.
4. What should be done (Prescriptive Analytics)?
# Answer: Prescriptive analytics guides us on what actions to take to optimize medical premium
pricing. It can involve optimizing pricing strategies to attract more customers or minimize risk.
For instance, if the analysis shows that premiums are too high for a certain age group, we can
adjust pricing strategies to make policies more attractive to that demographic.
5. Additional Analysis (Data Mining, ML, etc.)?
# Answer: We can further enhance this project by applying machine learning techniques. For
example, we can use decision tree algorithms to identify which factors have the most significant
impact on premium pricing. Additionally, clustering algorithms can help segment policyholders
into groups with similar characteristics, allowing for more tailored pricing strategies.
Graphical Representation: -
1) Scatterplot Matrix-
This matrix of scatterplots can help visualize the relationships between variables.
8|Page
Code:
# Load necessary libraries
library(GGally)
# Create a scatterplot matrix
ggpairs(medical_premium[, c("Age", "Income", "PremiumPrice")])
Analysis of Scatterplot Matrix:
Age vs. Income: There is a positive linear relationship between Age and Income. As Age
increases, Income tends to rise.
Conclusion: Older individuals generally have higher incomes.
9|Page
Age vs. Premium Price: Although there is some variability, there is a subtle positive correlation
between Age and Premium Price.
Conclusion: Older individuals might have slightly higher medical insurance premiums, but the
relationship is not very strong.
Income vs. Premium Price: A clear positive relationship exists between Income and Premium
Price. As Income increases, Premium Price tends to increase.
Conclusion: Individuals with higher incomes tend to pay higher medical insurance premiums.
2) Histogram of Premium Prices (Descriptive Analytics):
A histogram helps you understand the distribution of premium prices.
Code:
# Create a histogram of Premium Prices
hist(medical_premium$PremiumPrice, main = "Premium Price Distribution", xlab = "Premium

Price")
Diagram:
Analysis:
 The distribution of residuals is approximately normal, which is a positive indicator for

the validity of linear regression assumptions.
 The residuals are centered around zero, indicating that, on average, the model's
predictions are accurate.
10 | P a g e
 The narrow spread of residuals suggests that the model has relatively low variability in
its errors.
 The absence of significant outliers indicates that the model's performance is generally
consistent across most data points.
3) Bar Graph:
A bar plot can be used to visualize the Mean Absolute Error (MAE) and Mean Absolute
Percentage Error (MAPE) you calculated. You can create a bar plot to compare these two
metrics.
Code:
# Create a bar plot for MAE and MAPE
library(ggplot2)
metrics <- c("MAE", "MAPE")
values <- c(MAE, MAPE)
data <- data.frame(metrics, values)
ggplot(data, aes(x=metrics, y=values, fill=metrics)) +
geom_bar(stat="identity") +
labs(title="Model Evaluation Metrics", y="Value") +
theme_minimal()
Diagram:
11 | P a g e
Analysis:
 Both MAE and MAPE are important metrics for evaluating the performance of
regression models.
 Looking at the graph, it appears that the MAE and MAPE values are relatively low, which
is a positive sign. This suggests that the model's predictions are reasonably accurate in
terms of both absolute error and percentage error.
 The relatively low MAE indicates that, on average, the model's predictions are close to
the actual premium prices. This suggests that the model is effective in estimating
premium prices.
 The MAPE being low indicates that the model's predictions are, on average, within a
small percentage of the actual premium prices. This indicates that the model's
percentage accuracy is also good.
 Overall, the bar graph suggests that the linear regression model you've built for medical
premium prediction is performing well in terms of accuracy, with low MAE and MAPE
values.
4) Line Graph: -
12 | P a g e
Code:
# Assuming you have a time-based or ordered variable in your dataset, replace "TimeVariable"
with the actual variable name
line_data <- aggregate(MAPE ~ TimeVariable, data=actual_pred, FUN=mean)
# Create a line graph
ggplot(line_data, aes(x=TimeVariable, y=MAPE)) +
geom_line() +
labs(
title="Mean Absolute Percentage Error (MAPE) Over Time",
x="Time or Ordered Categories",
y="MAPE"
)+
theme_minimal()
Diagram:
Analysis:
13 | P a g e
 The linear regression model can be considered reasonable for predicting medical
premium prices based on the provided dataset.
 The diagnostic plots and QQ plot indicate that the model assumptions are
approximately met.
 The MAE and MAPE values provide a measure of the model's prediction accuracy, which
can be further used for model improvement or comparison with other models.
 The data visualization (line graph) helps to identify how the model's prediction errors
vary across different categories or time periods, providing valuable insights for decision-
making.
5) Box Plot: -
Code:
# Create a box plot of MAPE values
ggplot(actual_pred, aes(y = MAPE)) +
geom_boxplot() +
labs(
title = "Box Plot of Mean Absolute Percentage Error (MAPE)",
y = "MAPE"
)+
theme_minimal()
Diagram:
14 | P a g e
Analysis:
The example box plot above shows gradual increase in the second year in the premium in a
proportionate amount.
15 | P a g e
Model's Residual Diagnostic: -
• Residual Plot
The model diagnostic plots above show that the linear regression model fits the data well.
There is a straight-line relationship between the residual and the fitted value, residuals have a
constant variance, and residuals are normally distributed.
16 | P a g e
Glossary:
MAE (Mean Absolute Error): A metric used to measure the average absolute difference
between the predicted and actual values in a regression model. It quantifies the accuracy of the
model.
MAPE (Mean Absolute Percentage Error): A metric used to measure the average percentage
difference between the predicted and actual values in a regression model. It quantifies the
percentage accuracy of the model.
Residuals: The differences between the predicted values and the actual values in a regression
model. Residuals represent the model's errors.
Lessons Learnt:
 Data Preparation is Crucial: Ensuring the data is clean, well-structured, and relevant is a
critical step. Data preprocessing, including handling missing values and outliers,
significantly impacts model performance.
 Model Selection Matters: Choosing the appropriate machine learning algorithm or

model is essential. In this project, linear regression was used, but it's important to
consider alternative algorithms when dealing with complex or non-linear relationships.
 Model Evaluation is Key: Evaluating the model using appropriate metrics is crucial. MAE
and MAPE were used in this project to assess prediction accuracy, but other metrics like
RMSE and R-squared should also be considered depending on the context.
 Assumptions and Diagnostics: Checking and validating the assumptions of linear

regression, such as normality of residuals and linearity, is important. Diagnostic plots
help identify issues in the model.
17 | P a g e
 Interpreting Results: Interpreting the model results is essential for drawing meaningful
conclusions. Understanding the coefficients and their significance helps explain the
relationships between variables.
 Continuous Learning: The field of data science and machine learning is continually
evolving. Staying updated with the latest techniques, libraries, and best practices is
crucial for success in such projects.
References:
Dataset: Medical Premium Data (Dummy data)
R Documentation: R documentation for functions used in the analysis (e.g., lm, ggplot2)
Online Articles and Tutorials for e.g., Github, youtube, datacamp,etc.
18 | P a g e

IT Assignment Final

Uploaded by

Copyright:

Available Formats

You might also like

IT Assignment Final

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IT Assignment Final

Uploaded by

Copyright:

Available Formats

ASSIGNMENT-1

“R FOR INSURANCE DATA SCIENCE”

TOPIC: INSURANCE PREMIUM PREDICTION

Model Architecture: -...........................................................................................................................................3

Model's Summary: -..............................................................................................................................................7

Graphical Representation: -..................................................................................................................................9

Model's Residual Diagnostic: -............................................................................................................................16

Group members Name—

Vaishnavi Nade- 2224120

Sakshi Mall- 2224133

Dipanshu Rahtore- 2224128

Stuti Singh - 2224118

 What is the current distribution of medical premium prices in our dataset?

Discovery Analytics (Data Mining, ML):

Prescriptive Analytics (Decision Analytics):

Identify Data-Variables Requirements-

Age: Age of the policyholder.

Gender: Gender of the policyholder.

medical_premium <- read_csv("Medicalpremium.csv")

# 1. Split data: train - test

sample <- sample.split(medical_premium$PremiumPrice, SplitRatio = 0.80)

train <- subset(medical_premium, sample == TRUE)

test <- subset(medical_premium, sample == FALSE)

PremiumPrice_lm <- lm(PremiumPrice ~ . , data=train)

# Summary of the Linear Regression Model

# Plotting the regression line

PremiumPrice_lm_res <- resid(PremiumPrice_lm)

# Check assumptions: linearity, constant variance, and independence

plot(predict(PremiumPrice_lm, train), PremiumPrice_lm_res,

ylab="Residuals", xlab="Fitted Values",

# Draw using ggplot2

ggplot(PremiumPrice_lm_res_df, aes(x=fitted_value, y=residual)) +

title = 'Regressing Premium Price With ggplot2'

# QQ Plot for normality of residuals

main="Premium Price ")

predicted <- predict(PremiumPrice_lm, test)

actual_pred <- data.frame(cbind(actual=test$PremiumPrice, predicted=predicted))

MAE <- mean(abs(actual_pred$actual - actual_pred$predicted))

# MAPE (Mean Absolute Percentage Error)

MAPE <- mean(abs((actual_pred$predicted - actual_pred$actual)) / actual_pred$actual)

Additional Analysis Questions:

1. What has happened / What is happening (Descriptive Analytics)?

2. Why (Diagnostic Analytics)?

3. What is likely to happen (Predictive Analytics)?

4. What should be done (Prescriptive Analytics)?

5. Additional Analysis (Data Mining, ML, etc.)?

# Load necessary libraries

# Create a scatterplot matrix

ggpairs(medical_premium[, c("Age", "Income", "PremiumPrice")])

Analysis of Scatterplot Matrix:

Conclusion: Older individuals generally have higher incomes.

2) Histogram of Premium Prices (Descriptive Analytics):

A histogram helps you understand the distribution of premium prices.

# Create a histogram of Premium Prices

hist(medical_premium$PremiumPrice, main = "Premium Price Distribution", xlab = "Premium

 The distribution of residuals is approximately normal, which is a positive indicator for

# Create a bar plot for MAE and MAPE