4-R Code and PPT - Predicting Medical Expenses Using Linear Regression - New Without Prerequsit

Dr.
Vivek Tiwari, ABV-IIITM, Gwalior

Focus of the Today’s
Discussion
Predictio
n
Line Medi
ar cal
Reg. Exp.
Points for today’s Discussion
 Understanding Dataset (medical

Expenses), Code: R
 Fundamental of Linear Regression,
Code: R
 Training
Dr. Vivek Tiwari, ABV-IIITM, Gwalior

Step 1 – collecting data
For this analysis, we will use a simulated dataset
(insurance.csv) containing hypothetical medical
expenses for patients in the United States.
https://drive.google.com/file/d/
1QN5TSG3pJHo8v05gVsPKbvLwOyyaDy8t/view?usp=sharing

The insurance.csv file includes 1,338 examples of
beneficiaries currently enrolled in the insurance plan,
with features indicating characteristics of the patient
as:
• age: An integer indicating the age of the

primary beneficiary (excluding those above 64
years, since they are generally covered by the
government).
• Sex/Gender: The policy holder's gender, either

male or female.
• bmi: The body mass index (BMI), which provides a

sense of how over- or under-weight a person is
relative to their height. BMI is equal to weight
(in kilograms) divided by height (in meters)
squared. An ideal BMI is within the range of 18.5
to 24.9.
• children: An integer indicating the number of

children/dependents covered by the insurance
plan.
• smoker: A yes or no categorical variable that

indicates whether the insured regularly smokes
tobacco.
• region: The beneficiary's place of residence in

the US, divided into four geographic regions:
northeast, southeast, southwest, or northwest.
• Expenses: Yearly medical expenditure
Step 2 – Exploring and Preparing

the Data [using R]
Rstudio download
https://www.rstudio.com/products/rstudio/download/#download
getwd ()
(see the current working directory)
setwd("E:/Dell laptop back up/ laptop /Invited Talk ")
(set your own working directory where your all stuff is saved)
insurance <- read.csv("insurance.csv")

str(insurance)
'data.frame': 1338 obs. of 7 variables:

$ age : int 19 18 28 33 32 31 46 37 37 60 ...
$ sex : chr "female" "male" "male" "male" ...
$ bmi : num 27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8
$ children: int 0 1 3 0 0 0 1 3 2 0 ...
$ smoker : chr "yes" "no" "no" "no" ...
$ region : chr "southwest" "southeast" "southeast" "northwest"
$ expenses: num 16885 1726 4449 21984 3867 ...
insurance <- read.csv("insurance.csv",

stringsAsFactors = TRUE)
str(insurance)
'data.frame': 1338 obs. of 7 variables:

$ age : int 19 18 28 33 32 31 46 37 37 60 ...
$ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1
$ bmi : num 27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8
$ children: int 0 1 3 0 0 0 1 3 2 0 ...
$ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
$ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2
$ expenses: num 16885 1726 4449 21984 3867 ...
table(insurance$region)
northeast northwest southeast southwest

324 325 364 325
summary(insurance)
age sex bmi children smoker region
Min. :18.00 female:662 Min. :16.00 Min. :0.000 no :1064 northeast:324
1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274 northwest:325
Median :39.00 Median :30.40 Median :1.000 southeast:364
Mean :39.21 Mean :30.67 Mean :1.095 southwest:325
3rd Qu.:51.00 3rd Qu.:34.70 3rd Qu.:2.000
Max. :64.00 Max. :53.10 Max. :5.000
expenses
Min. : 1122
1st Qu.: 4740
Median : 9382
Mean :13270
3rd Qu.:16640
Max. :63770
summary(insurance$expenses)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1122 4740 9382 13270 16640 63770

Model Training
install.packages("carnet", dependencies = c("Depends", "Suggests"))
library(caret)
lr_model <- lm(expenses ~ age + children + bmi +

sex +smoker + region, data = insurance)
Following command is equivalent to the preceding command:
model <- lm(expenses ~ ., data = insurance)

lr_model
Call:
lm(formula = expenses ~ age + children + bmi + sex + smoker +
region, data = insurance)
Coefficients:
(Intercept) age children bmi sexmale
-11941.6 256.8 475.7 339.3 -131.4
smokeryes regionnorthwest regionsoutheast regionsouthwest
23847.5 -352.8 -1035.6 -959.3
 The coefficient of age term in our model is saying

that for every 1 year increase in the age of a
person, the required expense goes up by 256.8$
 It is saying that for every 1 children (dependent)

increase in family of a person, the required
expense goes up by 475.7$
 It is saying that for every 1 unit increase in the

bmi of a person, the required expense goes up
by 339.9$
summary (lr_model)
Call:
lm(formula = expenses ~ age + children + bmi + sex + smoker +
Residuals:
Min 1Q Median 3Q Max
-11302.7 -2850.9 -979.6 1383.9 29981.7

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11941.6 987.8 -12.089 < 2e-16
***
age 256.8 11.9 21.586 < 2e-16 ***
children 475.7 137.8 3.452 0.000574 ***
bmi 339.3 28.6 11.864 < 2e-16 ***
sexmale -131.3 332.9 -0.395 0.693255
smokeryes 23847.5 413.1 57.723 < 2e-16 ***
regionnorthwest -352.8 476.3 -0.741 0.458976
regionsoutheast -1035.6 478.7 -2.163 0.030685 *
regionsouthwest -959.3 477.9 -2.007 0.044921 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
Residual standard error: 6062 on 1329 degrees of

freedom, Multiple R-squared: 0.7509, Adjusted
R-squared: 0.7494, F-statistic: 500.9 on 8 and
1329 DF, p-value: < 2.2e-16
err = residuals(lr_model)
> hist(err) // should be Gaussian curve

1. adding non-linear relationships
insurance$age2 <- insurance$age^2
lr_model1 <- lm(expenses ~ age2 + children + bmi + sex

+ smoker + region,data = insurance)
summary (lr_model1)
Call:
lm(formula = expenses ~ age2 + children + bmi + sex + +smoker +
Residuals:
-11426.9 -2857.9 -935.7 1307.2 30662.4
Coefficients:
(Intercept) -7551.8397 925.4338 -8.160 7.67e-16 ***
age2 3.2535 0.1476 22.039 < 2e-16 ***
children 613.0908 136.9329 4.477 8.21e-06 ***
bmi 335.6486 28.4557 11.795 < 2e-16 ***
sexmale -136.4240 331.1067 -0.412 0.6804

smokeryes 23857.6299 410.8883 58.064 < 2e-16 ***
regionsouthwest -956.8930 475.3003 -2.013 0.0443 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
’ 1
Residual standard error: 6029 on 1329 degrees of freedom

Multiple R-squared: 0.7536, Adjusted R-squared:
0.7522 F-statistic: 508.2 on 8 and 1329 DF, p-value: < 2.2e-16
lr_model2 <- lm(expenses ~ age+age2 + children + bmi +

sex + smoker + region, data = insurance)
adding interaction effects

What if certain features have a combined impact on the dependent
variable? For instance, smoking and obesity may have harmful effects
separately, but it is reasonable to assume that their combined effect may
be worse than the sum of each one alone. When two features have a
combined effect, this is known as an interaction (bmi*smoker).
lr_model5 <- lm(expenses ~ age+age2 + children +bmi +

sex + bmi*smoker + region, data = insurance)
summary (lr_model5)
Call:
lm(formula = expenses ~ age + age2 + children + bmi + sex + +bmi *
smoker + region, data = insurance)
Residuals:
-17297.1 -1656.0 -1262.7 -727.8 24161.6
Coefficients:
(Intercept) 139.0053 1363.1359 0.102 0.918792
age -32.6181 59.8250 -0.545 0.585690
age2 3.7307 0.7463 4.999 6.54e-07 ***
children 678.6017 105.8855 6.409 2.03e-10 ***

bmi 119.7715 34.2796 3.494 0.000492 ***
sexmale -496.7690 244.3713 -2.033 0.042267 *
bmi30 -997.9355 422.9607 -2.359 0.018449 *
smokeryes 13404.5952 439.9591 30.468 < 2e-16 ***
regionsouthwest -1222.1619 350.5314 -3.487 0.000505 ***
bmi30:smokeryes 19810.1534 604.6769 32.762 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4445 on 1326 degrees of

freedom, Multiple R-squared: 0.8664, Adjusted R-
squared: 0.8653, F-statistic: 781.7 on 11 and 1326
DF, p-value: < 2.2e-16


4-R Code and PPT - Predicting Medical Expenses Using Linear Regression - New Without Prerequsit

Uploaded by

Copyright:

Available Formats

You might also like

4-R Code and PPT - Predicting Medical Expenses Using Linear Regression - New Without Prerequsit

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4-R Code and PPT - Predicting Medical Expenses Using Linear Regression - New Without Prerequsit

Uploaded by

Copyright:

Available Formats

Dr.

Vivek Tiwari, ABV-IIITM, Gwalior

Points for today’s Discussion

 Understanding Dataset (medical

Dr. Vivek Tiwari, ABV-IIITM, Gwalior

Dr. Vivek Tiwari, ABV-IIITM, Gwalior

• age: An integer indicating the age of the

• Sex/Gender: The policy holder's gender, either

• bmi: The body mass index (BMI), which provides a

• children: An integer indicating the number of

• smoker: A yes or no categorical variable that

• region: The beneficiary's place of residence in

• Expenses: Yearly medical expenditure

Step 2 – Exploring and Preparing

insurance <- read.csv("insurance.csv")

'data.frame': 1338 obs. of 7 variables:

insurance <- read.csv("insurance.csv",

'data.frame': 1338 obs. of 7 variables:

Dr. Vivek Tiwari, ABV-IIITM, Gwalior

northeast northwest southeast southwest

Dr. Vivek Tiwari, ABV-IIITM, Gwalior

lr_model <- lm(expenses ~ age + children + bmi +

Following command is equivalent to the preceding command:

model <- lm(expenses ~ ., data = insurance)

Dr. Vivek Tiwari, ABV-IIITM, Gwalior

 The coefficient of age term in our model is saying

 It is saying that for every 1 children (dependent)

 It is saying that for every 1 unit increase in the

Dr. Vivek Tiwari, ABV-IIITM, Gwalior

Residual standard error: 6062 on 1329 degrees of

Dr. Vivek Tiwari, ABV-IIITM, Gwalior

insurance$age2 <- insurance$age^2

lr_model1 <- lm(expenses ~ age2 + children + bmi + sex

Dr. Vivek Tiwari, ABV-IIITM, Gwalior

Residual standard error: 6029 on 1329 degrees of freedom

lr_model2 <- lm(expenses ~ age+age2 + children + bmi +

adding interaction effects

lr_model5 <- lm(expenses ~ age+age2 + children +bmi +

Dr. Vivek Tiwari, ABV-IIITM, Gwalior

Residual standard error: 4445 on 1326 degrees of

Dr. Vivek Tiwari, ABV-IIITM, Gwalior

You might also like