Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23

2023

Project Report

Insurance claim data


Syed Ameer .S
OBJECTIVES
AND
INTERPRETATIONS

1
1.Identify the categorical and continuous variables.
(a)Identify the categorical and continuous variables.
Categorical variables:
 Sex categorized as “male”, “female”.
 Smoker categorized as “no”, “yes”.
 Region categorized as “northeast”, “northwest”, “southeast”, “southwest”.
Continuous variables:
Age
BMI
Children
Charges

(b)Make Histograms and box plots (univariate analysis) for continuous variables
and do a correlation analysis (multivariate analysis).
Histogram for Age:

Fig.1.1.1

2
Histogram for BMI:

Fig.1.1.2

Histogram for Children:

Fig.1.1.3

3
Histogram for Charges:

Fig.1.1.4

Boxplot for Age:

Fig.1.2.1

4
Boxplot for BMI:

Fig.1.2.2

Boxplot for Children:

Fig.1.2.3

5
Boxplot for Charges:

Fig.1.2.4.

Correlation Analysis:

Correlation Output

Age bmi children charges($)


Age 1
bmi 0.11 1
children 0.04 0.01 1
charges($) 0.30 0.20 0.07 1

6
(c)Make relevant pivot table and charts for:
i. Male/Female ratio and share information on which gender has more smokers.

Count of
Sex smoker
female 49.48%
no 40.88%
yes 8.59%
male 50.52%
no 38.64%
yes 11.88%
Grand
Total 100.00%
Table.1.1

Total
45.00%
40.88%
40.00% 38.64%

35.00%
30.00%
25.00% Total
20.00%
15.00% 11.88%
10.00% 8.59%

5.00%
0.00%
no yes no yes
female male

Fig.1.3.1

7
ii. Charges vs Age:

Avg of Max of
Age charges($) charges($)
18-22 8375.01 44501.40
23-27 10244.95 42112.24
28-32 10326.03 58571.07
33-37 13081.74 55135.40
38-42 10902.09 43896.38
43-47 16357.43 62592.87
48-52 15404.69 60021.40
53-57 16510.40 63770.43
58-62 19094.35 52590.83
63-67 21542.59 49577.66
Grand
Total 13270.42 63770.43
Table.1.2

70000.00

60000.00

50000.00

40000.00
Avg of charges($)
30000.00 Max of charges($)

20000.00

10000.00

0.00
18- 23- 28- 33- 38- 43- 48- 53- 58- 63-
22 27 32 37 42 47 52 57 62 67

Fig.1.3.2

8
iii.Charges vs BMI:

Average of
BMI charges($)
15.96-19.96 8838.56
19.96-23.96 9680.81
23.96-27.96 11498.75
27.96-31.96 12677.40
31.96-35.96 14842.74
35.96-39.96 17022.31
39.96-43.96 16847.49
43.96-47.96 16959.11
47.96-51.96 7750.77
51.96-55.96 22832.43
Grand Total 13270.42
Table.1.3

Total
51.96-55.96
47.96-51.96
43.96-47.96
39.96-43.96
35.96-39.96 Total
31.96-35.96
27.96-31.96
23.96-27.96
19.96-23.96
15.96-19.96
0.00 5000.00 10000.00 15000.00 20000.00 25000.00

Fig.1.3.3

9
iv. Charges for Smokers vs Non-smoker:
Max of Average of
Smoker charges($) charges($)
no 36910.61 8434.27
yes 63770.43 32050.23
Grand
Total 63770.43 13270.42
Table.1.4

70000.00

60000.00

50000.00

40000.00
Max of charges($)
30000.00 Average of charges($)

20000.00

10000.00

0.00
no yes

Fig.1.3.4

10
(d)Region-wise smokers vs Non-smokers analysis with one or more pivot table
and charts.

Count of
Region smoker
no 1064
northeast 257
northwest 267
southeast 273
southwest 267
yes 274
northeast 67
northwest 58
southeast 91
southwest 58
Grand Total 1338
Table.1.5

Total
300 267 273 267
257
250

200

150 Total
91
100 67 58 58
50

0
southwest

southwest
northeast

northwest

southeast

northeast

northwest

southeast

no yes

Fig.1.4
Interpretation:
The high value of smoker is obtained in the region of southeast. Similarly,
the high value of non-smoker is also obtained in the region of southeast. The
minimum count for the smoker is obtained in the northwest and southwest regions.
The minimum count for the non-smoker is obtained in the northeast region.

11
(e)Region-wise charges for Smokers vs non-smokers.
Average of
Region charges($)
northeast 13406.38452
no 9165.531672
yes 29673.53647
northwest 12417.57537
no 8556.463715
yes 30192.00318
southeast 14735.41144
no 8032.216309
yes 34844.99682
southwest 12346.93738
no 8019.284513
yes 32269.06349
Grand
Total 13270.42227
Table.1.6

Total
northeast northwest southeast southwest

yes

no

yes

no Total

yes

no

yes

no
0 5000 10000 15000 20000 25000 30000 35000 40000

Fig.1.5

Interpretation:
The southeast region got the higher average charge from the smoker. The
northeast region got the higher average price from the non-smoker. Basically,
the average charges for the non-smoker got a low value comparing to the
average price of smoker for all the regions.

12
(f)Has charges got something to do with the number of dependents?
Average of charges
Children ($)
0 12365.9756
1 12731.17183
2 15073.56373
3 15355.31837
4 13850.65631
5 8786.035247
Grand
Total 13270.42227
Table.1.7

Total
18000
16000
14000
12000
10000 Total

8000
6000
4000
2000
0
0 1 2 3 4 5

Fig.1.6

Interpretation:
The line chart clearly shows the charges got something to do with number of
dependents. The average of charges is high for who have covered 3 children’s. The
average of charges is very low who have covered the 5 children’s.

13
(g)Do a similar dependants-charges analysis, Region-wise.
Average of
charges($) Column Labels
Row Labels northeast northwest southeast southwest Grand Total
11324.3709 14309.8683 11938.5049
0 11626.46266 2 8 9 12365.9756
10230.2563 13687.0419 10406.4849 12731.1718
1 16310.2064 1 7 5 3
13464.3146 15728.4706 17483.4855 15073.5637
2 13615.15272 9 2 6 3
17786.1606 18449.8460 10402.4422 15355.3183
3 14409.9133 7 2 6 7
11347.0187 14451.0239 14933.2605 13850.6563
4 14485.19312 3 7 3 1
10115.4415 8444.15862 8786.03524
5 6978.973483 8965.79575 4 5 7
12417.5753 14735.4114 12346.9373 13270.4222
Grand Total 13406.38452 7 4 8 7
Table.1.8

3 southwest
southeast
northwest
2
northeast

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Fig.1.7
Interpretation:
The average of charges is high for who have covered 3 children’s in the
southeast region. The average of charges is very low for who have covered 5
children’s in the northeast region.

14
(h)Do at least one more pivot table and chart of your own choice on the remaining
variables.
Age vs Smoker:
Count of
smoker Column Labels
23- 28- 33- 38- 43- 48- 53- 58- 63- Grand
Row Labels 18-22 27 32 37 42 47 52 57 62 67 Total
no 175 110 107 96 108 103 119 114 99 33 1064
yes 47 30 28 31 23 38 25 20 20 12 274
Grand Total 222 140 135 127 131 141 144 134 119 45 1338
Table.1.9
200

180

160
18-22
140 23-27
28-32
120
33-37
100 38-42
43-47
80 48-52
53-57
60 58-62
63-67
40

20

0
no yes

Fig.1.8

Interpretation:
The high value of smoker is obtained in the age between 18-22. Similarly,
the high value of non-smoker is also obtained in the age between 18-22. The
minimum count for the smoker is obtained in the age between 63-67. Similarly
minimum count for the non-smoker is also obtained in the age between 63-67.
Basically, after the age of 48, the count of smoker and non-smoker has been
decreased.

15
(i)Give your understanding from the patterns observed in point (b)
Histogram for Age:
The claimants are mostly having an age in between 18-22. The customers had a
smaller number of counts in the age between 62-66 which is shown in fig.1.1.1.
Histogram for BMI:
The insurance claimants are mostly having a BMI between 29 to 31. Secondly,
BMI between 27 to 29. The customers had a very small number of counts in the
BMI between 44 to 54 which is shown in the fig.1.1.2.
Histogram for Children:
The claimants are mostly covered only the 0to1 children/dependents. There are
a smaller number of counts have covered 4 to 5 children’s, which is shown in
the fig1.1.3
Histogram for Charges:
The claimants are mostly claimed the health insurance by the charge between
1100 to 4900. The claimants who have claimed the health insurance by the
charge between 50500 to 65700 are very a smaller number of counts.
Boxplot for Age:
The boxplot represents 50% of claimants having an age between 26 – 51 and
then in addition, 75% of claimants having an age less than 51. The 50% of the
claimants having an age above 39.
Boxplot for BMI:
The boxplot represents 50% of claimants having an BMI between 26.2 – 34.7
and then in addition, 75% of claimants having an BMI less than 34. The 50% of
the claimants having an BMI above 30.4.
Boxplot for Children:
The boxplot represents 50% of claimants having an BMI between 26.2 – 34.7
and then in addition, 75% of claimants having an BMI less than 34. The 50% of
the claimants having an BMI above 30.4.
Boxplot for Charges:
The boxplot represents 50% of claimants having an BMI between 26.2 – 34.7
and then in addition, 75% of claimants having an BMI less than 34. The 50% of
the claimants having an BMI above 30.4.

16
Interpretation for correlation:
1. Age vs Age: 1 - perfect positive correlation.
2. Age vs BMI: 0.11 – no correlation.
3. Age vs Children: 0.04 - no correlation.
4. Age vs Charges: 0.30 - weak positive correlation.
5. BMI vs BMI: 1 - perfect positive correlation.
6. BMI vs Children: 0.01 - no correlation.
7. BMI vs Charges: 0.20 - very weak correlation.
8. Children vs Children: 1 - perfect positive correlation.
9. Children vs Charge: 0.07 - no correlation.
10. Charge vs Charge: 1 - perfect positive correlation.

17
(j)Give your interpretation for observations made in point (c).
i. Male/Female ratio and share information on which gender has more smokers.
The female got the high value for the non-smoker with 40.88% and the
male the got value for non-smoker with 38.64%. There is not much difference
between female and male for the non-smoker. The male got the high value for the
smoker with 11.88% and the female got the value for smoker with 8.59%. Also,
there is not much difference between the female and male for the smoker.
ii. Charges vs Age:
The average of charges got the high value for the age between the 63-67
and got the low value for the age between 18-22. The maximum charges got the
high value for the age between the age 53-57. Basically, the average of prices
increases when the age was increased.
iii.Charges vs BMI:
The average of charges got the high value for the BMI between the
51.96-55.96 and got the low value for the BMI between 15.96-19.96. Basically,
the average of prices increases when the BMI was increased.
iv. Charges for Smokers vs Non-smoker:
The average of charges got the high value for smokers and also the
maximum charge would be high for the smoker. For the non-smoker the values are
low for both average and maximum charges.
3.Do a descriptive summary analysis for the edited data. Perform a Multiple Linear
Regression analysis to identify which variables decide the insurance charges/billed
insurance claim. Give your interpretation for the above analysis, do another set of
regression analysis by dropping insignificant variables, if needed.

18
19
Interpretation for summary analysis by skewness to know in which direction
actually the data distributed.
1.For Age (0.056), the skewness is distributed positively. The distributed value is
nearly symmetrical.
2.For Sex (-0.021), the skewness is distributed negatively. The distributed value is
nearly symmetrical.
3.For BMI (0.284), the skewness is distributed positively. The distributed value is
nearly symmetrical.
4.For Children (0.938), the skewness is distributed positively. The distributed
value is moderately skewed.
5.For Smoker (1.465), the skewness is distributed positively. The distributed value
is extremely skewed.
6.For northwest (1.200), southwest (1.200), the skewness is distributed positively
for both. The distributed values is skewed extremely for the both case. In this both
case skewness value is not exactly correct because both are dummy variable.
7.For southeast (1.026), the skewness is distributed positively. The distributed
value is perfectly skewed. In this case skewness value is not exactly correct
because this is a dummy variable.
8.For charges (1.516), the skewness is distributed positively. The distributed value
is extremely skewed.
Interpretation for multiple regression model by significant values.
 A p-value for the Age is 7.78E-89 which is equal to 0.00, means, the p -
value is less than the alpha level 0.05, so we reject the null hypothesis
and accept the alternative hypothesis. It means that there is some

20
relationship between the Age and Charges. It is statistically significant for
the model.

 A p-value for the Sex is 0.534 means, the p - value is greater than the
alpha level 0.05, so we accept the null hypothesis. It means that there is
no relationship between the Sex and Charges. It is statistically not
significant for the model.

 A p-value for the BMI is 6.50E-31 which is equal to 0.00, means, the p -
value is less than the alpha level 0.05, so we reject the null hypothesis
and accept the alternative hypothesis. It means that there is some
relationship between the BMI Age and Charges. It is statistically
significant for the model.
 A p-value for the Children 5.77E-04 is which is equal to 0.00, means, the
p - value is less than the alpha level 0.05, so we reject the null hypothesis
and accept the alternative hypothesis. It means that there is some
relationship between the Children and Charges. It is statistically
significant for the model.
 A p-value for the Smoker 0.00E+00 is which is equal to 0.00, means, the
p - value is less than the alpha level 0.05, so we reject the null hypothesis
and accept the alternative hypothesis. It means that there is some
relationship between the Smoker and Charges. It is statistically significant
for the model.
 A p-value for the northwest 4.59E-01 is which is equal to 0.00, means,
the p - value is less than the alpha level 0.05, so we reject the null
hypothesis and accept the alternative hypothesis. It means that there is
some relationship between the southeast and Charges. It is statistically
significant for the model.
 A p-value for the southeast 3.08E-02 is which is equal to 0.00, means, the
p - value is less than the alpha level 0.05, so we reject the null hypothesis
and accept the alternative hypothesis. It means that there is some
relationship between the southeast and Charges. It is statistically
significant for the model.
 A p-value for the southwest 4.48E-02 is which is equal to 0.00, means,
the p - value is less than the alpha level 0.05, so we reject the null
hypothesis and accept the alternative hypothesis. It means that there is
some relationship between the southwest and Charges. It is statistically
significant for the model.

21
22

You might also like