Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 22

INDIVIDUAL ASSIGNMENT

Group 5
MAS202
Applied Statistics for Business

Lecturer: Khong Van Hai


Class: IB1802
Group members:

Name Student ID Assigned Work

Hà Ngô Khánh Linh HS170110 Source - Descriptive Statistics - ANOVA

Confidence Interval
Bùi Chí Đạt HS176136

Nguyễn Trung Thành HS170167 Confidence Interval

Nguyễn Thị Thu Phương HS170700 Hypothesis Testing

Simple Linear Regression - Multiple


Đặng Tuấn Nam HS171236 Regression

1
TABLE OF CONTENT

PART A
I. Introduction & Methodolog………………………………………………...…......3
a. Topic ………………………………………………………………….……….3
b. Main issue …………………………………………………………………………3
c. Identify the two continuous variables………………………..…………………..3
d. Identify the population ………………………………………………………..….3
e. Identify the sample …………………………………………………………….….3
II. Descriptive Statistics Results ………………………………………………….....4
a. Demographics Information………………………………………………………4
b. Descriptive Statistics ………………………………………………………..…...4

PART B
I. Summarize descriptive statistics result …………………………………….…….7
II. Inferential Statistics …………………………………………………….…….…..7
a. Confidence interval …………………………………………………………..…..7
b. Hypothesis testing …………………………………………………………….….15
c. ANOVA…………………………………………………………………..….…….18
d. Regression…………………………………………………………………..…….19

Reference ………………………………………………………..……………………..20

2
PART A
Part I: Introduction & Methodology
a. Topic
Insurance charge prediction based on BMI index.
b. Main issue
Our analysis will approach the issue of whether the insurance charge is
dependent on BMI index.
Our questions are how much the insurance charges of all people in US
is and whether the BMI index affects the medical cost personal in US.
c. Identify the two continuous variables
The charge and BMI are both numerical continuous variables.
The charge could depend on BMI; therefore, the charge is a dependent variable
and BMI is an independent variable.
We choose these two variables because we are curious about the relationship
between them.
d. Identify the population

The population is all people in US.

e. Identify the sample


The sample is 1338 peple randomly selected in US.

Our data is secondary data.

3
Part II: Descriptive Statistics Results
a. Demographics Information
age: age of primary beneficiary.

sex: insurance contractor gender, female, male.

bmi: Body mass index, providing an understanding of body, weights that are
relatively high or low relative to height, objective index of body weight (kg / m
^ 2) using the ratio of height to weight.

children: Number of children covered by health insurance.

smoker: Smoking.

region: the beneficiary's residential area in the US, northeast, southeast,


southwest, northwest.

charges: Individual medical costs billed by health insurance.

b. Descriptive Statistics
i. A table with the measures of Central Tendency
charges bmi

Mean 13270.42 30.66


Standard Error 331.0674543 0.166714
Median 9382.03 30.40
Mode 1639.56 32.30
Standard Deviation 12110.01 6.10
Sample Variance 146652372.15 37.19
Kurtosis 1.606298653 -0.05073
Skewness 1.515879658 0.284047
Range 62648.55 37.17
Minimum 1121.87 15.96
Maximum 63770.43 53.13
Sum 17755824.99 41027.63
Count 1338 1338
Key Findings
The mean charge in the sample is $13270.42.
The median charge in the sample is $9382.03.
The mode charge in the sample is $1639.56.
The range of charge in the sample is $62648.55.
The variance of charge in the sample is 146652372.15.
The standard deviation of charge in the sample is $12110.01.

4
The mean BMI in the sample is 30.66.
The median BMI in the sample is 30.40.
The mode BMI in the sample is 32.30.
The range of BMI in the sample is 37.17.
The variance of BMI in the sample is 37.19.
The standard deviation of BMI in the sample is 6.10.
ii. Graphs

Key Findings The shape is right-skewed.

Key Findings
The first, second and third quartiles are 4733.64, 9382.033
and 16687.36 respectively.
There are many outliers.

5
The shape is right-skewed.

Key Findings
The shape is quite symmetric.

Key

Findings
The first, second and third quartiles are 26.27, 30.4 and 34.7 respectively.
There are some outliers.
The shape is quite symmetric.

6
PART B
Part I: Summarize descriptive statistics result
Our topic is to study the relationship between BMI and smoking factors with medical
costs. The purpose of the study is to find a model between the two factors BMI and
smoking and medical costs. Our data is secondary data. The data includes 1338
records. Each record includes fields for age, gender, BMI, number of children,
smoking or not, region and medical expenses.
This data sample has an average medical cost of $13270.42 with a standard deviation
of $12110.01.; an average BMI index of 30.66 with a standard deviation of 6.10;
included 274 smoking people and 1064 non-smoking people.
With the simple linear regression model, the medical cost variable is the dependent
variable while the BMI variable is the independent variable. In addition, we also built
a multiple regression model with the medical costs as the dependent variable; BMI
index and smoking factor are independent variables.
Part II: Inferential Statistics
a. Confidence interval
i. Problem 1: Construct a 95% confidence interval for the true mean
medical charges.

We use the following formula to calculate a 95% confidence interval for the true mean
medical charges.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
We collect a sample of 1338 people with the following information:

7
 Sample size n=1338
 Sample mean weight x=13270.42
 Sample standard deviation S=12110.01
The confidence level we choose is 95% so critical value is t 0.025,1337=1.96
Hence, 95% confidence interval for the true mean medical charges is (12620.95,
13919.89)
There is a 95% chance that the confidence interval of (12620.95, 13919.89) contains
the true population mean medical charges.
ii. Problem 2: Construct a 95% confidence interval for the true mean BMI
index.

We use the following formula to calculate a 95% confidence interval for the true mean
BMI index.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
We collect a sample of 1338 people with the following information:

 Sample size n=1338


 Sample mean weight x=30.66
 Sample standard deviation S=6.10
The confidence level we choose is 95% so critical value is t 0.025,1337=1.96
Hence, 95% confidence interval for the true mean BMI index is (30.34, 30.99)
There is a 95% chance that the confidence interval of (30.34, 30.99) contains the
true population mean BMI index.

8
iii. Problem 3: Construct a 95% confidence interval for the true mean
medical charges of smoking people.

We use the following formula to calculate a 95% confidence interval for the true mean
medical charges of smoking people.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
In data sample, there are 274 smoking people with the following information:

 Sample size n=274


 Sample mean weight x=32050.23
 Sample standard deviation S=11541.55
The confidence level we choose is 95% so critical value is t 0.025,273=1.97
Hence, 95% confidence interval for the true mean medical charges of smoking people
is (30677.56, 33422.90)
There is a 95% chance that the confidence interval of (30677.56, 33422.90)
contains the true population mean medical charges of smoking people.

iv. Problem 4: Construct a 95% confidence interval for the true mean
medical charges of non-smoking people.

9
We use the following formula to calculate a 95% confidence interval for the true mean
medical charges of non-smoking people.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
In data sample, there are 1064 non-smoking people with the following information:

 Sample size n=1064


 Sample mean weight x=8434.27
 Sample standard deviation S=5993.78
The confidence level we choose is 95% so critical value is t 0.025,1063=1.96
Hence, 95% confidence interval for the true mean medical charges of non-smoking
people is (8073.71, 8794.82)
There is a 95% chance that the confidence interval of (8073.71, 8794.82) contains
the true population mean medical charges of non-smoking people.

v. Problem 5: Construct a 95% confidence interval for the true mean


medical charges of obese people.

10
We use the following formula to calculate a 95% confidence interval for the true mean
medical charges of obese people.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
In data sample, there are 707 obese people with the following information:

 Sample size n=707


 Sample mean weight x=15552.34
 Sample standard deviation S=14552.32
The confidence level we choose is 95% so critical value is t 0.025,706 =1.96
Hence, 95% confidence interval for the true mean medical charges of obese people is
(14477.81, 16626,86)
There is a 95% chance that the confidence interval of (14477.81, 16626,86) contains
the true population mean medical charges of obese people.

vi. Problem 6: Construct a 95% confidence interval for the true mean
medical charges of non-obese people.

11
We use the following formula to calculate a 95% confidence interval for the true mean
medical charges of non-obese people.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
In data sample, there are 631 non-obese people with the following information:

 Sample size n=631


 Sample mean weight x=10713.67
 Sample standard deviation S=7843.54
The confidence level we choose is 95% so critical value is t 0.025,630 =1.96
Hence, 95% confidence interval for the true mean medical charges of non-obese
people is (10100.50, 11326.84)
There is a 95% chance that the confidence interval of (10100.50, 11326.84)
contains the true population mean medical charges of non-obese people.

vii. Problem 7: Construct a 95% confidence interval on the difference in the


true means of medical charges between smoking people and non-smoking
people.

12
We use the following formula to calculate a 95% confidence interval on the difference
in the true means of medical charges between smoking people and non-smoking
people.

1 2

(x 1 − x2 )− t α / 2, n +n −1 ∗ S2p ∗(
1 1
n1 n2 1 2
√ 1 1
+ )≤ μ ≤(x 1 − x 2)+t α / 2 ,n +n − 1 ∗ S2p ∗( + )
n1 n2

Here is the summary data for each sample:


n1=274 , x1 =32050.33 , S1 =11541.55

n2 =1064 , x2 =8434.27 , S 1=5993.78

Hence, 95% confidence interval on the difference in the true means of medical
charges between smoking people and non-smoking people is (22623.17, 24608.75)
There is a 95% chance that the confidence interval of (22623.17, 24608.75)
contains the difference in the true means of medical charges between smoking
people and non-smoking people.

viii. Problem 8: Construct a 95% confidence interval on the difference in the


true means of medical charges between obese people and non-obese
people.

13
We use the following formula to calculate a 95% confidence interval on the difference
in the true means of medical charges between obese people and non-obese people.

1 2

(x 1 − x2 )− t α / 2, n +n −1 ∗ S2p ∗(
1 1
n1 n2 1 2
√ 1 1
+ )≤ μ ≤(x 1 − x 2)+t α / 2 ,n +n − 1 ∗ S2p ∗( + )
n1 n2

Here is the summary data for each sample:


n1=707 , x 1=15552.34 , S 1=14552.32

n2 =631 , x 2=10713.67 , S1=7843.54

Hence, 95% confidence interval on the difference in the true means of medical
charges between obese people and non-obese people is (3563.32, 6114.02)
There is a 95% chance that the confidence interval of (3563.32, 6114.02) contains
the difference in the true means of medical charges between obese people and
non-obese people.

b. Hypothesis testing
i. Problem 1: Test the hypothesis that the true mean BMI index is greater
than 30.

14
H 0 : μ ≤30

H 1 : μ >30

x−μ 30.66 −30


Test statistic is = =3.98
S / √ n 6.10 / √1338
p − value=P ( t>3.98 )=0.00
p − value<α  Reject H 0

There is sufficient evidence that the true mean BMI index is greater than 30.
ii. Problem 2: Test the hypothesis that the true mean medical charges of
smoking people exceeds the true mean medical charges of non-smoking
people.

H 0 :π 1 − π 2 ≤ 0

H 1 : π 1 − π 2 >0

Use Excel output:

15
Test statistic is 46.66
p − value=0.00
p − value<α  Reject H 0

There is sufficient evidence that the true mean medical charges of smoking people
exceeds the true mean medical charges of non-smoking people.
iii. Problem 3: Test the hypothesis that the true mean medical charges of
obese people exceeds the true mean medical charges of non-obese people.

H 0 :π 1 − π 2 ≤ 0

H 1 : π 1 − π 2 >0

Use Excel output:


Test statistic is 7.44
p − value=0.00
p − value<α  Reject H 0

There is sufficient evidence that the true mean medical charges of obese people
exceeds the true mean medical charges of non-obese people.

iv. Problem 4: Test the hypothesis that the true proportion of smoking male
exceeds the proportion of smoking female.

16
H 0 :π 1 − π 2 ≤ 0

H 1 : π 1 − π 2 >0

( p1 − p2 )−(π 1 − π 2) ( 0.52− 0.48)− 0


= =0.85
Test statistic is
√ 1 1
p ∗(1 − p)∗( + )
n1 n 2 √ 0.5 ∗(1− 0.5)∗(
1
+
1
274 274
)

p − value=P ( Z >0.85 )=0.20


p − value>α  Fail to reject H 0

There is insufficient evidence that the true proportion of smoking male exceeds the
proportion of smoking female.

v. Problem 5: Test the hypothesis that the true proportion of obese male is
less than the proportion of obese female.

17
H 0 :π 1 − π 2 ≥ 0

H 1 : π 1 − π 2 <0

( p1 − p2 )−(π 1 − π 2) (0.51 −0.49)−0


= =0.69
Test statistic is
√ 1 1
p ∗(1 − p)∗( + )
n1 n 2 √ 0.5 ∗(1− 0.5)∗(
1
+
1
707 707
)

p − value=P ( Z >0.69 )=0.24


p − value>α  Fail to reject H 0

There is insufficient evidence that the true proportion of obese male is less than the
proportion of obese female.

18
c. ANOVA
We perform one-way ANOVA to determine if there is a difference in mean
medical charges between the four regions.

H 0 : μ1=μ2=μ 3=μ4

H 1 : Not all population means are equal

Use Excel output:


Test statistic is 2.97
p − value=0.03
p − value<α  Reject H 0

There is sufficient evidence that a difference in mean medical charges between the
four regions.

19
d. Regression
i. Simple linear regression

70000

60000

50000

40000
charges

30000

20000

10000

0
15 20 25 30 35 40 45 50 55
bmi

Identify two random variables X and Y


X BMI
Y Charges

Find the equation of the estimated regression line


b0 1192.94
--> Meaning No practical
b1 393.87
--> Meaning If BMI increase 1 unit, Charge increases $393.87
Regression equation ^y =1192.94+393.87 ∗ x

20
Use regression equation to predict the future value
x 25
y^ 11039.76

Compute the sample coefficient of determination


R Square 0.0393 --> weak positive relationship
--> Meaning 3.93% of the total variation in the charges that is explained by variation in the BMI.
ii. Multiple regression

70000

60000

50000

40000
charges

30000

20000

10000

0
15 20 25 30 35 40 45 50 55
bmi

X 1: BMI
X 2: Smoking

21
Y: Charges

Regression equation:
Smoking ^y =(−3458.1+23593.98)+388.0152 ∗ X 1
Non-smoking ^y =− 3458.1+388.0152∗ X 1

Sample coefficient of determination is 0.6579  moderate positive relationship.


65.79% of the total variation in the charges that is explained by variation in the BMI and
smoking factor together.

REFERENCE
https://www.kaggle.com/datasets/mirichoi0218/insurance

22

You might also like