mas202

INDIVIDUAL ASSIGNMENT
Group 5
MAS202
Applied Statistics for Business
Lecturer: Khong Van Hai

Class: IB1802
Group members:
Name Student ID Assigned Work
Hà Ngô Khánh Linh HS170110 Source - Descriptive Statistics - ANOVA
Confidence Interval
Bùi Chí Đạt HS176136
Nguyễn Trung Thành HS170167 Confidence Interval
Nguyễn Thị Thu Phương HS170700 Hypothesis Testing
Simple Linear Regression - Multiple

Đặng Tuấn Nam HS171236 Regression
1
TABLE OF CONTENT
PART A
I. Introduction & Methodolog………………………………………………...…......3
a. Topic ………………………………………………………………….……….3
b. Main issue …………………………………………………………………………3
c. Identify the two continuous variables………………………..…………………..3
d. Identify the population ………………………………………………………..….3
e. Identify the sample …………………………………………………………….….3
II. Descriptive Statistics Results ………………………………………………….....4
a. Demographics Information………………………………………………………4
b. Descriptive Statistics ………………………………………………………..…...4
PART B
I. Summarize descriptive statistics result …………………………………….…….7
II. Inferential Statistics …………………………………………………….…….…..7
a. Confidence interval …………………………………………………………..…..7
b. Hypothesis testing …………………………………………………………….….15
c. ANOVA…………………………………………………………………..….…….18
d. Regression…………………………………………………………………..…….19
Reference ………………………………………………………..……………………..20
2
PART A
Part I: Introduction & Methodology
a. Topic
Insurance charge prediction based on BMI index.
b. Main issue
Our analysis will approach the issue of whether the insurance charge is
dependent on BMI index.
Our questions are how much the insurance charges of all people in US
is and whether the BMI index affects the medical cost personal in US.
c. Identify the two continuous variables
The charge and BMI are both numerical continuous variables.
The charge could depend on BMI; therefore, the charge is a dependent variable
and BMI is an independent variable.
We choose these two variables because we are curious about the relationship
between them.
d. Identify the population
The population is all people in US.
e. Identify the sample

The sample is 1338 peple randomly selected in US.
Our data is secondary data.
3
Part II: Descriptive Statistics Results
a. Demographics Information
age: age of primary beneficiary.
sex: insurance contractor gender, female, male.
bmi: Body mass index, providing an understanding of body, weights that are
relatively high or low relative to height, objective index of body weight (kg / m
^ 2) using the ratio of height to weight.
children: Number of children covered by health insurance.
smoker: Smoking.
region: the beneficiary's residential area in the US, northeast, southeast,

southwest, northwest.
charges: Individual medical costs billed by health insurance.
b. Descriptive Statistics
i. A table with the measures of Central Tendency
charges bmi
Mean 13270.42 30.66

Standard Error 331.0674543 0.166714
Median 9382.03 30.40
Mode 1639.56 32.30
Standard Deviation 12110.01 6.10
Sample Variance 146652372.15 37.19
Kurtosis 1.606298653 -0.05073
Skewness 1.515879658 0.284047
Range 62648.55 37.17
Minimum 1121.87 15.96
Maximum 63770.43 53.13
Sum 17755824.99 41027.63
Count 1338 1338
Key Findings
The mean charge in the sample is $13270.42.
The median charge in the sample is $9382.03.
The mode charge in the sample is $1639.56.
The range of charge in the sample is $62648.55.
The variance of charge in the sample is 146652372.15.
The standard deviation of charge in the sample is $12110.01.
4
The mean BMI in the sample is 30.66.
The median BMI in the sample is 30.40.
The mode BMI in the sample is 32.30.
The range of BMI in the sample is 37.17.
The variance of BMI in the sample is 37.19.
The standard deviation of BMI in the sample is 6.10.
ii. Graphs
Key Findings The shape is right-skewed.
Key Findings
The first, second and third quartiles are 4733.64, 9382.033
and 16687.36 respectively.
There are many outliers.
5
The shape is right-skewed.
Key Findings
The shape is quite symmetric.
Key
Findings
The first, second and third quartiles are 26.27, 30.4 and 34.7 respectively.
There are some outliers.
The shape is quite symmetric.
6
PART B
Part I: Summarize descriptive statistics result
Our topic is to study the relationship between BMI and smoking factors with medical
costs. The purpose of the study is to find a model between the two factors BMI and
smoking and medical costs. Our data is secondary data. The data includes 1338
records. Each record includes fields for age, gender, BMI, number of children,
smoking or not, region and medical expenses.
This data sample has an average medical cost of $13270.42 with a standard deviation
of $12110.01.; an average BMI index of 30.66 with a standard deviation of 6.10;
included 274 smoking people and 1064 non-smoking people.
With the simple linear regression model, the medical cost variable is the dependent
variable while the BMI variable is the independent variable. In addition, we also built
a multiple regression model with the medical costs as the dependent variable; BMI
index and smoking factor are independent variables.
Part II: Inferential Statistics
a. Confidence interval
i. Problem 1: Construct a 95% confidence interval for the true mean
medical charges.
We use the following formula to calculate a 95% confidence interval for the true mean
medical charges.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
We collect a sample of 1338 people with the following information:
7
 Sample size n=1338
 Sample mean weight x=13270.42
 Sample standard deviation S=12110.01
The confidence level we choose is 95% so critical value is t 0.025,1337=1.96
Hence, 95% confidence interval for the true mean medical charges is (12620.95,
13919.89)
There is a 95% chance that the confidence interval of (12620.95, 13919.89) contains
the true population mean medical charges.
ii. Problem 2: Construct a 95% confidence interval for the true mean BMI
index.
BMI index.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
We collect a sample of 1338 people with the following information:

Hence, 95% confidence interval for the true mean BMI index is (30.34, 30.99)
There is a 95% chance that the confidence interval of (30.34, 30.99) contains the
true population mean BMI index.
8
iii. Problem 3: Construct a 95% confidence interval for the true mean
medical charges of smoking people.
medical charges of smoking people.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
In data sample, there are 274 smoking people with the following information:

Hence, 95% confidence interval for the true mean medical charges of smoking people
is (30677.56, 33422.90)
There is a 95% chance that the confidence interval of (30677.56, 33422.90)
contains the true population mean medical charges of smoking people.
iv. Problem 4: Construct a 95% confidence interval for the true mean
medical charges of non-smoking people.
9
medical charges of non-smoking people.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
In data sample, there are 1064 non-smoking people with the following information:

Hence, 95% confidence interval for the true mean medical charges of non-smoking
people is (8073.71, 8794.82)
the true population mean medical charges of non-smoking people.
v. Problem 5: Construct a 95% confidence interval for the true mean

medical charges of obese people.
10
medical charges of obese people.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
In data sample, there are 707 obese people with the following information:

The confidence level we choose is 95% so critical value is t 0.025,706 =1.96
Hence, 95% confidence interval for the true mean medical charges of obese people is
(14477.81, 16626,86)
There is a 95% chance that the confidence interval of (14477.81, 16626,86) contains
the true population mean medical charges of obese people.
vi. Problem 6: Construct a 95% confidence interval for the true mean
medical charges of non-obese people.
11
medical charges of non-obese people.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
In data sample, there are 631 non-obese people with the following information:

The confidence level we choose is 95% so critical value is t 0.025,630 =1.96
Hence, 95% confidence interval for the true mean medical charges of non-obese
people is (10100.50, 11326.84)
contains the true population mean medical charges of non-obese people.
vii. Problem 7: Construct a 95% confidence interval on the difference in the

true means of medical charges between smoking people and non-smoking
people.
12
We use the following formula to calculate a 95% confidence interval on the difference
in the true means of medical charges between smoking people and non-smoking
people.
1 2
√
(x 1 − x2 )− t α / 2, n +n −1 ∗ S2p ∗(
1 1
n1 n2 1 2
√ 1 1
+ )≤ μ ≤(x 1 − x 2)+t α / 2 ,n +n − 1 ∗ S2p ∗( + )
n1 n2
Here is the summary data for each sample:

n1=274 , x1 =32050.33 , S1 =11541.55
n2 =1064 , x2 =8434.27 , S 1=5993.78
Hence, 95% confidence interval on the difference in the true means of medical
charges between smoking people and non-smoking people is (22623.17, 24608.75)
contains the difference in the true means of medical charges between smoking
people and non-smoking people.
viii. Problem 8: Construct a 95% confidence interval on the difference in the

true means of medical charges between obese people and non-obese
people.
13
We use the following formula to calculate a 95% confidence interval on the difference
in the true means of medical charges between obese people and non-obese people.
1 2
√
(x 1 − x2 )− t α / 2, n +n −1 ∗ S2p ∗(
1 1
n1 n2 1 2
√ 1 1
+ )≤ μ ≤(x 1 − x 2)+t α / 2 ,n +n − 1 ∗ S2p ∗( + )
n1 n2
Here is the summary data for each sample:

n1=707 , x 1=15552.34 , S 1=14552.32
n2 =631 , x 2=10713.67 , S1=7843.54
Hence, 95% confidence interval on the difference in the true means of medical
charges between obese people and non-obese people is (3563.32, 6114.02)
the difference in the true means of medical charges between obese people and
non-obese people.
b. Hypothesis testing
i. Problem 1: Test the hypothesis that the true mean BMI index is greater
than 30.
14
H 0 : μ ≤30
H 1 : μ >30
x−μ 30.66 −30

Test statistic is = =3.98
S / √ n 6.10 / √1338
p − value=P ( t>3.98 )=0.00
p − value<α  Reject H 0
There is sufficient evidence that the true mean BMI index is greater than 30.
ii. Problem 2: Test the hypothesis that the true mean medical charges of
smoking people exceeds the true mean medical charges of non-smoking
people.
H 0 :π 1 − π 2 ≤ 0
H 1 : π 1 − π 2 >0
Use Excel output:
15
Test statistic is 46.66
p − value=0.00
There is sufficient evidence that the true mean medical charges of smoking people
exceeds the true mean medical charges of non-smoking people.
iii. Problem 3: Test the hypothesis that the true mean medical charges of
obese people exceeds the true mean medical charges of non-obese people.
H 0 :π 1 − π 2 ≤ 0
H 1 : π 1 − π 2 >0
Use Excel output:

p − value=0.00
There is sufficient evidence that the true mean medical charges of obese people
exceeds the true mean medical charges of non-obese people.
iv. Problem 4: Test the hypothesis that the true proportion of smoking male
exceeds the proportion of smoking female.
16
H 0 :π 1 − π 2 ≤ 0
H 1 : π 1 − π 2 >0
( p1 − p2 )−(π 1 − π 2) ( 0.52− 0.48)− 0

= =0.85
Test statistic is
√ 1 1
p ∗(1 − p)∗( + )
n1 n 2 √ 0.5 ∗(1− 0.5)∗(
1
+
1
274 274
)
p − value=P ( Z >0.85 )=0.20

p − value>α  Fail to reject H 0
There is insufficient evidence that the true proportion of smoking male exceeds the
proportion of smoking female.
v. Problem 5: Test the hypothesis that the true proportion of obese male is
less than the proportion of obese female.
17
H 0 :π 1 − π 2 ≥ 0
H 1 : π 1 − π 2 <0
( p1 − p2 )−(π 1 − π 2) (0.51 −0.49)−0

= =0.69
Test statistic is
√ 1 1
p ∗(1 − p)∗( + )
n1 n 2 √ 0.5 ∗(1− 0.5)∗(
1
+
1
707 707
)
p − value=P ( Z >0.69 )=0.24

p − value>α  Fail to reject H 0
There is insufficient evidence that the true proportion of obese male is less than the
proportion of obese female.
18
c. ANOVA
We perform one-way ANOVA to determine if there is a difference in mean
medical charges between the four regions.
H 0 : μ1=μ2=μ 3=μ4
H 1 : Not all population means are equal
Use Excel output:

p − value=0.03
There is sufficient evidence that a difference in mean medical charges between the
four regions.
19
d. Regression
i. Simple linear regression
70000
60000
50000
40000
charges
30000
20000
10000
0
15 20 25 30 35 40 45 50 55
bmi
Identify two random variables X and Y

X BMI
Y Charges
Find the equation of the estimated regression line

b0 1192.94
--> Meaning No practical
b1 393.87
--> Meaning If BMI increase 1 unit, Charge increases $393.87
Regression equation ^y =1192.94+393.87 ∗ x
20
Use regression equation to predict the future value
x 25
y^ 11039.76
Compute the sample coefficient of determination

R Square 0.0393 --> weak positive relationship
--> Meaning 3.93% of the total variation in the charges that is explained by variation in the BMI.
ii. Multiple regression
70000
60000
50000
40000
charges
30000
20000
10000
0
15 20 25 30 35 40 45 50 55
bmi
X 1: BMI
X 2: Smoking
21
Y: Charges
Regression equation:
Smoking ^y =(−3458.1+23593.98)+388.0152 ∗ X 1
Non-smoking ^y =− 3458.1+388.0152∗ X 1
Sample coefficient of determination is 0.6579  moderate positive relationship.

65.79% of the total variation in the charges that is explained by variation in the BMI and
smoking factor together.
REFERENCE
https://www.kaggle.com/datasets/mirichoi0218/insurance
22

mas202

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

mas202

Uploaded by

Copyright:

Available Formats

INDIVIDUAL ASSIGNMENT

Lecturer: Khong Van Hai

Name Student ID Assigned Work

Hà Ngô Khánh Linh HS170110 Source - Descriptive Statistics - ANOVA

Nguyễn Trung Thành HS170167 Confidence Interval

Nguyễn Thị Thu Phương HS170700 Hypothesis Testing

Simple Linear Regression - Multiple

The population is all people in US.

e. Identify the sample

Our data is secondary data.

sex: insurance contractor gender, female, male.

children: Number of children covered by health insurance.

region: the beneficiary's residential area in the US, northeast, southeast,

charges: Individual medical costs billed by health insurance.

Mean 13270.42 30.66

Key Findings The shape is right-skewed.

 Sample size n=1338

 Sample size n=274

 Sample size n=1064

v. Problem 5: Construct a 95% confidence interval for the true mean

 Sample size n=707

 Sample size n=631

vii. Problem 7: Construct a 95% confidence interval on the difference in the

Here is the summary data for each sample:

n2 =1064 , x2 =8434.27 , S 1=5993.78

viii. Problem 8: Construct a 95% confidence interval on the difference in the

Here is the summary data for each sample:

n2 =631 , x 2=10713.67 , S1=7843.54

x−μ 30.66 −30

Use Excel output:

Use Excel output:

( p1 − p2 )−(π 1 − π 2) ( 0.52− 0.48)− 0

p − value=P ( Z >0.85 )=0.20

( p1 − p2 )−(π 1 − π 2) (0.51 −0.49)−0

p − value=P ( Z >0.69 )=0.24

H 1 : Not all population means are equal

Use Excel output:

Identify two random variables X and Y

Find the equation of the estimated regression line

Compute the sample coefficient of determination

Sample coefficient of determination is 0.6579  moderate positive relationship.

You might also like