Professional Documents
Culture Documents
mas202
mas202
Group 5
MAS202
Applied Statistics for Business
Confidence Interval
Bùi Chí Đạt HS176136
1
TABLE OF CONTENT
PART A
I. Introduction & Methodolog………………………………………………...…......3
a. Topic ………………………………………………………………….……….3
b. Main issue …………………………………………………………………………3
c. Identify the two continuous variables………………………..…………………..3
d. Identify the population ………………………………………………………..….3
e. Identify the sample …………………………………………………………….….3
II. Descriptive Statistics Results ………………………………………………….....4
a. Demographics Information………………………………………………………4
b. Descriptive Statistics ………………………………………………………..…...4
PART B
I. Summarize descriptive statistics result …………………………………….…….7
II. Inferential Statistics …………………………………………………….…….…..7
a. Confidence interval …………………………………………………………..…..7
b. Hypothesis testing …………………………………………………………….….15
c. ANOVA…………………………………………………………………..….…….18
d. Regression…………………………………………………………………..…….19
Reference ………………………………………………………..……………………..20
2
PART A
Part I: Introduction & Methodology
a. Topic
Insurance charge prediction based on BMI index.
b. Main issue
Our analysis will approach the issue of whether the insurance charge is
dependent on BMI index.
Our questions are how much the insurance charges of all people in US
is and whether the BMI index affects the medical cost personal in US.
c. Identify the two continuous variables
The charge and BMI are both numerical continuous variables.
The charge could depend on BMI; therefore, the charge is a dependent variable
and BMI is an independent variable.
We choose these two variables because we are curious about the relationship
between them.
d. Identify the population
3
Part II: Descriptive Statistics Results
a. Demographics Information
age: age of primary beneficiary.
bmi: Body mass index, providing an understanding of body, weights that are
relatively high or low relative to height, objective index of body weight (kg / m
^ 2) using the ratio of height to weight.
smoker: Smoking.
b. Descriptive Statistics
i. A table with the measures of Central Tendency
charges bmi
4
The mean BMI in the sample is 30.66.
The median BMI in the sample is 30.40.
The mode BMI in the sample is 32.30.
The range of BMI in the sample is 37.17.
The variance of BMI in the sample is 37.19.
The standard deviation of BMI in the sample is 6.10.
ii. Graphs
Key Findings
The first, second and third quartiles are 4733.64, 9382.033
and 16687.36 respectively.
There are many outliers.
5
The shape is right-skewed.
Key Findings
The shape is quite symmetric.
Key
Findings
The first, second and third quartiles are 26.27, 30.4 and 34.7 respectively.
There are some outliers.
The shape is quite symmetric.
6
PART B
Part I: Summarize descriptive statistics result
Our topic is to study the relationship between BMI and smoking factors with medical
costs. The purpose of the study is to find a model between the two factors BMI and
smoking and medical costs. Our data is secondary data. The data includes 1338
records. Each record includes fields for age, gender, BMI, number of children,
smoking or not, region and medical expenses.
This data sample has an average medical cost of $13270.42 with a standard deviation
of $12110.01.; an average BMI index of 30.66 with a standard deviation of 6.10;
included 274 smoking people and 1064 non-smoking people.
With the simple linear regression model, the medical cost variable is the dependent
variable while the BMI variable is the independent variable. In addition, we also built
a multiple regression model with the medical costs as the dependent variable; BMI
index and smoking factor are independent variables.
Part II: Inferential Statistics
a. Confidence interval
i. Problem 1: Construct a 95% confidence interval for the true mean
medical charges.
We use the following formula to calculate a 95% confidence interval for the true mean
medical charges.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
We collect a sample of 1338 people with the following information:
7
Sample size n=1338
Sample mean weight x=13270.42
Sample standard deviation S=12110.01
The confidence level we choose is 95% so critical value is t 0.025,1337=1.96
Hence, 95% confidence interval for the true mean medical charges is (12620.95,
13919.89)
There is a 95% chance that the confidence interval of (12620.95, 13919.89) contains
the true population mean medical charges.
ii. Problem 2: Construct a 95% confidence interval for the true mean BMI
index.
We use the following formula to calculate a 95% confidence interval for the true mean
BMI index.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
We collect a sample of 1338 people with the following information:
8
iii. Problem 3: Construct a 95% confidence interval for the true mean
medical charges of smoking people.
We use the following formula to calculate a 95% confidence interval for the true mean
medical charges of smoking people.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
In data sample, there are 274 smoking people with the following information:
iv. Problem 4: Construct a 95% confidence interval for the true mean
medical charges of non-smoking people.
9
We use the following formula to calculate a 95% confidence interval for the true mean
medical charges of non-smoking people.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
In data sample, there are 1064 non-smoking people with the following information:
10
We use the following formula to calculate a 95% confidence interval for the true mean
medical charges of obese people.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
In data sample, there are 707 obese people with the following information:
vi. Problem 6: Construct a 95% confidence interval for the true mean
medical charges of non-obese people.
11
We use the following formula to calculate a 95% confidence interval for the true mean
medical charges of non-obese people.
S S
x − t α /2 , n −1 ∗ ≤ μ ≤ x+ t α /2 , n− 1 ∗
√n √n
In data sample, there are 631 non-obese people with the following information:
12
We use the following formula to calculate a 95% confidence interval on the difference
in the true means of medical charges between smoking people and non-smoking
people.
1 2
√
(x 1 − x2 )− t α / 2, n +n −1 ∗ S2p ∗(
1 1
n1 n2 1 2
√ 1 1
+ )≤ μ ≤(x 1 − x 2)+t α / 2 ,n +n − 1 ∗ S2p ∗( + )
n1 n2
Hence, 95% confidence interval on the difference in the true means of medical
charges between smoking people and non-smoking people is (22623.17, 24608.75)
There is a 95% chance that the confidence interval of (22623.17, 24608.75)
contains the difference in the true means of medical charges between smoking
people and non-smoking people.
13
We use the following formula to calculate a 95% confidence interval on the difference
in the true means of medical charges between obese people and non-obese people.
1 2
√
(x 1 − x2 )− t α / 2, n +n −1 ∗ S2p ∗(
1 1
n1 n2 1 2
√ 1 1
+ )≤ μ ≤(x 1 − x 2)+t α / 2 ,n +n − 1 ∗ S2p ∗( + )
n1 n2
Hence, 95% confidence interval on the difference in the true means of medical
charges between obese people and non-obese people is (3563.32, 6114.02)
There is a 95% chance that the confidence interval of (3563.32, 6114.02) contains
the difference in the true means of medical charges between obese people and
non-obese people.
b. Hypothesis testing
i. Problem 1: Test the hypothesis that the true mean BMI index is greater
than 30.
14
H 0 : μ ≤30
H 1 : μ >30
There is sufficient evidence that the true mean BMI index is greater than 30.
ii. Problem 2: Test the hypothesis that the true mean medical charges of
smoking people exceeds the true mean medical charges of non-smoking
people.
H 0 :π 1 − π 2 ≤ 0
H 1 : π 1 − π 2 >0
15
Test statistic is 46.66
p − value=0.00
p − value<α Reject H 0
There is sufficient evidence that the true mean medical charges of smoking people
exceeds the true mean medical charges of non-smoking people.
iii. Problem 3: Test the hypothesis that the true mean medical charges of
obese people exceeds the true mean medical charges of non-obese people.
H 0 :π 1 − π 2 ≤ 0
H 1 : π 1 − π 2 >0
There is sufficient evidence that the true mean medical charges of obese people
exceeds the true mean medical charges of non-obese people.
iv. Problem 4: Test the hypothesis that the true proportion of smoking male
exceeds the proportion of smoking female.
16
H 0 :π 1 − π 2 ≤ 0
H 1 : π 1 − π 2 >0
There is insufficient evidence that the true proportion of smoking male exceeds the
proportion of smoking female.
v. Problem 5: Test the hypothesis that the true proportion of obese male is
less than the proportion of obese female.
17
H 0 :π 1 − π 2 ≥ 0
H 1 : π 1 − π 2 <0
There is insufficient evidence that the true proportion of obese male is less than the
proportion of obese female.
18
c. ANOVA
We perform one-way ANOVA to determine if there is a difference in mean
medical charges between the four regions.
H 0 : μ1=μ2=μ 3=μ4
There is sufficient evidence that a difference in mean medical charges between the
four regions.
19
d. Regression
i. Simple linear regression
70000
60000
50000
40000
charges
30000
20000
10000
0
15 20 25 30 35 40 45 50 55
bmi
20
Use regression equation to predict the future value
x 25
y^ 11039.76
70000
60000
50000
40000
charges
30000
20000
10000
0
15 20 25 30 35 40 45 50 55
bmi
X 1: BMI
X 2: Smoking
21
Y: Charges
Regression equation:
Smoking ^y =(−3458.1+23593.98)+388.0152 ∗ X 1
Non-smoking ^y =− 3458.1+388.0152∗ X 1
REFERENCE
https://www.kaggle.com/datasets/mirichoi0218/insurance
22