BBADM 221 Unit 10 - With Notes

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 51

1

BADM 221 Statistics for Business

Week 10
ANOVA (Analysis of Variance)
2
ANOVA

Test of several means – n-sample hypothesis test

Many statistical applications in psychology, social science,


business administration, and the natural sciences involve several
groups.

Example:
• An experiment to study the effects of five different brands of
gasoline on car engine efficiency.
• A consumer looking for a new car might compare the average
gas mileage of seven car models.
• A professor wishes to study the effect of four different teaching
techniques on mathematics proficiency.
3
ANOVA
The characteristic that differentiates the treatments from one
another is called the factor of the study. The different
treatments is called the levels of the factor. Here, we only
consider one factor.
Example:
• An experiment to study the effects of five different brands of gasoline
on car engine efficiency.
Factor: Gasoline brand Treatments: The 5 different
brands
• A consumer looking for a new car might compare the average gas
mileage of seven car models.
Factor: Car model Treatments: The 7
car models
• A professor wishes to study the effect of four different teaching
techniques on mathematics proficiency.
Factor: Teaching technique Treatments: The 4 different
techniques
4
ANOVA

For hypothesis tests comparing averages among more


than two groups, statisticians have developed a method
called “Analysis of Variance” (abbreviated ANOVA)

One-way ANOVA
(Single-factor ANOVA)

The purpose of an ANOVA test is to determine whether


there is any significant difference among several group
means. The test uses variances to help determine if the
means are equal or not.
5
ANOVA

Two kinds of variances (source of variations)

• Variance between treatments:


Variation due to the different levels of the factor
(Termed as Sum of Squares of treatment/factor)
SS(Treatment) or SS(Factor)
• Variance within treatments:
Variation due to error
(Termed as Sum of Squares of Error)
SS(Error)
6
ANOVA

Null and Alternative Hypothesis

H0: All the population means are the same.


Ha: At least one of the means is different.

Suppose we want to compare k groups.


H0: The population means of all k groups are the same.
Ha: At least one group has a different mean.

H0: 1  2    k
Ha: At least one i is different from others
7
ANOVA
Data are typically put into a table for easy referencing by
computer software. The table is called ANOVA table.

Number of treatments: k Total number of data: n


Source of Sum of Squares Degrees of
Mean Square (MS) F
Variation (SS) Freedom (df)
MS(Factor)
Between SS(Factor) or MS(Factor)
k–1 SS(Factor) F
Treatments SS(Treatment)  MS(Error)
k 1

MS(Error)
Error (Within
SS(Error) n–k SS(Error)
Treatments)

nk

Total SS(Total) n–1


8
ANOVA
Example:
Three different diet plans are to be tested for mean weight loss. The
entries in the table are the weight losses for the different plans.
Plan 1 Plan 2 Plan 3
5 3.5 8
4.5 7 4
4 4.5 3.5
3

The resulting ANOVA table is shown below:


Source of Variation Sum of Squares Degrees of Freedom Mean Square F

Between Treatments 2.2458


Error (Within
Treatments)
20.8542

Total
9
ANOVA
Number of treatments: k = Total number of data: n =
Source of Variation Sum of Squares Degrees of Freedom Mean Square F

Between Treatments 2.2458


Error (Within
Treatments)
20.8542

Total

Source of Sum of Squares Degrees of


Mean Square (MS) F
Variation (SS) Freedom (df)
Between SS(Factor) or SS(Factor) MS(Factor)
k–1
Treatments SS(Treatment) k 1 MS(Error)

Error (Within SS(Error)


SS(Error) n–k
Treatments) nk

Total SS(Total) n–1


10
ANOVA
Example (continued):
Three different diet plans are to be tested for mean weight loss. The
entries in the table are the weight losses for the different plans.
Plan 1 Plan 2 Plan 3
5 3.5 8
4.5 7 4
4 4.5 3.5
3
Test the hypothesis that the mean weight loss of the 3 diet plans are
the same, at 5% level of significance.
11
ANOVA
Hypothesis Testing:

H 0: The population mean weight loss of the three diet


plans are ALL the same.
Ha: At least one of the diet plans has a different mean
weight loss.

ANOVA table
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 2.2458 2 1.1229 0.3769
Error (Within Treatments) 20.8542 7 2.9792
Total 23.1 9
12
ANOVA
Hypothesis Testing:
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 2.2458 2 1.1229 0.3769
Error (Within Treatments) 20.8542 7 2.9792
Total 23.1 9

F50,50
F-distribution

F10,90
F3,5

F90,10
13
ANOVA
Hypothesis Testing:
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 2.2458 2 (df1) 1.1229 0.3769
Error (Within Treatments) 20.8542 7 (df2) 2.9792
Total 23.1 9

Critical value
Fdf1 ,df2  F2,7  4.7375

Test Statistic
Fc  0.3769
14
ANOVA
Hypothesis Testing:
Reject H0 if (Test Statistic > Critical value)
Do not reject H0 if (Test Statistic Critical

value)

Source of Variation Sum of Squares Degrees of Freedom Mean Square F


Between Treatments 2.2458 2 (df1) 1.1229 0.3769
Error (Within Treatments) 20.8542 7 (df2) 2.9792
Total 23.1 9
Critical value Test Statistic
Fdf1 ,df2  F2,7  4.7375
Fc  0.3769

Fc  F2,7  Do not reject H0 .


 There is insufficient evidence that at least one of the
diet plans has a different mean weight loss.
15
ANOVA

1 H 0: The population mean weight loss of the three diet


plans are ALL the same.
Ha: At least one of the diet plans has a different mean
weight loss.
2 Test statistic: Fc  0.3769

3 Critical Value: At 5% level of significance, F2,7  4.7375

4 Fc  F2,7  Do not reject H0

5 Conclusion: Do not reject H0 at a 5% level of significance.


 There is insufficient evidence that at least one of the diet
plans has a different mean weight loss.
16
ANOVA
Example:
As part of an experiment to see how different types of soil cover
would affect slicing tomato production, Douglas College students
grew tomato plants under different soil cover conditions. Groups of
three plants each had one of the 5 treatments (i.e. a total of 15 plants).
All plants grew under the same conditions and were the same variety.
Students recorded the weight (in grams) of tomatoes produced by
each of the plants and the results are summarized in an ANOVA table:
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 36,648,561
Error (Within Treatments)
Total 57,095,287

At the 0.05 level of significance, conduct a hypothesis test to


determine if all treatment means are the same.
17
ANOVA
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 36,648,561 4 9,162,140.25 4.481
Error (Within Treatments) 20,446,726 10 2,044,672.6
Total 57,095,287 14

H0: The population mean of all 5 treatments are the


1
same.
Ha: At least one treatment has a different mean.
2 Test statistic: Fc  4.481
3 Critical Value: At 5% level of significance, F4,10  3.478

4 Fc  F4,10  Reject H0
5 Conclusion: Reject H0 at a 5% level of significance.
 There is sufficient evidence that at least one treatment
has a different mean.
18
ANOVA

Example:
In a completely randomized experimental design, 7 experimental
units were used for each of the 4 levels of the factor:

Source of Variation Sum of Squares Degrees of Freedom Mean Square F

Between Treatments

Error (Within Treatments) 24,000


Total 38,301

Complete the ANOVA table and test the hypothesis that the
population treatment means are all the same, at   0.05 .
19
ANOVA
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 14,301 3 4,767 4.767
Error (Within Treatments) 24,000 24 1,000
Total 38,301 27

1 H 0: The population mean of all 4 factors are the same.


Ha: At least one factor has a different mean.
2 Test statistic: Fc  4.767
3 Critical Value: At   0.05 , F3,24  F3,20  3.0983

4 Fc  F3,20  Reject H0
5 Conclusion: Reject H0 at a 5% level of significance.
 There is sufficient evidence that at least one factor has a
different mean.
20

BADM 221 Statistics for Business

Unit 11
Linear Regression
21
Linear Regression

Regression is a statistical technique that uses the idea


that one variable may be related to one or more variables
through an equation.

Here we consider the relationship of two variables only in a


straight line relationship, which is called simple linear
regression.
22
Linear Regression

Simple linear regression uses the relationship between the


two variables to obtain information about one variable by
knowing the values of the other.

The equation showing this type of relationship is called


linear regression equation.
23
Linear Regression

Linear equation: y  mx  b
slope y-intercept

y  2x 1 Slope = 2
Y-intercept = –1

y-intercept
24
Linear Regression

We want to use X to predict (or estimate) the value of Y that


might be obtained without actually measuring it, provided
the relationship between the two can be expressed by a line.

“ X ” is usually called the independent variable and “ Y ” is


called the dependent variable.
Statistics
Score

Mathematics Score
25
Linear Regression
Example: The exam scores of a class of 9 students in
Mathematics ( X ) and in Statistics ( Y ) are shown
below:
Math Score (X) 80 58 92 60 75 63 93 76 78
Stat Score (Y) 78 64 96 62 78 65 90 61 82

Statistics
Score

Mathematics Score
26
Linear Regression
We want to determine the equation of the regression line
that best-fits the data.

Statistics Statistics
Score Score

Mathematics Score Mathematics Score

Statistics Statistics
Score Score

Mathematics Score Mathematics Score


27
Linear Regression
Equation of the regression line:
df SS
Regression 1 1004.483
Residual 7 301.517
Total 8 1306

Coefficients Standard Error t Stat p-value


Intercept 9.450 13.74 0.687 0.513
Math Score 0.872 0.1807 4.829 0.001

Statistics
Score

Mathematics Score
28
Linear Regression
Equation of the regression line:
df SS
Regression 1 1004.483
Residual 7 301.517
Total 8 1306

Coefficients Standard Error t Stat p-value


Intercept 9.450 13.74 0.687 0.513
Math Score 0.872 0.1807 4.829 0.001

Y  9.450  0.872 X
Statistics
Score

Stat Score  9.450  0.872  Math Score

Mathematics Score
29
Linear Regression

We can then make prediction using the regression


equation:

Stat Score  9.450  0.872  Math Score

For example:
Score in Math  Estimated score in Stat
61  9.450 +
0.872 61 = 62.42 
73  9.450 +
0.872 73 = 73.11
91 9.450 +
30
Linear Regression

Is the regression relationship significant?

Null and Alternative Hypothesis


H0: There is no relationship between X and Y
(The regression relationship is NOT
significant.)
Ha: There is a linear relationship between X and Y
(The regression relationship is
significant.)
31
Linear Regression

Is the regression relationship significant?

Use the p-value approach


Reject H0 if (p-value level of significance)

 The regression relationship is significant.

Do not reject H0 if (p-value > level of significance)


 The regression relationship is NOT
significant.
32
Linear Regression

Is the regression relationship significant?

df SS
Regression 1 1004.483
Residual 7 301.517
Total 8 1306

Coefficients Standard Error t Stat p-value


Intercept 9.450 13.74 0.687 0.513
Math Score 0.872 0.1807 4.829 0.001

Which p-value?
33
Linear Regression

Is the regression relationship significant?

df SS
Regression 1 1004.483
Residual 7 301.517
Total 8 1306

Coefficients Standard Error t Stat p-value


Intercept 9.450 13.74 0.687 0.513
Math Score 0.872 0.1807 4.829 0.001

As an illustration: Take level of significance = 5%

The p-value for Math Score is 0.001 < the level of significance
 Reject H0  The regression relationship is significant.
34
Linear Regression

How good is the regression equation?

Coefficient of Determination, R2

SS (Regression)
R 
2
(decimal  percent)
SS (Total)

Interpreted as the percentage of the observed variation in Y


that can be explained by the variation in X.
35
Linear Regression
df SS
Regression 1 1004.483
Residual 7 301.517
Total 8 1306

Coefficients Standard Error t Stat p-value


Intercept 9.450 13.74 0.687 0.513
Math Score 0.872 0.1807 4.829 0.001

1004.483
R 
2
 0.7691  76.91%
1306

76.91% of the variability of the Statistics score can be explained


by the linear relationship with the Mathematics score.
36
Linear Regression
Example:
A teacher wishes to investigate if there is any relationship
between a student’s exam score in Mathematics (X) and the
exam score in Accounting (Y). A sample of 11 students is
randomly selected and the results are summarized in the
ANOVA table below:

df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64

Coefficients Standard Error t Stat p-value


Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001
37
Linear Regression
df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64

Coefficients Standard Error t Stat p-value


Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001

What is the estimated regression equation that relates the exam score in
accounting (Y) to the score in mathematics (X)?

What is the estimated exam score in accounting if a student got a score of


80 in mathematics?
38
Linear Regression
df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64

Coefficients Standard Error t Stat p-value


Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001

What is the estimated regression equation that relates the exam score in
accounting (Y) to the score in mathematics (X)?
Y  24.13  0.759 X
Acc.Score  24.13  0.759  (MathScore)
What is the estimated exam score in accounting if a student got a score of
80 in mathematics?

24.13  0.759  80  84.85


39
Linear Regression
df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64

Coefficients Standard Error t Stat p-value


Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001

Is the regression relationship significant? Use the p-value approach and 2%


level of significance.
40
Linear Regression
df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64

Coefficients Standard Error t Stat p-value


Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001

Is the regression relationship significant? Use the p-value approach and 2%


level of significance.

The p-value for MathScore is 0.001 < the level of significance


 Reject H0  The regression relationship is significant.
41
Linear Regression
df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64

Coefficients Standard Error t Stat p-value


Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001

Compute the coefficient of determination between the exam score in


accounting and the exam score in mathematics. Interpret the result in the
context of the problem.
42
Linear Regression
df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64

Coefficients Standard Error t Stat p-value


Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001

Compute the coefficient of determination between the exam score in


accounting and the exam score in mathematics. Interpret the result in the
context of the problem.

1305.68
R 
2
 0.9409  94.09%
1387.64
94.09% of the variability of the exam score in
accounting can be explained by the linear
relationship with the exam score in mathematics.
43
Linear Regression
Coefficient of determination
1305.68
R2 
1387.64
df SS  0.9409  94.09%
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64

Coefficients Standard Error t Stat p-value


Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001

Acc.Score  24.13  0.759  (MathScore) Significance of regression relationship?


Estimated regression equation p-value  the level of significance
 The regression relationship is significant.
p-value > the level of significance
 The regression relationship is NOT significant.
44
Linear Regression
Example:
The accountant at Walmart wants to determine the
relationship between customer purchases at the store, Y ($),
and the customer monthly salary, X ($). A sample of 15
customers is randomly selected and the results are
summarized in the ANOVA table below:

df SS
Regression 1 186952
Residual 13 99236
Total 14 286188

Coefficients Standard Error t Stat p-value


Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003
45
Linear Regression
df SS
Regression 1 186952
Residual 13 99236
Total 14 286188

Coefficients Standard Error t Stat p-value


Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003

What is the estimated regression equation that relates the amount of


customer’s purchase (Y) to the customer’s monthly salary (X)?
46
Linear Regression
df SS
Regression 1 186952
Residual 13 99236
Total 14 286188

Coefficients Standard Error t Stat p-value


Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003

What is the estimated regression equation that relates the amount of


customer’s purchase (Y) to the customer’s monthly salary (X)?

Y  78.56  0.066 X
Amt.Purchase  78.58  0.066  (Salary)
47
Linear Regression
df SS
Regression 1 186952
Residual 13 99236
Total 14 286188

Coefficients Standard Error t Stat p-value


Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003

Is the regression relationship significant? Use the p-value approach and 1%


level of significance.
48
Linear Regression
df SS
Regression 1 186952
Residual 13 99236
Total 14 286188

Coefficients Standard Error t Stat p-value


Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003

Is the regression relationship significant? Use the p-value approach and 1%


level of significance.

The p-value for Salary is 0.003 < the level of significance


 Reject H0  The regression relationship is significant.
49
Linear Regression
df SS
Regression 1 186952
Residual 13 99236
Total 14 286188

Coefficients Standard Error t Stat p-value


Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003

Compute the coefficient of determination between the amount purchase and


the customer’s monthly salary. Interpret the result in the context of the
problem.
50
Linear Regression
df SS
Regression 1 186952
Residual 13 99236
Total 14 286188

Coefficients Standard Error t Stat p-value


Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003

Compute the coefficient of determination between the amount purchase and


the customer’s monthly salary. Interpret the result in the context of the
problem.

186952
R 
2
 0.6532  65.32%
286188
65.32% of the variability of the amount
purchased can be explained by the linear
relationship with the customer’s monthly salary.
51

You might also like