Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Preface

Visual Data Analytics Module (ADaVis Module) utilises visualisation to present learning
content related to Data Analytics course at higher education level. The emphasis on Anova
topic reflects the difficulty and complexity of concepts that prevent students from mastering
them.

The module chooses appropriate visualization techniques to explain the basic concepts and
their relevancy to a wide range of topics and to the potential of life and career applications.
ADaVis module aims at assisting students in providing a clear statistics conceptual
framework and solid integration of concepts including into future applications. In particular,
the learning outcomes of the module are:

• to understand the basic concepts of variation.


• to understand the connection between the basic concepts of variation.
• to analyse data using Anova procedure.
• to apply variation concepts in life.

Yes, it helps you dear students to go deep into the basic and to see its application in real
world situations. The module consists of exploration, worked example and activity.
Exploration assists students to investigate important concepts in Anova with the help of
visual and illustration. Worked example provides solution using Anova procedure for various
cases. Activity encourages students to practice their understanding and knowledge of Anova
concepts in numerous situations.

ADaVis module embeds meaningful learning elements to support the construction of


knowledge in such way so that the learned concepts can be clearly utilized by students. The
strategy is based on the need to prepare students with the skills to leverage data to guide
correct and efficient decision making.

i
Contents

Introduction to Anova 1

Between samples and within samples Variation 1

F statistics and F distribution 2

Exploration 1 Data distribution and variation 3

Exploration 2 Data movement effect to variations 5

Exploration 3 Sum of square of the variation 8

Exploration 4 Optional function for sum of square 11

Exploration 5 Advance organiser 14

Exploration 6 Anova procedure 15

Exploration 7 Mean square and F test 16

Worked example 1 19

Worked example 2 21

Worked example 3 23

Worked example 4 26

Activity 1 28

Activity 2 29

Activity 3 29

Activity 4 30

ii
Introduction of ANOVA

Anova is a short name for the analysis of variance. Anova aims to test the hypothesis that
the means of three or more populations are not equal. The comparison is based on variance
instead of mean. ANOVA is a useful analysis tool to split an observed aggregate variability
found inside a data set into two sources: between sample and within sample. We employ
two sources of variance i.e. between samples and within samples. Other names for variance
would be variation, spread or dispersion.

First, we will see how the data for Anova looks like.
In Anova, our data set will be presented in k groups, with sample size of each group is not
necessarily equal, i.e. n1, n2,.., nk and the total sample would be n.

This is a simple example of one data set suitable for Anova. The set comes with three
groups, each has three members.
Group 1 Group 2 Group 3
2, 3,4 2, 4, 6 3,5,7

Next, we will see how the concepts of between samples and within samples variation work.

Between samples and within samples Variation


Between samples variation consider the spread to include values between the samples.
Therefore it compares the representative value of each sample (sample mean) and the
representative value of all samples (grand mean).

Within samples Variation consider the spread only inside the samples. Therefore it
compares each value of each sample (each observation) to the representative value of that
samples (sample mean).

Variation for total (both cases) consider the spread not only inside the samples, but also
compare them between samples. Therefore it compares each value of each sample (each

1
observation) to the representative value of all samples (grand mean) and repeat that to the
last sample.

F statistics and F distribution


Since Anova involves hypothesis test procedure, we are entitled to the use of F statistics and
F distibution to complete the step.
Once we obtain the value of the variations, we proceed with F test as our test statistics. F
test is used in Anova calculation by finding the ratio between two variation: between
samples and within samples.
F-distribution is another probability distribution Similar to a chi-squared distribution, it is a
positively skewed distribution. It has two sets of degrees of freedom, called numerator and
denominator.
The application of F distribution in Anova applies the two sets of degrees of freedom as
following. Degrees of freedom for numerator takes the value of k-1. Degrees of freedom for
denominator takes the value of n-k.

Exploration 1

Data distribution and variation

Let us look at a data set (Data set 1) that consists of three samples: Sample1, Sample 2 and
Sample 3. Each has six records as presented here. The measurement of this data using mean
shows respective values: 𝑥𝑥̅1 = 32, 𝑥𝑥̅2 = 37 and 𝑥𝑥̅3 = 41. Besides, let us describe the
variation in this data set. Can you differentiate between sample variation and within sample
variation in this data set?

Sample 1: 33 20 49 35 32 23,
Sample 2: 29 36 21 52 43 41,
Sample 3: 33 25 50 45 26 37

2
Data set 1
Sample 1 33 20 49 35 32 23
Sample 2 29 36 21 52 43 41
Sample 3 33 25 50 45 26 37

20 30 40 50

For the second data set (Data set 2) we also have three samples, and each sample has six
readings, as in the previous data set. Again, we pay attention to the sample mean: 𝑥𝑥̅1 = 32,
𝑥𝑥̅2 = 37 and 𝑥𝑥̅3 = 41. Can you notice their values?
Wow! We have exactly similar value of mean of each sample as in Data set 1. Again, can you
differentiate between sample variation and within sample variation in Data set 2?

Data set 2
Sample 1 33 32 32 30 32 33
Sample 2 36 37 37 36 39 37
Sample 3 42 39 41 42 42 40

2 2 2 3 3 4 4 5
0 3 7 3 8 3 8 2

3
In Data set 1, we can imagine a strong within sample variation. Data are spreading within
the sample. If we compare among these three samples, the spread pattern in each sample
is about similar. Within sample variation is more prominent than between samples
variation. Within samples variation is related to the dispersion of data contained in the
same sample. Between samples variation is related to the dispersion of data taking into
account all available samples. Within sample variation for the first data set is considered
significant because the dispersion inherent in each sample is large based on the reading gap
of each member. The dispersion between the samples was not significant as the reading
range for the three samples was very similar and the readings were scattered throughout
the range.

In Data set 2, we don’t see the previous pattern. Yes, data are spreading across the sample
but tend to be located around similar values inside each sample. Between samples variation
seem to be strong in data set 2 and variation within sample is not strong. Within sample
variation is less prominent than between samples variation. Within sample variation for the
second data set is small because the dispersion inherent in each sample is small based on
the reading gap of each member. While the dispersion between samples is large given the
reading range of the three different samples and the three samples are scattered with each
other.

Taking into consideration the whole case, there are a few important lessons. These two data
sets appear to be similar in central location measurement, i.e. mean. However, they show
distinct feature in terms of variation. Data set 1 shows a strong within sample variation
compare to between samples variation. Data set 2 demonstrates between samples variation
stronger than within sample variation.

In ANOVA, we split the two types of variation. The stronger between samples variation
compare to within sample variation, the stronger evidence that appear to show significant
difference exist in the sample.

4
Exploration 2

Data movement effect to between sample and within sample variations

Identify a dataset consists of three groups in only one dimensions, like score, weight. This
dataset can be plotted side by side to view the variance exist. But if we want to measure the
variation, ANOVA finds the value of variation in three sources: between sample, within
sample and total.

The respective sum of square and mean square values provide the F test statistic. The larger
F value contributes to higher chance for the groups to show significant difference in
population mean value. ANOVA is about measuring the difference based o variance.

Observe four situations below. Notice how changes in data point might contribute to
increase or decrease in between sample variation and within sample variation.

Anova Table
Data set 3

Data set 4

Data set 5

Data set 6

5
Referring to Data set 3, within sample variation is more prominent than between samples
variation, as shown in the accompanying Anova Table, focusing only on the first two
columns. The other three columns will be discussed shortly. The gap shown by the members
in the sample is greater than the gap between the three samples being compared together.
Within sample variation is related to the dispersion of data contained in the same sample.
Between samples variation is related to the dispersion of data taking into account all
available samples.

By changing only one data point in group 2, we obtain Data set 4. between sample variation
has increased while within sample variation has decreased. The movement has tightened
the gap within sample especially in Sample 2 and at the same time contribute to the
dispersion of the three samples.

Further movement of data in Data set 5 involves again one value, from Sample 3 item 2
provides another impact on the variation pattern. The movement has reduced the gap
within sample especially in Sample 3 and at the same time contribute to clear distinction of
the dispersion of the three samples.

Finally in Data set 6, one data moves in Sample 1 item 3. Similarly, the movement has
reduced the gap within sample especially in Sample 1 and at the same time contribute to
clear distinction of the dispersion of the three samples.

F statistics

Once we have observed and discussed about the variations, it is now the time to pay
attention to the rest of results presented in Anova Table, column three onwards. Gradually
from Data set 3 to Data set 4, and finally to Data set 6, the movement of data has increased
F statistics. In Data set 5 and Data set 6, data are spreading across the sample but tend to
be located around similar values inside each sample. Stronger between sample variation as
compare to within sample variation leads to stronger F statistic value and eventually smaller
p-value.

6
Now, as you have seen the general concept of how data distribution may affect the
variation in each source, we may go deep into the concept of sum of square. This is very
much related to the variance concept, particularly in the top component of the function.
Yes, they are called ‘sum of square’.

Variations in real life application

As we can see in Data Set 1 and Data Set 2, the high dependency of the use of mean that
measure central location of data could create miscommunication and misunderstanding.
The use of only one measure does not fully explain the real situation. Both data has
similar mean value that suggest these data sets are equal in terms of central tendency but
a further investigation using variance tells that the data has completely different pattern.
Thus, we may use other measurement such as variance that measure dispersion of data to
obtain a better explanation of data distribution. A good practice to explore the situation is
to view your data in a dot plot and identify the data distribution and pattern.

7
Exploration 3

Sum of square of the variation

We use the similar data set i.e. Data set 3 to explore the calculation of sum of square.
Simply, the genuine question would be raised by any student “.. but how do we obtain that
value?”

We present the data again

Group 1 Group 2 Group 3


2, 3,4 2, 4, 6 3,5,7

As you can see this is the first example that we see in introduction. The solution presented
here will be referring to this data set.

We start with simply calculating the mean of each sample and locate the mean in the same
plot

Mean and grand mean


𝑥𝑥̅1 = 3
𝑥𝑥̅2 = 4
𝑥𝑥̅3 = 5
𝑥𝑥� = 4

We see clearly three black dotted lines to represent the sample mean and one red dotted
line to represent the grand mean. We will use these values to calculate the sum of square
value shortly.

Mean and grand mean location

8
We start with Sum of square of between samples variation. Between samples Variation
consider the spread to include values between the samples. Therefore it compares the
representative value of each sample (sample mean) and the representative value of all
samples (grand mean).

The calculation seems to based on the gap between each black dotted line and red dotted
line
SSbetween = 3[ (3-4)2] + 3[(4-4)2] +3[(5-4)2]=3(12) + 3(02) +3(12) = 3+ 0 + 3 = 6

*compare sample mean with grand mean and consider how many observations you
have

Within samples Variation consider the spread only inside the samples. Therefore it
compares each value of each sample (each observation) to the representative value of that
samples (sample mean).
The calculation would be based on the gap between each point with the respective black
dotted line.

SSwithin = (2-3)2+ (3-3)2+ (4-3)2+ (2-4)2+ (4-4)2+ (6-4)2+ (3-5)2+ (5-5)2+ (7-5)2 =18

9
*compare each observation with the respective sample mean

Variation for total (both cases) consider the spread not only inside the samples, but also
compare them between samples. Therefore it compares each value of each sample (each
observation) to the representative value of all samples (grand mean) and repeat that to the
last sample.

The calculation for the last part will be part of your exercise in Activity 4.

Note

We apply the variance formula to derive basis for Sum of Square


∑(𝑥𝑥 𝑖𝑖 −𝑥𝑥̅ )2
Recall that 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 =
𝑛𝑛−1

i.e. Sum of square total variance = ∑(𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ )2


therefore
𝑘𝑘
2
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = � 𝑛𝑛𝑗𝑗 �𝑥𝑥̅𝑗𝑗 − 𝑥𝑥̅ �
𝑗𝑗 =1

𝑛𝑛𝑛𝑛 2
Similarly, 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆ℎ𝑖𝑖𝑖𝑖 = ∑𝑘𝑘𝑗𝑗=1 ∑𝑖𝑖=1�𝑥𝑥𝑖𝑖𝑖𝑖 − 𝑥𝑥̅𝑗𝑗 �

10
Exploration 4

Preferred function for sum of square

Once we have grabbed that clear view of variation and sum of square concepts, we may
proceed straight away with the technical procedure. Hang on! This last part clarifies the
other functions that help you as short cut to the tedious sum of square calculation. The
basis for this calculation would be the Total, plus Total of squared value of each observation.

We use the following data so you may compare the results and the steps they take. We
present the data again

Group 1 Group 2 Group 3


2, 3,4 2, 4, 6 3,5,7

We go back to the basic by simply calculating the Total and identify the sample size

Total, T N
𝑇𝑇1 = 9 3
𝑇𝑇2 = 12 3
𝑇𝑇3 = 15 3
𝑇𝑇 = 36 9

*Total value of each sample, grand total and sample size

We apply the total in various parts in the function, together with the sample size

11
*Ratio of squared Total to sample size, individual or accumulative

92 122 152 (36)2


𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = � + + �− = 150 − 144 = 6
3 3 3 9

*compare Total Squared value of individual observation with Ratio of squared Total to
sample size for individual sample
92 12 2 15 2
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆ℎ𝑖𝑖𝑖𝑖 = 168 − � + + � = 168 – 150 = 18
3 3 3

Note

Sum of squares to measure variation

Sum of squares as the name carry present the value of total of squared differences between
individual and target value.

Which concept is related to sum of squares?


It is variance. In variance, we have two components, sum of squares and degrees of freedom
Variation/ Dispersion/ Spread

12
2
(∑ 𝑥𝑥 )
∑ 𝑥𝑥𝑖𝑖2– 𝑛𝑛 𝑖𝑖
𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 =
𝑛𝑛 − 1
𝑆𝑆𝑆𝑆𝑆𝑆 𝑜𝑜𝑜𝑜 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 =
𝑛𝑛 − 1
2
(∑ 𝑥𝑥𝑖𝑖 )
𝑆𝑆𝑆𝑆𝑆𝑆 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 (𝑆𝑆𝑆𝑆) = � 𝑥𝑥𝑖𝑖2 –
𝑛𝑛

2 2
(∑ 𝑥𝑥𝑖𝑖𝑖𝑖 ) (∑ 𝑇𝑇𝑗𝑗 )
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = � 𝑥𝑥𝑖𝑖𝑖𝑖2 – = � 𝑥𝑥𝑖𝑖𝑖𝑖2 –
𝑛𝑛 𝑛𝑛
Note that 𝑇𝑇𝑗𝑗 = ∑ 𝑥𝑥𝑖𝑖 therefore ∑ 𝑇𝑇𝑗𝑗 = ∑ 𝑥𝑥𝑖𝑖𝑖𝑖

2
(𝑇𝑇𝑗𝑗 )2 (∑ 𝑇𝑇𝑗𝑗 )
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = � –
𝑛𝑛𝑗𝑗 𝑛𝑛

(𝑇𝑇𝑗𝑗 )2
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆ℎ𝑖𝑖𝑖𝑖 = � 𝑥𝑥𝑖𝑖𝑖𝑖2 − �
𝑛𝑛𝑗𝑗

13
Exploration 5

Advance organiser

We may use this illustration to get a general overview of ANOVA. Not only that we are able
to identify relationship between concepts, but also how the technique could be applied and
may benefit society.

14
Exploration 6

ANova procedure

After various explanation of important Anova concepts, it is now suitable to go into the
procedure. We will conduct Anova as a complete Hypothesis test Procedure in Four Steps

Template of Anova procedure

1. State the Null and Alternative hypothesis

15
2. Calculate the value of F test statistic

F test = Variance between sample

Variance within sample

3. Determine the rejection region

4. Decision and conclusion

State either you reject or do not reject Ho and relate your results to the presented case.

16
Exploration 7

Mean square and F test

Mean square is another calculation before we reach the final destination i.e. F test.
Particularly, the mean square finds the ratio between Sum of Squares and degrees of
freedom. The basis of degrees of freedom would be sample size.

In this data set, we have three samples (k=3), each sample has three observation (n=9).

Total, T n
𝑇𝑇1 = 9 3
𝑇𝑇2 = 12 3
𝑇𝑇3 = 15 3
𝑇𝑇 = 36 9

*Total value of each sample, grand total and sample size

Degrees of freedom (between sample)= k-1= 3-1= 2

Degrees of freedom (within sample)= n-k= 9- 3= 6

𝑆𝑆𝑆𝑆𝑆𝑆 𝑜𝑜𝑜𝑜 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 (𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆)


𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 (𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) =
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑜𝑜𝑜𝑜 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 (𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑙𝑙𝑙𝑙)

𝑆𝑆𝑆𝑆𝑆𝑆 𝑜𝑜𝑜𝑜 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 (𝑊𝑊𝑊𝑊𝑊𝑊ℎ𝑖𝑖𝑖𝑖 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆)


𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 (𝑊𝑊𝑊𝑊𝑊𝑊ℎ𝑖𝑖𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) =
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑜𝑜𝑜𝑜 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 (𝑊𝑊𝑊𝑊𝑊𝑊ℎ𝑖𝑖𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠)

For our recent example, the solution is provided here:

17
6
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 (𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) = =3
2

18 18
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 (𝑊𝑊𝑊𝑊𝑊𝑊ℎ𝑖𝑖𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) = = =3
9−3 6

𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 (𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) 3


𝐹𝐹 = = = 1
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 (𝑊𝑊𝑊𝑊𝑊𝑊ℎ𝑖𝑖𝑖𝑖 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠) 3

Anova table

Anova table summarises important information from Anova procedure

For the recent example, the results are summarised as in the table.

Source of variation SS df MS F
Between Groups (Treatment) 6 2 3 1
Within Groups (Error) 18 6 3
Total 24 8

18
Worked example 1
At 5% significance level, does it appear that a difference exists in the true mean exam score
(full mark of 20 produced by four learning methods? The table presents the exam score
obtained by students following the four methods.

Method 1 Method 2 Method 3 Method 4


13 20 17 8
16 19 14 12
15 18 13 11
17 16 14 14
20 17

Assumptions checking. We need to ensure all the requirements are fulfilled:

1. The populations from which the samples are drawn are (approximately)
normally distributed.

2. The populations from which the samples are drawn have the same variance
(or standard deviation).

3. The samples drawn from different populations are random and independent.

Suitability for Anova test:

Assumptions fulfilled.

The data contains four groups (fulfil at least three groups).

We use 5% significance level. This will be used in finding F critical vale.

Anova procedure

H0: μ1=μ2=μ3=μ4
Ha: at least one of the population mean is not equal

Method 1 Method 2 Method 3 Method 4 Total

19
13 20 17 8
16 19 14 12
15 18 13 11
17 16 14 14
20 17 EX2=4344
T1=81 T2=90 T3=58 T4=45 T=274
n1=5 n2=5 n3=4 n4=4 n=18

Degrees of freedom for numerator = k-1=4-1=3


Degrees of freedom for denominator = n-k = 18-4 = 14

812 902 582 452 (274)2


𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = � + + + �− = 4279.45 − 4170.889
5 5 4 4 18
= 108.561

812 902 58 2 45 2
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆ℎ𝑖𝑖𝑖𝑖 = 4344 − � + + + � = 4344-4279.45=64.55
5 5 4 4

Summary of results:

ANOVA
Source of
Variation SS df MS F P-value F crit
Between
Groups 108.5611 3 36.18704 7.848467 0.002584 3.343889
Within Groups 64.55 14 4.610714

Total 173.1111 17

F table (significance level 5%)

d1 1 2 3 4
Denominator

20
(numerator)
5 6.61 5.79 5.41 5.19
6 5.99 5.14 4.76 4.53
...
14 4.60 3.74 3.34 3.11
15 4.54 3.68 3.29 3.06

F critical value = 3.34

The test statistic F=7.848 is greater than the critical value. It falls into the rejection region.
We reject H0. Thus our conclusion supports that a difference exists in the true mean exam
score produced by four learning methods, at α=0.05.

p-value=0.0025

Decision and conclusion: We reject H0 since p-value < 0.05. We conclude that a difference
exists in the true mean exam score, produced by four learning methods.

Worked example 2

There are nine special interest groups (SIG) in a faculty. Students participate in a survey that
measure their average score of programme effectiveness in developing students’ soft skills.
The survey is conducted using questionnaire instrument with the following scale (1: very
poor, 2 : poor, 3: moderate, 4: strong, 5: very strong). The data are presented as following.

SIG 1 SIG2 SIG3 SIG4 SIG5 SIG6 SIG7 SIG8 SIG9


4.34 4.57 4.23 4.65 4.71 4.23 4.36 4.65 4.58
3.56 4.65 4.33 4.32 4.09 4.16 4.25 4.37 4.61
4.76 4.78 4.47 4.15 4.32 4.54 4.31 4.61 4.72
4.34 4.91 3.92 3.92 4.21 4.26 4.65 4.26 4.53
3.67 4.58 4.12 3.67 4.63 4.36 4.16 4.28 4.9
4.12 4.67 4.37 4.51 4.52 4.1 4.22 4.17 4.67
4.81 4.18 4.78 4.14 4.84

21
You are asked to assist in conducting Anova procedure and come with certain
recommendations. You are accompanied with the analysis output.

a) Identify using α=0.01 if there is a significant difference in the mean of groups’ average
score of programme effectiveness in developing students’ soft skills.
b) Does SIG play a significance influence in students’ average score in programme
effectiveness.
c) Would you suggest for further investigation of groups doing better in the evaluation?
Why?

Solution
Before we proceed, check all the assumptions required to apply the one-way ANOVA
procedure hold true.
1. The populations from which the samples are drawn are (approximately)
normally distributed.

22
2. The populations from which the samples are drawn have the same variance
(or standard deviation).
3. The samples drawn from different populations are random and independent.

a) One factor anova is conducted to see the influence of SIG groups on their average score
of SIG programme effectiveness.
H0: μ1=μ2=μ3=μ4=μ5=μ6=μ7=μ8=μ9
Ha: at least one of the population mean is not equal

F test statistic= 0.265/0.0644 =4.108


F critical value (at α=0.01; degrees of freedom 8,50) = 2.130
From the comparison of these two F values, it is clear that test statistic falls into rejection
region. The decision inclines towards rejecting Ho.
We may conclude there is a significant difference in the mean average of programme
effectiveness developing students’ soft skills by each of the nine SIG is not the same.
b) Based on the analysis, it clearly shows that SIG play a significance influence on students’
average score in programme effectiveness.
c) Certainly further investigation need to be conducted to examine which groups are doing
better in the evaluation. The group mean for SIG2 (4.71) and SIG9 (4.692) are far ahead the
other groups.

Worked example 3

How do you run Anova in Excel and PSPP?

Ensure you have Data Analysis package installed in your application.

23
Identify Anova from the tool selection and proceed with the detail for the input.

Conduct Anova using PSPP

Declare variable in variable window

24
Key in data in Data window

25
Select Anova from analysis selection and proceed with the variables.

26
Worked example 4

The internship programme in a faculty could be conducted in three options: industrial


placement, research and development and entrepreneurship. Before and after undergoing
their training, students were assessed regarding their knowledge in three aspects: hardware
and software skills, communication and professional skills and problem solving skills.
Percentage of knowledge increase were recorded. Data for sample of students were
obtained as following.

Industrial placement Research and development Entrepreneurship


40.4 18 23.3
35.6 22.1 34.3
27.6 25 24.7
31.4 19.5 13.9
36.7 24.1 25.2
20 30.6 14.3
27.6 18.5

You are given the following output for Anova test. Compete the procedure and provide
recommendations.

ANOVA
Source of SS df MS F P-value F crit

27
Variation
Between Groups 352.8016 2 176.4008 4.319896 0.030412 3.591531
Within Groups 694.1864 17 40.8345

Total 1046.988 19

Before we proceed, we assess that all the assumptions required to apply the one-way
ANOVA procedure hold true.
1. The populations from which the samples are drawn are (approximately)
normally distributed.
2. The populations from which the samples are drawn have the same variance
(or standard deviation).
3. The samples drawn from different populations are random and independent.
Anova solution
H0: μ1=μ2=μ3
Ha: at least one of the population mean is not equal

F test statistic= 176.401/40.834 = 4.320


F critical value (at α=0.05; degrees of freedom 2, 17) = 3.592
From the comparison of these two F values, it is clear that test statistic falls into rejection
region. The decision inclines towards rejecting Ho.
The evidence provides enough support that there is significant difference in the mean
average of programme effectiveness by each of the three options is not the same.
b) Based on the analysis, it clearly shows that internship option play a significant influence in
knowledge increment in internship programme.

28
Activity 1

Sample 1 21 21 20 18 20 20
Sample 2 25 28 25 24 24 24

a) Plot your data in a dot plot.

b) Calculate the mean for each sample.

c) From your answer in a) and b) can you predict if there is significant difference in the
mean.

d) Identify the source of variation that appear to be dominant in the data. Are the variations
come from between sample or within sample.

e) Repeat the same question a) to d) to the new data set:

Sample 1 21 37 11 20 8 23
Sample 2 24 31 29 40 9 17

f) In which data set, you may predeict that we are able to reject Ho?

29
Activity 2

Refer to Exploration 3. Use the data set and calculate the sum of square of total variation.
Use some visual to produce your calculation.

Activity 3

Collaborate with your peers. Use concept map and construct relationship between various
topics:

a) Refer to anova concept map. Reflect on various concepts available in anova.

b) Using concept map, Identify the differences in anova and :

I. Independent sample t- test


II. Paired sample t-test
III. Goodness of fit test
IV. Independent test

30
Activity 4

Spend some time to watch this video. You may go back to the video to ponder the main
concepts.

https://www.youtube.com/watch?v=ITf4vHhyGpc&t=23s

There are three groups in the comparison: Plain water, Juice, Coffee

1. What test is suitable if we have only two groups to compare the mean difference.
What is the reason of not using that particular test here?

2. There are two types of variation, variation between the groups and variation within
the groups. In the example, name the sources for two types of variation

3. In Example 1, it says ‘there is a lot of variation in each group’. Explain this statement.

4. In Example 2, it says ‘there’s not much variation in each group’. Provide explanation.

5. In Which example (Example 1 or Example 2) do we reject Ho? Why?

[Hint: Ho in this experiment refers to the believe that there is no significant


difference in the groups]

31

You might also like