Professional Documents
Culture Documents
Epsc 228 311 0228 Final Measurement and Evaluation Revised March 2021 (Repaired) (1)
Epsc 228 311 0228 Final Measurement and Evaluation Revised March 2021 (Repaired) (1)
67
EPSC 228/311/0228
68
EPSC 228/311/0228
There are many reasons schools are moving away from seat time and
toward competency-based learning. These reasons include:
CBE has had little success, if any, in Africa. South Africa and Malawi tried
and abandoned. It was tried for 12 years in South Africa. Why it failed in
South Africa?
Challenges of CBC
1. Failed system
It is also important to note that the proposed Competence-Based
Curriculum education was tried for 12 years in South Africa and it
failed/abandoned. In South Africa it was called Objective-Based Education
(OBE). It was also tried in Malawi and abandoned.
The proposed Kenyan CBE was borrowed from Japan and South Korea.
Kenya does not have the culture of these countries and is wrong to
assume that we will succeed. Do we have the Asian culture that has been
responsible for the success of CBE? No.
5. Expensive to implement.
6. Overloaded syllabus. Between 11-12 subjects to be taken.
7. Pupils‟ progression from primary to secondary unclear.
8. Teacher-based subjective assessment in determining movement
from primary to secondary is bias and hence pupils cannot go to
their school of choice in the absence of a standard, unifying national
examination.
9. Damage national integration because children will remain in
their neighborhoods.
10. Demand literacy of parents.
72
EPSC 228/311/0228
LECTURE 7
Welcome to lecture 7.
PSYCHOMETRIC CHARACTERISTICS OF A
GOOD TEST
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
List and describe the characteristics of a good test.
List the factors that affect validity of a testing
instrument.
List the factors that affect reliability of a testing
instrument.
Explain the different ways of estimating reliability.
94
EPSC 228/311/0228
1. VALIDITY
A test is said to be valid when it measures what is intended to measure.
That is, a good test measures what it is intended to measure.
What are you looking for when you are establishing the validity of an
instrument? You are looking for:
i. Trustworthiness of the instrument. Is the instrument trustworthy?
ii. Credibility of the instrument. Is the instrument credible?
A valid classroom test measures what has been taught (or should have
been taught). There are several aspects of validity:
These are:
Content validity
Construct validity
Face validity
95
EPSC 228/311/0228
Concurrent validity
Predictive validity
The two of these validities that are of particular importance with respect
to teacher – made tests are content and construct validity.
Content Validity
This is the most important validity for practicing teachers. It measures the
extent to which a test adequately covers the syllabus to be tested.
Content validity refers to the extent to which a test ―covers‖ the content.
If the test does not cover the content that has been taught, then it
is not valid.
An essay test intended to measure knowledge is likely to lack
content validity because of the severe limitation on the number of
topics which can be included in one such test. In other words, the
sample of possible learning is small. Hence essay test lacks content
validity. It lacks content balance.
A valid test provides for measurement of a good sampling of
content and is balanced with respect to coverage of the various
parts of the subject.
97
EPSC 228/311/0228
In primary schools, where teachers are compared at the end of the term
on the performance of their classes, teachers may set easier questions for
their classes in order that their pupils perform better.
Construct Validity
This is another aspect of validity that is important in teacher – made
tests. It refers to the kinds of learning specified or implied in the course
/learning objectives. That is based on learning objectives.
For example, if the course learning objective specified that at the end of
the course the learner must:
i. Identify four characteristics of living things, test tasks are required
to measure identification of these characteristics. Each kind of
learning objective must be tested to provide a valid measurement of
achievement.
ii. Identify the methods freedom fighters used destabilize the colonial
regime.
A test which measures only knowledge (e.g. ability to recall important
historical events) must lack validity if the objectives specify other learning
objectives.
Face Validity
The test should look as if it is testing what is intended to test. This,
however, is the starting point.
Describes how well a measurement instrument appears to measure
what is intended to measure as judged by its appearance, what it
was designed to measure. For example, a test of mathematical
ability would have face validity if it contained math problems.
A test is said to have face validity if it appears to be measuring
what it claims to measure.
For example:
If we are trying to select pilots from highly trained personnel
face valid tests of rapid reaction time will ensure full
98
EPSC 228/311/0228
Concurrent Validity
This is where test results are compared with another measure of the
same abilities at the same time or taken at the same time or about
the same time. For example, comparing mock results and actual
KCSE results. These examinations are taken on about the same
time – one in July (Mock) and the other in November (KCSE). In
1970‘s, mock results were used to select students to join Form 5 in
January before the KCE results were released on the belief that
mock was a good measure of the final examination or had good
concurrent validity.
O1 O2 O1 O2
99
EPSC 228/311/0228
([[
(a) (b)
(c)
A positively skewed distribution (c) reflects a very difficult test while a
negatively skewed distribution (a) reflects an easy test.
(iv) Length of a Test
The length of a test affects reliability. A very short test (or five items, for
example) cannot spread the scores sufficiently to give consistent results.
Five items are too few to provide a reliable measure. In general, the
longer the test, the more reliable.
(v) Erratic or inconsistent Marking/Scoring
If the markers are erratic in scoring the award of scores will be unreliable.
Inconsistency in scoring leads to low reliability of results. This is why
KNEC train examiners on marking so that if the same test or script is
marked by different examiners or even when the same examiner marks
the same test at different times the scores will be similar. See the article
of September 13, 2014 on ―Train every teacher on setting, marking
exams‖
103
EPSC 228/311/0228
104
EPSC 228/311/0228
105
EPSC 228/311/0228
106
EPSC 228/311/0228
One student can take one form of a test, and the students sitting to the
right and left could have different variations of the same test. None of
the three students would have an advantage over the others; their
respective scores would provide a fair comparison of the variable being
measured. For example, measuring mathematical ability of standard 8
pupils.
107
EPSC 228/311/0228
Form A Form B
The two forms A and B are developed from the same curriculum and
subjected to the due process of the development of a good test and the
two tests can be taken currently or at different times.
3. Inter-Scorer Reliability
4. Inter-Observer Reliability
108
EPSC 228/311/0228
10.00 / // 3 1
10.01 / / 2 -
10.02 // / 3 1
10.03 // /// 5 1
10.04 /// // 5 1
10.05 / / 2 -
10.06 // // 4 -
10.07 / / 2 -
10.08 // /// 5 1
10.09 // // 4 -
Total 17 18 35 5
109
EPSC 228/311/0228
5. Inter-Item Reliability
This involves splitting the test into two halves and computing coefficient
of reliability between the two halves (odd numbered questions and even-
numbered questions).
Split half reliability estimates are widely used, because of their simplicity.
The procedure is that you split the test into two halves and compute
coefficient of reliability between the two halves.
1. Reliability
a) Yes, probability will be high if the two examiners are trained so that
they can mark consistently.
b) No, how probability if the two are not trained.
111
EPSC 228/311/0228
Example 3: You know your actual weight to be 92 kg. You take your
weight three times in a day using the same machine and you find:
Morning = 85 kg
Lunch = 92 kg
Evening = 87 kg
Decision: Your scale is unreliable. The scale should read at all times/
whenever you step on it 92 kg.
Morning = 85 kg
Lunch = 85 kg
Evening = 85 kg
2. Validity
Target
113
EPSC 228/311/0228
2. Measuring Intelligence
Suppose you want to measure the intelligence of smart students and you
decide to use a tape measure the circumference of their heads in
centimeters and you consistently receive the same results/value of the
heads of these students-that is the tape measure is reliable. But the test
of using tape measure is not valid measure because we do not use tape
measure to measure intelligence. We use intelligence test to measure
intelligence of people.
114
EPSC 228/311/0228
LECTURE 11
Welcome to lecture 11
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Define table of specifications.
Describe the role of table of specifications in test
construction
List and discuss the steps/procedure in test
construction.
Explain the merits and demerits of easy and
objective tests.
Briefly describe the guidelines for the construction of essay and
objectives tests.
The number, nature, and specifically of the categories will depend on the
purpose of the test. Although the two-dimensional grid for content and
abilities is most common, there will be cases in which three or more are
140
EPSC 228/311/0228
Once the list of topics and abilities has been decided on, the next task is
to determine the relative emphasis to be given to each topic and ability
and to enter into each cell either a percentage of the test or the actual
number of questions to be tested in that cell. It may be that certain
topics by their nature are essentially limited to certain abilities or test
plan may lead to insights into what has been previously untested and to
ingenious solutions or new approaches to writing items that test what
may be long considered untestable in an objective format.
142
EPSC 228/311/0228
143
EPSC 228/311/0228
Once the purpose of a test has been made, the teacher or test developer
has to make two decisions, namely:
Once these two decisions have been made, the ―specifications‖ for a
particular test can be made. The ―specifications‖ of a test is presented in
a table called a “Table of Specifications‖ or blue print. A table of
specifications is a two dimensional chart with the content (topics) as one
dimension and behavioural performance or kinds of achievement as the
other.
Topic 1 2 2 0 0 0 1 5
Topic 2 3 2 2 1 1 1 10
Topic 3 3 1 3 2 2 2 13
Topic 4 2 5 4 2 0 2 15
Topic 5 2 2 1 0 1 1 7
144
EPSC 228/311/0228
Total 12 12 10 5 4 7 50
4. CONSTRUCTION OF TESTS
In the school setting the most convenient tests are paper and pencil tests
– or written tests. Such tests are commonly of two types:
(i) Essay tests and (ii) Objective type tests
ESSAY TESTS
The essential feature of an essay is that it is open-ended and each
candidate may present his own answer in his own particular style.
Example:
(a) Name two sources of support which helped Britain during Mau
Mau war of 1950-1959?
(b) Why did British lose the war?
145
EPSC 228/311/0228
146
EPSC 228/311/0228
147
EPSC 228/311/0228
OBJECTIVE TESTS
An objective test is one so constructed that irrespective of who marks the
answers, the scores for a particular candidate is always the same. The
objectivity really refers to the marking of the test. In order to achieve
such objectivity, objective tests usually have pre-coded answers. In any
particular item, there has to be one and only one correct answer.
MATCHING ITEMS
A matching item consists of two lists, phrases, pictures, other symbols
and a set of instructions explaining the basis on which the examinee is to
match an item in the first list with an item in the second list. The
elements of the list that is read first are called premises, and the
elements in the other list are called responses. It is possible to have
more premises than responses, more responses than premises, or to have
the same number of each. In the example of a matching exercise that
148
EPSC 228/311/0228
follows, the premises appear in the left-hand column, with the responses
at the right but in some cases the responses may be placed below the
premises.
The primary cognitive skill that matched exercises test is recall.
1. KANU ( ) 2007
2. NARC ( ) 2013
3. PNU ( ) 1960
4. JUBILEE ( ) 2002
5. KADU ( ) 1925
149
EPSC 228/311/0228
(iv) A question
(v) A complete statement
(vi) An incomplete statement
Who among the following people chaired the Kenya Constitution Review
Commission?
1. Githu Muigai
2. James Orengo
3. Yash Pal Ghai
4. Paul Muite
5. Raila Odinga
a. Knowledge
b. Synthesis
c. Evaluation
d. Analysis
e. Comprehensive
150
EPSC 228/311/0228
(ii) A statement which itself is true, but which does not satisfy the
requirement of the problem.
(iii) A carefully worked incorrect statement.
151
EPSC 228/311/0228
After reading the stem, the student should know exactly what the
problem is and what he or she is expected to do to solve it.
3. State the stem in positive form.
4. Keep the item short.
5. Word the alternatives clearly and concisely. This is to reduce student
confusion.
6. Keep the alternatives mutually exclusive.
7. Avoid ―all of these‖, none of these‖ and ―both A and B‖ answer choices.
8. Keep options lengths similar
9. Avoid cues to the correct answer
10. Use only one correct option
11. Vary the position of the correct options
12. Guard against giving clues in the correct answers.
13. Avoid any tendency to make the correct answer consistently longer
than the distracters.
14. Avoid ―give-ways‖ in the distracters, for example ―always, ―only‖,
―all‖, ―never‖ etc.
15. Use language that is simple, direct and free of ambiguity.
16. Do not use double negatives in a item.
MERITS OF OBJECTIVE TESTS
154
EPSC 228/311/0228
LECTURE 12
Welcome to lecture 12
ITEM ANALYSIS
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Define item analysis.
Explain and illustrate the three aspects of item
analysis:
Item difficulty index
Item discrimination index
Distractor analysis
Calculate and interpret three aspects of item analysis.
In principle, item analysis can be carried out on both essay and objective
tests but the techniques are much developed for the objective test items.
155
EPSC 228/311/0228
For this type of analysis one requires the service of both subject content
specialists and test construction specialist. Most of the qualitative
analysis can be done before the tests are administered, for example,
judging the content validity is a qualitative type of item analysis.
Examining stems of items for ambiguity is another way of
qualitative analysis.
When we prepare items for a test, we hope that each of them will be
useful in certain statistical way. That is, we hope that each item will turn
out to be of the appropriate level of difficulty for the group, the
proportionately more or the better students will get it right than the
poorer, and that the incorrect options will prove attractive to the students
who cannot arrive at the right answer through their own ability.
Item analysis uses statistical methods to identify any test items that are
not working well. If an item is too easy, failing to show a different
between skilled and unskilled examinees, or even scored incorrectly, an
item analysis will reveal it. That is, item analysis information can tell us if
an item was too easy or too hard, how well it discriminated between high
and low scorers on the test, and whether all of the alternatives
(distractors) functioned as intended. The three most common statistics or
areas reported in an item analysis are:
156
EPSC 228/311/0228
The item difficulty index is one of the most useful and most frequently
reported, item analysis statistics. It is a measure of the proportion of
examinees who answered the item correctly. Teachers produce a difficulty
index for a test item by calculating the proportion of students in class who
got an item correct. The larger the proportion, the more students who
have learned the content measured by the item.
This approach is less accurate but good for teachers in understanding the
concept of ID index in a simpler way. For example, imagine a classroom
of 40 Standard 6 students who took a test which included the item below.
What is the item difficulty of this test item? The asterisk indicates
that B is the correct answer.
A. Tom Mboya 6
*B. James Gichuru 24
157
EPSC 228/311/0228
C. Jomo Kenyatta 10
D. Robert Matano 0
Difficulty Index ranges from .00 to 1.00. For this example, Difficulty
Index=24/40=.60 or 60%. This means that sixty percent of students
knew the answer.
21 – 40% Difficult
41 – 60% Average
61 – 80% Easy
In computing item difficulty index of a test item using this approach you
need to do the following:
158
EPSC 228/311/0228
Alternatives (Options)
Groups A B* C D E
Upper 10 0 6 3 1 0
Lower 10 3 2 2 3 0
*
= Correct answer.
Index of Difficulty = RU + RL
n1+n2
Where,
U=Represents number of examinees in the upper-scoring group
responding correctly
L= Represents number of examinees in the lower-scoring group
responding correctly
n1+n2=Represent the total number of examinees in the upper and lower-
scoring groups, respectively.
ID= Students with correct answers x 100 = 8/20 x 100 = 40% =0.40
Total students (n1 + n2)
159
EPSC 228/311/0228
Since difficulty refers to the percentage getting the item right, the
smaller the percentage figure the more difficulty the item. Hence, Index
of Difficulty can range between 0% and 100% with a higher value
indicating that a greater proportion of examinees responded to the item
correctly, and it was thus an easier item.
the given item (correct and incorrect) and the examinee‘s score
on the overall test.
n1 n2 10 10
Where,
161
EPSC 228/311/0228
Correlation Description
Range
A strong and positive correlation suggests that students who get any one
question correct also have a relatively high score on the overall
examination.
1. High scorers=6/20=0.30
Low scorers=18/20=0.90
2. High scorers=0/20=0.00
Low scorers=20/20=1.00
162
EPSC 228/311/0228
Alternative Approach
Ru =Number of those in high scoring group that got the item correct.
Rl = Number of those in low scoring group that got the item correct.
If the discrimination index is negative, it also means that for some reason
students who scored low on the test were more likely to get the answer
163
EPSC 228/311/0228
A strong and positive correlation suggests that students who get any one
question correct also have a relatively high score on the overall
examination.
Just as the key, or correct response option, must be definitely correct, the
distractors must be clearly incorrect (or clearly not the ―best‖ option). In
addition to being clearly incorrect, the distractors must also be plausible.
That is, the distractors should seem likely or reasonable to an examinee
164
EPSC 228/311/0228
The analysis of response options shows that those who missed the item
were about equally likely to choose answer A and answer C. No students
chose answer D. Answer option D does not act as a distractor. Students
are really choosing between three options. This makes guessing more
likely, which hurts the validity for an item.
A =6/40 =.15
B =24/40 =.60
C =10/40 =.25
D =0/40 =.00
A good distractor will attract more examinees from the lower group than
the upper group. In this example D was a very poor distractor. It was
obvious to both good and poor students. In this example distractors A and
C are functioning effectively.
165
EPSC 228/311/0228
Interpretation
The analysis of response options shows that those who missed the item
were about equally likely to choose answer A and answer C. No students
chose answer D. Answer option D does not act as a distractor. Students
are not choosing between four answer options on this item, they are
really choosing between only three options, as they are not even
considering answer D. This makes guessing correctly more likely, which
hurts the validity of an item.
Example 2.
Alternatives (Options)
Groups A B* C D E
Upper 10 0 6 3 1 0
Lower 10 3 2 2 3 0
*
= Correct answer.
167
EPSC 228/311/0228
LECTURE 13
Welcome to lecture 13
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Explain the mistakes teachers make in developing
classroom tests.
168
EPSC 228/311/0228
169
EPSC 228/311/0228
LECTURE 14
Welcome to lecture 14
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Explain the different procedures KNEC has put in place to provide
credible examination.
Test Administration
As already discussed, tests determine the destiny of the individuals and
hence the conditions under which they are administered by must be fair
and uniform. In this respect, fair administration of tests takes into
considerations:
Rehearsal in case of national examinations.
Provision of uniform instructions on the conduct of examinations.
170
EPSC 228/311/0228
Test Scoring
Scoring is one pillar of fairness in an examination. A source of unfairness
if not well managed. There are two types of scoring systems in use in
Kenya.
Manual scoring. Used mainly in schools and in essay types of
questions in KNEC examinations. Subjectivity/bias can be high
under manual scoring.
Electronic scoring by use of Optical Mark Reader/ Scanner.
This is used by KNEC for scoring objective test items in KCPE and
KCSE. Objectivity is high in electronic marking/scoring. Objectivity
refers to consistency in test interpretation and scoring.
The conditions that promote fair scoring of a test include:
Moderation of the marking scheme.
Training of markers in case of easy tests.
Coordination of markers and putting in smaller teams.
Retirement of erratic and generous markers.
Test difficulty.
Because the assumption under testing is normality, scores are interpreted
in relation to the normal curve. However, there is estimated
(hypothetical) distribution and observed (actual) distribution of test
scored. These are illustrated below using physics and geography test
scores.
172
EPSC 228/311/0228
Observed distribution
of candidates who took
Geography Test
173
EPSC 228/311/0228
LECTURE 15
Welcome to lecture 15.
TOPIC OBJECTIVES
After this lecture, the learner should be able to:
Describe the terms:
Population
Statistic
Describe the application and interpretation of measures of central
tendency to test scores.
Describe the application and interpretation of measures of
variability to test scores.
Statistical Concepts
(i) What is a sample? The smaller group of people who actually
participate in the test is known as a sample. This is a sub-set of the
population and is represented by lower case n.
Population
174 Sample
EPSC 228/311/0228
Descriptive Statistics
Once a large set of scores have been collected, certain descriptive values
can be calculated. These are the values that summarize or condense the
set of scores, giving it meaning. Descriptive values are used by the
teachers to evaluate individual performance of pupils and to describe the
group‘s performance or compare its performance with that of another
group.
Once you collect data from a large sample, you can do the following
things:
Organizing and graphing test scores.
Applying descriptive statistics.
175
EPSC 228/311/0228
Prepare distribution of frequency table of the test scores. You can present
the test scores individually or in a group format in the form of frequency
table or histogram.
176
EPSC 228/311/0228
70 – 79 6
60 – 69 1
50 – 59 3
________
17
________
177
EPSC 228/311/0228
Mode
The mode is the score most frequently received. It is used with nominal
data. In the ungrouped scores given above the mode is 75. For grouped
data the modal interval is 70-79.
A frequency distribution can be uni-modal (one mode), bi-modal two
modes); trio-modal (three modes); poly-modal many modes.
Example: 2, 2, 2, 3, 4, 6, 6, 6, 7, 8.
Median
The median is the middle score; half the scores fall above the median and
half below. It cannot be calculated unless the scores are listed in order
from best to worst or in either ascending or descending order. Hence the
procedure for getting a median of a distribution is as follows:
First arrange the scores in a ascending or descending order
Determine the position or location of approximate median
X = ∑X
n
178
EPSC 228/311/0228
Where X (X bar) is the mean, ∑X is the sum of the scores, and n is the
number of scores. The symbol ∑ means sum of. Hence ∑X means sum of
all scores. In Greek ∑ is called sigma. X represents individual scores and
n is the number of students or number of scores.
Mean (X) of 66, 65, 61, 59, 53 = 304 = 60.8
5
The mean is appropriate for interval or ratio data.
The disadvantage of the mean is that it is influenced by outliers.
Group 1 Group 2
9 5
5 6
1 4
For both groups the mean and median is 5. If you simply report that the
mean and median for both groups are identical without showing the
variability of scores, another person could conclude that the two groups
have equal or similar ability. This is not true. Group 2 is more
homogenous in performance than Group 1. A measure of variability is the
descriptive term that indicates this difference in the spread, scatter or
heterogeneity, of a set of scores. There are two such measures of
variability: the range and the standard deviation.
179
EPSC 228/311/0228
Range
The range is the easiest measure of variability to obtain and the one that
is used when the measure of central tendency is the mode or median.
The range is the difference between the highest and the lowest scores.
For example:
For Group 1: Range = 9 – 1 = 8
For Group 2: Range = 6 – 4 = 2
The range is neither a precise nor a stable measure, because it depends
on only two scores- the highest and the lowest.
Standard Deviation
The standard deviation (symbolized S.D) is the measure of variability
used with the mean. It indicates the amount that all the scores differ or
deviate from the mean – the more the scores deviate from the mean, the
higher the standard deviation. The sum of the deviations of the scores
from the mean is always 0. There are two types of formulas that are
used to compute S.D.
Deviation formula.
Raw score formula.
The deviation formula illustrates what the S.D. is, but is more difficult to
use by hand if the mean has a fraction. The raw score formula is easier
to us if you have only a simple calculator.
Let us use the scores: 7, 2, 7, 6, 5, 6, 2.
2
S.D. = ∑ (x-x )
n-1
Where S.D. is the standard deviation, X is the scores, X is the mean, and
n is the number of scores.
180
EPSC 228/311/0228
Some books, calculators, and computer programs will use the term n
rather than n-1 in the denominator of the standard deviation formula.
When the sample is large you can use n because a larger sample
approaches the population size.
Why n-1?
i. Use of n-1 gives a good estimate of population variance or S.D.
That is, it gives unbiased estimate of population variance.
ii. We use n-1 when the sample size is small in order to get unbiased
estimate of population variance.
Steps 2 – 3
2
X X (X–X) (X – X)
7 5 2 4
2 5 -3 9
7 5 2 4
6 5 1 1
5 5 0 0
6 5 1 1
2 5 -3 9
∑=35 ∑=0 ∑=28
Step 4
S.D. =28 / (7–1) =
181
EPSC 228/311/0228
S.D. = ∑ x2 – (∑x)2/n
n-1
Where ∑X2 is the sum of the squared scores, ∑X is the sum of the scores,
and n is the number of scores.
X X2
7 49
2 4
7 49
6 36
5 25
6 36
2 4
∑X = 35 ∑X2 = 203
182
EPSC 228/311/0228
How do you interpret the scores in relation to the mean and S.D?
In reporting your pupils‘ scores, you need to report both the mean and
the S.D.
A test norm allows meaningful interpretation of test scores.
A person‘s raw test score is meaningless unless evaluated in terms
of the standardized group norms. For example, if a student
receives a raw score of 78 out of 100 in history, does that mean
that the student is doing well?
The score of 78 can be interrupted only when the norms are consulted. If
the mean of the test norm is 80 and the standard deviation is 10, the
score of 78 can be evaluated as ―typical‖ performance indicating that the
student possesses an average knowledge of history.
SELF-ASSESSMENT EXERCISE
Use raw score formula to compute the mean and the SD for the test
scores of the following two groups of students:
Group 1: 9, 5, 1
Group 2: 5, 6, 4
What does the S.D. tell you about these two groups?
For Group 1 you should get SD of 4
For Group 2 you should get SD of 1.414
183
EPSC 228/311/0228
Interpretation
Group 2
Group 1
Frequency
0 1 2 3 4 5 6 7 8 9
Score Value
For both groups, the test scores have the same mean, but different
variability or spread of scores. Students in Group 1 have a larger S.D
(SD=4) indicating that they are more heterogeneous in ability. Students
in Group 2 have smaller S.D (SD=1.414) indicating that they are more
homogeneous in ability.
**********************************************************
184