Professional Documents
Culture Documents
Rater Reliability
Rater Reliability
Rater Reliability
www.elsevier.com/locate/system
Received 20 September 1998; received in revised form 30 March 1999; accepted 15 April 1999
Abstract
A major problem in essay assessment is how to achieve an overall reliable score based on
the judgements of speci®c criteria such as topic relevance and grammatical accuracy. To
investigate this question, the writer conducted a workshop on interrater reliability at a con-
ference of the National Association of Educators of Teachers of English (NAETE) in South
Africa where a group of 24 experienced educators of teachers of English were asked to assess
two Grade 7 English essay protocols. The results revealed substantial variability in attention
that raters paid to dierent criteria varying from penalising students for spelling and/or
grammatical errors to glossing over these criteria and considering mainly content. To over-
come the problem of rater variability some researchers recommend that more than one rater be
used. The problem is that in the teaching situation there is rarely more than one rater available,
who is usually the teacher of the subject. The advantages and disadvantages of using a single
rater and more than one rater are examined. Whether one uses one rater or several, without
the quest for some kind of objective standard of what is, for example, (good) grammar and
(good) spelling, and agreement on what importance to attach to particular criteria, there
cannot be much reliability. # 2000 Published by Elsevier Science Ltd. All rights reserved.
Keywords: Reliability; Validity; Interrater reliability; Errors; Equivalent scores; Equivalent judgements;
Subjectivity; Moderation; Levels of pro®ciency
1. Introduction
* Address for correspondence: 51 Kings Road, King William's Town 5601, South Africa.
E-mail address: raphgam@border.co.za (R. Gamaro).
0346-251X/00/$ - see front matter # 2000 Published by Elsevier Science Ltd. All rights reserved.
PII: S0346-251X(99)00059-7
32 R. Gamaro / System 28 (2000) 31±53
(Alderson cited in Douglas, 1995, p. 176), there are still many bugbears. This article
focuses on rater reliability in language testing. Rater reliability, which is arguably
the greatest bugbear in assessment (Moss, 1994), is concerned with reconciling
authentic subjectivity and objective precision.
Rater reliability is particularly important in `subjective' tests such as essay tests,
where there exist ¯uctuations in judgements (1) between dierent raters, which is the
concern of interrater reliability, and (2) within the same rater, which is the concern
of intrarater reliability. This article focuses on interrater reliability. Interrater reli-
ability consists of two major kinds of judgements: (1) the order of priority for indi-
vidual raters of performance criteria (criteria such as grammatical accuracy, content
relevance and spelling) and (2) the agreement between raters on the ratings that
should be awarded if or when agreement is reached on what importance to attach to
dierent criteria.
Raters may give equivalent scores but this does not necessarily mean that these
scores represent what they are supposed to measure, i.e. that the (purpose of the)
test is valid. To illustrate, if all raters of an essay believe that spelling should be
heavily penalised and, accordingly, give equivalent scores in terms of spelling, the
interrater reliability would be high. The question, however, is whether spelling
should be the most important criterion. Or, raters may dier in the importance they
attach to dierent criteria. So, similar scores between raters do not necessarily mean
similar judgements, and also, dierent scores between raters do not necessarily
mean dierent judgements.
In previous research on interrater reliability (Gamaro, 1998a) I compared the
assessments of lecturers of English for academic purposes (EAP) and science lec-
turers on ®rst-year university student essays. These students were Tswana-mother-
tongue speakers. The topic was the `Greenhouse Eect'. Comparisons were ®rstly
made within the EAP group of lecturers and within the science group of lecturers,
and secondly between the two groups of EAP and science lecturers. The ®ndings
showed a wide range of scores and judgements within and between groups. In
this article I report the research on interrater reliability that was based on a work-
shop on language assessment that I conducted at a conference of the National
Association of Educators of Teachers of English (NAETE) (Gamaro, 1996). These
educators taught at dierent universities, technikons and colleges of education in
South Africa.
2. Literature review
The process of essay writing, is ``probably the most complex constructive act
that most human beings are ever expected to perform'' (Bereiter and Scardemalia,
1983, p. 20). If getting the better of words in writing is usually a very hard struggle
for mother-tongue speakers, the diculties are multiplied for the second-language
learner (Widdowson, 1983, p. 34).
The writing process involves the ``pragmatic mapping'' of linguistic structures into
extralinguistic context (Oller, 1979, p. 61). This mapping ability subsumes global
comprehension of a passage, inferential ability, perception of causal relationships
and deducing meaning of words from contexts. All these factors mesh together to
form a network of vast complexity, which makes objective assessment of essay per-
formance very dicult. It is this vast complexity that makes written discourse, or
essay writing, the most `pragmatic' of writing tasks and the main goal of formal
education.
Owing to the fact that the production of linguistic sequences in essay writing is not
highly constrained, problems arise in scoring when inferential judgements have to be
converted to a score. The question is ``[h]ow can essays or other writing tasks be
converted to numbers that will yield meaningful variance between learners?''. Oller
(1979, p. 385) argues that these inferential judgements should be based on intended
meaning and not merely on correct structural forms. That is why in essay assess-
ment, raters should rewrite (in their minds, but preferably on paper) the intended
meaning. Perhaps one can only have an absolutely objective scoring system with
lower-order skills; however, Oller is not claiming that his scoring system is abso-
lutely objective, but only that as far as psychometric measurement goes, it is a
sensible method for assessing an individual's level within a group, irrespective of the
actual scores of the individuals in the group (Oller, 1979, pp. 393±394). It has been
recommended, however, that equivalence in scores between raters also be taken into
account in the assessment of test reliability (Cziko, 1984; Ebel and Frisbie, 1991).
Another problem in essay assessment is how to achieve a reliable overall score based
on the judgements of speci®c criteria.
When one marks an essay, one can only do so through its structure. The paradox
of language is that structure must `die' so that meaning may live. Yet, if structure
were not preserved, language would not be able to mean. The German term aufhe-
ben (sublation) means `to clear away' and `to preserve': the simultaneous preserva-
tion and transcendence of the form/function antithesis. Language structure has to be
cleared away and preserved in order to convey its meaning (Coe, 1987, p. 13). By the
same token, our analytic criteria of assessment must be cleared away and preserved
in order to assess the total eectiveness of a piece of writing. Some of these analytic
criteria are morphology, phonology and spelling.
Confusions may arise between these three criteria. According to Oller (1979,
p. 279), errors of judgement in distinguishing between spelling, on the one hand, and
morphology and phonology, on the other, are not substantial enough to aect
reliability. Ingram (1985, p. 244), however, disagrees and maintains that ``it is often
a matter of judgement whether, for example, an error is merely spelling (to be dis-
regarded) or phonological or grammatical'' (see Alderson et al., 1995, p. 46, for a
similar view). Cziko (1982) believes that the subjective judgement of deciding what
34 R. Gamaro / System 28 (2000) 31±53
3. Method
The essay test used in the workshop was part of a battery of English pro®ciency
tests that included a cloze test, a dictation test, an error recognition test and a mixed
grammar test that was given to Grade 7 (12-year-old) entrants to Mmabatho High
School (MHS) in the North West Province of South Africa in 1987 where I was a
teacher from 1980 to 1987 (Gamaro, 1997, 1998b).
The entrants consisted of L1 and L2 speakers of English, where these labels refer
to those who took English as a ®rst language taught subject and as a second lan-
guage taught subject, respectively. All entrants in the L2 group were Bantu-mother-
tongue speakers, mostly Tswanas. The L1 group consisted of a mixture of Tswana-,
English- and Afrikaans-mother-tongue speakers, and also some whose mother ton-
gue was dicult to identify because they came from a background where several
languages or a hybrid of languages were used in the home, e.g. Afrikaans and Eng-
lish, Tswana and English (these were South Africans); Tagolog and English, Tamil
and English (these were expatriates).
The L2 group and some of the L1 group came from DET (Department of
Education and Training) schools, where the medium of instruction was English
from Grade 5, while the majority of the L1 group came from Connie Minchin
Primary School in the North West Province of South Africa, which was the main
feeder school of L1 entrants to Mmabatho High School and where English was the
medium of instruction from Grade 1. (The DET was the South African education
R. Gamaro / System 28 (2000) 31±53 35
If you cover a book you need several things such as a brown cover, a plastic cover
and selotape ect. First you open your couver and put the book on the corver.
You folled the cover onto the book and cut it with the sicor and folled it again.
You stick the cover with the selotape so that it mast not come out of the book.
Same aplies to when you cover with a plastic cover. Then you book is corved well.
Protocol 2 (L1) below belongs to a Sri Lankan of expatriate parents. (Recall that
the labels `L1 learner' and `L2 learner' at MHS refer to learners who took English as
®rst language taught subject or as a second language taught subject.)
You need a roll of paper cover or plastic cover, A pair of scissors some sello-
tape. You put the book on the paper or Plastic and cut the length it is better if
about 5 cm of cover was left from the book. You cut it into strips You fold the
cover over the book. You then put strip of sellotape to keep them down. Then
you put plasitic paper over it and stick it down. Then you can put your name
and standard.
Participants in the workshop were requested to (1) work individually, (2) spend
about one and a half minutes on each protocol, and (3) give an impressionistic score
based on criteria such as topic relevance, content and grammatical accuracy, and
any other criteria they wanted to mention, and (4) give reasons for their judgements
on the criteria they speci®ed. I did not specify the criterion of `spelling'. I mention
this because many participants gave prominence to spelling errors. Raters were
explicitly asked, however, to take into account the criteria of `topic relevance',
`content' and `grammar'. Most of the raters did not distinguish between topic rele-
vance and content, so I subsumed the two criteria under content.
4. Results
Figs. 1 and 2 show the frequency distribution of the individual scores awarded by
the 24 raters for Protocol 1 (L2) and Protocol 2 (L1), respectively. A nine-point
scale was used; 0±1 point=totally incomprehensible; 2 points=hardly readable;
3 points=very poor; 4 points=poor; 5 points=satisfactory; 6 points=good; 7
points=very good; 8 points=excellent; 9 points=outstanding. Ratings can refer to
a numerical scale or verbal judgements. I shall refer to scores and judgements, and
not to rating.
R. Gamaro / System 28 (2000) 31±53 37
Fig. 1. Frequency distribution of the scores, awarded by the 24 raters on Protocol 1 (L2).
Fig. 2. Frequency distribution of the scores awarded by the 24 raters on Protocol 2 (L1).
Although Protocol 2 in Fig. 2 has a wider range of scores (3±8) than Protocol 1 in
Fig. 1 (3±7), there is far more variability between raters in Protocol 1. Table 1 below
provides the average score for each of the six groups of raters: Groups A±F. Also
included in Table 1 is the average score of the four raters at MHS who were involved
in the original test battery. These scores appear after Group F.
I did not expect the scores of Groups A±F for Protocol 1 (L2) to be higher than
those for Protocol 2, because I judged Protocol 2 (L1) to be better. In the original
research at MHS, I awarded, in my capacity as one of the raters, a score of 4 for
Protocol 1 and score of 6 for Protocol 2.
Figs. 3 and 4 show a comparison between the percentage of negative judgements,
no judgements and positive judgements on the three criteria of content, grammar
and spelling for Protocol 1 (Fig. 3) and Protocol 2 (Fig. 4). Figs. 5 and 6 compare
the negative judgements of the `EL1' and `EL2' raters. (Percentages are calculated to
the nearest whole number.)
38 R. Gamaro / System 28 (2000) 31±53
Table 1
NAETE Workshop and MHS: average scores on Protocols 1 and 2 of groups of ratersa
Fig. 3. Percentage positive judgements, no judgement and negative judgements for Protocol 1 (L2).
Fig.4. Percentage positive judgements, no judgements and negative judgements for Protocol 2 (L1).
R. Gamaro / System 28 (2000) 31±53 39
In Figs. 5 and 6 above `EL1' refers to raters who use English as a ®rst language
(i.e. the language they know best), and `EL2' refers to raters who use English as a
second language. If raters had two languages that they claimed to know equally well
(see Tables A1 and A2, column 2 of the Appendix), they were categorised as EL1
speakers. It is important to keep the following distinctions in mind: the labels L1
and L2 that are used to refer to the protocols refer to learners who took English as a
®rst or second language taught subject. Thus, L1 in the general context of MHS is
not identical to the language a person knows best. For example, the L1 protocol
belonged to a Sri Lankan expatriate, whose mother tongue was Tamil, but who
claimed to know English well enough to describe himself as an EL1 speaker, i.e.
English was the language he claimed to know best. The label EL1, which I have used
for the NAETE raters, is used in the sense of English as the language that one knows
best.
Figs. 5 and 6 focus on the negative judgements, where the EL1 group is compared
with the EL2 group.
40 R. Gamaro / System 28 (2000) 31±53
Protocol 1 (Fig. 5) shows that 33% of all the raters (EL1+EL2) gave negative
comments on content and grammar while 54% considered spelling to be a problem.
There was a substantial dierence between the negative judgements of EL1 and
EL2 on grammar (19 and 63%, respectively) and on spelling (69 and 25%, respec-
tively), where the judgements of EL1 are almost the reverse of EL2: what EL1
considers to be grammatical errors, EL2 considers to be spelling errors (see Tables
A1 and A2 in the Appendix for individual judgements). It would have been inter-
esting to ®nd out which errors were considered to be spelling errors and which ones
grammatical errors. For example, consider the highlighted errors in Protocol 1 (the
protocol is repeated for easy reference), where dierent kinds of errors have been
highlighted:
If you cover a book you need several things such as a brown cover, a plastic
cover and selotape ect. First you open your couver and put the book on the
corver. You folled the cover onto the book and cut it with the sicor and folled it
again. You stick the cover with the selotape so that it mast not come out of the
book. Same aplies to when you cover with a plastic cover. Then you book is
corved well.
I judged the two italicised errors *folled and *aplies to be spelling errors and
*mast to be a phonological error. The other deviant forms are more dicult to
specify. Are the dierent deviant forms of `cover' to be labelled as spelling or
R. Gamaro / System 28 (2000) 31±53 41
phonological errors? Compare these forms with the following deviant phonological
forms from Oller (1979, p. 279):
ropeÐ*robe
expectedÐ*espected
ranchÐ*ransh
somethingÐ*somsing
In Protocol 2 there are hardly any deviant forms, and thus little possibility of
confusing spelling errors with grammatical errors. Only one rater (an EL2 rater)
mentions spelling errors. Most of the errors in Protocol 2 are punctuation errors,
which are judged to be `grammatical' errors by most in the EL1 and EL2 groups.
Protocol 2 is repeated for easy reference:
You need a roll of paper cover or plastic cover, A pair of scissors some sell-
otape. You put the book on the paper or Plastic and cut the length it is better if
about 5 cm of cover was left from the book. You cut it into strips You fold the
cover over the book. You then put strip of sellotape to keep them down. Then
you put plasitic paper over it and stick it down. Then you can put your name
and standard.
The punctuation errors are serious, while ``left from the book'' and ``cut into
strips'' aect the coherence to a certain extent. The pronoun ``them'' (in bold) does
not seem to be a grammatical error because it agrees with ``strips'' in the previous
sentence (not with ``strip'' in the same sentence). There seems to be one grammatical
error, namely the missing `a' between ``put'' and ``strip'' in the third last sentence,
but no spelling errors. There was a substantial dierence in negative judgements
between EL1 and EL2 on content: 63% and 38% (Fig. 6). The overall picture on
Protocol 2, as far as content and grammar are concerned, is that 54% of the raters
were negative about content, and 42% were negative about grammar.
42 R. Gamaro / System 28 (2000) 31±53
Consider the relationship between individual scores and judgements. Similar scores
between raters do not necessarily mean similar judgements, and also, dierent
scores between raters do not necessarily mean dierent judgements. Examples are
provided from Protocols 1 and 2. In Protocol 1 the following judgements went
together with the same scores (judgements of all the raters for Protocols 1 and 2 are
found in Tables A1 and A2, respectively, of the Appendix): a score of 3 for one rater
represented ``meaningless cloudy'' (Rater C1) and for another rater the same score
of 3 represented ``misspelled many words but not to bad'' (Rater E5: this rater was
excluded from the main analysis because he/she was the ®fth member of Group E
that had been reduced to four in a group). Many of the misspelled words in Protocol
1 were deviant forms of the one word `cover'. A score of 5 for C4 represented:
``Topic deviates. Content sequence satisfactory. Major grammatical. Errors detracts
from coherence'', but for D1 the same score represented ``Only one great fault is
spelling, quite distracting''. D3, who awarded a score of 6, states: `This learner
belongs to an elite group'.
Consider the following examples from Protocol 2: E2, who awarded a score of 5,
said ``General reluctance to give extremely high or low marks''. E2's score for Pro-
tocol 1 was 7, which seems to contradict the reluctance to give extremely high or low
marks: unless a score of 7 is not an ``extremely'' high mark in E2's eyes. If so, one
does not know what to make of E2's remark that a score of 5, which E2 gave for
Protocol 2, steers a middle path between an ``extremely low'' and an ``extremely
high'' score. E2 has a point about ``playing safe'': it is safer to give an average score
than to fail the learner or give a high score. One hugs the safe side of justice.
A few other examples from Protocol 2: some raters attached more importance
than others to the segment ``cut into strips''. Consider the remarks of the following
raters, which all contained the phrase ``cut into strips''. They all awarded a score of
5 and commented only on content. They were all EL1 speakers:
D1, E2 and F2 made a big issue of ``cut into strips'', which in their eyes made the
content inadequate, while F1 and D2 made overall positive comments on content.
F1's comment seems the most reasonable because the fact that ``cut into strips'' is
not in the correct sequence does not have a signi®cant aect on coherence, because
when one reads the sentence that follows this segment it seems quite clear that one
is talking about cutting the sellotape into strips, not the paper used to cover the
bookÐnor the book! Perhaps ``cut the ¯aps'' is what the writer meant by
R. Gamaro / System 28 (2000) 31±53 43
``cut into strips''. D2 calls ``cut it into strips'' a ``cohesion'' error (D2 has under-
lined ``it''). The problem is indeed one of cohesion, which in turn aects coherence.
(It is the lack of coherence that enables one to recognise the cohesion problem.)
What this analysis reveals is that it is not so easy to describe how a book is covered.
Most young children and adults alike can cover books, but both children and
adults might not ®nd it so easy to describe, even in their mother tongue, how to
cover one.
One may argue that owing to the fact that there are no data on which words in
the protocols individual raters judged to be spelling or grammatical errors, there
is no reason to believe that my judgements would be better than other people's.
I suppose some judgements must be better than others. Some raters must be
wrong and others right or otherwise it is all a matter of interpretative variations on
a poststructuralist theme? My judgements aside, the research is still useful because it
shows that many of the raters in this investigation cannot agree on what is spelling
and what is grammar.
7. Moderation workshops
The dierences between the NAETE raters, as shown in Figs. 3±6, is worrisome.
Even more so when compared with their answers to the questions on moderation
that were given in the questionnaire at the NAETE workshop (see Table A3 of the
Appendix).
In the questionnaire, 14 of the 24 participants stated that in their workplace they
never found any signi®cant dierence between their ratings and those of their col-
leagues. Of the seven raters who said that they did ®nd signi®cant dierences in the
workplace, only four found this a problem. As far as the participation in moderation
workshops was concerned, seven of the 24 stated that they had never participated in
a moderation workshop. Of the 17 remaining raters, 11 commented on whether
these moderation workshops resulted in any improvement. Three of these 11 raters
said that there was a great improvement, six said that there was a fair improvement,
one said that there was a negligible improvement, and one said that there was no
noticeable improvement.
8. Implications
A test is said to be used for a valid purpose when the tester knows what is being
tested. However, if testers cannot agree on what that what is, i.e. if there is no
interrater reliability, there can be no validity. So, validity and reliability are two
sides of the same corner. You cannot go round the one side without bumping into
the other.
To clarify a possible confusion between rater reliability and concurrent validity:
rater reliability has to do with the consistency between raters' judgements on one
test method, e.g. an essay test. Concurrent validity, in contrast, has to do with the
correlation between two or more dierent test methods, e.g. a dictation test and an
essay test.
The variability in attention that raters pay to dierent criteria is a general problem
in all kinds of educational institutions where ``lecturers [or teachers] vary from
penalising students heavily for mechanical and grammatical errors to ignoring the
linguistic surface and marking on content and organisation'' (Bock, 1998, p. 53).
There are dierent learning styles, teaching styles and also dierent rating styles.
One rater, as indeed one learner or one teacher, may be mainly interested in the big
picture, i.e. in coherence, while another may be mainly interested in systematicity
and structure. Moderation workshops in my experience do not seem to be able to
bring about an eective truce in these `style wars' (Oxford et al., 1991; Dreyer,
1998).
With regard to the level of English pro®ciency of raters, it does not follow that
because a rater (or anybody else) is not a mother-tongue speaker (of English in this
case) that his or her English pro®ciency is necessarily lower than a mother-tongue
speaker of English. Many non-mother-tongue-English speakers have a higher level
of academic English pro®ciency than mother-tongue-English speakers. A major
reason for this is not a linguistic one, but because these non-mother-tongue speakers
are more academically able, i.e. they have better problem-solving abilities and abil-
ities for learning, and in the case of raters, for assessment (Vollmer, 1983, p. 22;
Bley-Vroman, 1990).
In the research situation, it is possible to have more than one rater, even four. Four
raters would be a rare luxury outside a research situation. Most testing situations are
not research situations but teaching situations where often only one rater is available,
and where moderation workshops are seldom (usually only once) or never held.
One may argue that the reason that teachers/lecturers do not have moderation
workshops or have them seldom is thatÐas many of the NAETE participants
saidÐthey did not ®nd any signi®cant dierence in the ratings between their col-
leagues in their respective workplaces, which would explain why moderation work-
shops were seldom held. In educational institutions especially in tertiary institutions,
there is a large turnover of personnel. Thus, if one has had about 10 years' of
experience, one should have had more than one workshop on moderation because
one would have generally worked at more than one institution.
One may argue further that the reason for the dierences between the NAETE
raters was that they did not come together previously to discuss the protocols that
they were asked to judge individually in the conference workshop. I would imagine
that educators of English teachers, even if they did not confer beforehand on
R. Gamaro / System 28 (2000) 31±53 45
Oller's point is that because it is dicult to get raters to agree, one should do the
next best thing and try to agree with oneself (intrarater reliability). If one takes into
account the gargantuan problems of rater subjectivity, it may be better to use one
rater to mark a speci®c group of test-takers rather than several raters, so that if we
cannot improve interrater consistency to any signi®cant extent, we can at least try to
make sure that the same person marks all the protocols of the group he or she tea-
ches. But then, as we know, we cannot be sure that the rater will not mark dier-
ently before breakfast (a good or bad one) than after.
Raters are in danger of following a circular route to control what is very dicult
or perhaps impossible to control, namely, subjectivity (Davies, 1990, p. 4). The
problem is very similar to the problem of ®nding the `best test'. If the construct
validity of one test always depends on the validity of another, there cannot exist any
one test that stands by itself such as an equivalent of a `Prime Mover'. Lado's (1961,
p. 324) solution is to compare all tests in terms of ``some other criterion whose
validity is self-evident, e.g. the actual use of the language.'' The question is: what is
self-evident? Is there a self-evident test that pre-exists all other tests? There is not,
because `the buttressing validity of an external criterion is often neither de®nable
nor, when found, reliable'' (Davies, 1990, p. 3).
Often mother-tongue pro®ciency is advocated as an absolute yardstick of lan-
guage pro®ciency, but, as Bachman and Clark (1987, p. 29) point out, ``native
speakers show considerable variation in pro®ciency, particularly with regard to abil-
ities such as cohesion, discourse organisation, and sociolinguistic appropriateness.''
As a result, theoretical dierences between testers can aect the reliability of the test.
46 R. Gamaro / System 28 (2000) 31±53
Raters who know the language well, indeed even mother-tongue speakers, can dier
radically in their assessments of such pragmatic tasks as essay tasks. That is why
dierent raters' scores on a particular protocol are often incommensurate with their
judgements. Owing to these problems, it is virtually impossible to de®ne criterion
levels of language pro®ciency in terms of actual individuals or actual performance.
Bachman and Clark (1987, p. 30) suggest that such levels must be de®ned abstractly
in terms of the relative presence or absence of the abilities that constitute the
domain. But again this does not solve the problem because the diculty is how to
apply the de®nition to concrete situations of language behaviour.
The more satisfying the explanation, the foggier the idea may be of what is going
on in the test-taker's head. Thus, in an error analysis it is indeed possible to label the
error in purely linguistic terms, but the more important diagnostic issue of why
speci®c errors are committed remains largely a mystery. Raters are like inquisitive
(or lethargic) insects picking their way around in a ``gigantic multi-dimensional
cobweb'' in which every item requiring an interpretation is attached to a host of
others (Aitchison, 1987, p. 72).
9. Conclusion
Appendix
Tables A1 and A2 show the scores and judgements of individual raters on Proto-
cols 1 and 2, respectively. These tables have been divided into EL1 and EL2 sections,
then sorted within the EL1 and EL2 sections on scores in ascending order so that the
same scores appear together, which makes it easy to compare similar scores with
their corresponding judgements. If the language in the L1 column is English then
this is an EL1 speaker.
Table A1 (continued)
E1 Xhosa 5 No comment
Table A1 (continued)
F3 English 4 Neither gives precise enough
instructions to enable s.o. who
does not know how to cover
a book to cover one
E4 Tswana 8 ±
The relevant questions (J±L) of the questionnaire and Table A3 containing the
corresponding data are presented below:
J. (i) Do you ®nd that one or more of your colleagues in the workplace evaluate(s)
pupil/student protocols in such a way that your respective allocation of scores is
signi®cantly dierent? Yes. . ., No. . . (ii) If, yes, Do you ®nd this to be a serious
problem? Yes. . ., No. . . K. Do you have moderation workshops/meetings with your
colleagues? 1. Never. . .; 2. Once annually. . .; 3. More than once annually. . ., 3. More
than twice annually. . . L. If your answer in the previous item is not `never', have you
found that these moderation workshops/meetings at your institution have ironed
out the assessment disparities between you and your colleagues? 1. There has been a
great improvement. . .; 2. A fair improvement. . .; 3. A negligible improvement. . .; 4.
No noticeable improvement. . .;5. They're a waste of time. . .
R. Gamaro / System 28 (2000) 31±53 51
A1 University 20 Noa 1 ±
A2 Wits, University 12 No Never ±
of South Africa
A3 University 16 Yes Yes 1 2
A4 University 7 No Never ±
B1 Natal 18 No 2 1
B2 Rhodes University 7 No 2+ 2
B3 University of 8 No 1+
Transkei
B4 University 9 ± Never ±
C1 Lesotho 6 No 1 1
C2 University of 12 No 2+ 1
South Africa
C3 University of 18 No Never ±
Fort Hare
C4 Rhodes 10 No 2+ ±
b
D1 Fort Hare 20 Yes No 1+ 2
D2 South Africa 7 Yes No 2+ 2
and UK
D3 Fort Hare 38 ± Never ±
D4 Venda 4 Yes No Never ±
E1 University 7 No ± 2+ 2
E2 Potchefstroom 12 No No 1 ±
E3 College 4 No ± 1 ±
E4 Rhodes 10 No No 1 ±
F1 Exeter 28 Yes Yes Never ±
F2 OFS, UCT, 5 Yes Yes 1 4
Cambridge
F3 Lancaster 20 Yes Yes 2+ 2
F4 Bangalore, UK 11 ± ± 2+ ±
a
If the answer to J (i) is `No', then no answer is required for J (ii).
b
It is odd that this rater and the next two would have no problem if they discovered signi®cant
dierences in the ratings they gave the same student.
52 R. Gamaro / System 28 (2000) 31±53
References
Aitchison, J., 1987. Words in the Mind: An introduction to the Mental Lexicon. Blackwell, Oxford.
Alderson, J.C., 1981. Report of the discussion on Communicative Language Testing. In: Alderson, J.C.,
Hughes, A. (Eds.), Issues in Language Testing: ELT Documents III. The British Council, London.
Alderson, J.C., 1983. Who needs jam? In: Hughes, A., Porter, D. (Eds.), Current Developments in Lan-
guage Testing. Academic Press, London.
Alderson, J.C., Clapham, C., 1992. Applied linguistics and language testing: a case study of the ELTS test.
Applied Linguistics 13, 149±167.
Alderson, J.C., Clapham, C., Wall, D., 1995. Language test construction and evaluation. Cambridge
University Press, Cambridge.
Bachman, L.F., Clark, J.L.D., 1987. The measurement of foreign/second language pro®ciency. American
Academy of the Political and Social Science Annals 490, 20±33.
Bereiter, C., Scardemalia, M., 1983. Does learning to write have to be so dicult? In: Freedman, A.,
Pringle, I., Yalden, J. (Eds.), Learning to Write: First Language/Second Language. Longman, London.
Bley-Vroman, R., 1990. The logical problem of foreign language learning. Linguistic Analysis 20, 3±49.
Bock, M., 1998. Teaching grammar in context. In: AngeÂlil-Carter, S. (Ed.), Access to Success: Literacy in
Academic Contexts. University of Cape Town Press, Cape Town.
Bradbury, J., Damerell, C., Jackson, F., Searle, R., 1990. ESL issues arising from the ``Teach±test±teach''
programme. In: Chick, K. (Ed.), Searching for Relevance: Contextual Issues in Applied Linguistics.
South African Applied Linguistics Association (SAALA), Johannesburg.
Brown, A., 1995. The eect of rater variables in the development of an occupation-speci®c language per-
formance test. Language Testing 12, 1±15.
Coe, R.M., 1987. An apology for form; or, who took the form out of the process. College English 49, 13±
28.
Cziko, G.A., 1982. Improving the psychometric, criterion-referenced, and practical qualities of integrative
testing. TESOL Quarterly 16, 367±379.
Cziko, G.A., 1984. Some problems with emprically-based models of communicative competence. Applied
Linguistics 5, 23±37.
Davies, A., 1990. Principles of Language Testing. Blackwell Ltd., Oxford.
Douglas, D., 1995. Developments in language testing. Annual Review of Applied Linguistics 15, 167±187.
Dreyer, C., 1998. Teacher±student style wars in South Africa: the silent battle. System 26, 115±126.
Ebel, R.L., Frisbie, D.A., 1991. Essentials of Educational Measurement, 5th edition. Prentice Hall,
Englewood Clis, NJ.
Gamaro, R. 1996., Workshop on quantitative measurement in language testing. National Association of
Educators of Teachers of English (NAETE) Conference, East London Teacher's Centre, South Africa,
September.
Gamaro, R., 1997. ``Old paradigm'' language pro®ciency tests as predictors of longterm academic
achievement. Per Linguam 13, 1±22.
Gamaro, R., 1998a. Language, content and skills in the testing of English for academic purposes. South
African Journal of Higher Education 12, 109±116.
Gamaro, R., 1998b. Cloze tests as predictors of global language pro®ciency: a statistical analysis. South
African Journal of Linguistics 16, 7±15.
Hartog, P., Rhodes, E.C., 1936. The Marks of Examiners. Macmillan, New York.
Ingram, E., 1985. Assessing pro®ciency: an overview on some aspects of testing. In: Hyltenstam, K., Pie-
nemann, M. (Eds.), Modelling and Assessing Second Language Acquisition. Multilingual Matters Ltd,
Clevedon, Avon.
Kaczmarek, C.M., 1980. Scoring and rating essay tasks. In: Oller Jr., J.W., Perkins, K. (Eds.), Research in
Language Testing. Newbury House, Rowley, Massachusetts.
Lado, R., 1961. Language Testing. McGraw-Hill, New York.
Moore, R., 1998. How science educators construe student writing. In: AngeÂlil-Carter, S. (Ed.), Access to
Success: Literacy in Academic Contexts. University of Cape Town Press, Cape Town.
Moss, P., 1994. Can there be validity without reliability? Educational Researcher 23, 5±12.
R. Gamaro / System 28 (2000) 31±53 53