Rater Reliability

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

System 28 (2000) 31±53

www.elsevier.com/locate/system

Rater reliability in language assessment: the bug


of all bears
Raphael Gamaro€*
University of Fort Hare, Private Bag X1314, Alice, 5700, South Africa

Received 20 September 1998; received in revised form 30 March 1999; accepted 15 April 1999

Abstract
A major problem in essay assessment is how to achieve an overall reliable score based on
the judgements of speci®c criteria such as topic relevance and grammatical accuracy. To
investigate this question, the writer conducted a workshop on interrater reliability at a con-
ference of the National Association of Educators of Teachers of English (NAETE) in South
Africa where a group of 24 experienced educators of teachers of English were asked to assess
two Grade 7 English essay protocols. The results revealed substantial variability in attention
that raters paid to di€erent criteria varying from penalising students for spelling and/or
grammatical errors to glossing over these criteria and considering mainly content. To over-
come the problem of rater variability some researchers recommend that more than one rater be
used. The problem is that in the teaching situation there is rarely more than one rater available,
who is usually the teacher of the subject. The advantages and disadvantages of using a single
rater and more than one rater are examined. Whether one uses one rater or several, without
the quest for some kind of objective standard of what is, for example, (good) grammar and
(good) spelling, and agreement on what importance to attach to particular criteria, there
cannot be much reliability. # 2000 Published by Elsevier Science Ltd. All rights reserved.
Keywords: Reliability; Validity; Interrater reliability; Errors; Equivalent scores; Equivalent judgements;
Subjectivity; Moderation; Levels of pro®ciency

1. Introduction

Although we may no longer stand before an ``abyss of ignorance'' (Alderson,


1983, p. 90) and may be able to agree that language testing has ``come of age''

* Address for correspondence: 51 Kings Road, King William's Town 5601, South Africa.
E-mail address: raphgam@border.co.za (R. Gamaro€).

0346-251X/00/$ - see front matter # 2000 Published by Elsevier Science Ltd. All rights reserved.
PII: S0346-251X(99)00059-7
32 R. Gamaro€ / System 28 (2000) 31±53

(Alderson cited in Douglas, 1995, p. 176), there are still many bugbears. This article
focuses on rater reliability in language testing. Rater reliability, which is arguably
the greatest bugbear in assessment (Moss, 1994), is concerned with reconciling
authentic subjectivity and objective precision.
Rater reliability is particularly important in `subjective' tests such as essay tests,
where there exist ¯uctuations in judgements (1) between di€erent raters, which is the
concern of interrater reliability, and (2) within the same rater, which is the concern
of intrarater reliability. This article focuses on interrater reliability. Interrater reli-
ability consists of two major kinds of judgements: (1) the order of priority for indi-
vidual raters of performance criteria (criteria such as grammatical accuracy, content
relevance and spelling) and (2) the agreement between raters on the ratings that
should be awarded if or when agreement is reached on what importance to attach to
di€erent criteria.
Raters may give equivalent scores but this does not necessarily mean that these
scores represent what they are supposed to measure, i.e. that the (purpose of the)
test is valid. To illustrate, if all raters of an essay believe that spelling should be
heavily penalised and, accordingly, give equivalent scores in terms of spelling, the
interrater reliability would be high. The question, however, is whether spelling
should be the most important criterion. Or, raters may di€er in the importance they
attach to di€erent criteria. So, similar scores between raters do not necessarily mean
similar judgements, and also, di€erent scores between raters do not necessarily
mean di€erent judgements.
In previous research on interrater reliability (Gamaro€, 1998a) I compared the
assessments of lecturers of English for academic purposes (EAP) and science lec-
turers on ®rst-year university student essays. These students were Tswana-mother-
tongue speakers. The topic was the `Greenhouse E€ect'. Comparisons were ®rstly
made within the EAP group of lecturers and within the science group of lecturers,
and secondly between the two groups of EAP and science lecturers. The ®ndings
showed a wide range of scores and judgements within and between groups. In
this article I report the research on interrater reliability that was based on a work-
shop on language assessment that I conducted at a conference of the National
Association of Educators of Teachers of English (NAETE) (Gamaro€, 1996). These
educators taught at di€erent universities, technikons and colleges of education in
South Africa.

2. Literature review

Reliability (whether interrater or intrarater reliability) in essay writing has often


been analysed as a statistical concept (Hartog and Rhodes, 1936, p. 15; Pilliner,
1968, p. 27; Oller, 1979, pp. 393±394; Kaczmarek, 1980, pp. 156±159), but what also
requires analysis is the relationship between scores and judgements. An analysis of
judgements provides information on the rating processes used to judge the language
processing of learners, where the end of the processing journey in a test is manifested
in the ®nal product called a protocol, or script.
R. Gamaro€ / System 28 (2000) 31±53 33

The process of essay writing, is ``probably the most complex constructive act
that most human beings are ever expected to perform'' (Bereiter and Scardemalia,
1983, p. 20). If getting the better of words in writing is usually a very hard struggle
for mother-tongue speakers, the diculties are multiplied for the second-language
learner (Widdowson, 1983, p. 34).
The writing process involves the ``pragmatic mapping'' of linguistic structures into
extralinguistic context (Oller, 1979, p. 61). This mapping ability subsumes global
comprehension of a passage, inferential ability, perception of causal relationships
and deducing meaning of words from contexts. All these factors mesh together to
form a network of vast complexity, which makes objective assessment of essay per-
formance very dicult. It is this vast complexity that makes written discourse, or
essay writing, the most `pragmatic' of writing tasks and the main goal of formal
education.
Owing to the fact that the production of linguistic sequences in essay writing is not
highly constrained, problems arise in scoring when inferential judgements have to be
converted to a score. The question is ``[h]ow can essays or other writing tasks be
converted to numbers that will yield meaningful variance between learners?''. Oller
(1979, p. 385) argues that these inferential judgements should be based on intended
meaning and not merely on correct structural forms. That is why in essay assess-
ment, raters should rewrite (in their minds, but preferably on paper) the intended
meaning. Perhaps one can only have an absolutely objective scoring system with
lower-order skills; however, Oller is not claiming that his scoring system is abso-
lutely objective, but only that as far as psychometric measurement goes, it is a
sensible method for assessing an individual's level within a group, irrespective of the
actual scores of the individuals in the group (Oller, 1979, pp. 393±394). It has been
recommended, however, that equivalence in scores between raters also be taken into
account in the assessment of test reliability (Cziko, 1984; Ebel and Frisbie, 1991).
Another problem in essay assessment is how to achieve a reliable overall score based
on the judgements of speci®c criteria.
When one marks an essay, one can only do so through its structure. The paradox
of language is that structure must `die' so that meaning may live. Yet, if structure
were not preserved, language would not be able to mean. The German term aufhe-
ben (sublation) means `to clear away' and `to preserve': the simultaneous preserva-
tion and transcendence of the form/function antithesis. Language structure has to be
cleared away and preserved in order to convey its meaning (Coe, 1987, p. 13). By the
same token, our analytic criteria of assessment must be cleared away and preserved
in order to assess the total e€ectiveness of a piece of writing. Some of these analytic
criteria are morphology, phonology and spelling.
Confusions may arise between these three criteria. According to Oller (1979,
p. 279), errors of judgement in distinguishing between spelling, on the one hand, and
morphology and phonology, on the other, are not substantial enough to a€ect
reliability. Ingram (1985, p. 244), however, disagrees and maintains that ``it is often
a matter of judgement whether, for example, an error is merely spelling (to be dis-
regarded) or phonological or grammatical'' (see Alderson et al., 1995, p. 46, for a
similar view). Cziko (1982) believes that the subjective judgement of deciding what
34 R. Gamaro€ / System 28 (2000) 31±53

a spelling error is can ``adversely'' a€ect reliability. The implication in Ingram


and Cziko is that judgements on these matters between scorers/raters would vary
signi®cantly.
Oller (1979, p. 392) maintains that ``judges always seem to be evaluating commu-
nicative e€ectiveness regardless whether they are trying to gauge `¯uency', `accent-
edness', `nativeness', `grammar', `vocabulary', `content', `comprehension', or
whatever. It is arguable whether judges always seem to be evaluating communicative
e€ectiveness. Although it seems reasonable that in essays one should be looking at
the overall impact of a piece of writing (the whole) and that the only way to do this
is to look at the various aspects of the writing such as those mentioned by Oller
above, it is questionable whether the general tendency is to regard communicative
e€ectiveness as the overarching criterion. The data to be presented indicate a
wide range of opinion on this issue. In spite of discussions and workshops on
establishing common criteria such as content relevance and grammatical accuracy,
there remain large di€erences in the relative weight that raters attach to di€erent
criteria (Santos, 1988; Bradbury et al., 1990; Alderson and Clapham, 1992;
Brown, 1995). This problem is not a surprising one owing to the fact that language
is closely connected to human rationalities, imaginations, motivations and desires,
which, because they each comprise an extremely complex network of biological,
cognitive, cultural and educational factors, could easily compromise the quest for
objectivity.

3. Method

The essay test used in the workshop was part of a battery of English pro®ciency
tests that included a cloze test, a dictation test, an error recognition test and a mixed
grammar test that was given to Grade 7 (12-year-old) entrants to Mmabatho High
School (MHS) in the North West Province of South Africa in 1987 where I was a
teacher from 1980 to 1987 (Gamaro€, 1997, 1998b).
The entrants consisted of L1 and L2 speakers of English, where these labels refer
to those who took English as a ®rst language taught subject and as a second lan-
guage taught subject, respectively. All entrants in the L2 group were Bantu-mother-
tongue speakers, mostly Tswanas. The L1 group consisted of a mixture of Tswana-,
English- and Afrikaans-mother-tongue speakers, and also some whose mother ton-
gue was dicult to identify because they came from a background where several
languages or a hybrid of languages were used in the home, e.g. Afrikaans and Eng-
lish, Tswana and English (these were South Africans); Tagolog and English, Tamil
and English (these were expatriates).
The L2 group and some of the L1 group came from DET (Department of
Education and Training) schools, where the medium of instruction was English
from Grade 5, while the majority of the L1 group came from Connie Minchin
Primary School in the North West Province of South Africa, which was the main
feeder school of L1 entrants to Mmabatho High School and where English was the
medium of instruction from Grade 1. (The DET was the South African education
R. Gamaro€ / System 28 (2000) 31±53 35

department in charge of black education up to the democratic elections of 1994. It is


now defunct.)
Three raters, who were also the Grade 7 teachers at MHS and myself were
involved in the administration and marking of the essay test. Owing to practical
obstacles such as the limited time that these teachers could devote to the marking of
the tests, they did not provide judgements on speci®c criteria but merely gave a score
based on global impressions. The average scores of the four raters were informative
from a norm-referenced point of view, because they distinguished well between weak
and strong learners but they could not show the relationship between scores and
judgements because raters did not provide judgements. (More about the reliability
issue of averaging the scores of raters is provided later.)
Although I was unable to obtain the judgements of these MHS raters (except my
own, of course), I compensate for this in this article by providing the judgements
(and scores) of a group of 24 educators of teachers of English. As mentioned in the
introduction, this research on interrater reliability was based on a workshop on
language assessment that I conducted at a conference of NAETE (Gamaro€, 1996).
These educators taught at di€erent universities, technikons and colleges of educa-
tion in South Africa.
The following procedures are followed concerning the data collected from the
NAETE workshop: (1) a comparison between individual rater's scores (2) a com-
parison between the average scores of six groups of raters, four in a group, and (3)
an examination of the relationship between judgements and scores of individual
raters.
There were originally 27 participants in the NAETE workshop. These were divi-
ded into six groups of four or ®ve in each group: Groups A±F. Only four raters in
each group were used because the average score of any reasonably competent four
raters has been found to be reliable, the rationale being that the problems of sub-
jective judgements will be neutralised using the average of four judges. According to
Alderson (1981, p. 61), ``[t]here is considerable evidence to show that any four jud-
ges, who may disagree with each other, will agree as a group with any other four
judges of a performance''. Consequently, I excluded three of the 27 raters from the
computations, who were designated by the number 5 in their three respective
groups: these raters were C5, D5 and E5. I do, however, refer to a judgement of E5
because of its relevance.
Raters were asked to assess two essay protocols: one from the MHS L2
group (Protocol 1) and one from the MHS L1 group (Protocol 2). Protocol 2 was
chosen at random, while Protocol 1 was chosen because of the interesting spelling
errors, where I wanted to examine the attention raters paid to these highly visible errors.
The essay question consisted of choice between three topics: describe how to: 1.
clean a pair of shoes, 2. make a cup of tea or 3. cover a book. The content of these
topics should be much easier to assess than the controversial topic of the Green-
house E€ect, which was the topic in the previous research mentioned above
(Gamaro€, 1998a). The protocols (scripts) are now presented. Protocol 1 belongs to
an L2 learner at MHS, who was a Tswana-mother-tongue speaker and took English
as a second language at MHS.
36 R. Gamaro€ / System 28 (2000) 31±53

Protocol 1 (Grade 7 L2 learner)

How a school book is covered

If you cover a book you need several things such as a brown cover, a plastic cover
and selotape ect. First you open your couver and put the book on the corver.
You folled the cover onto the book and cut it with the sicor and folled it again.
You stick the cover with the selotape so that it mast not come out of the book.
Same aplies to when you cover with a plastic cover. Then you book is corved well.

Protocol 2 (L1) below belongs to a Sri Lankan of expatriate parents. (Recall that
the labels `L1 learner' and `L2 learner' at MHS refer to learners who took English as
®rst language taught subject or as a second language taught subject.)

Protocol 2 (Grade 7 L1 learner)

How a school book is covered

You need a roll of paper cover or plastic cover, A pair of scissors some sello-
tape. You put the book on the paper or Plastic and cut the length it is better if
about 5 cm of cover was left from the book. You cut it into strips You fold the
cover over the book. You then put strip of sellotape to keep them down. Then
you put plasitic paper over it and stick it down. Then you can put your name
and standard.

Participants in the workshop were requested to (1) work individually, (2) spend
about one and a half minutes on each protocol, and (3) give an impressionistic score
based on criteria such as topic relevance, content and grammatical accuracy, and
any other criteria they wanted to mention, and (4) give reasons for their judgements
on the criteria they speci®ed. I did not specify the criterion of `spelling'. I mention
this because many participants gave prominence to spelling errors. Raters were
explicitly asked, however, to take into account the criteria of `topic relevance',
`content' and `grammar'. Most of the raters did not distinguish between topic rele-
vance and content, so I subsumed the two criteria under content.

4. Results

Figs. 1 and 2 show the frequency distribution of the individual scores awarded by
the 24 raters for Protocol 1 (L2) and Protocol 2 (L1), respectively. A nine-point
scale was used; 0±1 point=totally incomprehensible; 2 points=hardly readable;
3 points=very poor; 4 points=poor; 5 points=satisfactory; 6 points=good; 7
points=very good; 8 points=excellent; 9 points=outstanding. Ratings can refer to
a numerical scale or verbal judgements. I shall refer to scores and judgements, and
not to rating.
R. Gamaro€ / System 28 (2000) 31±53 37

Fig. 1. Frequency distribution of the scores, awarded by the 24 raters on Protocol 1 (L2).

Fig. 2. Frequency distribution of the scores awarded by the 24 raters on Protocol 2 (L1).

Although Protocol 2 in Fig. 2 has a wider range of scores (3±8) than Protocol 1 in
Fig. 1 (3±7), there is far more variability between raters in Protocol 1. Table 1 below
provides the average score for each of the six groups of raters: Groups A±F. Also
included in Table 1 is the average score of the four raters at MHS who were involved
in the original test battery. These scores appear after Group F.
I did not expect the scores of Groups A±F for Protocol 1 (L2) to be higher than
those for Protocol 2, because I judged Protocol 2 (L1) to be better. In the original
research at MHS, I awarded, in my capacity as one of the raters, a score of 4 for
Protocol 1 and score of 6 for Protocol 2.
Figs. 3 and 4 show a comparison between the percentage of negative judgements,
no judgements and positive judgements on the three criteria of content, grammar
and spelling for Protocol 1 (Fig. 3) and Protocol 2 (Fig. 4). Figs. 5 and 6 compare
the negative judgements of the `EL1' and `EL2' raters. (Percentages are calculated to
the nearest whole number.)
38 R. Gamaro€ / System 28 (2000) 31±53

Table 1
NAETE Workshop and MHS: average scores on Protocols 1 and 2 of groups of ratersa

Groups of raters Protocol 1 (L2) Protocol 2 (L1)

Group A 4.5 4.3


Group B 5.3 4.0
Group C 4.8 4.3
Group D 5.0 4.5
Group E 4.8 5.8
Group F 4.3 5.0
MHS 3.5 5.5
a
NAETE, National Association of Educators of Teachers of English; MHS, Mmabatho High School.

Fig. 3. Percentage positive judgements, no judgement and negative judgements for Protocol 1 (L2).

Fig.4. Percentage positive judgements, no judgements and negative judgements for Protocol 2 (L1).
R. Gamaro€ / System 28 (2000) 31±53 39

Fig. 5. Percentage negative judgements for Protocol 1 (L2).

Fig. 6. Percentage negative judgements for Protocol 2 (L1).

In Figs. 5 and 6 above `EL1' refers to raters who use English as a ®rst language
(i.e. the language they know best), and `EL2' refers to raters who use English as a
second language. If raters had two languages that they claimed to know equally well
(see Tables A1 and A2, column 2 of the Appendix), they were categorised as EL1
speakers. It is important to keep the following distinctions in mind: the labels L1
and L2 that are used to refer to the protocols refer to learners who took English as a
®rst or second language taught subject. Thus, L1 in the general context of MHS is
not identical to the language a person knows best. For example, the L1 protocol
belonged to a Sri Lankan expatriate, whose mother tongue was Tamil, but who
claimed to know English well enough to describe himself as an EL1 speaker, i.e.
English was the language he claimed to know best. The label EL1, which I have used
for the NAETE raters, is used in the sense of English as the language that one knows
best.
Figs. 5 and 6 focus on the negative judgements, where the EL1 group is compared
with the EL2 group.
40 R. Gamaro€ / System 28 (2000) 31±53

5. Discussion of the results

If no judgement was given on a particular criterion, I assumed that the judgement


for the unmentioned criterion was not negative or that the errors were not serious
enough to warrant a speci®c mention. The ``no judgements'' category (Figs. 3 and 4)
is just as revealing as positive and negative judgements, for if one rater does not pay
attention to spelling, for example, and another does, this could have a signi®cant
e€ect on the score, and could mean the di€erence between a pass or a fail.
With regard to Figs. 5 and 6, there were 16 EL1 participants/raters but only eight
EL2 participants/raters. This proportion of EL1 to EL2 educators of teachers
of English is not indicative of South Africa as a whole, because there are far
more EL2 educators of English in South Africa than EL1 ones. I do not have pre-
cise statistics on this matter but this fact is clear from the demography of South
Africa. The reasons why the NAETE conference of 1996 had this unrepresentative
proportion of EL1 (mostly white) and EL2 (black) were possibly due to (1) the
fact that the conference was held in the Eastern Cape where there are fewer
tertiary institutions catering for black student teachers of English than in other
areas such as Gauteng (the Johannesburg±Pretoria region) or the Western Cape or
(2) a lack of conference/workshop funding from the historically black tertiary
institutions.

5.1. Protocol 1 (L2) (Fig. 5)

Protocol 1 (Fig. 5) shows that 33% of all the raters (EL1+EL2) gave negative
comments on content and grammar while 54% considered spelling to be a problem.
There was a substantial di€erence between the negative judgements of EL1 and
EL2 on grammar (19 and 63%, respectively) and on spelling (69 and 25%, respec-
tively), where the judgements of EL1 are almost the reverse of EL2: what EL1
considers to be grammatical errors, EL2 considers to be spelling errors (see Tables
A1 and A2 in the Appendix for individual judgements). It would have been inter-
esting to ®nd out which errors were considered to be spelling errors and which ones
grammatical errors. For example, consider the highlighted errors in Protocol 1 (the
protocol is repeated for easy reference), where di€erent kinds of errors have been
highlighted:

If you cover a book you need several things such as a brown cover, a plastic
cover and selotape ect. First you open your couver and put the book on the
corver. You folled the cover onto the book and cut it with the sicor and folled it
again. You stick the cover with the selotape so that it mast not come out of the
book. Same aplies to when you cover with a plastic cover. Then you book is
corved well.

I judged the two italicised errors *folled and *aplies to be spelling errors and
*mast to be a phonological error. The other deviant forms are more dicult to
specify. Are the di€erent deviant forms of `cover' to be labelled as spelling or
R. Gamaro€ / System 28 (2000) 31±53 41

phonological errors? Compare these forms with the following deviant phonological
forms from Oller (1979, p. 279):
ropeÐ*robe
expectedÐ*espected
ranchÐ*ransh
somethingÐ*somsing

The deviant forms of `cover' in Protocol 1 could be (interlanguage?) variations on


a phonological theme. Thus, the deviant forms of `cover' need not be spelling errors
but phonological errors.
The underlined error *you in the last line could be a morphosyntactic error (the
possessive `your' is required), a spelling error or a phonological error. For example,
black learners often omit the `y' in words such as `they'. This error is hardly likely to
be a morphological error because (1) black Grade 7 learners generally know that
these words belong to distinct syntactic categories: `the' (article); `they' (pronoun)Ð
even if they do not know the names for these categories, and (2) `the' and `they' have
dissimilar pronunciations among many Bantu speakers, whereas `you' and `your'
have similar pronunciations among many Bantu speakers. Therefore *you could be
a spelling slip.

5.2. Protocol 2 (L1) (Fig. 6)

In Protocol 2 there are hardly any deviant forms, and thus little possibility of
confusing spelling errors with grammatical errors. Only one rater (an EL2 rater)
mentions spelling errors. Most of the errors in Protocol 2 are punctuation errors,
which are judged to be `grammatical' errors by most in the EL1 and EL2 groups.
Protocol 2 is repeated for easy reference:

You need a roll of paper cover or plastic cover, A pair of scissors some sell-
otape. You put the book on the paper or Plastic and cut the length it is better if
about 5 cm of cover was left from the book. You cut it into strips You fold the
cover over the book. You then put strip of sellotape to keep them down. Then
you put plasitic paper over it and stick it down. Then you can put your name
and standard.

The punctuation errors are serious, while ``left from the book'' and ``cut into
strips'' a€ect the coherence to a certain extent. The pronoun ``them'' (in bold) does
not seem to be a grammatical error because it agrees with ``strips'' in the previous
sentence (not with ``strip'' in the same sentence). There seems to be one grammatical
error, namely the missing `a' between ``put'' and ``strip'' in the third last sentence,
but no spelling errors. There was a substantial di€erence in negative judgements
between EL1 and EL2 on content: 63% and 38% (Fig. 6). The overall picture on
Protocol 2, as far as content and grammar are concerned, is that 54% of the raters
were negative about content, and 42% were negative about grammar.
42 R. Gamaro€ / System 28 (2000) 31±53

6. The relationship between scores and judgements

Consider the relationship between individual scores and judgements. Similar scores
between raters do not necessarily mean similar judgements, and also, di€erent
scores between raters do not necessarily mean di€erent judgements. Examples are
provided from Protocols 1 and 2. In Protocol 1 the following judgements went
together with the same scores (judgements of all the raters for Protocols 1 and 2 are
found in Tables A1 and A2, respectively, of the Appendix): a score of 3 for one rater
represented ``meaningless cloudy'' (Rater C1) and for another rater the same score
of 3 represented ``misspelled many words but not to bad'' (Rater E5: this rater was
excluded from the main analysis because he/she was the ®fth member of Group E
that had been reduced to four in a group). Many of the misspelled words in Protocol
1 were deviant forms of the one word `cover'. A score of 5 for C4 represented:
``Topic deviates. Content sequence satisfactory. Major grammatical. Errors detracts
from coherence'', but for D1 the same score represented ``Only one great fault is
spelling, quite distracting''. D3, who awarded a score of 6, states: `This learner
belongs to an elite group'.
Consider the following examples from Protocol 2: E2, who awarded a score of 5,
said ``General reluctance to give extremely high or low marks''. E2's score for Pro-
tocol 1 was 7, which seems to contradict the reluctance to give extremely high or low
marks: unless a score of 7 is not an ``extremely'' high mark in E2's eyes. If so, one
does not know what to make of E2's remark that a score of 5, which E2 gave for
Protocol 2, steers a middle path between an ``extremely low'' and an ``extremely
high'' score. E2 has a point about ``playing safe'': it is safer to give an average score
than to fail the learner or give a high score. One hugs the safe side of justice.
A few other examples from Protocol 2: some raters attached more importance
than others to the segment ``cut into strips''. Consider the remarks of the following
raters, which all contained the phrase ``cut into strips''. They all awarded a score of
5 and commented only on content. They were all EL1 speakers:

D1: Less accurate. Dicult to understand ``cut into strips''.


E2: Unclear explanation. Cut what into strips?
F1: Fairly clear, except for ``cut it into strips''.
F2: Left out important details such as opening the book; ``cut into strips'' is
confusing.
D2: Cohesion bad, e.g. ``cut it into strips'', but fairly coherent, not too many
errors.

D1, E2 and F2 made a big issue of ``cut into strips'', which in their eyes made the
content inadequate, while F1 and D2 made overall positive comments on content.
F1's comment seems the most reasonable because the fact that ``cut into strips'' is
not in the correct sequence does not have a signi®cant a€ect on coherence, because
when one reads the sentence that follows this segment it seems quite clear that one
is talking about cutting the sellotape into strips, not the paper used to cover the
bookÐnor the book! Perhaps ``cut the ¯aps'' is what the writer meant by
R. Gamaro€ / System 28 (2000) 31±53 43

``cut into strips''. D2 calls ``cut it into strips'' a ``cohesion'' error (D2 has under-
lined ``it''). The problem is indeed one of cohesion, which in turn a€ects coherence.
(It is the lack of coherence that enables one to recognise the cohesion problem.)
What this analysis reveals is that it is not so easy to describe how a book is covered.
Most young children and adults alike can cover books, but both children and
adults might not ®nd it so easy to describe, even in their mother tongue, how to
cover one.
One may argue that owing to the fact that there are no data on which words in
the protocols individual raters judged to be spelling or grammatical errors, there
is no reason to believe that my judgements would be better than other people's.
I suppose some judgements must be better than others. Some raters must be
wrong and others right or otherwise it is all a matter of interpretative variations on
a poststructuralist theme? My judgements aside, the research is still useful because it
shows that many of the raters in this investigation cannot agree on what is spelling
and what is grammar.

7. Moderation workshops

The di€erences between the NAETE raters, as shown in Figs. 3±6, is worrisome.
Even more so when compared with their answers to the questions on moderation
that were given in the questionnaire at the NAETE workshop (see Table A3 of the
Appendix).
In the questionnaire, 14 of the 24 participants stated that in their workplace they
never found any signi®cant di€erence between their ratings and those of their col-
leagues. Of the seven raters who said that they did ®nd signi®cant di€erences in the
workplace, only four found this a problem. As far as the participation in moderation
workshops was concerned, seven of the 24 stated that they had never participated in
a moderation workshop. Of the 17 remaining raters, 11 commented on whether
these moderation workshops resulted in any improvement. Three of these 11 raters
said that there was a great improvement, six said that there was a fair improvement,
one said that there was a negligible improvement, and one said that there was no
noticeable improvement.

8. Implications

I mentioned in the introduction that interrater reliability consists of two major


kinds of judgements: (1) the order of priority for individual raters of performance
criteria and (2) the agreement between raters on the ratings that should be awarded.
This is also a construct validity issue. Both construct validity and interrater relia-
bility should be concerned with what scores represent. For example, if raters give a
similar low score, but for completely di€erent reasons, e.g. because (1) the spelling
or (2) the grammar was bad or (3) because the writer was o€ the topic, the scores
would not be valid because there would be no agreement on the purpose of the test.
44 R. Gamaro€ / System 28 (2000) 31±53

A test is said to be used for a valid purpose when the tester knows what is being
tested. However, if testers cannot agree on what that what is, i.e. if there is no
interrater reliability, there can be no validity. So, validity and reliability are two
sides of the same corner. You cannot go round the one side without bumping into
the other.
To clarify a possible confusion between rater reliability and concurrent validity:
rater reliability has to do with the consistency between raters' judgements on one
test method, e.g. an essay test. Concurrent validity, in contrast, has to do with the
correlation between two or more di€erent test methods, e.g. a dictation test and an
essay test.
The variability in attention that raters pay to di€erent criteria is a general problem
in all kinds of educational institutions where ``lecturers [or teachers] vary from
penalising students heavily for mechanical and grammatical errors to ignoring the
linguistic surface and marking on content and organisation'' (Bock, 1998, p. 53).
There are di€erent learning styles, teaching styles and also di€erent rating styles.
One rater, as indeed one learner or one teacher, may be mainly interested in the big
picture, i.e. in coherence, while another may be mainly interested in systematicity
and structure. Moderation workshops in my experience do not seem to be able to
bring about an e€ective truce in these `style wars' (Oxford et al., 1991; Dreyer,
1998).
With regard to the level of English pro®ciency of raters, it does not follow that
because a rater (or anybody else) is not a mother-tongue speaker (of English in this
case) that his or her English pro®ciency is necessarily lower than a mother-tongue
speaker of English. Many non-mother-tongue-English speakers have a higher level
of academic English pro®ciency than mother-tongue-English speakers. A major
reason for this is not a linguistic one, but because these non-mother-tongue speakers
are more academically able, i.e. they have better problem-solving abilities and abil-
ities for learning, and in the case of raters, for assessment (Vollmer, 1983, p. 22;
Bley-Vroman, 1990).
In the research situation, it is possible to have more than one rater, even four. Four
raters would be a rare luxury outside a research situation. Most testing situations are
not research situations but teaching situations where often only one rater is available,
and where moderation workshops are seldom (usually only once) or never held.
One may argue that the reason that teachers/lecturers do not have moderation
workshops or have them seldom is thatÐas many of the NAETE participants
saidÐthey did not ®nd any signi®cant di€erence in the ratings between their col-
leagues in their respective workplaces, which would explain why moderation work-
shops were seldom held. In educational institutions especially in tertiary institutions,
there is a large turnover of personnel. Thus, if one has had about 10 years' of
experience, one should have had more than one workshop on moderation because
one would have generally worked at more than one institution.
One may argue further that the reason for the di€erences between the NAETE
raters was that they did not come together previously to discuss the protocols that
they were asked to judge individually in the conference workshop. I would imagine
that educators of English teachers, even if they did not confer beforehand on
R. Gamaro€ / System 28 (2000) 31±53 45

assessment procedures, should nevertheless be in gross agreement on whether the


grammar, content or spelling of a protocol on such a simple topic with such simple
structures was good or bad. The fact that (1) they did not agree on these funda-
mentals, (2) many of them said in the questionnaire that they had little disagreement
with their colleagues, and (3) the majority held few or no moderation workshops,
reveals an unsatisfactory situation. The big question is how to deal with the problem
in the usual education situation where there is only one rater.
The crucial issue, according to Oller (1979, p. 279) is not the diculty that a rater
has in deciding how to categorise errors, but that one rater's idea of how to cat-
egorise errors di€ers from another's. If di€erent interpretations on what is a spelling
error and what is a grammatical error a€ect the reliability, the use of one rater
would, according to Oller (1979), ensure more consistency in judgements when prob-
lematic items need to be distinguished within these three categories. Oller (1979,
p. 279) maintains that one rater is justi®ed on the grounds that:

. . .there are cases where it is dicult to decide whether an error is really a


spelling problem or is indicative of some other diculty besides mere spelling.
In such cases, for instance, `pleshure', `teast' for `taste', `ridding' for `riding',
`fainaly' for `®nally', `moust' for `must', `whit' for `with' and similar instances,
perhaps it is best to be consistently lenient or consistently stringent in scoring.
In either case, it is a matter of judgement for the scorer.

Oller's point is that because it is dicult to get raters to agree, one should do the
next best thing and try to agree with oneself (intrarater reliability). If one takes into
account the gargantuan problems of rater subjectivity, it may be better to use one
rater to mark a speci®c group of test-takers rather than several raters, so that if we
cannot improve interrater consistency to any signi®cant extent, we can at least try to
make sure that the same person marks all the protocols of the group he or she tea-
ches. But then, as we know, we cannot be sure that the rater will not mark di€er-
ently before breakfast (a good or bad one) than after.
Raters are in danger of following a circular route to control what is very dicult
or perhaps impossible to control, namely, subjectivity (Davies, 1990, p. 4). The
problem is very similar to the problem of ®nding the `best test'. If the construct
validity of one test always depends on the validity of another, there cannot exist any
one test that stands by itself such as an equivalent of a `Prime Mover'. Lado's (1961,
p. 324) solution is to compare all tests in terms of ``some other criterion whose
validity is self-evident, e.g. the actual use of the language.'' The question is: what is
self-evident? Is there a self-evident test that pre-exists all other tests? There is not,
because `the buttressing validity of an external criterion is often neither de®nable
nor, when found, reliable'' (Davies, 1990, p. 3).
Often mother-tongue pro®ciency is advocated as an absolute yardstick of lan-
guage pro®ciency, but, as Bachman and Clark (1987, p. 29) point out, ``native
speakers show considerable variation in pro®ciency, particularly with regard to abil-
ities such as cohesion, discourse organisation, and sociolinguistic appropriateness.''
As a result, theoretical di€erences between testers can a€ect the reliability of the test.
46 R. Gamaro€ / System 28 (2000) 31±53

Raters who know the language well, indeed even mother-tongue speakers, can di€er
radically in their assessments of such pragmatic tasks as essay tasks. That is why
di€erent raters' scores on a particular protocol are often incommensurate with their
judgements. Owing to these problems, it is virtually impossible to de®ne criterion
levels of language pro®ciency in terms of actual individuals or actual performance.
Bachman and Clark (1987, p. 30) suggest that such levels must be de®ned abstractly
in terms of the relative presence or absence of the abilities that constitute the
domain. But again this does not solve the problem because the diculty is how to
apply the de®nition to concrete situations of language behaviour.
The more satisfying the explanation, the foggier the idea may be of what is going
on in the test-taker's head. Thus, in an error analysis it is indeed possible to label the
error in purely linguistic terms, but the more important diagnostic issue of why
speci®c errors are committed remains largely a mystery. Raters are like inquisitive
(or lethargic) insects picking their way around in a ``gigantic multi-dimensional
cobweb'' in which every item requiring an interpretation is attached to a host of
others (Aitchison, 1987, p. 72).

9. Conclusion

When doing research on rater judgements, researchers cannot avoid making


judgements themselves if they want to do more than present a list of who gave what
score and who said what about a test-taker's performance. Accordingly, the problem
of subjectivity can become very complex. For example, this research on interrater
reliability, which included my own judgements, was the basis for (but hopefully not
merely based on) the judgements of other people's judgements (the raters discussed
above). Thus, my judgement is a verbalisation (a fourth level of an interpretation) of
an observation (the third level of interpretation) of other people's verbalisations (a
second level of interpretation) of their observations (the ®rst level of interpretation).
When one adds a ®fth, a sixth and more levels (an assessment of an assessment, of
an assessment, etc.) hermeneutics can so easily get trapped in hermetic ``webs
of beliefs'' (Quine and Ullian, 1970, cited in Moore, 1998, p. 83), or, to change the
metaphor, in hermeneutic circles.
It is in rater (un)reliability that matters of validity and reliability come to a head,
because it brings together in a poignant, and often humbling and humiliating way,
what is being (mis)measured, which is the concern of validity, and how it is (mis)-
measured, which is the concern of reliability. Learners may fail because they do not
learn, or because they lack the academic ability, or because they are politically or
economically suppressed, and for many other reasons. In my experience many fail
and pass depending on who marks one's tests and exams (who is usually one's
teacher/lecturer); in other words, depending on the luck of the draw.
It might be of interest to know how the two learners whose protocols were used in
this investigation did in their school careers. The L1 learner passed Grade 11 with
high marks in English and aggregate and then left MHS because his family moved
away from Mmabatho. The L2 learnerÐrecall that only the spelling was poor in his
R. Gamaro€ / System 28 (2000) 31±53 47

essay protocolÐobtained a matriculation exemption with a C aggregate, which was


a relatively good grade for a MHS matriculant.

Appendix

Tables A1 and A2 show the scores and judgements of individual raters on Proto-
cols 1 and 2, respectively. These tables have been divided into EL1 and EL2 sections,
then sorted within the EL1 and EL2 sections on scores in ascending order so that the
same scores appear together, which makes it easy to compare similar scores with
their corresponding judgements. If the language in the L1 column is English then
this is an EL1 speaker.

Table A1. Scores and judgements of raters on Protocol 1

Raters First language Score Raters' judgements Content Grammar Spelling

Protocol 1ÐEnglish ®rst language raters (EL1)


D4 English 3 Many spelling errors Negative
E3 English 4 Can understand in spite Negative
of errors; facts given not
clear and logical
F1 English 4 Some confusion about Negative
the folding procedure

F2 English and 4 Folding instructions Negative


Afrikaans confusing
F3 English 4 Imprecise instructions Negative
on how to cover a book
A2 English 5 Well visualised but Positive Negative
inconsistent spelling
A3 English 5 Satisfactory, but poor Negative Negative
spelling and grammar
B2 English 5 Logical structure but Positive Negative
a spelling problem
B3 English and 5 Coherent and cohesive; Positive Negative
Xhosa some spelling mistakes
C2 English 5 Explicit and cohesive; Positive Negative Negative
surface errors do not
a€ect meaning

C4 English 5 Topic deviates; content Negative Negative


sequence satisfactory;
major grammatical errors
detract from coherence
(Table continued on next page)
48 R. Gamaro€ / System 28 (2000) 31±53

Table A1 (continued)

D1 English 5 Only one great fault is Positive Positive Negative


spelling, quite distracting

F4 English 5 Not enough details; Negative Negative


inconsistency of spelling

B1 English 6 Lucid but main problem Positive Negative


is spelling

D2 English 6 Logically structured, Positive Negative


spelling errors main
problem

E2 English 7 Clear logical, no serious Positive Positive Negative


grammatical errors,
only spelling errors

Protocol 1ÐEnglish second language raters (EL2)

C1 Sotho 3 Meaningless, cloudy Negative Negative

E4 Tswana 3 The student is relevant Positive Negative


but the text is full of
grammatical errors and
inconsistent

A1 Ewe 4 Grammatical errors but Positive Negative


adequate description

A4 Venda 4 Grammatical accuracy is Negative


a problem

E1 Xhosa 5 No comment

B4 Zulu and 5 Mechanics a problem, Positive Negative


Venda but understandable

C3 Xhosa 6 Topic not relevant; Negative Negative Negative


any book is covered in
this way; content accurate;
a few grammatical errors
but meaning not a€ected;
spelling inconsistent
D3 Xhosa 6 Has good command of Positive Positive
language; this learner
belongs to an `elite group'
R. Gamaro€ / System 28 (2000) 31±53 49

Table A2. Scores and judgements of raters on Protocol 2

Rater L1 Score Raters' judgements Content Grammar Spelling

Protocol 2ÐEnglish ®rst language raters (EL1)


A2 English 5 Logical approach Positive

A3 English 4 On topic but content confusing; Negative Negative


grammar inaccurate

B1 English 4 Muddled; poor syntax and Negative Negative


idiomatic usage

B2 English 4 Repetitive; simple vocab., Negative Negative


poor syntax

B3 English and 4 Poor punctuation; Negative


Xhosa language is poor

C2 English 4 Lack of cohesion makes Positive Negative


writing less explicit despite
limited surface errors;
content interpretable

C4 English 5 Topic relevant; content: Negative Positive


missing propositions, little
connection; reasonable
grammatical accuracy

D1 English 5 Less accurate; dicult to Negative


understand ``cut into strips''

D2 English 5 Cohesion bad, e.g. ``cut it Positive Negative


into strips'', but fairly coherent,
not too many errors

D4 English 4 Topic relevant, content Positive Positive


meaningful and grammar
better than 1

E2 English 5 Unclear explanation; cut Negative


what into strips? General
reluctance to give extremely
high or low marks

E3 English 5 Can understand in spite of Negative


errors; unclear and illogical

F1 English 5 Fairly clear, except for Positive


``cut it into strips''

F2 English and 5 Left out important details


Afrikaans such as opening the book;
``cut into strips'' is confusing

(Table continued on next page)


50 R. Gamaro€ / System 28 (2000) 31±53

Table A1 (continued)
F3 English 4 Neither gives precise enough
instructions to enable s.o. who
does not know how to cover
a book to cover one

F4 English 6 Quite good in terms of


``understanding ability'';
grammar not good

Protocol 2ÐEnglish second language raters (EL2)

A1 Ewe 5 Content ®ne Positive

C1 Sotho 4 Errors a€ect meaning Negative

E4 Tswana 8 ±

A4 Venda 3 Grammatical inaccuracy Negative

C3 Xhosa 4 Topic not relevant; any Positive Negative


book is covered in this way;
content accurate; a few
grammatical errors but
meaning OK

D3 Xhosa 4 Very limited vocab.; ``Perhaps he Negative


is from the low income group''

E1 Xhosa 5 Does not state clearly in Negative


opening sentence what
he/she intends to do

B4 Zulu and Venda 4 Mechanics blocks meaning, Negative Negative Negative


imprecise but understandable;
wrong sequence

Questionnaire on moderation workshops

The relevant questions (J±L) of the questionnaire and Table A3 containing the
corresponding data are presented below:
J. (i) Do you ®nd that one or more of your colleagues in the workplace evaluate(s)
pupil/student protocols in such a way that your respective allocation of scores is
signi®cantly di€erent? Yes. . ., No. . . (ii) If, yes, Do you ®nd this to be a serious
problem? Yes. . ., No. . . K. Do you have moderation workshops/meetings with your
colleagues? 1. Never. . .; 2. Once annually. . .; 3. More than once annually. . ., 3. More
than twice annually. . . L. If your answer in the previous item is not `never', have you
found that these moderation workshops/meetings at your institution have ironed
out the assessment disparities between you and your colleagues? 1. There has been a
great improvement. . .; 2. A fair improvement. . .; 3. A negligible improvement. . .; 4.
No noticeable improvement. . .;5. They're a waste of time. . .
R. Gamaro€ / System 28 (2000) 31±53 51

Table A3. Raters opinions on moderation workshops

Place of Study Experience Question J (i) Question J (ii) Question K Question L


(years)
Signi®cant di€erence Find J (i) to Moderation Is there any
between colleagues be a problem workshops improvement

A1 University 20 Noa 1 ±
A2 Wits, University 12 No Never ±
of South Africa
A3 University 16 Yes Yes 1 2
A4 University 7 No Never ±
B1 Natal 18 No 2 1
B2 Rhodes University 7 No 2+ 2
B3 University of 8 No 1+
Transkei
B4 University 9 ± Never ±
C1 Lesotho 6 No 1 1
C2 University of 12 No 2+ 1
South Africa
C3 University of 18 No Never ±
Fort Hare
C4 Rhodes 10 No 2+ ±
b
D1 Fort Hare 20 Yes No 1+ 2
D2 South Africa 7 Yes No 2+ 2
and UK
D3 Fort Hare 38 ± Never ±
D4 Venda 4 Yes No Never ±

E1 University 7 No ± 2+ 2
E2 Potchefstroom 12 No No 1 ±
E3 College 4 No ± 1 ±
E4 Rhodes 10 No No 1 ±
F1 Exeter 28 Yes Yes Never ±
F2 OFS, UCT, 5 Yes Yes 1 4
Cambridge
F3 Lancaster 20 Yes Yes 2+ 2
F4 Bangalore, UK 11 ± ± 2+ ±
a
If the answer to J (i) is `No', then no answer is required for J (ii).
b
It is odd that this rater and the next two would have no problem if they discovered signi®cant
di€erences in the ratings they gave the same student.
52 R. Gamaro€ / System 28 (2000) 31±53

References

Aitchison, J., 1987. Words in the Mind: An introduction to the Mental Lexicon. Blackwell, Oxford.
Alderson, J.C., 1981. Report of the discussion on Communicative Language Testing. In: Alderson, J.C.,
Hughes, A. (Eds.), Issues in Language Testing: ELT Documents III. The British Council, London.
Alderson, J.C., 1983. Who needs jam? In: Hughes, A., Porter, D. (Eds.), Current Developments in Lan-
guage Testing. Academic Press, London.
Alderson, J.C., Clapham, C., 1992. Applied linguistics and language testing: a case study of the ELTS test.
Applied Linguistics 13, 149±167.
Alderson, J.C., Clapham, C., Wall, D., 1995. Language test construction and evaluation. Cambridge
University Press, Cambridge.
Bachman, L.F., Clark, J.L.D., 1987. The measurement of foreign/second language pro®ciency. American
Academy of the Political and Social Science Annals 490, 20±33.
Bereiter, C., Scardemalia, M., 1983. Does learning to write have to be so dicult? In: Freedman, A.,
Pringle, I., Yalden, J. (Eds.), Learning to Write: First Language/Second Language. Longman, London.
Bley-Vroman, R., 1990. The logical problem of foreign language learning. Linguistic Analysis 20, 3±49.
Bock, M., 1998. Teaching grammar in context. In: AngeÂlil-Carter, S. (Ed.), Access to Success: Literacy in
Academic Contexts. University of Cape Town Press, Cape Town.
Bradbury, J., Damerell, C., Jackson, F., Searle, R., 1990. ESL issues arising from the ``Teach±test±teach''
programme. In: Chick, K. (Ed.), Searching for Relevance: Contextual Issues in Applied Linguistics.
South African Applied Linguistics Association (SAALA), Johannesburg.
Brown, A., 1995. The e€ect of rater variables in the development of an occupation-speci®c language per-
formance test. Language Testing 12, 1±15.
Coe, R.M., 1987. An apology for form; or, who took the form out of the process. College English 49, 13±
28.
Cziko, G.A., 1982. Improving the psychometric, criterion-referenced, and practical qualities of integrative
testing. TESOL Quarterly 16, 367±379.
Cziko, G.A., 1984. Some problems with emprically-based models of communicative competence. Applied
Linguistics 5, 23±37.
Davies, A., 1990. Principles of Language Testing. Blackwell Ltd., Oxford.
Douglas, D., 1995. Developments in language testing. Annual Review of Applied Linguistics 15, 167±187.
Dreyer, C., 1998. Teacher±student style wars in South Africa: the silent battle. System 26, 115±126.
Ebel, R.L., Frisbie, D.A., 1991. Essentials of Educational Measurement, 5th edition. Prentice Hall,
Englewood Cli€s, NJ.
Gamaro€, R. 1996., Workshop on quantitative measurement in language testing. National Association of
Educators of Teachers of English (NAETE) Conference, East London Teacher's Centre, South Africa,
September.
Gamaro€, R., 1997. ``Old paradigm'' language pro®ciency tests as predictors of longterm academic
achievement. Per Linguam 13, 1±22.
Gamaro€, R., 1998a. Language, content and skills in the testing of English for academic purposes. South
African Journal of Higher Education 12, 109±116.
Gamaro€, R., 1998b. Cloze tests as predictors of global language pro®ciency: a statistical analysis. South
African Journal of Linguistics 16, 7±15.
Hartog, P., Rhodes, E.C., 1936. The Marks of Examiners. Macmillan, New York.
Ingram, E., 1985. Assessing pro®ciency: an overview on some aspects of testing. In: Hyltenstam, K., Pie-
nemann, M. (Eds.), Modelling and Assessing Second Language Acquisition. Multilingual Matters Ltd,
Clevedon, Avon.
Kaczmarek, C.M., 1980. Scoring and rating essay tasks. In: Oller Jr., J.W., Perkins, K. (Eds.), Research in
Language Testing. Newbury House, Rowley, Massachusetts.
Lado, R., 1961. Language Testing. McGraw-Hill, New York.
Moore, R., 1998. How science educators construe student writing. In: AngeÂlil-Carter, S. (Ed.), Access to
Success: Literacy in Academic Contexts. University of Cape Town Press, Cape Town.
Moss, P., 1994. Can there be validity without reliability? Educational Researcher 23, 5±12.
R. Gamaro€ / System 28 (2000) 31±53 53

Oller Jr., J.W., 1979. Language Tests at School. Longman, London.


Oxford, R.L., Ehrman, M., Lavine, R.Z., 1991. Style wars: teacher±student style con¯icts in the language
classroom. In: Magnan, S.S. (Ed.), Challenges in the 1990s for College Foreign Language Programs.
Heinle and Heinle, Boston, MA.
Pilliner, A.E.G., 1968. Subjective and objective testing. In: Davies, A. (Ed.), Language Testing Sympo-
sium. Oxford University Press, London.
Santos, T., 1988. Professors' reactions to the academic writing of nonnative-speaking students. TESOL
Quarterly 22, 69±90.
Quine, W., Ullian, J., 1970. The Web of Belief. Random House, New York.
Vollmer, H.J., 1983. The structure of foreign language competence. In: Hughes, A., Porter, D. (Eds.),
Current Developments in Language Testing. Academic Press, London.
Widdowson, H.G., 1983. New starts and di€erent kinds of failure. In: Freedman, A., Pringle, I., Yalden,
J. (Eds.), Learning to Write: First Language/Second Language. Longman, London.

You might also like