Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Educational Measurement: Issues and Practice

Winter 2020, Vol. 39, No. 4, pp. 30–40

Rubric Rating with MFRM versus Randomly Distributed


Comparative Judgment: A Comparison of Two Approaches
to Second-Language Writing Assessment

Maureen E. Sims, Troy L. Cox , Grant T. Eckstein, K. James Hartshorn,


Matthew P. Wilcox, and Judson M. Hart, Brigham Young University

The purpose of this study is to explore the reliability of a potentially more practical approach to
direct writing assessment in the context of ESL writing. Traditional rubric rating (RR) is a common
yet resource-intensive evaluation practice when performed reliably. This study compared the
traditional rubric model of ESL writing assessment and many-facet Rasch modeling (MFRM) to
comparative judgment (CJ), the new approach, which shows promising results in terms of
reliability. We employed two groups of raters—novice and experienced—and used essays that had
been previously double-rated, analyzed with MFRM, and selected with fit statistics. We compared the
results of the novice and experienced groups against the initial ratings using raw scores, MFRM,
and a modern form of CJ—randomly distributed comparative judgment (RDCJ). Results showed that
the CJ approach, though not appropriate for all contexts, can be as reliable as RR while showing
promise as a more practical approach. Additionally, CJ is easily transferable to novel assessment
tasks while still providing context-specific scores. Results from this study will not only inform
future studies but can help guide ESL programs in selecting a rating model best suited to their
specific needs.

Keywords: comparative judgment (CJ), ESL writing assessment practicality, ESL writing assessment reliability,
many-facet Rasch measurement model (MFRM), rubric rating

S coring reliability is a necessary condition in any validity


argument (Knoch & Chapelle, 2017). Writing assessment
scholars agree that direct measures such as essay tests are
While these practices help increase the reliability of
direct writing assessment by focusing raters on what is being
measured (an issue of validity), each has unique drawbacks.
better than discrete response tasks such as multiple-choice Rubrics and training, for instance, can be costly and resource-
tests when measuring writing ability (Greenberg, 1992; Huot, intensive to develop and employ, yet do not statistically
1990; Yancey, 1999) because direct measures are more valid; control for rater bias. MFRM offers statistical adjustment of
that is, evaluating a student’s actual writing can lead to a rater bias, yet the calculations require appropriate software
better measurement of writing ability than a multiple-choice and sufficient technical skill to generate and interpret the
exam about writing. Despite the simplicity of this argument, modified scores. These conditions have led to the investi-
assessment scholars are sensitive to the fact that discrete re- gation of a reemerging scoring approach to direct writing
sponse tasks are easier to score and produce much higher reli- assessment called Comparative Judgment (CJ; Fechner, 1965;
ability coefficients than scores emerging from rater-mediated Laming, 2004; Pollitt, 2004; Thurstone, 1927). In this process,
assessments (Eckes, 2011). raters repeatedly select the better of two essays within a given
Direct writing tasks are prone to unreliable ratings due pool that computer algorithms then organize from strongest
to interpretations that can be made by individual raters to weakest.
(Breland, 1983), thus reducing the credibility, generaliz- Researchers have advocated for wider adoption of CJ based
ability, and validity of the assessment (Kane, 2013; Knoch & on its high reliability as measured in a range of assessment
Chapelle, 2017). To reduce divergent ratings, assessors use domains across a number of studies (Bramley, 2007; Jones &
holistic and analytic rubrics (Barkaoui, 2011; Huot, O’Neill, Inglis, 2015; Pollitt, 2012; Steedle & Ferrara; 2016). None of
& Moore, 2010), rater training (McNamara, 1996), and these studies, however, examined CJ reliability when rating
algorithms like many-facet Rasch measurement (MFRM) texts composed by English as a second language (ESL) writ-
that adjust raw scores to account for rater biases (Eckes, ers, whose language skills and organizational patterns may
2011; Engelhard, 1992). make it difficult for raters to agree on relative essay quality.

30 
C 2020 by the National Council on Measurement in Education
17453992, 2020, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/emip.12329 by Victoria University Of Wellington, Wiley Online Library on [05/05/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Furthermore, CJ has not been adequately compared to tradi- Chapelle is comprehensive and incorporates six inferences,
tional MFRM-adjusted scores to determine whether it has an each with warrants backed by assumptions that are, in turn,
advantage over the more commonly accepted reliability pro- supported by evidence. These are: evaluation inference, gen-
cedure. Finally, it is unknown if CJ is reliable when performed eralization inference, explanation inference, extrapolation
by minimally trained raters, though a number of studies in- inference, decision inference, and consequences inference.
volving CJ have revealed high reliability results among both It is beyond the scope of this article to address all six in-
trained and untrained raters in other domains (Heldinger & ferences as they are rigorous and comprehensive; however,
Humphrey, 2013; Jones & Alcock, 2014; McMahon & Jones, as “reliability is taken to be a necessary but not sufficient
2015; Metzgar, 2016). Moreover, a comprehensive comparison condition for validity,” and “the validity of a test with re-
among rubric-rated and CJ-rated essays by both experienced spect to any criterion cannot exceed the index of reliability”
and inexperienced raters of ESL writing has not yet been con- (Kane, 2013, p. 29), we, therefore, focus our efforts on ad-
ducted. It is possible that highly reliable CJ scores will fail to dressing evaluation inference as it relates directly to rating
emerge when compared to rubric rating (RR) of ESL writing, reliability.
especially by untrained raters. Since resources to conduct Evaluation inference incorporates two important war-
CJ are becoming more accessible and may be an attractive rants: rating scale properties and rater reliability. These two
approach to assessing ESL writing, CJ must be studied more warrants, in turn, involve a total of 13 combined assump-
thoroughly. tions that, if supported by evidence, meet the criteria for
This study’s purpose is to compare inferences that can the evaluation inference in validating the rating process.
be made in a validity argument through examining reliability Warrant 1 asserts that the scale functions as intended by the
measures of experienced and inexperienced raters among raw development team and consists of the following assumptions:
rubric scores, MFRM-adjusted rubric scores, and CJ scores on (1) the scale criteria are well-defined, (2) the steps are
ESL writing. The emphasis on reliability is a necessary condi- adequate for distinguishing different levels of performance
tion of this research, which takes as its theoretical framework within the scale, and (3) the scale can spread examinees
an assumption that reliability is a necessary (but not suffi- into different levels. Warrant 2 asserts that the raters rate
cient) condition of validity (Kane, 2013; Knoch & Chapelle, reliably and includes the assumptions that the raters: (4)
2017). Thus, we must first determine whether raters can pro- can identify different levels of performance among different
duce consistent scores utilizing CJ on ESL writing before scores, (5) can apply the scale consistently, (6) are comfort-
examining claims about CJ’s validity. able and confident using the scale, (7) are well-trained, (8)
have adequate support documentation, and (9) are qualified.
In addition, assumption 10 posits that rating sessions are de-
Background signed to optimize rater performance, and assumptions 11–13
Writing assessors value reliability because low reliability is presuppose the existence of evidence that rater bias is not
evidence that raters lack consistency in measuring intended systematically introducing construct irrelevant variance into
writing objectives—a threat to validity. Increased reliability the scores through rater characteristics, use of sub-scales, or
reduces measurement error (Wiliam, 2001), which is an es- rating of tasks. For each of these, Knoch and Chapelle (2017)
sential consideration in high-stakes testing, where test results listed types of evidence needed to support the assumptions
have substantial consequences and must be valid (O’Neill, and thereby the warrants. They further argued that the use
2011; Wiliam, 2001). Yet, it is well established in the writing of MFRM can provide strong evidence for a number of these
community that productive tasks (e.g., essays) are a pre- assumptions. For example, MFRM provides scale diagnostic
ferred method for validly assessing writing skill over discrete data regarding rater use of the scale in terms of both consis-
response tasks (e.g., multiple choice; Greenberg, 1992; Huot, tency and consensus (see Knoch & Chappelle, 2017, for a full
1990; Yancey, 1999) and that a valid assessment is one in description of the framework).
which inferences made and actions taken are adequately
supported by theoretical and empirical evidence (Kane, 2013;
Messick, 1995). Such direct assessment approaches are dif- Increasing Reliability
ficult to rate reliably; Breland (1983) explained that “as far
back as 1880 it was recognized that the essay examination Reliability, which contributes to a case for the valid inter-
was beset with the curse of unreliability” (p. 1). Rating incon- pretations of a score, measures the consistency of rater judg-
sistency undermines the generalizability and credibility of as- ments and is reported on a scale of 0 (significant error) to
sessment, which calls into question the validity and decisions 1 (no error). Scores of .80 and above are considered highly
made based on its results (Kane, 2013; Knoch & Chapelle, reliable and acceptable for high-stakes testing. In order to
2017). produce reliable essay scores, writing assessors have im-
plemented numerous measures including creating rubrics,
training raters on the rubrics, and utilizing software to com-
Validity Framework pensate for rater error and bias (Knoch & Chappelle, 2017).
Concerned with investigating the validity of the rating pro- Rubrics provide descriptions of expectations of quality, and
cess of CJ, we have framed the findings of this study in terms raters assign examinees to locations on the scale to rep-
congruent with Knoch and Chapelle’s (2017) argument-based resent their rating in relation to the traits being assessed
framework for rating process validation. Argument-based ap- (Myford & Wolfe, 2003). Holistic scoring rubrics include rat-
proaches have been used successfully throughout the assess- ing scale levels with descriptors for each level (Huot, O’Neill,
ment literature, and their benefits stem from delineating & Moore, 2010). Analytic rubrics go one step further and in-
“principled connections . . . in an argument that demon- clude descriptors for varying criteria that can be measured
strates the logic for making interpretations . . . of test takers’ individually and aggregated to reach a final score (Barkaoui,
performance” (p. 4). The framework developed by Knoch and 2011).

Winter 2020 
C 2020 by the National Council on Measurement in Education 31
17453992, 2020, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/emip.12329 by Victoria University Of Wellington, Wiley Online Library on [05/05/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
While rubrics can act as a “regulatory device” (Jonsson & assessment, claims: “Probably the most immediate advantage
Svingby, 2007, p. 136), they do not guarantee high reliability of [CJ] is in making reliable the assessment of skills that are
(Eckes, 2011). Raters, notwithstanding extensive training, currently problematic” (2012, p. 292). This characteristic can
often internalize and apply the same rubric differently be particularly useful in the assessment of essays that reflect
(Eckes, 2011; McNamara, 1996; Myford & Wolfe, 2003) actual writing ability but can be difficult to assess reliably
and don’t always utilize the rubric as intended (Winke & (Huot, 1990; O’Neill, 2011; Yancey, 1999). Jones and Inglis
Lim, 2015). To counter this problem, raters are trained (2015) illustrated this potential through a multistage study
and normed to have essentially the same mental picture wherein they investigated the flexibility of CJ in scoring pro-
of each rubric descriptor (Wolfe, 2005). Although training ductive tasks. Experienced test designers first developed a
and norming on rubrics has demonstrated a positive effect productive math assessment of higher order problem-solving
on intrarater reliability (Eckes, 2011; McNamara, 1996), skills for which experts then created what amounted to a
rater bias can still prove to be disproportionately influential 16-page rubric. After administering the exam to 750 students,
(Wilson & Case, 1997). In response, sophisticated statistical it was marked traditionally by four experienced markers and
measures, including MFRM, have been employed in the judged using CJ by 20 math experts. Traditional marking re-
last three decades to compensate for rater effects (Eckes, sulted in high interrater reliability, r = .91, and a strong
2011; Engelhard, 1992). MFRM is widely accepted as a correlation with predicted General Certificate of Secondary
robust statistical mechanism that adjusts for rater effects Education (GCSE) grades, r = .73. CJ interrater reliability
and identifies outlying judges or examinees, resulting in a was r = .86, yet correlated more closely with GCSE predicted
modified “fair average score” that represents a more accurate grades (r = .76). The same task resulted in comparable reli-
assessment. ability and validity evidence using CJ without the need for an
Unlike other models in which item discrimination is esti- extensive rubric, exemplifying the potential advantage of CJ
mated freely on an item-by-item basis, the Rasch model scales over RR in productive assessment design and rating. Steedle
the discrimination index for all items to 1. This restriction and Ferrara (2016) further reported on multiple CJ stud-
allows the model to place both item difficulty and person abil- ies in which all interrater reliability indicators were greater
ity on the same equal-interval log-odds (logit) scale. MFRM than .73, and most were greater than .93. Such high levels of
extends this model by permitting the addition of facets, or reliability may result from the method itself: the forced com-
sources of variability. In the present study, we are primar- parisons inherent in CJ can mitigate rater bias, and problems
ily interested in rater self-consistency in judging examinee typical in writing assessment, such as central and extreme
ability; however, other facets such as task difficulty, rating tendencies or lenience and strictness, are moderated by the
occasion, etc., can be added. MFRM models all these facets very nature of the method (Bramley, 2007; Pollitt, 2012).
jointly and analyzes the pattern of examinee responses and A number of CJ studies have further shown high inter-
rater scores in the form of fit statistics that help detect anoma- rater reliability scores among trained and untrained raters
lous behavior. MFRM, therefore, is an effective solution for (Heldsinger & Humphrey, 2013; Jones & Alcock, 2014; Jones
greater reliability in direct writing assessment, but it can be & Inglis, 2015; Jones, Swan, & Pollitt, 2015; McMahon & Jones,
resource-intensive and potentially cost-prohibitive. CJ sug- 2015; Metzgar, 2016). Pollitt (2012) indicated untrained
gests similar reliability through the use of built-in statistical raters competently assessed writing quality and achieved ab-
modeling and is offered at no cost by a proprietary system solute reliability (an equivalent of Cronbach’s α) estimates of
available through the No More Marking website: nomore- .96 of the objects rated. Heldsinger and Humphry (2013) em-
marking.com. Since this system is widely accessible and the ployed a form of CJ involving calibrated exemplars in which
underlying processes are proprietary, it needs to be carefully teacher ratings were compared with those of trained raters;
scrutinized. they reported high interrater reliability for the teachers (.92)
and a strong correlation between trained rater and teacher
ratings (.90). Therefore, the amount of training required to
Comparative Judgment perform CJ rating may be significantly less than that for rubric
Comparative Judgment (CJ) was first proposed by Thurstone rating (Jones & Wheadon, 2015).
(1927) in the early 1900s and is based on the theory that Although promising, CJ is not without drawbacks. It is
humans are innately predisposed to successful comparisons cumbersome when applied to lists, easily measured outcomes,
but less apt when attempting absolute judgments in isolation or longer tasks (McMahon & Jones, 2015). Additionally, it
(Fechner, 1965; Laming, 2004). Although the theory has been does not provide a feedback mechanism and is therefore
well researched in the field of psychology, CJ has only recently unsuited for use in certain pedagogical contexts. The lack of
been used in education (Pollitt, 2004). rubric also removes the instrument by which a consensus of
We can conceptualize CJ by way of analogy in which a sports quality expectations is communicated (Brown, 2012). While
league ranks teams from best to worst. CJ would have every rubric-rated benchmarks can be used as anchors in score
team compete against every other team over the course of translations, this means CJ is not completely independent
a season to determine that ranking. When applied to direct of RR. There may also be a potential for raters to use proxy
writing assessment, raters compare two essays side by side, surface features, such as word count, spelling, or essay length
choose which is better, then move to the next set. Comparison when making judgments. These potential shortcomings can
data is aggregated and eventually rank-ordered on a scale conceivably enervate the arguments of both the scale and
representing the relative distance between each essay. rater warrants of the validation framework.
Proponents argue that the main advantage of CJ is its Traditional CJ exhibits a further drawback in terms of
ability to estimate the subjective distance of objects that can- practicality. Studies involving traditional CJ suggest raters
not otherwise be objectively arranged (Vasquez-Espinosa & need to make anywhere from 25 to 50 judgments per essay to
Conners, 1982). Pollitt, an advocate for its use in educational assure quality results (Steedle & Ferrara, 2016; Whitehouse &

32 
C 2020 by the National Council on Measurement in Education Educational Measurement: Issues and Practice
17453992, 2020, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/emip.12329 by Victoria University Of Wellington, Wiley Online Library on [05/05/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Pollitt, 2012), as each essay needs to be compared with almost context of ESL writing assessment. Research so far has looked
every other essay in order to accurately assign its place in the primarily at native English contexts, in areas ranging from
rank order (Vasquez-Espinosa & Conners, 1982). peer assessment (Jones & Wheadon, 2015) to mathematical
Adaptive comparative judgment (ACJ) can conceiv- problem-solving (Jones, Swan, & Pollitt, 2015). Thus, there
ably increase reliability and practicality simultaneously by is a need for further investigation into the reliability of RDCJ
leveraging initial Swiss rounds1 to fuel an algorithm that and its applicability in relatively new contexts.
generates the most informative pairings (Whitehouse & Furthermore, there is an increasing need for practical ap-
Pollitt, 2012). ACJ is a system whereby raters make inferences proaches to L2 writing assessment. Reports estimate the num-
concerning an individual essay’s ranking within a group of es- ber of English-language learners will exceed two billion by
says based on a subset of known judgments between multiple the year 2020 (Beare, 2017). Many of these learners will be
linked pairs. In this approach, to further the sports league required to take direct writing assessments such as those
analogy used earlier, each team only plays a subset of all the in the Test of English as a Foreign Language (TOEFL) to
teams in the league but is linked to all other teams through demonstrate their language proficiency. Given the potential
the pairings (i.e., team A never plays team C, but team B emergence of RDCJ as an ESL rating method, determining
plays both team A and team C, thus teams A and C are linked the extent to which it meets the criteria of an assessment
through pairings with team B). Similarly, ACJ works with validity argument is valuable. Thus, this study assesses the
essay rating in such a way that each essay is compared to a reliability of RDCJ relative to the reliability of rubric rating
subset of other essays in the group; in each match, one essay with MFRM when applied to ESL writing using trained and
is judged to be better than the other and provides additional untrained raters. The primary research aim is to examine how
information that eventually results in the ranking of the es- novice and experienced raters compare in terms of reliability
says from best to worst. The algorithms generating ACJ data when utilizing traditional RR, MFRM, and RDCJ in an ESL
are based on the Rasch logistic model (Andrich, 1978) and setting.
the Bradley–Terry–Luce model (Turner & Firth, 2012). Using
these algorithms, which follow similar statistical models to
those underlying MFRM analyses, the predictive nature of ACJ Method
minimizes the number of pairs required to complete the scale For this study, we used essays that had been double-rated with
while maintaining reliability scores above .80 in most cases a rubric, analyzed with MFRM, and selected with fit statistics.
(Heldsinger & Humphry, 2013; Pollitt, 2012; Steedle & Fer- Novice and experienced raters scored congruous sets of these
rara, 2016; Whitehouse & Pollitt, 2012). This is an acceptable essays using RDCJ and RR with MFRM (see Figure 1).
level for high-stakes testing (Wiliam, 2001). Even in its adap-
tive form, the number of pairings is greater with ACJ than RR.
However, proponents of CJ suggest that the binary nature of Sample Essays
the decisions made within the CJ model reduce the cognitive We collected a stratified sample of 60 essays from a pool
load required to perform evaluations (Goffin & Olson, 2011), of 30-minute ESL placement essays in an intensive English
thus making rating decisions not only faster but easier as well. program (IEP) that had previously been rubric-rated by a
Critics question the efficacy of the adaptive nature of ACJ, group of experienced and rubric-normed IEP teachers. We
however, arguing that preemptive pairings may lead to po- made efforts to select essays evenly from each of the rubric
tential overinflation of ACJ reliability data (Bramley, 2015; levels and to representatively sample the language groups
Bramley & Wheadon, 2015). In acknowledgement, nomore- involved in the testing (see Table 1). To control for prompt
marking.com generated a more robust version with randomly effect (Evans, Hartshorn, Martin, & Cox, 2014), we collected
distributed pairings (C. Wheadon, personal communication, essays from four prior rating sessions that used the same essay
August 5, 2017). This version, which we refer to as randomly prompt and were further divided into two congruous sets of
distributed comparative judgment (RDCJ), also reduces the 37 and 38 essays, with 12 essays in common.
number of required judgments; however, unlike ACJ, which The strata are based on the rubric levels of 0 (little to
narrows the pairing possibilities as the algorithm advances, no language or a reliance on simple, memorized words and
RDCJ maintains random pairings throughout the judging pro- phrases) to 7 (university-level writing). The holistic rubric
cess, conceivably providing a more equitable judging frame- included three criterion categories (text type, content, and
work for comparisons (Wheadon, 2015). These randomly dis- accuracy) that were combined to produce a single rating. The
tributed pairs may be less likely to overinflate reliability. rubric used for all rubric-rating sessions associated with this
RDCJ is made available through nomoremarking.com. It is study has been consistently utilized by the IEP over several
important to note, however, that there is a clear trade-off in years with satisfactory results in terms of scale properties.
the use of a proprietary system. While the software is free We selected essays that typified models of each of the seven
and easy to use, the calculations are opaque and preclude levels being tested. For rubric levels 1–5, all essays had been
open scrutiny. Nevertheless, as RDCJ is a readily available rated with 100% rater agreement based on observed scores and
resource currently being used extensively in the educational MFRM infit scores between 0 and 1.36 (2–13 raters per essay)
marketplace by novice and experienced raters alike, it is this and represented a full-level difference between each rating
version of CJ we have chosen to investigate. level. Only two essays had a rating of 0, so both were included
Moreover, whereas RDCJ shows potential as a reliable and in the study. Additionally, the small number of available level
practical approach to direct writing assessment, the broad 6 essays resulted in less rigorous selection standards and,
applicability of CJ is still unexplored. For instance, while consequently, a little less than a full-level difference in some
studies of CJ suggest that rater training is not necessary for instances. Fair average scores ranged from 5.62 to 6.67, and
achieving acceptable levels of reliability (greater than .80), infit scores exceeded 1.5 on three of the essays (1.58, 2.35,
this claim has yet to be explored in the linguistically complex and 3.22). Level 6 rater agreement, however, remained 100%

Winter 2020 
C 2020 by the National Council on Measurement in Education 33
17453992, 2020, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/emip.12329 by Victoria University Of Wellington, Wiley Online Library on [05/05/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Note. ANOVA run to test for effects on rating time, and Spearman’s rho used to correlate between MFRM adjusted fair average, the study
rubric rating fair averages, and RDCJ true scores to show evidence of accuracy. Pearson correlation run to compare surface features with
MFRM fair averages and RDCJ true scores.
FIGURE 1. Study design to compare traditional rubric rating (RR) to many-facet Rasch modeling (MFRM) and randomly distributed
comparative judgment (RDCJ).

with two raters per essay. Level 7 was not included because currently working or having previously worked at the same
none of the available essays had been rated a 7. IEP. They had from 2 to 7 years’ experience rating congruent
placement essays within the same context while using the
rubric chosen for this study.
Raters
Two groups of evaluators (novice and experienced) rated Rating Methods
congruous sets of the essays using RDCJ and RR with MFRM. Most MFRM rubric-rating systems use incomplete yet linked
The novice and experienced raters were further divided rating schedules (Eckes, 2011) where at least two raters
into two groups in order to measure method and order judge each essay. For the rubric-rating portion of this study,
for both levels of rater experience; each group, A and B, we employed a fully crossed design in which each essay set
was composed of four novice raters and four experienced was rated by all of the raters within the rating group to which
raters. Although Knoch and Chapelle’s (2017) evaluation it was assigned. Further, a portion of the essays (20%) were
inference requires that raters are thoroughly and regularly included in both Essay Set 1 and Essay Set 2 so that they were
trained, we included novice, untrained raters in order to rated by all the raters employing both methods. The design of
investigate training effect. Eight undergraduates minoring the CJ portion of the study was also conservative. RDCJ has
in the Teaching English to Speakers of Other Languages demonstrated acceptable levels of reliability, .80 or higher,
(TESOL) program at the university where this research was with a minimum of nine judgments for each essay (Jones &
conducted formed the novice group. Wheadon, 2015). Full essay sets were assigned to each of four
The experienced raters met the training and qualification distinct groups: Group A Novice, Group A Experienced, Group
requirement and exhibited confidence using the scale B Novice, and Group B Experienced for RDCJ rating. For each
(Knoch & Chapelle, 2017). This group comprised eight set of 37 or 38 essays, raters were assigned an average of 100
individuals who self-selected from a pool of trained raters judgments each, or approximately 11 judgments per essay.

34 
C 2020 by the National Council on Measurement in Education Educational Measurement: Issues and Practice
17453992, 2020, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/emip.12329 by Victoria University Of Wellington, Wiley Online Library on [05/05/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Table 1. Essay Levels and Language the accuracy of RR scoring alone and could only be conducted
Background in a fully crossed design in which every rater rated every essay,
Essay rating levels hence only the 12 essays included in both essay sets were used.
We further grouped raters by experience (see Table 2).
Languages 0 1 2 3 4 5 6
The major observation from this analysis is that adjacent
Arabic – – – – – – 1 agreement was high among experienced raters and low among
Chinese – 1 2 – – – – novice. For instance, one of the level 3 essays received six level
French – – – – 1 1 – 2 ratings and two level 3 ratings from experienced raters,
Japanese – – 1 1 – – – while three novice raters gave it a level 1, four a level 2,
Korean – 3 1 1 – – 1 zero a level 3, and one a level 4. Globally, experienced raters
Mongolian – 1 – 1 – – –
Portuguese – – 1 1 1 2 2
exhibited 100% adjacent agreement across all essays in con-
Russian – – – – 1 – – trast with the novice raters achieving 100% adjacent agree-
Spanish 2 3 5 5 5 6 6 ment for four essays. Moreover, experienced raters displayed
Thai – – – – 2 1 – exact agreement above 63% in seven instances while novice
Turkish – 1 – – – – – rater exact agreement did not exceed 38% except in one
case and equaled 0% in two cases. If RR without MFRM ad-
Totals 2 9 10 9 10 10 10
justments were used to assign essay scores, novice raters
would present obvious threats to reliability, which corrobo-
Process rates Weigle’s (1999) findings that novice raters exhibited
greater severity on graph essays than experienced raters, a
To control for order effect, Group A was assigned to RR and difference that disappeared after training.
Group B used RDCJ during the first rating session using Essay Despite the advantage of training, both groups, without
Set 1. They then switched rating modes for the following ses- MFRM, call into question the reliability of using a single rater
sion one week later using Essay Set 2. Both groups received (Kane, 2013). The low percentages of exact agreement of
video rating instructions and were told to complete the rat- experienced and novice rating groups, 53% and 28%, respec-
ings independently in one session per rating method. For RR, tively, suggest student placement based on this data would
scores were entered into a computer-based system. For RDCJ, be accurate at best only about half the time with experienced
raters used an online proprietary system called nomoremark- raters. Such variance does little to assure assessment quality
ing.com. Raters were instructed to log out of each system and deliver fair results to examinees as McNamara (1996)
during breaks. During orientation, raters completed three observed: “Raw scores . . . are no reliable guide to candidate
RR practice ratings and two RDCJ practice comparisons. RR ability” (p. 118).
raters used a paper copy of the rubric to rate the essay pre-
sented on the screen. RDCJ raters were asked to compare Evidence on the Scale, Rater Consistency,
two essays, displayed side-by-side on the computer screen, and Examinee Reliability
and “choose the better response.” Upon completion of the
practice sessions, the raters were instructed to begin rating. We chose to distinguish between rater reliability and exam-
To examine raw RR without MFRM, the ratings of the inee reliability, though both kinds of reliability measure the
12 essays that were judged by all 16 raters were compared reproducibility of results. Rater reliability reflects an ability to
to the original MFRM ratings. We used the software program consistently distinguish between levels of a trait. Raters can
FACETS to run MFRM on the RR sessions from which we de- be trained to be more self-consistent in their interpretation
rived measure scores, fair averages, fit statistics, and separa- and implementation of a rating scale as manifested through
tion reliability results for both examinees and raters (Linacre, fit statistics; however, it is often difficult to achieve perfect
2017). RDCJ data analysis was completed by the proprietary unity among raters in terms of absolute agreement, and thus
system operating from nomoremarking.com, which provided if the raters fit the model (are self-consistent), rater sever-
reliability scores, fit statistics, and true scores. ity differences can be modelled mathematically. Examinee
In examining the scoring accuracy of the newly rated reliability reflects differences in the level of the trait the per-
data, we used Spearman’s rank-order correlation coefficient son possesses. MFRM was conducted using FACETS (Linacre,
(Spearman’s rho) to identify the level of association between 2018) in which a four-facet rating scale model was specified,
(1) the original fair average and the new MFRM as well as including: Examinee, Rater, Occasion (First, Second), and
(2) the original fair average and the RDCJ true score (see Rater Group (Novice, Experienced). The 16 raters assigned
Table 5). To investigate the relationship between essay length 802 RR scores to the essays from the 60 examinees and made
and assigned ratings, we ran a Pearson’s correlation, compar- 1,609 RDCJ judgments through nomoremarking.com.
ing word count with MFRM fair average scores and RDCJ true
scores. Further, we compared the initial MFRM-adjusted rat- Rater Reliability. The MFRM results in Table 3 indicate that
ings with the raw scores assigned in this study to assess the ratings progressed monotonically without any mis-ordering of
impact of MFRM on study data (see Table 2). We calculated categories or thresholds. However, several indices point to is-
statistical analyses using SPSS R
version 25.0. sues with category 7. For instance, the count is extremely low
compared to the other groups: only 12 essays were given a
Results rating of 7 (coincidentally, none were given this score by the
original raters). Further, the outfit mean square of 2.1 (which
RR Raw Scores has an expected value of 1) and the relatively large discrep-
The initial analysis compared the raw RR scores of 12 essays ancy between the expected measure and the average measure
rated by all participants in this study against the original indicate a high degree of unpredictability in how raters as-
MFRM-adjusted ratings. This analysis was meant to reveal signed a score of 7. Finally, results from the MFRM analysis

Winter 2020 
C 2020 by the National Council on Measurement in Education 35
17453992, 2020, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/emip.12329 by Victoria University Of Wellington, Wiley Online Library on [05/05/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Table 2. Comparison of Initial and Study RR Raw Scores
Experienced raters Novice raters

Adjacent agreement

Adjacent agreement
Original rating level

Exact agreement

Exact agreement
(original rating)

(original rating)

(original rating)

(original rating)
Range

Range
Essay

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0 35 2 6 2 25% 100% 6 2 2 75% 88%
1 28 2 5 1 3 63% 100% 7 1 2 13% 100%
32 6 2 2 75% 100% 3 3 2 3 38% 100%
2 27 5 3 2 63% 100% 6 1 1 3 13% 100%
31 5 3 2 63% 100% 5 3 2 38% 100%
3 36 6 2 2 25% 100% 3 4 1 3 0% 63%
4 34 1 6 1 3 75% 100% 3 5 2 0% 63%
37 6 2 2 75% 100% 2 3 3 3 38% 75%
5 30 5 2 1 3 25% 100% 1 3 2 1 1 5 13% 50%
33 3 3 2 3 38% 100% 3 1 1 3 4 38% 50%
6 26 1 7 2 88% 100% 1 2 3 2 4 38% 88%
29 5 2 1 3 25% 100% 1 3 3 1 4 38% 88%
2.42 53% 100% 3.08 28% 80%

= Original rating

Table 3. Rating Scale Effectiveness


Quality control Rasch–Andrich thresholds
Response Ave. Exp. Outfit
category Count % measure measure MNSQ Measure SE
0 62 8% –8.18 –8.04 .9 – –
1 126 16% –5.38 –5.31 .7 –7.37 .20
2 156 19% –2.80 –2.78 .8 –4.21 .15
3 137 17% -.63 -.68 .8 –1.60 .14
4 146 18% 1.42 1.29 .8 .28 .14
5 100 12% 2.78 2.70 1.3 2.41 .14
6 63 8% 4.15 4.15 1.1 3.84 .18
7 12 1% 4.65 5.74 2.1 6.65 .35
MNSQ = mean squared; SE = standard error.

estimate a rater strata of 6.94, which indicates that raters are 2.0; however, four of the novice raters had infit, outfit, or both
able to distinguish approximately seven distinct levels of per- mean square estimates between 1.5 and 2.0 (see Table 4).
formance rather than the eight response categories available RDCJ infit values less than 1 provide evidence of overall rater
to them. consistency, infit values ranging from 1 to 1.3 indicate some
Likely due to the binary decisions inherent in RDCJ, the rating inconsistency, and infit values greater than 1.3 indicate
company that provides RDCJ does not supply information rating inconsistency (No More Marking Ltd, 2019). Fifteen of
regarding scale quality. This was a nonissue for this study the raters had fit statistics less than 1; however, one novice
as all the rating samples were rated by both methods. How- rater exhibited inconsistent rating with an infit of 1.46 (see
ever, it could inhibit researchers in addressing the assertions Table 4). Thus, the scale for RR with MFRM functioned pri-
that the scale functions as intended. Nevertheless, if trained marily as intended. Furthermore, there is evidence for both
raters were to agree on rubric-rated exemplars to include RR and RDCJ that the raters can apply the scale consistently,
in an RDCJ rating session, a post hoc analysis could provide especially among experienced raters.
information relative to the effectiveness of the RDCJ rating
scale. Examinee Reliability. Examinee reliability was measured
Infit measures were used to provide insight into the con- in terms of how reliably MFRM and RDCJ differentiated
sistency of each of the raters when performing both RR with between the essays. This study reports a common reliability
MFRM and RDCJ. These measures, infit and outfit, are sensi- estimate calculated for both MFRM and RDCJ analyses:
tive to inliers and outliers, respectively, and have an expected examinee separation reliability. This type of reliability
value of 1. Linacre (2018) noted that values between .5 and is analogous to the more well-known Cronbach’s α and
1.5 are “productive for measurement” (2018, p. 272), and indicates the extent to which the scores of a measure are
further argued that values between 1.5 and 2.0, while not pro- consistent, or the degree to which they provide reproducible
ductive, do not degrade the measurement. For the RR with results. Reliabilities reported, therefore, are based on
MFRM, none of the raters had mean-square estimates above examinee separation: examinee separation reliability for

36 
C 2020 by the National Council on Measurement in Education Educational Measurement: Issues and Practice
17453992, 2020, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/emip.12329 by Victoria University Of Wellington, Wiley Online Library on [05/05/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Table 4. Rater Infit Data for MFRM and RDCJ Table 6. Word Count Correlations According
MFRM RDCJ to Background and Rating Mode
Infit Outfit Infit Pearson’s r
Experience Rater MNSQ MNSQ Statistic Group Background Mode N Word count
Experienced 1 .78 .72 .67 A Novice RR 36 .79
Experienced 2 .75 .73 .60 RDCJ 38 .90
Experienced 3 .65 .74 .70 Experienced RR 36 .83
Experienced 4 .53 .50 .61 RDCJ 38 .89
Experienced 5 .66 .74 .76 B Novice RR 37 .89
Experienced 6 .72 .72 .74 RDCJ 37 .82
Experienced 7 .51 .53 .52 Experienced RR 37 .89
Experienced 8 1.00 .88 .65 RDCJ 37 .80
Novice 9 1.35 1.33 .82
Novice 10 .76 .82 .56 RR = rubric rating; RDCJ = randomly distributed comparative
judgment.
Novice 11 1.86 1.60 .94
Novice 12 1.89 1.98 1.46
Novice 13 1.51 1.19 .51
Novice 14 .52 .49 .40 raters in Group B, the reliability of both groups exceeded
Novice 15 1.90 1.46 .53 industry standards for high-stakes testing. Thus, novice and
Novice 16 1.00 .96 .67 experienced raters were able to produce highly reliable re-
MNSQ = mean squared. sults when employing the fully crossed RR with MFRM or
Bolded values represent fit statistics outside the accepted range of .5 to RDCJ.
1.5.
Evidence of Concurrent Validity. Evidence of concurrent
Table 5. Reliability and Accuracy Indicators validity was primarily derived from a Spearman’s rho corre-
Reliability Original lation between (1) the initial MFRM-adjusted fair average
Examinee rating scores and the study MFRM fair average scores as well as (2)
Group Experience Mode n separation rho the initial MFRM-adjusted fair average scores and the RDCJ
A Novice RR 36 .96 .94
true scores. The RDCJ true score is analogous to the MFRM
RDCJ 38 .91 .90 fair average; in other words, the true score is an objective
Experienced RR 36 .98 .95 measurement of an examinee’s latent ability. The correla-
RDCJ 38 .92 .94 tions for both methods and each rating group are presented
B Novice RR 37 .96 .96 in Table 5. Spearman’s rank correlation coefficients involv-
RDCJ 37 .89 .92 ing the original MFRM fair average score were significant,
Experienced RR 37 .96 .94 ranging from .90 to .96, p < .001.
RDCJ 37 .94 .94 The Spearman’s rank-order correlation coefficient pro-
RR = rubric rating; RDCJ = randomly distributed comparative vides correlational indicators of data quality. Results echo
judgment. that of reliability as again no significant difference is evident
between novice and experienced raters; between 92% and 96%
of scores generated by participating raters, novice and expe-
RR and an analogous RDCJ reliability indicator (Wheadon, rienced, can be explained by the original writing placement
personal communication, August 5, 2017). score. Novice RR demonstrated slightly higher correlations
Although reliability evidence may not, of itself, speak to with the original scores than novice RDCJ, whereas expe-
the quality of the data, it provides important information rienced RR and RDCJ correlations with the original scores
relative to the consistency of the data collected and the sub- were essentially the same. Further, in all but one instance
sequent potential for drawing valid conclusions. In addition (Group B Novice raters), RDCJ correlations with the original
to mixed-rater group analyses, we calculated the examinee MFRM fair average were slightly lower than RR correlations.
separation reliability estimates for each rating group—Group However, in every case the data was highly correlated—rs
A Novice, Group A Experienced, Group B Novice, and Group >.85, p < .01—suggesting that both novice and experienced
B Experienced—revealing that within each of the aforemen- raters, utilizing either method, achieved similar results in
tioned homogenous groups, examinees judged by raters of terms of final scores: MFRM fair averages and RDCJ true
similar experience had reproducible relative locations on a scores.
measure scale for the same essays. A comparison of exam- We addressed the potential for surface features to intro-
inee separation reliability among all rating groups for both duce construct-irrelevant variance through investigating the
methods is presented in Table 5. To be acceptable for high- possibility that essay length acted as a proxy for writing
stakes assessments, reliability should be above .80 (Nunnally quality. Using a Pearson’s r, we compared word counts with
& Bernstein, 1994). We found both rating methods (MFRM RR measure scores and RDCJ true scores. Results (see Ta-
and RDCJ) and rater backgrounds (novice and experienced) ble 6) indicated statistically significant correlations between
exceeded this standard. Examinee separation reliability mea- word count and both RR and RDCJ scores. Nevertheless, it
sures ranged from .89 to .98 and tended to be slightly higher is important to note that high correlations between essay
for rubric rating and experienced raters. Close reliability in- length and rater score are also common throughout the lit-
dicators within each method for Group A show no significant erature in more traditional rating contexts (Barkaoui, 2011;
difference in rating background. And although experienced Kobrin, Deng, & Shaw, 2011; Wolfe, Song, & Jiao, 2016). In-
raters demonstrated slightly greater reliability than novice terestingly, word count exhibited higher correlations with

Winter 2020 
C 2020 by the National Council on Measurement in Education 37
17453992, 2020, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/emip.12329 by Victoria University Of Wellington, Wiley Online Library on [05/05/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
rating order than rating method; word count correlations and RDCJ study-generated scores indicate they are likely as-
were higher with the second rating session (Group B for sessing similar traits, it is not clear what those traits are. The
RR; Group A for RDCJ) regardless of rating mode or rater word count correlations of this study provide some evidence;
background. however, further analysis involving construct relevant com-
The lack of distinct variance between the two rating meth- ponents such as T-units would provide important information
ods, RR with MFRM and RDCJ, provides evidence that RDCJ is relative to the construct validity of RDCJ.
not uniquely prone to this potential for construct irrelevance.
Yet the high correlation with word count for both methods is
indicative of a strong connection. However, high word count Conclusions
correlations with scores may not necessarily equate to the Several important conclusions can be drawn from this study,
scoring of nonconstruct-relevant indicators. In all writing, not the least of which is that the resulting evidence indicates
but perhaps especially so in an ESL setting, the amount the consistency and potential generalizability of the data,
of language the examinee can produce, particularly under thus furthering the investigation of CJ as a viable alterna-
time constraints, is a clear indicator of language and writing tive to RR in terms of reliability. Additional steps need to be
fluency. taken, however, within the argument-based validity frame-
Raw score rating disparity between experienced and novice work before adopting CJ as a primary scoring method. As
raters supports what others have found: training minimizes per the Knoch and Chapelle (2017) framework, these steps
rater effects (Eckes, 2011; McNamara, 1996). However, the include, among other things, an investigation of the various
remaining gap between experienced rater raw scores and ini- potential interpretations of the data, a look at context-based
tial MFRM-adjusted ratings ratings points to the need for sta- decisions that can be made, and a review of the resulting
tistical procedures that account for rater bias. MFRM, under consequences.
heightened coverage conditions of eight raters per examinee, Further, while training was shown to increase intrarater
essentially eliminated the distinction between the two groups, reliability in RR, it was still inadequate. MFRM under typi-
demonstrating the strength of the model in accounting for cal double-rating conditions, though adequate for use with
rater effect. Under typical double-rating conditions with ex- experienced raters, would also likely be unequal to the task
perienced raters, MFRM has consistently demonstrated its of modeling out all novice rater variations. In contrast, un-
ability to successfully model out rater variance (Eckes, 2011; der typical conditions, RDCJ produced analogous results re-
Myford & Wolfe, 2003). However, under similar double-rating gardless of rater background and proved appropriate for use
conditions, the distinction between novice and experienced in an ESL direct writing assessment context in terms of
raters would likely be much greater; it is unlikely that MFRM reliability.
would be able to model out all of the rater variance in novice Similar reliability indicators between experienced raters
raters. Interestingly, RDCJ is not subject to the same limita- employing MFRM and experienced or novice raters using
tions and demonstrated comparable reliability for both novice RDCJ spotlights practicality as a point of discrimination.
and experienced raters. While not the focus of this article, we did record the time
it took to rate using each method. The time per essay to
rate using RDCJ was just under 52 seconds faster than
Limitations RR. A Mann–Whitney U test indicated that rating using RR
(Mdn = 63.0) took significantly longer than RDCJ (Mdn =
While this study provides important information relative to 11.3), U = 293081.500, p < .001. The difference in rating
necessary conditions for thoroughly assessing the validity of time and the additional time and resources required for rater
an assessment approach, it nevertheless is not broad enough training suggest that RDCJ might be a more practical rat-
to encompass all essential aspects of argument-based validity ing method. However, more research is needed as part of a
theory. As such, it is but one step in a larger sequence of stages validity argument to make this claim.
necessary for establishing the appropriateness of a particular Finally, whereas RR with MFRM automatically incorporates
assessment approach in any given context. the rating scale in data reports, RDCJ true score conversion
Additionally, despite the irregularities in the novice rat- requires the inclusion of examinee essays already rated us-
ings, the MFRM analysis produced highly reliable results. ing RR to act as anchors against which the relationship of
It is likely, however, that the atypical fully crossed rubric- new examinees can be measured. Should RDCJ prove to be
rating model, as well as the carefully selected stratified es- a valid assessment approach, perhaps a marriage of the two
says utilized in the study, overinflated the RR reliability in- methods would capitalize on the strengths of each: smaller
dicators to some degree. A modified rating schedule with numbers of highly trained raters could identify benchmarks
more typical rating samples may achieve different results. to anchor RDCJ rankings while larger numbers of less expe-
Further, this study investigated RDCJ in a new but poten- rienced raters would then perform RDCJ on the bulk of the
tially high-use context; however, in light of the concerns examinees, delivering quality results that are both reliable
raised by Bramley (2015) regarding the reliability of ACJ, and practical.
further research is needed. It is possible that claims made
regarding the greater reliability of RDCJ are unfounded. This
potential for unreliability and the undisclosed underlying al-
gorithms of RDCJ within nomoremarking.com introduce pos- Note
sible sources of inaccuracy that could weaken the validity 1
Swiss rounds are typically used when the number of competitors makes
argument. the inclusion of all potential pairings infeasible. After an initial random
Finally, while the highly significant correlations between round, subsequent rounds pair according to wins and losses and match
the original MFRM fair average scores and both the MFRM pairs with similar scores.

38 
C 2020 by the National Council on Measurement in Education Educational Measurement: Issues and Practice
17453992, 2020, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/emip.12329 by Victoria University Of Wellington, Wiley Online Library on [05/05/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
References Knoch, U., & Chapelle, C. A. (2017). Validation of rating processes
within an argument-based framework. Language Testing, https://
Andrich, D. (1978). Relationships between the Thurstone and Rasch doi.org/10.1177/0265532217710049
approaches to item scaling. Applied Psychological Measurement, Kobrin, J. L., Deng, H., & Shaw, E. J. (2011). The association be-
2(3), 449–460. tween SAT prompt characteristics, response features, and essay
Barkaoui, K. (2011). Effects of marking method and rater experience on scores. Assessing Writing, 16(3), 154–169. https://doi.org/10.1016/
ESL essay scores and rater performance. Assessment in Education: j.asw.2011.01.001.
Principles, Policy & Practice, 18(3), 279–293. Laming, D. (2004). Human judgment: The eye of the beholder. London,
Beare, K. (2017, May). How many people learn English? ThoughtCo. Re- UK: Thomson.
trieved August 4, 2017, from https://www.thoughtco.com/how-many- Linacre, J. M. (2017). FACETS computer program for many-facet
people-learn-english-globally-1210367 Rasch measurement (Version 3.80.0) [Software]. Beaverton, OR:
Bramley, T. (2007). Paired comparison methods. In P. Newton, J. Baird, Winsteps.com
H. Goldstein, H. Patrick, & P. Tymms (Eds.), Techniques for moni- Linacre, J. M. (2018). FACETS computer program for many-facet Rasch
toring the comparability of examination standards (pp. 246–294). measurement (Version 3.81.0). Beaverton, OR: Winsteps.com
London: Qualifications and Curriculum Authority. Metzgar, M. (2016). Using adaptive comparative judgment to assess stu-
Bramley, T. (2015). Investigating the reliability of adaptive comparative dent work in an MBA course. International Journal for Infonomics,
judgment. Cambridge Assessment Research Report. Cambridge, UK: 9(3), 1217–1219.
Cambridge Assessment. McMahon, S., & Jones, I. (2015). A comparative judgement approach to
Bramley, T., & Wheadon, C. (2015, November). The reliability of teacher assessment. Assessment in Education: Principles, Policy &
adaptive comparative judgment. AEA-Europe Annual Conference. Practice, 3, 368–389.
Cambridge, UK: Cambridge Assessment. Retrieved from http://www. McNamara, T. F. (1996). Measuring second language performance.
cambridgeassessment.org.uk/Images/296241-the-reliability-of- New York: Longman.
adaptive-comparative-judgment.pdf, (accessed March 26, 2020). Messick, S. (1995). Validity of psychological assessment: Validation of
Breland, H. M. (1983). The direct assessment of writing skill: A mea- inferences from person’s responses and performances as scientific
surement review (College Board Report No. 83.6; ETS RR No. 83.82). inquiry into score meaning. American Psychologist, 50(9), 741–749.
Brown, J. (2012). Developing, using, and analyzing rubrics in lan- Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater
guage assessment with case studies in Asian and Pacific languages. effects using many-facet Rasch measurement: Part I. Journal of
Honolulu, HI: National Foreign Language Resource Center. Applied Measurement, 4(4), 386–422.
Eckes, T. (2011). Introduction to many-facet Rasch measurement: No More Marking Ltd. (2019). No More Marking. Retrieved
Analyzing and evaluating rater-mediated assessments. Frankfurt, November 15, 2018, from https://www.nomoremarking.com/judges?
Germany: Lang. tab=MonitorJudges
Engelhard, G., Jr. (1992). The measurement of writing ability with Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd
a many-faceted Rasch model. Applied Measurement in Education, ed.). New York: McGraw-Hill.
5(3), 171–191. O’Neill, P. (2011). Reframing reliability for writing assessment. The
Evans, N. W., Hartshorn, K. J., Martin, T., & Cox, T. (2014). Measuring Journal of Writing Assessment, 4(1), n.p.
written linguistic accuracy with weighted clause ratios: A question Pollitt, A. (2004). Let’s stop marking exams. Paper presented at the
of validity. Journal of Second Language Writing, 24, 33–50. meeting of the IAEA Conference, Philadelphia. Retrieved from
Fechner, G. T. (1965). Elements of psychophysics. In R. J. Herrnstein, http://www.cambridgeassessment.org.uk/Images/109719-let-s-stop-
& E. G. Boring, (Eds.), A source book in the history of psychology marking-exams.pdf, (accessed March 26, 2020).
(pp. 66–75). Cambridge, MA: Harvard University Press. Pollitt, A. (2012). The method of adaptive comparative judgement.
Goffin, R. D., & Olson, J. M. (2011). Is it all relative? Comparative Assessment in Education: Principles, Policy & Practice, 19(3), 281–
judgments and the possible improvement of self-ratings and ratings 300.
of others. Perspectives on Psychological Science, 6(1), 48–60. Steedle, J. T., & Ferrara, S. (2016). Evaluating comparative judgment as
Greenberg, K. L. (1992). Validity and reliability issues in the direct an approach to essay scoring. Applied Measurement in Education,
assessment of writing. Writing Program Administration, 16(1–2), 29(3), 211–223.
7–22. Thurstone, L. L. (1927). A law of comparative judgment. Psychological
Heldsinger, S. A., & Humphry, S. M. (2013). Using calibrated exemplars Review, 34(4), 273–286.
in the teacher-assessment of writing: An empirical study. Educa- Turner, H., & Firth, D. (2012). Bradley-Terry models in R: The
tional Research, 55(3), 219–235. BradleyTerry2 package. Journal of Statistical Software, 48(9), 1–
Huot, B. (1990). Reliability, validity, and holistic scoring: What we know 21.
and what we need to know. College Composition and Communica- Vasquez-Espinosa, R. E., & Conners, R. W. (1982). The law of
tion, 41(2), 201–213. comparative judgment: Theory and implementation (RSIP TR
Huot, B., O’Neill, P., & Moore, C. (2010). A useable past for writing 403.82). Baton Rouge, LA: Louisiana State University. Retrieved
assessment. College English, 72(5), 495–517. from http://www.dtic.mil/dtic/tr/fulltext/u2/a136169.pdf, (accessed
Jones, I., & Alcock, L. (2014). Peer assessment without assessment March 26, 2020).
criteria. Studies in Higher Education, 39(10), 1774–1787. Weigle, S. C. (1999). Investigating rater/prompt interactions in writ-
Jones, I., & Inglis, M. (2015). The problem of assessing problem solving: ing assessment: Quantitative and qualitative approaches. Assessing
Can comparative judgement help? Educational Studies in Mathe- Writing, 6(2), 145–178.
matics, 89(3), 337–355. Wheadon, C. (2015, February 10). The opposite of adaptivity?
Jones, I., Swan, M., & Pollitt, A. (2015). Assessing mathematical prob- [Web log post]. Retrieved August 7, 2017 from https://blog.
lem solving using comparative judgement. International Journal of nomoremarking.com/the-opposite-of-adaptivity-c26771d21d50
Science and Mathematics Education, 13(1), 151–177. Whitehouse, C., & Pollitt, A. (2012). Using adaptive comparative judge-
Jones, I., & Wheadon, C. (2015). Peer assessment using comparative and ment to obtain a highly reliable rank order in summative assess-
absolute judgement. Studies in Educational Evaluation, 47, 93–101. ment. Manchester, UK: Centre for Education Research and Polic.
Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Relia- Wiliam, D. (2001). Reliability, validity, and all that jazz. Education 3–13,
bility, validity and educational consequences. Educational Research 29(3), 17–21.
Review, 2, 130–144. Wilson, M., & Case, H. (1997). An examination of variation in rater
Kane, M. T. (2013). Validating the interpretations and uses of test severity over time: A study in rater drift (Berkeley Evaluation and
scores. Journal of Educational Measurement, 50(1), 1–73. Assessment Research Center report). UC Berkley.

Winter 2020 
C 2020 by the National Council on Measurement in Education 39
17453992, 2020, 4, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/emip.12329 by Victoria University Of Wellington, Wiley Online Library on [05/05/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Winke, P., & Lim, H. (2015). ESL essay raters’ cognitive processes in Wolfe, E. W., Song, T., & Jiao, H. (2016). Features of difficult-to-
applying the Jacobs et al. rubric: An eye-movement study. Assessing score essays. Assessing Writing, 27, 1–10. https://doi.org/10.1016/
Writing, 25, 37–53. j.asw.2015.06.002.
Wolfe, E. W. (2005). Uncovering rater’s cognitive processing and focus Yancey, K. B. (1999). Looking back as we look forward: Historiciz-
using think-aloud protocols. Journal of Writing Assessment, 2(1), ing writing assessment. College Composition and Communication,
37–56. 50(3), 483–503.

40 
C 2020 by the National Council on Measurement in Education Educational Measurement: Issues and Practice

You might also like