Do Raters Use Rating Scale Categories Consistently Across Analytic Rubric Domains in Writing Assessment

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Assessing Writing 43 (2020) 100416

Contents lists available at ScienceDirect

Assessing Writing
journal homepage: www.elsevier.com/locate/asw

Do raters use rating scale categories consistently across analytic


T
rubric domains in writing assessment?
Stefanie A. Wind
Educational Studies in Psychology, Research Methodology, and Counseling, The University of Alabama, 313C Carmichael Hall, United States

ARTICLE INFO ABSTRACT

Keywords: Analytic rubrics for writing assessments are intended to provide diagnostic information regarding
Rater effects students’ strengths and weaknesses related to several domains, such as the meaning and me-
Rating scales chanics of their composition. Although individual domains refer to unique aspects of student
Performance assessment writing, the same rating scales are often applied across different domains. Accordingly, the in-
Rubrics
terpretation of rating scale categories is intended to be invariant across domains. However, this
hypothesis is rarely examined empirically. The purpose of this study is to illustrate and explore
methods for evaluating the degree to which raters apply a common rating scale consistently
across domains in analytic writing assessments. Data from a rater training procedure for a rater-
mediated writing assessment serve as a case study for the illustrative analyses. Results indicated a
lack of invariance in rating scale category functioning across domains for several raters in both
examples. Implications are discussed in terms of how evidence of differential rating scale cate-
gory use can be used to inform rater training and assessment development procedures.

1. Introduction

In the context of writing assessment, analytic rubrics provide insight into strengths and weaknesses in student writing that can
inform instruction (Crusan, 2015). Analytic rubrics require raters to apply a rating scale to distinct aspects of student compositions
(i.e., domains), such as meaning and mechanics. For example, a rating scale with five ordered categories might reflect increasing
levels of control of each domain, such as lack of control, minimal control, sufficient control, consistent control, and full command. By
including a common set of categories, analytic rubrics facilitate comparisons of test-takers’ achievement across domains. For ex-
ample, a student may demonstrate consistent control in meaning but minimal control in mechanics. However, the degree to which
rating scale categories share a consistent interpretation across domains, as well as across raters, is an empirical question (Trace,
Meier, & Janssen, 2016). For example, a consistent interpretation of rating scale categories across domains would imply that the
difference between minimal control and sufficient control reflects the same difference in writing achievement for the meaning domain
as it does for the mechanics domain, and that these differences are consistent across raters. However, researchers and practitioners
rarely examine this hypothesis of consistent category functioning empirically.
Empirical evidence of a comparable interpretation of rating scale categories across domains and raters is central to the meaningful
interpretation of assessment results based on analytic rubrics. Several researchers have demonstrated procedures for evaluating the
comparability of rating scale functioning across one of the facets in a performance assessment systems, such as items (e.g., Linacre,
2002; Penfield, Myers, & Wolfe, 2008), prompts (e.g., Engelhard & Wind, 2013), and individual raters (e.g., Wesolowski, Wind, &
Engelhard, 2016). However, these procedures focus on comparing the structure of rating scales across levels of a single variable (e.g.,

E-mail address: swind@ua.edu.

https://doi.org/10.1016/j.asw.2019.100416
Received 15 February 2019; Received in revised form 21 May 2019; Accepted 10 July 2019
Available online 22 July 2019
1075-2935/ © 2019 Elsevier Inc. All rights reserved.
S.A. Wind Assessing Writing 43 (2020) 100416

across raters), and thus do not provide evidence regarding the comparability of rating scales across combinations of facets, such as
individual raters and domains. Evidence of comparable rating scale functioning across domains at the level of individual raters
reflects idiosyncrasies in rating scale functioning related to each combination of individual raters and domains that is not captured in
overall examinations of rating scale functioning across raters or domains.
Moreover, current practice in rater-mediated assessment usually includes calculating rater reliability and agreement statistics, or,
if a modern measurement theory approach is used, calculating indicators of rater severity, model-data fit, and biases (e.g., differential
rater functioning; Wesolowski, Wind, & Engelhard, 2017). Although they provide insight into important characteristics of these
assessments, such analyses do not reveal the degree to which rating scale categories have a consistent interpretation across raters and
domains. This information is important because it allows researchers and practitioners to understand individual raters’ interpretation
and use of rating scales specific to rubric domains. Evidence of comparable rating scale interpretation ensures meaningful com-
parisons of ratings across raters and domains for a variety of purposes, including identifying test-takers’ strengths and weaknesses.
These details can guide rater training or re-training procedures, the interpretation and use of ratings, and the revision of scoring
materials in rater-mediated performance assessments.

2. Evaluating rating scale functioning

In previous studies, several researchers empirically examined the degree to which rating scales function as expected in rater-
mediated performance assessments. In many of these studies, the researchers used Rating Scale (RS) model (Andrich, 1978) or Partial
Credit (PC) model (Masters, 1982) formulations of the Many-Facet Rasch (MFR) model (Linacre, 1989) for their analyses. The MFR
model is a latent trait model based on Rasch measurement theory (Rasch, 1960) that allows researchers to calculate location esti-
mates for a customized set of facets (i.e., factors or variables) that reflect a given assessment, such as students, raters, and domains.
The RS formulation of the MFR model (RS-MFR) provides information about rating scale functioning over all of the facets in the
model. For example, the following model is a RS-MFR model that would allow an analyst to examine the structure of a rating scale
across students, domains, and raters in a rater-mediated assessment:

Pnijk
ln = n i j k,
Pnijk 1 (1)

where
Pnijk/Pnijk-1 = the probability of Student n receiving a rating in category k, rather than in category k – 1;
θn = the location of Student n on the construct (i.e., ability);
λi = the location of Rater i on the construct (i.e., severity);
δj = the logit-scale location of Domain j on the construct (i.e., difficulty);
τik = the location on the construct where the probability for a rating in Category k and k − 1 is equally probable.
The RS-MFR model provides estimates of each element within each facet (e.g., individual students, individual raters, and in-
dividual domains) that represent their relative locations on a common scale that represents the construct. The MFR model also
provides estimates that describe the location on the construct that corresponds to the transition point (i.e., threshold) between
adjacent rating scale categories—such that one can examine the difficulty associated with receiving a rating in a particular category,
rather than the category just below it.
To illustrate the concept of rating scale category thresholds, Fig. 1 is a plot that shows the probability that test-takers with various
locations on a construct will receive a rating in each category of a four-category rating scale from Rater i. The x-axis shows the
estimated test-taker locations ordered from low-achieving to high-achieving, and the y-axis shows probabilities. Separate lines re-
present the probability associated with receiving a rating in each category. Test-takers with lower locations are more likely to receive

Fig. 1. Illustration of Rating Scale Category Thresholds.

2
S.A. Wind Assessing Writing 43 (2020) 100416

a rating in a lower category, and test-takers with higher locations are more likely to receive a rating in a higher category. The
intersection points between the curves for each category (vertical lines) are the rating scale category thresholds (τ). The thresholds are
the location on the latent variable at which there is an equal probability that a test-taker will receive a rating in a pair of adjacent
categories.
On the other hand, by specifying a PC formulation of the MFR model, one can estimate separate rating scale category thresholds
for elements within a researcher-specified facet, such as individual raters or individual domains. For example, the following model
would allow an analyst to examine the structure of a rating scale separately for individual raters:

Pnijk
ln = n i j ik ,
Pnijk 1 (2)
where all of the terms are defined as in Eq. (1), except for the threshold (τik), which is the location on the construct where the
probability for a rating in Category k and k − 1 is equally probable, specific to Rater i. By estimating this model, analysts can examine
the properties of rating scale categories, specific to each rater. It is also possible to specify the model such that separate rating scale
category thresholds are estimated separately for each domain:

Pnijk
ln = n i j jk ,
Pnijk 1 (3)
where all of the terms are defined as in Eq. (1) except for τjk, which represents the location on the construct where the probability for
a rating in Category k and k − 1 is equally probable, specific to Domain j.
Analyses with models such as Eq. (1) through (3) provide insight into rating scale functioning, including the degree to which the
categories are ordered in the intended direction (i.e., higher categories reflect higher levels of achievement) and represent mean-
ingfully distinct levels of achievement. When one uses the PC-MFR formulation, it is also possible to examine the degree to which the
difficulty associated with each category threshold is comparable across raters (Eq. (2)) or domains (Eq. (3)). I discuss these indices in
more detail later in the manuscript.
A number of researchers have used models such as those shown in Eq. (1) through (3) to evaluate rating scale category func-
tioning in rater-mediated performance assessments in a variety of domains. For example, Engelhard and Wind (2013) examined
rating scale category functioning across domains in the Advanced Placement Statistics exam using models similar to Eq. (1) and (3).
In their analysis, these authors observed that the interpretation of rating scale categories was not consistent across domains—such
that rating scale categories had different interpretations depending on the particular aspect of statistics that was being evaluated. In
another study, Wesolowski et al. (2016) used a model similar to Eq. (2) to examine the comparability of rating scale functioning
across raters in a rater-mediated music performance assessment. These researchers observed substantial differences in category use
across raters, such that the categories had different interpretations depending on which rater scored a test-taker’s musical perfor-
mance. Likewise, Wind, Tsai, Grajeda, and Bergin (2018) applied a model similar to Eq. (3) to evaluate the extent to which a rating
scale categories had a comparable interpretation across domains in a teacher evaluation program based on classroom observations.
These researchers found that, although the raters used the scale categories in the intended direction, there were differences in the
structure of the rating scale across domains.
Although researchers have frequently used RS-MFR and PC-MFR models to examine rating scale functioning in rater-mediated
performance assessments, it is not widely recognized that it is possible to specify a version of the PC-MFR model in which the
threshold parameter is estimated separately for combinations of facets. Instead, researchers typically examine rating scale functioning
for one facet at a time. In this study, I illustrate how specifying the MFR model such that the rating scale structure is estimated
separately for two or more facets can provide additional insight into rating scale category functioning for a rater-mediated writing
assessment. Thus, this approach is different from previous studies (e.g., Wesolowski et al., 2016, Wind et al., 2018;Penfield et al.,
2008) because it involves estimating rating scale category thresholds separately for a combination of two facets, rather than one facet.

3. Purpose

The purpose of this study is to illustrate methods that researchers and practitioners can use to examine the degree to which raters
apply a rating scale consistently across domains in analytic writing assessments. This procedure involves examining rating scale
functioning for individual rater/domain combinations. In light of the methodological focus of this manuscript, I used an illustrative
data analysis to demonstrate how analysts can use modern measurement theory models to address the following questions in rater-
mediated performance assessment contexts:

1 To what extent does the entire group of raters use the rating scale categories as expected?
2 To what extent is rating scale category functioning comparable across individual raters?
3 To what extent is rating scale category functioning comparable across individual domains?
4 To what extent is rating scale category functioning comparable across rater-domain combinations?

The analyses in this manuscript are illustrations of techniques that researchers and practitioners can use to gather numeric and
graphical evidence of rating scale category functioning to evaluate the degree to which rating scales function in a comparable way
across rater-domain combinations in other writing assessment contexts. Accordingly, the current study builds on previous studies in

3
S.A. Wind Assessing Writing 43 (2020) 100416

which researchers have examined rating scale functioning across single facets in an assessment system and show how these tech-
niques can be extended to examine rating scale functioning for combinations of facets. The results from this study are not intended to
be generalizable to other assessment contexts, but rather serve as an illustration of rating scale category functioning analyses.

4. Instrument

The illustrative dataset is from a training program in which raters scored essays as part of a qualification examination prior to
scoring an English Language Proficiency (ELP) assessment. Results from the ELP assessment are used to inform English course
placement decisions for undergraduate students in a teacher education program. The illustrative dataset includes 100 essays and 18
raters. Each rater scored each essay using a four-category rating scale (0=Limited Proficiency; 1=Emerging Proficiency; 2=Moderate
Proficiency; 3=High Proficiency) and an analytic rubric with three domains: Content, Language, and Organization. These ratings were
obtained at the midpoint of the 10-hour rater training program. At the midpoint and end of the rater training program, raters were
given a set of exemplar essays to rate, and their ratings were compared to criterion scores on the same essays; different exemplar
essays were used in the midpoint and final assessments. Results from the midpoint assessment were used to inform the remainder of
rater training. Raters were required to pass the final assessment before they could participate in operational scoring. Details about the
cut score for passing the final assessment and any other details about raters’ background characteristics were not available to the
author.

5. Data analysis

To address the research questions for this study, I used the Facets software program (Linacre, 2015) to analyze the ELP data with
four models. Model One is the RS-MFR model shown in Eq. (1). This model provided estimates of the rating scale structure over all of
the students, domains, and raters. Model Two is the PC-MFR model shown in Eq. (2). This model provides estimates of rating scale
category thresholds specific to each rater who scored the ELP assessment. Model Three is the PC-MFR model shown in Eq. (3). This
model provides estimates of rating scale category thresholds specific to each of the domains in the ELP assessment (Content, Lan-
guage, and Organization). Finally, Model Four is a PC-MFR model in which I estimated the structure of the rating scale separately for
each rater-by-domain combination:

Pnijk
ln = n i j ijk ,
Pnijk 1 (4)

where all of the terms are defined in the same way as in Eq. (1), except for the threshold parameter (τijk). As shown in Eqs. (2) and (3),
this parameter frequently includes two subscripts that indicate that the logit-scale locations for each of the k − 1 rating scale
categories are estimated separately across each level of a single facet. In this model, I have included an additional subscript in the
threshold parameter (τijk) to indicate that the thresholds are estimated separately for each rater-by-domain combination. Applying all
four models allowed me to consider the degree to which Model Four provides additional information about rating scale functioning
compared to more commonly used models.

5.1. Rating scale category functioning

Because my research questions are focused on the rating scale categories, the calibrations of the rating scale category threshold
parameters (τ) were the primary focus of my analyses. Specifically, I used the threshold values to explore numeric and graphical
indicators of rating scale category functioning related to four main categories of evidence: (1) directionality; (2) category precision;
(3) model-data fit, and (4) category invariance. I adapted these categories from Engelhard and Wind (2018).
When interpreting the results from rating scale category functioning analyses, it is important to consider the need for relatively
large sample sizes to obtain stable estimates of rating scale threshold parameters (τ). Linacre (2000) recommended a minimum
sample size of 10 observations in each rating scale category for each of the facets across which scales are estimated. This requirement
has different implications for each of the models in the current analysis: For Model One, this requirement applies to the number of
observations in each category across all test-takers, raters, and domains. For Model Two, this requirement applies to the number of
observations in each category for each rater. For Model Three, this requirement applies to the number of observations in each
category for each domain. Finally, for Model Four, this requirement applies to the number of observations from each rater in each
domain.

5.2. Category directionality

The first category of evidence of category functioning includes indicators of the degree to which higher rating scale categories
correspond to higher locations on the construct—that is, the extent to which students who receive high ratings have high levels of
achievement. One can evaluate directionality using estimates of the rating scale category thresholds from Models One through Four,
where monotonically non-decreasing values across increasing categories suggest that the group of raters has used the categories in the
intended order over all students and domains (Model One), individual raters have used the categories in the intended order across
domains (Model Two), raters use the categories in the intended order within each individual domain (Model Three), or individual

4
S.A. Wind Assessing Writing 43 (2020) 100416

Fig. 2. Illustration of Category Response Functions for two Rater-Domain Combinations.

raters have used the categories in the intended order for a particular domain (Model Four).

5.3. Category precision

The second category of evidence of rating scale category functioning includes indicators of the degree to which rating scale
categories reflect unique levels of student achievement. One can evaluate rating scale category precision by examining the dis-
tribution of ratings across categories, as well as the distance between rating scale category thresholds overall (Model One), or specific
to raters (Model Two), domains (Model Three), or rater-domain combinations (Model Four). Following Linacre (2002), evidence of
normally or uniformly distributed ratings across categories, and distances between about 1.4 and 5 logits between categories provide
evidence of category precision. Evidence of category precision provides support for the interpretation of different rating scale ca-
tegories as descriptors of distinct levels of student achievement.
It is also useful to examine graphical displays of the probability of ratings within each category as an additional indicator of rating
scale category precision. Fig. 2 includes two examples of Category Response Functions (CRFs). One can construct these plots for the
overall calibration of a rating scale (Model One), or for individual raters (Model Two), individual domains (Model Three), or rater-
domain combinations (Model Four). The example plots show CRFs for two rater-domain combinations. These plots have the same
format as Fig. 1. When rating scale categories are precise, the CRFs each have unique “peaks,” or ranges along the x-axis at which
they are most probable; the first plot in Fig. 2 (A) illustrates this property. In contrast, the second plot in Fig. 2 (B) shows an example
of non-distinct CRFs. In this plot, Rater i does not use the second and third rating scale categories to describe distinct levels of student
achievement.

5.4. Rating scale category invariance

The last category of evidence for evaluating rating scale category functioning is rating scale category invariance. This category of
evidence applies only to the PC-MFR models (Model Two through Four), and it allows researchers to evaluate the degree to which the
relative difficulty of each rating scale category is comparable for different raters (Model Two), domains (Model Three), or rater-
domain combinations (Model Four). Substantial differences in the threshold locations between raters (Model Two), domains (Model
Three), or rater-domain combinations (Model Four) suggest that the difficulty of rating scale categories is not consistent across raters,
domains, or rater-domain combinations, respectively.

6. Results

With regard to the minimum sample size requirements within rating scale categories (Linacre, 2000), minimum number of
observations (≥10) was met for all of the models except Model Four. This is not surprising, given the number of raters and test-takers
in the data. Recognizing the limitations associated with the violation of this minimum sample size requirement, I will present the
results from Model Four and note when the interpretation of certain thresholds is compromised due to infrequent observations in a
rating scale category.
In the following paragraphs, I present results from my analysis of the ELP data with each model as they relate to the three
categories of evidence of rating scale category functioning. I provide more detailed information for Model Four compared to the other

5
S.A. Wind Assessing Writing 43 (2020) 100416

Fig. 3. Category Response Functions for the Overall Rating Scale (Model One).

models because this model has been represented less frequently in previous research on rater-mediated assessments.

6.1. Model one: common rating scale

The analyses based on Model One provide information that analysts can use to address the first research question: To what extent
does the entire group of raters use the rating scale categories as expected?

6.2. Category directionality

The threshold locations that I estimated using Model One were as follows: τ1 = −0.91, τ2 = −0.01, τ3 = 0.92. These values are
non-decreasing in location across increasing rating scale categories. Accordingly, the results from Model One suggest that the rating
scale caegories are ordered as expected for the ELP assessment.

6.3. Category precision

Based on the threshold estimates noted above, the distance between the thresholds was somewhat smaller than Linacre’s re-
commended distance of 1.4 logits for meaningfully distinct rating scale categories. However, inspection of the CRFs for the overall
rating scale (Fig. 3) indicates that the rating scale categories each described unique locations on the latent variable—suggesting that,
although the distinct ranges of achievement were relatively small, the raters used the categories to distinguish among students with
different levels of writing achievement.

6.4. Model two: rating scale varies across raters

The analyses based on Model Two provide information that analysts can use to address the second research question: To what
extent is rating scale category functioning comparable across individual raters?

6.5. Category directionality

Table 1 (A) shows the threshold locations that I estimated using Model Two. For all of the raters except Rater 12, the thresholds
were non-decreasing over increasing categories in the ordinal rating scale—indicating that 17 of the 18 raters interpreted the rating
scale categories in the intended order. For Rater 12, all three categories were disordered (τ1 = 0.14 > τ2 = 0.10 > τ3 =
−0.24)—indicating that this rater interpreted the category order in the opposite direction as intended.

6.6. Category precision

Table 1 (B) shows the absolute value of the distance between adjacent rating scale categories for each of the raters based on Model
Two. I used double asterisks to mark distances that were less than Linacre (2002) recommendation of 1.4 logits for distinct categories.
Although the distance between at least one pair of adjacent rating scale categories was smaller than 1.4 logits for most of the raters,
the CRFs indicated that most of the raters had distinct ranges of student achievement within which they used each rating scale
category, with the exception of four raters. Fig. 4 shows plots for the four raters whose CRFs indicated one or more non-distinct rating
scale categories: Rater 4, Rater 6, Rater 12, and Rater 18. For example, in the CRF plot for Rater 4, there is no range of student
achievement at which the third category (x = 3) is the most probable.

6
S.A. Wind Assessing Writing 43 (2020) 100416

Table 1
Threshold Estimates from Model Two (Rating Scale Varies across Raters).
Rater A. Threshold Location Estimates B. Distance Between Thresholds

τ1 τ2 τ3 | τ2 - τ1| | τ3 – τ2|

1 −0.65 0.18 0.47 0.83** 0.29**


2 −0.66 −0.43 1.09 0.23** 1.52
3 −0.91 −0.17 1.08 0.74** 1.25**
4 −0.93 0.55 0.38 1.48 0.17**
5 −1.45 −0.01 1.46 1.44 1.47
6 −0.29 −0.29 0.58 0.00** 0.87**
7 −1.37 0.20 1.17 1.57 0.97**
8 −1.25 0.12 1.12 1.37** 1.00**
9 −0.88 0.08 0.80 0.96** 0.72**
10 −0.55 −0.03 0.58 0.52** 0.61**
11 −0.71 0.23 0.48 0.94** 0.25**
12 0.14 0.10* −0.24* 0.04** 0.34**
13 −0.98 0.05 0.93 1.03** 0.88**
14 −0.98 −0.12 1.10 0.86** 1.22**
15 −1.32 0.16 1.16 1.48 1.00**
16 −0.92 −0.26 1.18 0.66** 1.44
17 −0.86 −0.11 0.97 0.75** 1.08**
18 −0.44 −0.26 0.70 0.18** 0.96**

Notes: * Disordered threshold; ** Distance between adjacent thresholds is less than 1.4 logits.

Fig. 4. Category Response Functions for Raters with Non-Distinct Category Probabilities (Model Two).

6.7. Category invariance

As the final indicator of rating scale category functioning based on Model Two, I compared the threshold locations across raters.
Fig. 5 illustrates this comparison (these values are the same as the values in Table 1). In the plot, the x-axis is the logit scale that
represents judged writing achievement. Thin, dashed vertical lines are plotted to aid in the visual interpretation of distances along the
x-axis. The y-axis shows the 18 raters who scored ELP assessment, and number plotting symbols represent the three rating scale
thresholds (1 = τ1, 2 = τ2, 3 = τ3). The locations of these numbers show the relative difficulty of the category thresholds. Im-
portantly, the locations shown here reflect the distance in difficulty for the rating scale categories from the overall location of each
rater. As a result, one can use the values shown in this plot to evaluate the extent to which the differences in difficulty between
categories are consistent across raters, even if raters vary in overall severity. Consistency in the distance between the thresholds
suggests that the difference between categories has a consistent interpretation across raters.

7
S.A. Wind Assessing Writing 43 (2020) 100416

Fig. 5. Location of Rating Scale Category Thresholds for Individual Raters (Model Two).

The locations of the thresholds in Fig. 5 suggest that the raters varied in their interpretation of the relative difficulty of the rating
scale categories. As a result, there is not a consistent interpretation of category difficulty across all of the raters. For example, a
student who received a rating in category 3 from Rater 1 would receive a rating in category 2 from Rater 2, even after controlling for
differences in overall rater severity.

7. Model three: rating scale varies across domains

The analyses based on Model Three provide information that analysts can use to address the third research question: To what
extent is rating scale category functioning comparable across individual domains?

7.1. Category directionality

The results from Model Three indicated expected directionality for the rating scale categories within all three domains.
Specifically, the thresholds were non-decreasing across increasing rating scale categories: Content: τ1 = -0.94, τ2 = 0.07, τ3 = 1.00;
Language: τ1 = -1.11, τ2 = 0.13, τ3 = 0.98; Organization: τ1 = -0.73, τ2 = 0.05, τ3 = 0.78. This result suggests that when the rating
scale structure is estimated separately across domains, there is no evidence of category disordering.

7.2. Category precision

As I observed for the previous two models, the absolute value of the distance between adjacent thresholds was less than Linacre
(2002) recommended minimum distance of 1.4 logits for distinct rating scale categories. However, the CRFs for all three domains
indicated that the each of the rating scale categories described unique locations on the latent variable (see Fig. 6). As I observed for
the previous two models, this result suggests that when the rating scale structure is estimated separately across domains, the cate-
gories describe distinct ranges of student achievement.

7.3. Category invariance

Using the same format as Fig. 5, Fig. 7 shows the locations of the thresholds for Model Three. Although the threshold locations do
not match perfectly across domains, they are relatively similar. As a result, these values indicate that when the rating scale is
estimated separately across domains, the categories reflect similar levels of achievement.

7.4. Model four: rating scale varies across rater-domain combinations

The analyses based on Model Four provide information that analysts can use to address the fourth research question: To what
extent is rating scale category functioning comparable across rater-domain combinations? The results from analyses with this model
demonstrate the additional information about rating scale functioning that analysts can glean from examinations of rating scale
category thresholds for two facets beyond the information provided by models in which the threshold parameter varies over a single
facet (e.g., Model Two and Model Three).

8
S.A. Wind Assessing Writing 43 (2020) 100416

Fig. 6. Category Response Functions for Domains (Model Three).

Fig. 7. Location of Rating Scale Category Thresholds for Domains (Model Three).

7.5. Rating scale category directionality

First, I examined the locations of the rating scale category thresholds for each rater-by-domain combination; I presented these
values in Table 2. When I examined rater-domain combinations, I observed that several raters did not use all of the rating scale
categories. For example, in the Content domain, Rater 12 and Rater 15 did not use the lowest rating scale category, and Rater 1 and
Rater 4 did not use the highest rating scale category. As a result, it was not possible to estimate all three thresholds for all of the raters.
Along the same lines, for each rater, there was at least one rating scale category within at least one domain with fewer than 10
observations. I marked the estimates for the affected rating scale category thresholds in Table 2 using italics. Although the Facets
software provided estimates for these thresholds, the number of observations within rating scale categories was not sufficient to
support meaningful interpretation of the values.
In terms of directionality, the threshold locations were generally increasing across increasing rating scale categories—indicating
that most of the raters used higher rating scale categories to describe higher levels of student achievement. However, there were
several raters with disordered thresholds. I marked disordered thresholds in Table 2 using asterisks. For some of the raters with
disordered thresholds (e.g., Rater 2 and Rater 6), the magnitude of the disordering was essentially negligible (≤ 0.07 logits), and the
disordering occurred alongside insufficient sample sizes in the adjacent threshold—such that these instances of disordered thresholds
do not have a clear interpretation. Other raters exhibited more substantial threshold disordering. For example, Rater 7 exhibited
disordering between the first two thresholds, with the first threshold located higher on the logit scale compared to the second

9
S.A. Wind Assessing Writing 43 (2020) 100416

Table 2
Rating Scale Category Threshold Estimates from Model Four.
Rater Content Language Organization

τ1 τ2 τ3 τ1 τ2 τ3 τ1 τ2 τ3

1 −0.55 0.55 – −0.31 0.06 0.25 – – 0.00


2 −0.88 −0.23 1.11 −0.77 −0.44 1.22 – 0.01 −0.01*
3 −1.01 −0.47 1.48 −1.04 0.12 0.92 – −0.40 0.40
4 −0.29 0.29 – – −0.20 0.20 −1.21 0.56 0.65
5 −2.87 0.72 2.15 −1.35 −0.11 1.46 −1.06 −0.16 1.21
6 −0.40 −0.47* 0.86 −0.19 −0.24* 0.43 −0.52 −0.25 0.77
7 1.40 0.10* 1.29 −1.62 0.61 1.01 −1.09 −0.10 1.19
8 −1.45 0.11 1.34 −1.25 −0.07 1.32 −1.06 0.48 0.58
9 −0.63 −0.28 0.91 −1.28 0.10 1.18 −0.76 0.28 0.48
10 −1.42 0.39 1.03 −0.72 −0.14 0.85 −0.46 0.03 0.42
11 −0.52 0.24 0.28 −0.96 0.12 0.85 – −1.32 1.32
12 – 0.43 −0.43* – – 0.00 −0.02 0.01 0.02
13 −1.46 −0.03 1.49 −1.07 −0.03 1.10 −0.64 0.21 0.43
14 −0.68 −0.40 1.08 −1.01 0.11 0.89 −1.49 0.06 1.43
15 – −0.73 0.73 −1.47 0.31 1.16 −0.60 −0.16 0.76
16 −0.95 −0.61 1.56 −1.17 −0.28 1.45 −0.68 0.05 0.63
17 −0.94 −0.24 1.17 −0.82 0.02 0.80 −0.96 −0.09 1.05
18 −0.28 −0.45* 0.73 −0.51 −0.06 0.56 −0.82 −0.47 1.29

Notes: (1) Asterisks (*) indicate disordered thresholds; (2) italic font indicates that there were fewer than 10 observations in one of the categories
associated with a rating scale category threshold such that the threshold estimate does not have a clear interpretation.

threshold (τ1 = 1.40 > τ2 = 0.10). This result suggests that Rater 7 used the second rating scale category to describe higher levels of
achievement than the third rating scale category. Interestingly, although several raters exhibited category disordering, none of the
raters exhibited disordering on all three domains.

7.6. Rating scale category precision

Table 3 includes the absolute value of the distance between rating scale category threshold locations on the logit scale based on
Model Four. Because of the small number of observations in some of the rating scale categories for certain rater/domain combi-
nations, many of these differences do not have a clear interpretation. I marked the absolute differences that included one of the
suspect threshold estimates using italic font. Among the absolute differences that did not include suspect threshold estimates, I
observed values less than 1.4 logits within all three domains and at least once for each of the eighteen raters. This finding suggests
that the raters who scored the ELP may have used the rating scale categories to distinguish among relatively small differences in

Table 3
Distance Between Rating Scale Category Thresholds: English Language Proficiency Assessment.
Rater Content Organization Language

| τ2 - τ1| | τ3 – τ2| | τ2 - τ1| | τ3 – τ2| | τ2 - τ1| | τ3 – τ2|

1 1.10* 0.37* 0.19*


2 0.65* 1.34* 0.33* 1.66 0.02*
3 0.54* 1.95 1.16* 0.80* 0.80*
4 0.58* 0.40* 1.77 0.09*
5 3.59 1.43 1.24* 1.57 0.90* 1.37*
6 0.07* 1.33* 0.05* 0.67* 0.27* 1.02*
7 1.30* 1.19* 2.23 0.40* 0.99* 1.29*
8 1.56 1.23* 1.18* 1.39* 1.54 0.10*
9 0.35* 1.19* 1.38* 1.08* 1.04* 0.20*
10 1.81 0.64* 0.58* 0.99* 0.49* 0.39*
11 0.76* 0.04* 1.08* 0.73* 2.64
12 0.86* 0.03* 0.01*
13 1.43 1.52 1.04* 1.13* 0.85* 0.22*
14 0.28* 1.48 1.12* 0.78* 1.55 1.37*
15 1.46 1.78 0.85* 0.44* 0.92*
16 0.34* 2.17 0.89* 1.73 0.73* 0.58*
17 0.70* 1.41 0.84* 0.78* 0.87* 1.14*
18 0.17* 1.18* 0.45* 0.62* 0.35* 1.76

Notes: (1) Asterisks (*) indicate that the distance between adjacent thresholds is less than 1.4 logits; (2) italic font indicates that there were fewer
than 10 observations in one of the categories associated with a rating scale category threshold such that the threshold estimate does not have a clear
interpretation.

10
S.A. Wind Assessing Writing 43 (2020) 100416

Fig. 8. Category Response Functions for Selected Rater-Domain Combinations (Model Four).

levels of student writing achievement.


To illustrate differences in rating scale category precision among rater-domain combinations, Fig. 8 includes CRFs for two raters.
To distinguish between rating scale categories with unstable estimates due to infrequent observations (≤ 10) and categories that
raters did not use at all, I plotted the probability curves for the categories in light grey, and marked them with an asterisk in the
legend. The first set of plots (Panel A) includes CRFs for Rater 1 within the Content, Language, and Organization domains. Two main
observations can be made from these CRFs. First, although the category probability curves for the Content and Organization domains
are ordered as expected and distinct, Rater 1 did not use the full range of rating scale categories for these domains. Second, in the
Language domain, the two middle categories do not have distinct locations on the logit scale where they are most probable. Taken
together, these results suggest that the interpretation of rating scale categories in terms of writing achievement is not consistent
across domains for Rater 1. The second set of plots in Fig. 8 (Panel B) includes CRFs for Rater 5 within the three domains on the ELP
assessment. In contrast to Rater 1, this rater used all of the rating scale categories in all of the domains. Although there were some
differences in category probabilities across domains, each of the rating scale categories had a distinct range of writing achievement at
which it was most probable.

7.7. Rating scale category invariance

Finally, I examined the threshold locations for each rater-by-domain combination to explore the degree to which individual raters
interpreted the difficulty of rating scale categories consistently across domains. For the sake of illustration, I compared the threshold
locations within each rater across domains. However, in theory, one could also compare threshold locations within domains across
raters. Fig. 9 illustrates this comparison for four selected raters. I selected these raters because they exhibited different patterns of
rating scale category use across domains; therefore, these raters demonstrate different patterns that one might observe when eval-
uating rating scale category functioning for rater/domain combinations. In each of the plots in Fig. 9, I used asterisks to indicate
thresholds that do not have a clear interpretation due to infrequent observations within a particular rater/domain combination.
Importantly, the threshold locations shown here reflect the distance in difficulty for the rating scale categories from the overall
location of the rater-domain combination. As a result, one can use the values shown in this plot to evaluate the extent to which the
differences in difficulty between categories are consistent across domains for a particular rater, even if that rater has judged the
difficulty of the domains to be different.
For Rater 5 (Plot A), the locations of the thresholds indicate that this rater had a similar interpretation of category difficulty
between the Language and Organization domains. However, for the Content domain, the first threshold was located much lower on
the rating scale compared to its location for Language and Organization. This result suggests that Rater 5 interpreted the first rating
scale category as substantially easier for Content than for Language and Organization. For Rater 9 (Plot B) and Rater 18 (Plot C), the
relative distance between category threshold is notably different across all three domains. In contrast, the threshold locations for
Rater 17 (Plot D) are relatively consistent across domains—indicating that this rater consistently interpreted the difficulty of rating
scale categories across the Content, Language, and Organization domains.

8. Discussion

The purpose of this study was to demonstrate a method for examining the degree to which raters use rating scale categories in a

11
S.A. Wind Assessing Writing 43 (2020) 100416

Fig. 9. Locations of Rating Scale Category Thresholds across Domains for Four Selected Raters (Model Four).

comparable manner across domains in an analytic writing assessment. I used data from a rater training program for an ELP as-
sessment to illustrate numeric and graphical indicators that one can obtain using three popular formulations of the MFR model: a RS-
MFR model in which one set of rating scale category thresholds is estimated (Model One), a PC-MFR model in which thresholds are
estimated separately for individual raters (Model Two), and a PC-MFR model in which thresholds are estimated separately for
individual domains (Model Three). Then, I showed how it is possible to use a PC-MFR model in which rating scale category thresholds
are estimated separately across combinations of facets in order to examine rating scale functioning in more detail (Model Four).
Specifically, I used a PC formulation of the MFR model in which the threshold parameter was estimated separately for each rater-
domain combination (τijk).
The analyses and results from each model illustrate how analysts can use different formulations of MFR models to address a
variety of questions about rating scale category functioning. With regard to the first research question (To what extent does the entire
group of raters use the rating scale categories as expected?), one can use techniques based on models similar to Model One to examine the
degree to which a group of raters’ category use reflects the intended category directionality, and the extent to which rating scale
categories have distinct interpretations. The analyses based on Model Two reflect the second research question: To what extent is rating
scale category functioning comparable across individual raters? One can use techniques based on models similar to Model Two to
examine the degree to which individual raters use rating scale categories in the intended direction, make distinctions between
categories, and interpret the difficulty of the categories in a similar way as other raters. With regard to research question three (To
what extent is rating scale category functioning comparable across individual domains?), the illustrative analyses based on Model Three
provide insight into raters’ use of rating scale categories across the domains in an analytic rubric. These analyses provide insight into
the degree to which a group of raters used categories in the intended direction for each domain, distinguished between categories for
each domain, and interpreted the difficulty of the categories similarly across domains. Finally, the analyses based on Model Four
provided information to address research question four: To what extent is rating scale category functioning comparable across rater-
domain combinations? The analyses from this model provided information similar to Model Two and Model Three, but specific to
rater-domain combinations.
Although a number of researchers have used PC-MFR models to explore rating scale functioning in rater-mediated performance
assessments, it is not widely recognized that it is possible to use models such as Model Four to estimate rating scale category
thresholds separately for two or more facets. With the illustrative analyses, I showed that all four formulations of the MFR model
provided useful information about rating scale functioning. However, Model Four (rating scale estimated separately for rater-domain
combinations; research question four) provided more detailed information about how individual raters use rating scale categories
within domains compared to the other models. Importantly, Model Four allowed me to identify the specific domains in which
individual raters exhibited idiosyncratic interpretations of category ordering and a lack of distinction among rating scale categories.
Model Four also provided insight into the consistency with which individual raters interpreted the difficulty of the different domains.
Many of the idiosyncrasies in raters’ category use within domains were only apparent in the results from Model Four. Together, these
results suggested that analyses based on models in which thresholds are estimated separately for rater-domain combinations can help

12
S.A. Wind Assessing Writing 43 (2020) 100416

researchers and practitioners identify individual raters who use rating scale categories in an idiosyncratic way within particular
domains. Individuals who train or monitor raters can use this information to guide remediation and improve scoring materials for
rater-mediated writing assessments.
Although models such as Model Four provide detailed information about category use specific to combinations of facets, an
important limitation of this approach is the requirement for relatively large sample sizes in order to calculate stable estimates of
rating scale category thresholds. The results from this study illustrated this challenge. Nonetheless, if analysts are aware of the sample
size requirements for interpreting rating scale category thresholds and check the number of observations within categories before
interpreting them, these models can still provide useful information about rating scale category use for combinations of facets.
When considering the series of MFR model analyses that I presented in this paper, it is important to note that my goal was not to
illustrate a procedure for identifying the best-fitting model for the rater training data. Instead, I sought to demonstrate the benefits of
estimating rating scale category thresholds separately for meaningful combinations of facets in a rater-mediated assessment system.
This modeling procedure provides analysts with detailed information about individual raters’ category use within domains that could
be useful in contexts such as rater remediation or rater training. When the goal of an analysis is to summarize the psychometric
characteristics of a rater-mediated assessment for other purposes, such as reporting the results to practitioners, simpler models are
likely to be more appropriate.
It is important to note that the methods that I have illustrated here provide different information than the methods that Penfield
and his colleagues (Penfield, 2007; Penfield et al., 2008; Penfield, Gattamorta, & Childs, 2009) have proposed for examining dif-
ferential step functioning (DSF) for polytomous items. DSF analyses provide researchers with information about the degree to which
particular rating scale categories function differently for test-takers who are members of different subgroups. Likewise, the indices
that I illustrated in this study provide different information than the centrality/extremity parameter in the Facets Model for Severity
and Centrality (FM-SC) that Jin and Wang (2018) have recently proposed. That model provides researchers with numeric indicators
of the degree to which raters tend to use the central or extreme categories of a rating scale, and it provides estimates of student
achievement that have been adjusted for systematic differences in severity and category use. In contrast to these approaches, the PC-
MFR model formulation that I presented in this study is a tool through which analysts can directly examine idiosyncrasies within and
between raters related to their rating scale category interpretations in particular domains of an analytic scoring rubric.

8.1. Recommendations for research and practice

With this illustrative analysis, I demonstrated methods that researchers and practitioners can use to gather information about
rating scale category functioning for a group of raters (Model One/research question one), specific to individual raters (Model Two/
research question two), specific to individual domains (Model Three/research question three), and specific to rater-domain combi-
nations (Model Four/research question four). The results from the illustrative analyses lead to several recommendations for research
and practice.
The first major recommendation is to incorporate rating scale functioning analyses into every stage of rater-mediated writing
assessments, including rater training, pilot testing, rater monitoring during operational scoring, and following scoring. As I de-
monstrated in this study, analyses that are focused specifically on rating scale categories and how scale functioning varies across
raters, domains, and rater-domain combinations provide different information about the psychometric quality of rater-mediated
writing assessments than the methods that are most commonly used to evaluate ratings in these settings (reliability analyses and rater
effects analyses; Engelhard & Wind, 2018). Category directionality, precision, and invariance play a central role in the interpretation
of ratings overall, as well as in the comparison of ratings across raters and domains. Accordingly, this information is critical for
meaningful interpretation and use of ratings based on analytic scoring rubrics.
The second major recommendation is to interpret the results from rater-mediated assessments in light of the results from rating
scale category functioning analyses. Although methods for evaluating rating scale functioning have been available for decades,
researchers and practitioners usually do not conduct empirical checks for category ordering, precision, and invariance before using
analytic ratings to compare student achievement across raters and domains. That ratings are comparable across facets or combi-
nations of facets in a rater-mediated performance assessment is a hypothesis that must be supported with empirical evidence. Without
sufficient evidence to support this hypothesis, scores from analytic rubrics should not be used to compare student achievement across
raters, domains, and rater-domain combinations.
Finally, the third major recommendation is to design rater-mediated performance assessment systems such that it is possible to
conduct rating scale functioning analyses. In order to conduct analyses similar to those illustrated in this study, it is necessary to
ensure that there are “links” or common observations across elements of a rater-mediated assessment, such as raters scoring essays
and domains in common with other raters. The rater training data that I used in this study reflected a complete rating design, in
which all of the raters scored all of the essays; however, such data collection designs are rare in operational administrations of rater-
mediated assessments. Nonetheless, it is possible to conduct rating scale functioning analyses using models similar to the ones that I
included in the current analyses even when complete data are not available, as long as there are common observations that connect
raters, essays, and domains. For additional details about this point, please see Engelhard and Wind (2018).

8.2. Limitations and directions for future research

This study has several limitations. First, the illustrative data that I presented does not reflect the full scope of rater-mediated
writing assessments or rater training programs for these assessments. Researchers and practitioners should consider the

13
S.A. Wind Assessing Writing 43 (2020) 100416

characteristics of the data that I used in my illustration before generalizing my results to other assessment systems that have different
characteristics. Second, the models that I used in my illustrative analyses do not reflect the full scope of MFR model formulations that
one could examine in a rater-mediated writing assessment. Analysts who are interested in examining rating scale category func-
tioning for combinations of domains should consider the combinations of variables for which such analyses would provide useful
information, and define their models to reflect those specifications. Finally, as with any quantitative analysis of rater effects, the
methods that I presented in this study do not provide substantive explanations for raters’ idiosyncratic use of rating scale categories
within domains that could. Additional qualitative analyses are needed to fully interpret the results from these analyses so that the
quantitative evidence can be used to inform remediation or revision of scoring materials.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.

References

Andrich, D. A. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. https://doi.org/10.1007/BF02293814.
Crusan, D. (2015). Dance, ten; looks, three: Why rubrics matter. Assessing Writing, 26, 1–4. https://doi.org/10.1016/j.asw.2015.08.002.
Engelhard, G., Jr., & Wind, S. A. (2013). Rating quality studies using Rasch Measurement Theory. (College Board Research and Development Report No. 2013-3)New York:
NY: College Board.
Engelhard, G., Jr., & Wind, S. A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. New York: NY: Taylor and
Francis.
Jin, K.-Y., & Wang, W.-C. (2018). A new facets model for rater’s centrality/extremity response style. Journal of Educational Measurement, 55(4), 543–563. https://doi.
org/10.1111/jedm.12191.
Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA Press.
Linacre, J. M. (2000). Comparing “partial credit models” (PCM) and “rating scale models” (RSM). Rasch Measurement Transactions, 14, 768.
Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3(1), 85–106.
Linacre, J. M. (2015). Facets Rasch measurement (Version 3.71.4) [Computer Software]. Chicago, IL: Winsteps.com.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. https://doi.org/10.1007/BF02296272.
Penfield, R. D. (2007). Assessing differential step functioning in polytomous items using a common odds ratio estimator. Journal of Educational Measurement, 44,
187–210.
Penfield, R. D., Gattamorta, K. A., & Childs, R. (2009). An NCME instructional module on using differential step functioning to refine the analysis of DIF in polytomous
items. Educational Measurement Issues and Practice, 28(1), 38–49.
Penfield, R. D., Myers, N. D., & Wolfe, E. W. (2008). Methods for assessing item, step, and threshold invariance in polytomous items following the Partial Credit model.
Educational and Psychological Measurement, 68(5), 717–733. https://doi.org/10.1177/0013164407312602.
Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests (Expanded edition, 1980). Chicago: University of Chicago Press.
Trace, J., Meier, V., & Janssen, G. (2016). “I can see that”: Developing shared rubric category interpretations through score negotiation. Assessing Writing, 30, 32–43.
https://doi.org/10.1016/j.asw.2016.08.001.
Wesolowski, B., Wind, S. A., & Engelhard, G., Jr. (2016). Examining rater precision in music performance assessment: An analysis of rating scale structure using the
multifaceted Rasch partial credit model. Music Perception, 33(5), 662–678. https://doi.org/10.1525/mp.2016.33.5.662.
Wesolowski, B. W., Wind, S. A., & Engelhard, G., Jr. (2017). Evaluating differential rater functioning over time in the context of solo music performance assessment.
Bulletin of the Council for Research in Music Education, 212, 75–98. https://doi.org/10.5406/bulcouresmusedu.212.0075.
Wind, S. A., Tsai, C.-L., Grajeda, S. B., & Bergin, C. (2018). Principals’ use of rating scale categories in classroom observations for teacher evaluation. School
Effectiveness and School Improvement, 29(3), 485–510. https://doi.org/10.1080/09243453.2018.1470989.

Stefanie A. Wind is an assistant professor of educational measurement at the University of Alabama. She conducts methodological and applied research on educa-
tional assessments, with an emphasis on issues related to raters, rating scales, and parametric and nonparametric item response theory models for ratings.

14

You might also like