Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Review

For reprint orders, please contact reprints@expert-reviews.com

Rating scales and


Rasch measurement
Expert Rev. Pharmacoeconomics Outcomes Res. 11(5), 571–585 (2011)

David Andrich Assessments with ratings in ordered categories have become ubiquitous in health, biological
Graduate School of Education, and social sciences. Ratings are used when a measuring instrument of the kind found in the
The University of Western Australia, natural sciences is not available to assess some property in terms of degree – for example, greater
M428, 35 Stirling Highway, Crawley, or smaller, better or worse, or stronger or weaker. The handling of ratings has ranged from the
Western Australia, 6009, Australia
Tel.: +61 864 881 085
very elementary to the highly sophisticated. In an elementary form, and assumed in classical
Fax: +61 864 881 052 test theory, the ratings are scored with successive integers and treated as measurements; in a
david.andrich@uwa.edu.au sophisticated form, and used in modern test theory, the ratings are characterized by probabilistic
response models with parameters for persons and the rating categories. Within modern test
theory, two paradigms, similar in many details but incompatible on crucial points, have emerged.
For the purposes of this article, these are termed the statistical modeling and experimental
measurement paradigms. Rather than reviewing a compendium of available methods and models
for analyzing ratings in detail, the article focuses on the incompatible differences between these
two paradigms, with implications for choice of model and inferences. It shows that the differences
have implications for different roles for substantive researchers and psychometricians in designing
instruments with rating scales. To illustrate these differences, an example is provided.

Keywords : graded response model • item response theory • ordered category formats • Rasch measurement
• Rasch models • rating scales

Introduction complement of many assumptions behind


This article is not an examination of a compen- the belief that Earth was the center of the
dium of technical issues for analyzing rating universe [4] .
scales using Rasch measurement  [1,2] . Rather, The article clarifies the confusion that can
it shows first that the Rasch measurement of arise in the comparisons of different methods
rating scales sits within a statistical paradigm of analyzing rating scales, in which it may
that is different from the standard one within appear that the differences are primarily tech-
which item analysis sits, and second, how this nical. It will be seen that the differences are
difference leads to a major consequence in much deeper, and that they are at the level
the construction and analysis of rating scales. of paradigms. To see this, it is first neces-
To illustrate the points, a concrete example sary to examine the principles underpinning
is provided. rating scales.
The term paradigm is used explicitly in
the sense of Kuhn  [3] to be a complement Measurement & rating scales
of many reinforcing assumptions taken for In the prototype of measurement, an instru-
granted in a field. Paradigms govern the ques- ment maps some property of an entity that
tions researched and how they are answered. can be considered greater or smaller, stronger
Within the paradigm, substantial, normal, or weaker, better or worse, and so on, onto a
puzzle­- solving research advancing the field linear continuum [5] . The linear continuum has
is carried out. However, periodically, a para- a more or less arbitrary origin and a specified
digm’s assumptions are challenged. When unit. Although they have been referred to as
the assumptions of the original paradigm are constructs and traits, following Reise  [6] , such
replaced, a scientific revolution occurs. Some properties are referred to here as variables.
of the assumptions of the different paradigms In measurement, the linear continuum is
are mutually incompatible. An example is the partitioned into equal, contiguous intervals
Copernican revolution, which replaced the by thresholds that are all equally fine (same

www.expert-reviews.com 10.1586/ERP.11.59 © 2011 Expert Reviews Ltd ISSN 1473-7167 571


Review Andrich

discrimination) and, relative to the size of the property being It may appear redundant to stress that the ordering requires
measured, fine enough that their width can be ignored. Then that successive categories from left (rigid) to right (normal)
the measurement of the variable is the count of the number of reflect successively more muscle tone and less spasticity, and that
intervals, the units, from the chosen origin to where the property whether or not this is the case is central to any subsequent inter­
of the object is mapped on the continuum. A prototype of a con- pretations. (Ashworth’s original scale was labeled 0 = normal tone
tinuum partitioned into equal intervals, the very familiar ruler to 4 = rigid. For the purposes of this article, the numbering has
partitioned into cm and mm, is shown in Figure 1. In anticipation been reversed, making a higher score correspond to a better level
of the analogy with a rating scale, the ruler has a rating scale of muscle tone). This is an a priori requirement that is independ-
with five ordered categories superimposed upon it. ent of any empirical ratings. If empirical ratings do not meet this
Measurement is a sophisticated concept and process. requirement, then it seems that it is a problem with the ratings.
Constructing an instrument in the natural sciences requires However, we do stress this requirement because how the empiri-
a theoretical understanding of the variable being measured. cal ordering is handled is a distinctive difference between the two
For example, to construct a thermometer, it was necessary to paradigms and is the focus of this article. To anticipate this differ-
understand the effects of heat on the expansion properties of ence, in one paradigm the empirical ordering is taken for granted
selected materials [7] . After an instrument has been established, and models are found that fit the ratings; in the other paradigm,
it can be used to relate its measurements to other factors. For the ordering is treated as an a priori requirement that needs to be
example, in some circumstances, such as the presence of a fever, verified empirically. However, although the ordering appears not to
the health statuses of individuals can be assessed by measuring be negotiable, we note in anticipation of the application of Rasch
their temperatures. measurement that, unlike the case of physical science measurement,
In the natural sciences, instruments are not expected to func- the distances between the thresholds do not have to be equal.
tion outside their designated operating ranges. In addition, it is Most instruments are composed of more than one rating scale
understood that all measurement contains at least, and ideally and the ratings of an individual are combined into a single value.
only, random error. Depending on the size of this error, replicated The distinct rating scales are generally referred to as items. Having
measurements may be made and then in a well-defined way, their more than one item is a form of replication that is analogous to
mean is a more precise estimate of the size of the property. The making replicated measurements in the natural sciences. However,
aforementioned characteristic of a measuring instrument, a linear unlike some examples in natural science where the same object
continuum partitioned into units, is not negotiable in the natural may be measured repeatedly with the same instrument, the same
sciences and is independent of any dataset. item (same wording) cannot be administered repeatedly to the
same person. Therefore, different items assessing the same vari-
Rating scales as instruments in health & social sciences able are used. Having more than one item potentially increases
Rating scales, which seem to have been first formalized by both the precision and the validity of an assessment. The example
Likert [8] , are used in the health and social sciences in the assess- provided in this article has multiple items to assess each person on
ment of variables when no measuring instrument of the kind the Ashworth scale of muscle tone.
found in the natural sciences is available. Moreover, as they are
used more and more to assess the effects of interventions, whether Theories & paradigms in the analysis of ratings
these be educational, physical or drug interventions, their impor- Three theories for the analysis of ratings are available: first, clas-
tance is constantly growing [9] . Therefore, it may be argued that sical test theory (CTT); second, item response theory (IRT); and
no less care and knowledge should be applied to constructing third, Rasch measurement theory (RMT). The histories of these
rating scales than are used to construct measuring instruments are summarized respectively by Traub [10] , Bock [11] and Wright [12] .
in the natural sciences. The latter two theories, IRT and RMT, are often placed together,
Figure 2 shows a specific rating scale, the Ashworth Scale, con- and for the purposes of this article will be referred to as modern test
cerned with the clinical assessment of the degrees of muscle tone theory. (The term modern test theory is not used in a reference to
that is used illustratively in this article. It immediately shows the an absolute period in the development of test theory, but to contrast
implied partitioning of a continuum [9] . it to classical test theory. The basics of modern test theory were
established by Thurstone in the 1920s [5]). There is no space here
to consider all three theories in relation to paradigms, and there-
Ruler in mm fore some elements of CTT will be referred to only where relevant
in reviewing the features of IRT and RMT. Cano et al. provide a
1cm 2 3 4
review of all three theories in relation to an area of assessment in
Five ordered health outcomes [13] .
0 1 2 3 4 categories Two paradigms for the applications of modern test theory have
emerged. For the purpose of this article, the two paradigms are
termed the statistical modeling paradigm and the experimental
Figure 1. A partitioned continuum in the prototype of measurement paradigm, respectively. The article shows that IRT
measurement with five ordered categories. sits within the former paradigm and RMT within the latter. The

572 Expert Rev. Pharmacoeconomics Outcomes Res. 11(5), (2011)


Rating scales & Rasch measurement Review

next section briefly reviews the common


features of ratings from the perspectives of Limb rigid Increased tone
(minimal (restricting Increased tone
these theories, before returning to examine movement) movement) (easily flexed) Catch Normal tone
their differences.
0 1 2 3 4
Analysis of ratings δ1 δ2 δ3 δ4 Thresholds
Classical test theory  [10] is historically the
oldest theory for the analyses of ratings, but Figure 2. The implied rating scale continuum partitioned by four thresholds into
is currently being augmented, and in many five ordered categories. 
cases superseded, by modern test theory. In
CTT, the elementary scoring of successive ratings by successive category x, which is a function of para­meters aix,bix of category x
integers is employed. Then the sum across items is taken as the of item i and the location bn of person n on the continuum, giving
summary score for a person. This score is taken to be composed Equation 1:
of a true score and an error. The characterization of each person
by a single score implies a single variable, and evidence is used to Pr " Xni = x; bn, aix, bix , = f " x; bn, aix, bix , (1)
confirm that the responses to the items do form replications with
respect to the assessment of the same variable, rather than that dif- Figure 3 shows an example of the probabilities of responses in five
ferent items may assess different variables. This does not preclude successive rating categories, x = 0, 1, 2, 3 and 4, as a function of
the idea that multiple factors may contribute to any person’s loca- the location b on the continuum and of the characteristics of the
tion on a variable. This is analogous to the many factors, including categories of an item. These probabilities are generally referred to
nutrition, genes and age, which contribute to any person’s weight, as category characteristic curves (CCCs) and appear in both IRT
which is a single variable [6] . and RMT.
Rather than assigning successive integers to categories and
treating them as measurements, in modern test theory (both IRT Two paradigms in modern test theory
and RMT), three modifications are made. First, a probabilistic The two paradigms of modern test theory for the analysis of
model for an observed rating is specified as a function of the ratings have many aspects in common and more than one aspect
location of the person and categories on a continuum; second, that is different. The terms statistical modeling and experimental
as noted earlier, categories are not presumed to be the same size; measurement are chosen for the paradigms because their dif-
third, the number of categories is explicitly finite. ferences can be characterized by the following feature: in the
Formally, let Xni be the random variable of the observed response former, the model is chosen to fit the observed ratings; in the
xni ! "0, 1, 2,...mi , of person n to item i where the successive inte- latter, the ratings are analyzed according to a model to check
gers are simply assigned descriptively to the successive categories, whether they meet a priori specifications. The statistical mod-
as shown in Figures 1 & 2, and do not imply scores. Furthermore, let eling paradigm, within which IRT sits, is the presently dominant
Pr " Xni = x , be the probability that person n responds to item i in paradigm [14] . The experimental measurement paradigm, within

1.0

0 4

2
1
Probability

3
0.5

0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Location

Figure 3. Category characteristic curves for five ordered categories.

www.expert-reviews.com 573
Review Andrich

Table 1. Models of modern test theory used in the analysis of ratings.


Class Description
Class 1 Divide by total models [19]
The Rasch model expressed in terms of thresholds dix and bix = x is mi
Pr " Xni = x , = exp (zix (bn))/ / exp (zik (bn))
an integer scoring function for successive categories† k=0
zix (bn) = aix + bix bn
x
aix = - / dix; bix = xi; di0 / 0
k=0

The generalized partial credit model with different discriminations mi


Pr " Xni = x , = exp (zix (bn)) / / exp (zik (bn))
ai among items k=0
zix (bn) = aix + bix bn
x
aix = - / ai dix; bix = aix x; di0 / 0
k=0

The nominal response model with no specifications of the mi


Pr " Xni = x , = exp (zix (bn))/ / exp (zik (bn))
parameters aix and bix k=0
zix (bn) = aix + bix bn

Class 2 Difference models [19]


The graded response model with common discrimination at
Pr " Xni = x , = r*nix - r*ni(x + 1)
thresholds among items
r*nxi = Pr " Xni = x , + Pr " Xni = x + 1 , ... + Pr " Xni = m ,; r*n0i = 1
r*nx = (exp (bn - d*x ))/(1 + exp (bn - d*x ))

The graded response model with different discriminations ai


Pr " Xni = x , = r*nix - r*ni(x + 1)
among items
r*nxi = Pr " Xni = x , + Pr " Xni = x + 1 , ... + Pr " Xni = m ,; r*n0i = 1
r*nx = (exp ai (bn - d*ix))/(1 + exp ai (bn - d*ix))

The only model within this table with sufficient statistics for the person and item parameters.

which RMT sits, is the alternative paradigm for the analysis of is simply taken for granted in discussions on the application of
rating scales. models in IRT. For example:
Owing to the complication that the Rasch model is used in both
IRT and RMT, and in anticipation of the explication of the models, “Normally, an assumption is made when fitting an IRT model to
theories and paradigms, Table 1 summarizes the class of models used a set of data” [16] ;
in modern test theory. All models specialize for dichotomous rat- and
ings. For the purposes of this exposition, the algebraic Rasch model “First, there is the difficulty of finding a model that fits the
in row 1 of Table 1 is distinguished from RMT. available data and estimating model parameters” [16] .

The statistical modeling paradigm More explicitly:


The statistical modeling paradigm seems to be traceable to a
major figure in statistics, Karl Pearson. McKenzie writes: “one “…when the criterion indicates nonrandomness, an examina-
aspect of Pearson’s approach – the construction of models to fit tion of residuals may suggest how the model should be modified to
the data – has if anything gained importance since his day…” [15] . improve fit” [17] ;
This perspective is consistent with Pearson’s view of the and
application of statistics [15] :
“If the proportion of misfitted items is large, the reasonable solu-
“Using statistics, the biologist could (apparently) measure tion is to discard the Rasch model and try a model that includes
without theorizing, summarise facts without going beyond them, discrimination … in a systematic manner” [18] .
describe without explaining.”
The range of models that can be chosen for analyzing ratings
Item response theory sits within this paradigm. That statistical within this paradigm is summarized in Table 1. Only the model
fit is the basis of choosing a model, within a broad class of models, in the first row of the table is a Rasch model.

574 Expert Rev. Pharmacoeconomics Outcomes Res. 11(5), (2011)


Rating scales & Rasch measurement Review

There are two classes of models in Table 1. Thissen and Steinberg of person and item parameters, sufficiency is the defining
referred to the first class as divide by total models and the second characteristic of Rasch models [1] :
class as difference models [19] . In the former, an expression charac-
terizes the probability of a response in each category, and to ensure “The realization of the concept of sufficiency, I think, is a sub-
that the probabilities sum to one, each expression is divided by their stantial contribution to the theory of knowledge and the high mark
sum; in the second, a cumulative probability of successive categories of what Fisher did … His formalization of sufficiency nails down
is specified, and then the probability of a response in any category the … conditions that a model must fulfill in order to yield an
is the difference between these successive, cumulative probabilities. objective basis for inference.” 
In Table 1, the Rasch model [2,20–22] , the generalized partial credit
model [23] and the nominal response model [24] belong to the first From the observation of the separation of parameters as a result
class. The nominal response model is the most general and the of sufficiency, Rasch was able to derive the models with sufficiency
Rasch model the most specialized, as it has the fewest parameters from the following requirements of the invariance of comparisons
to be estimated. within a specified frame reference, which includes defined classes
The Rasch model in Table 1 can be applied within IRT accord- of individuals and classes of stimuli (items), and the conditions
ing to the statistical modeling paradigm, and as quoted above, for engagement between members of each class:
if the model does not fit the data, then one of the other models
with more parameters that fits the data better is sought. There “The comparison between two stimuli should be independ-
is a further possible parametric specialization of the Rasch ent of which particular individuals were instrumental for the
model in which all items have the same relative thresholds, comparison;…
(dix - di(x - 1) = dx - d(x - 1) 6i) , which is referred to as a rating scale Symmetrically, a comparison between two individuals should be
parameterization. However, because the structure and inferences independent of which particular stimuli within the class considered
are identical at the level of an item, this is considered the same were instrumental for comparison…” [2] .
Rasch model but with the same, rather than different, threshold
values specified across items. Sufficiency implies that there is a statistic for a parameter such
The graded response model [25] belongs to the second class and that, given this statistic, the distribution of responses is inde-
cannot be specialized to the Rasch model in the case of ratings pendent of the parameter. A sequence of mathematical deriva-
in more than two categories. tions [2,20–22] led to the Rasch model of Table 1. In this model, the
successive ratings are scored with successive integers commencing
The experimental measurement paradigm with zero. Furthermore, it follows that across a set of items, the
The experimental measurement paradigm is traceable to another total score of a person is a sufficient statistic for the person’s loca-
major figure in statistics, the geneticist Ronald Fisher. In refer- tion parameter. Although this integer scoring and summing to
ence to Fisher’s book, Statistical Methods for Research Workers [26] , characterize a person is exactly as in CTT, it is stressed that this
McKenzie also writes [15] : follows from the requirement of invariant comparisons realized
through sufficiency, and is not simply asserted as in CTT.
“But, almost more importantly, the book incorporated an Then, given this statistic, a probability distribution of responses
effectively new concept of the statistician’s role, and therefore a new to the items can be written, which is independent of the person
function for statistical theory. The message was that the statistician parameter and dependent only on the threshold parameters of the
should get involved in the practical business of experimentation. items. Specifically, from the Rasch model equation in Table 1, the
This clearly pre-supposed the diffusion of the type of occupational distribution of the pair of responses (xni,xnj ) to items and of person,
role that Fisher … occupied. It was not enough that the scientist given the person’s total score rn = xni + xnj, is given by Equation 2 :
should hand their results to the statistician for ana­lysis: experi- xni xnj
ments (especially large-scale applied experiments that were difficult Pr "(xni, xnj) rn , = =exp (- / dik - / d jk) G/cij (2)
k=1 k=1
to ‘control’) had to be designed by those with statistical expertise.” 
where
Rasch  [1] studied with Fisher, and Fisher influenced Rasch at xni xnj
two levels. The first, and more explicit, was Fisher’s formaliza- cij = / exp (- / dik - / d jk)
(xni, xnj) r k=1 k=1
tion of sufficient statistics; the second, less explicit but no less
significant, was the way Rasch, as a statistician, approached data:
it is characterized by the above quote from McKenzie. As both and where / is the sum of all possible pairs of scores that
(xni, xnj) r
influences have implications for the construction and analysis of sum to r.
rating scales, both are reviewed briefly. From a generalization of Equation 2 across any number of persons
and more than two items, estimates of the item thresholds can be
Sufficient statistics made. However, because the equation is independent of the person
Fisher’s formulation of sufficiency  [27] was decisive in Rasch’s parameters b, it means no assumptions need be made regarding the
formulation of his measurement theory. Permitting the separation distribution of persons. For practical and theoretical reasons, the

www.expert-reviews.com 575
Review Andrich

thresholds need not only to be ordered, but in the process of scale to which responses should conform was not consistent with the
construction, also be relatively well aligned to the person locations. standard statistical modeling paradigm (emphasis in original) [1] :
However, even though the successive categories are scored with suc-
cessive integers beginning with zero, the successive thresholds do “It is tempting, therefore, in the case with deviations of one sort
not need to be exactly equally spaced. Furthermore, the estimates or other to ask whether it is the model or the test that has gone
of the thresholds reflect the structure of the ratings and not the dis- wrong. In one sense this of course turns the question upside down,
tribution of the persons. This feature is powerful in diagnosing any but in another sense the question is meaningful. For one thing, it
potential problems with rating formats. The non-Rasch models in is not easy to believe that several cases of accordance between model
Table 1 do not have sufficient statistics and, in the conduct of analyses and observations should be isolated occurrences. Furthermore
of responses, require assumptions for the distribution of persons. the application of the model must have something to do with the
Thurstone [5] and Guttman [28] preceded Rasch in arguing for construction of the test; at least, if a pair of tests showed results
invariance in measurement data. However, Rasch went beyond in accordance with our theory, this relationship could easily be
both by expressing the requirement in the form of a probabilistic, destroyed by adding alien items to the tests. Anyhow, it may be
mathematical model. Consequently, mathematical derivations worthwhile to know of conditions for the applicability of such
that have experimental implications can be made. We will see such relatively simple principles for evaluating test results.” 
an implication for the analysis of rating scales that is present in
the experimental measurement paradigm but not in the statistical The contrast between Rasch’s position, which specifies an
modeling paradigm. a priori criterion, and one where the fit of data to the model
is the criterion is starkly shown by the following quote from
The model to fit the data or the other way around? Lord regarding IRT models for dichotomous responses [30] :
Rasch analyzed data from a study of progress in reading by stu-
dents in Denmark, which led him to the role of sufficiency in “The reader may ask for some a priori justification of (equations)
the separation of parameters and to the implications of invariant (2-1) or (2-2). No convincing a priori justification exists (how-
comparisons within a frame of reference [1] . He then formulated ever, see chapter 3). The model must be justified on the basis of the
the model for dichotomous responses that had the property of results obtained, not on a priori grounds.” 
sufficiency. He applied it to data from two existing tests. One was
Raven’s Progressive Matrices, a non-verbal intelligence test, and Thus, only the Rasch model of row one of Table 1 inherently sits
the other was a Danish military intelligence test [29] . According within the experimental measurement paradigm.
to the fit statistics that Rasch employed, the former test fitted the
model; the latter did not [1,14] . An example
The question arises as to which was seen to fail in the second The issue of choice of model and implications for the empirical
example: the data from the intelligence test or the model character- operation of the putative ordering of the categories goes beyond
ized by sufficiency? Rasch realized that specifying a priori criteria the normal tests of fit. As dicussed later, even if ratings fit the

1.0

0 4

2
Probability

1
3
0.5

0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
δ1 δ2 δ3 δ4
Location

Figure 4. The latent dichotomous responses at the threshold of the Rasch model.
Dashed lines represent the latent dichotomous Rasch model characteristic curves at the four thresholds.

576 Expert Rev. Pharmacoeconomics Outcomes Res. 11(5), (2011)


Rating scales & Rasch measurement Review

Rasch model according to standard statistical tests of fit, it is pos- not explicit in IRT, they are not shown in Figure 3. By contrast, to
sible to use the model to diagnose whether or not the empirical reflect the fact that they are explicit in RMT, the thresholds dix
ordering of the categories is as required. By contrast, within an are shown in the CCCs of Figure 4. These thresholds are from the
IRT and statistical modeling paradigm, the empirical ordering is Rasch model equation of the first row in Table 1. Although suc-
not questioned. The analysis of the example in this section makes cessive thresholds need not be equidistant, each is located at the
concrete this important, incompatible difference. The example point where the probability of a response in two adjacent catego-
also shows that the difference is not only with respect to the analy- ries is equal. We observe that in Figure 4, if a person is located in a
ses of ratings, but also with the roles of substantive researchers and category (i.e., between the two thresholds that define a category),
psychometricians in constructing and improving scales. then the response in that category has the greatest probability.
The example concerns ratings of muscle tone according to the Furthermore, the Rasch model implies a latent dichotomous
clinician-rated Ashworth spasticity scale shown in Figure 2 [31] . The response at each threshold  [21,34,35] . Figure 4 also shows, in dashed
ratings were generated by a multicenter randomized controlled lines, the latent dichotomous Rasch model characteristic curves
clinical trial of cannabinoid treatment of multiple sclerosis [32] . The at the four thresholds. These curves are necessarily parallel in the
study recruited 661 people with multiple sclerosis from 33 clini- Rasch model.
cal sites. Ashworth scores were generated by therapists at each site Other dichotomous models could be considered at the thresh-
who were blinded as to whether the people were in the control or olds, but only the dichotomous Rasch model of Table 1 permits
treatment groups. For the purpose of this article, analyses of ratings estimation of the thresholds independently of the distribution of
of 656 people with responses to all items from only their first visit the persons.
are reported. All analyses were conducted using RUMM2030 [33] . This independence of the distribution of persons is entirely
Muscle tone was graded for multiple muscles (ten each side: six compatible with the requirement of the thresholds of a measuring
upper limb, four lower limb). The data in the example are from instrument. For example, it would be untenable in the natural
eight ratings (items) of the parts of the lower limbs (hip adduc- sciences that the locations of a measuring instrument’s thresholds
tion, knee extension, knee flexion and foot plantar flexion) for were dependent on the distribution of the entities being measured.
each side. The total score was taken as a summary of the muscle Of course, we stress that for estimates of the thresholds to be
tone of the lower limbs for each person. independent of the distribution of persons in the Rasch model,
The following Rasch model analysis shows one set of results the ratings must be consistent with the model to a level accept-
that are first interpreted within the experimental measurement able for the purpose. However, as is shown later, this consistency,
paradigm. This interpretation is then contrasted with an IRT generally referred to as fit, is not enough.
interpretation from within the statistical modeling paradigm. The ratings of the example did indeed fit the Rasch model
according to the general criterion that the expected values of the
The thresholds of the Rasch model responses for each item, given the threshold and person estimates,
As shown in Figure 2 , the thresholds that define categories are rel- were not statistically different from their expected values. In
evant in a rating scale. However, in order to reflect that they are addition, the distribution of the persons was well aligned, and

1.0

Threshold 1 2 3 4
Probability

0.5

0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5

δ1 δ2 δ3 δ4
Location

Figure 5. Expected threshold characteristic curves for assessments at successive thresholds by independent therapists.

www.expert-reviews.com 577
Review Andrich

1.0

Threshold 1 2 3 4
Probability

0.5

0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
δ1 δ2 δ3 δ4
Location

Figure 6. Threshold characteristic curves that might appear if a therapist assessing at threshold 4 was assessing at
threshold 3 in error.

in particular, a substantial number of persons were located near • A second therapist decides independently whether or not mus-
the thresholds that are now examined closely in the following cle tone meets the second threshold standard of increased tone
section. but easily flexed;
• A third therapist decides independently whether or not muscle
Requirements in an inferred experimental design
tone meets the third threshold standard of catch;
Central to the study of ratings is that in the Rasch model, the latent
dichotomous responses at the thresholds are equivalent in theory • Finally, a fourth therapist decides independently whether or not
to observed dichotomous responses at the same thresholds, which muscle tone meets the fourth threshold standard of normal tone.
are experimentally independent. In the case of the aforementioned
Ashworth spasticity rating scale, they are equivalent to a design in
In each case, the descision is made irrespective of the standards
which four different therapists decide independently whether orat the other thresholds [34,35] . In such a design, it is possible that a
not muscle tone meets the standards at each of the four thresholds.
therapist assessing at a higher threshold (e.g., normal tone) decides
Making this design concrete for a particular part of the limb:
that a part of a limb meets this higher standard, while the therapist
• One therapist decides whether or not muscle tone meets the assessing at the lower threshold (catch) decides that the same part
of the limb does not meet the lower standard. Although inconsist-
first threshold standard of increased tone restricting movement;
ent, from the independence of the responses,
Table 2. Estimates of thresholds in the Rasch model. such pairs of responses are inherent to the
Item Item Thresholds design that can, as a result, provide evidence
|i (. 4df)
2

number as to whether or not categories are working


p-value Mean 1 2 3 4 as intended.
1 LhpAd 0.168 0.041 -2.958 0.083 1.650 1.389 However, because the successive thresh-
2 RhpAd 0.583 0.050 -2.958 0.170 1.603 1.386 olds imply successively better muscle tone,
it would be required that the success rate
3 LknEx 0.233 0.093 -2.654 0.055 1.787 1.185
at the successive thresholds is decreasing
4 RknEx 0.495 0.102 -2.665 -0.147 1.681 1.537 and that the inconsistent pairs of responses
5 LknFx 0.417 -0.329 -2.748 -0.699 1.282 0.850 exemplified above are in the minority. This
6 RknFx 0.314 -0.331 -2.697 -0.850 1.129 1.097 decreasing success rate characterizes the cat-
egory order, namely, that the same part of
7 LftPl 0.388 0.175 -2.211 0.187 1.606 1.119
the limb for the same person should have a
8 RftPl 0.456 0.198 -2.221 0.234 1.605 1.175 smaller probability of being rated as meeting
Total c2 (≈ 32 df) 0.340 the standard of threshold dx than meeting
the standard at the lower threshold dx-1. For
^ 3 2 d^ 4 in all items.
Note: d
example, in the Ashworth spasticity scale,

578 Expert Rev. Pharmacoeconomics Outcomes Res. 11(5), (2011)


Rating scales & Rasch measurement Review

because normal tone is intended to be a 4.0


more exacting standard than catch, if a per-
son’s part of a limb has some probability of
3.0
meeting the standard of catch at threshold 3,

Expected value
then the same person’s same part of the limb
should have a smaller probability of meeting 2.0
the standard of normal tone at threshold 4.
Notice again that this ordering is a require-
ment of the decisions specified independ- 1.0
ently of any observed (dichotomous) deci-
sions, and that the responses themselves may
0.0
provide evidence to the contrary.
-3 -2 -1 0 1 2 3
Figure 5 shows just such an expected order- CI 1 2 3 4 5
ing and characteristic curves at the thresh- Location
olds for the above design according to the Figure 7. Item characteristic curve and observed means (plotted points) in five
dichotomous Rasch model. As in Figure 4, it class intervals for item 5 (left knee flexion).
is not expected that the distances between CI: Class interval.
the successive thresholds are equal; never-
theless, it is expected that they are ordered more or less as in Figure 5. problem. To understand the source of the problem, it would be
Critically, as emphasized earlier, actual ratings from the above necessary to consult the therapists making the ratings.
design may not show the required threshold order. For example, To summarize so far:
if the therapist assessing at threshold d4, normal tone, had a con- • The increasing order of the thresholds is a requirement of ratings
sistently much lower standard than intended (such as a standard independent of any observed ratings;
closer to that of threshold d3, catch), then the success rates at the
two thresholds would appear similar, making estimates d^3 and d^4 • Whether observed ratings meet this requirement can be tested
also similar. Figure 6 shows the results that are likely to appear in empirically using the Rasch model;
such a case. Indeed, from the experimental design described, the • If they do not meet this requirement, then it is necessary to
estimates may even be reversed in their order, that is d^ 3 2 d^ 4 . In understand substantively and empirically what has gone wrong
this case, inconsistent responses among the therapists would not with the empirical ratings.
be in the minority.
Evidence that threshold 4 was close to threshold 3 would indi- It is stressed again that close or reversed threshold estimates are
cate the presence of a rating problem, but not the source of the possible even if the judgments fit the dichotomous Rasch model

1.0

0
1
Probability

0.5 2

0.0
-4 -3 -2 -1 0 1 2 3 4

δ1 δ2 δ4 δ3
Location

Figure 8. Category characteristic curves and latent threshold response curves for item 5.
Dashed lines represent the latent dichotomous Rasch model characteristic curves at the four thresholds.

www.expert-reviews.com 579
Review Andrich

1.0

1 2 3 4
Probability

0.5

0.0
-4 -3 -2 -1 0 1 2 3 4

δ1 δ2 δ4 δ3
Location

Figure 9. Latent threshold response curves for item 5.

according to statistical tests of fit. Fit would demonstrate that of Table 1 is now reported. We first note that when means of five
the estimates of the thresholds are invariant with respect to dif- class intervals are compared with their expected values for each
ferent degrees of muscle tone, but fit would not preclude, in the item to give an approximate c2 statistic, no item showed misfit,
previously described experimental design, that a pair of threshold nor did the general test of fit with all item statistics pooled. These
estimates are very close or reversed relative to the required order. statistics and threshold estimates are shown in Table 2 .
Furthermore, if in an experimental design as previously The item characteristic curve of item 5 (left knee flexion),
described, the thresholds were close or not in the correct order, it together with the observed means of five class intervals, is shown
would be unwise to proceed to a single rating scale as shown in illustratively in Figure 7. The observed means are clearly close to
Figure 2 . It would be unwise because there could be no confidence their expected values. However, as indicated earlier, this statisti-
that a rating of 4 really represents a better muscle tone than a rat- cal fit, which exploits the threshold estimates from the data to
ing of 3. In that case, choosing treatments to improve tone and obtain the expected values, is essentially immaterial in studying
studies of improvement as a result of treatment would have prob- the relative estimates of thresholds.
lems of interpretation in the region where two adjacent thresholds Regarding threshold estimates, Figure 8 shows the CCCs for the
are very close. It is emphasized that evidence of the problem of the same item 5. The figure also shows, in dashed lines, the latent,
ordering of the thresholds is obtained from an experimental design inferred dichotomous response curves at the thresholds. It is evident
and that the solution to the problem requires experimentation, not that the estimates of thresholds d3 and d4 are not only similar, but
statistical modeling. that they are reversed: d^ 3 2 d^ 4 . In this case, and unlike in Figure 4,
Because a Rasch model analysis of ratings [21,34,35] is equivalent the rating of 3 (catch) never has the greatest probability. Thus a
to the design in which therapists make independent decisions at rating of 3 is not defined on the continuum. In addition, as is evi-
the different thresholds and the responses are analyzed according dent in Table 2, the reversal d^ 3 2 d^ 4 is present in all items. Standard
to the dichotomous Rasch model, two important properties follow: errors, which are available, are not provided for simplicity reasons
• From a Rasch model ana­lysis of the ratings, pairs of adjacent and are not required to infer a problem with the reversal d^ 3 2 d^ 4 .
thresholds can be close or even reversed; To emphasize the inference of their equivalence to indepen-
dent dichotomous responses at the thresholds, Figure 9 shows the
• If estimates of a pair of adjacent thresholds are close or reversed, threshold curves from Figure 8 again, but with the CCCs removed.
there is a problem with the operation of the ratings defined by
the thresholds. Interpretation from the experimental measurement
paradigm of the Rasch model analysis
Threshold estimates in the example The most important observation, from the experimental measure-
With the aforementioned background, an illustrative ana­lysis of ment paradigm, is the closeness (and reversal) of estimates d^ 3 and
ratings from the Ashworth spasticity scale using the Rasch model d^ 4 . Paraphrasing Rasch as mentioned earlier, the ratings, not the

580 Expert Rev. Pharmacoeconomics Outcomes Res. 11(5), (2011)


Rating scales & Rasch measurement Review

1.0

π*4
0.8
π*3
Probability

0.6

π*2
0.4

0.2

π*1
0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
δ1* δ2* δ3* δ4*

Location

Figure 10. The cumulative probabilities px* and thresholds dx* in the graded response model for item 5.

model, would be interpreted as having gone wrong. The fact that of thresholds 3 and 4 are d^ 3 = 1.543 and d^ 4 = 1.217 , and the
there is something wrong with the ratings is confirmed by the number of persons located between 1.0 logits and 2.0 logits is
evidence that all items show exactly the same pattern d^ 3 1 d^ 4 . In 199 – that is, 30%.
particular, the evidence challenges the very understanding of the A comment on the significance of the difference between thresh-
assumed difference between catch and normal tone. olds may be relevant. First, if statistical significance tests are con-
Furthermore, the solution to the problem needs to be substan- ducted, they would be one-tailed tests that di(x + 1) 2 dix 6x $ 1
tive, empirical and experimental. The analysis cannot reveal the and if any di(x + 1) < d^ ix , the hypothesis of the required order is
source of the problem, only the location of the problem [14] . As a automatically rejected. Thus generally, it would be required that
first step, feedback needs to be sought from the clinicians carrying adjacent thresholds are significantly different and in the correct
out the ratings; in particular, whether they experienced difficulty order. However, two features of such significance testing need to
distinguishing between catch and normal tone. The evidence sug- be appreciated. First, that statistical significance of such tests is
gests that they did. Then a new definition of the categories, which substantially a function of the sample size and therefore it can
attempts to distinguish better between them, might be trialed. be contrived. Second, there is no particular distance between
Alternatively, perhaps the levels catch and normal cannot be distin- thresholds in the experimental design, which the Rasch model
guished either theoretically or empirically, and the two categories represents, that can be specified. It would be as if, in measure-
need to be amalgamated into one category, given a new name and ment, we considered that every ruler should have its thresholds
again trialed. Thus, to help construct an improved rating format, exactly 1 cm apart. Clearly, this would be absurd, and different
the psychometrician who has analyzed the ratings needs to engage rulers used for different purposes must have different distances
with the clinicians who have constructed and used the ratings. between their thresholds.
Of course, other factors might contrib-
ute to the closeness of thresholds, such as Table 3. Estimates of the Rasch model and graded response model
few persons in the region of these thresh- thresholds for item 5.
olds, rendering the estimates very unstable. Model Mean Threshold estimates Note
However, having few people in a region does
1 2 3 4
not necessarily lead to reversed thresholds.
In any case, in this example, there is a sub- Rasch model d^ x 0.37 -2.99 -0.48 1.14 0.85 ^
d3 2 d^ 4
stantial number of people in the region of
Graded response model -0.35 -3.05 -0.58 0.76 1.46 * *
the reversed thresholds. Specifically, the
^d*x d^ 3 1 d^ 4
averages across the items of the estimates

www.expert-reviews.com 581
Review Andrich

Thus, an excessive focus on statistical significance plays into the Difference models
statistical modeling paradigm. In general, the thresholds should Instead of a distinct latent response at each threshold as in the
be correctly and clearly ordered and show a consistent structure. Rasch model and shown in Figure 4, in the graded response model
In the example of this paper, one pair of thresholds was slightly only one response process across the continuum is assumed, with
disordered and this was sufficient to conclude that there was a this one process partitioned at thresholds post hoc. This is illus-
problem. However, even if those two thresholds were ordered and trated in Figure 10 for item 5. As a result of this structure, the
statistically significantly different, but only just ordered when com- thresholds in the graded response model, which are different
pared with the other thresholds, as illustrated in Figure 6, it would be from the thresholds in the Rasch model, are necessarily in order.
evidence that the category between the thresholds was not work- Table 3 shows estimates of thresholds for item 5 of the example for
ing as well as the others. In the case that all items showed exactly the Rasch and graded response models, showing that although
the same pattern, even if they were just significantly ordered, this * *
d^ 3 2 d^ 4 from the Rasch model, d^ 3 1 d^ 4 from the graded response
would help in concluding that there is a problem in the format of model. Since the latter ordering is an integral property of the
all items in this example. model, irrespective of the empirical properties of the ratings, it
In many cases, as in the example in this article, multiple items means that evidence from the graded response thresholds can-
have the same response format. However, in many cases, it may not tell whether or not the intended ordering of the categories is
also be more consistent with the nature of the items to have dif- present in the ratings.
ferent numbers of categories for different items, with the number Category characteristic curves can be drawn from the graded
of categories present natural to the content of the item. Different response model. In the example, and despite the thresholds d^*x
items with different numbers of categories can be relevant in per- being ordered, these curves will appear as in Figure 8 , with catch
formance assessments that have specific protocols for the different never having the greatest probability. However, because the model
criteria. The general features of the assessments, governed by the accounts for the ratings, no questions follow from such a figure.
experimental setup and expectations, should govern a decision that Samejima  [25] , and many other papers which apply the graded
ordering of the categories is not working as intended. Each case response model, show curves as in Figure 8 , and no comment is
must be justified in its own right, not on some general statistical made that there are ratings that never have the greatest proba­
rule that is apparently independent of any case. bility. If a comment is made, it does not follow from the logic of
the graded response model, and it would be an acknowledgement
Interpretation from a CTT analysis of the implications that follow only from an understanding of the
The interpretation from a CTT perspective can be very brief – the response structure of the Rasch model.
empirical ordering of the categories simply does not arise in CTT.
Studies of the empirical ordering of categories has been carried out Contrasting responsibilities for action
independently [36] , but textbooks on CTT do not refer to this issue. Clearly, the RMT analysis in the experimental measurement
paradigm identifies a problem with the rating scale format (a
Interpretation from an IRT analysis from the statistical problem not identified in an IRT analysis in the statistical mod-
modeling paradigm eling paradigm). Therefore, the following question arises: who
The divide by total models deals with the problem? One answer might be that it is the cli-
As indicated earlier, the Rasch model is used in IRT theory as nician/substantive researcher who uses the scales and not the
part of the statistical modeling paradigm. In the example in this statistician/psycho­metrician. Of course, the clinician/substan-
article, it would be reported that the model fits the data and tive researcher and the psychometrician/statistician may be the
there would be no concern for the reversal of the thresholds. As a same person, as was often the case with Fisher. A second, more
consequence, interpretations regarding locations of persons, for tenable, answer is that the psychometrician of necessity has to
example before and after some intervention, would be reported work with the clinician. To paraphrase McKenzie’s [15] contrast-
confidently. It might also be noted that simply collapsing two cat- ing of Pearson’s and Fisher’s approaches, it is not enough for a
egories, where the thresholds are reversed, in order to fit the Rasch psychometrician to find a model that fits the data and to describe
model without further experimentation, is not consistent with the results without explaining them.
experimental measurement paradigm. The fact that the data fit
the Rasch model is central to the argument of the article, but it is Further implications of differences in paradigms
only a necessary condition. Post hoc adjustments of responses to fit Although they are important, the implications of the two dif-
the Rasch model are excellent exploratory tools and may be neces- ferent paradigms go beyond the analyses and inferences from
sary in some cases before making other relevant interpretations, ratings. Observing what might be seen as a clash of the two
but they need to be backed up by relevant experimental evidence. paradigms described and illustrated in this article, Cook argued
In the nominal response model of Table 2 , there are no con- that the models were ‘just tools’ and asked for less controversy
straints on the estimates of aix,bix and with its analysis of the in the use of these tools  [37] . However, as demonstrated earlier,
ratings, the parameters would not be interpreted in terms of the way in which the tools are used can be so different that
thresholds. They would simply be taken as descriptive parameters different questions about the data are answered and different
in the model. outcomes supported.

582 Expert Rev. Pharmacoeconomics Outcomes Res. 11(5), (2011)


Rating scales & Rasch measurement Review

This difference in paradigms manifests itself in papers and case for choosing a model is that it provides criteria that ratings
research proposals that are submitted and the reviews they should meet if they are to provide interpretable measurements.
engender. If research from one paradigm is reviewed from the The Rasch model, which arises from the criterion of invariance
perspective of the different paradigm, then a paper or a research of comparisons of persons and ratings within a specified frame
proposal is very likely to be rejected, and rejected with some pas- of reference, is such a model. This criterion sits within what is
sion and hostility, the feature observed by Cook. Controversy termed an experimental measurement paradigm. While hav-
seems inherent when a paradigm is challenged  [3] . However, if ing features in common, this article illustrates how a mutually
it is understood that there is a paradigm difference, then this exclusive question arises from the two paradigms and the appli-
passion and hostility towards excellent research within a para- cation of the respective models. Specifically, it shows that in the
digm may be overcome. It may be overcome by researchers stat- experimental measurement paradigm, ordering of the categories
ing explicitly the paradigm within which they are working and of the rating scales is tested empirically, and if ordering is not
reviewers respecting the paradigm. In some cases, reviewers may met, further experimentation is required before inferences from
declare that, because of the paradigm within which the paper the ratings are made. It also shows that in the statistical mod-
or research proposal is written, they are not in a position to eling paradigm, inferences from the ratings are made with their
review a paper. empirical ordering taken for granted, whether justified or not.
This of course requires an articulation of the different para- Therefore, with the importance of rating scales, it is considered
digms, a task inherently difficult because their respective assump- that models are not merely tools, but that the choice of model
tions are deeply embedded and taken for granted. An aim of governs the very understanding of the requirement of ratings and
this article is to help bring to the surface the presence of two the experimentation that goes into constructing them.
measure­ment paradigms in modern test theory, and to demon-
strate not only their basic difference, but also the implications of Five-year view
this difference in the case of rating scale construction. In the next 5 years, the use of rating scales for the assessment
In summary, for the purpose of the construction and validation of health outcomes will increase and therefore become more
of rating scales that may be used at different levels from clinical important. It is suggested that there will be an increasing rec-
trials to individual assessments, the article makes the case for the ognition of the paradigm difference between the application
application of the Rasch model within the experimental measure- of RMT and IRT and the potential advantages of the former
ment paradigm, and by implication, makes the case against the in which criteria that rating scales must meet are specified in
application of the graded response model within the dominant advance. With this recognition will come a stronger relationship
statistical modeling paradigm. between psychometricians and clinicians in the construction and
verification of rating scales, with the consequent better under-
Expert commentary standing of psychometrics by clinicians, and of clinical issues
In the analysis of rating scales in the social sciences, it is gener- by psychometricians.
ally considered that models of modern test theory that ostensibly
transform ratings into measurements are only tools. Furthermore, Acknowledgements
it is generally considered that the criterion for choosing the model John Zajicek from the Peninsula College of Medicine and Dentistry
as only a tool, from a broad class of models, is the one that (Plymouth, UK) permitted use of the data in the empirical example. Stefan
best accounts for the data. This criterion sits within what is Cano and two anonymous reviewers provided valuable comments for earlier
termed a statistical modeling paradigm. However, an alternative drafts of the article.

Key issues
• Rating scales in the measurement of health outcomes for variables where no instruments of the kind found in the natural sciences can
be constructed are used extensively and their use will increase.
• Modern test theory, encompassing item response theory (IRT) and Rasch measurement theory (RMT), will increasingly supersede
classical test theory in the analysis of rating scales and in the reporting of outcome measures.
• Deep differences between a statistical modeling paradigm and an experimental measurement paradigm that underpin IRT and RMT in
the analysis of rating scales have emerged: in the former, the criterion for the choice of a model is determined by its fit to the observed
ratings; in the latter, the criterion of the invariance of comparisons that rating scales should meet is specified a priori in terms of a
response model. This model is a Rasch model.
• A specific critical issue in rating scales is the ordering of the categories: in IRT, it is taken for granted; in RMT, it is treated as a
hypothesis with the implication that if the ordering is not working as intended, then the response categories need to be studied and
improved experimentally. Furthermore, if it is found that the empirical ordering is not as required, then the clinician and
psychometrician must work together to create ratings that are substantively valid and empirically ordered.
• Paradigm differences usually result in controversy and hostility towards papers submitted for publication and proposals submitted for
funding. The controversy and hostility may be overcome if authors and researchers make explicit the paradigm within which they are
working and reviewers respect the paradigms.

www.expert-reviews.com 583
Review Andrich

Financial & competing interests disclosure employment, consultancies, honoraria, stock ownership or options, expert
The author has no relevant affiliations or financial involvement with any testimony, grants or patents received or pending, or royalties.
organization or entity with a financial interest in or financial conflict with No writing assistance was utilized in the production of this
the subject matter or materials discussed in the manuscript. This includes manuscript.

References 11 Bock RD. A brief history of item response are scored in two or more nominal
Papers of special note have been highlighted as: theory. EM:IP 16(4), 21–33 (1997). categories. Psychometrika 37, 29–51
• of interest 12 Wright BD. A history of social science (1972).
•• of considerable interest measurement. EM:IP 16(4), 33–45 25 Samejima F. Estimation of latent ability
1 Rasch G. Probabilistic Models for Some (1997). using a response pattern of graded scores.
Intelligence and Attainment Tests. The •• Places the Rasch model within the Psychometric Monographs 34(17), 100–114
University of Chicago Press, IL, USA history of measurement in the (1969).
(1980). social sciences. 26 Fisher RA. Statistical Methods for Research
• A relatively easy introduction to the Workers. Oliver and Boyd, Edinburgh,
13 Cano S, Klassen AF, Scott A, Thoma A,
principles of Rasch models. Scotland, UK (1925).
Feeny D, Pusic A. Health outcome and
2 Rasch G. On general laws and the meaning economic measurement in breast cancer 27 Fisher RA. Two new properties of
of measurement in psychology. In: surgery: challenges and opportunities. mathematical likelihood. Proc. Royal Soc. A
Proceedings of the Fourth Berkeley Symposium Expert Rev Pharmacoeconomics Outcomes 144, 285–307 (1934).
on Mathematical Statistics and Probability Res. 10(5), 583–594 (2010). 28 Guttman L. The principal components of
(IV). Neyman J (Ed.). University of 14 Andrich D. Controversy and the Rasch scalable attitudes. In: Mathematical
California Press, CA, USA, 321–334 model: a characteristic of incompatible Thinking in the Social Sciences. Lazarsfeld PF
(1961). paradigms? Med. Care 42, 7–16 (2004). (Ed.). The Free Press, IL, USA, 216–257
• Makes the case for invariance of (1954).
•• An introduction to the characteristics of
comparisons in measurement. different measurement paradigms. 29 Raven JC. Matrix tests. Mental Health 1,
3 Kuhn TS. The Structure of Scientific 10–18 (1940).
15 McKenzie DA. Statistics in Britain,
Revolutions (2nd Edition). The University 1865–1930. Edinburgh University Press, 30 Lord, FM. Applications of Item Response
of Chicago Press, IL, USA (1970). Edinburgh, Scotland, UK (1981). Theory to Practical Testing Problems.
Lawrence Erbaum Associates, NJ, USA
• Introduces the concept of research 16 Hambleton RK. Emergence of item
(1980).
paradigms with mutually response modeling in instrument
exclusive assumptions. development and data analysis. Med. Care 31 Ashworth B. Preliminary trial of
38(9 Suppl.), II60–II65 (2000). carisoprodol in multiple sclerosis.
4 Kuhn TS. The Copernican Revolution.
Practitioner 192, 540–542 (1964).
Harvard University Press, MA, USA 17 Bock RD, Jones LV. The Measurement and
(1957). Prediction of Judgement and Choice. 32 Zajicek J, Fox P, Sanders H et al.
Holden Day, CA, USA (1968). Cannabinoids for treatment of spasticity
5 Thurstone LL. Attitudes can be measured.
and other symptoms related to multiple
Am. J. Sociol. 33, 529–554 (1928). 18 Divgi DR. Does the Rasch model really
sclerosis (CAMS study): multicentre
6 Reise SP. The impact of work for multiple choice items? Not if you
randomised placebo-controlled trial.
multidimensionality on unidimensional look closely. J. Educ. Measure. 23, 283–298
Lancet 362(8), 1517–1526 (2003).
item response theory model parameters. (1986).
33 Andrich D, Sheridan, BS, Luo G.
Presented at: International Conference on 19 Thissen D, Steinberg L. A taxonomy of
RUMM2030: a windows program for the
Outcome Measurement. Bethesda, MD, item response models. Psychometrika 51,
analysis of data according to Rasch
USA, 1–3 September 2010. 567–577 (1986).
unidimensional models for measurement.
7 Choppin BHL. Lessons for psychometrics 20 Andersen EB. Sufficient statistics and latent RUMM Laboratory, Perth, Australia
from thermometry. Eval. Educ. 9, 9–12 trait models. Psychometrika 42, 69–81 (2010).
(1985). (1977).
34 Andrich D. Educational measurement:
8 Likert R. A technique for the measurement 21 Andrich D, A rating formulation for Rasch models. In: International
of attitudes. Arch. Psychol. 140, 5–55 (1932). ordered response categories. Psychometrika Encyclopedia of Education (3rd Edition).
9 Hobart JC, Cano SJ, Zajicek JP, Thompson 43(4), 561–574 (1978). Baker E, Peterson P,McGaw B (Eds).
AJ. Rating scales as outcome measures for 22 Wright BD, Masters GN. Rating Scale Elsevier, Amsterdam, The Netherlands
clinical trials in neurology: problems, Analysis: Rasch Measurement. MESA Press, (2010).
solutions, and recommendations. Lancet IL, USA (1982). 35 Andrich D. Understanding the response
Neurol. 6, 1094–1105 (1970). 23 Muraki E. A generalized partial credit structure and process in the polytomous
•• Shows the increasing significance and use model: application of an EM algorithm. Rasch model. In: Handbook of Polytomous
of rating scales in medicine. Appl. Psychol. Measure. 16(2), 159–176 Item Response Theory Models: Developments
(1992). and Applications. Nering M, Ostini R
10 Traub RE. Classical test theory in
(Eds). Lawrence Erlbaum Associates, NJ,
historical perspective. EM:IP 16(4), 8–14 24 Bock RD. Estimating item parameters
USA, 123–152 (2010).
(1997). and latent proficiency when the responses

584 Expert Rev. Pharmacoeconomics Outcomes Res. 11(5), (2011)


Rating scales & Rasch measurement Review

• Provides the mathematics for the category in attitude scales. Educ. Psychol. just tools. Presented at: International
characteristics of the Rasch model Measure. 35, 869–884 (1975). Conference on Outcomes Measurement.
exploited in the article. 37 Cook KF. Bewildered, befuddled, Bethesda, MD, USA, 1–3 September
36 Dubois B, Burns JA. An analysis of the be-tooled; a blue collar psychometrician’s 2010.
meaning of the question mark response defense of measurement models as tools:

www.expert-reviews.com 585

You might also like