Scale: Comparative and Non Comparative Scaling Composite Measures

Scale
In the social sciences, scaling is the process of measuring or ordering entities with respect to
quantitative attributes or traits. For example, a scaling technique might involve estimating
individuals' levels of extraversion, or the perceived quality of products. Certain methods of
scaling permit estimation of magnitudes on a continuum, while other methods provide only for
relative ordering of the entities.
The level of measurement is the type of data that is measured.
The word scale is sometimes (including in academic literature) used to refer to another
composite measure, that of an index. Those concepts are however different.[1]
Comparative and non comparative scaling
With comparative scaling, the items are directly compared with each other (example: Do you
prefer Pepsi or Coke?). In noncomparative scaling each item is scaled independently of the
others (example: How do you feel about Coke?).
Composite measures
Composite measures of variables are created by combining two or more separate empirical
indicators into a single measure. Composite measures measure complex concepts more
adequately than single indicators, extend the range of scores available and are more efficient at
handling multiple items.
In addition to scales, there are two other types of composite measures. Indexes are similar to
scales except multiple indicators of a variable are combined into a single measure. The index of
consumer confidence, for example, is a combination of several measures of consumer attitudes.
A typology is similar to an index except the variable is measured at the nominal level.
Indexes are constructed by accumulating scores assigned to individual attributes, while scales are
constructed through the assignment of scores to patterns of attributes.
While indexes and scales provide measures of a single dimension[disambiguation needed], typologies are
often employed to examine the intersection of two or more dimensions. Typologies are very
useful analytical tools and can be easily used as independent variables, although since they are
not unidimensional it is difficult to use them as a dependent variable.
Data types
Main article: Level of measurement
The type of information collected can influence scale construction. Different types of
information are measured in different ways.
1. Some data are measured at the nominal level. That is, any numbers used are mere labels;
they express no mathematical properties. Examples are SKU inventory codes and UPC
bar codes.
2. Some data are measured at the ordinal level. Numbers indicate the relative position of
items, but not the magnitude of difference. An example is a preference ranking.
3. Some data are measured at the interval level. Numbers indicate the magnitude of
difference between items, but there is no absolute zero point. Examples are attitude scales
and opinion scales.
4. Some data are measured at the ratio level. Numbers indicate magnitude of difference and
there is a fixed zero point. Ratios can be calculated. Examples include: age, income,
price, costs, sales revenue, sales volume, and market share.
1
Scale construction decisions
 What level of data is involved (nominal, ordinal, interval, or ratio)?
 What will the results be used for?
 Should you use a scale, index, or typology?
 What types of statistical analysis would be useful?
 Should you use a comparative scale or a noncomparative scale?
 How many scale divisions or categories should be used (1 to 10; 1 to 7; −3 to +3)?
 Should there be an odd or even number of divisions? (Odd gives neutral center value;
even forces respondents to take a non-neutral position.)
 What should the nature and descriptiveness of the scale labels be?
 What should the physical form or layout of the scale be? (graphic, simple linear, vertical,
horizontal)
 Should a response be forced or be left optional?
Comparative scaling techniques
 Pairwise comparison scale – a respondent is presented with two items at a time and asked
to select one (example : Do you prefer Pepsi or Coke?). This is an ordinal level technique
when a measurement model is not applied. Krus and Kennedy (1977) elaborated the
paired comparison scaling within their domain-referenced model. The Bradley–Terry–
Luce (BTL) model (Bradley and Terry, 1952; Luce, 1959) can be applied in order to
derive measurements provided the data derived from paired comparisons possess an
appropriate structure. Thurstone's Law of comparative judgment can also be applied in
such contexts.
 Rasch model scaling – respondents interact with items and comparisons are inferred
between items from the responses to obtain scale values. Respondents are subsequently
also scaled based on their responses to items given the item scale values. The Rasch
model has a close relation to the BTL model.
 Rank-ordering – a respondent is presented with several items simultaneously and asked
to rank them (example : Rate the following advertisements from 1 to 10.). This is an
ordinal level technique.
 Bogardus social distance scale – measures the degree to which a person is willing to
associate with a class or type of people. It asks how willing the respondent is to make
various associations. The results are reduced to a single score on a scale. There are also
non-comparative versions of this scale.
 Q-Sort – Up to 140 items are sorted into groups based on rank-order procedure.
 Guttman scale – This is a procedure to determine whether a set of items can be rank-
ordered on a unidimensional scale. It utilizes the intensity structure among several
indicators of a given variable. Statements are listed in order of importance. The rating is
scaled by summing all responses until the first negative response in the list. The Guttman
scale is related to Rasch measurement; specifically, Rasch models bring the Guttman
approach within a probabilistic framework.
 Constant sum scale – a respondent is given a constant sum of money, script, credits, or
points and asked to allocate these to various items (example : If you had 100 Yen to
spend on food products, how much would you spend on product A, on product B, on
product C, etc.). This is an ordinal level technique.
2
 Magnitude estimation scale – In a psychophysics procedure invented by S. S. Stevens
people simply assign numbers to the dimension of judgment. The geometric mean of
those numbers usually produces a power law with a characteristic exponent. In cross-
modality matching instead of assigning numbers, people manipulate another dimension,
such as loudness or brightness to match the items. Typically the exponent of the
psychometric function can be predicted from the magnitude estimation exponents of each
dimension.
Non-comparative scaling techniques
 Continuous rating scale (also called the graphic rating scale) – respondents rate items
by placing a mark on a line. The line is usually labeled at each end. There are sometimes
a series of numbers, called scale points, (say, from zero to 100) under the line. Scoring
and codification is difficult.
 Likert scale – Respondents are asked to indicate the amount of agreement or
disagreement (from strongly agree to strongly disagree) on a five- to nine-point scale.
The same format is used for multiple questions. This categorical scaling procedure can
easily be extended to a magnitude estimation procedure that uses the full scale of
numbers rather than verbal categories.
 Phrase completion scales – Respondents are asked to complete a phrase on an 11-point
response scale in which 0 represents the absence of the theoretical construct and 10
represents the theorized maximum amount of the construct being measured. The same
basic format is used for multiple questions.
 Semantic differential scale – Respondents are asked to rate on a 7 point scale an item on
various attributes. Each attribute requires a scale with bipolar terminal labels.
 Stapel scale – This is a unipolar ten-point rating scale. It ranges from +5 to −5 and has no
neutral zero point.
 Thurstone scale – This is a scaling technique that incorporates the intensity structure
among indicators.
 Mathematically derived scale – Researchers infer respondents’ evaluations
mathematically. Two examples are multi dimensional scaling and conjoint analysis.
Scale evaluation
Scales should be tested for reliability, generalizability, and validity. Generalizability is the ability
to make inferences from a sample to the population, given the scale you have selected.
Reliability is the extent to which a scale will produce consistent results. Test-retest reliability
checks how similar the results are if the research is repeated under similar circumstances.
Alternative forms reliability checks how similar the results are if the research is repeated using
different forms of the scale. Internal consistency reliability checks how well the individual
measures included in the scale are converted into a composite measure.
Scales and indexes have to be validated. Internal validation checks the relation between the
individual measures included in the scale, and the composite scale itself. External validation
checks the relation between the composite scale and other indicators of the variable, indicators
not included in the scale. Content validation (also called face validity) checks how well the scale
measures what is supposed to measured. Criterion validation checks how meaningful the scale
criteria are relative to other possible criteria. Construct validation checks what underlying
construct is being measured. There are three variants of construct validity. They are convergent
3
validity, discriminant validity, and nomological validity (Campbell and Fiske, 1959; Krus and
Ney, 1978). The coefficient of reproducibility indicates how well the data from the individual
measures included in the scale can be reconstructed from the composite scale.
Level of measurement
In statistics and quantitative research methodology, various attempts have been made to classify
variables (or types of data) and thereby develop a taxonomy of levels of measurement or scales
of measure. Perhaps the best known are those developed by the psychologist Stanley Smith
Stevens. He proposed four types: nominal, ordinal, interval, and ratio.
Typology
Stevens proposed his typology in a 1946 Science article titled "On the theory of scales of
measurement".[1] In that article, Stevens claimed that all measurement in science was conducted
using four different types of scales that he called "nominal," "ordinal," "interval," and "ratio,"
unifying both "qualitative" (which are described by his "nominal" type) and "quantitative" (to a
different degree, all the rest of his scales). The concept of scale types later received the
mathematical rigour that it lacked at its inception with the work of mathematical psychologists
Theodore Alper (1985, 1987), Louis Narens (1981a, b), and R. Duncan Luce (1986, 1987, 2001).
As Luce (1997, p. 395) wrote:
S. S. Stevens (1946, 1951, 1975) claimed that what counted was having an interval or ratio
“ scale. Subsequent research has given meaning to this assertion, but given his attempts to
invoke scale type ideas it is doubtful if he understood it himself . . . no measurement
theorist I know accepts Stevens' broad definition of measurement . . . in our view, the only
sensible meaning for 'rule' is empirically testable laws about the attribute. ”
Nominal scale
The nominal type differentiates between items or subjects based only on their names or
(meta-)categories and other qualitative classifications they belong to; thus dichotomous data
involves the construction of classifications as well as the classification of items. Discovery of an
exception to a classification can be viewed as progress. Numbers may be used to represent the
variables but the numbers do not have numerical value or relationship.
Examples of these classifications include gender, nationality, ethnicity, language, genre, style,
biological species, and form.[2][3] In a university one could also use hall of affiliation as an
example. Other concrete examples are
 in grammar, the parts of speech: noun, verb, preposition, article, pronoun, etc.
 in politics, power projection: hard power, soft power, etc.
 in biology, the taxonomic ranks below domains: Archaea, Bacteria, and Eukarya
Nominal scales were often called qualitative scales, and measurements made on qualitative
scales were called qualitative data. However, the rise of qualitative research has made this usage
confusing.
Mathematical operations
Set membership, classification, categorical equality, and equivalence are all operations which
apply to objects of the nominal type.
Central tendency
4
The mode, i.e. the most common item, is allowed as the measure of central tendency for the
nominal type. On the other hand, the median, i.e. the middle-ranked item, makes no sense for the
nominal type of data since ranking is meaningless for the nominal type.
Percentage
Percentages can be used to determine or develop a comparison of the classifications.
5
Ordinal scale
The ordinal type allows for rank order (1st, 2nd, 3rd, etc.) by which data can be sorted, but still
does not allow for relative degree of difference between them. Examples include, on one hand,
dichotomous data with dichotomous (or dichotomized) values such as 'sick' vs. 'healthy' when
measuring health, 'guilty' vs. 'innocent' when making judgments in courts, 'wrong/false' vs.
'right/true' when measuring truth value, and, on the other hand, non-dichotomous data consisting
of a spectrum of values, such as 'completely agree', 'mostly agree', 'mostly disagree', 'completely
disagree' when measuring opinion.
Central tendency
The median, i.e. middle-ranked, item is allowed as the measure of central tendency; however, the
mean (or average) as the measure of central tendency is not allowed. The mode is allowed.
In 1946, Stevens observed that psychological measurement, such as measurement of opinions,
usually operates on ordinal scales; thus means and standard deviations have no validity, but they
can be used to get ideas for how to improve operationalization of variables used in
questionnaires. Most psychological data collected by psychometric instruments and tests,
measuring cognitive and other abilities, are ordinal, although some theoreticians have argued
they can be treated as interval or ratio scales. However, there is little prima facie evidence to
suggest that such attributes are anything more than ordinal (Cliff, 1996; Cliff & Keats, 2003;
Michell, 2008).[4] In particular,[5] IQ scores reflect an ordinal scale, in which all scores are
meaningful for comparison only.[6][7][8] There is no absolute zero, and a 10-point difference may
carry different meanings at different points of the scale.[9][10]
Interval scale
The interval type allows for the degree of difference between items, but not the ratio between
them. Examples include temperature with the Celsius scale, which has an arbitrarily-defined zero
point (the freezing point of a particular substance under particular conditions), date when
measured from an arbitrary epoch (such as AD) and direction measured in degrees from true or
magnetic north. Ratios are not allowed since 20 °C cannot be said to be "twice as hot" as 10 °C,
nor can multiplication/division be carried out between any two dates directly. However, ratios of
differences can be expressed; for example, one difference can be twice another. Interval type
variables are sometimes also called "scaled variables", but the formal mathematical term is an
affine space (in this case an affine line).
Central tendency and statistical dispersion
The mode, median, and arithmetic mean are allowed to measure central tendency of interval
variables, while measures of statistical dispersion include range and standard deviation. Since
one can only divide by differences, one cannot define measures that require some ratios, such as
the coefficient of variation. More subtly, while one can define moments about the origin, only
central moments are meaningful, since the choice of origin is arbitrary. One can define
standardized moments, since ratios of differences are meaningful, but one cannot define the
coefficient of variation, since the mean is a moment about the origin, unlike the standard
deviation, which is (the square root of) a central moment.
Ratio scale
The ratio type takes its name from the fact that measurement is the estimation of the ratio
between a magnitude of a continuous quantity and a unit magnitude of the same kind (Michell,
1997, 1999). A ratio scale possesses a meaningful (unique and non-arbitrary) zero value. Most
6
measurement in the physical sciences and engineering is done on ratio scales. Examples include
mass, length, duration, plane angle, energy and electric charge. Ratios are allowed because
having a non-arbitrary zero point makes it meaningful to say, for example, that one object has
"twice the length" of another (= is "twice as long"). Very informally, many ratio scales can be
described as specifying "how much" of something (i.e. an amount or magnitude) or "how many"
(a count). The Kelvin temperature scale is a ratio scale because it has a unique, non-arbitrary
zero point called absolute zero.
Central tendency and statistical dispersion
The geometric mean and the harmonic mean are allowed to measure the central tendency, in
addition to the mode, median, and arithmetic mean. The studentized range and the coefficient of
variation are allowed to measure statistical dispersion. All statistical measures are allowed
because all necessary mathematical operations are defined for the ratio scale.
Debate on typology
While Stevens' typology is widely adopted, it is still being challenged by other theoreticians,
particularly in the cases of the nominal and ordinal types (Michell, 1986).[11]
Duncan (1986) objected to the use of the word measurement in relation to the nominal type, but
Stevens (1975) said of his own definition of measurement that "the assignment can be any
consistent rule. The only rule not allowed would be random assignment, for randomness amounts
in effect to a nonrule". However, so-called nominal measurement involves arbitrary assignment,
and the "permissible transformation" is any number for any other. This is one of the points made
in Lord's (1953) satirical paper On the Statistical Treatment of Football Numbers.[12]
The use of the mean as a measure of the central tendency for the ordinal type is still debatable
among those who accept Stevens' typology. Many behavioural scientists use the mean for ordinal
data, anyway. This is often justified on the basis that the ordinal type in behavioural science is in
fact somewhere between the true ordinal and interval types; although the interval difference
between two ordinal ranks is not constant, it is often of the same order of magnitude. For
example, applications of measurement models in educational contexts often indicate that total
scores have a fairly linear relationship with measurements across the range of an assessment.
Thus, some argue that so long as the unknown interval difference between ordinal scale ranks is
not too variable, interval scale statistics such as means can meaningfully be used on ordinal scale
variables. Statistical analysis software such as SPSS requires the user to select the appropriate
measurement class for each variable. This ensures that subsequent user errors cannot
inadvertently perform meaningless analyses (for example correlation analysis with a variable on
a nominal level).
L. L. Thurstone made progress toward developing a justification for obtaining the interval type,
based on the law of comparative judgment. A common application of the law is the analytic
hierarchy process. Further progress was made by Georg Rasch (1960), who developed the
probabilistic Rasch model that provides a theoretical basis and justification for obtaining
interval-level measurements from counts of observations such as total scores on assessments.
Another issue is derived from Nicholas R. Chrisman's article "Rethinking Levels of
Measurement for Cartography",[13] in which he introduces an expanded list of levels of
measurement to account for various measurements that do not necessarily fit with the traditional
notions of levels of measurement. Measurements bound to a range and repeating (like degrees in
a circle, clock time, etc.), graded membership categories, and other types of measurement do not
fit to Steven's original work, leading to the introduction of six new levels of measurement, for a
7
total of ten: (1) Nominal, (2) Graded membership, (3) Ordinal, (4) Interval, (5) Log-Interval, (6)
Extensive Ratio, (7) Cyclical Ratio, (8) Derived Ratio, (9) Counts and finally (10) Absolute. The
extended levels of measurement are rarely used outside of academic geography.
Scale types and Stevens' "operational theory of measurement"
The theory of scale types is the intellectual handmaiden to Stevens' "operational theory of
measurement", which was to become definitive within psychology and the behavioral sciences,
[citation needed]
despite Michell's characterization as its being quite at odds with measurement in the
natural sciences (Michell, 1999). Essentially, the operational theory of measurement was a
reaction to the conclusions of a committee established in 1932 by the British Association for the
Advancement of Science to investigate the possibility of genuine scientific measurement in the
psychological and behavioral sciences. This committee, which became known as the Ferguson
committee, published a Final Report (Ferguson, et al., 1940, p. 245) in which Stevens' sone scale
(Stevens & Davis, 1938) was an object of criticism:
…any law purporting to express a quantitative relation between sensation intensity and
“ stimulus intensity is not merely false but is in fact meaningless unless and until a meaning
can be given to the concept of addition as applied to sensation. ”
That is, if Stevens' sone scale genuinely measured the intensity of auditory sensations, then
evidence for such sensations as being quantitative attributes needed to be produced. The
evidence needed was the presence of additive structure – a concept comprehensively treated by
the German mathematician Otto Hölder (Hölder, 1901). Given that the physicist and
measurement theorist Norman Robert Campbell dominated the Ferguson committee's
deliberations, the committee concluded that measurement in the social sciences was impossible
due to the lack of concatenation operations. This conclusion was later rendered false by the
discovery of the theory of conjoint measurement by Debreu (1960) and independently by Luce &
Tukey (1964). However, Stevens' reaction was not to conduct experiments to test for the
presence of additive structure in sensations, but instead to render the conclusions of the Ferguson
committee null and void by proposing a new theory of measurement:
Paraphrasing N.R. Campbell (Final Report, p.340), we may say that measurement, in the
“ broadest sense, is defined as the assignment of numerals to objects and events according to
rules (Stevens, 1946, p.677). ”
Stevens was greatly influenced by the ideas of another Harvard academic, the Nobel laureate
physicist Percy Bridgman (1927), whose doctrine of operationism Stevens used to define
measurement. In Stevens' definition, for example, it is the use of a tape measure that defines
length (the object of measurement) as being measurable (and so by implication quantitative).
Critics of operationism object that it confuses the relations between two objects or events for
properties of one of those of objects or events (Hardcastle, 1995; Michell, 1999; Moyer, 1981a,b;
Rogers, 1989).
The Canadian measurement theorist William Rozeboom (1966) was an early and trenchant critic
of Stevens' theory of scale types.
8
Pairwise comparison
This article is about pairwise comparisons in psychology. For statistical analysis of paired
comparisons, see paired difference test.
Pairwise comparison generally refers to any process of comparing entities in pairs to judge
which of each entity is preferred, or has a greater amount of some quantitative property. The
method of pairwise comparison is used in the scientific study of preferences, attitudes, voting
systems, social choice, public choice, and multiagent AI systems. In psychology literature, it is
often referred to as paired comparison.
Prominent psychometrician L. L. Thurstone first introduced a scientific approach to using
pairwise comparisons for measurement in 1927, which he referred to as the law of comparative
judgment. Thurstone linked this approach to psychophysical theory developed by Ernst Heinrich
Weber and Gustav Fechner. Thurstone demonstrated that the method can be used to order items
along a dimension such as preference or importance using an interval-type scale.
Overview
If an individual or organization expresses a preference between two mutually distinct
alternatives, this preference can be expressed as a pairwise comparison. If the two alternatives
are x and y, the following are the possible pairwise comparisons:
The agent prefers x over y: "x > y" or "xPy"
The agent prefers y over x: "y > x" or "yPx"
The agent is indifferent between both alternatives: "x = y" or "xIy"
Probabilistic models
In terms of modern psychometric theory, Thurstone's approach, called the law of comparative
judgment, is more aptly regarded as a measurement model. The Bradley–Terry–Luce (BTL)
model (Bradley & Terry, 1952; Luce, 1959) is often applied to pairwise comparison data to scale
preferences. The BTL model is identical to Thurstone's model if the simple logistic function is
used. Thurstone used the normal distribution in applications of the model. The simple logistic
function varies by less than 0.01 from the cumulative normal ogive across the range, given an
arbitrary scale factor.
In the BTL model, the probability that object j is judged to have more of an attribute than object i
is:
where is the scale location of object ; is the logistic function (the inverse of the logit). For
example, the scale location might represent the perceived quality of a product, or the perceived
weight of an object.
The BTL is very closely related to the Rasch model for measurement.
Thurstone used the method of pairwise comparisons as an approach to measuring perceived
intensity of physical stimuli, attitudes, preferences, choices, and values. He also studied
implications of the theory he developed for opinion polls and political voting (Thurstone, 1959).
9
Transitivity
For a given decision agent, if the information, objective, and alternatives used by the agent
remain constant, then it is generally assumed that pairwise comparisons over those alternatives
by the decision agent are transitive. Most agree upon what transitivity is, though there is debate
about the transitivity of indifference. The rules of transitivity are as follows for a given decision
agent.
If xPy and yPz, then xPz
If xPy and yIz, then xPz
If xIy and yPz, then xPz
If xIy and yIz, then xIz
This corresponds to (xPy or xIy) being a total preorder, P being the corresponding strict weak
order, and I being the corresponding equivalence relation.
Probabilistic models require transitivity only within the bounds of errors of estimates of scale
locations of entities. Thus, decisions need not be deterministically transitive in order to apply
probabilistic models. However, transitivity will generally hold for a large number of comparisons
if models such as the BTL can be effectively applied.
Using Transitivity test[1] one can investigate whether a data set of pairwise comparisons contains
a higher degree of transitivity than expected by chance.
Argument for intransitivity of indifference
Some contend that indifference is not transitive. Consider the following example. Suppose you
like apples and you prefer apples that are larger. Now suppose there exists an apple A, an apple
B, and an apple C which have identical intrinsic characteristics except for the following. Suppose
B is larger than A, but it is not discernible without an extremely sensitive scale. Further suppose
C is larger than B, but this also is not discernible without an extremely sensitive scale. However,
the difference in sizes between apples A and C is large enough that you can discern that C is
larger than A without a sensitive scale. In psychophysical terms, the size difference between A
and C is above the just noticeable difference ('jnd') while the size differences between A and B
and B and C are below the jnd.
You are confronted with the three apples in pairs without the benefit of a sensitive scale.
Therefore, when presented A and B alone, you are indifferent between apple A and apple B; and
you are indifferent between apple B and apple C when presented B and C alone. However, when
the pair A and C are shown, you prefer C over A.
Preference orders
If pairwise comparisons are in fact transitive in respect to the four mentioned rules, then pairwise
comparisons for a list of alternatives (A1, A2, A3, ..., An−1, and An) can take the form:
A1(>XOR=)A2(>XOR=)A3(>XOR=) ... (>XOR=)An−1(>XOR=)An
For example, if there are three alternatives a, b, and c, then the possible preference orders are:








10





If the number of alternatives is n, and indifference is not allowed, then the number of possible
preference orders for any given n-value is n!. If indifference is allowed, then the number of
possible preference orders is the number of total preorders. It can be expressed as a function of n:
where S2(n, k) is the Stirling number of the second kind.

Applications
One important application of pairwise comparisons is the widely used Analytic Hierarchy
Process, a structured technique for helping people deal with complex decisions. It uses pairwise
comparisons of tangible and intangible factors to construct ratio scales that are useful in making
important decisions.[2][3]
Law of comparative judgment

The law of comparative judgment was conceived by L. L. Thurstone. In modern-day
terminology, it is more aptly described as a model that is used to obtain measurements from any
process of pairwise comparison. Examples of such processes are the comparison of perceived
intensity of physical stimuli, such as the weights of objects, and comparisons of the extremity of
an attitude expressed within statements, such as statements about capital punishment. The
measurements represent how we perceive objects, rather than being measurements of actual
physical properties. This kind of measurement is the focus of psychometrics and psychophysics.
In somewhat more technical terms, the law of comparative judgment is a mathematical
representation of a discriminal process, which is any process in which a comparison is made
between pairs of a collection of entities with respect to magnitudes of an attribute, trait, attitude,
and so on. The theoretical basis for the model is closely related to item response theory and the
theory underlying the Rasch model, which are used in psychology and education to analyse data
from questionnaires and tests.
Background
Thurstone published a paper on the law of comparative judgment in 1927. In this paper he
introduced the underlying concept of a psychological continuum for a particular 'project in
measurement' involving the comparison between a series of stimuli, such as weights and
handwriting specimens, in pairs. He soon extended the domain of application of the law of
comparative judgment to things that have no obvious physical counterpart, such as attitudes and
values (Thurstone, 1929). For example, in one experiment, people compared statements about
capital punishment to judge which of each pair expressed a stronger positive (or negative)
attitude.
The essential idea behind Thurstone's process and model is that it can be used to scale a
collection of stimuli based on simple comparisons between stimuli two at a time: that is, based
on a series of pairwise comparisons. For example, suppose that someone wishes to measure the
perceived weights of a series of five objects of varying masses. By having people compare the
11
weights of the objects in pairs, data can be obtained and the law of comparative judgment
applied to estimate scale values of the perceived weights. This is the perceptual counterpart to
the physical weight of the objects. That is, the scale represents how heavy people perceive the
objects to be based on the comparisons.
Although Thurstone referred to it as a law, as stated above, in terms of modern psychometric
theory the 'law' of comparative judgment is more aptly described as a measurement model. It
represents a general theoretical model which, applied in a particular empirical context,
constitutes a scientific hypothesis regarding the outcomes of comparisons between some
collection of objects. If data agree with the model, it is possible to produce a scale from the data.
Relationships to pre-existing psychophysical theory
Thurstone showed that in terms of his conceptual framework, Weber's law and the so-called
Weber-Fechner law, which are generally regarded as one and the same, are independent, in the
sense that one may be applicable but not the other to a given collection of experimental data. In
particular, Thurstone showed that if Fechner's law applies and the discriminal dispersions
associated with stimuli are constant (as in Case 5 of the LCJ outlined below), then Weber's law
will also be verified. He considered that the Weber-Fechner law and the LCJ both involve a
linear measurement on a psychological continuum whereas Weber's law does not.
Weber's law essentially states that how much people perceive physical stimuli to change depends
on how big a stimulus is. For example, if someone compares a light object of 1 kg with one
slightly heavier, they can notice a relatively small difference, perhaps when the second object is
1.2 kg. On the other hand, if someone compares a heavy object of 30 kg with a second, the
second must be quite a bit larger for a person to notice the difference, perhaps when the second
object is 36 kg. People tend to perceive differences that are proportional to the size rather than
always noticing a specific difference irrespective of the size. The same applies to brightness,
pressure, warmth, loudness and so on.
Thurstone stated Weber's law as follows: "The stimulus increase which is correctly discriminated
in any specified proportion of attempts (except 0 and 100 per cent) is a constant fraction of the
stimulus magnitude" (Thurstone, 1959, p. 61). He considered that Weber's law said nothing
directly about sensation intensities at all. In terms of Thurstone's conceptual framework, the
association posited between perceived stimulus intensity and the physical magnitude of the
stimulus in the Weber-Fechner law will only hold when Weber's law holds and the just
noticeable difference (JND) is treated as a unit of measurement. Importantly, this is not simply
given a priori (Michell, 1997, p. 355), as is implied by purely mathematical derivations of the
one law from the other. It is, rather, an empirical question whether measurements have been
obtained; one which requires justification through the process of stating and testing a well-
defined hypothesis in order to ascertain whether specific theoretical criteria for measurement
have been satisfied. Some of the relevant criteria were articulated by Thurstone, in a preliminary
fashion, including what he termed the additivity criterion. Accordingly, from the point of view of
Thurstone's approach, treating the JND as a unit is justifiable provided only that the discriminal
dispersions are uniform for all stimuli considered in a given experimental context. Similar issues
are associated with Stevens' power law.
In addition, Thurstone employed the approach to clarify other similarities and differences
between Weber's law, the Weber-Fechner law, and the LCJ. An important clarification is that the
LCJ does not necessarily involve a physical stimulus, whereas the other 'laws' do. Another key
12
difference is that Weber's law and the LCJ involve proportions of comparisons in which one
stimulus is judged greater than another whereas the so-called Weber-Fechner law does not.
The general form of the law of comparative judgment
The most general form of the LCJ is
in which:
 is the psychological scale value of stimuli i
 is the sigma corresponding with the proportion of occasions on which the magnitude of
stimulus i is judged to exceed the magnitude of stimulus j
 is the discriminal dispersion of a stimulus
 is the correlation between the discriminal deviations of stimuli i and j
The discriminal dispersion of a stimulus i is the dispersion of fluctuations of the discriminal
process for a uniform repeated stimulus, denoted , where represents the mode of such
values. Thurstone (1959, p. 20) used the term discriminal process to refer to the "psychological
values of psychophysics"; that is, the values on a psychological continuum associated with a
given stimulus.
Case 5 of the law of comparative judgment
Thurstone specified five particular cases of the 'law', or measurement model. An important case
of the model is Case 5, in which the discriminal dispersions are specified to be uniform and
uncorrelated. This form of the model can be represented as follows:
where
In this case of the model, the difference can be inferred directly from the proportion of
instances in which j is judged greater than i if it is hypothesised that is distributed according
to some density function, such as the normal distribution or logistic function. In order to do so, it
is necessary to let , which is in effect an arbitrary choice of the unit of measurement.
Letting be the proportion of occasions on which i is judged greater than j, if, for example,
and it is hypothesised that is normally distributed, then it would be inferred that
.
When a simple logistic function is employed instead of the normal density function, then the
model has the structure of the Bradley-Terry-Luce model (BTL model) (Bradley & Terry, 1952;
Luce, 1959). In turn, the Rasch model for dichotomous data (Rasch, 1960/1980) is identical to
the BTL model after the person parameter of the Rasch model has been eliminated, as is
achieved through statistical conditioning during the process of Conditional Maximum Likelihood
estimation. With this in mind, the specification of uniform discriminal dispersions is equivalent
to the requirement of parallel Item Characteristic Curves (ICCs) in the Rasch model.
Accordingly, as shown by Andrich (1978), the Rasch model should, in principle, yield
essentially the same results as those obtained from a Thurstone scale. Like the Rasch model,
13
when applied in a given empirical context, Case 5 of the LCJ constitutes a mathematized
hypothesis which embodies theoretical criteria for measurement.
Applications
One important application involving the law of comparative judgment is the widely used
Analytic Hierarchy Process, a structured technique for helping people deal with complex
decisions. It uses pairwise comparisons of tangible and intangible factors to construct ratio scales
that are useful in making important decisions.
Rasch model
The Rasch model, named after Georg Rasch,[1] is a psychometric model for analyzing
categorical data, such as answers to questions on a reading assessment or questionnaire
responses, as a function of the trade-off between (a) the respondent's abilities, attitudes or
personality traits and (b) the item difficulty. For example, they may be used to estimate a
student's reading ability, or the extremity of a person's attitude to capital punishment from
responses on a questionnaire. In addition to psychometrics and educational research, the Rasch
model and its extensions are used is other areas, including the health profession[2] and market
research[3] because of their general applicability.[4]
The mathematical theory underlying Rasch models is a special case of item response theory and,
more generally, a special case of a generalized linear model. However, there are important
differences in the interpretation of the model parameters and its philosophical implications[5] that
separate proponents of the Rasch model from the item response modeling tradition. A central
aspect of this divide relates to the role of specific objectivity,[6] a defining property of the Rasch
model according to Georg Rasch, as a requirement for successful measurement.
Overview
The Rasch model for measurement
In the Rasch model, the probability of a specified response (e.g. right/wrong answer) is modeled
as a function of person and item parameters. Specifically, in the original Rasch model, the
probability of a correct response is modeled as a logistic function of the difference between the
person and item parameter. The mathematical form of the model is provided later in this article.
In most contexts, the parameters of the model characterize the proficiency of the respondents and
the difficulty of the items as locations on a continuous latent variable. For example, in
educational tests, item parameters represent the difficulty of items while person parameters
represent the ability or attainment level of people who are assessed. The higher a person's ability
relative to the difficulty of an item, the higher the probability of a correct response on that item.
When a person's location on the latent trait is equal to the difficulty of the item, there is by
definition a 0.5 probability of a correct response in the Rasch model.
A Rasch model is a model in one sense in that it represents the structure which data should
exhibit in order to obtain measurements from the data; i.e. it provides a criterion for successful
measurement. Beyond data, Rasch's equations model relationships we expect to obtain in the real
world. For instance, education is intended to prepare children for the entire range of challenges
they will face in life, and not just those that appear in textbooks or on tests. By requiring
measures to remain the same (invariant) across different tests measuring the same thing, Rasch
models make it possible to test the hypothesis that the particular challenges posed in a
14
curriculum and on a test coherently represent the infinite population of all possible challenges in
that domain. A Rasch model is therefore a model in the sense of an ideal or standard that
provides a heuristic fiction serving as a useful organizing principle even when it is never actually
observed in practice.
The perspective or paradigm underpinning the Rasch model is distinct from the perspective
underpinning statistical modelling. Models are most often used with the intention of describing a
set of data. Parameters are modified and accepted or rejected based on how well they fit the data.
In contrast, when the Rasch model is employed, the objective is to obtain data which fit the
model (Andrich, 2004; Wright, 1984, 1999). The rationale for this perspective is that the Rasch
model embodies requirements which must be met in order to obtain measurement, in the sense
that measurement is generally understood in the physical sciences.
A useful analogy for understanding this rationale is to consider objects measured on a weighing
scale. Suppose the weight of an object A is measured as being substantially greater than the
weight of an object B on one occasion, then immediately afterward the weight of object B is
measured as being substantially greater than the weight of object A. A property we require of
measurements is that the resulting comparison between objects should be the same, or invariant,
irrespective of other factors. This key requirement is embodied within the formal structure of the
Rasch model. Consequently, the Rasch model is not altered to suit data. Instead, the method of
assessment should be changed so that this requirement is met, in the same way that a weighing
scale should be rectified if it gives different comparisons between objects upon separate
measurements of the objects.
Data analysed using the model are usually responses to conventional items on tests, such as
educational tests with right/wrong answers. However, the model is a general one, and can be
applied wherever discrete data are obtained with the intention of measuring a quantitative
attribute or trait.
Scaling
Figure 1: Test characteristic curve showing the relationship between total score on a test and person
location estimate
When all test-takers have an opportunity to attempt all items on a single test, each total score on
the test maps to a unique estimate of ability and the greater the total, the greater the ability
estimate. Total scores do not have a linear relationship with ability estimates. Rather, the
relationship is non-linear as shown in Figure 1. The total score is shown on the vertical axis,
while the corresponding person location estimate is shown on the horizontal axis. For the
particular test on which the test characteristic curve (TCC) shown in Figure 1 is based, the
15
relationship is approximately linear throughout the range of total scores from about 10 to 33. The
shape of the TCC is generally somewhat sigmoid as in this example. However, the precise
relationship between total scores and person location estimates depends on the distribution of
items on the test. The TCC is steeper in ranges on the continuum in which there are a number of
items, such as in the range on either side of 0 in Figures 1 and 2. In applying the Rasch model,
item locations are often scaled first, based on methods such as those described below. This part
of the process of scaling is often referred to as item calibration. In educational tests, the smaller
the proportion of correct responses, the higher the difficulty of an item and hence the higher the
item's scale location. Once item locations are scaled, the person locations are measured on the
scale. As a result, person and item locations are estimated on a single scale as shown in Figure 2.
Interpreting scale locations
Figure 2: Graph showing histograms of person distribution (top) and item distribution (bottom) on a
scale
For dichotomous data such as right/wrong answers, by definition, the location of an item on a
scale corresponds with the person location at which there is a 0.5 probability of a correct
response to the question. In general, the probability of a person responding correctly to a
question with difficulty lower than that person's location is greater than 0.5, while the probability
of responding correctly to a question with difficulty greater than the person's location is less than
0.5. The Item Characteristic Curve (ICC) or Item Response Function (IRF) shows the probability
of a correct response as a function of the ability of persons. A single ICC is shown and explained
in more detail in relation to Figure 4 in this article (see also the item response function). The
leftmost ICCs in Figure 3 are the easiest items, the rightmost items in the same figure are the
most difficult items.
When responses of a person are listed according to item difficulty, from lowest to highest, the
most likely pattern is a Guttman pattern or vector; i.e. {1,1,...,1,0,0,0,...,0}. However, while this
pattern is the most probable given the structure of the Rasch model, the model requires only
probabilistic Guttman response patterns; that is, patterns which tend toward the Guttman pattern.
It is unusual for responses to conform strictly to the pattern because there are many possible
patterns. It is unnecessary for responses to conform strictly to the pattern in order for data to fit
the Rasch model.
16
Figure 3: ICCs for a number of items. ICCs are coloured to highlight the change in the probability of a
successful response for a person with ability location at the vertical line. The person is likely to respond
correctly to the easiest items (with locations to the left and higher curves) and unlikely to respond
correctly to difficult items (locations to the right and lower curves).
Each ability estimate has an associated standard error of measurement, which quantifies the
degree of uncertainty associated with the ability estimate. Item estimates also have standard
errors. Generally, the standard errors of item estimates are considerably smaller than the standard
errors of person estimates because there are usually more response data for an item than for a
person. That is, the number of people attempting a given item is usually greater than the number
of items attempted by a given person. Standard errors of person estimates are smaller where the
slope of the ICC is steeper, which is generally through the middle range of scores on a test. Thus,
there is greater precision in this range since the steeper the slope, the greater the distinction
between any two points on the line.
Statistical and graphical tests are used to evaluate the correspondence of data with the model.
Certain tests are global, while others focus on specific items or people. Certain tests of fit
provide information about which items can be used to increase the reliability of a test by omitting
or correcting problems with poor items. In Rasch Measurement the person separation index is
used instead of reliability indices. However, the person separation index is analogous to a
reliability index. The separation index is a summary of the genuine separation as a ratio to
separation including measurement error. As mentioned earlier, the level of measurement error is
not uniform across the range of a test, but is generally larger for more extreme scores (low and
high).
Features of the Rasch model
The class of models is named after Georg Rasch, a Danish mathematician and statistician who
advanced the epistemological case for the models based on their congruence with a core
requirement of measurement in physics; namely the requirement of invariant comparison. This is
the defining feature of the class of models, as is elaborated upon in the following section. The
Rasch model for dichotomous data has a close conceptual relationship to the law of comparative
judgment (LCJ), a model formulated and used extensively by L. L. Thurstone (cf Andrich,
1978b), and therefore also to the Thurstone scale.
Prior to introducing the measurement model he is best known for, Rasch had applied the Poisson
distribution to reading data as a measurement model, hypothesizing that in the relevant empirical
context, the number of errors made by a given individual was governed by the ratio of the text
difficulty to the person's reading ability. Rasch referred to this model as the multiplicative
Poisson model. Rasch's model for dichotomous data – i.e. where responses are classifiable into
17
two categories – is his most widely known and used model, and is the main focus here. This
model has the form of a simple logistic function.
The brief outline above highlights certain distinctive and interrelated features of Rasch's
perspective on social measurement, which are as follows:
1. He was concerned principally with the measurement of individuals, rather than with
distributions among populations.
2. He was concerned with establishing a basis for meeting a priori requirements for measurement
deduced from physics and, consequently, did not invoke any assumptions about the distribution
of levels of a trait in a population.
3. Rasch's approach explicitly recognizes that it is a scientific hypothesis that a given trait is both
quantitative and measurable, as operationalized in a particular experimental context.
Thus, congruent with the perspective articulated by Thomas Kuhn in his 1961 paper The function
of measurement in modern physical science, measurement was regarded both as being founded in
theory, and as being instrumental to detecting quantitative anomalies incongruent with
hypotheses related to a broader theoretical framework. This perspective is in contrast to that
generally prevailing in the social sciences, in which data such as test scores are directly treated as
measurements without requiring a theoretical foundation for measurement. Although this
contrast exists, Rasch's perspective is actually complementary to the use of statistical analysis or
modelling that requires interval-level measurements, because the purpose of applying a Rasch
model is to obtain such measurements. Applications of Rasch models are described in a wide
variety of sources, including Alagumalai, Curtis & Hungi (2005), Bezruczko (2005), Bond &
Fox (2007), Fisher & Wright (1994), Masters & Keeves (1999), and the Journal of Applied
Measurement.
Invariant comparison and sufficiency
The Rasch model for dichotomous data is often regarded as an item response theory (IRT) model
with one item parameter. However, rather than being a particular IRT model, proponents of the
model[7] regard it as a model that possesses a property which distinguishes it from other IRT
models. Specifically, the defining property of Rasch models is their formal or mathematical
embodiment of the principle of invariant comparison. Rasch summarised the principle of
invariant comparison as follows:
The comparison between two stimuli should be independent of which particular individuals
were instrumental for the comparison; and it should also be independent of which other stimuli
within the considered class were or might also have been compared.
Symmetrically, a comparison between two individuals should be independent of which
particular stimuli within the class considered were instrumental for the comparison; and it
should also be independent of which other individuals were also compared, on the same or
some other occasion (Rasch, 1961, p. 332).
Rasch models embody this principle because their formal structure permits algebraic separation
of the person and item parameters, in the sense that the person parameter can be eliminated
during the process of statistical estimation of item parameters. This result is achieved through the
use of conditional maximum likelihood estimation, in which the response space is partitioned
according to person total scores. The consequence is that the raw score for an item or person is
the sufficient statistic for the item or person parameter. That is to say, the person total score
contains all information available within the specified context about the individual, and the item
total score contains all information with respect to item, with regard to the relevant latent trait.
18
The Rasch model requires a specific structure in the response data, namely a probabilistic
Guttman structure.
In somewhat more familiar terms, Rasch models provide a basis and justification for obtaining
person locations on a continuum from total scores on assessments. Although it is not uncommon
to treat total scores directly as measurements, they are actually counts of discrete observations
rather than measurements. Each observation represents the observable outcome of a comparison
between a person and item. Such outcomes are directly analogous to the observation of the
rotation of a balance scale in one direction or another. This observation would indicate that one
or other object has a greater mass, but counts of such observations cannot be treated directly as
measurements.
Rasch pointed out that the principle of invariant comparison is characteristic of measurement in
physics using, by way of example, a two-way experimental frame of reference in which each
instrument exerts a mechanical force upon solid bodies to produce acceleration. Rasch
(1960/1980, pp. 112–3) stated of this context: "Generally: If for any two objects we find a certain
ratio of their accelerations produced by one instrument, then the same ratio will be found for any
other of the instruments". It is readily shown that Newton's second law entails that such ratios are
inversely proportional to the ratios of the masses of the bodies.
The mathematical form of the Rasch model for dichotomous
data
Let be a dichotomous random variable where, for example,
denotes a correct response and an incorrect response to a given assessment item. In the
Rasch model for dichotomous data, the probability of the outcome is given by:
where is the ability of person and is the difficulty of item . Thus, in the case of a
dichotomous attainment item, is the probability of success upon interaction
between the relevant person and assessment item. It is readily shown that the log odds, or logit,
of correct response by a person to an item, based on the model, is equal to . It can be
shown that the log odds of a correct response by a person to one item, conditional on a correct
response to one of two items, is equal to the difference between the item locations. For example,
where is the total score of person n over the two items, which implies a correct response to
one or other of the items (Andersen, 1977; Rasch, 1960; Andrich, 2010). Hence, the conditional
log odds does not involve the person parameter , which can therefore be eliminated by
conditioning on the total score . That is, by partitioning the responses according to raw
scores and calculating the log odds of a correct response, an estimate is obtained without
involvement of . More generally, a number of item parameters can be estimated iteratively
through application of a process such as Conditional Maximum Likelihood estimation (see Rasch
model estimation). While more involved, the same fundamental principle applies in such
estimations.
19
Figure 4: ICC for the Rasch model showing the comparison between observed and expected proportions
correct for five Class Intervals of persons
The ICC of the Rasch model for dichotomous data is shown in Figure 4. The grey line maps a
person with a location of approximately 0.2 on the latent continuum, to the probability of the
discrete outcome for items with different locations on the latent continuum. The
location of an item is, by definition, that location at which the probability that is equal
to 0.5. In figure 4, the black circles represent the actual or observed proportions of persons
within Class Intervals for which the outcome was observed. For example, in the case of an
assessment item used in the context of educational psychology, these could represent the
proportions of persons who answered the item correctly. Persons are ordered by the estimates of
their locations on the latent continuum and classified into Class Intervals on this basis in order to
graphically inspect the accordance of observations with the model. There is a close conformity of
the data with the model. In addition to graphical inspection of data, a range of statistical tests of
fit are used to evaluate whether departures of observations from the model can be attributed to
random effects alone, as required, or whether there are systematic departures from the model.
The polytomous form of the Rasch model
The polytomous Rasch model, which is a generalisation of the dichotomous model, can be
applied in contexts in which successive integer scores represent categories of increasing level or
magnitude of a latent trait, such as increasing ability, motor function, endorsement of a
statement, and so forth. The Polytomous response model is, for example, applicable to the use of
Likert scales, grading in educational assessment, and scoring of performances by judges.
Other considerations
A criticism of the Rasch model is that it is overly restrictive or prescriptive because it does not
permit each item to have a different discrimination. A criticism specific to the use of multiple
choice items in educational assessment is that there is no provision in the model for guessing
because the left asymptote always approaches a zero probability in the Rasch model. These
variations are available in models such as the two and three parameter logistic models
(Birnbaum, 1968). However, the specification of uniform discrimination and zero left asymptote
are necessary properties of the model in order to sustain sufficiency of the simple, unweighted
raw score.
Verhelst & Glas (1995) derive Conditional Maximum Likelihood (CML) equations for a model
they refer to as the One Parameter Logistic Model (OPLM). In algebraic form it appears to be
20
identical with the 2PL model, but OPLM contains preset discrimination indexes rather than
2PL's estimated discrimination parameters. As noted by these authors, though, the problem one
faces in estimation with estimated discrimination parameters is that the discriminations are
unknown, meaning that the weighted raw score "is not a mere statistic, and hence it is impossible
to use CML as an estimation method" (Verhelst & Glas, 1995, p. 217). That is, sufficiency of the
weighted "score" in the 2PL cannot be used according to the way in which a sufficient statistic is
defined. If the weights are imputed instead of being estimated, as in OPLM, conditional
estimation is possible and some of the properties of the Rasch model are retained (Verhelst, Glas
& Verstralen, 1995; Verhelst & Glas, 1995). In OPLM, the values of the discrimination index are
restricted to between 1 and 15. A limitation of this approach is that in practice, values of
discrimination indexes must be preset as a starting point. This means some type of estimation of
discrimination is involved when the purpose is to avoid doing so.
The Rasch model for dichotomous data inherently entails a single discrimination parameter
which, as noted by Rasch (1960/1980, p. 121), constitutes an arbitrary choice of the unit in terms
of which magnitudes of the latent trait are expressed or estimated. However, the Rasch model
requires that the discrimination is uniform across interactions between persons and items within
a specified frame of reference (i.e. the assessment context given conditions for assessment).
Application of the models provides diagnostic information regarding how well the criterion is
met. Application of the models can also provide information about how well items or questions
on assessments work to measure the ability or trait. Prominent advocates of Rasch models
include Benjamin Drake Wright, David Andrich and Erling Andersen.
21
Psychometrics
Not to be confused with psychrometrics, the measurement of the heat and water vapor properties of
air.
For other uses of this term and similar terms, see Psychometry (disambiguation).
Psychometrics is the field of study concerned with the theory and technique of psychological
measurement. One part of the field is concerned with the objective measurement of skills and
knowledge, abilities, attitudes, personality traits, and educational achievement. For example,
psychometric research has concerned itself with the construction and validation of assessment
instruments such as questionnaires, tests, raters' judgments, and personality tests. Another part of
the field is concerned with statistical research bearing on measurement theory (e.g., item
response theory; intraclass correlation).
Thus psychometrics involves two major research tasks: (i) the construction of instruments and
procedures for measurement; and (ii) the development and refinement of theoretical approaches
to measurement. Those who practice psychometrics are known as psychometricians. All
psychometricians possess a specific psychometric qualification, and while most are
psychologists with advanced graduate training in psychometric testing. Many work in human
resources departments. Others specialize as learning and development professionals.
Definition of measurement in the social sciences
The definition of measurement in the social sciences has a long history. A currently widespread
definition, proposed by Stanley Smith Stevens (1946), is that measurement is "the assignment of
numerals to objects or events according to some rule." This definition was introduced in the
paper in which Stevens proposed four levels of measurement. Although widely adopted, this
definition differs in important respects from the more classical definition of measurement
adopted in the physical sciences, namely that scientific measurement entails "the estimation or
discovery of the ratio of some magnitude of a quantitative attribute to a unit of the same
attribute" (p. 358)[4]
Indeed, Stevens's definition of measurement was put forward in response to the British Ferguson
Committee, whose chair, A. Ferguson, was a physicist. The committee was appointed in 1932 by
the British Association for the Advancement of Science to investigate the possibility of
quantitatively estimating sensory events. Although its chair and other members were physicists,
the committee also included several psychologists. The committee's report highlighted the
importance of the definition of measurement. While Stevens's response was to propose a new
definition, which has had considerable influence in the field, this was by no means the only
response to the report. Another, notably different, response was to accept the classical definition,
as reflected in the following statement:
Measurement in psychology and physics are in no sense different. Physicists can measure when
they can find the operations by which they may meet the necessary criteria; psychologists have
but to do the same. They need not worry about the mysterious differences between the
meaning of measurement in the two sciences. (Reese, 1943, p. 49)
These divergent responses are reflected in alternative approaches to measurement. For example,
methods based on covariance matrices are typically employed on the premise that numbers, such
as raw scores derived from assessments, are measurements. Such approaches implicitly entail
Stevens's definition of measurement, which requires only that numbers are assigned according to
22
some rule. The main research task, then, is generally considered to be the discovery of
associations between scores, and of factors posited to underlie such associations.
On the other hand, when measurement models such as the Rasch model are employed, numbers
are not assigned based on a rule. Instead, in keeping with Reese's statement above, specific
criteria for measurement are stated, and the goal is to construct procedures or operations that
provide data that meet the relevant criteria. Measurements are estimated based on the models,
and tests are conducted to ascertain whether the relevant criteria have been met.
Instruments and procedures
The first psychometric instruments were designed to measure the concept of intelligence. The
best known historical approach involved the Stanford-Binet IQ test, developed originally by the
French psychologist Alfred Binet. Intelligence tests are useful tools for various purposes. An
alternative conception of intelligence is that cognitive capacities within individuals are a
manifestation of a general component, or general intelligence factor, as well as cognitive
capacity specific to a given domain.
Psychometrics is applied widely in educational assessment to measure abilities in domains such
as reading, writing, and mathematics. The main approaches in applying tests in these domains
have been Classical Test Theory and the more recent Item Response Theory and Rasch
measurement models. These latter approaches permit joint scaling of persons and assessment
items, which provides a basis for mapping of developmental continua by allowing descriptions of
the skills displayed at various points along a continuum. Such approaches provide powerful
information regarding the nature of developmental growth within various domains.
Another major focus in psychometrics has been on personality testing. There have been a range
of theoretical approaches to conceptualizing and measuring personality. Some of the better
known instruments include the Minnesota Multiphasic Personality Inventory, the Five-Factor
Model (or "Big 5") and tools such as Personality and Preference Inventory and the Myers-Briggs
Type Indicator. Attitudes have also been studied extensively using psychometric approaches. A
common method in the measurement of attitudes is the use of the Likert scale. An alternative
method involves the application of unfolding measurement models, the most general being the
Hyperbolic Cosine Model (Andrich & Luo, 1993).
Theoretical approaches
Psychometricians have developed a number of different measurement theories. These include
classical test theory (CTT) and item response theory (IRT).[5][6] An approach which seems
mathematically to be similar to IRT but also quite distinctive, in terms of its origins and features,
is represented by the Rasch model for measurement. The development of the Rasch model, and
the broader class of models to which it belongs, was explicitly founded on requirements of
measurement in the physical sciences.[7]
Psychometricians have also developed methods for working with large matrices of correlations
and covariances. Techniques in this general tradition include: factor analysis,[8] a method of
determining the underlying dimensions of data; multidimensional scaling,[9] a method for finding
a simple representation for data with a large number of latent dimensions; and data clustering, an
approach to finding objects that are like each other. All these multivariate descriptive methods
try to distill large amounts of data into simpler structures. More recently, structural equation
modeling[10] and path analysis represent more sophisticated approaches to working with large
23
covariance matrices. These methods allow statistically sophisticated models to be fitted to data
and tested to determine if they are adequate fits.
One of the main deficiencies in various factor analyses is a lack of consensus in cutting points
for determining the number of latent factors. A usual procedure is to stop factoring when
eigenvalues drop below one because the original sphere shrinks. The lack of the cutting points
concerns other multivariate methods, also.[citation needed]
Key concepts
Key concepts in classical test theory are reliability and validity. A reliable measure is one that
measures a construct consistently across time, individuals, and situations. A valid measure is one
that measures what it is intended to measure. Reliability is necessary, but not sufficient, for
validity.
Both reliability and validity can be assessed statistically. Consistency over repeated measures of
the same test can be assessed with the Pearson correlation coefficient, and is often called test-
retest reliability.[11] Similarly, the equivalence of different versions of the same measure can be
indexed by a Pearson correlation, and is called equivalent forms reliability or a similar term.[11]
Internal consistency, which addresses the homogeneity of a single test form, may be assessed by
correlating performance on two halves of a test, which is termed split-half reliability; the value
of this Pearson product-moment correlation coefficient for two half-tests is adjusted with the
Spearman–Brown prediction formula to correspond to the correlation between two full-length
tests.[11] Perhaps the most commonly used index of reliability is Cronbach's α, which is
equivalent to the mean of all possible split-half coefficients. Other approaches include the intra-
class correlation, which is the ratio of variance of measurements of a given target to the variance
of all targets.
There are a number of different forms of validity. Criterion-related validity can be assessed by
correlating a measure with a criterion measure theoretically expected to be related. When the
criterion measure is collected at the same time as the measure being validated the goal is to
establish concurrent validity; when the criterion is collected later the goal is to establish
predictive validity. A measure has construct validity if it is related to measures of other
constructs as required by theory. Content validity is a demonstration that the items of a test do an
adequate job of covering the domain being measured. In a personnel selection example, test
content is based on a defined statement or set of statements of knowledge, skill, ability, or other
characteristics obtained from a job analysis.
Item response theory models the relationship between latent traits and responses to test items.
Among other advantages, IRT provides a basis for obtaining an estimate of the location of a test-
taker on a given latent trait as well as the standard error of measurement of that location. For
example, a university student's knowledge of history can be deduced from his or her score on a
university test and then be compared reliably with a high school student's knowledge deduced
from a less difficult test. Scores derived by classical test theory do not have this characteristic,
and assessment of actual ability (rather than ability relative to other test-takers) must be assessed
by comparing scores to those of a "norm group" randomly selected from the population. In fact,
all measures derived from classical test theory are dependent on the sample tested, while, in
principle, those derived from item response theory are not.
Standards of quality
The considerations of validity and reliability typically are viewed as essential elements for
determining the quality of any test. However, professional and practitioner associations
24
frequently have placed these concerns within broader contexts when developing standards and
making overall judgments about the quality of any test as a whole within a given context. A
consideration of concern in many applied research settings is whether or not the metric of a given
psychological inventory is meaningful or arbitrary.[12]
Testing standards
In this field, the Standards for Educational and Psychological Testing[13] place standards about
validity and reliability, along with errors of measurement and related considerations under the
general topic of test construction, evaluation and documentation. The second major topic covers
standards related to fairness in testing, including fairness in testing and test use, the rights and
responsibilities of test takers, testing individuals of diverse linguistic backgrounds, and testing
individuals with disabilities. The third and final major topic covers standards related to testing
applications, including the responsibilities of test users, psychological testing and assessment,
educational testing and assessment, testing in employment and credentialing, plus testing in
program evaluation and public policy.
Evaluation standards
In the field of evaluation, and in particular educational evaluation, the Joint Committee on
Standards for Educational Evaluation[14] has published three sets of standards for evaluations. The
Personnel Evaluation Standards[15] was published in 1988, The Program Evaluation Standards
(2nd edition)[16] was published in 1994, and The Student Evaluation Standards[17] was published
in 2003.
Each publication presents and elaborates a set of standards for use in a variety of educational
settings. The standards provide guidelines for designing, implementing, assessing and improving
the identified form of evaluation.[18] Each of the standards has been placed in one of four
fundamental categories to promote educational evaluations that are proper, useful, feasible, and
accurate. In these sets of standards, validity and reliability considerations are covered under the
accuracy topic. For example, the student accuracy standards help ensure that student evaluations
will provide sound, accurate, and credible information about student learning and performance.
Non-human: animals and machines
Psychometrics addresses human abilities, attitudes, traits and educational evolution. Notably, the
study of behavior, mental processes and abilities of non-human animals is usually addressed by
comparative psychology, or with a continuum between non-human animals and the rest of
animals by evolutionary psychology. Nonetheless there are some advocators for a more gradual
transition between the approach taken for humans and the approach taken for (non-human)
animals.[19] [20] [21] [22]
The evaluation of abilities, traits and learning evolution of machines has been mostly unrelated
to the case of humans and non-human animals, with specific approaches in the area of artificial
intelligence. A more integrated approach, under the name of universal psychometrics, has also
been proposed.[23]
25
Ranking
A ranking is a relationship between a set of items such that, for any two items, the first is either
'ranked higher than', 'ranked lower than' or 'ranked equal to' the second. In mathematics, this is
known as a weak order or total preorder of objects. It is not necessarily a total order of objects
because two different objects can have the same ranking. The rankings themselves are totally
ordered. For example, materials are totally preordered by hardness, while degrees of hardness are
totally ordered.
By reducing detailed measures to a sequence of ordinal numbers, rankings make it possible to
evaluate complex information according to certain criteria. Thus, for example, an Internet search
engine may rank the pages it finds according to an estimation of their relevance, making it
possible for the user quickly to select the pages they are likely to want to see.
Analysis of data obtained by ranking commonly requires non-parametric statistics.
Strategies for assigning rankings
It is not always possible to assign rankings uniquely. For example, in a race or competition two
(or more) entrants might tie for a place in the ranking. When computing an ordinal measurement,
two (or more) of the quantities being ranked might measure equal. In these cases, one of the
strategies shown below for assigning the rankings may be adopted.
A common shorthand way to distinguish these ranking strategies is by the ranking numbers that
would be produced for four items, with the first item ranked ahead of the second and third
(which compare equal) which are both ranked ahead of the fourth. These names are also shown
below.
Standard competition ranking ("1224" ranking)
In competition ranking, items that compare equal receive the same ranking number, and then a
gap is left in the ranking numbers. The number of ranking numbers that are left out in this gap is
one less than the number of items that compared equal. Equivalently, each item's ranking number
is 1 plus the number of items ranked above it. This ranking strategy is frequently adopted for
competitions, as it means that if two (or more) competitors tie for a position in the ranking, the
position of all those ranked below them is unaffected (i.e., a competitor only comes second if
exactly one person scores better than them, third if exactly two people score better than them,
fourth if exactly three people score better than them, etc.).
Thus if A ranks ahead of B and C (which compare equal) which are both ranked ahead of D, then
A gets ranking number 1 ("first"), B gets ranking number 2 ("joint second"), C also gets ranking
number 2 ("joint second") and D gets ranking number 4 ("fourth").
Modified competition ranking ("1334" ranking)
Sometimes, competition ranking is done by leaving the gaps in the ranking numbers before the
sets of equal-ranking items (rather than after them as in standard competition ranking). The
number of ranking numbers that are left out in this gap remains one less than the number of items
that compared equal. Equivalently, each item's ranking number is equal to the number of items
ranked equal to it or above it. This ranking ensures that a competitor only comes second if they
score higher than all but one of their opponents, third if they score higher than all but two of their
opponents, etc.
A gets ranking number 1 ("first"), B gets ranking number 3 ("joint third"), C also gets ranking
26
number 3 ("joint third") and D gets ranking number 4 ("fourth"). In this case, nobody would get
ranking number 2 ("second") and that would be left as a gap.
Dense ranking ("1223" ranking)
In dense ranking, items that compare equal receive the same ranking number, and the next
item(s) receive the immediately following ranking number. Equivalently, each item's ranking
number is 1 plus the number of items ranked above it that are distinct with respect to the ranking
order.
A gets ranking number 1 ("first"), B gets ranking number 2 ("joint second"), C also gets ranking
number 2 ("joint second") and D gets ranking number 3 ("third").
Ordinal ranking ("1234" ranking)
In ordinal ranking, all items receive distinct ordinal numbers, including items that compare
equal. The assignment of distinct ordinal numbers to items that compare equal can be done at
random, or arbitrarily, but it is generally preferable to use a system that is arbitrary but
consistent, as this gives stable results if the ranking is done multiple times. An example of an
arbitrary but consistent system would be to incorporate other attributes into the ranking order
(such as alphabetical ordering of the competitor's name) to ensure that no two items exactly
match.
With this strategy, if A ranks ahead of B and C (which compare equal) which are both ranked
ahead of D, then A gets ranking number 1 ("first") and D gets ranking number 4 ("fourth"), and
either B gets ranking number 2 ("second") and C gets ranking number 3 ("third") or C gets
ranking number 2 ("second") and B gets ranking number 3 ("third").
In computer data processing, ordinal ranking is also referred to as "row numbering"....
Fractional ranking ("1 2.5 2.5 4" ranking)
Items that compare equal receive the same ranking number, which is the mean of what they
would have under ordinal rankings. Equivalently, the ranking number of 1 plus the number of
items ranked above it plus half the number of items equal to it. This strategy has the property that
the sum of the ranking numbers is the same as under ordinal ranking. For this reason, it is used in
computing Borda counts and in statistical tests (see below).
A gets ranking number 1 ("first"), B and C each get ranking number 2.5 (average of "joint
second/third") and D gets ranking number 4 ("fourth").
Here's an example: Suppose you have the data set 1 1 2 3 3 4 5 5 5 There are 5 different
numbers, so there would be five different ranks. If 1 and 1 were actually different numbers, they
would occupy ranks 1 and 2. Since they are the same number, you find their rank by finding the
average as follows : (rank) 1 + (rank) 2 / 2 numbers total = 1.5 (average rank). The next number
in the data set, 2, is thus assigned the rank of 3 (the average takes up 1 and 2 in the first two 1's).
The two 3's in the set would occupy ranks 4 and 5 if they were different numbers, so the average
rank would be computed as follows: (4 + 5) / 2 = 4.5. 4 would get the rank of 6 (because your
average took into account rank 4 and 5 in the average). there are 3 5's in the data set. Their
average rank is computed as (7+8+9)/3 = 8
Your ranks would be: 1.5 1.5 3 4.5 4.5 6 8 8 8
Ranking in statistics
In statistics, "ranking" refers to the data transformation in which numerical or ordinal values are
replaced by their rank when the data are sorted. For example, the numerical data 3.4, 5.1, 2.6, 7.3
27
are observed, the ranks of these data items would be 2, 3, 1 and 4 respectively. For example, the
ordinal data hot, cold, warm would be replaced by 3, 1, 2. In these examples, the ranks are
assigned to values in ascending order. (In some other cases, descending ranks are used.) Ranks
are related to the indexed list of order statistics, which consists of the original dataset rearranged
into ascending order.
Some kinds of statistical tests employ calculations based on ranks. Examples include:
 Friedman test
 Kruskal-Wallis test
 Rank products
 Spearman's rank correlation coefficient
 Wilcoxon rank-sum test
 Wilcoxon signed-rank test
Some ranks can have non-integer values for tied data values. For example, when there is an even
number of copies of the same data value, the above described fractional statistical rank of the
tied data ends in ½.
Rank function in Excel
The rank function in Microsoft Excel assigns competition ranks ("1224") as described above.
For some statistical purposes, that is not the desired result - for instance, it means that the sum of
ranks for a list of a given length changes depending on the number of ties. Pottel has described a
user defined ranking function which assigns fractional ranks to ties to keep the sum consistent.[1]
Bogardus social distance scale

The Bogardus social distance scale is a psychological testing scale created by Emory S.
Bogardus to empirically measure people's willingness to participate in social contacts of varying
degrees of closeness with members of diverse social groups, such as racial and ethnic groups.
The scale asks people the extent to which they would be accepting of each group (a score of 1.00
for a group is taken to indicate no social distance):
 As close relatives by marriage (score 1.00)
 As my close personal friends (2.00)
 As neighbors on the same street (3.00)
 As co-workers in the same occupation (4.00)
 As citizens in my country (5.00)
 As only visitors in my country (6.00)
 Would exclude from my country (7.00)
The Bogardus social distance scale is a cumulative scale (a Guttman scale), because agreement
with any item implies agreement with all preceding items. The scale has been criticized as too
simple because the social interactions and attitudes in close familial or friendship-type
relationships may be qualitatively different from social interactions with and attitudes toward
relationships with far-away contacts such as citizens or visitors in one's country.
Research by Bogardus first in 1925 and then repeated in 1946, 1956, and 1966 shows that the
extent of social distancing in the US is decreasing slightly and fewer distinctions are being made
among groups. The study was also replicated in 2005. The results supported the existence of this
tendency, showing that the mean level of social distance has been decreasing comparing with the
previous studies.[1] A web-based questionnaire has been running since late 1993. Internet users
28
are encouraged to submit their responses here where the maintainer of this site has posted at least
two papers that update research on social distance.
For Bogardus, social distance is a function of affective distance between the members of two
groups: ‘‘[i]n social distance studies the center of attention is on the feeling reactions of persons
toward other persons and toward groups of people.’’[2] Thus, for him, social distance is
essentially a measure of how much or little sympathy the members of a group feel for another
group. It might be important to note that Bogardus’s conceptualization is not the only one in the
sociological literature. Several sociologists have pointed out that social distance can also be
conceptualized on the basis of other parameters such as the frequency of interaction between
different groups or the normative distinctions in a society about who should be considered an
“insider” or “outsider.”[3]
Q methodology
You have a new message from another user (last change).
Q Methodology is a research method used in psychology and in social sciences to study people's
"subjectivity"—that is, their viewpoint. Q was developed by psychologist William Stephenson. It
has been used both in clinical settings for assessing a patient's progress over time (intra-rater
comparison), as well as in research settings to examine how people think about a topic (inter-
rater comparisons).
Technical overview
The name "Q" comes from the form of factor analysis that is used to analyze the data. Normal
factor analysis, called "R method," involves finding correlations between variables (say, height
and age) across a sample of subjects. Q, on the other hand, looks for correlations between
subjects across a sample of variables. Q factor analysis reduces the many individual viewpoints
of the subjects down to a few "factors," which are claimed to represent shared ways of thinking.
It is sometimes said that Q factor analysis is R factor analysis with the data table turned
sideways. While helpful as a heuristic for understanding Q, this explanation may be misleading,
as most Q methodologists argue that for mathematical reasons no one data matrix would be
suitable for analysis with both Q and R.
Sorting the statements in a Q-sort

The data for Q factor analysis come from a series of "Q sorts" performed by one or more
subjects. A Q sort is a ranking of variables—typically presented as statements printed on small
cards—according to some "condition of instruction." For example, in a Q study of people's views
of a celebrity, a subject might be given statements like "He is a deeply religious man" and "He is
29
a liar," and asked to sort them from "most like how I think about this celebrity" to "least like how
I think about this celebrity." The use of ranking, rather than asking subjects to rate their
agreement with statements individually, is meant to capture the idea that people think about ideas
in relation to other ideas, rather than in isolation.
The sample of statements for a Q sort is drawn from and claimed to be representative of a
"concourse"—the sum of all things people say or think about the issue being investigated. Since
concourses do not have clear membership lists (as would be the case in the population of
subjects), statements cannot be drawn randomly nor is there a theory which would specify an
adequate sample they do not meet accepted scientific expectations for valid inference.
Commonly Q methodologists use a structured sampling approach in order to try and represent
the full breadth of the concourse but these do not justify inference from the Q-sample to the
concourse.
One salient difference between Q and other social science research methodologies, such as
surveys, is that it typically uses many fewer subjects. This can be a strength, as Q is sometimes
used with a single subject, and it makes research far less expensive. In such cases, a person will
rank the same set of statements under different conditions of instruction. For example, someone
might be given a set of statements about personality traits and then asked to rank them according
to how well they describe herself, her ideal self, her father, her mother, etc. Working with a
single individual is particularly relevant in the study of how an individual's rankings change over
time and this was the first use of Q-methodology. As Q-methodology works with a small non-
representative sample, conclusions are limited to those who participated in the study.
In studies of intelligence, Q factor analysis can generate Consensus based assessment (CBA)
scores as direct measures. Alternatively, the unit of measurement of a person in this context is his
factor loading for a Q-sort he or she performs. Factors represent norms with respect to schemata.
The individual who gains the highest factor loading on an Operant factor is the person most able
to conceive the norm for the factor. What the norm means is a matter, always, for conjecture and
refutation (Popper). It may be indicative of the wisest solution, or the most responsible, the most
important, or an optimized-balanced solution. These are all untested hypotheses that require
future study.
An alternative method that determines the similarity among subjects somewhat like Q
methodology, as well as the cultural "truth" of the statements used in the test, is Cultural
Consensus Theory.
The "Q sort" data collection procedure is traditionally done using a paper template and the
sample of statements or other stimuli printed on individual cards. However, there are also
computer software applications for conducting online Q sorts. For example, consulting firm
Davis Brand Capital has created a proprietary online product, nQue, that they use to conduct
online Q sorts that mimic the analog, paper-based sorting procedure. However, the web-based
software application that uses a drag-and-drop, graphical user interface to assist researchers is
not available for commercial sale. UC Riverside's Riverside Situational Q-sort (RSQ), a newly
developed tool by the university, purports to measure the psychological properties of situations.
Their International Situations Project is using the tool to explore the psychologically salient
aspects of situations and how those aspects may differ across cultures with this university-
developed web-based application. To date there has been no study of differences in sorts
produced by use of computer based vs. physical sorting.
30
One Q-sort should produce two sets of data. The first is the physical distribution of sorted
objects. The second is either an ongoing 'think-out-loud' narrative or a discussion that
immediately follows the sorting exercise. The purpose of these narratives were, in the first
instance, to elicit discussion of the reasons for particular placements. While the relevance of this
qualitative data is often suppressed in current uses of Q-methodology, the modes of reasoning
behind placement of an item can be more analytically relevant than the absolute placement of
cards.
Application
Q-methodology has been used as a research tool in a wide variety of disciplines including
nursing, veterinary medicine, public health, transportation, education, rural sociology, hydrology
and mobile communication.[1][2][3][4][5] The methodology is particularly useful when researchers
wish to understand and describe the variety of subjective viewpoints on an issue.[6]
Validation
Some information on validation of the method is available. The retrospective process using the
methodology for any researches needs to be taken.[7]
Guttman scale
In statistical surveys conducted by means of structured interviews or questionnaires, a subset of
the survey items having binary (e.g., YES or NO) answers forms a Guttman scale (named after
Louis Guttman) if they can be ranked in some order so that, for a rational respondent, the
response pattern can be captured by a single index on that ordered scale. In other words, on a
Guttman scale, items are arranged in an order so that an individual who agrees with a particular
item also agrees with items of lower rank-order. For example, a series of items could be (1) "I
am willing to be near ice cream"; (2) "I am willing to smell ice cream"; (3) "I am willing to eat
ice cream"; and (4) "I love to eat ice cream". Agreement with any one item implies agreement
with the lower-order items. This contrasts with topics studied using a Likert scale or a Thurstone
scale.
The concept of Guttman scale likewise applies to series of items in other kinds of tests, such as
achievement tests, that have binary outcomes. For example, a test of math achievement might
order questions based on their difficulty and instruct the examinee to begin in the middle. The
assumption is if the examinee can successfully answer items of that difficulty (e.g., summing two
3-digit numbers), s/he would be able to answer the earlier questions (e.g., summing two 2-digit
numbers). Some achievement tests are organized in a Guttman scale to reduce the duration of the
test.
By designing surveys and tests such that they contain Guttman scales, researchers can simplify
the analysis of the outcome of surveys and increase the robustness. Guttman scales also make it
possible to detect and discard randomized answer patterns, as may be given by uncooperative
respondents.
A hypothetical, perfect Guttman scale consists of a unidimensional set of items that are ranked in
order of difficulty from least extreme to most extreme position. For example, a person scoring a
"7" on a ten item Guttman scale, will agree with items 1-7 and disagree with items 8,9,10. An
important property of Guttman's model is that a person's entire set of responses to all items can
be predicted from their cumulative score because the model is deterministic.
31
A well-known example of a Guttman scale is the Bogardus Social Distance Scale.
Another example is the original Beaufort wind force scale, assigning a single number to
observed conditions of the sea surface ("Flat", ..., "Small waves", ..., "Sea heaps up and foam
begins to streak", ...), which was in fact a Guttman scale. The observation "Flat = YES" implies
"Small waves = NO".
Deterministic model
An important objective in Guttman scaling is to maximize the reproducibility of response
patterns from a single score. A good Guttman scale should have a coefficient of reproducibility
(the percentage of original responses that could be reproduced by knowing the scale scores used
to summarize them) above .85. Another commonly used metric for assessing the quality of a
Guttman scale, is Menzel's coefficient of scalability and the coefficient of homogeneity
(Loevinger, 1948; Cliff, 1977; Krus and Blackman, 1988). To maximize unidimensionality,
misfitting items are re-written or discarded.
Stochastic models
Guttman's deterministic model is brought within a probabilistic framework in item response
theory models, and especially Rasch measurement. The Rasch model requires a probabilistic
Guttman structure when items have dichotomous responses (e.g. right/wrong). In the Rasch
model, the Guttman response pattern is the most probable response pattern for a person when
items are ordered from least difficult to most difficult (Andrich, 1985). In addition, the
Polytomous Rasch model is premised on a deterministic latent Guttman response subspace, and
this is the basis for integer scoring in the model (Andrich, 1978, 2005). Analysis of data using
item response theory requires comparatively longer instruments and larger datasets to scale item
and person locations and evaluate the fit of data to model.
In practice, actual data from respondents do not closely match Guttman's deterministic model.
Several probabilistic models of Guttman implicatory scales were developed by Krus (1977) and
Krus and Bart (1974).
Applications
The Guttman scale is used mostly when researchers want to design short questionnaires with
good discriminating ability. The Guttman model works best for constructs that are hierarchical
and highly structured such as social distance, organizational hierarchies, and evolutionary stages.
Unfolding models
A class of unidimensional models that contrast with Guttman's model are unfolding models.
These models also assume unidimensionality but posit that the probability of endorsing an item
is proportional to the distance between the items standing on the unidimensional trait and the
standing of the respondent. For example, items like "I think immigration should be reduced" on a
scale measuring attitude towards immigration would be unlikely to be endorsed both by those
favoring open policies and also by those favoring no immigration at all. Such an item might be
endorsed by someone in the middle of the continuum. Some researchers feel that many attitude
items fit this unfolding model while most psychometric techniques are based on correlation or
factor analysis, and thus implicitly assume a linear relationship between the trait and the
response probability. The effect of using these techniques would be to only include the most
extreme items, leaving attitude instruments with little precision to measure the trait standing of
individuals in the middle of the continuum.
32
Magnitude estimation scale
Non-comparative scaling techniques

Continuous rating scale
Likert scale
A Likert scale (/ˈlɪkərt/[1]) is a psychometric scale commonly involved in research that employs
questionnaires. It is the most widely used approach to scaling responses in survey research, such
that the term is often used interchangeably with rating scale, or more accurately the Likert-type
scale, even though the two are not synonymous. The scale is named after its inventor,
psychologist Rensis Likert.[2] Likert distinguished between a scale proper, which emerges from
collective responses to a set of items (usually eight or more), and the format in which responses
are scored along a range. Technically speaking, a Likert scale refers only to the former. The
difference between these two concepts has to do with the distinction Likert made between the
underlying phenomenon being investigated and the means of capturing variation that points to
the underlying phenomenon.[3] When responding to a Likert questionnaire item, respondents
specify their level of agreement or disagreement on a symmetric agree-disagree scale for a series
of statements. Thus, the range captures the intensity of their feelings for a given item.[4] A scale
can be created as the simple sum of questionnaire responses over the full range of the scale. In so
doing, Likert scaling assumes that distances on each item are equal. Importantly, "All items are
assumed to be replications of each other or in other words items are considered to be parallel
instruments" [5] (p. 197). By contrast modern test theory treats the difficulty of each item (the
ICCs) as information to be incorporated in scaling items.
Likert scales and items
33
A Likert scale pertaining to Wikipedia can be calculated using these five Likert items.
An important distinction must be made between a Likert scale and a Likert item. The Likert scale
is the sum of responses on several Likert items. Because Likert items are often accompanied by a
visual analog scale (e.g., a horizontal line, on which a subject indicates his or her response by
circling or checking tick-marks), the items are sometimes called scales themselves. This is the
source of much confusion; it is better, therefore, to reserve the term Likert scale to apply to the
summed scale, and Likert item to refer to an individual item.
A Likert item is simply a statement which the respondent is asked to evaluate according to any
kind of subjective or objective criteria; generally the level of agreement or disagreement is
measured. It is considered symmetric or "balanced" because there are equal numbers of positive
and negative positions.[6] Often five ordered response levels are used, although many
psychometricians advocate using seven or nine levels; a recent empirical study[7] found that items
with five or seven levels may produce slightly higher mean scores relative to the highest possible
attainable score, compared to those produced from the use of 10 levels, and this difference was
statistically significant. In terms of the other data characteristics, there was very little difference
among the scale formats in terms of variation about the mean, skewness or kurtosis.
The format of a typical five-level Likert item, for example, could be:
1. Strongly disagree
2. Disagree
3. Neither agree nor disagree
4. Agree
5. Strongly agree
Likert scaling is a bipolar scaling method, measuring either positive or negative response to a
statement. Sometimes an even-point scale is used, where the middle option of "Neither agree nor
disagree" is not available. This is sometimes called a "forced choice" method, since the neutral
option is removed.[8] The neutral option can be seen as an easy option to take when a respondent
is unsure, and so whether it is a true neutral option is questionable. A 1987 study found
34
negligible differences between the use of "undecided" and "neutral" as the middle option in a 5-
point Likert scale.[9]
Likert scales may be subject to distortion from several causes. Respondents may avoid using
extreme response categories (central tendency bias); agree with statements as presented
(acquiescence bias); or try to portray themselves or their organization in a more favorable light
(social desirability bias). Designing a scale with balanced keying (an equal number of positive
and negative statements) can obviate the problem of acquiescence bias, since acquiescence on
positively keyed items will balance acquiescence on negatively keyed items, but central tendency
and social desirability are somewhat more problematic.
Scoring and analysis
After the questionnaire is completed, each item may be analyzed separately or in some cases
item responses may be summed to create a score for a group of items. Hence, Likert scales are
often called summative scales.
Whether individual Likert items can be considered as interval-level data, or whether they should
be treated as ordered-categorical data is the subject of considerable disagreement in the literature,
[10][11]
with strong convictions on what are the most applicable methods. This disagreement can be
traced back, in many respects, to the extent to which Likert items are interpreted as being ordinal
data.
There are two primary considerations in this discussion. First, Likert scales are arbitrary. The
value assigned to a Likert item has no objective numerical basis, either in terms of measure
theory or scale (from which a distance metric can be determined). The value assigned to each
Likert item is simply determined by the researcher designing the survey, who makes the decision
based on a desired level of detail. However, by convention Likert items tend to be assigned
progressive positive integer values. Likert scales typically range from 2 to 10 – with 5 or 7 being
the most common. Further, this progressive structure of the scale is such that each successive
Likert item is treated as indicating a ‘better’ response than the preceding value. (This may differ
in cases where reverse ordering of the Likert Scale is needed).
The second, and possibly more important point, is whether the ‘distance’ between each
successive item category is equivalent, which is inferred traditionally. For example, in the above
five-point Likert item, the inference is that the ‘distance’ between category 1 and 2 is the same as
between category 3 and 4. In terms of good research practice, an equidistant presentation by the
researcher is important; otherwise a bias in the analysis may result. For example, a four-point
Likert item with categories "Poor", "Average", "Good", and "Very Good" is unlikely to have all
equidistant categories since there is only one category that can receive a below average rating.
This would arguably bias any result in favor of a positive outcome. On the other hand, even if a
researcher presents what he or she believes are equidistant categories, it may not be interpreted
as such by the respondent.
A good Likert scale, as above, will present a symmetry of categories about a midpoint with
clearly defined linguistic qualifiers. In such symmetric scaling, equidistant attributes will
typically be more clearly observed or, at least, inferred. It is when a Likert scale is symmetric
and equidistant that it will behave more like an interval-level measurement. So while a Likert
scale is indeed ordinal, if well presented it may nevertheless approximate an interval-level
measurement. This can be beneficial since, if it was treated just as an ordinal scale, then some
valuable information could be lost if the ‘distance’ between Likert items were not available for
35
consideration. The important idea here is that the appropriate type of analysis is dependent on
how the Likert scale has been presented.
Notions of central tendency are often applicable at the item level - that is responses often show a
quasi-normal distribution. The validity of such measures depends on the underlying interval
nature of the scale.
Responses to several Likert questions may be summed providing that all questions use the same
Likert scale and that the scale is a defensible approximation to an interval scale, in which case
the Central Limit Theorem allows treatment of the data as interval data measuring a latent
variable.[citation needed] If the summed responses fulfill these assumptions, parametric statistical tests
such as the analysis of variance can be applied. Typical cutoffs for thinking that this
approximation will be acceptable is a minimum of 4 and preferably 8 items in the sum.[12][13]
To model binary Likert responses directly, they may be represented in a binomial form by
summing agree and disagree responses separately. The chi-squared, Cochran Q, or McNemar test
are common statistical procedures used after this transformation. Non-parametric tests such as
chi-squared test, Mann–Whitney test, Wilcoxon signed-rank test, or Kruskal–Wallis test.[14] are
often used in the analysis of Likert scale data.
Consensus based assessment (CBA) can be used to create an objective standard for Likert scales
in domains where no generally accepted or objective standard exists. Consensus based
assessment (CBA) can be used to refine or even validate generally accepted standards.[citation needed]
The five response categories are often believed to represent an Interval level of measurement.
But this can only be the case if the intervals between the scale points correspond to empirical
observations in a metric sense. Reips and Funke (2008)[15] show that this criterion is much better
met by a visual analogue scale. In fact, there may also appear phenomena which even question
the ordinal scale level in Likert scales. For example, in a set of items A,B,C rated with a Likert
scale circular relations like A>B, B>C and C>A can appear. This violates the axiom of
transitivity for the ordinal scale.
Research by Labovitz [16] and Traylor [17] provide evidence that, even with rather large distortions
of perceived distances between scale points, Likert-type items perform closely to scales that are
perceived as equal intervals. So these items and other equal-appearing scales in questionnaires
are robust to violations of the equal distance assumption many researchers believe are required
for parametric statistical procedures and tests.
Munshi has shown that the equal interval assumption may not be valid and that careful
construction of the scale paying attention to both the number of choices and their placement on
the scale (and therefore their weight) may be necessary if the data are to be treated as interval
data.[18]
Rasch model
Likert scale data can, in principle, be used as a basis for obtaining interval level estimates on a
continuum by applying the polytomous Rasch model, when data can be obtained that fit this
model. In addition, the polytomous Rasch model permits testing of the hypothesis that the
statements reflect increasing levels of an attitude or trait, as intended. For example, application
of the model often indicates that the neutral category does not represent a level of attitude or trait
between the disagree and agree categories.
36
Again, not every set of Likert scaled items can be used for Rasch measurement. The data has to
be thoroughly checked to fulfill the strict formal axioms of the model.
Pronunciation
Rensis Likert, the developer of the scale, pronounced his name 'lick-urt' with a short "i" sound.[19]
[20]
It has been claimed that Likert's name "is among the most mispronounced in [the] field",[21] as
many people pronounce it with a diphthong "i" sound ('lie-kurt').
Phrase completions
Phrase completion scales are a type of psychometric scale used in questionnaires. Developed in
response to the problems associated with Likert scales, Phrase completions are concise,
unidimensional measures that tap ordinal level data in a manner that approximates interval level
data.
Overview of the phrase completion method
Phrase completions consist of a phrase followed by an 11-point response key. The phrase
introduces part of the concept. Marking a reply on the response key completes the concept. The
response key represents the underlying theoretical continuum. Zero(0)indicates the absence of
the construct. Ten(10)indicates the theorized maximum amount of the construct. Response keys
are reversed on alternate items to mitigate response set bias.
Sample question using the phrase completion method
I am aware of the presence of God or the Divine
Never
Continually
0 1 2 3 4 5 6 7 8 9
10
Scoring and analysis
After the questionnaire is completed the score on each item is summed together, to create a test
score for the respondent. Hence, Phrase Completions, like Likert scales, are often considered to
be summative scales.
The response categories represent an ordinal level of measurement. Ordinal level data, however,
varies in terms of how closely it approximates interval level data. By using a numerical
continuum as the response key instead of sentiments that reflect intensity of agreement,
respondents may be able to quantify their responses in more equal units.
Semantic differential
Semantic differential
Diagnostics
37
Fig. 1. Modern Japanese version of the Semantic
Differential. The Kanji characters in background stand for
"God" and "Wind" respectively, with the compound
reading "Kamikaze". (Adapted from Dimensions of
Meaning. Visual Statistics Illustrated at
VisualStatistics.net.)
MeSH D012659
Semantic differential is a type of a rating scale designed to measure the connotative meaning of
objects, events, and concepts. The connotations are used to derive the attitude towards the given
object, event or concept.
Osgood's semantic differential was an application of his more general attempt to measure the
semantics or meaning of words, particularly adjectives, and their referent concepts. The
respondent is asked to choose where his or her position lies, on a scale between two bipolar
adjectives (for example: "Adequate-Inadequate", "Good-Evil" or "Valuable-Worthless").
Semantic differentials can be used to measure opinions, attitudes and values on a
psychometrically controlled scale.
Theoretical background
Nominalists and realists
Theoretical underpinnings of Charles E. Osgood's semantic differential have roots in the
medieval controversy between the nominalists and realists.[citation needed] Nominalists asserted that
only real things are entities and that abstractions from these entities, called universals, are mere
words. The realists held that universals have an independent objective existence either in a realm
of their own or in the mind of God. Osgood’s theoretical work also bears affinity to linguistics
and general semantics and relates to Korzybski's structural differential.[citation needed]
Use of adjectives
The development of this instrument provides an interesting insight into the broader area between
linguistics and psychology. People have been describing each other since they developed the
ability to speak. Most adjectives can also be used as personality descriptors. The occurrence of
thousands of adjectives in English is an attestation of the subtleties in descriptions of persons and
their behavior available to speakers of English. Roget's Thesaurus is an early attempt to classify
most adjectives into categories and was used within this context to reduce the number of
adjectives to manageable subsets, suitable for factor analysis.
Evaluation, potency, and activity
Osgood and his colleagues performed a factor analysis of large collections of semantic
differential scales and found three recurring attitudes that people use to evaluate words and
phrases: evaluation, potency, and activity. Evaluation loads highest on the adjective pair 'good-
38
bad'. The 'strong-weak' adjective pair defines the potency factor. Adjective pair 'active-passive'
defines the activity factor. These three dimensions of affective meaning were found to be cross-
cultural universals in a study of dozens of cultures.
This factorial structure makes intuitive sense. When our ancestors encountered a person, the
initial perception had to be whether that person represents a danger. Is the person good or bad?
Next, is the person strong or weak? Our reactions to a person markedly differ if perceived as
good and strong, good and weak, bad and weak, or bad and strong. Subsequently, we might
extend our initial classification to include cases of persons who actively threaten us or represent
only a potential danger, and so on. The evaluation, potency and activity factors thus encompass a
detailed descriptive system of personality. Osgood's semantic differential measures these three
factors. It contains sets of adjective pairs such as warm-cold, bright-dark, beautiful-ugly, sweet-
bitter, fair-unfair, brave-cowardly, meaningful-meaningless.
The studies of Osgood and his colleagues revealed that the evaluative factor accounted for most
of the variance in scalings, and related this to the idea of attitudes.[1]
Usage
The semantic differential is today one of the most widely used scales used in the measurement of
attitudes. One of the reasons is the versatility of the items. The bipolar adjective pairs can be
used for a wide variety of subjects, and as such the scale is nicknamed "the ever ready battery" of
the attitude researcher.[2]
Statistical properties
Five items, or 5 bipolar pairs of adjectives, have been proven to yield reliable findings, which
highly correlate with alternative Likert numerical measures of the same attitude [3]
One problem with this scale is that its psychometric properties and level of measurement are
disputed.[2] The most general approach is to treat it as an ordinal scale, but it can be argued that
the neutral response (i.e. the middle alternative on the scale) serves as an arbitrary zero point,
and that the intervals between the scale values can be treated as equal, making it an interval
scale.
A detailed presentation on the development of the semantic differential is provided in the
monumental book, Cross-Cultural Universals of Affective Meaning.[4] David R. Heise's
Surveying Cultures[5] provides a contemporary update with special attention to measurement
issues when using computerized graphic rating scales.
39
Thurstone scale
In psychology and sociology, the Thurstone scale was the first formal technique to measure an
attitude. It was developed by Louis Leon Thurstone in 1928, as a means of measuring attitudes
towards religion. It is made up of statements about a particular issue, and each statement has a
numerical value indicating how favorable or unfavorable it is judged to be. People check each of
the statements to which they agree, and a mean score is computed, indicating their attitude.
Thurstone scale
Thurstone's method of pair comparisons can be considered a prototype of a normal distribution-
based method for scaling-dominance matrices. Even though the theory behind this method is
quite complex (Thurstone, 1927a), the algorithm itself is straightforward. For the basic Case V,
the frequency dominance matrix is translated into proportions and interfaced with the standard
scores. The scale is then obtained as a left-adjusted column marginal average of this standard
score matrix (Thurstone, 1927b). The underlying rationale for the method and basis for the
measurement of the "psychological scale separation between any two stimuli" derives from
Thurstone's Law of comparative judgment (Thurstone, 1928).
The principal difficulty with this algorithm is its indeterminacy with respect to one-zero
proportions, which return z values as plus or minus infinity, respectively. The inability of the pair
comparisons algorithm to handle these cases imposes considerable limits on the applicability of
the method.
The most frequent recourse when the 1.00-0.00 frequencies are encountered is their omission.
Thus, e.g., Guilford (1954, p. 163) has recommended not using proportions more extreme
than .977 or .023, and Edwards (1957, pp. 41–42) has suggested that “if the number of judges is
large, say 200 or more, then we might use pij values of .99 and .01, but with less than 200
judges, it is probably better to disregard all comparative judgments for which pij is greater
than .98 or less than .02."’ Since the omission of such extreme values leaves empty cells in the Z
matrix, the averaging procedure for arriving at the scale values cannot be applied, and an
elaborate procedure for the estimation of unknown parameters is usually employed (Edwards,
1957, pp. 42–46). An alternative solution of this problem was suggested by Krus and Kennedy
(1977).
With later developments in psychometric theory, it has become possible to employ direct
methods of scaling such as application of the Rasch model or unfolding models such as the
Hyperbolic Cosine Model (HCM) (Andrich & Luo, 1993). The Rasch model has a close
conceptual relationship to Thurstone's law of comparative judgment (Andrich, 1978), the
principal difference being that it directly incorporates a person parameter. Also, the Rasch model
takes the form of a logistic function rather than a cumulative normal function.
Mathematically derived scale

Rating scale
A rating scale is a set of categories designed to elicit information about a quantitative or a
qualitative attribute. In the social sciences, particularly psychology, common examples are the
40
Likert scale and 1-10 rating scales in which a person selects the number which is considered to
reflect the perceived quality of a product.
Background
A rating scale is a method that requires the rater to assign a value, sometimes numeric, to the
rated object, as a measure of some rated attribute.
Types of rating scales
All rating scales can be classified into one of three classifications:-
1. Some data are measured at the ordinal level. Numbers indicate the relative position of items,
but not the magnitude of difference. One example is a Likert scale:
Statement: e.g. "I could not live without my computer".
Response options:
1. Strongly disagree
2. Disagree
3. Agree
4. Strongly agree
2. Some data are measured at the interval level. Numbers indicate the magnitude of difference
between items, but there is no absolute zero point. Examples are attitude scales and opinion
scales.
3. Some data are measured at the ratio level. Numbers indicate magnitude of difference and there
is a fixed zero point. Ratios can be calculated. Examples include age, income, price, costs, sales
revenue, sales volume and market share.
More than one rating scale is required to measure an attitude or perception due to the
requirement for statistical comparisons between the categories in the polytomous Rasch model
for ordered categories.[1] In terms of Classical test theory, more than one question is required to
obtain an index of internal reliability such as Cronbach's alpha,[2] which is a basic criterion for
assessing the effectiveness of a rating scale and, more generally, a psychometric instrument.
Rating scales used online
Rating scales are used widely online in an attempt to provide indications of consumer opinions
of products. Examples of sites which employ ratings scales are IMDb, Epinions.com, Yahoo!
Movies, Amazon.com, BoardGameGeek and TV.com which use a rating scale from 0 to 100 in
order to obtain "personalised film recommendations".
In almost all cases, online rating scales only allow one rating per user per product, though there
are exceptions such as Ratings.net, which allows users to rate products in relation to several
qualities. Most online rating facilities also provide few or no qualitative descriptions of the rating
categories, although again there are exceptions such as Yahoo! Movies, which labels each of the
categories between F and A+ and BoardGameGeek, which provides explicit descriptions of each
category from 1 to 10. Often, only the top and bottom category is described, such as on IMDb's
online rating facility.
Validity
With each user rating a product only once, for example in a category from 1 to 10, there is no
means for evaluating internal reliability using an index such as Cronbach's alpha. It is therefore
impossible to evaluate the validity of the ratings as measures of viewer perceptions. Establishing
validity would require establishing both reliability and accuracy (i.e. that the ratings represent
what they are supposed to represent).The degree of validity of an instrument is determined
41
through the application of logic/or statistical procedures." A measurement procedure is valid to
the degree that if measures what it proposes to measure"
Another fundamental issue is that online ratings usually involve convenience sampling much like
television polls, i.e. they represent only the opinions of those inclined to submit ratings.
Validity is concerned with different aspects of the measurement process.Each of these types uses
logic, statistical verification or both to determine the degree of validity and has special value
under certain conditions. Types of validity include content validity, predictive validity, and
construct validity.
Sampling
Sampling errors can lead to results which have a specific bias, or are only relevant to a specific
subgroup. Consider this example: suppose that a film only appeals to a specialist audience—90%
of them are devotees of this genre, and only 10% are people with a general interest in movies.
Assume the film is very popular among the audience that views it, and that only those who feel
most strongly about the film are inclined to rate the film online; hence the raters are all drawn
from the devotees. This combination may lead to very high ratings of the film, which do not
generalize beyond the people who actually see the film (or possibly even beyond those who
actually rate it).
Qualitative description
Qualitative description of categories improve the usefulness of a rating scale. For example, if
only the points 1-10 are given without description, some people may select 10 rarely, whereas
others may select the category often. If, instead, "10" is described as "near flawless", the
category is more likely to mean the same thing to different people. This applies to all categories,
not just the extreme points.
The above issues are compounded, when aggregated statistics such as averages are used for lists
and rankings of products. User ratings are at best ordinal categorizations. While it is not
uncommon to calculate averages or means for such data, doing so cannot be justified because in
calculating averages, equal intervals are required to represent the same difference between levels
of perceived quality. The key issues with aggregate data based on the kinds of rating scales
commonly used online are as follow:
 Averages should not be calculated for data of the kind collected.
 It is usually impossible to evaluate the reliability or validity of user ratings.
 Products are not compared with respect to explicit, let alone common [clarification needed], criteria.
 Only users inclined to submit a rating for a product do so.
 Data are not usually published in a form that permits evaluation of the product ratings.
More developed methodologies include Choice Modelling or Maximum Difference methods, the
latter being related to the Rasch model due to the connection between Thurstone's law of
comparative judgement[clarification needed] and the Rasch model.
42

Scale: Comparative and Non Comparative Scaling Composite Measures

Uploaded by

Copyright:

Available Formats

You might also like

Scale: Comparative and Non Comparative Scaling Composite Measures

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scale: Comparative and Non Comparative Scaling Composite Measures

Uploaded by

Copyright:

Available Formats

Scale

where S2(n, k) is the Stirling number of the second kind.

Law of comparative judgment

Bogardus social distance scale

Sorting the statements in a Q-sort

Non-comparative scaling techniques

Mathematically derived scale

You might also like