Myers 2002

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

RESEARCH UPDATE REVIEW

This series of 10-year updates in child and adolescent psychiatry began in July 1996. Topics are selected in
consultation with the AACAP Committee on Recertification, both for the importance of new research and
its clinical or developmental significance. The authors have been asked to place an asterisk before the five
or six most seminal references.

Ten-Year Review of Rating Scales. I: Overview of Scale


Functioning, Psychometric Properties, and Selection
KATHLEEN MYERS, M.D., M.P.H., M.S., AND NANCY C. WINTERS, M.D.

ABSTRACT
Objective: As part of the Journal’s 10-year Research Update Reviews, a series of articles will be presented on the role
of rating scales in child and adolescent psychiatry. The first article in the series summarizes principles underlying scale
functioning. Method: Sources were reviewed regarding testing theory, scale development, variability in scale function-
ing, psychometric properties, and scale selection. The extracted information was adapted to issues in child and adoles-
cent psychiatry. Results: Rating scales can make major contributions to understanding youths’ needs. They provide easy
and efficient measurement of psychopathology and quantify underlying constructs for comparison across youths, time,
and applications. Although multiple factors may affect a scale’s functioning, these factors can be understood and man-
aged by considering the goals of measurement and basic psychometric principles. Conclusions: Potential users of rat-
ing scales should not blindly assume that a well-known scale will meet the measurement needs for a particular application.
Rather, they can relatively easily educate themselves regarding the appropriate use of rating scales. This article pro-
vides the background information needed to evaluate scales for intended applications. It will also assist in reviewing the
individual scales presented in subsequent articles in this series. J. Am. Acad. Child Adolesc. Psychiatry, 2002, 41(2):114–122.
Key Words: rating scales, psychometrics, measurement.

This article is the first in a series of 10-year updates on fessional fields, but with limited relevance for child and
the contributions of rating scales to child and adolescent adolescent psychiatry.
psychiatry. The topic is broad, and the literature address- The scales reviewed in this series of articles were cho-
ing it is rich. To adequately cover the wealth of informa- sen according to several guidelines. In keeping with the
tion while remaining within the scope of the Journal’s current practice of child and adolescent psychiatry, we
10-year updates, we have grouped the rating scales into chose scales predominantly relating to diagnosis, such as
several areas for sequential publication: overview, inter- major depressive disorder, obsessive-compulsive disorder
nalizing disorders, psychosocial scales, externalizing dis- (OCD), and attention-deficit/hyperactivity disorder
orders, and miscellaneous scales. Within the four groupings, (ADHD), or alternatively relating to specific constructs,
we have emphasized scales most relevant to practice in such as self-esteem and aggression. Older scales were
child and adolescent psychiatry. This means the exclu- included if they had long track records in research or
sion of some interesting scales widely used in related pro- practice, established psychometric properties, and con-
tinued citations in the literature. Newer scales cannot
Accepted September 4, 2001. meet these criteria. Rather, we selected newer scales that
Dr. Myers was Associate Professor and Director of Outpatient Child and
Adolescent Psychiatry and Dr. Winters is Assistant Professor and Director of
have promising initial psychometric properties and that
Training, both at Oregon Health Sciences University, Portland. also improve upon older scales or fill a special niche. To
Correspondence to Dr. Myers, Division of Child Psychiatry CH-13, Children’s determine which scales meet these criteria, we sampled
Hospital and Regional Medical Center, Box 5371, 4800 Sand Point Way, N.E.,
articles on various diagnostic categories published over
Seattle WA 98105.
0890-8567/02/4102–0114䉷2002 by the American Academy of Child and the past 25 years and selected scales with good represen-
Adolescent Psychiatry. tation or special applications. We reviewed the proper-

114 J . A M . A C A D . C H I L D A D O L E S C . P S YC H I AT RY, 4 1 : 2 , F E B RU A RY 2 0 0 2
RATING SCALES, I: OVERVIEW

ties of these scales. We then included those with adequate losing items. Scales may be unidimensional or multidi-
psychometric properties or wide use. Finally, to deter- mensional. The former may be more time-efficient, but
mine their current applications, we examined studies over the latter provide more information. Also, rating scales
the past decade that used these scales. Specifically not may be completed by various individuals, with each pro-
included in these reviews are broad-band scales and per- viding a unique perspective. Youths describe their own
sonality inventories. perceptions, which often surprise even those adults clos-
Subsequent articles will present specific scales along est to them. Parents provide the most comprehensive
with their relevant psychometric properties, strengths and knowledge as they observe variations in behavior across
weaknesses, and current use. This article reviews princi- multiple situations. Teachers best note deviations from
ples underlying the successful use of these rating scales, peers in the normalized setting of school. Some scales use
including variability in scales’ functioning, psychomet- peer ratings to gain a perspective of youths in their own
ric properties, and successful scale selection. world. Clinicians draw upon their experience in assess-
ing youths’ problems and functioning. The relevance of
DEFINITIONS AND CHARACTERISTICS
others’ reports may vary across youths’ development, e.g.,
Rating scales gained popularity in the second half of parents’ importance may wane while teachers’ impor-
the 20th century as a response to the declining interest tance may increase and then decline as youths mature.
in projective measures, along with an increasing focus on The best informants are those who initiate referral, pro-
scientific measurement, refinements in diagnostic nomen- vide feedback during treatment, and are most familiar
clature, development of new models of juvenile psycho- with the youth.
pathology, need for outcome measures in clinical trials, Finally, rating scales are standardized. They have uni-
and the recognition of internalizing disorders in youths. form items, scoring, and administration procedures that
The term rating scale is broad and encompasses multiple do not change over youths, users, applications, adminis-
types of measurement, including checklists, question- trations, or time. Their developers are obliged to provide
naires, inventories, self-reports, other-reports, indices, information that allows the user to assess the scale’s rele-
and other measures. In this series, rating scale refers to vance for an application. Unfortunately, optimal infor-
any type of measure that provides relatively rapid assess- mation is often not available.
ment of a specific construct with an easily derived numer-
ADVANTAGES AND DISADVANTAGES OF
ical score which is readily interpreted, whether completed
RATING SCALES
by the youth or someone else, regardless of the response
format and irrespective of application. Traditionally, self- The value of rating scales may be best indicated by
report methodology has been used to assess internalizing their inclusion in the evaluation and treatment of youths
disorders and emotional functioning, while other-report as recommended in the practice parameters published by
methodology from parents and teachers has been used to the American Academy of Child and Adolescent Psychiatry.
assess externalizing disorders and behavior. However, They have multiple applications (Corcoran and Fischer,
newer scales have incorporated multiple informants into 2000a; Piacentini, 1993). Rating scales are used to screen
rating scales across disorders and related constructs. groups in normative settings such as school or the com-
Rating scales are diverse (Corcoran and Fischer, 2000a; munity, to monitor the emergence of symptoms in high-
Piacentini, 1993). They may measure global constructs, risk youths, to ensure selection of homogeneous subjects
such as anxiety or hostility; or specific behavioral quan- for research, to evaluate intervention effects, and to deter-
tities, such as types of phobias or number of fights. They mine treatment outcomes. Rating scales can ensure sys-
may measure a construct broadly, such as thinking style; tematic coverage of behaviors, thereby reducing variability
or narrowly, such as negative cognitions. Some scales mea- in data collection. They provide quantifiable informa-
sure trait characteristics, such as temperament, while oth- tion regarding the presence, frequency, and severity of
ers measure state functioning, such as fearfulness. Regardless behavior and symptoms. They allow comparisons with
of their focus, all good scales include both generic and self across multiple administrations, with peers in simi-
specific aspects of a problem in order to represent the lar circumstances, and with the overall population of
range of symptoms. For example, ADHD scales cover other youths. Early in treatment, they may allow youths
both hyperactivity and fidgeting, or disorganization and to more easily endorse distressing symptoms that they are

J . A M . A C A D . C H I L D A D O L E S C . P S YC H I AT RY, 4 1 : 2 , F E B RU A RY 2 0 0 2 115
MYERS AND WINTERS

reluctant to discuss, such as hallucinations or deviant off scores for diagnostic determination. This is inappro-
behaviors. They may also reveal difficult-to-observe behav- priate. Rating scales are not diagnostic instruments and
iors, such as compulsions or stealing. Rating scales are should not substitute for diagnostic evaluation. When
easy to score and to interpret. They are efficient and eco- examiners do not clearly formulate their expectations,
nomical in time, cost, and personnel. Of great relevance they may obtain interesting information about their
for outcome-based treatments is the increased account- patients, but not information that helps to elucidate the
ability provided by these quantifiable indicators. Finally, youths’ problem or answer their questions. Therefore, the
a major advantage is that specialized training is not needed expectations for a scale and the information obtained
to use most rating scales. They do not require advanced from it should be clear prior to scale selection. For exam-
training in psychological testing, knowledge of testing ple, if the scale is hoped to easily render diagnostic clas-
theory, or even in-depth understanding of psychomet- sification in order to determine which youths will receive
rics. This is not to say that no training is needed. Some pharmacotherapy, irrelevant data will be collected. On
understanding of scale construction, developmental rel- the other hand, if the scale is intended to provide sever-
evance, informer variance, and limitations helps the clin- ity data during treatment, helpful monitoring may be
ician to better select and implement a scale. Also, the achieved. Rating scales should also be among the easiest
examiner must provide standardized instruction to youths ways of obtaining information. Thus, if the examiner is
completing the scales. For venues using clinician-rated interested in whether a youth hallucinates, it may be eas-
scales with multiple raters, these raters must be trained ier to simply ask the youth.
to high interrater agreement. Finally, examiners may use scales so frequently that
Rating scales are not without disadvantages (Corcoran salient trends that take time to develop cannot be detected.
and Fischer, 2000c). These relate predominantly to youths’ For example, weekly assessment of inattention during
self-reporting abilities, examiners’ goals for the scale, and stimulant treatment of a child with ADHD might yield
psychometric properties. During their earliest years of very helpful information. However, weekly assessment
use, many juvenile scales represented downward modifi- of dissociation will not accurately depict effectiveness of
cations of adult scales and thus were not developmen- treatment.
tally appropriate; this compromised their functioning. Disadvantages regarding psychometric properties are
Newer scales developed specifically for youths addressed complex, but critical to a scale’s functioning. A summary
this shortcoming. However, concerns then arose regard- of psychometric properties is presented later. Relevant
ing youths’ competence as reporters of their feelings and here is the recognition that many rating scales do not have
behaviors. Although adolescents have generally been con- sufficient psychometric information to allow optimal deci-
sidered competent self-reporters, factors such as reading sions about their use. This is probably most frequently
level, learning disabilities, psychological maturity, and noted in the lack of normative data that could be used to
experience may attenuate their competence. Children’s interpret scores. Many scales lack validity data, which
competence is even less clear. In addition to the factors makes it difficult to know whether the intended construct
affecting adolescents’ competence, children may also have has been measured. Compromises must then be made
limitations in their linguistic skills, self-reflection, emo- regarding which psychometric properties are most impor-
tional awareness, and the ability to monitor their behav- tant to an intended task. A widely used scale cannot sim-
iors, thoughts, and feelings. They may also tend to respond ply be assumed to possess the desired properties.
in a socially desirable manner. Despite these concerns, Perhaps the greatest empirical evidence of the disad-
research over two decades has shown that both adoles- vantages of rating scales is their poor functioning in recent
cents and children can be reliable and valid self-reporters. pharmacological trials (Ambrosini et al., 1999; Emslie
Nevertheless, caution is still warranted to ensure that et al., 1997). In these studies, diagnosis-specific rating
there is an appropriate match between a particular scale scales have not adequately detected treatment effects when
and the youths completing it. global ratings have. The reasons are not clear, but they
Examiners may have unrealistic goals or poorly for- may include suboptimal properties of the scales used,
mulated expectations for a scale and may not appreciate study design, youths’ self-reporting, too short a study
alternative methods of ascertaining the same informa- period, or true medication nonresponse. There are obvi-
tion. A typical unrealistic goal is examiners’ use of cut- ous implications for any treatment monitoring.

116 J . A M . A C A D . C H I L D A D O L E S C . P S YC H I AT RY, 4 1 : 2 , F E B RU A RY 2 0 0 2
RATING SCALES, I: OVERVIEW

The main implication regarding these advantages and greater agreement than adults in different settings. For
disadvantages is that rating scales can be very useful in example, a teacher will demonstrate greater agreement
the assessment and treatment of youths. However, exam- with another teacher than with the youth’s parents. Even
iners must carefully consider what they want to learn, mothers and fathers may differ. Generally, mothers tend
decide whether a rating scale is the best way to ascertain to rate their children’s symptoms higher than do fathers,
this information, and then familiarize themselves with a perhaps suggesting the different contexts youths experi-
scale’s functioning. Blind use of a scale is likely to bring ence with each parent.
unanticipated, misleading, or invalid results. Developmentally, the older the child, the better the
concordance between youths’ and adults’ reports. However,
VARIABILITY IN FUNCTIONING OF RATING SCALES
it is unclear whether the improving concordance reflects
Although most scales highly discriminate clinically social-cognitive development, greater verbal abilities, or
referred from community youths, misclassification rates other factors (Renouf and Kovacs, 1994). Furthermore,
up to 30% may occur. Misclassification may be an even concordance at all ages decreases when mothers are
greater problem when clinical groups are compared. For depressed, since they then overreport depressive symp-
example, most anxiety rating scales will incorrectly include toms (Angold et al., 1987; Fergusson et al., 1993; Moretti
depressed youths, and ADHD scales may pick up other et al., 1985; Weissman et al., 1987) and possibly behav-
disruptive youths. Multiple factors affect a scale’s perfor- ior problems (Fergusson et al., 1993) in their children.
mance (Piacentini, 1993; Reynolds, 1993). Other factors distressing the mother may also influence
her perception of her children.
Individual, Contextual, and Interpersonal Factors
The type of symptom is also relevant. Parents and chil-
The best-known individual factors include the associa- dren agree best on concrete and observable behaviors such
tion of younger age with less reliable and valid ratings, and as school suspension or fighting, but poorly on psycho-
female gender with higher internalizing scores (Angold logical symptoms such as sadness or suicidality. In gen-
et al., 1987). Also, youths who seek social acceptance may eral, parents are better reporters of externalizing behaviors
underreport symptoms, while those who feel overwhelmed and youths are better reporters of internalizing symp-
may overreport symptoms. toms (Welner et al., 1987; Yule, 1993).
Rating scales generally fail to consider contextual fac-
Factors Relating to Characteristics of the Scale
tors. It is widely recognized that both children’s and adoles-
cents’ self-reports are situationally influenced. The behavior Variability in rating scales partially reflects deficits in
and emotional functioning of young children are espe- scale construction. Rating scales are often not developed
cially reactive to environmental factors. Youths also func- with samples representative of the population, but with
tion differently across settings, such as between home and specific groups such as urban black schoolchildren or chil-
school, or between the classroom and playground. It may dren in a single midwestern state. Application to other sam-
be difficult to decide whether measured problems reflect ples, such as rural Hispanic youths or incarcerated white
underlying psychopathology or contextual issues. These teenagers, may then produce suboptimal results. This is
site factors are often addressed by using multiple infor- because the distribution of the measured variable varies
mants in different relevant sites. However, scales rarely according to development, culture, and situation (Riegelman
address environmental stressors. and Hirsch, 1989; Robson, 1988). Normative values, cut-
The poor concordance between different adult reporters offs, and other properties will then vary with samples that
and between juvenile and adult reporters has been well differ from the original sample. Thus it is important to
documented (Herjanic and Reich, 1982; Ines and Sacco, know the sample in which a scale was developed and to
1992; Welner et al., 1987). Four factors are particularly decide whether it is sufficiently similar to the test sample
salient regarding this poor concordance: contextual fac- to ensure minimal variability in its functioning.
tors, the youth’s development, parental psychopathology, A particular concern is the degree to which a scale mea-
and the type of symptom assessed. sures the range or complexities of a problem (Piacentini,
Contextual issues were discussed above in relation to 1993; Reynolds, 1993). The scope of a scale may be too
youths’ differential functioning across settings. It is not narrow to helpfully define clinical implications, or con-
surprising that adults in the same settings demonstrate versely, too broad to measure the construct of interest.

J . A M . A C A D . C H I L D A D O L E S C . P S YC H I AT RY, 4 1 : 2 , F E B RU A RY 2 0 0 2 117
MYERS AND WINTERS

While most scales focus on a particular problem, most factor structures may be reevaluated. However, these newly
are too general to tap subtleties of that problem. Thus it marketed versions may not be consistent with the origi-
may not be possible to find a scale that provides helpful nal scale’s structure and their relevance may not be clear.
information on specific aspects of a youth’s problem. As
Factors Relating to Psychometric Properties
an example, many depression-rating scales assess overall
depression severity, but few provide information on somatic Rating scales do not provide “the truth.” They repre-
versus cognitive aspects of depression. Furthermore, even sent measurement of a variable, e.g., of a construct such
when a scale is helpful in assessment, it may sensitize the as youths’ feelings or behavior. Measurement is the sys-
youth toward its content, i.e., the simple act of repeti- tematic process of assigning a number to this variable.
tively completing the scale may influence how youths However, such measurement is subject to error, and thus
endorse items. The scale would then not be invariant variability in functioning. Psychometric properties pro-
across administrations, violating the standardization rules. vide an estimate of this error, and thereby reveal how rel-
A scale may also not be appropriate to the type of symp- evant these scores might be for a selected application
tom assessed. Since many symptoms represent state con- (Sackett et al., 1991). Unfortunately, most scales do not
ditions, and thus wax and wane, changes measured over provide all of the psychometric data desired in selecting
time might not reflect the scale’s ability to detect treat- the best scale; and even when such date are available, they
ment effects, but may represent its variation with the nat- may not be optimal. The user must then decide which
ural course of the symptom. This may be more of an issue properties best meet the needs of a particular application.
for internalizing than for externalizing disorders. The A major psychometric issue affecting variability is the
scale must be matched to the symptoms. choice of a cutoff score. Cutoffs are useful for identify-
On a practical level, specifications for using a scale are ing individuals for further clinical evaluation, but they
often not well delineated, especially the time frame for always represent a trade-off between sensitivity and speci-
reporting (Corcoran and Fischer, 2000b; Piacentini, 1993). ficity. A few points in either direction can greatly alter
Most scales specify time frames between 1 week and 1 who will be considered clinically relevant and who will
month, but many do not indicate any time frame. Thus not receive further intervention. Conversion of raw scores
examiners must make their own decisions without sup- to T scores provides greater standardization as well as a
porting data, producing variable results. Also, a scale may useful comparison in relation to all other examinees. T
not provide an optimal number of response options for scores greater than 70 represent 2 SD above the mean (T
a stated purpose (Aman, 1993; Piacentini, 1993). For score = 50) and are considered statistically significant, but
example, dichotomous responses may be adequate for lower scores may be clinically significant. Strict adher-
detecting the presence of tics, but not for detecting their ence to either raw score cutoffs or T scores to define clin-
decrease during treatment. Also, the type of response ical relevance may miss youths in need of treatment.
options may be vague (e.g., never, sometimes, often) and Rather, it is important to examine the pattern of raw
thus confuse youths, leading to increased response vari- scores or T scores and to consider these scores in relation
ability. On the other hand, if the response options are to the examiner’s overall practice population.
too precise (e.g., never, weekly, monthly), reliability and Because psychometric properties are so important to
validity may be compromised. The length of the scale is understanding the variability of rating scales and their
also important. Generally, a longer scale will demonstrate appropriate use, and because many readers may be “rusty”
better psychometric properties, but may decrease youths’ in their recollection of psychometric principles, follow-
ability to maintain interest and respond accurately through- ing is a brief review of the most relevant psychometric
out the scale administration. Also, if properties of a scale properties. This information will complement future arti-
have not been examined in a long time, changes in under- cles in this series that provide reliability and validity data
standing a disorder, youths’ sophistication, goals of mea- for the scales reviewed.
surement, and other factors may alter applicability of the
REVIEW OF PSYCHOMETRIC PROPERTIES
scale. Over such time periods, different versions of a scale
may emerge without reexamination of the psychometric A good measure is both reliable and valid. Unfortunately,
properties. Similarly, when scales are purchased by pub- no scale available for clinical practice is totally reliable
lishing companies and copyrighted, their formats and and/or totally valid. Lack of reliability and lack of valid-

118 J . A M . A C A D . C H I L D A D O L E S C . P S YC H I AT RY, 4 1 : 2 , F E B RU A RY 2 0 0 2
RATING SCALES, I: OVERVIEW

ity are referred to as random and systematic error, respec- a scale, and split-half reliability effectively reduces the
tively. Thus, higher reliability and higher validity increase number of items in each correlational assessment. This
the user’s confidence that the scale is measuring what it underestimate may be corrected through a procedure
proposes to measure, with a minimum of error (Corcoran referred to as the Spearman-Brown formula. For either of
and Fischer, 2000b; Piacentini, 1993; Sackett et al., 1991). these types of internal reliability, coefficients exceeding
0.80 suggest that the scale is generally internally consis-
Reliability
tent. However, a coefficient of 0.80 also means that 20%
Reliability refers to the consistency with which all of of the scale’s score is due to random error. Thus, higher
a scale’s items measure the same construct, and the con- reliability coefficients are more desirable and are com-
sistency with which the total scale measures that con- monly reported for newer scales.
struct in the same way every time (Corcoran and Fischer, Test-retest reliability, or stability, assesses whether a scale
2000c; Piacentini, 1993; Sackett et al., 1991). Stated in is stable over time. If the variable measured has not changed,
another way, reliability reveals whether the scale performs then a scale’s scores should be similar over administrations,
the same way every time it is administered across persons, and stability should be high. Test-retest reliability is espe-
situations, and time. There are four approaches to relia- cially important when a scale is used to assess the progress
bility: consistency of the items comprising a scale (inter- of treatment. If a scale is not stable, then it is impossible
nal reliability or internal consistency), stability of the scale to determine whether measured change is real or repre-
over time and measurements (test-retest reliability), agree- sents random error in the scale. The construct measured
ment between different raters using the scale (interrater may affect the scale’s apparent stability. For example, a trait
reliability), and concordance between similar forms of a measure, such as self-concept or inattentiveness, should
scale (parallel reliability). demonstrate high stability over 1 to 2 months; while a state
Internal reliability, or internal consistency, measures measure, such as loneliness or truancy, will have lower sta-
the homogeneity of the scale. It represents the degree to bility due to the natural history of the construct. A corre-
which the individual items are consistent with each other, lation greater than 0.80 for two administrations of a scale
and thus are tapping the same construct. Items that are 1 to 2 weeks apart suggests adequate stability. For admin-
not internally consistent are likely measuring different istrations over a month, a correlation greater than 0.70 is
constructs and detract from the scale. Typically, an item considered reasonable stability. The lower stability with
analysis is conducted during scale construction and those longer intervals has been posited to represent practice effects,
items with poor internal consistency are dropped from reactivity, the natural history of juvenile disorders, and sta-
the final scale version. However, the author may elect to tistical regression.
retain some such items because of theoretical or clinical Interrater reliability represents the agreement, or con-
appeal. Ultimately, they decrease the scale’s internal reli- cordance, between different informants. Informants may
ability. Scales measuring a unitary construct are expected include lay informants, such as adults who are familiar
to have high internal reliability. Multifactorial scales, or with the youth, but interrater reliability is most relevant
those covering a wide variety of symptoms, have lower to clinician-rated scales requiring an interview format.
internal reliability. In general, longer rating scales tend Considerable training may be needed to ensure that mul-
to have higher internal consistency than shorter scales. tiple raters are scoring items similarly, i.e., are using the
Internal consistency is usually measured and reported scale in a consistent manner. To assess their concordance,
in two ways. The most common is the Cronbach coeffi- correlations may be made between raters’ scores for the
cient α, a measure of the average correlations among all total scale as well as for individual items. Again, correla-
items. Another estimate of internal consistency is split- tions greater than 0.80 are acceptable.
half reliability. This approach correlates half of the items Parallel-forms reliability also assesses agreement between
with the other half. These two halves can be chosen in different entities, but this time between two forms of a
various ways, e.g., the first half with the last half of items, scale. When such parallel forms of a scale exist, such as
even numbers with odd numbers, or random selection parent and child versions or long and short versions, they
of items in each group. It should be noted that split-half would tap the same construct and their scores should be
reliability underestimates internal consistency because highly correlated. Correlations greater than 0.80 suggest
reliability is influenced by the total number of items in adequate parallel reliability.

J . A M . A C A D . C H I L D A D O L E S C . P S YC H I AT RY, 4 1 : 2 , F E B RU A RY 2 0 0 2 119
MYERS AND WINTERS

Validity discriminant validity. Convergent validity is the extent to


Validity pertains to whether the scale accurately assesses which the scale correlates with some theoretically relevant
what it was designed to assess (Corcoran and Fischer, variable with which it should correlate. For example, a
2000c; Piacentini, 1993; Sackett et al., 1991). This is a depression scale should correlate with a decrease in a youth’s
serious concern for scales measuring juvenile psychopa- social activities. Discriminant validity, sometimes termed
thology due to the questionable validity of childhood diag- known-groups validity, compares a scale’s scores for a group
noses, changing diagnostic criteria, and the natural history that is known to have the problem with a group that is
of the course of juvenile disorders. Validity must be estab- known not to have the problem. If the scale is valid, then
lished against multiple criteria, and it generally requires these two groups should have different scores. For exam-
several years to assess accurately. Thus newer scales do not ple, scores on a drug abuse scale should differ for youths
have optimal validity data available for the potential user in drug rehabilitation and those who abstain from drugs.
to review. Even older scales with wide applicability may Construct validity examines whether the scale taps a
not have had their validity reevaluated since the initial particular theoretical construct. To consider a scale as
preliminary assessment. Caution is warranted in consid- having construct validity, it should be shown to have dis-
ering a scale’s validity. There are three major types of valid- criminant and convergent validity. Thus construct valid-
ity: content, criterion, and construct validity. ity shows that the scale converges with and diverges from
Content validity assesses whether the scale’s items rep- other appropriate variables.
resent the entity being measured. Adequate content valid- Factorial validity is another approach to construct valid-
ity is often ensured by deriving items from the diagnostic ity as it examines a scale’s convergent and discriminant valid-
criteria or clinical correlates of the disorder of interest or ity using a statistical procedure known as factor analysis.
by careful examination of youths with the disorder. There This derives groups of variables that measure separate aspects
are two basic approaches to content validity: face validity of the problem, termed factors. If variables are similar, they
and logical content validity. Face validity asks whether the correlate with the same factor, demonstrating convergent
items appear on the surface to tap the content. It is deter- validity. Variables not associated with a particular factor sug-
mined by simple examination of the items and subjec- gest discriminant validity. Alternatively, factorial validity is
tively judging whether they appear to be measuring the determined by assessing whether individual items correlate
content area. Logical content validity is more systematic. with the scale’s total score and do not correlate with unre-
It refers to the procedure the scale’s developer used to eval- lated variables. Factorial validity is complicated and often
uate the content of the items, whether they cover the entire not available. However, scales with reported subscales are
content domain, and whether the items are representative often described as multifactorial because these subscales
of all content areas that should be included. However, this have been determined through factor analysis to demon-
information is not always available. strate some degree of discrimination from one another,
Criterion validity offers greater depth than content valid- but good correlation with the overall score.
ity. It is empirically based, assessed in relation to other scales During scale construction reliability is determined first.
with established validity measuring the same construct. The scale’s ability to perform similarly each time must be
Correlations with these established scales provide greater ensured in order to assess its ability to measure a construct.
confidence that the scale is measuring what it is supposed In other words, random error must be minimized in order
to measure. There are two types of criterion validity: pre- to detect any systematic error in the scale. Thus, to be valid,
dictive validity and concurrent validity. Predictive validity a scale must be somewhat reliable. The converse does not
asks whether the scale is correlated with some event that apply. Thus, if only reliability is reported, caution is war-
will occur in the future, e.g., a scale measuring strong ranted regarding validity. In addition, because validity is
parental support may predict early hospital discharge. determined over years as the scale is applied to various
Concurrent validity refers to a scale’s correlation with an groups, newer scales generally do not have sufficient valid-
event that is assessed at the same time the scale is admin- ity data. Surprisingly, some scales with long track records
istered; for example, a scale of psychosis should correlate in child and adolescent psychiatry may not have estab-
with parents’ concerns about odd behaviors, or a conduct lished validity. Also, no scale is completely reliable and
disorder scale should correlate with incarceration. Two valid. Users will have to settle for some error in measure-
types of concurrent validity are convergent validity and ment and be judicious in interpreting scores. Finally, users

120 J . A M . A C A D . C H I L D A D O L E S C . P S YC H I AT RY, 4 1 : 2 , F E B RU A RY 2 0 0 2
RATING SCALES, I: OVERVIEW

should be aware that authors often use inconsistently the vention, not due to scale instability or error. By contrast,
various terms relating to reliability and validity. reactivity refers to how the act of measuring something
may change it. This may be desirable for some interven-
Normative Data tions in which measurement is intended to induce behav-
Normative data for a scale are important but often not ioral change, such as dieting or smoking cessation. It is
adequately addressed. Normative data provide informa- not desirable with rating scales intended simply to mon-
tion on the representativeness of a scale’s functioning. itor change from the treatment. However, with repeated
Normative data should be representative of the current administrations of the scale, youths can figure out what
population and should be stratified on relevant variables is expected and may respond with a bias toward pleasing
that show differences in scores, usually age and gender, the examiner. Directness refers to how the score reflects
but often also ethnicity and geography. Separate norms the youth’s actual behavior, thoughts, or feelings. Behavioral
for these variables should then be provided. Normative observations are relatively direct, whereas projective tests
values are affected by base rates of the construct mea- are indirect, or symbolic. Direct measures best ensure reli-
sured, as well as its distribution in the population. For ability and validity. Finally, appropriateness refers to how
many clinical disorders and symptoms, self-report and compatible a scale is with the desired evaluation. The
other-report measures are not normally distributed, but most appropriate scales are valid, stable, and sensitive;
are skewed. This can affect the determination of stan- measure the problem in a direct and nonreactive man-
dard score norms developed from raw scores. Although ner; have utility; and are suitable.
this problem is not generally addressed, it forms one basis When selecting a scale, first address whether the scale
for the aforementioned caution about cutoff scores. will be used for research or clinical purposes, as utility
will vary for investigators and clinicians who may have
SELECTING A RATING SCALE
different measurement goals. Define this problem in con-
In choosing the best scale, several factors warrant con- crete, observable, and measurable terms. For example, if
sideration (Corcoran and Fischer, 2000d; Riegelman and the scale is to be used to follow response to cognitive-
Hirsch, 1989). Stability refers to what extent a scale per- behavioral therapy for OCD, scales may be needed to
forms equally over repeated administrations. When con- measure a reduction in overall OCD severity, as well as
sidered from a psychometric perspective, this is called a change in time spent checking, possibly global func-
test-retest reliability. However, in relation to scale selec- tioning, or even days tardy to school. Each focus for mea-
tion, it can also refer to practical aspects, such as how well surement assesses a different domain and will require a
the youth tolerates the scale with repeated administra- different scale or subscale. Next, determine whether the
tions. Utility represents the practical advantages the scale youth will be a reliable and valid reporter, whether other
offers and is influenced by how helpful the information reporters will also be needed, and if so who would be the
will be and by ease of use. For example, scales routinely most accurate reporter. When possible, use more than
administered for entry to a clinic will likely demonstrate one informant, usually a parent and a teacher, to obtain
low utility, inasmuch as they are not providing helpful various perspectives of the youth’s problem. Next, deter-
information on a specific issue. Also, if there is an easier mine whether the youth’s behavior should be measured
way to obtain desired information, then a rating scale will in a specific setting or in multiple settings.
have low utility. Suitability is an estimate of the scale’s Only after deciding these aspects of the youth and the
appropriateness to the youth’s abilities. A scale developed problem should potentially relevant scales be reviewed.
for a teenager will not be suitable for a third grader with These scales should be assessed regarding their suitabil-
a learning disability. Sensitivity estimates a scale’s ability ity for the youth’s age, reading level, and ability to com-
to detect change due to an intervention, rather than due plete the scale. If the scale is suitable, review it for content,
to other factors such as variability in the scale itself. This determining whether each item is descriptive of the prob-
is similar to the sensitivity described in the psychomet- lem to be assessed. This will provide familiarity with the
rics section, since in both cases sensitivity refers to the scale, as well as estimate whether the scale might tap sub-
scale’s ability to detect something “real.” In the case of tleties of the problem. Next, address the purpose of mea-
psychometric properties, it is the percentage of “real” surement in relation to psychometric properties. For
cases the scale detects, whereas in the case of scale selec- example, screening, severity of a diagnostic construct, or
tion, it is the amount of “real” change due to the inter- treatment monitoring all require strong overall psycho-

J . A M . A C A D . C H I L D A D O L E S C . P S YC H I AT RY, 4 1 : 2 , F E B RU A RY 2 0 0 2 121
MYERS AND WINTERS

metric properties, but each also has its own special needs. selves regarding the major factors in successful measure-
A screening scale must have high sensitivity. Measuring ment of child and adolescent psychopathology.
the intensity of a construct requires high construct valid- In subsequent articles in this series, we will review indi-
ity and an appropriate response format. Monitoring treat- vidual scales relevant to specific diagnoses, such as OCD,
ment effects requires high test-retest reliability, sensitivity, ADHD, and major depression, or specific problems, such
nonreactivity, and appropriate response format. as suicidality, self-esteem, and aggression. The aforemen-
Even when a scale has excellent psychometric proper- tioned principles will be incorporated into the discussion
ties, it is best to use more than one scale whenever pos- of the utility of the individual scales to aid the reader in
sible. Different scales will tap somewhat different aspects selecting the most appropriate scale for an application.
of the problem and will provide a more robust overall
assessment, perhaps closer to “the truth.” These scales REFERENCES
should demonstrate at least fair concurrent validity (>0.6), Aman MG (1993), Monitoring and measuring drug effects, II: behavioral,
emotional, and cognitive effects. In: Practitioner’s Guide to Psychoactive
although not so high (>0.95) that they are measuring Drugs for Children and Adolescents, Werry JR, Aman MG, eds. New York:
exactly the same aspects of the construct. Finally, be sure Plenum, pp 99–159
Ambrosini PJ, Wagner KD, Biederman J et al. (1999), Multicenter open-label
that the response format is appropriate to the purpose of sertraline study in adolescent outpatients with major depression. J Am
the measurement. In general, a Likert-type format with Acad Child Adolesc Psychiatry 38:566–572
Angold A, Weisman MM, John K (1987), Parent and child reports of depres-
multiple responses is best because it will have greater sive symptoms in children at low and high risk of depression. J Child
power and reliability. However, children may not be able Psychol Psychiatry 28:901–915
*Corcoran K, Fischer J (2000a), Measures for Clinical Practice: A Sourcebook,
to understand expanded formats and may need only three 3rd ed, Vol I. New York: Free Press, pp 3–10
options or even a dichotomous yes/no format. Overall, Corcoran K, Fischer J (2000b), Measures for Clinical Practice: A Sourcebook,
matching the scale to the problem and youth is crucial 3rd ed, Vol I. New York: Free Press, pp 11–26
Corcoran K, Fischer J (2000c), Measures for Clinical Practice: A Sourcebook,
to obtaining helpful information. 3rd ed, Vol I. New York: Free Press, pp 43–48
Corcoran K, Fischer J (2000d), Measures for Clinical Practice: A Sourcebook,
CONCLUSIONS 3rd ed, Vol I. New York: Free Press, pp 49–62
Emslie GJ, Rush AJ, Weinberg WA et al. (1997), A double-blind, random-
Rating scales assist in our evaluation and treatment of ized, placebo-controlled trial of fluoxetine in children and adolescents with
depression. Arch Gen Psychiatry 54:1031–1037
children and adolescents. They enrich our understand- *Fergusson DM, Lynskey MT, Horwood LJ (1993), The effect of maternal
ing of youths’ psychopathology and quantify their prob- depression on maternal ratings of child behavior. J Abnorm Child Psychol
21:245–269
lems for comparison across individuals, situations, and Herjanic B, Reich W (1982), Development of a structured psychiatric inter-
time. Their roles are adjunctive to careful psychiatric diag- view for children: agreement between child and parent on individual symp-
toms. J Abnorm Child Psychol 10:307–324
nosis and include screening, monitoring, and outcome *Ines TM, Sacco WP (1992), Factors related to correspondence between teacher
assessment. Among their many advantages is increased ratings of elementary student depression and student self-ratings. J Consult
Clin Psychol 60:140–142
accountability in clinical practice. Moretti M, Fine S, Haley G, Marriage K (1985), Childhood and adolescent
To obtain useful information, the potential user must depression: child-report versus parent-report information. J Am Acad Child
Psychiatry 24:298–302
understand a scale’s functioning in relation to the pro- *Piacentini J (1993), Checklists and rating scales. In: Handbook of Child and
posed application. This includes setting realistic goals for Adolescent Assessment, Vol 167, Ollendick TH, Hersen M, eds. Boston:
Allyn & Bacon, pp 82–97
the scale, conceptualizing the goals in measurable terms, *Renouf AG, Kovacs M (1994), Concordance between mothers’ reports and
ensuring a developmental match between the youth and children’s self-reports of depressive symptoms: a longitudinal study. J Am
Acad Child Adolesc Psychiatry 33:208–216
a specific scale, and choosing a scale with the best psy- Reynolds WM (1993), Self-report methodology. In: Handbook of Child and
chometric properties, utility, and appropriateness. Using Adolescent Assessment, Vol 167, Ollendick TH, Hersen M, eds. Boston:
Allyn & Bacon, pp 98–123
multiple scales can circumvent the limitations of any sin- *Riegelman RK, Hirsch RP (1989), Studying a Study and Testing a Test: How
gle scale and can provide an estimation of the construct to Read the Medical Literature. Boston: Little, Brown, pp 127–174
Robson PJ (1988), Self-esteem: a psychiatric view. Br J Psychiatry 153:6–15
measured in its various aspects. Similarly, the use of mul- Sackett DL, Haynes RB, Guyatt GH, Tugwell P (1991), Clinical Epidemiology:
tiple informants can help to account for contextual issues. A Basic Science for Clinical Medicine, 2nd ed. Boston: Little, Brown
Weissman MM, Wickramaratne P, Warner V (1987), Assessing psychiatric dis-
The most important feature of the material presented orders in children: discrepancies between mothers’ and children’s reports.
in this first article in this series is that potential users of rat- Arch Gen Psychiatry 44:747–753
Welner Z, Reich W, Herjanic B, Jung KG, Amado H (1987), Reliability, valid-
ing scales should not simply assume that a particular scale, ity, and parent child agreement studies of the Diagnostic Interview for Children
even a widely used scale, will fulfill their measurement and Adolescents (DICA). J Am Acad Child Adolesc Psychiatry 26:649–653
*Yule W (1993), Developmental considerations in child assessment. In:
needs. Otherwise invalid, and potentially damaging, results Handbook of Child and Adolescent Assessment, Vol 167, Ollendick TH,
may ensue. Rather, users can relatively easily educate them- Hersen M, eds. Boston: Allyn & Bacon, pp 15–25

122 J . A M . A C A D . C H I L D A D O L E S C . P S YC H I AT RY, 4 1 : 2 , F E B RU A RY 2 0 0 2

You might also like