Escala Yates

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Pain 117 (2005) 314–325

www.elsevier.com/locate/pain

A scale for rating the quality of psychological trials for pain


Shona L. Yatesa, Stephen Morleya,*, Christopher Ecclestonb, Amanda C. de C. Williamsc
a
Academic Unit of Psychiatry and Behavioural Sciences, University of Leeds, 15 Hyde Terrace, Leeds LS2 9JT, UK
b
Pain Management Unit, University of Bath, Bath, UK
c
Department of Psychology, University College London, London, UK
Received 1 February 2005; received in revised form 19 April 2005; accepted 20 June 2005

Abstract
This paper reports the development of a scale for assessing the quality of reports of randomised controlled trials for psychological
treatments. The Delphi method was used in which a panel of 15–12 experts generated statements relating to treatment and design components
of trials. After three rounds, statements with high consensus agreement were reviewed by a second expert panel and rewritten as a scale.
Evidence to support the reliability and validity of the scale is reported. Three expert and five novice raters assessed sets of 31 and 25
published trials to establish scale reliability (ICC ranges from 0.91 to 0.41 for experts and novices, respectively) and item reliability (Kappa
and inter-rater agreement). The total scale score discriminated between trials globally judged as good and poor by experts, and trial quality
was shown to be a function of year of publication. Uses for the scale are suggested.
q 2005 Published by Elsevier B.V. on behalf of International Association for the Study of Pain.

1. Introduction around features of the internal validity of trials to identify


potential sources of bias. To this end most scales assessing
It is widely agreed that interpretation of the results of a quality have focused on the design features of trials. Table 1
randomised controlled trials (RCT) should be informed by summarises published scales identified in a literature
the quality of the trial: the better the quality, the greater the search.1 These scales were mostly designed for medical
confidence one may have in the validity and utility of the trials in which pharmacological treatments can be delivered
results. There are several guidelines to aid the critical in a double blind manner. Furthermore, the major aspect of
appraisal of reports of RCTs, e.g. (Davidson et al., 2003) the quality of treatment delivered in medical trials (the drug)
and other authors have developed scales by which the is controlled via manufacturing quality control processes,
quality of a study may be quantified (Chalmers et al., 1981; although the context in which therapy may be delivered
Downs and Black, 1998; Harbour and Miller, 2001; Jadad within trials may vary considerably. In contrast, delivering
et al., 1996; Sindhu et al., 1997). Quantification can be used psychological treatments in controlled trials poses a number
to inform the conduct and analysis of systematic reviews of other problems (Schwartz et al., 1997; Waltz et al., 1993).
and meta-analysis either by setting a cut-off score to For example, it is improbable that delivery can be double
determine the exclusion of trials that do not meet a pre- blind as skilled therapists will know what they are
defined criterion or by examining the influence of quality delivering and participants will also be aware of treatment
parameters on standardised trial outcomes. content. Other steps must, therefore, be taken to ensure
For many purposes judgment of quality is indexed by equivalence between the treatment arms on potentially
methods used to control bias. Quality is a multi-dimensional confounding variables, e.g. expectation of improvement.
construct and most current scales have been constructed Furthermore, treatment integrity needs to be maintained
throughout the study as the treatment is essentially
* Corresponding author. Tel.: C44 113 343 2733; fax: C44 113 243
1
3719. The scales in Table 1 were retrieved by a systematic search of the
E-mail address: s.j.morley@leeds.ac.uk (S. Morley). literature in January 2003.

0304-3959/$20.00 q 2005 Published by Elsevier B.V. on behalf of International Association for the Study of Pain.
doi:10.1016/j.pain.2005.06.018
Table 1
Content analysis of published quality scales

General items Methodological quality Treatment quality


Author Date RCTs No. items Response Scale Reliability Random Blind- Out- Stats. Scoring Treatment Protocol Treatment Treatment
optionsa develop ing comes Anal guide description adherence delivery dosage
1. Chalmers 1981 General 27 Y, N, P, Arbitrary B C C B C C C B B B
variable
2. Cho 1994 General 31 Y, N, P Revised C C C C C C B B B B
measure
3. de Vet 1997 General 15 Y, N, P, Arbitrary B C C C C B C B B C
weighting
4. Detsky 1992 General 5 Y, N, P Revised B C C C C B C B B B
measure
5. Downs 1998 General 27 Y, N Revised C C C C C C C B B B
pilot
6. Evans 1985 General/ 33 Y, N Arbitrary B C C C C C C B B B
surgery

S.L. Yates et al. / Pain 117 (2005) 314–325


7. Goodmanb 1994 General 34 5 point Arbitrary C B C C C C B B B B
Likert
8. Gotzsche 1989 Drug 16 Not stated Arbitrary B C C C C B B B B C
trials
9. Harbour 2001 General 23 Written, Revised B C C C C C B B B B
variable measure
10. Huwiler- 2002 General 3 Y, N, P Arbitrary C B C B C C B B B B
Muntenerc
11. Jadad 1996 Pain 3 Y, N Nominal C C C B B C B B B B
group
12. Kleijnen 1991 General 7 Y, N, P Arbitrary B C C B C C C B B C
13. Liberati 1986 Cancer 24 Y, N, P, Revised B C C B C C C C B B
variable measure
14. Reisch 1989 General 13 Y, N, U Arbitrary B C C C C C C B B C
15. Sindhu 1997 General/ 15 Weighted Delphi C C C C C C B C B B
Pain points
16. Turlik 2000 General 14 5 point Arbitrary B C C B C B B B B B
Likert
17. Van der 1996 Drug 15 Weighted Arbitrary B C C C C C C B B B
Heijden trials points
18. van Tulder 1997 Pain 17 Weighted Arbitrary B C C C C B C C B B
points
Total number of scales with items for each aspect of quality 6 16 18 12 17 13 10 3 0 4

C, included in scale; B, not included in scale.


a
Y, yes; N, no; P, partial definition; U, unknown.
b
Scale measured quality of report NOT methodological quality.
c
Also had scale (25 items) assessing quality of report.

315
316 S.L. Yates et al. / Pain 117 (2005) 314–325

‘manufactured’ by the therapist at each session. Trialists pain published subsequent to the trials included in the Morley et al.
must, therefore, ensure that therapists deliver only the (1999) article.
prescribed treatment components at an acceptable level of Electronic mail addresses for 44 of those eligible were obtained
competence. A content analysis of current quality scales from published articles, the World Wide Web and personal
knowledge: 21 of the experts were located within Europe; 18 in
(Table 1) clearly identifies a lacuna in the scales around the
North America and 5 in Australia and New Zealand. The experts
issue of treatment implementation. Psychosocial trials may
were individually contacted via e-mail and invited to take part in
be unduly penalised because of the problem of double the study. The anonymity between participants was maintained
blinding (Guzman et al., 2001) whereas other potentially throughout the study.
compensatory methodological refinements in the studies
may be overlooked.
2.1.2. Development of consensus agreed statement
The purpose of the current study was to develop a
The Delphi survey was conducted over three rounds. In the first
scale to assess the quality of trials of psychological round, participants were invited to contribute as many ideas as they
treatments that could be used both to assess individual wished in response to two open ended questions regarding quality
trials (a critical appraisal tool) and to provide in research trials: “What factors do you consider are important for
quantification of quality for use in meta-analytic studies. assessing the quality of treatments used in randomised controlled
The present study reports the development of a scale trials of cognitive behaviour therapy and behaviour therapy for
using Delphi methodology (Jones and Hunter, 1999; chronic pain?” and, “What factors do you consider are important
Linstone and Turoff, 2002) to develop a consensus from for assessing the methodological quality of randomised controlled
a panel of experts to ensure that the items generated trials of cognitive behaviour therapy and behaviour therapy for
were not merely a function of the small team of chronic pain?”
individuals represented by the authors. In the second round, the responses obtained in round one
were collated and grouped together under a number of
semantically related headings by the first author in discussion
with the second author. The statements were organised into two
categories: those relating to treatment quality and those relating
2. Methods
to the quality of the design and methods of a trial. Participants
were asked to consider which of the statements would be
The study comprised several phases: (1) generation of a pool essential to include in a rating scale designed to measure the
of statements by a Delphi panel; (2) a panel of experts to write
quality of randomised controlled trials of cognitive behaviour
the scale items; (3) use of the scale by expert and novice raters
therapy for chronic pain. For each statement, the degree of
to establish reliability; (4) assessment of the scale’s validity; necessity for its inclusion in a final quality rating scale was
(5) a preliminary analysis of the influence of trial quality on the
rated on a seven point Likert scale (one pointZcompletely
magnitude of effects size. Two sets of published RCTs were
unnecessary, to seven pointsZcompletely necessary).
used in the reliability and validity phases, one from which data
The median score (representing group level of agreement) and
(effect sizes) was already available, and an additional sample of
the inter-quartile range (indicating the degree of consensus) for
six trials established through a newly written search strategy.
each statement were computed. This information was then
To aid the reader an overview of the phases is represented
incorporated into the round three questionnaire with the addition
diagrammatically Fig. 1.
of the participant’s own ratings for each statement as a reminder.
Thus, separate round three questionnaires were developed for each
2.1. Delphi panel participant. The participants reviewed and re-rated the statements
in the light of the new information about the opinion of the group
as a whole. A list of statements, which achieved consensus
2.1.1. Recruitment of panel agreement, was prepared by the first author. Consensus for
We identified possible participants for the Delphi panel if they inclusion was pre-defined as a median rating of six or above and
met two of the following criteria: (1) previous involvement in a
an inter-quartile range (IQR) of 1.5 or less.
published randomised controlled trial of a psychological treatment
for chronic pain; (2) two or more published articles on
psychological treatment for chronic pain; (3) two or more 2.2. Expert panel
conference presentations on the same subject; or (4) possession
of a professional qualification in a relevant discipline, e.g. clinical The expert panel comprised three of the authors (SM, CE and
psychology, statistics. AW). Their credentials as experts were that they had previously
A list of 62 eligible candidates from Australia, Europe and conducted systematic reviews and meta-analyses of psychological
North America was compiled from several sources. The second treatments for chronic pain for both adult (Morley et al., 1999) and
author identified approximately 25 eligible participants through his child and adolescent (Eccleston et al., 2002) populations and for
own knowledge of the field. Further candidates for the survey panel irritable bowel symptoms (Lackner et al., 2004). One (AW) had
were identified as the authors of the randomised controlled trials also conducted an RCT (Williams et al., 1996) and all were
included in meta-analysis by Morley et al. (1999) and from a thoroughly familiar with the field. The panel was presented with
search of the following electronic databases; Medline, Embase, the output of the third round from the Delphi panel to consider
PsycINFO using a search strategy developed to identify prior to meeting face-to-face for one day. The panel meeting was
randomised controlled trials of psychological therapy for chronic chaired by the first author (SY). The main task of the panel was to
S.L. Yates et al. / Pain 117 (2005) 314–325 317

Fig. 1. Diagrammatic representation of the sequence of tasks in the study.

draft a quality scale from the output generated by the Delphi panel. 2.3. Reliability
The expert panel aims to produce a new scale of a reasonable
length that would be a practical tool.2 Definitions for each item Three experienced raters (authors: SY, SM and CE) rated the 25
were drafted and a response scale for each item prepared. The draft trials included in the metaKanalysis reported by Morley et al.
scale was then circulated between members of the expert panel for (1999, Table 3) and an additional six trials randomly selected from
further comment and editing. those published subsequent to the 1999 study (Basler et al., 1997;
Ersek et al., 2003; Johansson et al., 1998; Marhold et al., 2001;
Sharpe et al., 2001; Thieme et al., 2003). Two of the raters (SM and
CE) were familiar with the first 25 trials, whereas one rater (SY)
was not familiar with any of the trials. Inter-rater reliability for the
2
The expert panel aimed to produce a new scale of a reasonable length total scale score and the two sub-scale scores (intra class
that would be a practical tool. The coding sheets developed in a previous correlation, ICC), and by item (Kappa and agreement ratio) was
study Morley et al. (1999) were found to be too exhaustive for practical use. computed. A further test of reliability for the total scale and
318 S.L. Yates et al. / Pain 117 (2005) 314–325

subscale scores was obtained by recruiting five novice raters who 3. Results
were unfamiliar with the scale and did not have detailed and
extensive knowledge of the trials. These five raters were all 3.1. Delphi panel
psychologists; four of these with some experience in providing
cognitive behaviour therapy for chronic pain. They were given one Emails were sent to 44 of the 62 individuals eligible for
brief training session (approximately 1.5 h) by the first author.
inclusion in the Delphi survey. Fifteen experts responded
Each rater rated 10 of the 25 trials from the Morley et al. (1999)
the invitation and completed round 1, and 12 also completed
article in a balanced order so that each trial was rated by two raters.
rounds 2 and 3. Reasons for the attrition of experts are given
A set of intra-class correlations was computed for each of five pairs
of raters.
in Fig. 1. No further participants were sought as it has been
suggested that between 12 and 20 participants is an optimal
size for a Delphi study (Henry et al., 1987).
2.4. Validity Table 2 shows a summary of the statements generated
retained over the three rounds. For ease of summary, the
We sought to establish validity using two methods. First, we statements have been aggregated into recurrent themes. (A
followed the method reported by Jadad et al. (1996), where full list of all items at each stage of development is available
published articles were initially allocated to three grades by raters from the authors on request.) In round 1, the panel generated
with knowledge of the field. In this study, the second and fourth a total of 234 statements, each person generating on average
authors were presented with the abstracts of the 25 trials analysed
15.13 statements (SDZ3.33). Removal of duplicate
by Morley et al. (1999). They categorised each study as high,
statements resulted in 150 statements that were equally
medium or low quality. The two judges were familiar with the
contents and methodology of the trials. (These broad category
distributed between the two main categories: treatment, and
judgments were made before the second author reread the articles design and methods. Consensus was defined as a median
in depth to code them for the reliability study.) The validity of the Table 2
quality scale was assessed by testing whether the quality scale Summary of statements generated for each theme at each round
discriminated between these categories. The first author’s ratings
were used for this test and these had been completed prior to and Round 1 Round 2 Round 3
number of consensus consensus
independently from the expert rankings of the studies produced by
items
the two judges. As a second test of validity we assumed that the
quality of the trial might be expected to improve over time. We, Treatment quality
Manualisation 6 2 (33) 3 (50)
therefore, regressed the total quality score onto the year of
Client characteristics 7 3 (43) 4 (57)
publication. Client adherence 3 1 (33) 2 (66)
Client perception 5 2 (40) 2 (40)
Therapist training 6 2 (33) 4 (66)
2.5. Quality and outcome
Therapist characteristics 7 – 1 (14)
Therapist competence 3 1 (33) 1 (33)
In the final set of analyses, we conducted exploratory analyses Treatment adherence 7 2 (28) 2 (28)
of the influence of trial quality on outcome. We examined the Treatment duration 4 1 (25) 1 (25)
relationship of the total score and the two sub-scale scores Therapy outcome 7 4 (57) 6 (86)
(treatment and design) by regressing the scores onto a measure of Therapy content 20 2 (10) 3 (15)
outcome for the 31 trials in the data set. The major issue to consider Therapy setting 3 2 (66) 2 (66)
Relevance of methods 1 – 1 (100)
here was the selection of the outcome measure because the trials
Subtotals 75 22 32
have multiple treatments (trial arms) and multiple outcome Design and Method quality
measures and there was no single measure that could be regarded Sample size 4 3 (75) 3 (75)
as the ‘primary endpoint’ that was also common across trials. It is Blinding 4 – 1 (25)
probable that most trialists with a cognitive-behavioural allegiance Treatment adherence 5 1 (20) 2 (40)
expect that treatment should have a broad impact across a range of Sample characteristics 9 5 (55) 8 (88)
measures. We, therefore, aggregated the effect sizes (ES) across Manual 2 2 (100) 2 (100)
outcomes within trials. To take into account the fact that the Therapist training 2 – 1 (50)
Attrition 5 1 (20) 5 (100)
outcomes are not independent we used the algorithm devised by
Follow-up assessment 3 2 (66) 3 (100)
Wampold et al. (1997) to compute the mean ES across measures Outcomes 12 5 (42) 8 (66)
within trial arms, assuming that the average inter-correlation Randomisation 4 2 (50) 4 (100)
between measures was 0.5. The selection of 0.5 as the average Therapist interests 1 – 1 (100)
intercorrelations was a ‘guesstimate’ and followed the precedent Statistical analysis 7 4 (57) 4 (57)
set by Wampold et al. (1997). The aggregated ES using this method Hypotheses 3 3 (100) 3 (100)
is monotonically related to the estimated intercorrelations thus Control groups 13 4 (31) 7 (54)
preserving the order of trials across the range of possible Pilot study 1 – –
Subtotals 75 32 52
correlations. As the focus of the analysis was the relationship
between quality and the relative differences of ESs across trials the The numbers in parentheses for rounds 2 and 3 is the percentage of items
exact estimate of the aggregated ES is of secondary interest. retained from the first round.
S.L. Yates et al. / Pain 117 (2005) 314–325 319

rating of six or more (indicating necessity of inclusion), and which 14 had absolute agreement across rounds 2 and 3.
an inter-quartile range (IQR) of 1.5 or less (indicating After round three 45 of the 75 statements relating to
agreement amongst the Delphi panel). Treatment Quality were judged as necessary (median rating
After round two 22 out of 75 of the statements in the of O6): 32 statements achieved consensus and were
Treatment Quality section (in supplementary Appendix 1 – included in the pool for the quality scale. Design and
online only) of the questionnaire achieved consensus; 30 Method Quality: 52 of these statements achieved consensus
statements had median ratings in the middle of the range for inclusion in the statement pool for the quality scale.
(medianZ3–5) of which only one also obtained consensus.
Only one statement was judged by the panel to be 3.2. Expert panel
unnecessary; ‘duration of therapy at least three months’.
In the Design and method section, 32 statements achieved The expert panel considered the 84 statements generated
consensus; 11 statements had median ratings between 3 and by the Delphi panel and distilled the statements into 13 main
5 and there was a consensus level of agreement for only one topics; each referring to a major theme identified by the
of these items. None of the statements in this section of the Delphi panel. A number of topics contained two or more
questionnaire were considered unnecessary by the Delphi parts resulting in a total of 26 items for the quality scale. For
panel (i.e. median%2). example, the topic of treatment manuals contained items
Table 2 shows the number of statements in each category referring both to the presence of a treatment manual and
gaining consensus for rounds 2 and 3. Eight treatment whether there was evidence that therapists had adhered to
quality statements obtained absolute agreement (medianZ the manual. The final scale comprised two sections with six
7, IQRZ0) and three had stable absolute agreement across items in the section on treatment quality (supplementary
rounds 2 and 3. Twenty-six statements relating to Appendix 1) and 20 items in the section on design and
methodological quality obtained absolute agreement of method quality (supplementary Appendix 1). A brief coding

Table 3
The quality rating scale
320 S.L. Yates et al. / Pain 117 (2005) 314–325

Table 3 (continued)

guide (manual) detailing the criteria for each item and the 3.3. Reliability
associated scale points was also produced as a result of the
panel’s deliberations. The final version of the scale is shown The intra-class correlation (two-way mixed effects
in Table 3 and the coding guide is reproduced in absolute agreement model for average measures, McGraw
supplementary Appendix 1. and Wong (1996)) for three raters was 0.91 (95%CIZ
S.L. Yates et al. / Pain 117 (2005) 314–325 321

Table 4
Item analysis

Scale item and range of response Kappa Agreement coefficient Frequency criteria met in trials
scale
(%) Strict criterion (%) Relaxed criterion (%) Strict criterion (%) Relaxed criterion
Treatment quality
Treatment content 0,2 0.41 70 80 60 98
Treatment duration 0,1 0.70 87 100 73 84
Manualisation 0,2 0.43 53 84 13 50
Manual adherence 0,1 0.69 80 100 27 64
Therapist training 0,2 0.60 63 96 13 66
Patient engagement 0,1 0.13 38 83 3 46
Quality of design and methods
Sample criteria 0,1 0.49 80 92 73 80
Evidence criteria met 0,1 0.22 47 84 7 52
Attrition 0,2 0.28 47 80 0 60
Rates of attrition 0,1 0.10 37 60 30 60
Sample characteristics 0,1 K0.01 73 84 70 92
Group equivalence 0,1 0.07 50 88 47 90
Randomisation 0,2 0.44 77 92 3 8
Allocation bias 0,1 0.74 97 100 3 0
Measurement bias 0,1 0.73 83 100 13 20
Treatment expectations 0,1 0.68 77 96 30 54
Justification of outcomes 0,2 0.21 47 64 40 98
Validity of outcomes 0,2 0.36 47 88 37 90
Reliability 0,2 0.40 50 76 17 90
Follow-up 0,1 0.30 50 80 37 46
Power calculation 0,1 0.69 90 100 7 8
Sample size 0,1 0.28 57 92 10 12
Data analysis 0,1 K0.07 80 80 80 90
Statistics reporting 0,1 0.48 93 96 93 90
Intention to treat analysis 0,1 K0.01 97 100 0 0
Control group 0,2 0.66 78 96 10 42

0.76–0.96) for the full scale, 0.91 (95%CIZ0.76–0.96) for medianZ0.76 (ICC coefficients for pairs 1–5 were 0.72,
the treatment subscale, and 0.85 (95%CIZ0.70–0.93) for the 0.76, 0.50, 0.42 0.91).
design and methods subscale. The multiple rater Kappa The final two columns of Table 4 also display the
coefficients for each item are shown in Table 4; they ranged percentage of the trials entered into the analysis which met
from 0.74 to K0.07. The median Kappa value for all items the quality criterion represented in each of the items as
was 0.405 (IQRZ0.21–0.66). As Kappa is sensitive to the given by both the strict and relaxed agreement criteria.
marginal total, it is possible to have very low values of Kappa There is marked variation between items which suggests
when raters agree at a high level and where most of their that for this sample of trials there is significant variation in
agreement lies within one cell. We, therefore, computed the degree to which the reports of the trials meet the quality
agreement coefficients for each item—shown in Table 4. We criteria. The trials are, generally, strong in reporting
used two criteria of agreement: the strict criterion was treatment content, sample criteria (inclusion/exclusion)
defined as complete agreement between all three raters and characteristics and equivalence between groups, details of
the relaxed criterion was the highest agreement ratio between outcomes and analysis reporting. In contrast, items relating
any rater pair. The median value for the strict agreement to the controlled delivery of treatments (e.g. manuals) show
criterion was 72% (IQRZ50–80%), indicating good level of only a modest attainment of the criteria and there are clear
agreement across most items. As expected, the relaxed limitations with respect the reporting of aspects of
criterion gave higher values of agreement; 90% (IQRZ80– experimental design (power calculations, intention-to-treat
96%). analyses, sample sizes, and randomisation procedures).
Intra-class correlations (model as above) were computed
for all pairings of ‘novice’ raters (five pairs) for the total 3.4. Validity
score and the two subscale scores. The median of the
average rater ICCs for the total scale score was 0.81 (ICC The two raters achieved consensus agreement that five of
coefficients for pairs 1–5 were 0.89, 0 .69, 0.91, 0.47, 0.81). the 25 sample trials were ‘excellent’, seven were ‘average’,
The corresponding values for the treatment subscale were; and five were ‘poor’ quality. The remaining eight trials,
medianZ0.57 (ICC coefficients for pairs 1–5 were 0.94, where there was no consensus agreement, were removed
0.57, 0.99, 0.50, 0.53); and for the methods subscale; from the analysis to ensure a clear unambiguous criterion.
322 S.L. Yates et al. / Pain 117 (2005) 314–325

Fig. 2. Scatter plot of total quality score again year of publication. The Fig. 3. Scatter plot of averaged effect size within trial against total quality
dotted line represents the fitted regression line. score. The dotted line represents the fitted regression line. The circle data
point is the identified outlier.
The mean overall score for the 17 trials with expert
consensus judgements of quality was 17.94 with a range of shown in Fig. 3. The resulting regression did not meet
8.5–25. The mean quality scores (SD in parentheses) for the conventional significance criteria (F1,28Z3.93, P!0.057,
excellent, average and poor trials were 22.7 (1.95), 18.71 AdjR2Z0.092, bZK0.351). When the separate com-
(2.25) and 12.10 (3.17), respectively. Comparisons between ponents of the quality scale were considered there was a
all pairs of means were made with one-sided t-tests with significant impact of the quality of design on the magnitude
predicted directional differences and alpha set at 0.01. Using of the effect size (F1,28Z5.39, P!0.05, AdjR2Z0.131,
this criterion all means were significantly different from bZK0.402) but no effect of the quality of treatment
each other. Despite the small number of trials, the post hoc implementation (bZK0.147, t!1.0, PZns).
power for the comparisons always exceeded 83%. The preceding analyses depend on averaging the effect
A regression analysis (quality score against year of sizes for treatment arms within each trial and the weights of
publication) included the 25 trials from (Morley et al., 1999) the resultant average were regarded as equivalent. This may
plus the sample of six additional trials published since 1996. introduce bias. We, therefore, repeated the analyses using
The mean overall quality score for the trials published prior to grouped regression in which effect sizes within each trial are
1996 was 18.24 (SD 4.88) and for those published after 1996 regarded as replicates (Buchan, 2000). This method
was 21.33 (SD 4.00). There was a significant regression effect incorporates all the effect sizes and weights the trial by
(F1,29Z9.52, P!0.01, AdjR2Z0.221, bZ0.497) such that a the number of ‘replicates’, but makes the assumption that
higher year of publication predicted a higher quality score as the replicates are independent. In these analyses, the same
shown in Fig. 2. The AdjR2 indicated that the year of trial was identified as an outlier and excluded from the
publication accounted for just over 20% of the total quality analysis. There was a significant regression of effect size on
score suggesting that the quality of trials (or their reporting) the total quality score (F1,40Z8.19, P!0.01), a marginally
has increased over time. The analysis was repeated with just significant effect for the quality of design and method
the 25 trials included in the Morley et al. (1999) meta-analysis (F1,40Z4.06, P!0.057) and no significant relationship
resulting in very similar regression statistics (F1,29Z7.41, between treatment quality and effect size (F1,40Z1.02, ns).
P!0.05, AdjR2Z0.211, bZ0.494).

3.5. Quality and outcome 4. Discussion

We regressed the quality score onto the averaged effect The purpose of this study was to develop a scale to
size of each trial, i.e. averaged across trial arms. Fig. 3 measure the quality of randomised controlled trials for
shows the scatter plot of the averaged effect sizes and the psychological interventions. A Delphi panel generated
total quality scores. Inspection of the plot suggested a trend statements with consensus validity that was then used to
of a negative correlation and also indicated the presence of construct a scale. In two reliability studies, with experts and
an outlier. The presence of the outlier was confirmed by nonKexperts, the total scale score and the subscale scores
diagnostic statistics in a regression analysis. We, therefore, achieved good levels of reliability. There was variation in
excluded the outlier and regressed the quality score onto the the inter-rater reliability across items. This may be
averaged effect size of each trial. The regression line is attributed to the degree of inference required of the raters
S.L. Yates et al. / Pain 117 (2005) 314–325 323

when making judgments. For example, the presence of a team. Nevertheless, we suggest that attempt should be made
power calculation is either clearly stated or not whereas to estimate the influence of therapist allegiance on response
evidence that patients have actively engaged in the to treatment and its importance as a vehicle for therapeutic
treatment is more a question of interpretation on the part change recognised and given due weight. The omission of
of the rater. In the absence of a gold standard, we tested an item assessing equivalence of treatment credibility across
the scale against two criteria: its ability to discriminate arms of a trial might also be rectified. Non-equivalence of
between expert nominated trials of different quality and the treatment credibility has long been recognised as a potential
assumption that trial quality has improved in time (the two source of generating differential expectations of treatment
decades between 1982 and 2003). Both of these tests gain (Kazdin and Wilcoxon, 1976)—a potential placebo
indicated preliminary support for the validity of the scale. mechanism (Kirsch, 1985; Price et al., 1999). There is some
There are potential biases in the methodology used in the evidence that initial expectations of treatment gain influence
study. First, the statements may not be an exhaustive outcomes in treatments for chronic pain (Goossens et al.,
inventory of every aspect of methodology that could impact 2005). In mitigation, the scale does include an item to assess
on trial quality. We obtained consensus opinion from those treatment expectations and a credibility assessment might,
with direct experience of conducting RCTs of psychological therefore, be redundant.
treatments for chronic pain as knowledge of the subject Assessment of trial quality is necessarily intertwined
matter is considered the most significant assurance of a valid with the quality of the trial report (Juni et al., 2001). This
outcome using the Delphi method (Stone Fish and Busby, can potentially lead to the situation where a well reported
1996). The experts in the Delphi panel and the reliability but biased trial could be judged to be of high quality while a
study were predominantly behavioural scientists and trial that is well designed but poorly reported is judged to be
research clinicians, in contrast to the statisticians and of low quality (Jadad, 1998). It has been argued that poor
medical clinical trialists involved in other psychosocial reporting is indeed reflective of poor methods generally
and community-based trials. Differences between these two (Schulz et al., 1995). The CONSORT statement (Moher
groups might be reflected in the content of the various et al., 2001) was developed with the aim of improving the
scales. Nevertheless, the validity of the current scale items is standard of reporting of randomised controlled trials for
supported because of the considerable overlap between the medical interventions. More recently, additions have been
items in the design and methods section (supplementary made to the statement that reflect more accurately the design
Appendix 1) and similar items reported in other quality features of randomised controlled trials of psychological
scales (Tables 1 and 3). Second, the involvement of the interventions that are pertinent, e.g. treatment adherence
authors in both the expert panel and as raters in the (Davidson et al., 2003).
reliability study may have inflated the reliability and The final two columns of Table 4 provide an overview of
validity coefficients. We attempted to minimise these biases the relative strengths and weakness of cognitive-beha-
by sequencing the order of the tasks and by temporal vioural treatments published between 1982 and 2003 and
separation of the tasks, e.g. the expert panel meeting indicates where improvements in design or reporting of
occurred between 3 and 8 months before rating the trials. trials are necessary. Despite the sophisticated data analysis
The results from the novice raters provided additional of many trials there are lacunae in either design or reporting,
support for the potential usability and reliability of the scale. e.g. participant allocation to treatment (allocation bias),
Three areas are not covered by the scale: therapist randomisation, power calculations, adequate sample sizes
allegiance, credibility of therapy and the reporting of and intention-to-treat analysis. Whether all these criteria
adverse events. As far as we can ascertain there is no should be applied to psychological trials merits further
evidence to link reporting of adverse events to bias in the debate. More attention could perhaps be given to the
estimation of the effectiveness of therapy but documentation selection and design of control groups as only a minority of
of such effects in pharmacological trials is required. trials appear to include control groups that are matched to
Adverse events are rarely reported in psychological trials the general structure of treatment groups. This is a complex
although the fact that some patients deteriorate in issue (Schwartz et al., 1997), but an important source of
psychotherapy has long been documented (Lambert and potential bias if the magnitude of treatment effects are to be
Bergin, 1994). The absence of an item directly assessing estimated (Baskin et al., 2003). Structural equivalence of
therapist allegiance might be rectified in any revision of this control groups of primary importance for explanatory trials
scale. There is substantial evidence that the allegiance of but may not be relevant in pragmatic trials. This distinction
therapists to a particular model of psychological treatment is is not often made by authors of the psychological trials;
associated with larger effect sizes (Berman et al., 1985; nevertheless users of the scale should consider the use of
Wampold, 2001). However, these findings come from a this scale item in the light of their aims. In contrast to the
literature in which therapy is delivered by a single therapist apparent shortfalls in design many trialists have developed
to a single patient and caution should be exercised manualised protocols, assessed the integrity of implemen-
generalising this finding to chronic pain treatment which tation and justified the selection of outcomes. A recent
is typically delivered to patient groups by a multidisciplinary meta-analytic review of psychological interventions for
324 S.L. Yates et al. / Pain 117 (2005) 314–325

irritable bowel syndrome revealed a similar pattern of Appendix 1. Supplementary material


strengths and weaknesses (Lackner et al., 2004).
The quality scale developed in this study offers some Supplementary data associated with this article can be
advantages for assessing the quality of psychological trials: found, in the online version, at doi:10.1016/j.pain.2005.06.
its content was developed through the consensus of experts, 018
it captures features of trial design that are widespread in this
field, and there is preliminary evidence of its validity. The References
scale can be used to assess trials in systematic reviews and
to explore the influence of trial quality or particular design Baskin TW, Tierney SC, Minami T, Wampold BE. Establishing specificity
in psychotherapy: a meta-analysis of structural equivalence of placebo
features on the estimated effect size. Although more controls. J Consult Clin Psychol 2003;71:973–9.
comprehensive than existing tools, e.g. Jadad et al. (1996), Basler HD, Jakle C, Kroner-Herwig B. Incorporation of cognitive-
it remains concise and easy to use. Its use should provide behavioral treatment into the medical care of chronic low back
greater validity, and a correction to the emphasis in existing patients: a controlled randomized study in German pain treatment
centers. Patient Educ Couns 1997;31:113–24.
scales on specific methods of bias control, e.g. blinding that
Berman JS, Miller RC, Massman PJ. Cognitive therapy versus systematic
may not pertain to psychological interventions.3 Clearly, desensitization: Is one treatment superior? Psychol Bull 1985;97:
caution should be exercised in interpreting the results if 451–61.
single items are used as they are likely to be measured less Buchan IE. StatsDirect—software program. Sale, Cheshire: StatsDirect
reliably. Users are encouraged to consider the addition of Ltd; 2000.
Chalmers TC, Smith H, Blackburn B, Silverman B, Schroeder B,
further items, e.g. to assess credibility, but as with all rating Reitman D, Ambroz A. A method for assessing the quality of a
scales it is necessary to establish coding reliability each time randomized control trial. Control Clin Trials 1981;2:31–49.
of use. The scale can be used to assess trial of psychological Davidson KW, Goldstein M, Kaplan RM, Kaufmann PG, Knatterud GL,
interventions in the general field of behavioural medicine as Orleans CT, Springs B, Trudeau KJ, Whitlock EP. Evidence-based
behavioral medicine: what is it and how do we achieve it? Ann Behav
none of the items are specific to chronic pain. We also note Med 2003;26:161–71.
that many of the items concerning treatment may apply Downs SH, Black N. The feasibility of creating a checklist for the
equally to pharmacological and other interventions where assessment of the methodological quality both of randomised and non-
the competence of the therapist and adherence to the randomised studies of health care interventions. J Epidemiol Commu-
nity Health 1998;52:377–84.
treatment protocols are also important but perhaps some-
Eccleston C, Morley S, Williams A, Yorke L, Mastroyannopoulou K.
what neglected by current quality scales. Finally, we note Systematic review of randomised controlled trials of psychological
that the current scale might be adapted to appraise trials therapy for chronic pain in children and adolescents, with a subset meta-
where difference modalities of treatment are being analysis of pain relief. Pain 2002;99:157–65.
compared, e.g. pharmacotherapy vs. psychological Ersek M, Turner JA, McCurry SM, Gibbons L, Kraybill BM. Efficacy of a
self-management group intervention for elderly persons with chronic
treatment. pain. Clin J Pain 2003;19:156–67.
Goossens MEJB, Vlaeyen JWS, Hidding A, Kole-Snijders A, Evers S.
Treatment expectancy affects the outcome of cognitive-behavioral
interventions in chronic pain. Clin J Pain 2005;21:18–26.
Acknowledgements Guzman J, Esmail R, Karjalainen K, Malmivaara A, Irvin E, Bombardier C.
Multidisciplinary rehabilitation for chronic low back pain: systematic
review. Br Med J 2001;322:1511–6.
Shona Yates was supported by the West Yorkshire Harbour R, Miller J. A new system for grading recommendations in
Workforce Development Confederation. We would like evidence based guidelines. Br Med J 2001;323:334–6.
to thank: the members of the Delphi panel who gave Henry B, Moody LE, Pendergast JF, O’Donnell J, Hutchinson SA,
their time freely and generously—without them this Scullyl G. Delineation of nursing administration research priorities.
Nurs Res 1987;36:309–14.
project would have been impossible; to Bruce Jadad AR, Moore A, Carroll D, Jenkinson C, Reynolds DJM, Gavaghan DJ,
E. Wampold of the University of Wisconsin–Madison McQuay HJ. Assessing the quality of reports of randomized clinical
who kindly provide the necessary SPSS syntax file for trials: Is blinding necessary? Control Clin Trials 1996;17:1–12.
computing the aggregated effect sizes; the 5 ‘novice’ Jadad AR. Randomised controlled Trials. A user’s guide. London: BMJ
Publishing; 1998.
raters who gave their time in the presence of competing Johansson C, Dahl J, Jannert M, Melin L, Andersson G. Effects of a
demands; to Sylvia Bickley for advice on the search cognitive-behavioral pain-management program. Behav Res Ther
strategy; and finally to Chris Yates. 1998;36:915–30.
Jones J, Hunter D. Using the Delphi and nominal group technique in health
services research. In: Pope C, Mays N, editors. Qualitative research in
health care. BMJ Books; 1999.
3
Although the majority of CBT trials cannot be blinded there are some Juni P, Altman DG, Egger M. Systematic reviews in health care: assessing
psychological treatments, e.g. those delivering biofeedback, where the the quality of randomised controlled trials. Br Med J 2001;323:42–6.
treatment can be delivered blind to both participant and therapist. Meta- Kazdin AE, Wilcoxon LA. Systematic desensitization and nonspecific
analyses of these trials should consider incorporating a ‘blinding’ item from treatment effects: a methodological evaluation. Psychol Bull 1976;83:
another scale. 729–58.
S.L. Yates et al. / Pain 117 (2005) 314–325 325

Kirsch I. Response expectancy as a determinant of experience and Schwartz CE, Chesney MA, Irvine MJ, Keefe FJ. The control group
behavior. Am Psychol 1985;40:1189–202. dilemma in clinical research: applications for psychosocial and
Lackner JM, Morley S, Dowzer C, Mesmer C, Hamilton S. Psychological behavioral medicine trials. Psychosom Med 1997;59:362–71.
treatments for irritable bowel syndrome: a systematic review and meta- Sharpe L, Sensky T, Timberlake N, Ryan B, Brewin C, Allard S. A blind,
analysis. J Consult Clin Psychol 2004;72:1100–13. randomized, controlled trial of cognitive-behavioural intervention for
Lambert MJ, Bergin AE. The effectiveness of psychotherapy. In: Bergin A patients with recent onset rheumatoid arthritis: Preventing psychologi-
E, Garfield SL, editors. Handbook of psychotherapy and behavior cal and physical morbidity. Pain 2001;89:275–83.
change. New York: Wiley; 1994. p. 143–89. Sindhu F, Carpenter L, Seers K. Development of a tool to rate the quality
Linstone HA, Turoff M. The Delphi method. Techniques and applications assessment of randomized controlled trials using a Delphi technique.
2002. J Adv Nurs 1997;25:1262–8.
Marhold C, Linton SJ, Melin L. A cognitive-behavioral return-to-work Stone Fish L, Busby DM. The Delphi method. In: Sprenkle DH, Moon SM,
program: effects on pain patients with a history of long-term versus editors. Research methods in family therapy. New York: Guilford Press;
short-term sick leave. Pain 2001;91:155–63.
1996.
McGraw KO, Wong SP. Forming inferences about some intraclass
Thieme K, Gromnica-Ihle E, Flor H. Operant behavioral treatment of
correlation coefficients. Psychol Methods 1996;1:30–46.
fibromyalgia: a controlled study. Arthritis Rheum 2003;49:314–20.
Moher D, Schulz KF, Altman DG, for the CONSORT Group. The
Waltz J, Addis ME, Koerner K, Jacobson NS. Testing the integrity of a
CONSORT statement: revised recommendations for improving the
psychotherapy protocol: assessment of adherence and competence.
quality of reports of parallel group randomized trials. J Am Med Assoc
J Consult Clin Psychol 1993;61:620–30.
2001;285:1987–91.
Morley S, Eccleston C, Williams A. Systematic review and meta-analysis Wampold BE. The great psychotherapy debate: models, methods, and
of randomized controlled trials of cognitive behaviour therapy and findings. vol. xiii. Mahwah, NJ: Lawrence Erlbaum Associates; 2001.
behaviour therapy for chronic pain in adults, excluding headache. Pain 263 pp..
1999;80:1–13. Wampold BE, Mondin GW, Moody M, Stich F, Benson K, Ahn H. A meta-
Price DD, Milling LS, Kirsch I, Duff A, Montgomery GH, Nicholls SS. An analysis of outcome studies comparing bona fide psychotherapies:
analysis of factors that contribute to the magnitude of placebo analgesia empirically, ‘all must have prizes’. Psychol Bull 1997;122:203–15.
in an experimental paradigm. Pain 1999;83:147–56. Williams A, Richardson P, Nicholas M, Pither C, Harding V, Ridout K,
Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias: Ralphs J, Richardson I, Justins D, Chamberlain J. Inpatient vs.
dimensions of methodological quality associated with estimates of outpatient pain management: results of a randomised controlled trial.
treatment effects in controlled trials. J Am Med Assoc 1995;273:408. Pain 1996;66:13–22.

You might also like