Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

M E T H O D O L O G Y PA P E R

Validation of a new tool for the assessment of study


quality and reporting in exercise training studies:
TESTEX
Neil A. Smart, 1 Mark Waldron, 1 Hashbullah Ismail, 1 Francesco Giallauria, 1,2 Carlo Vigorito, 2
Veronique Cornelissen 1,3 and Gudrun Dieberg 1
1
University of New England, School of Science and Technology, Armidale, New South Wales, Australia, 2Department of Translational Medical
Sciences, Division of Internal Medicine and Cardiac Rehabilitation, Federico II University of Naples, Naples, Italy, and 3University of Leuven, KU
Leuven, Department of Rehabilitation Sciences, Leuven, Belgium

ABSTRACT

Introduction: Several established tools are available to assess study quality and reporting of randomized controlled
trials; however, these tools were designed with clinical intervention trials in mind. In exercise training intervention
trials some of the traditional study quality criteria, such as participant or researcher blinding, are extremely difficult to
implement.
Methods: We developed the Tool for the assEssment of Study qualiTy and reporting in EXercise (TESTEX) – a study
quality and reporting assessment tool, designed specifically for use in exercise training studies. Our tool is a 15-point
scale (5 points for study quality and 10 points for reporting) and addresses previously unmentioned quality
assessment criteria specific to exercise training studies.
Results: There were no systematic differences between the summated TESTEX scores of each observer [H(2) ¼ 0.392,
P ¼ 0.822]. There was a significant association between the summated TESTEX scores of the three observers, with
almost perfect agreement between observers 1 and 2 [intra-class correlation coefficient (ICC) ¼ 0.93, 95% confidence
interval (CI) 0.82–0.97, P < 0.001], observers 1 and 3 (ICC ¼ 0.96, 95% CI 0.89–0.98, P < 0.001) and observers 2 and
3 (ICC ¼ 0.91, 95% CI 0.75–0.96, P < 0.001).
Conclusions: The TESTEX scale is a new, reliable tool, specific to exercise scientists, that facilitates a comprehensive
review of exercise training trials.
Key words: assessment tool, exercise training, study quality, study reporting
Int J Evid Based Healthc 2015; 13:9–18.

Introduction adhere to an intervention, be it medication or exercise,

E xercise training study designs and reporting, especi-


ally in clinical populations, are increasingly complex.
The benefits of exercise training for improving the
then the potential to improve is likely to be less and, for
this reason, interest in translational research into exercise
training is burgeoning. Poor study designs and incom-
clinical status of patients are widely recognized, but plete reporting in the exercise sciences may limit the
research and service provision funding may be withheld usefulness of study findings in terms of translation to
in certain circumstances owing to the growing concerns clinical servicing.
over poor exercise compliance. If a person refuses to The most robust individual study design is the
randomized controlled trial, but results from two differ-
Correspondence: Neil A. Smart, School of Science and Technology, ent exercise training interventions are often conflicting.
University of New England, Armidale, NSW 2351, Australia. E-mail: In order to determine which trial is more believable, one
nsmart2@une.edu.au could rank studies in terms of the standards of meth-
DOI: 10.1097/XEB.0000000000000020 odological and reporting quality. The reason one may

International Journal of Evidence-Based Healthcare ß 2015 University of Adelaide, Joanna Briggs Institute 9

©2015 University of Adelaide, Joanna Briggs Institute. Unauthorized reproduction of this article is prohibited.
NA Smart et al.

employ this ranking strategy is to identify sources of events, session attendance, exercise adherence, and
bias. One common source of bias is measurement error exercise programme characteristics, which are important
where researchers who deliver the intervention are also in exercise training studies and are not included in the
responsible for taking outcome measurements – which PEDro scale.7 The authors of this work have experience
may be biased in favour of the exercise over control assessing study quality for numerous published meta-
group; blinding of outcome assessors may therefore analyses.8–15 The PEDro scale is perhaps the tool that
eliminate this form of bias. Outcome reporting from comes closest to meeting the methodological and
trials may be deficient in a number of ways and this is reporting requirements of exercise studies; however,
easily rectified. Selective outcome reporting is com- several shortcomings remain.
mon, where post-intervention change in initially stipu- The primary objective of this study was to develop
lated outcome measures are withheld from publication an exercise science-specific scale, designed for use by
because results are undesirable. Reliability, or the con- exercise specialists, to assess the quality and reporting of
sistency of a measurement or the absence of measure- exercise training trials. The secondary objective was to
ment error,1 and degree of observer error or validity are assess the validity and reliability of this scale. It is
two other sources of error. In exercise training studies intended that the scale will be used by researchers
health practitioners are almost always concerned with conducting systematic review and meta-analyses, so
safety, yet adverse events are not always reported. they can quantify the strength of individual study
Related to this, one may report a certain type of designs and reporting; and by exercise science prac-
exercise intervention to be more beneficial to a group titioners seeking to establish whether a particular inter-
than another intervention; yet this is of secondary vention is beneficial or safe in the face of conflicting
importance if adverse events rates, withdrawal or evidence. While developing the Tool for the assEssment
adherence rates are worse. For these reasons, we have of Study qualiTy and reporting in EXercise (TESTEX) scale,
focussed on the specifics of methodology and report- we aimed to avoid using redundant criteria, which we
ing in exercise training studies with respect to these feel are not applicable for exercise training studies,
shortcomings. and to include new criteria, which we think are most
Various tools are available to assess the quality of relevant to study design, quality and reporting in
methodology and reporting in randomized controlled exercise sciences.
trials. The Consolidated Standards of Reporting Trials
statement2 is a general tool to guide study reporting, Methods
and while the JADAD score3 was for some time the The authorship group consisted of members of an
preferred quality and reporting assessment tool fav- existing collaborative with previous experience in pro-
oured by the Cochrane Collaboration,4 a more recent ducing meta-analyses and therefore assessing study
risk of bias (ROB) scale has been developed by Cochrane, quality. All members were asked to list the difficulties
although the reliability of ROB appears low.5 Adopting a they had encountered in using PEDro in assessing
general tool, such as ROB, for exercise training studies, is exercise training studies. Members were also asked
extremely likely to lack the specificity and certain criteria which items they thought were redundant and which
(such as participant blinding) will be redundant. Within should be included in an exercise training-specific
the field of physical therapy, the Physiotherapy Evidence assessment tool. The group used the PEDro scale as a
Database (PEDro) tool has become widely used by template for the development of a new scale. A series of
physiotherapists and is currently also the tool of choice meetings were organized, during which newly pro-
for exercise physiologists. Some of the criteria included posed and existing (PEDro) criteria were assessed for
in the PEDro scale, many of which were adapted from the inclusion in the new scale. Newly proposed criteria were
Delphi list,6 are often redundant for exercise training based upon difficulties experienced when conducting
studies. Examples of this are the blinding criteria utilized study quality assessment for meta-analyses, which is
by the PEDro scale; blinding of exercise training partici- crucial for information accuracy and translation into
pants is not feasible, as is blinding of the investigators clinical practice. Most items were unanimously included
directly supervising the training. Therefore, these criteria in the protocol, but two items were debated at three
are redundant for exercise training studies. On the con- meetings. At the third meeting, a consensus was
trary, other important methodological and reporting reached and a draft protocol was circulated for com-
criteria determining the effectiveness and the risks of ment. Three drafts were edited before a final version was
an exercise intervention are not adequately addressed. reached. Once the draft had been finalized, a reliability
Examples of these are reporting of withdrawals/adverse study was conducted.

10 International Journal of Evidence-Based Healthcare ß 2015 University of Adelaide, Joanna Briggs Institute

©2015 University of Adelaide, Joanna Briggs Institute. Unauthorized reproduction of this article is prohibited.
METHODOLOGY PAPER

Reliability of TESTEX ‘moderate’; 0.21–0.40 ‘fair’; 0.00–0.20 ‘slight’ and 0.00


Study selection and quality assessments ‘poor’). Data analyses were performed using SPSS ver-
Three reviewers/observers (M.W., N.S., and G.D.) inde- sion 20 and statistical significance was set at P less than
pendently evaluated the quality of 19 published exercise 0.05 throughout.
training studies using the TESTEX criteria. The studies
were randomly selected from a list of randomized con- Results
trolled trials of exercise training in people with chronic TESTEX is a study quality assessment tool for exercise
disease. This search intentionally produced a broad training studies that addresses previously unmentioned
variety of exercise studies (randomized controlled trials), quality assessment criteria such as crossover from
administering exercise interventions to a range of sedentary control to exercise, periodic adjustment of
patients. It was not deemed necessary to search for exercise training intensity in respect to physical train-
studies with any greater scrutiny, given the intended ing adaptation, and reporting of exercise programme
broad application of the TESTEX scale. Nineteen studies characteristics. The TESTEX scale uses 12 criteria with
were selected for the reliability analysis since this equa- some criteria having more than one possible point, for a
ted to the average number of studies included in meta- maximum score of 15 points (5 points for study quality
analyses conducted by the current authors. In total, there and 10 points for reporting). A concise summary of the
were 15 points available (5 points for study quality and TESTEX scale is provided in Table 1. A detailed descrip-
10 points for reporting) in the TESTEX scale that were tion of each TESTEX criterion and justification for
each assigned either a ‘1’ or ‘0’ by the observers. All inclusion is included below. A comparison between
observers had experience in conducting exercise train- the TESTEX criteria and those used by the PEDro scale
ing studies and varying levels of expertise in assessing was conducted and is included in Table 2.
study quality of exercise intervention trials. Each
observer was provided with a copy of the TESTEX pro- TESTEX criteria
tocol, the 19 research papers and an Excel spreadsheet Common study quality criteria
on which to record the data. Eligibility criteria specified – TESTEX Criterion 1
In exercise training studies, specific diagnostic criteria
Statistical analysis of reliability must fall within a certain range for the condition to exist,
The inter-observer agreement between each observer e.g. to be considered hypertensive a person’s systolic
(n ¼ 3) was assessed for each individual point available blood pressure should be greater than 139mmHg and/or
on the TESTEX scale (15 in total) using the Cohen Kappa diastolic blood pressure greater than 89mmHg or being
statistic (K). The Kappa statistic is based on the following treated with antihypertensive medication. However,
calculation: K ¼ (PO  PC)/(1  PC), where PO is the num- often the mean and standard deviation (SD) values for
ber of observed agreements and PC is the number of a study group indicate that some values fall outside the
agreements expected by chance. The Kappa statistic and specified range and, hence, eligibility criteria ‘are not
intra-class correlation coefficients (ICCs) are appropriate met’ by all participants. We note that in the PEDro
for measuring agreement between individuals when the document it is stated that this criterion influences exter-
data are nominal (i.e. ‘1’ or ‘0’).1,16 The use of these nal validity, but not the internal or statistical validity of
statistics is consistent with previous studies assessing the trial. We also note that the PEDro scale suggests that
the inter-observer agreement of quality assessment eligibility criteria should be specified but this criterion is
tools.7 not used to calculate PEDro score. Indeed, the PEDro
A secondary analysis was also performed, based scale was specifically designed to avoid the assessment
on the same data-set, to evaluate the inter-observer of external validity. We feel that eligibility criteria
reliability of the summated TESTEX scale score (i.e. score are more precise in the exercise sciences for three
out of 15). Following checks for normality using the reasons. First, eligibility is less reliant on clinical judge-
Shapiro–Wilk statistic, systematic differences between ment (e.g. physical examination) and more dependent
the three observers was evaluated using non-parametric on numerical values from objective diagnostic tests
Kruskal–Wallis tests. The reliability of the total score of in exercise science. Second, the scope of practice for
each observer was assessed using an ICC (2, k) and the exercise scientists excludes diagnosis, which is not
associated 95% confidence intervals (95% CIs). For all always the case for physiotherapists. Third, manuscript
analyses, we described the level of agreement between and grant reviewer feedback over the years lead us to
the observers according to Landis and Koch17: (>0.81 unequivocally believe that some eligibility criteria are
‘almost perfect’; 0.61–0.80 ‘substantial’; 0.41–0.60 extremely likely to affect the statistical validity of exercise

International Journal of Evidence-Based Healthcare ß 2015 University of Adelaide, Joanna Briggs Institute 11

©2015 University of Adelaide, Joanna Briggs Institute. Unauthorized reproduction of this article is prohibited.
12

NA Smart et al.
Table 1. ‘Detailed TESTEX scale’ (maximum score 15)
Criterion Explanation Scoring
Study quality
1 – Eligibility criteria specified Eligibility criteria should be specified and fulfilled and specific diagnostic test values should be provided 1 Point – if eligibility criteria are clearly stated and fulfilled
for all participants.
2 – Randomization specified A description of the method used to allocate patients into treatment groups should be provided. 1 Point – if methods are described and they are truly random e.g.
coin-tossing, sequence of randomly generated numbers
3 – Allocation concealment It should be stated if group allocation was concealed; meaning if a patient was eligible for inclusion in 1 Point – if group allocation was concealed from patients eligible
the trial was unaware (when this decision was made) of which group the patient would be allocated for inclusion in the trial (e.g. consent should be given before
to. randomization)
4 – Groups similar at baseline Baseline data of all participants who were randomized should be presented. There should be no 1 Point – if baseline data are separated by group allocation,
International Journal of Evidence-Based Healthcare ß 2015 University of Adelaide, Joanna Briggs Institute

significant difference in the measure of the severity of the treated condition between treatment presented and no differences are apparent
groups.
Blinding of all participants This item is not scored. No point
Blinding of all therapists This item is not scored. No point
5 – Blinding of assessor (for at least It is not always possible to blind patients and/or therapists; however, blinding of assessors is reasonable. 1 Point – if it is stated unambiguously that the assessor of at least
one key outcome) If assessors of primary outcome measures are blinded to the intervention allocation of the patients, 1 primary outcome measure was blinded to group allocation
this should be stated clearly.
Study reporting
6 – Outcome measures assessed in The percentage of patients completing the study in both groups should be reported. No point – if withdrawals are >15%
85% of patients Any adverse events (serious medical events, deaths, hospitalizations etc.) should be reported for each 1 Point – if adherence>85%
intervention group. 1 Point – if adverse events are reported
The percentage of exercise sessions completed by the exercise patients who did not withdraw from the 1 Point – if exercise attendance is reported
study should be reported. Total possible – 3 points

7 – Intention-to-treat analysis When a patient withdraws, this analysis is conducted by using either the last value obtained for each of 1 Point – if intention to treat analysis was performed on outcomes
the outcome measures as a post-intervention value, or by using the baseline value as a post value. of interest
This analysis should be added to the data of those that did complete the study and an analysis
conducted.
8 – Between-group statistical Comparison of exercise vs. comparator (control) group for the primary and at least one secondary 1 Point – if between-group statistical comparisons are reported for
comparisons reported outcome should be performed. the primary outcome measure of interest
1 Point – if between-group statistical comparisons are reported for
at least one secondary outcome measure
Total possible – 2 points
9 – Point measures and measures of Point estimates should be provided for all outcomes, otherwise this could be deemed selective 1 Point – if all outcomes are reported with point estimates
variability for all reported outcome outcome reporting.
measures
10 – Activity monitoring in control Between-group differences may be diluted if control patients crossover to intervention. As many as one 1 Point – if control patients are asked to report their levels of
groups third of patients do this, so some measure e.g. exercise diary or activity monitoring should be physical activity and data are presented
supplied so this effect can be measured and quantified.
11 – Relative exercise intensity Exercise intensity is considered by many to be the best stimulus for adaptation. Once patients begin an 1 Point – if exercise load is titrated to keep relative intensity
remained constant exercise programme at a set intensity they will begin to adapt. Throughout the study duration the constant
relative intensity will fall in those that do adapt. Therefore, periodic assessment of exercise capacity
should be conducted and the intensity titrated up (or in those that lose fitness, titrated down) so
that exercise intensity remains constant.
12 – Exercise volume and energy Exercise parameters; session and programme duration, session frequency, exercise training intensity and 1 Point – if exercise volume and energy expenditure can be
expenditure modality should be clearly reported. calculated
Total out of a possible 15 points

©2015 University of Adelaide, Joanna Briggs Institute. Unauthorized reproduction of this article is prohibited.
METHODOLOGY PAPER

Table 2. Comparison between PEDro scale and TESTEX scale


PEDro Points TESTEX Points
TESTEX study quality (total 5 points)
Eligibility criteria specified None Eligibility criteria specified 1
Random allocation of patients 1 Randomization specified 1
Allocation concealed 1 Allocation concealment of all patients at the time of randomization 1
Groups similar at baseline 1 Groups similar at baseline 1
Blinding of all participants 1 Blinding of all participants None
Blinding of all therapists 1 Blinding of all therapists None
Blinding of all assessors 1 Blinding of assessor (for at least one key outcome) 1
TESTEX study reporting (total 10 points)
Outcome measures assessed in 85% of 1 Outcome measures assessed in 85% of patients
patients Study withdrawals reported (>15% – no point; <15% – 1 point) 1
Adverse events reported 1
Session attendance reported 1
Intention-to-treat analysis 1 Intention-to-treat analysisa 1
Reporting of between-group statistical 1 Reporting of between-group statistical comparisons
comparisons Primary outcome reported 1
Secondary outcome(s) reported 1
Point measures and measures of variability 1 Point measures and measures of variability for all reported outcome 1
reported measures reported
Activity monitoring in control groups
To avoid/measure crossover to exercise by sedentary control 1
patients; method of activity monitoring is reported
Relative exercise intensity remained constant
Periodic evidence-based adjustment of exercise intensity is 1
reported
Exercise volume and exercise expenditure
Information on all exercise characteristics (intensity, duration, 1
frequency, mode is provided to calculate exercise volume and
expenditure
Total points possible 10 Study quality – 5 points: Study reporting –10 points 15

PEDro, Physiotherapy Evidence Database.


a
When a patient withdraws, this analysis is conducted by using either the last value obtained for each of the outcome measures as a post-intervention value, or by
using the baseline value as a post value.

training trials. For these reasons, we award 1 point Allocation concealment – TESTEX Criterion 3
for studies that report and fulfil included eligibility In addition to eligibility criteria, the TESTEX scale awards
criteria. 1 point for the concealment of allocation. A study is
considered to have provided allocation concealment if
Randomization specified – TESTEX Criterion 2 the potential patients were unaware of which group they
Consistent with the second criterion of the PEDro scale, would be allocated to, at the time the patients give
we feel it is insufficient to consider that a study has their consent.
utilized a random allocation if the published manuscript
merely states that allocation was random. The precise Groups similar at baseline – TESTEX Criterion 4
method of randomization (e.g. computer-generated ran- Studies of exercise training interventions should report
dom numbers, coin-tossing and dice-rolling) should be at least one measure of the severity of the condition
specified. Quasi-randomized allocation procedures do being treated and at least one (different) key outcome
not satisfy this criterion. Random allocation ensures that measure at baseline. The rater must be satisfied that the
(within the constraints provided by chance) treatment groups’ outcomes would not be expected to differ (on
and control groups are comparable; this is especially the basis of baseline differences in prognostic variables
important in exercise training studies as sample size is alone) by a statistically significant amount. Discrepancies
often less than 50 patients and study withdrawal is more at baseline between groups may be indicative of
than 15%. We award 1 point for studies that stipulate the inadequate randomization procedures. One point is
method of randomization. awarded if baseline data are presented by group

International Journal of Evidence-Based Healthcare ß 2015 University of Adelaide, Joanna Briggs Institute 13

©2015 University of Adelaide, Joanna Briggs Institute. Unauthorized reproduction of this article is prohibited.
NA Smart et al.

allocation, and there is no significant difference between continue to exercise, inherently attaching the prac-
study groups in the key outcome(s) of interest and titioner to the treatment. Collectively, we feel that the
the measure of the severity of the condition being above reasons render this criterion as an inappropriate
treated. scoring criterion. For this reason, unlike the PEDro scale
we do not award a point for this criterion.
Blinding of all participants
Blinding involves ensuring that participants were unable Blinding of assessor (for at least one key
to discriminate whether they had or had not received the outcome) – TESTEX Criterion 5
treatment. It is acknowledged that in some physiother- Whereas blinding of patients and therapists is very
apy studies it is possible to provide ‘sham’ interventions difficult to implement in exercise training studies, it is
that could be perceived by participants to mimic actual reasonable to expect to blind assessors (those people
interventions. In exercise training studies participant that conduct outcome data measurements) to the inter-
blinding is difficult to achieve, with just a few notable vention allocation of the participants. When assessors
exceptions. For example, there have been studies that have been blinded, the reader can be satisfied that the
have compared exercise training to: cycling at zero load; apparent effect (or lack of effect) of treatment was not
‘functional electrical stimulation (FES)’ or ‘sham’ and due to the assessors’ biases impinging on their measures
inspiratory muscle training (IMT), whereby participants of outcomes. One point is awarded if it is stated, unam-
can be subjected to interventions that are likely to be biguously, that the assessor of the primary outcome
below the stimulus threshold to elicit physiological measure was blinded to group allocation. An exception
adaptation. Despite these attempts at ‘sham’ training, to this is where studies state that measurements are
participants are usually aware of the groups to which completely automated, for example, measurements of
they have been randomized (based on information that blood analyses in which case the potential for human
they receive when giving their consent to participate), so bias has been removed, 1 point can be awarded.
true blinding is almost impossible. Studies that have
allocated participants to either FES or IMT intervention Common study reporting criteria
groups could be considered to have employed shams or Outcome measures assessed in 85% of
placebos. We acknowledge that the reason for using patients – TESTEX Criterion 6
sham treatments is that they can be manipulated to fall The volume and duration of exercise training required
above or below the therapeutic threshold expected to to elicit adaptations varies with different outcome
elicit a physical adaptation, but feel that this is in general measures, exercise prescriptions and patient character-
not feasible in exercise training studies. For this reason, istics. However, it is generally accepted that significant
unlike the PEDro scale, we do not award a point for changes in some measures, such as cardio-respiratory
this criterion. fitness, cardiac function, lipids or glycaemic control, are
not immediate and that at least 1 month is required to
Blinding of all therapists detect changes. Due to the extended intervention
Like in criterion 3, it is our contention that exercise periods typical of exercise training, the proportion of
training interventions do not lend themselves to blind- patients who complete the study is often less than 85%
ing the administering therapist. When therapists have and not all of those who complete the study attend all
been blinded, the reader can be satisfied that the appa- exercise sessions (we deal with high withdrawal rates in
rent effect (or lack of effect) of treatment was not due to criterion 7 on ITT). It is therefore important in exercise
the therapists’ enthusiasm or lack of enthusiasm for the training studies to distinguish between exercise adher-
treatment or control conditions. In exercise studies per- ence and exercise attendance as both are relevant. For
formed on patients with chronic disease, it is of funda- the purposes of this document we will define exercise
mental importance that the administering therapist is adherence as the number of withdrawals and com-
fully aware of the possible effects of a given treatment. pletions in both the study’s intervention and control
Indeed, in studies administering exercise interventions groups. Exercise attendance is defined here as the per-
to moderate-high risk patients, it would be considered centage of target sessions completed by each individual
negligent on behalf of the therapist not to have obtained who completes the study. Quite often more than 15% of
an a priori record of the patient’s medical history and the people will withdraw from an exercise training study
scope of effects that might occur in that individual as during the stipulated study period. Moreover, exercise
a result of the treatment. Furthermore, part of the attendance is less than 85% in some of the people who
therapist’s role is to provide motivation for patients to do not withdraw from the study. It is therefore desirable

14 International Journal of Evidence-Based Healthcare ß 2015 University of Adelaide, Joanna Briggs Institute

©2015 University of Adelaide, Joanna Briggs Institute. Unauthorized reproduction of this article is prohibited.
METHODOLOGY PAPER

that both the number of withdrawals in each allocation group provides two things. Firstly, this will indicate
group, and also the mean and SD of the percentage whether there is regression to the mean, that is patients
exercise session attendance are reported for interven- improve without exercise, perhaps because their medi-
tion groups. The one caveat to this would be where an cations are optimized or other reasons unrelated to
alternative target, for example, kilocalories per week of the exercise intervention. Secondly, it may be that
exercise energy expenditure was successfully achieved. some patients who should have exercised did not
In this case a point would be awarded regardless of the and some who should not have received the interven-
number of exercise sessions completed. tion did. Although we attribute a score to these issues in
We also feel that physical activity in control group other criteria, the between-group comparison alerts us
patients should be monitored but we address this, and to the fact that regression to the mean or inappropriate
attribute a score, in criterion 10. We award 1 point if treatment allocations may have occurred and also
studies report exercise training adherence of at least whether the difference between groups is greater than
85%; no point will be awarded if adherence is less can plausibly be attributed to chance. One point is
than 85%. We award 1 point for reporting adverse awarded if the primary outcome of interest is reported,
events (deaths, hospitalizations, etc. are reported). We with another point awarded for the reporting of at least
award a point for this as the uptake of exercise therapy is one secondary outcome for a total of 2 points.
almost always evaluated in terms of a balance between
expected benefits and the risk of adverse events. We also Point measures and measures of variability for
award 1 point for reporting session attendance for the all reported outcome measures – TESTEX
exercise group(s). Criterion 9
Point estimates (often P values) of treatment effect
‘Intention-to-treat’ analysis – TESTEX only provide limited information about the outcomes
Criterion 7 of treatment and control groups. A more comprehensive
In this criterion, we assess if all patients for whom out- approach is to also provide measures of variability. We
come measures were available received the treatment or suggest, however, that this is extended to all reported
control condition as allocated or, where this was not the outcome measures to avoid selective outcome report-
case, data for at least one key outcome in which one is ing. We award 1 point for this criterion.
interested was analysed by ITT analysis. We actually
propose that, when possible, an ITT analysis is conducted Activity monitoring in control groups – TESTEX
so either the last value obtained for each of the outcome Criterion 10
measures is used as the post-intervention value, or the The largest trial of exercise training in heart failure to
baseline value is used as the post-value, when a patient date (HF-ACTION)18 suggested that one of the reasons
withdraws. The inclusion of this criterion aims to estab- for a lower than expected post-intervention difference
lish if certain patient demographics, clinical status, medi- between groups occurred because approximately 30%
cations and so on predispose patients to withdraw. The of patients allocated to sedentary control undertook
overall aim here is to help future research and clinical exercise training privately. It is therefore recommended
services to improve exercise adherence by identifying that a robust study design quantifies this by making
predisposing factors that lead to study withdrawal. One some provision for measuring activity levels in control
point is awarded if an ITT is conducted. ITT analysis is patients to avoid crossover to exercise. This may be a
performed by substituting the last measurement (which simple method, such as providing patients with an
may be the baseline measurement) in those that did not activity diary or, more advanced, by providing acceler-
complete the study and these data are included in ometry or heart rate monitoring devices to assess poten-
analyses of those that did complete. tial contamination of control group by monitoring
physical activity behaviour. We award 1 point for any
Between-group statistical comparisons reporting on results of activity monitoring in sedentary
reported – TESTEX Criterion 8 control participants.
To score in this criterion, between-group statistical com-
parisons should be reported for the primary and at least Relative exercise intensity remained constant –
one secondary outcome measure. If all outcomes are TESTEX Criterion 11
not reported, this would deem to be selective outcome As patients adapt to exercise training, if the workload is
reporting. We are primarily interested in whether inter- kept constant, then the relative exercise intensity will
vention groups improve, but a comparison with a control continually fall as patients improve their physical work

International Journal of Evidence-Based Healthcare ß 2015 University of Adelaide, Joanna Briggs Institute 15

©2015 University of Adelaide, Joanna Briggs Institute. Unauthorized reproduction of this article is prohibited.
NA Smart et al.

capacity. Therefore, periodic assessments of work Reliability study


capacity (using either rate of perceived exertion, heart As presented in Table 3, the agreement (K) between
rate reserve or percentage maximum heart or peak VO2) observers 1 and 2 ranged between 0.44 (moderate) and
should be undertaken at least once during the training 1.00 (almost perfect). The moderate agreements
programme and the exercise load titrated accordingly. occurred in the categories of ‘Groups similar at baseline’
Any drift in exercise intensity means that adaptations will and ‘Study withdrawals below 15%’. The almost perfect
be attenuated or even plateau after 3–4 weeks and also (or constant values 100% agreement) agreements
that any conclusions based on expected magnitude of occurred in 8 of the 15 categories (53.3%) and the
adaptation in relation to intensity will be difficult to substantial agreements in 5 (33.3%). Observers 2 and
examine. We award 1 point for periodic adjustment 3 achieved almost perfect agreement (or constant agree-
for the purpose of keeping relative exercise intensity ment) in 7 of the 15 categories (46.6%), substantial
constant. agreement in 5 of the 15 categories (33.3%), with
3 categories (20%) reaching moderate agreement.
Exercise volume characteristics and energy Observers 1 and 3 achieved almost perfect agreement (or
expenditure – TESTEX Criterion 12 constant agreement) in 11 of the 15 categories (73.3%),
Often there is a dose-response relationship between substantial agreement in 2 categories (13.3%), and
exercise and clinical improvement although clinical moderate agreement in the remainder (13.3%).
markers respond differently to different types and doses There were no systematic differences between the
of exercise. Adequate reporting should allow the reader summated TESTEX scores of each observer [H(2) ¼ 0.392,
to calculate, rather than estimate, the volume of exercise P ¼ 0.822]. There was a significant association between
in terms of energy expended during the programme of the summated TESTEX scores of the three reviewers, with
exercise. To this end, the exercise parameters; type almost perfect agreement between observers 1 and 2
(aerobic/resistance/combined), session and programme (ICC ¼ 0.93, 95% CI 0.82–0.97, P < 0.001), observers 1 and
duration, session frequency, exercise training intensity 3 (ICC ¼ 0.96, 95% CI 0.89–0.98, P < 0.001) and observers
and modality should be clearly reported so each exercise 2 and 3 (ICC ¼ 0.91, 95% CI 0.75–0.96, P < 0.001).
prescription can be evaluated if it meets the therapeutic
threshold or not. One point is awarded if all exercise Discussion
characteristics are reported adequately (i.e. intensity, Building on the widely used PEDro and JADAD scales, we
frequency, mode, duration of session and duration of have developed the TESTEX scale as an exercise science-
the intervention) and the exercise volume and energy specific scale, designed for use by exercise specialists to
expenditure can be evaluated. assess the quality of randomized controlled trials of

Table 3. Inter-observer reliability (Kappa  SE) between the three expert reviewers using the 15-point
TESTEX criteria
TESTEX criteria Observer 1 vs. Observer 2 vs. Observer 1 vs.
observer 2 observer 3 observer 3
Eligibility criteria included Constant Constant Constant
Randomization method stated 1.00 (0.00)f 1.00 (0.00)f 1.00 (0.00)f
Allocation concealment Constant Constant Constant
Groups similar at baseline 0.48 (0.24)d 0.61 (0.23)e 0.32 (0.30)d
Assessor blinded 0.69 (0.16)e 0.69 (0.16)e 1.00 (0.00)f
Study withdrawals <15% 0.44 (0.33)d 0.41 (0.21)d 0.41 (0.21)d
Adverse events reported 0.65 (0.18)e 0.77 (0.15)e 0.88 (0.11)f
Session attendance reported 0.63 (0.18)e 0.52 (0.20)d 0.86 (0.14)f
Intention-to-treat analysis 1.00 (0.00)f 0.61 (0.24)e 0.61 (0.24)e
Between-group primary analysis Constant Constant Constant
Between-group secondary analysis Constant Constant Constant
Point measures for all outcomes Constant Constant Constant
Activity monitoring controls 0.77 (0.14)e 0.51 (0.21)d 0.77 (0.14)e
Relative exercise intensity adjusted 0.73 (0.17)e 0.73 (0.17)e 1.00 (0.00)f
Exercise energy expenditure information reported 0.83 (0.17)f 0.83 (0.17)f 1.00 (0.00)f
17
Superscript letters denote the following level of agreement between observers : a, poor; b, slight; c, fair; d, moderate; e, substantial; f, almost perfect.
Constant ¼ 100% agreement between observers, preventing kappa analysis.

16 International Journal of Evidence-Based Healthcare ß 2015 University of Adelaide, Joanna Briggs Institute

©2015 University of Adelaide, Joanna Briggs Institute. Unauthorized reproduction of this article is prohibited.
METHODOLOGY PAPER

exercise training in clinical populations. The TESTEX scale We feel that the TESTEX scale with the newly intro-
uses 12 criteria, with some criteria scoring more than one duced criteria addresses common shortcomings in study
possible point, for a maximum score of 15 points, as design, quality and reporting in the exercise sciences. We
compared to the PEDro scale, which uses 11 criteria for a are therefore confident the TESTEX scale will improve
maximum score of 10 points. study design and reporting, thus qualifying inferences
In contrast to the PEDro scale, the TESTEX scale takes and conclusions in the exercise sciences.
eligibility criteria into account. Both scales award 1 point
for the concealment of allocation. Subsequent blinding Conclusion
of patients and therapists is nearly always unachievable The TESTEX scale is a new reliable tool, specific to
in exercise training studies and does not attract any exercise scientists, that facilitates a comprehensive
points. The TESTEX scale expands on outcome measures review of exercise training trials.
including reports on study withdrawals, session attend-
ance, intention-to-treat (ITT) analyses, reporting seden- Acknowledgements
tary control crossover to exercise, periodic adjustment of The authors would like to acknowledge the researchers
exercise load so that intensity remains constant, adverse who assisted with the review procedures.
events and measurement errors, as well as description of The authors report no conflicts of interest.
exercise characteristics which allows a calculation of
exercise volume and energy expenditure. We acknowl- References
edge that exercise training studies do not lend them- 1. Batterham A, George KP. Reliability in evidence-based
selves to participant blinding and this may introduce clinical practice: a primer for allied health professionals.
measurement error. Phys Ther Sport 2003; 4: 122–8.
The inter-observer reliability of the 15 different 2. Antes G. The new CONSORT statement. Br Med J 2010; 340:
c1432.
TESTEX items ranged between moderate and almost
3. Jadad AR, Moore RA, Carroll D, et al. Assessing the quality of
perfect agreement, suggesting that observers, varying
reports of randomized clinical trials: is blinding necessary?
in experience, can reach an acceptable level of reliability Control Clin Trials 1996; 17: 1–12.
without specific training or familiarization. These find- 4. Higgins JPT, Green S (Eds): Cochrane handbook for system-
ings reflect the clarity of our descriptions for each item of atic reviews of interventions Version 5.1.0 [updated March
the TESTEX and are consistent with those reported 2011]. The Cochrane Collaboration; 2011. www.cochrane-
for the PEDro scale.7 The moderate agreements occurred handbook.org. [Accessed 31 July 2014]
consistently for the criteria of ‘Study withdrawals less 5. Hartling L, Hamm MP, Milne A, et al. Testing the risk of bias
than 15%’. Follow-up interviews revealed some minor tool showed low reliability between individual reviewers
oversights among the observers, whereby points were and across consensus assessments of reviewer pairs. J Clin
awarded to studies reporting withdrawals of only the Epidemiol 2012; 66: 973–81.
6. Verhagen AP, de Vet HC, de Bie RA, et al. The Delphi list: a
treatment groups, and not the control groups. Given that
criteria list for quality assessment of randomized clinical
this is clearly stated in the TESTEX criteria, and only minor
trials for conducting systematic reviews developed by
disagreements were observed, there appears to be Delphi consensus. J Clin Epidemiol 1995; 51: 1235–41.
limited threat to the reliability of each of the TESTEX 7. Maher CG, Sherrington C, Herbert RD, et al. Reliability of the
items. PEDro scale for rating quality of randomized controlled
The observers also agreed, almost perfectly, on the trials. Phys Ther 2003; 83: 713–21.
summated TESTEX score, with ICCs ranging from 0.91 8. Cornelissen VA, Buys R, Smart NA. Endurance exercise
to 0.96. In a sample of 19 studies, the typical difference beneficially affects ambulatory blood pressure: a syste-
between observers’ summated TESTEX scores ranged matic review and meta-analysis. J Hypertens 2013; 31:
from 1 to 2 points and was not systematically differ- 639–48.
ent. We applied the TESTEX criteria to a previous meta- 9. Smart N, Meyer T, Butterfield J, et al. Individual patient
meta-analysis of exercise training effects on systemic brain
analysis performed by some of the current authors in
natriuretic peptide expression in heart failure. Eur J Prev
the area of exercise intervention, accounting for a
Cardiol 2012; 19: 428–35.
worst-case error of 2 points. An error of 2 points 10. Smart NA, Dieberg G, Giallauria F. Functional electrical
would not have resulted in the exclusion of research stimulation for chronic heart failure: a meta-analysis. Int J
papers from the meta-analysis and would not have Cardiol 2013; 167: 80–6.
altered the conclusions of the study. On this basis, 11. Cornelissen VA, Smart NA. Exercise training for blood
the worst case error of the summated TESTEX can be pressure: a systematic review and meta-analysis. J Am Heart
tolerated. Assoc 2013; 2: e004473.

International Journal of Evidence-Based Healthcare ß 2015 University of Adelaide, Joanna Briggs Institute 17

©2015 University of Adelaide, Joanna Briggs Institute. Unauthorized reproduction of this article is prohibited.
NA Smart et al.

12. Ismail H, McFarlane JR, Nojoumian AH, et al. Clinical out- 15. Smart NA, Giallauria F, Dieberg G. Efficacy of inspiratory
comes and cardiovascular responses to different exercise muscle training in chronic heart failure patients: a systematic
training intensities in patients with heart failure: a system- review and meta-analysis. Int J Cardiol 2013; 167: 1502–7.
atic review and meta-analysis. J Am Coll Cardiol Heart Fail 16. Fleiss J, Cohen J. The equivalence of weighted kappa and
2013; 1: 514–22. the interclass correlation coefficient as measures of
13. Ismail H, McFarlane J, Smart NA. Is exercise training reliability. Educ Psychol Measurement 1973; 33: 613–9.
beneficial for heart failure patients taking beta-adrenergic 17. Landis JR, Koch GG. The measurement of observer agree-
blockers? A systematic review and meta-analysis. Congest ment for categorical data. Biometrics 1977; 33: 159–74.
Heart Fail 2013; 19: 61–9. 18. O’Connor CM, Whellan DJ, Lee KL, et al. Efficacy and safety
14. Smart NA, Dieberg G, Giallauria F. Intermittent versus of exercise training in patients with chronic heart failure:
continuous exercise training in chronic heart failure: a HF-ACTION randomized controlled trial. J Am Med Assoc
meta-analysis. Int J Cardiol 2013; 166: 352–8. 2009; 301: 1439–50.

18 International Journal of Evidence-Based Healthcare ß 2015 University of Adelaide, Joanna Briggs Institute

©2015 University of Adelaide, Joanna Briggs Institute. Unauthorized reproduction of this article is prohibited.

You might also like