Professional Documents
Culture Documents
A Systematic Review of Validity Evidence For Checklists Versus Global Rating Scales in Simulation-Based Assessment
A Systematic Review of Validity Evidence For Checklists Versus Global Rating Scales in Simulation-Based Assessment
A Systematic Review of Validity Evidence For Checklists Versus Global Rating Scales in Simulation-Based Assessment
CONTEXT The relative advantages and disadvan- (n = 22), endoscopy (n = 8), resuscitation (n = 7)
tages of checklists and global rating scales (GRSs) and anaesthesiology (n = 4). The pooled GRS–check-
have long been debated. To compare the merits of list correlation was 0.76 (95% confidence interval [CI]
these scale types, we conducted a systematic review 0.69–0.81, n = 16 studies). Inter-rater reliability was
of the validity evidence for checklists and GRSs in similar between scales (GRS 0.78, 95% CI 0.71–0.83,
the context of simulation-based assessment of n = 23; checklist 0.81, 95% CI 0.75–0.85, n = 21),
health professionals. whereas GRS inter-item reliabilities (0.92, 95% CI
0.84–0.95, n = 6) and inter-station reliabilities (0.80,
METHODS We conducted a systematic review of 95% CI 0.73–0.85, n = 10) were higher than those for
multiple databases including MEDLINE, EMBASE checklists (0.66, 95% CI 0–0.84, n = 4 and 0.69, 95%
and Scopus to February 2013. We selected studies CI 0.56–0.77, n = 10, respectively). Content evidence
that used both a GRS and checklist in the simula- for GRSs usually referenced previously reported instru-
tion-based assessment of health professionals. ments (n = 33), whereas content evidence for check-
Reviewers working in duplicate evaluated five lists usually described expert consensus (n = 26).
domains of validity evidence, including correlation Checklists and GRSs usually had similar evidence for
between scales and reliability. We collected informa- relations to other variables.
tion about raters, instrument characteristics, assess-
ment context, and task. We pooled reliability and CONCLUSIONS Checklist inter-rater reliability
correlation coefficients using random-effects meta- and trainee discrimination were more favourable
analysis. than suggested in earlier work, but each task
requires a separate checklist. Compared with the
RESULTS We found 45 studies that used a checklist checklist, the GRS has higher average inter-item
and GRS in simulation-based assessment. All studies and inter-station reliability, can be used across mul-
included physicians or physicians in training; one tiple tasks, and may better capture nuanced ele-
study also included nurse anaesthetists. Topics of ments of expertise.
assessment included open and laparoscopic surgery
1 5
Division of Emergency Medicine, Department of Medicine, Division of General Internal Medicine, Mayo Clinic, Rochester,
University of Washington School of Medicine, Seattle, Minnesota, USA
Washington, USA
2
Department of Medicine, University of Calgary, Calgary, Alberta, Canada Correspondence: Jonathan S Ilgen, Division of Emergency Medicine,
3
Department of Medicine, University of British Columbia, University of Washington School of Medicine, Harborview Medical
Vancouver, British Columbia, Canada Center, 325 9th Avenue, Box 359702, Seattle, Washington 98104-2499,
4
Mayo Clinic Multidisciplinary Simulation Center, Mayo Clinic USA. E-mail: ilgen@u.washington.edu
College of Medicine, Rochester, Minnesota, USA
ª 2015 John Wiley & Sons Ltd. MEDICAL EDUCATION 2015; 49: 161–173 161
J S Ilgen et al
162 ª 2015 John Wiley & Sons Ltd. MEDICAL EDUCATION 2015; 49: 161–173
Assessment with checklists and global rating scales
the learner physically interacts to mimic an aspect a procedure or technique) or non-technical skills
of clinical care for the purpose of teaching or assess- (such as communication or team leadership). We
ment. This includes (but is not limited to) high- collected validity evidence from five sources25 using
fidelity and low-fidelity manikins, part-task trainers, previously defined operational definitions,22,26
virtual reality (including any computer simulation which included: internal structure (inter-rater, inter-
that requires non-standard computer equipment), item and inter-station reliability, and factor analy-
animal models, and human cadaveric models used sis); content (processes used to ensure that items
for teaching purposes.’24 The use of human standar- accurately represent the construct); relations to
dised patients was not included in this definition, other variables (association with participant charac-
although the inter-rater reliability of GRSs and teristics, such as training level, or scores from
checklists with standardised patients in the OSCE another instrument); response process (the relation-
setting was recently reported.18 We defined check- ship between the intended construct and the
lists as instruments with a dichotomous response thought processes of participants or observers), and
format and more than one item; we excluded stud- consequences (the assessment’s impact on partici-
ies with only a single checklist item (i.e. an overall pants and programmes).26 We also coded informa-
pass/fail only). We defined GRSs as instruments tion about the raters, instruments and tasks, and
with more than two response options per item. evaluated study quality using the Medical Education
Because these scales have been designed to allow Research Study Quality Instrument27 (MERSQI)
global judgements, we included single-item summa- (IRR ICCs: 0.51–0.8421).
tive GRSs (i.e. those that ask for a ‘global impres-
sion’). To enable meaningful psychometric Data analysis
comparisons between instrument types, we excluded
studies in which the GRS and checklist assessed dif- We pooled reliability and correlation coefficients
ferent constructs. using random-effects meta-analysis. We used Z-trans-
formed coefficients for these analyses, and then
Study identification and selection transformed the pooled results back to the native
format for reporting. We used the Spearman–Brown
Our search strategy has been previously published formula to adjust all reliability coefficients to reflect
in full.22,24 To summarise briefly, an experienced a single item, station or rater prior to pooling. We
research librarian developed a search strategy that performed analyses separately for GRS and checklist
included terms focused on the topic (e.g. simulat*), scores. We quantified between-study inconsistency
learner population (med*, nurs*, health occupa- using the I2 statistic; I2 values of > 50%, 25–49%
tions), and assessment (e.g. assess*, valid*). We used and < 25% indicate large, moderate and little incon-
no beginning date cut-off, and the last date of sistency, respectively.28 We conducted planned sensi-
search was 26 February 2013. Reviewers worked tivity analyses excluding studies with single-item
independently in pairs to screen studies for inclu- GRSs. To avoid undue influence from any single
sion, resolving conflicts by consensus. From this instrument, we conducted post hoc sensitivity analyses
large pool of studies, we identified studies that excluding all studies using the objective structured
included both a GRS and a checklist that were assessment of technical skill (OSATS) GRS29
designed to assess the same construct (reviewer because this tool comprised nearly one-third of our
inter-rater reliability [IRR] intraclass correlation GRS sample.
coefficient [ICC]: 0.76). Some articles referred to
more than a single checklist or a single GRS; in Nearly all inter-rater reliability results in our sample
these instances, we selected tools using a hierarchy comprised kappa, weighted kappa or an ICC corre-
described previously.21 sponding to the Shrout and Fleiss (2, 1) ICC, all of
which reflect the reliability of a single rater.30 Thus,
Data extraction we report (2, 1) inter-rater reliability coefficients
and classify these using criteria proposed by Landis
We abstracted data independently and in duplicate and Koch31 (fair [ICC values: 0.21–0.4], moderate
for all variables, resolving conflicts by consensus. [ICC values: 0.41–0.6], substantial [ICC values:
Inter-rater reliability was substantial for nearly all 0.61–0.8]). For inter-item and inter-station reliabil-
variables (Appendix S1 [online] gives details on IRR ity, which always used a (2, k) ICC (e.g. Cronbach’s
for data abstraction). We noted the task being alpha),30 we used the Spearman–Brown formula to
assessed, and classified the assessment as measuring adjust pooled coefficients to reflect the median (or
technical skills (the learner’s ability to demonstrate near-median) number of observations, and classified
ª 2015 John Wiley & Sons Ltd. MEDICAL EDUCATION 2015; 49: 161–173 163
J S Ilgen et al
these as suboptimal (values < 0.70), good (0.70– behavioural (i.e. directly observable actions, n = 23);
0.89) or substantial (> 0.90).32 We also noted other anchors included proficiency (i.e. ranging
instances in which authors did not clearly identify from ‘high’ to ‘low’ performance without outlining
analyses as reflecting inter-item or inter-station reli- specific behaviours, n = 10), Likert scale-based
ability, and performed sensitivity analyses excluding anchors (i.e. ranging from ‘disagree’ to ‘agree’,
these studies. We classified correlation coefficients n = 5), expert/intermediate/novice performance
as small (0.10–0.29), moderate (0.30–0.49) and (n = 1), and visual analogue scales (n = 3) (some
large (≥ 0.50) using thresholds proposed by Co- studies used multiple anchor types). Thirteen stud-
hen.33 We used SAS Version 9.3 (SAS Institute, Inc., ies used the OSATS GRS29 or very slight modifica-
Cary, NC, USA) to perform all analyses. tions of it, and another 14 studies used the OSATS
as the starting point for a new instrument.
164 ª 2015 John Wiley & Sons Ltd. MEDICAL EDUCATION 2015; 49: 161–173
Assessment with checklists and global rating scales
Validity evidence
Global rating
scales‡ Checklists‡ Rater characteristics§
ª 2015 John Wiley & Sons Ltd. MEDICAL EDUCATION 2015; 49: 161–173 165
J S Ilgen et al
Table 1 (Continued)
Validity evidence
Global rating
scales‡ Checklists‡ Rater characteristics§
166 ª 2015 John Wiley & Sons Ltd. MEDICAL EDUCATION 2015; 49: 161–173
Assessment with checklists and global rating scales
Table 1 (Continued)
Validity evidence
Global rating
scales‡ Checklists‡ Rater characteristics§
*MIS = minimally invasive surgery; NTS = non-technical skills; MDs = practising physicians; PGs = postgraduate resident physicians;
MSs = medical students; NAs = nurse anaesthetists.
†
Study quality evaluated using the Medical Education Research Study Quality Instrument (MERSQI); maximum score 18 (see Appendix S3
for detailed results).
‡
IS = internal structure (R = inter-rater reliability; I = inter-item reliability; S = inter-station reliability; O = other); RoV = relationship to
other variables (M = another measure; T = trainee characteristic); C = content; RP = response process; CQ = consequences.
§
Rater characteristics: E = expert; R = researcher; O = other; NR = not reported; G = raters trained to use GRS; C = raters trained to use
checklist; NT = not trained; Blinding: R = blinded to other raters; T = blinded to trainee characteristics.
assessed a subset of trainees, assignment was rando- reported that raters were blinded to trainees’ char-
mised in three studies, and was either non-random acteristics (GRS, n = 13; checklist, n = 13).
or not reported in 13 studies.
Correlation between instruments
Ratings were performed in duplicate for all trainees
in about two-thirds of the studies (GRS, n = 28; Figure S1 (online) summarises the meta-analysis of
checklist, n = 28), and three studies performed correlation coefficients between GRSs and checklists
duplicate ratings on a subset of trainees. In these 31 in the 16 studies in which these analyses were avail-
studies with duplicate ratings, the vast majority of able. The pooled correlation was moderate
raters were blinded to one another’s scores (GRS, (r = 0.76, 95% confidence interval [CI] 0.69–0.81),
n = 29; checklist, n = 28). It was less commonly with large inconsistency between studies (I2 = 71%).
ª 2015 John Wiley & Sons Ltd. MEDICAL EDUCATION 2015; 49: 161–173 167
J S Ilgen et al
168 ª 2015 John Wiley & Sons Ltd. MEDICAL EDUCATION 2015; 49: 161–173
Assessment with checklists and global rating scales
ª 2015 John Wiley & Sons Ltd. MEDICAL EDUCATION 2015; 49: 161–173 169
J S Ilgen et al
most abstracted data, agreement was poor for some checklists offer more objective assessment, the construc-
items as a result, at least in part, of incomplete or tion of these tools often requires subjective judgements.
unclear reporting. We addressed this by reaching
consensus on all reported data. The strengths of this Global rating scales have important advantages.
review include its use of a broad search strategy that Compared with checklists, GRSs have higher average
did not exclude potential material on the basis of lan- inter-item and inter-station reliability. Moreover,
guage or publication year, duplicate and indepen- GRSs can be used across multiple tasks, obviating
dent data abstraction, rigorous coding of the need for task-specific instrument development
methodological quality, and the use of reproducible and simplifying application-specific validation. Glo-
inclusion criteria encompassing a broad range of bal rating scales may require more rater training,
learners, outcomes and study designs. although subjective responses can capture nuanced
elements of expertise7 or potentially dangerous devi-
Implications for research ations from desired practice,45 and reflect multiple
complementary perspectives.14 Finally, we note the
We found numerous instances in which authors were inseparable interaction between the person using
vague in their reporting (such as uncertainty the instrument and the instrument itself: neither the
between inter-station versus inter-item reliability) or checklist nor the GRS will supplant the need for
used non-standard methods (such as in the use of human expertise and judgement.
Cronbach’s alpha to calculate inter-rater reliability).
To facilitate useful interpretations and cross-study
comparisons, we encourage authors to clearly define Contributors: JSI contributed to the conception and design
of the work, and to data acquisition and interpretation,
the facet(s) of variation (raters, items, stations,
and drafted the paper. IWYM contributed to the concep-
time), use reliability analyses appropriate to each tion and design of the work, and to data acquisition and
facet, and then explicitly report these findings. Gen- interpretation, and assisted in the drafting of the paper.
eralisability studies may be helpful in this regard.43 RH contributed to the conception and design of the
work, and to data acquisition and interpretation. DAC
Although our findings are generally supportive of contributed to the conception and design of the work,
GRSs and checklists, they clearly do not generalise to and to data analysis and interpretation, and assisted in
all GRSs or checklists. Yet we found several instances, the drafting of the paper. All authors contributed to the
particularly for the OSATS, in which authors cited critical revision of the paper and approved the final man-
evidence for a previously reported checklist in order uscript for publication. All authors have agreed to be
to support their newly developed checklist. Such accountable for all aspects of the work.
Acknowledgements: we acknowledge Jason Szostek, and Amy
practices are not defensible. We remind researchers
Wang (Division of General Internal Medicine), Benjamin
and educators that every new GRS and checklist must
Zendejas (Department of Surgery), and Patricia Erwin
be validated independently and, further, that validity (Mayo Libraries), Mayo Clinic College of Medicine, Roche-
evidence must be collected afresh for each new appli- ster, MN, USA; and Stanley Hamstra, Academy for Innova-
cation (e.g. task or learner group).26 tion in Medical Education, Faculty of Medicine, University
of Ottawa, Ontario, Canada for their assistance in the liter-
Implications for practice ature search and initial data acquisition. We thank Glenn
Regehr, Associate Director of Research and Senior scien-
Our data support a more favourable view of checklists tist, Centre for Health Education Scholarship, Faculty of
than has been suggested in earlier work.6 Average Medicine, University of British Columbia, Vancouver,
inter-rater reliability was high and slightly better for Canada, for his constructive critique and insights.
Funding: this work was supported by intramural funds,
checklists than for GRSs, and discrimination and cor-
including an award from the Division of General Internal
relation with other measures were usually similar. Medicine, Mayo Clinic.
The use of checklists may also diminish rater training Conflicts of interest: none.
requirements and improve the quality of feed- Ethical approval: not applicable.
back,41,44 although these issues require further study.
However, each task requires a separate checklist and
each task-specific checklist requires independent REFERENCES
validation, especially in the context of assessing techni-
cal skills. As such, checklists will typically lag behind 1 Hodges B, Regehr G, McNaughton N, Tiberius R,
GRSs in the robustness of validity evidence. It is also Hanson M. OSCE checklists do not capture increasing
important to highlight that, despite the perception that levels of expertise. Acad Med 1999;74:1129–34.
170 ª 2015 John Wiley & Sons Ltd. MEDICAL EDUCATION 2015; 49: 161–173
Assessment with checklists and global rating scales
2 Regehr G, MacRae H, Reznick R, Szalay D. Comparing 19 Khan KZ, Gaunt K, Ramachandran S, Pushkar P. The
the psychometric properties of checklists and global objective structured clinical examination (OSCE):
rating scales for assessing performance on an OSCE- AMEE Guide No. 81. Part II: organisation and
format examination. Acad Med 1998;73:993–7. administration. Med Teach 2013;35:1447–63.
3 Ringsted C, Østergaard D, Ravn L, Pedersen JA, 20 Hettinga AM, Denessen E, Postma CT. Checking the
Berlac PA, van der Vleuten CP. A feasibility study checklist: a content analysis of expert- and evidence-
comparing checklists and global rating forms to assess based case-specific checklist items. Med Educ
resident performance in clinical skills. Med Teach 2010;44:874–83.
2003;25:654–8. 21 Cook DA, Brydges R, Zendejas B, Hamstra SJ, Hatala
4 Swanson DB, van der Vleuten CP. Assessment of R. Technology-enhanced simulation to assess health
clinical skills with standardised patients: state of the art professionals: a systematic review of validity evidence,
revisited. Teach Learn Med 2013;25 (Suppl 1):17–25. research methods, and reporting quality. Acad Med
5 Archer JC. State of the science in health professional 2013;88:872–83.
education: effective feedback. Med Educ 2010;44:101–8. 22 Brydges R, Hatala R, Zendejas B, Erwin PJ, Cook DA.
6 van der Vleuten CPM, Norman GR, De Graaff E. Linking simulation-based educational assessments and
Pitfalls in the pursuit of objectivity: issues of patient-related outcomes: a systematic review and
reliability. Med Educ 1991;25:110–8. meta-analysis. Acad Med 2014. Epub ahead of print
7 Norman G. Checklists vs. ratings, the illusion of November 4, 2014. doi: 10.1097/ACM.000000
objectivity, the demise of skills and the debasement of 0000000549.
evidence. Adv Health Sci Educ Theory Pract 2005;10:1–3. 23 Moher D, Liberati A, Tetzlaff J, Altman DG, PRISMA
8 Streiner DL, Norman GR. Health Measurement Scales: A Group. Preferred reporting items for systematic
Practical Guide to their Development and Use, 4th edn. reviews and meta-analyses: the PRISMA statement.
New York, NY: Oxford University Press 2008. Ann Intern Med 2009;151:264–9.
9 Cunnington JPW, Neville AJ, Norman GR. The risks 24 Cook DA, Hatala R, Brydges R, Zendejas B, Szostek
of thoroughness: reliability and validity of global JH, Wang AT, Erwin PJ, Hamstra SJ. Technology-
ratings and checklists in an OSCE. Adv Health Sci Educ enhanced simulation for health professions
Theory Pract 1996;1:227–33. education: a systematic review and meta-analysis.
10 Norman GR, van der Vleuten CPM, De Graaff E. JAMA 2011;306:978–88.
Pitfalls in the pursuit of objectivity: issues of validity, 25 Messick S. Validity. In: Linn RL, ed. Educational
efficiency and acceptability. Med Educ 1991;25:119–26. Measurement, 3rd edn. New York, NY: American
11 Hodges B, McIlroy JH. Analytic global OSCE ratings Council on Education and Macmillan 1989;13–103.
are sensitive to level of training. Med Educ 26 Cook DA, Beckman TJ. Current concepts in validity
2003;37:1012–6. and reliability for psychometric instruments: theory
12 Govaerts MB, van der Vleuten CM, Schuwirth LT, and application. Am J Med 2006;119:166.e7–16.
Muijtjens AM. Broadening perspectives on clinical 27 Reed DA, Cook DA, Beckman TJ, Levine RB, Kern
performance assessment: rethinking the nature of in- DE, Wright SM. Association between funding and
training assessment. Adv Health Sci Educ Theory Pract quality of published medical education research.
2007;12:239–60. JAMA 2007;298:1002–9.
13 Eva KW, Hodges BD. Scylla or Charybdis? Can we 28 Higgins JP, Thompson SG, Deeks JJ, Altman DG.
navigate between objectification and judgement in Measuring inconsistency in meta-analyses. BMJ
assessment? Med Educ 2012;46:914–9. 2003;327:557–60.
14 Schuwirth LWT, van der Vleuten CPM. A plea for 29 Martin JA, Regehr G, Reznick R, MacRae H,
new psychometric models in educational assessment. Murnaghan J, Hutchison C, Brown M. Objective
Med Educ 2006;40:296–300. structured assessment of technical skill (OSATS) for
15 Lievens F. Assessor training strategies and their surgical residents. Br J Surg 1997;84:273–8.
effects on accuracy, interrater reliability, and 30 Shrout PE, Fleiss JL. Intraclass correlations: uses in
discriminant validity. J Appl Psychol 2001;86:255–64. assessing rater reliability. Psychol Bull 1979;86:420–8.
16 Holmboe ES, Hawkins RE, Huot SJ. Effects of 31 Landis JR, Koch GG. The measurement of observer
training in direct observation of medical residents’ agreement for categorical data. Biometrics
clinical competence: a randomised trial. Ann Intern 1977;33:159–74.
Med 2004;140:874–81. 32 Nunnally JC. Psychometric Theory, 2nd edn. New York,
17 Kogan JR, Hess BJ, Conforti LN, Holmboe ES. What NY: McGraw-Hill 1978.
drives faculty ratings of residents’ clinical skills? The 33 Cohen J. Statistical Power Analysis for the Behavioral
impact of faculty’s own clinical skills. Acad Med Sciences, 2nd edn. Hillsdale, NJ: Lawrence Erlbaum
2010;85 (Suppl):25–8. 1988.
18 Brannick MT, Erol-Korkmaz HT, Prewett M. A 34 Murray DJ, Boulet JR, Kras JF, McAllister JD, Cox TE.
systematic review of the reliability of objective A simulation-based acute skills performance
structured clinical examination scores. Med Educ assessment for anaesthesia training. Anesth Analg
2011;45:1181–9. 2005;101:1127–34.
ª 2015 John Wiley & Sons Ltd. MEDICAL EDUCATION 2015; 49: 161–173 171
J S Ilgen et al
35 White MA, DeHaan AP, Stephens DD, Maes AA, undergraduate assessment using an anaesthesia
Maatman TJ. Validation of a high fidelity adult simulator. Acad Med 2001;76:1053–5.
ureteroscopy and renoscopy simulator. J Urol 50 Murray D, Boulet J, Ziv A, Woodhouse J, Kras J,
2010;183:673–7. McAllister J. An acute care skills evaluation for
36 Finan E, Bismilla Z, Campbell C, Leblanc V, Jefferies graduating medical students: a pilot study using
A, Whyte HE. Improved procedural performance clinical simulation. Med Educ 2002;36:833–41.
following a simulation training session may not be 51 Adrales GL, Park AE, Chu UB, Witzke DB,
transferable to the clinical environment. J Perinatol Donnelly MB, Hoskins JD, Mastrangelo MJ Jr,
2012;32:539–44. Gandsas A. A valid method of laparoscopic
37 Gordon JA, Alexander EK, Lockley SW, Flynn-Evans simulation training and competence assessment.
E, Venkatan SK, Landrigan CP, Czeisler CA, Harvard J Surg Res 2003;114:156–62.
Work Hours Health and Safety Group. Does 52 Datta V, Bann S, Beard J, Mandalia M, Darzi A.
simulator-based clinical performance correlate with Comparison of bench test evaluations of surgical skill
actual hospital behaviour? The effect of extended with live operating performance assessments. J Am
work hours on patient care provided by medical Coll Surg 2004;199:603–6.
interns. Acad Med 2010;85:1583–8. 53 Murray DJ, Boulet JR, Kras JF, Woodhouse JA, Cox T,
38 Mazor KM, Ockene JK, Rogers HJ, Carlin MM, Quirk McAllister JD. Acute care skills in anaesthesia
ME. The relationship between checklist scores on a practice: a simulation-based resident performance
communication OSCE and analogue patients’ assessment. Anesthesiology 2004;101:1084–95.
perceptions of communication. Adv Health Sci Educ 54 Weller J, Robinson B, Larsen P, Caldwell C.
Theory Pract 2005;10:37–51. Simulation-based training to improve acute care
39 Sackett PR, Laczo RM, Arvey RD. The effects of range skills in medical undergraduates. N Z Med J
restriction on estimates of criterion interrater 2004;117:U1119.
reliability: implications for validation research. Pers 55 Bann S, Davis IM, Moorthy K, Munz Y, Hernandez J,
Psychol 2002;55:807–25. Khan M, Datta V, Darzi A. The reliability of multiple
40 Holmboe ES, Ward DS, Reznick RK, Katsufrakis PJ, objective measures of surgery and the role of human
Leslie KM, Patel VL, Ray DD, Nelson EA. Faculty performance. Am J Surg 2005;189:747–52.
development in assessment: the missing link in 56 Moorthy K, Munz Y, Adams S, Pandey V, Darzi A. A
competency-based medical education. Acad Med human factors analysis of technical and team skills
2011;86:460–7. among surgical trainees during procedural
41 Holmboe ES, Sherbino J, Long DM, Swing SR, Frank simulations in a simulated operating theatre. Ann
JR. The role of assessment in competency-based Surg 2005;242:631–9.
medical education. Med Teach 2010;32:676–82. 57 Berkenstadt H, Ziv A, Gafni N, Sidi A. The validation
42 Cook DA, Dupras DM, Beckman TJ, Thomas KG, process of incorporating simulation-based
Pankratz VS. Effect of rater training on reliability and accreditation into the anaesthesiology Israeli national
accuracy of mini-CEX scores: a randomised, board exams. Isr Med Assoc J 2006;8:728–33.
controlled trial. J Gen Intern Med 2009;24:74–9. 58 Broe D, Ridgway PF, Johnson S, Tierney S, Conlon
43 Schuwirth LWT, van der Vleuten CPM. Programmatic KC. Construct validation of a novel hybrid surgical
assessment and Kane’s validity perspective. Med Educ simulator. Surg Endosc 2006;20:900–4.
2012;46:38–48. 59 Matsumoto ED, Pace KT, D’A Honey RJ. Virtual reality
44 Schuwirth LW, van der Vleuten CP. Programmatic ureteroscopy simulator as a valid tool for assessing
assessment: from assessment of learning to assessment endourological skills. Int J Urol 2006;13:896–901.
for learning. Med Teach 2011;33:478–85. 60 Banks EH, Chudnoff S, Karmin I, Wang C, Pardanani
45 Boulet JR, Murray D. Review article: assessment in S. Does a surgical simulator improve resident
anaesthesiology education. Can J Anaesth 2012;59:182–92. operative performance of laparoscopic tubal ligation?
46 Jansen JJ, Berden HJ, van der Vleuten CP, Grol RP, Am J Obstet Gynecol 2007;197:541.e1–5.
Rethans J, Verhoeff CP. Evaluation of 61 Fialkow M, Mandel L, Van Blaricom A, Chinn M,
cardiopulmonary resuscitation skills of general Lentz G, Goff B. A curriculum for Burch
practitioners using different scoring methods. colposuspension and diagnostic cystoscopy evaluated
Resuscitation 1997;34:35–41. by an objective structured assessment of technical
47 Reznick R, Regehr G, MacRae H, Martin J, McCulloch skills. Am J Obstet Gynecol 2007;197:544.e1–6.
W. Testing technical skill via an innovative ‘bench 62 Goff BA, Van Blaricom A, Mandel L, Chinn M,
station’ examination. Am J Surg 1997;173:226–30. Nielsen P. Comparison of objective, structured
48 Friedlich M, MacRae H, Oandasan I, Tannenbaum D, assessment of technical skills with a virtual reality
Batty H, Reznick R, Regehr G. Structured assessment hysteroscopy trainer and standard latex hysteroscopy
of minor surgical skills (SAMSS) for family medicine model. J Reprod Med 2007;52:407–12.
residents. Acad Med 2001;76:1241–6. 63 Khan MS, Bann SD, Darzi AW, Butler PE. Assessing
49 Morgan PJ, Cleave-Hogg D, Guest CB. A comparison surgical skill using bench station models. Plast
of global ratings and checklist scores from an Reconstr Surg 2007;120:793–800.
172 ª 2015 John Wiley & Sons Ltd. MEDICAL EDUCATION 2015; 49: 161–173
Assessment with checklists and global rating scales
ª 2015 John Wiley & Sons Ltd. MEDICAL EDUCATION 2015; 49: 161–173 173