Development of A Reliable Simulation-Based Test For Diagnostic Abdominal Ultrasound With A Pass/fail Standard Usable For Mastery Learning

Eur Radiol
DOI 10.1007/s00330-017-4913-x
ULTRASOUND
Development of a reliable simulation-based test for diagnostic

abdominal ultrasound with a pass/fail standard usable
for mastery learning
Mia L. Østergaard 1 & Kristina R. Nielsen 1 & Elisabeth Albrecht-Beste 2 & Lars Konge 3 &
Michael B. Nielsen 1
Received: 7 November 2016 / Revised: 13 March 2017 / Accepted: 24 May 2017

# European Society of Radiology 2017
Abstract (no false negatives). All intermediate participants and six out
Background This study aimed to develop a test with validity of 14 trainees passed.
evidence for abdominal diagnostic ultrasound with a pass/fail- Conclusion We developed a test for diagnostic abdominal ul-
standard to facilitate mastery learning. trasound with solid validity evidence and a pass/fail-standard
Method The simulator had 150 real-life patient abdominal without any false-positive or false-negative scores.
scans of which 15 cases with 44 findings were selected, Key Points
representing level 1 from The European Federation of • Ultrasound training can benefit from competency-based ed-
Societies for Ultrasound in Medicine and Biology. Four ucation based on reliable tests.
groups of experience levels were constructed: Novices (med- • This simulation-based test can differentiate between compe-
ical students), trainees (first-year radiology residents), inter- tency levels of ultrasound examiners.
mediates (third- to fourth-year radiology residents) and ad- • This test is suitable for competency-based education, e.g.
vanced (physicians with ultrasound fellowship). Participants mastery learning.
were tested in a standardized setup and scored by two blinded • We provide a pass/fail standard without false-negative or
reviewers prior to an item analysis. false-positive scores.
Results The item analysis excluded 14 diagnoses. Both inter-
nal consistency (Cronbach’s alpha 0.96) and inter-rater reli- Keywords Ultrasonography . Abdomen . Simulation
ability (0.99) were good and there were statistically significant training . Education, medical . Radiology
differences (p < 0.001) between all four groups, except the
intermediate and advanced groups (p = 1.0). There was a sta-
tistically significant correlation between experience and test Introduction
scores (Pearson’s r = 0.82, p < 0.001). The pass/fail-standard
failed all novices (no false positives) and passed all advanced Abdominal ultrasound examinations are associated with a risk
for both false-positive and false-negative findings with poten-
tially grave consequences for diagnosis and treatment. While
* Mia L. Østergaard
mlo@dadlnet.dk ultrasound itself is a safe modality, false findings can lead to
additional testing or provide inappropriate reassurance that
can be associated with serious sequelae, such as prolonged
1
Department of Radiology, Copenhagen University Hospital, or more serious illness and anxiety, as well as unnecessary
Rigshospitalet, Blegdamsvej 9, afd. 2023, 2100 Copenhagen
O, Denmark
radiation exposure or invasive procedures [1, 2].
2
The value of an ultrasound examination is dependent upon
Department of Clinical Physiology, Nuclear Medicine and PET,
Copenhagen University Hospital, Rigshospitalet, Blegdamsvej 9,
the skills of the examiner. The acquisition of these skills as
2100 Copenhagen, Denmark well as testing and ongoing maintenance should be based
3
Copenhagen Academy for Medical Education and Simulation
upon a structured approach that prioritises objective assess-
CAMES, The Capital Region of Denmark, Blegdamsvej 9, ment of competency [3, 4]. Clinical competence has tradition-
2100 Copenhagen, Denmark ally been achieved with the use of the apprenticeship model
Eur Radiol
(‘see one, do one, teach one’), and educational goals have a mock ultrasound probe. Based on probe positioning, a pre-
been set as a fixed timeframe or a number of procedures per- acquired scan is shown on one screen while the other screen
formed. In modern medicine, this approach has been increas- replicates the buttons on an ultrasound machine and displays
ingly questioned due to several important factors, including any written information. The scans are of real-life patients and
limited resident working hours, fatigue, supervision shortage, each case is constructed on the basis of up to 2,000 raw B-
patient safety concerns, and an increasing focus on the reduc- mode scans [11]. No dynamic movements were simulated and
tion of human errors [5, 6]. As a result, competency-based the Doppler technique was not available.
education has emerged. This approach focuses on continuous The simulators have 150 abdominal cases that were all
competency assessment, as well as the establishment of reviewed by a first-year radiological resident (M.L.Ø.) who
competency-based goals. The latter provides a basis for the selected 50 cases representing a variety of pathologies related
educational approach called mastery learning, which focus on to the liver, gallbladder, bile ducts, pancreas, spleen, urinary
training until a pre-defined competency level is reached. tract and pelvis [12].
Mastery learning ensures that the individual learners all An advanced ultrasound radiologist within the research
achieve the same competence level, but not necessarily at group (M.B.N.) prioritized the 50 cases and, ultimately, agree-
the same pace [7]. ment was reached for a group of 15 cases, including 44 find-
Mastery learning requires a test in order to ensure skill ings that collectively represent the recommended level 1
acquisition at a sufficient level, as well as to help identify knowledge from The European Federation of Societies for
individual training needs and to serve as a well-defined goal. Ultrasound in Medicine and Biology [12] (Table 1). A short
Goalsetting is a crucial factor in skill acquisition alongside patient history and reason for referral were provided for each
motivation, feedback and opportunity for repetition [3]. case. A maximum of 6 minutes was allowed for scanning,
Given that mastery learning is structured around a measure- with no time limit for writing the answers. A pilot test was
ment of competency, it is critical to ensure that the test results completed by a fourth-year radiology resident (K.R.N.) and
can be trusted and represent a true reflection of competency. In minor changes were made based on the resident’s feedback
other words, the test must demonstrate solid validity evidence. and with agreement from all study authors.
The framework by Messick [8] identifies five sources of va- Four study groups were created, each with different levels
lidity evidence: (1) Content: does the test represent the rele- of ultrasound experience and based on the minimum training
vant curriculum? (2) Response process: on what grounds are requirements as defined by the European Federation of
the test results interpreted? (3) Internal structure: is the test Societies for Ultrasound in Medicine and Biology
reliable and generalizable? (4) Relation to other variables: (EFSUMB) [12]:
correlations within the test or to other assessment tools or
measurements. (5) Consequences: who, how and on what 1. Novices: Medical students with little or no ultrasound
grounds does the test-score have an impact [9]. In a systematic experience.
review from 2015, none of the identified studies on 2. Trainees: First-year radiology residents who have com-
simulation-based abdominal ultrasound training demonstrated pleted an introduction to clinical ultrasound, including
a high level of evidence and no tests with validity evidence 4–8 weeks of focused training.
were used [10]. To the best of our knowledge there is no 3. Intermediates: Third- or fourth-year radiology residents
standardized test of competency in abdominal ultrasound with who have completed general ultrasound training, includ-
solid validity evidence. The aim of this study was to develop a ing a minimum of 4–12 months of focused training; cor-
test with validity evidence for abdominal diagnostic ultra- responding to EFSUMB level 1 clinicians.
sound and to establish a pass/fail-standard to facilitate mastery 4. Advanced: Fully specialised radiology physicians who
learning. have completed an ultrasound fellowship and have a min-
imum of 3 years of experience, as well as current employ-
ment at an ultrasound clinic; corresponding to EFSUMB
Material and method level 2 or level 3 clinicians.
The study was approved by The Danish Ethical Committee We estimated a maximum of 17 correct diagnoses in group
with an exemption letter (protocol H-15013261). Test devel- 1 and a minimum of 39 correct diagnoses in group 4 with a
opment was based on two identical simulators manufactured power of 0.9 provided for groups of 14. Data were collected
by Schallware (station 64; version 10013) and provided by the during December 2015–May 2016. All eligible physicians
research fund from the Department of Radiology, and residents in southern and eastern Denmark were recruited
Rigshospitalet. The simulators resemble a diagnostic ultra- by phone, e-mail or in educational groups. Medical students
sound machine and consist of a hard drive, a keyboard, a were recruited through the weekly university paper. All par-
sensor table with a mannequin torso, two touch screens and ticipants provided written informed consent and stated their
Eur Radiol
Table 1 Test information showing diagnostic findings: case number in the test (Case No.) and in the simulator (Case No. in Sim.), diagnostic findings
numbers before (Org. Diag. No.) and after (Final Diag. No.) item analysis, and item difficulty and discrimination number (Item Diss. No.). Diagnostic
findings included in the final test are marked with a grey background
experience level. They were excluded if group criteria were maximum for their group. Participants did not receive any
not met or any additional ultrasound training exceeded the compensation.
Eur Radiol
All participants were given a unique test ID that was (M.B.N.) who identified the five diagnoses as essential curric-
randomly assigned by computer. They were introduced ulum content (Table 1).
to the simulator by one radiological resident (M.L.Ø.) The internal structure of the test was very good, with
using two standardized simulation cases that were not Cronbach’s alpha 0.96 for internal consistency and 0.99 for
included in the test. Written answers to the test ques- inter-rater reliability. The ANOVA (Bonferroni) showed sta-
tions were provided and answer sheets were automati- tistically significant differences (p < 0.001) between test
cally locked away after each case. All tests were scored scores for all four groups, with the exception of those for the
by two blinded reviewers using a list of correct diagno- intermediate and advanced groups (p = 1.0) (Fig. 1.). As
ses. There was a minimum of 0 and a maximum of 44 shown in Fig. 2, a highly statistically significant correlation
points, representing a total of the 44 findings in the 15 was seen between scanning experience in weeks and mean test
cases. Test scores were calculated as the sum of the scores with Pearson’s r = 0.82 (p < 0.001) (Table 2).
correct diagnoses. Incorrect diagnoses were noted sepa- A pass/fail standard was established with a test score of 14
rately. Empty answer boxes were given zero points correct diagnoses based on the mean test scores from the nov-
(Table 1). ice and advanced groups (Fig. 3). Consequences were consis-
An item analysis was performed based on categories by tent with expectations: All novices failed the test (no false
Downing et al., and for each reviewer all the diagnoses positives) and all advanced participants passed (no false neg-
were placed into four categories according to their difficul- atives). All intermediate participants, as well as six out of 14
ty and their ability to discriminate between experience trainees, passed the test.
levels, with category one being the optimal category (high As previously noted, all incorrect diagnoses were regis-
discriminatory level and appropriate difficulty) [8]. tered and analysed separately. A post hoc multiple comparison
According to Downing et al.’s parameters, the decision test (Bonferroni) based on the mean incorrect scores demon-
was made to include all category one, two and three diag- strated a statistically significant difference between novices
noses in the test. Selected category four diagnoses were and intermediates (p = 0.02). However, the findings fell just
individually chosen by an advanced ultrasound radiologist below statistical significance for the comparison between nov-
(M.B.N.) for inclusion in the test. ices and advanced participants (p = 0.053). There were no
Statistical analysis was performed by two authors (L.K. differences between any other groups (p ≥ 0.8). A maximum
and M.L.Ø.) using SPSS version 22 (IBM, Armonk, NY, of ten incorrect diagnoses was set as an additional criterion for
USA). Cronbach’s alpha was calculated using all included passing the test with a subsequent consequence of failing one
diagnoses in order to determine internal consistency, and from additional trainee, but without any consequences for the nov-
the diagnosis scores from both raters in order to determine ice, intermediate or advanced groups.
inter-rater reliability. Relationship to other variables was ex- The participating medical students were all from the
plored by comparing scores (between participants and inter- Copenhagen University and physicians were from ten differ-
rater), as well as incorrect diagnoses from all groups with a ent hospitals and two different private practices in southern
one-way analysis of variance (ANOVA). The latter was done and eastern Denmark.
with corrections for multiple comparisons (Bonferroni).
Correlation between scanning experience in weeks and test
scores was calculated with Pearson’s r.
Consequence was considered with the establishment
of a pass/fail standard. This standard was determined
using the contrasting groups’ method that allows later
adjustment if needed and was based on mean scores and
their associated standard deviations for group 1 and
group 4 [8].
Results
The item analysis resulted in a final test with 30 out of the 44

diagnoses distributed over all 15 cases. The test included 12
category one, seven category two and six category three diag-
noses, as well as five of the 19 category four diagnoses. The
decision to include the category four diagnoses was based Fig. 1 Boxplot showing scores for the four groups with median,
upon input from an advanced ultrasound radiologist minimum, maximum and two outliers
Eur Radiol
Solid validity evidence has been shown for tests in other

areas of medical education, but to the best of our knowledge,
there is no standardized test with validity evidence for diag-
nostic abdominal ultrasound [13, 14]. This test is well suited
as a foundation for mastery learning, an approach that has
been effectively used for procedural training in other special-
ties. This strategy allows trainees to individually determine
how much time they need to achieve the desired skill level
[15, 16]. Mastery learning uses repeated examinations for
trainees who fail, which could make our test easier for them
as the questions and cases in the test are fixed. Further studies
should explore the effect of repeated examinations, and it
would be beneficial if other experts developed different tests
in the future.
The simulator cases used in this study are all recordings of
Fig. 2 Test scores as a function of experience in weeks (transformed with real patient scans with varied diagnoses necessitating not only
Log10 to aid graph drawing) for the four groups in a linear regression diagnostic skill, but also thoroughness with the presence of
with the pass/fail level of 14 marked (horizontal line) multiple positive findings in numerous cases. The latter issue
was especially relevant for one case (no. 13) in which the
Discussion patient’s history identified risk factors for a diagnosis that
was visible on ultrasound (abdominal aortic aneurism). This
We have developed a test for abdominal ultrasound compe- diagnosis was identified by 83% of all participants, while only
tency with solid validity evidence. In addition, we have 23% identified the additional finding on the ultrasound (uter-
established a pass/fail standard with no false-positive or ine fibroma), which was unrelated to the patient’s risk-factor
false-negative results. The test is highly reliable and can dis- profile.
criminate between novices (medical students) and trainees The administration of this test is straightforward and can
(first-year radiological residents), as well as between trainees easily be transferred to other settings. The inter-rater reliability
and intermediate (third- to fourth-year radiological residents). is very high, suggesting that scoring could be done by one
The test does not discriminate between intermediate and ad- rater only and that the answer sheet could be used without
vanced groups (specialised ultrasound radiologists). The con- prior training. Moreover, validity evidence supports the use
sequences of the established pass/fail standard were appropri- of this test for individually tailored skill development and
ate, with a clear cut-off in the middle of the trainee group assessment prior to clinical training. The widespread adoption
without any outliers from other groups on either side. of simulation-based learning in medical education has
Table 2 Demographic information and scoring results
Novices (Group 1) Trainees (Group 2) Intermediates (Group 3) Advanced (Group 4)
Group sin’ (no.) 16 14 15 15

Male 5 6 5 10
Female 11 8 10 5
Mean age (y) 25.4 33.5 38.5 56.3
Experience in weeks Mean 0.5 7 62 1,012
Median 0 6 44 1,200
Mean score 6.0 15.1 22.1 22.7
95% CI 4.6–7.4 13.4–16.7 19.9–24.2 21.2–24.3
Min. score 2 9 16 16
Max. score 11 22 28 28
Mean incorrect 6.9 5.3 3.8 4.1
95% CI 4.9–8.9 3.9–6.8 2.6–5.0 2.7–5.6
Min. incorrect 0 2 1 0
Max. incorrect 13 11 7 9
CI confidence interval
Eur Radiol
an advantage. Our test focused on scanning and identifying

pathological findings, but other factors are also essential to an
optimal examination (e.g. patient communication), and these
should be learned in the clinical environment. Likewise, the
test does not consider differential diagnostic skills or related
clinical knowledge. It is important to acknowledge that
trainees will still need initial clinical supervision even after
completing a simulation-based mastery learning training pro-
gramme. The high correlation between experience in weeks
and test scores supports our conclusion that the test effectively
discriminates between the study groups, but this finding
should be interpreted with caution as clinical competency does
not necessarily correlate perfectly with experience [20].
Conclusion
We have developed a test for diagnostic abdominal ultrasound

Fig. 3 Normal distribution, using mean and standard deviation, for the
novice and advanced groups’ test scores using the contrasting group with solid validity evidence and a pass/fail-standard without
method: The pass/fail score of 14 is seen (vertical line) without any any false-positive or false-negative scores. The test can be
overlap between the two groups directly implemented for the training and certification of
health professionals using diagnostic abdominal ultrasound.
potential benefits for efficient skill development and resource
utilisation, as a recent study in endobronchial ultrasound has Compliance with ethical standards
shown [17, 18]. Research in abdominal ultrasound should
further investigate these benefits and determine the mean Guarantor The scientific guarantor of this publication is Michael
training time needed with a randomised trial exploring the Bachmann Nielsen.
potential for simulation-based education to enhance efficiency
Conflict of interest The authors of this manuscript declare no relation-
and prepare residents for future practice with decreased need ships with any companies whose products or services may be related to
for supervision [19]. Ideally, outcome measures should in- the subject matter of the article.
clude assessment of image optimisation, a systematic scan-
ning approach and instrument control. Funding The authors state that this work has not received any funding.
The pass/fail cut-off for the test was very clear, without any
false positives or false negatives. This score could be used as a Statistics and biometry One of the authors has significant statistical
expertise.
minimum criterion for inexperienced residents; however, the
overlap in confidence intervals for the trainee and intermediate Informed consent Written informed consent was obtained from all
groups indicates that a competency standard could be a higher subjects in this study.
score (Fig. 1). The maximum of ten incorrect diagnoses were
set as a separate passing requirement to filter out passing by Ethical approval Institutional Review Board approval was obtained.
extensive guessing or overdiagnosis. This was exemplified by
a trainee with 15 correct diagnoses and 11 incorrect diagnoses, Methodology
who failed the test based on the number of incorrect & Prospective
diagnoses. & Observational
& Performed at one institution
Limitations
There are limitations to the fidelity of any simulator and, in

this study, the scans with partial abdominal views may have
References
provided a hint regarding the location of the positive findings.
1. Martínez-Ares D, Martín-Granizo Barrenechea I, Souto-Ruzo J,
In addition, these images may have presented challenges to Yáñez López J, Pallarés Peral A, Vázquez-Iglesias JL (2005) The
experienced clinicians who have developed systematic scan- value of abdominal ultrasound in the diagnosis of colon cancer. Dig
ning techniques and thereby given less experienced scanners Organo Of Soc Esp Patol Dig 97:877–886
Eur Radiol
2. Torloni MR, Vedmedovska N, Merialdi M, Betrán AP, Allen T, 12. European Society of Radiology (2015) Guidelines & recommenda-
González R et al (2009) Safety of ultrasonography in pregnancy: tions - Appendix 5. EFSUMB. Available via http://www.efsumb.
WHO systematic review of the literature and meta-analysis. org/guidelines/guidelines01.asp
Ultrasound Obstet Gynecol 33:599–608 13. Thinggaard E, Bjerrum F, Strandbygaard J, Gögenur I, Konge L
3. Anders Ericsson K (2008) Deliberate Practice and Acquisition of (2015) Validity of a cross-specialty test in basic laparoscopic tech-
Expert Performance: A General Overview. Acad Emerg Med 15: niques (TABLT). Br J Surg 102:1106–1113
988–994 14. Thomsen ASS, Kiilgaard JF, Kjaerbo H, la Cour M, Konge L
4. McGaghie WC (2015) Mastery learning: it is time for medical (2015) Simulation-based certification for cataract surgery. Acta
education to join the 21st century. Acad Med J 90:1438–1441 Ophthalmol 93:416–421
5. European Society of Radiology (ESR) (2013) Organisation and 15. Dyre L, Nørgaard LN, Tabor A, Madsen ME, Sørensen JL,
practice of radiological ultrasound in Europe: a survey by the Ringsted C et al (2016) Collecting Validity Evidence for the
ESR Working Group on Ultrasound. Insights Imaging 4:401–407 Assessment of Mastery Learning in Simulation-Based Ultrasound
6. Garg M, Drolet BC, Tammaro D, Fischer SA (2014) Resident Duty Training. Ultraschall Med 37:386–392
Hours: A Survey of Internal Medicine Program Directors. J Gen 16. Jacobsen ME, Andersen MJ, Hansen CO, Konge L (2015) Testing
Intern Med 29:1349–1354 basic competency in knee arthroscopy using a virtual reality simu-
7. McGaghie WC, Miller GE, Sajid AW, Telder TV (1978) lator: exploring validity and reliability. J Bone Joint Surg Am 97:
Competency-based curriculum development on medical education: 775–781
an introduction. Public Health Pap 68:11–91
17. Konge L, Clementsen PF, Ringsted C, Minddal V, Larsen KR,
8. Downing SM, Yudkowsky R (2009) Assessment in health profes-
Annema JT (2015) Simulator training for endobronchial ultra-
sions education, 1st edn. Routledge, New York, p 108 and p 143
sound: a randomised controlled trial. Eur Respir J 46:1140–1149
9. Ghaderi I, Manji F, Park YS, Juul D, Ott M, Harris I et al (2015)
Technical skills assessment toolbox: a review using the unitary 18. Konge L, Albrecht-Beste E, Nielsen MB (2014) Virtual-reality sim-
framework of validity. Ann Surg 261:251–262 ulation-based training in ultrasound. Ultraschall Med 35:95–97
10. Østergaard M, Ewertsen C, Konge L, Albrecht-Beste E, Bachmann 19. Ericsson KA (2008) Deliberate practice and acquisition of expert
Nielsen M (2016) Simulation-Based Abdominal Ultrasound performance: a general overview. Acad Emerg Med Off J Soc Acad
Training – A Systematic Review. Ultraschall Med 37:253–261 Emerg Med 15:988–994
11. Schallware (2016) Specifications of simulator. Schallware, 20. Bransford JD, Schwartz DL (1999) Rethinking Transfer: A Simple
Germany. Available via http://www.schallware.com/ Proposal with Multiple Implications. Rev Res Educ 24:61

Development of A Reliable Simulation-Based Test For Diagnostic Abdominal Ultrasound With A Pass/fail Standard Usable For Mastery Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Development of A Reliable Simulation-Based Test For Diagnostic Abdominal Ultrasound With A Pass/fail Standard Usable For Mastery Learning

Uploaded by

Copyright:

Available Formats

Eur Radiol

Development of a reliable simulation-based test for diagnostic

Received: 7 November 2016 / Revised: 13 March 2017 / Accepted: 24 May 2017

The item analysis resulted in a final test with 30 out of the 44

Solid validity evidence has been shown for tests in other

Table 2 Demographic information and scoring results

Novices (Group 1) Trainees (Group 2) Intermediates (Group 3) Advanced (Group 4)

Group sin’ (no.) 16 14 15 15

an advantage. Our test focused on scanning and identifying

We have developed a test for diagnostic abdominal ultrasound

There are limitations to the fidelity of any simulator and, in

You might also like