Van Der Vleuten, Assessment of Clinical Skills With Standardized Pationts. State of The Art.

This article was downloaded by: [New York University]
On: 13 January 2015, At: 02:01

Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer
House, 37-41 Mortimer Street, London W1T 3JH, UK
Teaching and Learning in Medicine: An International

Journal
Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/htlm20
Assessment of clinical skills with standardized

patients: State of the art
a b
C. P. M. van der Vleuten & David B. Swanson
a
Department of Educational Development and Research , University of Limburg , P.O.
Box 616, 6200 MD, Maastricht, Pennsylvania, The Netherlands
b
National Board of Medical Examiners , Philadelphia, USA
Published online: 03 Nov 2009.
To cite this article: C. P. M. van der Vleuten & David B. Swanson (1990) Assessment of clinical skills with
standardized patients: State of the art, Teaching and Learning in Medicine: An International Journal, 2:2, 58-76, DOI:
10.1080/10401339009539432
To link to this article: http://dx.doi.org/10.1080/10401339009539432
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of
the Content. Any opinions and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied
upon and should be independently verified with primary sources of information. Taylor and Francis shall
not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other
liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or
arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Teaching and Learning in Medicine Copyright 1990 by
1990, Vol. 2, No. 2, 58-76 Lawrence Erlbaum Associates, Inc.
ANALYSES/REVIEWS OF THE LITERATURE
Assessment of Clinical Skills With

Standardized Patients: State of the Art
C. P. M. van der Vleuten

Department of Educational Development and Research
University of Limburg
Maastricht, The Netherlands
David B. Swanson
Downloaded by [New York University] at 02:01 13 January 2015
National Board of Medical Examiners

Philadelphia, Pennsylvania, USA
A little more than 10 years ago, the objective structured clinical examination
(OSCE) was introduced. It includes several "stations," at which examinees
perform a variety of clinical tasks. Although an OSCE may involve a range of
testing methods, standardized patients (SPs), who are nonphysicians trained to
play the role of a patient, are commonly used to assess clinical skills. This
article provides a comprehensive review of large-scale studies of the psycho-
metric characteristics of SP-based tests.
Across studies, reliability analyses consistently indicate that the major
source of measurement error is variation in examinee performance from
station to station (termed content specificity in the medical-problem-solving
literature). As a consequence, tests must include large numbers of stations to
obtain a stable, reproducible assessment of examinee skills. Disagreements
among raters observing examinee performance and differences between SPs
playing the same patient role appear to have less effect on the precision of
scores, as long as examinees are randomly assigned to raters and SPs.
Results of validation studies (e.g., differences in group performance,
correlations with other measures) are generally favorable, though not partic-
ularly informative. Future validation research should investigate the impact of
station format, timing, and instructions on examinee performance; study the
procedures used to translate examinee behavior into station and test scores;
and work on rater and SP bias.
Several recommendations are offered for improving SP-based tests. These
include (a) focusing on assessment of history taking, physical examination,
and communication skills, with separately administered written tests used to
measure diagnostic and management skills, (b) adoption of a mastery-testing
framework for score interpretation, and (c) development of improved proce-
dures for setting pass-fail standards. Use of generalizability theory in ana-
lyzing and reporting results of psychometric studies is also suggested.
Medical schools and other organizations re- traditionally based their decisions on written ex-
sponsible for certifying clinical competence have aminations and faculty ratings of performance in
Requests for reprints should be sent to Dr. van der Vleuten at the Department of Educational Development and Research, P.O.
Box 616, University of Limburg, 6200 MD Maastricht, The Netherlands.
58
ASSESSMENT OF CLINICAL SKILLS WITH SPs
clinical training. In recent years, there has been murmurs, pulmonary crackles, joint abnormali-
growing dissatisfaction with these procedures, be- ties, etc.), or can simulate physical findings (e.g.,
cause of the limited skills assessed by written tests abnormal reflexes, diminished breath sounds, ele-
and psychometric problems associated with rat- vated blood pressure). Examinees interact with
ings of performance. This has led to a new em- SPs as though they were interviewing, examining,
phasis in assessment, in which the tasks presented and counseling real patients. Often, SPs are also
to examinees are more representative of those trained to complete checklists and rating forms at
faced in real clinical situations. These performance- the end of encounters, recording the history infor-
based tests have become very popular, and med- mation obtained, the examination maneuvers per-
ical schools worldwide use them for assessing formed, and the counseling provided, as well as
clinical skills.1"2 Increasingly, schools are con- rating communication skills of examinees. Alter-
ducting and reporting studies of the psychometric natively, faculty-raters may observe SP-examinee
characteristics of the tests they use. This article encounters and complete checklists and rating
reviews the psychometric studies of one family of forms.
performance-based tests, those involving stan- SP-based tests have been used to assess a broad
dardized patients (SPs). range of clinical skills. Most often, they are used to
measure history taking, physical examination, and
Some Definitions and Terminological communication skills, although skills in diagnosis,
Distinctions laboratory utilization, and patient management are

sometimes tested through oral exams or written
About 15 years ago, the objective structured questions as a follow-up to an SP encounter.
clinical examination (OSCE) was introduced by A broad range of station formats has been
Harden and colleagues,3 although similarly struc- developed; these vary in the skills assessed, the
tured "practical exams" have been used in medical time required, the scoring criteria used, and other
training for centuries. An OSCE involves exam- dimensions. For example, the classic OSCE sta-
inees rotating around a circuit of stations at which tion requires about 5 min, with examinees taking a
they are required to perform a variety of clinical brief history of a present illness or performing an
tasks. These tasks may include taking a brief isolated component of a physical examination.
history from a patient, performing some portion Other investigators have used very long stations,
of a physical examination, undertaking a proce- requiring examinees to perform a complete history
dure, ordering or interpreting diagnostic studies, and physical, followed by written questions re-
counseling a patient, and so on. The time allowed garding the case. Others use intermediate-length
at each station can vary from a few minutes to an stations, with examinees asked to do a focused
hour, depending on the task to be performed, but, history and physical. Depending on the resources
most commonly, stations last from 5 to 20 min. available and the preferences of test developers,
Examinee performance is scored on checklists and examinee performance may be scored by SPs,
rating forms tailored to the content of each sta- other nonphysicians, or physician-observers. All
tion. Use of clinically relevant tasks tailored to the stations included in an examination may be iden-
skills to be assessed, controlled and standardized tical in form or may vary extensively, depending
testing situations, and predefined grading criteria on the range of skills that test developers thought
are thought to provide a significant advance over appropriate to assess.
traditional unstructured ratings of performance in The term station is typically used to refer to the
clinical training. basic building block from which tests are con-
The OSCE, however, is not really a testing structed. However, some investigators use the
method: It is a flexible approach to test adminis- term to refer to each discrete element in a test,
tration in which a variety of methods can be even if the elements are related to the same
embedded to obtain an assessment of clinical presenting situation; others use the term to refer to
skills. Individual OSCE stations may use any type a group of related elements. For example, a
of assessment method, ranging from observation chest-pain case, in which examinees take a history,
of procedural skills demonstrated on cadavers to read an electrocardiogram, provide a diagnosis,
traditional multiple-choice questions. Most fre- and initiate treatment, may be described as one to
quently, however, SPs are used to test "hands-on" four stations, depending on the preferences of the
clinical skills. investigators. In this article, such multiple-element
SPs are nonphysicians taught to portray pa- encounters are uniformly termed and treated as
tients in a standardized and consistent fashion. single stations, because the elements are not inde-
SPs can be asymptomatic, can have stable ab- pendent from test construction and analysis per-
normal findings on physical examination (heart spectives.
59
VAN DER VLEUTEN & SWANSON
Scoring systems vary extensively across studies. 1. The test must have been administered to a
Some groups calculate only overall scores for each minimum of 40 medical students or residents.
station, aggregating across the skills measured 2. Examinees must have completed a minimum
and the checklists, rating forms, and written test of three stations.
materials associated with a station; for these 3. The total number of SP-examinee encoun-
groups, our review of psychometric results must ters (product of examinees and stations per exam-
be based on overall station scores. Other groups inee) must have been at least 400. Although larger
retain and use subscores for individual station examinee and station sample sizes are clearly
components, most often by calculating one or desirable for accurate estimation of key psycho-
more subscores based on interaction with an SP metric parameters, these minimum values were
and one or more subscores related to written established in recognition of the logistical intrica-
follow-up questions; an overall score for the cies and resource requirements of SP-based tests.
station may or may not be calculated in addition. Because the review integrates results across stud-
Because this review concerns SP-based assess- ies, fluctuations due to limited sample sizes should
ment, when multiple subscores have been used, average out.
we have focused on psychometric results for 4. Results of reliability and validity analyses
scores based on direct interaction with SPs had to be reported in sufficient depth to be
(generally those related to data-gathering and interpretable. A number of studies were elimi-
communication skills). Results for other scores nated on this basis, either because reliability or
are reported if they provide insight into key issues validity information was not reported, or because
in test design and score interpretation. the reported results were based on inappropriate
Investigators also vary in the methods used to estimation procedures. We revisit this problem at
aggregate (sub)scores across stations: Some the end of the article.
groups form composite scores by averaging indi-
vidual station scores, yielding a single composite Table 1 lists the studies that met our inclusion
test score; other groups form a composite score criteria and form the basis for the review. For each
for each group of similar stations; a third alterna- study, the table provides the institution respon-
tive (used by groups calculating multiple subscores sible for test development, citations to reports and
for each station) is to calculate a profile of com- publications, the type of examinees tested, and the
posite scores corresponding to station subscores. station formats used. Because SP-based tests from
Psychometric analyses may or may not be re- the same institution and investigator group tend to
share common features, studies are grouped by
ported for all scores, depending on the purpose of
institution. If several independent studies were
the test, the skills viewed as important by the
conducted at the same institution, they are shown
investigator group, the reliability of the subscores,
as separate "data sets" in the table under a
and other factors. In accord with the primary
common institutional heading. If a data set is
objectives of the review, we focus on composite described in several reports, the results are inte-
scores based on direct interaction with SPs. In grated across them in discussion and later tables.
reporting reliability results, we have adjusted total
testing time and testing time per station to include Organization of the Review
only the time actually spent interacting with SPs.
The only exceptions occur in those studies where The remainder of the review is divided into four
investigators reported only results for a single sections. The first section discusses the reproduc-
composite score that included both SP-based and ibility (reliability) of SP-based scores and pass-fail
written components; in these instances (explicitly decisions, integrating results across studies
noted in the text and tables), total testing time and through use of generalizability theory.4'5 The next
testing time per station reflect all components. section summarizes research on the validity of
Although this overall approach is somewhat arti- SP-based test scores. The third section discusses
ficial, due to variation in station format, it was research on the impact of SP-based tests on the
necessary to make cross-study comparisons more educational process. In an effort to address the
meaningful. practical needs of SP-based test users, these three
sections are structured as a series of responses to
Studies Included in the Review key questions in SP-based testing. The last section
summarizes the state of the SP-based testing art,
The review includes published (and some un- presents some ideas for improvement of SP-based
published) studies of the psychometric character- tests, identifies several areas for further research,
istics of SP-based tests in which four criteria are and makes some methodological observations and
met. recommendations for future studies.
60
Table 1. Summary of Studies Included in the Review
Institution Examinees Station Format(s)
University of Adelaide10-26 Senior students 5 to 10 min for physical, patient

education, or procedure
College of Family Physicians59-60 Certification candidates 15 min for history
of Canada (CFPC) in family medicine
Educational Commission for Foreign medical graduates 15 to 30 min for physical, 10 to 20 min
Foreign Medical Graduates (ECFMG) 6 1 -" and U.S. graduates for written follow-up"
University of Limburg 25 Junior and senior students 10 to 20 min for history, physical, or
procedure
Manitoba/Southern Illinois Senior students 15 min for history and physical,
University (SIU)9-18-64 15 min for written follow-upb
University of Massachusetts (UMass)
UMass Data Set I 27 - 28 First-, second-, and third-year 30 min for history and physical,
medicine residents 10 min for written follow-upa
UMass Data Set 2 29 Senior students 10 min for history, 5 min for
written follow-up"
UMass Data Set 3 10 - 65 Senior students 15 min for history, 5 min for
written follow-up"
National Board of Medical Examiners Senior students 20 min for history and initial
(NBME) 66 management
Southern Illinois University Senior students 15 min for history and physical,
(SIU)14-32-35-67-68 15 min for written follow-upb
University of Texas Medical Branch
at Galveston (UTMB)
UTMB Data Set I 30 * 9 - 70 Junior students 5 min for history or physical,
5 min for written follow-up b
UTMB Data Set 2 71 First- and second-year 5 min for history or physical,
medicine residents 5 min for written follow-upb
UTMB Data Set 3 7 2 Junior students 5 min for history or physical,
5 min for written follow-upb
University of Toronto 17 - 73 Foreign medical graduates 5 to 10 min for history or physical,
5 min for written follow-up"
"Written follow-up not included in testing time or calculation of scores in later tables. b Written follow-up included in testing time and
calculation of scores in later tables.
Reproducibility of SP-Based Scores and tive, test design is basically the development of a
Pass-Fail Decisions sampling plan that reflects the skills and areas to
be assessed.
This section discusses issues related to the repro- For SP-based tests, SPs, raters (either observers
ducibility of SP-based tests, integrating results or the SPs themselves), and stations are sampled
across studies. We begin by presenting a general from larger domains of SPs, raters, and stations
conceptual framework for thinking about the re- that might have been used on the test. Test scores
producibility of SP-based tests. This is followed are reproducible if an examinee's score is reason-
by discussion of the various factors that influence ably stable across different but similar (randomly
the reproducibility of SP-based scores and parallel) samples of SPs, raters, and stations, and
pass-fail decisions. reproducibility (generalizability) coefficients can
be thought of as the expected correlation between
scores derived from these similar samples. For an
A Conceptual Framework for the estimate of an examinee's skill level to be repro-
Reproducibility of SP-Based Tests ducible (e.g., a reproducibility coefficient greater
than .8), an adequate number of SPs, raters, and
The purpose of any test is to draw inferences stations must be included in the sample that the
about the ability of examinees that extend beyond test comprises. Lack of interrater agreement in
the particular items used to the larger domain scoring examinee behavior, inconsistency in SP
from which the items are sampled. Depending on performance, and variation in an examinee's per-
the size and nature of the sample, these inferences formance across stations all affect the reproduc-
can be more or less reproducible (reliable) and ibility of scores.
more or less accurate (valid). From this perspec- The structure of the next three subsections
61
follows from this conceptualization of the repro- sizable reduction in testing time requirements, an
ducibility of SP-based tests. First, rater-related important consideration, because SP-based tests
sources of measurement error are examined; this is are very resource intensive.
followed by discussion of SP-related sources of In large-scale SP-based testing, it is often nec-
measurement error. Although SPs often act as essary to develop several test forms for use at
raters as well as patients, it is important to keep multiple sites over an extended period of time.
the two sources distinct conceptually, because it is These forms generally differ in difficulty and
common practice to train multiple SPs to play the discrimination; consequently, the score received
same role. Ratings (whether provided by an SP or by any particular examinee is influenced by the
an observer) could be perfectly accurate, with test form used. In the last subsection, we discuss
different SPs still varying extensively in how they the problem of statistically adjusting (equating)
play the same patient role. Next we consider scores on alternate test forms to put them on the
station-related sources of measurement error, the same scale.
impact of variation in an examinee's performance
from station to station (often referred to as con- Rater-Related Sources of Measurement Error
tent specificity in the medical-problem-solving lit-
erature) on reproducibility of scores. This is the How well do raters agree in scoring individual
largest source of measurement error in SP-based
SP stations? Table 2 summarizes the results of

tests; relatively long tests, including many sta- analyses of interrater agreement for all studies
tions, are required as a consequence to obtain reporting them. Although the particular index
reproducible scores. used varies from study to study, agreement is
Most psychometric analyses to date have (often generally good. Similar results have been reported
implicitly) adopted a norm-referenced framework by Andrew,6 Stillman et al.,7 and Neufeld et al.8 A
for score interpretation. The following subsection major exception is found in the Manitoba/
examines the impact on reproducibility of working Southern Illinois University (SIU) data set. In this
within a domain-referenced framework. This ap- study, SPs were provided with relatively little
proach is conceptually appealing, because advo- training in completion of checklists, and interrater
cates of SP-based testing often wish to interpret agreement was studied through use of videotapes;
quality of performance in an absolute, real-world both factors probably affected agreement
sense, rather than in terms of the relative ranking adversely.9
of examinees. The values in Table 2 indicate interrater agree-
These subsections all focus on the reproduc- ment for individual stations, not for the test as a
ibility of scores. In the next subsection, we explore whole. Because examinees are rated by different
the adoption of a mastery-testing strategy, in observers at each station, any measurement error
which reproducibility of pass-fail decisions is of introduced by interrater disagreement at indi-
primary importance. This approach can lead to a vidual stations will tend to average out across
Table 2. Interrater Reliability of SP-Based Scores
Data Set Rater Composite of Value"
Adelaide Faculty 10-min physical exam checklists .74"

5-min physical exam checklists .68 b
Patient education checklists .50"
Procedural skills .76"
CFPC Physicians Ratings of communication skills .63C
Limburg Faculty Miscellaneous checklists .79"
Manitoba/SIU SPs History and physical exam checklists .42"
UMass Data Set 1 SPs History checklists .76 d
Physical exam checklists .78"
Ratings of communication skills .60 d
UMass Data Set 2 SPs History checklists .93
Ratings of communication skills .77
NBME Nonphysician observers History and patient management checklists .78*
SIU SPs History and physical exam checklists .80f
UTMB Data Set 1 SPs History and physical exam checklists .80*
"All entries are Pearson product-moment correlations, unless otherwise noted. b Intraclass correlation, with mean differences between
raters included in measurement error. c Intraclass correlation, with mean differences between raters excluded from error term.
d
Average between training session and test administration session. 'Cohen's kappa. Proportion agreement.
62
stations, as long as errors are nonsystematically rater agreement as a function of rater character-
related to examinees (e.g., not associated with the istics: Adequate interrater agreement can be
site where the exam was taken, not related to the achieved through use of SPs or physicians as
examinees' race, sex, or appearance, etc.). This raters. Intuitively, it does seem likely that SPs and
issue is explored further in later subsections. physicians may differ in the aspects of examinee
performance that they can rate accurately. For
How many raters are required per station? example, physicians should be more attuned to
Because interrater agreement is fairly good, it is logical sequencing of questions in history taking
generally unnecessary to use more than one rater and technical adequacy of some physical examina-
per station, particularly for relatively long tests tion maneuvers. SPs may be more sensitive to
involving many stations. Analyses performed with some communication skills (e.g., establishing rap-
the University of Adelaide data set10 provide a port, sensitivity to patient needs, avoidance of
good illustration. In this study, data was accumu- jargon) and better judges of certain examination
lated on approximately 400 examinees over a maneuvers, where feeling what is done provides
4-year period encompassing eight test forms. Each important information (e.g., palpation generally;
examinee encountered three to five SPs (depend- pelvic, rectal, and joint examinations specifically).
ing on the test form), and performance was rated Practical and educational considerations may be
by two faculty-physicians on checklists tailored to the most important factors in rater selection. If
station content. Generalizability analyses statisti- faculty physicians can participate, it may well be
cally compared use of one and two raters per desirable for them to serve as raters, because
station at a variety of test lengths. Results indi- observation of examinees provides useful feed-
cated that use of multiple raters per station has back concerning instructional effectiveness in
only a marginal effect on reproducibility of scores. terms of the skill levels of trainees. If physicians
If a sufficient number of stations are used to are unavailable, nonphysicians (including SPs
obtain reproducible scores, a large enough sample themselves) appear to provide a logical, less ex-
of raters is automatically included with a single pensive, and satisfactory alternative.
rater per station. If large numbers of raters are
available, it is much more effective to increase the Should checklists or rating scales be used?
number of stations, assigning one to each. For Review of Table 2 indicates that interrater agree-
example, a 2-hr test with two raters per station ment is generally better for checklists than for
requires the same total rater time as a 4-hr test rating scales, though agreement is sufficiently
with one rater per station. In the Adelaide data good for both that either can be used, particularly
set the reproducibility of the former was pro- in long tests involving large numbers of stations
jected to be .64, whereas the reproducibility of the and raters. Presumably, interrater agreement is
latter was projected at .75, a major improvement better for checklists because items are more con-
in precision. cretely stated and can be judged more objectively.
From the perspective of interrater agreement,
Who should rate examinee performance? when checklists and ratings scales are both viable
Some previous research has suggested that alternatives, checklists are to be preferred. From
physician-observers are naturally either stringent an educational perspective, checklists also provide
(hawks) or lenient (doves), and these tendencies better definition of expectations for examinees
are resistant to training.11'12 In a recent study of and more specific feedback on performance.
the University of Limburg Skills Test using very However, it is difficult to develop checklists in
detailed checklists, van der Vleuten et al.13 studied several areas (e.g., attitudes, aspects of communi-
the impact of training on different types of raters. cation skills) without trivializing the aspect of
Trained and untrained groups of nonphysicians, examinee performance to be judged, and such
medical students, and physician faculty rated the validity considerations are more important than
videotaped performance of examinees with two interrater agreement. In such situations, use of
SPs. Results indicated that the need for and behaviorally anchored rating scales is generally
effectiveness of training varied across groups: It advised, though we could find no research basis
was least needed and least effective for physician- for this recommendation in the SP literature.
raters, more needed and effective for medical
students, and most needed and effective for SP-Related Sources of Measurement Error
nonphysicians. Differences in accuracy among
groups were almost eliminated by rater training. What are the measurement consequences of
Inspection of Table 2 supports these results. using several SPs to play the same role? Several
There are no consistent trends in level of inter- research groups have investigated this question.
63
Using a subset of the stations from the SIU data same institution, but not for tests given at dif-
set, Dawson-Saunders et al.14 compared the per- ferent times or at multiple sites, as discussed next.
formance of groups of examinees who saw dif- A well-designed, large-scale study using the
ferent SPs playing the same role. Statistically Manitoba/SIU data set9-18 studied the relative
significant group differences were obtained on accuracy of SPs playing the same role at two
SP-based scores for five of the seven stations institutions, Southern Illinois University and the
studied, suggesting that the particular SP seen had University of Manitoba. This provides a particu-
a major influence on scores. The investigators larly strong challenge to use of multiple SPs,
recommended several measures to reduce this in- because different individuals trained SPs at each
fluence: (a) using a single SP per case, (b) phrasing school. Videotapes of 252 SP-examinee encoun-
patient checklists in lay language, and (c) in- ters from Manitoba and 197 SP-examinee encoun-
creasing the training provided for completion of ters from SIU on 15 common stations were viewed
checklists and rating forms. In small-scale studies, by trained raters, who recorded the accuracy* of
Hiemstra et al.15 and Vu et al.16 also found some presentation on case-specific checklists of critical
evidence of differences between SPs trained to findings. Average accuracy exceeded 90% at
play the same role. However, analyses of the roughly two thirds of the stations at both institu-
University of Toronto data set17 reported no tions. For the remaining stations, accuracy was
significant differences between SPs playing the generally between 80% and 90%, but several SPs
same role. had average accuracy scores between 69% and
The real issue is not, however, if scores on 75%. Errors in presentation of physical findings
individual stations differ systematically as a result and patient affect were more common than errors
of multiple SPs playing the patient role; the in presentation of historical information. Use of
impact of using multiple SPs on overall reproduc- multiple SPs contributed sizable random and sys-
ibility of scores is of primary importance. This tematic error into station scores, potentially
issue was investigated in generalizability analyses having a major impact on overall pass-fail
of 10 stations from University of Massachusetts results.9
(UMass) Data Set 3 where two SPs played the Additional research in this area is highly desir-
same role.10 Results (shown in Table 3) indicated able. Large-scale use of SP-based tests depends, in
that, although scores derived from different SPs part, on the ability of test developers to train
playing the same role are often significantly dif- multiple SPs at different sites to play the same
ferent, reproducibility of total test scores is not roles accurately. Training procedures that facili-
markedly affected. As long as examinees are ran- tate "transportability" of stations are needed.
domly assigned to SPs playing the same role,
differences average out across the test as a whole.
Station-Related Sources of Measurement Error
This may be approximately true for many SP-
based tests given at a single point in time at the
In virtually all efforts to assess clinical compe-
tence, examinee performance on one case is a poor
Table 3. Reproducibility of Scores as a Function predictor of performance on other cases. This has
of Testing Time and Number of SPs Used been termed the content-specificity of clinical
competence.19 This phenomenon has been ob-
served across measurement techniques, including
Data Gathering Skills Communication Skills
written and computer-based simulations,20'21
Test Length Same Different Same Different vignette-based short-answer tests,22 chart audits,23
(hr)' SP SPs SP SPs oral exams,24 and SP-based tests.25 Long tests,
1 .34 .33 .59 .56 including large samples of cases, are necessary as a
2 .51 .50 .74 .71 consequence, and careful determination of the
4 .67 .67 .85 .83 number of SP-based stations required to obtain
6 .76 .75 .90 .88 reproducible results is clearly merited.
8 .81 .80 .92 .91
16 .89 .89 .96 .95
Notes: Analyses reported in Swanson and Norcini10 based on
UMass Data Set 3;65 table from "Factors Influencing the *Accuracy was defined as the number of critical findings
Reproducibility of Tests Using Standardized Patients" by D. presented correctly divided by the number of critical findings
Swanson and J. Norcini, 1989, Teaching and Learning in that could have been presented, given the actions of the
Medicine, 1, p. 161. Copyright 1989 by Lawrence Erlbaum examinee, rescaled as a percentage. Thus, accuracy could vary
Associates, Inc. Adapted by permission. from 0% to 100%, with the latter representing a perfectly
a accurate presentation.
Three stations per hour.
64
How much testing time is required to obtain testing time was too short to yield reproducible SP-
reproducible scores? The studies reviewed in this based scores in the remaining studies.
article varied substantially in the methods used to Second, generalizability coefficients are lowest
estimate and report results of reliability analyses. for those studies in which written follow-up ques-
To achieve greater comparability across studies, tions linked to SP presentation are included in
we have reanalyzed reported results using general- scores and testing time (SIU and University of
izability theory to estimate reproducibility of Texas Medical Branch at Galveston [UTMB] data
scores. Because total testing time and time per sets). In part, this reflects the extra (doubling of)
station varied across studies, our analysis focused testing time per station that the written compo-
on reproducibility as a function of testing time nents require. However, even if testing time per
rather than number of stations. station is halved to eliminate this effect, coeffi-
Table 4 presents the results of the reanalysis. cients remain relatively low, indicating that use of
Generalizability coefficients in the table are anal- linked written questions reduces generalizability
ogous to coefficient alpha; they can be thought of and increases test length requirements. In part,
as the expected correlation between scores derived this is a natural consequence of a shift in what is
from similar, but not identical exams using a measured. The SP component of stations gener-
different sample of stations but the same test ally tests the hands-on skills involved in data-
gathering and communication. Follow-up written
length. Depending on the purpose of. testing, a

value of .80 is generally viewed as the minimum questions linked to SP presentation usually focus
acceptable level of reproducibility. Coefficients in on more purely cognitive skills: diagnosis, labora-
Table 4 are indicators of overall reproducibility: tory utilization, and/or treatment. Inclusion of
They reflect all sources of measurement error follow-up written questions broadens the meaning
operating in the testing situation. Due to differ- of total scores, but the linkage results in limited,
ences in study and test design, estimation proce- inefficient sampling of content in assessment of
dures, and results reported, table entries should be the more purely cognitive skills.
viewed as only roughly comparable. This interpretation is supported by the results of
Although results vary from study to study, re- studies in which the reliabilities of subscores are
flecting the diversity of skills and examinees as- reported.26"30 Scores for history-taking, physical
sessed, some trends are evident. First and foremost, examination, and communication skills are consis-
content specificity is a serious problem. A minimum tently more reproducible than scores for differen-
of 3 to 4 hr of testing time is necessary to obtain even tial diagnosis, laboratory utilization, and treat-
minimally reproducible scores. This finding is al- ment. These results indicate that it is more
most invariant across the range of station formats efficient to focus SP-based tests on assessment of
and skills assessed. The testing time actually used hands-on clinical skills with patients, with sepa-
was only adequate in one study (University of rately administered written tests used for other
Toronto). Though sufficient for research purposes, components of competence.24'27"29'31
Table 4. Reproducibility of Scores as a Function of Testing Time and Station Length
Average
Station
Test Length (hr)
Length
Data Set 1 2 3 4 6 8 12 (min)
a
Adelaide .43 .60 .69 .75 .82 .86 .90 6
CFPC .53" .67 .77 .82 .87 .90 .93 15
ECFMG .51* .68 .76 .81 .86 .89 .93 20
Limburg .54 .69 a .77 .82 .87 .90 .93 15
UMass Data Set 1 .59 .74 .81* .85 .90 .92 .94 30
UMass Data Set 2 .50 .67" .75 .80 .86 .89 .92 10
UMass Data Set 3 .34 .51 .61 .67a .76 .80 .86 15
NBME .53 .69 .77 .82* .87 .90 .93 20
SIU .19 .31 .41 .48 .58 .68 .73* 40
UTMB Data Set 1 .24 .38 .49" .56 .66 .72 .79 10
UTMB Data Set 2 .19 .32 .41* .48 .58 .65 .74 10
UTMB Data Set 3 .31 .47 a .57 .64 .73 .78 .84 10
Toronto .65 .79 .85* .88 .92 .94 .96 6
"Approximate testing time used (rounded to nearest whole hour).
65
How long should individual stations be, and Reproducibility of Domain-Referenced Test
which formats use testing time most efficiently? Scores
Aside from lower reproducibility coefficients for
stations including written components, there is no Most studies to date have adopted (often implic-
obvious relationship in Table 4 between testing itly in selection of reliability estimation proce-
time per station and generalizability of scores. dures) a norm-referenced framework for score
Both short and long stations are effective (and interpretation. That is, scores are given meaning
ineffective), depending on the study. Apparently, by reference to the performance of other exam-
longer stations tend to yield more measurement inees (e.g., an examinee's score is 1 SD below the
information, but the fact that more short stations mean, in the 95th percentile, etc.), and reproduc-
can be completed in a fixed amount of time ibility is high to the extent that tests differentiate
completely compensates. The range of station examinees and allow fairly precise rank ordering.
lengths for this compensatory effect is probably For SP-based tests, it seems natural and desirable
limited. Tests using very long stations (e.g., more to use a domain-referenced framework for score
than 1 hr each) would probably yield less interpretation, where an examinee's score is inter-
reproducible scores, even controlling for total pretable in absolute terms (e.g., an examinee's
testing time, because other sources of measure- score indicates that he or she can take an adequate
ment error (i.e., raters and SPs) would be
history for 80% of common ambulatory problems

sampled less extensively. Very short stations (e.g., or can correctly perform 90% of the items on a list
1 or 2 min each) might make efficient use of of physical examination maneuvers).
testing time, but the clinical tasks that can be
performed in such a brief time period may not be What happens to reproducibility if a domain-
particularly interesting from assessment and edu- referenced testing perspective is adopted? Table
cational perspectives. 5 compares reproducibility for norm-referenced
In general, selection of station length and versus domain-referenced score interpretation at
format should be considered from the perspective various test lengths for those studies where the
of content validity, not reproducibility. For exam- necessary statistical information (variance compo-
ple, a test developer might be interested in the nents) was reported. Reproducibility for domain-
ability of a junior student to perform a complete referenced score interpretation is lower than for
history and physical examination on a healthy norm-referenced, because differences in the diffi-
patient; the required format is implied, and a 1-hr culty of stations affect the former, but not the
station length might be appropriate. The first step latter. Differences in station difficulty change the
in developing SP-based tests is identification of magnitude of scores, influencing reproducibility
the skills to be assessed. The tasks to be used at of scores interpreted from an absolute, domain-
individual stations follow naturally, and these referenced perspective. Interpretation of scores
constrain decisions regarding station format and from a norm-referenced perspective is not af-
length. When SP-based testers choose sharply fected, however: Changes in station difficulty shift
different station formats and lengths, they are the distribution of scores, but not the position of
usually interested in assessing different skills. examinees relative to one another. The drop in
Table 5. Reproducibility of Scores as a Function of Test Length and Score Interpretation
Test Length (hr)

Score
Data Set Interpretation 1 2 3 4 6 8 12
Adelaide Norm referenced .43" .60 .69 .75 .82 .86 .90
Domain referenced .34 .51 .61 .67 .76 .81 .88
Limburg Norm referenced .54 .69a .77 .82 .87 .90 .93
UMass Data Set 3 Norm referenced .34 .51 .61 .67 a .76 .80 .86
SIU Norm referenced .19 .31 .41 .48 .58 .65 .73 a
Toronto1" Norm referenced .54 .70 .78 .83 .88C .91 .93
"Approximate testing time used (rounded to nearest whole hour). b Written follow-up here included, because the necessary variance
components for domain-referenced estimation were only reported for the combined data set. c Actual testing time was 5 hr.
66
reproducibility is substantial, and longer tests are looking for "gaps," or a pass-fail point was se-
necessary as a consequence. Differences in repro- lected arbitrarily. Given the current interest in use
ducibility decrease as test length increases, because of SP-based tests as graduation exams or as a
test forms tend to become more similar in diffi- component of the licensure process, development
culty as the number of stations increases. of better standard-setting procedures should have
a high priority.
Reproducibility of Pass-Fail Decisions
Equating Scores on Alternate Forms of
The discussion of reproducibility and testing SP-Based Tests
time requirements has, thus far, focused on the
reproducibility of scores. For many applications In many situations where SP-based tests are
of SP-based tests, reproducibility of scores is not used, several equivalent forms of a test are devel-
really important: reproducibility of pass-fail deci- oped, similar in overall content and format, but
sions is crucial. For example, SP-based tests given with different stations on each form. Most com-
after completion of required clerkships and before monly, multiple forms are required for security
medical school graduation often focus on history- reasons when testing is spread out over time, either
taking and physical examination skills in an effort because the number of examinees is large or
to ensure that these skills have been mastered to because several cohorts (e.g., successive clerkships
the level required for postgraduate training.29'32"34 or graduating classes) are to be tested. Regardless
In such testing situations, although it is obviously of the reason, whenever multiple test forms are
desirable to estimate an examinee's skill level used, they are likely to be equivalent in level and
precisely, the basic issue is whether ability exceeds range of difficulty, and any direct comparison of
the mastery point, and the reproducibility of scores would be unfair to those examinees tested
pass-fail decisions is of primary importance, both with more difficult forms.37 A variety of statistical
practically and psychometrically. procedures, termed equating methods, are used
with written tests to deal with this problem.38
What happens to reproducibility if a mastery-
testing approach is adopted? If most examinees What procedures have been used to equate
perform well relative to the pass-fail point, fairly scores on SP-based tests? For the most part,
short tests can still yield reproducible pass-fail users of SP-based tests do not equate scores on
decisions, particularly for examinees at upper alternate test forms. Although multiple forms are
ability levels.10'35 In such situations, use of "se- commonly used, scores are not adjusted for dif-
quential testing" procedures may be advanta- ferences in form difficulty. The problem is simply
geous. In this approach to assessment, a brief ignored: Scores are interpreted as if they are on the
screening test is given initially to all examinees. same scale. However, a few investigators have
Those who perform well relative to the pass-fail developed some rough-and-ready procedures for
point are excused from further testing with a coping with differences in form difficulty.
passing result. The test is continued for the re- At the University of Adelaide, two test forms
maining examinees, concentrating testing time and are used each year. Examinees are randomly as-
resources on "close call" decisions in the vicinity of signed to forms, which should result in roughly
the passing score. A hypothetical example based equivalent ability groups taking each form. After
on the University of Adelaide data set is provided test administration, the mean score on each form
in Swanson and Norcini.10 is calculated, and the difference between them is
added to the score of each examinee taking the
What methods have been used to set standards more difficult form.26 When used with written
for SP-based tests? Implicit in the mastery- tests, this procedure is termed mean equating.39
testing approach is use of absolute standards in Similar stations are also developed in pairs and
making pass-fail decisions. Unfortunately, almost randomly assigned to two forms; this procedure
no work has been done on development of abso- should result in forms that are parallel in content
lute standard-setting procedures for SP-based tests and similar, though not identical, in difficulty.
analogous to those used for written exams.36 Most In SP-based tests given at the University of
researchers have not had to confront the standard- Massachusetts (Data Sets 2 and 3), for security
setting problem, because test scores either did not reasons separate test forms are typically con-
count or were combined with other assessments. structed from a common pool of stations ac-
When pass-fail standards were needed, typically a cording to a fixed blueprint. After test adminis-
relative standard was set (e.g., 2 SD below the tration, scores on each station are standardized to
mean), the score distribution was inspected, a mean of 500 and a standard deviation of 100,
67
averaged across stations for each examinee, and performance (e.g., comparison of scores received
restandardized across examinees. This procedure by examinees at different points in their training),
also results in a type of mean equating that adjusts (b) study of the relationships between scores and
for differences in form difficulty, assuming exam- other measures (e.g., correlations with written test
inees and stations are randomly assigned to test scores or ratings of clinical performance), and (c)
forms (not quite true for UMass test administra- logical analysis of test content (clinical tasks posed
tions). to examinees, items included on checklists, etc.).
After a brief discussion of procedures for scoring
What other procedures might be used to equate stations, the following subsections review work in
scores on SP-based tests? A variety of equating each category. The last subsection outlines some
procedures have been developed for written validation studies that we believe are needed.
tests.38 Several of these could be adapted for use in
SP-based assessment, at least for situations in Procedures for Scoring SP-Based Tests
which large numbers of examinees are tested with
each form. The various forms of "common-item Because the essence of validity is the accuracy of
linear equating" would probably be the simplest inferences based on test scores, it seems appro-
and most practical to use. Using this approach, priate to begin discussion of validity by com-
different test forms would include some common menting on procedures used for scoring SP-based
stations. The relative performance on common stations and tests. Interestingly, published articles
stations of examinee groups taking different test rarely describe scoring procedures, beyond stating
forms provides a basis for estimating group ability that scores were calculated as the percentage of
and heterogeneity, independent of form difficulty possible points obtained on checklists and/or
and discrimination. These estimates are then used rating scales. Often, it is unclear how checklists
as a basis for adjusting scores on alternate were developed, how individual items were
forms.39 Other equating procedures (equiper- weighted in calculation of station (sub)scores, and
centile equating; methods based on item response how station (sub)scores were aggregated to obtain
theory) might also be applied; in general, the composite scores. Thus, material in the remainder
sample-size requirements for these procedures (at of this section is based predominantly on specula-
least several hundred examinees per form) are too tion, rather than results of research.
large, however.
If assessment takes place within a mastery- What items should station checklists and rating
testing framework, it may prove more practical to forms include? Inspection of data-gathering
identify equivalent pass-fail points on alternate checklists used by different groups reveals remark-
test forms, rather than attempting to equate able diversity. Checklist length varies from a few
scores. If primary focus is on pass-fail decisions, items to several dozen. On checklists used for
this approach could permit appropriate adjust- history taking, some groups list questions that
ment for differences in test forms without re- examinees should ask; others list answers that SPs
quiring large numbers of examinees. However, it provide. Intuitively, the latter seem easier to use,
will be necessary to develop improved standard- because roughly equivalent questions may be
setting procedures before investigating this alter- asked in several ways at varying levels of specific-
native. ity, but the information provided is relatively
unambiguous. Checklists used for the physical
Validity of SP-Based Test Scores examination component of stations generally list
examination maneuvers, though specificity level
Validity refers to "the accuracy of a prediction varies considerably (e.g., from "examines the ab-
or inference made from a test score"40 (p. 443). It domen" to "palpates the right upper quadrant for
is not a property of the test itself, but of interpre- the liver" to a list of discrete steps for each
tations based on test scores. Thus, the same test quadrant). Rating forms for communication skills
can have many validities, depending on how it is vary from a single global item to several dozen
used and how scores are interpreted. Making items concerning discrete behaviors. Individual
matters worse, there is typically no "gold stan- items may be fairly concrete ("maintains good eye
dard" with which scores can be compared; this is contact"), fairly abstract ("establishes good rap-
surely true for SP-based tests. As a consequence, port"), or related to behavioral intentions ("I
validation requires the accumulation of evidence would recommend this examinee to a friend"). In
across a series of studies. part, differences in checklists and rating forms
Traditionally, validation studies take one of may reflect differences in the focus of assessment
three forms: (a) study of differences in group (more detailed lists for interpersonal skills and
68
physical diagnosis courses; less detailed lists for ciently reproducible to be meaningful. Reporting a
exams required for graduation from school). Re- profile of subscores is particularly ill-advised,
producibility of test scores appears to be fairly because individual subscores in a profile are likely
invariant across the various rating form and to be very unstable.
checklist formats; validity of scores may not be.
Some systematic research on the content and Differences in Group Performance
format of checklists and rating forms seems highly
desirable.
Do examinees at different points in training
perform differently? There have been relatively
How should items and subscores be weighted in
few studies of the performance of different groups
calculating station scores? Relatively little re-
on SP-based tests. In UMass Data Set 1, the
search on this question has been reported in the performance of internal medicine residents im-
standardized patient literature. Stillman27 ex- proved as they progressed through training: Third-
plored the use of weighting in calculation of year residents performed better than second-year
checklist scores and found little impact on repro- residents, who performed better than first-year
ducibility or validity. These findings parallel those residents.27'28 In the same study, residents from
obtained in studies of written patient-management training programs with stronger reputations per-
problems20 and in psychometric studies of
formed significantly better than residents from

weighted composites more generally. Given a par- less prestigious programs. In UTMB Data Set 2,
ticular checklist or rating form, as long as items first- and second-year residents in internal medi-
and subscores are positively intercorrelated, cine performed significantly better than junior
weighting is unlikely to have much impact.41 This medical students.30 Other small-scale studies have
does not mean that all scoring approaches will observed similar trends.42"44 Thus, the results of
yield similar results, however. It simply indicates studies of differential group performance provide
that research should concentrate on the items to be some support for the validity of SP-based test
included on checklists and rating forms in the first scores.
place, rather than on alternative weighting sys-
This evidence, in isolation, is hardly compelling.
tems.
Any well-constructed 20-item multiple-choice test
differentiates groups at varying levels of training,
What scores and subscores should be reported
simply on the basis of differences in knowledge.
to examinees? The answer to this question is Further, it is usually unclear how much groups
complex: It depends on the purpose of testing, the should differ, so such studies can only yield very
reproducibility and validity of scores, and security imprecise information. Results are only conclusive
considerations. If the purpose of testing is prima- when they are negative: When expected differ-
rily formative—to identify strengths and weak- ences are not obtained, strong evidence of inval-
nesses of examinees—providing extensive feed- idity is provided. More sophisticated studies are
back on performance may be desirable. Examinees needed to provide supportive evidence.
taking written tests are often provided with the test
materials after scoring is complete so they can
study material related to items they answered Relationships With Other Measures of Clinical
incorrectly. In SP-based tests, giving examinees Competence
copies of their checklists and rating forms both
provides feedback on their performance and com- Table 6 summarizes the correlations between
municates criteria used in grading. Students may SP-based test scores and a variety of other indices
then "study to the test," which can be desirable of clinical competence, as reported in the studies
and effective, if mastery of the specific skills included in the review. The middle column pro-
included in the test form is desired. However, if vides observed correlations. These vary non-
the resulting performance gains are specific to a systematically from study to study, in part because
particular test form, rather than generalized, this they are "attenuated" (reduced in magnitude) by
kind of feedback is less useful and poses security measurement error. For short, unreliable tests, a
problems for reuse of stations. sizable reduction can occur (e.g., the rows for the
When the purpose of a test is primarily University of Adelaide data set). The rightmost
summative—to assign grades or make pass-fail column provides "true" (disattenuated) correla-
decisions—providing extensive feedback may be tions. The effect of measurement error has been
counterproductive. From a security standpoint, eliminated from these correlations statistically;
reuse of stations becomes problematic. Further, in they can be viewed as the correlations that would
most situations, only the total test score is suffi- be obtained if the tests were very long (perfectly
69
Table 6. Relationship Between SP-Based Scores and Other Measures
Observed True
Data Set and Measures Correlation* Correlation1*
Adelaide
Multiple-Choice Test (Locally Developed) .33 .68
Non-SP Skills Test .35 1.00
Written Follow-Up (Short-Answer Test) .40 .88
Limburg
Multiple-Choice Test (Locally Developed)' .63 .77
Written Follow-Up (Multiple-Choice Test)c'd .62 .77
Manitoba/SIU
Multiple-Choice Test (NBME Part II) .63 .82e
Clinical Ratings .52 .66e'f
UMass Data Set 1
Multiple-Choice Test (American Board of Internal Medicine Certifying Exam) .24 .29*
Clinical Ratings ns —
Months of Residency Training .32 .37=
Self-Ratings ns -
UMass Data Set 2
Multiple-Choice Test (NBME Part I) .19 .24*
Multiple-Choice Test (NBME Part II) .27 .34*
Clinical Ratings .44 .50e'f
Written Follow-Up (Short-Answer and Multiple-Choice Test) .26 .36*
UMass Data Set 3
Multiple-Choice Test (NBME Part I) .10 .13 e
Clinical Ratings .25 .31 e - f
Self-Ratings .08 .io c - f
Written Follow-Up (Pattern Recognition Test) .22 .32 e
NBME
Multiple-Choice Test (NBME Behavioral Science Subtest) .37 .46°
Multiple-Choice Test (NBME Psychiatry Subtest) .31 .40*
Multiple-Choice Test (Other NBME) ns ns
SIU
Multiple-Choice Test (NBME Part I) .53 .63 e
Clinical Ratings .65 .75 e ' r
UTMB Data Set 1
Multiple-Choice Test (NBME Medicine Subtest) .43 .64
Clinical Ratings .46 .73
UTMB Data Set 2
Multiple-Choice Test (American Board of Internal Medicine Certifying Exam) .24 .35
Clinical Ratings .00 —
Months of Residency Training .31 .49
UTMB Data Set 3
Multiple-Choice Test (NBME Medicine Subtest) .64 1.00
Clinical Ratings .37 .56'
Toronto
Multiple-Choice Test (Locally Developed) .43 .50
Written Follow-Up (Short-Answer Test) .69 .91
a
ns = nonsignificant. bEntries corrected for measurement error (disattenuated) in both scores, unless otherwise noted. 'Reported in
van der Vleuten et al.31 dAdministered 2 weeks after SP-test. eNo values were reported; entries were approximated from available
results. 'Estimate is disattenuated for unreliability in SP-score only.
reliable). They provide a better index of the true between scores on SP-based and multiple-choice
strength of relationship between the measures tests vary extensively, from near zero to one,
involved. though the average value is fairly high. True
correlations between SP-based scores and ratings
What is the relationship between SP-based test of clinical performance are also moderately high.
scores and other measures? True correlations These results are not particularly surprising. The
70
performance of better trainees should exceed Content Validity of SP-Based Tests

those of poorer trainees across a wide range of
content and skills, and a common core of clinical Ebel49 suggested that validation procedures can
knowledge underlies performance on most tests, be divided into two categories: direct (primary)
regardless of the skills measured. Given similar and derived (secondary). The studies reported in
educational goals and opportunities to learn, the previous subsection exemplified derived pro-
better (brighter, more highly motivated, self- cedures, where the focus was on the relationship
directed) students will outperform poorer stu- between test scores and other measures. In con-
dents, and, as time in training passes, this effect trast, direct validation procedures investigate the
should increase in size (e.g., van der Vleuten et extent to which the tasks posed by test items
al.31). This should be true, regardless of the faithfully represent the real-world tasks of mea-
format of the achievement test used, as long as surement interest. These validation procedures
a reasonably broad range of content and skills depend on logical analysis of the test in relation to
is covered.24'26 In factor-analytic studies of the the domain of interest, rather than empirical,
structure of clinical competence, this results in statistical evidence. For achievement tests (as op-
models including only a single general posed to personality or aptitude tests), Ebel clearly
factor.45"47 believed that direct validation is most important.
Some might view the moderately high true
He argued that validity can be "built into" a test

correlations between scores on SP-based tests and through careful operational definition of the tasks
traditional assessments as an indication that the and content to be measured. Kane50 and
same trait is being measured. This is simply Frederiksen51 expressed a similar viewpoint: Con-
wrong. High correlations indicate that tests rank- tent validity should be the major concern in
order examinees similarly, without saying any- achievement testing.
thing about the specific skills measures.24 Within
a norm-referenced framework and from a purely Are SP-based tests content-valid? On the sur-
psychometric perspective, high correlations do face, it appears that SP-based test scores should be
indicate that tests can be used interchangeably content valid because the "testing tasks" posed
without affecting which examinees pass and fail resemble real-world clinical tasks. It is unclear,
(as long as the nature of the test does not however, whether SP-based tests are actually con-
influence what examinees choose to learn, as structed as systematically as Ebel would demand.
discussed in the next section). This is reassuring in To determine this, it would be necessary to review
a sense; most examinations used for grading, the domain definitions developed by test authors,
licensure, and certification are written, norm- determine whether those domains have been rep-
referenced tests, not direct assessments of clinical resentatively sampled, study the scoring methods,
skills. Within domain-referenced and mastery- and conduct generalizability studies (because con-
testing frameworks, the equivalence of SP-based tent validity depends on having a sufficiently large
and traditional tests disappears, however, because sample of items). Test and station construction
the absolute level of performance is of interest, procedures strike us as less systematic than this, at
not just the relative ranking of examinees.26 least for studies in which we participated directly.
Thus, for the most part, studies of the relation- Careful description of the domain(s) to be tested
ship between SP-based tests and other measures and systematic development of sampling plans and
provide supportive evidence for the validity of test test blueprints are needed. Content validity fol-
scores, though almost any results could be lows from attention to these details.
interpreted positively (e.g., in UMass Data Sets 1
and 2 where lower true correlations were ob- Needed Validation Studies
served, the investigators concluded that SP-based
tests measure important aspects of clinical compe- SP-based tests seem particularly amenable to
tence not tapped by traditional measures). This direct validation. They consist of a series of
highlights a weakness of broad, unfocused corre- high-fidelity simulations, so there should be little
lational studies of validity: Results can be inter- concern about differences in the tasks posed by
preted in a variety of ways, with almost any individual "items" and those posed by the real
findings viewed as positive or negative. This world.* If the skill or content domain to be tested
problem, among others, led Ebel48 to conclude: is carefully defined and sampled, validity should
"Validity has long been one of the major deities in
the pantheon of the psychometrician. It is univer-
sally praised, but the good works done in its name *Several investigations directly compared performance of
SPs and real patients. These have uniformly indicated that SP
are remarkably few" (p. 640). behavior is comparable to that shown by real patients.74-80
71
be built in through the test construction process, A related problem follows from the time pres-
as long as a sufficiently large sample of SP cases is sure under which SP-based tests are commonly
included. However, several "threats to validity" administered. Developers of written tests try to
remain to be studied. obtain information on the time required for exam-
First, it is unclear if scoring procedures accu- inees to comfortably complete the test. If insuffi-
rately translate examinee behavior into appropri- cient time is allowed (i.e., the test is "speeded"),
ate, meaningful scores. In general, published re- score interpretation is more difficult. A straight-
ports vaguely describe scoring as calculation of forward method for investigating "speededness"
"percentage of possible points" obtained. The of SP-based tests would involve comparing per-
validity of such scores depends on the appropri- formance on identical stations under varying time
ateness of the items, the weighting attached to conditions. Given the diversity of stations and
each, and other factors. The potential for omitting station formats, it would not be surprising if this
important items and including unimportant ones is factor were important in examinee performance,
great. The former penalizes examinees who take and more specific guidelines for station construc-
indicated actions that are not listed; the latter tion could result.
rewards examinees who are unjustifiably thor- Because station scores depend on the judgment
ough, a common problem in scoring written and of observers, characteristics of examinees unre-
computer-based patient management problems.20
lated to clinical skills could influence those judg-

Research investigating commonly used checklist ments. Such characteristics include age, sex,
formats and scoring procedures is needed. One ethnicity, accent, and physical appearance of ex-
obvious approach is to use several methods to aminees. This possibility is difficult to investigate
develop checklists and ratings forms for scoring because these characteristics can covary with clin-
the same stations and compare the results. Com- ical skills in unknown ways. However, it should be
plementary research would check the correspon- possible to train some "standardized examinees"
dence between station scores and ratings of exam- who behave in a consistent fashion, making pos-
inee behavior provided by expert observers who sible systematic investigation of the influence of
are unfamiliar with scoring algorithms. Multiple such characteristics on rater judgments. Video-
observers could be used, with agreement between tapes in which the clinical skills of simulated
them providing an index of the agreement one examinees are systematically varied in relation to
would hope to see with scoring procedures. If sex, race, and so on, could also yield useful
done on a broad range of stations, this might be information. Such "bias studies" may be especially
termed the microscopic approach to test valida- important if SP-based tests are to be used for
tion. Given that the scores on individual stations assessing the clinical skills of foreign graduates.
are valid, total test scores should also be valid, as
long as stations are representatively sampled from Educational Impact of SP-Based Tests
the defined domain, and the test is sufficiently
long to yield reproducible scores. What Are the Educational Consequences of
Second, examinee scores may well be affected SP-Based Tests?
by a mismatch between their perceptions of the
tasks posed by stations and those intended by test Several authors have emphasized the impor-
developers. (This might be termed the "guess- tance of considering the impact of assessment
what-I-want-you-to-do" problem.) The impact of methods on education, both in general51"53 and
a mismatch may well be most serious when short for SP-based tests specifically.31-34-54-55 Clinical
stations are used, because examinees must quickly skills instruction is typically provided using a
determine what to do, without time to readjust if fairly nonsystematic, master-apprentice ap-
they guess wrong. Interview studies with exam- proach. Curricular goals, instructional quality,
inees should shed some light on this potential and educational material (i.e., patients) can vary
problem: It is always useful to ask examinees why greatly from school to school, hospital to hospital,
they behaved as they did. and master to master.34 SP-based tests are hypoth-
To cope with the mismatch problem, examinees esized to counteract these problems, both by in-
should generally be provided with extensive infor- fluencing the faculty's teaching and the students'
mation about the purpose and format of the test in learning activities. The argument runs as follows.
advance of the examination. "Practice" tests, To develop an SP-based exam, it is necessary
sample checklists and rating forms, and video- for faculty to reach a consensus on what should be
tapes of stations from previous exams should all learned. Domain definition and blueprinting, if
be useful, particularly if examinees vary in their taken seriously, require specific delineation of the
familiarity with SP-based testing. range of clinical situations and skills which
72
trainees should have mastered. Station construc- rater background, as long as requisite training is
tion requires concrete definition of performance provided. Use of multiple SPs playing the same
criteria. Reaching a consensus on these items patient role does not generally reduce reproduc-
should lead to more standardized instructional ibility very much. Variation in examinee perfor-
experiences and learning outcomes, particularly if mance across stations has the largest impact on the
the consensus is carefully communicated to both reproducibility of scores. In most testing situa-
faculty and students.55 Trainees view assessment tions, if a sufficiently large sample of stations is
methods as indicators of what faculty believe is included, the resulting sample of raters and SPs
important, and careful design of exams can exert a will also be large enough to obtain reproducible
major influence on learning. scores. Exceptions can occur in large-scale testing
The only published empirical work on the edu- situations, when examinees are tested at different
cational impact of SP-based tests has been at the times and/or at different sites—when there is
University of Adelaide. The original motivations major departure from random assignment of
for developing a practical examination of clinical raters and SPs to examinees. Four to 8 hr of
skills was a general faculty perception that stu- testing time are required to obtain reproducible
dents were spending a disproportionate amount of scores for hands-on clinical skills; longer tests are
time studying for written tests relative to partici- required if stations include written questions
pating in clinical work on the wards.54'56"58 A linked to SPs. Otherwise, station format and
practical exam with SP-based components was testing time per station appear to be relatively
introduced to improve the congruence between the unrelated to reproducibility. Domain-referenced
educational goals of the medical school and the score interpretation requires longer tests than does
assessment methods used, anticipating that a shift a norm-referenced approach. Mastery testing, in
in students' learning activities might result. To which only pass-fail results are of interest, has the
investigate the impact on student study habits, potential to reduce test length requirements and
questionnaires were sent to students before and costs. However, better standard-setting proce-
after the introduction of the practical exam.54 dures must be developed to realize this potential.
Questionnaire results indicated that the exam had More work is needed on procedures for adjusting
a dramatic impact on how students spent their (equating) scores obtained on alternate test forms.
time, decreasing time spent preparing for written Results of validation studies have, for the most
tests and increasing ward-based learning activities. part, been encouraging, though not particularly
In addition, students reported a generally high informative. Groups at different stages of training
level of satisfaction with the practical exam and obtain appropriately different scores, and relation-
rated it as substantially more relevant than the ships between SP-based scores and traditional
written tests to the work of an intern. These results measures of clinical competence are fairly strong.
have persisted since the practical exam was Content validity of SP-based tests should be par-
introduced.58 ticularly good due to the realistic clinical tasks
included as stations, although additional attention
Discussion to domain definition and blueprinting seems mer-
ited. Additional efforts should be devoted to test
The purpose of this article was to review psy- validation; this could include research on scoring
chometric research on SP-based tests. In this final procedures, examinee perceptions of tasks posed
section, we summarize the major conclusions by stations, effect of station speededness, and
reached in the review, present some suggestions for rater bias.
improved use of SP-based tests, and provide meth-
There has been much rhetoric but little empir-
odological observations and recommendations.
ical work on the educational impact of SP-based
Summary of Conclusions tests, aside from Newble's efforts at the University
of Adelaide. More research in this area is needed,
The review was divided into three major areas: because the hypothesized educational impact of
(a) reproducibility of scores and pass-fail deci- SP-based tests has been a major factor in their
sions, (b) validity of score interpretation, and (c) increased use, despite high costs and psychometric
educational impact of tests. This summary also shortcomings.
follows that organization.
Lack of interrater agreement in scoring exam- Suggestions for the Improved Use of SP-Based
inee behavior, inconsistency in SP performance, Tests
and variation in examinee performance across
stations all affect the reproducibility of scores. Several practical suggestions for improving SP-
Interrater reliability is adequate, regardless of based tests and conserving testing resources follow
73
directly from the psychometric conclusions unclear. The format and content of checklists and
emerging from the review: rating forms and the procedures used for scoring
are almost never described. More work is needed
1. Scores on short SP-based tests are not mean- in these areas, beginning with better descriptions
ingful, because they are not sufficiently reproduc- of methods already in use.
ible. Profiles of subscores based on a small Procedures used for reliability estimation were
number of stations or short segments of each sometimes unspecified and, occasionally, ap-
station are also unstable and should generally not peared to be wrong. Use of generalizability theory
be used. in analysis is absolutely required, because multiple
2. There is no need to have more than one rater sources of measurement error are commonly
per station. If extra raters are available, the present. Variance component estimates should be
number of stations should be increased instead. reported (along with the standard errors of the
3. The decision to use nonphysicians (usually estimates), so that readers can explore alternative
SPs) or physicians as raters has practical and uses of testing resources, and researchers can
educational elements. If faculty physicians serve integrate results across studies. If multiple
as raters, they receive useful feedback concerning subscores are calculated, results of generalizability
curriculum effectiveness through observation of analyses should be reported for all of them.
examinees. Nonphysicians can also be used as
In validity analyses, both observed and true

raters, given adequate training; often, they are correlations should be reported, along with reli-
more readily available and less expensive. ability information for criterion measures, if avail-
4. SP-based tests should emphasize assessment able. Observed correlations are often seriously
of hands-on skills, such as history taking, physical attenuated; without reliability information, they
examination, patient education, counseling, and are uninterpretable. Correlations between SP-
the like. Linking the hands-on components to based test scores and traditional measures of
written questions concerning differential diagno- clinical competence do not provide particularly
sis, laboratory utilization, and treatment greatly useful information. More creative, focused studies
reduces the reproducibility of scores. If these skills of the validity of SP-based tests are clearly needed.
are to be tested, written tests administered sepa-
rately seem preferable.
5. Selection of station formats follows from the References
hands-on clinical skills to be assessed; reproduc- 1. Hart I, Harden R, Walton H (Eds.). Newer developments
ibility of scores appears to be almost unrelated to in assessing clinical competence. Montreal: Heal, 1986.
format. 2. Hart I, Harden R (Eds.). Further developments in as-
6. Relatively little measurement error is intro- sessing clinical competence. Montreal: Can-Heal, 1987.
3. Harden R, Stevenson M, Downie W, Wilson G. Assess-
duced by training multiple SPs to play the same ment of clinical competence using objective structured
patient role, as long as examinees are randomly examinations. British Medical Journal 1975;1:447-51.
assigned to SPs. This approach can reduce 4. Cronbach L, Gleser G, Nanda H, Rajaratnam N. The
training time, increase scheduling flexibility, and dependability of behavioral measurements: Generaliz-
allow larger numbers of examinees to be tested ability for scores and profiles. New York: Wiley, 1972.
concurrently. 5. Brennan R. Elements of generalizability theory. Iowa City,
IA: American College Testing Program, 1983.
7. Procedures for setting pass-fail standards on 6. Andrew B. The use of behavioral checklists to assess
SP-based tests remain primitive. Do not rely solely physical examination skills. Journal of Medical Education
on SP-based tests in making major promotion or 1977:52:589-91.
certification decisions. 7. Stillman P, Ruggill J, Rutala P, Sabers D. Patient instruc-
tors as teachers and evaluators. Journal of Medical Edu-
8. Use SP-based tests. Despite the early stage of cation 1980:55:186-93.
development, the tests measure important skills 8. Neufeld V, Woodward C, Norman G. Simulated patients
emphasized in clinical training. Assessment should in evaluating medical education. In Proceedings of the
be congruent with educational goals. 22nd Research in Medical Education Conference (pp.
240-2). Washington, DC: Association of American Med-
ical Colleges, 1983.
Methodological Observations and 9. Tamblyn R. Use of standardized patients in the evaluation
Recommendations of clinical competence. Unpublished doctoral dissertation,
McGill University, Montreal, 1989.
Published reports (including our own) are often 10. Swanson D, Norcini J. Factors influencing the reproduc-
distressingly vague. Psychometrically significant ibility of tests using standardized patients. Teaching and
Learning in Medicine 1989;l:158-66.
details of test administration (e.g., if multiple SPs 11. Newble D, Hoare J, Sheldrake P. The selection and
played the same patient role; how examinees were training for clinical examinations. Medical Education
assigned to raters, SPs, and stations) are often 1980;14:345-9.
74
12. Ludbrook J, Marshall VR. Examiner training for clinical Walton (Eds.), Newer developments in assessing clinical
examinations. British Journal of Medical Education competence (pp. 181-91). Montreal: Heal, 1986.
1971;5:152-5. 31. van der Vleuten C, van Luyk S, Beckers A. A written test
13. van der Vleuten C, van Luyk S, Ballegooijen A, Swanson as an alternative to performance testing. Medical Educa-
D. Training and experience of medical examiners. Medical tion 1989:23:97-107.
Education 1989;23:290-6. 32. Williams R, Barrows H, Vu N, et al. Direct, standardized
14. Dawson-Saunders B, Verhulst S, Marcy M, Steward D. assessment of clinical competence. Medical Education
Variability in standardized patients and its effect on stu- 1987;21:482-9.
dent performance. In I Hart, R Harden (Eds.), Further 33. Newble D, Elmslie R, Baxter A. A problem-based criterion-
developments in assessing clinical competence (pp. 451-8). referenced examination of clinical competence. Journal of
Montreal: Can-Heal, 1987. Medical Education 1978:53:720-6.
15. Hiemstra R, Scherpbier A, Roze B. Assessing history- 34. Stillman P, Swanson D. Ensuring the clinical competence
taking skills or . . . simulated patients' peculiarities. In I of medical school graduates through standardized pa-
Hart, R Harden (Eds.), Further developments in assessing tients. Archives of Internal Medicine 1987:147:1049-52.
clinical competence (pp. 491-6). Montreal: Can-Heal, 35. Colliver J, Verhulst S, Williams R, Norcini J. Reliability
1987. of performance on standardized patient cases: A compar-
16. Vu N, Steward D, Marcy M. An assessment of the ison of consistency measures based on generalizability
consistency and accuracy of standardized patients' simula- theory. Teaching and Learning in Medicine 1989:1:31-7.
tions. Journal of Medical Education 1987;62:1000-2. 36. Livingston S, Zieky M. Passing scores. Princeton, NJ:
17. Cohen R, Rothman A, Ross J, et al. A comprehensive Educational Testing Service, 1982.
assessment of graduates of foreign medical schools (In- 37. Angoff W. Scales, norms, and equivalent scores. Prince-
ternal Report, University of Toronto), 1988. ton, NJ: Educational Testing Service, 1984.
18. Tamblyn R, Schnabl G, Klass D, Kopelow M, Marcy M. 38. Petersen N, Kolen M, Hoover H. Scaling, norming, and
How standardized are standardized patients? In Proceed- equating. In R Linn (Ed.), Educational measurement (pp.
ings of the 27th Research in Medical Education Confer- 221-62). New York: American Council on Education and
ence (pp. 148-53). Washington, DC: Association of Amer- Macmillan, 1989.
ican Medical Colleges, 1988. 39. Kolen M. Traditional equating methodology. Educational
19. Elstein A, Shulman L, Sprafka S. Medical problem solv- Measurement: Issues and Practice 1988;7:29-36.
ing. Cambridge, MA: Harvard University Press, 1978. 40. Cronbach LJ. Test validation. In RL Thorndike (Ed.),
20. Swanson D, Norcini J, Grosso L. Assessment of clinical Educational measurement 1971 (pp. 443-507). Washing-
competence: Written and computer-based simulations. As- ton, DC: American Council on Education, 1971.
sessment and Evaluation in Higher Education 1987; 12: 41. Dawes R, Corrigan B. Linear models in decision making.
220-46. Psychological Bulletin 1974;81:95-106.
21. Norcini J, Swanson D. Factors influencing testing time 42. Newble D, Hoare J, Elmslie R. The validity and reliability
requirements for simulation-based measurements: Do sim- of a new examination of the clinical competence of medical
ulations ever yield reliable scores? Teaching and Learning students. Medical Education 1981;17:165-71.
in Medicine 1989; 1:85-91. 43. Stillman P, Rutala P, Nicholson G, Sabers D, Stillman A.
22. De Graaff E, Post G, Drop M. Validation of a new Measurement of clinical competence of residents using
measure of clinical problem-solving. Medical Education patient instructors. In Proceedings of the 21st Research in
1987;21:213-8. Medical Education Conference (pp. 111-6). Washington,
23. Erviti V, Templeton B, Bunce J, Burg F. The relationships DC: Association of American Medical Colleges, 1982.
of pediatric resident recording behavior across medical 44. Robb K, Rothman A. The assessment of clinical skills in
conditions. Medical Care 1980;18:1020-31. general internal medicine residents—Comparison of the
24. Swanson D. A measurement framework for performance- objective structured clinical examination to a conventional
based tests. In I Hart, R Harden (Eds.), Further develop- oral examination. Annals of the Royal College of Physi-
ments in assessing clinical competence (pp. 13-45). cians and Surgeons of Canada 1985:18:235-8.
Montreal: Can-Heal, 1987. 45. Maatsch J. Model for a criterion-referenced medical spe-
25. van der Vleuten C, van Luyk S, Swanson D. Reliability cialty test (Final Report, Grant No. HS-02038-02). East
(generalizability) of the Maastrict Skills Test. In Proceed- Lansing: Michigan State University, Office of Medical
ings of the 27th Research in Medical Education Confer- Education Research and Development, 1980.
ence (pp. 228-33). Washington, DC: Association of Amer- 46. Maatsch J. Theories of clinical competence: The construct
ican Medical Colleges, 1988. validity of objective tests and performance assessments.
26. Newble D, Swanson D. Psychometric characteristics of the Paper presented at the International Conference on Eval-
objective structured clinical examination. Medical Educa- uation in Medical Education, Beer Sheva, Israel, 1987.
tion 1988;22:325-34. 47. Maatsch J, Huang R. An evaluation of the construct
27. Stillman P, Swanson D, Smee S, et al. Psychometric validity of four alternative theories of clinical competence.
characteristics of standardized patients for assessment of In Proceedings of the 25th Research in Medical Education
clinical skills (Final Report to the American Board of Conference (pp. 69-74). Washington, DC: Association of
Internal Medicine), 1986. American Medical Colleges, 1986.
28. Stillman P, Swanson D, Smee S, et al. Assessing clinical 48. Ebel R. Must all tests be valid? American Psychologist
skills of residents with standardized patients. Annals of 1961:16:640-7.
Internal Medicine 1986; 105:762-71. 49. Ebel R. Measuring educational achievement. Englewood
29. Stillman P, Regan M, Swanson D. A diagnostic fourth Cliffs, NJ: Prentice-Hall, 1965.
year performance assessment. Archives of Internal Medi- 50. Kane M. The validity of licensure examinations. American
cine 1987;147:1981-5. Psychologist 1982:37:911-8.
30. Petrusa E, Blackwell T, Parcel S, Saydjari C. Psycho- 51. Frederiksen N. The real test bias: Influences of testing on
metric properties of the objective clinical exam as an teaching and learning. American Psychologist 1984;39:
instrument for final evaluation. In I Hart, R Harden, H 193-202.
75
52. Entwistle N. Styles of learning and teaching. Chichester, assessing clinical competence (pp. 425-33). Montreal: Can-
England: Wiley, 1981. Heal, 1987.
53. Frederiksen J, Collins A. Researcher 1989;18(9):27-32. 69. Petrusa E, Guckian J, Perkowski L. A multiple station
54. Newble D, Jaeger K. The effect of assessments and objective clinical evaluation. In Proceedings of the 23rd
examinations on the learning of medical students. Medical Research in Medical Education Conference (pp. 211-6).
Education 1983;17:165-71. Washington, DC: Association of American Medical Col-
55. Bouhuijs P, van der Vleuten C, van Luyk S. The OSCE as leges, 1984.
a part of a systematic skills-training approach. Medical 70. Petrusa E, Blackwell T, Rogers L, Saydjari C, Parcel S,
Teacher 1987;9:183-91. Guckian J. An objective measure of clinical performance.
56. Newble D. The assessment of clinical competence—A American Journal of Medicine 1987;83:34-42.
perspective from "down under." In I Hart, R Harden 71. Petrusa E, Blackwell T, Ainsworth M. Performance of
(Eds.), Newer developments in assessing clinical compe- internal medicine house officers on a short station OSCE.
tence (pp. 40-5). Montreal: Heal, 1986. In I Hart, R Harden (Eds.), Further developments in
57. Newble D. Improving the clinical and oral examination assessing clinical competence (pp. 598-608). Montreal:
process. In I Hart, R Harden (Eds.), Further developments Can-Heal, 1987.
in assessing clinical competence (pp. 88-98). Montreal: 72. Petrusa E. Collaborative Project to Improve the Evalua-
Can-Heal, 1987. tion of Clinical Competence (Final report to the National
58. Newble D. Eight years' experience with a structured clin- Fund for Medical Education), 1988.
ical examination. Medical Education 1988;22:200-4. 73. Cohen R, Rothman A, Ross J, et al. Comprehensive
59. Rainsberry P, Grava-Gubins I, Khan S. Reliability and assessment of clinical performance. In I Hart, R Harden
validity of oral examinations in family medicine. In I Hart, (Eds.), Further developments in assessing clinical compe-
R Harden (Eds.), Further developments in assessing clin- tence (pp. 624-8). Montreal: Can-Heal, 1987.
ical competence (pp. 399-405). Montreal: Can-Heal, 1987. 74. Norman G, Tugwell P, Feightner J. A comparison of
60. Grava-Gubins I, Khan S, Rainsberry P. Factor analysis of resident performance on real and simulated patients.
simulated office oral examinations in family medicine. In I Journal of Medical Education 1982;57:708-15.
Hart, R Harden (Eds.), Further developments in assessing 75. Norman G, Neufeld V, Walssh A, Woodward C,
clinical competence (pp. 406-17). Montreal: Can-Heal, McConvey G. Measuring physicians' performance by using
1987. simulated patients. Journal of Medical Education 1985;
61. Conn H. Assessing the clinical skills of foreign medical 60:925-34.
graduates. Journal of Medical Education 1986:61:863-71. 76. Sanson-Fisher R, Poole A. Simulated patients and the
62. Cody R. Additional analysis for the 1987 administration of assessment of medical students' interpersonal skills. Med-
the clinical skill exam (Internal report, Educational Com- ical Education 1980;14:249-53.
mission for Foreign Medical Graduates), 1988. 77. Owen A, Winkler R. General practitioners and psycho-
63. Conn H, Cody R. Results of the second clinical skills social problems: An evaluation using pseudopatients.
assessment examination of the ECFMG. Academic Medi- Medical Journal of Australia 1974;2:393-8.
cine 1989;64:448-53. 78. Burri A, McCaughan K, Barrows H. The feasibility of
64. Klass D, Hazzards T, Kopelow M, Tamblyn R, Barrows using the simulated patient as a means to evaluate clinical
H, Williams R. Portability of a multiple station, perfor- competence of practicing physicians in a community. In
mance based assessment of clinical competence. In I Hart, Proceedings of the 15th Research in Medical Education
R Harden (Eds.), Further developments in assessing clin- Conference (pp. 295-9). Washington, DC: Association of
ical competence (pp. 434-42). Montreal: Can-Heal, 1987. American Medical Colleges, 1976.
65. Stillman P, Regan M, Swanson D, et al. An assessment of 79. Renaud M, Beauchemin J, LaLonde C, Poirier H,
the clinical skills of New England fourth year medical Berthiaume S. Practice settings and prescribing profiles:
students. Academic Medicine in press. The simulation of tension headaches to general practi-
66. Templeton B, Best A, Samph T, Case S. Short-term tioners working in different practice settings in the
outcomes achieved in interviewing medical students (In- Montreal area. American Journal of Public Health
ternal report, National Board of Medical Examiners), 1980;70:1068-73.
1978. 80. Rethans J, van Boven C. Simulated patients in general
67. Barrows H, Williams R, Moy R. A comprehensive practice: A different look at the consultation. British
performance-based assessment of fourth-year students' Medical Journal 1987;294:809-12.
clinical skills. Journal of Medical Education 1987;62:
805-9.
68. Williams R, Barrows H. Performance-based assessment of
clinical competence using clinical encounter multiple sta-
tions. In I Hart, R Harden (Eds.), Further developments in Received 9 June 1989
76

Van Der Vleuten, Assessment of Clinical Skills With Standardized Pationts. State of The Art.

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Van Der Vleuten, Assessment of Clinical Skills With Standardized Pationts. State of The Art.

Uploaded by

Copyright:

Available Formats

This article was downloaded by: [New York University]

On: 13 January 2015, At: 02:01

Teaching and Learning in Medicine: An International

Assessment of clinical skills with standardized

To link to this article: http://dx.doi.org/10.1080/10401339009539432

PLEASE SCROLL DOWN FOR ARTICLE

ANALYSES/REVIEWS OF THE LITERATURE

Assessment of Clinical Skills With

C. P. M. van der Vleuten

National Board of Medical Examiners

Distinctions laboratory utilization, and patient management are

Table 1. Summary of Studies Included in the Review

Institution Examinees Station Format(s)

University of Adelaide10-26 Senior students 5 to 10 min for physical, patient

SP stations? Table 2 summarizes the results of

Table 2. Interrater Reliability of SP-Based Scores

Data Set Rater Composite of Value"

Adelaide Faculty 10-min physical exam checklists .74"

length. Depending on the purpose of. testing, a

Table 4. Reproducibility of Scores as a Function of Testing Time and Station Length

history for 80% of common ambulatory problems

Table 5. Reproducibility of Scores as a Function of Test Length and Score Interpretation

Test Length (hr)

formed significantly better than residents from

Table 6. Relationship Between SP-Based Scores and Other Measures

performance of better trainees should exceed Content Validity of SP-Based Tests

He argued that validity can be "built into" a test

lated to clinical skills could influence those judg-

In validity analyses, both observed and true

You might also like