Professional Documents
Culture Documents
Van Der Vleuten, Assessment of Clinical Skills With Standardized Pationts. State of The Art.
Van Der Vleuten, Assessment of Clinical Skills With Standardized Pationts. State of The Art.
To cite this article: C. P. M. van der Vleuten & David B. Swanson (1990) Assessment of clinical skills with
standardized patients: State of the art, Teaching and Learning in Medicine: An International Journal, 2:2, 58-76, DOI:
10.1080/10401339009539432
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of
the Content. Any opinions and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied
upon and should be independently verified with primary sources of information. Taylor and Francis shall
not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other
liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or
arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Teaching and Learning in Medicine Copyright 1990 by
1990, Vol. 2, No. 2, 58-76 Lawrence Erlbaum Associates, Inc.
A little more than 10 years ago, the objective structured clinical examination
(OSCE) was introduced. It includes several "stations," at which examinees
perform a variety of clinical tasks. Although an OSCE may involve a range of
testing methods, standardized patients (SPs), who are nonphysicians trained to
play the role of a patient, are commonly used to assess clinical skills. This
article provides a comprehensive review of large-scale studies of the psycho-
metric characteristics of SP-based tests.
Across studies, reliability analyses consistently indicate that the major
source of measurement error is variation in examinee performance from
station to station (termed content specificity in the medical-problem-solving
literature). As a consequence, tests must include large numbers of stations to
obtain a stable, reproducible assessment of examinee skills. Disagreements
among raters observing examinee performance and differences between SPs
playing the same patient role appear to have less effect on the precision of
scores, as long as examinees are randomly assigned to raters and SPs.
Results of validation studies (e.g., differences in group performance,
correlations with other measures) are generally favorable, though not partic-
ularly informative. Future validation research should investigate the impact of
station format, timing, and instructions on examinee performance; study the
procedures used to translate examinee behavior into station and test scores;
and work on rater and SP bias.
Several recommendations are offered for improving SP-based tests. These
include (a) focusing on assessment of history taking, physical examination,
and communication skills, with separately administered written tests used to
measure diagnostic and management skills, (b) adoption of a mastery-testing
framework for score interpretation, and (c) development of improved proce-
dures for setting pass-fail standards. Use of generalizability theory in ana-
lyzing and reporting results of psychometric studies is also suggested.
Medical schools and other organizations re- traditionally based their decisions on written ex-
sponsible for certifying clinical competence have aminations and faculty ratings of performance in
Requests for reprints should be sent to Dr. van der Vleuten at the Department of Educational Development and Research, P.O.
Box 616, University of Limburg, 6200 MD Maastricht, The Netherlands.
58
ASSESSMENT OF CLINICAL SKILLS WITH SPs
clinical training. In recent years, there has been murmurs, pulmonary crackles, joint abnormali-
growing dissatisfaction with these procedures, be- ties, etc.), or can simulate physical findings (e.g.,
cause of the limited skills assessed by written tests abnormal reflexes, diminished breath sounds, ele-
and psychometric problems associated with rat- vated blood pressure). Examinees interact with
ings of performance. This has led to a new em- SPs as though they were interviewing, examining,
phasis in assessment, in which the tasks presented and counseling real patients. Often, SPs are also
to examinees are more representative of those trained to complete checklists and rating forms at
faced in real clinical situations. These performance- the end of encounters, recording the history infor-
based tests have become very popular, and med- mation obtained, the examination maneuvers per-
ical schools worldwide use them for assessing formed, and the counseling provided, as well as
clinical skills.1"2 Increasingly, schools are con- rating communication skills of examinees. Alter-
ducting and reporting studies of the psychometric natively, faculty-raters may observe SP-examinee
characteristics of the tests they use. This article encounters and complete checklists and rating
reviews the psychometric studies of one family of forms.
performance-based tests, those involving stan- SP-based tests have been used to assess a broad
dardized patients (SPs). range of clinical skills. Most often, they are used to
measure history taking, physical examination, and
Some Definitions and Terminological communication skills, although skills in diagnosis,
Downloaded by [New York University] at 02:01 13 January 2015
Scoring systems vary extensively across studies. 1. The test must have been administered to a
Some groups calculate only overall scores for each minimum of 40 medical students or residents.
station, aggregating across the skills measured 2. Examinees must have completed a minimum
and the checklists, rating forms, and written test of three stations.
materials associated with a station; for these 3. The total number of SP-examinee encoun-
groups, our review of psychometric results must ters (product of examinees and stations per exam-
be based on overall station scores. Other groups inee) must have been at least 400. Although larger
retain and use subscores for individual station examinee and station sample sizes are clearly
components, most often by calculating one or desirable for accurate estimation of key psycho-
more subscores based on interaction with an SP metric parameters, these minimum values were
and one or more subscores related to written established in recognition of the logistical intrica-
follow-up questions; an overall score for the cies and resource requirements of SP-based tests.
station may or may not be calculated in addition. Because the review integrates results across stud-
Because this review concerns SP-based assess- ies, fluctuations due to limited sample sizes should
ment, when multiple subscores have been used, average out.
we have focused on psychometric results for 4. Results of reliability and validity analyses
scores based on direct interaction with SPs had to be reported in sufficient depth to be
Downloaded by [New York University] at 02:01 13 January 2015
(generally those related to data-gathering and interpretable. A number of studies were elimi-
communication skills). Results for other scores nated on this basis, either because reliability or
are reported if they provide insight into key issues validity information was not reported, or because
in test design and score interpretation. the reported results were based on inappropriate
Investigators also vary in the methods used to estimation procedures. We revisit this problem at
aggregate (sub)scores across stations: Some the end of the article.
groups form composite scores by averaging indi-
vidual station scores, yielding a single composite Table 1 lists the studies that met our inclusion
test score; other groups form a composite score criteria and form the basis for the review. For each
for each group of similar stations; a third alterna- study, the table provides the institution respon-
tive (used by groups calculating multiple subscores sible for test development, citations to reports and
for each station) is to calculate a profile of com- publications, the type of examinees tested, and the
posite scores corresponding to station subscores. station formats used. Because SP-based tests from
Psychometric analyses may or may not be re- the same institution and investigator group tend to
share common features, studies are grouped by
ported for all scores, depending on the purpose of
institution. If several independent studies were
the test, the skills viewed as important by the
conducted at the same institution, they are shown
investigator group, the reliability of the subscores,
as separate "data sets" in the table under a
and other factors. In accord with the primary
common institutional heading. If a data set is
objectives of the review, we focus on composite described in several reports, the results are inte-
scores based on direct interaction with SPs. In grated across them in discussion and later tables.
reporting reliability results, we have adjusted total
testing time and testing time per station to include Organization of the Review
only the time actually spent interacting with SPs.
The only exceptions occur in those studies where The remainder of the review is divided into four
investigators reported only results for a single sections. The first section discusses the reproduc-
composite score that included both SP-based and ibility (reliability) of SP-based scores and pass-fail
written components; in these instances (explicitly decisions, integrating results across studies
noted in the text and tables), total testing time and through use of generalizability theory.4'5 The next
testing time per station reflect all components. section summarizes research on the validity of
Although this overall approach is somewhat arti- SP-based test scores. The third section discusses
ficial, due to variation in station format, it was research on the impact of SP-based tests on the
necessary to make cross-study comparisons more educational process. In an effort to address the
meaningful. practical needs of SP-based test users, these three
sections are structured as a series of responses to
Studies Included in the Review key questions in SP-based testing. The last section
summarizes the state of the SP-based testing art,
The review includes published (and some un- presents some ideas for improvement of SP-based
published) studies of the psychometric character- tests, identifies several areas for further research,
istics of SP-based tests in which four criteria are and makes some methodological observations and
met. recommendations for future studies.
60
ASSESSMENT OF CLINICAL SKILLS WITH SPs
written follow-up"
National Board of Medical Examiners Senior students 20 min for history and initial
(NBME) 66 management
Southern Illinois University Senior students 15 min for history and physical,
(SIU)14-32-35-67-68 15 min for written follow-upb
University of Texas Medical Branch
at Galveston (UTMB)
UTMB Data Set I 30 * 9 - 70 Junior students 5 min for history or physical,
5 min for written follow-up b
UTMB Data Set 2 71 First- and second-year 5 min for history or physical,
medicine residents 5 min for written follow-upb
UTMB Data Set 3 7 2 Junior students 5 min for history or physical,
5 min for written follow-upb
University of Toronto 17 - 73 Foreign medical graduates 5 to 10 min for history or physical,
5 min for written follow-up"
"Written follow-up not included in testing time or calculation of scores in later tables. b Written follow-up included in testing time and
calculation of scores in later tables.
Reproducibility of SP-Based Scores and tive, test design is basically the development of a
Pass-Fail Decisions sampling plan that reflects the skills and areas to
be assessed.
This section discusses issues related to the repro- For SP-based tests, SPs, raters (either observers
ducibility of SP-based tests, integrating results or the SPs themselves), and stations are sampled
across studies. We begin by presenting a general from larger domains of SPs, raters, and stations
conceptual framework for thinking about the re- that might have been used on the test. Test scores
producibility of SP-based tests. This is followed are reproducible if an examinee's score is reason-
by discussion of the various factors that influence ably stable across different but similar (randomly
the reproducibility of SP-based scores and parallel) samples of SPs, raters, and stations, and
pass-fail decisions. reproducibility (generalizability) coefficients can
be thought of as the expected correlation between
scores derived from these similar samples. For an
A Conceptual Framework for the estimate of an examinee's skill level to be repro-
Reproducibility of SP-Based Tests ducible (e.g., a reproducibility coefficient greater
than .8), an adequate number of SPs, raters, and
The purpose of any test is to draw inferences stations must be included in the sample that the
about the ability of examinees that extend beyond test comprises. Lack of interrater agreement in
the particular items used to the larger domain scoring examinee behavior, inconsistency in SP
from which the items are sampled. Depending on performance, and variation in an examinee's per-
the size and nature of the sample, these inferences formance across stations all affect the reproduc-
can be more or less reproducible (reliable) and ibility of scores.
more or less accurate (valid). From this perspec- The structure of the next three subsections
61
VAN DER VLEUTEN & SWANSON
follows from this conceptualization of the repro- sizable reduction in testing time requirements, an
ducibility of SP-based tests. First, rater-related important consideration, because SP-based tests
sources of measurement error are examined; this is are very resource intensive.
followed by discussion of SP-related sources of In large-scale SP-based testing, it is often nec-
measurement error. Although SPs often act as essary to develop several test forms for use at
raters as well as patients, it is important to keep multiple sites over an extended period of time.
the two sources distinct conceptually, because it is These forms generally differ in difficulty and
common practice to train multiple SPs to play the discrimination; consequently, the score received
same role. Ratings (whether provided by an SP or by any particular examinee is influenced by the
an observer) could be perfectly accurate, with test form used. In the last subsection, we discuss
different SPs still varying extensively in how they the problem of statistically adjusting (equating)
play the same patient role. Next we consider scores on alternate test forms to put them on the
station-related sources of measurement error, the same scale.
impact of variation in an examinee's performance
from station to station (often referred to as con- Rater-Related Sources of Measurement Error
tent specificity in the medical-problem-solving lit-
erature) on reproducibility of scores. This is the How well do raters agree in scoring individual
largest source of measurement error in SP-based
Downloaded by [New York University] at 02:01 13 January 2015
62
ASSESSMENT OF CLINICAL SKILLS WITH SPs
stations, as long as errors are nonsystematically rater agreement as a function of rater character-
related to examinees (e.g., not associated with the istics: Adequate interrater agreement can be
site where the exam was taken, not related to the achieved through use of SPs or physicians as
examinees' race, sex, or appearance, etc.). This raters. Intuitively, it does seem likely that SPs and
issue is explored further in later subsections. physicians may differ in the aspects of examinee
performance that they can rate accurately. For
How many raters are required per station? example, physicians should be more attuned to
Because interrater agreement is fairly good, it is logical sequencing of questions in history taking
generally unnecessary to use more than one rater and technical adequacy of some physical examina-
per station, particularly for relatively long tests tion maneuvers. SPs may be more sensitive to
involving many stations. Analyses performed with some communication skills (e.g., establishing rap-
the University of Adelaide data set10 provide a port, sensitivity to patient needs, avoidance of
good illustration. In this study, data was accumu- jargon) and better judges of certain examination
lated on approximately 400 examinees over a maneuvers, where feeling what is done provides
4-year period encompassing eight test forms. Each important information (e.g., palpation generally;
examinee encountered three to five SPs (depend- pelvic, rectal, and joint examinations specifically).
ing on the test form), and performance was rated Practical and educational considerations may be
Downloaded by [New York University] at 02:01 13 January 2015
by two faculty-physicians on checklists tailored to the most important factors in rater selection. If
station content. Generalizability analyses statisti- faculty physicians can participate, it may well be
cally compared use of one and two raters per desirable for them to serve as raters, because
station at a variety of test lengths. Results indi- observation of examinees provides useful feed-
cated that use of multiple raters per station has back concerning instructional effectiveness in
only a marginal effect on reproducibility of scores. terms of the skill levels of trainees. If physicians
If a sufficient number of stations are used to are unavailable, nonphysicians (including SPs
obtain reproducible scores, a large enough sample themselves) appear to provide a logical, less ex-
of raters is automatically included with a single pensive, and satisfactory alternative.
rater per station. If large numbers of raters are
available, it is much more effective to increase the Should checklists or rating scales be used?
number of stations, assigning one to each. For Review of Table 2 indicates that interrater agree-
example, a 2-hr test with two raters per station ment is generally better for checklists than for
requires the same total rater time as a 4-hr test rating scales, though agreement is sufficiently
with one rater per station. In the Adelaide data good for both that either can be used, particularly
set the reproducibility of the former was pro- in long tests involving large numbers of stations
jected to be .64, whereas the reproducibility of the and raters. Presumably, interrater agreement is
latter was projected at .75, a major improvement better for checklists because items are more con-
in precision. cretely stated and can be judged more objectively.
From the perspective of interrater agreement,
Who should rate examinee performance? when checklists and ratings scales are both viable
Some previous research has suggested that alternatives, checklists are to be preferred. From
physician-observers are naturally either stringent an educational perspective, checklists also provide
(hawks) or lenient (doves), and these tendencies better definition of expectations for examinees
are resistant to training.11'12 In a recent study of and more specific feedback on performance.
the University of Limburg Skills Test using very However, it is difficult to develop checklists in
detailed checklists, van der Vleuten et al.13 studied several areas (e.g., attitudes, aspects of communi-
the impact of training on different types of raters. cation skills) without trivializing the aspect of
Trained and untrained groups of nonphysicians, examinee performance to be judged, and such
medical students, and physician faculty rated the validity considerations are more important than
videotaped performance of examinees with two interrater agreement. In such situations, use of
SPs. Results indicated that the need for and behaviorally anchored rating scales is generally
effectiveness of training varied across groups: It advised, though we could find no research basis
was least needed and least effective for physician- for this recommendation in the SP literature.
raters, more needed and effective for medical
students, and most needed and effective for SP-Related Sources of Measurement Error
nonphysicians. Differences in accuracy among
groups were almost eliminated by rater training. What are the measurement consequences of
Inspection of Table 2 supports these results. using several SPs to play the same role? Several
There are no consistent trends in level of inter- research groups have investigated this question.
63
VAN DER VLEUTEN & SWANSON
Using a subset of the stations from the SIU data same institution, but not for tests given at dif-
set, Dawson-Saunders et al.14 compared the per- ferent times or at multiple sites, as discussed next.
formance of groups of examinees who saw dif- A well-designed, large-scale study using the
ferent SPs playing the same role. Statistically Manitoba/SIU data set9-18 studied the relative
significant group differences were obtained on accuracy of SPs playing the same role at two
SP-based scores for five of the seven stations institutions, Southern Illinois University and the
studied, suggesting that the particular SP seen had University of Manitoba. This provides a particu-
a major influence on scores. The investigators larly strong challenge to use of multiple SPs,
recommended several measures to reduce this in- because different individuals trained SPs at each
fluence: (a) using a single SP per case, (b) phrasing school. Videotapes of 252 SP-examinee encoun-
patient checklists in lay language, and (c) in- ters from Manitoba and 197 SP-examinee encoun-
creasing the training provided for completion of ters from SIU on 15 common stations were viewed
checklists and rating forms. In small-scale studies, by trained raters, who recorded the accuracy* of
Hiemstra et al.15 and Vu et al.16 also found some presentation on case-specific checklists of critical
evidence of differences between SPs trained to findings. Average accuracy exceeded 90% at
play the same role. However, analyses of the roughly two thirds of the stations at both institu-
University of Toronto data set17 reported no tions. For the remaining stations, accuracy was
Downloaded by [New York University] at 02:01 13 January 2015
significant differences between SPs playing the generally between 80% and 90%, but several SPs
same role. had average accuracy scores between 69% and
The real issue is not, however, if scores on 75%. Errors in presentation of physical findings
individual stations differ systematically as a result and patient affect were more common than errors
of multiple SPs playing the patient role; the in presentation of historical information. Use of
impact of using multiple SPs on overall reproduc- multiple SPs contributed sizable random and sys-
ibility of scores is of primary importance. This tematic error into station scores, potentially
issue was investigated in generalizability analyses having a major impact on overall pass-fail
of 10 stations from University of Massachusetts results.9
(UMass) Data Set 3 where two SPs played the Additional research in this area is highly desir-
same role.10 Results (shown in Table 3) indicated able. Large-scale use of SP-based tests depends, in
that, although scores derived from different SPs part, on the ability of test developers to train
playing the same role are often significantly dif- multiple SPs at different sites to play the same
ferent, reproducibility of total test scores is not roles accurately. Training procedures that facili-
markedly affected. As long as examinees are ran- tate "transportability" of stations are needed.
domly assigned to SPs playing the same role,
differences average out across the test as a whole.
Station-Related Sources of Measurement Error
This may be approximately true for many SP-
based tests given at a single point in time at the
In virtually all efforts to assess clinical compe-
tence, examinee performance on one case is a poor
Table 3. Reproducibility of Scores as a Function predictor of performance on other cases. This has
of Testing Time and Number of SPs Used been termed the content-specificity of clinical
competence.19 This phenomenon has been ob-
served across measurement techniques, including
Data Gathering Skills Communication Skills
written and computer-based simulations,20'21
Test Length Same Different Same Different vignette-based short-answer tests,22 chart audits,23
(hr)' SP SPs SP SPs oral exams,24 and SP-based tests.25 Long tests,
1 .34 .33 .59 .56 including large samples of cases, are necessary as a
2 .51 .50 .74 .71 consequence, and careful determination of the
4 .67 .67 .85 .83 number of SP-based stations required to obtain
6 .76 .75 .90 .88 reproducible results is clearly merited.
8 .81 .80 .92 .91
16 .89 .89 .96 .95
Notes: Analyses reported in Swanson and Norcini10 based on
UMass Data Set 3;65 table from "Factors Influencing the *Accuracy was defined as the number of critical findings
Reproducibility of Tests Using Standardized Patients" by D. presented correctly divided by the number of critical findings
Swanson and J. Norcini, 1989, Teaching and Learning in that could have been presented, given the actions of the
Medicine, 1, p. 161. Copyright 1989 by Lawrence Erlbaum examinee, rescaled as a percentage. Thus, accuracy could vary
Associates, Inc. Adapted by permission. from 0% to 100%, with the latter representing a perfectly
a accurate presentation.
Three stations per hour.
64
ASSESSMENT OF CLINICAL SKILLS WITH SPs
How much testing time is required to obtain testing time was too short to yield reproducible SP-
reproducible scores? The studies reviewed in this based scores in the remaining studies.
article varied substantially in the methods used to Second, generalizability coefficients are lowest
estimate and report results of reliability analyses. for those studies in which written follow-up ques-
To achieve greater comparability across studies, tions linked to SP presentation are included in
we have reanalyzed reported results using general- scores and testing time (SIU and University of
izability theory to estimate reproducibility of Texas Medical Branch at Galveston [UTMB] data
scores. Because total testing time and time per sets). In part, this reflects the extra (doubling of)
station varied across studies, our analysis focused testing time per station that the written compo-
on reproducibility as a function of testing time nents require. However, even if testing time per
rather than number of stations. station is halved to eliminate this effect, coeffi-
Table 4 presents the results of the reanalysis. cients remain relatively low, indicating that use of
Generalizability coefficients in the table are anal- linked written questions reduces generalizability
ogous to coefficient alpha; they can be thought of and increases test length requirements. In part,
as the expected correlation between scores derived this is a natural consequence of a shift in what is
from similar, but not identical exams using a measured. The SP component of stations gener-
different sample of stations but the same test ally tests the hands-on skills involved in data-
gathering and communication. Follow-up written
Downloaded by [New York University] at 02:01 13 January 2015
Average
Station
Test Length (hr)
Length
Data Set 1 2 3 4 6 8 12 (min)
a
Adelaide .43 .60 .69 .75 .82 .86 .90 6
CFPC .53" .67 .77 .82 .87 .90 .93 15
ECFMG .51* .68 .76 .81 .86 .89 .93 20
Limburg .54 .69 a .77 .82 .87 .90 .93 15
UMass Data Set 1 .59 .74 .81* .85 .90 .92 .94 30
UMass Data Set 2 .50 .67" .75 .80 .86 .89 .92 10
UMass Data Set 3 .34 .51 .61 .67a .76 .80 .86 15
NBME .53 .69 .77 .82* .87 .90 .93 20
SIU .19 .31 .41 .48 .58 .68 .73* 40
UTMB Data Set 1 .24 .38 .49" .56 .66 .72 .79 10
UTMB Data Set 2 .19 .32 .41* .48 .58 .65 .74 10
UTMB Data Set 3 .31 .47 a .57 .64 .73 .78 .84 10
Toronto .65 .79 .85* .88 .92 .94 .96 6
"Approximate testing time used (rounded to nearest whole hour).
65
VAN DER VLEUTEN & SWANSON
How long should individual stations be, and Reproducibility of Domain-Referenced Test
which formats use testing time most efficiently? Scores
Aside from lower reproducibility coefficients for
stations including written components, there is no Most studies to date have adopted (often implic-
obvious relationship in Table 4 between testing itly in selection of reliability estimation proce-
time per station and generalizability of scores. dures) a norm-referenced framework for score
Both short and long stations are effective (and interpretation. That is, scores are given meaning
ineffective), depending on the study. Apparently, by reference to the performance of other exam-
longer stations tend to yield more measurement inees (e.g., an examinee's score is 1 SD below the
information, but the fact that more short stations mean, in the 95th percentile, etc.), and reproduc-
can be completed in a fixed amount of time ibility is high to the extent that tests differentiate
completely compensates. The range of station examinees and allow fairly precise rank ordering.
lengths for this compensatory effect is probably For SP-based tests, it seems natural and desirable
limited. Tests using very long stations (e.g., more to use a domain-referenced framework for score
than 1 hr each) would probably yield less interpretation, where an examinee's score is inter-
reproducible scores, even controlling for total pretable in absolute terms (e.g., an examinee's
testing time, because other sources of measure- score indicates that he or she can take an adequate
ment error (i.e., raters and SPs) would be
Downloaded by [New York University] at 02:01 13 January 2015
Adelaide Norm referenced .43" .60 .69 .75 .82 .86 .90
Domain referenced .34 .51 .61 .67 .76 .81 .88
Limburg Norm referenced .54 .69a .77 .82 .87 .90 .93
Domain referenced .42 .59 .68 .74 .81 .85 .90
UMass Data Set 3 Norm referenced .34 .51 .61 .67 a .76 .80 .86
Domain referenced .21 .34 .44 .51 .61 .68 .76
SIU Norm referenced .19 .31 .41 .48 .58 .65 .73 a
Domain referenced .10 .19 .26 .32 .41 .48 .58
Toronto1" Norm referenced .54 .70 .78 .83 .88C .91 .93
Domain referenced .41 .58 .68 .74 .81 .85 .89
"Approximate testing time used (rounded to nearest whole hour). b Written follow-up here included, because the necessary variance
components for domain-referenced estimation were only reported for the combined data set. c Actual testing time was 5 hr.
66
ASSESSMENT OF CLINICAL SKILLS WITH SPs
reproducibility is substantial, and longer tests are looking for "gaps," or a pass-fail point was se-
necessary as a consequence. Differences in repro- lected arbitrarily. Given the current interest in use
ducibility decrease as test length increases, because of SP-based tests as graduation exams or as a
test forms tend to become more similar in diffi- component of the licensure process, development
culty as the number of stations increases. of better standard-setting procedures should have
a high priority.
Reproducibility of Pass-Fail Decisions
Equating Scores on Alternate Forms of
The discussion of reproducibility and testing SP-Based Tests
time requirements has, thus far, focused on the
reproducibility of scores. For many applications In many situations where SP-based tests are
of SP-based tests, reproducibility of scores is not used, several equivalent forms of a test are devel-
really important: reproducibility of pass-fail deci- oped, similar in overall content and format, but
sions is crucial. For example, SP-based tests given with different stations on each form. Most com-
after completion of required clerkships and before monly, multiple forms are required for security
medical school graduation often focus on history- reasons when testing is spread out over time, either
taking and physical examination skills in an effort because the number of examinees is large or
Downloaded by [New York University] at 02:01 13 January 2015
to ensure that these skills have been mastered to because several cohorts (e.g., successive clerkships
the level required for postgraduate training.29'32"34 or graduating classes) are to be tested. Regardless
In such testing situations, although it is obviously of the reason, whenever multiple test forms are
desirable to estimate an examinee's skill level used, they are likely to be equivalent in level and
precisely, the basic issue is whether ability exceeds range of difficulty, and any direct comparison of
the mastery point, and the reproducibility of scores would be unfair to those examinees tested
pass-fail decisions is of primary importance, both with more difficult forms.37 A variety of statistical
practically and psychometrically. procedures, termed equating methods, are used
with written tests to deal with this problem.38
What happens to reproducibility if a mastery-
testing approach is adopted? If most examinees What procedures have been used to equate
perform well relative to the pass-fail point, fairly scores on SP-based tests? For the most part,
short tests can still yield reproducible pass-fail users of SP-based tests do not equate scores on
decisions, particularly for examinees at upper alternate test forms. Although multiple forms are
ability levels.10'35 In such situations, use of "se- commonly used, scores are not adjusted for dif-
quential testing" procedures may be advanta- ferences in form difficulty. The problem is simply
geous. In this approach to assessment, a brief ignored: Scores are interpreted as if they are on the
screening test is given initially to all examinees. same scale. However, a few investigators have
Those who perform well relative to the pass-fail developed some rough-and-ready procedures for
point are excused from further testing with a coping with differences in form difficulty.
passing result. The test is continued for the re- At the University of Adelaide, two test forms
maining examinees, concentrating testing time and are used each year. Examinees are randomly as-
resources on "close call" decisions in the vicinity of signed to forms, which should result in roughly
the passing score. A hypothetical example based equivalent ability groups taking each form. After
on the University of Adelaide data set is provided test administration, the mean score on each form
in Swanson and Norcini.10 is calculated, and the difference between them is
added to the score of each examinee taking the
What methods have been used to set standards more difficult form.26 When used with written
for SP-based tests? Implicit in the mastery- tests, this procedure is termed mean equating.39
testing approach is use of absolute standards in Similar stations are also developed in pairs and
making pass-fail decisions. Unfortunately, almost randomly assigned to two forms; this procedure
no work has been done on development of abso- should result in forms that are parallel in content
lute standard-setting procedures for SP-based tests and similar, though not identical, in difficulty.
analogous to those used for written exams.36 Most In SP-based tests given at the University of
researchers have not had to confront the standard- Massachusetts (Data Sets 2 and 3), for security
setting problem, because test scores either did not reasons separate test forms are typically con-
count or were combined with other assessments. structed from a common pool of stations ac-
When pass-fail standards were needed, typically a cording to a fixed blueprint. After test adminis-
relative standard was set (e.g., 2 SD below the tration, scores on each station are standardized to
mean), the score distribution was inspected, a mean of 500 and a standard deviation of 100,
67
VAN DER VLEUTEN & SWANSON
averaged across stations for each examinee, and performance (e.g., comparison of scores received
restandardized across examinees. This procedure by examinees at different points in their training),
also results in a type of mean equating that adjusts (b) study of the relationships between scores and
for differences in form difficulty, assuming exam- other measures (e.g., correlations with written test
inees and stations are randomly assigned to test scores or ratings of clinical performance), and (c)
forms (not quite true for UMass test administra- logical analysis of test content (clinical tasks posed
tions). to examinees, items included on checklists, etc.).
After a brief discussion of procedures for scoring
What other procedures might be used to equate stations, the following subsections review work in
scores on SP-based tests? A variety of equating each category. The last subsection outlines some
procedures have been developed for written validation studies that we believe are needed.
tests.38 Several of these could be adapted for use in
SP-based assessment, at least for situations in Procedures for Scoring SP-Based Tests
which large numbers of examinees are tested with
each form. The various forms of "common-item Because the essence of validity is the accuracy of
linear equating" would probably be the simplest inferences based on test scores, it seems appro-
and most practical to use. Using this approach, priate to begin discussion of validity by com-
Downloaded by [New York University] at 02:01 13 January 2015
different test forms would include some common menting on procedures used for scoring SP-based
stations. The relative performance on common stations and tests. Interestingly, published articles
stations of examinee groups taking different test rarely describe scoring procedures, beyond stating
forms provides a basis for estimating group ability that scores were calculated as the percentage of
and heterogeneity, independent of form difficulty possible points obtained on checklists and/or
and discrimination. These estimates are then used rating scales. Often, it is unclear how checklists
as a basis for adjusting scores on alternate were developed, how individual items were
forms.39 Other equating procedures (equiper- weighted in calculation of station (sub)scores, and
centile equating; methods based on item response how station (sub)scores were aggregated to obtain
theory) might also be applied; in general, the composite scores. Thus, material in the remainder
sample-size requirements for these procedures (at of this section is based predominantly on specula-
least several hundred examinees per form) are too tion, rather than results of research.
large, however.
If assessment takes place within a mastery- What items should station checklists and rating
testing framework, it may prove more practical to forms include? Inspection of data-gathering
identify equivalent pass-fail points on alternate checklists used by different groups reveals remark-
test forms, rather than attempting to equate able diversity. Checklist length varies from a few
scores. If primary focus is on pass-fail decisions, items to several dozen. On checklists used for
this approach could permit appropriate adjust- history taking, some groups list questions that
ment for differences in test forms without re- examinees should ask; others list answers that SPs
quiring large numbers of examinees. However, it provide. Intuitively, the latter seem easier to use,
will be necessary to develop improved standard- because roughly equivalent questions may be
setting procedures before investigating this alter- asked in several ways at varying levels of specific-
native. ity, but the information provided is relatively
unambiguous. Checklists used for the physical
Validity of SP-Based Test Scores examination component of stations generally list
examination maneuvers, though specificity level
Validity refers to "the accuracy of a prediction varies considerably (e.g., from "examines the ab-
or inference made from a test score"40 (p. 443). It domen" to "palpates the right upper quadrant for
is not a property of the test itself, but of interpre- the liver" to a list of discrete steps for each
tations based on test scores. Thus, the same test quadrant). Rating forms for communication skills
can have many validities, depending on how it is vary from a single global item to several dozen
used and how scores are interpreted. Making items concerning discrete behaviors. Individual
matters worse, there is typically no "gold stan- items may be fairly concrete ("maintains good eye
dard" with which scores can be compared; this is contact"), fairly abstract ("establishes good rap-
surely true for SP-based tests. As a consequence, port"), or related to behavioral intentions ("I
validation requires the accumulation of evidence would recommend this examinee to a friend"). In
across a series of studies. part, differences in checklists and rating forms
Traditionally, validation studies take one of may reflect differences in the focus of assessment
three forms: (a) study of differences in group (more detailed lists for interpersonal skills and
68
ASSESSMENT OF CLINICAL SKILLS WITH SPs
physical diagnosis courses; less detailed lists for ciently reproducible to be meaningful. Reporting a
exams required for graduation from school). Re- profile of subscores is particularly ill-advised,
producibility of test scores appears to be fairly because individual subscores in a profile are likely
invariant across the various rating form and to be very unstable.
checklist formats; validity of scores may not be.
Some systematic research on the content and Differences in Group Performance
format of checklists and rating forms seems highly
desirable.
Do examinees at different points in training
perform differently? There have been relatively
How should items and subscores be weighted in
few studies of the performance of different groups
calculating station scores? Relatively little re-
on SP-based tests. In UMass Data Set 1, the
search on this question has been reported in the performance of internal medicine residents im-
standardized patient literature. Stillman27 ex- proved as they progressed through training: Third-
plored the use of weighting in calculation of year residents performed better than second-year
checklist scores and found little impact on repro- residents, who performed better than first-year
ducibility or validity. These findings parallel those residents.27'28 In the same study, residents from
obtained in studies of written patient-management training programs with stronger reputations per-
problems20 and in psychometric studies of
Downloaded by [New York University] at 02:01 13 January 2015
Observed True
Data Set and Measures Correlation* Correlation1*
Adelaide
Multiple-Choice Test (Locally Developed) .33 .68
Non-SP Skills Test .35 1.00
Written Follow-Up (Short-Answer Test) .40 .88
Limburg
Multiple-Choice Test (Locally Developed)' .63 .77
Written Follow-Up (Multiple-Choice Test)c'd .62 .77
Manitoba/SIU
Multiple-Choice Test (NBME Part II) .63 .82e
Clinical Ratings .52 .66e'f
UMass Data Set 1
Multiple-Choice Test (American Board of Internal Medicine Certifying Exam) .24 .29*
Clinical Ratings ns —
Months of Residency Training .32 .37=
Downloaded by [New York University] at 02:01 13 January 2015
Self-Ratings ns -
UMass Data Set 2
Multiple-Choice Test (NBME Part I) .19 .24*
Multiple-Choice Test (NBME Part II) .27 .34*
Clinical Ratings .44 .50e'f
Written Follow-Up (Short-Answer and Multiple-Choice Test) .26 .36*
UMass Data Set 3
Multiple-Choice Test (NBME Part I) .10 .13 e
Multiple-Choice Test (NBME Part II) .22 .28e
Clinical Ratings .25 .31 e - f
Self-Ratings .08 .io c - f
Written Follow-Up (Pattern Recognition Test) .22 .32 e
NBME
Multiple-Choice Test (NBME Behavioral Science Subtest) .37 .46°
Multiple-Choice Test (NBME Psychiatry Subtest) .31 .40*
Multiple-Choice Test (Other NBME) ns ns
SIU
Multiple-Choice Test (NBME Part I) .53 .63 e
Multiple-Choice Test (NBME Part II) .51 .60e
Clinical Ratings .65 .75 e ' r
UTMB Data Set 1
Multiple-Choice Test (NBME Medicine Subtest) .43 .64
Clinical Ratings .46 .73
UTMB Data Set 2
Multiple-Choice Test (American Board of Internal Medicine Certifying Exam) .24 .35
Clinical Ratings .00 —
Months of Residency Training .31 .49
UTMB Data Set 3
Multiple-Choice Test (NBME Medicine Subtest) .64 1.00
Clinical Ratings .37 .56'
Toronto
Multiple-Choice Test (Locally Developed) .43 .50
Written Follow-Up (Short-Answer Test) .69 .91
a
ns = nonsignificant. bEntries corrected for measurement error (disattenuated) in both scores, unless otherwise noted. 'Reported in
van der Vleuten et al.31 dAdministered 2 weeks after SP-test. eNo values were reported; entries were approximated from available
results. 'Estimate is disattenuated for unreliability in SP-score only.
reliable). They provide a better index of the true between scores on SP-based and multiple-choice
strength of relationship between the measures tests vary extensively, from near zero to one,
involved. though the average value is fairly high. True
correlations between SP-based scores and ratings
What is the relationship between SP-based test of clinical performance are also moderately high.
scores and other measures? True correlations These results are not particularly surprising. The
70
ASSESSMENT OF CLINICAL SKILLS WITH SPs
71
VAN DER VLEUTEN & SWANSON
be built in through the test construction process, A related problem follows from the time pres-
as long as a sufficiently large sample of SP cases is sure under which SP-based tests are commonly
included. However, several "threats to validity" administered. Developers of written tests try to
remain to be studied. obtain information on the time required for exam-
First, it is unclear if scoring procedures accu- inees to comfortably complete the test. If insuffi-
rately translate examinee behavior into appropri- cient time is allowed (i.e., the test is "speeded"),
ate, meaningful scores. In general, published re- score interpretation is more difficult. A straight-
ports vaguely describe scoring as calculation of forward method for investigating "speededness"
"percentage of possible points" obtained. The of SP-based tests would involve comparing per-
validity of such scores depends on the appropri- formance on identical stations under varying time
ateness of the items, the weighting attached to conditions. Given the diversity of stations and
each, and other factors. The potential for omitting station formats, it would not be surprising if this
important items and including unimportant ones is factor were important in examinee performance,
great. The former penalizes examinees who take and more specific guidelines for station construc-
indicated actions that are not listed; the latter tion could result.
rewards examinees who are unjustifiably thor- Because station scores depend on the judgment
ough, a common problem in scoring written and of observers, characteristics of examinees unre-
computer-based patient management problems.20
Downloaded by [New York University] at 02:01 13 January 2015
trainees should have mastered. Station construc- rater background, as long as requisite training is
tion requires concrete definition of performance provided. Use of multiple SPs playing the same
criteria. Reaching a consensus on these items patient role does not generally reduce reproduc-
should lead to more standardized instructional ibility very much. Variation in examinee perfor-
experiences and learning outcomes, particularly if mance across stations has the largest impact on the
the consensus is carefully communicated to both reproducibility of scores. In most testing situa-
faculty and students.55 Trainees view assessment tions, if a sufficiently large sample of stations is
methods as indicators of what faculty believe is included, the resulting sample of raters and SPs
important, and careful design of exams can exert a will also be large enough to obtain reproducible
major influence on learning. scores. Exceptions can occur in large-scale testing
The only published empirical work on the edu- situations, when examinees are tested at different
cational impact of SP-based tests has been at the times and/or at different sites—when there is
University of Adelaide. The original motivations major departure from random assignment of
for developing a practical examination of clinical raters and SPs to examinees. Four to 8 hr of
skills was a general faculty perception that stu- testing time are required to obtain reproducible
dents were spending a disproportionate amount of scores for hands-on clinical skills; longer tests are
time studying for written tests relative to partici- required if stations include written questions
pating in clinical work on the wards.54'56"58 A linked to SPs. Otherwise, station format and
Downloaded by [New York University] at 02:01 13 January 2015
practical exam with SP-based components was testing time per station appear to be relatively
introduced to improve the congruence between the unrelated to reproducibility. Domain-referenced
educational goals of the medical school and the score interpretation requires longer tests than does
assessment methods used, anticipating that a shift a norm-referenced approach. Mastery testing, in
in students' learning activities might result. To which only pass-fail results are of interest, has the
investigate the impact on student study habits, potential to reduce test length requirements and
questionnaires were sent to students before and costs. However, better standard-setting proce-
after the introduction of the practical exam.54 dures must be developed to realize this potential.
Questionnaire results indicated that the exam had More work is needed on procedures for adjusting
a dramatic impact on how students spent their (equating) scores obtained on alternate test forms.
time, decreasing time spent preparing for written Results of validation studies have, for the most
tests and increasing ward-based learning activities. part, been encouraging, though not particularly
In addition, students reported a generally high informative. Groups at different stages of training
level of satisfaction with the practical exam and obtain appropriately different scores, and relation-
rated it as substantially more relevant than the ships between SP-based scores and traditional
written tests to the work of an intern. These results measures of clinical competence are fairly strong.
have persisted since the practical exam was Content validity of SP-based tests should be par-
introduced.58 ticularly good due to the realistic clinical tasks
included as stations, although additional attention
Discussion to domain definition and blueprinting seems mer-
ited. Additional efforts should be devoted to test
The purpose of this article was to review psy- validation; this could include research on scoring
chometric research on SP-based tests. In this final procedures, examinee perceptions of tasks posed
section, we summarize the major conclusions by stations, effect of station speededness, and
reached in the review, present some suggestions for rater bias.
improved use of SP-based tests, and provide meth-
There has been much rhetoric but little empir-
odological observations and recommendations.
ical work on the educational impact of SP-based
Summary of Conclusions tests, aside from Newble's efforts at the University
of Adelaide. More research in this area is needed,
The review was divided into three major areas: because the hypothesized educational impact of
(a) reproducibility of scores and pass-fail deci- SP-based tests has been a major factor in their
sions, (b) validity of score interpretation, and (c) increased use, despite high costs and psychometric
educational impact of tests. This summary also shortcomings.
follows that organization.
Lack of interrater agreement in scoring exam- Suggestions for the Improved Use of SP-Based
inee behavior, inconsistency in SP performance, Tests
and variation in examinee performance across
stations all affect the reproducibility of scores. Several practical suggestions for improving SP-
Interrater reliability is adequate, regardless of based tests and conserving testing resources follow
73
VAN DER VLEUTEN & SWANSON
directly from the psychometric conclusions unclear. The format and content of checklists and
emerging from the review: rating forms and the procedures used for scoring
are almost never described. More work is needed
1. Scores on short SP-based tests are not mean- in these areas, beginning with better descriptions
ingful, because they are not sufficiently reproduc- of methods already in use.
ible. Profiles of subscores based on a small Procedures used for reliability estimation were
number of stations or short segments of each sometimes unspecified and, occasionally, ap-
station are also unstable and should generally not peared to be wrong. Use of generalizability theory
be used. in analysis is absolutely required, because multiple
2. There is no need to have more than one rater sources of measurement error are commonly
per station. If extra raters are available, the present. Variance component estimates should be
number of stations should be increased instead. reported (along with the standard errors of the
3. The decision to use nonphysicians (usually estimates), so that readers can explore alternative
SPs) or physicians as raters has practical and uses of testing resources, and researchers can
educational elements. If faculty physicians serve integrate results across studies. If multiple
as raters, they receive useful feedback concerning subscores are calculated, results of generalizability
curriculum effectiveness through observation of analyses should be reported for all of them.
examinees. Nonphysicians can also be used as
Downloaded by [New York University] at 02:01 13 January 2015
74
ASSESSMENT OF CLINICAL SKILLS WITH SPs
12. Ludbrook J, Marshall VR. Examiner training for clinical Walton (Eds.), Newer developments in assessing clinical
examinations. British Journal of Medical Education competence (pp. 181-91). Montreal: Heal, 1986.
1971;5:152-5. 31. van der Vleuten C, van Luyk S, Beckers A. A written test
13. van der Vleuten C, van Luyk S, Ballegooijen A, Swanson as an alternative to performance testing. Medical Educa-
D. Training and experience of medical examiners. Medical tion 1989:23:97-107.
Education 1989;23:290-6. 32. Williams R, Barrows H, Vu N, et al. Direct, standardized
14. Dawson-Saunders B, Verhulst S, Marcy M, Steward D. assessment of clinical competence. Medical Education
Variability in standardized patients and its effect on stu- 1987;21:482-9.
dent performance. In I Hart, R Harden (Eds.), Further 33. Newble D, Elmslie R, Baxter A. A problem-based criterion-
developments in assessing clinical competence (pp. 451-8). referenced examination of clinical competence. Journal of
Montreal: Can-Heal, 1987. Medical Education 1978:53:720-6.
15. Hiemstra R, Scherpbier A, Roze B. Assessing history- 34. Stillman P, Swanson D. Ensuring the clinical competence
taking skills or . . . simulated patients' peculiarities. In I of medical school graduates through standardized pa-
Hart, R Harden (Eds.), Further developments in assessing tients. Archives of Internal Medicine 1987:147:1049-52.
clinical competence (pp. 491-6). Montreal: Can-Heal, 35. Colliver J, Verhulst S, Williams R, Norcini J. Reliability
1987. of performance on standardized patient cases: A compar-
16. Vu N, Steward D, Marcy M. An assessment of the ison of consistency measures based on generalizability
consistency and accuracy of standardized patients' simula- theory. Teaching and Learning in Medicine 1989:1:31-7.
tions. Journal of Medical Education 1987;62:1000-2. 36. Livingston S, Zieky M. Passing scores. Princeton, NJ:
17. Cohen R, Rothman A, Ross J, et al. A comprehensive Educational Testing Service, 1982.
assessment of graduates of foreign medical schools (In- 37. Angoff W. Scales, norms, and equivalent scores. Prince-
Downloaded by [New York University] at 02:01 13 January 2015
ternal Report, University of Toronto), 1988. ton, NJ: Educational Testing Service, 1984.
18. Tamblyn R, Schnabl G, Klass D, Kopelow M, Marcy M. 38. Petersen N, Kolen M, Hoover H. Scaling, norming, and
How standardized are standardized patients? In Proceed- equating. In R Linn (Ed.), Educational measurement (pp.
ings of the 27th Research in Medical Education Confer- 221-62). New York: American Council on Education and
ence (pp. 148-53). Washington, DC: Association of Amer- Macmillan, 1989.
ican Medical Colleges, 1988. 39. Kolen M. Traditional equating methodology. Educational
19. Elstein A, Shulman L, Sprafka S. Medical problem solv- Measurement: Issues and Practice 1988;7:29-36.
ing. Cambridge, MA: Harvard University Press, 1978. 40. Cronbach LJ. Test validation. In RL Thorndike (Ed.),
20. Swanson D, Norcini J, Grosso L. Assessment of clinical Educational measurement 1971 (pp. 443-507). Washing-
competence: Written and computer-based simulations. As- ton, DC: American Council on Education, 1971.
sessment and Evaluation in Higher Education 1987; 12: 41. Dawes R, Corrigan B. Linear models in decision making.
220-46. Psychological Bulletin 1974;81:95-106.
21. Norcini J, Swanson D. Factors influencing testing time 42. Newble D, Hoare J, Elmslie R. The validity and reliability
requirements for simulation-based measurements: Do sim- of a new examination of the clinical competence of medical
ulations ever yield reliable scores? Teaching and Learning students. Medical Education 1981;17:165-71.
in Medicine 1989; 1:85-91. 43. Stillman P, Rutala P, Nicholson G, Sabers D, Stillman A.
22. De Graaff E, Post G, Drop M. Validation of a new Measurement of clinical competence of residents using
measure of clinical problem-solving. Medical Education patient instructors. In Proceedings of the 21st Research in
1987;21:213-8. Medical Education Conference (pp. 111-6). Washington,
23. Erviti V, Templeton B, Bunce J, Burg F. The relationships DC: Association of American Medical Colleges, 1982.
of pediatric resident recording behavior across medical 44. Robb K, Rothman A. The assessment of clinical skills in
conditions. Medical Care 1980;18:1020-31. general internal medicine residents—Comparison of the
24. Swanson D. A measurement framework for performance- objective structured clinical examination to a conventional
based tests. In I Hart, R Harden (Eds.), Further develop- oral examination. Annals of the Royal College of Physi-
ments in assessing clinical competence (pp. 13-45). cians and Surgeons of Canada 1985:18:235-8.
Montreal: Can-Heal, 1987. 45. Maatsch J. Model for a criterion-referenced medical spe-
25. van der Vleuten C, van Luyk S, Swanson D. Reliability cialty test (Final Report, Grant No. HS-02038-02). East
(generalizability) of the Maastrict Skills Test. In Proceed- Lansing: Michigan State University, Office of Medical
ings of the 27th Research in Medical Education Confer- Education Research and Development, 1980.
ence (pp. 228-33). Washington, DC: Association of Amer- 46. Maatsch J. Theories of clinical competence: The construct
ican Medical Colleges, 1988. validity of objective tests and performance assessments.
26. Newble D, Swanson D. Psychometric characteristics of the Paper presented at the International Conference on Eval-
objective structured clinical examination. Medical Educa- uation in Medical Education, Beer Sheva, Israel, 1987.
tion 1988;22:325-34. 47. Maatsch J, Huang R. An evaluation of the construct
27. Stillman P, Swanson D, Smee S, et al. Psychometric validity of four alternative theories of clinical competence.
characteristics of standardized patients for assessment of In Proceedings of the 25th Research in Medical Education
clinical skills (Final Report to the American Board of Conference (pp. 69-74). Washington, DC: Association of
Internal Medicine), 1986. American Medical Colleges, 1986.
28. Stillman P, Swanson D, Smee S, et al. Assessing clinical 48. Ebel R. Must all tests be valid? American Psychologist
skills of residents with standardized patients. Annals of 1961:16:640-7.
Internal Medicine 1986; 105:762-71. 49. Ebel R. Measuring educational achievement. Englewood
29. Stillman P, Regan M, Swanson D. A diagnostic fourth Cliffs, NJ: Prentice-Hall, 1965.
year performance assessment. Archives of Internal Medi- 50. Kane M. The validity of licensure examinations. American
cine 1987;147:1981-5. Psychologist 1982:37:911-8.
30. Petrusa E, Blackwell T, Parcel S, Saydjari C. Psycho- 51. Frederiksen N. The real test bias: Influences of testing on
metric properties of the objective clinical exam as an teaching and learning. American Psychologist 1984;39:
instrument for final evaluation. In I Hart, R Harden, H 193-202.
75
VAN DER VLEUTEN & SWANSON
52. Entwistle N. Styles of learning and teaching. Chichester, assessing clinical competence (pp. 425-33). Montreal: Can-
England: Wiley, 1981. Heal, 1987.
53. Frederiksen J, Collins A. Researcher 1989;18(9):27-32. 69. Petrusa E, Guckian J, Perkowski L. A multiple station
54. Newble D, Jaeger K. The effect of assessments and objective clinical evaluation. In Proceedings of the 23rd
examinations on the learning of medical students. Medical Research in Medical Education Conference (pp. 211-6).
Education 1983;17:165-71. Washington, DC: Association of American Medical Col-
55. Bouhuijs P, van der Vleuten C, van Luyk S. The OSCE as leges, 1984.
a part of a systematic skills-training approach. Medical 70. Petrusa E, Blackwell T, Rogers L, Saydjari C, Parcel S,
Teacher 1987;9:183-91. Guckian J. An objective measure of clinical performance.
56. Newble D. The assessment of clinical competence—A American Journal of Medicine 1987;83:34-42.
perspective from "down under." In I Hart, R Harden 71. Petrusa E, Blackwell T, Ainsworth M. Performance of
(Eds.), Newer developments in assessing clinical compe- internal medicine house officers on a short station OSCE.
tence (pp. 40-5). Montreal: Heal, 1986. In I Hart, R Harden (Eds.), Further developments in
57. Newble D. Improving the clinical and oral examination assessing clinical competence (pp. 598-608). Montreal:
process. In I Hart, R Harden (Eds.), Further developments Can-Heal, 1987.
in assessing clinical competence (pp. 88-98). Montreal: 72. Petrusa E. Collaborative Project to Improve the Evalua-
Can-Heal, 1987. tion of Clinical Competence (Final report to the National
58. Newble D. Eight years' experience with a structured clin- Fund for Medical Education), 1988.
ical examination. Medical Education 1988;22:200-4. 73. Cohen R, Rothman A, Ross J, et al. Comprehensive
59. Rainsberry P, Grava-Gubins I, Khan S. Reliability and assessment of clinical performance. In I Hart, R Harden
validity of oral examinations in family medicine. In I Hart, (Eds.), Further developments in assessing clinical compe-
Downloaded by [New York University] at 02:01 13 January 2015
R Harden (Eds.), Further developments in assessing clin- tence (pp. 624-8). Montreal: Can-Heal, 1987.
ical competence (pp. 399-405). Montreal: Can-Heal, 1987. 74. Norman G, Tugwell P, Feightner J. A comparison of
60. Grava-Gubins I, Khan S, Rainsberry P. Factor analysis of resident performance on real and simulated patients.
simulated office oral examinations in family medicine. In I Journal of Medical Education 1982;57:708-15.
Hart, R Harden (Eds.), Further developments in assessing 75. Norman G, Neufeld V, Walssh A, Woodward C,
clinical competence (pp. 406-17). Montreal: Can-Heal, McConvey G. Measuring physicians' performance by using
1987. simulated patients. Journal of Medical Education 1985;
61. Conn H. Assessing the clinical skills of foreign medical 60:925-34.
graduates. Journal of Medical Education 1986:61:863-71. 76. Sanson-Fisher R, Poole A. Simulated patients and the
62. Cody R. Additional analysis for the 1987 administration of assessment of medical students' interpersonal skills. Med-
the clinical skill exam (Internal report, Educational Com- ical Education 1980;14:249-53.
mission for Foreign Medical Graduates), 1988. 77. Owen A, Winkler R. General practitioners and psycho-
63. Conn H, Cody R. Results of the second clinical skills social problems: An evaluation using pseudopatients.
assessment examination of the ECFMG. Academic Medi- Medical Journal of Australia 1974;2:393-8.
cine 1989;64:448-53. 78. Burri A, McCaughan K, Barrows H. The feasibility of
64. Klass D, Hazzards T, Kopelow M, Tamblyn R, Barrows using the simulated patient as a means to evaluate clinical
H, Williams R. Portability of a multiple station, perfor- competence of practicing physicians in a community. In
mance based assessment of clinical competence. In I Hart, Proceedings of the 15th Research in Medical Education
R Harden (Eds.), Further developments in assessing clin- Conference (pp. 295-9). Washington, DC: Association of
ical competence (pp. 434-42). Montreal: Can-Heal, 1987. American Medical Colleges, 1976.
65. Stillman P, Regan M, Swanson D, et al. An assessment of 79. Renaud M, Beauchemin J, LaLonde C, Poirier H,
the clinical skills of New England fourth year medical Berthiaume S. Practice settings and prescribing profiles:
students. Academic Medicine in press. The simulation of tension headaches to general practi-
66. Templeton B, Best A, Samph T, Case S. Short-term tioners working in different practice settings in the
outcomes achieved in interviewing medical students (In- Montreal area. American Journal of Public Health
ternal report, National Board of Medical Examiners), 1980;70:1068-73.
1978. 80. Rethans J, van Boven C. Simulated patients in general
67. Barrows H, Williams R, Moy R. A comprehensive practice: A different look at the consultation. British
performance-based assessment of fourth-year students' Medical Journal 1987;294:809-12.
clinical skills. Journal of Medical Education 1987;62:
805-9.
68. Williams R, Barrows H. Performance-based assessment of
clinical competence using clinical encounter multiple sta-
tions. In I Hart, R Harden (Eds.), Further developments in Received 9 June 1989
76