Williams Cef Article

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/9014051

Cognitive, Social and Environmental Sources of


Bias in Clinical Performance Ratings

Article in Teaching and Learning in Medicine · February 2003


DOI: 10.1207/S15328015TLM1504_11 · Source: PubMed

CITATIONS READS

245 263

3 authors:

Reed Williams Debra L Klamen


Indiana University School of Medicine Southern Illinois University Springfield
115 PUBLICATIONS 3,501 CITATIONS 52 PUBLICATIONS 956 CITATIONS

SEE PROFILE SEE PROFILE

William C Mcgaghie
Northwestern University
247 PUBLICATIONS 9,748 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

System for Improving and Measuring Procedural Learning (SIMPL) View project

All content following this page was uploaded by Debra L Klamen on 28 October 2014.

The user has requested enhancement of the downloaded file.


SPECIAL ARTICLE

Cognitive, Social and Environmental Sources of Bias


in Clinical Performance Ratings
Reed G. Williams
Department of Surgery
Southern Illinois University School of Medicine
Springfield, Illinois USA
Debra A. Klamen
Department of Psychiatry
University of Illinois at Chicago
Chicago, Illinois USA
William C. McGaghie
Office of Medical Education and Faculty Development
Northwestern University
Feinberg School of Medicine
Chicago, Illinois USA

Background: Global ratings based on observing convenience samples of clinical


performance form the primary basis for appraising the clinical competence of medi-
cal students, residents, and practicing physicians. This review explores cognitive, so-
cial, and environmental factors that contribute unwanted sources of score variation
(bias) to clinical performance evaluations.
Summary: Raters have a 1 or 2-dimensional conception of clinical performance and
do not recall details. Good news is reported more quickly and fully than bad news,
leading to overly generous performance evaluations. Training has little impact on ac-
curacy and reproducibility of clinical performance ratings.
Conclusions: Clinical performance evaluation systems should assure broad, system-
atic sampling of clinical situations; keep rating instruments short; encourage imme-
diate feedback for teaching and learning purposes; encourage maintenance of writ-
ten performance notes to support delayed clinical performance ratings; give raters
feedback about their ratings; supplement formal with unobtrusive observation; make
promotion decisions via group review; supplement traditional observation with other
clinical skills measures (e.g., Objective Structured Clinical Examination); encourage
rating of specific performances rather than global ratings; and establish the meaning
of ratings in the manner used to set normal limits for clinical diagnostic investigations.

Teaching and Learning in Medicine, 15(4), 270–292 Copyright © 2003 by Lawrence Erlbaum Associates, Inc.

Global ratings based on observing convenience instrument design. Research shows that features of
samples of clinical performance in vivo form the pri- measurement instruments account for no more than
mary basis for appraising the clinical skills of medical 8% of the variance in performance ratings.1
students, residents, and practicing physicians (hereaf- This review article aims to explore the cognitive, so-
ter called practitioners). Global ratings are subject to cial, and environmental factors that contribute un-
several sources of bias that have not been the focus of wanted sources of score variation (bias) to ratings of
earlier research and reviews because they involve cog- practitioners. Such bias compromises the value of per-
nitive, social, and environmental factors attendant to formance ratings and constrains inferences about the
the ratings rather than measurement problems such as clinical competence of practitioners. To start this we

Correspondence may be sent to Reed G. Williams, Ph.D., Department of Surgery, Southern Illinois University School of Medicine, P.O. Box
19638, Springfield, IL 62794–9638 USA. E-mail: rwilliams@siumed.edu

270
CLINICAL PERFORMANCE RATINGS

set the stage by defining clinical competence and de- cal performance assessment and those who do research
scribing characteristics that are pertinent to this review. in this area. Hopefully, the review will also help those
Later, we describe characteristics of the natural clinical practitioners and researchers who focus on perfor-
environment that lead to unwanted variation in clinical mance assessment outside the field of medicine. All
performance ratings. factors discussed and recommendations made are
Next, we establish four quality criteria that clinical based on findings from sound research studies that
ratings must meet to allow strong inferences about have also stood the test of replication, taking advantage
medical clinical competence. We also summarize of meta-analyses when available.
psychometric data from the performance assessment
literature in medicine, business, and psychology that
document how performance ratings achieve these four Clinical Competence Defined
criteria and, in the process, better define these criteria.
We then summarize the evidence about cognitive, Michael Kane6 defined clinical competence as the
social, and environmental factors that can bias evalua- degree to which an individual can use the knowledge,
tor ability to make valid and useful inferences about skills, and judgment associated with the profession to
competence from performance ratings. The goal is to perform effectively in the domain of possible encoun-
identify and describe factors worthy of attention be- ters defining the scope of professional practice (p.
cause they are based on well-designed and conducted 167). This definition highlights a key point about clini-
research studies and, in most cases, have stood the test cal competence appraisal. Evaluators are interested not
of replication. It is not possible to describe all studies only in how the clinician performs in observed situa-
and findings in detail. Interested readers will have the tions (i.e., judging clinical performance) but also as
references to allow further exploration as desired. grounds for generalizing about the practitioner’s abil-
Finally, we present a set of 16 recommendations for ity to perform a variety of other tasks in a range of simi-
clinical performance assessment practice and for re- lar clinical situations (i.e., estimating clinical compe-
search based on our analysis of performance rating tence).
problems and our review of the literature. Estimating a practitioner’s professional compe-
The article is not a review using a single, standard, tence requires at least three inferences: (a) judging the
and highly specific set of methodological inclusion cri- quality of the observed performance, (b) drawing gen-
teria,2–4 nor is it an exhaustive review of the perfor- eral conclusions about the person’s ability to perform
mance assessment research literature. Instead, it fo- in a universe of similar situations based on the ob-
cuses selectively on cognitive, social, and server’s judgment of the observed performance
environmental sources of error that compromise the (Kane’s6 example: drawing general conclusions about
value of clinical performance ratings as typically col- skill in delivering babies from the observation of one or
lected and the inferences about clinical competence two deliveries), and (c) extrapolating from the ap-
that are based on this information. We try to be com- praisal context to expected performance in practice sit-
prehensive in reviewing research studies that address uations where the practitioner is not being observed.
these sources of error and to summarize the results This third distinction, between what a practitioner can
when the research design of the studies and the prepon- do and what the practitioner typically will do in prac-
derance of evidence earned our confidence. We cite a tice, introduces the contribution of practitioner per-
combination of field studies that document the extent sonal qualities (e.g., dependability, responsibility, ethi-
of the problem, and experimental studies (randomized cal beliefs) to the mix of factors (knowledge, skills, and
controlled trials) that allow us to disentangle the ele- professional judgment) that shape practice behavior.
ments of the problem and document their relative
strength. Using the taxonomy of literature reviews pro-
posed by Cooper,5 the focus of this review is research
Clinical Competence Characteristics
findings; the goals are to formulate general statements
from multiple studies, to bridge the gap in research lit-
Multidimensionality
erature from disciplines involved in performance as-
sessment research, and to identify and refine key issues The term clinical competence leads one to think in
central to this field of practice and research; the per- terms of a unitary performance capability. This over-
spective is to provide a neutral representation of the simplifies medical practice. Most patient encounters
findings; the review describes a purposeful sample of require the practitioner to integrate and perform at least
the articles that represent the extensive literature we re- 10 separate component capabilities.7 These include (a)
viewed; the organization of the review is around a con- discerning what information is needed to clarify a pa-
ceptual framework that we developed to guide future tient’s problem; (b) collecting data about the patient
practice and research; and the intended audiences are (history taking, physical examination, diagnostic tests,
those individuals responsible for the practice of clini- procedures); (c) organizing, integrating, and interpret-

271
WILLIAMS, KLAMEN, MCGAGHIE

ing patient data to derive working diagnoses and refin- al.15 recruited 269 internists from six Western states
ing them as more information becomes available; (d) to participate in a study designed to validate the
establishing management goals and evaluating the ef- American Board of Internal Medicine (ABIM) certifi-
fectiveness of treatment, adjusting management as cation process. Six professional colleagues (two in-
needed; (e) performing procedures required for patient ternists, two surgeons, one head nurse, and one medi-
management; (f) communicating and relating effec- cal director or chief of staff) nominated by each
tively with patients and their families; (g) working ef- physician were asked to evaluate the performance of
fectively with other health professionals under the that physician in eight areas of clinical competence
range of conditions that make up clinical practice; (h) and to provide a rating of overall clinical skills. The
critically appraising new information for adequacy, return rate averaged approximately 78% for the rat-
quality, and veracity; (i) monitoring personal knowl- ers. Results of this study suggested that 10 to 17 rat-
edge and skills and updating these in an efficient and ings by peers are necessary to yield reproducible
effective manner as needed; and (j) performing consis- global ratings of competence in each competency
tently in a dependable and responsible manner. area. Kreiter et al.16 did a similar study to investigate
the measurement characteristics of ratings of medical
student clerkship performance using standard clinical
Case Specificity
evaluation forms. Again, their results documented
Research on clinical problem solving,8 clinical that ratings depended heavily on the individual rater
performance assessment (primarily using standard- to whom the student was assigned and on the nature
ized patient technology),9–12 and clinical practice has of the patient–student encounter observed. They used
shown consistently that clinical performance is case decision study results to estimate the reliability of rat-
specific. Practitioners who excel at diagnosing and ings for panels of up to eight raters per student.
managing one clinical problem are not necessarily Panels of eight raters resulted in reliability estimates
skillful at diagnosing even closely related clinical sit- of at best 0.65. More than eight raters would be re-
uations.13 Practitioners differ in effectiveness in dif- quired to achieve reliabilities of 0.80, but the article
ferent clinical situations due to differences in experi- does not allow us to be more precise. Clinical rating
ence, training, professional interest, and personal research is clear that many observations are needed,
disposition. and attention needs to be directed toward assuring
Van der Vleuten and Swanson9 summarized the systematic sampling across the range of normal clini-
findings from 29 reports of large-scale standardized cal encounters and tasks.
patient-based examinations of clinical performance
conducted on residents and medical students in the
1980s. They concluded that the major source of score Context
variation is due to variation in examinee performance
from situation to situation (case to case), and that 16 hr The typical performance assessment context is an
of performance observation across a variety of clinical uncertain, uncontrolled, and “noisy” information envi-
situations is necessary to achieve reproducible esti- ronment. The most common performance assessment
mates of clinical competence. The studies that formed approach is to observe, rate, and judge a practitioner’s
the basis for van der Vleuten and Swanson’s9 conclu- clinical performance in a natural clinical setting. The
sions involved (a) carefully selected situations and practitioner is observed performing routine clinical
tasks where observation was tightly controlled; (b) in- tasks under real clinical conditions. A global rating
struments and rater training that assured commonality scale containing relatively nonspecific items designed
in rating focus; and (c) ratings that occurred immedi- to be used in a range of clinical situations is employed
ately after the performance, limiting forgetting and to direct the observer’s attention to common, important
memory distortion. Because this estimate was made aspects of clinical performance and to calibrate the rat-
under nearly ideal conditions, it is likely that more ings of performance quality. The result of this general
hours of performance observation would be required to approach is that practitioners are observed working up
achieve reproducible estimates of clinical competence cases that vary in difficulty and in required skill. Like-
in less controlled clinical settings. The van der Vleuten wise, performance standards used for the case may vary
and Swanson9 review suggests that appraisals designed because only one expert observes the doctor–patient en-
to determine whether clinical performance is accept- counter, and the expert rarely studies the case in depth.
able rather than just determine how practitioners rank Consequently, this method provides a weak basis for
relative to each other will require observation of a comparing clinicians because the practitioners perform
broader clinical sample because the difficulty of the different tasks, and the standards used by raters are id-
observed encounters must be taken into account. iosyncratic.
Field research14,15 involving ratings of physician Contextual noise contributes error (unreliability) to
performance also suggests case specificity. Carline et performance measurements. The rater usually has

272
CLINICAL PERFORMANCE RATINGS

many responsibilities beyond supervision, training, Accuracy


and evaluation of practitioners. As a result, the rater’s
In a randomized, controlled trial designed to deter-
database about a medical practitioner’s clinical perfor-
mine the accuracy of faculty evaluations of residents’
mance is often fragmentary and restricted to a small,
clinical skills, Noel and colleagues17 had 203 faculty
unsystematic sample of clinical situations and tasks
internists watch a videotape of one of two residents
due to limited direct personal contact. Given what has
performing a new patient work-up on one of two pa-
been written about the multidimensionality and case
specificity of clinical competence, problems produced tients. The faculty internists then immediately rated
by using small, unsystematic, clinical performance the clinical skills of the resident. One half of the rat-
samples are magnified. In addition, performance ex- ers used an open-ended evaluation form, and the
pectations are rarely articulated for the rater or the other half used a structured form that prompted de-
practitioner. A lack of shared understanding of respon- tailed observations of selected clinical skills. The
sibilities may influence clinical performance ratings. unique contribution of this study was that it investi-
Also, clinical care in hospital and outpatient settings is gated and compared attending ratings of the same
usually a team responsibility. Therefore, it is often dif- clinical performance. When observations were not
ficult to isolate and appraise the contributions and clin- specifically directed through use of a structured form,
ical competence of individual practitioners. a condition typical of traditional clinical performance
rating practice, participants recorded only 30% of
known resident strengths and weaknesses. Strengths
were omitted more frequently than weaknesses, an
Quality of Clinical Ratings outcome consistent with findings from the general
performance assessment literature.18,19 Rater accuracy
Observational data must fulfill four quality criteria. increased to 60% or greater when participants used
First, the observational instrument and process must be structured forms designed to direct attention to spe-
accurate. That is, observations must yield scores that cific dimensions of performance. Comparable results
are defined by the behavior being observed and re- were obtained in an earlier, smaller, randomized con-
corded, nothing else. Second, the observers must trolled trial.20
agree. A rating instrument may be capable of accurate The only other study that addressed accuracy in a
use, but rater agreement depends on how it is actually direct experimental manner was a study by Pangaro
used by observers in a field setting. If only one rater is and associates21 done to document the accuracy of
using the instrument, intrarater reliability is the only is- standardized patient-produced checklist scores. In this
sue; but this is rarely the case. When multiple raters use study, internal medicine residents were trained to act as
the instrument, interest also resides in agreement ordinary candidates and to achieve target scores by
among different raters when observing the same per- performing to a predetermined “strong” or “weak”
formance. The third criterion concerns the ability to level on checklist items used by raters to record inter-
generalize from a set of observations to a range of per- viewing, physical examination, and communication
tinent professional performance situations. As de- skills. The strong standardized examinees (SEs) were
scribed in the discussion of Kane’s6 definition of pro- trained to score 80% correct, and the weak SEs were
fessional competence, the goal of measurement and trained to score 40% correct. Seven SEs took the Stan-
evaluation of clinical performance is to formulate ac- dardized Patient (SP) examination along with regular
curate inferences about clinical competence generally candidates. The mean score recorded by the raters was
from a series of observations of particular instances of 77% for the strong SEs and 44% for the weak SEs. A
performance. The third criterion addresses this issue. It videotape review revealed that most scoring discrepan-
asks, “To what extent do observations of this set of per- cies were in the area of communication skills.
formances predict how that individual will perform in Achieving the increases in accuracy reported by
other situations?” Noel and colleagues17 and by Herbers and colleagues20
In addition to accuracy, rater agreement, and the would require standardized patient encounters or, at
ability to generalize across situations, the focus must least, use of rating scales specific to the case being ob-
include the fourth criterion: utility. Utility addresses served and appraised (i.e., rating scales specific to the
the question about whether the generalizations are use- chief complaint of patients). Obviously, this would
ful. Does knowing the rating score have any value in come at added cost, extra effort, and added logistical
predicting future or concurrent performance of inter- complexity.
est? Does knowing the score aid in making progress, There is usually a time lag between observation of
teaching and learning, licensure, or certification deci- clinical behavior and rating of that behavior in the real
sions? If not, the ratings have no real value. Utility data world of performance evaluation. Delay introduces er-
also provide the basis for setting meaningful perfor- ror into ratings. We know of no controlled study of the
mance standards. effects of time lag on accuracy of clinical ratings, but

273
WILLIAMS, KLAMEN, MCGAGHIE

laboratory studies reported in the business literature feedback about the consistency of their ratings with
and a survey of psychologists outlined the nature and those of other judges because all ratings are posted
magnitude of the relation. immediately. This serves as a form of feedback that
In an experimental study with 180 participants, can refine agreement among future ratings and also is
Heneman22 demonstrated that ratings recorded 1 to 3 likely to increase the motivation to provide accurate
weeks after observation were less accurate than ratings ratings.
recorded immediately. However, the decrease in accu- Interrater reliability coefficients were consistently
racy was not large. The delay accounted for only 2% of high, ranging from 0.93 to 0.97 for the full comple-
the variance in rating accuracy. ment of judges. More detailed analyses demonstrated
A related but “soft” source of evidence confirms that the judges were able to agree in their evaluation
the importance of immediate recording to maximize of the skaters, but primarily in a global sense. Judges’
rating accuracy. Kassin et al.23 conducted a survey of ratings of skaters did not differ much for the two di-
64 psychologists who are frequently asked to testify mensions rated, technical merit and artistic impres-
about the reliability of eyewitness testimony. The sur- sion. The rating differences between technical merit
vey asked the psychologists to list findings that are and artistic impression were larger in the short pro-
adequately documented to present in court. The ex- gram where skaters were required to perform pre-
pert psychologists agreed there is sufficient evidence scribed maneuvers, reinforcing the view that careful
to support the conclusion that event memory loss is definition of tasks performed and rated helps raters to
greatest right after the event and then levels off with differentiate among performance dimensions. These
time. results suggest that the ceiling on rating reliability us-
ing human judges is high. Virtually all of the variabil-
ity in ratings was attributable to differences in skating
Rater Agreement (Intra- and
performance.
Interrater Agreement)
Elliot and Hickam26 had groups of three physician
Intrarater agreement is based on having a single faculty members observe the same student while the
rater evaluate the same (e.g., videotaped) perfor- student performed a physical examination on a patient.
mance on at least two different occasions. Differ- As in the Weekley and Gier25 study, observers were fa-
ences in ratings reflect changes in rater attention, per- miliar with the task, had limited environmental inter-
spective, standards, or mood. These studies provide a ruptions to disrupt the observation, and used a check-
good estimate of how much variability is introduced list with careful description of the tasks and finite
into a performance measurement as a result of these classification categories for rating performance. Elliot
factors. and Hickam26 found that observers did not agree when
Kalet et al.24 had faculty raters rerate videotapes of judging 17 of the 53 physical examination maneuvers
21 medical student interviews after a period of 3 to 6 performed (32%). There was good interrater agree-
months. They found substantial agreement between ment for 18 of the remaining maneuvers and moderate
the earlier and later ratings only when raters judged the agreement for the other 18. Agreement was lower on
information obtained (i.e., data collection) during the items requiring palpation and percussion, characteris-
interview. Even here, the level of agreement was far tics that are not subject to complete evaluation by ob-
from perfect. There was little to no agreement in the servation alone. Elliot and Hickam26 also reported that
earlier and later ratings of the interviewing process. trained nonphysician observers were able to reliably
Interrater agreement is based on having different evaluate 83% of the skills that were reliably assessed
raters evaluate the same performance. Under these by faculty.
circumstances, differences in ratings reflect differ- Kalet et al.24 also had four judges each rate medical
ences in rater focus and rigor as well as the sources of interviews. They concluded that interrater agreement
variability addressed in intrarater reliability studies. was poor when judging the overall interview as well as
Weekley and Gier25 examined ratings gathered from when judging the information obtained and the inter-
world-class figure skating judges during the 1984 viewing process.
Olympics. These results provide some indication of In the Noel et al.17 study described earlier, medical
the theoretical upper limit on interrater agreement us- faculty raters were also asked to appraise the overall
ing human judges. Olympic judges have at least 15 clinical performance of the resident. Raters disagreed
years of judging experience. Criteria for judging are about the residents’ overall clinical performance. For
clearly spelled out, and judges are periodically the first resident, 5% of the judges rated the overall
trained and retrained to use them. Nine judges are clinical performance as unsatisfactory, whereas 5%
employed to observe and appraise the same perfor- rated the performance as superior. Twenty-six percent
mance. Olympic judges have no conflicting responsi- rated the performance as marginal, and 64% rated the
bilities during judging, and they receive continuous performance as satisfactory. For the second resident,

274
CLINICAL PERFORMANCE RATINGS

the ratings were 6% unsatisfactory, 42% marginal, Accuracy in Estimating Competence


49% satisfactory, and 3% superior. Agreement in (Estimating From a Discrete Set of
overall ratings was not improved among raters who Observations to Performance Across a
filled out the structured form. These results strongly Range of Tasks and Situations)
suggest that judges use different criteria and stan-
Interrater agreement studies using ratings by judges
dards to judge clinical performances. Therefore, mul-
who have observed different practitioner performances
tiple ratings need to be collected and combined to as-
combine variance due to differences in judges’ expec-
sure a fair and representative overall appraisal of
tations with variance due to differences in demands
clinical performance. These results are a conservative
estimate of actual clinical performance assessment (task and case characteristics) of the observed sample
problems because the raters in this study all observed of behavior. Viswesvaran et al.30 used meta-analytic
the entire performance and were able to complete methods to investigate interrater and intrarater
their appraisals immediately under circumstances reliabilities of job performance ratings reported in 15
where no competing responsibilities interfered. primary organizational psychology and business jour-
Again, parallel findings were observed in the Herbers nals from their inception through January 1994. This
et al.20 study. resulted in studying 215 research reports. Viswesvaran
Martin and colleagues27 compared the score estab- et al.30 established that the mean interrater reliability
lished by individual physician raters with a gold stan- estimate for ratings by supervisors equaled 0.52, with
dard based on the average score of three physician the majority of interrater reliability coefficients falling
raters who rated the same encounters from a video- between 0.50 and 0.54. They reported that the mean
taped record. This allowed the investigators to deter- intrarater reliability coefficient was 0.81, with most co-
mine the absolute score deviation from the gold stan- efficients ranging from 0.78 to 0.84. Based on the dif-
dard as well as conduct correlational analyses that ferences between inter- and intrarater reliability coeffi-
compare relative rankings of individuals. The average cients, Viswesvaran and associates30 concluded that
correlation of the individual physician raters with the approximately 29% of the variance in ratings is attrib-
gold standard was approximately 0.54 with a fair utable to differences in the perspective of different rat-
amount of variation among clinical situations. The ers, including the sample of behavior they observed;
unique contribution of this study was to document whereas 5% is due to variations in a single rater attrib-
that individual physician scores were almost always utable to changes in moods, frames of reference, or
overestimates of true clinical competence (estimated other variables.
by the gold standard consensus ratings). This ten- Recently, authors of a number of studies on clinical
dency to give medical practitioners the benefit of the performance evaluation have been influenced by
doubt has also been established in a study by Vu and generalizability theory and analytic methods. This
colleagues.28 work has established the number of ratings needed to
MacRae and colleagues29 compared physician rat- achieve a generalizability coefficient of 0.80 that is ac-
ings of 120 videotaped medical student clinical en- cepted as the minimum desired level for use in making
counters across four cases using an eight-item rating progress decisions. Carline et al.14 collected 3,557 rat-
scale covering information acquired and interviewing ings of 328 medical students who had taken an internal
process. They found that the interrater reliability co- medicine clerkship. They concluded that 7 independ-
efficients ranged from 0.65 to 0.93 with an average ent ratings would be needed to judge overall clinical
reliability of 0.85. These results demonstrate that performance. The amount of rater experience did not
good interrater agreement is possible when judging change the results. Similar results have been reported
clinical performance. One factor that probably con- by Kreiter and colleagues16 (more than 8 ratings
tributed to the high level of agreement is that the rat- needed), Kwolek et al.31 (7–8 ratings), Ramsey et al.32
ers collaborated in developing the rating scales used (11 ratings), Violato et al.33 (10 ratings), and Kroboth
in this study, resulting in a shared definition of good et al.34 (6–10 ratings). All reports agree that some-
clinical performance. where between 7 and 11 ratings are necessary to
The differences in overall appraisal of the same per- achieve a generalizable global estimate of competence
formance documented in this section are most likely when raters are basing ratings on a nonsystematic sam-
due to the combination of two factors. First, different ple of observations.
raters focus on (differentially weight) different aspects Although there has been only limited research on
of clinical performance. Second, different raters have the component skills included under the broader cate-
different expectations about levels of acceptable clini- gory of clinical competence, it is reasonable to expect
cal performance. Anything that reduces these differ- that these abilities will develop at different rates and
ences, such as in the MacRae et al.29 study, tends to im- may differ in their stability across situations. The work
prove interrater agreement. that has been done suggests that different numbers of

275
WILLIAMS, KLAMEN, MCGAGHIE

observations will be required to establish stable esti- clinical competence. The results in the studies reviewed
mates of clinical competence in various clinical com- here suggest that a minimum of 7 to 11 observer ratings
petence areas. Carline et al.15 established that the num- will be required to estimate overall clinical competence.
ber of ratings required to achieve stable estimates of More ratings will be necessary to obtain stable estimates
competence in each of the eight areas reflected in their of competence in more specific competence domains
items ranged from 10 to 32 ratings, with an average of (e.g., data collection, communication, responsible pro-
18. The areas where the least agreement occurred in- fessional behavior), with the most ratings (up to 30)
cluded ratings of (a) verbal presentations; (b) written needed in the noncognitive areas.
communications; and (c) management of patients with A few studies have been reported that investigate
simple, uncomplicated problems (probably because using nurses and patients to appraise aspects of clinical
these are managed in the office where care is less sub- competence. We have included these studies under the
ject to observation than in the hospital). Because this heading of Accuracy in Estimating Competence be-
investigation was a field study, it is not possible to de- cause we believe that these ratings broaden the dimen-
termine whether these findings are due to less stable sions of competence observed more than they change
practitioner performance or to rater variability associ- the characteristics of raters. Wenrich and associates35
ated with lack of rater opportunity to observe. Carline compared ratings of physicians by nurses and physi-
et al.14 also reported that the number of independent cian peers. They found that the correlation among
ratings needed to reliably judge various individual as- these ratings for individual clinical performance com-
pects of clinical performance varied dramatically. To ponents ranged from 0.46 to 0.72, with the average cor-
illustrate the range of requirements, these investigators relation being 0.59. Nurses’ ratings were generally
reported that 7 ratings would be needed to judge data lower than those of physician peer raters for humanis-
gathering reliably, whereas 27 would be needed to rate tic qualities, and their ratings were higher for medical
interpersonal relationships with patients. Wenrich et knowledge and verbal communication. These results
al.35 observed similar results when studying nurses’ suggest that nurses’ ratings are not a replacement for
ratings of internists. They established that somewhere physician ratings. However, nurses ratings do expand
between 11 and 24 nurse ratings would be necessary to the number of perspectives incorporated into the esti-
establish a generalizable estimate (generalizability co- mates of clinical competence and broaden the range of
efficient = 0.80) of clinical competence depending on clinical situations and competencies observed and
the competence dimension being investigated. Inter- judged. Nurses also are more likely to have the oppor-
personal and professional behavior characteristics re- tunity to observe how practitioners typically perform.
quired more ratings than did cognitive and technical at- Tamblyn and associates38 investigated the feasibil-
tributes. Meta-analytic reviews of the general ity and value of using satisfaction ratings obtained
performance assessment literature30,36 produced simi- from patients to evaluate the doctor–patient relation-
lar findings. More ratings are needed to provide stable ship skills of medical residents. They found that patient
estimates of communication and interpersonal compe- satisfaction ratings provide valuable information about
tence than for specific skills (e.g., clinical skills). a resident’s ability to establish an effective doctor–pa-
We believe that the discrepancies in number of rat- tient relationship. However, these investigators noted
ings needed for interpersonal and professional behav- that approximately 30 patient ratings would be neces-
ior and cognitive and technical attributes are best ex- sary to obtain a generalizable estimate of resident skill
plained by the conceptual framework offered by in these areas (0.80 generalizability coefficient), al-
Ginsburg and associates.37 These investigators noted though a smaller number of ratings would be enough to
that we tend to talk about people as being honest or dis- detect outliers. The larger number of required ratings
honest, professional or unprofessional rather than talk speaks to the increased heterogeneity of patient raters
about their behaviors. In other words, professionalism and raises feasibility issues. However, the use of pa-
is thought of as a set of stable traits. They argued in- tient satisfaction ratings provides the perspective of the
stead that professional behavior is much more context client. Church’s39 findings from a business environ-
dependent than has usually been acknowledged. The ment suggested that service providers and their clients
implication of this is that generalizations about profes- have different perceptual frames of reference than in-
sional behavior require sampling much larger numbers ternal observers. In this regard, the addition of infor-
of events (contexts and behaviors) than is true for some mation from patients should provide an important per-
other aspects of competence. More research regarding spective that has not often been captured
the nature of aspects of clinical competence will help systematically.
to increase our understanding of these individual com-
petencies and will allow further refinement of perfor-
Utility
mance assessment practice.
It is clear that multiple observations of students, resi- The bottom line for any measurement instrument
dents, and physicians are needed to reliably estimate revolves around its usefulness. A useful clinical perfor-

276
CLINICAL PERFORMANCE RATINGS

mance measure is one that successfully estimates (pre- gold standard problem is evaluation of a single per-
dicts) performance in situations that we care about. formance by a group of expert judges who all observe
There are very few studies that provide insight into the same performance and then rate the performance
how useful clinical ratings are for accomplishing this independently (perhaps using videotape). A third,
goal. more common approach in traditional clinical perfor-
Hojat et al.40 investigated how effectively medical mance assessment involves a comparison of the rat-
student clerkship ratings predicted residency supervi- ings of a group of expert judges who observed the
sors’ ratings of clinical performance. The investigators same practitioner across an uncontrolled and undocu-
sought to determine if clerkship ratings provided infor- mented range of situations and tasks in natural envi-
mation that was helpful in predicting which students ronments. Although this approach has most of the
performed poorly or well during their first year of resi- weaknesses described earlier, it is the most prevalent.
dency. They found that 42% of students who received Cronbach48 contributed to refining this approach con-
clerkship ratings that put them in the bottom 25% on ceptually by proposing that rater performance be ana-
global ratings of clinical competence were also in the lyzed to determine whether (a) an individual rater
bottom fourth based on residency program ratings of tends to rate too high or too low averaged across all
data gathering and processing skills. Only 17% of ratees and items (leniency–stringency), (b) the rater
those in the bottom fourth on medical school clerkship can accurately distinguish among practitioners based
ratings were found to achieve residency ratings that put on the candidate’s total job performance (i.e., the
them in the top fourth in their residency program. rater can accurately rank practitioners relative to each
Thirty-seven percent of those individuals who received other), (c) the rater can accurately determine whether
clerkship ratings that put them in the top performance a group of practitioners are better at one aspect of
quartile were found to achieve residency ratings that their jobs or another, and (d) the rater can diagnose
placed them in the top quartile. Conversely, only 14% accurately the strengths and weaknesses of individual
of those individuals ended with residency ratings that practitioners. This framework for thinking about and
placed them in the bottom fourth. Put another way, the evaluating the quality of ratings is appealing and
1,704 individuals were three times more likely to be practical. Where called for, it is used in subsequent
classified in the same quartile during residency train- sections of this article.
ing and medical school as to be classified at the other
extreme (i.e., classified in the bottom fourth during res-
idency when they had been classified in the top fourth Cognitive, Social, and Environmental
during medical school). Sources of Bias in Clinical Ratings
Data from rating instruments acquire meaning by
documenting consistent relations between scores on Research in experimental psychology has investi-
the instrument and scores from other measures of inter- gated the nature of human memory since the late 19th
est. Given this information, it is possible to use scores century. Since the 1980s, research in performance as-
to make progress decisions with confidence that the sessment has taken conceptual cues from this research.
scores have implications for concurrent or future per- A number of experimental studies have investigated
formance states. In the case of the Hojat et al.40 re- how humans store information and use that informa-
search, it becomes possible to anticipate subsequent tion in performance assessment settings. The research
clinical performance during residency training. This results provide partial answers to the following two
information is critical for making informed decisions questions. How are rating data organized? What are the
about medical student progress. There are surprisingly mechanisms of data recall?
few examples of this type of investigation in the litera-
ture that we studied.
Data Organization
One of the primary obstacles to research on clini-
cal ratings is the lack of a readily available, consen- Research has been conducted to determine the strat-
sual gold standard for judging clinical competence. egy raters use in organizing and storing the information
One approach used in standardized patient research is they acquire and to determine which organizational
to develop a patient-based problem with known find- strategies result in the most accurate recall and ratings.
ings and have a group of experts decide in advance These research results remind us of the limits of hu-
what information should be elicited to diagnose the man memory and the implications for performance as-
case effectively, what diagnosis is most likely, and sessment. Nathan and Lord49 concluded, “Even under
what management is indicated. The case can then be optimal conditions many subjects will not or cannot in-
used to test the clinical performance of practitioners. dependently use five dimensions to rate performance”
Recent research using patient simulators has demon- (p.112). Evidence supporting this conclusion was pres-
strated the feasibility, utility, and validity of such ent in the Weekley and Gier25 study of world-class fig-
clinical evaluations.41–47 A second approach to the ure skating judges during the 1984 Olympics. Their

277
WILLIAMS, KLAMEN, MCGAGHIE

analyses demonstrated that the judges agreed about Therefore, a series of laboratory and field studies
their evaluations of the skaters, but chiefly in a global suggest that (a) organization of information in memory
sense. is important for recall and rating accuracy; (b) recall
Cafferty et al.50 reported that raters organized in- and rating accuracy can be improved by prompting the
formation either by persons (performances of one rater to organize information before the rating even if
person on all pertinent tasks, followed by perfor- the information was acquired in an unorganized fash-
mances for a second person) or by tasks (perfor- ion; and (c) raters prefer to store information organized
mances of all persons for one task, followed by per- by persons, and this strategy produces the most accu-
formances of all persons for a second task, etc.). Not rate recall and ratings.
surprising, acquisition by person led to information
being organized in memory by persons, whereas in-
Data Recall
formation acquired by task was organized in memory
by tasks. Most important, raters who did not organize Stored information about practitioner clinical per-
information in memory or were inconsistent in their formance can be recalled and used by raters to formu-
organizational strategy had the least accurate recall late ratings in two ways. First, recall may involve spe-
and ratings. DeNisi and Peters51 investigated the ef- cific details about practitioner performance in specific
fects of structured diary keeping on recall of perfor- clinical situations. Second, recall may be based on ob-
mance information and on performance ratings. They servations of specific events, forming a general im-
found that raters who kept diaries structured by per- pression of the effectiveness of performance in that sit-
sons or by performance dimensions were better able uation, and storing only the general impression. In the
to recall performance information than were those general case, much of the specific information used in
who kept no diary even when they were not allowed forming an impression is lost. When asked to make an
to refer to the diary. Raters who kept diaries provided appraisal, the rater recalls the pertinent general impres-
(a) lower overall ratings, (b) ratings of individual sion data and uses it for making a judgment. Evidence
practitioners that varied more from dimension to di- regarding how raters perceive and recall clinical per-
mension (provided a more varied profile for each per- formance data is instructive. Results of factor analytic
son), and (c) ratings that discriminated more among and correlational studies suggest that raters provide
ratees. Raters who organized their diaries by persons global ratings of clinical competence rather than differ-
believed that their ratings were more accurate and de- entiate individual aspects of competence. Verhulst et
fensible than did raters in the other conditions, and al.54 found that supervisor’s ratings (13-item rating
they believed that the use of diaries helped them in form) of first-year resident performance revealed that
the appraisal process. Raters who were required to raters differentiated and rated two aspects of clinical
maintain diaries, but where no structure was imposed performance: clinical skills and professional behavior.
consistently, organized their diaries by persons. These two factors together accounted for 83.6% of the
Research by Williams and colleagues,52,53 in a labo- common variance in ratings, with the clinical skills
ratory setting, suggested that raters who acquired and factor accounting for 49.4% and the professional be-
stored performance information in an unorganized havior factor accounting for 34.2% of the common
fashion can be prompted to structure the information variance. The results were virtually identical each of
stored in memory before a rating task by having them the 7 years of the study.
recall pertinent performance incidents and record them In their study of peer ratings, Ramsey and
according to a given format. This structuring resulted colleagues32 also found that raters have a two-dimen-
in recall and rating accuracy results similar to those ob- sional conception of clinical skills. One factor repre-
tained when the information was initially obtained in a sented cognitive and clinical management skills and
structured manner. DeNisi and Peters51 found that rat- the other represented humanistic qualities and man-
ers in a free recall situation chose to organize the per- agement of psychosocial aspects of illness.
formance information by persons. Most important, Other studies reported similar results suggesting
they found that raters asked to organize the information that clinical performance ratings reflect a general im-
before the rating task provided less elevated and more pression of clinical performance rather than a differen-
discriminating ratings and reported more positive reac- tiated view of individual competencies.33,35,54–58 These
tions to the rating task. This suggests that raters can be results suggest that either the individual competencies
prompted to structure performance information in that make up clinical competence are highly correlated
memory just before the appraisal task and realize a rat- (i.e., individuals who perform one competency well are
ing benefit without adding the logistic demands of a di- likely to perform well in the other competency areas);
ary-keeping requirement. Although we assume a pos- or that individual competencies are not highly corre-
ture of healthy skepticism about these findings until lated, but raters perceive them as such and rate them
they stand the test of multiple replications, they are in- accordingly. Nathan and Lord49 summarized their re-
teresting and promising. search on performance assessment by stating:

278
CLINICAL PERFORMANCE RATINGS

The true correlation among dimensions related to any quency of observation on the accuracy and
naturally occurring category will usually be nonzero, reproducibility of ratings.
because categories are built from natural clusters of In a study of almost 10,000 pairs of raters from 79
characteristics. On the other hand, the correlation business organizations, Rothstein61 established that
among rated dimensions will almost always be higher
interrater agreement increases as the length of expo-
than true correlations, due to the nonrandom errors or
sure to the ratee increases. Interrater agreement im-
simplifications of raters. (p.112)
proved slowly and reached only about 0.60 with 10 to
12 years of opportunity to observe long-time employ-
This is an area where a number of large-scale stud- ees. Heneman,22 in a randomized controlled trial, also
ies have been done regarding clinical performance as- found that longer observation periods yield more accu-
sessment. Results from these studies show consistently rate ratings. Finally, eyewitness testimony experts23
that raters have either a one- or two-dimensional con- agreed that there was sufficient evidence to testify in
ception of clinical skills and do not differentiate further court that the less time an eyewitness has to observe an
among the specific dimensions of clinical performance event the less accurate is the memory of the event.
using current instruments and procedures. Efforts to Therefore, there is evidence from a range of situations
elicit more differentiated ratings may result in rater to reinforce the idea that the frequency of observation
time and effort being wasted, and this may result in in- influences the reproducibility and accuracy of ratings.
accurate characterizations of true performance. The quality and utility of rating data is likely reduced at
the low levels of observation that are common in clinical
performance assessment.
Frequency of Rater Observation
In a survey of 336 internal medicine residents from
Focus of Attention
14 New England residency programs, Stillman and
colleagues11 found that 19% of the residents reported One constructive role that a clinical performance as-
that no faculty member ever observed them performing sessment instrument can play is to direct the attention
a complete history and physical examination during of raters toward particular dimensions of competence.
medical school, and 30% reported they had never been This may increase the level of rater agreement and the
observed performing a complete history and physical accuracy of ratings.
in residency. Over one half reported that they had been In the randomized controlled trials described ear-
observed less than three times during medical school; lier, Noel and colleagues17 and Herbers and
and 45% reported that they had been observed once colleagues20 found that participants recorded only 30%
during residency, presumably during the required of resident strengths and weaknesses using global rat-
ABIM Clinical Evaluation Exercise (CEX). Burdick ing forms. Strengths were omitted more frequently
and Schoffstall59 reported similar results in a survey of than weaknesses, a finding consistent with results from
514 residents in 21 emergency medicine residency pro- the general performance assessment literature.18,19
grams throughout the United States. Rater accuracy increased to 60% or greater when par-
These results suggest that the frequency of direct ticipants used structured forms designed to focus atten-
clinical performance observation at the medical school tion on specific performance dimensions. Noel and
and the residency level is very low. To the best of our colleagues17 reported the percentage of raters who de-
knowledge, no one has reported on the total number of tected each strength and weakness but the number of
hours of observation per student or resident underlying items is too small to allow generalizations about the at-
clinical performance evaluation. Based on surveys of tributes of performance that raters observed. More
3,600 business organizations throughout the United work of this type is needed to determine what expert
States, Bretz et al.60 reported that business supervisors raters attend to when rating clinical performance. If, as
commonly spend about 7 hr per year assessing the per- we suspect, the scope is limited to a few competencies
formance of each employee. This includes any special (e.g., clinical skills, professional behavior) then efforts
observation, other data collection, filling out assess- need to be directed to expanding rater focus.
ment forms, and communicating with each employee. Woehr62 and Day63 found that when raters are told
Because middle- and upper-level human resource man- about task rating dimensions in advance they perform
agers filled out the survey, one might expect that these better in rating that performance. Day63 found that rat-
results paint a favorable picture of the process. We ers were better able to distinguish between effective
know of no comparable survey of clerkship directors, and ineffective behaviors when they had advance ac-
residency program directors, or health maintenance or- cess to the rating instrument. Specifically, raters were
ganization directors, but we have no reason to expect able to more accurately identify behaviors that did not
that more time is spent in clinical performance assess- occur. Raters without prior knowledge of task rating
ment. Therefore, this section is directed toward investi- dimensions were more dependent on a general impres-
gating available evidence about the impact of fre- sion of effectiveness.

279
WILLIAMS, KLAMEN, MCGAGHIE

Rater Reporting supporting the finding that supervisors are likely to dis-
tort ratings of their employees when they believe
Judgments are private evaluations of a practitioner’s
employees will see the ratings or when the ratings will
performance, whereas ratings are a public record. For a
be used for promotion or salary increase purposes. The
variety of reasons the two may not always coincide.
Mum Effect research suggests that distortion in com-
In a review chapter, Tesser and Rosen64 described
munications may occur even when there are no conse-
a bias among communicators to transmit messages
quences for the employee. In all cases, the tendency is
that are pleasant for recipients to receive and to avoid
for ratings to be increased. We are not aware of any
transmitting unpleasant messages. They labeled the
studies of the Mum Effect or of the degree of distortion
tendency to avoid transmitting bad news the “Mum
in ratings as they apply to clinical performance ratings.
Effect.” The review recounts the results of studies
conducted by themselves and other investigators to
probe this bias directly. Tesser and Rosen64 con- Leniency and Stringency
cluded that good news is communicated more fre-
Littlefield et al.70 studied the ratings given by sur-
quently, more quickly, and more fully than bad news.
gery faculty in five medical schools to determine the
They also reported this is true across a wide range of
percentage of lenient and stringent raters. The study in-
situations whether the communication is in person or
volved 107 raters who each evaluated four or more stu-
in a less direct manner (e.g., written communication).
dents and a total of 1,482 ratings. The investigators es-
This Mum Effect would seem to have clear implica-
tablished that 13% of the raters were significantly
tions for supervisors’ communications of poor perfor-
more lenient in their ratings. Students with true clinical
mance to practitioners.
ability at the 50th percentile would receive ratings
A number of studies have been conducted to deter-
from these lenient raters that would rank them at the
mine the impact of the Mum Effect on supervisors’
76th percentile or higher. Conversely, 14% of the raters
communications about performance of employees and
were classified as significantly more stringent in their
trainees. In a laboratory study involving performance
rating patterns. Ratings by these stringent raters would
assessment, Benedict and Levine65 found that partici- rank average students (true clinical ability at the 50th
pants scheduled low performers for feedback sessions percentile) at the 23rd percentile or lower. However,
later and evaluated these performers with more posi- the total group of raters was normally distributed re-
tive distortion. Two field studies suggested similar re- garding stringency of ratings. Therefore, where multi-
sults. Waldman and Thornton66 found that ratings used ple ratings are available for each student, the average
for promotion and salary increases and those that were rating should approximate the student’s true clinical
to be shared with employees were more lenient than competence. Further, the investigators noted that the
confidential ratings of the same employees. Harris et format of rating forms was not related to the tendency
al.67 had supervisors rate employees to aid in validating to provide extreme ratings. Ryan et al.71 studied emer-
a performance examination. The supervisors had ear- gency medicine supervisors and found that they also
lier rated the same employees for purposes of promo- showed large variations in both leniency and range re-
tion and salary increases using the same instrument. striction, and that differences in individual evaluator
The investigators reported that slightly more than 1% scoring leniency were large enough to have a moderate
of the ratees were rated as in need of improvement effect on the overall rating received by the resident.
when ratings were to be used for promotion and salary
increases, whereas almost 7% were rated as in need of
improvement in the ratings for research purposes that Inflation of Ratings
had no economic consequences for employees. How- Based on surveys of 3,587 business and industrial
ever, the ranking of employees under the two condi- organizations, Bretz et al.60 found that 60% to 70% of
tions was similar. an organization’s workforce received ratings in the top
Longenecker et al.68 reported survey research re- two performance levels on the organization’s perfor-
sults showing that many managers knowingly provide mance rating scale. We know of no directly compara-
employee ratings that are inflated due to actual and ble large-scale survey regarding medical clinical per-
perceived negative consequences of accurate appraisal. formance assessment. However, there are a number of
Bernardin and Cardy69 found that 65% of police de- studies that report similar outcomes, and the results are
partment supervisors indicated that the “typical” su- consistent across the studies found. Magarian and
pervisor would willfully distort ratings. Further, they Mazur72 surveyed all internal medicine clerkship di-
found that supervisors with lower levels of trust in the rectors in the United States and received completed
appraisal process rated their subordinates higher than surveys from 81% of them. Survey results indicated
did those with higher levels of trust. that 75% of students in internal medicine clerkships
In summary, there is research reported in the social using an A through F grading system received ratings
psychology and the business management literature in the top two performance categories. In clerkships

280
CLINICAL PERFORMANCE RATINGS

using descriptor grades (e.g., honors, high pass, pass, with preceptor expectations. Although this study does
fail), 46% of the students received ratings in the top not establish the relative contributions that behavioral
two performance categories. Because 86% of the inter- anchors and the group review process contributed to
nal medicine clerkships incorporated results of an ob- decreasing the grades assigned, a study by Hemmer et
jective knowledge test such as the National Board of al.79 supports our belief that the objective group review
Medical Examiners (NBME) internal medicine subject process was the more important factor. They found that
test into student grades,73 it is reasonable to expect that a group review process significantly improved the de-
percentages in the top two grading categories would be tection of unprofessional behavior. In almost 25% of
higher if clinical ratings alone were used for grading. the cases, the group review process was the only evalu-
Objective test results generally lower grades.74 ation method that detected unprofessional behavior.
Shea et al.75 described program directors’ ratings of Metheny56 reported that first-year residents in ob-
all 17,916 internal medicine residents who sought cer- stetrics–gynecology were more lenient than other rat-
tification in internal medicine during the years 1982 ers in their grading of medical students. These results
through 1985. They reported that the mean rating was were based on findings in one clerkship at one medical
6.5 on a 9-point scale. Six is a “satisfactory” rating and school. It would be informative to have a larger scale
7 is a “superior” rating according to the scale used. study to determine whether such variables as amount
Residents with “unsatisfactory” program directors’ rat- of clinical supervising experience have systematic ef-
ings were excluded from this study. Therefore, the av- fects on ratings assigned.
erage rating is raised artificially by a small degree.
However, it is clear that the ratings assigned are con-
Failure Rates
centrated in the higher scores on the scale.
Ramsey and associates76 reported that 91% of all Another way to study the issues of leniency and
practicing physicians in their study received a rating of stringency and grade inflation is to look at failure rates.
7 or above on a 9-point scale from peers. The mean The Liaison Committee on Medical Education con-
clinical performance rating for all physicians was 7.71 ducts a survey of all medical schools in the United
with no physician receiving a rating below 6. Other States each year that normally yields a 100% response
studies16,55,71 reported similar findings across a range rate. The results of the survey80 indicated that attrition
of practitioner rating situations (medical students, resi- from U.S. medical schools continues to be about 1% of
dents). Carline et al.15 found that out of 6,932 ratings in total enrollment. Forty-one percent of those students
their study, only 136 or 2.0% were poor or fair ratings. who left medical school during the 1996 to 1997 year
Twenty-two percent of the physicians (58 of 269 physi- were either dismissed for reasons of academic failure
cians rated) received all of the 136 poor or fair ratings, or withdrew in poor academic standing. Of these stu-
with 15 physicians receiving three or more low ratings. dents, only 31% were in the clinical years of training.
Some other evidence exists indicating that individual Therefore, there were 95 students in the clinical years
ratings of clinical performance are inflated. We reported who were dismissed for academic failure or withdrew
earlier that Martin and colleagues27 found that individual in poor academic standing from the 125 U.S. medical
physician ratings were almost always overestimates of schools during the 1996 to 1997 academic year. This is
true clinical competence as established by a consensual less than one medical student per U.S. medical school.
gold standard panel. Vu and colleagues28 established this Therefore, the likelihood of a student leaving medical
tendency for individual raters to give ratees the benefit of school for academic performance reasons during the
the doubt as well. Schwartz et al.77 reported that faculty clinical years of training was very low.
ward evaluations of fourth-year surgery residents were In an Association of American Medical Colleges
inflated and did not correlate with other, more objective survey of representatives from 10 medical schools,
measures of clinical competence (i.e., Objective Struc- Tonesk and Buchanan81 reported that one of the most
tured Clinical Examination [OSCE] scores). commonly reported problems was unwillingness to re-
Speer et al.78 reported on their efforts to modify the cord negative performance evaluations. A second was
clinical rating process in an internal medicine clerk- unwillingness to act on negative evaluations. Results
ship toward the goal of reducing grade inflation. In the from the Magarian and Mazur72,73 surveys of internal
original grading system, the average grade assigned to medicine clerkship directors support the Tonesk and
students by preceptors was an A. The revised system Buchanan81 findings. Thirty-six percent of the clerk-
involved using a more behaviorally anchored defini- ship directors responding could not recall having ever
tion for each grade and by assigning final grades given a failing grade in their clerkship. The highest
through an “objective group review.” This revised sys- failure rate reported for a medical school class was
tem resulted in students receiving an average grade of 10% (two schools). Failure rates in other studies were
B+. The authors documented that preceptors believed comparable. Yao and Wright82 reported that 6.9% of
the average student should receive a grade of B. There- internal medicine residents were reported to have sig-
fore, the revised grading approach was more consistent nificant enough problems to require intervention by

281
WILLIAMS, KLAMEN, MCGAGHIE

someone of authority. They also reported that 94% of physicians with no training, moderate training, and
residency programs had problem residents. Kwolek et more extensive training. They reported that training
al.83 reported that 1.3% of ratings in a surgery clerkship did not improve the low rate of interrater agreement.
were unsatisfactory or marginal, and 5% of interns in However, interrater agreement was improved substan-
the same program were rated as deficient.77 tially by identifying the most inconsistent raters and re-
Schueneman et al.57 reported that 9 out of 310 (3%) moving them from the analysis. This approach of cull-
surgery residents were asked to leave their residency ing inconsistent raters may be a more cost-effective
training program over a period of 15 years. Vu et al.74 way of dealing with rater reliability problems.
studied failure rates in six required clerkships (family The limited number of studies that have addressed
medicine, internal medicine, obstetrics–gynecology, the issue of rater training for clinical performance eval-
pediatrics, psychiatry, and surgery) in one medical uation should be judged cautiously. The findings may
school over a period of 5 years. They reported that the be the result of the type or quantity of training pro-
average failure rate was less than 1% in family medi- vided. To illustrate, the training in the Noel et al.17
cine, 1% in obstetrics–gynecology and in psychiatry, study only involved viewing a videotape. There was no
5% in surgery, 7% in internal medicine, and 8% in pe- practice and feedback component. The Newble et al.85
diatrics (average failure rate: 3.67%). However, 70% of study involved a ½ hr training session with no feedback
the failing students received passing marks based on for the moderate training group and a 2½ hr training
clinical performance ratings alone. In these cases, per- session including practice and feedback for the exten-
formance on an objectively scored knowledge exami- sively trained group. The amount and type of training
nation (the NBME subject test) was the factor that de- in the Newble et al.85 study is more in line with what
termined their failure. Jain et al.84 reported the results would be expected to result in a positive training out-
of a survey of physical medicine and rehabilitation res- come, but it may take even more training to yield the
idency programs. Eighty-three percent of program rep- desired improvement in reliability.
resentatives responded. Only 50% of programs re- Many more studies investigating the effect of train-
ported rating at least one resident unsatisfactory during ing on rater performance have been done in business
a clinical rotation in the 3 years prior to the survey. settings. Woehr and Huffcutt86 reported a meta-analy-
Only 11% had reported one resident’s overall perfor- sis of 29 rater training studies. These studies yielded 71
mance as unsatisfactory to the parent organization. meaningful comparisons of training programs with
As mentioned earlier, results from a study in a large control groups. Most important, Woehr and Huffcutt86
business organization67 indicated that 1.3% of 223 sorted out four different types of training programs and
ratees were rated as in need of improvement on regular investigated their effects separately. The most common
ratings of employees for promotion and salary in- programs (28 comparisons) were rater error training
creases. When these same employees were rated to val- programs designed to minimize errors of leniency, cen-
idate a performance test, a situation where presumably tral tendency, and halo by making raters aware of these
raters perceived no negative consequences of their rat- tendencies. These programs proved to be moderately
ings for the employees, 6.7% were rated as needing im- effective in reducing halo error and somewhat less ef-
provement. Therefore, the true rate of cases where per- fective with respect to rater leniency. Rater error train-
formance is inadequate may be closer to 7% than to the ing programs also yielded a modest increase in rater
1% rate, which normal ratings yield. accuracy with some being substantially more effective
than others.
The second type of rater training was performance
Rater Calibration and Training
dimension training, which was designed to train raters
A common response to the perception that clinical to recognize and use appropriate dimensions of perfor-
performance ratings are less accurate and reproducible mance in their ratings. The idea behind this training is
than they should be is to suggest implementing a rater that raters will make dimension-relevant judgments as
training program. We found very few studies in medi- opposed to more global judgments. Woehr and
cal clinical performance evaluation that addressed the Huffcutt86 reported that performance dimension train-
effectiveness or cost effectiveness of rater training. ing was moderately effective at reducing halo error but
Both studies that we found suggested that training may less effective with respect to increasing accuracy. This
not lead to gains commensurate with the cost of the training procedure actually led to a slight increase in
program. Noel and colleagues17 included a training rater leniency.
condition in their randomized controlled trial that stud- Frame of reference training programs, the third ap-
ied the accuracy of raters in appraising clinical perfor- proach, trains raters with regard to performance stan-
mance. They found that physicians who received train- dards as well as performance dimensionality. These
ing were no more accurate than other raters who programs typically provide raters with samples of be-
participated in the study but without training. Newble havior representing each dimension and each level of
et al.85 compared the interrater agreement for groups of performance on the rating scale. The programs also in-

282
CLINICAL PERFORMANCE RATINGS

clude practice and feedback. These programs proved to 4,951 ratees (1,663 African American and 3,288 Cau-
be the single most effective training strategy for in- casian). Sackett and Dubois88 conducted a similar
creasing accuracy of rating and observation. They also study using a data set including ratings for 36,000 indi-
yielded small decreases in halo and leniency. viduals holding 174 civilian jobs in 2,876 firms. As in
Finally, behavioral observation training programs the Pulakos et al.87 study, Sackett and Dubois88 per-
focus on improving observation of behavior rather than formed separate analyses for the 617 ratees (286 Cau-
evaluation of behavior. There have been few (four) casian, 331 African American) where both an African
such studies reported. Like frame of reference training, American and a Caucasian rater had submitted ratings
behavioral observation training programs were found for a ratee. Sackett and Dubois88 did a combined analy-
to produce a medium to large positive effect on both sis that incorporated the data from the Pulakos et al.87
rating and observational accuracy. These studies did study and their study. They reported that African
not investigate effects on either leniency or halo. American and Caucasian raters provided virtually
A summary of results from Woehr and Huffcutt86 identical ratings of Caucasian ratees. African Ameri-
suggests that rater training programs have promise as a can ratees, on the other hand, received ratings from
means of improving performance of raters observing Caucasian raters that were slightly lower (0.02–0.10
and appraising clinical performance. However, the SDs lower) than the ratings they received from African
rater training studies in medical clinical evaluation American raters. For comparison purposes, Sackett
have not been effective most likely due to their short and Dubois88 noted that ratings for Caucasian ratees by
duration, no practice, and no feedback. This suggests African American raters ranged from 0.003 SDs lower
either the training program characteristics have not to 0.003 SDs higher than ratings from Caucasian raters.
been optimized or that physician raters are impervious They concluded that race of the rater accounted for an
to training. extremely small amount of variance in the rating data.
Like ethnicity, the individual studies of age pro-
duced mixed results. The best available basis for draw-
Practitioner (Ratee) Characteristics
ing conclusions lies in available meta-analyses de-
This section addresses practitioner characteristics signed to look at the combined results from multiple
that can affect performance ratings but are unrelated to studies. Two meta-analyses were found that investi-
practitioner performance ability. Practitioner charac- gated the effects of ratee age on performance ratings.
teristics other than performance ability can be subdi- McEvoy and Cascio89 looked at the combined results
vided into two categories: those that are relatively fixed from 96 independent samples of the age–performance
and those that are more transitory and can be manipu- relation from 65 studies published between 1965 and
lated by the practitioner. Authors have used such terms 1986. They found that the mean correlation between
as impression management and social control to de- age and objective measures of productivity (e.g., units
scribe the overt manipulation of transitory variables for produced, patents received, articles published) was
personal gain. 0.07. The relation between age and supervisor’s rating
was similar (mean correlation = 0.03). Waldman and
Avolio90 looked at the combined results from 37 sam-
Fixed Personal Characteristics of
ples reported in 13 research reports. Direct measures of
Practitioners
productivity indicated that older workers were more
We found no studies investigating the impact of productive. Older workers received lower ratings from
fixed personal characteristics (e.g., ethnicity, age, gen- their supervisors (average correlation = –0.14), but this
der, physical attractiveness) on clinical performance relation was limited to nonprofessional workers. Based
ratings in medicine. However, several studies involv- on the pooled results reported in these meta-analyses,
ing these variables have been done on performance there seems to be little relation between age and super-
evaluation ratings in the military and in business. Most visors’ ratings for professional workers.
of the reported research investigates the impact of rater The relation between gender and performance rat-
and practitioner ethnic background. A few studies have ings has received surprisingly little attention. The
investigated practitioner age and its impact on assigned study by Pulakos et al.87 described earlier also investi-
ratings. Finally, one study investigated ratee attractive- gated gender. Due to the very large sample, statistically
ness and affect. significant differences were obtained, but Pulakos et
The most comprehensive and informative studies of al.87 noted the proportion of variance in ratings ac-
ethnicity have been those by Pulakos et al.87 and by counted for by gender was minimal. Male raters tended
Sackett and DuBois.88 The Pulakos et al.87 study in- to rate women slightly more positively than men. Fe-
volved 39,537 ratings of 8,642 army enlisted personnel male raters on the other hand tended to rate males
in a variety of jobs. They performed separate analyses slightly more positively.
for the data set where both an African American and a Only one study was identified that studied attrac-
Caucasian rater evaluated ratees. This data set included tiveness and its relation to performance ratings.

283
WILLIAMS, KLAMEN, MCGAGHIE

Heilman and Stopeck91 conducted a laboratory study ized into professions, Haas and Shaffir94 noted that one
in which the attractiveness of the employee was var- characteristic is that neophytes must develop a “pre-
ied experimentally by changing the picture on the tense of competence even though one may be privately
employee’s written performance report. Participants uncertain” (p. 142). They proceeded to document the
in the study were MBA students who were asked to students’ development of this cloak of competence.
evaluate the performance and estimate advancement Haas and Shaffir94 concluded by stating the following:
potential. Heilman and Stopeck91 concluded that at-
tractiveness of the ratee inflated ratings of A significant part of professionalization is an in-
nonmanagerial women, deflated ratings of managerial creased ability to perceive and adapt behavior to
women, and had no impact on the ratings of men. Be- legitimators’ (faculty, staff, and peer) expectations, no
cause this was a single laboratory study with a very matter how variable or ambiguous … In this context of
small number of participants rating hypothetical em- ambiguity, students in both settings accommodate
ployees from written performance reports, the results themselves, individually and collectively, to convinc-
should at most be considered suggestive of an area ing others of their developing competence by selective
learning and by striving to control the impressions
for further investigation.
others receive of them.” (p.148)

Impression Management There have been many studies conducted in busi-


Certain practitioner characteristics are under per- ness organizations to determine the effects of subordi-
sonal control and can be manipulated to influence nate’s use of impression management techniques on
one’s ratings. Only two studies were found that investi- their supervisors’ ratings. There have also been a num-
gated the effect of practitioner characteristics other ber of studies in the laboratory designed to control and
than clinical competence on clinical performance rat- separate the variables involved. All of these studies ad-
ings. Wigton92 studied case presentation style and its dress concerns that supervisors’ evaluations of em-
effect on the ratings assigned to medical students. He ployees can be influenced substantially by factors
coached five first-year medical students to present each other than job performance itself. The validity of per-
of five prepared (standardized) case presentations that formance ratings is compromised to the extent that this
varied systematically in content and organization. Fif- occurs. The results of these impression management
teen experienced faculty raters evaluated videotaped studies are mixed, partly due to the variety of impres-
records of the presentations. Results showed that the sion management techniques that have been studied
ranking given each presentation depended as much on and partly to the environmental conditions under
which student was giving the presentation as the con- which they have been studied. As a result, the best indi-
tent of the presentation. Wigton concluded that the cation of the impact of impression management on per-
evaluation of clinical competence based on student formance ratings is found in the cumulative findings
case presentations could be significantly influenced by reflected in the meta-analysis conducted by Gordon.96
the personal characteristics of the students. This is the Gordon’s96 meta-analysis covered 55 journal articles,
only experimental study of this type in medical educa- including 69 studies, yielding 106 effect sizes.
tion that we found. More experimental studies of this Gordon96 found that impression management tech-
type would help us understand factors that influence niques studied so far appear to have little effect on per-
clinical performance ratings. Kalet and colleagues24 formance ratings in field settings. There was a slightly
conducted a study comparing faculty ratings on an larger effect in laboratory experiments, but the average
OSCE to expert ratings. They concluded that faculty effect was still characterized as small based on conven-
rated students on the basis of likeability rather than tional standards for interpreting effect sizes. In sum-
specific clinical skills mary, there is little hard evidence that impression man-
There is a very active research community in social agement tactics are an important influence on
psychology investigating the processes by which peo- supervisors’ ratings of performance.
ple control the impressions others form of them dating
back at least to the theoretical work of Goffman.93
Influence by Other Evaluators
Haas and Shaffir94,95 conducted investigations of some
of these variables in their study of the professional so- There are two traditional approaches to evaluating
cialization of one class of medical students over the pe- medical students and residents. In one, all supervisors
riod of their entire medical school training. Haas and are asked to fill out the rating form independently and
Shaffir’s94,95 primary research methods included par- submit it to the clerkship director or residency program
ticipant observation and interview research. They ob- director. The clerkship director or residency program
served the students over the full range of their educa- director then reviews the individual ratings and assigns
tional experiences. Referring to the sociological a final grade. The second approach requires that raters
research on the process by which neophytes are social- complete individual rating forms in preparation for a

284
CLINICAL PERFORMANCE RATINGS

group discussion of all practitioners by all supervisors. cians-in-training and practitioners. Time pressures and
Each practitioner’s performance is discussed in com- distractions likely influence the clinical performance
mittee and then a final rating is assigned through some assessment process most by decreasing the amount of
consensus building process. observation that occurs and thus by decreasing the
The two approaches have benefits and disadvan- sampling of clinical performance across the domain of
tages. The consensus building model has the potential clinical situations and competencies of interest. They
benefit of providing each rater with multiple perspec- may also influence the rater’s time for reflection and
tives on the practitioner’s performance and making all the quality of information integration that underlies the
raters aware of data regarding practitioner performance assigned ratings. We found no research that addresses
in a variety of settings. However, many people express this question directly in either the clinical or general
fear that such an approach may lead to a dominant indi- performance assessment research literature. One study
vidual unduly influencing the decision process. Al- in the decision-making literature is suggestive.
though little data exist that address the benefits and Wright98 found that consumers, facing time pressures
weaknesses of the group model, the results of one and distractions when making purchasing decisions,
study in the business literature is informative. reduced search time by collecting fewer pieces of in-
Martell and Borg97 conducted a laboratory study formation and generally searching for negative infor-
in which single individuals and groups of four rated mation. It is possible that a similar process occurs in
the same performance either immediately or after a performance assessment. Clinical performance assess-
delay of 5 days. In the delayed rating condition, ment practice would profit from more research in this
groups remembered performance details more accu- area.
rately than individuals. However, groups also were
more liberal in their rating criteria. On the other hand,
Clinical Performance = Group Effort
Speer and colleagues78 found that grades assigned to
medical students on an internal medicine clerkship Patient care is a group effort in hospital settings and
were lower and closer to the faculty-stated appropri- in most office settings. Many individuals including
ate average grade for clerks when the grades were as- specialists, attendings, residents, and students partici-
signed after a group discussion by faculty. These in- pate in the process of working up, diagnosing, and
vestigators attributed this difference to committee managing patients. Therefore, it is difficult to assess
impartiality compared to individual faculty observers the clinical competence of any one practitioner based
who worked with the students regularly on their inter- on patient presentations and write-ups or on individual
nal medicine clerkship. Likewise, Hemmer et al.79 conversations about patients. Weaknesses in practitio-
found that group performance assessment identified ner knowledge and skill may be masked by the contri-
many professional behavior problems that were not butions of others to the diagnosis and care of the pa-
otherwise disclosed. We found no studies to confirm tient as reported. One of the primary advantages of
or refute anecdotal reports that individuals with standardized patient encounters for clinical perfor-
strong opinions or a persuasive style can sway group mance evaluation is the opportunity they give to assess
opinion inordinately in evaluative discussions. This is an individual’s clinical proficiency knowing that the
widely perceived as a disadvantage to the group pro- results reflect that individual’s clinical competence
cess model and should be studied systematically. alone. We found no research that investigated the im-
pact of this source of bias on traditional clinical perfor-
mance ratings, nor did we find any reports of attempts
Environmental Bias
to control this source of bias in field settings.
As mentioned earlier, raters and practitioners inter- A great deal of research and evaluation on appraising
act in an environment with many distractions and con- team effectiveness99 in addition to individual effective-
founding factors. Some of these factors are likely to in- ness has occurred in the military, in business, and in the
fluence judgments and ratings and are unrelated to true nuclear power industry in recent years. The medical pro-
clinical competence. In this section, we discuss some of fession has much to learn from this research legacy.
the confounding factors and provide evidence, where
available, of their influence on performance ratings.
Effects of Being Observed
One factor that makes clinical performance assess-
Time Pressures and Distractions
ment artificial is the fact that practitioners usually
Anecdotal evidence and nonsystematic observation know when they are being observed. It is customary to
suggest that time pressures, distractions, and the pres- think that we are seeing a person’s best performance
sure of competing responsibilities may compromise under conditions where the person knows that he or she
the ability of clinical supervisors to accurately observe is being watched. However, it is also possible that the
and evaluate the clinical performance of physi- awareness of being observed creates anxiety and leads

285
WILLIAMS, KLAMEN, MCGAGHIE

to poor performance. In either case, appraisals may not cal performance in unobtrusive as well as more formal
provide an accurate reflection of how the individual ways. Patient satisfaction measures102 and nurse
will perform in practice when not under observation. observations103 may prove to be good ways of captur-
We found one study using clinical performance rat- ing information about normal clinical performance.
ings by Kopelow et al.100 that addressed the effects of
being observed. Kopelow and colleagues100 had practi-
tioners work up the same standardized patient cases Recommendations
(portrayed by different individuals) both in a setting
where they knew they were being evaluated and also in The following recommendations for improvement
practice without evaluation awareness. One group of in clinical performance assessment practice are based
the practitioners saw the patients in the office first and on the research covered in this article.
then saw the patients under examination conditions. A
second group took the examination first and saw the
Broad, Systematic Sampling
patients in their office later. The practitioners agreed to
have the standardized patients rotate through their Practitioners need to be observed and evaluated
practice but were not made aware when the patients across a broad range of clinical situations and proce-
were introduced. Physicians recognized the standard- dures to draw reasonable conclusions about overall clin-
ized patients in only 26 of the 263 encounters (10%). ical competence. Further, it is unreasonable to expect
Most pertinent to this article, average overall perfor- that practitioners are equally effective across a broad
mance scores were 65% to 66% of possible points in range of situations. Efforts need to be directed toward
the examination setting when practitioners knew they assuring that practitioners can handle important, spe-
were being assessed and 52% to 56% in the practice cific situations competently rather than a sole focus on
setting when practitioners were not aware they were identifying good and bad practitioners. Advanced
being assessed (p < .01). Trauma Life Support (ATLS) and Advanced Cardiac
Sackett et al.101 performed a comparable study in Life Support training are good examples of training and
supermarkets. They instructed 1,370 supermarket assessment protocols designed to assure competence.
cashiers from 12 supermarket chains to process (check Medical educators need to be much more selective
out) 25 standardized items in a grocery cart while be- and deliberate in planning performance observations.
ing observed. Cashiers were told that they would be Observations should cross a representative sample of
timed and errors would be recorded. They were asked chief complaints, diagnoses, tasks, and patient types.
to do their best and to place equal emphasis on speed This will require planning and record keeping to fulfill
and accuracy. Each cashier processed six different the desired systematic sampling. Standardized patient
standard cartloads of groceries over a period of 4 encounters9–11 and high-fidelity simulations44–46 can
weeks. Sackett et al.101 considered these performance be used to compensate for deficits in the naturally oc-
tests measures of maximum performance. All the cash- curring patient mix.104 Educators also need to increase
iers worked in supermarkets where the cash register the absolute amount of observation that is done in sup-
system kept records of items checked per minute, a port of clinical performance evaluation. One way to ac-
measure of speed; and voids per shift, a measure of ac- complish this may be to limit the amount of observa-
curacy. These records provided a measure of typical tion time during any one encounter. Results reported
performance. Sackett and colleagues101 established by Shatzer and his colleagues105,106 suggest that short
that the correlations between maximum and typical periods of observation may be enough for clinical eval-
performance (speed and accuracy) were very low, lead- uation. A strategy that distributes available observation
ing them to conclude that the two measures do not time across a larger number and range of encounters is
yield comparable information about the relative per- preferable.
formance of cashiers. They also found that supervi-
sors’ ratings of the cashiers correlated more highly
Observation by Multiple Raters
with the maximum than the typical performance mea-
sures confirming the sampling weaknesses noted for Although the primary goal should be to increase the
evaluation based on nonsystematic observation of criti- range of situations and procedures observed, it is also
cal incidents (positive or negative). the case that individual raters have idiosyncrasies re-
The results of the Kopelow et al.100 and Sackett et garding clinical focus (competencies they observe) and
al.101 studies document the likelihood that perfor- standards (stringency and leniency). Each practitioner
mance under formal test conditions is not a very good should be rated by a number of different raters to balance
indicator of “normal” performance. Furthermore, these the effects of these rater idiosyncrasies. The research
results suggest that differences under these two sets of that we have reviewed suggests that a minimum of 7 to
conditions are likely to be large. Their findings demon- 11 ratings should be collected to allow for a reproducible
strate the importance of creating ways to sample clini- holistic estimate of general clinical competence. Ac-

286
CLINICAL PERFORMANCE RATINGS

quiring 7 to 11 ratings per practitioner with multiple rat- tation based on observations that occurred during the
ers, although not a small logistical task, appears to be rotation. As a result, there is some decrease in rating
feasible. Ramsey and colleagues76 sought to acquire 10 accuracy due to memory loss and distortion. Any
or more ratings for each of 228 practicing physicians. method that encourages recording of observations and
They reported that this was achieved for 72% of the phy- judgments at the time of the observation will minimize
sicians with no follow-up mailings. this problem. Encounter records can be referred to
when completing end-of-rotation rating forms. The en-
counter records also provide a basis for providing de-
Keep Rating Instruments Short
tailed feedback to practitioners-in-training to help
Our review indicates that raters view clinical com- them improve their clinical performance. A number of
petence as one or two dimensional. The work by approaches have been used to capture observations and
Ramsey and colleagues76 suggests that a rating scale judgments as they are made, including performance re-
composed of 10 specific items plus a global item is port cards (Rhoton and associates,107–109 Soglin,110
sufficient when the goal is holistic decision making Brennan & Norman111), diaries,112 and clinical evalua-
(i.e., about progress or promotion). Adding more tions filled out immediately.113
items only boosts rater time and makes the task
harder. Results of the Kreiter et al.16 study suggest
Supplement Formal Observation With
that relatively little is gained in generalizability by
Unobtrusive Observation
adding more than 5 items to the rating scale when
progress decision making is the goal. Careful instru- Unobtrusive observation is valuable because it
ment construction using a process similar to that used provides a better estimate of normal clinical perfor-
by Ramsey et al.76 should simplify the rating task mance than formal observation. To the extent that it
with no loss of information. When the goal of evalua- increases the total number of clinical performance
tion is to provide feedback to support learning, the observations, it also helps to achieve the goal of
observation should be specific to the performance ob- broad sampling. Newble’s103 reported use of nurse
served, and the feedback should be immediate and in- evaluations and Tamblyn and associates102 use of pa-
teractive. A general performance rating form is not tient evaluations are two effective ways of increasing
appropriate for this purpose. unobtrusive evaluation.

Separate Appraisal for Teaching, Consider Making Promotion and


Learning, and Feedback From Grading Decisions Via a Group Review
Appraisal for Promotion
We believe that a group review process for making
One of the clearest implications from our review is promotion and grading decisions has more benefits
that the feedback and progress decision goals of clini- than problems associated with it. The group process
cal performance evaluation should be separate. Physi- provides each decision maker with a broader base of
cians recall information regarding two clinical perfor- information on which to base the decision. It also
mance factors: (a) clinical skills, and (b) professional provides added perspective because different raters
behavior. They do not recall performance details at the tend to focus on different aspects of performance.
end of rotations when asked to fill out rating forms. Finally, available evidence suggests that groups are
Consequently, physicians should be encouraged to more likely to make unpopular promotion decisions
comment immediately about specific performance at- than are individual raters making decisions in isola-
tributes that should be reinforced or corrected. This tion.78,79 However, some physicians are reluctant to
should include an interactive conversation about the use a group process for fear that a person with a
situation, the practitioner’s behavior, desirable or un- strong view will unduly sway the opinions of others.
desirable consequences of the behavior, and some sug- There is no evidence on this issue one way or the
gested alternatives if the behavior should be changed in other.
the future. The implied message in performance as-
sessment forms with room for written comments is that
Supplement Traditional Clinical
performance feedback should be saved and conveyed
Performance Ratings With
only through the form. We believe this is the wrong
Standardized Clinical Encounters and
message to send.
Skills Training and Assessment
Protocols
Encourage Prompt Recording
Standardized clinical encounters have the advan-
Clinical performance evaluation usually involves tage of allowing direct comparisons of practitioners
completing an evaluation instrument at the end of a ro- when they perform exactly the same clinical tasks.

287
WILLIAMS, KLAMEN, MCGAGHIE

They also allow for evaluation of individual physi- Encourage Raters to Observe and Rate
cians’ skills in comparison to an established gold stan- Specific Performances
dard. Performance in these situations can also isolate
The ABIM has explored use of the mini CEX. This
the individual’s clinical performance capabilities and
process involves having an evaluator observe and rate
deficits. There is no contamination of the clinical per-
specific performance on a single case immediately us-
formance measure resulting from the contributions of
ing an instrument tailored for that purpose. Given the
other members of the health care team who have coop-
problems of memory distortion associated with the
erated in working up the patient. Standardized clinical
normal rating process, we believe a supplemental pro-
encounters usually provide a more in-depth look at
cess such as that proposed by the ABIM will provide
clinical performance. Finally, they provide a means of
additional, more accurate data about clinical perfor-
looking at the performance of an entire class (e.g., all
mance that can be added to the mix when final promo-
third-year residents) and assuring that critical skills are
tion decisions are made. An even more optimal process
acquired and maintained.
would use rating scales specifically designed for se-
lected chief complaints or procedures.

Educate Raters
At a minimum, make sure that all raters are familiar Use No More Than Seven Quality
with the rating form at the beginning of the rotation. Rating Categories (i.e., 1 [Poor] to
This will help raters focus their observations and orga- 7 [Excellent])
nize performance information in a way that will facili- Five to seven quality rating categories are optimal.
tate performance assessment. Consider employing Even at this, the extreme rating categories will rarely
frame of reference training to assure that raters are cali- be used. We also discourage two-level rating systems
brated regarding both definition of performance di- (e.g., 1–3 [unsatisfactory], 4–6 [satisfactory], to 7–9
mensions and performance quality rating category di- [superior]). Additional quality rating categories add no
mensions. More research is needed to determine performance information and just complicate the task
whether adding other types of more extensive training for raters.
will yield higher quality clinical performance ratings
by physicians. The research by Newble and
associates85 suggests that eliminating raters who are
Establish the Meaning of Ratings
extreme outliers (positive and negative) is a more
cost-effective means of increasing the validity and Given the tendency for raters to use primarily the
reproducibility of clinical performance ratings than is top two categories and the reluctance to assign failing
training. It would be nice to have other studies investi- ratings, clinical performance assessment ratings only
gate this practice to increase confidence in the have meaning relative to each other. As with any mea-
reproducibility of the finding. We mention it here as surement instrument, scores only acquire meaning
much to encourage research as to suggest practice. We through experience and systematic data analysis and
realize that the practice of culling raters may have un- interpretation. We encourage maintaining the same rat-
desirable consequences on faculty morale that need to ing instrument and procedure. Clinical evaluators have
be considered by the program or clerkship director, but more to gain by accumulating norms over a number of
the practice does merit consideration of the positive years than will be gained by “refining” the rating in-
and negative consequences. strument. Familiarity with a form due to extensive use
enables directors to easily and quickly identify outliers
(i.e., those who perform unusually well or poorly based
on their ratings). One can also develop a sense for the
Provide Time for Rating
significance of particular ratings by keeping and using
Do whatever you can to encourage intermittent ob- historical records regarding the ultimate status (e.g.,
servation and recording of clinical performance. Rat- asked to leave program, chose to leave program, ulti-
ings and other types of clinical evaluation should be mately judged to be outstanding) of practitioners with
thoughtful, candid, and not rushed. Clinical evaluation particular scores. This will require some investment to
of learners is an important professional activity that collect and make sense of the data, but it is the only
warrants protected time. We have heard anecdotally of way scores acquire meaning. Keeping the same form
departments that gather faculty members at a given also makes it more feasible to conduct validity studies
time and place for the sole purpose of filling out perfor- such as those conducted by Hojat and colleagues.40
mance ratings. The group rating process described ear- Changing the form repeatedly does nothing but require
lier has the added benefit of protecting time for perfor- recalibration of the new form, while losing data cap-
mance assessment. tured with an older, calibrated form.

288
CLINICAL PERFORMANCE RATINGS

Give Raters Feedback About 3. Wolf FM, Shea JA, Albanese MA. Toward setting a research
Stringency and Leniency agenda for systematic reviews of evidence of the effects of med-
ical education. Teaching and Learning in Medicine
Much like feedback to Olympic skating judges, 2001;13:54–60.
feedback to raters about their clinical performance as- 4. Albanese M, Norcini J. Systematic reviews: What are they and
why should we care? Advances in Health Sciences Education
sessments may help calibrate future evaluations and in- 2002;7:147–51.
crease the motivation to provide accurate ratings. Ad- 5. Cooper H. Editorial. Psychological Bulletin 2003;129:3–9.
mittedly, because clinical raters observe different 6. Kane MT. The assessment of professional competence. Evalua-
samples of clinical performance, they will have more tion and the Health Professions 1992;15:163–82.
difficulty interpreting information about stringency 7. American Institutes for Research. The definition of clinical
competence in medicine: Performance dimensions and ratio-
and leniency. However, providing feedback to raters nale for clinical skill areas. Palo Alto, CA: American Institutes
regarding their average ratings along with average rat- for Research in the Behavioral Sciences, 1976.
ings for all raters provides some idea of rater strin- 8. Elstein AS, Shulman LS, Sprafka SA. Medical problem solv-
gency and leniency especially when the averages are ing: An analysis of clinical reasoning. Cambridge, MA: Har-
based on many ratings. We are not convinced of the vard University Press, 1978.
9. Van der Vleuten CPM, Swanson DB. Assessment of clinical
wisdom of handicapping raters arithmetically as rater
skills with standardized patients: State of the art. Teaching and
knowledge that handicapping is being used may lead Learning in Medicine 1990;2:58–76.
them to even more extreme ratings in the future. 10. Stillman PL, Swanson DB, Smee S, et al. Assessing clinical
skills of residents with standardized patients. Annals of Internal
Medicine 1986;105:762–71.
Learn From Other Professions 11. Stillman PL, Swanson DB, Regan MB, et al. Assessment of
clinical skills of residents utilizing standardized patients: A fol-
Clinical medicine has much to learn about perfor- low-up study and recommendations for application. Annals of
mance evaluation from other professions including the Internal Medicine 1991;114:393–401.
astronaut corps,114 aviation,115 business and indus- 12. Harper AC, Roy WB, Norman GR, Rand CA, Feightner JW.
try,116 the clergy,117 the military,118,119 and nuclear Difficulties in clinical skills evaluation. Medical Education
1983;17:24–27.
power plant operation.120 Each of these professions 13. Norman GR, Tugwell P, Feightner JW, Muzzin LJ, Jacoby LL.
employs personnel evaluation theories and technolo- Knowledge and problem-solving. Medical Education
gies that are applicable and useful in clinical medicine. 1985;19:344–56.
They also have made more progress in evaluating team 14. Carline JD, Paauw DS, Thiede KW, Ramsey PG. Factors affect-
performance,115 an important clinical competency. ing the reliability of ratings of students’ clinical skills in a medi-
cine clerkship. Journal of General Internal Medicine
1992;7:506–10.
15. Carline JD, Wenrich MD, Ramsey PG. Characteristics of rat-
Acknowledge the Limits of Ratings
ings of physician competence by professional associates. Eval-
The consistent message throughout this target arti- uation and the Health Professions 1989;12:409–23.
16. Kreiter CD, Ferguson K, Lee WC, Brennan RL, Densen P. A
cle is that clinical ratings, as an evaluation tool, have
generalizability study of a new standardized rating form used to
serious limits. Users of clinical ratings need to be fully evaluate students’ clinical clerkship performances. Academic
aware of the limitations and make necessary adjust- Medicine 1998;73:1294–8.
ments in their use for making inferences about clinical 17. Noel GL, Herbers JEJ, Caplow MP, Cooper GS, Pangaro LN,
competence and in making progress decisions. Despite Harvey J. How well do internal medicine faculty members eval-
recent arguments to the contrary,121 our review shows uate the clinical skills of residents? Annals of Internal Medicine
1992;117:757–65.
that faculty observations and the ratings they produce 18. Ganzach Y. Negativity (and positivity) in performance evalua-
are insufficient measures of clinical competence by tion: Three field studies. Journal of Applied Psychology
themselves although they provide an important source 1995;80:491–9.
of information about clinical performance. The chal- 19. McIntyre M, James L. The inconsistency with which raters
lenge ahead is to understand when and how to use tra- weight and combine information across targets. Human Perfor-
mance 1995;8:95–111.
ditional clinical ratings and to complement their use by 20. Herbers JEJ, Noel GL, Cooper GS, Harvey J, Pangaro LN,
adding other measures such as standardized clinical Weaver MJ. How accurate are faculty evaluations of clinical
encounters and skills examinations (e.g., ATLS) to fill competence? Journal of General Internal Medicine
in the gaps regarding clinical performance. 1989;4:202–8.
21. Pangaro LN, Worth-Dickstein H, Macmillan K, Shatzer J,
Klass D. Performance of “standardized examinees” in a stan-
dardized patient examination of clinical skills. Academic Medi-
References cine 1997;72:1008–11.
22. Heneman RL. The effects of time delay in rating and amount of
1. Landy FJ, Farr JL. Performance rating. Psychological Bulletin information observed on performance rating accuracy. Acad-
1980;87:72–107. emy of Management Journal 1983;26:677–86.
2. Wolf FM. Lessons to be learned from evidence-based medicine: 23. Kassin S, Tubb V, Hosch H, Memon A. On the “general accep-
Practice and promise of evidence-based medicine and evi- tance” of eyewitness testimony research: A new survey of the
dence-based education. Medical Teacher 2000;22:251–9. experts. American Psychologist 2001;56:405–16.

289
WILLIAMS, KLAMEN, MCGAGHIE

24. Kalet A, Earp JA, Kowlowitz V. How well do faculty evaluate 44. Issenberg SB, McGaghie WC, Hart IR, et al. Simulation tech-
the interviewing skills of medical students? Journal of General nology for health care professional skills training and assess-
Internal Medicine 1992;7:499–505. ment. Journal of the American Medical Association
25. Weekly JA, Gier JA. Ceilings in the reliability and validity of 1999;282:861–6.
performance ratings: The case of expert raters. Academy of 45. Issenberg SB, Petrusa ER, McGaghie WC, et al. Effectiveness
Management Journal 1989;32:213–22. of a computer-based system to teach bedside cardiology. Aca-
26. Elliot DL, Hickam DH. Evaluation of physical examination demic Medicine 1999;74(October, Suppl.):S93–5.
skills: Reliability of faculty observers and patient instructors. 46. Issenberg SB, McGaghie WC, Gordon DL, et al. Effectiveness
Journal of the American Medical Association of a cardiology review course for internal medicine residents
1987;258:3405–8. using simulation technology and deliberate practice. Teaching
27. Martin JA, Reznick RK, Rothman A, Tamblyn RM, Regehr G. and Learning in Medicine 2002;14:223–8.
Who should rate candidates in an objective structured clinical 47. Mangione S, Nieman LZ, Gracely E, Kaye D. The teaching and
examination? Academic Medicine 1996;71:170–5. practice of cardiac auscultation during internal medicine and
28. Vu NV, Marcy ML, Colliver JA, Verhulst SJ, Travis TA, Bar- cardiology training. Annals of Internal Medicine
rows HS. Standardized (simulated) patients’ accuracy in re- 1993;119:47–54.
cording clinical performance checklist items. Medical Educa- 48. Cronbach LJ. Processes affecting scores on “understanding of
tion 1992;26:99–104. others” and “assumed similarity.” Psychological Bulletin
29. MacRae HM, Vu NV, Graham B, Word-Sims M, Colliver JA, 1955;52:177–93.
Robbs RS. Comparing checklists and databases with physi- 49. Nathan BR, Lord RG. Cognitive categorization and dimen-
cians’ ratings as measures of students’ history and physical-ex- sional schemata: A process approach to the study of halo in per-
amination skills. Academic Medicine 1995;70:313–7. formance ratings. Journal of Applied Psychology
30. Viswesvaran C, Ones D, Schmidt F. Comparative analysis of 1983;68:102–14.
the reliability of job performance ratings. Journal of Applied 50. Cafferty T, DeNisi AS, Williams KJ. Search and retrieval pat-
Psychology 1996;81:557–74. terns for performance information. Journal of Personality and
31. Kwolek CJ, Donnelly MB, Sloan DA, Birrell SN, Strodel WE, Social Psychology 1986;50:676–83.
Schwartz RW. Ward evaluations: Should they be abandoned? 51. DeNisi AS, Peters LH. Organization of information in memory
Journal of Surgical Research 1997;69:1–6. and the performance appraisal process: Evidence from the field.
32. Ramsey PG, Wenrich MD, Carline JD, Inui TS, Larson EB, Journal of Applied Psychology 1996;81:717–37.
LoGerfo JP. Use of peer ratings to evaluate physician perfor- 52. Williams KJ, DeNisi AS, Meglino BM, Cafferty T. Initial judg-
mance. Journal of the American Medical Association ments and subsequent appraisal decisions. Journal of Applied
1993;269:1655–60. Psychology 1986;71:189–95.
33. Violato C, Marini A, Toews J, Lockyer J, Fidler H. Feasibility 53. Williams KJ. The effect of performance appraisal salience on
and psychometric properties of using peers, consulting physi- recall and ratings. Organizational Behavior and Human Deci-
cians, co-workers, and patients to assess physicians. Academic sion Processes 1990;46:217–39.
Medicine 1997;72:S82–4. 54. Verhulst S, Colliver J, Paiva R, Williams RG. A factor analysis
34. Kroboth FJ, Hanusa BH, Parker S, et al. The inter-rater reliabil- of performance of first-year residents. Journal of Medical Edu-
ity and internal consistency of a clinical evaluation exercise. cation 1986;61:132–4.
Journal of General Internal Medicine 1992;7:174–9. 55. Thompson WG, Lipkin MJ, Gilbert DA, Guzzo RA, Roberson
35. Wenrich MD, Carline JD, Giles LM, Ramsey PG. Ratings of the L. Evaluating evaluation: Assessment of the American Board of
performances of practicing internists by hospital-based regis- Internal Medicine Resident Evaluation Form (see comments).
tered nurses. Academic Medicine 1993;68:680–7. Journal of General Internal Medicine 1990;5:214–7.
36. Conway JM, Huffcutt AI. Psychometric properties of 56. Metheny WP. Limitations of physician ratings in the assessment
multisource performance ratings: A meta-analysis of subordi- of student clinical performance in an obstetrics and gynecology
nate, supervisor, peer, and self-ratings. Human Performance clerkship. Obstetrics and Gynecology 1991;78:136–41.
1997;10:331–60. 57. Schueneman AL, Carley JP, Baker WH. Residency evaluations:
37. Ginsburg S, Regehr G, Hatala R, et al. Context, conflict, and Are they worth the effort? Archives of Surgery
resolution: A new conceptual framework for evaluating profes- 1994;129:1067–73.
sionalism. Academic Medicine 2000;75(October, 58. Haber RJ, Avins AL. Do ratings on the American Board of In-
Suppl.):S6–11. ternal Medicine Resident Evaluation Form detect differences in
38. Tamblyn R, Benaroya S, Snell L, McLeod P, Schnarch B, clinical competency? Journal of General Internal Medicine
Abrahamowicz M. The feasibility and value of using patient 1994;9:140–5.
satisfaction ratings to evaluate internal medicine residents. 59. Burdick WP, Schoffstall J. Observation of emergency medicine
Journal of General Internal Medicine 1994;9:146–52. residents at the bedside: How often does it happen? Academic
39. Church AH. Do you see what I see? An exploration of congru- Emergency Medicine 1995;2:909–13.
ence in ratings from multiple perspectives. Journal of Applied 60. Bretz RD, Milkovich GT, Read W. The current state of perfor-
Social Psychology 1997;27:983–1020. mance appraisal research and practice: Concerns, directions,
40. Hojat M, Gonnella JS, Veloski JJ, Erdmann JB. Is the glass half and implications. Journal of Management 1992;18:321–52.
full or half empty? A reexamination of the associations between 61. Rothstein HR. Interrater reliability of job performance ratings:
assessment measures during medical school and clinical com- Growth to asymptote level with increasing opportunity to ob-
petence after graduation. Academic Medicine 1993;68:S69–76. serve. Journal of Applied Psychology 1990;75:322–7.
41. Williams RG, Barrows HS, Vu NV. Direct, standardized assess- 62. Woehr DJ. Performance dimension accessibility: Implications
ment of clinical competence. Medical Education for rating accuracy. Journal of Organizational Behavior
1987;21:482–9. 1992;13:357–67.
42. Stillman PL, Regan MB, Philbin MM, Haley HL. Results of a 63. Day NE. Can performance raters be more accurate? Investi-
survey on the use of standardized patients to teach and evaluate gating the benefits of prior knowledge of performance dimen-
clinical skills. Academic Medicine 1990;65:288–92. sions. Journal of Managerial Issues 1995;7:323–42.
43. Colliver JA, Williams RG. Technical issues: Test application.
Academic Medicine 1993;68:454–60.

290
CLINICAL PERFORMANCE RATINGS

64. Tesser A, Rosen S. The reluctance to transmit bad news. In L 85. Newble DI, Hoare J, Sheldrake PF. The selection and training of
Berkowitz (Ed.), Advances in experimental social psychology examiners for clinical examinations. Medical Education
(Vol. 8, pp. 193–232). New York: Academic, 1975. 1980;14:345–9.
65. Benedict ME, Levine EL. Delay and distortion: Tacit influences 86. Woehr DJ, Huffcutt AI. Rater training for performance ap-
on performance appraisal effectiveness. Journal of Applied praisal: A quantitative review. Journal of Occupational and
Psychology 1988;73:507–14. Organizational Psychology 1994;67:189–205.
66. Waldman DA, Thornton GC, III. A field study of rating condi- 87. Pulakos ED, White LA, Oppler SH, Borman WC. Examination
tions and leniency in performance appraisal. Psychological Re- of race and sex effects on performance ratings. Journal of Ap-
ports 1988;63:835–40. plied Psychology 1989;74:770–80.
67. Harris MM, Smith DE, Champagne D. A field study of perfor- 88. Sackett PR, DuBois CL. Rater–ratee race effects on perfor-
mance appraisal purpose: Research- versus administra- mance evaluation: Challenging meta-analytic conclusions.
tive-based ratings. Personnel Psychology 1995;48:151–60. Journal of Applied Psychology 1991;76:873–7.
68. Longenecker CO, Sims HP, Gioia DA. Behind the mask: The 89. McEvoy GM, Cascio WF. Cumulative evidence of the relation-
politics of employee appraisal. The Academy of Management ship between employee age and job performance. Journal of
Executive 1987;1:183–93. Applied Psychology 1989;74:11–7.
69. Bernardin HJ, Cardy RL. Appraisal accuracy: The ability and 90. Waldman DA, Avolio BJ. A meta-analysis of age differences in
motivation to remember the past. Public Personnel Management job performance. Journal of Applied Psychology
1982;11:352–7. 1986;71:33–8.
70. Littlefield JH, DaRosa DA, Anderson KD, Bell RM, Nicholas 91. Heilman ME, Stopeck MH. Being attractive, advantage or dis-
GG, Wolfson PJ. Accuracy of surgery clerkship performance advantage? Performance-based evaluations and recommended
raters. Academic Medicine 1991;66:S16–8. personnel actions as a function of appearance, sex, and job type.
71. Ryan JG, Mandel FS, Sama A, Ward MF. Reliability of faculty Organizational Behavior and Human Decision Processes
clinical evaluations of non-emergency medicine residents dur- 1985;35:202–15.
ing emergency department rotations. Academic Emergency 92. Wigton RS. The effects of student personal characteristics on
Medicine 1996;3:1124–30. the evaluation of clinical performance. Journal of Medical Edu-
72. Magarian GJ, Mazur DJ. A national survey of grading systems cation 1980;55:423–7.
used in medicine clerkships. Academic Medicine 93. Goffman E. The presentation of self in everyday life. Garden
1990;65:636–9. City, NY: Doubleday, 1959.
73. Magarian GJ, Mazur DJ. Evaluation of students in medicine 94. Haas J, Shaffir W. Ritual evaluation of competence. Work and
clerkships. Academic Medicine 1990;65:341–5. Occupations 1982;9:131–54.
74. Vu NV, Henkle JQ, Colliver JA. Consistency of pass–fail de- 95. Haas J, Shaffir W. Becoming doctors: The adoption of a cloak
cisions made with clinical clerkship ratings and standard- of competence. Greenwich, CT: JAI, 1987.
ized-patient examination scores. Academic Medicine 96. Gordon R. Impact of ingratiation on judgments and evaluations:
1994;69:S40–1. A meta-analytic investigation. Journal of Personality and So-
75. Shea J, Norcini J, Kimball H. Relationships of ratings of clini- cial Psychology 1996;71:54–70.
cal competence and ABIM scores to certification status. Aca- 97. Martell RF, Borg MR. A comparison of the behavioral rating
demic Medicine 1993;68:S22–4. accuracy of groups and individuals. Journal of Applied Psy-
76. Ramsey PG, Carline JD, Blank LL, Wenrich MD. Feasibility chology 1993;78:43–50.
of hospital-based use of peer ratings to evaluate the perfor- 98. Wright P. The harassed decision maker: Time pressures, dis-
mances of practicing physicians. Academic Medicine tractions, and the use of evidence. Journal of Applied Psychol-
1996;71:364–70. ogy 1974;59:555–61.
77. Schwartz RW, Donnelly MB, Sloan DA, Johnson SB, Strodel 99. Brannick MT, Salas E, Prince C. (Eds.). Team performance as-
WE. The relationship between faculty ward evaluations, OSCE, sessment and measurement: Theory, methods, and applica-
and ABSITE as measures of surgical intern performance. tions. Mahwah, NJ: Lawrence Erlbaum Associates, Inc., 1997.
American Journal of Surgery 1995;169:414–7. 100. Kopelow ML, Schnabl GK, Hassard TH, et al. Assessing prac-
78. Speer AJ, Solomon DJ, Ainsworth MA. An innovative evalua- ticing physicians in two settings using standardized patients.
tion method in an internal medicine clerkship. Academic Medi- Academic Medicine 1992;67(October, Suppl.):S19–21.
cine 1996;71:S76–8. 101. Sackett P, Zedeck S, Fogli L. Relations between measures of
79. Hemmer PA, Hawkins R, Jackson JL, Pangaro LN. Assessing typical and maximum job performance. Journal of Applied
how well three evaluation methods detect deficiencies in medi- Psychology 1988;73:482–6.
cal students’ professionalism in two settings of an internal med- 102. Tamblyn R, Benaroya S, Snell L, McLeod P, Schnarch B,
icine clerkship. Academic Medicine 2000;75:167–73. Abrahamowicz M. The feasibility and value of using patient
80. Barzansky B, Jonas HS, Etzel SI. Educational programs in US satisfaction ratings to evaluate internal medicine residents.
medical schools, 1997–1998. Journal of the American Medical Journal of General Internal Medicine 1994;9:146–52.
Association 1998;280:803–8. 103. Newble D. The critical incident technique: A new approach to
81. Tonesk X, Buchanan RG. An AAMC pilot study by 10 medical the assessment of clinical performance. Medical Education
schools of clinical evaluation of students. Journal of Medical 1983;17:401–3.
Education 1987;62:707–18. 104. McGlynn TJ, Munzenrider RF, Zizzo J. A resident’s internal
82. Yao DC, Wright SM. National survey of internal medicine resi- medicine practice. Evaluation and the Health Professions
dency program directors regarding problem residents. Journal 1979;2:463–76.
of the American Medical Association 2000;284:1099–104. 105. Shatzer, J., Wardrop, J. Williams, R., Hatch, T. The
83. Kwolek CJ, Donnelly MB, Sloan DA, Birrell SN, Strodel WE, generalizability of performance on different station length stan-
Schwartz RW. Ward evaluations: Should they be abandoned? dardized patient cases. Teaching and Learning in Medicine
Journal of Surgical Research 1997;69:1–6. 1994;6:54–8.
84. Jain SS, DeLisa JA, Campagnolo DI. Methods used in the eval- 106. Shatzer JH, DaRosa D, Colliver JA, Barkmeier L. Sta-
uation of clinical competency of physical medicine and rehabil- tion-length requirements for reliable performance-based exam-
itation residents. American Journal of Physical Medicine and ination scores. Academic Medicine 1993;68:224–9.
Rehabilitation 1994;73:234–9.

291
WILLIAMS, KLAMEN, MCGAGHIE

107. Rhoton MF, Furgerson CL, Cascorbi HF. Evaluating clinical 115. Wiener EL, Kanki BG, Helmreich RL. (Eds.). Cockpit resource
competence in anesthesia: Using faculty comments to develop management. San Diego, CA: Academic, 1993.
criteria. Annual Conference on Research in Medical Education 116. Berk RA. (Ed.). Performance assessment: Methods & applica-
1986;25:57–62. tions. Baltimore: Johns Hopkins University Press, 1986.
108. Rhoton MF. A new method to evaluate clinical performance 117. Hunt RA, Hinkle JE, Malony HN. (Eds.). Clergy assessment
and critical incidents in anesthesia: Quantification of daily and career development. Nashville, TN: Abingdon, 1990.
comments by teachers. Medical Education 1989;23:280–9. 118. Wigdor AK, Green BF. (Eds.). Performance assessment for the
109. Rhoton MF, Barnes A, Flashburg M, Ronai A, Springman S. In- workplace (Vol. 1). Washington, DC: National Academy Press,
fluence of anesthesiology residents’ noncognitive skills on the 1991.
occurrence of critical incidents and the residents’ overall clini- 119. Wigdor AK, Green BF. (Eds.). Performance assessment for the
cal performances. Academic Medicine 1991;66:359–61. workplace: Technical issues (Vol. 2). Washington, DC: Na-
110. Soglin, D. J. The use of event report cards in the evaluation of tional Academy Press, 1991.
pediatric residents: A pilot study. Chicago: University of Illi- 120. Toquam JL, Macaulay JL, Westra CD, Fujita Y, Murphy SE.
nois, Department of Medical Education, 1996. Assessment of nuclear power plant crew performance variabil-
111. Brennan BG, Norman GR. Use of encounter cards for evalua- ity. In MT Brannick, E Salas, C Prince (Eds.), Team perfor-
tion of residents in obstetrics. Academic Medicine mance assessment and measurement: Theory, methods, and ap-
1997;72:S43–4. plications (pp. 253–287). Mahwah, NJ: Lawrence Erlbaum
112. Flanagan JC, Burns RK. The employee performance record: A Associates, Inc., 1997.
new appraisal and development tool. Harvard Business Review 121. Whitcomb ME. Competency-based graduate medical educa-
1955;33:95–102. tion? Of course! But how should competency be assessed? Aca-
113. Norcini JJ, Blank LL, Arnold GK, Kimball HR. The mini-CEX demic Medicine 2002;77:359–60.
(clinical evaluation exercise): A preliminary investigation. An-
nals of Internal Medicine 1995;123:795–9.
114. Santy PA. Choosing the right stuff: The psychological selection Received 23 September 2002
of astronauts and cosmonauts. Westport, CT: Praeger, 1994. Final revision received 14 April 2003

292

View publication stats

You might also like