Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

2014; 36: 97–110

AMEE GUIDE

How to set standards on performance-based


examinations: AMEE Guide No. 85
DANETTE W. MCKINLEY & JOHN J. NORCINI
FAIMER, Research and Data Resources, USA

Abstract
This AMEE Guide offers an overview of methods used in determining passing scores for performance-based assessments. A
consideration of various assessment purposes will provide context for discussion of standard setting methods, followed by a
description of different types of standards that are typically set in health professions education. A step-by-step guide to the
standard setting process will be presented. The Guide includes detailed explanations and examples of standard setting methods,
and each section presents examples of research done using the method with performance-based assessments in health professions
education. It is intended for use by those who are responsible for determining passing scores on tests and need a resource
explaining methods for setting passing scores. The Guide contains a discussion of reasons for assessment, defines standards, and
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14

presents standard setting methods that have been researched with performance-based tests. The first section of the Guide
addresses types of standards that are set. The next section provides guidance on preparing for a standard setting study. The
following sections include conducting the meeting, selecting a method, implementing the passing score, and maintaining the
standard. The Guide will support efforts to determine passing scores that are based on research, matched to the assessment
purpose, and reproducible.
For personal use only.

Introduction Practice points


Standard setting is the process of defining or judging the level
. Although there is extensive research on standard setting
of knowledge and skill required to meet a typical level of
with both multiple-choice and performance-based tests,
performance and then identifying a score on the examination
there is no ‘‘right’’ passing score, and no ‘‘best’’ method
score scale that corresponds to that performance standard. . Different methods yield different results
Standard setting procedures are employed to provide a . Pre-fixed passing scores set without consideration of test
conceptual definition of competence for an occupation or content or examinee performance can vary greatly due
educational domain and to operationalise the concept. When to test difficulty and content, affecting the appropriate-
considering the conceptual definition of competence, it is ness of the decisions made
helpful to think about the criteria developed in competency- . Selecting a method depends on the purpose of the
based medical education. The descriptive information pro- examination and the resources available for the standard
vided in the development of milestones or benchmarks setting effort
(Holmboe et al. 2010) can be helpful in defining the . The passing score should be determined by a group
performance standard. The standard setting process is (e.g. faculty members) familiar with the assessment
designed to translate a conceptual definition of competence purpose and the domain assessed
to an operational version, called the passing score (Kane 1994; . The standard setting method selected should: be closely
Norcini 1994). Verification that the passing score is appropriate aligned with assessment goals; be easy to explain and
14

is another critical element in collecting evidence to support the implement; require judgments that are based on per-
20

validity of test score interpretation (American Educational formance data; entail thoughtful effort; and be based on
Research Association et al. 1999; Kane 2006). Various research
approaches to determining passing scores for examinations
have been developed and researched. In this Guide, an
overview to the methods that have been typically used with
performance-based assessments will be provided. A consider- Assessment purposes
ation of various assessment purposes will provide context for
discussion of standard setting methods, followed by a In education, it is often necessary to evaluate whether trainees
description of different types of standards that are typically are attaining the knowledge, skills, and attitudes needed to
set in health professions education. A step-by-step guide to the perform in the field of endeavour. In order to determine
standard setting process will be presented. whether ‘‘sufficient knowledge, skills and attitudes’’ are

Correspondence: Danette W. McKinley, PhD, Foundation for the Advancement of International Medical Education and Research, 3624 Market
Street, 4th Floor, Philadelphia, PA 19104, USA. Tel: 1 215 823 2231; fax: 1 215 386 3309; email: dmckinley@faimer.org
ISSN 0142–159X print/ISSN 1466–187X online/14/020097–14 ß 2014 Informa UK Ltd. 97
DOI: 10.3109/0142159X.2013.853119
D. W. McKinley & J. J. Norcini

present, different methods are typically employed as part of a measures such as the objective structured clinical examination
programme of assessment (Dijkstra et al. 2010). In health (OSCE) are developed, principles associated with domain-
professions education, there are many approaches to assessing based assessment, including definition of characteristics
the knowledge, skills, attitudes and abilities of applicants, denoting adequate performance, can be employed (Pell
students, graduates, and practitioners. For some time, health et al. 2010). Whether the test consists of multiple-choice
professions educators used any available assessment method items or task performance, it is important to consider whether
to evaluate the competencies of a doctor, even if they were not the goal is to compare an individual with others taking the
appropriate (Norcini & McKinley 2007). For example, although same assessment, or to determine the level of proficiency. In
it is important for a doctor to be able to communicate the next section of the Guide, we discuss types of standards
effectively with the healthcare team, assessment of this aspect related to these different purposes.
is not appropriately tested through the use of written exam-
inations. Several methods of assessment have been developed
and implemented, with a movement towards assessment
Types of standards
based on performance that is tied to what is expected in Standards can be categorised as relative (sometimes called
practice. In the education of health professionals, standardised norm-referenced) or absolute (sometimes called criterion-
patients (SPs), lay people trained to portray complaints of referenced) (Livingston & Zieky 1982). Standards that are
‘‘real’’ patients, are frequently used (e.g. Patil et al. 2003). This established based on a comparison of those who take the
type of assessment provides examinees the opportunity to assessment to each other are relative standards. For example,
show what they can do (e.g. correctly perform a physical when the passing score is set based on the number or
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14

examination, communicate with a patient), rather than what percentage of examinees that will pass, the standard is relative.
they know (Miller 1990). This type of standard setting is typically used in selection for
In developing methods that assess what examinees do, employment or admission to educational programmes where
other methods, or even combinations of modalities, have also the positions available are limited. With high stakes examin-
been used (Nestel et al. 2006). In many healthcare professions, ations (e.g. graduation, certification, licensure), relative stand-
the use of various workplace-based assessments, including ards are not typically used, because the ability of the groups of
chart reviews and 360 degree evaluations, have been instituted test takers could vary over time and the content of the
For personal use only.

(Norcini & Burch 2007). These assessments are usually assessment may also vary over time. When pre-determined
employed as quality improvement measures, and involve passing scores are used with educational (classroom) tests, the
peer review, evaluation of practice outcomes, and patient or ability of the students in the class and the difficulty of the test
client satisfaction measures (Norcini 2003). given are not considered (Cohen-Schotanus & van der Vleuten
The varieties of instruments that are available have been 2010). Examinee ability and test difficulty are factors that could
developed, at least in part, to meet different assessment goals. adversely affect evidence of the appropriateness of the passing
Assessment goals may include determining who is eligible to score. To avoid the disadvantages associated with the relative
enter medical school (e.g. admissions testing); whether course standard setting method, absolute standard setting approaches
requirements are satisfied (e.g. classroom testing); if a student are more commonly used in credentialing examinations (i.e.
is ready to advance to the next level of training (e.g. end-of- licensure or certification).
year testing); whether the examinee is ready to enter a Standards set by determining the amount of test material
profession (e.g. licensure and certification testing); or whether that must be answered (or performed) correctly in order to
the examinee has shown evidence of expertise (e.g. mainten- pass are absolute standards. For example, if the examinee must
ance of certification). The assumption is usually made that answer 75% of the items on a multiple-choice test correctly in
scores obtained from assessments provide an indication of a order to pass, the standard is absolute. When absolute
student’s ability to ‘‘use the appropriate knowledge, skills, and standards are used, it is possible that decisions made will
judgment to provide effective professional services over the result in all examinees passing or failing the examination.
domain of encounters defining the area of practice’’ (Kane
1992, p. 167). Scores are often used to make decisions (or
Standard setting and performance-
interpretations) regarding whether students (or graduates)
have sufficiently acquired the knowledge and skills to enter, or
based assessments
continue practice in, a profession. In this manner, test scores Several standard setting methods have been used with
are used to classify people into two or more groups (e.g. performance-based assessments. Because relative standards
passing or failing an examination). Test scores can be used to are set based on the desired result (e.g. admitting the top 75
make decisions about who needs additional educational help, candidates to a school) and are, therefore, determined more
whether the test-taker will go on to the next level of training, or easily, this Guide will focus on absolute standard setting
whether the test-taker has achieved mastery in the domain of methods (See Table 1 for comparisons). These methods
interest. involve review of examination materials or examinee per-
Kane (1992) defined competence as the extent to which the formance, and the resulting passing scores can be derived
individual can handle the various situations that arise in that from the judgments of a group of subject matter experts.
area of practice ( p. 165). In education, licensure, and Absolute standard setting approaches have been referred to as
certification, assessments are used to determine the profi- either test-centred or examinee-centred (Livingston & Zieky
ciency of candidates. For example, when performance 1982). When using test-centred methods, the judges focus on
98
Standard setting on performance-based exams

Table 1. Features of standard setting methods.

Standard setting method Purpose Description Comments


Relative or norm-referenced Ranking examinees  Standard reflects group performance. Useful when a predetermined number (or per-
 Passing score is set based on test results (i.e. centage) of examinees should pass the
percent passing). examination, e.g. admissions testing,
employment testing.
Passing score may fluctuate due to examinee
ability or test difficulty.
Pass rate (number or percentage passing) is
stable.
Absolute or Determining mastery  Standard reflects a conceptual definition of Useful when determining whether examinees
criterion-referenced mastery, proficiency, or competence. meet requirements defined by the standard.
 Passing score is set independent of test All examinees could pass or fail using this type
results. of standard.
Test-centred  Standard is translated to a passing score
based on review of test materials (e.g.
multiple-choice questions, cases, tasks).
 Standard is defined, and judges focus on
exam content.
Examinee-centred  Standard is translated to a passing score
based on the review of examinees’ per-
formance on tasks.
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14

 Judges determine whether the performance


they review depicts someone possessing the
knowledge and skills needed to meet the
standard.
Compromise  Judges consider both the passing score and Can be a useful alternative when resources are
the passing (or failing) rate. limited.
 Judges give an opinion about what consti- The approach assumes that a passing score
tutes an ‘‘acceptable’’ cut score and pas- has been determined and test results are
sing rate. available.
Useful when a balance between absolute and
For personal use only.

relative methods is needed.

exam content. The Angoff (1971), Ebel (1972), and Nedelsky that should be completed. First, panellists need to be recruited.
(1954) methods are examples of test-centred standard setting To increase objectivity and to derive plausible passing scores,
methods. In contrast, when using examinee-centred methods, panellists should be knowledgeable about the examination
judges focus on the performance of examinees. These content area and the purpose of the test (Jaeger 1991;
methods include contrasting groups, borderline group, and Raymond & Reid 2001). In addition, they should be familiar
borderline regression methods (Livingston & Zieky 1982; with the qualifications of the students being tested. Finally,
Wood et al. 2006). The judges’ task in examinee-centred experience with the assessment method is essential. Panellists
methods is to determine whether the performance they review who make judgments about clinical proficiency should be
depicts someone possessing the knowledge and skills needed experts in the profession and should have familiarity with
to meet the standard (e.g. are minimally competent). expectations of trainees at various stages of education,
Aspects of test-centred and examinee-centred approaches, because they will readily understand the consequences
related to examinations containing multiple choice questions associated with passing the test. Because OSCEs are often
(MCQs), were presented in a previous AMEE guide used to test clinical and communication skills, faculty members
(Bandaranayake 2008), and the application of some of these who have participated in the OSCE as examiners or those who
approaches to the OSCE are presented here. Specifically, we have assisted in the development of materials used in the
will provide guidance for the use of the Angoff, borderline administration (e.g. checklists) would be ideal for recruiting as
group, borderline regression, contrasting group, and com- panellists in the standard setting meeting. To set standards for
promise methods of standard setting using simulated OSCE communication skills of physicians, for example, other mem-
data. The remainder of the Guide is divided into four sections: bers of the health care team could be considered (e.g. nursing
preparing for the standard setting study, conducting the staff, standardised patient trainers) as judges. Because they are
standard setting study, generating the passing score, and familiar with the expected performance of physicians in
implementing and maintaining standards. clinical settings, they would also be appropriate participants
in the standard setting process. A suitable mix of panellists
based on gender, discipline (e.g. paediatrics, general medi-
Preparing for the standard setting
cine), and professional activity (e.g. faculty vs. practicing
study physician) should be considered. This is particularly important
Determining the passing score is typically accomplished by a when the passing score identifies those candidates who will
group familiar with the assessment purpose and the domain enter a profession (e.g. licensure). The more panellists there
assessed. Before this group meets, there are a number of steps are, the more likely it is that the resulting passing score will be
99
D. W. McKinley & J. J. Norcini

stable (Jaeger 1991; Ben-David 2000). However, consideration Downing et al. 2006). In each section, research regarding the
of management of a large group is also important. The number method as used with OSCEs or standardised patient examin-
of panellists needed should be balanced by a number of ations is cited. In order to provide detailed guidelines for
factors: the consequences associated with examination deci- deriving passing scores for an OSCE, we generated data for an
sions, the number of performances to review a reasonable time end-of-year examination for 50 students, using five OSCE
frame for completion of the standard setting meeting, and the stations. Data for a multiple-choice examination consisting of
resources available for the meeting. Successful standard setting 50 items was also generated. This simulated data set will be
meetings can be conducted with as few as four panellists or as used to illustrate a number of methods that can be used to
many as 20. Having a large group will provide the meeting develop passing scores for an OSCE examination.
facilitator with the opportunity to assign panellists to smaller
groups so that more material can be covered. The deciding
factor will be resources (e.g. space, number of facilitators) Conducting the standard setting
available for the meeting. meeting: modified Angoff
The next step in organising the meeting is preparing
materials to be used in training the panellists. Panellist training
Checklist items
is very important; developing a clear understanding of the
performance standard (e.g. student in need of remediation, Several studies have been conducted where the modified
graduate ready for supervised practice, practitioner ready for Angoff was used to set standards for each item in a checklist
unsupervised practice) is essential (Ben-David 2000; (e.g. Downing et al. 2003, 2006). For this process, it is
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14

Bandaranayake 2008). To promote understanding of the necessary to prepare case materials for all panellists, including
performance standard, criteria that are typically part of a checklists and a form for providing ratings. The meeting
competency-based curriculum can be very useful. This type of facilitator will need a way to display the ratings (flip chart,
information can assist in the delineation of the knowledge, projector) and should prepare forms for data entry and, if
skills, and abilities that comprise the performance standard in a possible, set up spreadsheets for each case checklist. This will
particular context, and at a particular stage in a health permit simple analyses of the data, calculating averages for
professional’s career (Frank et al. 2010). items, judges, and cases. In these studies, panellists review the
items on the checklist, and the task is to estimate the
For personal use only.

To support training, examination materials may be used as


part of the orientation. For example, as part of the orientation percentage of examinees meeting the performance standard
to standard setting for an OSCE, panellists could complete (e.g. minimally competent) who will correctly perform the
some of the stations as examinees. This allows them to action described in the OSCE checklist.
experience the examination from the perspective of the Table 2 shows a sample checklist for a case of a fall in an
examinee. Next, a discussion of the characteristics defining elderly patient. Following a discussion of the performance
with the performance standard would be conducted. Finally, standard, and presentation of case materials, it is recom-
the panellists would be afforded the opportunity to practice mended that the panellists review the first five checklist items
the method selected. as a group. In the example, there is a large difference between
In the next section of the Guide, we present methods that the ratings of raters 10 and 13 on item 3. Discussion of
can be used to set passing scores for performance assessments, discrepancies of 15% or more would be facilitated, with the
using the OSCE as an example. These methods are commonly goal of reaching consensus on the definition and the rating
used to derive passing scores for OSCE and standardised task. If there are items with discrepancies of 15% or more, they
patient examinations (Norcini et al. 1993; Boulet et al. 2003; should be reviewed after the meeting to determine if there is a

Table 2. Modified Angoff: checklist items.

Item Text Rater 10 Rater 11 Rater 12 Rater 13 Rater 14 Average


1 Describe the fall in more detail OR Tell me what happened. 65 80 65 80 75 73
2 Hit head when you fell? 75 70 60 65 70 68
3 Loss of consciousness? 65 72 75 85 68 73
4 Felt faint OR lightheaded OR dizzy recently or just before the fall? 70 80 85 80 55 74
5 Since fall, any new symptoms other than hip pain, OR double vision 75 65 60 70 45 63
OR speech problems OR weakness of an arm/hand?
6 Palpitations before fall? 80 55 40 62 60 59
7 What makes pain worse? 65 60 65 57 65 62
8 Other falls? 75 75 55 85 67 71
9 Past medical history? 65 68 75 50 70 66
10 Medications? 70 65 45 55 55 58
11 Checks symmetry of strength in lower extremities 67 68 70 65 65 67
(must include hip flexion on right)
12 Palpates over area(s) of pain and trochanteric area of hip. 65 70 68 75 60 68
13 Observes gait 60 80 65 70 72 69
14 Orientation to time OR assessment of recent memory. 65 75 78 60 70 70
15 Auscultates heart 60 50 50 65 65 58
Passing score for case 67

100
Standard setting on performance-based exams

Table 3. Modified Angoff: station-level passing score.

Station Rater 10 Rater 11 Rater 12 Rater 13 Rater 14 Rater Average Average Score (All students)
1 45 55 45 60 45 50 59.3
2 65 60 70 55 65 63 74.8
3 65 60 65 50 55 59 73.1
4 70 65 60 55 60 62 71.9
5 65 60 65 55 60 61 70.1
Passing score 59 69.8

content-related problem with the checklist item. This item may Table 3 provides a sample spreadsheet format for the
need to be removed from determination of the passing score, simulation we mentioned earlier, with five OSCE stations for
and any discussion the panellists have can be useful in making an end-of-year test of 50 students. For this example, there are
that decision. There may be some items where discrepancies five judges. Once again, the characteristics of the examinees
of 15% or more cannot be resolved through discussion, but meeting the performance standard are discussed, test materials
there should not be many of these items. After group rating are presented and reviewed, and panellists begin the rating
and discussion, panellists can proceed through the remaining task. Practicing the method is essential. Using our example, a
checklist items independently. sixth case would be used for practice and discussion amongst
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14

The table used for this example shows the average rating panellists. The ratings from this case would not be used to
across all panellists. To derive the passing score for the case, generate the passing score. The case scores (in percent correct
the average rating across all items and panellists was calculated. metric) for all students are presented in the last column of the
In this example, the passing score would be 67%. If examinees table. If time permits, panellists could provide their ratings and
correctly complete 67% or more of the items on the checklist, then be given information on how all students performed on
they would achieve a passing score for that case. These are the the stations. They can then change their estimates. Calculating
steps associated with using the modified Angoff method to the mean across judges and stations provides the passing score
determine the passing score for a single case. This process for the OSCE.
For personal use only.

would be repeated with all checklists for stations included in


the OSCE administration. Once panellists have completed their
rating process, data entry can begin. Data entry can be Conducting the standard setting meeting: borderline
facilitated by using a spreadsheet for each case. Calculating group
the mean across items and judges produces a cut score for each While both modified Angoff approaches (checklist items and
case. Depending on the scoring and reporting of assessment case level) are commonly used to determine passing scores in
results, passing scores can be calculated for the OSCE by OSCE and standardised patient examinations, panellists may
averaging case passing scores, or OSCE content domains (e.g. find the task of estimating the percentage of examinees
history taking, physical examination, communication). meeting the performance standard who will receive credit for
items or who will correctly manage the case challenging
(Boulet et al. 2003). A method that focuses on examinee
Cases
performance rather than examination materials that is fre-
Although the modified Angoff has been used by gathering quently used with OSCEs is the borderline group method.
judgments at the checklist item level, it is more common to The borderline group method requires the identification of the
have panellists make their judgments at the case or station characteristics (e.g. knowledge, skills, and abilities) of the
level. One rationale for this approach is that the items in a ‘‘borderline’’ examinee. The ‘‘borderline’’ examinee is one
checklist are inter-related; the likelihood of asking a question whose knowledge and skills are not quite adequate, but are
or performing a manoeuvre is dependent on asking other not inadequate (Livingston & Zieky 1982). Assessment mater-
questions or other findings in a physical examination (Ben- ials (or actual examinee performances) are categorised by
David 2000; Boulet et al. 2003). If the Angoff is conducted at panellists as clear fail, borderline, or clear pass. The passing
the case level, the panellists’ task is making an estimation of score is then set at the median (i.e. 50th percentile) score of the
the percentage of examinees who will meet the performance borderline group (e.g. Rothman & Cohen 1996).
standard based on the content of the case, rather than the One modification to this method that has been used with
individual items within the checklist for the OSCE station (e.g. OSCEs is to use the judgments gathered during the OSCE
Kaufman et al. 2000). Alternately, the panellists can be asked administration (e.g. see Reznick et al. 1996). In this modifica-
to review the checklist items and estimate the percentage of tion, a panel of judges is not used; instead, observers provide
items for which the examinees meeting the performance information used to derive the passing score for each station. If
standard (e.g. minimally qualified) will get credit. experts (e.g. physicians, faculty members) are used to score
Case materials are reviewed by the panellists, and the the OSCE stations, they can be asked whether the performance
preparation for data entry is similar. However, a single they have seen would be considered ‘‘borderline.’’ This
spreadsheet may be used, depending on the number of approach can save time by gathering examiners’ judgments
cases to be analysed as part of the standard setting process. while the examination is being administered. Once examiners
101
D. W. McKinley & J. J. Norcini

Table 4. Example of individual results for borderline group method.

Date Student number Case 6 score Examiner rating Date Student number Case 6 score Examiner rating
12-Apr-10 15 75 Clear Pass 14-Apr-11 12 64 Borderline
12-Apr-10 20 83 Superior 14-Apr-11 39 50 Clear Fail
12-Apr-10 43 75 Clear Pass 14-Apr-11 28 57 Clear Fail
12-Apr-10 42 100 Superior 14-Apr-11 50 43 Clear Fail
12-Apr-10 13 75 Clear Pass 14-Apr-11 10 64 Borderline
12-Apr-10 38 92 Superior 14-Apr-11 48 71 Clear Pass
12-Apr-10 5 92 Superior 14-Apr-11 30 71 Clear Pass
12-Apr-10 14 83 Superior 14-Apr-11 23 71 Clear Pass
12-Apr-10 29 83 Superior 13-Apr-11 19 89 Superior
10-Apr-11 37 60 Clear Fail 13-Apr-11 8 79 Clear Pass
10-Apr-11 36 40 Clear Fail 14-Apr-11 1 64 Borderline
10-Apr-11 16 50 Clear Fail 14-Apr-11 41 64 Borderline
10-Apr-11 47 60 Clear Fail 13-Apr-11 19 89 Superior
10-Apr-11 24 70 Clear Pass 13-Apr-11 17 58 Clear Fail
10-Apr-11 40 80 Superior 13-Apr-11 6 74 Clear Pass
10-Apr-11 33 70 Clear Pass 13-Apr-11 32 74 Clear Pass
10-Apr-11 35 90 Superior 13-Apr-11 27 95 Superior
10-Apr-11 26 70 Clear Pass 13-Apr-11 11 68 Borderline
10-Apr-11 7 80 Superior 13-Apr-11 3 68 Borderline
11-Apr-11 4 56 Clear Fail 13-Apr-11 34 89 Superior
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14

11-Apr-11 18 75 Clear Pass 13-Apr-11 46 84 Superior


11-Apr-11 21 69 Borderline 11-Apr-11 9 94 Superior
11-Apr-11 2 50 Clear Fail 11-Apr-11 25 69 Borderline
11-Apr-11 31 81 Superior 11-Apr-11 22 75 Clear Pass
11-Apr-11 45 63 Borderline 12-Apr-11 44 92 Superior
11-Apr-11 49 50 Clear Fail

have identified examinees whose performance is considered scores are not likely to be stable. That is, the reliability
For personal use only.

borderline, the passing score can be calculated by finding the associated with a cut score derived from two or three scores is
median score of all examinees who were classified as likely to be very low.
‘‘borderline.’’ To overcome this potential disadvantage, a regression
To illustrate this approach, we will use the same data set approach was studied by Wood et al. (2006). Using the entire
used for the modified Angoff approach, which is presented in range of OSCE scores can be particularly useful if only a small
Table 4. Fifty students were tested, and examiners provided a number of examinees have participated. Because the number
rating indicating whether the observed performance was a of examinees classified as borderline could be very small, the
‘‘clear fail,’’ ‘‘borderline,’’ ‘‘clear pass,’’ or ‘‘superior’’ at each resulting passing score could be less precise than if all scores
station. Of the 50 students, nine were thought to have were used (Wood et al. 2006). In this modification, the
demonstrated ‘‘borderline’’ performance at the OSCE station. checklist score is the dependent variable; the rating is the
To derive the passing score for the station, the median (i.e.
independent variable. The goal of the regression analysis is to
50th percentile) score was calculated. For this example, the
predict the checklist score of the examinees classified as
passing score was identified using spreadsheet software:
‘‘borderline’’ for the station.
MEDIAN (C23,C26,C29,C36,C37,C42,C43,G2,G6); where ‘‘C’’
The borderline regression method is straightforward, and
and ‘‘G’’ indicate the columns where OSCE scores are located
can be done using a Microsoft Excel worksheet. Details on the
and the numbers indicate the rows for the scores of the
method are provided in Figures 1–5, which depict a series of
borderline examinees. In this example, the median score is
seven steps.
64%, so examinees with scores of 64% or higher would pass
the station. Step 1: Prepare a spreadsheet of OSCE scores and examiner
A modification of this approach is used by the Medical ratings.
Council of Canada (Smee & Blackmore 2001), where a six- Step 2: Click on the tab labelled ‘‘Data,’’ and when the pop-up
point rating scale is used: inferior, poor, borderline unsatis- window appears, select ‘‘Data Analysis.’’ The analysis
factory, borderline satisfactory, good, and excellent. The mean tool you will select is ‘‘Regression.’’
station score for those examinees rated borderline unsatisfac- Step 3: Identify the ‘‘Input Y Range’’ – what will be predicted.
tory and borderline satisfactory is calculated to derive the In this case, it is the OSCE scores in Column C.
passing score for the station. This Modified Borderline Group Step 4: Identify the ‘‘Input X Range’’ – what will be used to
Method works well when there are enough examinees who predict scores. In this case, the ratings provided by
were rated ‘‘borderline’’ for the station. However, the stability ‘‘Examiner PF1’’ in Column D will be selected.
of the passing score is dependent on the number of examinees Step 5: Identify the location for analysis results. In the
in the borderline unsatisfactory and borderline satisfactory example, we gave the spreadsheet the name
categories. If few examinees are rated ‘‘borderline’’, then ‘‘Sheet 3.’’ Click OK in the ‘‘Regression’’ window
calculating the passing score based on the mean of their station (upper right side).
102
Standard setting on performance-based exams
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14
For personal use only.

Figure 1. Spreadsheet of OSCE scores and examiner ratings.

Figure 2. Data analysis ‘‘regression.’’ 103


D. W. McKinley & J. J. Norcini
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14
For personal use only.

Figure 3. Define variables for analysis.

Figure 4. Location for analysis results.


104
Standard setting on performance-based exams
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14
For personal use only.

Figure 5. Output from regression analysis.

Step 6: The output from the regression (‘‘Summary Output’’) is Table 5. Contrasting groups example.
in ‘‘Sheet 3.’’
Step 7: The formula for deriving the passing score is:
Examiner decision
passing score ¼ (median of ratings* ‘‘  Variable 1’’) þ
Intercept.
Score range Fail Pass Total Pass rate (%)
For this example, the passing score would be
75.4 ¼ (2  11.561) þ 52.326
0–49a 3 1 4 92
where 2 is the median of the ratings, 11.561 is the 50–54 3 0 3 86
‘‘  Variable 1’’, and 52.326 is the intercept. 55–59 5 2 7 72
60–64 2 4 6 60
65–69 0 5 5 50
The passing score could be adjusted by the standard error 70–74 0 13 13 44
of estimation (labelled ‘‘Standard Error’’ in the Summary 75–79 0 3 3 38
85–89 0 4 4 32
Output), if review leads to the conclusion that the examiner 90–94 0 4 4 24
at a station was particularly harsh (or lenient). 95–100a 0 1 1 2

a
‘‘Smoothed’’ averages cannot be created for the highest and lowest score
ranges.
Conducting the standard setting meeting: contrast-
ing groups
The Contrasting Groups method requires panellists to review
examinee work and classify the performance as acceptable or distributions are compared to determine their degree of
unacceptable (Livingston & Zieky 1982). In education, infor- overlap. This is done by tabulating the percentage of test-
mation external to the test is used to classify the examinees in takers in each category and at each score level who are
these categories (Hambleton et al. 2000). When other meas- considered ‘‘competent’’. The passing score is the point at
ures with similar content are available, two groups of which about 50% of the test-takers are considered competent.
examinees are identified. Then, scores from the test on For examination programmes in health professions education,
which performance standards are being established are used it is difficult to find an external measure that assesses the same
to generate distributions (one for each group), and the skills as those measured in the OSCE. The variation most
105
D. W. McKinley & J. J. Norcini

14
Fail
Pass
12

10
Number of Examinees

2
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14

Figure 6. Example of contrasting groups.


For personal use only.

commonly used in medical education is to have panellists (1994), the point of intersection would generate a recom-
decide whether the performance they review on the measure mended passing score of 65%. Examinees with score of 65% or
of interest (i.e. the OSCE or standardised patient examination) higher would pass, and those with score of 64% or lower
meets the characteristics associated with the performance would fail.
standard. One example of a variation on this approach derived
the passing score by regressing the number of judges rating the
performance as ‘‘competent’’ to the test scores, and set the
Determining examination level
passing score at the point at which 50% of the panellists rated passing scores
the performance as competent (Burrows et al. 1999). Although we have reviewed methods that would derive
In another variation of the contrasting groups method, passing scores for each task performed in the performance-
panellists judged the performance of examinees on the test of based assessment, a passing score at the examination level is
interest without knowledge of their test scores (Clauser & often needed. If examinees must receive a passing score on
Clyman 1994). The passing score was then identified as the each task or skill tested, the standard is conjunctive. If
intersection of the two score distributions. We will illustrate the performance across all tasks or skills is considered, the
use of this approach using our earlier example. Fifty students standard is compensatory (Haladyna & Hess 1999; Ben-
were tested, and the examiner provided a rating indicating David 2000). When deciding whether to require passing
whether the observed performance was considered ‘‘failing’’ or each task or considering performance overall, there are several
‘‘passing.’’ Table 5 shows the data set with score ranges and factors to consider. First, examinee performance is likely to be
counts of examinees rated ‘‘fail’’ or ‘‘pass’’ by the examiner. In variable from task to task. That is, there is likely to be
addition to the columns labelled ‘‘Examiner’s Decision,’’ the inconsistency in the performance of each examinee. On some
total number of examinees with scores within that range is tasks, an examinee will have a better performance than on
provided. In this example, the examiner’s ratings are separate others. In addition, the reliability of the individual tasks is likely
from the score range; instead, imagine that the scores are based to be much lower than the reliability across all tasks. Because
on a checklist completed by another rater. The results show conjunctive standards are likely to result in a higher number of
that the rater identified examinees considered passing even in students failing, the consequences of failing the examination
the lowest part of the score range. The column labelled ‘‘Pass and the logistics of resitting or repeating instruction must be
rate’’ is an indication of the percentage of examinees that would considered (Ben-David 2000). Compensatory standard setting
pass if the passing score was set just above the score range. involves averaging (or summing) performances across all tasks
For example, if the passing score was set at 50% correct, to derive the examination score. Compensatory standards
46 examinees would pass, and the pass rate would be 92%. allow examinees to compensate for poor performance on one
Figure 6 illustrates the overlap in the score distribution. task with better performance on another. The degree to which
Considering the approach studied by Clauser & Clyman the tasks (or skills) correlate with each other can provide
106
Standard setting on performance-based exams

support to the compensatory vs. conjunctive decision (Ben- Hofstee suggested that the chosen passing score was
David 2000). Another option is to consider how perform- only one out of a universe of possible passing scores. In
ance is reported to examinees. This decision is important addition, it is feasible to plot all possible failure rates. To
because those managing the standard setting process will ensure that panellists have considered these data, the
need to decide how the derived passing scores will be standard that is being set (e.g. minimal competence,
used. In the conjunctive model, examinees must meet or proficiency, etc.) is discussed, the details of the examin-
exceed the score for each task in order to pass the ation process are reviewed, and the panellists are asked to
examination. For the compensatory model, the average answer four questions:
passing score across tasks can be used to set the
(1) What is the lowest acceptable percentage of students
examination-level passing score. With OSCE stations, it
who fail the examination? (Minimum fail rate; fmin)
may be that decisions can be made across tasks (compen-
(2) What is the highest acceptable percentage of students
satory) but each skill (e.g. communications, clinical deci-
who fail the examination? (Maximum fail rate; fmax)
sion-making) must be passed in order to pass the
(3) What is the lowest acceptable percent correct score that
examination. Consideration of the feedback provided to
would be considered passing? (Minimum passing score;
the examinees will play an important role in determining
whether compensatory, conjunctive, or a combination will kmin)
(4) What is the highest acceptable percent correct score
be used to set the examination level passing score.
that would be considered passing? (Maximum passing
score; kmax)
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14

Compromise methods The four data points are calculated by averaging across all
Although results from the standard setting panellists are the judges. The percentage of examinees that would pass for every
most important elements in determining the passing score, possible value of the passing score on the test is graphed and
additional information is often used to determine the final the four data points are plotted, based on the four judgments
passing score that will be applied to examinations (Geisinger of the standard setting panel. Figure 7 provides an example of
1991; Geisinger & McCormick 2010). One type of information application of the Hofstee method. In this example, 140
that is considered is the pass–fail rate for the passing score. students took a 50-item end-of-year test.
For personal use only.

The compromise approaches proposed by Hofstee (1983), The curve in the chart shows the projected failure rate
Beuk (1984), and De Gruijter (1985) explicitly ask the based on percent correct scores on the test. Instructors were
panellists to consider both the passing score and the passing asked the four questions that appeared above:
(or failing) rate. Each approach assumes that the judges have
(1) What is the lowest acceptable percentage of students
an opinion about what constitutes an ‘‘acceptable’’ passing
who fail the examination? Average: 20%
score and passing rate.

100%

90%

80%

70%
% Examinees Passing

60%

50%
60% min score Passingscore 63%
40% 30% max fail rate

30%
75% max score
20% 20% min fail rate

10%

0%
22% 26% 30% 34% 38% 42% 46% 50% 54% 58% 62% 66% 70% 74% 78% 82% 86% 90% 94% 98%
Test Scores (% Correct)

Figure 7. Example of the Hofstee method.

107
D. W. McKinley & J. J. Norcini

(2) What is the highest acceptable percentage of students method should permit judgments that are based on informa-
who fail the examination? Average: 30% tion; processes that permit expert judgment in light of
(3) What is the lowest acceptable percent correct score that performance data are preferable. The method chosen should
would be considered passing? Average: 60% be closely aligned with the goal of assessment. The method
(4) What is the highest acceptable percent correct score should require thoughtful effort of those participating in the
that would be considered passing? Average: 75% process, and it should be based on research. Finally, the
method should be easy to explain to participants, and easy to
Using the information from the judges, two points are
implement.
plotted: the intersection of the lowest acceptable fail rate and
It is important to keep in mind that the standard setting
the highest acceptable percent correct score; and the intersec-
study will generate a recommended passing score and that the
tion of the highest acceptable fail rate and the lowest
score should correspond to a level of performance that meets
acceptable percent correct score (see Figure 7). These two
the purpose for the test and the standard setting process. For
points create a line that intersects the curve that is defined by
example, if the test is used for identification of students who
percent correct score and projected failure rate. The passing
may need additional training or remediation, then passing
score is found by following the dotted line from the intersec-
denotes the group of students ready for the next phase of
tion to the x-axis ( percent correct scores). The fail rate is found
study, while failing identifies the group who may repeat the
by following the dotted line from the intersection to the y-axis
course. In this case the level of performance may not be as
( percent fail).
high as the level that corresponds to competence in inde-
In a modification of Hofstee method, Beuk (1984) sug-
pendent practice. If the test is used to represent those who are
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14

gested that the panellists report to what extent each of their


ready to graduate, and enter a setting with supervised practice,
judgments should be considered in deriving the final passing
those who pass the test possess the characteristics associated
score. That is, panellists are asked the degree to which their
with readiness to enter supervised practice. The result of
decisions are examinee-oriented or test-oriented. The means
passing these tests has different meanings and the final
and standard deviations of both the passing scores and
determination of the passing score will take these differences
acceptable pass rates are computed. The mean passing rate
into account. While it is not possible to identify the best
and mean passing score are plotted. The point on the chart
method, the selection should be based on the purpose of the
where these two points intersect is identified. The compromise
For personal use only.

test, as well as practical considerations delineated in this guide.


consists of using the ratio of the standard deviation of pass rate
to the standard deviation of passing score. The point where the
distribution of scores intersects the line generated based on the Implementing the standard
slope constitutes the passing score. De Gruijter (1985) further
Since the standard setting study will generate a ‘‘recom-
suggested that an additional question be posed to panellists,
mended’’ passing score, there are additional issues to be
that of the level of uncertainty regarding these two judgments.
considered before implementing the results of the standard
Beuk’s and De Gruijter’s methods have not been reported in
setting process. One important decision to make is whether the
the literature for medical education, but the Hofstee method
passing score will be compensatory or conjunctive. For OSCEs
has been used by a number of researchers.
and standardised patient examinations, several stations are
Schindler et al. (2007) reported on the use of the Hofstee
typically included. If the assessment is averaged (or summed)
approach to set passing scores for a surgery clerkship. Because
across cases, the passing score should be generated in a similar
the goal was to set a passing score for the clerkship as a whole
fashion (i.e. averaged or summed across cases). In this
instead of individual assessments (multiple-choice examin-
example, the standard would be considered compensatory;
ations, OSCEs, clerkship grades, ratings of professionalism) the
those who meet or exceed the passing score will pass, and
standard setting panel determined that the use of the Hofstee
poor performance at one station can be compensated by better
method was appropriate. The use of multiple, related assess-
performance at another station. Alternately, a passing score
ments led the group to conclude that compensatory standards
could be derived for each case/station, and an additional
would be set, although a breech in professionalism could
requirement could be that a set number of cases have to be
result in failing. Panellists reviewed score distributions for all
passed in order to pass the assessment. In this case, the
students as well as those who had failed in previous years,
standard would be conjunctive. Because cases often measure
along with scoring rubrics and examination materials before
both clinical and interpersonal skills, passing scores could be
they responded to the four questions in the Hofstee method.
generated for each of these skills, and the requirement to pass
The authors found that there was a high level of agreement
would be to meet or exceed the passing score in each skill
amongst the judges, and that the pass rate derived was
area. This approach would also be considered conjunctive.
reasonable when applied to previous clerkship data.
When deciding whether the pass–fail decision will be
compensatory or conjunctive, it is important to consider the
Selecting a standard setting research done in this area. Performance on different tasks can
be quite variable (Traub 1994), and performance on a single
method
case is not likely to be a reliable indicator of an examinee’s
With many methods available, it may seem difficult to decide ability (Linn & Burton 1994). Conjunctive standards based on
which the ‘‘best’’ method is. When selecting a standard setting individual stations will result in higher failure rates, and can
method, there are practical considerations to be made. The result in incorrect decisions due to measurement error
108
Standard setting on performance-based exams

(Hambleton & Slater 1997; Ben-David 2000). While higher Conclusions


failure rates will also result from conjunctive standards based
on skill area, it is reasonable to require that the passing score Although there is extensive research on standard setting with
be met for each area without compensation in each area. Ben- both multiple-choice and performance-based tests, there is no
David (2000) suggests that consideration of the construct ‘‘right’’ passing score, and no ‘‘best’’ method. Different
measured by the assessment is essential in making a decision methods yield different results. Selecting a method depends
about compensatory and conjunctive standards. The purpose on the purpose of the examination and the resources
of the assessment and the feedback given regarding the results available for the standard setting effort. The methods
presented, the guidelines provided, and the examples given
are important criteria to include in making a decision. For
are meant to provide information to inform decisions
example, it would be very useful to have examinees know that
regarding selection of a method, preparation for a standard
they need to improve their physical examination manoeuvres,
setting meeting, conducting the meeting and analysing the
but that their history taking and communication skills are
data obtained, and implementing and maintaining the
adequate. In this case, it would be reasonable to set separate
standard.
passing scores based on skills measured.
Another consideration is the format of reporting the results
of the examination to test-takers and other stakeholders. If the
Notes on contributors
OSCE is administered as an end-of-year assessment, students
who fail (and their instructors) may want to know about areas DANETTE W. MCKINLEY, PhD, is the Director, Research and Data
of strength and weakness, so that they can concentrate their Resources. Dr McKinley determines research priorities, defines scope,
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14

and proposes methodology for studies focused on understanding and


efforts on skill improvement. Even students who pass may promoting international medical education. She supports research activities
want to know whether there were any areas in which they related to the certification of graduates of international medical programs.
could improve. Providing feedback is important, particularly Her interests include educational research methodology and assessment,
for failing examinees (Livingston & Zieky 1982; American particularly for licensure or certification. With more than 20 years of
experience in licensure and certification testing, she now concentrates her
Educational Research Association et al. 1999).
efforts on the development of research programs on international medical
Finally, consideration of the percentage of examinees education and the migration of health care workers.
passing is essential. Understanding the consequences of the
JOHN J. NORCINI, PhD, President and Chief Executive Officer. Dr Norcini
For personal use only.

decisions generated is vital to ensuring that decision makers became FAIMER’s first President and Chief Executive Officer in May 2002.
comprehend and endorse the process. It is not likely that it will Before joining FAIMER, Dr Norcini spent 25 years with the American Board
be feasible to generate the recommended passing score during of Internal Medicine serving in various capacities, including Director of
the standard setting meeting, so a meeting with stakeholders Psychometrics, Executive Vice President for Evaluation and Research, and
finally, Executive Vice President of the Institute for Clinical Evaluation.
(e.g. faculty members, head of departments) should be Dr Norcini’s principal academic interest is the assessment of physician
conducted to inform them of the results of the study, and to performance. Current major research interests include methods for setting
present the implications (i.e. number passed). standards, assessing practice performance, and testing professional com-
petence. His research also focuses on physician migration and workforce
issues, as well as the impact of international medical graduates on the U.S.
health care system.
Maintaining the standard
Once the meetings have been conducted and the passing
score has been generated and endorsed, it is time to consider Declaration of interest: The authors report no conflicts of
how the passing score will be generated for the next testing interests. The authors alone are responsible for the content
cycle. Because the performance of examinees and the and writing of this article.
difficulty of the test can change from administration to
administration, the same passing score may not have the
same effect over time. If test materials are revised, it is
References
essential to conduct the standard setting meeting once again. American Educational Research Association, American Psychological
Even if the test materials are not changed, it is important to Association, National Council on Measurement in Education, 1999.
Standards for educational and psychological testing. American
monitor the performance of examinees and difficulty of the
Educational Research Association, Washington DC.
test, as well as the consequences of implementing the Angoff WH. 1971. Scales, norms, and equivalent scores. In: Thorndike RL,
passing score (i.e. changes in passing rates). If the test editor. Educational measurement. Washington, DC: American Council
becomes easier (i.e. examinees obtain higher scores) and the on Education, pp 508–600.
passing score remains the same, the passing rate is likely to Bandaranayake RC. 2008. Setting and maintaining standards in multiple
choice examinations: AMEE Guide No. 37. Med Teach 30:836–845.
increase. Conversely, if the test becomes more difficult,
Ben-David MF. 2000. AMEE Guide No. 18: Standard setting in student
the passing rate is likely to decrease. Revisiting the definition assessment. Med Teach 22:120–130.
of the standard as well as the passing score in light of Beuk CH. 1984. A method for reaching a compromise between absolute
changes associated with the test on a regular basis is advised. and relative standards in examinations. J Educ Measure 21:147–152.
Boulet JR, De Champlain AF, McKinley DW. 2003. Setting defensible
Monitoring test performance is essential if the test is used
performance standards on OSCEs and standardized patient examin-
for determining examinee qualifications, whether it means ations. Med Teach 25:245–249.
going on to the next level of training or entering Burrows PJ, Bingham L, Brailovsky CA. 1999. A modified contrasting
independent practice. groups method used for setting the passmark in a small scale
109
D. W. McKinley & J. J. Norcini

standardised patient examination. Adv Health Sci Educ Theory Pract Kaufman DM, Mann KV, Muijtjens AMM, van der Vleuten CPM. 2000. A
4:145–154. comparison of standard-setting procedures for an OSCE in under-
Clauser BE, Clyman SG. 1994. A contrasting-groups approach to standard graduate medical education. Acad Med 75:267–271.
setting for performance assessments of clinical skills. Acad Med Linn RL, Burton E. 1994. Performance-based assessment: Implications of
69:S42–S44. task specificity. Educ Measure: Issu Pract 13:5–8.
Cohen-Schotanus J, van der Vleuten CPM. 2010. A standard setting method Livingston SA, Zieky MJ. 1982. Passing scores: A manual for setting
with the best performing students as point of reference: Practical and standards of perfromance on education and occupational tests. Educ
affordable. Med Teach 32:154–160. Testing Serv. Princeton, New Jersey.
De Gruijter DNM. 1985. Compromise models for establishing examination Miller GE. 1990. The assessment of clinical skills/competence/performance.
standards. J Educ Measure 22:263–269. Acad Med 65:S63–S67.
Dijkstra J, Van der Vleuten CPM, Schuwirth LWT. 2010. A new framework Nedelsky L. 1954. Absolute grading standards for objective tests. Educ
for designing programmes of assessment. Adv Health Sci Educ Theory Psychol Measur 14:3–19.
Pract 15:379–393. Nestel D, Kneebone R, Black S. 2006. Simulated patients and the
Downing SM, Lieska NG, Raible MD. 2003. Establishing passing standards development of procedural and operative skills. Med Teach 28:390–391.
for classroom achievement tests in medical education: A comparative Norcini J, Burch V. 2007. Workplace-based assessment as an educational
study of four methods. Acad Med 78:S85–S87. tool: AMEE Guide No. 31. Med Teach 29:855–871.
Downing SM, Tekian A, Yudkowsky R. 2006. Procedures for establishing Norcini JJ. 1994. Principles for setting standards on certifying and licensing
defensible absolute passing scores on performance examinations in examinations. In: Rothman AI, Cohen R, editors. The Sixth Ottawa
health professions education. Teach Learn Med 18:50–57. Conference on Medical Education. Toronto: University of Toronto
Ebel R. 1972. Essentials of educational measurement. 2nd ed. Englewood Bookstore, pp 346–347.
Cliffs, NJ: Prentice-Hall. Norcini JJ. 2003. Work based assessment. Br Med J 326:753–755.
Frank JR, Snell LS, Cate OT, Holmboe ES, Carraccio C, Swing SR, Norcini J, McKinley D. 2007. Assessment methods in medical education.
Teach Teacher Educ 23:239–250.
Med Teach Downloaded from informahealthcare.com by HINARI on 02/14/14

Harris P, Glasgow NJ, Campbell C, Dath D, et al. 2010.


Competency-based medical education: Theory to practice. Med Norcini JJ, Stillman PL, Sutnick AI, Regan MB, Haley HL, Williams RG,
Friedman M. 1993. Scoring and standard setting with standardized
Teach 32:638–645.
patients. Eval Health Prof 16:322–332.
Geisinger KF. 1991. Using standard-setting data to establish cutoff scores.
Patil NG, Saing H, Wong J. 2003. Role of OSCE in evaluation of practical
Educ Measure: Issu Pract 10:17–22.
skills. Med Teach 25:271–272.
Geisinger KF, McCormick CM. 2010. Adopting cut scores: Post-standard-
Pell G, Fuller R, Homer M, Roberts T. 2010. How to measure the quality of
setting panel considerations for decision makers. Educ Measure: Issu
the OSCE: A review of metrics – AMEE guide no. 49. Med Teach
Pract 29:38–44.
32:802–811.
Haladyna T, Hess R. 1999. An evaluation of conjunctive and compensatory
Raymond MR, Reid J. 2001. Who made thee a judge? Selecting
For personal use only.

standard-setting strategies for test decisions. Educ Assess 6:129–153.


and training participants for standard setting. In: Cizek GJ,
Hambleton RK, Jaeger RM, Plake BS, Mills C. 2000. Setting performance
editor. Setting performance standards: Concepts, methods, and
standards on complex educational assessments. Appl Psychol Measur
perspectives. Mahwah, NJ: Lawrence Erlbaum Associates.
24:355–366.
pp 119–158.
Hambleton RK, Slater SC. 1997. Reliability of credentialing examinations
Reznick RK, Blackmore D, Dauphinée WD, Rothman AI, Smee S. 1996.
and the impact of scoring models and standard-setting policies. Appl
Large-scale high-stakes testing with an OSCE: Report from the Medical
Measur Educ 10:19–28.
Council of Canada. Acad Med 71:S19–S21.
Hofstee WKB. 1983. The case for compromise in educational selection and
Rothman AI, Cohen R. 1996. A comparison of empirically- and rationally-
grading. In: Anderson SB, Helmick JS, editors. On educational testing.
defined standards for clinical skills checklists. Acad Med 71:S1–S3.
San Francisco, CA: Jossey-Bass. pp 109–127. Schindler N, Corcoran J, DaRosa D. 2007. Description and impact of using a
Holmboe ES, Sherbino J, Long DM, Swing SR, Frank JR. 2010. The role of standard-setting method for determining pass/fail scores in a surgery
assessment in competency-based medical education. Med Teach clerkship. Am J Surg 193:252–257.
32:676–682. Smee SM, Blackmore DE. 2001. Setting standards for an objective structured
Jaeger RM. 1991. Selection of judges for standard-setting. Educ Measu: clinical examination: The borderline group method gains ground on
Iss Pract 10:3–14. Angoff. Med Educ 35:1009–1010.
Kane MT. 1992. The assessment of professional competence. Eval Health Traub RE. 1994. Facing the challenge of multidimensionality in perfor-
Prof 15:163–182. mance assessment. In: Rothman AI, Cohen R, editors. Proceedings of
Kane MT. 1994. Validating interpretive arguments for licensure and the Sixth Annual Ottawa Conference on Medical Education. Toronto:
certification examinations. Eval Health Prof 17:133–159; discussion University of Toronto Bookstore. pp 9–11.
236–241. Wood TJ, Humphrey-Murto SM, Norman GR. 2006. Standard setting in a
Kane MT. 2006. Validation. In: Brennan RL, editor. Educational measure- small scale OSCE: A comparison of the modified borderline-group
ment. Westport, CT: American Council on Education and Praeger method and the borderline regression method. Adv Health Sci Educ
Publishers. pp 17–64. 11:115–122.

110

You might also like