THe Bar ExaMINER «63, Number 3, August 199% Articles Oy Sosmeviur Howor ti: A Ristarganiy We ADDrr SS PROM tHE ANNUAL Dinstr or THe Americas Law Ixsrivei i Ruhard A. Posner Tht Costs axp Bextrrts oF Prrtormanct Testing ON THE Bar Exasina tion: Stephen PKlem Nats, Gexper, axp Evasterry 1s tHe MBE Tyna Ledger and Mary M.S Departments Tene rot Tn Citar by Richard JB Tue Costs AND BENEFITS OF PERFORMANCE TESTING ON THE Bar ExaM’NATION by Stephen P. Klein he National Conference of Bar Examin= ts will begin offering the Multistate Per formance Test in February to97. This article reviews how performance test PT problems differ from typical bar exam essay ot multiple-choice questions. I also discuss what we know about the technical quality and other characteristics of PT problems from over fiftcen years of research with this type of mesure. Taken together the findings from these Studies show that mserting one o two PT problems into. a sate’ bar exam will most likely improve the overall quality of that exam. However, junsdictions should not rely on PT problems alone to make pass/fail decisions, {combination of multiple-choice MBE. essay, and PT questions 1s the best way to go, States also should be careful about how much weight they give the PT in dletermining an applicants pass/fail status, This article summarizes the basis for these conclusions es to the technical studies that support them, How Are PT Prosiems Dirrerent? MBE questions really ask applicants to select the choice that correspo: to the most app for re ng a case, Essay questions normally ask appli- cants to present a balanced analysis of a case situation. not unlike a micro version of an appeals court decision. Both of these question types usually present all of the relevant facts in one of two paragraphs, unessential or ambiguous information is avoided, and applicants must draw heavily on their knowledge of the law to respond appropriately. While questions posed in a rultiple- choice or essay format assess important legal skills, such as analytical abvlity, it is unlikely that a newiy licensed attorney would be presented seth a multiple cesar type task in practice In contrast, PT problems ask candidates to carry out kkinds of activities Lawyers actually perform, such as drafting a brief in support of a motion. PT problems require applicants to read and analyze a wide aray of documents with which attorneys normally work such as statutes, regulations egal opinions, transcripts of depo- sitions. briefs. police reports, news articles, interview notes, investigator reports, and internal memos’. Appli- cants must sift through several pages of these materials to identify the information that is salient to their assigned task and then priontize and organize that information in preparing their response. Unlike the typical MBE or essay question, a PT problem includes documents that contain the specific laws that are needed to respond. In this sense. PT problems are more like an “open book” exam, As in practice information sources for PT problems (such as witness accounts of events) may be unreliable or biased facts are sometimes ambiguous. incomplete, oF even conflicting: and applicants have to recognize these prob- Tut Costs ano Brstrrs oF Pearonsascs Trstisc ox mit Bax Exasanarion In sum, PT problems emphasize the day-to-day practical skills attorneys need to function effectively, such as the ability to integrate facts and the law. to present a coherent and persuasive legal argument, or to plan an appropnate course of action “MacCrate et al. 1992 (O'Hara & Klein, 1981 . In comparison to MBE and essay ons. PTF tasks place far more weight on these kinds oof sulls than they do on specific legal knowledge. Legal analysis and reasoning do, of course. play important roles in responding to MBE and essay questions, Thus, the major underlying distinctions among these different ‘pes of test formats are the emphasis they place on legal knowledge versus other job-related skills and how closely they simulate the actual practice experience SUMMARY OF RESEARCH ON PT Tasks: Alaska and Califorma have included PT problems as a regular part of their bar exams for over ten years. Cole nado, Georgia, Hasan, New Mesico, Puerto Rico, and Virginia also have used this form of testing on an opera or experimental basis. The experiences in these junsdictions provide important insights inte the techn ty and other features of PT tasks, These charac cal qu teristics are discussed below in terms of how they relate to five major indicators of test quality, namely: inter: reader consistency, score rebability, validity, fairness, and ‘ost effectiveness Inter-reader Consistency Interreader consistency refers to the extent to which different qualified readers assign she same score to a given response, Highly consistent readers would rank the qual: ity of different anywers the same way and one reader's mean score would not be substantially higher than some fers mean on a set of answers they grade in ‘ommon. If there i hittle oF no agreement among readers in their judgments about the relative quality of the ar ta Tw Bax Exawsan, Avast iyy6 . then the scores they assign simply reflect each reader's idiosyncratic views regarding answer quality and the test should not be used fr making pass /farl decisions an applicant, Reasonably high inter-reader agree ment 1 therefore an essential 1 shient of test quality: The major finding regarding this c-.tetton 1 that experienced PT readers are just as consistent with each other ay are experienced essay readers Klean, 109 1994 With both types of measures, there isan adequate degeee ‘of agreement among readers provided they receive appro priate training and supervision in the use of the scoring guidelines for their question, For example, inter-reader agreement on a PT ask was mich higher m Alaska where the graders are accustomed to evaluating PT answers and there ss extensive reader training than it was fon this same task in jurisdictions which were using PT problems for the first ume, The factors that contribute to high inter-reader agreement on essay questions Klein, 1996 abo apply to PTT readers, However. f PT answers are longer than the typical essay answer training also may take longer and there may be a greater advantage i using an analytic or semi-analytic grading scale Score Reliability and Weighting the PT Questions A test normally consists of questions that ate sampled: from some theoretical universe of questions that could have been asked. Because of time and costs, we cannot ask all the questions mn the universe. Thats why we rely on a sample, Nevertheless, we want to use the results in the sample to make an inference abous how well an applicant swould have performed if hat applicant had answered all the questions in the universe. Score reliability indicates the confidence we can place in this inference. The higher the reliability, the more certain we are that the results with the sample of questions that were asked ate indicative of how well the applicants would have performed af they had taken all the questions that could have been asked. Thus, score rebiabulty is analogous to the margin of error in an exit poll where we estimate how all voters voted on election day based on how a small sample of them voted. The higher the score reliability, the smaller the margin of error in making an inference from the simple Le. the test results) tothe universe, Another way of thinking about score reliability is, that at indicates the consistency with which different versions of the bar exam would make the same pass/fail decision about an applicant. In this contest, reliability refers te the probability that a given appli W's pass/fail status would remain the same regardless of which set of MBE a say questions were asked: ex. the July 1997 version of the test instead of the February 1997 version. Several factors inthuence score rebiabiity and thereby the consistency of pass/fail decisions Klein. 1993 . One of the most important of these factors 1s the number of questions asked. All other things being equal, the larger the number of questions... the larger the sample, the ay. However, this ts case of diminish Iuugher the real ‘tums, Going from three to four questions produces 1 greater increase tn reliability than does going from four to five questions but the five-question test produces a more rehable score than the four-question test. Conse quently, in most junsdictions, adding one or two PT tasks to the state's ex ng essay test ell produce a small bur noticeable improsement in the reliability of th Adding PT questions to a states bar exam will normally improve score reliability provided each PT task carries no more than twice as much werght asa standard essay question in determining an applicant’ total written essay + PT score, This stra even if PT task takes three or more times longer to answer than a stats typical ess) question, This is an especially important rule of thumb if the written section as less than six to eight weight than the MBE in determining an_applicant’s pass/fail status, and/or if less than 85 percent of the Trae Costs ante Basten applicants pass the exam (score reliability usually is not 4 major concern when there isa very high passing rate Some of the states that are considering including PT «questions on their exam are unable to lengthen the total amount of testing time, Thus, in these jurisdictions, adding or or more PT question will require eliminating bone oF more essay questions, My statistical modeling of these trade-offs suggests that in most jurisdictions, re- placing two or even three essay questions with one PT. problem will have little effect on the overall consistency with which pass/fail decisions are made, Again the size and direction of this effect will depend on several factors, anclading the number of essay questions that remain, the average correlation among these questions, the comrela ns, the cor tion between PT and regular essay quest Lnison between the MBE and written portion ofthe exam, the relative weights of these sections in determining an applicant's total bar exam score, and the passing rate. All of these factors make a difference, Valid If score reliability ws the only criterion of test quality that had » be satisfied, the bar exam would consist catitely of multiple-choice questions (as noted below, they produce more reliable scores p hour of testing and they cost less to achieve a given level rebability than any other type of question — which is why they are so prominent an large-scale testing. programs). However. there ore many critical, job relevant skills that cannot be tested with multiple-choice questions. such as the ability to identify issues in case and express ideas in writing, That 1s ovhy all bar exams include essay questions. Similarly there are many sillsJawyers uses practice that cannot be measured or measured well with either multiple-choice or essay questions, but can be assessed with PTS. For example. a twelve-membor “content vali! ity” panel of attorneys found that four prototype PTs developed by NCBE did a good job a assessing certain 1 Pratonsascd Trstse or) ak EXAMINATION 15 2° skills such as extracting rues of la from authority, recognizing the precise points of law at issue recognizing the specific facts necessary to. resolve legal questions, and applying legal rules to demons they determine the result sought. “fact analysis” skills such as identifing the relevant facts and breaking down, legal rules into components and connecting them £0 the facts and “problem solemg” skills such as identifving factual and legal obstacles and solu tions to a cheats objectives. and priorities Tn the judgment of this panel. the PTs also did a better job in assessing some of these skills than did the typical essay: question ACT. 1994 THAN EITHER There hase been several separ rate independent surveys of a cant opinions regarding PTs. All of these surveys ont that applicants judge PTs to be a significantly better measure of their ability to perform as ney than either multiple-choice of essay testing The applicants’ opinions appear to be confirmed by empirical data, Analyses of California, Georgia. and Vir- sinia data show that attorneys with four or more years of practice experience score higher on the PT section than would be expected on the basis oftheir scores onthe rest of the exam Klein. agg. In short, after holding other factors constant, attorneys with practical experience do better on the PT PTs also appear to be sensitive to the effects of legal education. For example, one study: Klein. 1988) found that students who are just entering one of four well known Califorma law schools the “novices”, earned much lower PT scores than comparable graduates from these same schvols. In fact, none of the novices earned a passing score or any PT problem nor did any novice score higher than any graduate from their school 16 Tht Bae Exanistn, Avctst iy AppLicants jupGr PTs To BE A SIGNIFICANTLY BETTER MEASURE OF THEIR ABILITY TO PERFORM AS AN ATTORNEY CHOICE OR ESSAY TESTING. ‘Candidates who earn relatively high PT scores also tend to carn relauvely high MBE and essay: scores. Miter controlling for the differences in reliability between MBE and essay scores, PT scores are more closely associated with thie essay section than with the MBE. In other words PT problems are more aligned with that portion of the exam which virtually all bar examiners consider to be the most important indicator of applicant ability. However, even with the controls for reliabil- ity. the correlation between essay and PT. scores is far from perfect. Moreover, the correlation between two PT problems is generally higher than their correlations with meCrntee a typical essay: question. In short, the applicants who have the abilities needed to do well on the essay sec- ton usually have the skills that are needed for success on the PT, but its evident that a PT task is not just another essay question, This i analogous to the relationship between MBE an essay scores — applicants who receive high MBE scores also tend to receive high essay scores, but the correlation is far from perfect Fairness It appears that taking one PT problem enables applicants dio better on the next one. Inthe ACT study, for example. the applicants who took PTA and then B did relatively + on B whereas those who took B then A did relatively better on A. These practice effects probably stemmed from applicants learning how to budget their time better, Applicants taking PT type problems for the first ‘ume often say that they cannot finish in the time allotted. However, applicants frequently say the same thing about the MBE and essay sections. In all likelihood, these concerns will dissipace as PT problems become an opera- tional part of the exam, Applicants will certainly have ample opportunity to practice on previously used pro lemy all PT problems are generally released following the exain PT problems are designed to assess skills rather than content knowledge, Consequently, the tasks deal with topics that either all applicants should know about or topres that few if any applicants know about stich as manitime shipping rules . However with bo: types of tasks, applicants are given copies of all the applicable albert fictional statutes, regula: tions, cases, ete, that are relevant te the case, Thus, specific content knowledge of the topes covered should not play a significant role although 1s conceivable that who is familiar with che terminology in a spe= cualized field could have an advantage if a case situation snvolved that field Muluple-chotce and essay: questions are not immune to this same concem, Indeed, scores on these tests are just 1s likely as PT scores to be sensitive to some applicants having an advantage on a particular question, For exam: ple. an applicant may have asisted on a case that had a fact pattern that was very simular to the one a an essay for MBE question. Test developers try to minimize this problem by asking several questions: t.,s0 that no one Jestion carries an excessive amount of weight in deter ‘mining an applicant’ pass/fail status. In short, the more «questions that are asked. the less likely an applicant’ total score and thereby pass/fail status, will be influenced by pectalized knowledge on a single question. ‘That is why score reliability ts sensitive to the number of questions asked and why [recommend not giving a single PT {question more than twice as much weight asa single essay A related consideration is what applicants should do to prepare for the PT. Unlike torts or contracts there 1s no law school course called “Performance Testing There are climcal courses that should help and the pres ence of PT problems on the exam may encourage mote professors to include such tasks in their classroom activi ties im much the same way Harsard Business School asks students to resolve case problems In also should be no. 4 that although multiple-choice and essay questions are currently: charac terized by content area, these labels bear no relationship to the ques- tions’ statistical properties. For ex ample, two torts multiple-choice questions generally correlate no higher with each other than they do with any other question (Linn, 1992. The same is true oon the essay. These findings mican that specific content knowledge not dr sing the differences in scores among applicants, While tis certainly true that content knowl: exkge i required todo well on the MBE and essay sections, cross-cutting abilities such as legal reasoning) appear to play the major role in determining who passes and fails These reasoning skills are needed on all three sections of the exam MBE, essay. and PTD, which may help to explain the relatively high correlations among these sec Some PT problems may be harder than others and/or the responses to one PT problem may, on the awerage, be graded more lententhy than the responses to some other problem. Thus. of no adjustment is made for this ray be easier t0 pays one exam than another simply because © differences in the difficulty of the particular set of PT problems that happen to go into each of them. The same is true, of course, of the essay section, but it ts Bixtins oF Puasonaasce Testis os i Bar ExaMesarion 497 not the case with the MBE, Raw MBE scores (ie the number of questions answered correctly) are adjusted “equated” for possible differences in average question difficulty across different administeations of the exam, Consequently, most jurisdictions now scile their essay scores to the distribution of MBE scores in their state ‘lei. 1995. The same thing can be done with the PT section the simplest method involves combining the essay and PT scores inte a total written score that s then scaled to the MBE Under these conditions, the inclusion of PTT probs eas om the exam probably sil have no effect on the overeIl paying rate or on differences in passing rates between gender and racial etme groups. dn general, men tend to score higher than women on the MBE while the reverse struc on the essay and PT sections, Adding a PT section would therefore benefit women only if this ak tion aly resulted in reducing the weight gwen to the MBE in determining an applicant's total score and pass/fail statis, Inching or not meluding a PT section or changing the weight given to the MBE) would have Inte af anyetfect on the differences in passing: racial/ethnic groups because all three sections MBE essays and PT result in about the same sized differences erage scores among these groups. Klein, 1989 . These dlilerences ako correspond to the disparities among these soups in law school grade point averages Klein, 1995 Some applicants regardless of their group atfia thon tend to relate better t multiple-choice tests while others prefer esay or PT exams, For example some applicants write faster or neater than others and chat may result in their doing better on the essay than on d MBE even though writing speed or ne related to success on the job what you write 1s far more Important. Thus, the more ways we can test, the fater the examination process because there 1s less likelihood that extraneous features of the test format will intluence 16 Tha Bar Exatisen, Ararat ios an applicant's pass/fall status, For that reason, there isa real advantage to using several assessment methods. In cluding one or more PT problems is an important con- ribution to that end. Cost Effectiveness and Benefits To an economist, a “cost effectiveness dletermining the costs of different methods to achieve the same end or fora given cost. determining which method provides the best result, such as the most rehable test per dollarspent).A “cost benefit” analysis, on the other hand. recognizes that all methods may not be able to achieve the same ids no matter how much we spend on them og. multiple-choice tests cannot measure issue spore tung). Consequently, the determination of which method isthe most “beneficial” involves judgments regarding the relative value of different outcomes (such as score reli ability. validity, fuirness, efficiency, ete In the context of the bar exam, some costs are joutof-pocket expenses (such as the fee paid for the MBE) while others ate “opportunity costs” such as the value of the time board members donate to: writing essay questions and grading the responses: ie. time they could have been spent doing something ese. Cost effectiveness dies usually sigh market values to donated services ete what other states actaully pay readers to grade answers). Festing time also i vale resource, Ths, a ffir comparon among methods has to hold testing time i constant achich alo effectively conttols the costs of test administration, proctor, spave, ete ible: shows the prostated cost of thee host o testing time for three testing methods and the score reliability that is obtained with each method for a three hour test. Materials costs ate the foes NCBE charges for the MBE, Multistate Essay Examination MEL and the MBE sor Multistate Performance Test MP is included in ts maternals cost, Essay and P'T scoring costs were est ated at 52.00 and 62.50 pot answer, respectively TABLE 1 Esniaten Costs axp Score Reutasie 1 10R THe Hous oF Trstixe Tiste wir Taner Dirraesr Asstssstest Metiops Estimated Cost per Applicant Testing Method Ma crials Muluple-choice 20.00 Performance Test oa excluding expenses for data entry and similar services The reliability for the MBE is based on tow questions. The tehability forthe essay and PT sections are hased on empirical data from several jurisdictions with so-minute ‘say questions and go-minute PT problems, rey tively. Actual costs and seliability may wary across juris aie # shows that the MBE as by far the best deal if the only goal is to obtain a high level of score reliability: Because of differences in scoring changes. a set off six PT problems, but the essay produces a higher level of rehabuhty, We would have to almost double the testing tume forthe PT to bring its rehabiity up tothe same level However, score reliability is not the only messate of test quality. Valldity and faimess ako must be cons ered. Pow attorneys would rely solely on the MBE to inake pass/fail decisions even though at as the least expense way to achieve a given level of score reliability Bar examiners recognize there are amportant skills that cannot be sted with the MBE just as there are amportant abfities that cannot be assessed with the traditional essay question, One type of testing simply cannot replace another if they assess somewhat different abilities, More exer although the applicants who do well on one type of test also tend to do well on another, this relations! Via Costs 9st Be strs oF Praromsiaser Teste os tn Bat Easy tion Scon Toul Score Reliability cliability of far from perfect even afier adjusting forthe the measures). Includingthe PT on the exam may encour- cal kills. In chis age more students to develop their prac way, it may improve the applicants’ overall level of profictency even if 1 oes not affect their relative sta ings. That is why bar examiners shoud no rely solely on the method that costs the least to obtain a given level of score reliability. They alo must consider the unique benefits that are derived from cach method, AND IMPLICATIONS, There is now ample empirical and jaxigmental evidence ConcLusion: tosupport including PT problems on the bar exam. They are more job relevant than the typical multiple-choice or essay question. Thus, their use facilitates responding to challenges to the bar exam process, Nevertheless, boards of bar examiners must be sensitive to realistic budgetary ‘operational legal, and political constraints, Testing pro- grams must strike a balance among costs, testing time, and technical quality including score reabulity, validity: and faimess In this context, 2 jurisdiction that ts limited ta twoskay test with sis hours of testing time per day night well consider using the MBE, six so-minute essay questions, and 80 go-nsinate PE problems (with each PT problem given twice as much werght as.a single essay question and the MBE and Written sections carryiny equal weight in determining an applicant's total score if re testing time was available, then the board could crease the umber of essay and PT questions askev and/or the tinge allocated to them, However the parti lar blend of measures. anal the weights attached to them pat would work best for a jurisdiction will depend on several technical factors sucha the rehability of its essay tioms (auch asthe kind o essay questions it has used in Rererences acl R searchin the NCBE Performance Test le R sted tthe NCBE Test Klan So aust. Testing jon the Calon Ba xanan State Bar of Calton and the National k - analast of the ae toes tl tls anal far canation Re pate matter of Bar Examiners of the State Bar of Gi the National Conferene of Bar Esamuners Klan. ass Rel fr xanatins to pertoemance Kk . tu8e . An wal the reais Focal sll ad far examanation reals, Repu prepared tor Comuntte of Bar Examiners the State Rar of Cabra and the Natnal Conference of Par Hany Klean Ssoss MM legal research dally on a bar examina Paper presented to the American Paholopeal Avs. two, Anaheim, California, lem. An analysts of the pettormanse txt on the fl Caldortna Har Examination, Report prepared Commuttee of Har Examiners ofthe State Bar of Calfornta k oY ves take a Icensing test, Paper pre edt the meets of the American Faucatwnal Reseatch f Bar xan the State Bar of Calienia PRASS-3 Klein. 189 . Does performance testing on the bar examination reduce diferences in scores among se and racial groups Paper pesentedat the mectings tthe American Pa Research Assastation Kew. S. toot Performance testing on the bar xamunatin. Rept puted forthe National Confer of Bar Famers Klevn. Stone k Malt Barb Navonal Contre Klein. ag. Relationships MIE esata Jl Pedormance Fest scores, Report prepared forthe Nation Conference of Ba " Klein. So ryt Options for combining MIE sales scores Tek ; s Klkin. 8. (ogo6}- Options for assigning essay scores, The Ra hs ‘ Linn, R. Giggs). An analpsis of the subtest structure af the dnstate Bar Examunation, Report prepared for the Na al Conference of Bar Exarnnets, McCrae, Re et al tags Legal education and professional pment — An educational continuum, Report vn Law hols and the Professions: Narrowing the gap. Amerian x Ealucaton and Anis the Bar Ohi 8 Klein, 1oSt Isthebar evaminationan aequate measure of Langer competence? Ty Bar Examiner 528% ns Rens. PRD,

