Using G-Theory and Rasch Model To Assess ESL Performance

FirstSearch: Marked Categories PENDING - Lender Page 20 of 27 Request Date: OCLC Number: Call Number: Titk Imprint: Article: Volume: Number: Date: Pages: Verified: Patron: Ship To: Bill To: Ship Via: Maximum Cost: Billing Notes: Fax: Affiliation: Borrowing Notes: Lending Charges: Shipped: Ship Insurance: http://firstsearch.ocle.org/WebZ/SageRetrieveMarkedCategories ?sessioni #2384642" Status: PENDING GENERAL RECORD INFORMATION 20050207 Source: OCLCILL 11309224 TA Need Before: 20050309 Renewal Request: New Due Dai “IAU, IA, TXH, GSU, IXA BIBLIOGRAPHIC INFORMATION “Lender's OCLC LDR: 4-20 1987-2003 Language testing, London : Edward Amold, c1984- Lynch, B.K. & McNamara, T.F.: Using G-theory and many-acet Rasch measurement in the development of performance assessements of the ESL speaking skills of immigrants. 15 2 1998 158-180 OCLC ISSN: 0265-5322 [Format: Serial] BorRowING INFORMATION Tzou, Yeh-Zu Interlibrary Services/TAMU Libraries/5000 TAMUS/College Station, TX 77843-5000 ‘SAME... FEIN 746000531 Ariel 165.91.220.14 IFM - $50 cet BRI USER CODE 51-1281;CAI # DD000806; LHL D10225 (979)862-4759; ARIEL 165.91.220.14 iisshare @ lib-gw.tamu.edu ARL,BTP,HARLIC, TEXSHARE, TExpress 41 HOU,TEL:( LenbING INFORMATION 2/8/2005 i=sp07sw04-4148...Using G-theory and Many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants Brian K. Lynch and T. F, McNamara University of Melbourne ‘This material may be protected by Copyright law (Title 17 U.S. Code 108) ‘were multiply rated and analysed using GENOVA ( FACETS (Linacre and Wright, 1993). The advantages ‘contrasting analytical techniques are considered in I Introduction 1 Background ‘The field of language testing continues to embrace the notion of per formance assessment as a means of achieving a close link between the tests ‘and authentic language use. In doi relationship between test and criterion is though idity of the inferences that we draw from our test da However, while we may be able to closely mirror authentic langu: Use in our assessment procedures and thus make an argument for validity of inferences on content grounds, w additional evidence of reliability and valid ‘Adress for coespondence: Brian Lynch, Centre for Communication Skills and SI. Unive sity of Metboure, 151 Bary Steet, Caton, Victoria, 3053, Ausaliay email b1)ReN@ Tanguage-unimel cd. ‘Language Testing 1998 15 (2) 168-189 0265-5222198)LT1460A © 1998 Arnold Brian K, Lynch and T.F. McNamara 159 nd language performance tests, through the richness of the suet coment induce a ange of fcterewhish my nace f success for a candidate oF particular, i cers bave looks issue of vari sessment tasks pesarlned ie rea atest oan performance test of speaking Spanish as two complementary approact Generalis 1983; Shavelson and Webb, ment (Linacre, 1989; McNamara, 1996). 2 Generalizability theory theory, is replaced by the notion of a universe thought of as an average score for a person acro: cof measurement that we can identify and are int replaced with the identification tematic sources of error ~ the mt we are attempting to measure ‘measurement procedure. A set for academic purposes, and 2) raters who were native or native speakers of Spanish with a tertiary academic background who were trained in the use of the study's rating scale. A General- study (G-study) is then designed to estimate the relative effects of these facets on test performance data, This estimation is expressed in terms of variance components, obtained from the expected mean squares in an analysis of variance where the main effects are persons (the object of the measurement in G-theory terminology) and the facets. Variance components are also estimated162 Assessment of ESL immigrants’ speaking skills jent rater on a parti mation of the person's ability B, hypothetically of the same abi el as person less lenient rater on a more difficult task, the ability both person A and person B will be more or less the sam on the standard error. Many-facet Rasch measurement also allow: elements within a facet that are problematic, be a rater who is unsystematically inconsistent a task that is unsystematically difficult across vations, or a person whose responses appear feature known as bias anal nations of facet elemer 1995), we will attempt to de these two approaches can play design and development, as well as more theoretically research on second language performance tests, We also hope ify our understanding of the comparative advantages of la tively recent methodologies as a part of the current debate in educational measurement concerning their relative characteristics. TI Methodology 1 The instrument ‘The Australian government in 1993 introduced a formal test of com- ‘municative skills in English for certain classes of intending, immi- son to enter the country as an immigi Assessment of Communicative English Sh iminis- tered at a large number of test centres overseas. It was developed in Brian K. Lynch and T.F. McNamara 163 1992-93 under the overall direction of the National Centre for English Language Teaching and Research at Macquarie University, Sydney, by a team involving experts from universities in three Aus states. There are four test modules, one for each macro of researchers at the National Languages and Literacy ‘Australia Language Testing Research Centre at the University of Mel- bourne was responsible from 1992 to 1994 for the development of the speaking skills module, and data from trialling of this module are used in this stu The speaking skills module is offered in two formats, because of operational constraints in the overseas centres. In the first format, a trained interlocutor engages the candidate in a nom! designed to sample oral language across a range of soci interactional varia are used in this study, are . is audiotaped and marked in by a team of trained raters. In the second format, the tasks are presented on audiotape in a listening centre or language labora- tory, and the candidate records his or her response on to a tape, which is again marked in Australia. The two f i resemble each other as closely as possible; their claim to equivalence is currently the subject of extensive research (O'Loughlin, 1995; 1997) study are from a trial condu the first administration of tl in April 1993. The data are from the first of the test. The performances of 83 candidates were each marked by four raters. For each of seven tasks, raters were required to make assessments on a six-point scale of relevant dimensions of the per- in a role-play task, separate assessments were in the analysis. There were no ratings existed for all candidates. ‘The raters were part of a larger group of 13 raters, all qualified and experienced ESL teachers, who had been trained in preparation for the first test administration. Each rater had attended a one-day taining session at which independent ratings of sample tapes were164 Assessment of ESL immigrants’ speaking skills made and discussed, and had subseque another sample set. Many-facet Rasch analysis of the data from thes faters was then carried out, and raters meeting predetermined stax dards for consistency were permitted to participate in the marking of the actual administration. The four were chosen because of thot ‘ompleted the rating of 3 Data analysis, ‘The G-study design used in this investigation was a random effec model with two facets: raters and items. Our universe of admissl sted of trained raters who were native speakers of that defined characteristics of ESL speak There were four conditions for the rater facet and 23 cond; f the item facet. These facets were considered random in that the foor Talers and 23 items were considered interchangeable with any other Set of four raters and 23 items from the universe of ad; 2 ations. All analyses were done with the GENOVA pro etsion 22, for the Macintosh computer (Crick and Brennan, 1984), One Complication in our design was the fact that items on the accese test are associated with certain tasks. Be Fequirements for GENOVA were not met ‘numbers of ratings for the various tas sider the facet of task, which otherwis as a fixed effe i candidates relative 's about their standing in relation Rasch measurement (Linacre, 1989) analysis was © computer program FACETS, version 2.62 , 1993). In our analysis we specified persons, facets of interest, with 83, four and 23 elements ly. The output of the FACETS analysis reports: d fit ty es a 4) bias analyses for Person interactions, rater x item interactions and person x item interactions. Brian K. Lynch and T.F. McNamara 165 For an introduction to Many-facet Rasch analysis using FACETS, see Namara Chapter 5). FACETS does not have the same the interests of consistent comparison with GENOVA analysis, Both NOVA and FACETS make the assumption of independence of ms from task, which may not in fact be the case; the effect of the ion of this assumption has been studied using new Rasch-based models which do not make such an assumption in a recent paper by ‘McNamara and Adams (1996). IMI Results 1 The GENOVA analysis Using the variance components from our G-study, we conducted a series of D-studies to investigate the relative effects of varying the fects of raters and items on test perf was trialled in December 1992,166 Assessment of ESL immigrants’ speaking skills However, the Raters facet accounts for a substantial amount of the ‘otal variance. This means that there is a tendency towards inconsis encies between raters in their judgements — certain raters are more Tenient or more severe than others across all persons. There is also's boundary between “functional (socis “vocational”/level 5. However, since basis of the composite score acros we did not have sufficient data to this study. Table 2 gives the generali coefficients (G-coeffic well as the ® coefficients, associated with the various comt of Raters and Items, beginning with the original G-study samp] ‘Table 2 Retioblty/dependabity estimate forthe 12 D-studios eine Oe Number ot Raters Nuno of toms G-ostient (NRT, (GAT, Sapondaby) satay) ie eee ‘ 2 55 3 2 oo Be 2 2 ote 7 i a a a ‘ 6 ose 3 * fee fee 2 ie ae 7 : i ae ae ‘ e 29 ost 3 & cd oe 2 3 er os : : m2 as Brian K. Lynch and T.F. McNamara 167 ‘The G-coefficients are only relevant to the G-study, not the D-study results, since the access: test is primarily concemed with CRT presented for purposes of illustrat- ences between relative and absolute error terms. sed in the Introduction, the G-coefficients are parallel to ty coefficients in classical test theory and represent the degree of accuracy with which we can generalize from test-takers’ observed their universe scores. They are calculated with an error term useful for NRT contexts primary concer. The degree of dependability that exists for an observed score as rep- resenting the individual test-taker’s domain score. They are calculated with an absol term, and are thus appropriate for CRT contexts, fidual test-taker’s standing in relation to a well- defined criterion or domain of ability is the prime concern ith persons. Thus, where there is an effect for any of the when the variance component for one of the facets is reer lar standard (absolute suggest that more rater training could be a focus of improvement in the test when decisions about meeting a particular standard are of primary concer. ‘The dependability estimates for the D-study design using the original conditions scores. The difference between having 16 items and 23 items with four raters, for example, is .866 and .870, respectively. They further168 Assessment of ESL immigrants’ speaking skills Brian K. Lynch and T.F. MeNamara 169 indicate that hi vo raters, as opposed to a single rater, se in dependability. The difference between ‘one rater versus two -716, respectively. able Rater characteristios —_——__--r-.: mnber_—__Severty logis) ——_Exor nat Mean Square a0 oe 2 Son Hy an 0.04 98 dates (although sm: candidature. This woul the deletion or m presenting a particu b lems: The mean item difficulty was set at 1 the same way as an “ep ling default). The standard deviation ofthe item 63 ge Of difficulty across the items was 246 a maximum of +1.20 (note: the abi the relevant rating which nndard deviation ofthe set of Infit Mean Square values tac conser) *Ar cai ‘greater than the mean plus twice the id deviation would be considered as misfitting for these data in this case, the Infit Mean Square mean value greater than 1.2 The inconsistency in ratings of a misfittng rate and compensated for by the program, and will ith by further training, ¢ Raters: The results of th re analysis of rater behaviour are ble 3. ted in . Wright and Linacre ferences in logit values +79) provide a table for converting dif170 Assessment of ESL immigrants’ speaking skills 36%, showed significant ist be remembered that interactions between particular raters demonstrates the risk of the practice of ance test su Table 6. It can be seen that the bias across raters, ‘oble 4 Explanation of bias analysis reports, 1 2 3 Rater severity logit) 4 Candidate abitty (logit) 5 Prodited Lketnood of this discrepancy occuring by chance (expressed a & 2-500) ee eee of significantly biased interaction ae ions OF the type of deal povided in tas Brian K. Lynch and T.F. McNamara 171 Table Bias analysis: naTER«cANOIOATE interactions - examples of significant biased Interactions ly Predicted Observed Discrepancy Error core (opt) (og) D Rater IO Candidate Rater severty Candidate abi (ootmara 173 172 Assessment of ESL immigrants’ speaking skills Brian K. Lynch and T.F, McNamara Table6_ Strict based rags, by tater, ere cnaoare ercions TARBEr of sigicanty based % of asian tcsoy ratings fatngs 10 28 2 " eaea : : I | a539 % a 2 8 LS... B | sea Rater x item bias fenits analysis was also carried out onthe interaction of raters with g incre caning of the bis considered here is as follonee cach | eee ently de tificant bias, the rater involved is responding consist. 23) 8382 ently to the item in a way whi from other raters, ‘nd different from his or her own behaviour There were 92 possible interactions (fou f the items ‘most frequently involved were the three worst TV Discussion tom 10 1 2 9 2 g I ters x 23 items), i Faamles of significantly biased interactions are given ig Table 7 ] 3 | exes Forty-four of the rater-item interactions, or 486 of the total, showed - ae GENOVA Liss This isin suiking contrast to the result i GENOVA analysis, which showed a zero Rater > Ttens interaction 2 oats gyEtit the effect could not be isolated to & single mae (ee : j enge Table 8). 3/2 | sge5 Candidate x item bias g A bias analysis was also carried out on the interaction of candidates Fle with items. The meaning of the aye ach instance of significant bias, d b| 52] sree consistently to the it y wh £| 83) 3582 candidates, and different from his or her ove i Wi ihere were 1909 possible interactions (83 candida f/e ) pxamples of significantly biased interactions are gives 2/2 The number of candic 'm interactions showing sig- Taal er 88 bias was 132 or 7% of the total. There was a relatively even £| a3| $322 distribution of bias terms across items, alt 3 2 5 3 8 ater 1D 13 1 " 2174 Assessment of ESL immigrants’ speaking skills Brian K. Lynch and T.F. McNamara 175 ‘Table 8 Slonifcanty biased ratings, by rater, RATER x Te interactions Rater number Number of signicanty biased —% of al sgnticanty biasog ratings rang OF? es E) 588 5 gl2 4) FE) 885 B/ 82 Tle 3|? 3 £/8 [ere 3 3 3 | pag 2 82 Sle ‘13 With GENOVA, the relative severity of raters (or relative difficulty gi3_ of tasks) is reported as the variance component for that facet. Incon. gle B| g88 sistencies within raters (or tasks) are indicated by the variance £| 2E) 79% Components for interactions between that facet and persons, or Alle facet and other facets. FACETS reports these two types pe lative severity and inconsistency), separately and for #1 3=) oy as well. Consistent it Be] saa lt into the estimates of candidate 8a) Ser oration of these arguably legitimate differences in pers : Taters into the measurement process has clear advantages in te Ble validity (Lumley and McNamara, 1995). The degree of consi f1 | oon or inconsistency is ted for each rater as a fit statis i 2 | ee! can be done about this inconsistency in the measurement : Program cannot model and hence cannot compensate for 3\8 essentially random, but be flagged, and decisions about aye). ing or exclusion from the rating process can be as appropriate. ee aes ‘Similarly, for persons and items, reports on individual persons and items are available, together with individual fit statistics for each ‘Thus specific intervention can be planned,176 Assessment of ESL immigrants’ speaking skills a ee ee igrants’ speaking A somewhat striking difference between the two approa clear in the contrasting results for the interaction effects in ‘ies : sane FACETS found extensive rater-candidate bias (36% of at many points. As a development which may result in G- or yeing able to provide more specific information than is con (1995) is developing puter application which uses a weighted Euclidean distance order to provide a graphical representation of test data tification of unusual performance by persons, items (also see Marcoulides and Drezner, 1996). The two ly iance in the scores. These are even more striking in the results for rater > i of these were revealed by the FACETS sneasurement approaches, used together, allow for balanced and thor- Whereas the rater x item interaction term interpretations of performance assessment data and offer prin- contributed almost no variance (esti Variance component ways of improving performance tests. 003) The teraction term in the GENOVA anal was similarly very sm: ugh FACETS analysis revealed ber of significant it ci) a ie way of reconciling these apparent differences is to reco, Gat the GENOVA and FACETS analyses operate with difering io, els of detail, Using the microscope as an analogy, FACETS tures fe tratnifcation up quite high and reveals every potential blemish vg the measurement surface. GENOVA, on the other nification lower and Acknowledgements {An earlier version of the article was presented at the Language Test- ing Research Colloquium, Center for Applied Linguistics, Wash- jon DC, March 1994, The research for this article was made poss- by a grant from the Australian Commonwealth Department of ration and Ethnic Affairs, administered through the National for English Language Teaching and Research at Macquarie the Test Development Com- the access: project for permission to use data from the test research, We also wish to acknowledge the help of Kieran in and Tom Lumley of the National Languages and Literacy f Australia Language Testing Research Centre at the Uni- of Melbourne and Gillian Wigglesworth of Macquarie Univer- for their assistance in several aspects of the project. In addition, the final draft was strengthened by helpful comments from Lyle Bach- ‘man and two anonymous Language Testing reviewers. ile the FACETS anal of specific person-by-rater and rat considered to be ‘biased’, the effect of these combinations was li ings. V References Australian Department of Immigration, Local Government and Ethnic ‘Affairs rating to Australia: English language assessment. : DILGEA, Bachman, LF. 1990: Fundamental considerations in language Oxford: Oxford University Press. Bachman, L-F., Lynch, B.K. and Mason, M. I in tasks and rater judgments in a performance saking. Language Testing 12, 238-57. il, D. 1989: ‘Naive’ native speakers and judgments of oral pro- sh. Language Testing 6, 152-63. ren surement: the state of the art. University Press of generalizability theory. Iowa City, TA: ‘American College Testing Program. w software incorporating many i, 1996) is now available in both facet analysis (ConQuest: Wu formats. {At a more substantive level, GENOVA is useful in providing gen gral, group-level information, and particularly in making. ‘overll decisions about test design. FACETS provides more specific infor: ‘mation, which can be fed into the test development and improvement178 Assessment of ESL immigrants’ speaking skills Brennan, RL, and Kane, M.T. t of renced test consis, IENOVA: a general purpose analysis a City, IA: American Coliegs I Wright, BD. 1993: A user's guide to FACETS: Rasch it computer program. Versi Chicago, IL: MESA, Brian K. Lynch and T.F. McNamara 179 Council for Educational Research, ‘Appendix A Stzucture and content of ora interaction subtest Tie Contenv materials Pertormance aspects Section 1 Warmup General personal question and _Unassessed 28 Description GAA Wo okt description of Fioney famitar seting (e., school or Grammar work environment) Vocabulary 28 Narraton Petre sequonce stinuks for Fluency reteing ot Grammar ‘soy, =, sd canto Grane eprcracy 4 Exposition Daserpfon, ans and Iorretaton of itormaton——Grarear roconted i he fom o ables, ‘apne or agrame 5 Discussion GSA‘on subees of general or Fueney vecatona etree, Grammar Voeabulry ‘Appendix B Examples of rating scales used J ee 6 ‘Speech is as fuent as, and of a speed simi to, an educated native ‘speaker 5 ‘Speaks fuenty with only occasional hostation. Speach may be sighily slower than that ofa naive speaker. 4 ‘Speaks more slowly than natve speakers, wih some hestations and 3 2 1 Fluency evident only inthe most common formulae phrases. ——_0Sne mest common formula Pres,180 Assessment of ESL immigrants’ speaking skills Appendix B Continued ‘Resources of grammar ange and contol ofa native speaker. SS = mane Poegatimes intrtore with communication, ee aa see cman wenn ae =aoeer ene eee é 5 4 3 2 1 Differences in native-language skills, foreign-language aptitude, and foreign- language grades among high-, average-, and low-proficiency foreign- language learners: two studies Richard L. Sparks' Department of Education, College of ‘Mount St Joseph Marjorie Artzer Northern Kentucky University Leonore Ganschow, David Siebenhar, Mark Plageman, Jon Patton Miami University “Two studies examined the extent to which thee would be differences in native language skils, foreign-language aptitude, and final forignlanguage grades “among high-school students competing a second year ofa foreign-language course i ‘sverage-, and low-proficiency learners. Oral and witlen tered by trained 1 privat, single-sex suburban high school; the second involved a coeducational ‘2 suburban public school. Results showed overall proficiency groups on native-language and foreign ‘were significant in dstingi ‘connections among foreign-language proficiency and naive- foreign-language apttede, and end-of-year grades are presented. Foreign-language educators speculate that one’s level of native- language skill affects one’s ability to leam a foreign language. For research results which support Carroll's for foreign language is, to some arning abi 278). Through factor-analytic studies, Carroll (1962) finds that language skills — "address for coespondence: Richard Sparks, Deparment of Education, College of Mount St Joseph 5101 Delhi Road, Cincional, OH &5253, USA; ema richard _sparks@mail ms dd “Sn the United States poli echool Is 2 nonprvae school i. the equivalent ofa Briiah state schon! Language Testing 1998 15 (2) 181-218 _—_0265.5322(98).TI4BOA © 1998 Arnold

Using G-Theory and Rasch Model To Assess ESL Performance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Using G-Theory and Rasch Model To Assess ESL Performance

Uploaded by

Copyright:

Available Formats

You might also like