2. McNamara (2000) & Hughes (2003)_About Validity

You might also like

Download as pdf
Download as pdf
You are on page 1of 18
Oxford Introductions to Language Study u WS Bpensiey Series Editor H.G.Widdowson * Tate E> Testing Tim McNamara OXFORD OXFORD UNIVERSITY PRESS Great Clarendon Street, Oxford ox2 6D? ‘Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide in Oxford New York ‘Auckland Cape Town Dares Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in ‘Argentina Auswia Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam ‘OXFORD and OXFORD ENGLISH are registered trade marks of Oxford University Press in the UK and in certain other counwies © Oxford University Press 2000 ‘The moral rights of the author have been asserted Database right Oxford University Press (maker) First published 2000 2014 2013 2012 2011 2010 10987 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, ‘without the prior permission in writing of Oxford University Press (with the sole exception of photocopying carried out under the conditions stated in the paragraph headed ‘Photocopying’), or as expressly permitted by law, or under terms agreed with the appropriate reprographics rights organization. Enguiries concerning reproduction outside the scope of the above should be sent to the ELT Rights Department, Oxford University Press, at the address above ‘You must not circulate this book in any other binding or cover and you must impose this same condition on any acquirer Photocopying ‘The Publisher grants permission for the photocopying of those pages marked ‘photocopiable’ according to the following conditions. Individual purchasers may make copies for their own use or for use by classes that they teach. School purchasers may make copies for use by staff and students, but this permission does not extend to additional schools or branches Under no circumstances may any part of this book be photocopied for resale Any websites referred to in this publicawion are in the publicdomain and their addresses are provided by Oxford University Press for information only. (Oxford University Press disclaims any responsibility for the content IsBN-13: 978 019 437222 0 Printed in China 5 Validity: testing the test As we have seen, testing is a matter of using data to establish evi- dence of learning. But evidence does not occur concretely in the natural state, so to speak, but is an abstract inference. It is a mat- ter of judgement. The question arises as to who makes this judge- ment, and how we can decide how valid the evidence is. The very terms judgement and evidence suggest a court of law, and one way of making the issues clear is to draw parallels between testing and legal procedures. In the famous American murder trial of the athlete O.J. Simpson, the jury was asked to determine, on the basis of the evi- dence presented, whether the police and prosecutor’s claim that he had been involved in the murder of his wife and her friend was. likely to be true (‘beyond reasonable doubt’). The death of his wife and her friend had been witnessed by no-one apart from the victims themselves and the killer, so that reconstruction of what actually happened had to be done by inference. This was initially done by the police investigating the case, who came to the conclu- sion, on the evidence available to them, that the likely killer was O,J. Simpson. He was thus charged with murder. In the trial, the police procedures and the conclusions they had reached on the basis of the evidence were themselves put to the test. In the event, the jury decided there was enough doubt to acquit Simpson. In criminal procedures such as this, there are thus two stages, each involving the consideration of evidence. First, the police make an investigation, and on the evidence available to them reach the conclusion that a crime has been committed by someone, who is then charged. This conclusion is itself then examined, using an independent procedure (often a trial with a jury). VALIDITY: TESTING THE TEST 47 These two stages are mirrored in language test development and validation. The initial stage is represented by the test itself, in which the evidence of test performance is used to reach a conclu- sion about the candidate’s ability to handle the demands of the criterion situation. (Remember, as we saw in Chapter 1, we are never in a position to observe those subsequent performances directly.) In some tests, matters are sometimes left there; the test procedures are not themselves subject to scrutiny, that is, they are not validated. Where a lot hinges on the determinations made ina language test, for example, where it is used to screen for admis- sion to academic or work settings (such tests are sometimes called high stakes tests), measures may similarly be taken to investigate the procedures by which test judgements were reached. This process is known as test validation. The purpose of validation in language testing is to ensure the defensibility and fairness of interpretations based on test perfor- mance. It asks, ‘On what basis is it proposed that individuals be admitted or denied access to the criterion setting being sought? Is this a sufficient or fair basis?’ In the case of both legal and assess- ment settings, the focus of investigation is on the procedures used. If the procedures are faulty, then conclusions about particular individuals are likely to be unsound. The scrutiny of such proce- dures will involve both reasoning and examination of the facts. In the legal case, the reasoning may involve legal argumentation, and appeals to the common sense, insight, and human under- standing of the jury members, as well as careful examination of the evidence. Test validation similarly involves thinking about the logic of the test, particularly its design and its intentions, and also involves looking at empirical evidence—the hard facts—emerg- ing from data from test trials or operational administrations. If no validation procedures are available there is potential for unfair- ness and injustice. This potential is significant in proportion to what is at stake. There are certain differences between the two contexts. First, legal cases usually involve an individual accused; test validation looks at the procedures as a whole, for all the candidates affected by them. Secondly, in the case of a crime, the picture being formed in the minds of the police concerns something that has already happened, that is, it is retrospective. This is replicated only in SURVEY certain kinds of tests, but not in others. We saw in Chapter 1 that we can distinguish tests according to their purpose, and defined one such type of test, an achievement test, as retrospective, giving evi- dence on what has been achieved. The inferences from proficiency tests on the other hand are predictive or forward looking, as such tests typically precede entry to the criterion setting, as in selection, screening, and certification tests. As we saw in Chapter 1, infer- ences are made from these tests about how a person is likely to manage the language and communicative demands of the subse- quent non-test or criterion situation, for example, listening to lec- tures (in the role of international student), or communicating with colleagues or clients (in work-related language assessments). There isalsoa contrast in the allocation of roles to individuals in the two settings. In the legal setting, the arguments for and against the charge are presented by different individuals, the pros- ecution and defence lawyers. The persons making the decision (the jury or the judge) are independent of either. The person who has most at stake—the accused—is directly represented. In the test situation, the prosecution, defence, judge, and jury are all the same person—the person responsible for the validation research; moreover, this is often the test developer, who may be seen as having a vested interest in the test surviving the challenge of vali- dation. Of course, validation research may be presented to a wider audience of other researchers in the form of conference papers or publications in academic journals, in which case it may encounter further challenges; this is the function of the discourse community of language testing researchers. As test validation involves close analysis of test data, it is necessarily technical, and its function too easily misunderstood or discounted, particularly by those funding the test, who may wish to do without the com- plication and expense of carrying it out. Many public tests with a significant burden of responsibility in important decision making about individuals have been too little validated asa result. The research carried out to validate test procedures can accom- pany test development, and is often done by the test developers themselves; that is, it can begin before the test becomes opera- tional. Validation ideally continues through the life of the test, as new questions about its validity arise, usually in the context of language testing research. VALIDITY: TESTING THE TEST 49 50 In some public discussions of new test procedures, particularly those fulfilling a role in public policy, the term validation is some- times used rather differently. It refers to the process of negotiating the acceptability of a new assessment procedure to the stake- holders, that is, those most interested in its introduction. For example, if a new testing procedure is being introduced as a mat- ter of government policy, then it may be politically important to ensure its acceptability to educators and administrators. In this case, the scales and frameworks used in such procedures, or even actual sample test materials, may be distributed in draft and become the subject of intense discussion of their content and wording. This process may result in valuable revisions to the materials, but its deeper function is to ensure that nobody is too unhappy with the change; the ‘validation’ is designed to defuse opposition. This procedure guarantees the face validity of the test (its surface acceptability to those involved in its development or use) but no more. Threats to test validity Why are test validation procedures necessary? Why is face valid- ity not enough? What can threaten the validity—the meaningful- ness, interpretability, and fairness of assessments (scores, ratings)? Let us look at a number of possible problem areas, to do with test content (what the test contains; see Chapter 3), test method (the way in which the candidate is asked to engage with the materials and tasks in the test, and how these responses will be scored; see also Chapter 3), and test construct (the underlying ability being captured by the test; see Chapter 2). Test content The issue here is the extent to which the test content forms a satis- factory basis for the inferences to made from test performance. We saw in Chapter 3 how content relevance can be established in well designed tests. These procedures are used to establish the relevance of what candidates are asked to do. Imagine that you are working as a flight attendant for an international airline. On certain routes passengers may need assistance in their own lan- guage in the course of the flight. The airline has thus decided to give bonuses to flight attendants who can demonstrate a given SURVEY level of proficiency in the languages most frequently spoken by passengers on that airline. As such assistance rarely involves read- ing and writing, and is on the whole restricted to a range of predictable topics, it would be unreasonable to test potential employees on communication tasks not found in that setting, or on tasks presented through an inappropriate mode of language use (reading, writing). On the one hand, even if a potential employee could manage such test tasks, it may not be safe to infer that the person concerned can communicate adequately on non- tested oral tasks more relevant to the occupational role. And vice versa: if the person fails the test tasks, he/she may still be fluent orally—this would be so in the case of languages with dif- ferent alphabets or writing systems, particularly where the per- son’s acquisition of the language has been through informal means. The issues arising in such contexts are issues of what is known as content-related validity or, more traditionally, content validity. The argument for the relevance of test content to the decisions to be made about functioning in the criterion situation has led to the growth of specific purpose language tests, such as the Occupational English Test for health professionals wishing to work in Australia. Judgements as to the relevance of content are often quite com- plex, and the validation effort is accordingly elaborate. For exam- ple, ina test of ability to read academic texts, does it matter from which academic domain the texts are drawn? Should someone studying law be asked to read texts drawn from fields such as edu- cation or medicine? In other contexts, we may want to know whether performance on a general proficiency test can be used to predict performance in particular occupational roles, and vice versa. Sometimes, there is pressure from bureaucracies to use tests designed for one purpose to make decisions in a very different context that had not been envisioned by the original test design- ers. The problem is that the inferences we draw about candidates based on a test designed for one purpose are not necessarily valid for another unrelated purpose, particularly where test content reflects the original test purpose. VALIDITY: TESTING THE TEST 5u 52 Test method and test construct How are the test-takers required to engage with the test materi- als? To what extent are arbitrary features of the test method influ- encing the inferences we are reaching about candidates? We saw in Chapter 2 the kinds of choices about test method open to test designers. We also saw that the most commonly used methods involve considerable compromise on the authenticity of the test, so that the gap between test performance and performance in the criterion may, on the face of it, appear quite wide. What implica- tions does our choice of test method have on the inferences we make about candidates? One way of approaching this issue is to ask to what extent the method is properly part of the test construct (the underlying abil- ity or trait being measured by the test), or is irrelevant to it. If the latter is the case (and it often necessarily is), then we need to inves- tigate the impact of test method on scores, because if the impact is large, then it has the potential to obscure our picture of the rele- vant aspects of candidate abilities. This will involve a programme of research, for example, by varying the conditions of perfor- mance. Thus, in the case of the note-taking task, we can compare scores obtained from comparable groups of subjects under vari- ous conditions of interest and study the resulting impact on scores. We can see whether scores are affected when candidates are allowed unconstrained vs. constrained note-taking, are exposed to shorter versus longer chunks of text at any one time, are required to pre-read the questions or not, listen once or more than once to the test materials, and so on. In the case of speaking and writing, even when test content and methods used to elicit a performance seem reasonable, other aspects of the testing procedure can jeopardize the meaningful- ness of test inferences. We saw in Chapter 4, for example, that rating procedures introduce a host of variables into the assess- ment. Research on ratings part of the validation required for per- formance tests of this type. In general, the more complex the context of performance, the more there is to jeopardize the valid- ity of the ratings. This point was well recognized by Lado in the 19508 and 1960s (see Chapter 2), and is what made him so wary of performance assessment. SURVEY In general, tests may introduce factors that are irrelevant to the aspect of ability being measured (construct irrelevant variance); or they may require too little of the candidate (construct under-repre- sentation). There may be factors in the test which will cause per- formances to be affected, or to vary in a way whichis not relevant to the information being sought about candidates’ abilities. Thus, as we have seen, the knowledge or skill being tested may be embedded in a context which is neither within the candidate’s experience nor relevant to the thing being assessed. In an advanced level oral test, candidates may be asked to speak on an abstract topic; however, if the topic does not match their interests or is one about which they may have little knowledge, the perfor- mance is likely to appear less impressive than when candidates are speaking about a more familiar topic at an equivalent level of abstraction. In this case, then, a potential problem is that the trait being assessed (ability to discuss an abstract topic in the foreign language) is confounded with the irrelevant requirement of hav- ing knowledge of a particular topic. By contrast, in other cases, the real requirements of the crite- rion may not be fully represented in the test. Take the case of foreign medical graduates in the UK or Australia, who face prac- tical clinical examinations where they must take case histories from real patients. Examiners frequently complain that the candi- dates’ communicative skills are not up to the task, even though they will have passed a prior test, and, on this measure, seem to have a high degree of language ability. Clearly, something which the clinicians feel is important in communication with patients is missing from the language test. The impact of tests In the last decade, a renewed theory of test validation has expanded the scope of validation research to include the changes that may occur as a consequence of their introduction. Such changes (for example in preparation of test candidates) may in turn have an impact on what is being measured by the test, in such a way that the fairness of inferences about candidates is called into question. This area is known as the consequential validity of tests. For example, in a school context, an assessment reform which changes the emphasis from formal tests to ongoing assess- VALIDITY: TESTING THE TEST 53 54 ment of complex projects and assignments may raise issues of consequential validity if it turns out that students can be coached into performance on the projects, and the opportunities for coaching are differentially available to the students being assessed (for example, because only some families can afford coaching, or because children with more highly educated parents get help from their parents). What appears initially to be a test reform may thus in the end have the unfortunate and obviously unintended effect of reducing our ability to make meaningful distinctions between students in terms of the abilities being measured. To the extent that such consequences can be foreseen, the test developer is bound to anticipate them and investigate their likely effect on the validity of test scores. Concerns about consequential validity are part of a larger area of research on the impact of assessment pro- cedures on teaching and learning, and more broadly on society as a whole. The social context of assessment will be considered in detail in Chapter 7. Conclusion In this chapter we have examined the need for questioning the bases for inferences about candidate abilities residing in test pro- cedures, and the way in which these inferences may be at risk from aspects of test design and test method, or lack of clarity in our thinking about what we are measuring. Efforts to establish the validity of tests have generated much of what constitutes the field of language testing research. Such research involves two pri- mary techniques: speculation and empiricism. Speculation here refers to reasoning and logical analysis about the nature of lan- guage and language use, and of the nature of performance, of the type that we outlined in Chapter 2. Empiricism means subjecting such theorizing and specific implications of particular testing practices to examination in the light of data from test trials and operational test administrations. Thus, as an outcome of the test development cycle, language testing research involves the forma- tion of hypotheses about the nature of language ability, and putting such hypotheses to the test. In this way, language testing is rescued from being a merely technical activity and constitutes a site for research activity of a fundamental nature in applied linguistics. SURVEY ‘CAMBRIDGE LANGUAGE TEACHING LIBRARY ‘Aseries covering central issues in language teaching and learning, by authors who have expert knowledge in thee fel Ties svies Alles in Language Learning edited by Jane Arnold [Approaches and Methods in Language Teaching Second Editon My Jack C. Richards and Theodora S. Rodgers Beyond Training by Jack C. Richards {Classroom Denon: Making eed by Michel Breen and Andrew Litejobt Collaborative Action Research for English Language Teachers by Anne Burns Collaborative Language Learaing and Teaching edited by David Numan Communicative Language Teaching by Wiliam Littlaocod Designing Tatks forthe Communicative Classroom by David Nunn Developing Reading kills by Frongoise Grellet Developments in English for Specie Purposes by Tony Dudley-Fvans and ‘Mage Jo St Jobn Discourse Analysis for Language Teachers by Michael McCarthy Discourse and Language Education by Evel Hatch ‘The Dynamics ofthe Language Classroom by lan Tudor English for Academic Purposes by RR. Jordan English for Specific Purposes by Toot Huchinson and Alan Waters Establishing Sell Access by David Gardwer and Lindsay Miler Foreign and Second Language Learning by William Littlewood Language Learing in Intercalural Perspecie edited by Michacl Byram and “Michal Fleming “The Language Teaching Matrix by Jack C. Richards Language Test Construction and Evaluation by J- Charles Alderson, (Caroline Clapham, and Dianne Wall Leamnercentredness as Language Education by Ia Tudor ‘Managing Curricular Innovation by Numa Markee “Materials Development in Language Teaching edited by Brian Tomlinson ‘Motivatioaal Seraepes in the Language Classroom by Zoltan Dorje Psychology for Language Teachers by Marion Willams and Robert L. Burden Research Methods in Language Learning by David Nien Second Language Teacher Education edited by Jack C. Richards and David Numa Society and the Language Classroom edited by Hywel Coleman ‘Teaching Languages Young Learners by Lynne Cameron ‘Teacher Learning in Language Teaching edited by Donald reeman avd ‘Jack C Richards Understanding Research in Second Language Learning by James Deon Brown ‘Vocabulary: Description, Acquistion and Pedagogy elited by Norbert Schmit amd “Michael McCarthy Vocablay, Semantics, and Language Educa her Brown Voices From the Language Classroom edited by Kathleen M. Bailey and David Nea a by Evelyn Hatch and Testing for Language Teachers Second Edition Arthur Hughes ‘The Pie Building, Trumpington Street, Cambridge, United Kingdom ‘The Edinburgh Building, Cambridge cna anu, UK 4o West 2th Stret, New York, N¥roor1-4211, USA ‘Jt57 Willnmerown Road, Port Melhourne, ere 3207, Australia Ruiz de Alarc6n 15, 28014 Madrid, Spain Dock House, The Waterfront, Cape Town Sor, South Africa bapultwwwccambridgeorg (© Cambridge University Press x98, 2003 “This book is in copyright. Subject to statutory exception and tothe provisions of relevant collective licensing agreements, zo reprodvetion of any part may take place without the written permission of Cambridge Univesity Pres. Fist published 1989 Sscond edn 3003 Reprinted 2005 Printed inthe Unied Kingdom atthe University Press, Cambridge ‘Typeface Sabon xo.shx2. System QuarkXPress® font] [A catalogue record for this bookie auable from the British Library Isax 0 521 823250 hardback ISBN 0 521 484952 paperback For Vicky, Meg and Jake We already know from Chapter 2 that a testis said to be valid if it measures accurately what iti intended to measure, We create language fests inorder to measure such essentially theoretical constructs as ‘reading ability, ‘uency in speaking’, ‘control of grammar’, and so on. For this reason, in recent years the term construct validity’ as been increasingly tsed to refer tothe general, overarching notion of validity. Te is nor enough to asere that atest has construct validity; empirical evidence is needed, Such evidence may take several forms, including the suboedinate forms of validity, content validity and criteion-related validity. We shall begin by looking at these two forms of evidence in tum, and attempt to show their relevance for the solution of language testing problems. We sball then tara to other forms of evidence. Content validity “The first form of evidence relates to the content ofthe test. A test is said +o have content validity if its content constitutes a representative sample fof the language skills, structures, etc. with which it is meant to be ‘concerned. Iris obvious that a grammar tes, for instance, must be made up of items relating co the knowledge or control of grammar. But this in itself does not ensure content validity. The test would have content validity only if it included a proper sample of the relevant structures. Just what are the relevant steuctuces will depend, of course, upon the ‘purpose of the test. We would not expect an achievement test for inter- rediate learners to contain just the same set of structures as one for advanced learners. In order to judge whether or not a test has content validity, we need a specification of the skills or structures, etc. that itis ‘meant to cover. uch a specification should be made at a very early stage in test construction. It isn’t to be expected that everything in the spe: fication will always appear in the test; there may simply be too many things for all of them to appear in a single test. But it will provide the 26 Validity rest constructor with the basis for making a principled selection of tlements fr inclusion inthe test. A comparison of test pecicaton and ‘test content is the basis for judgements as to content validity. Ideally these judgements should be’ made by people who are familiar with langoage teaching and testing but who are not Greeti concerned with the production ofthe test in question. ‘What isthe importance of content valid? First, the greater ates conten validity, the more likey iris to Be an accurate measure of what t {S suppose wo measure, Le. co have construct validity. A test in Which imajor areas identified in the specication are underrepresented ~or not represented atall~ is unlikely to be accurate, Secondly, such a ets ikely torhave harmful backwash effect. Areas that are not tested are likely 10 become areas ignored i teaching and leacnng. Too often the content of tests determined by what is easy to test rather than what is important to test. The best safeguard against this isto write full est specfations and to ensure thatthe text conten is fair reflection ofthese, For this reason, content validation should be caried out while a testis being developed; i should noe wait until the testi already being used. Advice on the wrting of specifications iso be found in Chapter 7. Criterion-related validity “The second form of evidence of a test's construct validity relates to the degree to which results on the test agree with those provided by some independent and highly dependable assessment ofthe candidate’ ability. ‘This independent assessment is thus the criterion measure against which the testis validated, ‘There are essentially two kinds of criterion-related validity: comcur- rent validity and predictive validity. Concureent validity is established when the test and the criterion are administered at about the same time. To exemplify this kind of validation in achievement testing, let us ‘consider a situation where course objectives cal for an oral component 18s part of the final achievement test. The objectives may lise a large ‘number of functions’ which students are expected to perform orally, to testall of which might take 45 minutes for each student, This could well be impractical. Pehaps it is felt that only ten minutes can be devoted to each student for the oral component, The question then arises: can such a ten-minute session give a sufficiently accurate estimate of the student's ability with respect to the functions specified in the course objectives? Is it in other words, a valid measure? From the point of view of content validity, this will depend on how many of the functions are tested in the component, and how 7 ‘Testing for language teachers representative they are of the complete set of functions included in the objectives. Every effort should be made when designing the oral component to give it content validity. Once this has been done, however, ‘we ean go further. We ean attempt to establish the concurrent validity ofthe component. TT do this, we should choose at random a sample ofall the students taking the test. These students would then be subjected to the full 45 minute oral component necessary for coverage ofall the functions, cing pela four scovens Co ensuse reliable seoring (see next chapter). ‘This would be the criterion test against which the shorter test would be judged, The students’ scores on the full test would be compared with the ones they obtained on the ten-minute session, which would have been conducted and scored in the usual way, without knowledge of thei performance on the longer version. If che comparison between the two sets of scores reveals a high level of agreement, then the shorter ‘ersion ofthe oral component may be considered valid inasmuch as it gives results similar to those obtained with the longer version. I, on the ther hand, the two sets of scores show litle agreement, the shorter version cannot be considered valid; it cannot be used as 2 dependable measure of achievement with respect to the functions specified in the abjectives. OF course, if ten minutes rally is all that can be spared for tach student, then the oral component may be included forthe contri- bution that it makes tothe assessment of students’ overall achievement and for its backwash effect. But it cannot be regarded as an accurate measure in tell. ‘References to ‘a igh level of agreement’ and ‘little agreement’ raise the question of how the level of agreement is measured, There are, in fact, standard procedures for comparing sets of scores in this way, which generate what is called a ‘correlation coelicient (oz, when we are considering validity, a "validity coefficient) ~ a mathematical measure of similarity, Perfect agreement betvicen two sets of scores will resalt in a coefficient of 1. Total lack of agreement will give a coefficient of zero. “To get a feel for the meaning of a coefficient between these wo coremes, read the contents of the box on page 29. ‘Whether oF not a particular level of agreement is regarded as satis factory will depend upon the purpose of the test and the importance of the decisions tha are made on the bass oft. I, for example, atest coral ability was to be used as part ofthe selection procedure for a high level diplomatic post, then a coefficient of 0.7 might well be regarded as 100 low for a shorter test to be substituted for a full and thorough test ‘of oral abilty. The saving in time would not be worth the risk of appointing, someone with insufficient ability in the relevant foreign Tanguage. On the other hand, a coefficient of the same size might be 28 Validity “To get a feel for what a coefficient means in terms of the level of. agreement between two sets of scores, its best to square that coel- ficient. Letus imagine that a coefficient of 0.7 is ealculated between the two oral tess referred to in the main text. Squared, this becomes 0.49. If this is regarded as a proportion of one, and converted toa percentage, we get 49 per cent. On the bass ofthis, wwe can say thatthe scores on the shore vest predict 49 per cent of the vasition in scores on the longer test. In broad teem, there fs almost 50 per cent agreement between one set of scores and the ‘other. A coefficient of 0.5 would signify 25 per cent agreement; a coefficient of 0.8 would indicate 64 per cent agreement. Iti impor- tant co note that a “level of agreement” of, say, 50 percent does not ‘mean that $0 per cent ofthe students would each have equivalent scores on the two versions. We are dealing with an overall measure fof agreement that does not refer to the individual scores of students. This explanation of how to interpret validity coefficients is very bref and necessarily rather crude. For a beter understand ing, the reader is referred to the Further reading section atthe end of the chapter. perfectly acceptable for a brief interview forming part of a placement test? Ir should be said that the criterion for concurrent validation is not necessarily a proven, longer test. A test may be validated against, for example, teachers? assessments of their students, provided that the assessments themselves can be relied on. This would be appropriate where a test was developed that claimed to be measuring something different from all existing test. The second kind of criterion-related validity is predictive validity ‘This concerns the degree to which a test can predict candidates" future performance. An example would be how well a proficiency test could predict a student’ ability to cope with a graduate course at a British ‘university. The criterion measure here might be an assessment of the student’s English as perceived by his or her supervisor at the university, or it could be the outcome of the course (pass/fail etc). The choice of criterion measure raises interesting issues. Should we rely on the subjec- tive and untrained judgements of supervisors? How helpful is it eo use final outcome as the criterion measute when so many factors other than ability in English (such as subject knowledge, intelligence, motivation, 29 ‘Testing for language teachers health and happiness) will have contributed to every outcome? Where ‘outcome is used as the criterion measure, a validity coefficient of around 0.4 (only 20 per cent agreement) is about as high as one can expect. This is partly because of the other factors, and partly because those students ‘whose English the test predicted would be inadequate are not normally permit predicting problems for those students goes unrecognised”. ‘Asa result, a validity coefficient of this order is generally regarded as satisfactory. The Further reading section at the end of the chapter gives ‘references tothe reports on the validation ofthe British Council's LTS tes (the predecessor of IELTS), in which these issues are discussed at length, ‘Another example of predictive validity would be where an attempt was made to validate @ placement test. Placement tests attempt to predict the most appropriate class for any particular student. Validation would involve an enquiry, once courses were under way, into the proportion of students who were thought to be misplaced. It would then bea matter of comparing the number of misplacements (and their effect (on teaching and learning) with the cost of developing and administering atest that would place students more accurately Content validity, concurrent validity and predictive validity all have a part to play in the development of a test. For instance, in developing an English placement test for language schools, Hughes et al (1996) vali- dated test content against the content of three popular course books used by language schools in Britain, compared students’ performance on the test with their performance on the existing placement tests of a number of language schools, and then examined the success of the test in placing students in clases. Only when this pracess was complete (and ‘minor changes made on the basis of the results obtained) was the test published. Other forms of evidence for construct validity Investigations of a test's content validity and criterion-related validity provide evidence for its overall, or construct validity. However, they are not the only source of evidence. One could imagine a test that was ‘meant to measure reading ability, the specifications for which included reference to a variety of reading sub-skills, including, for example, the ability to guess the meaning of unknown words from the context in which they are met. Content validation of the test might confirm that these sub-skills were well represented in the test. Concurrent validation od to take the course, and so the test's (possible) accuracy in ‘might reveal a strong relationship between students’ performance on the test and their supervisors" assessment of their reading ability. But one 30 Wis Validity would still not be sure that the items in the test were ‘really’ measuring the sub-skills listed in che specifications. “The word ‘construct’ refers to any underlying ability (or trait) that is hypothesised in a theory of language ability. The ability to guess the meaning of unknown words from context, referred to above, would be an example. It is a matter of empirical esearch to establish whether for not such a distinct ability exists, can be measured, and is indeed measured in that est. Without confirming evidence from such research, Fe would not be possible to say that the part of a test that attempted t0 ‘measure that ability has construct validity. If all of the items in a test ‘were meant to measure specified abilities, then, without evidence that they were actually measuring those abilities the consteuct validity ofthe ‘whole test would be in question “The reader may ask at cis point whether such a demanding require- iment for validity is appropriate for practical testing situations. It is easy to see the relevance of content validity in developing atest. And if atest has criterion related validity, whether concurrent or predictive, surely it is doing its job well. But does it matter if we can’t demonstrate that parts of the test are measuring exactly what we say they are measuring? have some sympathy for ths view. What is more, I believe that gross, commonsense constructs like ‘reading ability’ and ‘writing ability’ are unproblematic. Similarly, the direct measurement of writing ability, for instance, should not cause us too much concern: even without research ‘we ean be fairly confident that we are measuring a distinct and mean- ingfal ability (albeit a quite general and not closely defined ability). Once we try to measure such an ability indicecly, however, we ean no longer take for granted what we are doing. We need to look to a theory of weiting ability for guidance as to the form an indirect test should take, its content and techniques. Let us imagine that we are indeed planning to construct an indirect test of writing ability that must for reasons of practicality be multiple choice. Our theory of writing tells us that underlying writing ability are a number of sub-abilties, such as control of punctuation, sensitivity to demands on style, and 50 on. We construct items that are meant to measure these sub-abilities and administer them as a pilot test. How do wwe know that this test really is measuring writing ability? One step we ‘would almost certainly take isto obtain extensive samples ofthe weiting ability of the group to whom the test is frst administered, and have these reliably scored. We would then compare scores on the pilot test with the scores given for the samples of writing. If there is a high level of agreement (and a coefficient of the kind described in the previous section can be calculated), then we have evidence that we are measuring writing ability with the test. 3 ‘Testing for language teachers So far, however, although we may have developed a satisfactory indirect test of writing, we have not demonstrated the reality of the underlying constructs (control of punctuation, et.). To do this we might administer a series of specially constructed tests, measuring each of the ‘constructs by a number of different methods. In addition, compositions ‘written by the people who took the tests could be scored separately for performance in relation to the hypothesised constructs (control of punctuation, for example). In this way, for each person, we would obtain a sot of scores for cach of the conctructs. Coefficients could then be calculated between the various measures. If the coefficients between scores on the same construct are consistently higher than those between scores on different constructs, then we have evidence that Wwe are indeed measuring separate and identifiable constructs. This knowledge would be particularly valuable if we wanted to use the test for diagnostic purposes. Another way of obtaining evidence about the construct validity of a testis to investigate what test takers actually do when they respond to an item, Two principal methods are used 10 gather such information: think aloud and retrospection. In the think aloud method, test takers voice their thoughts as they respond to the item. In retrospection, they try to recollect what their thinking was as they responded. In both cases their thoughts are usually tape-recorded, although a questionnaire may be used for the latter. The problem with the think aloud method is that the very voicing of thoughts may interfere with what would be the natural response to the item, The drawback to retrospection is that thoughts may be misremembered or forgotten. Despite these weak- nesses, such research can give valuable insights into how items work (ovhich may be quite different from what the test developer intended) All test validation isto some degree a research activity. When it goes beyond content and criterion related validation, theories are put to the test and are confirmed, modified, or abandoned. It isin this way that language testing can be put on a sounder, more scientific footing. But it will not all happen overnight; there is a long way to go. In the mean- time, the practical language tester should try to keep abreast of what is known. When in doubt, where itis possible, direct testing of abilities is recommended. Validity In scoring It is worth pointing out that if a test is to have validity, not only the items but also the way in which the responses are scored must be valid. Ie is no use having excellent items if they are scored invalidly. A reading 3 Validity test may call for short writen responses. If the scoring of these responses takes into account spelling and grammar, then it is not valid {assuming the reading test is meant to measure reading ability). By ‘measuring more than one ability, it makes the measurement of the one Zhilty in question less accurate. There may be occasions when, because fof misspelling or faulty grammar, itis not clear what the test taker fntended. In this case, the problem is with the item, not with the scoring, Similarly, if we are interested in measuring speaking or writing ability, eis not enough to elicie speech or writing in 2 valid fashion. The raring ff that speech or writing has to be valid too. For instance, overem= phasis on such mechanical featuces as spelling and punctuation can te the scoring of written work (and so the test of writing) Face validity [Atestis said to have face validity if looks as if t measures what it is supposed. to.measure. For example, a test that pretended to measure pronunciation abilty but which did not requize the test taker to speak {and there have been some) might be thought to lack face validity. This ‘would be true even if the tests construct and crterion-elated validity Could be demonstrated. Face validity is not a scientific notion and is not seen a providing evidence for construct validity, yet it can be very impor- tant. A test which does not have face validity may not be accepted by candidates, teachers, education authorities or employers. It may simply not be useds and ifs used, the candidate’ reaction to it may mean that they do not perform on it i a way tha euly reflects their ability. Novel techniques, particularly those which provide indirect measure, have t0 bre introduced slows, with eae, and with convincing explanations. How to make tests more valid In the development of a high stakes test, which may significantly affect the lives of those who take it, there isan obligation to carry out 2 full validation exercise before the test becomes operational ‘in the case of teacher-made tests, full validation is unlikely to be possible, In these circumstances, I would recommend the following: Fisst, write explicit specifications for the test (see Chapter 7) which take account ofall that is known about the constructs that are to be measured. Make sure that you include a representative sample of the content of these in the test. 33 ‘Testing for language teachers Second, whenever feasible, use direct testing, If for some reason it is decided that indirect testing is necessatY, reference should be made to the research literature to confirm that measurement of the relevant underlying constructs has been demonstrated using the testing tech- niques that are to be employed (this may often result in disappoint- ‘ment, another reason for favouring direct testing!) ‘Third, make sure thatthe scoring of responses relates directly to what is being tested. = Finally, do everything possible to make the test reliable Ifa testis not reliable, it cannot be valid. Reliability is dealt with in the next chapter. Last word ‘Test developers must make every effort to make their tests as valid as possible, ‘Any published test should supply details of its validation, without which its validity (and suitability) can hardly be judged by a potential purchaser, Tests for which validity information is not available should be treated with caution. Reader activities Consider any tests with which you are familiar. Assess each of them in terms of the various kinds of validity that have been presented in this chapter. What empirical evidence is there that the est is valid? If lence is lacking, how would you set about gathering it? Further reading At fist sight, validity seems a quite straightforward concept. On closer examination, however, it can seem impossibly complex, with some writers even finding i dificult to separate it from the notion of reliabil- ity in some circumstances. In the present chapter, | have tried to present validity in a form which can be grasped by newcomers to the feld and which will prove useful in thinking about and developing tests. For those who would like to explore the concept in greater depth, I would recommend: Anastasi and Urbina (1997) for a general discussion of est validity and ways of measuring it; Nitko (2001) for validity in the context of educational measurement; and Messick (1989) for long, 34 Validity wide ranging, and detailed chapter on validity which is much cited in janguage testing literature. His 1996 paper discusses the relationship between validity and backwash Bachman and Palmer (1981) was a notable erly attempr co introduce construct validation to language testing. A stil interesting example of test validation (ofthe British Council ELTS test) in which a number of important issues are raised, is described and evaluated in Criper and Davies (1988) and Hughes, Porter and Weir (1988). More recent crounts of validation can be found in Wall er al (1994) and Fulcher (1997). Cohen (1984) describes early use of ‘hink-aloue! and retrospec tion, Buck (1991) and Wa (1998) provide more recent example of the tse of introspection. Storey (1997) uses “think-aloud’. Bradshaw (1990) investigates the fac validity ofa placement test. Weir et al: (1993) and ‘Wei and Porter (1995) disagree with Alderson (1990a, 1990b) about the evidence for certain reading comprehension skis. Cumming, and Berwick (1996) i a collection of papers on validation in language testing. Bachman and Cohen (1998) i a collection of papers concerned with the relationship berween second language acquisition and language testing research, For the argument (with which Ido not agree) that there is no Criterion against which ‘communicative’ language tests can be validated {inthe sense of criterion related validity), see Morrow (1986). Bachman's {1990) book ~ much refered ro and inlluenial in the field of language testing ~ discusses validity and other theoretical issues in depth. 1, When the term ‘construct validity’ was fist used, it was in the context of psychological ests, particulary of personality est. There was eal conceca ft that time at the number of such tests which purported to measure psychological constructs, without offering evidence thae these constructs fxisted in a measurable form. The demand was therefore that such evidence ofthese constructs be provided as part of demonstrating a tests validity. 2, Sometimes the size ofa correlation coefficient can be misleading, an accident ‘ofthe particular eample of people taking the tests). If for example, chere are “extreme scores from outstandingly good or outstandingly poor takers ofthe tests) the coefficient may be higher than the performance ofthe group as 2 ‘whole warrants, See Nitko (2001) for detail 2, Recante the fall range of ability ie nar inlide, the validity coefficient is an underestimate (see previous footnote) 4, However, one may question the validity of the sales used to assess perfor- ‘mance in, say, writing. How far do they reflect the development or acqu- ficon ofthe skis they refer 19? This may not be important in proficiency testing, where the scales may be based on levels of skill needed for a parti~ tlar purpose (a job, for example). In achievement testing, scales that are ‘ot consistent with paterns of development may lak validity 35

You might also like