Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

A rating scale to assess English speaking proficiency of university students in Pakistan

WAJDAN RAZA Department of English College of Humanities & Sciences, PAF-Karachi Institute of Economics & Technology, Karachi, Pakistan.

Abstract
This paper is aimed to propose the much needed rating scale to assess English speaking proficiency of students graduating in 124 universities in Pakistan, keeping in mind its increasing utility in the domains of education and profession in the Outer Circle and the singular emphasis laid down recently by the Higher Education Commission of Pakistan to improve learners communicative skills, which in turn has encouraged the universities to introduce various compulsory English language proficiency courses. In fact, this study is motivated by the fact that the absence of any standard scale to rate the extended response speaking skill in Pakistan seems to have created both difficulties and discrepancies in assessment. This preliminary work is divided into four phases: studying available rating scales and their practices in and outside Pakistan, examining relevant levels of analytic marking categories, drafting a rating scale with expert advice of eight qualified ESL teachers, and running the scale on forty-four students with eight raters. To achieve fairness, Multifaceted Rasch measurement, using FACETS (Linacre, 2006), was employed to analyze three facets: examinees, raters, and rating categories. The results suggest that the rating scale can be a proper indicator of students English speaking abilities with high reliability, however further investigations are needed to validate and improve the scale.

Key words: language proficiency, testing, rating scales, FACETS, concentric circles.

Introduction Pakistan is the third largest south Asian country in Kachrus Outer Circle1, wherein 18 million Pakistanis use English as a second(ary) language, particularly in the domains of education and profession (Kachru,1989: 85-89; Botlan, 2008: 5-8). The cause of this growing utility of the language is twofold: the use of English as an International Language in and outside the country, as the most happening phenomenon across the world, and the resentment of regional-speaking nationalists of all four provinces against the national language Urdu2 (Rahman, 1999: 98). As a result, English has permeated this multilingual society and its importance has been acknowledged by teaching it as a compulsory course from class I3. Even then this language is not
1

Kachrus Concentric Circles consist of three circles: Inner Circle comprising English -speaking countries, Outer Circle countries, say, Pakistan and Malaysia, in which English has a long history and serves a vital role in domains of power, and Expanding Circle countries wherein English has no official role, e.g. Germany and France. 2 During Urdu-Hindi controversy in the pre-partition India, Urdu was the mark of Muslim identity. It was officially announced as the national language soon after the independence despite the fact that it was mother tongue of a minority of the newly independent Pakistan who had migrated from India in 1940s and 50s. The role of Urdu, however, cannot be denied as it is unifying force and means of communication for speakers of more than 60 regional languages. 3 English used to be taught from class VI in state-owned schools till 2000.

learned successfully in the12-year schooling, due to unclear government policies spawning incompetent English language teachers and purposeless syllabus design and material development and teaching skills required for effective communication. Only reading and writing skills are taught and tested in this crucial phase of education. (Khan, 2003: 1-7). This is done without any communicative purpose, which affects their higher education (HE). Much to your surprise, English, as a subject opposed to as a language (Raza, 2006), is taught, with Urdu or any regional language, as a means of classroom communication (Khan, 2003: 1-7). Consequently, students in pursuit of doing professional courses in HE and getting good employment in the future face communication problems, as English is the means of communication in HE and good organizations (Mansoor, 2005: 275-277). In university education, they need to speak English and read numerous books and research papers written in English (Khan, 2003: 1-7). Since discussion on all fours skills is beyond the scope of my paper, the present study focuses speaking skills only. Due to lack of proficiency in English language, this becomes a hindrance when they inhibit from expressing their ideas in discussions and opt out, as a face-saving technique. This not only shakes their confidence in front of a handful of proficient speakers of English4 but makes them on the receiving end of the great deal of harmful rote leaning at tertiary level. The immediate need they desire to be catered is speaking proficiency, which has been missing since schooling. The Higher Education Commission of Pakistan (HEC), formerly known as the University Grant Commission (UGC), has not only acknowledged but also emphasized the significance of English language proficiency among university students by setting up the National Committee on English (NCE). The NCE, with its six sub-committees, has initiated a project, English Language Teaching Reform (ELTR), with an aim to improve English language teaching and provide all logistic support needed in present124 institutions of higher education (Higher Education Commission of Pakistan, 2007: 99-120). Dr Rahman, Chairman, HEC, reiterates the importance of speaking skills in order to use analytic approaches for researches (HEC News & Views, 2007: 13). Funds allocated for the accomplishment of the project are on a modest scale though. In pursuance of this, three compulsory courses of English language, each of 3 credit hours, and one optional of the same duration have been introduced in 4-year integrated curricula for undergraduate programmes, in order to develop students communicative skills. The discussion on the issue of effectiveness of the teaching of language proficiency courses as credit courses in contrast to pre-university courses is beyond the scope of this present paper. Testing of Speaking Proficiency Testing can tremendously facilitate both learning and teaching when it assesses required skills successfully (Heaton, 1988: 5-6). With the beneficial backwash, it streamlines classroom activities and is deemed inexpensive as compared to the cost incurred in the changes involving reducing class size (as big classes is a main issue in ELT in most Pakistani universities), increasing instructional time, hiring teachers, and purchasing teaching equipment (Linn & Grondlund, 2005: 20-25). Although literature with scholarly discussions on the influence of testing over teaching and learning in general (Alderson & Wall 1993; Bailey 1996; Cheng 1999;
4

English has been the mark of elitism in Pakistan. The rich send their children to very expensive private schools, where trained teachers teach them in accordance with the Cambridge International Examinations, namely GCE O and A levels examinations, which are carried out by the British Council of Pakistan.

Cheng & Falvey 2000) are copious, directions and foci required in testing of speaking skills are found short of solid grounding on theory and pedagogy (Pennington, 1999; Celce-Murcia, Brinton, & Goodwin, 1996). This is because testing speaking proficiency of an L2 leaner is one of the most difficult skills to test (Heaton, 1988: 88). Knowing what to assess specifically and how to conduct tests with recent theories and valid procedures is of utmost importance, which in turn helps testers measure with the principles of validity and reliability. Thus, it is equally significant to glance up to see the effects of these principles in the developments rating scales. Validity refers to the measurement of a required skill. For example, the testing of speaking is likely to be considered invalid if a student of engineering is asked to orally describe current marketing problems in Pakistan. It is because it demands critical knowledge of marketing rather than linguistic abilities of the learner. There are various overlapping concepts of validity, e.g., face validity, content validity, construct validity, and empirical validity. The test loses face validity if it assesses English pronunciation of Pakistani learners by instructing them to transcribe in IPA or to speak with Received Pronunciation (RP) or General American (GA). In other words, face validity varies country to country also. If greater emphasis is given on the use of Western phatic communion5 in the Outer Circle countries as opposed to the demonstrations of overall communicative skills in speaking tests, the test is considered with a low content validity. Thus, what is relevant to the testees capabilities brings the content validity (Hugues, 1994: 2223), while the construct validity depends on the current theories. If a speaking test is aimed to assess error-free grammar as the most important feature of communication in contrast to the ability required to convey message effectively, the construct validity is low. Empirical validity, also known as criterion-related validity, assess how far results on the test agree with those provided by some independent and highly dependable assessment of the candidates ability(ibid.). This is further divided into predictive validity and concurrent validity. However, Messick (1996) has strongly disagreed upon the distinction made in the description of test validity and considers it as a many-sided concept. In fact, it does not validate a test, as a characteristic of test, by specifying what it measures exactly but it makes inferences, as a feature of inferences (Cronbach & Meehl, 1955: 297). Therefore, language testing must consider language ability as a multi-component and acknowledge the influence of the test method and the testees performance (Bachman, 1991). Chalhoub-Deville (1997) has criticized the Bachman Model on theoretical grounds though. Another pertinent view reconceptualizing proficiency as multidimensional indicates the changing patterns of proficiency domain to domain; therefore, it is imperative to highlight, through testing, what learners are capable of doing in different domains (Perkins and Gass, 1996). McNamara (1995) discusses the social dimension of language proficiency and includes an interlocutor the testee speaks to. By replacing the rater with an interlocutor reduces the pressure on the part of the testee and the performance is likely to increase. Although the history of the testing of speaking is rich (Spolsky, 1990; 2001), its testing in L2 took form in 1980s, with Oral Proficiency Interviews (OPIs) conducted by a US Foreign Service Institute (FSI). The FSI rating scale uses both analytic and holistic6 methods and
5

Phatic communion is the social function of a language, used to show a rapport or establish a pleasant environment between interlocutors. A comment about the weather or someones health is a typical example of the British phatic communion. 6 The process of scoring can either be holistic or analytic. In holistic scoring, also known as impressionistic scoring, the tester assigns a single score to the performance of the testee, on the basis of its overall impression, while scoring methods which require a separate score for each of a number of aspects of the same performance are said to be analytic.

involves two testers, which assign levels to candidates and rate them on a six-point scale for each of the following: accent, grammar, vocabulary, comprehension, and fluency. Later on, OPIs, with raters as interlocutors, were used for large-scale oral testing systems in schools and universities (Robinson, 1992). The same is practiced in Pakistan today, but, unfortunately, it is done without any clear method of scoring. Despite a spate of criticism on its form, OPIs are still considered representative samples of communicative speech events (Moder & Halleck, 1998). In the case of Pakistan, it is particularly relevant to university students, who are interviewed for internship or training. Therefore, it is not only a representative sample for assessment and but also cost-effective. However, the model of McNmara (1995) can also be used to make relationship between the participants relatively symmetrical. Next to validity is reliability which provides consistency in measurement and makes the test valid. For example, on different occasions, if a test is run on the same candidates, who have learned nothing in between, and the results change, the test is unreliable. Heaton (1988, 162-63) lists down factors affecting reliability as follows: 1. Reliability is directly proportional to the sample size of the test. For example, a 30minute speaking task is more reliable than the one of 5 minutes. 2. The administration of the test affects reliability. For example, uncomfortable environment or unclear instructions can reduce the reliability of the test. 3. Personal factors, such as motivation or illness, have the potential to disturb reliability. 4. Varying scores puts reliability in question. Rating scales Since the discussions on the concept of reliability versus validity passionately discussed by various scholars are beyond the scope of the paper, I would like to switch to mull over the last factor highlighted by Heaton (1988), which further divides into two types: intra-rater reliability and inter-rater reliability. As human raters have tremendous potential to affect consistency of the measurement of test in most cases (Bachman, 2004: 169-70), intra-rater reliability becomes low if the rating criteria of a rater vary for the same test run on different candidates, while inter-rater reliability is deemed low if two raters score the same test differently. Therefore, the development of rating scales and their use are instrumental in bringing about a solution to the problematic reliability, particularly in L2 speaking proficiency tests in the Outer Circle countries, wherein the definition of speaking abilities varies with respect to the use of the target language. McNamara (1996) and Shohamy (2001) have integrated the component of performance in latest tests and proposed subjective judgment. By making tests flexible, accommodation theory of adult SLA is incorporated. At the same time, there is an urgent need to understand that rating scales can never be error-free tools for the assessments of L2 learners. They must be tailored to elicit specified functions of language in different domains (Weir, 1993). For example, the rating scale for candidates of intermediate level should be different from the one for the advanced level. Therefore, the purpose of the rating scale must be made clear and suitable for the language samples elicited in different testing situations. The objectives of rating scale scales could be classified into four categories: user-oriented, assessor-oriented, constructororiented and diagnosis-oriented (Alderson, 1991; Pollitt & Murray, 1996). It is clear that different purposes require different rating scales; therefore, the purpose compatible with the scale

must be prioritized to gauge the sample language elicited from learners performance in particular testing situations. This paper aims to develop an assessor-oriented speaking rating scale which university ESL teachers can utilize to assess speaking proficiency of students in their classroom. A criterian-referenced based analytic rating scale is developed due to its pedagogic relevance. In order to establish statistical procedures for better understanding of sample language performances of testees, various approaches have emerged for rating scales. This paper uses the Many-facted Rasch Measurement, with FACETS (Linacre, 2006), which has been effectively used for rating scale development and rating training. It triangulates testing by including the components of item difficulty, examinees ability, and rater severity. The MFRM identifies the misfitting elements within the facet as well. In Pakistan, where the role of English language in higher education is underpinned by teaching four courses, the rating scale for oral proficiency is not developed as yet. Consequently, teachers are perceived unable to assess speaking skills of learners, hence inconsistent with their own rating scales. Some teachers assess oral proficiency by asking students to prepare a 15-minute presentation and deliver it in front of the whole class. Others give 10 minutes for extempore speeches to students and assess the performances accordingly. Some are asked to participate in role-plays. In the end, teachers give marks out of 40, which is quite difficult to handle. As it has been discussed in detail, they are not relevant forms of speaking tests. Available scales After the FSI scale, a number of performance-based rating scales have emerged, for instance, ACTFL Proficiency Guidelines, ALTE Framework and the Interlanguage Proficiency Roundable (ILR) proficiency ratings, which was also used in the development of the ACTFL Proficiency Guidelines. The most popular, however, has been the Common European Framework of Reference for Languages (CEFR), developed by the Council of Europe (CoE), which comprises six levels: A1 (Breakthrough), A2 (Waystage), B1 (Threshold), B2 (Vantage), C1 (Effective Operational Proficiency) and C1 (Mastery). Each level is illustrated with descriptors. IELTS and ESOL are examples of the same framework. The part of the speaking assessment consists of discourse analysis, fluency, flexibility, sociolinguistic competence, vocabulary range, grammatical accuracy, and phonological control. In other words, there are many rating scales used outside Pakistan, however, no Pakistani university has developed a rating scale of its own to assess speaking proficiency of university students. The National Testing Service (NTS) has recently launched a Graduate Employment Examination (GEE) in partnership with the Educational Testing Service (ETS). The GEE comprises GAT (Graduate Assessment Test) and ETS-administered TOEIC (Test of English as an International Communication). The objective of TOEIC is to assess workplace English language proficiency of the candidates. There are three speaking tasks of 20 minutes, carried out online: reading aloud for pronunciation, describing pictures for vocabulary and grammar, and expressing an opinion for fluency. The speech of the testee is recorded and marked against the score range of 0 to 200, along with 8 proficiency levels. Both reliability and validity are in question for running this test at the mass level in the Outer Circle countries. Whereas, TOEFL (Internet-based Testing) and IELTS are administrated by the ETS and the British Council in Pakistan respectively. Although, they are taken for higher studies abroad, they are being utilized by some institutions in Pakistan.

Examination of scales It is imperative, therefore, to ensure the rating scale we are going to develop is fully compatible with the courses taught and objectives set. Although the objectives for teaching 3 compulsory English courses set by the HEC are unclear, it could be drawn out from the course outlines as follows: learners should be able to meet their real life communication by using the language for social and academic purposes. (Higher Education Commission of Pakistan, 2006: 14-20). Therefore, keeping in mind the above objective and the GEE, the communicative ability in group discussions is deemed most relevant to the procedural framework of the rating scale. OPIs with class teachers and pair discussions could be the most relevant forms for testing speaking skills of learners. Taking help from the IELTS speaking band descriptors and the levels of the IRL, the rating scale could be divided into five essential components: fluency, vocabulary, grammar, pronunciation, and interactive communication. The last component delineates the precise analysis of the discourse carried out in interviews and discussions. Basing the whole rating model for communicative abilities of the learner in the domains of education and profession, the success rate of conveying messages was purposed to be prioritized in the assessment. Draft of the scale After the specifications of the categories of the rating scale, eight qualified ESL university teachers were requested to help draft the scale. It was decided that, to make the scale flexible and easy to mark accurately, the rating scale must range from one to six and precisely determine procedures to quantify learners performance. These procedures depend on the attributes, observed by raters, in the individual performance of the leaner. Administrators of the test must rest assured that classroom teachers as opposed to raters will be the interlocutors with the testees in the OPI. The OPI should begin with topics stimulus to a 15-minute conversation, say, introductions, questions helping testee participate, etc. This is how social asymmetrical relationship is reduced and performance of the learner is improved. In the second part of the test, a discussion between two testees should be carried out for 20 minutes. Again classroom teachers in contrast to raters should introduce the topic and help if the discussants fumble for words. These speeches should ideally be video recorded and given to the raters. It is also necessary to give essential information about speaking tests to raters so that they could score accurately and comfortably. Each component must be marked separately. The rating scale must be discussed with raters first. Assessment of intonation should be the integral part of pronunciation. Weak forms and features of connected speech could be ignored. The appropriate choice of word must be preferred to inappropriate high frequency words or phrases. Slow speed is better than unexpected long pauses in fast speeches. See Appendix for the rating scale.

Operation of the scale In order to see the level of reliability of the rating scale, forty-four students studying at a private university of Karachi7 were tested for their oral proficiency. They were provided with comprehensive instructions about two tasks they had to accomplish. Their classroom teachers prepared them for the tasks. And, the same teachers were used to run the test. The first task comprised Oral Proficiency Interview (OPI). Since testees had already had good relationship with their teachers, they were observed comfortable in accomplishing the tasks. The teachers initiated the 15-minute interviews by asking them to introduce themselves, as a warm-up technique. Later, they gradually moved to a select topic of general concern and encouraged the testee to speak on it. Since the teachers had been informed about the significance of being friendly with testees, the interviews in large part were carried out smoothly and testees participated with a great enthusiasm. Second part consisted of 20-minute peer-based pair discussions, which were, in very few cases, moderated by the teachers. They had already been taught how to speak effectively as discussants. All interactions were audio recorded, as some female testees were observed reluctant to show their consent to video recording. The reason for the reluctance was based on the cultural norms of the country. Later on, the first half of the recorded speeches were given to rater 1, 2, 3 and 4, while the second half to rater 5, 6, 7 and 8. They were given with the analytic rating scale and briefed about it. They scored at their convenience and returned within a week.

FACETS Analysis In the present study, FACETS is aimed to investigate the rating patterns in association with the effects of the three facets: examinee ability, rating severity and item difficulty. Figure 1 displays the patterns of the rating scale in conjunction with the effects of rater, examinee and item difficulty. The first column is the continuum of a logit interval scale ranging between +3 and -3 logits. The second column is a summary of levels of severity of raters, wherein the rater at the top is deemed severest and the one at the bottom most lenient. In the present study, rate 3 with 1.86 logits and rater 5 with -2.56 logits are severest and most lenient respectively. Similarly, the third column, which shows examinee ability, places the most able at the top, e.g. examinee 1061, at 2.14 in the logit scale, whose average score is 4.69 out of 6, and the least able at the bottom, e.g. examinee 1081, at -2.79 in the logit scale, whose average score is 2.79. Summing up category difficulty, the fourth column indicates interactive communication (IC) is the easiest, whereas vocabulary is the most difficult. Since no examinee is assigned level 1 (poor ability), the last column is the display of the rating scale ranging between level 2 (limited ability) and level 6 (very good ability).

Karachi is the largest city of Pakistan. It is the hub of economic activities of the whole country.

FIGURE 1: FACETS Summary (Examinee ability, rater severity, item difficulty)


+----------------------------------------------------------+ |Measr|+Rater |+Examinees|+Category |Scale| |-----+-------+----------+---------------------------+-----| | 3 + + + + (6) | | | | | | --- | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * | | | | 2 + + * + + 5 | | | 3 | ** | | | | | | | | | | | | ** | | | | | | *** | | | | | 7 | ** | | | | | | ** | Interactive Communication | --- | | | | | | | | 1 + + ** + + | | | | ** | | | | | | ** | | | | | | | | | | | | | Fluency | | | | 2 | ** | | 4 | | | | ** | Grammar | | | | | * | | | * 0 * 4 * * * * | | | | | | | | | | | | | | | * | | | | | | ** | | | | | 6 | ** | | | | | | ** | | --- | | | 8 | | Pronunciation | | | -1 + + + + | | | | ** | Vocabulary | | | | | | | | | | | * | | | | | | *** | | | | | | | | | | | | *** | | | | | | * | | 3 | | -2 + + + + | | | 1 | | | | | | | * | | | | | | | | | | | 5 | | | | | | | * | | | | | | | | | | | | * | | | | -3 + + + + (2) | |-----+-------+----------+---------------------------+-----| |Measr|+Rater |*=Examinee|+Category |Scale| +----------------------------------------------------------+

Table 1 summarizes the FACETS analysis of select statistics on the ability scale for 44 examinees, in which the mean examinee ability remains 3.79, with standard deviation (S.D.) of 0.63, indicating concentration of the spread of examinee scores between 2.14 and -0.62 logits. Although overall scores of examinees range between only 5 and 3, the value of significance at p=0.00 is a clear indicative of all examinees have different levels of abilities in the five categories of the speaking skills and cannot be considered equal. TABLE 1: Summary of Statistics on Examinees (N=44) Mean ability S.D. 3.79 0.63 Model S.E. 0.34 Separation 3.86 Reliability 0.94 Significance (p) 0.00

The reliability coefficient of 0.94 for examinees proves the test highly reliable, with the separation value of 3.84 revealing that the examinees could be categorized into four different levels of abilities. FACETS identifies two misfit examinees, using the formula: Misfit > Infit MnSq + 2(S.D.) _______(expression 1) Overfit < Infit MnSq - 2(S.D.) _______(expression 1) TABLE 2: Misfit Examinees (N=2) Rater
1 (-2.13) 2 (0.39) 3 (1.86) 4 (-0.03)

Examinee 1086 (Average score=2.84)


Pronunciation Vocabulary Grammar Fluency IC 2 2 2 2 4 3 3 2 2 3 2 3 4 3 2 5 2 3 4 5

Examinee 1083

(Average score=2.93)
IC 4 4 4 4

Pronunciation Vocabulary Grammar Fluency 2 2 2 2 2 5 2 2 4 2 2 4 3 3 4 3

If the result is grater than the summation of Infit mean-square mean and twice the value of standard deviation, the value is interpreted as unpredictable and misfit. It is also imperative to note that the misfit examinees should not exceed 2% (Plliitt & Hutchinson;1987: 72-92). But, in the present case, misfit examines are 4.5% calling for a little more improvement in the scale. Table 2 identifies that level 5 in Interactive Communication(IC), given by rater 4, is unexpectedly high for the examinee 1086, whose average score is 2.84. Rater 3, who has been severest, gave 5 in pronunciation to the examinee 1083, which is unexpectedly high, as no other rater gave the same. The answer of some problematic findings presented in Table 2 can be gathered from Table 3, which undoubtedly shows the intra-rater consistency with mean model standard error (S.E.) at 0.14, with 0.0 S.D., but discovers unacceptable reliability coefficient of 0.99 for raters, which should ideally be 0.00, so that raters could be perceived so-called equally harsher or lenient. For example, the rater 3 is consistently harsher, while rater 5 is consistently more lenient. The most serious element the Table 3 spots is unpredictably high separation at 6.11, which means 7 levels of severity could be run for this rating scale.

Rater 1 (overfit) 2 3 4 5 (overfit) 6 7 8 Mean S.D.

Rater severity (logits) -2.13 0.39 1.86 -0.03 -2.56 -0.59 1.38 -0.83 -0.38 1.40

TABLE 3: Rater measurement report Model S.E Infit Mean-Square Index 0.16 0.13 0.14 0.14 0.17 0.14 0.14 0.14 0.14 0.0 0.81 0.90 1.01 1.17 0.86 0.88 1.12 1.18 0.99 0.13

This problem is not perceived to be with the rating scale as raters self-consistency is fairly high, but it requires raters, who appear to be less-qualified in rating, to train for the rating scale in question. Therefore, using expression 1 and 2, rater 1 and rater 5 are overfit as they lie outside the boundary between 3.79 logits and -1.81 logits. They are deemed unacceptably lenient. Although the separation value is too high, significance at p=0.22 shows some proximity among levels of severity in the eight raters too. Table 4 is a summary of item difficulty measurement report, pointing out interactive communication is the easiest and vocabulary the most difficult. After the application of expression 1 and 2, the boundary for the categories is set between 2.80 logits and -0.80 logits. Keeping in mind the same, pronunciation needs a little improvement to be fully fit in the scale, while vocabulary is deemed overfit; thus its descriptors need great amendment. After the required amendment, the reliability coefficient of separation index, which is at 0.97 at present, can be reduced, in order to make each item nearly equal difficult. By doing so, the rating scale will become more reliable for the university students studying in Pakistan. TABLE 4: Item difficulty measurement report Category Difficulty Measure Model S.E. Infit Infit Mean(logits) Square Index Interactive Communication 1.30 0.16 1.44 Fluency 0.48 0.16 0.97 Grammar 0.23 0.16 1.14 Pronunciation -0.83 0.17 0.71 Vocabulary (overfit) -1.18 0.17 0.74 Mean 0.00 0.16 1.00 S.D. 0.90 0.01 0.27 Reliability=0.97 Significance: p=0.00

Mean of rater severity (in logits) logits is brought to zero in FACETS. Since this mean is a combination of group of rater 1, 2, 3 and 4 and rater 5, 6, 7, and 8, it is -0.3.

10

Conclusion The present study on the rating scale for the English speaking proficiency among university students in Pakistan not only aims to develop the required scale but also highlights the importance of indigenous scales for ESL learners in Pakistan. It is concluded that the scale is fairly suitable for the target population but requires qualified raters to make it more reliable. In other words, some training sessions must be conducted for raters, which in turn will facilitate the item difficulty facet, wherein vocabulary is too demanding challenge for examinees at the moment. Raters must be educated what vocabulary and pronunciation should be considered appropriate in Outer Circle countries, particularly in the domains of education and employment, where students are required to communicate in English while descriptors of both categories must be amended to be equally difficult.
References Kachru, B. (1989) Teaching world Englishes. Indian Journal of Applied Linguistics, 13.1: 85-95. Linn, Robert L., and Norman E. Gronlund (2005) Measurement and Assessment in Teaching (8th edition). Singapore: Pearson Education. Boltan, K. (2008) English in Asia, Asian Englishes, and the Issue of Proficiency. English Today, 94.2: 5-8. Rahman, T. (1999) Language, Education, and Culture. Karachi: Oxford University Press. Khan, M. K. Raza. (2003) Washback Effects of English Language Testing in Pakistan. SPELT Quarterly, 18.3: 1-7. Raza, W. (2006, 8 December) Teaching English as language. The News International. Pakistan. Mansoor, S. (2005) Language Planning in Higher Education: A Case Study of Pakistan. Karachi: Oxford University Press. Heaton, J. B. (1988) Writing English Tests. New York: Longman. Pennington, M. C. (1999) Computer-aided pronunciation pedagogy: promise, limitations, directions. Computer Assisted Language Learning. 12.1: 427-440. Celce-Murcia, Brinton M., and J. Goodwin (1996) Teaching pronunciation: Reference for teachers of English to speakers of other languages. Cambridge: Cambridge University Press. Alderson, J. Charles, and Dianne Wall (1993) Does Washback Exist? Applied Linguistics, 14.2: 115-29. Alderson, J. Charles (1991) Dis-sporting life. Response to Alastair Pollitts paper: Giving students a sporting chance: assessment by counting and judging. In J. C. Alderson & B. North (Eds.), Language testing in the 1990s: The communicative legacy (60-70). London: Mcmillan. Bailey, K. (1996) Working for washback: A review of the washback concept in language testing . Language Testing, 13. 257-79. Cheng, L. (1999) Changing assessment: Washback on teacher perceptions and actions. The Teacher and Education, 13. 253-71. Cheng, L., and P. Falvey (2000) What Works? The Washback Effect of a New Public Examinatin on Teachers Perspectives and Behaviours in Classroom Teaching. Curriculum Forum, 9.2: 1-33. Hughes, Arthur (1994) Testing for Language Teachers. Cambridge: Cambridge University Press.

11

Messick, S. (1996) Validity and washback in language testing. Language Testing, 13.3: 241-56. Cronbach, L. J., and P. E. Meehl (1955) Construct validity in psychological tests. Psychological Bulletin, 52. 281302. Bachman, L. F. (1991) What does language testing have to offer? TESOL Quarterly, 25.4: 671-704. Chalhoub-Deville, M. (1997) Theoratical models, assessment frameworks and test construction. Language Testing, 14.1: 3-22. Perkins, K, and S. M. Gass (1996) An investigation of patterns of discontinuous learning: implications for ESL measurement. Language Testing, 13.1: 63-82. McNamara, T. F. (1995) Modelling performance: opening Pandoras Box. Applied Linguistics, 16.2: 159-75. Spolsky, B. (2001) The speaking construct in historical perspective. Paper presented at the LTRC / AAAL Symposium, St Louis. Spolsky, B. (1990) Oral Examinations: an historical note. Language Testing, 7.2: 158-73. Robinson, R. E. (1992) Developing practical speaking tests for the foreign language classroom: a small group approach. Foreign Language Annals, 25.6: 487-96. Moder, C. L., and G. B. Halleck (1998) Framing the language proficiency interview as a speech event: native and non-native speakers questions. In R. Young & A. W. He (Eds.), Talking and testing: discourse approaches to the assessment of oral proficiency, Studies in Bilingualism (Vol. 14). Amsterdam: John Benjamins Publishing Company. 117-46. Bachman, L. F. (2004) Statistical Analyses for Language Assessment. Cambridge: Cambridge University Press. 16970. McNamara, T. F. (1996) Measuring second language performance. London: Addison Wesley Longman. Shohamy, E. (2001) The power of tests. London: Longman. Weir, C. J. (1993) Understanding and developing language tests. Hemel Hempstead: Prentice Hall. Pollitt, A., and N. L. Murray (1996) What raters really pay attention to. In M. Milanovic & N. Saville (Eds.) Studies in language testing 3: Performance testing, cognition and assessment. Cambridge: Cambridge University Press.

Pollitt, A., & Hutchinson, C. (1987). Calibrated graded assessment: Rasch partial credit analysis of performance in writing. Language Testing, 4, 72-92.
Linacre, M. (2006) A users guide to FACETS: Rasch-model computer programs. Chicago. IL: MESA Press. Higher Education Commission of Pakistan (2007) Annual Report (2005-06). Islamabad: Higher Education Commission of Pakistan. 99-120. HEC News & Views (2007, January) A Monthly Magazine of Higher Education. Islamabad: Higher Education Commission of Pakistan. 13. Higher Education Commission of Pakistan (2006) English BA/BS & MA/MS (Revised Curriculum). Islamabad: Higher Education Commission of Pakistan. 14-20.

12

You might also like