Factors Affecting The Assessment of ESL Students' Writing

Factors Affecting the Assessment of ESL Students’ Writing
Jinyan Huang, Ph.D.

Assistant Professor in TESOL and Assessment
Niagara University
Abstract: Due to the different linguistic stated by Speck and Jones (1998), “there
and cultural backgrounds of ESL students, are more problems than solutions -
the assessment of their English writing problems of inter-grader reliability,
becomes a problematic area. On the one single-grader consistency, and ultimate
hand, many factors affect ESL students’ accountability for the grades we assign”
writing, including their English (p. 17). Due to the different linguistic and
proficiency, mother tongue, home culture, cultural backgrounds of English-as-a-
and style of written communication. In second-language (ESL) students, the
rating ESL students’ writing, raters may assessment of their English writing is
differentially consider these factors. On more problematic than the assessment of
the other hand, empirical studies have NE students’ writing (Connor-Linton,
found that many factors (e.g., raters’ 1995; Hamp-Lyons, 1991a; Sakyi, 2000;
linguistic backgrounds, previous Sweedler-Brown, 1993). On the one hand,
experiences, and prior training in many factors affect ESL students’ writing,
assessment) affect the rating of ESL including their English proficiency,
students’ writing. The impact of these mother tongue, home culture, and style of
factors leads to questions about the written communication (Casanave &
accuracy, precision and ultimately, the Hubbard, 1992; Hinkel, 2003; Shaw &
fairness of the assessment of ESL Liu, 1998; Yang, 2001). On the other
students’ writing. This paper reviews 20 hand, raters may differentially consider
major empirical studies in which the these factors when rating ESL students’
factors affecting the rating of ESL writing writing. Empirical studies have noted
in North American school contexts were differences in rater behavior for ESL
investigated. These factors are students (Bachman, 2000). Rater
categorized into two types: rater-related background, mother tongue, previous
and task-related. Rater-related factors experience, amount of prior training, and
include the rating methods used, rating types and difficulty of writing tasks have
criteria, raters’ academic disciplines, been found to affect the rating of the
professional experiences, linguistic written responses of ESL students (Brown,
backgrounds, tolerance for error, 1991; Kobayashi, 1992; Sakyi, 2000;
perceptions and expectations, and rater Santos, 1988; Weigle, 1994, 1999; Weigle,
training. Task-related factors include the Boldt, & Valsecchi, 2003). The impact of
types of and difficulty levels of writing these factors leads to questions about the
tasks. The paper also identifies research accuracy, precision and ultimately, the
gaps and proposes directions for future fairness of the scores obtained from the
research in the assessment of ESL writing. ratings of written work produced by ESL
students.
Introduction
Fairness is a priority in the field of
The assessment of writing has long educational assessment. Educational
been considered a problematic area for organizations, institutions, and individual
educational assessment professionals. As professionals should make assessments as
1
IJAES – Vol. 5 No. 1 Pg. No. 2
fair as possible for test takers of different Assessing and evaluating ESL
races, genders, and ethnic backgrounds writing involves both assigning a score or
(American Educational Research grade to an essay and commenting on it
Association, American Psychological (Perkins, 1983). This review paper
Association, National Council on discusses only the scoring of ESL essays.
Measurement in Education, 1999; Cole & Therefore, terms “grading”, “rating”, and
Zieky, 2001; Joint Advisory Committee, “scoring” are used interchangeably in this
1993). Due to a significant growth in the paper, referring to the process raters use
number of ESL students being educated in to arrive at the scores students will receive
North American schools in the past two (Speck & Jones, 1998). Given the
decades (CBIE, 2002; IIE, 2001), fairness increasing number of ESL students being
issues in ESL writing assessments have educated in the North American context
been of growing interest and importance (IIE, 2001; CBIE, 2002), the investigation
(Connor-Linton, 1995; Hamp-Lyons, of the factors that affect the accuracy,
1991a; Kunnan, 2000; McDaniel, 1985; reliability, and validity of the scoring of
Sweedler-Brown, 1993; Vaughan, 1991). English compositions written by ESL
Increasingly, writing-proficiency students in North American secondary
standards are being established for both schools and colleges/universities is the
secondary school and university students focus of this paper. The central argument
in North America without regard to of this paper is that to some extent ESL
students’ native languages (EQAO, 2002; compositions are not fairly scored as they
Johnson, Penny, & Gordon, 2000; should be because various factors have
Thompson, 1990). ESL students have to been found to affect the accuracy,
write the same tests as native English reliability, and validity of the scoring of
(NE) students. Like NE students, ESL ESL compositions at North American
students are expected to successfully schools. Studies conducted in North
demonstrate their English writing skills or American school contexts and related to 1)
complete high-stakes essay examinations secondary school teachers’ or
(Casanave & Hubbard, 1992; Hayward, college/university professors’ rating of
1990, Wiggins, 1993). ESL students’ writing; and 2) raters’
However, research shows that ESL rating of ESL compositions in large-scale
students face considerable challenges state/provincial and institutional
passing institutional or provincial/state assessment contexts are reviewed. A rater,
competency examinations of writing depending on the assessment context, can
(Blackett, 2002; Johns, 1991; Ruetten, be a teacher/professor or a grader/marker.
1994; Thompson, 1990). Further, these The writing tasks can be compositions,
challenges may be due to more than short essays, timed essays, or constructed
language deficiencies. As an example, written responses in state/provincial (e.g.,
leniency or severity within raters could diploma exams, literacy tests) and
underestimate or overestimate ESL institutional (e.g., placement, exit exams)
students’ performance on these writing assessment contexts. They can include
examinations. Previous studies have also writing assignments, term papers, and
found that raters with different teaching course papers.
experience assign different scores to the This review paper begins with a brief
same piece of ESL writing (Cumming, explanation of reliability and validity in
1990b; Hamp-Lyon, 1996; Rubin & the context of the rating of students’
William-James, 1997). Such variability writing. It then reviews the empirical
due to raters, therefore, may threaten the studies on factors affecting the accuracy,
fairness of assessments of ESL writing. reliability, and validity of the rating of
ESL students’ writing in North American Consequently, the study of random inter-
school contexts. Finally, it identifies and intra-rater reliability and attempts to
research gaps and proposes directions for lower these two unwanted sources of
future research. variability are warranted. Rater reliability
or consistency is important because it
Reliability and Validity of the Rating of
indicates the precision of the rating of
Writing
students’ writing in North American
Inter- and intra-rater reliability or school contexts. Thus consistency is
rating consistency is particularly related to fairness for test-takers (Johnson
important in the rating of students’ et al., 2000). Therefore, it is vital to
writing since there may exist unwanted ensure a high level of rater reliability in
variations in scores due to variations the rating of writing.
among raters and within rater (Bachman,
1990; Gamaroff, 2000; Huot, 1990; Validity
Johnson et al., 2000). Different raters The validity of scoring is also a major
often assign different scores to the same concern in the rating of student writing
piece of writing, and the same rater may (Connor-Linton, 1995; Hamp-Lyons,
assign different scores to the same 1991a). Messick (1989) describes validity
composition at different times. Both of as “an integrated evaluative judgment of
these variations are problematic as they the degree to which empirical evidence
adversely affect the reliability, validity, and theoretical rationales support the
and fairness of the scores assigned to adequacy and appropriateness of
students. inferences and actions based on test
scores” (p. 13). In this sense, a student
Reliability who scores high on a writing test would
In classical test theory (CTT), an be considered a good or competent writer,
examinee’s observed score consists of a and a student who scores low would be
“true score” and an “error score” (Huot, labeled as a poor or incompetent writer
1990). Within the CTT framework, the (Huot, 1990).
variance of the observed score across Of concern are the sources of
persons is equal to the sum of the systematic measurement error that, if
variances of the persons’ true scores and present, would serve to confound the
error scores. Variation among raters and interpretation of a writing score. If the
within raters of writing constitutes nature of the systematic error raises
random measurement error if there is no observed scores, then the observed scores
predictability in the variation. As such, would systematically overestimate true
these sources of error contribute to the scores. Alternatively, if the nature of the
variance of the error scores. Spearman systematic error is to lower the observed
(1904) defined reliability as the ratio of scores, then the true scores would be
the true score variance to the total score underestimated. Variations due to
variance. Therefore, if the error score systematic sources lower the validity of
variance is small, then reliability will be the interpretation of observed scores as
high and close to 1.0. In contrast, if the estimates of the corresponding true scores.
error score variance is large, then Many factors may contribute to the
reliability will be low. Given that random systematic rater error of writing. In the
variation among and within raters of case of ESL writing, these factors include
writing contributes to the error score the raters’ linguistic and academic
variance, this random variation backgrounds, rater severity, raters’
contributes to lower reliability. tolerances for errors, rater training, types
and difficulty of writing tasks, and rater factors affecting the accuracy, reliability,
interactions with tasks (Ferris, 1994; and validity of the rating of ESL writing
Janopoulos, 1992; Russikoff, 1995; in North American schools were
Santos, 1988; Song & Caruso, 1996; investigated. The nature of the writing
Sweedler-Brown, 1993; Vaughan, 1991; tasks, number of students or papers (ESL
Weigle, 1994, 1998). versus NE), number of raters, rater
Like reliability of rating, the study of qualifications and training, and rating
validity is important because it indicates methods of each empirical study are listed.
the absence or presence of bias in the As shown in Table 1, many factors
rating of writing. Likewise, validity is affecting the rating of ESL compositions
closely tied to fairness. Cole and Moss in North American school contexts have
(1989) argue that fairness is an aspect of been studied. These factors can be
validity. Messick (1989) clearly ties categorized into two types: rater-related
fairness to validity in his “Validity” and task-related. Rater-related factors
chapter in the third edition of Educational include the rating methods used, rating
Measurement (Linn, 1989). An invalid criteria, raters’ academic disciplines,
writing score for a student would mean professional experiences, linguistic
unfairness to him/her and produce backgrounds, tolerance for error,
consequences (Huot, 1990; Johnson et al., perceptions and expectations, and rater
2000). Therefore, it is important to ensure training. Task-related factors include the
valid ratings of students’ compositions. types of and difficulty levels of writing
Both reliability and validity are tasks. Each of these factors is reviewed
closely related to fairness, therefore we below, beginning with rating method.
need to consider fairness as a critical Rating Methods. Holistic and
central component and give primacy to it analytical rating have gained wide
(Kunnan, 1997). In talking about acceptance in writing assessment
reliability, validity, and fairness in practices (Jacobs, Zinkgraf, Wormuth,
language assessment, Kunnan (2000) Hartfiel, & Hughey, 1981; Perkins, 1983).
argues that “if a test is not fair there is In holistic rating, each rater rates the
little or no value in it being valid and overall writing proficiency on a single
reliable or even authentic and interactive” rating scale, whereas in analytical rating
(Kunnan, 2000, p. 10). the writing performance is categorized
into identifiable component parts, such as
Factors Affecting the Rating of ESL
organization and content, and then each
Writing
component is separately marked using a
Table 1 contains a summary of 20 rating scale (Hamp-Lyons, 1991b;
major empirical studies in which the Stiggins & Bridgeford, 1983).
Table 1.
A Summary of 20 Major Empirical Studies on ESL Writing Assessment
# of # of Rater
Author(s) Year Writing Task(s) Rater Qualification Rating Methods
Students Raters Training
Essay 56 ESL English & ESL Holistic
Brown (1991) 16 Yes
(institutional) 56 NE faculties (6-point)
Essay Novice & expert Analytical* 3
Cumming (1990b) 12 ESL 13 Yes
(institutional) ESL teachers (4-point)
Essay Experienced in Holistic
Engber (1995) 66 ESL 10 No
(institutional) holistic rating (6-point)
Persuasive essay 30 ESL Composition Holistic
Ferris (1994) 3 No
(institutional) 30 NE instructors (6-point)
Hamp-Lyons & 64 writing tasks Experienced in ESL Holistic
8583 ESL 2 Yes
Mathias (1994) (commercial) writing assessment (10-point)
Intara-prawat & Persuasive essay Experienced in Holistic
12 ESL 5 No
Steffensen (1995) (institutional) teaching ESL (5-point)
Essay Professors, grad. & Analytical 4
Kobayashi (1992) 2 ESL 269 No
(institutional) undergrad. students (10-point)
11 ESL Trained English Analytical 3
McDaniel (1985) Essay (provincial) 2 Yes
9 NE teachers (9-point)
Mendel-sohn & Essay University Holistic
8 ESL 26 No
Cumming (1987) (institutional) professors (8-point)
Reid & O’Brien Essay Holistic
11 ESL 10 Graduate students No
(1981) (institutional) (9-point)
Essay 89 ESL Some with ESL Holistic &
Russikoff (1995) 8 No
(institutional) 11 NE writing expertise Analytical
12 Experienced in
Essay Holistic
Sakyi (2000) university 6 writing, teaching & No
(institutional) (5-point)
students testing
Essay University Analytical 6
Santos (1988) 2 ESL 178 No
(institutional) professors (10- point)
Holistic &
Song & Caruso Essay 2 ESL English & ESL
62 No Analytical
(1996) (institutional) 2 NE professors
(6-point)
Sweedler-Brown Essay 6 ESL Experienced in Holistic
6 No
(1993) (institutional) 6 NE writing (6-point)
Essay 6 university Experienced in Holistic
Vaughan (1991) 9 No
(institutional) students holistic assessment (6-point)
Essay Analytical 3
Weigle (1994) 4 ESL 16 Graduate students Yes
Essay Analytical 3
Weigle (1998) 60 ESL 16 Graduate students Yes
Essay Analytical 3
Weigle (1999) 60 ESL 16 ESL faculty & TAs Yes
Weigle, Boldt & Essay University
6 ESL 16 No Analytical
Valsecchi (2003) (institutional) professors & TAs
* For analytical rating, 3 (4-point) indicates that the writing performance is categorized into 3
identifiable component parts and each component part is separately rated using a 4-point scale.
These two rating methods have both higher scores to all four essays. The
advantages and disadvantages (Perkins, English faculty raters seemed to give
1983). Holistic rating has the highest more weight to the overall content and
construct validity when overall attained quality of the rhetorical features in the
writing proficiency is the construct to be writing samples than they did to language
assessed. It is frequently used as a tool for use when they used the analytic method.
“certification, placement, proficiency, and In this study, the two ESL compositions
research testing” (Perkins, 1983). But it were written by Russian speakers. This
has “threats to reliability” because it can factor might limit the generalizability of
be highly subjective due to “bias, fatigue, the findings.
internal lack of consistency, previous Rating Criteria. Rating criteria are
knowledge of the student, and/or shifting specified in scoring rubrics and rating
standards from one paper to the next” scales to guide the rating of the written
(Perkins, 1983, p. 653). Analytical rating responses of the students who were tested.
allows students to see how they have Generally, rhetoric (text organization,
performed on each component considered style, register, discourse functions, or
and, as such, provides diagnostic genre), language (grammar/mechanics,
information. Although analytical rating syntax, and vocabulary), and content
produces higher inter-rater reliability than (knowledge of subject, thesis
holistic rating, it is more time-consuming development, topic relevance, original or
(Perkins, 1983). factual support) are considered as the
Empirical evidence indicates that the three main criteria for rating both ESL
two rating methods may affect the and NE compositions in North American
reliability and validity of the rating of school contexts (Cumming, 1990a;
ESL compositions. For example, Song Mendelsohn & Cumming, 1987; Reid &
and Caruso (1996) examined the degree to O’Brien, 1981; Russikoff, 1995; Sakyi,
which differences existed between the 2000).
holistic rating and analytical ratings of Empirical research indicates that
four compositions written by two ESL raters weight or emphasize the criteria
speakers and two NE speakers. The differently while rating compositions
ratings were completed by 32 English and written by ESL and NE students
30 ESL professors in the English (McDaniel, 1985). By analyzing a large-
Department of an American university. scale essay test administered to all
The two faculty groups were divided into graduating high school students in British
two subgroups, one rating the four essays Columbia, McDaniel (1985) found that, in
holistically on a 1 to 6 scoring rubric and analytical rating, raters (people teaching
the other rating the same four essays or qualified to teach English in high
analytically using 1 to 6 scale for each of school, college, university, or as graduate
10 features, 6 comprising rhetorical teaching assistants, or employed as
features and 4, language features. graders in English departments of public
Statistical analyses indicated that both the schools) rated essays written by ESL and
holistic and analytical methods produced NE students differently in terms of the
no significant differences between the following three specific criteria: “content
scores assigned to ESL and NE essays. development and organization,”
However, there was a significant “sentences,” and “words.” For example,
difference (p < .05) between the holistic sentence-level errors had a stronger
scores awarded by the English faculty and negative effect on the scores of ESL
the holistic scores awarded by the ESL writers than on the scores of the NE
faculty, with the English faculty assigning writers.
Further, raters may unfairly focus on ESL students in an intensive English

one or two criteria outlined in the rating program at an American university. The
guide or use their own internalized criteria results showed high and significant
when rating ESL compositions. For correlations (p < .05) for lexical variation
example, Russikoff (1995) found that and lexical variation minus error,
when the raters rated ESL compositions suggesting that the “diversity of lexical
holistically, they paid attention only to choice and the correctness of lexical
“language use,” which was the ESL form” have a significant benign effect on
students’ weakness. However, when the the ratings of timed essays written by ESL
same raters rated the same ESL students at an intermediate to high-
compositions analytically, they were intermediate level of proficiency (Engber,
surprised to see how strong the “content” 1995, p. 150).
and “organization” of these compositions Similarly, the use of rhetorical
were. strategies by students in their writing is
Sakyi (2000) found that six raters, considered an important criterion by some
who had experience in teaching and raters of compositions. By analyzing 60
testing writing ranging from 6 to 37 years, persuasive compositions written by both
focused on different criteria or used ESL and NE freshman composition
different criteria while holistically rating students at an American university, Ferris
12 essays written by first year university (1994) found that in the rating of
students attending a Canadian university. persuasive/argumentative writing, raters
He found that the raters tended to focus often considered assigning good scores to
on three criteria not in the rating guide: writers who used rhetorical strategies.
errors in the text; essay topic, and However, the ESL students were at a
presentation of ideas; and on their own disadvantage because of their linguistic
personal reaction to the text. Further, he and rhetorical deficits for the task of
also found that raters who tried to follow persuasion in English (Hinds, 1987;
the rating guide assigned scores to essays Ostler, 1987). Intaraprawat and Steffensen
using mostly one or two of the particular (1995) further investigated the effect of
features outlined in the rating guide. meta-discourse features, "facets of a text
Other researchers have also found which make the organization of the text
similar results to those reported by Sakyi explicit, provide information about the
(2000). For example, syntactic and lexical writer's attitude toward the text content,
proficiency have become major criteria in and engage the reader in the interaction”
the rating of ESL writing. Essay raters (p. 253) on the rating of ESL
often consider assigning a lower rating to compositions. After analyzing 12 good
an essay due to its simple construction and poor essays written by ESL students
and lexicon (Vaughan, 1991). However, at an American university, they found that
investigations into ESL writing have ESL essays which contained a variety of
established that, in large-scale testing and meta-discourse features were considered
university-level assessments of student “good” essays and received high scores.
writing, syntactic and lexical simplicity Intaraprawat and Steffensen’s findings
are often considered to be a severe also indicated that ESL students were at a
handicap for ESL students (Casanave & disadvantage because they had
Hubbard, 1992; Hinkel, 2003). Further, considerable challenges in learning how
Engber (1995) investigated the to apply English meta-discourse features
relationship between lexical proficiency when writing in English. In this study, the
(lexical richness and lexical errors) and criterion used to rate the essays affected
the quality of compositions written by the rating of ESL students’ essays.
Lastly, different criteria may be used university. The results showed that all
by different raters in the rating of ESL raters rated the content (holistic
compositions of different types and impression, development, and
quality. Weigle et al. (2003) found that sophistication) more severely than the
professors from different disciplines language (comprehensibility, acceptability,
tended to use different criteria for and irritation). However, raters in the
assessing ESL essays. For example, the humanities/social sciences were more
ESL and English department raters chose lenient in rating than raters in the physical
grammar more frequently as the most sciences.
important reason for failing essays on Brown (1991) and Song and Caruso
general topics in contrast to essays that (1996) investigated if raters from different
were based on such specific texts as disciplines rated ESL speakers’
reading passages and lecture notes. This compositions the same way as they rated
result suggests that ESL students may be NE students’ compositions. Brown (1991)
“penalized” for poor linguistic control in investigated whether significant
essays on general topics, which are differences existed in the holistic writing
typically used in large-scale assessment scores of 56 ESL and 56 NE students at
writing tasks. In contrast, the psychology the end of their first-year composition
raters chose content as the most important courses at an American university. The
aspect of writing for both essays on compositions were marked by eight ESL
general topics and essays that were based and eight English raters. The results
on such specific texts as reading passages showed that there were no significant
and lectures notes. Similarly, Sweedler- differences (p < .05) between the scores
Brown (1993) found that raters with no assigned to ESL and NE compositions or
ESL training placed far more emphasis on by ESL and English raters. Brown (1991)
the essays’ linguistic features (sentence- explained that the raters from both
level errors) than on the essays’ strong faculties seem to “assign very similar
rhetorical features. scores - regardless of differences in
Raters’ Discipline. Several background or training,” although they
researchers have investigated the effects may “arrive at those scores from
of a rater’s discipline (e.g., English, somewhat different perspectives (as
History, ESL, and Engineering) on the indicated by the features analysis)” (p.
rating of compositions written by ESL 601). Similarly, as previously reported,
students at North American universities Song and Caruso (1996) did not find any
(Brown, 1991; Mendelsohn & Cumming, differences between the scores assigned to
1987; Santos, 1988; Song & Caruso, 1996; ESL and NE essays by 30 ESL and 32
Weigle et al., 2003). Mendelsohn and English faculty members at an American
Cumming (1987) found that Engineering university. However, they found
professors attributed more importance to significant differences (p < .05) between
language use than to rhetorical ESL and English faculty raters.
organization in rating the effectiveness of A feature of the papers reviewed thus
ESL papers. ESL professors attributed far is that the essays were on general
more importance to rhetorical topics. Recently, Weigle et al. (2003)
organization when rating the same paper. extended this area of research to the rating
Santos (1988) investigated 178 non-ESL of essays that were based on specific texts
professors’ (96 in the humanities/social such as reading passages and lecture notes
sciences and 82 in the physical sciences) by ESL teachers and teachers in English,
ratings of two compositions written by history, and psychology departments at an
two Asian students at an American American university. They found that
scores differed (p < .05) among raters ratings for language use were significantly
from different disciplines. The instructors different from their ratings of content and
from the English department were the rhetorical organization. However, the
most severe raters and the instructors expert raters did not show any significant
from the history department the most differences among their ratings of the
lenient. The ESL raters, in contrast to the three aspects of the compositions.
raters from other disciplines, Raters’ Linguistic Background.
demonstrated consistency across all tasks. Raters’ linguistic background (native
Raters’ Professional Experience. The language) can be a factor that affects their
ESL writing assessment literature has ratings of ESL compositions. Kobayashi
shown that raters’ professional experience, (1992) investigated how 145 NE raters
such as the number of years of teaching and 124 native Japanese raters at the
and rating ESL writing, influences their professorial, graduate, and undergraduate
rating of ESL writing (Cumming, 1990a; levels rated two compositions written by
Hamp-Lyon, 1996; Rubin & William- ESL students at an American university.
James, 1997; Vaughan, 1991). Song and The results showed that NE raters were
Caruso (1996) found that raters with more stricter about grammar than the native
years of experience in teaching tended to Japanese raters, and that NE professors
be more lenient than raters with fewer and graduate students gave more positive
years of teaching experience when they ratings for the aspects of clarity of
used holistic scoring. In contrast, when meaning and organization for both
they employed analytical ratings, compositions than did the comparable
experience was not a significant factor native Japanese raters. However, the
that affected their rating of ESL native Japanese undergraduates rated both
compositions. compositions much more positively than
Reid and O’Brien (1981) examined did the NE undergraduates. The study
the reliability of the ratings of 11 ignored such factors as raters’ age and
compositions written by 11 ESL gender, which have been found to affect
university students who were enrolled in the rating of ESL compositions (Santos,
an intensive ESL program. Ten raters 1988; Vann, Lorenz, & Meyer, 1991).
holistically marked the compositions. The Raters’ Perceptions And
results showed that the more experienced Expectations. Raters’ perceptions and
the raters were in teaching ESL writing expectations of ESL writing may have an
and/or the more practice the raters had effect on the rating of ESL compositions
with holistic scoring, the higher the inter- (Janopoulos, 1995). Casanave and
and intra-rater reliability. Hubbard (1992) surveyed graduate faculty
Cumming (1990b) investigated the at an American university to obtain
ratings of seven novice and six expert specific information about the writing the
ESL raters who scored 12 compositions faculty required of first-year doctoral
written by adult students with differing students, the criteria they use to evaluate
levels of ESL proficiency at a Canadian students’ written work, and the writing
university. The results showed that there problems of both ESL and NE students.
was a significant difference (p < 0.05) Eighty-five graduate faculty representing
between the two groups of raters’ ratings the humanities/social sciences and
for “content” and “rhetorical science/technology academic fields
organization,” but not for “language use.” responded to the survey. The findings
The novice raters consistently rated revealed that faculty from both fields
content and rhetorical organization higher claimed to use similar criteria for
than the expert raters. Further, their evaluating students’ writing. In terms of
importance, discourse-level criteria (e.g., (Janopoulos, 1992; Santos, 1988; Vann et

quality of content, development of ideas, al., 1984; Vann et al., 1991). The data
and adequate treatment of topic) were from all four studies revealed that
ranked high, and word- and sentence-level professors considered such errors as word
criteria (e.g., accuracy of grammar, size of order, verb form, and relative clause to be
vocabulary, and spelling and punctuation) among the most serious and placed such
were ranked low. However, faculty from errors as incorrect article and preposition
the humanities/social sciences gave high usage at the “high end of the spectrum of
ratings to “development of ideas, quality tolerance” (Janopoulos, 1992, p. 116). In
of overall paper organization, addition, these four studies also revealed
appropriateness of vocabulary, and that, among faculty members in social
appropriateness of tone/style” than sciences, education, humanities,
science and technology faculty (p. 38). In biological and agricultural sciences,
terms of their students’ writing problems, physical and mathematical sciences, and
while all participants agreed that ESL engineering, those from the social
students had more problems than NE sciences were the most tolerant of the
students on most features of writing, they ESL writing errors in general. However,
perceived ESL students as “having only the biggest limitation of the two studies
minor or moderate problems meeting the completed by Vann et al. (1984) and
constraints of particular assignments” (pp. Janopoulos (1992) is that the errors were
38-39). In addition, two differences were considered at the sentence-level rather
found between the two groups of faculty than in extended discourse.
members. First, appropriateness of Rater training can also change the
vocabulary usage was considered to be raters’ tolerance for grammatical errors.
the biggest problem for ESL students by McDaniel (1985) found that “the
the humanities/social sciences faculty, but judgments of graders who were not
was considered less of a problem by the trained in evaluating ESL writing were
science and technology faculty. Second, dominated by error when they graded ESL
humanities/social sciences faculty thought essays” (quoted in Sweedler-Brown, 1993,
that both ESL students and NE students p. 5). Sweedler-Brown (1993)
had greater problems with overall paper investigated the rating of ESL papers by
organization than did the science and experienced English writing instructors
technology faculty. with no training in evaluating ESL writing
Raters’ Tolerance for Error. and found that “sentence-level error was
Research suggests that grammatical errors the only significant influence on holistic
made in a written composition lead to score” and was the “critical factor” in
lower ratings of ESL students’ causing ESL essays to fail (p. 12).
compositions (Homberg, 1984; Sweedler- These studies suggest that a “double
Brown, 1993). But there are different standard” may exist in terms of faculty
types of errors, and raters from different tolerance of ESL writing errors, with
disciplines have different tolerances for some faculty apparently willing to make
errors and may assign different weights to allowances for the efforts of ESL writers
grammatical errors while rating ESL (Janopoulos, 1992). Ruetten (1994)
compositions (Janopoulos, 1992; Santos, suggested that, to provide more accurate
1988; Vann, Meyer, & Lorenz, 1984; holistic ratings, officials responsible for
Vann et al., 1991). university composition programs need to
Four studies in which faculty clarify their standards regarding ESL
members’ responses to the writing errors students, especially on the “acceptability
of ESL students were examined of error, and to train faculty members to
read accordingly” (p. 94). inexperienced and experienced raters of

Rater Training. Rater training is an ESL placement compositions in an
essential component of the essay rating American university both before and after
process (Davidson, 1991; Weigle, 1994). rater training by using the Item Response
Training provides raters with a clear Theory (IRT) program FACETS (Linacre,
conception of what a piece of quality 1989). The analysis showed the changes
writing looks like, and thus promotes rater in rater behavior from before- to after-
consensus (Homburg, 1984). Training can training. First, the spread of rater severity
also minimize differences caused by estimates was reduced, particularly for the
raters’ different backgrounds (Jacobs et inexperienced raters, indicating the raters
al., 1981; Reid & O’Brien, 1981) and were more like each other after training
modify expectations of good writing by than before training. Second, the training
clarifying for the raters both the task seemed to have brought the extreme raters
demands and writer characteristics (Huot, “within a more tolerable range of
1990). In fact, rater training is an issue severity,” but still had not eliminated all
that lies at the heart of both reliability and differences in rater severity. Third, the fit
validity in ESL essay rating (Weigle, statistics from before- to after-training
1994). Homburg (1984) commented that were improved. For example, the three
holistic rating of ESL compositions, “with raters who were highly inconsistent before
training to familiarize readers with the training became quite consistent after
types of features present in ESL training. Finally, the results showed that
compositions, can be considered to be before training the inexperienced raters
adequately reliable and valid” (p. 103). were more extreme in their severities than
Weigle (1994) investigated the the experienced raters, and for most part
effects of training on eight experienced they were more severe as well. These
and eight inexperienced raters of the ESL results suggest that rater training can be
placement compositions written by four more successful in helping essay raters to
ESL students at an American university. give more predictable scores than in
The results showed that four of the getting them to give identical scores.
inexperienced raters gave scores after Types and Difficulty of Writing Task.
training that were different from the Writing tasks of different types should be
scores they had given to the same of equal difficulty; and different raters
compositions before training. The think- should rate essays of different difficulty
aloud protocols of these four raters levels consistently. However, in reality
indicated that the changes in their scores they may not (Brown, Hilgers, & Marsella,
could be attributed to the following three 1991). The types and difficulty of writing
factors: 1) clarification of the rating prompts, then, may affect raters’ ratings
criteria, 2) revision of expectations, and 3) of ESL compositions. Weigle (1999)
concern for rater agreement. Analyses of investigated how experienced and
raters’ think-aloud protocols can be an inexperienced raters scored placement
approach to the study of rater training essays written by ESL students on two
effects; however, it is never a complete different prompts. One prompt was a
record of one’s cognitive activity (Ericson graph while the second prompt was a
& Simon, 1993). Hence the changes in chart or table. In the case of the graph, the
raters’ scores from before to after training students were required to interpret the
could also be attributed to other factors. graph and make predictions based on the
Weigle (1998) conducted a second information contained in the graph. The
similar study to explore differences in chart or table prompt required the students
rater severity and consistency among to make and defend a choice based on
information contained in the chart or table. pass English language proficiency tests
The results showed that there were no before studying; hence they are quite
significant differences in severity of proficient in English. However, secondary
marking between the experienced and school students come to study in North
inexperienced raters before training on the America with their immigrated parents.
chart or table essay. However, the Further, their English skills are not as well
inexperienced raters were significantly developed as compared to university
more severe on the graph essays than students (Blackett, 2002). Thus the
were the experienced raters before secondary ESL students likely have very
training. After training no significant different language exposure to English
differences were found between the two and English writing skills. Further, most
rater groups on either prompt. studies were conducted in American
In responding to the effects of the contexts. Only a few studies were
difficulty of writing prompts on raters’ conducted within Canadian contexts
ratings of ESL compositions, Hamp- (Cumming, 1990a; McDaniel, 1985;
Lyons (1990) indicated that raters could Mendelsohn & Cumming, 1987; Sakyi,
compensate for more difficult questions 2000). ESL students in these two
by giving higher scores. In later research, countries are generally different in terms
Hamp-Lyons and Mathias (1994) of their linguistic backgrounds. For
confirmed that topics which were judged example, in America, there are a large
more difficult by experts received higher number of Spanish-speaking ESL students;
scores, and those which were judged whereas there are a large number of
easier received lower scores, indicating French-speaking students and Asian
essay raters were “consciously or students in Canada (CBIE, 2002; IIE,
unconsciously compensating” in their 2001). ESL students’ native language
rating for relative prompt difficulty affects their English writing (Shaw & Liu,
“based on their own, internalized, 1998; Yang, 2001).
difficulty estimates” (p. 59). Second, the generalizability of most
studies was limited due to their
Identified Research Gaps
unrepresentative sampling (McDaniel,
Based upon the above review of 20 1985; Sakyi, 2000; Santos, 1988; Song &
empirical studies on ESL writing Caruso 1996). For example, Song and
assessments in North American school Caruso (1996) sampled two ESL
contexts, three major research gaps are compositions written by Russian speakers
evident. First, there was an imbalance by in their study. Similarly, Santos (1988)
both school levels and countries. Most sampled two compositions written by two
studies examined the factors affecting the Asian students in order to investigate the
rating of ESL compositions written by effects of a rater’s discipline on the rating
undergraduate and graduate students in of ESL compositions. Such sampling was
colleges and universities. Very few not intended to be representative, limiting
studies, however, examined the factors the interpretations that can be made and
affecting the rating of ESL compositions affecting the generalizability of their
written by secondary school students. results.
Among these reviewed studies, the study Finally, due to the quantitative nature
by McDaniel (1985) is the only one of analyzing and reporting ESL writing
involving secondary school students. scores, most reviewed studies used
Undergraduate and graduate students quantitative approaches. However, in the
usually come to study in North America investigation of empirical evidence for
universities on their own and they need to rater variation, most reviewed quantitative
studies were based on the framework of Caruso, 1996). Therefore, rater is a factor
CTT. In CTT, the ratio of true score that has important impact on the rating of
variance to observed score variance is ESL writing. Rater-related factors include:
defined as reliability (Crocker & Algina, rater’s linguistic background, discipline,
1986). The reliability estimates of scores perceptions and expectations, professional
derived from various holistic and experience, tolerance for the error, and
analytical scoring methods reported in the rater training (Brown, 1991; Janopoulos,
literature have relied on CTT (Brown, 1992; Mendelsohn & Cumming, 1987;
1991; Reid & O’Brien, 1981), which Santos, 1988; Weigle, 1994, 1998; Weigle
accounts for only a single source of et al, 2003). Finally, the type and
variance within a given analysis. difficulty of writing tasks, and the rater-
Generalizability (G-) theory (Cronbach, task interaction also affect raters’ rating of
Gleser, Nanda, & Rajaratnam, 1972) is a ESL writing (Sakyi, 2000; Weigle, 1999).
more powerful quantitative approach for The results of these studies have
detecting rater variation in the field of confirmed the central argument of this
performance assessment (Linn & Burton, paper: To some extent ESL compositions
1994). It extends the framework of CTT in are not fairly scored as they should be.
order to take into account the multiple In order to thoroughly examine the
sources of variability that can have an factors that influence raters as they assess
effect on test scores. It can identify the ESL students’ writing in North American
sources of variance and error and estimate schools, future empirical investigations
the impact of these variance components need to keep a balance between school
on scoring accuracy, allowing the levels and assessment types. For example,
investigator to consider numerous factors affecting classroom teachers’
applications of an instrument (Shavelson assessment of ESL writing in secondary
& Webb, 1991). Therefore, G-theory and graduate school settings and raters’
provides a comprehensive conceptual rating of ESL writing in large-scale
framework and methodology for provincial/state assessment contexts are
analyzing more than one measurement under-researched. As increasing number
facet simultaneously in investigations of of ESL students are educated in North
assessment error and score dependability American schools, it is important to
(Brennan, 2001). However, none of the explore these factors that have an impact
reviewed studies has used G-theory for on the assessment of their writing and
detecting rater variation. ensure that the assessment of their writing
is fair.
Conclusion
References
This review paper explores various
factors that affect the accuracy, reliability American Educational Research
and validity of the scoring of ESL Association (AERA), American
compositions in North American school Psychological Association (APA),
contexts. The reviewed empirical studies and National Council on
indicate that scoring criteria have become Measurement in Education (NCME).
a major concern about the construct (1999). Standards for educational
validity and reliability of the scoring of and psychological testing.
ESL writing (Ferris, 1994; Mendelsohn & Washington, DC: American
Cumming, 1987; Sakyi, 2000). No matter Psychological Association.
what scoring methods they use, different Bachman, L. (2000). Modern language
raters use different criteria to rate ESL testing at the turn of the century:
compositions (Russikoff, 1995; Song & Assuring that what we count counts.
Language Testing, 17(1), 1-42. Quarterly, 29(4), 762-765.

Bachman, L. (1990). Fundamental Crocker, L., & Algina, J. (1986).
considerations in language testing. Introduction to classical and modern
Oxford: Oxford University Press. test theory. New York: Holt, Rinehart
Blackett, K. (2002). Ontario schools and Winston.
losing English as a second language Cronbach, L. J., Gleser, G. C., Nanda, H.,
programs - despite increase in & Rajaratnam, N. (1972). The
immigration. Retrieved August 31, dependability of behavioral
2004, from measurements: Theory of
http://www.peopleforeducation.com/r generalizability for scores and
eleases/2003/oct24_02.html profiles. New York: Wiley.
Brennan, R. L. (2001). Statistics for social Cumming, A. (1990a). Application of
science and public policy: contrastive rhetoric in advanced ESL
Generalizability theory. New York: writing. Paper presented at the 24th
Springer-Verlag. Annual TESOL Conference, San
Brown, J. D. (1991). Do English and ESL Francisco, CA.
faculties rate writing samples Cumming, A. (1990b). Expertise in
differently? TESOL Quarterly, 25(4), evaluating second language
587-603. composition. Language Testing, 7,
Brown, J. D., Hilgers, T., & Marsella, J. 31-51.
(1991). Essay prompts and topics. Davidson, F. (1991). Statistical support
Written Communication, 8, 533-556. for training in ESL composition
Casanave, C. P., & Hubbard, P. (1992). rating. In L. Hamp-Lyons (Ed.),
The writing assignments and writing Assessing second language writing
problems of doctoral students: (pp. 155-165). Norwood, NJ: Ablex.
Faculty perceptions, pedagogical Education Quality and Accountability
issues, and needed research. English Office. (2002). Ontario Secondary
for Specific Purposes, 11, 33-49. School Literacy Test, February 2002:
Canadian Bureau for International Report of provincial results. Toronto:
Education. (2002, April 15). Queen’s Printer for Ontario.
International student numbers hit Engber, C. A. (1995). The relationship of
record high, but Canada offers lexical proficiency to the quality of
dwindling support for African ESL compositions. Journal of Second
students. Retrieved October 28, 2002, Language Writing, 4, 139-155.
from Ericsson, K. A., & Simon, H. (1993).
http://www.cbie.ca/news/index_e.cfm Protocol analysis: Verbal reports as
?folder=releases&page=rel_2002- data. Cambridge, MA: MIT Press.
04-15_e Ferris, D. (1994). Rhetorical strategies in
Cole, N. S., & Moss, P. A. (1989). Bias in student persuasive writing:
test use. In R. L. Linn (Ed.), Differences between native and non-
Educational Measurement (pp. 201- native English speakers. Research in
219). New York: Macmillan. the Teaching of English, 28, 45-65.
Cole, N. S., & Zieky, M. J. (2001). The Gamaroff, R. (2000). Rater reliability in
new faces of fairness. Journal of language assessment: The bug of all
Educational Measurement, 38, 369- bears. System, 28, 31-53.
382. Hamp-Lyons, L. (1990). Second language
Connor-Linton, J. (1995). Looking behind writing: Assessment issues. In B.
the curtain: What do L2 composition Kroll (Ed.), Second language writing:
ratings really mean? TESOL Research insights for the classroom
(pp. 69-87). Cambridge: Cambridge 201-213.

University Press. Institute of International Education. (2001,
Hamp-Lyons, L. (1991a). Issues and May 16). 98/99 open doors on the
directions in assessing second Web. Retrieved June 15, 2002, from
language writing in academic http://www.opendoorsweb.org/Lead
contexts. In L. Hamp-Lyons (Ed.), %20Stories/international_studs.htm
Assessing second language writing in Intaraprawat, P., & Steffensen, M. S.
academic contexts (pp. 323-329). (1995). The use of metadiscourse in
Norwood, NJ: Ablex. good and poor ESL essays. Journal
Hamp-Lyons, L. (1991b). Rating of Second Language Writing, 4(3),
procedures for ESL contexts. In L. 253-272.
Hamp-Lyons (Ed.), Assessing second Jacobs, H. L., Zingraf, S. A., Wormuth, D.
language writing in academic R., Hartfiel, V. F., & Hughey, J. B.
contexts (pp. 241-277). Norwood, NJ: (1981). Testing ESL composition: A
Ablex. practical approach. Rowley, MA:
Hamp-Lyons, L. (1996). The challenges Newbury House.
of second language writing Janopoulos, M. (1992). University faculty
assessment. In E. White, W. Lutz and tolerance of NS and NNS writing
S. Kamusikiri (eds.), Assessment of errors: A comparison. Journal of
writing: Policies, politics, practice Second Language Writing, 1(2), 109-
(pp. 226-240). New York: Modern 121.
Language Association. Janopoulos, M. (1995). Writing across the
Hamp-Lyons, L., & Mathias, S. P. (1994). curriculum, writing proficiency
Examining expert judgments of task exams, and the NNS college student.
difficulty on essay tests. Journal of Journal of Second Language Writing,
Second Language Writing, 3(1), 49- 4, 43-50.
68. Johns, A. M. (1991). Interpreting an
Hayward, M. (1990). Evaluations of essay English competency examination:
prompts by nonnative speakers of The frustrations of an ESL science
English. TESOL Quarterly, 24, 753- student. Written Communication, 8(3),
758. 379-401.
Hinds, J. (1987). Reader versus writer Johnson, R. L., Penny, J., & Gordon, B.
accountability: A new typology. In U. (2000). The relation between score
Connor & R. Kaplan (Eds.), Writing resolution methods and interrater
across languages: Analysis of L2 text reliability: An empirical study of an
(pp. 141-152). Reading, MA: analytic rating rubric. Applied
Addison-Wesley. Measurement in Education, 13(2),
Hinkel, E. (2003). Simplicity without 121-138.
elegance: Features of sentences in L1 Joint Advisory Committee. (1993).
and L2 academic texts. TESOL Principles for Fair Student
Quarterly, 37, 275-301. Assessment Practices for Education
Homburg, T. (1984). Holistic evaluation in Canada. Edmonton, AB.
of ESL compositions: Can it be Kobayashi, T. (1992). Native and
validated objectively? TESOL nonnative reactions to ESL
Quarterly, 18(1), 87-107. compositions. TESOL Quarterly, 26,
Huot, B. A. (1990). Reliability, validity, 81-112.
and holistic rating: What we know Kunnan, A. J. (1997). Connecting fairness
and what we need to know. College and validation. In A. Huhta, V.,
Composition and Communication, 41, Kohonen, L. Kurki-Suomo, & S.
Luoma (Eds.), Current developments presented at the annual convention of

and alternatives in language Teachers of English to Speakers of
assessment (pp. 85-105). Jyvaskyla, Other Languages, Detroit, MI. (ERIC
Finland. Document Reproduction Service No.
Kunnan, A. J. (Ed.) (2000). Fairness and ED 221 044)
validation in language assessment. Rubin, D. L., & Williams-James, M.
Cambridge: Cambridge University (1997). The impact of writer
Press. nationality on mainstream teachers'
Linacre, J. M. (1989). Many-facet Rasch judgments of composition quality.
measurement. Chicago, IL: MESA Journal of Second Language Writing,
Press. 6(2), 139-153.
Linn, R. L. (1989). Educational Ruetten, M. K. (1994). Evaluating ESL
measurement (3rd ed.). New York: students' performance on proficiency
Macmillan. exams. Journal of Second Language
Linn, R. L., & Burton, E. (1994). Writing 3, 85-96.
Performance-based assessment: Russikoff, K. A. (1995). A comparison of
Implications of task specificity. writing criteria: Any differences?
Educational Measurement: Issues Paper presented at the annual meeting
and Practice, 13, 5-8. of the Teachers of English to
McDaniel, B. A. (1985). Ratings vs. Speakers of Other languages, Long
equity in the evaluation of writing. Beach, CA.
Paper presented at the annual meeting Sakyi, A. (2000). Validation of holistic
of the Conference on College rating for ESL writing assessment:
Composition and Communication, How raters evaluate ESL
Minneapolis, MN. (ERIC Document compositions. In A. Kunnan (Ed.),
Reproduction Service No. ED 260 Fairness and validation in language
459) assessment (pp. 129-152). Cambridge:
Mendelsohn, D., & Cumming, A. (1987). Cambridge University Press.
Professors' ratings of language use Santos, T. (1988). Professors' reactions to
and rhetorical organization in ESL the writing of nonnative-speaking
compositions. TESL Canada Journal, students. TESOL Quarterly, 22(1),
5, 9-26. 69-90.
Messick, S. (1989). Validity. In R. L. Shavelson, R. J., & Webb, N. M. (1991).
Linn (Ed.), Educational measurement Generalizability theory: A primer.
(pp. 13-103). New York: Macmillan. Newbury Park, CA: Sage.
Ostler, S. (1987). English in parallels: A Shaw, P., & Liu, E. T.-K. (1998). What
comparison of English and Arabic develops in the development of
prose. In U. Connor & R. Kaplan second language writing. Applied
(Eds.), Writing across languages: Linguistics, 19, 225-254.
Analysis of L2 text (pp. 169-185). Song, B., & Caruso, I. (1996). Do English
Reading, MA: Addison-Wesley. and ESL faculty differ in evaluating
Perkins, K. (1983). On the use of the essays of Native English-
composition rating techniques, Speaking, and ESL students? Journal
objective measures, and objective of Second Language Writing, 5(2),
tests to evaluate ESL writing ability. 163-182.
TESOL Quarterly, 17(4), 651-671. Spearman, C. (1904). The proof and
Reid, J., & O’Brien, M. (1981). The measurement of association between
application of holistic grading in an two things. American Journal of
ESL writing program. Paper Psychology, 15, 72-101.
Speck, B. W., & Jones, T. R. (1998). Vaughan, C. (1991). Holistic assessment:

Direction in the grading of writing? what goes on in the raters' minds? In
In F. Zak and C. C. Weaver (Eds.), L. Hamp-Lyons (Ed.), Assessing
The theory and practice of grading: second language writing in academic
Problems and possibilities (pp. 17- contexts (pp. 111-126). Norwood, NJ:
29). Albany: SUNY Press. Ablex.
Stiggins, R. J., & Bridgeford, N. J. (1983). Weigle, S. C. (1994). Effects of training
An analysis of published tests of on raters of ESL compositions.
writing proficiency. Educational Language Testing, 11, 197-223.
Measurement: Issues and Practices, Weigle, S. C. (1998). Using FACETS to
2(1), 6-19. model rater training effects.
Sweedler-Brown, C. O. (1993). ESL essay Language Testing, 15(2), 263-287.
evaluation: The influence of Weigle, S. C. (1999). Investigating
sentence-level and rhetorical features. rater/prompt interactions in writing
Journal of Second Language Writing, assessment: Quantitative and
2, 3-17. qualitative approaches. Assessing
Thompson, R. (1990). Writing- Writing, 6(2), 145-178.
proficiency tests and remediation: Weigle, S. C., Boldt, H., & Valsecchi, M.
Some cultural differences. TESOL I. (2003). Effects of task and rater
Quarterly, 24, 99-102. background on the evaluation of ESL
Vann, R., Lorenz, F., & Meyer, D. (1991). writing: A pilot study. TESOL
Error gravity: Faculty response to Quarterly, 37(2), 345-354.
errors in the written discourse of non- Wiggins, G. (1993). Assessing student
native speakers of English. In L. performance: Exploring the purpose
Hamp-Lyons (Ed.), Assessing second and limits of testing. San Francisco,
language writing in academic CA: Jossey-Bass.
contexts (pp. 181-195). Norwood, NJ: Yang, Y. (2001). Chinese interference in
Ablex. English writing: Cultural and
Vann, R., Meyer, D., & Lorenz, F. (1984). linguistic differences. (ERIC
Error gravity: A study of faculty Document Reproduction Service No.
opinion of ESL errors. TESOL ED 461 992).
Quarterly, 18, 427-440.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Factors Affecting The Assessment of ESL Students' Writing

Uploaded by

Copyright:

Available Formats

You might also like

Factors Affecting The Assessment of ESL Students' Writing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Factors Affecting The Assessment of ESL Students' Writing

Uploaded by

Copyright:

Available Formats

Factors Affecting the Assessment of ESL Students’ Writing

Jinyan Huang, Ph.D.

Further, raters may unfairly focus on ESL students in an intensive English

importance, discourse-level criteria (e.g., (Janopoulos, 1992; Santos, 1988; Vann et

read accordingly” (p. 94). inexperienced and experienced raters of

Language Testing, 17(1), 1-42. Quarterly, 29(4), 762-765.

(pp. 69-87). Cambridge: Cambridge 201-213.

Luoma (Eds.), Current developments presented at the annual convention of

Speck, B. W., & Jones, T. R. (1998). Vaughan, C. (1991). Holistic assessment:

You might also like