Professional Documents
Culture Documents
Native-And Nonnative-Speaking EFL Teachers' Evaluation of Chinese Students' English Writing
Native-And Nonnative-Speaking EFL Teachers' Evaluation of Chinese Students' English Writing
Native-And Nonnative-Speaking EFL Teachers' Evaluation of Chinese Students' English Writing
This study examined differences between native and nonnative EFL (English as
a Foreign Language) teachers ratings of the English writing of Chinese university
students. I explored whether two groups of teachers expatriates who typically
speak English as their first language and ethnic Chinese with proficiency in
English gave similar scores to the same writing task and used the same criteria
in their judgements. Forty-six teachers 23 Chinese and 23 English-background
rated 10 expository essays using a 10-point scale, then wrote and ranked three
reasons for their ratings. I coded their reported reasons as positive or negative
criteria under five major categories: general, content, organization, language and
length. MANOVA showed no significant differences between the two groups in
their scores for the 10 essays. Chi-square tests, however, showed that the Englishbackground teachers attended more positively in their criteria to the content and
language, whereas the Chinese teachers attended more negatively to the organization and length of the essays. The Chinese teachers were also more concerned
with content and organization in their first criteria, whereas English-background
teachers focused more on language in their third criteria. The results raise questions about the validity of holistic ratings as well as the underlying differences
between native and nonnative EFL teachers in their instructional goals for second
language (L2) writing.
I Introduction
The fairness and construct validity of the ratings of different groups
of raters using the same scale in assessing ESL/EFL (English as a
Second/Foreign Language) writing is a major concern in second
language (L2) writing instruction and testing (Hamp-Lyons, 1991a;
Connor-Linton, 1995a; Silva, 1997). Exploring whether L2 writers
texts meet the expectations of native English readers, a growing number of studies have investigated whether native English speakers
(NES) and nonnative English speakers (NNS) share the same judgement of the students writing (James, 1977; Hughes and Lascaratou,
1982; Machi, 1988; Santos, 1988; Kobayashi, 1992; Hinkel, 1994;
Address for correspondence: Ling Shi, Department of Language and Literacy Education, 2034
Lower Mall Road, UBC, Vancouver, BC, Canada V6T 1Z2; email: Ling.Shiubc.ca
Language Testing 2001 18 (3) 303325
2 Teacher raters
The 10 written samples were first sent, together with a letter of invitation to participate in the study and a background questionnaire, to
70 English expatriates teaching in various tertiary institutions in all
parts of China. All of them had then been teaching in China for a
minimum of six months and a maximum of about a year. The name
list was provided by Amity Foundation, a Christian organization that
helps Chinese universities to recruit native English teachers. Twentyfour of the NES EFL teachers responded to the letter and sent back
the completed questionnaires and their quantitative and qualitative
evaluations of the essays. I then invited 24 EFL NNS teachers, who
were either working or taking an in-service teacher training program
in the university where data were collected, to evaluate the same set
of essays. In each of the groups, there was one rater who missed
ratings for several of the essays, which made for an equal number of
23 raters in each group for the study.
The information summarized from the questionnaire suggested that
the participating raters were mostly experienced teachers with some
professional training. At the time of data collection, the 46 raters, as
they reported in the questionnaires, were teaching at 23 tertiary institutes in 12 cities of China. Of the 23 NES teachers, 14 were US
citizens, 5 British, 2 Canadian, and 2 Norwegians. (The two Norwegians are considered NESs in this study because they were expatriates
who had learnt and used English since elementary school. With
emphasis on multilingualism in the school system and English being
the second most important language, many educated Norwegians are
bilinguals.) Summarizing the rater profiles, Table 1 shows that more
than half (frequency of 31, 67%) of the raters were female and 15
(33%) were male. In terms of educational background, about half
(frequency of 23, 50%) had a masters degree and 17 (37%) had an
undergraduate degree. Most raters (frequency of 34, 74%) said they
had teacher training, whereas 11 (8 NNSs and 5 NESs) reported that
they had no such training. Of the 46 raters, 18 (39.1%) had taught
English for 1 to 5 years, 11 (23.9%) 6 to 9 years, and 17 (37%)
about 10 years. In general, the NES group appeared to be more educated than the NNSs (4 NES teachers had a PhD whereas 1 NNS had
only a high school diploma) although the latter seemed to have more
teaching experience than the former. (12 NNSs had taught for about
10 years compared with only 5 NESs who had similar experience.)
Categories
Frequencies
Percentage
English
Chinese
Total
Number of
missing
cases
Gender
Male
Female
Total
6
17
23
9
14
23
15
31
46
32.6
67.4
100.0
Degrees
High School
Graduate
Master
PhD
Total
7
12
4
23
1
10
11
22
1
17
23
4
45
2.2
37.0
50.0
8.7
97.9
Yes
No
Total
20
3
23
14
8
22
34
11
45
73.9
23.9
97.8
15 years
610 years
Over 10 years
Total
10
8
5
23
8
3
12
23
18
11
17
46
39.1
23.9
37.0
100.0
Teacher
training
Years of
teaching
English
3 Procedures
I asked the participating teachers to rate the 10 compositions holistically using a 10-point scale. The teachers were told that the purpose
of the research was to compare how teachers evaluated English writing of Chinese university students. Following previous research
(Cumming, 1990; Connor-Linton, 1995b; Kobayashi and Rinnert,
1996), no criteria or analytical categories for the rating scale were
provided with the purpose of finding out how these teachers defined
the criteria themselves. In addition, to find out whether these teachers
made different qualitative judgments, the teachers were asked to state
three reasons or comments, in the order of importance, for their ratings of each essay. (For evaluation instructions, see Appendix 2.)
With no pre-determined criteria, the teachers were expected to make
and weight qualitative judgments based on their own beliefs of how
salient certain writing elements were in L2 writing evaluation.
Finally, the raters completed a questionnaire about their teaching
experiences and educational backgrounds. Although similar in its
basic design to Connor-Lintons (1995b) study, asking two groups of
teachers to rate 10 essays and state three reasons (for justifications
of these methodological choices, see Connor-Linton, 1995b: 102), the
present study, apart from subject identities and context, differed from
the previous study in that all the teachers rated the essays holistically
NNS (n = 23)
Ranking of essays
sd
sd
NES
NNS
1
2
3
4
5
6
7
8
9
10
6.69
6.29
6.96
5.14
7.50
5.47
7.21
7.06
8.14
5.64
2.27
1.20
1.52
1.77
1.20
1.86
1.99
1.51
1.25
2.20
6.83
7.14
7.35
6.07
7.65
5.35
7.31
6.91
7.87
6.59
1.40
1.14
1.17
1.54
1.15
0.92
1.78
1.01
1.04
1.47
6
7
5
10
2
8
3
4
1
9
7
5
3
9
2
10
4
6
1
8
Group M
5.98
1.68
6.19
1.26
Essays
(Group M of S.D. of 1.26 vs. 1.68). This indicates that the scores
given by the NNS teachers were more homogeneous or less spread
out than those given by the NES teachers, although individual members within the NNS group were less consistent as suggested by the
lower reliability of the group I reported earlier. In contrast, the NES
group, as the present data suggest, while being able to maintain a
high reliability, also used a wider range of average scores than the
NNSs (from 5.35 to 7.87 for NNSs and from 5.14 to 8.14 for NESs).
This tendency towards great variation among NES raters has also
been documented by previous researchers (e.g., Hamp-Lyons 1989;
Vaughan, 1991; Zhang, 1999). Together with previous studies, the
present study suggests that NES raters were perhaps either more
willing to take risks in awarding scores at the endpoints of the scale
or were better able to make finer distinctions among levels of ability
in rating ESL/EFL essays than their NNS colleagues.
3 Comparison of qualitative judgements: raters positive and
negative comments
Figure 1 compares the raters self-reported positive and negative comments, grouped into 12 categories. The overall statistically significant
chi-square value of the frequencies of the 12 types of positive and
negative comments indicates that NES and NNS teachers were systematically influenced by different qualitative judgments in their ratings (chi-square = 54.03, 51.01, d.f. = 11, p 0.001). In line with
previous research (Brown, 1991; Connor-Linton, 1995b), the present
study implies that these two groups of teacher raters arrived at their
NNS positive
Frequencies
120
NES negative
100
NNS negative
80
60
40
20
(
)
th
**
nc
ng
ge
ua
La
ng
Le
y)
(fl
ue
(+
y)
,
)
**
**
ur
La
ng
ua
ge
(a
cc
te
l
(in
ge
ua
ng
ac
ilit
y)
ib
lig
ge
ua
ng
La
La
(+
er
al
)
ns
(g
si
an
(tr
at
io
iz
an
en
tio
ph
ra
ag
al
(p
ar
n
at
io
iz
s)
(
)
)*
en
t)
en
(g
n
at
io
an
rg
O
O
rg
an
iz
C
on
te
n
t(
ar
er
t(
gu
id
ea
s)
*(
l)*
te
n
C
on
ra
ne
ge
rg
on
te
n
t(
en
er
al
*(
+)
Frequencies
100
NES
80
NNS
60
40
20
th
ng
Le
nc
y)
ge
(fl
ur
ue
ac
y)
y)
*
ua
ng
La
ge
ua
ng
La
ua
ge
(in
(a
te
cc
llig
ib
i
lit
er
al
)
ns
)
ge
ua
ng
ng
La
La
at
iz
an
rg
(g
en
tio
si
(tr
n
io
n
io
at
iz
an
rg
O
an
ag
(p
ar
(g
at
iz
an
hs
ra
p
er
al
en
um
n
io
nt
te
rg
on
)*
t)
en
)*
(a
nt
te
on
C
rg
(id
e
er
en
(g
nt
as
al
)
al
er
G
en
te
on
C
100
NES
80
NNS
60
40
20
t(
Le
ng
th
*
ar
gu
m
iz
en
at
O
io
t)
rg
n
an
(g
en
iz
at
er
io
al
n
)*
(p
O
rg
ar
an
ag
iz
ra
at
ph
io
s)
n
(tr
an
La
si
ng
tio
ua
ns
La
)
ge
ng
(g
ua
en
ge
er
al
(in
)
t
e
La
llig
ng
ib
ua
ilit
y)
ge
*
(a
cc
La
ur
ac
ng
y)
ua
*
ge
(fl
ue
nc
y)
id
ea
s)
O
rg
an
te
n
C
on
C
on
on
te
n
te
n
t(
ge
ne
r
t(
G
en
er
al
al
)
Frequencies
100
NES
80
NNS
60
40
20
**
th
ng
y)
nc
ue
(fl
cc
La
ng
ua
ge
(a
ge
ua
ng
La
Le
y)
ur
ilit
ib
llig
te
(in
ge
ua
ng
ac
y)
al
er
en
ge
ua
ng
La
**
)
ns
(g
si
an
(tr
n
io
at
rg
an
iz
at
O
iz
an
rg
tio
ph
ra
ag
n
io
iz
an
rg
O
La
)
al
er
n
io
at
nt
te
on
C
(p
ar
(g
en
um
(a
rg
(id
nt
te
on
C
t)
s)
ea
er
en
(g
nt
te
en
**
)
al
al
er
en
G
on
C
s)
of the 10 essays. Since the teachers were asked to record their three
reasons or comments in order of importance, I expected a comparison
of the ranking of the comments to reveal how these teachers weighted
various writing elements in their evaluations. An initial inspection of
the data suggests that in their rank ordering of the 12 categorical
comments, the two groups of teachers varied from primary focus on
the quality of content and organization to a lesser focus on language.
As Figure 2 illustrates, in the comments they ranked as first in
importance, both groups of teachers focused primarily on the content,
particularly the ideas and arguments of the essays, although the NNS
teachers were found to attend more than the NES teachers to ideas
(chi-square = 5.15, 4.19, d.f. = 1, p 0.05). There was also a fair
amount of attention among both groups of teachers to the general
organization of the essays in their most important comments, although
Ratings
Reason 1
Reason 2
Reason 3
Sub-categories
Definitions
Examples of
positive/negative
comments*
General
General
General comments on
overall quality of writing.
Well written.
It fails to complete the
task.
General
General comments on
content.
Good contents.
Content shallow.
Ideas
General or specific
comments on ideas and
thesis.
Arguments
General or specific
comments on aspects of
arguments such as
balance, use of
comparison, counter
arguments, support, uses
of details or examples,
clarity, unity, maturity,
originality, relevance,
logic, depth, objectivity,
conciseness,
development and
expression
Content
General
Organization Paragraphs
Transitions
General comments on
organization
Comments on the macro
level concerning
paragraphs introductions,
and conclusions.
Good ideas.
Poor idea.
Thesis clearly stated.
No thesis statement.
Good argument.
Poor argument.
Arguments balanced.
Lack of arguments on
the newspaper issue.
Arguments well
supported.
Arguments not very
well supported.
Logical argument.
No logic in reasoning.
Argument well
developed.
Limited development of
argument.
Excellent organization.
Weak organization.
Paragraphs are well
arranged.
Paragraphs are poorly
organized.
Good introduction.
Poor introduction.
Unique conclusion.
No conclusion.
Good transitions
Bad use of transition
words.
Coherent
Lacks coherence.
Cohesive.
Lack of connectives.
Language
Sub-categories
Definitions
Examples of
positive/negative
comments*
General
General comments on
language.
Comments on whether the
language is clear or easy
to understand.
General comments on
accuracy or specific
comments on word use,
grammar and mechanics.
Language good.
Poor English.
Easy to read and follow.
Meaning unclear.
Accurate language.
Too many errors.
Good use of words.
Bad diction.
Good grammar.
Poor grammar.
Few errors of spelling
and punctuation.
Too many errors in
mechanics.
Fluent language.
Language not smooth.
Concise language.
Wordy.
Complex sentences.
Use of language
immature.
Idiomatic English.
Chinglish.
Voice not appropriate.
Vivid language.
About 250 words.
Too short.
Intelligibility
Accuracy
Fluency
Length
Length
Comments on fluency,
conciseness, maturity,
naturalness,
appropriateness, and
vividness of language.