Professional Documents
Culture Documents
978-1-4438-8590-4-sample
978-1-4438-8590-4-sample
in Language
Evaluation,
Assessment
and Testing
Current Issues
in Language
Evaluation,
Assessment
and Testing:
Edited by
All rights for this book reserved. No part of this book may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
electronic, mechanical, photocopying, recording or otherwise, without
the prior permission of the copyright owner.
Abstract
This study examined item-level data from fifty 30-item cloze tests that
were randomly administered to university-level examinees from Japan (N
= 2,298) and Russia (N = 5,170). A single 10-item anchor cloze test was
also administered to all students. The analyses investigated differences
between the two nationalities in terms of both classical test theory (CTT)
item analysis and multifaceted Rasch analysis (the latter allowed us to
estimate test-taker ability and item difficulty measures and fit statistics
simultaneously across 50 cloze tests separately and combined for the two
nationalities). The results indicated that considerably larger proportions of
items functioned well in the Rasch item analyses than in the traditional
CTT item analysis. Rasch analyses also turned out to be more appropriate
for our cloze test analysis and revision purposes than did traditional CTT
item analyses. Linguistic analyses of items that fit the Rasch model
revealed that blanks representing certain categories of words (i.e., function
words rather than content words, and Germanic-origin words rather than
Latin-origin words), and to a greater extent relatively high frequency
words were more likely to work well for norm-referenced test (NRT)
purposes. In addition, this study found that different items were
functioning well for the two nationalities.
How Well do Cloze Items Work and Why? 3
Introduction
Taylor (1953) first proposed the use of cloze tests for evaluating the
readability of reading materials in US elementary schools. In the 60s and
70s, a number of studies appeared on the usefulness of cloze for English as
a second language (ESL) proficiency or placement testing (see Alderson,
1979 for a summary of this early ESL research). Since then, as Brown
(2013) noted, this research on using cloze in ESL proficiency or placement
testing has continued, but has been inconsistent at best with reported
reliability estimates ranging from .31 to .96 and criterion-related validity
coefficients ranging from .43 to .91.
While the literature has focused predominantly on fixed interval (i.e.,
every nth word) deletion cloze tests, other bases have been used for
developing cloze tests. For example, rational deletion cloze was
developed by selecting blanks based on word classes (cf., Bachman, 1982,
1985; Markham, 1985). Tailored cloze involved using classical test theory
(CTT) item analysis techniques to select items and thereby create cloze
tests tailored to a particular group of students (cf. Brown, 1988, 1989,
2013; Brown, Yamashiro, & Ogane, 1999, 2001; Revard, 1990).
For the most part, cloze studies have been based on CTT. However,
Item Response Theory (IRT), including Rasch analysis, has been applied
to cloze in a few cases. Baker (1987) used Rasch analysis to examine a
dichotomously scored cloze test and found that “observed and expected
item characteristic curves show reasonable conformity, though with some
instances of serious misfit…no evidence for departure from unidimensionality
is found for the cloze data…” (p. iv). Hale, Stansfield, Rock, Hicks,
Butler, and Oller (1988) found that IRT provided stable estimates for cloze
in their study of the degree to which groups of cloze items related to
different subparts of the overall Test of English as a Foreign Language
(TOEFL). Hoshino and Nakagawa (2008) used IRT in developing a
multiple-choice cloze test authoring system. Lee-Ellis (2009) used Rasch
analysis in developing and validating a Korean C-test. However, Rasch
analysis has not been used to study the effectiveness of individual items.
The Study
Certainly, no work has investigated the degree to which cloze items
function well when analyzed using both CTT and IRT frameworks, and
little research has examined the functioning of cloze items in terms of their
linguistic characteristics. To address these issues and others, the following
research questions were posed, all focusing on the individual items
4 Chapter One
Participants
A total of 7,468 English as a foreign language (EFL) students
participated in this study: 2,298 of these EFL students were studying at 18
different universities in Japan as part of their normal classroom activities;
the remaining 5,170 EFL students were studying at 38 universities in
Russia (see Appendix 1-A for a list of the participating universities in both
countries). In Japan, about 38.3% of the participants were women, and
61.7% of the participants were men; in Russia, 71.7% of the participants
were women, and 28.0% were men, with the remaining 0.3% giving no
response. The participants in Japan were between 18-24 years old, while in
Russia they were between 14-46 years old. The data from Japan were
collected as part of Brown (1993 & 1998); the data from Russia were
collected in 2012-2013 and served as the basis of Brown, Janssen, Trace,
and Kozhevnikova (2013). Though these samples were convenience
samples (i.e., not randomly selected), they were relatively large, which is
important as this sample size permits robust analyses of these cloze data.
It is critically important to stress that in this study we are interested in
how linguistic background affects different analyses; we do not make any
claims for the generalizability of these results to the EFL populations of all
undergraduate students in university-level institutions in these countries.
In fact, we want to stress that the samples from Japan and Russia cannot
be said to be comparable given the sampling procedures, the very different
proportions of university seats per million people available in the two
How Well do Cloze Items Work and Why? 5
Measures
The 50 cloze tests used in this study were first created and used for
Brown (1993). The 50 passages were randomly selected from among the
adult-level books at a public library in Florida. Passages were chosen from
each book by randomly selecting a page then working backwards for a
reasonable starting place. Passages were between 366-478 words long
with a mean of 412.1 words. Each passage contained 30 items, and the
deletion pattern was every 12th word, which created a fairly high degree of
item independence relative to the more typical 7th-word deletion pattern.
The first and last sentences of all passages were left intact to provide
context. Appendix 1-B shows the layout of the directions, example items,
and answer key.
A 10-item cloze passage was also administered to all participants to act
as anchor items (i.e., items that provide a common metric for making
comparisons across tests and examinee samples). This anchor-item cloze
was first created in a study by Brown (1989), wherein it was found that
these 10 items were functioning effectively.
To check the degree to which the English in the cloze passages was
representative of typical written English, the lexical frequencies for all 50
passages combined were calculated (see Appendix 1-C) and compared to
the frequencies reported for the same words in the well-known Brown
Corpus (Francis & Kučera, 1979, 1982; Kučera & Francis, 1967). We felt
justified in comparing the 50 passages to this particular corpus for two
reasons. First, following Stubbs (2004), though the Brown Corpus is
relatively small, it is
“still useful because of their careful design ... one million words of written
American English, sampled from texts published in 1961: both informative
prose, from different text types (e.g., press and academic writing), and
different topics (e.g., religion and hobbies); and imaginative prose (e.g.,
detective fiction and romance).” (p. 111)
(Brown, 1998). Thus, we felt reasonably certain that these passages and
cloze items were representative of the written English language, or at a
minimum the genres of English found in US public library books.
Procedures
The 50 cloze tests were distributed to intact classes by teachers such
that every student had an equal chance of receiving each of the 50 cloze
test passages. In Japan, 42-50 participants completed each cloze test, with
a mean of 46.0 participants completing each passage. In Russia, 90-122
completed each cloze test (Mean = 103.4). All examinees in both countries
completed the 10-item anchor cloze. Twenty-five minutes were allowed
for completing the tests. Exact-answer scoring was used (i.e., only the
word found in the original text was counted as correct). This was done for
two reasons: (a) we wanted each item to be interpretable as fillable by a
single lexical item for analysis purposes; and (b) with the hundreds of
items and thousands of examinees in this study, using an acceptable-
answer scoring or any other of the available scoring schemes would
clearly have been beyond our resources.
Analyses
Initially, CTT statistics were used to analyze the cloze test data. These
statistics included: the mean, standard deviation, minimum and maximum
scores, reliability, item facility, and item discrimination. Rasch analyses
were also used in this study to calculate item difficulty measures and to
identify misfitting test items. We used FACETS (Linacre, 2014a) analysis
rather than WINSTEPS because the former allowed us to easily analyze
our nested design (i.e., multiple tests administered to different groups of
examinees). Or as Linacre put it, “Use Winsteps if the data can be
formatted as a rectangle, e.g., persons-items … Use Facets when Winsteps
won’t do the job” (Linacre, 2014b, np).
We performed the analyses in several steps. Initially, we needed to
determine anchor values through a separate FACETS analysis of only the
10 anchor items that were administered across all groups of participants.
Then, we created a FACETS input file to link our 50 cloze tests by using
our 10 anchor items (see Appendix 1-D for a description of the actual code
that was used). There were three facets in this analysis: test-takers, test
version, and test items. By using the FACETS program, we were able to
combine the 50 different cloze procedures for both nationalities into a
single analysis using anchor items, and put all of the items onto the same
How Well do Cloze Items Work and Why? 7
true interval scale for ease of comparison (see e.g., Bond & Fox, 2007, pp.
75-90). Four of the total 1,500 items had blanks that were either missing or
made no sense, thus the total number of valid cloze items was 1,496.
Appendix 1-D also shows how we coded the data for the analysis. In
order to analyze separate tests in a single analysis using a common set of
anchor items, each examinee required two lines of response data. The first
line corresponds to the set of items for the particular cloze procedure, set
up by examinee ID, test version, the range of applicable items (e.g., 101-
130 for items 1-30 on Test 1), followed by the observed response for each
item. An additional line was also needed for examinee performance on
anchor items, with the same coding format as above except for a common
range of items for all examinees (31-40). The series of commas within the
data indicates items that were removed as explained above. Using the
same setup, we were able to run the program separately for the samples in
Russia, Japanese, and Combined (i.e., with the two samples analyzed
together as one).
Results
Classical Test Theory
Reliability. Tables 1-1 and 1-2 also show how the reliability estimates of
the various cloze passages were for the two nationalities. These cloze tests
functioned somewhat less reliably with the Japan sample (ranging from
.17 to .87) than with the Russia sample (ranging from .65 to .92). This
pattern could be a consequence of the greater variation and perhaps the
larger sample sizes in Russia. A synthesis of the cloze passages’ reliability
estimates is shown in Table 1-3.
8 Chapter One
Japan
Test Mean SD Min Max N r
1 5.23 3.16 0 15 48 0.71
2 4.21 3.42 0 13 47 0.86
3 2.02 2.13 0 10 48 0.74
4 7.54 3.87 2 16 46 0.80
5 3.98 2.79 0 13 47 0.73
6 5.11 3.23 0 14 47 0.80
7 6.14 3.41 0 16 43 0.83
8 3.16 2.27 0 8 45 0.46
9 2.85 2.46 0 11 46 0.77
10 2.54 2.31 0 8 46 0.83
11 5.94 3.36 0 16 46 0.74
12 8.98 3.97 0 21 47 0.79
13 2.87 1.71 0 8 46 0.50
14 3.23 2.50 0 9 47 0.68
15 9.18 3.42 4 18 49 0.68
16 1.36 1.41 0 6 48 0.65
17 1.38 1.25 0 5 46 0.35
18 1.02 1.09 0 3 50 0.50
19 4.76 2.88 0 10 50 0.70
20 4.38 3.24 0 15 47 0.86
21 9.92 4.44 0 19 48 0.84
22 3.70 2.86 0 11 47 0.84
23 3.64 2.40 0 11 43 0.65
24 2.96 2.26 0 9 47 0.44
25 5.36 2.74 0 12 46 0.63
26 2.68 1.56 0 5 47 0.17
27 2.34 2.72 0 13 47 0.87
28 2.58 2.17 0 8 43 0.57
29 2.32 1.77 0 7 44 0.64
30 9.56 3.28 3 16 48 0.72
31 3.78 3.08 0 15 46 0.83
32 3.83 2.53 0 9 42 0.77
33 2.14 1.87 0 6 44 0.63
34 5.87 2.92 0 13 45 0.82
35 6.63 3.66 0 17 45 0.72
36 5.00 2.05 0 9 46 0.51