Professional Documents
Culture Documents
Rossi 2021
Rossi 2021
Rossi 2021
net/publication/351714258
The language test development process: Main principles and suggestions for
classroom practice
CITATIONS READS
0 4,457
1 author:
Olena Rossi
Lancaster University
19 PUBLICATIONS 14 CITATIONS
SEE PROFILE
All content following this page was uploaded by Olena Rossi on 20 May 2021.
The language test development process: main principles and suggestions for
classroom practice
Olena Rossi
This article extends on a TEASIG webinar given by the author on 20 October 2020.
One can argue that a test is valid by collecting pieces of evidence that prove this is
true. For example, if test-takers’ scores on Test A correlate well with their scores on
Test B that measures the same construct, this can serve as evidence for validity.
However, no one piece of evidence is enough to claim test validity, but the more
pieces one can produce, the stronger the validity argument is. One such important
piece of evidence is the process that was used to produce test items.
While high-quality items contribute to the overall test validity, low-quality items
constitute a validity threat (Messick, 1989). One such threat is construct
underrepresentation, which occurs when an important aspect of the intended
construct is not targeted in the item. For example, a teacher has a group of A2-level
students and needs to measure how well they mastered wh-questions in the past.
The teacher writes a test which includes the multiple-choice (MC) item presented in
Table 1.
Although the item stem is a wh-question in the past, what the item actually tests is
the knowledge of question words. The intended construct, however, involves much
more than just question words – it also includes word order and the use of auxiliary
verbs. Because they are not tested with the item the teacher created, it can be said
that the item underrepresents the construct.
The stem is a wh-question in the past and the item targets important aspects of the
construct. However, the complexity of the vocabulary and of the sentence structure
will most probably prevent A2-level students from understanding the question and,
consequently, from answering the item correctly, irrespective of whether the students
know how to form wh-questions in the past or not. In other words, the inappropriate
at A2 level complexity of vocabulary and sentence structure introduce construct-
irrelevant variance which impacts on the test-takers’ performance.
The two examples above are somewhat extreme and were created to make a point.
It is surprising, however, how many items produced for classroom testing suffer from
similar problems, something that could have been avoided if sufficient care had been
taken during the test production process. Being aware of the effect that item quality
has on test validity, large-scale examination bodies follow rigorous procedures for
test development. They base the test production process on relevant documentation
that includes the test framework, specifications, and item-writer guidelines. The test
development itself is a multi-step process spanning over many months or even
several years. It starts with commissioning test items to suitably qualified item
writers; items then go through several rounds of review and revision; the revised
items are sent for pre-testing (trialling); the pre-testing is then followed by analyses
of item responses and, if necessary, further item revision.
In the study I conducted with 25 novice item writers (Rossi, 2017), the following were
reported as important for producing good quality test items:
• the ability to understand and follow item specifications;
• good knowledge of language proficiency levels and task types;
• awareness of test constructs, including language skills and sub-skills, and the
ability to target the intended construct in items;
• proficient use of item-writing tools (e.g., the ability to check text readability or
vocabulary frequency);
• the ability to give constructive feedback to others’ items and to respond
constructively to feedback on one’s own items.
Arguably, some of the above abilities develop either through targeted item-writing
training, which is not always available to classroom teachers, or through continuous
practice. However, the knowledge of language testing principles and concepts, of
test constructs, and of language proficiency levels as they are defined, for example,
in the CEFR, can be gained by doing free online courses such as The Teachers’
Assessment Literacy Enhancement (TALE) Project developed by Dina Tsagari, Karin
Vogt and colleagues (http://taleproject.eu/) or the Language Assessment in the
Classroom course regularly offered by the British Council
(https://www.futurelearn.com/courses/ language-assessment/).
There are also some books on test development that can be used for self-study. For
example, Hughes (2003) provides practical guidance in test development aimed at
language teachers. Although the book written by Haladyna and Rodriguez (2013) is
not exclusively about producing language test items, it is one of the most
comprehensive guides on writing items of different types, from MC questions to
essay prompts to portfolio and project tasks. Language testing and teaching
associations such as the International Language Testing Association (ILTA), the
European Association for Language Testing and Assessment (EALTA), and IATEFL
also have a range of useful resources on their websites available to the members.
Testing, Evaluation & Assessment Today, Issue 4 May 2021, pp.53-57 ISSN2709-1724
Test specifications
Any test should be produced based on specifications. Good test specifications will
provide overall information about the test including its purpose, proficiency level(s),
intended construct, and overall structure. The specifications will also detail each task
and item in the test: the number of tasks and items, their type(s) (e.g., MC items,
gap-fill, short-answer questions, etc.), and the order they are to appear in the test.
For reading and listening items, input text characteristics should also be specified
(e.g., their genre, length and topic), as well as how the texts should be sourced. For
speaking and writing tasks, guidance on the scoring method should be provided,
including the assessment criteria, the score range, and examples of performance at
different levels of proficiency.
Test specifications should contain sample items for each item type to be included in
the test. It might also be a good idea to have some examples of poorly written items,
to signal what items should not be like. Having a range of good- and poor-quality
items to refer to is something professional item writers appreciate (see, e.g., Al-
Lawati, 2014), and can also be useful for teachers with test-writing responsibilities.
Practice shows that item writers are often blind to flaws in their own items. That is
why, after the first item drafts have been produced, they should be reviewed by
colleagues who will start the review by doing the test as students. Approaching an
item from the students’ perspective will help reveal unclear instructions, double keys,
or weak distractors in MC items, or correct answers not included in the key for gap-
fill items. It is a good idea to ask more than one colleague to review items because
different people will notice different problems. The reviewers should also check the
items against the specifications, paying particular attention to whether the items
target the intended construct and whether there are any threats to test validity, such
as construct-irrelevance or construct-underrepresentation (see discussion earlier in
this article).
Testing, Evaluation & Assessment Today, Issue 4 May 2021, pp.53-57 ISSN2709-1724
From my own experience as an item writer, I know that receiving negative feedback
on one’s own items can be painful, so developing the ability to respond
constructively to feedback is a must. Salvageable items should be revised and
returned for review, and items might have to go through several cycles of review and
revision until the reviewers cannot identify any obvious flaws.
Actual students, however, might react to items differently from teacher colleagues;
therefore trialling the items on a small student sample might help in revealing
problems that the reviewers were not able to identify. For example, a look at
students’ responses to a writing prompt might reveal some off-topic or tangential
answers. This might not only be because the students are poor writers, but also
because the prompt is unclear, or the topic is unsuitable. Just reviewing students’
responses can be very helpful, but students themselves can also be asked to
comment on the test, for example, how well they understood the instructions or how
confident they are about doing a particular item type (including an unfamiliar item
type in a test is not a good idea). Importantly, the students should be as similar as
possible to those the test was produced for, so a parallel class is often the best
choice, although there remains the problem of item exposure.
Post-test analysis
Large-scale exam bodies carry out complex statistical analyses on each test
administration to ensure test items perform as expected. Classroom teachers might
not have enough time, or statistical knowledge, to carry out such analyses after each
test they give to their students. However, a quick review of students’ responses might
produce interesting findings. For example, the teacher might discover that a
distractor for a MC item is never selected. This means that the distractor is weak and
should be replaced. The teacher might also find that those students who received a
high score on the whole test repeatedly failed a particular item, while weaker
students answered the item correctly. Such items should also be removed from the
test because they do not fulfil their purpose of discriminating between weak and
strong students.
Conclusion
The test development process described in this article should be followed for all
higher-stakes classroom testing, for example, end-of-term and end-of-year tests.
When creating tests for continuous assessment, it will not always be possible to go
through the whole test development cycle. However, even for these tests, it is
important to ensure that the items target the intended construct and are suitable for
the purpose. Creating a bank of past items that worked well can save a lot of time
and effort: if teachers in a school or university maintain a shared item bank, they can
re-use the items multiple times, and also re-combine the items to produce different
tests.
If I were asked for the most important tip in item writing, I would say that tests should
not be produced the day before they are administered. Teachers should allow
themselves enough time to think the test over, to write, and to revise what they have
created. After all, it is only fair for teachers and their students if the tests give a true
picture of the students’ learning.
Testing, Evaluation & Assessment Today, Issue 4 May 2021, pp.53-57 ISSN2709-1724
References
Haladyna, T. and Rodriguez, M.C. (2013). Developing and validating test items. New
York: Routledge.
Rossi, O. (2017, June 26-27). Assessment literacy for test writers: What do people
who write language tests need to know about testing? [Paper presentation].
Linguistics and English Language Postgraduate Conference. Lancaster, UK.
Biodata
Olena Rossi has recently completed her doctoral studies at Lancaster University
specialising in language testing, and teaches language testing and academic study
skills on MA courses. Olena’s main research interests include test design, item
writing, and assessment literacy for test stakeholders. Her background is in EFL as a
teacher and teacher trainer, and she also has experience working as an examiner,
item writer and item reviewer for several large-scale examination boards. Olena has
facilitated assessment-related workshops and taught item-writing courses.
olena.rossi@yahoo.com