Rossi 2021

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/351714258
The language test development process: Main principles and suggestions for
classroom practice
Article · May 2021
CITATIONS READS
0 4,457
1 author:
Olena Rossi
Lancaster University
19 PUBLICATIONS 14 CITATIONS
SEE PROFILE
All content following this page was uploaded by Olena Rossi on 20 May 2021.
The user has requested enhancement of the downloaded file.

Testing, Evaluation & Assessment Today, Issue 4 May 2021, pp.53-57 ISSN2709-1724
The language test development process: main principles and suggestions for
classroom practice
Olena Rossi
This article extends on a TEASIG webinar given by the author on 20 October 2020.
Test validity and item development

A test is valid “if it measures accurately what it is intended to measure” (Hughes,
2003, p.26), and what we intend to measure in a test, for example a language skill or
knowledge of a particular grammar structure, is called test construct. Test validity is a
major concern both for large-scale and classroom testing. In no small part, whether
or not a test is valid depends on the quality of items (questions, tasks) the test is
made up of.
One can argue that a test is valid by collecting pieces of evidence that prove this is
true. For example, if test-takers’ scores on Test A correlate well with their scores on
Test B that measures the same construct, this can serve as evidence for validity.
However, no one piece of evidence is enough to claim test validity, but the more
pieces one can produce, the stronger the validity argument is. One such important
piece of evidence is the process that was used to produce test items.
While high-quality items contribute to the overall test validity, low-quality items
constitute a validity threat (Messick, 1989). One such threat is construct
underrepresentation, which occurs when an important aspect of the intended
construct is not targeted in the item. For example, a teacher has a group of A2-level
students and needs to measure how well they mastered wh-questions in the past.
The teacher writes a test which includes the multiple-choice (MC) item presented in
Table 1.
Table 1: MC grammar item 1
Stem _____ did you do on holiday last month?

Key What
Distractor 1 Where
Distractor 2 Why
Although the item stem is a wh-question in the past, what the item actually tests is
the knowledge of question words. The intended construct, however, involves much
more than just question words – it also includes word order and the use of auxiliary
verbs. Because they are not tested with the item the teacher created, it can be said
that the item underrepresents the construct.
Another threat to test validity is construct-irrelevant variance. It occurs when scores

reflect not only the test-takers’ knowledge of the intended construct, but also of some
variables that are extraneous to the construct. Let’s imagine that the teacher
produced one more item to test her students’ knowledge of wh-questions in the past
(see Table 2).
Table 2: MC grammar item 2
Stem What __________ to achieve by disturbing my sleep at dawn?

Key did you expect
Distractor 1 you expected
Distractor 2 you did expect
The stem is a wh-question in the past and the item targets important aspects of the
construct. However, the complexity of the vocabulary and of the sentence structure
will most probably prevent A2-level students from understanding the question and,
consequently, from answering the item correctly, irrespective of whether the students
know how to form wh-questions in the past or not. In other words, the inappropriate
at A2 level complexity of vocabulary and sentence structure introduce construct-
irrelevant variance which impacts on the test-takers’ performance.
The two examples above are somewhat extreme and were created to make a point.
It is surprising, however, how many items produced for classroom testing suffer from
similar problems, something that could have been avoided if sufficient care had been
taken during the test production process. Being aware of the effect that item quality
has on test validity, large-scale examination bodies follow rigorous procedures for
test development. They base the test production process on relevant documentation
that includes the test framework, specifications, and item-writer guidelines. The test
development itself is a multi-step process spanning over many months or even
several years. It starts with commissioning test items to suitably qualified item
writers; items then go through several rounds of review and revision; the revised
items are sent for pre-testing (trialling); the pre-testing is then followed by analyses
of item responses and, if necessary, further item revision.
Admittedly, such elaborate procedures are not feasible in classroom testing

situations. However, this does not mean that classroom teachers should not strive to
improve the quality of their tests. Below are some practical suggestions on how
teachers can write better-quality items and, consequently, increase the validity of
their tests.
Teachers’ language assessment literacy

It is often the case that language teachers do not receive sufficient (or, indeed, any)
training in test development, while at the same time being expected to produce tests
on a regular basis. Taylor (2013) described the language assessment literacy (LAL)
profile for test writers (see Figure 1). Language testing theory, principles and
concepts, as well as technical item-writing skills are suggested as the most important
knowledge areas for those with test-writing responsibilities.
Figure 1: LAL profile for test writers (Taylor, 2013, p.410)
In the study I conducted with 25 novice item writers (Rossi, 2017), the following were
reported as important for producing good quality test items:
• the ability to understand and follow item specifications;
• good knowledge of language proficiency levels and task types;
• awareness of test constructs, including language skills and sub-skills, and the
ability to target the intended construct in items;
• proficient use of item-writing tools (e.g., the ability to check text readability or
vocabulary frequency);
• the ability to give constructive feedback to others’ items and to respond
constructively to feedback on one’s own items.
Arguably, some of the above abilities develop either through targeted item-writing
training, which is not always available to classroom teachers, or through continuous
practice. However, the knowledge of language testing principles and concepts, of
test constructs, and of language proficiency levels as they are defined, for example,
in the CEFR, can be gained by doing free online courses such as The Teachers’
Assessment Literacy Enhancement (TALE) Project developed by Dina Tsagari, Karin
Vogt and colleagues (http://taleproject.eu/) or the Language Assessment in the
Classroom course regularly offered by the British Council
(https://www.futurelearn.com/courses/ language-assessment/).
There are also some books on test development that can be used for self-study. For
example, Hughes (2003) provides practical guidance in test development aimed at
language teachers. Although the book written by Haladyna and Rodriguez (2013) is
not exclusively about producing language test items, it is one of the most
comprehensive guides on writing items of different types, from MC questions to
essay prompts to portfolio and project tasks. Language testing and teaching
associations such as the International Language Testing Association (ILTA), the
European Association for Language Testing and Assessment (EALTA), and IATEFL
also have a range of useful resources on their websites available to the members.
Test specifications
Any test should be produced based on specifications. Good test specifications will
provide overall information about the test including its purpose, proficiency level(s),
intended construct, and overall structure. The specifications will also detail each task
and item in the test: the number of tasks and items, their type(s) (e.g., MC items,
gap-fill, short-answer questions, etc.), and the order they are to appear in the test.
For reading and listening items, input text characteristics should also be specified
(e.g., their genre, length and topic), as well as how the texts should be sourced. For
speaking and writing tasks, guidance on the scoring method should be provided,
including the assessment criteria, the score range, and examples of performance at
different levels of proficiency.
Test specifications should contain sample items for each item type to be included in
the test. It might also be a good idea to have some examples of poorly written items,
to signal what items should not be like. Having a range of good- and poor-quality
items to refer to is something professional item writers appreciate (see, e.g., Al-
Lawati, 2014), and can also be useful for teachers with test-writing responsibilities.
Test development process

Language teachers working at schools, universities and language centres usually
have colleagues whom they can organise into working groups to develop tests
together (see Figure 2). Having colleagues to bounce around ideas and to peer-
review items is very important as no good-quality test can be created by one person
alone. The group should start off by producing test specifications, which will help to
ensure that the intended construct is targeted and that test versions produced in
different years, or by different teachers, are comparable.
Set up a Produce test Peer-review

Write items
working group specifications the items
Post-test Try out the Revise the

analysis items on items
students
Figure 2: Test development process for classroom assessment
Practice shows that item writers are often blind to flaws in their own items. That is
why, after the first item drafts have been produced, they should be reviewed by
colleagues who will start the review by doing the test as students. Approaching an
item from the students’ perspective will help reveal unclear instructions, double keys,
or weak distractors in MC items, or correct answers not included in the key for gap-
fill items. It is a good idea to ask more than one colleague to review items because
different people will notice different problems. The reviewers should also check the
items against the specifications, paying particular attention to whether the items
target the intended construct and whether there are any threats to test validity, such
as construct-irrelevance or construct-underrepresentation (see discussion earlier in
this article).
From my own experience as an item writer, I know that receiving negative feedback
on one’s own items can be painful, so developing the ability to respond
constructively to feedback is a must. Salvageable items should be revised and
returned for review, and items might have to go through several cycles of review and
revision until the reviewers cannot identify any obvious flaws.
Actual students, however, might react to items differently from teacher colleagues;
therefore trialling the items on a small student sample might help in revealing
problems that the reviewers were not able to identify. For example, a look at
students’ responses to a writing prompt might reveal some off-topic or tangential
answers. This might not only be because the students are poor writers, but also
because the prompt is unclear, or the topic is unsuitable. Just reviewing students’
responses can be very helpful, but students themselves can also be asked to
comment on the test, for example, how well they understood the instructions or how
confident they are about doing a particular item type (including an unfamiliar item
type in a test is not a good idea). Importantly, the students should be as similar as
possible to those the test was produced for, so a parallel class is often the best
choice, although there remains the problem of item exposure.
Post-test analysis
Large-scale exam bodies carry out complex statistical analyses on each test
administration to ensure test items perform as expected. Classroom teachers might
not have enough time, or statistical knowledge, to carry out such analyses after each
test they give to their students. However, a quick review of students’ responses might
produce interesting findings. For example, the teacher might discover that a
distractor for a MC item is never selected. This means that the distractor is weak and
should be replaced. The teacher might also find that those students who received a
high score on the whole test repeatedly failed a particular item, while weaker
students answered the item correctly. Such items should also be removed from the
test because they do not fulfil their purpose of discriminating between weak and
strong students.
Conclusion
The test development process described in this article should be followed for all
higher-stakes classroom testing, for example, end-of-term and end-of-year tests.
When creating tests for continuous assessment, it will not always be possible to go
through the whole test development cycle. However, even for these tests, it is
important to ensure that the items target the intended construct and are suitable for
the purpose. Creating a bank of past items that worked well can save a lot of time
and effort: if teachers in a school or university maintain a shared item bank, they can
re-use the items multiple times, and also re-combine the items to produce different
tests.
If I were asked for the most important tip in item writing, I would say that tests should
not be produced the day before they are administered. Teachers should allow
themselves enough time to think the test over, to write, and to revise what they have
created. After all, it is only fair for teachers and their students if the tests give a true
picture of the students’ learning.
References
Haladyna, T. and Rodriguez, M.C. (2013). Developing and validating test items. New
York: Routledge.
Hughes, A. (2003). Testing for language teachers. Cambridge: Cambridge University

Press.
Messick, S. (1989). Validity. In Linn R. L. (Ed.), Educational measurement (3rd ed.),

pp. 13–104. New York: American Council on Education and Macmillan.
Rossi, O. (2017, June 26-27). Assessment literacy for test writers: What do people
who write language tests need to know about testing? [Paper presentation].
Linguistics and English Language Postgraduate Conference. Lancaster, UK.
Taylor, L. (2013). Communicating the theory, practice and principles of language

testing to test stakeholders: Some reflections. Language Testing, 30(3), 403–412.
https://doi.org/10.1177/0265532213480338
Further references provided by the author on request
Biodata
Olena Rossi has recently completed her doctoral studies at Lancaster University
specialising in language testing, and teaches language testing and academic study
skills on MA courses. Olena’s main research interests include test design, item
writing, and assessment literacy for test stakeholders. Her background is in EFL as a
teacher and teacher trainer, and she also has experience working as an examiner,
item writer and item reviewer for several large-scale examination boards. Olena has
facilitated assessment-related workshops and taught item-writing courses.
olena.rossi@yahoo.com
View publication stats

Rossi 2021

Uploaded by

Copyright:

Available Formats

You might also like

Rossi 2021

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rossi 2021

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Article · May 2021

The user has requested enhancement of the downloaded file.

Test validity and item development

Table 1: MC grammar item 1

Stem _____ did you do on holiday last month?

Another threat to test validity is construct-irrelevant variance. It occurs when scores

Table 2: MC grammar item 2

Stem What __________ to achieve by disturbing my sleep at dawn?

Admittedly, such elaborate procedures are not feasible in classroom testing

Teachers’ language assessment literacy

Figure 1: LAL profile for test writers (Taylor, 2013, p.410)

Test development process

Set up a Produce test Peer-review

Post-test Try out the Revise the

Figure 2: Test development process for classroom assessment

Hughes, A. (2003). Testing for language teachers. Cambridge: Cambridge University

Messick, S. (1989). Validity. In Linn R. L. (Ed.), Educational measurement (3rd ed.),

Taylor, L. (2013). Communicating the theory, practice and principles of language

Further references provided by the author on request

View publication stats

You might also like