Test Construction Procedures

CHAPTER VI
TEST CONSTRUCTION PROCEDURES
This part describes the stages of the process of test construction, from drafting of the
initial test specification throug public and user trial. According to Alderson, et al (1995:2) the
process of test consiruction should cover test specification, item writing and moderation or
editing. pre-testing or traling and analysis, validation posttest reports. and developing and
improving tests.
A. Test Specifications
A test's specifications provide the official statement about what the test tests and how it
tests it (Alderson, et al,1995:9) The specifications are the blueprint to be followed by test and
item writers, and they are also essential in the establishment of the test's ccnstruct validity.
Alderson (1995 9) further explain that a test specification is a detailed document, and is often for
internal purposes oniy. It is sometime confidential to the examining body. A test specification is
for developers and those who need to evaluate whether a test has met ts aim. the development of
test specifications is, therefore, a central and CIucial part of the test construction and evaluation
process. Specifications should provide specific intormation. the following are in form of
questions as a guide, as preposed by alderson, et al,(199511-3):
1. what is the purpose of the test? Tests tend to fall into one of the following broad categories as
presented in the erlier part: plecement, progress, achievement, proficiency, and diagnostic.
2. What sort of learner will be taking the test age sex, level of proficiency stage of learning. first
language, culturate background, level and nature of education reason for taking the test, and
likely levels of background knowledge?
3. how many sections should the test have how long should they be, and how will they be
differentiated- one - two hour exam, four separate two-hour papers, or a single test section?
4. What target language situation is envisaged for the test and is this to be simulated in some way
in the test content and method?
5. What text types should be chosen written and/or spoken? What should be the sources of these,
the supposed, audience, the topics, and the degree of authenticity? how difficult or long should
they be? How complex should the language be?
6.What language skills should be tested? Are micro-skills specified, and should items be
designed to test these incividually or in some integrated fashion?
7. What language elerments should be tested? Is there a list of grammatical structures/features
specified?
8. What sort of tasks are required discrete points, integrative, simulated 'authentic, objectively
assessable?
9 How many items are required for each section? What is the relative weight for each item equal
weighting, extra weighting for more difficult tems?
10 What test methods are to be used-multiple choice, gap filling matching, transformation. short
answer question, pictures description, role play with cue cards, essay, structured writing?
11. What rubrics are to be used as instructions for candidates? Will examples be required to help
candidates know what is expected? Should the criteria by which candidates will be assessed be
included in the rubric?
12. Which criteria will be used for assessment by markers? How important is accuracy,
appropriacy, spelling, length of utterance/script, etc.?
Test specifications vary according to their uses. The specification must, howeve, provide
appropriate information that a test should cover. Test specifications should include all or Some
of the followng:
The test's purpose
Description of the test taker
Test level
Construct (theoretical framework for test)
Descripticn of suitable language course or textbook
Number of sections/papers
Time of each section/paper
Target language situation
Text-types
Text length
Language skills to be tested
Language elements to be tested

Test tasks
Test methods
Rubrics
Criteria for marking
Descriptions of typical performance of each level
Descriptions of candidates at each level can do in the real world
Sample papers and Samples of students' performance on tasks
The test specfication also considers to include the three comains of taxonomy. According
to Krothwohl, et al (1973), most of the objectives stated by teachers in institutions, as well as
those found in the literature, could be placed rather easty one of the three major domains or
classifications: cognitive affective and psychomotor domains. Since tests should be constructed
based on the instructional objectives, then they could also be placed in the three domains
1. Cognitive Domain
Cognitive domains are objectives which emphasize remembering or reproducing something

which has presumably been learned, as well as objectives which involve the solving of some
intellectual tasks for which the individual has to determine the essential problem and then reorder
given material combine it with ideas, methods, or procedures previously learned. Cognitive
objectives vary from simple recall of material learned to highly original and creative ways of
combining and synthesizing new ideas and materials.
Krathwohl, et al (1973) state that the largest proportion of educational objectives falls
into the cognitive domain. It includes objectives that are related to recall or recognition of
knowledge and the development of higher intellectual skills and abilities.
Bloom, et al (1979) identified six major areas within which cognitive objectives may be
classified: knowledge, comprehension, application, analysis, synthesis, and evaluation.
Knowledge refers to the recall of Specific information; comprehension refers to an understanding
of what was read; application refers to the converting of abstrad content to concrete situations;
analysis refers to the comparison and contrast of the content to personal experiences; synthesis
refers to the organization of thoughts, ideas, and intormation from the content; and evaluation
refers to the judgment and evaluation of characters, actions, outcome, etc. for personal reflection
and understanding. The higher-the level, the more sophisticated the test. However, the names of
the six major categories are changed from noun to verb forms, and some are reorganized
(Tarlinton, 2003). The knowledge category is renamed. Khowledge is a product of thingking and
is inappropriate to describe a category of thinking and is replaced with the word remembering
instead. Comprehension becomes understanding and synthesis is renamed creating in order to
better reflect the nature of the thinking described by each category. Thus, the revised cognitive
domains are ordered from lower-level thinking (remembering, understanding. and applying) to
high-level thinking (analyzing, evaluating. and creating)
a. Remermbering indicates recalling information (recognising, listing describing, retrieving,

naming, finding)
b. Understanding signifies explaining ideas or concepts (interpreting, Summarising,

paraphrasing, classitying, and explaining)
c. Applying conveys using information in another familiar situation (implementing, carrying out,
using, executing)
d. Analysing denotes breaking information into parts to explore understandings and relalionships
(comparing, organising, deconstructing, interrogating, and finding).
e. evaluating implies justifying a decision or course of action (checking, hypothesising,

critiquing, experimenting, and judging.)
f. Creating refers to generating new ideas, products, or ways of viewing things (designing
constructing, planning, producing, inventing).
2. Affective Domain
Affective domains are objectives, which emphasize a feeling tone, an emotion, or a degree of
acceptance or rejection. Affective objectives vary from simple attention to selected phenomena
to complex but internally consistent qualities of character and conscience. A large number of
such objectives in the literature expressed as interests, attitudes, appreciations, values, and
emotional sets or biases.
3. psychomotor domain
Psychomotor domains are objectives, which emphasize some muscular or otor skill, some
manipulation of materials and objectives or some acts that requires a neuromusculars
coordination.
B. Test Construction and Moderation
Test construction, which is commonly known as item writing, is the next step in test
development after test specifications have been formulated. In writing test items, one should
ideally combine both necessary formal professional qualification and teaching experience to
similar students who want to take the test and of relevant subject areas. The teaching experience
will provide insights into what such students find easy and difficult, what interest them, their
cultural background, and so on.
The item writing must be based on the test specifications, although it is possible to look
at past papers. Trying to replicate or build upon past papers restricts the test methods and
contents to what have already been tested. It is normal practice to vary test content, and often test
method, for each new test that written unless there is a requirement to produce a narrowly
parallel test. Thus, it is essential to refer to test specilications in order to ensure as wide a
sampling of the potential content and methods as possible.
it is important realize that the method used for testing a language ability may itself affect
the students score, which is called the method effect. For that reason, its intluesce should be
reduced as much as possible. The present researcher is not interested in finding out whether a
student is good at multiple-choice tests, or can do error identification test better than other
students, or finas essay tests particularly difficult. He is interested in finding out about the
grammaticai knowledge at four different successive grammar courses.
It is likely that particular test methods will lend themselves to testing some abilities. and
not be so good at testing others. An etreme example provided by Aiderson, et al (1995) is that
multiple-choice tests are not suitable for testing a student's ability to pronounce a language
correctly but they tend to be good fo testing student's grammar knowledge.
In terms of test editing or moderation, each item and the test as a whole are considered
for the degree of match with the test specifications, likely level of difficulty, possible unforeseen
problems, ambiguities in the wording of items and of instructions, problems fo layout, match
between stems and choices, and overall balance of the subtest or paper. in the process of editing,
the activities do not only involve reading the test and its items, the editor must take each item as
if he were a student taking the test, Items that have provoked unexpected responses from ecitors
or are too problematic must be revised or dropped.
C. Try-out
However well designed a test may be; and however carefully it has been edited, it is not
likely to know how it will work until it has been tried out on students. An item writer cannot
anticipate the responses of the students at different levels of language ability although he may
think he knows what an item is testing and what the correct answer is.
We do not only need to know how difficult the test items are, but we also need to know
whether they work. This may mean that an item which is intended to test a particular structure
actually does so, or it may mean that the items succeeds in distinguishing between stucents at
different levels so that the more proficient students can answer it better than the weaker ones. It
is impossible to predict whether items will work without trying them out. Alderson performance
of multiple-choice items may be the most difficult to predict, since the presence of a variety of
correct and incorrect answers provides a plenty of scope for ambiguity and disagreement, but
open-ended items and subjectively marked tests can also produce surprises. For example, an
open-ended question may turn out to confuse the best rather than the worst students, on an essay
task may unirtentionally elicit only a small range of language from the students.
The number of students on whom a test should be trialed depends on the importance and
type of test, and also the availability of suitable students. The only guiding rule is the more the
better, since the more students there are, the less effect change will have on the results.
Regardless of how many students there are, it is important that the sample should, as far as
possible, be representative of the intended students, with a similar range of abilities and
backgrounds, otherwise the results of the trials may be useless.
D. Test Analysis
The test items that have been tried out must be analyzed to see whether they work. This
analysis will show us the extent to which each item works. For objective test items, traditionally
there are two measures of calculation the facility value and the discrimination index. The facility
value measures the level of difficulty of an item, and the discrimination index measures the
extent to which the results of an individual item correlate with results from the whole test, that is
how well it discriminates between students at different level of ability.
For subjectively marked tests although item analysis is inappropriale, such as summaries,
essays, and oral interviews, these tests still need lo be tried out to see whether the items elicit the
intended sample of language, whether the marking system, which should have been drafted
during the item writing stage, is usable: and whether the examiners are able ln mark consistently
lt may be impossible to try out such tests on large numbers because of the time needed to mark
the scripts or run the interviews, but with the students with a wide rance of backgrounds and
language levels should be tested in order to ensure that the sample of language produced contains
most of the features which will be found in the tests.
Once the papers or interviews have been administered, there should be trial marking
sessions to see whether the test item prompts have produced the intended kinds of responses, and
whether the marking guidelines and criteria are working satisfactorily.
E. Validation
The most important question of all in language testing is validity. Henning (1987) defines
validity as something refers to the appropriateness of a given test or any of its component parts
as a measure of what it is supposed to measure. A test is said to be valid to the extent that it
measures what is supposed to measure. Any test may be valid for a certain purpose or for some
purposes, but not for others. Alderson, et al (1995) state that one of the commonest problems in
test use is test misuse, example using a test for a purpose for which it was not intended.
The validation process involves the terms internal and external validity, with the
distinction being that internal validity relates to studies of the perceived content of the test and its
perceived effect. and external validity relates to studies Comparing students' test scores with
measures of their ability gleaned from outside the test. Three of the most common ways of
assessing the internal validity of a test are face validation where non-testers such as students and
administrators. comment on the value of the test. content validation, where testers or subject
experts judge the test, and response validation, where a growing range of qualitative techniques
like self-report or self-observation on the part of test takers are used to understand how they
respond to test items and why.
The commonest types of external validity are concurrent validity and predictive validity,
and the statistic most frequently used is the correlation coefficient. Concurrent validation
involves the comparison of the test scores with some other measure for the same candidates
taken at roughly the same time of as the test. Predictive validation is most common with
proficiency tests, tests which are intended to predict how well somebody will perform in the
future.
F. Public and User Trial
The tests that have been constructed, tried out, and analyzed should also be evalualed by
public, especially the future users of the tests. The tests are presented to the future users and they
analyze the tests, give comments or suggestions to the improvement of the tests, and approve the
tests. The future user of the constructed tests is supposed to be the English Department of UNM.

Test Construction Procedures

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Test Construction Procedures

Uploaded by

Copyright:

Available Formats

CHAPTER VI

TEST CONSTRUCTION PROCEDURES

The test's purpose

Description of the test taker

Construct (theoretical framework for test)

Descripticn of suitable language course or textbook

Time of each section/paper

Target language situation

Language skills to be tested

Language elements to be tested

Criteria for marking

Descriptions of typical performance of each level

Descriptions of candidates at each level can do in the real world

Sample papers and Samples of students' performance on tasks

Cognitive domains are objectives which emphasize remembering or reproducing something

a. Remermbering indicates recalling information (recognising, listing describing, retrieving,

b. Understanding signifies explaining ideas or concepts (interpreting, Summarising,

e. evaluating implies justifying a decision or course of action (checking, hypothesising,

B. Test Construction and Moderation

F. Public and User Trial

You might also like