Download as pdf or txt
Download as pdf or txt
You are on page 1of 110

Preface

Testing in general, and language testing in particular, is a challenging field. On


the one hand, tests are used to make decisions which influence people's lives.
Therefore, tests must provide as accurate information as possible to enable testers
to make fair decisions. This makes testing a very delicate responsibility. On the
other hand, testing is rooted in many complicated scientific disciplines, such as,
linguistics, psychology, and sociology, each of which has its own intricate and
unresolved issues. This makes testing a very complex responsibility. The delicacy
of decision making and the intericacy of different related fields have made
language testing a challenging field.
The challenge has forced the field of testing to grow rapidly in the last few
decades. Developments in different language related fields have contributed to
swift changes in language testing. To accommodate these developments, several
excellent textbooks have been written by distinguished scholars. From Harris,
1969, up to Backman, 1991, many people have contributed to capturing the
evergrowing evolutions in the field.
Then why another textbook on language testing? Of course, the existing
textbooks cover a wide variety of topics in language testing and collectively fulfill
the needs of students. Any one of them, however, would not serve this
J

purpose individually for several reasons. First of all, they mostly focus on testing
English as a second language rather than a foreign language. Second,

each of these books, covers the field of language testing from a particular
perspective. And finally, they do not accommodate peculiarities related to testing
situation in our country, Iran.
Thus, the main motivation for producing this book as an addition to the ones
already in the market was to provide students with a single textbook dealing with
the issues from different perspectives, which would meet most of their needs.
Furthermore, a deliberate attempt is made to gear the text

XIII

\
towards the needs of the Iranian students.
Besides, Testing English Skills enjoys certain unique characteristics. The
first is its organization. The book is organized in such a way that it leads the students
from the first to the last stage of language test ing, That is, the chapters are organized
to help students develop, administer, score, and
interpret the scores of language tests they have developed.
The second characteristic of Testing English Skills is its lucid style. The
concepts are explained without appealing to pedantic language. Highly technical
treatment of testing concepts is avoided. Attempts are made to communicate ideas to
the readers through utilizing plain language, so that it would not create any serious
barrier to the comprehension of the ideas due to the lack of native-like language
competence. The third quality of this book is its scope. A conscious attempt is made to
cover a wide range of topics which would provide students with a fairly comprehensive
picture of language testing. In other words, very little need is left to be fulfilled by an
additional
text.
Even with these peculiarities, no claim is made that the book needs no
improvement. Nor is it claimed that it has exhausted all topics. It is, however, hoped
that the book will fulfill a considerable number of students' needs. It is also hoped that
reader's constructive comments will improve the book in the
future editions.

Dr. H. Farhady
Dr. A. J a'farpur
Dr. P. Birj andi
June, 1994

XIV
1
Preliminaries

1.1 Importance of Testing


Measurement and evaluation have been with us for a long time. Man has always
been concerned with measurement and evaluation. Educators particularly have
been concerned with measuring and evaluating the progress of their students. As
the goals of education have become more complex and the number of students
have enormously increased, evaluation has, accordingly, become much more
difficult. Moreover, educators have always attempted to revise existing programs
and develop innovative ones.
As many people rightly perceive, a lot of responsibility for improving the
society has been placed on the shoulders of educators. Seemingly, for every
existing social problem there exists someone who strongly advocates that the
responsibility for the solution lies within the realm of education.
Education is the most important enterprise in any society. Next to defense, it
is the largest economic enterprise in most countries. At some time, every citizen
is directly involved with education. More than one-fourth of the nation's
population attend school. Education is truly a giant and an important undertaking
and, therefore, it is crucial that its processes and products be evaluated. In fact,
evaluation is a major consideration in any educational setting. Teachers have
always wanted to know how much their students have learned. Also the
government and private sectors which pay teachers and employ the students
afterwards are interested in having precise information about students' abilities.
And, finally, students, teachers, administrators, and parents all work toward
achieving educational goals, and it is quite natural that they would want to
ascertain the degree to which those goals have been

1
realized. Measurement and evaluation are essential devices to help them achieve
most of these objectives in order to make sound educational decisions.

1.2 Decision Making


The direct involvement of everyone in education means that every person must
at some time make educational decisions. Some educational decisions will affect
a large number of people (for example, the entrance exam to the universities);
others, only a single person (Ali's decision not to review his math book). There
are many decisions that educators must make, and many more that they must
assist individual pupils, parents and the general public to make. The following
examples illustrate just some of these situations:
a) Should Reza be placed in an advanced reading group?
b) Should the school continue using the English textbook adopted this year, go
back to the previous text, or try still another one?
c) Is grammar being stressed at the expense of pronunciation in the first year
of English?
d) Am I doing as well in my English class as I should?
e) Should I go to college?
These are just a few of the types of questions facing educators, parents and
students. When a decision is made, whether the decision is great or small, it
should be based on as much and as accurate information as possible. The more
accurate the information upon which a decision is made the better that decision is
likely to be. In fact, many scholars who study decision making define a good
decision as one that is based on all relevant information.

1.3 Test, Measurement, and Evaluation


The terms test, measurement, and evaluation are sometimes used interchangeably,
but some educators make distinctions among them. The term "test" is usually
considered the narrowest of the three terms. "Test" often connotes the
presentation of a set of questions to be answered. As a result of a person's answers
to such a series of questions, we obtain a measure (that is, a numerical value) of a
characteristic of that person. "Measurement" often

2
implies a broader sense: we can measure characteristics by means other than
giving tests Using b · ·
· o servattons, ratmg scales, or other devices that allow us to
obtain information in a quantitative form is measurement.
"E 1 · "
va uation has been defined in a variety of ways. Stufflebeam, et al.
(1971) stated that evaluation is "the process of delineating, obtaining, and
providing useful information for judging decision alternatives." A second popular
concept of evaluation is interpreted as the determination of the congruence
between performance and objectives. Other definitions simply categorize
evaluation as professional judgement or as a process that allows one to make a
judgement about the desirability or value of a measure.
To evaluate, then would require that we have a goal or objective m mind. In
education we occasionally gather data that are not measures of specific
educational goals but are gathered to help us make decisions about what goals
should be set or what instructional procedures should be employed to reach the
goals. For example, two students may obtain the same measures (test scores) but
we might evaluate these measures differently. Suppose, at the end of the third
grade of junior high school {Rahnemai school), we have two students who are
both performing English at the third-grade level. However, at the beginning of the
year, one student was performing at the first-grade level, and the other at the
second-grade level. Our evaluations of these outcomes should not be the same.
Obviously, one student progressed at an above-average rate and the other at a
below-average rate.
It is also important to point out that we never measure or evaluate people.
We measure or evaluate characteristics or properties of people: their scholastic
potential, knowledge of English, ability to teach, and so forth. This is not to be
confused with evaluating the worth of a person. Teachers, parents, and students are
recommended to keep this distinction clearly in mind.

1.4 Language Testing


Testing is an important part of every language teaching and language learning
experience. Well-made t_ests, prepared by teachers or a team of skillful test
makers, can help students in at least two ways.

3
First, testing will encourage the students and will motivate them in learning
the subject matter. All teachers should do their best to provide positive classroom
experiences for their students through giving tests. Some teachers assume that the
main responsibility of an instructor is to provide good instruction. As a matter of
fact, good instruction cannot do much if it is not accompanied with appropriate
evaluation. Appropriate evaluation provides a sense of accomplishment in the
students and in many cases alleviates students' dissatisfaction, frustration, and
complaints about the educational programs.
Second, testing will help the students prepare themselves and thus learn
the materials. Repeated preparations will enable students to master the language.
They will also benefit from the test results and the discussion over these results.
Also, several tests or quizzes during a given term will make students better aware of
the course objectives. The analysis of the test results will reveal the students' areas
of difficulty and, accordingly, the students will have an opportunity to make up for
their weaknesses. It is generally believed that a better awareness of course
objectives and personal language needs can help the students adjust their personal
activities towards the achievement of their goals (Madsen, 1983).

1.5 Why Test?


Teachers of English as a Second/Foreign Language should be able to explain and
justify their activities in class. This is not possible unless they are capable of
interpreting their test results. A good and appropriate test should provide answers
for the following questions:
a) Has the instruction been successful?
b) Were the materials for instruction at the right level?
c) Have all language skills been emphasized equally?
d) What points need reviewing?
e) Should the same materials be used next year or do they need some
modifications?
Furthermore, a careful analysis of the test results will enable us to

4
improve the evaluation process itself. Such an analysis will provide answers for the
following questions:
a) Were the test instructions clear?
b) Was the allotted time sufficient?
c) How did the students feel when responding to the items?
d) Were the test results a reflection of the students' performances during the
course?
It is quite obvious that tests can help us improve the quality of instruction in
any educational setting. Both the test-givers and the test-takers benefit from the test
results. No one can ignore the importance of a good test. A good test is not
necessarily the one prepared by professional organizations. A test, whether teacher-
made or standardized, can serve useful purposes. To clarify the issue, a distinction
between teacher-made tests and standardized tests follows.

1.6 Teacher-Made Tests Versus Standardized Tests Teacher-


made tests are frequently used to evaluate the students' progress in school. In any
educational setting, students will be exposed to some kind of teacher-made tests.
The value of these test results is recognized by all who are involved in education.
Teachers are obliged to provide their students with good instruction.
Through classroom achievement tests they can measure the efficiency of the
instruction in terms of how effectively their students have been taught.
Schwartz and Tiedeman (1957, p. 110) stated that teacher-made tests are
valuable because they:
1. measure students' progress based on the classroom activities,
2. motivate students,
3. provide an opportunity for the teacher to diagnose students' weaknesses
concerning a given subject matter, and
4. help the teacher make plans for remedial instruction, if needed.
In spite of the usefulness of teacher-made tests, they have always been faced
with students complaints. These complaints have originated from the

5
ambiguity of the content of the test and sometimes the irrelevance of such tests to
instructional materials. Students' comments, such as "This was a useless exercise,"
"I didn't know what the teacher was looking for," "I studied the major details of
the course but was only examined on unimportant points and footnotes", are not
uncommon. Therefore, any test must be based on a pre-determined content to
measure the students' knowledge at a given point of time. Such a test could be
prepared by a teacher or a group of professional test-makers.
Teacher-made tests and standardized tests differ in many respects. In order to
facilitate understanding of standardized tests, it would be quite reasonable to
discuss what standardized tests are. They are commercially prepared by skilled
test-makers and measurement experts. They provide

Table 1.1 Comparison between Standardized and Teacher-Made Achievement Tests

Characteristic Teacher-Made Standardized


Achievement Test Achie,·ement Test

Direction for adrni- Usually no uniform directions Specific instructions. Standardized


nistration and specified administration and scoring procedures
scoring
Sampling of content Both content and sampling are Content determined by curriculum and
determined by calssroom subject-matter experts; involves extensive
teacher investigations of existing syllabi,
textbooks, and programs; sampling of
content done systematically
Construction May be hurried and haphazard; Uses meticulous construction procedures that
often no test blueprints, item include constructing objecrives and test
tryouts, item analysis or revision; blueprints, employing item tryouts, item
quality of test may be quite poor analysis, and item revisions
Only locaJ classroom norms are In addition to local norms, standardized tests
Norms available typically make available national school
district and school building norms
Best suited for measuring broad curriculum
objectives and for interclass, school, and
Purposes and use Best suited for measuring particular national comparisons
objectives set by teacher and for
intraclass comparisons

Source: Mehrens, S. W. A. and Lehmann, I. J., 1973, p. 454.

6
methods of obtaining samples of behavior under uniform procedures. By a
"uniform procedure" it is meant that the same fixed set of questions are
administered with the same set of directions, time restrictions, and scoring
procedures. Scoring is usually based on an objective procedure. However. some
standardized achievement tests may also include some essay-type questions.
Compared to teacher-made tests, standardized tests usually have a wider range of
coverage (that is, they cover more material). They are used to assess either one
year's learning or more than one year's learning. On the other hand, teacher-made
tests usually cover a single unit of work or that of a term. The following Table
illustrates some of the major differences between teacher-made tests and
standardized tests (Table 1.1).

1.7 Language Teaching and Language Testing


Teaching and testing are so closely interrelated that it is virtually impossible to
work in either field without taking the other into account. Testing is viewed as a
constructive and practical teaching strategy giving learners useful opportunities
for discussion of language choices. As Madsen (1983) states:
"Language testing today reflects current interest in teaching genuine
communication, but it also reflects earlier concerns for scientifically sound tests."
In this section we will briefly discuss some language testing procedures and
how they have been influenced by major language teaching methods at various
times in this century. The sequence of presentation will roughly correspond to
their historical development.

1.7.1 Traditional Tests


These tests are closely related to the grammar-translation method in language
teaching. To be more specific, this relationship was much stronger in the early
stages. This stage of language testing could be called the intuitive stage. Many
teachers with no training in teaching and testing would strongly emphasize
knowing about the language as well as using the language. Consequently,
students had to memorize many language rules and lists of words. The following
are some of the item types:

7
Example 1
Convert the following statement into past tense: ••
He drives his car fast."

Example 2
Write the main parts of these verbs: go,
buy, ring, lay

Example 3
Make sentences using each of these words:
intelligent, bashful, dilligent

Example 4
Translate the following sentence(s)/passage into Farsi: (A
few sentences or a short passage in English)

Traditional tests also include a great deal of writing ( composition) and


reading comprehension. These item types are very similar to what is used in first-
language testing. Here are some examples:

Example 5
Write a composition of about 150-200 words on "the importance of
education in third world countries."

Example 6
A passage for dictation

Example 7
A reading comprehension passage. We can construct different types of questions
based on the reading passage: asking for definitions, asking for information or asking
for inferences.

After this stage of testing (i.e., traditional tests), language testing entered a
more scientific stage. We can easily detect the determining impacts of structural
linguistics and behavioristic psychology on both language teaching and language
testing in this era. The findings of these two disciplines

8
suggested that "language mastery could be evaluated scientifically bit by bit"
(Madsen, 1983). In other words, the behaviorist psychologists would consider
language as a set of habits. And at the same time, the structural linguists would
start analysing the components of language (sounds, morphemes, words, syntax).
As a result, objective tests were devised to measure these different language
elements. One other reason for the development of objective tests was the
notorious unreliability of subjective tests. It is true that we all realize the value of
composition writing as a language exercise, but many teachers are dissatisfied
with this type of subjective assessment. It should be noted, however, that many of
the traditional grammar tests are also objective as far as scoring is concerned.
Let's examine the following test examples and then discuss the questions related
to them:

Example 8
Complete the following sentences, using not more than five words:
"She didn't go to school because ............... "

Example 9
Change the following sentence into a question.
"She works in a library."

Example 10
Use the correct form of the verb in brackets. He
(go) to school before you came.

Example 11
Check the correct form of the verb.
If you need it, he ............. it to you.
a) lends c) will
lend
b) to lend
d) has lent
Example 12
Check the correct sentence.
a) He at school studies English.
b) English at school he studies.

9
c) He studies at school English.
d) He studies English at school.

The above examples illustrate two types of objective tests. It is obvious


th
at one type is more objective than the other. In example 8 students are faced
with an open-ended sentence. Toe scorer might have some assessment
problems.
In example 12 students are exposed to three incorrect forms. Presenting
st
udents with incorrect forms is a controversial issue. It is assumed that frequent
exposure to incorrect language forms would result in the fossilization of these
forms. Then to what extent can this be justified?
Some objective tests are obviously open-ended, and others are multiple-
choice. In this section we will only deal with the latter type.

1. 7 .2 Multiple-Choice Items
Multiple-choice items are the most popular types of objective tests. The students
are presented with three, four or five alternatives or options ( a correct response
and distractors). The student is expected to choose the correct alternative from a
range of answers provided. The student's task 1s quite simple and scoring is
easy, as well. Consider the following example:

Example 13
He left home ............ ten o'clock.
a) in b) at c) by d) until

This is a typical form of a grammar item. It has been suggested that such an item
measures only a single, or discrete feature of the grammar.
We observe that multiple-choice tests provide the learner with restricted
contexts, usually no wider than the item context. The testee (test taker) is given no
more information about the speaker or the situation, and consequently, choosing
the appropriate option is difficult. As a matter of fact, many grammatical rules are
not observed in informal situations. It has been noticed that we may be able to find
possible contexts for many of the
distract ors.

10
To sum up, it can be said that constructing good test items with reasonable
distractors is very difficult, and unfortunately many of the teacher-made tests are
bad ones. A very good piece of advice suggested by Elizabeth Ingram is as
follows:
"The inexperienced test constructors should first prepare open-ended
items. These items should be administered to some students of the sort he
wishes to test ultimately. The wrong answers provided by the students could
be used as reasonable distractors later on."
It should also be noted that uncommon and implausible distractors are
dangerous instruments to be used in language testing. Frequently, many types of
multiple-choice tests expose students to a lot of unlikely errors, where more
language is wrong than is right. Consider the following examples:

Example 14
The teacher ............ me what to do.

a) told c) explained to
b) suggested to d) said

Example 15
How is everything?
a) Not so good. c) Thank you so much.
b) AJI right. d) Every thing is all right.

The examples given have all provided unrelated sentences. It is also possible
to provide much longer contexts in which many test items can be embedded. As
we mentioned before, the behaviorists view the process of language learning as
leamig a set of habits. This complex network of habits can be built up step by step.
In addition, many of the language tests prepared by teachers intend to examine
these linguistic components separately. These linguistic components are thought to
constitute language skills. Thus, we may make tests that can measure the four
traditional language skills and the very components related to them. Many of these
discrete item tests, as we have already discussed, are of the multiple choice type. A
detailed discussion on the

1
1
construction of such item types and the measuring of different skills will be
presented in later Chapters.
In the later stages of language testing development a number of tests have
begun to move toward global testing. In other words, these types of tests make more
comprehensive demands on the language learner. Two very popular types of global
tests are dictation and doze. At this point, we would just mention that the use of
dictation has always been controversial. On the other hand, a cloze test (the term
taken from Gestalt psychology) is based on a passage with some deleted words.
Providing these deleted words requires perceptive and productive skills and a sound
knowledge of lexical and grammatical systems. Besides taking advantage of all
linguistic clues for answering a cloze test, the student should rely on some other
contextual clues as well.

1.7.3 Testing Communication


Language teaching is becoming increasingly concerned with communication and its
objectives are being reassessed. Language testing, however, has to some extent
failed to develop techniques for measuring effectively the use which is made of
language in a communicative situation. Many of the procedures are designed to
measure students' ability to manipulate the grammatical and phonological systems
of a language. Nevertheless, tests can be constructed to enable learners to
manipulate language functions, and to identify utterances as belonging to a certain
function of language on account of their appropriateness. This movement, referred
to as "the functional approach", will be discussed later.
As a matter of fact, there is more to language than strings of words.
Authentic language consists of more than either words or grammatical rules or
arrangement of these words into sentences. One of the misconceptions ( outgrown)
about language is that successful language usage would lead to successful language
use. This is by no means true, because linguistic aspects of language are only one
part of the communication process. For instance, when language is embedded in a
social context, it conveys meanings other than

12
those present in the sentence in isolation. The following example will clarify this
point:
Imagine a scene where you, with your arms full of books, intend to leave the
room. The door is closed and you need some help. So you address one of your
students and tell him:
"Do you think you could open that door?"
Student (without moving to open the door):
"Yes, I do."
This student has not responded in the way you had hoped he would. This
example may seem to be evidence of miscommunication of meaning but it
demonstrates that the problem does not lie in the misuse of words or in something
ungrammatical in the exchange. We all know that in the grammar of individual
sentences, a question is a request for information, and that is exactly how your
student has treated it. Yet, we all know that questions are in fact one of the most
common ways of making a polite request.
For conclusion, the present literature in TEFL indicates that the teaching of
English for communication, especially communication for specific purposes,
differs from more traditional approaches. But to what extent is this true of testing
the communicative ability of our students? Some scholars in the field feel that
testing is lagging behind.

Activities
1. What purposes do you consider the most useful in testing students' language
ability?
2. To what extent are your teachers using the behaviorist approach in testing your
language ability?
3. What are the merits and demerits of composition as a test?
4. To what extent is it true to say that most teachers use dictation simply as a
means of testing spelling?
5. Read the following information and the conversation between Mr. Ahmadi and
Mr. Hatch, who has applied for a teaching job. Then answer the questions. As
all responses are possible, score them on a scale of 1 to 3. Give your first
choice 3 points, and so on.

13
Mr. Ahmadi: Good morning, Mr. Hatch. Please sit down. Mr.
Hatch: I. Good morning, Mr. Ahmadi. Thank you.
2. Good day to you sir. You are most kind.
3. Morning. Thanks a lot.
Mr. Ahmadi: I have your application form here in front of me, but I'd like
to ask you a few questions about your experience.
Mr. Hatch: I. Why not? Just go ahead.
2. Please don't hesitate· I shall be most pleased to answer them.
'
3. (No linguistic response-smiles nervously)
Mr. Ahmadi: I believe you do some EFL teaching in the evenings?
Mr. Hatch: I. That's right. I teach two lots, a group of kids and a bunch of
older characters.
2. Indeed, you are quite correct. I have accepted the responsibi-
lity for two groups at rather different levels of ability.
3. Yes, and I'm finding it very interesting. I teach two evenings a
week at the institute, an elementary class on Tuesdays, and a
fairly advanced class on Thursdays.
Mr. Ahmadi:
Which level do you prefer?
Mr. Hatch: 1. Well, it's all the same, isn't it?
2. I find the advanced group more challenging, and feel it's a more
useful experience for me at the moment.
3. It is my sincere belief that the elementary group is in greater
need of my endeavours.

a. How would you label the three choices? If one is about right or fairly
appropriate, what are the others?
b. What impression would you say Mr. Hatch would be likely to make on the
inspector if he consistently used a) your first choice b) your second choice and
c) your third choice?
c. What are some of the formal characteristics of your third choices?
d. Can you think of communicative situations in which your second and third
choices would be appropriate?
e. One of the choices of Mr. Hatch is to say nothing-linguistically, a zero
response. How can such a response be justified?

14
Functions of Language Tests

2.1 Introduction

As mentioned before, a test is an instrument for collecting numerical information on an


attribute. The purpose of collecting data is to determine the degree of existence of the attribute. For
measurement purposes, the nature of the instrument through which information is collected, and the
nature of the attribute upon which measurement is carried out, should be clearly specified. This
implies that there should be a close, and in some cases a one-to-one correspondence between the
instrument and the attribute. To secure this relationship, the purpose of the test and the kinds of
decisions made on the basis of test scores should be clearly determined. This chapter focuses on the
purpose or function of a test.

The function of a test refers to the purpose for which a test is designed. A test user should
clearly identify the function for which a test is to be used Otherwise, employing a test for
inappropriate purposes would lead to making unjustified decisions and thus to undesirable
consequences. Since language testing follows the principles of educational measurement, theoretical
and practical regulations governing educational measurement should be utilized in language testing as
well. According to these principles, tests serve two major functions: prognostic and evaluation of
attainment. Prognostic tests, which include placement, selection, and aptitude tests are not directly
related to a particular course of instruction. Evaluation of attainment tests, on the other hand, is based
on a clearly specified course of instruction and includes achievement, proficiency and knowledge
tests. These categories, illustrated in Figure 2.1, will be discussed in detail.

FUNCTION

Selection
-----------
Pf a ce rrve n r
------
P, o g ri o s t i c 7~
Figure 2.1 Functions of Language Tests
Achievement Proficiency Kno'-"'h!Llg:e

Gen~os<ic
2.2 Prognostic Tests

The word "prognostic" means predictive, Tests developed for prognostic purposes are used to
predict the future course of action about the examinees. The scores on prognostic tests are used to
make decisions about the most appropriate channel of educational or occupational career for the
testees. More specifically, prognostic measures enable authorities to find answers to questions
dealing with appropriate major fields of study, learning suitable foreign languages, and future
occupation of the students. Questions referring to future actions about the examinees indicate that
prognostic tests are not related to their educational backgrounds. The focus of prognostic tests is to
make sound decisions about the future success of the examinees on the basis of their present
capabilities. The major categories of prognostic tests, i.e., selection, placement and aptitude, are
discussed below.

2.2.1 Selection Tests

The purpose of selection tests is to provide information upon which the examinees'
acceptance or non-acceptance into a particular program can be determined. The program does not
necessarily refer to an educational one. It applies to all sorts of institutions dealing with accepting not
accepting a group of applicants. For example, to obtain a driver's license, applicants take a selection
test and they either pass or fail. The criterion for pass or fail 5 determined by the authorities. Another
example of tests used for selection purposes can be given in an occupational area. Applicants usually
take a selection test in order to demonstrate their capabilities for employment. For instance, to
determine whether a typist is qualified or not, she is required to type something as a sample. If she
meets the criterion, say 300 words per minute with almost three mistakes, she will be considered as a
qualified typist and given a certificate. In educational environments too, selection tests are used in
various situations. For example, noncredit courses are evaluated on a pass or fail basis through
selection tests. Those who obtain an
already determined score pass, and others fail the course. In fact. the education system in Iran
operates on a pass-fail procedure with the criterion of 50 percent achievement.

An important factor in using selection tests is that when a passing score is determined, there
should not be any limitation for those who obtain that score. In reality, however, due to
administrative restrictions and limitations, admitting all applicants who pass a selection test is not
possible. Apart from a few occasions, such as conferring degrees and issuing certificates, in most
cases many applicants, even with scores higher than the required standard, would not be admitted.
That is, the number of applicants passing a test exceeds the capacity of the educational programs. In
such cases, the selection test becomes a competition test.

Through a competition test, the scores of applicants are ranked from the highest to the lowest.
Then the selection starts from the highest score and moves towards the score which belongs to the
last person who can be admitted. The applicants following the last person admitted may have met the
criterion and sometimes have scores well above the criterion; however, they are not admitted because
the educational institution does not have facilities to accommodate them. In such situations,
applicants compete rather than meet a criterion.

In order to avoid such limitations, two options are possible. The first is to increase the
facilities in order to admit more applicants. The second deals with modifications of the passing
criteria. If the applicants are quite knowledgeable in a given area, the difficulty level of the test can
be increased.

In the former situation, the possibility of admitting a greater number of applicants would be
increased, where in the latter, the number of applicants who can pass the test will be limited to a
number for whom facilities can be provided a very good example of competition test is the Entrance
Examination for universities in Iran. If universities were capable of accommodating all applicants
who pass the examination, the examination would serve the selection purpose. However, due to the
excessively high number of applicants, selection procedures are changed into competition
procedures. Through competition tests, applicants are selected from among those who receive the
highest scores on the test up to the person for whom universities can provide educational
opportunities. The last person's score, no matter how high it is. would be the passing criterion.
An inconvenient consequence of competition procedures is the fluctuations of the criterion
for puss and fail. Depending on the overall performance of examinees and the total capacity of the
universities, an individual, with the same performance may obtain a passing score in one
administration but not in the other. As mentioned before, one way to obviate the problem is to
modify the standards, i.e., raise or lower the criterion. Another possibility would be to increase the
capacity of educational institutions to accommodate all who pass the test.

2.2.2 Placement Tests

Placement tests are used to determine the most appropriate channel of education for examinees. In
contrast to selection tests, there is no pass or fail in placement tests. The purpose of placement tests is
merely to measure the capabilities of an applicant in pursuing a certain path of language learning.
Through appropriate placement and efficient instruction, all applicants should be helped to reach a
certain level of language ability which is required to meet a pre-determined criterion.

Placement tests are one of the most-frequently used tests at different levels of language
instruction. As an example, suppose that students in the English department of a university are
required to have a good command of English in order to be allowed to pursue their academic courses.
It is possible that not every admitted student possesses the desired language ability Therefore, the
department administers a placement test. Suppose further that. according to empirical data obtained
by the English department, it is determined that a score of 80 on the placement test is a convincing
indication of students' capability in taking academic courses. In other words, students who receive a
score of 80 or higher do not need any additional language instruction. This implies that students
whose scores are below 80 should take extra language instruction in order to improve their language
ability.

Such students could be classified in different ways. Depending on the departmental policy,
they could be placed in three groups: those who received scores between 60 and 79; those who
obtained scores between 40 and 59; and those who received scores below 40. This classification is
illustrated in Figure 2.2.

As shown in Figure 2.2, students in group 3 are far from the criterion, and thus need more
instruction to be able to meet the requirements of the department. In comparison, students in
group 2 need less instruction than those in group 3. In other words, the farther the student score from the
criterion, the more instruction he will need. Clearly, then it would be an erroneous policy to administer a
placement test and group the students in

above 80
(
80

60-79
Group 1 I I
I I

40-59 I C:: I
IO I

Group 2 l'C I
I(!) I
1.-:::: I
Below 40 1 I 1-< I
I U I
I I I
I
Group .3 I
I

different classes but give them the same instruction through the same materials. The purpose of
placement procedures to help those who need more ruction Therefore, the different ability levels of the
groups should be of primary importance in teaching the materials. The objectives of instruction at these
levels should center around alleviating the existing differences among the groups and helping the students
reach the criterion level. If they receive the same instruction, the distance between the groups will remain
unchanged and thus the purpose of the test will not be fulfilled.

To achieve the objective, two different, but not necessarily mutually exclusive methods can be
applied. The first method regards the length of instruction. hose at the elementary level should receive
instruction for a longer time than those at the intermediate or advanced levels. For example, the
elementary group would receive instruction for three semesters, the intermediate group for two semesters,
and the advanced group for only one semester.

The second method concerns the intensity of instruction. That is. students at the elementary level
would receive more intensive instruction than those at the intermediate or advanced levels. For example,
students at the elementary level would receive instruction for 10
hours a week, those at the intermediate level for 6 hours a week, and those at the advanced level for 3
hours a week.

Depending on the educational setting, one of the above-mentioned methods, or both of them
at the same time, should be applied. Due to certain time limitations, however, intensive instruction is
more favorable than lengthy instruction. No matter what method is utilized, the difference between
the knowledge of students should be diminished as much as possible. Otherwise, the placement test,
which is used to prepare students for further studies, will not serve its purpose.

2.2.3 Aptitude Tests

Aptitude tests are used to predict applicants' success in achieving certain objectives in the
future. In order to take an aptitude test, the examinee does not need to have prior knowledge of the
subject being tested. In language education, through aptitude tests, one can determine which language
the examinee is likely to learn more easily and successfully than other languages

Aptitude tests can contribute to making decisions on the future career of the applicants as well
These tests have wide applications, ranging from determining students' major fields of study, to
predicting appropriate occupation for applicants and to making suggestions on learning a particular
language in the future. For example, through aptitude test scores, it would be possible to predict the
extent to which a person can fulfill certain responsibilities in the future, i.e., how good a pilot, an
engineer or a teacher can one be

Of course, developing aptitude tests is a very delicate and time-consuming task. Before
making any suggestion on the basis of test scores, the degree of their predictive power should be
experimentally determined. Only after several trials can one make a sound prediction on the basis of
aptitude tests. Otherwise, using unestablished tests would provide invalid and sometimes misleading
information about the phenomenon being tested.

2.3 Evaluation of Attainment Tests

The second major category of test functions comprises tests serving the purposes of
evaluation of attainment. In contrast to prognostic tests, measures of attainment deal with the
extent to which examinees have learned the materials they have been taught. This implies that these
tests are more directly related to educational settings than those of prognostic functions. Measures of
attainment include achievement tests, proficiency tests and knowledge tests. Each will be discussed
in turn.

2.3.1 Achievement Tests

Tests used for achievement purposes are designed to measure the degree of students' learning
from a particular set of instructional materials. These tests play a crucial role in educational
environments because most classroom tests fall in this category. Final, midterm and classroom
examinations are but a few examples of achievement tests.

Most achievement tests are often made by classroom teachers because these tests should be
based on the materials taught in the classroom. There exist, of course, some standardized
achievement tests which are developed by professional organizations. Most of these tests deal with a
body of knowledge that the examinee is supposed to achieve through a course or courses of study.
Such tests are called general achievement tests. For example, a test to measure students' achievement
in the first year of high school, in the first three years of secondary school, or in the seven years of
secondary school would be a general achievement test.

However, teachers and administrators sometimes intend to measure the degree of student's
achievement on a particular topic. These tests, which are aimed at measuring, specifically and
purposefully the detailed elements of an instructional topic, are called diagnostic achievement tests.
Diagnostic achievement tests are used to determine, unambiguously, the strengths and weaknesses of
the examinees in a particular course of study. For example, a test of English developed on the basis of
materials being taught, and designed to measure students' overall achievement in a particular English
language class, would be considered a general achievement test of English. Such a test would include
subtests of structure, vocabulary, reading comprehension and some others related to the materials
covered in the course. On the other hand, a test designed to demonstrate strengths and weaknesses of
students in a specific area, such as structure, would be considered a diagnostic test.

Through achievement tests, especially diagnostic achievement tests, teachers and educators
would evaluate the extent to which their instructional program has helped the students
achieve the objectives of the program. By collecting information on the weaknesses and strengths of
the learners, teachers would modify the instructional procedures to meet the objectives by focusing
on the students' weaknesses. Achievement tests, then, can be used for both instructional and
evaluation purposes:

A major practical advantage of achievement tests over other tests is their usability before an
instructional program starts. Through diagnostic achievement tests, teachers would collect
information to determine potential problems of the students. Then, on the basis of such information,
they would plan instructional activities which would alleviate the student's problems. Teachers can
also modify instructional programs to meet the students' needs in terms of determining what elements
they should emphasize in the class

2.3.2 Proficiency Tests

In contrast to achievement tests, which focus on measuring students' achievement of the


materials covered within an academic course, proficiency tests are used to measure the overall
language ability of the learners. Proficiency tests are designed to measure a) the degree of knowledge
a learner has accumulated through his language education, b) the degree of his capability in language
components, and e) the degree he is able to practically demonstrate his knowledge of language use. In
a proficiency measurement situation, there is no interest in the ways through which the learner has
achieved or accumulated a certain body of knowledge. What language proficiency measures focus
upon is to determine the extent of examinees ability to utilize language at the time of examination.
How the examinee has acquired that degree of capability is not of much significance to the examiner
Language proficiency tests have wide applications in the world. Most universities use proficiency
tests for admission purposes. A good example of language proficiency tests is the Test of English as a
Foreign Language (TOEFL), which is used worldwide to determine the applicant's ability to pursue
their higher education in English speaking countries.

In spite of their wide applications, construction of language proficiency tests is more difficult
in comparison to the development of other types of tests. One major reason for the
difficulty centers around the complexity of the term "proficiency". It is not an easily definable term.
Scholars have provided different definitions for proficiency. For instance, Briere (1972) defines it as
"The degree of competence or capability in a given language demonstrated by an individual at a
given point in time independent of a specific textbook. Chapter in the book, or pedagogical
method."(p. 332)

Such definitions complicate the issue because most of the terms used ja the definition,
including competence, at a given point in time, and demonstration of knowledge, are ambiguous by
themselves. The term competence could refer to linguistic, social or other types of competencies
related to language proficiency (Farhady, 1980). The term capability could refer to the ability of the
examinee to recognize, comprehend or produce language elements. In addition, at a given point in
time, language learner may be a listener, speaker or both. And finally, demonstration of ability could
be in various modes of language, including oral, written or both

An improvement over this definition is provided by Clark (1972) who states: "The use of
language for real life purposes without regard to the manner in which that competence was acquired.
Thus, in proficiency testing. the frame of reference shifts from the classroom to the actual situation in
which the language is used." (p. 5)

In this definition, too, the way language learning has taken place is ignored. However, it
seems an oversimplification to define a concept without taking all relevant parameters into account.
The assumption that language proficiency is independent of the way it is obtained has been
questioned. Farhady (1983) demonstrated that language proficiency was not, in fact, independent of
educational programs through which the learners had achieved language competence. He concluded
that performance on language proficiency tests was closely related to students' educational
background, major field of study, sex, and nationality. This means that the term "proficiency" as
defined by authorities, has not, as yet, been comprehensive. It also justifies the point that developing
a language proficiency test is more difficult than constructing other kinds of tests. For the purposes of
this Chapter, however, it seems sufficient to go along with most scholars who agree that language
proficiency tests should be designed to measure examinees ability in a particular area of competency
in order to determine the extent to which they can function in a real language-use situation.
2.3.3 Knowledge Tests

Knowledge tests are used in situations where the medium of instruction is a language other
than the learners' mother tongue. In these cases, the second language is used as the language of the
tests to measure the examines knowledge in areas other than the language itself. For example, a
physics test or a psychology test written in English would be considered a knowledge test.
Knowledge tests are not of primary importance to language testers because these tests are not
designed to measure language abilities. They are designed to measure examinees knowledge of a
scientific subject through a second language. Therefore, in countries where education is not carried
out in a foreign language, these tests have no particular role. For example, in the educational system
of Iran, knowledge tests are used only in foreign language departments to measure the students'
knowledge in subject matter areas. In other majors, the knowledge tests are developed in the students'
native language.

Activities

1. What is the difference between an achievement test and a proficiency test?

2. A language test consists of structure, vocabulary, listening comprehension, reading


comprehension, and composition. By examining the content of such a test, how is it possible to
determine whether it is an achievement test, a placement test, or a proficiency test?

3. Think of items which would be appropriate for an aptitude test of the Farsi language.

4. Consider the definition of language proficiency. Do you agree with Briers definition? Why or why
not? If you do not agree, what factors do you think would influence the language proficiency of
learners?

5. The solution to avoid competition tests is to raise the standard of pass or fail points. Do you think
this solution would help administrators in using selection tests? What other solutions can be offered?
3
Forms of Language Tests

3.1 Introduction
The very first minute an examinee encounters a test, its appearance catches
his attention. The examinee's first impression of the test is quite important. It may
encourage or disappoint him. He may find the appearance of the test harmonious
with or contrary to his presuppositions about it. As an example, consider an
examinee having to answer a set of true-false questions for a composition test.
What he might have in mind for such a test could be anything but a set of true-
false questions. Even if true-false questions can serve the purpose of measuring
one's writing ability, such a test will in this case, put the examinee in an
unexpected situation. Thus, in preparing a test, teachers and educators should
consider its form as an important factor.
The form of a test refers to its physical appearance. Consdering the nature
and varieties of attributes, language testers should utilize the most appropriate
form of the test which would correspond to the nature of the attribute to be
measured. For example, a test in written form would not be appropriate for
measuring the examinees' listening comprehension ability. Nor would it be
appropriate to evaluate one's knowledge of vocabulary through a test in oral
form. Therefore, to decide on the form of a test, certain parameters such as the
nature of the attribute and the function of the test should be taken into account.
A test, of course, consists of certain items, and the form of a test is
determined by the form of the items comprising it. Thus, to provide a
comprehensive discussion of the test format, an explanation of the structure of an
item, as the building block of a test, seems warranted. After explaining

26
the concept of the item, different classifications of item formats will be presented
and their advantages and disadvantages will be pointed out. And finally, a
classification of the item format will be suggested.

3.2 The Structure of an Item


An item, the smallest unit of a test, consists of two parts: the stem and the
response. The purpose of the stem is to elicit information from the examinee. The
stem can be presented as a question, as a statement, or as other varieties of
linguistic constructions. The following is a list of examples of stems.
1. Explain the function of language tests.
2. What is the purpose of a placement test?
3. Discuss advantages and disadvantages of proficiency tests.
4. An aptitude test refers to ..................... .
The purpose of the above-mentioned stems is to make examinees provide the
examiner with information.
The second part of an item is the response, which refers to the information
elicited from the examinee. The response can range from recognizing a single
word to producing a comprehensive essay presenting a discussion on or an
explanation of a complex issue. Depending on the nature of the attribute to be
measured, and depending on the learning objectives to have been achieved by the
examinee, language testers should decide on eliciting information through
appropriate forms of tests. For instance, sometimes the examinees are required to
produce certain pieces of information. On other occasions, they are provided
with a set of responses and are required to select the correct response from
among given alternatives. The former case calls for stems which would make
examinees produce the response, while the latter case requires stems which
would measure the examinees' comprehension ability.
The latter type of activity can be carried out through the well-known
multiple-choice format. In multiple-choice items, the most frequently used
forms, the stem is followed by three, four, or five responses. These responses are
also called alternatives, options or choices, one of which is the correct

27
response, and the others arc called distractors. The following example demonstrates
the components of a multiple-choice item.

An item ........... of two parts.


a) consists correct response

responses
(alternatives)
b) consisting
c) will consist
d) has consisted
I distract ors

The example clarifies the difference between an alternative and a distractor.


An alternative may or may not be the correct response; i.e., alternatives include
both the correct response and the distractors, whereas distractors consist of only
wrong alternatives.
Presenting the structure of an item, the reader should keep in mind that the
format of a test is determined by the format of the items comprising the test.
Therefore, any classification of item format can be considered as the classification
of test format as well.

3.3 Classificatioin of Item Forms


There has been some confusion about the test format in the field. One source of this
confusion originates from using different terminologies to refer to similar or
identical concepts. For example, words such as supply, fill in the blank, and
completion have been used to refer to a single form of item. Another source of
confusion originates from different views taken by scholars in interpreting the
concepts. For instance, classifying tests into objective or subjective forms stems
from the differing schools of thought on the nature of items. Despite existing
similarities in classification, each has its own advantages and disadvantages.
Therefore, the following section is devoted to clarifying the issue by presenting and
critically examining different categorizations and then suggesting a comprehensive
classification.

3.3.1 Subjective vs. Objective Items


Ever since the era of grammar translation method of language teaching,

28
translation tests have been used as a major technique of testing. These tests required
the examinees to translate a passage or a set of sentences from one language into
another. Although the content of such tests seems relevant to the materials to be tested
and in most cases is a valid representation of the materials, the scoring procedures of
these tests have not been systematic. The scoring of translation tests requires a great
amount of time and energy. In addition, the fluctuations of scores from one scorer to
another creates a serious problem for the consistency of the translation test scores.
Since the scoring procedures for such tests did not follow any objective criteria, they
have been classified as subjective tests.
To compensate for the inadequacies of subjective tests on the one hand, and with
the application of psychometric principles to language tests on the other, scholars
moved towards easily and consistently scored tests. To contrast with subjective tests,
the newly developed tests were classified as objective. Multiple-choice and true-false
types of tests are popular kinds of the so-called objective measures.
It should be pointed out that the contrast between subjective and objective tests,
stemmed from the difference between translation type tests and multiple-choice tests,
has led the field to a common misunderstanding. The misconception originates from
the fact that objectivity or subjectivity refers to the way a test item is scored and has
little or nothing to do with the form of a test. For instance, it would be a
misunderstanding to assume that all composition-type tests are subjective or all
multiple-choice type tests are objective. The following examples clearly illustrate the
irrelevance of form to subjectivity or objectivity of a test.

Example 1
There are four .............. in a year.

a) seasons c) weeks
b) months d) fortnights

Example 2
There are four seasons in a year.
true false

29
Example 3
How many seasons are there in a year?
answer: 4

Example 4
In Summer, the weather is usually hot here.

true fals
e
Example 5
Name the seasons of the year. Answer:
Spring, Summer, Fall, Winter

Example 6
Translate the following sentence into Farsi:
The seasons of the year are Spring, Summer, Fall, and Winter,

All these items, regardless of their form, are objective because they have
concrete, verifiable and straightforward answers upon which the scorers can
hardly exercise their personal tastes or attitudes. However, the following items,
which are identical to the above-mentioned items in form, would not be
considered objective.

Example 7
There are four ............ seasons in a year.
a) beautiful c) uninteresting
b) tiring d) exciting

Example 8
There are four beautiful seasons in a year.
true false

Example 9

How many exciting seasons are there in a year?

Example 10

In Summer, the weather is usually pleasant here.

30
Example 11
Which season do you like best?

The two sets of items are very similar in form. While each item in the first set
( examples 1-6) is an objective one, its counterpart in the second set (examples 7-
11) is quite subjective. In other words, the answers to the items in the first set are
such that almost everbody would agree on a particular response. However, the
answers to the items in the second set depend, to a great extent, on the tastes,
attitudes, likes and dislikes of the individuals scoring the items. Thus, classifying
items as subjective or objective would not account for the form of the items
because objectivity or subjectivity refers to the way items are scored. From two
items with identical formats, one could be quite objective and the other quite
subjective. Therefore, a different classification is required to avoid the
shortcomings of objective/subjective dichotomy.

3.3.2 Essay-Type vs. Multiple-Choice Items


Since classifying items as subjective or objective proved inadequate in accounting
for all item forms, scholars attempted to develop a new approach. They classified
the items into two categories: essay-type and multiple-choice. Essay-type items
refer to all kinds of items in which the examinee is required to produce language
elements. Multiple-choice items, on the other hand refer to all items in which the
examinee is required to select the correct response from among given alternatives.
Although this classification was an improvement over the previous one, it has
certain shortcomings. Consider the following examples:
1. The students ..................... playing now.
2. The students ...................... in the park everyday.
3. What is the past tense of "go"?
4. What is the passive form of "He worte a book."?
5. Change the follwing sentence into indirect form:
Hossein said, "I'm going to start my graduate studies in physics at Tehran
University."

31
6. Write a short paragraph about "Your Education".
7. Discuss the advantages and disadvantages of living in big cities.
8. Compare and contrast two educational systems in Iran.
A\\ of the above-mentioned items, ranging from a single word production to producing a
comprehensive explanation about a concept or topic, would be considered varieties of
essay-type formats. Since each of these items requires a certain type of activity, they
cannot be classified under the same category. Clearly, then, classifying the tests as essay-
type does not seem satisfactory because an essay has a fixed format. Producing a single
word cannot logically be considered similar to writing an essay. Therefore, further
modifications were needed in classifying item forms.

3.3.3 Suppletion vs. Recognition Items


To avoid the defects of the previous classification, new terms were coined without any
substantial changes in the principles of categorization. The terms "objective" or
"multiple-choice" were replaced by the word "recognition". Recognition form items
only required the examinees to recognize the correct response from among the
alternatives provided for each stem. On the other hand, the term suppletion, and
sometimes completion, replaced the former terms, i.e., essay-type or subjective.
Suppletion or completion form items required the examinee to supply the missing
partts) of the stem or complete an incomplete stem.
Although this classification improved the previous ones, it still suffered
from the same or similar inadequacies. Consider the following examples:
1. Hossein will ........................ his education abroad. (continue)
2. Parviz will ....................... .
3. Write a paragraph about "the process of reading".
All of the above items are suppletion or completion types. Nevertheless, the degree of
production and the way they will be scored are not clear. Except for the first item, of
which the answer is clear, the others have shortcomings similar to those of essay-type
items. Thus, this categorization could not account for all varieties of items. To obviate
some of the ambiguities involved

32
m the classification of items, and to shed some light on the issue, a new classification is
suggested below. It is hoped that the new approach will be comprehensive both
theoretically and practically.

3.4 Psycholinguistic Classification


It should be noted that the form of an item cannot be determined without a critical
examination of the nature of the item. Calling an item multiple-choice form, completion
form, or any other name one may wish, without identifying the process through which
the item is answered, would not result in a reasonable classification. Therefore, in the
classification suggested below, the form of the item is determined by taking theoretical
principles of language processing into account.
The present classification is called psycholinguistic because it assumes
psychological and linguistic principles as the underlying theoretical assumptions of item
formats. It attempts to benefit from the principles of psychology because responding to
an item, by and large, requires certain psychological processes. Furthermore, the
suggested classification involves linguistic theories because an item which is presented
and responded in a certain form of language involves linguistic elements. Regarding
psychology, it is clear that from the very first moment of encountering a single item,
psychological processes such as perception, identification, recognition, comprehension,
analysis and production of written or oral materials are, in one way or another, utilized.
Therefore, one criterion for determining the item form should concern the psychological
processes required for answering that item. Considering linguistic principles, language
can be manifested in three different ways, referred to as the modality or instrumentality.
Modality, dealing with the ways through which language is manifested, includes oral,
written, and pictorial modes. In other words, language can be manifested verbally or
non-verbally. Verbal manifestation includes oral and written forms and non-verbal
manifestation includes all sorts of graphic devices. The following graph illustrates this
point:

33
oral
Modality-{Verbal
- written
Non-verbal (pictures, maps, graphs, etc.)

Thus, the other criterion for determining item forms should involve the modality in
which the item is presented.
The psycholinguistic classification, then, would be two-dimensional. One
dimension would determine the psychological processes involved in answering a
particular item and the other dimension would concern the modality of language
through which the item is presented. To avoid complications, two major
psychological processes, which are crucial to language processing, namely,
recognition and production, can be considered the two extreme poles of language
processing tasks, and other psychological processes such as comprehension, or a
combination of comprehension and production would be placed in between.
Of course, recognitioin is a pre-requisite to comprehension and
comprehension is a pre-requisite to production. That is, language processing starts
from recognition and moves towards production. Since a clear-cut division between
these two processes is not possible, a continuum could be assumed with recognition
tasks at the one end and production activities at the other. Furthermore, because of
the possibilities existing in between, additional stages including comprehension and
comprehension-production should be taken into account. This process, would
accomodate the activities which involve some degrees of both comprehension and
production. Thus, the psychological dimension embodies items which require
recognition, comprehension, comprehension-production, or production on the part
of the examinee. The linguistic dimension, too, encompasses three levels of oral,
written, and pictorial modes of language.
In addition to linguistic and psychological factors, one more variable, the
components of an item, should be taken into account. As mentioned before, an item
consists of the stem and the response. These components give more variability to
the factors involved in determining item forms. In this regard, the stem and the
response may or may not have the same modes of language.

34
In other words, the modality of the stem may, and in most cases does, differ from
that of the response. Therefore, items can have a variety of forms depending on the
psychological process and the linguistic mode of both their stems and the responses.
These factors, linguistic, psychological and the structure of the item, are presented
in the following Table.

Table 3.1

Psychological Recog. Comp. Com p./Prod. Prod.

Linguistics

Oral 1 2 3 4

Written 5 6 7 8

Pictorial 9 10 11 12

This classification provides for 12 boxes, each of which would identify one
aspect of the item. For example, a true-false item can be presented in the following
form to meet the objectives for which it is designed:
a) psychological process: recognition or comprehension
b) stem: oral, written or pictorial
c) response: oral, written or pictorial
So, such an item will be placed in one of the cells of column 1 if it is designed
to tap the testees' recognition ability, and in one of the cells of column 2 if it is
intended to measure students' comprehension ability. Furthermore, it will be placed
in cells I or 2 if it is presented in oral form, in cells 5 or 6 if it is presented in
written form, and in cells 9 or IO if it is presented in pictorial form.

All the characteristics of the item can be identified when it is placed in one of
the cells of the Table. As another example, a test such as dictation

35
requires both comprehension and production on the part of the examinee. "Whether
the stem and the response are in an oral form or a written form would add to the
diversity of the form of the test. Such a test will be placed in one of the cells of
column 3 depending on its modality. And, finally, a test such as free composition
requires only production and should be placed in one of the cells of column 4 on the
Table.
Considering the above-metioned factors, test developers should be more cautious
in calling an item multiple-choice or essay-type. It would seem an over-simplification
of the issue to assume that an item can be categorized as one form or another without
taking the above-mentioned variables into account. Therefore, much more care should
be exercised before giving a name to the form of an item.

Activities
1. What is the difference between a distractor and an alternative?
2. How does the form of a test influence its content or function?
3. In what ways is it possible to score composition-type tests objectively?
4. What are the advantages of psycholinguistic classification over other types of
classifications?
5. Consider the Table representing the psycholinguistic classification. Think of forms of
items which would fit every cell in the Table.

36
5
Test Construction

5.1 Introduction
In the previous Chapters, certain fundamental concepts in language testing
were discussed. The function or the purpose of a test, which has a significant role in
developing and using a test, was explained in detail. Then the form of a test, which
refers to the way an item is presented to the examinee, was discussed. It should be
mentioned that form and function are interrelated, though not interdependent. This
means that a particular form of the item can be included in tests used for different
purposes. However, depending on the function of a test, some restrictions are
imposed upon the form of items. For example, if the purpose of a test is to measure
the examinee's oral proficiency, a test item in written mode may not be appropriate.
Thus, the function of a test can impose certain limitations on the form of the items of
that test.
It should also be mentioned that test construction is not a onedimensional
process. In addition to determining the form and function of the test, the content and
the number of the items should be specified in advance. More specifically, to
construct a test the following steps should be taken. First, the function and the form
of the test should be decided upon. Second, the content of the items should be
specified. This step is also called planning. Third, the items should be prepared in
accordance with the specified content. Fourth, the items should be reviewed. Fifth,
the items should be pretested in order to determine their statistical characteristics.
And finally, the test should be validated. Thus the steps in developing a test can be
listed as follows:
I. Determining the function and the form
2. Planning (determining the content of the test)

76
3. Preparing the items
4. Reviewing the items
5. Pretesting the items
6. Validating the test
Any test which is administered for decision-making purposes must go through
these steps. Otherwise, it will not be dependable or valid. Due to the significance of
these steps, the present Chapter is devoted to helping readers to develop skills
necessary for carrying out these steps.

5.2 Determining the Function and the Form of the Test As


mentioned before, a test can be developed to serve different functions. Since the
function of a test influences its form and content, it should be detenniend in
advance. In order to determine the function of a test, three factors should be taken
into account: characteristics of the examinees, the specific purpose of the test and
the scope of the test.
The first factor deals with the identification of the characteristics of the
examinees. Test developers should consider the nature of the population to which
the test is likely to be administered. If the test is intended for a group of youngsters,
for example, its content and form should be geared to their level of intellectual and
congnitive abilities. Therefore, a test with items in pictorial mode would be more
appropriate than a test with items in written mode. Or, if the examinees are from a
particular language background, the test should include items based on the findings
of contrastive analysis on the structures of the source and the target languages. For a
heterogenous group, on the other hand, findings of contrastive analysis should be
used quite cautiously because they may not prove fruitful. In addition to the factors
such as age and language background, the educational system through which the
examinees have carried out their education should be taken into account, because
the edcuational policy influences the mastery level of examinees at different
language skills.
The second factor is to determine the specific purpose of the test. As
mentioned before, tests can serve two major functions: Prognostic and

77
evaluation of attainment. In additon to determining the major function, test
developers should decide on the specific function the test is to serve because the
specific function, i.e., aptitude, placement, etc., would significantly influence the
content and form of the test. For example, a proficiency test requires a different
content from that of an aptitude test. As another example, the content of an
achievement test differs from that of a proficiency test because the former is based
on a set of materials covered in a particular insturctional course, whereas the latter is
independent of any specific instructional material.
And finally, the scope of the test should be clarified. \Vhether the test is to be
used within the scope of a classroom, a school, a district, or a country influences the
structure of the test. As the scope of the test widens, the degree of care to be taken
along with the amount of time and energy to be spent on developing the test
increases because the decision to be made on the test scores would influence a larger
population.
In sum, to develop a test, one should determine the function and the form of the
test considering the above-mentioned factors. Having determined the function and
the form, test developers should move toward specifying the content of the items to
be included in the test.

5.3 Planning (Specifying the Content of the Test)


As mentioned before, the purpose of testing is to gather quantitative information
about the degree of the examinees' command in a particular area of knowledge.
Along the same lines, language tests are designed to measure examinees' ability
regarding a certain skill or component of language. Therefore, it is of utmost
importance for the tester to decide on the area of knowledge to be measured. It
means that the content of the test should be precisely and carefully specified. The test
developer should not only clarify the content of the test but also specify the relative
importance of the elements to be included in the test.
Of course, specifying the content of the test cannot be independent of the
function of that test. Nor can it be unrelated to the form of the test. For

78
example, the content of a placement test should logically differ from that of an
aptitude test. The content of a placement test is limited to the materials to be
covered in a particular course of instruction, whereas the content of an aptitude
test has little or no relationship to any predetermined instructional materials.
Moreover, the form of the items of a test may influence the content of the test. For
example, a comprehension test in multiple-choice format allows the examiner to
include as many items as he feels necessary, whereas in a production type test,
limitations of time and space force the test developer to limit the test to a
manageable number of items.
To illustrate the steps involved in content specification, an example will be
helpful. Suppose a test for measuring the degree of an examinee's command of the
structure of a language is to be developed. Suppose further that the test is to be
designed to assess a high school student's achievement of the materials covered in
the textbook. And, finally, assume that the items are to be of comprehension type,
multiple choice, and in written mode.
In order to determine the content of such a test, the first step is to examine
the instructional objectives. That is, the course content should be outlined to
include a list of major structural points covered during the instruction. Then these
major topics should be divided into their specific components. The degree of
detailedness would depend on practical factors such as test length and test time.
The third step is to prepare a Table of specifications. A Table of specifications has
two dimensions. On one dimension, the specific topics and subtopics are listed in
accordance with the instructional objectives. On the other dimension, form and
number of items to be written on any particular topic or subtopic is delineated.
The main purpose of the Table of specifications is to assure the test
developer that the test includes a representative sample of the materials covered in
a particular course. In the case of the example mentioned above, the following
Table of specifications can be constructed (Table 5.1).
This Table, illustrating major sturctural points, serve three main purposes.
First, it specifies what is to be tested, i.e., the content. Second, it shows the aspect
of achievement to be tested, i.e., comprehension. And, third,

79
Table 5.1

Inst. Objectives Number of Items

Content Comprehension
Prepositions 3

Verb Forms
2
Conditionals 5
Indirect Speech 3
Comparatives 2
Total 1
5

it demonstrates the number of items necessary to measure each point at different


phases.
As mentioned before, each topic can be divided into detailed subtopics, including
the various elements within that topic. The following Table, not exhaustive of course,
demonstrates one of the possible classifications of the verb forms.

Table 5.2

Inst. Objectives Number of Items


Content Comprehension
Verb Forms
1. Simple present
I
2. Simple past
3. Present perfect
l
4. Past perfect
I
5. Future
l

Of course, it is not necessary to keep the number of items in one verb form equal
to those in other forms. Nor is it necessary to give more weight to items measuring one
psychological process than those measuring other processes. The important point,
however, is that the nature of items should be determined before embarking on item
preparation.

80
Specifying the nature of the items, the test developers should decide on the
content of the items. In other words, they have to clearly determine the linguistic
context within which the identified elements should be tested. This step, referred
to as preparing the items, is of crucial importance because certain rules and
regulations should be followed to avoid defective items. The remainder of this
Chapter, then, is devoted to providing test writers with suggestions for preparing
sound items.

5.4 Preparing Items


It would be a dangerous over-simplification to assume that every classroom
teacher is capable of constructing reasonably acceptable items. Of course, the
experience of the teachers is an invaluable asset which would be of great help in
item construction. Experience alone, however, does not suffice. Without a
thorough knowledge of the principles of item preparation, even the most
experienced teacher is apt to make defective items. Therefore, some guidelines are
suggested here for preparing various item types.
It should be mentioned that suggestions are limited to comprehension items,
including true-false, multiple-choice, and matching. Other items, such as
comprehension -production and production forms, will be discussed in subsequent
Chapters.

5.4.1 Suggestions for Preparing True-False Items


True-false form items are technically called alternative-response items. Such
items consist of a statement that the e:xaminee is required to read and mark true or
false (T/F), right or wring (R/W), correct or incorrect (C/I), yes or no (YIN), or
agree or disagree (AID). In every case only two alternatives are possible. Since
the term true-false is used more frequently than others, this item type is generally
referred to as true-false.
The most common use of true-false items is to measure the ability of
examinees to identify the accuracy of the information provided through a
statement. Consequently, such items are usually used to assess simple learning
outcomes. In other words, comprehension is the major psychological process

81

I
involved in answering true-false items. The following is a set of exan1pk.--s
de1nonstrating the use of true-false items in language testing:
l. Listening comprehension through visual cues:

examinee hears
examlnee sees
T/F a) This is a pencil.
A pencil
T/F b) They are books.
Pencils

2. Listening comprehension with aural cues:

examinee hears examinee reads


T/F They went out for a walk. They are inside the house.

3. Sturcture:

The passive form of "He wrote a book." is: T/F A book was written by him.

4. Reading comprehension:

After reading a passage, the examinee is required to determine whether a given


statement is true or false on the basis of the information provided in the passage.
True-false items, though frequently used in language tests, are not highly
recommended because of two reasons. First, they very much depend on chance,
namely, the examinee has a fifty percent chance of getting a correct response without
having any knowledge of the points being tested. Second, they are limited to
measuring simple learning activities in language. Complex tasks cannot be measured
validly through true-false items. These two shortcomings decrease the reliability,
validity and application of true-false items.
In spite of the aforementioned shortcomings, however, true-false items have
some merits. They are easy to construct and easy to score. They also allow the test
developers to use a large number of items in a given test. Considering both
advantages and disadvantages mentioned here, the test users should take the
following precautions regarding true-false items:

82
1. Avoid using broad general statements. This is because most broad
generalizations tend to be true. Consider the following examples:
a) The president of a country is usually elected.
b) People have friendly relationships in many societies.
Such statements are usually true. So even if the examinee does not have any
knowledge of the point being tested, he may get the item right.
2. Avoid using statements which measure trivial points. Since general
statements are not recommended, testers should not move toward the other end of
the extreme, i.e., using T-F items to measure unimportant points. The following
sentence, for example, is not an appropriate item because the information is so
trivial that the examinec would be led to memorize certain points rather than
comprehend the flow of language in a natural context.

Example:
The past tense of "talk" is "talked".
3. Avoid using negative statements. Negative words, such as "rio" or "not ",
are likely to be overlooked by the examinees. Specially, statements with double
negation are confusing.

Example:
a) Students are not· required to memorize the lesson.
b) It is unacceptable for the students not to memorize the lesson.
Sometimes,· however, using negative statements is inevitable. In such cases,
the negative word should be written so clearly that students would notice it
(writing it in capital letters, underlining it, or both).
4. Avoid using long and complex sentences. Such statements tend to include
information over and beyond the point to be tested. In most cases, the complexity
of the sentence prevents the students from getting to the point being tested.
5. Make true and false statements approximately of similar length,
difficulty and distribution. If the true statements are longer than the false ones, the
students will find a clue to mark longer statements as true even if they do not have
the knowledge of the points being tested.

8
3
The above-mentioned guidelines should be taken as suggestions rath
. er
than rules. Teachers and test developers should use their common sense
. . h anct
ogre in the construction of such items. These suggestions, owever, will hel
I
them develop relatively reasonable items. p
5.4.2 Suggestions for Preparing Matching Items
Matching items are usually used for measuring facts based on simple associations.
In language testing, one of the uses of matching items is to check the students'
ability in recognizing and comprehending synonyms and antonyms. In comparison
to true-false items, matching items require more complex mental activities. A
common form of matching items is to arrange items in two columns and require the
examinees to match the corresponding items from the two columns.

Example:
Match the words in column I with their antonyms in column II. There are more
items in column one than required.
I II
1. construct a) accept
2. refuse b) clarify
3. approach c) destory
4. confuse

In constructing matching items, too, certain suggestions may prove helpful.


I. Use homogeneous materials in a single matching item. For example, if the item is
to measure vocabulary knowledge, it should be limited to definitions, antonyms or
synonyms alone, and not a combination of the three. Furthermore, an item intended
to measure vocabulary should not include grammatical structures.
2. Include an unequal number of items in each column. If the number of items are
equal in both columns, the last item is usually predictable from the previously
answered ones. As the number of items in one column increases,

84
the possibility of guessing decreases.
3. Clarify the way the items are to be matched from the two columns.
Whether the e:xaminee should draw lines between corresponding items, number
them, or use other forms of matching, should be clearly specified.
4. Keep the list brief and place the shorter column to the right. Long lists require
excessive concentration on one area. A brief list, on the other hand, is convenient
for both the tester and the testees. Usually seven items in each column is just about
right. Placing the shorter column in the right facilitates the testee 's job in that he
reads the long column first and moves through the shorter column to find the
corresponding items.

5.4.3 Suggestions for Preparing Multiple-Choice Items Multiple-


choice form items are probably the most widely used types of items. They are
applicable to a wide variety of skills. Multiple-choice items can measure simple
learning outcomes more effectively than true-false or matching items can. That is
why in formal testing situations multiple-choice items are more frequently used
than any other kind of items.
The stem of multiple-choice items can take two forms. It may be a question to
which one of the alternatives is the answer. Or, it may be a statement to be
completed by one of the alternatives. Due to the wide range of applications of
multiple-choice items, care should be taken in constructing them. The following
suggestions should prove helpful.
1. The stem should be quite clear and state the point to be tested
unambiguously. If the stem does not clearly specify the problem, the alternatives
would serve as true-false items rather than multipe-choice items.

Example:
He is ........... .
a) one of the students passing the test.
b) one of the students failing the test.
c) the man who left the testing session early.
d) the man who did not participate in the test.

85
Such an item is faulty in that the stem does not pose the problem, ll1cr~forc
,
each alternative functions as an independent true-false item. To improve such items,
attempts should be made to shift the bulk of information into the stem.
He is the man who .......................... because he got a headache unexpectedly
during the testing session.
a) passed the test
b) failed the test
c) left the testing session early
d) did not participate in the testing session
2. The stem should include as much of the item as possible. Any word or phrase that
is shared by all alternatives should be placed in the stem.

Example:
The person .............. is called an author.
a) who writes a book c) who prints a book
b) who reviews a book d) who sells a book

Such items can be improved by moving the common elements in the alternatives to
the stem.
The person who ...............a book is called an author.
a) writes c) prints
b) reviews d) sells

3. Negative statements should be avoided because they are likely to be ignored by the
examinees. As mentioned before, in unavoidable cases, the negative mark should be
either capitalized, underlined or both.
4. All of the alternatives must be grammatically correct by themselves and consistent
with the stem. Distractors, however, should prove wrong when they are placed in the
stem.

Example:
Last year, incoming students ................ on the first day of school.
a) enrolled c) will enroll
b) will enrolled d) should enroll

86
In this item, alternative (b) is wrong by itself because such an expression does not
exist in the English language. Thus, using such wrong expressions as alternatives is
quite useless because they will be automatically ignored by the examinees.
Furthermore, using wrong distractors would expose examinees to wrong forms of
the language which might negatively influence the student's language learning
process. Since there is no merit whatsoever in using wrong distractors, while some
disadvantages are clearly present, such distractors should be avoided altogether.
Another example is the following:

He entered the room after ............... the light.


a) turning on c) turn on
b) when he turned on d) had turned on

In this example, distract or (b) is not consistent with the other alternatives because
it is the only distractor which has a relative pronoun. Nor is it consistent with the
stem because the structure "after when he turned on" does not exist in the English
language. Thus, it will be easily ignored by the exammees.
5. Every item should have one correct or clearly best answer.

Example:
When Hossein called, I ............. the house.
a) left c) had left

b) was leaving d) have left

In this example, there is more than one possible answer, (b) and (c). Such items
should be avoided.
6. All distractors should be plausible. That is, distractors which do not
logically belong to the point being tested will be discarded by the testees.

Example:
To call on someone means .............. .
a) to visit c) to talk
b) to telphone d) a curious person

87
In this item the point to be tested is the meaning of the verb. However, distractor
(d) is a noun which does not belong to the point being tested.
Therefore it should be avoided.
7. All distractors should be of similar length and level of difficulty. A
relatively long alternative tends to be the correct response. In most cases naive
test developers are not capable of constructins the correct alternative in a short
form. Furthermore, they want to guarantee that an alternative is undisputably
correct. Therefore, they are forced to use more words in order to achieve this
objective. That is probably why long responses tend to be often the correct ones.
Moreover, if an alternative is exceedingly more difficult than others, either it will
be ignored or selected erroneously as the correct response. In either case, such a
selection is not based on the knowledge of the e:xaminee but on some sort of wild
guess.
8. Using "all of the above" or "none of the above" as an alternative is not
recommended. These alternatives are usually used when the test developers do not
find appropriate choices. The use of "all of the above" should be strongly avoided.
"None of the above", however, should be used sporadically if used at all.

Example:
Abundant means ............ .

a) plenty c) infercarious
b) a great deal of d) all of the above

In this item, the examinee attempts to answer with the assumption that there
is only one correct response for every item. Then he notices that alternatives (a)
and (b) are both correct. In this case, even if he does not know the meaning of the
word in alternative (c), (in fact, he does not need to know it) he would select (d) as
the correct response. Therefore, a four-choice item would function as a three-
choice one.
In the case of "none of the above'\ although such a clear shortcoming does
not exist, it has the disadvantage of introducing incorrect responses to
the examinees.

88
9. Correct responses should be distributed approximately equally but
randomly among the alternatives. There should be no discemable pattern in the
distribution of the correct responses. Furthermore, depending on the number of
alternatives, an approximately equal proportion of correct responses should be
assigned to each alternative.
IO. The stem should not provide any grammatical clue which might help the
examinee find the correct response without understanding the item.

Example:
Ali picked an ................off the tree and gave it to his guest.
a) apple c) tangerine
b) banana d) peach

In this example, the article "an" leads the testee to the correct response. Even
though the testee does not understand the meaning of the stem, through the
grammatical clue given in the stem, he can guess the correct response. In such cases
the grammatical clue should be eliminated.
11. The stem should not start with a blank. This recommendation originates
from the concept of meaningful learning. According to the cognitive-code learning
theory, language processes start with known information and move towards
unknown information. Starting a stem with a blank means that the testee should
move from unknown to known information. Thus, the flow of information is in the
direction opposite to the normal flow of information.
As mentioned before, it is reiterated here that the above mentioned guidelines
are to be considered as suggestive; not prescriptive. The intention is to facilitate the
test constructors' task in developing a test. In addition, these suggestions would
hopefully alleviate some of the obvious deficiencies in existing test items. It should
be kept in mind that taking these suggestions into consideration is necessary but not
sufficient for developing acceptable items. Of course, they would help test
developers in a subjective evaluation of test items. For an objective scrutiny of the
accuracy and or efficiency of the items, test constructors should go through the
remaining steps in the process

89
of test construction. These steps, which deal with reviewing and detennin·
the statistical characteristics of test items, are discussed below. tng

5.5 Reviewing
In fact, the development of a particular test is often a teamwork. Very seldorn
would a test be constructed by a single person, because any individual is likely to
commit errors. Whether a test is the product of an individual or a team, it is highly
recommended to have the test reviewed by an outsider. Through the reviewing
stage, problems unnoticed by the test developer will most likely be observed by
the reviewers. Then, the reviewers would suggest modifications in order to
alleviate the problems. Of course, similar to guidlines for item construction, the
reviewer's comments would be subjective as well. Therefore
'
although these suggestions for modification may improve the quality of the
test items, they are not sufficient for the development of a reasonable test. For a
test to be scientifically defensible, it should be examined obejctively. Such a
scrutiny will be possible through the next stage of test construction
'
i.e., pretesting.
5.6 Pretesting
Up to now, the newly developed test has gone through the stages of planning,
preparing item writing and reviewing. Before the planning stage, the form and
function, and at the planning stage the content of the test are determined with
certain preassumptions about the testees. That is, the population for whom the test
is designed is an important factor because the nature of the population influences
the parameters of a test. For example, a test may be developed to be given to a
group of children, youngsters, or adults. In each case, the content as well as the
form of the items should be geard toward the characteristics of the prospective
testees. Otherwise, it is neither logical nor purposive to develop a test for an
unspecified population.
The notion of determining the nature of the population,i.e., "for whom the
test is designed", is of crucial improtance because pretesting is based on this
notion. At the item writing stage, logical items are written on the basis of

90
..
the suggestions offered earlier. Taking these rules into account is essential because
bad item would complicate the pretesting process. And finally, at the reviewing
stage expert opinions are solicited and some modifications are made on the items.
Reviewing is important because it indirectly helps the pretesting stage. That is,
improved items would make pretesting more fruitful. Thus, planning, item writing,
and reviewing would set the ground for pretesting.
Pretesting is defined as administering the newly developed test to a group of
examinees with characteristics similar to those of the target group. For example, if
a test is designed for high school graduates, it should be pretested with a group of
such graduates. Or if a test is designed for children, it should be pretested with a
group of children. Otherwise, the goal of pretesting will not be achieved.
The purpose of pretesting is twofold. The first purpose is to determine,
objectively, the characteristics of the individual items. These characteristics
include item facility (IF), item discrimination (ID), and choice distribution (CD).
The second purpose of pretesting, which is called validation, is to determine the
characteristics of the items altogether. In fact, through validation, which is the last
step in test construction process, characteristics of a test as a total unit are
determined. These characteristics include reliability, validity, and practicality. To
avoid the burden of technical complexities, item characteristics are dealt with in
this section, and test characteristics will be presented in the subsequent Chapter.

5.6.1 Item Facility


One of the most important characteristics of a single item is its facility. In non-
technical terms, item facility refers to the easiness of an item. How easy an item
is can have answers such as very easy, or not easy'. These terms are, however,
relative and subjective. How easy is very easy; not easy? In order to develop an
obejctive index of easiness, a more technical definition is necessary. Item facility
is defined as the proportion of correct responses for every single item. Proportion
means that all correct responses should be divided by the total number of
responses. This idea can be illustrated by the following formula:

91
IF= ~C
N
where:
IF = item facility
LC' = sum of the correct responses N =
total number of responses

To facilitate the comprehension of the concept of item characteristics, a sample data is


presented below. Suppose a ten-item test is given to twelve subjects. Their performances
arc illustrated in Table 5.3. The columns represent the items. the rows show the subjects,
and the number in each cell represents the subjects' response to a particular item. Each
correct response is given one point and wrong responses are given zeros. Using the sample
data, it is obvious that out of 12 responses to item number one, seven are correct. Therefore,
the facility index of this item can be calculated as follows:

7
IF(I) = - = 0.58 12
As another example, consider the facility of item number 9.

12
IF<9> = - = 1 12
Table 5.3
Items
Subjects I 2 3 4 5 6 7 8 9 10 Total
1 0 I I 0 I 1 0 0 1 1 6
2 I 0 I 0 I 0 I 0 I 0 5
3 I I I 0 I 1 1 0 I 1 8
4 0 1 0 1 0 1 1 1 I 0 6
5 I 0 0 I 0 I 1 0 1 I 6
6 1 1 1 I I 1 1 1 1 1 10
7 1 1 1 1 0 1 0 0 1 0 6
8 1 0 1 0 0 0 0 0 1 0 3
9 0 0 0 1 0 1 0 0 1 I 4
10 0 0 I 0 0 0 1 0 1 0 3
11 1 I I 0 0 0 0 0 1 0 4
12 0 0 0 I 1 I 0 0 1 0 5
Total 7 6 8 6 5 8 6 2 12 6

92
This implies that all examinees answered this particular item correctly. In other
words, it is quite an easy item. It can be understood that the maximum item facility,
when all examinees get an item correctly, equals I. By the same token, the most
difficult item is the one to which no one gives a correct
'-'
response. Or, the item facility is zero. An extremely easy item is not
recommended because it does not provide useful information about the examinees'
knowledge. Nor is a very difficult item recommended. It is like adding one point to
or subtracting one point from every indiviuals score. Of course, in some specific
cases, such items arc valuable and should be kept in the test. These cases will be
explained later. For the moment, it is sufficient to state that items with facility
indexes beyond 0.63 are too easy, and items with facility indexes below 0.37 are too
difficult. Such items should, therefore, be deleted from the test. By determining item
facility, the test constructor can easily find out item difficulty. Item difficulty can be
calculated by using the following formula:

Item difficulty = 1 - item facility

In some textbooks, item difficulty may be used for item facility. The readers should,
however, keep in mind that item facility refers to the proportion of correct responses,
while item difficulty refers to the proportion of wrong responses. When item
facilities are determiend, the next characteristic of the item, i.e., item discrimination,
should be calculated.

5.6.2 Item Discrimination


One of the purposes of testing is to measure students' knowledge of a language
component or skill. In other words, the purpose of testing is to provide information
on who performs better than the other. In fact, a test should discriminate between
more and less knowledgeable examinees. The discrimination power of a test depends
on the discrimination power of its items. In non-technical terms again, item
discrimination refers to the extent to which a particular item discriminates more
knowledgeable examinees from less knowledgeable ones. If, for example, all
students answered a question

93
correctly, it would mean that the item is not only too easy but also
. · · · Th h · I t ·onshi·p between item facility and
non- rscrumnatmg.
d us, t ere ts a re a 1
item discrimiantion. An item with a too high or low facility index is not likely
to have a good discrimination power. In technical terms, however, item
discrimiantion refers to the index which is derived from comparing the difference
between the perfomance of more knowledgeable and less knowledgeable examinees
on a particular item.
One can calculate the item discrimination index through the following steps.
l. The total scores on the test should be ranked from the highest to the lowest.
Doing so, the data will be rearranged in the following form.
2. The examinees should be divided into two equal groups. The best way to do so
is to count up to half of the examinees. In the case of sample data, the half, or fifty
percent, will be 6. Then counting from the top, after person number 6, a line should
be drawn. The group above this line will be called "high group" or simply H, and the
group below the line will be called "low group" or simply L.

Table 5.4
Items
Subjects 1 2 3 4 5 6 7 8 9 10 Total
1 1 1 1 1 1 1 1 1 1 1 10
2 1 1 1 0 1 1 1 0 1 1 8
3 0 1 1 0 1 1 0 0 1 1 6
4 0 1 0 1 0 1 1 1 1 0 6
5 1 0 0 1 0 1 1 0 1 I 6
6 I 1 I 1 0 1 0 0 I 0 6
7 1 0 1 0 1 0 1 0 1 0 5
8 0 0 0 1 1 I 0 0 1 I 5
9 1 1 1 0 0 0 0 0 I 0 4
10 0 0 0 1 0 1 0 0 I I 4
11 0 0 1 0 0 0 1 0 1 0 3
12 l 0 1 0 0 0 0 0 1 0 3
Total 7 6 8 6 5 8 6 2 12 6

94
3 . To cmnpute the item discrimination, the following formula should be

used:
Item discrimination (ID) = CH - CL ½N
ID refers to item discrimination
where:
CH refers to the number of correct responses to that particular item given by the
examinees in the high group
CL refers to the number of correct responses to that particular item given by the
examinees in the low group
½ N refers to the total number of responses divided by 2
As an example, the discrimination index for item number six will be calcualted as
follows:

From the Table, six out of six examinees in the high group gave correct
answers to this particular item. But only two examinees in the low group gave
correct responses to this item. Thus, the item does in fact discriminate between low
and high groups.
6-2 4
IDc6) = --- = - = 0.66
J'l,( \/ 1 ')\ 6
One point should be clarified here that drawing the dividing line may sometimes
cause problems. For example, suppose that the line falls on a subject number who
has the same score as the next subejct. In such cases, some on the spot decisions
should be made by the test developer. He may decide to eliminate one or more
subjects to have a clear point for dividing the subjects into two groups. In fact, with
large data, it is possible for more than one person to get the same score. However,
one or two subjects, especially on the border line will not influence item
discrimination to a great extent.
Having determined the item discrimination, the tester should set a criterion for
accepting or rejecting a particular index of discrimiantion. In contrast to item facility
where the ideal index was 0.50, for item discrimination the ideal index is unity. An
item discrimination index of I means that all the subjects in the high group answered
the item correctly and all the subjects in the low group answered the item wrongly.
TI1at is, the item has the highest discrimination power. However, not all items will
enjoy such a

95
discrimination power. Therefore. the tester should make a decision on the acceptahil
ity level of the item discrimination value. Logically, the closer the value of item
discrimination to unity. the more discriminating the item. Nevertheless. items which
show discrimination value beyond 0.40 can be considered acceptable.

5.6.3 Choice Distribution


Item facility and item discrimination are the two determining parameters for the
accept ab ii ity of an item. On the basis of these characteristics, one can judge whether
an item is acceptable or not. The point which is not, however, accoutned for by either
IF or ID, is the effectiveness of the choices, because values obtained for these
characteristics are based on some sort of quantification on the number of correct
responses. Neither item facility nor item discrimination can provide the test
constructor with necessary information about the appropriateness of the choices.
Therefore, the third characteristic of an item, i.e., choice distribution, should be
determined in order to improve the test both qualitatively and quantitatively. Choice
distribution refers to the frequency with which alternatives are selected by the
examinees. That is, through choice distribution, the tester examines the efficiency of
the distractors. Sometimes an item with reasonable facility and discrimination indexes
may have a poor choice distribution. If a particular distractor does not attract
examinees, it should be either discarded or modified. Otherwise, there would be no
purpose for that distractor. As an example, suppose that the items presented below are
administered to a group of one hundred examiees and the selection of choices is
distributed in the following pattern.
In spite of the fact that all the items have acceptable facility indexes, they have
undesirable choice distribution patterns. In item number 1, sixty examinees selected
choice B, which is the correct response. Distractor A is selected by 10 and distracter C
by 30 people. Distractor D, however, is not selected by any examinee. It implies that
this distractor did not function satisfactorily. In other words, one can assume that the
item is a three-choice rather than a four-choice item. Therefore, this distractor should
be modified.

96
Table 5.5
Item I Choice Distribution
Key A B C D
B 10 60 30 0
2 I A I 50 40 5 5
3 I D I 40 12 8 4
0

In item number 2, the correct choice is A which was selected by 50 people.


Fifty percent correct response represents an ideal item facility. However, few people
attempted choices D or C. Thus, these distractors should be examined and modified.
And in items number 3, there is another problem which can only be observed
through choice distribution. As it is shown in the Table, the correct choice is D,
which is selected by 40 people. However, the same number of people have selected
choice A. This implies that choice A is quite deceiving; it is either too close to the
correct choice or it involves some unnoticed problem. No matter what the reason
may be, it should be definitely modified. Thus through choice distribution the test
developer can observe deficiencies existing in the nature of choices. These
observations will eventually lead the test constructor to improve the quality of the
choices. This improvement will, in tum, improve the characteriestics of the items
and ultimately those of the totality of the test. The characteristics of a test as a total
unit, which are obtained at the validation stage, will be discussed in subsequent
Chapters.

5.7 Apply Logic


The purpose of pretesting is to determine item characteristics. Indexes of item
facility and item discrimination along with information obtained from choice
distribution are important pieces of information to help test developers keep, modify,
or discard a particular item. However, in rare cases, when a reasonable item shows
poor item characteristics, the test developer should not totally depend on these
characteristics. He should apply logic to the selection of items. If an item, for one
reason or another, must be included in a test, in spite of its poor characteristics, the
test developer may keep that item. It

97
might be due to certain unknown factors including the nature of the examinees, the
examination setting, etc. Thus, statistical information should be used as suggestive
rather than determining tools in the exclusion or
inclusion of a particular item in the test.
One last point to be mentioned here is that suggestions made for
modifying or discarding a particular item on the basis of item characteristics should
be carried out with caution. As mentioned before, an item with a facility index of
unity or zero is considered a bad item because it does not provide the tester with
useful information. This is true when a norm-referenced testing theory is applied. In
norm-referenced testing the main purpose is to compare student performances with
one another and rank their scores. Thus, the addition of one point (because of an
easy item) or subtracting one point (because of a difficult item) will not influence
the examinees ranks. That is why it is suggested that too easy or too difficult items
be eliminated. In criterion-referenced testing, however, such items may provide the
tester with useful information. That is, when all students give a correct response to
a particular item, it means that the instruction was quite successful. On the other
hand, if all students missed a given item, it implies that instruction was not
successful. Therefore, to make a decision on whether an item should be excluded or
included in a test, the theoretical assumptions behind the interpretation of scores
must be taken into account. These theories and assumptions are explained in the
later Chapters.

Activities
The following items are defective. Find the problems and fix them.
I. Farmers don't work ............. farms .......... fridays.
a. at/on c. on/on
b. in/in d. at/at
2. "When did you go home?"
"I went home ............. "
a. late c. tomorrow
b. easy d. by bus

98
"3 . h. ?"
. . .......... car IS t IS.
"It is Mr. Amini's car." c. When
a. Who d. Whose
b. What
4 ........... is a kilo of apples? c. How often
a. How many d. How
b. How much
5
" d. . . 1 ?"
. ...............ictronary 1s arge.
"That heavy dictionary." c. How
d. Where
a. Which
b. Who
6. " ........... did you talk about?"
"We talked about many things." C. What
a. Where d. Who
b. Why
c. excuse
7. Which word is not correct?
me
a. several
d. finished
b. runer
c. resturant
8. Which word is not correct?
d. finished
a. country
b. sunny
9. "Is Mehdi at home?"
c. is studying
·"Yes he ................ for his examination."
d. had study
a. is study
b. had studying
?"
10. "Was your tea sweet. c. so sweet as
"Y . h " d. as sweet like
es, 1t was ............ oney.
11. He brought ............ which I gave to my brother.
a. as sweet some c. me a notebook
a. a notebook to me
b. as sweet as d. a notebook me
b. to me a notebook

99
12. "Was Parvin a literature student last year?"
"N h 1· I "
o, s e ............... iterature ast year. c. hasn't study
a. wasn't study d. hasn't studying
b. wasn't studying
13. Have ............ to Shiraz? c. ever you been
a. ever been you
b. you been ever d. you ever been

14. "Is John studying mathematics?"


"N h h .f "
o, e ............... mat emancs or two years.
a. hasn't study c. haven't studied
b. haven't study d. hasen 't studied
15. "Are you going to study tonight?"
"N I igh igh "
o, m1 t .................. tom t. c. no study
a. not studying d. no studying
b. not study
16. "Do you speak Japanese?"
"Yes, I learned ................ in Japan."
c. how to speaking it
a. it how to speak
d. it how to speaking
17. Ib.don't
how to speak it
know where the station is. Is ..................... ?
a. near the hotel it c. it near the hotel
b. it the hotel near d. the hotel it near
18. "Will your friends be here today?"
"Y ,,
es, ........... .
c. would they said they come
a. they would come said they
d. they said they would come
b. said they would they come
19. We want to know ................ .
a. how well she speaks English
b. how well does she speaks English
c. she speaks English how well
d. she speaks how well English
20. "Where is the book I was reading last night?"

100
''I think ............ is on the table."
a. the you were reading book c. you were the book reading
b. the book you were reading d. you were reading the book
21. "Why didn't you visit Mehdi?"
"I h. ·t I h d . "
.......... un 1 a time.
c. would have visited
a. would have visit
d. will have visited
b. will have visit
22. "Didn't you get in touch with him?"
"Oh, I ............, but the telephone was out of order."
a. did my better c. did my best
b. did my good d. did my well
23. "What was he doing last night?"
"He ...........himself for the final exam, but I'm not sure."
a. must have been preparing c. might have been preparing
b. must be prepare d. may be preparing
24. "Is that all, Mr. Afshin?"
"No, on Monday next week, I ............. you some more information."
a. have given c. shall giving
b. was giving d. will be giving
25. He opened the office door and shouted that ................ .
a. he has to see the director c. he was to see the director
b. he is seeing the director d. he was seeing the director
26. "What do we need now?"
"At the moment we ............. some fresh meat and five loaves of bread."
c. have needed
a. are needing
d. needed
b.need
27. According to the teacher if Ali should come late again, he ................. .
a. is dismiss c. will certainly dismiss
b. would be dismissed d. will be dismissed
28. "Why don't you take this one?"
"I . . h ?"
s 1t superior ........... t at . c.to
a. then d.from
b. of
10
1
29. "I'm really tired."
"N d · d1 ''
o won er you are tired. You .............. all ay ong.
a. have always working c. have to always work
b. have to work always d. always have to work
30. Scientists say that ............ .
a. the earth now is just like a blazing ball
b. there is no earth at all
c. the earth was not similar to that at present
d. we must now believe that the earth is a hot cloud of gases
31. "Popular" is the synonym of ............. .
a. population c. famous
b. studying d. pollution
32. They have .......... the new words.
a. to studied c. to study
b. studying d. to studying
33. It is very important equipment .............. it we have been able to save lives.
a. By all means c. By all means of
b. With means of d. By means of
34. It shows you how many syllables ....................... in a word and how to
pronounce it.
a. are there c. there is
b. will there d. there are
35. If you had gone to the art museum, you ........................... different kinds of
pictures.
a. would see c. would seen
b. would have see d. would have seen
36. The students will not stay there and the principal ................. .
a. doesn't either c. won't either
b. will, too d. will so
37. Afshin is able to communicate his ideas to others very ................. .
a. affecting c. effective
b. affect d. effectively

102
38. The tests were for the students to perform on.
a. very difficult c. difficult
b. so difficult d. too difficult
39. Our literature teacher knows many of Ferdowsi's poems ................. .
a. in memory c. by heart
b. on mind d. by mind
40. They acted very ........... in the play last night.
a. beautifully c. beautiful
b. in beautiful way d. beauty
41. The man ............on the bench is the principal of our school.
a. who sitting c. sat
b. sitting d. will sit
42. Your heart works ............. .
a. only when you sleep c. most of the time
b. day and night d. all the time
43. Your heart sends blood to ............. .
a. all parts of your body c. your head and your hands
b. your arms and your legs d. all organs in your body
44. Your body needs food and oxygen ............ .
a. only when you rest c. all the time
b. only when you run d. only when you are tired
45. 'When you rest, your heart ........... .
a. stops walking c. beats fast
b. slows down d. rests, too
46. Your body needs the most food and oxygen when you ............ .
a. sit down c.
b. are active sleep
d. are tired
47. Calcium and phosphorus are among ............ materials.
a. organic c. systematic
b. inorgani d. unsystemati
48. Glass,
c mica, hard rubber, air, and many plasticscare good ............ .
a. conductors c. insulators
b. electrifiers d. magnifier
s
10
3
suitable
49. Any two substances, when rubbed together under
conditions, become ..............or acquire an electrical charge.
a. electrified c. purified d.
identified
b. magnified
50. Magnetism which is induced by an electric ...................................... is known
a
s
c. current
electromagnetism. d. iron
b. jet
a. flow
51. The sun and the stars are examples of ..................... .
a. celestial bodies c. astronomical studies
b. scientific investigations d. subatomic divisions
52. I can seldom concentrate on what I read when my test is ................................ with
unwanted things.
a. challenged c. diversified

b. conformed d. cluttered
53 ............ has no place in an astronaut's occupation.
a. Precision c. Exactness
b. Frailty d. Expansion
54. "John's book is on the Table. Where is my book?"
" ...........is under the table."
a. Yours c. His
b. John's d. Yours book
55. We don't .............eat lunch at home.
a. never c. rarely
b. ever d. at all
56. Paul bas recently finished his high school. Therefore, he .................................... be
about 18 years old.
a. ma
c. must
y
d. can
b. will
57. Mr. Smith is going to buy two shirts. He is buying one now. He is
going to buy ............ tomorrow.
a. another one
c. other
b. the other one
d. other one

104
58 ............ two miles from Paris to Rome.
a. Its is
c. There is
b. Its
d. There are
59. George writes ., .......... .
a. carefully than John
c. the same careful as
b. less carefully than John
d. more careful than John
60. "This student is the same height as the other." means:
This student is .............. the other one.
a. as tall as
c. similar high to
b. the same tall as d. as much high as
61. The children didn't finish their meals. They said that the fish had
a ........... taste.
a. wonderful c. best
b. queer
d. worst
62. Are you still looking .............. a job?
a. at
c. after
b. for
d.up
63. How many times a week .............. English?
a. you have c. you had
b. do you have d. you have had
64. He ........... nothing at first.
a. said c. didn't say
b. was not saying d. doesn't say
65. Let's ............here.
a. don't sit c. not sit
b. not to sit d. not sitting
66. This patient in the hospital since 8 o'clock.
a. waits c. waiting
b. was waiting d. has been waiting
67. If he were here now, what ............... .
a. you would do c. you will do
b. you did d. would you do
68. He could not concentrate on what he was reading and .................. .

105
c. so I could
a. nor I could
d. I couldn't too
b. neither could I
69. How many books will each of us .................. give you?
c. had to
a. has to
d. to have
b. have to
70. I saw him ............ the work.
a. that he does c. done

b.do d. did

71. He ............ studying hard.


a. used to c. uses to
b. is used to d. is used
72. My father didn't want to take a trip, and ......................my mother.
a. so did c. neither did
b. didn't so d. didn't either
73. Will you please help me ................ the door.
a. open c. that I open
b. to open d. that you open
74. The train ............. the station when we got there.
a. had left c. has been leaving
b. has left d. had been leaving
75. Mathematics is a .............. language.
a. various c. reasonable
b. unhappy d. symbolic
76. Food, heat and shelter are ................. needs.
a. basic c. minor
b. quick d. closed
77. Computers are .............. machines.
a. complicated c. internal
b. invested d. stabilized
78. Man is a ............. being. He can think.
a. funny c. verbal
b. irrational d. rational

106
79. Man .............culture from one generation to another generation.
a. transmits c. finishes
b. pauses d. manages
80 ............ is the science of population.
a. Demography c. Dictionary
b. Sociology d. Politics
81. More people "participated" in government affairs. "Participate" means
.. .. .. . . .. ... .
a. separated c. took part
b. held out d. set out
82. Society before the Industrial Revolution did not change very "rapidly".
"R "di ''
apr y means ...................
c. hardly
a. fast
b. slowly d. smoothly
83. Social science studies the "behavior" of human beings. "Behavior"
means ............. .
a. appearance c. method
b. manner d. speech
84. Man "manipulates" his surrounding for his benefit. "Manipulates"
means ............ .
a. controls c. illustrates
b. searches d. defines
85. The government should "stabilize" the prices. "Stabilize" means ......................
a. select c. save
b. train d. make fixed
86. Philosophers study the relationship among different "concepts".
"C ,,
oncepts means ................ .
c. ideas
a. sizes d. teams
b. h
87 Sh modes
d "· '' 'd "U . ''
· e a a very unique 1 ea. n1que means .............. .
a. divided c. deep
b. only one of its kind d. avoided

10
7
. d ?"
88. "What do you grow m ................. gar en.
"Fruits and vegetables." c. yours
a. your d. you're
b. you
89. "Why did he go to Spain?"
''H h h. ''
e went t ere ............... see 1s son.
c. because
d. from
a. to
b. for
90. "What's the matter with you?"
"I h '' c. bad headache
ave ........... . d. a bad headache
a. headache a bad
b. a headache,,bad
"O
n .............
91. "Where do you . live?''
c. fourth street
a. the fourth street
d. the street fourth
b. street fourth
92. "Can you speak Spanish?"
"No, but I can speak .................well."
c. the French language
a. French language
d. the language French
b. language
93. "This French
water is very cold, isn't it?"
"I I . I' . h b -& "

t sure y 1s. ve never swum m sue ....................... erore.


a.
b. awater
watercold
cold c. a cold water
d. cold water
94. I will help you when I ............... my work.
a. have fin ished
c. will finish
b. finish
d. am finishing
95. We observed our teacher ................ the science experiment.
a. conducts
c. conduct
b. conducted
d. conducting
96. She ............to market every morning.
a. will go
c. goes
b. have gone
d. is going

108
97. This is .............man who stole my bicycle.
a. the c. a
b. any d. an
98. The sick man was confined to his bed all the time because he .......................... .
a. did not feel like walking c. liked to sleep
b. was lazy d. was bedridden
99. They ............. that you are going to be late.
a. have been known c. will know
b. know d. are knowing
100. What is the .............. of man living on the moon?
a. wrinkle c. basin
b. likelihood d. basalt
101. Among the instruments, this one is the most ............................ It is used more
than the others.
a. integral c. strict
b. versatile d. evident
102. New computers which .............. recently are not as big as the old ones.
a. has developed c. are developing
b. have been developed d. have developed
103. The study of elementary particles is of great interest to physicists.
Physicists are very interested in exploring the nuclear processes which lead to
the evolution of the elements in the stars. For physicists the
study of the elementary particles is .............................exploring the nuclear
processes which lead to the evolution of the elements in the stars.
a. not as interesting as c. of much less interest than
b. of the same interest as d. of much more interest than
104. A nutrient that performs only one function is equally as essential as one
involved in different functions. The importance of a nutrient that
performs different functions is ............... the importance of one involved in
only one function.
a. not the same as c. quite the same as
b. equally different from d. quite different from

109
6
Characteristics of a Good Test

6.1 Introduction
· · · · d" ·d l ·t ms are the building blocks of
As mentioned m earlier Chapters, m 1V1 ua 1 e
a test. A test may consist of a single item or a combination of items.
Regardless of the number of items in a test, every single item should possess
certain characteristics. That is, every item should satisfy the criteria of item
facility, item discrimination and choice distribution. On the basis of these criteria,
defective items should be either modified or discarded. By eliminating such
items, the test developer will end up with a certain number of reasonable and
acceptable items.
Having good items, however, does not necessarily lead to a good test,
because a test as a whole is more than a mere combination of individual items. In
fact, test developers should bear in mind that having good items is necessary but
not sufficient for a test to function satisfactorily. Of course, having good
individual items provides the test developer with good raw materials. However,
when the raw materials, in this case good items, are put together, certain points
should be taken into account in order to have a good test as a whole. An example
may help clarify the point.
Suppose someone intends to build a house. Suppose further that he has
prepared some high-quality raw materials such as bricks, iron bars, etc. Having
excellent raw material does not guarantee, by any means, that he will have a good
house as well. Many other factors, such as the design of the
house and the way materials are utilized should be t k ·t t
, a en m o accoun .
Otherwise, it is quite possible to end up with a bad hou · h h fin
se even wit t e est
quality materials.
110
The analogy can be extended to the case of test construction. That 1s, simply
putting some good individual items together would not necessarily lead to a good
test as a whole. Here, too, certain factors, including the administration process and
the scoring procedures are important. Therefore, in addition to having good items, a
test should have certain characteristics. These characteristics are reliability, validity,
and practicality. Since these concepts are quite essential in the test constuction
process, and they require fairly comprehensive explanations, this Chapter is
devoted to clarifying them.

6.2 Reliability
All students, at one time or another, have taken tests. After taking a test, they have
had feelings that they hadn't done as well as they could have, that they had probably
been surprised by the materials covered on the test, that they hadn't got enough
sleep the night before, or that they just hadn't studied hard enough.
At other times, at least a few of them have finished a test with the feeling that
they had done better than they expected, that perhaps they had studied just the right
portions, or that they had a lucky day.
Considering these different views, it seems safe to say, that many of the
students have mixed emotions about tests. Of course, the above mentioned
comments are inevitable. Most of them, however, are based on personal experience.
Although personal experience is a good criterion, it is not a sufficient one for
criticizing tests. Most of the criticism is more emotional than rational. In only a few
cases, have they taken a good hard look at the tests. Seldom have they searched for
information other than personal experience. One of the most important pieces of
information which will help test users, or test takers, to make rational judgments on
the tests they use or take is reliability.
To explain the concept of reliability in as non-technical terms as possible, a
restatement of the arguments presented in the foregoing paragraphs will be helpful.
The feeling that someone did not do as well on a test as he could have can be
restated as follows: If he could take the test

111
again, he would do better.
This may be quite true. Nevertheless, one should also admit that some factors, such
as a good chance on guessing the correct responses, would raise his score higher than it
should really be. Seldom docs anyone complain about this.
Now, if one could take a test over and over again, he would probably agree that his
average score over all the tests is an acceptable estimate of what he really knows or how
he really feels about the test. On a "reliable" test, one's score on its various
administrations would not differ greatly. That is, one's score would be quite consistent.
On an "unreliable" test, on the other hand, one's score might fluctuate from one
administration to the other. That is, one's score on its various administrations will be
inconsistent. The notion of consistency of one's score with respect to one's average score
over repeated administrations is the central concern of the concept of "reliability". In
order to investigate the concept of reliability or consistency of a test, first a theoretical
framework should be established. Different theories have been developed to explain the
concept of reliability in scientific terms. Each theory makes certain assumptions which
are similar to axioms and should be taken for granted. From among existing theories, the
one advocated in this book is the classical test theory. One main reason for this selection
is the ease with which the concept of reliability is explained in this theory.
According to this theory, reliability or unreliability is explained as follows. If one
takes two measures of the same attribute, e.g., height, or verbal knowledge, the two
measures will not resemble each other exactly. The fact that repeated measurements of
some attributes of the same individual almost never duplicate one another is called
"unreliability". On the other hand, repeated measurements of the same attribute of the
same person will show some consistency. The tendency toward consistency from one
set of measurement to the next is called reliability.
In other words, when a person earns a score on a test, in most cases this score
does not correspond to his perception of his ability. That is, for one reason or another, if
he could only take the test again, his score might change.

112
As mentioned before, this can be due to different reasons. At the time of the test,
he may have had a headache or experienced some disturbing factors. He may not
have had time to review for the test or enough sleep the night before because of
reviewing. On the other hand, he may have had a good day with a number of lucky
guesses.
The change in one "s score is inevitable. Some of the changes might
represent a steady increase in one's score. The increase would most likely be due
to some sort of learning. This kind of change, which would be predictable, is
called systematic variation. Some of the changes, however, could be due to such
factors as how the examinee felt on a particular day he took the test, under what
conditions he took the test, and so on. For the most part, variation under these
conditions and many others which may not be predictable is called unsystematic
variation. Thus, whenever, several observations of a person are made and several
scores are recorded on the same ability, those scores are likely to differ from one
another. This variation is due, in part, to systematic variation and, in part, to
unsystematic variation. The systematic variation contributes to the reliability and
the unsystematic variation, which is called error variation, contributes to the
unreliability of a test.
In order to examine the reliability of a test, then, a study must be designed to
control systematic variation so that differences in test scores can be attributed to
unsystematic variation or random errors. This means that the study should develop
techniques to account for systematic variation as well as unsystematic variation.
The following semi-technical explanation would help clarify the issue.
Let's assume that someone takes a test. Since all measurement devices are
subject to error, the score one gets on a test cannot be true manifestation of one's
ability in that particular trait. In other words, the score contains one's true ability
along with some error. If this error part could be eliminated, the resulting score
would represent an errorless measure of that ability. By definition, this errorless
score is called a "true score".
This true score is almost always different from the score one gets, which 1s
called the "observed score". Since the observed score includes the measurement
error, i.e., the error score, it can be greater than, equal to, or

113
smaller than the true score. If there is absolutely no error of measurement
'
the observed score will equal the true score. However, when there is a
measurement error, which is often the case, it can lead to an overestimation or an
underestination of the true score. Therefore, if the observed score is represented by
X, the true score by T and the error score by E, the relationship between the
observed and true score can be illustrated as follows:

(1) X=T or
X>T or
X<T

These relations, however, do not hold true when the scores are changed into their
corresponding variance terms. The variance of the true scores does not change
because the true scores are constant. The variance of the observed scores,
nonetheless, fluctuates because of the extent of the error of measurement. Since
error variance is included in the observed variance, the variance of the observed
scores is always greater than the variance of the true scores. That is, the magnitude
of the observed variance equals the magnitude of the true variance plus the
magnitude of the error variance. If the variance of observed scores is represented by
V x, the variance of true scores by Vt, and the variance of error scores by Ve, formula
number 1 can be rewritten as:

These three variance components are crucial to understanding the concept of


reliability in statistical terms.
From this fomula, it can easily be understood that there "is a close
relationship between the degree of error in measurement and the exact amount of
the true score. The greater the measurement error, the smaller the estimation of the
true score. On the other hand, when a small part of the observed score is due to
measurement error, the estimate of the true score will approximate its real value.
Understanding the concepts of true, error, and observed variance, will help to
define reliability in its technical form. By definition, reliability is the

114
ratio of true score variance to observed score variance. If the reliability 1s
represented by r the following formula can be written.

(3) Vt r =-
Vx
Of course, the true score is not measurable and thus the value of Vt is never known.
Therefore, we can solve the unknown Vt through the following computations:

or:

Substituting this value of Vt in formula 3 will lead to formula 4:


v-v.
(4)
r = Vx

From formula 4 certain conclusions can be drawn. First, if the measurement is


without error, the error variance will be zero,
i.e., Ve= 0. Thus, we can have:

v-v,
r=--Vx
Vx-0

This means that when there is no error in measurement, the reliability equals
unity. Second, if the error in measurement is large, so large to equal the observed
score, i.e., all the observed score is error, in this case Vx=Ve and we will have:
v-v,
r=

-0

115
This means that when there is the grc.:-atc.:-st amount of error in rneasurcmcur, the
rcliubil ity will equal zero. Thus. the mugn iru. .. k of rdiability can range from zero to one.
The rcliabil ity of zero, which is the tninimum, means that all observed variation is due to error.
That is, the test is compl~tdy unrdiabk. On the other hand, the reliability of 1 indicates that
then: is no error in measurement and the test is perfectly rcliuhlc. Of course, this d~s not
happen in reality. All tests show a certain degree of unrcliahil ity, But tho closer the magnitude
of rcliahility to unity, the more reliable the test \\ill he.
It should be mcnt ioucd that the degree of error vnriancc will dc.:-L'reasc the reliability
and consequently the accurucy of making decisions on the basis of test score. Therefore, there
should be a way to account for this inacl.'ura~y. Of course, it is not possible to account for
measurement error on c,~ry single occasion. Even if it were possible, it would not be very
helpful because the error part of the score is not systcmat ic and will fluct uat e from one
occasion to another. Therefore, it is necessary to find an index of error in measurement which
could be applied to all occasions of a particular measure. This index of error is called
standard error of measurement, abbreviated as SE l\ t. By dcfinit ion, SEM is the standard
deviation of all error scores obtained fr om a gin'n measure in different situations. To calculate
the numerical value of SEl\t. the formula for reliability can be used through the following
procedures.

v,
r=-
(1) Yx
Yx = Vt+ Y.::
Vt= VX-V(:
VX-V(:
(2) r= Yx
or:
Yx Ve
r=-
vx --
Ve
Yx

(3) r= 1-v X

116
Solving (3) for Ve

1-r

or:

(4) Ve = Vx (1 - r)

From Chapter 4, it should be recalled that standard deviation is the square


root of the variance. Taking the square root of the variance terms in fonnula 4 we
will have:

W. = YVx (1 - r)
or:

S = SVl - r
e X

The value of Se, the standard deviation of errors, as mentioned before, 1s called
the SEM. Thus:
SEM = SVl - r
X

In the formula, S, refers to the standard deviation of observed scores and r is the
reliability. From the formula it is clear that there is a negative relationship between
reliability and SEM. The higher the reliability, the smaller the SEM. For example,
if the reliability is perfect, i.e., r = 1, the value of SEM will be zero because the
value of ( 1 - r) will equal zero. By the same token, the lower the reliability, the
greater the SEM.
The concept of SEM is very important in testing. One should be careful in
using a particular test score because the magnitude of SEM will influence the
accuracy of the interpretation of that score. An example may help clarify the point.
Suppose that a vocabulary test is administered to a group of students and the
following results are obtained.

117
X = 20 8x = 5 r = 0.84
The SEM can be calculated to be:

SEM = SxVI - r
= sv1 - o.84
= sv'o."i6
= (5) (0.4)
SEM = 2

This means that the standard error of measurement for the vocabulary test is 2.
Thus, when one interprets a given score on this test, he should be careful that on the
average, the observed score may be lower or higher than the examinees' true score.
The degree of highness or lowness can be predicted from the value of the SEM.
Usually, a safe estimate is to interpret the true score within the range of plus or
minus one SEM from the observed score. In formulaic form:

A more accurate score = observed· score + I SEM.

If the score of a given examinee were, for example, 25 on the test, his score might
fluctuate between 23 and 27.

A more accurate score = 25 + 1 SEM = 25 + 2 = 27 = 25 - 2 = 23


So, the observed score cannot be taken as the most exact estimate of one's
ability. The magnitude of the SEM should be taken into account in interpreting all
test scores. Since the value of SEM is computed fr_om the value of the reliability,
to calculate the SEM, one has to . calculate the reliability first. However, it is not so
easy to calculate the reliability of a set of test scores. In fact, it is almost impossible
to calculate the real magnitude of reliability because measuring the true score is
impossible. This, of course, does not mean that attempts to approximate the true
score should be abandened. On the contrary, one should try to make the best
possible

118
estimate of the true score from the observed score. The implication is that the
mathematical value of reliability is always estimated rather than calculated. Apart
from theoretical exactness, there are different practical methods of estimating
reliability. These methods are explained below.

6.3 Methods of Estimating Reliability


Reliability was defined as the consistency of scores produced by a given test. The
term consistency needs some clarification. Let's assume that a test was administered
on a particular day. How can one be sure that the scores on the test will be consistent
if the test is to be administered to the same subjects again? From the preceding
explanations, it should be clear that no single person would be expected to obtain
exactly the same score in two administrations of the same test. The difference
between the scores of the two administrations;. as was explained before, contributes
to the unreliability of the scores, i.e., the degree of the error in measurement.
Furthermore, it was discussed that reliability is the total variance in test scores
minus the error variance. Thus, t~ estimate the reliability of a test one
• • , '-'. ·: ~: .••• :.c. ,, .~ - ?: .

should S~IJ?-eh0w 'calculate ·1:he amount of variance produced by test scores.


One way to obtain this variance is to use the following formula:

V = L (X- X)2 N-1


However, this formula provides the amount of observable variance rather than
the true variance; therefore, it is not sufficient. Another way to obtain the amount of
variance is through correlational procedures. From the discussion of the correlation
coefficient in Chapter 4, it should be remembered that the square 'of correlation is
the amount of common variance between two sets of scores. Of course, in
correlation, the two sets of scores are obtained from the· administration of two
different tests to a particular group. When the same test is administered to the same
group twice, the correlation, and not the square of correlation, is the amount of
common variance.
This common variance between two sets of scores obtained from the two
administrations of the same test is the amount of consistent variation in test

119
scores, i.e., the reliability. Therefore, to estimate the reliability of test scores, the
correlation coefficient between two sets of scores obtained from two
administrations of the same test to the same group should be calculated. Different
administrations can be possible in various ways. These possibilities are the
sources of different methods of estimating reliability. The following is a brief
explanation of four methods of estimating reliability. Three of them are through
using correlational procedures, and one through using a noncorrelational
procedure, i.e., a particular formula.

6.3.1 Test-Retest Method


As the name implies, in this method reliability is obtained through
administering a given test to a particular group twice and calculating the
correlation between the two sets of scores obtained from the two administrations.
In using this method, the assumption is made that no significant change occurs in
the examinees' knowledge during the interval between the two administrations.
Since there has to be a reasonable amount of time between the two
administrations, this kind of reliability is referred to as the reliability or
consistency of scores over time. The extent of this consistency can be estimated by
computing the correlation coefficient between the two sets of scores obtained
from. the two administrations of the same test to the same group. The coefficient
of correlation is the reliability estimate.
Although the test-retest method provides a logical estimate of test score
reliability, it has some disadvantages. First, it requires two administrations.
Obviously it is difficult to arrange two testing sessions for the same group of
examinees. Furthermore, preparing similar conditions under which the
administrations take place adds to the complications of this method. Only on
exceptional occasions would it be possible to arrange two similar testing sessions
for the same group of testees.
Second, human beings are intelligent and dynamic creatures. They are
always involved in the process of learning. Thus, their abilities are most likely to
change from one administration to another, especially when the interval between
the two testing sessions is long. To avoid drastic changes, one should

120
keep the interval as short as possible. However, there is a trade off. The longer the
interval, the more change will occure in the testees' behavior, but less memory
factor will exist; the shorter the interval, the less change will occure in the testees'
behavior, but more memory factor will exist. To keep a balance, scholars
recommend a two-week interval as appropriate.
Third, there is the test effect, especially when the interval is short. That is,
the examinees may perform differently on the second administration because
either they have learned something from the test administered before, or they
have memorized some of the items from the first administration. To avoid these
disadvantages, educators tried to develop an alternative method of estimating
reliability referred to as the parallel-forms method.

6.3.2 Parallel-Forms Method


To avoid the complexities of the test-retest method, educators developed the
parallel-forms method. The major disadvantage of the test-retest method was the
difficulties involved in administering a single test to the same group twice.
In the parallel-forms method, however, two similar, or parallel forms of the
same test are administered to a group of examinees just once. Then the correlation
coefficient between the scores obtained from the two forms will be an estimate of
test score reliability.
Although this method alleviates the problem of two administerations, it
creates another problem. Constructing two parallel forms of a test is not an easy
task. Two parallel forms of a test should demonstrate certain statistical
characteristics. A discussion of these properties is too technical to be included
here in this book. However, certain logical considerations should be mentioned.
First, the Table of specifications for the two forms of the test must be the
same. It means that all the elements upon which test items are constructed should
be the same in both forms. For example, if one form of the test has a certain
number of items measuring a particular element of grammer, the other form
should also contain the same number of items on the same element of grammer.

121
Second. the components of the two tests, i.e., subtests, should also be the
same, This means that if one form of the test has, for example th
, ree
subsections of grammer, vocabulary, and reading comprehension, the other
form should also have the same subsections with the same proportions.
This sameness does not imply that the surface forms of the items should be the
same. Each item in one form has a stem which differs from its counterpart in the
other form, What remains consistent across the items is the element to be tested. For
example, the following items can be considered parallel because both are to measure
the examinees' ability to recognize the correct form of the present tense.

1. Ali usually late at night.


a) study b) studies c) studying
2. Reza often the shopping in the afternoon.
a) do b) does c) doing

Although the parallel-forms method is an improvement over the test-retest method, it


still has some disadvantages. Therefore, another method, called the split-half method
was developed.

6.3.3 Split-Half Method


The main disadvantage of the test-retest method was administering a single test to the
same group twice. And the major shortcoming of the parallel-forms method was to
develop two parallel forms for a single test. As discussed before, these two problems,
along with other factors, made the estimation of reliability through these methods
difficult. Therefore, scholars developed the split-half method as a convenient
alternative.
As a matter of fact, the split-half method was developed on the basis of parallel
forms assumption; of course, parallel forms of the items in a single test, not the
parallel forms of two separate tests. The main idea behind the split-half method is
that the items comprising a test are homogeneous. That is, all the items in a test
attempt to measure elements of a particular trait, e.g., tenses, prepositions, other
grammatical points, vocabulary, reading and

122
listening comprehension, which are all subparts of the trait called language ability.
In fact, the method assumes that there is an internal homogeneity among the items.
Thus, the relationship among the items will be a sort of reliability of scores
regarding their internal relationship. Thar's why this method is sometimes referred
to as the internal consistency of the test scores.
In this method, when a single test with homogeneous items is administered to a
group of examinees, the test is split, or divided, into two equal halves. The
correlation between the two halves is an estimate of the test score reliability.
In using this method, two main points should be taken into consideration. First,
the procedure for dividing the test into two equal halves, and second, the
computation of the total test reliability from the reliability of one half of the test.
Let's take the first point. In a well-designed test, the items are sequenced from
easy to difficult. A ready-made answer to the question of "What is easy?" or "What
is difficult?" does not exist. However, the complexity of the psychological process
involved in answering an item can serve as an· acceptable criterion. Thus items can
be arranged from recognition type to production type.
Assuming that in a one-hundred item test, the items are arranged to be
progressively more difficult, dividing the test into two halves, say from items 1 to 50
and 51 to 100, would not be appropriate. In such a case, the first half will be easier
than the second half, thus, the performance on the two halves will not be
interrelated. Furthermore, it is quite likely that some of the examinees will not have
enough time to attempt all the items. Therefore, the last items will be left blank,
leading to the unequality of the two halves.
In order to avoid. the above-mentioned problems, an appropriate procedure for
dividing a test into two equal halves should be developed. This would be possible by
selecting odd items for one half and even items for the other. Through this
procedure, easy and difficult items will be equally distributed in the two halves.
Furthermore, if some of the last items are left blank, they will also be equally
divided into the two halves.

123
The second point regarding the split-half method is to compensate for the loss in the
degree of rcliubility due to dividing the test into halves. Since the length of the test, i.c., the
number of items, is an important factor in test score reliability, by dividing the test into two
halves, the length of the test will be reduced to half of the length of the total test. Thus, the
correlation between the two halves will be the reliability of one half of the test, not the total
test. To est imatc the reliability of the total test, the following formula, known as the
Spearman Brown prophecy formula, should be used.

r = 2(rhalf)
llulall l+(r halt)

If, for instance, the reliability coefficient of half of the test is computed to be 0.80, the
reliability of the total test will be
2 (0.80)
I+ 0.80

J.60 0.88
- -
J.80
- ----
-
It should be logically clear that the reliability of the total test will always be higher than the
reliability of half of the test.
The split-half method of estimating reliability has certain advantages over the other
methods. First, it is more practical than others. In using the split-half method, there is no need
to administer the same test twice. Nor is it necessary to develop two parallel forms of the same
test. To use the split-half method, a single administration of a single test will suffice.
Along with these advantages, however, the split-half method has certain demerits. The
main shortcoming of this method is developing a test with homogeneous items because
assuming the equality between the two halves is not always a safe assumption. Furthermore,
different subsections, in a test, e.g., grammer, vocabulary, reading or listening comprehension,
will jeopardize test homogeneity, and thus reduce test score reliability. To improve this
method statisticians have developed a still more convenient alternative called KR-21.

124
6_3.4. KR-21 Method
Kudar and Richardson, two famous statisticians, have developed a set of
mathematical formulas for statistical computations. One of their formulas,
referred to as KR-21, was developed to estimate the reliability of test scores. This
formula was also based on the assumption that all items in a test are designed to
measure a single trait. This method, sometimes called rational equivalence, is a
purely statistical procedure. It only requires the calculation of the mean (X) and
the variance (V) of test scores. These parameters are put into the following
formula:

(KR - 21) r =I K l [i _ X(K - X)J


where: · LK- CJ
K = the number of the items in a test X
KV

= the mean score


V = the variance
This method is advantageous over all other methods because it does not have any
complications.· It does not require double administrations as in test-retest,
parallel forms of a test as in the parallel-forms method, or splitting the test into
two halves as in the split-half method. Furthermore, all other methods require the
utilization of correlational procedure. Of course, not all educators or teachers are
familiar with this statistical technique. Therefore, the KR-21 method is the most
practical, frequently used, and convenient method of estimating reliability. For
example, if a 100-item test is administered to a group of testees and resulted in a
mean of 60 and variance of 48, the reliability of this test can easily be computed
as follows.

r= f" 100 7 [ 1 _ 60(100 - 60)7


1J
L100 - 100(48) j

=11007[1- 24007
l 99] 4800J

=[100 7 [1 - 2] = 100 ( _!_)


99] 4] 99 2
12
5
- 0.55
It should be mentioned that when different methods are available to estimate the
reliability of a set of test scores, a question may arise regarding when or why a
particular one should be used. In order to choose one method over the others, the
function of each method should be taken into consideration. For instance, the test-
retest method is most appropriate when the consistency of scores over a particular
time interval is important. In other words, it can be said that the test-retest method
provides a good estimate of stability of test scores over time. On the contrary, the
parallel-forms method is desirable when the . consistency of scores over the
different forms is of importance. And finally, when the go togetherness of the items
of a test is of significance, i.e., the internal consistency, the split-half method, KR-
21, or other varieties of this formula will be most appropriate.
Understanding the function of each method will, of course, facilitate the
choice of one method over the others. However, in each case, the advantages and
disadvantages of each method should be taken into consideration. Some logical
factors similar to the foregoing arguments will influence the choice of a particular
method. However, from the practical point of view, the KR-21 method, as
mentioned before, is the most practical and convenient method of estimating test
score reliability.

6.4 Factors Influencing Reliability


As explained before, reliability is an index to estimate the extent to which a test
produces consistent scores. This consistency can be over time, i.e., test-retest, or
over two forms, i.e., parallel forms, or over equal parts of the same test, i.e., split-
half, and KR-21.. In each case, to have a reliability estimate, one or two sets of
scores should be obtained from the same group of testees. Thus, two factors
contribute to test reliability: the testee and the test itself. Fluctuations of each of
these factors can somehow influence the magnitude of reliability. Therefore, these
factors will be discussed briefly, because having some information about these
factors would help test developers to be more careful in controlling them.

126
6.4.1 The Effect of Testees
The process of measuring an attribute differs from one scientific field to another.
In physical sciences, for example, attributes are often objects or characteristics of
objects which usually change through manipulation exerted by certain controlled
procedures. Therefore, fluctuations in the characteristics of the objects can be
measured fairly objectively and consistently.
In human sciences, however, the attribute to be measured does not have
absolute properties. Since human beings are dynamic creatures, the attributes
related to human beings are also dynamic. The implication is that the
performances of human beings will, by their very nature, :fluctuate from time to
time, or from place to place. Fluctuations in the psychological and physiological
conditions of the testees will influence their performance. The influence may
increase their true score or their error score. In the former case, the reliability will
be over-estimated, and, in the latter case, the reliability will be under-estimated.
Another factor related to the testee which will influence the reliability is the
homogeneity of the testees' ability on the measured attribute. As was mentioned
earlier, reliability is a function of the amount of variation in the testees'
performance. If the testees are homogeneous regarding the ability to be measured,
there will not be a great deal of variation among their scores. That is, the
reliability will be under-estimated. On the other hand, when the tetees' ability
varies greatly in the attribute to be measured, the amount of variance will increase
and thus reliability will be over-estimated. Therefore, in selecting the testees, care
should be exercised to have a reasonable range of ability in the attribute to be
measured. Otherwise, obtaining an accurate estimate of reliability will be quite
difficult.

6.4.2 The Effect of Test Factors


Fluctuations in the psychological or physiological conditions of the learner can be
argued not to influence the reliability estimate to a great extent. In this regard, one
can assume that in a group of testees some may randomly obtain high true scores
and some others high error scores. These random fluctuations

127
may either cancel each other out or they may not be so great to influence the
reliability estimate drastically.
The test factors, on the other hand, can potentially influence the
reliability to a great extent. Fluctuations in test factors can be due to the structure of
the content of the test the administration procedures of the test
, '
or the scoring process of the test. Each will be discussed briefly.

6.4.2.1 The Structure of the Test


Inconsistencies in test scores can be partially attributed to the structure of the test.
Certain parameters which are built into the test itself can contribute to its
unreliability. These parameters include the homogeneity of the items, the speed with
which a test is performed, and the lenght of the test. Of course, with some care and
planning, the effect of these factors can be minimized if not eliminated.
The first parameter is the homogenity of the items. As a general rule, the items
in a test should be aimed at measuring the same trait or aspects of the same trait. For
example, at test of grammer should include items which measure only the
grammatical elements. If some of the items measure, let's say, vocabulary, the test
will lose its homogeneity. It is generally believed that the more homogeneous the
test items, the more consistent scores it will produce. This statement is justified
because almost all measures of consistency, i.e., reliability, are obtained through
correlational techniques. Homogeneous items go together well and thus produce
high correlation. The higher the correlation between the two parts of the test or
among the items of a test, the higher the reliability. On the other hand, when the
items are heterogeneous, the correlation among them will decrease which will lead
to a low reliability index.
The second parameter is the speed with which the test is performed. In speeded
tests, the examinees compete against the available time, whereas in power tests, the
examinees concentrate on the content of the test. In other words, in a speeded test,
examinees are instructed to work within a limited amount of time. Even if they know
the answers to the test items, they may not

128
have enough time to attempt all items. In a power test, on the other hand, the time
factor is eliminated and the examinees arc given a chance to try all the items. Of
course, it would be difficult to administer a test under purely speeded or power
situations. Most tests fall somewhere between these two extremes.
To the extent that a test is speeded, error scores creep into the testecs'
performance, because time limitation will make the testees rush through the items.
Consequently, most of them will not get to all items in a relaxed environment. Of
course, an unlimited time allocation will negatively influence test results as well,
because the testees will work luxiriously and thus lose their concentration. Therefore,
there should be a balance between the time allowed and the number of items. As a
rule of thumb, for a multiple-choice item thirty seconds will be sufficient. Of course,
in tests such as reading comprehension, doze, or similar tasks, since the testees should
read the passage as well as the items, one minute per item wil be appropriate.
The third parameter is the number of items or lenght of the test. The longer the
test, i.e., the more items in a test, the more reliable the test will be. The justification
follows that reliability is the function of variance of test scores produced by testees.
When there is a great number of items in a test, testees will most likely perform
differently on these items and will be widely dispersed along the scoring continuum.
Consequently, the variation, which contributes to reliability, will increase.
It should be mentioned that lenght of the test improves the reliability up to a
certain point. Beyond a particular number of items, the contribution of the lenght to
the reliability decreases upto a negligeable point. The contribution of lenght to
reliability is explored by many scholars. The following Figure shows the approximate
relationship between lenght and reliability.
Although the Figure is an approximation, it shows the extent to which the
number of items contributes to reliability. As it is shown in the graph, up to 30-35
items, the reliability increases sharply. Beyond 35 items, however, the increase in
reliability is so smooth that it can be ignored. In other words,

129
1.00

0.7
.£ 5

1 ::::f
L _______ --1.., _;_• ----J..• -----..J•--------'1'----------.1.J
25 50 75 100 125
Number of items

if a test demonstrates a reliability of, say, 0.80 with 30 items, and a reliability of .85
with 100 items, it is more logical and cost-effective to ignore the .05 mcrease m
reliability and cut the testing time and scoring time to a great extent.
In order to account for the influence of the number of items on the reliability
through mathematical computation, the Spearman Brown Prophecy formula is
used.

In this formula, rk refers to the test when adjusted to k times its original
lenght, r1 refers to the observed reliability of the test with its present length, and k
refers to the number of times the length of the test is to be increased. A couple of
examples will help clarify the use of this formula.
As mentioned in the split-half method of estimating reliability, suppose that a
test is divided into two halves and its reliability is computed to be 0.80(r1 = 0.80).
If the reliability of the test is to be adjusted to the full length ( double the present
length) i.e., k = 2, the adjusted reliability can be computed as follows:

kr, rk=--
--"--
1+ (k - l)r1

_ 2(0.80)
1+ (2 - 1)0.80

= I.60 = 0.88 1.80


130
So, the reliability of the full test, i.e., the number of items doubled, would
increase from 0.80 to 0.88.
As another example, suppose that a 40-item test has a reliability of 0.70, i.e.,
r1 = 0.70. Suppose further that ten items are added to this test, i.e., the length is

increased up to one-fourth of the present length (k = 1 + ! = 1.25). The


reliability of the new test will be computed as follows:

- rkr1
k-
l + (k-:- l)r1
1.25(0.70)
r=--------
1 + (1.25 - 1) 0.70
0.875
= 0.74
1.175

So, if we increase the length of the test by one-fourth of the total length of the test,
the reliability will increase from 0.70 to 0.74. In using this formula, two points
should be taken into account. First, the addition of the items is assumed to follow the
principle of homogeneity. That is, additional items should be similar in content and
form to those already in the test. Otherwise, the adjusted reliability will not be an
accurate estimate of the reliability. Second, the concept of "k" in the formula does
not refer to the number of items. It refers to the proportionate increase of the items
regarding the present length of the test. That is, if" a test has 20 items, when the
number of items is doubled, k will equal 2. But if only ~ve items are added to the
tests, k . will equal 1 + _L or 1.25.
4' .
It should be pointed out that the Spearman Brown Prophecy formula
enables the test developers to determine the number of items required to achieve a
desired level of reliability. The formula to determine the number of items is
developed by solving the formula fork.

kr,
rk = 1 + (k - l)r1

131
or

or
kr , =rk+ rkkr1-
rkrl or
kr1-rkkr1 =rk-rkr1
or

In this formula, k is the number of times the test length should be


increased, rk is the desired reliability, and r 1 is the observed reliability.
As an example, if a 40-item test shows a reliability of 0.60, i.e., r1 = 0.60,
and the test developer wants to increase the reliability up to 0.80, i.e., rk = 0.80,
the number of times the length of the test must be increased can be calculated as
follows:

k = rk (l-r1)
r1 (l-r0
0.80 (1-0.60)
0.60 (1-0.80)
(0.80) (0.40)
(0.60) (0.20)

0.32 = 2.6
0.12

Then k equals almost 2.6. It means that the test should be lengthened up to
2.6 times the present length. Since there are 40 items in the test, and 2.6 times of 40
is 104, then the number of items should be increased up to 104. In other words, if
the test developer desires to· have a test with the reliability of 0.80, he has to have a
104 item test rather than a 40-item test.
By using this formula, the test constructors will be in a better position to
decide on the length of their tests as well as the level of reliability they desire.

132
6.4.2.2 The Effect of Administration Factors
Inconsistencies in the administration process may increase the measurement error
and thus reduce test reliability. The influence of ad1ninistration may become more
serious when a test is administered to different groups in different locations and on
different occasions. Of course, a single administration of a single test is also
subject to such fluctuations.
Some of the fluctuations originating from test administration are due to the
ambiguity of instructions, the time of administration and the extent of interaction
between the tester and the testees. Some others are related to certain irregularities
in the administration process. For example, whether the time remained should be
announced regularly or not, whether the questions should be answered during the
testing session or not, and whether any extra explanations should be given or not,
are all potential sources of error.
As an example, suppose that a test of listening comprehension is to be
administed in which the items are read by one of the examiners. How consistent
can the reader be? How many times will the reading be interrupted because of the
reader's physiological failures? All these fluctuations may influence the test results.
Still another source of fluctuation in test administration is the environment of
testing. Interruptions, distractions, inappropriate light, temperature, etc. can
influence the performance of examinees and thus the reliability of the test.
As a test administrator, then, one should try to reduce the effect of
administration factors. A major way to control such factors is to plan all the
procedural steps in advance. If the regulations of administration procedure are
carefully planned in advance, much of the fluctuation can be avoided.

6.4.2.3 The Influence of Scoring Factors


As mentioned in earlier Chapters, a test can be scored either objectively or
s
ubjectively. In an objectively scored test, the likes, ~d dislikes of the scorers will
not influence the results. Therefore, there will not be any fluctuations due to
scoring. In such cases, the reliability will not be influenced by scoring

13
3
either. In subjectively scored tests, however, due to fluctuations in the scorer
judgement, measurement error will influence the reliability.
For instance, suppose that a set of compositions is to be evaluated and
scored. In such a situation, two sources of error may be present in the scoring
process. The first, is the error due to the fluctuations of a single scorer in scoring a
composition twice. Obviously, a single scorer will not assign exactly the same
score to the same composition in two different scoring situations. The differences
due to this kind of fluctuation originate from the inconsistency of the scorer. In
other words, the error is due to the scorer. This kind of error is called intra-rater
error. Second, is the error due to the fluctuations of different scorers scoring a
single composition. Again, two different scorers will not assign the same score to
the same composition. Such error, which is due to the differences between the
scorers, is called inter-rater error. Inter-rater and intra-rater errors will increase the
error variance in the computation of reliability.
Of course, certain suggestions may help reduce the effect of scoring
procedures. To minimize these kinds of measurement errors, the raters should be
encouraged to use clearly defined rating scales, to score the compositions without
prior information about their writers, i.e., anonymously, and to mark the sample
more than once. These guidelines will, of course, reduce the effect of scoring, not
eliminate the error due to scoring altogether.

6.5 Validity
The second major characteristic of a good test is validity. In the preceding section,
the concept of reliability was defined and the ways to estimate it were presented. It
would be most desirable if the concept of validity could be treated in the same
way. Unfortunately, for most tests it is not so because validity is a more test-
dependent concept than reliability is. Furthermore, reliability is a purely statistical
parameter. That is, it can be determined fairly independently of the test itself. For
example, given a set of scores obtained from a given test, the degree of reliability
can be determined without even referring to the test. Validity, on the other hand,
depends mostly on the

134
pecularities of the test. Therefore, just obtaining a certain number of scores from a
test will not enable test users to establish its validity. Validity refers to the extent
to which a test measures what it is supposed to measure. In the earlier Chapters,
the concept of the function of the tests, referring to the purpose for which a test is
used, was explained. The question of validity is similarly concerned with whether
the test is achieving what it is intended to or not. As an example, if a test is
designed to measure students' ability on a particular trait, it will be desirable to
observe that the test actually provides information on the intended trait rather
than something else. In order for a test to be valid, it is not sufficient to examine
the test from only a mathematical or statistical perspective. Other aspects are
important as well. That's why different types of validity exist to examine
different aspects of the test. Generally speaking, there are three types of validity:
content validity, criterionrelated validity, and construct validity. Each is briefly
discussed below.

6.5.1 Content Validity


Content validity refers to the degree of correspondence between the test content
and the content of the materials to be tested. The content of materials may be
broadly defined to include both subject matter content and instructional
objectives. The subject matter content is concerned with the topics, or subject
matter areas, to be covered. The instructional objectives part is concerned with
the degree of learning that · students are supposed to achieve. Both of these
aspects should be taken into account in examining content validity. So, content
validity may be defined as the extent to which a test measures a representative
sample of the content to be tested at the intended level of learning. The focus of
content validity, then, is on the appropriacy of the sample of elements included in
the test. The term appropriacy refers to both the appropriacy of the sample and
the appropriacy of the learning level. That's why content validity is sometimes
called the appropriateness of the test.
For a test to be content valid, it should meet two major criteria. First,
the content of the test should be selected appropriately to correspond to the

135
content of the materials to be tested. Second, the test should be aimed at tn~asuring
the appropriate level of the students' learning. An example may help clarify the
point. Suppose a test is to be developed to measure examinees" ability in
recognizing the correct gran1111ar or usage of the English language. To assure that
the content of the test corresponds to the content to be tested, i.e., usage or
grammar, the test should include a representative sample of grnmrnatical items. This
would be possible through utilization of the Table of specifications discussed earlier
in Chapter Four. The test should not only represent the content, but also represent
the content proportionate to the importance or the weight given to certain elements.
Furthermore, the test items should meet, quite accurately, the level of learning
expected from the examinees. Complex items would not be appropriate for
elementary level students. Thus, in dealing with content validity, both the content of
the test and the student for whom the test is designed should be taken into account.
Content validity is the most important type of validity which can be achieved
through careful examination of the content of the test items. Although there is no
commonly used numerical expression for content validity, it provides the most
useful subjective information about the appropriateness of the test.
Of ··course, this subjectivity is a drawback in itself. Two individuals who do
not have the same understanding of the content to be tested may well make
different judgments about the match between the items and the content to be tested.
One way to avoid too much subjectivity is to have the test reviewed by more than
one expert. Another is to define the content to be tested in as detailed terms as
possible and transfer the detailed definition onto a Table of specification. Here a
distinction should be made between content validity, which refers to the
appropriateness of the content of the test, and the so-called face validity, because so
often these two kinds of validity are confused. Face validity is not really validity at
all in the technical sense of the word. Face validity is simply whether the test looks
valid on the face of it or not. That is, would it be possible for untrained people to
look at the test and understand what the test is supposed to measure or not? Face
validity often is a desirable feature of a test in the sense that it is useful from a
public

136
relations standpoint.

Of course, if a test appears to be irrelevant, examinees may not take the test
seriously, or potential users may not consider the results useful. Content validity,
on the other hand, is directly relevant to the content of the items comprising a test.
An example may clarify the point. Suppose a 20-item test on "prepositions" is
under review. The question of face validity will ask whether the test looks like a
grammar test of prepositions or not. In this case, of course, the answer is positive.
Thus, the test has face validity. However, whether the very same test has content
validity or not depends, totally, on a careful examination of the correspondence
between the items and the Table of specifications. If this test is developed to
measure students' achievement after a crash course on the use of prepositions, it
may have content validity as well. Otherwise, the content validity will be under
question.
It should be noted that a test may have a desireable level of content validity,
but not face validity. For example, a test such as a cloze test, which is
experimentally shown to have a reasonable validity, may not be acceptable, on
the face of it, to be a good test of say, grammar. Thus, test developers should not
be very much concerned with the face validity. They should, however, be very
careful about establishing the content validity.
As mentioned before, even if a test proves to have an acceptable content
validity, the comments and evaluations will be basically subjective. No matter
how useful the information may be, the educators would want to convert this
information into numerical values. Of course, it is not possible to have numerical
values for content validity. Other types of validity, nevertheless, including
criterion-related validity and construct validity, are developed to obtain such
information.

6.5.2 Criterion-Related Validity


Through establishing the content validity, one would make sure that the test
serves its purpose in terms of its content. That is, the correspondence between the
content to be tested and the content of the test is present. This is a desirable
property of a test without considering any outside companson.

137
Critcrion-rclutcd validity, 011 the other hand, investigates the correspondence between
the scores obtained from the newly-developed test and the scores obtained from some
independent outside criteria. The criteria can range from toucher's subjective
judgment to standardized objective tests.
To obtain criterion-related validity, the newly-developed test has to be
administered along with the criterion measure to the same group. Then the
correspondence or correlation between the two sets of scores will be an indication of
the criterion-related validity of the test. Depending on the time of administration of
the criterion measure, two types of criterion-related validity are established:
concurrent and predictive.

6.5.2.1 Concurrent Validity


Concurrent validity is a kind of criterion-related validity. To obtain concurrent
validity; a test developed to measure a paticular trait is administered concurrently
with another well-known, reputable test of which the validity is already established.
Two sets of scores, obtained from the newly-developed test and the criterion measure,
are correlated. The degree of correlation is an indication of the concurrent validity of
the test.
As an example, a newly-developed language proficiency test can be validated
against an already established test such as TOEFL. The question to be asked here is
whether or not the new test, which is designed to measure language proficiency does,
in fact, measure this trait equally well as does TOEFL, which is also designed to
measure language proficiency. So, the two tests are administered to the same group
concurrently and the degree of the correlation coefficient between the two sets of
scores ~11 be an indication of the concurrent validity of the new test.

6.5.2.2 Predictive Validity


Another kind of criterion-related validity is predictive validity, which is closely
related to concurrent validity. It is closely related because it, too, depends on some
sort of correspondence between the scores obtained from the new test and those
obtained from an already established one. It is different from

138
concurrent validity, however, in that the administration of the two tests is not
concurrent but in some time interval.
As an example, suppose that a test is developed with the claim that it will
predict an applicant's success in the nation-wide University Entrance
Examination. A group of applicants take the new test. Then after some time they
take the University Entrance Examination as well. The correspondence between
the two sets of scores will serve as the predictive validity of the test.
In using either predictive validity or concurrent validity, i.e., the criterion-related
validity, some points should be taken into account. First is the nature of the
criterion. The criterion measure should possess all the characteristics of a good
test. That is, it must have a reasonable index of reliability and validity. Second is
the content of the criterion. The content of the criterion measure must be on the
same domain as that of the new test. That is, the criterion measure should reflect
the same content which is to be measured by the newly developed test. Third is
the caution to be taken in the interpretation of the validity index. It is important
for educators and test users to bear in mind that the correspondence between the
new test and the criterion test can get into a circular question. For example, test
A, the new test, is validated against test B, the criterion measure. But test B itself
should have been validated against still another measure, let's say C. So does the
process go on. This means that the criterion-related validity should be interpreted
quite cautiously because all validity indexes depend on the very first test against
which all other tests have been subsequently validated. Fourth is the empirical
nature of this validity. Through criterion-related validity numerical values are
obtained. This kind of validity is also known as empirical validity. In contrast,
content validity is non-empirical because it does not provide numerical
information. Most often researchers and educators depend heavily on empirical
validity in general, and on concurrent validity in particular, because of its
practicality and objectivity.

6.5.3 Construct Validity


The different types of validity thus far described are concerned with some

13
9
· · 1 Th ovide information through wh. h
specific, practical use of test resu ts. ey pr 1c

educators can make decisions on how well a test measures a particular trait or
how well a test predicts the future performance of the testees. In addition to such
specific and immediately applicable uses, some other issues may be addressed
regarding testing devices. A crucial question which is asked about validity of a
test is whether it measures what it is purported to measure. So, if a test is
intended to measure the oral language ability of a student and it does, in fact,
measure this ability, the test is said to be valid. But a more crucial question
regarding test validity is whether there exists such an ability called oral language
ability that the test is intended to measure or not. In other words, a test is
designed to measure a trait. The trait is psychological in this case. And the
question is about the reality of this trait. Thus, construct validity refers to the
extent to which the psychological reality of a trait, or construct, can be
established.
A quick review of existing tests reveals that numerous tests are used under
different names. More popular ones include measures of intelligence, critical
reasoning, reading comprehension, study skills, and all kinds of aptitude and
attitude. Consturct validity asks questions about the reality of these constructs.
Determining construct validity could have a crucial influence on the
multiplicity of tests. There may be various tests labeled differently which, in fact,
measure the same psychological construct. Construct validity is difficult to
determine because it requires utilization of sophisticated statistics called factor
analysis. However, educators should always keep in mind that construct validity
is the most important type of validity which can dominate all others. The reason
is quite simple. If there does not exist a particular trait or construct, then there
will be no sense in attempting to measure it.
The importance of and the complexities involved in determining construct
validity have led to interesting arguments among scholars regarding the
underlying nature of what the tests actually measure. More research and
experiments are needed before one can clearly and unambiguously determine the
construct validity and the nature of psychological constructs.

140
6.6 Factors Influencing Validity
Many factors tend to influence the validity of test scores. Some of these factors are
obvious. For example, no one would evaluate a student's knowledge of
mathematics with an English language test, because it will not measure what the
test purports to measure and thus it will easily destroy the content validity of the
test. Nor would anyone measure an elementary student's language ability with a
fairly complex and difficult test because it will not measure the learner's true
ability. This would also destroy the content appropriacy of the test. Such factors
can easily be observed and avoided without any need for specialty. There are,
however, some factors to which the test users should pay close attention. Some of
these factors are discussed below.

6.6.1 Directions
Directions of the test should be quite clear and simple to ensure that the testees
understand what they are expected to do. For instance, the examinees should be
informed whether they have to mark on the answer sheets or on the test papers;
whether they can guess on the items they are not sure of, whether they will be
penalized for wrong choices; and whether they are allowed to ask questions during
the test session. Ignoring these seemingly simple procedural elements will reduce
the validity of test scores because the obtained scores may fluctuate due to these
elements rather than the examinees' real ability.

6.6.2 Difficulty Level of the Test


As mentioned before, item facility is one of the characteristics of an individual
item. Too easy or too difficult items will jeopardize test validity because such items
are either below or above the ability level of the testees, and thus they will not
measure the testees' real knowledge.

6.6.3 Structure of the Items


Poorly constructed and/or ambiguous items will contribute to the invalidity of the
test because such items do not allow testees to perform to their potential.

141
If a test taker misses a poorly constructed item. it will not he known whether he missed the
item because he did not k1H)W the correct response or hi! simply did not undcrst and it.

6.6.4 Ar'rangem ent of Items and Correct Responses


Test items arc usually arranged in the order of difficulty. That 1s. a test typicully starts with
easy items and progresses toward difficult ones. Furthermore. item responses are usually
arranged randomly to avoid any identifiable pattern for the correct response. If items are
not arranged from easy to difficult, the testcc may feel frustrated at the very outset, and
thus, the scores obtained will not be valid.
Considering the above-mentioned factors, test users can avoid some unwanted
sources which contribute to test invalidity. The more control there is over these factors, the
more confidence the test developers can have on the validity of the scores they obtain from
a particular test.

6.7 The Relationship Between Reliability and Validity Although


reliability and validity were discussed under separate topics, in most cases they are closely
interrelated. Before explaining their interrelationship, two points should be clarified. First,
reliability and validity refer to the scores obtained from a particular test, not to the test
itself. That is, claiming a test to be reliable or valid is a common mistake. Rather, it should
be stated that the scores obtained from a test are reliable or valid.
Second, reliability is a mathematical concept. That is. when a test shows a certain
degree of reliability, it will produce, to the same degree, consistent scores. But validity is a
relative term. Validity depends on the purpose of the test. By changing the purpose of the
test, validity can completely disappear. For example, a test may be quite valid for one
purpose but not as valid for another, and absolutely invalid for still another purpose.
Keeping these points in mind, the reader should remember that reliability is an independent
statistical concept. Its computation, no matter what method is employed, depends totally on
a set of scores. Validity, on the other hand, has a direct

142
correspondence to the content of the test. For example, criterion-related validity
is based on the correlation between two tests measuring the same or similar
subject matter. However, it should be born in mind that there 1s an underlying
mathematical relationship between reliability and validity.
If a test is reliable, it may or may not be valid. For example, if a particular
test on mathematics produces consistent results, it is not, by any means, a valid
measure of one's language ability, even if it is a valid measure of mathematics in
a particular context. On the contrary, if a test demonstrates a certain degree of
validity, it is to some extent reliable. That is why validity is the single most
important characteristic of a test. The reason is that validity is the degree of
correlation between two tests. Correlation will not exist if the test is not reliable.
In technical terms, the square of correlation is the common variance between the
two tests. There must exist reliability, 1.e., vanance, in each test in order for
correlation to materialize. In fact, it is possible to describe this relationship
through the following formula:

In this formula, rxy refers to the correlation between x and y, i.e., validity, r; and
ry refer to the reliability of x and y respectively. The formula indicates that the
correlation between two tests is at most equal to the cross-product of the square
root of the reliability coefficients of the two tests. In other words, if the test
demonstrates validity, there is, at least to the extent of its square,
reliability in the test.
From the discussion about the relationship between validity and
reliability, one can conclude that a reliable test may or may not be valid, but a
valid test is, to some extent, reliable. Therefore, validity is more important
than reliability.

6.8 Reliability, Validity and Acceptability


A common question usually asked by students and teachers is directed toward
the acceptable or desirable magnitude of relaibility and validity coefficients.
How reliable and valid should a test be? There is no ready-made or clear-cut

14
3
answer to this question. It depends, to a great extent, on the application of test
scores, and on the importance of the decisions to be made upon the test scores.
Consider the following cases. Suppose a newly-developed extracurricular
program utilizes a reading book which is intended to enhance the students'
reading comprehension ability. Of course, the students are required to take a test
on the materials. The results obtained from such a test do not play an important
role in making a decision about the educational achievement of the students.
Therefore, even if a moderately reliable or valid test is used in such a situation, it
will not have any serious consequences upon the students' lives. On the contrary,
when a test is used to make crucial decisions, such as !o place a student in a
particular class, to direct a student to a particular career, to promote a student to a
higher level, or to admit an applicant to an educational program, the tests should
be highly valid and reliable. Of course, making a decision in educational settings
is not a unidimensional process. Seldom do educators make a crucial decision on
the basis of a single score on a single test. Most often decision -making requires
multi-dimensional sources of information. As a general rule, the more pieces of
information gathered on a particular trait, the more logical decisions will be made
on that trait. Therefore, educators should gather as many pieces of information on
a trait as they can in order to make sound decisions. However, where final
decisions are being made, the educators are compelled to seek the most reliable
and valid information.
Thus, the answer to the question posed at the beginning of this section
depends on the importance of the decision to be made on the basis of test scores.
The more important the decision to be made, the more confidence is needed in
the scores, and thus, the more reliable and valid tests are required. Nevertheless,
it is a generally accepted tradition that validity and reliability coefficients below
0.50 are considered low, 0.50 to 0.75 are considered moderate, and 0.75 to 0.90
or above are considered high. In language testing too, these ranges can be used
depending on the various situations in which the tests are used.

144
6.9 Practicality
At the outset of this Chapter, it was mentioned that a good test should have three
characteristics, namely, validity, reliability, and practicality. The first two were
discussed in the previous sections. The third characteristic, practicality, is to be
presented in this section.
Although reliability and validity are two crucial properties of a test, practical
considerations are important as well. No matter how reliable or valid a test may
be, it should be usable and practical. Generally speaking, practicality refers to the
ease of administration and scoring of a test. In addition to these two major
considerations, some other factors, such as the time of administration, cost of
testing, ease of interpretation and application of scores, availability of comparable
forms contribute to the practicality of a test.

6.9.1 Ease of Administration


Tests are usually administered by teachers. Most often, teachers do not have
special training on testing. Therefore, easy administration of a test 1s an important
quality for such teachers. One of the factors which makes a test easy to administer
is the clarity and simplicity of directions. Clear, simple, and easily read directions
receive the highest significance in foreign language situations in general, and with
low proficiency level students in particular. In such cases, directions can safely be
translated into the testee' native language. Particulary with low proficiency level
students, giving the directions in the native language of the testee may even be
desirable. Another factor· contributing to the difficulty of administration is the
excessive n~mber of subtests. The fewer the number of subtests, the easier the test
administration because too many subtests will make the test complicated. And
finally, the time required for a test is important because sitting for a test for a long
time will tire the students.
Any inconvenience created by giving difficult directions, inappropriate
timing, and other aspects of administration will negatively influence the test
results, and thus make the test impractical.

145
6.9.2 Ease of Scoring
Traditionally, one of the most time-consuming and troublesome stages of the
testing process is the scoring of the test. As discussed in the earlier Chapters, a
test can be scored subjectively or objectively. Subsective scoring procedure
creates many problems regarding the validity and reliability of the tests. In
addition, sometimes tests include too many subsections with complex keys for
scoring. This kind of complexity will also make the scoring task quite tedious.
Since scoring is difficult and time consuming, the trend is toward
objectivity, simplicity, and machine scorability. That is, test developers attempt
to avoid subjectively scored tests as much as possible. Nowadays, except for very
rare cases, almost all standardized tests are objectively scored. Furthermore, test
developers tend to prefer simple scoring systems. And finally, developments in
technology have contributed to facilitating the scoring procedures. By using
machine scored answer sheets, almost all complexities of scoring a test have
vanished. Of course, where the exploitation of machines is not possible, using
separate answer sheets which can be scored by mapping keys are of great help. In
short, other things being equal, a test which provides ease of scoring and
economy without sacrificing scoring accuracy should be selected and used.

6.9.3 Ease of Interpretation and Application


Clearly, developing a test is a costly procedure. The purpose of a test is to make
decisions on a certain aspect of the test taker's life. No matter how reliable, valid,
or easily administered a test may be, the most crucial point about a test is the
meaningfulness of the scores obtained from that test. For example, what does it
mean to get a score of X on a particular test? Is the score interpreted on a norm-
referenced system? Should the magnitude of the standard error of measurement
be taken into account or not? How clear-cut a decision can be made? These and
many other questions are fundamental to the effective application and accurate
interpretation of the scores. If the test results are misinterpreted or misapplied,
they will be of little value and may actually be harmful to some individual or
group.

146
Of course, for standarized tests, information concerning the interpretation
and use of scores is usually available from the test manual. However, care must be
taken to see how easy it is to convert raw scores into meaningful scores, how
clearly the Tables of norms are presented, and how comprehensive the
suggestions are for using and applying test results.
To sum up, next to the validity and reliability of a test, practical
considerations play an important role in using a test. In general, practicality deals
with the facilities needed to administer a test. These facilities concern the scoring
and administration of the test, required equipment for the administration of the
test, and interpretation of scores obtained from the test. Due to the limited
knowledge of most school teachers regarding the technicalities involved in the
testing process, attemps should be made to select, develop, or use tests which are
easily administered, easily scored, and provide easily interpreted scores.
However, it should be mentioned that although practicality is an important
factor in selecting and using a test, educators should not ignore the vitality of the
decisions to be made on the test scores. If crucial decisions are to be made,
practicality should not be considered a determinent factor. For example, using a
composition is not practical from scoring point of view, especially for large scale
administrations. Nor is conducting oral interviews. However, when the selection
of language teachers is at stake, the difficulties regarding the administration or
scoring oral interviews should not impede authorities, because it is very important
for language teachers to express themselves orally. Or, when selection of
professional writers is the purpose of the test, administering a composition test
should be mandatory, because in the selection of such people writing ability plays
a crucial role. 'Of course, in large-scale administrations, preliminary screening can
be accomplished by using practical paper and pencil tests. This procedure can
reduce the number of testees. Then, in the final stages, and with a small number of
testees, all necessary testing devices can be used regardless of their practicality.

Activities
1. A test of reading comprehension has a variance of 25 and reliability of 0.81.

147
Calculate the standard error of measurement for this test.
2. What is the maximum correlation between two rests wi th reliabilities of

0.64 and 0.81?


3. Why is the parallel-forms method of estimating reliability an improvement

over the test-retest method?


4. A 40-item vocabulary test has a mean of 25 and standard deviation of IO.

What is the reliability of this test?


5. The reliability of a test computed through the split-half method is 0.70.
What is the reliability of the full test?
6. How and why does the homogeneity of subjects and items influence the

reliability and validity of a test?


7. A 20-item test has a reliability of 0.40. If the number of items was increased
to 30, what would be the reliability of the new test? How about 40?
8. How does concurrent validity differ from predictive validity?
9. Which is more important: reliability or validity? Why?
10. What is the relationship between item characteristics and test
characteristics?

148

You might also like