Professional Documents
Culture Documents
English Language Assessment Module
English Language Assessment Module
CHAPTER I
EVALUATION IN LANGUAGE LEARNING
This chapter discusses the position of evaluation in language learning, the explanation of some
terms in language testing, the functions of evaluation, and the target of evaluation in language learning.
Discussion
1. Mention the three main components of teaching learning process and explain the relationship
between them.
2. Discuss the differences between ‘evaluation’ and ‘test’
3. How can the result of evaluation affect the students’ learning?
4. What do you think a teacher can do when the result of evaluation is not very satisfying?
5. In what case should the learning goal be modified?
6. When the students involve in a language program, what aspects need to be evaluated?
CHAPTER II
APPROACHES AND TYPES OF LANGUAGE TESTING
This chapter provides some explanation of approaches of language testing with its historical
background and the important issues in language testing. This will be followed by some classifications
of language testing based on its purpose, interpretation, and scoring method.
A. Approaches to Language Testing: A Brief History and Some Current Issues in Classroom Testing
In order to get a comprehensive understanding of classroom-based testing, the classification of
approaches to language testing is presented in relation with history of language testing over the past
half-century. The next discussion will be about some current issues in language classroom testing.
Dealing with approaches to language testing, Heaton (1988) classifies it into four main approaches: (1)
the essay-translation approach, (2) the structuralist approach, (3) the integrative approach and (4) the
communicative approach. Brown (2004) adds one more approach: performance-based assessment.
Performance-Based Assessment
In language courses and programs around the world, test designers are now tackling this new
and more student-centered agenda. Instead of just offering paper-and-pencil selective response tests,
performance-based assessment of language typically involves oral production, written production,
open-ended responses, integrated performance (across skill areas), group performance, and other
interactive tasks. To be sure, such assessment is time-consuming and therefore expensive, but those
extra efforts are paying off in the forms of more direct testing because students are assessed as they
perform actual or simulated real-world tasks. In technical terms, higher content validity is achieved
because learners are measured in the process of performing the targeted linguistic acts.
In an English language-teaching context, performance-based assessment is more likely to be
in the border line between formal and informal assessment. The extreme differences between formal
and informal assessment will be described in the section of traditional and alternative assessment.
Considerably more time and higher institutional budgets are required to administer and score
assessments that presupposition, more subjective, more individualization, and more interaction in the
process of offering feedback. The payoff of the latter, however, comes with more useful feedback to
students, the potential for intrinsic motivation, and ultimately a more complete description of a student’s
ability (Brown, 2004:14).
b. Computer-Based Testing
Recent years have seen a burgeoning of assessment in which the test-taker performs
responses on a computer. Almost all computer-based test items have fixed, close-ended responses.
However, tests like the Test of English as a Foreign Language (TOEFL) offer a written essay section
that must be scored by humans (as opposed to automatic, electronic, or machine scoring).
A specific type of computer-based test, a computer-adaptive test, has been available for many
years but has recently gained momentum. In a computer-adaptive test (CAT), each test-taker receives
a set of questions that meet the test specifications and that are generally appropriate for his or her
performance level. The CAT starts with questions of moderate difficulty. As test-takers answers each
question, the computer scores the question and uses that information, as well as responses to previous
questions, to determine which question will be presented next. As long as examinees respond correctly,
the computer typically selects questions of greater or equal difficulty. In correct answers, however,
typically bring questions of lesser or equal difficulty. The computer is programmed to fulfill the test
design as it continuously adjusts to find questions of appropriate difficulty for test-takers at all
performance levels. In CATs, the test-taker sees only one question at a time, and the computer scores
each question before selecting the next one. As a result, test-takers cannot skip questions, and once
they entered and confirmed their answers, they cannot return to questions or to any earlier part of the
test.
Proficiency Tests
Proficiency tests are designed to measure people’s ability in a language regardless of any
training they may have had in that language. The content of a proficiency test, therefore, is not based
on the content or objectives of language courses which people taking the test may have followed.
Rather, it is based on a specification of what candidates have to be able to do in the language in order
to be considered proficient. ‘Proficient’ means having sufficient command of the language for a
particular purpose.
An example of proficiency test would be a test used to determine whether a student’s English is
good enough to follow a course of study at a British university. Such a test may even attempt to take
into account the level and kind of English needed to follow courses in particular subject areas.
There are other proficiency tests which, by contrast, do not have any occupation or course of
study in mind. For them, the concept of proficiency is more general. A typical example of a
standardized proficiency test is the Test of English as a Foreign Language (TOEFL) produced by the
Educational Testing Service. The function of such test is to show whether the candidates have reached
a certain standard with respect to certain specified abilities. Usually, the test administrators are
independent teaching institutions and so can be relied on by potential employers to make fair
comparisons between candidates.
Proficiency tests have traditionally consisted of standardized multiple choice items on grammar,
vocabulary, reading comprehension, aural comprehension and sometimes a sample of writing.
Achievement Tests
In contrast to proficiency tests, achievement tests are directly related to language courses. The
purpose of this kind of test is to establish how successful individual students, group of students, or the
courses themselves have been in achieving objectives.
There are two kinds of achievement tests: final and progress ones. Final achievement tests are
those administered at the end of a course of study. The content of final achievement test should be
based directly on course objective. Progress achievement tests, on the other hand, are intended to
measure the progress that students are making. Since ‘progress’ is towards the achievement of course
objectives, these tests too should relate to objectives (short-term objectives).
Diagnostic Tests
Diagnostic tests are used to identify students’ strengths and weaknesses. They are intended
primarily to ascertain what further teaching is necessary. Therefore, it is designed to diagnose a
particular aspect of a language.
A diagnostic test in pronunciation might have the purpose of determining which phonological
features of English are difficult for a learner and should therefore become the part of a curriculum.
Usually, such tests offer a checklist of features for the administrator (often the teacher) to use in
pinpointing difficulties. It is not advisable to use a general achievement test as a diagnostic, since
diagnostic tests should information on what students need to work on in the future. Therefore,
diagnostic test will typically offer more detailed subcategorized information on the learner. Conversely,
achievement tests are useful for analyzing the extent to which students have acquired language
features that have already been taught.
Placement Tests
Placement tests, as their name suggests, are intended to provide information which will help to
place students at the stage (or in the part) of the teaching program most appropriate to their abilities.
Typically, they are used to assign students to classes at different levels.
Placement tests can be bought, but this is not to be recommended unless the institution concerned is
quite sure that the test being considered suits its particular teaching program. No one placement test
will work for every institution and the initial assumption about any test that is commercially available
must be that it will not work well.
The placement tests which are mostly successful are those constructed for particular situations.
They depend on the identification of the key features at different levels of the teaching in the institution.
Such placement tests will result in accurate placement.
Aptitude Test
Finally, we need to consider the type of test that is given to a person prior to any exposure to
the second language, a test that predicts a person’s future success. A language aptitude test is
designed to measure a person’s capacity or general ability to learn a foreign language and to be
successful in that undertaking. Aptitude tests are considered to be independent of a particular
language.
Two standardized aptitude tests have been used in the US – the Modern Language Aptitude
Test (MLAT) and the Pimsleur Language Aptitude Battery (PLAB). Both are English language tests and
require students to perform such tasks as memorizing numbers and vocabulary, listening to foreign
words, and detecting a spelling clues and grammatical patterns.
Discussion
1. Try to summarize the strengths and the weaknesses of each testing approach!
2. Among the five approaches described in this book, which one do you think the most effective
approach in assessing the students’ language proficiency?
3. What can computer-based assessment assist the test-administrator and test-takers?
4. By the end of a language learning program, the teacher gives the students a test. What kind of
test does she do?
5. Give one example of proficiency test and explain the purpose of such test.
CHAPTER III
PRINCIPLES OF LANGUAGE TESTING
This chapter explores how principles of language assessment can and should be applied to formal
tests, but with ultimate recognition that these principles also apply to assessments of all kinds. Brown
(2004) proposes five principles of reliability, validity, practicality, authenticity, and washback.
A. Reliability
A reliable test is consistent and dependable. If the students are given the same test on two
different occasions, the test should yield similar results. The word ‘similar’ is used here because it is
almost impossible for the test-takers to get exactly the same scores when the test is repeated the
following day. This is because of the fact that human beings do not simply behave in exactly the same
way on every occasion, even when the circumstances seem identical. Therefore, the more similar the
scores are, the more reliable the test is.
Hughes (1989) presents some examples of students’ test scores. Table 3.1a represents the
scores obtained by ten students who took 100-item test A for a particular occasion, and the scores
obtained by them a day later. Note the size of difference between the two scores for each student.
Now look at Table 3.1b which displays the same kind of information for another 100-item test B. Again
note the difference in scores for each student.
From the above tables, it can be seen that the differences between the two sets of scores are
much smaller for Test B than for Test A. Therefore, it can be concluded that Test B appears to be more
reliable than Test A although in practice the claims about reliability would not be that simple for the
small number of test-takers.
1. Test-Retest Method
To arrive at reliability coefficient of a test, the following requirements should be followed. First,
we have to get two sets of scores for comparison. The most obvious way of obtaining these is to get a
group of subjects to take the same test twice. This is known as test-retest method.
Consequently, there will be two sets of scores from the first and the second administration of
the same test. Then, the two sets of scores are calculated to get the correlation between them.
Pearson Product-Moment formula is usually used to find the correlation.
where
rxy = Pearson product-moment reliability coefficient
Y = each student’s score on test Y
Y = mean on test Y
Sy = standard deviation on test Y
X = each student’s score on test X
X = mean on test X
Sx = standard deviation on test X
N = The number of students who took the two tests
However, this method has weakness as sometimes because the second administration of the
test is too soon after the first, then, subjects are likely to recall items and their responses to them. Then,
the reliability is spuriously high.
4. Kuder-Richardson Reliability
To get Kuder-Richardson reliability, it requires test administration only once. One correct
answer is given point 1, while incorrect answer is given 0. There are two most commonly used
formulas; they are KR-20 and KR-21.The latter is considered simpler than the former.
5. Rater Reliabilty
Besides the ways to arrive at reliability mentioned previously, there is another kind of reliability
specialized for subjective tests in which the response cannot be judged as correct or incorrect and it
involves the rater in the process of judgment. The examples of such tests are test of writing and
speaking.
The rater reliability will also require two sets of scores, but the scores are not gotten from test-
retest method, alternate form method or split half method. Rather, the two sets of scores are gotten
from intra-rater reliability and inter-rater reliability.
Intra-rater reliability is achieved when one scorer or one rater does the scoring twice. Hence,
two sets of scores are gotten and then are calculated using Pearson Product-Moment for getting
correlation coefficient.
Inter-rater reliability is achieved when two scorers or two raters do the scoring. Then, as in
intra-rater reliability, the two sets of scores gotten from the two raters are calculated to get the
correlation coefficient.
B. Validity
The most complex criterion of an effective test and the most important principle of language
testing is validity. It is the extent to which inferences made from assessment results are appropriate,
meaningful, and useful in terms of the purpose of the assessment (Gronlund in Brown, 2004:22). A test
should test what the writer wants to test. Test validity presupposes that the writer can be explicit about
what is to be tested and takes steps to ensure that the test reflects realistic use of particular ability to be
measured (Weir, 1993:19). A valid test of reading ability actually measures reading ability, not previous
knowledge, nor some other irrelevant variable. To measure writing ability, one might ask students to
write in 15 minutes, then simply count the words for the final score. Although it would be easy to
administer (practical) and the scoring is quite dependable (reliable), it would not be considered a valid
test of writing ability because there is no consideration of comprehensibility, organization of ideas and
other factors of writing ability.
How is the validity of a test established? These four types of validity below will provide
evidence to achieve the validity of the test.
Content Validity
A test is said to have content validity if its contents constitutes a representative sample of the
language skills, structures, etc. being tested. It is obvious that grammar test, for instance, must be
made up of items testing knowledge of grammar. However, it is not enough to ensure content validity.
The test will have content validity if it includes a proper sample of the structure or content which is
relevant with the purpose of the test. It would be silly if an achievement test for intermediate learners
has the same content as one for advanced learners. In order to judge whether or not the test has
content validity, we need a specification of the skills or structure being tested. A comparison of test
specification and test content is the basis for judgment for content validity.
Criterion-Related Validity
Another approach to test validity is to see how far results on the test agree with those provided
by some independent and highly dependable test. This independent test is thus the criterion measure
against which the test is validated and is called criterion-related validity.
There are essentially two kinds of criterion-related validity: concurrent validity and predictive
validity (Hughes, 1989:23). Concurrent validity is established when the test and the criterion are
administered at about the same time. Demonstrating concurrent validity usually requires one group of
students to take two kinds of tests: the new test being developed and another well-established test. For
instance, to demonstrate the criterion-related validity of a new test called Test of Overall ESL
Proficiency (TOESLP), a test developer might administer it to a group of students wishing to study
English in USA. As a criterion measure, the test developer might also administer a well-established
test, TOEFL to the same group of students. Then, the two sets of scores gotten from both tests are
calculated for the correlation coefficient. When the calculation yields, let’s say, 0.95, this indicates a
very strong relationship between the two sets of scores. Thus, it can be concluded that the new test is
as good as TOEFL.
Predictive validity, on the other hand, concerns the degree to which a test can predict future
performance of test-takers. To prove this kind of validity, the test developer might administer a certain
test before the students start a course. After one semester, the same group of students might take the
same test and the scores, resulted from the first and the second test, are calculated for the correlation
coefficient. The closer the correlation to 1, the stronger the relationship between the two sets of scores,
and the stronger the test predicts the students’ future.
Construct Validity
A test is said to have construct validity if it can be demonstrated that it measures just the ability
which is supposed to measure. The word ‘construct’ refers to any underlying ability which is
hypothesized in a theory of language ability. Brown (2004:25) mentioned that a construct is any theory,
hypothesis, or model that attempts to explain observed phenomena in our universe of perception.
The illustration of the use of underlying theory in language testing is as follows. It is assumed
that the ability to write involves a number of sub-abilities such as control of punctuation, style, grammar.
By having such knowledge of sub-abilities of writing, of course, we will not develop test of writing in the
form of multiple choice where the control of punctuation cannot be detected. Therefore, the right form of
writing test must be asking the test-takers to write.
Face Validity
A test is said to have face validity if it looks as if it measures what it is supposed to measure.
For example, a test which pretended to measure pronunciation ability but which did not require the test-
takers to speak might be thought to lack face validity. This is true even if the test’s construct and
criterion-related validity can be demonstrated. Face validity is hardly a scientific concept, yet it is very
important. A test which does not have face validity may not be accepted by test-takers, teachers,
education authorities or employers.
C. Practicality
Besides being reliable and valid, an effective test is practical. This means that it is not
excessively expensive, stays within appropriate time constraints, is relatively easy to administer, and
has a scoring procedure that is specific and time-efficient.
A test that is prohibitively expensive is impractical. A test of language proficiency that takes a
student five hours to complete is impractical – it consumes more time (and money) than necessary to
accomplish its objective. A test that takes a few minutes for a student to take and several hours for
examiner to evaluate is impractical for most classroom situations.
D. Authenticity
The next principle of developing a good test is being authentic. The idea of authentic is usually
associated with real-world tasks (likely to be performed in real world). More specifically, the authentic
test is described by Brown (2004:35) with the following characteristics:
The language in the test is as natural as possible.
Items are as contextualized as possible rather than isolated.
Topics and situations are meaningful (relevant, interesting, enjoyable, and/or humorous) for the
students.
Some thematic organization is provided, such as through a story line or episode.
Tasks represent, or closely approximate, real-world tasks.
E. Washback
Washback generally refers to the effects the test have on instruction. This can be in the form of
how students prepare for the test. “Cram” courses and “teaching to the test” are examples of such
washback. Another form of washback that occurs more in the classroom assessment is the information
that ‘washes back’ to the students in the form of useful diagnoses of strengths and weaknesses.
Washback also includes the effects of an assessment on teaching and learning prior to the assessment
itself, that is, on preparation for the assessment. Informal performance assessment is by nature more
likely to have built-in washback effects because the teacher is usually providing interactive feedback.
Formal tests can also have positive washback, but they provide washback if the students receive a
simple letter grade or a single overall numerical score.
The challenge to teachers is to create classroom tests that serve as learning devices through
which washback is achieved. Students’ incorrect responses can become windows of insight into further
work. Their correct responses need to be praised. Finally, washback enhances a number of basic
principles of language acquisition: intrinsic motivation, autonomy, self-confidence, language ego, and
inter language (Brown 2004:29).
Discussion
1. Review the five basic principles of language testing and define them clearly!
2. Reliability can be achieved through many ways. Explain the ways briefly!
3. There are many factors causing unreliability of a test. Discuss the factors concisely!
4. Why do you think that content validity is important?
5. Do you think that face validity is essential in achieving a valid test?
6. What things should be considered to develop a practical test?
7. What do you know about an authentic test?
8. Give examples of washback that can be achieved by test administration!
CHAPTER IV
TESTING LANGUAGE SKILLS AND COMPONENTS
This chapter discusses several ways of testing language skills and components. However, it will be
preceded by some explanation about various test techniques.
It is the test-takers’ task to identify the correct or most appropriate option (in this case B).
Multiple choice test technique has some advantages. The most obvious advantage is that scoring can
be perfectly reliable. Scoring should also be rapid and economical. A further considerable advantage is
that it is possible to include more items than other forms of tests since the test-takers have only to
make a mark on the paper.
Despite the advantages, multiple choice test technique has also some limitations. This
technique tests only recognition knowledge. It cannot give an accurate picture of test-takers’
performance. A multiple choice grammar test score, for example, may be poor indicator of someone’s
ability to use grammatical structure. The person who can identify the correct response in the item above
may not be able to produce the correct form when speaking or writing. Therefore, the construct validity
of such technique is questionable.
Besides, multiple choice test technique gives the test-takers a chance of guessing the correct
answer. It will not be known what part of any particular individual’s score has come about through
guessing.
Writing successful items for multiple choice tests is also extremely difficult. Hughes (1989:61)
provides some commonest problems in multiple choice tests. Among others are that there are more
than one correct answer, there is no correct answer, there are clues in the options as to which is
correct, and the distractors are ineffective.
Practice language using multiple choice items will not be the best way for the students to
improve their command of a language since usually much attention is paid to improving one’s guessing
rather than to the content of the items. Hughes(1986:61) consider multiple choice tests as having
harmful backwash.
Finally, multiple choice tests technique is said to facilitate students’ cheating because the
responses (a, b, c, d) are so simple that can make them communicate easily to others nonverbally.
All in all, the multiple choice technique is best suited to relatively infrequent testing of large numbers of
test-takers. In order to make effective and good items in multiple choice test, Djiwandono (2008:47)
suggests the test developer to be careful in formulating the stem, and the correct answer and the
distractors. The stem should be in the form of complete sentence whenever possible. In order to avoid
students’ guessing, it is important to have identical options in terms of form, content, and length. Having
identical options will force the students to think critically.
Matching Test
Matching test require the students to match two parts of a test. The two parts are usually
interrelated in terms of meaning or content. Usually, the two parts are in the form of list. The first list
usually consists of some statements or questions, while the second consists of responses. To make
matching test effective, the number of responses should be more than the statements. This is meant to
make the students think critically until the last questions.
The supposed advantages of C-Test over the more traditional one are that only exact scoring is
necessary and that shorter passages are possible. Possible disadvantage of C-Test is that it is harder
to read than the cloze procedure
Dictation is a testing technique in which the passage is read aloud to students, with pauses
during which they have to write down what they heard as accurately as possible (Richard et al, 1992).
Dictation test gives results similar to those obtained from cloze test. In predicting overall ability, it has
the advantage of involving listening ability. It is also easy to create, relatively easy to administer, but is
not certainly easy to score. Because of scoring problem, partial dictation may be considered as an
alternative. In this, part of what is dictated is already printed on the answer sheet. The test-takers are
simply to fill in the gaps and the scoring is likely to be more reliable.
Testing Listening
It may seem rather odd to test listening separately from speaking, since the two skills are
typically exercised together in oral interaction. However, there are occasions, such as listening to the
radio, listening to lectures, or listening to announcement, when no speaking is required.
The testing of listening involves listening macro-skills and micro-skills. The macro-skills of
listening include listening for specific information, obtaining gist of what is being said, following
directions, and following instructions. The micro-skills of listening include interpretation of intonation
patterns and recognition of function of structures. At the lowest level are abilities like being able to
distinguish between phonemes (for example between /w/ and /v/).
There are several types of texts that can be used for listening test such as monologue,
dialogue, or multi-participant. Those types can be in the forms of announcement, talk or lecture,
instructions, directions, etc.
The source of listening test materials can be recordings of radio broadcasts, television
broadcast, teaching materials, or even recording of native speakers made by ourselves. The most
important thing to consider is that recordings must be good and natural.
There are some techniques that are possibly used in testing listening. They are multiple choice,
short answer, information transfer, note taking, partial dictation and recordings and live presentations.
Multiple choice The technique has some advantages and disadvantages as discussed
previously. For listening test, the problem is greater because the test-takers should listen to a
passage while reading the alternatives/options. Therefore, the options must be short and
simple.
Information transfer This technique is useful in testing listening since it makes minimal
demands on productive skills. It can involve such activities as the labeling of diagrams or
pictures, completing forms, or showing routes on a map.
Note taking Where the ability to take notes while listening to a lecture is in question, this activity
can be quite realistically replicated in the testing situation. Test-takers take notes during the
talk, and only after the talk is finished do they see the items to which they have to respond.
Partial dictation Although partial dictation may not be an authentic listening activity, it may be
possible to administer it when no other test of listening is practical.
Recording and live presentation The great advantage of using recordings when administering a
listening test is that there is uniformity in what is presented to the test-takers. This is fine if the
recording is to be listened in a well-maintained language laboratory or in a room with good
acoustic qualities and with suitable equipment. If these conditions cannot be obtained, then a
live presentation is preferred.
Testing Speaking
The objective of teaching speaking is the development of the ability to interact successfully in
that language and therefore, speaking involves comprehension as well as production. Consequently,
testing speaking should enable the students to elicit the behavior which truly represents their ability and
which can be scored validly and reliably.
The materials tested for speaking test include dialog and multi-participant interactions including
operations of language functions such as:
- Expressing: thanks, requirements, opinions, comment, attitude, confirmation,
apology, want/need, information, complaints, reasons/justifications.
- Narrating: sequence of events
- Eliciting: information, directions, service, clarification, help, permission.
- Directing: ordering, instructing, persuading, advising, warning
- Reporting: description, comment, decisions.
There are several formats that can be used to test speaking ability. They are interview, interaction
with peers, and response to tape-recordings. Each format has some techniques.
Interview It is the most obvious format for testing speaking.
Questions and request for information. For questions and request, yes/no
questions should be avoided.
Pictures can also be used to elicit descriptions. Series of pictures (or video
sequences) form a natural basis for narration.
Interaction with peers Two or more test-takers may be asked to discuss a
topic, make plans, and so on.
Role play Students can be asked to assume a role in a particular situation
and the tester can act as an observer.
Response to tape-recordings Uniformity of elicitation procedures can be
achieved through presenting the students only with the same audio- (video-) tape recordings.
Imitation The test-takers hear a series of sentences, each of which they
have repeat in turn.
Scoring will be valid and reliable only if clearly recognizable and appropriate descriptions of criteria
levels are written and scorers are trained to use them. Description of speaking proficiency usually deals
with accent, grammar, vocabulary, fluency and comprehension as in the following examples taken from
Hughes (1989).
Proficiency Description
Accent
1. Pronunciation frequently unintelligible
2. Frequent gross errors and a very heavy accent make understanding difficult, require frequent
repetition.
3. Foreign accent requires concentrated listening, and mispronunciations lead to occasional
misunderstanding and apparent errors in grammar or vocabulary.
4. Marked foreign accent and occasional mispronunciations which do not interfere with
understanding.
5. No conspicuous mispronunciation, but would not be taken for a native speaker.
6. Native pronunciation, with no trace of foreign accent.
Grammar
1. Grammar almost entirely inaccurate phrases.
2. Constant errors showing control of very few major patterns and frequently preventing
communication
3. Frequent errors showing some major patterns uncontrolled and causing occasional irritation
and misunderstanding
4. Occasional error showing imperfect control of some patterns but no weakness that cause
misunderstanding.
5. Few errors, with no patters of failure.
6. No more than two errors during the interview.
Vocabulary
1. Vocabulary inadequate for even the simplest conversation.
2. Vocabulary limited to basic personal and survival areas (time, food, transportation, family, etc.)
3. Choice of words sometimes inaccurate, limitations of vocabulary prevent discussion of some
common professional and social topics
4. Professional vocabulary adequate to discuss special interests; general vocabulary permits
discussion of any non-technical subject with some circumlocutions.
5. Professional vocabulary broad and precise; general vocabulary adequate to cope with complex
practical problems and varied situations.
6. Vocabulary apparently as accurate and extensive as that of an educated native speaker.
Fluency
1. Speech is so halting and fragmentary that conversation is virtually impossible.
2. Speech is very slow and uneven except for short or routine sentences.
3. Speech is frequently hesitant and jerky; sentences may be left uncompleted
4. Speech is occasionally hesitant, with some unevenness caused by rephrasing and groping for
words.
5. Speech is effortless and smooth, but perceptibly non-native in speech and evenness.
6. Speech on all professional and general topics as effortless and smooth as a native speaker’s.
Comprehension
1. Understands too little for the simplest type of conversation.
2. Understands only slow, very simple speech on common social and touristic topics; requires
constant repetition and rephrasing.
3. Understands careful, somewhat simplified speech when engaged in a dialogue , but may
require considerable repetition and rephrasing.
4. Understands quite well normal educated speech when engaged in a dialogue, but requires
occasional repetition or rephrasing.
5. Understands everything in normal educated conversation except for very colloquial or low-
frequency items, or exceptionally rapid or slurred speech.
6. Understands everything in both formal and colloquial speech to be expected of an educated
native speaker.
Besides using clear descriptions of criteria levels, the use of more than one scorer will
decrease the subjectivity as described earlier. If two testers are involved in an interview, then they can
independently assess each test-taker. If they disagree, even after discussion, then a third scorer may
be referred to.
Testing Reading
Similar to listening skill, reading skill is a receptive skill. The task of language tester is, then, to
set reading tasks which will result in behavior that will demonstrate their successful completion.
The reading macro-skills (directly related to course objectives) are scanning text to locate
specific information, skimming text to obtain general idea, identifying stages of argument, and
identifying examples presented in support of an argument. The micro-skill underlying reading skills are
identifying referents of pronouns, using context to guess meaning of unfamiliar words, and
understanding relations between parts of text.
The reading texts can be taken from textbooks, novel, newspaper, magazine, academic
journal, letter, timetable, etc. The texts can be in the forms of newspaper report, advertisement,
editorial, etc.
The techniques that might be used to test reading skills are multiple choice, true/false,
completion, short answer, guided short answer, summary cloze, information transfer, identifying order
of events, identifying referents, guessing the meaning of unfamiliar words from context.
Multiple Choice The test-takers provide evidence of successful reading by marking a mark
against one out of a number of alternatives. Its strengths and weaknesses have been
presented earlier.
True/false The test-takers should respond to a statement by choosing one of the two choices,
true or false.
Completion The students are required to complete a sentence with a single word, for example:
……………was the man responsible for the first steam railway.
Short answer It is in the form of questions and requires the students to answer briefly, for
example:
According to the author, what does the increase in divorce rates show about people’s expectations of
marriage?
Guided short answer This is the alternative of short answer in which students are guided to
have the intended answer. They have to complete sentences presented to them, for example:
Complete the following based on the fourth paragraph!
‘Many universities in Europe used to insist that their students speak and write only ………………… Now
many of them accept ………………….. as an alternative, but not a ………………. of the two.
Summary cloze A reading passage is summarized by the tester, and then gaps are left in the
summary for completion by the test-takers. This is really the extension of the guided short
answer.
Information transfer One way to minimize demands on writing by test-takers is to require them
to show successful completion of a reading task by supplying simple information in a table,
following a route on a map, labeling a picture, and so on.
Identifying order of events, topics, or arguments The test-takers can be required to number the
events etc.
Identifying referents One of the ‘microskills’ listed previously was the ability to identify referents.
An example of an item to test this is:
What does the word ‘it’ (line 25) refer to? ……………………
Guessing the meaning of unfamiliar words from context This is another microskill mentioned
earlier. Items may take the form:
Find a single word in the passage (between lines 1 and 25) which has the same meaning as ‘making of
laws’.
The above techniques are among the many techniques of testing reading. In scoring the reading test,
Hughes (1989) suggested that errors of grammar, spelling or punctuation should not be penalized, as
long as it is clear that the test-taker has successfully performed the reading task which the item set.
The function of a reading test is to test reading ability. To test productive skills at the same time simply
makes the measurement of reading ability less accurate.
Testing Writing
The best way to test people’s writing ability is to get them to write directly. Therefore, indirect
testing of writing ability cannot be possibly constructed as accurately as possible even by professional
testing institutions. There are three things that we should consider to develop a good test for writing:
2. The tasks should elicit samples of writing which truly represent the students’ ability.
There are at least two things we can do to obtain the sample that properly represent each
student’s ability. The first one is to set as many tasks as is feasible. The reasons for this are because
students’ performance on the same task is not consistent. And they sometimes are better at some
tasks than others. Therefore, giving many different tasks will enable the test developer to see the
students’ performance as objectively as possible.
The second way to elicit students’ writing ability is by testing only writing ability. Another ability
which at times interferes with the accurate measurement of writing is reading. It is acceptable to expect
students to be able to read simple instructions, but asking the students to read very difficult and long
instruction in writing test should be avoided. It will prevent the students to perform adequately on writing
task. To reduce students’ dependence on the students’ ability to read is to make use of illustrations in
the forms of diagram, a series of pictures or graphs.
Holistic Scoring:
5 The main idea is stated very clearly, and there is a clear statement of change of opinion. The essay
is well organized and coherent. The choice of vocabulary is excellent. There are no major or minor
grammatical errors. Spelling and punctuation are fine.
4 The main idea is fairly clear, and change of opinion is evident. The essay is moderately well
organized and is relatively coherent. The vocabulary is good, and there are only minor grammatical
errors. There are few spelling and punctuation errors.
3 The main idea and a change of opinion are indicated but not so clearly. The essay is not well
organized and is somewhat lacking in coherence. The vocabulary is fair, and there are some major
and minor grammatical errors. There are a fair number of spelling and punctuation errors.
2 The main idea and change of opinion are hard to identify in the essay. The essay is poorly
organized and relatively incoherent. The use of vocabulary is weak, and grammatical errors appear
frequently. Spelling and punctuation errors are frequent.
1 The main idea and change of opinion are absent in the essay. The essay is poorly organized and
generally incoherent. The use of vocabulary is very weak, and grammatical errors appear very
frequently. Spelling and punctuation errors are very frequent.
Method of scoring which require a separate score for each of a number of aspects of a writing task is
said to be analytic. The following is an example of analytic scoring provided by Cohen (1994:328-329)
Analytic Scoring:
Content
5 – Excellent : main ideas stated clearly and accurately, change of opinion very clear
4 – Good : main ideas stated fairly clearly and accurately, change of opinion relatively clear
3 – Average : main ideas somewhat unclear and inaccurate, change of opinion somewhat weak
2 – Poor : main ideas not clear or accurate, change of opinion weak
1 – Very Poor : main ideas not at all clear or accurate, change of opinion very weak
Organization
5 – Excellent : well organized and perfectly coherent
4 – Good : fairly well organized and generally coherent
3 – Average : loosely organized but main ideas clear, logical but incomplete sequencing
2 – Poor : ideas disconnected, lacks logical sequencing
1 – Very poor : no organization, incoherent
Vocabulary
5 – Excellent : very effective choice of words and use of idioms and word forms
4 – Good : effective choice of words and use of idioms and word forms
3 – Average : adequate choice of words but some misuse of vocabulary, idioms and word forms
2 – Poor : limited range, confused use of words, idioms, and word forms
1 – Very Poor : very limited range, very poor knowledge of words, idioms, and word forms
Grammar
5 – Excellent : no errors, full control of complex structure
4 – Good : almost no errors, good control of structure
3 – Average : some errors, fair control of structure
2 – Poor : many errors, poor control of structure
1 – Vary Poor : dominated by errors, no control of structure
Mechanics
5 – Excellent : mastery of spelling and punctuation
4 – Good : few errors in spelling and punctuation
3 – Average : fair number of spelling and punctuation errors
2 – Poor : Frequent errors in spelling and punctuation
1 – Very poor : no control over spelling and punctuation
The choice between holistic and analytic scoring depends on the purpose of testing (Hughes,
1989:97). If diagnostic information is required, then analytic scoring is essential. If the scoring is carried
out by a small group of people, then holistic scoring may be appropriate. Analytic scoring is used when
scoring is conducted by heterogeneous, less well-trained people or in a number of different places.
However, whichever is used, multiple scoring involving two or more scorers is suggested.
Testing Grammar
The place of grammar in language teaching is sometimes debatable. Some may think that
control of grammatical structure was seen as the core of language ability and it would have been
unthinkable not to test it. For that reason, most proficiency tests include a grammar section besides the
reason of its ease with which large numbers of items can be administered and scored within a short
period of time.
On contrast, others see that one cannot accurately predict mastery of grammar by measuring
control of what we believe to be the abilities that underlie it. Besides, the backwash effect of grammar
test may encourage the learning of grammatical structures in isolation, with no apparent need to use
them. Therefore, consideration of this kind has resulted in the absence of grammar components in
some well-known proficiency tests.
However, whether or not grammar has an important place in an institution’s teaching, it has to
be accepted that grammatical ability has an important influence on someone’s performance. The
successful writing of academic writing, for example, must depend to some extent on command of some
elementary grammatical structures. Therefore, it can be said that there is still room for grammar
component in a language test.
The specification of grammar test should be in line with the teaching syllabus if the syllabus
lists the grammatical structures to be taught. When there is no such list, it becomes necessary to infer
from textbooks or other teaching materials.
There are some techniques that can be used to test grammar. Multiple choice is one alternative.
However, it is not suggested for its difficulty in finding appropriate distractors. The other proposed
techniques are paraphrase, completion, and modified cloze.
Paraphrase This technique requires the students to write a sentence equivalent in meaning to
one that is given. It is helpful to give part of the paraphrase in order to restrict the students to
the grammatical structures being tested. An example of testing passive past continuous form
would be:
When we arrived, a policeman was questioning the bank clerk.
When we arrived, the bank clerk ……………………………..
Completion This technique can be used to test variety of structures. The following is an
example of testing interrogative forms:
In the following conversation, some sentences have been left incomplete. Complete them
suitably. Read the whole conversation before you begin to answer the question.
(Mr. Cole wants a job in Mr. Gilbert’s export business. He has come from an interview.)
Mr. Gilbert: Good morning, Mr. Cole. Please come in and sit down. Now let me see. (1)
Which school ……………………………………………………….?
Mr. Cole: Whitestone College
Mr Gilbert: (2) And when …………………………………………………………...?
Mr. Cole: In 1999, at the end of the summer term.
Mr. Gilbert: (3) And since then, what ……………………………………………….?
Mr. Cole: I worked in a bank for a year. Then I took my present job, selling cars. But I
would like a change now.
Mr. Gilbert: (4) Well, what sort of a job ……………………………………………?
Mr. Cole: I’d really like to work in your Export Department.
Modified cloze This technique can be in the form of the following example of testing articles:
In the scoring process, the scorer should only score what the item is testing, not something
else. For instance, when the focus is to test pronoun, the error on a missing third person -s should not
be penalized. Finally, for valid and reliable scoring of grammar items, careful preparation of the scoring
key is necessary.
Testing Vocabulary
The debate on testing vocabulary is equal to the testing of vocabulary. Clearly, knowledge of
vocabulary is essential to the development and demonstration of linguistic skills. But according to some
people, that does not mean that it should be tested separately.
On the other hand, some argue that some time should devoted to the regular, conscious
teaching of vocabulary, Thus, it is important to test vocabulary as an achievement test of vocabulary
after teaching.
The specification for vocabulary achievement test should be based on all items presented to
the students in vocabulary teaching. When placement test is applied, the vocabulary being tested
should refer to one of common published word lists.
The testing of vocabulary productively is so difficult. Information on receptive ability is regarded as
sufficient. The following techniques are suggested only for possible use in achievement test.
Pictures The use of pictures can limit the students to lexical items that we have in mind. Some
pictures are provided and the students are required to write down the names of the objects.
This method of testing vocabulary is obviously restricted to concrete nouns which can be
drawn.
Definitions This may work for a range of lexical items. The following is an example of such test.
A …… is a person who looks after our teeth.
……… is frozen water.
……… is the second month of the year.
But not all items can be identified using a definition. Nor can all words be defined entirely in
words more common or simpler than themselves.
Gap filling This can take the form of one or more sentences with a single word missing.
Because of the snow, the football match was ….. until the following week.
I ….. to have to tell you this, Mrs. Jones, but your husband has had an accident
To avoid various answers, the first letter of the word or even the indication of the number of
letters can be given
Testing Pronunciation
Heaton (1990) includes pronunciation into testing speaking skill. There are at least three
techniques of testing pronunciation. They are pronouncing words in isolation, pronouncing words in
sentences, and reading aloud.
Pronouncing words in isolation The importance of listening in almost all tests of speaking,
especially those of pronunciation, should never be underestimated. It is impossible for students
to pronounce words correctly unless they first hear and recognize the precise sound of that
word. In the early stages of learning English, it is useful to base our pronunciation tests on
minimal pairs: that is, pairs of words which differ only in one sound, for example:
bud bird ferry fairy
nip nib boss bus
pill pail knit lit
ball bowl fry fly
sheet seat sport support
Pictures can also be used to test the students’ pronunciation. The students can be shown pictures
and asked to identify the object of each picture. Each picture is based on a possible source of
confusion. For example, a picture of ship can be used to test the students to distinguish between sheep
and ship.
Pronouncing words in sentences Students can also be asked to read aloud sentences
containing the problematic sounds which we want to test. Sentences are, of course, preferable
because they provide a context for the sounds (as in real life). For example:
There were several people standing in the hole. (hole/hall)
Are you going to sail your boat today (sail/sell)
Do you like this sport? (sport/spot)
Reading aloud Reading aloud can offer a useful way of testing pronunciation provided that we
give a student a few minutes to look at the reading text first. When we choose suitable texts to
read aloud, it is useful to imagine actual situations when someone may read something aloud.
For example, people read aloud news on TV, letters, or instructions.
Discussion
1. Having cloze test passage in the chapter, complete it and say what you think each item is
testing?
2. Discuss when and how multiple choice tests can be used appropriately in an English
classroom?
3. What advantages can we have in testing language proficiency by using dictation?
4. Design a test that requires the test-takers to draw (complete) simple picture after listening to an
instruction!
5. Can reading aloud be included into one technique of testing reading ability?
6. Do you think grammar should be tested separately?
7. What do you think the best way of testing writing ability?
CHAPTER V
DESIGNING CLASSROOM TESTS AND STANDARDIZED TESTS
This chapter provides the teachers with step-by-step procedures in designing classroom tests
and standardized tests. Most of the explanation is summarized from Brown (2004).
Table 5.1 Example of Selected Objectives for a unit in a low-intermediate integrated-skills course
(Brown, 2004:50)
In reviewing the objectives of a unit, we cannot possibly test each one. We will then need to
choose a possible subset of the objectives to test.
Now, we are ready to draft other test items. To provide a sense of authenticity and interest, we
make the items based on the context of a recent TV sitcom that we have used in class. The sitcom
described a loud, noisy party with lots of small talk. Finally, we have the following samples of test items
for each section.
Writing
Directions: Write a paragraph about what you liked or didn’t like about one of the
characters at the party in the TV sitcom we saw.
However, the above test items are quite traditional. It should be admitted that the format of
some of the items is unnatural, thus lowering the level of authenticity. Therefore, the above test items
need to be revised.
In revising our draft, we need to ask some important questions:
1. Are the directions to each section absolutely clear?
2. Is there an example item for each section?
3. Does each item measure a specified objective?
4. Is each item stated in clear, simple language?
5. Does each multiple choice item have appropriate distractors?
6. Is the difficulty of each item appropriate for your stuff\dents?
7. Is the language of each item sufficiently authentic?
8. Do the sum of the items and the test as a whole adequately reflect the learning objective?
In the current example that we have been analyzing, our revising process is likely to result in at
least four changes or additions:
1. In both interview and writing sections, you recognize that a scoring rubric will be essential. For
the interview we decide to create a holistic scale and and for the writing section we devise a
simple analytic scale that captures only the objectives you have focused on (see the previous
chapter).
2. In the interview questions, you realize that follow-up questions may be needed for students
who give one-word or very short answers.
3. In the listening section, part, you intend choice “c” as the correct answer, but we realize that
choice “d” is also acceptable. We need an answer that is unambiguously incorrect. You shorten
it “d. Around eleven o’clock.” We also note that providing the prompts for this section on audio
recording will be logistically difficult, and so we opt to read these items to your students.
4. In the writing prompt, we can see how some students would not use the words so or because,
which were in our objectives, so we re-word the prompt: “Name one of the characters at the
party in the TV sitcom we saw. Then use the word so at least once and the word because at
least once to tell why you liked or didn’t like that person.”
Ideally, we have to try out all your tests on the students not in our class before actually
administering the tests. But, in our daily classroom teaching, the tryout phase is almost impossible.
Alternatively, we could enlist the aid of a colleague to look over our test.
In the final revision of our test, imagine that you are a student taking the test. Go through each
set of directions and all items slowly and deliberately. It’s better if we time ourselves. If the test should
be shortened or lengthened, we should make necessary adjustments. We have to be sure that
everything is good.
B. Standardized Tests
A standardized test presupposes certain standard objectives, or criteria, that are held constant
across one form of the test to another. The criteria in large-scale standardized tests are designed to
apply to a broad hand of competencies that are usually no exclusive to one particular curriculum. A
good standardized test is the product o a thorough process of empirical research and development. It
dictates standard procedures for administration and scoring. And finally, it is typical of a norm-
referenced test, the goal of which is to place test-takers on a continuum across a range scores and to
differentiate test-takers by their relative ranking.
Many people are under the incorrect impression that all standardized tests consist of items
presented in multiple-choice format. While it is true that many standardized tests conform to a multiple-
choice format for its objective standard, multiple-choice is not a prerequisite characteristic of
standardized tests. Human-scored tests standard are also involved in standardized tests as in Test of
Spoken English (TSE) and Test of Written English (TWE) produced by Educational Testing Service
(ETS).
Standardized tests have both advantages and disadvantages. The advantages of standardized
testing include a ready-made previously validated product that frees the teacher from having to spend
hours creating a test. Administration to large groups can be accomplished within reasonable time limits.
In the case of multiple-choice formats, the scoring procedures are easy.
The disadvantage of standardized tests centers largely on the inappropriate use of such tests,
for example using an overall proficiency test as an achievement test simply because of the
convenience of standardization. Therefore, the teachers should be careful in using standardized test.
Discussion
1. Following the steps of developing classroom tests, can you make your own English test for the
first grade of junior high school? Do it in group!
2. In pairs or in small groups, compile a brief list of pros and cons of standardized testing!
3. Tell the class about the worst test experience you’ve ever had. Briefly, analyze what made the
experience so unbearable?
CHAPTER VI
DESCRIBING, ANALYZING, AND INTERPRETING
TEST SCORES
This chapter deals with three things a teacher should do after test administration. They are
describing, analyzing and interpreting test result.
Central Tendency
Central tendency describes the most typical behavior of a group. Foyr statistics are used for
estimating central tendency: the mean, the mode, the median, and the midpoint.
1. Mean
The mean is probably the single most important indicator of central tendency. The mean is
virtually the same with average. It is symbolized in writing by X (said “ex bar”). It is the sum of all the
scores divided by the number of scores. Thus the means of 14, 34, 56, 68 is (14 + 34 + 56 + 68)/4 = 43
The formula will be:
_ ∑X
X= N
Where _
X = mean
X = scores
N = number of scores
2. Mode
Another indicator of central tendency is the mode. The mode is that score which occurs most
frequently. When the students’ scores are 77, 75, 72, 72, 70, 70, 69, 69, 69, 69, 69, 68, 68, 67, 64, 64,
61, then the mode is 69. No statistical formula is necessary for this straightforward idea.
3. Median
The median is that point below which 50% of the scores fall and above which 50% fall. Thus in
the set of scores 100, 95, 83, 61, 57, 30, the median is 71 because 71 has three scores above it (100,
95, and 83) and three scores below it (61, 57, and 30).
In real data, cases arise that are not so clear. For example, what is the median for these scores: 9, 12,
15, 16, 17, 27? In such a situation, when there is an even number of scores, the median is taken to be
midway between two middle scores. In this case, 15 and 16 are two middle scores, so, the median is
15.5.
4. Midpoint
The midpoint in a set of scores is that point halfway between the highest score and the lowest
score on the test. The formula for calculating the midpoint is:
High + Low
Midpoint =
2
For example, if the lowest score on a test was 30 and the highest was 100, the midpoint would be (100
+ 30)/2 = 65
Dispersion
Dispersion is how the individual performances vary from the central tendency. Three indicators
of the dispersion are commonly used for describing distributions of test scores: the range, the standard
deviation, and the variance.
1. Range
Most teachers are already familiar with the concept of range from tests they have given in
class. The range is the number of points between the highest score on a measure and the lowest score
plus one (one is added because the range should include the scores at both ends). Thus, when the
highest score is 71 and the lowest score is 61, the range is 17 points (17 – 61 + 1 = 17). The range
provides some idea of how individuals vary from the central tendency.
2. Standard Deviation
The standard deviation is a short of average of the differences of all scores from the mean
(Brown, 1989:107). This is not an exact statistical definition but rather one that will serve well for
conveying the meaning of this statistic. The formula is as follows:
_
S= ∑ (X – X)2
N
S = standard deviation,
X = the score
X = the mean
N = the number of scores
The standard deviation is a very flexible and useful statistic because it is a good indicator of the
dispersion of a set of scores around the mean. The standard deviation is usually better than the range
because it is the result of averaging process.
Sometimes, a slightly different formula is used for the standard deviation:
_
S= ∑ (X – X)2
N-1
This version (called the ”N – 1” formula) is only appropriate if the number of students taking the
test is less than 30.
3. Variance
The variance is another descriptive statistic for dispersion. As indicated by its symbol, S 2, the
test variance is equal to the squared value of the standard deviation. So the formula for test variance is
very much like the one for standard deviation except that both sides of the equations are squared.
Thus, the formula of variance is:
_
S = ∑ (X – X)2
2
N
Hence, variance can easily be defined, with reference to this formula, as the average of the
squared differences of students’ scores from the mean.
The result of this formula is an item facility value that can range from 0.00 to 1.00 for different
items. Teachers can interpret this value as the percentage of correct answer for a given item (by
moving the decimal point two places to the right). For example, the correct interpretation for an index of
0.27 would be that 27% of the students correctly answered the item. In most cases, an item with an IF
of 0.27 would be a very difficult question because many more students missed it than answered it
correctly.
Arranging the data in matrix as shown in Table 6.1 below will be very helpful.
The actual responses are recorded with a 1 for each correct answer and 0 for a wrong answer.
Notice that A answered the first item correctly – indeed, so did everyone else, except poor J. This item
must have been very easy.
D 1 1 0 1 1 1 0 1 0 0 60
E 1 1 1 1 0 0 1 1 0 0 60
F 1 0 1 1 1 0 0 0 1 1 60
G 1 1 0 1 0 0 1 1 0 0 50
H 1 0 0 1 1 1 0 0 0 1 50
I 1 0 0 1 0 0 1 0 1 0 40
J 0 1 0 1 0 0 0 0 1 1 40
Once the data are sorted into groups of students, calculation of the discrimination index is
easy. To do this, calculate for the IF of the upper and lower groups separately for each item. Then, to
calculate the ID index, the IF for the lower group is subtracted from the IF for the upper group on each
item as follows.
ID = IF upper – IF lower
For example, in Table 6.2, the IF for the upper group on item 8 is 1.00 because everyone in
group answered it correctly. At the same time, the IF for the lower group on that item is 0.00 because
everyone on the lower group answered it incorrectly. The calculation for item discrimination for item 8
resulted in 1.00 (1.00 – 0.00 = 1.00). An item discrimination index of 1.00 is very good because it
indicates the maximum contrast between the upper and the lower groups of students. See Table 6.3 for
ID for each item.
The calculations for ID can give us the knowledge about the quality of a certain item. This will
help us in deciding whether an item should be revised or not in the process of developing a good test
item. The ideal item in a test should have an average IF of 0.50 and the highest available ID. Ebel as
quoted by Brown (1989:70) has suggested the following guidelines for making decisions based on ID:
Notice that the table also provides the same item facility and discrimination indexes that were
previously shown in Table 6.3. In addition Table 6.4 gives information about the proportion of students
in high, middle, and low groups who chose each of the options. For example, in item 1, the correct
answer is option a (indicated by asterisk). Options b, c, and d are the distractors. One student from low
group chose option b, while no one chose options c and d. It is the indication that options c and d are
not good distractors because they are not attracting the students to choose them. Therefore, options c
and d need to be revised.
Discussion
1. How would you define central tendency? What are the four ways to estimate it?
2. What is dispersion? Which of the three indices for dispersion are most often reported?
3. Why should a teacher describe the students’ behavior on a measure in terms of both central
tendency and dispersion?
4. What is item facility index? How do you calculate it? How do you interpret the results of your
calculations?
5. What is item discrimination index? How do you calculate it? How do you interpret the results of
your calculations?
6. What is the distractor efficiency analysis? How do you do it? What can you learn from it in
terms of improving your test?
7. Draw an ideal normal distribution. Start by drawing two lines-an ordinate and an abscissa. Then
mark off a reasonable set of scores along the abscissa and some sort of frequency scale along
the ordinate. Make sure that you represent the mean, mode, median, and midpoint with a
vertical line down the middle of the distribution. Also include six lines to represent each of three
standard deviations above and below the mean.
REFERENCES
Allison, D. 1999. Language Testing: An Introductory Course. Singapore: Singapore University Press
Brown, H. D. 2001. Teaching by Principles: An Interactive Approach to Language Pedagogy. White
Plains, NY: Addison Wesley Longman
Brown, H. D. 2004. Language Assessment: Principles and Classroom Practices. White Plains, NY:
Pearson Education
Brown, J. D. 1996. Testing in Language Programs. New Jersey: Prentice Hall
Cohen, A. D. 1994. Assesing Language Ability in the Classroom . Boston: Heinle & Heinle Publishers
Djiwandono, M. S. 2008. Tes Bahasa. Jakarta: Indeks
Heaton, J. B. 1988. Writing English Language Test. New York: Longman
Heaton, J. B. 1990. Classroom Testing. New York: Longman
Hughes, A. 1989. Testing for Language Teachers. Cambridge: Cambridge University Press
Johnson, K. 2001. An Introduction to Foreign Language Learning and Teaching . Essex: Pearson
Education
Weir, C. 1993. Understanding &Developing Language Tests . Hertfordshire: Prentice Hall International
(UK) Ltd.