A Study On An Achievement Listening Test Design For Espd'S Students Bui Ngoc Lien Context

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

A STUDY ON AN ACHIEVEMENT LISTENING TEST DESIGN

FOR ESPD’S STUDENTS


Bui Ngoc Lien

Context
This test is a final achievement test designed to measure knowledge and academic
achievement of students in listening skills at Foundation Studies Department of Hanoi
University. They are Intermediate learners, who have just completed the first semester of
their first academic year and have been required to sit an exam. In their first semester, they
have studies the listening course, which consists of two sub-components namely
Conversation and Dictation Listening. With regard to Conversation Listening, special focus
has been put on the development of learners’ listening skills and strategies to be able to
comprehend daily life conversations and talks made by native speakers with various
accents. Meanwhile, Dictation Listening puts more emphasis on promoting students’
capacity of listening and perceiving the accurate written forms of oral sounds, words and
simple sentences so that they can confidently dictate (write down exactly what they hear)
from speakers. Therefore, test designers are going to choose listening material and design
listening tasks which matches description and objectives of the course as well as students’
level.

I. Literature review
Test listening skill
According to Heaton (1988), testing four basic skills plays a significant role because it
evaluates students’ language skills for purposes of selection or comparison. In addition, he
claims that testing provides feedback about the results of education for teachers as well as
for learners, which can bring some backwash to teaching and learning process. As a
receptive skill, listening tests typically resemble reading tests except the fact that students
listen to a text instead of reading. Therefore, they might have few chances to look back
information. In general, a listening test includes 3 rudimentary elements: the listening
stimuli which represents typical oral language, the questions and responses format and the
testing environment which should be free of external distraction (Rubin & Mead, 1984).

1. Kinds of test/ testing


Achievement Test
Madsen (1983) defines achievement test as a test to measure mastery of language skills
which are taught during the course. It means that the test measure if the objectives of a
course have been achieved. Heaton (1988) subdivides this type of test into progress
achievement test and final achievement test.

Final Achievement Test


Final achievement test is usually carried out at the end of a unit or a course in order to
assess acquired knowledge and skills. It serves the purpose of giving a grade or making a
judgment about the students’ achievements after the course. Hughes (1989) proposed two
approaches to final achievement test which are syllabus-content and course objectives. The
former bases on the course syllabus, the course textbook and other supplementary material
used during the course. However, if a syllabus is badly designed, the latter is recommended
as an alternative because it boosts responsibilities of course designers and exerts beneficial
backwash on teaching. On the other hand, students might suffer the risk of unfairness
because of poor course design.

Direct test
According Hughes (1989), direct test requires examinees “perform precisely the sill which
we wish to measure” (p15). It seems attracting, even though there might have problems
related to reliability of testing such productive skills as writing and speaking because it is
limited to a rather small sample of tasks, so as a result it cannot include a representative
structure. However, if the test designers have a clear awareness of the abilities they want to
test, direct test is straightforward and leads to helpful backwash (Hughes, 1989).

Criterion- referenced testing


Criterion-referenced testing assess students according to whether they are able to perform
some tasks or a set of tasks satisfactorily at a given level (Heaton, 1988). In line with that,
Hughes (1989) states that students’ performance is measured in relation to meaningful
criteria, which is believed to motivate students to achieve that standard. Therefore, it is like
to provide students with beneficial washback.

Objective testing
In terms of scoring tests, objective testing requires no judgment for scorer. The reason is
that an objective test has only one or limited correct answers, so no matter which teacher
marks, the score of a candidate will stay the same. (Heaton, 1988). Heaton also adds a
weakness of objective testing is time-consuming in preparation and test designing. Despite
some criticisms that objective test may be easy to guess the answer, Heaton believes that
“look easier is no indication that they are easier” (p26).

2. Quality of a good test


Test validity
Validity is commonly defined as the extent to which a test measures precisely what it aims
to measure (Hughes, p26). Without validity, test result will become useless or misleading
because it ensures the “meaningfulness, interpretability, and fairness of assessments”
(McNamara, p50). If a test is valid, an individual's score is a true reflection of the
individual's skill. However, not until Messick’s (1989) was validity re-conceptualized and
added a new aspect which is “empirical evidence”, so validity has been understood as the
meaning of the test scores. Four types of validity are proposed as followed:

Content validity
Content validity is the extent to which the items constituting a representative sample covers
the most appropriate and necessary content essential for a good performance. (Hughes,
1989). In an attempt to have and judge content validity, Heaton (1988) points out that it is
important to have a specification in which particular skills and structures in the text are
clearly written.
Construct validity
Construct validity is the most important type of validity (Cronbach, 1990). It investigates if
the test measures exactly and adequately the ability and skills which is aims to measure.
Face validity
A test is said to have face validity “if it looks as if it measures what it is supposed to
measure” (Hughes, 1989). In other words, it is the extent to which a test appeals to
candidates or to those choosing it on behalf of the candidates. Face validity is crucial
because if students do not realize face validity of a test, they might not put maximum effort
into performing the tasks (Heaton, 1988)
Predictive validity
Predictive validity indicates the extent to which an individual's future level on the criterion
is predicted from prior test performance (Hughes, 1989).
Concurrent Validity
Concurrent Validity is whether a test correlates with or provides similar result as another
similar test of the same skill. In other words, it can be studied when one test is proposed as
a substitute for another and it is examined when test score and criterion score are
determined at essentially the same time (Hughes, 1989).

Test reliability
Reliability is another very fundamental consideration when testing because it addresses the
consistency of testing process in relation to test administration and scoring (Hughes, 1989).
It means that if every time the test is administered, it will have the same outcome. The
consistency in a test, which is called internal consistency, happens if there is correlation
among the variables comprising the test. Besides, inter-rater reliability refers to the level of
agreement between different raters on an instrument.
Teachers, as the agents of assessment, need to ensure the reliability and validity of their
classroom assessment and base on that to support their students’ learning (Black &
William, 1998a, 1998b).

Relationship between validity and reliability


The test cannot have its validity unless it is reliable. However, the other way around may
not be true as a reliable test is not necessary a valid test. Sometimes, test designers may
need reduce the validity of the test intentionally in order to maintain its reliability. As a
result, testers need to mind the tension between these two components in order to balance
them.

3. Test specifications
Specifications for the test must be developed at the very beginning of the test design
process. According to Hughes (2003, p.59), this will contain information regarding the
“content, test structure, timing, medium/channel, techniques to be used, criterial levels of
performance, and scoring procedure.”

Test content
Test content refers to items that will be included in the test, which will subsequently
suggest the construct of the test. As a result, test content needs to be chosen carefully from
samples of the test domain (McNamara, 2000). The domain of the test can either be defined
as “a set of practical, real-world tasks” if the construct is more operational, or “a theory of
components of knowledge and ability that underlie performance in domain” if the construct
is rather abstract (p.25). Importantly, content should be specified as fully as possible in
terms of operations, types, addressees and length of text(s), topics, readability, structural
and vocabulary range, dialect, accent, style and speed of processing.

Format and timing


Regarding test format, the information about different parts of the test, the types of skill to
be tested, the number of items and passages, techniques to be used to measure different
skill/subskills should be specified. The exact amount of time allocated to teach section and
the entire test should also be decided. Additionally, the way in which candidate react to the
test materials will be investigated via their interaction with the response format that is test
method. Its relationship with test content will consequently determine the type of response
format because if it is seen as an aspect of content, the response will need to be more
authentic whereas if it is considered independent of content, inauthenticity response format
will be acceptable (McNamara, 2000).

Criteria level of performance


The level(s) of required performance for success needs to be specified. However, for
receptive skills such as reading or listening, the criteria can be as simple as required
percentage of mastery of all items (Hughes, 2003).

Scoring procedure
These are of vital importance, especially when scoring is subjective. Test developers should
be clear about ways to achieve reliable and valid scoring, the types of rating scale, the
number of people involved in scoring and procedures to deal with disagreement among
scorers (Hughes, 2003).

4. Type of test items


Types of test items
According to Brown & Hudson (2002) there are three types of responses which are
selected, constructed and personal responses. These three types differ in that the selected
response requires the examinee to choose from a set of possible solutions to the test item,
while the constructed response requires the students to produce some language from a
relatively limited set of possible answers), and the personal response requires the learners to
produce an individual response (p.59). Despite the difference, the types share some
common problems. First, the items can be used to test the intelligence more than the
language ability. Second, it is the background knowledge that is tested rather than the sub
skills. Furthermore, test items are not independent of the others as they should be; for
example, when test takers fail item 1, they will automatically fail item 2. Lastly,
instructions can be vague and no examples are provided to lead students to full
understanding of what they are required to do.
Selected response
Selected response are those in which students choose the correct answer from among a set
of options, either by circling it, darkening the correct answer or a similar selection
procedure. That is to say, students do not have to create any language. Therefore, listening
will be suitable for this type of response.
Within selected response format, our test will use two types of items: binary- choice and
multiple- choice which include both pros and cons as following. As to binary- choice, this
format requires students to respond to the language by selecting one of two choices, for
instance, between true and false. Its primary strength is that it focuses on assessing ability
to distinguish between only two choices makes it a simple and direct measure of
comprehension. However, test writers some items might be tricky or deceptive. Besides,
binary- choice contains large guessing factor (50%) and requires large number of items and
mostly focuses on facts and details, which seem to be the weaknesses of it. As to multiple-
choice items, this kind of item requires students to examine some language material and
then select, usually among three, four or five options, the answer that best completes a
statement or best fills in a blank in the statement. Looking at the advantages, this kind of
items shows smaller guessing factors and helpful for testing a wide range of learning
points. However, it can be overused or used for inappropriate purpose.

Constructed response
Constructed response are those in which a student is required to actually produce language
rather than simply selecting answers. Therefore, they are suitable for testing the interaction
of receptive and productive skills such as listening. In contrast to selected response format,
constructed response tests eliminate most of the guessing, but introduce all of problems
associated with subjectivity on the part of scorers. Also, marking is more time- consuming.
Another potential problem is that test takers may be able to be bluff because this is also a
type of guessing but about what raters want not what may be correct as in selected-
response tests. Among some types of selected- response items, the test designed uses partial
dictation. According to Nation and Newton (2009) considered partial dictation (PD) as an
easier variant of full dictation and a plausible activity in enhancing FL/L2 listening ability.
Students are provided with an incomplete written text and fill in missing words while
listening to an oral version of the text. Some FL/L2 researchers recommended the use of
partial dictation as a reliable, valid, and plausible listening test (Buck, 2001; Hughes, 1989;
Nation & Newton, 2009). Buck (2001) supports Hughes’ (1989) suggestion on the use of
partial dictation for low-level students when dictation proved too difficult for the students.
Using partial dictation helps students focus on missing parts, making it easier for them to
follow the text and/or to get its main points.

II. Test specifications


Content and format
The paper contains two sections with a total of 30 questions
Section 1
Questions 1-5
Features of the task

Skill focus Listening for specific ideas


Task level B1

Task Listen to a short talk/ dialogue and find the correct answer for each question.
description The task focuses on identification of specific ideas.

Instructions The rubric will contain two parts:


to 1) a short contextualization: you will hear ….. talking about …..;
candidates 2) a short requirement: for questions one to five. Listen and choose the best
answer

Presentation Aural

Response Multiple choice


format

Topic Education, Work and Leisure, Food, Movie and Television, Memory,
Describing things, Relationships, Money makes the world go round, Travel
and Exploration, Environment

Length 2-3 minutes Words 60-80

Accent Native speakers of English

Pattern Dialogue

Questions 6-10
Features of the task

Skill focus Listening for specific ideas

Task level B1

Task description Listen to a short talk/ dialogue and decide each sentence True or False.
The task focuses on identification of specific ideas.

Instructions to The rubric will contain two parts:


candidates 1) a short contextualization: you will hear ….. talking about …..;
2) a short requirement: for questions six to ten. Listen and decide each
sentence True or False
Presentation Aural

Response format Multiple choice

Topic Education, Work and Leisure, Food, Movie and Television, Memory,
Describing things, Relationships, Money makes the world go round,
Travel and Exploration, Environment

Length 2-3 minutes/ time Words 60-80

Accent Native speakers of English

Pattern Dialogue

Section 2
Features of the task

Skill focus Listening for sounds/ words

Task level B1

Task description Students listen to a dictation audio 2 times and fill ONE WORD or a
NUMBER in each gap.

Instructions to Listen and fill in the gaps. Write no more than ONE WORD and/or A
candidates NUMBER in each gap.
You will hear the piece TWICE.

Presentation Aural

Response format Gap filling

Topic Study Abroad, Vitamins, Family matters, Childhood, Television,


Happiness and dreams, Superstitions, Environment, Technology,
Careers

Length 5 minutes/ time Words: 400-450

Accent Native speakers of English

Pattern Monologue
Timing
For about 20-22 minutes excluding 10 minutes for transferring answers
Operations
- Students presents answers by writing answers A, B or C (Questions 1-5), writing
answers T or F (Questions 6-10) and writing answers from the recording (Questions
11-20) on the answer sheet.
- Candidates record their answers on the question paper as they listen.
- They are given 10 minutes at the end of the test to copy their answers onto the
answer sheet
- In each part the recording is played twice.
- The recordings contain a variety of different native speaker accents

Criteria levels of performance/ benchmark


Students need to get a minimum of 5 points in total in order to pass this listening test

Scoring procedures
Section 1 (Questions 1-10):  Total: 5 points
- For multiple choice and binary choice questions, there is only one correct answer per
question.
- For each correct answer, 0.5 mark will be awarded; for wrong or blank answer, 0 mark
will be awarded
Section 2 (Questions 11-20): Total: 5 points
- For each correct answer, 0.25 mark will be awarded;
- For one of these following problems: Wrong spelling answers, over the word/number
limit, wrong word order, or blank answer, 0 mark will be awarded
References
Black, P. J., & William, D. (1998a). Assessment and classroom learning. Assessment in
Education: Principles Policy and Practice, 5(1), 7-73.
Black, P. J., & William, D. (1998b). Inside the black box: raising standards through
classroom assessment. London: King’s College London School of Education.
Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge:
Cambridge University Press.
Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press.
Cronbach, L. (1990).  Essentials of psychological testing.  Harper & Row, New York.
Heaton, J. B. (1988). Writing English Language Tests: Longman Handbook for Language
Teachers (New Edition). London: Longman Group UK Ltd.
Hughes, A. (1989). Testing for Language Teachers. Cambridge: Cambridge University
Press.
Hughes, A. (2003) Testing for Language Teachers (2nd Edition ed.). Cambridge:
Cambridge University Press.
Madsen, H. S. (1983).  Techniques in Testing. Oxford: Oxford University Press .
McNamara, T. (2000). Language testing. Oxford: Oxford University Press.
Messick, S. (1989b). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp.
13-103). New York: Macmillan
Nation, I. S. P., & Newton, J. (2009). Teaching ESL/EFL Listening and Speaking. New
York: Routledge Publisher.
Rubin, D. L., & Mead, N. A. (1984). Large scale assessment of oral communication skills:
Kindergarten through grade12. Urbana, Ill, ERIC Clearinghouse on Reading and
Communication Skills, National Institute of Education.

You might also like