Validity and Reliabillity in Language Assessment and Testing

Validity and
reliability
Dr. D. Spiteri, Faculty of Education,
University of Malta
VALIDITY
CONSTRUCT
VALIDITY CONTENT
FACE
CONTENT VALIDITY
 Does it reflect the teaching programme?
 Does the test test what it is supposed to test?
 Does it reflect the syllabus?
 Does it include a representative sample of what has
been learnt?
•The lower the Forms, the
What has younger the learners, the
more careful we have to be
been taught that our test has content
validity.
Sample of •We cannot expect Form 1

what has students to be able to use the
been passive or to have a wide
taught range of vocabulary.
My test lacks content validity
if ......
 I don’t teach listening comprehension skills in class
but then include one in a test;
 I do not teach students how to write a letter of
complaint but then test them on it;
 I do not teach students how to work out the meaning
of a word from the context, but then include it in
the reading comprehension skills test.
Content validity therefore asks..

Does a test match the syllabus or teaching
that preceded it?
….. Content validity
Does this mean that I can only test what I have expressly
taught?
No, certain aspects can be assumed as having been learnt.
E.g. Among average Form 1 sts: the basic tenses, several

prepositions, certain vocabulary, certain conjunctions,
etc.
E.g. Among average Form V sts: all the tenses, several

text types, all reading skills, several vocabulary areas,
etc.
Improving content validity
 Our teaching should be guided by the syllabus
 Make a list of the vocab areas / topics / functions /

structures / subskills that you have covered
 From this whole, sample carefully and purposefully so

that the test is in proportion to the time and coverage
that you devoted to the above.
Syllabus
Test
 It is easy to sample haphazardly and end up being
unfair to some learners.
 Sampling can be done in a systematic manner using a

little grid as we’re going to see next.
8/50
NUMBER OF ITEMS IN TEST

PRODUCE

RECOGNISE

15
% OF SYLLABUS
statements, questions,
negatives, short
All persons,
answers
WHAT EXACTLY ARE WE
TEACHING?
Present simple
Grammar:
SYLLABUS ITEM
P
% RE R
NUMBE
OF CO O
SYLLABUS WHAT EXACTLY ARE WE R OF
SYL GN D
ITEM TEACHING? ITEMS
LAB IS U
IN TEST
US E C
E
Reading Scanning / skimming / 25   25/100
skills working out meaning from
context
Speaking Apologizing / giving reasons 25   25/100
skills / offering solutions
A letter of apology / a letter
Writing explaining reasons   25/100
25
skills
Conversation – row,
25   25/100
Listening expressing regret / giving
skills
Construct validity
 Does the test test what it is supposed to - and nothing else?
 How happy would you be if you found out that the pilot
flying your plane got his license after studying lots of
books?
 How valid is a driving test in which the learner did not
drive a car?
 How valid is a test of physical stamina if a young person is
asked to walk around the University ring road?
 How valid is a speaking test if students only answer questions
but never ask one?
 How valid is a speaking test if the questions are on general
knowledge?
 How valid is a reading test which requires me to write long
answers?
 How valid is a reading test where the teacher removes marks
for my spelling and grammar mistakes?
..... Construct validity
How valid is a writing test in which I am only asked to

fill in the blanks in sentences ?
…. or choose the correct sentences from a multiple
choice set?
How valid is a test of one’s weight if the person is
wearing thick winter clothes and boots?
ALL THE ABOVE ARE CONSTRUCTS – THINGS ABOUT WHICH

WE HAVE CERTAIN KNOWLEDGE AND A TEST HAS
CONSTRUCT VALIDITY IF IT MATCHES AND FITS THE
KNOWLEDGE WE HAVE OF IT.
4. Read the text and answer the following questions:
 Yesterday I saw the palglish flester gollining begrunt the
bruck. He seemed very chanderbil, so I did not jorter
him, just deapled to him quistly. Perhaps later he will
besand chander, and I will be able to rangel to him.
 What was the flester doing, and where?

 What sort of flester was he?
 Why did the writer decide not to jorter him?
 How did she deaple?
 What did she hope would happen?
Are we really testing reading comprehension skills?

Yesterday I saw the new patient hurrying along the
corridor. He seemed very upset, so I did not follow
him, just called to him gently. Perhaps later he will
feel better, and I will be able to talk to him.
 What is the problem described here?

 Is this event taking place indoors or outside?
 Did the writer try to get near the patient?
 What do you think she said when she called to him?
 What might the job of the writer be?
 Why do you think she wants to talk to the patient?
Improving construct validity- a few examples
For writing –
Require students to write a piece which is authentic
and exists in life.
 Make sure that the writing specifies an audience or
reader to whom the learners should write - this makes
a huge difference to the style of writing.
Set two writing tasks rather than one; one should be
longer than the other and together they should not
exceed the number of words specified in the syllabus.
By giving the learners a fresh start, you are giving
another opportunity for them to show what they can
do.
....for writing
Be clear about assessment criteria – for yourself and
for students – use a rating scale.
 Do not test or expect general knowledge or cultural

knowledge or schematic knowledge beyond your
learners’ grasp.
Do not expect semi-literary pieces of writing from

your learners, neither should you expect to be
entertained by their writing
GO BUY YOURSELF A NOVEL INSTEAD!

For reading :
Requirestudents to write as little as possible
when answering questions;
Do not penalize accuracy (but don’t tell sts this);
 Have more than one reading comprehension;

Don’t stick to texts only – film listings,
recipes, timetables, instructions, directions,
are also texts meant for reading;
Don’t stick to excerpts from works of literature – these

might not be easily understood when lifted from a
whole.
Do test a variety of reading subskills (v. VLE)
 For speaking:
 Require learners to take on role they can identify with;
 Require them to initiate not only respond;
 Do not penalize learners who can monitor and correct

themselves;
 Do not penalize learners if they ask for clarification

– this is a skill in itself!
 Use a rating scale to judge learners’ output.
 Don’t be discouraged by the time factor – things that are worth

doing, are worth doing – no matter the time it takes.
 For listening –
 Use ‘texts’ meant to be listened to;
 Use more than one ‘text’ – these shouldn’t be too

long;
 Use a recording;
 Require learners to write as little as possible;
 Do not penalize for accuracy as long as meaning is

clear;
In general…..
 Give examples;
 Keep rubrics short and simple;
 Show allocation of marks for each task;
 Insert duration of test / exam;
 Work through the paper yourself;
 Ask a critical friend to work through the paper;
 Decide what is required for sts to score full marks –
out of 3;
 Use rating scale for writing skills and speaking skills;
 Use a list of ‘operations’ for the reading and listening
skills.
Face validity
 This has to do with the test’s appearance;
does it look as though it is doing what it’s
supposed to be doing? Do test takers and test
users believe in it?
 Does the public feel that the test does what it

should do?
Face validity refers mainly to people’s

opinion about a test or examination.
What makes a ‘good’ test good?
A GOOD TEST HAS THE FOLLOWING QUALITIES
VALID
A GOOD
TEST
GOOD
RELIABLE BACKWASH
Reliability
Scorer reliability
Test reliability Intra-scorer reliability
If it was possible to If the same person marked
give the same person the same test twice, would
the same test at the they give the same mark?
same time, would the Inter-scorer reliability
result be the same? If two people marked the
same test, would they give
the same score?
Test reliability - how can we improve it
1. Choose a representative sample of language to test.
2. Vary the testing techniques but choose ones sts are

familiar with.
3. Write clear and simple instructions.
4. Include an example.
5. Restrict the task.
6. Keep conditions comparable.

Scorer reliability – how can we improve it?
TASK: Which of the following two types of test questions

make scoring more reliable?
a) Objectively-marked test items

b) Subjectively-marked test items
 Can an entire test be made up of objectively-marked

items / questions?
….Scorer reliability – how can we improve it?
1. Ensure that some parts of the test are objective type
items:
 Multiple choice (!)
 Cloze
 Underline / circle the correct answer
 Put in correct order
 Matching
3. Limit the number of possible correct answers. If this

is not possible then use a marking scheme. Involve
more than one teacher in marking the test.
4. Agree on criteria that answer will be judged by.
IF TEACHERS MARK IN DIFFERENT WAYS THEN THE
EXAM / TEST HAS BEEN A WASTE OF TIME BECAUSE
THE RESULTS ARE NOT RELIABLE
What makes a ‘good’ test good?
A GOOD TEST HAS THE FOLLOWING QUALITIES
VALID
A GOOD
TEST
GOOD
RELIABLE BACKWASH
Backwash
Backwash is a term that describes the effect that a test has on the
teaching programme that leads to it.
When we design a test we need to consider what effect the test will
have on people.
For example, what happens if:

 We don’t test listening comprehension?
 We test only grammar?
 Speaking is not tested?
 If grammar is not tested?
Unfortunately or fortunately, tests and exams still exert a strong

influence on the teaching and learning that take place, therefore we
have to ensure that tests are valid, communicative and meaningful.

Validity and Reliabillity in Language Assessment and Testing

Uploaded by

Copyright:

Available Formats

You might also like

Validity and Reliabillity in Language Assessment and Testing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Validity and Reliabillity in Language Assessment and Testing

Uploaded by

Copyright:

Available Formats

Validity and

Sample of •We cannot expect Form 1

Content validity therefore asks..

E.g. Among average Form 1 sts: the basic tenses, several

E.g. Among average Form V sts: all the tenses, several

 Make a list of the vocab areas / topics / functions /

 From this whole, sample carefully and purposefully so

 Sampling can be done in a systematic manner using a

NUMBER OF ITEMS IN TEST

How valid is a writing test in which I am only asked to

ALL THE ABOVE ARE CONSTRUCTS – THINGS ABOUT WHICH

 What was the flester doing, and where?

Are we really testing reading comprehension skills?

 What is the problem described here?

 Do not test or expect general knowledge or cultural

Do not expect semi-literary pieces of writing from

GO BUY YOURSELF A NOVEL INSTEAD!

Don’t stick to excerpts from works of literature – these

 Require learners to take on role they can identify with;

 Require them to initiate not only respond;

 Do not penalize learners who can monitor and correct

 Do not penalize learners if they ask for clarification

 Use a rating scale to judge learners’ output.

 Don’t be discouraged by the time factor – things that are worth

 Use ‘texts’ meant to be listened to;

 Use more than one ‘text’ – these shouldn’t be too

 Require learners to write as little as possible;

 Do not penalize for accuracy as long as meaning is

 Does the public feel that the test does what it

Face validity refers mainly to people’s

1. Choose a representative sample of language to test.

2. Vary the testing techniques but choose ones sts are

3. Write clear and simple instructions.

5. Restrict the task.

6. Keep conditions comparable.

TASK: Which of the following two types of test questions

a) Objectively-marked test items

 Can an entire test be made up of objectively-marked

3. Limit the number of possible correct answers. If this

For example, what happens if:

Unfortunately or fortunately, tests and exams still exert a strong

You might also like