Therere five testing criteria for testing a test:

1. Practicality
2. Reliability
3. Validity
4. Authenticity
5. Washback


A practical test
is not excessively expensive,
stays within appropriate time constraints,
is relatively easy to administer, and
has a scoring/evaluation procedure that is specific and time-efficient.


For a test to be practical
administrative details should clearly be established before the test,
students should be able to complete the test reasonably within the set time
the test should be able to be administered smoothly (prosedrle bomamal),
all materials and equipment should be ready,
the cost of the test should be within budgeted limits,
the scoring/evaluation system should be feasible in the teachers time
methods for reporting results should be determined in advance.


A reliable test is consistent and dependable. (Ayn test bir renciye farkl zamanlarda
verildiinde ayn sonular alnabilmeli.)

The issue of reliability of a test may best be addressed by considering a number of
factors that may contribute to the unreliability of a test.
Consider following possibilities: fluctuations
in the student(Student-Related Reliability),
in scoring(Rater Reliability),
in test administration(Test Administration Reliability), and
in the test(Test Reliability) itself.

Student-Related Reliability:

Temporary illness, fatigue, a bad day, anxiety and other physical or
psychological factors may make an observed score deviate from ones true
score. Also a test-takers test-wiseness or strategies for efficient test taking can
also be included in this category.
Rater Reliability:

Human error, subjectivity, lack of attention to scoring criteria, inexperience,
inattention, or even preconceived (pein hkml) biases may enter into
scoring process.

Inter-rater unreliability occurs when two or more scorers yield
inconsistent scores of the same test. (Deerlendirme sonucunda farkl
eitmenlerin ayn test iin tutarsz skorlar vermesi.)
Intra-rater unreliability is a common occurrence for classroom teachers
because of unclear scoring criteria, fatigue, bias toward particular good
and bad students, or simple carelessness.
One solution to such intra-rater unreliability is to read through about half of
the tests before rendering any final scores or grades, then to recycle back
through the whole set of tests to ensure an even-handed judgement.

The careful specification of an analytical scoring instrument can increase rater-

Test Administration Reliability:

Unreliability may also result from the conditions in which the test is administered.

Street noise, photocopying variations, poor light, variations in temperature,
condition of desks and chairs.

Test Reliability:

Sometimes the nature of the test itself can cause measurement errors.

o Timed tests may discriminate against students who do not perform well on a
test with a time limit.
o Poorly written test items (that are ambiguous or that have more than one
correct answer) may be a further source of test unreliability.

Arguably, validity is the most important principle. The extent to which the assessment
requires students to perform tasks that were included in the previous classroom lessons.

How is the validity of a test established?
There is no final, absolute measure of validity, but several different kinds of evidence
may be invoked in support.
In some cases
it may be appropriate to examine the extent to which a test calls for performance that
matches that of the course or unit of study being tested.
In other cases
we may be concerned with how well a test determines whether or not students have
reached an established set of goals or level of competence.
Still in some other cases
it could be appropriate to study statistical correlation with other related but
independent measures.
Other concerns about a tests validity
may focus on the consequences beyond measuring the criteria themselves - of a test,
even on the test-takers perception of validity.
We will look at these five types of evidence below.
Content Validity:

If a test requires the test-taker to perform the behaviour that is being measured,
it can claim content-related evidence of validity, often popularly referred to as
content validity.

If you are trying to assess a persons ability to speak a second language in a
conversational setting, asking the learner to answer paper-and-pencil multiple
choice questions requiring grammatical judgements does not achieve content
In contrast, a test that requires the learner actually to speak within some sort of
authentic context does.

Additionally, in order for content validity to be achieved in a test, one should be able
to elicit the following conditions:
Classroom objectives should be identified and appropriately framed. The
first measure of an effective classroom test is the identification of objectives.

Lesson objectives should be represented in the form of test specifications. In
other words, a test should have a structure that follows logically from the
lesson or unit you are testing.

If you clearly perceive the performance of test-takers as reflective of the
classroom objectives, then you can argue this, content validity has probably
been achieved.

Another way of understanding content validity is to consider the difference between
direct and indirect testing.
Direct testing involves the test-taker in actually performing the target
Indirect testing involves the test-taker in performing not the target task
itself, but that is related in some way.
When you test learners oral production of syllable stress,
if you have them mark stressed syllables in a list of written words, this will be
an indirect testing, but if you require them actually produce target words orally
then, this will be a direct testing.

Consequently, it can be said that direct testing is the most feasible (uygun) way to
achieve content validity in classroom assessment.

Criterion-related Validity:

It examines the extent to which the criterion of the test has actually been
achieved. (Test edilen becerinin, konunun, bilginin gerekte ne kadar iyi kavranm
For example, a classroom test designed to assess a point of grammar in
communicative use will have criterion validity if test scores are corroborated
either by observed subsequent behaviour or by other communicative
measures of the grammar point in question.
(Ya test edilen kiinin test edildii konuyla ilgili davranlarnn gzlem yoluyla
tutarll gzlenir. Ya da test edildii konuyla ilgili farkl bir teste tabi tutularak iki
test sonucu arasnda tutarl bir sonuca varlp varlmad incelenir.)

Criterion-related evidence usually falls into one of two categories:
Concurrent (uygun, ayn zamanda olan) validity:
A test has concurrent validity if its results are supported by other
concurrent performance beyond the assessment itself.
For example, the validity of a high score on the final exam of a foreign
language course will be substantiated by actual proficiency in the language.
(Testte elde edilen baarnn dilin gerek kullanmnda yanstlabilmesi.)

Predictive (ngrsel, tahmini) validity:
The assessment criterion in such cases is not to measure concurrent ability
but to assess (and predict) a test-takers likelihood of future success.
For example, the predictive validity of an assessment becomes important in
the case of placement tests, language aptitude tests, and the like. (rnein
daha baarl snflar elde etmek iin seviye tespit snavnda homojen
gruplarn oluturulmas.)

Construct Validity:

Virtually every issue in language learning and teaching involves theoretical
constructs. In the field of assessment, construct validity asks, Does this test
actually tap into the theoretical construct as it has been identified? (Yani bu
test gerekten de test etmek istediim konu ya da beceriyi test etmede gerekli olan
yapsal zellikleri tayor mu?)

Example 1:
Imagine that you have been given a procedure for conducting an oral interview. The
scoring analysis for the interview includes several factors in the final score:
pronunciation, fluency, grammatical accuracy, vocabulary use, and sociolinguistic
appropriateness. The justification for these five factors lies in a theoretical construct
that claims those factors to be major components of oral proficiency.
So if you were asked to conduct on oral proficiency interview that evaluated only
pronunciation and grammar, you could be justifiably suspicious about the
construct validity of that test.

Example 2:
Lets suppose youve created a simple written vocabulary quiz, covering the content
of a recent unit, that asks students to correctly define a set of words. Your chosen
items may be a perfectly adequate sample of what was covered in the unit, but if the
lexical objective of the unit was the communicative use of vocabulary, then the
writing of definitions certainly fails to match a construct of communicative
language use.

Large-scale standardized tests olarak nitelediimiz snavlar construct validity
asndan pek de uygun deildir. nk pratik olmas asndan (yani hem zaman hem
de ekonomik nedenlerden) bu testlerde llmesi gereken btn dil becerileri
llememektedir. rnein TOEFL da oral production blmnn olmamas
construct validity asndan byk bir engel olarak karmza kmaktadr.

Consequential Validity:

Consequential validity encompasses (iermek) all the consequences of a test,
including such considerations as its accuracy in measuring intended criteria, its
impact on the preparation of test-takers, its effect on the learner, and the
(intended and unintended) social consequences of a tests interpretation and
McNamara (2000, p. 54) cautions against test results that may reflect
socioeconomic conditions such as opportunities for coaching (zel ders, zel
ilgi). For example, only some families can afford coaching, or because children with
more highly educated parents get help from their parents.

Teachers should consider the effect of assessments on students motivation,
subsequent performance in a course, independent learning, study habits, and
attitude toward school work.

Face Validity:

Face validity refers to the degree to which a test looks right, and appears
to measure the knowledge or abilities it claims to measure, based on the
subjective judgment of the test-takers. (Snava girenlerin snav ne kadar
dzgn, konuyla ilgili ve faydal bulduuyla ilgili)
Face validity means that the students perceive the test to be valid. Face
validity asks the question Does the test, on the face of it, appear from the
learners perspective to test what it is designed to test?
Face validity is not something that can be empirically tested by a teacher or
even by a testing expert. It depends on the subjective evaluation of the
A classroom test is not the time to introduce new tasks.
If a test samples the actual content of what the learner has achieved or
expects to achieve, face validity will be more likely to be perceived.
Content validity is a very important ingredient in achieving face validity.
Students will generally judge a test to be face valid if
directions are clear,
the structure of the test is organized logically,
its difficulty level is appropriately pitched,
the test has no surprises, and
timing is appropriate.
To give an assessment procedure that is biased for best(iyi sonu elde
etmek amacyla, bacy dvmeyip ona zm yedirmek iin) , a teacher
offers students appropriate review and preparation for the test,
suggests strategies that will be beneficial, and
structures the test so that the best students will be modestly challenged
and the weaker students will not be overwhelmed.


In an authentic test
the language is as natural as possible,
items are as contextualized as possible,
topics and situations are interesting, enjoyable, and/or humorous,
some thematic (konuyla ilgili) organization, such as through a story line or
episode is provided,
tasks represent real-world tasks.

Reading passages are selected from real-world sources that test-takers are
likely to have encountered or will encounter.
Listening comprehension sections feature natural language with hesitations,
white noise, and interruptions.
More and more tests offer items that are episodic in that they are sequenced to
form meaningful units, paragraphs, or stories.


Washback includes the effects of an assessment on teaching and learning prior to the
assessment itself, that is, on preparation for the assessment.

Informal performance assessment is by nature more likely to have built-in
washback effects because the teacher is usually providing interactive feedback.
(Resmi snavlardan nce rencinin kendisine eki dzen vermesi iin yaplan ara
snavlar washback etkisi yapar.)
Formal tests can also have positive washback, but they provide no washback if
the students receive a simple letter grade or a single overall numerical
Classroom tests should serve as learning devices through which washback is
Students incorrect responses can become windows of insight into further
Their correct responses need to be praised, especially when they represent
accomplishments in a students inter-language.
Washback enhances a number of basic principles of language acquisition:
intrinsic motivation, autonomy, self-confidence, language ego, inter-
language, and strategic investment, among others.
One way to enhance washback is to comment generously and specifically on test
Washback implies that students have ready access to the teacher to discuss the
feedback and evaluation he has given.
Teachers can raise the washback potential by asking students to use test results
as a guide to setting goals for their future effort.

What is washback?
What does washback
What should teachers do to
enhance washback?
In general terms: The effect
of testing on teaching and
In large-scale assessment:
Refers to the effects that the
tests have on instruction in
terms of how students prepare
for the test
In classroom assessment:
The information that washes
back to students in the form of
useful diagnoses of strengths
and weaknesses
Intrinsic motivation
Language ego
Strategic investment
Comment generously and
specifically on test
Respond to as many details
as possible
Praise strengths
Criticize weaknesses
Give strategic hints to
improve performance


Decide whether the following statements are TRUE or FALSE.

1. An expensive test is not practical.
2. One of the sources of unreliability of a test is the school.
3. Students, raters, the test, and the administration of it may affect the tests reliability.
4. In indirect tests, students do not actually perform the task.
5. If students are aware of what is being tested when they take a test, and think that
the questions are appropriate, the test has face validity.
6. Face validity can be tested empirically.
7. Diagnosing strengths and weaknesses of students in language learning is a facet of
8. One way of achieving authenticity in testing is to use simplified language.

Decide which type of validity does each sentence belong to?

a) Content validity b) Criterion related validity c) Construct validity
d) Consequential validity e) Face validity

1. It is based on subjective judgment. ----------------------
2. It questions the accuracy of measuring the intended criteria. ----------------------
3. It appears to measure the knowledge and abilities it claims to measure. ---------------
4. It measures whether the test meets the objectives of classroom objectives. -----------
5. It requires the test to be based on a theoretical background. ----------------------
6. Washback is part of it. ----------------------
7. It requires the test-taker to perform the behavior being measured. --------------------
8. The students (test-takers) think they are given enough time to do the test. -----------
9. It assesses a test-taker's likelihood of future success. (e.g. placement tests). ---------
10. The students' psychological mood may affect it negatively or positively. ---------------
11. It includes the consideration of the test's effect on the learner. ----------------------
12. Items of the test do not seem to be complicated. ----------------------
13. The test covers the objectives of the course. ----------------------
14. The test has clear directions. ----------------------

Decide with which type of reliability could each sentence be related?
a) Student-related reliability b) Rater reliability
c) Test administration reliability d) Test reliability
1. There are ambiguous items.
2. The student is anxious.
3. The tape is of bad quality.
4. The teacher is tired but continues scoring.
5. The test is too long.
6. The room is dark.
7. The student has had an argument with the teacher.
8. The scorers interpret the criteria differently.
9. There is a lot of noise outside the building.

