Professional Documents
Culture Documents
Methodology 3 Exam Notes
Methodology 3 Exam Notes
By far the most complex and important principle of an effective test – validity, “the
extent to which inferences made from assessment results are appropriate,
meaningful and useful in terms of the purpose of the assessment”(Gronlund).
Several different kinds of evidence may be invoked in order to confirm a test’s
validity. It may be appropriate to examine the extent to which a test calls for
performance that matches that of the course or unit of study being tested. We may
be concerned how well a test determines whether or not students have reached an
established set of goals or level of competence. Statistical correlation with other
related but independent measures is another widely accepted form of evidence. Other
concerns of a test’s validity may focus on the consequences of a test, or even on the test
taker’s perception of validity.
Content validity (content related evidence) – if a test actually samples the subject matter
about which conclusions are to be drawn, and if it requires the test taker to perform the
behavior that is being measured.
Criterion-related validity – the extent to which the criterion of the test has been
reached. In such tests, specified classroom objectives are measured, and implied
predetermined levels of performance are expected to be reached.
Construct validity – asks does the test actually tap into the theoretical construct as it has
been defined.
Consequential validity – Consequences of a test such as accuracy in measuring intended
criteria, the test’s impact on the preparation of the test takers, its effects on the learner.
Face validity – the degree to which the test looks right and appears to measure the
knowledge or abilities it claims to measure, based on the subjective judgment of the
examinees who take it and the administrative personnel who decide on its use.
A reliable test is consistent and dependable and yields the same results if given to the
same or matched students on two different occasions. The reliability of a test may
best be considered by checking a number of factors which contribute to the
unreliability of a test: fluctuations in scoring, in the student (temporary illness,
fatigue, a bad day or anxiety), test administrations and in the test itself.
Most often a simple outline of a test. It can comprise of a broad outline of the test,
what skills will be tested and what the test items will look like.
4. Cloze task?
These tasks were developed on the assumption that, in written language, a sentence
with a word left out should have enough context that a reader should be able to close
that gap with a calculated guess, using linguistic expectancies (formal schemata),
background expectancies (content schemata) and strategic competence.
Cloze tests are usually a minimum of two paragraphs in length in order to account
for discourse expectancies. Specifications for scoring and choosing deletions need to be
clearly defined. Typically, every seventh word is deleted (fixed-ratio deletion), but
many cloze test designers use a rational deletion procedure of choosing deletions
according to the grammatical or discourse functions of the words. Traditionally,
cloze passages have between 30-50 gaps to fill. The test may likewise come in a
multiple-choice format, making grading even easier.
There are two scoring methods for the test – the exact word method and the
appropriate word method.
If your aim is to test global competence in a language, then you are testing
proficiency. These tests have traditionally consisted of standardized multiple-choice
items on grammar, vocabulary, reading comprehension, and aural comprehension.
Proficiency tests are almost always summative and norm referenced, and provide
results in the form of a single score, which is a sufficient result for the gate-keeping
role they play of accepting or denying someone passage into the next stage of a journey.
(Example: TOEFL test)
7. Three major genres of writing (according to Brown) with at least three examples for
each!
Academic writing:
Papers, essays, short-answer test responses, technical reports, dissertations, theses
Job-related writing:
Messages, letters/emails, memos, reports, advertisements, manuals
Personal writing:
Emails, letters, greeting cards, messages, notes, forms, fiction, diaries, questioners,
medical reports etc.
Alternative assessment’ is usually taken to mean assessment procedures which are less
formal than traditional testing, which are gathered over a period of time rather than
being taken at one point in time, which are usually formative rather than summative in
function, are often low-stakes in terms of consequences, and are claimed to have
beneficial washback effects (2001: 228)
Teachers can make accurate judgments about the competence of the learners they
are working with based on the validity of a test.
Macroskills:
Recognize the communicative functions of utterances, according to situations,
participants, goals.
Infer situations, participants, goals using real-world language.
From events, ideas, and so on, described, deduce causes, and effects and such relations
as main idea, supporting idea, new information, given information, generalization and
exemplification.
Distinguish between literal and implied meanings.
Use facial, kinesic, body language, and other nonverbal cues to decipher meaning.
Develop and use a battery of listening strategies, such as detecting key words, guessing
the meaning of words from context, appealing for help, and signaling comprehension of
lack thereof.
Microskills:
Discriminate among the distinctive sounds of English.
Retain chunks of language of different lengths in short term memory.
Recognize English stress patterns, words in stressed and unstressed positions, rhythmic
structure, intonation contours, and their role in signaling information.
Distinguish word boundaries, recognize a core of words, and interpret word order
patterns and their significance.
Process speech at different rates of delivery.
Process speech containing pauses, errors, corrections, and other performance variables.
Recognize grammatical word classes, systems, patterns and elliptical forms.
Detect sentence constituents and distinguish between major and minor.
Recognize cohesive devices in spoken discourse.
Recognize that a particular meaning may be expressed in different grammatical forms.
A practical test
Is not excessively expensive
Stays within appropriate time constraints
Is relatively easy to administer
Has a scoring procedure that is specific and time efficient
Each point on a holistic scale is given a systematic set of descriptors, and the reader-
evaluator matches an overall impression with the descriptors to arrive at a score.
Descriptors usually follow a prescribed pattern. For example, the first descriptor across
all score categories may address the quality of task achievement, the second may
deal with organization, the third with grammatical or rhetorical organizations, and
so on. Scoring, however, is truly holistic in that those subsets are not quantitatively
added up to yield a score. Advantages include: fast evaluation, relatively high inter-
rater reliability, the fact that scores represent “standards” that are easily interpreted by
lay persons, the fact that scores tend to emphasize the writer’s strengths and applicability
to writing across many disciplines. For classroom instructional purposes, holistic
scoring provides very little. In most classroom setting where a teacher wishes to adapt a
curriculum to the needs of a particular group of students, much more differentiated
information across subskills is desirable than is provided by holistic scoring.
Primary scoring focuses on “how well students can write within a narrowly defined
range of discourse”. This type of scoring emphasizes the task at hand and assigns a
score based on the effectiveness of the text’s achieving that one goal. In summary, a
primary trait score would assess:
The accuracy of the account of the original (summary)
The expression of the writer’s opinion (response to an article)
A four point scale ranging from one to zero is suggested for the rating the primary
trait of the text. It goes without saying that organization, fluency, syntactic variety,
supporting details, and other features will implicitly be evaluated in the process of
offering a primary trait score.
Extensive reading applies to texts of more than a page, up to and including essays,
professional articles, technical reports, short stories and books. The purposes of
assessment are usually to tap into a learner’s global understanding of a text as
opposed to asking text-takers to “zoom in” on small details. Top-down processing is
assumed for more extensive tasks.
17. What skills do learners have to master in order to become efficient readers?
First, they need to be able to master fundamental bottom-up strategies for processing
separate letters, words and phrases, as well as top-down conceptually driven
strategies for comprehension.
Second, as a part of that top-down approach, second language learners must develop
appropriate content and formal schemata – background information and cultural
experience – to carry out those interpretations effectively.
These two types of assessment are based on the function of the assessment.
Most of our classroom assessment is formative assessment – evaluating students in
the process of “forming” their competencies and skills with the goal of helping them
to continue their growth process. The key to such formation is the delivery (by the
teacher) and internalization (by the student) of appropriate feedback on
performance, with an eye on the future continuation (formation) of learning. Virtually
all kinds of informal assessment are formative.
Formal assessments are the systematic, data-based tests that measure what and how well the students
have learned. Formal assessments determine the students’ proficiency or mastery of the content, and
can be used for comparisons against certain standards.
Examples:
standardized tests
criterion referenced tests
norm referenced test
achievement tests
aptitude tests
Informal assessments are those spontaneous forms of assessment that can easily be incorporated in the
day-to-day classroom activities and that measure the students’ performance and progress. Informal
assessments are content and performance driven.
Examples:
checklist
observation
portfolio
rating scale
time sampling
event sampling
anecdotal record
Test taker's temporary psychological or physical state. Test performance can be influenced by a
person's psychological or physical state at the time of testing. For example, differing levels of
anxiety, fatigue, or motivation may affect the applicant's test results.
Environmental factors. Differences in the testing environment, such as room temperature,
lighting, noise, or even the test administrator, can influence an individual's test performance.
Test form. Many tests have more than one version or form. Items differ on each form, but each
form is supposed to measure the same thing. Different forms of a test are known as parallel
forms or alternate forms. These forms are designed to have similar measurement
characteristics, but they contain different items. Because the forms are not exactly the same, a
test taker might do better on one form than on another.
Multiple raters. In certain tests, scoring is determined by a rater's judgments of the test taker's
performance or responses. Differences in training, experience, and frame of reference among
raters can produce different test scores for the test taker.
Validity is the most important issue in selecting a test. Validity refers to what characteristic the test
measures and how well the test measures that characteristic.
Validity tells you if the characteristic being measured by a test is related to job qualifications and
requirements.
Validity gives meaning to the test scores. Validity evidence indicates that there is linkage
between test performance and job performance. It can tell you what you may conclude or
predict about someone from his or her score on the test. If a test has been demonstrated to be
a valid predictor of performance on a specific job, you can conclude that persons scoring high on
the test are more likely to perform well on the job than persons who score low on the test, all
else being equal.
Validity also describes the degree to which you can make specific conclusions or predictions
about people based on their test scores. In other words, it indicates the usefulness of the test.
An achievement test measures one's performance on specific knowledge or skills learnt in a specific
course. In other words, an achievement test measures how much the person has learnt in a specific time
which can be, e.g., at the end of a chapter, or at the end of a course, etc to test the student's weak
points or strengths. Achievement tests have educational and diagnostic purposes that can help the
teacher and the learner to investigate which area(s) the learner needs to improve. However,
proficiency tests measure how much a person is able to use the knowledge or skills s/he has learnt in
real time situation(s) and they evaluate the learners’ level of knowledge or skills on a specific area, e.g.
IELTS and TOEFL are the examples of language proficiency tests.
A diagnostic test is a test that helps the teacher and learners identify problems that they have with the
language. At the start of the course, the teacher gives the learners a diagnostic test to see what areas of
language need to be in the syllabus.
Test specifications are a detailed summary of the test components and their purpose. They can serve as
an outline for the structure of the test and its general objectives, and they can also be made more
detailed depending on how much info is added.
Assessing skills:
Speaking: word repetition, sentence/dialogue completion, oral questionnaires, picture-cued tasks,
translations, question and answer, paraphrasing.
Reading: multiple choice, picture-cued tasks, reading aloud, matching, editing, gap-filling, cloze tasks,
editing, scanning, ordering.
Writing: copying, listening cloze, picture-cued tasks, multiple choice, matching, dictation, grammatical
transformation, ordering, paraphrasing.
Listening: question and answer, listening cloze, information transfer, sentence repetition, dictations,
note-taking, editing, retelling.
Extensive reading, free reading, book flood, or reading for pleasure is a way of language learning,
including foreign language learning, through large amounts of reading. As well as facilitating acquisition
of vocabulary, it is believed to increase motivation through positive affective benefits.
Holistic scoring provides an examinee with a single score regarding the quality of examinee work (i.e.,
performance) as a whole. Most commonly, holistic scoring is used to assess writing samples, though it
may be employed to assess any performance task, for example, acting, debate, dance, or athletics.
When scoring an essay holistically, the rater neither marks errors on the paper nor does the individual
write constructive comments in the margins. Instead, the rater considers the quality of the entire paper
and then assigns one holistic score. The SAT, ACT, and Advanced Placement tests all utilize a 6-point
holistic scoring rubric to assess their respective writing sections.
Analytic scoring is a method of evaluating student work that requires assigning a separate score for
each dimension of a task. Often used with performance assessment tasks, analytic scoring rubrics
specify the key dimensions of a task and define student performance relative to a set of criteria across
performance levels for each dimension. For example, analytic rubrics used to evaluate student essay
writing often include the following dimensions: development of ideas, organization, language use,
vocabulary, grammar, spelling, and mechanics.
What makes speaking difficult?
- Clustering, redundancy, reduced forms, performance variables, colloquial language, rate of delivery,
stress, rhythm and intonation and interaction.
A cloze test (also cloze deletion test or occlusion test) is an exercise, test, or assessment consisting of a
portion of language with certain items, words, or signs removed (cloze text), where the participant is
asked to replace the missing language item. Cloze tests require the ability to understand context
and vocabulary in order to identify the correct language or part of speech that belongs in the deleted
passages. This exercise is commonly administered for the assessment of native and second language
learning and instruction.
C-test is a task where the second half of every other word is obliterated and the test taker must restore
each word.
What skills do learners have to master in order to become efficient readers?
Top-down, bottom-up and schemata. Learn the basics and then connect whole contexts.
‘ Alternative assessment’ is usually taken to mean assessment procedures which are less formal than
traditional testing, which are gathered over a period of time rather than being taken at one point in
time, which are usually formative rather than summative in function, are often low-stakes in terms of
consequences, and are claimed to have beneficial washback effects.