Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

CHAPTER 15

DEVELOPMENT OF
LARGE-SCALE STUDENT
ASSESSMENT TEST
Review of Classroom Test
Development Process
Within the context of summative assessment in the classroom, the
suggested phases of work in developing the test generally consist
of the following procedural steps to ensure to ensure basic validity
requirements:
o Planning the test which specifies
• Purpose of test
• Learning outcomes
• Test blueprint – test format, number of items
o Item-Construction which is performed by the classroom teacher
following a table of specifications
o Review and Revision for item improvement
• Judgemental approach – before and after administration of
test
- by the teacher/peers to ensure the accuracy and alignment
of test content to learning outcomes
- by the students to ensure comprehensibility of items and
test instructions
o Empirical approaches – after administration of test
• Obtain item statistics in the form of quality indices
Development Process for
Large-Scale Test
Changing the context from classroom to systems-wide testing,
there are other significant considerations that must be in place in
addition to what are required by a teacher-made test. With an
understanding of the nature of large-scale student assessment,
more questions must be addressed in the development process
concerning purpose of test, coverage, length of test, review of
items for quality and fairness and such technical merits as validity
and reliability among others.

o Divide the class into four conversation groups. Discuss the


concerns that must be raised if a large-scale assessment test will
be developed?
o As a class, discuss these questions and classify them into phases
of work they will fall under.
o Watch the video presentation showing how “ETS creates fair
meaningful tests and test questions” to guide your discussion
later.
Development Process for
Large-Scale Test
What do you see as common steps developing classroom tests and
large scale tests?

o They both need a test framework for specifying purpose of test,


what are to be measured, to whom the test will be administered,
what test format to use , the length of test, etc.
o They both need to prepare a test blueprint or table of
specifications that specify the content and knowledge and skills to
be covered and the number of items to be prepared for each
learning outcome.
o There is a need to review the items to ensure that the items
measure intended outcomes, non-ambiguity of the problem, the
plausibility of the distracters, and the correctness of the keyed
option.
Create
Homework

Make
Ready for use
blueprint

Statistical Write
review questions

Figure 13.1. Standard Test Development Process

Content
Pilot testing
review

Stakeholders’ Fairness
review review

Editorial
review
The LSA process however spends much more time and effort in
carrying out multiple checks and balance. The various types of
review to be undertaken , i.e. content, fairness, editorial,
stakeholders, curriculum experts, teachers, item developers, testing
experts, teachers, item developers, testing experts, language
specialists, sociologists, psychometricians and statisticians and large
data base specialists. These are reflected in two steps which
apparently are not done with classroom tests, pilot testing of tests to
sample groups whose characteristics are similar to the target
population and the statistical review that establishes the
psychometric integrity of the items and the test as a whole in terms
of gathering empirical evidences for the validity of its score
interpretation and reliability in terms of consistency of scores
obtained across versions of the test.
Key Steps in Large-Scale Test
Development

The test development process is basically influenced by the


Standards for Educational and Psychological Testing developed by
American Educational Research Association, American
Psychological Association & National Council on Measurement in
Education (1985). While they are regarded as criteria for evaluating
tests (cf Section 2), they serve as the foundation for the process.

Given these standards, ETS has developed its stringent


guidelines contained in 2014 ETS Standards for Quality and Fairness
for its specific standards on “Validity”, “Fairness”, “Scoring” and
“Reporting Test Results” in addition to “Test Design and
Development”. These have defined the key steps in the development
of large-scale tests.
Table 13.1: Steps in Developing
Tests by ETS
Key Steps Fundamental Questions to be Addressed

Step 1: Defining Objectives  Who will take the test and for what purpose?
 What skills and/or area to use their knowledge?
 How should test takers be able to use their knowledge?
 What kinds of questions should be included? How many of each
kind?
 How long should the test be?
 How difficult should the test be?

Step 2: Item Development Committees Who will be


 Defining test objectives and specifications
 Helping ensure test questions are unbiased
 Determining test format (e.g., multiple-choice, essay, constructed-response,
etc.)
 Considering supplemental test materials
 Reviewing test supplemental test materials
 Writing tests questions
Step 3: Writing and Reviewing Questions Item developers and reviewers must see to it that each item:
 Has only one correct answer among the options provided in the test
 Conforms to the style rules used throughout the test
 
There are scoring guides for open-ended resposes (e.g. short-written answers, essays and
responses.)

Items are pretested to a sample group similar to the population to be tested. Results should
Step 4: The Pretest determine:
 The difficulty of each question
 If questions are ambiguous or misleading
 If questions should be revised or eliminated
 If incorrect alternative answers should be revised or replaced

Step 5: Detecting and Removing Unfair Questions After pretesting, test reviewers re-examined the items.
 Are there any test question which language, symbols or words and
phrases inappropriate or offensive to any subgroup of the
population?
 Are there questions consistently performed better by a group than
other groups?
 What items further need revision or removal before final version is
made?
Step 6: Assembling the Test After the test is assembled, item reviewers prepare a
list of correct answers and are compared with
existing answer keys.
• Are the intended answers indeed the correct
answer?
Step 7: Making Sure that the Test After test analysis of results to find out if test is
working as intended.
Questions are Functioning Properly • Is the test valid? Are the score interpretations
supported by empirical evidence?
• Is the test reliable? Can the performance on one
version of the test predict performance on any
other version of the test?
• What corrective actions need to be done when
there are problems detected before final scoring is
done?
Establishing Validity of Tests

Validity is regarded as the basic requirement of every test. It refers to the


degree to which a test measures what is intended to be measured. Can be the
test perform its intended function? This is the business of validity and the one
adapted by the classical model for regarding validity. There are three
conventional types of validity according to this model: content validity,
criterion-related validity and construct validity.

Content validity refers to how the test covers a representative sample of the
behaviors serves as evidence of content validity.

Construct validity involves empirical examination of the psychological


construct hypothetically assumed to be measured by the test. It is established
by doing a factor analysis of the test items to bring about what defines the
overall construct. It determines if the test items to bring about what defines
the overall construct. It determines if the test measures a unitary construct or
if it is a multi-dimensional construct as shown by the resultant factors. These
“validities” have for while been what are required to be established by
educational and psychological tests.
While validity is spoken with reference to what the test purpose to measure,
the concept as applied for large-scale testing has shifted to “the degree to
which evidence and theory support the interpretations of test scores entailed
by proposed uses of tests.” (Messick, 1995), there are five categories of
evidence supporting score interpretation and which have brought about
other forms of validity:

1. Evidence based on test content


2. Evidence based on response processes
3. Evidence based on internal structure
4. Evidence based on relations to other variables
5. Evidence based on consequences of testing

The type of validity that looks into the social impact of a test result on an
individual, a group or a school is referred to consequential validity (Crocker
and Algina 1986).
Estimating Reliability of
Large-Scale Tests
Reliability is related to the concept of error of measurement which indicates
the degree of fluctuation likely to occur in an individual score as a result of
irrelivant, chance factors which Anastasi and Urbina (1997) call error
variance.

Ther are several ways of estimating the reality of a test and they are grouped
according to the number of times the test is administered to the same group
of students. With two test sessions, test-retest reliability where the same test is
given twice with a time interval not exceeding six months and alternate-form
reliability where two comparable versions of the test are administered to the
same individuals. Administration of the two forms can be immediately done,
one after the other or delayed with an interval not exceeding six months. This
is also widely known as parallel-form reliability since they emerge from the
same table of specifications. The nature and strength of relationship or
correspondence between the two sets of scores is then established using the
coefficient of correlation.
This value ranges from -1.0 to +1.0. The closer it gets to 1.0 the more consistent
are the scores obtained from the two test trials. To obtain the reliability coefficient in
these two types, the Pearson Product Moment Correlation is used to get the
coefficient in these two types , the Pearson Product Moment Correlation is used to
get the coefficient of correlation (r) with this well-known formula:

With only a single administration, split-half reliability is workable. This divides the
test into two halves using the odd-even split. All the odd-numbered items make up
Form A while the even numbered items compose Form B. The coefficient of
correlation between the two half tests is obtained using the Pearson Product
Moment Correlation with Spearman-Brown Formula applied to estimate the
correlation of the full tests (r).

Inter-rater reliability assesses the degree to which different judges or raters agree
in their assessment decisions. This is quite useful to avoid doubts on the scoring
procedure of tests with non-objective items. The set of scores obtained in the tests
with non-objective items. The sets of scores obtained in the test from two raters can
also be subjected to Pearson r to get the reliability coefficient.

You might also like