Professional Documents
Culture Documents
Development of Large-Scale Student Assessment Test
Development of Large-Scale Student Assessment Test
DEVELOPMENT OF
LARGE-SCALE STUDENT
ASSESSMENT TEST
Review of Classroom Test
Development Process
Within the context of summative assessment in the classroom, the
suggested phases of work in developing the test generally consist
of the following procedural steps to ensure to ensure basic validity
requirements:
o Planning the test which specifies
• Purpose of test
• Learning outcomes
• Test blueprint – test format, number of items
o Item-Construction which is performed by the classroom teacher
following a table of specifications
o Review and Revision for item improvement
• Judgemental approach – before and after administration of
test
- by the teacher/peers to ensure the accuracy and alignment
of test content to learning outcomes
- by the students to ensure comprehensibility of items and
test instructions
o Empirical approaches – after administration of test
• Obtain item statistics in the form of quality indices
Development Process for
Large-Scale Test
Changing the context from classroom to systems-wide testing,
there are other significant considerations that must be in place in
addition to what are required by a teacher-made test. With an
understanding of the nature of large-scale student assessment,
more questions must be addressed in the development process
concerning purpose of test, coverage, length of test, review of
items for quality and fairness and such technical merits as validity
and reliability among others.
Make
Ready for use
blueprint
Statistical Write
review questions
Content
Pilot testing
review
Stakeholders’ Fairness
review review
Editorial
review
The LSA process however spends much more time and effort in
carrying out multiple checks and balance. The various types of
review to be undertaken , i.e. content, fairness, editorial,
stakeholders, curriculum experts, teachers, item developers, testing
experts, teachers, item developers, testing experts, language
specialists, sociologists, psychometricians and statisticians and large
data base specialists. These are reflected in two steps which
apparently are not done with classroom tests, pilot testing of tests to
sample groups whose characteristics are similar to the target
population and the statistical review that establishes the
psychometric integrity of the items and the test as a whole in terms
of gathering empirical evidences for the validity of its score
interpretation and reliability in terms of consistency of scores
obtained across versions of the test.
Key Steps in Large-Scale Test
Development
Step 1: Defining Objectives Who will take the test and for what purpose?
What skills and/or area to use their knowledge?
How should test takers be able to use their knowledge?
What kinds of questions should be included? How many of each
kind?
How long should the test be?
How difficult should the test be?
Items are pretested to a sample group similar to the population to be tested. Results should
Step 4: The Pretest determine:
The difficulty of each question
If questions are ambiguous or misleading
If questions should be revised or eliminated
If incorrect alternative answers should be revised or replaced
Step 5: Detecting and Removing Unfair Questions After pretesting, test reviewers re-examined the items.
Are there any test question which language, symbols or words and
phrases inappropriate or offensive to any subgroup of the
population?
Are there questions consistently performed better by a group than
other groups?
What items further need revision or removal before final version is
made?
Step 6: Assembling the Test After the test is assembled, item reviewers prepare a
list of correct answers and are compared with
existing answer keys.
• Are the intended answers indeed the correct
answer?
Step 7: Making Sure that the Test After test analysis of results to find out if test is
working as intended.
Questions are Functioning Properly • Is the test valid? Are the score interpretations
supported by empirical evidence?
• Is the test reliable? Can the performance on one
version of the test predict performance on any
other version of the test?
• What corrective actions need to be done when
there are problems detected before final scoring is
done?
Establishing Validity of Tests
Content validity refers to how the test covers a representative sample of the
behaviors serves as evidence of content validity.
The type of validity that looks into the social impact of a test result on an
individual, a group or a school is referred to consequential validity (Crocker
and Algina 1986).
Estimating Reliability of
Large-Scale Tests
Reliability is related to the concept of error of measurement which indicates
the degree of fluctuation likely to occur in an individual score as a result of
irrelivant, chance factors which Anastasi and Urbina (1997) call error
variance.
Ther are several ways of estimating the reality of a test and they are grouped
according to the number of times the test is administered to the same group
of students. With two test sessions, test-retest reliability where the same test is
given twice with a time interval not exceeding six months and alternate-form
reliability where two comparable versions of the test are administered to the
same individuals. Administration of the two forms can be immediately done,
one after the other or delayed with an interval not exceeding six months. This
is also widely known as parallel-form reliability since they emerge from the
same table of specifications. The nature and strength of relationship or
correspondence between the two sets of scores is then established using the
coefficient of correlation.
This value ranges from -1.0 to +1.0. The closer it gets to 1.0 the more consistent
are the scores obtained from the two test trials. To obtain the reliability coefficient in
these two types, the Pearson Product Moment Correlation is used to get the
coefficient in these two types , the Pearson Product Moment Correlation is used to
get the coefficient of correlation (r) with this well-known formula:
With only a single administration, split-half reliability is workable. This divides the
test into two halves using the odd-even split. All the odd-numbered items make up
Form A while the even numbered items compose Form B. The coefficient of
correlation between the two half tests is obtained using the Pearson Product
Moment Correlation with Spearman-Brown Formula applied to estimate the
correlation of the full tests (r).
Inter-rater reliability assesses the degree to which different judges or raters agree
in their assessment decisions. This is quite useful to avoid doubts on the scoring
procedure of tests with non-objective items. The set of scores obtained in the tests
with non-objective items. The sets of scores obtained in the test from two raters can
also be subjected to Pearson r to get the reliability coefficient.