Test Conceptualization: Norm-Referenced Vs Criterion-Referenced

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

BES3149 PSYCHOMET 1

Developing Psychological Tests

stages in test development

Test Conceptualization
- The test developer answers this basic question: What am I going to measure?
SOME PRELIMINARY QUESTIONS RELEVANT DURING THIS STAGE
1. What is the test going to measure 8. What is the ideal test format
2. What is the objective of the test 9. Should there be more than one form
3. Is there a need for this test?
- Are there existing tests measuring the same attribute? 10. Does the test require special training for
- Will this newly developed test be better than the test administration, scoring, and
existing ones? interpretation?
- Will there be psychometric properties?
- Will it have greater practicality and utility?
11. What types of responses are required of
test takers? (Address issues of
4. Who are the intended test users
accomodation for differently-abled test
takers)
12. Who will benefit from administering the
5. Who are the intended test takers
test?
13. Is there any potential for harm as a result
6. What is the content of the test
of administering this test
14. How will meaning be attributed to scores
7. How will the test be administered
on this test

NORM-REFERENCED VS CRITERION-REFERENCED
a. Norm-referenced = we create a table of norms; the table serves as the basis of
interpreting the scores
b. Criterion-referenced = there is an established criteria para masabi na qualified
® On a criterion-oriented perspective, what matters is whether a test-taker has met a
particular set of criteria or standard = it does not matter if a test taker got the
highest score
® In developing items for criterion-referenced = test developer begins by
“conceptualizing the knowledge and skills to be mastered”, then focuses on
developing items that measure these
® It is recommended that in developing a criterion-referenced test, at least two
groups of test takers are utilized (one group that mastered the required skills and
knowledge, and another group without mastery)
BES3149 PSYCHOMET 1
Developing Psychological Tests
- For both, a good item is one that is answered correctly by high scores and incorrectly by
low scorers
Pilot work
- Different from test tryout
- Refers to the prelimary research surrounding the creation of a prototype of a test
- Test items may be pilot studied to evaluate whether they should be included in
the final form of the instrument
- Test developer typically attempts to determine how best to measure a targeted
construct
- May entail literature reviews and experimentation as well as the creation,
revision

Test construction
- Scaling Methods = process of setting rules for assigning numbers in
measurement
1. Rating scale
- Grouping of words, statements, or symbols on which judgements of the strength
of a particular trait, attitude, or meotion are indicated by the test taker
- Final score is obtained by summing the rating across all items = summative
scale
- Examples:

a. Likert Scale
® Likert (1932) experimented with different weightings of the five categories but
concluded that assigning weighting of 1 (for endorsement of items at one
extreme) through 5 (for endorsement of items at the other extreme) generally
worked best
® One that is used extensively in psychology, usually to measure attitudes
® Relatively easy to construct, each item presents the testtaker with 5 alternative
responses (sometimes 7)
® Usually on an agree-disagree or approve-disapprove continuum
® Use of rating scales results in ordinal-level data
b. Method of Paired Comparison
® Testtakers are presented with pairs of stimuli which they are asked to compare
® They must select one of the stimuli according to some rule; for example, the rule
that they agree more with one statement than the other, or the rule that they find
one stimulus more appealing than the other
® This method is similar to the forced choice technique
BES3149 PSYCHOMET 1
Developing Psychological Tests
c. Guttman Scale
® Items on it range sequentially from weaker to stronger expressions of the attitude,
belief, or feeling being measured
® All respondents who agree with the stronger statements of the attitude will also
agree with milder statements
d. Thurstone Scale
® Also known as Method of Equal Appearing Intervals
® Made up of statements about a particular issue and each statement has a numerical
value indicating the respondent’s attitutde about the issue, either favorable or
unfavorable
® People indicate which of the statements with which they agree, and the average
response is computed
® Problems with developing Thurstone scales:
a. Time consuming and expensive
b. Examples of the mid-points of the scale for which there is consensus among the
judges can be difficult to obtain
Writing Items
- The prospective test developer or item writer immediately faces three questions related
to the test blueprint:
a. What range of content should the items cover?
b. Which of the many different types of item formats should be employed?
c. How many items should be written in total and for each content area covered?
- In this stage, an item pool is developed
® Item pool is the reservoir or well from which items will or will not be drawn for
the final version of the test
® The item pool should contain at least twice the number of items expected to be
included in the final version of the test
® Content validity is an important consideration in developing the item pool

Selected response constructed response


- Format requires testtakers to select a response - Constructed response format requires testtakers
from a set of alternatives to supply or provide the correct answer
- Examples: multiple choice, matching type, true - Examples: fill-in-the-blanks, identification,
or false or binary-choice item enumeration, essay
- There are many considerations in assessing - There are many advantages and downsides for
which among the selected response formats each of these formats (i.e., the type of skill
could be better (i.e., controlling guesswork, measured by a fill-in-the-blanks item
providing uneven number of items per column compared with the essay, the subjectivity in
in the matching type, writing good distractors scoring essay tests, etc.)
for multiple choice)
BES3149 PSYCHOMET 1
Developing Psychological Tests

a. Cumulative scoring: the higher the score on the test, the higher the testtaker is on the
ability measured
b. Category/Class scoring: the testtaker earns points/credits toward placement in a
particular class or category together with other testtakers who share the same response
patterns.
c. Ipsative scoring: involves comparing a testtaker’s score on one scale within a test to
another scale within that same test (applicable to forced-choice techniques)

SOME INITIAL GUIDELINES FOR WRITING ITEMS FOR ATTITUDE SCALES


- Define clearly what you want to measure. To do this, use substantive theory as a guide
and try to make items as specific as possible.
- Generate an item pool. Avoid redundant items. Three or four items per area would be
advisable.
- Avoid exceptionally long items. They are rarely good.
- Keep the level of reading difficulty appropriate for those who would use the test.
- Avoid “double barreled” items, which convey two or more ideas at the same time.
® For example, “I like people who are open to change because I am flexible”
consists of two different statements for which a person may agree: “I like people
who are open to change” and “I am flexible.”
- Consider mixing positively and negatively worded items. Sometimes, respondents
develop the “acquiescence response set.”
® This means that they will tend to agree with most items
® To avoid this bias, you can include some items worded in the opposite direction
® For example, “I felt depressed” may be mixed with “I am hopeful about the
future.”

THREE SUBSCALES
Cognitive Affective Behavioral
Consists of items that address Consists of items that involve one’s
the cognitive domain of the feelings and emotions, with items Consists of items that are more
scale, with items related to related to feelings, values, action-oriented, with items
related to behaviors, practices,
opinions, beliefs, thoughts, appreciation, enthusiasm, actions, etc.
concept understanding, etc. motivation, etc.
Examples (for the Attitude
Examples (for the Attitude
Examples (for the Attitude towards towards Superstition Scale)
towards Superstition Scale) - “I avoid continuing my
Superstition Scale)
- “I believe that journey when a black cat
- “I feel weird when I am
superstitions strengthen crosses my path.”
with people who subscribe - “Superstitions lead me to
the Filipino cultural
to superstitions.” avoid doing certain
identity.”
things.”
BES3149 PSYCHOMET 1
Developing Psychological Tests
- “There is nothing wrong - “Believing in superstitions - “Going straight home
with believing in give me a sense of after visiting a wake is
superstitions.” security.” not a problem for me.”
- “Superstitions lead to a - “I am irritated when people
more backward way of defend their superstitious
thinking.” beliefs.”

TESt tryout
a. This stage involves administering the test to a sample that is similar to the intended
testtakers
b. No fewer than five, perhaps as many as 10 for each item on the test
c. Should be under standardized conditions
® Example: Similar to the setting and conditions when the actual test will be
administered, including instructions, time limits, etc.
Item analysis
1. Item Difficulty Index
- What proportion of testtakers answered each item correctly?
- Applies mainly to ability tests
- In personality/attitude scales, the proper term = “Item Endorsement Index”
® Proportion of people who indicated their agreement/endorsement for every item
- The Item Difficulty Index is computed per item, and an “average difficulty index” is
determined by averaging the item difficulty indices for all the items
- For maximal discrimination of the ability of testtakers, the optimal average item
difficulty should be approximately .5 (item difficulty indices ranging from .3 to .8)
® 0.8 yung mas madali
- The effect of guesswork must be factored in when computing optimal level of difficulty
- The suggested formula is: (probability of getting the answer correct by guesswork + 1)/2
® For example, in a five-option multiple choice test, the optimal level of difficulty
is: 0.20 + 1.00 = 1.20
1.20 / 2 = 0.60
- It measures how easy yung item
2. Item Reliability Index
- Provides an indication of the internal consistency of a test
- The higher this index, the greater the test’s internal consistency
3. Item Validity Index
- Provides an indication of the degree to which a test is measuring what it purports to
measure
- The higher the item-validity index, the greater the test’s criterion- related validity
BES3149 PSYCHOMET 1
Developing Psychological Tests
4. Item Discrimination Index
- Measure of the difference between the proportion of high scorers answering an item
correctly and the proportion of low scorers answering the item correctly
- The higher the value of d, the greater the number of high scorers answering the item
correctly
® Negative d-value on a particular item is a red flag because it indicates that low-
scoring examinees are more likely to answer the item correctly than high-scoring
examinees
® This situation calls for some action such as revising or eliminating the item
- Item Characteristic Curves = provide a graphic representation of item difficulty and
discrimination

Test revision
- This stage can either be (1) the last stage in developing a new test or (2) a part of the life
cycle of an existing test
- As the last stage in developing a new test, test revision entails finalizing the test after
considering the outcomes from the four earlier stages
- The end result is coming up with a better version of the test than when initially designed
- After being in use for several years, an existing test can be due for revision when any of
the following conditions exist:
a. The stimulus materials look dated and current testtakers cannot relate to them
b. The verbal content of the test, including the administration instructions and the test items,
contains dated vocabulary that is not readily understood by current testtakers
c. As popular culture changes and words take on new meanings, certain words or
expressions in the test items or directions may be perceived as inappropriate or even
offensive to a particular group and must therefore be changed
d. The test norms are no longer adequate as a result of group membership changes in the
population of potential testtakers
e. The test norms are no longer adequate as a result of age-related shifts in the abilities
measured over time, and so an age extension of the norms (upward, downward, or in both
directions) is necessary
f. The reliability or the validity of the test, as well as the effectiveness of individual test
items, can be significantly improved by a revision
g. The theory on which the test was originally based has been improved significantly, and
these changes should be reflected in the design and content of the test

- Normally, initial validity results are high because the sample chosen (on whom validation
was performed) is more identical to the intended test takers
BES3149 PSYCHOMET 1
Developing Psychological Tests
- The second sample used for the cross-validation procedure is not expected to be as
identical
® This leads to a phenomenon called validity shrinkage – a decrease in validity that
occurs as a result of cross-validation

CROSS validation
- The revalidation of a test on a sample of testtakers other than those on
whom test performance was originally found to be a valid predictor of
some criterion

Co-validation
- A test-validation process conducted on two or more tests using the same
sample of testtakers
- When used in conjunction with the creation of norms or the revision of
existing norms, this process is also called as co-norming

Take note

- There are two types of tests, according to the attribute measured: ability tests
(assessment of intelligence, aptitude, achievement) and personality tests (measures of
personality, interests, attitudes)
- There are item formats that are more appropriate for ability tests, as well as item formats
more suitable for personality and attitude measures
- Test blueprint = plan for the test

You might also like