Professional Documents
Culture Documents
314 - Introduction to Psychological Test Development (1)
314 - Introduction to Psychological Test Development (1)
PSYCHOLOGICAL TEST
DEVELOPMENT - Jeyi’s Notes
FOR OFFICIAL COMMUNICATION: 08069690241 | kosenjoseph67@gmail.corn
COURSE DESCRIPTION:
This course introduces the student to the principles of test development and
evaluation: conceptualization, test construction strategies, test tryout and analysis,
use and scope. Students are expected to carry out a test development project.
COURSE REQUIREMENTS:
Each student is required to:
2. Partake in continuous assessment, made up of two (2) tests, one (1) test
development project and several home tasks after each lecture (INDIVIDUAL AND
GROUP)
COURSE OBJECTIVES:
COURSE OUTLINE
MODULE ONE: INTRODUCTION TO PSYCHOLOGICAL
TESTS
A. Item-Difficulty index
B. Item-Validity index
C. Item-Discrimination index
D. Item-Characteristic curves: graphic representation of item difficulty and
discrimination
E. Other considerations in item analysis
MODULE ONE:
Meaning of Test
A test is a measurement device or procedure.
1. CONTENT:
2. FORMAT:
The form, plan, structure, arrangement/layout of the test items, time limits of
the test, and manner of administration (paper and pen, computerized, inkblot
images) are important considerations. When the test is computerized, the
format refers to whether it is IBM or Apple compatible. Form and structure
also apply to other evaluative tools and processes, such as interviews or task
performance assessments.
4. ADMINISTRATION PROCEDURE:
Some tests require special conditions for administration, such as tests for
clinical or field settings.
5. TECHNICAL & PSYCHOMETRIC PROPERTIES
1. LOGICAL STANDPOINT
A good test…
2. TECHNICAL STANDPOINT
READING ASSIGNMENT
Discuss the psychological test (15 marks).
ANSWER
There are several types of psychological tests, each serving different purposes and
employing different methodologies:
1. Personality Tests: These tests aim to measure enduring traits,
tendencies, and behaviors that characterize an individual's personality.
Examples include the Myers-Briggs Type Indicator (MBTI), the Big Five
Personality Inventory, and the Minnesota Multiphasic Personality
Inventory (MMPI).
2. Intelligence Tests: Also known as IQ tests, these assessments measure
cognitive abilities such as reasoning, problem-solving, and verbal and
nonverbal skills. The most well-known intelligence test is the Wechsler
Adult Intelligence Scale (WAIS) for adults and the Wechsler Intelligence
Scale for Children (WISC) for children.
3. Neuropsychological Tests: These assessments evaluate cognitive
functions such as memory, attention, language, and executive
functioning. They are often used in clinical settings to diagnose
neurological conditions or brain injuries and to monitor cognitive
changes over time.
4. Behavioral Assessments: These tests focus on observing and
measuring specific behaviors or behavioral patterns. They may be used
in clinical psychology to diagnose disorders such as autism spectrum
disorder or attention-deficit/hyperactivity disorder (ADHD).
5. Projective Tests: These tests involve presenting ambiguous stimuli,
such as inkblots or incomplete sentences, and asking individuals to
interpret them. The responses are believed to reveal unconscious
thoughts, feelings, and conflicts. The Rorschach Inkblot Test and the
Thematic Apperception Test (TAT) are examples of projective tests.
Reliability
(Freeman 1962)
Reliability has two closely related but different meanings in psychological testing:
a. Internal consistency
b. Consistent results upon testing and retesting
Oxford
Reliability is the consistency of scores obtained by the same persons when they are
reexamined by the same test on different occasions or different sets of equivalent
items or under varying conditions. In psychometry the concept of reliability has
many rules to cover several aspects of scored consistency.
X - observed/total score | T - true difference | E - chance error {X = (T+E)}
Error difference in its broad sense indicates the extent to which individual differences
in test scores are attributable to true difference in the characteristics being
measured and the extent to which they are attributable to chance errors. Error is part
of the measurement in psychological testing
(Variance = Standard Deviation 2)
Variance is a useful statistic for describing test scores reliability. it can be
broken into components:
a. True variance
b. Error variance (irrelevant random scores)
Sources of Errors
1. Test Construction:
You are likely to have some errors during this, which create errors during
content and item sampling. Content sampling is the variation among items
within a test as well as variations among items between tests.
2. Test Administration:
There are several factors that are sources of variance error when administering
a test.
a. Test Environment:
Temperature, ventilation, noise, lighting, seating arrangement.
b. Test Taker:
Anxiety, fatigue, illness, drugs/medication, presence of emotional
problems.
c. Test Examiner:
Presence/absence of examiner,
1. Test-Retest;
Weaknesses
2. Split-Half.
Only one session is used here. Two scores are obtained for each person by
dividing the test into equivalent halves. The most common procedure is
finding the scores of the odd and even items of the test, the longer the test,
the more reliable it is.
If you have ever taken a makeup test or exams, you will notice that the
questions were not all the same as those on the first. Thus it was an alternative
form of the test. The terms alternate and parallel are used interchangeably, but
are technically not the same. In the alternative form test, the items in the tests
are not all the same. It examines the degree of the relationship between
various forms of the test and is often called the Coefficient of Equivalence.
Parallel form is a useful estimate of reliability and like in test-retest, the interval
period between the parallel forms have to be stated in the manual.
Advantages
a. Parallel forms estimate is much more widely applicable for use than
test-retest
Limitations
a. Alternate forms estimates reduce practice effect but do not eliminate it.
b. Reduce but not eliminate issue of nature of the test changing with
repetition
c. Alternate forms are unavailable for many tests because of the practical
difficulties of constructing a truly equivalent test.
4. Kuder-richalison
The Kuder-Richardson Formula 20 (KR-20) is a statistical measure used to
assess the internal consistency or reliability of a test that consists of items with
dichotomous (yes/no or true/false) responses. It's named after its developers,
Dorothy Kuder and Edward Richardson.
FURTHER EXPLANATION
For each item, calculate the proportion of agreement among all pairs of
responses. This is done by dividing the number of agreements (both
responses are correct or both responses are incorrect) by the total
number of pairs.
Where:
The resulting KR-20 value ranges from 0 to 1, with higher values indicating
greater internal consistency or reliability of the test.
5. Conbrach's alpha
Advantages
a. Assessment of Reliability:
b. Ease of Interpretation:
Disadvantages
e. Assumes Unidimensionality:
g. Sample Dependence:
PSYCHOMETRY - VALIDITY
Psychologists, clinician (1-0), educators, counselors, forensic, military etc use test
results for various purposes.
The clinician may use a test for the purpose of arriving at a diagnosis or evaluating
the outcome of an intervention. Similarly the 1-0 may use test or test results for the
purpose of personal selection, specimen, promotion e.t.c.
Also the forensic psychologist may use the test results for the purpose of trials
competence or competency to stand trial. None of these purposes can be achieved,
even partially, if the psychological instrument does not have a high degree of validity.
WHAT IS VALIDITY ?
Freeman (1965) says that an index of validity shows the degree to which a test
measures what it proposes to measure when compared to its accepted criterion.
Who decides the test? It is the expert. The construction and use of the test means
the test has been evaluated against criteria regarded by experts as the best evidence
of the trial measured by the test.
Some circumstances that may sometimes warrant the test users to provide evidence
of validity include:;
1. When the rest of the users plan to alter in some way the format, instructions,
language or content of the test for instance changing a test “Braille” (for blind
people).
2. When the test will be used in the population of the test takers that differs in
some significant ways from the population from which the test was
standardized.
1. Content Validity
2. Criterion-related validity
3. Construct validity
There are other forms of validity. Cohen and Swerdlik (2002) note that “Predictive
Validity” and “Concurrent Validity” are forms of validity collapsed under the general
category of Criterion- related validity.
Anastasi and Urbina (2007) described another form of validity called Face Validity.
The three categories taxonomy of validity is the idea that validity of a test may be
evaluated by;
According to Cohen and Swardlik (2002). These three approaches of validity are not
mutually exclusive ( should not be taken as separate entities). Each should be
thought of as one type of evidence in other words all should be thought of as a
unified picture of validity.
1. FACE VALIDITY:
According to Anastasi and Urbina, face validity is not the same as content
validity. Technically speaking, face validity is not validity as it does not refer to
what the test actually measures, but to what the test superficially measures.
Face validity is concerned with things that make the test “look valid” to both
the test takers and test users.
2. CONTENT VALIDITY:
Lawshe 1975 developed one of the most widely used methods of measuring content
validity: it measures arguments among raters and judges on how essential a
particular item is. Each judge/rater responds to:
- Essential
- Useful but not essential (Important)
- Not necessary (not essential)
Where:
- Ne= number of Panelist indicating “essential”
- N = Total number of panelist or judges or raters
- CVR = 6.2 and above, retain the item.
● I feel lonely
● I feel guilty
● I feel edgy
If more than half of the judges indicate that an item is essential, that item at least
has some content validity.
3. CRITERION-RELATED VALIDITY
Cohen, Swardlik 2004 described two types of validity evidence which are subsumed
under criterion-based concurrent validity.
a. Concurrent Validity
If test scores are obtained about the same time that a criterion
measures are obtained, measures of the relationship between the test
scores and the criterion provide evidence for concurrent validity.
Sometimes we need to know how well Test A compares to Test B. In this
case, Test B is used as Validatory Criteria.
i. Concurrent validity of Test A.
b. Predictive Validity
4. CONSTRUCT VALIDITY
Construct validity has been viewed as the unifying concept for all validity
evidence. All types of validity evidence (content, criterion) are types of
construct validity. Content and criterion related coefficients have a bearing on
construct validity of a test
There are several procedures used to provide different kinds of evidence for construct
validity. These procedures include:
DISCRIMINANT CONVERGENT
DISCRIMINANT EVIDENCE
Discriminant evidence is a form of construct validity (discriminant validity). It is
simply a validity coefficient showing little (statistically insignificant) relationship
between test scores and other variables with which scores on the test being
constructed (i.e construct validated) should not be theoretically correlated.
CONVERGENT EVIDENCE
The evidence of construct validity of a particular test may converge from a number of
sources particularly such other tests or measures designed to access the same or
similar construct. Here, if scores on the test being construct-validated tend to
correlate highly in the predicted direction with scores on other more established and
already validated tests designed to measure the same or similar construct, it could
be an example of convergent evidence.
FACTOR ANALYSIS
E.g; personality is a construct and there are various attributes, characteristics, or dimensions.
Depending on which characteristics we’re looking at, there are different dimensions.
STUDY QUESTIONS:
TEST DEVELOPMENT
This is the process by which tests are constructed.
○ Conceptualisation
○ Construction
○ Test Tryouts
○ Item Analysis
○ Test revision
Test Conceptualisation
The first stage in the process, starts with our thoughts due to, because of our
experiences cognitive psychologists, thoughts are "self talk"in behavioral terms
thoughts are the probable source of all tests ever published the test developer
thinks, "there ought to be a test that measures something" often, there is a stimulus
for such a thought. That stimulus could be:
9. What special training will be required of test users for the administration and
interpretation of the test?
- What background and qualification will a prospective user of data
derived from an administration of the test need to have?
- What restriction - if any - should be placed on the distribution of the
test and on the test user?
12. Is there any potential for harm as a result of the administration of the test?
- What safeguard are built into the recommended testing procedure to
prevent any sort of harm in any of the parties involved in the use of this
test?
- Who are the parties in the test? Test developer, publisher, taker, and the
society. Sometimes, the test developer may also be the test publisher.
NORM-REFERENCED
Anastasi and Urbina (2007) note that a norm referenced test is that test in which an
individual score is interpreted by comparing it with the scores obtained by others on
the same test. An example of a norm referenced test is the Eysenck personality
questionnaire (EPQ). The EPQ has two versions:
a. EPQ Adult
b. EPQ Children
This means that there are norms for the children and different norms for the adults.
CRITERION-REFERENCED
The criterion referenced test as first proposed by Glaser (1963), the term defined
loosely and apparently by different writers and the terms has different names such
as domain reference tests and content reference tests.
44 F 0 Fail
59 C 3 Pass
69 B 4 Pass
79 A 5 Pass
PILOT WORK
In pilot work, test items are pilot studies to evaluate whether they should be in the
final form of the instrument. Pilot work requires creation, revision, deletion of many
items. It also involves literature reviews and experimentation. Once the pilot study
has been completed, the process of test construction begins.
Test Construction
Test construction is the second phase or stage in the test development process.
Constructing a test involves three major things;
EXPLANATION
1. Scaling:
A fourth way in which scale may be classified is when raw scores are
transformed into scores that range from 1-9 (Stannie).
SCALING METHODS
There are many scaling methods , the choice of deciding on which scaling
method you should use as a test developer is dependent on many factors.
RATING SCALES
Rating scales are the most commonly used methods of scaling. A rating scale
may be defined as the grouping of words, statements, symbols, on which
judgment concerning the strength of a particular traits, attitude, or emotions,
are indicated by the test taker or examiner. Rating scales can be used to report
judgment of one’s self, others, experiences, or updates.
c. A rating scale may take a pictorial form of sad, happy, confused faces.
In rating scales, the test takers rate every item E.g DAS 21 has 21 items. The
tests rates each item (of the 21) and scores are added together to obtain
subscale and final scores:
Because the final test results occur by summing the rating across all the items
is called a summative scale. One example of a summative scale is the LIKERT
EXAMPLE. LIKERT, 1932 scale is used extensively in psychology. They are
relatively easy to construct. In a Likert scale, the test taker is faced with 5
alternative responses (AGREE, DISAGREE, APPROVE, DISAPPROVE,
CONTINUUM).
Likert scales are used because they are usually reliable. It has five ranges of
response options. At times you might encounter “Likert scale or Likert type of
scale” which is always five or more than five.
PAIR COMPARISON SCALING METHODS
In this scaling method , the test taker is presented with pairs of stimuli. The
pair of stimuli may be two pictures , two options, two objects, two statements,
which are asked to compare. They must select one of the stimuli based on
some rules, the rules are ;
3. Which is heavier?
a. A kg of cotton
b. A kg of tin
For each pair of options , test takers receive high scores, if the selected option
was deemed more justifiable( more appealing) by the majority of test takers.
The majority would must be asked to rate the pair of options before
administration of the test and the least of options selected will be provided
with the scoring instructions as the answer key
SORTING TASK
Comparative Scaling
Categorical Scaling
a. All people should have the right whether they wish to end their
lives.
b. People who are mostly ill and in pain should have the option of
having a doctor who will assist them to end their lives.
c. People should have the option of signing away the use of
artificial life support system when they become seriously ill
d. People have a right to a comfortable live
Look at the statements above carefully, they are arranged in sequence of most
extreme positions to list extreme. People who agreed with the most extreme
positions (a) should also agree with (b), (c), and (d)
This scale was first described in 1929. It is in the internal scale and it is an equal
appearing interval scale.
1. What range of items should the item be called? which of the many
different types of the items format should be implored
2. Should I use selected response format or constructed response format
even if I am using matching format
3. How many items should be written
Under matching
Completed Item
Short Answer
The items have to be good and clearly written, there are no long or short
routes.
Essay
An essay may have many paragraphs e.g. I might ask you to compare and
contrast the techniques and definition of classical and operant conditioning or
compare selected response format of item writing or norms referenced and
domain response test differ
SCORING ITEM
2. Class/ 3. Ipsative
1. cumulativ
category scoring
e
scoring
Cumulative
This is the commonest method of scoring. It is simple and its logic is simple as
well. The rule in cumulative score test is that the higher the score on the test,
the higher the test taker is on the ability or the trait to other characteristics or
the test purports to measure e.g. of cumulative scoring test is DAS21, PTSD
CHECKLIST, FOR DSM -5
3. Scoring
a. CLASS SCORING/METHOD:
Once we have generated a pool of items that will probably be included, the next
thing is try out the items
In order to find out which of the items are good. The test should be tried out on
people similar in physical aspect, with people who the test was intended on. For
example, if the test was designed to test the performance of an employee, the test
performance of an employee, the test should be carried out on an employee of an
organization. Similarly, if the test is carried out for children it should be tried out on
children, furthermore, if it is for adults it should be tried out on adults. Another thing
is the issue of culture, if the test is culture specific, it should be tried out on people
who have that culture.
The main objective of trying out a test is for the test developer to identify the items
that are reliable and valid. A good test should be able to test takers. Therefore, a good
item is one that scores high on the test as a whole. It follows that item in which high
scores on the test
In a nutshell, if everyone passes or everybody fails then it's not a good test.
Test Analysis
For a test user to be able to evaluate the quality of a published test, they must have
reasonable knowledge of the basic concepts or techniques of item analysis. This is
because item analysis itself is relevant and an integral part of test development.
It enables test developers to shorten the test and at the same time increases its
reliability and validity. Generally, a longer test is more reliable and valid than a short
test. However, when a test is shortened by eliminating the least satisfactory items,
the short test is more valid and reliable than the longer one
Test Revision
1. We carry out item analysis as a integral part of the test development as which
item is good for bad discuss
2. Having successfully taken lectures in Psy 314, 300-level students in Psy
decided to develop a test to measure exam anxiety. They decided to name the
scale “student examination anxiety scale (SEAS) ; they have generated an item
post of 400. What remains is a test try out.
a. What does taking this test try out mean?
b. What are the objectives of test tryout
c. What questions will you ask and answer, while carrying out try out test
d. What are the likely group items are you going to select from the item
pool and what are the likely ones you are going to eliminate
e. SEAS is a multi-choice item format, how many items does it have?
3. Test construction is a crucial process in test development. With firm grasp of
your course lecturer’s predisposition to this issue, discuss certainly and
generously with examples that test is born in the mind cannot be contested,
discussed.
ASSIGNMENT
Think carefully and develop a test to measure any construct of your choice. You must
follow the test development process, that is the test conceptualization, test
construction, test try out, test analysis, and test revision. highlight, literature review
and theories that explain the construct and which theory developed the construct.