314 - Introduction to Psychological Test Development (1)

PSY 314 - INTRODUCTION TO
PSYCHOLOGICAL TEST
DEVELOPMENT - Jeyi’s Notes
FOR OFFICIAL COMMUNICATION: 08069690241 | kosenjoseph67@gmail.corn
COURSE DESCRIPTION:
This course introduces the student to the principles of test development and
evaluation: conceptualization, test construction strategies, test tryout and analysis,
use and scope. Students are expected to carry out a test development project.
COURSE REQUIREMENTS:
Each student is required to:
1. Attend lectures at least 75%.
2. Partake in continuous assessment, made up of two (2) tests, one (1) test
development project and several home tasks after each lecture (INDIVIDUAL AND
GROUP)
3. Attempt four (4) exam questions at the end of the Semester
COURSE OBJECTIVES:
At the end of this course student will be able to:
1. Give a definition of a psychological test

2. Identify and briefly explain the assumptions underlying psychological testing.
3. Give a definition of reliability.
4. Identify at least three measures of reliability and explain them briefly.
5. Give a definition of validity.
6. Correctly identify the steps in psychological test development.
7. Identify the preliminary questions necessary for test conceptualization and
explain them
8. Explain the concept of test tryout/pilot study.
9. Explain the concept of scaling as a part of the test construction process.
10. Identify at least two (2) scaling methods and explain them.
11. Identify the range of content the test items cover in a particular test
12. Distinguish between norm-referenced and criterion-referenced tests related
to item
13. Identify and explain the different formats of writing a test
14. Determine how many items should be written.
15. Distinguish among the cumulative, class and ipsative models of test scoring.
16. Explain the basics of the test tryout.
17. Identify and explain the statistical procedures used for test analysis
18. Explain the concept of test revision and clearly show how to do it.
COURSE OUTLINE
MODULE ONE: INTRODUCTION TO PSYCHOLOGICAL
TESTS
A. Meaning of psychological test

B. The test
C. What is a good test?
D. The test development stages
MODULE TWO: TEST CONCEPTUALIZATION
A. What do mean by test conceptualization?

B. Key questions to answer while conceptualizing the test
C. Should the test be norm-referenced or criterion-referenced?
D. Pilot work/plot study/pilot research
MODULE THREE: TEST CONSTRUCTION AND TEST TRY
OUT
A. Scaling, types of scale and scaling methods

B. Writing items
C. Scoring items
D. Test tryout: finding the item that is reliable and valid
MODULE FOUR (4): ITEM ANALYSIS
A. Item-Difficulty index
B. Item-Validity index
C. Item-Discrimination index
D. Item-Characteristic curves: graphic representation of item difficulty and
discrimination
E. Other considerations in item analysis
MODULE FIVE (5): TEST REVISION

a) Test revision as a stage in the development of a new scale
b) Test revision in the life cycle of an existing scale
Tuesday 20th, February 2024
MODULE ONE:
Meaning of Test
A test is a measurement device or procedure.
- A psychological test is an objective and standardized measure of a

sample of behavior. Simply, a psychological test is a way to measure and
observe behavior in a consistent and uniform manner to ensure accuracy of
results. They measure psychological constructs like personality, attitude,
intelligence, locus of control, self esteem, interests, etc
- Objectivity is a critical feature of the scientific method. it means

that something can be verified publicly and not subject to
opinions or bias
- Standardization is the process of using statistical measures to

estimate the reliability and validity of a test
DIFFERENCE BETWEEN PSYCHOLOGICAL TESTS
Psychological tests may differ on variables like;
1. CONTENT:
The subject matter of the test-s measures personality, intelligence, and

attitude. Content may differ between tests measuring the same constructs
due to factors like theoretical orientation and definition of the measured
construct/concept.
SCALES FOR MEASURING PERSONALITY
- Eysenck Personality Questionnaire has 94 items and measures

personality in three dimensions; introversion-extraversion, neuroticism,
psychoticism.
- Big Five Personality Inventory has 45 items and measures openness to
experience, conscientiousness, extraversion, neuroticism
- Minnesota Multiphasic Personality Inventory has 565 items and
measures personality in 10 dimensions
2. FORMAT:
The form, plan, structure, arrangement/layout of the test items, time limits of
the test, and manner of administration (paper and pen, computerized, inkblot
images) are important considerations. When the test is computerized, the
format refers to whether it is IBM or Apple compatible. Form and structure
also apply to other evaluative tools and processes, such as interviews or task
performance assessments.
3. SCORING & INTERPRETATION
A score is a code or summary statement usually, but not necessarily,

numerical in nature. it reflects evaluation in regards to performance on a test,
task, or interview. Scoring is the process of assigning such evaluative codes or
statements to access the performance of a test or sample of behavior. Tests
differ widely in guidelines for scoring and interpretation. Some tests are
designed to be scored by the test taker, others by trained examiners, others by
computer. Some come with manuals.
4. ADMINISTRATION PROCEDURE:
Some tests require special conditions for administration, such as tests for
clinical or field settings.
5. TECHNICAL & PSYCHOMETRIC PROPERTIES
Tests differ in reliability and validity.
PROPERTIES OF A GOOD TEST
1. LOGICAL STANDPOINT
A good test…
- includes instructions for administration, scoring, and interpretation

- offers economy in the time it takes to administer, score, and interpret
- measures what it claims to measure
- results from assessment lead to improved quality of life of the assessee
and others
2. TECHNICAL STANDPOINT
A good test is reliable and valid. In psychometry, we focus on these technical

qualities.
READING ASSIGNMENT
Discuss the psychological test (15 marks).
ANSWER
Psychological testing is a broad field encompassing various methods and

instruments used to assess an individual's psychological characteristics, such as
personality traits, cognitive abilities, emotional functioning, and behavioral
tendencies. These tests are designed to provide valuable insights into an individual's
psychological makeup and can be used for a variety of purposes, including clinical
diagnosis, educational placement, career counseling, and research.
There are several types of psychological tests, each serving different purposes and
employing different methodologies:
1. Personality Tests: These tests aim to measure enduring traits,
tendencies, and behaviors that characterize an individual's personality.
Examples include the Myers-Briggs Type Indicator (MBTI), the Big Five
Personality Inventory, and the Minnesota Multiphasic Personality
Inventory (MMPI).
2. Intelligence Tests: Also known as IQ tests, these assessments measure
cognitive abilities such as reasoning, problem-solving, and verbal and
nonverbal skills. The most well-known intelligence test is the Wechsler
Adult Intelligence Scale (WAIS) for adults and the Wechsler Intelligence
Scale for Children (WISC) for children.
3. Neuropsychological Tests: These assessments evaluate cognitive
functions such as memory, attention, language, and executive
functioning. They are often used in clinical settings to diagnose
neurological conditions or brain injuries and to monitor cognitive
changes over time.
4. Behavioral Assessments: These tests focus on observing and
measuring specific behaviors or behavioral patterns. They may be used
in clinical psychology to diagnose disorders such as autism spectrum
disorder or attention-deficit/hyperactivity disorder (ADHD).
5. Projective Tests: These tests involve presenting ambiguous stimuli,
such as inkblots or incomplete sentences, and asking individuals to
interpret them. The responses are believed to reveal unconscious
thoughts, feelings, and conflicts. The Rorschach Inkblot Test and the
Thematic Apperception Test (TAT) are examples of projective tests.
Psychological tests can be administered in various formats, including

paper-and-pencil tests, computerized assessments, and interviews. It's essential for
these tests to be administered and interpreted by qualified professionals, such as
clinical psychologists or neuropsychologists, to ensure accuracy and ethical
standards.
While psychological testing can provide valuable information, it's essential to
recognize their limitations. No single test can fully capture the complexity of an
individual's personality or cognitive abilities. Additionally, factors such as cultural
background, socioeconomic status, and personal experiences can influence test
performance and interpretation. Therefore, psychological assessment should be
approached with caution and used as part of a comprehensive evaluation process.
Tuesday, 5th March, 2024
Reliability
(Freeman 1962)
Reliability has two closely related but different meanings in psychological testing:
a. Internal consistency
b. Consistent results upon testing and retesting
Oxford
Reliability is the degree to which the result of a measurement, calculation, or

specification can be depended on to be accurate.
(Anastasi, Urbina 2007)
Reliability is the consistency of scores obtained by the same persons when they are
reexamined by the same test on different occasions or different sets of equivalent
items or under varying conditions. In psychometry the concept of reliability has
many rules to cover several aspects of scored consistency.
X - observed/total score | T - true difference | E - chance error {X = (T+E)}
Error difference in its broad sense indicates the extent to which individual differences
in test scores are attributable to true difference in the characteristics being
measured and the extent to which they are attributable to chance errors. Error is part
of the measurement in psychological testing
(Variance = Standard Deviation 2)
Variance is a useful statistic for describing test scores reliability. it can be
broken into components:
a. True variance
b. Error variance (irrelevant random scores)
Sources of Errors
1. Test Construction:
You are likely to have some errors during this, which create errors during
content and item sampling. Content sampling is the variation among items
within a test as well as variations among items between tests.
2. Test Administration:
There are several factors that are sources of variance error when administering
a test.
a. Test Environment:
Temperature, ventilation, noise, lighting, seating arrangement.
b. Test Taker:
Anxiety, fatigue, illness, drugs/medication, presence of emotional
problems.
c. Test Examiner:
Presence/absence of examiner,
3. Test Scoring, Interpretation of Scores:
Scoring has been professionalized by computers as they are unlikely to make

errors, reducing the amount of wrong scoring. interpretation can be a source
of chance/random error. clerical errors in data entry by humans can also lead
to errors.
Estimates; Measures of Reliability
1. Test-Retest;
Most obvious form of estimating reliability of a test. an identical test is

administered and repeated on a second occasion. accordingly, the reliability
coefficient (rtt) is the correlation between the scores obtained by the same
person on both administrations of the test. The error variance corresponds to
the random fluctuations of performance from one test session to the other.
When it is carried out as an estimate of reliability, the period of time between
the sessions is often recorded in the manual.
Advantages
a. It is simple and straightforward
Weaknesses
b. Practice effect: Practice will probably produce varied amounts of

improvement in the test.
c. If the period of time between test-retests is fairly short, the takers may
recall many of their previous responses.
d. Nature of the test may change with repetition, particularly when it has
to do with reasoning and ingenuity.
Tuesday 26th, March 2024
2. Split-Half.
Only one session is used here. Two scores are obtained for each person by
dividing the test into equivalent halves. The most common procedure is
finding the scores of the odd and even items of the test, the longer the test,
the more reliable it is.
a. Spearman Brown Formula:

b. Rulon Formula (1939)
Divide the variance of the differences between each person's scores on

the two half tests by the variance of the total scores. Subtract this
answer from 1.
The Rulon formula requires the variance of the difference between

each person’s scores in the half tests. Because psychometrics became
disenchanted with the previous estimates of reliability, they developed
the modern formulae to determine the internal consistency of tests.
Inter-item consistency is a degree of correlation between all the items

on a scale. Once calculated, it will yield an index of item consistency.
The index of inter-item consistency assesses the homogeneity of a test,
the degree to which a test measures a single factor, the extent to
which items on a scale are unifactorial. A test is homogeneous if it
contains items that measure a single trait. Heterogeneity is the degree
to which a test measures different factors. The more homogeneous a
test is, the more item consistency it can be expected to have, thus
homogeneity is very important because it allows a very straightforward
test score interpretation.
3. Alternate / Parallel Forms
If you have ever taken a makeup test or exams, you will notice that the
questions were not all the same as those on the first. Thus it was an alternative
form of the test. The terms alternate and parallel are used interchangeably, but
are technically not the same. In the alternative form test, the items in the tests
are not all the same. It examines the degree of the relationship between
various forms of the test and is often called the Coefficient of Equivalence.
Parallel form is a useful estimate of reliability and like in test-retest, the interval
period between the parallel forms have to be stated in the manual.
Advantages
a. Parallel forms estimate is much more widely applicable for use than
test-retest
Limitations
a. Alternate forms estimates reduce practice effect but do not eliminate it.
b. Reduce but not eliminate issue of nature of the test changing with
repetition
c. Alternate forms are unavailable for many tests because of the practical
difficulties of constructing a truly equivalent test.
4. Kuder-richalison
The Kuder-Richardson Formula 20 (KR-20) is a statistical measure used to
assess the internal consistency or reliability of a test that consists of items with
dichotomous (yes/no or true/false) responses. It's named after its developers,
Dorothy Kuder and Edward Richardson.
In simpler terms, KR-20 is a method to check how consistent a test is when

the answers are either right or wrong. It's like making sure all the questions in
a true/false test are measuring the same thing consistently. Higher KR-20
values mean the test is more reliable, indicating that the questions are
working well together to measure what they're supposed to.
FURTHER EXPLANATION
The Kuder-Richardson Formula 20 (KR-20) is an estimate of reliability

commonly used for assessing the internal consistency of dichotomously
scored items, meaning items that can only be answered as "correct" or
"incorrect". It is similar in concept to Cronbach's alpha but specifically tailored
for dichotomous items.
Here are the steps to calculate the KR-20:
1. Calculate the Proportion of Agreement:
For each item, calculate the proportion of agreement among all pairs of
responses. This is done by dividing the number of agreements (both
responses are correct or both responses are incorrect) by the total
number of pairs.
2. Calculate the Variance of Proportions:
Calculate the variance of the proportions of agreement across all items.

3. Calculate KR-20:
Use the formula
Where:
- \(N\) is the number of items.
- The "proportion of correct responses" is the proportion of respondents who

answered each item correctly.
The resulting KR-20 value ranges from 0 to 1, with higher values indicating
greater internal consistency or reliability of the test.
Advantages of KR-20 include its simplicity, particularly for dichotomously

scored items, and its ability to provide a quick estimate of internal consistency.
However, it also shares some limitations with Cronbach's alpha, such as the
assumption of unidimensionality and the sensitivity to sample characteristics.
5. Conbrach's alpha
Cronbach's alpha, also known as tau-equivalent reliability or coefficient alpha,

is a reliability coefficient and a measure of the internal consistency of tests and
measures.
It is a way of assessing reliability by comparing the amount of shared variance,
or covariance, among the items making up an instrument to the amount of
overall variance. The idea is that if the instrument is reliable, there should be a
great deal of covariance among the items relative to the variance.
Cronbach's alpha is a coefficient that ranges from 0 to 1, where higher values

indicate higher internal consistency. It is calculated by dividing the average
covariance between the items by the total variance of the items. In other
words, it compares how much the items vary together to how much they vary
individually. Cronbach's alpha can be interpreted as the expected correlation
between two random sets of items drawn from the same pool. For example, if
Cronbach's alpha is 0.8, it means that the average correlation between any
two sets of items is 0.8.
Advantages
a. Assessment of Reliability:
Cronbach's alpha provides a single number that represents the overall

reliability of a scale or test. This allows researchers to quickly assess
whether the items in their scale are measuring the same underlying
construct.
b. Ease of Interpretation:
Cronbach's alpha ranges from 0 to 1, where higher values indicate

greater internal consistency. This makes it easy for researchers to
interpret the reliability of their scale.
c. Useful for Scale Development:
Cronbach's alpha can be used during the development of a new scale

to assess the reliability of the items and identify any that may need to
be revised or removed.
d. Widely Accepted:
Cronbach's alpha is a well-established statistic that is widely used in

psychological and social science research. Its popularity means that
researchers can easily compare the reliability of their scales to those in
existing literature.
Disadvantages
e. Assumes Unidimensionality:
Cronbach's alpha assumes that all items in a scale are measuring a

single underlying construct. If the scale measures multiple dimensions
or factors, Cronbach's alpha may not accurately reflect the reliability of
the scale.
f. Sensitive to Number of Items:
Cronbach's alpha is sensitive to the number of items in a scale. In

general, scales with more items tend to have higher Cronbach's alpha
values, even if some of the items are not very reliable.
g. Sample Dependence:
Cronbach's alpha can be influenced by the characteristics of the sample

being studied. For example, alpha values may be lower in samples with
greater variability in responses or in samples with small sample sizes.
h. Does Not Assess Validity:
Cronbach's alpha only assesses the internal consistency or reliability of a

scale. It does not provide any information about the validity of the scale,
or whether it is measuring the intended construct.
i. Doesn't Account for Item Difficulty:
Cronbach's alpha treats all items in a scale as equally important,

regardless of their difficulty or discriminatory power. This means that
items with low difficulty or poor discrimination may inflate the alpha
value.
Tuesday, 2nd April 2024
PSYCHOMETRY - VALIDITY
Psychologists, clinician (1-0), educators, counselors, forensic, military etc use test
results for various purposes.
The clinician may use a test for the purpose of arriving at a diagnosis or evaluating
the outcome of an intervention. Similarly the 1-0 may use test or test results for the
purpose of personal selection, specimen, promotion e.t.c.
Also the forensic psychologist may use the test results for the purpose of trials
competence or competency to stand trial. None of these purposes can be achieved,
even partially, if the psychological instrument does not have a high degree of validity.
WHAT IS VALIDITY ?
Freeman (1965) says that an index of validity shows the degree to which a test
measures what it proposes to measure when compared to its accepted criterion.
Who decides the test? It is the expert. The construction and use of the test means
the test has been evaluated against criteria regarded by experts as the best evidence
of the trial measured by the test.
Validation is the process of gathering and evaluating validity evidence whose

responsibility is to do the test. It is the responsibility of the test users and developers,
sometimes it is the responsibility of the test users to produce evidence for the validity
of the test by doing analysis. Oftentimes, test developers provide the evidence of the
validity of their test in a test manual.
Test Users may encounter situations whereby they need to provide validity evidence
for tests in other words test users may sometimes have to carry out their own
validation studies.
Some circumstances that may sometimes warrant the test users to provide evidence
of validity include:;
1. When the rest of the users plan to alter in some way the format, instructions,
language or content of the test for instance changing a test “Braille” (for blind
people).
2. When the test will be used in the population of the test takers that differs in
some significant ways from the population from which the test was
standardized.
HOW DO WE GO ABOUT ESTIMATING THE VALIDITY OF A TEST?
One way of conceptualizing validity is using Guion’s (1950) Trinitarian taxonomy

which proposes 3 major types or categories of validity:
1. Content Validity
2. Criterion-related validity
3. Construct validity
There are other forms of validity. Cohen and Swerdlik (2002) note that “Predictive
Validity” and “Concurrent Validity” are forms of validity collapsed under the general
category of Criterion- related validity.
Anastasi and Urbina (2007) described another form of validity called Face Validity.
The three categories taxonomy of validity is the idea that validity of a test may be
evaluated by;
1. Scrutinizing its contents.
2. Relating scores obtained on the test to other test scores or methods.

3. Executing a comprehensive analysis of not only how scores of a test relate to
other test scores and measures but also how they can be understood within a
theoretical framework for understanding the construction that the test was
designed to measure.
According to Cohen and Swardlik (2002). These three approaches of validity are not
mutually exclusive ( should not be taken as separate entities). Each should be
thought of as one type of evidence in other words all should be thought of as a
unified picture of validity.
1. FACE VALIDITY:
According to Anastasi and Urbina, face validity is not the same as content
validity. Technically speaking, face validity is not validity as it does not refer to
what the test actually measures, but to what the test superficially measures.
Face validity is concerned with things that make the test “look valid” to both
the test takers and test users.
2. CONTENT VALIDITY:
This form of validity is estimated by evaluating the relevance of test items

individually and collectively or as a whole. The items should constitute a
representative sample of the variable in estimate on one hand and on the
other hand not introducing irrelevant items that may compound the variable.
In other words, content validity describes in judgment concerning how

adequate a test sample behavior represents the universal behavior that the
test was designed to sample or measure.
Content description validation procedures include the systematic examination

of test content to determine or establish whether it covers a representative
sample of the behavior domain to be measured.
Traditional experts would inspect the content of the test in order to ascertain
or establish its validity for such purposes. For example; a cumulative final exam
in PSY 314 will be said to be content-valued if the proportion and the type of
PSY 314 problems in the test approximate the proportion or type of PSY 314
problems presented in the course.
Thursday 4th April 2024
Quantification of content validity is important in employment and educational

settings, eg. tests used to hire and employ people are carefully scrutinized for their
relevance to the job, such as in military, medical, management.
Lawshe 1975 developed one of the most widely used methods of measuring content
validity: it measures arguments among raters and judges on how essential a
particular item is. Each judge/rater responds to:
- Essential
- Useful but not essential (Important)
- Not necessary (not essential)
Lawshe, developed a formula to judge the effectiveness of content Validity Ratio

(CVR)= CVR= (Ne - N/2) / (N/2),
Where:
- Ne= number of Panelist indicating “essential”
- N = Total number of panelist or judges or raters
- CVR = 6.2 and above, retain the item.
A minimum content validity ratio of 0.62 should be retained.
For example in a depression diagnosis:
● I feel lonely
● I feel guilty
● I feel edgy
If more than half of the judges indicate that an item is essential, that item at least
has some content validity.
3. CRITERION-RELATED VALIDITY
Criterion is a standard against which a test score is evaluated. Operationally, a

criterion can be almost anything, such as
● A pilot's performance in flying a Boeing 747

● A student's grade in an examination
● Number of days spent in a psychiatric care
Although it can be anything, the criterion should be :

- relevant,
- reliable,
- and uncontaminated.
- valid,
Criterion-related validity is a judgment regarding how adequately a test score can be

used to infer an individual's most probable standing on some measure of interest
(criterion). E.g: cost of caregiving, flying, being happy.
Cohen, Swardlik 2004 described two types of validity evidence which are subsumed
under criterion-based concurrent validity.
a. Concurrent Validity
This is a form of criterion validity, an index of the degree to which a test

score is related to some criterion measure that is obtained concurrently.
If test scores are obtained about the same time that a criterion
measures are obtained, measures of the relationship between the test
scores and the criterion provide evidence for concurrent validity.
Sometimes we need to know how well Test A compares to Test B. In this
case, Test B is used as Validatory Criteria.
i. Concurrent validity of Test A.
Test A may be a brand new test, such as Jos Depression Inventory

Revised (JDI-R), while Test B may be an established Jos
Depression Inventory (JDI).
In order to determine the concurrent validity of Test A, both tests
are administered at the same time and the relationship between
them is determined by correlation.
ii. Concurrent validity of Test A.
Test B is an existent test, J.D.I that has been validated. In order to

determine the concurrent Validity of Test A, Test A and Test B are
administered at the same time (concurrently) and the
relationship between the two tests ( A & B) is determined.
b. Predictive Validity
This is an index of the degree to which a test predicts some criterion

measurement. Test scores may be obtained at one time and the
criterion measures obtained at a future time after some intervening
event has occurred, such as training, experience, therapy, medication, or
simply the passage of time.
Measures of the relationship between test scores and the criterion

measures obtained at a future time provide evidence for predictive
validity, which indicates how accurately scores on the test predict some
eg. for university admission, the JAMB test score is used to provide
admission on the basis that the score can predict your CGPA at the end
of your training
4. CONSTRUCT VALIDITY
In our definition of a test we noted that a psychological test is essentially an

objective and standardized measure of behavior. Importantly, we noted that
psychological tests measure psychological constructs like personality, interest,
attitudes, self esteem, anxiety. a construct may be construed as an informed,
scientific idea developed, formulated, and hypothesized to describe or explain
behavior.
Construct validity is therefore a judgment about the appropriateness of

inference drawn from test scores regarding individual standings on the
construct variable between at least two standard deviations below the mean
and at most three standard deviations above the mean.
Construct validity has been viewed as the unifying concept for all validity
evidence. All types of validity evidence (content, criterion) are types of
construct validity. Content and criterion related coefficients have a bearing on
construct validity of a test
Tuesday 16th April 2024
TECHNIQUES OF ESTIMATING CONSTRUCT VALIDITY
There are several procedures used to provide different kinds of evidence for construct
validity. These procedures include:
1. The test is homogeneous: it measures only one construct. E.g, personality,

intelligence, depression, anxiety, etc.
2. Test scores correlate with scores on other tests in accordance with what will
be predicted from a theory that covers the manifestation of the process.
3. Test scores increase or decrease as a function of age, passage of time, or an
experimental manipulation as theoretically predicted.
4. Test scores obtained subsequent to some event or the mere passage of time
differ from the test scores as theoretically predicted.
5. Test scores obtained by people from distinct groups vary as predicted by
DISCRIMINANT CONVERGENT
DISCRIMINANT EVIDENCE
Discriminant evidence is a form of construct validity (discriminant validity). It is
simply a validity coefficient showing little (statistically insignificant) relationship
between test scores and other variables with which scores on the test being
constructed (i.e construct validated) should not be theoretically correlated.
CONVERGENT EVIDENCE
The evidence of construct validity of a particular test may converge from a number of
sources particularly such other tests or measures designed to access the same or
similar construct. Here, if scores on the test being construct-validated tend to
correlate highly in the predicted direction with scores on other more established and
already validated tests designed to measure the same or similar construct, it could
be an example of convergent evidence.
FACTOR ANALYSIS
This is simply a mathematical procedure used to obtain both convergent and

discriminant evidence of construct validity. Factor analysis is a singular shorthand
term used to describe a class of mathematical procedures that are designed to
identify factors or specific variables that are typically attributes, characteristics, or
dimensions on which people may differ.
E.g; personality is a construct and there are various attributes, characteristics, or dimensions.
Depending on which characteristics we’re looking at, there are different dimensions.
READING ASSIGNMENT: Read more on factor analysis.
STUDY QUESTIONS:
- What is a sound test?

- If you are developing a new test, which estimate of reliability are you likely to
use and why?
- There’s no doubt whatsoever that construct validity is the unifier of all
estimates of validity. Armed with your course lecturer’s predisposition to this
issue, discuss using local examples.
Tuesday 23rd April 2024
TEST DEVELOPMENT
This is the process by which tests are constructed.
○ Conceptualisation
○ Construction
○ Test Tryouts
○ Item Analysis
○ Test revision
Stages of Test Development

If something is a process it means it is step by step and not haphazard.
Test Conceptualisation
The first stage in the process, starts with our thoughts due to, because of our
experiences cognitive psychologists, thoughts are "self talk"in behavioral terms
thoughts are the probable source of all tests ever published the test developer
thinks, "there ought to be a test that measures something" often, there is a stimulus
for such a thought. That stimulus could be:
● Disenchantment with the existing properties of an existing psychometric test,

which they think they can do better.
● An emergency novel social phenomenon. (eg. nomophobia, fear of being
without your phone)
● Need to assess mastery in an emerging profession or occupation.
● Converting an existing scale into a local language.
Thursday 25th, April 2024
Test Questions a Test Developer has to Answer
1. What is the test design to measure? Related Question:

- Is this test defined similarly or differently from other tests that purport
to measure the construct in developing a test to measure?
2. What is the objective of the test? OR Will this test achieve or attain? Related
Question:
- In what way or ways is the objectives of this test different or the same
with others tests with similar goals?
3. Is there a need for this test? Related Questions:

- Are there any other tests purported to measure the same thing?
- In what ways will the new test be better than or different from the
existing ones?
- Will it take less time to administer?
- What about its reliability and validity?
4. Who will use this test? Related Question:

- What level of education is required of the test taker?
- What cultural factors might affect the test taker’s response?
5. What content will the test cover?

- Why should it cover the content?
- Is the coverage different from the content coverage of existing tests
that measure the same construct?
- How and why is the coverage different from the content coverage of
existing tests that measure the same construct?
- How and why is the content area different?
- To what extent is the content culture specific?
6. How will the test be administered?

- Individually or by group?
- Can it be computer administered?
- What differences will exist between tests administered individually and
group?
7. What is the format of the test and why that format?

- True/false?
- Rating scale? Comparison? Or some other fashion?
- Likert scale?
- Thrustone scale format?
8. Should more than one form of the test be developed?

- What does it cost to create alternate forms?
- What are the benefits of creating alternate forms?
9. What special training will be required of test users for the administration and
interpretation of the test?
- What background and qualification will a prospective user of data
derived from an administration of the test need to have?
- What restriction - if any - should be placed on the distribution of the
test and on the test user?
10. What types of responses will be required by the test takers?

- What variety of disability will preclude someone from being able to take
the test?
11. Who benefits from the administration of the test?

- What will the test taker learn from the administration of the test?
- How might test takers benefit from the administration of the test?
What social benefits?
12. Is there any potential for harm as a result of the administration of the test?
- What safeguard are built into the recommended testing procedure to
prevent any sort of harm in any of the parties involved in the use of this
test?
- Who are the parties in the test? Test developer, publisher, taker, and the
society. Sometimes, the test developer may also be the test publisher.
13. How will meaning be attributed to the test scores?

- Will the test takers' scores be compared to others taking the test at the
same time? OR
- Will the test takers’ sores be compared to others in the criterion group?
NORM-REFERENCED VS CRITERION-REFERENCED
NORM-REFERENCED
Anastasi and Urbina (2007) note that a norm referenced test is that test in which an
individual score is interpreted by comparing it with the scores obtained by others on
the same test. An example of a norm referenced test is the Eysenck personality
questionnaire (EPQ). The EPQ has two versions:
a. EPQ Adult
b. EPQ Children
This means that there are norms for the children and different norms for the adults.
CRITERION-REFERENCED
The criterion referenced test as first proposed by Glaser (1963), the term defined
loosely and apparently by different writers and the terms has different names such
as domain reference tests and content reference tests.
Typically, domain reference testing uses as its interpretation form of reference, a

specific content domain, rather than a specified population of persons.
Thus a test taker’s performance may be reported in terms of specific kinds of

arithmetic forms of operations he/she mastered or the level of reading he/she can
comprehend or some other criterion for example. See the definition of criterion. In
criterion related validity for instance, in grade as follows:
44 F 0 Fail
59 C 3 Pass
69 B 4 Pass
79 A 5 Pass
PILOT WORK
This is an important aspect of test conceptualization, it has other names such as

“pilot study”, and “pilot research”. Pilot research refers to the preliminary research
surrounding the creation of the prototype of the test (test being developed).
In pilot work, test items are pilot studies to evaluate whether they should be in the
final form of the instrument. Pilot work requires creation, revision, deletion of many
items. It also involves literature reviews and experimentation. Once the pilot study
has been completed, the process of test construction begins.
Tuesday 30th, April 2024
Test Construction
Test construction is the second phase or stage in the test development process.
Constructing a test involves three major things;
a. Scaling. b. Writing c. Scoring
EXPLANATION
1. Scaling:
The main focus of psychology is scaling measurement in this context refers to

the assignment of numbers according to rules. Therefore, scaling is a process
by which a measuring device is designed and calibrated and the way (number
or other indices .
Scale values are assigned different amounts of the traits, attributes, or

characteristics that have been measured. Scales are measuring instruments.
In psychometry scales are construed as instruments used to measure
psychological traits, or attributes, characteristics.
Scales may be categorized in many ways.

One way of categorizing scales is along a quantum level of measurement, thus
nominal, ordinal, interval, and ratio scale.
A second way of categorizing scales, maybe based on test takers performance

on the test, on the function of age based scale.
A third way of classifying scales, maybe based on test takers performance

function on grade-based performances.
A fourth way in which scale may be classified is when raw scores are
transformed into scores that range from 1-9 (Stannie).
Finally , scales may be classified into other ways e.g unidimensional, as

opposed to multidimensional, comparative, as opposed to categorized scales.
SCALING METHODS
There are many scaling methods , the choice of deciding on which scaling
method you should use as a test developer is dependent on many factors.
1. The variable being measured.

2. The group for whom the test is intended (children require less
complicated scaling methods than children)
3. The preference of the test developer.
RATING SCALES
Rating scales are the most commonly used methods of scaling. A rating scale
may be defined as the grouping of words, statements, symbols, on which
judgment concerning the strength of a particular traits, attitude, or emotions,
are indicated by the test taker or examiner. Rating scales can be used to report
judgment of one’s self, others, experiences, or updates.
Rating scales, may take several forms:
a. Yes or No. e.g do you experience palpitations? Yes or No ? any answer

given validates the actual results.
b. Never, often, sometimes, almost always. SEE DASS 21 - Depression
Anxiety Stress Scales – Short Form (DASS-21.
i. 0 NEVER = they don’t apply to me at all.

ii. SOMETIMES = apply to me to some degrees.
iii. OFTEN = apply to me at a considerable degree.
iv. OFTEN SOMETIMES = applies to me almost always.
c. A rating scale may take a pictorial form of sad, happy, confused faces.
In rating scales, the test takers rate every item E.g DAS 21 has 21 items. The
tests rates each item (of the 21) and scores are added together to obtain
subscale and final scores:
( D -7, A – 7, S – 7 | SUMMED, TOTAL SCORES ).
Because the final test results occur by summing the rating across all the items
is called a summative scale. One example of a summative scale is the LIKERT
EXAMPLE. LIKERT, 1932 scale is used extensively in psychology. They are
relatively easy to construct. In a Likert scale, the test taker is faced with 5
alternative responses (AGREE, DISAGREE, APPROVE, DISAPPROVE,
CONTINUUM).
1. I will cheat in an exam if I have the opportunity.

a. strongly agree
b. agree
c. undecided
d. disagree
e. strongly disagree
2. I will cheat on taxes if given the chance. (REFER TO THE ALTERNATIVE

RESPONSES ABOVE)
Likert scales are used because they are usually reliable. It has five ranges of
response options. At times you might encounter “Likert scale or Likert type of
scale” which is always five or more than five.
PAIR COMPARISON SCALING METHODS
In this scaling method , the test taker is presented with pairs of stimuli. The
pair of stimuli may be two pictures , two options, two objects, two statements,
which are asked to compare. They must select one of the stimuli based on
some rules, the rules are ;
- Could agree more on one statement than the other.

- Find a picture more appealing in twos.
- Objects heavier than the other
1. Which behavior is more justifiable ?

a. Cheating on taxes if given the chances
b. People accepting bribes in the course of their duties
2. Which picture is more appealing ?

a. Happy sad face
b. Depressed face
3. Which is heavier?
a. A kg of cotton
b. A kg of tin
For each pair of options , test takers receive high scores, if the selected option
was deemed more justifiable( more appealing) by the majority of test takers.
The majority would must be asked to rate the pair of options before
administration of the test and the least of options selected will be provided
with the scoring instructions as the answer key
Tuesday 2nd, May 2024
SORTING TASK
In task sorting, pretest tasks, drawing, photographs, objects, or some other

stimuli are presented to test takers to make use of. Once the pretest, materials,
photographs, drawing, etc. are presented to the examiner for evaluation, the
examiner cooperatively or categorically, this implies that there are two types of
sorting.
Comparative Scaling
In comparative scaling, a stimulus is judged in comparison with every

other stimuli on the scale for example a list of 30 items will be
presented to test takers on a sheet of papers and they will be asked to
round the justifiability of the item from 1-30.
Categorical Scaling
Conversely, in categorical scaling stimuli are placed into one or more

alternatives categories that differ quantitatively with respect to some
continuum for instance test takers might be given. 30 index cards on
which are printed in 30 items . The test takers will then be asked to sort
the cards in 3 piles.
1. Those behaviors that are justified

2. Behaviors that are sometimes justifiable
3. Behaviors that are always justifiable
In this scaling method, item on it range sequentially from weaker to stronger

expressions of the attitude beliefs or feelings being measured
1. Do you agree or disagree with each of the following?
a. All people should have the right whether they wish to end their
lives.
b. People who are mostly ill and in pain should have the option of
having a doctor who will assist them to end their lives.
c. People should have the option of signing away the use of
artificial life support system when they become seriously ill
d. People have a right to a comfortable live
Look at the statements above carefully, they are arranged in sequence of most
extreme positions to list extreme. People who agreed with the most extreme
positions (a) should also agree with (b), (c), and (d)
THE THURSTONE SCALE:
This scale was first described in 1929. It is in the internal scale and it is an equal
appearing interval scale.
There are steps to follow when trying to develop this scale:
1. Create a large number of statements reflecting positive or negative

attitude towards the traits, object or things to which you want to
measure e.g suicide, homosexuality, compulsory education
2. Experts judge each statement on the basis of what it purports to
measure; each judge is instructed to rate each statement on the scale
as if the scale were internal in nature. The scale might range from 1
(homosexuality is never Justified) to 9 (homosexuality is always
justified) the judges are also cautious that the rating should be
independent of theory's own opinion, that is the rule that was given to
them.
3. A mean and standard deviation of the judges ratings are calculated for
each statement.
4. Items are then selected for inclusion in the final scale based on several
criteria.
a. The degree to which the item contributes to the comprehensive
measurable.
b. The test developers have a degree of confidence that the item
has been sorted into equal internal.
i. A low standard deviation is an implication that the item is
good and should be indicated in the scale. In other words ,
the judges agreed about merging of the items in respect
to measuring the variables in questions.
ii. The scale is now ready for administration.
2. Writing Items
1. What range of items should the item be called? which of the many
different types of the items format should be implored
2. Should I use selected response format or constructed response format
even if I am using matching format
3. How many items should be written
SOME TIPS TO USE
When you are conducting a testing using multiple choices format, it is

advisable that the item on the first list or list should double the number of
items in the final format. Why must It be so, because a number of items will be
eliminated, most importantly the test developer must ensure content validity
else, it will be jeopardized if this is not done.
How do you get the items to write?
The items to write may come from personal experience or according to

academic acquaintance with the matter (test) you tend to measure e.g. if you
want to develop a case for clinical use, you must get experts to develop or help
you on items to include.
THE ITEMS FORMATS
By way of definition, item formats refers to the plan, structure, arrangement or

layout of individual items including whatever the test takers require / should
select or construct a response .
There are three types of item formats:

- Selected response
1. True- false
- Constructed responses
2. Multiple choices
3. Matching
Under multiple choices
- Stem - Correct options

- Incorrect - Distractions or tools
Under matching
Lion - Nest Man - House Bird - Forest
Truth or false items are easier to write than multiple choices.
CONSTRUCTED RESPONSE FORMAT
There are three types of constructed response format.

1. complete 2. short 3. essay
d item answer
Completed Item
Psychometrics is theories and techniques of measurements. If you are using

this format you should be careful so that you can use only one type format if
answer the answers are many, you a problem with scoring
Short Answer
The items have to be good and clearly written, there are no long or short
routes.
Essay
An essay may have many paragraphs e.g. I might ask you to compare and
contrast the techniques and definition of classical and operant conditioning or
compare selected response format of item writing or norms referenced and
domain response test differ
SCORING ITEM
2. Class/ 3. Ipsative
1. cumulativ
category scoring
e
scoring
Cumulative
This is the commonest method of scoring. It is simple and its logic is simple as
well. The rule in cumulative score test is that the higher the score on the test,
the higher the test taker is on the ability or the trait to other characteristics or
the test purports to measure e.g. of cumulative scoring test is DAS21, PTSD
CHECKLIST, FOR DSM -5
- Cumulative scoring - Class scoring

- Ipsative scoring
Tuesday 7th, May 2024
3. Scoring
a. CLASS SCORING/METHOD:
This is also referred to as a category approach to scoring. This type of

scoring is used in diagnosis in order to measure a certain number of
symptoms that qualifies a specific diagnosis. In this scoring approach,
test takers respond and predict towards placement in a particular class
or category with other test takers. Whose pattern of responses is
presumably similar in some ways.
b. IPSATIVE SCORING APPROACH:
This approach departs radically in rational or logically from either the

cumulative or class scoring. The main objective in ipsative scoring is the
comparison of the test takers. Score on one scale within a test with
another scale within the same test.
Test Try Out
Once we have generated a pool of items that will probably be included, the next
thing is try out the items
In order to find out which of the items are good. The test should be tried out on
people similar in physical aspect, with people who the test was intended on. For
example, if the test was designed to test the performance of an employee, the test
performance of an employee, the test should be carried out on an employee of an
organization. Similarly, if the test is carried out for children it should be tried out on
children, furthermore, if it is for adults it should be tried out on adults. Another thing
is the issue of culture, if the test is culture specific, it should be tried out on people
who have that culture.
An important aspect of a test trial is a number of people to participate (sample size).

These are no hard rules or fast rules for determining the number of people. The size
can be 10-15 participants. However, it can be borne in mind that there will be definite
risk. It might be cumbersome. Imagine for example that as a test developer you want
to develop a MPPT like scale. An important consideration in this part is that other
conditions which are similar to the condition in which a standardized test will be
administered, the tests should be carried out.
The main objective of trying out a test is for the test developer to identify the items
that are reliable and valid. A good test should be able to test takers. Therefore, a good
item is one that scores high on the test as a whole. It follows that item in which high
scores on the test
In a nutshell, if everyone passes or everybody fails then it's not a good test.
Test Analysis
For a test user to be able to evaluate the quality of a published test, they must have
reasonable knowledge of the basic concepts or techniques of item analysis. This is
because item analysis itself is relevant and an integral part of test development.
Test it can be analyzed, quantitatively and quantitatively. As a matter of fact, test

items can be analyzed quantitatively in terms of their content and form. Items can
be analyzed quantitatively, content validity and effectiveness of items.
Conversely, items can be analyzed quantitatively. Primarily measuring items'
difficulty, both the reliability and validity depend ultimately on the characteristics of
its item. Therefore, item analysis helps test developers to build in advance high
reliability and validity which is the major of item analysis. The second goal of item
analysis is to improve the test through revision, selection, review the test
It enables test developers to shorten the test and at the same time increases its
reliability and validity. Generally, a longer test is more reliable and valid than a short
test. However, when a test is shortened by eliminating the least satisfactory items,
the short test is more valid and reliable than the longer one
Thursday 9th, May 2024
Test Revision
1. We carry out item analysis as a integral part of the test development as which
item is good for bad discuss
2. Having successfully taken lectures in Psy 314, 300-level students in Psy
decided to develop a test to measure exam anxiety. They decided to name the
scale “student examination anxiety scale (SEAS) ; they have generated an item
post of 400. What remains is a test try out.
a. What does taking this test try out mean?
b. What are the objectives of test tryout
c. What questions will you ask and answer, while carrying out try out test
d. What are the likely group items are you going to select from the item
pool and what are the likely ones you are going to eliminate
e. SEAS is a multi-choice item format, how many items does it have?
3. Test construction is a crucial process in test development. With firm grasp of
your course lecturer’s predisposition to this issue, discuss certainly and
generously with examples that test is born in the mind cannot be contested,
discussed.
ASSIGNMENT
Think carefully and develop a test to measure any construct of your choice. You must
follow the test development process, that is the test conceptualization, test
construction, test try out, test analysis, and test revision. highlight, literature review
and theories that explain the construct and which theory developed the construct.

314 - Introduction to Psychological Test Development (1)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

314 - Introduction to Psychological Test Development (1)

Uploaded by

Copyright:

Available Formats

PSY 314 - INTRODUCTION TO

1. Attend lectures at least 75%.

3. Attempt four (4) exam questions at the end of the Semester

At the end of this course student will be able to:

1. Give a definition of a psychological test

A. Meaning of psychological test

MODULE TWO: TEST CONCEPTUALIZATION

A. What do mean by test conceptualization?

A. Scaling, types of scale and scaling methods

MODULE FOUR (4): ITEM ANALYSIS

MODULE FIVE (5): TEST REVISION

- A psychological test is an objective and standardized measure of a

- Objectivity is a critical feature of the scientific method. it means

- Standardization is the process of using statistical measures to

DIFFERENCE BETWEEN PSYCHOLOGICAL TESTS

Psychological tests may differ on variables like;

The subject matter of the test-s measures personality, intelligence, and

- Eysenck Personality Questionnaire has 94 items and measures

3. SCORING & INTERPRETATION

A score is a code or summary statement usually, but not necessarily,

Tests differ in reliability and validity.

PROPERTIES OF A GOOD TEST

- includes instructions for administration, scoring, and interpretation

A good test is reliable and valid. In psychometry, we focus on these technical

Psychological testing is a broad field encompassing various methods and

Psychological tests can be administered in various formats, including

Tuesday, 5th March, 2024

Reliability is the degree to which the result of a measurement, calculation, or

(Anastasi, Urbina 2007)

3. Test Scoring, Interpretation of Scores:

Scoring has been professionalized by computers as they are unlikely to make

Most obvious form of estimating reliability of a test. an identical test is

a. It is simple and straightforward

b. Practice effect: Practice will probably produce varied amounts of

Tuesday 26th, March 2024

a. Spearman Brown Formula:

Divide the variance of the differences between each person's scores on

The Rulon formula requires the variance of the difference between

Inter-item consistency is a degree of correlation between all the items

3. Alternate / Parallel Forms

In simpler terms, KR-20 is a method to check how consistent a test is when

The Kuder-Richardson Formula 20 (KR-20) is an estimate of reliability

Here are the steps to calculate the KR-20:

1. Calculate the Proportion of Agreement:

2. Calculate the Variance of Proportions:

Calculate the variance of the proportions of agreement across all items.

Use the formula

- \(N\) is the number of items.

- The "proportion of correct responses" is the proportion of respondents who

Advantages of KR-20 include its simplicity, particularly for dichotomously

Cronbach's alpha, also known as tau-equivalent reliability or coefficient alpha,

Cronbach's alpha is a coefficient that ranges from 0 to 1, where higher values

Cronbach's alpha provides a single number that represents the overall

Cronbach's alpha ranges from 0 to 1, where higher values indicate

c. Useful for Scale Development:

Cronbach's alpha can be used during the development of a new scale

Cronbach's alpha is a well-established statistic that is widely used in

Cronbach's alpha assumes that all items in a scale are measuring a

f. Sensitive to Number of Items:

Cronbach's alpha is sensitive to the number of items in a scale. In

Cronbach's alpha can be influenced by the characteristics of the sample

h. Does Not Assess Validity:

Cronbach's alpha only assesses the internal consistency or reliability of a

i. Doesn't Account for Item Difficulty:

● Disenchantment with the existing properties of an existing psychometric test,