Psychometrics Notes

lOMoARcPSD|28218661
Psychometrics notes
Research Methods I (University of Cape Town)
Studocu is not sponsored or endorsed by any college or university

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)
lOMoARcPSD|28218661
Psychometrics
10, 23, & 24 April (Class Test material)
o Introduction to Psychometrics
o The History of Psychological Testing
7 May (Exam material)
o Test Development
14 & 15 May (Exam material)
o Reliability
o Validity
Recommended reading:
 Downing, S. M. (2006). Twelve Steps for Effective Test Development. In S. M. Downing & T.
M. Haladyna (Eds.), Handbook of Test Development (pp. 3-25). London: Lawrence Erlbaum
Associates. (available on Vula)
 Foxcroft, C. D. (2004). Planning a psychological test in the multi-cultural South African

context. South African Journal of Industrial Psychology, 30, 8-15. (available on Vula)
 Gregory, R. J. (2003). History of Psychological testing. In R. J. Gregory (Ed.), Psychological

testing: History, principles and applications (pp. 1-28). Available at:
http://www.ablongman.com/partners_in_psych/PDFs/Gregory/gregory_ch01.pdf
 Khanjee, A. (2006). Assessment research. In M. Terre Blanche, K. Durrheim, & D. Painter

(Eds.), Research in Practice (pp. 484-493). Cape Town: University of Cape Town Press.
(prescribed textbook)
What is Psychometrics?
Simply put = The study of Psychological measurement
The measurement of mental capacity, thought processes, aspects of personality, etc., especially by
mathematical or statistical analysis of quantitative data
The science or study of this measurement
The construction & application of psychological tests
What is measurement?
The assignment of numbers to objects & events according to rules

lOMoARcPSD|28218661
The assigning of numbers to individuals in a systematic way as a means of representing properties

of the individual
Measurement allows for comparison, analysis and evaluation
What is a test?
Measurement devices or techniques used to quantify behaviour – to aid in understanding &

prediction of behaviour
Defining Psychometrics: 3 Important Aspects
1: Psychometrics measures MANY different things
- Mental processes, personality, behaviour, intelligence, cognition, etc.
2: Psychometrics aims to operationally define & QUANTIFY the things it measures
- E.g., anxiety, intelligence, empathy
And then to be able to predict it
3: Psychometrics is concerned with constructing tests to measure things and evaluating how good
these tests are
Basic terms and concepts
Psychological measurement is the measurement of human behaviour
• Overt behaviour
 Observable
 E.g., time needed to put 10 pegs in a peg board, time taken to tap fingertips
10 times
• Covert behaviour
 Mental, social, or physical action or practice that is not immediately

observable
 E.g., feelings of anxiety, depression
Psychological Tests + Behavioural Samples

We can’t measure every aspect of every type of mental process, behaviour, emotional process, etc. ,
that a person has so we take a sample of that behaviour, through psychological tests

lOMoARcPSD|28218661
A psychological test is therefore:
“A test designed to provide a quantitative analysis of a person’s mental capacities or traits, typically
as shown by responses to a standard series of questions or statements.”
A set of items that is designed to measure characteristics of human beings that pertain to behaviour
The process of measuring psychology-related variables by means of devices or procedures designed

to obtain a sample of behaviour
A systematic procedure for obtaining samples of behaviour relevant to cognitive or affective

functioning, & for scoring & evaluating those samples according to standards
Trials = turns taking a test.
How do we quantify behaviour?
Through test items
- A test question; something a person must do in a test; a task in a test a person must
perform
- NB: A test item is not necessarily always a question
E.g., can be spotting a missing detail in a picture, drawing a picture
Test items are therefore defined as: “A specific stimulus to which a person responds overtly.”
Types of tests:
Modes of administration:
1: Individual tests
• Designed to be administered to 1 person at a time
• Useful for collecting comprehensive info
• Usually some degree of subjectivity in the scoring. Not just about wrong + right.
• Limitations: Time, cost, & labour-intensive
• E.g., some personality tests, some IQ tests (Wechsler adult intelligence scale), the
Rorschach test,

lOMoARcPSD|28218661
2: Group tests
• Designed to be administered to more than 1 person at a time (i.e., mass testing)
• Scoring is usually more objective. There is a norm. Right and wrong answers.
• Economical & time-efficient: quick to administer and mark.
• E.g., university class tests (especially MCQs)
Dimensions measured:
A: Ability tests
1. Achievement tests
 Measures person’s past /previous learning through accomplishment of a task
2. Aptitude tests
 Measures person’s potential for learning a specific skill/task under provision of

training
3. Intelligence tests
 Refers to person’s general mental abilities: ability to solve problems, adapt to

changing circumstances, think abstractly, and benefit from experience
 E.g., Stanford-Binet, WAIS test
B: Personality tests
• Measure typical behaviour: traits, temperaments, dispositions, etc.

• Designed to measure a person’s individuality
• Tests can help in predicting future behaviour
• E.g., 16PF questionnaire
1. Structured (objective) personality tests:

 Use self-report statements
 Person chooses between 2 or more alternative responses
 E.g. True/false, yes/no, strongly agree/agree/disagree etc.
2. Projective (unstructured) personality tests:

 An ambiguous stimulus is provided. Response requirements are unclear.
 Test-takers required to respond spontaneously
 Assumes that a person’s interpretation of an ambiguous stimulus will reflect their
unique characteristics. They will project their personality into their answers.

lOMoARcPSD|28218661
 E.g. Rorschach, Thematic Apperception Test (TAT)
TAT:
Provides ambiguous pics & have to make up a story
Story includes: What had led up to the event; What is happening in the moment;
Emotions & thoughts of characters; What the outcome of the story was
Story content & structure thought to reveal:
- Person’s attitudes, inner conflicts, & inner motives

- Needs for achievement, power, intimacy, & views
31 pictures in standard TAT
Performance Interpretation / Assessment: Norm & Criterion-Referenced

Tests
Norm-referenced tests
Test score is judged against the distribution of scores obtained by the

other test-takers
This distribution is called the norm
Compares an individual’s results on the test with a statistically

representative sample
Rank the performance of a student in a particular group
E.g., “You fall in the 90th percentile”
Criterion-referenced tests
Compares an individual’s results to a criterion or expected level of performance
Test-taker’s score compared to objectively stated standard of performance on that test
Establish standard/criterion & mark student against it
E.g., Getting 60% for an exam
Summary
Psychometrics is the quantitative measurement of mental capacity, thought processes, aspects of
personality, etc.
- Overt or covert behaviours

- Measured by test items

lOMoARcPSD|28218661
Types of tests (mode of administration)
- Individual tests
- Group tests
Categories of test (dimension measured)
1: Ability tests
 Achievement
 Aptitude
 Intelligence
2: Personality tests
 Structure
 Projective
Mode of interpretation (performance assessment)
1. Norm-referenced tests
2. Criterion-referenced tests
PSYCHOLOGICAL TESTING: Introduction and history
Ancient History of Psychological Testing

Rudimentary forms of testing date back to the Han dynasty in China (2200 BC)
Officials of the Emperor were examined every 3 years to determine their fitness for office
Tests required proficiency in civil law, military affairs, geography etc
But good penmanship was important
Went through 3 rounds of rigorous exams with very low pass rates:
Preliminary exam on Confucian classics:
Candidates had to spend a day & night in an isolated booth composing an essay and
writing a poem
Only 1-7% of them passed
Moved on to the district exam
Three separate sessions of 3 days & nights
Gruelling & rigorous, only 1-10% passed

lOMoARcPSD|28218661
Move on to Peking final round exam
Only 3% of this group passed
Became mandarins, eligible for public office
China can therefore be said to have developed the first civil service examination program
Contributions of the Han Dynasty to Modern Testing
 Names of candidates had to be concealed
 Independent assessments
 Conditions of examinations had to be standardised. I.e., the same for everyone
History of Modern Psychometrics: Wundt
Experimental psychology flourished in the late 1800s in continental Europe and Great Britain.
The problem was that the early experimental psychologists mistook simple sensory processes for
intelligence.
- They used assorted brass instruments to measure sensory thresholds and reaction
times, thinking that such abilities were at the heart of intelligence.
- Hence, this period is sometimes referred to as the Brass Instruments era of psychological
testing. Introduced by Wundt
Wilhelm Wundt: Father of psychology (1879)
Wundt’s ‘Thought Meter’: Wundt thought that the difference between the observed pendulum
position and the actual position would provide a means of determining the swiftness of thought of
the observer
Wundt believed that the speed of thought might differ from one person to the next
Demonstrated empirical analysis that sought to explain individual differences: relevant to current
practices in psychological testing.
Overly simplistic: overlooks other factors like attention, motivation and self-correcting feedback
from other trials.
History of Modern Psychometrics: Galton

Francis Galton: Father of modern psychometrics
Obsessed with measurement: Believed everything was measurable
Attempted to measure intellect by means of reaction time and sensory discrimination

lOMoARcPSD|28218661
Was more interested in the problems of human evolution than psychology (Galton was Charles
Darwin’s cousin)
1869: Hereditary Genius: An inquiry into its Laws & Consequences
• Studied the genealogy of famous scientific families (including his own)
• Argued this genius was genetic in origin
 Derived from Darwin’s theories
• White, English middle-class men were the best
 Apes, ‘savages’, races of the colonies, the Irish, & English working class were
inferior
Attempted to measure intellect by things such as visual, auditory, & weight discrimination,
threshold levels, etc.
- Galton’s simplistic attempts to gauge intellect with measures of reaction time and
sensory discrimination proved fruitless.
Set up psychometric laboratory in London. Tests involved physical & behavioural domains
Galton’s Contributions
ADAPTED Wundt’s psychophysical brass instruments to a series of single & quick sensorimotor
measures: Allowed him to collect data from thousands of subjects quickly
Demonstrated that objective tests could be devised and that meaningful scores could be obtained
through standardized procedures.
Did come up with Pearson Product-Moment correlation for analysing data from these experiments
Came up with twin studies to study hereditary factors
History of Modern Psychometrics: Cattell

Also interested in sensory discrimination
Studied experimental psychology with Wundt & Galton
Examined relationship between academic grades, psycho-sensory tests, & size of the brain, & shape
of the head
Invented the term mental test
Interested in creating comprehensive, standardized tests. Proposed a battery of 10 mental tests
- These tests were clearly a reworking and embellishment of the Galtonian tradition

lOMoARcPSD|28218661
- They had physiological and sensory bias similar to that of Galton

- Tests include strength of hand squeeze, rate of hand movement, reaction time to sound,
remembering numbers heard
Psychometrics and Intelligence Testing

One of the things leading to development of psychology & psychometrics was intelligence testing
Intelligence seen as being part of a person’s make-up (i.e., genetically determined)
Alfred Binet
First to develop an intelligence test
 The Binet-Simon intelligence test (1905)

 Wanted to separate children with intellectual disabilities from normal children in Parisian
schools
Argued that intelligence could be better measured by means of the higher psychological processes
rather than the elementary sensory processes such as reaction time.
Binet-Simon (1905) differed from the Brass Instruments tests:
1. Did not measure any single faculty – assessed child’s general mental development through
a heterogenous group of tasks
2. Aim was classification – not measurement
3. Brief & practical test, taking less than an hour to administer and requiring little equipment
4. Directly measured what Binet and Simon regarded as the essential factor of intelligence—
practical judgment—rather than wasting time with lower-level abilities involving sensory and
motor elements
5. Items were arranged by approximate level of difficulty – instead of content
Binet saw intelligence as good judgment
How does this relate to the modern view of intelligence? And the common sense, everyday view?
- Today: intelligence seen as how well you can adapt to circumstances
Binet-Simon intelligence test (1905) had 30 items
- Heavily weighted towards verbal skills

- Did not offer precise method for arriving at a total score
- Purpose was classification, not measurement
Published a revised Binet-Simon test in 1908
- Had 58 items

lOMoARcPSD|28218661
- Most of the very simple items were dropped and new items were added at the higher
end of the scale
- Introduced the concept of a mental level as the test had been standardized
Binet-Simon published a third revision in 1911
- Each age level now had five tests

- Scale was extended into the adult range
‘Mental Level/Mental Age’
Within months, what Binet called mental level was being translated as mental age.
In his writings, Binet emphasized strongly that the child’s exact mental level should not be taken too
seriously as an absolute measure of intelligence.
Influenced intelligence testing throughout 20th Century
People were comparing mental age with chronological age
Then came the idea of the intelligence quotient (IQ) - Terman
Galton and Binet’s Influence in the US
Galton: intelligence is hereditary and unchangeable. Binet: intelligence can be improved through
special training
In America, IQ tests were welcomed as a way to assess the intellect of immigrants & potential
soldiers
A ‘scientific’ way to create order out of chaos
The Stanford-Binet Scale
Developed by Lewis Terman in 1916 (who also suggested multiplying the intelligence quotient by
100, and was the first to use the abbreviation IQ)
Adapted Binet test for American schools & adults
 Stanford-Binet Scale
 5th version of this test still in use today
Heavy reliance on language/vocabulary skills
Now contained the familiar term “IQ” for expressing test results
Number of items increased to 90
Had clear & well-organised instructions for administration & scoring

lOMoARcPSD|28218661
Had standardised it using a representative sample of subjects
Summary
 Sir Francis Galton (1869)
• Believed ‘genius’ was hereditary
• Used psychophysical methods
• Believed sensory discrimination & reaction times defined intelligence
 James Cattell (1890)
• Took Galton’s ‘Brass Instruments’ to USA
• Invented the term ‘mental test’
• Proposed a battery of 10 tests
 Alfred Binet (1905; 1908; 1911)
• Developed Binet-Simon Intelligence Test to separate special needs children from

ages 3-13;
• Believed intelligence was ‘good judgment’ & changeable
• 1908 revision invented the term ‘mental level/age’;
• 1911 revision extended the test into the adult range
 Lewis Terman (1916)
• Coined the abbreviation IQ by suggesting multiplying the intelligence quotient by

100
• Developed the Stanford-Binet Scale for US schools & adults
Intelligence testing and eugenics

Intelligence testing and psychometrics developed through the need to identify the “feeble-minded”
Early uses and abuses of tests

Henry Goddard (1906)
• Hired to do research on classification & education of feeble-minded children
• Needed a diagnostic instrument
• Tested normal children
 Children with mental age 4 or more years behind chronological age were
feeble-minded (this constituted 3% of his normal sample)
 Needed to be segregated

lOMoARcPSD|28218661
• Invited to Ellis Island by Commissioner of Immigration
• Convinced that rates of feeble-mindedness were higher than estimated
• Tests administered through a translator
 Immigrants often tired, scared, & confused
• Translated from French to English, then retranslated to immigrant language, but

interpreted according to French norms
• Found feeble-mindedness in:
 83% of Jews
 80% of Hungarians
 79% of Italians
 87% of Russians
• Suggested deportation, or: “We might be able to use moron labourers only if we are
wise enough to train them properly.”
Eugenics
The science of using controlled, selective breeding to improve hereditary qualities of the human
race & create superior individuals
Concerned about the lower classes breeding too quickly
- Lowering the average standard of intelligence
The ideological forerunner to Nazism
Positive eugenics: encouraging reproduction of the genetically “fit”
Negative eugenics: aims to prevent those deemed psychically, mentally, or morally unfit to procreate
(through sterilisation and segregation)
Eugenics Movement:
Positive eugenics:
‘Fitter Family Contests’
- Judged on physical, mental, moral traits
- Examined by doctors, social workers, historians, dentists
- Grades given to each member along with examination record
- Sent to Eugenics office

lOMoARcPSD|28218661
Purpose = To encourage the fit to reproduce, raise racial consciousness, bring public awareness to
eugenics agenda, emphasize value of marriage & family
Negative eugenics
Forced sterilization: hope was to stop problems of mental illness, crime, and low IQ
Psychometrics and the wars: WWI army recruits
Robert Yerkes & the rise of group tests
Quick, effective, efficient way of evaluating emotional & intellectual functioning of soldiers
Stanford-Binet adapted to multiple choice tests in 1917 for use by US Army (Alpha & Beta)
- Ease of administration & scoring

- Does not need to be administered by trained professionals
- Lack of subjectivity
Psychometrics and the wars: Army Alpha and Army Beta

Designed for WW1 recruits
Segregate & eliminate mentally incompetent, & classify men according to mental ability
1. Army Alpha
- For English literates
- Oral directions, arithmetic, practical judgment, analogies, disarranged sentences, number
series, information
2. Army Beta
- For non-English, & non-literates
- Memory, matching, picture completion, geometric construction
Psychometrics and the wars: WWII

Personality tests caught on for screening recruits
Early tests were structured pen-and-pencil tests
Later, projective tests. E.g., Rorschach
After the war: psychometric testing and racism

Army testing led to an explosion of psychometric testing outside of the military
 E.g., school testing, Scholastic Aptitude Tests (SAT)
Jensen & Eyesenck (1960s)

lOMoARcPSD|28218661
 Used Binet IQ test to show that black American children had a lower average IQ than white
children
 This difference was due to genetics
Herrnstein & Murray (1994)
 Argue that poor black communities in the US are a ‘cognitive underclass’ formed by
interbreeding of people with low IQ
Critiques of mental testing:

Testing came under attack by advocates for under-privileged
Mental tests require knowledge & cultural values rather than innate intelligence – biased towards
white middle class
Tests are culturally biased
Correlation does not imply causation
Psychometric testing and racism in SA

Fick (1929) applied tests developed for & standardized on whites only to white & black children
 Used Army Alpha & Army Beta tests

 White children of course got the highest scores
Originally proposed cultural, environmental, educational, & social reasons for discrepancy
 Later suggested that due to differences in ‘innate abilities’ between whites & blacks, & not
external factors
Use of tests gained momentum after WWII and 1948 when the NP came into power
Arose from need to identify occupational suitability of black people
Measures of intellectual ability used to draw distinction between races
Justify superiority of one group over another
1960’s onwards in SA
Measures developed along racial lines
View was that there was little need for common tests as groups were separate and did not compete
with one another
Competition for some jobs during 80s & 90s raised questions such as:
 How can you compare scores if different measures are used?

lOMoARcPSD|28218661
 How do you appoint people if different measures are used?
The current challenge
A drive in SA & internationally to find tasks that are not biased towards race, culture, gender, &
language
- or to adapt existing tests to be race, culture, gender, & language appropriate
Test development
Evaluation:
There will be Psychometrics sections in both the class test and final exam.
In the Psychometrics section on your final exam, you will only be tested on the portion of work
covered after your class test (i.e., only Test Development, Reliability, and Validity). For the
Psychometrics section you will have to answer short answer questions worth 40 marks in total.
Note: All material covered on your lecture slides is examinable. Use the recommended readings
above to supplement your understanding of what is covered during lectures.
Steps to developing a measure
1. Overall goal & pre-planning
2. Content definition
3. Test specifications
4. Item development
5. Test design & assembly
6. Test production
7. Test administration
8. Scoring responses
9. Establishing passing scores
10. Reporting results
11. Item banking
12. Test technical report

lOMoARcPSD|28218661
1: Overall goal and pre-planning

Provides systematic framework for all test development activities
Outlines all essential tasks to be accomplished for a successful testing program.
A priori decisions:
 What: is the construct is the test designed to measure, is the aim (what) and purpose (why)
of the test, the administration modality of the test, the need for the test, the format of the
test etc
 Why: is the test needed
 Who: will use the test, who will take the test, who produces, publishes or prints the test,
who creates and reviews the test items?
 When: will it happen (decide on a timeline)
Test security & quality control
Specifying the aim of the measure:
The aim is what you want to achieve with your test. Purpose = the why: what are the test scores
going to be used for?
Can have different aims for the same construct, e.g., stress
 To measure how someone reacts to stress

 To measure how much stress a person is experiencing at the moment
Aim is informed by the purpose of the test
Other considerations when specifying the aim:
Are we going to use our questionnaire for screening or in-depth assessment / diagnosis?
Screening purposes
 Fewer items & less content covered

 Quick & easy to administer
Detailed assessment (e.g., diagnostic purposes)
 More items & content covered

 More reliable, but more time-consuming
 Might require special training
What decisions can we make based on a score on the test? Depends on length and purpose of the
test (screening or diagnoses)
What type of measure are we going to use, based on our aim?

lOMoARcPSD|28218661
Normative: Compares scores to a norm group
 How stressed is the person compared to the average SA citizen?
Ipsative: Intra-individual comparisons
 In what areas is the person the most (or least) stressed – family life, work life, etc.?
Criterion-referenced: Performance compared to pre-defined standard
 E.g. To evaluate a potential employee’s ability to handle stress by giving them a large amount
of tasks to do in a short period of time
 Completing at least half of the tasks = handles stress well (completing half of the tasks would
be defined as the cut-off score)
2: Content definition
Defining what content the test aims to measure
Need to operationally define the construct that you are measuring
 Operationalisation: The act of making a fuzzy concept measurable
Operational definition
 Defined in terms of how to observe/measure the concept

 Concepts often have many indicators – related but distinct items that make up that concept
(e.g. indicators of depression include tiredness / fatigue, suicidality, feelings of
worthlessness)
 How could we operationally define stress?
Theoretical (conceptual) definitions
 Defined in terms of a concept’s relationship with other concepts

 E.g., stress defined as: hardship, adversity, affliction, feeling of strain, & pressure
Dictionary definition
 E.g., stress defined as: difficulty that causes worry or emotional tension
Example: intelligence
- Conceptual definition: The capacity for abstract thought, understanding, communication,

reasoning, learning, planning and problem solving.
- Operational definition: The score resulting from performing the Raven’s Progressive Matrices
Test.
Defining the test’s purpose:
- Aim = what exactly are we going to measure?

- Purpose = why: what are we going to use the scores on the test for?
Example: Purpose = to inform a behavioural intervention

lOMoARcPSD|28218661
- Aim = to measure behavioural reactions to stress

- Purpose = to inform how to change behavioural reactions to stress in a behaviour
modification programme
3: Test specifications
The test blue-print
A complete operational definition of test characteristics
Test specifications should describe:
1. Test/response format
2. Item format
3. Test length
4. Content areas of the construct/s tested
5. Whether items will contain visual stimuli
6. How tests scores will be interpreted
7. Time limits
1: Test/Response format
How will participants demonstrate their skills? How does the construct manifest through the test?
- Selected response: e.g., Likert scale, MCQ, dichotomous

- Constructed response: e.g., essay question, fill-in-the-blank
- Performance response: e.g., block design task
Objective vs. subjective formats
- Objective: very structured, person usually picks only one response (e.g., MCQ)
- Subjective: interpretation of response depends on examiner judgment, so more
unstructured (e.g., projective tests)
2: Item format
Open-ended items
- No limitations on the test-taker

- E.g., describe how you normally behave when you have to confront a work colleague
Forced-choice items
- E.g., MCQs, true false questions
Sentence-completion items
- E.g., “The most important thing in life is _______.”
Performance-based items

lOMoARcPSD|28218661
- E.g., writing an essay, oral presentations
3: Test length
Depends on:
- Amount of administration time available

- Purpose of the measure (e.g., screening vs. comprehensive)
Compliance lower when item numbers are higher
- People get fatigued, bored, etc.
Need at least 50% more items in initial version than final version
- You will discard bad items
How many items per area are being tested/assessed?
4: Test content areas
Ensure that all domains of the construct are tested
A test structure (test blueprint)
- Columns represent content areas (indicators of the construct)

- Rows represent manifestations (items that tell you something about the content area)
Example: the influence of superstitions:
Content Areas and Manifestations of Superstitions
CONTENT AREAS / indicators
MANIFESTATIONS Chance events Lucky items (charms)
Behaviour from
C1M1 C2M1
superstitions
Belief in Superstitions
C1M2 C2M2
(Cognitive)
4: Item development
Guidelines for writing the items
- Use clear wording to avoid ambiguity

- Use appropriate vocabulary (no jargon)
- Avoid double negatives (e.g., “I do not never get stressed out around exam time.”)

lOMoARcPSD|28218661
- Consider your audience: If writing a test for children (vs. adults), consider different format
such as visual
- Don’t make questions too obvious
5: Test design and assembly

Assembling
- Placement of correct items

- Check for errors (quality control)
- Manual or computer assembly?
- How are people going to answer the test?
- Does the test look aesthetically pleasing? Tests must be formatted to maximize the ease of
reading and minimize any additional cognitive burden that’s unrelated to the construct being
tested
Pre-testing: Giving the test to a representative sample from the target population to gather
information about it
- Do people answer how we expect?

- Are there any problems?
- Too easy or difficult?
- Time limits?
6: Test production
Production & printing
Making final all items, their order, & necessary visual stimuli
Security of tests
Quality assurance procedures
- Randomly sample final printed booklets
7: Test administration
Most public & visible aspect of testing
Security is a major concern. Security problems during test administration can lead to the invalidation
of some or all examinee scores and can require the test developers to retire or eliminate large
numbers of very expensive test items.
Preferable to designate one highly experienced invigilator as ‘chief invigilator’ for the site to
supervise others
Standardization of testing conditions of utmost importance
- Control extraneous variables & make conditions identical for all examinees
- Examinee fairness, clarity of instructions, time limits
- Otherwise examinee scores difficult to interpret

lOMoARcPSD|28218661
8: Scoring responses
Test scoring is the process of applying a scoring key to examinee responses to the test stimuli.
Scoring criteria
How will we score?
- Develop a scoring key: how many marks particular questions get. Scoring key must be
applied with perfect accuracy to the examinee items responses. High levels of quality
control are therefore necessary.
- When to ‘drop’ respondents from sample?
o Those who answer less than 75% of the questionnaire?
o Response bias to all questions?
Item analysis
Deciding whether to keep or discard items according to:
1. Item difficulty
2. Discriminating power
3. Item bias
When analysing test items, we have several questions about the performance of each item. Some of
these include:
- Are the items congruent with the test objectives?

- Are the items valid?
- Are the items reliable?
- How long does it take an examinee to complete each item?
- What items are easy/difficult?
- Are there any poor performing items that need to be discarded?
1: Item difficulty
= the proportion/percentage of individuals who answer the item correctly (also known as the facility
index)
- Higher proportion/percentage of correct responses = easier the item; and vice-versa

- Items with a facility index of less than 0.30 are considered too difficult and of more than
0.70 are considered too easy (so must potentially discard them)
Retrospective. Difficulty can only be analysed after the test has been administered.
Need to include items with a range of difficulty
2: Item discriminating power
Item discriminating power = the extent to which an item measures the same aspect of what the
total test is measuring

lOMoARcPSD|28218661
Able to discriminate between high and low performers
Measured by:
- Discrimination index (D): Works by subtracting the item difficulty for people in the bottom
25% from item difficulty for people in the top 25%
Index Range Interpretation Action
-1.0 to -.50 Can discriminate but the Discarded

item is questionable
-.55 to .45 Non-discriminating Revised
.46 to 1.0 Discriminating item Include
- Item-total correlations: Correlation between the score on an item & performance on the
total measure
o Positive correlation = good discriminating power
o 0 correlation = no discriminating power
o Negative correlation = poor discriminating power
Items with correlations less than 0.20 & negative correlations are not retained
3: Item Bias
Bias in testing:
 Errors in measurement
 Associated with group membership
Item bias in tests:
 Bias arises as a result of the item content

 Bias arises from differences in group performance on a test
9: Establishing passing scores

Many, but not all, tests require some type of passing score or performance standard.
Establishing norms, performance standards, or cut-off scores
Establishing reliability & validity across different test administrations
10: Reporting results

All examinees have a right to accurate, timely and useful reports of their performance, in
understandable language.

lOMoARcPSD|28218661
11. Item banking

Storing effective items for future use
Because effective test questions are so difficult to develop and so many resources must be used to
produce useful test items that perform well
Security is important
12: Publishing and refinement

1. Compiling the test manual & technical report (usually very detailed: describes all important
aspects of test development, administration, scoring etc). Identifies potential threats to
validity and makes recommendations for improvement.
2. Submitting the measure for classification as a psychological measure
3. Publishing & marketing
4. Revision & refinement
Security is important
Developing multi-cultural tests
During the planning phase, be mindful of test-taker characteristics that may influence performance:
- Educational level
Impacts ability to read, write, and work with numbers as well as higher order thinking
Tends to differ from rural to urban areas
Include questionnaire on quality and level of schooling?
Develop separate norms i.e. differently graded tests depending on educational background?
- Language
If a test is given in an unfamiliar language, it’s difficult to find out whether poor performance
is due to language / communication difficulties or that the test-taker has a low level of the
construct being assessed
Provide evidence that language is appropriate for intended populations
Should different tests be constructed for different languages? Translation may be difficult
Can produce tests with different language versions: bilingual or multilingual
Defining the construct
- the same construct could be interpreted and understood in very different ways in various
cultural and language groups
- Is the construct meaningful (i.e. of value, important) to different cultural groups?
- Is the construct defined the same way in different cultures? Is there a shared
understanding?
E.g., Eastern & Western associations with intelligence

lOMoARcPSD|28218661
Test, item, & response modes and formats
A test consists of a stimulus (item) to which the test-taker responds using a specified response mode.
There are various modes in which a test is presented (e.g., paper-based, computer-based); various
item formats (e.g., multiple choice, performance tasks), and various response modes (e.g., verbal,
written, typing on a computer keyboard
- Developers must provide evidence that all formats are familiar to target populations.
Shouldn’t be assumed that the different presentation and response modes or item formats
are equally familiar to and appropriate for all cultural groups
Numbers, dates, time, currency
Icons, symbols, colours
Computer-based tests: there are differing levels of computer familiarity and technological
sophistication among various cultural and socio-economic groups in South Africa
Could include a preparatory tutorial
If items are not familiar, decide whether to:
- Omit items
- Balance number of familiar/unfamiliar items
- Practice items that provide opportunity to familiarise test-takers with unfamiliar item types
or content
Reliability: Evaluating a Scale

In psychological tests, reliability means:
- Consistency in measurement
- How much error we have in a measurement
So it’s the precision with which the test scores measures achievement
- The higher the better
Reliability in measurement = The desired consistency or reproducibility of test scores
No test is free from error
- We always assume that there is some random error
The distribution of scores should be normally distributed
Sources of unreliability in different measures:
Questionnaires: participants may interpret questions differently
Structured observations: involves subjectivity
Physical apparatus (e.g. polygraph): other factors like feeling embarrassed (may appear that they’re
lying)

lOMoARcPSD|28218661
In mathematical terms: x = T + e
- x – observed score
- T – true score
- e – error
The Four Assumptions of Classical Test Theory

1. Each person has a true score we could obtain if there was no measurement error
2. There is measurement error, but this error is random
3. The true score of an individual doesn’t change with repeated applications of the same test,
even though their observed score does
4. The distribution of random errors and thus observed test scores will be the same for all
people
We can estimate someone’s true score by taking the average of all their observed scores
Standard Error of Measurement

We can work out how much measurement error there is by working out how much, on average, an
observed score differs from the true score.
- i.e. The standard deviation of the scores

- Also called the SEM: standard error of measurement
The Domain Sampling Model

Another central concept of Classical Test Theory
If we construct a test on something, we can’t ask all possible questions, so we only use a few test
items (we sample from all the possible test questions / the domain)
But using fewer test items can lead to the introduction of error because the test items may or may
not adequately sample the domain / construct.
Our task in reliability analysis is to ascertain how much error we would make by using a score from a
shorter test as an estimate of true ability
Variance of observed score on a short test

Reliability =
Variance of true score
The observed test score should be correlated with the true score
As the sample gets larger, the estimate becomes more accurate
Example: Spelling Ability

lOMoARcPSD|28218661
We can’t ask a test-taker every word in the dictionary, so we ask them to spell a subset (sample) of
words
Does their mark reflect their true spelling skill?
- What if they were given very easy or difficult words?

- What if they were tired on one day but rested on the next?
- Their true score would stay the same but their observed score would vary
- Amount of error (e) would vary
Investigating a Test’s Reliability

Types of reliability:
1. Test-retest reliability
2. Parallel forms reliability
3. Internal consistency
a. Split-half reliability
b. Kuder-Richardson 20 reliability
c. Coefficient / Conbach’s alpha
4. Inter-rater reliability
1: Test-Retest Reliability
Give someone a test then give it to them again at another time
If the scores are highly correlated, we have good test-retest reliability
The correlation between 2 scores is known as the co-efficient of stability – refers to stability of the
score
The source of error measured is time sampling – scores differ due to a factor occurring over time
Issues with test-retest reliability
Can’t really be used when measuring things like mood and stress as they fluctuate naturally over
time and may change between 1st and 2nd administrations – co-efficient of stability will be low
Someone’s score can improve the second time due to testing effects (learning what to expect,
improving)
Something could happen in between administrations that changes that which is being tested
The thing you’re trying to measure could change through being tested (e.g. being tested for
empathy: may want to appear more empathetic)
2: Parallel forms reliability

Two different forms of the same test are administered
Difficulty levels are relatively equal

lOMoARcPSD|28218661
E.g., two questionnaires with different items in each
The correlation between the 2 scores is known as the co-efficient of equivalence
The source of error measured is item sampling – difficulty levels unequal across tests
Changing the form of the test
The question responses can be reworded
The order of items can be changed (to reduce practice effect)
The wording of the question/item can be changed
- Two items must be equivalent; items should differ in wording only

- Items with different degrees of difficulty don’t measure the same attribute
Issues with parallel forms reliability
What if the different forms are given to people at two different times?
Should the different forms be given to the same people or to different people?
- People might work out how to answer the one form from doing the other form
Need to be able to make two forms of the test in the first place
3: Internal Consistency Reliability

The reliability of one test administered on one occasion.
Measures whether different items within one test all measure the same thing to the same extent
The source of error measured is internal consistency – items are linked to the overarching construct
Tests for internal consistency reliability:
1: Split-half reliability
A test is split in half. Each half is scored separately and total scores for each half are correlated
Advantage: only one test is needed (not two forms)
Challenges:
1. Dividing the test into equivalent halves

The correlation changes each time depending on which items are put into which half
2. Doesn’t reducing the length of the test by splitting it in half automatically reduce its
reliability? The fewer the items, the lower the reliability (remember the Domain Sampling
Model)
Solution: Spearman-Brown formula
rhh = correlation between the halves

lOMoARcPSD|28218661
2: Coefficient / Cronbach’s alpha
Estimates the consistency of responses to different scale

items.
Takes the average of all possible split-half correlations for a

test.
Can overestimate reliability when there’s a large

number of test items even if the mean internal
correlation is low
Reliability according to Cronbach’s alpha is affected by number of items
There’s a positive non-linear correlation between the number of items + reliability
- There’s a rapid increase in reliability from 2 – 10 items

- Increases steadily from 11 – 30 items
- Tapers off at about 40 items – so around 30-40 items is ideal in terms of reliability
Cronbach’s alpha can be affected by:
- Multidimensionality: questionnaire measuring more than one construct (constructs aren’t

correlated)
- Bad test items
- Number of items
Interpreting Coefficient / Cronbach’s alpha:
- 0 = no consistency in measurement whereas 1 = perfect consistency in measurement

- 0.70 is appropriate for exploratory research
- 0.80 is appropriate for basic research
- 0.90 is appropriate for applied research
3: Kuder-Richardson 20 (KR20)
3: Inter-Rater Reliability
Measures how consistently 2 or more raters / judges agree on
something’s rating
Correlates the raters’ scores
Ranges from 1 (perfect agreement) to -1 (disagreement)
- > 0.75: perfect agreement

- .40-.75: satisfactory agreement
- < .40: poor agreement

lOMoARcPSD|28218661
Source of error is observer differences
Factors Influencing a Test’s Reliability

Number of items in a scale / test
Variability of the sample
- A sample that’s representative of the wider population gives a better estimate of a test’s
reliability
Extraneous variables
- The testing situation

- Ambiguous or misleading items
- Un-standardised testing procedures
- Perceived demand effects
How to improve reliability
Increase the number of items (but be sure not to add too many or bad items)
Item analysis: there are several ways to asses whether items/questions are ‘good’ or ‘bad’ and
therefore need to be discarded
- Item discriminating power = the extent to which an item measures the same aspect of what
the total test is measuring
- Item bias
- Item difficulty
Use identical instructions with each test administration
Train the raters when using inter-rater reliability
Test the measurement scale by doing a pilot-run
Make sure that which you aim to measure has been clearly conceptualised
Validity
Refers to whether or not a test measures what it intends/claims to measure
Aims of establishing validity:
- To be able to make accurate inferences from scores on a test, giving meaning to a test score
- i.e. validity indicates the usefulness of a test
Relationship between validity and reliability
A test must be considered reliable before it can be considered valid. But if it isn’t
valid, it doesn’t matter if it’s reliable.

lOMoARcPSD|28218661
You must demonstrate reliability BEFORE validity
Validity tells you how good a test is for a particular situation
Reliability tells you how trustworthy a score on that test is
Types of Validity
1. Face validity
2. Content validity
3. Criterion validity
• Concurrent
• Predictive
4. Construct validity
• Convergent
• Discriminant
1: Face Validity
When a test seems on the surface (on its face) to measure what it’s supposed to measure
How authentic the test/scale seems to participants
- If they doubt the test, their scores will be effected
But a test can have good face validity without actually being valid
Measured by researchers looking at items and giving their opinions regarding whether the items
appear to measure what they’re trying to measure
The least scientific of all the measures of validity as it’s determined by researcher’s opinions. It’s not
enough to just have face validity.
Evaluated in terms of:
1. Readability
2. Feasibility
3. Layout and style
4. Clarity of wording
Face validity is important because:
- How relevant the test seems to participants impacts whether they want to take it or not. It
looks interesting to them
- If the test looks valid, clinicians etc are more likely to trust and therefore buy it

lOMoARcPSD|28218661
- Tests with low face validity usually have low reliability (because there might be issues with
readability, feasibility, clarity of wording etc)
Issues with face validity
- Many don’t consider this an actual measure of validity

- It doesn’t refer to what is actually being measured. Refers to what it appears to measure
- Is determined by a review of the items and not through statistical analyses
For these reasons, face validity is insufficient for claiming that a test is valid.
2: Content Validity
Degree to which a test measures an intended content area
- E.g., does an IQ questionnaire have items covering all areas of intelligence discussed in the
scientific literature?
In other words: there’s a correspondence between the test items and the content domain
- Domain sampling model: do the questions/items make up a representative sample of the

attribute the test is supposed to measure?
- E.g., if you think Emotional Intelligence is a type of intelligence, an IQ scale that does not
include a measure of EI would not have content validity
Researchers work towards content validity by:
1. Specifying the content area covered by the phenomenon when developing the construct
definition (early stages of test development)
2. Writing questions/scale items that are relevant to each of the content areas
3. Developing a measure of the construct that includes the best / most representative items
from each content area
Aspects of content validity
1. Construct under-representation
The test doesn’t capture important components of the constructed
E.g., a test of PTSD that does not have questions relating to vividly re-experiencing the traumatic
event
2. Construct irrelevant-variance
When test scores are influenced by things other than the construct the test is supposed to measure
E.g., scores a on a depression test are influenced by a person’s level of anxiety
E.g., IQ tests might be influenced by reading ability

lOMoARcPSD|28218661
How is content validity established?
- Judgement by expert judges

They ask the question “how relevant is each item to the content domain?”
They independently examine the items and decide whether each of the items is weakly or
strongly relevant to the content domain of the construct
Ranges from 0 to 1 (the higher, the more valid)
- Can also use statistical methods

Factor analysis: items relating to each content domain should load on factors representative
of those domains
3: Criterion Validity
How well a test score estimates or predicts a criterion behaviour outcome, now or in the future
E.g. The depression inventory estimates what depressive behaviour the person displays
Predicting is easier for ability tests, but harder for personality and attitude tests
a. Concurrent Criterion Validity
Criterion Validity: How well test performances estimate / predict current and future performance on
some valued measure (other than the test itself)
Concurrent Validity: The extent to which test scores can correctly identify the CURRENT state of
individuals
- Measured by correlating scores from the new test with scores from an already established
test
- E.g. Results from new intelligence test correlated with Wedchsler I.Q test?
b. Predictive Criterion Validity
Criterion Validity: How well test performances estimate / predict current and future performance on
some valued measure (other than the test itself)
Predictive Validity: Do scores on a test predict a FUTURE event successfully? Does the measurement
correctly predict the score on a future test?
- The test = the predictor variable, and the future event = the criterion variable
- Matric math score = predictor. Success at Psychology statistics = criterion
E.g. Do scores on a test of acute stress disorder predict scores on a test of PTSD after a few weeks?
E.g. Are NBT results (predictor variable) correlated with first-year university scores (criterion
variable)?

lOMoARcPSD|28218661
4: Construct Validity
The extent to which the instrument measures a theoretical construct
A construct = a postulated, hypothetical attribute.
- Something we think exists but is not directly observable or measurable.
E.g. Depression, empathy or intelligence. Can’t be directly measured like we can measure water. It’s
a construct that isn’t directly observable and measurable.
How is construct validity measured?
1. We look at the construct’s relationship with other constructs

To what other constructs is it similar and different?
2. What observable behaviours can we expect of a person with a high or low score on the test
measuring the construct?
What is the relationship among these behaviours?
Two main types of evidence to look for:
a. Convergent validity: the scores from different measures of theoretically-related constructs

should converge / relate to each other
Scores on the test have high correlations with other tests that measure similar constructs
E.g., measures of self-esteem should correlate highly with measures of self-worth, or
confidence
b. Divergent / discriminant validity: scores from measures of concepts that are theoretically not
supposed to be related are actually unrelated.
The test score should have a low correlation with other tests that measure different
constructs
E.g., Gender and a questionnaire on racism.
For a test to have good construct validity, there needs to be evidence for both convergent and
discriminant validity.
Factors Affecting Validity
1. Reliability
Any form of measurement error (lack of reliability) can reduce validity
But it is possible for a test to be reliable without being valid

lOMoARcPSD|28218661
- E.g., a test of anxiety may actually measure stress. It might measure stress very well but it is
not measuring what you want it to
However, you must demonstrate reliability BEFORE validity
- You cannot establish that a test measures what it is supposed to if it doesn’t consistently
measure what it is supposed to
2: Social Diversity
Tests may not be equally valid for different social/cultural groups
E.g., a test of “superstition” in one culture might be a test of “religiosity” in another

Psychometrics Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Psychometrics Notes

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|28218661

Research Methods I (University of Cape Town)

Studocu is not sponsored or endorsed by any college or university

o The History of Psychological Testing

7 May (Exam material)

14 & 15 May (Exam material)

 Foxcroft, C. D. (2004). Planning a psychological test in the multi-cultural South African

 Gregory, R. J. (2003). History of Psychological testing. In R. J. Gregory (Ed.), Psychological

 Khanjee, A. (2006). Assessment research. In M. Terre Blanche, K. Durrheim, & D. Painter

The science or study of this measurement

The construction & application of psychological tests

The assignment of numbers to objects & events according to rules

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)

The assigning of numbers to individuals in a systematic way as a means of representing properties

Measurement allows for comparison, analysis and evaluation

Measurement devices or techniques used to quantify behaviour – to aid in understanding &

Defining Psychometrics: 3 Important Aspects

1: Psychometrics measures MANY different things

- Mental processes, personality, behaviour, intelligence, cognition, etc.

2: Psychometrics aims to operationally define & QUANTIFY the things it measures

- E.g., anxiety, intelligence, empathy

And then to be able to predict it

Basic terms and concepts

Psychological measurement is the measurement of human behaviour

 Mental, social, or physical action or practice that is not immediately

 E.g., feelings of anxiety, depression

Psychological Tests + Behavioural Samples

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)

A psychological test is therefore:

The process of measuring psychology-related variables by means of devices or procedures designed

A systematic procedure for obtaining samples of behaviour relevant to cognitive or affective

Trials = turns taking a test.

How do we quantify behaviour?

Through test items

E.g., can be spotting a missing detail in a picture, drawing a picture

• Designed to be administered to 1 person at a time

• Useful for collecting comprehensive info

• Limitations: Time, cost, & labour-intensive

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)

• Designed to be administered to more than 1 person at a time (i.e., mass testing)

• Economical & time-efficient: quick to administer and mark.

• E.g., university class tests (especially MCQs)

 Measures person’s past /previous learning through accomplishment of a task

 Measures person’s potential for learning a specific skill/task under provision of

 Refers to person’s general mental abilities: ability to solve problems, adapt to

 E.g., Stanford-Binet, WAIS test

• Measure typical behaviour: traits, temperaments, dispositions, etc.

1. Structured (objective) personality tests:

2. Projective (unstructured) personality tests:

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)

 E.g. Rorschach, Thematic Apperception Test (TAT)

Provides ambiguous pics & have to make up a story

Story content & structure thought to reveal:

- Person’s attitudes, inner conflicts, & inner motives

31 pictures in standard TAT

Performance Interpretation / Assessment: Norm & Criterion-Referenced

Test score is judged against the distribution of scores obtained by the

This distribution is called the norm

Compares an individual’s results on the test with a statistically

Rank the performance of a student in a particular group

E.g., “You fall in the 90th percentile”

Compares an individual’s results to a criterion or expected level of performance

Test-taker’s score compared to objectively stated standard of performance on that test

Establish standard/criterion & mark student against it