Professional Documents
Culture Documents
Psychometrics Notes
Psychometrics Notes
Psychometrics notes
Psychometrics
10, 23, & 24 April (Class Test material)
o Introduction to Psychometrics
o Test Development
o Reliability
o Validity
Recommended reading:
Downing, S. M. (2006). Twelve Steps for Effective Test Development. In S. M. Downing & T.
M. Haladyna (Eds.), Handbook of Test Development (pp. 3-25). London: Lawrence Erlbaum
Associates. (available on Vula)
What is Psychometrics?
Simply put = The study of Psychological measurement
The measurement of mental capacity, thought processes, aspects of personality, etc., especially by
mathematical or statistical analysis of quantitative data
What is measurement?
What is a test?
3: Psychometrics is concerned with constructing tests to measure things and evaluating how good
these tests are
• Overt behaviour
Observable
E.g., time needed to put 10 pegs in a peg board, time taken to tap fingertips
10 times
• Covert behaviour
“A test designed to provide a quantitative analysis of a person’s mental capacities or traits, typically
as shown by responses to a standard series of questions or statements.”
A set of items that is designed to measure characteristics of human beings that pertain to behaviour
- A test question; something a person must do in a test; a task in a test a person must
perform
- NB: A test item is not necessarily always a question
Test items are therefore defined as: “A specific stimulus to which a person responds overtly.”
Types of tests:
Modes of administration:
1: Individual tests
• Usually some degree of subjectivity in the scoring. Not just about wrong + right.
• E.g., some personality tests, some IQ tests (Wechsler adult intelligence scale), the
Rorschach test,
2: Group tests
• Scoring is usually more objective. There is a norm. Right and wrong answers.
Dimensions measured:
A: Ability tests
1. Achievement tests
2. Aptitude tests
3. Intelligence tests
B: Personality tests
TAT:
Story includes: What had led up to the event; What is happening in the moment;
Emotions & thoughts of characters; What the outcome of the story was
Norm-referenced tests
Criterion-referenced tests
Summary
Psychometrics is the quantitative measurement of mental capacity, thought processes, aspects of
personality, etc.
- Individual tests
- Group tests
1: Ability tests
Achievement
Aptitude
Intelligence
2: Personality tests
Structure
Projective
1. Norm-referenced tests
2. Criterion-referenced tests
Officials of the Emperor were examined every 3 years to determine their fitness for office
Went through 3 rounds of rigorous exams with very low pass rates:
Candidates had to spend a day & night in an isolated booth composing an essay and
writing a poem
China can therefore be said to have developed the first civil service examination program
Independent assessments
Experimental psychology flourished in the late 1800s in continental Europe and Great Britain.
The problem was that the early experimental psychologists mistook simple sensory processes for
intelligence.
- They used assorted brass instruments to measure sensory thresholds and reaction
times, thinking that such abilities were at the heart of intelligence.
- Hence, this period is sometimes referred to as the Brass Instruments era of psychological
testing. Introduced by Wundt
Wundt’s ‘Thought Meter’: Wundt thought that the difference between the observed pendulum
position and the actual position would provide a means of determining the swiftness of thought of
the observer
Wundt believed that the speed of thought might differ from one person to the next
Demonstrated empirical analysis that sought to explain individual differences: relevant to current
practices in psychological testing.
Overly simplistic: overlooks other factors like attention, motivation and self-correcting feedback
from other trials.
Was more interested in the problems of human evolution than psychology (Galton was Charles
Darwin’s cousin)
Apes, ‘savages’, races of the colonies, the Irish, & English working class were
inferior
Attempted to measure intellect by things such as visual, auditory, & weight discrimination,
threshold levels, etc.
- Galton’s simplistic attempts to gauge intellect with measures of reaction time and
sensory discrimination proved fruitless.
Set up psychometric laboratory in London. Tests involved physical & behavioural domains
Galton’s Contributions
ADAPTED Wundt’s psychophysical brass instruments to a series of single & quick sensorimotor
measures: Allowed him to collect data from thousands of subjects quickly
Demonstrated that objective tests could be devised and that meaningful scores could be obtained
through standardized procedures.
Did come up with Pearson Product-Moment correlation for analysing data from these experiments
Examined relationship between academic grades, psycho-sensory tests, & size of the brain, & shape
of the head
- These tests were clearly a reworking and embellishment of the Galtonian tradition
Alfred Binet
First to develop an intelligence test
Argued that intelligence could be better measured by means of the higher psychological processes
rather than the elementary sensory processes such as reaction time.
1. Did not measure any single faculty – assessed child’s general mental development through
a heterogenous group of tasks
2. Aim was classification – not measurement
3. Brief & practical test, taking less than an hour to administer and requiring little equipment
4. Directly measured what Binet and Simon regarded as the essential factor of intelligence—
practical judgment—rather than wasting time with lower-level abilities involving sensory and
motor elements
5. Items were arranged by approximate level of difficulty – instead of content
How does this relate to the modern view of intelligence? And the common sense, everyday view?
- Had 58 items
- Most of the very simple items were dropped and new items were added at the higher
end of the scale
- Introduced the concept of a mental level as the test had been standardized
Within months, what Binet called mental level was being translated as mental age.
In his writings, Binet emphasized strongly that the child’s exact mental level should not be taken too
seriously as an absolute measure of intelligence.
Galton: intelligence is hereditary and unchangeable. Binet: intelligence can be improved through
special training
In America, IQ tests were welcomed as a way to assess the intellect of immigrants & potential
soldiers
Developed by Lewis Terman in 1916 (who also suggested multiplying the intelligence quotient by
100, and was the first to use the abbreviation IQ)
Stanford-Binet Scale
5th version of this test still in use today
Now contained the familiar term “IQ” for expressing test results
Summary
Sir Francis Galton (1869)
Children with mental age 4 or more years behind chronological age were
feeble-minded (this constituted 3% of his normal sample)
Needed to be segregated
83% of Jews
80% of Hungarians
79% of Italians
87% of Russians
• Suggested deportation, or: “We might be able to use moron labourers only if we are
wise enough to train them properly.”
Eugenics
The science of using controlled, selective breeding to improve hereditary qualities of the human
race & create superior individuals
Negative eugenics: aims to prevent those deemed psychically, mentally, or morally unfit to procreate
(through sterilisation and segregation)
Eugenics Movement:
Positive eugenics:
Purpose = To encourage the fit to reproduce, raise racial consciousness, bring public awareness to
eugenics agenda, emphasize value of marriage & family
Negative eugenics
Forced sterilization: hope was to stop problems of mental illness, crime, and low IQ
Quick, effective, efficient way of evaluating emotional & intellectual functioning of soldiers
Stanford-Binet adapted to multiple choice tests in 1917 for use by US Army (Alpha & Beta)
Segregate & eliminate mentally incompetent, & classify men according to mental ability
1. Army Alpha
- For English literates
- Oral directions, arithmetic, practical judgment, analogies, disarranged sentences, number
series, information
2. Army Beta
- For non-English, & non-literates
- Memory, matching, picture completion, geometric construction
Used Binet IQ test to show that black American children had a lower average IQ than white
children
This difference was due to genetics
Argue that poor black communities in the US are a ‘cognitive underclass’ formed by
interbreeding of people with low IQ
Mental tests require knowledge & cultural values rather than innate intelligence – biased towards
white middle class
Originally proposed cultural, environmental, educational, & social reasons for discrepancy
Later suggested that due to differences in ‘innate abilities’ between whites & blacks, & not
external factors
Use of tests gained momentum after WWII and 1948 when the NP came into power
1960’s onwards in SA
View was that there was little need for common tests as groups were separate and did not compete
with one another
Competition for some jobs during 80s & 90s raised questions such as:
A drive in SA & internationally to find tasks that are not biased towards race, culture, gender, &
language
Test development
Evaluation:
There will be Psychometrics sections in both the class test and final exam.
In the Psychometrics section on your final exam, you will only be tested on the portion of work
covered after your class test (i.e., only Test Development, Reliability, and Validity). For the
Psychometrics section you will have to answer short answer questions worth 40 marks in total.
Note: All material covered on your lecture slides is examinable. Use the recommended readings
above to supplement your understanding of what is covered during lectures.
2. Content definition
3. Test specifications
4. Item development
6. Test production
7. Test administration
8. Scoring responses
A priori decisions:
What: is the construct is the test designed to measure, is the aim (what) and purpose (why)
of the test, the administration modality of the test, the need for the test, the format of the
test etc
Why: is the test needed
Who: will use the test, who will take the test, who produces, publishes or prints the test,
who creates and reviews the test items?
When: will it happen (decide on a timeline)
The aim is what you want to achieve with your test. Purpose = the why: what are the test scores
going to be used for?
Can have different aims for the same construct, e.g., stress
Are we going to use our questionnaire for screening or in-depth assessment / diagnosis?
Screening purposes
What decisions can we make based on a score on the test? Depends on length and purpose of the
test (screening or diagnoses)
In what areas is the person the most (or least) stressed – family life, work life, etc.?
E.g. To evaluate a potential employee’s ability to handle stress by giving them a large amount
of tasks to do in a short period of time
Completing at least half of the tasks = handles stress well (completing half of the tasks would
be defined as the cut-off score)
2: Content definition
Defining what content the test aims to measure
Operational definition
Dictionary definition
E.g., stress defined as: difficulty that causes worry or emotional tension
Example: intelligence
3: Test specifications
The test blue-print
1. Test/response format
2. Item format
3. Test length
4. Content areas of the construct/s tested
5. Whether items will contain visual stimuli
6. How tests scores will be interpreted
7. Time limits
1: Test/Response format
How will participants demonstrate their skills? How does the construct manifest through the test?
- Objective: very structured, person usually picks only one response (e.g., MCQ)
- Subjective: interpretation of response depends on examiner judgment, so more
unstructured (e.g., projective tests)
2: Item format
Open-ended items
Forced-choice items
Sentence-completion items
Performance-based items
3: Test length
Depends on:
Need at least 50% more items in initial version than final version
Behaviour from
C1M1 C2M1
superstitions
Belief in Superstitions
C1M2 C2M2
(Cognitive)
4: Item development
Guidelines for writing the items
- Consider your audience: If writing a test for children (vs. adults), consider different format
such as visual
- Don’t make questions too obvious
Pre-testing: Giving the test to a representative sample from the target population to gather
information about it
6: Test production
Production & printing
Making final all items, their order, & necessary visual stimuli
Security of tests
7: Test administration
Most public & visible aspect of testing
Security is a major concern. Security problems during test administration can lead to the invalidation
of some or all examinee scores and can require the test developers to retire or eliminate large
numbers of very expensive test items.
Preferable to designate one highly experienced invigilator as ‘chief invigilator’ for the site to
supervise others
- Control extraneous variables & make conditions identical for all examinees
- Examinee fairness, clarity of instructions, time limits
- Otherwise examinee scores difficult to interpret
8: Scoring responses
Test scoring is the process of applying a scoring key to examinee responses to the test stimuli.
Scoring criteria
- Develop a scoring key: how many marks particular questions get. Scoring key must be
applied with perfect accuracy to the examinee items responses. High levels of quality
control are therefore necessary.
- When to ‘drop’ respondents from sample?
o Those who answer less than 75% of the questionnaire?
o Response bias to all questions?
Item analysis
Deciding whether to keep or discard items according to:
1. Item difficulty
2. Discriminating power
3. Item bias
When analysing test items, we have several questions about the performance of each item. Some of
these include:
1: Item difficulty
= the proportion/percentage of individuals who answer the item correctly (also known as the facility
index)
Retrospective. Difficulty can only be analysed after the test has been administered.
Item discriminating power = the extent to which an item measures the same aspect of what the
total test is measuring
Measured by:
- Discrimination index (D): Works by subtracting the item difficulty for people in the bottom
25% from item difficulty for people in the top 25%
- Item-total correlations: Correlation between the score on an item & performance on the
total measure
o Positive correlation = good discriminating power
o 0 correlation = no discriminating power
o Negative correlation = poor discriminating power
Items with correlations less than 0.20 & negative correlations are not retained
3: Item Bias
Bias in testing:
Errors in measurement
Associated with group membership
Because effective test questions are so difficult to develop and so many resources must be used to
produce useful test items that perform well
Security is important
Security is important
During the planning phase, be mindful of test-taker characteristics that may influence performance:
- Educational level
Impacts ability to read, write, and work with numbers as well as higher order thinking
Tends to differ from rural to urban areas
Include questionnaire on quality and level of schooling?
Develop separate norms i.e. differently graded tests depending on educational background?
- Language
If a test is given in an unfamiliar language, it’s difficult to find out whether poor performance
is due to language / communication difficulties or that the test-taker has a low level of the
construct being assessed
Provide evidence that language is appropriate for intended populations
Should different tests be constructed for different languages? Translation may be difficult
Can produce tests with different language versions: bilingual or multilingual
- the same construct could be interpreted and understood in very different ways in various
cultural and language groups
- Is the construct meaningful (i.e. of value, important) to different cultural groups?
- Is the construct defined the same way in different cultures? Is there a shared
understanding?
E.g., Eastern & Western associations with intelligence
A test consists of a stimulus (item) to which the test-taker responds using a specified response mode.
There are various modes in which a test is presented (e.g., paper-based, computer-based); various
item formats (e.g., multiple choice, performance tasks), and various response modes (e.g., verbal,
written, typing on a computer keyboard
- Developers must provide evidence that all formats are familiar to target populations.
Shouldn’t be assumed that the different presentation and response modes or item formats
are equally familiar to and appropriate for all cultural groups
Numbers, dates, time, currency
Icons, symbols, colours
Computer-based tests: there are differing levels of computer familiarity and technological
sophistication among various cultural and socio-economic groups in South Africa
Could include a preparatory tutorial
- Omit items
- Balance number of familiar/unfamiliar items
- Practice items that provide opportunity to familiarise test-takers with unfamiliar item types
or content
- Consistency in measurement
- How much error we have in a measurement
So it’s the precision with which the test scores measures achievement
Physical apparatus (e.g. polygraph): other factors like feeling embarrassed (may appear that they’re
lying)
In mathematical terms: x = T + e
- x – observed score
- T – true score
- e – error
We can estimate someone’s true score by taking the average of all their observed scores
If we construct a test on something, we can’t ask all possible questions, so we only use a few test
items (we sample from all the possible test questions / the domain)
But using fewer test items can lead to the introduction of error because the test items may or may
not adequately sample the domain / construct.
Our task in reliability analysis is to ascertain how much error we would make by using a score from a
shorter test as an estimate of true ability
We can’t ask a test-taker every word in the dictionary, so we ask them to spell a subset (sample) of
words
1. Test-retest reliability
2. Parallel forms reliability
3. Internal consistency
a. Split-half reliability
b. Kuder-Richardson 20 reliability
c. Coefficient / Conbach’s alpha
4. Inter-rater reliability
1: Test-Retest Reliability
Give someone a test then give it to them again at another time
The correlation between 2 scores is known as the co-efficient of stability – refers to stability of the
score
The source of error measured is time sampling – scores differ due to a factor occurring over time
Can’t really be used when measuring things like mood and stress as they fluctuate naturally over
time and may change between 1st and 2nd administrations – co-efficient of stability will be low
Someone’s score can improve the second time due to testing effects (learning what to expect,
improving)
Something could happen in between administrations that changes that which is being tested
The thing you’re trying to measure could change through being tested (e.g. being tested for
empathy: may want to appear more empathetic)
The source of error measured is item sampling – difficulty levels unequal across tests
What if the different forms are given to people at two different times?
Should the different forms be given to the same people or to different people?
- People might work out how to answer the one form from doing the other form
Need to be able to make two forms of the test in the first place
Measures whether different items within one test all measure the same thing to the same extent
The source of error measured is internal consistency – items are linked to the overarching construct
1: Split-half reliability
A test is split in half. Each half is scored separately and total scores for each half are correlated
Challenges:
3: Kuder-Richardson 20 (KR20)
3: Inter-Rater Reliability
Measures how consistently 2 or more raters / judges agree on
something’s rating
- A sample that’s representative of the wider population gives a better estimate of a test’s
reliability
Extraneous variables
Increase the number of items (but be sure not to add too many or bad items)
Item analysis: there are several ways to asses whether items/questions are ‘good’ or ‘bad’ and
therefore need to be discarded
- Item discriminating power = the extent to which an item measures the same aspect of what
the total test is measuring
- Item bias
- Item difficulty
Make sure that which you aim to measure has been clearly conceptualised
Validity
Refers to whether or not a test measures what it intends/claims to measure
- To be able to make accurate inferences from scores on a test, giving meaning to a test score
- i.e. validity indicates the usefulness of a test
A test must be considered reliable before it can be considered valid. But if it isn’t
valid, it doesn’t matter if it’s reliable.
Types of Validity
1. Face validity
2. Content validity
3. Criterion validity
• Concurrent
• Predictive
4. Construct validity
• Convergent
• Discriminant
1: Face Validity
When a test seems on the surface (on its face) to measure what it’s supposed to measure
But a test can have good face validity without actually being valid
Measured by researchers looking at items and giving their opinions regarding whether the items
appear to measure what they’re trying to measure
The least scientific of all the measures of validity as it’s determined by researcher’s opinions. It’s not
enough to just have face validity.
1. Readability
2. Feasibility
3. Layout and style
4. Clarity of wording
- How relevant the test seems to participants impacts whether they want to take it or not. It
looks interesting to them
- If the test looks valid, clinicians etc are more likely to trust and therefore buy it
- Tests with low face validity usually have low reliability (because there might be issues with
readability, feasibility, clarity of wording etc)
For these reasons, face validity is insufficient for claiming that a test is valid.
2: Content Validity
Degree to which a test measures an intended content area
- E.g., does an IQ questionnaire have items covering all areas of intelligence discussed in the
scientific literature?
In other words: there’s a correspondence between the test items and the content domain
1. Specifying the content area covered by the phenomenon when developing the construct
definition (early stages of test development)
2. Writing questions/scale items that are relevant to each of the content areas
3. Developing a measure of the construct that includes the best / most representative items
from each content area
1. Construct under-representation
E.g., a test of PTSD that does not have questions relating to vividly re-experiencing the traumatic
event
2. Construct irrelevant-variance
When test scores are influenced by things other than the construct the test is supposed to measure
3: Criterion Validity
How well a test score estimates or predicts a criterion behaviour outcome, now or in the future
E.g. The depression inventory estimates what depressive behaviour the person displays
Predicting is easier for ability tests, but harder for personality and attitude tests
Criterion Validity: How well test performances estimate / predict current and future performance on
some valued measure (other than the test itself)
Concurrent Validity: The extent to which test scores can correctly identify the CURRENT state of
individuals
- Measured by correlating scores from the new test with scores from an already established
test
- E.g. Results from new intelligence test correlated with Wedchsler I.Q test?
Criterion Validity: How well test performances estimate / predict current and future performance on
some valued measure (other than the test itself)
Predictive Validity: Do scores on a test predict a FUTURE event successfully? Does the measurement
correctly predict the score on a future test?
- The test = the predictor variable, and the future event = the criterion variable
- Matric math score = predictor. Success at Psychology statistics = criterion
E.g. Do scores on a test of acute stress disorder predict scores on a test of PTSD after a few weeks?
E.g. Are NBT results (predictor variable) correlated with first-year university scores (criterion
variable)?
4: Construct Validity
The extent to which the instrument measures a theoretical construct
E.g. Depression, empathy or intelligence. Can’t be directly measured like we can measure water. It’s
a construct that isn’t directly observable and measurable.
2. What observable behaviours can we expect of a person with a high or low score on the test
measuring the construct?
What is the relationship among these behaviours?
For a test to have good construct validity, there needs to be evidence for both convergent and
discriminant validity.
1. Reliability
- E.g., a test of anxiety may actually measure stress. It might measure stress very well but it is
not measuring what you want it to
- You cannot establish that a test measures what it is supposed to if it doesn’t consistently
measure what it is supposed to
2: Social Diversity