Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

lOMoARcPSD|28218661

Psychometrics notes

Research Methods I (University of Cape Town)

Studocu is not sponsored or endorsed by any college or university


Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)
lOMoARcPSD|28218661

Psychometrics
10, 23, & 24 April (Class Test material)

o Introduction to Psychometrics

o The History of Psychological Testing

7 May (Exam material)

o Test Development

14 & 15 May (Exam material)

o Reliability

o Validity

Recommended reading:

 Downing, S. M. (2006). Twelve Steps for Effective Test Development. In S. M. Downing & T.
M. Haladyna (Eds.), Handbook of Test Development (pp. 3-25). London: Lawrence Erlbaum
Associates. (available on Vula)

 Foxcroft, C. D. (2004). Planning a psychological test in the multi-cultural South African


context. South African Journal of Industrial Psychology, 30, 8-15. (available on Vula)

 Gregory, R. J. (2003). History of Psychological testing. In R. J. Gregory (Ed.), Psychological


testing: History, principles and applications (pp. 1-28). Available at:
http://www.ablongman.com/partners_in_psych/PDFs/Gregory/gregory_ch01.pdf

 Khanjee, A. (2006). Assessment research. In M. Terre Blanche, K. Durrheim, & D. Painter


(Eds.), Research in Practice (pp. 484-493). Cape Town: University of Cape Town Press.
(prescribed textbook)

What is Psychometrics?
Simply put = The study of Psychological measurement

The measurement of mental capacity, thought processes, aspects of personality, etc., especially by
mathematical or statistical analysis of quantitative data

The science or study of this measurement

The construction & application of psychological tests

What is measurement?

The assignment of numbers to objects & events according to rules

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

The assigning of numbers to individuals in a systematic way as a means of representing properties


of the individual

Measurement allows for comparison, analysis and evaluation

What is a test?

Measurement devices or techniques used to quantify behaviour – to aid in understanding &


prediction of behaviour

Defining Psychometrics: 3 Important Aspects

1: Psychometrics measures MANY different things

- Mental processes, personality, behaviour, intelligence, cognition, etc.

2: Psychometrics aims to operationally define & QUANTIFY the things it measures

- E.g., anxiety, intelligence, empathy

And then to be able to predict it

3: Psychometrics is concerned with constructing tests to measure things and evaluating how good
these tests are

Basic terms and concepts

Psychological measurement is the measurement of human behaviour

• Overt behaviour

 Observable

 E.g., time needed to put 10 pegs in a peg board, time taken to tap fingertips
10 times

• Covert behaviour

 Mental, social, or physical action or practice that is not immediately


observable

 E.g., feelings of anxiety, depression

Psychological Tests + Behavioural Samples


We can’t measure every aspect of every type of mental process, behaviour, emotional process, etc. ,
that a person has so we take a sample of that behaviour, through psychological tests

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

A psychological test is therefore:

“A test designed to provide a quantitative analysis of a person’s mental capacities or traits, typically
as shown by responses to a standard series of questions or statements.”

A set of items that is designed to measure characteristics of human beings that pertain to behaviour

The process of measuring psychology-related variables by means of devices or procedures designed


to obtain a sample of behaviour

A systematic procedure for obtaining samples of behaviour relevant to cognitive or affective


functioning, & for scoring & evaluating those samples according to standards

Trials = turns taking a test.

How do we quantify behaviour?

Through test items

- A test question; something a person must do in a test; a task in a test a person must
perform
- NB: A test item is not necessarily always a question

E.g., can be spotting a missing detail in a picture, drawing a picture

Test items are therefore defined as: “A specific stimulus to which a person responds overtly.”

Types of tests:

Modes of administration:

1: Individual tests

• Designed to be administered to 1 person at a time

• Useful for collecting comprehensive info

• Usually some degree of subjectivity in the scoring. Not just about wrong + right.

• Limitations: Time, cost, & labour-intensive

• E.g., some personality tests, some IQ tests (Wechsler adult intelligence scale), the
Rorschach test,

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

2: Group tests

• Designed to be administered to more than 1 person at a time (i.e., mass testing)

• Scoring is usually more objective. There is a norm. Right and wrong answers.

• Economical & time-efficient: quick to administer and mark.

• E.g., university class tests (especially MCQs)

Dimensions measured:

A: Ability tests

1. Achievement tests

 Measures person’s past /previous learning through accomplishment of a task

2. Aptitude tests

 Measures person’s potential for learning a specific skill/task under provision of


training

3. Intelligence tests

 Refers to person’s general mental abilities: ability to solve problems, adapt to


changing circumstances, think abstractly, and benefit from experience

 E.g., Stanford-Binet, WAIS test

B: Personality tests

• Measure typical behaviour: traits, temperaments, dispositions, etc.


• Designed to measure a person’s individuality
• Tests can help in predicting future behaviour
• E.g., 16PF questionnaire

1. Structured (objective) personality tests:


 Use self-report statements
 Person chooses between 2 or more alternative responses
 E.g. True/false, yes/no, strongly agree/agree/disagree etc.

2. Projective (unstructured) personality tests:


 An ambiguous stimulus is provided. Response requirements are unclear.
 Test-takers required to respond spontaneously
 Assumes that a person’s interpretation of an ambiguous stimulus will reflect their
unique characteristics. They will project their personality into their answers.

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

 E.g. Rorschach, Thematic Apperception Test (TAT)

TAT:

Provides ambiguous pics & have to make up a story

Story includes: What had led up to the event; What is happening in the moment;
Emotions & thoughts of characters; What the outcome of the story was

Story content & structure thought to reveal:

- Person’s attitudes, inner conflicts, & inner motives


- Needs for achievement, power, intimacy, & views

31 pictures in standard TAT

Performance Interpretation / Assessment: Norm & Criterion-Referenced


Tests

Norm-referenced tests

Test score is judged against the distribution of scores obtained by the


other test-takers

This distribution is called the norm

Compares an individual’s results on the test with a statistically


representative sample

Rank the performance of a student in a particular group

E.g., “You fall in the 90th percentile”

Criterion-referenced tests

Compares an individual’s results to a criterion or expected level of performance

Test-taker’s score compared to objectively stated standard of performance on that test

Establish standard/criterion & mark student against it

E.g., Getting 60% for an exam

Summary
Psychometrics is the quantitative measurement of mental capacity, thought processes, aspects of
personality, etc.

- Overt or covert behaviours


- Measured by test items

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

Types of tests (mode of administration)

- Individual tests
- Group tests

Categories of test (dimension measured)

1: Ability tests

 Achievement
 Aptitude
 Intelligence

2: Personality tests

 Structure
 Projective

Mode of interpretation (performance assessment)

1. Norm-referenced tests
2. Criterion-referenced tests

PSYCHOLOGICAL TESTING: Introduction and history

Ancient History of Psychological Testing


Rudimentary forms of testing date back to the Han dynasty in China (2200 BC)

Officials of the Emperor were examined every 3 years to determine their fitness for office

Tests required proficiency in civil law, military affairs, geography etc

But good penmanship was important

Went through 3 rounds of rigorous exams with very low pass rates:

Preliminary exam on Confucian classics:

Candidates had to spend a day & night in an isolated booth composing an essay and
writing a poem

Only 1-7% of them passed

Moved on to the district exam

Three separate sessions of 3 days & nights

Gruelling & rigorous, only 1-10% passed

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

Move on to Peking final round exam

Only 3% of this group passed

Became mandarins, eligible for public office

China can therefore be said to have developed the first civil service examination program

Contributions of the Han Dynasty to Modern Testing

 Names of candidates had to be concealed

 Independent assessments

 Conditions of examinations had to be standardised. I.e., the same for everyone

History of Modern Psychometrics: Wundt

Experimental psychology flourished in the late 1800s in continental Europe and Great Britain.

The problem was that the early experimental psychologists mistook simple sensory processes for
intelligence.

- They used assorted brass instruments to measure sensory thresholds and reaction
times, thinking that such abilities were at the heart of intelligence.
- Hence, this period is sometimes referred to as the Brass Instruments era of psychological
testing. Introduced by Wundt

Wilhelm Wundt: Father of psychology (1879)

Wundt’s ‘Thought Meter’: Wundt thought that the difference between the observed pendulum
position and the actual position would provide a means of determining the swiftness of thought of
the observer

Wundt believed that the speed of thought might differ from one person to the next

Demonstrated empirical analysis that sought to explain individual differences: relevant to current
practices in psychological testing.

Overly simplistic: overlooks other factors like attention, motivation and self-correcting feedback
from other trials.

History of Modern Psychometrics: Galton


Francis Galton: Father of modern psychometrics

Obsessed with measurement: Believed everything was measurable

Attempted to measure intellect by means of reaction time and sensory discrimination

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

Was more interested in the problems of human evolution than psychology (Galton was Charles
Darwin’s cousin)

1869: Hereditary Genius: An inquiry into its Laws & Consequences

• Studied the genealogy of famous scientific families (including his own)

• Argued this genius was genetic in origin

 Derived from Darwin’s theories

• White, English middle-class men were the best

 Apes, ‘savages’, races of the colonies, the Irish, & English working class were
inferior

Attempted to measure intellect by things such as visual, auditory, & weight discrimination,
threshold levels, etc.

- Galton’s simplistic attempts to gauge intellect with measures of reaction time and
sensory discrimination proved fruitless.

Set up psychometric laboratory in London. Tests involved physical & behavioural domains

Galton’s Contributions

ADAPTED Wundt’s psychophysical brass instruments to a series of single & quick sensorimotor
measures: Allowed him to collect data from thousands of subjects quickly

Demonstrated that objective tests could be devised and that meaningful scores could be obtained
through standardized procedures.

Did come up with Pearson Product-Moment correlation for analysing data from these experiments

Came up with twin studies to study hereditary factors

History of Modern Psychometrics: Cattell


Also interested in sensory discrimination

Studied experimental psychology with Wundt & Galton

Examined relationship between academic grades, psycho-sensory tests, & size of the brain, & shape
of the head

Invented the term mental test

Interested in creating comprehensive, standardized tests. Proposed a battery of 10 mental tests

- These tests were clearly a reworking and embellishment of the Galtonian tradition

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

- They had physiological and sensory bias similar to that of Galton


- Tests include strength of hand squeeze, rate of hand movement, reaction time to sound,
remembering numbers heard

Psychometrics and Intelligence Testing


One of the things leading to development of psychology & psychometrics was intelligence testing

Intelligence seen as being part of a person’s make-up (i.e., genetically determined)

Alfred Binet
First to develop an intelligence test

 The Binet-Simon intelligence test (1905)


 Wanted to separate children with intellectual disabilities from normal children in Parisian
schools

Argued that intelligence could be better measured by means of the higher psychological processes
rather than the elementary sensory processes such as reaction time.

Binet-Simon (1905) differed from the Brass Instruments tests:

1. Did not measure any single faculty – assessed child’s general mental development through
a heterogenous group of tasks
2. Aim was classification – not measurement
3. Brief & practical test, taking less than an hour to administer and requiring little equipment
4. Directly measured what Binet and Simon regarded as the essential factor of intelligence—
practical judgment—rather than wasting time with lower-level abilities involving sensory and
motor elements
5. Items were arranged by approximate level of difficulty – instead of content

Binet saw intelligence as good judgment

How does this relate to the modern view of intelligence? And the common sense, everyday view?

- Today: intelligence seen as how well you can adapt to circumstances

Binet-Simon intelligence test (1905) had 30 items

- Heavily weighted towards verbal skills


- Did not offer precise method for arriving at a total score
- Purpose was classification, not measurement

Published a revised Binet-Simon test in 1908

- Had 58 items

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

- Most of the very simple items were dropped and new items were added at the higher
end of the scale
- Introduced the concept of a mental level as the test had been standardized

Binet-Simon published a third revision in 1911

- Each age level now had five tests


- Scale was extended into the adult range

‘Mental Level/Mental Age’

Within months, what Binet called mental level was being translated as mental age.

In his writings, Binet emphasized strongly that the child’s exact mental level should not be taken too
seriously as an absolute measure of intelligence.

Influenced intelligence testing throughout 20th Century

People were comparing mental age with chronological age

Then came the idea of the intelligence quotient (IQ) - Terman

Galton and Binet’s Influence in the US

Galton: intelligence is hereditary and unchangeable. Binet: intelligence can be improved through
special training

In America, IQ tests were welcomed as a way to assess the intellect of immigrants & potential
soldiers

A ‘scientific’ way to create order out of chaos

The Stanford-Binet Scale

Developed by Lewis Terman in 1916 (who also suggested multiplying the intelligence quotient by
100, and was the first to use the abbreviation IQ)

Adapted Binet test for American schools & adults

 Stanford-Binet Scale
 5th version of this test still in use today

Heavy reliance on language/vocabulary skills

Now contained the familiar term “IQ” for expressing test results

Number of items increased to 90

Had clear & well-organised instructions for administration & scoring

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

Had standardised it using a representative sample of subjects

Summary
 Sir Francis Galton (1869)

• Believed ‘genius’ was hereditary

• Used psychophysical methods

• Believed sensory discrimination & reaction times defined intelligence

 James Cattell (1890)

• Took Galton’s ‘Brass Instruments’ to USA

• Invented the term ‘mental test’

• Proposed a battery of 10 tests

 Alfred Binet (1905; 1908; 1911)

• Developed Binet-Simon Intelligence Test to separate special needs children from


ages 3-13;

• Believed intelligence was ‘good judgment’ & changeable

• 1908 revision invented the term ‘mental level/age’;

• 1911 revision extended the test into the adult range

 Lewis Terman (1916)

• Coined the abbreviation IQ by suggesting multiplying the intelligence quotient by


100

• Developed the Stanford-Binet Scale for US schools & adults

Intelligence testing and eugenics


Intelligence testing and psychometrics developed through the need to identify the “feeble-minded”

Early uses and abuses of tests


Henry Goddard (1906)

• Hired to do research on classification & education of feeble-minded children

• Needed a diagnostic instrument

• Tested normal children

 Children with mental age 4 or more years behind chronological age were
feeble-minded (this constituted 3% of his normal sample)

 Needed to be segregated

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

• Invited to Ellis Island by Commissioner of Immigration

• Convinced that rates of feeble-mindedness were higher than estimated

• Tests administered through a translator

 Immigrants often tired, scared, & confused

• Translated from French to English, then retranslated to immigrant language, but


interpreted according to French norms

• Found feeble-mindedness in:

 83% of Jews

 80% of Hungarians

 79% of Italians

 87% of Russians

• Suggested deportation, or: “We might be able to use moron labourers only if we are
wise enough to train them properly.”

Eugenics
The science of using controlled, selective breeding to improve hereditary qualities of the human
race & create superior individuals

Concerned about the lower classes breeding too quickly

- Lowering the average standard of intelligence

The ideological forerunner to Nazism

Positive eugenics: encouraging reproduction of the genetically “fit”

Negative eugenics: aims to prevent those deemed psychically, mentally, or morally unfit to procreate
(through sterilisation and segregation)

Eugenics Movement:

Positive eugenics:

‘Fitter Family Contests’

- Judged on physical, mental, moral traits

- Examined by doctors, social workers, historians, dentists

- Grades given to each member along with examination record

- Sent to Eugenics office

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

Purpose = To encourage the fit to reproduce, raise racial consciousness, bring public awareness to
eugenics agenda, emphasize value of marriage & family

Negative eugenics

Forced sterilization: hope was to stop problems of mental illness, crime, and low IQ

Psychometrics and the wars: WWI army recruits

Robert Yerkes & the rise of group tests

Quick, effective, efficient way of evaluating emotional & intellectual functioning of soldiers

Stanford-Binet adapted to multiple choice tests in 1917 for use by US Army (Alpha & Beta)

- Ease of administration & scoring


- Does not need to be administered by trained professionals
- Lack of subjectivity

Psychometrics and the wars: Army Alpha and Army Beta


Designed for WW1 recruits

Segregate & eliminate mentally incompetent, & classify men according to mental ability

1. Army Alpha
- For English literates
- Oral directions, arithmetic, practical judgment, analogies, disarranged sentences, number
series, information

2. Army Beta
- For non-English, & non-literates
- Memory, matching, picture completion, geometric construction

Psychometrics and the wars: WWII


Personality tests caught on for screening recruits

Early tests were structured pen-and-pencil tests

Later, projective tests. E.g., Rorschach

After the war: psychometric testing and racism


Army testing led to an explosion of psychometric testing outside of the military

 E.g., school testing, Scholastic Aptitude Tests (SAT)

Jensen & Eyesenck (1960s)

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

 Used Binet IQ test to show that black American children had a lower average IQ than white
children
 This difference was due to genetics

Herrnstein & Murray (1994)

 Argue that poor black communities in the US are a ‘cognitive underclass’ formed by
interbreeding of people with low IQ

Critiques of mental testing:


Testing came under attack by advocates for under-privileged

Mental tests require knowledge & cultural values rather than innate intelligence – biased towards
white middle class

Tests are culturally biased

Correlation does not imply causation

Psychometric testing and racism in SA


Fick (1929) applied tests developed for & standardized on whites only to white & black children

 Used Army Alpha & Army Beta tests


 White children of course got the highest scores

Originally proposed cultural, environmental, educational, & social reasons for discrepancy

 Later suggested that due to differences in ‘innate abilities’ between whites & blacks, & not
external factors

Use of tests gained momentum after WWII and 1948 when the NP came into power

Arose from need to identify occupational suitability of black people

Measures of intellectual ability used to draw distinction between races

Justify superiority of one group over another

1960’s onwards in SA

Measures developed along racial lines

View was that there was little need for common tests as groups were separate and did not compete
with one another

Competition for some jobs during 80s & 90s raised questions such as:

 How can you compare scores if different measures are used?

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

 How do you appoint people if different measures are used?

The current challenge

A drive in SA & internationally to find tasks that are not biased towards race, culture, gender, &
language

- or to adapt existing tests to be race, culture, gender, & language appropriate

Test development

Evaluation:

There will be Psychometrics sections in both the class test and final exam.

In the Psychometrics section on your final exam, you will only be tested on the portion of work
covered after your class test (i.e., only Test Development, Reliability, and Validity). For the
Psychometrics section you will have to answer short answer questions worth 40 marks in total.

Note: All material covered on your lecture slides is examinable. Use the recommended readings
above to supplement your understanding of what is covered during lectures.

Steps to developing a measure

1. Overall goal & pre-planning

2. Content definition

3. Test specifications

4. Item development

5. Test design & assembly

6. Test production

7. Test administration

8. Scoring responses

9. Establishing passing scores

10. Reporting results

11. Item banking

12. Test technical report

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

1: Overall goal and pre-planning


Provides systematic framework for all test development activities

Outlines all essential tasks to be accomplished for a successful testing program.

A priori decisions:

 What: is the construct is the test designed to measure, is the aim (what) and purpose (why)
of the test, the administration modality of the test, the need for the test, the format of the
test etc
 Why: is the test needed
 Who: will use the test, who will take the test, who produces, publishes or prints the test,
who creates and reviews the test items?
 When: will it happen (decide on a timeline)

Test security & quality control

Specifying the aim of the measure:

The aim is what you want to achieve with your test. Purpose = the why: what are the test scores
going to be used for?

Can have different aims for the same construct, e.g., stress

 To measure how someone reacts to stress


 To measure how much stress a person is experiencing at the moment

Aim is informed by the purpose of the test

Other considerations when specifying the aim:

Are we going to use our questionnaire for screening or in-depth assessment / diagnosis?

Screening purposes

 Fewer items & less content covered


 Quick & easy to administer

Detailed assessment (e.g., diagnostic purposes)

 More items & content covered


 More reliable, but more time-consuming
 Might require special training

What decisions can we make based on a score on the test? Depends on length and purpose of the
test (screening or diagnoses)

What type of measure are we going to use, based on our aim?

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

Normative: Compares scores to a norm group

 How stressed is the person compared to the average SA citizen?

Ipsative: Intra-individual comparisons

 In what areas is the person the most (or least) stressed – family life, work life, etc.?

Criterion-referenced: Performance compared to pre-defined standard

 E.g. To evaluate a potential employee’s ability to handle stress by giving them a large amount
of tasks to do in a short period of time
 Completing at least half of the tasks = handles stress well (completing half of the tasks would
be defined as the cut-off score)

2: Content definition
Defining what content the test aims to measure

Need to operationally define the construct that you are measuring

 Operationalisation: The act of making a fuzzy concept measurable

Operational definition

 Defined in terms of how to observe/measure the concept


 Concepts often have many indicators – related but distinct items that make up that concept
(e.g. indicators of depression include tiredness / fatigue, suicidality, feelings of
worthlessness)
 How could we operationally define stress?

Theoretical (conceptual) definitions

 Defined in terms of a concept’s relationship with other concepts


 E.g., stress defined as: hardship, adversity, affliction, feeling of strain, & pressure

Dictionary definition

 E.g., stress defined as: difficulty that causes worry or emotional tension

Example: intelligence

- Conceptual definition: The capacity for abstract thought, understanding, communication,


reasoning, learning, planning and problem solving.
- Operational definition: The score resulting from performing the Raven’s Progressive Matrices
Test.

Defining the test’s purpose:

- Aim = what exactly are we going to measure?


- Purpose = why: what are we going to use the scores on the test for?

Example: Purpose = to inform a behavioural intervention

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

- Aim = to measure behavioural reactions to stress


- Purpose = to inform how to change behavioural reactions to stress in a behaviour
modification programme

3: Test specifications
The test blue-print

A complete operational definition of test characteristics

Test specifications should describe:

1. Test/response format
2. Item format
3. Test length
4. Content areas of the construct/s tested
5. Whether items will contain visual stimuli
6. How tests scores will be interpreted
7. Time limits

1: Test/Response format

How will participants demonstrate their skills? How does the construct manifest through the test?

- Selected response: e.g., Likert scale, MCQ, dichotomous


- Constructed response: e.g., essay question, fill-in-the-blank
- Performance response: e.g., block design task

Objective vs. subjective formats

- Objective: very structured, person usually picks only one response (e.g., MCQ)
- Subjective: interpretation of response depends on examiner judgment, so more
unstructured (e.g., projective tests)

2: Item format

Open-ended items

- No limitations on the test-taker


- E.g., describe how you normally behave when you have to confront a work colleague

Forced-choice items

- E.g., MCQs, true false questions

Sentence-completion items

- E.g., “The most important thing in life is _______.”

Performance-based items

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

- E.g., writing an essay, oral presentations

3: Test length

Depends on:

- Amount of administration time available


- Purpose of the measure (e.g., screening vs. comprehensive)

Compliance lower when item numbers are higher

- People get fatigued, bored, etc.

Need at least 50% more items in initial version than final version

- You will discard bad items

How many items per area are being tested/assessed?

4: Test content areas

Ensure that all domains of the construct are tested

A test structure (test blueprint)

- Columns represent content areas (indicators of the construct)


- Rows represent manifestations (items that tell you something about the content area)

Example: the influence of superstitions:

Content Areas and Manifestations of Superstitions

CONTENT AREAS / indicators

MANIFESTATIONS Chance events Lucky items (charms)

Behaviour from
C1M1 C2M1
superstitions

Belief in Superstitions
C1M2 C2M2
(Cognitive)

4: Item development
Guidelines for writing the items

- Use clear wording to avoid ambiguity


- Use appropriate vocabulary (no jargon)
- Avoid double negatives (e.g., “I do not never get stressed out around exam time.”)

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

- Consider your audience: If writing a test for children (vs. adults), consider different format
such as visual
- Don’t make questions too obvious

5: Test design and assembly


Assembling

- Placement of correct items


- Check for errors (quality control)
- Manual or computer assembly?
- How are people going to answer the test?
- Does the test look aesthetically pleasing? Tests must be formatted to maximize the ease of
reading and minimize any additional cognitive burden that’s unrelated to the construct being
tested

Pre-testing: Giving the test to a representative sample from the target population to gather
information about it

- Do people answer how we expect?


- Are there any problems?
- Too easy or difficult?
- Time limits?

6: Test production
Production & printing

Making final all items, their order, & necessary visual stimuli

Security of tests

Quality assurance procedures

- Randomly sample final printed booklets

7: Test administration
Most public & visible aspect of testing

Security is a major concern. Security problems during test administration can lead to the invalidation
of some or all examinee scores and can require the test developers to retire or eliminate large
numbers of very expensive test items.

Preferable to designate one highly experienced invigilator as ‘chief invigilator’ for the site to
supervise others

Standardization of testing conditions of utmost importance

- Control extraneous variables & make conditions identical for all examinees
- Examinee fairness, clarity of instructions, time limits
- Otherwise examinee scores difficult to interpret

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

8: Scoring responses
Test scoring is the process of applying a scoring key to examinee responses to the test stimuli.

Scoring criteria

How will we score?

- Develop a scoring key: how many marks particular questions get. Scoring key must be
applied with perfect accuracy to the examinee items responses. High levels of quality
control are therefore necessary.
- When to ‘drop’ respondents from sample?
o Those who answer less than 75% of the questionnaire?
o Response bias to all questions?

Item analysis
Deciding whether to keep or discard items according to:

1. Item difficulty
2. Discriminating power
3. Item bias

When analysing test items, we have several questions about the performance of each item. Some of
these include:

- Are the items congruent with the test objectives?


- Are the items valid?
- Are the items reliable?
- How long does it take an examinee to complete each item?
- What items are easy/difficult?
- Are there any poor performing items that need to be discarded?

1: Item difficulty

= the proportion/percentage of individuals who answer the item correctly (also known as the facility
index)

- Higher proportion/percentage of correct responses = easier the item; and vice-versa


- Items with a facility index of less than 0.30 are considered too difficult and of more than
0.70 are considered too easy (so must potentially discard them)

Retrospective. Difficulty can only be analysed after the test has been administered.

Need to include items with a range of difficulty

2: Item discriminating power

Item discriminating power = the extent to which an item measures the same aspect of what the
total test is measuring

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

Able to discriminate between high and low performers

Measured by:

- Discrimination index (D): Works by subtracting the item difficulty for people in the bottom
25% from item difficulty for people in the top 25%

Index Range Interpretation Action

-1.0 to -.50 Can discriminate but the Discarded


item is questionable

-.55 to .45 Non-discriminating Revised

.46 to 1.0 Discriminating item Include

- Item-total correlations: Correlation between the score on an item & performance on the
total measure
o Positive correlation = good discriminating power
o 0 correlation = no discriminating power
o Negative correlation = poor discriminating power
Items with correlations less than 0.20 & negative correlations are not retained

3: Item Bias

Bias in testing:

 Errors in measurement
 Associated with group membership

Item bias in tests:

 Bias arises as a result of the item content


 Bias arises from differences in group performance on a test

9: Establishing passing scores


Many, but not all, tests require some type of passing score or performance standard.

Establishing norms, performance standards, or cut-off scores

Establishing reliability & validity across different test administrations

10: Reporting results


All examinees have a right to accurate, timely and useful reports of their performance, in
understandable language.

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

11. Item banking


Storing effective items for future use

Because effective test questions are so difficult to develop and so many resources must be used to
produce useful test items that perform well

Security is important

12: Publishing and refinement


1. Compiling the test manual & technical report (usually very detailed: describes all important
aspects of test development, administration, scoring etc). Identifies potential threats to
validity and makes recommendations for improvement.

2. Submitting the measure for classification as a psychological measure

3. Publishing & marketing

4. Revision & refinement

Security is important

Developing multi-cultural tests

During the planning phase, be mindful of test-taker characteristics that may influence performance:

- Educational level
Impacts ability to read, write, and work with numbers as well as higher order thinking
Tends to differ from rural to urban areas
Include questionnaire on quality and level of schooling?
Develop separate norms i.e. differently graded tests depending on educational background?

- Language
If a test is given in an unfamiliar language, it’s difficult to find out whether poor performance
is due to language / communication difficulties or that the test-taker has a low level of the
construct being assessed
Provide evidence that language is appropriate for intended populations
Should different tests be constructed for different languages? Translation may be difficult
Can produce tests with different language versions: bilingual or multilingual

Defining the construct

- the same construct could be interpreted and understood in very different ways in various
cultural and language groups
- Is the construct meaningful (i.e. of value, important) to different cultural groups?
- Is the construct defined the same way in different cultures? Is there a shared
understanding?
E.g., Eastern & Western associations with intelligence

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

Test, item, & response modes and formats

A test consists of a stimulus (item) to which the test-taker responds using a specified response mode.
There are various modes in which a test is presented (e.g., paper-based, computer-based); various
item formats (e.g., multiple choice, performance tasks), and various response modes (e.g., verbal,
written, typing on a computer keyboard

- Developers must provide evidence that all formats are familiar to target populations.
Shouldn’t be assumed that the different presentation and response modes or item formats
are equally familiar to and appropriate for all cultural groups
Numbers, dates, time, currency
Icons, symbols, colours
Computer-based tests: there are differing levels of computer familiarity and technological
sophistication among various cultural and socio-economic groups in South Africa
Could include a preparatory tutorial

If items are not familiar, decide whether to:

- Omit items
- Balance number of familiar/unfamiliar items
- Practice items that provide opportunity to familiarise test-takers with unfamiliar item types
or content

Reliability: Evaluating a Scale


In psychological tests, reliability means:

- Consistency in measurement
- How much error we have in a measurement

So it’s the precision with which the test scores measures achievement

- The higher the better

Reliability in measurement = The desired consistency or reproducibility of test scores

No test is free from error

- We always assume that there is some random error

The distribution of scores should be normally distributed

Sources of unreliability in different measures:

Questionnaires: participants may interpret questions differently

Structured observations: involves subjectivity

Physical apparatus (e.g. polygraph): other factors like feeling embarrassed (may appear that they’re
lying)

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

In mathematical terms: x = T + e

- x – observed score
- T – true score
- e – error

The Four Assumptions of Classical Test Theory


1. Each person has a true score we could obtain if there was no measurement error
2. There is measurement error, but this error is random
3. The true score of an individual doesn’t change with repeated applications of the same test,
even though their observed score does
4. The distribution of random errors and thus observed test scores will be the same for all
people

We can estimate someone’s true score by taking the average of all their observed scores

Standard Error of Measurement


We can work out how much measurement error there is by working out how much, on average, an
observed score differs from the true score.

- i.e. The standard deviation of the scores


- Also called the SEM: standard error of measurement

The Domain Sampling Model


Another central concept of Classical Test Theory

If we construct a test on something, we can’t ask all possible questions, so we only use a few test
items (we sample from all the possible test questions / the domain)

But using fewer test items can lead to the introduction of error because the test items may or may
not adequately sample the domain / construct.

Our task in reliability analysis is to ascertain how much error we would make by using a score from a
shorter test as an estimate of true ability

Variance of observed score on a short test


Reliability =
Variance of true score
The observed test score should be correlated with the true score

As the sample gets larger, the estimate becomes more accurate

Example: Spelling Ability

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

We can’t ask a test-taker every word in the dictionary, so we ask them to spell a subset (sample) of
words

Does their mark reflect their true spelling skill?

- What if they were given very easy or difficult words?


- What if they were tired on one day but rested on the next?
- Their true score would stay the same but their observed score would vary
- Amount of error (e) would vary

Investigating a Test’s Reliability


Types of reliability:

1. Test-retest reliability
2. Parallel forms reliability
3. Internal consistency
a. Split-half reliability
b. Kuder-Richardson 20 reliability
c. Coefficient / Conbach’s alpha
4. Inter-rater reliability

1: Test-Retest Reliability
Give someone a test then give it to them again at another time

If the scores are highly correlated, we have good test-retest reliability

The correlation between 2 scores is known as the co-efficient of stability – refers to stability of the
score

The source of error measured is time sampling – scores differ due to a factor occurring over time

Issues with test-retest reliability

Can’t really be used when measuring things like mood and stress as they fluctuate naturally over
time and may change between 1st and 2nd administrations – co-efficient of stability will be low

Someone’s score can improve the second time due to testing effects (learning what to expect,
improving)

Something could happen in between administrations that changes that which is being tested

The thing you’re trying to measure could change through being tested (e.g. being tested for
empathy: may want to appear more empathetic)

2: Parallel forms reliability


Two different forms of the same test are administered

Difficulty levels are relatively equal

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

E.g., two questionnaires with different items in each

The correlation between the 2 scores is known as the co-efficient of equivalence

The source of error measured is item sampling – difficulty levels unequal across tests

Changing the form of the test

The question responses can be reworded

The order of items can be changed (to reduce practice effect)

The wording of the question/item can be changed

- Two items must be equivalent; items should differ in wording only


- Items with different degrees of difficulty don’t measure the same attribute

Issues with parallel forms reliability

What if the different forms are given to people at two different times?

Should the different forms be given to the same people or to different people?

- People might work out how to answer the one form from doing the other form

Need to be able to make two forms of the test in the first place

3: Internal Consistency Reliability


The reliability of one test administered on one occasion.

Measures whether different items within one test all measure the same thing to the same extent

The source of error measured is internal consistency – items are linked to the overarching construct

Tests for internal consistency reliability:

1: Split-half reliability

A test is split in half. Each half is scored separately and total scores for each half are correlated

Advantage: only one test is needed (not two forms)

Challenges:

1. Dividing the test into equivalent halves


The correlation changes each time depending on which items are put into which half
2. Doesn’t reducing the length of the test by splitting it in half automatically reduce its
reliability? The fewer the items, the lower the reliability (remember the Domain Sampling
Model)
Solution: Spearman-Brown formula
rhh = correlation between the halves

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

2: Coefficient / Cronbach’s alpha

Estimates the consistency of responses to different scale


items.

Takes the average of all possible split-half correlations for a


test.

Can overestimate reliability when there’s a large


number of test items even if the mean internal
correlation is low

Reliability according to Cronbach’s alpha is affected by number of items

There’s a positive non-linear correlation between the number of items + reliability

- There’s a rapid increase in reliability from 2 – 10 items


- Increases steadily from 11 – 30 items
- Tapers off at about 40 items – so around 30-40 items is ideal in terms of reliability

Cronbach’s alpha can be affected by:

- Multidimensionality: questionnaire measuring more than one construct (constructs aren’t


correlated)
- Bad test items
- Number of items

Interpreting Coefficient / Cronbach’s alpha:

- 0 = no consistency in measurement whereas 1 = perfect consistency in measurement


- 0.70 is appropriate for exploratory research
- 0.80 is appropriate for basic research
- 0.90 is appropriate for applied research

3: Kuder-Richardson 20 (KR20)

3: Inter-Rater Reliability
Measures how consistently 2 or more raters / judges agree on
something’s rating

Correlates the raters’ scores

Ranges from 1 (perfect agreement) to -1 (disagreement)

- > 0.75: perfect agreement


- .40-.75: satisfactory agreement
- < .40: poor agreement

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

Source of error is observer differences

Factors Influencing a Test’s Reliability


Number of items in a scale / test

Variability of the sample

- A sample that’s representative of the wider population gives a better estimate of a test’s
reliability

Extraneous variables

- The testing situation


- Ambiguous or misleading items
- Un-standardised testing procedures
- Perceived demand effects

How to improve reliability

Increase the number of items (but be sure not to add too many or bad items)

Item analysis: there are several ways to asses whether items/questions are ‘good’ or ‘bad’ and
therefore need to be discarded

- Item discriminating power = the extent to which an item measures the same aspect of what
the total test is measuring
- Item bias
- Item difficulty

Use identical instructions with each test administration

Train the raters when using inter-rater reliability

Test the measurement scale by doing a pilot-run

Make sure that which you aim to measure has been clearly conceptualised

Validity
Refers to whether or not a test measures what it intends/claims to measure

Aims of establishing validity:

- To be able to make accurate inferences from scores on a test, giving meaning to a test score
- i.e. validity indicates the usefulness of a test

Relationship between validity and reliability

A test must be considered reliable before it can be considered valid. But if it isn’t
valid, it doesn’t matter if it’s reliable.

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

You must demonstrate reliability BEFORE validity

Validity tells you how good a test is for a particular situation

Reliability tells you how trustworthy a score on that test is

Types of Validity
1. Face validity

2. Content validity

3. Criterion validity

• Concurrent

• Predictive

4. Construct validity

• Convergent

• Discriminant

1: Face Validity
When a test seems on the surface (on its face) to measure what it’s supposed to measure

How authentic the test/scale seems to participants

- If they doubt the test, their scores will be effected

But a test can have good face validity without actually being valid

Measured by researchers looking at items and giving their opinions regarding whether the items
appear to measure what they’re trying to measure

The least scientific of all the measures of validity as it’s determined by researcher’s opinions. It’s not
enough to just have face validity.

Evaluated in terms of:

1. Readability
2. Feasibility
3. Layout and style
4. Clarity of wording

Face validity is important because:

- How relevant the test seems to participants impacts whether they want to take it or not. It
looks interesting to them
- If the test looks valid, clinicians etc are more likely to trust and therefore buy it

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

- Tests with low face validity usually have low reliability (because there might be issues with
readability, feasibility, clarity of wording etc)

Issues with face validity

- Many don’t consider this an actual measure of validity


- It doesn’t refer to what is actually being measured. Refers to what it appears to measure
- Is determined by a review of the items and not through statistical analyses

For these reasons, face validity is insufficient for claiming that a test is valid.

2: Content Validity
Degree to which a test measures an intended content area

- E.g., does an IQ questionnaire have items covering all areas of intelligence discussed in the
scientific literature?

In other words: there’s a correspondence between the test items and the content domain

- Domain sampling model: do the questions/items make up a representative sample of the


attribute the test is supposed to measure?
- E.g., if you think Emotional Intelligence is a type of intelligence, an IQ scale that does not
include a measure of EI would not have content validity

Researchers work towards content validity by:

1. Specifying the content area covered by the phenomenon when developing the construct
definition (early stages of test development)
2. Writing questions/scale items that are relevant to each of the content areas
3. Developing a measure of the construct that includes the best / most representative items
from each content area

Aspects of content validity

1. Construct under-representation

The test doesn’t capture important components of the constructed

E.g., a test of PTSD that does not have questions relating to vividly re-experiencing the traumatic
event

2. Construct irrelevant-variance

When test scores are influenced by things other than the construct the test is supposed to measure

E.g., scores a on a depression test are influenced by a person’s level of anxiety

E.g., IQ tests might be influenced by reading ability

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

How is content validity established?

- Judgement by expert judges


They ask the question “how relevant is each item to the content domain?”
They independently examine the items and decide whether each of the items is weakly or
strongly relevant to the content domain of the construct
Ranges from 0 to 1 (the higher, the more valid)

- Can also use statistical methods


Factor analysis: items relating to each content domain should load on factors representative
of those domains

3: Criterion Validity
How well a test score estimates or predicts a criterion behaviour outcome, now or in the future

E.g. The depression inventory estimates what depressive behaviour the person displays

Predicting is easier for ability tests, but harder for personality and attitude tests

a. Concurrent Criterion Validity

Criterion Validity: How well test performances estimate / predict current and future performance on
some valued measure (other than the test itself)

Concurrent Validity: The extent to which test scores can correctly identify the CURRENT state of
individuals

- Measured by correlating scores from the new test with scores from an already established
test
- E.g. Results from new intelligence test correlated with Wedchsler I.Q test?

b. Predictive Criterion Validity

Criterion Validity: How well test performances estimate / predict current and future performance on
some valued measure (other than the test itself)

Predictive Validity: Do scores on a test predict a FUTURE event successfully? Does the measurement
correctly predict the score on a future test?

- The test = the predictor variable, and the future event = the criterion variable
- Matric math score = predictor. Success at Psychology statistics = criterion

E.g. Do scores on a test of acute stress disorder predict scores on a test of PTSD after a few weeks?

E.g. Are NBT results (predictor variable) correlated with first-year university scores (criterion
variable)?

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

4: Construct Validity
The extent to which the instrument measures a theoretical construct

A construct = a postulated, hypothetical attribute.

- Something we think exists but is not directly observable or measurable.

E.g. Depression, empathy or intelligence. Can’t be directly measured like we can measure water. It’s
a construct that isn’t directly observable and measurable.

How is construct validity measured?

1. We look at the construct’s relationship with other constructs


To what other constructs is it similar and different?

2. What observable behaviours can we expect of a person with a high or low score on the test
measuring the construct?
What is the relationship among these behaviours?

Two main types of evidence to look for:

a. Convergent validity: the scores from different measures of theoretically-related constructs


should converge / relate to each other
Scores on the test have high correlations with other tests that measure similar constructs
E.g., measures of self-esteem should correlate highly with measures of self-worth, or
confidence
b. Divergent / discriminant validity: scores from measures of concepts that are theoretically not
supposed to be related are actually unrelated.
The test score should have a low correlation with other tests that measure different
constructs
E.g., Gender and a questionnaire on racism.

For a test to have good construct validity, there needs to be evidence for both convergent and
discriminant validity.

Factors Affecting Validity

1. Reliability

Any form of measurement error (lack of reliability) can reduce validity

But it is possible for a test to be reliable without being valid

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)


lOMoARcPSD|28218661

- E.g., a test of anxiety may actually measure stress. It might measure stress very well but it is
not measuring what you want it to

However, you must demonstrate reliability BEFORE validity

- You cannot establish that a test measures what it is supposed to if it doesn’t consistently
measure what it is supposed to

2: Social Diversity

Tests may not be equally valid for different social/cultural groups

E.g., a test of “superstition” in one culture might be a test of “religiosity” in another

Downloaded by Lincoln Mazivanhanga (lincolnmazivanhanga52@gmail.com)

You might also like