Modern Psychometrics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Modern Psychometrics

This popular text introduces the reader to all aspects of psychometric assessment,
including its history, the construction and administration of traditional tests, and the
latest techniques for psychometric assessment online.
Rust, Kosinski, and Stillwell begin with a comprehensive introduction to the
increased sophistication in psychometric methods and regulation that took place
during the 20th century, including the many benefits to governments, businesses, and
customers. In this new edition, the authors explore the increasing influence of the
internet, wherein everything we do on the internet is available for psychometric analysis,
often by AI systems operating at scale and in real time. The intended and unintended
consequences of this paradigm shift are examined in detail, and key controversies, such
as privacy and the psychographic microtargeting of online messages, are addressed.
Furthermore, this new edition includes brand-new chapters on item response theory,
computer adaptive testing, and the psychometric analysis of the digital traces we all leave
online.
Modern Psychometrics combines an up-to-date scientific approach with full
consideration of the political and ethical issues involved in the implementation of
psychometric testing in today’s society. It will be invaluable to both undergraduate and
postgraduate students, as well as practitioners who are seeking an introduction to modern
psychometric methods.

John Rust is the founder of The Psychometrics Centre at the University of Cambridge,
UK. He is a Senior Member of Darwin College, UK, and an Associate Fellow of the
Leverhulme Centre for the Future of Intelligence, University of Cambridge, UK.

Michal Kosinski is an Associate Professor of organizational behavior at the Stanford


Graduate School of Business, USA.

David Stillwell is the Academic Director of the Psychometrics Centre at the University
of Cambridge, UK. He is also a reader in computational social science at the Cambridge
Judge Business School, UK.
Modern Psychometrics
The Science of Psychological Assessment

Fourth Edition

John Rust, Michal Kosinski, and


David Stillwell
Fourth edition published 2021
by Routledge
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
and by Routledge
52 Vanderbilt Avenue, New York, NY 10017
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2021 John Rust, Michal Kosinski and David Stillwell
The right of John Rust, Michal Kosinski, and David Stillwell to be identified as authors of
this work has been asserted by them in accordance with sections 77 and 78 of the
Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced or utilised in any
form or by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying and recording, or in any information storage or
retrieval system, without permission in writing from the publishers.
Trademark notice: Product or corporate names may be trademarks or registered trademarks,
and are used only for identification and explanation without intent to infringe.
First edition published in 1989
Third edition published by Routledge 2009
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-in-Publication Data
Names: Rust, John, 1943- author. | Kosinski, Michal, author. | Stillwell, David, author.
Title: Modern psychometrics : the science of psychological assessment/John Rust, Michal
Kosinski and David Stillwell.
Description: Fourth edition. | Milton Park, Abingdon, Oxon ; New York, NY:
Routledge, 2021. | Includes bibliographical references and index. |
Identifiers: LCCN 2020034344 (print) | LCCN 2020034345 (ebook) | ISBN
9781138638631 (hardback) | ISBN 9781138638655 (paperback) | ISBN
9781315637686 (ebook)
Subjects: LCSH: Psychometrics.
Classification: LCC BF39 .R85 2009 (print) | LCC BF39 (ebook) | DDC 150.28/
7--dc23
LC record available at https://lccn.loc.gov/2020034344
LC ebook record available at https://lccn.loc.gov/2020034345

ISBN: 978-1-138-63863-1 (hbk)


ISBN: 978-1-138-63865-5 (pbk)
ISBN: 978-1-315-63768-6 (ebk)

Typeset in Bembo
by MPS Limited, Dehradun
Contents

Preface to the fourth edition xii

1 The history and evolution of psychometric testing 1


Introduction 1
What is psychometrics? 1
Psychometrics in the 21st century 2
History of assessment 4
Chinese origins 4
The ability to learn 5
The 19th century 7
Beginnings of psychometrics as a science 7
Intelligence testing 8
Eugenics and the dark decades 9
Psychometric testing of ability 11
The dark ages come to an end 11
An abundance of abilities 12
Tests of other psychological constructs 13
Personality 13
Integrity 14
Interests 16
Motivation 16
Values 16
Temperament 17
Attitudes 17
Beliefs 18
Summary 18

2 Constructing your own psychometric questionnaire 20


The purpose of the questionnaire 20
Making a blueprint 20
vi Contents
Content areas 21
Manifestations 21
Writing items 24
Alternate-choice items 24
Multiple-choice items 24
Rating-scale items 25
All questionnaires 26
Knowledge-based questionnaires 26
Person-based questionnaires 27
Designing the questionnaire 28
Background information 28
Instructions 28
Layout 29
Piloting the questionnaire 30
Item analysis 31
Facility 31
Discrimination 32
Distractors 33
Obtaining reliability 33
Cronbach’s alpha 34
Split-half reliability 34
Assessing validity 35
Face validity 35
Content validity 35
Standardization 36

3 The psychometric principles 38


Reliability 38
Test–retest reliability 38
Parallel-forms reliability 39
Split-half reliability 40
Interrater reliability 40
Internal consistency 40
Standard error of measurement (SEM) 41
Comparing test reliabilities 42
Restriction of range 42
Validity 43
Face validity 43
Content validity 43
Predictive validity 43
Concurrent validity 44
Contents vii
Construct validity 44
Differential validity 45
Standardization 45
Norm referencing 46
Criterion referencing 51
Equivalence 52
Differential item functioning 55
Measurement invariance 55
Adverse impact 56
Summary 57

4 Psychometric measurement 58
True-score theory 58
Identification of latent traits with factor analysis 60
Spearman’s two-factor theory 60
Vector algebra and factor rotation 62
Moving into more dimensions 64
Multidimensional scaling 65
Application of factor analysis to test construction 66
Eigenvalues 66
Identifying the number of factors to extract using the Kaiser criterion 66
Identifying the number of factors to extract using the Cattell scree test 67
Other techniques for identifying the number of factors to extract 67
Factor rotation 67
Rotation to simple structure 69
Orthogonal rotation 69
Oblique rotation 69
Limitations of the classical factor-analytic approach 70
Criticisms of psychometric measurement theory 70
The Platonic true score 71
Psychological vs. physical true scores 71
Functional assessment and competency testing 72
Machine learning and the black box 74
Summary 74

5 Item response theory and computer adaptive testing 76


Introduction 76
Item banks 76
The Rasch model 77
viii Contents
Assessment of educational standards 77
The Birnbaum model 78
The evolution of modern psychometrics 78
Computer adaptive testing 79
Test equating 79
Polytomous IRT 79
An intuitive graphical description of item response theory 80
Limitations of classical test theory 80
A graphical introduction to item response theory 82
The logistic curve 82
3PL model: difficulty parameter 83
3PL model: discrimination parameter 84
3PL model: guessing parameter 84
The Fisher information function 85
The test information function and its relationship to the standard
error of measurement 86
How to score an IRT test 88
Principles of computer adaptive testing 89
Summary of item response theory 91
Confirmatory factor analysis 92

6 Personality theory 93
Theories of personality 94
Psychoanalytic theory 94
Humanistic theory 96
Social learning theory 97
Behavioral genetics 98
Type and trait theories 100
Different approaches to personality assessment 101
Self-report techniques and personality profiles 101
Reports by others 103
Online digital footprints 103
Situational assessments 104
Projective measures 104
Observations of behavior 104
Task performance methods 105
Polygraph methods 105
Repertory grids 106
Sources and management of bias 106
Self-report techniques and personality profiles 107
Reports by others 107
Online digital footprints 108
Situational assessments 108
Contents ix
Projective measures 108
Observations of behavior 108
Task performance methods 109
Polygraph methods 109
Repertory grids 109
Informal methods of personality assessment 109
State vs. trait measures 110
Ipsative scaling 110
Spurious validity and the Barnum effect 111
Summary 112

7 Personality assessment in the workplace 113


Prediction of successful employment outcomes 114
Validation of personality questionnaires previously used in employment 114
Historical antecedents to the five-factor model 114
Stability of the five-factor model 115
Cross-cultural aspects of the five-factor model 115
Scale independence and the role of facets 117
Challenges to scale construction for the five-factor model 117
Impression management 118
Acquiescence 118
Response bias and factor structure 118
Development of the five OBPI personality scales 119
Assessing counterproductive behavior at work 120
The impact of behaviorism 120
Prepsychological theories of integrity 121
Modern integrity testing 121
Psychiatry and the medical model 122
The dysfunctional tendencies 123
The dark triad 125
Assessing integrity at work 125
The OBPI integrity scales 126
Conclusion 127

8 Employing digital footprints in psychometrics 129


Introduction 129
Types of digital footprints 131
Usage logs 131
Language data 131
Mobile sensors 131
Images and audiovisual data 132
Typical applications of digital footprints in psychometrics 132
x Contents
Replacing and complementing traditional measures 132
New contexts and new constructs 132
Predicting future behavior 133
Studying human behavior 133
Supporting the development of traditional measures 133
Advantages and challenges of employing digital footprints in
psychometrics 134
High ecological validity 134
Greater detail and longitude 135
Less control over the assessment environment 135
Greater speed and unobtrusiveness 136
Less privacy and control 136
No anonymity 137
Bias 138
Enrichment of existing constructs 139
Developing digital-footprint-based psychometric measures 139
Collecting digital footprints 139
Preparing digital footprints for analysis 141
Reducing the dimensionality of the respondent-footprint matrix 143
Building prediction models 150
Summary 151

9 Psychometrics in the era of the intelligent machine 152


History of computerization in psychometrics 152
Computerized statistics 153
Computerized item banks 153
Computerized item generation 154
Automated advice and report systems 154
The evolution of AI in psychometrics 155
Expert systems 156
Neural networks (machine learning) 157
Parallel processing 158
Predicting with statistics and machine learning 159
Explainability 161
Psychometrics in cyberspace 162
What and where is cyberspace? 162
The medium is the message 163
Moral development in AI 164
Kohlberg’s theory of moral development 165
Do machines have morals? 166
Contents xi
The laws of robotics 167
Artificial general intelligence 167
Conclusion 168

References 169
Index 172
Preface to the fourth edition

It is now 30 years since the first edition of Modern Psychometrics was published, and in that
time the science has continued to make great strides. Many of the future possibilities
tentatively discussed in the first and second editions are now accepted realities. Since the
publication of the third edition in 2009, the internet has completely revolutionized our
lives. Psychometrics has played a major part in this, much of it good, some not so good.
Psychometric profiles derived from our online digital activity are the subject of constant
AI scrutiny, providing corporations, political parties, and governments with tools to
nudge us for their own benefit and not always in our best interest. But also,
psychographic microtargeting of information based on these profiles enables
individualized learning, information retrieval, and purchasing of preferred products on
a scale previously undreamed of. It is the engine that drives the big tech money machine
and the digital economy.
At the same time, psychometrics continues to play a central role in improving
examination systems in our schools and universities, recruitment and staff development
in human-resources management, and development of research tools for academic
projects. This book is intended to provide both a theoretical underpinning to
psychometrics and a practical guide for professionals and scholars working in all these
fields. In this new edition we outline the history and discuss central issues such as IQ and
personality testing and the impact of computer technology. It is increasingly recognized
that modern psychometricians, because their role is so central to fair assessment and
selection, must not only continue to take a stand on issues of racism and injustice but also
contribute to debates concerning privacy and the regulation of corporate and state
power.
The book includes a practical step-by-step guide to the development of a
psychometric test. This enables anyone who wishes to create their own test to plan,
design, construct, and validate it to a professional standard. Knowledge-based tests of
ability, aptitude, and achievement are considered, as well as person-based tests of
personality, integrity, motivation, mood, attitudes, and clinical symptoms. There is
extensive coverage of the psychometric principles of reliability, validity, standardization,
and bias, knowledge of which is essential for the testing practitioner, whether in school
examination boards, human-resources departments, or academic research. The fourth
edition has been extensively updated and expanded to take into account recent
developments in the field, making it the ideal companion for those wishing to
achieve qualifications of professional competence in testing.
But today, no psychometrics text would be complete without extensive coverage of
key issues in testing in the online environment made possible by advances in internet
Preface to the fourth edition xiii
technology. Computer adaptive testing and real-time item generation are now available
to any psychometrician with the necessary know-how and access to the relevant
software, much of it open source, such as Concerto (The Psychometrics Centre, 2019).
Psychometric skills are currently in enormous demand, not just from classic markets but
also for new online applications that require understanding and measuring the unique
traits of individuals, such as the provision of personalized health advice, market research,
online recommendations, and persuasion. The fourth edition extends coverage of these
fields to provide advice to computer scientists and AI specialists on how to develop and
understand computer adaptive tests and online digital-footprint analysis.
The groundwork for the collaboration that led to this fourth edition was established in
The Psychometrics Centre at the University of Cambridge. We were very fortunate to
have been supported by an amazing team, without whose enthusiasm, creativity,
ambition, and drive many of the revolutionary developments in psychometrics would
not have been possible. Among them are Iva Cek, Fiona Chan, Tanvi Chaturvedi,
Kalifa Damani, Bartosz Kielczewski, Shining Li, Przemyslaw Lis, Aiden Loe, Vaishali
Mahalingam, Sandra Matz, Igor Menezes, Tomoya Okubo, Vesselin Popov, Luning
Sun, Ning Wang and Youyou Wu. Many have now dispersed to all corners of the
world, but their work continues, and there is much more yet to do. Finally, thanks
are due to Peter Hiscocks and Christoph Loch, who facilitated the move of the
The Psychometrics Centre to the Judge Business School, and to Susan Golombok, the
original coauthor, for her generosity and support in the preparation of this new edition.
1 The history and evolution
of psychometric testing

Introduction
People have always judged each other in terms of their skills, potential, character,
motives, mood, and expected behavior. Since the beginning of time, skill in this practice
has been passed on from generation to generation. Being able to evaluate our friends,
family, colleagues, and enemies in terms of these attributes is fundamental to us as human
beings. Since the introduction of the written word, these opinions and evaluations have
been recorded, and our techniques for classifying, analyzing, and improving them have
become not just an art but also a technology that has played an increasing role in how
societies are governed. Triumphing in this field has been the secret of success in war,
business, and politics. As with all technologies, it has been driven by science—in this
case, the science of human behavior, the psychology of individual differences, and, when
applied to psychological assessment, psychometrics.
In the 20th century, early psychometricians played a key role in the development of
related disciplines such as statistics and biometrics. They also revolutionized education by
introducing increasingly refined testing procedures that enabled individuals to demonstrate
their potential from an early age. Psychometrics required statistical and computational
know-how, as well as data on a large scale, before its impact could be felt. Today, we think
of big data in terms of the technological revolution, but large-scale programs implementing
the analysis of human data on millions of individuals date back to over 100 years ago in the
form of early national censuses and military recruitment. These early scientists did not see
their subject as just an interesting academic discipline; they were also fascinated by its
potential to improve all our lives. And indeed, in most ways it has, but it has been a long
and rocky road—with many false starts, and indeed disasters, on the way. In this chapter,
we start with definitions, followed by an evaluation of future potential, a history, a warning
about past missteps, applause for current successes, and an invitation to learn from history’s
lessons. It is said that those who cannot remember the past are condemned to repeat it. Let
us all make sure that this does not happen, but rather that we can bring about a future that
sees human potential expand to the stars.

What is psychometrics?
Psychometrics is the science of psychological assessment, and has traditionally been seen
as an aspect of psychology. But its impact has been much broader. The scientific
principles that underpin psychometrics apply equally to other forms of assessment such
as educational examinations, clinical diagnoses, crime detection, credit ratings, and staff
2 History and evolution of psychometric testing
recruitment. The early psychometricians were equally at home in all of these fields.
Since then, paths have often diverged, but they have generally reunited as the im-
portance of advances made in one context come to the attention of workers in other
areas. Currently, great strides are being made in the application of machine-learning
techniques and big-data analytics—particularly in the analysis of the digital traces we all
leave online—and these are beginning to have a significant impact across a broad range
of applications. These are both exciting and disturbing times.
We experience psychometric assessment in many of our activities, for example:

• We are tested throughout our education to inform us, our parents, teachers, and
policy makers about our progress (and the efficiency of teaching).
• We are assessed at the end of each stage of education to provide us with academic
credentials and inform future schools, colleges, or employers about our strengths and
weaknesses.
• We must pass a driving test before we are allowed to drive a car.
• Many of us need to pass a know-how or skills test to be able to practice our
professions.
• We are assessed in order to gain special provisions (e.g., for learning difficulties) or to
obtain prizes.
• When we borrow money or apply for a mortgage, we must complete credit scoring
forms to assess our ability to repay the debt.
• We are tested at work when we apply for a promotion and when we seek
another job.
• Our playlists are analyzed to assess our music tastes and recommend new songs.
• Our social media profiles are analyzed—sometimes without our consent—to
estimate our personality and choose the advertisements that we are most likely
to click.

Assessment can take many forms: job interviews, school examinations, multiple-choice
aptitude tests, clinical diagnoses, continuous assessment, or analysis of our online foot-
prints. But despite the wide variety of applications and manifestations, all assessments
should share a common set of fundamental characteristics: they should strive to be ac-
curate, measure what they intend to measure, produce scores that can be meaningfully
compared between people, and be free from bias against members of certain groups.
There are good assessments and bad assessments, and psychometrics is the science of how
to maximize the quality of the assessments that we use.

Psychometrics in the 21st century


Psychometrics depends on the availability of data on a large scale, and so it is no surprise
that the advent of the internet has massively boosted its influence. If we had to date the
internet, we would probably start at CERN, the European Organization for Nuclear
Research, in Geneva, with Tim Berners-Lee’s invention of the World Wide Web in
1990; he linked the newly developed hypertext markup language (HTML) to a graphic
user interface (GUI), thereby creating the first web pages. Since then, the web has
expanded to make Marshall McLuhan’s “global village” a reality (McLuhan, 1964). The
population of this global village grew from a handful of academics in the early 1990s to a
History and evolution of psychometric testing 3
diverse and vibrant community of one billion users in 2005, and to over four billion users
(representing more than 50% of the world’s population) in 2020. Thus, within less than
20 years, the new medium of cyberspace came into existence, creating a completely new
science with new disciplines, new experts, and, of course, new problems. Some aspects
of this new science are exceptional. While the science of biology is only 300 years old,
and that of psychology considerably younger, both their subjects of study—humans and
life itself—have existed for millions of years. Not so the internet. Hence the cyberworld
is unique, and it is hard to predict what to expect of its future. It is also a serious
disruptor; it has completely changed the nature of its adjacent disciplines, especially
computer and information sciences, but also psychology and its progeny, psychometrics.
By the year 2000, the migration of psychometrics into the online world was well
underway, producing both new opportunities and new challenges, particularly for global
examination organizations such as the Educational Testing Service (ETS) at Princeton
and Cambridge Assessment in the UK. On the positive side, gone were the massive
logistical problems involved in securely delivering and recovering huge numbers of
examination papers by road, rail, and air from remote parts of the world. But the
downside was that examinations needed to take place at fixed times during the school
or working day, and it became possible for candidates in, say, Singapore to contact
their friends in, say, Mexico with advance knowledge of forthcoming questions.
Opportunities for cheating were rife. To counter these challenges, the major ex-
amination boards and test publishers turned to the advantages offered by large item banks
and computer adaptive testing, the psychometricians’ own version of machine learning.
However, it was the development of the app—an abbreviation of “application” used to
describe a piece of software that can be run through a web browser or on a mobile
phone—that was to prove the most disruptive to traditional ways of thinking about
psychometric assessment.
One such app was David Stillwell’s myPersonality, published on Facebook in
2007 (Stillwell, 2007; Kosinski, Stillwell, & Graepel, 2013; Youyou, Kosinski, &
Stillwell, 2015). It offered its users a chance to take a personality test, receive
feedback on their scores, and share those scores—if they were so inclined—with
their Facebook friends. It was similar to countless other quizzes widely shared on
Facebook around that time, yet it employed an established and well-validated
personality test taken from the International Personality Item Pool (IPIP), an open-
source repository established in the 1990s for academic use as a reaction to test
publishers’ domination of the testing world. The huge popularity of myPersonality
was unforeseen. Within a few years, the app had collected over six million per-
sonality profiles, generated by enthusiasts who were interested to see the sort of
results and feedback about themselves that had previously only been available to
psychology professionals. It was one of psychometrics’ first encounters with the
big-data revolution.
But the availability of psychometric data on such a grand scale was to have un-
expected consequences. Many saw opportunities for emulating the procedure in online
advertising, destined to become the major source of revenue for the digital industry.
Once the World Wide Web existed, it could be searched or trawled by search engines,
the most ubiquitous of which is Google. In the mid-1990s, search engines simply
provided information. By 2010 they did so with a scope and accuracy that exceeded all
previous expectations; information on anything or anyone was ripe for the picking.
But those who wished to be found soon became active players on the scene—it was the
4 History and evolution of psychometric testing
advertising industry’s new paradise. The battle to reach the top in search league
tables—or, at the very least, the first results page—began in earnest. Once online
advertising entered the fray, it became a new war zone. The battle for the keywords
had begun. Marketing was no longer about putting up a board on the high street; it was
about building a digital presence in cyberspace that would bring customers to you in
droves. By the early 2000s, no company or organization could afford not to have a
presence in cyberspace. For a high proportion of customers, companies without some
digital presence simply ceased to exist.
While web pages were the first universally available data source in cyberspace, social
networks soon followed, and these opened a whole new world of individualized personal
information about their users that was available for exploitation. Not only was standard
demographic information such as age, marital status, gender, occupation, and education
available, but there were also troves of new data such as the words being used in status
updates and tweets, images, music preferences, and Facebook Likes. And these data
sources soon became delicious morsels in a new informational feeding frenzy. They were
mined extensively by tech companies and the marketing industry to hone their ability to
target advertisements to the most relevant audiences—or, to put it another way, to those
who might be most vulnerable to persuasion. The prediction techniques used were the
same as those that had been used by psychometricians for decades: principal component
analysis, cluster analysis, machine learning, and regression analysis. These were able to
predict a person’s character and future behavior with far more accuracy than simple
demographics. Cross-correlating demographics with traditional psychometric data, such
as personality traits, showed that internet users were giving away much more in-
formation about their most intimate secrets than they realized. Thus, online psycho-
graphic targeting was born. This new methodology, creating clickbait and directing
news feeds using psychological as well as demographic data, was soon considered to be
far too powerful to exist in an unregulated world. But this will prove one day to have
been just the midpoint in a journey that began many centuries ago.

History of assessment
Chinese origins
Employers have assessed prospective employees since the beginnings of civilization, and have
generated consistent and replicable techniques for doing this. China was the first country to
use testing for the selection of talents (Jin, 2001; Qui, 2003). Earlier than 500 BCE,
Confucius had argued that people were different from each other. In his words, “their nature
might be similar, but behaviors are far apart,” and he differentiated between “the superior and
intelligent” and “the inferior and dim” (Lun Yu, Chapter Yang Huo). Mencius (372–289
BCE) believed that these differences were measurable. He advised: “assess, to tell light from
heavy; evaluate, to know long from short” (Mencius, Chapter Liang Hui Wang). Xunzi
(310–238 BCE) built upon this theory and advocated the idea that we should “measure a
candidate’s ability to determine his position [in the court]” (Xun Zi, Chapter Jun Dao).
Thus, over 2,000 years ago, much of the fundamental thinking that today underpins
psychometric testing was already in place, as were systems that used this in the selection
of talents. In fact, there is evidence that talent selection systems appeared in China even
before Confucius. In the Xia Dynasty (c. 2070–1600 BCE), the tradition of selecting
officers by competition placed heavy emphasis on physical strength and skills, but by the
History and evolution of psychometric testing 5
time of the Zhou Dynasty (1046–256 BCE) the content of the tests had changed. The
emperor assessed candidates not only based on their shooting skills but also in terms of
their courteous conduct and good manners. From then on, the criteria used for the
selection of talent grew to include the “Six Skills”: arithmetic, writing, music, archery,
horsemanship, and skills in the performance of rituals and ceremonies; the “Six
Conducts”: filial piety, friendship, harmony, love, responsibility, and compassion; and
the “Six Virtues”: insight, kindness, judgment, courage, loyalty, and concord. During
the Warring States period (475–271 BCE), oral exams became more prominent. In the
Qin Dynasty, from 221 BCE, the main test syllabus primarily consisted of the ability to
recite historical and legal texts, calligraphy, and the ability to write official letters and
reports. The Sui (581–618 CE) and Tang Dynasties (618–907 CE) saw the introduction
of the imperial examinations, a nationwide testing system that became the main method
of selecting imperial officials. Formal procedures required—then as they do now—that
candidates’ names should be concealed, independent assessments by two or more as-
sessors should be made, and conditions of examination should be standardized. The
general framework of assessment set down then—including a “syllabus” of material that
should be learned and rules governing an efficient and fair “examination” of candidates’
knowledge—has not changed for 3,000 years. While similar but less sophisticated fra-
meworks may have existed in other ancient civilizations, it was models based on the
Chinese system that were to become the template for the modern examination system.
The British East India Company, active in Shanghai, introduced the Chinese system
to its occupied territories in Bengal in the early 19th century. Once the company was
abolished in 1858, the system was adopted by the British for the Indian Civil Service. It
subsequently became the template for civil service examinations in England, France, the
USA, and much of the rest of the world.

The ability to learn


It has long been recognized by teachers that some students are more capable of learning
than others. In Europe in 375 BCE, for example, Socrates asked his student Glaucon:

When you spoke of a nature gifted or not gifted in any respect, did you mean to say
that one man will acquire a thing hastily, another with difficulty; a little learning will
lead the one to discover a great deal; whereas the other, after much study and
application, no sooner learns than he forgets; or again, did you mean that the one has
a body that is a good servant of his mind, while the body of the other is a hindrance
to him? – Would not these be the sort of differences which distinguish the man
gifted by nature from the one who is ungifted?
Plato, (449a–480a) Respublica V)

This view of the ability to learn, generally referred to as intelligence, was very
familiar to European scientists in the 19th century—almost all would have studied Greek
at school and university. Intelligence was not education but educability, and represented
an important distinction between the educated person and the intelligent person. An
educated person is not necessarily intelligent, and an uneducated person is not necessarily
unintelligent.
In medieval Europe, the number of people entitled to receive an education was
very small. However, the Reformation and then the Industrial Revolution were
6 History and evolution of psychometric testing
transformative. In Europe, the importance of being able to read the Bible in a native
language, and the need to learn how to operate machines, led to popular support for a
movement to provide an education for all—regardless of social background. One
consequence was that more attention was drawn to those who continued to find it
difficult to integrate into everyday society. The 19th century saw the introduction of the
asylum system, in which “madhouses” were replaced by “lunatic asylums” for the
mentally ill and “imbecile asylums” for those with learning difficulties. These words,
obviously distasteful today, were then in common usage: the term “lunatic” in law and
the term “imbecile” in the classification system used by psychiatrists. Indeed, the term
“asylum” was intended to be positive, denoting a place of refuge (as in “political asylum”
today).
The need to offer provision for those who had difficulty with the learning process
focused attention on how such people could be identified and how their needs
best accommodated. In pre-Victorian England, Edward Jenner (the advocate of
vaccination) proposed a four-stage hierarchy of human intellect, in which he
confounded intelligence and social class. Jenner (1807) summarized the attitude of
the time. In “Classes of the Human Powers of Intellect,” published in the popular
magazine The Artist, he wrote:

“I propose therefore to offer you some thought on the various degrees of power
which appear in the human intellect; Or, to speak more correctly, of the various
degrees of intellectual power that distinguish the human animal. For though all men
are, as we trust and believe, capable of the divine faculty of reason, yet it is not to all
that the heavenly beam is disclosed in all its splendour.

1 In the first and lowest order I place the idiot: the mere vegetative being, totally
destitute of intellect.
2 In the second rank I shall mention that description of Being just lifted into
intellectuality, but too weak and imperfect to acquire judgement; who can
perform some of the minor offices of life – can shut a door – light a fire –
express sensations of pain, etc., and, although faintly endowed with perceptions
of comparative good, is yet too feeble to discriminate with accuracy. A being of
this degree may, with sufficient propriety, be denominated the silly poor
creature, the dolt.
3 The third class is best described by the general term of Mediocrity and includes
the large mass of mankind. These crowd our streets, these line our queues, these
cover our seas. It is with this class that the world is peopled. These are they who
move constantly in the beaten path; these support the general order which they
do not direct; these uphold the tumult which they do not stir; these echo the
censure or the praise of that which they are neither capable of criticising nor
admiring.
4 The highest level is Mental Perfection; the happy union of all the faculties of
the mind, which conduce to promote present and future good; all of the
energies of genius, valour and judgement. In this class are found men who,
surveying truth in all her loveliness, defend her from assault, and unveil her
charms to the world; who rule mankind by their wisdom, and contemplate
glory, as the Eagle fixes his view on the Sun, undazzled by the rays that
surround it.”
History and evolution of psychometric testing 7
The 19th century
The era of European colonialism saw the spread of the Western education system to the
colonized world, but uptake was slow, and ideas taken from perceptions of social status in
Europe were often transferred to the subjected populations. Charles Darwin, for example—a
giant figure of that age, whose theory of evolution had considerable implications for how
differences both between and within species would be understood—was among those who
held this Eurocentric approach, something that perturbed the development of evolutionary
science in a way that would increasingly be recognized as racist. In The Descent of Man, first
published in 1871, Darwin argued that the intellectual and moral faculties had been gradually
perfected through natural selection, stating as evidence that “at the present day, civilized
nations are everywhere supplanting barbarous nations.” Darwin’s view that natural selection
in humans was an ongoing process, with “the savage” and “the lower races” being evo-
lutionary inferior to “the civilized nations,” had considerable influence. But it was not a view
that was shared by every scientist of that time. Others chose to differ, including Alfred
Wallace, Darwin’s copresenter of papers to the the seminal 1858 meeting of the Linnean
Society of London that introduced the idea of natural selection through survival of the fittest
(Wallace, 1858). Wallace took issue with his former colleague, believing Darwin’s arguments
for differences between the races to be fundamentally flawed. His observations in Southeast
Asia, South America, and elsewhere convinced him that so-called “primitive” peoples all
exhibited a high moral sense. He also drew attention to the ability of children from these
groups to learn advanced mathematics, having taught five-year-olds in Borneo how to solve
simultaneous equations, and pointed out that no evolutionary pressure could have ever been
exerted on their ancestors in this direction by the natural environment. In Wallace’s view,
the evolutionary factors that had led to the development of intelligence and morality in
humanity had happened in the distant evolutionary past and were shared by all humans.

Beginnings of psychometrics as a science


The evolution of the human intellect was also of particular interest to Darwin’s cousin, Sir
Francis Galton, who in 1869 published Hereditary Genius: An Inquiry into Its Laws and
Consequences. He had first carried out a study of the genealogy of famous families in 1865,
based on a compendium by Sir Thomas Phillipps entitled The Million of Facts (Galton,
1865). Galton argued that genius, genetic in origin, was to be found in these families. But
when he spoke of genius, he was considering much more than mere intellect. He believed
that these people were superior in many other respects, be it the ability to appreciate music
or art, performance in sport, or even simply physical appearance. In 1883, in order to collect
data to validate his idea, he established his Anthropometric Laboratory at the International
Health Exhibition at South Kensington, London, in which people attending the exhibition
could have their characteristics measured for three pence (about two US dollars at today’s
prices). In the late 1880s, the American psychophysicist James McKeen Cattell, recently
arrived in Cambridge from Wundt’s psychophysics laboratory in Germany, introduced
Galton to many of Wundt’s psychological testing instruments, which he added to his
repertoire. Hence mental testing—psychometrics—was born. The data generated from
Galton’s studies provided the raw material for his development of many key statistical
methods such as standard deviation and correlation. Karl Pearson, an acolyte of Galton,
added partial and multiple correlation coefficients, as well as the chi-square test, to the
techniques available. In 1904, Charles Spearman, an army officer turned psychologist,
8 History and evolution of psychometric testing
introduced procedures for the analysis of more complex correlation matrices and laid down
the foundations of factor analysis. Thus by the end of the first decade of the 20th century,
the fundamentals of psychometric theory were in place. The groundwork had also been laid
for the subsequent development of many new scientific endeavors: the statistical sciences,
biometrics, latent variable modeling, machine learning, and artificial intelligence.

Intelligence testing
The main impetus to provide an intelligence test specifically for educational selection
arose in France in 1904, when the minister of public instruction in Paris appointed a
committee to find a method that could identify children with learning difficulties. It was
urged that “children who failed to respond to normal schooling be examined before
dismissal and, if considered educable, be assigned to special classes.” (Binet and Simon,
1916). Drawing from item types already developed, the psychologist Alfred Binet and his
colleague Théodore Simon put together a standard set of 30 scales that were quick and
easy to administer. These were found to be very successful at differentiating between
children who were seen as bright and children who were seen as dull (by teachers), and
between children in institutions for special educational needs and children in mainstream
schools. Furthermore, the scores of each child’s scales could be compared with those of
other children of the same or similar age, thus freeing the assessment from teacher bias.
The results of Binet’s testing program not only provided guidance on the education of
children at an individual level but also influenced educational policy.
The first version of the Binet–Simon Scale was published in 1905, and an updated
version followed in 1908 when the concept of “mental age” was introduced—this being
the age for which a child’s score was most typical, regardless of their chronological age. In
1911, further amendments were made to improve the ability of the test to differentiate
between education and educability. Scales of reading, writing, and knowledge that had
been incidentally acquired were eliminated. The English-language derivative of the
Binet–Simon test, the Stanford–Binet, is still in widespread use today as one of the primary
assessment methods for the identification of learning difficulties in children.
Binet’s tests emphasized what he called the higher mental processes that he believed
underpinned the capacity to learn: the execution of simple commands, coordination,
recognition, verbal knowledge, definitions, picture recognition, suggestibility, and the
completion of sentences. In their book The Development of Intelligence in Children, first
published in 1916, Binet and Simon, using the language of the time, stated their belief
that good judgment was the key to intelligence:

It seems to us that in intelligence there is a fundamental faculty, the alteration or the


lack of which, is of the utmost importance for practical life. This faculty is judgment,
otherwise called good sense, practical sense, initiative, the faculty of adapting one’s self
to circumstances. To judge well, to comprehend well, to reason well, these are the
essential activities of intelligence. A person may be a moron or an imbecile if he is
lacking in judgment; but with good judgment he can never be either. Indeed, the rest
of the intellectual faculties seem of little importance in comparison with judgment.
(Binet and Simon, 1916)

The potential of intelligence testing to identify intellectual capacity in adults as well as


children was soon recognized, and the First World War saw the introduction of such a
History and evolution of psychometric testing 9
program in the US on a grand scale. A committee under the chairmanship of Robert
Yerkes, president of the American Psychological Association, was established. It was
tasked with devising tests of intelligence that would be positively correlated with Binet’s
scales but adapted for group use and simple to administer and score. The tests should
measure a wide range of abilities, be resilient to malingering and cheating, be in-
dependent of school training, and have a minimal need for writing. In seven working
days, they constructed 10 subtests with sufficient items for 10 different forms. These
were piloted with 500 participants from a broad range of backgrounds, including in-
stitutions for people with special educational needs, patients at a psychiatric hospital,
army recruits, officer trainees, and high school students. The entire process was com-
pleted in less than six months. By the end of the war, these tests, known as Army Alpha
and Army Beta, had been administered at the rate of 200,000 per month to nearly two
million American recruits.
Following from the popularity and success of the army tests, the mantle for mass testing
for an Intelligence Quotient (IQ) was taken up by the US College Board, which in 1926
introduced the Scholastic Aptitude Test (SAT), designed to facilitate entry into colleges
throughout the US. Adopted first by Harvard and then by the University of California, by
1940 the SAT had become the standard admission test for practically all US universities.
The aim was to create a level playing field on which every child leaving high school with
an IQ above a certain level could have the opportunity to benefit from tertiary education.
The development of meritocratic education rapidly spread around the world, and the old
world order—in which education depended on birth and economic privilege—began its
decline.
However, the success of the new system was not universal. While more gifted in-
dividuals benefited, particularly among ethnic minorities and the working class, the effect
on the majority in such groups was counterproductive. Large differences in average
group scores on these tests remained, resulting in considerable group differences in
admission rates, particularly to elite schools and universities, and consequently in em-
ployment and entry to professions. The impact of social background factors on test scores
went largely unrecognized, and one form of elite was replaced by another. The bene-
ficiaries were predominantly middle class and white. It seemed that IQ testing, while
perhaps a panacea, was not a panacea for all.
Early attempts to address the discriminating consequences of group differences in IQ
test scores when used for selection purposes focused on several strategies. Predominant
was a shift away from a single measure, referred to as general intelligence or “g,” toward
separate measures for different IQ intellectual abilities such as numerical or verbal, tai-
lored more specifically to the requirements of the training course or employment po-
sition in question. There was also increased attention in law to the extent to which any
psychometric testing procedures would impact constitutional rights, particularly under
the Bill of Rights and the 14th Amendment.

Eugenics and the dark decades


But despite the enormous success of Alfred Binet and his successors in addressing
the requirements of the education system, for many this was just a sideshow. The
originator of the science was Galton, not Binet, and Galton’s true interest was not
psychometrics—or even anthropometrics per se; rather, his concern was that the
quality of the human race, particularly its intelligence, was degenerating as those of
10 History and evolution of psychometric testing
lower intelligence were having more children and passing on more and more of their
“inferior” genes through succeeding generations. In 1883 he coined the terms “eu-
genics” (defining it as “the conditions under which men of high type are produced”)
and “dysgenics” (the opposite of eugenics). Galton’s academic influence was con-
siderable, culminating in the establishment of a eugenics department at University
College London. But it was much more than just a theory. In many countries, its
ambitions were soon implemented in policy. In 1907, the USA was the first country to
undertake compulsory sterilization programs for eugenics purposes. The principal
targets were the “feebleminded” and the mentally ill, but also included under many
state laws were people who were deaf, blind, epileptic, and physically disabled. Native
Americans were sterilized against their will in many states, often without their
knowledge, while they were in hospital for some other reason (e.g., after giving birth).
Some sterilizations also took place in prisons and other penal institutions, targeting
criminality. Over 65,000 individuals were sterilized in 33 states under state compulsory
sterilization programs. These were last carried out in 1981.
Assessment of intelligence played a key part in many of these programs. Indeed, many
early intelligence tests were designed with a eugenics agenda in mind. By 1913, Henry
Goddard had introduced them to the evaluation process for potential immigrants at Ellis
Island in New York, and in 1919 Lewis Terman stated in his introduction to the first
edition of the Stanford–Binet Intelligence Scales (his own translations of Binet’s scales
into English):

It is safe to predict that in the near future intelligence tests will bring tens of
thousands of … high-grade defectives under the surveillance and protection of
society. This will ultimately result in the curtailing of the reproduction of feeble-
mindedness and in the elimination of enormous amounts of crime, pauperism, and
industrial inefficiency. It is hardly necessary to emphasize that the high-grade cases,
of the type now so frequently overlooked, are precisely the ones whose guardianship
it is most important for the state to assume.
(Terman, 1919)

In 1927, the Kaiser Wilhelm Institute of Anthropology, Human Heredity, and Eugenics
was the first to advocate sterilization in Germany. The year 1934 saw the introduction of
the country's Law for the Prevention of Offspring with Hereditary Defects; and in
1935 this law was amended to allow abortion for the “hereditarily ill,” including the
“social feebleminded” and “asocial persons.” Two years later came the introduction of
Sonderkommission 3 (Special Commission Number 3), under which all local authorities
in Germany were required to submit a list of all children of African descent. All such
children were to be medically sterilized. We all know what followed. In 1939, eu-
thanasia was legalized for psychiatric patients (including homosexual people) in psy-
chiatric hospitals; and in 1942, the same methods were extended to Roma and Jewish
people in concentration camps in what became the Holocaust.
Debates about the implementation of eugenics generally ended with the Second
World War. However, widespread beliefs among white communities about differ-
ences between races did not end. The decolonization process had yet to begin, and
many people—not just in the USA but also in Africa and elsewhere—continued to
attribute African and African-American academic underachievement to inherited IQ
differences. Martin Luther King Jr. did not have his dream until the early 1960s, and
History and evolution of psychometric testing 11
it was not until 1994 that apartheid was finally abolished in South Africa. Until that
time, arguments over whether inherited differences in intelligence could be a
possible cause of group differences continued to stir up controversy among academics
in both the USA and Europe, culminating in the publication of Herrnstein and
Murray’s The Bell Curve in 1994, in which the authors argued that poor and
ethnically diverse inner-city communities were a “cognitive underclass” formed by
the interbreeding of people of poorer genetic stock. But there was light at the end of
the tunnel.

Psychometric testing of ability


The dark ages come to an end
The year 1984 saw the publication of James Flynn’s (Flynn, 1984) first report on
what subsequently became known as the “Flynn effect,” the term used to describe
the now well-documented year-by-year rise in IQ scores that dates back to the early
20th century, when the practice of IQ testing first began. By examining these
changes, it was possible to extrapolate across the entire life of the test. Flynn showed
that, on average, IQ scores increased by 0.3 to 0.4 IQ point every year and had been
doing so for at least the past 100 years. Various theories have been put forward to
explain this phenomenon, including improved nutrition and increasing familiarity
with the testing process. For his part, Flynn (2007, 2016) argued that the change was
due to the way in which the scientific method had influenced education. Today, we
are more rational thinkers than were our ancestors, because the requirements of an
industrialized and increasingly technical world force us to be just that. It is interesting
that Flynn does not believe that we are actually more intelligent today, in the
traditional sense of the word. If that were so, the logic of the finding would be that
almost half of our great-grandparents’ generation would be diagnosed as having
severe learning difficulties by today’s standards. Rather, it is what we need to
comprehend scientfically that has changed. Today, most primary school children
understand that the correct answer to the question “What do dogs and rabbits have in
common?” is that they are both mammals. It is unlikely that our great-grandparents
would think this was a matter of any consequence. They might choose the wrong
answer—for example, “Dogs chase rabbits”—and fail to understand why this might
be considered wrong. Flynn’s (2007, 2016) work has increasingly focused on the
interplay between intelligence test scores and education—not just across time, but
also between the level of education provided by each countries schools and colleges,
which has historically varied enormously.
As well as being an interesting phenomenon in its own right, it was clear that the
average IQ of African-Americans in the 2000s was higher than the average IQ of white
Americans in the 1970s. Further research by Flynn extended this work beyond the
USA and found similar results in many other countries across all the continents. This
evidence of the equally enormous impact of environmental factors on IQ scores of all
groups rendered unnecessary any need to explain group differences in terms of ge-
netics. Today, the controversy has been largely forgotten. The world has become
increasingly multicultural and global in outlook. Moreover, in both the USA and
Europe, the enormous professional success of immigrant groups that had previously
been excluded speaks for itself.
12 History and evolution of psychometric testing
An abundance of abilities
Today, intelligence testing is conceptualized more broadly and assessed in a variety of
more positive ways. A popular view is exemplified by the work of Robert Sternberg
(1990) in his triarchic model. Sternberg suggested three major forms of intelligence:
analytic, creative, and practical.
Those with analytic intelligence do well on IQ tests, appear “clever,” are able to
solve preset problems, learn quickly, appear knowledgeable, are able to automate their
expertise, and normally pass exams easily. While analytic intelligence—measured by
classical IQ tests—is important, it is not always the most relevant. He gives an apocryphal
example of a brilliant mathematician who appeared to have no insight at all into human
relationships, to such an extent that when his wife left him unexpectedly, he was
completely unable to come up with any meaningful explanation.
Doubts about the sufficiency of the classical notion of IQ have often been expressed in
the business world. It has been pointed out that to do well on a test of analytic in-
telligence, a candidate must demonstrate the ability to solve problems set by others, such
as in a test or examination. A successful business entrepreneur, however, needs to ask
good questions rather than supply good answers. It is the ability to come up with good
questions that characterizes the creative form of intelligence.
People with practical intelligence are “streetwise.” They want to understand and can
stand back and deconstruct a proposal or idea, they will question the reasons for wanting
to know, and they can integrate their knowledge for a deeper purpose. Hence, they are
often the most successful in life. They know the right people, avoid making powerful
enemies, and are able to work out the unspoken rules of the game.
Howard Gardner (1983) also argued that there were multiple intelligences, each linked
with an independent and separate system within the human brain. These were linguistic,
logical-mathematical, spatial, musical, bodily-kinesthetic, interpersonal, and intrapersonal
intelligence. Gardner emphasized the distinctive nature of some of these intelligences and
their ability to operate independently of each other. For example, bodily-kinesthetic in-
telligence, representing excellence in sporting activities, is a new concept within the field,
as are interpersonal intelligence and intrapersonal intelligence—representing the ability to
understand other people and the ability to have insight into one’s own feelings, respec-
tively. The MSCEIT is an ability-based test designed to measure emotional intelligence.
Most tests today under the name of emotional intelligence are not in fact tests of in-
telligence at all, but rather assess the personality traits of those who are said to be sensitive
to the feelings of others. The MSCEIT stands in contrast to these others in that there are
predefined right and wrong answers in terms of, say, whether a person’s recognition of
anger in someone else is correct or incorrect.
It has been argued that many of Sternberg’s and Gardner’s diverse forms of in-
telligence are not really new. Many correspond with traditional notions that were
already tested within IQ tests. Aristotle himself emphasized the importance of dis-
tinguishing between wisdom and intelligence, while creativity tests have been around
for about 70 years. Nevertheless, their approaches have had an important role in de-
constructing the idea that academic success is the sole form of intellectual merit. One
result is that today, intelligence testing is conceptualized more broadly and is assessed in
a more strategic manner. There is increased emphasis on a diversity of talents, enabling
all to identify their strengths as well as areas that they are more likely to find chal-
lenging. Tests of specific forms of intelligence—such as numeracy, verbal proficiency,
History and evolution of psychometric testing 13
and critical thinking—remain central to assessment for entry to specific professional
training programs that depend on skills in these areas. In addition, broad-spectrum
assessments of the many fundamental cognitive skills that underpin key learning pro-
cesses play an increasing role in targeting specific remedial learning programs to those
who most need them. For example, the Wechsler Intelligence Scale for Children, Fifth
Edition (WISC-V), contains specific subtests of:

• Similarities, vocabulary, information, and comprehension (to assess verbal concept


formation)
• Block design and visual puzzles (to assess visual spatial processing)
• Matrix reasoning, figure weights, picture concepts, and arithmetic (to assess
inductive and quantitative reasoning)
• Digit span, picture span, and letter-number sequencing (to assess working memory)
• Coding, symbol search, and cancellation (to assess processing speed)

The WISC-V is in widespread use around the world by educational psychologists who
work with children with learning disabilities such as dyslexia, dyscalculia, and autism.
But we should not forget that while the idea of “g,” a single score of general in-
telligence, has fallen out of favor, the underlying concept of IQ has captured the popular
imagination (as have astrological terms before it), and wishing it away is unlikely to be
successful. It has entered everyday language, and it is surprising how many people think
they know their own IQ, even though they are generally wrong. The idea is particularly
likely to remain popular with those who have attained high scores on IQ tests. Societies
such as MENSA have grown throughout the world, and perhaps we should not
begrudge the pleasure and pride felt by their members.

Tests of other psychological constructs


Psychometric intelligence tests assess optimal performance, as do all tests of intelligence,
ability, competence, or achievement. What these all have in common is that candidates
are expected to achieve the highest score that they can and need to be sufficiently
motivated to do so. Typically, in a classical psychometric test of this type, it is a simple
question of how many of the questions they can answer correctly. The higher the
number of correct answers, the higher the score. However, psychometric methods can
also be applied to the assessment of other psychological characteristics in which people
differ from each other, such as personality, integrity, interests, motivation, values,
temperament, attitudes, and beliefs.

Personality
Psychometric personality testing, like psychometric intelligence testing, can trace its roots back
to Galton—not just the statistical methods, but also his lexical hypothesis (Galton, 1884).
Galton proposed that consequential differences between people should be reflected in lan-
guage, and that the more important the difference, the more likely it is to be encoded as a
single word: if we differ in some way, and such a difference matters, we will—sooner or
later—come up with a word capturing this difference. This is why most languages have terms
such as respectfulness, gregariousness, or intelligence. If, on the other hand, a particular trait
14 History and evolution of psychometric testing
does not vary much between people, or it varies yet is inconsequential, it is less likely to be
captured with a single term. This is why most languages do not have terms describing “the
ability to count” (nearly everyone can do it) or “not having a favorite color” (perhaps sur-
prisingly, that seems not to be an important individual characteristic). Galton identified about
1,000 such terms describing differences, based largely on his knowledge of English, German,
and other European languages. Later, Allport and Odbert (1936) carried out a systematic
survey of the English language and listed about 18,000 personality-descriptive terms, or words
that could be used to describe a characteristic of a person. They grouped them into four
categories: personal traits, temporary moods or activities, judgments of personal conduct, and
capacities and talents. After laboriously omitting words that were rarely used or little un-
derstood, obvious synonyms, as well as “non-neutral” words such as “good” and “bad,” they
were left with a list of approximately 4,500 “neutral” words that they then categorized ac-
cording to their meaning to produce a smaller number of personality constructs.
Following this early work based on researchers’ intuition, psychologists began to
apply a data-driven approach to grouping those words into categories. They employed
factor analysis, a statistical procedure in which correlated variables are combined into
factors, originally developed in the context of intelligence assessment (factor analysis
is discussed in Chapter 4). One of the most influential factor-analytic theorists in this
field, Raymond Cattell, asked people to rate their friends—and themselves—on 200
personality-descriptive terms from the Allport and Odbert list. By analyzing those ratings
using factor analysis, Cattell discovered that people were not described by random sets of
personality-descriptive terms, but that there were clear patterns. For example, people
rated as “warm” were often rated as “easygoing” and “gregarious,” and rarely by words
such as “reserved,” “cool,” or “impersonal.” Overall, the factor analysis conducted by
Cattell produced the 16 personality factors (also known as the 16PF) in Table 1.1.
Hans Eysenck (1967), a prominent personality theorist, also used factor analysis, but
instead of the 16 factors favored by Cattell, he argued that the structure of personality is more
usefully described by two dimensions that he called neuroticism and extraversion. The
dimension of neuroticism represents the difference between people who are anxious and
moody on one end and calm and carefree on the other, whereas extraversion distinguishes
between those who are sociable and like parties (extroverts) and those who are quiet,
introspective, and reserved (introverts). Eysenck’s model did not contradict Cattell's: in fact,
Cattell’s 16PF can be further reduced to produce Eysenck’s two factors. It is largely a matter
of personal preference whether a researcher will opt for a larger number of factors, to give a
wider description of personality, or a smaller number, to produce fewer factors that are more
robust. However, in recent years, psychologists have come to favor five personality factors:
openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism have
become ubiquitous in the literature; they are generally referred to as the “OCEAN” or
“five-factor” model. This model is discussed in more detail in Chapter 7.

Integrity
In many occupational settings, it is the integrity of the job applicant, rather than their
personality per se, that is of interest. A lack of integrity in this context might refer to drug
taking, previous criminal activity, or barefaced lying concerning a previous work record
or the possession of a qualification. The subject of truth-telling has long been a primary
concern of legal, police, and security systems all over the world, and during the past
century, lie-detector tests using a polygraph entered widespread use among these
History and evolution of psychometric testing 15
Table 1.1 Most highly related personality-descriptive terms from Cattell’s 16PF

Trait Name Negatively Related Positively Related

Warmth Impersonal, distant, cool, reserved, Warm, outgoing, attentive to others,


detached, formal, aloof kindly, easygoing, participative,
likes people
Reasoning Concrete-thinking, less intelligent, Abstract-thinking, more intelligent,
lower general mental capacity, bright, higher general mental
unable to handle abstract problems capacity, fast learner
Emotional Stability Reactive emotionally, changeable, Emotionally stable, adaptive,
affected by feelings, emotionally less mature, faces reality calmly
stable, easily upset
Dominance Deferential, cooperative, avoids Dominant, forceful, assertive,
conflict, submissive, humble, aggressive, competitive, stubborn,
obedient, easily led, docile, bossy
accommodating
Liveliness Serious, restrained, prudent, taciturn, Lively, animated, spontaneous,
introspective, silent enthusiastic, happy-go-lucky,
cheerful, expressive, impulsive
Rule Consciousness Expedient, nonconforming, disregards Rule-conscious, dutiful,
rules, self-indulgent conscientious, conforming,
moralistic, staid, rule-bound
Social Boldness Shy, threat-sensitive, timid, hesitant, Socially bold, venturesome, thick-
intimidated skinned, uninhibited
Sensitivity Utilitarian, objective, unsentimental, Sensitive, aesthetic, sentimental,
tough-minded, self-reliant, no- tender-minded, intuitive, refined
nonsense, rough
Vigilance Trusting, unsuspecting, accepting, Vigilant, suspicious, skeptical,
unconditional, easy distrustful, oppositional
Abstractedness Grounded, practical, prosaic, solution- Abstract, imaginative,
oriented, steady, conventional absentminded, impractical,
absorbed in ideas
Privateness Forthright, genuine, artless, open, Private, discreet, nondisclosing,
guileless, naive, unpretentious, shrewd, polished, worldly, astute,
involved diplomatic
Apprehension Self-assured, unworried, complacent, Apprehensive, self-doubting,
secure, free of guilt, confident, self- worried, guilt-prone, insecure,
satisfied worrying, self-blaming
Openness to Change Traditional, attached to the familiar, Open to change, experimental,
conservative, respecting traditional liberal, analytical, critical,
ideas freethinking, flexible
Self-Reliance Group-oriented, affiliative, a joiner Self-reliant, solitary, resourceful,
and follower, dependent individualistic, self-sufficient
Perfectionism Tolerates disorder, unexacting, Perfectionistic, organized,
flexible, undisciplined, lax, self- compulsive, self-disciplined,
conflict, impulsive, careless of social socially precise, exacting
rules, uncontrolled willpower, control, self-
sentimental
Tension Relaxed, placid, tranquil, torpid, Tense, high-energy, impatient,
patient, composed low drive driven, frustrated, overwrought,
time-driven
16 History and evolution of psychometric testing
services. Such devices assess physiological activity such as heart rate, blood pressure,
respiration rate, and galvanic skin response during an interrogation. However, the clear
ethical and privacy concerns involved in the use of these techniques have led to them
being made illegal in most countries, at least within employment settings. Where they
were previously in use, they have generally been replaced by psychometric integrity tests
based on self-report data. Such tests are themselves particularly vulnerable to dishonesty;
however, several strategies are able to keep the effects of this to a minimum (see
Chapter 7).

Interests
Psychometric tests of interests were first introduced for use in career guidance in 1927
by Edward Kellog Strong Jr., then at the Carnegie Institute of Technology. He be-
lieved that “the relationship among abilities, interests and achievements may be likened
to a motorboat with a motor and a rudder. The motor (abilities) determine how fast
the boat can go, the rudder (interests) determine which way the boat goes” (Strong,
1943). His Strong Interest Inventory and variants thereof are still in use today, assessing
interest in occupations, subject areas, activities, leisure activities, people, and personal
characteristics. In 1956, John Holland introduced a system that was based on the idea
that occupational preference was a veiled expression of underlying character. This
more personality-based framework was updated in the 1990s when the Holland Codes,
sometimes called Holland Occupational Themes, became standard. The Holland
Codes (RIASEC) refer to his personality types of realistic, investigative, artistic, social,
enterprising, and conventional.

Motivation
What motivates both humans and animals to act as they do is a central part of psy-
chology. We may be motivated by basic needs such as the need to quench our thirst or
relieve our hunger. However, when it comes to the psychometric assessment of moti-
vation, we are nearly always talking about motivation in the work environment. In an
occupational setting, some people are more motivated by a need for interesting work
than by a need for money. For others, their primary motivation is a need for recognition
of their achievements and the possibility of promotion. Several theories have been re-
quisitioned in support of assessments of employee motivation, Abraham Maslow’s
hierarchy theory of needs being one of them. It is suggested that the basic need of the
employee is to survive, for which they need money, so that cash in this case (rather than
food and drink as in Maslow’s original model) comes first. Once this is satisfied, they
need safety (maybe adequate accommodation), and only after these two have been sa-
tisfied do friendship, belongings, and career ambition come into play. The Atkinson and
McClelland model of achievement motivation, first introduced in 1953, has specified
motivational needs at work as being the need for achievement, the need for authority,
and the need for affiliation. In the field of industrial and organizational psychology, there
are many different motivational questionnaires available.
History and evolution of psychometric testing 17
Values
A value is an enduring belief about what should be important in our lives and how
people should behave. Our values influence our choice of occupation. For example, a
person who is deeply concerned about the effects of climate change is unlikely to apply
for a job in a coal-fired power station. Values questionnaires were first developed in a
cross-cultural framework by Geert Hofstede, who between 1967 and 1973 examined
international differences among IBM employees, identifying the four dimensions of
individualism-collectivism, uncertainty avoidance, strength of social hierarchy, and
task vs. person orientation. Later, the dimensions of long-term vs. short-term or-
ientation and indulgence vs. restraint were added. Subsequently, Shalom Schwartz
developed his Theory of Basic Human Values, first published in 1992, focusing on
those that he suggested may be universal. These were self-direction, stimulation, he-
donism, achievement, power, security, conformity, tradition, benevolence, and
spirituality (which he calls “universalism”), and can be assessed with the Schwartz
Value Survey. It is noteworthy that there is considerable conceptual overlap between
the idea of a value and that of a motivator.

Temperament
Although the term “temperament” was formerly used by theorists such as Gordon
Allport to refer to those aspects of adult personality that were considered to be largely
inherited (such as impulsiveness), it is more commonly used in relation to infants and
young children, and refers to their tendency to behave in consistent ways across situa-
tions. Developmental psychologists such as Jerome Kagan saw temperament as a genetic
predisposition because differences in characteristics such as activity level and emotion-
ality can be observed in newborn infants. Beginning in the 1950s, Alexander Thomas,
Stella Chess, and their colleagues initiated the New York Longitudinal Study, which
identified nine temperament characteristics that affect how well a child fits in with
school, friends, and family. They later suggested three dimensions of temperament in
children: the “easy child” has a positive mood, can quickly establish regular routines
from infancy, and easily adapts to new experiences; the “difficult child” reacts negatively
and frequently cries, is irregular in attention to daily routines, and is slow to accept new
experiences; and the “slow-to-warm-up child” has a low activity level, is rather negative,
somewhat negative in mood, and slow to adapt. These are assessed through attention
to individual differences in emotion, motor reactivity, self-regulation, and consistency
in behavior.

Attitudes
The term “attitude” refers to how much a person likes or dislikes an object, person, or
idea—or, to put it another way, our positive or negative feelings toward an attitude
object. As there are so many different objects, people, and ideas toward which people
can have attitudes, there are an enormous number of possible attitude scales. Hence
discussion on attitude questionnaires focuses on methodology rather than the object of
measurement. Usually there is a set of questions, each with response options along the
agree/disagree continuum, a system based on the work of Rensis Likert in the 1930s.
Today, these are referred to as Likert scales. In the 1950s, Charles Osgood introduced
18 History and evolution of psychometric testing
the semantic differential as an alternative method of assessing attitudes, based on
George Kelly’s repertory-grid technique for the assessment of personal constructs.
These days, attitude measurement is a mass industry, with opinion polls playing a
major role in fields ranging from politics to economic policy. However, when
measuring attitude, you will usually be starting from scratch and need to develop the
questionnaire yourself.

Beliefs
The term “belief” refers to the cognitive component of an attitude, i.e., what a person
assumes to be true. Examples are the belief in religion or the belief that a placebo drug
can cure an illness. Beliefs may be held with varying degrees of certainty. They are
distinct from attitudes in that they do not include an affective component, i.e., they do
not include liking or disliking. In the 1980s, Albert Bandura, already well known for his
work on social learning theory, introduced the concept of self-efficacy—belief in one’s
own effectiveness, or, put another way, belief in oneself—and this has become an im-
portant concept in a variety of domains. People with high self-efficacy belief are more
likely to persevere with and complete tasks, whether they be work-performance-related
or self-directed, such as following a health regime. Those with low self-efficacy are more
likely to give up and less likely to prepare effectively. Today there are many different
self-belief scales available, particularly around health beliefs, but again these tend to be
targeted at specific applications.

Summary
In the 20th century, the reputation of psychometrics suffered considerable damage twice
over: first, directly from the enthusiasm with which its testing procedures were taken up
by the eugenics movement, and second, from that movement’s successors, who con-
tinued to argue that political and policy decisions should be made under the assumption
that an intelligence level was part of each race’s genetic heritage—a position that the
Flynn effect so soundly showed to be unwarranted. Psychometrics is both a pure and an
applied science, but it is also a science that can all too easily be misapplied.
Today, the use of other forms of psychometric testing in education, recruitment, and
marketing—involving the whole person rather than just skills and character—is in-
creasing in society, and it is important that all issues associated with this phenomenon
be properly and objectively evaluated if we wish procedures to be efficient and fair.
Many believe that there should be no testing at all, but this is based on a mis-
understanding that has arisen from attempting to separate the process and in-
strumentation of testing from its function. It is the function of testing that is
determining its use, and this function derives from the need in any society to select and
assess individuals within it for a job or for entry into a university. Given that selection
and assessment exist, it is important that they be carried out as accurately as possible and
that they be studied and understood. Psychometrics can be defined as the science of
this selection and evaluation process in human beings. But now we must realize that
the ethics, ideology, and politics of this selection and assessment are as much an integral
part of psychometrics as are data science and psychology. Any science dealing with
selection is also, by default, dealing with rejection, and is therefore intrinsically
History and evolution of psychometric testing 19
political. It is essential that psychometrics be understood in this way if we wish to apply
proper standards and controls to the current expansion in test use, and thus develop a
more equitable society. It is one matter to question testing; it is quite another to
question the psychometric concepts of reliability and validity on the grounds that these
were developed by eugenicists. It would be just as irrational to dismiss the theory of
evolution and consequently most of modern biology! The techniques developed by
Darwin, Galton, Spearman, Cattell, and their followers have made a major con-
tribution to our understanding of the principles of psychometrics.
In this third decade of the 21st century, given the rise of cyberspace and the enormous
power of psychographics as a tool for control within it, we must ensure that we fully
anticipate the possibility of unintended consequences. It is time for urgent remedial
action in the form of regulation. But what should be regulated? The right to privacy, of
course, but this has been only a part of the problem. In cyberspace, our wishes, desires,
and psychological vulnerabilities can be targeted all too readily on the basis of the data
trail that we leave behind. Broadcast messages are filtered through our digital footprint so
that we receive only those that are deemed relevant—perhaps to our own interests, but
more likely to the interests of others. All this can be done at scale and in real time, with
no one else knowing the microtargeted messages that we receive or are likely to receive.
This filtering, determined by machine-learning algorithms trained on our input,
determines what we see or hear in the online world; artificial intelligence builds models
that enable the information that we receive to become increasingly targeted so that our
filter bubble becomes increasingly controlling. Proper regulation of the application of
psychometrics for psychographic purposes clearly needs to be established and controlled.
But how?
One thing is certain. Following this, the world will never be the same again. Let us
not fear the future, but rather seek ways in which we can influence it for the better.
For now, the machines should be our servants, not our masters … and maybe, one day,
they’ll be our colleagues.

References
Alighieri, Dante (1935). The Divine Comedy of Dante Alighieri: Inferno, Purgatory, Paradise (pp. 1265–1321).
New York: The Union Library Association.
Allport, G., & Odbert, H. (1936). Trait names: A psycholexical study. Psychological Monographs,
47(1), 1–171.
Asimov, I (1950). I, Robot. New York: Gnome Press.
Atkinson, R. D., Brake, D., Castro, D., Cunliff, C., Kennedy, J., McLaughlin, M., McQuinn, A.,
& New, J. (2019) Guide to the “Techlash”: What it is and why it’s a threat to growth and progress.
Information Technology and Innovation Foundation, October 28, 2019.
Barrick, M. R., & Mount, M. K. (1991). The Big Five personality dimensions and job performance: A
meta-analysis. Personnel Psychology, 44(1), 1–26.
Binet, A., & Simon, T. (1916). The Development of Intelligence in Children . Baltimore: Williams &
Wilkins.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M.
Lord & M. R. Novick (eds.), Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley.
Bloom, B. (1956). Taxonomy of Educational Objectives. New York: Longmans.
Bond, M. H. (1997). Working at the Interface of Culture: Eighteen Lives in Social Sciences. London:
Routledge.
Boring, E. G. (1957). A History of Experimental Psychology, 2nd edition. New York: Appleton-Century-
Crofts.
Brown, T. A. (2006). Confirmatory Factor Analysis for Applied Research. New York: Guilford Press.
Cattell, R. B. (1957). Personality and Motivation Structure and Measurement. New York: World Book.
Cheung, F., Leung, K., & Zhang, J. (2001). Indigenous Chinese personality constructs: Is the five factor
model complete? Journal of Cross-Cultural Psychology, 32, 407–433. doi: 10.1177/0022022101032004003.
Cleckley, H. M. (1976). The Mask of Sanity (5thedition). London: Mosby.
Cleckley, H. (1941). The Mask of Sanity: An Attempt to Reinterpret the So-Called Psychopathic
Personality. St Louis: C.V. Mosby.
Concerto (2019). Cambridge, UK: The Psychometrics Centre, University of Cambridge. Retrieved
from https://www.psychometrics.cam.ac.uk/newconcerto.
Darwin, C. (1971). The Descent of Man, and Selection in Relation to Sex. London: JohnMurray.
Darwin, C. (1859). On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured
Races in the Struggle for Life. London: John Murray.
Darwin, C. R., & Wallace, A. R. (1858). On the tendency of species to form varieties; and on the
perpetuation of varieties and species by natural means of selection. Journal of the Proceedings of the
Linnean Society of London, Zoology, 3(9), 45–62.
Edgeworth, F. Y. (1888). The statistics of examinations. Journal of the Royal Statistical Society, 51(3),
599–635.
Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Multivariate Applications
Books Series. Mahwah, NJ: Lawrence Erlbaum Associates Publishers.
Eysenck, H. J. (1967). The Biological Basis of Personality. Springfield, IL: Thomas Press.
Eysenck, H. J. (1970). The Structure of Human Personality (3rdedition). London: Methuen.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (ed.), Educational Measurement. The
American Council on Education/Macmillan Series on Higher Education (pp. 105–146). New
York: Macmillan Publishing Co, Inc; American Council on Education.
Flynn, J. R. (1984). The mean IQ of Americans: Massive gains 1932 to 1978. Psychological Bulletin,
95, 29–51.
Flynn, J. R. (2007). What is Intelligence?: Beyond the Flynn Effect. Cambridge, UK: Cambridge University
Press.
Flynn, J. R. (2016). Does your family make you smarter? Nature, nurture, and Human Autonomy.
Cambridge, UK: Cambridge University Press.
Galton, F. (1865). Hereditary talent and character. Macmillan’s Magazine, 12, 157–166. Note: Galton’s
misspelling of Phillipps as Phillips.
Galton, F. (1884). Measurement of character. Fortnightly Review, 42, 179–182.
Gardner, H. (1983). Frames of Mind: The Theory of Multiple Intelligences. New York: Basic Books.
Gosling, S. D., Rentfrow, P. J., & Swann, W. B., Jr. (2003). A very brief measure of the big five
personality domains. Journal of Research in Personality, 37, 504–528.
Hebb D. O. (1949). The Organization of Behavior: A Neuropsychological Theory. New York, NY: John
Wiley & Sons.
Herculano-Houzel, S. (2012). The remarkable, yet not extraordinary, human brain as a scaled-up
primate brain and its associated cost. Proceedings of the National Academy of Sciences
(PNAS), 109(Supplement 1), 10661–10668. first published June 20, 2012. https://doi.org/10.1073/
pnas.1201895109.
International Cognitive Ability Resource (2018). Retrieved from icar-project.com
International Personality Item Pool (IPIP) (2006). Retrieved from https://ipip.ori.org/.
Jenner, E (1807). Classes of the human power of intellect. The Artist, 19, 1–7.
Jin, Y. (Ed.) (2001). Psychometrics (pp. 2–9). Hua Dong Normal University Press: Shanghai. (in
Chinese).
Kleinberg, J., Lakkaraju, H., Leskovec, J. H., Ludwig, J., & Mullainathan, S. (2018) Human decisions
and machine predictions. Quarterly Journal of Economics. 133, 237–293.
Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital
records of human behavior, Proceedings of the National Academy of Sciences of the USA, April 2013.
LeBreton, J. M., Shiverdecker, L. K., & Grimaldi, E. M. (2018). The dark triad and workplace be-
havior. Annual Review of Organizational Psychology and Organizational Behavior, 5, 387–414.
Liu, C. (2015). The Dark Forest (The Three Body Problem). New York: Tor Books.
McLuhan, M. (1964). Understanding Media: The Extensions of Man. New York: McGraw Hill.
Monti, A. (2001). Does Cyberspace Exist? APC Europe Internet Rights Workshop, Prague, February
2001. https://blog.andreamonti.eu/?p=38.
Pearl, J., & Mackenzie., D. (2018). The Book of Why: The New Science of Cause and Effect. New York:
Basic Books.
Popham, W. J. (1999). Why standardized tests don’t measure educational quality. Educational Leadership,
56(6), 8–15.
Qi, S. Q. (2003). Applying Modern Psychometric Theory in Examination (pp. 2–5). WuHan: HuaZhong
Normal University Press (in Chinese).
Rumelhart, D., & McClelland, J. (1986). Parallel Distributed Processing: Explorations in the Microstructure of
Cognition. London: MIT Press.
Rust, J. (1997). Giotto Manual, London, UK: Pearson Assessment.
Rust, J. (2019). The Orpheus Business Personality Inventory (OBPI), Cambridge, UK: The Psychometrics
Centre, University of Cambridge.
Rust, J. & Golombok, S. (2020). The Golombok Rust Inventory of Sexual Satisfaction
(GRISS), Cambridge, UK: The Psychometrics Centre, University of Cambridge.
Rust, J., & Golombok, S. (2020). The Golombok Rust Inventory of Marital State (GRIMS), Cambridge,
UK: The Psychometrics Centre, University of Cambridge.
Rust, J., Golombok, S., & Collier, J. (1988). Marital problems and sexual dysfunction: How are they
related? British Journal of Psychiatry, 152, 629–631.
Saville, P., Holdsworth, R., Nyfield, G., Cramp, L., & Mabey, W. (1984). Occupational Personality
Questionnaires Manual. London: Saville & Holdsworth Ltd.
Sternberg, R. (1990). Wisdon: Its Nature, Origin and Development. Cambridge, MA: MIT Press.
Strong, E. K. (1943). Vocational Interests of Men and Women. Stanford, CA: Stanford University Press.
Stillwell, D. (2007) myPersonality https://sites.google.com/michalkosinski.com/mypersonality.
Sun, L., Liu, Y., & Luo, F. (2019). Automatic generation of number series reasoning items of high
difficulty. Frontiers in Psychology 10:884.
Terman, L. M. (1919). Measurement of Intelligence. London: Harrap.
Tett, R. P., Jackson, D. N., & Rothstein, M. (1991). Personality measures as predictors of job per-
formance: A meta-analytic review. Personnel Psychology, 44(4), 703–742.
The UK Cyber Security Strategy: Protecting and promoting the UK in a digital world, November
2011, HMSO London. https://www.gov.uk/government/publications/cyber-security-strategy.
Thomson, W. (1891). Popular Lectures and Addresses.. London: MacMillan.
Thorndike, R. M., & Thorndike-Christ, T. M. (2014). Measurement and Evaluation in Psychology and
Education (8thedition). New York, NY: Pearson Education.
Wainer, H., Dorans, N. J., Flaugher, R., Green, B. F., & Mislevy, R. J. (2000). Computerized Adaptive
Testing: A Primer. London, UK: Routledge.
Wang, D. F., & Chiu, H. (2005). Measuring the personality of Chinese: QZPS vs. NEO PI-R. Asian
Journal of Social Psychology, 8(1), 97–122.
Wittgenstein, L. (1958). In G. E. M. Anscombe (trans). Philosophical Investigations (3rd edition).
Englewood Cliffs, NJ: Prentice Hall.
Youyou, W., Kosinski, M., & Stillwell, D. (2015). Computer-based personality judgments are more
accurate than those made by humans, Proceedings of the National Academy of Sciences of the USA
(PNAS), September 2013.

You might also like